2023-11-21

cs.CV

cs.CV - 2023-11-21

Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning

paper_url: http://arxiv.org/abs/2311.13612
repo_url: https://github.com/chris210634/word_soups
paper_authors: Christopher Liao, Theodoros Tsiligkaridis, Brian Kulis
For: 本研究目的是提出两种更加灵活的方法，即描述器汤和字符汤，以提高零shot的目标准确率。* Methods: 这两种方法不需要测试时使用LLM，可以利用训练数据来提高零shot方法的OOD目标准确率。描述器汤通过选择一小集文本描述器，计算稳定的类嵌入。字符汤通过拼接一串字符来实现类似的目的。* Results: 相比现有的几shot软引导方法，字符汤需要更少的参数和GPU内存，并且在跨数据集和领域泛化benchmark上表现出色。与SoTA描述器和prompt汤集成方法相比，字符汤可以 дости到更高的OOD准确率。

Abstract
Over the past year, a large body of multimodal research has emerged around zero-shot evaluation using GPT descriptors. These studies boost the zero-shot accuracy of pretrained VL models with an ensemble of label-specific text generated by GPT. A recent study, WaffleCLIP, demonstrated that similar zero-shot accuracy can be achieved with an ensemble of random descriptors. However, both zero-shot methods are un-trainable and consequently sub-optimal when some few-shot out-of-distribution (OOD) training data is available. Inspired by these prior works, we present two more flexible methods called descriptor and word soups, which do not require an LLM at test time and can leverage training data to increase OOD target accuracy. Descriptor soup greedily selects a small set of textual descriptors using generic few-shot training data, then calculates robust class embeddings using the selected descriptors. Word soup greedily assembles a chain of words in a similar manner. Compared to existing few-shot soft prompt tuning methods, word soup requires fewer parameters by construction and less GPU memory, since it does not require backpropagation. Both soups outperform current published few-shot methods, even when combined with SoTA zero-shot methods, on cross-dataset and domain generalization benchmarks. Compared with SoTA prompt and descriptor ensembling methods, such as ProDA and WaffleCLIP, word soup achieves higher OOD accuracy with fewer ensemble members. Please checkout our code: github.com/Chris210634/word_soups

摘要
在过去一年，一大量的多模式研究在零示词汇中出现了。这些研究使得预训练的VL模型的零示词汇精度提高。一项研究，WaffleCLIP，表明了随机描述符 ensemble可以达到类似的零示词汇精度。然而，两种零示词汇方法都是无法训练的，因此在有些几个异常数据训练时，其性能是低下的。受这些前期工作启发，我们提出了两种更灵活的方法：描述符汤和单词汤。这两种方法不需要LLM在测试时，可以利用训练数据来提高OOD目标精度。描述符汤通过使用通用几个shot训练数据选择一小集文本描述符，然后计算 robust类 embedding。单词汤通过将一串单词组合而成，具有类似的性能。与现有的几个shot软提取方法相比，word soup需要 fewer parameters by construction，并且不需要backpropagation，因此它具有更好的性能。两种汤都超过了当前发表的几个shot方法，即使与SoTA零示词汇方法相结合。与SoTA描述符和提取方法相比，word soup在OOD精度方面达到更高的水平。请参考我们的代码：github.com/Chris210634/word_soups

Novel OCT mosaicking pipeline with Feature- and Pixel-based registration

paper_url: http://arxiv.org/abs/2311.13052
repo_url: None
paper_authors: Jiacheng Wang, Hao Li, Dewei Hu, Yuankai K. Tao, Ipek Oguz
for: This paper aims to improve the field of view (FoV) of high-resolution Optical Coherence Tomography (OCT) images for ophthalmology studies by stitching multiple overlapping images together.
methods: The proposed method combines learning-based feature matching and robust pixel-based registration to align multiple images effectively, and also utilizes a trained foundational model (Segment Anything Model, SAM) to validate mosaicking results in an unsupervised manner.
results: The authors evaluate the efficacy of their pipeline using two datasets and show superior performance in terms of both accuracy and computational efficiency compared to current mosaicking pipelines.Here’s the full text in Simplified Chinese:
for: 这篇论文目的是通过将多个重叠的高分辨率 Optical Coherence Tomography (OCT) 图像粘贴起来，提高镜学研究中的场景范围 (FoV)。
methods: 提议的方法结合学习基于特征匹配和稳定的像素基于准确进行多个图像的对齐，并利用已经训练的基础模型（Segment Anything Model, SAM）来验证拼接结果的无监督方式。
results: 作者通过两个数据集评估了他们的拼接管道的效果，并与当前的拼接管道相比，达到了更高的准确率和计算效率。

Abstract
High-resolution Optical Coherence Tomography (OCT) images are crucial for ophthalmology studies but are limited by their relatively narrow field of view (FoV). Image mosaicking is a technique for aligning multiple overlapping images to obtain a larger FoV. Current mosaicking pipelines often struggle with substantial noise and considerable displacement between the input sub-fields. In this paper, we propose a versatile pipeline for stitching multi-view OCT/OCTA \textit{en face} projection images. Our method combines the strengths of learning-based feature matching and robust pixel-based registration to align multiple images effectively. Furthermore, we advance the application of a trained foundational model, Segment Anything Model (SAM), to validate mosaicking results in an unsupervised manner. The efficacy of our pipeline is validated using an in-house dataset and a large public dataset, where our method shows superior performance in terms of both accuracy and computational efficiency. We also made our evaluation tool for image mosaicking and the corresponding pipeline publicly available at \url{https://github.com/MedICL-VU/OCT-mosaicking}.

摘要
高分辨率光合成 Tomatoes (OCT) 图像是眼科研究中非常重要的，但它们受到宽度场景的限制。图像融合是一种将多个重叠的图像组合成更大的场景的技术。现有的融合管道经常遇到巨大的噪声和输入子领域的大量排移。在这篇论文中，我们提出了一种多视图 OCT/OCTA 的投影图像融合管道。我们的方法结合了学习基于匹配的特征和稳定的像素基于注册，以有效地对多个图像进行对齐。此外，我们推进了使用训练过的基础模型（Segment Anything Model，SAM）来验证融合结果，从无监督的角度进行验证。我们的管道在一个内部数据集和一个大型公共数据集上进行了验证，与传统方法相比，我们的方法在精度和计算效率两个方面表现出色。此外，我们还将评估工具和相应的管道公开发布在 GitHub 上，请参考 \url{https://github.com/MedICL-VU/OCT-mosaicking}。

Camera-Independent Single Image Depth Estimation from Defocus Blur

paper_url: http://arxiv.org/abs/2311.13045
repo_url: https://github.com/sleekeagle/defocus_camind
paper_authors: Lahiru Wijayasingha, Homa Alemzadeh, John A. Stankovic
for: 优化缺陷点云深度估计方法，提高机器视觉下某些下游任务的性能。
methods: 基于光学物理方程的杜比模型，对相机参数的影响分析，并提出一种简单的修正方法，使得模型不需要更新。
results: 对实验和synthetic数据进行测试，并显示其在不同相机下的相对稳定性。Here’s the translation of the abstract in English:
for: Improving the accuracy of monocular depth estimation methods for downstream tasks in machine vision.
methods: Analyzing the impact of camera parameters on defocus blur using optical physics equations, and proposing a simple correction method that does not require retraining of the original model.
results: Evaluating the model on both synthetic and real datasets obtained with different cameras, and demonstrating its robustness to camera changes.

Abstract
Monocular depth estimation is an important step in many downstream tasks in machine vision. We address the topic of estimating monocular depth from defocus blur which can yield more accurate results than the semantic based depth estimation methods. The existing monocular depth from defocus techniques are sensitive to the particular camera that the images are taken from. We show how several camera-related parameters affect the defocus blur using optical physics equations and how they make the defocus blur depend on these parameters. The simple correction procedure we propose can alleviate this problem which does not require any retraining of the original model. We created a synthetic dataset which can be used to test the camera independent performance of depth from defocus blur models. We evaluate our model on both synthetic and real datasets (DDFF12 and NYU depth V2) obtained with different cameras and show that our methods are significantly more robust to the changes of cameras. Code: https://github.com/sleekEagle/defocus_camind.git

摘要
监视机器人视觉中的单目深度估算是许多下游任务的重要步骤。我们评论单目深度从聚光朦胧中估算的问题，这种方法可以比semantic基于深度估算方法更加准确。现有的单目深度从聚光朦胧技术对摄像机而言非常敏感，我们使用光学物理方程来描述摄像机参数如何影响聚光朦胧，并提出一个简单的修正方法来解决这个问题，这个方法不需要重新训练原始模型。我们创建了一个可用来测试单目深度从聚光朦胧模型的Synthetic dataset。我们对这些Synthetic dataset和实际数据集（DDFF12和NYU深度V2）进行了评估，并证明我们的方法在不同的摄像机下更加稳定。代码：https://github.com/sleekEagle/defocus_camind.gitNote that Simplified Chinese is a writing system used in mainland China, and it may be different from Traditional Chinese, which is used in Taiwan and other regions.

Unsupervised Multimodal Surface Registration with Geometric Deep Learning

paper_url: http://arxiv.org/abs/2311.13022
repo_url: https://github.com/mohamedasuliman/geomorph
paper_authors: Mohamed A. Suliman, Logan Z. J. Williams, Abdulah Fawaz, Emma C. Robinson
for: 这篇论文旨在提出一种基于几何深度学习框架的 cortical surface 图像 регистрация方法，以便进行脑部研究。
methods: 该方法包括两个主要步骤：首先，对每个输入表面进行独立特征提取，使用图 convolutions 生成低维特征表示， capture cortical surface 重要特征。然后，通过深度离散方式进行Feature registration，以优化共同结构的 overlap across surfaces，并通过深度条件随机场来进行Regularization，以确保平滑和生物可能的变换。
results: 实验结果显示，GeoMorph 方法可以在 cortical surface 图像 registration 中超过现有的深度学习方法，获得更好的对齐和平滑的变换。此外，GeoMorph 方法也可以与传统方法相比，展示出类似的性能。这种多样性和稳定性表明 GeoMorph 方法在 neuroscience 应用中具有强大的潜力。

Abstract
This paper introduces GeoMorph, a novel geometric deep-learning framework designed for image registration of cortical surfaces. The registration process consists of two main steps. First, independent feature extraction is performed on each input surface using graph convolutions, generating low-dimensional feature representations that capture important cortical surface characteristics. Subsequently, features are registered in a deep-discrete manner to optimize the overlap of common structures across surfaces by learning displacements of a set of control points. To ensure smooth and biologically plausible deformations, we implement regularization through a deep conditional random field implemented with a recurrent neural network. Experimental results demonstrate that GeoMorph surpasses existing deep-learning methods by achieving improved alignment with smoother deformations. Furthermore, GeoMorph exhibits competitive performance compared to classical frameworks. Such versatility and robustness suggest strong potential for various neuroscience applications.

摘要

Independent feature extraction: Graph convolutions are used to extract low-dimensional feature representations that capture important characteristics of the cortical surfaces.2. Deep-discrete registration: Features are registered in a deep-discrete manner to optimize the overlap of common structures across surfaces by learning displacements of a set of control points.To ensure smooth and biologically plausible deformations, the framework implements regularization through a deep conditional random field implemented with a recurrent neural network. Experimental results show that GeoMorph surpasses existing deep-learning methods by achieving improved alignment with smoother deformations. Additionally, GeoMorph exhibits competitive performance compared to classical frameworks, indicating its strong potential for various neuroscience applications.

Image-Based Soil Organic Carbon Remote Sensing from Satellite Images with Fourier Neural Operator and Structural Similarity

paper_url: http://arxiv.org/abs/2311.13016
repo_url: None
paper_authors: Ken C. L. Wong, Levente Klein, Ademir Ferreira da Silva, Hongzhi Wang, Jitendra Singh, Tanveer Syeda-Mahmood
for: 这个论文的目的是提出一种基于卷积神经网络的SOC远程感知方法，以便在全球范围内估计SOC浓度。
methods: 这个方法使用了FNO-DenseNet模型，该模型结合了卷积神经网络的优点，并且在我们的实验中表现了出色的性能，比传统的像素基于 Random Forest 高出18%。
results: 该方法可以在全球范围内估计SOC浓度，并且表现出了出色的准确性。

Abstract
Soil organic carbon (SOC) sequestration is the transfer and storage of atmospheric carbon dioxide in soils, which plays an important role in climate change mitigation. SOC concentration can be improved by proper land use, thus it is beneficial if SOC can be estimated at a regional or global scale. As multispectral satellite data can provide SOC-related information such as vegetation and soil properties at a global scale, estimation of SOC through satellite data has been explored as an alternative to manual soil sampling. Although existing studies show promising results, they are mainly based on pixel-based approaches with traditional machine learning methods, and convolutional neural networks (CNNs) are uncommon. To study the use of CNNs on SOC remote sensing, here we propose the FNO-DenseNet based on the Fourier neural operator (FNO). By combining the advantages of the FNO and DenseNet, the FNO-DenseNet outperformed the FNO in our experiments with hundreds of times fewer parameters. The FNO-DenseNet also outperformed a pixel-based random forest by 18% in the mean absolute percentage error.

摘要
土壤有碳 (SOC) 固定是将大气碳二氧化物转移并存储在土壤中，这对气候变化的控制具有重要作用。SOC的浓度可以通过合适的土地利用提高，因此在区域或全球范围内估算SOC具有重要意义。由于多спектル卫星数据可以提供土壤上的植被和土壤性质信息，因此有人提出了使用卫星数据来估算SOC的想法。 existing studies 显示了有希望的结果，但主要基于像素基的方法和传统的机器学习方法，并未使用 convolutional neural networks (CNNs)。为了研究使用 CNNs 在 SOC 遥感中的应用，我们提出了基于 Fourier neural operator (FNO) 的 FNO-DenseNet。通过将 FNO 和 DenseNet 的优点相结合，FNO-DenseNet 在我们的实验中超过了 FNO 的性能，并且在比较Pixel-based random forest 的情况下，mean absolute percentage error 下降了18%。

3D Compression Using Neural Fields

paper_url: http://arxiv.org/abs/2311.13009
repo_url: None
paper_authors: Janis Postels, Yannick Strümpler, Klara Reichard, Luc Van Gool, Federico Tombari
for: This paper proposes a novel Neural Fields (NF)-based compression algorithm for 3D data, which can compress both geometry and attribute of 3D data.
methods: The proposed method uses Signed Distance Fields (SDFs) for watertight shapes and Unsigned Distance Fields (UDFs) for arbitrary non-watertight shapes.
results: The method demonstrates excellent geometry compression on 3D point clouds and meshes, and it is straightforward to extend the compression algorithm to compress both geometry and attribute of 3D data.Here is the translation of the paper’s abstract in Simplified Chinese:
for: 本研究提出了一种基于神经场（NF）的3D数据压缩算法，可以压缩3D数据的geometry和属性。
methods: 该方法使用了签名距离场（SDF）来处理水平的形状，而不签名距离场（UDF）来处理任意非水平的形状。
results: 该方法在3D点云和网格上实现了出色的减少，并且可以方便地将压缩算法扩展到压缩3D数据的geometry和属性。

Abstract
Neural Fields (NFs) have gained momentum as a tool for compressing various data modalities - e.g. images and videos. This work leverages previous advances and proposes a novel NF-based compression algorithm for 3D data. We derive two versions of our approach - one tailored to watertight shapes based on Signed Distance Fields (SDFs) and, more generally, one for arbitrary non-watertight shapes using Unsigned Distance Fields (UDFs). We demonstrate that our method excels at geometry compression on 3D point clouds as well as meshes. Moreover, we show that, due to the NF formulation, it is straightforward to extend our compression algorithm to compress both geometry and attribute (e.g. color) of 3D data.

摘要
神经场（NF）已成为数据模式的压缩工具，如图像和视频。这项工作基于先前的进展，并提出了一种基于NF的压缩算法 для3D数据。我们提出了两个版本的方法：一个针对水密形状基于签名距离场（SDF），另一个针对任意非水密形状使用未签名距离场（UDF）。我们示示了我们的方法在3D点云和网格上进行geometry压缩的优秀表现。此外，由于NF的表示，我们可以轻松地将压缩算法扩展到压缩3D数据的geometry和属性（例如颜色）。

TRIDENT: The Nonlinear Trilogy for Implicit Neural Representations

paper_url: http://arxiv.org/abs/2311.13610
repo_url: None
paper_authors: Zhenda Shen, Yanqi Cheng, Raymond H. Chan, Pietro Liò, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero
for: 这篇论文是 для 研究 implicit neural representations (INRs) 的能力模型复杂、高维数据，并不需要明确参数化。
methods: 这篇论文提出了一个新的函数TRIDENT，用于 implicit neural representations，具有三种非线性特性：首先，它可以表示高阶特征，通过顺序簇合 Compactness; 其次，TRIDENT可以高效地捕捉频率信息，即频率簇合 Compactness; 最后，它可以表示信号或图像，使其能量集中在有限的空间区域内，即空间簇合 Compactness。
results: 经过广泛的实验研究，该提出的函数在多个逆问题中表现出色，超越了现有的 implicit neural representation 函数。

Abstract
Implicit neural representations (INRs) have garnered significant interest recently for their ability to model complex, high-dimensional data without explicit parameterisation. In this work, we introduce TRIDENT, a novel function for implicit neural representations characterised by a trilogy of nonlinearities. Firstly, it is designed to represent high-order features through order compactness. Secondly, TRIDENT efficiently captures frequency information, a feature called frequency compactness. Thirdly, it has the capability to represent signals or images such that most of its energy is concentrated in a limited spatial region, denoting spatial compactness. We demonstrated through extensive experiments on various inverse problems that our proposed function outperforms existing implicit neural representation functions.

摘要
近期，启发式神经表示（INR）已引起广泛关注，因其能模型复杂、高维数据无需显式参数化。在这项工作中，我们介绍了一种新的函数TRIDENT，它是一种启发式神经表示，具有三种非线性特性。首先，它可以表示高阶特征，通过顺序紧凑性得到高级别的特征表示。其次，TRIDENT可以高效地捕捉频率信息，这被称为频率紧凑性。最后，它可以表示信号或图像，使其能够尽量减少尺寸，这被称为空间紧凑性。我们通过多种逆问题实验证明了我们提议的函数的优越性。

AI for Agriculture: the Comparison of Semantic Segmentation Methods for Crop Mapping with Sentinel-2 Imagery

paper_url: http://arxiv.org/abs/2311.12993
repo_url: None
paper_authors: Irina Korotkova, Natalia Efremova
for: 这篇研究旨在探讨用于葡萄园分类问题的主要机器学习方法，并评估这些方法在不同情况下的效果。
methods: 本研究使用了多种广泛使用的机器学习技术，包括支持向量机器学习、快速降低学习等。
results: 研究发现，这些机器学习方法在葡萄园分类问题上有不同的表现，对于不同的问题可以选择适合的模型。

Abstract
Crop mapping is one of the most common tasks in artificial intelligence for agriculture due to higher food demands from a growing population and increased awareness of climate change. In case of vineyards, the texture is very important for crop segmentation: with higher resolution satellite imagery the texture is easily detected by majority of state-of-the-art algorithms. However, this task becomes increasingly more difficult as the resolution of satellite imagery decreases and the information about the texture becomes unavailable. In this paper we aim to explore the main machine learning methods that can be used with freely available satellite imagery and discuss how and when they can be applied for vineyard segmentation problem. We assess the effectiveness of various widely-used machine learning techniques and offer guidance on selecting the most suitable model for specific scenarios.

摘要
蔬菜映射是人工智能在农业领域中最常见的任务，这是因为人口增长和气候变化的问题导致了食品需求的增加。在葡萄园中，Texture是分 segmentation的关键因素：高分辨率卫星图像中的Texture易于由大多数现代算法检测。然而，这种任务随着卫星图像的分辨率减小而变得越来越difficult，信息about texture变得不可用。在这篇文章中，我们想探索使用自由可用卫星图像的主要机器学习方法，并讨论它们在葡萄园分 segmentation问题上的应用。我们评估了各种常用的机器学习技术的效果，并提供选择特定情况下最适合的模型的指导。

FollowMe: a Robust Person Following Framework Based on Re-Identification and Gestures

paper_url: http://arxiv.org/abs/2311.12992
repo_url: https://github.com/federicorollo/followme
paper_authors: Federico Rollo, Andrea Zunino, Gennaro Raiola, Fabio Amadio, Arash Ajoudani, Nikolaos Tsagarakis
for: 这项研究旨在提高人机合作Robot的操作灵活性，使其在各种工作场景中提供个性化协助。
methods: 该研究使用了视觉Re-ID、手势检测和避免碰撞导航来识别和跟踪人类对象。
results: 实验结果表明，该框架能够在实验室中成功地识别和跟踪人类对象，并在有些不确定的动态障碍物的情况下保持稳定 Navigation。

Abstract
Human-robot interaction (HRI) has become a crucial enabler in houses and industries for facilitating operational flexibility. When it comes to mobile collaborative robots, this flexibility can be further increased due to the autonomous mobility and navigation capacity of the robotic agents, expanding their workspace and consequently, the personalizable assistance they can provide to the human operators. This however requires that the robot is capable of detecting and identifying the human counterpart in all stages of the collaborative task, and in particular while following a human in crowded workplaces. To respond to this need, we developed a unified perception and navigation framework, which enables the robot to identify and follow a target person using a combination of visual Re-Identification (Re-ID), hand gestures detection, and collision-free navigation. The Re-ID module can autonomously learn the features of a target person and use the acquired knowledge to visually re-identify the target. The navigation stack is used to follow the target avoiding obstacles and other individuals in the environment. Experiments are conducted with few subjects in a laboratory setting where some unknown dynamic obstacles are introduced.

摘要
人机交互（HRI）已成为家庭和工业领域的关键促进者，以增加操作灵活性。当涉及到移动协作机器人时，这种灵活性可以得到进一步提高，因为机器人代理人的自主移动和导航能力，扩大了它们的工作空间，从而提供更多的个性化协助给人类操作者。但是，需要机器人能够在协作任务的所有阶段探测和识别人类对象，特别是在拥挤的工作场所中跟随人类。为回应这个需求，我们开发了一个简化的识别和导航框架，该框架使机器人能够通过视觉重新识别目标人物，以及检测手势和避免碰撞 navigation。识别模块可以自动学习目标人物的特征，并使用已得知知识进行视觉重新识别。导航堆栈则用于跟随目标，避免障碍物和其他环境中的人员。在实验室设置中，我们进行了一些人员的测试，并在其中引入了一些不确定的动态障碍物。

SD-NAE: Generating Natural Adversarial Examples with Stable Diffusion

paper_url: http://arxiv.org/abs/2311.12981
repo_url: None
paper_authors: Yueqian Lin, Jingyang Zhang, Yiran Chen, Hai Li
for: This paper aims to provide a valuable method for obtaining challenging evaluation data to advance the development of more robust deep learning models.
methods: The proposed method, called SD-NAE, actively synthesizes natural adversarial examples (NAEs) using the state-of-the-art Stable Diffusion. The generation is guided by the gradient of loss from the target classifier.
results: The proposed method is effective in producing valid and useful NAEs, as demonstrated through a meticulously designed experiment. The generated NAEs can be used to evaluate the robustness of deep learning models and potentially advance their development.

Abstract
Robustly evaluating deep learning image classifiers is challenging due to some limitations of standard datasets. Natural Adversarial Examples (NAEs), arising naturally from the environment and capable of deceiving classifiers, are instrumental in identifying vulnerabilities in trained models. Existing works collect such NAEs by filtering from a huge set of real images, a process that is passive and lacks control. In this work, we propose to actively synthesize NAEs with the state-of-the-art Stable Diffusion. Specifically, our method formulates a controlled optimization process, where we perturb the token embedding that corresponds to a specified class to synthesize NAEs. The generation is guided by the gradient of loss from the target classifier so that the created image closely mimics the ground-truth class yet fools the classifier. Named SD-NAE (Stable Diffusion for Natural Adversarial Examples), our innovative method is effective in producing valid and useful NAEs, which is demonstrated through a meticulously designed experiment. Our work thereby provides a valuable method for obtaining challenging evaluation data, which in turn can potentially advance the development of more robust deep learning models. Code is available at https://github.com/linyueqian/SD-NAE.

摘要
Difficulty evaluating deep learning image classifiers due to limitations of standard datasets. Natural Adversarial Examples (NAEs) from the environment can deceive classifiers, but existing methods collect NAEs passively from a large set of real images, lacking control. Our method actively synthesizes NAEs using state-of-the-art Stable Diffusion. We perturb the token embedding corresponding to a specified class to create NAEs, guided by the gradient of loss from the target classifier. Our method is effective in producing valid and useful NAEs, as demonstrated by a carefully designed experiment. Our work provides a valuable method for obtaining challenging evaluation data, which can potentially advance the development of more robust deep learning models. Code available at https://github.com/linyueqian/SD-NAE.

Physics-guided Shape-from-Template: Monocular Video Perception through Neural Surrogate Models

paper_url: http://arxiv.org/abs/2311.12796
repo_url: https://github.com/davidstotko/physics-guided-shape-from-template
paper_authors: David Stotko, Nils Wandel, Reinhard Klein
for: 三维重建动态场景是计算机图形领域的长期问题，随着信息的减少，变得越来越复杂。Shape-from-Template（SfT）方法用于从RGB图像或视频序列中重建模板基于的几何结构，通常不需要深度信息，例如常见的单摄像头记录。然而，现有的重建方法存在 física 不正确和噪声，或者优化过程慢。
methods: 我们提出了一种新的 SfT 重建算法，使用预训练的神经网络代理模型，快速评估、稳定、生成平滑的重建结构。渲染模拟的 mesh 可以对重建和目标视频序列进行像素对比，通过梯度基于优化程序来提取不仅形状信息，还有物理参数，如缝纹弹性、压缩硬度等。
results: 与 $\phi$-SfT 方法相比，我们的方法可以降低运行时间约400-500倍，同时保持精准、稳定、平滑的重建结构。

Abstract
3D reconstruction of dynamic scenes is a long-standing problem in computer graphics and increasingly difficult the less information is available. Shape-from-Template (SfT) methods aim to reconstruct a template-based geometry from RGB images or video sequences, often leveraging just a single monocular camera without depth information, such as regular smartphone recordings. Unfortunately, existing reconstruction methods are either unphysical and noisy or slow in optimization. To solve this problem, we propose a novel SfT reconstruction algorithm for cloth using a pre-trained neural surrogate model that is fast to evaluate, stable, and produces smooth reconstructions due to a regularizing physics simulation. Differentiable rendering of the simulated mesh enables pixel-wise comparisons between the reconstruction and a target video sequence that can be used for a gradient-based optimization procedure to extract not only shape information but also physical parameters such as stretching, shearing, or bending stiffness of the cloth. This allows to retain a precise, stable, and smooth reconstructed geometry while reducing the runtime by a factor of 400-500 compared to $\phi$-SfT, a state-of-the-art physics-based SfT approach.

摘要
三维重建动态场景是计算机图形领域的长期问题，随着可用信息的减少，这个问题变得越来越difficult。形状从模板（SfT）方法试图从RGB图像或视频序列中提取模板基于geometry，经常利用单个monocular摄像头无depth信息，如常见的智能手机记录。然而，现有的重建方法存在 físico和噪声问题，或者优化过程太慢。为解决这个问题，我们提议一种基于神经网络代理模型的新的SfT重建算法，该算法快速评估、稳定、生成平滑的重建结构。通过对模拟的网格进行可归纳渲染，可以在每个像素位置进行对比重建和目标视频序列，从而使用梯度基本进行搜索法来提取不仅形状信息，还有物理参数such as stretching, shearing, or bending stiffness of the cloth。这使得可以保持精确、稳定、平滑重建结构，同时减少运行时间，比phi-SfT，当前的物理基于SfT方法的运行时间减少了400-500倍。

paper_url: http://arxiv.org/abs/2311.12793
repo_url: https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V
paper_authors: Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin
for: 提高大型多媒体模型（LMMs）的有效媒体Alignment，以解决现有数据的缺乏高质量图文数据所带来的瓶颈。
methods: 我们引入了 ShareGPT4V 数据集，这是一个具有 1.2 万个高度描述性标签的大规模资源，超过现有数据集的多样性和信息内容，覆盖了世界知识、物品特性、空间关系和艺术评价等领域。
results: ShareGPT4V 数据集在 Supervised Fine-Tuning（SFT）阶段的效果，通过将现有数据集中的等量精细标签替换为我们高质量的标签，显著提高了 LLMs 如 LLaVA-7B、LLaVA-1.5-13B 和 Qwen-VL-Chat-7B 在 MME 和 MMBench bencmarks 上的表现，具体提高了 222.8/22.0/22.3 和 2.7/1.3/1.5。此外，我们还在 Pre-training 和 SFT 阶段将 ShareGPT4V 数据集 integrate 到 LLMs 中，获得了 ShareGPT4V-7B，一个性能在多种多媒体 bencmarks 上出色的 LMM。

Abstract
In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT) phase, by substituting an equivalent quantity of detailed captions in existing SFT datasets with a subset of our high-quality captions, significantly enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and 2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple architecture that has remarkable performance across a majority of the multi-modal benchmarks. This project is available at https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the LMMs community.

摘要
在大型多Modal模型（LMMs）的领域中，有效的Modal匹配是关键，但受到高质量图文数据的缺乏而受限。为解决这个瓶颈，我们介绍了 ShareGPT4V 数据集，这是一个具有120万高度描述性描述的大规模资源，超过现有数据集的多样性和信息内容，覆盖世界知识、物品特性、空间关系和艺术评价等领域。具体来说，ShareGPT4V 起源于一个 cura 的 100K 高质量描述，通过一种超出现有数据集的 caption 模型进行扩展，达到 1.2 万个。ShareGPT4V 首先在 Supervised Fine-Tuning（SFT）阶段展示其效果，通过将现有 SFT 数据集中的 Equivalent 量细致的描述替换为我们的高质量描述，对 LMMs 如 LLaVA-7B、LLaVA-1.5-13B 和 Qwen-VL-Chat-7B 在 MME 和 MMBench benchmark 上显著提高了性能，具体提高了 222.8/22.0/22.3 和 2.7/1.3/1.5。我们进一步将 ShareGPT4V 数据集integrated 到了预训练和 SFT 阶段，得到了 ShareGPT4V-7B，一个基于简单架构的优秀 LMM，在大多数多Modal benchmark 上表现出色。这个项目可以在中下载，以便为 LMMs 社区提供积极的资源。

SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering

paper_url: http://arxiv.org/abs/2311.12775
repo_url: https://github.com/Anttwo/SuGaR
paper_authors: Antoine Guédon, Vincent Lepetit
for: 提供了一种方法，允许高速精度地提取3D Gaussian Splatting中的网格。
methods: 方法包括一种REGULARIZATION TERM，使得 Gaussian Splatting 更加稳定，并且引入了一种基于 Poisson 重建的网格提取方法，它快速、可扩展、细节保留。
results: 方法可以在 minutes 内获得高质量的编辑 mesh，比 estado-of-the-art 方法在 neural SDF 上花费 hours 的时间，并且提供了更好的渲染质量。Here’s the original English text for reference:
for: We propose a method for precise and extremely fast mesh extraction from 3D Gaussian Splatting.
methods: Our method includes a regularization term that encourages the gaussians to align well with the surface of the scene, and introduces a method that exploits this alignment to extract a mesh from the Gaussians using Poisson reconstruction.
results: Our method can retrieve an editable mesh for realistic rendering within minutes, compared to hours with the state-of-the-art methods on neural SDFs, while providing a better rendering quality.

Abstract
We propose a method to allow precise and extremely fast mesh extraction from 3D Gaussian Splatting. Gaussian Splatting has recently become very popular as it yields realistic rendering while being significantly faster to train than NeRFs. It is however challenging to extract a mesh from the millions of tiny 3D gaussians as these gaussians tend to be unorganized after optimization and no method has been proposed so far. Our first key contribution is a regularization term that encourages the gaussians to align well with the surface of the scene. We then introduce a method that exploits this alignment to extract a mesh from the Gaussians using Poisson reconstruction, which is fast, scalable, and preserves details, in contrast to the Marching Cubes algorithm usually applied to extract meshes from Neural SDFs. Finally, we introduce an optional refinement strategy that binds gaussians to the surface of the mesh, and jointly optimizes these Gaussians and the mesh through Gaussian splatting rendering. This enables easy editing, sculpting, rigging, animating, compositing and relighting of the Gaussians using traditional softwares by manipulating the mesh instead of the gaussians themselves. Retrieving such an editable mesh for realistic rendering is done within minutes with our method, compared to hours with the state-of-the-art methods on neural SDFs, while providing a better rendering quality. Our project page is the following: https://imagine.enpc.fr/~guedona/sugar/

摘要
我们提出一种方法，允许精准而非常快速地从3D Gaussian Splatting中提取网格。 Gaussian Splatting在最近几年内变得非常流行，因为它可以实现真实的渲染，而且在训练时间上比NeRFs更快速。然而，从 Millionen tiny 3D Gaussian中提取网格是一项挑战，因为这些Gaussian在优化后往往不整理。我们的首要贡献是一个REGULARIZATION term，鼓励Gaussian与场景表面align well。然后，我们引入一种方法，利用这种alignment来从Gaussian中提取网格，使用Poisson reconstruction，它快速、可扩展、细节保持。与Marching Cubes算法通常用于从Neural SDFs中提取网格相比，这种方法具有更好的性能。最后，我们引入一种可选的精度优化策略，将Gaussian绑定到网格上，并同时优化这些Gaussian和网格通过Gaussian splatting rendering。这使得可以使用传统软件编辑、雕塑、骨架、动画、复制和重新照明Gaussian，而不是直接编辑Gaussian自身。通过我们的方法，可以在几分钟内提取editable mesh，而不是前一代方法在Neural SDFs上的几个小时。此外，我们的方法还提供了更好的渲染质量。我们的项目页面是以下：https://imagine.enpc.fr/~guedona/sugar/。

Iris Presentation Attack: Assessing the Impact of Combining Vanadium Dioxide Films with Artificial Eyes

paper_url: http://arxiv.org/abs/2311.12773
repo_url: None
paper_authors: Darshika Jauhari, Renu Sharma, Cunjian Chen, Nelson Sepulveda, Arun Ross
for: 这项研究旨在检测人工眼睛中的投影攻击（presentation attack，PA），并研究VO2薄膜在人工眼睛表面上的影响。
methods: 该研究使用了两种现代眼睛人脸识别系统的PA检测方法，并在人工眼睛表面上附加了VO2薄膜。
results: 研究发现，VO2薄膜可能会使PA检测方法错分为真正的眼睛，从而导致攻击。这表明了这种攻击的存在，需要系统地分析和解决。

Abstract
Iris recognition systems, operating in the near infrared spectrum (NIR), have demonstrated vulnerability to presentation attacks, where an adversary uses artifacts such as cosmetic contact lenses, artificial eyes or printed iris images in order to circumvent the system. At the same time, a number of effective presentation attack detection (PAD) methods have been developed. These methods have demonstrated success in detecting artificial eyes (e.g., fake Van Dyke eyes) as presentation attacks. In this work, we seek to alter the optical characteristics of artificial eyes by affixing Vanadium Dioxide (VO2) films on their surface in various spatial configurations. VO2 films can be used to selectively transmit NIR light and can, therefore, be used to regulate the amount of NIR light from the object that is captured by the iris sensor. We study the impact of such images produced by the sensor on two state-of-the-art iris PA detection methods. We observe that the addition of VO2 films on the surface of artificial eyes can cause the PA detection methods to misclassify them as bonafide eyes in some cases. This represents a vulnerability that must be systematically analyzed and effectively addressed.

摘要
芳香识别系统，运行在近红外спектurm（NIR）中，已经显示出抵御攻击（presentation attacks）的易受性，攻击者可以使用cosmetic contact lenses、人工眼或印刷眼图像以排除系统。同时，一些有效的抵御攻击检测（PAD）方法已经被开发出来。这些方法已经在检测人工眼（例如假Van Dyke眼）时表现出成功。在这种工作中，我们想要改变人工眼的光学特性，通过在其表面贴加Vanadium Dioxide（VO2）薄膜。VO2薄膜可以选择性传输NIR光，因此可以用来控制人工眼捕获的NIR光量。我们研究了这些由感ensor产生的图像对两种现代芳香PA检测方法的影响。我们发现，在人工眼表面贴加VO2薄膜后，PA检测方法可能会在某些情况下将它们识别为真正的眼。这表示存在一个漏洞，需要系统地分析和有效地解决。

Swift Parameter-free Attention Network for Efficient Super-Resolution

paper_url: http://arxiv.org/abs/2311.12770
repo_url: https://github.com/hongyuanyu/span
paper_authors: Cheng Wan, Hongyuan Yu, Zhiqi Li, Yihang Chen, Yajun Zou, Yuqing Liu, Xuanwu Yin, Kunlong Zuo
for: 提高单图超解像（SISR）任务中的图像质量和执行速度之间的平衡，以便在资源有限的场景中实现高质量和高速的图像重建。
methods: 提出了一种高效的 Swift Parameter-free Attention Network（SPAN）模型，该模型使用了一种新的参数自由注意机制，通过对称的活动函数和循环连接来提高高质量信息和抑制低质量信息。
results: 经过理论分析和实验验证，SPAN模型在多个 bencmark 上表现出色，其中包括图像质量和执行速度之间的较好的平衡，并在 NTIRE 2023 高效超解像挑战中取得了最高 PSNR 值和最快的测试时间。

Abstract
Single Image Super-Resolution (SISR) is a crucial task in low-level computer vision, aiming to reconstruct high-resolution images from low-resolution counterparts. Conventional attention mechanisms have significantly improved SISR performance but often result in complex network structures and large number of parameters, leading to slow inference speed and large model size. To address this issue, we propose the Swift Parameter-free Attention Network (SPAN), a highly efficient SISR model that balances parameter count, inference speed, and image quality. SPAN employs a novel parameter-free attention mechanism, which leverages symmetric activation functions and residual connections to enhance high-contribution information and suppress redundant information. Our theoretical analysis demonstrates the effectiveness of this design in achieving the attention mechanism's purpose. We evaluate SPAN on multiple benchmarks, showing that it outperforms existing efficient super-resolution models in terms of both image quality and inference speed, achieving a significant quality-speed trade-off. This makes SPAN highly suitable for real-world applications, particularly in resource-constrained scenarios. Notably, our model attains the best PSNR of 27.09 dB, and the test runtime of our team is reduced by 7.08ms in the NTIRE 2023 efficient super-resolution challenge. Our code and models are made publicly available at \url{https://github.com/hongyuanyu/SPAN}.

摘要
单图超分辨 (SISR) 是计算机视觉领域的关键任务之一，旨在从低分辨度图像中恢复高分辨度图像。传统的注意力机制有助于提高 SISR 性能，但常导致复杂的网络结构和大量参数，从而导致慢速推理速度和大型模型。为解决这问题，我们提出了快速无参数注意力网络 (SPAN)，这是一种高效的 SISR 模型，既能保持参数量、推理速度和图像质量的平衡。SPAN 使用了一种新的无参数注意力机制，利用对称的活动函数和径向连接来增强高贡献信息并抑制冗余信息。我们的理论分析表明该设计能够实现注意力机制的目的。我们在多个标准曲线上评估了 SPAN，并显示它在图像质量和推理速度之间取得了显著的质速交换。这使得 SPAN 在实际应用中特别适用，特别是在有限资源的场景下。值得一提的是，我们的模型在 NTIRE 2023 年度高效超分辨度挑战中实现了最高 PSNR 值为 27.09 dB，并在测试时间上减少了7.08ms。我们的代码和模型在 \url{https://github.com/hongyuanyu/SPAN} 上公开发布。

Investigating Weight-Perturbed Deep Neural Networks With Application in Iris Presentation Attack Detection

paper_url: http://arxiv.org/abs/2311.12764
repo_url: https://github.com/redwankarimsony/weightperturbation-msu
paper_authors: Renu Sharma, Redwan Sony, Arun Ross
for: 本研究旨在分析深度神经网络（DNNs）对参数偏移的敏感性，以便在实际应用中部署之前进行分析。
methods: 本研究使用了三种DNN架构（VGG、ResNet和DenseNet），三种参数偏移方法（ Gaussian noise、weight zeroing 和 weight scaling），以及两种设置（全网络和层次）。我们在用于触发人脸识别 task 的两个公开数据集（LivDet-Iris-2017和LivDet-Iris-2020）上进行了实验。
results: 根据敏感性分析，我们提出了通过perturbing DNN 参数来提高模型性能的方法。此外，我们还将这些受到偏移的模型 ensemble 在分数级和参数级进行了组合，以提高原始模型的性能。 LivDet-Iris-2017 数据集上， ensemble 在参数级平均提高了43.58%，而 LivDet-Iris-2020 数据集上，ensemble 在参数级平均提高了9.25%。研究代码可以在 GitHub 上找到。

Abstract
Deep neural networks (DNNs) exhibit superior performance in various machine learning tasks, e.g., image classification, speech recognition, biometric recognition, object detection, etc. However, it is essential to analyze their sensitivity to parameter perturbations before deploying them in real-world applications. In this work, we assess the sensitivity of DNNs against perturbations to their weight and bias parameters. The sensitivity analysis involves three DNN architectures (VGG, ResNet, and DenseNet), three types of parameter perturbations (Gaussian noise, weight zeroing, and weight scaling), and two settings (entire network and layer-wise). We perform experiments in the context of iris presentation attack detection and evaluate on two publicly available datasets: LivDet-Iris-2017 and LivDet-Iris-2020. Based on the sensitivity analysis, we propose improved models simply by perturbing parameters of the network without undergoing training. We further combine these perturbed models at the score-level and at the parameter-level to improve the performance over the original model. The ensemble at the parameter-level shows an average improvement of 43.58% on the LivDet-Iris-2017 dataset and 9.25% on the LivDet-Iris-2020 dataset. The source code is available at https://github.com/redwankarimsony/WeightPerturbation-MSU.

摘要

High-resolution Image-based Malware Classification using Multiple Instance Learning

paper_url: http://arxiv.org/abs/2311.12760
repo_url: https://github.com/timppeters/mil-malware-images
paper_authors: Tim Peters, Hikmat Farhat
for: 防止针对大小变化的恶意软件分类
methods: 使用高精度灰度图像和多个实例学习来超越针对性的二进制扩大
results: 实验结果显示，该方法可以在对抗性扩大样本上达到 $96.6%$ 的准确率，比基eline的 $22.8%$ 高出许多。

Abstract
This paper proposes a novel method of classifying malware into families using high-resolution greyscale images and multiple instance learning to overcome adversarial binary enlargement. Current methods of visualisation-based malware classification largely rely on lossy transformations of inputs such as resizing to handle the large, variable-sized images. Through empirical analysis and experimentation, it is shown that these approaches cause crucial information loss that can be exploited. The proposed solution divides the images into patches and uses embedding-based multiple instance learning with a convolutional neural network and an attention aggregation function for classification. The implementation is evaluated on the Microsoft Malware Classification dataset and achieves accuracies of up to $96.6\%$ on adversarially enlarged samples compared to the baseline of $22.8\%$. The Python code is available online at https://github.com/timppeters/MIL-Malware-Images .

摘要
这个论文提出了一种基于高级灰度图像和多实例学习的新型恶意软件分类方法，以抵御假像扩大攻击。现有的视觉化基于识别方法大多采用输入数据的损失性变换，如尺寸缩放，以处理大型、变量大小的图像。经实验和分析表明，这些方法会导致重要信息损失，可能被利用。该提案将图像分成块，使用嵌入式多实例学习的 convolutional neural network 和注意聚合函数进行分类。实现在 Microsoft Malware Classification 数据集上进行评估，对假像扩大样本达到了 $96.6\%$ 的准确率，比基线 $22.8\%$ 高出了许多。Python 代码可以在上下载。

Breathing Life Into Sketches Using Text-to-Video Priors

paper_url: http://arxiv.org/abs/2311.13608
repo_url: None
paper_authors: Rinon Gal, Yael Vinker, Yuval Alaluf, Amit H. Bermano, Daniel Cohen-Or, Ariel Shamir, Gal Chechik
for: 将单个素描图 animate into a short animation, using a text prompt to guide the motion.
methods: 使用文本提示指导笔划动作，并使用混合权重损失来帮助笔划动作更加自然和细腻。
results: 可以生成高质量的动画，并且可以轻松地编辑。

Abstract
A sketch is one of the most intuitive and versatile tools humans use to convey their ideas visually. An animated sketch opens another dimension to the expression of ideas and is widely used by designers for a variety of purposes. Animating sketches is a laborious process, requiring extensive experience and professional design skills. In this work, we present a method that automatically adds motion to a single-subject sketch (hence, "breathing life into it"), merely by providing a text prompt indicating the desired motion. The output is a short animation provided in vector representation, which can be easily edited. Our method does not require extensive training, but instead leverages the motion prior of a large pretrained text-to-video diffusion model using a score-distillation loss to guide the placement of strokes. To promote natural and smooth motion and to better preserve the sketch's appearance, we model the learned motion through two components. The first governs small local deformations and the second controls global affine transformations. Surprisingly, we find that even models that struggle to generate sketch videos on their own can still serve as a useful backbone for animating abstract representations.

摘要
一个素描是人类最直观和多余的工具之一，用于视觉表达意想。一个动画素描增加了另一个维度，用于表达意想，广泛用于设计各种目的。动画素描是一项劳动密集的过程，需要广泛的经验和专业的设计技能。在这个工作中，我们提出了一种方法，可以通过提供文本提示来自动添加动作到单个素描，从而“吹生”它。输出是一段短动画，提供vector表示，可以轻松编辑。我们的方法不需要广泛的训练，而是利用大量预训练的文本到视频扩散模型的运动优先，使用分数混合损失来引导笔迹的布局。为了促进自然和平滑的动作和更好地保持素描的外观，我们将动作分为两个组件。第一个控制小地方弯曲，第二个控制全局Affine变换。意外地，我们发现， même les modèles qui ont du mal à générer des vidéos de croquis peuvent toujours servir de support utile pour animer des représentations abstraites.

Attention Deficit is Ordered! Fooling Deformable Vision Transformers with Collaborative Adversarial Patches

paper_url: http://arxiv.org/abs/2311.12914
repo_url: None
paper_authors: Quazi Mishkatul Alam, Bilel Tarchoun, Ihsen Alouani, Nael Abu-Ghazaleh
for: 这篇论文主要针对于 transformer 模型在视觉任务上的应用和防御性能。
methods: 该论文使用了 sparse attention 结构来降低 transformer 模型的 quadratic complexity，并对 deformable transformer 进行了攻击和合作攻击的研究。
results: 该论文显示了对 transformer 模型的攻击和合作攻击，并证明了 deformable transformer 具有更高的鲁棒性和防御能力。 Here’s the full text in Simplified Chinese:
for: 这篇论文主要针对于 transformer 模型在视觉任务上的应用和防御性能。
methods: 该论文使用了 sparse attention 结构来降低 transformer 模型的 quadratic complexity，并对 deformable transformer 进行了攻击和合作攻击的研究。
results: 该论文显示了对 transformer 模型的攻击和合作攻击，并证明了 deformable transformer 具有更高的鲁棒性和防御能力。

Abstract
The latest generation of transformer-based vision models have proven to be superior to Convolutional Neural Network (CNN)-based models across several vision tasks, largely attributed to their remarkable prowess in relation modeling. Deformable vision transformers significantly reduce the quadratic complexity of modeling attention by using sparse attention structures, enabling them to be used in larger scale applications such as multi-view vision systems. Recent work demonstrated adversarial attacks against transformers; we show that these attacks do not transfer to deformable transformers due to their sparse attention structure. Specifically, attention in deformable transformers is modeled using pointers to the most relevant other tokens. In this work, we contribute for the first time adversarial attacks that manipulate the attention of deformable transformers, distracting them to focus on irrelevant parts of the image. We also develop new collaborative attacks where a source patch manipulates attention to point to a target patch that adversarially attacks the system. In our experiments, we find that only 1% patched area of the input field can lead to 0% AP. We also show that the attacks provide substantial versatility to support different attacker scenarios because of their ability to redirect attention under the attacker control.

摘要
最新一代基于转换器的视觉模型已经超越了传统的卷积神经网络（CNN）模型，在多个视觉任务上表现出色，主要归功于它们在关系模型方面的优异。使用 sparse 注意结构的视觉转换器可以减少模型注意力的 quadratic complexity，使其在大规模应用场景，如多视图视觉系统中使用。在最近的研究中，我们发现对于转换器来说，有些攻击不会传递，这是因为它们使用 sparse attention 结构。具体来说，视觉转换器中的注意力是通过指向图像中最相关的其他 токен来模型的。在这个工作中，我们首次提出了对具有 sparse attention 结构的视觉转换器的攻击。我们还开发了新的合作攻击，其中一个源 patch manipulate 注意力，使其指向一个针对系统进行攻击的 target patch。在我们的实验中，我们发现只需要对输入场景的 1% 区域进行修改，就可以导致系统的 AP 值为 0%。我们还发现，这些攻击具有可控的Redirect attention 能力，可以根据攻击者的需求进行调节。

Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatially Relation Matching

paper_url: http://arxiv.org/abs/2311.12751
repo_url: None
paper_authors: Meng Chu, Zhedong Zheng, Wei Ji, Tat-Seng Chua
for: 提高飞行器控制和导航的自然语言命令 интеграción
methods: 利用大型自然语言模型生成数据集和 pré-训练视觉模型，增加Univeristy-1652图像集的空间意识文本标注，并引入细致的空间匹配优化目标
results: 在不同描述复杂性下维持 exceptional recall rate，示出了该方法在实际场景中自然语言命令的灵活 интеграción的承诺

Abstract
Drone navigation through natural language commands remains a significant challenge due to the lack of publicly available multi-modal datasets and the intricate demands of fine-grained visual-text alignment. In response to this pressing need, we present a new human-computer interaction annotation benchmark called GeoText-1652, meticulously curated through a robust Large Language Model (LLM)-based data generation framework and the expertise of pre-trained vision models. This new dataset seamlessly extends the existing image dataset, \ie, University-1652, with spatial-aware text annotations, encompassing intricate image-text-bounding box associations. Besides, we introduce a new optimization objective to leverage fine-grained spatial associations, called blending spatial matching, for region-level spatial relation matching. Extensive experiments reveal that our approach maintains an exceptional recall rate under varying description complexities. This underscores the promising potential of our approach in elevating drone control and navigation through the seamless integration of natural language commands in real-world scenarios.

摘要
<>translate "Drone navigation through natural language commands remains a significant challenge due to the lack of publicly available multi-modal datasets and the intricate demands of fine-grained visual-text alignment. In response to this pressing need, we present a new human-computer interaction annotation benchmark called GeoText-1652, meticulously curated through a robust Large Language Model (LLM)-based data generation framework and the expertise of pre-trained vision models. This new dataset seamlessly extends the existing image dataset, \ie, University-1652, with spatial-aware text annotations, encompassing intricate image-text-bounding box associations. Besides, we introduce a new optimization objective to leverage fine-grained spatial associations, called blending spatial matching, for region-level spatial relation matching. Extensive experiments reveal that our approach maintains an exceptional recall rate under varying description complexities. This underscores the promising potential of our approach in elevating drone control and navigation through the seamless integration of natural language commands in real-world scenarios."into Simplified Chinese.<>无公开多模态数据集和细腻视文对齐的需求，导航器控制通过自然语言命令仍然是一大挑战。为应对这种急需，我们提出了一个新的人机交互标准集called GeoText-1652，通过我们自己开发的大型自然语言模型（LLM）基础数据生成框架和预训练视觉模型的专业知识，以便精心策划。这个新数据集继承了现有的图像集，即University-1652，并添加了包含细腻图像-文本矩形匹配的空间意识文本注释。此外，我们还引入了一个新的优化目标，称为混合空间匹配，用于区域水平的空间关系匹配。经过广泛的实验，我们的方法能够在不同的描述复杂性下保持exceptional recall率。这表明了我们的方法在实际场景中自然语言命令的整合可以提高无人机控制和导航的潜力。

Q-Seg: Quantum Annealing-based Unsupervised Image Segmentation

paper_url: http://arxiv.org/abs/2311.12912
repo_url: None
paper_authors: Supreeth Mysore Venkatesh, Antonio Macaluso, Marlon Nuske, Matthias Klusch, Andreas Dengel
for: 这个研究旨在提出一种基于量子热化的无监督图像分类方法，以便在现有的量子硬件上进行高效的图像分类。
methods: 我们将图像分类问题转换为一个图像架构优化任务，利用量子硬件的跨频率网络 topology，实现高效的批量分类。
results: 我们的实验结果显示，Q-Seg 可以对于现有的量子硬件进行高效的分类，并且在某些应用中（例如洪水探测）表现更好些于经典机器学习方法。

Abstract
In this study, we present Q-Seg, a novel unsupervised image segmentation method based on quantum annealing, tailored for existing quantum hardware. We formulate the pixel-wise segmentation problem, which assimilates spectral and spatial information of the image, as a graph-cut optimization task. Our method efficiently leverages the interconnected qubit topology of the D-Wave Advantage device, offering superior scalability over existing quantum approaches and outperforming state-of-the-art classical methods. Our empirical evaluations on synthetic datasets reveal that Q-Seg offers better runtime performance against the classical optimizer Gurobi. Furthermore, we evaluate our method on segmentation of Earth Observation images, an area of application where the amount of labeled data is usually very limited. In this case, Q-Seg demonstrates near-optimal results in flood mapping detection with respect to classical supervised state-of-the-art machine learning methods. Also, Q-Seg provides enhanced segmentation for forest coverage compared to existing annotated masks. Thus, Q-Seg emerges as a viable alternative for real-world applications using available quantum hardware, particularly in scenarios where the lack of labeled data and computational runtime are critical.

摘要
在本研究中，我们提出了一种新的无监督图像分割方法，基于量子退火，适用于现有的量子硬件。我们将像素化分割问题，即汇集图像的spectral和空间信息，视为一个图像拓扑优化任务。我们的方法可以高效地利用D-Wave Advantage设备中的量子bits的相互连接 topology，提供更好的扩展性，而且比现有的量子方法更高效。我们的实验结果表明，Q-Seg在对 synthetic 数据进行优化时比CLASSICAL optimizer Gurobi 更快。此外，我们在 Earth Observation 图像分割中应用 Q-Seg，并证明它在洪水探测和森林覆盖率分割方面与现有的supervised state-of-the-art机器学习方法具有near-optimal 性能。此外，Q-Seg还提供了改进的图像分割结果，超过了现有的标注mask。因此，Q-Seg emerges as a viable alternative for real-world applications using available quantum hardware，特别是在缺乏标注数据和计算时间的情况下。

Attacking Motion Planners Using Adversarial Perception Errors

paper_url: http://arxiv.org/abs/2311.12722
repo_url: None
paper_authors: Jonathan Sadeghi, Nicholas A. Lord, John Redford, Romain Mueller
for: 这个论文的目的是探讨自动驾驶系统中模块之间的相互关系，以及每个模块的性能如何影响整体系统的性能。
methods: 这个论文使用了一种名为”边界攻击”的简单算法，可以系统地构造出导致计划失败的感知错误。
results: 研究发现，使用这种算法可以在不同的城市和高速公路驾驶场景中找到许多导致计划失败的感知错误，并且这些错误是隔离在计划器的输入空间中的。

Abstract
Autonomous driving (AD) systems are often built and tested in a modular fashion, where the performance of different modules is measured using task-specific metrics. These metrics should be chosen so as to capture the downstream impact of each module and the performance of the system as a whole. For example, high perception quality should enable prediction and planning to be performed safely. Even though this is true in general, we show here that it is possible to construct planner inputs that score very highly on various perception quality metrics but still lead to planning failures. In an analogy to adversarial attacks on image classifiers, we call such inputs \textbf{adversarial perception errors} and show they can be systematically constructed using a simple boundary-attack algorithm. We demonstrate the effectiveness of this algorithm by finding attacks for two different black-box planners in several urban and highway driving scenarios using the CARLA simulator. Finally, we analyse the properties of these attacks and show that they are isolated in the input space of the planner, and discuss their implications for AD system deployment and testing.

摘要
自动驾驶（AD）系统经常以模块化的方式设计和测试，其中不同模块的性能通过任务特定的指标进行评估。这些指标应该能够捕捉下游影响和整体性能。例如，高识别质量可以使预测和规划进行安全地执行。虽然这是通常的情况，但我们在这里显示出，可以构造出让某些指标得高分的批处输入，却导致规划失败。我们称这种输入为“对抗感知错误”，并使用一种简单的边界攻击算法来系统地构造这些错误。我们使用CARLA simulator在城市和高速公路上驾驶多个场景中，通过这种算法找到了两个黑盒规划器的攻击。最后，我们分析了这些攻击的性质和影响，并讨论了它们对自动驾驶系统部署和测试的意义。

Cascade Learning Localises Discriminant Features in Visual Scene Classification

paper_url: http://arxiv.org/abs/2311.12704
repo_url: None
paper_authors: Junwen Wang, Katayoun Farrahi
For: 提高深度卷积神经网络（DCNN）的可读性，特别在医疗领域，临床医生希望获得可靠的自动化判断。* Methods: 研究DCNN中学习过程中的特征表示的地方化性，并比较了两种不同的学习策略（末端到末端学习和层次学习）的效果。* Results: 研究发现，传统的末端到末端学习策略在多层网络中具有有限的特征地方化能力，而层次学习策略（阶段学习）能够更好地地方化特征表示。在基于YOLO对象检测框架的实验中，我们得到的最佳结果显示，阶段学习策略比末端到末端学习策略提高了$2%$的准确率。

Abstract
Lack of interpretability of deep convolutional neural networks (DCNN) is a well-known problem particularly in the medical domain as clinicians want trustworthy automated decisions. One way to improve trust is to demonstrate the localisation of feature representations with respect to expert labeled regions of interest. In this work, we investigate the localisation of features learned via two varied learning paradigms and demonstrate the superiority of one learning approach with respect to localisation. Our analysis on medical and natural datasets show that the traditional end-to-end (E2E) learning strategy has a limited ability to localise discriminative features across multiple network layers. We show that a layer-wise learning strategy, namely cascade learning (CL), results in more localised features. Considering localisation accuracy, we not only show that CL outperforms E2E but that it is a promising method of predicting regions. On the YOLO object detection framework, our best result shows that CL outperforms the E2E scheme by $2\%$ in mAP.

摘要
缺乏深度卷积神经网络（DCNN）的解释性是医疗领域中的一个很 bekannt的问题，因为临床专业人员希望得到可靠的自动化决策。一种方法来提高信任是通过显示特征表示的本地化。在这项工作中，我们研究了两种不同的学习方法中学习的特征的本地化，并证明一种学习方法的优越性。我们对医疗和自然数据集进行分析，发现传统的端到端（E2E）学习策略在多层网络中具有有限的特征本地化能力。我们显示，叠加学习（CL）策略可以更好地本地化特征。对本地化精度进行评估，我们不仅证明CL超过E2E，而且表示CL是预测区域的有效方法。在YOLO对象检测框架上，我们最佳结果表明CL超过E2E方案的mAP指标上的提升达2%。

Transferring to Real-World Layouts: A Depth-aware Framework for Scene Adaptation

paper_url: http://arxiv.org/abs/2311.12682
repo_url: None
paper_authors: Mu Chen, Zhedong Zheng, Yi Yang
for: 这篇论文主要针对的是scene segmentation via unsupervised domain adaptation（UDA），即将源域数据中学习的知识转移到目标域数据中，以降低需要手动标注Pixel级别的需求。methods: 该方法基于depth estimation来推导 Layout，并通过一种叫做Depth-guided Contextual Filter（DCF）来进行数据增强，以及一种跨任务编码器来Contextual learning。results: 该方法可以在两个广泛使用的 bench-mark上达到竞争性的表现，即77.7 mIoU on GTA to Cityscapes和69.3 mIoU on Synthia to Cityscapes，而且只使用pseudo depth。

Abstract
Scene segmentation via unsupervised domain adaptation (UDA) enables the transfer of knowledge acquired from source synthetic data to real-world target data, which largely reduces the need for manual pixel-level annotations in the target domain. To facilitate domain-invariant feature learning, existing methods typically mix data from both the source domain and target domain by simply copying and pasting the pixels. Such vanilla methods are usually sub-optimal since they do not take into account how well the mixed layouts correspond to real-world scenarios. Real-world scenarios are with an inherent layout. We observe that semantic categories, such as sidewalks, buildings, and sky, display relatively consistent depth distributions, and could be clearly distinguished in a depth map. Based on such observation, we propose a depth-aware framework to explicitly leverage depth estimation to mix the categories and facilitate the two complementary tasks, i.e., segmentation and depth learning in an end-to-end manner. In particular, the framework contains a Depth-guided Contextual Filter (DCF) forndata augmentation and a cross-task encoder for contextual learning. DCF simulates the real-world layouts, while the cross-task encoder further adaptively fuses the complementing features between two tasks. Besides, it is worth noting that several public datasets do not provide depth annotation. Therefore, we leverage the off-the-shelf depth estimation network to generate the pseudo depth. Extensive experiments show that our proposed methods, even with pseudo depth, achieve competitive performance on two widely-used bench-marks, i.e. 77.7 mIoU on GTA to Cityscapes and 69.3 mIoU on Synthia to Cityscapes.

摘要
Scene分割via无监督领域适应（UDA）可以将源数据上获得的知识传递到目标领域中，大大减少了目标频道上的手动像素级标注。为了促进频率不变的特征学习，现有方法通常将源频道和目标频道的数据混合在一起，只是将像素拷贝并贴上。这种混合方法通常不佳，因为它们不考虑混合的配置与实际情况相匹配。我们发现， semantic category，如人行道、建筑物和天空，在深度图中具有相对一致的深度分布，并且可以在深度图中明显分辨。基于这一观察，我们提议一个深度感知框架，用于明确地混合类别，并在端到端方式中进行 segmentation 和深度学习两个补做任务。具体来说，框架包括一个带有深度导航的 Contextual Filter（DCF），用于数据增强，以及一个交叉任务编码器，用于Contextual learning。 DCF 模拟了实际情况，而交叉任务编码器进一步适应地融合了两个任务的补做特征。此外，我们注意到一些公共数据集没有提供深度注释。因此，我们利用了市场上的深度估计网络来生成假深度。广泛的实验显示，我们提出的方法，即使使用假深度，在两个广泛使用的标准 bench-mark 上具有竞争性的性能，即 GTA 到 Cityscapes 的 77.7 mIoU 和 Synthia 到 Cityscapes 的 69.3 mIoU。

BundleMoCap: Efficient, Robust and Smooth Motion Capture from Sparse Multiview Videos

paper_url: http://arxiv.org/abs/2311.12679
repo_url: None
paper_authors: Georgios Albanis, Nikolaos Zioulis, Kostas Kolomvatsos
for: 这个论文是为了解决从视频中捕捉平滑运动的问题，而传统的markerless技术通常需要耗时、多个阶段、数据驱动回归和优化、和时间窗口上的包装解决。
methods: 这个论文提出了一种新的和高效的方法，即BundleMoCap，它在单个阶段中解决了运动捕捉问题，并消除了时间稳定性目标，而且仍能够提供高质量的运动捕捉结果。BundleMoCap的关键思想是在latent keyframe之间进行拟合，通过本地拟合平滑性假设，可以高效地解决一个bundle of frames。
results: 论文的实验结果表明，BundleMoCap可以与现有的状态olaр技术进行比较，而不需要增加复杂度。 BundleMoCap的主要优势在于可以实现高质量的运动捕捉结果，同时具有简单性和高效性。更多细节可以在https://moverseai.github.io/bundle/上找到。

Abstract
Capturing smooth motions from videos using markerless techniques typically involves complex processes such as temporal constraints, multiple stages with data-driven regression and optimization, and bundle solving over temporal windows. These processes can be inefficient and require tuning multiple objectives across stages. In contrast, BundleMoCap introduces a novel and efficient approach to this problem. It solves the motion capture task in a single stage, eliminating the need for temporal smoothness objectives while still delivering smooth motions. BundleMoCap outperforms the state-of-the-art without increasing complexity. The key concept behind BundleMoCap is manifold interpolation between latent keyframes. By relying on a local manifold smoothness assumption, we can efficiently solve a bundle of frames using a single code. Additionally, the method can be implemented as a sliding window optimization and requires only the first frame to be properly initialized, reducing the overall computational burden. BundleMoCap's strength lies in its ability to achieve high-quality motion capture results with simplicity and efficiency. More details can be found at https://moverseai.github.io/bundle/.

摘要
通常 markerless 技术 capturing 平滑运动视频中会涉及到复杂的过程，如时间约束、多个阶段数据驱动回归优化和时间窗口集成。这些过程可能会效率低下，需要调整多个目标在不同阶段。而 BundleMoCap 则提供了一种新的高效方法。它在单个阶段内解决运动捕捉问题，消除时间稳定目标，同时仍能提供平滑的运动。与现有技术相比，BundleMoCap 表现更高效，而且不增加复杂性。BundleMoCap 的关键思想是基于 latent keyframe 的替换插值。通过启用本地拟合平滑假设，我们可以高效地解决一个bundle的帧数据。此外，该方法可以实现为滑动窗口优化，只需要初始化第一帧即可，从而降低总计算负担。 BundleMoCap 的优势在于能够实现高质量运动捕捉结果，同时具有简单高效的特点。更多细节请参考。

Visually Guided Object Grasping

paper_url: http://arxiv.org/abs/2311.12660
repo_url: None
paper_authors: Radu Horaud, Fadi Dornaika, Bernard Espiau
for: 本文提出了一种视觉服务方法，用于对物体抓取和更广泛的对端效器与物体的Alignment。
methods: 本文首先对Espiau等人提出的方法进行了扩展，使其适用于不 mounted onto the robot being controlled的摄像头。然后，我们强调了实时计算图Jacobian的重要性。
results: 本文显示了如何使用不可Calibrated stereo rig来表示 grasp或更广泛的Alignment between two solids in 3-D projective space。这种3-D projective representation是视点 invariants的，可以轻松地映射到图像set-point中无需任何关于摄像头参数的知识。

Abstract
In this paper we present a visual servoing approach to the problem of object grasping and more generally, to the problem of aligning an end-effector with an object. First we extend the method proposed by Espiau et al. [1] to the case of a camera which is not mounted onto the robot being controlled and we stress the importance of the real-time estimation of the image Jacobian. Second, we show how to represent a grasp or more generally, an alignment between two solids in 3-D projective space using an uncalibrated stereo rig. Such a 3-D projective representation is view-invariant in the sense that it can be easily mapped into an image set-point without any knowledge about the camera parameters. Third, we perform an analysis of the performances of the visual servoing algorithm and of the grasping precision that can be expected from this type of approach.

摘要
在这篇论文中，我们提出了一种视觉服务方法来解决物体抓取和更一般地，将端效器与物体相对于的问题。首先，我们将Espiau等人提出的方法扩展到不 mounted onto the robot being controlled的相机情况下，并强调了实时计算图Jacobian的重要性。其次，我们示出了在3D项目空间中表示一个抓取或更一般地，两个物体之间的对应关系的3D项目空间表示方法，使用不准备的ステレオ摄像头。这种3D项目空间表示方法是视觉不变的，可以轻松地将其映射到图像集点中，不需要任何相机参数的知识。最后，我们对视觉服务算法的性能分析和抓取精度的预期进行了分析。

Hand-Eye Calibration

paper_url: http://arxiv.org/abs/2311.12655
repo_url: https://github.com/IFL-CAMP/easy_handeye
paper_authors: Radu Horaud, Fadi Dornaika
for: 本研究旨在解决 robot 手上感测器与手臂之间的关系问题，即手眼准Alignment问题。
methods: 本研究提出了两种方法来解决手眼准Alignment问题：一种是经典的方法，另一种是使用3x4的投影矩阵（M和M’）来解决问题。
results: 研究表明，使用非线性优化方法同时解决旋转和平移问题是最稳定的方法，并且可以更好地抵消噪声和测量误差的影响。

Abstract
Whenever a sensor is mounted on a robot hand it is important to know the relationship between the sensor and the hand. The problem of determining this relationship is referred to as hand-eye calibration, which is important in at least two types of tasks: (i) map sensor centered measurements into the robot workspace and (ii) allow the robot to precisely move the sensor. In the past some solutions were proposed in the particular case of a camera. With almost no exception, all existing solutions attempt to solve the homogeneous matrix equation AX=XB. First we show that there are two possible formulations of the hand-eye calibration problem. One formulation is the classical one that we just mentioned. A second formulation takes the form of the following homogeneous matrix equation: MY=M'YB. The advantage of the latter is that the extrinsic and intrinsic camera parameters need not be made explicit. Indeed, this formulation directly uses the 3 by 4 perspective matrices (M and M') associated with two positions of the camera. Moreover, this formulation together with the classical one cover a wider range of camera-based sensors to be calibrated with respect to the robot hand. Second, we develop a common mathematical framework to solve for the hand-eye calibration problem using either of the two formulations. We present two methods, (i) a rotation then translation and (ii) a non-linear solver for rotation and translation. Third, we perform a stability analysis both for our two methods and for the classical linear method of Tsai and Lenz (1989). In the light of this comparison, the non-linear optimization method, that solves for rotation and translation simultaneously, seems to be the most robust one with respect to noise and to measurement errors.

摘要
当感测器 mounted 在机器人手上时，关于感测器和手的关系是非常重要的。这个问题被称为手眼准备，对于至少两种任务是非常重要的：（i）将感测器中的测量转换到机器人工作空间中，（ii）允许机器人精确地移动感测器。在过去，一些解决方案在特定的情况下采用了摄像头。大多数现有的解决方案都尝试解决homogeneous matrix equation AX=XB。我们首先表明了手眼准备问题的两种可能的形式化。一种形式化是我们刚才提到的古典形式化。另一种形式化取得了以下homogeneous matrix equation：MY=M'YB。这种形式化的优点是，摄像头外部和内部参数不需要显式地出现。实际上，这种形式化直接使用了两个相机位置下的3x4的投影矩阵（M和M'）。此外，这种形式化与古典形式化覆盖了更多的摄像头基于的感测器与机器人手进行准备的问题。二、我们开发了一个通用的数学框架来解决手眼准备问题。我们提出了两种方法：（i）旋转然后翻译，（ii）非线性最优化方法。三、我们进行了稳定性分析，包括我们两种方法和 Tsai和Lenz（1989）的线性方法。在这种比较下，非线性最优化方法，同时解决旋转和翻译的问题，显示出对于噪声和测量错误的稳定性。

Polyhedral Object Recognition by Indexing

paper_url: http://arxiv.org/abs/2311.12641
repo_url: https://github.com/crowdbotics-dev/abd-231102-s1-dev-126419
paper_authors: Radu Horaud, Humberto Sossa
for: 本研究旨在解决计算机视觉中的索引问题，即从2D图像中识别3D多面体对象，而不需要经典的图像特征与对象特征匹配方法。
methods: 本文提出了一种新的图像索引方法，基于多项式表示法和哈希函数。这种方法可以快速地判断图像中是否包含一个特定的3D多面体对象。
results: 本文提供了一些实验结果，证明了该方法的效果。Results show that the proposed method can effectively recognize 3D polyhedral objects from 2D images.

Abstract
In computer vision, the indexing problem is the problem of recognizing a few objects in a large database of objects while avoiding the help of the classical image-feature-to-object-feature matching paradigm. In this paper we address the problem of recognizing 3-D polyhedral objects from 2-D images by indexing. Both the objects to be recognized and the images are represented by weighted graphs. The indexing problem is therefore the problem of determining whether a graph extracted from the image is present or absent in a database of model graphs. We introduce a novel method for performing this graph indexing process which is based both on polynomial characterization of binary and weighted graphs and on hashing. We describe in detail this polynomial characterization and then we show how it can be used in the context of polyhedral object recognition. Next we describe a practical recognition-by-indexing system that includes the organization of the database, the representation of polyhedral objects in terms of 2-D characteristic views, the representation of this views in terms of weighted graphs, and the associated image processing. Finally, some experimental results allow the evaluation of the system performance.

摘要
We propose a novel method for performing this graph indexing process, which is based on both polynomial characterization of binary and weighted graphs and hashing. We describe the polynomial characterization in detail and show how it can be used in the context of polyhedral object recognition.Our practical recognition-by-indexing system includes the organization of the database, the representation of polyhedral objects in terms of 2-D characteristic views, the representation of these views in terms of weighted graphs, and the associated image processing. We also present some experimental results to evaluate the system's performance.

GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning

paper_url: http://arxiv.org/abs/2311.12631
repo_url: None
paper_authors: Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, Shifeng Chen
for: 提高文本到视频生成质量，增强视频中的物理动作准确性和一致性。
methods: 利用大语言模型GPT-4生成Blender脚本，基于文本提示 commanded Blender的物理引擎生成基本场景组件，然后输入稳定扩散模型进行视频生成。
results: 在三种基本物理动作场景中（即固体物体落下与碰撞、布料折叠与摆动、液体流动），GPT4Motion可以高效生成高质量视频，保持动作准确性和实体一致性。

Abstract
Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for future explorations.

摘要
（简体中文）最近的文本到视频生成技术已经充分利用了扩散模型来创造基于文本提示的视觉吸引人的内容。然而，它们通常会遇到高计算成本和困难以生成具有一致的物理运动的视频。为解决这些问题，我们提出了GPT4Motion，一个无需训练的框架，它利用大型语言模型 such as GPT的规划能力，Blender的物理模拟力和文本到图像扩散模型的出色的图像生成能力来提高视频合成的质量。具体来说，GPT4Motion使用GPT-4来生成基于用户文本提示的Blender脚本，这个脚本命令Blender的内置物理引擎来生成基本的场景组件，这些组件包含整个帧的一致的物理运动。然后，这些组件被输入到稳定扩散模型中来生成与文本提示相关的视频。我们在三种基本物理运动场景中，包括固体物体落下和碰撞、布料摆倒和摆动、液体流动中，实现了高质量的视频生成，同时保持运动一致和实体一致。GPT4Motion提供了新的视角，推动了文本到视频研究的质量和未来探索的可能性。

Bridging Generalization Gaps in High Content Imaging Through Online Self-Supervised Domain Adaptation

paper_url: http://arxiv.org/abs/2311.12623
repo_url: None
paper_authors: Johan Fredin Haslum, Christos Matsoukas, Karl-Johan Leuchowius, Kevin Smith
for: 这个论文是为了提高高内容成像（HCI）在现代药物发现和开发流水线中的应用，从发现到药物候选者特征化。
methods: 这个论文提出了一种在线自动适应方法（CODA），将分类器分为通用特征提取器和任务特定模型。在新领域中适应特征提取器的参数，使用交叉批自监督，保持任务特定模型不变。
results: 实验结果表明，这种策略可以明显减少通用泛化差，在不同实验室使用不同顾 microscope 的数据上实现了最多300%的提升。CODA可以应用于新、无标注的外域数据源，不同的大小，从单个板到多个实验批次。

Abstract
High Content Imaging (HCI) plays a vital role in modern drug discovery and development pipelines, facilitating various stages from hit identification to candidate drug characterization. Applying machine learning models to these datasets can prove challenging as they typically consist of multiple batches, affected by experimental variation, especially if different imaging equipment have been used. Moreover, as new data arrive, it is preferable that they are analyzed in an online fashion. To overcome this, we propose CODA, an online self-supervised domain adaptation approach. CODA divides the classifier's role into a generic feature extractor and a task-specific model. We adapt the feature extractor's weights to the new domain using cross-batch self-supervision while keeping the task-specific model unchanged. Our results demonstrate that this strategy significantly reduces the generalization gap, achieving up to a 300% improvement when applied to data from different labs utilizing different microscopes. CODA can be applied to new, unlabeled out-of-domain data sources of different sizes, from a single plate to multiple experimental batches.

摘要
CODA将分类器的角色分为一个通用特征提取器和一个任务特定模型。我们使用cross-batch自监学习来适应特征提取器的加权到新领域，保持任务特定模型不变。我们的结果表明，这种策略可以大幅减少通用差距，在不同实验室使用不同探针的数据上达到300%的提升。CODA可以应用于新、无标签的域外数据源，包括单个板到多个实验批次。

Crowd management, crime detection, work monitoring using aiml

paper_url: http://arxiv.org/abs/2311.12621
repo_url: None
paper_authors: P. R. Adithya, Dheepak. S, B. Akash, Harshini. V, Sai Lakshana
for: 这项研究旨在利用现有的关闭电视（CCTV）网络，实现人群管理、犯罪预防和工作场所监测的全面方法，通过人工智能（AI）和机器学习（ML）技术的集成。
methods: 该项目使用了AI/ML技术，对视频流进行实时分析，能够在实时标识和评估人群动态，早期发现可能的犯罪活动，并Continuous监测工作环境。
results: 该项目通过对现有基础设施的 интеграцию，提高了公共安全措施，提高了组织的生产效率。

Abstract
This research endeavors to harness the potential of existing Closed-Circuit Television (CCTV) networks for a comprehensive approach to crowd management, crime prevention, and workplace monitoring through the integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. The primary objective is to develop and implement advanced algorithms capable of real-time analysis of video feeds, enabling the identification and assessment of crowd dynamics, early detection of potential criminal activities, and continuous monitoring of workplace environments. By leveraging AI/ML, the project aims to optimize surveillance capabilities, thereby enhancing public safety measures and improving organizational productivity. This initiative underscores the transformative impact that intelligent video analytics can have on existing infrastructure, mitigating the need for extensive system overhauls while significantly advancing security and operational efficiency.

摘要
这项研究寻求通过现有的关闭电视（CCTV）网络来实现全面的人群管理、犯罪预防和工作场所监测，通过人工智能（AI）和机器学习（ML）技术的集成。项目的主要目标是在实时分析视频流中开发并实施高级算法，以识别和评估人群动态、早期发现可能的犯罪活动和不断监测工作环境。通过利用AI/ML，项目希望提高观察能力，以提高公共安全措施和组织效率。这一倡议强调智能视频分析对现有基础设施的转型影响，从而减少需要大规模系统改进的需求，同时显著提高安全和运营效率。

Leveraging Unlabeled Data for 3D Medical Image Segmentation through Self-Supervised Contrastive Learning

paper_url: http://arxiv.org/abs/2311.12617
repo_url: https://github.com/xmindflow/SSL-contrastive
paper_authors: Sanaz Karimijafarbigloo, Reza Azad, Yury Velichko, Ulas Bagci, Dorit Merhof
for: 提高3D半监督分割方法的精度和可靠性， Addressing the limitations of current methods such as limited contextual information and unreliable pseudo-labels.
methods: 引入两个独立的子网络，用于探索和利用Prediction inconsistencies，并通过Targeted verification training process来纠正预测结果。 Additionally, a self-supervised contrastive learning paradigm is employed to adaptively fine-tune the network’s representational capacity and reduce prediction uncertainty.
results: 对于脑肿 segmentation，使用临床MRI和CT扫描数据，实验结果表明我们的方法在比特当前方法更高的精度和可靠性。 Codebase accessible on GitHub.

Abstract
Current 3D semi-supervised segmentation methods face significant challenges such as limited consideration of contextual information and the inability to generate reliable pseudo-labels for effective unsupervised data use. To address these challenges, we introduce two distinct subnetworks designed to explore and exploit the discrepancies between them, ultimately correcting the erroneous prediction results. More specifically, we identify regions of inconsistent predictions and initiate a targeted verification training process. This procedure strategically fine-tunes and harmonizes the predictions of the subnetworks, leading to enhanced utilization of contextual information. Furthermore, to adaptively fine-tune the network's representational capacity and reduce prediction uncertainty, we employ a self-supervised contrastive learning paradigm. For this, we use the network's confidence to distinguish between reliable and unreliable predictions. The model is then trained to effectively minimize unreliable predictions. Our experimental results for organ segmentation, obtained from clinical MRI and CT scans, demonstrate the effectiveness of our approach when compared to state-of-the-art methods. The codebase is accessible on \href{https://github.com/xmindflow/SSL-contrastive}{GitHub}.

摘要
当前的3D半监督分割方法面临着有限的Contextual信息考虑和不可靠的pseudo标签生成，导致效果不佳的不监督数据使用。为解决这些挑战，我们引入了两个不同的子网络，用于探索和利用它们之间的差异，最终更正预测结果。更具体地，我们将预测结果中的不一致区域特定地标记，并进行targeted验证训练。这个过程可以准确地微调和融合子网络的预测结果，从而使用Contextual信息。此外，为了适应网络的表征能力的调整和预测不确定性的减少，我们采用了一种自动调整的自我超视的contrastive学习方法。我们使用网络的自信来分辨可靠和不可靠的预测，然后训练网络以减少不可靠的预测。我们的实验结果表明，对于临床MRI和CT扫描数据进行的器官分割，我们的方法与当前的状态艺法相比，有更高的效果。代码库可以在\href{https://github.com/xmindflow/SSL-contrastive}{GitHub}上获取。

Adaptive Dense Pseudo Label Selection for Semi-supervised Oriented Object Detection

paper_url: http://arxiv.org/abs/2311.12608
repo_url: None
paper_authors: Tong Zhao, Qiang Fang, Shuohao Shi, Xin Xu
for: 这个研究是为了提高半指导的物体检测（SSOD）中的 Orientated Object Detection（OOD）性能。
methods: 我们提出了一个名为 Adaptive Dense Pseudo Label Selection（ADPLS）的方法，它使用了一个简单 yet effective的适应机制来指导伪标签的选择。特别是，我们提出了一个名为 Mean Feature-Richness Score（mFRS）的估计方法，用于估计潜在物体的密度，并使用这个分数来调整伪标签的数量。
results: 在DOTA-v1.5 bencmark上，我们的方法比前一代方法更高，尤其是当标签资料仅有5%时。例如，我们的方法在5%标签资料下 achieve 49.78 mAP，比前一代方法在10%标签资料下的48.63 mAP高出1.15 mAP。

Abstract
Recently, dense pseudo-label, which directly selects pseudo labels from the original output of the teacher model without any complicated post-processing steps, has received considerable attention in semi-supervised object detection (SSOD). However, for the multi-oriented and dense objects that are common in aerial scenes, existing dense pseudo-label selection methods are inefficient and impede the performance in semi-supervised oriented object detection. Therefore, we propose Adaptive Dense Pseudo Label Selection (ADPLS) for semi-supervised oriented object detection. In ADPLS, we design a simple but effective adaptive mechanism to guide the selection of dense pseudo labels. Specifically, we propose the mean Feature-Richness Score (mFRS) to estimate the density of potential objects and use this score to adjust the number of dense pseudo labels. On the DOTA-v1.5 benchmark, the proposed method outperforms previous methods especially when labeled data are scarce. For example, it achieves 49.78 mAP given only 5% of annotated data, which surpasses previous state-of-the-art method given 10% of annotated data by 1.15 mAP. Our codes will be available soon.

摘要
最近，密集 pseudo-标签选择（ dense pseudo-label selection）在 semi-supervised object detection（SSOD）中受到了广泛关注。然而，对于飞行场景中的多 oriented 和密集的对象，现有的密集 pseudo-标签选择方法效率低下，影响了 semi-supervised oriented object detection 的性能。因此，我们提出了适应性 dense pseudo-标签选择（ADPLS）方法。在 ADPLS 中，我们设计了一种简单 yet effective 的适应机制，以导引密集 pseudo-标签的选择。具体来说，我们提出了均值 Feature-Richness Score（mFRS）来估算潜在对象的密集程度，并使用这个分数来调整密集 pseudo-标签的数量。在 DOTA-v1.5 测试集上，我们的方法比前一代方法更高效，特别是当只有5%的注解数据时，我们的方法可以达到 49.78 mAP，而前一代方法只有在注解数据占10%的情况下达到这个水平。我们的代码将很快地上传。

Thinking Outside the Box: Orthogonal Approach to Equalizing Protected Attributes

paper_url: http://arxiv.org/abs/2311.14733
repo_url: None
paper_authors: Jiahui Liu, Xiaohao Cai, Mahesan Niranjan
for: 本研究旨在探讨黑盒AI在医疗决策中可能夹带健康相关不平等和偏见的问题，以及如何通过机器学习方法来分析和suppress这些影响。
methods: 本研究使用机器学习的正交方法，包括抑制因素分解和对保护属性进行正交化，以分析和消除保护属性对疾病诊断的影响。
results: 研究表明，通过使用正交方法可以减少保护属性对疾病诊断的影响，提高模型预测性能，并 Mitigate不良特征相关性。

Abstract
There is growing concern that the potential of black box AI may exacerbate health-related disparities and biases such as gender and ethnicity in clinical decision-making. Biased decisions can arise from data availability and collection processes, as well as from the underlying confounding effects of the protected attributes themselves. This work proposes a machine learning-based orthogonal approach aiming to analyze and suppress the effect of the confounder through discriminant dimensionality reduction and orthogonalization of the protected attributes against the primary attribute information. By doing so, the impact of the protected attributes on disease diagnosis can be realized, undesirable feature correlations can be mitigated, and the model prediction performance can be enhanced.

摘要
随着黑盒AI的潜在力量的扩展，健康相关的不平等和偏见，如性别和民族，在临床决策中可能得到扩大。这些偏见可能来自数据的可用性和收集过程，以及保护属性本身的下意识效应。本工作提议一种机器学习基于的正交方法，以分析和抑制保护属性的影响，通过维度减少和保护属性对主属信息的正交化。这样可以实现保护属性对疾病诊断的影响，减少不良特征相关性，并提高模型预测性能。

Surgical Temporal Action-aware Network with Sequence Regularization for Phase Recognition

paper_url: http://arxiv.org/abs/2311.12603
repo_url: None
paper_authors: Zhen Chen, Yuhao Zhai, Jun Zhang, Jinqiao Wang
for: 帮助外科医生在操作室中更准确地识别手术阶段，以发展计算机助手外科系统。
methods: 我们提出了一种具有高效多级医学动作特征 integrate 模块（MS-STA）和两个分类器序列导航（DSR）的 STAR-Net，使得可以更好地利用手术视频中的视觉特征，并通过sequential guidance来优化模型的训练。
results: 我们的 STAR-Net 在大规模的胃部手术数据集和公共 Cholec80 标准 datasets 上实现了 state-of-the-art 的手术阶段识别性能。

Abstract
To assist surgeons in the operating theatre, surgical phase recognition is critical for developing computer-assisted surgical systems, which requires comprehensive understanding of surgical videos. Although existing studies made great progress, there are still two significant limitations worthy of improvement. First, due to the compromise of resource consumption, frame-wise visual features are extracted by 2D networks and disregard spatial and temporal knowledge of surgical actions, which hinders subsequent inter-frame modeling for phase prediction. Second, these works simply utilize ordinary classification loss with one-hot phase labels to optimize the phase predictions, and cannot fully explore surgical videos under inadequate supervision. To overcome these two limitations, we propose a Surgical Temporal Action-aware Network with sequence Regularization, named STAR-Net, to recognize surgical phases more accurately from input videos. Specifically, we propose an efficient multi-scale surgical temporal action (MS-STA) module, which integrates visual features with spatial and temporal knowledge of surgical actions at the cost of 2D networks. Moreover, we devise the dual-classifier sequence regularization (DSR) to facilitate the training of STAR-Net by the sequence guidance of an auxiliary classifier with a smaller capacity. Our STAR-Net with MS-STA and DSR can exploit visual features of surgical actions with effective regularization, thereby leading to the superior performance of surgical phase recognition. Extensive experiments on a large-scale gastrectomy surgery dataset and the public Cholec80 benchmark prove that our STAR-Net significantly outperforms state-of-the-arts of surgical phase recognition.

摘要
<> transtable begin为帮助手术医生在操作室中工作，计算机助手手术系统的发展几乎无法关闭，而这需要全面的手术视频理解。虽然现有的研究做出了巨大的进步，但还有两个重要的限制值得进一步改进。首先，由于资源占用的牺牲，2D网络提取的帧级视觉特征会忽略手术动作的空间和时间知识，这会阻碍后续的帧预测。其次，这些工作只是使用一个简单的分类损失函数和一个热一级的阶段标签来优化阶段预测，无法充分利用手术视频的不充分监督。为了解决这两个限制，我们提出了一种具有手术时间动作意识的网络，称为STAR-Net，可以更准确地从输入视频中识别手术阶段。具体来说，我们提出了一种高效的多尺度手术时间动作（MS-STA）模块，可以将视觉特征与手术动作的空间和时间知识相结合，而无需耗费2D网络的资源。此外，我们还提出了一种双类标注序列regularization（DSR），可以通过auxiliary类型的小型分类器来促进STAR-Net的训练。我们的STAR-Net可以有效地利用手术动作的视觉特征，并且通过序列指导来进行有效的 Regularization，从而实现更高的手术阶段识别性。广泛的实验表明，我们的STAR-Net在一个大规模的胃部手术数据集和公共的Cholec80数据集上具有显著的优异性。<> transtable end

TouchSDF: A DeepSDF Approach for 3D Shape Reconstruction using Vision-Based Tactile Sensing

paper_url: http://arxiv.org/abs/2311.12602
repo_url: None
paper_authors: Mauro Comi, Yijiong Lin, Alex Church, Alessio Tonioni, Laurence Aitchison, Nathan F. Lepora
for: 这篇论文旨在探讨数据驱动方法可以如何使用高分辨率视觉感知器来描述物体的3D形态。
methods: 该论文提出了一种基于深度学习的触觉3D形态重建方法，该方法利用视觉感知器提供的详细信息和深度学习函数 DeepSDF 来重建物体的3D形态。
results: 该论文在实验和实际场景中可以准确地重建物体的3D形态，并且可以在不同的形态和环境下进行robust化和多模式识别。

Abstract
Humans rely on their visual and tactile senses to develop a comprehensive 3D understanding of their physical environment. Recently, there has been a growing interest in exploring and manipulating objects using data-driven approaches that utilise high-resolution vision-based tactile sensors. However, 3D shape reconstruction using tactile sensing has lagged behind visual shape reconstruction because of limitations in existing techniques, including the inability to generalise over unseen shapes, the absence of real-world testing, and limited expressive capacity imposed by discrete representations. To address these challenges, we propose TouchSDF, a Deep Learning approach for tactile 3D shape reconstruction that leverages the rich information provided by a vision-based tactile sensor and the expressivity of the implicit neural representation DeepSDF. Our technique consists of two components: (1) a Convolutional Neural Network that maps tactile images into local meshes representing the surface at the touch location, and (2) an implicit neural function that predicts a signed distance function to extract the desired 3D shape. This combination allows TouchSDF to reconstruct smooth and continuous 3D shapes from tactile inputs in simulation and real-world settings, opening up research avenues for robust 3D-aware representations and improved multimodal perception in robotics. Code and supplementary material are available at: https://touchsdf.github.io/

摘要
人类依靠视觉和感觉来发展一个全面的3D环境理解。最近，有一个增长的兴趣在使用数据驱动的方法来探索和操作物体，使用高分辨率视觉感知器。然而，使用感觉重建3D形状受到限制，包括无法泛化到未看过的形状、缺乏现实世界测试和精简表示的限制。为解决这些挑战，我们提出了TouchSDF，一种深度学习方法，使用视觉基于感知器提供的丰富信息和深度SDF表示的表达能力来重建感觉中的3D形状。我们的技术包括两部分：（1）一个 convolutional neural network 将感觉图像映射到touchlocation的地方表示surface的本地mesh，和（2）一个隐藏函数，预测signed distance function，从而提取所需的3D形状。这种组合允许TouchSDF在模拟和实际场景中从感觉输入中重建平滑和连续的3D形状，打开了3D意识表示和多模态感知在机器人学中的研究 Avenues。代码和补充材料可以在以下链接中找到：https://touchsdf.github.io/

Deep learning-based detection of morphological features associated with hypoxia in H&E breast cancer whole slide images

paper_url: http://arxiv.org/abs/2311.12601
repo_url: None
paper_authors: Petru Manescu, Joseph Geradts, Delmiro Fernandez-Reyes
For: This paper is written to demonstrate the use of deep learning for evaluating hypoxia in breast cancer histomorphology.* Methods: The paper uses weakly supervised deep learning (WSDL) models to detect hypoxia-associated features in routine Hematoxylin and Eosin (H&E) whole slide images (WSI).* Results: The paper shows that the WSDL models can accurately detect hypoxia-associated features in H&E WSI with an average AUC of 0.87 on a left-out test set, and significant differences between features of hypoxic and normoxic tissue regions.Here is the information in Simplified Chinese text:* 为：这篇论文是用于描述用深度学习评估乳腺癌细胞氧激酶水平的研究。* 方法：这篇论文使用弱监督深度学习（WSDL）模型来检测H&E整幕图像中的hypoxia相关特征。* 结果：这篇论文显示WSDL模型可以准确地检测H&E整幕图像中的hypoxia相关特征， Left-out测试集的平均AUC为0.87，并显示hypoxic和normoxic组织区域之间存在显著差异。

Abstract
Hypoxia occurs when tumour cells outgrow their blood supply, leading to regions of low oxygen levels within the tumour. Calculating hypoxia levels can be an important step in understanding the biology of tumours, their clinical progression and response to treatment. This study demonstrates a novel application of deep learning to evaluate hypoxia in the context of breast cancer histomorphology. More precisely, we show that Weakly Supervised Deep Learning (WSDL) models can accurately detect hypoxia associated features in routine Hematoxylin and Eosin (H&E) whole slide images (WSI). We trained and evaluated a deep Multiple Instance Learning model on tiles from WSI H&E tissue from breast cancer primary sites (n=240) obtaining on average an AUC of 0.87 on a left-out test set. We also showed significant differences between features of hypoxic and normoxic tissue regions as distinguished by the WSDL models. Such DL hypoxia H&E WSI detection models could potentially be extended to other tumour types and easily integrated into the pathology workflow without requiring additional costly assays.

摘要
肿瘤缺氧现象发生于肿瘤细胞超出血液供应，导致肿瘤中的氧气含量下降。计算肿瘤缺氧水平是理解肿瘤生物学、临床进展和治疗响应的重要步骤。这个研究示出了深度学习在肿瘤 Histomorphology 中评估缺氧的新应用。具体来说，我们表明了弱监督深度学习（WSDL）模型可以准确地检测肿瘤缺氧相关特征在 Routine Hematoxylin and Eosin（H&E）整幅报告图像（WSI）中。我们在肿瘤primary sites（n=240）的H&E报告图像中训练和评估了一个深度多实例学习模型，其中的平均AUC为0.87。我们还发现了Hypoxic和normoxic组织区域之间的显著差异，这些差异被WSDL模型所识别出来。这些深度缺氧H&E WSI检测模型可能可以扩展到其他肿瘤类型，并且可以轻松地 интеGRATE到病理工作流程中，无需额外的成本重要的检测。

HiPose: Hierarchical Binary Surface Encoding and Correspondence Pruning for RGB-D 6DoF Object Pose Estimation

paper_url: http://arxiv.org/abs/2311.12588
repo_url: None
paper_authors: Yongliang Lin, Yongzhi Su, Praveen Nathan, Sandeep Inuganti, Yan Di, Martin Sundermeyer, Fabian Manhardt, Didier Stricke, Jason Rambach, Yu Zhang
for: 这篇论文目的是提出一种新的高精度对象 pose 估计方法，使用单个 RGB-D 图像。
methods: 这篇论文使用了一种叫做 HiPose 的方法，通过 hierarchical binary surface encoding 来实现 3D-3D 对应。不同于之前的 dense-correspondence 方法，我们使用点到表面匹配并在多个层次中步进缩小表面，直到它变成对应点。
results: 我们的方法在公共的测试集 LM-O、YCB-V 和 T-Less 上进行了广泛的实验，与所有不包含反射的方法相比，我们的方法表现更好，甚至与较昂贵的反射基于方法相当。此外，我们的方法具有计算效率的优势，可以满足高精度的应用需求。

Abstract
In this work, we present a novel dense-correspondence method for 6DoF object pose estimation from a single RGB-D image. While many existing data-driven methods achieve impressive performance, they tend to be time-consuming due to their reliance on rendering-based refinement approaches. To circumvent this limitation, we present HiPose, which establishes 3D-3D correspondences in a coarse-to-fine manner with a hierarchical binary surface encoding. Unlike previous dense-correspondence methods, we estimate the correspondence surface by employing point-to-surface matching and iteratively constricting the surface until it becomes a correspondence point while gradually removing outliers. Extensive experiments on public benchmarks LM-O, YCB-V, and T-Less demonstrate that our method surpasses all refinement-free methods and is even on par with expensive refinement-based approaches. Crucially, our approach is computationally efficient and enables real-time critical applications with high accuracy requirements. Code and models will be released.

摘要
在这项工作中，我们介绍了一种新的密集对匹配方法，用于从单个RGB-D图像中 estimate 6DoF объекpose。虽然现有的数据驱动方法可以达到很好的性能，但它们通常具有时间消耗的 limitation，因为它们基于渲染基因refinement的方法。为了缓解这个限制，我们提出了HiPose，它在一种层次 Binary surface编码中Establishes 3D-3D对匹配。与前一个密集对匹配方法不同，我们在点到表面匹配中Estimate the correspondence surface，并在层次方式推进surface，直到它变成对匹配点，同时慢慢地移除异常值。我们在公共的benchmark LM-O、YCB-V和T-Less上进行了广泛的实验，结果表明，我们的方法不仅超过了所有不包含修正的方法，而且与较昂贵的修正基于方法相当。这些方法的计算效率高，可以满足实时的应用需求。我们将发布代码和模型。

A Region of Interest Focused Triple UNet Architecture for Skin Lesion Segmentation

paper_url: http://arxiv.org/abs/2311.12581
repo_url: None
paper_authors: Guoqing Liu, Yu Guo, Caiying Wu, Guoqing Chen, Barintag Saheya, Qiyu Jin
for: automatic skin lesion segmentation
methods: Triple-UNet architecture with region of interest enhancement module (ROIE)
results: outperforms state-of-the-art on skin lesion segmentation

Abstract
Skin lesion segmentation is of great significance for skin lesion analysis and subsequent treatment. It is still a challenging task due to the irregular and fuzzy lesion borders, and diversity of skin lesions. In this paper, we propose Triple-UNet to automatically segment skin lesions. It is an organic combination of three UNet architectures with suitable modules. In order to concatenate the first and second sub-networks more effectively, we design a region of interest enhancement module (ROIE). The ROIE enhances the target object region of the image by using the predicted score map of the first UNet. The features learned by the first UNet and the enhanced image help the second UNet obtain a better score map. Finally, the results are fine-tuned by the third UNet. We evaluate our algorithm on a publicly available dataset of skin lesion segmentation. Experiments show that Triple-UNet outperforms the state-of-the-art on skin lesion segmentation.

摘要
皮肤损害区域分割对皮肤损害分析和后续治疗具有重要意义。然而，由于皮肤损害区域的不规则和杂乱，以及皮肤损害的多样性，这是一项非常困难的任务。在这篇论文中，我们提出了Triple-UNet算法，用于自动分割皮肤损害区域。Triple-UNet是三个UNet架构的有机组合，其中每个UNet架构都有适合的模块。为了更好地 concatenate the first and second sub-networks, we design a region of interest enhancement module (ROIE). ROIE使用了predict的得分图来增强目标对象区域的图像，并使得第二个UNet获得更好的得分图。最后，结果被第三个UNet进行细化。我们在公共可用的皮肤损害分割数据集上进行了实验，并证明Triple-UNet在皮肤损害分割方面超过了状态的前瞻性。

Multi-Resolution Planar Region Extraction for Uneven Terrains

paper_url: http://arxiv.org/abs/2311.12562
repo_url: None
paper_authors: Yinghan Sun, Linfang Zheng, Hua Chen, Wei Zhang
for: 本研究旨在从非排序点云数据中提取平面区域，以应对机器人应用中的感知行动问题。
methods: 我们提出了一种多分辨率平面区域提取策略，通过平衡精度和计算效率来解决现有方法的问题。我们的方法包括点 wise 分类预处理模块和多分辨率分 segmentation 模块。
results: 我们通过实验证明了我们的方法可以在不同的不平坦地形上实现高效和稳定的平面区域提取，并且可以在真实世界中实现实时性，帧率超过 35 FPS。

Abstract
This paper studies the problem of extracting planar regions in uneven terrains from unordered point cloud measurements. Such a problem is critical in various robotic applications such as robotic perceptive locomotion. While existing approaches have shown promising results in effectively extracting planar regions from the environment, they often suffer from issues such as low computational efficiency or loss of resolution. To address these issues, we propose a multi-resolution planar region extraction strategy in this paper that balances the accuracy in boundaries and computational efficiency. Our method begins with a pointwise classification preprocessing module, which categorizes all sampled points according to their local geometric properties to facilitate multi-resolution segmentation. Subsequently, we arrange the categorized points using an octree, followed by an in-depth analysis of nodes to finish multi-resolution plane segmentation. The efficiency and robustness of the proposed approach are verified via synthetic and real-world experiments, demonstrating our method's ability to generalize effectively across various uneven terrains while maintaining real-time performance, achieving frame rates exceeding 35 FPS.

摘要
Our method begins with pointwise classification, categorizing all sampled points based on their local geometric properties. This facilitates multi-resolution segmentation. We then arrange the categorized points using an octree, followed by an in-depth analysis of nodes to finish multi-resolution plane segmentation.We verify the efficiency and robustness of our approach through synthetic and real-world experiments. Our method demonstrates the ability to generalize effectively across various uneven terrains while maintaining real-time performance, achieving frame rates exceeding 35 FPS.

Convolutional Neural Networks for Neuroimaging in Parkinson’s Disease: Is Preprocessing Needed?

paper_url: http://arxiv.org/abs/2311.12561
repo_url: None
paper_authors: Francisco J. Martinez-Murcia, Juan M. Górriz, Javier Ramírez, Andrés Ortiz
for:* The paper aims to investigate the effectiveness of Convolutional Neural Networks (CNNs) in accounting for spatial and intensity differences in nuclear brain imaging, and to determine whether spatial and intensity normalization is still necessary for accurate diagnosis.methods:* The authors trained four different CNN models based on well-established architectures, using or not different spatial and intensity normalization preprocessing.results:* The results show that a sufficiently complex model such as the three-dimensional version of the ALEXNET can effectively account for spatial differences, achieving a diagnosis accuracy of 94.1% with an area under the ROC curve of 0.984. However, the intensity normalization and its type are revealed as very influential in the results and accuracy of the trained model.

Abstract
Spatial and intensity normalization are nowadays a prerequisite for neuroimaging analysis. Influenced by voxel-wise and other univariate comparisons, where these corrections are key, they are commonly applied to any type of analysis and imaging modalities. Nuclear imaging modalities such as PET-FDG or FP-CIT SPECT, a common modality used in Parkinson's Disease diagnosis, are especially dependent on intensity normalization. However, these steps are computationally expensive and furthermore, they may introduce deformations in the images, altering the information contained in them. Convolutional Neural Networks (CNNs), for their part, introduce position invariance to pattern recognition, and have been proven to classify objects regardless of their orientation, size, angle, etc. Therefore, a question arises: how well can CNNs account for spatial and intensity differences when analysing nuclear brain imaging? Are spatial and intensity normalization still needed? To answer this question, we have trained four different CNN models based on well-established architectures, using or not different spatial and intensity normalization preprocessing. The results show that a sufficiently complex model such as our three-dimensional version of the ALEXNET can effectively account for spatial differences, achieving a diagnosis accuracy of 94.1% with an area under the ROC curve of 0.984. The visualization of the differences via saliency maps shows that these models are correctly finding patterns that match those found in the literature, without the need of applying any complex spatial normalization procedure. However, the intensity normalization -- and its type -- is revealed as very influential in the results and accuracy of the trained model, and therefore must be well accounted.

摘要
现在，空间和强度normalization是神经成像分析的必备条件。由于 voxel-wise 和其他单variate比较的影响，这些更正在任何类型的分析和成像模式下都被广泛应用。核磁共振成像模式，如 PET-FDG 或 FP-CIT SPECT，在诊断parkinson病的常见模式中尤其висиendent于强度normalization。然而，这些步骤可能会带来计算成本和图像扭曲，从而改变图像中的信息。深度学习神经网络（CNNs）则引入了位置不变性，并在 Pattern recognition中识别对象无关于其方向、大小、角度等因素。因此，我们提出了一个问题：CNNs 能否准确考虑空间和强度差异，并不需要使用 espacial 和强度normalization 预处理？为了回答这个问题，我们在不同的 spatial 和强度normalization 预处理下训练了四个不同的 CNN 模型，使用了 Well-established 的 Architecture。结果显示，一个足够复杂的模型，如我们的三维版本的 ALEXNET，可以准确考虑空间差异，达到了 94.1% 的诊断精度和0.984的 ROC 曲线下的Area。视觉化 diferencias 的映射显示，这些模型正确地找到了与文献中描述的匹配的模式，无需应用任何复杂的空间normalization 过程。然而，强度normalization 的选择和类型对结果和准确率产生了很大的影响。因此，这一点必须得到广泛考虑。

paper_url: http://arxiv.org/abs/2311.12560
repo_url: None
paper_authors: Carolina A. M. Heming, Mohamed Abdalla, Monish Ahluwalia, Linglin Zhang, Hari Trivedi, MinJae Woo, Benjamin Fine, Judy Wawira Gichoya, Leo Anthony Celi, Laleh Seyyed-Kalantari
for: 这个论文主要是为了扩展临床AI模型报告卡，以包括广泛的偏见报告，包括社会和非社会因素。
methods: 该论文使用了各种方法，包括对AI模型偏见的分析和评估，以及对其他因素的考虑，如疾病依赖、解剖学因素和仪器因素，以确保AI模型的安全部署。
results: 该论文得到了许多结果，包括提出了一种扩展了临床AI模型报告卡的方法，以及通过对多个疾病和不同的AI模型进行研究，发现了一些可能导致AI模型偏见的因素。

Abstract
Clinical AI model reporting cards should be expanded to incorporate a broad bias reporting of both social and non-social factors. Non-social factors consider the role of other factors, such as disease dependent, anatomic, or instrument factors on AI model bias, which are essential to ensure safe deployment.

摘要
临床AI模型报告卡应该扩展到包括广泛的偏见报告，包括社会因素以及非社会因素。非社会因素包括疾病依赖、解剖学因素和仪器因素等，这些因素对AI模型偏见的影响非常重要，以确保安全部署。

“HoVer-UNet”: Accelerating HoVerNet with UNet-based multi-class nuclei segmentation via knowledge distillation

paper_url: http://arxiv.org/abs/2311.12553
repo_url: https://github.com/diagnijmegen/hover-unet
paper_authors: Cristian Tommasino, Cristiano Russo, Antonio Maria Rinaldi, Francesco Ciompi
for: 本研究旨在压缩 HoVerNet 框架中的知识，用于 histopathology 中的 nuclei 实例 segmentation 和分类。
methods: 提出了一种含有 Mix Vision Transformer 背景的 Compact UNet 网络，并使用自定义损失函数来优化压缩后的知识。
results: 在 PanNuke 和 Consep 公共数据集上，我们的模型实现了与 HoVerNet 相当的结果，但具有三倍减少的计算时间。

Abstract
We present "HoVer-UNet", an approach to distill the knowledge of the multi-branch HoVerNet framework for nuclei instance segmentation and classification in histopathology. We propose a compact, streamlined single UNet network with a Mix Vision Transformer backbone, and equip it with a custom loss function to optimally encode the distilled knowledge of HoVerNet, reducing computational requirements without compromising performances. We show that our model achieved results comparable to HoVerNet on the public PanNuke and Consep datasets with a three-fold reduction in inference time. We make the code of our model publicly available at https://github.com/DIAGNijmegen/HoVer-UNet.

摘要
我们提出了“HoVer-UNet”方法，用于在病理学中分类和实例化肿瘤 cells。我们提出了一个紧凑的、流式化的单一 UNet 网络，并将其配备了自定义损失函数，以优化 transferred 知识，减少计算需求，而不会影响性能。我们的模型在公共的 PanNuke 和 Consep 数据集上实现了与 HoVerNet 相当的结果，并且在推理时间方面减少了三倍。我们将我们的代码公开发布在 GitHub 上，可以在中找到。

GMISeg: General Medical Image Segmentation without Re-Training

paper_url: http://arxiv.org/abs/2311.12539
repo_url: None
paper_authors: Jing Xu
for: 解决新的医学图像分割任务，无需进行额外的训练。
methods: 使用 novel low-rank fine-tuning strategy based on the proposed approach to the SAM (Segment Anything Model) image encoder, 并与提示编码器和掩码解码器进行细化。
results: GMISeg 在未知任务上表现出优于最新的方法，并进行了全面的分析和总结。

Abstract
Although deep learning models have become the main method for medical image segmentation, they often cannot be extended to unknown segmentation tasks involving new anatomical structures, image shapes, or labels. For new segmentation tasks, researchers often have to retrain or fine-tune the model, which is time-consuming and poses a significant obstacle to clinical researchers, who often lack the resources and professional knowledge to train neural networks. Therefore, we proposed a general method that can solve unknown medical image segmentation tasks without requiring additional training. Given an example set of images and prompts for defining new segmentation tasks, GMISeg applies a novel low-rank fine-tuning strategy based on the proposed approach to the SAM (Segment Anything Model) image encoder, and works with the prompt encoder and mask decoder to fine-tune the labeled dataset without the need for additional training. To achieve generalization of new tasks, we used medical image datasets with different imaging modes for different parts. We trained and generalized GMISeg on a different set of anatomical and imaging modes using cardiac images on other site datasets. We have demonstrated that GMISeg outperforms the latest methods on unknown tasks and have conducted a comprehensive analysis and summary of the important performance of the proposed method.

摘要
（使用简化中文）虽然深度学习模型已经成为医学图像分割的主要方法，但它们经常无法适应未知的分割任务，包括新的解剖结构、图像形态和标签。为新的分割任务，研究人员经常需要重新训练或微调模型，这是耗时且对临床研究人员来说是一个重大障碍。因此，我们提出了一种可以解决未知医学图像分割任务的通用方法。给出一个示例集合和定义新分割任务的提示，GMISeg应用了一种新的低级别微调策略，基于我们的方法，对SAM（分割任何模型）图像编码器进行微调。与提示编码器和面 mask解码器一起，GMISeg可以在已经标注的数据集上进行微调，无需进行额外的训练。为了实现新任务的泛化，我们使用了不同的医学图像数据集，包括不同的解剖和捕捉模式。我们在不同的部位上训练和泛化GMISeg，并在其他网站上使用卡得纳图像进行训练。我们已经证明，GMISeg在未知任务上表现出色，并进行了全面的分析和总结，描述了该方法的重要性能。

Hyb-NeRF: A Multiresolution Hybrid Encoding for Neural Radiance Fields

paper_url: http://arxiv.org/abs/2311.12490
repo_url: None
paper_authors: Yifan Wang, Yi Gong, Yuan Zeng
for: 高精度场景重建和新视图生成
methods: 使用多决赛措施的卷积神经网络，以及缓存NeRF的数据结构
results: 提高渲染速度和图像质量，同时减少内存占用量

Abstract
Recent advances in Neural radiance fields (NeRF) have enabled high-fidelity scene reconstruction for novel view synthesis. However, NeRF requires hundreds of network evaluations per pixel to approximate a volume rendering integral, making it slow to train. Caching NeRFs into explicit data structures can effectively enhance rendering speed but at the cost of higher memory usage. To address these issues, we present Hyb-NeRF, a novel neural radiance field with a multi-resolution hybrid encoding that achieves efficient neural modeling and fast rendering, which also allows for high-quality novel view synthesis. The key idea of Hyb-NeRF is to represent the scene using different encoding strategies from coarse-to-fine resolution levels. Hyb-NeRF exploits memory-efficiency learnable positional features at coarse resolutions and the fast optimization speed and local details of hash-based feature grids at fine resolutions. In addition, to further boost performance, we embed cone tracing-based features in our learnable positional encoding that eliminates encoding ambiguity and reduces aliasing artifacts. Extensive experiments on both synthetic and real-world datasets show that Hyb-NeRF achieves faster rendering speed with better rending quality and even a lower memory footprint in comparison to previous state-of-the-art methods.

摘要
Hyb-NeRF represents the scene using different encoding strategies from coarse-to-fine resolution levels. It exploits memory-efficient learnable positional features at coarse resolutions and the fast optimization speed and local details of hash-based feature grids at fine resolutions. Additionally, we embed cone tracing-based features in our learnable positional encoding to eliminate encoding ambiguity and reduce aliasing artifacts.Extensive experiments on both synthetic and real-world datasets show that Hyb-NeRF achieves faster rendering speed with better rendering quality and even a lower memory footprint compared to previous state-of-the-art methods.

HCA-Net: Hierarchical Context Attention Network for Intervertebral Disc Semantic Labeling

paper_url: http://arxiv.org/abs/2311.12486
repo_url: https://github.com/xmindflow/hca-net
paper_authors: Afshin Bozorgpour, Bobby Azad, Reza Azad, Yury Velichko, Ulas Bagci, Dorit Merhof
for: 这个研究的目的是为了精确地描述骨盘中的脊椎组织，以评估脊椎相关疾病，如骨质减退、脊椎骨折或脊椎膜瘤。
methods: 这个研究使用了一种新的内部构成注意力网络架构（HCA-Net），具有特别的优化对应关系。这个方法可以处理不同的数统尺度的特征，并将其集成以捕捉脊椎中的细微空间关系。
results: 这个研究的结果显示，使用HCA-Net架构可以在多中心脊椎数据集上取得更高的准确性，并且在MRI T1w和T2w模式上具有更好的表现。

Abstract
Accurate and automated segmentation of intervertebral discs (IVDs) in medical images is crucial for assessing spine-related disorders, such as osteoporosis, vertebral fractures, or IVD herniation. We present HCA-Net, a novel contextual attention network architecture for semantic labeling of IVDs, with a special focus on exploiting prior geometric information. Our approach excels at processing features across different scales and effectively consolidating them to capture the intricate spatial relationships within the spinal cord. To achieve this, HCA-Net models IVD labeling as a pose estimation problem, aiming to minimize the discrepancy between each predicted IVD location and its corresponding actual joint location. In addition, we introduce a skeletal loss term to reinforce the model's geometric dependence on the spine. This loss function is designed to constrain the model's predictions to a range that matches the general structure of the human vertebral skeleton. As a result, the network learns to reduce the occurrence of false predictions and adaptively improves the accuracy of IVD location estimation. Through extensive experimental evaluation on multi-center spine datasets, our approach consistently outperforms previous state-of-the-art methods on both MRI T1w and T2w modalities. The codebase is accessible to the public on \href{https://github.com/xmindflow/HCA-Net}{GitHub}.

摘要
医学影像中间vertebral discs（IVD）的准确和自动分割是评估脊椎相关疾病的关键，如骨质疏松、脊椎骨折或IVD压缩。我们介绍HCA-Net，一种新的上下文注意网络架构，用于IVD的semantic标签预测，特别是利用前期几何信息。我们的方法可以处理不同的缩放级别，并有效地将其集成以捕捉脊椎内的细腻空间关系。为此，HCA-Net将IVD标注作为一个定位问题， aiming to minimize the discrepancy between each predicted IVD location and its corresponding actual joint location。此外，我们引入了骨架损失项，以强制模型具有脊椎的几何依赖。这个损失函数是设计来限制模型的预测在一个与人类脊椎骨架结构相符的范围内。因此，网络学习减少false预测的发生，并适应提高IVD的定位精度。经过对多中心脊椎数据集的广泛实验评估，我们的方法 consistent outperform了过去的状态收敛方法在MRI T1w和T2w模式上。代码库可以在 \href{https://github.com/xmindflow/HCA-Net}{GitHub} 上获取。

MaskFlow: Object-Aware Motion Estimation

paper_url: http://arxiv.org/abs/2311.12476
repo_url: None
paper_authors: Aria Ahmadi, David R. Walton, Tim Atherton, Cagatay Dikici
for: 这种新的运动估计方法MaskFlow可以准确地估计运动场景，即使是小对象、大偏移和异常的外观变化都是可以处理的。
methods: MaskFlow使用了各种层次特征，包括其他深度神经网络（DNN）基于的运动估计方法中使用的低级特征。此外，它还使用了对象级别的特征和分 segmentation。这些特征和分 segmentation 用于估计对象的翻译运动场景。
results: MaskFlow在我们新增的挑战性 dataset 上表现出色，而不会在流行的 FlyingThings3D benchmark dataset 上产生相似的结果。

Abstract
We introduce a novel motion estimation method, MaskFlow, that is capable of estimating accurate motion fields, even in very challenging cases with small objects, large displacements and drastic appearance changes. In addition to lower-level features, that are used in other Deep Neural Network (DNN)-based motion estimation methods, MaskFlow draws from object-level features and segmentations. These features and segmentations are used to approximate the objects' translation motion field. We propose a novel and effective way of incorporating the incomplete translation motion field into a subsequent motion estimation network for refinement and completion. We also produced a new challenging synthetic dataset with motion field ground truth, and also provide extra ground truth for the object-instance matchings and corresponding segmentation masks. We demonstrate that MaskFlow outperforms state of the art methods when evaluated on our new challenging dataset, whilst still producing comparable results on the popular FlyingThings3D benchmark dataset.

摘要
我们介绍了一种新的运动估算方法，即MaskFlow，可以在非常具有挑战性的情况下估算准确的运动场，包括小物体、大偏移和强烈的外观变化。除了低级特征外，MaskFlow还利用了物体级别的特征和分割。这些特征和分割用于估算物体的翻译运动场。我们提出了一种新的和有效的方法，通过在后续运动估算网络中包含部分翻译运动场来补充和完善。我们还制作了一个新的和挑战性强的 sintetic数据集，并提供了对象匹配和相应的分割mask的附加真实数据。我们示示MaskFlow在我们新的挑战数据集上表现出色，而且在Popular FlyingThings3D benchmark dataset上仍然输出相似的结果。

GLAD: Global-Local View Alignment and Background Debiasing for Unsupervised Video Domain Adaptation with Large Domain Gap

paper_url: http://arxiv.org/abs/2311.12467
repo_url: https://github.com/khuvll/glad
paper_authors: Hyogun Lee, Kyungho Bae, Seong Jong Ha, Yumin Ko, Gyeong-Moon Park, Jinwoo Choi
for: 这篇论文是为了解决无监督视频领域适应（UVDA）问题，具体是在动作识别方面。
methods: 作者提出了一种新的全局-地方视图对齐方法，以解决时间差问题，以及一种学习时间顺序敏感表示和背景不变表示的方法，以抵消背景偏移。
results: 实验表明，提出的方法在Kinetics->BABEL数据集上具有显著的改善，与现有方法相比。代码可以在https://github.com/KHUVLL/GLAD上获取。

Abstract
In this work, we tackle the challenging problem of unsupervised video domain adaptation (UVDA) for action recognition. We specifically focus on scenarios with a substantial domain gap, in contrast to existing works primarily deal with small domain gaps between labeled source domains and unlabeled target domains. To establish a more realistic setting, we introduce a novel UVDA scenario, denoted as Kinetics->BABEL, with a more considerable domain gap in terms of both temporal dynamics and background shifts. To tackle the temporal shift, i.e., action duration difference between the source and target domains, we propose a global-local view alignment approach. To mitigate the background shift, we propose to learn temporal order sensitive representations by temporal order learning and background invariant representations by background augmentation. We empirically validate that the proposed method shows significant improvement over the existing methods on the Kinetics->BABEL dataset with a large domain gap. The code is available at https://github.com/KHUVLL/GLAD.

摘要
在这项工作中，我们面临着无监督视频领域适应（UVDA）问题的挑战，具体是针对具有较大领域差异的情景。现有工作主要集中在小领域差异的情况下进行研究，而我们则专注于更真实的场景。为了建立更加实际的设定，我们提出了一种新的UVDA场景，即Kinetics->BABEL场景，其领域差异更大，包括时间动态和背景变化。为了解决时间偏移问题，即来源频道和目标频道之间的动作持续时间差异，我们提出了全局地视角协调方法。为了缓解背景变化问题，我们提出了学习时间顺序敏感表示法和背景增强法。我们通过实验证明，我们的方法在Kinetics->BABEL数据集上表现出了显著的改进，而且与现有方法相比，具有更高的鲁棒性和可靠性。代码可以在https://github.com/KHUVLL/GLAD上获取。

HiFi-Syn: Hierarchical Granularity Discrimination for High-Fidelity Synthesis of MR Images with Structure Preservation

paper_url: http://arxiv.org/abs/2311.12461
repo_url: None
paper_authors: Ziqi Yu, Botao Zhao, Shengjie Zhang, Xiang Chen, Jianfeng Feng, Tingying Peng, Xiao-Yong Zhang
for: 本研究旨在提高医疗影像的合成，以保持医学研究中的结构信息。
methods: 我们提出了层次粒度识别策略，利用医学影像中不同层次的Semantic信息。
results: 我们的策略在三个独立数据集（UK Biobank、IXI和BraTS 2018）上评估了图像翻译性能，并在比较最佳方法时表现出色。特别是，我们的模型不仅能够合成正常结构，还能够处理异常（疾病）结构，如脑肿瘤，并且在不同的医学成像模式下具有较好的变化抗颤性。

Abstract
Synthesizing medical images while preserving their structural information is crucial in medical research. In such scenarios, the preservation of anatomical content becomes especially important. Although recent advances have been made by incorporating instance-level information to guide translation, these methods overlook the spatial coherence of structural-level representation and the anatomical invariance of content during translation. To address these issues, we introduce hierarchical granularity discrimination, which exploits various levels of semantic information present in medical images. Our strategy utilizes three levels of discrimination granularity: pixel-level discrimination using a Brain Memory Bank, structure-level discrimination on each brain structure with a re-weighting strategy to focus on hard samples, and global-level discrimination to ensure anatomical consistency during translation. The image translation performance of our strategy has been evaluated on three independent datasets (UK Biobank, IXI, and BraTS 2018), and it has outperformed state-of-the-art algorithms. Particularly, our model excels not only in synthesizing normal structures but also in handling abnormal (pathological) structures, such as brain tumors, despite the variations in contrast observed across different imaging modalities due to their pathological characteristics. The diagnostic value of synthesized MR images containing brain tumors has been evaluated by radiologists. This indicates that our model may offer an alternative solution in scenarios where specific MR modalities of patients are unavailable. Extensive experiments further demonstrate the versatility of our method, providing unique insights into medical image translation.

摘要
医疗图像合成是医学研究中非常重要的一环。在这些场景下，保持结构信息的稳定性特别重要。虽然最近的进步已经通过在翻译中包含实例级信息来引导翻译，但这些方法忽略了医学图像的空间相关性和结构级别的内容一致性。为解决这些问题，我们提出了层次粒度识别，它利用医学图像中不同级别的 semantic 信息。我们的策略包括三个级别的识别粒度：像素级识别使用 Brain Memory Bank，结构级识别通过重新权重策略来关注困难样本，以及全球级识别来保证结构一致性。我们的策略在三个独立的数据集（UK Biobank、IXI和BraTS 2018）上进行了图像翻译性能的评估，并在比较当前的算法时表现出色。特别是，我们的模型不仅能够合成正常结构，还能够处理异常（疾病）结构，如脑肿瘤，尽管在不同的医学成像Modalities中存在异常的对比特征。医生评估的合成 MR 图像中的脑肿瘤的诊断价值已被评估。这表明我们的模型可能会在某些患者的特定 MR Modalities 不可用时提供一个替代解决方案。广泛的实验还证明了我们的方法的多样性，为医学图像翻译提供了新的视角。

Learning Site-specific Styles for Multi-institutional Unsupervised Cross-modality Domain Adaptation

paper_url: http://arxiv.org/abs/2311.12437
repo_url: https://github.com/medicl-vu/crossmoda2023
paper_authors: Han Liu, Yubo Fan, Zhoubing Xu, Benoit M. Dawant, Ipek Oguz
for: 这篇论文主要用于解决医学影像分析中的无监督跨频道领域适应问题，并且受到多个机构的数据收集所带来的挑战。
methods: 本篇论文使用了无对照图像转换来转换源频道影像到目标频道，并设计了动态网络以生成可控制、站点特定的目标频道图像。然后，我们使用这些伪图像进行分类模型训练，并通过自我训练来增加领域差距。
results: 根据 validation 和 testing 阶段的成绩，我们的解决方案获得了第一名。实际上，我们的代码存储库已经公开供大家使用，详细信息可以参考 https://github.com/MedICL-VU/crossmoda2023。

Abstract
Unsupervised cross-modality domain adaptation is a challenging task in medical image analysis, and it becomes more challenging when source and target domain data are collected from multiple institutions. In this paper, we present our solution to tackle the multi-institutional unsupervised domain adaptation for the crossMoDA 2023 challenge. First, we perform unpaired image translation to translate the source domain images to the target domain, where we design a dynamic network to generate synthetic target domain images with controllable, site-specific styles. Afterwards, we train a segmentation model using the synthetic images and further reduce the domain gap by self-training. Our solution achieved the 1st place during both the validation and testing phases of the challenge. The code repository is publicly available at https://github.com/MedICL-VU/crossmoda2023.

摘要
<>translate "Unsupervised cross-modality domain adaptation is a challenging task in medical image analysis, and it becomes more challenging when source and target domain data are collected from multiple institutions. In this paper, we present our solution to tackle the multi-institutional unsupervised domain adaptation for the crossMoDA 2023 challenge. First, we perform unpaired image translation to translate the source domain images to the target domain, where we design a dynamic network to generate synthetic target domain images with controllable, site-specific styles. Afterwards, we train a segmentation model using the synthetic images and further reduce the domain gap by self-training. Our solution achieved the 1st place during both the validation and testing phases of the challenge. The code repository is publicly available at https://github.com/MedICL-VU/crossmoda2023." into Simplified Chinese.Here's the translation:“不监督的跨模式领域适应是医疗影像分析中的挑战，而当源和目标领域数据来自多家机构时，这个挑战更加困难。在这篇文章中，我们介绍了我们的解决方案来解决crossMoDA 2023挑战中的多家机构不监督领域适应。我们首先使用不对称图像转换来将源领域图像转换到目标领域，并设计了动态网络来生成可控、站点特定的目标领域图像。接着，我们使用这些伪图像进行分类模型训练，并通过自我训练来进一步缩小领域差。我们的解决方案在验证和测试阶段都获得了第一名。相关的代码存储在 GitHub 上，可以在访问。”

AR Visualization System for Ship Detection and Recognition Based on AI

paper_url: http://arxiv.org/abs/2311.12430
repo_url: None
paper_authors: Ziqi Ye, Limin Huang, Yongji Wu, Min Hu
for: 这项研究旨在开发一个基于人工智能和扩展现实技术的船舶检测和识别系统。
methods: 该系统主要包括三部分：人工智能模块、Unity开发模块和Hololens2AR模块。使用R3Det算法完成远程感知图像中船舶的检测和识别，并使用RTX 2080Ti训练模型可达96%的识别率。然后通过将船舶分类和信息转化为3D模型，在虚拟场景中生成船舶的3D模型。最后，添加了语音模块和UI交互模块，并通过MRTK部署到Hololens2上。
results: 该系统实现了计算机视觉和扩展现实技术的融合，将对象检测结果映射到AR场景中， represent a bold step towards future technological trends and intelligent applications.

Abstract
Augmented reality technology has been widely used in industrial design interaction, exhibition guide, information retrieval and other fields. The combination of artificial intelligence and augmented reality technology has also become a future development trend. This project is an AR visualization system for ship detection and recognition based on AI, which mainly includes three parts: artificial intelligence module, Unity development module and Hololens2AR module. This project is based on R3Det algorithm to complete the detection and recognition of ships in remote sensing images. The recognition rate of model detection trained on RTX 2080Ti can reach 96%. Then, the 3D model of the ship is obtained by ship categories and information and generated in the virtual scene. At the same time, voice module and UI interaction module are added. Finally, we completed the deployment of the project on Hololens2 through MRTK. The system realizes the fusion of computer vision and augmented reality technology, which maps the results of object detection to the AR field, and makes a brave step toward the future technological trend and intelligent application.

摘要
“增强现实技术在工业设计互动、展览导航、信息检索等领域都得到了广泛应用。将人工智能和增强现实技术相结合，也成为未来发展趋势。这个项目是基于R3Det算法实现的基于AI的船舶检测和识别AR视觉系统，主要包括三部分：人工智能模块、Unity开发模块和Hololens2AR模块。该项目使用RTX 2080Ti进行模型训练，可达96%的检测率。然后，通过船 categories和信息获取3D模型，并在虚拟场景中生成。同时，添加了语音模块和UI互动模块。最后，通过MRTK进行 deploy 到 Hololens2 上。系统实现了计算机视觉和增强现实技术的融合，将对象检测结果映射到AR场景中，做出了勇敢的一步 towards 未来技术趋势和智能应用。”

Two Views Are Better than One: Monocular 3D Pose Estimation with Multiview Consistency

paper_url: http://arxiv.org/abs/2311.12421
repo_url: None
paper_authors: Christian Keilstrup Ingwersen, Anders Bjorholm Dahl, Janus Nørtoft Jensen, Morten Rieger Hannemose
for: 提高单视图人姿估计模型性能，使其在没有3D数据的情况下进行 fine-tuning。
methods: 提出了一种新的损失函数——多视图一致性损失，可以在没有3D数据的情况下进行训练。这个损失函数要求一视图中的3D姿势与另一视图中的3D姿势相似。
results: 实验表明，只需要使用两个视图，即90度偏移的两个视图，可以获得良好的性能。只有在添加更多视图时，获得了有限的改进。这种方法可以通过捕捉活动中的偏移量来获得域特定的数据，从而消除较为复杂的准备过程。这些研究开创了域适应3D姿势估计领域的新可能性，提供了一种实用和经济的解决方案，可以适应特定应用场景。

Abstract
Deducing a 3D human pose from a single 2D image or 2D keypoints is inherently challenging, given the fundamental ambiguity wherein multiple 3D poses can correspond to the same 2D representation. The acquisition of 3D data, while invaluable for resolving pose ambiguity, is expensive and requires an intricate setup, often restricting its applicability to controlled lab environments. We improve performance of monocular human pose estimation models using multiview data for fine-tuning. We propose a novel loss function, multiview consistency, to enable adding additional training data with only 2D supervision. This loss enforces that the inferred 3D pose from one view aligns with the inferred 3D pose from another view under similarity transformations. Our consistency loss substantially improves performance for fine-tuning with no available 3D data. Our experiments demonstrate that two views offset by 90 degrees are enough to obtain good performance, with only marginal improvements by adding more views. Thus, we enable the acquisition of domain-specific data by capturing activities with off-the-shelf cameras, eliminating the need for elaborate calibration procedures. This research introduces new possibilities for domain adaptation in 3D pose estimation, providing a practical and cost-effective solution to customize models for specific applications. The used dataset, featuring additional views, will be made publicly available.

摘要
<>用单个2D图像或2D关键点推算3D人姿是一项极其困难的任务，因为多个3D姿势可以对应同一个2D表示。获取3D数据是非常有价值的，可以解决姿势 ambiguity，但是它是昂贵的并且需要复杂的设置，通常只能在控制的实验室环境中进行。我们改进了单视图人姿估算模型的性能，使用多视图数据进行微调。我们提出了一个新的损失函数，多视图一致性，以便在没有3D数据时进行微调。这个损失函数要求一个视图中的3D姿势与另一个视图中的3D姿势在相似变换下吻合。我们的一致性损失substantially提高了没有3D数据时的微调性能。我们的实验表明，只需两个视图夹角为90度即可获得良好的性能，再加上更多视图只有微量提升。因此，我们可以通过捕捉活动中的自然运动来获得领域特有的数据，消除需要复杂的准备过程。这项研究开拓了3D姿势估算领域的领域适应，提供了实用和经济的解决方案，以便自定义模型 для特定应用。我们使用的数据集，包含多个视图，将在未来公开。

Attribute-Aware Deep Hashing with Self-Consistency for Large-Scale Fine-Grained Image Retrieval

paper_url: http://arxiv.org/abs/2311.12894
repo_url: None
paper_authors: Xiu-Shen Wei, Yang Shen, Xuhao Sun, Peng Wang, Yuxin Peng
for: 这paper的目的是提出一种基于特征意识hash网络，以提高精细图像检索的效率和可靠性。
methods: 该paper使用了一种encoder-decoder结构的待检索任务，以不监督地提取高级别特征特征向量。此外，它还加入了一个特征减 correlate约束，以强化特征表示能力。
results: 对六个精细图像检索 datasets和两个通用检索 datasets进行了广泛的量化实验，并证明了该方法在对比方法的情况下表现出优异性。

Abstract
Our work focuses on tackling large-scale fine-grained image retrieval as ranking the images depicting the concept of interests (i.e., the same sub-category labels) highest based on the fine-grained details in the query. It is desirable to alleviate the challenges of both fine-grained nature of small inter-class variations with large intra-class variations and explosive growth of fine-grained data for such a practical task. In this paper, we propose attribute-aware hashing networks with self-consistency for generating attribute-aware hash codes to not only make the retrieval process efficient, but also establish explicit correspondences between hash codes and visual attributes. Specifically, based on the captured visual representations by attention, we develop an encoder-decoder structure network of a reconstruction task to unsupervisedly distill high-level attribute-specific vectors from the appearance-specific visual representations without attribute annotations. Our models are also equipped with a feature decorrelation constraint upon these attribute vectors to strengthen their representative abilities. Then, driven by preserving original entities' similarity, the required hash codes can be generated from these attribute-specific vectors and thus become attribute-aware. Furthermore, to combat simplicity bias in deep hashing, we consider the model design from the perspective of the self-consistency principle and propose to further enhance models' self-consistency by equipping an additional image reconstruction path. Comprehensive quantitative experiments under diverse empirical settings on six fine-grained retrieval datasets and two generic retrieval datasets show the superiority of our models over competing methods.

摘要
我们的工作关注于解决大规模细化图像检索问题，即根据查询图像中细化的细节来排序图像，并以细化细节为检索基准。这是一项实际重要的任务，因为细化图像数据的增长速度急剧，同时细化图像之间的变化很小，但是又有很大的内类变化。在这篇论文中，我们提出了Attribute-aware哈希网络，以提高检索效率和建立明确的哈希码和视觉特征之间的对应关系。具体来说，我们根据捕捉的视觉表示来设计一个encoder-decoder结构网络，以无监督的方式提取高级特征向量，并且通过对这些特征向量进行Feature decorrelation来增强其表示能力。然后，我们通过保持原始实体之间的相似性来生成需要的哈希码，从而使哈希码变得Attribute-aware。此外，为了解决深度哈希中的简单性偏好问题，我们从自 consistency 原理的角度设计了我们的模型，并在其基础上进一步增强模型的自 consistency。经过广泛的量化实验，我们的模型在六个细化检索数据集和两个通用检索数据集上表现出优于竞争方法。

Board-to-Board: Evaluating Moonboard Grade Prediction Generalization

paper_url: http://arxiv.org/abs/2311.12419
repo_url: https://github.com/a1773620/moonboard-grade-prediction
paper_authors: Daniel Petashvili, Matthew Rodda
for: 预测攀岩路线的推荐级别
methods: 应用经典和深度学习模型技术
results: 实现了2016、2017和2019年月块数据集上的state of the art的推荐性能，MAE=0.87、RMSE=1.12，不需要分解路线为具体的动作，并且能够泛化到不同版本。

Abstract
Bouldering is a sport where athletes aim to climb up an obstacle using a set of defined holds called a route. Typically routes are assigned a grade to inform climbers of its difficulty and allow them to more easily track their progression. However, the variation in individual climbers technical and physical attributes and many nuances of an individual route make grading a difficult and often biased task. In this work, we apply classical and deep-learning modelling techniques to the 2016, 2017 and 2019 Moonboard datasets, achieving state of the art grade prediction performance with 0.87 MAE and 1.12 RMSE. We achieve this performance on a feature-set that does not require decomposing routes into individual moves, which is a method common in literature and introduces bias. We also demonstrate the generalization capability of this model between editions and introduce a novel vision-based method of grade prediction. While the generalization performance of these techniques is below human level performance currently, we propose these methods as a basis for future work. Such a tool could be implemented in pre-existing mobile applications and would allow climbers to better track their progress and assess new routes with reduced bias.

摘要
众所周知，碰碰 stones 是一种运动，运动员们的目标是通过一系列定义的托手（route）爬升障碍物。通常， Routes 会被分配一个等级，以便 climbers 更好地了解其困难程度并跟踪进步。然而，因为每名运动员的技术和物理属性之间的差异以及路线中的多种细节，这种等级很难以准确地进行。在这项工作中，我们使用古典和深度学习模型技术来处理2016、2017和2019年的 Moonboard 数据集，达到了状态之美的等级预测性能（0.87 MAE和1.12 RMSE）。我们实现这一性能在一个不需要分解路线为个体动作的特征集上，这是文献中常见的方法，但是会引入偏见。我们还证明了这些模型在不同版本之间的泛化性能，并提出了一种新的视觉等级预测方法。虽然这些技术的总体性能目前还不及人类水平，但我们提出这些方法作为未来工作的基础。这种工具可以与现有的手机应用程序集成，允许 climbers 更好地跟踪进步并评估新路线，减少偏见。

Learning Part Motion of Articulated Objects Using Spatially Continuous Neural Implicit Representations

paper_url: http://arxiv.org/abs/2311.12407
repo_url: https://github.com/Yushi-Du/PartMotion
paper_authors: Yushi Du, Ruihai Wu, Yan Shen, Hao Dong
for: 本研究旨在提出一种新的拟合机构，以便更好地理解和操作具有高度自由度和复杂几何结构的扭轴物体（如门和抽屉）。
methods: 我们提出了一种新的拟合方法，该方法使用空间连续神经几何表示来描述扭轴物体的部件运动。该方法可以覆盖不同类型的关节运动，并且可以在不同的几何空间中进行细致的部件运动模拟。
results: 我们的实验结果表明，我们的提出的拟合方法可以有效地模拟扭轴物体的部件运动，并且可以在不同类型的扭轴物体上进行精准的操作。

Abstract
Articulated objects (e.g., doors and drawers) exist everywhere in our life. Different from rigid objects, articulated objects have higher degrees of freedom and are rich in geometries, semantics, and part functions. Modeling different kinds of parts and articulations with nerual networks plays an essential role in articulated object understanding and manipulation, and will further benefit 3D vision and robotics communities. To model articulated objects, most previous works directly encode articulated objects into feature representations, without specific designs for parts, articulations and part motions. In this paper, we introduce a novel framework that explicitly disentangles the part motion of articulated objects by predicting the transformation matrix of points on the part surface, using spatially continuous neural implicit representations to model the part motion smoothly in the space. More importantly, while many methods could only model a certain kind of joint motion (such as the revolution in the clockwise order), our proposed framework is generic to different kinds of joint motions in that transformation matrix can model diverse kinds of joint motions in the space. Quantitative and qualitative results of experiments over diverse categories of articulated objects demonstrate the effectiveness of our proposed framework.

摘要
everywhere in our lives, articulated objects (such as doors and drawers) exist. Compared to rigid objects, articulated objects have more freedom of movement and richer geometry, semantics, and part functions. Using neural networks to model different parts and articulations is essential for understanding and manipulating articulated objects, and will benefit the 3D vision and robotics communities.Previous works have directly encoded articulated objects into feature representations without specific designs for parts, articulations, and part motions. In this paper, we introduce a novel framework that explicitly disentangles the part motion of articulated objects by predicting the transformation matrix of points on the part surface using spatially continuous neural implicit representations to model part motion smoothly in space.Moreover, our proposed framework is generic to different types of joint motions, as the transformation matrix can model diverse types of joint motions in space. The effectiveness of our proposed framework is demonstrated through quantitative and qualitative experiments on diverse categories of articulated objects.

CASR: Refining Action Segmentation via Magrinalizing Frame-levle Causal Relationships

paper_url: http://arxiv.org/abs/2311.12401
repo_url: None
paper_authors: Keqing Du, Xinyu Yang, Hang Chen
for: 提高 Temporal Action Segmentation (TAS) 任务的解释性，通过深度学习和 causal discovery 集成。
methods: 提出 Causal Abstraction Segmentation Refiner (CASR)，可以根据不同模型的前 segmentation 结果进行改进，并将视频 causality 增强到 marginalizing 帧级 causal relationships 中。
results: CASR 能够在多种数据集上显著超过现有方法，以及在 causal explainability 和泛化能力方面表现出色。

Abstract
Integrating deep learning and causal discovery has increased the interpretability of Temporal Action Segmentation (TAS) tasks. However, frame-level causal relationships exist many complicated noises outside the segment-level, making it infeasible to directly express macro action semantics. Thus, we propose Causal Abstraction Segmentation Refiner (CASR), which can refine TAS results from various models by enhancing video causality in marginalizing frame-level casual relationships. Specifically, we define the equivalent frame-level casual model and segment-level causal model, so that the causal adjacency matrix constructed from marginalized frame-level causal relationships has the ability to represent the segmnet-level causal relationships. CASR works out by reducing the difference in the causal adjacency matrix between we constructed and pre-segmentation results of backbone models. In addition, we propose a novel evaluation metric Causal Edit Distance (CED) to evaluate the causal interpretability. Extensive experimental results on mainstream datasets indicate that CASR significantly surpasses existing various methods in action segmentation performance, as well as in causal explainability and generalization.

摘要
将深度学习和 causal discovery 结合可以提高 Temporal Action Segmentation (TAS) 任务的解释性。然而，视频中存在许多复杂的噪声，使得直接表达macro action semantics 变得不可能。因此，我们提议 Causal Abstraction Segmentation Refiner (CASR)，可以通过增强视频 causality 来精炼 TAS 结果。specifically，我们定义了equivalent frame-level casual model和 segment-level causal model，使得在 marginalized frame-level causal relationships 中构建的 causal adjacency matrix 可以表示 segmnet-level causal relationships。CASR 通过减少这些 causal adjacency matrix 中与pre-segmentation结果不同的差异来工作。此外，我们提出了一种新的评价指标 Causal Edit Distance (CED)，可以评估 causal explainability。经验 resultado indicate that CASR 在 action segmentation performance 和 causal explainability 以及泛化性方面具有显著优势。

IMJENSE: Scan-specific Implicit Representation for Joint Coil Sensitivity and Image Estimation in Parallel MRI

paper_url: http://arxiv.org/abs/2311.12892
repo_url: https://github.com/amri-lab/imjense
paper_authors: Ruimin Feng, Qing Wu, Jie Feng, Huajun She, Chunlei Liu, Yuyao Zhang, Hongjiang Wei
for: 提高磁共振成像（MRI）数据收集的速度。
methods: 使用含义表示（Implicit Neural Representation，INR）来改进平行MRI重建算法。
results: 在具有4或8梯度线的限制情况下，IMJENSE可以稳定地重建2D кар特西安收集的图像，并且图像质量较高。

Abstract
Parallel imaging is a commonly used technique to accelerate magnetic resonance imaging (MRI) data acquisition. Mathematically, parallel MRI reconstruction can be formulated as an inverse problem relating the sparsely sampled k-space measurements to the desired MRI image. Despite the success of many existing reconstruction algorithms, it remains a challenge to reliably reconstruct a high-quality image from highly reduced k-space measurements. Recently, implicit neural representation has emerged as a powerful paradigm to exploit the internal information and the physics of partially acquired data to generate the desired object. In this study, we introduced IMJENSE, a scan-specific implicit neural representation-based method for improving parallel MRI reconstruction. Specifically, the underlying MRI image and coil sensitivities were modeled as continuous functions of spatial coordinates, parameterized by neural networks and polynomials, respectively. The weights in the networks and coefficients in the polynomials were simultaneously learned directly from sparsely acquired k-space measurements, without fully sampled ground truth data for training. Benefiting from the powerful continuous representation and joint estimation of the MRI image and coil sensitivities, IMJENSE outperforms conventional image or k-space domain reconstruction algorithms. With extremely limited calibration data, IMJENSE is more stable than supervised calibrationless and calibration-based deep-learning methods. Results show that IMJENSE robustly reconstructs the images acquired at 5$\mathbf{\times}$ and 6$\mathbf{\times}$ accelerations with only 4 or 8 calibration lines in 2D Cartesian acquisitions, corresponding to 22.0% and 19.5% undersampling rates. The high-quality results and scanning specificity make the proposed method hold the potential for further accelerating the data acquisition of parallel MRI.

摘要
parallel imaging是一种广泛使用的技术，以加速магни共振成像（MRI）数据收集。数学上，并行MRI重建可以表示为一个内部问题，关系快速取样的k空间测量和所需的MRI图像。尽管许多现有的重建算法已经取得了成功，但是仍然是一个挑战来可靠地重建高质量图像从高度减少的k空间测量。最近，隐式神经表示出现了一种强大的平台，可以利用部分获得的数据的内部信息和物理特性来生成所需的对象。在这项研究中，我们介绍了IMJENSE，一种扫描特定的隐式神经表示基于方法，用于改进并行MRI重建。特别是，在该方法中，MRI图像和探测器敏感度被模型为坐标系中的连续函数，由神经网络和多项式 Parametrized。这些网络和多项式的参数同时从快速取样的k空间测量中学习，而无需完全的训练数据。由于隐式神经表示的强大连续表示和同时估计MRI图像和探测器敏感度的 JOINT estimation，IMJENSE在图像或k空间频谱频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频��

RFTrans: Leveraging Refractive Flow of Transparent Objects for Surface Normal Estimation and Manipulation

paper_url: http://arxiv.org/abs/2311.12398
repo_url: None
paper_authors: Tutian Tang, Jiyu Liu, Jieyi Zhang, Haoyuan Fu, Wenqiang Xu, Cewu Lu
for:这篇论文的目的是教 robots 如何与透明物体交互，以解决直接从 RGB 图像中提取 geometry 信息时的困难。methods:这篇论文提出了一种基于 RGB-D 摄像头的方法，称为 RFTrans，可以估计透明物体表面法向。该方法利用了偏振流作为中间表示，从而解决直接从 RGB 图像中提取 geometry 信息时的困难。results:在 synthetic 数据集上训练 RFTrans 后，它可以在 synthetic 和实际世界 benchmark 中一直胜过基准 ClearGrasp 的表现，占据了大幅度的优势。此外，在实际世界中，使用 RFTrans 实现了83% 的成功率。

Abstract
Transparent objects are widely used in our daily lives, making it important to teach robots to interact with them. However, it's not easy because the reflective and refractive effects can make RGB-D cameras fail to give accurate geometry measurements. To solve this problem, this paper introduces RFTrans, an RGB-D-based method for surface normal estimation and manipulation of transparent objects. By leveraging refractive flow as an intermediate representation, RFTrans circumvents the drawbacks of directly predicting the geometry (e.g. surface normal) from RGB images and helps bridge the sim-to-real gap. RFTrans integrates the RFNet, which predicts refractive flow, object mask, and boundaries, followed by the F2Net, which estimates surface normal from the refractive flow. To make manipulation possible, a global optimization module will take in the predictions, refine the raw depth, and construct the point cloud with normal. An analytical grasp planning algorithm, ISF, is followed to generate the grasp poses. We build a synthetic dataset with physically plausible ray-tracing rendering techniques to train the networks. Results show that the RFTrans trained on the synthetic dataset can consistently outperform the baseline ClearGrasp in both synthetic and real-world benchmarks by a large margin. Finally, a real-world robot grasping task witnesses an 83% success rate, proving that refractive flow can help enable direct sim-to-real transfer. The code, data, and supplementary materials are available at https://rftrans.robotflow.ai.

摘要
透明物体在我们日常生活中广泛使用，因此教 robot 如何与它们交互是非常重要的。然而，由于镜面和弯光效果，RGB-D 摄像头可能无法提供准确的几何测量。为解决这个问题，本文介绍了 RFTrans，一种基于 RGB-D 的表面法向量估计和透明物体 manipulate 方法。通过利用弯光流作为中间表示，RFTrans 可以 circumvent 直接从 RGB 图像中提取几何的缺陷，帮助bridges 实验室到实际的差距。RFTrans 结合了 RFNet，这个网络预测弯光流、物体面积和边界，然后与 F2Net 结合，从弯光流中估计表面法向量。为了实现操作，global 优化模块将在预测基础上进行数据级别的优化，并将raw depth 转换为点云。通过使用 ISF 分析 grasp 算法，我们可以生成各种抓握姿态。我们在使用 Physically plausible 的 ray-tracing 渲染技术生成的 sintetic 数据集上训练了这些网络。结果表明，在 synthetic 和实际 benchmark 中，RFTrans 可以在 Baseline ClearGrasp 的大幅度上超越。最后，我们在实际世界中完成了一个 robot 抓握任务，观察到了 83% 的成功率，证明了弯光流可以帮助实现直接的 sim-to-real 传输。代码、数据和补充材料可以在上获取。

Rich and Poor Texture Contrast: A Simple yet Effective Approach for AI-generated Image Detection

paper_url: http://arxiv.org/abs/2311.12397
repo_url: None
paper_authors: Nan Zhong, Yiran Xu, Zhenxing Qian, Xinpeng Zhang
for: 本研究旨在开发一种可以识别基于多种生成模型生成的假图像的检测器。
methods: 我们的方法利用图像中质地区域之间的相关性异同来检测假图像。我们将图像分解成多个区域，并将每个区域重建为两个图像：一个rich texture图像和一个poor texture图像。然后，我们提取rich texture和poor texture区域之间的相关性异同特征，并使其服为生成模型生成图像的普适指标。
results: 我们的方法在16种常见的生成模型下进行了广泛的实验，并证明了与现有基eline相比，我们的方法能够显著提高检测精度。

Abstract
Recent generative models show impressive performance in generating photographic images. Humans can hardly distinguish such incredibly realistic-looking AI-generated images from real ones. AI-generated images may lead to ubiquitous disinformation dissemination. Therefore, it is of utmost urgency to develop a detector to identify AI-generated images. Most existing detectors suffer from sharp performance drops over unseen generative models. In this paper, we propose a novel AI-generated image detector capable of identifying fake images created by a wide range of generative models. Our approach leverages the inter-pixel correlation contrast between rich and poor texture regions within an image. Pixels in rich texture regions exhibit more significant fluctuations than those in poor texture regions. This discrepancy reflects that the entropy of rich texture regions is larger than that of poor ones. Consequently, synthesizing realistic rich texture regions proves to be more challenging for existing generative models. Based on this principle, we divide an image into multiple patches and reconstruct them into two images, comprising rich-texture and poor-texture patches respectively. Subsequently, we extract the inter-pixel correlation discrepancy feature between rich and poor texture regions. This feature serves as a universal fingerprint used for AI-generated image forensics across different generative models. In addition, we build a comprehensive AI-generated image detection benchmark, which includes 16 kinds of prevalent generative models, to evaluate the effectiveness of existing baselines and our approach. Our benchmark provides a leaderboard for follow-up studies. Extensive experimental results show that our approach outperforms state-of-the-art baselines by a significant margin. Our project: https://fdmas.github.io/AIGCDetect/

摘要
现代生成模型显示出惊人的表现，可以生成极其真实looking的人工图像。人类几乎无法分辨这些人工图像与真实图像之间的差异。然而，人工图像可能会导致广泛的谣言散布。因此，我们需要开发一个用于识别人工图像的检测器。现有的检测器往往会遭到未知的生成模型所致的性能下降。在这篇论文中，我们提出了一种新的人工图像检测器，可以识别由各种生成模型创建的假图像。我们的方法利用图像中rich和poor тексту重叠的间距异同。rich тексту重叠的像素在不同的文本重叠中存在更大的波动，这是因为rich texture regions的Entropy更大。这意味着，生成真实rich texture regions的挑战更大。基于这一原理，我们将图像分解成多个patch，并将它们重构成两个图像，其中一个包含rich texture patch，另一个包含poor texture patch。然后，我们提取rich和poor texture patches之间的干扰异同特征。这个特征作为一种通用的人工图像嫌疑犯迹，可以在不同的生成模型下进行AI-generated图像的诊断。此外，我们建立了一个包含16种常见生成模型的人工图像检测 benchmark，以评估现有的基eline和我们的方法的效果。我们的benchmark提供了一个领导者榜，用于后续的研究。我们的实验结果表明，我们的方法在现有基eline的基础上具有显著的优势。我们的项目：https://fdmas.github.io/AIGCDetect/

From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation

paper_url: http://arxiv.org/abs/2311.12391
repo_url: None
paper_authors: Jiaxin Ge, Sanjay Subramanian, Trevor Darrell, Boyi Li
for: 提高视觉理解任务中的解释质量，减少人工标注量。
methods: 采用回归型视觉语言模型，在多步骤过程中计算视觉特征、答案和解释，以提高解释质量。
results: 相比单步骤生成，多步骤生成能够更好地导引模型更正自己的答案，并且生成的解释也可以作为有价值的自动标注。 Our approach outperforms previous methods while utilizing merely 5% of the human-annotated explanations across 10 metrics, demonstrating up to a 4.2 and 1.3 increase in BLEU-1 score on the VCR and VQA-X datasets.

Abstract
Addressing the challenge of adapting pre-trained vision-language models for generating insightful explanations for visual reasoning tasks with limited annotations, we present ReVisE: a $\textbf{Re}$cursive $\textbf{Vis}$ual $\textbf{E}$xplanation algorithm. Our method iteratively computes visual features (conditioned on the text input), an answer, and an explanation, to improve the explanation quality step by step until the answer converges. We find that this multi-step approach guides the model to correct its own answers and outperforms single-step explanation generation. Furthermore, explanations generated by ReVisE also serve as valuable annotations for few-shot self-training. Our approach outperforms previous methods while utilizing merely 5% of the human-annotated explanations across 10 metrics, demonstrating up to a 4.2 and 1.3 increase in BLEU-1 score on the VCR and VQA-X datasets, underscoring the efficacy and data-efficiency of our method.

摘要
Here's the text in Simplified Chinese:面对预训练视语模型用于生成视理解任务中的详细解释的挑战，我们提出了ReVisE：一种 recursive visual explanation algorithm。我们的方法会递归计算视特征（基于文本输入），答案和解释，以提高解释质量步骤weise until the answer converges。我们发现这种多步approach可以引导模型自我更正答案，并超过单步解释生成。此外，ReVisE生成的解释还可以用于几拍自动反馈，我们的方法可以在10个纪录中比前一代方法提高5%的人工注解，达到4.2和1.3倍的BLEU-1分数提高，证明了我们的方法的效果和数据效果。

Point, Segment and Count: A Generalized Framework for Object Counting

paper_url: http://arxiv.org/abs/2311.12386
repo_url: None
paper_authors: Zhizhong Huang, Mingliang Dai, Yi Zhang, Junping Zhang, Hongming Shan
for: 本文提出了一种通用的对象计数框架，用于在图像中计算对象的数量，无需提供例示板或类名。
methods: 本文使用的方法包括SAM和CLIP两种基础模型，通过推断对象的分布来预测对象的数量。
results: 实验结果表明，PseCo在FSC-147数据集上达到了当前最佳性能，并在COCO和LVIS数据集上也取得了优秀的 результа。

Abstract
Class-agnostic object counting aims to count all objects in an image with respect to example boxes or class names, \emph{a.k.a} few-shot and zero-shot counting. Current state-of-the-art methods highly rely on density maps to predict object counts, which lacks model interpretability. In this paper, we propose a generalized framework for both few-shot and zero-shot object counting based on detection. Our framework combines the superior advantages of two foundation models without compromising their zero-shot capability: (\textbf{i}) SAM to segment all possible objects as mask proposals, and (\textbf{ii}) CLIP to classify proposals to obtain accurate object counts. However, this strategy meets the obstacles of efficiency overhead and the small crowded objects that cannot be localized and distinguished. To address these issues, our framework, termed PseCo, follows three steps: point, segment, and count. Specifically, we first propose a class-agnostic object localization to provide accurate but least point prompts for SAM, which consequently not only reduces computation costs but also avoids missing small objects. Furthermore, we propose a generalized object classification that leverages CLIP image/text embeddings as the classifier, following a hierarchical knowledge distillation to obtain discriminative classifications among hierarchical mask proposals. Extensive experimental results on FSC-147 dataset demonstrate that PseCo achieves state-of-the-art performance in both few-shot/zero-shot object counting/detection, with additional results on large-scale COCO and LVIS datasets. The source code is available at \url{https://github.com/Hzzone/PseCo}.

摘要
<> transtable text into Simplified Chinese.��UClass-agnostic对象计数targets全image中的所有对象，即ew-shot和zero-shot计数。现有状态的方法都高度依赖density maps来预测对象计数，lacking model interpretability。在这篇论文中，我们提议一种通用的基础模型框架，用于ew-shot和zero-shot对象计数。我们的框架结合了两种基础模型的优点，即SAM和CLIP，而不会削弱其零 shot能力。然而，这种策略遇到了计算效率过高和小聚集的对象无法被本地化和分辨的问题。为了解决这些问题，我们的框架，称之为PseCo，采用以下三步：点、分割、计数。具体来说，我们首先提出一种基于类型的对象本地化，以提供准确但是最少的点提示 дляSAM，从而不仅减少计算成本，还避免了小对象的漏报。其次，我们提出一种通用对象分类，利用CLIP图像/文本嵌入为分类器，采用层次知识传递来获得特征分类。经验结果表明，PseCo在FSC-147 dataset上达到了当前最佳性能，同时还在大规模COCO和LVIS datasets上进行了详细的实验。源代码可以在github上找到：。

Text-Guided Texturing by Synchronized Multi-View Diffusion

paper_url: http://arxiv.org/abs/2311.12891
repo_url: None
paper_authors: Yuxin Liu, Minshan Xie, Hanyuan Liu, Tien-Tsin Wong
for: 本研究旨在Synthesize 3D 物体上的文本ure, 使用 pré-train 的 T2I 扩散模型。
methods: 我们提出了一种同步多视角扩散方法，使得不同视角的扩散过程可以在扩散过程中达成内容的协调，从而保证文本的一致性。在每个去噪步骤中，我们将不同视角的潜在内容在Texture 领域进行混合，特别是在视角之间的 overlap 区域。
results: 我们的方法在生成一致、不间断、高细节文本ure 方面达到了更高的性能，与当前状态艺技方法相比。

Abstract
This paper introduces a novel approach to synthesize texture to dress up a given 3D object, given a text prompt. Based on the pretrained text-to-image (T2I) diffusion model, existing methods usually employ a project-and-inpaint approach, in which a view of the given object is first generated and warped to another view for inpainting. But it tends to generate inconsistent texture due to the asynchronous diffusion of multiple views. We believe such asynchronous diffusion and insufficient information sharing among views are the root causes of the inconsistent artifact. In this paper, we propose a synchronized multi-view diffusion approach that allows the diffusion processes from different views to reach a consensus of the generated content early in the process, and hence ensures the texture consistency. To synchronize the diffusion, we share the denoised content among different views in each denoising step, specifically blending the latent content in the texture domain from views with overlap. Our method demonstrates superior performance in generating consistent, seamless, highly detailed textures, comparing to state-of-the-art methods.

摘要

De-fine: Decomposing and Refining Visual Programs with Auto-Feedback

paper_url: http://arxiv.org/abs/2311.12890
repo_url: None
paper_authors: Minghe Gao, Juncheng Li, Hao Fei, Liang Pang, Wei Ji, Guoming Wang, Wenqiao Zhang, Siliang Tang, Yueting Zhuang
for: 解决复杂多步视觉任务
methods: 使用自动分解和自动反馈机制，将复杂任务 decomposed into simpler subtasks，并通过多模型的集成来提高逻辑推理性能
results: 在多种视觉任务上实现新的benchmark，创造更加准确和Robust的程序

Abstract
Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it advances in performing visual processing and reasoning in an unsupervised manner. Current visual programming methods generate programs in a single pass for each task where the ability to evaluate and optimize based on feedback, unfortunately, is lacking, which consequentially limits their effectiveness for complex, multi-step problems. Drawing inspiration from benders decomposition, we introduce De-fine, a general framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. This model-agnostic approach can improve logical reasoning performance by integrating the strengths of multiple models. Our experiments across various visual tasks show that De-fine creates more accurate and robust programs, setting new benchmarks in the field.

摘要
“视觉编程，一种模块化和普遍化的方法，将不同的模块和Python运算符集成以解决各种视觉语言任务。与终端模型不需要任务特定数据不同，它可以在无监督的情况下进行视觉处理和推理。现有的视觉编程方法在每个任务中生成单次程序，却缺乏评估和优化基于反馈的能力，这限制了它们在复杂多步任务中的效果。 Drawing inspiration from benders decomposition, we introduce De-fine, a general framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. This model-agnostic approach can improve logical reasoning performance by integrating the strengths of multiple models. Our experiments across various visual tasks show that De-fine creates more accurate and robust programs, setting new benchmarks in the field.”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Semi-supervised Medical Image Segmentation via Query Distribution Consistency

paper_url: http://arxiv.org/abs/2311.12364
repo_url: None
paper_authors: Rong Wu, Dehua Li, Cong Zhang
for: 这篇论文主要针对医疗影像分类中的半监督学习问题进行研究，以提高医疗影像分类的精度和效率。
methods: 这篇论文提出了一个基于互动学习的DUAL KMax UX-Net框架，它结合了3D UX-Net Meta-architecture和KMax decoder，以优化医疗影像分类性能。
results: 实验结果显示，这个方法可以将不具有标签的数据融合到标签数据中，以提高医疗影像分类性能。此外，这个方法在10%和20%标签设定下也能够超过现有的半监督学习方法。

Abstract
Semi-supervised learning is increasingly popular in medical image segmentation due to its ability to leverage large amounts of unlabeled data to extract additional information. However, most existing semi-supervised segmentation methods focus only on extracting information from unlabeled data. In this paper, we propose a novel Dual KMax UX-Net framework that leverages labeled data to guide the extraction of information from unlabeled data. Our approach is based on a mutual learning strategy that incorporates two modules: 3D UX-Net as our backbone meta-architecture and KMax decoder to enhance the segmentation performance. Extensive experiments on the Atrial Segmentation Challenge dataset have shown that our method can significantly improve performance by merging unlabeled data. Meanwhile, our framework outperforms state-of-the-art semi-supervised learning methods on 10\% and 20\% labeled settings. Code located at: https://github.com/Rows21/DK-UXNet.

摘要
semi-supervised learning在医疗影像 segmentation 中日益受欢迎，因为它可以利用大量无标签数据来提取更多信息。然而，现有大多数 semi-supervised segmentation 方法都只是利用无标签数据来提取信息。在这篇论文中，我们提出了一种新的 Dual KMax UX-Net 框架，该框架利用标签数据来导引无标签数据中的信息提取。我们的方法基于一种互助学习策略，该策略包括3D UX-Net作为我们的基础元体系和 KMax decoder来提高分 segmentation 性能。我们在 Atrial Segmentation Challenge 数据集上进行了广泛的实验，发现我们的方法可以显著提高性能，并且在 10% 和 20% 标签设置下超越了现有的 semi-supervised learning 方法。代码位于：https://github.com/Rows21/DK-UXNet。

paper_url: http://arxiv.org/abs/2311.12344
repo_url: None
paper_authors: Sumin Lee, Sangmin Woo, Muhammad Adi Nugroho, Changick Kim
For: 本研究提出了一种新的网络模型，即Modality Mixer（M-Mixer）网络，用于多模态人体动作识别。M-Mixer网络可以有效利用不同模态之间的补充信息，并将其与时间上的动作内容相结合。* Methods: 本研究提出了一个新的模块，即Complementary Feature Extraction Module（CFEM），用于提取不同模态之间的补充信息。CFEM使用专门的学习权重 embedding 来引导其提取相应的补充信息和全局动作内容。* Results: 实验结果表明，我们提出的方法在NTU RGB+D 60、NTU RGB+D 120和NW-UCLA数据集上达到了现状顶峰性能。此外，我们还进行了全面的抑制研究，以验证我们的提出方法的有效性。

Abstract
Due to the distinctive characteristics of sensors, each modality exhibits unique physical properties. For this reason, in the context of multi-modal action recognition, it is important to consider not only the overall action content but also the complementary nature of different modalities. In this paper, we propose a novel network, named Modality Mixer (M-Mixer) network, which effectively leverages and incorporates the complementary information across modalities with the temporal context of actions for action recognition. A key component of our proposed M-Mixer is the Multi-modal Contextualization Unit (MCU), a simple yet effective recurrent unit. Our MCU is responsible for temporally encoding a sequence of one modality (e.g., RGB) with action content features of other modalities (e.g., depth and infrared modalities). This process encourages M-Mixer network to exploit global action content and also to supplement complementary information of other modalities. Furthermore, to extract appropriate complementary information regarding to the given modality settings, we introduce a new module, named Complementary Feature Extraction Module (CFEM). CFEM incorporates sepearte learnable query embeddings for each modality, which guide CFEM to extract complementary information and global action content from the other modalities. As a result, our proposed method outperforms state-of-the-art methods on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets. Moreover, through comprehensive ablation studies, we further validate the effectiveness of our proposed method.

摘要
因为各种感测器具有不同的特点，每种模式都具有独特的物理特性。因此，在多模式动作识别场景下，需要考虑不同模式之间的补做信息。在本文中，我们提出了一种新的网络模型，名为多模式混合网络（M-Mixer），它能够有效地利用不同模式之间的补做信息，并在时间上进行动作内容的推理。M-Mixer网络的关键组件是多模式Contextualization Unit（MCU），这是一个简单 yet efficient的循环单元。MCU负责在一个模式（例如RGB）中编码动作内容特征，并将其与其他模式（例如深度和红外模式）中的动作内容特征进行推理。这个过程会让M-Mixer网络能够利用全局的动作内容，同时也能够补做不同模式之间的补做信息。此外，为了提取适当的补做信息，我们引入了一个新的模块，名为补做特征提取模块（CFEM）。CFEM使用每个模式的特有学习查询嵌入，以导引CFEM提取适当的补做信息和全局动作内容。因此，我们的提出方法在NTU RGB+D 60、NTU RGB+D 120和NW-UCLA数据集上比州前方法表现出色。此外，我们还进行了详细的抽象研究，以验证我们的提出方法的有效性。

LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis

paper_url: http://arxiv.org/abs/2311.12342
repo_url: None
paper_authors: Peiang Zhao, Han Li, Ruiyang Jin, S. Kevin Zhou
For: 本研究旨在提出一种无需训练的方法，用于将文本描述和空间布局转化为高质量图像。* Methods: 本方法引入了一种本地注意力约束，以准确地将各个对象放置在指定的区域中。此外，我们还提出了一种补充то克规约，以利用已经被忽略的补充字符，从而避免生成的对象融合。* Results: 我们通过了多个测试 benchmark 表明，我们的方法可以高效地解决先前方法中所见的语义失败问题，并且在无需训练的情况下，我们的方法可以与现有的文本到图像和布局到图像模型集成，从而大幅提高其性能。

Abstract
Recent text-to-image diffusion models have reached an unprecedented level in generating high-quality images. However, their exclusive reliance on textual prompts often falls short in accurately conveying fine-grained spatial compositions. In this paper, we propose LoCo, a training-free approach for layout-to-image synthesis that excels in producing high-quality images aligned with both textual prompts and spatial layouts. Our method introduces a Localized Attention Constraint to refine cross-attention for individual objects, ensuring their precise placement in designated regions. We further propose a Padding Token Constraint to leverage the semantic information embedded in previously neglected padding tokens, thereby preventing the undesired fusion of synthesized objects. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, significantly amplifying their performance and effectively addressing semantic failures observed in prior methods. Through extensive experiments, we showcase the superiority of our approach, surpassing existing state-of-the-art training-free layout-to-image methods both qualitatively and quantitatively across multiple benchmarks.

摘要
最近的文本到图像扩散模型已达到历史最高水平，但它们均依赖文本提示，而不能准确表达细腻的空间组合。在这篇论文中，我们提出了LoCo，一种不需要训练的方法，可以高质量地将文本提示和空间布局转化为图像。我们的方法引入了本地化注意力约束，以确保各个 объек 的精确位置在指定的区域内。此外，我们还提出了补充符约束，以利用 previously neglected 的补充符嵌入 semantic information，以避免人工合成对象的不良融合。LoCo可以轻松地与现有的文本到图像和布局到图像模型集成，显著提高其性能，有效地解决先前方法中所见的语义失败。经过广泛的实验，我们证明了我们的方法的超越性，在多个benchmark上 both qualitatively and quantitatively surpassed existing training-free layout-to-image methods。

Fine-Grained Open Domain Image Animation with Motion Guidance

paper_url: http://arxiv.org/abs/2311.12886
repo_url: None
paper_authors: Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang
for: 本研究旨在提供一种基于视频扩散模型的开放频谱图像动画方法，以便在多种自然环境中捕捉的图像动画 generation 中实现细化和可控的动画。
methods: 本方法使用目标动作区域指导和动作强度指导，以实现精细控制动画中的可动区域和其运动速度。
results: 经过严格的实验 validate，本方法在开放频谱dataset上显示出比较出色的性能，能够生成细化和交互式的动画序列。

Abstract
Image animation is a key task in computer vision which aims to generate dynamic visual content from static image. Recent image animation methods employ neural based rendering technique to generate realistic animations. Despite these advancements, achieving fine-grained and controllable image animation guided by text remains challenging, particularly for open-domain images captured in diverse real environments. In this paper, we introduce an open domain image animation method that leverages the motion prior of video diffusion model. Our approach introduces targeted motion area guidance and motion strength guidance, enabling precise control the movable area and its motion speed. This results in enhanced alignment between the animated visual elements and the prompting text, thereby facilitating a fine-grained and interactive animation generation process for intricate motion sequences. We validate the effectiveness of our method through rigorous experiments on an open-domain dataset, with the results showcasing its superior performance. The source code and model will be made publicly available upon publication.

摘要
Image 动画是计算机视觉中关键任务之一，旨在将静止图像转化为动态视觉内容。现代图像动画技术使用神经网络基于渲染技术来生成真实的动画。然而，在文本指导下实现细化和可控的图像动画仍然是一项挑战，特别是在开放领域中拍摄的多样化实际环境中。在这篇论文中，我们提出了一种开放领域图像动画方法，利用视频扩散模型的运动先验。我们的方法包括targeted Motion Area指导和运动强度指导，以实现精细控制可动区域和其运动速度。这会使动画内的视觉元素更加与提示文本相对应，从而实现细化和交互的动画生成过程。我们通过对开放领域数据集进行严格的实验，证明了我们的方法的效果。代码和模型将在发表后公开。

ViLaM: A Vision-Language Model with Enhanced Visual Grounding and Generalization Capability

paper_url: http://arxiv.org/abs/2311.12327
repo_url: https://github.com/anonymgiant/vilam
paper_authors: Xiaoyu Yang, Lijian Xu, Hongsheng Li, Shaoting Zhang
for: 本研究旨在应用语言模型于医学图像分析任务中，以提高人机交互和多模态任务的表现。
methods: 本研究提出了一种统一的视力语言变换模型（ViLaM），通过预训练语言模型的指令预测来优化知识和理解能力。文章使用冻结预训练encoder来编码和对应图像和文本特征，以便执行多种视觉任务。此外，文章还提出了一种循环训练策略来提高参照表达的质量。
results: 研究表明，ViLaM在公共普通数据集上显示出了极高的表现，并且在医学数据集上进行了验证和推广。此外，研究还发现了模型在零基础学习中的卓越表现，表明未来可能有很多应用场景。

Abstract
Vision-language models have revolutionized human-computer interaction and shown significant progress in multi-modal tasks. However, applying these models to complex visual tasks like medical image analysis remains challenging. In this study, we propose ViLaM, a unified Vision-Language transformer model that integrates instruction tuning predicated on a large language model. This approach enables us to optimally utilize the knowledge and reasoning capacities of large pre-trained language models for an array of tasks encompassing both language and vision. We employ frozen pre-trained encoders to encode and align both image and text features, enabling ViLaM to handle a variety of visual tasks following textual instructions. Besides, we've designed cycle training for referring expressions to address the need for high-quality, paired referring expression datasets for training large models in terms of both quantity and quality. We evaluated ViLaM's exceptional performance on public general datasets and further confirmed its generalizability on medical datasets. Importantly, we've observed the model's impressive zero-shot learning ability, indicating the potential future application of ViLaM in the medical field.

摘要
现代视力语言模型已经改变人机交互方式，并在多模态任务上显示出了很大的进步。然而，将这些模型应用于复杂的视觉任务，如医疗图像分析，仍然是一个挑战。在这项研究中，我们提出了ViLaM，一种整合视力语言变换模型，它采用基于大型预训练语言模型的指令预测。这种方法允许我们最大限度利用大型预训练语言模型的知识和理解能力，以涵盖范围从语言到视觉的多种任务。我们使用冻结预训练的编码器来编码和对应图像和文本特征，使ViLaM能够处理多种视觉任务，根据文本指令进行执行。此外，我们还设计了循环训练来改进参照表达，以提高大型模型的训练数据质量和量。我们对公共普通数据集进行了评估，并证明了ViLaM在医疗领域的普适性。最重要的是，我们发现了ViLaM的很好的零基础学习能力，表明这个模型在未来可能在医疗领域发挥重要作用。

Long-MIL: Scaling Long Contextual Multiple Instance Learning for Histopathology Whole Slide Image Analysis

paper_url: http://arxiv.org/abs/2311.12885
repo_url: None
paper_authors: Honglin Li, Yunlong Zhang, Chenglu Zhu, Jiatong Cai, Sunyi Zheng, Lin Yang
for: This paper is written for developing a computer-aided diagnosis tool for histopathology images, specifically addressing the problem of shape varying large Whole Slide Images (WSIs) in previous Multi-Instance Learning (MIL) models.
methods: The proposed method, Long-contextual MIL (Long-MIL), introduces Linear Bias into Attention to amend position embedding for shape varying long-contextual WSIs, and utilizes Flash-Attention module to tackle computational complexity while maintaining full self-attention performance.
results: The proposed Long-MIL method is evaluated on extensive experiments including 4 datasets, including WSI classification and survival prediction tasks, and demonstrates superiority on shape varying WSIs compared to previous MIL models.Here’s the Chinese translation of the three key points:
for: 这篇论文是为了开发一种基于计算机支持的 histopathology 图像诊断工具，特别是解决过去的 Multi-Instance Learning (MIL) 模型中 shape varying 大的 Whole Slide Images (WSIs) 问题。
methods: 提议的方法是 Long-contextual MIL (Long-MIL)，它通过 Linear Bias into Attention 来修正 shape varying long-contextual WSIs 的位域嵌入，并使用 Flash-Attention module 来解决 transformer 模型的计算复杂性，保持了全自注意性能。
results: 提议的 Long-MIL 方法在四个数据集上进行了广泛的实验，包括 WSI 分类和存活预测任务，并在 shape varying WSIs 上表现出了超越性。

Abstract
Histopathology image analysis is the golden standard of clinical diagnosis for Cancers. In doctors daily routine and computer-aided diagnosis, the Whole Slide Image (WSI) of histopathology tissue is used for analysis. Because of the extremely large scale of resolution, previous methods generally divide the WSI into a large number of patches, then aggregate all patches within a WSI by Multi-Instance Learning (MIL) to make the slide-level prediction when developing computer-aided diagnosis tools. However, most previous WSI-MIL models using global-attention without pairwise interaction and any positional information, or self-attention with absolute position embedding can not well handle shape varying large WSIs, e.g. testing WSIs after model deployment may be larger than training WSIs, since the model development set is always limited due to the difficulty of histopathology WSIs collection. To deal with the problem, in this paper, we propose to amend position embedding for shape varying long-contextual WSI by introducing Linear Bias into Attention, and adapt it from 1-d long sequence into 2-d long-contextual WSI which helps model extrapolate position embedding to unseen or under-fitted positions. We further utilize Flash-Attention module to tackle the computational complexity of Transformer, which also keep full self-attention performance compared to previous attention approximation work. Our method, Long-contextual MIL (Long-MIL) are evaluated on extensive experiments including 4 dataset including WSI classification and survival prediction tasks to validate the superiority on shape varying WSIs. The source code will be open-accessed soon.

摘要
血癌病理图像分析是诊断 golden standard 的临床应用。在医生日常 Routine 和计算机支持诊断中，整个染色体图像 (Whole Slide Image, WSI) 被用于分析。由于图像的尺度非常大，前一些方法通常将 WSI 分解成大量的小块，然后使用多例学习 (Multi-Instance Learning, MIL) 将所有块在 WSI 中聚合，以便在开发计算机支持诊断工具时进行满板级预测。然而，大多数前一些 WSI-MIL 模型使用全球注意力而无需对比和任何位置信息，或者使用自注意力并嵌入绝对位置信息，无法好好地处理变形大 WSI，例如在模型部署后测试 WSI 大于训练 WSI，因为模型开发集是由于病理图像收集的困难所受限。为解决这个问题，在本文中，我们提议修改位嵌入以适应形变大 WSI，并将它从一维长序列转换为二维长 contextual WSI，以帮助模型在未seen或压缩位置 extrapolate 位嵌入。我们还利用 Flash-Attention 模块来解决 transformer 的计算复杂性，并保持与前一些注意力缩写工作相同的全自注意性性能。我们的方法，长 contextual MIL (Long-MIL)，在广泛的实验中证明其在形变 WSI 上的优越性。代码将很快公开。

ABFL: Angular Boundary Discontinuity Free Loss for Arbitrary Oriented Object Detection in Aerial Images

paper_url: http://arxiv.org/abs/2311.12311
repo_url: None
paper_authors: Zifei Zhao, Shengyang Li
for: 寻找航空图像中的自由方向物体（AOOD）是一项广泛关注且高度挑战性的任务，在许多场景中具有重要作用。
methods: 本文使用了一种基于 von Mises 分布的ANGULAR BOUNDARY FREE LOSS（ABFL）来解决自由方向物体检测中的ANGULAR BOUNDARY DISCONTINUITY（ABD）问题。
results: 实验表明，提议的ABFL损失可以超越一些专门针对ABD问题的方法，并且不需要额外的编码-解码结构。

Abstract
Arbitrary oriented object detection (AOOD) in aerial images is a widely concerned and highly challenging task, and plays an important role in many scenarios. The core of AOOD involves the representation, encoding, and feature augmentation of oriented bounding-boxes (Bboxes). Existing methods lack intuitive modeling of angle difference measurement in oriented Bbox representations. Oriented Bboxes under different representations exhibit rotational symmetry with varying periods due to angle periodicity. The angular boundary discontinuity (ABD) problem at periodic boundary positions is caused by rotational symmetry in measuring angular differences. In addition, existing methods also use additional encoding-decoding structures for oriented Bboxes. In this paper, we design an angular boundary free loss (ABFL) based on the von Mises distribution. The ABFL aims to solve the ABD problem when detecting oriented objects. Specifically, ABFL proposes to treat angles as circular data rather than linear data when measuring angle differences, aiming to introduce angle periodicity to alleviate the ABD problem and improve the accuracy of angle difference measurement. In addition, ABFL provides a simple and effective solution for various periodic boundary discontinuities caused by rotational symmetry in AOOD tasks, as it does not require additional encoding-decoding structures for oriented Bboxes. Extensive experiments on the DOTA and HRSC2016 datasets show that the proposed ABFL loss outperforms some state-of-the-art methods focused on addressing the ABD problem.

摘要
“自由方向物体检测（AOOD）在飞行图像中是广泛关注和高度挑战的任务，对多种场景都具有重要作用。AOOD的核心问题在于对 Orientated bounding-boxes（Bboxes）的表示、编码和特征增强。现有方法缺乏对角度差量的直观模型。Orientated Bboxes 在不同的表示下展现出固定的旋转对称性和不同的周期，由于角度对称性而导致的角度边界缺陷（ABD）问题在测量角度差量时出现。此外，现有方法还使用了额外的编码-解码结构。在这篇论文中，我们设计了一种角度边界无损损失（ABFL），基于韦氏分布。ABFL 目的是解决 ABD 问题在检测 Orientated 物体时。具体来说，ABFL 将 treat 角度为径向数据而不是线性数据，以引入角度对称性来缓解 ABD 问题并提高角度差量测量的准确性。此外，ABFL 还提供了一种简单有效的解决 Rotational Symmetry 导致的不同周期的径向边界缺陷，不需要额外的编码-解码结构。广泛的实验表明，提出的 ABFL 损失能够超越一些关注Addressing ABD 问题的方法。”

Challenges in Video-Based Infant Action Recognition: A Critical Examination of the State of the Art

paper_url: http://arxiv.org/abs/2311.12300
repo_url: https://github.com/ostadabbas/video-based-infant-action-recognition
paper_authors: Elaheh Hatamimajoumerd, Pooria Daneshvar Kakhaki, Xiaofei Huang, Lingfei Luan, Somaieh Amraee, Sarah Ostadabbas
for: 这个论文旨在提高宝宝的动作识别精度，以便提高宝宝的安全监测、发展阶段跟踪、早期发现发展延迟、促进父母宝宝关系、提高计算机辅助诊断和帮助科学界更好地理解婴儿发展。
methods: 这篇论文使用了新的 dataset “InfActPrimitive”，这个 dataset 包括5种重要的婴儿成长阶段动作类别，并使用了特殊的预处理技术来处理婴儿数据。AUTHORS 使用了当今最先进的 skeleton-based 动作识别模型进行了广泛的比较分析。
results: 研究发现，尽管 PoseC3D 模型的准确率达到了约71%，但其他模型几乎无法正确地捕捉婴儿动作的动态特征。这表明了成人动作识别领域和婴儿动作识别领域之间的知识差距很大，并且需要开发数据效率的管道模型。

Abstract
Automated human action recognition, a burgeoning field within computer vision, boasts diverse applications spanning surveillance, security, human-computer interaction, tele-health, and sports analysis. Precise action recognition in infants serves a multitude of pivotal purposes, encompassing safety monitoring, developmental milestone tracking, early intervention for developmental delays, fostering parent-infant bonds, advancing computer-aided diagnostics, and contributing to the scientific comprehension of child development. This paper delves into the intricacies of infant action recognition, a domain that has remained relatively uncharted despite the accomplishments in adult action recognition. In this study, we introduce a groundbreaking dataset called ``InfActPrimitive'', encompassing five significant infant milestone action categories, and we incorporate specialized preprocessing for infant data. We conducted an extensive comparative analysis employing cutting-edge skeleton-based action recognition models using this dataset. Our findings reveal that, although the PoseC3D model achieves the highest accuracy at approximately 71%, the remaining models struggle to accurately capture the dynamics of infant actions. This highlights a substantial knowledge gap between infant and adult action recognition domains and the urgent need for data-efficient pipeline models.

摘要
自动人类动作识别，一个在计算机视觉领域崛起的新领域，具有广泛的应用领域，包括监控、安全、人机交互、医疗诊断和体育分析。准确地识别婴儿的动作可以涵盖多个重要目标，如安全监控、发展阶段跟踪、早期发现发展延迟、促进父婴关系、提高计算机辅助诊断和进一步科学成就儿童发展。本文探讨了婴儿动作识别领域的细节，这个领域在成人动作识别领域的成就之后，尚未得到充分的研究。在这个研究中，我们发布了一个名为“InfActPrimitive”的新的婴儿 milestone 动作类别集合，并对婴儿数据进行特殊的预处理。我们通过使用最新的骨架基本动作识别模型进行了广泛的比较分析。我们的发现表明，尽管poseC3D模型在约71%的准确率下表现最高，但其他模型在捕捉婴儿动作的dinamics方面具有巨大的难题。这 highlights a substantial knowledge gap between infant and adult action recognition domains and the urgent need for data-efficient pipeline models.

Instance-aware 3D Semantic Segmentation powered by Shape Generators and Classifiers

paper_url: http://arxiv.org/abs/2311.12291
repo_url: None
paper_authors: Bo Sun, Qixing Huang, Xiangru Huang
for: 提高3D semantic segmentation的精度和效果，增强模型对形状信息的感知和利用。
methods: combines several geometry processing tasks supervised at instance-level，包括形状生成和形状分类任务，以提高特征表示的精度和效果。
results: 在多个公共benchmark上显著超越现有方法，包括Waymo Open Dataset、SemanticKITTI和ScanNetV2等。

Abstract
Existing 3D semantic segmentation methods rely on point-wise or voxel-wise feature descriptors to output segmentation predictions. However, these descriptors are often supervised at point or voxel level, leading to segmentation models that can behave poorly at instance-level. In this paper, we proposed a novel instance-aware approach for 3D semantic segmentation. Our method combines several geometry processing tasks supervised at instance-level to promote the consistency of the learned feature representation. Specifically, our methods use shape generators and shape classifiers to perform shape reconstruction and classification tasks for each shape instance. This enforces the feature representation to faithfully encode both structural and local shape information, with an awareness of shape instances. In the experiments, our method significantly outperform existing approaches in 3D semantic segmentation on several public benchmarks, such as Waymo Open Dataset, SemanticKITTI and ScanNetV2.

摘要
现有的3D语义分割方法通常基于点级或体积级特征描述器来输出分割预测。然而，这些描述器经常被supervised at point或体积级，导致分割模型可能会 behave poorly at instance级。在这篇论文中，我们提出了一种新的实例意识的方法 для3D语义分割。我们的方法结合了多个geometry处理任务supervised at instance级来提高学习的特征表示的一致性。特别是，我们的方法使用形态生成器和形态分类器来进行形态重建和分类任务 для每个形态实例。这种强制特征表示具有 faithful地编码结构和本地形态信息，同时具备对形态实例的意识。在实验中，我们的方法在多个公共标准测试集上显著超越现有的方法，包括 Waymo Open Dataset、SemanticKITTI和ScanNetV2。

Procedural Generation of Grain Orientations using the Wave Function Collapse Algorithm

paper_url: http://arxiv.org/abs/2311.12272
repo_url: None
paper_authors: G. Magny-Fokam, D. Madisetti, J. El-Awady
for: 这 paper 的目的是研究如何使用 referential electron backscatter diffraction (EBSD) 地图来生成符合统计分布的粒体结构，以便进一步研究附加的塑性和失效行为。
methods: 这 paper 使用了两种方法来生成粒体结构：Wave Function Collapse (WFC) 算法和 Markov Junior。
results: 研究发现，Markov Junior 算法比 WFC 算法更有效，可以生成符合 EBSD 地图的 Voronoi 分布，并且可以在 Python 中使用来生成粒体结构。 Comparing the results between the reference and the generated EBSD, we found that the orientation and volume fractions were extremely similar. Therefore, it can be concluded that Markov Junior is an effective machine learning tool for reproducing representative grain microstructures.

Abstract
Statistics of grain sizes and orientations in metals correlate to the material's mechanical properties. Reproducing representative volume elements for further analysis of deformation and failure in metals, like 316L stainless steel, is particularly important due to their wide use in manufacturing goods today. Two approaches, initially created for video games, were considered for the procedural generation of representative grain microstructures. The first is the Wave Function Collapse (WFC) algorithm, and the second is constraint propagation and probabilistic inference through Markov Junior, a free and open-source software. This study aimed to investigate these two algorithms' effectiveness in using reference electron backscatter diffraction (EBSD) maps and recreating a statistically similar one that could be used in further research. It utilized two stainless steel EBSD maps as references to test both algorithms. First, the WFC algorithm was too constricting and, thus, incapable of producing images that resembled EBSDs. The second, MarkovJunior, was much more effective in creating a Voronoi tessellation that could be used to create an EBSD map in Python. When comparing the results between the reference and the generated EBSD, we discovered that the orientation and volume fractions were extremely similar. With the study, it was concluded that MarkovJunior is an effective machine learning tool that can reproduce representative grain microstructures.

摘要
《stats of grain sizes and orientations in metals correlate to material mechanical properties. reproducing representative volume elements for further analysis of deformation and failure in metals, like 316L stainless steel, is particularly important due to their wide use in manufacturing goods today. two approaches, initially created for video games, were considered for the procedural generation of representative grain microstructures. the first is the wave function collapse (WFC) algorithm, and the second is constraint propagation and probabilistic inference through markov junior, a free and open-source software. this study aimed to investigate these two algorithms' effectiveness in using reference electron backscatter diffraction (EBSD) maps and recreating a statistically similar one that could be used in further research. it utilized two stainless steel EBSD maps as references to test both algorithms. first, the WFC algorithm was too constricting and, thus, incapable of producing images that resembled EBSDs. the second, MarkovJunior, was much more effective in creating a Voronoi tessellation that could be used to create an EBSD map in Python. when comparing the results between the reference and the generated EBSD, we discovered that the orientation and volume fractions were extremely similar. with the study, it was concluded that MarkovJunior is an effective machine learning tool that can reproduce representative grain microstructures.》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.

Boosting Audio-visual Zero-shot Learning with Large Language Models

paper_url: http://arxiv.org/abs/2311.12268
repo_url: https://github.com/chenhaoxing/KDA
paper_authors: Haoxing Chen, Yaohui Li, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Jun Lan, Huijia Zhu, Weiqiang Wang
for: 该论文主要针对的是Audio-visual零 shot learning问题，即可以基于已知的Audio-visual对应的类别名称来识别未经见过的类别。
methods: 该论文提出了一种简单 yet effective的框架，即知识填充适应（KDA），用于帮助模型更好地理解未经见过的动作内容。 Specifically, 该框架包括使用大型自然语言模型生成类别名称的详细描述，以及使用分布匹配损失和知识填充适应margin损失来进一步提高对未经见过的类别的泛化能力。
results: 实验结果表明， compared with现有的方法，我们的提出的KDA可以在三个 популяр的Audio-visual零 shot learning数据集上取得更高的识别率。

Abstract
Audio-visual zero-shot learning aims to recognize unseen categories based on paired audio-visual sequences. Recent methods mainly focus on learning aligned and discriminative multi-modal features to boost generalization towards unseen categories. However, these approaches ignore the obscure action concepts in category names and may inevitably introduce complex network structures with difficult training objectives. In this paper, we propose a simple yet effective framework named Knowledge-aware Distribution Adaptation (KDA) to help the model better grasp the novel action contents with an external knowledge base. Specifically, we first propose using large language models to generate rich descriptions from category names, which leads to a better understanding of unseen categories. Additionally, we propose a distribution alignment loss as well as a knowledge-aware adaptive margin loss to further improve the generalization ability towards unseen categories. Extensive experimental results demonstrate that our proposed KDA can outperform state-of-the-art methods on three popular audio-visual zero-shot learning datasets. Our code will be avaliable at \url{https://github.com/chenhaoxing/KDA}.

摘要
audio-visual零 shot learning目标是基于带有音频视频序列的Category recognition不见的类型。 current methods主要集中在学习对应的多Modal特征，以提高对不见类型的泛化能力。然而，这些方法可能会忽略Category名称中的模糊动作概念，并且可能会导入复杂的网络结构和困难的训练目标。在这篇论文中，我们提出了一个简单 yet effective的框架，名为知识感知分布适应（KDA），以帮助模型更好地理解新的动作内容。 Specifically，我们首先提出了使用大型自然语言模型生成Category名称的详细描述，以便更好地理解未见类型。此外，我们还提出了分布对齐损失以及知识感知 adaptive margin损失，以进一步提高对未见类型的泛化能力。我们的实验结果表明，我们的提出的KDA可以在三个流行的 audio-visual零 shot learning数据集上超越当前的状态OF-the-art方法。我们的代码将在 \url{https://github.com/chenhaoxing/KDA} 上提供。

Virtual Home Staging: Inverse Rendering and Editing an Indoor Panorama under Natural Illumination

paper_url: http://arxiv.org/abs/2311.12265
repo_url: https://github.com/gzhji/cali-hdr-dataset
paper_authors: Guanzhou Ji, Azadeh O. Sawyer, Srinivasa G. Narasimhan
for: 这篇论文是为了解决现有室内panorama中的家具布局更新问题，使用自然照明。
methods: 该方法包括三个关键组成部分：（1）检测和移除panorama中的家具，（2）自动生成室内地板设计，（3）使用场景几何、新家具对象和实时outdoor图像进行全景渲染。
results: 该方法可以在不同的outdoor照明条件下实现室内场景的重新渲染，并且提供了一个新的标准化HDR（Cali-HDR）数据集，包括137个标准化室内panorama和其相应的outdoor图像。

Abstract
We propose a novel inverse rendering method that enables the transformation of existing indoor panoramas with new indoor furniture layouts under natural illumination. To achieve this, we captured indoor HDR panoramas along with real-time outdoor hemispherical HDR photographs. Indoor and outdoor HDR images were linearly calibrated with measured absolute luminance values for accurate scene relighting. Our method consists of three key components: (1) panoramic furniture detection and removal, (2) automatic floor layout design, and (3) global rendering with scene geometry, new furniture objects, and a real-time outdoor photograph. We demonstrate the effectiveness of our workflow in rendering indoor scenes under different outdoor illumination conditions. Additionally, we contribute a new calibrated HDR (Cali-HDR) dataset that consists of 137 calibrated indoor panoramas and their associated outdoor photographs. The source code and dataset are available: https://github.com/Gzhji/Cali-HDR-Dataset.

摘要
我们提出了一种新的反向渲染方法，可以将现有的室内360度全景图与新的室内家具布局转换为自然照明下的图像。为实现这一点，我们捕捉了室内HDR全景图和实时OUTDOOR半球形HDR照片。室内和 OUTDOOR HDR图像通过测量精确的场景照明值进行线性抽象，以实现准确的场景重新照明。我们的方法包括三个关键组件：（1）扫描排除室内家具，（2）自动生成室内地板设计，（3）基于场景几何、新家具对象和实时OUTDOOR照片进行全球渲染。我们在不同的 OUTDOOR照明条件下验证了我们的工作流程，并提供了一个新的标准化HDR数据集（Cali-HDR），包括137个标准化的室内全景图和其关联的 OUTDOOR照片。源代码和数据集可以在以下地址下下载：https://github.com/Gzhji/Cali-HDR-Dataset。

2023-11-21

Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning

Novel OCT mosaicking pipeline with Feature- and Pixel-based registration

Camera-Independent Single Image Depth Estimation from Defocus Blur

Unsupervised Multimodal Surface Registration with Geometric Deep Learning

Image-Based Soil Organic Carbon Remote Sensing from Satellite Images with Fourier Neural Operator and Structural Similarity

3D Compression Using Neural Fields

TRIDENT: The Nonlinear Trilogy for Implicit Neural Representations

AI for Agriculture: the Comparison of Semantic Segmentation Methods for Crop Mapping with Sentinel-2 Imagery

FollowMe: a Robust Person Following Framework Based on Re-Identification and Gestures

SD-NAE: Generating Natural Adversarial Examples with Stable Diffusion

Physics-guided Shape-from-Template: Monocular Video Perception through Neural Surrogate Models

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering

Iris Presentation Attack: Assessing the Impact of Combining Vanadium Dioxide Films with Artificial Eyes

Swift Parameter-free Attention Network for Efficient Super-Resolution

Investigating Weight-Perturbed Deep Neural Networks With Application in Iris Presentation Attack Detection

High-resolution Image-based Malware Classification using Multiple Instance Learning

Breathing Life Into Sketches Using Text-to-Video Priors

Attention Deficit is Ordered! Fooling Deformable Vision Transformers with Collaborative Adversarial Patches

Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatially Relation Matching

Q-Seg: Quantum Annealing-based Unsupervised Image Segmentation

Attacking Motion Planners Using Adversarial Perception Errors

Cascade Learning Localises Discriminant Features in Visual Scene Classification

Transferring to Real-World Layouts: A Depth-aware Framework for Scene Adaptation

BundleMoCap: Efficient, Robust and Smooth Motion Capture from Sparse Multiview Videos

Similar Document Template Matching Algorithm

Visually Guided Object Grasping

Hand-Eye Calibration

Polyhedral Object Recognition by Indexing

GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning

Bridging Generalization Gaps in High Content Imaging Through Online Self-Supervised Domain Adaptation

Crowd management, crime detection, work monitoring using aiml

Leveraging Unlabeled Data for 3D Medical Image Segmentation through Self-Supervised Contrastive Learning

Adaptive Dense Pseudo Label Selection for Semi-supervised Oriented Object Detection

Thinking Outside the Box: Orthogonal Approach to Equalizing Protected Attributes

Surgical Temporal Action-aware Network with Sequence Regularization for Phase Recognition

TouchSDF: A DeepSDF Approach for 3D Shape Reconstruction using Vision-Based Tactile Sensing

Deep learning-based detection of morphological features associated with hypoxia in H&E breast cancer whole slide images

HiPose: Hierarchical Binary Surface Encoding and Correspondence Pruning for RGB-D 6DoF Object Pose Estimation

A Region of Interest Focused Triple UNet Architecture for Skin Lesion Segmentation

Multi-Resolution Planar Region Extraction for Uneven Terrains

Convolutional Neural Networks for Neuroimaging in Parkinson’s Disease: Is Preprocessing Needed?

Benchmarking bias: Expanding clinical AI model card to incorporate bias reporting of social and non-social factors

“HoVer-UNet”: Accelerating HoVerNet with UNet-based multi-class nuclei segmentation via knowledge distillation

GMISeg: General Medical Image Segmentation without Re-Training

Hyb-NeRF: A Multiresolution Hybrid Encoding for Neural Radiance Fields

HCA-Net: Hierarchical Context Attention Network for Intervertebral Disc Semantic Labeling

MaskFlow: Object-Aware Motion Estimation

GLAD: Global-Local View Alignment and Background Debiasing for Unsupervised Video Domain Adaptation with Large Domain Gap

HiFi-Syn: Hierarchical Granularity Discrimination for High-Fidelity Synthesis of MR Images with Structure Preservation

Learning Site-specific Styles for Multi-institutional Unsupervised Cross-modality Domain Adaptation

AR Visualization System for Ship Detection and Recognition Based on AI

Two Views Are Better than One: Monocular 3D Pose Estimation with Multiview Consistency

Attribute-Aware Deep Hashing with Self-Consistency for Large-Scale Fine-Grained Image Retrieval

Board-to-Board: Evaluating Moonboard Grade Prediction Generalization

Learning Part Motion of Articulated Objects Using Spatially Continuous Neural Implicit Representations

CASR: Refining Action Segmentation via Magrinalizing Frame-levle Causal Relationships

IMJENSE: Scan-specific Implicit Representation for Joint Coil Sensitivity and Image Estimation in Parallel MRI

RFTrans: Leveraging Refractive Flow of Transparent Objects for Surface Normal Estimation and Manipulation

Rich and Poor Texture Contrast: A Simple yet Effective Approach for AI-generated Image Detection

From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation

Point, Segment and Count: A Generalized Framework for Object Counting

Text-Guided Texturing by Synchronized Multi-View Diffusion

De-fine: Decomposing and Refining Visual Programs with Auto-Feedback

Semi-supervised Medical Image Segmentation via Query Distribution Consistency

Modality Mixer Exploiting Complementary Information for Multi-modal Action Recognition

LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis

Fine-Grained Open Domain Image Animation with Motion Guidance

ViLaM: A Vision-Language Model with Enhanced Visual Grounding and Generalization Capability

Long-MIL: Scaling Long Contextual Multiple Instance Learning for Histopathology Whole Slide Image Analysis

ABFL: Angular Boundary Discontinuity Free Loss for Arbitrary Oriented Object Detection in Aerial Images

Challenges in Video-Based Infant Action Recognition: A Critical Examination of the State of the Art

Instance-aware 3D Semantic Segmentation powered by Shape Generators and Classifiers

Procedural Generation of Grain Orientations using the Wave Function Collapse Algorithm

Boosting Audio-visual Zero-shot Learning with Large Language Models

Virtual Home Staging: Inverse Rendering and Editing an Indoor Panorama under Natural Illumination