2023-10-11

cs.CV

cs.CV - 2023-10-11

Dynamic Appearance Particle Neural Radiance Field

paper_url: http://arxiv.org/abs/2310.07916
repo_url: None
paper_authors: Ancheng Lin, Jun Li
for: 模elling 3D 动态场景的 Dynamic NeRFs 扩展了这种模型，但是现有的动态 NeRFs 使用类似的欧拉rian representation，导致外观和运动之间存在紧密的关联，缺乏物理解释。
methods: 我们提出了Dynamic Appearance Particle Neural Radiance Field (DAP-NeRF)，它通过粒子基示的方式来模拟动态场景中视觉元素的运动。DAP-NeRF 由静态场景和动态场景组成，其中动态场景是由多个{\em appearance particles} 组成，每个粒子都携带了小元素的视觉信息和运动模型。所有组件，包括静态场景、视觉特征和运动模型，都是通过单摄影视频学习而不需要任何先前的场景知识。
results: 我们构建了一个新的数据集来评估运动模型，并开发了一种高效的计算框架。实验结果表明，DAP-NeRF 是一种有效的技术，可以捕捉动态场景中不仅外观，还有物理意义的运动。

Abstract
Neural Radiance Fields (NeRFs) have shown great potential in modelling 3D scenes. Dynamic NeRFs extend this model by capturing time-varying elements, typically using deformation fields. The existing dynamic NeRFs employ a similar Eulerian representation for both light radiance and deformation fields. This leads to a close coupling of appearance and motion and lacks a physical interpretation. In this work, we propose Dynamic Appearance Particle Neural Radiance Field (DAP-NeRF), which introduces particle-based representation to model the motions of visual elements in a dynamic 3D scene. DAP-NeRF consists of superposition of a static field and a dynamic field. The dynamic field is quantised as a collection of {\em appearance particles}, which carries the visual information of a small dynamic element in the scene and is equipped with a motion model. All components, including the static field, the visual features and motion models of the particles, are learned from monocular videos without any prior geometric knowledge of the scene. We develop an efficient computational framework for the particle-based model. We also construct a new dataset to evaluate motion modelling. Experimental results show that DAP-NeRF is an effective technique to capture not only the appearance but also the physically meaningful motions in a 3D dynamic scene.

摘要
neural radiance fields (NeRFs) 有大量的潜力用于模拟3D场景。动态NeRFs进一步发展了这个模型，通常使用形变场来捕捉时间变化的元素。现有的动态NeRFs都使用相似的尤里安表示法来表示光辐射场和形变场，这会导致视觉和运动之间的强相关性和缺乏物理解释。在这项工作中，我们提出了动态外观粒子神经辐射场(DAP-NeRF)，它通过使用粒子基于表示来模拟动态3D场景中的视觉元素运动。DAP-NeRF包括一个静止场和一个动态场的积和。动态场被量化为一个集合的“外观粒子”，这些粒子携带了场景中小元素的视觉信息，并拥有运动模型。所有组件，包括静止场、视觉特征和运动模型，都由单投影视频无关 geometric knowledge of the scene 进行学习。我们开发了一个高效的计算框架，并构建了一个新的数据集来评估运动模型。实验结果表明，DAP-NeRF 是一种有效的方法，可以捕捉3D动态场景中的视觉特征和物理意义的运动。

paper_url: http://arxiv.org/abs/2310.07896
repo_url: None
paper_authors: Ajay Sridhar, Dhruv Shah, Catherine Glossop, Sergey Levine
for: 本研究的目的是提供一种能够执行任务导航和无目标探索的单一扩散策略，以提高在未知环境中的机器人导航性能。
methods: 本研究使用了一种基于Transformer的大规模策略，与一种扩散模型解码器相结合，以适应两类不同的导航需求。
results: 实验结果表明，使用本研究的方法可以在实际的移动机器人平台上实现更好的总性性能，并与五种相关方法进行比较，显示了显著的改进和减少碰撞的情况。

Abstract
Robotic learning for navigation in unfamiliar environments needs to provide policies for both task-oriented navigation (i.e., reaching a goal that the robot has located), and task-agnostic exploration (i.e., searching for a goal in a novel setting). Typically, these roles are handled by separate models, for example by using subgoal proposals, planning, or separate navigation strategies. In this paper, we describe how we can train a single unified diffusion policy to handle both goal-directed navigation and goal-agnostic exploration, with the latter providing the ability to search novel environments, and the former providing the ability to reach a user-specified goal once it has been located. We show that this unified policy results in better overall performance when navigating to visually indicated goals in novel environments, as compared to approaches that use subgoal proposals from generative models, or prior methods based on latent variable models. We instantiate our method by using a large-scale Transformer-based policy trained on data from multiple ground robots, with a diffusion model decoder to flexibly handle both goal-conditioned and goal-agnostic navigation. Our experiments, conducted on a real-world mobile robot platform, show effective navigation in unseen environments in comparison with five alternative methods, and demonstrate significant improvements in performance and lower collision rates, despite utilizing smaller models than state-of-the-art approaches. For more videos, code, and pre-trained model checkpoints, see https://general-navigation-models.github.io/nomad/

摘要
机器人学习 navigation 在未知环境中需要提供任务域导航（即达到机器人已经发现的目标）和任务无关探索（即在新环境中寻找目标）的策略。通常，这些角色由不同的模型来处理，例如使用互助提案、规划或不同的导航策略。在这篇论文中，我们描述了如何通过培训单一个混合扩散策略来处理任务域导航和任务无关探索，其中前者提供了达到用户指定的目标的能力，而后者提供了在新环境中搜索目标的能力。我们表明这种混合策略在视觉指定目标下在新环境中导航时比使用生成模型的互助提案或先前的射频变量模型方法更好的总性表现。我们在实现方面使用了大规模的Transformer基于策略，并使用扩散模型解码器来灵活地处理任务域导航和任务无关探索。我们的实验在真实世界移动机器人平台上进行，与五种alternative方法进行比较，并显示了在未 seen环境中的有效导航，并且demonstrate了与之前的方法相比更高的性能和更低的碰撞率，即使使用了更小的模型。更多视频、代码和预训练模型检查点，请访问https://general-navigation-models.github.io/nomad/。

Unsupervised Structured Noise Removal with Variational Lossy Autoencoder

paper_url: http://arxiv.org/abs/2310.07887
repo_url: https://github.com/krulllab/DVLAE
paper_authors: Benjamin Salmon, Alexander Krull
for: 该研究旨在提出一种无监督的深度学习方法，可以去除微scopic中的图像噪声，而无需任何净图像或噪声模型。
methods: 该方法基于变分自动编码器（VAE），并使用特制的 autoregressive 解码器，可以模拟图像噪声的分布，但不能独立地模拟净图像的分布。
results: 实验结果表明，该方法可以超越现有的自监和无监督图像去噪方法，并且具有对拟合核心区域大小的 Robustness。代码可以在 GitHub 上找到：https://github.com/krulllab/DVLAE。

Abstract
Most unsupervised denoising methods are based on the assumption that imaging noise is either pixel-independent, i.e., spatially uncorrelated, or signal-independent, i.e., purely additive. However, in practice many imaging setups, especially in microscopy, suffer from a combination of signal-dependent noise (e.g. Poisson shot noise) and axis-aligned correlated noise (e.g. stripe shaped scanning or readout artifacts). In this paper, we present the first unsupervised deep learning-based denoiser that can remove this type of noise without access to any clean images or a noise model. Unlike self-supervised techniques, our method does not rely on removing pixels by masking or subsampling so can utilize all available information. We implement a Variational Autoencoder (VAE) with a specially designed autoregressive decoder capable of modelling the noise component of an image but incapable of independently modelling the underlying clean signal component. As a consequence, our VAE's encoder learns to encode only underlying clean signal content and to discard imaging noise. We also propose an additional decoder for mapping the encoder's latent variables back into image space, thereby sampling denoised images. Experimental results demonstrate that our approach surpasses existing methods for self- and unsupervised image denoising while being robust with respect to the size of the autoregressive receptive field. Code for this project can be found at https://github.com/krulllab/DVLAE.

摘要
大多数无监督干涉方法假设图像噪声是像素独立的，即空间不相关，或信号独立的，即纯加性的。然而，在实际应用中，许多图像设置，特别是微scopy，受到信号相关的噪声（例如摄件衰减噪声）和排序相关的噪声（例如扫描条形或读取 artifacts）的混合噪声。在这篇论文中，我们介绍了首个无监督深度学习基于的干涉方法，可以在没有干涉图像或噪声模型的情况下，去除这种类型的噪声。与自我监督技术不同，我们的方法不需要通过屏蔽或抽样来消除像素，因此可以利用所有可用的信息。我们实现了一种变量自动机（VAE），其拥有特制的 autoregressive 解码器，可以模型图像噪声的组成部分，但不能独立模型图像的净信号部分。因此，VAE 的编码器将只编码净信号内容，并且抛弃图像噪声。我们还提出了一种将编码器的缺省变量映射回图像空间的额外解码器，从而采样干涉后的净化图像。实验结果表明，我们的方法超越了现有的自我监督和无监督图像干涉方法，并且对 autoregressive 感知场的大小具有较好的稳定性。代码可以在找到。

A Survey of Feature Types and Their Contributions for Camera Tampering Detection

paper_url: http://arxiv.org/abs/2310.07886
repo_url: None
paper_authors: Pranav Mantini, Shishir K. Shah
for: 这篇论文是关于摄像头妨碍检测的，它旨在检测摄像头中的非法和不可预期的修改。
methods: 这篇论文使用了时间序列分析方法来检测摄像头妨碍，并对不同特征类型进行了分析和实验研究。
results: 研究发现，使用不同特征类型可以提高摄像头妨碍检测的精度和可靠性。同时，研究还发现了不同特征类型在不同的妨碍情况下的表现不同。

Abstract
Camera tamper detection is the ability to detect unauthorized and unintentional alterations in surveillance cameras by analyzing the video. Camera tampering can occur due to natural events or it can be caused intentionally to disrupt surveillance. We cast tampering detection as a change detection problem, and perform a review of the existing literature with emphasis on feature types. We formulate tampering detection as a time series analysis problem, and design experiments to study the robustness and capability of various feature types. We compute ten features on real-world surveillance video and apply time series analysis to ascertain their predictability, and their capability to detect tampering. Finally, we quantify the performance of various time series models using each feature type to detect tampering.

摘要
surveillance camera تammer detection 是指通过分析视频来检测非法和无意之变化。 camera tampering 可能由自然事件引起，也可能是故意的干扰Surveillance。 we cast tampering detection as a change detection problem，并对现有文献进行了审查，强调特征类型。 we formulate tampering detection as a time series analysis problem，并设计了许多实验来评估不同特征类型的可靠性和检测能力。 we compute ten features on real-world surveillance video and apply time series analysis to determine their predictability and ability to detect tampering. finally, we quantify the performance of various time series models using each feature type to detect tampering.Here's the text with some additional information about the features used in the study:surveillance camera تammer detection 是指通过分析视频来检测非法和无意之变化。 camera tampering 可能由自然事件引起，也可能是故意的干扰Surveillance。 we cast tampering detection as a change detection problem，并对现有文献进行了审查，强调特征类型。 we formulate tampering detection as a time series analysis problem，并设计了许多实验来评估不同特征类型的可靠性和检测能力。 we compute ten features on real-world surveillance video and apply time series analysis to determine their predictability and ability to detect tampering. these features include:1. color histogram2. color moments3. color co-occurrence matrices4. texture features5. edge features6. corner features7. motion features8. optical flow features9. histogram of oriented gradients (HOG)10. scale-invariant feature transform (SIFT)we quantify the performance of various time series models using each feature type to detect tampering. the time series models include:1. autoregressive (AR) models2. moving average (MA) models3. ARIMA models4. seasonal ARIMA (SARIMA) models5. long short-term memory (LSTM) modelswe evaluate the performance of each model using metrics such as detection accuracy, false alarm rate, and area under the receiver operating characteristic (ROC) curve.

BrainVoxGen: Deep learning framework for synthesis of Ultrasound to MRI

paper_url: http://arxiv.org/abs/2310.08608
repo_url: None
paper_authors: Shubham Singh, Dr. Mrunal Bewoor, Ammar Ranapurwala, Satyam Rai, Sheetal Patil
for: 这个研究旨在使用深度学习框架将三维ultrasound图像转换为三维MRI图像。
methods: 这个方法使用Pix2Pix GAN模型，将三维ultrasound图像输入到UNET生成器和patch检测器中，生成对应的三维MRI图像。
results: 研究发现，这个方法可以成功地生成三维MRI图像，并且与预期结果 exhibit 一定的相似性。

Abstract
The study presents a deep learning framework aimed at synthesizing 3D MRI volumes from three-dimensional ultrasound images of the brain utilizing the Pix2Pix GAN model. The process involves inputting a 3D volume of ultrasounds into a UNET generator and patch discriminator, generating a corresponding 3D volume of MRI. Model performance was evaluated using losses on the discriminator and generator applied to a dataset of 3D ultrasound and MRI images. The results indicate that the synthesized MRI images exhibit some similarity to the expected outcomes. Despite challenges related to dataset size, computational resources, and technical complexities, the method successfully generated MRI volume with a satisfactory similarity score meant to serve as a baseline for further research. It underscores the potential of deep learning-based volume synthesis techniques for ultrasound to MRI conversion, showcasing their viability for medical applications. Further refinement and exploration are warranted for enhanced clinical relevance.

摘要
这种研究描述了一种深度学习框架，用于从三维超声图像转换为三维MRI图像，使用Pix2Pix GAN模型。该过程中，将三维超声图像输入到UNET生成器和补做分类器中，生成相应的三维MRI图像。模型性能被评估使用生成器和分类器对三维超声和MRI图像集合的损失。结果表明，生成的MRI图像具有一定的相似性。虽然数据集的大小、计算资源和技术复杂度带来了挑战，但方法仍然成功地生成了MRI图像，并且得到了一个可接受的相似性分数。这种方法的成功表明了深度学习基于超声到MRI转换的卷积synthesis技术的可能性，并且为医疗应用提供了一个可靠的基础。进一步的改进和探索是需要的，以提高临床 relevance。

CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping

paper_url: http://arxiv.org/abs/2310.07855
repo_url: None
paper_authors: Tim Lebailly, Thomas Stegmüller, Behzad Bozorgtabar, Jean-Philippe Thiran, Tinne Tuytelaars
for: 提高 dense visual representation learning 的精度和效能
methods: 使用 object-level nearest neighbor bootstrapping 方法，在训练过程中对每个对象进行精细化的 Bootstrapping，以提高模型在具有多个对象的场景下的表现
results: 在具有多个对象的场景下，CrIBo 表现出色，在具有 nearest neighbor retrieval 的测试任务上达到了领先的性能水平，并在标准下游任务中保持竞争力In English, this means:
for: Improving the accuracy and efficiency of dense visual representation learning
methods: Using object-level nearest neighbor bootstrapping method, fine-grained bootstrapping is performed for each object during training to enhance the model’s performance in scenes with multiple objects
results: CrIBo achieves state-of-the-art performance on tasks with nearest neighbor retrieval in scenes with multiple objects, and is highly competitive in standard downstream segmentation tasks.

Abstract
Leveraging nearest neighbor retrieval for self-supervised representation learning has proven beneficial with object-centric images. However, this approach faces limitations when applied to scene-centric datasets, where multiple objects within an image are only implicitly captured in the global representation. Such global bootstrapping can lead to undesirable entanglement of object representations. Furthermore, even object-centric datasets stand to benefit from a finer-grained bootstrapping approach. In response to these challenges, we introduce a novel Cross-Image Object-Level Bootstrapping method tailored to enhance dense visual representation learning. By employing object-level nearest neighbor bootstrapping throughout the training, CrIBo emerges as a notably strong and adequate candidate for in-context learning, leveraging nearest neighbor retrieval at test time. CrIBo shows state-of-the-art performance on the latter task while being highly competitive in more standard downstream segmentation tasks. Our code and pretrained models will be publicly available upon acceptance.

摘要
利用最近邻居检索来进行自我超vised表示学习已经证明有利于对象中心图像。然而，这种方法在场景中心数据集上遇到限制，因为图像中多个对象仅在全局表示中被间接捕捉。这可能导致对象表示的不良杂化。此外，即使对象中心数据集也可以从更细化的杂化方法中受益。为回应这些挑战，我们提出了一种新的跨图像对象级Bootstrapping方法，可以提高粗细视觉表示学习。我们在训练中使用对象级最近邻居检索，并在测试时使用最近邻居检索。我们称之为CrIBo。CrIBo在具有状态的下游分 segmentation任务中表现出色，而且在标准下游任务中也具有高竞争力。我们将代码和预训练模型公开发布。

Explorable Mesh Deformation Subspaces from Unstructured Generative Models

paper_url: http://arxiv.org/abs/2310.07814
repo_url: None
paper_authors: Arman Maesumi, Paul Guerrero, Vladimir G. Kim, Matthew Fisher, Siddhartha Chaudhuri, Noam Aigerman, Daniel Ritchie
for: 实现3D形状的变化探索，从传入的指标形状中找到可视化的2D探索空间，并将该空间转换为训练过的生成模型中的子空间，以实现高质量的这些形状之间的变化探索。
methods: 使用生成模型的高维度 latent space 进行探索，并寻找一个将这些 latent space 转换为可视化的2D探索空间的映射。然后，使用这个映射将变化在2D空间中，并将其转换为高质量这些形状的塑形场。
results: 可以实现视觉上可见且易于探索的2D探索空间，并将该空间转换为高质量这些形状之间的变化。对于一些形状 category 进行了评估，结果显示了与先前的学习塑形空间的比较，其可以实现更多的变化和更好的可视化。

Abstract
Exploring variations of 3D shapes is a time-consuming process in traditional 3D modeling tools. Deep generative models of 3D shapes often feature continuous latent spaces that can, in principle, be used to explore potential variations starting from a set of input shapes. In practice, doing so can be problematic: latent spaces are high dimensional and hard to visualize, contain shapes that are not relevant to the input shapes, and linear paths through them often lead to sub-optimal shape transitions. Furthermore, one would ideally be able to explore variations in the original high-quality meshes used to train the generative model, not its lower-quality output geometry. In this paper, we present a method to explore variations among a given set of landmark shapes by constructing a mapping from an easily-navigable 2D exploration space to a subspace of a pre-trained generative model. We first describe how to find a mapping that spans the set of input landmark shapes and exhibits smooth variations between them. We then show how to turn the variations in this subspace into deformation fields, to transfer those variations to high-quality meshes for the landmark shapes. Our results show that our method can produce visually-pleasing and easily-navigable 2D exploration spaces for several different shape categories, especially as compared to prior work on learning deformation spaces for 3D shapes.

摘要
将文本翻译成简化中文。 traditional 3D 模型化工具中查找3D 形状的变化是一个时间消耗的过程。深度生成模型中的3D 形状通常具有连续的潜在空间，可以从输入形状开始探索潜在的变化。然而，在实践中，这可以是一个问题：潜在空间的维度很高，Difficult to visualize, contains shapes that are not relevant to the input shapes, and linear paths through them often lead to sub-optimal shape transitions. In this paper, we present a method to explore variations among a given set of landmark shapes by constructing a mapping from an easily-navigable 2D exploration space to a subspace of a pre-trained generative model. We first describe how to find a mapping that spans the set of input landmark shapes and exhibits smooth variations between them. We then show how to turn the variations in this subspace into deformation fields, to transfer those variations to high-quality meshes for the landmark shapes. Our results show that our method can produce visually-pleasing and easily-navigable 2D exploration spaces for several different shape categories, especially as compared to prior work on learning deformation spaces for 3D shapes.以下是翻译结果：传统的3D模型化工具中查找3D形状的变化是一个时间消耗的过程。深度生成模型中的3D形状通常具有连续的潜在空间，可以从输入形状开始探索潜在的变化。然而，在实践中，这可以是一个问题：潜在空间的维度很高，Difficult to visualize，contains shapes that are not relevant to the input shapes，和Linear paths through them often lead to sub-optimal shape transitions。在这篇论文中，我们提出了一种方法，通过从可探索的2D探索空间到预训练的生成模型的子空间中构建映射，以探索给定的标记形状中的变化。我们首先描述了如何找到一个涵盖输入标记形状的映射，并且在这个映射中展示了平滑的变化。然后，我们显示了如何将这个映射中的变化转换为塑形场，以传递这些变化到高质量的矩阵 для标记形状。我们的结果表明，我们的方法可以生成可观赏和易探索的2D探索空间，特别是与先前学习3D形状的塑形空间相比。

CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous Driving

paper_url: http://arxiv.org/abs/2310.07794
repo_url: None
paper_authors: Changhe Chen, Mozhgan Pourkeshavarz, Amir Rasouli
for: 本文提出了一个新的评估巡回预测方法的评估方程（CRITERIA），用于评估自动驾驶巡回预测模型的性能。
methods: 本文提出了以下方法：1）通过根据道路结构、模型性能和数据特性提取驾驶场景，实现细化的评估模型表现；2）基于实际驾驶约束，开发了不偏度的多元指标来衡量巡回预测模型的多样性和合法性。
results: 经过广泛的实验， authors发现提出的评估方程可以更准确地评估巡回预测模型的性能，并且可以作为评估模型行为的方式。 authors还进行了缺省研究，以阐明不同元素在计算提出的指标中的贡献。

Abstract
Benchmarking is a common method for evaluating trajectory prediction models for autonomous driving. Existing benchmarks rely on datasets, which are biased towards more common scenarios, such as cruising, and distance-based metrics that are computed by averaging over all scenarios. Following such a regiment provides a little insight into the properties of the models both in terms of how well they can handle different scenarios and how admissible and diverse their outputs are. There exist a number of complementary metrics designed to measure the admissibility and diversity of trajectories, however, they suffer from biases, such as length of trajectories. In this paper, we propose a new benChmarking paRadIgm for evaluaTing trajEctoRy predIction Approaches (CRITERIA). Particularly, we propose 1) a method for extracting driving scenarios at varying levels of specificity according to the structure of the roads, models' performance, and data properties for fine-grained ranking of prediction models; 2) A set of new bias-free metrics for measuring diversity, by incorporating the characteristics of a given scenario, and admissibility, by considering the structure of roads and kinematic compliancy, motivated by real-world driving constraints. 3) Using the proposed benchmark, we conduct extensive experimentation on a representative set of the prediction models using the large scale Argoverse dataset. We show that the proposed benchmark can produce a more accurate ranking of the models and serve as a means of characterizing their behavior. We further present ablation studies to highlight contributions of different elements that are used to compute the proposed metrics.

摘要
《用于评估自动驾驶路径预测模型的benchmarking方法》是一个常见的评估方法。现有的benchmark围绕着常见的enario，如维持速度，集成了距离基于的metric，这些metric通过平均所有scenario来计算。然而，这些benchmark存在偏见，如scenario的长度。在这篇论文中，我们提出了一个新的benchmarking方法（CRITERIA），具有以下特点：1. 提取了不同级别的驾驶scenario，根据道路结构、模型性能和数据特性进行细化的排名预测模型。2. 提出了一组新的偏见自由的多元指标，通过考虑场景特点和道路结构来衡量多样性和合法性。3. 使用大规模的Argoverse dataset进行了广泛的实验，并证明了提出的benchmark可以更准确地排名模型，并且可以用来描述模型的行为。我们还进行了减少实验来 highlight提出的元素的贡献。

An automated approach for improving the inference latency and energy efficiency of pretrained CNNs by removing irrelevant pixels with focused convolutions

paper_url: http://arxiv.org/abs/2310.07782
repo_url: https://github.com/PurdueCAM2Project/focused-convolutions
paper_authors: Caleb Tung, Nicholas Eliopoulos, Purvish Jajal, Gowri Ramshankar, Chen-Yun Yang, Nicholas Synovic, Xuecen Zhang, Vipin Chaudhary, George K. Thiruvathukal, Yung-Hsiang Lu
for: 提高 Convolutional Neural Networks (CNNs) 的能效性，降低 computation 和能源成本。
methods: 提出一种自动化方法，通过插入一个阈值层来筛选之前层的活动，以实现更加精准地忽略图像中无关部分，从而降低推理延迟和能源成本，保持准确性。
results: 对多种流行的预训练 CNNs 进行了实验，发现该方法可以降低推理延迟（最多下降25%）和能源成本（最多下降22%），而且准确性几乎不受影响。

Abstract
Computer vision often uses highly accurate Convolutional Neural Networks (CNNs), but these deep learning models are associated with ever-increasing energy and computation requirements. Producing more energy-efficient CNNs often requires model training which can be cost-prohibitive. We propose a novel, automated method to make a pretrained CNN more energy-efficient without re-training. Given a pretrained CNN, we insert a threshold layer that filters activations from the preceding layers to identify regions of the image that are irrelevant, i.e. can be ignored by the following layers while maintaining accuracy. Our modified focused convolution operation saves inference latency (by up to 25%) and energy costs (by up to 22%) on various popular pretrained CNNs, with little to no loss in accuracy.

摘要

3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers

paper_url: http://arxiv.org/abs/2310.07781
repo_url: https://github.com/Beckschen/3D-TransUNet
paper_authors: Jieneng Chen, Jieru Mei, Xianhang Li, Yongyi Lu, Qihang Yu, Qingyue Wei, Xiangde Luo, Yutong Xie, Ehsan Adeli, Yan Wang, Matthew Lungren, Lei Xing, Le Lu, Alan Yuille, Yuyin Zhou
for: 这篇论文的目的是提出一个基于Transformer的医疗影像分类网络，以提高医疗影像分类的精度和效率。
methods: 这篇论文使用了Transformer的自注意力机制来补充U-Net的本地信息，以提高医疗影像分类的能力。具体来说，论文提出了两个关键的 ком成分：1）使用Transformer嵌入器将影像片段转换为Token，以EXTRACT全局背景信息；2）使用Transformer解oder适应地对候选区域进行修饰，通过跨andidate proposal和U-Net特征之间的相互注意力。
results: 论文的实验结果显示，不同的医疗任务可以从不同的架构设计中获得更好的效果。Transformer嵌入器在多器官分类任务中表现出色，而Transformer解oder则在较小且具有挑战性的分类目标，如肿瘤分类任务中表现更好。总的来说，结合Transformer-based嵌入器和解oder into U-Net架构可以提高医疗影像分类的精度和效率。

Abstract
Medical image segmentation plays a crucial role in advancing healthcare systems for disease diagnosis and treatment planning. The u-shaped architecture, popularly known as U-Net, has proven highly successful for various medical image segmentation tasks. However, U-Net's convolution-based operations inherently limit its ability to model long-range dependencies effectively. To address these limitations, researchers have turned to Transformers, renowned for their global self-attention mechanisms, as alternative architectures. One popular network is our previous TransUNet, which leverages Transformers' self-attention to complement U-Net's localized information with the global context. In this paper, we extend the 2D TransUNet architecture to a 3D network by building upon the state-of-the-art nnU-Net architecture, and fully exploring Transformers' potential in both the encoder and decoder design. We introduce two key components: 1) A Transformer encoder that tokenizes image patches from a convolution neural network (CNN) feature map, enabling the extraction of global contexts, and 2) A Transformer decoder that adaptively refines candidate regions by utilizing cross-attention between candidate proposals and U-Net features. Our investigations reveal that different medical tasks benefit from distinct architectural designs. The Transformer encoder excels in multi-organ segmentation, where the relationship among organs is crucial. On the other hand, the Transformer decoder proves more beneficial for dealing with small and challenging segmented targets such as tumor segmentation. Extensive experiments showcase the significant potential of integrating a Transformer-based encoder and decoder into the u-shaped medical image segmentation architecture. TransUNet outperforms competitors in various medical applications.

摘要
医疗影像分割对于提高医疗系统的疾病诊断和治疗规划具有关键作用。U-Net建筑，也称为U-shaped architecture，在各种医疗影像分割任务中表现出了极高的成功率。然而，U-Net的卷积操作自然地限制了它的长距离依赖性模型能力。为了解决这些限制，研究人员转向了Transformers，这种global self-attention机制的知名网络。在这篇论文中，我们扩展了2D TransUNet架构到3D网络，基于state-of-the-art nnU-Net架构，并充分发挥Transformers的潜力在encoder和decoder设计中。我们介绍了两个关键组件：1）使用CNN特征图像块 Tokenizer，从CNN特征图像中提取全局上下文，2）使用交叉关注来适应候选区域的精度调整。我们的调查表明，不同的医疗任务需要不同的架构设计。Transformer encoder在多器官分割任务中表现出色，因为器官之间的关系非常重要。然而，Transformer decoder在处理小型和复杂分割目标，如肿瘤分割任务中表现更出色。我们的实验结果表明，将Transformer-based encoder和decoder与U-shaped医疗影像分割架构结合使用，可以提高TransUNet的性能，并在各种医疗应用中超越竞争对手。

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

paper_url: http://arxiv.org/abs/2310.07749
repo_url: None
paper_authors: Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Lijuan Wang, Jiebo Luo
for: 该论文探讨了一个开放领域图文生成任务，该任务通过输入查询生成杂合的图文内容。
methods: 该论文提出了一个基于大型自然语言模型（LLM）和预训练文本到图像（T2I）模型的新杂合生成框架，名为OpenLEAF。该框架中的LLM生成文本描述，协调T2I模型，创建视觉提示图像生成，并将全局上下文 integrate 到T2I模型中。
results: 根据我们构建的评估集，使用大型多modal模型（LMM）评估Entity和Style一致性，我们的提出的杂合生成框架可以在不同领域和应用中生成高质量的图文内容，如问答、故事、图文重新写作、宣传品等。此外，我们验证了我们提出的LMM评估技术的有效性。

Abstract
This work investigates a challenging task named open-domain interleaved image-text generation, which generates interleaved texts and images following an input query. We propose a new interleaved generation framework based on prompting large-language models (LLMs) and pre-trained text-to-image (T2I) models, namely OpenLEAF. In OpenLEAF, the LLM generates textual descriptions, coordinates T2I models, creates visual prompts for generating images, and incorporates global contexts into the T2I models. This global context improves the entity and style consistencies of images in the interleaved generation. For model assessment, we first propose to use large multi-modal models (LMMs) to evaluate the entity and style consistencies of open-domain interleaved image-text sequences. According to the LMM evaluation on our constructed evaluation set, the proposed interleaved generation framework can generate high-quality image-text content for various domains and applications, such as how-to question answering, storytelling, graphical story rewriting, and webpage/poster generation tasks. Moreover, we validate the effectiveness of the proposed LMM evaluation technique with human assessment. We hope our proposed framework, benchmark, and LMM evaluation could help establish the intriguing interleaved image-text generation task.

摘要
Translation notes:* "open-domain" is translated as "开放领域" (kāifàng lǐngyè)* "interleaved" is translated as "交错" (jiāo chá)* "image-text generation" is translated as "图文生成" (túwén shēngchǎng)* "LLMs" is translated as "大型语言模型" (dàxìng yǔyán módelì)* "T2I models" is translated as "文本到图像模型" (wén tiě dào tú xiàng módelì)* "global context" is translated as "全局上下文" (quán jí shàng xiàng)* "entity and style consistencies" is translated as "实体和风格一致性" (shíwù hé fēng gé yīchāngxìng)* "LMMs" is translated as "大型多模态模型" (dàxìng duō móduō módelì)* "evaluation set" is translated as "评估集" (píngjǐ zhù)* "how-to question answering" is translated as "如何问答" (rúhěn wèn dá)* "storytelling" is translated as "故事告诉" (gùshì gào shuō)* "graphical story rewriting" is translated as "图文故事重写" (túwén gùshì zhòngxī)* "webpage/poster generation" is translated as "网页/海报生成" (wǎng jiāng/hǎi bào shēngchǎng)

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

paper_url: http://arxiv.org/abs/2310.07702
repo_url: https://github.com/yingqinghe/scalecrafter
paper_authors: Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, Ying Shan
for: 本研究探讨了使用预训练的扩散模型在更高的分辨率上生成图像，并且图像的方向比例可以是任意的。
methods: 我们提出了一种简单 yet有效的重尺刻法，可以在推理时动态调整卷积核心的见觉范围，以解决预训练模型中的问题。我们还提出了分散卷积和噪声抑制的自由导向方法，可以实现超高分辨率图像生成（例如4096 x 4096）。
results: 我们的方法可以很好地解决重复问题，并且在高分辨率图像生成中达到了状态的表现。我们的方法不需要任何训练或优化。广泛的实验表明，我们的方法可以在高分辨率图像生成中很好地保持细节Texture。

Abstract
In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.

摘要
在这项研究中，我们 investigate了使用预训练的扩散模型生成高分辨率图像。此外，生成的图像应该有任意的图像方向比。当直接在更高的分辨率1024x1024上生成图像，使用预训练的稳定扩散模型，我们观察到了持续的 объек repeating和不合理的对象结构问题。现有的高分辨率生成方法，如注意力基的方法和联合扩散方法，无法好解这些问题。作为一个新的视角，我们分析了扩散模型中的结构组件，并确定了归一化核心的局限性为主要原因。基于这一关键观察，我们提出了一种简单又有效的重定义，可以在推理过程中动态调整归一化核心的见觉范围。我们还提出了散布核心和噪声抑制的类别器-自由导航，可以实现ultra-高分辨率图像生成（例如4096x4096）。需要注意的是，我们的方法不需要任何的训练或优化。广泛的实验表明，我们的方法可以很好地解决重复问题，并在高分辨率图像生成中 achieved state-of-the-art表现，特别是在тексту册细节方面。我们的工作还表明了一个预训练的扩散模型可以直接在高分辨率图像生成中使用，无需进一步调整，这可能提供了未来研究高分辨率图像和视频生成的灵感。

ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation

paper_url: http://arxiv.org/abs/2310.07697
repo_url: None
paper_authors: Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao
for: 文章主要针对的问题是如何通过提供条件、视频和输入文本，生成高质量的动态视频。
methods: 本文提出了一种无需训练的文本到视频生成方法，基于现有的文本到图像生成方法（如稳定扩散），并通过分解动态场景的运动表示来提高生成的 temporal coherence。
results: 对比其他方法，本文的方法在Frame consistency、clip score和 conditional accuracy等指标上表现出色，得到了更高的性能。

Abstract
Recent works have successfully extended large-scale text-to-image models to the video domain, producing promising results but at a high computational cost and requiring a large amount of video data. In this work, we introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e.g., Stable Diffusion). ConditionVideo generates realistic dynamic videos from random noise or given scene videos. Our method explicitly disentangles the motion representation into condition-guided and scenery motion components. To this end, the ConditionVideo model is designed with a UNet branch and a control branch. To improve temporal coherence, we introduce sparse bi-directional spatial-temporal attention (sBiST-Attn). The 3D control network extends the conventional 2D controlnet model, aiming to strengthen conditional generation accuracy by additionally leveraging the bi-directional frames in the temporal domain. Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.

摘要
近期研究已成功扩展大规模文本到视频领域，但计算成本高并需大量视频数据。在这项工作中，我们介绍ConditionVideo，一种无需训练的文本到视频生成方法，基于给定的条件、视频和输入文本，通过利用市场上已有的文本到图像生成方法（如稳定扩散）。ConditionVideo可生成真实的动态视频从随机噪声或给定的场景视频。我们的方法明确分离动作表示为条件导向的动作组件和场景动作组件。为此，ConditionVideo模型采用了UNet分支和控制分支。为了改进时间准确性，我们引入罕见的双向时空注意力（sBiST-Attn）。三维控制网络将传统的二维控制网络模型扩展到三维空间，以更好地利用时间领域的双向帧。我们的方法在帧一致性、clip分数和条件准确性方面表现出色，超越其他比较方法。

Orbital Polarimetric Tomography of a Flare Near the Sagittarius A* Supermassive Black Hole

paper_url: http://arxiv.org/abs/2310.07687
repo_url: None
paper_authors: Aviad Levis, Andrew A. Chael, Katherine L. Bouman, Maciek Wielgus, Pratul P. Srinivasan
for: 这个研究的目的是理解黑洞吸收过程中的高能射线飞亮现象。
methods: 这个研究使用了人工神经网络三维表示（一种emergent artificial intelligence方法）和黑洞重力模型来解决高度不定Posing的Tomography问题，以恢复黑洞中央的辐射强度。
results: 研究结果表明，在2017年4月11日ALMA数据中存在一个位于黑洞事件 horizonto的紧凑的明亮区域，距离黑洞6倍，并且在低倾斜的orbital平面上进行clockwise旋转。这些结果与以往的EHT和GRAVITY合作的研究结果一致。

Abstract
The interaction between the supermassive black hole at the center of the Milky Way, Sagittarius A$^*$, and its accretion disk, occasionally produces high energy flares seen in X-ray, infrared and radio. One mechanism for observed flares is the formation of compact bright regions that appear within the accretion disk and close to the event horizon. Understanding these flares can provide a window into black hole accretion processes. Although sophisticated simulations predict the formation of these flares, their structure has yet to be recovered by observations. Here we show the first three-dimensional (3D) reconstruction of an emission flare in orbit recovered from ALMA light curves observed on April 11, 2017. Our recovery results show compact bright regions at a distance of roughly 6 times the event horizon. Moreover, our recovery suggests a clockwise rotation in a low-inclination orbital plane, a result consistent with prior studies by EHT and GRAVITY collaborations. To recover this emission structure we solve a highly ill-posed tomography problem by integrating a neural 3D representation (an emergent artificial intelligence approach for 3D reconstruction) with a gravitational model for black holes. Although the recovered 3D structure is subject, and sometimes sensitive, to the model assumptions, under physically motivated choices we find that our results are stable and our approach is successful on simulated data. We anticipate that in the future, this approach could be used to analyze a richer collection of time-series data that could shed light on the mechanisms governing black hole and plasma dynamics.

摘要
黑洞的中心星系银河系的超大质量黑洞，Sagittarius A*，与其吸积盘 occasionally produce high energy flares visible in X-ray, infrared and radio。一种导致这些闪光的机制是在吸积盘中形成 Compact bright regions，这些区域位于黑洞事件 horizons 附近。理解这些闪光可以提供黑洞吸积过程的窗口。虽然先进的仿真 simulation 预测了这些闪光的形成，但它们的结构还没有由观测所回收。在这里，我们展示了由 ALMA 光谱 curves 于2017年4月11日观测到的首次三维 (3D) 重建的发射闪光。我们的重建结果表明 Compact bright regions 的距离约为黑洞事件 horizon 的6倍。此外，我们的重建结果还表明这些闪光在低倾斜的 орбиталь平面上以 clockwise 方向旋转，这与 prior studies by EHT 和 GRAVITY collaborations 的结果相符。为了重建这些发射结构，我们解决了一个高度不定 проблеma tomography 问题，通过结合人工智能 Representation （一种 emergent artificial intelligence 方法）和 gravitational model for black holes 来做。虽然recovered 3D structure 受到模型假设的影响，但在物理上有理由的选择下，我们发现结果是稳定的，并且在 simulated data 上成功。未来，我们预计这种方法可以用来分析更加丰富的时间序列数据，以了解黑洞和气体动力学的机制。

Prediction of MET Overexpression in Non-Small Cell Lung Adenocarcinomas from Hematoxylin and Eosin Images

paper_url: http://arxiv.org/abs/2310.07682
repo_url: None
paper_authors: Kshitij Ingale, Sun Hae Hong, Josh S. K. Bell, Abbas Rizvi, Amy Welch, Lingdao Sha, Irvin Ho, Kunal Nagpal, Aicha BenTaieb, Rohan P Joshi, Martin C Stumpe
for: 这个研究旨在开发一种使用 routinely available digitized H&E 染色片预测 MET 蛋白过表达的算法，以便为 NSCLC 患者预测是否有可能获得 ME 蛋白或 ME 基因表达状态的诊断。methods: 这个研究使用了一个大型的匹配 H&E 染色片和 RNA 表达数据库来训练一种弱级超vised 模型，以直接从 H&E 图像中预测 MET RNA 过表达。results: 这个模型在一个独立的占据测试集上进行了评估，并达到了 ROC-AUC 0.70（95% 信息interval：0.66-0.74）的稳定性特征，并且在不同的患者临床变量下表现稳定，并且对于synthetic 噪声进行了Robust性测试。这些结果表明，H&E 基本的预测模型可能是一种有用的工具，以便为 NSCLC 患者预测 ME 蛋白或 ME 基因表达状态的可能性。

Abstract
MET protein overexpression is a targetable event in non-small cell lung cancer (NSCLC) and is the subject of active drug development. Challenges in identifying patients for these therapies include lack of access to validated testing, such as standardized immunohistochemistry (IHC) assessment, and consumption of valuable tissue for a single gene/protein assay. Development of pre-screening algorithms using routinely available digitized hematoxylin and eosin (H&E)-stained slides to predict MET overexpression could promote testing for those who will benefit most. While assessment of MET expression using IHC is currently not routinely performed in NSCLC, next-generation sequencing is common and in some cases includes RNA expression panel testing. In this work, we leveraged a large database of matched H&E slides and RNA expression data to train a weakly supervised model to predict MET RNA overexpression directly from H&E images. This model was evaluated on an independent holdout test set of 300 over-expressed and 289 normal patients, demonstrating an ROC-AUC of 0.70 (95th percentile interval: 0.66 - 0.74) with stable performance characteristics across different patient clinical variables and robust to synthetic noise on the test set. These results suggest that H&E-based predictive models could be useful to prioritize patients for confirmatory testing of MET protein or MET gene expression status.

摘要
MET蛋白过表达是非小细胞肺癌（NSCLC）中可target的事件，目前有活跃的药物开发。检测patient中MET蛋白过表达的挑战包括lack of access to validated testing, such as standardized immunohistochemistry (IHC) assessment, and consumption of valuable tissue for a single gene/protein assay。Development of pre-screening algorithms using routinely available digitized hematoxylin and eosin (H&E)-stained slides to predict MET overexpression could promote testing for those who will benefit most. Although assessment of MET expression using IHC is currently not routinely performed in NSCLC, next-generation sequencing is common and in some cases includes RNA expression panel testing. In this work, we leveraged a large database of matched H&E slides and RNA expression data to train a weakly supervised model to predict MET RNA overexpression directly from H&E images. This model was evaluated on an independent holdout test set of 300 over-expressed and 289 normal patients, demonstrating an ROC-AUC of 0.70 (95th percentile interval: 0.66 - 0.74) with stable performance characteristics across different patient clinical variables and robust to synthetic noise on the test set. These results suggest that H&E-based predictive models could be useful to prioritize patients for confirmatory testing of MET protein or MET gene expression status.

Accelerating Vision Transformers Based on Heterogeneous Attention Patterns

paper_url: http://arxiv.org/abs/2310.07664
repo_url: None
paper_authors: Deli Yu, Teng Xi, Jianwei Li, Baopu Li, Gang Zhang, Haocheng Feng, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang
for: 提高 ViT 的运行速度，以便在计算复杂的自注意 Mechanism 上进行减少计算量。
methods: 提出了一个集成压缩管道，包括动态指导静止自注意 (DGSSA) 方法和全球聚合 pyramid (GLAD) 方法，以减少 ViT 的计算量。
results: 实验表明，该集成压缩管道可以提高 ViT 的运行速度，相比 DeiT 提高了121% 的运行throughput，超过了所有 SOTA 方法。

Abstract
Recently, Vision Transformers (ViTs) have attracted a lot of attention in the field of computer vision. Generally, the powerful representative capacity of ViTs mainly benefits from the self-attention mechanism, which has a high computation complexity. To accelerate ViTs, we propose an integrated compression pipeline based on observed heterogeneous attention patterns across layers. On one hand, different images share more similar attention patterns in early layers than later layers, indicating that the dynamic query-by-key self-attention matrix may be replaced with a static self-attention matrix in early layers. Then, we propose a dynamic-guided static self-attention (DGSSA) method where the matrix inherits self-attention information from the replaced dynamic self-attention to effectively improve the feature representation ability of ViTs. On the other hand, the attention maps have more low-rank patterns, which reflect token redundancy, in later layers than early layers. In a view of linear dimension reduction, we further propose a method of global aggregation pyramid (GLAD) to reduce the number of tokens in later layers of ViTs, such as Deit. Experimentally, the integrated compression pipeline of DGSSA and GLAD can accelerate up to 121% run-time throughput compared with DeiT, which surpasses all SOTA approaches.

摘要
近期，视transformer（ViTs）在计算机视觉领域引起了很多关注。通常，ViTs的强大表现capacity mainly benefits from the self-attention mechanism, which has high computation complexity. To accelerate ViTs, we propose an integrated compression pipeline based on observed heterogeneous attention patterns across layers. On one hand, different images share more similar attention patterns in early layers than later layers, indicating that the dynamic query-by-key self-attention matrix may be replaced with a static self-attention matrix in early layers. Then, we propose a dynamic-guided static self-attention (DGSSA) method where the matrix inherits self-attention information from the replaced dynamic self-attention to effectively improve the feature representation ability of ViTs. On the other hand, the attention maps have more low-rank patterns, which reflect token redundancy, in later layers than early layers. In a view of linear dimension reduction, we further propose a method of global aggregation pyramid (GLAD) to reduce the number of tokens in later layers of ViTs, such as Deit. Experimentally, the integrated compression pipeline of DGSSA and GLAD can accelerate up to 121% run-time throughput compared with DeiT, which surpasses all SOTA approaches.

Deep Video Inpainting Guided by Audio-Visual Self-Supervision

paper_url: http://arxiv.org/abs/2310.07663
repo_url: https://github.com/kyuyeonpooh/Audio-Visual-Deep-Video-Inpainting
paper_authors: Kyuyeon Kim, Junsik Jung, Woo Jae Kim, Sung-Eui Yoon
for: 提高视频填充质量
methods: 使用深度学习模型借鉴人类的听觉视觉对应知识，实现视频填充
results: 实验结果表明，提出的方法可以更广泛地恢复视频场景，特别是在听觉对象在场景中部分遮盖时表现出色。

Abstract
Humans can easily imagine a scene from auditory information based on their prior knowledge of audio-visual events. In this paper, we mimic this innate human ability in deep learning models to improve the quality of video inpainting. To implement the prior knowledge, we first train the audio-visual network, which learns the correspondence between auditory and visual information. Then, the audio-visual network is employed as a guider that conveys the prior knowledge of audio-visual correspondence to the video inpainting network. This prior knowledge is transferred through our proposed two novel losses: audio-visual attention loss and audio-visual pseudo-class consistency loss. These two losses further improve the performance of the video inpainting by encouraging the inpainting result to have a high correspondence to its synchronized audio. Experimental results demonstrate that our proposed method can restore a wider domain of video scenes and is particularly effective when the sounding object in the scene is partially blinded.

摘要
人类可以轻松地从听音信息中想象场景，在这篇论文中，我们模仿人类的 innate 能力，在深度学习模型中提高视频填充质量。为了实现先前知识，我们首先训练了 audio-visual 网络，该网络学习听音和视觉信息之间的对应关系。然后，我们使用我们提出的两种新的损失函数：听音视觉注意力损失和听音视觉假类一致损失。这两种损失函数进一步提高了视频填充的性能，使得填充结果具有高度对应于其同步的听音。实验结果表明，我们提出的方法可以恢复更广泛的视频场景，并在听音对象在场景中部分遮盲时特别有效。

Context-Enhanced Detector For Building Detection From Remote Sensing Images

paper_url: http://arxiv.org/abs/2310.07638
repo_url: None
paper_authors: Ziyue Huang, Mingming Zhang, Qingjie Liu, Wei Wang, Zhe Dong, Yunhong Wang
for: 提高遥感图像中建筑物检测精度， Addressing the challenges of building detection in remote sensing images.
methods: 提出一种新的Context-Enhanced Detector（CEDet）方法，包括三个阶段堆式结构，使用Semantic Guided Contextual Mining（SGCM）模块和Instance Context Mining Module（ICMM）两个模块，并使用Semantic segmentation loss基于pseudo-masks来导引上下文信息抽取。
results: 在三个建筑物检测标准测试集上达到了状态 искусственный智能性能，包括CNBuilding-9P、CNBuilding-23P和SpaceNet。

Abstract
The field of building detection from remote sensing images has made significant progress, but faces challenges in achieving high-accuracy detection due to the diversity in building appearances and the complexity of vast scenes. To address these challenges, we propose a novel approach called Context-Enhanced Detector (CEDet). Our approach utilizes a three-stage cascade structure to enhance the extraction of contextual information and improve building detection accuracy. Specifically, we introduce two modules: the Semantic Guided Contextual Mining (SGCM) module, which aggregates multi-scale contexts and incorporates an attention mechanism to capture long-range interactions, and the Instance Context Mining Module (ICMM), which captures instance-level relationship context by constructing a spatial relationship graph and aggregating instance features. Additionally, we introduce a semantic segmentation loss based on pseudo-masks to guide contextual information extraction. Our method achieves state-of-the-art performance on three building detection benchmarks, including CNBuilding-9P, CNBuilding-23P, and SpaceNet.

摘要
隐身图像建筑检测领域已经做出了重要进展，但是因为建筑物的多样性和场景的复杂性，高精度检测仍然面临挑战。为解决这些挑战，我们提出了一种新的方法 called Context-Enhanced Detector (CEDet)。我们的方法采用三stage阶段结构来增强 Contextual information的提取和改善建筑物检测精度。特别是，我们引入了两个模块：Semantic Guided Contextual Mining (SGCM)模块和 Instance Context Mining Module (ICMM)。SGCM模块通过不同级别的上下文进行聚合和使用注意力机制来捕捉长距离相互作用，而ICMM模块则通过构建空间关系图和聚合实例特征来捕捉实例水平的关系上下文。此外，我们还引入了基于 Pseudo-masks的 semantic segmentation loss来引导上下文信息提取。我们的方法在三个建筑物检测标准 bencmarks，包括 CNBuilding-9P、CNBuilding-23P 和 SpaceNet 上达到了当前领域的状态态��表现。

Attention-Map Augmentation for Hypercomplex Breast Cancer Classification

paper_url: http://arxiv.org/abs/2310.07633
repo_url: None
paper_authors: Eleonora Lopez, Filippo Betello, Federico Carmignani, Eleonora Grassucci, Danilo Comminiello
for: 提高乳腺癌早期诊断性能
methods: 使用 Parameterized Hypercomplex Attention Maps (PHAM) 框架，包括图像增强步骤和扩展步骤，以及使用Parameterized Hypercomplex Neural Network (PHNN) 进行乳腺癌分类
results: 在环境中覆盖率高，超过了关注基于网络的现状和实数值对应的方法的表现，并在医疗图像和 histopathological 图像上进行了证明

Abstract
Breast cancer is the most widespread neoplasm among women and early detection of this disease is critical. Deep learning techniques have become of great interest to improve diagnostic performance. Nonetheless, discriminating between malignant and benign masses from whole mammograms remains challenging due to them being almost identical to an untrained eye and the region of interest (ROI) occupying a minuscule portion of the entire image. In this paper, we propose a framework, parameterized hypercomplex attention maps (PHAM), to overcome these problems. Specifically, we deploy an augmentation step based on computing attention maps. Then, the attention maps are used to condition the classification step by constructing a multi-dimensional input comprised of the original breast cancer image and the corresponding attention map. In this step, a parameterized hypercomplex neural network (PHNN) is employed to perform breast cancer classification. The framework offers two main advantages. First, attention maps provide critical information regarding the ROI and allow the neural model to concentrate on it. Second, the hypercomplex architecture has the ability to model local relations between input dimensions thanks to hypercomplex algebra rules, thus properly exploiting the information provided by the attention map. We demonstrate the efficacy of the proposed framework on both mammography images as well as histopathological ones, surpassing attention-based state-of-the-art networks and the real-valued counterpart of our method. The code of our work is available at https://github.com/elelo22/AttentionBCS.

摘要
乳癌是女性最常见的肿瘤，早期发现这种疾病非常重要。深度学习技术在提高诊断性能方面表现出了极大的兴趣。然而，从整个照片中分别识别癌变和良性肿瘤仍然是一项极其困难的任务，因为它们在无经验的眼里看起来几乎一样，而识别区域（ROI）占整个照片的一个非常小的部分。在这篇论文中，我们提出了一个框架，即卷积注意地图（PHAM）。我们在这个框架中使用了一个增强步骤，基于计算注意地图。然后，我们使用了这些注意地图来控制分类步骤，通过构建一个多维输入，其中包括原始乳癌照片和相应的注意地图。在这个步骤中，我们使用了一个参数化的高维复杂神经网络（PHNN）来进行乳癌分类。我们的框架具有两个主要优点。首先，注意地图提供了ROI的重要信息，使神经网络能够专注于它。其次，高维复杂架构可以通过高维复杂代数规则，正确利用注意地图提供的信息。我们在照片和 histopathological 图像上进行了实验，超过了注意力基于状态最佳网络和实值对应的我们方法。我们的代码可以在 GitHub 上找到。

Prompt Backdoors in Visual Prompt Learning

paper_url: http://arxiv.org/abs/2310.07632
repo_url: None
paper_authors: Hai Huang, Zhengyu Zhao, Michael Backes, Yun Shen, Yang Zhang
for: 这篇论文目的是探讨大型预训模型的精细化是否可行，以及Visual Prompt Learning（VPL）是一种可能的替代方案。
methods: 这篇论文使用了Visual Prompt as a Service（VPPTaaS），即提供者可以供给用户一个可读的显示图像，并让用户使用这个图像和大型预训模型进行预测。
results: 这篇论文发现了VPL中的一个新的安全风险，即BadVisualPrompt，这是一种可以透过攻击提供者提供的显示图像来控制模型的攻击。 Specifically, 这篇论文发现了一个新的技术挑战，即显示图像触发器和显示图像之间的互动，这不同于传统的模型水平的后门攻击。

Abstract
Fine-tuning large pre-trained computer vision models is infeasible for resource-limited users. Visual prompt learning (VPL) has thus emerged to provide an efficient and flexible alternative to model fine-tuning through Visual Prompt as a Service (VPPTaaS). Specifically, the VPPTaaS provider optimizes a visual prompt given downstream data, and downstream users can use this prompt together with the large pre-trained model for prediction. However, this new learning paradigm may also pose security risks when the VPPTaaS provider instead provides a malicious visual prompt. In this paper, we take the first step to explore such risks through the lens of backdoor attacks. Specifically, we propose BadVisualPrompt, a simple yet effective backdoor attack against VPL. For example, poisoning $5\%$ CIFAR10 training data leads to above $99\%$ attack success rates with only negligible model accuracy drop by $1.5\%$. In particular, we identify and then address a new technical challenge related to interactions between the backdoor trigger and visual prompt, which does not exist in conventional, model-level backdoors. Moreover, we provide in-depth analyses of seven backdoor defenses from model, prompt, and input levels. Overall, all these defenses are either ineffective or impractical to mitigate our BadVisualPrompt, implying the critical vulnerability of VPL.

摘要
大型预训计算机视觉模型的细调是对资源有限的用户来说不可能进行。因此，视觉提示学习（VPL）已经出现以提供一种高效和灵活的替代方案。具体来说，VPL提供者将Visual Prompt as a Service（VPPTaaS）优化一个视觉提示，并且用户可以使用这个提示和大型预训模型进行预测。然而，这种新的学习模式可能也会带来安全风险，当VPPTAaaS提供商而不是提供正确的视觉提示。在这篇论文中，我们通过镜头攻击的视角来探讨这些风险。具体来说，我们提出了BadVisualPrompt，一种简单 yet effective的镜头攻击方法，可以让攻击者在只需要负担5%的CIFAR10训练数据上成功率高于99%，而且只带来模型准确率的1.5%下降。我们还发现了一个新的技术挑战，即镜头触发器和视觉提示之间的交互问题，这个问题不存在于传统的模型级镜头攻击中。此外，我们还提供了七种防御策略的深入分析，包括模型、提示和输入级防御策略。总之，这些防御策略都是无效或不实际的，表明VPL具有极高的漏洞。

paper_url: http://arxiv.org/abs/2310.07602
repo_url: https://github.com/adept-thu/dual-radar
paper_authors: Xinyu Zhang, Li Wang, Jian Chen, Cheng Fang, Lei Yang, Ziying Song, Guangqi Yang, Yichen Wang, Xiaofei Zhang, Qingshan Yang, Jun Li
For: The paper is written for the purpose of introducing a novel large-scale multi-modal dataset for studying effective 4D radar perception algorithms in autonomous driving.* Methods: The paper uses two types of 4D radars captured simultaneously to create a novel dataset, which consists of 151 consecutive series, most of which last 20 seconds and contain 10,007 meticulously synchronized and annotated frames.* Results: The paper experimentally validates the dataset and provides valuable results for studying different types of 4D radars.Here is the information in Simplified Chinese text:* For: 本文是为了介绍一个新的大规模多模式数据集，用于研究自动驾驶4D радиar感知算法的有效性。* Methods: 本文使用了两种同时捕获的4D радиar来创建一个新的数据集，该数据集包括151个连续的系列，每个系列持续20秒钟，共计10,007幅 preciselly同步和标注的帧。* Results: 本文对数据集进行了实验 validate，并提供了对不同类型4D радиar的有价值的结果。

Abstract
Radar has stronger adaptability in adverse scenarios for autonomous driving environmental perception compared to widely adopted cameras and LiDARs. Compared with commonly used 3D radars, the latest 4D radars have precise vertical resolution and higher point cloud density, making it a highly promising sensor for autonomous driving in complex environmental perception. However, due to the much higher noise than LiDAR, manufacturers choose different filtering strategies, resulting in an inverse ratio between noise level and point cloud density. There is still a lack of comparative analysis on which method is beneficial for deep learning-based perception algorithms in autonomous driving. One of the main reasons is that current datasets only adopt one type of 4D radar, making it difficult to compare different 4D radars in the same scene. Therefore, in this paper, we introduce a novel large-scale multi-modal dataset featuring, for the first time, two types of 4D radars captured simultaneously. This dataset enables further research into effective 4D radar perception algorithms.Our dataset consists of 151 consecutive series, most of which last 20 seconds and contain 10,007 meticulously synchronized and annotated frames. Moreover, our dataset captures a variety of challenging driving scenarios, including many road conditions, weather conditions, nighttime and daytime with different lighting intensities and periods. Our dataset annotates consecutive frames, which can be applied to 3D object detection and tracking, and also supports the study of multi-modal tasks. We experimentally validate our dataset, providing valuable results for studying different types of 4D radars. This dataset is released on https://github.com/adept-thu/Dual-Radar.

摘要
雷达在自动驾驶环境感知方面具有更强的适应能力，比较广泛使用的相机和激光雷达更具有优势。与常见的3D雷达相比，最新的4D雷达具有高精度的垂直分辨率和更高的点云密度，使其成为自动驾驶复杂环境感知的非常有前途的传感器。然而，由于雷达的噪声比激光更高，制造商们采用不同的筛选策略，导致点云密度与噪声水平之间存在反比关系。到目前为止，没有对不同筛选策略对深度学习基于感知算法的比较分析。这是因为当前数据集只采用了一种类型的4D雷达，使其不可以在同一场景中比较不同的4D雷达。因此，在本文中，我们引入了一个新的大规模多模态数据集，其中包含了两种类型的4D雷达同时捕获的数据。这个数据集允许进一步研究4D雷达感知算法。我们的数据集包含151个连续时间序列，大多数时间长达20秒，包含10,007幅高精度同步 annotated frames。此外，我们的数据集涵盖了许多挑战性的驾驶场景，包括不同的路面条件、天气条件、夜间和白天不同的照明强度和时间。我们的数据集将 consecutively frames annotated，可以应用于3D объек体检测和跟踪，同时也支持多模态任务的研究。我们在实验 validate了我们的数据集，提供了价值的结果，用于研究不同类型的4D雷达。这个数据集在https://github.com/adept-thu/Dual-Radar上发布。

PeP: a Point enhanced Painting method for unified point cloud tasks

paper_url: http://arxiv.org/abs/2310.07591
repo_url: None
paper_authors: Zichao Dong, Hang Ji, Xufeng Huang, Weikun Zhang, Xin Zhan, Junbo Chen
for: 提高点云识别的性能，提供更好的输入参数 для下游模块。
methods: 提出了一种新的PeP模块，包括改进点绘制方法和LM基于的点编码器。
results: 在nuScenes和KITTI数据集上进行了实验，并证明了我们的PeP模块在semantic segmentation和物体检测方面具有强大表现，包括单点云和多模态设置下的情况。

Abstract
Point encoder is of vital importance for point cloud recognition. As the very beginning step of whole model pipeline, adding features from diverse sources and providing stronger feature encoding mechanism would provide better input for downstream modules. In our work, we proposed a novel PeP module to tackle above issue. PeP contains two main parts, a refined point painting method and a LM-based point encoder. Experiments results on the nuScenes and KITTI datasets validate the superior performance of our PeP. The advantages leads to strong performance on both semantic segmentation and object detection, in both lidar and multi-modal settings. Notably, our PeP module is model agnostic and plug-and-play. Our code will be publicly available soon.

摘要
<>将文本翻译成简化中文。<>点编码器对点云识别具有重要的重要性。作为整个模型管道的开头步骤，从多种来源添加特征并提供更强的特征编码机制可以为下游模块提供更好的输入。在我们的工作中，我们提出了一种新的PeP模块来解决上述问题。PeP包括两个主要部分：一种精度调整的点绘制方法和一种LM基于的点编码器。在nuScenes和KITTI数据集上进行的实验结果表明了我们的PeP模块在 semantic segmentation和对象检测中具有出色的表现，包括雷达和多模式下的情况。尤其是，我们的PeP模块是模型无关的和可插入的。我们的代码将很快地公开。

A Discrepancy Aware Framework for Robust Anomaly Detection

paper_url: http://arxiv.org/abs/2310.07585
repo_url: https://github.com/caiyuxuan1120/daf
paper_authors: Yuxuan Cai, Dingkang Liang, Dongliang Luo, Xinwei He, Xin Yang, Xiang Bai
for: 本研究旨在探讨自然语言处理中的异常检测问题，尤其是使用自然语言生成的数据进行自监学习。
methods: 我们提出了一种异常检测方法，即异常检测框架（DAF），它可以在不同的异常检测任务中表现出较好的 Robustness。
results: 我们在两个复杂的数据集上进行了广泛的实验，并证明了我们的方法可以在使用简单的生成策略时表现出较好的性能，并且在异常检测任务中实现了state-of-the-art的本地化性能。

Abstract
Defect detection is a critical research area in artificial intelligence. Recently, synthetic data-based self-supervised learning has shown great potential on this task. Although many sophisticated synthesizing strategies exist, little research has been done to investigate the robustness of models when faced with different strategies. In this paper, we focus on this issue and find that existing methods are highly sensitive to them. To alleviate this issue, we present a Discrepancy Aware Framework (DAF), which demonstrates robust performance consistently with simple and cheap strategies across different anomaly detection benchmarks. We hypothesize that the high sensitivity to synthetic data of existing self-supervised methods arises from their heavy reliance on the visual appearance of synthetic data during decoding. In contrast, our method leverages an appearance-agnostic cue to guide the decoder in identifying defects, thereby alleviating its reliance on synthetic appearance. To this end, inspired by existing knowledge distillation methods, we employ a teacher-student network, which is trained based on synthesized outliers, to compute the discrepancy map as the cue. Extensive experiments on two challenging datasets prove the robustness of our method. Under the simple synthesis strategies, it outperforms existing methods by a large margin. Furthermore, it also achieves the state-of-the-art localization performance. Code is available at: https://github.com/caiyuxuan1120/DAF.

摘要
<>转换文本到简化中文。<>人工智能中的缺陷检测是一个关键的研究领域。最近，基于合成数据的自主学习已经表现出了很大的潜力。虽然有很多复杂的合成策略，但是对模型对不同策略的Robustness还需要更多的研究。在这篇论文中，我们将关注这个问题，并发现现有的方法对不同策略非常敏感。为了解决这个问题，我们提出了一种不同策略意识框架（DAF），它在不同的缺陷检测 bencmarks 中表现出了稳定性和简单性。我们认为现有方法对合成数据的视觉出现非常依赖，而我们的方法借鉴了现有的知识填充方法，通过计算不同策略下的异常映射来避免依赖于合成数据的视觉。为此，我们采用了一种教师生网络，通过合成异常数据来计算异常映射。我们在两个复杂的数据集上进行了广泛的实验，结果表明我们的方法在简单的合成策略下表现出了明显的优势，并且也达到了当前的最佳本地化性能。代码可以在 GitHub 上找到：https://github.com/caiyuxuan1120/DAF。

Centrality of the Fingerprint Core Location

paper_url: http://arxiv.org/abs/2310.07584
repo_url: None
paper_authors: Laurenz Ruzicka, Bernhard Strobl, Bernhard Kohn, Clemens Heitzinger
for: 这个研究的目的是分析和提高指纹识别的方法，特别是研究指纹核心的分布。
methods: 该研究使用了rolled fingerprint recordings和plain fingerprint recordings的大量数据集，并使用了empirical distribution的方法来分析指纹核心的分布。
results: 研究发现，rolling fingerprint recordings中核心的位置与指纹中心的偏差为5.7% $\pm$ 5.2%到7.6% $\pm$ 6.9%，而plain fingerprint recordings中核心的位置遵循正态分布。此外，研究还发现，NFIQ 2 预测器偏爱rolled fingerprint recordings中核心处于指纹中心下方的位置。

Abstract
Fingerprints have long been recognized as a unique and reliable means of personal identification. Central to the analysis and enhancement of fingerprints is the concept of the fingerprint core. Although the location of the core is used in many applications, to the best of our knowledge, this study is the first to investigate the empirical distribution of the core over a large, combined dataset of rolled, as well as plain fingerprint recordings. We identify and investigate the extent of incomplete rolling during the rolled fingerprint acquisition and investigate the centrality of the core. After correcting for the incomplete rolling, we find that the core deviates from the fingerprint center by 5.7% $\pm$ 5.2% to 7.6% $\pm$ 6.9%, depending on the finger. Additionally, we find that the assumption of normal distribution of the core position of plain fingerprint recordings cannot be rejected, but for rolled ones it can. Therefore, we use a multi-step process to find the distribution of the rolled fingerprint recordings. The process consists of an Anderson-Darling normality test, the Bayesian Information Criterion to reduce the number of possible candidate distributions and finally a Generalized Monte Carlo goodness-of-fit procedure to find the best fitting distribution. We find the non-central Fischer distribution best describes the cores' horizontal positions. Finally, we investigate the correlation between mean core position offset and the NFIQ 2 score and find that the NFIQ 2 prefers rolled fingerprint recordings where the core sits slightly below the fingerprint center.

摘要
人体指纹长期被认为是一种独特和可靠的个人认可方法。中心于指纹核心的分析和加强是识别人体指纹的关键。尽管核心位置在许多应用中被使用，但 according to our knowledge, this study is the first to investigate the empirical distribution of the core over a large, combined dataset of rolled and plain fingerprint recordings. We identify and investigate the extent of incomplete rolling during the rolled fingerprint acquisition and investigate the centrality of the core. After correcting for the incomplete rolling, we find that the core deviates from the fingerprint center by 5.7% ± 5.2% to 7.6% ± 6.9%, depending on the finger. Additionally, we find that the assumption of normal distribution of the core position of plain fingerprint recordings cannot be rejected, but for rolled ones it can. Therefore, we use a multi-step process to find the distribution of the rolled fingerprint recordings. The process consists of an Anderson-Darling normality test, the Bayesian Information Criterion to reduce the number of possible candidate distributions, and finally a Generalized Monte Carlo goodness-of-fit procedure to find the best fitting distribution. We find the non-central Fischer distribution best describes the cores' horizontal positions. Finally, we investigate the correlation between mean core position offset and the NFIQ 2 score and find that the NFIQ 2 prefers rolled fingerprint recordings where the core sits slightly below the fingerprint center.

Relational Prior Knowledge Graphs for Detection and Instance Segmentation

paper_url: http://arxiv.org/abs/2310.07573
repo_url: None
paper_authors: Osman Ülger, Yu Wang, Ysbrand Galama, Sezer Karaoglu, Theo Gevers, Martin R. Oswald
for: 这 paper 的目的是调查使用对象之间关系来进行物体探测和实例分割是否有效。
methods: 该 paper 提出了一种基于关系优先的特征增强模型（RP-FEM），该模型在Scene graph上运行，并同时学习对象探测和实例分割中的关系上下文模型。
results: 实验表明，在 COCO 数据集上，使用Scene graph和关系优先，可以提高物体探测和实例分割的性能，RP-FEM 可以减少图像中不可能的类别预测，同时避免生成重复预测，与基础模型相比呈现提高。

Abstract
Humans have a remarkable ability to perceive and reason about the world around them by understanding the relationships between objects. In this paper, we investigate the effectiveness of using such relationships for object detection and instance segmentation. To this end, we propose a Relational Prior-based Feature Enhancement Model (RP-FEM), a graph transformer that enhances object proposal features using relational priors. The proposed architecture operates on top of scene graphs obtained from initial proposals and aims to concurrently learn relational context modeling for object detection and instance segmentation. Experimental evaluations on COCO show that the utilization of scene graphs, augmented with relational priors, offer benefits for object detection and instance segmentation. RP-FEM demonstrates its capacity to suppress improbable class predictions within the image while also preventing the model from generating duplicate predictions, leading to improvements over the baseline model on which it is built.

摘要
人类具有非常出色的能力，可以通过理解对象之间的关系来理解世界。在这篇论文中，我们研究了使用这些关系来进行对象检测和实例分割。为此，我们提出了一种基于关系优先的特征增强模型（RP-FEM），这是一种图 transformer 模型，可以通过在Scene Graph中增强对象提案特征来提高对象检测和实例分割的性能。这种建议的架构在Scene Graph中进行同时学习对象检测和实例分割的关系上下文模型。在COCO数据集上进行实验评估，我们发现使用Scene Graph和关系优先可以提高对象检测和实例分割的性能。RP-FEM可以降低图像中不可能的类别预测，同时避免模型生成重复预测，从而超越基础模型。

Impact of Label Types on Training SWIN Models with Overhead Imagery

paper_url: http://arxiv.org/abs/2310.07572
repo_url: None
paper_authors: Ryan Ford, Kenneth Hutchison, Nicholas Felts, Benjamin Cheng, Jesse Lew, Kyle Jackson
for: 这个论文的目的是研究数据集设计对模型训练和性能的影响，以帮助降低生成遥感和过空标注数据的成本。
methods: 这篇论文使用了固定窗口变换器的训练，使用了 bounding boxes 和 segmentation 标签，其中后者更加昂贵。作者比较了使用不同类型的标签进行训练，并研究了这些模型在不同任务上的性能。
results: 作者发现，使用只有目标像素的训练不会提高分类任务的性能，而且会混淆评估集中的背景像素。对于对象检测模型，使用不同类型的标签没有影响测试集的性能。作者发现，使用 bounding boxes 可以为一些不需要更复杂的标签的任务提供足够的性能。

Abstract
Understanding the impact of data set design on model training and performance can help alleviate the costs associated with generating remote sensing and overhead labeled data. This work examined the impact of training shifted window transformers using bounding boxes and segmentation labels, where the latter are more expensive to produce. We examined classification tasks by comparing models trained with both target and backgrounds against models trained with only target pixels, extracted by segmentation labels. For object detection models, we compared performance using either label type when training. We found that the models trained on only target pixels do not show performance improvement for classification tasks, appearing to conflate background pixels in the evaluation set with target pixels. For object detection, we found that models trained with either label type showed equivalent performance across testing. We found that bounding boxes appeared to be sufficient for tasks that did not require more complex labels, such as object segmentation. Continuing work to determine consistency of this result across data types and model architectures could potentially result in substantial savings in generating remote sensing data sets for deep learning.

摘要
Understanding the impact of data set design on model training and performance can help reduce the costs associated with generating remote sensing and overhead labeled data. This work examined the impact of training shifted window transformers using bounding boxes and segmentation labels, where the latter are more expensive to produce. We compared models trained with both target and backgrounds against models trained with only target pixels, extracted by segmentation labels. For object detection models, we compared performance using either label type when training. We found that the models trained on only target pixels did not show performance improvement for classification tasks, appearing to conflate background pixels in the evaluation set with target pixels. For object detection, we found that models trained with either label type showed equivalent performance across testing. We found that bounding boxes were sufficient for tasks that did not require more complex labels, such as object segmentation. Continuing work to determine consistency of this result across data types and model architectures could potentially result in substantial savings in generating remote sensing data sets for deep learning.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Does resistance to Style-Transfer equal Shape Bias? Evaluating Shape Bias by Distorted Shape

paper_url: http://arxiv.org/abs/2310.07555
repo_url: None
paper_authors: Ziqi Wen, Tianqin Li, Tai Sing Lee
for: 这项研究旨在评估深度学习模型对形状的敏感性，并提供一个新的测试工具箱（Distorted Shape Testbench，DiST）来评估模型的全局形状敏感性。
methods: 这项研究使用了样式传递图像来训练深度学习模型，并对模型的性能进行评估。
results: 研究发现，已有的Shape bias评估方法不能准确评估模型的全局形状敏感性，而DiST测试工具箱可以准确地评估模型的全局形状敏感性，并且训练使用DiST图像可以bridge人类和现有SOTA模型之间的性能差距，同时保持模型的标准图像分类任务的准确率。

Abstract
Deep learning models are known to exhibit a strong texture bias, while human tends to rely heavily on global shape for object recognition. The current benchmark for evaluating a model's shape bias is a set of style-transferred images with the assumption that resistance to the attack of style transfer is related to the development of shape sensitivity in the model. In this work, we show that networks trained with style-transfer images indeed learn to ignore style, but its shape bias arises primarily from local shapes. We provide a Distorted Shape Testbench (DiST) as an alternative measurement of global shape sensitivity. Our test includes 2400 original images from ImageNet-1K, each of which is accompanied by two images with the global shapes of the original image distorted while preserving its texture via the texture synthesis program. We found that (1) models that performed well on the previous shape bias evaluation do not fare well in the proposed DiST; (2) the widely adopted ViT models do not show significant advantages over Convolutional Neural Networks (CNNs) on this benchmark despite that ViTs rank higher on the previous shape bias tests. (3) training with DiST images bridges the significant gap between human and existing SOTA models' performance while preserving the models' accuracy on standard image classification tasks; training with DiST images and style-transferred images are complementary, and can be combined to train network together to enhance both the global and local shape sensitivity of the network. Our code will be host at: https://github.com/leelabcnbc/DiST

摘要
深度学习模型通常会表现出强的文本擅长，而人类则更倾向于依靠全局形状来识别对象。现有的测试方法之一是使用样式传输图像，假设图像的鲜明度攻击性能与模型内部形状敏感度之间存在相关性。在这项工作中，我们证明了通过样式传输图像训练的网络实际上会忽略样式，但是其形状偏好主要来自本地形状。我们提供了一个扭曲形状测试台（DiST），作为评估模型全局形状敏感度的代理测试方法。我们的测试包括2400个原始图像，每个图像都有两个基于图像的全局形状扭曲的图像，保持图像的文本via texture synthesis程序。我们发现：1. 在之前的形状偏好评估中高分的模型不如在我们提出的DiST测试中表现出色。2. 广泛采用的ViT模型与传统 convolutional neural network（CNN）在该测试中并没有显著的优势，即使ViT模型在之前的形状偏好测试中得分高。3. 使用DiST图像进行训练可以bridges人类和现有最佳实际模型之间的巨大差距，同时保持模型在标准图像分类任务中的准确率。使用DiST图像和样式传输图像进行训练可以将全局和本地形状敏感度融合到同一个模型中，这种融合训练可以提高模型的总体性和特征性能。我们的代码将会hosts在：https://github.com/leelabcnbc/DiST

Attribute Localization and Revision Network for Zero-Shot Learning

paper_url: http://arxiv.org/abs/2310.07548
repo_url: None
paper_authors: Junzhe Xu, Suling Duan, Chenwei Tang, Zhenan He, Jiancheng Lv
for: 本文旨在提出一种基于 Attribute Localization and Revision Network 的零shot学习模型，以便在无需训练数据的情况下，模型能够识别未经训练的类别。
methods: 本文使用了 Attribute Localization Module (ALM) 和 Attribute Revision Module (ARM) 两种模块来捕捉图像区域的本地和全局特征，并通过修改涂抹图像的每个特征值来补做忽略内类 attribute 变化的缺陷。
results: 经过广泛的实验测试，本文的模型在三个常用的 benchmark 上表现出色，证明了本文提出的方法的有效性。

Abstract
Zero-shot learning enables the model to recognize unseen categories with the aid of auxiliary semantic information such as attributes. Current works proposed to detect attributes from local image regions and align extracted features with class-level semantics. In this paper, we find that the choice between local and global features is not a zero-sum game, global features can also contribute to the understanding of attributes. In addition, aligning attribute features with class-level semantics ignores potential intra-class attribute variation. To mitigate these disadvantages, we present Attribute Localization and Revision Network in this paper. First, we design Attribute Localization Module (ALM) to capture both local and global features from image regions, a novel module called Scale Control Unit is incorporated to fuse global and local representations. Second, we propose Attribute Revision Module (ARM), which generates image-level semantics by revising the ground-truth value of each attribute, compensating for performance degradation caused by ignoring intra-class variation. Finally, the output of ALM will be aligned with revised semantics produced by ARM to achieve the training process. Comprehensive experimental results on three widely used benchmarks demonstrate the effectiveness of our model in the zero-shot prediction task.

摘要
zero-shot 学习可以让模型认识未经见过的类别，通过 auxiliary Semantic information such as 特征。现有工作提出了从本地图像区域检测特征并将抽取的特征与类别 semantics 对齐。在这篇论文中，我们发现选择本地和全局特征并不是零和游戏，全局特征也可以帮助理解特征。此外，对属性特征与类别 semantics 对齐忽略了内部类 attribute 的变化。为了解决这些缺点，我们在这篇论文中提出了 Attribute Localization and Revision Network。首先，我们设计了 Attribute Localization Module (ALM)，它可以从图像区域中捕捉 both local 和 global 特征。为了融合全局和本地表示，我们采用了一个新的 Scale Control Unit。其次，我们提出了 Attribute Revision Module (ARM)，它可以根据图像的实际情况修改每个特征的真实值，以补偿因为忽略内部类 attribute 的变化而导致的性能下降。最后，ALM 的输出将被与 ARM 生成的修订后的 semantics 进行对齐，以实现训练过程。我们在三个常用的 benchmark 上进行了广泛的实验， demonstarting 我们的模型在 zero-shot 预测任务中的有效性。

S4C: Self-Supervised Semantic Scene Completion with Neural Fields

paper_url: http://arxiv.org/abs/2310.07522
repo_url: None
paper_authors: Adrian Hayler, Felix Wimbauer, Dominik Muhle, Christian Rupprecht, Daniel Cremers
for: 本研究旨在解决计算机视觉中的3Dsemantic场景理解挑战，帮助移动代理人自动规划和探索不确定环境。
methods: 我们提出了首个不需要3D实际数据的自我超级visedapproach，通过单张图像重建场景，并只使用视频和pseudo分割数据进行训练。
results: 我们的方法可以准确地推断场景的填充和semantic类别，并且可以 synthesize precisemap for far away viewpoints。

Abstract
3D semantic scene understanding is a fundamental challenge in computer vision. It enables mobile agents to autonomously plan and navigate arbitrary environments. SSC formalizes this challenge as jointly estimating dense geometry and semantic information from sparse observations of a scene. Current methods for SSC are generally trained on 3D ground truth based on aggregated LiDAR scans. This process relies on special sensors and annotation by hand which are costly and do not scale well. To overcome this issue, our work presents the first self-supervised approach to SSC called S4C that does not rely on 3D ground truth data. Our proposed method can reconstruct a scene from a single image and only relies on videos and pseudo segmentation ground truth generated from off-the-shelf image segmentation network during training. Unlike existing methods, which use discrete voxel grids, we represent scenes as implicit semantic fields. This formulation allows querying any point within the camera frustum for occupancy and semantic class. Our architecture is trained through rendering-based self-supervised losses. Nonetheless, our method achieves performance close to fully supervised state-of-the-art methods. Additionally, our method demonstrates strong generalization capabilities and can synthesize accurate segmentation maps for far away viewpoints.

摘要
三维semantic场景理解是计算机视觉领域的基本挑战。它使移动代理能够自主规划和探索不确定环境。SSC将这个挑战形式化为同时估算场景的密度和semantic信息从稀疏观察数据中。现有的SSC方法通常通过3D实际数据进行训练，这个过程需要特殊的感知器和手动标注，这些成本高并不具扩展性。为了解决这个问题，我们的工作提出了第一个不需要3D实际数据的自主supervised SSC方法，称为S4C。我们的提议的方法可以从单张图像中重construct场景，只需要视频和pseudo segmentationground truth生成于市场上的图像分割网络进行训练。与现有方法不同，我们将场景表示为半 implicit semantic场景。这种表示方式允许在相机投影范围内查询任何点的占据和semantic类别。我们的架构通过渲染基于的自我超vised损失来训练。尽管如此，我们的方法可以与完全supervised方法相比，并且显示出强大的泛化能力，可以生成正确的分割图像 для远距离视角。

paper_url: http://arxiv.org/abs/2310.07517
repo_url: None
paper_authors: Yaru Chen, Ruohao Guo, Xubo Liu, Peipei Wu, Guangyao Li, Zhenbo Li, Wenwu Wang
for: 这篇论文旨在提高视频分割任务的精度，通过利用注意力机制 capture 视频中各个段落之间的语义相关性。
methods: 我们提出了一种新的交互增强十Modal 感知方法（CM-PIE），通过应用段基 attention 模块来学习细腻的特征。此外，我们还引入了一个交叉模态聚合块，以协同优化视频和声音信号的 semantic表示。
results: 我们的模型在 Look, Listen, and Parse 数据集上的 parsing 性能比其他方法高。

Abstract
Audio-visual video parsing is the task of categorizing a video at the segment level with weak labels, and predicting them as audible or visible events. Recent methods for this task leverage the attention mechanism to capture the semantic correlations among the whole video across the audio-visual modalities. However, these approaches have overlooked the importance of individual segments within a video and the relationship among them, and tend to rely on a single modality when learning features. In this paper, we propose a novel interactive-enhanced cross-modal perception method~(CM-PIE), which can learn fine-grained features by applying a segment-based attention module. Furthermore, a cross-modal aggregation block is introduced to jointly optimize the semantic representation of audio and visual signals by enhancing inter-modal interactions. The experimental results show that our model offers improved parsing performance on the Look, Listen, and Parse dataset compared to other methods.

摘要
Audio-visual视频分解是将视频分为每个段的任务，并将它们分类为听ible或可见事件。现代方法在这个任务中利用注意力机制来捕捉全视频的语义相关性。然而，这些方法通常忽略视频中每个段的重要性以及它们之间的关系，而且通常仅仅使用单一的感知modalities来学习特征。在本文中，我们提出了一种新的交互增强的交叉模态感知方法（CM-PIE），可以通过应用段级注意力模块来学习细化的特征。此外，我们还引入了交叉模态聚合块，以便在语音和视频信号之间强化交互。实验结果表明，我们的模型在Look, Listen, and Parse数据集上的分解性能比其他方法更高。

A Unified Remote Sensing Anomaly Detector Across Modalities and Scenes via Deviation Relationship Learning

paper_url: http://arxiv.org/abs/2310.07511
repo_url: https://github.com/jingtao-li-cver/uniadrs
paper_authors: Jingtao Li, Xinyu Wang, Hengwei Zhao, Liangpei Zhang, Yanfei Zhong
for:这个研究旨在建立一个横跨多 modalities 和 scene 的潜在适应率高的问题检测器，以搜寻 earth 上的多种问题。methods:本研究使用了一个基于偏差关系的 bilayer 图表 reformulation，将问题检测任务转换为一个 conditional probability 模型，并透过条件预测Problem 来训练模型。results:研究发现，在五种不同的模式（包括 Hyperspectral、可见光、Synthetic Aperture Radar （SAR）、infrared 和 low light）中，提出的模型具有横跨多 modalities 和 scene 的问题检测能力。

Abstract
Remote sensing anomaly detector can find the objects deviating from the background as potential targets. Given the diversity in earth anomaly types, a unified anomaly detector across modalities and scenes should be cost-effective and flexible to new earth observation sources and anomaly types. However, the current anomaly detectors are limited to a single modality and single scene, since they aim to learn the varying background distribution. Motivated by the universal anomaly deviation pattern, in that anomalies exhibit deviations from their local context, we exploit this characteristic to build a unified anomaly detector. Firstly, we reformulate the anomaly detection task as an undirected bilayer graph based on the deviation relationship, where the anomaly score is modeled as the conditional probability, given the pattern of the background and normal objects. The learning objective is then expressed as a conditional probability ranking problem. Furthermore, we design an instantiation of the reformulation in the data, architecture, and optimization aspects. Simulated spectral and spatial anomalies drive the instantiated architecture. The model is optimized directly for the conditional probability ranking. The proposed model was validated in five modalities including the hyperspectral, visible light, synthetic aperture radar (SAR), infrared and low light to show its unified detection ability.

摘要
<>这里使用简化字体写入中文。> Remote sensing异常探测器可以找到背景上异常的物体作为潜在目标。由于地球上异常的多样性，一个通用的异常探测器适合多modalities和场景将是cost-effective和flexible。然而，现有的异常探测器仅对单一modalities和单一场景进行学习，因为它们专注于变化的背景分布。我们受到地球上异常的通用偏差模式所驱动，我们利用这个特征来建立一个通用的异常探测器。首先，我们将异常探测任务 reformulate为一个不同irectional bilayer graph，基于偏差关系，异常得分被model为背景和正常物体的模式下的条件 probabilities。学习目标则表示为conditional probability ranking问题。此外，我们还设计了实体化的构造、架构和优化方面。实验 spectral和spatial异常驱动了实体化架构。我们直接对条件 probabilities进行优化。我们的模型在五种modalities中，包括光谱、可见光、Synthetic Aperture Radar（SAR）、infrared和low light中显示了通用的探测能力。

Heuristic Vision Pre-Training with Self-Supervised and Supervised Multi-Task Learning

paper_url: http://arxiv.org/abs/2310.07510
repo_url: None
paper_authors: Zhiming Qian
for: 该 paper 的目的是提出一种基于自然语言的视觉基础模型，以便更好地模仿人类视觉的方式来认识多样化和开放的世界。
methods: 该 paper 使用了自然语言上的多种预言任务，包括自然语言模型和图像分类等，以便更好地学习视觉表示学习。
results: 该 paper 的实验结果表明，使用该基础模型可以在多种视觉任务中达到或超过现有最佳Result，包括 ImageNet-1K 分类、COCO 物体检测和 ADE-20K Semantic Segmentation 等。

Abstract
To mimic human vision with the way of recognizing the diverse and open world, foundation vision models are much critical. While recent techniques of self-supervised learning show the promising potentiality of this mission, we argue that signals from labelled data are also important for common-sense recognition, and properly chosen pre-text tasks can facilitate the efficiency of vision representation learning. To this end, we propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner. Specifically, given an image, we take a heuristic way by considering its intrinsic style properties, inside objects with their locations and correlations, and how it looks like in 3D space for basic visual understanding. However, large-scale object bounding boxes and correlations are usually hard to achieve. Alternatively, we develop a hybrid method by leveraging both multi-label classification and self-supervised learning. On the one hand, under the multi-label supervision, the pre-trained model can explore the detailed information of an image, e.g., image types, objects, and part of semantic relations. On the other hand, self-supervised learning tasks, with respect to Masked Image Modeling (MIM) and contrastive learning, can help the model learn pixel details and patch correlations. Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks. For example, with a vanilla Swin-B backbone, we achieve 85.3\% top-1 accuracy on ImageNet-1K classification, 47.9 box AP on COCO object detection for Mask R-CNN, and 50.6 mIoU on ADE-20K semantic segmentation when using Upernet. The performance shows the ability of our vision foundation model to serve general purpose vision tasks.

摘要
On one hand, under multi-label supervision, the pre-trained model can explore detailed image information such as image types, objects, and semantic relations. On the other hand, self-supervised learning tasks like Masked Image Modeling (MIM) and contrastive learning help the model learn pixel details and patch correlations. Our pre-trained models achieve results on par with or better than state-of-the-art (SOTA) on multiple visual tasks. For example, with a vanilla Swin-B backbone, we achieve 85.3% top-1 accuracy on ImageNet-1K classification, 47.9 box AP on COCO object detection for Mask R-CNN, and 50.6 mIoU on ADE-20K semantic segmentation when using Upernet. These results demonstrate the ability of our vision foundation model to handle various general-purpose vision tasks.

paper_url: http://arxiv.org/abs/2310.07506
repo_url: None
paper_authors: Haizhong Zheng, Jiachen Sun, Shutong Wu, Bhavya Kailkhura, Zhuoqing Mao, Chaowei Xiao, Atul Prakash
for: 本文提出了一种新的数据压缩方法，以提高模型训练的性能。
methods: 本文使用了一种新的数据参数化方法，将数据压缩成参数化数据容器而不是像素空间。
results: 比较baseline方法，本文的提posed方法在四个公共数据集（SVHN、CIFAR10、CIFAR100和Tiny-ImageNet）上表现出了更高的性能，甚至在使用批处理损失函数并使用 less GPU内存。

Abstract
Given a real-world dataset, data condensation (DC) aims to synthesize a significantly smaller dataset that captures the knowledge of this dataset for model training with high performance. Recent works propose to enhance DC with data parameterization, which condenses data into parameterized data containers rather than pixel space. The intuition behind data parameterization is to encode shared features of images to avoid additional storage costs. In this paper, we recognize that images share common features in a hierarchical way due to the inherent hierarchical structure of the classification system, which is overlooked by current data parameterization methods. To better align DC with this hierarchical nature and encourage more efficient information sharing inside data containers, we propose a novel data parameterization architecture, Hierarchical Memory Network (HMN). HMN stores condensed data in a three-tier structure, representing the dataset-level, class-level, and instance-level features. Another helpful property of the hierarchical architecture is that HMN naturally ensures good independence among images despite achieving information sharing. This enables instance-level pruning for HMN to reduce redundant information, thereby further minimizing redundancy and enhancing performance. We evaluate HMN on four public datasets (SVHN, CIFAR10, CIFAR100, and Tiny-ImageNet) and compare HMN with eight DC baselines. The evaluation results show that our proposed method outperforms all baselines, even when trained with a batch-based loss consuming less GPU memory.

摘要
Translated into Simplified Chinese:给一个实际世界数据集，数据缩写（DC）目标是生成一个较小的数据集，捕捉这个数据集的知识，以便模型训练高性能。现有的方法提议通过数据参数化来增强DC，将数据压缩到参数化数据容器中，而不是像素空间。我们认为图像共享特征的理念是基于图像分类系统的层次结构，这一点被当前的数据参数化方法忽略。为了更好地将DC与这个层次结构相匹配，并促进数据容器内的信息共享，我们提出了一种新的数据参数化架构——层次记忆网络（HMN）。HMN将压缩数据存储在三级结构中，表示数据集级、类级和实例级特征。另外，层次架构的帮助特性是，HMN自然地保证图像之间的独立性，同时实现信息共享。这使得HMN可以进行实例级别的剪辑，从而进一步减少重复信息，提高性能。我们在四个公共数据集（SVHN、CIFAR10、CIFAR100和Tiny-ImageNet）上评估HMN，并与八个DC基准方法进行比较。评估结果表明，我们的提出方法在所有基准方法中表现出色，即使在使用更少的GPU内存的批处理损失函数训练时。

PtychoDV: Vision Transformer-Based Deep Unrolling Network for Ptychographic Image Reconstruction

paper_url: http://arxiv.org/abs/2310.07504
repo_url: None
paper_authors: Weijie Gan, Qiuchen Zhai, Michael Thompson McCann, Cristina Garcia Cardona, Ulugbek S. Kamilov, Brendt Wohlberg
for: 本研究旨在提出一种高效、高品质ptychography图像重建方法，以解决现有深度学习方法的缺陷和费时问题。
methods: 本研究使用了一种新的深度学习模型，称为PtychoDV，来实现高效、高品质ptychography图像重建。PtychoDV包括一个视觉变换器，用于从 raw measurement 中生成初始图像，并考虑到这些测量的相互关系。然后，使用一个深度 unfolding 网络来修正初始图像，使用学习的卷积推荐和ptychography测量模型。
results: 实验结果表明，PtychoDV 可以在 simulated data 上比现有的深度学习方法表现更好，并且可以Significantly reduce 计算成本，与迭代方法相比，而且维持竞争力性。

Abstract
Ptychography is an imaging technique that captures multiple overlapping snapshots of a sample, illuminated coherently by a moving localized probe. The image recovery from ptychographic data is generally achieved via an iterative algorithm that solves a nonlinear phase-field problem derived from measured diffraction patterns. However, these approaches have high computational cost. In this paper, we introduce PtychoDV, a novel deep model-based network designed for efficient, high-quality ptychographic image reconstruction. PtychoDV comprises a vision transformer that generates an initial image from the set of raw measurements, taking into consideration their mutual correlations. This is followed by a deep unrolling network that refines the initial image using learnable convolutional priors and the ptychography measurement model. Experimental results on simulated data demonstrate that PtychoDV is capable of outperforming existing deep learning methods for this problem, and significantly reduces computational cost compared to iterative methodologies, while maintaining competitive performance.

摘要
ptychography 是一种成像技术，通过多次重叠的探针探测样品，以征服干扰的方式获得样品的图像。然而，现有的方法通常需要高度计算成本。在这篇论文中，我们介绍了ptychoDV，一种新的深度学习模型基网络，用于高效、高质量ptychographic图像重建。ptychoDV包括一个视觉转换器，通过对 Raw 测量数据进行处理，生成初始图像，考虑到测量数据之间的相互关系。然后，一个深度拓展网络会使用学习的卷积约束和ptychography测量模型，来细化初始图像。我们对模拟数据进行了实验，结果表明，ptychoDV 可以在深度学习方法中出色表现，并且在计算成本方面具有显著的改善，同时维持竞争力。

paper_url: http://arxiv.org/abs/2310.07473
repo_url: https://github.com/XinyuSun/FGPrompt
paper_authors: Xinyu Sun, Peihao Chen, Jugang Fan, Thomas H. Li, Jian Chen, Mingkui Tan
for: 本研究旨在解决自主系统导航到图像指定目标的重要 yet challenging task，agent需要从拍摄图像中理解目标位置。
methods: 我们采用 Fine-grained Goal Prompting (FGPrompt) 方法，利用高级别和高分辨率特征图作为启发，通过conditioned embedding来保留目标图像中细节信息，并使观察Encoder更加关注目标相关区域。
results: 与existMethods比较，我们在3个图像目标导航 benchmark datasets（i.e., Gibson, MP3D, HM3D）上显著提高了性能，特别是在Gibson上，我们超过了状态之前最佳成功率by 8%，只用1/50的模型大小。

Abstract
Learning to navigate to an image-specified goal is an important but challenging task for autonomous systems. The agent is required to reason the goal location from where a picture is shot. Existing methods try to solve this problem by learning a navigation policy, which captures semantic features of the goal image and observation image independently and lastly fuses them for predicting a sequence of navigation actions. However, these methods suffer from two major limitations. 1) They may miss detailed information in the goal image, and thus fail to reason the goal location. 2) More critically, it is hard to focus on the goal-relevant regions in the observation image, because they attempt to understand observation without goal conditioning. In this paper, we aim to overcome these limitations by designing a Fine-grained Goal Prompting (FGPrompt) method for image-goal navigation. In particular, we leverage fine-grained and high-resolution feature maps in the goal image as prompts to perform conditioned embedding, which preserves detailed information in the goal image and guides the observation encoder to pay attention to goal-relevant regions. Compared with existing methods on the image-goal navigation benchmark, our method brings significant performance improvement on 3 benchmark datasets (i.e., Gibson, MP3D, and HM3D). Especially on Gibson, we surpass the state-of-the-art success rate by 8% with only 1/50 model size. Project page: https://xinyusun.github.io/fgprompt-pages

摘要
学习到具体目标位置 Navigation 是自主系统中一项重要但具有挑战性的任务。 Agent 需要从拍摄图像中理解目标位置。现有方法通过学习导航策略，以独立地捕捉目标图像和观察图像的semantic特征，并最终将其混合以预测导航动作序列。然而，这些方法受到两个主要限制：1）它们可能会遗漏目标图像中的细节信息，因此无法理解目标位置。2）更加重要的是，它们难以在观察图像中注意目标相关区域，因为它们不会在目标条件下理解观察。在这篇论文中，我们寻求超越这些限制，通过设计 Fine-grained Goal Prompting（FGPrompt）方法，以便在图像目标导航中提高性能。具体来说，我们利用目标图像中细化和高分辨率的特征图作为引导，通过conditioned embedding来实现conditioned embedding，以保持目标图像中细节信息，并导引观察编码器注意目标相关区域。相比之前的方法，我们在三个图像目标导航 benchmark 上实现了显著的性能提升。尤其是在 Gibson 上，我们超过了状态 искусternal 的成功率，并且只需要1/50 的模型大小。项目页面：https://xinyusun.github.io/fgprompt-pages

PoRF: Pose Residual Field for Accurate Neural Surface Reconstruction

paper_url: http://arxiv.org/abs/2310.07449
repo_url: None
paper_authors: Jia-Wang Bian, Wenjing Bian, Victor Adrian Prisacariu, Philip Torr
for:* The paper aims to improve the accuracy of neural surface reconstruction in real-world scenarios by refining camera poses and reducing the impact of camera pose noise.methods:* The proposed method uses a novel implicit representation called the pose residual field (PoRF), which leverages global information over the entire sequence to improve pose accuracy.* The method also introduces an epipolar geometry loss to enhance supervision and improve pose accuracy.results:* The proposed method achieves promising results on two datasets: the DTU dataset and the MobileBrick dataset.* On the DTU dataset, the method reduces rotation error by 78% for COLMAP poses and decreases the reconstruction Chamfer distance from 3.48mm to 0.85mm.* On the MobileBrick dataset, the method improves the reconstruction F1 score from 69.18 to 75.67, outperforming the dataset provided ground-truth pose.

Abstract
Neural surface reconstruction is sensitive to the camera pose noise, even if state-of-the-art pose estimators like COLMAP or ARKit are used. More importantly, existing Pose-NeRF joint optimisation methods have struggled to improve pose accuracy in challenging real-world scenarios. To overcome the challenges, we introduce the pose residual field (\textbf{PoRF}), a novel implicit representation that uses an MLP for regressing pose updates. This is more robust than the conventional pose parameter optimisation due to parameter sharing that leverages global information over the entire sequence. Furthermore, we propose an epipolar geometry loss to enhance the supervision that leverages the correspondences exported from COLMAP results without the extra computational overhead. Our method yields promising results. On the DTU dataset, we reduce the rotation error by 78\% for COLMAP poses, leading to the decreased reconstruction Chamfer distance from 3.48mm to 0.85mm. On the MobileBrick dataset that contains casually captured unbounded 360-degree videos, our method refines ARKit poses and improves the reconstruction F1 score from 69.18 to 75.67, outperforming that with the dataset provided ground-truth pose (75.14). These achievements demonstrate the efficacy of our approach in refining camera poses and improving the accuracy of neural surface reconstruction in real-world scenarios.

摘要
“神经表面重建敏感于摄像头姿态噪声，即使使用最先进的姿态估计器如COLMAP或ARKit。更重要的是，现有的姿态-NeRF共产化优化方法在实际世界场景中困难寻求高精度姿态。为了解决这些挑战，我们引入姿态差估场（PoRF），一种新的隐式表示方法，使用多层感知（MLP）来回归姿态更新。这种方法比传统的姿态参数优化更加稳定，因为它可以共享参数，利用全序列的全局信息。此外，我们提出了视觉几何损失来增强监督，利用COLMAP结果中出口的对准关系，而无需额外计算开销。我们的方法实现了可靠的成果。在DTU数据集上，我们将COLMAP姿态Error降低至78%，导致重建Chamfer距离从3.48mm降低至0.85mm。在包含抓拍 captured 360度视频的MobileBrick数据集上，我们的方法改善了ARKit姿态，提高了重建F1分数从69.18提高至75.67，超过了数据集提供的基准姿态（75.14）。这些成果表明我们的方法在实际世界场景中有效地改善摄像头姿态和神经表面重建精度。”

Distance Weighted Trans Network for Image Completion

paper_url: http://arxiv.org/abs/2310.07440
repo_url: None
paper_authors: Pourya Shamsolmoali, Masoumeh Zareapoor, Huiyu Zhou, Xuelong Li, Yue Lu
for: 本文提出了一种新的图像生成模型，用于更好地理解图像的结构和关系。
methods: 该模型基于距离Weighted Transformer (DWT)和卷积神经网络 (CNN) 两种不同的模型，以优化图像完成过程。
results: 对三个复杂的图像 dataset 进行了广泛的量化和质量测试，并证明了该模型的超越性。

Abstract
The challenge of image generation has been effectively modeled as a problem of structure priors or transformation. However, existing models have unsatisfactory performance in understanding the global input image structures because of particular inherent features (for example, local inductive prior). Recent studies have shown that self-attention is an efficient modeling technique for image completion problems. In this paper, we propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components. In our model, we leverage the strengths of both Convolutional Neural Networks (CNNs) and DWT blocks to enhance the image completion process. Specifically, CNNs are used to augment the local texture information of coarse priors and DWT blocks are used to recover certain coarse textures and coherent visual structures. Unlike current approaches that generally use CNNs to create feature maps, we use the DWT to encode global dependencies and compute distance-based weighted feature maps, which substantially minimizes the problem of visual ambiguities. Meanwhile, to better produce repeated textures, we introduce Residual Fast Fourier Convolution (Res-FFC) blocks to combine the encoder's skip features with the coarse features provided by our generator. Furthermore, a simple yet effective technique is proposed to normalize the non-zero values of convolutions, and fine-tune the network layers for regularization of the gradient norms to provide an efficient training stabiliser. Extensive quantitative and qualitative experiments on three challenging datasets demonstrate the superiority of our proposed model compared to existing approaches.

摘要
描述文本：图像生成挑战已经被模型为结构先验或变换问题。然而，现有模型在理解全局输入图像结构方面表现不满足，主要因为特定的遗传特征（例如本地推导先验）。最近的研究表明，自注意是一种高效的模型技术 для图像完成问题。在这篇论文中，我们提议一种新的架构，即距离基于变换器（DWT），以更好地理解图像组件之间的关系。我们在模型中结合了卷积神经网络（CNN）和DWT块来提高图像完成过程。具体来说，CNNs用于增强粗略先验中的本地特征信息，而DWT块用于恢复一些粗略的文本特征和一致视觉结构。与现有方法一样，我们使用DWT来编码全局依赖关系并计算距离基于权重的特征地图，从而减少视觉混乱的问题。此外，我们还提出了一种简单 yet有效的技术，即 residual Fast Fourier Convolution（Res-FFC）块，以结合生成器提供的粗略特征和编码器的跳过特征。此外，我们还提出了一种简单 yet有效的正则化技术，即Normalize the non-zero values of convolutions，并在网络层次进行 fine-tune 来稳定训练过程。广泛的量化和质量测试表明，我们的提议模型在三个复杂的数据集上表现出色，比现有方法更高效。

DESTINE: Dynamic Goal Queries with Temporal Transductive Alignment for Trajectory Prediction

paper_url: http://arxiv.org/abs/2310.07438
repo_url: None
paper_authors: Rezaul Karim, Soheil Mohamad Alizadeh Shabestary, Amir Rasouli
for: 预测路用户的时间相关路径，以便更好地预测路用户的行为。
methods: 使用semantic map信息和模型交互，并建立一种能够理解不同粒度的行为的机制。
results: 使用DESTINE方法，在Argoverse数据集上实现了状态领先的性能，并通过精细的ablation研究，证明了不同模块的贡献。

Abstract
Predicting temporally consistent road users' trajectories in a multi-agent setting is a challenging task due to unknown characteristics of agents and their varying intentions. Besides using semantic map information and modeling interactions, it is important to build an effective mechanism capable of reasoning about behaviors at different levels of granularity. To this end, we propose Dynamic goal quErieS with temporal Transductive alIgNmEnt (DESTINE) method. Unlike past arts, our approach 1) dynamically predicts agents' goals irrespective of particular road structures, such as lanes, allowing the method to produce a more accurate estimation of destinations; 2) achieves map compliant predictions by generating future trajectories in a coarse-to-fine fashion, where the coarser predictions at a lower frame rate serve as intermediate goals; and 3) uses an attention module designed to temporally align predicted trajectories via masked attention. Using the common Argoverse benchmark dataset, we show that our method achieves state-of-the-art performance on various metrics, and further investigate the contributions of proposed modules via comprehensive ablation studies.

摘要
预测 temporally consistent road users' trajectories 在多代理人设定下是一项具有挑战性的任务，因为代理人的特性和意图都是未知的。除了使用semantic map信息和模型交互外，我们还需要建立一种有效的机制，以便对代理人的行为进行不同级别的推理。为此，我们提出了动态目标QuErieS with temporal Transductive alIgNmEnt（DESTINE）方法。与过去的艺术不同，我们的方法具有以下特点：1. 动态预测代理人的目标，不受特定的道路结构，如车道，影响预测的精度;2. 实现地图符合的预测，通过在粗粒度和细粒度之间进行多层次预测，使得中间的粗粒度预测服为间接目标;3. 使用面向时间的注意模块，通过做Masked attention来进行时间启配。使用常用的 Argoverse 数据集，我们表明了我们的方法在不同的维度上达到了领先的性能，并通过完整的减少研究来调查提posed模块的贡献。

A Novel Voronoi-based Convolutional Neural Network Framework for Pushing Person Detection in Crowd Videos

paper_url: http://arxiv.org/abs/2310.07416
repo_url: https://github.com/pedestriandynamics/vcnn4pude
paper_authors: Ahmed Alia, Mohammed Maree, Mohcine Chraibi, Armin Seyfried
for: 本研究旨在通过分析人群中微观动态行为，提供更深刻的人群模式和互动理解，以便制定更有效的人群管理策略、优化人群流动和提高总体人群体验。
methods: 本研究提出了一种新的自动化推拿框架，包括两个主要组成部分：i）特征提取和ii）视频标注。在特征提取部分，采用了基于Voronoi方法的新方法来确定输入视频中每个人的本地区域。然后，这些区域被 fed into EfficientNetV1B0卷积神经网络以提取每个人在时间上的深度特征。在第二个组成部分，使用了一个完全连接层与sigmoid激活函数来分析这些深度特征，并将其与视频中推拿行为相关的个体标注。
results: 实验结果表明，提议的框架在比较分析中超过了7个基准方法。

Abstract
Analyzing the microscopic dynamics of pushing behavior within crowds can offer valuable insights into crowd patterns and interactions. By identifying instances of pushing in crowd videos, a deeper understanding of when, where, and why such behavior occurs can be achieved. This knowledge is crucial to creating more effective crowd management strategies, optimizing crowd flow, and enhancing overall crowd experiences. However, manually identifying pushing behavior at the microscopic level is challenging, and the existing automatic approaches cannot detect such microscopic behavior. Thus, this article introduces a novel automatic framework for identifying pushing in videos of crowds on a microscopic level. The framework comprises two main components: i) Feature extraction and ii) Video labeling. In the feature extraction component, a new Voronoi-based method is developed for determining the local regions associated with each person in the input video. Subsequently, these regions are fed into EfficientNetV1B0 Convolutional Neural Network to extract the deep features of each person over time. In the second component, a combination of a fully connected layer with a Sigmoid activation function is employed to analyze these deep features and annotate the individuals involved in pushing within the video. The framework is trained and evaluated on a new dataset created using six real-world experiments, including their corresponding ground truths. The experimental findings indicate that the suggested framework outperforms seven baseline methods that are employed for comparative analysis purposes.

摘要
可以通过分析人群中微型动态行为来获得有价值的人群模式和互动知识。通过在人群视频中标识推担行为，可以更深入地理解推担行为何时、何处、何 raison。这些知识是创建更有效的人群管理策略、优化人群流动和提高总体人群体验的关键。然而，手动地在微型水平上识别推担行为很困难，而现有的自动方法无法检测微型行为。因此，本文提出了一种新的自动框架，用于在人群视频中识别推担行为。该框架包括两个主要组成部分：一是特征提取，二是视频标注。在特征提取部分，我们开发了一种基于Voronoi区域的新方法，用于每个输入视频中确定每个人的本地区域。然后，这些区域将被传递给EfficientNetV1B0卷积神经网络，以提取每个人的深度特征。在第二个组成部分，我们使用一个全连接层和sigmoid激活函数，以分析这些深度特征并将其与视频中的推担行为关联。我们对新的实验 dataset 进行了训练和评估，该dataset 包括六个实际实验，以及其对应的地面真实值。实验结果表明，我们的方法在七个基准方法的比较分析中表现出色，并且在识别推担行为方面达到了更高的准确率。

CLIP for Lightweight Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.07394
repo_url: None
paper_authors: Ke Jin, Wankou Yang
for: 该文章目的是提出一种新的语言引导的Semantic Segmentation方法，使得语言引导的方法可以应用于轻量级网络。
methods: 该方法使用了一种新的特征融合模块，该模块通过并行的CNN和Transformer网络实现语言引导的特征融合，从而提高Semantic Segmentation的性能。
results: 实验结果表明，该方法可以在不同的视觉背景下 achieve better performance than previous SOTA work，例如DenseCLIP，并且可以全面利用预训练的语言优先知识来提高Semantic Segmentation的性能。

Abstract
The large-scale pretrained model CLIP, trained on 400 million image-text pairs, offers a promising paradigm for tackling vision tasks, albeit at the image level. Later works, such as DenseCLIP and LSeg, extend this paradigm to dense prediction, including semantic segmentation, and have achieved excellent results. However, the above methods either rely on CLIP-pretrained visual backbones or use none-pretrained but heavy backbones such as Swin, while falling ineffective when applied to lightweight backbones. The reason for this is that the lightweitht networks, feature extraction ability of which are relatively limited, meet difficulty embedding the image feature aligned with text embeddings perfectly. In this work, we present a new feature fusion module which tackles this problem and enables language-guided paradigm to be applied to lightweight networks. Specifically, the module is a parallel design of CNN and transformer with a two-way bridge in between, where CNN extracts spatial information and visual context of the feature map from the image encoder, and the transformer propagates text embeddings from the text encoder forward. The core of the module is the bidirectional fusion of visual and text feature across the bridge which prompts their proximity and alignment in embedding space. The module is model-agnostic, which can not only make language-guided lightweight semantic segmentation practical, but also fully exploit the pretrained knowledge of language priors and achieve better performance than previous SOTA work, such as DenseCLIP, whatever the vision backbone is. Extensive experiments have been conducted to demonstrate the superiority of our method.

摘要
大规模预训练模型CLIP，训练了400万张图像和文本对应对，提供了解决视觉任务的有前途的方法，即图像水平。后续的工作，如 denseclip 和 lseg，在 dense prediction 方面扩展了这种方法，并取得了出色的结果。然而，以上方法都是基于 CLIP 预训练的视觉脊梁或使用不预训练的但是重量级别的脊梁，如 swin，而不是使用轻量级别的脊梁。这是因为轻量级别的网络，其特征提取能力相对较弱，难以将图像特征与文本嵌入完全匹配。在这种情况下，我们提出了一种新的特征融合模块，该模块可以解决这个问题，并使得语言导向的方法可以应用于轻量级别的网络。具体来说，该模块是一种并行的 CNN 和 transformer 的设计，其中 CNN 提取图像中的空间信息和视觉上下文，而 transformer 将文本嵌入从文本编码器传递进来。模块的核心是两个方向的特征融合，使得视觉和文本特征在嵌入空间中相互吸引。这种模块是模型无关的，可以不仅使得语言导向的轻量级别semantic segmentation 成为现实，还可以全面利用预训练的语言优先知识，并在不同的视觉脊梁上达到更高的性能，比如 denseclip 等。我们进行了广泛的实验，以证明我们的方法的优越性。

Domain Generalization Guided by Gradient Signal to Noise Ratio of Parameters

paper_url: http://arxiv.org/abs/2310.07361
repo_url: None
paper_authors: Mateusz Michalkiewicz, Masoud Faraki, Xiang Yu, Manmohan Chandraker, Mahsa Baktashmotlagh
for: 防止深度神经网络过拟合源频道数据
methods: 基于梯度信号噪声比（GSNR）选择dropout掩码，利用元学习法寻找优化的dropout比率
results: 在标准频道总结chmark上达到竞争性的结果，包括分类和人脸反掩膜问题

Abstract
Overfitting to the source domain is a common issue in gradient-based training of deep neural networks. To compensate for the over-parameterized models, numerous regularization techniques have been introduced such as those based on dropout. While these methods achieve significant improvements on classical benchmarks such as ImageNet, their performance diminishes with the introduction of domain shift in the test set i.e. when the unseen data comes from a significantly different distribution. In this paper, we move away from the classical approach of Bernoulli sampled dropout mask construction and propose to base the selection on gradient-signal-to-noise ratio (GSNR) of network's parameters. Specifically, at each training step, parameters with high GSNR will be discarded. Furthermore, we alleviate the burden of manually searching for the optimal dropout ratio by leveraging a meta-learning approach. We evaluate our method on standard domain generalization benchmarks and achieve competitive results on classification and face anti-spoofing problems.

摘要
常见的过拟合问题在深度神经网络的梯度基本训练中出现。为了缓解过参数的模型，许多正则化技术被引入，如基于dropout的方法。这些方法在ImageNet等 классических测试集上实现了显著改善，但是在测试集中的领域变化时，其性能减退。在这篇论文中，我们偏离了传统的bernoulli抽样dropout面积建构方法，并基于网络参数的梯度噪声比率（GSNR）来选择参数。具体来说，在每个训练步骤中，具有高GSNR的参数将被排除。此外，我们利用元学习方法，以避免手动搜索dropout率的优化。我们在标准的领域普适性测试上评估了我们的方法，并在分类和人脸防伪检测问题上实现了竞争性的结果。

Diagnosing Bipolar Disorder from 3-D Structural Magnetic Resonance Images Using a Hybrid GAN-CNN Method

paper_url: http://arxiv.org/abs/2310.07359
repo_url: None
paper_authors: Masood Hamed Saghayan, Mohammad Hossein Zolfagharnasab, Ali Khadem, Farzam Matinfar, Hassan Rashidi
for: 这个研究旨在开发一个基于三维显微镜像（sMRI）的潜在鉴别BD患者的方法，以提供更可靠的诊断支持系统，帮助医生更加准确地诊断BD患者。
methods: 这个研究使用了一种混合式GAN-CNN模型，通过对sMRI样本进行扩展，提高了BD的诊断精度。研究还使用了5-fold检查分割，以评估不同扩展比例的影响。
results: 根据结果，这个研究获得了75.8%的准确率、60.3%的感染率和82.5%的特异率，较以往研究高出3-5%，并使用了少于6%的样本数。此外，研究还证明了一个2D层基本的GAN生成器可以有效地重现复杂的3D脑样本，并且使用了50%的扩展阈值。

Abstract
Bipolar Disorder (BD) is a psychiatric condition diagnosed by repetitive cycles of hypomania and depression. Since diagnosing BD relies on subjective behavioral assessments over a long period, a solid diagnosis based on objective criteria is not straightforward. The current study responded to the described obstacle by proposing a hybrid GAN-CNN model to diagnose BD from 3-D structural MRI Images (sMRI). The novelty of this study stems from diagnosing BD from sMRI samples rather than conventional datasets such as functional MRI (fMRI), electroencephalography (EEG), and behavioral symptoms while removing the data insufficiency usually encountered when dealing with sMRI samples. The impact of various augmentation ratios is also tested using 5-fold cross-validation. Based on the results, this study obtains an accuracy rate of 75.8%, a sensitivity of 60.3%, and a specificity of 82.5%, which are 3-5% higher than prior work while utilizing less than 6% sample counts. Next, it is demonstrated that a 2- D layer-based GAN generator can effectively reproduce complex 3D brain samples, a more straightforward technique than manual image processing. Lastly, the optimum augmentation threshold for the current study using 172 sMRI samples is 50%, showing the applicability of the described method for larger sMRI datasets. In conclusion, it is established that data augmentation using GAN improves the accuracy of the CNN classifier using sMRI samples, thus developing more reliable decision support systems to assist practitioners in identifying BD patients more reliably and in a shorter period

摘要
《抑郁症（BD）是一种心理疾病，通过反复的假mania和抑郁而诊断。由于诊断BD需要长期的主观行为评估，因此不可靠的诊断方法不是很 straightforward。本研究回应了这个挑战，提出了一种将GAN和CNN结合使用的模型，用于从三维结构MRI图像（sMRI）上诊断BD。本研究的创新之处在于，不同于以往的fMRI、EEG和行为症状数据，从sMRI样本中诊断BD，同时解决了对sMRI样本的数据缺乏问题。此外，本研究还测试了不同的扩展比例，并使用5次分割验证。根据结果，本研究获得了75.8%的准确率，60.3%的敏感度和82.5%的特异性，与先前的工作相比高出3-5%，使用的样本数量少于6%。然后，示出了一种使用2维GAN生成器可以有效地重现复杂的3D脑样本，比手动图像处理更加简单。最后，本研究发现了使用172个sMRI样本的最佳扩展阈值为50%。总之，本研究证明了通过GAN数据扩展提高CNN分类器的准确率，从而开发更可靠的决策支持系统，帮助医生更准确地诊断BD患者，更快速地进行诊断。

IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

paper_url: http://arxiv.org/abs/2310.07355
repo_url: None
paper_authors: Che Liu, Sibo Cheng, Miaojing Shi, Anand Shah, Wenjia Bai, Rossella Arcucci
for: 这 paper 是为了提高医学视语预训练（VLP）中对医疗报告和相关医疗图像的特征提取方法。
methods: 这 paper 提出了一种新的临床指导 VLP 框架，named IMITATE，通过在医疗报告中捕捉结构信息来学习视语对Alignment。该框架使用多级视觉特征来对医疗报告中的描述性和结论性文本进行分层对应。此外，这 paper 还提出了一种新的临床知识引入对医疗报告进行对比学习的矫正损失。
results: 根据实验结果，IMITATE 模型在六个不同的数据集上比基eline VLP 方法表现出色，在五种医疗影像下游任务中均显示出优于基eline方法。这些结果表明，通过 integrate 医疗报告中的结构信息可以提高视语预训练中的特征提取效果。

Abstract
In the field of medical Vision-Language Pre-training (VLP), significant efforts have been devoted to deriving text and image features from both clinical reports and associated medical images. However, most existing methods may have overlooked the opportunity in leveraging the inherent hierarchical structure of clinical reports, which are generally split into `findings' for descriptive content and `impressions' for conclusive observation. Instead of utilizing this rich, structured format, current medical VLP approaches often simplify the report into either a unified entity or fragmented tokens. In this work, we propose a novel clinical prior guided VLP framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report. Furthermore, a new clinical-informed contrastive loss is introduced for cross-modal learning, which accounts for clinical prior knowledge in formulating sample correlations in contrastive learning. The proposed model, IMITATE, outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks. Comprehensive experimental results highlight the advantages of integrating the hierarchical structure of medical reports for vision-language alignment.

摘要
在医疗领域的医学视语预训练（VLP）领域，有大量努力投入到从医疗报告和关联的医学图像中提取文本和图像特征。然而，大多数现有方法可能忽略了利用医疗报告的自然层次结构，这些报告通常分为描述性内容的“发现”和结论的“印象”。相反，现有的医学VLP方法通常将报告简化为单一实体或分解成多个符号。在这种情况下，我们提出了一种新的临床导向的VLP框架，名为IMITATE，用于从医疗报告中学习层次结构信息。该框架在胸部X射线图像中提取多级视觉特征，并将这些特征与描述性和结论文本进行层次视语对应。此外，我们还引入了一种新的临床知识 Informed Contrastive Loss，用于在跨模态学习中考虑临床知识。我们的模型IMITATE在六个不同的数据集上比基eline VLP方法表现出色，在五种医学影像下沉淀任务中取得了最高的表现。广泛的实验结果表明，将医疗报告的层次结构 интеグ吧到视语对应可以提高视语对应的性能。

PointHR: Exploring High-Resolution Architectures for 3D Point Cloud Segmentation

paper_url: http://arxiv.org/abs/2310.07743
repo_url: https://github.com/haibo-qiu/PointHR
paper_authors: Haibo Qiu, Baosheng Yu, Yixin Chen, Dacheng Tao
for: 高精度3D点云分割，即使没有额外的优化和修饰。
methods: 提出了一种高精度架构，named PointHR，包括knn顺序运算符和差分扩散运算符，以及预计算序列和扩散运算符的索引。
results: 在S3DIS和ScanNetV2数据集上进行了广泛的实验，并表明PointHR可以高度竞争与当前最佳方法无需额外优化和修饰。

Abstract
Significant progress has been made recently in point cloud segmentation utilizing an encoder-decoder framework, which initially encodes point clouds into low-resolution representations and subsequently decodes high-resolution predictions. Inspired by the success of high-resolution architectures in image dense prediction, which always maintains a high-resolution representation throughout the entire learning process, we consider it also highly important for 3D dense point cloud analysis. Therefore, in this paper, we explore high-resolution architectures for 3D point cloud segmentation. Specifically, we generalize high-resolution architectures using a unified pipeline named PointHR, which includes a knn-based sequence operator for feature extraction and a differential resampling operator to efficiently communicate different resolutions. Additionally, we propose to avoid numerous on-the-fly computations of high-resolution architectures by pre-computing the indices for both sequence and resampling operators. By doing so, we deliver highly competitive high-resolution architectures while capitalizing on the benefits of well-designed point cloud blocks without additional effort. To evaluate these architectures for dense point cloud analysis, we conduct thorough experiments using S3DIS and ScanNetV2 datasets, where the proposed PointHR outperforms recent state-of-the-art methods without any bells and whistles. The source code is available at \url{https://github.com/haibo-qiu/PointHR}.

摘要
Recently, there have been significant advances in point cloud segmentation using an encoder-decoder framework, where point clouds are first encoded into low-resolution representations and then decoded into high-resolution predictions. Inspired by the success of high-resolution architectures in image dense prediction, we believe it is also crucial for 3D dense point cloud analysis. Therefore, in this paper, we explore high-resolution architectures for 3D point cloud segmentation. Specifically, we propose a unified pipeline named PointHR, which includes a knn-based sequence operator for feature extraction and a differential resampling operator to efficiently communicate different resolutions. Additionally, we pre-compute the indices for both sequence and resampling operators to avoid on-the-fly computations. By doing so, we achieve highly competitive high-resolution architectures without additional effort.To evaluate these architectures for dense point cloud analysis, we conduct thorough experiments using S3DIS and ScanNetV2 datasets. The results show that our proposed PointHR outperforms recent state-of-the-art methods without any additional bells and whistles. The source code is available at \url{https://github.com/haibo-qiu/PointHR}.

Guided Attention for Interpretable Motion Captioning

paper_url: http://arxiv.org/abs/2310.07324
repo_url: https://github.com/rd20karim/m2t-interpretable
paper_authors: Karim Radouane, Andon Tchechmedjiev, Sylvie Ranwez, Julien Lagarde
for: 本研究旨在生成文本从动作中，并提出了一种使用运动编码器和时空注意模型的方法，以及一种在训练中引导注意力的策略，以提高文本生成的可读性。
methods: 本研究使用运动编码器和时空注意模型，并提出了一种在训练中引导注意力的策略，以提高文本生成的可读性。
results: 本研究在KIT MLD dataset和HumanML3D dataset上取得了比基eline高的性能，包括BLEU@4、ROUGE-L、CIDEr和Bertscore等指标。

Abstract
While much effort has been invested in generating human motion from text, relatively few studies have been dedicated to the reverse direction, that is, generating text from motion. Much of the research focuses on maximizing generation quality without any regard for the interpretability of the architectures, particularly regarding the influence of particular body parts in the generation and the temporal synchronization of words with specific movements and actions. This study explores the combination of movement encoders with spatio-temporal attention models and proposes strategies to guide the attention during training to highlight perceptually pertinent areas of the skeleton in time. We show that adding guided attention with adaptive gate leads to interpretable captioning while improving performance compared to higher parameter-count non-interpretable SOTA systems. On the KIT MLD dataset, we obtain a BLEU@4 of 24.4% (SOTA+6%), a ROUGE-L of 58.30% (SOTA +14.1%), a CIDEr of 112.10 (SOTA +32.6) and a Bertscore of 41.20% (SOTA +18.20%). On HumanML3D, we obtain a BLEU@4 of 25.00 (SOTA +2.7%), a ROUGE-L score of 55.4% (SOTA +6.1%), a CIDEr of 61.6 (SOTA -10.9%), a Bertscore of 40.3% (SOTA +2.5%). Our code implementation and reproduction details will be soon available at https://github.com/rd20karim/M2T-Interpretable/tree/main.

摘要
“尽管有很多研究投入到人体动作从文本生成中，但相对少数研究关注反向方向，即从动作生成文本。大多数研究集中于提高生成质量而忽略建模 interpretability，特别是关于特定身体部分在生成中的影响和时间同步词语与动作之间的关系。本研究拟合运动编码器和空间时间注意模型，并提出了引导注意力的策略，以提高文本生成的可读性。我们的实验表明，在 KIT MLD 数据集上，通过添加引导注意力和适应门户，可以实现可读性的文本生成，同时提高性能，与高参数计数的非可读性 SOTA 系统相比。在 HumanML3D 数据集上，我们获得了 BLEU@4 的 25.00% (SOTA +2.7%), ROUGE-L 的 55.4% (SOTA +6.1%), CIDEr 的 61.6 (SOTA -10.9%), Bertscore 的 40.3% (SOTA +2.5%).我们的代码实现和复现细节将在 GitHub 上公开。”

A webcam-based machine learning approach for three-dimensional range of motion evaluation

paper_url: http://arxiv.org/abs/2310.07322
repo_url: None
paper_authors: Xiaoye Michael Wang, Derek T. Smith, Qin Zhu
for: 这个研究旨在提供一种可靠且可以远端存取的肢体范围动作评估方法，并评估这种方法的可靠性。
methods: 这个研究使用机器学习方法来评估肢体范围动作，并使用 webcam 进行评估。
results: 研究结果显示，这种 webcam 方法具有高的内排重复性和其他评估方法的相互重复性，并且可以实现远端存取physical therapy 和rehabilitation。

Abstract
Background. Joint range of motion (ROM) is an important quantitative measure for physical therapy. Commonly relying on a goniometer, accurate and reliable ROM measurement requires extensive training and practice. This, in turn, imposes a significant barrier for those who have limited in-person access to healthcare. Objective. The current study presents and evaluates an alternative machine learning-based ROM evaluation method that could be remotely accessed via a webcam. Methods. To evaluate its reliability, the ROM measurements for a diverse set of joints (neck, spine, and upper and lower extremities) derived using this method were compared to those obtained from a marker-based optical motion capture system. Results. Data collected from 25 healthy adults demonstrated that the webcam solution exhibited high test-retest reliability, with substantial to almost perfect intraclass correlation coefficients for most joints. Compared with the marker-based system, the webcam-based system demonstrated substantial to almost perfect inter-rater reliability for some joints, and lower inter-rater reliability for other joints (e.g., shoulder flexion and elbow flexion), which could be attributed to the reduced sensitivity to joint locations at the apex of the movement. Conclusions. The proposed webcam-based method exhibited high test-retest and inter-rater reliability, making it a versatile alternative for existing ROM evaluation methods in clinical practice and the tele-implementation of physical therapy and rehabilitation.

摘要
背景： JOINT 范围动作（ROM）是物理治疗中非常重要的量化测量。通常使用指南仪，准确和可靠的 ROM 测量需要广泛的训练和实践。这种情况限制了那些具有有限的面对面医疗访问的人们。目标：本研究提出了一种基于机器学习的 ROM 评估方法，可以通过网络摄像头访问。方法：为评估其可靠性，使用这种方法测量的 JOINT 范围动作数据与使用标记器基于光学运动跟踪系统获取的数据进行比较。结果：从25名健康成人收集的数据表明，网络摄像头解决方案具有高测试再测试可靠性和大多数关节的substantial到几乎完美的同准类相关系数。与标记器基于系统相比，网络摄像头解决方案在一些关节（如肩部和肘部）上具有substantial到几乎完美的同准类相关系数，而在其他关节（如肩部和肘部）上具有较低的同准类相关系数，这可能是因为网络摄像头的感度减退。结论：提出的网络摄像头解决方案具有高测试再测试和同准类相关性，使其成为现有的 ROM 评估方法的多样化选择，并且可以在临床实践和远程物理治疗中应用。

Deep Aramaic: Towards a Synthetic Data Paradigm Enabling Machine Learning in Epigraphy

paper_url: http://arxiv.org/abs/2310.07310
repo_url: None
paper_authors: Andrei C. Aioanei, Regine Hunziker-Rodewald, Konstantin Klein, Dominik L. Michels
for: 这个研究旨在提高古代文献中文字识别率，使用现代人工智能技术，如机器学习（ML），从古代文献中提取知识。
methods: 该研究使用创新的数据生成技术，synthesize photo-realistic Aramaic letter datasets，包括文本特征、照明、损害和扩展等，以模拟实际铭文多样性。
results: 该研究成功创建了250,000个训练图像和25,000个验证图像，涵盖了古代阿拉伯字母的22个类别。这个全面的数据集为训练一个剩余神经网络（ResNet）提供了高度的识别率，并在不同材料和风格下进行了成功的检验，证明了模型的可重复性。

Abstract
Epigraphy increasingly turns to modern artificial intelligence (AI) technologies such as machine learning (ML) for extracting insights from ancient inscriptions. However, scarce labeled data for training ML algorithms severely limits current techniques, especially for ancient scripts like Old Aramaic. Our research pioneers an innovative methodology for generating synthetic training data tailored to Old Aramaic letters. Our pipeline synthesizes photo-realistic Aramaic letter datasets, incorporating textural features, lighting, damage, and augmentations to mimic real-world inscription diversity. Despite minimal real examples, we engineer a dataset of 250,000 training and 25,000 validation images covering the 22 letter classes in the Aramaic alphabet. This comprehensive corpus provides a robust volume of data for training a residual neural network (ResNet) to classify highly degraded Aramaic letters. The ResNet model demonstrates high accuracy in classifying real images from the 8th century BCE Hadad statue inscription. Additional experiments validate performance on varying materials and styles, proving effective generalization. Our results validate the model's capabilities in handling diverse real-world scenarios, proving the viability of our synthetic data approach and avoiding the dependence on scarce training data that has constrained epigraphic analysis. Our innovative framework elevates interpretation accuracy on damaged inscriptions, thus enhancing knowledge extraction from these historical resources.

摘要
随着epigraphy越来越向现代人工智能(AI)技术，如机器学习(ML)，提取古代铭文中的信息。然而，古代文字如Old Aramaic的有限的标注数据对当前技术带来了严重的限制。我们的研究创新了一种适用于Old Aramaic字母的创新方法。我们的管道Synthesize了高度真实的Aramaic字母数据集，包括文字特征、照明、损害和扩展等，以模拟实际铭文多样性。尽管具有 minimal real examples，我们Enginered一个包括250,000个训练图像和25,000个验证图像的数据集，覆盖了Aramaic字母的22个类型。这个全面的数据集提供了一个强大的训练ResNet神经网络来分类高度损害的Aramaic字母。ResNet模型在实际图像中的8世纪前BCE Hadad雕塑铭文中显示出高精度的分类能力。此外，我们还进行了多种材质和风格的实验，证明了模型的普适性和可行性。我们的结果证明了我们的合成数据方法的可行性，并且不再依赖于稀缺的训练数据，从而提高了铭文解读的准确性，并扩展了历史资源的知识抽取。

Distilling Efficient Vision Transformers from CNNs for Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.07265
repo_url: None
paper_authors: Xu Zheng, Yunhao Luo, Pengyuan Zhou, Lin Wang
for: 本研究目的是如何将预训练的 CNN 模型知识转移到学习 Compact Vision Transformer (ViT) 模型，而保持其学习能力？
methods: 本研究提出了一个 CNN 到 ViT 知识传递框架（C2VKD），并提出了一个独特的视语相容特征（VLFD）模组和一个像素对像分离分配（PDD）模组。
results: 实验结果显示，与 SoTA KD 方法相比，我们的方法可以提高 mIoU 的提升量超过 200%。

Abstract
In this paper, we tackle a new problem: how to transfer knowledge from the pre-trained cumbersome yet well-performed CNN-based model to learn a compact Vision Transformer (ViT)-based model while maintaining its learning capacity? Due to the completely different characteristics of ViT and CNN and the long-existing capacity gap between teacher and student models in Knowledge Distillation (KD), directly transferring the cross-model knowledge is non-trivial. To this end, we subtly leverage the visual and linguistic-compatible feature character of ViT (i.e., student), and its capacity gap with the CNN (i.e., teacher) and propose a novel CNN-to-ViT KD framework, dubbed C2VKD. Importantly, as the teacher's features are heterogeneous to those of the student, we first propose a novel visual-linguistic feature distillation (VLFD) module that explores efficient KD among the aligned visual and linguistic-compatible representations. Moreover, due to the large capacity gap between the teacher and student and the inevitable prediction errors of the teacher, we then propose a pixel-wise decoupled distillation (PDD) module to supervise the student under the combination of labels and teacher's predictions from the decoupled target and non-target classes. Experiments on three semantic segmentation benchmark datasets consistently show that the increment of mIoU of our method is over 200% of the SoTA KD methods

摘要
在这篇论文中，我们面临一个新的问题：如何将预训练的庞然一物的 CNN 模型的知识传递到学习 compact Vision Transformer (ViT) 模型，而保持其学习能力？由于 CNN 和 ViT 模型之间的完全不同特征和长期存在的容量差异，直接传递模型之间的知识是非常困难的。为此，我们偏好利用 ViT 模型（学生）的视觉和语言相容特征，以及它们与 CNN 模型（教师）之间的容量差，并提出了一种新的 CNN 到 ViT KD 框架，称为 C2VKD。特别是，由于教师模型的特征与学生模型的特征不同，我们首先提出了一种视觉语言相容特征采样（VLFD）模块，以实现有效的 KD。此外，由于教师模型和学生模型之间的容量差较大，以及不可避免的预测错误，我们则提出了一种像素独立分离采样（PDD）模块，以便在教师的预测和非预测类划分中监督学生。我们在三个semantic segmentation benchmarkdataset上进行了实验，结果表明，我们的方法可以提高 mIoU 的增量超过 200% 的 SoTA KD 方法。

ADASR: An Adversarial Auto-Augmentation Framework for Hyperspectral and Multispectral Data Fusion

paper_url: http://arxiv.org/abs/2310.07255
repo_url: https://github.com/fangfang11-plog/adasr
paper_authors: Jinghui Qin, Lihuang Fang, Ruitao Lu, Liang Lin, Yukai Shi
for: 提高深度学习驱动的干涉光спектраль图像（HSI）超分辨率，通过融合干涉光спектраль图像（HSI）和多spectral图像（MSI），使用深度神经网络（DNNs）来生成高空间分辨率HSI（HR-HSI）。
methods: 我们提出了一种新的反对抗整形自动数据增强框架ADASR，可以自动优化和增强HSI-MSI示例对的多样性，以便在实际场景中应用。我们的框架是示例意识的，并通过对扩展网络和两个下采样网络进行对抗学习来优化它们，以便学习更加稳定的下采样网络用于训练扩展网络。
results: 我们的ADASR在两个公共的古典干涉光спектраль数据集上进行了广泛的实验，并证明了我们的ADASR与当前方法相比更加有效。

Abstract
Deep learning-based hyperspectral image (HSI) super-resolution, which aims to generate high spatial resolution HSI (HR-HSI) by fusing hyperspectral image (HSI) and multispectral image (MSI) with deep neural networks (DNNs), has attracted lots of attention. However, neural networks require large amounts of training data, hindering their application in real-world scenarios. In this letter, we propose a novel adversarial automatic data augmentation framework ADASR that automatically optimizes and augments HSI-MSI sample pairs to enrich data diversity for HSI-MSI fusion. Our framework is sample-aware and optimizes an augmentor network and two downsampling networks jointly by adversarial learning so that we can learn more robust downsampling networks for training the upsampling network. Extensive experiments on two public classical hyperspectral datasets demonstrate the effectiveness of our ADASR compared to the state-of-the-art methods.

摘要
深度学习基于卷积神经网络（DNN）的卷积谱图像（HSI）超分辨率，旨在通过将卷积谱图像（HSI）和多спектраль图像（MSI）融合，生成高空间分辨率卷积谱图像（HR-HSI）。然而，DNN需要大量的训练数据，使其在实际场景中应用受限。在这封信中，我们提出了一种新的反对抗自动数据增强框架ADASR，可以自动优化和增强HSI-MSI样本对的多样性，以便为HSI-MSI融合提供更多的数据。我们的框架是样本意识的，并通过反对学习来优化一个增强器网络和两个下采样网络，以便我们可以更好地学习更有效的下采样网络，用于训练上采样网络。我们在两个公共古典卷积谱数据集上进行了广泛的实验，并证明了我们的ADASR比 estado-of-the-art方法更有效。

A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation

paper_url: http://arxiv.org/abs/2310.07252
repo_url: None
paper_authors: Rashid Khan, Bingding Huang, Haseeb Hassan, Asim Zaman, Zhongfu Ye
for: 这个论文是为了提出一种深度神经网络框架，用于图像描述文本生成。
methods: 该方法使用了多个预训练的卷积神经网络作为Encoder来提取图像特征，并使用GRU语言模型作为Decoder来生成描述文本。它还使用了Bahdanau注意机制与GRU解码器相结合，以便学习专注于特定图像部分。
results: 该方法在MSCOCO和Flickr30k datasets上进行了评估，并取得了与当前方法相当的成绩。

Abstract
Image captioning is a challenging task involving generating a textual description for an image using computer vision and natural language processing techniques. This paper proposes a deep neural framework for image caption generation using a GRU-based attention mechanism. Our approach employs multiple pre-trained convolutional neural networks as the encoder to extract features from the image and a GRU-based language model as the decoder to generate descriptive sentences. To improve performance, we integrate the Bahdanau attention model with the GRU decoder to enable learning to focus on specific image parts. We evaluate our approach using the MSCOCO and Flickr30k datasets and show that it achieves competitive scores compared to state-of-the-art methods. Our proposed framework can bridge the gap between computer vision and natural language and can be extended to specific domains.

摘要
Image captioning是一个复杂的任务，即使用计算机视觉和自然语言处理技术生成图像的文本描述。这篇论文提出了一种深度神经网络框架，用于图像描述生成。我们的方法使用多个预训练的卷积神经网络作为Encoder提取图像特征，并使用GRU基于注意机制来生成描述性句子。为了提高性能，我们将巴登瑙注意模型与GRU解码器结合，让学习关注特定图像部分。我们使用MSCOCO和Flickr30k数据集进行评估，并显示了与状态之前方法相当的分数。我们的提议的框架可以将计算机视觉和自然语言相连，并可以扩展到特定领域。

Synthesizing Missing MRI Sequences from Available Modalities using Generative Adversarial Networks in BraTS Dataset

paper_url: http://arxiv.org/abs/2310.07250
repo_url: None
paper_authors: Ibrahim Ethem Hamamci
for: 这篇论文的目的是为了提高脑癌MRI检查和诊断的精度和效率，并且将AI技术应用到脑癌MRI量化中。
methods: 这篇论文使用了生成对抗网络（GAN）技术，将三个MRI序列作为输入，生成缺失的第四个MRI序列。
results: 这篇论文的结果显示，使用GAN技术可以实现高品质和现实的MRI序列生成，帮助临床医生提高诊断能力，并且支持AI技术应用到脑癌MRI量化中。

Abstract
Glioblastoma is a highly aggressive and lethal form of brain cancer. Magnetic resonance imaging (MRI) plays a significant role in the diagnosis, treatment planning, and follow-up of glioblastoma patients due to its non-invasive and radiation-free nature. The International Brain Tumor Segmentation (BraTS) challenge has contributed to generating numerous AI algorithms to accurately and efficiently segment glioblastoma sub-compartments using four structural (T1, T1Gd, T2, T2-FLAIR) MRI scans. However, these four MRI sequences may not always be available. To address this issue, Generative Adversarial Networks (GANs) can be used to synthesize the missing MRI sequences. In this paper, we implement and utilize an open-source GAN approach that takes any three MRI sequences as input to generate the missing fourth structural sequence. Our proposed approach is contributed to the community-driven generally nuanced deep learning framework (GaNDLF) and demonstrates promising results in synthesizing high-quality and realistic MRI sequences, enabling clinicians to improve their diagnostic capabilities and support the application of AI methods to brain tumor MRI quantification.

摘要
高级肿瘤癌是脑肿瘤的一种高度致命的形式。核磁共振成像（MRI）在诊断、治疗规划和跟踪高级肿瘤癌患者中扮演着非常重要的角色，因为它不侵入性和无辐射性。国际脑肿瘤分 segmentation（BraTS）挑战已经促成了许多人工智能算法，以准确和高效地分 segment glioblastoma 子组件使用四种结构（T1、T1Gd、T2、T2-FLAIR）MRI扫描。然而，这四种 MRI 序列可能不总是可用。为解决这个问题，生成对抗网络（GANs）可以用于生成缺失的 MRI 序列。在这篇论文中，我们实现了一种开源 GAN 方法，该方法使用任意三种 MRI 序列作为输入，以生成缺失的第四种结构序列。我们的提议方法被添加到了社区驱动的普遍精细深度学习框架（GaNDLF）中，并在生成高质量和实实际的 MRI 序列方面显示出了扎实的成果，使临床医生可以提高诊断能力，并支持人工智能方法的脑肿瘤 MRI 量化应用。

IBoxCLA: Towards Robust Box-supervised Segmentation of Polyp via Improved Box-dice and Contrastive Latent-anchors

paper_url: http://arxiv.org/abs/2310.07248
repo_url: None
paper_authors: Zhiwei Wang, Qiang Hu, Hongkuan Shi, Li He, Man He, Wenxuan Dai, Ting Li, Yitong Zhang, Dun Li, Mei Liu, Qiang Li
for: 这篇论文主要应用于泌脓腺部分分类，以提高医疗影像分类的精度和效率。
methods: 这篇论文提出了两种创新的学习方法：Improved Box-dice（IBox）和Contrastive Latent-Anchors（CLA），并将它们融合使用以训练一个坚固的泌脓腺部分分类模型IBoxCLA。这两种学习方法可以将形状学习和位置/大小学习分开，从而增强模型的准确性和稳定性。
results: 这篇论文的实验结果显示IBoxCLA在五个公共的泌脓腺数据集上的比较表现，与最近的完全监督泌脓腺部分分类方法相比，有至少6.5%和7.5%的提升。此外，IBoxCLA也比其他的泌脓腺部分分类方法有更好的稳定性和精度。

Abstract
Box-supervised polyp segmentation attracts increasing attention for its cost-effective potential. Existing solutions often rely on learning-free methods or pretrained models to laboriously generate pseudo masks, triggering Dice constraint subsequently. In this paper, we found that a model guided by the simplest box-filled masks can accurately predict polyp locations/sizes, but suffers from shape collapsing. In response, we propose two innovative learning fashions, Improved Box-dice (IBox) and Contrastive Latent-Anchors (CLA), and combine them to train a robust box-supervised model IBoxCLA. The core idea behind IBoxCLA is to decouple the learning of location/size and shape, allowing for focused constraints on each of them. Specifically, IBox transforms the segmentation map into a proxy map using shape decoupling and confusion-region swapping sequentially. Within the proxy map, shapes are disentangled, while locations/sizes are encoded as box-like responses. By constraining the proxy map instead of the raw prediction, the box-filled mask can well supervise IBoxCLA without misleading its shape learning. Furthermore, CLA contributes to shape learning by generating two types of latent anchors, which are learned and updated using momentum and segmented polyps to steadily represent polyp and background features. The latent anchors facilitate IBoxCLA to capture discriminative features within and outside boxes in a contrastive manner, yielding clearer boundaries. We benchmark IBoxCLA on five public polyp datasets. The experimental results demonstrate the competitive performance of IBoxCLA compared to recent fully-supervised polyp segmentation methods, and its superiority over other box-supervised state-of-the-arts with a relative increase of overall mDice and mIoU by at least 6.5% and 7.5%, respectively.

摘要
《Box-supervised胞分割吸引了增加的关注，因为它的成本效果很高。现有的解决方案frequently使用学习无关的方法或预训练模型生成 Pseudo masks，从而触发 dice约束。在这篇论文中，我们发现一个以最简单的方块填充mask为导向的模型可以准确预测胞位置/大小，但是受到形态压缩的影响。为了解决这个问题，我们提出了两种创新的学习方法：Improved Box-dice（IBox）和Contrastive Latent-Anchors（CLA），并将它们相结合以训练一个可靠的box-supervised模型IBoxCLA。IBoxCLA的核心想法是解耦位置/大小和形态的学习，以便对每个特征进行专注的约束。specifically, IBox将 segmentation map转换为一个代理 map，先后进行形态解耦和干扰区域交换。在代理 map中，形态被解耦，而位置/大小被编码为方块样式的回应。通过约束代理 map而不是直接约束raw prediction，方块填充mask可以良好地指导IBoxCLA不mislead its shape learning。此外，CLA通过生成两种类型的latent anchors，通过涨动和分割胞和背景特征来逐渐代表胞和背景特征。这些latent anchors使IBoxCLA能够在对和外部box中捕捉特征，并通过对比式学习捕捉更清晰的边界。我们对IBoxCLA进行五个公共胞数据集的测试。实验结果表明IBoxCLA与最近的完全监督胞分割方法相比，有着竞争性的性能，并且与其他box-supervised state-of-the-arts相比，在全部mDice和mIoU上增加至少6.5%和7.5%。

Optimizing the Placement of Roadside LiDARs for Autonomous Driving

paper_url: http://arxiv.org/abs/2310.07247
repo_url: https://github.com/PJLab-ADG/PCSim
paper_authors: Wentao Jiang, Hao Xiang, Xinyu Cai, Runsheng Xu, Jiaqi Ma, Yikang Li, Gim Hee Lee, Si Liu
for: 提高自动驾驶roadside LiDAR的探测性能
methods: 基于探测功能的greedy算法和单个点云帧学习的感知预测器
results: 提出了一种基于探测功能的LiDAR位置优化方法，并创建了Roadside-Opt数据集以便进一步研究roadside LiDAR placement问题。

Abstract
Multi-agent cooperative perception is an increasingly popular topic in the field of autonomous driving, where roadside LiDARs play an essential role. However, how to optimize the placement of roadside LiDARs is a crucial but often overlooked problem. This paper proposes an approach to optimize the placement of roadside LiDARs by selecting optimized positions within the scene for better perception performance. To efficiently obtain the best combination of locations, a greedy algorithm based on perceptual gain is proposed, which selects the location that can maximize the perceptual gain sequentially. We define perceptual gain as the increased perceptual capability when a new LiDAR is placed. To obtain the perception capability, we propose a perception predictor that learns to evaluate LiDAR placement using only a single point cloud frame. A dataset named Roadside-Opt is created using the CARLA simulator to facilitate research on the roadside LiDAR placement problem.

摘要
Multi-agent合作感知是自驾车领域中越来越受欢迎的话题，路边LiDAR扮演着关键性的角色。然而，如何优化路边LiDAR的布局是一个关键 yet often overlooked的问题。这篇论文提出了一种方法来优化路边LiDAR的布局，通过选择场景中最佳位置来提高感知性能。为了高效地获得最佳组合位置，我们提出了一种基于偏好度的贪婪算法，该算法在每次选择新LiDAR位置时，选择能够最大化偏好度的位置。我们定义偏好度为新LiDAR的布局增加的感知能力。为了获得感知能力，我们提出了一种感知预测器，该预测器通过只使用单个点云帧来评估LiDAR布局。为了促进路边LiDAR布局问题的研究，我们创建了名为Roadside-Opt的数据集，该数据集使用CARLA模拟器生成。

Crowd Counting in Harsh Weather using Image Denoising with Pix2Pix GANs

paper_url: http://arxiv.org/abs/2310.07245
repo_url: None
paper_authors: Muhammad Asif Khan, Hamid Menouar, Ridha Hamila
For: 本研究旨在提高人群计数模型在恶劣天气下的性能，特别是在fog、尘埃和低光照等不良条件下。* Methods: 本研究提出使用Pix2Pix生成器对人群图像进行预处理，以提高计数模型的推理性能。* Results: 研究测试了JHU-Crowd数据集，并证明了提出的方法可以在不良天气下提高人群计数模型的准确率和可靠性。

Abstract
Visual crowd counting estimates the density of the crowd using deep learning models such as convolution neural networks (CNNs). The performance of the model heavily relies on the quality of the training data that constitutes crowd images. In harsh weather such as fog, dust, and low light conditions, the inference performance may severely degrade on the noisy and blur images. In this paper, we propose the use of Pix2Pix generative adversarial network (GAN) to first denoise the crowd images prior to passing them to the counting model. A Pix2Pix network is trained using synthetic noisy images generated from original crowd images and then the pretrained generator is then used in the inference engine to estimate the crowd density in unseen, noisy crowd images. The performance is tested on JHU-Crowd dataset to validate the significance of the proposed method particularly when high reliability and accuracy are required.

摘要
<>传感技术可以估算人群的密度使用深度学习模型，如 convolutional neural networks (CNNs)。模型的性能强度取决于训练数据中的人群图像质量。在恶劣天气条件下，如雾、尘埃和低光照条件下，推理性能可能会受到图像噪声和模糊的影响，使得推理结果不准确。在这篇论文中，我们提议使用 Pix2Pix 生成 adversarial network (GAN) 首先将人群图像去噪，然后将已经训练的生成器用于实际推理引擎中，以估算人群密度。我们在 JHU-Crowd 数据集上测试了方法，以验证该方法在需要高可靠性和准确性时的效果。Note: "Pix2Pix" is translated as "生成 adversarial network" (生成对抗网络) in Simplified Chinese, which is a combination of "Pix2Pix" and "adversarial network" (对抗网络).

SAGE-ICP: Semantic Information-Assisted ICP

paper_url: http://arxiv.org/abs/2310.07237
repo_url: None
paper_authors: Jiaming Cui, Jiming Chen, Liang Li
for: 本研究旨在提高遥感器加载 unknown environment中的姿态估计精度和稳定性。methods: 本文提出了一种基于LiDAR的点对点ICP算法，并利用有效的semantic信息。results: 实验表明，相比基eline方法，本方法可以在大规模场景中提高姿态估计精度，并且可以保持实时性。

Abstract
Robust and accurate pose estimation in unknown environments is an essential part of robotic applications. We focus on LiDAR-based point-to-point ICP combined with effective semantic information. This paper proposes a novel semantic information-assisted ICP method named SAGE-ICP, which leverages semantics in odometry. The semantic information for the whole scan is timely and efficiently extracted by a 3D convolution network, and these point-wise labels are deeply involved in every part of the registration, including semantic voxel downsampling, data association, adaptive local map, and dynamic vehicle removal. Unlike previous semantic-aided approaches, the proposed method can improve localization accuracy in large-scale scenes even if the semantic information has certain errors. Experimental evaluations on KITTI and KITTI-360 show that our method outperforms the baseline methods, and improves accuracy while maintaining real-time performance, i.e., runs faster than the sensor frame rate.

摘要
Robust和准确的姿态估计在未知环境中是机器人应用的关键部分。我们关注LiDAR基于点对点ICP的方法，并且使用有效的semantic信息。这篇论文提出了一种基于semantic信息的ICP方法，名为SAGE-ICP，它在odometry中利用semantic信息。整个扫描的semantic信息在时间上是有效的和高效的提取，并且这些点级标签在注册过程中深入参与了每一部分，包括semantic尺度下采样、数据关联、自适应地图和动态车辆除除。与过去的semantic援助方法不同，我们的方法可以在大规模场景中提高姿态精度，即使semantic信息有一定错误。实验评估在KITTI和KITTI-360上显示，我们的方法在精度和实时性之间协调，比基准方法快速，即在感知框架帧率以上运行。

AdaMesh: Personalized Facial Expressions and Head Poses for Speech-Driven 3D Facial Animation

paper_url: http://arxiv.org/abs/2310.07236
repo_url: None
paper_authors: Liyang Chen, Weihong Bao, Shun Lei, Boshi Tang, Zhiyong Wu, Shiyin Kang, Haozhi Huang
for: 这个论文旨在提出一种基于 speech-driven 3D 人脸动画的个性化方法，以实现与驱动语音同步的自然人脸表达和头pose。
methods: 该方法使用 mixture-of-low-rank adaptation (MoLoRA) 技术来精准地捕捉Reference video中的 talking style，并通过一个具有语义意识的 pose style matrix 来自动调整头pose。
results: 对比于现有方法，该方法能够更好地保持 Reference video 中的 talking style，并生成更加生动的人脸动画。

Abstract
Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at https://adamesh.github.io.

摘要
《语音驱动3D面部动画》是一项相对新的领域，目标是在语音驱动下生成同步的面部动作，现有的工作大多忽略了个人特有的说话风格，包括表情和头部姿态风格。一些工作尝试通过细化模块来捕捉个人性格，但由于训练数据的限制，生成的表情和头部姿态具有浓郁的缺乏生命力。在这种情况下，我们提出了 AdaMesh，一种新的适应语音驱动面部动画方法，可以从约10秒的参考视频中学习个人化说话风格，并生成有生命力的表情和头部姿态。具体来说，我们提出了一种混合低级 adaptation（MoLoRA）方法，用于精细地捕捉表情风格。为个性化姿态风格，我们提出了一种姿态适应器，通过建立精确的姿态前providing和使用语义意识 pose style matrix 来无需 Fine-tuning retrieve相应的风格嵌入。EXTensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. Supplementary video and code will be available at https://adamesh.github.io.

paper_url: http://arxiv.org/abs/2310.07223
repo_url: https://github.com/jrodriguezortega/msmtu
paper_authors: José Rodríguez-Ortega, Rohaifa Khaldi, Domingo Alcaraz-Segura, Siham Tabik
for: 这个论文的目的是提出一种基于深度学习模型的多spectral时间序列数据中的纯化方法，用于提取杂合的土地用途和地形特征信息。
methods: 该方法使用了深度学习模型，包括长短期记忆网络（LSTM）模型，将 spectral-temporal 输入数据与地ографи和气候参数结合使用，以提高杂合像素中的物类含量估计。
results: 实验表明，结合spectral-temporal输入数据与地ографи和气候参数可以substantially improve the abundance estimation of LULC classes in mixed pixels。

Abstract
Remotely sensed data are dominated by mixed Land Use and Land Cover (LULC) types. Spectral unmixing is a technique to extract information from mixed pixels into their constituent LULC types and corresponding abundance fractions. Traditionally, solving this task has relied on either classical methods that require prior knowledge of endmembers or machine learning methods that avoid explicit endmembers calculation, also known as blind spectral unmixing (BSU). Most BSU studies based on Deep Learning (DL) focus on one time-step hyperspectral data, yet its acquisition remains quite costly compared with multispectral data. To our knowledge, here we provide the first study on BSU of LULC classes using multispectral time series data with DL models. We further boost the performance of a Long-Short Term Memory (LSTM)-based model by incorporating geographic plus topographic (geo-topographic) and climatic ancillary information. Our experiments show that combining spectral-temporal input data together with geo-topographic and climatic information substantially improves the abundance estimation of LULC classes in mixed pixels. To carry out this study, we built a new labeled dataset of the region of Andalusia (Spain) with monthly multispectral time series of pixels for the year 2013 from MODIS at 460m resolution, for two hierarchical levels of LULC classes, named Andalusia MultiSpectral MultiTemporal Unmixing (Andalusia-MSMTU). This dataset provides, at the pixel level, a multispectral time series plus ancillary information annotated with the abundance of each LULC class inside each pixel. The dataset and code are available to the public.

摘要
Remotely sensed data受混合土地用途和土地覆盖(LULC)类型的控制。spectral unmixing是一种技术，以提取混合像素中的各种LULC类型和相应的含量分数。传统上，解决这个任务通常需要知识先驱练或机器学习方法，而不需要显式计算终端成员。大多数无知终端混合(BSU)研究基于深度学习(DL)方法，但这些研究通常只关注单个时间步的干涉光谱数据。在我们所知道的情况下，我们在这里提供了首个BSU的LULC类型使用多spectral时间序数据与DL模型的研究。我们还使用地理加topographic(geo-topographic)和气候附加信息来提高LSTM模型的性能。我们的实验表明，将spectral-temporal输入数据与geo-topographic和气候信息结合在一起可以显著提高混合像素中LULC类型的含量估计。为了进行这项研究，我们建立了一个新的标注集，名为Andalusia MultiSpectral MultiTemporal Unmixing(Andalusia-MSMTU)。这个集包括2013年MODIS在460米分辨率上每月多spectral时间序数据，以及两个层次的LULC类型。每个像素级别具有多spectral时间序数据和附加信息，并标注每个像素中每种LULC类型的含量。这个集和代码现在对公众开放。

Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model

paper_url: http://arxiv.org/abs/2310.07222
repo_url: https://github.com/ysy31415/unipaint
paper_authors: Shiyuan Yang, Xiaodong Chen, Jing Liao
for: 这篇论文的目的是提出一种多modal填充方法，以提供多种导向方式，包括无条件、文本驱动、橡皮擦驱动和示例驱动填充，以及这些模式的组合。
methods: 该方法基于预训练的稳定扩散模型（Stable Diffusion），不需要特定任务的训练，可以在少量数据下实现一元化。
results: 论文通过对多种数据集进行广泛的质量和量化评估，表明其方法可以与单modal方法匹配的效果，同时具有多modal填充的能力。

Abstract
Recently, text-to-image denoising diffusion probabilistic models (DDPMs) have demonstrated impressive image generation capabilities and have also been successfully applied to image inpainting. However, in practice, users often require more control over the inpainting process beyond textual guidance, especially when they want to composite objects with customized appearance, color, shape, and layout. Unfortunately, existing diffusion-based inpainting methods are limited to single-modal guidance and require task-specific training, hindering their cross-modal scalability. To address these limitations, we propose Uni-paint, a unified framework for multimodal inpainting that offers various modes of guidance, including unconditional, text-driven, stroke-driven, exemplar-driven inpainting, as well as a combination of these modes. Furthermore, our Uni-paint is based on pretrained Stable Diffusion and does not require task-specific training on specific datasets, enabling few-shot generalizability to customized images. We have conducted extensive qualitative and quantitative evaluations that show our approach achieves comparable results to existing single-modal methods while offering multimodal inpainting capabilities not available in other methods. Code will be available at https://github.com/ysy31415/unipaint.

摘要
最近，文本到图像去噪扩散概率模型（DDPM）已经表现出了惊人的图像生成能力，并且在图像填充方面也有成功应用。然而，在实践中，用户经常需要更多的控制力在填充过程中，特别是当他们想要搭配自定义外观、颜色、形状和布局时。可惜，现有的扩散基于的填充方法都受到单modal导航的限制，需要任务特定的训练，这会阻碍其跨modal扩展性。为了解决这些限制，我们提出了Uni-paint，一个多modal填充框架，它提供了多种导航模式，包括随机、文本驱动、笔划驱动、示例驱动填充，以及这些模式的组合。此外，我们的Uni-paint基于预训练的稳定扩散，不需要任务特定的训练，可以在特定图像上进行几步扩展，实现了自定义图像的填充。我们进行了广泛的质量和量测试，结果表明，我们的方法可以与单modal方法匹配，同时提供了多modal填充的能力，不在其他方法中可以实现。代码将在https://github.com/ysy31415/unipaint上公开。

Multi-task Explainable Skin Lesion Classification

paper_url: http://arxiv.org/abs/2310.07209
repo_url: None
paper_authors: Mahapara Khurshid, Mayank Vatsa, Richa Singh
for: 这篇论文旨在提出一种基于几个扩展练习的多任务几据学习方法，以帮助早期识别皮肤癌。
methods: 本文提出了一种多任务几据学习方法，包括一个统一维度的条件统计学习（Segmentation Network）和一个分类维度的条件统计学习（Classification Network），并将它们融合为一个整体的条件统计学习系统。
results: 实验结果显示，提出的方法可以对皮肤癌进行早期识别，并且在不同的数据集上具有很好的一致性和稳定性。

Abstract
Skin cancer is one of the deadliest diseases and has a high mortality rate if left untreated. The diagnosis generally starts with visual screening and is followed by a biopsy or histopathological examination. Early detection can aid in lowering mortality rates. Visual screening can be limited by the experience of the doctor. Due to the long tail distribution of dermatological datasets and significant intra-variability between classes, automatic classification utilizing computer-aided methods becomes challenging. In this work, we propose a multitask few-shot-based approach for skin lesions that generalizes well with few labelled data to address the small sample space challenge. The proposed approach comprises a fusion of a segmentation network that acts as an attention module and classification network. The output of the segmentation network helps to focus on the most discriminatory features while making a decision by the classification network. To further enhance the classification performance, we have combined segmentation and classification loss in a weighted manner. We have also included the visualization results that explain the decisions made by the algorithm. Three dermatological datasets are used to evaluate the proposed method thoroughly. We also conducted cross-database experiments to ensure that the proposed approach is generalizable across similar datasets. Experimental results demonstrate the efficacy of the proposed work.

摘要
皮肤癌是一种非常危险的疾病，如果不得到治疗，则可能会有很高的死亡率。诊断通常由视觉检查开始，然后是比采或历史检查。早期发现可以降低死亡率。然而，视觉检查可能受医生的经验限制。由于皮肤癌数据集的长尾分布和类型之间的显著内部变化，使用计算机助手的自动分类变得困难。在这种情况下，我们提出了一种多任务几 shot 基于的方法，用于处理皮肤癌的小样本空间问题。我们的方法包括一个 fusion 的 segmentation 网络，它作为注意模块，并一个分类网络。分类网络输出的结果可以帮助注意模块决策，同时，注意模块可以帮助分类网络做出更加准确的决策。为了进一步提高分类性能，我们将 segmentation 和分类损失权重加以权重，并包括了可视化结果，以便更好地解释算法的决策。我们使用了三个皮肤癌数据集进行了全面的评估。我们还进行了跨数据集的试验，以确保我们的方法是可重复性的。实验结果表明，我们的方法具有效果。

DeepSimHO: Stable Pose Estimation for Hand-Object Interaction via Physics Simulation

paper_url: http://arxiv.org/abs/2310.07206
repo_url: https://github.com/rongakowang/DeepSimHO
paper_authors: Rong Wang, Wei Mao, Hongdong Li
for: 这个研究目的是从单一图像观察中进行3D姿势估测，并且处理手与物体之间的互动。
methods: 这个研究使用了深度学习架构，其中包括前向物理 simulations 和反向梯度推断。
results: 实验结果显示，这个方法可以提高估测的稳定性，并且比测时优化更高效。

Abstract
This paper addresses the task of 3D pose estimation for a hand interacting with an object from a single image observation. When modeling hand-object interaction, previous works mainly exploit proximity cues, while overlooking the dynamical nature that the hand must stably grasp the object to counteract gravity and thus preventing the object from slipping or falling. These works fail to leverage dynamical constraints in the estimation and consequently often produce unstable results. Meanwhile, refining unstable configurations with physics-based reasoning remains challenging, both by the complexity of contact dynamics and by the lack of effective and efficient physics inference in the data-driven learning framework. To address both issues, we present DeepSimHO: a novel deep-learning pipeline that combines forward physics simulation and backward gradient approximation with a neural network. Specifically, for an initial hand-object pose estimated by a base network, we forward it to a physics simulator to evaluate its stability. However, due to non-smooth contact geometry and penetration, existing differentiable simulators can not provide reliable state gradient. To remedy this, we further introduce a deep network to learn the stability evaluation process from the simulator, while smoothly approximating its gradient and thus enabling effective back-propagation. Extensive experiments show that our method noticeably improves the stability of the estimation and achieves superior efficiency over test-time optimization. The code is available at https://github.com/rongakowang/DeepSimHO.

摘要
To address these issues, we propose DeepSimHO, a novel deep-learning pipeline that combines forward physics simulation and backward gradient approximation with a neural network. Given an initial hand-object pose estimated by a base network, we forward it to a physics simulator to evaluate its stability. However, existing differentiable simulators cannot provide reliable state gradients due to non-smooth contact geometry and penetration. To address this, we introduce a deep network to learn the stability evaluation process from the simulator, while smoothly approximating its gradient and enabling effective back-propagation.Our method significantly improves the stability of the estimation and achieves superior efficiency over test-time optimization. The code is available at https://github.com/rongakowang/DeepSimHO.

SpikePoint: An Efficient Point-based Spiking Neural Network for Event Cameras Action Recognition

paper_url: http://arxiv.org/abs/2310.07189
repo_url: None
paper_authors: Hongwei Ren, Yue Zhou, Yulong Huang, Haotian Fu, Xiaopeng Lin, Jie Song, Bojun Cheng
for: 本研究旨在开发一种能够实现低功耗和高精度的事件摄像头应用场景，通过将事件摄像头与脉冲神经网络（SNN）相结合。
methods: 本研究提出了一种名为SpikePoint的新的端到端点 wise SNN架构，可以高效处理 sparse event cloud 数据，并提取全局和局部特征。
results: 对于四个事件基因 recognize 数据集，SpikePoint 达到了状态机器人（SOTA）性能，只需要使用 16 个时间步骤，超过了其他 SNN 方法。此外，它还在三个数据集上达到了 SOTA 性能，使用了约 0.3% 的参数和 0.5% 的能耗，相比较于人工神经网络（ANNs）的参数和能耗。这些结果证明 Point Cloud 的重要性，并开启了许多低功耗事件基因数据处理应用场景。

Abstract
Event cameras are bio-inspired sensors that respond to local changes in light intensity and feature low latency, high energy efficiency, and high dynamic range. Meanwhile, Spiking Neural Networks (SNNs) have gained significant attention due to their remarkable efficiency and fault tolerance. By synergistically harnessing the energy efficiency inherent in event cameras and the spike-based processing capabilities of SNNs, their integration could enable ultra-low-power application scenarios, such as action recognition tasks. However, existing approaches often entail converting asynchronous events into conventional frames, leading to additional data mapping efforts and a loss of sparsity, contradicting the design concept of SNNs and event cameras. To address this challenge, we propose SpikePoint, a novel end-to-end point-based SNN architecture. SpikePoint excels at processing sparse event cloud data, effectively extracting both global and local features through a singular-stage structure. Leveraging the surrogate training method, SpikePoint achieves high accuracy with few parameters and maintains low power consumption, specifically employing the identity mapping feature extractor on diverse datasets. SpikePoint achieves state-of-the-art (SOTA) performance on four event-based action recognition datasets using only 16 timesteps, surpassing other SNN methods. Moreover, it also achieves SOTA performance across all methods on three datasets, utilizing approximately 0.3\% of the parameters and 0.5\% of power consumption employed by artificial neural networks (ANNs). These results emphasize the significance of Point Cloud and pave the way for many ultra-low-power event-based data processing applications.

摘要
Simplified Chinese:事件摄像机是生物体外的感知器，响应当地光度变化，具有低延迟、高能效率和高 dinamic范围。同时，使得神经网络（SNN）在处理数据时具有很好的性能和容错性。通过将事件摄像机的能效性与 SNN 的射频处理机制相结合，可以实现低功耗应用场景，如行动认知任务。然而，现有的方法通常会将异步事件转换成常规帧，从而需要额外的数据映射努力，并且会导致数据精度下降，与 SNN 和事件摄像机的设计原则相抵触。为解决这个挑战，我们提出了 PointCloud，一种新的端到端点基 SNN 架构。PointCloud excells 在处理稀疏事件云数据，通过单个阶段结构，提取全局和局部特征。通过使用代理训练方法，PointCloud 可以在多个数据集上实现高精度，并且具有低功耗。在四个事件基于动作认知数据集上，PointCloud 使用 16 个时间步骤，已经超越了其他 SNN 方法。此外，PointCloud 还在三个数据集上实现了全方法的最佳性能，使用了约 0.3% 的参数和 0.5% 的电力耗用，相比于人工神经网络 (ANN) 的参数和电力耗用。这些结果强调了 PointCloud 的重要性，并开创了许多低功耗事件基数据处理应用场景。

NeuroInspect: Interpretable Neuron-based Debugging Framework through Class-conditional Visualizations

paper_url: http://arxiv.org/abs/2310.07184
repo_url: https://github.com/yeongjoonju/neuroinspect
paper_authors: Yeong-Joon Ju, Ji-Hoon Park, Seong-Whan Lee
for: This paper aims to provide an interpretable neuron-based debugging framework for deep learning (DL) models, to help DL practitioners understand and fix mistakes made by the models.
methods: The paper proposes a three-stage debugging framework called NeuroInspect, which includes counterfactual explanations, feature visualizations, and false correlation mitigation. The framework uses a novel feature visualization method called CLIP-Illusion to generate human-interpretable explanations for model errors.
results: The paper demonstrates the effectiveness of NeuroInspect in addressing false correlations and improving inferences for classes with the worst performance in real-world settings. The results show that NeuroInspect helps debug the mistakes of DL models and improves human understanding of the decision-making process within the networks.

Abstract
Despite deep learning (DL) has achieved remarkable progress in various domains, the DL models are still prone to making mistakes. This issue necessitates effective debugging tools for DL practitioners to interpret the decision-making process within the networks. However, existing debugging methods often demand extra data or adjustments to the decision process, limiting their applicability. To tackle this problem, we present NeuroInspect, an interpretable neuron-based debugging framework with three key stages: counterfactual explanations, feature visualizations, and false correlation mitigation. Our debugging framework first pinpoints neurons responsible for mistakes in the network and then visualizes features embedded in the neurons to be human-interpretable. To provide these explanations, we introduce CLIP-Illusion, a novel feature visualization method that generates images representing features conditioned on classes to examine the connection between neurons and the decision layer. We alleviate convoluted explanations of the conventional visualization approach by employing class information, thereby isolating mixed properties. This process offers more human-interpretable explanations for model errors without altering the trained network or requiring additional data. Furthermore, our framework mitigates false correlations learned from a dataset under a stochastic perspective, modifying decisions for the neurons considered as the main causes. We validate the effectiveness of our framework by addressing false correlations and improving inferences for classes with the worst performance in real-world settings. Moreover, we demonstrate that NeuroInspect helps debug the mistakes of DL models through evaluation for human understanding. The code is openly available at https://github.com/yeongjoonJu/NeuroInspect.

摘要
尽管深度学习（DL）已经取得了各种领域的显著进步，但DL模型仍然容易出错。这个问题需要有效的调试工具，以便DL实践者可以解释网络的决策过程。然而，现有的调试方法经常需要额外的数据或调整决策过程，限制其应用。为解决这个问题，我们提出了NeuroInspect，一个可解释的 neuron-based 调试框架。我们的调试框架首先在网络中标识出负责出错的 neuron，然后使用CLIP-Illusion，一种新的特征视图方法，生成表示特征类别的图像，以便人类可以理解。我们通过使用类信息，缩小混杂的解释，从而提供更人类可理解的错误解释，而无需更改已训练的网络或需要额外的数据。此外，我们的框架还解决了基于数据的false correlation问题，通过修改考虑到的neuron的决策，从而提高网络的准确率。我们验证了NeuroInspect的效果，通过对实际场景中的false correlation和各类错误进行修复，提高了网络的推理能力。此外，我们还证明了NeuroInspect可以帮助调试DL模型的错误。代码可以在https://github.com/yeongjoonJu/NeuroInspect中下载。

Improving mitosis detection on histopathology images using large vision-language models

paper_url: http://arxiv.org/abs/2310.07176
repo_url: None
paper_authors: Ruiwen Ding, James Hall, Neil Tenenholtz, Kristen Severson
for: 这个论文的目的是提高悉尼肿瘤检测的准确率，使用混沌神经网络和自然语言进行辅助。
methods: 这个论文使用了混沌神经网络，并利用了图像描述和自然语言来提高悉尼肿瘤检测的准确率。
results: 研究表明，使用这种方法可以提高悉尼肿瘤检测的准确率，并且比较于各种基线模型。

Abstract
In certain types of cancerous tissue, mitotic count has been shown to be associated with tumor proliferation, poor prognosis, and therapeutic resistance. Due to the high inter-rater variability of mitotic counting by pathologists, convolutional neural networks (CNNs) have been employed to reduce the subjectivity of mitosis detection in hematoxylin and eosin (H&E)-stained whole slide images. However, most existing models have performance that lags behind expert panel review and only incorporate visual information. In this work, we demonstrate that pre-trained large-scale vision-language models that leverage both visual features and natural language improve mitosis detection accuracy. We formulate the mitosis detection task as an image captioning task and a visual question answering (VQA) task by including metadata such as tumor and scanner types as context. The effectiveness of our pipeline is demonstrated via comparison with various baseline models using 9,501 mitotic figures and 11,051 hard negatives (non-mitotic figures that are difficult to characterize) from the publicly available Mitosis Domain Generalization Challenge (MIDOG22) dataset.

摘要
某些癌细胞组织中， Mitotic 数量与肿瘤增殖、较差的预后和治疗耐荷强相关。由于病理医生在 Mitotic 计数中存在高度的交互变性，因此人工神经网络（CNNs）已被应用以减少病理分析中的主观性。然而，大多数现有模型只使用视觉信息，其性能落后于专家审查anel。在这项工作中，我们示出了使用预训练的大规模视力语言模型，可以提高 Mitosis 检测精度。我们将 Mitosis 检测任务定义为一个图像描述任务和一个视觉问答（VQA）任务，并通过包括肿瘤类型和扫描仪类型等metadata来提供背景信息。我们的管道的效果通过与多种基准模型进行比较，使用 MIDOG22 公共可用的 Mitosis 领域泛化挑战 dataset 中的 9,501 个 Mitotic 图像和 11,051 个困难的非 Mitotic 图像进行证明。

Anchor-based Multi-view Subspace Clustering with Hierarchical Feature Descent

paper_url: http://arxiv.org/abs/2310.07166
repo_url: None
paper_authors: Qiyuan Ou, Siwei Wang, Pei Zhang, Sihang Zhou, En Zhu
for: 这个论文主要目标是提出一种基于多视图的含义下降 clustering 算法，以解决现有多视图 clustering 算法的时间复杂度问题。
methods: 该论文使用了一种基于层次特征下降的方法，通过在不同视图之间建立相互关系来实现视图之间的数据拟合。然后，通过一种统一采样策略在含义下降空间中进行采样，并使用子空间 clustering 算法来学习共同表示。
results: 实验结果表明，提出的 MVSC-HFD 模型在公共评估数据集上经常超越当前状态艺技。

Abstract
Multi-view clustering has attracted growing attention owing to its capabilities of aggregating information from various sources and its promising horizons in public affairs. Up till now, many advanced approaches have been proposed in recent literature. However, there are several ongoing difficulties to be tackled. One common dilemma occurs while attempting to align the features of different views. We dig out as well as deploy the dependency amongst views through hierarchical feature descent, which leads to a common latent space( STAGE 1). This latent space, for the first time of its kind, is regarded as a 'resemblance space', as it reveals certain correlations and dependencies of different views. To be exact, the one-hot encoding of a category can also be referred to as a resemblance space in its terminal phase. Moreover, due to the intrinsic fact that most of the existing multi-view clustering algorithms stem from k-means clustering and spectral clustering, this results in cubic time complexity w.r.t. the number of the objects. However, we propose Anchor-based Multi-view Subspace Clustering with Hierarchical Feature Descent(MVSC-HFD) to further reduce the computing complexity to linear time cost through a unified sampling strategy in resemblance space( STAGE 2), followed by subspace clustering to learn the representation collectively( STAGE 3). Extensive experimental results on public benchmark datasets demonstrate that our proposed model consistently outperforms the state-of-the-art techniques.

摘要
多视图聚类在最近几年来得到了越来越多的关注，这是因为它可以聚合来自不同源泉的信息，并且在公共事务中具有承诺的前途。到目前为止，文献中已经提出了许多高级方法。然而，还有许多在进行的困难，其中一个最常见的困难是对不同视图之间的特征进行Alignment。我们通过层次特征降低来解决这个问题，并在多视图聚类中提出了一个新的Latent space（Stage 1）。这个Latent space被认为是一种'相似空间'，因为它揭示了不同视图之间的相似性和依赖关系。此外，由于大多数现有的多视图聚类算法来自k-means clustering和spectral clustering，这会导致 cubic time complexity 对于对象的数量。然而，我们提出了基于 anchor的多视图子空间聚类 Algorithm with Hierarchical Feature Descent(MVSC-HFD)，可以在resemblance space中进行统一采样策略（Stage 2），然后使用子空间聚类来学习表示（Stage 3）。我们在公共测试数据集上进行了广泛的实验，结果显示，我们提出的模型在与现有技术相比 consistently outperform。

Robust Unsupervised Domain Adaptation by Retaining Confident Entropy via Edge Concatenation

paper_url: http://arxiv.org/abs/2310.07149
repo_url: None
paper_authors: Hye-Seong Hong, Abhishek Kumar, Dong-Gyu Lee
for: 提高不监督领域适应的Semantic segmentation网络训练效果，使用计算机生成的标注数据作为源数据。
methods: 利用内部和外部信息的共同作用，在Entropy-based adversarial networks中增强预测源领域的能力。增加权重 Edge-predicted probability values，以提高分类边界的清晰度。设计了一种概率分享网络，将多种信息集成到更有效的分类中。
results: 在不监督领域适应 benchmark上进行了严格的评估，包括 SYNTHIA $\rightarrow$ Cityscapes和 SYNTHIA $\rightarrow$ Mapillary。实验结果表明，提出的方法在不同的无监督领域适应场景中具有优于当前方法的性能。

Abstract
The generalization capability of unsupervised domain adaptation can mitigate the need for extensive pixel-level annotations to train semantic segmentation networks by training models on synthetic data as a source with computer-generated annotations. Entropy-based adversarial networks are proposed to improve source domain prediction; however, they disregard significant external information, such as edges, which have the potential to identify and distinguish various objects within an image accurately. To address this issue, we introduce a novel approach to domain adaptation, leveraging the synergy of internal and external information within entropy-based adversarial networks. In this approach, we enrich the discriminator network with edge-predicted probability values within this innovative framework to enhance the clarity of class boundaries. Furthermore, we devised a probability-sharing network that integrates diverse information for more effective segmentation. Incorporating object edges addresses a pivotal aspect of unsupervised domain adaptation that has frequently been neglected in the past -- the precise delineation of object boundaries. Conventional unsupervised domain adaptation methods usually center around aligning feature distributions and may not explicitly model object boundaries. Our approach effectively bridges this gap by offering clear guidance on object boundaries, thereby elevating the quality of domain adaptation. Our approach undergoes rigorous evaluation on the established unsupervised domain adaptation benchmarks, specifically in adapting SYNTHIA $\rightarrow$ Cityscapes and SYNTHIA $\rightarrow$ Mapillary. Experimental results show that the proposed model attains better performance than state-of-the-art methods. The superior performance across different unsupervised domain adaptation scenarios highlights the versatility and robustness of the proposed method.

摘要
通过不监督领域适应，可以减轻 semantic segmentation 网络训练所需的广泛像素级注释。通过训练模型使用生成的数据作为源，并使用计算机生成的注释。不过，基于Entropy的反对抗网络可能会忽略一些重要的外部信息，如图像中的边缘，这些信息可能能够准确地识别和分类不同的对象。为解决这个问题，我们提出了一种新的适应领域方法，具有内部和外部信息的共同作用。在这种方法中，我们增强了推定网络的边缘预测概率值，以提高类划界的清晰度。此外，我们设计了一种概率共享网络，以更有效地进行分类。在这种方法中，我们利用了对象边缘的信息，解决了过去常见的适应领域方法不具备的对象边界的精确分类问题。我们的方法在已知的适应领域 benchmark 上进行了严格的评估，包括 SYNTHIA $\rightarrow$ Cityscapes 和 SYNTHIA $\rightarrow$ Mapillary 等。实验结果表明，我们的方法在不同的适应领域场景中表现出色，超过了现有的state-of-the-art方法。这些结果表明了我们的方法的多样性和可靠性。

Echocardiography video synthesis from end diastolic semantic map via diffusion model

paper_url: http://arxiv.org/abs/2310.07131
repo_url: None
paper_authors: Phi Nguyen Van, Duc Tran Minh, Hieu Pham Huy, Long Tran Quoc
for: 这篇论文的目的是为echocardiography视频生成任务提供一种新的方法，使用semantic映射来指导生成过程，以提高生成的视频的真实感和一致性。
methods: 这篇论文使用了Denoising Diffusion Probabilistic Models (DDPMs)，并在其基础上进行了扩展和改进，以适应echocardiography视频生成任务。具体来说，我们使用了semantic映射来指导生成过程，并在多尺度特征图中进行了空间适应 нормализа。
results: 经过实验，我们发现OUR模型在CAMUS数据集上的表现比标准扩散技术更好，包括多个纪录指标，如FID、FVD和SSMI。这表明OUR模型可以更好地生成echocardiography视频序列，具有更高的真实感和一致性。

Abstract
Denoising Diffusion Probabilistic Models (DDPMs) have demonstrated significant achievements in various image and video generation tasks, including the domain of medical imaging. However, generating echocardiography videos based on semantic anatomical information remains an unexplored area of research. This is mostly due to the constraints imposed by the currently available datasets, which lack sufficient scale and comprehensive frame-wise annotations for every cardiac cycle. This paper aims to tackle the aforementioned challenges by expanding upon existing video diffusion models for the purpose of cardiac video synthesis. More specifically, our focus lies in generating video using semantic maps of the initial frame during the cardiac cycle, commonly referred to as end diastole. To further improve the synthesis process, we integrate spatial adaptive normalization into multiscale feature maps. This enables the inclusion of semantic guidance during synthesis, resulting in enhanced realism and coherence of the resultant video sequences. Experiments are conducted on the CAMUS dataset, which is a highly used dataset in the field of echocardiography. Our model exhibits better performance compared to the standard diffusion technique in terms of multiple metrics, including FID, FVD, and SSMI.

摘要
德 numerically 噪声扩散模型（DDPM）在不同的图像和视频生成任务中表现出了显著的成就，其中包括医学影像领域。然而，基于semantic anatomical information的echocardiography视频生成仍然是一个未经探索的领域，这主要是因为当前available的数据集缺乏 sufficient scale和每个cardiac cycle的完整框架级别的注释。本文旨在解决上述挑战，扩展现有的视频扩散模型用于卡ди亚视频生成。更 specifically，我们的重点是使用initial frame during cardiac cycle的semantic maps来生成视频。为了进一步改进生成过程，我们将spatial adaptive normalization integrate into multiscale feature maps，以使用semantic guidance during synthesis，从而提高生成的视频序列的真实性和一致性。我们在CAMUS dataset上进行了实验，这是医学影像领域中的一个非常流行的数据集。我们的模型在多个纪录metric上表现出了更好的性能，包括FID、FVD和SSMI。

2023-10-11

Dynamic Appearance Particle Neural Radiance Field

NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration

Unsupervised Structured Noise Removal with Variational Lossy Autoencoder

A Survey of Feature Types and Their Contributions for Camera Tampering Detection

BrainVoxGen: Deep learning framework for synthesis of Ultrasound to MRI

CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping

Explorable Mesh Deformation Subspaces from Unstructured Generative Models

CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous Driving

An automated approach for improving the inference latency and energy efficiency of pretrained CNNs by removing irrelevant pixels with focused convolutions

3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation

Orbital Polarimetric Tomography of a Flare Near the Sagittarius A* Supermassive Black Hole

Prediction of MET Overexpression in Non-Small Cell Lung Adenocarcinomas from Hematoxylin and Eosin Images

Accelerating Vision Transformers Based on Heterogeneous Attention Patterns

Deep Video Inpainting Guided by Audio-Visual Self-Supervision

Context-Enhanced Detector For Building Detection From Remote Sensing Images

Attention-Map Augmentation for Hypercomplex Breast Cancer Classification

Prompt Backdoors in Visual Prompt Learning

Dual Radar: A Multi-modal Dataset with Dual 4D Radar for Autonomous Driving

PeP: a Point enhanced Painting method for unified point cloud tasks

A Discrepancy Aware Framework for Robust Anomaly Detection

Centrality of the Fingerprint Core Location

Relational Prior Knowledge Graphs for Detection and Instance Segmentation

Impact of Label Types on Training SWIN Models with Overhead Imagery

Does resistance to Style-Transfer equal Shape Bias? Evaluating Shape Bias by Distorted Shape

Attribute Localization and Revision Network for Zero-Shot Learning

S4C: Self-Supervised Semantic Scene Completion with Neural Fields

CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing

A Unified Remote Sensing Anomaly Detector Across Modalities and Scenes via Deviation Relationship Learning

Heuristic Vision Pre-Training with Self-Supervised and Supervised Multi-Task Learning

Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation

PtychoDV: Vision Transformer-Based Deep Unrolling Network for Ptychographic Image Reconstruction

FGPrompt: Fine-grained Goal Prompting for Image-goal Navigation

PoRF: Pose Residual Field for Accurate Neural Surface Reconstruction

Distance Weighted Trans Network for Image Completion

DESTINE: Dynamic Goal Queries with Temporal Transductive Alignment for Trajectory Prediction

A Novel Voronoi-based Convolutional Neural Network Framework for Pushing Person Detection in Crowd Videos

CLIP for Lightweight Semantic Segmentation

Domain Generalization Guided by Gradient Signal to Noise Ratio of Parameters

Diagnosing Bipolar Disorder from 3-D Structural Magnetic Resonance Images Using a Hybrid GAN-CNN Method

IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

PointHR: Exploring High-Resolution Architectures for 3D Point Cloud Segmentation

Guided Attention for Interpretable Motion Captioning

A webcam-based machine learning approach for three-dimensional range of motion evaluation

Deep Aramaic: Towards a Synthetic Data Paradigm Enabling Machine Learning in Epigraphy

Distilling Efficient Vision Transformers from CNNs for Semantic Segmentation

ADASR: An Adversarial Auto-Augmentation Framework for Hyperspectral and Multispectral Data Fusion

A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation

Synthesizing Missing MRI Sequences from Available Modalities using Generative Adversarial Networks in BraTS Dataset

IBoxCLA: Towards Robust Box-supervised Segmentation of Polyp via Improved Box-dice and Contrastive Latent-anchors

Optimizing the Placement of Roadside LiDARs for Autonomous Driving

Crowd Counting in Harsh Weather using Image Denoising with Pix2Pix GANs

SAGE-ICP: Semantic Information-Assisted ICP

AdaMesh: Personalized Facial Expressions and Head Poses for Speech-Driven 3D Facial Animation

Deep Learning for blind spectral unmixing of LULC classes with MODIS multispectral time series and ancillary data

Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model

Multi-task Explainable Skin Lesion Classification

DeepSimHO: Stable Pose Estimation for Hand-Object Interaction via Physics Simulation

SpikePoint: An Efficient Point-based Spiking Neural Network for Event Cameras Action Recognition

NeuroInspect: Interpretable Neuron-based Debugging Framework through Class-conditional Visualizations

Improving mitosis detection on histopathology images using large vision-language models

Anchor-based Multi-view Subspace Clustering with Hierarchical Feature Descent

Robust Unsupervised Domain Adaptation by Retaining Confident Entropy via Edge Concatenation

Echocardiography video synthesis from end diastolic semantic map via diffusion model