2023-08-24

cs.CV

cs.CV - 2023-08-24

VNI-Net: Vector Neurons-based Rotation-Invariant Descriptor for LiDAR Place Recognition

paper_url: http://arxiv.org/abs/2308.12870
repo_url: None
paper_authors: Gengxuan Tian, Junqiao Zhao, Yingfeng Cai, Fenglin Zhang, Wenjie Mu, Chen Ye
for: 提高 LiDAR 场景认知中的旋转不敏感性
methods: 使用 Vector Neurons Network (VNN) 实现 SO(3) 旋转不变性，提取邻近点的旋转等价特征，将低维特征映射到高维空间
results: 对公共数据集进行实验，与其他基准方法相比，提高了旋转不敏感性，与当前状态艺术场景认知方法几乎匹配Here’s a brief explanation of each point:* “for”: The paper aims to improve the rotation-invariance of LiDAR scene recognition.* “methods”: The proposed method uses Vector Neurons Network (VNN) to achieve SO(3) rotation invariance, and extracts rotation-equivalent features from neighboring points.* “results”: The proposed method significantly outperforms other baseline methods that consider rotation invariance, and achieves comparable results with current state-of-the-art place recognition methods that do not consider rotation issues.

Abstract
LiDAR-based place recognition plays a crucial role in Simultaneous Localization and Mapping (SLAM) and LiDAR localization. Despite the emergence of various deep learning-based and hand-crafting-based methods, rotation-induced place recognition failure remains a critical challenge. Existing studies address this limitation through specific training strategies or network structures. However, the former does not produce satisfactory results, while the latter focuses mainly on the reduced problem of SO(2) rotation invariance. Methods targeting SO(3) rotation invariance suffer from limitations in discrimination capability. In this paper, we propose a new method that employs Vector Neurons Network (VNN) to achieve SO(3) rotation invariance. We first extract rotation-equivariant features from neighboring points and map low-dimensional features to a high-dimensional space through VNN. Afterwards, we calculate the Euclidean and Cosine distance in the rotation-equivariant feature space as rotation-invariant feature descriptors. Finally, we aggregate the features using GeM pooling to obtain global descriptors. To address the significant information loss when formulating rotation-invariant descriptors, we propose computing distances between features at different layers within the Euclidean space neighborhood. This greatly improves the discriminability of the point cloud descriptors while ensuring computational efficiency. Experimental results on public datasets show that our approach significantly outperforms other baseline methods implementing rotation invariance, while achieving comparable results with current state-of-the-art place recognition methods that do not consider rotation issues.

摘要
利用LiDAR技术实现地点识别在同时地图和地点位置确定（SLAM）中扮演关键角色。 despite the emergence of various深度学习基于和手动设计基于方法， rotate induced place recognition failure remains a critical challenge。 existing studies address this limitation through specific training strategies or network structures。 However, the former does not produce satisfactory results, while the latter focuses mainly on the reduced problem of SO(2) rotation invariance。 methods targeting SO(3) rotation invariance suffer from limitations in discrimination capability。 In this paper, we propose a new method that employs Vector Neurons Network (VNN) to achieve SO(3) rotation invariance。 we first extract rotation-equivariant features from neighboring points and map low-dimensional features to a high-dimensional space through VNN。 Afterwards, we calculate the Euclidean and Cosine distance in the rotation-equivariant feature space as rotation-invariant feature descriptors。 Finally, we aggregate the features using GeM pooling to obtain global descriptors。 To address the significant information loss when formulating rotation-invariant descriptors, we propose computing distances between features at different layers within the Euclidean space neighborhood。 This greatly improves the discriminability of the point cloud descriptors while ensuring computational efficiency。 Experimental results on public datasets show that our approach significantly outperforms other baseline methods implementing rotation invariance, while achieving comparable results with current state-of-the-art place recognition methods that do not consider rotation issues。Note: "Simplified Chinese" is a romanization of the Chinese language that uses a simplified set of characters and pronunciation. It is commonly used in mainland China and Singapore.

ToonTalker: Cross-Domain Face Reenactment

paper_url: http://arxiv.org/abs/2308.12866
repo_url: None
paper_authors: Yuan Gong, Yong Zhang, Xiaodong Cun, Fei Yin, Yanbo Fan, Xuan Wang, Baoyuan Wu, Yujiu Yang
for: 本研究旨在实现跨域人脸reenactment，即将真实视频转换为动漫图像和 vice versa。
methods: 我们提出了一种基于 transformer 框架的新方法，包括两个域特定的动作编码器和两个可学习的动作基准存储。我们还使用了源查询 transformer 和驱动 transformer 来将域特定动作 proyect 到共同幂 space，然后在该空间中进行动作传输。
results: 我们的方法在评估中表现出色，与竞争方法相比有所超越。此外，我们还提供了一个 Disney 风格的动漫数据集，以便进一步验证和应用我们的方法。

Abstract
We target cross-domain face reenactment in this paper, i.e., driving a cartoon image with the video of a real person and vice versa. Recently, many works have focused on one-shot talking face generation to drive a portrait with a real video, i.e., within-domain reenactment. Straightforwardly applying those methods to cross-domain animation will cause inaccurate expression transfer, blur effects, and even apparent artifacts due to the domain shift between cartoon and real faces. Only a few works attempt to settle cross-domain face reenactment. The most related work AnimeCeleb requires constructing a dataset with pose vector and cartoon image pairs by animating 3D characters, which makes it inapplicable anymore if no paired data is available. In this paper, we propose a novel method for cross-domain reenactment without paired data. Specifically, we propose a transformer-based framework to align the motions from different domains into a common latent space where motion transfer is conducted via latent code addition. Two domain-specific motion encoders and two learnable motion base memories are used to capture domain properties. A source query transformer and a driving one are exploited to project domain-specific motion to the canonical space. The edited motion is projected back to the domain of the source with a transformer. Moreover, since no paired data is provided, we propose a novel cross-domain training scheme using data from two domains with the designed analogy constraint. Besides, we contribute a cartoon dataset in Disney style. Extensive evaluations demonstrate the superiority of our method over competing methods.

摘要
我们在这篇论文中targetcross-domain face reenactment，即将动漫图像驱动真实视频和真实视频驱动动漫图像。在最近的许多工作中，人们主要关注在一个shot中生成真实人脸，即在同一个频谱中reenactment。如果直接应用这些方法到cross-domain animation，会导致不准确的表达传递、模糊效果和甚至显式的artefacts，这是因为cartoon和真实人脸之间的频谱差异。只有一些工作尝试了cross-domain face reenactment。最相关的工作是AnimeCeleb，它需要构建一个数据集，其中包含pose vector和动漫图像对的Pair，并通过动画3D人物来生成这些对。这使得它在没有对应数据时无法应用。在这篇论文中，我们提出了一种新的方法，即使没有对应数据也可以实现cross-domain reenactment。具体来说，我们提出了一个基于transformer的框架，用于将不同频谱中的动作都尝试到一个共同的幂space中，然后通过幂码加法进行动作传递。我们使用了两个域特定的动作编码器和两个可学习的动作基准记忆来捕捉域属性。 sources query transformer和驱动一个是用于将域特定的动作 проек到共同空间，而编辑的动作则是通过transformer将其 projet回到源域。此外，由于没有提供对应数据，我们提出了一种新的跨域训练方案，使用了两个域的数据，并通过设计的相似性约束。此外，我们还贡献了一个以Disney风格为主的动漫数据集。我们的方法在多个评估中表现出色，超过了竞争方法的性能。

SkipcrossNets: Adaptive Skip-cross Fusion for Road Detection

paper_url: http://arxiv.org/abs/2308.12863
repo_url: None
paper_authors: Xinyu Zhang, Yan Gong, Zhiwei Li, Xin Gao, Dafeng Jin, Jun Li, Huaping Liu
For: This paper proposes a novel fusion architecture called SkipcrossNets for multi-modal fusion of LiDAR point clouds and camera images in autonomous driving tasks.* Methods: The SkipcrossNets architecture uses skip-cross connections to adaptively combine features from both modalities at each layer, without being bound to a specific fusion epoch. The network is divided into several blocks to reduce the complexity of feature fusion and the number of model parameters.* Results: The proposed SkipcrossNets architecture achieved a MaxF score of 96.85% on the KITTI dataset and an F1 score of 84.84% on the A2D2 dataset, with a memory requirement of only 2.33 MB and a speed of 68.24 FPS, making it viable for mobile terminals and embedded devices.

Abstract
Multi-modal fusion is increasingly being used for autonomous driving tasks, as images from different modalities provide unique information for feature extraction. However, the existing two-stream networks are only fused at a specific network layer, which requires a lot of manual attempts to set up. As the CNN goes deeper, the two modal features become more and more advanced and abstract, and the fusion occurs at the feature level with a large gap, which can easily hurt the performance. In this study, we propose a novel fusion architecture called skip-cross networks (SkipcrossNets), which combines adaptively LiDAR point clouds and camera images without being bound to a certain fusion epoch. Specifically, skip-cross connects each layer to each layer in a feed-forward manner, and for each layer, the feature maps of all previous layers are used as input and its own feature maps are used as input to all subsequent layers for the other modality, enhancing feature propagation and multi-modal features fusion. This strategy facilitates selection of the most similar feature layers from two data pipelines, providing a complementary effect for sparse point cloud features during fusion processes. The network is also divided into several blocks to reduce the complexity of feature fusion and the number of model parameters. The advantages of skip-cross fusion were demonstrated through application to the KITTI and A2D2 datasets, achieving a MaxF score of 96.85% on KITTI and an F1 score of 84.84% on A2D2. The model parameters required only 2.33 MB of memory at a speed of 68.24 FPS, which could be viable for mobile terminals and embedded devices.

摘要
多Modal融合在自动驾驶任务中日益普遍应用，因为不同模式的图像提供了独特的特征提取信息。然而，现有的两派网络只是在特定网络层进行融合，需要大量的手动尝试设置。随着CNN深入，两种模式的特征变得越来越先进和抽象，融合发生在特征层级，这可能会产生性能下降。在这种研究中，我们提出了一种新的融合架构，即跳过网络（SkipcrossNets），它将雷达点云和摄像头图像在不同的层次进行 Adaptive 融合。具体来说，跳过连接每层与每层进行Feed-forward 的连接，并且为每层的特征图使用所有前一层的特征图作为输入，并将每层的特征图作为输入给后续层的另一个模式。这种策略使得在融合过程中选择最相似的特征层，提供了质点云特征的补做效果。此外，网络还被分解成多个块，以降低特征融合的复杂度和模型参数的数量。Skipcross融合的优点在应用于 KITTI 和 A2D2 数据集上得到了证明，最大评分达 96.85% 和 F1 分数达 84.84%。模型参数只需 2.33 MB 的内存和 68.24 FPS 的速度，这可能是蜂窝Terminal和嵌入式设备的可行选择。

Learned Local Attention Maps for Synthesising Vessel Segmentations

paper_url: http://arxiv.org/abs/2308.12861
repo_url: None
paper_authors: Yash Deo, Rodrigo Bonazzola, Haoran Dou, Yan Xia, Tianyou Wei, Nishant Ravikumar, Alejandro F. Frangi, Toni Lassila
for: 这个论文是为了制作血管分割图像的IMAGING模式，以便在诊断和评估血液管的风险方面提供更好的工具。
methods: 这篇论文使用了一种encoder-decoder模型，通过将T1和T2磁共振图像进行编码，生成基于T2磁共振图像的血管分割图像。该模型采用了两个阶段多目标学习方法，以捕捉全局和局部特征。它使用学习的本地注意力地图，从T2磁共振图像中提取与生成圆束血管相关的信息。
results: 测试中，这个模型的生成的血管分割图像的 dice分数为$0.79\pm0.03$，比state-of-the-art的分割网络（如转换器U-Net和nnU-net）更高，同时使用的参数数量只是这些网络的一部分。主要的 Qualitative difference between our synthetic vessel segmentations and the comparative models was in the sharper resolution of the CoW vessel segments, especially in the posterior circulation.

Abstract
Magnetic resonance angiography (MRA) is an imaging modality for visualising blood vessels. It is useful for several diagnostic applications and for assessing the risk of adverse events such as haemorrhagic stroke (resulting from the rupture of aneurysms in blood vessels). However, MRAs are not acquired routinely, hence, an approach to synthesise blood vessel segmentations from more routinely acquired MR contrasts such as T1 and T2, would be useful. We present an encoder-decoder model for synthesising segmentations of the main cerebral arteries in the circle of Willis (CoW) from only T2 MRI. We propose a two-phase multi-objective learning approach, which captures both global and local features. It uses learned local attention maps generated by dilating the segmentation labels, which forces the network to only extract information from the T2 MRI relevant to synthesising the CoW. Our synthetic vessel segmentations generated from only T2 MRI achieved a mean Dice score of $0.79 \pm 0.03$ in testing, compared to state-of-the-art segmentation networks such as transformer U-Net ($0.71 \pm 0.04$) and nnU-net($0.68 \pm 0.05$), while using only a fraction of the parameters. The main qualitative difference between our synthetic vessel segmentations and the comparative models was in the sharper resolution of the CoW vessel segments, especially in the posterior circulation.

摘要
磁共振成像（MRA）是一种成像血管的技术，可以用于诊断和评估风险，如血栓roke（由血管壁崩溃引起的血栓roke）。然而，MRA不是常见的成像方式，因此一种能够从常见的MR增强像素，如T1和T2，synthesize血管分 segmentation的方法会很有用。我们提出了一种编码器-解码器模型，可以从T2 MRI中synthesize主要脑血管的圆桌（CoW）分 segmentation。我们采用了两个阶段多目标学习方法，可以捕捉全局和局部特征。它使用学习的本地注意力图，从T2 MRI中提取与synthesize CoW相关的信息，使网络只提取T2 MRI中与synthesize CoW相关的信息。我们使用只有T2 MRI Synthesize的血管分 segmentation在测试中 achieve了 mean dice score为$0.79 \pm 0.03$，比 estado-of-the-art segmentation网络如transformer U-Net ($0.71 \pm 0.04$)和nnU-net ($0.68 \pm 0.05$) 的segmentation网络，而且只用了一小部分的参数。主要的qualitative difference между我们的synthetic vessel segmentation和相关的模型在CoW血管段的高分辨率，尤其是 posterior circulation。

paper_url: http://arxiv.org/abs/2308.12845
repo_url: https://github.com/xwaiyy123/object-navigation
paper_authors: Wei Xie, Haobo Jiang, Shuo Gu, Jin Xie
for: 本研究旨在提高室内导航任务中的目标避免碰撞率，尤其是在视觉图像中缺失障碍物和可能的检测错误问题下。
methods: 本研究提出了一种基于历史尝试和错误经验学习的隐式障碍地图驱动的室内导航框架，以提高避免碰撞的Robustness。同时，一种基于非本地网络的目标念 памя库聚合模块是设计来利用非本地网络来描述导航过程中target semantic和target方向准确的相关性，以便在导航过程中挖掘最相关的物品准确准确。
results: 对于AI2-Thor和RoboTHOR的测试数据集，我们的提出方法得到了优秀的避免碰撞和导航效率。

Abstract
Robust obstacle avoidance is one of the critical steps for successful goal-driven indoor navigation tasks.Due to the obstacle missing in the visual image and the possible missed detection issue, visual image-based obstacle avoidance techniques still suffer from unsatisfactory robustness. To mitigate it, in this paper, we propose a novel implicit obstacle map-driven indoor navigation framework for robust obstacle avoidance, where an implicit obstacle map is learned based on the historical trial-and-error experience rather than the visual image. In order to further improve the navigation efficiency, a non-local target memory aggregation module is designed to leverage a non-local network to model the intrinsic relationship between the target semantic and the target orientation clues during the navigation process so as to mine the most target-correlated object clues for the navigation decision. Extensive experimental results on AI2-Thor and RoboTHOR benchmarks verify the excellent obstacle avoidance and navigation efficiency of our proposed method. The core source code is available at https://github.com/xwaiyy123/object-navigation.

摘要
Robust obstacle avoidance 是indoor navigation任务中的一个关键步骤。由于视觉图像中缺失障碍物和可能的检测问题，视觉图像基于的障碍物避免技术仍然具有不满足的Robustness。为了解决这个问题，在这篇论文中，我们提出了一种基于历史尝试和错误经验学习的隐式障碍地图驱动的indoor navigation框架，以便更好地避免障碍物。为了进一步提高导航效率，我们还设计了一个非本地目标记忆聚合模块，通过非本地网络模型target semantic和导航过程中的target orientation clue之间的内在关系，以便在导航过程中挖掘最相关的目标对象指示。经验结果表明，我们提出的方法具有优秀的障碍物避免和导航效率。主要代码可以在https://github.com/xwaiyy123/object-navigation中获取。

EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting

paper_url: http://arxiv.org/abs/2308.12831
repo_url: None
paper_authors: Zitao Wang, Qiguang Miao, Yue Xi
for: 提取完整的 semantics 和细腻的 outline
methods: 使用 transformers 自动注意 Mechanism，具有更大的接受场景，能够更好地捕捉人脸的长距离依赖关系和低频semantic 信息
results: 提高模型对人脸 outline 的准确性和完整性，并且不需要trimap

Abstract
The portrait matting task aims to extract an alpha matte with complete semantics and finely-detailed contours. In comparison to CNN-based approaches, transformers with self-attention allow a larger receptive field, enabling it to better capture long-range dependencies and low-frequency semantic information of a portrait. However, the recent research shows that self-attention mechanism struggle with modeling high-frequency information and capturing fine contour details, which can lead to bias while predicting the portrait's contours. To address the problem, we propose EFormer to enhance the model's attention towards semantic and contour features. Especially the latter, which is surrounded by a large amount of high-frequency details. We build a semantic and contour detector (SCD) to accurately capture the distribution of semantic and contour features. And we further design contour-edge extraction branch and semantic extraction branch for refining contour features and complete semantic information. Finally, we fuse the two kinds of features and leverage the segmentation head to generate the predicted portrait matte. Remarkably, EFormer is an end-to-end trimap-free method and boasts a simple structure. Experiments conducted on VideoMatte240K-JPEGSD and AIM datasets demonstrate that EFormer outperforms previous portrait matte methods.

摘要
PORTRAIT MATTING TASK的目标是提取一个完整的α抑制矩阵，以捕捉人脸的完整 semantics和细腻的边缘信息。相比CNN基于的方法， transformer自注意力 Mechanism 具有更大的接受场，可以更好地捕捉人脸的长距离依赖关系和低频semantic信息。然而， latest research 表明 that self-attention mechanism 在模型高频信息和细腻边缘特征的处理方面存在困难，可能导致预测人脸的边缘偏倚。为了解决问题，我们提出 EFormer，用于增强模型的注意力 towards semantic和contour特征。特别是后者，它围绕着大量高频细节。我们建立了 semantic和contour探测器 (SCD)，以准确捕捉人脸的semantic和contour特征的分布。此外，我们还设计了 contour-edge extraction branch 和 semantic extraction branch，用于精细化contour特征和完整semantic信息。最后，我们融合两种特征，并利用 segmentation head 生成预测的人脸抑制矩阵。值得注意的是， EFormer 是一种端到端的trimap-free方法，具有简单的结构。在 VideoMatte240K-JPEGSD 和 AIM 数据集上进行的实验表明，EFormer 在人脸抑制矩阵预测方面高效。

Robotic Scene Segmentation with Memory Network for Runtime Surgical Context Inference

paper_url: http://arxiv.org/abs/2308.12789
repo_url: https://github.com/uva-dsa/runtime_robscene_seg_2context
paper_authors: Zongyu Li, Ian Reyes, Homa Alemzadeh
for: 这 paper 是为了解决runtime context inference在机器助手手术中的挑战，以及提高视频数据的分割精度和时间一致性。
methods: 这 paper 使用了 Space Time Correspondence Network (STCN)，这是一种记忆网络，它可以进行二分分割并减少类别偏见的影响。STCN 使用了记忆银行，以使用过去的图像和分割信息，以确保分割掩模的一致性。
results: 实验表明，STCN 在公共可用的 JIGSAWS 数据集上表现出色，对于难以分割的对象，如针和织物，可以提高分割精度和上下文推断。此外，这 paper 还证明了在 runtime 无需妥协性能的情况下，可以同时进行分割和上下文推断。

Abstract
Surgical context inference has recently garnered significant attention in robot-assisted surgery as it can facilitate workflow analysis, skill assessment, and error detection. However, runtime context inference is challenging since it requires timely and accurate detection of the interactions among the tools and objects in the surgical scene based on the segmentation of video data. On the other hand, existing state-of-the-art video segmentation methods are often biased against infrequent classes and fail to provide temporal consistency for segmented masks. This can negatively impact the context inference and accurate detection of critical states. In this study, we propose a solution to these challenges using a Space Time Correspondence Network (STCN). STCN is a memory network that performs binary segmentation and minimizes the effects of class imbalance. The use of a memory bank in STCN allows for the utilization of past image and segmentation information, thereby ensuring consistency of the masks. Our experiments using the publicly available JIGSAWS dataset demonstrate that STCN achieves superior segmentation performance for objects that are difficult to segment, such as needle and thread, and improves context inference compared to the state-of-the-art. We also demonstrate that segmentation and context inference can be performed at runtime without compromising performance.

摘要
医疗机器人助手中的手术上下文推断在最近几年内受到了广泛关注，因为它可以帮助分析工作流程、评估技能和检测错误。然而，运行时上下文推断具有挑战性，因为它需要在视频数据中检测工具和物品之间的互动，并在实时上提供准确的上下文推断。然而，现有的状态 искусственный智能方法通常对不常见的类型存在偏见，并且无法提供时间上的一致性 для分类mask。这可能会对上下文推断产生负面影响，并妨碍精准检测关键状态。在本研究中，我们提出了一种解决这些挑战的解决方案，即使用Space Time Correspondence Network（STCN）。STCN是一种记忆网络，它可以实现二分 segmentation，并最小化类别不均衡的影响。记忆银行在STCN中的使用，使得可以利用过去的图像和分类信息，以确保mask的一致性。我们使用公共可用的JIGSAWS数据集进行实验，并证明STCN可以在难以分类的对象，如针和织物，中提供superior的分类性能，并提高上下文推断的精度。此外，我们还证明可以在运行时进行分类和上下文推断，不会影响性能。

On Offline Evaluation of 3D Object Detection for Autonomous Driving

paper_url: http://arxiv.org/abs/2308.12779
repo_url: None
paper_authors: Tim Schreier, Katrin Renz, Andreas Geiger, Kashyap Chitta
for: 这个论文是为了评估3D对象检测模型在自动驾驶核心任务中的性能而写的。
methods: 这篇论文使用了16种对象检测模型，并在CARLA simulate器上进行了广泛的实验，以评估不同检测精度指标如何影响自动驾驶性能。
results: 研究发现，nuScenes检测得分更高地相关于驾驶性能，而且警告了对` плаanner-centric’指标的封闭依赖。

Abstract
Prior work in 3D object detection evaluates models using offline metrics like average precision since closed-loop online evaluation on the downstream driving task is costly. However, it is unclear how indicative offline results are of driving performance. In this work, we perform the first empirical evaluation measuring how predictive different detection metrics are of driving performance when detectors are integrated into a full self-driving stack. We conduct extensive experiments on urban driving in the CARLA simulator using 16 object detection models. We find that the nuScenes Detection Score has a higher correlation to driving performance than the widely used average precision metric. In addition, our results call for caution on the exclusive reliance on the emerging class of `planner-centric' metrics.

摘要
先前的工作在3D对象检测中通常使用离线指标如平均准确率来评估模型。然而，不清楚这些离线结果对驱动性能的指导性。在本工作中，我们实施了首次employmetric evaluation的研究，measure如何不同的检测指标对自动驱动栈中检测器的驱动性能具有预测性。我们在CARLA simulator上进行了大规模的城市驱动实验，使用16个对象检测模型。我们发现，nuScenes Detection Score与驱动性能之间存在更高的相关性，而且我们的结果表明，应该小心对新般的` плаanner-centric'指标的归类依赖。

LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition

paper_url: http://arxiv.org/abs/2308.12774
repo_url: None
paper_authors: Changxu Cheng, Peng Wang, Cheng Da, Qi Zheng, Cong Yao
for: 提高Scene Text Recognition（STR）的长文本识别能力和长文本推断能力。
methods: 提出Length-Insensitive Scene TExt Recognizer（LISTER）算法，包括Neighbor Decoder和Feature Enhancement Module两部分。Neighbor Decoder使用帮助器矩阵获取准确的字符注意力地图，不受文本长度影响。Feature Enhancement Module通过低计算成本模型长距离依赖关系，可以逐步增强特征地图进行迭代处理。
results: 实验表明，提出的LISTER算法在长文本识别和长文本推断方面具有显著优势，并且与之前的STATE-OF-THE-ART方法在标准STR测试集（主要是短文本）上表现相当。

Abstract
The diversity in length constitutes a significant characteristic of text. Due to the long-tail distribution of text lengths, most existing methods for scene text recognition (STR) only work well on short or seen-length text, lacking the capability of recognizing longer text or performing length extrapolation. This is a crucial issue, since the lengths of the text to be recognized are usually not given in advance in real-world applications, but it has not been adequately investigated in previous works. Therefore, we propose in this paper a method called Length-Insensitive Scene TExt Recognizer (LISTER), which remedies the limitation regarding the robustness to various text lengths. Specifically, a Neighbor Decoder is proposed to obtain accurate character attention maps with the assistance of a novel neighbor matrix regardless of the text lengths. Besides, a Feature Enhancement Module is devised to model the long-range dependency with low computation cost, which is able to perform iterations with the neighbor decoder to enhance the feature map progressively. To the best of our knowledge, we are the first to achieve effective length-insensitive scene text recognition. Extensive experiments demonstrate that the proposed LISTER algorithm exhibits obvious superiority on long text recognition and the ability for length extrapolation, while comparing favourably with the previous state-of-the-art methods on standard benchmarks for STR (mainly short text).

摘要
Text 的多样性在长度方面是一个重要特征。由于文本长度的长尾分布，大多数现有的场景文本识别（STR）方法只能在短文本上工作良好，缺乏长文本或者长文本 extrapolation 的能力。这是一个关键问题，因为实际应用中文本的长度通常不会提前给出，但这一问题在前一未得到充分调查。因此，我们在这篇论文中提出了一种名为Length-Insensitive Scene TExt Recognizer（LISTER）的方法，以解决这种限制。具体来说，我们提出了一种邻居解码器，可以通过一个新的邻居矩阵获得准确的字符注意地图，不管文本的长度。此外，我们还设计了一种特征增强模块，可以通过低计算成本模elling 长距离关系，并且可以与邻居解码器进行多次迭代来进一步增强特征图。根据我们所知，我们的提出的 LISTER 算法是首次实现了长度不敏感的场景文本识别。我们的实验结果表明，LISTER 算法在长文本识别和长度推算方面具有明显的优势，同时与之前的状态lava 方法在标准 STR 标准库（主要是短文本）上比较了常。

IP-UNet: Intensity Projection UNet Architecture for 3D Medical Volume Segmentation

paper_url: http://arxiv.org/abs/2308.12761
repo_url: None
paper_authors: Nyothiri Aung, Tahar Kechadi, Liming Chen, Sahraoui Dhelim
for: automatic breast calcification detectionmethods: IP-UNet model, which performs multi-class segmentation on Intensity Projection (IP) of 3D volumetric data, and uses limited memory capability for training without losing the original 3D image resolution.results: IP-UNet achieves similar segmentation accuracy as 3D-UNet but with much better performance, reducing training time by 70% and memory consumption by 92%.

Abstract
CNNs have been widely applied for medical image analysis. However, limited memory capacity is one of the most common drawbacks of processing high-resolution 3D volumetric data. 3D volumes are usually cropped or downsized first before processing, which can result in a loss of resolution, increase class imbalance, and affect the performance of the segmentation algorithms. In this paper, we propose an end-to-end deep learning approach called IP-UNet. IP-UNet is a UNet-based model that performs multi-class segmentation on Intensity Projection (IP) of 3D volumetric data instead of the memory-consuming 3D volumes. IP-UNet uses limited memory capability for training without losing the original 3D image resolution. We compare the performance of three models in terms of segmentation accuracy and computational cost: 1) Slice-by-slice 2D segmentation of the CT scan images using a conventional 2D UNet model. 2) IP-UNet that operates on data obtained by merging the extracted Maximum Intensity Projection (MIP), Closest Vessel Projection (CVP), and Average Intensity Projection (AvgIP) representations of the source 3D volumes, then applying the UNet model on the output IP images. 3) 3D-UNet model directly reads the 3D volumes constructed from a series of CT scan images and outputs the 3D volume of the predicted segmentation. We test the performance of these methods on 3D volumetric images for automatic breast calcification detection. Experimental results show that IP-Unet can achieve similar segmentation accuracy with 3D-Unet but with much better performance. It reduces the training time by 70\% and memory consumption by 92\%.

摘要
对于医疗影像分析，广泛应用了深度学习网络（CNN）。然而，处理高分辨率3D数据时的内存容量问题是最常见的问题。通常会将3Dvolume裁剪或缩小以便处理，这会导致解析损失、增加分布不均和影像分析表现下降。在这篇论文中，我们提出了一个端到端的深度学习方法，即IP-UNet。IP-UNet是基于UNet模型，用于多类分类INTENSITY PROJECTION（IP）3D数据，而不需要内存昂贵的3Dvolume训练。我们将比较三种模型的表现，包括：1）对CT扫描影像的单面2D分类使用传统2D UNet模型。2）IP-UNet，它在提取Maximum Intensity Projection（MIP）、Closest Vessel Projection（CVP）和Average Intensity Projection（AvgIP）表现后，将UNet模型应用于输出IP图像。3）直接从CT扫描影像构建3D数据，并将预测分类结果传回3D数据。我们将这些方法评估在3D数据中自动胸腔癌斑准确性检测上。实验结果显示，IP-UNet可以与3D-UNet实现相似的分类精度，但是具有训练时间快速和内存消耗几乎减半的优点。

PartSeg: Few-shot Part Segmentation via Part-aware Prompt Learning

paper_url: http://arxiv.org/abs/2308.12757
repo_url: None
paper_authors: Mengya Han, Heliang Zheng, Chaoyue Wang, Yong Luo, Han Hu, Jing Zhang, Yonggang Wen
for: 本研究旨在实现少数示例部分 segmentation，即使用非常少的标注示例来分割未经见过的物体中的不同部分。
methods: 我们提出了一种基于多Modal学习的新方法，称为PartSeg，用于实现少数示例部分 segmentation。我们特别设计了一种可以让CLIP模型更好地理解“部分”概念的部分掌握学习方法。此外，我们在提问学习过程中建立了不同物体类别中同一部分之间的关系。
results: 我们在PartImageNet和Pascal$_$Part datasets上进行了广泛的实验，结果表明，我们提出的方法可以达到状态 искусственный智能的表现。

Abstract
In this work, we address the task of few-shot part segmentation, which aims to segment the different parts of an unseen object using very few labeled examples. It is found that leveraging the textual space of a powerful pre-trained image-language model (such as CLIP) can be beneficial in learning visual features. Therefore, we develop a novel method termed PartSeg for few-shot part segmentation based on multimodal learning. Specifically, we design a part-aware prompt learning method to generate part-specific prompts that enable the CLIP model to better understand the concept of ``part'' and fully utilize its textual space. Furthermore, since the concept of the same part under different object categories is general, we establish relationships between these parts during the prompt learning process. We conduct extensive experiments on the PartImageNet and Pascal$\_$Part datasets, and the experimental results demonstrated that our proposed method achieves state-of-the-art performance.

摘要
在这个工作中，我们考虑了几个shot部分 segmentation任务，该任务的目标是使用非常少的标注例进行不同对象的部分分类。我们发现可以利用一个强大预训练的图像语言模型（如CLIP）的文本空间，可以有利于学习视觉特征。因此，我们开发了一种基于多Modal学习的新方法，称为PartSeg。具体来说，我们设计了一种部分意识的提问学习方法，以便CLIP模型更好地理解“部分”的概念，并充分利用其文本空间。此外，我们在提问学习过程中建立了这些部分之间的关系。我们在PartImageNet和Pascal$\_$Part datasets上进行了广泛的实验，并发现我们的提议方法可以达到状态的表现。

Learning Heavily-Degraded Prior for Underwater Object Detection

paper_url: http://arxiv.org/abs/2308.12738
repo_url: https://github.com/xiaodetection/learning-heavily-degraed-prior
paper_authors: Chenping Fu, Xin Fan, Jiewen Xiao, Wanqi Yuan, Risheng Liu, Zhongxuan Luo
for: 解决水下物体检测中的质量下降问题，通过利用受损图像中的特征分布偏移来提高检测性能。
methods: 基于受损图像的统计观察，提出了差异特征传递模块（RFTM），通过学习受损图像和水下图像之间的映射，提高水下物体检测性能。
results: 对URPC2020和UODD数据集进行评估，显示 compared to CNN基于的检测器，本方法可以大幅提高水下物体检测性能，并且具有更高的速度和更少的参数。

Abstract
Underwater object detection suffers from low detection performance because the distance and wavelength dependent imaging process yield evident image quality degradations such as haze-like effects, low visibility, and color distortions. Therefore, we commit to resolving the issue of underwater object detection with compounded environmental degradations. Typical approaches attempt to develop sophisticated deep architecture to generate high-quality images or features. However, these methods are only work for limited ranges because imaging factors are either unstable, too sensitive, or compounded. Unlike these approaches catering for high-quality images or features, this paper seeks transferable prior knowledge from detector-friendly images. The prior guides detectors removing degradations that interfere with detection. It is based on statistical observations that, the heavily degraded regions of detector-friendly (DFUI) and underwater images have evident feature distribution gaps while the lightly degraded regions of them overlap each other. Therefore, we propose a residual feature transference module (RFTM) to learn a mapping between deep representations of the heavily degraded patches of DFUI- and underwater- images, and make the mapping as a heavily degraded prior (HDP) for underwater detection. Since the statistical properties are independent to image content, HDP can be learned without the supervision of semantic labels and plugged into popular CNNbased feature extraction networks to improve their performance on underwater object detection. Without bells and whistles, evaluations on URPC2020 and UODD show that our methods outperform CNN-based detectors by a large margin. Our method with higher speeds and less parameters still performs better than transformer-based detectors. Our code and DFUI dataset can be found in https://github.com/xiaoDetection/Learning-Heavily-Degraed-Prior.

摘要
水下对象检测受到质量低下的影响，因为图像处理过程中的距离和波长的效果会导致明显的图像质量下降，如霾效、低可见度和颜色扭曲。因此，我们决心解决水下对象检测中的环境降低效应。常见的方法是开发复杂的深度架构来生成高质量图像或特征。然而，这些方法只能在有限的范围内工作，因为图像因素是不稳定、太敏感或复杂的。与这些方法不同，这篇论文寻求从检测友好的图像（DFUI）中提取知识。这种知识指导检测器去除影响检测的降低因素。这基于统计观察，水下和DFUI图像中的严重降低区域有明显的特征分布差异，而轻度降低区域之间重叠。因此，我们提出了差异特征传递模块（RFTM），以学习将深度表示DFUI图像中的严重降低区域和水下图像之间建立一个映射，并将这个映射作为强制质量（HDP）来进行水下检测。由于统计性质独立于图像内容，HDP可以在无监督的情况下学习，并可以插入流行的CNN基于特征提取网络来提高水下对象检测性能。我们的方法在URPC2020和UODD上进行评估，与CNN基于检测器相比，我们的方法在大幅度下表现出较好的性能。我们的方法还具有更高的速度和更少的参数，仍然可以在 transformer 基于检测器之上进行改进。我们的代码和 DFUI 数据集可以在 GitHub 上找到：https://github.com/xiaoDetection/Learning-Heavily-Degraed-Prior。

FastSurfer-HypVINN: Automated sub-segmentation of the hypothalamus and adjacent structures on high-resolutional brain MRI

paper_url: http://arxiv.org/abs/2308.12736
repo_url: None
paper_authors: Santiago Estrada, David Kügler, Emad Bahrami, Peng Xu, Dilshad Mousa, Monique M. B. Breteler, N. Ahmad Aziz, Martin Reuter
for: 这个论文的目的是提供一种自动分割哈米尼肌肉的方法，以便更好地研究哈米尼肌肉的功能和结构。
methods: 这个方法使用了深度学习算法，并且可以处理0.8mm的是otropic T1w和T2w MR图像。
results: 这个方法可以具有高度的分割精度和可靠性，并且可以在多个数据集上进行扩展验证。

Abstract
The hypothalamus plays a crucial role in the regulation of a broad range of physiological, behavioural, and cognitive functions. However, despite its importance, only a few small-scale neuroimaging studies have investigated its substructures, likely due to the lack of fully automated segmentation tools to address scalability and reproducibility issues of manual segmentation. While the only previous attempt to automatically sub-segment the hypothalamus with a neural network showed promise for 1.0 mm isotropic T1-weighted (T1w) MRI, there is a need for an automated tool to sub-segment also high-resolutional (HiRes) MR scans, as they are becoming widely available, and include structural detail also from multi-modal MRI. We, therefore, introduce a novel, fast, and fully automated deep learning method named HypVINN for sub-segmentation of the hypothalamus and adjacent structures on 0.8 mm isotropic T1w and T2w brain MR images that is robust to missing modalities. We extensively validate our model with respect to segmentation accuracy, generalizability, in-session test-retest reliability, and sensitivity to replicate hypothalamic volume effects (e.g. sex-differences). The proposed method exhibits high segmentation performance both for standalone T1w images as well as for T1w/T2w image pairs. Even with the additional capability to accept flexible inputs, our model matches or exceeds the performance of state-of-the-art methods with fixed inputs. We, further, demonstrate the generalizability of our method in experiments with 1.0 mm MR scans from both the Rhineland Study and the UK Biobank. Finally, HypVINN can perform the segmentation in less than a minute (GPU) and will be available in the open source FastSurfer neuroimaging software suite, offering a validated, efficient, and scalable solution for evaluating imaging-derived phenotypes of the hypothalamus.

摘要
《响应腔室中的肥厚腔室功能》的研究具有重要的意义，但由于缺乏可靠的自动分割工具，因此只有一些小规模的 нейро成像研究对其下部结构进行了调查。尽管之前一个使用神经网络自动分割肥厚腔室的尝试显示了对1.0 mm是otropic T1束成像（T1w）的承诺，但是有必要为高分辨率（HiRes）MR扫描图像提供自动分割工具，因为它们在广泛使用并包含多modal MRI结构细节。我们因此介绍了一种新的快速、自动化深度学习方法，名为 HypVINN，用于肥厚腔室和相邻结构的0.8 mm是otropic T1w和T2w大脑MR扫描图像的分割，具有对缺失模式的Robust性。我们对模型进行了广泛验证，包括分割精度、普适性、在SESSION中的重复测试可靠性和性别差异的敏感性。我们的模型在单独的T1w图像上以及T1w/T2w图像对上都 exhibits高度的分割性能。即使可以接受 flexible inputs，我们的模型与已有的方法相当或超过性能。我们进一步证明了我们的方法在1.0 mm MR扫描图像上的普适性，并在 Rheinland Study和UK Biobank中进行了实验。最后，HypVINN可以在 less than a minute（GPU）内完成分割，并将被包含在开源的 FastSurfer neuroscience imaging software suite中，提供一个验证、高效、扩展的解决方案，用于评估基于成像的肥厚腔室相关性。

Ground-to-Aerial Person Search: Benchmark Dataset and Approach

paper_url: http://arxiv.org/abs/2308.12712
repo_url: https://github.com/yqc123456/hkd_for_person_search
paper_authors: Shizhou Zhang, Qingchun Yang, De Cheng, Yinghui Xing, Guoqiang Liang, Peng Wang, Yanning Zhang
for: 这个论文旨在构建一个大规模的人体搜索数据集，以便进行跨平台智能监测应用程序开发。
methods: 该论文使用了两步人体搜索方法和终端到终端人体搜索方法，并提出了一种简单 yet effective的知识储存方法，用于提高人体搜索性能。
results: 该论文通过对G2APS数据集和两个公共的人体搜索数据集进行分析，并提出了一种基于知识储存的人体搜索方法，实现了状态畅的性能。

Abstract
In this work, we construct a large-scale dataset for Ground-to-Aerial Person Search, named G2APS, which contains 31,770 images of 260,559 annotated bounding boxes for 2,644 identities appearing in both of the UAVs and ground surveillance cameras. To our knowledge, this is the first dataset for cross-platform intelligent surveillance applications, where the UAVs could work as a powerful complement for the ground surveillance cameras. To more realistically simulate the actual cross-platform Ground-to-Aerial surveillance scenarios, the surveillance cameras are fixed about 2 meters above the ground, while the UAVs capture videos of persons at different location, with a variety of view-angles, flight attitudes and flight modes. Therefore, the dataset has the following unique characteristics: 1) drastic view-angle changes between query and gallery person images from cross-platform cameras; 2) diverse resolutions, poses and views of the person images under 9 rich real-world scenarios. On basis of the G2APS benchmark dataset, we demonstrate detailed analysis about current two-step and end-to-end person search methods, and further propose a simple yet effective knowledge distillation scheme on the head of the ReID network, which achieves state-of-the-art performances on both of the G2APS and the previous two public person search datasets, i.e., PRW and CUHK-SYSU. The dataset and source code available on \url{https://github.com/yqc123456/HKD_for_person_search}.

摘要
在这项工作中，我们构建了一个大规模的人earch数据集，名为G2APS，其包含31,770张图像和260,559个注解的矩形框，其中每个矩形框都包含2,644个人肖象出现在UAV和地面监测摄像头中。我们知道，这是首个跨平台智能监测应用程序的数据集，UAV可以作为地面监测摄像头的强力补充。为更真实地模拟实际跨平台地面-空中监测场景，地面监测摄像头 fixes在2米上，而UAV拍摄了不同位置的人肖象，并且有多种视角、飞行姿态和飞行模式。因此，该数据集具有以下独特特点：1）跨平台摄像头之间人肖象的极大视角变化；2）人肖象的多种分辨率、姿势和视野下的9种实际场景。基于G2APS标准数据集，我们对现有的两步人earch方法和端到端人earch方法进行详细分析，并提出了一种简单 yet有效的知识储存 scheme，用于在ReID网络的头部进行人earch，该方法在G2APS和以前两个公共人earch数据集上实现了状态当前性。数据集和源代码可以在 \url{https://github.com/yqc123456/HKD_for_person_search} 上获取。

A Parse-Then-Place Approach for Generating Graphic Layouts from Textual Descriptions

paper_url: http://arxiv.org/abs/2308.12700
repo_url: None
paper_authors: Jiawei Lin, Jiaqi Guo, Shizhao Sun, Weijiang Xu, Ting Liu, Jian-Guang Lou, Dongmei Zhang
for: 这个论文的目的是提出一种基于文本指导的图形设计方法，以低下设计难度。
methods: 这种方法包括两个阶段：解析阶段和放置阶段。解析阶段通过将文本描述转换为一种中间表示（IR），来模拟文本中的隐式约束。放置阶段使用Transformer模型生成图形。为了处理组合和不完整的约束，我们使用Transformer模型并且 специаль地设计了约束和图形的表示方式。
results: 我们在两个 Text-to-Layout 数据集上进行了实验，并取得了优秀的成绩。量化结果、质量分析和用户研究都证明了我们的方法的有效性。

Abstract
Creating layouts is a fundamental step in graphic design. In this work, we propose to use text as the guidance to create graphic layouts, i.e., Text-to-Layout, aiming to lower the design barriers. Text-to-Layout is a challenging task, because it needs to consider the implicit, combined, and incomplete layout constraints from text, each of which has not been studied in previous work. To address this, we present a two-stage approach, named parse-then-place. The approach introduces an intermediate representation (IR) between text and layout to represent diverse layout constraints. With IR, Text-to-Layout is decomposed into a parse stage and a place stage. The parse stage takes a textual description as input and generates an IR, in which the implicit constraints from the text are transformed into explicit ones. The place stage generates layouts based on the IR. To model combined and incomplete constraints, we use a Transformer-based layout generation model and carefully design a way to represent constraints and layouts as sequences. Besides, we adopt the pretrain-then-finetune strategy to boost the performance of the layout generation model with large-scale unlabeled layouts. To evaluate our approach, we construct two Text-to-Layout datasets and conduct experiments on them. Quantitative results, qualitative analysis, and user studies demonstrate the effectiveness of our approach.

摘要
创建布局是图形设计的基本步骤。在这项工作中，我们提议使用文本作为布局创建的指导，即文本到布局（Text-to-Layout），以降低设计障碍。文本到布局是一项复杂的任务，因为它需要考虑文本中的隐式、共同和部分缺失的布局约束，每一种都没有在前期工作中研究过。为解决这个问题，我们提出了两个阶段方法，称之为parse-then-place。这种方法引入了一个中间表示（IR），用于将文本中的布局约束转换为Explicit的约束。在IR中，我们使用Transformer模型来生成布局。此外，我们还设计了一种方法来表示约束和布局为序列，以便处理共同和缺失的约束。此外，我们采用了预训练后finetune策略，以提高布局生成模型的性能。为评估我们的方法，我们构建了两个文本到布局数据集，并在其中进行了实验。量化结果、质量分析和用户研究都证明了我们的方法的有效性。

A Continual Learning Approach for Cross-Domain White Blood Cell Classification

paper_url: http://arxiv.org/abs/2308.12679
repo_url: None
paper_authors: Ario Sadafi, Raheleh Salehi, Armin Gruber, Sayedali Shetab Boushehri, Pascal Giehr, Nassir Navab, Carsten Marr
for: 这个研究旨在提高白血球类别的准确性，以便诊断血液疾病。
methods: 本研究使用了一种叫做复习式专有学习的方法，可以逐步学习来自新数据流的知识，而不会忘记之前学习的知识。
results: 研究结果显示，使用复习式专有学习方法可以在不同的颜色、分辨率和类别结构下，实现白血球类别的准确分类。此外，在长期演进学习中，本方法也可以优于现有的iCaRL和EWC方法。

Abstract
Accurate classification of white blood cells in peripheral blood is essential for diagnosing hematological diseases. Due to constantly evolving clinical settings, data sources, and disease classifications, it is necessary to update machine learning classification models regularly for practical real-world use. Such models significantly benefit from sequentially learning from incoming data streams without forgetting previously acquired knowledge. However, models can suffer from catastrophic forgetting, causing a drop in performance on previous tasks when fine-tuned on new data. Here, we propose a rehearsal-based continual learning approach for class incremental and domain incremental scenarios in white blood cell classification. To choose representative samples from previous tasks, we employ exemplar set selection based on the model's predictions. This involves selecting the most confident samples and the most challenging samples identified through uncertainty estimation of the model. We thoroughly evaluated our proposed approach on three white blood cell classification datasets that differ in color, resolution, and class composition, including scenarios where new domains or new classes are introduced to the model with every task. We also test a long class incremental experiment with both new domains and new classes. Our results demonstrate that our approach outperforms established baselines in continual learning, including existing iCaRL and EWC methods for classifying white blood cells in cross-domain environments.

摘要
Accurate classification of white blood cells in peripheral blood is essential for diagnosing hematological diseases. Due to constantly evolving clinical settings, data sources, and disease classifications, it is necessary to update machine learning classification models regularly for practical real-world use. Such models significantly benefit from sequentially learning from incoming data streams without forgetting previously acquired knowledge. However, models can suffer from catastrophic forgetting, causing a drop in performance on previous tasks when fine-tuned on new data. Here, we propose a rehearsal-based continual learning approach for class incremental and domain incremental scenarios in white blood cell classification. To choose representative samples from previous tasks, we employ exemplar set selection based on the model's predictions. This involves selecting the most confident samples and the most challenging samples identified through uncertainty estimation of the model. We thoroughly evaluated our proposed approach on three white blood cell classification datasets that differ in color, resolution, and class composition, including scenarios where new domains or new classes are introduced to the model with every task. We also test a long class incremental experiment with both new domains and new classes. Our results demonstrate that our approach outperforms established baselines in continual learning, including existing iCaRL and EWC methods for classifying white blood cells in cross-domain environments.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know and I can provide that as well.

A Study of Age and Sex Bias in Multiple Instance Learning based Classification of Acute Myeloid Leukemia Subtypes

paper_url: http://arxiv.org/abs/2308.12675
repo_url: None
paper_authors: Ario Sadafi, Matthias Hehr, Nassir Navab, Carsten Marr
for: 这份研究旨在探讨急性白血病（AML）分型的精确分类是否受到年龄和性别偏好的影响，以提高临床决策和患者照顾。methods: 这份研究使用多例学习（MIL）架构，训练多个MIL模型，并评估它们在不同的性别偏好和年龄偏好下的表现。results: 研究发现，AML分型分类中受到性别和年龄偏好的影响，特别是女性患者更容易受到性别偏好的影响，而 certain age groups，例如72-86岁的患者，则受到年龄偏好的影响。确保训练数据的多元性是关键，以确保AML分型分类的可靠性和公平性，最终帮助多元化的患者人口。

Abstract
Accurate classification of Acute Myeloid Leukemia (AML) subtypes is crucial for clinical decision-making and patient care. In this study, we investigate the potential presence of age and sex bias in AML subtype classification using Multiple Instance Learning (MIL) architectures. To that end, we train multiple MIL models using different levels of sex imbalance in the training set and excluding certain age groups. To assess the sex bias, we evaluate the performance of the models on male and female test sets. For age bias, models are tested against underrepresented age groups in the training data. We find a significant effect of sex and age bias on the performance of the model for AML subtype classification. Specifically, we observe that females are more likely to be affected by sex imbalance dataset and certain age groups, such as patients with 72 to 86 years of age with the RUNX1::RUNX1T1 genetic subtype, are significantly affected by an age bias present in the training data. Ensuring inclusivity in the training data is thus essential for generating reliable and equitable outcomes in AML genetic subtype classification, ultimately benefiting diverse patient populations.

摘要
《急性白细胞病（AML）分型准确分类是临床决策和患者护理中非常重要。本研究探讨AML分型准确分类中年龄和性别偏见的可能性，使用多例学习（MIL）架构。为此，我们在不同的性别占比水平和年龄组中训练多个MIL模型，并在测试集上评估模型的性别偏见和年龄偏见。结果显示，女性患者更容易受到数据集中的性别偏见的影响，而72-86岁的年龄组患者则受到训练数据中的年龄偏见的影响。因此，在训练数据中保证包容性是AML分型准确分类中不可或缺的。这有助于促进多样化患者群体的可靠和公平的结果，最终总是有利于患者的护理。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Masked Feature Modelling: Feature Masking for the Unsupervised Pre-training of a Graph Attention Network Block for Bottom-up Video Event Recognition

paper_url: http://arxiv.org/abs/2308.12673
repo_url: None
paper_authors: Dimitrios Daskalakis, Nikolaos Gkalelis, Vasileios Mezaris
for: 本研究旨在提出一种未监督预训练方法，用于提高视频事件识别模型的起点和总体性能。
methods: 本研究使用了一种已经预训练的视觉Tokenizer来重建视频中对象的遮盖特征，然后将预训练的GAT块 integrate到现有的视频事件识别架构中，以提高模型的起点和总体性能。
results: 实验结果表明，使用Masked Feature Modelling（MFM）方法可以提高视频事件识别性能。

Abstract
In this paper, we introduce Masked Feature Modelling (MFM), a novel approach for the unsupervised pre-training of a Graph Attention Network (GAT) block. MFM utilizes a pretrained Visual Tokenizer to reconstruct masked features of objects within a video, leveraging the MiniKinetics dataset. We then incorporate the pre-trained GAT block into a state-of-the-art bottom-up supervised video-event recognition architecture, ViGAT, to improve the model's starting point and overall accuracy. Experimental evaluations on the YLI-MED dataset demonstrate the effectiveness of MFM in improving event recognition performance.

摘要
在这篇论文中，我们介绍了一种新的无监督预训练方法，即Masked Feature Modelling（MFM），用于提高视频事件认知性能。MFM使用一个预训练的视觉分词器来重建视频中对象的遮盲特征，利用MiniKinetics dataset。然后，我们将预训练的GAT块 integrate到了一个现有的 bottom-up 超级视频事件识别架构ViGAT中，以提高模型的起点和总体准确率。实验评估在YLI-MED数据集上，表明MFM有效地提高事件识别性能。

An All Deep System for Badminton Game Analysis

paper_url: http://arxiv.org/abs/2308.12645
repo_url: None
paper_authors: Po-Yung Chou, Yu-Chun Lo, Bo-Zheng Xie, Cheng-Hung Lin, Yu-Yung Kao
for: automatic detection of events within badminton match videos, especially the shuttlecock
methods: modified TrackNet model and diverse data types to improve precision
results: score of 0.78 out of 1.0 in the challenge

Abstract
The CoachAI Badminton 2023 Track1 initiative aim to automatically detect events within badminton match videos. Detecting small objects, especially the shuttlecock, is of quite importance and demands high precision within the challenge. Such detection is crucial for tasks like hit count, hitting time, and hitting location. However, even after revising the well-regarded shuttlecock detecting model, TrackNet, our object detection models still fall short of the desired accuracy. To address this issue, we've implemented various deep learning methods to tackle the problems arising from noisy detectied data, leveraging diverse data types to improve precision. In this report, we detail the detection model modifications we've made and our approach to the 11 tasks. Notably, our system garnered a score of 0.78 out of 1.0 in the challenge.

摘要
coachai 羽毛球 2023 跟踪1 INITIATIVE 目标是自动探测羽毛球赛事视频中的事件。特别是小 object，如羽毛球，需要高精度的探测，这是因为这些探测对于hit count、 hitting time 和 hitting location 等任务非常重要。不过，即使修改了 widely recognized 的羽毛球探测模型 TrackNet，我们的 object detection 模型仍然没有达到所需的准确性。为了解决这个问题，我们采用了多种深度学习方法，以提高不同数据类型的精度。在这份报告中，我们详细介绍了我们对模型的修改和我们对11个任务的方法。值得注意的是，我们的系统在挑战中得到了 0.78 分的成绩。

Tag-Based Annotation for Avatar Face Creation

paper_url: http://arxiv.org/abs/2308.12642
repo_url: None
paper_authors: An Ngo, Daniel Phelps, Derrick Lai, Thanyared Wong, Lucas Mathias, Anish Shivamurthy, Mustafa Ajmal, Minghao Liu, James Davis
for: 这篇论文的目的是如何自动生成数字人物图像。
methods: 这篇论文使用了标签基于注释的方法来训练模型生成人物图像。
results: 这篇论文的结果是通过标签基于注释的方法来提高模型的预测质量和降低噪音水平。

Abstract
Currently, digital avatars can be created manually using human images as reference. Systems such as Bitmoji are excellent producers of detailed avatar designs, with hundreds of choices for customization. A supervised learning model could be trained to generate avatars automatically, but the hundreds of possible options create difficulty in securing non-noisy data to train a model. As a solution, we train a model to produce avatars from human images using tag-based annotations. This method provides better annotator agreement, leading to less noisy data and higher quality model predictions. Our contribution is an application of tag-based annotation to train a model for avatar face creation. We design tags for 3 different facial facial features offered by Bitmoji, and train a model using tag-based annotation to predict the nose.

摘要
当前，数字化人物可以通过人像作为参考来手动创建。系统如Bitmoji可以生成细节rich的人物设计，具有数百个个性化选项。一个监督学习模型可以自动生成人物，但是数百个可能的选项带来难度，困难于获得不含噪声数据来训练模型。为解决这个问题，我们使用标签基本注解来训练模型生成人物脸。这种方法可以提供更好的注释协议，从而减少噪声数据和提高模型预测质量。我们的贡献是通过标签基本注解来训练模型，以生成人物脸。我们设计了Bitmoji提供的三种不同的 facial 特征标签，并使用标签基本注解来预测脸的鼻子。

Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization

paper_url: http://arxiv.org/abs/2308.12609
repo_url: None
paper_authors: Songchun Zhang, Chunhui Zhao
for: 本研究旨在提高无rimmed视频中动作地址的准确性和效率，使用视频级标签进行weakly supervised temporal action localization。
methods: 我们提出了一个综合的框架，包括Robust Memory-Guided Contrastive Learning（RMGCL）模块和Global Knowledge Summarization and Aggregation（GKSA）模块，以挖掘和利用跨视频动作特征的相似性和一致性，从而提高动作特征的结构化编码，并降低分类学习中的ambiguity。
results: 我们的方法在THUMOS14、ActivityNet1.3和FineAction等三个 datasets上进行了广泛的实验，结果显示，我们的方法可以高效地提高无rimmed视频中动作地址的准确性和效率，并且可以与其他WSTAL方法结合使用。

Abstract
Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels. Despite recent advances, existing approaches mainly follow a localization-by-classification pipeline, generally processing each segment individually, thereby exploiting only limited contextual information. As a result, the model will lack a comprehensive understanding (e.g. appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, an end-to-end framework is proposed, including a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.

摘要
弱类超级视频动作地标（WSTAL）目标是使用视频级标签来地标视频中的动作。 DESPITE recent advances, existing methods mainly follow a localization-by-classification pipeline, which only utilizes limited contextual information. As a result, the model may lack a comprehensive understanding (e.g., appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this issue from a novel perspective by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, we propose an end-to-end framework that includes a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods and can be easily plugged into other WSTAL methods.

HR-Pro: Point-supervised Temporal Action Localization via Hierarchical Reliability Propagation

paper_url: http://arxiv.org/abs/2308.12608
repo_url: https://github.com/pipixin321/hr-pro
paper_authors: Huaxin Zhang, Xiang Wang, Xiaohao Xu, Zhiwu Qing, Changxin Gao, Nong Sang
for: 本研究目的是提出一种基于可靠性协议的 temporal action localization 方法，以提高 label-efficient learning 的性能。
methods: 本方法包括两个可靠性感知阶段：短 clip 级可靠性学习和实例级可靠性学习，两个阶段都会利用高 confidence 的点签注入进行可靠性传播。
results: 通过多级可靠性感知学习，我们得到了更可靠的 confidence 分布和更准确的 temporal 边界。我们的 HR-Pro 在多个挑战性 benchmark 上达到了状态的最佳性能，包括 THUMOS14 的平均 mAP 60.3%。

Abstract
Point-supervised Temporal Action Localization (PSTAL) is an emerging research direction for label-efficient learning. However, current methods mainly focus on optimizing the network either at the snippet-level or the instance-level, neglecting the inherent reliability of point annotations at both levels. In this paper, we propose a Hierarchical Reliability Propagation (HR-Pro) framework, which consists of two reliability-aware stages: Snippet-level Discrimination Learning and Instance-level Completeness Learning, both stages explore the efficient propagation of high-confidence cues in point annotations. For snippet-level learning, we introduce an online-updated memory to store reliable snippet prototypes for each class. We then employ a Reliability-aware Attention Block to capture both intra-video and inter-video dependencies of snippets, resulting in more discriminative and robust snippet representation. For instance-level learning, we propose a point-based proposal generation approach as a means of connecting snippets and instances, which produces high-confidence proposals for further optimization at the instance level. Through multi-level reliability-aware learning, we obtain more reliable confidence scores and more accurate temporal boundaries of predicted proposals. Our HR-Pro achieves state-of-the-art performance on multiple challenging benchmarks, including an impressive average mAP of 60.3% on THUMOS14. Notably, our HR-Pro largely surpasses all previous point-supervised methods, and even outperforms several competitive fully supervised methods. Code will be available at https://github.com/pipixin321/HR-Pro.

摘要
《点指导时间动作Localization（PSTAL）是一个emerging研究方向，它的目标是实现标签效率学习。然而，当前方法主要集中于网络优化，忽略了点级和实例级的可靠性。在这篇论文中，我们提出了一个层次可靠性传播（HR-Pro）框架，它包括两个可靠性感知阶段：幂级可靠性学习和实例级可靠性学习。两个阶段都是利用高信任点级别的缓存来进行可靠性传播。为幂级学习，我们引入了一个在线更新的内存，用于存储每个类型的可靠性缓存。然后，我们使用一个可靠性感知块来捕捉内视频和 между视频依赖关系，从而生成更加准确和稳定的幂级表示。为实例级学习，我们提出了一种基于点的提议生成方法，用于将幂级与实例相连接，从而生成高信任度的提议。通过多级可靠性感知学习，我们得到了更加可靠的信任分数和更加准确的时间边界。我们的HR-Pro在多个挑战性的benchmark上实现了state-of-the-art性能，其中包括THUMOS14的很出色的平均精度（mAP）60.3%。值得注意的是，我们的HR-Pro大大超过了所有前期点指导方法，并且甚至超过了一些高效的完全监督方法。代码将在https://github.com/pipixin321/HR-Pro中提供。

PoseSync: Robust pose based video synchronization

paper_url: http://arxiv.org/abs/2308.12600
repo_url: None
paper_authors: Rishit Javia, Falak Shah, Shivam Dave
for: 这篇论文是用于提出一个端到端管道，用于基于姿势进行视频同步。
methods: 该管道包括对图像中人体部分进行剪辑，然后使用姿势检测器对剪辑的图像进行姿势检测，最后使用动态时间扭曲（DTW）算法对姿势关键点之间的角度/距离度量进行比较，从而实现一个可比静态图像的姿势匹配管道。
results: 该管道可以帮助在多个领域，如游戏表现评估、编舞或导引运动员等，进行比较和评估人体动作。

Abstract
Pose based video sychronization can have applications in multiple domains such as gameplay performance evaluation, choreography or guiding athletes. The subject's actions could be compared and evaluated against those performed by professionals side by side. In this paper, we propose an end to end pipeline for synchronizing videos based on pose. The first step crops the region where the person present in the image followed by pose detection on the cropped image. This is followed by application of Dynamic Time Warping(DTW) on angle/ distance measures between the pose keypoints leading to a scale and shift invariant pose matching pipeline.

摘要
pose基于视频同步可以在多个领域有应用，如游戏性能评估、编舞或引导运动员。将主体的动作与专业人员的动作进行比较和评估。在这篇论文中，我们提出了基于pose的视频同步管道的终端到终点解决方案。首先，将图像中人物的区域裁剪，然后进行pose检测。接着，对于裁剪后的图像，应用动态时间扩展(DTW)来计算pose关键点之间的角度/距离度量，从而实现一个可以快速匹配pose的缩放和平移不敏感管道。

Logic-induced Diagnostic Reasoning for Semi-supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.12595
repo_url: None
paper_authors: Chen Liang, Wenguan Wang, Jiaxu Miao, Yi Yang
for: 提高 semi-supervised semantic segmentation 的精度，使用 pseudo labeling 补做了有限的标注数据，忽略了 semantic concept 之间的关系知识。
methods: 提出了 LogicDiag，一种基于神经逻辑学习框架的新方法，利用 pseudo label 中的冲突，通过逻辑检查和诊断，纠正 pseudo label，从而缓解 error accumulation 问题。
results: 在三个标准 semi-supervised semantic segmentation 测试集上进行了广泛的实验，证明了 LogicDiag 的有效性和通用性。此外，LogicDiag 还探讨了将符号逻辑reasoning integrate 到 prevailing 的统计学、神经网络学习方法中的可能性。

Abstract
Recent advances in semi-supervised semantic segmentation have been heavily reliant on pseudo labeling to compensate for limited labeled data, disregarding the valuable relational knowledge among semantic concepts. To bridge this gap, we devise LogicDiag, a brand new neural-logic semi-supervised learning framework. Our key insight is that conflicts within pseudo labels, identified through symbolic knowledge, can serve as strong yet commonly ignored learning signals. LogicDiag resolves such conflicts via reasoning with logic-induced diagnoses, enabling the recovery of (potentially) erroneous pseudo labels, ultimately alleviating the notorious error accumulation problem. We showcase the practical application of LogicDiag in the data-hungry segmentation scenario, where we formalize the structured abstraction of semantic concepts as a set of logic rules. Extensive experiments on three standard semi-supervised semantic segmentation benchmarks demonstrate the effectiveness and generality of LogicDiag. Moreover, LogicDiag highlights the promising opportunities arising from the systematic integration of symbolic reasoning into the prevalent statistical, neural learning approaches.

摘要
Translated into Simplified Chinese:近期 semi-supervised semantic segmentation 领域的进步都受到了 pseudo labeling 的限制，忽视了 semantic concepts 之间的 valuabe relational knowledge。为了bridging这个 gap，我们提出 LogicDiag，一种全新的 neural-logic semi-supervised learning 框架。我们的关键发现是，在 pseudo labels 中的 conflicts，可以作为强大 yet 常被忽略的学习信号。LogicDiag 通过逻辑检查，解决这些 conflicts，使得可以恢复 (可能) 错误的 pseudo labels，从而缓解 error accumulation 问题。我们在数据充沛的 segmentation enario 中实现了 LogicDiag，并形式化 semantic concepts 的抽象结构为逻辑规则。我们在三个标准 semi-supervised semantic segmentation benchmark 上进行了广泛的实验，证明 LogicDiag 的效果和通用性。此外，LogicDiag 还展示了将逻辑推理系统化到 prevailing statistical, neural learning approaches 中的推动力。

Self-supervised Learning of Implicit Shape Representation with Dense Correspondence for Deformable Objects

paper_url: http://arxiv.org/abs/2308.12590
repo_url: None
paper_authors: Baowen Zhang, Jiahe Li, Xiaoming Deng, Yinda Zhang, Cuixia Ma, Hongan Wang
for: 学习3D形状表示的精细对应方法，用于扭形物体。
methods: 提出了一种新的自动标注方法，使用签名距离场来学习神经隐式形状表示，不需要骨架和皮肤纹理的假设。
results: 实验表明，该方法可以表示大幅扭形的形状，并且可以支持文本传输和形状编辑等应用，性能竞争力强。Here is the summary in English for reference:
for: Learning 3D shape representation with dense correspondence for deformable objects.
methods: Propose a novel self-supervised approach to learn neural implicit shape representation, which does not require prior knowledge of skeleton and skinning weight.
results: Experimental results show that the method can represent shapes with large deformations and support applications such as texture transfer and shape editing with competitive performance.

Abstract
Learning 3D shape representation with dense correspondence for deformable objects is a fundamental problem in computer vision. Existing approaches often need additional annotations of specific semantic domain, e.g., skeleton poses for human bodies or animals, which require extra annotation effort and suffer from error accumulation, and they are limited to specific domain. In this paper, we propose a novel self-supervised approach to learn neural implicit shape representation for deformable objects, which can represent shapes with a template shape and dense correspondence in 3D. Our method does not require the priors of skeleton and skinning weight, and only requires a collection of shapes represented in signed distance fields. To handle the large deformation, we constrain the learned template shape in the same latent space with the training shapes, design a new formulation of local rigid constraint that enforces rigid transformation in local region and addresses local reflection issue, and present a new hierarchical rigid constraint to reduce the ambiguity due to the joint learning of template shape and correspondences. Extensive experiments show that our model can represent shapes with large deformations. We also show that our shape representation can support two typical applications, such as texture transfer and shape editing, with competitive performance. The code and models are available at https://iscas3dv.github.io/deformshape

摘要
学习3D形状表示方法中的密集匹配问题是计算机视觉的基本问题。现有的方法经常需要特定的 semantic 领域的更多注释，例如人体或动物的skeleton 姿势，这会增加注释努力并受到错误堆积的限制，同时它们只适用于特定的领域。在这篇论文中，我们提出了一种新的自助学习方法，用于学习神经凝聚形状表示方法，可以在3D中表示形状。我们的方法不需要预先知道skeleton和皮肤粘性的积分，只需要一个包含形状的signed distance fields。为了处理大幅度的变形，我们将学习的模板形状固定在同一个隐藏空间中，并设计了一种新的本地刚性约束，以便在本地区域中强制刚性变换，解决本地反射问题。此外，我们还提出了一种新的层次刚性约束，以减少由模板形状和匹配的共同学习所导致的模糊性。广泛的实验表明我们的模型可以表示大幅度的变形。此外，我们还证明了我们的形状表示可以支持两种典型的应用，例如纹理传输和形状编辑，并且与竞争性表现。代码和模型可以在https://iscas3dv.github.io/deformshape 上获取。

paper_url: http://arxiv.org/abs/2308.12587
repo_url: https://github.com/csir1996/vln-gela
paper_authors: Yibo Cui, Liang Xie, Yakun Zhang, Meishan Zhang, Ye Yan, Erwei Yin
for: 本研究的目的是解决视觉语言导航（VLN）中的跨模态Alignment问题。
methods: 我们提出了一种新的Grounded Entity-Landmark Adaptive（GELA）预训练方法，通过引入基于实体和Landmark的 annotated数据（GEL-R2R），并采用三种基于实体和Landmark的适应预训练目标来强制学习细致的跨模态Alignment。
results: 我们的GELA模型在两个下游任务上（R2R和CVDN）得到了状态级 результа们，证明了其效果和普适性。

Abstract
Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN). Most existing studies concentrate on mapping the global instruction or single sub-instruction to the corresponding trajectory. However, another critical problem of achieving fine-grained alignment at the entity level is seldom considered. To address this problem, we propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks. To achieve the adaptive pre-training paradigm, we first introduce grounded entity-landmark human annotations into the Room-to-Room (R2R) dataset, named GEL-R2R. Additionally, we adopt three grounded entity-landmark adaptive pre-training objectives: 1) entity phrase prediction, 2) landmark bounding box prediction, and 3) entity-landmark semantic alignment, which explicitly supervise the learning of fine-grained cross-modal alignment between entity phrases and environment landmarks. Finally, we validate our model on two downstream benchmarks: VLN with descriptive instructions (R2R) and dialogue instructions (CVDN). The comprehensive experiments show that our GELA model achieves state-of-the-art results on both tasks, demonstrating its effectiveness and generalizability.

摘要
cross-modalAlignment是Vision-and-Language Navigation（VLN）中一个关键挑战。大多数现有研究专注于将全球指令或单一子指令映射到相应的路径上。然而，另一个重要的问题是实现细部对齐，即在实体水平上进行精确的对齐。为了解决这个问题，我们提出了一个新的Grounded Entity-Landmark Adaptive（GELA）预训方法 дляVLN任务。为了实现这个预训方法，我们首先将固有的实体-Landmark人工注释添加到Room-to-Room（R2R） dataset中，名为GEL-R2R。其次，我们采用三种固有的实体-Landmark适应预训练目标：1）实体短语预测，2）Landmark bounding box预测，和3）实体-Landmark语义对齐，这些目标直接监督学习跨模态的精确对齐。最后，我们 validate our model on two downstream benchmarks：VLN with descriptive instructions (R2R)和 dialogue instructions (CVDN)。弹性实验结果显示，我们的GELA模型在两个任务上均 achieve state-of-the-art results，证明其有效性和普遍性。

LORD: Leveraging Open-Set Recognition with Unknown Data

paper_url: http://arxiv.org/abs/2308.12584
repo_url: None
paper_authors: Tobias Koch, Christian Riess, Thomas Köhler
for: 这个论文的目的是如何处理未知数据，以便在部署过程中更好地进行分类。
methods: 这篇论文使用了一种名为LORD的框架，该框架在分类器训练过程中直接模型了开放空间，并提供了一系列可靠的评估方法。
results: 经过对多种评估协议的测试，这篇论文表明了在未知数据中进行分类时的改进表现，并且通过使用mixup作为数据生成技术，减轻了依赖于大量和昂贵的背景数据的问题。

Abstract
Handling entirely unknown data is a challenge for any deployed classifier. Classification models are typically trained on a static pre-defined dataset and are kept in the dark for the open unassigned feature space. As a result, they struggle to deal with out-of-distribution data during inference. Addressing this task on the class-level is termed open-set recognition (OSR). However, most OSR methods are inherently limited, as they train closed-set classifiers and only adapt the downstream predictions to OSR. This work presents LORD, a framework to Leverage Open-set Recognition by exploiting unknown Data. LORD explicitly models open space during classifier training and provides a systematic evaluation for such approaches. We identify three model-agnostic training strategies that exploit background data and applied them to well-established classifiers. Due to LORD's extensive evaluation protocol, we consistently demonstrate improved recognition of unknown data. The benchmarks facilitate in-depth analysis across various requirement levels. To mitigate dependency on extensive and costly background datasets, we explore mixup as an off-the-shelf data generation technique. Our experiments highlight mixup's effectiveness as a substitute for background datasets. Lightweight constraints on mixup synthesis further improve OSR performance.

摘要
处理完全未知数据是任何部署类фика器的挑战。类фика器通常是在静态预先定义的数据集上训练，因此在推理过程中难以处理外部不确定数据。为解决这个问题，我们提出了开放集 recognition（OSR）技术。然而，大多数OSR方法都受限于它们只是将关闭集类фика器 retrained，并且只是在推理过程中适应OSR。本文介绍了LORD框架，它可以利用未知数据进行开放集 recognition。LORD在类ifica器训练过程中直接模型开放空间，并提供了系统的评估方法。我们identified三种模型无关的训练策略，并应用这些策略到了已知的类ifica器上。由于LORD的广泛的评估协议，我们在不同的需求水平上 consistently 示出了未知数据的更好的识别。这些标准化的协议为我们进行了深入的分析。为了减少依赖于广泛和昂贵的背景数据集，我们探索了mixup作为一种可用的数据生成技术。我们的实验表明，mixup是一种有效的替代方案。在进一步提高OSR性能的同时，我们还提出了一些轻量级的约束来限制mixup的生成。

StreamMapNet: Streaming Mapping Network for Vectorized Online HD Map Construction

paper_url: http://arxiv.org/abs/2308.12570
repo_url: None
paper_authors: Tianyuan Yuan, Yicheng Liu, Yue Wang, Yilun Wang, Hang Zhao
for: 高清晰地图是自动驾驶系统的关键 Component， StreamMapNet 提供了一种新的在线地图生成管线，可以处理长串 temporal 信息，提高了稳定性和性能。
methods: StreamMapNet 使用多点注意力和时间信息来建立大范围的本地高清晰地图，并且可以处理复杂的场景，如 occlusion。
results: StreamMapNet 在所有设置下都与现有方法进行比较，表现出色，并且可以在 $14.2$ FPS 的在线推理速度下保持稳定性和高性能。

Abstract
High-Definition (HD) maps are essential for the safety of autonomous driving systems. While existing techniques employ camera images and onboard sensors to generate vectorized high-precision maps, they are constrained by their reliance on single-frame input. This approach limits their stability and performance in complex scenarios such as occlusions, largely due to the absence of temporal information. Moreover, their performance diminishes when applied to broader perception ranges. In this paper, we present StreamMapNet, a novel online mapping pipeline adept at long-sequence temporal modeling of videos. StreamMapNet employs multi-point attention and temporal information which empowers the construction of large-range local HD maps with high stability and further addresses the limitations of existing methods. Furthermore, we critically examine widely used online HD Map construction benchmark and datasets, Argoverse2 and nuScenes, revealing significant bias in the existing evaluation protocols. We propose to resplit the benchmarks according to geographical spans, promoting fair and precise evaluations. Experimental results validate that StreamMapNet significantly outperforms existing methods across all settings while maintaining an online inference speed of $14.2$ FPS.

摘要
高清定义（HD）地图是自动驾驶系统的关键。现有技术使用摄像头图像和车辆上的感知器来生成 вектор化高精度地图，但这些技术受到单帧输入的限制，导致它们在复杂的情况下表现不稳定，主要是因为缺乏时间信息。此外，它们在扩大观察范围时表现下降。在这篇论文中，我们提出了StreamMapNet，一种新的在线地图生成管道，可以长时间序列模型视频。StreamMapNet使用多点注意力和时间信息，使得在大范围本地高清定义地图的建构中具有高稳定性，并解决了现有方法的局限性。此外，我们严格检查了 Argoverse2 和 nuScenes 等在线 HD 地图建构标准和数据集，发现这些标准存在偏见。我们提议将标准按地理范围重新分割，以便更公正和精确的评估。实验结果表明，StreamMapNet 在所有设置下与现有方法进行比较，并且保持在线推理速度为14.2帧/秒。

NOVA: NOvel View Augmentation for Neural Composition of Dynamic Objects

paper_url: http://arxiv.org/abs/2308.12560
repo_url: https://github.com/dakshitagrawal/nova
paper_authors: Dakshit Agrawal, Jiajie Xu, Siva Karthik Mustikovela, Ioannis Gkioulekas, Ashish Shrivastava, Yuning Chai
for: trains NeRFs for photo-realistic 3D composition of dynamic objects in a static scene
methods: uses a novel-view augmentation (NOVA) strategy
results: reduces blending artifacts, achieves comparable PSNR without additional ground truth modalities, and provides ease, flexibility, and scalability in neural composition.

Abstract
We propose a novel-view augmentation (NOVA) strategy to train NeRFs for photo-realistic 3D composition of dynamic objects in a static scene. Compared to prior work, our framework significantly reduces blending artifacts when inserting multiple dynamic objects into a 3D scene at novel views and times; achieves comparable PSNR without the need for additional ground truth modalities like optical flow; and overall provides ease, flexibility, and scalability in neural composition. Our codebase is on GitHub.

摘要
我们提出一种新视图增强策略（NOVA），用于在静止场景中使用神经网络组合动态对象的3D组合。相比之前的工作，我们的框架可以在新视图和时间插入多个动态对象时减少融合 artifacts，达到相同的PSNR，而不需要额外的真实流动模式 like 光学流体；同时提供了更容易、灵活和可扩展的神经组合。我们的代码库在 GitHub 上。

Hyperbolic Audio-visual Zero-shot Learning

paper_url: http://arxiv.org/abs/2308.12558
repo_url: None
paper_authors: Jie Hong, Zeeshan Hayder, Junlin Han, Pengfei Fang, Mehrtash Harandi, Lars Petersson
for: 这个论文的目的是探讨采用几何学变换来实现零shot学习，以便更好地处理具有复杂层次结构的数据。methods: 该方法使用了一种新的损失函数，该损失函数将视频和音频特征在几何空间进行交叉模块对齐。此外，该方法还 explore了使用多个自适应几何 curvature来进行几何投影。results: 实验结果表明，我们的提议的几何方法在三个数据集上（VGGSound-GZSL、UCF-GZSL和ActivityNet-GZSL）实现了预测值的大约3.0%、7.0%和5.3%的提升，相对于现有的最佳方法。

Abstract
Audio-visual zero-shot learning aims to classify samples consisting of a pair of corresponding audio and video sequences from classes that are not present during training. An analysis of the audio-visual data reveals a large degree of hyperbolicity, indicating the potential benefit of using a hyperbolic transformation to achieve curvature-aware geometric learning, with the aim of exploring more complex hierarchical data structures for this task. The proposed approach employs a novel loss function that incorporates cross-modality alignment between video and audio features in the hyperbolic space. Additionally, we explore the use of multiple adaptive curvatures for hyperbolic projections. The experimental results on this very challenging task demonstrate that our proposed hyperbolic approach for zero-shot learning outperforms the SOTA method on three datasets: VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL achieving a harmonic mean (HM) improvement of around 3.0%, 7.0%, and 5.3%, respectively.

摘要
audio-visual zero-shot learning 目标是将相对应的音频和视频序列分类为在训练中不存在的类。数据分析显示 audio-visual 数据具有大量的抽象性，表明可能通过使用抽象变换来实现曲线意识的几何学学习，以探索更复杂的层次数据结构。我们提议的方法使用一种新的损失函数，该函数包含视频和音频特征在抽象空间的交叉模块Alignment。此外，我们还探讨了多个自适应曲线的投影。实验结果表明，我们的提议的抽象方法在三个数据集上（VGGSound-GZSL、UCF-GZSL 和 ActivityNet-GZSL）实现了harmonic mean（HM）提高约3.0%, 7.0%, 5.3%，分别。

Hybrid Models for Facial Emotion Recognition in Children

paper_url: http://arxiv.org/abs/2308.12547
repo_url: None
paper_authors: Rafael Zimmer, Marcos Sobral, Helio Azevedo
for: 这个研究旨在使用情绪识别技术来帮助儿童心理师在远程 робо扮SESSION中进行儿童治疗。
methods: 该研究使用了Embodied Conversational Agents（ECA）作为中间工具，以帮助专业人员与儿童进行交流，特别是对于患有注意力不足过动症（ADHD）、自闭症 спектル（ASD）或者因为战争、自然灾害或其他原因而无法进行面对面会议的儿童。情绪识别技术作为反馈工具，能够帮助心理师更好地了解儿童的情绪状态。
results: 该研究首先对儿童情绪识别领域进行了文献综述，并对当前社区广泛使用的算法和数据集进行了初步的检视。然后，通过使用 dense optical flow features 技术，提高了儿童情绪识别的精度。 HybridCNNFusion 模型由一个 Convolutional Neural Network 和两个中间特征 fusion 组成，可以更好地识别儿童的情绪。最终，该研究使用了巴西儿童的数据集，并取得了初步的情绪识别结果。

Abstract
This paper focuses on the use of emotion recognition techniques to assist psychologists in performing children's therapy through remotely robot operated sessions. In the field of psychology, the use of agent-mediated therapy is growing increasingly given recent advances in robotics and computer science. Specifically, the use of Embodied Conversational Agents (ECA) as an intermediary tool can help professionals connect with children who face social challenges such as Attention Deficit Hyperactivity Disorder (ADHD), Autism Spectrum Disorder (ASD) or even who are physically unavailable due to being in regions of armed conflict, natural disasters, or other circumstances. In this context, emotion recognition represents an important feedback for the psychotherapist. In this article, we initially present the result of a bibliographical research associated with emotion recognition in children. This research revealed an initial overview on algorithms and datasets widely used by the community. Then, based on the analysis carried out on the results of the bibliographical research, we used the technique of dense optical flow features to improve the ability of identifying emotions in children in uncontrolled environments. From the output of a hybrid model of Convolutional Neural Network, two intermediary features are fused before being processed by a final classifier. The proposed architecture was called HybridCNNFusion. Finally, we present the initial results achieved in the recognition of children's emotions using a dataset of Brazilian children.

摘要
In this article, we first present the results of a bibliographical research on emotion recognition in children. This research provided an initial overview of the algorithms and datasets commonly used by the community. Based on the analysis of the results, we improved the ability to identify emotions in children in uncontrolled environments using the technique of dense optical flow features. A hybrid model of Convolutional Neural Network (CNN) was used, which fused two intermediary features before being processed by a final classifier. The proposed architecture was called HybridCNNFusion.Finally, we present the initial results achieved in recognizing children's emotions using a dataset of Brazilian children.

Mutual-Guided Dynamic Network for Image Fusion

paper_url: http://arxiv.org/abs/2308.12538
repo_url: https://github.com/guanys-dar/mgdn
paper_authors: Yuanshen Guan, Ruikang Xu, Mingde Yao, Lizhi Wang, Zhiwei Xiong
for:This paper proposes a novel mutual-guided dynamic network (MGDN) for image fusion, which aims to generate high-quality images from multiple inputs captured under varying conditions.methods:The proposed MGDN method utilizes a mutual-guided dynamic filter (MGDF) for adaptive feature extraction, which incorporates additional guidance from different inputs and generates spatial-variant kernels for different locations. Additionally, a parallel feature fusion (PFF) module is introduced to effectively fuse local and global information of the extracted features.results:Experimental results on five benchmark datasets demonstrate that the proposed MGDN method outperforms existing methods on four image fusion tasks, showcasing its effectiveness in preserving complementary information while filtering out irrelevant information for the fused result.

Abstract
Image fusion aims to generate a high-quality image from multiple images captured under varying conditions. The key problem of this task is to preserve complementary information while filtering out irrelevant information for the fused result. However, existing methods address this problem by leveraging static convolutional neural networks (CNNs), suffering two inherent limitations during feature extraction, i.e., being unable to handle spatial-variant contents and lacking guidance from multiple inputs. In this paper, we propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs. Specifically, we design a mutual-guided dynamic filter (MGDF) for adaptive feature extraction, composed of a mutual-guided cross-attention (MGCA) module and a dynamic filter predictor, where the former incorporates additional guidance from different inputs and the latter generates spatial-variant kernels for different locations. In addition, we introduce a parallel feature fusion (PFF) module to effectively fuse local and global information of the extracted features. To further reduce the redundancy among the extracted features while simultaneously preserving their shared structural information, we devise a novel loss function that combines the minimization of normalized mutual information (NMI) with an estimated gradient mask. Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks. The code and model are publicly available at: https://github.com/Guanys-dar/MGDN.

摘要
图像融合目标是生成多个图像下 varying 条件下的高质量图像。该任务的关键问题是保留 complementary information 而过滤 irrelevant information 以生成融合结果。然而，现有方法通过利用静态 convolutional neural networks (CNNs) 解决这个问题，具有两个内在的限制：无法处理空间 variant 内容和缺乏多输入的指导。在这篇论文中，我们提出了一种新的 mutual-guided dynamic network (MGDN) для图像融合，允许有效地利用不同的位置和输入中的信息。具体来说，我们设计了一种 mutual-guided cross-attention (MGCA) 模块和一种动态滤波预测器，其中前者包含不同输入的额外指导，而后者生成不同位置的空间variant 滤波器。此外，我们引入了一种 parallel feature fusion (PFF) 模块，以有效地融合本地和全局的特征信息。为了进一步减少提取的特征信息之间的重复，我们设计了一种新的损失函数，它将 normalized mutual information (NMI) 的最小化与一个估计的梯度掩码相结合。实验结果表明，我们的提出的方法在五个 benchmark 数据集上比现有方法在四个图像融合任务上表现出色。代码和模型可以在：https://github.com/Guanys-dar/MGDN 上获取。

HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks

paper_url: http://arxiv.org/abs/2308.12537
repo_url: https://github.com/dzcgaara/HuBo-VLM
paper_authors: Zichao Dong, Weikun Zhang, Xufeng Huang, Hang Ji, Xin Zhan, Junbo Chen
for: 这个论文旨在提出一种基于 трансформа器视觉语言模型的人机交互模型，以便帮助机器人理解人类的自然语言指令并完成相关任务。
methods: 该论文提出了一种基于 transformer 视觉语言模型的人机交互模型，包括对象检测和视觉定位。
results: EXTENSIVE EXPERIMENTS ON THE TALK2CAR BENCHMARK DEMONSTRATE THE EFFECTIVENESS OF THE PROPOSED APPROACH。

Abstract
Human robot interaction is an exciting task, which aimed to guide robots following instructions from human. Since huge gap lies between human natural language and machine codes, end to end human robot interaction models is fair challenging. Further, visual information receiving from sensors of robot is also a hard language for robot to perceive. In this work, HuBo-VLM is proposed to tackle perception tasks associated with human robot interaction including object detection and visual grounding by a unified transformer based vision language model. Extensive experiments on the Talk2Car benchmark demonstrate the effectiveness of our approach. Code would be publicly available in https://github.com/dzcgaara/HuBo-VLM.

摘要
人机交互是一项有趣的任务，旨在使 робоッツ按照人类的指令行动。由于人类自然语言与机器代码之间存在巨大的差距，结束到端人机交互模型是非常困难的。此外，机器人的感知器也是一种困难的语言，对机器人来说很难理解。在这项工作中，我们提出了一种解决人机交互相关的感知任务，包括物体检测和视觉定位，的方法。我们使用了一种基于转换器的视Language模型，并进行了广泛的实验，证明了我们的方法的有效性。代码将在https://github.com/dzcgaara/HuBo-VLM上公开。

SCP: Spherical-Coordinate-based Learned Point Cloud Compression

paper_url: http://arxiv.org/abs/2308.12535
repo_url: https://github.com/luoao-kddi/SCP
paper_authors: Ao Luo, Linxin Song, Keisuke Nonaka, Kyohei Unno, Heming Sun, Masayuki Goto, Jiro Katto
for: 本研究targets learned point cloud compression, particularly for spinning LiDAR point clouds with circular shapes and azimuthal angle invariance features.
methods: 该方法基于Spherical-Coordinate-based learned Point cloud compression (SCP)，利用了上述特征，并提出了多级Octree来降低远区域重建误差。
results: 实验结果显示，SCP比前一代方法提高了29.14%的点到点PSNR BD-Rate。

Abstract
In recent years, the task of learned point cloud compression has gained prominence. An important type of point cloud, the spinning LiDAR point cloud, is generated by spinning LiDAR on vehicles. This process results in numerous circular shapes and azimuthal angle invariance features within the point clouds. However, these two features have been largely overlooked by previous methodologies. In this paper, we introduce a model-agnostic method called Spherical-Coordinate-based learned Point cloud compression (SCP), designed to leverage the aforementioned features fully. Additionally, we propose a multi-level Octree for SCP to mitigate the reconstruction error for distant areas within the Spherical-coordinate-based Octree. SCP exhibits excellent universality, making it applicable to various learned point cloud compression techniques. Experimental results demonstrate that SCP surpasses previous state-of-the-art methods by up to 29.14% in point-to-point PSNR BD-Rate.

摘要
近年来，学习点云压缩任务得到了更多的关注。一种重要的点云类型是旋转雷达点云，由旋转雷达在车辆上生成。这个过程会生成很多圆形和方位角协variance特征，但这些特征在前一代方法中受到了忽略。在这篇论文中，我们介绍了一种模型无关的方法called Spherical-Coordinate-based learned Point cloud compression (SCP)，旨在利用上述特征。此外，我们提议了一种多级 Octree 来 mitigate SCP 的重建误差。SCP 具有优秀的通用性，可以应用于多种学习点云压缩技术。实验结果表明，SCP 可以比前一代方法提高点-到-点 PSNR BD-Rate 的最高提升率达29.14%。

Channel and Spatial Relation-Propagation Network for RGB-Thermal Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.12534
repo_url: None
paper_authors: Zikun Zhou, Shukun Wu, Guoqing Zhu, Hongpeng Wang, Zhenyu He
for: 这个论文的目的是提出一个 Channel and Spatial Relation-Propagation Network (CSRPNet)，用于RGB-T semantic segmentation，以利用两 modalities 之间的共同特征来提高 semantic segmentation 的精度。
methods: 这个论文使用了一个叫做 Channel and Spatial Relation-Propagation Network (CSRPNet) 的网络，它首先在通道和空间 dimension 进行了关系传递，以捕捉两 modalities 之间的共同特征。然后，它将一 modalities 的特征与另一 modalities 的入力特征进行了融合，以提高入力特征不受污染的问题。
results: 实验结果显示，CSRPNet 可以与现有的方法相比，在RGB-T semantic segmentation 中表现出色。

Abstract
RGB-Thermal (RGB-T) semantic segmentation has shown great potential in handling low-light conditions where RGB-based segmentation is hindered by poor RGB imaging quality. The key to RGB-T semantic segmentation is to effectively leverage the complementarity nature of RGB and thermal images. Most existing algorithms fuse RGB and thermal information in feature space via concatenation, element-wise summation, or attention operations in either unidirectional enhancement or bidirectional aggregation manners. However, they usually overlook the modality gap between RGB and thermal images during feature fusion, resulting in modality-specific information from one modality contaminating the other. In this paper, we propose a Channel and Spatial Relation-Propagation Network (CSRPNet) for RGB-T semantic segmentation, which propagates only modality-shared information across different modalities and alleviates the modality-specific information contamination issue. Our CSRPNet first performs relation-propagation in channel and spatial dimensions to capture the modality-shared features from the RGB and thermal features. CSRPNet then aggregates the modality-shared features captured from one modality with the input feature from the other modality to enhance the input feature without the contamination issue. While being fused together, the enhanced RGB and thermal features will be also fed into the subsequent RGB or thermal feature extraction layers for interactive feature fusion, respectively. We also introduce a dual-path cascaded feature refinement module that aggregates multi-layer features to produce two refined features for semantic and boundary prediction. Extensive experimental results demonstrate that CSRPNet performs favorably against state-of-the-art algorithms.

摘要

SieveNet: Selecting Point-Based Features for Mesh Networks

paper_url: http://arxiv.org/abs/2308.12530
repo_url: https://github.com/sievenet/sievenet.github.io
paper_authors: Shengchao Yuan, Yishun Dou, Rui Shi, Bingbing Ni, Zhong Zheng
for: 提高3D计算机视觉和图形领域中的网格使用，解决网格的不规则结构限制现有神经网络体系中的应用。
methods: 提出了一种新的思路，即使用结构化网格结构和精度地理信息，从原始网格表面进行误差意识抽取点批量检测，从而兼顾规则结构和准确地理信息。
results: 经过广泛的实验研究，在分类和 segmentation 任务中，提出的 Sievenet 方法能够具有较高的效果和优势，不需要手动设计特征工程。

Abstract
Meshes are widely used in 3D computer vision and graphics, but their irregular topology poses challenges in applying them to existing neural network architectures. Recent advances in mesh neural networks turn to remeshing and push the boundary of pioneer methods that solely take the raw meshes as input. Although the remeshing offers a regular topology that significantly facilitates the design of mesh network architectures, features extracted from such remeshed proxies may struggle to retain the underlying geometry faithfully, limiting the subsequent neural network's capacity. To address this issue, we propose SieveNet, a novel paradigm that takes into account both the regular topology and the exact geometry. Specifically, this method utilizes structured mesh topology from remeshing and accurate geometric information from distortion-aware point sampling on the surface of the original mesh. Furthermore, our method eliminates the need for hand-crafted feature engineering and can leverage off-the-shelf network architectures such as the vision transformer. Comprehensive experimental results on classification and segmentation tasks well demonstrate the effectiveness and superiority of our method.

摘要
mesh 广泛应用于3D计算机视觉和图形领域，但它们的不规则结构会对现有神经网络架构的应用带来挑战。 recent advances in mesh neural networks have turned to remeshing and pushed the boundaries of pioneering methods that only take raw meshes as input. Although remeshing provides a regular topology that significantly facilitates the design of mesh network architectures, features extracted from such remeshed proxies may struggle to retain the underlying geometry faithfully, limiting the subsequent neural network's capacity. To address this issue, we propose SieveNet, a novel paradigm that takes into account both the regular topology and the exact geometry. Specifically, this method utilizes structured mesh topology from remeshing and accurate geometric information from distortion-aware point sampling on the surface of the original mesh. Furthermore, our method eliminates the need for hand-crafted feature engineering and can leverage off-the-shelf network architectures such as the vision transformer. Comprehensive experimental results on classification and segmentation tasks well demonstrate the effectiveness and superiority of our method.Note: The translation is in Simplified Chinese, which is the standardized form of Chinese used in mainland China. The traditional Chinese form of the text would be slightly different.

Uniformly Distributed Category Prototype-Guided Vision-Language Framework for Long-Tail Recognition

paper_url: http://arxiv.org/abs/2308.12522
repo_url: None
paper_authors: Siming Fu, Xiaoxuan He, Xinpeng Ding, Yuchen Cao, Hualiang Wang
for: 这个研究是为了解决长尾认知 task 中的类别偏度问题，特别是当训练数据具有类别差异时，模型会受到特定类别的扭曲。
methods: 我们提出了一个uniformly category prototype-guided vision-language框架，通过生成一些均匀分布在球体上的类别原型，将不同类别的特征扩展到这些原型上，使得特征空间内的分布变得均匀。此外，我们还提出了一个无关文本筛选和特征增强模组，让模型忽略无关的噪音文本，更加重视关键特征信息。
results: 我们的方法比前一代的视觉语言方法更好地适应长尾认知任务，并且实现了类别偏度问题的解决。具体来说，我们的方法在认知精度上比前一代的方法提高了大约20%，并且在长尾类别上保持了高度的稳定性。

Abstract
Recently, large-scale pre-trained vision-language models have presented benefits for alleviating class imbalance in long-tailed recognition. However, the long-tailed data distribution can corrupt the representation space, where the distance between head and tail categories is much larger than the distance between two tail categories. This uneven feature space distribution causes the model to exhibit unclear and inseparable decision boundaries on the uniformly distributed test set, which lowers its performance. To address these challenges, we propose the uniformly category prototype-guided vision-language framework to effectively mitigate feature space bias caused by data imbalance. Especially, we generate a set of category prototypes uniformly distributed on a hypersphere. Category prototype-guided mechanism for image-text matching makes the features of different classes converge to these distinct and uniformly distributed category prototypes, which maintain a uniform distribution in the feature space, and improve class boundaries. Additionally, our proposed irrelevant text filtering and attribute enhancement module allows the model to ignore irrelevant noisy text and focus more on key attribute information, thereby enhancing the robustness of our framework. In the image recognition fine-tuning stage, to address the positive bias problem of the learnable classifier, we design the class feature prototype-guided classifier, which compensates for the performance of tail classes while maintaining the performance of head classes. Our method outperforms previous vision-language methods for long-tailed learning work by a large margin and achieves state-of-the-art performance.

摘要
近期，大规模预训练视觉语言模型已经显示出了对长尾识别问题的缓解效果。然而，长尾数据分布可以损害模型的表征空间，导致模型在uniform测试集上展示不明确和不分化的决策边界，从而降低其性能。为解决这些挑战，我们提议使用 uniformly分布的类prototype来引导视觉语言框架，以有效地消除数据不均分带来的表径空间偏见。具体来说，我们生成了一组 uniformly分布在 hypersphere 上的类prototype。这些类prototype在图像文本匹配中 acted as a guide, making the features of different classes converge to these distinct and uniformly distributed category prototypes, thereby maintaining a uniform distribution in the feature space and improving class boundaries.此外，我们还提出了不相关文本过滤和特征增强模块，使模型忽略不相关的噪音文本，更加注重关键特征信息，从而提高了我们的框架的Robustness。在图像识别细化阶段，为了解决learnable classifier的正面偏好问题，我们设计了类feature prototype-guided类ifier，这种方法可以补偿尾类的性能，同时保持头类的性能。根据我们的实验结果，我们的方法在长尾学习任务上比前一代视觉语言方法出performanced by a large margin，达到了状态的最佳性能。

Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval

paper_url: http://arxiv.org/abs/2308.12509
repo_url: None
paper_authors: Yuan Yuan, Yang Zhan, Zhitong Xiong
for: 这研究旨在提出一种高效高可用的视言语传播学习方法，以便在实际应用中处理大量的远程感知数据。
methods: 该研究使用了CLIP模型作为预训练模型，并设计了一种多模态远程感知适配器以及一种混合多模态对比学习目标。此外，我们还提出了一种简单 yet有效的HMMC损失函数来解决高内模态相似性问题。
results: 我们的研究表明，使用PETL方法可以有效地传播视言语知识从自然领域到远程感知领域，并且可以大幅降低训练成本和环境影响。我们的模型只包含0.16M个训练参数，可以实现98.9%的参数减少，并且在 Retrieval 性能方面超过传统方法7-13%，与全 Fine-tuning 的性能相当或更好。

Abstract
Vision-and-language pre-training (VLP) models have experienced a surge in popularity recently. By fine-tuning them on specific datasets, significant performance improvements have been observed in various tasks. However, full fine-tuning of VLP models not only consumes a significant amount of computational resources but also has a significant environmental impact. Moreover, as remote sensing (RS) data is constantly being updated, full fine-tuning may not be practical for real-world applications. To address this issue, in this work, we investigate the parameter-efficient transfer learning (PETL) method to effectively and efficiently transfer visual-language knowledge from the natural domain to the RS domain on the image-text retrieval task. To this end, we make the following contributions. 1) We construct a novel and sophisticated PETL framework for the RS image-text retrieval (RSITR) task, which includes the pretrained CLIP model, a multimodal remote sensing adapter, and a hybrid multi-modal contrastive (HMMC) learning objective; 2) To deal with the problem of high intra-modal similarity in RS data, we design a simple yet effective HMMC loss; 3) We provide comprehensive empirical studies for PETL-based RS image-text retrieval. Our results demonstrate that the proposed method is promising and of great potential for practical applications. 4) We benchmark extensive state-of-the-art PETL methods on the RSITR task. Our proposed model only contains 0.16M training parameters, which can achieve a parameter reduction of 98.9% compared to full fine-tuning, resulting in substantial savings in training costs. Our retrieval performance exceeds traditional methods by 7-13% and achieves comparable or better performance than full fine-tuning. This work can provide new ideas and useful insights for RS vision-language tasks.

摘要
Recently, vision-and-language pre-training (VLP) 模型在不同领域中得到了广泛的应用。通过特定数据集的精细调整，VLP模型在各种任务中表现出了显著的性能提升。然而，全量调整VLP模型不仅需要巨量的计算资源，还会对环境产生巨大的影响。此外，随着Remote Sensing（RS）数据不断更新，全量调整可能无法适应实际应用中的需求。为此，本文提出了参数有效传播学习（PETL）方法，以有效地和高效地将视觉语言知识从自然领域传播到RS领域中的图文检索任务上。为此，我们做了以下贡献：1. 我们建立了一个新的和复杂的PETL框架 дляRS图文检索任务，包括预训练的CLIP模型、多模态RS适配器和混合多模态对比（HMMC）学习目标；2. 为RS数据中高内模态相似性问题而设计了一个简单 yet effective的HMMC损失函数；3. 我们提供了RS图文检索的广泛的实验研究。我们的结果表明，我们提出的方法具有扎实的推荐和实际应用的潜在价值。4. 我们对现有的PETL方法进行了广泛的比较研究，并发现我们的提出的模型只需0.16M参数进行训练，相比涵盖所有模型，可以实现参数减少98.9%，减少训练成本。我们的检索性能高于传统方法7-13%，并且与全量调整的性能相当或更好。本文可以提供新的想法和有用的意见 дляRS视觉语言任务。

FFEINR: Flow Feature-Enhanced Implicit Neural Representation for Spatio-temporal Super-Resolution

paper_url: http://arxiv.org/abs/2308.12508
repo_url: None
paper_authors: Chenyue Jiao, Chongke Bi, Lu Yang
for: 提高流体动力学数据的空间和时间分辨率
methods: 基于嵌入型神经网络（FFEINR）和特征增强技术
results: 比较比特里利内插法更好的结果

Abstract
Large-scale numerical simulations are capable of generating data up to terabytes or even petabytes. As a promising method of data reduction, super-resolution (SR) has been widely studied in the scientific visualization community. However, most of them are based on deep convolutional neural networks (CNNs) or generative adversarial networks (GANs) and the scale factor needs to be determined before constructing the network. As a result, a single training session only supports a fixed factor and has poor generalization ability. To address these problems, this paper proposes a Feature-Enhanced Implicit Neural Representation (FFEINR) for spatio-temporal super-resolution of flow field data. It can take full advantage of the implicit neural representation in terms of model structure and sampling resolution. The neural representation is based on a fully connected network with periodic activation functions, which enables us to obtain lightweight models. The learned continuous representation can decode the low-resolution flow field input data to arbitrary spatial and temporal resolutions, allowing for flexible upsampling. The training process of FFEINR is facilitated by introducing feature enhancements for the input layer, which complements the contextual information of the flow field.To demonstrate the effectiveness of the proposed method, a series of experiments are conducted on different datasets by setting different hyperparameters. The results show that FFEINR achieves significantly better results than the trilinear interpolation method.

摘要
大规模数值计算可以生成数据达到tera bytes甚至petabytes级别。作为数据压缩的承诺方法，超分辨率（SR）在科学视觉社区得到了广泛的研究。然而，大多数都基于深度卷积神经网络（CNN）或生成敌对网络（GAN），并且需要确定缩放因子之前建立网络。这意味着单个训练会话只支持固定因子，并且具有较差的泛化能力。为解决这些问题，本文提出了基于几何卷积神经网络的特征增强隐藏表示（FFEINR），用于空间时间超分辨率的流场数据。它可以完全利用隐藏表示的几何结构和采样分辨率来获得轻量级模型。学习的连续表示可以将低分辨率流场输入数据解码到任意空间和时间分辨率，以便灵活增加。为便于FFEINR的训练，我们引入了输入层的特征增强，以增强流场的上下文信息。为证明提案的效iveness，我们在不同的数据集上进行了一系列实验，并通过设置不同的超参数来评估结果。结果显示，FFEINR在比较方法中表现出了显著的优势。

DD-GCN: Directed Diffusion Graph Convolutional Network for Skeleton-based Human Action Recognition

paper_url: http://arxiv.org/abs/2308.12501
repo_url: https://github.com/shiyin-lc/dd-gcn
paper_authors: Chang Li, Qian Huang, Yingchi Mao
for: 这篇论文是为了提高skeleton-based human action recognition中的Graph Convolutional Networks（GCNs）性能而写的。
methods: 该论文使用了导向协同分布图（DD-GCN），它利用了建立导向分布图以实现动作模型化，并引入了活动分区策略来优化图 convolution kernels 的加权共享机制。此外，它还提出了空间时间同步编码器来嵌入同步空间时间 semantics。
results: 实验结果表明，该方法在三个公共数据集（NTU-RGB+D、NTU-RGB+D 120、NW-UCLA）上达到了当前最佳性能。

Abstract
Graph Convolutional Networks (GCNs) have been widely used in skeleton-based human action recognition. In GCN-based methods, the spatio-temporal graph is fundamental for capturing motion patterns. However, existing approaches ignore the physical dependency and synchronized spatio-temporal correlations between joints, which limits the representation capability of GCNs. To solve these problems, we construct the directed diffusion graph for action modeling and introduce the activity partition strategy to optimize the weight sharing mechanism of graph convolution kernels. In addition, we present the spatio-temporal synchronization encoder to embed synchronized spatio-temporal semantics. Finally, we propose Directed Diffusion Graph Convolutional Network (DD-GCN) for action recognition, and the experiments on three public datasets: NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA, demonstrate the state-of-the-art performance of our method.

摘要
格点图 neural network (GCN) 在人体动作识别中广泛应用。在 GCN 基本方法中，空间时间图是关键 для捕捉运动模式。然而，现有方法忽略了物理依赖性和同步空间时间相关性 между 关节，这限制了 GCN 的表示能力。为解决这些问题，我们构建了导向干扰图 для动作模型化，并引入活动分区策略来优化图像卷积核的加权共享机制。此外，我们提出了空间时间同步编码器，以嵌入同步空间时间 semantics。最后，我们提出了导向干扰图 convolutional neural network (DD-GCN) для动作识别，并在三个公共数据集（NTU-RGB+D、NTU-RGB+D 120、NW-UCLA）上进行实验，其中表现出了当今最佳性能。

MOFA: A Model Simplification Roadmap for Image Restoration on Mobile Devices

paper_url: http://arxiv.org/abs/2308.12494
repo_url: None
paper_authors: Xiangyu Chen, Ruiwen Zhen, Shuai Li, Xiaotian Li, Guanghui Wang
for: restore high-quality images from degraded counterparts and improve the efficiency of image restoration models on mobile devices.
methods: add more parameters to partial convolutions on FLOPs non-sensitive layers, apply partial depthwise convolution coupled with decoupling upsampling/downsampling layers.
results: decrease runtime by up to 13%, reduce the number of parameters by up to 23%, while increasing PSNR and SSIM on several image restoration datasets.Here is the text in Simplified Chinese:
for: restore高品质的图像从损坏版本中，并提高移动设备上图像恢复模型的效率。
methods: 增加FLOPs非敏感层中的参数，应用部分深度卷积并与解解锁升降样例层。
results: 减少运行时间，减少参数数量，同时提高PSNR和SSIM在多个图像恢复数据集上。

Abstract
Image restoration aims to restore high-quality images from degraded counterparts and has seen significant advancements through deep learning techniques. The technique has been widely applied to mobile devices for tasks such as mobile photography. Given the resource limitations on mobile devices, such as memory constraints and runtime requirements, the efficiency of models during deployment becomes paramount. Nevertheless, most previous works have primarily concentrated on analyzing the efficiency of single modules and improving them individually. This paper examines the efficiency across different layers. We propose a roadmap that can be applied to further accelerate image restoration models prior to deployment while simultaneously increasing PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index). The roadmap first increases the model capacity by adding more parameters to partial convolutions on FLOPs non-sensitive layers. Then, it applies partial depthwise convolution coupled with decoupling upsampling/downsampling layers to accelerate the model speed. Extensive experiments demonstrate that our approach decreases runtime by up to 13% and reduces the number of parameters by up to 23%, while increasing PSNR and SSIM on several image restoration datasets. Source Code of our method is available at \href{https://github.com/xiangyu8/MOFA}{https://github.com/xiangyu8/MOFA}.

摘要
Image restoration aims to restore high-quality images from degraded counterparts and has seen significant advancements through deep learning techniques. The technique has been widely applied to mobile devices for tasks such as mobile photography. Given the resource limitations on mobile devices, such as memory constraints and runtime requirements, the efficiency of models during deployment becomes paramount. Nevertheless, most previous works have primarily concentrated on analyzing the efficiency of single modules and improving them individually. This paper examines the efficiency across different layers. We propose a roadmap that can be applied to further accelerate image restoration models prior to deployment while simultaneously increasing PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index). The roadmap first increases the model capacity by adding more parameters to partial convolutions on FLOPs non-sensitive layers. Then, it applies partial depthwise convolution coupled with decoupling upsampling/downsampling layers to accelerate the model speed. Extensive experiments demonstrate that our approach decreases runtime by up to 13% and reduces the number of parameters by up to 23%, while increasing PSNR and SSIM on several image restoration datasets. 源代码我们的方法可以在 \href{https://github.com/xiangyu8/MOFA}{https://github.com/xiangyu8/MOFA} 上 obtain.

Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

paper_url: http://arxiv.org/abs/2308.12469
repo_url: None
paper_authors: Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, Mar Gonzalez-Franco
for: 实现零shot Segmentation的高质量分割mask，解决了计算机视觉领域中的基本问题。
methods: 利用稳定扩散模型的自我注意层，通过衡量KL差值 среди注意图来 merge them into valid segmentation masks。
results: 在COCO-Stuff-27上，我们的方法超过了先前的无supervised zero-shot SOTA方法，净误率提高26%， Mean IoU提高17%。

Abstract
Producing quality segmentation masks for images is a fundamental problem in computer vision. Recent research has explored large-scale supervised training to enable zero-shot segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not require any training or language dependency to extract quality segmentation for any images. On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU.

摘要
生成高质量的图像分割 маSK是计算机视觉的基本问题。 latest research has explored large-scale supervised training to enable zero-shot segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not require any training or language dependency to extract quality segmentation for any images. On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU.Here's the text with traditional Chinese characters:生成高质量的图像分割mask是计算机视觉的基本问题。 latest research has explored large-scale supervised training to enable zero-shot segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not require any training or language dependency to extract quality segmentation for any images. On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU.

InverseSR: 3D Brain MRI Super-Resolution Using a Latent Diffusion Model

paper_url: http://arxiv.org/abs/2308.12465
repo_url: https://github.com/biomedai-ucsc/inversesr
paper_authors: Jueqi Wang, Jacob Levman, Walter Hugo Lopez Pinaya, Petru-Daniel Tudosiu, M. Jorge Cardoso, Razvan Marinescu
for: 这个论文的目的是提出一种基于深度学习的MRI超分辨（SR）方法，以提高临床MRI扫描的分辨率。
methods: 该方法利用一个 estado-of-the-art 3D脑生成模型（LDM），通过在 UK BioBank 上训练该模型，来提高临床MRI扫描的分辨率。
results: 该方法可以在多种不同的MRI SR问题中提高分辨率，并且可以在不同的设置下选择合适的方法。Here are the three points in Simplified Chinese text:
for: 这个论文的目的是提出一种基于深度学习的MRI超分辨（SR）方法，以提高临床MRI扫描的分辨率。
methods: 该方法利用一个 estado-of-the-art 3D脑生成模型（LDM），通过在 UK BioBank 上训练该模型，来提高临床MRI扫描的分辨率。
results: 该方法可以在多种不同的MRI SR问题中提高分辨率，并且可以在不同的设置下选择合适的方法。

Abstract
High-resolution (HR) MRI scans obtained from research-grade medical centers provide precise information about imaged tissues. However, routine clinical MRI scans are typically in low-resolution (LR) and vary greatly in contrast and spatial resolution due to the adjustments of the scanning parameters to the local needs of the medical center. End-to-end deep learning methods for MRI super-resolution (SR) have been proposed, but they require re-training each time there is a shift in the input distribution. To address this issue, we propose a novel approach that leverages a state-of-the-art 3D brain generative model, the latent diffusion model (LDM) trained on UK BioBank, to increase the resolution of clinical MRI scans. The LDM acts as a generative prior, which has the ability to capture the prior distribution of 3D T1-weighted brain MRI. Based on the architecture of the brain LDM, we find that different methods are suitable for different settings of MRI SR, and thus propose two novel strategies: 1) for SR with more sparsity, we invert through both the decoder of the LDM and also through a deterministic Denoising Diffusion Implicit Models (DDIM), an approach we will call InverseSR(LDM); 2) for SR with less sparsity, we invert only through the LDM decoder, an approach we will call InverseSR(Decoder). These two approaches search different latent spaces in the LDM model to find the optimal latent code to map the given LR MRI into HR. The training process of the generative model is independent of the MRI under-sampling process, ensuring the generalization of our method to many MRI SR problems with different input measurements. We validate our method on over 100 brain T1w MRIs from the IXI dataset. Our method can demonstrate that powerful priors given by LDM can be used for MRI reconstruction.

摘要
高解度（HR）MRI扫描从研究级医疗机构获取的信息非常精确。然而，日常临床MRI扫描通常是低解度（LR）的，并且因为扫描参数的调整而具有不同的对比度和空间分辨率。为解决这个问题，我们提出了一种新的方法，利用UK BioBank上训练的状态态流模型（LDM）来提高临床MRI扫描的解度。LDM acts as a generative prior, which has the ability to capture the prior distribution of 3D T1-weighted brain MRI。基于LDM的架构，我们发现不同的方法适用于不同的MRI SR设置，因此我们提出了两种新的策略：1）为SR with more sparsity，我们通过LDM的解码器和deterministic Denoising Diffusion Implicit Models（DDIM）进行逆变换，一种我们将称为InverseSR(LDM)；2）为SR with less sparsity，我们只通过LDM的解码器进行逆变换，一种我们将称为InverseSR(Decoder)。这两种方法在LDM模型中寻找不同的秘密空间，以找到将LR MRI映射到HR的最佳秘密代码。我们的训练过程不依赖MRI下抽样过程，因此我们的方法可以通用许多MRI SR问题。我们验证了我们的方法在IXI数据集上的100多个脑T1w MRI中。我们的方法可以证明LDM可以提供强大的PRIOR，用于MRI重建。

Overcoming General Knowledge Loss with Selective Parameter Finetuning

paper_url: http://arxiv.org/abs/2308.12462
repo_url: None
paper_authors: Wenxuan Zhang, Paul Janson, Rahaf Aljundi, Mohamed Elhoseiny
for: 提高基础模型的更新能力，以适应新的信息和维护原有知识。
methods: 本文提出了一种新的方法，通过对小部分参数进行本地修改来实现基础模型的不断更新。这种方法基于先前分析基础模型的经验，首先局部化特定层进行模型精度，然后引入重要性分数机制，以更新关键参数。
results: 对基础视觉语言模型进行了广泛评估，证明该方法可以在不同的持续学习任务上提高现有的持续学习方法，并将预先学习的知识减少到0.97%。

Abstract
Foundation models encompass an extensive knowledge base and offer remarkable transferability. However, this knowledge becomes outdated or insufficient over time. The challenge lies in updating foundation models to accommodate novel information while retaining their original ability. In this paper, we present a novel approach to achieving continual model updates by effecting localized modifications to a small subset of parameters. Guided by insights gleaned from prior analyses of foundational models, we first localize a specific layer for model refinement and then introduce an importance scoring mechanism designed to update only the most crucial weights. Our method is exhaustively evaluated on foundational vision-language models, measuring its efficacy in both learning new information and preserving pre-established knowledge across a diverse spectrum of continual learning tasks, including Aircraft, Birdsnap CIFAR-100, CUB, Cars, and GTSRB. The results show that our method improves the existing continual learning methods by 0.5\% - 10\% on average, and reduces the loss of pre-trained knowledge from around 5\% to 0.97\%. Comprehensive ablation studies substantiate our method design, shedding light on the contributions of each component to controllably learning new knowledge and mitigating the forgetting of pre-trained knowledge.

摘要
基础模型包含广泛的知识库和卓越的跨 Transferability。然而，这些知识随着时间的推移会变得过时或不足。挑战在于更新基础模型以容纳新的信息，而不会失去原有的知识。在这篇论文中，我们提出了一种新的方法来实现不断模型更新，通过对一小部分参数进行本地化修改。以先前分析基础模型所获得的知识为指导，我们首先本地化特定层，然后引入一种重要性分配机制，以更新最重要的权重。我们的方法在基础视觉语言模型上进行了完整的评估，并测试其在多种不断学习任务上的效果，包括飞机、鸟卷CIFAR-100、CUB、汽车和GTSRB。结果表明，我们的方法与现有的不断学习方法相比，平均提高了0.5%-10%，并将先前学习的知识损失从约5%降低至0.97%。我们还进行了广泛的减少分析，以证明我们的方法设计的每一部分对于控制新知识学习和减少先前知识损失做出了贡献。

ARF-Plus: Controlling Perceptual Factors in Artistic Radiance Fields for 3D Scene Stylization

paper_url: http://arxiv.org/abs/2308.12452
repo_url: None
paper_authors: Wenzhao Li, Tianhao Wu, Fangcheng Zhong, Cengiz Oztireli
for: 用于三维场景样式传递
methods: 使用3D神经辐射场进行样式传递，并提供四种控制方法：色彩保持控制、纹理尺度控制、空间选择性风格控制和深度增强控制
results: 通过实际数据集的量化和质量评估，表明ARF-Plus框架在三维场景样式传递中提供了有效的控制功能，并且可以同时应用多种样式效果，创造出独特和引人注目的风格效果。

Abstract
The radiance fields style transfer is an emerging field that has recently gained popularity as a means of 3D scene stylization, thanks to the outstanding performance of neural radiance fields in 3D reconstruction and view synthesis. We highlight a research gap in radiance fields style transfer, the lack of sufficient perceptual controllability, motivated by the existing concept in the 2D image style transfer. In this paper, we present ARF-Plus, a 3D neural style transfer framework offering manageable control over perceptual factors, to systematically explore the perceptual controllability in 3D scene stylization. Four distinct types of controls - color preservation control, (style pattern) scale control, spatial (selective stylization area) control, and depth enhancement control - are proposed and integrated into this framework. Results from real-world datasets, both quantitative and qualitative, show that the four types of controls in our ARF-Plus framework successfully accomplish their corresponding perceptual controls when stylizing 3D scenes. These techniques work well for individual style inputs as well as for the simultaneous application of multiple styles within a scene. This unlocks a realm of limitless possibilities, allowing customized modifications of stylization effects and flexible merging of the strengths of different styles, ultimately enabling the creation of novel and eye-catching stylistic effects on 3D scenes.

摘要
《几何场景风格传输》是一个刚刚崛起的领域，感谢神经透辉场景的出色表现在3D重建和视觉合成中。我们指出了几何场景风格传输的研究漏洞，即无 suficient perceptual控制，这是基于2D图像风格传输的现有概念。在这篇论文中，我们提出了ARF-Plus，一个3D神经风格传输框架，可以系统地探索3D场景风格传输中的perceptual控制。我们提出了四种控制类型：颜色保持控制、样式模式比例控制、空间（选择性风格着色）控制和深度强化控制。这些控制被纳入到这个框架中，并在实际世界数据集上进行了评估。结果表明，ARF-Plus框架中的四种控制类型能够成功地实现对perceptual控制的管理，并且这些控制可以单独应用于个体风格输入或同时应用于场景中的多种风格。这些技术在创建个性化 modify 3D场景风格效果和自由混合不同风格的优点时，都工作良好。

MOFO: MOtion FOcused Self-Supervision for Video Understanding

paper_url: http://arxiv.org/abs/2308.12447
repo_url: None
paper_authors: Mona Ahmadian, Frank Guerin, Andrew Gilbert
for: 本研究的目的是提高视频中动作识别的性能，通过对视频中动作区域进行自我监督学习，以提高视频中动作的表征学习。methods: 我们提出了一种新的自我监督学习方法，称为MOFO（动作区域关注），它可以自动检测视频中动作区域，并使用这些区域来引导自我监督学习任务。我们使用了一种帮助器隐藏一定比例的输入序列中的掩码，并强制掩码在动作区域内部的一定比例被隐藏，而其余部分来自外部。此外，我们还在下游任务中进行了动作信息的加入，以强调动作的表征。results: 我们的研究表明，我们的动作区域关注技术可以明显提高当前最佳的自我监督学习方法（VideoMAE）的动作识别性能。我们在Epic-Kitchens verb、noun和动作分类任务上提高了2.6%、2.1%和1.3%的精度，并在Something-Something V2动作分类任务上提高了4.7%的精度。这表明，在自我监督学习中显式地编码动作是非常重要的。

Abstract
Self-supervised learning (SSL) techniques have recently produced outstanding results in learning visual representations from unlabeled videos. Despite the importance of motion in supervised learning techniques for action recognition, SSL methods often do not explicitly consider motion information in videos. To address this issue, we propose MOFO (MOtion FOcused), a novel SSL method for focusing representation learning on the motion area of a video, for action recognition. MOFO automatically detects motion areas in videos and uses these to guide the self-supervision task. We use a masked autoencoder which randomly masks out a high proportion of the input sequence; we force a specified percentage of the inside of the motion area to be masked and the remainder from outside. We further incorporate motion information into the finetuning step to emphasise motion in the downstream task. We demonstrate that our motion-focused innovations can significantly boost the performance of the currently leading SSL method (VideoMAE) for action recognition. Our method improves the recent self-supervised Vision Transformer (ViT), VideoMAE, by achieving +2.6%, +2.1%, +1.3% accuracy on Epic-Kitchens verb, noun and action classification, respectively, and +4.7% accuracy on Something-Something V2 action classification. Our proposed approach significantly improves the performance of the current SSL method for action recognition, indicating the importance of explicitly encoding motion in SSL.

摘要
自顾学（SSL）技术在无标注视频中学习视觉表示方面最近取得了出色的结果。尽管动作认知中的运动信息在指导学习过程中非常重要，但SSL方法通常不直接考虑视频中的运动信息。为解决这个问题，我们提议MOFO（运动区域关注）方法，它是一种新的SSL方法，用于在视频中注意力集中在运动区域上，以提高动作认知。MOFO方法自动检测视频中的运动区域，并使用这些区域来引导自我超vision任务。我们使用一个随机屏蔽输入序列的masked autoencoder，其中高比例的输入序列会被随机屏蔽，而在运动区域内部则强制屏蔽一定比例。此外，我们还在下游任务中注入运动信息，以强调运动在下游任务中的作用。我们示出，我们的运动关注创新可以显著提高现有的SSL方法（VideoMAE）对动作认知的性能。我们的方法可以在Epic-Kitchens动词、名词和动作分类中提高VideoMAE的性能，分别提高+2.6%、+2.1%和+1.3%的精度。此外，我们还在Something-Something V2动作分类中提高了+4.7%的精度。这表明，在SSL中显式编码运动的重要性。

TAI-GAN: Temporally and Anatomically Informed GAN for early-to-late frame conversion in dynamic cardiac PET motion correction

paper_url: http://arxiv.org/abs/2308.12443
repo_url: https://github.com/gxq1998/tai-gan
paper_authors: Xueqi Guo, Luyao Shi, Xiongchao Chen, Bo Zhou, Qiong Liu, Huidong Xie, Yi-Hwa Liu, Richard Palyo, Edward J. Miller, Albert J. Sinusas, Bruce Spottiswoode, Chi Liu, Nicha C. Dvornek
for: 这篇论文主要关注的是动脉心PET图像中的快速追踪器动力学和各帧分布的高变化，以及这些变化对插入动作 corrections 的影响。
methods: 该论文提出了一种使用生成方法处理 tracer 分布变化以帮助现有的注册方法。具体来说，我们提出了一种 Temporally and Anatomically Informed Generative Adversarial Network (TAI-GAN)，用于在早期帧中将 tracer 分布变化转换为late reference frame中的图像。
results: 我们在临床 $^{82}$Rb PET数据集上验证了我们的提议方法，并发现我们的 TAI-GAN 可以生成高质量的转换图像，与参照帧图像相似。 после TAI-GAN 转换，运动估计精度和临床血液流量（MBF）的量化得到了改善。

Abstract
The rapid tracer kinetics of rubidium-82 ($^{82}$Rb) and high variation of cross-frame distribution in dynamic cardiac positron emission tomography (PET) raise significant challenges for inter-frame motion correction, particularly for the early frames where conventional intensity-based image registration techniques are not applicable. Alternatively, a promising approach utilizes generative methods to handle the tracer distribution changes to assist existing registration methods. To improve frame-wise registration and parametric quantification, we propose a Temporally and Anatomically Informed Generative Adversarial Network (TAI-GAN) to transform the early frames into the late reference frame using an all-to-one mapping. Specifically, a feature-wise linear modulation layer encodes channel-wise parameters generated from temporal tracer kinetics information, and rough cardiac segmentations with local shifts serve as the anatomical information. We validated our proposed method on a clinical $^{82}$Rb PET dataset and found that our TAI-GAN can produce converted early frames with high image quality, comparable to the real reference frames. After TAI-GAN conversion, motion estimation accuracy and clinical myocardial blood flow (MBF) quantification were improved compared to using the original frames. Our code is published at https://github.com/gxq1998/TAI-GAN.

摘要
<> translate "The rapid tracer kinetics of rubidium-82 ($^{82}$Rb) and high variation of cross-frame distribution in dynamic cardiac positron emission tomography (PET) raise significant challenges for inter-frame motion correction, particularly for the early frames where conventional intensity-based image registration techniques are not applicable. Alternatively, a promising approach utilizes generative methods to handle the tracer distribution changes to assist existing registration methods. To improve frame-wise registration and parametric quantification, we propose a Temporally and Anatomically Informed Generative Adversarial Network (TAI-GAN) to transform the early frames into the late reference frame using an all-to-one mapping. Specifically, a feature-wise linear modulation layer encodes channel-wise parameters generated from temporal tracer kinetics information, and rough cardiac segmentations with local shifts serve as the anatomical information. We validated our proposed method on a clinical $^{82}$Rb PET dataset and found that our TAI-GAN can produce converted early frames with high image quality, comparable to the real reference frames. After TAI-GAN conversion, motion estimation accuracy and clinical myocardial blood flow (MBF) quantification were improved compared to using the original frames. Our code is published at https://github.com/gxq1998/TAI-GAN." into Simplified Chinese.Here's the translation:<>rapid tracer kinetics of rubidium-82 ($^{82}$Rb) and high variation of cross-frame distribution in dynamic cardiac positron emission tomography (PET) pose significant challenges for inter-frame motion correction, especially for early frames where conventional intensity-based image registration techniques are not applicable. alternatively, a promising approach utilizes generative methods to handle tracer distribution changes to assist existing registration methods. to improve frame-wise registration and parametric quantification, we propose a Temporally and Anatomically Informed Generative Adversarial Network (TAI-GAN) to transform early frames into the late reference frame using an all-to-one mapping. specifically, a feature-wise linear modulation layer encodes channel-wise parameters generated from temporal tracer kinetics information, and rough cardiac segmentations with local shifts serve as anatomical information. we validated our proposed method on a clinical $^{82}$Rb PET dataset and found that our TAI-GAN can produce converted early frames with high image quality, comparable to real reference frames. after TAI-GAN conversion, motion estimation accuracy and clinical myocardial blood flow (MBF) quantification were improved compared to using the original frames. our code is published at https://github.com/gxq1998/TAI-GAN.Note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. The other version is Traditional Chinese.

HNAS-reg: hierarchical neural architecture search for deformable medical image registration

paper_url: http://arxiv.org/abs/2308.12440
repo_url: None
paper_authors: Jiong Wu, Yong Fan
for: 这篇论文是为了找出最佳的深度学习模型，用于医疗影像注册。
methods: 这篇论文使用了一个内在的 NAS 框架 (HNAS-Reg)，包括了扩散操作搜索和网络架构搜索，以找到最佳的网络架构。具体来说，这个框架使用了一种参数化的搜索方法，以找到最佳的扩散操作和网络架构。
results: 实验结果显示，提议的方法可以建立一个具有更高影像注册精度和较小的模型大小的深度学习模型，比过去的影像注册方法更好。具体来说，在三个数据集上（包括 636 个 T1-调试磁共振成像（MRI）），提议的方法可以建立一个深度学习模型，并且与其他两个Unsupervised Learning-based方法相比，具有更高的影像注册精度和较小的模型大小。

Abstract
Convolutional neural networks (CNNs) have been widely used to build deep learning models for medical image registration, but manually designed network architectures are not necessarily optimal. This paper presents a hierarchical NAS framework (HNAS-Reg), consisting of both convolutional operation search and network topology search, to identify the optimal network architecture for deformable medical image registration. To mitigate the computational overhead and memory constraints, a partial channel strategy is utilized without losing optimization quality. Experiments on three datasets, consisting of 636 T1-weighted magnetic resonance images (MRIs), have demonstrated that the proposal method can build a deep learning model with improved image registration accuracy and reduced model size, compared with state-of-the-art image registration approaches, including one representative traditional approach and two unsupervised learning-based approaches.

摘要
卷积神经网络（CNN）已经广泛用于深度学习模型的医学图像注册，但是手动设计的网络架构可能不是最佳的。这篇论文提出了一种层次 NAS 框架（HNAS-Reg），包括卷积操作搜索和网络架构搜索，以确定最佳的医学图像注册网络架构。为了减少计算负担和内存限制，该方法使用了部分通道策略，而不失去优化质量。在三个数据集上，包括 636 个 T1 束缚磁共振成像（MRI），实验表明，提议方法可以建立一个具有提高图像注册精度和减少模型大小的深度学习模型，相比之下一个代表性的传统方法和两个无监督学习方法。

Characterising representation dynamics in recurrent neural networks for object recognition

paper_url: http://arxiv.org/abs/2308.12435
repo_url: None
paper_authors: Sushrut Thorat, Adrien Doerig, Tim C. Kietzmann
for: 这种研究旨在理解Recurrent Neural Networks (RNNs) 在复杂视觉任务中的表征动态，特别是大规模视觉模型中的计算。
methods: 研究者使用了MiniEcoset，一个新的子集，来训练 RNNs 进行物体分类。他们还使用了“读取区”来描述计算轨迹的活动排序。
results: 研究者发现，在推断时，表示 continuted to evolve после正确的分类，这表明 RNNs 没有“完成分类”的概念。此外，研究者发现，在“读取区”中，错误的表示具有较低的L2范数活动排序，并位于更加外围的位置。这种排序可以帮助错误的表示逐渐移动到正确的区域。这些发现可以普通到其他类型的 RNNs，包括理解Primates 视觉中的表征动态。

Abstract
Recurrent neural networks (RNNs) have yielded promising results for both recognizing objects in challenging conditions and modeling aspects of primate vision. However, the representational dynamics of recurrent computations remain poorly understood, especially in large-scale visual models. Here, we studied such dynamics in RNNs trained for object classification on MiniEcoset, a novel subset of ecoset. We report two main insights. First, upon inference, representations continued to evolve after correct classification, suggesting a lack of the notion of being ``done with classification''. Second, focusing on ``readout zones'' as a way to characterize the activation trajectories, we observe that misclassified representations exhibit activation patterns with lower L2 norm, and are positioned more peripherally in the readout zones. Such arrangements help the misclassified representations move into the correct zones as time progresses. Our findings generalize to networks with lateral and top-down connections, and include both additive and multiplicative interactions with the bottom-up sweep. The results therefore contribute to a general understanding of RNN dynamics in naturalistic tasks. We hope that the analysis framework will aid future investigations of other types of RNNs, including understanding of representational dynamics in primate vision.

摘要
recurrent neural networks (RNNs) 已经在具有挑战性的条件下识别对象以及模型Primates的视觉方面显示了promising的结果。然而，RNNs中的表达动力学 Dynamics 仍未得到了充分的理解，特别是在大规模的视觉模型中。在这里，我们对RNNs在MiniEcoset上进行了对象分类训练。我们发现了两个主要的发现：首先，在推理时，表达还在继续进行改变，表明没有“完成分类”的概念。第二，我们将“读取区”作为表达轨迹的特征进行分析，发现了在读取区中的表达方式具有更低的L2范数，并且位于读取区的更外围位置。这种排列可以帮助错误的表达移动到正确的区域，并在时间的推移中进行改变。我们的发现涵盖了具有 Lateral 和上下 Connection 的网络，并包括了加法和乘法交互。这些结果因此对 RNN 动力学在自然任务中的一般理解做出了贡献，并且可以帮助未来对其他类型的 RNN 进行更深入的研究，包括理解primates 视觉中的表达动力学。

A Spatiotemporal Correspondence Approach to Unsupervised LiDAR Segmentation with Traffic Applications

paper_url: http://arxiv.org/abs/2308.12433
repo_url: None
paper_authors: Xiao Li, Pan He, Aotian Wu, Sanjay Ranka, Anand Rangarajan
for: 这个研究旨在解决室外LiDAR点云Sequence中的无监督Semantic Segmentation问题，尤其是在自动驾驶和交叉基建中的多种交通情况下。
methods: 本研究利用Point cloud sequence的空间时间特性，并在多帧框架之间建立强大的对应关系，以提高Semantic Segmentation的精度。研究将 clustering和pseudo-label学习结合，将点 cloud分组成Semantic groups，并使用点 clouds的pseudo-spatiotemporal标签进行模型优化。
results: 研究在Semantic-KITTI、SemanticPOSS和FLORIDAbenchmark dataset上得到了竞争性的Semantic Segmentation性能，与许多现有的对照学习方法相比。这个通用框架可以带来LiDAR点云Sequence中的统一表现学习方法，并结合对领域知识的导入。

Abstract
We address the problem of unsupervised semantic segmentation of outdoor LiDAR point clouds in diverse traffic scenarios. The key idea is to leverage the spatiotemporal nature of a dynamic point cloud sequence and introduce drastically stronger augmentation by establishing spatiotemporal correspondences across multiple frames. We dovetail clustering and pseudo-label learning in this work. Essentially, we alternate between clustering points into semantic groups and optimizing models using point-wise pseudo-spatiotemporal labels with a simple learning objective. Therefore, our method can learn discriminative features in an unsupervised learning fashion. We show promising segmentation performance on Semantic-KITTI, SemanticPOSS, and FLORIDA benchmark datasets covering scenarios in autonomous vehicle and intersection infrastructure, which is competitive when compared against many existing fully supervised learning methods. This general framework can lead to a unified representation learning approach for LiDAR point clouds incorporating domain knowledge.

摘要
我们 Addressing the problem of unsupervised semantic segmentation of outdoor LiDAR point clouds in diverse traffic scenarios. The key idea is to leverage the spatiotemporal nature of a dynamic point cloud sequence and introduce drastically stronger augmentation by establishing spatiotemporal correspondences across multiple frames. We dovetail clustering and pseudo-label learning in this work. Essentially, we alternate between clustering points into semantic groups and optimizing models using point-wise pseudo-spatiotemporal labels with a simple learning objective. Therefore, our method can learn discriminative features in an unsupervised learning fashion. We show promising segmentation performance on Semantic-KITTI, SemanticPOSS, and FLORIDA benchmark datasets covering scenarios in autonomous vehicle and intersection infrastructure, which is competitive when compared against many existing fully supervised learning methods. This general framework can lead to a unified representation learning approach for LiDAR point clouds incorporating domain knowledge.Here's the word-for-word translation:我们 Addressing outdoor LiDAR point cloud semantic segmentation problem in diverse traffic scenarios. 针对多个 traffic scenarios 中的 outdoor LiDAR point cloud semantic segmentation problem. The key idea is to leverage point cloud sequence's spatiotemporal nature and introduce stronger augmentation by establishing spatiotemporal correspondences across multiple frames. 利用 point cloud sequence 的 spatiotemporal nature 和多幅 frames 之间的匹配，提高 semantic segmentation 的精度。 We dovetail clustering and pseudo-label learning in this work. 在这个工作中，我们将 clustering 和 pseudo-label learning 相互协调使用。 Essentially, we alternate between clustering points into semantic groups and optimizing models using point-wise pseudo-spatiotemporal labels with a simple learning objective. 我们将 alternate between clustering points into semantic groups 和使用 point-wise pseudo-spatiotemporal labels 来优化模型，使用简单的 learning objective。 Therefore, our method can learn discriminative features in an unsupervised learning fashion. 因此，我们的方法可以在无监督学习中学习出 distinguishing 特征。 We show promising segmentation performance on Semantic-KITTI, SemanticPOSS, and FLORIDA benchmark datasets covering scenarios in autonomous vehicle and intersection infrastructure. 我们在 Semantic-KITTI, SemanticPOSS, 和 FLORIDA benchmark datasets 上显示出了优秀的 segmentation 性能，这些 datasets 涵盖了自动驾驶车和交叉道路基础设施的场景。 These datasets are competitive when compared against many existing fully supervised learning methods. 这些 datasets 与许多现有的完全监督学习方法相比，显示出了竞争力。 This general framework can lead to a unified representation learning approach for LiDAR point clouds incorporating domain knowledge. 这个通用的框架可以导致一种 incorporating domain knowledge 的 LiDAR point clouds 的表示学习方法。

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

paper_url: http://arxiv.org/abs/2308.12408
repo_url: None
paper_authors: Matthew Martel, Jackson Wagner
for: 这个论文的目的是开发一种基于深度学习的框架，用于生成电影和其他媒体中的真实的音效。
methods: 这个论文使用了多种不同的模型建立，包括深度融合CNN、扩展Wavenet CNN以及Transformer结构。这些模型都将视频上下文和先前生成的音频融合在一起，以生成真实的音效。
results: 研究发现，使用Transformer结构可以匹配视频中的低频信号，但是无法生成更加复杂的波形。

Abstract
Generating realistic audio effects for movies and other media is a challenging task that is accomplished today primarily through physical techniques known as Foley art. Foley artists create sounds with common objects (e.g., boxing gloves, broken glass) in time with video as it is playing to generate captivating audio tracks. In this work, we aim to develop a deep-learning based framework that does much the same - observes video in it's natural sequence and generates realistic audio to accompany it. Notably, we have reason to believe this is achievable due to advancements in realistic audio generation techniques conditioned on other inputs (e.g., Wavenet conditioned on text). We explore several different model architectures to accomplish this task that process both previously-generated audio and video context. These include deep-fusion CNN, dilated Wavenet CNN with visual context, and transformer-based architectures. We find that the transformer-based architecture yields the most promising results, matching low-frequencies to visual patterns effectively, but failing to generate more nuanced waveforms.

摘要
generate realistic audio effects for movies and other media is a challenging task that is primarily accomplished today through physical techniques known as Foley art. Foley artists create sounds with common objects (e.g., boxing gloves, broken glass) in time with video as it is playing to generate captivating audio tracks. In this work, we aim to develop a deep-learning based framework that does much the same - observes video in its natural sequence and generates realistic audio to accompany it. Notably, we have reason to believe this is achievable due to advancements in realistic audio generation techniques conditioned on other inputs (e.g., Wavenet conditioned on text). We explore several different model architectures to accomplish this task that process both previously-generated audio and video context. These include deep-fusion CNN, dilated Wavenet CNN with visual context, and transformer-based architectures. We find that the transformer-based architecture yields the most promising results, matching low-frequencies to visual patterns effectively, but failing to generate more nuanced waveforms.Here's the text with some notes on the translation:* "generate realistic audio effects" is translated as "生成真实的音效" (shēngjīn zhēnshí de yīngxìng), which is a more literal translation of the original English phrase.* "Foley art" is translated as "FOLEY艺术" (fōlēi yìshù), which is a direct translation of the original English phrase.* "captivating audio tracks" is translated as "吸引人的音乐轨迹" (xīhuī rén de yīngyuè guītà), which is a more idiomatic translation that conveys the idea of audio that is engaging and immersive.* "deep-learning based framework" is translated as "基于深度学习的框架" (jīyù shēngrán de kuàihù), which is a more literal translation of the original English phrase.* "low-frequencies" is translated as "低频谱" (dīfreqè), which is a more technical term that is commonly used in audio engineering.* "visual patterns" is translated as "视觉模式" (wèishì móxìng), which is a more idiomatic translation that conveys the idea of patterns that are visible and can be perceived through sight.I hope this helps! Let me know if you have any further questions or if you would like me to translate anything else.

FG-Net: Facial Action Unit Detection with Generalizable Pyramidal Features

paper_url: http://arxiv.org/abs/2308.12380
repo_url: https://github.com/ihp-lab/fg-net
paper_authors: Yufeng Yin, Di Chang, Guoxian Song, Shen Sang, Tiancheng Zhi, Jing Liu, Linjie Luo, Mohammad Soleymani
for: 该文章目的是提出一种通用的表情动作单元检测方法，以优化对 facial expression 的 объектив分析。
methods: 该方法使用 StyleGAN2 模型预训练在大型和多样化的面孔图像集上，然后使用 Pyramid CNN Interpreter 检测表情动作单元。
results: 对于 DISFA 和 BP4D datasets，提出的方法在跨域和同域检测中均达到了优于预先的状态艺术，同时在1000个样本上进行训练并且可以达到竞争性的性能。

Abstract
Automatic detection of facial Action Units (AUs) allows for objective facial expression analysis. Due to the high cost of AU labeling and the limited size of existing benchmarks, previous AU detection methods tend to overfit the dataset, resulting in a significant performance loss when evaluated across corpora. To address this problem, we propose FG-Net for generalizable facial action unit detection. Specifically, FG-Net extracts feature maps from a StyleGAN2 model pre-trained on a large and diverse face image dataset. Then, these features are used to detect AUs with a Pyramid CNN Interpreter, making the training efficient and capturing essential local features. The proposed FG-Net achieves a strong generalization ability for heatmap-based AU detection thanks to the generalizable and semantic-rich features extracted from the pre-trained generative model. Extensive experiments are conducted to evaluate within- and cross-corpus AU detection with the widely-used DISFA and BP4D datasets. Compared with the state-of-the-art, the proposed method achieves superior cross-domain performance while maintaining competitive within-domain performance. In addition, FG-Net is data-efficient and achieves competitive performance even when trained on 1000 samples. Our code will be released at \url{https://github.com/ihp-lab/FG-Net}

摘要
自动检测人脸动作单元（AU）可以实现 объектив的人脸表达分析。由于AU标注的高成本和现有 benchmark 的有限大小，前一代AU检测方法往往会适应数据集，导致在 corpora 中表现不佳。为解决这个问题，我们提出了 FG-Net，一种通用的人脸动作单元检测方法。具体来说，FG-Net 从 StyleGAN2 模型在大量和多样的人脸图像数据集上预训练后的特征图进行检测AU。然后，这些特征图被 Pyramid CNN Interpreter 使用，以实现高效的训练和捕捉本地特征。我们提出的 FG-Net 在热图基于 AU 检测中实现了强大的总结能力，因为它可以从预训练的生成模型中提取通用和含义 Rich 的特征。我们进行了广泛的实验，以评估在 DISFA 和 BP4D 数据集上的在 corpora 和 across-corpus 中的 AU 检测性能。与当前状态的方法相比，我们的方法在跨频谱上实现了superior 的横跨频谱性能，同时保持竞争的在频谱内性能。此外，FG-Net 是数据效率的，可以在1000个样本上实现竞争性的表现。我们的代码将在 \url{https://github.com/ihp-lab/FG-Net} 上发布。

AdVerb: Visually Guided Audio Dereverberation

paper_url: http://arxiv.org/abs/2308.12370
repo_url: None
paper_authors: Sanjoy Chowdhury, Sreyan Ghosh, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha
for: 提高听起来的声音质量，使其更加清晰和可识别。
methods: 利用视觉特征和听起来的声音，通过一种新的几何学感知架构，捕捉场景几何和听起来的跨Modal关系，生成复杂的理想比例幕，以提高听起来的声音质量。
results: 比较传统的听起来只和听起来+视觉两个基elines，实现了18%-82%的提升，在LibriSpeech测试集上。同时，在AVSpeech数据集上也实现了非常满意的RT60错误分数。

Abstract
We present AdVerb, a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio. Although audio-only dereverberation is a well-studied problem, our approach incorporates the complementary visual modality to perform audio dereverberation. Given an image of the environment where the reverberated sound signal has been recorded, AdVerb employs a novel geometry-aware cross-modal transformer architecture that captures scene geometry and audio-visual cross-modal relationship to generate a complex ideal ratio mask, which, when applied to the reverberant audio predicts the clean sound. The effectiveness of our method is demonstrated through extensive quantitative and qualitative evaluations. Our approach significantly outperforms traditional audio-only and audio-visual baselines on three downstream tasks: speech enhancement, speech recognition, and speaker verification, with relative improvements in the range of 18% - 82% on the LibriSpeech test-clean set. We also achieve highly satisfactory RT60 error scores on the AVSpeech dataset.

摘要
我们介绍了AdVerb，一种新的音频-视觉减振框架，该框架利用视觉信号以及干扰音频来估算清晰音频。虽然音频只的减振框架已经广泛研究过，但我们的方法具有较好的场景准确性和音频视觉跨模态关系，可以更好地进行音频减振。给出了环境中录制的干扰音频的图像，AdVerb使用了一种新的场景意识的cross-modal transformer架构，捕捉场景准确性和音频视觉跨模态关系，生成复杂的理想比例面积，当应用于干扰音频时，可以预测清晰音频。我们的方法的效果得到了广泛的量化和质量评估。与传统的音频只和音频视觉基线相比，我们的方法在三个下游任务中表现出了显著的改善，即speech enhancement、speech recognition和speaker verification，改善比例在0.18-0.82之间。此外，我们在AVSpeech dataset上也实现了高度满意的RT60错误分布。

Continual Zero-Shot Learning through Semantically Guided Generative Random Walks

paper_url: http://arxiv.org/abs/2308.12366
repo_url: https://github.com/wx-zhang/igczsl
paper_authors: Wenxuan Zhang, Paul Janson, Kai Yi, Ivan Skorokhodov, Mohamed Elhoseiny
for: 本研究旨在模型人类在生活中不断学习和应用新知识，以及将之应用于未来任务中。
methods: 本研究使用生成模型，通过学习seen类的质量表示来提高对未经训练的视觉空间的生成理解。
results: 提出了一种基于生成模型的 continual zero-shot learning 算法，在 AWA1、AWA2、CUB 和 SUN 数据集上达到了状态之 arts 性能，比现有的 CZSL 方法高出 3-7%。

Abstract
Learning novel concepts, remembering previous knowledge, and adapting it to future tasks occur simultaneously throughout a human's lifetime. To model such comprehensive abilities, continual zero-shot learning (CZSL) has recently been introduced. However, most existing methods overused unseen semantic information that may not be continually accessible in realistic settings. In this paper, we address the challenge of continual zero-shot learning where unseen information is not provided during training, by leveraging generative modeling. The heart of the generative-based methods is to learn quality representations from seen classes to improve the generative understanding of the unseen visual space. Motivated by this, we introduce generalization-bound tools and provide the first theoretical explanation for the benefits of generative modeling to CZSL tasks. Guided by the theoretical analysis, we then propose our learning algorithm that employs a novel semantically guided Generative Random Walk (GRW) loss. The GRW loss augments the training by continually encouraging the model to generate realistic and characterized samples to represent the unseen space. Our algorithm achieves state-of-the-art performance on AWA1, AWA2, CUB, and SUN datasets, surpassing existing CZSL methods by 3-7\%. The code has been made available here \url{https://github.com/wx-zhang/IGCZSL}

摘要
人类生命中，同时学习新概念，记忆过去知识，并将其应用到未来任务中发生。为模型这种全面能力，最近才提出了无限 zero-shot learning（CZSL）。然而，现有的方法往往过度利用无法在实际场景中 continually 获得的无序 semantic information。在这篇论文中，我们解决了 CZSL 任务中无法在训练中提供无序信息的挑战，通过使用生成模型。生成模型的核心是学习seen类型的高质量表示，以改善对未seen visual空间的生成理解。这些基于的概念工具，我们提供了第一个理论解释，描述了生成模型对 CZSL 任务的优势。受理论分析的指导，我们然后提出了我们的学习算法，该算法使用了一种新的semantically guided Generative Random Walk（GRW）损失函数。GRW损失函数在训练中不断地鼓励模型生成真实、特征化的样本，以表示未seen空间。我们的算法在 AWA1、AWA2、CUB 和 SUN 数据集上达到了状态机器人的性能，超过了现有的 CZSL 方法3-7\%。我们的代码已经在 GitHub 上公开，访问地址为 \url{https://github.com/wx-zhang/IGCZSL}。

Saliency-based Video Summarization for Face Anti-spoofing

paper_url: http://arxiv.org/abs/2308.12364
repo_url: https://github.com/Usman1021/Saliency
paper_authors: Usman Muhammad, Mourad Oussalah, Md Ziaul Hoque, Jorma Laaksonen
for: 提高面部骗取检测器的性能和效率，使用视觉吸引力理论来增强深度学习模型的表现。
methods: 提出了一种视频概要方法，通过提取源图像的视觉吸引力信息，对每帧图像进行分解，并使用重要性映射来线性组合源图像，创建一个代表整个视频的单一图像。
results: 实验结果表明，该方法可以在五个面部骗取检测 datasets 上达到状态 искусственный智能的性能，并且比 tradicional 方法有更好的性能和效率。

Abstract
Due to the growing availability of face anti-spoofing databases, researchers are increasingly focusing on video-based methods that use hundreds to thousands of images to assess their impact on performance. However, there is no clear consensus on the exact number of frames in a video required to improve the performance of face anti-spoofing tasks. Inspired by the visual saliency theory, we present a video summarization method for face anti-spoofing tasks that aims to enhance the performance and efficiency of deep learning models by leveraging visual saliency. In particular, saliency information is extracted from the differences between the Laplacian and Wiener filter outputs of the source images, enabling identification of the most visually salient regions within each frame. Subsequently, the source images are decomposed into base and detail layers, enhancing representation of important information. The weighting maps are then computed based on the saliency information, indicating the importance of each pixel in the image. By linearly combining the base and detail layers using the weighting maps, the method fuses the source images to create a single representative image that summarizes the entire video. The key contribution of our proposed method lies in demonstrating how visual saliency can be used as a data-centric approach to improve the performance and efficiency of face presentation attack detection models. By focusing on the most salient images or regions within the images, a more representative and diverse training set can be created, potentially leading to more effective models. To validate the method's effectiveness, a simple deep learning architecture (CNN-RNN) was used, and the experimental results showcased state-of-the-art performance on five challenging face anti-spoofing datasets.

摘要
Translated into Simplified Chinese:由于面对面骗降库的可用性不断增长，研究人员正在更加关注视频基于方法，使用数百到千个图像来评估其影响性。然而，没有明确的共识，关于视频中帧数所需要提高面对面骗降模型的性能。我们根据视觉吸引力理论，提出了一种面对面骗降视频 summarization方法，以提高深度学习模型的性能和效率。具体来说，该方法使用源图像的差分 Laplacian 和 Wiener 滤波器输出来提取视觉吸引力信息，并在每帧中标识最有吸引力的区域。然后，源图像被分解成基层和详细层，从而增强图像的重要信息表示。最后，根据视觉吸引力信息，计算weighting map，以指示每个像素的重要性。通过线性组合基层和详细层，方法将源图像总结为整个视频的代表图像。我们的提案的关键在于，通过使用视觉吸引力来为面对面骗降模型提高性能和效率。通过关注图像中最有吸引力的部分或区域，可以创建更加代表和多样的训练集，可能导致更有效的模型。为验证方法的效果，我们使用了一种简单的深度学习架构（CNN-RNN），并在五个面对面骗降数据集上获得了状态艺术性的实验结果。

Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.12350
repo_url: None
paper_authors: Duo Peng, Ping Hu, Qiuhong Ke, Jun Liu
for: 提高频率领域转换图像的semantic consistency
methods: 使用源域标签作为Explicit导航 during image translation
results: 比对前方法有superiority

Abstract
Translating images from a source domain to a target domain for learning target models is one of the most common strategies in domain adaptive semantic segmentation (DASS). However, existing methods still struggle to preserve semantically-consistent local details between the original and translated images. In this work, we present an innovative approach that addresses this challenge by using source-domain labels as explicit guidance during image translation. Concretely, we formulate cross-domain image translation as a denoising diffusion process and utilize a novel Semantic Gradient Guidance (SGG) method to constrain the translation process, conditioning it on the pixel-wise source labels. Additionally, a Progressive Translation Learning (PTL) strategy is devised to enable the SGG method to work reliably across domains with large gaps. Extensive experiments demonstrate the superiority of our approach over state-of-the-art methods.

摘要
通常，域适应semantic segmentation（DASS）中将源域图像翻译到目标域图像是一种常见的策略。然而，现有方法仍然困难保持 semantic consistency的本地细节 между原始图像和翻译图像。在这种情况下，我们提出了一种创新的方法，通过在翻译过程中使用源域标签作为直接导航来解决这个挑战。具体来说，我们将cross-domain image translation表示为干扰扩散过程，并使用一种新的Semantic Gradient Guidance（SGG）方法来约束翻译过程，将其受到像素级source标签的控制。此外，我们还提出了一种Progressive Translation Learning（PTL）策略，以确保 SGG 方法在不同域的大差下可靠地工作。广泛的实验证明了我们的方法在现有方法之上表现出了superiority。

A Generative Approach for Image Registration of Visible-Thermal (VT) Cancer Faces

paper_url: http://arxiv.org/abs/2308.12271
repo_url: None
paper_authors: Catherine Ordun, Alexandra Cha, Edward Raff, Sanjay Purushotham, Karen Kwok, Mason Rule, James Gulley
for: 这项研究旨在提高人工智能下的疼痛研究，使用可见光和热成像图像进行对比。
methods: 该研究使用了生成对应算法进行图像 регистра，以解决可见光和热成像图像之间的偏移问题。
results: 研究发现，通过对可见光和热成像图像进行REGISTERING，可以提高热成像图像质量，提高疼痛研究的效果，最高提高率达52.5%。

Abstract
Since thermal imagery offers a unique modality to investigate pain, the U.S. National Institutes of Health (NIH) has collected a large and diverse set of cancer patient facial thermograms for AI-based pain research. However, differing angles from camera capture between thermal and visible sensors has led to misalignment between Visible-Thermal (VT) images. We modernize the classic computer vision task of image registration by applying and modifying a generative alignment algorithm to register VT cancer faces, without the need for a reference or alignment parameters. By registering VT faces, we demonstrate that the quality of thermal images produced in the generative AI downstream task of Visible-to-Thermal (V2T) image translation significantly improves up to 52.5\%, than without registration. Images in this paper have been approved by the NIH NCI for public dissemination.

摘要
由于热影像可以提供一种独特的方式来研究疼痛，美国国家医学研究院（NIH）已经收集了大量和多样化的癌症患者脸部热影像，用于人工智能基于痛症研究。然而，相机捕捉的角度差异导致热影像和可见感器拍摄的图像不一致，这导致了可见热图像的注册问题。我们使用和修改生成对齐算法，以无需参考或对齐参数，对热照相机拍摄的癌症脸部进行注册。通过注册热照相机拍摄，我们证明了在生成AI下渠道任务中，将可见图像翻译成热图像的质量显著提高，比无注册情况提高至52.5%。图像在本文中已经获得了NIH NCI的批准，可以公开发布。

MolGrapher: Graph-based Visual Recognition of Chemical Structures

paper_url: http://arxiv.org/abs/2308.12234
repo_url: https://github.com/ds4sd/molgrapher
paper_authors: Lucas Morin, Martin Danelljan, Maria Isabel Agea, Ahmed Nassar, Valery Weber, Ingmar Meijer, Peter Staar, Fisher Yu
for: 本研究旨在提高化学文献自动分析的效率，以促进新材料和药物的发现。
methods: 本研究使用了深度键点检测器和图学神经网络来自动识别化学结构。
results: 对五个数据集进行了广泛的实验，结果表明，我们的方法在大多数情况下与经典和学习基于方法相比，有显著的优异表现。

Abstract
The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. Much of the critical information in patent documents and scientific articles is contained in figures, depicting the molecule structures. However, automatically parsing the exact chemical structure is a formidable challenge, due to the amount of detailed information, the diversity of drawing styles, and the need for training data. In this work, we introduce MolGrapher to recognize chemical structures visually. First, a deep keypoint detector detects the atoms. Second, we treat all candidate atoms and bonds as nodes and put them in a graph. This construct allows a natural graph representation of the molecule. Last, we classify atom and bond nodes in the graph with a Graph Neural Network. To address the lack of real training data, we propose a synthetic data generation pipeline producing diverse and realistic results. In addition, we introduce a large-scale benchmark of annotated real molecule images, USPTO-30K, to spur research on this critical topic. Extensive experiments on five datasets show that our approach significantly outperforms classical and learning-based methods in most settings. Code, models, and datasets are available.

摘要
自动分析化学文献的潜在可能性非常大，可以加速发现新材料和药物。文献中大量关键信息都集中在图像中，其中包括分子结构。然而，自动解析图像中的具体化学结构是一项具有挑战性的任务，原因在于图像中的信息量、绘制风格的多样性以及需要训练数据。在这项工作中，我们介绍了MolGrapher，一种可视化化学结构的识别算法。首先，我们使用深度关键点检测器检测原子。其次，我们将所有候选原子和键视为图像中的节点，并将它们建立成一个图。这种构建方式允许自然地表示分子的图像。最后，我们使用图 neural network 来分类原子和键节点。因为缺乏真实的训练数据，我们提出了一个生成 sintetic 数据的管道，以生成多样化和真实的结果。此外，我们还介绍了一个大规模的注释实验室， USPTO-30K，以促进这一重要领域的研究。我们在五个数据集上进行了广泛的实验，结果显示，我们的方法在大多数情况下与 класси方法和学习型方法相比，表现出了显著的优势。代码、模型和数据集都可以获得。

SPPNet: A Single-Point Prompt Network for Nuclei Image Segmentation

paper_url: http://arxiv.org/abs/2308.12231
repo_url: https://github.com/xq141839/sppnet
paper_authors: Qing Xu, Wenwei Kuang, Zeyu Zhang, Xueyao Bao, Haoran Chen, Wenting Duan
for: 这个研究旨在提出一个单点提示网络（SPPNet），用于核仁像分类，以解决目前的模型存在大量参数和训练成本的问题。
methods: 这个模型使用了轻量级的投影 транс福曼（ViT）来取代原始的图像编码器，并添加了一个有效的混合层来提高图像中低层次的Semantic信息EXTRACTION。
results: 这个研究显示了 SPPNet 比现有的 U-shape 架构更好地运行，并且在训练过程中更快地训练。相比于目前的模型，SPPNet 的测试速度大约是 20 倍 faster，仅需要 1/70 参数和computational cost。此外，这个模型只需要在训练和测试阶段点击一次，更适合临床应用。

Abstract
Image segmentation plays an essential role in nuclei image analysis. Recently, the segment anything model has made a significant breakthrough in such tasks. However, the current model exists two major issues for cell segmentation: (1) the image encoder of the segment anything model involves a large number of parameters. Retraining or even fine-tuning the model still requires expensive computational resources. (2) in point prompt mode, points are sampled from the center of the ground truth and more than one set of points is expected to achieve reliable performance, which is not efficient for practical applications. In this paper, a single-point prompt network is proposed for nuclei image segmentation, called SPPNet. We replace the original image encoder with a lightweight vision transformer. Also, an effective convolutional block is added in parallel to extract the low-level semantic information from the image and compensate for the performance degradation due to the small image encoder. We propose a new point-sampling method based on the Gaussian kernel. The proposed model is evaluated on the MoNuSeg-2018 dataset. The result demonstrated that SPPNet outperforms existing U-shape architectures and shows faster convergence in training. Compared to the segment anything model, SPPNet shows roughly 20 times faster inference, with 1/70 parameters and computational cost. Particularly, only one set of points is required in both the training and inference phases, which is more reasonable for clinical applications. The code for our work and more technical details can be found at https://github.com/xq141839/SPPNet.

摘要
Image segmentation plays an essential role in nuclei image analysis. Recently, the segment anything model has made a significant breakthrough in such tasks. However, the current model exists two major issues for cell segmentation: (1) the image encoder of the segment anything model involves a large number of parameters. Retraining or even fine-tuning the model still requires expensive computational resources. (2) in point prompt mode, points are sampled from the center of the ground truth and more than one set of points is expected to achieve reliable performance, which is not efficient for practical applications.In this paper, a single-point prompt network is proposed for nuclei image segmentation, called SPPNet. We replace the original image encoder with a lightweight vision transformer. Also, an effective convolutional block is added in parallel to extract the low-level semantic information from the image and compensate for the performance degradation due to the small image encoder. We propose a new point-sampling method based on the Gaussian kernel. The proposed model is evaluated on the MoNuSeg-2018 dataset. The result demonstrated that SPPNet outperforms existing U-shape architectures and shows faster convergence in training. Compared to the segment anything model, SPPNet shows roughly 20 times faster inference, with 1/70 parameters and computational cost. Particularly, only one set of points is required in both the training and inference phases, which is more reasonable for clinical applications.The code for our work and more technical details can be found at .

2023-08-24

VNI-Net: Vector Neurons-based Rotation-Invariant Descriptor for LiDAR Place Recognition

ToonTalker: Cross-Domain Face Reenactment

SkipcrossNets: Adaptive Skip-cross Fusion for Road Detection

Learned Local Attention Maps for Synthesising Vessel Segmentations

Implicit Obstacle Map-driven Indoor Navigation Model for Robust Obstacle Avoidance

EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting

Robotic Scene Segmentation with Memory Network for Runtime Surgical Context Inference

On Offline Evaluation of 3D Object Detection for Autonomous Driving

LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition

IP-UNet: Intensity Projection UNet Architecture for 3D Medical Volume Segmentation

PartSeg: Few-shot Part Segmentation via Part-aware Prompt Learning

Learning Heavily-Degraded Prior for Underwater Object Detection

FastSurfer-HypVINN: Automated sub-segmentation of the hypothalamus and adjacent structures on high-resolutional brain MRI

Ground-to-Aerial Person Search: Benchmark Dataset and Approach

A Parse-Then-Place Approach for Generating Graphic Layouts from Textual Descriptions

A Continual Learning Approach for Cross-Domain White Blood Cell Classification

A Study of Age and Sex Bias in Multiple Instance Learning based Classification of Acute Myeloid Leukemia Subtypes

Masked Feature Modelling: Feature Masking for the Unsupervised Pre-training of a Graph Attention Network Block for Bottom-up Video Event Recognition

An All Deep System for Badminton Game Analysis

Tag-Based Annotation for Avatar Face Creation

Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization

HR-Pro: Point-supervised Temporal Action Localization via Hierarchical Reliability Propagation

PoseSync: Robust pose based video synchronization

Logic-induced Diagnostic Reasoning for Semi-supervised Semantic Segmentation

Self-supervised Learning of Implicit Shape Representation with Dense Correspondence for Deformable Objects

Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation

LORD: Leveraging Open-Set Recognition with Unknown Data

StreamMapNet: Streaming Mapping Network for Vectorized Online HD Map Construction

NOVA: NOvel View Augmentation for Neural Composition of Dynamic Objects

Hyperbolic Audio-visual Zero-shot Learning

Hybrid Models for Facial Emotion Recognition in Children

Mutual-Guided Dynamic Network for Image Fusion

HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks

SCP: Spherical-Coordinate-based Learned Point Cloud Compression

Channel and Spatial Relation-Propagation Network for RGB-Thermal Semantic Segmentation

SieveNet: Selecting Point-Based Features for Mesh Networks

Uniformly Distributed Category Prototype-Guided Vision-Language Framework for Long-Tail Recognition

Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval

FFEINR: Flow Feature-Enhanced Implicit Neural Representation for Spatio-temporal Super-Resolution

DD-GCN: Directed Diffusion Graph Convolutional Network for Skeleton-based Human Action Recognition

MOFA: A Model Simplification Roadmap for Image Restoration on Mobile Devices

Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

InverseSR: 3D Brain MRI Super-Resolution Using a Latent Diffusion Model

Overcoming General Knowledge Loss with Selective Parameter Finetuning

ARF-Plus: Controlling Perceptual Factors in Artistic Radiance Fields for 3D Scene Stylization

MOFO: MOtion FOcused Self-Supervision for Video Understanding

TAI-GAN: Temporally and Anatomically Informed GAN for early-to-late frame conversion in dynamic cardiac PET motion correction

HNAS-reg: hierarchical neural architecture search for deformable medical image registration

Characterising representation dynamics in recurrent neural networks for object recognition

A Spatiotemporal Correspondence Approach to Unsupervised LiDAR Segmentation with Traffic Applications

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

FG-Net: Facial Action Unit Detection with Generalizable Pyramidal Features

AdVerb: Visually Guided Audio Dereverberation

Continual Zero-Shot Learning through Semantically Guided Generative Random Walks

Saliency-based Video Summarization for Face Anti-spoofing

Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation

A Generative Approach for Image Registration of Visible-Thermal (VT) Cancer Faces

MolGrapher: Graph-based Visual Recognition of Chemical Structures

SPPNet: A Single-Point Prompt Network for Nuclei Image Segmentation