2023-11-09

cs.CV

cs.CV - 2023-11-09

Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter

paper_url: http://arxiv.org/abs/2311.05779
repo_url: https://github.com/gtziafas/ocid-vlg
paper_authors: Georgios Tziafas, Yucheng Xu, Arushi Goel, Mohammadreza Kasaei, Zhibin Li, Hamidreza Kasaei
for: 这个论文的目的是提出一种基于图像拓展和抓取技能的机器人操作方法，以便在人类环境中有效地操作物品根据用户的指令。
methods: 该论文使用了一种novel end-to-end模型（CROG），其利用CLIP的视觉固定技能来学习图像-文本对的抓取合成。
results: 实验结果表明，与已经存在的多个阶段管道相比，CROG在复杂的自然indoor场景中表现出了显著的改进，并且在实验中在simulation和硬件上都达到了出色的效果。

Abstract
Robots operating in human-centric environments require the integration of visual grounding and grasping capabilities to effectively manipulate objects based on user instructions. This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes. Existing approaches often employ multi-stage pipelines that first segment the referred object and then propose a suitable grasp, and are evaluated in private datasets or simulators that do not capture the complexity of natural indoor scenes. To address these limitations, we develop a challenging benchmark based on cluttered indoor scenes from OCID dataset, for which we generate referring expressions and connect them with 4-DoF grasp poses. Further, we propose a novel end-to-end model (CROG) that leverages the visual grounding capabilities of CLIP to learn grasp synthesis directly from image-text pairs. Our results show that vanilla integration of CLIP with pretrained models transfers poorly in our challenging benchmark, while CROG achieves significant improvements both in terms of grounding and grasping. Extensive robot experiments in both simulation and hardware demonstrate the effectiveness of our approach in challenging interactive object grasping scenarios that include clutter.

摘要
人类环境中运行的机器人需要视觉固定和抓取功能的集成，以根据用户指令有效地抓取物品。这项工作关注于对自然语言中引用的物体进行抓取预测，称为引用抓取合成。现有的方法 oftentimes 使用多个阶段管道，先 segment 引用的物体，然后提出适当的抓取，并在私有数据集或模拟器中进行评估，这些数据集并不能准确反映自然的室内场景。为了解决这些限制，我们开发了一个具有各种挑战的benchmark，基于OCID数据集中的拥挤的室内场景，并生成了引用表达和4个自由度的抓取 pose。此外，我们提出了一种新的端到端模型（CROG），利用 CLIP 的视觉固定能力来学习直接从图像-文本对的 grasp synthesis。我们的结果显示，将 CLIP 与预训练模型直接集成不会在我们的挑战性 benchmark 中进行好转移，而 CROG 在图像-文本对中的 grasping 和固定方面都具有显著改进。在 simulation 和硬件中的机器人实验中，我们发现了 CROG 在拥挤的交互式物品抓取场景中的有效性。

PolyMaX: General Dense Prediction with Mask Transformer

paper_url: http://arxiv.org/abs/2311.05770
repo_url: https://github.com/google-research/deeplab2
paper_authors: Xuan Yang, Liangzhe Yuan, Kimberly Wilber, Astuti Sharma, Xiuye Gu, Siyuan Qiao, Stephanie Debats, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Liang-Chieh Chen
for: The paper is written for dense prediction tasks such as semantic segmentation, depth estimation, and surface normal prediction.
methods: The paper proposes a method based on the cluster-prediction paradigm, which is inspired by the success of DORN and AdaBins in depth estimation. The method discretizes the continuous output space and unifies dense prediction tasks with the mask transformer framework.
results: The proposed method, PolyMaX, demonstrates state-of-the-art performance on three benchmarks of the NYUD-v2 dataset.Here’s the simplified Chinese text for the three key points:
for: 这篇论文是为 dense prediction 任务 such as semantic segmentation, depth estimation, 和 surface normal prediction 写的。
methods: 这篇论文提出了基于 cluster-prediction 的方法，它是以 DORN 和 AdaBins 在 depth estimation 中的成功为 inspirations。该方法是将 continuous output space 精确化，并将 dense prediction 任务与 mask transformer 框架集成。
results: 提议的方法 PolyMaX 在 NYUD-v2 dataset 上的三个 benchmark 上达到了 state-of-the-art 性能。

Abstract
Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated as per-pixel classification (discrete outputs) or regression (continuous outputs). This per-pixel prediction paradigm has remained popular due to the prevalence of fully convolutional networks. However, on the recent frontier of segmentation task, the community has been witnessing a shift of paradigm from per-pixel prediction to cluster-prediction with the emergence of transformer architectures, particularly the mask transformers, which directly predicts a label for a mask instead of a pixel. Despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on the other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction. Motivated by the success of DORN and AdaBins in depth estimation, achieved by discretizing the continuous output space, we propose to generalize the cluster-prediction based method to general dense prediction tasks. This allows us to unify dense prediction tasks with the mask transformer framework. Remarkably, the resulting model PolyMaX demonstrates state-of-the-art performance on three benchmarks of NYUD-v2 dataset. We hope our simple yet effective design can inspire more research on exploiting mask transformers for more dense prediction tasks. Code and model will be made available.

摘要
dense prediction 任务，如semantic segmentation，depth estimation，和surface normal prediction，可以容易地表示为每像素分类（离散输出）或回归（连续输出）。这种每像素预测模式在全 convolutional networks 的普及下保持流行。然而，在最近的 segmentation 任务中，社区却目睹了一种新的 paradigm 的转变，即 directly predicting a label for a mask instead of a pixel。 despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction.motivated by the success of DORN and AdaBins in depth estimation, achieved by discretizing the continuous output space, we propose to generalize the cluster-prediction based method to general dense prediction tasks. this allows us to unify dense prediction tasks with the mask transformer framework. remarkably, the resulting model PolyMaX demonstrates state-of-the-art performance on three benchmarks of NYUD-v2 dataset. we hope our simple yet effective design can inspire more research on exploiting mask transformers for more dense prediction tasks. code and model will be made available.Here is the word-for-word translation of the text into Simplified Chinese: dense prediction 任务，如semantic segmentation，depth estimation，和surface normal prediction，可以容易地表示为每像素分类（离散输出）或回归（连续输出）。这种每像素预测模式在全 convolutional networks 的普及下保持流行。然而，在最近的 segmentation 任务中，社区却目睹了一种新的 paradigm 的转变，即直接预测一个 mask 的标签 instead of a pixel。 despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction.motivated by the success of DORN and AdaBins in depth estimation, achieved by discretizing the continuous output space, we propose to generalize the cluster-prediction based method to general dense prediction tasks. this allows us to unify dense prediction tasks with the mask transformer framework. remarkably, the resulting model PolyMaX demonstrates state-of-the-art performance on three benchmarks of NYUD-v2 dataset. we hope our simple yet effective design can inspire more research on exploiting mask transformers for more dense prediction tasks. code and model will be made available.

GIPCOL: Graph-Injected Soft Prompting for Compositional Zero-Shot Learning

paper_url: http://arxiv.org/abs/2311.05729
repo_url: https://github.com/hlr/gipcol
paper_authors: Guangyue Xu, Joyce Chai, Parisa Kordjamshidi
for: 本研究旨在提高vision-language模型（VLM）在无supervision zero-shot learning（CZSL）中的表现，特别是通过提出Prompt Learning paradigm。
methods: 本研究提出了Graph-Injected Soft Prompting for COmpositional Learning（GIP-COL）方法，其中包括在soft prompt中添加结构化的 prefix learnable vectors、 attribute label和object label。此外， attribute和object labels在soft prompt中被设置为 compositional graph中的节点，该图由基于训练数据中的对象和属性的compositional结构而构建。
results: 与前一代non-CLIP和CLIP-based方法相比，GIP-COL在MIT-States、UT-Zappos和C-GQA数据集上 achieved state-of-the-art AUCResults在closed和open settings中。我们还进行了分析，发现GIP-COL在CLIP backbone和训练数据上的限制下运行得非常好，这些发现有助于设计更有效的prompt дляCZSL。

Abstract
Pre-trained vision-language models (VLMs) have achieved promising success in many fields, especially with prompt learning paradigm. In this work, we propose GIP-COL (Graph-Injected Soft Prompting for COmpositional Learning) to better explore the compositional zero-shot learning (CZSL) ability of VLMs within the prompt-based learning framework. The soft prompt in GIPCOL is structured and consists of the prefix learnable vectors, attribute label and object label. In addition, the attribute and object labels in the soft prompt are designated as nodes in a compositional graph. The compositional graph is constructed based on the compositional structure of the objects and attributes extracted from the training data and consequently feeds the updated concept representation into the soft prompt to capture this compositional structure for a better prompting for CZSL. With the new prompting strategy, GIPCOL achieves state-of-the-art AUC results on all three CZSL benchmarks, including MIT-States, UT-Zappos, and C-GQA datasets in both closed and open settings compared to previous non-CLIP as well as CLIP-based methods. We analyze when and why GIPCOL operates well given the CLIP backbone and its training data limitations, and our findings shed light on designing more effective prompts for CZSL

摘要
Pre-trained vision-language models (VLMs) 已经在多个领域取得了出色的成绩，特别是使用 prompt 学习模式。在这项工作中，我们提出了 GIP-COL（图像注入软提示 дляcompositional learning），以更好地探索vision-language模型在prompt-based学习框架中的compositional zero-shot learning（CZSL）能力。 GIP-COL 的软提示结构有序，包括预fix learnable vectors、特征标签和对象标签。另外，特征和对象标签在软提示中被设置为图像compositional结构中的节点。图像compositional结构是基于图像和特征的训练数据中提取的对象和特征的compositional结构，从而将更新的概念表示feed into软提示，以捕捉这种compositional结构，为CZSL提供更好的提示。与此同时，GIP-COL 在三个 CZSL benchmark上（MIT-States、UT-Zappos 和 C-GQA 数据集）达到了当前最高的 AUC 结果，在关闭和开放设置下比前一些非 CLIP 以及 CLIP 基于方法更高。我们分析了 GIP-COL 在 CLIP 基础上和训练数据的限制下操作的情况，并发现了设计更有效的提示的关键，这些发现有助于设计更好的 CZSL 方法。

Whole-body Detection, Recognition and Identification at Altitude and Range

paper_url: http://arxiv.org/abs/2311.05725
repo_url: None
paper_authors: Siyuan Huang, Ram Prabhakar Kathirvel, Chun Pong Lau, Rama Chellappa
for: 总体来说，本研究旨在解决距离范围在500米，大角度 pitch angle 达50度的全身生物metric检测、识别和识别。
methods: 我们提出了一个端到端系统，包括预训练检测器在常见图像集上，并在BRIAR dataset上进行精度调整。在检测后，我们提取了身体图像，并使用特征提取器进行识别。
results: 我们进行了多种环境下的广泛评估，包括indoor、outdoor和飞行场景。我们的方法在不同范围和角度下具有优秀的性能，包括识别精度和真实接受率在低假接受率下的比较优异表现。在一个测试集中，我们的模型在100名测试者中 дости到了75.13%的排名20的识别率，并达到了54.09%的TAR@1%FAR。

Abstract
In this paper, we address the challenging task of whole-body biometric detection, recognition, and identification at distances of up to 500m and large pitch angles of up to 50 degree. We propose an end-to-end system evaluated on diverse datasets, including the challenging Biometric Recognition and Identification at Range (BRIAR) dataset. Our approach involves pre-training the detector on common image datasets and fine-tuning it on BRIAR's complex videos and images. After detection, we extract body images and employ a feature extractor for recognition. We conduct thorough evaluations under various conditions, such as different ranges and angles in indoor, outdoor, and aerial scenarios. Our method achieves an average F1 score of 98.29% at IoU = 0.7 and demonstrates strong performance in recognition accuracy and true acceptance rate at low false acceptance rates compared to existing models. On a test set of 100 subjects with 444 distractors, our model achieves a rank-20 recognition accuracy of 75.13% and a TAR@1%FAR of 54.09%.

摘要
在本文中，我们 Addressing the challenging task of whole-body biometric detection, recognition, and identification at distances of up to 500m and large pitch angles of up to 50 degree. We propose an end-to-end system evaluated on diverse datasets, including the challenging Biometric Recognition and Identification at Range (BRIAR) dataset. Our approach involves pre-training the detector on common image datasets and fine-tuning it on BRIAR's complex videos and images. After detection, we extract body images and employ a feature extractor for recognition. We conduct thorough evaluations under various conditions, such as different ranges and angles in indoor, outdoor, and aerial scenarios. Our method achieves an average F1 score of 98.29% at IoU = 0.7 and demonstrates strong performance in recognition accuracy and true acceptance rate at low false acceptance rates compared to existing models. On a test set of 100 subjects with 444 distractors, our model achieves a rank-20 recognition accuracy of 75.13% and a TAR@1%FAR of 54.09%.Here's the word-for-word translation:在本文中，我们对整体生物ometrics检测、识别和标识距离达到500m，大角度达到50度的挑战任务进行 Addressing.我们提出了一个终端系统，并在多种数据集上进行评估，包括Biometric Recognition and Identification at Range (BRIAR) 数据集。我们的方法包括在常见图像集上预训练检测器，并在BRIAR的复杂视频和图像上终端训练。检测后，我们提取身体图像，并使用特征提取器进行识别。我们进行了各种情况下的全面评估，包括不同的距离和角度在室内、室外和航空enario中。我们的方法在IoU = 0.7下 achieve an average F1 score of 98.29%，并在识别精度和真实接受率下示出了与现有模型相比的强性表现。在444个干扰物中，我们的模型在100名测试者中 achieve a rank-20 recognition accuracy of 75.13% and a TAR@1%FAR of 54.09%.

Intelligent Cervical Spine Fracture Detection Using Deep Learning Methods

paper_url: http://arxiv.org/abs/2311.05708
repo_url: None
paper_authors: Reza Behbahani Nejad, Amir Hossein Komijani, Esmaeil Najafi
for: 骨折检测（cervical spine fractures detection）
methods: 两stagepipeline，包括图像和图像元数据的多输入神经网络（Global Context Vision Transformer）和YOLOv8模型（YOLOv5）
results: 提高骨折检测精度，减少放射学家的工作负担

Abstract
Cervical spine fractures constitute a critical medical emergency, with the potential for lifelong paralysis or even fatality if left untreated or undetected. Over time, these fractures can deteriorate without intervention. To address the lack of research on the practical application of deep learning techniques for the detection of spine fractures, this study leverages a dataset containing both cervical spine fractures and non-fractured computed tomography images. This paper introduces a two-stage pipeline designed to identify the presence of cervical vertebrae in each image slice and pinpoint the location of fractures. In the first stage, a multi-input network, incorporating image and image metadata, is trained. This network is based on the Global Context Vision Transformer, and its performance is benchmarked against popular deep learning image classification model. In the second stage, a YOLOv8 model is trained to detect fractures within the images, and its effectiveness is compared to YOLOv5. The obtained results indicate that the proposed algorithm significantly reduces the workload of radiologists and enhances the accuracy of fracture detection.

摘要
脊椎骨折是一种严重的医疗紧急情况，可能导致永久性肢体瘫痪或even fatality if left untreated or undetected. Over time, these fractures can deteriorate without intervention. 为了Addressing the lack of research on the practical application of deep learning techniques for the detection of spine fractures, this study leverages a dataset containing both cervical spine fractures and non-fractured computed tomography images. This paper introduces a two-stage pipeline designed to identify the presence of cervical vertebrae in each image slice and pinpoint the location of fractures.在first stage, a multi-input network, incorporating image and image metadata, is trained. This network is based on the Global Context Vision Transformer, and its performance is benchmarked against popular deep learning image classification model. In the second stage, a YOLOv8 model is trained to detect fractures within the images, and its effectiveness is compared to YOLOv5. The obtained results indicate that the proposed algorithm significantly reduces the workload of radiologists and enhances the accuracy of fracture detection.

FMViT: A multiple-frequency mixing Vision Transformer

paper_url: http://arxiv.org/abs/2311.05707
repo_url: None
paper_authors: Wei Tan, Yifeng Geng, Xuansong Xie
for: 提高计算效率和准确率的计算机视觉任务模型
methods: 提出一种高效的混合模型，即FMViT，通过混合高频和低频特征，以及采用 deploy-friendly 机制，如 gMLP、RLMHSA 和 CFB，提高模型的表达力和实现效率
results: FMViT 在各种计算机视觉任务上超越了现有的 CNN、ViT 和 CNN-Transformer 混合模型，并且在 TensorRT 和 CoreML 平台上实现了更高的准确率和更低的计算开销。例如，在 ImageNet 数据集上，FMViT 在 TensorRT 平台上超越 Resnet101 的 top-1 准确率，并且与 EfficientNet-B5 的表现相似，但具有43% 的计算速度提升。在 CoreML 平台上，FMViT 超越 MobileOne，并且与 MobileOne 的计算开销相似（78.5% vs. 75.9%）。

Abstract
The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing Vision Transformers (ViTs) encounter challenges in achieving efficient performance in practical industrial deployment scenarios, such as TensorRT and CoreML, where traditional CNNs excel. Although some recent attempts have been made to design CNN-Transformer hybrid architectures to tackle this problem, their overall performance has not met expectations. To tackle these challenges, we propose an efficient hybrid ViT architecture named FMViT. This approach enhances the model's expressive power by blending high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively. Additionally, we introduce deploy-friendly mechanisms such as Convolutional Multigroup Reparameterization (gMLP), Lightweight Multi-head Self-Attention (RLMHSA), and Convolutional Fusion Block (CFB) to further improve the model's performance and reduce computational overhead. Our experiments demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks. On the TensorRT platform, FMViT outperforms Resnet101 by 2.5% (83.3% vs. 80.8%) in top-1 accuracy on the ImageNet dataset while maintaining similar inference latency. Moreover, FMViT achieves comparable performance with EfficientNet-B5, but with a 43% improvement in inference speed. On CoreML, FMViT outperforms MobileOne by 2.6% in top-1 accuracy on the ImageNet dataset, with inference latency comparable to MobileOne (78.5% vs. 75.9%). Our code can be found at https://github.com/tany0699/FMViT.

摘要
“ transformer 模型在计算机视觉任务中 gain 广泛的采用，但由于自注意力的quadratic时间和内存复杂度，因此大多数现有的视觉 трансформаer（ViTs）在实际工业部署场景中遇到了效率问题。虽有些 latest 尝试将 CNN 和 transformer 混合成architecture，但其总体性能未达到期望。为解决这些问题，我们提出了一种高效的 hybrid ViT 架构，称为 FMViT。这种方法通过混合不同频率的特征来增强模型的表达力，使其能够有效地捕捉局部和全局信息。此外，我们还引入了可部署的机制，如 Convolutional Multigroup Reparameterization (gMLP)、Lightweight Multi-head Self-Attention (RLMHSA) 和 Convolutional Fusion Block (CFB)，以进一步改善模型的性能并减少计算开销。我们的实验表明，FMViT 超过了现有的 CNN、ViT 和 CNNTransformer 混合架构在各种视觉任务中的精度/效率质量评价。在 TensorRT 平台上，FMViT 超过了 Resnet101 的 top-1 精度（83.3% vs. 80.8%），同时保持相似的推理延迟。此外，FMViT 与 EfficientNet-B5 相当的性能，但具有43%的推理速度提升。在 CoreML 上，FMViT 超过了 MobileOne 的 top-1 精度（78.5% vs. 75.9%），推理延迟与 MobileOne 相似。我们的代码可以在 GitHub 上找到：https://github.com/tany0699/FMViT。”

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

paper_url: http://arxiv.org/abs/2311.05698
repo_url: None
paper_authors: AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova
for: 本研究旨在解决多modal学习中的困难，即将不同类型的输入（视频、音频、文本）结合在一起。
methods: 我们提出了一种分解多modal模型，将其分成两个专注型autoregressive模型，处理输入根据模式的特点。我们还提出了一种 combiner 机制，可以同时提取视频和音频信号的特征，并将其 fusion 为一个Compact但expressive的表示。
results: 我们的方法在多modal Benchmark 上达到了状态的前iers，比较大的模型表现更好，能够有效地控制媒体输入的计算成本，并模型其时间相关性。

Abstract
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.

摘要
一个主要挑战在多模态学习是将不同类型的模态（如视频、音频、文本）结合在一起。例如，视频和音频通常有更高的速率，并且与文本不同步，文本通常来自全局上下文，如标题或描述。此外，视频和音频输入的量相对较大，随着视频长度增加而增加计算量，这使得模型长距离相互关系更加困难。为了解决这个问题，我们提出了分离多模态模型，将它们分成独立的、专注型 autoregressive 模型，处理输入根据模式的特点。我们提出了一种多模态模型，名为 Mirasol3B，它包括一个时间同步的 autoregressive 组件，以及一个不同步的 autoregressive 组件，用于处理不同步的上下文模式。为了处理长序的视频-音频输入，我们提出了一种 Combiner 机制，可以同时处理视频和音频的原始空间时间信号，并学习提取视频和音频特征，然后将这些特征进行autoregressive处理，生成每个时间桢中的短暂 yet 表达力强的表示。我们的方法在已知的多模态标准准点上达到了状态的极点，超越了许多更大的模型。它有效地解决了媒体输入的高计算需求，通过学习紧凑表示、控制序列长度、和模型时间相互关系。

3DGAUnet: 3D generative adversarial networks with a 3D U-Net based generator to achieve the accurate and effective synthesis of clinical tumor image data for pancreatic cancer

paper_url: http://arxiv.org/abs/2311.05697
repo_url: None
paper_authors: Yu Shi, Hannah Tang, Michael Baine, Michael A. Hollingsworth, Huijing Du, Dandan Zheng, Chi Zhang, Hongfeng Yu
for: 这个研究旨在开发一个基于生成器执行的模型，以生成真实的3D CT影像，帮助提高PDAC肿瘤和胰脏组织的检测和诊断。methods: 这个模型使用生成器网络（GAN）技术，将3D CT影像转换为更加真实的3D CT影像，并且可以生成跨 slice 的资料，以解决现有2D CT影像合成模型所面临的限制。results: 这个模型可以生成高品质的3D CT影像，并且可以帮助提高PDAC肿瘤的检测和诊断。这个模型的发展具有潜在的应用前瞻性，可以帮助解决PDAC肿瘤早期检测的问题，从而提高病人的生存率。

Abstract
Pancreatic ductal adenocarcinoma (PDAC) presents a critical global health challenge, and early detection is crucial for improving the 5-year survival rate. Recent medical imaging and computational algorithm advances offer potential solutions for early diagnosis. Deep learning, particularly in the form of convolutional neural networks (CNNs), has demonstrated success in medical image analysis tasks, including classification and segmentation. However, the limited availability of clinical data for training purposes continues to provide a significant obstacle. Data augmentation, generative adversarial networks (GANs), and cross-validation are potential techniques to address this limitation and improve model performance, but effective solutions are still rare for 3D PDAC, where contrast is especially poor owing to the high heterogeneity in both tumor and background tissues. In this study, we developed a new GAN-based model, named 3DGAUnet, for generating realistic 3D CT images of PDAC tumors and pancreatic tissue, which can generate the interslice connection data that the existing 2D CT image synthesis models lack. Our innovation is to develop a 3D U-Net architecture for the generator to improve shape and texture learning for PDAC tumors and pancreatic tissue. Our approach offers a promising path to tackle the urgent requirement for creative and synergistic methods to combat PDAC. The development of this GAN-based model has the potential to alleviate data scarcity issues, elevate the quality of synthesized data, and thereby facilitate the progression of deep learning models to enhance the accuracy and early detection of PDAC tumors, which could profoundly impact patient outcomes. Furthermore, this model has the potential to be adapted to other types of solid tumors, hence making significant contributions to the field of medical imaging in terms of image processing models.

摘要
《数位对待胆管癌：干扰对待胆管癌早期识别的挑战》Pancreatic ductal adenocarcinoma (PDAC) 是一个全球健康问题，早期识别是提高5年生存率的关键。 latest medical imaging and computational algorithm advances offer potential solutions for early diagnosis. Deep learning, particularly in the form of convolutional neural networks (CNNs), has demonstrated success in medical image analysis tasks, including classification and segmentation. However, the limited availability of clinical data for training purposes continues to provide a significant obstacle. Data augmentation, generative adversarial networks (GANs), and cross-validation are potential techniques to address this limitation and improve model performance, but effective solutions are still rare for 3D PDAC, where contrast is especially poor owing to the high heterogeneity in both tumor and background tissues.In this study, we developed a new GAN-based model, named 3DGAUnet, for generating realistic 3D CT images of PDAC tumors and pancreatic tissue, which can generate the interslice connection data that the existing 2D CT image synthesis models lack. Our innovation is to develop a 3D U-Net architecture for the generator to improve shape and texture learning for PDAC tumors and pancreatic tissue. Our approach offers a promising path to tackle the urgent requirement for creative and synergistic methods to combat PDAC. The development of this GAN-based model has the potential to alleviate data scarcity issues, elevate the quality of synthesized data, and thereby facilitate the progression of deep learning models to enhance the accuracy and early detection of PDAC tumors, which could profoundly impact patient outcomes. Furthermore, this model has the potential to be adapted to other types of solid tumors, hence making significant contributions to the field of medical imaging in terms of image processing models.

Window Attention is Bugged: How not to Interpolate Position Embeddings

paper_url: http://arxiv.org/abs/2311.05613
repo_url: None
paper_authors: Daniel Bolya, Chaitanya Ryali, Judy Hoffman, Christoph Feichtenhofer
for: The paper is written for improving the performance of modern transformer-based computer vision models, specifically addressing the issue of interpolating position embeddings while using window attention.
methods: The paper uses window attention, position embeddings, and high resolution finetuning as core components, and introduces a simple absolute window position embedding strategy to fix the issue of interpolating position embeddings.
results: The paper achieves state-of-the-art performance on the COCO dataset with a model that only uses ImageNet-1k pretraining, achieving 61.7 box mAP with the proposed “absolute win” bug fix.

Abstract
Window attention, position embeddings, and high resolution finetuning are core concepts in the modern transformer era of computer vision. However, we find that naively combining these near ubiquitous components can have a detrimental effect on performance. The issue is simple: interpolating position embeddings while using window attention is wrong. We study two state-of-the-art methods that have these three components, namely Hiera and ViTDet, and find that both do indeed suffer from this bug. To fix it, we introduce a simple absolute window position embedding strategy, which solves the bug outright in Hiera and allows us to increase both speed and performance of the model in ViTDet. We finally combine the two to obtain HieraDet, which achieves 61.7 box mAP on COCO, making it state-of-the-art for models that only use ImageNet-1k pretraining. This all stems from what is essentially a 3 line bug fix, which we name "absolute win".

摘要
窗口注意力、位嵌入和高分辨率调整是现代转换器时代的核心概念，但我们发现将这些组件组合起来可能会导致性能下降。问题的原因很简单：在使用窗口注意力时 interpolate位嵌入是错误的。我们研究了两种现代方法，即Hiera和ViTDet，并发现它们都受到这个漏洞的影响。为解决这个问题，我们提出了一种简单的绝对窗口位嵌入策略，可以解决Hiera中的漏洞，并使ViTDet模型的速度和性能得到提升。最后，我们将Hiera和ViTDet两者结合，得到了HieraDet模型，在COCO数据集上达到了61.7个框的MAP值，成为只使用ImageNet-1k预训练的状态gravity模型。这一成果凭借了一个简单的3行修复，我们称之为“绝对胜利”。

What Do I Hear? Generating Sounds for Visuals with ChatGPT

paper_url: http://arxiv.org/abs/2311.05609
repo_url: None
paper_authors: David Chuan-En Lin, Nikolas Martelaro
for: 这篇论文提出了一种工作流程，用于生成视频媒体中的真实声景。与之前的工作不同，这种方法不仅强调匹配视频上的声音，而且还可以提供不直接可见的声音，以创造一个真实和吸引人的听觉环境。
methods: 我们的方法包括创建场景上下文，brainstorming声音和生成声音。我们利用语言模型，如ChatGPT的推理能力，以便更好地理解和生成声音。
results: 我们的实验结果表明，我们的方法可以生成高质量的声景声音，并且可以帮助制作人更好地描绘和创造听觉环境。

Abstract
This short paper introduces a workflow for generating realistic soundscapes for visual media. In contrast to prior work, which primarily focus on matching sounds for on-screen visuals, our approach extends to suggesting sounds that may not be immediately visible but are essential to crafting a convincing and immersive auditory environment. Our key insight is leveraging the reasoning capabilities of language models, such as ChatGPT. In this paper, we describe our workflow, which includes creating a scene context, brainstorming sounds, and generating the sounds.

摘要
这篇短篇论文介绍了一种工作流程，用于生成真实的声景音频 для视觉媒体。与先前的工作不同，我们的方法不仅仅是匹配屏幕上的视觉元素，而是扩展到建议不可见的声音，以创造一个感人和吸引人的听觉环境。我们的关键发现是利用语言模型的推理能力，如ChatGPT。在这篇论文中，我们描述了我们的工作流程，包括创建场景 контекст、寻思声音和生成声音。

3D-QAE: Fully Quantum Auto-Encoding of 3D Point Clouds

paper_url: http://arxiv.org/abs/2311.05604
repo_url: None
paper_authors: Lakshika Rathi, Edith Tretschk, Christian Theobalt, Rishabh Dabral, Vladislav Golyanik
for: 这个论文的目的是提出一种基于量子计算机的3D点云自动编码器（3D-QAE），用于压缩3D数据。
methods: 该方法使用完全量子的数据处理组件，并在量子硬件上进行训练。具有3D数据正常化和参数优化的核心挑战， authors提出了解决方案。
results: 实验结果表明，该方法比简单的经典基准方法高效，这成功地开启了基于量子计算机的3D计算机视觉领域的新研究方向。Here’s the English version of the paper’s abstract again for reference:”Existing methods for learning 3D representations are deep neural networks trained and tested on classical hardware. Quantum machine learning architectures, despite their theoretically predicted advantages in terms of speed and the representational capacity, have so far not been considered for this problem nor for tasks involving 3D data in general. This paper thus introduces the first quantum auto-encoder for 3D point clouds. Our 3D-QAE approach is fully quantum, i.e. all its data processing components are designed for quantum hardware. It is trained on collections of 3D point clouds to produce their compressed representations. Along with finding a suitable architecture, the core challenges in designing such a fully quantum model include 3D data normalization and parameter optimization, and we propose solutions for both these tasks. Experiments on simulated gate-based quantum hardware demonstrate that our method outperforms simple classical baselines, paving the way for a new research direction in 3D computer vision.”

Abstract
Existing methods for learning 3D representations are deep neural networks trained and tested on classical hardware. Quantum machine learning architectures, despite their theoretically predicted advantages in terms of speed and the representational capacity, have so far not been considered for this problem nor for tasks involving 3D data in general. This paper thus introduces the first quantum auto-encoder for 3D point clouds. Our 3D-QAE approach is fully quantum, i.e. all its data processing components are designed for quantum hardware. It is trained on collections of 3D point clouds to produce their compressed representations. Along with finding a suitable architecture, the core challenges in designing such a fully quantum model include 3D data normalisation and parameter optimisation, and we propose solutions for both these tasks. Experiments on simulated gate-based quantum hardware demonstrate that our method outperforms simple classical baselines, paving the way for a new research direction in 3D computer vision. The source code is available at https://4dqv.mpi-inf.mpg.de/QAE3D/.

摘要
现有的方法 для学习3D表示法是使用深度神经网络，并在经典硬件上训练和测试。量子机器学习架构，尽管其 theoretically predicted advantages in terms of speed and representational capacity，尚未被考虑用于这个问题或任何3D数据相关的任务。这篇论文因此引入了首个量子自动编码器 для3D点云。我们的3D-QAE方法是完全量子的，即所有的数据处理组件都是为量子硬件设计的。它是在收集3D点云的集合上训练，以生成压缩表示。与设计such a fully quantum model的核心挑战包括3D数据Normalization和参数优化，我们提出了解决方案 для这两个任务。实验在模拟的门槛基 quantum 硬件上表明，我们的方法超过了简单的类型基eline，开创了一个新的研究方向于3D计算机视觉。源代码可以在中下载。

Reconstructing Objects in-the-wild for Realistic Sensor Simulation

paper_url: http://arxiv.org/abs/2311.05602
repo_url: None
paper_authors: Ze Yang, Sivabalan Manivasagam, Yun Chen, Jingkang Wang, Rui Hu, Raquel Urtasun
for: 用于帮助机器人训练和测试中带有真实感的 simulate 环境。
methods: 使用神经网络signed distance function来重建物体表面和光照，以及利用 LiDAR 和摄像头感知器数据来重建精准的 geometry 和 нормаль。
results: 在具有有限的训练视图的情况下，NeuSim 能够实现高效的视 synthesis 性能，并且可以将重建的对象资产组合到虚拟世界中，生成真实的多感器数据用于评估自动驾驶感知模型。

Abstract
Reconstructing objects from real world data and rendering them at novel views is critical to bringing realism, diversity and scale to simulation for robotics training and testing. In this work, we present NeuSim, a novel approach that estimates accurate geometry and realistic appearance from sparse in-the-wild data captured at distance and at limited viewpoints. Towards this goal, we represent the object surface as a neural signed distance function and leverage both LiDAR and camera sensor data to reconstruct smooth and accurate geometry and normals. We model the object appearance with a robust physics-inspired reflectance representation effective for in-the-wild data. Our experiments show that NeuSim has strong view synthesis performance on challenging scenarios with sparse training views. Furthermore, we showcase composing NeuSim assets into a virtual world and generating realistic multi-sensor data for evaluating self-driving perception models.

摘要
<>将实际世界数据重建为虚拟世界中的对象，并在新的视角下rendering它们是虚拟世界中的重要任务。在这项工作中，我们提出了NeuSim，一种新的方法，可以从稀疏的宽泛数据中估算高精度的几何结构和真实的外观。我们表示物体表面为神经网络签名距离函数，并利用LiDAR和摄像头感知器数据来重建平滑和准确的几何和法向量。我们模型物体外观使用物理学派的反射表示，可以有效地处理宽泛数据中的不确定性。我们的实验表明，NeuSim在复杂的情况下具有强大的视图合成性能。此外，我们还展示了将NeuSim资产集成到虚拟世界中，并生成真实的多感器数据用于评估自动驾驶感知模型。>>Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

SigScatNet: A Siamese + Scattering based Deep Learning Approach for Signature Forgery Detection and Similarity Assessment

paper_url: http://arxiv.org/abs/2311.05579
repo_url: None
paper_authors: Anmol Chokshi, Vansh Jain, Rajas Bhope, Sudhir Dhage
for: 本研究旨在开发一种能够准确检测 forgery 和评估 signature 相似性的技术解决方案，以满足现代社会面临着假印花痕的广泛存在和严重问题。
methods: 本研究提出了一种基于 Siamese 深度学习网络和散射波lets的方法，通过对 signature 进行精准的 validate 和比较，以验证它的合法性。该方法具有 Exceptional efficiency 和可靠性，可以在低成本的硬件系统上运行。
results: 实验结果表明，使用 SigScatNet 可以准确地检测 forgery 和评估 signature 相似性，并且具有很高的 Equal Error Rate（EER）和 Computational Efficiency。 Specifically, the EER of the ICDAR SigComp Dutch dataset was 3.689%, and the EER of the CEDAR dataset was 0.0578%. 这些结果表明，SigScatNet 可以提供一个新的 state-of-the-art 的 signature analysis 技术解决方案，并且可以在实际应用中提供高效、可靠的服务。

Abstract
The surge in counterfeit signatures has inflicted widespread inconveniences and formidable challenges for both individuals and organizations. This groundbreaking research paper introduces SigScatNet, an innovative solution to combat this issue by harnessing the potential of a Siamese deep learning network, bolstered by Scattering wavelets, to detect signature forgery and assess signature similarity. The Siamese Network empowers us to ascertain the authenticity of signatures through a comprehensive similarity index, enabling precise validation and comparison. Remarkably, the integration of Scattering wavelets endows our model with exceptional efficiency, rendering it light enough to operate seamlessly on cost-effective hardware systems. To validate the efficacy of our approach, extensive experimentation was conducted on two open-sourced datasets: the ICDAR SigComp Dutch dataset and the CEDAR dataset. The experimental results demonstrate the practicality and resounding success of our proposed SigScatNet, yielding an unparalleled Equal Error Rate of 3.689% with the ICDAR SigComp Dutch dataset and an astonishing 0.0578% with the CEDAR dataset. Through the implementation of SigScatNet, our research spearheads a new state-of-the-art in signature analysis in terms of EER scores and computational efficiency, offering an advanced and accessible solution for detecting forgery and quantifying signature similarities. By employing cutting-edge Siamese deep learning and Scattering wavelets, we provide a robust framework that paves the way for secure and efficient signature verification systems.

摘要
“ counterfeit signatures 的问题已经对个人和机构带来广泛的不便和严重的挑战。本研究的创新解决方案是基于 Siamese 深度学习网络，并与散射波лет特别结合，以检测签名伪造和评估签名相似性。 Siamese 网络允许我们通过全面的相似度指数，实现签名的authenticity检测，并提供了高度精确的比较和验证。另外，散射波лет特别的整合使我们的模型变得非常轻量级，可以运行在便宜的硬件系统上。为了证明我们的方法的有效性，我们在 ICAR D SigComp 荷兰 dataset 和 CEDAR dataset 上进行了广泛的实验。实验结果显示了我们的提案的 SigScatNet 在 EER 分数和计算效率上具有杰出的表现，其中 ICDAR SigComp Dutch dataset 的 EER 分数为 3.689%，而 CEDAR dataset 的 EER 分数则为 0.0578%。通过 SigScatNet 的实现，我们的研究将 signature 分析领域带进了新的州际领域，并提供了一个高度可靠和可接近的解决方案，以应对签名伪造和评估签名相似性。我们的研究使用了 cutting-edge Siamese 深度学习和散射波лет特别，提供了一个强大和可靠的框架，将来对签名验证系统带来新的突破和进步。”

Exploring Emotion Expression Recognition in Older Adults Interacting with a Virtual Coach

paper_url: http://arxiv.org/abs/2311.05567
repo_url: None
paper_authors: Cristina Palmero, Mikel deVelasco, Mohamed Amine Hmani, Aymen Mtibaa, Leila Ben Letaifa, Pau Buch-Cardona, Raquel Justo, Terry Amorese, Eduardo González-Fraile, Begoña Fernández-Ruanova, Jofre Tenorio-Laranga, Anna Torp Johansen, Micaela Rodrigues da Silva, Liva Jenny Martinussen, Maria Stylianou Korsnes, Gennaro Cordasco, Anna Esposito, Mounim A. El-Yacoubi, Dijana Petrovska-Delacrétaz, M. Inés Torres, Sergio Escalera
for: This paper aims to develop an emotionally expressive virtual coach for healthy seniors to improve well-being and promote independent aging.
methods: The paper outlines the development of the emotion expression recognition module of the virtual coach, including data collection, annotation design, and a first methodological approach. The study uses various modalities such as speech from audio and facial expressions, gaze, and head dynamics from video to recognize emotional expressions.
results: The study found that the modalities studied were informative for the emotional categories considered, with multimodal methods generally outperforming others. The results are expected to contribute to the limited literature on emotion recognition applied to older adults in conversational human-machine interaction.

Abstract
The EMPATHIC project aimed to design an emotionally expressive virtual coach capable of engaging healthy seniors to improve well-being and promote independent aging. One of the core aspects of the system is its human sensing capabilities, allowing for the perception of emotional states to provide a personalized experience. This paper outlines the development of the emotion expression recognition module of the virtual coach, encompassing data collection, annotation design, and a first methodological approach, all tailored to the project requirements. With the latter, we investigate the role of various modalities, individually and combined, for discrete emotion expression recognition in this context: speech from audio, and facial expressions, gaze, and head dynamics from video. The collected corpus includes users from Spain, France, and Norway, and was annotated separately for the audio and video channels with distinct emotional labels, allowing for a performance comparison across cultures and label types. Results confirm the informative power of the modalities studied for the emotional categories considered, with multimodal methods generally outperforming others (around 68% accuracy with audio labels and 72-74% with video labels). The findings are expected to contribute to the limited literature on emotion recognition applied to older adults in conversational human-machine interaction.

摘要
《情感察觉》项目目标是设计一个情感表达能力强的虚拟教练，以提高健康老年人的情绪状况和独立生活能力。系统的核心特点之一是情感感知能力，通过感知用户的情感状况，提供个性化的经验。本文介绍了《情感表达识别模块》的开发，包括数据收集、标注设计和方法ologica approaches，都适应项目的需求。我们 investigate了不同modalities的作用，单独和结合使用，对 discrete emotional expression recognition的性能的影响。收集的数据库包括来自西班牙、法国和挪威的用户，并对音频和视频通道进行了分别的注释，以便比较不同文化和标签类型之间的性能。结果表明，研究对older adults在对话人机交互中表达情感的方面的limited literature中，modalities studying的信息力强，单modal和多模态方法的性能相对较高（音频标签的准确率为68%，视频标签的准确率为72-74%）。这些发现预计将对设计情感表达能力强的虚拟教练提供有用的指导。

High-Performance Transformers for Table Structure Recognition Need Early Convolutions

paper_url: http://arxiv.org/abs/2311.05565
repo_url: https://github.com/poloclub/tsr-convstem
paper_authors: ShengYun Peng, Seongmin Lee, Xiaojing Wang, Rajarajeswari Balasubramaniyan, Duen Horng Chau
for: 这篇论文主要探讨了一种轻量级的视觉编码器，以提高表格识别（TSR）模型的速度和可学习性。
methods: 该论文提出了一种新的视觉编码器，即 convolutional stem，它使用了一个简单的模型结构，但能够与 классификацион CNN 相比肩。该编码器具有较高的感知野比率和更长的序列长度，能够匹配表格的结构和上下文。
results: 研究人员通过了多种ablation study来证明，新的视觉编码器可以减少模型参数数量，同时保持表格识别的表现。此外，该论文还开源了代码，以便进一步的研究和比较。

Abstract
Table structure recognition (TSR) aims to convert tabular images into a machine-readable format, where a visual encoder extracts image features and a textual decoder generates table-representing tokens. Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder. However, this hybrid CNN-Transformer architecture introduces a complex visual encoder that accounts for nearly half of the total model parameters, markedly reduces both training and inference speed, and hinders the potential for self-supervised learning in TSR. In this work, we design a lightweight visual encoder for TSR without sacrificing expressive power. We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model. The convolutional stem strikes an optimal balance between two crucial factors for high-performance TSR: a higher receptive field (RF) ratio and a longer sequence length. This allows it to "see" an appropriate portion of the table and "store" the complex table structure within sufficient context length for the subsequent transformer. We conducted reproducible ablation studies and open-sourced our code at https://github.com/poloclub/tsr-convstem to enhance transparency, inspire innovations, and facilitate fair comparisons in our domain as tables are a promising modality for representation learning.

摘要
tables structure recognition (TSR) 目标是将表格图像转换为机器可读格式，其中视觉编码器提取图像特征，而文本编码器生成表格表示符。现有方法使用经典 convolutional neural network (CNN) 脑筋作为视觉编码器和转换器作为文本编码器。然而，这种混合 CNN-Transformer 架构会导致复杂的视觉编码器，占用大量模型参数，明显降低训练和执行速度，并阻碍自动学习在 TSR 中。在这种工作中，我们设计了 TSR 中的轻量级视觉编码器，不会失去表达力。我们发现， convolutional stem 可以与经典 CNN 脑筋性能相当，但是它的模型非常简单。convolutional stem 在两个关键因素上做出了优化的平衡：高的 receptive field (RF) 比和长的序列长度。这使得它可以 "看" 到合适的表格部分，并 "存储" 表格结构的详细信息在 suficient context length 中，为后续转换器提供足够的上下文。我们进行了可重复的抽象研究，并将我们的代码开源在 https://github.com/poloclub/tsr-convstem，以便提高透明度，激发创新，并且在我们领域中进行公正的比较。

Disentangling Quantum and Classical Contributions in Hybrid Quantum Machine Learning Architectures

paper_url: http://arxiv.org/abs/2311.05559
repo_url: None
paper_authors: Michael Kölle, Jonas Maurer, Philipp Altmann, Leo Sünkel, Jonas Stein, Claudia Linnhoff-Popien
for: 这篇论文的目的是探讨量子计算的可行性，以及将经过训练的古典模型与量子圈组合使用的混合转移学习解决方案。
methods: 这篇论文使用了一种新的混合架构，将 autoencoder 用于将输入数据压缩，然后将压缩后的数据通过量子环节。另外，还与两个现有的 Hybrid transfer learning 架构、两个纯古典架构和一个量子架构进行比较。
results: 研究结果显示，古典 комponent 在混合转移学习中具有重要的影响，而这个影响通常被误以为是量子元件的贡献。我们的模型在四个 datasets 上的准确率与使用于量子圈的混合转移学习模型的准确率相似。

Abstract
Quantum computing offers the potential for superior computational capabilities, particularly for data-intensive tasks. However, the current state of quantum hardware puts heavy restrictions on input size. To address this, hybrid transfer learning solutions have been developed, merging pre-trained classical models, capable of handling extensive inputs, with variational quantum circuits. Yet, it remains unclear how much each component - classical and quantum - contributes to the model's results. We propose a novel hybrid architecture: instead of utilizing a pre-trained network for compression, we employ an autoencoder to derive a compressed version of the input data. This compressed data is then channeled through the encoder part of the autoencoder to the quantum component. We assess our model's classification capabilities against two state-of-the-art hybrid transfer learning architectures, two purely classical architectures and one quantum architecture. Their accuracy is compared across four datasets: Banknote Authentication, Breast Cancer Wisconsin, MNIST digits, and AudioMNIST. Our research suggests that classical components significantly influence classification in hybrid transfer learning, a contribution often mistakenly ascribed to the quantum element. The performance of our model aligns with that of a variational quantum circuit using amplitude embedding, positioning it as a feasible alternative.

摘要

LCM-LoRA: A Universal Stable-Diffusion Acceleration Module

paper_url: http://arxiv.org/abs/2311.05556
repo_url: https://github.com/luosiallen/latent-consistency-model
paper_authors: Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolinário Passos, Longbo Huang, Jian Li, Hang Zhao
for: 快速生成高质量图像，使用Latent Consistency Models（LCMs）可以减少推理步骤数量，只需要约32个A100 GPU训练小时。
methods: 使用LoRA混合精炼方法进行驱动，将Stable-Diffusion模型包括SD-V1.5、SSD-1B和SDXL扩展到更大的模型，并且减少内存占用。
results: 通过LCM混合精炼方法，可以获得更高质量的图像生成结果，并且可以将LCM-LoRA作为一个通用加速器应用于多种图像生成任务。

Abstract
Latent Consistency Models (LCMs) have achieved impressive performance in accelerating text-to-image generative tasks, producing high-quality images with minimal inference steps. LCMs are distilled from pre-trained latent diffusion models (LDMs), requiring only ~32 A100 GPU training hours. This report further extends LCMs' potential in two aspects: First, by applying LoRA distillation to Stable-Diffusion models including SD-V1.5, SSD-1B, and SDXL, we have expanded LCM's scope to larger models with significantly less memory consumption, achieving superior image generation quality. Second, we identify the LoRA parameters obtained through LCM distillation as a universal Stable-Diffusion acceleration module, named LCM-LoRA. LCM-LoRA can be directly plugged into various Stable-Diffusion fine-tuned models or LoRAs without training, thus representing a universally applicable accelerator for diverse image generation tasks. Compared with previous numerical PF-ODE solvers such as DDIM, DPM-Solver, LCM-LoRA can be viewed as a plug-in neural PF-ODE solver that possesses strong generalization abilities. Project page: https://github.com/luosiallen/latent-consistency-model.

摘要
Latent Consistency Models (LCMs) 已经实现了在快速生成文本到图像任务中表现出色，生成高质量图像只需要 minimal inference steps。LCMs 是从预训练的latent diffusion models (LDMs) 中提取出来的，只需要约32个A100 GPU 训练小时。本报告进一步扩展了LCMs的潜在能力，包括：首先，通过应用LoRA混合精灵抽取法，我们扩展了LCM的范围，使得LCM可以处理更大的模型，并且具有更少的内存占用，从而实现更高质量的图像生成。其次，我们确定了通过LCM混合精灵抽取法获得的LoRA参数为Universal Stable-Diffusion加速模块，名为LCM-LoRA。LCM-LoRA可以直接插入不同的Stable-Diffusion 精灵抽取模型或LoRAs 中，无需训练，因此可以视为一种通用适用的图像生成加速器。与前一些数值PF-ODE 解决方案相比，LCM-LoRA可以看作是一种嵌入式神经网络PF-ODE 解决方案，具有强大的泛化能力。项目页面：https://github.com/luosiallen/latent-consistency-model。

L-WaveBlock: A Novel Feature Extractor Leveraging Wavelets for Generative Adversarial Networks

paper_url: http://arxiv.org/abs/2311.05548
repo_url: None
paper_authors: Mirat Shah, Vansh Jain, Anmol Chokshi, Guruprasad Parasnis, Pramod Bide
for: 这篇论文旨在提出一种新的特征提取器，即L-WaveBlock，以便提高基于GAN的图像生成器的性能和速度。
methods: 这篇论文使用了Discrete Wavelet Transform（DWT）和深度学习方法，开发了一种新的特征提取器L-WaveBlock，以提高GAN生成器的速度和性能。
results: 在三个 dataset（即路面卫星图像数据集、CelebA数据集和GoPro数据集）上，L-WaveBlock得到了惊人的效果，使得GAN生成器更快地 converges，并且在每个dataset上都达到了竞争性的效果。

Abstract
Generative Adversarial Networks (GANs) have risen to prominence in the field of deep learning, facilitating the generation of realistic data from random noise. The effectiveness of GANs often depends on the quality of feature extraction, a critical aspect of their architecture. This paper introduces L-WaveBlock, a novel and robust feature extractor that leverages the capabilities of the Discrete Wavelet Transform (DWT) with deep learning methodologies. L-WaveBlock is catered to quicken the convergence of GAN generators while simultaneously enhancing their performance. The paper demonstrates the remarkable utility of L-WaveBlock across three datasets, a road satellite imagery dataset, the CelebA dataset and the GoPro dataset, showcasing its ability to ease feature extraction and make it more efficient. By utilizing DWT, L-WaveBlock efficiently captures the intricate details of both structural and textural details, and further partitions feature maps into orthogonal subbands across multiple scales while preserving essential information at the same time. Not only does it lead to faster convergence, but also gives competent results on every dataset by employing the L-WaveBlock. The proposed method achieves an Inception Score of 3.6959 and a Structural Similarity Index of 0.4261 on the maps dataset, a Peak Signal-to-Noise Ratio of 29.05 and a Structural Similarity Index of 0.874 on the CelebA dataset. The proposed method performs competently to the state-of-the-art for the image denoising dataset, albeit not better, but still leads to faster convergence than conventional methods. With this, L-WaveBlock emerges as a robust and efficient tool for enhancing GAN-based image generation, demonstrating superior convergence speed and competitive performance across multiple datasets for image resolution, image generation and image denoising.

摘要
生成对抗网络（GAN）在深度学习中崛起，能够生成真实的数据从随机噪声中。GAN的效果经常受到特征提取的影响，这是其架构中的关键因素。本文介绍一种名为L-WaveBlock的新型和可靠的特征提取器，它利用抽象波лет变换（DWT）与深度学习方法结合，以快速加速GAN生成器的协调。L-WaveBlock在三个数据集上展示了强大的实用性，包括公路卫星影像数据集、CelebA数据集和GoPro数据集。它能够高效地提取特征，并在多个尺度和多个缩放级别上分解特征图。这不仅导致更快的协调，还可以在每个数据集上获得优秀的结果。使用DWT，L-WaveBlock可以高效地捕捉结构和文本细节的细节，并将特征图分解成多个 ortogonal subbands。这不仅提高了特征提取的效率，还保留了重要信息。因此，L-WaveBlock不仅可以带来更快的协调，还可以在多个数据集上获得竞争力的结果。提议的方法在maps数据集上 achiev 一个 Inception Score 的 3.6959 和一个 Structural Similarity Index 的 0.4261，在 CelebA 数据集上 achiev 一个 Peak Signal-to-Noise Ratio 的 29.05 和一个 Structural Similarity Index 的 0.874。在图像压缩数据集上，提议的方法可以和现有方法相比，并且可以带来更快的协调。因此，L-WaveBlock emerges 为一种可靠和高效的图像生成工具，可以在多个数据集上提高图像分辨率、图像生成和图像压缩的性能。

A Deep Learning Method for Simultaneous Denoising and Missing Wedge Reconstruction in Cryogenic Electron Tomography

paper_url: http://arxiv.org/abs/2311.05539
repo_url: https://github.com/mli-lab/deepdewedge
paper_authors: Simon Wiedemann, Reinhard Heckel
for: used to improve the visual quality and resolution of cryo-ET tomograms
methods: deep-learning approach for simultaneous denoising and missing wedge reconstruction called DeepDeWedge
results: competitive performance for deep learning-based denoising and missing wedge reconstruction of cryo-ET tomogramsHere’s the full text in Simplified Chinese:
for: 用于提高晶体电子显微镜图像的视觉质量和分辨率
methods: 使用深度学习方法，即DeepDeWedge，同时去噪和缺角重建晶体电子显微镜图像
results: 在synthetic和实际晶体电子显微镜数据上实现了竞争力强的深度学习基于的denoising和缺角重建表现I hope that helps! Let me know if you have any other questions.

Abstract
Cryogenic electron tomography (cryo-ET) is a technique for imaging biological samples such as viruses, cells, and proteins in 3D. A microscope collects a series of 2D projections of the sample, and the goal is to reconstruct the 3D density of the sample called the tomogram. This is difficult as the 2D projections have a missing wedge of information and are noisy. Tomograms reconstructed with conventional methods, such as filtered back-projection, suffer from the noise, and from artifacts and anisotropic resolution due to the missing wedge of information. To improve the visual quality and resolution of such tomograms, we propose a deep-learning approach for simultaneous denoising and missing wedge reconstruction called DeepDeWedge. DeepDeWedge is based on fitting a neural network to the 2D projections with a self-supervised loss inspired by noise2noise-like methods. The algorithm requires no training or ground truth data. Experiments on synthetic and real cryo-ET data show that DeepDeWedge achieves competitive performance for deep learning-based denoising and missing wedge reconstruction of cryo-ET tomograms.

摘要
低温电子镜像技术（冰点电子镜像）可以用于图像生物样品，如病毒、细胞和蛋白质的三维图像。一个镜头收集了样品的一系列二维投影图像，目标是重建样品的三维密度特征，称为tomogram。这是困难的，因为二维投影图像缺失一部分信息，并且含有噪声。使用传统方法重建tomogram时，会受到噪声和缺失信息的影响，以及各向异otropic的分辨率。为了改善tomogram的视觉质量和分辨率，我们提出了一种基于深度学习的同时去噪和缺失信息重建方法，称为DeepDeWedge。DeepDeWedge基于对二维投影图像适应一个神经网络，使用自动驱动的损失函数，类似于噪声2噪声的方法。该算法不需要训练或真实数据。对于synthetic和实际冰点电子镜像数据进行了实验，显示DeepDeWedge可以与深度学习基于的去噪和缺失信息重建方法竞争。

Embedding Space Interpolation Beyond Mini-Batch, Beyond Pairs and Beyond Examples

paper_url: http://arxiv.org/abs/2311.05538
repo_url: https://github.com/shashankvkt/MultiMix_NeurIPS023
paper_authors: Shashanka Venkataramanan, Ewa Kijak, Laurent Amsaleg, Yannis Avrithis
for: 本研究旨在提高混合数据 augmentation 的效果，以提高模型的泛化能力。
methods: 本研究引入 MultiMix 方法，可以生成许多 interpolated 的示例，并在 embedding 空间进行 interpolating。
results: 对四个 benchmark 进行实验，并证明 MultiMix 方法可以提高模型的性能，并且可以解释为什么性能提高的原因。

Abstract
Mixup refers to interpolation-based data augmentation, originally motivated as a way to go beyond empirical risk minimization (ERM). Its extensions mostly focus on the definition of interpolation and the space (input or feature) where it takes place, while the augmentation process itself is less studied. In most methods, the number of generated examples is limited to the mini-batch size and the number of examples being interpolated is limited to two (pairs), in the input space. We make progress in this direction by introducing MultiMix, which generates an arbitrarily large number of interpolated examples beyond the mini-batch size and interpolates the entire mini-batch in the embedding space. Effectively, we sample on the entire convex hull of the mini-batch rather than along linear segments between pairs of examples. On sequence data, we further extend to Dense MultiMix. We densely interpolate features and target labels at each spatial location and also apply the loss densely. To mitigate the lack of dense labels, we inherit labels from examples and weight interpolation factors by attention as a measure of confidence. Overall, we increase the number of loss terms per mini-batch by orders of magnitude at little additional cost. This is only possible because of interpolating in the embedding space. We empirically show that our solutions yield significant improvement over state-of-the-art mixup methods on four different benchmarks, despite interpolation being only linear. By analyzing the embedding space, we show that the classes are more tightly clustered and uniformly spread over the embedding space, thereby explaining the improved behavior.

摘要
混合（Mixup）是基于 interpolate 的数据增强技术，起初是为了超越empirical risk minimization（ERM）的目的。其扩展主要集中在 interpolate 的定义和进行 interpolate 的空间（输入或特征）上，而 interpolate 过程自身得到了更少的研究。在大多数方法中，生成的示例数限制在 mini-batch 大小和 interpolate 的示例数量均为两（对），在输入空间进行 interpolate。我们在这个方向上做出了进步，通过引入 MultiMix，可以生成超过 mini-batch 大小的 interpolated 示例，并且在嵌入空间中 interpolate 整个 mini-batch。实际上，我们在整个几何体中采样，而不是在线性段 между对的示例之间采样。在序列数据上，我们进一步扩展到 dense MultiMix，在每个空间位置上密集 interpolate 特征和目标标签，并且对 loss 进行密集应用。为了减少缺少密集标签的问题，我们继承例外的标签并将 interpolate 因子重量为注意力的度量。总的来说，我们在每个 mini-batch 中增加了数量级别的损失项数，而这是因为 interpolate 在嵌入空间中进行的。我们经验显示，我们的解决方案在四个不同的标准测试上显示了明显的改善，即使 interpolate 只是线性的。通过分析嵌入空间，我们显示了类划分更加紧密，uniform 分布在嵌入空间中，从而解释了改善的行为。

SeaTurtleID2022: A long-span dataset for reliable sea turtle re-identification

paper_url: http://arxiv.org/abs/2311.05524
repo_url: None
paper_authors: Lukáš Adam, Vojtěch Čermák, Kostas Papafitsoros, Lukáš Picek
for: The paper is written for researchers and practitioners working on animal re-identification, particularly those interested in sea turtles.
methods: The paper uses a large-scale, long-span dataset of sea turtle photographs captured in the wild, with various annotations such as identity, encounter timestamp, and body parts segmentation masks. The dataset is split into two realistic and ecologically motivated splits: a time-aware closed-set and a time-aware open-set. The paper also proposes an end-to-end system for sea turtle re-identification based on Hybrid Task Cascade for head instance segmentation and ArcFace-trained feature-extractor.
results: The paper reports an accuracy of 86.8% for the proposed end-to-end system, and provides baseline instance segmentation and re-identification performance over various body parts. The paper also shows that time-aware splits are essential for benchmarking re-identification methods, as random splits lead to performance overestimation.

Abstract
This paper introduces the first public large-scale, long-span dataset with sea turtle photographs captured in the wild -- SeaTurtleID2022 (https://www.kaggle.com/datasets/wildlifedatasets/seaturtleid2022). The dataset contains 8729 photographs of 438 unique individuals collected within 13 years, making it the longest-spanned dataset for animal re-identification. All photographs include various annotations, e.g., identity, encounter timestamp, and body parts segmentation masks. Instead of standard "random" splits, the dataset allows for two realistic and ecologically motivated splits: (i) a time-aware closed-set with training, validation, and test data from different days/years, and (ii) a time-aware open-set with new unknown individuals in test and validation sets. We show that time-aware splits are essential for benchmarking re-identification methods, as random splits lead to performance overestimation. Furthermore, a baseline instance segmentation and re-identification performance over various body parts is provided. Finally, an end-to-end system for sea turtle re-identification is proposed and evaluated. The proposed system based on Hybrid Task Cascade for head instance segmentation and ArcFace-trained feature-extractor achieved an accuracy of 86.8%.

摘要

BakedAvatar: Baking Neural Fields for Real-Time Head Avatar Synthesis

paper_url: http://arxiv.org/abs/2311.05521
repo_url: None
paper_authors: Hao-Bin Duan, Miao Wang, Jin-Chuan Shi, Xu-Chuan Chen, Yan-Pei Cao
For: 本研究旨在提供高效的NeRF技术来实现实时的人头化学习，以满足VR/AR、telepresence和游戏应用的需求。* Methods: 我们提出了一种新的表示方法，即BakedAvatar，它可以在标准的 polygon 纹理化管道中进行实时的人头化学习。我们的方法从学习的ISO面上提取了可变的多层网格，并计算出表达、姿势和视角依赖的外观特征，这些特征可以被烘焙成静态的文本ures，以实现高效的纹理化。* Results: 我们的方法可以与其他状态对照方法相比，同时大幅降低了推理时间的需求。我们还通过了不同的视频来synthesize heads，包括视点synthesis、face reenactment、表情编辑和姿势编辑，全部在交互帧率下完成。

Abstract
Synthesizing photorealistic 4D human head avatars from videos is essential for VR/AR, telepresence, and video game applications. Although existing Neural Radiance Fields (NeRF)-based methods achieve high-fidelity results, the computational expense limits their use in real-time applications. To overcome this limitation, we introduce BakedAvatar, a novel representation for real-time neural head avatar synthesis, deployable in a standard polygon rasterization pipeline. Our approach extracts deformable multi-layer meshes from learned isosurfaces of the head and computes expression-, pose-, and view-dependent appearances that can be baked into static textures for efficient rasterization. We thus propose a three-stage pipeline for neural head avatar synthesis, which includes learning continuous deformation, manifold, and radiance fields, extracting layered meshes and textures, and fine-tuning texture details with differential rasterization. Experimental results demonstrate that our representation generates synthesis results of comparable quality to other state-of-the-art methods while significantly reducing the inference time required. We further showcase various head avatar synthesis results from monocular videos, including view synthesis, face reenactment, expression editing, and pose editing, all at interactive frame rates.

摘要
<>通过视频生成 photorealistic 4D人头模型是虚拟现实（VR）、虚拟真实（AR）、电子游戏等应用的关键。 existed Neural Radiance Fields（NeRF）方法可以实现高质量结果，但计算成本限制了它们在实时应用中使用。为了解决这个问题，我们介绍了 BakedAvatar，一种新的表示方法，可以在标准 polygon 笔触板pipeline中进行实时神经头像生成。我们的方法从学习的isoSurface中提取了可变多层网格，并计算出expression、pose和视角相关的表现，可以用静态纹理进行高效的笔触板。因此，我们提出了一个三个阶段的神经头像生成管道，包括学习连续变形、 manifold 和辐射场，提取层次网格和纹理，以及微调纹理 Details 通过差分笔触板。实验结果表明，我们的表示可以在其他state-of-the-art方法的同等质量synthesis结果的情况下，明显减少计算成本。我们还展示了从单抗视频中生成的多种头像 sintesis结果，包括视图synthesis、脸部reenactment、表情编辑和姿态编辑，都在交互帧率下进行。Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

paper_url: http://arxiv.org/abs/2311.05669
repo_url: None
paper_authors: Yuqi Hou, Zhongqun Zhang, Nora Horanyi, Jaewon Moon, Yihua Cheng, Hyung Jin Chang
for: 这个论文旨在提高对话场景中人员的 gaze following 性能，利用听录信息来提供关键的人类行为信息。
methods: 该方法基于“聆听者听力着重点”的观察，首先利用听录和嘴唇的相关性进行分类，然后使用标识信息进行场景图像的增强，并提出一种基于场景图像的 gaze 候选点估计网络。
results: 该方法在新收集的对话场景中的视频听录数据集（VGS）上表现出了显著的优异性，与现有方法相比，具有更高的准确率和更好的可读性。

Abstract
Gaze following estimates gaze targets of in-scene person by understanding human behavior and scene information. Existing methods usually analyze scene images for gaze following. However, compared with visual images, audio also provides crucial cues for determining human behavior.This suggests that we can further improve gaze following considering audio cues. In this paper, we explore gaze following tasks in conversational scenarios. We propose a novel multi-modal gaze following framework based on our observation ``audiences tend to focus on the speaker''. We first leverage the correlation between audio and lips, and classify speakers and listeners in a scene. We then use the identity information to enhance scene images and propose a gaze candidate estimation network. The network estimates gaze candidates from enhanced scene images and we use MLP to match subjects with candidates as classification tasks. Existing gaze following datasets focus on visual images while ignore audios.To evaluate our method, we collect a conversational dataset, VideoGazeSpeech (VGS), which is the first gaze following dataset including images and audio. Our method significantly outperforms existing methods in VGS datasets. The visualization result also prove the advantage of audio cues in gaze following tasks. Our work will inspire more researches in multi-modal gaze following estimation.

摘要
glance following estimates gaze targets of in-scene person by understanding human behavior and scene information. Existing methods usually analyze scene images for gaze following. However, compared with visual images, audio also provides crucial cues for determining human behavior.This suggests that we can further improve gaze following considering audio cues. In this paper, we explore gaze following tasks in conversational scenarios. We propose a novel multi-modal gaze following framework based on our observation "audiences tend to focus on the speaker". We first leverage the correlation between audio and lips, and classify speakers and listeners in a scene. We then use the identity information to enhance scene images and propose a gaze candidate estimation network. The network estimates gaze candidates from enhanced scene images and we use MLP to match subjects with candidates as classification tasks. Existing gaze following datasets focus on visual images while ignore audios.To evaluate our method, we collect a conversational dataset, VideoGazeSpeech (VGS), which is the first gaze following dataset including images and audio. Our method significantly outperforms existing methods in VGS datasets. The visualization result also prove the advantage of audio cues in gaze following tasks. Our work will inspire more researches in multi-modal gaze following estimation.

paper_url: http://arxiv.org/abs/2311.05494
repo_url: None
paper_authors: Lei Li, Alexander Liniger, Mario Millhaeusler, Vagia Tsiminaki, Yuanyou Li, Dengxin Dai
for: 这篇论文主要用于提高实时物体检测领域中事件相关的检测性能。
methods: 该论文提出了一种新的知识塑造方法，通过对事件相关的特征进行精细地塑造，以减少事件数据稀疏性和缺失视觉细节的问题。
results: 在一个synthetic和一个实际的事件数据集上进行测试，研究发现，通过使用对象中心插槽注意机制，可以iteratively减少特征图进行塑造，以提高事件相关的学生对象检测器的性能，相当于减半与教师模式的性能差距。

Abstract
Event cameras are gaining popularity due to their unique properties, such as their low latency and high dynamic range. One task where these benefits can be crucial is real-time object detection. However, RGB detectors still outperform event-based detectors due to the sparsity of the event data and missing visual details. In this paper, we develop a novel knowledge distillation approach to shrink the performance gap between these two modalities. To this end, we propose a cross-modality object detection distillation method that by design can focus on regions where the knowledge distillation works best. We achieve this by using an object-centric slot attention mechanism that can iteratively decouple features maps into object-centric features and corresponding pixel-features used for distillation. We evaluate our novel distillation approach on a synthetic and a real event dataset with aligned grayscale images as a teacher modality. We show that object-centric distillation allows to significantly improve the performance of the event-based student object detector, nearly halving the performance gap with respect to the teacher.

摘要
To do this, we propose a cross-modality object detection distillation method that focuses on regions where knowledge distillation works best. We use an object-centric slot attention mechanism to iteratively decouple feature maps into object-centric features and corresponding pixel features used for distillation.We evaluate our novel distillation approach on a synthetic and real event dataset with aligned grayscale images as a teacher modality. Our results show that object-centric distillation significantly improves the performance of the event-based student object detector, nearly halving the performance gap with respect to the teacher.

Retinal OCT Synthesis with Denoising Diffusion Probabilistic Models for Layer Segmentation

paper_url: http://arxiv.org/abs/2311.05479
repo_url: None
paper_authors: Yuli Wu, Weidong He, Dennis Eschweiler, Ningxin Dou, Zixin Fan, Shengli Mi, Peter Walter, Johannes Stegmaier
for: overcome the challenge of limited annotated data in deep biomedical image analysis
methods: utilize denoising diffusion probabilistic models (DDPMs) to automatically generate retinal optical coherence tomography (OCT) images
results: achieve comparable results in layer segmentation accuracy with a model trained solely with synthesized images, reducing the need for manual annotations of retinal OCT images.Here is the full text in Simplified Chinese:
for: deep biomedical image analysis overcome the challenge of limited annotated data
methods: DDPMs automatically generate retinal optical coherence tomography (OCT) images
results: achieve comparable results in layer segmentation accuracy with a model trained solely with synthesized images, reducing the need for manual annotations of retinal OCT images.

Abstract
Modern biomedical image analysis using deep learning often encounters the challenge of limited annotated data. To overcome this issue, deep generative models can be employed to synthesize realistic biomedical images. In this regard, we propose an image synthesis method that utilizes denoising diffusion probabilistic models (DDPMs) to automatically generate retinal optical coherence tomography (OCT) images. By providing rough layer sketches, the trained DDPMs can generate realistic circumpapillary OCT images. We further find that more accurate pseudo labels can be obtained through knowledge adaptation, which greatly benefits the segmentation task. Through this, we observe a consistent improvement in layer segmentation accuracy, which is validated using various neural networks. Furthermore, we have discovered that a layer segmentation model trained solely with synthesized images can achieve comparable results to a model trained exclusively with real images. These findings demonstrate the promising potential of DDPMs in reducing the need for manual annotations of retinal OCT images.

摘要
现代医学生物图像分析使用深度学习经常遇到有限的标注数据的挑战。为解决这个问题，深度生成模型可以被使用来生成真实的医学图像。在这种情况下，我们提议一种使用杂化扩散概率模型（DDPM）来自动生成 RETINAL optical coherence tomography（OCT）图像。通过提供粗略的层草图，已经训练的 DDPM 可以生成真实的环脉OCT图像。我们还发现，通过知识转移，可以获得更准确的假标签，这对 segmentation 任务具有很大的 beneficial effect。通过这种方法，我们观察到层 segmentation 精度的一致提高，这被证明了使用不同的神经网络进行验证。此外，我们发现，使用solely 生成的图像来训练层 segmentation 模型可以达到与使用实际图像训练的结果相同的水平。这些发现表明 DDPM 在减少手动标注 retinal OCT 图像的需求方面具有普遍的承诺。

Robust Retraining-free GAN Fingerprinting via Personalized Normalization

paper_url: http://arxiv.org/abs/2311.05478
repo_url: None
paper_authors: Jianwei Fei, Zhihua Xia, Benedetta Tondi, Mauro Barni
for: 这篇论文主要应用于追踪和识别Generative Adversarial Networks（GANs）的责任用户在执行授权协议或任何类型的黑客使用时。
methods: 本论文提出了一种不需要重新训练的GAN标识方法，让模型开发者可以轻松地生成不同标识的模型复本。在 generator 中插入了额外的个性化normalization（PN）层，并将PN层的参数（涵盖和偏置）通过两个特别的浅层网络（ParamGen Nets）接受标识作为输入。同时还训练了一个标识器，以EXTRACT标识自生成的图像中。
results: 提出的方法可以在不需要重新训练和调整的情况下，将不同的标识 embed 到GAN中，并且在模型水平和图像水平的攻击下保持了更高的防护性能。

Abstract
In recent years, there has been significant growth in the commercial applications of generative models, licensed and distributed by model developers to users, who in turn use them to offer services. In this scenario, there is a need to track and identify the responsible user in the presence of a violation of the license agreement or any kind of malicious usage. Although there are methods enabling Generative Adversarial Networks (GANs) to include invisible watermarks in the images they produce, generating a model with a different watermark, referred to as a fingerprint, for each user is time- and resource-consuming due to the need to retrain the model to include the desired fingerprint. In this paper, we propose a retraining-free GAN fingerprinting method that allows model developers to easily generate model copies with the same functionality but different fingerprints. The generator is modified by inserting additional Personalized Normalization (PN) layers whose parameters (scaling and bias) are generated by two dedicated shallow networks (ParamGen Nets) taking the fingerprint as input. A watermark decoder is trained simultaneously to extract the fingerprint from the generated images. The proposed method can embed different fingerprints inside the GAN by just changing the input of the ParamGen Nets and performing a feedforward pass, without finetuning or retraining. The performance of the proposed method in terms of robustness against both model-level and image-level attacks is also superior to the state-of-the-art.

摘要
近年来，商业应用中的生成模型出现了显著增长，开发商将其授权并分发给用户，他们再次使用它们提供服务。在这种情况下，需要跟踪和识别违反授权协议或任何类型的黑客使用的责任用户。 Although there are methods to embed invisible watermarks in the images produced by Generative Adversarial Networks (GANs), generating a model with a different watermark, referred to as a fingerprint, for each user is time- and resource-consuming due to the need to retrain the model to include the desired fingerprint. In this paper, we propose a retraining-free GAN fingerprinting method that allows model developers to easily generate model copies with the same functionality but different fingerprints. The generator is modified by inserting additional Personalized Normalization (PN) layers whose parameters (scaling and bias) are generated by two dedicated shallow networks (ParamGen Nets) taking the fingerprint as input. A watermark decoder is trained simultaneously to extract the fingerprint from the generated images. The proposed method can embed different fingerprints inside the GAN by just changing the input of the ParamGen Nets and performing a feedforward pass, without finetuning or retraining. The performance of the proposed method in terms of robustness against both model-level and image-level attacks is also superior to the state-of-the-art.

Using ResNet to Utilize 4-class T2-FLAIR Slice Classification Based on the Cholinergic Pathways Hyperintensities Scale for Pathological Aging

paper_url: http://arxiv.org/abs/2311.05477
repo_url: None
paper_authors: Wei-Chun Kevin Tsai, Yi-Chien Liu, Ming-Chun Yu, Chia-Ju Chou, Sui-Hing Yan, Yang-Teng Fan, Yan-Hsiang Huang, Yen-Ling Chiu, Yi-Fang Chuang, Ran-Zan Wang, Yao-Chia Shih
for: 用于评估白 matter 肥厚症的严重程度，帮助诊断和评估抑郁症的发展风险。
methods: 使用深度学习模型BSCA（基于ResNet）自动确定四个关键的T2-FLAIR图像，以便评估抑郁症的严重程度。
results: 在ADNI T2-FLAIR数据集（N=150）和本地数据集（N=30）上进行测试，BSCA模型的性能达到了99.82%的准确率和99.83%的F1分数，表明BSCA可以有效地自动确定四个关键的T2-FLAIR图像，并且可以帮助临床医生评估抑郁症的发展风险。

Abstract
The Cholinergic Pathways Hyperintensities Scale (CHIPS) is a visual rating scale used to assess the extent of cholinergic white matter hyperintensities in T2-FLAIR images, serving as an indicator of dementia severity. However, the manual selection of four specific slices for rating throughout the entire brain is a time-consuming process. Our goal was to develop a deep learning-based model capable of automatically identifying the four slices relevant to CHIPS. To achieve this, we trained a 4-class slice classification model (BSCA) using the ADNI T2-FLAIR dataset (N=150) with the assistance of ResNet. Subsequently, we tested the model's performance on a local dataset (N=30). The results demonstrated the efficacy of our model, with an accuracy of 99.82% and an F1-score of 99.83%. This achievement highlights the potential impact of BSCA as an automatic screening tool, streamlining the selection of four specific T2-FLAIR slices that encompass white matter landmarks along the cholinergic pathways. Clinicians can leverage this tool to assess the risk of clinical dementia development efficiently.

摘要
“激素性白质纤维变化评估尺度（CHIPS）是一个用于评估脑中激素体路way的白质纤维变化的可视评估scale， serves as an indicator of dementia severity. However, the manual selection of four specific slices for rating throughout the entire brain is a time-consuming process. Our goal was to develop a deep learning-based model capable of automatically identifying the four slices relevant to CHIPS. To achieve this, we trained a 4-class slice classification model (BSCA) using the ADNI T2-FLAIR dataset (N=150) with the assistance of ResNet. Subsequently, we tested the model's performance on a local dataset (N=30). The results demonstrated the efficacy of our model, with an accuracy of 99.82% and an F1-score of 99.83%. This achievement highlights the potential impact of BSCA as an automatic screening tool, streamlining the selection of four specific T2-FLAIR slices that encompass white matter landmarks along the cholinergic pathways. Clinicians can leverage this tool to assess the risk of clinical dementia development efficiently.”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

3DStyle-Diffusion: Pursuing Fine-grained Text-driven 3D Stylization with 2D Diffusion Models

paper_url: http://arxiv.org/abs/2311.05464
repo_url: https://github.com/yanghb22-fdu/3dstyle-diffusion-official
paper_authors: Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Tao Mei
for: 本研究旨在提供一种高品质的三维内容创建方法，使得基于文本描述的三维模型可以实现精细的样式化。
methods: 本研究使用了CLIP基础模型，并提出了一种新的三维样式噪听模型（3DStyle-Diffusion），通过控制隐藏层MLP网络和扩散过程来实现精细的样式化。
results: 经过质量和量тив的实验，本研究证明了3DStyle-Diffusion模型的效果，并建立了一个新的数据集和评价协议来评估这种任务。

Abstract
3D content creation via text-driven stylization has played a fundamental challenge to multimedia and graphics community. Recent advances of cross-modal foundation models (e.g., CLIP) have made this problem feasible. Those approaches commonly leverage CLIP to align the holistic semantics of stylized mesh with the given text prompt. Nevertheless, it is not trivial to enable more controllable stylization of fine-grained details in 3D meshes solely based on such semantic-level cross-modal supervision. In this work, we propose a new 3DStyle-Diffusion model that triggers fine-grained stylization of 3D meshes with additional controllable appearance and geometric guidance from 2D Diffusion models. Technically, 3DStyle-Diffusion first parameterizes the texture of 3D mesh into reflectance properties and scene lighting using implicit MLP networks. Meanwhile, an accurate depth map of each sampled view is achieved conditioned on 3D mesh. Then, 3DStyle-Diffusion leverages a pre-trained controllable 2D Diffusion model to guide the learning of rendered images, encouraging the synthesized image of each view semantically aligned with text prompt and geometrically consistent with depth map. This way elegantly integrates both image rendering via implicit MLP networks and diffusion process of image synthesis in an end-to-end fashion, enabling a high-quality fine-grained stylization of 3D meshes. We also build a new dataset derived from Objaverse and the evaluation protocol for this task. Through both qualitative and quantitative experiments, we validate the capability of our 3DStyle-Diffusion. Source code and data are available at \url{https://github.com/yanghb22-fdu/3DStyle-Diffusion-Official}.

摘要
3D内容创建通过文本驱动化的样式化问题对 multimedia 和图形社区提出了基本挑战。latest advances of cross-modal foundation models（例如 CLIP）使得这个问题变得可能。这些方法通常利用 CLIP 将整体 semantics of 饰色化 mesh 与给定的文本提示相对位。然而，不是那么容易使得基于 semantics-level cross-modal supervision 的细化的样式化方法。在这种情况下，我们提出了一个新的 3DStyle-Diffusion 模型，可以让3D mesh 的细化样式化受到额外可控的外观和几何指导。技术上，3DStyle-Diffusion 首先将 3D mesh 的 texture 分解成反射性和场光照的多层感知神经网络。然后，通过 conditioned 3D mesh 的准确深度图来获得每个采样视图的准确深度图。接着，3DStyle-Diffusion 利用预训练的可控2D Diffusion 模型来导向Synthesize 的 rendered 图像，使得每个视图的生成图像具有文本提示和深度图的semantic 一致性，同时保持几何一致性。这种方法强大地结合了 implicit MLP 网络和 diffusion 过程，实现了高质量的细化样式化。我们还建立了基于 Objaverse 的新数据集和评估协议。通过质量和量度的实验，我们证明了我们的 3DStyle-Diffusion 的可能性。代码和数据可以在 \url{https://github.com/yanghb22-fdu/3DStyle-Diffusion-Official} 上获得。

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

paper_url: http://arxiv.org/abs/2311.05463
repo_url: None
paper_authors: Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei
for: 这篇论文的目的是提出一种新的``风格化’’文本到图像生成任务，即基于文本提示和风格图像的风格化图像生成。
methods: 该论文提出了一种新的扩展方法ControlStyle，通过升级一个先进的文本到图像模型，并添加可调整的调制网络，以便更多的文本提示和风格图像可以进行风格化。此外，还引入了扩散风格和内容规则，以促进这个调制网络的学习。
results: 对比 conventional style transfer techniques，ControlStyle可以生成更加美观和艺术性强的风格化图像，并且可以更好地控制风格化的程度和方向。

Abstract
Recently, the multimedia community has witnessed the rise of diffusion models trained on large-scale multi-modal data for visual content creation, particularly in the field of text-to-image generation. In this paper, we propose a new task for ``stylizing'' text-to-image models, namely text-driven stylized image generation, that further enhances editability in content creation. Given input text prompt and style image, this task aims to produce stylized images which are both semantically relevant to input text prompt and meanwhile aligned with the style image in style. To achieve this, we present a new diffusion model (ControlStyle) via upgrading a pre-trained text-to-image model with a trainable modulation network enabling more conditions of text prompts and style images. Moreover, diffusion style and content regularizations are simultaneously introduced to facilitate the learning of this modulation network with these diffusion priors, pursuing high-quality stylized text-to-image generation. Extensive experiments demonstrate the effectiveness of our ControlStyle in producing more visually pleasing and artistic results, surpassing a simple combination of text-to-image model and conventional style transfer techniques.

摘要
近些时间， multimedia 社区发现了基于大规模多Modal 数据的扩散模型在视觉内容创作中的崛起，尤其是文本到图像生成领域。在这篇论文中，我们提出了一个新的任务，即“风格化”文本到图像模型，即通过输入文本提示和风格图像来生成风格化的图像，这些图像同时需要具备Semantic relevance 和风格图像的风格一致性。为了实现这一目标，我们提出了一种新的扩散模型（ControlStyle），通过对 pré-trained 文本到图像模型添加可学习的调节网络，使得更多的文本提示和风格图像可以被满足。此外，我们同时引入了扩散样式和内容规则，以便掌控这个调节网络的学习，实现高质量的风格化文本到图像生成。实验结果表明，我们的 ControlStyle 能够生成更加视觉吸引人和艺术性高的结果，超过了简单地将文本到图像模型和传统风格传输技术相加。

Control3D: Towards Controllable Text-to-3D Generation

paper_url: http://arxiv.org/abs/2311.05461
repo_url: None
paper_authors: Yang Chen, Yingwei Pan, Yehao Li, Ting Yao, Tao Mei
for: 这种研究旨在提高文本到3D图形生成的可控性，使用额外的手写绘制图来控制生成的3D场景。
methods: 这种方法使用了一种改进的2D conditioned diffusion模型（ControlNet），用于导导3D场景的学习，并且使用了一种已经预训练的可微分图像到绘制图模型来直接估计绘制图。
results: 通过广泛的实验，我们示出了这种方法可以生成准确和忠实的3D场景，与输入文本提示和绘制图保持高度一致。

Abstract
Recent remarkable advances in large-scale text-to-image diffusion models have inspired a significant breakthrough in text-to-3D generation, pursuing 3D content creation solely from a given text prompt. However, existing text-to-3D techniques lack a crucial ability in the creative process: interactively control and shape the synthetic 3D contents according to users' desired specifications (e.g., sketch). To alleviate this issue, we present the first attempt for text-to-3D generation conditioning on the additional hand-drawn sketch, namely Control3D, which enhances controllability for users. In particular, a 2D conditioned diffusion model (ControlNet) is remoulded to guide the learning of 3D scene parameterized as NeRF, encouraging each view of 3D scene aligned with the given text prompt and hand-drawn sketch. Moreover, we exploit a pre-trained differentiable photo-to-sketch model to directly estimate the sketch of the rendered image over synthetic 3D scene. Such estimated sketch along with each sampled view is further enforced to be geometrically consistent with the given sketch, pursuing better controllable text-to-3D generation. Through extensive experiments, we demonstrate that our proposal can generate accurate and faithful 3D scenes that align closely with the input text prompts and sketches.

摘要

Transformer-based Model for Oral Epithelial Dysplasia Segmentation

paper_url: http://arxiv.org/abs/2311.05452
repo_url: None
paper_authors: Adam J Shephard, Hanya Mahmood, Shan E Ahmed Raza, Anna Luiza Damaceno Araujo, Alan Roger Santos-Silva, Marcio Ajudarte Lopes, Pablo Agustin Vargas, Kris McCombe, Stephanie Craig, Jacqueline James, Jill Brooks, Paul Nankivell, Hisham Mehanna, Syed Ali Khurram, Nasir M Rajpoot
for: 提高某些口腔病变诊断的准确率
methods: 使用Transformer模型进行某些口腔病变图像的检测和分割
results: 在测试数据上获得了优秀的普适性，并实现了预级验证的最佳结果，这是首次使用Transformers进行口腔病变图像分割的外部验证研究。

Abstract
Oral epithelial dysplasia (OED) is a premalignant histopathological diagnosis given to lesions of the oral cavity. OED grading is subject to large inter/intra-rater variability, resulting in the under/over-treatment of patients. We developed a new Transformer-based pipeline to improve detection and segmentation of OED in haematoxylin and eosin (H&E) stained whole slide images (WSIs). Our model was trained on OED cases (n = 260) and controls (n = 105) collected using three different scanners, and validated on test data from three external centres in the United Kingdom and Brazil (n = 78). Our internal experiments yield a mean F1-score of 0.81 for OED segmentation, which reduced slightly to 0.71 on external testing, showing good generalisability, and gaining state-of-the-art results. This is the first externally validated study to use Transformers for segmentation in precancerous histology images. Our publicly available model shows great promise to be the first step of a fully-integrated pipeline, allowing earlier and more efficient OED diagnosis, ultimately benefiting patient outcomes.

摘要
口腔细胞肥大病变（OED）是口腔区域病变的前期诊断，它的评估存在大量的内外诊断人员差异，导致患者的过度或者下降处理。我们开发了一个基于Transformer的新ipeline，用于改善在染色镜术中的OED检测和分 segmentation。我们的模型在260个OED患者和105个控制组中进行了训练，并在三个外部中心进行了验证（78个患者）。我们的内部实验得到了0.81的F1分数，在外部测试下轻微下降到0.71，表现良好，并达到了领域内最佳 результа。这是第一个得到了外部验证的Transformers用于口腔细胞肥大病变图像分 segmentation的研究。我们公开提供的模型表现良好，可以帮助更早、更高效地诊断OED，最终改善患者的结果。

Dual Pipeline Style Transfer with Input Distribution Differentiation

paper_url: http://arxiv.org/abs/2311.05432
repo_url: None
paper_authors: ShiQi Jiang, JunJie Kang, YuJian Li
for: 本研究旨在提高颜色和xture dual pipeline architecture (CTDP)的表现，通过掩码总变量损失 (Mtv) 来抑制纹理表示和遗留物。
methods: 本研究使用的方法包括 CTDP 和 Mtv，以及一种输入分布差异训练策略 (IDD)。
results: 实验结果显示，使用 IDD 训练策略可以让纹理生成完全依赖于噪声分布，而平滑分布则不会生成纹理。此外，在颜色平滑传输任务中，使用平滑分布作为前向推理阶段的输入可以完全消除纹理表示和遗留物。

Abstract
The color and texture dual pipeline architecture (CTDP) suppresses texture representation and artifacts through masked total variation loss (Mtv), and further experiments have shown that smooth input can almost completely eliminate texture representation. We have demonstrated through experiments that smooth input is not the key reason for removing texture representations, but rather the distribution differentiation of the training dataset. Based on this, we propose an input distribution differentiation training strategy (IDD), which forces the generation of textures to be completely dependent on the noise distribution, while the smooth distribution will not produce textures at all. Overall, our proposed distribution differentiation training strategy allows for two pre-defined input distributions to be responsible for two generation tasks, with noise distribution responsible for texture generation and smooth distribution responsible for color smooth transfer. Finally, we choose a smooth distribution as the input for the forward inference stage to completely eliminate texture representations and artifacts in color transfer tasks.

摘要
color和 texture dual pipeline architecture (CTDP) 可以抑制文本表示和缺陷通过做masked total variation loss (Mtv), 并且进一步实验表明，可以使用平滑输入来几乎完全消除文本表示。我们通过实验发现，平滑输入不是完全 removetexture representation的原因，而是训练集的分布差异。基于这，我们提出了输入分布差异训练策略 (IDD)，强制生成文本完全依赖于噪音分布，而平滑分布不会生成文本。总的来说，我们的提出的输入分布差异训练策略使得两个预定的输入分布负责两个生成任务，噪音分布负责文本生成，平滑分布负责颜色平滑传输。最后，我们选择平滑分布作为前向推理阶段的输入，完全消除颜色传输任务中的文本表示和缺陷。

Active Mining Sample Pair Semantics for Image-text Matching

paper_url: http://arxiv.org/abs/2311.05425
repo_url: None
paper_authors: Yongfeng Chena, Jin Liua, Zhijing Yang, Ruihan Chena, Junpeng Tan
for: 提高图文匹配 task 的表现和泛化能力，特别是 Handle 负样本匹配问题。
methods: 提出了一种新的图文匹配模型，即 Active Mining Sample Pair Semantics image-text matching model (AMSPS)，它使用了 Adaptive Hierarchical Reinforcement Loss (AHRL) 并可以自动挖掘更多的隐藏相关semantic表示。
results: 对于 Flickr30K 和 MSCOCO универса dataset，我们的提出方法比先前的比较方法更高效和泛化得更好。

Abstract
Recently, commonsense learning has been a hot topic in image-text matching. Although it can describe more graphic correlations, commonsense learning still has some shortcomings: 1) The existing methods are based on triplet semantic similarity measurement loss, which cannot effectively match the intractable negative in image-text sample pairs. 2) The weak generalization ability of the model leads to the poor effect of image and text matching on large-scale datasets. According to these shortcomings. This paper proposes a novel image-text matching model, called Active Mining Sample Pair Semantics image-text matching model (AMSPS). Compared with the single semantic learning mode of the commonsense learning model with triplet loss function, AMSPS is an active learning idea. Firstly, the proposed Adaptive Hierarchical Reinforcement Loss (AHRL) has diversified learning modes. Its active learning mode enables the model to more focus on the intractable negative samples to enhance the discriminating ability. In addition, AMSPS can also adaptively mine more hidden relevant semantic representations from uncommented items, which greatly improves the performance and generalization ability of the model. Experimental results on Flickr30K and MSCOCO universal datasets show that our proposed method is superior to advanced comparison methods.

摘要
最近，常识学习在图文匹配中得到了广泛关注。虽然它可以描述更多的图文关系，但常识学习仍有一些缺点：1）现有方法基于 triplet Semantic Similarity 度量损失，无法有效匹配图文样本对中的难以处理的负样本。2）模型的欠拟合能力导致图文匹配在大规模 datasets 上的效果不佳。根据这些缺点，本文提出了一种新的图文匹配模型，即 Active Mining Sample Pair Semantics 图文匹配模型（AMSPS）。与常识学习模型的单个 Semantic 学习模式相比，AMSPS 是一种活动学习的想法。首先，我们提出的 Adaptive Hierarchical Reinforcement Loss （AHRL）可以多样化学习模式。其活动学习模式使得模型更好地强调难以处理的负样本，以提高分辨率。此外，AMSPS 还可以动态挖掘更多的隐藏相关semantic 表示，从而大大提高模型的性能和泛化能力。实验结果表明，我们提出的方法在 Flickr30K 和 MSCOCO 通用dataset 上比 Advanced Comparison 方法更出色。

Linear Gaussian Bounding Box Representation and Ring-Shaped Rotated Convolution for Oriented Object Detection

paper_url: http://arxiv.org/abs/2311.05410
repo_url: https://github.com/zhen6618/rotayolo
paper_authors: Zhen Zhou, Yunkai Ma, Junfeng Fan, Zhaoyang Liu, Fengshui Jing, Min Tan
for:* 这篇论文主要目标是解决现有的oriented object detection中的boundary discontinuity问题，以及numerical instability问题。methods:* 该论文提出了一种新的oriented bounding box（LGBB）表示方法，通过线性变换Gaussian bounding box（GBB）的元素，以避免boundary discontinuity问题并具有高度的数字稳定性。* 论文还提出了一种新的 rotation-sensitive feature extraction方法，即ring-shaped rotated convolution（RRC），该方法可以在ring-shaped感知场中adaptively旋转特征图来捕捉到旋转敏感特征，以快速地聚合特征和上下文信息。results:* 实验结果表明，LGBB和RRC可以达到state-of-the-art的性能 Waterbury et al. (2018)的性能。* 论文还发现，将LGBB和RRC综合integrated into various models可以有效地提高检测精度。

Abstract
In oriented object detection, current representations of oriented bounding boxes (OBBs) often suffer from boundary discontinuity problem. Methods of designing continuous regression losses do not essentially solve this problem. Although Gaussian bounding box (GBB) representation avoids this problem, directly regressing GBB is susceptible to numerical instability. We propose linear GBB (LGBB), a novel OBB representation. By linearly transforming the elements of GBB, LGBB avoids the boundary discontinuity problem and has high numerical stability. In addition, existing convolution-based rotation-sensitive feature extraction methods only have local receptive fields, resulting in slow feature aggregation. We propose ring-shaped rotated convolution (RRC), which adaptively rotates feature maps to arbitrary orientations to extract rotation-sensitive features under a ring-shaped receptive field, rapidly aggregating features and contextual information. Experimental results demonstrate that LGBB and RRC achieve state-of-the-art performance. Furthermore, integrating LGBB and RRC into various models effectively improves detection accuracy.

摘要
在orientation对象检测中，当前的oriented boundin box（OBB）表示方式经常受到边界不连续问题困扰。直接使用Continuous regression loss方法不能够解决这个问题。虽然Gaussian bounding box（GBB）表示方式可以避免这个问题，但直接对GBB进行直接回归是数字不稳定的。我们提议使用线性GBB（LGBB），一种新的OBB表示方式。通过线性变换GBB中的元素，LGBB可以避免边界不连续问题，并且具有高度数字稳定性。此外，现有的Convolution-based rotation-sensitive feature extraction方法只有局部感知野，导致Feature收集慢，我们提议使用Ring-shaped rotated convolution（RRC），可以适应任意orientation的Feature映射，快速收集Feature和Contextual information。实验结果表明LGBB和RRC可以 дости得状态之巅性能。此外，将LGBB和RRCintegrated into various models可以有效提高检测精度。

SIRE: scale-invariant, rotation-equivariant estimation of artery orientations using graph neural networks

paper_url: http://arxiv.org/abs/2311.05400
repo_url: None
paper_authors: Dieuwertje Alblas, Julian Suk, Christoph Brune, Kak Khee Yeung, Jelmer M. Wolterink
for: 用于描述医疗影像中血管 geometry 的描述，包括中心线提取和后续分割和视化。
methods: 使用3D卷积神经网络（CNN）来确定血管的精确Orientation，但CNN 敏感于不同的血管大小和方向。
results: SIRE 可以准确地确定血管的方向，并且可以通过嵌入在中心线跟踪器中来跟踪 AAAs，即使训练数据中没有包含这些血管。

Abstract
Blood vessel orientation as visualized in 3D medical images is an important descriptor of its geometry that can be used for centerline extraction and subsequent segmentation and visualization. Arteries appear at many scales and levels of tortuosity, and determining their exact orientation is challenging. Recent works have used 3D convolutional neural networks (CNNs) for this purpose, but CNNs are sensitive to varying vessel sizes and orientations. We present SIRE: a scale-invariant, rotation-equivariant estimator for local vessel orientation. SIRE is modular and can generalise due to symmetry preservation. SIRE consists of a gauge equivariant mesh CNN (GEM-CNN) operating on multiple nested spherical meshes with different sizes in parallel. The features on each mesh are a projection of image intensities within the corresponding sphere. These features are intrinsic to the sphere and, in combination with the GEM-CNN, lead to SO(3)-equivariance. Approximate scale invariance is achieved by weight sharing and use of a symmetric maximum function to combine multi-scale predictions. Hence, SIRE can be trained with arbitrarily oriented vessels with varying radii to generalise to vessels with a wide range of calibres and tortuosity. We demonstrate the efficacy of SIRE using three datasets containing vessels of varying scales: the vascular model repository (VMR), the ASOCA coronary artery set, and a set of abdominal aortic aneurysms (AAAs). We embed SIRE in a centerline tracker which accurately tracks AAAs, regardless of the data SIRE is trained with. Moreover, SIRE can be used to track coronary arteries, even when trained only with AAAs. In conclusion, by incorporating SO(3) and scale symmetries, SIRE can determine the orientations of vessels outside of the training domain, forming a robust and data-efficient solution to geometric analysis of blood vessels in 3D medical images.

摘要
医疗影像中血管方向的三维视觉化是一个重要的描述器，可以用于血管中心线提取和进一步的分割和可见化。血管在多种尺度和扭曲程度出现，确定它们的具体方向是困难的。最近的工作使用了三维卷积神经网络（CNN）来实现这一点，但CNN具有不同血管大小和方向的敏感性。我们介绍了一种可缩放、旋转对称的描述器（SIRE），它可以在不同尺度和方向下准确地确定血管的方向。SIRE包括一个 gauge equivariant mesh CNN（GEM-CNN），该 CNN在多个嵌套的球体网格上运行，以获得不同尺度的特征。这些特征是圆柱体内的图像强度的投影，具有内在的SO(3)对称性。通过使用可变尺度的最大函数来组合多个尺度的预测，SIRE实现了约束度准确的抗噪倾向性。因此，SIRE可以在不同尺度和方向下训练，并且可以通过将其与不同的血管数据集进行组合来扩展到不同的血管尺度和扭曲程度。我们使用了三个不同尺度的血管数据集来证明SIRE的有效性：vascular model repository（VMR）、ASOCA coronary artery set和abdominal aortic aneurysms（AAAs）。我们将SIRE与中心线跟踪器结合，可以准确地跟踪AAAs，不管训练数据是什么。此外，SIRE还可以用于跟踪 coronary arteries，即使只有AAAs的训练数据。总之，通过包含SO(3)和尺度对称性，SIRE可以在不同尺度和方向下确定血管的方向，形成一种数据效率和稳定的解决方案，用于医疗影像中血管的三维геометрического分析。

Improving Hand Recognition in Uncontrolled and Uncooperative Environments using Multiple Spatial Transformers and Loss Functions

paper_url: http://arxiv.org/abs/2311.05383
repo_url: None
paper_authors: Wojciech Michal Matkowski, Xiaojie Li, Adams Wai Kin Kong
for: 提高恶势力识别率在不控制的环境下
methods: 使用多空间变换器网络（MSTN）和多种损失函数进行全手图像识别
results: 在NTU-PI-v1数据库和六个不同领域数据库上的实验结果表明，提案的算法在不控制的环境下表现出色，并且具有良好的适应性。

Abstract
The prevalence of smartphone and consumer camera has led to more evidence in the form of digital images, which are mostly taken in uncontrolled and uncooperative environments. In these images, criminals likely hide or cover their faces while their hands are observable in some cases, creating a challenging use case for forensic investigation. Many existing hand-based recognition methods perform well for hand images collected in controlled environments with user cooperation. However, their performance deteriorates significantly in uncontrolled and uncooperative environments. A recent work has exposed the potential of hand recognition in these environments. However, only the palmar regions were considered, and the recognition performance is still far from satisfactory. To improve the recognition accuracy, an algorithm integrating a multi-spatial transformer network (MSTN) and multiple loss functions is proposed to fully utilize information in full hand images. MSTN is firstly employed to localize the palms and fingers and estimate the alignment parameters. Then, the aligned images are further fed into pretrained convolutional neural networks, where features are extracted. Finally, a training scheme with multiple loss functions is used to train the network end-to-end. To demonstrate the effectiveness of the proposed algorithm, the trained model is evaluated on NTU-PI-v1 database and six benchmark databases from different domains. Experimental results show that the proposed algorithm performs significantly better than the existing methods in these uncontrolled and uncooperative environments and has good generalization capabilities to samples from different domains.

摘要
智能手机和消费类摄像头的普及导致更多的证据在形式为数字图像中出现，这些图像大多是在无控制和不合作环境中拍摄的。在这些图像中，嫌犯可能会隐藏或覆盖面部，而手部在某些情况下可能会出现， creating a challenging use case for forensic investigation. 现有的手部识别方法在控制环境下 WITH 用户合作下表现良好，但在无控制和不合作环境下，其性能差异显著。一项最近的研究曾经探讨了手部识别在这些环境中的潜力。然而，只考虑了手部的平板区域，并且认为手部识别性能仍然很差。为了提高识别精度，本文提出了一种 integrate 多个空间转换网络（MSTN）和多个损失函数的算法，以全面利用手部图像中的信息。首先，MSTN 被用来本地化手部和手指，并估计对应参数。然后，经过预训练的卷积神经网络进行更多的特征提取。最后，使用多个损失函数进行 trains 结构，以END-to-END 训练网络。为证明提出的算法的效iveness，已经在 NTU-PI-v1 数据库和六个不同领域的benchmark数据库进行了评估。实验结果表明，提出的算法在无控制和不合作环境中表现出色，并且在不同领域的样本上具有良好的泛化能力。

paper_url: http://arxiv.org/abs/2311.05348
repo_url: None
paper_authors: Jinjin Xu, Liwu Xu, Yuzhe Yang, Xiang Li, Yanchun Xie, Yi-Jie Huang, Yaqian Li
for: The paper is written to propose a new approach to adapt large language models (LLMs) to downstream tasks, specifically by using LLM as a bridge to connect multiple expert models.
methods: The proposed approach, called u-LLaVA, incorporates a modality alignment module and multi-task modules into LLM, and reorganizes or rebuilds multi-type public datasets to enable efficient modality alignment and instruction following.
results: The proposed approach achieves state-of-the-art performance across multiple benchmarks, and the authors release their model, the generated data, and the code base publicly available.Here are the three points in Simplified Chinese:
for: 这篇论文是为了提出一种新的方法，使大语言模型（LLM）在下游任务上适应。
methods: 该方法称为u-LLaVA，它将模式匹配模块和多任务模块 incorporated into LLM，并重新组织或重新建立多种公共数据集以实现有效的模式匹配和指令遵从。
results: 该方法在多个标准准则上达到了最佳性能，并将其模型、生成数据和代码库公开发布。

Abstract
Recent advances such as LLaVA and Mini-GPT4 have successfully integrated visual information into LLMs, yielding inspiring outcomes and giving rise to a new generation of multi-modal LLMs, or MLLMs. Nevertheless, these methods struggle with hallucinations and the mutual interference between tasks. To tackle these problems, we propose an efficient and accurate approach to adapt to downstream tasks by utilizing LLM as a bridge to connect multiple expert models, namely u-LLaVA. Firstly, we incorporate the modality alignment module and multi-task modules into LLM. Then, we reorganize or rebuild multi-type public datasets to enable efficient modality alignment and instruction following. Finally, task-specific information is extracted from the trained LLM and provided to different modules for solving downstream tasks. The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also release our model, the generated data, and the code base publicly available.

摘要
Firstly, we incorporate the modality alignment module and multi-task modules into LLM. Then, we reorganize or rebuild multi-type public datasets to enable efficient modality alignment and instruction following. Finally, task-specific information is extracted from the trained LLM and provided to different modules for solving downstream tasks.The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also release our model, the generated data, and the code base publicly available.Translated into Simplified Chinese:近期的进步，如LLaVA和Mini-GPT4，已经成功地将视觉信息 интеGRATE到LLMs中，产生了激动人心的结果，并且给出了一新的多Modal LLMs（MLLMs）的机遇。然而，这些方法受到幻觉和任务之间的互相干扰的问题。为了解决这些问题，我们提议一种高效和准确的方法，利用LLM作为多个专家模型之间的桥梁，称之为u-LLaVA。首先，我们在LLM中添加了模式匹配模块和多任务模块。然后，我们重新组织或重新建立多种类公共数据集，以便高效地进行模式匹配和指令遵从。最后，从训练过的LLM中提取了相关的任务信息，并将其提供给不同的模块以解决下游任务。总的来说，我们的框架是简单、高效，并在多个标准准点上实现了状态当前的性能。我们还公开发布了我们的模型、生成的数据和代码库。

SynFacePAD 2023: Competition on Face Presentation Attack Detection Based on Privacy-aware Synthetic Training Data

paper_url: http://arxiv.org/abs/2311.05336
repo_url: https://github.com/zi-yuanyang/ijcb-synfacepad-dig
paper_authors: Meiling Fang, Marco Huber, Julian Fierrez, Raghavendra Ramachandra, Naser Damer, Alhasan Alkhaddour, Maksim Kasantcev, Vasiliy Pryadchenko, Ziyuan Yang, Huijie Huangfu, Yingyu Chen, Yi Zhang, Yuchen Pan, Junjun Jiang, Xianming Liu, Xianyun Sun, Caiyong Wang, Xingyu Liu, Zhaohua Chang, Guangzhe Zhao, Juan Tapia, Lazaro Gonzalez-Soler, Carlos Aravena, Daniel Schulz
for: 竞赛旨在鼓励和吸引面部表现攻击检测方案，同时考虑个人数据隐私、法律和伦理问题。
methods: 参赛队伍使用的方法包括新型的检测方法和基于synthetic数据的训练方法。
results: 参赛队伍的提交解决方案在考古的benchmark中超越了考虑的基准。

Abstract
This paper presents a summary of the Competition on Face Presentation Attack Detection Based on Privacy-aware Synthetic Training Data (SynFacePAD 2023) held at the 2023 International Joint Conference on Biometrics (IJCB 2023). The competition attracted a total of 8 participating teams with valid submissions from academia and industry. The competition aimed to motivate and attract solutions that target detecting face presentation attacks while considering synthetic-based training data motivated by privacy, legal and ethical concerns associated with personal data. To achieve that, the training data used by the participants was limited to synthetic data provided by the organizers. The submitted solutions presented innovations and novel approaches that led to outperforming the considered baseline in the investigated benchmarks.

摘要

Spatial Attention-based Distribution Integration Network for Human Pose Estimation

paper_url: http://arxiv.org/abs/2311.05323
repo_url: None
paper_authors: Sihan Gao, Jing Zhu, Xiaoxuan Zhuang, Zhaoyue Wang, Qijin Li
for: 提高人体 pose ocalization 精度，增强模型对受 occlusion、多样化外观、灯光变化和 overlap 等挑战场景的能力。
methods: 提出 Spatial Attention-based Distribution Integration Network (SADI-NET)，包括三个高效模型： Receptive Fortified Module (RFM)、Spatial Fusion Module (SFM) 和 Distribution Learning Module (DLM)。基于经典 HourglassNet 架构，我们将基本块替换为我们提议的 RFM，并在扩大感知场景中增强 spatial 信息敏感性。
results: 在 MPII 和 LSP 测试集上进行了广泛的实验，并取得了优秀的 $92.10%$ 精度，比既有模型提高了 significatively，成为 state-of-the-art 性能。

Abstract
In recent years, human pose estimation has made significant progress through the implementation of deep learning techniques. However, these techniques still face limitations when confronted with challenging scenarios, including occlusion, diverse appearances, variations in illumination, and overlap. To cope with such drawbacks, we present the Spatial Attention-based Distribution Integration Network (SADI-NET) to improve the accuracy of localization in such situations. Our network consists of three efficient models: the receptive fortified module (RFM), spatial fusion module (SFM), and distribution learning module (DLM). Building upon the classic HourglassNet architecture, we replace the basic block with our proposed RFM. The RFM incorporates a dilated residual block and attention mechanism to expand receptive fields while enhancing sensitivity to spatial information. In addition, the SFM incorporates multi-scale characteristics by employing both global and local attention mechanisms. Furthermore, the DLM, inspired by residual log-likelihood estimation (RLE), reconfigures a predicted heatmap using a trainable distribution weight. For the purpose of determining the efficacy of our model, we conducted extensive experiments on the MPII and LSP benchmarks. Particularly, our model obtained a remarkable $92.10\%$ percent accuracy on the MPII test dataset, demonstrating significant improvements over existing models and establishing state-of-the-art performance.

摘要
近年来，人姿估算技术得到了深度学习的应用，但这些技术仍然在面临困难场景时存在限制，包括干扰、多样性、照明变化和重叠。为了解决这些缺点，我们提出了空间注意力基于分布集成网络（SADI-NET），以提高姿势估算的精度。我们的网络包括三个高效模型：感知强化模块（RFM）、空间融合模块（SFM）和分布学习模块（DLM）。基于经典的小时钟网络架构，我们将基本块更换为我们所提议的RFM。RFM包括具有扩展辐射场和注意力机制的延迟块，以扩大感知场和增强对空间信息的敏感度。此外，SFM采用了多尺度特征，通过使用全球和本地注意力机制来实现。此外，DLM，取得了基于逻辑梯度估计（RLE）的预测热图的改进，通过使用可学习的分布权重来重新配置预测热图。为了评估我们的模型效果，我们在MPII和LSP测试集上进行了广泛的实验。特别是，我们的模型在MPII测试集上达到了92.10%的准确率，表明了显著的改进和状态艺术性表现。

SPADES: A Realistic Spacecraft Pose Estimation Dataset using Event Sensing

paper_url: http://arxiv.org/abs/2311.05310
repo_url: None
paper_authors: Arunkumar Rathinam, Haytam Qadadri, Djamila Aouada
for:* 这个研究旨在提高在轨道上的自主操作，例如 rendezvous、 docking 和 proximity maneuvers，使用 Deep Learning-based Spacecraft Pose Estimation 技术。methods:* 这个研究使用了 Domain Adaptation 技术来减少域别差异的影响，并使用了事件感应器来减少域别差异。results:* 这个研究创建了一个名为 SPADES 的新数据集，包括实际的事件数据和虚拟事件数据，并提出了一个有效的数据筛选方法以提高模型性能。此外，这个研究还引入了一个基于图像的事件表示，与现有的表示方法相比，具有更高的性能。

Abstract
In recent years, there has been a growing demand for improved autonomy for in-orbit operations such as rendezvous, docking, and proximity maneuvers, leading to increased interest in employing Deep Learning-based Spacecraft Pose Estimation techniques. However, due to limited access to real target datasets, algorithms are often trained using synthetic data and applied in the real domain, resulting in a performance drop due to the domain gap. State-of-the-art approaches employ Domain Adaptation techniques to mitigate this issue. In the search for viable solutions, event sensing has been explored in the past and shown to reduce the domain gap between simulations and real-world scenarios. Event sensors have made significant advancements in hardware and software in recent years. Moreover, the characteristics of the event sensor offer several advantages in space applications compared to RGB sensors. To facilitate further training and evaluation of DL-based models, we introduce a novel dataset, SPADES, comprising real event data acquired in a controlled laboratory environment and simulated event data using the same camera intrinsics. Furthermore, we propose an effective data filtering method to improve the quality of training data, thus enhancing model performance. Additionally, we introduce an image-based event representation that outperforms existing representations. A multifaceted baseline evaluation was conducted using different event representations, event filtering strategies, and algorithmic frameworks, and the results are summarized. The dataset will be made available at http://cvi2.uni.lu/spades.

摘要
近年来，卫星运行中的自主化需求提高，如 rendezvous、停机和距离推进等操作，导致深度学习基于空间机器人定位估计技术的兴趣增加。然而，由于实际目标数据的有限访问，算法通常在实际领域使用 synthetic 数据进行训练，导致领域差距问题。现代方法利用领域适应技术来解决这个问题。在寻找可行的解决方案时，事件感知被探索和研究，并显示它可以降低实际领域和模拟领域之间的领域差距。事件感知的硬件和软件技术在最近几年内做出了重要进展。此外，事件感知器在空间应用中具有许多优点，比如 RGB 感知器。为了进一步训练和评估深度学习基于模型，我们介绍了一个新的数据集，称为 SPADES，该数据集包含实际事件数据，从实验室环境中获取，以及使用相同摄像机特性的 simulated 事件数据。此外，我们提出了一种有效的数据筛选方法，以提高训练数据质量，从而提高模型性能。此外，我们引入了一种基于图像的事件表示方法，超过了现有的表示方法。我们通过不同的事件表示方法、事件筛选策略和算法框架进行多方面基准评估，结果如下。数据集将在 http://cvi2.uni.lu/spades 上公开。

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

paper_url: http://arxiv.org/abs/2311.05298
repo_url: None
paper_authors: Cheng Yang, Rui Xu, Ye Guo, Peixiang Huang, Yiru Chen, Wenkui Ding, Zhongyuan Wang, Hong Zhou
for: 本研究旨在提高视觉常识逻辑（VCR）的性能，VCR是一项复杂的多模态任务，需要高水平的认知和常识逻辑能力。
methods: 我们提出了一种基于视觉场景的空间关系图建构方法，并设计了两个预训练任务：对象位置回归（OPR）和空间关系分类（SRC），以学习重建空间关系图。
results: 我们的方法可以导致表示保持更多的空间上下文，帮助注意力集中在重要的视觉区域上进行逻辑。我们实现了VCR和两个其他视觉语言逻辑任务（VQA和NLVR）的状态时间表现。

Abstract
Visual commonsense reasoning (VCR) is a challenging multi-modal task, which requires high-level cognition and commonsense reasoning ability about the real world. In recent years, large-scale pre-training approaches have been developed and promoted the state-of-the-art performance of VCR. However, the existing approaches almost employ the BERT-like objectives to learn multi-modal representations. These objectives motivated from the text-domain are insufficient for the excavation on the complex scenario of visual modality. Most importantly, the spatial distribution of the visual objects is basically neglected. To address the above issue, we propose to construct the spatial relation graph based on the given visual scenario. Further, we design two pre-training tasks named object position regression (OPR) and spatial relation classification (SRC) to learn to reconstruct the spatial relation graph respectively. Quantitative analysis suggests that the proposed method can guide the representations to maintain more spatial context and facilitate the attention on the essential visual regions for reasoning. We achieve the state-of-the-art results on VCR and two other vision-and-language reasoning tasks VQA, and NLVR.

摘要
Visual 常识理解 (VCR) 是一个复杂的多Modal任务，需要高度的认知和常识理解能力。在过去几年，大规模预训练方法得到了广泛的应用和提高了VCR的状态艺术。然而，现有的方法大多采用BERT类目标来学习多Modal表示。这些目标来自文本领域，对于视觉领域的复杂情况不够。尤其是忽略了视觉对象的空间分布。为解决以上问题，我们提议构建基于给定的视觉场景的空间关系图。此外，我们设计了两个预训练任务名为物体位置Rectification (OPR)和空间关系分类 (SRC)，以学习重建空间关系图。量化分析表明，我们的方法可以导致表示具有更多的空间 контекст和促进关注重要的视觉区域 для理解。我们在VCR和两个视觉语言理解任务VQA、NLVR中实现了状态艺术 Results。

VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis

paper_url: http://arxiv.org/abs/2311.05289
repo_url: None
paper_authors: Sen Wang, Wei Zhang, Stefano Gasperini, Shun-Cheng Wu, Nassir Navab
for: 提高各种虚拟应用的图像质量，特别是indoor环境下的图像生成。
methods: 利用立方体表示法提高图像生成的质量和效率，并采用多分辨率哈希网格适应 occlusion 和indoor场景中的复杂geometry。
results: 比对三个公共indoor数据集，vosNeRF 方法在图像生成中表现出色，同时提高了训练和渲染时间的效率，甚至超过了 Instant-NGP 的速度， bringing the technology closer to real-time。

Abstract
Creating high-quality view synthesis is essential for immersive applications but continues to be problematic, particularly in indoor environments and for real-time deployment. Current techniques frequently require extensive computational time for both training and rendering, and often produce less-than-ideal 3D representations due to inadequate geometric structuring. To overcome this, we introduce VoxNeRF, a novel approach that leverages volumetric representations to enhance the quality and efficiency of indoor view synthesis. Firstly, VoxNeRF constructs a structured scene geometry and converts it into a voxel-based representation. We employ multi-resolution hash grids to adaptively capture spatial features, effectively managing occlusions and the intricate geometry of indoor scenes. Secondly, we propose a unique voxel-guided efficient sampling technique. This innovation selectively focuses computational resources on the most relevant portions of ray segments, substantially reducing optimization time. We validate our approach against three public indoor datasets and demonstrate that VoxNeRF outperforms state-of-the-art methods. Remarkably, it achieves these gains while reducing both training and rendering times, surpassing even Instant-NGP in speed and bringing the technology closer to real-time.

摘要
Firstly, VoxNeRF constructs a structured scene geometry and converts it into a voxel-based representation. We use multi-resolution hash grids to adaptively capture spatial features, effectively managing occlusions and the intricate geometry of indoor scenes.Secondly, we propose a unique voxel-guided efficient sampling technique. This innovation selectively focuses computational resources on the most relevant portions of ray segments, significantly reducing optimization time.We validate our approach against three public indoor datasets and demonstrate that VoxNeRF outperforms state-of-the-art methods. Remarkably, it achieves these gains while reducing both training and rendering times, surpassing even Instant-NGP in speed and bringing the technology closer to real-time.

SAMVG: A Multi-stage Image Vectorization Model with the Segment-Anything Model

paper_url: http://arxiv.org/abs/2311.05276
repo_url: None
paper_authors: Haokun Zhu, Juang Ian Chong, Teng Hu, Ran Yi, Yu-Kun Lai, Paul L. Rosin
for: 本研究旨在提出一种基于多 stage模型的vector化方法，以生成高质量的scalable vector graphics（SVG）。
methods: 该方法首先使用通用图像分割模型提供的一般图像分割结果，然后使用一种新的滤波方法来选择整个图像最佳的密集分割图。其次，方法会识别缺失的组件并增加更多的细节组件到SVG中。
results: 经过广泛的实验表明，SAMVG可以在任何领域生成高质量的SVG，需要 menos计算时间和复杂度比前一代方法更低。

Abstract
Vector graphics are widely used in graphical designs and have received more and more attention. However, unlike raster images which can be easily obtained, acquiring high-quality vector graphics, typically through automatically converting from raster images remains a significant challenge, especially for more complex images such as photos or artworks. In this paper, we propose SAMVG, a multi-stage model to vectorize raster images into SVG (Scalable Vector Graphics). Firstly, SAMVG uses general image segmentation provided by the Segment-Anything Model and uses a novel filtering method to identify the best dense segmentation map for the entire image. Secondly, SAMVG then identifies missing components and adds more detailed components to the SVG. Through a series of extensive experiments, we demonstrate that SAMVG can produce high quality SVGs in any domain while requiring less computation time and complexity compared to previous state-of-the-art methods.

摘要
Vector图形广泛应用于视觉设计中，受到越来越多的关注。然而，与矢量图像不同，从矢量图像自动转换为高质量矢量图形仍然是一项重要挑战，特别是 для更复杂的图像，如照片或艺术作品。在这篇论文中，我们提出了SAMVG模型，用于将矢量图像转换为SVG（可缩放vector图形）。首先，SAMVG使用Segment-Anything模型提供的通用图像分割，并使用一种新的筛选方法来选择整个图像的最佳笔触分割图。其次，SAMVG会找到缺失的组件并添加更多细节到SVG中。经过了一系列的广泛实验，我们证明了SAMVG可以生成高质量的SVG，无需更多的计算时间和复杂度，与之前的状态艺术方法相比。

Single-shot Tomography of Discrete Dynamic Objects

paper_url: http://arxiv.org/abs/2311.05269
repo_url: None
paper_authors: Ajinkya Kadu, Felix Lucka, Kees Joost Batenburg
for: 高分辨率时间图像重建
methods: 使用水平集方法进行图像分割和表示运动，以及一种可 computationally efficient 和 east optimizable 的变分框架
results: 在Synthetic 和 pseudo-dynamic real X-ray tomography 数据集上显示出比现有方法更高的性能，能够重建高质量的2D或3D图像序列，只需单个投影每帧。

Abstract
This paper presents a novel method for the reconstruction of high-resolution temporal images in dynamic tomographic imaging, particularly for discrete objects with smooth boundaries that vary over time. Addressing the challenge of limited measurements per time point, we propose a technique that synergistically incorporates spatial and temporal information of the dynamic objects. This is achieved through the application of the level-set method for image segmentation and the representation of motion via a sinusoidal basis. The result is a computationally efficient and easily optimizable variational framework that enables the reconstruction of high-quality 2D or 3D image sequences with a single projection per frame. Compared to current methods, our proposed approach demonstrates superior performance on both synthetic and pseudo-dynamic real X-ray tomography datasets. The implications of this research extend to improved visualization and analysis of dynamic processes in tomographic imaging, finding potential applications in diverse scientific and industrial domains.

摘要

Widely Applicable Strong Baseline for Sports Ball Detection and Tracking

paper_url: http://arxiv.org/abs/2311.05237
repo_url: https://github.com/nttcom/wasb-sbdt
paper_authors: Shuhei Tarashima, Muhammad Abdul Haq, Yushan Wang, Norio Tagawa
for: 本研究提出了一种新的运动球检测和跟踪方法 (SBDT), 可以应用于不同的运动类别。
methods: 该方法包括高分辨率特征提取、位置意识模型训练和时间一致性推断，这三个部分组合成了一个新的 SBDT 基准。
results: 实验结果表明，我们的方法在所有运动类别中具有显著优势，至于具体的结果可以查看我们的 GitHub 上的数据和代码。

Abstract
In this work, we present a novel Sports Ball Detection and Tracking (SBDT) method that can be applied to various sports categories. Our approach is composed of (1) high-resolution feature extraction, (2) position-aware model training, and (3) inference considering temporal consistency, all of which are put together as a new SBDT baseline. Besides, to validate the wide-applicability of our approach, we compare our baseline with 6 state-of-the-art SBDT methods on 5 datasets from different sports categories. We achieve this by newly introducing two SBDT datasets, providing new ball annotations for two datasets, and re-implementing all the methods to ease extensive comparison. Experimental results demonstrate that our approach is substantially superior to existing methods on all the sports categories covered by the datasets. We believe our proposed method can play as a Widely Applicable Strong Baseline (WASB) of SBDT, and our datasets and codebase will promote future SBDT research. Datasets and codes are available at https://github.com/nttcom/WASB-SBDT .

摘要
在这项工作中，我们提出了一种新的体育球检测和跟踪（SBDT）方法，可以应用于不同的体育类别。我们的方法包括（1）高分辨率特征提取、（2）位域意识模型训练和（3）基于时间一致性的推理，这些都被整合成了一个新的 SBDT 基准。此外，为了证明我们的方法广泛可用，我们与6种现有 SBDT 方法进行了比较，使用5个不同的体育类别的数据集。我们新 introduce two SBDT 数据集，提供了新的球标注 для两个数据集，并重新实现了所有方法，以便进行广泛的比较。实验结果表明，我们的方法在所有涉及的体育类别中具有显著优势。我们认为，我们提出的方法可以扮演为一种广泛适用的强大基准（WASB），而我们提供的数据集和代码库将推动未来的 SBDT 研究。数据集和代码可以在 GitHub 上获取：https://github.com/nttcom/WASB-SBDT。

ConRad: Image Constrained Radiance Fields for 3D Generation from a Single Image

paper_url: http://arxiv.org/abs/2311.05230
repo_url: None
paper_authors: Senthil Purushwalkam, Nikhil Naik
for: 从单个RGB图像中重构3D物体
methods: 基于最新的图像生成模型，推理隐藏的3D结构，保持输入图像的准确性
results: 提供了一种简单有效的3D表示方式，可以保持输入图像的详细信息，并生成实际的3D重建结果，与现有基eline前景准确相对。

Abstract
We present a novel method for reconstructing 3D objects from a single RGB image. Our method leverages the latest image generation models to infer the hidden 3D structure while remaining faithful to the input image. While existing methods obtain impressive results in generating 3D models from text prompts, they do not provide an easy approach for conditioning on input RGB data. Na\"ive extensions of these methods often lead to improper alignment in appearance between the input image and the 3D reconstructions. We address these challenges by introducing Image Constrained Radiance Fields (ConRad), a novel variant of neural radiance fields. ConRad is an efficient 3D representation that explicitly captures the appearance of an input image in one viewpoint. We propose a training algorithm that leverages the single RGB image in conjunction with pretrained Diffusion Models to optimize the parameters of a ConRad representation. Extensive experiments show that ConRad representations can simplify preservation of image details while producing a realistic 3D reconstruction. Compared to existing state-of-the-art baselines, we show that our 3D reconstructions remain more faithful to the input and produce more consistent 3D models while demonstrating significantly improved quantitative performance on a ShapeNet object benchmark.

摘要
我们提出了一种新的方法，用于从单个RGB图像中重建3D对象。我们的方法利用最新的图像生成模型来推断隐藏的3D结构，同时保持对输入图像的忠实。现有的方法可以从文本提示中生成出色的3D模型，但是它们不提供一个简单的入口点来 condition on 输入RGB数据。不熟悉的扩展可能会导致图像和3D重建中的 aparence不一致。我们解决这些挑战 by introducing Image Constrained Radiance Fields (ConRad), a novel variant of neural radiance fields. ConRad是一种高效的3D表示，可以直接 capture输入图像的一个视点的外观。我们提出了一种培育算法，利用单个RGB图像和预训练的扩散模型来优化ConRad表示的参数。广泛的实验表明，ConRad表示可以简化保持图像细节的同时生成真实的3D重建。相比于现有的状态机器人标准基eline，我们的3D重建更加 faithful 到输入和生成更一致的3D模型，同时显示出了明显改善的量化性能在ShapeNet对象benchmark中。

Let’s Get the FACS Straight – Reconstructing Obstructed Facial Features

paper_url: http://arxiv.org/abs/2311.05221
repo_url: None
paper_authors: Tim Büchner, Sven Sickert, Gerd Fabian Volk, Christoph Anders, Orlando Guntinas-Lichius, Joachim Denzler
for: 提高机器学习方法对受阻面部表情的理解
methods: 使用 CycleGAN 架构实现样式传递，不需要匹配对
results: 可以达到与无阻挡记录相同的评价分数，提高面部表情分析的准确性

Abstract
The human face is one of the most crucial parts in interhuman communication. Even when parts of the face are hidden or obstructed the underlying facial movements can be understood. Machine learning approaches often fail in that regard due to the complexity of the facial structures. To alleviate this problem a common approach is to fine-tune a model for such a specific application. However, this is computational intensive and might have to be repeated for each desired analysis task. In this paper, we propose to reconstruct obstructed facial parts to avoid the task of repeated fine-tuning. As a result, existing facial analysis methods can be used without further changes with respect to the data. In our approach, the restoration of facial features is interpreted as a style transfer task between different recording setups. By using the CycleGAN architecture the requirement of matched pairs, which is often hard to fullfill, can be eliminated. To proof the viability of our approach, we compare our reconstructions with real unobstructed recordings. We created a novel data set in which 36 test subjects were recorded both with and without 62 surface electromyography sensors attached to their faces. In our evaluation, we feature typical facial analysis tasks, like the computation of Facial Action Units and the detection of emotions. To further assess the quality of the restoration, we also compare perceptional distances. We can show, that scores similar to the videos without obstructing sensors can be achieved.

摘要
人类面部是交流中最重要的部分之一。即使面部部分被隐藏或堵塞，也可以理解下面部的运动。机器学习方法经常在这个方面失败，因为面部结构的复杂性。为解决这个问题，常见的方法是为每个特定应用进行精细调整。然而，这是计算昂贵的，并且可能需要重复进行每个分析任务。在这篇论文中，我们提议使用恢复隐藏的面部部分来避免多次精细调整。通过这种方式，现有的面部分析方法可以无需更改数据进行使用。在我们的方法中，恢复面部特征被解释为面部样式传递任务。通过使用 CycleGAN 架构，可以消除匹配对的要求，这经常是难以满足的。为证明我们的方法的可行性，我们比较了我们的恢复与没有隐藏感知器的实际录制视频。我们创建了一个新的数据集，其中有 36 名测试者在不同的录制设置下被录制。在我们的评估中，我们包括常见的面部分析任务，如计算面部动作单元和感情检测。为进一步评估恢复质量，我们还比较了感知距离。我们可以显示，我们的恢复视频与没有隐藏感知器的视频的分数相似。

BrainNetDiff: Generative AI Empowers Brain Network Generation via Multimodal Diffusion Model

paper_url: http://arxiv.org/abs/2311.05199
repo_url: None
paper_authors: Yongcheng Zong, Shuqiang Wang
for: 本研究旨在提供一种新的脑网络分析方法，以 deeper understanding of brain functions and disease mechanisms.
methods: 该方法 combines 多头 transformer encoder 和 conditional latent diffusion model，从 FMRI 时间序列中提取有关特征，并将脑网络生成为图形.
results: 实验结果表明，该方法在健康和神经科学上的数据集上有效地生成脑网络，并在下游疾病分类任务中表现出色.

Abstract
Brain network analysis has emerged as pivotal method for gaining a deeper understanding of brain functions and disease mechanisms. Despite the existence of various network construction approaches, shortcomings persist in the learning of correlations between structural and functional brain imaging data. In light of this, we introduce a novel method called BrainNetDiff, which combines a multi-head Transformer encoder to extract relevant features from fMRI time series and integrates a conditional latent diffusion model for brain network generation. Leveraging a conditional prompt and a fusion attention mechanism, this method significantly improves the accuracy and stability of brain network generation. To the best of our knowledge, this represents the first framework that employs diffusion for the fusion of the multimodal brain imaging and brain network generation from images to graphs. We validate applicability of this framework in the construction of brain network across healthy and neurologically impaired cohorts using the authentic dataset. Experimental results vividly demonstrate the significant effectiveness of the proposed method across the downstream disease classification tasks. These findings convincingly emphasize the prospective value in the field of brain network research, particularly its key significance in neuroimaging analysis and disease diagnosis. This research provides a valuable reference for the processing of multimodal brain imaging data and introduces a novel, efficient solution to the field of neuroimaging.

摘要
�� brain 网络分析已经成为脑功能和疾病机制研究的关键方法。 despite 多种网络建构方法的存在， correlation 学习 between 结构和功能 Magnetic Resonance Imaging（MRI）数据仍然存在缺陷。为此，我们介绍了一种新的方法called BrainNetDiff，它将 multi-head Transformer 编码器用于 FMRI 时间序列中EXTRACT 相关特征，并将 conditional latent diffusion 模型用于脑网络生成。通过 conditional prompt 和 Fusion attention 机制，这种方法可以提高脑网络生成的准确性和稳定性。根据我们所知，这是第一个使用 diffusion 将多Modal brain imaging 和脑网络生成转化为图形的框架。我们验证了这种框架在健康和 neurolOgical impairment 群体中的应用，并使用 authentic dataset 进行验证。实验结果表明，提案的方法在下游疾病分类任务中表现出色，这些结果强烈地强调了该方法在脑网络研究、特别是 Magnetic Resonance Imaging 分析和疾病诊断中的潜在价值。本研究为多Modal brain imaging 数据处理提供了一个有价值的参考，并提供了一种新、高效的解决方案 для neuroimaging 领域。

Adaptive-Labeling for Enhancing Remote Sensing Cloud Understanding

paper_url: http://arxiv.org/abs/2311.05198
repo_url: https://github.com/jaygala223/cloud-adaptive-labeling
paper_authors: Jay Gala, Sauradip Nag, Huichou Huang, Ruirui Liu, Xiatian Zhu
for: 本研究旨在提高远程感知中的云分类精度，以便在气象和气候科学中进行细致的云分析，从而优化各种预测和管理应用。
methods: 我们提出了一种创新的模型无关的云适应标注（CAL）方法，通过iteratively进行云训练图像的标注更新，从而提高学习模型的性能。我们的方法首先使用原始标注来训练云分类模型，然后引入可调Pixel敏感度阈值，在流动图像上适应地标注云图像。
results: 我们在多个标准云分类 benchmark上进行了广泛的实验，并证明了我们的方法能够显著提高现有 segmentation 模型的性能。我们的 CAL 方法在比较多种现有方法时创造了新的状态态-of-the-art 结果。

Abstract
Cloud analysis is a critical component of weather and climate science, impacting various sectors like disaster management. However, achieving fine-grained cloud analysis, such as cloud segmentation, in remote sensing remains challenging due to the inherent difficulties in obtaining accurate labels, leading to significant labeling errors in training data. Existing methods often assume the availability of reliable segmentation annotations, limiting their overall performance. To address this inherent limitation, we introduce an innovative model-agnostic Cloud Adaptive-Labeling (CAL) approach, which operates iteratively to enhance the quality of training data annotations and consequently improve the performance of the learned model. Our methodology commences by training a cloud segmentation model using the original annotations. Subsequently, it introduces a trainable pixel intensity threshold for adaptively labeling the cloud training images on the fly. The newly generated labels are then employed to fine-tune the model. Extensive experiments conducted on multiple standard cloud segmentation benchmarks demonstrate the effectiveness of our approach in significantly boosting the performance of existing segmentation models. Our CAL method establishes new state-of-the-art results when compared to a wide array of existing alternatives.

摘要
云分析是气象和气候科学中的关键组成部分，影响各种领域，如灾害管理。然而，在远程感知中实现细致云分析，如云分割，仍然是一项挑战，因为获得准确标签的困难，导致训练数据中的标签错误很大。现有方法frequently假设可以获得可靠的分割标注，限制其总体性能。为解决这种内在的限制，我们介绍了一种创新的模型无关Cloud Adaptive-Labeling（CAL）方法，该方法在训练数据标注质量的基础上进行迭代增强，并因此提高学习模型的性能。我们的方法流程如下：首先，我们使用原始标注训练云分 segmentation模型。然后，我们引入可训练像素强度阈值，以适应性地标注云训练图像。新生成的标注被employmed для细化模型。我们在多个标准云分 segmentation benchmark上进行了广泛的实验，结果表明，我们的方法可以在存在标签错误的情况下，大幅提高现有分 segmentation模型的性能。我们的CAL方法在与多种现有方法进行比较时，创造了新的状态态峰值结果。

TransReg: Cross-transformer as auto-registration module for multi-view mammogram mass detection

paper_url: http://arxiv.org/abs/2311.05192
repo_url: None
paper_authors: Hoang C. Nguyen, Chi Phan, Hieu H. Pham
For: 这个研究旨在开发一个基于多视图照片的电脑助诊系统（CAD），以实现早期胸癌检测中的胸癌检测。* Methods: 这个系统使用了两个照片的联合资料，通过实现这两个照片之间的关联，以提高医生对胸癌的诊断准确性。* Results: 这个研究表明，使用这个系统可以实现更高的胸癌检测精度，并且可以降低伪阳性率。具体来说，在DDSM和VinDr-Mammo数据集上，这个系统使用SwinT作为特征提取器时，在伪阳性率为0.5时取得了83.3%的精度。

Abstract
Screening mammography is the most widely used method for early breast cancer detection, significantly reducing mortality rates. The integration of information from multi-view mammograms enhances radiologists' confidence and diminishes false-positive rates since they can examine on dual-view of the same breast to cross-reference the existence and location of the lesion. Inspired by this, we present TransReg, a Computer-Aided Detection (CAD) system designed to exploit the relationship between craniocaudal (CC), and mediolateral oblique (MLO) views. The system includes cross-transformer to model the relationship between the region of interest (RoIs) extracted by siamese Faster RCNN network for mass detection problems. Our work is the first time cross-transformer has been integrated into an object detection framework to model the relation between ipsilateral views. Our experimental evaluation on DDSM and VinDr-Mammo datasets shows that our TransReg, equipped with SwinT as a feature extractor achieves state-of-the-art performance. Specifically, at the false positive rate per image at 0.5, TransReg using SwinT gets a recall at 83.3% for DDSM dataset and 79.7% for VinDr-Mammo dataset. Furthermore, we conduct a comprehensive analysis to demonstrate that cross-transformer can function as an auto-registration module, aligning the masses in dual-view and utilizing this information to inform final predictions. It is a replication diagnostic workflow of expert radiologists

摘要
屏幕检查肿瘤是现代医学中最广泛使用的方法，可以有效降低乳腺癌死亡率。将多视图照片信息集成可以提高医生的自信心，同时降低假阳率，因为它们可以在两个视图中跨参照肿瘤的存在和位置。 Drawing inspiration from this, we present TransReg, a computer-aided detection (CAD) system designed to exploit the relationship between craniocaudal (CC) and mediolateral oblique (MLO) views. The system includes a cross-transformer to model the relationship between the region of interest (RoIs) extracted by a Siamese Faster RCNN network for mass detection problems. Our work is the first time cross-transformer has been integrated into an object detection framework to model the relation between ipsilateral views. Our experimental evaluation on DDSM and VinDr-Mammo datasets shows that our TransReg, equipped with SwinT as a feature extractor, achieves state-of-the-art performance. Specifically, at a false positive rate of 0.5, TransReg using SwinT achieves a recall of 83.3% for the DDSM dataset and 79.7% for the VinDr-Mammo dataset. Furthermore, we conduct a comprehensive analysis to demonstrate that cross-transformer can function as an auto-registration module, aligning the masses in dual-view and utilizing this information to inform final predictions. This is a replication diagnostic workflow of expert radiologists.

Audio-visual Saliency for Omnidirectional Videos

paper_url: http://arxiv.org/abs/2311.05190
repo_url: https://github.com/FannyChao/AVS360_audiovisual_saliency_360
paper_authors: Yuxin Zhu, Xilei Zhu, Huiyu Duan, Jie Li, Kaiwei Zhang, Yucheng Zhu, Li Chen, Xiongkuo Min, Guangtao Zhai
For: The paper is written for predicting visual saliency in omnidirectional videos (ODVs) and analyzing the influence of audio on visual attention.* Methods: The paper uses a large-scale audio-visual dataset (AVS-ODV) to analyze the visual attention behavior of observers under various omnidirectional audio modalities and visual scenes. It also compares the performance of several state-of-the-art saliency prediction models on the AVS-ODV dataset and constructs a new benchmark.* Results: The paper establishes the largest audio-visual saliency dataset for ODVs and analyzes the visual attention behavior of observers under various audio modalities and visual scenes. It also compares the performance of several state-of-the-art saliency prediction models on the AVS-ODV dataset and constructs a new benchmark.Here is the information in Simplified Chinese text:* For: 这篇论文是为了预测全景视频中的视觉吸引力和听音影响视觉注意力。* Methods: 这篇论文使用大规模的音视频数据集(AVS-ODV)来分析观众在不同全景声音模式下的视觉注意力行为，以及不同视频场景下的视觉注意力行为。它还比较了一些状态之际的最佳预测模型在AVS-ODV数据集上的性能，并构建了新的标准。* Results: 这篇论文建立了全景视频中最大的音视频预测数据集(AVS-ODV)，并分析了观众在不同全景声音模式下的视觉注意力行为。它还比较了一些状态之际的最佳预测模型在AVS-ODV数据集上的性能，并构建了新的标准。

Abstract
Visual saliency prediction for omnidirectional videos (ODVs) has shown great significance and necessity for omnidirectional videos to help ODV coding, ODV transmission, ODV rendering, etc.. However, most studies only consider visual information for ODV saliency prediction while audio is rarely considered despite its significant influence on the viewing behavior of ODV. This is mainly due to the lack of large-scale audio-visual ODV datasets and corresponding analysis. Thus, in this paper, we first establish the largest audio-visual saliency dataset for omnidirectional videos (AVS-ODV), which comprises the omnidirectional videos, audios, and corresponding captured eye-tracking data for three video sound modalities including mute, mono, and ambisonics. Then we analyze the visual attention behavior of the observers under various omnidirectional audio modalities and visual scenes based on the AVS-ODV dataset. Furthermore, we compare the performance of several state-of-the-art saliency prediction models on the AVS-ODV dataset and construct a new benchmark. Our AVS-ODV datasets and the benchmark will be released to facilitate future research.

摘要
“视觉吸引预测 для全方位视频（ODV）已经表现出了非常重要和必要的地位，以帮助ODV编码、ODV传输、ODV渲染等等。然而，大多数研究只考虑了视觉信息的ODV吸引预测，声音却 rarely 被考虑，尽管它对OBDV的观看习惯有很大的影响。这主要是因为缺乏大规模的 audio-visual ODV 数据集和相关分析。因此，在这篇论文中，我们首先建立了全方位视频、声音和相应的捕捉眼动数据的最大 audio-visual 吸引数据集（AVS-ODV），该数据集包括 omnidirectional 视频、声音和三种视频声明模式（包括无声、单声道和杜邦扬声）。然后，我们分析了在不同的全方位声音模式下观看者的视觉注意力行为，基于 AVS-ODV 数据集。此外，我们对多种当前领先的吸引预测模型在 AVS-ODV 数据集上的性能进行比较，并构建了一个新的标准。我们的 AVS-ODV 数据集和标准将被发布，以便未来的研究。”

Dynamic Association Learning of Self-Attention and Convolution in Image Restoration

paper_url: http://arxiv.org/abs/2311.05147
repo_url: None
paper_authors: Kui Jiang, Xuemei Jia, Wenxin Huang, Wenbin Wang, Zheng Wang, Junjun Jiang
for: This paper proposes an association learning method to improve image deraining by utilizing the advantages of CNNs and Self-Attention, while suppressing their shortcomings.
methods: The proposed method uses a novel multi-input attention module to generate a degradation prior and produce a degradation mask, which helps to extract informative complementary components from the rainy input and restore accurate textures. The method also uses a hybrid fusion network that combines a residual Transformer branch and an encoder-decoder branch to encode global features of the image and represent contexture knowledge.
results: The proposed method achieves high-quality and efficient inpainting by associating rain streak removal and background recovery, and outperforms existing state-of-the-art methods in terms of both visual quality and computational efficiency.

Abstract
CNNs and Self attention have achieved great success in multimedia applications for dynamic association learning of self-attention and convolution in image restoration. However, CNNs have at least two shortcomings: 1) limited receptive field; 2) static weight of sliding window at inference, unable to cope with the content diversity.In view of the advantages and disadvantages of CNNs and Self attention, this paper proposes an association learning method to utilize the advantages and suppress their shortcomings, so as to achieve high-quality and efficient inpainting. We regard rain distribution reflects the degradation location and degree, in addition to the rain distribution prediction. Thus, we propose to refine background textures with the predicted degradation prior in an association learning manner. As a result, we accomplish image deraining by associating rain streak removal and background recovery, where an image deraining network and a background recovery network are designed for two subtasks. The key part of association learning is a novel multi-input attention module. It generates the degradation prior and produces the degradation mask according to the predicted rainy distribution. Benefited from the global correlation calculation of SA, MAM can extract the informative complementary components from the rainy input with the degradation mask, and then help accurate texture restoration. Meanwhile, SA tends to aggregate feature maps with self-attention importance, but convolution diversifies them to focus on the local textures. A hybrid fusion network involves one residual Transformer branch and one encoder-decoder branch. The former takes a few learnable tokens as input and stacks multi-head attention and feed-forward networks to encode global features of the image. The latter, conversely, leverages the multi-scale encoder-decoder to represent contexture knowledge.

摘要
使用CNN和自注意来处理多媒体应用程序中的动态关联学习，得到了很大的成功。然而，CNN具有至少两个缺点：1）有限的接收场景；2）在推理过程中静态的窗口重复计算，无法适应内容多样性。在视情况和自注意的优劣点之间，本文提出一种关联学习方法，以利用优势并抑制缺点，以实现高质量和高效的填充。我们认为雨水分布反映了损害的位置和度量，除了雨水分布预测外。因此，我们提议在关联学习方式下，使用预测的损害估计来细化背景文本。通过这种方式，我们实现了图像抹掉，即将雨线除去和背景恢复两个子任务。关联学习的关键部分是一种新的多输入注意模块。它生成了损害估计和生成损害面板，根据预测的雨水分布。由于SA的全局相关计算，MAM可以从雨水输入中提取有用的补充组件，并帮助准确地恢复文本。同时，SA倾向于将特征地图归一化，而 convolution 则将其多样化，以注重地方文本。一个混合 fusión 网络包括一个待过 Residual Transformer 分支和一个 Encoder-Decoder 分支。前者从一些可学习的 токен中接受输入，并堆叠多头注意力和Feed-Forward 网络来编码图像的全局特征。后者则利用多级 Encoder-Decoder 来表达Contexture 知识。

OW-SLR: Overlapping Windows on Semi-Local Region for Image Super-Resolution

paper_url: http://arxiv.org/abs/2311.05146
repo_url: https://github.com/rishavbb/owslr
paper_authors: Rishav Bhardwaj, Janarthanam Jothi Balaji, Vasudevan Lakshminarayanan
for: 该论文目的是提出一种基于 semi-local 区域的 implicit neural representation 方法，以提高图像的缩放精度。
methods: 该方法使用 Overlapping Windows on Semi-Local Region (OW-SLR) 技术，在 latent space 中提取 semi-local 区域的特征，并使用这些特征来预测图像的 RGB 值。
results: 该方法在 OCT-A 图像上进行缩放后，对于健康和疾病retinal 图像（如 диабетиче Retinopathy 和 normal）的分类表现出色，并且在 OCT500 数据集上表现出了更好的效果。

Abstract
There has been considerable progress in implicit neural representation to upscale an image to any arbitrary resolution. However, existing methods are based on defining a function to predict the Red, Green and Blue (RGB) value from just four specific loci. Relying on just four loci is insufficient as it leads to losing fine details from the neighboring region(s). We show that by taking into account the semi-local region leads to an improvement in performance. In this paper, we propose applying a new technique called Overlapping Windows on Semi-Local Region (OW-SLR) to an image to obtain any arbitrary resolution by taking the coordinates of the semi-local region around a point in the latent space. This extracted detail is used to predict the RGB value of a point. We illustrate the technique by applying the algorithm to the Optical Coherence Tomography-Angiography (OCT-A) images and show that it can upscale them to random resolution. This technique outperforms the existing state-of-the-art methods when applied to the OCT500 dataset. OW-SLR provides better results for classifying healthy and diseased retinal images such as diabetic retinopathy and normals from the given set of OCT-A images. The project page is available at https://rishavbb.github.io/ow-slr/index.html

摘要
“Recently, there have been significant advancements in implicit neural representation for upscaling images to any arbitrary resolution. However, existing methods rely on defining a function to predict the Red, Green, and Blue (RGB) values based on just four specific points. This is insufficient, as it leads to the loss of fine details from the surrounding regions. We propose a new technique called Overlapping Windows on Semi-Local Region (OW-SLR) to improve performance. This technique takes the coordinates of the semi-local region around a point in the latent space and uses it to predict the RGB value of a point. We apply this algorithm to Optical Coherence Tomography-Angiography (OCT-A) images and show that it can upscale them to any arbitrary resolution. Compared to existing state-of-the-art methods, OW-SLR achieves better results for classifying healthy and diseased retinal images, such as diabetic retinopathy and normals, in the OCT500 dataset. More information can be found on the project page at 。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

SCAAT: Improving Neural Network Interpretability via Saliency Constrained Adaptive Adversarial Training

paper_url: http://arxiv.org/abs/2311.05143
repo_url: None
paper_authors: Rui Xu, Wenkang Qin, Peixiang Huang, Hao Wang, Lin Luo
for: 提高深度神经网络（DNN）的解释性，使其预测结果更加 transparent 和 understandable。
methods: 提出了一种模型无关学习方法called Saliency Constrained Adaptive Adversarial Training（SCAAT），通过构建对抗样本，从而提高DNN的解释性。
results: SCAAT 可以减少对抗样本中的噪声，使 saliency map 更加精炼和可靠，而不需要修改模型结构。在不同的领域和指标上进行了多种 DNN 的评估，结果表明，SCAAT 可以显著提高 DNN 的解释性，而无需牺牲预测力。

Abstract
Deep Neural Networks (DNNs) are expected to provide explanation for users to understand their black-box predictions. Saliency map is a common form of explanation illustrating the heatmap of feature attributions, but it suffers from noise in distinguishing important features. In this paper, we propose a model-agnostic learning method called Saliency Constrained Adaptive Adversarial Training (SCAAT) to improve the quality of such DNN interpretability. By constructing adversarial samples under the guidance of saliency map, SCAAT effectively eliminates most noise and makes saliency maps sparser and more faithful without any modification to the model architecture. We apply SCAAT to multiple DNNs and evaluate the quality of the generated saliency maps on various natural and pathological image datasets. Evaluations on different domains and metrics show that SCAAT significantly improves the interpretability of DNNs by providing more faithful saliency maps without sacrificing their predictive power.

摘要

ScribblePolyp: Scribble-Supervised Polyp Segmentation through Dual Consistency Alignment

paper_url: http://arxiv.org/abs/2311.05122
repo_url: None
paper_authors: Zixun Zhang, Yuncheng Jiang, Jun Wei, Hannah Cui, Zhen Li
for: scribble-supervised polyp segmentation frameworkmethods: two-branch consistency alignment approach (transformation consistency alignment + affinity propagation)results: Dice score of 0.8155 (with potential for 1.8% improvement through self-training)

Abstract
Automatic polyp segmentation models play a pivotal role in the clinical diagnosis of gastrointestinal diseases. In previous studies, most methods relied on fully supervised approaches, necessitating pixel-level annotations for model training. However, the creation of pixel-level annotations is both expensive and time-consuming, impeding the development of model generalization. In response to this challenge, we introduce ScribblePolyp, a novel scribble-supervised polyp segmentation framework. Unlike fully-supervised models, ScribblePolyp only requires the annotation of two lines (scribble labels) for each image, significantly reducing the labeling cost. Despite the coarse nature of scribble labels, which leave a substantial portion of pixels unlabeled, we propose a two-branch consistency alignment approach to provide supervision for these unlabeled pixels. The first branch employs transformation consistency alignment to narrow the gap between predictions under different transformations of the same input image. The second branch leverages affinity propagation to refine predictions into a soft version, extending additional supervision to unlabeled pixels. In summary, ScribblePolyp is an efficient model that does not rely on teacher models or moving average pseudo labels during training. Extensive experiments on the SUN-SEG dataset underscore the effectiveness of ScribblePolyp, achieving a Dice score of 0.8155, with the potential for a 1.8% improvement in the Dice score through a straightforward self-training strategy.

摘要
自动肿体分割模型在肠胃疾病诊断中扮演着关键角色。在过去的研究中，大多数方法依赖于全supervised的方法，需要每个图像进行像素级别的标注。然而，创建像素级别的标注是非常昂贵和时间consuming，对模型普适性的发展带来了阻碍。为了解决这个挑战，我们介绍了ScribblePolyp，一种新的scribble-supervised肿体分割框架。不同于全supervised模型，ScribblePolyp只需每个图像两条scribble标签（scribble标注），对于每个图像的标注成本减少了90%。尽管scribble标注的粗糙性使得一部分像素未得到标注，我们提议一种两支分支一致性适应方法，以提供对这些未标注的像素的超vision。第一支分支使用变换一致性适应来缩小输入图像不同变换后的预测差异。第二支分支利用协同传播来细化预测，向未标注像素提供软化的超vision。简单地说，ScribblePolyp是一个不需要教师模型或移动平均 Pseudo标签的模型，在训练时不需要这些资源。广泛的实验表明，ScribblePolyp在SUN-SEG数据集上达到了0.8155的Dice分数，可能通过简单的再训练策略提高Dice分数1.8%。

Reducing the Side-Effects of Oscillations in Training of Quantized YOLO Networks

paper_url: http://arxiv.org/abs/2311.05109
repo_url: None
paper_authors: Kartik Gupta, Akshay Asthana
for: 这个论文目的是对适合边缘设备的量化网络进行优化，以减少计算和内存资源的消耗。
methods: 这个论文使用了量化训练（Quantization-Aware Training，QAT）来对网络进行量化，并提出了一些新的方法来缓解量化网络中的振荡现象，以提高量化网络的精度。
results: 这个论文的结果显示，使用了该些新方法后，可以对YOLO模型进行高效的量化，并在COCO dataset上进行了广泛的评估，获得了更高的精度和更低的错误率。

Abstract
Quantized networks use less computational and memory resources and are suitable for deployment on edge devices. While quantization-aware training QAT is the well-studied approach to quantize the networks at low precision, most research focuses on over-parameterized networks for classification with limited studies on popular and edge device friendly single-shot object detection and semantic segmentation methods like YOLO. Moreover, majority of QAT methods rely on Straight-through Estimator (STE) approximation which suffers from an oscillation phenomenon resulting in sub-optimal network quantization. In this paper, we show that it is difficult to achieve extremely low precision (4-bit and lower) for efficient YOLO models even with SOTA QAT methods due to oscillation issue and existing methods to overcome this problem are not effective on these models. To mitigate the effect of oscillation, we first propose Exponentially Moving Average (EMA) based update to the QAT model. Further, we propose a simple QAT correction method, namely QC, that takes only a single epoch of training after standard QAT procedure to correct the error induced by oscillating weights and activations resulting in a more accurate quantized model. With extensive evaluation on COCO dataset using various YOLO5 and YOLO7 variants, we show that our correction method improves quantized YOLO networks consistently on both object detection and segmentation tasks at low-precision (4-bit and 3-bit).

摘要

Self-similarity Prior Distillation for Unsupervised Remote Physiological Measurement

paper_url: http://arxiv.org/abs/2311.05100
repo_url: None
paper_authors: Xinyu Zhang, Weiyu Sun, Hao Lu, Ying Chen, Yun Ge, Xiaolin Huang, Jie Yuan, Yingcong Chen
for: 本研究旨在提出一种不需要标注数据的非监督式远程血液摄影（rPPG）估计方法，通过利用生物信号自然的自同异性来提高估计精度。
methods: 我们提出了一种基于自同异性优先的框架，包括物理特征嵌入增强技术、自相似性意识网络和层次自适应填充方法。
results: 我们的方法在不同的测试数据集上实现了与标注方法相当或更高的性能，同时具有最低的推理时间和计算成本。

Abstract
Remote photoplethysmography (rPPG) is a noninvasive technique that aims to capture subtle variations in facial pixels caused by changes in blood volume resulting from cardiac activities. Most existing unsupervised methods for rPPG tasks focus on the contrastive learning between samples while neglecting the inherent self-similar prior in physiological signals. In this paper, we propose a Self-Similarity Prior Distillation (SSPD) framework for unsupervised rPPG estimation, which capitalizes on the intrinsic self-similarity of cardiac activities. Specifically, we first introduce a physical-prior embedded augmentation technique to mitigate the effect of various types of noise. Then, we tailor a self-similarity-aware network to extract more reliable self-similar physiological features. Finally, we develop a hierarchical self-distillation paradigm to assist the network in disentangling self-similar physiological patterns from facial videos. Comprehensive experiments demonstrate that the unsupervised SSPD framework achieves comparable or even superior performance compared to the state-of-the-art supervised methods. Meanwhile, SSPD maintains the lowest inference time and computation cost among end-to-end models. The source codes are available at https://github.com/LinXi1C/SSPD.

摘要
远程血液摄影（rPPG）是一种不侵入式技术，目标是捕捉face pixels上因心跳活动而带来的微小变化。现有大多数无监督方法对rPPG任务强调对比采样，忽略了生物信号内置自similarity prior。在这篇论文中，我们提出了一个Self-Similarity Prior Distillation（SSPD）框架，用于无监督rPPG估计。我们首先引入了physical-prior附加技术，以减少各种噪声的影响。然后，我们适应了自similarity-aware网络，以提取更可靠的自similar生理特征。最后，我们开发了一种层次自降解析方法，以助网络分离自similar生理模式从 face videos。广泛的实验表明，无监督SSPD框架可与现有的监督方法相当或者超越其性能，同时SSPD保持了最低的推理时间和计算成本。源代码可以在https://github.com/LinXi1C/SSPD上下载。

POISE: Pose Guided Human Silhouette Extraction under Occlusions

paper_url: http://arxiv.org/abs/2311.05077
repo_url: https://github.com/take2rohit/poise
paper_authors: Arindam Dutta, Rohit Lal, Dripta S. Raychaudhuri, Calvin Khang Ta, Amit K. Roy-Chowdhury
for: 该论文的目的是提出一种用于人体示意抽取的自助学习混合方法，以提高在 occlusions 下人体示意抽取的准确性和可靠性。
methods: 该方法使用了一种自助学习混合模型，将人体示意抽取和人 JOINT 预测结果融合，以利用两者的优势，提高人体示意抽取的精度和可靠性。
results: 实验结果表明，该方法能够在 occlusions 下提高人体示意抽取的准确性和可靠性，并在下游任务中表现出优异的 Result。

Abstract
Human silhouette extraction is a fundamental task in computer vision with applications in various downstream tasks. However, occlusions pose a significant challenge, leading to incomplete and distorted silhouettes. To address this challenge, we introduce POISE: Pose Guided Human Silhouette Extraction under Occlusions, a novel self-supervised fusion framework that enhances accuracy and robustness in human silhouette prediction. By combining initial silhouette estimates from a segmentation model with human joint predictions from a 2D pose estimation model, POISE leverages the complementary strengths of both approaches, effectively integrating precise body shape information and spatial information to tackle occlusions. Furthermore, the self-supervised nature of \POISE eliminates the need for costly annotations, making it scalable and practical. Extensive experimental results demonstrate its superiority in improving silhouette extraction under occlusions, with promising results in downstream tasks such as gait recognition. The code for our method is available https://github.com/take2rohit/poise.

摘要
人体影像抽取是计算机视觉中的基本任务，具有许多下游任务的应用。然而，干扰Element pose poses a significant challenge，导致人体影像抽取 incomplete和扭曲。为了解决这个挑战，我们介绍 POISE：POSE Guided Human Silhouette Extraction under Occlusions，一种新的自我监督融合框架，可以提高人体影像抽取的准确性和Robustness。POISE通过将分割模型的初始抽取估计与2D pose estimation模型的人 JOINT预测结果融合起来，以利用这两种方法的优势，同时得到精确的身体形状信息和空间信息，有效地处理干扰。此外，POISE的自我监督性式，使得无需贵重的注释，可以扩展和实用。广泛的实验结果表明POISE在干扰下进行人体影像抽取时 exhibits superiority，并在下游任务中表现出了扎实的 results，如行走识别。POISE的代码可以在https://github.com/take2rohit/poise找到。

On the Behavior of Audio-Visual Fusion Architectures in Identity Verification Tasks

paper_url: http://arxiv.org/abs/2311.05071
repo_url: None
paper_authors: Daniel Claborne, Eric Slyman, Karl Pazdernik
for: 本研究旨在训练一种人脸识别模型，并对模型中将语音和视频表示结合部分进行修改，以便在一个输入缺失的情况下进行比较。
methods: 本研究使用了一种将输入embedding进行平均化的方法，以提高模型在全modalities情况下和一个输入缺失情况下的准确率。
results: 研究发现，平均化输入embedding可以更好地使用 embedding 空间，并在全modalities情况下和一个输入缺失情况下提高准确率。

Abstract
We train an identity verification architecture and evaluate modifications to the part of the model that combines audio and visual representations, including in scenarios where one input is missing in either of two examples to be compared. We report results on the Voxceleb1-E test set that suggest averaging the output embeddings improves error rate in the full-modality setting and when a single modality is missing, and makes more complete use of the embedding space than systems which use shared layers and discuss possible reasons for this behavior.

摘要
我们训练了一个标识验证建筑，并评估了对拼接声音和视觉表示的模型部分进行修改，包括在两个例子之间比较的情况下一个输入缺失。我们在Voxceleb1-E测试集上发现，将输出嵌入平均值可以改善错误率，包括全功能模式和单模态缺失情况，并且更好地利用嵌入空间。我们还讨论了可能的原因。

2023-11-09

Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter

PolyMaX: General Dense Prediction with Mask Transformer

GIPCOL: Graph-Injected Soft Prompting for Compositional Zero-Shot Learning

Whole-body Detection, Recognition and Identification at Altitude and Range

Intelligent Cervical Spine Fracture Detection Using Deep Learning Methods

FMViT: A multiple-frequency mixing Vision Transformer

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

3DGAUnet: 3D generative adversarial networks with a 3D U-Net based generator to achieve the accurate and effective synthesis of clinical tumor image data for pancreatic cancer

Window Attention is Bugged: How not to Interpolate Position Embeddings

What Do I Hear? Generating Sounds for Visuals with ChatGPT

3D-QAE: Fully Quantum Auto-Encoding of 3D Point Clouds

Reconstructing Objects in-the-wild for Realistic Sensor Simulation

SigScatNet: A Siamese + Scattering based Deep Learning Approach for Signature Forgery Detection and Similarity Assessment

Exploring Emotion Expression Recognition in Older Adults Interacting with a Virtual Coach

High-Performance Transformers for Table Structure Recognition Need Early Convolutions

Disentangling Quantum and Classical Contributions in Hybrid Quantum Machine Learning Architectures

LCM-LoRA: A Universal Stable-Diffusion Acceleration Module

L-WaveBlock: A Novel Feature Extractor Leveraging Wavelets for Generative Adversarial Networks

A Deep Learning Method for Simultaneous Denoising and Missing Wedge Reconstruction in Cryogenic Electron Tomography

Embedding Space Interpolation Beyond Mini-Batch, Beyond Pairs and Beyond Examples

SeaTurtleID2022: A long-span dataset for reliable sea turtle re-identification

BakedAvatar: Baking Neural Fields for Real-Time Head Avatar Synthesis

Multi-Modal Gaze Following in Conversational Scenarios

Object-centric Cross-modal Feature Distillation for Event-based Object Detection

Retinal OCT Synthesis with Denoising Diffusion Probabilistic Models for Layer Segmentation

Robust Retraining-free GAN Fingerprinting via Personalized Normalization

Using ResNet to Utilize 4-class T2-FLAIR Slice Classification Based on the Cholinergic Pathways Hyperintensities Scale for Pathological Aging

3DStyle-Diffusion: Pursuing Fine-grained Text-driven 3D Stylization with 2D Diffusion Models

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

Control3D: Towards Controllable Text-to-3D Generation

Transformer-based Model for Oral Epithelial Dysplasia Segmentation

Dual Pipeline Style Transfer with Input Distribution Differentiation

Active Mining Sample Pair Semantics for Image-text Matching

Linear Gaussian Bounding Box Representation and Ring-Shaped Rotated Convolution for Oriented Object Detection

SIRE: scale-invariant, rotation-equivariant estimation of artery orientations using graph neural networks

Improving Hand Recognition in Uncontrolled and Uncooperative Environments using Multiple Spatial Transformers and Loss Functions

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

SynFacePAD 2023: Competition on Face Presentation Attack Detection Based on Privacy-aware Synthetic Training Data

Spatial Attention-based Distribution Integration Network for Human Pose Estimation

SPADES: A Realistic Spacecraft Pose Estimation Dataset using Event Sensing

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis

SAMVG: A Multi-stage Image Vectorization Model with the Segment-Anything Model

Single-shot Tomography of Discrete Dynamic Objects

Widely Applicable Strong Baseline for Sports Ball Detection and Tracking

ConRad: Image Constrained Radiance Fields for 3D Generation from a Single Image

Let’s Get the FACS Straight – Reconstructing Obstructed Facial Features

BrainNetDiff: Generative AI Empowers Brain Network Generation via Multimodal Diffusion Model

Adaptive-Labeling for Enhancing Remote Sensing Cloud Understanding

TransReg: Cross-transformer as auto-registration module for multi-view mammogram mass detection

Audio-visual Saliency for Omnidirectional Videos

Dynamic Association Learning of Self-Attention and Convolution in Image Restoration

OW-SLR: Overlapping Windows on Semi-Local Region for Image Super-Resolution

SCAAT: Improving Neural Network Interpretability via Saliency Constrained Adaptive Adversarial Training

ScribblePolyp: Scribble-Supervised Polyp Segmentation through Dual Consistency Alignment

Reducing the Side-Effects of Oscillations in Training of Quantized YOLO Networks

Self-similarity Prior Distillation for Unsupervised Remote Physiological Measurement

POISE: Pose Guided Human Silhouette Extraction under Occlusions

On the Behavior of Audio-Visual Fusion Architectures in Identity Verification Tasks