2023-08-16

cs.CV

cs.CV - 2023-08-16

High-Fidelity Lake Extraction via Two-Stage Prompt Enhancement: Establishing a Novel Baseline and Benchmark

paper_url: http://arxiv.org/abs/2308.08443
repo_url: None
paper_authors: Ben Chen, Xuechao Zou, Kai Li, Yu Zhang, Junliang Xing, Pin Tao
for: 本文提出了一种基于提示的 dataset 构建方法，用于自动从遥感图像中提取湖泊信息。
methods: 该方法使用了一种两Stage提示增强框架，包括提示基 stage 和提示自由 stage。在提示基 stage 中，使用提示编码器提取先前信息，并通过自身和交叉注意力将提示符和图像嵌入综合。在提示自由 stage 中，提示被禁用，以确保模型在推断时独立。
results: 对于 Surface Water 和 Qinghai-Tibet Plateau Lake 数据集，LEPrompter 可以在不添加额外参数或 GFLOPs 的情况下实现了前一个状态方法的性能改进。LEPrompter 在两个数据集上的 mIoU 分数分别为 91.48% 和 97.43%。

Abstract
The extraction of lakes from remote sensing images is a complex challenge due to the varied lake shapes and data noise. Current methods rely on multispectral image datasets, making it challenging to learn lake features accurately from pixel arrangements. This, in turn, affects model learning and the creation of accurate segmentation masks. This paper introduces a unified prompt-based dataset construction approach that provides approximate lake locations using point, box, and mask prompts. We also propose a two-stage prompt enhancement framework, LEPrompter, which involves prompt-based and prompt-free stages during training. The prompt-based stage employs a prompt encoder to extract prior information, integrating prompt tokens and image embeddings through self- and cross-attention in the prompt decoder. Prompts are deactivated once the model is trained to ensure independence during inference, enabling automated lake extraction. Evaluations on Surface Water and Qinghai-Tibet Plateau Lake datasets show consistent performance improvements compared to the previous state-of-the-art method. LEPrompter achieves mIoU scores of 91.48% and 97.43% on the respective datasets without introducing additional parameters or GFLOPs. Supplementary materials provide the source code, pre-trained models, and detailed user studies.

摘要
几乎所有的探测湖泊从遥感图像中提取湖泊的过程都是一个复杂的挑战，主要是因为湖泊的形状和数据噪声的变化。现有的方法都是基于多spectral图像集，这使得学习湖泊特征准确从像素的排序中困难。这种情况导致模型学习和生成准确的分割面板困难。本文提出了一种简单的提示基本构建方法，可以提供湖泊的大致位置使用点、盒子和面影��提示。我们还提出了一种两阶段提高框架，即 LEPrompter，该框架在训练阶段包括基于提示和无提示两个阶段。提示阶段使用提示编码器提取先前信息，将提示token和图像嵌入通过自身和跨attend在提示解码器中进行自我和跨attend。提示被禁用一旦模型训练完成，以确保探测过程中的独立性。评估表面水和青海 Tibet 高原湖 datasets 表明，相比之前的状态艺术方法，LEPrompter 在不添加额外参数或 GFLOPs 的情况下实现了稳定的性能提升。详细的材料和附加的详细用户研究可以在补充材料中找到。

Integrating Visual and Semantic Similarity Using Hierarchies for Image Retrieval

paper_url: http://arxiv.org/abs/2308.08431
repo_url: https://github.com/vaishwarya96/hierarchy-image-retrieval
paper_authors: Aishwarya Venkataramanan, Martin Laviale, Cédric Pradalier
for: 这个论文的目的是提高内容基于图像检索（CBIR）中的结果准确性，使得检索结果更加准确地反映用户的查询需求。methods: 该方法使用了深度神经网络进行分类训练，并将相似的类划合并为一个视觉层次结构。这个视觉层次结构被用于计算图像之间的距离，以确定最相似的图像。results: 实验结果表明，该方法在标准数据集CUB-200-2011和CIFAR100上达到了比较好的性能，并且在实际应用中使用的 диато镜像检索 task 上也表现出色。

Abstract
Most of the research in content-based image retrieval (CBIR) focus on developing robust feature representations that can effectively retrieve instances from a database of images that are visually similar to a query. However, the retrieved images sometimes contain results that are not semantically related to the query. To address this, we propose a method for CBIR that captures both visual and semantic similarity using a visual hierarchy. The hierarchy is constructed by merging classes with overlapping features in the latent space of a deep neural network trained for classification, assuming that overlapping classes share high visual and semantic similarities. Finally, the constructed hierarchy is integrated into the distance calculation metric for similarity search. Experiments on standard datasets: CUB-200-2011 and CIFAR100, and a real-life use case using diatom microscopy images show that our method achieves superior performance compared to the existing methods on image retrieval.

摘要
大多数研究在内容基于图像检索（CBIR）都是关注开发robust的特征表示，以便从图像库中检索与查询图像视觉相似的图像。然而，检索到的图像有时会包含不相关的内容。为解决这个问题，我们提议一种基于视觉层次结构的CBIR方法，该方法可以捕捉视觉和 semantic相似性。我们使用一个深度神经网络进行分类训练，并假设过lapping类在特征空间中共享高度的视觉和semantic相似性。最后，我们将构造的层次结构纳入检索距离计算度量中，以进一步提高图像检索的性能。我们在标准数据集CUB-200-2011和CIFAR100上进行了实验，以及使用激光微scopy图像进行了实际应用，结果显示我们的方法在图像检索中具有较高的性能。

ALIP: Adaptive Language-Image Pre-training with Synthetic Caption

paper_url: http://arxiv.org/abs/2308.08428
repo_url: https://github.com/deepglint/alip
paper_authors: Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, Tongliang Liu
For: The paper aims to improve the performance of vision-language tasks by addressing the issue of intrinsic noise and unmatched image-text pairs in web data through a novel pre-training method called Adaptive Language-Image Pre-training (ALIP).* Methods: The paper proposes an ALIP model that integrates supervision from both raw text and synthetic captions, with core components such as the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) that dynamically adjust the weights of samples and image-text/caption pairs during training. The adaptive contrastive loss is also used to reduce the impact of noise data and enhance the efficiency of pre-training.* Results: The paper achieves state-of-the-art performance on multiple downstream tasks including zero-shot image-text retrieval and linear probe, and the code and pre-trained models are released for future research at https://github.com/deepglint/ALIP.

Abstract
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks by scaling up the dataset with image-text pairs collected from the web. However, the presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. To address this issue, we first utilize the OFA model to generate synthetic captions that focus on the image content. The generated captions contain complementary information that is beneficial for pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption. As the core components of ALIP, the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) dynamically adjust the weights of samples and image-text/caption pairs during the training process. Meanwhile, the adaptive contrastive loss can effectively reduce the impact of noise data and enhances the efficiency of pre-training data. We validate ALIP with experiments on different scales of models and pre-training datasets. Experiments results show that ALIP achieves state-of-the-art performance on multiple downstream tasks including zero-shot image-text retrieval and linear probe. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/ALIP.

摘要
CLIP（对比语言图像预训练）已经显著提高了多种视觉语言任务的性能，通过扩大数据集，收集自网络上的图像文本对。然而，网络数据中的内生噪音和不匹配的图像文本对可能会影响表征学习的性能。为解决这个问题，我们首先使用OFA模型生成各种关注图像内容的自然语言描述。生成的描述包含有利于预训练的补充信息。然后，我们提出了自适应语言图像预训练（ALIP），一种bidirectional模型，它将raw文本和生成的描述作为双价值监督。ALIP的核心组件是语言一致门（LCG）和描述一致门（DCG），它们在训练过程中动态调整样本和图像文本/描述对的权重。同时，适应对比损失可以有效降低噪音数据的影响，提高预训练数据的效率。我们通过不同规模的模型和预训练集进行实验，并证明ALIP在多个下游任务中达到了状态之作。为便于未来的研究，我们在https://github.com/deepglint/ALIP上发布了代码和预训练模型。Here's the translation in Traditional Chinese:CLIP（对比语言图像预训练）已经显著提高了多种视觉语言任务的性能，通过扩大数据集，收集自网络上的图像文本对。然而，网络数据中的内生噪音和不匹配的图像文本对可能会影响表征学习的性能。为解决这个问题，我们首先使用OFA模型生成各种关注图像内容的自然语言描述。生成的描述包含有利于预训练的补充信息。然后，我们提出了自适应语言图像预训练（ALIP），一种bidirectional模型，它将raw文本和生成的描述作为双价值监督。ALIP的核心组件是语言一致门（LCG）和描述一致门（DCG），它们在训练过程中动态调整样本和图像文本/描述对的权重。同时，适应对比损失可以有效降低噪音数据的影响，提高预训练数据的效率。我们通过不同规模的模型和预训练集进行实验，并证明ALIP在多个下游任务中达到了状态之作。为便于未来的研究，我们在https://github.com/deepglint/ALIP上发布了代码和预训练模型。

Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer

paper_url: http://arxiv.org/abs/2308.08414
repo_url: None
paper_authors: Guangyi Chen, Xiao Liu, Guangrun Wang, Kun Zhang, Philip H. S. Torr, Xiao-Ping Zhang, Yansong Tang
for: 本研究旨在提高视频问答任务（VideoQA）中的模型表现，通过借鉴图像预训练知识，尽量减少视频域与图像域之间的semantic gap。
methods: 本文提出了Tem-Adapter，它通过视觉时间对齐和文本 semantics对齐来学习时间动力和复杂 semantics。另外，为了减少semantic gap和适应文本表示，我们引入了一个模板设计方法，用于融合问题和答案对为事件描述。
results: 我们在两个 VideoQA 测试 benchmark 上评估了 Tem-Adapter 和不同的预训练传输方法，结果显示了我们的方法的效果。

Abstract
Video-language pre-trained models have shown remarkable success in guiding video question-answering (VideoQA) tasks. However, due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones. This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains. To bridge these gaps, in this paper, we propose Tem-Adapter, which enables the learning of temporal dynamics and complex semantics by a visual Temporal Aligner and a textual Semantic Aligner. Unlike conventional pretrained knowledge adaptation methods that only concentrate on the downstream task objective, the Temporal Aligner introduces an extra language-guided autoregressive task aimed at facilitating the learning of temporal dependencies, with the objective of predicting future states based on historical clues and language guidance that describes event progression. Besides, to reduce the semantic gap and adapt the textual representation for better event description, we introduce a Semantic Aligner that first designs a template to fuse question and answer pairs as event descriptions and then learns a Transformer decoder with the whole video sequence as guidance for refinement. We evaluate Tem-Adapter and different pre-train transferring methods on two VideoQA benchmarks, and the significant performance improvement demonstrates the effectiveness of our method.

摘要
vidéo-langage pré-entraînés modèles ont montré un succès remarquable dans la conduite des tâches de questions-réponses vidéo (VideoQA). Cependant, en raison de la longueur des séquences de vidéo, l'entraînement de modèles de grande échelle basés sur des vidéos soulève considérablement plus de coûts que l'entraînement de modèles d'images. Cela motive à utiliser la connaissance des modèles d'images pré-entraînés, malgré les écarts évidents entre les domaines de l'image et de la vidéo. Pour combler ces écarts, dans ce papier, nous proposons Tem-Adapter, qui permet l'apprentissage des dynamiques temporelles et des complexités sémantiques en utilisant un Alignateur Temporel Visuel et un Alignateur Sémantique Textuel. Contrairement aux méthodes de transfert de connaissance conventionnelles qui se concentrent uniquement sur l'objectif de la tâche downstream, l'Alignateur Temporel introduit une tâche autorégressive language-guidée supplémentaire visant à faciliter l'apprentissage des dépendances temporelles en prédictant des états futurs à partir de clues historiques et de guidances linguistiques qui décrivent l'évolution des événements. De plus, pour réduire l'écart sémantique et adapter la représentation textuelle pour une description plus précise des événements, nous introduisons un Alignateur Sémantique qui conçoit d'abord un modèle de template pour fusionner les paires de questions et de réponses comme des descriptions d'événements, puis apprend une décodeuse Transformer avec la totalité de la séquence de vidéo comme guidance pour la réfinement. Nous évaluons Tem-Adapter et différentes méthodes de transfert de connaissance sur deux benchmarks VideoQA, et les améliorations de performance significatives démontrent l'efficacité de notre méthode.

Prediction of post-radiotherapy recurrence volumes in head and neck squamous cell carcinoma using 3D U-Net segmentation

paper_url: http://arxiv.org/abs/2308.08396
repo_url: None
paper_authors: Denis Kutnár, Ivan R Vogelius, Katrin Elisabet Håkansson, Jens Petersen, Jeppe Friborg, Lena Specht, Mogens Bernsdorf, Anita Gothelf, Claus Kristensen, Abraham George Smith
for: 预后治疗头颈癌细胞瘤患者的普遍失血环境（LRR）仍然是治疗失败的常见原因。
methods: 我们使用了卷积神经网络（CNN）来预测前治疗18F-fluorodeoxyglucose пози트рон融合 Tomatoes（FDG-PET）/计算机断层（CT）扫描图像中LRR的Volume。
results: CNN可以预测LRR的Volume，并且可以在预处理前的FDG-PET/CT扫描图像上提供更好的预测结果，但是需要进一步的数据集开发以达到临床有用的预测精度。

Abstract
Locoregional recurrences (LRR) are still a frequent site of treatment failure for head and neck squamous cell carcinoma (HNSCC) patients. Identification of high risk subvolumes based on pretreatment imaging is key to biologically targeted radiation therapy. We investigated the extent to which a Convolutional neural network (CNN) is able to predict LRR volumes based on pre-treatment 18F-fluorodeoxyglucose positron emission tomography (FDG-PET)/computed tomography (CT) scans in HNSCC patients and thus the potential to identify biological high risk volumes using CNNs. For 37 patients who had undergone primary radiotherapy for oropharyngeal squamous cell carcinoma, five oncologists contoured the relapse volumes on recurrence CT scans. Datasets of pre-treatment FDG-PET/CT, gross tumour volume (GTV) and contoured relapse for each of the patients were randomly divided into training (n=23), validation (n=7) and test (n=7) datasets. We compared a CNN trained from scratch, a pre-trained CNN, a SUVmax threshold approach, and using the GTV directly. The SUVmax threshold method included 5 out of the 7 relapse origin points within a volume of median 4.6 cubic centimetres (cc). Both the GTV contour and best CNN segmentations included the relapse origin 6 out of 7 times with median volumes of 28 and 18 cc respectively. The CNN included the same or greater number of relapse volume POs, with significantly smaller relapse volumes. Our novel findings indicate that CNNs may predict LRR, yet further work on dataset development is required to attain clinically useful prediction accuracy.

摘要
Head and neck squamous cell carcinoma (HNSCC) 的 Local Recurrence (LRR) 仍然是治疗失败的常见现象。预测LRR的风险量基于前治学成像是关键。我们使用深度学习神经网络（CNN）来预测HNSCC患者的LRR量，并可能通过这种方式确定生物学高风险量。我们在37名受到主要放疗治疗的唇舌癌患者中，由5名外科医生确定了再次出现的量在复发CT扫描图上。我们将每名患者的预治FTDG-PET/CT、GTV和重点复发量分别作为训练（n=23）、验证（n=7）和测试（n=7）集。我们比较了从头开始训练的CNN、预先训练的CNN、SUVmax阈值方法和直接使用GTV。SUVmax阈值方法包含5个复发起点在 median 4.6立方厘米（cc）中。GTV标注和最佳CNN分割都包含复发起点6个中， median 值分别为28和18 cc。CNN包含同样或更多的复发量POs，并且复发量较小。我们的新发现表明CNN可能预测LRR，但是需要进一步的数据集开发以实现临床有用的预测精度。

SIGMA: Scale-Invariant Global Sparse Shape Matching

paper_url: http://arxiv.org/abs/2308.08393
repo_url: None
paper_authors: Maolin Gao, Paul Roetzer, Marvin Eisenberger, Zorah Lähner, Michael Moeller, Daniel Cremers, Florian Bernard
for: 该文章是为了生成精确的稀疏对匹配高度非均质形状而提出的一种新的杂integer编程（MIP）形ulation。
methods: 文章引入了一种投影拉Place-Beltrami算子（PLBO），该算子结合内在和外在的几何信息来衡量预测对匹配中所导致的扭formation质量。文章还 integrate了PLBO以及一种方向感知规则izerinto一种新的MIP形ulation，可以在许多实际问题上解决到全球最优解。与之前的方法不同的是，该方法具有恒等性、初始化free、优化 garantess和高分辨率网格上的线性时间复杂度。
results: 文章在一些复杂的3D数据集上实现了精确的非均质对匹配，包括一些具有不一致的网格的数据集，以及点云对网格的匹配问题。

Abstract
We propose a novel mixed-integer programming (MIP) formulation for generating precise sparse correspondences for highly non-rigid shapes. To this end, we introduce a projected Laplace-Beltrami operator (PLBO) which combines intrinsic and extrinsic geometric information to measure the deformation quality induced by predicted correspondences. We integrate the PLBO, together with an orientation-aware regulariser, into a novel MIP formulation that can be solved to global optimality for many practical problems. In contrast to previous methods, our approach is provably invariant to rigid transformations and global scaling, initialisation-free, has optimality guarantees, and scales to high resolution meshes with (empirically observed) linear time. We show state-of-the-art results for sparse non-rigid matching on several challenging 3D datasets, including data with inconsistent meshing, as well as applications in mesh-to-point-cloud matching.

摘要
我们提出了一种新的混合整数编程（MIP）形式ulation для生成精度粗略对应关系 для高度非rigid shapes。为此，我们引入了投影拉place-Beltrami算子（PLBO），该算子结合内在和外在几何信息来衡量预测对应关系所导致的形变质量。我们将PLBO与orientation-aware regularizer integrate into a novel MIP formulation that can be solved to global optimality for many practical problems. 与之前的方法不同，我们的方法具有rigid transformations和全局缩放的不变性，初始化自由，有优化 garanties，并可扩展到高分辨率网格上（empirically observed) linear time。我们在several challenging 3D datasets上显示了非常好的结果，包括具有不一致的网格的数据、以及mesh-to-point-cloud matching应用。

Robust Autonomous Vehicle Pursuit without Expert Steering Labels

paper_url: http://arxiv.org/abs/2308.08380
repo_url: None
paper_authors: Jiaxin Pan, Changyao Zhou, Mariia Gladkova, Qadeer Khan, Daniel Cremers
for: 本研究旨在实现横向和长向运动控制，以实现某车追踪其他车辆。
methods: 我们不依靠专家驾驶员提供的满足标签，而是利用经典控制器作为线上标签生成工具。我们还考虑控制值预测错误的影响，以避免跟踪失败和急剧坠机。为此，我们提出了一种有效的数据扩展方法，以训练能够处理不同视图的目标车辆。
results: 我们使用CARLA模拟器进行了广泛验证，并显示了实时性和不同enario中的稳定性。我们的方法能够在不同的路径和路径完成率下实现高的追踪性和安全性。

Abstract
In this work, we present a learning method for lateral and longitudinal motion control of an ego-vehicle for vehicle pursuit. The car being controlled does not have a pre-defined route, rather it reactively adapts to follow a target vehicle while maintaining a safety distance. To train our model, we do not rely on steering labels recorded from an expert driver but effectively leverage a classical controller as an offline label generation tool. In addition, we account for the errors in the predicted control values, which can lead to a loss of tracking and catastrophic crashes of the controlled vehicle. To this end, we propose an effective data augmentation approach, which allows to train a network capable of handling different views of the target vehicle. During the pursuit, the target vehicle is firstly localized using a Convolutional Neural Network. The network takes a single RGB image along with cars' velocities and estimates the target vehicle's pose with respect to the ego-vehicle. This information is then fed to a Multi-Layer Perceptron, which regresses the control commands for the ego-vehicle, namely throttle and steering angle. We extensively validate our approach using the CARLA simulator on a wide range of terrains. Our method demonstrates real-time performance and robustness to different scenarios including unseen trajectories and high route completion. The project page containing code and multimedia can be publicly accessed here: https://changyaozhou.github.io/Autonomous-Vehicle-Pursuit/.

摘要
在这个工作中，我们提出了一种学习方法用于ego汽车的横向和 longitudinal 动作控制，以实现追踪目标车辆。控制的车辆不具备预定的路线，而是能够反应地适应追踪目标车辆，同时保持安全距离。为了训练我们的模型，我们不依赖于专家驾驶员提供的转向标签，而是有效地利用了古典控制器作为离线标签生成工具。此外，我们还考虑了预测控制值的错误，可能导致跟踪丢失和控制车辆的崩溃。为此，我们提出了一种有效的数据扩充方法，允许训练一个能够处理不同视角的目标车辆的网络。在追踪过程中，目标车辆首先被local化使用卷积神经网络，Network takes a single RGB image along with cars' velocities and estimates the target vehicle's pose with respect to the ego-vehicle。这些信息然后被传递给多层感知器，该感知器将反逆预测控制命令，即加速和转向角。我们在CARLA simulator上进行了广泛的验证，并证明了我们的方法具有实时性和不同enario中的稳定性。项目页面包含代码和多媒体，可以在以下链接上公共访问：https://changyaozhou.github.io/Autonomous-Vehicle-Pursuit/.

Automated Semiconductor Defect Inspection in Scanning Electron Microscope Images: a Systematic Review

paper_url: http://arxiv.org/abs/2308.08376
repo_url: None
paper_authors: Thibault Lechien, Enrique Dehaerne, Bappaditya Dey, Victor Blanco, Sandip Halder, Stefan De Gendt, Wannes Meert
for: 本研究旨在提供一个系统性的对Semiconductor Defect Inspection（SEM）图像自动化检测技术的review，包括最新的创新和发展。
methods: 本研究使用Machine Learning算法，包括卷积神经网络，对Semiconductor样品进行自动化检测和定位缺陷。
results: 研究分析了38篇关于SEM图像自动化检测的文献，其中每篇文献的应用、方法、数据集、结果、限制和未来工作都被简要概述。

Abstract
A growing need exists for efficient and accurate methods for detecting defects in semiconductor materials and devices. These defects can have a detrimental impact on the efficiency of the manufacturing process, because they cause critical failures and wafer-yield limitations. As nodes and patterns get smaller, even high-resolution imaging techniques such as Scanning Electron Microscopy (SEM) produce noisy images due to operating close to sensitivity levels and due to varying physical properties of different underlayers or resist materials. This inherent noise is one of the main challenges for defect inspection. One promising approach is the use of machine learning algorithms, which can be trained to accurately classify and locate defects in semiconductor samples. Recently, convolutional neural networks have proved to be particularly useful in this regard. This systematic review provides a comprehensive overview of the state of automated semiconductor defect inspection on SEM images, including the most recent innovations and developments. 38 publications were selected on this topic, indexed in IEEE Xplore and SPIE databases. For each of these, the application, methodology, dataset, results, limitations and future work were summarized. A comprehensive overview and analysis of their methods is provided. Finally, promising avenues for future work in the field of SEM-based defect inspection are suggested.

摘要
需求生长的势在普遍精炼芯片材料和设备中的缺陷检测方法中增加。这些缺陷可能会影响生产过程的效率，因为它们导致关键失败和芯片产量限制。随着节点和模式的减小，即使使用高分辨率成像技术如扫描电子显微镜（SEM），也会产生噪声，因为在不同的底层或抗氧剂材料的物理性能的变化导致。这种内生的噪声是检测缺陷的主要挑战。一种有前途的方法是使用机器学习算法，可以对芯片样品中的缺陷进行准确的分类和位置确定。最近，卷积神经网络已经证明是在这一点上非常有用。本文提供了涵盖自动芯片缺陷检测在SEM图像上的系统性评论，包括最新的创新和发展。38篇文章被选择，其中每篇文章的应用、方法、数据集、结果、局限性和未来工作的总结。对这些方法的系统性分析进行了详细的梳理。最后，对芯片缺陷检测领域的未来工作的可能性进行了建议。

Diff-CAPTCHA: An Image-based CAPTCHA with Security Enhanced by Denoising Diffusion Model

paper_url: http://arxiv.org/abs/2308.08367
repo_url: None
paper_authors: Ran Jiang, Sanfeng Zhang, Linfeng Liu, Yanbing Peng
for: 增强文本Captcha的安全性
methods: 使用扩散模型生成文本图像，以强化文本特征和背景图像的混合，增加机器学习的难度
results: 对比基eline方案，Diff-CAPTCHA显示更高的安全性和usable性，可以有效抵抗终端攻击 algorithm

Abstract
To enhance the security of text CAPTCHAs, various methods have been employed, such as adding the interference lines on the text, randomly distorting the characters, and overlapping multiple characters. These methods partly increase the difficulty of automated segmentation and recognition attacks. However, facing the rapid development of the end-to-end breaking algorithms, their security has been greatly weakened. The diffusion model is a novel image generation model that can generate the text images with deep fusion of characters and background images. In this paper, an image-click CAPTCHA scheme called Diff-CAPTCHA is proposed based on denoising diffusion models. The background image and characters of the CAPTCHA are treated as a whole to guide the generation process of a diffusion model, thus weakening the character features available for machine learning, enhancing the diversity of character features in the CAPTCHA, and increasing the difficulty of breaking algorithms. To evaluate the security of Diff-CAPTCHA, this paper develops several attack methods, including end-to-end attacks based on Faster R-CNN and two-stage attacks, and Diff-CAPTCHA is compared with three baseline schemes, including commercial CAPTCHA scheme and security-enhanced CAPTCHA scheme based on style transfer. The experimental results show that diffusion models can effectively enhance CAPTCHA security while maintaining good usability in human testing.

摘要
“为提高文本验证码（CAPTCHA）的安全性，各种方法有被应用，如添加文本中的干扰线条、随机扭曲字符和 overlap 多个字符。这些方法有所提高了自动化分 segmentation 和识别攻击的难度。然而，面对末端分解算法的快速发展，它们的安全性受到了严重的挑战。在这篇论文中，一种基于杂化模型的图像阻止验证码（Diff-CAPTCHA）被提出。图像阻止验证码中的背景图像和字符被视为一个整体，以引导杂化模型的生成过程，从而减弱字符特征，提高验证码中字符特征的多样性，并提高攻击算法的难度。为评估Diff-CAPTCHA的安全性，本文开发了多种攻击方法，包括基于Faster R-CNN的端到端攻击和两个阶段攻击。与三个基eline scheme进行比较，包括商业CAPTCHA scheme和基于样式转换的安全化CAPTCHA scheme。实验结果表明，杂化模型可以有效地提高验证码的安全性，同时保持人工测试中的使用体验良好。”

DeepContrast: Deep Tissue Contrast Enhancement using Synthetic Data Degradations and OOD Model Predictions

paper_url: http://arxiv.org/abs/2308.08365
repo_url: None
paper_authors: Nuno Pimpão Martins, Yannis Kalaidzidis, Marino Zerial, Florian Jug
for: 这 paper 的目的是提高微scopy 图像质量，以便更好地检查和 characterize 细胞和组织结构和功能。
methods: 这 paper 使用了深度学习方法，并需要ground truth 数据进行训练。然而，在深入到样本中时，获取 clean GT 数据是无法实现的。因此，作者们提出了一种新的方法，可以绕过 GT 数据的缺失。
results: 作者们首先使用了一种 Approximate forward model 来模拟深入到样本中的图像噪声和对比loss。然后，他们使用了一种 neural network 来学习这种噪声和对比loss 的 inverse。 results 表明，这种方法可以在 OOD 情况下提高微scopy 图像质量。然而，在每次预测中，图像对比度会不断提高，同时图像细节会逐渐消失。因此，取决于下游分析的需求，需要找到一个平衡点，以保留图像细节而同时提高对比度。

Abstract
Microscopy images are crucial for life science research, allowing detailed inspection and characterization of cellular and tissue-level structures and functions. However, microscopy data are unavoidably affected by image degradations, such as noise, blur, or others. Many such degradations also contribute to a loss of image contrast, which becomes especially pronounced in deeper regions of thick samples. Today, best performing methods to increase the quality of images are based on Deep Learning approaches, which typically require ground truth (GT) data during training. Our inability to counteract blurring and contrast loss when imaging deep into samples prevents the acquisition of such clean GT data. The fact that the forward process of blurring and contrast loss deep into tissue can be modeled, allowed us to propose a new method that can circumvent the problem of unobtainable GT data. To this end, we first synthetically degraded the quality of microscopy images even further by using an approximate forward model for deep tissue image degradations. Then we trained a neural network that learned the inverse of this degradation function from our generated pairs of raw and degraded images. We demonstrated that networks trained in this way can be used out-of-distribution (OOD) to improve the quality of less severely degraded images, e.g. the raw data imaged in a microscope. Since the absolute level of degradation in such microscopy images can be stronger than the additional degradation introduced by our forward model, we also explored the effect of iterative predictions. Here, we observed that in each iteration the measured image contrast kept improving while detailed structures in the images got increasingly removed. Therefore, dependent on the desired downstream analysis, a balance between contrast improvement and retention of image details has to be found.

摘要
μ好像是生物科学研究中非常重要的一部分，允许详细检查和Characterization of cellular and tissue-level structures and functions。然而，μ数据都会受到图像质量下降的影响，如噪声、模糊或其他问题。这些问题也会导致图像的对比度下降，尤其是在样本深度较深的地方。今天，最好的图像质量提高方法都是基于深度学习的方法，它们通常需要训练时使用标准训练数据（GT数据）。由于我们无法对深入样本中的图像进行修复和对比度提高，因此我们无法获得这些干净的GT数据。由于前进过程中的图像噪声和对比度下降可以被模型化，我们提出了一种新的方法，可以缺省GT数据问题。我们首先使用一种近似的深度图像降解模型来further degrade the quality of microscopy images。然后，我们使用这些生成的对像对来训练一个神经网络，学习这个降解函数的逆运算。我们证明了这种方法可以在不使用GT数据的情况下，提高图像质量。由于μ数据的绝对水平可能比我们的前进模型引入的降解水平更高，我们也探讨了反复预测的效果。在每次预测中，我们发现图像对比度不断提高，同时图像中的细节也在逐渐消失。因此，根据下游分析的需要，需要找到对比度提高和图像细节保留的平衡。

KernelWarehouse: Towards Parameter-Efficient Dynamic Convolution

paper_url: http://arxiv.org/abs/2308.08361
repo_url: https://github.com/osvai/kernelwarehouse
paper_authors: Chao Li, Anbang Yao
for: 提高卷积批处理器的表现，使其可以处理更大的输入数据集。
methods: 使用动态核心，并通过精细地分割和共享核心来提高卷积批处理器的表现。
results: 在ImageNet和MS-COCO datasets上，与基eline方法进行比较，并达到了状态 искусственный智能的表现。例如，使用KernelWarehouse在ImageNet上的ResNet18|ResNet50|MobileNetV2|ConvNeXt-Tiny模型可以达到76.05%|81.05%|75.52%|82.51%的顶部一个精度。此外，KernelWarehouse还可以减小ConvNet模型的大小，同时提高其表现，例如，我们的ResNet18模型的参数减少了36.45%|65.10%，而其顶部一个精度提高了2.89%|2.29%。

Abstract
Dynamic convolution learns a linear mixture of $n$ static kernels weighted with their sample-dependent attentions, demonstrating superior performance compared to normal convolution. However, existing designs are parameter-inefficient: they increase the number of convolutional parameters by $n$ times. This and the optimization difficulty lead to no research progress in dynamic convolution that can allow us to use a significant large value of $n$ (e.g., $n>100$ instead of typical setting $n<10$) to push forward the performance boundary. In this paper, we propose $KernelWarehouse$, a more general form of dynamic convolution, which can strike a favorable trade-off between parameter efficiency and representation power. Its key idea is to redefine the basic concepts of "$kernels$" and "$assembling$ $kernels$" in dynamic convolution from the perspective of reducing kernel dimension and increasing kernel number significantly. In principle, KernelWarehouse enhances convolutional parameter dependencies within the same layer and across successive layers via tactful kernel partition and warehouse sharing, yielding a high degree of freedom to fit a desired parameter budget. We validate our method on ImageNet and MS-COCO datasets with different ConvNet architectures, and show that it attains state-of-the-art results. For instance, the ResNet18|ResNet50|MobileNetV2|ConvNeXt-Tiny model trained with KernelWarehouse on ImageNet reaches 76.05%|81.05%|75.52%|82.51% top-1 accuracy. Thanks to its flexible design, KernelWarehouse can even reduce the model size of a ConvNet while improving the accuracy, e.g., our ResNet18 model with 36.45%|65.10% parameter reduction to the baseline shows 2.89%|2.29% absolute improvement to top-1 accuracy.

摘要
“动态核函数学习一种线性混合的 $n$ 个静态核函数，表现更高于常规核函数，但现有设计不具有效率：它们将核函数参数数量增加 $n$ 倍。这和优化困难导致没有进展在动态核函数方面，使得我们无法使用大量的 $n$（例如，$n>100$ 而不是典型设定 $n<10$）来推动性能边界。在这篇论文中，我们提出 $KernelWarehouse$，一种更通用的动态核函数形式，可以寻求有利的参数效率和表达能力之间的平衡。其关键思想是从动态核函数中重新定义“核函数”和“核函数组合”的基本概念，以减少核函数维度并增加核函数数量，从而在同一层和successive层中增强参数之间的依赖关系。在原理上，KernelWarehouse 通过精细的核函数分割和仓库共享，提供了高度自由的参数预算调整。我们在 ImageNet 和 MS-COCO 数据集上验证了我们的方法，并显示其达到了当前最佳结果。例如，在 ImageNet 上，我们使用 KernelWarehouse 训练 ResNet18|ResNet50|MobileNetV2|ConvNeXt-Tiny 模型，达到了76.05%|81.05%|75.52%|82.51% 的顶部一个精度。由于其灵活的设计，KernelWarehouse 甚至可以将 ConvNet 模型的大小减小，同时提高准确性，例如，我们的 ResNet18 模型在参数减少 36.45%|65.10% 后，对于基eline 模型的较大准确性做出了2.89%|2.29% 的绝对改进。”

Membrane Potential Batch Normalization for Spiking Neural Networks

paper_url: http://arxiv.org/abs/2308.08359
repo_url: https://github.com/yfguo91/mpbn
paper_authors: Yufei Guo, Yuhan Zhang, Yuanpei Chen, Weihang Peng, Xiaode Liu, Liwen Zhang, Xuhui Huang, Zhe Ma
for: 提高深度模型的能效性，采用了效果批处理（BN）技术。
methods: 使用了批处理后的卷积层，并在射频函数之前添加了另一个BN层以Normalize membrane potential，并提出了一种基于训练-推理分离的重parameterization技术来减少MPBN的时间成本。
results: 实验结果表明，提出的MPBN在非射频 datasets上表现良好，并且可以采用元素级别的形式。代码已经公开在 GitHub上（https://github.com/yfguo91/MPBN）。

Abstract
As one of the energy-efficient alternatives of conventional neural networks (CNNs), spiking neural networks (SNNs) have gained more and more interest recently. To train the deep models, some effective batch normalization (BN) techniques are proposed in SNNs. All these BNs are suggested to be used after the convolution layer as usually doing in CNNs. However, the spiking neuron is much more complex with the spatio-temporal dynamics. The regulated data flow after the BN layer will be disturbed again by the membrane potential updating operation before the firing function, i.e., the nonlinear activation. Therefore, we advocate adding another BN layer before the firing function to normalize the membrane potential again, called MPBN. To eliminate the induced time cost of MPBN, we also propose a training-inference-decoupled re-parameterization technique to fold the trained MPBN into the firing threshold. With the re-parameterization technique, the MPBN will not introduce any extra time burden in the inference. Furthermore, the MPBN can also adopt the element-wised form, while these BNs after the convolution layer can only use the channel-wised form. Experimental results show that the proposed MPBN performs well on both popular non-spiking static and neuromorphic datasets. Our code is open-sourced at \href{https://github.com/yfguo91/MPBN}{MPBN}.

摘要
如一种能效替代传统神经网络（CNN）的选择，脉冲神经网络（SNN）在最近吸引了更多的关注。为了训练深度模型，一些有效的批normalization（BN）技术在SNN中被提出。所有这些BN都被建议在卷积层后使用，与CNN的做法一样。然而，神经元更加复杂，具有空间时间动力学。通过更正数据流后BN层的规则操作，将导致神经元电位更新操作前的非线性活化函数被干扰。因此，我们建议在神经元电位更新操作前加入另一个BN层，称为MPBN。为了消除MPBN引入的时间成本，我们还提出了在训练和推理之间分离的重parameterization技术，可以将MPBN翻译成发射阈值。通过这种重parameterization技术，MPBN在推理中不会增加额外的时间负担。此外，MPBN还可以采用元素级别的形式，而这些BN后 convolution层只能使用通道级别的形式。实验结果表明，我们提出的MPBN在非激发神经网络的流行静止数据集和neuromorphic数据集上都表现良好。我们的代码在GitHub上公开，请参考\href{https://github.com/yfguo91/MPBN}{MPBN}.

GAEI-UNet: Global Attention and Elastic Interaction U-Net for Vessel Image Segmentation

paper_url: http://arxiv.org/abs/2308.08345
repo_url: None
paper_authors: Ruiqiang Xiao, Zhuoyue Wan
for: 这篇论文旨在提高血管图像分割的精度和可靠性，以便提供更加准确和可靠的医疗诊断工具。
methods: 该论文提出了一种新的模型，即GAEI-UNet，它结合全局注意力和弹性交互式技术来提高深度学习分割的精度和连接度。
results: 对于DRIVE眼球血管数据集的评估表明，GAEI-UNet比传统的深度学习分割模型具有更高的精度和连接度，而不会提高计算复杂性。

Abstract
Vessel image segmentation plays a pivotal role in medical diagnostics, aiding in the early detection and treatment of vascular diseases. While segmentation based on deep learning has shown promising results, effectively segmenting small structures and maintaining connectivity between them remains challenging. To address these limitations, we propose GAEI-UNet, a novel model that combines global attention and elastic interaction-based techniques. GAEI-UNet leverages global spatial and channel context information to enhance high-level semantic understanding within the U-Net architecture, enabling precise segmentation of small vessels. Additionally, we adopt an elastic interaction-based loss function to improve connectivity among these fine structures. By capturing the forces generated by misalignment between target and predicted shapes, our model effectively learns to preserve the correct topology of vessel networks. Evaluation on retinal vessel dataset -- DRIVE demonstrates the superior performance of GAEI-UNet in terms of SE and connectivity of small structures, without significantly increasing computational complexity. This research aims to advance the field of vessel image segmentation, providing more accurate and reliable diagnostic tools for the medical community. The implementation code is available on Code.

摘要
船体影像分割在医学诊断中发挥重要作用，帮助早期发现和治疗血管疾病。深度学习基于的分割方法已经显示出了有前途的成绩，但是分割小结构和保持这些结构之间的连接仍然是挑战。为了解决这些限制，我们提出了GAEI-UNet模型，该模型 combining global attention和弹性交互技术。GAEI-UNet利用全局空间和通道信息来增强高级 semantic understanding within U-Net架构，以帮助精确地分割小血管。此外，我们采用了弹性交互基于的损失函数来提高这些细结构之间的连接。通过捕捉目标形态与预测形态之间的偏差生成的力量，我们的模型可以准确地保持血管网络的正确topology。对于retinal vessel数据集（DRIVE）的评估表明，GAEI-UNet在准确率和细结构连接方面具有显著优势，而不是增加计算复杂性。本研究旨在提高船体影像分割的精度和可靠性，为医学社区提供更加准确和可靠的诊断工具。代码实现可以在Code.

Denoising Diffusion Probabilistic Model for Retinal Image Generation and Segmentation

paper_url: http://arxiv.org/abs/2308.08339
repo_url: https://github.com/aaleka/retree
paper_authors: Alnur Alimanov, Md Baharul Islam
For: 用于检测和诊断视网膜图像中的眼病、血液循环和大脑相关疾病。* Methods: 使用Generative Adversarial Networks (GAN)生成深度学习模型，并使用Denosing Diffusion Probabilistic Model (DDPM)生成图像。* Results: 提出了一个新的DDPM模型，并创建了一个名为Retinal Trees (ReTree)的数据集，包括视网膜图像、相应的血管树和基于DDPM的分类网络。通过对 synthetic data 进行训练和 authentic data 进行测试， validate 了该数据集的有效性。Here’s the summary in English for reference:* For: Detection and diagnosis of eye, blood circulation, and brain-related diseases using retinal images.* Methods: Use of Generative Adversarial Networks (GAN) to generate deep learning models, and use of Denoising Diffusion Probabilistic Model (DDPM) to generate images.* Results: Proposed a new DDPM model and created a dataset called Retinal Trees (ReTree), which includes retinal images, corresponding vessel trees, and a classification network based on DDPM trained with images from the ReTree dataset. Validated the effectiveness of the dataset by training the vessel segmentation model with synthetic data and testing it on authentic data.

Abstract
Experts use retinal images and vessel trees to detect and diagnose various eye, blood circulation, and brain-related diseases. However, manual segmentation of retinal images is a time-consuming process that requires high expertise and is difficult due to privacy issues. Many methods have been proposed to segment images, but the need for large retinal image datasets limits the performance of these methods. Several methods synthesize deep learning models based on Generative Adversarial Networks (GAN) to generate limited sample varieties. This paper proposes a novel Denoising Diffusion Probabilistic Model (DDPM) that outperformed GANs in image synthesis. We developed a Retinal Trees (ReTree) dataset consisting of retinal images, corresponding vessel trees, and a segmentation network based on DDPM trained with images from the ReTree dataset. In the first stage, we develop a two-stage DDPM that generates vessel trees from random numbers belonging to a standard normal distribution. Later, the model is guided to generate fundus images from given vessel trees and random distribution. The proposed dataset has been evaluated quantitatively and qualitatively. Quantitative evaluation metrics include Frechet Inception Distance (FID) score, Jaccard similarity coefficient, Cohen's kappa, Matthew's Correlation Coefficient (MCC), precision, recall, F1-score, and accuracy. We trained the vessel segmentation model with synthetic data to validate our dataset's efficiency and tested it on authentic data. Our developed dataset and source code is available at https://github.com/AAleka/retree.

摘要
专家利用血液图像和血管树来检测和诊断多种眼部、血液和大脑相关疾病。然而，手动分割血液图像是一项时间consuming和技术困难的任务，尤其是由于隐私问题。许多方法已经提出来分割图像，但是因为缺乏大量血液图像数据，这些方法的性能受限。本文提出了一种新的干扰扩散概率模型（DDPM），其在图像生成中超越了生成 adversarial networks（GAN）的性能。我们开发了一个名为“Retinal Trees”（ReTree）的数据集，该数据集包括血液图像、对应的血管树和基于 DDPM 的分割网络，并在 ReTree 数据集上训练了 DDPM 模型。在首个阶段，我们开发了一种两个阶段的 DDPM，其中首先生成血管树从随机数列中的标准正态分布中。然后，模型被引导生成基于给定血管树和随机分布的血液图像。我们对提出的数据集进行了量化和质量evaluation。量化评价指标包括Frechet Inception Distance（FID）分数、Jaccard同异度系数、Cohen的κ值、Matthew的相关系数（MCC）、精度、 recall、F1 分数和准确率。我们使用合成数据训练了血管分割模型，以验证我们的数据集的效率，并在真实数据上测试了它。我们开发的数据集和源代码可以在上获得。

Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

paper_url: http://arxiv.org/abs/2308.08333
repo_url: None
paper_authors: Jiawei Yao, Tong Wu, Xiaofeng Zhang
for: 本研究旨在探讨 transformer 和 cnn 在单目深度估计中的差异，以及如何使 transformer 模型更好地处理深度估计。
methods: 我们使用 sparse pixel 方法对两种模型进行比较，并发现 transformer excel 在处理全局上下文和细节文件方面，但在保持深度Gradient Continuity 方面落后于 cnn。为了提高 transformer 模型在单目深度估计中的性能，我们提出了 Depth Gradient Refinement (DGR) 模块，该模块通过高阶分数、特征融合和重新均衡来进行深度估计的精细调整。此外，我们还使用了 оптимальный运输理论，将深度图像看作空间概率分布，并使用 оптимальный运输距离作为损失函数来优化我们的模型。
results: 我们的实验结果显示，搭配 Depth Gradient Refinement (DGR) 模块和我们提出的损失函数后，模型的性能得到了提高，而无需增加复杂度和计算成本。此研究不仅提供了 transformer 和 cnn 在单目深度估计中的新的视角，还开辟了新的深度估计方法的先河。

Abstract
Monocular depth estimation is an ongoing challenge in computer vision. Recent progress with Transformer models has demonstrated notable advantages over conventional CNNs in this area. However, there's still a gap in understanding how these models prioritize different regions in 2D images and how these regions affect depth estimation performance. To explore the differences between Transformers and CNNs, we employ a sparse pixel approach to contrastively analyze the distinctions between the two. Our findings suggest that while Transformers excel in handling global context and intricate textures, they lag behind CNNs in preserving depth gradient continuity. To further enhance the performance of Transformer models in monocular depth estimation, we propose the Depth Gradient Refinement (DGR) module that refines depth estimation through high-order differentiation, feature fusion, and recalibration. Additionally, we leverage optimal transport theory, treating depth maps as spatial probability distributions, and employ the optimal transport distance as a loss function to optimize our model. Experimental results demonstrate that models integrated with the plug-and-play Depth Gradient Refinement (DGR) module and the proposed loss function enhance performance without increasing complexity and computational costs. This research not only offers fresh insights into the distinctions between Transformers and CNNs in depth estimation but also paves the way for novel depth estimation methodologies.

摘要
单眼深度估计是计算机视觉领域中持续的挑战。最近的对称模型在这个领域中获得了杰出的进步，与传统的CNN相比。然而，我们仍然不确定这些模型在2D图像中不同区域的优先顺序和如何影响深度估计性能。为了探索这两种模型之间的差异，我们使用罕发像素方法进行对照分析。我们的发现表明，对称模型在处理全球背景和细节文化方面表现出色，但在保持深度梯度连续性方面落后于CNN。为了进一步提高对称模型在单眼深度估计中的表现，我们提出了深度梯度精照（DGR）模组，该模组通过高阶差分、特征融合和重新对准来精照深度估计。此外，我们运用了最佳运输理论，将深度图表视为空间概率分布，并使用最佳运输距离作为损失函数来优化我们的模型。实验结果显示，将DGR模组和我们提议的损失函数组合使用可以提高表现，不会增加复杂度和计算成本。本研究不具备对对称模型和CNN在深度估计中的新发现，也开启了新的深度估计方法的先河。

AdaBrowse: Adaptive Video Browser for Efficient Continuous Sign Language Recognition

paper_url: http://arxiv.org/abs/2308.08327
repo_url: https://github.com/hulianyuyy/adabrowse
paper_authors: Lianyu Hu, Liqing Gao, Zekang Liu, Chi-Man Pun, Wei Feng
for: 本研究旨在利用Raw视频中的特征重复来提高短视频识别效率。
methods: 我们提出了一种新的适应模型（AdaBrowse），通过模仿这个问题为一个顺序决策问题来动态选择输入视频序列中最 Informative的子序列。
results: 我们的实验结果表明，AdaBrowse和AdaBrowse+可以与状态静态方法具有相同的准确率，同时减少了1.44倍的计算量和2.12倍的FLOPs。

Abstract
Raw videos have been proven to own considerable feature redundancy where in many cases only a portion of frames can already meet the requirements for accurate recognition. In this paper, we are interested in whether such redundancy can be effectively leveraged to facilitate efficient inference in continuous sign language recognition (CSLR). We propose a novel adaptive model (AdaBrowse) to dynamically select a most informative subsequence from input video sequences by modelling this problem as a sequential decision task. In specific, we first utilize a lightweight network to quickly scan input videos to extract coarse features. Then these features are fed into a policy network to intelligently select a subsequence to process. The corresponding subsequence is finally inferred by a normal CSLR model for sentence prediction. As only a portion of frames are processed in this procedure, the total computations can be considerably saved. Besides temporal redundancy, we are also interested in whether the inherent spatial redundancy can be seamlessly integrated together to achieve further efficiency, i.e., dynamically selecting a lowest input resolution for each sample, whose model is referred to as AdaBrowse+. Extensive experimental results on four large-scale CSLR datasets, i.e., PHOENIX14, PHOENIX14-T, CSL-Daily and CSL, demonstrate the effectiveness of AdaBrowse and AdaBrowse+ by achieving comparable accuracy with state-of-the-art methods with 1.44$\times$ throughput and 2.12$\times$ fewer FLOPs. Comparisons with other commonly-used 2D CNNs and adaptive efficient methods verify the effectiveness of AdaBrowse. Code is available at \url{https://github.com/hulianyuyy/AdaBrowse}.

摘要
raw 视频有许多重复的特征，只需要一部分帧可以达到准确的认知要求。在这篇论文中，我们关注到是否可以利用这种重复来提高连续手语识别（CSLR）的效率。我们提出了一种新的适应模型（AdaBrowse），可以动态选择输入视频序列中最 informative 的子序列。具体来说，我们首先使用一个轻量级网络快速扫描输入视频，提取低精度特征。然后，这些特征被 feed 到一个策略网络，以智能选择处理的子序列。最后，对于这个子序列，我们使用一个标准的 CSLR 模型进行句子预测。由于只处理一部分帧，总计算量可以得到显著减少。此外，我们还关注是否可以将内在的空间重复合理地融合到一起，以实现更高的效率，即动态选择每个样本的最低输入分辨率。我们称之为 AdaBrowse+。我们在四个大规模 CSLR 数据集上进行了广泛的实验，分别是 PHOENIX14、PHOENIX14-T、CSL-Daily 和 CSL，并证明了 AdaBrowse 和 AdaBrowse+ 的有效性，与state-of-the-art方法相比，具有1.44倍的吞吐量和2.12倍的 fewer FLOPs。与其他常用的2D CNNs和适应高效方法进行比较，证明了 AdaBrowse 的有效性。代码可以在 GitHub 上找到：\url{https://github.com/hulianyuyy/AdaBrowse}。

Visually-Aware Context Modeling for News Image Captioning

paper_url: http://arxiv.org/abs/2308.08325
repo_url: None
paper_authors: Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens
for: 新闻图文描述是为了根据新闻文章和图片的内容生成图文描述。
methods: 我们设计了一个人脸命名模块，用于学习图片中的人脸和文章中的名称的更好的嵌入。此外，我们还设计了一种检索策略，使用CLIP来检索与图片相关的文章中的句子。
results: 我们进行了广泛的实验，证明了我们的框架的有效性。无需使用额外的对应数据，我们在两个新闻图文描述数据集上建立了新的状态前瞻性，超越了之前的状态前瞻性5个CIDEr点。

Abstract
The goal of News Image Captioning is to generate an image caption according to the content of both a news article and an image. To leverage the visual information effectively, it is important to exploit the connection between the context in the articles/captions and the images. Psychological studies indicate that human faces in images draw higher attention priorities. On top of that, humans often play a central role in news stories, as also proven by the face-name co-occurrence pattern we discover in existing News Image Captioning datasets. Therefore, we design a face-naming module for faces in images and names in captions/articles to learn a better name embedding. Apart from names, which can be directly linked to an image area (faces), news image captions mostly contain context information that can only be found in the article. Humans typically address this by searching for relevant information from the article based on the image. To emulate this thought process, we design a retrieval strategy using CLIP to retrieve sentences that are semantically close to the image. We conduct extensive experiments to demonstrate the efficacy of our framework. Without using additional paired data, we establish the new state-of-the-art performance on two News Image Captioning datasets, exceeding the previous state-of-the-art by 5 CIDEr points. We will release code upon acceptance.

摘要
文章标题：图像新闻描述的目标是根据新闻文章和图像内容生成图像描述。为了有效利用图像信息，需要利用文章和图像之间的联系。心理学研究表明，图像中的人脸吸引更高的注意力。此外，人类在新闻故事中也很重要，我们发现在现有的新闻图像描述数据集中存在人名和图像的相互关系。因此，我们设计了一个人脸命名模块，以学习更好的人名嵌入。除了名称，新闻图像描述主要包含文章中的上下文信息，这些信息只能通过文章来找到。我们通过使用CLIP进行检索策略，将文章中相似的句子与图像相关联。我们进行了广泛的实验，证明了我们的框架的效果。无需使用额外的对应数据，我们在两个新闻图像描述数据集上建立了新的状态的报告，超越了前一个状态的5个CIDEr点。我们将代码发布于接受后。

Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations

paper_url: http://arxiv.org/abs/2308.08321
repo_url: None
paper_authors: Yuewei Yang, Hai Li, Yiran Chen
For: This paper aims to address the instability issues in discriminative self-supervised learning methods and improve the downstream performance of the learned representations.* Methods: The paper uses a causal perspective to analyze the unstable behaviors of discriminative self-supervised methods and proposes solutions to overcome these issues. The proposed solutions involve tempering a linear transformation with controlled synthetic data.* Results: The authors show through experiments on both controlled image datasets and realistic image datasets that their proposed solutions are effective in addressing the instability issues and improving the downstream performance of the learned representations.

Abstract
In recent years, discriminative self-supervised methods have made significant strides in advancing various visual tasks. The central idea of learning a data encoder that is robust to data distortions/augmentations is straightforward yet highly effective. Although many studies have demonstrated the empirical success of various learning methods, the resulting learned representations can exhibit instability and hinder downstream performance. In this study, we analyze discriminative self-supervised methods from a causal perspective to explain these unstable behaviors and propose solutions to overcome them. Our approach draws inspiration from prior works that empirically demonstrate the ability of discriminative self-supervised methods to demix ground truth causal sources to some extent. Unlike previous work on causality-empowered representation learning, we do not apply our solutions during the training process but rather during the inference process to improve time efficiency. Through experiments on both controlled image datasets and realistic image datasets, we show that our proposed solutions, which involve tempering a linear transformation with controlled synthetic data, are effective in addressing these issues.

摘要
近年来，权偏自监学方法在不同的视觉任务中作出了重要进步。这中心思想是学习一种数据编码器，该编码器能够对数据扭曲/增强产生Robust。虽然许多研究表明了不同学习方法的实际成功，但是获得的学习表示可能会具有不稳定性，影响下游性能。在这种研究中，我们从 causal 角度分析权偏自监学方法，解释这些不稳定行为，并提出解决方案。我们的方法启发自先前的研究，证明了权偏自监学方法可以一定程度地分解真实的 causal 源。不同于之前的 causality-empowered representation learning，我们在执行过程中不应用我们的解决方案，而是在推理过程中应用，以提高时间效率。通过对控制图像 dataset 和真实图像 dataset 进行实验，我们展示了我们提议的解决方案，即在推理过程中tempering 一个线性变换的控制synthetic数据，是有效地解决这些问题。

Dual-Stream Diffusion Net for Text-to-Video Generation

paper_url: http://arxiv.org/abs/2308.08316
repo_url: None
paper_authors: Binhui Liu, Xin Liu, Anbo Dai, Zhiyong Zeng, Zhen Cui, Jian Yang
for: 提高生成视频的内容一致性
methods: 提posed dual-stream diffusion net (DSDN)，包括两个扩散流：视频内容流和动作流，以及我们设计的cross-transformer交互模块，以使内容和动作领域之间进行良好的对应。
results: 实验表明，我们的方法可以生成更加平滑的连续视频，具有 fewer flickers.

Abstract
With the emerging diffusion models, recently, text-to-video generation has aroused increasing attention. But an important bottleneck therein is that generative videos often tend to carry some flickers and artifacts. In this work, we propose a dual-stream diffusion net (DSDN) to improve the consistency of content variations in generating videos. In particular, the designed two diffusion streams, video content and motion branches, could not only run separately in their private spaces for producing personalized video variations as well as content, but also be well-aligned between the content and motion domains through leveraging our designed cross-transformer interaction module, which would benefit the smoothness of generated videos. Besides, we also introduce motion decomposer and combiner to faciliate the operation on video motion. Qualitative and quantitative experiments demonstrate that our method could produce amazing continuous videos with fewer flickers.

摘要
<>将文本转换为简化中文。现在的扩散模型已经吸引了更多的注意力，特别是在文本到视频生成方面。然而，生成视频时存在一个重要的瓶颈，即生成视频经常带有些微震动和 artifacts。在这种情况下，我们提出了双流扩散网络（DSDN），以提高生成视频的内容变化一致性。具体来说，我们设计了两个扩散流，一个是视频内容流，另一个是动作流，它们可以在独立的私有空间中运行，以生成个性化的视频内容和动作。此外，我们还引入了动作分解器和组合器，以便操作视频动作。实际和量测试表明，我们的方法可以生成更加流畅的视频，带有 fewer flickers。

ECPC-IDS:A benchmark endometrail cancer PET/CT image dataset for evaluation of semantic segmentation and detection of hypermetabolic regions

paper_url: http://arxiv.org/abs/2308.08313
repo_url: None
paper_authors: Dechao Tang, Xuanyi Li, Tianming Du, Deguo Ma, Zhiyu Ma, Hongzan Sun, Marcin Grzegorzek, Huiyan Jiang, Chen Li
for: This paper aims to provide a publicly available dataset of endometrial cancer images for research and development of computer-assisted diagnostic techniques.
methods: The paper uses five classical deep learning semantic segmentation methods and six deep learning object detection methods to demonstrate the differences between various methods on the dataset.
results: The paper provides a large number of multiple images, including a large amount of information required for image and target detection, which can aid researchers in exploring new algorithms to enhance computer-assisted technology and improve the accuracy and objectivity of diagnosis.

Abstract
Endometrial cancer is one of the most common tumors in the female reproductive system and is the third most common gynecological malignancy that causes death after ovarian and cervical cancer. Early diagnosis can significantly improve the 5-year survival rate of patients. With the development of artificial intelligence, computer-assisted diagnosis plays an increasingly important role in improving the accuracy and objectivity of diagnosis, as well as reducing the workload of doctors. However, the absence of publicly available endometrial cancer image datasets restricts the application of computer-assisted diagnostic techniques.In this paper, a publicly available Endometrial Cancer PET/CT Image Dataset for Evaluation of Semantic Segmentation and Detection of Hypermetabolic Regions (ECPC-IDS) are published. Specifically, the segmentation section includes PET and CT images, with a total of 7159 images in multiple formats. In order to prove the effectiveness of segmentation methods on ECPC-IDS, five classical deep learning semantic segmentation methods are selected to test the image segmentation task. The object detection section also includes PET and CT images, with a total of 3579 images and XML files with annotation information. Six deep learning methods are selected for experiments on the detection task.This study conduct extensive experiments using deep learning-based semantic segmentation and object detection methods to demonstrate the differences between various methods on ECPC-IDS. As far as we know, this is the first publicly available dataset of endometrial cancer with a large number of multiple images, including a large amount of information required for image and target detection. ECPC-IDS can aid researchers in exploring new algorithms to enhance computer-assisted technology, benefiting both clinical doctors and patients greatly.

摘要
《endoometrial cancer图像数据集》是女性生殖系统中最常见的肿瘤之一，也是第三种最常致命的生殖系统癌症之一，只有后于肾脏和子宫癌症。早期诊断可以显著提高患者5年生存率。随着人工智能的发展，计算机助手诊断在改善诊断准确性和公正性方面扮演着越来越重要的角色，同时减轻医生的工作负担。但是， absent of publicly available endometrial cancer image datasets restricts the application of computer-assisted diagnostic techniques。在本文中，一个公共可用的Endometrial Cancer PET/CT Image Dataset for Evaluation of Semantic Segmentation and Detection of Hypermetabolic Regions（ECPC-IDS）被发布。具体来说，分别包括PET和CT图像，总共7159个图像，存储在多种格式中。为证明ECPC-IDS上segmenation方法的效果，本研究选择了5种经典的深度学习 semantic segmentation方法进行测试图像 segmentation任务。另外，包括PET和CT图像，总共3579个图像和XML文件中的注释信息。为了证明检测任务中的不同方法的差异，本研究选择了6种深度学习方法进行实验。本研究通过深度学习基于Semantic Segmentation和object Detection方法进行了广泛的实验，以证明ECPC-IDS上的差异。我们知道，ECPC-IDS是第一个公共可用的endometrial cancer图像数据集，包含大量的多Format图像和需要的信息，以便图像和目标检测。ECPC-IDS将为研究人员提供了一个大量的数据集，以便他们探索新的算法，为临床医生和患者带来很大的好处。

Leveraging Next-Active Objects for Context-Aware Anticipation in Egocentric Videos

paper_url: http://arxiv.org/abs/2308.08303
repo_url: None
paper_authors: Sanket Thakur, Cigdem Beyan, Pietro Morerio, Vittorio Murino, Alessio Del Bue
for: 预测人类在短时间内与物体进行互动
methods: 使用多Modal end-to-end transformer网络，通过对观察到的帧进行注意力集中来预测下一个活跃的物体（NAO），并最终预测基于这些探测结果的上下文意识满意的未来行为。
results: 相比现有的视频模型架构，NAOGAT能够更好地利用物体与全景场景的关系，预测出下一个活跃的物体和相应的未来行为，同时可以利用物体的动态运动来提高准确率。通过我们的实验，我们表明NAOGAT在Ego4D和EpicKitchens-100(“Unseen Set”)两个 dataset上比现有方法表现出色，并且在多个额外指标中表现出色，如时间到接触（TTC）和下一个活跃物体的本地化探测。

Abstract
Objects are crucial for understanding human-object interactions. By identifying the relevant objects, one can also predict potential future interactions or actions that may occur with these objects. In this paper, we study the problem of Short-Term Object interaction anticipation (STA) and propose NAOGAT (Next-Active-Object Guided Anticipation Transformer), a multi-modal end-to-end transformer network, that attends to objects in observed frames in order to anticipate the next-active-object (NAO) and, eventually, to guide the model to predict context-aware future actions. The task is challenging since it requires anticipating future action along with the object with which the action occurs and the time after which the interaction will begin, a.k.a. the time to contact (TTC). Compared to existing video modeling architectures for action anticipation, NAOGAT captures the relationship between objects and the global scene context in order to predict detections for the next active object and anticipate relevant future actions given these detections, leveraging the objects' dynamics to improve accuracy. One of the key strengths of our approach, in fact, is its ability to exploit the motion dynamics of objects within a given clip, which is often ignored by other models, and separately decoding the object-centric and motion-centric information. Through our experiments, we show that our model outperforms existing methods on two separate datasets, Ego4D and EpicKitchens-100 ("Unseen Set"), as measured by several additional metrics, such as time to contact, and next-active-object localization. The code will be available upon acceptance.

摘要
objecs 是人机交互的关键因素。通过确定相关的 объекcs，一可以预测未来的交互或操作。在这篇论文中，我们研究了短期对象交互预测（STA）问题，并提出了NAOGAT（下一个活动对象引导预测变换器），一种多modal 终端变换网络，它在观察到的帧中注意到对象，以预测下一个活动对象（NAO），并最终用这些预测来导引模型预测受到上下文影响的未来行为。这个任务非常具有挑战性，因为它需要预测将来的行为，同时与这个行为发生的对象和时间相关，即接触时间（TTC）。与现有的视频模型化建筑不同，NAOGAT 捕捉对象与全局场景上下文之间的关系，以预测下一个活动对象的探测和未来行为，利用对象的动力学特征提高准确性。我们的方法的一个关键优势是能够利用视频中对象的动力学特征，这些特征通常被其他模型忽略。我们通过实验证明，我们的模型在Ego4D和EpicKitchens-100（“未seen集”）两个数据集上比现有方法表现出色， measured by 一些额外指标，如接触时间和下一个活动对象的本地化。代码将在接受后公开。

Improving Audio-Visual Segmentation with Bidirectional Generation

paper_url: http://arxiv.org/abs/2308.08288
repo_url: None
paper_authors: Dawei Hao, Yuxin Mao, Bowen He, Xiaodong Han, Yuchao Dai, Yiran Zhong
for: 这个论文旨在提高Audio-Visual Segmentation（AVS）的精度，以便在视频中准确地分割可见的声音对象。
methods: 该论文提出了一种双向生成框架，通过建立视觉特征和声音相关性的强有力关系，提高AVS的性能。具体来说，该框架包括视觉到声音投影部分，使用物理投影重建声音特征，以及采用隐藏式三维运动估计模块来处理时间 Dinamics。
results: 经过对AVSBenchbenchmark进行了广泛的实验和分析，该方法在多个声音源的分割任务中达到了新的状态对AVS的性能水平，尤其是在MS3子集中，该子集包括多个声音源的分割任务。

Abstract
The aim of audio-visual segmentation (AVS) is to precisely differentiate audible objects within videos down to the pixel level. Traditional approaches often tackle this challenge by combining information from various modalities, where the contribution of each modality is implicitly or explicitly modeled. Nevertheless, the interconnections between different modalities tend to be overlooked in audio-visual modeling. In this paper, inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we introduce a bidirectional generation framework. This framework establishes robust correlations between an object's visual characteristics and its associated sound, thereby enhancing the performance of AVS. To achieve this, we employ a visual-to-audio projection component that reconstructs audio features from object segmentation masks and minimizes reconstruction errors. Moreover, recognizing that many sounds are linked to object movements, we introduce an implicit volumetric motion estimation module to handle temporal dynamics that may be challenging to capture using conventional optical flow methods. To showcase the effectiveness of our approach, we conduct comprehensive experiments and analyses on the widely recognized AVSBench benchmark. As a result, we establish a new state-of-the-art performance level in the AVS benchmark, particularly excelling in the challenging MS3 subset which involves segmenting multiple sound sources. To facilitate reproducibility, we plan to release both the source code and the pre-trained model.

摘要
Inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we propose a bidirectional generation framework. This framework establishes robust correlations between an object's visual characteristics and its associated sound, thereby enhancing the performance of AVS.To achieve this, we employ a visual-to-audio projection component that reconstructs audio features from object segmentation masks and minimizes reconstruction errors. Additionally, recognizing that many sounds are linked to object movements, we introduce an implicit volumetric motion estimation module to handle temporal dynamics that may be challenging to capture using conventional optical flow methods.To demonstrate the effectiveness of our approach, we conduct comprehensive experiments and analyses on the widely recognized AVSBench benchmark. As a result, we establish a new state-of-the-art performance level in the AVS benchmark, particularly excelling in the challenging MS3 subset which involves segmenting multiple sound sources. To facilitate reproducibility, we plan to release both the source code and the pre-trained model.

CARE: A Large Scale CT Image Dataset and Clinical Applicable Benchmark Model for Rectal Cancer Segmentation

paper_url: http://arxiv.org/abs/2308.08283
repo_url: None
paper_authors: Hantao Zhang, Weidong Guo, Chenyang Qiu, Shouhong Wan, Bingbing Zou, Wanqin Wang, Peiquan Jin
for: Rectal cancer segmentation of CT images for timely clinical diagnosis, radiotherapy treatment, and follow-up.
methods: Proposed a novel large-scale rectal cancer CT image dataset CARE with pixel-level annotations, and a novel medical cancer lesion segmentation benchmark model named U-SAM that incorporates prompt information to tackle the challenges of intricate anatomical structures.
results: U-SAM outperformed state-of-the-art methods on the CARE dataset and demonstrated generalization on the WORD dataset through extensive experiments. These results can serve as a baseline for future research and clinical application development.Here’s the format you requested:
for: <what are the paper written for?>
methods: <what methods the paper use?>
results: <what results the paper get?>I hope this helps!

Abstract
Rectal cancer segmentation of CT image plays a crucial role in timely clinical diagnosis, radiotherapy treatment, and follow-up. Although current segmentation methods have shown promise in delineating cancerous tissues, they still encounter challenges in achieving high segmentation precision. These obstacles arise from the intricate anatomical structures of the rectum and the difficulties in performing differential diagnosis of rectal cancer. Additionally, a major obstacle is the lack of a large-scale, finely annotated CT image dataset for rectal cancer segmentation. To address these issues, this work introduces a novel large scale rectal cancer CT image dataset CARE with pixel-level annotations for both normal and cancerous rectum, which serves as a valuable resource for algorithm research and clinical application development. Moreover, we propose a novel medical cancer lesion segmentation benchmark model named U-SAM. The model is specifically designed to tackle the challenges posed by the intricate anatomical structures of abdominal organs by incorporating prompt information. U-SAM contains three key components: promptable information (e.g., points) to aid in target area localization, a convolution module for capturing low-level lesion details, and skip-connections to preserve and recover spatial information during the encoding-decoding process. To evaluate the effectiveness of U-SAM, we systematically compare its performance with several popular segmentation methods on the CARE dataset. The generalization of the model is further verified on the WORD dataset. Extensive experiments demonstrate that the proposed U-SAM outperforms state-of-the-art methods on these two datasets. These experiments can serve as the baseline for future research and clinical application development.

摘要
<> Rectal cancer segmentation of CT image 在临床诊断、放疗治疗和跟踪中扮演着关键角色。although current segmentation methods have shown promise in delineating cancerous tissues, they still encounter challenges in achieving high segmentation precision. These obstacles arise from the intricate anatomical structures of the rectum and the difficulties in performing differential diagnosis of rectal cancer. Additionally, a major obstacle is the lack of a large-scale, finely annotated CT image dataset for rectal cancer segmentation. To address these issues, this work introduces a novel large-scale rectal cancer CT image dataset CARE with pixel-level annotations for both normal and cancerous rectum, which serves as a valuable resource for algorithm research and clinical application development. Moreover, we propose a novel medical cancer lesion segmentation benchmark model named U-SAM. The model is specifically designed to tackle the challenges posed by the intricate anatomical structures of abdominal organs by incorporating prompt information. U-SAM contains three key components: promptable information (e.g., points) to aid in target area localization, a convolution module for capturing low-level lesion details, and skip-connections to preserve and recover spatial information during the encoding-decoding process. To evaluate the effectiveness of U-SAM, we systematically compare its performance with several popular segmentation methods on the CARE dataset. The generalization of the model is further verified on the WORD dataset. Extensive experiments demonstrate that the proposed U-SAM outperforms state-of-the-art methods on these two datasets. These experiments can serve as the baseline for future research and clinical application development.Here's the translation in Traditional Chinese:<>RECTAL cancer segmentation of CT image 在临床诊断、放疗治疗和跟踪中扮演着关键角色。although current segmentation methods have shown promise in delineating cancerous tissues, they still encounter challenges in achieving high segmentation precision. These obstacles arise from the intricate anatomical structures of the rectum and the difficulties in performing differential diagnosis of rectal cancer. Additionally, a major obstacle is the lack of a large-scale, finely annotated CT image dataset for rectal cancer segmentation. To address these issues, this work introduces a novel large-scale rectal cancer CT image dataset CARE with pixel-level annotations for both normal and cancerous rectum, which serves as a valuable resource for algorithm research and clinical application development. Moreover, we propose a novel medical cancer lesion segmentation benchmark model named U-SAM. The model is specifically designed to tackle the challenges posed by the intricate anatomical structures of abdominal organs by incorporating prompt information. U-SAM contains three key components: promptable information (e.g., points) to aid in target area localization, a convolution module for capturing low-level lesion details, and skip-connections to preserve and recover spatial information during the encoding-decoding process. To evaluate the effectiveness of U-SAM, we systematically compare its performance with several popular segmentation methods on the CARE dataset. The generalization of the model is further verified on the WORD dataset. Extensive experiments demonstrate that the proposed U-SAM outperforms state-of-the-art methods on these two datasets. These experiments can serve as the baseline for future research and clinical application development.

Computer vision-enriched discrete choice models, with an application to residential location choice

paper_url: http://arxiv.org/abs/2308.08276
repo_url: None
paper_authors: Sander van Cranenburgh, Francisco Garrido-Valenzuela
for: This paper aims to address the gap between traditional discrete choice models and real-world decision-making by incorporating computer vision into these models.
methods: The proposed “Computer Vision-enriched Discrete Choice Models” (CV-DCMs) integrate computer vision and traditional discrete choice models to handle choice tasks involving numeric attributes and images.
results: The proposed CV-DCMs are grounded in random utility maximization principles and demonstrate the potential to handle complex decision-making tasks involving visual imagery, as demonstrated through a novel stated choice experiment involving residential location choices.

Abstract
Visual imagery is indispensable to many multi-attribute decision situations. Examples of such decision situations in travel behaviour research include residential location choices, vehicle choices, tourist destination choices, and various safety-related choices. However, current discrete choice models cannot handle image data and thus cannot incorporate information embedded in images into their representations of choice behaviour. This gap between discrete choice models' capabilities and the real-world behaviour it seeks to model leads to incomplete and, possibly, misleading outcomes. To solve this gap, this study proposes "Computer Vision-enriched Discrete Choice Models" (CV-DCMs). CV-DCMs can handle choice tasks involving numeric attributes and images by integrating computer vision and traditional discrete choice models. Moreover, because CV-DCMs are grounded in random utility maximisation principles, they maintain the solid behavioural foundation of traditional discrete choice models. We demonstrate the proposed CV-DCM by applying it to data obtained through a novel stated choice experiment involving residential location choices. In this experiment, respondents faced choice tasks with trade-offs between commute time, monthly housing cost and street-level conditions, presented using images. As such, this research contributes to the growing body of literature in the travel behaviour field that seeks to integrate discrete choice modelling and machine learning.

摘要
“图像感知是许多多属性决策情况中不可或缺的。旅游行为研究中的例子包括居住地选择、车辆选择、旅游目的地选择以及安全相关的选择。然而，现有的精确选择模型无法处理图像数据，因此无法将图像中嵌入的信息 integrate 到其表示的选择行为中。这个差距导致模型的结果不准确、可能是误导的。为解决这个差距，本研究提出了“计算机视觉增强精确选择模型”（CV-DCM）。CV-DCM 可以处理包含数字属性和图像的选择任务，通过结合计算机视觉和传统精确选择模型来实现。此外，由于 CV-DCM 基于 random utility maximization 原则，因此它保持了传统精确选择模型的坚实行为基础。我们通过应用 CV-DCM 到来自一种新的声明选择实验中的居住地选择任务来示例。在这个实验中，参与者面临了含有交通时间、月度住房成本和街道级别条件的选择任务，使用图像表示。因此，本研究贡献到旅游行为领域的精确选择模型和机器学习 интеграция的增长的Literature中。”

Detecting Olives with Synthetic or Real Data? Olive the Above

paper_url: http://arxiv.org/abs/2308.08271
repo_url: None
paper_authors: Yianni Karabatis, Xiaomin Lin, Nitin J. Sanket, Michail G. Lagoudakis, Yiannis Aloimonos
for: 这个论文的目的是为了提高精准农业中的橄榄估算，但在橄榄业中，高度变化的橄榄颜色和背景叶子投影的问题却存在挑战。
methods: 这篇论文提出了一种新的橄榄检测方法，不需要手动标注数据。它首先生成了一个自动标注的摘要真实3D橄榄树模型，然后简化其geometry для实时渲染目的。
results: 实验表明，使用大量的synthetic数据和一小部分的真实数据可以提高橄榄检测精度，比使用只有一小部分的真实数据更好。

Abstract
Modern robotics has enabled the advancement in yield estimation for precision agriculture. However, when applied to the olive industry, the high variation of olive colors and their similarity to the background leaf canopy presents a challenge. Labeling several thousands of very dense olive grove images for segmentation is a labor-intensive task. This paper presents a novel approach to detecting olives without the need to manually label data. In this work, we present the world's first olive detection dataset comprised of synthetic and real olive tree images. This is accomplished by generating an auto-labeled photorealistic 3D model of an olive tree. Its geometry is then simplified for lightweight rendering purposes. In addition, experiments are conducted with a mix of synthetically generated and real images, yielding an improvement of up to 66% compared to when only using a small sample of real data. When access to real, human-labeled data is limited, a combination of mostly synthetic data and a small amount of real data can enhance olive detection.

摘要
现代机器人技术已经推动了精准农业的收益估计。然而，当应用于橄榄业时，高度的橄榄颜色变化和它们与背景叶子覆盖物的相似性带来了挑战。为了进行 segmentation，标注数以千计的橄榄园景图像是一项劳动密集的任务。本文介绍了一种新的橄榄检测方法，不需要手动标注数据。在这种方法中，我们首次创建了一个自动标注的真实3D橄榄树模型。其几何结构后来简化了为轻量级渲染目的。此外，我们在真实和 sintetically生成的图像混合下进行了实验，并达到了使用小量真实数据时的66%提高。在有限的真实人类标注数据时，一种主要是 sintetically生成的数据和一小部分的真实数据的组合可以提高橄榄检测。

OnUVS: Online Feature Decoupling Framework for High-Fidelity Ultrasound Video Synthesis

paper_url: http://arxiv.org/abs/2308.08269
repo_url: None
paper_authors: Han Zhou, Dong Ni, Ao Chang, Xinrui Zhou, Rusi Chen, Yanlin Chen, Lian Liu, Jiamin Liang, Yuhao Huang, Tong Han, Zhe Liu, Deng-Ping Fan, Xin Yang
for:* The paper aims to address the challenges of synthesizing high-fidelity ultrasound (US) videos for clinical diagnosis, particularly in the context of limited availability of specific US video cases.methods:* The proposed method is an online feature-decoupling framework called OnUVS, which incorporates anatomic information into keypoint learning, uses a dual-decoder to decouple content and textural features, and employs a multiple-feature discriminator to enhance sharpness and fine details.results:* The paper reports that OnUVS synthesizes US videos with high fidelity, as demonstrated through validation and user studies on in-house echocardiographic and pelvic floor US videos.

Abstract
Ultrasound (US) imaging is indispensable in clinical practice. To diagnose certain diseases, sonographers must observe corresponding dynamic anatomic structures to gather comprehensive information. However, the limited availability of specific US video cases causes teaching difficulties in identifying corresponding diseases, which potentially impacts the detection rate of such cases. The synthesis of US videos may represent a promising solution to this issue. Nevertheless, it is challenging to accurately animate the intricate motion of dynamic anatomic structures while preserving image fidelity. To address this, we present a novel online feature-decoupling framework called OnUVS for high-fidelity US video synthesis. Our highlights can be summarized by four aspects. First, we introduced anatomic information into keypoint learning through a weakly-supervised training strategy, resulting in improved preservation of anatomical integrity and motion while minimizing the labeling burden. Second, to better preserve the integrity and textural information of US images, we implemented a dual-decoder that decouples the content and textural features in the generator. Third, we adopted a multiple-feature discriminator to extract a comprehensive range of visual cues, thereby enhancing the sharpness and fine details of the generated videos. Fourth, we constrained the motion trajectories of keypoints during online learning to enhance the fluidity of generated videos. Our validation and user studies on in-house echocardiographic and pelvic floor US videos showed that OnUVS synthesizes US videos with high fidelity.

摘要
Ultrasound (US) 影像是诊断的不可或缺的工具。为了诊断某些疾病，sonoographers需要观察相应的动态 анатомиче结构，以获取全面的信息。然而，有限的特定US视频案例的可用性会导致教学困难，从而可能影响疾病的检测率。Synthesizing US videos may represent a promising solution to this issue. However, it is challenging to accurately animate the intricate motion of dynamic anatomic structures while preserving image fidelity. To address this, we present a novel online feature-decoupling framework called OnUVS for high-fidelity US video synthesis. Our highlights can be summarized by four aspects. First, we introduced anatomic information into keypoint learning through a weakly-supervised training strategy, resulting in improved preservation of anatomical integrity and motion while minimizing the labeling burden. Second, to better preserve the integrity and textural information of US images, we implemented a dual-decoder that decouples the content and textural features in the generator. Third, we adopted a multiple-feature discriminator to extract a comprehensive range of visual cues, thereby enhancing the sharpness and fine details of the generated videos. Fourth, we constrained the motion trajectories of keypoints during online learning to enhance the fluidity of generated videos. Our validation and user studies on in-house echocardiographic and pelvic floor US videos showed that OnUVS synthesizes US videos with high fidelity.

SceNeRFlow: Time-Consistent Reconstruction of General Dynamic Scenes

paper_url: http://arxiv.org/abs/2308.08258
repo_url: None
paper_authors: Edith Tretschk, Vladislav Golyanik, Michael Zollhoefer, Aljaz Bozic, Christoph Lassner, Christian Theobalt
For: 本研究旨在实现一种可靠地重建通常、不固定的物体运动的4D重建方法，以便进行后续任务 like 3D 编辑、运动分析或虚拟资产创建。* Methods: 我们提出了一种名为 SceNeRFlow 的动态 NeRF 方法，它使用多视图RGB视频和静止相机图像作为输入，并在线进行重建。该方法首先估算了场景中的准确模型，然后使用时间一致的方式来重建物体的变形和外观。由于这个准确模型是时间不变的，因此我们可以在长时间和长距离内获得对应关系。我们使用神经场景表示来参数化我们的方法的组件。* Results: 我们实验表明，与之前的工作不同，我们的方法可以重建 studio 级别的运动。这是因为我们使用了一种强制 regularization 的粗糙分解方法，以处理更大的运动。我们还证明了我们的方法可以在不同的场景下进行高效的重建。

Abstract
Existing methods for the 4D reconstruction of general, non-rigidly deforming objects focus on novel-view synthesis and neglect correspondences. However, time consistency enables advanced downstream tasks like 3D editing, motion analysis, or virtual-asset creation. We propose SceNeRFlow to reconstruct a general, non-rigid scene in a time-consistent manner. Our dynamic-NeRF method takes multi-view RGB videos and background images from static cameras with known camera parameters as input. It then reconstructs the deformations of an estimated canonical model of the geometry and appearance in an online fashion. Since this canonical model is time-invariant, we obtain correspondences even for long-term, long-range motions. We employ neural scene representations to parametrize the components of our method. Like prior dynamic-NeRF methods, we use a backwards deformation model. We find non-trivial adaptations of this model necessary to handle larger motions: We decompose the deformations into a strongly regularized coarse component and a weakly regularized fine component, where the coarse component also extends the deformation field into the space surrounding the object, which enables tracking over time. We show experimentally that, unlike prior work that only handles small motion, our method enables the reconstruction of studio-scale motions.

摘要
现有方法 для四维重建普通、非rigidly变形对象强调新视图合成和忽略匹配。然而，时间一致性允许进行高级下游任务，如3D编辑、运动分析或虚拟资产创建。我们提出SceNeRFlow方法，用于在时间一致的方式重建普通、非rigid场景。我们的动态NeRF方法从多视图RGB视频和静止相机知道参数的背景图像入手。然后，它在线上重建了测量的准确模型的geometry和外观的变形。由于这个准确模型是时间不变的，我们在长期、长距离运动中获得匹配。我们使用神经场景表示来参数化我们的方法的组件。与先前的动态NeRF方法类似，我们使用回溯变形模型。但是，我们发现对于更大的运动，需要非常重要的适应。我们将变形分解为一个强制regularized的粗糙分量和一个弱regularized的细分量，其中粗糙分量还扩展了变形场景到对象周围的空间，这使得可以在时间上跟踪。我们实验表明，与先前工作仅处理小运动不同，我们的方法可以重建 studio级别的运动。

paper_url: http://arxiv.org/abs/2308.08256
repo_url: None
paper_authors: Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Dominik Schiller, Mohammed Guermal, Dominike Thomas, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling
for: 这篇论文目的是为了研究人类社交行为的自动分析，以创建更有效的人机交互机器人。
methods: 这篇论文使用了 MultiMediate’23 挑战中的两个关键人类社交行为分析任务：参与度估计和身体行为识别。它们使用了 NOXI 数据库中的新注释，以及 MPIIGroupInteraction 数据库中的 BBSI 注释方案。
results: 这篇论文提供了 MultiMediate’23 挑战中的基准结果。

Abstract
Automatic analysis of human behaviour is a fundamental prerequisite for the creation of machines that can effectively interact with- and support humans in social interactions. In MultiMediate'23, we address two key human social behaviour analysis tasks for the first time in a controlled challenge: engagement estimation and bodily behaviour recognition in social interactions. This paper describes the MultiMediate'23 challenge and presents novel sets of annotations for both tasks. For engagement estimation we collected novel annotations on the NOvice eXpert Interaction (NOXI) database. For bodily behaviour recognition, we annotated test recordings of the MPIIGroupInteraction corpus with the BBSI annotation scheme. In addition, we present baseline results for both challenge tasks.

摘要
自动分析人类行为是人工智能创造人工智能交互的基本先提要。在MultiMediate'23中，我们对人际交互中的两个关键人类社交行为进行了首次控制挑战：参与度估计和社交互动中的身体行为识别。这篇文章描述了MultiMediate'23挑战，并提供了两个任务的新的注解集。对于参与度估计任务，我们在NOvice eXpert Interaction（NOXI）数据库上收集了新的注解。对于身体行为识别任务，我们在MPIIGroupInteraction corpus上使用了BBSI注解方案进行了注解。此外，我们还提供了两个挑战任务的基线结果。

Contrastive Learning for Lane Detection via Cross-Similarity

paper_url: http://arxiv.org/abs/2308.08242
repo_url: None
paper_authors: Ali Zoljodi, Sadegh Abadijou, Mina Alibeigi, Masoud Daneshtalab
for: 提高路径检测的可靠性，特别是在低可见性情况下，如阴影、天气、车辆、行人等因素所带来的挑战。
methods: 我们提出了一种自监学习方法，即对比学习（Contrastive Learning，CL），它通过将本地特征对照学习（Local Feature CL）与我们新提出的跨相似运算（Cross-similarity）结合，实现了对路径检测模型的提升。
results: 在评估数据集上，我们的CLLD方法比过去的对比学习方法更高效，特别是在低可见性情况下，如阴影和人行道等。与超级学习相比，CLLD在阴影和人行道等Scene中表现更好。

Abstract
Detecting road lanes is challenging due to intricate markings vulnerable to unfavorable conditions. Lane markings have strong shape priors, but their visibility is easily compromised. Factors like lighting, weather, vehicles, pedestrians, and aging colors challenge the detection. A large amount of data is required to train a lane detection approach that can withstand natural variations caused by low visibility. This is because there are numerous lane shapes and natural variations that exist. Our solution, Contrastive Learning for Lane Detection via cross-similarity (CLLD), is a self-supervised learning method that tackles this challenge by enhancing lane detection models resilience to real-world conditions that cause lane low visibility. CLLD is a novel multitask contrastive learning that trains lane detection approaches to detect lane markings even in low visible situations by integrating local feature contrastive learning (CL) with our new proposed operation cross-similarity. Local feature CL focuses on extracting features for small image parts, which is necessary to localize lane segments, while cross-similarity captures global features to detect obscured lane segments using their surrounding. We enhance cross-similarity by randomly masking parts of input images for augmentation. Evaluated on benchmark datasets, CLLD outperforms state-of-the-art contrastive learning, especially in visibility-impairing conditions like shadows. Compared to supervised learning, CLLD excels in scenarios like shadows and crowded scenes.

摘要
探析道路车道是一项挑战，因为车道标记有许多细节和敏感度。车道标记具有强的形状偏好，但它们的可见性易受到不利的影响。因为光线、天气、车辆、人行道和腐蚀等因素，车道标记的可见性容易受到损害。为了训练一个可以抗性低视觉的车道探析方法，需要大量的数据。这是因为车道形状和自然变化存在很多。我们的解决方案是通过对比学习（Contrastive Learning）提高车道探析模型对低视觉情况的抗性。我们提出了一种新的多任务对比学习方法，称为跨相似性（cross-similarity）。本方法将本地特征对比学习（CL）与我们新提出的跨相似性操作结合，以增强车道探析模型对低视觉情况的抗性。本地特征CL专注于从小图像部分提取特征，这是必要的以确定车道段的地方，而跨相似性则捕捉全图像的全局特征，以检测遮盖的车道段。我们在输入图像中随机遮盖部分进行了增强。在标准测试集上评估，CLLD在视觉降低情况下（如阴影）表现出色，与前学习方法相比，它在阴影和人行道场景中表现更出色。相比于监督学习，CLLD在这些场景中表现更好。

DDF-HO: Hand-Held Object Reconstruction via Conditional Directed Distance Field

paper_url: http://arxiv.org/abs/2308.08231
repo_url: None
paper_authors: Chenyangguang Zhang, Yan Di, Ruida Zhang, Guangyao Zhai, Fabian Manhardt, Federico Tombari, Xiangyang Ji
for: 重建手持物品的RGB图像是一个重要和挑战性的问题，现有的SDF方法存在局限性，无法同时捕捉手物交互的复杂性。
methods: 我们提出了DDF-HO方法，利用指向距离场（DDF）作为形式表示。与SDF不同的是，DDF将一个三维空间中的射线映射到相应的DDF值，包括一个二进制可见信号，判断射线是否与目标物体交叉，以及一个距离值，测量射线与目标之间的距离。
results: 我们在 sintetic和实际 datasets上进行了广泛的实验，结果表明DDF-HO方法在Chamfer Distance上比基eline方法大大提高，约80%的提升。代码和训练模型将很 soon released。

Abstract
Reconstructing hand-held objects from a single RGB image is an important and challenging problem. Existing works utilizing Signed Distance Fields (SDF) reveal limitations in comprehensively capturing the complex hand-object interactions, since SDF is only reliable within the proximity of the target, and hence, infeasible to simultaneously encode local hand and object cues. To address this issue, we propose DDF-HO, a novel approach leveraging Directed Distance Field (DDF) as the shape representation. Unlike SDF, DDF maps a ray in 3D space, consisting of an origin and a direction, to corresponding DDF values, including a binary visibility signal determining whether the ray intersects the objects and a distance value measuring the distance from origin to target in the given direction. We randomly sample multiple rays and collect local to global geometric features for them by introducing a novel 2D ray-based feature aggregation scheme and a 3D intersection-aware hand pose embedding, combining 2D-3D features to model hand-object interactions. Extensive experiments on synthetic and real-world datasets demonstrate that DDF-HO consistently outperforms all baseline methods by a large margin, especially under Chamfer Distance, with about 80% leap forward. Codes and trained models will be released soon.

摘要
原文：重建手持物品从单一RGB图像是一项重要和挑战性的问题。现有的works使用Signed Distance Fields (SDF)表明这些工作存在问题，因为SDF只有在目标附近才可靠，因此不能同时编码当地手和物品指示。为解决这个问题，我们提出了DDF-HO方法，该方法利用Directed Distance Field (DDF)作为形状表示。与SDF不同，DDF将三个维度空间中的一条射线映射到对应的DDF值，包括一个二进制可见信号，该信号判断射线是否与物体 intersect，以及一个距离值，该值测量射线起点与目标之间的距离。我们随机选择多个射线，并收集它们的本地到全局的地理特征，使用一种新的2D射线基于的特征聚合方案和一种3D交叉相关手姿嵌入，将2D-3D特征相结合，以模拟手-物体交互。我们对 sintetic和实际数据集进行了广泛的实验，结果显示，DDF-HO方法在Chamfer Distance上比基eline方法高出约80%，这 represent a large advance.代码和训练模型即将释出。

Inherent Redundancy in Spiking Neural Networks

paper_url: http://arxiv.org/abs/2308.08227
repo_url: https://github.com/biclab/asa-snn
paper_authors: Man Yao, Jiakui Hu, Guangshe Zhao, Yaoyuan Wang, Ziyang Zhang, Bo Xu, Guoqi Li
for: 这个论文主要是为了研究锐极神经网络（SNN）中的内在重复性，以提高神经网络的准确率和能效性。
methods: 这篇论文使用了一种称为 Advance Spatial Attention（ASA）模块，来利用SNN中的内在重复性，并可以有效地控制噪声脉冲。
results: 实验结果表明，提案的方法可以显著降低脉冲发生的情况，并在比对状态的artificial neural networks（ANN）基eline上达到更好的性能。

Abstract
Spiking Neural Networks (SNNs) are well known as a promising energy-efficient alternative to conventional artificial neural networks. Subject to the preconceived impression that SNNs are sparse firing, the analysis and optimization of inherent redundancy in SNNs have been largely overlooked, thus the potential advantages of spike-based neuromorphic computing in accuracy and energy efficiency are interfered. In this work, we pose and focus on three key questions regarding the inherent redundancy in SNNs. We argue that the redundancy is induced by the spatio-temporal invariance of SNNs, which enhances the efficiency of parameter utilization but also invites lots of noise spikes. Further, we analyze the effect of spatio-temporal invariance on the spatio-temporal dynamics and spike firing of SNNs. Then, motivated by these analyses, we propose an Advance Spatial Attention (ASA) module to harness SNNs' redundancy, which can adaptively optimize their membrane potential distribution by a pair of individual spatial attention sub-modules. In this way, noise spike features are accurately regulated. Experimental results demonstrate that the proposed method can significantly drop the spike firing with better performance than state-of-the-art SNN baselines. Our code is available in \url{https://github.com/BICLab/ASA-SNN}.

摘要
神经网络（SNN）因其能够减少能耗而被广泛认为是一种有前途的人工神经网络。然而，由于人们对SNN的减少性假设，因此对SNN内部缺乏缓存的分析和优化而忽略了大量的可能优势。在这项工作中，我们提出了三个关键问题，即SNN中的内在缺乏的来源，这种缺乏的源泉是SNN的空间时间不变性所带来的，这种不变性可以提高参数的使用效率，但也会引入很多噪声脉冲。然后，我们分析了SNN的空间时间动态和脉冲发生的影响。基于这些分析，我们提出了一种进步的空间注意力（ASA）模块，可以自适应地优化SNN的膜电压分布，并且可以精准地控制噪声脉冲特征。实验结果表明，我们的方法可以减少脉冲发生，并且比现有的SNN基线性能更好。我们的代码可以在中找到。

How To Overcome Confirmation Bias in Semi-Supervised Image Classification By Active Learning

paper_url: http://arxiv.org/abs/2308.08224
repo_url: None
paper_authors: Sandra Gilhuber, Rasmus Hvingelby, Mang Ling Ada Fok, Thomas Seidl
for: 本研究旨在探讨是否需要活动学习，即使具有强大深度半超vised方法，也可能无法在有限的标注数据设置中使用。
methods: 本研究使用了 semi-supervised learning（SSL）方法，并与随机选择标注相结合。
results: experiments表明，在实际数据场景中， combining SSL methods with a random selection for labeling可以超越现有的活动学习（AL）技术。

Abstract
Do we need active learning? The rise of strong deep semi-supervised methods raises doubt about the usability of active learning in limited labeled data settings. This is caused by results showing that combining semi-supervised learning (SSL) methods with a random selection for labeling can outperform existing active learning (AL) techniques. However, these results are obtained from experiments on well-established benchmark datasets that can overestimate the external validity. However, the literature lacks sufficient research on the performance of active semi-supervised learning methods in realistic data scenarios, leaving a notable gap in our understanding. Therefore we present three data challenges common in real-world applications: between-class imbalance, within-class imbalance, and between-class similarity. These challenges can hurt SSL performance due to confirmation bias. We conduct experiments with SSL and AL on simulated data challenges and find that random sampling does not mitigate confirmation bias and, in some cases, leads to worse performance than supervised learning. In contrast, we demonstrate that AL can overcome confirmation bias in SSL in these realistic settings. Our results provide insights into the potential of combining active and semi-supervised learning in the presence of common real-world challenges, which is a promising direction for robust methods when learning with limited labeled data in real-world applications.

摘要
是否需要活动学习？强大深度半supervised方法的出现，使得有限量标注数据设置下使用活动学习的可用性受到了质疑。这是由于结果表明，结合半supervised学习（SSL）方法和随机选择标注可以超越现有的活动学习（AL）技术。然而，这些结果是基于已知的benchmark数据集进行实验获得的，这些数据集可能过度估计外部有效性。然而，文献中对活动半supervised学习方法在实际数据场景中的性能的研究不够，留下了一个 Notable gap 在我们的理解中。因此，我们介绍了三种常见的实际数据挑战：between-class imbalance、within-class imbalance 和 between-class similarity。这些挑战可能会对SSL性能产生负面影响，因为confirmation bias。我们在SSL和AL中进行了实验，发现随机抽样不能消除confirmation bias，在某些情况下，随机抽样even worse than supervised learning。然而，我们示出了AL可以在这些实际设置中超越confirmation bias。我们的结果为将活动和半supervised学习结合在一起的潜在可能性提供了新的思路，这是一种在实际应用中学习受限量标注数据时的robust方法。

Low-Light Image Enhancement with Illumination-Aware Gamma Correction and Complete Image Modelling Network

paper_url: http://arxiv.org/abs/2308.08220
repo_url: None
paper_authors: Yinglong Wang, Zhen Liu, Jianzhuang Liu, Songcen Xu, Shuaicheng Liu
for: 解决低光照图像增强问题
methods: integrate gamma correction with deep networks, use Taylor Series to approximate gamma correction, use a novel Transformer block to simulate pixel dependencies
results: outperform state-of-the-art methods on several benchmark datasets

Abstract
This paper presents a novel network structure with illumination-aware gamma correction and complete image modelling to solve the low-light image enhancement problem. Low-light environments usually lead to less informative large-scale dark areas, directly learning deep representations from low-light images is insensitive to recovering normal illumination. We propose to integrate the effectiveness of gamma correction with the strong modelling capacities of deep networks, which enables the correction factor gamma to be learned in a coarse to elaborate manner via adaptively perceiving the deviated illumination. Because exponential operation introduces high computational complexity, we propose to use Taylor Series to approximate gamma correction, accelerating the training and inference speed. Dark areas usually occupy large scales in low-light images, common local modelling structures, e.g., CNN, SwinIR, are thus insufficient to recover accurate illumination across whole low-light images. We propose a novel Transformer block to completely simulate the dependencies of all pixels across images via a local-to-global hierarchical attention mechanism, so that dark areas could be inferred by borrowing the information from far informative regions in a highly effective manner. Extensive experiments on several benchmark datasets demonstrate that our approach outperforms state-of-the-art methods.

摘要
Common local modeling structures, such as CNN and SwinIR, are insufficient to recover accurate illumination across whole low-light images, as dark areas often occupy large scales. To address this, the proposed method uses a novel Transformer block that completes simulates the dependencies of all pixels across images via a local-to-global hierarchical attention mechanism. This allows dark areas to be inferred by borrowing information from far informative regions in a highly effective manner.Experimental results on several benchmark datasets demonstrate that the proposed approach outperforms state-of-the-art methods.

MEDOE: A Multi-Expert Decoder and Output Ensemble Framework for Long-tailed Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.08213
repo_url: None
paper_authors: Junao Shen, Long Chen, Kun Kuang, Fei Wu, Tian Feng, Wei Zhang
for: 解决长尾分布难以捕捉的问题，提高 semantic segmentation 的性能。
methods: 提出了一个名为 MEDOE 的新框架，通过Contextual Information Ensemble-and-Grouping 技术，使用多个专家来提高分类的准确性。
results: 实验结果表明，MEDOE 比现有方法在 Cityscapes 和 ADE20K 数据集上提高了1.78% 的 mIoU 和 5.89% 的 mAcc。

Abstract
Long-tailed distribution of semantic categories, which has been often ignored in conventional methods, causes unsatisfactory performance in semantic segmentation on tail categories. In this paper, we focus on the problem of long-tailed semantic segmentation. Although some long-tailed recognition methods (e.g., re-sampling/re-weighting) have been proposed in other problems, they can probably compromise crucial contextual information and are thus hardly adaptable to the problem of long-tailed semantic segmentation. To address this issue, we propose MEDOE, a novel framework for long-tailed semantic segmentation via contextual information ensemble-and-grouping. The proposed two-sage framework comprises a multi-expert decoder (MED) and a multi-expert output ensemble (MOE). Specifically, the MED includes several "experts". Based on the pixel frequency distribution, each expert takes the dataset masked according to the specific categories as input and generates contextual information self-adaptively for classification; The MOE adopts learnable decision weights for the ensemble of the experts' outputs. As a model-agnostic framework, our MEDOE can be flexibly and efficiently coupled with various popular deep neural networks (e.g., DeepLabv3+, OCRNet, and PSPNet) to improve their performance in long-tailed semantic segmentation. Experimental results show that the proposed framework outperforms the current methods on both Cityscapes and ADE20K datasets by up to 1.78% in mIoU and 5.89% in mAcc.

摘要
长尾分布的 semantic category 问题，常被 conventional methods 忽略，导致 semantic segmentation 的性能不满意。在这篇论文中，我们关注了长尾 semantic segmentation 问题。尽管有一些长尾认知方法（例如重新批量/重新权重）在其他问题上提出，但它们可能会丢失重要的上下文信息，因此难以适应长尾 semantic segmentation 问题。为解决这个问题，我们提出了 MEDOE，一种基于上下文信息的集成和分组的novel框架。该框架包括一个多专家解码器（MED）和一个多专家输出集（MOE）。具体来说，MED 包括多个 "专家"。根据像素频率分布，每个专家都会根据特定类别为输入，在自适应的方式下生成上下文信息用于分类; MOE 采用可学习的决策权重，对专家们的输出进行集成。作为一个模型独立的框架，我们的 MEDOE 可以与各种流行的深度神经网络（例如 DeepLabv3+、OCRNet 和 PSPNet）模型结合，以提高它们在长尾 semantic segmentation 中的性能。实验结果表明，我们的提议方案在 Cityscapes 和 ADE20K 数据集上比现有方法提高了1.78%的 mIoU 和 5.89%的 mAcc。

Neural Spherical Harmonics for structurally coherent continuous representation of diffusion MRI signal

paper_url: http://arxiv.org/abs/2308.08210
repo_url: None
paper_authors: Tom Hendriks, Anna Vilanova, Maxime Chamberland
for: 这个论文旨在提出一种基于干扰Magnetic Resonance Imaging（dMRI）数据的新方法，该方法利用人脑结构几何的共同性，并且只使用单个测试者的数据。
methods: 该方法使用神经网络来参数化一个圆柱卷积函数（NeSH）来表示单个测试者的dMRI信号，该函数是连续的在angular和 espacial频域。
results: 使用这种方法重建dMRI信号后，得到的数据具有更加结构几何地表示， gradient 图像中的噪声被除去，fiber orientation distribution functions 显示了细胞轴的平滑变化。此外，该方法还可以计算mean diffusivity、fractional anisotropy和总显示的纤维密度。这些结果可以通过一个单一的模型架构和一个 hyperparameter 来实现。此外，在angular和 espacial频域进行upsampling也可以实现与现有方法相当或更好的重建结果。

Abstract
We present a novel way to model diffusion magnetic resonance imaging (dMRI) datasets, that benefits from the structural coherence of the human brain while only using data from a single subject. Current methods model the dMRI signal in individual voxels, disregarding the intervoxel coherence that is present. We use a neural network to parameterize a spherical harmonics series (NeSH) to represent the dMRI signal of a single subject from the Human Connectome Project dataset, continuous in both the angular and spatial domain. The reconstructed dMRI signal using this method shows a more structurally coherent representation of the data. Noise in gradient images is removed and the fiber orientation distribution functions show a smooth change in direction along a fiber tract. We showcase how the reconstruction can be used to calculate mean diffusivity, fractional anisotropy, and total apparent fiber density. These results can be achieved with a single model architecture, tuning only one hyperparameter. In this paper we also demonstrate how upsampling in both the angular and spatial domain yields reconstructions that are on par or better than existing methods.

摘要
我们提出了一种新的方法，用于模型吸引磁共振成像（dMRI）数据集，该方法利用人脑结构减少的一体性。现有方法在个voxel中模型dMRI信号，忽略了voxel之间的相关性。我们使用神经网络来parameterize一个圆锥函数系列（NeSH）来表示单个主体的dMRI信号，这个信号是人类连接度计划数据集的连续信号。我们的重建结果显示，使用这种方法可以获得更结构一致的数据表示。 gradient图像中的噪声被去除，纤维方向分布函数显示了纤维轨迹上的平滑变化。我们还示出了如何使用这种重建方法计算平均扩散率、相对扩散率和总显示纤维密度。这些结果可以通过单个模型架构和一个hyperparameter来实现，并且可以在angular和空间域中进行upsampling，以实现与现有方法相当或更好的重建结果。

Self-Reference Deep Adaptive Curve Estimation for Low-Light Image Enhancement

paper_url: http://arxiv.org/abs/2308.08197
repo_url: https://github.com/john-venti/self-dace
paper_authors: Jianyu Wen, Chenhao Wu, Tong Zhang, Yixuan Yu, Piotr Swierczynski
for: 提高低光照图像的显示品质
methods: 提出了一种基于自referential深度适应曲线估计（Self-DACE）的二 stage低光照图像提升方法，包括一种INTUITIVE、轻量级、快速、无监督的亮度提升算法，以及一种新的损失函数，用于保持自然图像的颜色、结构和准确性。
results: 对多个实际 datasets进行了广泛的Qualitative和Quantitative分析，结果表明，该方法在比较当前最佳算法的测试中表现出色。

Abstract
In this paper, we propose a 2-stage low-light image enhancement method called Self-Reference Deep Adaptive Curve Estimation (Self-DACE). In the first stage, we present an intuitive, lightweight, fast, and unsupervised luminance enhancement algorithm. The algorithm is based on a novel low-light enhancement curve that can be used to locally boost image brightness. We also propose a new loss function with a simplified physical model designed to preserve natural images' color, structure, and fidelity. We use a vanilla CNN to map each pixel through deep Adaptive Adjustment Curves (AAC) while preserving the local image structure. Secondly, we introduce the corresponding denoising scheme to remove the latent noise in the darkness. We approximately model the noise in the dark and deploy a Denoising-Net to estimate and remove the noise after the first stage. Exhaustive qualitative and quantitative analysis shows that our method outperforms existing state-of-the-art algorithms on multiple real-world datasets.

摘要
在本文中，我们提出了一种两阶段低光照图像提升方法，称为自适应曲线估计（Self-DACE）。在第一阶段，我们提出了一种直观、轻量级、快速、不需要监督的亮度提升算法。该算法基于一个新的低光照增强曲线，可以地方增强图像亮度。我们还提出了一个新的损失函数，采用简化的物理模型，保持自然图像的颜色、结构和准确性。我们使用一个普通的Convolutional Neural Network（CNN）将每个像素通过深度适应曲线（AAC）进行映射，保持图像的地方结构。在第二阶段，我们引入了对应的干扰除方案，以除去黑暗中的秘密噪声。我们约化黑暗中的噪声，并部署了一个干扰除网络来估计和除去噪声。我们的方法在多个实际世界数据集上进行了系统的质量和量化分析，结果表明我们的方法在与现有状态艺术算法进行比较时表现出色。

Automatic Vision-Based Parking Slot Detection and Occupancy Classification

paper_url: http://arxiv.org/abs/2308.08192
repo_url: None
paper_authors: Ratko Grbić, Brando Koch
for: 这篇论文是为了提供一种自动化停车槽检测和占用分类算法，用于提高停车导航信息系统（PGI）的精度和效果。
methods: 该算法使用了摄像头拍摄停车场的图像，并对图像进行了一系列的处理和分析，包括车辆检测、bird’s eye view clustering、占用分类等步骤。
results: 在使用了公开available的PKLot和CNRPark+EXT数据集进行测试后，该算法表现了高效的停车槽检测和占用分类能力，并能够快速响应停车场的变化。

Abstract
Parking guidance information (PGI) systems are used to provide information to drivers about the nearest parking lots and the number of vacant parking slots. Recently, vision-based solutions started to appear as a cost-effective alternative to standard PGI systems based on hardware sensors mounted on each parking slot. Vision-based systems provide information about parking occupancy based on images taken by a camera that is recording a parking lot. However, such systems are challenging to develop due to various possible viewpoints, weather conditions, and object occlusions. Most notably, they require manual labeling of parking slot locations in the input image which is sensitive to camera angle change, replacement, or maintenance. In this paper, the algorithm that performs Automatic Parking Slot Detection and Occupancy Classification (APSD-OC) solely on input images is proposed. Automatic parking slot detection is based on vehicle detections in a series of parking lot images upon which clustering is applied in bird's eye view to detect parking slots. Once the parking slots positions are determined in the input image, each detected parking slot is classified as occupied or vacant using a specifically trained ResNet34 deep classifier. The proposed approach is extensively evaluated on well-known publicly available datasets (PKLot and CNRPark+EXT), showing high efficiency in parking slot detection and robustness to the presence of illegal parking or passing vehicles. Trained classifier achieves high accuracy in parking slot occupancy classification.

摘要
停车导航信息（PGI）系统用于提供驾驶员关于最近停车场和停车槽数量的信息。最近，基于视觉解决方案开始出现，作为传感器安装在每个停车槽上的成本效果的代替。视觉基于系统提供停车占用的信息，基于停车场上的图像。然而，这些系统的开发具有多种可能的视点、天气条件和物体遮挡的挑战。特别是，它们需要手动标注停车槽位置在输入图像中，这是相对于摄像头角度变化、更换或维护的敏感。在本文中，一种基于输入图像进行自动停车槽检测和占用分类的算法（APSD-OC）被提出。自动停车槽检测基于在停车场图像中检测车辆，并将其分组在鸟瞰视图中检测停车槽。一旦detect了停车槽的位置在输入图像中，每个检测到的停车槽都会被使用特定训练的ResNet34深度分类器进行占用或无占用的分类。提议的方法在知名的公共数据集（PKLot和CNRPark+EXT）进行了广泛的评估，表明高效地检测停车槽和抗抗车或过往车的存在。训练的分类器在停车槽占用分类中达到了高精度。

Unsupervised Domain Adaptive Detection with Network Stability Analysis

paper_url: http://arxiv.org/abs/2308.08182
repo_url: https://github.com/tiankongzhang/nsa
paper_authors: Wenzhang Zhou, Heng Fan, Tiejian Luo, Libo Zhang
for: 这篇研究的目的是提高通用性，从源频道上获得标注的检测器，在目标频道上进行无标注检测。
methods: 本研究使用稳定性分析（Network Stability Analysis，NSA）来实现无标注频道适应检测。NSA处理图像和不同频道之间的差异，并使用教师模型进行外部一致性分析和内部一致性分析。
results: 这篇研究使用NSA整合Faster R-CNN，实现了顶尖的结果，包括在Cityscapes-to-FoggyCityscapes上的52.7% mAP记录。此外，NSA还可以应用于其他一阶检测器（例如FCOS），并且在实验中证明了这一点。

Abstract
Domain adaptive detection aims to improve the generality of a detector, learned from the labeled source domain, on the unlabeled target domain. In this work, drawing inspiration from the concept of stability from the control theory that a robust system requires to remain consistent both externally and internally regardless of disturbances, we propose a novel framework that achieves unsupervised domain adaptive detection through stability analysis. In specific, we treat discrepancies between images and regions from different domains as disturbances, and introduce a novel simple but effective Network Stability Analysis (NSA) framework that considers various disturbances for domain adaptation. Particularly, we explore three types of perturbations including heavy and light image-level disturbances and instancelevel disturbance. For each type, NSA performs external consistency analysis on the outputs from raw and perturbed images and/or internal consistency analysis on their features, using teacher-student models. By integrating NSA into Faster R-CNN, we immediately achieve state-of-the-art results. In particular, we set a new record of 52.7% mAP on Cityscapes-to-FoggyCityscapes, showing the potential of NSA for domain adaptive detection. It is worth noticing, our NSA is designed for general purpose, and thus applicable to one-stage detection model (e.g., FCOS) besides the adopted one, as shown by experiments. https://github.com/tiankongzhang/NSA.

摘要
领域适应检测目标是提高检测器，从源频道上得到标注的频道，在目标频道上进行检测的通用性。在这项工作中，我们Drawing inspiration from the concept of stability from control theory that a robust system requires to remain consistent both externally and internally regardless of disturbances, we propose a novel framework that achieves unsupervised domain adaptive detection through stability analysis. Specifically, we treat discrepancies between images and regions from different domains as disturbances, and introduce a novel simple but effective Network Stability Analysis (NSA) framework that considers various disturbances for domain adaptation. Particularly, we explore three types of perturbations including heavy and light image-level disturbances and instance-level disturbance. For each type, NSA performs external consistency analysis on the outputs from raw and perturbed images and/or internal consistency analysis on their features, using teacher-student models. By integrating NSA into Faster R-CNN, we immediately achieve state-of-the-art results. In particular, we set a new record of 52.7% mAP on Cityscapes-to-FoggyCityscapes, showing the potential of NSA for domain adaptive detection. It is worth noting that our NSA is designed for general purpose, and thus applicable to one-stage detection models (e.g., FCOS) besides the adopted one, as shown by experiments. More details can be found at https://github.com/tiankongzhang/NSA.

AATCT-IDS: A Benchmark Abdominal Adipose Tissue CT Image Dataset for Image Denoising, Semantic Segmentation, and Radiomics Evaluation

paper_url: http://arxiv.org/abs/2308.08172
repo_url: None
paper_authors: Zhiyu Ma, Chen Li, Tianming Du, Le Zhang, Dechao Tang, Deguo Ma, Shanchuan Huang, Yan Liu, Yihao Sun, Zhihao Chen, Jin Yuan, Qianqing Nie, Marcin Grzegorzek, Hongzan Sun
For: 这个研究使用的数据集是为了研究腹部脂肪组织的多维特征而制作的。* Methods: 这个研究使用了一个名为AATTCT-IDS的标准数据集，该数据集包含300个subject的3D CT剖图，并由研究人员手动标注了脂肪组织区域的3213个剖图。不同任务的研究人员使用不同的方法对AATTCT-IDS进行了比较分析，以验证这些方法在这些任务中的研究潜力。* Results: 这个研究发现，在图像压缩领域，使用缓和策略可以更好地降低杂噪，并且保持原始图像的结构。在semantic segmentation领域，BiSeNet模型可以在短时间内 obtian精度高的结果，并具有更好的结构分化能力。在 радиологи metrics 领域，研究人员发现了三种不同的脂肪分布方式。I hope this helps! Let me know if you have any further questions.

Abstract
Methods: In this study, a benchmark \emph{Abdominal Adipose Tissue CT Image Dataset} (AATTCT-IDS) containing 300 subjects is prepared and published. AATTCT-IDS publics 13,732 raw CT slices, and the researchers individually annotate the subcutaneous and visceral adipose tissue regions of 3,213 of those slices that have the same slice distance to validate denoising methods, train semantic segmentation models, and study radiomics. For different tasks, this paper compares and analyzes the performance of various methods on AATTCT-IDS by combining the visualization results and evaluation data. Thus, verify the research potential of this data set in the above three types of tasks. Results: In the comparative study of image denoising, algorithms using a smoothing strategy suppress mixed noise at the expense of image details and obtain better evaluation data. Methods such as BM3D preserve the original image structure better, although the evaluation data are slightly lower. The results show significant differences among them. In the comparative study of semantic segmentation of abdominal adipose tissue, the segmentation results of adipose tissue by each model show different structural characteristics. Among them, BiSeNet obtains segmentation results only slightly inferior to U-Net with the shortest training time and effectively separates small and isolated adipose tissue. In addition, the radiomics study based on AATTCT-IDS reveals three adipose distributions in the subject population. Conclusion: AATTCT-IDS contains the ground truth of adipose tissue regions in abdominal CT slices. This open-source dataset can attract researchers to explore the multi-dimensional characteristics of abdominal adipose tissue and thus help physicians and patients in clinical practice. AATCT-IDS is freely published for non-commercial purpose at: \url{https://figshare.com/articles/dataset/AATTCT-IDS/23807256}.

摘要
方法：本研究使用的Benchmark dataset是“ Abdomen Adipose Tissue CT Image Dataset”（AATTCT-IDS），包含300个研究对象，已经公布并可以免费下载。AATTCT-IDS包含13,732个Raw CT slice，研究人员对3,213个slice进行了手动标注，以验证去噪方法、训练semantic segmentation模型以及研究 радиомькс。通过组合视觉化结果和评估数据，对不同任务进行比较和分析。这种方法可以验证AATTCT-IDS的研究潜力。结果：在图像去噪比较研究中，使用缓和策略的算法可以更好地降低杂噪，但是会压缩图像细节。BM3D等方法可以更好地保持原始图像结构，但评估数据略为下降。结果显示不同算法之间存在显著的差异。在semantic segmentation的研究中，每个模型对脂肪组织的 segmentation 结果具有不同的结构特征。比如BiSeNet可以在短时间内 obtaint 脂肪组织 segmentation 结果，并且可以准确地分割小型和隔离的脂肪组织。此外，基于AATTCT-IDS的 радиомькс研究发现了脂肪分布的三种类型。结论：AATTCT-IDS包含了 Abdomen CT slice 中脂肪组织的真实特征。这个开源数据集可以吸引研究人员更深入研究胸部脂肪组织的多维特征，从而帮助医生和患者在临床实践中。AATTCT-IDS是免费发布的，可以在以下链接下下载：https://figshare.com/articles/dataset/AATTCT-IDS/23807256。

Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis

paper_url: http://arxiv.org/abs/2308.08157
repo_url: https://github.com/pmh9960/GCDP
paper_authors: Minho Park, Jooyeol Yun, Seunghwan Choi, Jaegul Choo
for: 提高文本到图像生成的文本-图像匹配率，而不仅仅依靠大规模的文本-图像数据集。
methods: 提出了一种新的方法，即利用可用的semantic layout来增强文本到图像生成的文本-图像匹配率。Specifically, we propose a Gaussian-categorical diffusion process that simultaneously generates both images and corresponding layout pairs.
results: 我们的实验表明，我们可以通过训练模型生成 semantic labels for each pixel，使模型对不同图像区域的semantics有所了解。我们的方法在Multi-Modal CelebA-HQ和Cityscapes dataset上达到了更高的文本-图像匹配率，这些数据集上的文本-图像对是罕见的。

Abstract
Existing text-to-image generation approaches have set high standards for photorealism and text-image correspondence, largely benefiting from web-scale text-image datasets, which can include up to 5~billion pairs. However, text-to-image generation models trained on domain-specific datasets, such as urban scenes, medical images, and faces, still suffer from low text-image correspondence due to the lack of text-image pairs. Additionally, collecting billions of text-image pairs for a specific domain can be time-consuming and costly. Thus, ensuring high text-image correspondence without relying on web-scale text-image datasets remains a challenging task. In this paper, we present a novel approach for enhancing text-image correspondence by leveraging available semantic layouts. Specifically, we propose a Gaussian-categorical diffusion process that simultaneously generates both images and corresponding layout pairs. Our experiments reveal that we can guide text-to-image generation models to be aware of the semantics of different image regions, by training the model to generate semantic labels for each pixel. We demonstrate that our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset, where text-image pairs are scarce. Codes are available in this https://pmh9960.github.io/research/GCDP

摘要
现有的文本到图像生成方法已经设置了高标准 для光实感和文本图像对应，主要受益于网络规模的文本图像集，这些集可以包括多达50亿对。然而，文本到图像生成模型在域pecific的 dataset上，如城市场景、医学图像和人脸，仍然受到文本图像对应的低问题，因为缺乏文本图像对。此外，收集百亿对文本图像对可以是时间consuming和costly的。因此，保证高度文本图像对应而不依赖于网络规模的文本图像集是一个挑战。在这篇论文中，我们提出了一种新的方法，通过利用可用的semantic layout来提高文本图像对应。具体来说，我们提出了一个 Gaussian-categorical 扩散过程，同时生成图像和对应的布局对。我们的实验表明，我们可以通过训练模型生成semantic标签来引导文本到图像生成模型对不同图像区域的semantic有认知。我们的方法在Multi-Modal CelebA-HQ和Cityscapes dataset上表现出高度文本图像对应，这些数据集上文本图像对scarce。代码可以在以下链接获取：https://pmh9960.github.io/research/GCDP

Conditional Perceptual Quality Preserving Image Compression

paper_url: http://arxiv.org/abs/2308.08154
repo_url: None
paper_authors: Tongda Xu, Qian Zhang, Yanghao Li, Dailan He, Zhe Wang, Yuanyuan Wang, Hongwei Qin, Yan Wang, Jingjing Liu, Ya-Qin Zhang
for: 该文章提出了一种基于用户定义信息的 conditional perceptual quality（CPQ），用于保持高质量和 semantics 的图像压缩。
methods: 该文章使用了扩展了 Blau et al. 的 perceptual quality 定义，通过conditioning 用户定义的信息来提高压缩率。
results: 实验结果表明，该代码可以成功保持高质量和 semantics 的图像压缩，并且提供了一个Lower bound 的共同Randomness 需求，解决了前一些对于 (conditional) perceptual quality 压缩是否应该包含随机性的辩论。I hope that helps! Let me know if you have any other questions.

Abstract
We propose conditional perceptual quality, an extension of the perceptual quality defined in \citet{blau2018perception}, by conditioning it on user defined information. Specifically, we extend the original perceptual quality $d(p_{X},p_{\hat{X})$ to the conditional perceptual quality $d(p_{X|Y},p_{\hat{X}|Y})$, where $X$ is the original image, $\hat{X}$ is the reconstructed, $Y$ is side information defined by user and $d(.,.)$ is divergence. We show that conditional perceptual quality has similar theoretical properties as rate-distortion-perception trade-off \citep{blau2019rethinking}. Based on these theoretical results, we propose an optimal framework for conditional perceptual quality preserving compression. Experimental results show that our codec successfully maintains high perceptual quality and semantic quality at all bitrate. Besides, by providing a lowerbound of common randomness required, we settle the previous arguments on whether randomness should be incorporated into generator for (conditional) perceptual quality compression. The source code is provided in supplementary material.

摘要
我们提议使用条件的感知质量，这是基于\citet{blau2018perception}定义的感知质量的扩展，通过定制用户定义的信息来conditioning。具体来说，我们从原始的感知质量$d(p_{X},p_{\hat{X})$中扩展了条件感知质量$d(p_{X|Y},p_{\hat{X}|Y})$,其中$X$是原始图像，$\hat{X}$是重建的图像，$Y$是用户定义的侧信息。我们证明了条件感知质量具有与Rate-Distortion-Perception交易的同样理论性质。基于这些理论结论，我们提出了一个优化的条件感知质量保持压缩框架。实验结果表明，我们的编码器成功保持高感知质量和 semantics质量在所有比特率。此外，我们提供了common randomness所需的下界，解决了过去对 generator中是否应该包含随机性的争议。代码可以在补充材料中找到。

SCANet: A Self- and Cross-Attention Network for Audio-Visual Speech Separation

paper_url: http://arxiv.org/abs/2308.08143
repo_url: None
paper_authors: Kai Li, Runxuan Yang, Xiaolin Hu
for: 这篇论文主要是为了提出一种新的多模态协同分离方法，以提高人工智能系统对听视环境的识别能力。
methods: 这篇论文提出了一种名为自我和交叉注意网络（SCANet）的新模型，该模型利用注意机制来有效地融合听视信号。SCANet包括两种注意块：自我注意（SA）块和交叉注意（CA）块，其中CA块分布在网络的顶部（TCA）、中部（MCA）和底部（BCA）。这些块使得模型可以学习不同的模式特征，并提取不同的 semantics 从听视特征。
results: 对于三个标准的听视分离测试集（LRS2、LRS3和VoxCeleb2），SCANet表现出色，超过现有的状态时之方法，并且保持相对的执行时间相对较短。

Abstract
The integration of different modalities, such as audio and visual information, plays a crucial role in human perception of the surrounding environment. Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion architectures situated either at the top or bottom positions, rather than comprehensively considering multi-modal fusion at various hierarchical positions within the network. In this paper, we propose a novel model called self- and cross-attention network (SCANet), which leverages the attention mechanism for efficient audio-visual feature fusion. SCANet consists of two types of attention blocks: self-attention (SA) and cross-attention (CA) blocks, where the CA blocks are distributed at the top (TCA), middle (MCA) and bottom (BCA) of SCANet. These blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of SCANet, outperforming existing state-of-the-art (SOTA) methods while maintaining comparable inference time.

摘要
人类对周围环境的识别受到不同modalities（如音频和视觉信息）的集成具有重要作用。现有研究已经在设计音视频演说分离模块方面做出了重要进步。然而，这些模块主要集中在网络的顶层或底层位置，而不是全面考虑多modalities在网络各级别位置的集成。在这篇论文中，我们提出了一种新的模型，即自我和交叉关注网络（SCANet），它利用关注机制来实现有效的音视频特征结合。SCANet包括两种关注块：自我关注（SA）和交叉关注（CA）块，其中CA块分布在SCANet的顶层（TCA）、中层（MCA）和底层（BCA）。这些块可以保持学习不同modalities特征，并允许从音视频特征中提取不同 semantics。我们对三个标准音视频分离标准（LRS2、LRS3和VoxCeleb2）进行了广泛的实验，并证明SCANet的效果比现有SOTA方法更好，同时保持相对的执行时间相对快。

paper_url: http://arxiv.org/abs/2308.08142
repo_url: None
paper_authors: Minghao She, Wendong Mao, Huihong Shi, Zhongfeng Wang
for: 提高理想和盲 SR 任务的视觉效果（ideal and blind super-resolution）
methods: 提出了一种双赢框架（double-win framework），包括一种轻量级 transformer 基于 SR 模型（S2R transformer）和一种新的倒吃式训练策略（coarse-to-fine training strategy）
results: 实验结果表明，提出的 S2R 模型在理想 SR 条件下与只有 578K 参数表现出色，并在盲糊条件下与只有 10 梯度更新达到更好的视觉效果，提高了整体训练速度三百倍。

Abstract
Nowadays, deep learning based methods have demonstrated impressive performance on ideal super-resolution (SR) datasets, but most of these methods incur dramatically performance drops when directly applied in real-world SR reconstruction tasks with unpredictable blur kernels. To tackle this issue, blind SR methods are proposed to improve the visual results on random blur kernels, which causes unsatisfactory reconstruction effects on ideal low-resolution images similarly. In this paper, we propose a double-win framework for ideal and blind SR task, named S2R, including a light-weight transformer-based SR model (S2R transformer) and a novel coarse-to-fine training strategy, which can achieve excellent visual results on both ideal and random fuzzy conditions. On algorithm level, S2R transformer smartly combines some efficient and light-weight blocks to enhance the representation ability of extracted features with relatively low number of parameters. For training strategy, a coarse-level learning process is firstly performed to improve the generalization of the network with the help of a large-scale external dataset, and then, a fast fine-tune process is developed to transfer the pre-trained model to real-world SR tasks by mining the internal features of the image. Experimental results show that the proposed S2R outperforms other single-image SR models in ideal SR condition with only 578K parameters. Meanwhile, it can achieve better visual results than regular blind SR models in blind fuzzy conditions with only 10 gradient updates, which improve convergence speed by 300 times, significantly accelerating the transfer-learning process in real-world situations.

摘要
现在，深度学习基本方法在理想的超分辨率（SR）数据集上已经表现出了惊人的表现，但大多数这些方法在实际的SR重建任务中直接应用时会导致表现下降，特别是在随机扭曲kernel的情况下。为解决这个问题，盲 SR 方法被提出来提高视觉效果，但这些方法在理想的低分辨率图像上也会导致不满足的重建效果。在这篇文章中，我们提出了一个双赢框架，名为 S2R，包括一个轻量级的 transformer 基本 SR 模型（S2R transformer）和一种新的 course-to-fine 训练策略，可以在理想和随机扭曲条件下实现出色的视觉效果。在算法层次，S2R transformer 智能地结合了一些高效和轻量级的块来增强提取的特征表示能力，同时减少参数的数量。在训练策略方面，我们首先在大规模的外部数据集上进行了粗级学习过程，以提高网络的通用性，然后，我们开发了一种快速的 fine-tune 过程，通过挖掘图像内部特征来转移预训练模型到实际SR任务中。实验结果显示，我们提出的 S2R 可以在理想SR条件下与只有 578K 参数的其他单图 SR 模型进行比较，同时在盲扭曲条件下，它可以在只有 10 梯度更新的情况下达到更好的视觉效果，提高了整体的转移学习过程的速度，从而实现了实际应用中的加速。

GPA-3D: Geometry-aware Prototype Alignment for Unsupervised Domain Adaptive 3D Object Detection from Point Clouds

paper_url: http://arxiv.org/abs/2308.08140
repo_url: https://github.com/liz66666/gpa3d
paper_authors: Ziyu Li, Jingming Guo, Tongtong Cao, Liu Bingbing, Wankou Yang
for: 提高LiDAR-based 3D检测的鲁棒性和可靠性在未看过的环境中
methods: 使用不监督的领域适应3D检测方法，通过自适应的投影Alignment技术来减少特征空间的分布差异，从而实现跨频道的转移
results: 在Waymo、nuScenes和KITTI等多个 benchmark上获得了比领先方法更好的适应性和性能

Abstract
LiDAR-based 3D detection has made great progress in recent years. However, the performance of 3D detectors is considerably limited when deployed in unseen environments, owing to the severe domain gap problem. Existing domain adaptive 3D detection methods do not adequately consider the problem of the distributional discrepancy in feature space, thereby hindering generalization of detectors across domains. In this work, we propose a novel unsupervised domain adaptive \textbf{3D} detection framework, namely \textbf{G}eometry-aware \textbf{P}rototype \textbf{A}lignment (\textbf{GPA-3D}), which explicitly leverages the intrinsic geometric relationship from point cloud objects to reduce the feature discrepancy, thus facilitating cross-domain transferring. Specifically, GPA-3D assigns a series of tailored and learnable prototypes to point cloud objects with distinct geometric structures. Each prototype aligns BEV (bird's-eye-view) features derived from corresponding point cloud objects on source and target domains, reducing the distributional discrepancy and achieving better adaptation. The evaluation results obtained on various benchmarks, including Waymo, nuScenes and KITTI, demonstrate the superiority of our GPA-3D over the state-of-the-art approaches for different adaptation scenarios. The MindSpore version code will be publicly available at \url{https://github.com/Liz66666/GPA3D}.

摘要
“李达尔基于3D探测技术在最近几年内做出了大量的进步。然而，3D探测器在未看过的环境中表现不佳，主要因为域名隔problem。现有的域名适应3D探测方法不充分考虑了特征空间中的分布差异问题，从而阻碍探测器在不同域之间的通用化。在这种情况下，我们提出了一种新的无监督域名适应3D探测框架，即geometry-aware prototype alignment（GPA-3D）。GPA-3D利用了点云对象的内在几何关系来减少特征空间中的差异，从而促进域之间的转移。具体来说，GPA-3D分配了一系列适应点云对象的学习式和定制的原型。每个原型都将源域和目标域BEV特征相互对应，从而减少分布差异并实现更好的适应。我们在WAYMO、nuScenes和KITTI等各种标准均取得了GPA-3D在不同适应场景下的优于状态艺术方法。MindSpore版本代码将在 \url{https://github.com/Liz66666/GPA3D} 上公开。”

View Consistent Purification for Accurate Cross-View Localization

paper_url: http://arxiv.org/abs/2308.08110
repo_url: https://github.com/ShanWang-Shan/PureACL-website
paper_authors: Shan Wang, Yanhao Zhang, Akhil Perincherry, Ankit Vora, Hongdong Li
for: 本研究提出了一种高精度自地localization方法，用于outdoor robotics，该方法利用了多个搭载在机器人上的摄像头和 readily accessible的卫星图像。
methods: 该方法利用了视觉特征的检测和匹配，并将视觉特征与卫星图像的映射转换成为Homography变换。此外，该方法还使用了空间嵌入approach来减少视觉匹配的歧义性。
results: 对KITTI和Ford Multi-AV Seasonal dataset进行了广泛的实验，并表明了our proposed method可以高效地在室外环境中进行自地localization，其中 median lateral和longitudinal方向的空间精度错误在0.5米以下， medianorientation方向的精度错误在2度以下。

Abstract
This paper proposes a fine-grained self-localization method for outdoor robotics that utilizes a flexible number of onboard cameras and readily accessible satellite images. The proposed method addresses limitations in existing cross-view localization methods that struggle to handle noise sources such as moving objects and seasonal variations. It is the first sparse visual-only method that enhances perception in dynamic environments by detecting view-consistent key points and their corresponding deep features from ground and satellite views, while removing off-the-ground objects and establishing homography transformation between the two views. Moreover, the proposed method incorporates a spatial embedding approach that leverages camera intrinsic and extrinsic information to reduce the ambiguity of purely visual matching, leading to improved feature matching and overall pose estimation accuracy. The method exhibits strong generalization and is robust to environmental changes, requiring only geo-poses as ground truth. Extensive experiments on the KITTI and Ford Multi-AV Seasonal datasets demonstrate that our proposed method outperforms existing state-of-the-art methods, achieving median spatial accuracy errors below $0.5$ meters along the lateral and longitudinal directions, and a median orientation accuracy error below 2 degrees.

摘要
（本文提出了一种用于外部自导航的细化自位置方法，该方法使用可变数量的船内摄像头和可ready accessible的卫星图像。该方法解决了现有的相对视图本地化方法所面临的噪声源，如移动物体和季节变化。这是首个使用视图相同的键点和深度特征进行视觉匹配的缺省方法，同时从地面和卫星视图中提取杂谱对象并建立Homography变换。此外，该方法还 integrate了空间嵌入方法，利用摄像头内部和外部信息来减少视觉匹配的歧义，导致更好的特征匹配和总位姿估计精度。该方法具有强大的泛化能力和环境变化的Robustness，只需要地球坐标作为真实参照。实验表明，我们的提出方法在KITTI和福特多个AV季节数据集上表现出色，与现有状态最佳方法相比，实现了水平和 longitudinal方向的 median 精度 ErrorBelow $0.5$米，并且orientation accuracy errorBelow 2度。）

Snapshot High Dynamic Range Imaging with a Polarization Camera

paper_url: http://arxiv.org/abs/2308.08094
repo_url: https://github.com/Intelligent-Sensing/polarization-hdr
paper_authors: Mingyang Xie, Matthew Chan, Christopher Metzler
for: 该论文旨在将普通的楔格相机转化为高性能HDR相机。
methods: 该方法使用 linear polarizer 在前面，并通过不同的折射器orientation 来获取不同的曝光图像四个。然后，通过一种异常抗性和自适应算法来重建HDR图像（在单一极性下）。
results: 该方法通过实际的实验证明其效果。

Abstract
High dynamic range (HDR) images are important for a range of tasks, from navigation to consumer photography. Accordingly, a host of specialized HDR sensors have been developed, the most successful of which are based on capturing variable per-pixel exposures. In essence, these methods capture an entire exposure bracket sequence at once in a single shot. This paper presents a straightforward but highly effective approach for turning an off-the-shelf polarization camera into a high-performance HDR camera. By placing a linear polarizer in front of the polarization camera, we are able to simultaneously capture four images with varied exposures, which are determined by the orientation of the polarizer. We develop an outlier-robust and self-calibrating algorithm to reconstruct an HDR image (at a single polarity) from these measurements. Finally, we demonstrate the efficacy of our approach with extensive real-world experiments.

摘要
高动态范围（HDR）图像在各种任务中非常重要，从导航到消费型摄影。因此，一些专门的HDR传感器被开发出来，最成功的是基于每个像素变量曝光的方法。在本文中，我们提出了将偏振相机转化为高性能HDR相机的简单 yet effectiveapproach。我们在偏振相机前置一个线性偏振器，以实现同时捕捉不同曝光的四个图像。我们开发了一种异常抗性和自适应算法，将这些测量转化为HDR图像（单一偏振）。最后，我们通过实际实验证明了我们的方法的有效性。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

paper_url: http://arxiv.org/abs/2308.08089
repo_url: None
paper_authors: Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, Nan Duan
for: 这篇论文主要针对的是控制视频生成的精细控制问题，即在视频中实现细致的控制功能，以提高视频生成的灵活性和自然性。
methods: 该论文提出了一种基于扩散的视频生成模型，称为DragNUWA，它同时使用文本、图像和轨迹信息进行细致控制，从semantic、spatial和temporal三个方面提供细致控制。此外， DragNUWA还提出了三个方面的轨迹模型：Trajectory Sampler、Multiscale Fusion和Adaptive Training，以解决当前研究中对于开放领域轨迹控制的局限性。
results: 实验证明 DragNUWA 的效果，其能够在视频生成中实现细致的控制功能，并且在不同的材料和设定下都能够达到比较好的效果。

Abstract
Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is \url{https://www.microsoft.com/en-us/research/project/dragnuwa/}

摘要
《可控视频生成技术在最近几年内得到了广泛关注。然而，两个主要限制仍然存在：首先，大多数现有的工作都是基于文本、图像或轨迹控制，导致精细控制视频内容的能力受到限制。其次，轨迹控制领域的研究仍然处于初 stages，大多数实验都是基于简单的 dataset like Human3.6M 进行的。这两个限制使得模型无法处理开放领域图像和复杂弯曲轨迹。在这篇论文中，我们提出了 DragNUWA，一种开放领域扩散基于视频生成模型。为了解决现有工作中的精细控制不足问题，我们同时引入文本、图像和轨迹信息，以提供从 semantic、空间和时间三个角度进行精细控制视频内容的能力。为了解决当前研究中对开放领域轨迹控制的限制，我们提出了轨迹模型，包括 Trajectory Sampler (TS) 以实现开放领域轨迹控制，Multiscale Fusion (MF) 以控制不同级别的轨迹，以及 Adaptive Training (AT) 策略以生成遵循轨迹的一致视频。我们的实验证明 DragNUWA 的效果，并示出其在精细控制视频生成方面的优越性。更多信息请访问我们的主页：。

Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection

paper_url: http://arxiv.org/abs/2308.08088
repo_url: https://github.com/social-ai-studio/pro-cap
paper_authors: Rui Cao, Ming Shan Hee, Adriel Kuek, Wen-Haw Chong, Roy Ka-Wei Lee, Jing Jiang
for: 本研究旨在提高 hateful meme 检测 task 的效果，通过对 pre-trained vision-language 模型（PVLM）进行特点化。
methods: 本研究提出了一种 probing-based captioning 方法，通过向冻结 PVLM 提问 hateful content-related 问题，并使用回答作为图像标题（我们称为 Pro-Cap），以便图像标题包含有关 hateful content 的信息。
results: 在三个标准测试集上，使用 Pro-Cap 方法的模型显示了良好的性能和泛化能力，证明了方法的有效性和通用性。

Abstract
Hateful meme detection is a challenging multimodal task that requires comprehension of both vision and language, as well as cross-modal interactions. Recent studies have tried to fine-tune pre-trained vision-language models (PVLMs) for this task. However, with increasing model sizes, it becomes important to leverage powerful PVLMs more efficiently, rather than simply fine-tuning them. Recently, researchers have attempted to convert meme images into textual captions and prompt language models for predictions. This approach has shown good performance but suffers from non-informative image captions. Considering the two factors mentioned above, we propose a probing-based captioning approach to leverage PVLMs in a zero-shot visual question answering (VQA) manner. Specifically, we prompt a frozen PVLM by asking hateful content-related questions and use the answers as image captions (which we call Pro-Cap), so that the captions contain information critical for hateful content detection. The good performance of models with Pro-Cap on three benchmarks validates the effectiveness and generalization of the proposed method.

摘要
仇恨内容检测是一个复杂的多模态任务，需要包括视觉和语言理解，以及跨模态交互。Recent studies have tried to fine-tune预训练的视觉语言模型(PVLM) для这个任务。然而，随着模型的尺寸增大，更重要的是更好地利用更强大的PVLM，而不是只是精度。Recently, researchers have attempted to convert meme images into textual captions and prompt language models for predictions. This approach has shown good performance but suffers from non-informative image captions. Considering the two factors mentioned above, we propose a probing-based captioning approach to leverage PVLMs in a zero-shot visual question answering (VQA) manner. Specifically, we prompt a frozen PVLM by asking hateful content-related questions and use the answers as image captions (which we call Pro-Cap), so that the captions contain information critical for hateful content detection. The good performance of models with Pro-Cap on three benchmarks validates the effectiveness and generalization of the proposed method.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Deep Learning Framework for Spleen Volume Estimation from 2D Cross-sectional Views

paper_url: http://arxiv.org/abs/2308.08038
repo_url: None
paper_authors: Zhen Yuan, Esther Puyol-Anton, Haran Jogeesvaran, Baba Inusa, Andrew P. King
for:* The paper aims to provide an automated method for measuring spleen volume from 2D cross-sectional segmentations obtained from ultrasound imaging, which can be used to assess splenomegaly and related clinical conditions.methods:* The proposed method uses a variational autoencoder-based framework to estimate spleen volume from single- or dual-view 2D spleen segmentations.* The framework includes three volume estimation methods, which are evaluated and compared with the clinical standard approach and a deep learning-based 2D-3D reconstruction-based approach.results:* The best model achieved mean relative volume accuracies of 86.62% and 92.58% for single- and dual-view segmentations, respectively, which is higher than the performance of the clinical standard approach and the comparative deep learning-based approach.Here is the information in Simplified Chinese text:for:* 本研究旨在提供一种自动化的脾体积量测量方法，使用2D横截图像来评估脾体积度和相关的临床病情。methods:* 提议的方法使用Variational Autoencoder（VAE）基础框架来估计脾体积量，并包括三种量测方法。results:* 最佳模型在单视和双视分割图像中的脾体积量准确率分别为86.62%和92.58%，高于临床标准方法和相关的深度学习基础的2D-3D重构方法。

Abstract
Abnormal spleen enlargement (splenomegaly) is regarded as a clinical indicator for a range of conditions, including liver disease, cancer and blood diseases. While spleen length measured from ultrasound images is a commonly used surrogate for spleen size, spleen volume remains the gold standard metric for assessing splenomegaly and the severity of related clinical conditions. Computed tomography is the main imaging modality for measuring spleen volume, but it is less accessible in areas where there is a high prevalence of splenomegaly (e.g., the Global South). Our objective was to enable automated spleen volume measurement from 2D cross-sectional segmentations, which can be obtained from ultrasound imaging. In this study, we describe a variational autoencoder-based framework to measure spleen volume from single- or dual-view 2D spleen segmentations. We propose and evaluate three volume estimation methods within this framework. We also demonstrate how 95% confidence intervals of volume estimates can be produced to make our method more clinically useful. Our best model achieved mean relative volume accuracies of 86.62% and 92.58% for single- and dual-view segmentations, respectively, surpassing the performance of the clinical standard approach of linear regression using manual measurements and a comparative deep learning-based 2D-3D reconstruction-based approach. The proposed spleen volume estimation framework can be integrated into standard clinical workflows which currently use 2D ultrasound images to measure spleen length. To the best of our knowledge, this is the first work to achieve direct 3D spleen volume estimation from 2D spleen segmentations.

摘要
非常常见的脾脓肿大 (splenomegaly) 被视为临床指标，用于诊断多种疾病，包括肝病、癌症和血液疾病。脾脓长度从ultrasound图像中测量是通常使用的代表脾脓大小的临床标准，但脾脓体积仍然是评估splenomegaly和相关临床疾病的严重程度的黄金标准。计算机断层成像是评估脾脓体积的主要成像方法，但在全球南方地区，其访问性较差。我们的目标是启用自动化脾脓体积计算的方法，从2D横截图像中获得脾脓 segmentation。在这种框架中，我们提出并评估了三种体积估计方法。我们还示出了如何生成95%信任区间的体积估计，以使我们的方法更加临床有用。我们的最佳模型在单视和双视 segmentation中达到了86.62%和92.58%的相对体积准确率，超过了临床标准方法的线性回归使用手动测量和相对深度学习基于2D-3D重建的方法。我们的提议的脾脓体积估计框架可以与现有的临床工作流程 integrating，这些工作流程当前使用2D ultrasound图像来测量脾脓长度。据我们所知，这是第一个直接从2D脾脓 segmentation中获得3D脾脓体积的方法。

Shortcut-V2V: Compression Framework for Video-to-Video Translation based on Temporal Redundancy Reduction

paper_url: http://arxiv.org/abs/2308.08011
repo_url: None
paper_authors: Chaeyeon Chung, Yeojeong Park, Seunghwan Choi, Munkhsoyol Ganbat, Jaegul Choo
for: Video-to-video translation的计算效率提高
methods: 短路抽象法和 AdaBD 块
results: 相比原始模型， saves 3.2-5.7x 计算成本和 7.8-44x 内存占用，而且性能相对较好。

Abstract
Video-to-video translation aims to generate video frames of a target domain from an input video. Despite its usefulness, the existing networks require enormous computations, necessitating their model compression for wide use. While there exist compression methods that improve computational efficiency in various image/video tasks, a generally-applicable compression method for video-to-video translation has not been studied much. In response, we present Shortcut-V2V, a general-purpose compression framework for video-to-video translation. Shourcut-V2V avoids full inference for every neighboring video frame by approximating the intermediate features of a current frame from those of the previous frame. Moreover, in our framework, a newly-proposed block called AdaBD adaptively blends and deforms features of neighboring frames, which makes more accurate predictions of the intermediate features possible. We conduct quantitative and qualitative evaluations using well-known video-to-video translation models on various tasks to demonstrate the general applicability of our framework. The results show that Shourcut-V2V achieves comparable performance compared to the original video-to-video translation model while saving 3.2-5.7x computational cost and 7.8-44x memory at test time.

摘要
<>对于视频到视频翻译 task，我们目标是生成目标域中的视频帧。尽管它们的用处很大，但现有的网络需要巨大的计算能力，因此需要压缩模型以实现广泛的应用。虽然在不同的图像/视频任务中存在压缩方法，但一个通用的压缩方法 для视频到视频翻译尚未得到了充分的研究。为了解决这个问题，我们提出了 Shortcut-V2V，一个通用压缩框架 для视频到视频翻译。Shortcut-V2V 避免了对每帧邻近视频帧的完整推理，而是使用一个新提出的块 called AdaBD，将邻近帧的特征特性缓存并混合。这使得在测试时可以更准确地预测中间特征。我们对多个常见的视频到视频翻译模型进行了量化和质量的评估，以示我们的框架的通用性。结果表明，Shortcut-V2V 可以与原始视频到视频翻译模型相比，在测试时保持相同的性能，同时减少了 3.2-5.7 倍的计算成本和 7.8-44 倍的内存占用。

paper_url: http://arxiv.org/abs/2308.07997
repo_url: None
paper_authors: Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H. Li, Gaowen Liu, Mingkui Tan, Chuang Gan
for: 本文旨在解决零例视觉语言导航（ZS-VLN）问题，即一个实际困难的问题， Agent 学习从语言指令中导航，无需任何导航指令数据。
methods: 本文提出了一种基于基础模型的视觉语言能力（$A^2$Nav），包括一个指令分析器和一个具有动作意识的导航策略。指令分析器利用大语言模型（如 GPT-3）的高级逻辑能力，将复杂的导航指令分解成一系列具有特定动作需求的对象导航子任务。
results: 实验表明，$A^2$Nav 在 R2R-Habitat 和 RxR-Habitat 数据集上达到了可观的 ZS-VLN 性能，甚至超过了指导学习方法。

Abstract
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data. Normally, the instructions have complex grammatical structures and often contain various action descriptions (e.g., "proceed beyond", "depart from"). How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging. Note that a well-educated human being can easily understand path instructions without the need for any special training. In this paper, we propose an action-aware zero-shot VLN method ($A^2$Nav) by exploiting the vision-and-language ability of foundation models. Specifically, the proposed method consists of an instruction parser and an action-aware navigation policy. The instruction parser utilizes the advanced reasoning ability of large language models (e.g., GPT-3) to decompose complex navigation instructions into a sequence of action-specific object navigation sub-tasks. Each sub-task requires the agent to localize the object and navigate to a specific goal position according to the associated action demand. To accomplish these sub-tasks, an action-aware navigation policy is learned from freely collected action-specific datasets that reveal distinct characteristics of each action demand. We use the learned navigation policy for executing sub-tasks sequentially to follow the navigation instruction. Extensive experiments show $A^2$Nav achieves promising ZS-VLN performance and even surpasses the supervised learning methods on R2R-Habitat and RxR-Habitat datasets.

摘要
我们研究零shot视觉语言导航（ZS-VLN）任务，这是一个实际又具有挑战性的问题，在哪怕需要任务描述数据。通常，这些 instrucion 具有复杂的 grammatical structure 和各种动作描述（例如“继续前进”、“从而 Depart”）。正确地理解和执行这些动作需求是一个关键的问题，而且缺乏 annotated data 使其更加具有挑战性。注意，一个有良好教育的人可以轻松地理解路径 instrucion без需要任何特殊训练。在这篇论文中，我们提出一个动作感知的零shot VLN 方法（$A^2$Nav），通过利用基础模型的见识和语言能力。具体来说，提案的方法包括一个 instruction parser 和一个动作感知的导航政策。 instruction parser 利用大型语言模型（例如 GPT-3）的进阶逻辑能力，将复杂的导航 instrucion 拆分为一系列动作特定的物品导航子任务。每个子任务需要代理人寻找物品并按照相应的动作需求 navigate 到特定的目标位置。为了完成这些子任务，我们从自由收集的动作特定 dataset 学习一个动作感知的导航政策。我们使用学习的导航政策来执行子任务依序，以实现following the navigation instruction。我们的实验结果显示，$A^2$Nav 可以实现出色的 ZS-VLN 性能，甚至超越了对 R2R-Habitat 和 RxR-Habitat dataset 的监督学习方法。

YODA: You Only Diffuse Areas. An Area-Masked Diffusion Approach For Image Super-Resolution

paper_url: http://arxiv.org/abs/2308.07977
repo_url: None
paper_authors: Brian B. Moser, Stanislav Frolov, Federico Raue, Sebastian Palacio, Andreas Dengel
for: 该论文旨在提出一种基于部分扩散的单图超解hre（SISR）方法，即“You Only Diffuse Areas”（YODA）。
methods: 该方法利用基于注意力地图的扩散选择性地应用于空间区域，以便更有效地 converts to high-resolution outputs。
results: 我们通过对SR3和SRDiff扩展而证明了YODA的性能提升，包括面部和总体SR的PSNR、SSIM和LPIPS指标上的新状态态记录。此外，YODA还能够稳定化训练过程，尤其是在小批量下可能引起的颜色偏移问题。

Abstract
This work introduces "You Only Diffuse Areas" (YODA), a novel method for partial diffusion in Single-Image Super-Resolution (SISR). The core idea is to utilize diffusion selectively on spatial regions based on attention maps derived from the low-resolution image and the current time step in the diffusion process. This time-dependent targeting enables a more effective conversion to high-resolution outputs by focusing on areas that benefit the most from the iterative refinement process, i.e., detail-rich objects. We empirically validate YODA by extending leading diffusion-based SISR methods SR3 and SRDiff. Our experiments demonstrate new state-of-the-art performance gains in face and general SR across PSNR, SSIM, and LPIPS metrics. A notable finding is YODA's stabilization effect on training by reducing color shifts, especially when induced by small batch sizes, potentially contributing to resource-constrained scenarios. The proposed spatial and temporal adaptive diffusion mechanism opens promising research directions, including developing enhanced attention map extraction techniques and optimizing inference latency based on sparser diffusion.

摘要
我们介绍了一种新的单图超解像方法“You Only Diffuse Areas”（YODA），它在单图超解像中实现了部分扩散。该方法基于低分辨率图像和当前扩散过程的注意力地图 selectively 实现了扩散。这种时间相关的目标设定，使得高分辨率输出更加有效地利用了迭代精度提高过程中的细节强项。我们经验 validate YODA 方法，并将其扩展到了领先的扩散基于 SISR 方法SR3和SRDiff。我们的实验表明，YODA 方法在 PSNR、SSIM 和 LPIPS 指标上均 achieve 新的状态泰施率表现。一个值得注意的发现是 YODA 方法在训练中的稳定化效果，尤其是在小批量引入时，可以降低颜色偏移，这可能对资源受限的场景产生贡献。The proposed spatial and temporal adaptive diffusion mechanism opens promising research directions, including developing enhanced attention map extraction techniques and optimizing inference latency based on sparser diffusion. 这种提出的空间和时间适应扩散机制，开启了许多有前途的研究方向，包括提高注意力地图提取技术和基于更加稀疏的扩散优化推理延迟。

paper_url: http://arxiv.org/abs/2308.07967
repo_url: None
paper_authors: Messaoud Bengherabi, Douaa Laib, Fella Souhila Lasnami, Ryma Boussaha
for: 这 paper 的目的是研究如何使用三种状态的盲面 restore 技术来提高人脸识别系统的性能，并保持人脸的有价值信息。
methods: 这 paper 使用的方法包括 GFP-GAN、GPEN 和 SGPN 三种盲面 restore 技术。
results: 实验结果表明，使用 GFP-GAN 技术可以大幅提高人脸识别系统的准确率，特别是在具有低质量图像的情况下。

Abstract
In recent years, various Blind Face Restoration (BFR) techniques were developed. These techniques transform low quality faces suffering from multiple degradations to more realistic and natural face images with high perceptual quality. However, it is crucial for the task of face verification to not only enhance the perceptual quality of the low quality images but also to improve the biometric-utility face quality metrics. Furthermore, preserving the valuable identity information is of great importance. In this paper, we investigate the impact of applying three state-of-the-art blind face restoration techniques namely, GFP-GAN, GPEN and SGPN on the performance of face verification system under very challenging environment characterized by very low quality images. Extensive experimental results on the recently proposed cross-quality LFW database using three state-of-the-art deep face recognition models demonstrate the effectiveness of GFP-GAN in boosting significantly the face verification accuracy.

摘要

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

paper_url: http://arxiv.org/abs/2308.07926
repo_url: https://github.com/qiuyu96/codef
paper_authors: Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, Yujun Shen
for: 这 paper 的目的是提出一种新型的视频表示方法，称为 Content Deformation Field (CoDeF)，用于重构视频。
methods: 这 paper 使用了一种新的渲染管道和一些注意力力学 régularization，使得 CoDeF 可以自然地支持提升图像算法，从而实现视频处理。
results: 实验表明，CoDeF 能够将图像-to-图像翻译提升到视频-to-视频翻译，并且可以自动地进行关键点检测和跟踪。此外，CoDeF 可以提供更高的横向一致性，并且可以跟踪非RIGID 对象如水和雾。

Abstract
We present the content deformation field CoDeF as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis.Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline.We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video.With such a design, CoDeF naturally supports lifting image algorithms for video processing, in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field.We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.More importantly, thanks to our lifting strategy that deploys the algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches, and even manage to track non-rigid objects like water and smog.Project page can be found at https://qiuyu96.github.io/CoDeF/.

摘要
我们介绍了一种新的视频表示方式：内容扭曲场（CoDeF），它包含一个核心内容场聚合整个视频中的静止内容，以及一个时间扭曲场记录每帧图像（即从核心内容场渲染后的图像）与时间轴上的变化。给定一个目标视频，这两个场合jointly 优化以重建它，通过我们特制的渲染管线。我们注意到了一些正则化，使得核心内容场继承视频中的 semantics（例如物体形状）。与此同时，我们还引入了一些正则化，使得核心内容场继承视频中的 semantics（例如物体形状）。这种设计使得CoDeF自然支持图像算法 для视频处理，即可以将图像算法应用于核心图像，并使用时间扭曲场将结果传播到整个视频中。我们实验表明，CoDeF可以将图像到图像翻译 lifted 到视频到视频翻译，并且可以使用图像算法来检测关键点，而不需要训练。此外，我们的提升策略只需要在一个图像上应用算法，从而实现了跨帧一致性更高的处理视频，甚至可以跟踪非RIGID的物体，如水和雾。项目页面可以在找到。

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

paper_url: http://arxiv.org/abs/2308.07918
repo_url: https://github.com/chuhanxx/helping_hand_for_egocentric_videos
paper_authors: Chuhan Zhang, Ankush Gupta, Andrew Zisserman
for: 提高 egocentric 视频中的空间时间表示性能
methods: 使用对象意识泛化器进行训练，通过对匹配的Caption进行预测，提高模型对象意识和视觉关系
results: 在零批量测试中表现出优于状态艺术，并在长期视频理解任务中表现出良好的表示能力，包括视频描述和视频存储等任务。

Abstract
We introduce an object-aware decoder for improving the performance of spatio-temporal representations on ego-centric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this). We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over the state of the art -- even compared to networks trained with far larger batch sizes. We also show that by using noisy image-level detection as pseudo-labels in training, the model learns to provide better bounding boxes using video consistency, as well as grounding the words in the associated text descriptions. Overall, we show that the model can act as a drop-in replacement for an ego-centric video model to improve performance through visual-text grounding.

摘要
我们介绍一个对象意识decoder，用于改进自我中心视频的空间时间表示性能。关键思想是在训练过程中提高对象意识，通过在可用的paired captions中 зада务模型预测手势位置、物体位置和物体semantic标签。在推理时，模型只需要RGB帧作为输入，并能够跟踪和附加物体（尚未直接训练）。我们通过测试和分类benchmark数据集来评估模型的表示，并证明其在零基eline测试中具有强 Transfer Learning 性能。此外，我们还表明通过使用噪音图像级检测作为pseudo-标签，模型能够提供更好的 bounding box，同时还能够在相关文本描述中固定words。总之，我们表明该模型可以作为自我中心视频模型的替换，以提高视觉-文本固定性能。

A Foundation LAnguage-Image model of the Retina (FLAIR): Encoding expert knowledge in text supervision

paper_url: http://arxiv.org/abs/2308.07898
repo_url: https://github.com/jusiro/flair
paper_authors: Julio Silva-Rodriguez, Hadi Chakor, Riadh Kobbi, Jose Dolz, Ismail Ben Ayed
for: This paper is written for developing a pre-trained vision-language model for universal retinal fundus image understanding, with the goal of improving the performance of domain-expert models in medical imaging.
methods: The paper uses a combination of open-access fundus imaging datasets and expert knowledge from clinical literature and community standards to pre-train and fine-tune a vision-language model called FLAIR. The model is adapted with a lightweight linear probe for zero-shot inference, and its performance is evaluated under various scenarios with domain shifts and unseen categories.
results: The paper reports that FLAIR outperforms fully-trained, dataset-focused models and more generalist, larger-scale image-language models in few-shot regimes, emphasizing the potential of embedding experts’ domain knowledge in medical imaging. Specifically, FLAIR achieves better performance in difficult scenarios with domain shifts or unseen categories, and outperforms more generalist models by a large margin.

Abstract
Foundation vision-language models are currently transforming computer vision, and are on the rise in medical imaging fueled by their very promising generalization capabilities. However, the initial attempts to transfer this new paradigm to medical imaging have shown less impressive performances than those observed in other domains, due to the significant domain shift and the complex, expert domain knowledge inherent to medical-imaging tasks. Motivated by the need for domain-expert foundation models, we present FLAIR, a pre-trained vision-language model for universal retinal fundus image understanding. To this end, we compiled 37 open-access, mostly categorical fundus imaging datasets from various sources, with up to 97 different target conditions and 284,660 images. We integrate the expert's domain knowledge in the form of descriptive textual prompts, during both pre-training and zero-shot inference, enhancing the less-informative categorical supervision of the data. Such a textual expert's knowledge, which we compiled from the relevant clinical literature and community standards, describes the fine-grained features of the pathologies as well as the hierarchies and dependencies between them. We report comprehensive evaluations, which illustrate the benefit of integrating expert knowledge and the strong generalization capabilities of FLAIR under difficult scenarios with domain shifts or unseen categories. When adapted with a lightweight linear probe, FLAIR outperforms fully-trained, dataset-focused models, more so in the few-shot regimes. Interestingly, FLAIR outperforms by a large margin more generalist, larger-scale image-language models, which emphasizes the potential of embedding experts' domain knowledge and the limitations of generalist models in medical imaging.

摘要
医学影像领域的基础视言模型目前正在改变计算机视觉领域，并在医学影像领域得到推动。然而，初始的尝试将这新的思维方式应用到医学影像领域表现不如其他领域的表现，这是因为医学影像领域的领域转移和专业知识的复杂性。为了解决这问题，我们提出了FLAIR，一个预训练的视言模型，用于通用的肉眼血管图像理解。为此，我们收集了37个开放访问的、主要是分类的肉眼影像Dataset，共计284,660张图像，并将专业知识 integrate到模型中，以增强数据的不具备信息的分类指导。这些文本描述了疾病的细腻特征以及疾病之间和层次结构的相互关系。我们对FLAIR进行了全面的评估，其中包括域转移和未经见过的类目的场景。我们发现，通过 integrate专业知识，FLAIR在域转移和少量学习场景下表现出了强大的泛化能力，并且在不同的域转移和未经见过的类目场景下，FLAIR可以与专门设计的Dataset-FOCUSED模型进行比较，并在这些场景下表现出优异的表现。此外，我们发现，当FLAIR被拓展为一个轻量级线性探针时，其表现更为出色，特别是在几何学学习场景下。这表明，将专业知识 embedding到模型中和通用模型的局限性在医学影像领域中具有潜在的优势。

Memory-and-Anticipation Transformer for Online Action Understanding

paper_url: http://arxiv.org/abs/2308.07893
repo_url: https://github.com/echo0125/memory-and-anticipation-transformer
paper_authors: Jiahao Wang, Guo Chen, Yifei Huang, Limin Wang, Tong Lu
for: 这篇论文是为了提出一种基于记忆和预测的新方法，用于在线动作检测和预测任务。
methods: 该方法基于记忆和预测的思想，提出了一种名为Memory-and-Anticipation Transformer（MAT）的新方法，可以同时处理在线动作检测和预测任务。
results: 对四个具有挑战性的标准 benchmark（TVSeries、THUMOS’14、HDD、EPIC-Kitchens-100）进行测试，MAT模型在在线动作检测和预测任务中具有显著的优异性，与现有方法相比显著超越。

Abstract
Most existing forecasting systems are memory-based methods, which attempt to mimic human forecasting ability by employing various memory mechanisms and have progressed in temporal modeling for memory dependency. Nevertheless, an obvious weakness of this paradigm is that it can only model limited historical dependence and can not transcend the past. In this paper, we rethink the temporal dependence of event evolution and propose a novel memory-anticipation-based paradigm to model an entire temporal structure, including the past, present, and future. Based on this idea, we present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks. In addition, owing to the inherent superiority of MAT, it can process online action detection and anticipation tasks in a unified manner. The proposed MAT model is tested on four challenging benchmarks TVSeries, THUMOS'14, HDD, and EPIC-Kitchens-100, for online action detection and anticipation tasks, and it significantly outperforms all existing methods. Code is available at https://github.com/Echo0125/Memory-and-Anticipation-Transformer.

摘要
现有的预测系统多数是记忆基本方法，尝试模拟人类预测能力 by 使用不同的记忆机制，并进步在时间模型中。然而，这个思维模型的明显弱点是它只能模型有限的历史依赖，无法突破过去。在这篇文章中，我们重新思考了事件演化的时间依赖，并提出了一个新的记忆预测基本方法，可以模型整个时间结构，包括过去、现在和未来。基于这个想法，我们提出了记忆预测变换器（MAT），一种记忆预测基本方法，用于线上动作检测和预测任务。此外，由于MAT的内在优势，可以在线上进行动作检测和预测任务的统一处理。我们在四个具有挑战性的参考标准（TVSeries、THUMOS'14、HDD和EPIC-Kitchens-100）上进行了MAT模型的评估，并与所有现有的方法进行比较。结果显示MAT模型在线上动作检测和预测任务上具有杰出的表现。代码可以在获取。

The Challenge of Fetal Cardiac MRI Reconstruction Using Deep Learning

paper_url: http://arxiv.org/abs/2308.07885
repo_url: None
paper_authors: Denis Prokopenko, Kerstin Hammernik, Thomas Roberts, David F A Lloyd, Daniel Rueckert, Joseph V Hajnal
For: This paper explores the use of deep learning methods to improve the quality of non-gated kt-SENSE reconstruction for dynamic free-breathing fetal cardiac MRI.* Methods: The authors use supervised deep learning networks to reconstruct fully-sampled data from undersampled data, and consider various model architectures and training strategies for their application in a real clinical setup.* Results: The authors show that the best-performing models recover a detailed depiction of the maternal anatomy but underestimate the dynamic properties of the fetal heart, suggesting the need for more targeted training and evaluation methods for fetal heart applications.

Abstract
Dynamic free-breathing fetal cardiac MRI is one of the most challenging modalities, which requires high temporal and spatial resolution to depict rapid changes in a small fetal heart. The ability of deep learning methods to recover undersampled data could help to optimise the kt-SENSE acquisition strategy and improve non-gated kt-SENSE reconstruction quality. In this work, we explore supervised deep learning networks for reconstruction of kt-SENSE style acquired data using an extensive in vivo dataset. Having access to fully-sampled low-resolution multi-coil fetal cardiac MRI, we study the performance of the networks to recover fully-sampled data from undersampled data. We consider model architectures together with training strategies taking into account their application in the real clinical setup used to collect the dataset to enable networks to recover prospectively undersampled data. We explore a set of modifications to form a baseline performance evaluation for dynamic fetal cardiac MRI on real data. We systematically evaluate the models on coil-combined data to reveal the effect of the suggested changes to the architecture in the context of fetal heart properties. We show that the best-performers recover a detailed depiction of the maternal anatomy on a large scale, but the dynamic properties of the fetal heart are under-represented. Training directly on multi-coil data improves the performance of the models, allows their prospective application to undersampled data and makes them outperform CTFNet introduced for adult cardiac cine MRI. However, these models deliver similar qualitative performances recovering the maternal body very well but underestimating the dynamic properties of fetal heart. This dynamic feature of fast change of fetal heart that is highly localised suggests both more targeted training and evaluation methods might be needed for fetal heart application.

摘要
干支持自由呼吸婴儿心脏MRI是最有挑战性的模式之一，需要高度的时间和空间分辨率来显示婴儿心脏中快速变化的小心脏。深度学习方法可以恢复扫描不足的数据，可以帮助优化kt-SENSE获取策略和非阻塞kt-SENSE重建质量。在这项工作中，我们使用了激活函数网络进行kt-SENSE样式数据重建，使用了大量的生物样本。由于我们拥有完整的低分辨率多极体婴儿心脏MRI数据，我们可以研究深度学习网络是否可以从扫描不足数据中恢复完整数据。我们考虑了模型架构和训练策略，以便在实际临床设置中收集数据时使用。我们系统地评估了模型在实际数据上的表现，并通过多极体数据组合来描述模型的改进效果。我们发现最佳表现者可以准确地重建大规模的 maternal anatomy，但是婴儿心脏的动态性尚未得到充分表现。通过直接在多极体数据上训练，模型可以预测扫描不足数据，并且在 adult cardiac cine MRI 中引入的 CTFNet 之上表现出色。然而，这些模型在Qualitative上具有类似表现，可以准确地重建 maternal body，但是婴儿心脏的动态特性尚未得到充分表现。这种婴儿心脏动态特性的快速变化和高度本地化表示，可能需要更Targeted的训练和评估方法来应用于婴儿心脏。

Advancements in Repetitive Action Counting: Joint-Based PoseRAC Model With Improved Performance

paper_url: http://arxiv.org/abs/2308.08632
repo_url: None
paper_authors: Haodong Chen, Ming C. Leu, Md Moniruzzaman, Zhaozheng Yin, Solmaz Hajmohammadi, Zhuoqing Chang
for: 这篇论文的目的是提高复诵数据集（RepCount）的准确性和稳定性，并且能够实现更好的结果，比如运动追踪和重abilitation。
methods: 这篇论文使用了肢体角度和姿势点来处理复诵数据，并且结合了先前的工作([1])，以解决复诵数据的一些挑战，例如不稳定的摄像头视角、过归、缺数、难以区别子动作、不准确的识别精灵姿势等。
results: 这篇论文在RepCount数据集上取得了比前方法更好的结果，其中MAE为0.211，OBO准确率为0.599。实验结果显示了方法的效iveness和可靠性。

Abstract
Repetitive counting (RepCount) is critical in various applications, such as fitness tracking and rehabilitation. Previous methods have relied on the estimation of red-green-and-blue (RGB) frames and body pose landmarks to identify the number of action repetitions, but these methods suffer from a number of issues, including the inability to stably handle changes in camera viewpoints, over-counting, under-counting, difficulty in distinguishing between sub-actions, inaccuracy in recognizing salient poses, etc. In this paper, based on the work done by [1], we integrate joint angles with body pose landmarks to address these challenges and achieve better results than the state-of-the-art RepCount methods, with a Mean Absolute Error (MAE) of 0.211 and an Off-By-One (OBO) counting accuracy of 0.599 on the RepCount data set [2]. Comprehensive experimental results demonstrate the effectiveness and robustness of our method.

摘要
重复计数（RepCount）在不同应用中具有重要意义，如健身和rehabilitation。先前的方法通过RGB帧和体姿特征来估计动作重复数，但这些方法受到许多问题的影响，包括视角变化不稳定、过 COUNT、下 COUNT、不能识别子动作、识别精准 pose 等问题。本文基于 [1] 的工作，将骨骼角度与体姿特征结合，以解决这些挑战并实现更好的结果，MAE 为 0.211，OBO 计数准确率为 0.599 在 RepCount 数据集上。广泛的实验结果证明了我们的方法的有效性和可靠性。

SEDA: Self-Ensembling ViT with Defensive Distillation and Adversarial Training for robust Chest X-rays Classification

paper_url: http://arxiv.org/abs/2308.07874
repo_url: https://github.com/razaimam45/seda
paper_authors: Raza Imam, Ibrahim Almakky, Salma Alrashdi, Baketah Alrashdi, Mohammad Yaqub
For: This paper aims to enhance the robustness of self-ensembling ViTs for the tuberculosis chest x-ray classification task, in order to improve the reliability of Deep Learning methods in medical settings.* Methods: The proposed method, SEDA, utilizes efficient CNN blocks to learn spatial features with various levels of abstraction, and leverages adversarial training in combination with defensive distillation for improved robustness against adversaries.* Results: Extensive experiments performed with the proposed architecture and training paradigm on a publicly available Tuberculosis x-ray dataset show that SEDA achieves state-of-the-art (SOTA) efficacy compared to SEViT in terms of computational efficiency with 70x times lighter framework, and enhanced robustness of +9%.

Abstract
Deep Learning methods have recently seen increased adoption in medical imaging applications. However, elevated vulnerabilities have been explored in recent Deep Learning solutions, which can hinder future adoption. Particularly, the vulnerability of Vision Transformer (ViT) to adversarial, privacy, and confidentiality attacks raise serious concerns about their reliability in medical settings. This work aims to enhance the robustness of self-ensembling ViTs for the tuberculosis chest x-ray classification task. We propose Self-Ensembling ViT with defensive Distillation and Adversarial training (SEDA). SEDA utilizes efficient CNN blocks to learn spatial features with various levels of abstraction from feature representations extracted from intermediate ViT blocks, that are largely unaffected by adversarial perturbations. Furthermore, SEDA leverages adversarial training in combination with defensive distillation for improved robustness against adversaries. Training using adversarial examples leads to better model generalizability and improves its ability to handle perturbations. Distillation using soft probabilities introduces uncertainty and variation into the output probabilities, making it more difficult for adversarial and privacy attacks. Extensive experiments performed with the proposed architecture and training paradigm on publicly available Tuberculosis x-ray dataset shows SOTA efficacy of SEDA compared to SEViT in terms of computational efficiency with 70x times lighter framework and enhanced robustness of +9%.

摘要
SEDA uses efficient convolutional neural network (CNN) blocks to learn spatial features with different levels of abstraction from the feature representations extracted from intermediate ViT blocks. These features are less affected by adversarial perturbations. Additionally, SEDA uses adversarial training in combination with defensive distillation to improve the model's robustness against adversaries. Training with adversarial examples improves the model's ability to handle perturbations, while distillation using soft probabilities introduces uncertainty and variation into the output probabilities, making it more difficult for adversarial and privacy attacks.We evaluated the proposed architecture and training paradigm on a publicly available tuberculosis x-ray dataset and found that SEDA outperformed the existing SEViT method in terms of computational efficiency, with a 70 times lighter framework, and enhanced robustness, with a +9% improvement.

ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces

paper_url: http://arxiv.org/abs/2308.07868
repo_url: https://github.com/qianyiwu/objectsdf_plus
paper_authors: Qianyi Wu, Kaisiyuan Wang, Kejie Li, Jianmin Zheng, Jianfei Cai
for: This paper focuses on improving the performance of neural implicit surface-based methods for multi-view 3D reconstruction, with a particular emphasis on reconstructing individual objects within a scene.
methods: The proposed framework, called ObjectSDF++, utilizes an occlusion-aware object opacity rendering formulation and a novel regularization term for object distinction to improve the quality of object reconstruction.
results: The extensive experiments conducted in the paper demonstrate that ObjectSDF++ produces superior object reconstruction results and significantly improves the quality of scene reconstruction compared to the previous state-of-the-art method, ObjectSDF.Here’s the full text in Simplified Chinese:
for: 本文关注使用神经隐式表面基于多视图3D重建的方法，特别是在Scene中重建个体对象。
methods: 提出的方法是ObjectSDF++，它利用遮盲对象 opacity 渲染表示法和对象分辨率正则项来提高对象重建质量。
results: 实验结果表明，ObjectSDF++ 比之前的状态 искусственный方法ObjectSDF更好地重建对象和Scene。

Abstract
In recent years, neural implicit surface reconstruction has emerged as a popular paradigm for multi-view 3D reconstruction. Unlike traditional multi-view stereo approaches, the neural implicit surface-based methods leverage neural networks to represent 3D scenes as signed distance functions (SDFs). However, they tend to disregard the reconstruction of individual objects within the scene, which limits their performance and practical applications. To address this issue, previous work ObjectSDF introduced a nice framework of object-composition neural implicit surfaces, which utilizes 2D instance masks to supervise individual object SDFs. In this paper, we propose a new framework called ObjectSDF++ to overcome the limitations of ObjectSDF. First, in contrast to ObjectSDF whose performance is primarily restricted by its converted semantic field, the core component of our model is an occlusion-aware object opacity rendering formulation that directly volume-renders object opacity to be supervised with instance masks. Second, we design a novel regularization term for object distinction, which can effectively mitigate the issue that ObjectSDF may result in unexpected reconstruction in invisible regions due to the lack of constraint to prevent collisions. Our extensive experiments demonstrate that our novel framework not only produces superior object reconstruction results but also significantly improves the quality of scene reconstruction. Code and more resources can be found in \url{https://qianyiwu.github.io/objectsdf++}

摘要
Recently, neural implicit surface reconstruction has become a popular paradigm for multi-view 3D reconstruction. Unlike traditional multi-view stereo methods, neural implicit surface-based methods use neural networks to represent 3D scenes as signed distance functions (SDFs). However, these methods often neglect the reconstruction of individual objects within the scene, which limits their performance and practical applications. To address this issue, previous work ObjectSDF proposed a framework of object-composition neural implicit surfaces, which uses 2D instance masks to supervise individual object SDFs. In this paper, we propose a new framework called ObjectSDF++ to overcome the limitations of ObjectSDF. First, our model uses an occlusion-aware object opacity rendering formulation that directly volume-renders object opacity to be supervised with instance masks, which improves performance compared to ObjectSDF. Second, we design a novel regularization term for object distinction, which can effectively mitigate the issue that ObjectSDF may result in unexpected reconstruction in invisible regions due to the lack of constraint to prevent collisions. Our extensive experiments show that our novel framework not only produces superior object reconstruction results but also significantly improves the quality of scene reconstruction. Code and more resources can be found at \url{https://qianyiwu.github.io/objectsdf++}.

StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models

paper_url: http://arxiv.org/abs/2308.07863
repo_url: None
paper_authors: Zhizhong Wang, Lei Zhao, Wei Xing
for: 本研究旨在提出一种新的内容-风格（C-S）分离框架，用于实现风格传输。
methods: 该框架利用内存抽象和隐藏学习来实现C-S分离，并通过CLIP图像空间的自适应分离损失和样式重建优化来实现可控的C-S分离和风格传输。
results: 该框架在比较之上超越了现有的状态对抗方法，并可以实现高质量的C-S分离和风格传输，以及灵活的C-S分离和控制质量之间的质量。

Abstract
Content and style (C-S) disentanglement is a fundamental problem and critical challenge of style transfer. Existing approaches based on explicit definitions (e.g., Gram matrix) or implicit learning (e.g., GANs) are neither interpretable nor easy to control, resulting in entangled representations and less satisfying results. In this paper, we propose a new C-S disentangled framework for style transfer without using previous assumptions. The key insight is to explicitly extract the content information and implicitly learn the complementary style information, yielding interpretable and controllable C-S disentanglement and style transfer. A simple yet effective CLIP-based style disentanglement loss coordinated with a style reconstruction prior is introduced to disentangle C-S in the CLIP image space. By further leveraging the powerful style removal and generative ability of diffusion models, our framework achieves superior results than state of the art and flexible C-S disentanglement and trade-off control. Our work provides new insights into the C-S disentanglement in style transfer and demonstrates the potential of diffusion models for learning well-disentangled C-S characteristics.

摘要

2023-08-16

High-Fidelity Lake Extraction via Two-Stage Prompt Enhancement: Establishing a Novel Baseline and Benchmark

Integrating Visual and Semantic Similarity Using Hierarchies for Image Retrieval

ALIP: Adaptive Language-Image Pre-training with Synthetic Caption

Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer

Prediction of post-radiotherapy recurrence volumes in head and neck squamous cell carcinoma using 3D U-Net segmentation

SIGMA: Scale-Invariant Global Sparse Shape Matching

Robust Autonomous Vehicle Pursuit without Expert Steering Labels

Automated Semiconductor Defect Inspection in Scanning Electron Microscope Images: a Systematic Review

Diff-CAPTCHA: An Image-based CAPTCHA with Security Enhanced by Denoising Diffusion Model

DeepContrast: Deep Tissue Contrast Enhancement using Synthetic Data Degradations and OOD Model Predictions

KernelWarehouse: Towards Parameter-Efficient Dynamic Convolution

Membrane Potential Batch Normalization for Spiking Neural Networks

GAEI-UNet: Global Attention and Elastic Interaction U-Net for Vessel Image Segmentation

Denoising Diffusion Probabilistic Model for Retinal Image Generation and Segmentation

Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

AdaBrowse: Adaptive Video Browser for Efficient Continuous Sign Language Recognition

Visually-Aware Context Modeling for News Image Captioning

Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations

Dual-Stream Diffusion Net for Text-to-Video Generation

ECPC-IDS:A benchmark endometrail cancer PET/CT image dataset for evaluation of semantic segmentation and detection of hypermetabolic regions

Leveraging Next-Active Objects for Context-Aware Anticipation in Egocentric Videos

Improving Audio-Visual Segmentation with Bidirectional Generation

CARE: A Large Scale CT Image Dataset and Clinical Applicable Benchmark Model for Rectal Cancer Segmentation

Computer vision-enriched discrete choice models, with an application to residential location choice

Detecting Olives with Synthetic or Real Data? Olive the Above

OnUVS: Online Feature Decoupling Framework for High-Fidelity Ultrasound Video Synthesis

SceNeRFlow: Time-Consistent Reconstruction of General Dynamic Scenes

MultiMediate’23: Engagement Estimation and Bodily Behaviour Recognition in Social Interactions

Contrastive Learning for Lane Detection via Cross-Similarity

DDF-HO: Hand-Held Object Reconstruction via Conditional Directed Distance Field

Inherent Redundancy in Spiking Neural Networks

How To Overcome Confirmation Bias in Semi-Supervised Image Classification By Active Learning

Low-Light Image Enhancement with Illumination-Aware Gamma Correction and Complete Image Modelling Network

MEDOE: A Multi-Expert Decoder and Output Ensemble Framework for Long-tailed Semantic Segmentation

Neural Spherical Harmonics for structurally coherent continuous representation of diffusion MRI signal

Self-Reference Deep Adaptive Curve Estimation for Low-Light Image Enhancement

Automatic Vision-Based Parking Slot Detection and Occupancy Classification

Unsupervised Domain Adaptive Detection with Network Stability Analysis

AATCT-IDS: A Benchmark Abdominal Adipose Tissue CT Image Dataset for Image Denoising, Semantic Segmentation, and Radiomics Evaluation

Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis

Conditional Perceptual Quality Preserving Image Compression

SCANet: A Self- and Cross-Attention Network for Audio-Visual Speech Separation

S2R: Exploring a Double-Win Transformer-Based Framework for Ideal and Blind Super-Resolution

GPA-3D: Geometry-aware Prototype Alignment for Unsupervised Domain Adaptive 3D Object Detection from Point Clouds

View Consistent Purification for Accurate Cross-View Localization

Snapshot High Dynamic Range Imaging with a Polarization Camera

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection

Deep Learning Framework for Spleen Volume Estimation from 2D Cross-sectional Views

Shortcut-V2V: Compression Framework for Video-to-Video Translation based on Temporal Redundancy Reduction

$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

YODA: You Only Diffuse Areas. An Area-Masked Diffusion Approach For Image Super-Resolution

Boosting Cross-Quality Face Verification using Blind Face Restoration

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

A Foundation LAnguage-Image model of the Retina (FLAIR): Encoding expert knowledge in text supervision

Memory-and-Anticipation Transformer for Online Action Understanding

The Challenge of Fetal Cardiac MRI Reconstruction Using Deep Learning

Advancements in Repetitive Action Counting: Joint-Based PoseRAC Model With Improved Performance

SEDA: Self-Ensembling ViT with Defensive Distillation and Adversarial Training for robust Chest X-rays Classification

ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces

StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models