2023-07-20

cs.CV

cs.CV - 2023-07-20

Spinal nerve segmentation method and dataset construction in endoscopic surgical scenarios

paper_url: http://arxiv.org/abs/2307.10955
repo_url: https://github.com/zzzzzzpc/funet
paper_authors: Shaowu Peng, Pengcheng Zhao, Yongyu Ye, Junying Chen, Yunbing Chang, Xiaoqing Zheng
for: 该研究旨在提供一种实时分割方法，帮助外科医生在endooscopic surgery中避免损害脊椎神经。
methods: 该研究使用了一个精心标注的分割数据集，并提出了基于这个数据集的 Frame-Unet 模型，利用了间帧信息和自注意机制，实现了当前最佳性能。
results: 研究表明， Frame-Unet 模型在一个相似的肠苔视频数据集上表现出了良好的泛化能力，并且在实际endooscopic surgery中具有优异的表现。Here’s the full Chinese text:
for: 该研究旨在提供一种实时分割方法，帮助外科医生在endooscopic surgery中避免损害脊椎神经。
methods: 该研究使用了一个精心标注的分割数据集，并提出了基于这个数据集的 Frame-Unet 模型，利用了间帧信息和自注意机制，实现了当前最佳性能。
results: 研究表明， Frame-Unet 模型在一个相似的肠苔视频数据集上表现出了良好的泛化能力，并且在实际endooscopic surgery中具有优异的表现。

Abstract
Endoscopic surgery is currently an important treatment method in the field of spinal surgery and avoiding damage to the spinal nerves through video guidance is a key challenge. This paper presents the first real-time segmentation method for spinal nerves in endoscopic surgery, which provides crucial navigational information for surgeons. A finely annotated segmentation dataset of approximately 10,000 consec-utive frames recorded during surgery is constructed for the first time for this field, addressing the problem of semantic segmentation. Based on this dataset, we propose FUnet (Frame-Unet), which achieves state-of-the-art performance by utilizing inter-frame information and self-attention mechanisms. We also conduct extended exper-iments on a similar polyp endoscopy video dataset and show that the model has good generalization ability with advantageous performance. The dataset and code of this work are presented at: https://github.com/zzzzzzpc/FUnet .

摘要
现代endooscopic surgery中，避免神经bundles损害是一项关键挑战。这篇论文提出了首个实时分割方法，用于在endooscopic surgery中分割脊梗神经。这种方法提供了重要的导航信息，帮助外科医生进行更加精准的手术。为了解决semantic segmentation问题，我们构建了约10,000帧 consecutiverecorded during surgery的精心注释分割集合，这是endooscopic surgery领域中的首次。基于这个分割集合，我们提出了Frame-Unet（FUnet），它利用了inter-frame信息和自我注意机制，并达到了当前最佳性能。我们还对相似的polyp endoscopy video dataset进行了广泛的实验，并证明了模型具有良好的泛化能力。 dataset和代码可以在：https://github.com/zzzzzzpc/FUnet 上获取。

Soft-tissue Driven Craniomaxillofacial Surgical Planning

paper_url: http://arxiv.org/abs/2307.10954
repo_url: None
paper_authors: Xi Fang, Daeseung Kim, Xuanang Xu, Tianshu Kuang, Nathan Lampen, Jungwook Lee, Hannah H. Deng, Jaime Gateno, Michael A. K. Liebschner, James J. Xia, Pingkun Yan
for: correction of facial deformities in CMF surgery
methods: soft-tissue driven framework that combines bony planner network and facial simulator network
results: improved accuracy and efficacy of surgical planning compared to conventional bone-driven approach

Abstract
In CMF surgery, the planning of bony movement to achieve a desired facial outcome is a challenging task. Current bone driven approaches focus on normalizing the bone with the expectation that the facial appearance will be corrected accordingly. However, due to the complex non-linear relationship between bony structure and facial soft-tissue, such bone-driven methods are insufficient to correct facial deformities. Despite efforts to simulate facial changes resulting from bony movement, surgical planning still relies on iterative revisions and educated guesses. To address these issues, we propose a soft-tissue driven framework that can automatically create and verify surgical plans. Our framework consists of a bony planner network that estimates the bony movements required to achieve the desired facial outcome and a facial simulator network that can simulate the possible facial changes resulting from the estimated bony movement plans. By combining these two models, we can verify and determine the final bony movement required for planning. The proposed framework was evaluated using a clinical dataset, and our experimental results demonstrate that the soft-tissue driven approach greatly improves the accuracy and efficacy of surgical planning when compared to the conventional bone-driven approach.

摘要
在CMF手术中，计划骨骼运动以实现愿景型脸结果是一项具有挑战性的任务。现有的骨骼驱动方法假设将骨骼归正，并且期望通过这种方法来 correect facial appearance。然而，由于脸部软组织和骨骼之间的复杂非线性关系，这些骨骼驱动方法无法正确地修复脸部异常。尽管尝试通过模拟骨骼运动所导致的脸部变化来估算 facial changes，但 surgical planning仍然依赖于迭代修改和教育的尝试。为解决这些问题，我们提出了一种软组织驱动框架，可以自动创建和验证手术计划。我们的框架包括一个骨骼规划网络，可以估算需要实现愿景型脸结果所需的骨骼运动，以及一个脸部模拟网络，可以模拟骨骼运动所导致的可能的脸部变化。通过将这两个模型相结合，我们可以验证和确定最终的骨骼运动计划。我们的实验结果表明，相比传统的骨骼驱动方法，我们的软组织驱动方法可以大幅提高手术规划的准确性和效果。

Improving Online Lane Graph Extraction by Object-Lane Clustering

paper_url: http://arxiv.org/abs/2307.10947
repo_url: https://github.com/ybarancan/object_lane
paper_authors: Yigit Baran Can, Alexander Liniger, Danda Pani Paudel, Luc Van Gool
for: 提高自动驾驶的本地场景理解精度
methods: 使用3D对象探测输出进行中心线分配和路径估计
results: 提高了本地路径估计精度，比前景方法有显著改进

Abstract
Autonomous driving requires accurate local scene understanding information. To this end, autonomous agents deploy object detection and online BEV lane graph extraction methods as a part of their perception stack. In this work, we propose an architecture and loss formulation to improve the accuracy of local lane graph estimates by using 3D object detection outputs. The proposed method learns to assign the objects to centerlines by considering the centerlines as cluster centers and the objects as data points to be assigned a probability distribution over the cluster centers. This training scheme ensures direct supervision on the relationship between lanes and objects, thus leading to better performance. The proposed method improves lane graph estimation substantially over state-of-the-art methods. The extensive ablations show that our method can achieve significant performance improvements by using the outputs of existing 3D object detection methods. Since our method uses the detection outputs rather than detection method intermediate representations, a single model of our method can use any detection method at test time.

摘要
自适应驾驶需要准确的本地场景理解信息。为此，自适应代理将对象检测和在线BEV车道图EXTRACT方法作为其感知堆栈的一部分。在这项工作中，我们提出了一种体制和损失函数设计，以提高本地车道图估算的准确性。我们的方法通过考虑中心线为集中点，对象为数据点分布的概率分布来分配对象。这种训练方案直接supervise了车道和对象之间的关系，从而导致更好的性能。我们的方法在比较前一些方法的基础上显著提高了车道图估算。我们的广泛ablation表明，我们的方法可以通过使用现有3D对象检测方法的输出来实现显著性能提高。由于我们的方法使用检测输出而不是检测方法中间表示，因此在测试时可以使用任何检测方法。

OCTraN: 3D Occupancy Convolutional Transformer Network in Unstructured Traffic Scenarios

paper_url: http://arxiv.org/abs/2307.10934
repo_url: None
paper_authors: Aditya Nalgunda Ganesh, Dhruval Pobbathi Badrinath, Harshith Mohan Kumar, Priya SS, Surabhi Narayan
for: 实现自主NAVIGATION中的环境感知，使用自我超级vised monocular depth estimation算法输出失真图。
methods: 提出了一个名为OCTraN的transformer架构，使用迭代注意力将2D影像特征转换为3Doccupancy特征，并使用卷积和倒卷积效率地处理空间信息。
results: 提出了一个自我超级vised training管道，可以将模型扩展到任何场景，并且不需要LiDAR真实ground truth，可以使用pseudo-ground truth标签来取代。

Abstract
Modern approaches for vision-centric environment perception for autonomous navigation make extensive use of self-supervised monocular depth estimation algorithms that output disparity maps. However, when this disparity map is projected onto 3D space, the errors in disparity are magnified, resulting in a depth estimation error that increases quadratically as the distance from the camera increases. Though Light Detection and Ranging (LiDAR) can solve this issue, it is expensive and not feasible for many applications. To address the challenge of accurate ranging with low-cost sensors, we propose, OCTraN, a transformer architecture that uses iterative-attention to convert 2D image features into 3D occupancy features and makes use of convolution and transpose convolution to efficiently operate on spatial information. We also develop a self-supervised training pipeline to generalize the model to any scene by eliminating the need for LiDAR ground truth by substituting it with pseudo-ground truth labels obtained from boosted monocular depth estimation.

摘要
现代方法 для自主导航中的环境感知广泛使用自我超vised单目深度估计算算法，输出的是 disparity 地图。但当这个disparity地图被投影到3D空间时，错误会被夹攻ive，导致depth估计错误增加平方根根据相机距离的距离。虽然激光探测（LiDAR）可以解决这个问题，但它是昂贵的并不适合许多应用。为了解决低成本感知器中的准确距离问题，我们提出了OCTraN，一种基于 transformer 架构的iterative-attention转换2D图像特征为3D占用特征，并使用 convolution 和 transpose convolution 有效地处理空间信息。我们还开发了一个自我超vised训练管道，以便在不需要 LiDAR 真实数据的情况下，通过 pseudo-ground truth 标签从 boosted 单目深度估计中提取出精度的训练数据。

Modeling 3D cardiac contraction and relaxation with point cloud deformation networks

paper_url: http://arxiv.org/abs/2307.10927
repo_url: None
paper_authors: Marcel Beetz, Abhirup Banerjee, Vicente Grau
for: 该研究旨在提出一种基于点云深度学习的新方法，用于模拟心脏功能的3D弹性变化。
methods: 该方法使用点云深度学习的最新进展，将心脏功能的3D点云表示转化为多级别的特征学习。
results: 研究人员对大量的UK Biobank数据集进行了测试，并发现了PCD-Net可以准确地预测心脏功能的3D弹性变化，并且可以更好地识别正常人群和急性心肺病（MI）患者之间的差异。

Abstract
Global single-valued biomarkers of cardiac function typically used in clinical practice, such as ejection fraction, provide limited insight on the true 3D cardiac deformation process and hence, limit the understanding of both healthy and pathological cardiac mechanics. In this work, we propose the Point Cloud Deformation Network (PCD-Net) as a novel geometric deep learning approach to model 3D cardiac contraction and relaxation between the extreme ends of the cardiac cycle. It employs the recent advances in point cloud-based deep learning into an encoder-decoder structure, in order to enable efficient multi-scale feature learning directly on multi-class 3D point cloud representations of the cardiac anatomy. We evaluate our approach on a large dataset of over 10,000 cases from the UK Biobank study and find average Chamfer distances between the predicted and ground truth anatomies below the pixel resolution of the underlying image acquisition. Furthermore, we observe similar clinical metrics between predicted and ground truth populations and show that the PCD-Net can successfully capture subpopulation-specific differences between normal subjects and myocardial infarction (MI) patients. We then demonstrate that the learned 3D deformation patterns outperform multiple clinical benchmarks by 13% and 7% in terms of area under the receiver operating characteristic curve for the tasks of prevalent MI detection and incident MI prediction and by 7% in terms of Harrell's concordance index for MI survival analysis.

摘要
全球唯一价值指标通常用于临床实践中，如抽射率，只提供有限的true 3D 冠状膜弹性过程的信息，因此限制了健康和疾病冠状机械的理解。在这项工作中，我们提议使用Point Cloud Deformation Network（PCD-Net）作为一种新的几何深度学技术，以模型3D 冠状膜收缩和膨涨过程。它采用了最新的点云基于深度学技术，组织成编码器-解码器结构，以便高效地学习多尺度特征。我们在UK Biobank研究中的大量数据集上评估了我们的方法，并发现了下面的 Chamfer 距离值下than the pixel resolution of the underlying image acquisition。此外，我们发现了预测和实际数据集之间的临床指标相似性，并示出PCD-Net可以成功地捕捉冠状膜疾病特征。最后，我们表明了学习的3D 弹性模式在预测冠状膜疾病和预测冠状膜疾病的任务上比多种临床标准更高，提高了7%和13%。

Confidence intervals for performance estimates in 3D medical image segmentation

paper_url: http://arxiv.org/abs/2307.10926
repo_url: https://github.com/rosanajurdi/SegVal_TMI
paper_authors: R. El Jurdi, G. Varoquaux, O. Colliot
for: 本研究探讨了医疗像素分割模型的实际评估方法，以及如何更正准确地计算测试集大小所需的样本数量。
methods: 本研究使用了nnU-net框架和医疗挑战赛 dataset，并使用了两个性能指标： dice精度和 Hausdorff 距离。
results: 研究发现，在不同的测试集大小和性能指标的标准差下，参数型 confidence interval 是一个可靠的估计方法，并且需要更少的样本数量来达到给定的精度水平。 Typically, 需要100-200个测试样本，而在更Difficult segmentation task中，可能需要更多的样本。

Abstract
Medical segmentation models are evaluated empirically. As such an evaluation is based on a limited set of example images, it is unavoidably noisy. Beyond a mean performance measure, reporting confidence intervals is thus crucial. However, this is rarely done in medical image segmentation. The width of the confidence interval depends on the test set size and on the spread of the performance measure (its standard-deviation across of the test set). For classification, many test images are needed to avoid wide confidence intervals. Segmentation, however, has not been studied, and it differs by the amount of information brought by a given test image. In this paper, we study the typical confidence intervals in medical image segmentation. We carry experiments on 3D image segmentation using the standard nnU-net framework, two datasets from the Medical Decathlon challenge and two performance measures: the Dice accuracy and the Hausdorff distance. We show that the parametric confidence intervals are reasonable approximations of the bootstrap estimates for varying test set sizes and spread of the performance metric. Importantly, we show that the test size needed to achieve a given precision is often much lower than for classification tasks. Typically, a 1% wide confidence interval requires about 100-200 test samples when the spread is low (standard-deviation around 3%). More difficult segmentation tasks may lead to higher spreads and require over 1000 samples.

摘要
医疗分割模型通常被实证性地评估。由于这种评估基于有限的示例图像，因此不可避免噪声。而在医疗图像分割中，rarelyReporting confidence intervals是关键的。宽度取决于测试集大小和性能指标（其标准差）的覆盖率。为 classification，需要许多测试图像，以避免宽度的信任间隔。但是，分割却没有被研究，而且它们的信任间隔取决于给定测试图像的信息量。在这篇文章中，我们研究了医疗图像分割中的典型信任间隔。我们在使用标准 nnU-net 框架，以及 Medical Decathlon 挑战赛中的两个数据集进行实验。我们发现， Parametric confidence intervals 是可靠的近似值，并且它们与 bootstrap 估计有关。重要的是，我们发现，为了达到某种精度，测试样本的数量通常比 классификация任务要少。例如，在标准差 relativately 低（约3%）时，一个 1% 宽的信任间隔需要约 100-200 个测试样本。难度更高的分割任务可能会导致更高的标准差，需要更多的测试样本。

Intrinsic Appearance Decomposition Using Point Cloud Representation

paper_url: http://arxiv.org/abs/2307.10924
repo_url: https://github.com/xyxingx/PoInt-Net
paper_authors: Xiaoyan Xing, Konrad Groh, Sezer Karaoglu, Theo Gevers
for: 根据点云数据协同预测照明、质量和阴影，解决图像内部积分问题。
methods: 提议使用点云表示来解决图像内部积分问题，并使用Point Intrinsic Net（简称PoInt-Net）结合预测照明、光源方向和阴影。
results: 对多个数据集进行比较，PoInt-Net具有高精度、效率和稳定性。具体来说，它在多个纬度上超过了基于2D图像的方法，并且可以在任何大小的点云上培训和稳定地运行。

Abstract
Intrinsic decomposition is to infer the albedo and shading from the image. Since it is a heavily ill-posed problem, previous methods rely on prior assumptions from 2D images, however, the exploration of the data representation itself is limited. The point cloud is known as a rich format of scene representation, which naturally aligns the geometric information and the color information of an image. Our proposed method, Point Intrinsic Net, in short, PoInt-Net, jointly predicts the albedo, light source direction, and shading, using point cloud representation. Experiments reveal the benefits of PoInt-Net, in terms of accuracy, it outperforms 2D representation approaches on multiple metrics across datasets; in terms of efficiency, it trains on small-scale point clouds and performs stably on any-scale point clouds; in terms of robustness, it only trains on single object level dataset, and demonstrates reasonable generalization ability for unseen objects and scenes.

摘要
<>对给定图像进行内在几何分解，以估算图像中的反射率和照明方向。由于这是一个极度不定Problem，前一代方法通常基于2D图像的假设，但是它们对数据表示的探索受限。Point cloud是一种 ric丰富的场景表示格式，自然地将地形信息和图像中的颜色信息相匹配。我们提出的方法，Point Intrinsic Net，简称为PoInt-Net，使用点云表示来同时预测反射率、照明方向和镜面照明。实验表明，PoInt-Net在准确性、效率和稳定性方面都有明显的优势。准确性方面，它在多个数据集上的多个指标上超过2D表示方法；效率方面，它可以在小规模点云上训练，并在任何规模的点云上稳定地进行预测；Robustness方面，它只需要在单个对象级别上训练，并在未seen对象和场景上示出了合理的泛化能力。Note: " ric丰富" is a typo, it should be " ric丰盈" (meaning "rich" in Chinese).

Language-based Action Concept Spaces Improve Video Self-Supervised Learning

paper_url: http://arxiv.org/abs/2307.10922
repo_url: None
paper_authors: Kanchana Ranasinghe, Michael Ryoo
for: 学习高效传递和稳定的视频 Representation
methods: 使用语言关联自我指导学习对图像 CLIP 模型进行适应
results: 提高 zero-shot 和线性考核表现 на 三个动作认识 bencmarksIn English:
for: Learning highly transferable and robust video representations
methods: Using language-tied self-supervised learning to adapt an image CLIP model to the video domain
results: Improved zero-shot and linear probing performance on three action recognition benchmarks

Abstract
Recent contrastive language image pre-training has led to learning highly transferable and robust image representations. However, adapting these models to video domains with minimal supervision remains an open problem. We explore a simple step in that direction, using language tied self-supervised learning to adapt an image CLIP model to the video domain. A backbone modified for temporal modeling is trained under self-distillation settings with train objectives operating in an action concept space. Feature vectors of various action concepts extracted from a language encoder using relevant textual prompts construct this space. We introduce two train objectives, concept distillation and concept alignment, that retain generality of original representations while enforcing relations between actions and their attributes. Our approach improves zero-shot and linear probing performance on three action recognition benchmarks.

摘要
注意：下面的文本是使用简化中文（Simplified Chinese）进行翻译的。近期的语言图像对比预训练技术已经导致了图像表示的高度传输和鲁棒性。然而，将这些模型应用到视频频道上 WITH 最小的监督仍然是一个开放的问题。我们通过使用语言绑定的自我超vised学习来适应图像 CLIP 模型到视频频道。我们修改了包括时间模型化的背部，并在自我折衔设置下使用训练目标在动作概念空间中运行。我们使用语言解码器提取的不同动作概念的特征向量来构建这个空间。我们引入了两个训练目标，概念精炼和概念对接，以保持原始表示的通用性，同时强制行为和其特征之间的关系。我们的方法提高了零配置和线性探测性能在三个动作识别标准 benchmark 上。

Revisiting Fine-Tuning Strategies for Self-supervised Medical Imaging Analysis

paper_url: http://arxiv.org/abs/2307.10915
repo_url: None
paper_authors: Muhammad Osama Khan, Yi Fang
for: 本研究旨在探讨自助学习（SSL）是否可以更好地利用预训练知识，以及预训练层数和精度之间的关系。
methods: 本研究使用了强制对比和恢复SSL的基线，并对多个预训练和精度训练数据集进行了广泛的细化分析。
results: 研究发现，在四个不同的下游任务上， intermediate layers fine-tuning 比 end-to-end fine-tuning 更有效，而且在不同的SSL类型下，不同的预训练层数也有不同的优化效果。使用这些发现，研究提出了一种简单 yet effective的多SSL模型协同使用方法，可以提高自助学习医疗影像分析的性能。

Abstract
Despite the rapid progress in self-supervised learning (SSL), end-to-end fine-tuning still remains the dominant fine-tuning strategy for medical imaging analysis. However, it remains unclear whether this approach is truly optimal for effectively utilizing the pre-trained knowledge, especially considering the diverse categories of SSL that capture different types of features. In this paper, we first establish strong contrastive and restorative SSL baselines that outperform SOTA methods across four diverse downstream tasks. Building upon these strong baselines, we conduct an extensive fine-tuning analysis across multiple pre-training and fine-tuning datasets, as well as various fine-tuning dataset sizes. Contrary to the conventional wisdom of fine-tuning only the last few layers of a pre-trained network, we show that fine-tuning intermediate layers is more effective, with fine-tuning the second quarter (25-50%) of the network being optimal for contrastive SSL whereas fine-tuning the third quarter (50-75%) of the network being optimal for restorative SSL. Compared to the de-facto standard of end-to-end fine-tuning, our best fine-tuning strategy, which fine-tunes a shallower network consisting of the first three quarters (0-75%) of the pre-trained network, yields improvements of as much as 5.48%. Additionally, using these insights, we propose a simple yet effective method to leverage the complementary strengths of multiple SSL models, resulting in enhancements of up to 3.57% compared to using the best model alone. Hence, our fine-tuning strategies not only enhance the performance of individual SSL models, but also enable effective utilization of the complementary strengths offered by multiple SSL models, leading to significant improvements in self-supervised medical imaging analysis.

摘要
尽管自我超级学习（SSL）的进步迅速，但END-TO-END fine-tuning仍然是医学影像分析中最主要的精度调整策略。然而，是否这种方法是最佳方式，以便有效地利用预训练知识，尤其是考虑到不同类型的SSL捕捉不同类型的特征。在这篇论文中，我们首先建立了强大的对比和修复SSL基线，超过了当前的标准方法。基于这些强大的基线，我们进行了多个预训练和精度调整数据集的广泛分析，以及不同的精度调整数据集大小。与传统的精度调整仅训练最后几层的网络不同，我们发现，在对比SSL中，中间层的精度调整更有效，而在修复SSL中，第三季（50-75%）的网络精度调整更加优化。相比于传统的精度调整方法，我们的最佳精度调整策略，即精度调整0-75%的预训练网络，可以提高达5.48%。此外，我们还提出了一种简单 yet effective的方法，使得多个SSL模型之间的补做效果，从而实现更大的提升。因此，我们的精度调整策略不仅提高了单个SSL模型的性能，还使得多个SSL模型之间的补做效果，导致了医学影像分析中的自助学习得到了重要的提升。

Diffusion Sampling with Momentum for Mitigating Divergence Artifacts

paper_url: http://arxiv.org/abs/2307.11118
repo_url: https://github.com/sWizad/momentum-diffusion
paper_authors: Suttisak Wizadwongsa, Worameth Chinchuthakun, Pramook Khungurn, Amit Raj, Supasorn Suwajanakorn
for: 提高Diffusion模型的批处速度，尤其是在低步骤情况下。
methods: 使用ODE/SDE重新表述批处 sampling，并引入高阶数字方法，但这些方法经常会产生异常artefacts，特别是在低步骤情况下。
results: 我们的研究发现了这些artefacts的可能性性 causes，并提出了两种新的技术来解决这个问题。首先，我们将重要的滚动阶段（Heavy Ball） momentum纳入现有的批处数字方法中，并证明这些方法具有第一阶度收敛性。其次，我们提出了一种新的高阶方法，即通用增强滚动阶段（GHVB），它可以在不同的步骤情况下提供可变的质量和artefact消除之间的质量。实验结果表明，我们的技术可以高效地减少artefacts并提高图像质量，超过当前的批处推导器在像素基和幂值基的批处模型中。

Abstract
Despite the remarkable success of diffusion models in image generation, slow sampling remains a persistent issue. To accelerate the sampling process, prior studies have reformulated diffusion sampling as an ODE/SDE and introduced higher-order numerical methods. However, these methods often produce divergence artifacts, especially with a low number of sampling steps, which limits the achievable acceleration. In this paper, we investigate the potential causes of these artifacts and suggest that the small stability regions of these methods could be the principal cause. To address this issue, we propose two novel techniques. The first technique involves the incorporation of Heavy Ball (HB) momentum, a well-known technique for improving optimization, into existing diffusion numerical methods to expand their stability regions. We also prove that the resulting methods have first-order convergence. The second technique, called Generalized Heavy Ball (GHVB), constructs a new high-order method that offers a variable trade-off between accuracy and artifact suppression. Experimental results show that our techniques are highly effective in reducing artifacts and improving image quality, surpassing state-of-the-art diffusion solvers on both pixel-based and latent-based diffusion models for low-step sampling. Our research provides novel insights into the design of numerical methods for future diffusion work.

摘要
尽管扩散模型在图像生成中取得了很大成功，但扩散采样仍然是一个持续存在的问题。以前的研究已经将扩散采样 reformulate 为 ODE/SDE 和引入更高级数学方法，但这些方法经常会产生漏斗artefacts，特别是采样步数较少时，这限制了可以实现的加速。在这篇论文中，我们调查了这些artefacts的可能原因，并认为小稳定区域是主要的原因。为了解决这个问题，我们提出了两种新的技术。第一种技术是在现有的扩散数学方法中添加重力球（HB）势，这是一种广泛应用于优化的技术，可以扩大稳定区域。我们也证明了这些方法具有首个收敛性。第二种技术是称为通用重力球（GHVB），它构建了一种新的高阶方法，可以在精度和漏斗artefacts之间选择变量负担。实验结果表明，我们的技术可以减少漏斗artefacts，提高图像质量，超过了当前的扩散解决方案。我们的研究为未来扩散工作提供了新的视角。

WeakPolyp: You Only Look Bounding Box for Polyp Segmentation

paper_url: http://arxiv.org/abs/2307.10912
repo_url: https://github.com/weijun88/weakpolyp
paper_authors: Jun Wei, Yiwen Hu, Shuguang Cui, S. Kevin Zhou, Zhen Li
for: 这个论文的目的是提出一种基于 bounding box 标注的弱型肿瘤分割模型（即 WeakPolyp），以减少标注成本。methods: 这个模型使用了 mask-to-box (M2B) 变换，以减少粗略标注的干扰，并使用 scale consistency (SC) 损失以 dense supervision。results: 实验表明，提出的 WeakPolyp 模型可以 дости到与完全监督模型相同的性能，而不需要任何涂抹标注。

Abstract
Limited by expensive pixel-level labels, polyp segmentation models are plagued by data shortage and suffer from impaired generalization. In contrast, polyp bounding box annotations are much cheaper and more accessible. Thus, to reduce labeling cost, we propose to learn a weakly supervised polyp segmentation model (i.e., WeakPolyp) completely based on bounding box annotations. However, coarse bounding boxes contain too much noise. To avoid interference, we introduce the mask-to-box (M2B) transformation. By supervising the outer box mask of the prediction instead of the prediction itself, M2B greatly mitigates the mismatch between the coarse label and the precise prediction. But, M2B only provides sparse supervision, leading to non-unique predictions. Therefore, we further propose a scale consistency (SC) loss for dense supervision. By explicitly aligning predictions across the same image at different scales, the SC loss largely reduces the variation of predictions. Note that our WeakPolyp is a plug-and-play model, which can be easily ported to other appealing backbones. Besides, the proposed modules are only used during training, bringing no computation cost to inference. Extensive experiments demonstrate the effectiveness of our proposed WeakPolyp, which surprisingly achieves a comparable performance with a fully supervised model, requiring no mask annotations at all.

摘要
限于昂贵的像素级标签，肿瘤分割模型受到数据不足和模型适应性下降的困扰。相比之下，肿瘤 bounding box 标注更加可 accessible，因此我们提议学习一个弱型肿瘤分割模型（i.e., WeakPolyp），完全基于 bounding box 标注进行学习。然而，粗略的 bounding box 包含了过多的噪声。为了避免干扰，我们引入 mask-to-box（M2B）变换。通过对预测外部 box máscara 进行监督，而不是直接监督预测自身，M2B 可以减少 mistmatch between 粗略标注和精确预测。然而，M2B 只提供笼统监督，导致预测不唯一。因此，我们进一步提议一个权重Consistency（SC）损失，以 dense 监督。通过将预测在不同的图像上进行对齐，SC 损失可以减少预测的变化。注意，我们的 WeakPolyp 是一个插件型模型，可以轻松地портирова到其他吸引人的背bone。此外，我们提posed的模块只在训练时使用，无需在推理时计算。广泛的实验表明我们的提议的 WeakPolyp surprisingly 达到了与完全监督模型相同的性能，不需要任何 mask 标注。

Variational Point Encoding Deformation for Dental Modeling

paper_url: http://arxiv.org/abs/2307.10895
repo_url: None
paper_authors: Johan Ziruo Ye, Thomas Ørkild, Peter Lempel Søndergaard, Søren Hauberg
for: 本研究旨在鼓励进一步的数字牙科研究，发布新的大量牙齿数据集，并提出Variational FoldingNet（VF-Net）模型，用于掌握牙齿点云表示的概率学学习。
methods: 本研究使用Variational FoldingNet（VF-Net）模型，扩展了FoldingNet模型，以实现牙齿点云表示的概率学学习。VF-Net模型解决了现有笛卡尔距离metric不具有归一化分布的问题，从而提高计算效率和牙齿点云 reconstruction的准确性。
results: 本研究的实验结果表明，VF-Net模型在牙齿扫描重建和推测方面表现出了superior的性能，并且对牙齿点云的latent表示进行了robust性测试。这些结果证明了VF-Net模型作为牙齿点云重建和分析的有效和可靠方法。

Abstract
Digital dentistry has made significant advancements in recent years, yet numerous challenges remain to be addressed. In this study, we release a new extensive dataset of tooth meshes to encourage further research. Additionally, we propose Variational FoldingNet (VF-Net), which extends FoldingNet to enable probabilistic learning of point cloud representations. A key challenge in existing latent variable models for point clouds is the lack of a 1-to-1 mapping between input points and output points. Instead, they must rely on optimizing Chamfer distances, a metric that does not have a normalized distributional counterpart, preventing its usage in probabilistic models. We demonstrate that explicit minimization of Chamfer distances can be replaced by a suitable encoder, which allows us to increase computational efficiency while simplifying the probabilistic extension. Our experimental findings present empirical evidence demonstrating the superior performance of VF-Net over existing models in terms of dental scan reconstruction and extrapolation. Additionally, our investigation highlights the robustness of VF-Net's latent representations. These results underscore the promising prospects of VF-Net as an effective and reliable method for point cloud reconstruction and analysis.

摘要
“数字 dentistry 在 recent 年来已经做出了 significative 进步，但还有许多挑战需要解决。在这项研究中，我们发布了一个新的广泛的 tooth 精度数据集，以便进一步研究。此外，我们提出了 Variational FoldingNet (VF-Net)，这是一种扩展 FoldingNet 以实现点云表示的 probabilistic 学习。现有的积分变量模型对点云输入的挑战之一是输入点cloud 与输出点云之间不存在 1-to-1 映射，而是通过优化 Chamfer 距离来优化模型。我们证明了 Chamfer 距离的直接最小化可以被替换为适当的编码器，这使得我们可以提高计算效率，并简化 probabilistic 扩展。我们的实验成果表明 VF-Net 在 dental scan 重建和推广方面比现有模型表现更好，并且其 latent 表示的稳定性也得到了证明。这些结果表明 VF-Net 是一种有效和可靠的点云重建和分析方法。”

Human Motion Generation: A Survey

paper_url: http://arxiv.org/abs/2307.10894
repo_url: None
paper_authors: Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, Yizhou Wang
for: 这篇论文旨在为人体动作生成技术提供一个全面的文献综述，以便为研究人员提供一个快速入门的机会，并鼓励新的研究方向。
methods: 本文提出的方法主要包括文本、音频和场景等条件征文生成人体动作的三大类方法，具体来说是：文本征文生成人体动作、音频征文生成人体动作和场景征文生成人体动作。
results: 本文的结果主要表明，人体动作生成领域在过去几年内取得了重要进步，但仍然存在许多挑战，如人体动作的复杂性和条件征文的隐式关系。

Abstract
Human motion generation aims to generate natural human pose sequences and shows immense potential for real-world applications. Substantial progress has been made recently in motion data collection technologies and generation methods, laying the foundation for increasing interest in human motion generation. Most research within this field focuses on generating human motions based on conditional signals, such as text, audio, and scene contexts. While significant advancements have been made in recent years, the task continues to pose challenges due to the intricate nature of human motion and its implicit relationship with conditional signals. In this survey, we present a comprehensive literature review of human motion generation, which, to the best of our knowledge, is the first of its kind in this field. We begin by introducing the background of human motion and generative models, followed by an examination of representative methods for three mainstream sub-tasks: text-conditioned, audio-conditioned, and scene-conditioned human motion generation. Additionally, we provide an overview of common datasets and evaluation metrics. Lastly, we discuss open problems and outline potential future research directions. We hope that this survey could provide the community with a comprehensive glimpse of this rapidly evolving field and inspire novel ideas that address the outstanding challenges.

摘要
人体动作生成目标是生成自然的人体姿势序列，具有很大的应用前途。最近几年来，人体动作数据采集技术和生成方法的进步非常大，为人体动作生成领域的兴趣培育提供了良好的基础。大多数研究在这个领域都是基于决定信号，如文本、音频和场景上下文，来生成人体动作。虽然在过去几年内有了 significative 的进步，但是这个任务仍然存在许多挑战，主要是因为人体动作的复杂性和决定信号的间接关系。在这篇评论中，我们提供了人体动作生成领域的全面的文献综述，到目前为止没有任何一篇。我们开始于人体动作和生成模型的背景，然后探讨了三个主流子任务的表现：文本决定、音频决定和场景决定的人体动作生成。此外，我们还提供了常用的数据集和评价指标的概述。最后，我们讨论了现有的问题和未来研究方向。我们希望这篇评论可以为这个领域提供一个全面的视图，并鼓励社区提出新的想法，以解决现有的挑战。

Risk-optimized Outlier Removal for Robust Point Cloud Classification

paper_url: http://arxiv.org/abs/2307.10875
repo_url: None
paper_authors: Xinke Li, Junchi Lu
for: 提高点云深度模型的可靠性和安全性，抗击意外或自然发生的点云噪声。
methods: 提出了一种新的点云异常点除掉方法 called PointCVaR，可以让标准训练的模型消除额外的异常点并重新还原数据。该方法基于对每个点的影响分析，以及Conditional Value at Risk（CVaR）作为目标函数来优化过滤高风险点的过程。
results: 在不同的去噪分类实验中，PointCVaR方法能够达到87%的防御效果，从而提高点云分类的精度和可靠性。

Abstract
The popularity of point cloud deep models for safety-critical purposes has increased, but the reliability and security of these models can be compromised by intentional or naturally occurring point cloud noise. To combat this issue, we present a novel point cloud outlier removal method called PointCVaR, which empowers standard-trained models to eliminate additional outliers and restore the data. Our approach begins by conducting attribution analysis to determine the influence of each point on the model output, which we refer to as point risk. We then optimize the process of filtering high-risk points using Conditional Value at Risk (CVaR) as the objective. The rationale for this approach is based on the observation that noise points in point clouds tend to cluster in the tail of the risk distribution, with a low frequency but a high level of risk, resulting in significant interference with classification results. Despite requiring no additional training effort, our method produces exceptional results in various removal-and-classification experiments for noisy point clouds, which are corrupted by random noise, adversarial noise, and backdoor trigger noise. Impressively, it achieves 87% accuracy in defense against the backdoor attack by removing triggers. Overall, the proposed PointCVaR effectively eliminates noise points and enhances point cloud classification, making it a promising plug-in module for various models in different scenarios.

摘要
popularity of point cloud deep models for safety-critical purposes has increased, but the reliability and security of these models can be compromised by intentional or naturally occurring point cloud noise. To combat this issue, we present a novel point cloud outlier removal method called PointCVaR, which empowers standard-trained models to eliminate additional outliers and restore the data. Our approach begins by conducting attribution analysis to determine the influence of each point on the model output, which we refer to as point risk. We then optimize the process of filtering high-risk points using Conditional Value at Risk (CVaR) as the objective. The rationale for this approach is based on the observation that noise points in point clouds tend to cluster in the tail of the risk distribution, with a low frequency but a high level of risk, resulting in significant interference with classification results. Despite requiring no additional training effort, our method produces exceptional results in various removal-and-classification experiments for noisy point clouds, which are corrupted by random noise, adversarial noise, and backdoor trigger noise. Impressively, it achieves 87% accuracy in defense against the backdoor attack by removing triggers. Overall, the proposed PointCVaR effectively eliminates noise points and enhances point cloud classification, making it a promising plug-in module for various models in different scenarios.Here's the translation in Traditional Chinese:popularity of point cloud deep models for safety-critical purposes has increased, but the reliability and security of these models can be compromised by intentional or naturally occurring point cloud noise. To combat this issue, we present a novel point cloud outlier removal method called PointCVaR, which empowers standard-trained models to eliminate additional outliers and restore the data. Our approach begins by conducting attribution analysis to determine the influence of each point on the model output, which we refer to as point risk. We then optimize the process of filtering high-risk points using Conditional Value at Risk (CVaR) as the objective. The rationale for this approach is based on the observation that noise points in point clouds tend to cluster in the tail of the risk distribution, with a low frequency but a high level of risk, resulting in significant interference with classification results. Despite requiring no additional training effort, our method produces exceptional results in various removal-and-classification experiments for noisy point clouds, which are corrupted by random noise, adversarial noise, and backdoor trigger noise. Impressively, it achieves 87% accuracy in defense against the backdoor attack by removing triggers. Overall, the proposed PointCVaR effectively eliminates noise points and enhances point cloud classification, making it a promising plug-in module for various models in different scenarios.

Conservative Estimation of Perception Relevance of Dynamic Objects for Safe Trajectories in Automotive Scenarios

paper_url: http://arxiv.org/abs/2307.10873
repo_url: None
paper_authors: Ken Mori, Kai Storms, Steven Peters
for: 本研究旨在提供一种新的方法来解决自动驾驶系统测试中的效率检测挑战，即确定测试需求和适用方法。
methods: 本研究使用了一种新的方法，通过对各个功能场景进行分解，并对各个场景中可能的动作进行形式化表述，以 derive relevance criteria。
results: 研究结果表明，该方法可以提供一种保守的估计，即哪些动物对感知 module 是重要的，需要进行完整的评估。此外，研究还提供了一种可视化的示例，以证明方法的可行性。

Abstract
Having efficient testing strategies is a core challenge that needs to be overcome for the release of automated driving. This necessitates clear requirements as well as suitable methods for testing. In this work, the requirements for perception modules are considered with respect to relevance. The concept of relevance currently remains insufficiently defined and specified. In this paper, we propose a novel methodology to overcome this challenge by exemplary application to collision safety in the highway domain. Using this general system and use case specification, a corresponding concept for relevance is derived. Irrelevant objects are thus defined as objects which do not limit the set of safe actions available to the ego vehicle under consideration of all uncertainties. As an initial step, the use case is decomposed into functional scenarios with respect to collision relevance. For each functional scenario, possible actions of both the ego vehicle and any other dynamic object are formalized as equations. This set of possible actions is constrained by traffic rules, yielding relevance criteria. As a result, we present a conservative estimation which dynamic objects are relevant for perception and need to be considered for a complete evaluation. The estimation provides requirements which are applicable for offline testing and validation of perception components. A visualization is presented for examples from the highD dataset, showing the plausibility of the results. Finally, a possibility for a future validation of the presented relevance concept is outlined.

摘要
有效的测试策略是自动驾驶发布的核心挑战之一。这需要清晰的需求以及适合的测试方法。在这项工作中，我们考虑了感知模块的需求，尤其是 relevance 的问题。现在 relevance 的概念尚未充分定义和规定。在这篇论文中，我们提出了一种新的方法来解决这个挑战，通过应用到高速公路领域的冲突安全场景。使用这种总体系统和用例 спецификация，我们 derive 了一种新的 relevance 概念。不重要的对象被定义为不限制 egO 车在考虑所有不确定性下的安全行为的对象。在进行初步分解的使用 случа，我们将用例分解成功能场景，并将每个场景中 egO 车和任何其他动态对象的可能行动 formalize 为方程。这些可能的行动集被限制为交通规则，从而得到 relevance 标准。这些标准提供了对某些测试和验证感知组件的需求，这些需求可以在线上进行测试和验证。我们还提供了一个可视化的例子，来说明高D数据集中的可能性。最后，我们提出了一种将来验证 relevance 概念的可能性。

BlendFace: Re-designing Identity Encoders for Face-Swapping

paper_url: http://arxiv.org/abs/2307.10854
repo_url: https://github.com/mapooon/blendface
paper_authors: Kaede Shiohara, Xingchao Yang, Takafumi Taketomi
for: 这个研究的目的是如何独特地辨识人脸，并且将人脸辨识模型中的属性偏好降低为最小化。methods: 这个研究使用了一种新的人脸辨识器，即BlendFace，它通过在混合图像中替换属性来将属性与人脸辨识独立开来。BlendFace使用了一个新的损失函数来引导生成器，以确保生成的人脸具有正确的属性。results: 实验结果显示，BlendFace能够将人脸辨识模型中的属性偏好降低为最小化，同时维持与前一代方法相似的量化性能。

Abstract
The great advancements of generative adversarial networks and face recognition models in computer vision have made it possible to swap identities on images from single sources. Although a lot of studies seems to have proposed almost satisfactory solutions, we notice previous methods still suffer from an identity-attribute entanglement that causes undesired attributes swapping because widely used identity encoders, eg, ArcFace, have some crucial attribute biases owing to their pretraining on face recognition tasks. To address this issue, we design BlendFace, a novel identity encoder for face-swapping. The key idea behind BlendFace is training face recognition models on blended images whose attributes are replaced with those of another mitigates inter-personal biases such as hairsyles. BlendFace feeds disentangled identity features into generators and guides generators properly as an identity loss function. Extensive experiments demonstrate that BlendFace improves the identity-attribute disentanglement in face-swapping models, maintaining a comparable quantitative performance to previous methods.

摘要
“ generator adversarial networks 和脸部识别模型在计算机视觉领域的进步使得可以将图像中的身份交换。although many studies have proposed almost satisfactory solutions, previous methods still suffer from an identity-attribute entanglement that causes undesired attribute swapping because widely used identity encoders, such as ArcFace, have some crucial attribute biases owing to their pretraining on face recognition tasks. to address this issue, we design BlendFace, a novel identity encoder for face-swapping. the key idea behind BlendFace is training face recognition models on blended images whose attributes are replaced with those of another mitigates inter-personal biases such as hairstyles. BlendFace feeds disentangled identity features into generators and guides generators properly as an identity loss function. extensive experiments demonstrate that BlendFace improves the identity-attribute disentanglement in face-swapping models, maintaining a comparable quantitative performance to previous methods.”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Exploring Effective Priors and Efficient Models for Weakly-Supervised Change Detection

paper_url: http://arxiv.org/abs/2307.10853
repo_url: https://github.com/zhenghuizhao/transwcd
paper_authors: Zhenghui Zhao, Lixiang Ru, Chen Wu
for: 这篇论文目标是提高无监督变化检测（WSCD）的精度，特别是解决变化检测中的变化缺失和 fabricating 问题。
methods: 该论文提出了两个组件：一个叫做扩展幂（DP）解码器，另一个叫做标签阻止（LG）约束。DP解码器根据变化的图像级标签，选择ively decode samples，并将所有 pixels 标记为无变化。LG约束基于变化表示和图像级标签之间的相对关系，对模型进行惩罚，以避免模型错误地预测变化状况。
results: 该论文使用 TransWCD 模型，并将 DP 解码器和 LG 约束纳入 TransWCD-DL 模型。这些方法在 WHU-CD 数据集上实现了 +6.33% 和 +9.55% F1 分数的提升，超过了一些全监督变化检测（FSCD）竞争对手。

Abstract
Weakly-supervised change detection (WSCD) aims to detect pixel-level changes with only image-level annotations. Owing to its label efficiency, WSCD is drawing increasing attention recently. However, current WSCD methods often encounter the challenge of change missing and fabricating, i.e., the inconsistency between image-level annotations and pixel-level predictions. Specifically, change missing refer to the situation that the WSCD model fails to predict any changed pixels, even though the image-level label indicates changed, and vice versa for change fabricating. To address this challenge, in this work, we leverage global-scale and local-scale priors in WSCD and propose two components: a Dilated Prior (DP) decoder and a Label Gated (LG) constraint. The DP decoder decodes samples with the changed image-level label, skips samples with the unchanged label, and replaces them with an all-unchanged pixel-level label. The LG constraint is derived from the correspondence between changed representations and image-level labels, penalizing the model when it mispredicts the change status. Additionally, we develop TransWCD, a simple yet powerful transformer-based model, showcasing the potential of weakly-supervised learning in change detection. By integrating the DP decoder and LG constraint into TransWCD, we form TransWCD-DL. Our proposed TransWCD and TransWCD-DL achieve significant +6.33% and +9.55% F1 score improvements over the state-of-the-art methods on the WHU-CD dataset, respectively. Some performance metrics even exceed several fully-supervised change detection (FSCD) competitors. Code will be available at https://github.com/zhenghuizhao/TransWCD.

摘要
weakly-supervised change detection (WSCD) targets to detect pixel-level changes with only image-level annotations. Due to its label efficiency, WSCD has gained increasing attention recently. However, current WSCD methods often encounter the challenge of change missing and fabricating, i.e., the inconsistency between image-level annotations and pixel-level predictions. Specifically, change missing refers to the situation where the WSCD model fails to predict any changed pixels, even though the image-level label indicates changed, and vice versa for change fabricating. To address this challenge, in this work, we leverage global-scale and local-scale priors in WSCD and propose two components: a Dilated Prior (DP) decoder and a Label Gated (LG) constraint. The DP decoder decodes samples with the changed image-level label, skips samples with the unchanged label, and replaces them with an all-unchanged pixel-level label. The LG constraint is derived from the correspondence between changed representations and image-level labels, penalizing the model when it mispredicts the change status. Additionally, we develop TransWCD, a simple yet powerful transformer-based model, showcasing the potential of weakly-supervised learning in change detection. By integrating the DP decoder and LG constraint into TransWCD, we form TransWCD-DL. Our proposed TransWCD and TransWCD-DL achieve significant +6.33% and +9.55% F1 score improvements over the state-of-the-art methods on the WHU-CD dataset, respectively. Some performance metrics even exceed several fully-supervised change detection (FSCD) competitors. Code will be available at .

Self-paced Weight Consolidation for Continual Learning

paper_url: http://arxiv.org/abs/2307.10845
repo_url: https://github.com/congwei45/spWC
paper_authors: Wei Cong, Yang Cong, Gan Sun, Yuyang Liu, Jiahua Dong
for: 这篇论文旨在提出一个robust continual learning框架，以提高继续学习过程中的性能和效率。
methods: 本文使用了一个自适应的Weight Consolidation（spWC）框架，通过评估先前任务的决策贡献来减少遗传学会忘记。
results: 实验结果显示，提出的spWC框架可以对比其他具有优化性的继续学习算法，实现更好的性能和效率。

Abstract
Continual learning algorithms which keep the parameters of new tasks close to that of previous tasks, are popular in preventing catastrophic forgetting in sequential task learning settings. However, 1) the performance for the new continual learner will be degraded without distinguishing the contributions of previously learned tasks; 2) the computational cost will be greatly increased with the number of tasks, since most existing algorithms need to regularize all previous tasks when learning new tasks. To address the above challenges, we propose a self-paced Weight Consolidation (spWC) framework to attain robust continual learning via evaluating the discriminative contributions of previous tasks. To be specific, we develop a self-paced regularization to reflect the priorities of past tasks via measuring difficulty based on key performance indicator (i.e., accuracy). When encountering a new task, all previous tasks are sorted from "difficult" to "easy" based on the priorities. Then the parameters of the new continual learner will be learned via selectively maintaining the knowledge amongst more difficult past tasks, which could well overcome catastrophic forgetting with less computational cost. We adopt an alternative convex search to iteratively update the model parameters and priority weights in the bi-convex formulation. The proposed spWC framework is plug-and-play, which is applicable to most continual learning algorithms (e.g., EWC, MAS and RCIL) in different directions (e.g., classification and segmentation). Experimental results on several public benchmark datasets demonstrate that our proposed framework can effectively improve performance when compared with other popular continual learning algorithms.

摘要
continuous learning算法，即保持新任务参数与前一个任务参数相似，在顺序任务学习Setting中广泛应用。然而，1）新的 Contemporary learner表现将受到前一个任务的贡献分化的影响; 2）随着任务数量的增加，大多数现有算法需要对所有前一个任务进行REGULARIZATION，从而提高计算成本。为了解决以上挑战，我们提出了自适应权重卷积（spWC）框架，以实现 Robust continual learning via 评估前一个任务的推导贡献。具体来说，我们开发了一种自适应REGULARIZATION，通过测量难度基于关键性能指标（即准确率）来衡量过去任务的优先级。在接触新任务时，所有前一个任务会被排序为"difficult"到"easy"的顺序，根据优先级。然后，新的 Contemporary learner将通过保留更难的过去任务的知识来胜任 catastrophic forgetting，而且可以降低计算成本。我们采用了一种alternative convex search来逐步更新模型参数和优先级权重在bi-convex表示形式中。我们的spWC框架是可插入的，可以应用到大多数 Contemporary learning算法（例如EWC、MAS和RCIL）以及不同的方向（例如分类和分割）。实验结果表明，我们的提出的框架可以在多个公共 benchmark dataset上提高性能，与其他流行的 Contemporary learning算法相比。

Global Precipitation Nowcasting of Integrated Multi-satellitE Retrievals for GPM: A U-Net Convolutional LSTM Architecture

paper_url: http://arxiv.org/abs/2307.10843
repo_url: https://github.com/reyhaneh-92/genesis_nowcast
paper_authors: Reyhaneh Rahimi, Ardeshir Ebtehaj, Ali Behrangi, Jackson Tan
for: 这个论文是用于研究一种深度学习架构，用于全球覆盖降水预报，每30分钟预测降水，预测时间为4小时。
methods: 该架构结合了U-Net和卷积长短期记忆神经网络，并使用IMERG和一些关键降水驱动因素来训练。研究不同训练损失函数的影响于降水预报质量，包括平均方差损失和关注损失。
results: 结果表明，回归网络在降水较轻（≤1.6毫米/小时）时表现良好，而分类网络在降水极端（>8毫米/小时）预报中表现更优，以critical success index（CSI）为标准。使用 Wasserstein 距离表明，预测降水由分类网络得到的概率分布更加接近IMERG。研究发现，包含物理变量可以提高降水预报，特别是在较长的预测时间内。对于IMERG作为参照，使用多比例技术分析表明，预报机器在10公里分辨率上保持技能（FSS > 0.5），而在50公里分辨率上，仅当降水率大于4毫米/小时时，分类网络保持技能。

Abstract
This paper presents a deep learning architecture for nowcasting of precipitation almost globally every 30 min with a 4-hour lead time. The architecture fuses a U-Net and a convolutional long short-term memory (LSTM) neural network and is trained using data from the Integrated MultisatellitE Retrievals for GPM (IMERG) and a few key precipitation drivers from the Global Forecast System (GFS). The impacts of different training loss functions, including the mean-squared error (regression) and the focal-loss (classification), on the quality of precipitation nowcasts are studied. The results indicate that the regression network performs well in capturing light precipitation (below 1.6 mm/hr), but the classification network can outperform the regression network for nowcasting of precipitation extremes (>8 mm/hr), in terms of the critical success index (CSI).. Using the Wasserstein distance, it is shown that the predicted precipitation by the classification network has a closer class probability distribution to the IMERG than the regression network. It is uncovered that the inclusion of the physical variables can improve precipitation nowcasting, especially at longer lead times in both networks. Taking IMERG as a relative reference, a multi-scale analysis in terms of fractions skill score (FSS), shows that the nowcasting machine remains skillful (FSS > 0.5) at the resolution of 10 km compared to 50 km for GFS. For precipitation rates greater than 4~mm/hr, only the classification network remains FSS-skillful on scales greater than 50 km within a 2-hour lead time.

摘要
The results show that the regression network performs well in capturing light precipitation (below 1.6 mm/hr), but the classification network outperforms the regression network in predicting precipitation extremes (>8 mm/hr) in terms of the critical success index (CSI). Additionally, the classification network is found to have a closer class probability distribution to the IMERG than the regression network, as measured by the Wasserstein distance.The study also shows that including physical variables in the nowcasting model can improve its performance, especially at longer lead times. A multi-scale analysis using fractions skill score (FSS) reveals that the nowcasting machine remains skillful (FSS > 0.5) at the resolution of 10 km compared to 50 km for GFS. For precipitation rates greater than 4 mm/hr, only the classification network remains FSS-skillful on scales greater than 50 km within a 2-hour lead time.

Label Calibration for Semantic Segmentation Under Domain Shift

paper_url: http://arxiv.org/abs/2307.10842
repo_url: None
paper_authors: Ondrej Bohdal, Da Li, Timothy Hospedales
for: 这篇论文目的是为了提高 semantic segmentation 模型在新领域数据上的性能。
methods: 本研究使用了将预训好的模型转换到目标领域数据上，并计算软 Label prototype，以便根据最相似的 prototype 来做预测。这个适应程序快速、 computationally efficient，可以帮助提高模型的性能。
results: 研究结果显示，这个适应程序可以对 synthetic-to-real semantic segmentation 问题提供 considerable performance improvements。

Abstract
Performance of a pre-trained semantic segmentation model is likely to substantially decrease on data from a new domain. We show a pre-trained model can be adapted to unlabelled target domain data by calculating soft-label prototypes under the domain shift and making predictions according to the prototype closest to the vector with predicted class probabilities. The proposed adaptation procedure is fast, comes almost for free in terms of computational resources and leads to considerable performance improvements. We demonstrate the benefits of such label calibration on the highly-practical synthetic-to-real semantic segmentation problem.

摘要
“一个预训练的具有 semantics 的标注排序模型在新领域的数据上可能会受到重大的性能下降。我们展示了一个预训练模型可以通过计算域转换下的软标签原型，并根据预测的分类概率来做预测。我们的适应程序快速、Computational资源几乎没有成本，并带来了显著的性能改善。我们在实际上实现了这种标签整合的实用性。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Parse and Recall: Towards Accurate Lung Nodule Malignancy Prediction like Radiologists

paper_url: http://arxiv.org/abs/2307.10824
repo_url: None
paper_authors: Jianpeng Zhang, Xianghua Ye, Jianfeng Zhang, Yuxing Tang, Minfeng Xu, Jianfei Guo, Xin Chen, Zaiyi Liu, Jingren Zhou, Le Lu, Ling Zhang
for: 这个研究旨在提高肺癌早期检测的精度，以提高生存成本。
methods: 本研究提出了一个基于专业医生的诊断过程模拟方法，包括内容解析和原型呈现模组。
results: 实验结果显示，本方法在低剂量和无 контра斯creening情况下实现了进步的诊断性能。

Abstract
Lung cancer is a leading cause of death worldwide and early screening is critical for improving survival outcomes. In clinical practice, the contextual structure of nodules and the accumulated experience of radiologists are the two core elements related to the accuracy of identification of benign and malignant nodules. Contextual information provides comprehensive information about nodules such as location, shape, and peripheral vessels, and experienced radiologists can search for clues from previous cases as a reference to enrich the basis of decision-making. In this paper, we propose a radiologist-inspired method to simulate the diagnostic process of radiologists, which is composed of context parsing and prototype recalling modules. The context parsing module first segments the context structure of nodules and then aggregates contextual information for a more comprehensive understanding of the nodule. The prototype recalling module utilizes prototype-based learning to condense previously learned cases as prototypes for comparative analysis, which is updated online in a momentum way during training. Building on the two modules, our method leverages both the intrinsic characteristics of the nodules and the external knowledge accumulated from other nodules to achieve a sound diagnosis. To meet the needs of both low-dose and noncontrast screening, we collect a large-scale dataset of 12,852 and 4,029 nodules from low-dose and noncontrast CTs respectively, each with pathology- or follow-up-confirmed labels. Experiments on several datasets demonstrate that our method achieves advanced screening performance on both low-dose and noncontrast scenarios.

摘要
肺癌是全球最主要的死亡原因之一， early screening 是提高存活率的关键。在临床实践中， nodule 的上下文结构和 radiologist 的总结经验是识别benign和malignant nodule的两个核心元素。上下文信息提供了 nodule 的位置、形状和周围血管等信息，经验丰富的 radiologist 可以从前次案例中搜寻相似之处作为参考，以扩展基于决策的基础。在这篇论文中，我们提出了一种基于 radiologist 的方法，它包括上下文解析和原型回忆模块。上下文解析模块首先将 nodule 的上下文结构分解，然后将上下文信息聚合以更全面地理解 nodule。原型回忆模块利用原型学习来压缩先前学习的案例，并在线更新。基于这两个模块，我们的方法可以充分利用 nodule 的内在特征和从其他 nodule 积累的外部知识，以实现准确的诊断。为了满足低剂量和不contrast扫描的需求，我们收集了12,852和4,029个 nodule 的低剂量和不contrast CT 数据，每个数据都有 pathology-或 follow-up-确认的标签。实验表明，我们的方法在低剂量和不contrast扫描场景中具有高级别的检测性能。

Gradient-Semantic Compensation for Incremental Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.10822
repo_url: None
paper_authors: Wei Cong, Yang Cong, Jiahua Dong, Gan Sun, Henghui Ding
for: 这个研究的目的是提出一个能够不断学习新来的类别的增量 semantic segmentation 方法，并且能够解决遗传性遗传和背景迁移的问题。methods: 我们提出了一个 Gradient-Semantic Compensation (GSC) 模型，它从 Gradient 和 Semantic 两个角度来解决增量 semantic segmentation 的问题。具体来说，我们为了解决遗传性遗传的问题，开发了一个步骤意识的 Gradient 补偿，可以通过重新分配 Gradient 的传播来将遗传 pace 平衡 Previously seen 的类别。同时，我们提出了一个 Soft-Sharp 的 Semantic Relation Distillation，可以透过软 Label 来激发内部类别之间的 Semantic 关系，从 Semantic 角度来解决遗传性遗传的问题。results: 我们在 Pascal VOC 2012、ADE20K 和 Cityscapes 三个公共数据集上进行了广泛的实验，结果显示了我们提出的 GSC 模型的效果。

Abstract
Incremental semantic segmentation aims to continually learn the segmentation of new coming classes without accessing the training data of previously learned classes. However, most current methods fail to address catastrophic forgetting and background shift since they 1) treat all previous classes equally without considering different forgetting paces caused by imbalanced gradient back-propagation; 2) lack strong semantic guidance between classes. To tackle the above challenges, in this paper, we propose a Gradient-Semantic Compensation (GSC) model, which surmounts incremental semantic segmentation from both gradient and semantic perspectives. Specifically, to address catastrophic forgetting from the gradient aspect, we develop a step-aware gradient compensation that can balance forgetting paces of previously seen classes via re-weighting gradient backpropagation. Meanwhile, we propose a soft-sharp semantic relation distillation to distill consistent inter-class semantic relations via soft labels for alleviating catastrophic forgetting from the semantic aspect. In addition, we develop a prototypical pseudo re-labeling that provides strong semantic guidance to mitigate background shift. It produces high-quality pseudo labels for old classes in the background by measuring distances between pixels and class-wise prototypes. Extensive experiments on three public datasets, i.e., Pascal VOC 2012, ADE20K, and Cityscapes, demonstrate the effectiveness of our proposed GSC model.

摘要
增量semantic segmentation的目标是不断学习新到来的类型，而不需要访问之前学习过的数据。然而，现有的方法通常无法解决崩溃性忘记和背景变化的问题，这是因为它们：1）对之前学习过的类型很平等，不考虑不同的忘记速度，由于权重反propagation而导致的忘记速度不平等; 2）缺乏强的semantic指导。为了解决这些挑战，在这篇论文中，我们提出了一种Gradient-Semantic Compensation（GSC）模型，可以从gradient和semantic两个角度上解决增量semantic segmentation问题。具体来说，为了解决崩溃性忘记的gradient方面问题，我们开发了一种步骤意识度补偿，可以通过重新分配权重反propagation来填充忘记的步骤。同时，我们提出了一种软锐semantic关系练习，可以通过软标签来练习consistent的inter-class semantic关系，从而解决崩溃性忘记的semantic方面问题。此外，我们开发了一种prototypical pseudo re-labeling，可以提供强的semantic指导，来mitigate background shift。它可以生成高质量的pseudo标签，以便更好地减少背景shift。extensive的实验表明，我们提出的GSC模型在Pascal VOC 2012、ADE20K和Cityscapes三个公共数据集上具有极高的效果。

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

paper_url: http://arxiv.org/abs/2307.10816
repo_url: https://github.com/showlab/boxdiff
paper_authors: Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, Mike Zheng Shou
for: 这篇论文主要旨在提出一种无需训练的方法，用于控制 diffusion models 中的物体和上下文，以满足给定的空间条件。
methods: 该方法基于 diffusion models 的杂化过程，并在杂化过程中添加三种空间约束，即 Inner-Box、Outer-Box 和 Corner Constraints，以控制物体的位置和大小。
results: 实验结果表明，该方法可以准确地控制 diffusion models 中的物体和上下文，同时保持高效的杂化能力和多元概念覆盖率。

Abstract
Recent text-to-image diffusion models have demonstrated an astonishing capacity to generate high-quality images. However, researchers mainly studied the way of synthesizing images with only text prompts. While some works have explored using other modalities as conditions, considerable paired data, e.g., box/mask-image pairs, and fine-tuning time are required for nurturing models. As such paired data is time-consuming and labor-intensive to acquire and restricted to a closed set, this potentially becomes the bottleneck for applications in an open world. This paper focuses on the simplest form of user-provided conditions, e.g., box or scribble. To mitigate the aforementioned problem, we propose a training-free method to control objects and contexts in the synthesized images adhering to the given spatial conditions. Specifically, three spatial constraints, i.e., Inner-Box, Outer-Box, and Corner Constraints, are designed and seamlessly integrated into the denoising step of diffusion models, requiring no additional training and massive annotated layout data. Extensive experimental results demonstrate that the proposed constraints can control what and where to present in the images while retaining the ability of Diffusion models to synthesize with high fidelity and diverse concept coverage. The code is publicly available at https://github.com/showlab/BoxDiff.

摘要
近期文本到图像扩散模型已经表现出了惊人的图像质量生成能力。然而，研究人员主要研究了使用文本提示来生成图像。而一些工作已经探索了使用其他modalities作为条件，但需要大量的配售数据，例如箱/面积图像对，以及微调时间。由于这些配售数据具有时间和劳动力的限制，这可能会成为应用在开放世界中的瓶颈。这篇论文将关注最简单的用户提供的条件，例如箱或涂抹。为了解决上述问题，我们提议一种没有需要训练的方法，以控制生成图像中的对象和上下文，遵循给定的空间条件。特别是，我们定义了三个空间约束，即内部箱、外部箱和角度约束，并将其集成到扩散模型的净化步骤中，无需额外训练和大量注释的布局数据。广泛的实验结果表明，我们的约束可以控制生成图像中的对象和上下文，同时保持扩散模型的高精度和多样性涵盖。代码可以在 GitHub 上获取：https://github.com/showlab/BoxDiff。

A novel integrated method of detection-grasping for specific object based on the box coordinate matching

paper_url: http://arxiv.org/abs/2307.11783
repo_url: None
paper_authors: Zongmin Liu, Jirui Wang, Jie Li, Zufeng Li, Kai Ren, Peng Shi
for: 提高服务机器人对特定物体的抓取能力，以更好地照顾老年和残疾人群。
methods: 提出了一种基于盒坐标匹配的检测-抓取融合方法（DG-BCM），包括对SOLOv2实例分 segmentation模型进行额外改进，并将ASPP和CAM添加到GR-CNN模型中，以优化抓取估算。
results: 对象检测和抓取估算分别进行了验证，并在模拟平台上进行了具体的抓取任务，证明了该方法的可行性和有效性。

Abstract
To better care for the elderly and disabled, it is essential for service robots to have an effective fusion method of object detection and grasp estimation. However, limited research has been observed on the combination of object detection and grasp estimation. To overcome this technical difficulty, a novel integrated method of detection-grasping for specific object based on the box coordinate matching is proposed in this paper. Firstly, the SOLOv2 instance segmentation model is improved by adding channel attention module (CAM) and spatial attention module (SAM). Then, the atrous spatial pyramid pooling (ASPP) and CAM are added to the generative residual convolutional neural network (GR-CNN) model to optimize grasp estimation. Furthermore, a detection-grasping integrated algorithm based on box coordinate matching (DG-BCM) is proposed to obtain the fusion model of object detection and grasp estimation. For verification, experiments on object detection and grasp estimation are conducted separately to verify the superiority of improved models. Additionally, grasping tasks for several specific objects are implemented on a simulation platform, demonstrating the feasibility and effectiveness of DG-BCM algorithm proposed in this paper.

摘要
Firstly, the SOLOv2 instance segmentation model is improved by adding a channel attention module (CAM) and a spatial attention module (SAM). Then, the atrous spatial pyramid pooling (ASPP) and CAM are added to the generative residual convolutional neural network (GR-CNN) model to optimize grasp estimation.Furthermore, a detection-grasping integrated algorithm based on box coordinate matching (DG-BCM) is proposed to obtain the fusion model of object detection and grasp estimation. For verification, experiments on object detection and grasp estimation are conducted separately to verify the superiority of the improved models. Additionally, grasping tasks for several specific objects are implemented on a simulation platform, demonstrating the feasibility and effectiveness of the DG-BCM algorithm proposed in this paper.

Perceptual Quality Assessment of Omnidirectional Audio-visual Signals

paper_url: http://arxiv.org/abs/2307.10813
repo_url: None
paper_authors: Xilei Zhu, Huiyu Duan, Yuqin Cao, Yuxin Zhu, Yucheng Zhu, Jing Liu, Li Chen, Xiongkuo Min, Guangtao Zhai
for: 本研究旨在评估Omnidirectional视频（ODV）的质量，以提高用户的Quality of Experience（QoE）。
methods: 本研究使用了三种基eline方法， combinig existing state-of-the-art single-mode audio和视频Quality Assessment（QA）模型via multimodal fusion strategies，以实现全参考omnidirectional Audio-Visual Quality Assessment（OAVQA）。
results: 研究在大规模的Audio-Visual quality assessment dataset上验证了A/V multimodal fusion方法的效iveness，提供了一个新的Benchmark for omnidirectional QoE评估。

Abstract
Omnidirectional videos (ODVs) play an increasingly important role in the application fields of medical, education, advertising, tourism, etc. Assessing the quality of ODVs is significant for service-providers to improve the user's Quality of Experience (QoE). However, most existing quality assessment studies for ODVs only focus on the visual distortions of videos, while ignoring that the overall QoE also depends on the accompanying audio signals. In this paper, we first establish a large-scale audio-visual quality assessment dataset for omnidirectional videos, which includes 375 distorted omnidirectional audio-visual (A/V) sequences generated from 15 high-quality pristine omnidirectional A/V contents, and the corresponding perceptual audio-visual quality scores. Then, we design three baseline methods for full-reference omnidirectional audio-visual quality assessment (OAVQA), which combine existing state-of-the-art single-mode audio and video QA models via multimodal fusion strategies. We validate the effectiveness of the A/V multimodal fusion method for OAVQA on our dataset, which provides a new benchmark for omnidirectional QoE evaluation. Our dataset is available at https://github.com/iamazxl/OAVQA.

摘要
“全方位视频”（ODV）在医疗、教育、广告、旅游等领域的应用越来越重要。评估ODV的质量非常重要，以提高用户的“用户体验质量”（QoE）。然而，现有的大多数质量评估研究只关注视频中的视觉扭曲，而忽略了音频信号对总体质量的影响。在本文中，我们首先建立了大规模的音频视频质量评估数据集，包括375个扭曲的全方位音频视频（A/V）序列，这些序列来自于15个高质量的原始全方位A/V内容。然后，我们设计了三种基线方法 для全方位音频视频质量评估（OAVQA），这些方法将现有单模式音频和视频质量评估模型通过多模式融合策略相结合。我们验证了这种A/V多模式融合方法的有效性，并提供了一个新的全方位QoE评估标准。我们的数据集可以在GitHub上获取：https://github.com/iamazxl/OAVQA。

HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces

paper_url: http://arxiv.org/abs/2307.10797
repo_url: https://github.com/stelabou/hyperreenact
paper_authors: Stella Bounareli, Christos Tzelepis, Vasileios Argyriou, Ioannis Patras, Georgios Tzimiropoulos
for: 本研究旨在提出一种基于 StyleGAN2 生成器的神经网络方法，用于实现真实的 talking head 图像生成，驱动目标表情pose。
methods: 我们提出了一种使用 StyleGAN2 生成器的强化版本，通过首先将真实图像转换为其latent space，然后使用 hypernetwork 进行：（i）源 identity 特征纠正和（ii）面部pose 重新目标，从而消除了依赖于外部编辑方法的需要。
results: 我们的方法可以在一shot setting 下运行，并且可以进行cross-subject reenactment，无需任何主体特定的 fine-tuning。我们在 VoxCeleb1 和 VoxCeleb2 标准准样上进行了评估，并与其他状态对照方法进行了比较，显示了我们的方法在生成 artifact-free 图像方面的超越性，并且在极端头pose 变化下展现出了remarkable 的Robustness。

Abstract
In this paper, we present our method for neural face reenactment, called HyperReenact, that aims to generate realistic talking head images of a source identity, driven by a target facial pose. Existing state-of-the-art face reenactment methods train controllable generative models that learn to synthesize realistic facial images, yet producing reenacted faces that are prone to significant visual artifacts, especially under the challenging condition of extreme head pose changes, or requiring expensive few-shot fine-tuning to better preserve the source identity characteristics. We propose to address these limitations by leveraging the photorealistic generation ability and the disentangled properties of a pretrained StyleGAN2 generator, by first inverting the real images into its latent space and then using a hypernetwork to perform: (i) refinement of the source identity characteristics and (ii) facial pose re-targeting, eliminating this way the dependence on external editing methods that typically produce artifacts. Our method operates under the one-shot setting (i.e., using a single source frame) and allows for cross-subject reenactment, without requiring any subject-specific fine-tuning. We compare our method both quantitatively and qualitatively against several state-of-the-art techniques on the standard benchmarks of VoxCeleb1 and VoxCeleb2, demonstrating the superiority of our approach in producing artifact-free images, exhibiting remarkable robustness even under extreme head pose changes. We make the code and the pretrained models publicly available at: https://github.com/StelaBou/HyperReenact .

摘要
在这篇论文中，我们提出了一种名为HyperReenact的神经网络 talking head 生成方法，旨在生成源人脸的真实对话图像，驱动目标脸部姿势。现有的状态对抗技术学习生成实际的脸部图像， yet 生成reenacted faces 容易出现显著的视觉缺陷，特别是在极端的头部姿势变化情况下或需要贵重的一些shot fine-tuning 来更好地保留源人脸特征。我们提议使用 pretrained StyleGAN2 生成器的 photorealistic 生成能力和分离性特性，首先将实际图像转换为其latent space，然后使用 hypernetwork 进行：（i）源人脸特征的精度调整和（ii）脸部姿势重定向，从而消除了对外部编辑方法的依赖，从而消除了artefacts。我们的方法在一shot Setting（即使用单个源帧）下运行，并允许跨主体reenactment，无需任何主体特定的调整。我们在VoxCeleb1和VoxCeleb2标准 bencmarks 上对我们的方法进行了量化和 каче度上的比较，demonstrating 我们的方法在生成无artefacts的图像，在极端头部姿势变化情况下表现出了很好的Robustness。我们在https://github.com/StelaBou/HyperReenact 上公开了代码和预训练模型。

paper_url: http://arxiv.org/abs/2307.10790
repo_url: https://github.com/yoark/vln-behave
paper_authors: Zijiao Yang, Arjun Majumdar, Stefan Lee
for: 这项研究的目的是研究语言导航（VLN）代理人是如何基于它的环境来固定指令。
methods: 这种方法是通过生成特定技能的干预来研究代理人的行为，并测量代理人的预测变化。
results: 研究发现，训练时的偏见会对代理人的行为产生持续的影响，而现有模型可以固定简单的引用表达。此外，我们 comparing多种模型，发现技能特定的score与整体VLN任务性能之间存在正相关关系。

Abstract
To be successful, Vision-and-Language Navigation (VLN) agents must be able to ground instructions to actions based on their surroundings. In this work, we develop a methodology to study agent behavior on a skill-specific basis -- examining how well existing agents ground instructions about stopping, turning, and moving towards specified objects or rooms. Our approach is based on generating skill-specific interventions and measuring changes in agent predictions. We present a detailed case study analyzing the behavior of a recent agent and then compare multiple agents in terms of skill-specific competency scores. This analysis suggests that biases from training have lasting effects on agent behavior and that existing models are able to ground simple referring expressions. Our comparisons between models show that skill-specific scores correlate with improvements in overall VLN task performance.

摘要
Translated into Simplified Chinese:为实现成功，视觉语言导航（VLN）机器人必须能够将指令与它所处环境相连。在这项工作中，我们开发了一种方法来研究机器人行为的技能特性基础，包括评估现有机器人如何根据指令在具体的对象或房间中进行行为。我们的方法包括生成技能特定的干预和评估机器人预测的变化。我们在一个详细的案例研究中分析了一个最新的机器人的行为，然后比较多个机器人的技能特定能力分数。我们的发现表明，训练时的偏见会对机器人行为产生持续的影响，并且现有的模型能够基于简单的引用表达ground。我们在不同模型之间进行比较时发现，技能特定分数与整体VLN任务性能的提高相对耦合。

Feed-Forward Source-Free Domain Adaptation via Class Prototypes

paper_url: http://arxiv.org/abs/2307.10787
repo_url: None
paper_authors: Ondrej Bohdal, Da Li, Timothy Hospedales
for: 这篇论文是为了提出一个简单的输入验证方法，挑战传统的反射径方法。
methods: 本研究使用一个预训练的模型，计算classes的原型值在预设的类别转换下，以提高精度。
results: 相比于传统的反射径方法，这个方法可以实现更高的精度，并且只需要一小部分的时间。

Abstract
Source-free domain adaptation has become popular because of its practical usefulness and no need to access source data. However, the adaptation process still takes a considerable amount of time and is predominantly based on optimization that relies on back-propagation. In this work we present a simple feed-forward approach that challenges the need for back-propagation based adaptation. Our approach is based on computing prototypes of classes under the domain shift using a pre-trained model. It achieves strong improvements in accuracy compared to the pre-trained model and requires only a small fraction of time of existing domain adaptation methods.

摘要

SMURF: Spatial Multi-Representation Fusion for 3D Object Detection with 4D Imaging Radar

paper_url: http://arxiv.org/abs/2307.10784
repo_url: None
paper_authors: Jianan Liu, Qiuchi Zhao, Weiyi Xiong, Tao Huang, Qing-Long Han, Bing Zhu
for: This paper is written for the purpose of addressing the challenges of 3D object detection using 4D millimeter wave (mmWave) radar technology, specifically the issues of sparsity and noise in radar point cloud data.
methods: The paper proposes a novel approach called spatial multi-representation fusion (SMURF) that leverages multiple representations of radar detection points to improve the accuracy of 3D object detection. SMURF uses pillarization and density features of a multi-dimensional Gaussian mixture distribution through kernel density estimation (KDE) to mitigate measurement inaccuracy and alleviate point cloud sparsity.
results: The paper demonstrates the effectiveness and generalization ability of SMURF through experimental evaluations on two datasets, View-of-Delft (VoD) and TJ4DRadSet. SMURF outperforms recently proposed 4D imaging radar-based single-representation models and achieves comparable performance to the state-of-the-art 4D imaging radar and camera fusion-based method, with an increase of 1.22% in the mean average precision on bird’s-eye view of TJ4DRadSet dataset and 1.32% in the 3D mean average precision on the entire annotated area of VoD dataset. Additionally, SMURF has impressive inference time, with the inference time no more than 0.05 seconds for most scans on both datasets.

Abstract
The 4D Millimeter wave (mmWave) radar is a promising technology for vehicle sensing due to its cost-effectiveness and operability in adverse weather conditions. However, the adoption of this technology has been hindered by sparsity and noise issues in radar point cloud data. This paper introduces spatial multi-representation fusion (SMURF), a novel approach to 3D object detection using a single 4D imaging radar. SMURF leverages multiple representations of radar detection points, including pillarization and density features of a multi-dimensional Gaussian mixture distribution through kernel density estimation (KDE). KDE effectively mitigates measurement inaccuracy caused by limited angular resolution and multi-path propagation of radar signals. Additionally, KDE helps alleviate point cloud sparsity by capturing density features. Experimental evaluations on View-of-Delft (VoD) and TJ4DRadSet datasets demonstrate the effectiveness and generalization ability of SMURF, outperforming recently proposed 4D imaging radar-based single-representation models. Moreover, while using 4D imaging radar only, SMURF still achieves comparable performance to the state-of-the-art 4D imaging radar and camera fusion-based method, with an increase of 1.22% in the mean average precision on bird's-eye view of TJ4DRadSet dataset and 1.32% in the 3D mean average precision on the entire annotated area of VoD dataset. Our proposed method demonstrates impressive inference time and addresses the challenges of real-time detection, with the inference time no more than 0.05 seconds for most scans on both datasets. This research highlights the benefits of 4D mmWave radar and is a strong benchmark for subsequent works regarding 3D object detection with 4D imaging radar.

摘要
四维毫波雷达（4D mmWave radar）是一种有前途的技术，因其成本优化和气候不良情况下的可操作性。然而，这种技术的采用受到了雷达点云数据中稀疏和噪声问题的妨碍。本文介绍了空间多表示融合（SMURF），一种基于单个4D探测雷达的3D对象检测方法。SMURF利用了雷达检测点的多个表示，包括柱状化和密度特征，通过核密度估计（KDE）来减少雷达信号的测量不准确性，以及减少点云稀疏性。此外，KDE还有助于减少点云稀疏性。实验评估结果表明，SMURF在View-of-Delft（VoD）和TJ4DRadSet数据集上表现出色，超过了最近提出的4D探测雷达基于单表示模型。此外，只使用4D探测雷达，SMURF仍能与状态空间的4D探测雷达和摄像头融合基于方法相当，提高了TJ4DRadSet数据集的鸟瞰视图中的均值精度提升1.22%和整个注释区域中的3D均值精度提升1.32%。我们的提议方法具有快速的推理时间，在大多数扫描中的推理时间不超过0.05秒。这些研究表明4D毫波雷达的优势，并提供了后续相关4D探测雷达3D对象检测方法的参考。

paper_url: http://arxiv.org/abs/2307.10782
repo_url: https://github.com/4DVLab/See_More_Know_More
paper_authors: Yuhang Lu, Qi Jiang, Runnan Chen, Yuenan Hou, Xinge Zhu, Yuexin Ma
for: 本研究的目的是提出一种多模式零shot学习方法，使得深度模型能够识别未在训练阶段看到的物体。
methods: 本方法基于已经训练过的类别的标签，将视觉特征与 semantic特征进行对齐，使得模型可以更好地利用附近类别的信息来预测未经训练的类别。
results: 经过广泛的实验，我们的方法在SemanticKITTI和nuScenes两个标准benchmark上的平均提高了52%和49%，对于未经训练的类别的mIoU。

Abstract
Zero-shot point cloud segmentation aims to make deep models capable of recognizing novel objects in point cloud that are unseen in the training phase. Recent trends favor the pipeline which transfers knowledge from seen classes with labels to unseen classes without labels. They typically align visual features with semantic features obtained from word embedding by the supervision of seen classes' annotations. However, point cloud contains limited information to fully match with semantic features. In fact, the rich appearance information of images is a natural complement to the textureless point cloud, which is not well explored in previous literature. Motivated by this, we propose a novel multi-modal zero-shot learning method to better utilize the complementary information of point clouds and images for more accurate visual-semantic alignment. Extensive experiments are performed in two popular benchmarks, i.e., SemanticKITTI and nuScenes, and our method outperforms current SOTA methods with 52% and 49% improvement on average for unseen class mIoU, respectively.

摘要
zero-shot 点云 segmentation 目标是让深度模型能够在训练阶段未看过的 novel 对象上进行识别。 current trends 倾向于使用 seen class 的标签来传递知识到 unseen class 上。它们通常将视觉特征与 semantic feature 进行对齐，其中 semantic feature 通常来自于 word embedding 的监督。然而，点云具有有限的信息，不能完全与 semantic feature 匹配。实际上，图像中的丰富的外观信息是点云的自然补充，而 previous literature 中未得到充分利用。 motivated by this, we propose 一种 novel multi-modal zero-shot learning 方法，可以更好地利用点云和图像之间的协同信息，以实现更高的视觉semantic alignment。我们在 SemanticKITTI 和 nuScenes 两个 популяр的 benchmark 上进行了广泛的实验，并比前一代 SOTA 方法提高了52%和49%的均值提升，对 unseen class mIoU 的识别率。

Learned Thresholds Token Merging and Pruning for Vision Transformers

paper_url: http://arxiv.org/abs/2307.10780
repo_url: https://github.com/mxbonn/ltmp
paper_authors: Maxim Bonnaerens, Joni Dambre
for: 这篇论文旨在提出一种名为learned Thresholds Token Merging and Pruning（LTMP）的新方法，以提高vision transformer在图像识别任务中的计算效率。
methods: 本paper使用learned threshold masking module， dynamically determine哪些token可以合并哪些可以剔除，以提高计算效率。
results: 我们通过广泛的实验 demonstrates that LTMP在reduction rate中实现了state-of-the-art的准确率，只需要一次精度调整，比前一代方法快得多一个次循环。

Abstract
Vision transformers have demonstrated remarkable success in a wide range of computer vision tasks over the last years. However, their high computational costs remain a significant barrier to their practical deployment. In particular, the complexity of transformer models is quadratic with respect to the number of input tokens. Therefore techniques that reduce the number of input tokens that need to be processed have been proposed. This paper introduces Learned Thresholds token Merging and Pruning (LTMP), a novel approach that leverages the strengths of both token merging and token pruning. LTMP uses learned threshold masking modules that dynamically determine which tokens to merge and which to prune. We demonstrate our approach with extensive experiments on vision transformers on the ImageNet classification task. Our results demonstrate that LTMP achieves state-of-the-art accuracy across reduction rates while requiring only a single fine-tuning epoch, which is an order of magnitude faster than previous methods. Code is available at https://github.com/Mxbonn/ltmp .

摘要
“视觉转换器在过去几年内表现出了惊人的成功，但它们的计算成本仍然是实际应用的重要障碍。特别是转换器模型的复杂度与输入符号数量直接相关。因此，减少输入符号数量的技术得到了关注。这篇论文介绍了学习梯度掩码层（LTMP），一种新的方法，它利用了token合并和token剪辑的优势。LTMP使用学习的阈值掩码模块，动态确定需要合并和剪辑哪些符号。我们通过对视觉转换器进行了详细的实验，在ImageNet分类任务上达到了最佳性能水平，同时只需要一个精度调整 epoch，这比之前的方法快了一个数量级。代码可以在https://github.com/Mxbonn/ltmp 上获取。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.

Urban Radiance Field Representation with Deformable Neural Mesh Primitives

paper_url: http://arxiv.org/abs/2307.10776
repo_url: None
paper_authors: Fan Lu, Yan Xu, Guang Chen, Hongsheng Li, Kwan-Yee Lin, Changjun Jiang
for: 高品质的都市级场景表示和Synthesis
methods: 使用Deformable Neural Mesh Primitive（DNMP） parameterize整个场景，DNMP是一种灵活和减少存储的神经variant mesh representation，通过缩放和截断来控制度量的自由度，并使用视角相关的MLP来解码渲染颜色
results: 实现高品质的新视图合成，并且具有低计算成本和快速渲染速度（2.07ms/1k像素），与高优化的Instant-NGP相当（0.61vs0.71ms/1k像素）

Abstract
Neural Radiance Fields (NeRFs) have achieved great success in the past few years. However, most current methods still require intensive resources due to ray marching-based rendering. To construct urban-level radiance fields efficiently, we design Deformable Neural Mesh Primitive~(DNMP), and propose to parameterize the entire scene with such primitives. The DNMP is a flexible and compact neural variant of classic mesh representation, which enjoys both the efficiency of rasterization-based rendering and the powerful neural representation capability for photo-realistic image synthesis. Specifically, a DNMP consists of a set of connected deformable mesh vertices with paired vertex features to parameterize the geometry and radiance information of a local area. To constrain the degree of freedom for optimization and lower the storage budgets, we enforce the shape of each primitive to be decoded from a relatively low-dimensional latent space. The rendering colors are decoded from the vertex features (interpolated with rasterization) by a view-dependent MLP. The DNMP provides a new paradigm for urban-level scene representation with appealing properties: $(1)$ High-quality rendering. Our method achieves leading performance for novel view synthesis in urban scenarios. $(2)$ Low computational costs. Our representation enables fast rendering (2.07ms/1k pixels) and low peak memory usage (110MB/1k pixels). We also present a lightweight version that can run 33$\times$ faster than vanilla NeRFs, and comparable to the highly-optimized Instant-NGP (0.61 vs 0.71ms/1k pixels). Project page: \href{https://dnmp.github.io/}{https://dnmp.github.io/}.

摘要
neural radiance fields (NeRFs) 在过去几年内取得了很大的成功。然而，大多数当前方法仍然需要劳动密集的资源，即因为光柱 marching 的渲染。为了高效地构建城市级别的采集场景，我们设计了弹性神经网络模型~(DNMP)，并提议使用这些模型来参数化整个场景。DNMP 是一种灵活和占用空间的神经变体，它拥有优化的效率和神经表示能力，可以实现高质量的图像合成。具体来说，一个 DNMP 由一组连接的弹性网格顶点组成，每个顶点具有对应的颜色特征，用于参数化场景的几何和采集信息。为了限制优化的自由度和储存占用量，我们在低维度的干扰空间中解码了每个 primitives 的形状。渲染颜色通过视角依赖的多层感知网络（MLP）进行解码。DNMP 提供了一种新的城市级场景表示方式，具有以下优点： $(1)$ 高质量渲染。我们的方法在城市场景中的新视图合成中实现了领先的性能。 $(2)$ 低计算成本。我们的表示方式可以快速渲染（2.07ms/1k像素），并且储存占用量很低（110MB/1k像素）。我们还提供了一个轻量级版本，可以在 vanilla NeRFs 的 33 倍快速运行，并且与高度优化的 Instant-NGP 相当（0.61 vs 0.71ms/1k像素）。项目页面：

LBL: Logarithmic Barrier Loss Function for One-class Classification

paper_url: http://arxiv.org/abs/2307.10753
repo_url: https://github.com/ml-hdu/lbl_lblsig
paper_authors: Tianlei Wang, Dekang Liu, Wandong Zhang, Jiuwen Cao
for: 本研究旨在提出一种新的一类分类（OCC）损失函数，用于深度学习中的一类分类问题。
methods: 本研究提出了一种基于对数梯度函数的OCC损失函数（LBL），以及一种基于这个损失函数的修改后的OCC损失函数（LBLSig）。
results: 实验结果表明，提出的LBL和LBLSig损失函数比之前的OCC损失函数更加稳定和有效，并且在不同的网络结构上都有优秀的表现。

Abstract
One-class classification (OCC) aims to train a classifier only with the target class data and attracts great attention for its strong applicability in real-world application. Despite a lot of advances have been made in OCC, it still lacks the effective OCC loss functions for deep learning. In this paper, a novel logarithmic barrier function based OCC loss (LBL) that assigns large gradients to the margin samples and thus derives more compact hypersphere, is first proposed by approximating the OCC objective smoothly. But the optimization of LBL may be instability especially when samples lie on the boundary leading to the infinity loss. To address this issue, then, a unilateral relaxation Sigmoid function is introduced into LBL and a novel OCC loss named LBLSig is proposed. The LBLSig can be seen as the fusion of the mean square error (MSE) and the cross entropy (CE) and the optimization of LBLSig is smoother owing to the unilateral relaxation Sigmoid function. The effectiveness of the proposed LBL and LBLSig is experimentally demonstrated in comparisons with several state-of-the-art OCC algorithms on different network structures. The source code can be found at https://github.com/ML-HDU/LBL_LBLSig.

摘要

EdgeAL: An Edge Estimation Based Active Learning Approach for OCT Segmentation

paper_url: http://arxiv.org/abs/2307.10745
repo_url: https://github.com/mak-ta-reque/edgeal
paper_authors: Md Abdul Kadir, Hasan Md Tusfiqur Alam, Daniel Sonntag
for: 这个论文是为了提出一种基于边缘信息的活动学习算法，以优化模型在具有有限数据的情况下的训练。
methods: 该论文使用了模型对边缘信息的分析，以量化模型对未见数据的不确定性。然后，使用这个量化来选择需要注释的超像素。
results: 论文在多类 Optical Coherence Tomography（OCT） segmentation任务中实现了99%的 dice 得分，同时降低了注释标签成本至12%, 2.3%, 和3%，分别在三个公共可用的数据集（Duke、AROI 和 UMN）上。

Abstract
Active learning algorithms have become increasingly popular for training models with limited data. However, selecting data for annotation remains a challenging problem due to the limited information available on unseen data. To address this issue, we propose EdgeAL, which utilizes the edge information of unseen images as {\it a priori} information for measuring uncertainty. The uncertainty is quantified by analyzing the divergence and entropy in model predictions across edges. This measure is then used to select superpixels for annotation. We demonstrate the effectiveness of EdgeAL on multi-class Optical Coherence Tomography (OCT) segmentation tasks, where we achieved a 99% dice score while reducing the annotation label cost to 12%, 2.3%, and 3%, respectively, on three publicly available datasets (Duke, AROI, and UMN). The source code is available at \url{https://github.com/Mak-Ta-Reque/EdgeAL}

摘要
aktive lerning algoritmen hao ying yong zhong zhi xiao xiang yu jiao yan zhong xiang zheng yan zhong yu yi zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong xiang zheng yan zhong

Comparison between transformers and convolutional models for fine-grained classification of insects

paper_url: http://arxiv.org/abs/2307.11112
repo_url: None
paper_authors: Rita Pucci, Vincent J. Kalkman, Dan Stowell
for: 本研究主要针对 insecta 分类问题，尤其是 Odonata 和 Coleoptera 等目的分类。
methods: 本研究使用了深度学习算法，主要比较了 transformer 层和 convolutional 层两种结构的性能。
results: 研究结果显示，混合模型在准确率和总性能方面表现最佳，而 transformer 层在 Speed 和推理时间方面表现最佳。

Abstract
Fine-grained classification is challenging due to the difficulty of finding discriminatory features. This problem is exacerbated when applied to identifying species within the same taxonomical class. This is because species are often sharing morphological characteristics that make them difficult to differentiate. We consider the taxonomical class of Insecta. The identification of insects is essential in biodiversity monitoring as they are one of the inhabitants at the base of many ecosystems. Citizen science is doing brilliant work of collecting images of insects in the wild giving the possibility to experts to create improved distribution maps in all countries. We have billions of images that need to be automatically classified and deep neural network algorithms are one of the main techniques explored for fine-grained tasks. At the SOTA, the field of deep learning algorithms is extremely fruitful, so how to identify the algorithm to use? We focus on Odonata and Coleoptera orders, and we propose an initial comparative study to analyse the two best-known layer structures for computer vision: transformer and convolutional layers. We compare the performance of T2TViT, a fully transformer-base, EfficientNet, a fully convolutional-base, and ViTAE, a hybrid. We analyse the performance of the three models in identical conditions evaluating the performance per species, per morph together with sex, the inference time, and the overall performance with unbalanced datasets of images from smartphones. Although we observe high performances with all three families of models, our analysis shows that the hybrid model outperforms the fully convolutional-base and fully transformer-base models on accuracy performance and the fully transformer-base model outperforms the others on inference speed and, these prove the transformer to be robust to the shortage of samples and to be faster at inference time.

摘要
细致分类是因为找到划分特征而困难。这个问题在种类同一类时更加突出，因为种类经常共享形态特征，使其困难分辨。我们考虑了昆虫纲（Insecta）的分类。识别昆虫对生态系统的监测至关重要，因为它们是生态系统的基础居民之一。公民科学在野外采集昆虫图像，为专家创建了改进的分布图。我们有数十亿张图像需要自动分类，深度学习算法是细致任务中最主要的技术。当前领域的深度学习算法非常成熔，所以如何选择合适的算法？我们将关注蝴蝶目（Odonata）和拟毛目（Coleoptera）两个顺序，并提出了初步比较研究，检验了两种最为知名的计算机视觉层结构：变换层和卷积层。我们将T2TViT、EfficientNet和ViTAE三种模型进行比较，分析每种模型在同一个条件下的性能，包括每种物种、每个形态、性别和推理时间。虽然我们发现所有三种模型都有高性能，但我们的分析表明混合模型在准确性和推理速度两个方面都高于卷积层基础模型，而 transformer 基础模型在样本短缺和推理时间两个方面表现更好。

TwinLiteNet: An Efficient and Lightweight Model for Driveable Area and Lane Segmentation in Self-Driving Cars

paper_url: http://arxiv.org/abs/2307.10705
repo_url: https://github.com/chequanghuy/TwinLiteNet
paper_authors: Quang Huy Che, Dinh Phuc Nguyen, Minh Quan Pham, Duc Khai Lam
for: 本研究旨在提出一种轻量级的模型，用于自动驾驶车辆中的驾驶区域和车道线分割。
methods: 该模型使用了TwinLiteNet，一种低计算成本的模型，实现了高精度和高效性的分割结果。
results: 实验结果表明，TwinLiteNet在BDD100K数据集上的mIoU分数为91.3%，与现代模型相当，而且只需0.4万参数，并在GPU RTX A5000上达到415 FPS的速度。此外，TwinLiteNet可以在具有有限计算能力的嵌入式设备上实时执行，特别是在Jetson Xavier NX上达到60 FPS，使其成为自动驾驶车辆中理想的解决方案。

Abstract
Semantic segmentation is a common task in autonomous driving to understand the surrounding environment. Driveable Area Segmentation and Lane Detection are particularly important for safe and efficient navigation on the road. However, original semantic segmentation models are computationally expensive and require high-end hardware, which is not feasible for embedded systems in autonomous vehicles. This paper proposes a lightweight model for the driveable area and lane line segmentation. TwinLiteNet is designed cheaply but achieves accurate and efficient segmentation results. We evaluate TwinLiteNet on the BDD100K dataset and compare it with modern models. Experimental results show that our TwinLiteNet performs similarly to existing approaches, requiring significantly fewer computational resources. Specifically, TwinLiteNet achieves a mIoU score of 91.3% for the Drivable Area task and 31.08% IoU for the Lane Detection task with only 0.4 million parameters and achieves 415 FPS on GPU RTX A5000. Furthermore, TwinLiteNet can run in real-time on embedded devices with limited computing power, especially since it achieves 60FPS on Jetson Xavier NX, making it an ideal solution for self-driving vehicles. Code is available: url{https://github.com/chequanghuy/TwinLiteNet}.

摘要
Semantic segmentation是自动驾驶中常见的任务，用于理解周围环境。driveable Area Segmentation和Lane Detection特别重要，以确保安全和高效的 Navigation on the road。然而，原始semantic segmentation模型 computationally expensive和需要高级硬件，这不是自动驾驶汽车中的可行。这篇论文提出了一种轻量级模型，用于driveable Area和Lane Line Segmentation。TwinLiteNet是一种低成本的设计，但它实现了精度和高效的分割结果。我们在BDD100K数据集上评估TwinLiteNet，并与现代模型进行比较。实验结果显示，我们的TwinLiteNet与现有的方法相似，但需要远 fewer computational resources。具体来说，TwinLiteNet在Drivable Area任务上 achievemIoU score为91.3%，在Lane Detection任务上achievIoU score为31.08%，只需0.4 million参数，并在GPU RTX A5000上实现415 FPS。此外，TwinLiteNet可以在具有有限计算能力的嵌入式设备上实时运行，特别是在Jetson Xavier NX上实现60 FPS，使其成为自动驾驶汽车的理想解决方案。代码可以在以下链接中获取：https://github.com/chequanghuy/TwinLiteNet。

Reverse Knowledge Distillation: Training a Large Model using a Small One for Retinal Image Matching on Limited Data

paper_url: http://arxiv.org/abs/2307.10698
repo_url: https://github.com/SaharAlmahfouzNasser/MeDAL-Retina
paper_authors: Sahar Almahfouz Nasser, Nihar Gupte, Amit Sethi
for: 该论文旨在提出一种基于反知识储存的方法，用于在有限数据情况下训练大型模型，并避免过拟合。
methods: 该方法基于一种修改后的CNN模型，以及一种基于视觉转换器编码器的更重要的模型。这种 reverse knowledge distillation 方法 surprisingly 提高了模型的通用化能力。
results: 该论文的实验结果表明，高维适应空间的匹配可以避免过拟合，而不是直接训练模型来匹配最终输出。此外，该论文还提供了一个公共数据集，用于推广适用于 retinal 图像关键点检测和匹配的算法研究。

Abstract
Retinal image matching plays a crucial role in monitoring disease progression and treatment response. However, datasets with matched keypoints between temporally separated pairs of images are not available in abundance to train transformer-based model. We propose a novel approach based on reverse knowledge distillation to train large models with limited data while preventing overfitting. Firstly, we propose architectural modifications to a CNN-based semi-supervised method called SuperRetina that help us improve its results on a publicly available dataset. Then, we train a computationally heavier model based on a vision transformer encoder using the lighter CNN-based model, which is counter-intuitive in the field knowledge-distillation research where training lighter models based on heavier ones is the norm. Surprisingly, such reverse knowledge distillation improves generalization even further. Our experiments suggest that high-dimensional fitting in representation space may prevent overfitting unlike training directly to match the final output. We also provide a public dataset with annotations for retinal image keypoint detection and matching to help the research community develop algorithms for retinal image applications.

摘要
这文章提出了一个新的方法，用于对于透视图像进行匹配，以监控疾病的进展和治疗的反应。但是，对于时间分隔的图像对称点的数据不在充分的情况下进行训练，使得训练大型模型具有过滤过滤的问题。我们提出了一个 novel approach，基于倒逆知识传授，将大型模型训练为具有有限数据的情况下，并避免过滤。我们首先提出了一个改进的SuperRetina方法，并在公共可用的数据集上进行训练。接着，我们采用了一个更重要的vision transformer Encoder模型，并将其训练为一个较轻量级的CNN模型。这是在知识传授领域中较少见的对话方向，即将较重要的模型训练为较轻量级的模型。我们的实验结果显示，高维度的对应在表示空间中可以避免过滤，不同于直接对最终出力进行训练。我们还提供了一个公共数据集，并将其标注为透视图像关键点检测和匹配，以帮助研究人员发展透视图像应用程序。

SqueezerFaceNet: Reducing a Small Face Recognition CNN Even More Via Filter Pruning

paper_url: http://arxiv.org/abs/2307.10697
repo_url: None
paper_authors: Fernando Alonso-Fernandez, Kevin Hernandez-Diaz, Jose Maria Buades Rubio, Josef Bigun
for: 这个研究旨在提供一个轻量级的人脸识别网络，以满足现代移动设备上的各种数位服务需求。
methods: 本研究使用了一种称为Taylor scores的网络删除方法，将不重要的节点（filter）移除，以降低网络大小。
results: 研究发现，这种网络删除方法可以将原本的1.24M网络缩减至0.68M（即40%的缩减），而且不会对人脸识别性能造成重大影响。

Abstract
The widespread use of mobile devices for various digital services has created a need for reliable and real-time person authentication. In this context, facial recognition technologies have emerged as a dependable method for verifying users due to the prevalence of cameras in mobile devices and their integration into everyday applications. The rapid advancement of deep Convolutional Neural Networks (CNNs) has led to numerous face verification architectures. However, these models are often large and impractical for mobile applications, reaching sizes of hundreds of megabytes with millions of parameters. We address this issue by developing SqueezerFaceNet, a light face recognition network which less than 1M parameters. This is achieved by applying a network pruning method based on Taylor scores, where filters with small importance scores are removed iteratively. Starting from an already small network (of 1.24M) based on SqueezeNet, we show that it can be further reduced (up to 40%) without an appreciable loss in performance. To the best of our knowledge, we are the first to evaluate network pruning methods for the task of face recognition.

摘要
广泛的移动设备用于多种数字服务的使用，导致了可靠和实时人身认证的需求。在这种情况下，人脸识别技术作为可靠的用户验证方法得到了广泛的应用。由于移动设备内置了摄像头，并且与日常应用程序集成，人脸识别技术成为了可靠的用户验证方法。然而，这些模型往往很大，不适合移动应用程序，其参数数量可以达到百万个，占用内存空间很大。我们解决这个问题，通过开发一个轻量级的人脸识别网络——SqueezerFaceNet，其参数数量低于100万。我们采用基于Taylor分数的网络剪辑方法，通过逐步移除无关紧要的滤波器来实现这一目标。我们开始于一个已经很小的网络（1.24M），基于SqueezeNet，并证明可以进一步减少（达到40%），无需感知性的性能下降。到目前为止，我们是第一个评估网络剪辑方法的人脸识别任务。

SLPD: Slide-level Prototypical Distillation for WSIs

paper_url: http://arxiv.org/abs/2307.10696
repo_url: https://github.com/carboxy/slpd
paper_authors: Zhimiao Yu, Tiancheng Lin, Yi Xu
for: 本文targets the problem of improving feature representation ability for whole slide pathological images (WSIs), with a focus on slide-level downstream tasks such as subtyping, grading, and staging.
methods: 本文提出了一种新的方法 called Slide-Level Prototypical Distillation (SLPD), which explores intra- и inter-slide semantic structures for context modeling on WSIs. The method iteratively performs intra-slide clustering to yield prototypes and assigns regions to their corresponding prototypes.
results: 本文 achieved state-of-the-art results on multiple slide-level benchmarks, demonstrating that representation learning of semantic structures of slides can make a suitable proxy task for WSI analysis.Here’s the format you requested:
for: 本文targets the problem of improving feature representation ability for whole slide pathological images (WSIs), with a focus on slide-level downstream tasks such as subtyping, grading, and staging.
methods: 本文提出了一种新的方法 called Slide-Level Prototypical Distillation (SLPD), which explores intra- и inter-slide semantic structures for context modeling on WSIs. The method iteratively performs intra-slide clustering to yield prototypes and assigns regions to their corresponding prototypes.
results: 本文 achieved state-of-the-art results on multiple slide-level benchmarks, demonstrating that representation learning of semantic structures of slides can make a suitable proxy task for WSI analysis.

Abstract
Improving the feature representation ability is the foundation of many whole slide pathological image (WSIs) tasks. Recent works have achieved great success in pathological-specific self-supervised learning (SSL). However, most of them only focus on learning patch-level representations, thus there is still a gap between pretext and slide-level downstream tasks, e.g., subtyping, grading and staging. Aiming towards slide-level representations, we propose Slide-Level Prototypical Distillation (SLPD) to explore intra- and inter-slide semantic structures for context modeling on WSIs. Specifically, we iteratively perform intra-slide clustering for the regions (4096x4096 patches) within each WSI to yield the prototypes and encourage the region representations to be closer to the assigned prototypes. By representing each slide with its prototypes, we further select similar slides by the set distance of prototypes and assign the regions by cross-slide prototypes for distillation. SLPD achieves state-of-the-art results on multiple slide-level benchmarks and demonstrates that representation learning of semantic structures of slides can make a suitable proxy task for WSI analysis. Code will be available at https://github.com/Carboxy/SLPD.

摘要
提高图像特征表示能力是许多整张半导体图像（WSIs）任务的基础。最近的工作已经在生物学特有的自然学习（SSL）中取得了很大的成功。然而，大多数其中只是学习小块级别的表示，因此还有一个差距 между预测和板级下游任务，例如分类、评分和阶段评估。针对板级表示，我们提出了板级原型蒸馏（SLPD），以探索板级内部和板级之间的含义结构。具体来说，我们采用了逐步进行板级归一化，将板级内部每个块（4096x4096块）的表示与它所属的板级中的板级原型进行比较，以便使得块表示更加靠近它所属的板级原型。然后，我们通过将每个板级表示为其所属的板级原型，并将相似的板级进行距离设定，将块分配给相似的板级。SLPD实现了多个板级标准 bencmarks 的state-of-the-art 结果，并证明了表示板级含义结构可以作为WSI分析的适当代理任务。代码将在 GitHub 上公开。

Self2Self+: Single-Image Denoising with Self-Supervised Learning and Image Quality Assessment Loss

paper_url: http://arxiv.org/abs/2307.10695
repo_url: https://github.com/JK-the-Ko/Self2SelfPlus
paper_authors: Jaekyun Ko, Sanghwan Lee
for: 提高单图像的噪声去除性能
methods: 使用单图像自动超级vised学习，Feature Extraction通过闭合 convolution，Training Process guideline by no-reference Image Quality Assessment
results: 在 synthetic 和实际数据集上达到了状态的噪声去除性能， highlighting the effectiveness and practicality of the proposed method for various noise removal tasks.

Abstract
Recently, denoising methods based on supervised learning have exhibited promising performance. However, their reliance on external datasets containing noisy-clean image pairs restricts their applicability. To address this limitation, researchers have focused on training denoising networks using solely a set of noisy inputs. To improve the feasibility of denoising procedures, in this study, we proposed a single-image self-supervised learning method in which only the noisy input image is used for network training. Gated convolution was used for feature extraction and no-reference image quality assessment was used for guiding the training process. Moreover, the proposed method sampled instances from the input image dataset using Bernoulli sampling with a certain dropout rate for training. The corresponding result was produced by averaging the generated predictions from various instances of the trained network with dropouts. The experimental results indicated that the proposed method achieved state-of-the-art denoising performance on both synthetic and real-world datasets. This highlights the effectiveness and practicality of our method as a potential solution for various noise removal tasks.

摘要
近期，基于supervised学习的减噪方法表现出色。然而，它们依赖于外部数据集中的噪音-清晰图像对，这限制了它们的应用性。为解决这个限制，研究人员对减噪网络的训练进行了重点研究，使用了具有闭合特性的卷积神经网络进行特征提取和无参考图像质量评估。此外，我们采用了bernoulli采样法，将输入图像集 samples instances，并通过dropout机制来随机采样不同实例。最终得到的结果是通过averaging多个实例的生成预测值来生成最终的结果。实验结果表明，我们提出的方法可以在synthetic和实际世界数据集上达到领先的减噪性能。这说明了我们的方法在减噪任务中的有效性和实用性。

Pre-train, Adapt and Detect: Multi-Task Adapter Tuning for Camouflaged Object Detection

paper_url: http://arxiv.org/abs/2307.10685
repo_url: None
paper_authors: Yinghui Xing, Dexuan Kong, Shizhou Zhang, Geng Chen, Lingyan Ran, Peng Wang, Yanning Zhang
For: This paper is written for detecting camouflaged objects in images, which is a challenging task due to the similar patterns between the objects and the background.* Methods: The paper proposes a novel “pre-train, adapt and detect” paradigm, which uses a large pre-trained model to transfer knowledge from multi-modal data, and a lightweight parallel adapter to adjust the features for the downstream COD task.* Results: The paper shows that the proposed method outperforms existing state-of-the-art COD models by large margins on four challenging benchmark datasets, and also demonstrates the effectiveness of a multi-task learning scheme for improving the generalization ability of the model.

Abstract
Camouflaged object detection (COD), aiming to segment camouflaged objects which exhibit similar patterns with the background, is a challenging task. Most existing works are dedicated to establishing specialized modules to identify camouflaged objects with complete and fine details, while the boundary can not be well located for the lack of object-related semantics. In this paper, we propose a novel ``pre-train, adapt and detect" paradigm to detect camouflaged objects. By introducing a large pre-trained model, abundant knowledge learned from massive multi-modal data can be directly transferred to COD. A lightweight parallel adapter is inserted to adjust the features suitable for the downstream COD task. Extensive experiments on four challenging benchmark datasets demonstrate that our method outperforms existing state-of-the-art COD models by large margins. Moreover, we design a multi-task learning scheme for tuning the adapter to exploit the shareable knowledge across different semantic classes. Comprehensive experimental results showed that the generalization ability of our model can be substantially improved with multi-task adapter initialization on source tasks and multi-task adaptation on target tasks.

摘要
隐形物体检测（COD），旨在分割隐形物体，其pattern和背景相似，是一项具有挑战性的任务。现有大多数工作都是专门设计用于完整、细节适应特定类型的特殊模块，而忽略物体边界的定位。在这篇论文中，我们提出了一种“预训练、适应并检测”的思路，用于检测隐形物体。我们引入了大量预训练的模型，以便直接将大量多样数据中学习的知识传递到COD任务。在下游任务中，我们插入了一个轻量级并行适应器，以适应COD任务的特点。我们对四个复杂的benchmark数据集进行了广泛的实验，结果显示，我们的方法在现有状态的COD模型之上大幅提高了性能。此外，我们还设计了一种多任务学习方案，用于调整适应器，以利用不同semantic类型之间的共享知识。实验结果表明，我们的模型的通用能力可以通过多任务适应器初始化和多任务调整来substantially提高。

Deep learning for classification of noisy QR codes

paper_url: http://arxiv.org/abs/2307.10677
repo_url: None
paper_authors: Rebecca Leygonie, Sylvain Lobry, ), Laurent Wendling (LIPADE)
for: 理解深度学习模型在抽象图像分类中的限制
methods: 使用QR码生成器生成QR码图像，并对图像进行深度学习模型训练和对比传统干扰方法
results: 研究结果表明，深度学习模型可以对抽象图像进行有效的分类，并且比传统干扰方法更具有抗干扰能力。

Abstract
We wish to define the limits of a classical classification model based on deep learning when applied to abstract images, which do not represent visually identifiable objects.QR codes (Quick Response codes) fall into this category of abstract images: one bit corresponding to one encoded character, QR codes were not designed to be decoded manually. To understand the limitations of a deep learning-based model for abstract image classification, we train an image classification model on QR codes generated from information obtained when reading a health pass. We compare a classification model with a classical (deterministic) decoding method in the presence of noise. This study allows us to conclude that a model based on deep learning can be relevant for the understanding of abstract images.

摘要
我们想定义深度学习模型在抽象图像分类中的限制，当应用于不可识别目的的图像。QR codes（快速回应码）是这种抽象图像的一个例子：每个位元代表一个编码字符，QR codes不是设计来手动读取。通过对读取健康证书中的信息生成QR codes进行分类模型训练，我们可以评估深度学习模型在抽象图像分类中的限制，并与传统（决定的）解码方法进行比较，以评估模型在噪声存在的情况下的性能。这个研究可以帮助我们了解深度学习模型在抽象图像分类中的可行性。

Efficient Unified Demosaicing for Bayer and Non-Bayer Patterned Image Sensors

paper_url: http://arxiv.org/abs/2307.10667
repo_url: None
paper_authors: Haechang Lee, Dongwon Park, Wongi Jeong, Kijeong Kim, Hyunwoo Je, Dongil Ryu, Se Young Chun
for: 本研究旨在提出一种能够应用于不同 Bayer 预览图像排布（CFA）模式下的高效灵活排值方法，以便在不同的照明条件下进行排值。methods: 本研究使用了知识学习（KLAP）模型，其中包含1% 钥匙过滤器，可以有效地处理不同的 CFA 模式。在排值过程中，KLAP 模型还使用了元学习（KLAP-M），以避免在实际拍摄图像中产生的抽象遗憾。results: 根据实验结果，KLAP 和 KLAP-M 方法在 Both synthetic 和实际 RAW 数据中达到了领先的排值性能。

Abstract
As the physical size of recent CMOS image sensors (CIS) gets smaller, the latest mobile cameras are adopting unique non-Bayer color filter array (CFA) patterns (e.g., Quad, Nona, QxQ), which consist of homogeneous color units with adjacent pixels. These non-Bayer sensors are superior to conventional Bayer CFA thanks to their changeable pixel-bin sizes for different light conditions but may introduce visual artifacts during demosaicing due to their inherent pixel pattern structures and sensor hardware characteristics. Previous demosaicing methods have primarily focused on Bayer CFA, necessitating distinct reconstruction methods for non-Bayer patterned CIS with various CFA modes under different lighting conditions. In this work, we propose an efficient unified demosaicing method that can be applied to both conventional Bayer RAW and various non-Bayer CFAs' RAW data in different operation modes. Our Knowledge Learning-based demosaicing model for Adaptive Patterns, namely KLAP, utilizes CFA-adaptive filters for only 1% key filters in the network for each CFA, but still manages to effectively demosaic all the CFAs, yielding comparable performance to the large-scale models. Furthermore, by employing meta-learning during inference (KLAP-M), our model is able to eliminate unknown sensor-generic artifacts in real RAW data, effectively bridging the gap between synthetic images and real sensor RAW. Our KLAP and KLAP-M methods achieved state-of-the-art demosaicing performance in both synthetic and real RAW data of Bayer and non-Bayer CFAs.

摘要
As recent CMOS image sensors (CIS) become smaller, mobile cameras are adopting unique non-Bayer color filter array (CFA) patterns (e.g., Quad, Nona, QxQ), which consist of homogeneous color units with adjacent pixels. These non-Bayer sensors are superior to conventional Bayer CFA due to their changeable pixel-bin sizes for different light conditions, but may introduce visual artifacts during demosaicing due to their inherent pixel pattern structures and sensor hardware characteristics. Previous demosaicing methods have primarily focused on Bayer CFA, necessitating distinct reconstruction methods for non-Bayer patterned CIS with various CFA modes under different lighting conditions. In this work, we propose an efficient unified demosaicing method that can be applied to both conventional Bayer RAW and various non-Bayer CFAs' RAW data in different operation modes. Our Knowledge Learning-based demosaicing model for Adaptive Patterns, namely KLAP, utilizes CFA-adaptive filters for only 1% key filters in the network for each CFA, but still manages to effectively demosaic all the CFAs, yielding comparable performance to the large-scale models. Furthermore, by employing meta-learning during inference (KLAP-M), our model is able to eliminate unknown sensor-generic artifacts in real RAW data, effectively bridging the gap between synthetic images and real sensor RAW. Our KLAP and KLAP-M methods achieved state-of-the-art demosaicing performance in both synthetic and real RAW data of Bayer and non-Bayer CFAs.

Lighting up NeRF via Unsupervised Decomposition and Enhancement

paper_url: http://arxiv.org/abs/2307.10664
repo_url: https://github.com/onpix/LLNeRF
paper_authors: Haoyuan Wang, Xiaogang Xu, Ke Xu, Rynson WH. Lau
for: 提高低光照场景中图像的质量和细节，并在不需要训练数据的情况下生成高质量的新视图图像。
methods: 提出了一种新的方法，即低光照频谱场景（LLNeRF），通过分解辐射场learning来增强场景表示，减少噪声和修正扭曲的颜色，同时与NeRF优化过程结合使用。
results: 实验表明，提出的方法可以很好地提高低光照图像的质量和细节，并生成高质量的新视图图像，与现有的低光照增强方法和NeRF方法相比，表现更好。

Abstract
Neural Radiance Field (NeRF) is a promising approach for synthesizing novel views, given a set of images and the corresponding camera poses of a scene. However, images photographed from a low-light scene can hardly be used to train a NeRF model to produce high-quality results, due to their low pixel intensities, heavy noise, and color distortion. Combining existing low-light image enhancement methods with NeRF methods also does not work well due to the view inconsistency caused by the individual 2D enhancement process. In this paper, we propose a novel approach, called Low-Light NeRF (or LLNeRF), to enhance the scene representation and synthesize normal-light novel views directly from sRGB low-light images in an unsupervised manner. The core of our approach is a decomposition of radiance field learning, which allows us to enhance the illumination, reduce noise and correct the distorted colors jointly with the NeRF optimization process. Our method is able to produce novel view images with proper lighting and vivid colors and details, given a collection of camera-finished low dynamic range (8-bits/channel) images from a low-light scene. Experiments demonstrate that our method outperforms existing low-light enhancement methods and NeRF methods.

摘要
neural radiance field (nerf) 是一种有前途的方法，可以从场景中的图像和相关的摄像头姿态获得新视图。但是，由于低光照场景中的图像具有低ixel值、强度噪声和颜色扭曲，因此使用这些图像来训练 nerf 模型难以获得高质量结果。在这篇论文中，我们提出了一种新的方法，称为low-light nerf (llnerf)，可以从 sRGB 低光照图像中直接生成正常光照的新视图。我们的方法的核心在于 radiance field 学习的分解，可以同时增强照明、减少噪声和正确颜色的扭曲。我们的方法可以从低动态范围（8 bits/通道）的Camera-finished low dynamic range图像中生成新视图图像，具有正确的照明和生动的颜色和细节。实验表明，我们的方法在与现有低光照增强方法和 nerf 方法进行比较时具有更高的性能。

RetouchingFFHQ: A Large-scale Dataset for Fine-grained Face Retouching Detection

paper_url: http://arxiv.org/abs/2307.10642
repo_url: None
paper_authors: Qichao Ying, Jiaxin Liu, Sheng Li, Haisheng Xu, Zhenxing Qian, Xinpeng Zhang
for: 本研究旨在提高涂抹技术的进步，解决短视频平台上广泛使用面部涂抹过滤器后，面部真实性和广告宣传的问题。
methods: 本研究使用了大规模和细化的面部涂抹数据集RetouchingFFHQ，包含超过50万条条件涂抹图像。RetouchingFFHQ与之前的数据集不同，拥有大规模、高质量、细化和定制等特点。
results: 研究通过提出多种面部涂抹类型和不同的涂抹水平，将面部涂抹检测升级为细化、多种涂抹类型和多个涂抹水平的估计问题。此外，提出了一种多级别注意模块（MAM），用于增强CNN背景学习的跨比例表示学习。实验表明，使用不同的基elines以及我们提posed方法在RetouchingFFHQ上得到了良好的性能。

Abstract
The widespread use of face retouching filters on short-video platforms has raised concerns about the authenticity of digital appearances and the impact of deceptive advertising. To address these issues, there is a pressing need to develop advanced face retouching techniques. However, the lack of large-scale and fine-grained face retouching datasets has been a major obstacle to progress in this field. In this paper, we introduce RetouchingFFHQ, a large-scale and fine-grained face retouching dataset that contains over half a million conditionally-retouched images. RetouchingFFHQ stands out from previous datasets due to its large scale, high quality, fine-grainedness, and customization. By including four typical types of face retouching operations and different retouching levels, we extend the binary face retouching detection into a fine-grained, multi-retouching type, and multi-retouching level estimation problem. Additionally, we propose a Multi-granularity Attention Module (MAM) as a plugin for CNN backbones for enhanced cross-scale representation learning. Extensive experiments using different baselines as well as our proposed method on RetouchingFFHQ show decent performance on face retouching detection. With the proposed new dataset, we believe there is great potential for future work to tackle the challenging problem of real-world fine-grained face retouching detection.

摘要
广泛使用短视频平台上的面部修饰矩阵已引起了数字出现的真实性和滥购广告的问题。为解决这些问题，需要开发高级的面部修饰技术。然而，缺乏大规模和细化的面部修饰数据集是该领域进步的主要障碍。在这篇论文中，我们介绍了RetouchingFFHQ数据集，该数据集包含超过50万条条件修饰的图像。RetouchingFFHQ与之前的数据集不同，具有大规模、高质量、细化和定制特点。我们在数据集中包含了四种常见的面部修饰操作和不同的修饰水平，从而将binary face retouching detection扩展到细化、多种修饰类型和多个修饰水平的估计问题。此外，我们提出了一种多级别注意模块（MAM），用于增强CNN背景中的横跨级别表示学习。经验表明，使用不同的基准和我们提出的方法在RetouchingFFHQ上进行了良好的表达学习。我们认为，随着RetouchingFFHQ数据集的出现，将来的工作具有很大的潜力来解决现实世界中的细化面部修饰检测问题。

Quantized Feature Distillation for Network Quantization

paper_url: http://arxiv.org/abs/2307.10638
repo_url: None
paper_authors: Ke Zhu, Yin-Yin He, Jianxin Wu
For: The paper aims to accelerate and trim full-precision neural network models using low bit approximations, and proposes a novel and highly effective quantization aware training (QAT) method called quantized feature distillation (QFD).* Methods: QFD trains a quantized representation as the teacher, then quantizes the network using knowledge distillation (KD). The method is more flexible and effective than previous quantization methods, and is applied to image classification, object detection, and image segmentation tasks.* Results: QFD surpasses existing methods by a noticeable margin on not only image classification but also object detection and segmentation tasks, and is the first method to quantize vision transformers for object detection and image segmentation tasks.

Abstract
Neural network quantization aims to accelerate and trim full-precision neural network models by using low bit approximations. Methods adopting the quantization aware training (QAT) paradigm have recently seen a rapid growth, but are often conceptually complicated. This paper proposes a novel and highly effective QAT method, quantized feature distillation (QFD). QFD first trains a quantized (or binarized) representation as the teacher, then quantize the network using knowledge distillation (KD). Quantitative results show that QFD is more flexible and effective (i.e., quantization friendly) than previous quantization methods. QFD surpasses existing methods by a noticeable margin on not only image classification but also object detection, albeit being much simpler. Furthermore, QFD quantizes ViT and Swin-Transformer on MS-COCO detection and segmentation, which verifies its potential in real world deployment. To the best of our knowledge, this is the first time that vision transformers have been quantized in object detection and image segmentation tasks.

摘要

Learning and Evaluating Human Preferences for Conversational Head Generation

paper_url: http://arxiv.org/abs/2307.10636
repo_url: https://github.com/dc3ea9f/PreferenceScore
paper_authors: Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei
for: 这个论文是为了提出一种新的评价指标，帮助提高对话头视频生成算法和系统的发展。
methods: 该论文使用了学习基于的评价指标，名为偏好分数（Preference Score，PS），可以quantitative evaluations across different dimensions，而不需要人工标注。
results: 实验结果表明，PS 可以准确地反映人类偏好，同时也具有robustness和通用性，可以应用于未看到的数据集，这使得它成为一个有价值的工具，推动对话头生成的进步。

Abstract
A reliable and comprehensive evaluation metric that aligns with manual preference assessments is crucial for conversational head video synthesis methods development. Existing quantitative evaluations often fail to capture the full complexity of human preference, as they only consider limited evaluation dimensions. Qualitative evaluations and user studies offer a solution but are time-consuming and labor-intensive. This limitation hinders the advancement of conversational head generation algorithms and systems. In this paper, we propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions. PS can serve as a quantitative evaluation without the need for human annotation. Experimental results validate the superiority of Preference Score in aligning with human perception, and also demonstrate robustness and generalizability to unseen data, making it a valuable tool for advancing conversation head generation. We expect this metric could facilitate new advances in conversational head generation. Project Page: https://https://github.com/dc3ea9f/PreferenceScore.

摘要
<>使用一个可靠且全面的评价指标是对对话头视频生成方法的发展非常重要。现有的量化评价通常只考虑有限的评价维度，而不能捕捉人类偏好的全面性。 qualitative评价和用户研究提供了一种解决方案，但它们需要大量的时间和劳动力。这种限制了对对话头生成算法和系统的进步。在这篇论文中，我们提出了一种基于学习的评价指标 named Preference Score (PS)，可以根据不同维度的量化评价来适应人类偏好。PS可以作为一种不需要人类注释的量化评价。实验结果证明了 PS 的优势在与人类感知相符，同时也表明了其对未见数据的稳定性和普适性，这使得它成为了对对话头生成领域的一种有价值的工具。我们期望这个指标能够推动对话头生成的进步。项目页面：https://github.com/dc3ea9f/PreferenceScore。

Parallelization of a new embedded application for automatic meteor detection

paper_url: http://arxiv.org/abs/2307.10632
repo_url: None
paper_authors: Mathuran Kandeepan, Clara Ciocan, Adrien Cassagne, Lionel Lacassagne
for: 这篇论文主要是为了描述一种新的计算机视觉应用程序的并行化方法。该系统可以自动检测陨星从不稳定的摄像头和噪声视频序列中。该应用程序设计用于天气球或空中观测活动，因此最终目标是一个低功耗系统在板（< 10 瓦特），软件需要计算流程图像频率（> 25 帧每秒）。
methods: 为了实现这些目标，首先将应用程序拆分成任务图，然后应用不同的并行化技术。实验结果表明并行化方法的效率。例如，在raspberry Pi 4 和一个高清视频序列上，处理链达到 42 帧每秒，而只消耗 6 瓦特的电力。
results: 实验结果表明，使用这些并行化方法可以大幅提高处理速度和能效性。例如，在raspberry Pi 4 和一个高清视频序列上，处理链达到 42 帧每秒，而只消耗 6 瓦特的电力。

Abstract
This article presents the methods used to parallelize a new computer vision application. The system is able to automatically detect meteor from non-stabilized cameras and noisy video sequences. The application is designed to be embedded in weather balloons or for airborne observation campaigns. Thus, the final target is a low power system-on-chip (< 10 Watts) while the software needs to compute a stream of frames in real-time (> 25 frames per second). For this, first the application is split in a tasks graph, then different parallelization techniques are applied. Experiment results demonstrate the efficiency of the parallelization methods. For instance, on the Raspberry Pi 4 and on a HD video sequence, the processing chain reaches 42 frames per second while it only consumes 6 Watts.

摘要

Learning Discriminative Visual-Text Representation for Polyp Re-Identification

paper_url: http://arxiv.org/abs/2307.10625
repo_url: https://github.com/jeremyxsc/vt-reid
paper_authors: Suncheng Xiang, Cang Liu, Sijia Du, Dahong Qian
for: 这个研究旨在提高colonoscopic polyp re-identification的精度，以帮助预防和治疗colonrectal cancer。
methods: 本研究使用了VT-ReID训练方法，具体包括将高水平semantic信息与视觉表现学习拓展，以及引入文本数据的对应知识来提高分类效果。
results: 根据实验结果，本方法与现有的州际方法相比，具有明显的优势，并且可以更好地适应新的应用环境。

Abstract
Colonoscopic Polyp Re-Identification aims to match a specific polyp in a large gallery with different cameras and views, which plays a key role for the prevention and treatment of colorectal cancer in the computer-aided diagnosis. However, traditional methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training, which may easily leads to poor generalization capability when adapted the pretrained model into the new scenarios. To relieve this dilemma, we propose a simple but effective training method named VT-ReID, which can remarkably enrich the representation of polyp videos with the interchange of high-level semantic information. Moreover, we elaborately design a novel clustering mechanism to introduce prior knowledge from textual data, which leverages contrastive learning to promote better separation from abundant unlabeled text data. To the best of our knowledge, this is the first attempt to employ the visual-text feature with clustering mechanism for the colonoscopic polyp re-identification. Empirical results show that our method significantly outperforms current state-of-the art methods with a clear margin.

摘要
COLONOSCOPIC POLYP RE-IDENTIFICATION 目的是匹配大量不同摄像头和视图中的特定肿块，这对于抗击和治疗肠Rectal Cancer 具有关键作用。然而，传统方法主要Focus on the visual representation learning，而忽略了在训练时 explore the potential of semantic features，这可能会导致预训练模型在新scenario 中的恶劣泛化性。为了解决这个问题，我们提出了一种简单 yet effective 的训练方法，名为 VT-ReID，该方法可以很好地增强肿块视频的表示。此外，我们 elaborate on a novel clustering mechanism to introduce prior knowledge from textual data，该机制 leverages contrastive learning to promote better separation from abundant unlabeled text data。到目前为止，这是首次采用视觉文本特征与 clustering mechanism 进行肿块重识别。实验结果表明，我们的方法在与当前状态最佳方法进行比较中具有显著的优势。

Joint Skeletal and Semantic Embedding Loss for Micro-gesture Classification

paper_url: http://arxiv.org/abs/2307.10624
repo_url: https://github.com/VUT-HFUT/MiGA2023_Track1
paper_authors: Kun Li, Dan Guo, Guoliang Chen, Xinge Peng, Meng Wang
for: Micros-gesture Classification Challenge at IJCAI 2023
methods: 3D-CNNs-based micro-gesture recognition network with skeletal and semantic embedding loss
results: Ranked 1st in the challenge with Top-1 accuracy 1.10% higher than the second-place team

Abstract
In this paper, we briefly introduce the solution of our team HFUT-VUT for the Micros-gesture Classification in the MiGA challenge at IJCAI 2023. The micro-gesture classification task aims at recognizing the action category of a given video based on the skeleton data. For this task, we propose a 3D-CNNs-based micro-gesture recognition network, which incorporates a skeletal and semantic embedding loss to improve action classification performance. Finally, we rank 1st in the Micro-gesture Classification Challenge, surpassing the second-place team in terms of Top-1 accuracy by 1.10%.

摘要
在这篇论文中，我们简要介绍我们团队HFUT-VUT在IJCAI 2023年的MiGA挑战中的解决方案。微表达识别任务的目标是根据视频中的skeleton数据识别动作类别。为此，我们提议一种基于3D-CNNs的微表达识别网络，该网络包含skeletal和semantic嵌入损失以提高动作分类性能。最后，我们在微表达分类挑战中名列第一，比第二名组织的Top-1准确率高出1.10%。

Quaternion tensor ring decomposition and application for color image inpainting

paper_url: http://arxiv.org/abs/2307.10620
repo_url: None
paper_authors: Jifei Miao, Kit Ian Kou
for: color image inpainting
methods: quaternion tensor ring (QTR) decomposition, low-rank quaternion tensor completion (LRQTC) model
results: highly competitive performance in color image inpainting tasks

Abstract
In recent years, tensor networks have emerged as powerful tools for solving large-scale optimization problems. One of the most promising tensor networks is the tensor ring (TR) decomposition, which achieves circular dimensional permutation invariance in the model through the utilization of the trace operation and equitable treatment of the latent cores. On the other hand, more recently, quaternions have gained significant attention and have been widely utilized in color image processing tasks due to their effectiveness in encoding color pixels. Therefore, in this paper, we propose the quaternion tensor ring (QTR) decomposition, which inherits the powerful and generalized representation abilities of the TR decomposition while leveraging the advantages of quaternions for color pixel representation. In addition to providing the definition of QTR decomposition and an algorithm for learning the QTR format, this paper also proposes a low-rank quaternion tensor completion (LRQTC) model and its algorithm for color image inpainting based on the QTR decomposition. Finally, extensive experiments on color image inpainting demonstrate that the proposed QTLRC method is highly competitive.

摘要
近年来，tensor网络已经出现为大规模优化问题提供了 poderous工具。TR嵌入（Tensor Ring）是其中之一，它实现了循环维度Permutation invariance 的模型，通过跟踪操作和平衡latent核的待遇，从而获得了一种通用和泛化的表示能力。然而，更近期，quaternions 在color image processing tasks 中得到了广泛的应用，因为它们可以高效地编码color pixels。因此，在这篇论文中，我们提出了quaternion tensor ring（QTR）嵌入，它继承了TR嵌入的强大和通用表示能力，同时利用了quaternions 的优点来表示color pixel。此外，我们还提出了一种基于QTR嵌入的low-rank quaternion tensor completion（LRQTC）模型和其算法，用于color image inpainting。最后，我们进行了大量的实验，并证明了我们的方法在color image inpainting中具有高端竞争力。

Hybrid Feature Embedding For Automatic Building Outline Extraction

paper_url: http://arxiv.org/abs/2307.10609
repo_url: None
paper_authors: Weihang Ran, Wei Yuan, Xiaodan Shi, Zipei Fan, Ryosuke Shibasaki
for: 本研究旨在提高从高分辨 Aerial 图像中提取的建筑 outline 精度，以便应用于变化探测和灾害评估等领域。
methods: 本研究提出了一种基于 CNN 和 Transformer 模型的 actice contour 模型，并采用了 triple-branch decoder 结构来处理不同的特征生成。
results: 实验结果显示，我们的模型在两个 dataset 上的测试结果均高于基线模型，达到了 Vaihingen 的 91.1% mIoU 和 Bing huts 的 83.8%。

Abstract
Building outline extracted from high-resolution aerial images can be used in various application fields such as change detection and disaster assessment. However, traditional CNN model cannot recognize contours very precisely from original images. In this paper, we proposed a CNN and Transformer based model together with active contour model to deal with this problem. We also designed a triple-branch decoder structure to handle different features generated by encoder. Experiment results show that our model outperforms other baseline model on two datasets, achieving 91.1% mIoU on Vaihingen and 83.8% on Bing huts.

摘要
traditional CNN模型无法准确地识别原始图像中的边界，这在变化检测和灾害评估等应用领域中存在问题。在这篇论文中，我们提出了一种基于CNN和Transformer的模型，并与活动轮廓模型结合使用。我们还设计了一种三重分支解码结构，以处理由编码器生成的不同特征。实验结果显示，我们的模型在两个数据集上比基准模型高效，达到了91.1%的mIoU在杰宁根和83.8%在冰寨。

paper_url: http://arxiv.org/abs/2307.10603
repo_url: https://github.com/vita-group/pirn
paper_authors: Ajay Jaiswal, Xingguang Zhang, Stanley H. Chan, Zhangyang Wang
For: This paper aims to improve the restoration of images degraded by atmospheric turbulence in long-range optical imaging systems.* Methods: The proposed method, called Physics-integrated Restoration Network (PiRN), incorporates physics-based simulation tools into the training process to help the network disentangle the stochasticity from the degradation and the underlying image. Additionally, PiRN with Stochastic Refinement (PiRN-SR) is introduced to boost perceptual quality.* Results: The proposed method improves the generalization to real-world unknown turbulence conditions and provides state-of-the-art restoration in both pixel-wise accuracy and perceptual quality.Here’s the full summary in Simplified Chinese:
for: 这篇论文目标是改进长距离光学感知系统中由大气扰动引起的图像质量下降的修复。
methods: 提议的方法是将物理基础的模拟工具直接引入训练过程，以帮助网络从图像中分离杂散性和底层图像。此外，还引入PiRN with Stochastic Refinement（PiRN-SR），以提高可见质量。
results: 提议的方法可以提高对实际未知大气扰动条件的泛化，并提供像素精度和可见质量方面的国际顶峰修复。

Abstract
Image distortion by atmospheric turbulence is a stochastic degradation, which is a critical problem in long-range optical imaging systems. A number of research has been conducted during the past decades, including model-based and emerging deep-learning solutions with the help of synthetic data. Although fast and physics-grounded simulation tools have been introduced to help the deep-learning models adapt to real-world turbulence conditions recently, the training of such models only relies on the synthetic data and ground truth pairs. This paper proposes the Physics-integrated Restoration Network (PiRN) to bring the physics-based simulator directly into the training process to help the network to disentangle the stochasticity from the degradation and the underlying image. Furthermore, to overcome the ``average effect" introduced by deterministic models and the domain gap between the synthetic and real-world degradation, we further introduce PiRN with Stochastic Refinement (PiRN-SR) to boost its perceptual quality. Overall, our PiRN and PiRN-SR improve the generalization to real-world unknown turbulence conditions and provide a state-of-the-art restoration in both pixel-wise accuracy and perceptual quality. Our codes are available at \url{https://github.com/VITA-Group/PiRN}.

摘要
图像扭曲因大气扰动是一种随机干扰，对远距离光学图像系统来说是一个重要问题。过去几十年内，有许多研究被进行，包括基于模型的和新兴深度学习解决方案，这些解决方案均依靠生成的数据和真实数据对。 although 最近引入了快速和基于物理的 simulate 工具，这些工具只是用于帮助深度学习模型适应实际大气扰动 conditioons。本文提出了物理结合修复网络（PiRN），将物理基础的模拟器直接 интеGRATE 到训练过程中，以帮助网络分离随机性和扰动以及下面的图像。此外，为了解决由 deterministic 模型引入的“平均效应”和实际世界的扰动域之间的域隔，我们进一步引入 PiRN with Stochastic Refinement（PiRN-SR），以提高其视觉质量。总的来说，我们的 PiRN 和 PiRN-SR 可以更好地泛化到实际世界未知的大气扰动条件，并提供了 pixeling 精度和视觉质量两个方面的状态ixel-wise 精度和视觉质量两个方面的状态-of-the-art 修复。我们的代码可以在 \url{https://github.com/VITA-Group/PiRN} 上获取。

Flatness-Aware Minimization for Domain Generalization

paper_url: http://arxiv.org/abs/2307.11108
repo_url: None
paper_authors: Xingxuan Zhang, Renzhe Xu, Han Yu, Yancheng Dong, Pengfei Tian, Peng Cu
for: 这个研究旨在探讨预测分布shift下的模型学习 Robustness，具体来说是选择适当的优化器以提高预测性能。
methods: 本研究提出了一种新的方法，即 Flatness-Aware Minimization for Domain Generalization (FAD)，可以快速且高效地优化零项和一项的平坦性同时，以提高预测性能。
results: 实验结果显示，FAD在多个预测分布shift datasets上具有较高的性能，并且可以更好地找到平坦的极小值。

Abstract
Domain generalization (DG) seeks to learn robust models that generalize well under unknown distribution shifts. As a critical aspect of DG, optimizer selection has not been explored in depth. Currently, most DG methods follow the widely used benchmark, DomainBed, and utilize Adam as the default optimizer for all datasets. However, we reveal that Adam is not necessarily the optimal choice for the majority of current DG methods and datasets. Based on the perspective of loss landscape flatness, we propose a novel approach, Flatness-Aware Minimization for Domain Generalization (FAD), which can efficiently optimize both zeroth-order and first-order flatness simultaneously for DG. We provide theoretical analyses of the FAD's out-of-distribution (OOD) generalization error and convergence. Our experimental results demonstrate the superiority of FAD on various DG datasets. Additionally, we confirm that FAD is capable of discovering flatter optima in comparison to other zeroth-order and first-order flatness-aware optimization methods.

摘要
领域通用化 (DG) 目标是学习能够适应未知分布变化的模型。作为critical aspect of DG, optimizer选择还没有得到深入研究。目前，大多数 DG 方法采用了广泛使用的标准底层，DomainBed，并使用 Adam 作为所有数据集的默认优化器。然而，我们发现 Adam 并不是现有的 DG 方法和数据集中的优imal choice。基于损失函数地形的角度，我们提出了一种新的方法，适应性考虑的最小化 для领域通用化 (FAD)，可以有效地最小化零预测和首预测的损失函数平坦性。我们提供了对 FAD 的出度报告和抽象分析。我们的实验结果表明 FAD 在各种 DG 数据集上具有优越性。此外，我们证明 FAD 可以更好地发现平坦的极点，相比其他零预测和首预测平坦性意识的优化方法。

SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and Multi-View for 3D Object Retrieval

paper_url: http://arxiv.org/abs/2307.10601
repo_url: None
paper_authors: Dongyun Lin, Yi Cheng, Aiyuan Guo, Shangbo Mao, Yiqun Li
for:* 3D object retrievalmethods:* self-and-cross attention based aggregation of point cloud and multi-view images (SCA-PVNet)* deep features extracted from point clouds and multi-view images* two types of feature aggregation modules: In-Modality Aggregation Module (IMAM) and Cross-Modality Aggregation Module (CMAM)results:* superiority of the proposed SCA-PVNet over state-of-the-art methods* extensive experiments and analysis conducted on three datasets, ranging from small to large scale.

Abstract
To address 3D object retrieval, substantial efforts have been made to generate highly discriminative descriptors of 3D objects represented by a single modality, e.g., voxels, point clouds or multi-view images. It is promising to leverage the complementary information from multi-modality representations of 3D objects to further improve retrieval performance. However, multi-modality 3D object retrieval is rarely developed and analyzed on large-scale datasets. In this paper, we propose self-and-cross attention based aggregation of point cloud and multi-view images (SCA-PVNet) for 3D object retrieval. With deep features extracted from point clouds and multi-view images, we design two types of feature aggregation modules, namely the In-Modality Aggregation Module (IMAM) and the Cross-Modality Aggregation Module (CMAM), for effective feature fusion. IMAM leverages a self-attention mechanism to aggregate multi-view features while CMAM exploits a cross-attention mechanism to interact point cloud features with multi-view features. The final descriptor of a 3D object for object retrieval can be obtained via concatenating the aggregated features from both modules. Extensive experiments and analysis are conducted on three datasets, ranging from small to large scale, to show the superiority of the proposed SCA-PVNet over the state-of-the-art methods.

摘要
要解决3D对象检索问题，有substantial effortshave been made to generate高度 дискリメイト的描述符3D对象，例如voxels、point clouds或多视图图像。可以利用多模态表示3D对象的补充信息来进一步提高检索性能。然而，多模态3D对象检索rarely developed and analyzed on large-scale datasets。在这篇论文中，我们提出了自我和交叉注意力基于点云和多视图图像的汇集（SCA-PVNet） для3D对象检索。通过点云和多视图图像中的深度特征，我们设计了两种类型的特征汇集模块，namely the In-Modality Aggregation Module (IMAM)和the Cross-Modality Aggregation Module (CMAM)，以实现有效的特征融合。IMAM利用了一种自我注意力机制来汇集多视图特征，而CMAM利用了一种交叉注意力机制来互动点云特征和多视图特征。最终的3D对象检索中的描述符可以通过 concatenating the aggregated features from both modules来获得。我们在三个dataset上进行了广泛的实验和分析，以显示我们提出的SCA-PVNet在现状方法中的超越。

Event Blob Tracking: An Asynchronous Real-Time Algorithm

paper_url: http://arxiv.org/abs/2307.10593
repo_url: https://github.com/ziweiwwang/event-blob-tracking
paper_authors: Ziwei Wang, Timothy Molloy, Pieter van Goor, Robert Mahony
for: 这 paper 是为了跟踪 fast-moving objects 的 tracking 而设计的。
methods: 这 paper 使用了 raw events 的 asynchronous 处理，并使用了一种新的 algorithm 来跟踪 event blobs。
results: 这 paper 实现了高精度的 tracking 和 event blob shape estimation，even under challenging lighting conditions 和高速运动情况下。

Abstract
Event-based cameras have become increasingly popular for tracking fast-moving objects due to their high temporal resolution, low latency, and high dynamic range. In this paper, we propose a novel algorithm for tracking event blobs using raw events asynchronously in real time. We introduce the concept of an event blob as a spatio-temporal likelihood of event occurrence where the conditional spatial likelihood is blob-like. Many real-world objects generate event blob data, for example, flickering LEDs such as car headlights or any small foreground object moving against a static or slowly varying background. The proposed algorithm uses a nearest neighbour classifier with a dynamic threshold criteria for data association coupled with a Kalman filter to track the event blob state. Our algorithm achieves highly accurate tracking and event blob shape estimation even under challenging lighting conditions and high-speed motions. The microsecond time resolution achieved means that the filter output can be used to derive secondary information such as time-to-contact or range estimation, that will enable applications to real-world problems such as collision avoidance in autonomous driving.

摘要
Translated into Simplified Chinese:事件基于摄像头已经在跟踪快速移动 объек的应用中变得越来越流行，这是因为它们的高时间分辨率、低延迟和高动态范围。在这篇论文中，我们提出了一种新的算法，用于在实时中跟踪事件blob，使用Raw事件 asynchronous。我们定义了事件blob为空间时间的可能性，其中条件的空间可能性是菱形的。许多实际世界中的对象会生成事件blob数据，例如跑动的LED灯或任何小背景对象在静止或慢变化的背景下移动。我们的算法使用最近的几何类型分类器和动态阈值 criterion for data association，并与加拓Filter进行跟踪事件blob状态。我们的算法可以在挑战性的照明条件和高速运动下实现高精度的跟踪和事件blob形状估计。微秒级时间分辨率实现的结果可以用来 derivate secondary information，例如时间距离或距离估计，这将为实际问题如自动驾驶中的碰撞避免提供应用。

Reference-based Painterly Inpainting via Diffusion: Crossing the Wild Reference Domain Gap

paper_url: http://arxiv.org/abs/2307.10584
repo_url: None
paper_authors: Dejia Xu, Xingqian Xu, Wenyan Cong, Humphrey Shi, Zhangyang Wang
for: 这篇论文旨在提出一种新的图像填充方法，可以将新的对象Insert到艺术图像中，以实现创新的画廊效果。
methods: 该方法基于一种新的扩散框架，称为RefPaint，它可以处理大量域之差的参考图像，并使用一种新的掩码融合机制和一个梯形分支来实现填充mask。
results: 实验结果表明，RefPaint方法可以生成较好的结果，而且比现有的方法更有创造力和灵活性。

Abstract
Have you ever imagined how it would look if we placed new objects into paintings? For example, what would it look like if we placed a basketball into Claude Monet's ``Water Lilies, Evening Effect''? We propose Reference-based Painterly Inpainting, a novel task that crosses the wild reference domain gap and implants novel objects into artworks. Although previous works have examined reference-based inpainting, they are not designed for large domain discrepancies between the target and the reference, such as inpainting an artistic image using a photorealistic reference. This paper proposes a novel diffusion framework, dubbed RefPaint, to ``inpaint more wildly'' by taking such references with large domain gaps. Built with an image-conditioned diffusion model, we introduce a ladder-side branch and a masked fusion mechanism to work with the inpainting mask. By decomposing the CLIP image embeddings at inference time, one can manipulate the strength of semantic and style information with ease. Experiments demonstrate that our proposed RefPaint framework produces significantly better results than existing methods. Our method enables creative painterly image inpainting with reference objects that would otherwise be difficult to achieve. Project page: https://vita-group.github.io/RefPaint/

摘要
你们ever imagined如果我们在画作中放置了新的物品呢？例如，如果我们在 Claude Monet的“水莲花”晚效版本中放置了篮球呢？我们提出了Reference-based Painterly Inpainting，这是一个跨越野 referenece领域的新任务，可以插入 novel objects 到艺术作品中。 although previous works have examined reference-based inpainting, they are not designed for large domain discrepancies between the target and the reference, such as inpainting an artistic image using a photorealistic reference. This paper proposes a novel diffusion framework, dubbed RefPaint, to "inpaint more wildly" by taking such references with large domain gaps. Built with an image-conditioned diffusion model, we introduce a ladder-side branch and a masked fusion mechanism to work with the inpainting mask. By decomposing the CLIP image embeddings at inference time, one can manipulate the strength of semantic and style information with ease. Experiments demonstrate that our proposed RefPaint framework produces significantly better results than existing methods. Our method enables creative painterly image inpainting with reference objects that would otherwise be difficult to achieve. Project page: https://vita-group.github.io/RefPaint/Note: The translation is done using Google Translate and may not be perfect. Please let me know if you need any further assistance.

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

paper_url: http://arxiv.org/abs/2307.10567
repo_url: None
paper_authors: Qi Zhang, Sipeng Zheng, Qin Jin
for: 本研究旨在提高视频语言查询中的时间间隔检索精度，尤其是在低含义噪声比例（SNR）下。
methods: 本文提出了一种简单易用的视频语言查询模型，包括多Scale邻域注意和缩进边界检测两个核心模块。
results: 与先前的研究相比，本模型在不同的TVG标准准则上实现了竞争力的性能，同时具有更快的推理速度和轻量级的模型参数。

Abstract
Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video. A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR. Prior works have addressed this challenge using sophisticated techniques. In this paper, we propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection. The multi-scale neighboring attention restricts each video token to only aggregate visual contexts from its neighbor, enabling the extraction of the most distinguishing information with multi-scale feature hierarchies from high-ratio noises. The zoom-in boundary detection then focuses on local-wise discrimination of the selected top candidates for fine-grained grounding adjustment. With an end-to-end training strategy, our model achieves competitive performance on different TVG benchmarks, while also having the advantage of faster inference speed and lighter model parameters, thanks to its lightweight architecture.

摘要
The multi-scale neighboring attention restricts each video token to only aggregate visual contexts from its neighbor, allowing the extraction of the most distinguishing information with multi-scale feature hierarchies from high-ratio noises. The zoom-in boundary detection then focuses on local-wise discrimination of the selected top candidates for fine-grained grounding adjustment.Our model achieves competitive performance on different TVG benchmarks with an end-to-end training strategy, while also having the advantage of faster inference speed and lighter model parameters, thanks to its lightweight architecture.

Interactive Segmentation for Diverse Gesture Types Without Context

paper_url: http://arxiv.org/abs/2307.10518
repo_url: None
paper_authors: Josh Myers-Dean, Yifei Fan, Brian Price, Wilson Chan, Danna Gurari
for: 这个论文旨在解决现有方法的限制，它们只支持单一的手势类型（如点击或涂抹），或者需要知道使用哪种手势类型。
methods: 这篇论文提出了一种简化的交互分割任务，只需要用户标注图像，无需指定手势类型。他们还引入了首个支持多种手势类型的交互分割数据集，以及一新的评价指标。
results: 研究人员对多种交互分割算法进行了分析，包括他们修改后的算法。虽然总体表现良好，但还有一些需要进一步改进的地方。研究人员将数据集公开发布在 GitHub 上（https://github.com/joshmyersdean/dig），以便进一步扩展这项工作。

Abstract
Interactive segmentation entails a human marking an image to guide how a model either creates or edits a segmentation. Our work addresses limitations of existing methods: they either only support one gesture type for marking an image (e.g., either clicks or scribbles) or require knowledge of the gesture type being employed, and require specifying whether marked regions should be included versus excluded in the final segmentation. We instead propose a simplified interactive segmentation task where a user only must mark an image, where the input can be of any gesture type without specifying the gesture type. We support this new task by introducing the first interactive segmentation dataset with multiple gesture types as well as a new evaluation metric capable of holistically evaluating interactive segmentation algorithms. We then analyze numerous interactive segmentation algorithms, including ones adapted for our novel task. While we observe promising performance overall, we also highlight areas for future improvement. To facilitate further extensions of this work, we publicly share our new dataset at https://github.com/joshmyersdean/dig.

摘要
互动分割包括一个人用图像进行指导，以便模型创建或编辑分割。我们的工作解决了现有方法的限制：它们可以只支持一种图像标记类型（例如，单击或scribbles），或者需要知道用户使用哪种手势类型，并且需要指定标记区域是否包括在最终分割中。我们而是提议简化互动分割任务，其中用户只需要标记图像，而无需指定手势类型。我们支持这项新任务，并发布了首个包含多种手势类型的互动分割数据集，以及一种新的评价指标，能够全面评估互动分割算法。我们然后分析了许多互动分割算法，包括我们所采用的新任务中的算法。虽然我们在总的来说表现良好，但我们也指出了未来改进的方向。为了促进这项工作的进一步扩展，我们将数据集公开分享在 GitHub 上，请参考。

FedSoup: Improving Generalization and Personalization in Federated Learning via Selective Model Interpolation

paper_url: http://arxiv.org/abs/2307.10507
repo_url: None
paper_authors: Minghui Chen, Meirui Jiang, Qi Dou, Zehua Wang, Xiaoxiao Li
for: 这篇论文旨在解决跨存储层 Federated Learning（FL）中模型在不同数据中心的数据分布下的问题。
methods: 该论文提出了一种新的 federated model soup 方法（即选择式模型参数 interpolating）来优化本地和全局性能的负面选择。在联合训练阶段，每个客户端都维护自己的全局模型池，并在本地和全局模型之间进行模型参数 interpolating，以避免过拟合和寻找平坦的最优点。
results: 该论文在透视和病理图像分类任务上进行了评估，并达到了显著改善的对于不同数据分布的泛化性能。代码可以在 https://github.com/ubc-tea/FedSoup 上获取。

Abstract
Cross-silo federated learning (FL) enables the development of machine learning models on datasets distributed across data centers such as hospitals and clinical research laboratories. However, recent research has found that current FL algorithms face a trade-off between local and global performance when confronted with distribution shifts. Specifically, personalized FL methods have a tendency to overfit to local data, leading to a sharp valley in the local model and inhibiting its ability to generalize to out-of-distribution data. In this paper, we propose a novel federated model soup method (i.e., selective interpolation of model parameters) to optimize the trade-off between local and global performance. Specifically, during the federated training phase, each client maintains its own global model pool by monitoring the performance of the interpolated model between the local and global models. This allows us to alleviate overfitting and seek flat minima, which can significantly improve the model's generalization performance. We evaluate our method on retinal and pathological image classification tasks, and our proposed method achieves significant improvements for out-of-distribution generalization. Our code is available at https://github.com/ubc-tea/FedSoup.

摘要
跨存储silos的联合学习（FL）可以在数据中心如医院和临床研究室中的数据集上开发机器学习模型。然而，latest research发现当面临分布变化时，当前的FL算法存在本地和全局性能之间的负面选择。具体来说，个性化FL方法容易过拟合本地数据，导致本地模型峰值降低，阻碍其对于不同数据集的泛化表现。在这篇论文中，我们提出了一种新的联合模型汤（i.e., 选择模型参数的 interpolación）来优化本地和全局性能之间的负面选择。具体来说，在联合训练阶段，每个客户端都维护自己的全局模型池，并通过监测 interpolated model 的性能来监控本地和全局模型之间的交互。这样可以减轻过拟合，寻找平坦的最佳点，可以显著提高模型的泛化性能。我们在Retinal和pathological图像分类任务上评估了我们的方法，并确认了对于不同数据集的泛化性能具有显著提高。我们的代码可以在https://github.com/ubc-tea/FedSoup中找到。

Is Grad-CAM Explainable in Medical Images?

paper_url: http://arxiv.org/abs/2307.10506
repo_url: None
paper_authors: Subhashis Suara, Aayush Jha, Pratik Sinha, Arif Ahmed Sekh
for: This paper is written for the field of artificial intelligence (AI) and medical imaging, specifically to explore the principles of Explainable Deep Learning and its relevance to medical imaging.
methods: The paper discusses various explainability techniques, including Grad-CAM, and their limitations in medical imaging applications.
results: The findings highlight the potential of Explainable Deep Learning and Grad-CAM in improving the accuracy and interpretability of deep learning models in medical imaging.Here is the same information in Simplified Chinese text:
for: 这篇论文是为人工智能（AI）和医学成像领域写的，具体来说是探讨Explainable Deep Learning的原理和它在医学成像中的应用。
methods: 论文讨论了各种解释技术，包括Grad-CAM，以及它们在医学成像应用中的局限性。
results: 结论显示Explainable Deep Learning和Grad-CAM在医学成像中可以提高深度学习模型的准确性和可读性。

Abstract
Explainable Deep Learning has gained significant attention in the field of artificial intelligence (AI), particularly in domains such as medical imaging, where accurate and interpretable machine learning models are crucial for effective diagnosis and treatment planning. Grad-CAM is a baseline that highlights the most critical regions of an image used in a deep learning model's decision-making process, increasing interpretability and trust in the results. It is applied in many computer vision (CV) tasks such as classification and explanation. This study explores the principles of Explainable Deep Learning and its relevance to medical imaging, discusses various explainability techniques and their limitations, and examines medical imaging applications of Grad-CAM. The findings highlight the potential of Explainable Deep Learning and Grad-CAM in improving the accuracy and interpretability of deep learning models in medical imaging. The code is available in (will be available).

摘要
“几何学深度学习（Explainable Deep Learning）在人工智能（AI）领域内已经获得了很大的注意力，特别是在医疗影像领域，因为需要高度精度和可解释的机器学习模型，以便更好地诊断和规划治疗。Grad-CAM是一个基准，可以显示 deep learning 模型在决策过程中使用的最重要的影像区域，从而增加了解释力和信任度。它在许多计算机视觉（CV）任务中，如分类和解释，获得了应用。本研究探讨了Explainable Deep Learning的原则和医疗影像领域的应用，讨论了各种解释技术的限制，并评估了 Grad-CAM 在医疗影像中的应用。研究结果显示了Explainable Deep Learning和 Grad-CAM在医疗影像中的应用潜力。代码将在（将会available）。”Here's the breakdown of the translation:1. 几何学深度学习 (Explainable Deep Learning) - This is the name of the field that combines geometry and deep learning to improve the interpretability of deep learning models.2. 人工智能 (AI) - This refers to the broader field of artificial intelligence, which includes machine learning and deep learning.3. 医疗影像 (medical imaging) - This refers to the use of deep learning models in medical imaging to improve diagnosis and treatment planning.4. Grad-CAM - This is a technique for visualizing the most important regions of an image used in a deep learning model's decision-making process.5. 计算机视觉 (CV) - This refers to the field of computer vision, which includes tasks such as image classification and object detection.6. 应用 (applications) - This refers to the specific uses of Explainable Deep Learning and Grad-CAM in medical imaging.7. 原则 (principles) - This refers to the fundamental principles of Explainable Deep Learning and how it can be applied in medical imaging.8. 限制 (limitations) - This refers to the limitations of Explainable Deep Learning and Grad-CAM in medical imaging.9. 应用 (applications) - This refers to the specific uses of Grad-CAM in medical imaging.10. 代码 (code) - This refers to the code used to implement Explainable Deep Learning and Grad-CAM in medical imaging. The code will be available in the future.

Identifying Interpretable Subspaces in Image Representations

paper_url: http://arxiv.org/abs/2307.10504
repo_url: None
paper_authors: Neha Kalibhat, Shweta Bhardwaj, Bayan Bruss, Hamed Firooz, Maziar Sanjabi, Soheil Feizi
for: 本文提出了一种自动Feature Explanation使用对比概念（FALCON）解释图像表示的解释性框架，用于解释图像表示中的特征。
methods: 本文使用了一个大型captioningdataset（如LAION-400m）和一个预训练的视觉语言模型（如CLIP）来captioning高活动图像，并对每个单词进行 scoring和排名，从而获得了一小number of共同、人类可理解的概念，可以准确地描述目标特征。此外，本文还应用了对比解释，使用低活动（counterfactual）图像来消除幻象概念。
results: 本文发现，许多现有的方法只能解释特征独立地，而不能够解释图像表示中的大部分空间。然而，通过使用FALCON，我们发现，在更大的空间中，特征可以更好地被解释，并且可以通过高级 scoring概念来描述。此外，本文还提出了一种将概念从一个可解释的表示空间传递到另一个未知表示空间的技术。

Abstract
We propose Automatic Feature Explanation using Contrasting Concepts (FALCON), an interpretability framework to explain features of image representations. For a target feature, FALCON captions its highly activating cropped images using a large captioning dataset (like LAION-400m) and a pre-trained vision-language model like CLIP. Each word among the captions is scored and ranked leading to a small number of shared, human-understandable concepts that closely describe the target feature. FALCON also applies contrastive interpretation using lowly activating (counterfactual) images, to eliminate spurious concepts. Although many existing approaches interpret features independently, we observe in state-of-the-art self-supervised and supervised models, that less than 20% of the representation space can be explained by individual features. We show that features in larger spaces become more interpretable when studied in groups and can be explained with high-order scoring concepts through FALCON. We discuss how extracted concepts can be used to explain and debug failures in downstream tasks. Finally, we present a technique to transfer concepts from one (explainable) representation space to another unseen representation space by learning a simple linear transformation.

摘要
我们提出了自动Feature解释使用对比概念（FALCON）的解释框架，用于解释图像表示中的特征。对于目标特征，FALCON使用大量captioning数据集（如LAION-400m）和预训练的视觉语言模型（如CLIP）来描述高活跃的裁剪图像。每个单词在caption中的得分和排名，导致一小number of共享、人类可理解的概念，准确地描述目标特征。此外，FALCON还应用了对比解释，使用低活跃（counterfactual）图像来消除幻样概念。我们发现，许多现有的方法在独立地解释特征，但我们观察到，在当前领域最佳自动学习和监督学习模型中，仅有少于20%的表示空间可以由单个特征解释。我们显示，在更大的表示空间中，特征在组合上变得更加解释性，可以通过FALCON中的高阶得分概念来解释。我们还讨论了如何使用提取的概念来解释和调试下游任务中的失败。最后，我们介绍了一种将概念从一个可解释的表示空间传递到另一个未知表示空间的简单线性变换的技术。

Eye Disease Classification Using Deep Learning Techniques

paper_url: http://arxiv.org/abs/2307.10501
repo_url: https://github.com/akbarreis/Classification-eyes-disease
paper_authors: Tareq Babaqi, Manar Jaradat, Ayse Erdem Yildirim, Saif H. Al-Nimer, Daehan Won
for: Early detection and diagnosis of eye diseases to prevent vision loss or blindness.
methods: Utilized Convolutional Neural Networks (CNN) and transfer learning for multi-class classification.
results: Achieved high accuracy of 94%, outperforming traditional CNN at 84%.

Abstract
Eye is the essential sense organ for vision function. Due to the fact that certain eye disorders might result in vision loss, it is essential to diagnose and treat eye diseases early on. By identifying common eye illnesses and performing an eye check, eye care providers can safeguard patients against vision loss or blindness. Convolutional neural networks (CNN) and transfer learning were employed in this study to discriminate between a normal eye and one with diabetic retinopathy, cataract, or glaucoma disease. Using transfer learning for multi-class classification, high accuracy was achieved at 94% while the traditional CNN achieved 84% rate.

摘要
眼是视力功能的关键感知器官。由于某些眼病可能会导致视力损失，因此早期诊断和治疗眼病是非常重要的。通过识别常见眼病和进行眼检查，眼科医生可以为患者保护视力或失明。本研究使用卷积神经网络（CNN）和传输学习来区分正常眼和糖尿病、水晶体疾病或高压病眼。使用传输学习进行多类分类，达到了94%的高精度率，而传统的CNN则达到了84%的率。

Mining Conditional Part Semantics with Occluded Extrapolation for Human-Object Interaction Detection

paper_url: http://arxiv.org/abs/2307.10499
repo_url: None
paper_authors: Guangzhi Wang, Yangyang Guo, Mohan Kankanhalli
For: The paper focuses on human-object interaction detection, which is crucial for human-centric scene understanding and has various applications.* Methods: The proposed method uses a Part Semantic Network (PSN) with a Conditional Part Attention (CPA) mechanism to automatically focus on the most informative human parts conditioned on the involved object, generating more semantically meaningful features for interaction recognition. Additionally, the Occluded Part Extrapolation (OPE) strategy is proposed to facilitate interaction recognition under occluded scenarios.* Results: The proposed method consistently outperforms prior approaches on the V-COCO and HICO-DET datasets without external data or extra annotations.

Abstract
Human-Object Interaction Detection is a crucial aspect of human-centric scene understanding, with important applications in various domains. Despite recent progress in this field, recognizing subtle and detailed interactions remains challenging. Existing methods try to use human-related clues to alleviate the difficulty, but rely heavily on external annotations or knowledge, limiting their practical applicability in real-world scenarios. In this work, we propose a novel Part Semantic Network (PSN) to solve this problem. The core of PSN is a Conditional Part Attention (CPA) mechanism, where human features are taken as keys and values, and the object feature is used as query for the computation in a cross-attention mechanism. In this way, our model learns to automatically focus on the most informative human parts conditioned on the involved object, generating more semantically meaningful features for interaction recognition. Additionally, we propose an Occluded Part Extrapolation (OPE) strategy to facilitate interaction recognition under occluded scenarios, which teaches the model to extrapolate detailed features from partially occluded ones. Our method consistently outperforms prior approaches on the V-COCO and HICO-DET datasets, without external data or extra annotations. Additional ablation studies validate the effectiveness of each component of our proposed method.

摘要
人机物交互检测是人工智能场景理解的关键方面，具有重要应用于多个领域。尽管最近在这个领域取得了一些进展，但感知细微和细腻交互仍然是挑战。现有方法通过人类相关的准确来减轻这个问题，但它们依赖于外部注释或知识，限制其在实际场景中的实用性。在这种情况下，我们提出了一种新的部Semantic Network（PSN）来解决这个问题。PSN的核心是一种Conditional Part Attention（CPA）机制，其中人类特征被用作键和值，对象特征被用作查询进行计算。这种方式使得我们的模型能够自动将注意力集中在与物体交互时最有用的人类部分上，从而生成更加含义强的交互特征。此外，我们还提出了Occluded Part Extrapolation（OPE）策略，以便在受阻场景下进行交互检测，即教学模型从部分遮挡的特征中提取细节特征。我们的方法在V-COCO和HICO-DET数据集上 consistently 击败先前的方法，而无需外部数据或额外注释。此外，我们还进行了附加的抽象研究，以证明每个我们提出的方法的效果。

Novel Batch Active Learning Approach and Its Application to Synthetic Aperture Radar Datasets

paper_url: http://arxiv.org/abs/2307.10495
repo_url: https://github.com/chapman20j/sar_bal
paper_authors: James Chapman, Bohan Chen, Zheng Tan, Jeff Calder, Kevin Miller, Andrea L. Bertozzi
for: This paper is written for improving the performance of machine learning methods on synthetic aperture radar (SAR) data using active learning techniques.
methods: The paper proposes a novel, two-part approach for batch active learning, which includes Dijkstra’s Annulus Core-Set (DAC) for core-set generation and LocalMax for batch sampling.
results: The proposed approach achieves nearly identical accuracy as sequential active learning but is more efficient, proportional to the batch size. The paper also demonstrates the effectiveness of the approach on classifying FUSAR-Ship and OpenSARShip datasets, outperforming state-of-the-art CNN-based methods.Here are the three points in Simplified Chinese text:
for: 这篇论文是为了通过活动学习技术提高机器学习方法对 Synthetic Aperture Radar（SAR）数据的性能。
methods: 论文提出了一种新的、两部分的批处理活动学习方法，包括 Dijkstra的 Annulus Core-Set（DAC）和 LocalMax。
results: 该方法可以和顺序活动学习方法准确性相似，但是更高效，与批处理大小成正比。论文还在使用传输学习特征嵌入、图学习、DAC和LocalMax классифици了 FUSAR-Ship 和 OpenSARShip 数据集，超越了现有的 CNN 方法。

Abstract
Active learning improves the performance of machine learning methods by judiciously selecting a limited number of unlabeled data points to query for labels, with the aim of maximally improving the underlying classifier's performance. Recent gains have been made using sequential active learning for synthetic aperture radar (SAR) data arXiv:2204.00005. In each iteration, sequential active learning selects a query set of size one while batch active learning selects a query set of multiple datapoints. While batch active learning methods exhibit greater efficiency, the challenge lies in maintaining model accuracy relative to sequential active learning methods. We developed a novel, two-part approach for batch active learning: Dijkstra's Annulus Core-Set (DAC) for core-set generation and LocalMax for batch sampling. The batch active learning process that combines DAC and LocalMax achieves nearly identical accuracy as sequential active learning but is more efficient, proportional to the batch size. As an application, a pipeline is built based on transfer learning feature embedding, graph learning, DAC, and LocalMax to classify the FUSAR-Ship and OpenSARShip datasets. Our pipeline outperforms the state-of-the-art CNN-based methods.

摘要
活动学习可以提高机器学习方法的性能，通过选择一个有限的无标注数据点进行问题，以最大程度地提高下面的分类器性能。最近，在 Synthetic Aperture Radar（SAR）数据上使用了连续活动学习，引用文献：arXiv:2204.00005。在每一轮中，连续活动学习选择一个问题集大小为一的查询集，而批量活动学习选择多个数据点的查询集。虽然批量活动学习方法更高效，但是保持模型准确性相对来说是一个挑战。我们提出了一种新的、两部分的批量活动学习方法：Dijkstra的核心圈（DAC）和LocalMax。这个批量活动学习过程将DAC和LocalMax结合起来，可以达到与连续活动学习方法相同的准确性，但是更高效，与批量大小成比例。我们使用了传输学习特征嵌入、图学习、DAC和LocalMax来分类FUSAR-Ship和OpenSARShip数据集。我们的管道超越了基于Convolutional Neural Network（CNN）的状态流方法。

Confidence Estimation Using Unlabeled Data

paper_url: http://arxiv.org/abs/2307.10440
repo_url: https://github.com/topoxlab/consistency-ranking-loss
paper_authors: Chen Li, Xiaoling Hu, Chao Chen
for: 本研究旨在提出一种 semi-supervised 环境下的信任估计方法，以便更好地在实际应用中使用深度神经网络。
methods: 我们提出了一种基于训练过程中的预测一致性的信任估计方法，使用了训练一致性作为代理函数，并提出了一种一致性排名损失函数。
results: 在图像分类和分割任务上，我们的方法实现了状态之最的性能在信任估计中，并且在下游活动学任务中表现出了良好的效果。

Abstract
Overconfidence is a common issue for deep neural networks, limiting their deployment in real-world applications. To better estimate confidence, existing methods mostly focus on fully-supervised scenarios and rely on training labels. In this paper, we propose the first confidence estimation method for a semi-supervised setting, when most training labels are unavailable. We stipulate that even with limited training labels, we can still reasonably approximate the confidence of model on unlabeled samples by inspecting the prediction consistency through the training process. We use training consistency as a surrogate function and propose a consistency ranking loss for confidence estimation. On both image classification and segmentation tasks, our method achieves state-of-the-art performances in confidence estimation. Furthermore, we show the benefit of the proposed method through a downstream active learning task. The code is available at https://github.com/TopoXLab/consistency-ranking-loss

摘要
常见的问题是深度神经网络的过度自信，限制它们在实际应用中的部署。现有方法主要集中在充分监督的enario下，通过训练标签来估算自信。在这篇论文中，我们提出了首个在半监督设置下进行自信估算的方法。我们认为，即使有限的训练标签，我们仍可以通过训练过程中的预测一致性来合理地估算模型对无标签样本的自信。我们使用训练一致性作为代理函数，并提出了一种一致排名损失来进行自信估算。在图像分类和 segmentation 任务上，我们的方法实现了状态的最佳性能。此外，我们还证明了我们的方法的优点通过下游活动学习任务。代码可以在 https://github.com/TopoXLab/consistency-ranking-loss 上获取。

POV-Surgery: A Dataset for Egocentric Hand and Tool Pose Estimation During Surgical Activities

paper_url: http://arxiv.org/abs/2307.10387
repo_url: None
paper_authors: Rui Wang, Sophokles Ktistakis, Siwei Zhang, Mirko Meboldt, Quentin Lohmeyer
for:POV-Surgery is a large-scale, synthetic, egocentric dataset for pose estimation of hands with surgical gloves and three orthopedic surgical instruments.methods:The dataset consists of high-resolution RGB-D video streams with activity annotations, accurate 3D and 2D annotations for hand-object pose, and 2D hand-object segmentation masks.results:The dataset is fine-tuned for current state-of-the-art methods and is shown to be generalizable to real-life cases with surgical gloves and tools through extensive evaluations.Here is the simplified Chinese translation of the three key points:for:POV-Surgery 是一个大规模的合成 egocentric 数据集，用于手部位置估计，涉及到不同的手术手套和三种 orthopedic 手术工具。methods:数据集包括高分辨率 RGB-D 视频流，含有活动标注、精度的 3D 和 2D 手部位置标注，以及 2D 手部对象分割mask。results:数据集通过对当前 SOTA 方法进行 fine-tuning，并在实际术中使用手术手套和工具进行了广泛的评估，并显示了普适性。

Abstract
The surgical usage of Mixed Reality (MR) has received growing attention in areas such as surgical navigation systems, skill assessment, and robot-assisted surgeries. For such applications, pose estimation for hand and surgical instruments from an egocentric perspective is a fundamental task and has been studied extensively in the computer vision field in recent years. However, the development of this field has been impeded by a lack of datasets, especially in the surgical field, where bloody gloves and reflective metallic tools make it hard to obtain 3D pose annotations for hands and objects using conventional methods. To address this issue, we propose POV-Surgery, a large-scale, synthetic, egocentric dataset focusing on pose estimation for hands with different surgical gloves and three orthopedic surgical instruments, namely scalpel, friem, and diskplacer. Our dataset consists of 53 sequences and 88,329 frames, featuring high-resolution RGB-D video streams with activity annotations, accurate 3D and 2D annotations for hand-object pose, and 2D hand-object segmentation masks. We fine-tune the current SOTA methods on POV-Surgery and further show the generalizability when applying to real-life cases with surgical gloves and tools by extensive evaluations. The code and the dataset are publicly available at batfacewayne.github.io/POV_Surgery_io/.

摘要
<>translate english text into simplified chineseThe surgical usage of Mixed Reality (MR) has received growing attention in areas such as surgical navigation systems, skill assessment, and robot-assisted surgeries. For such applications, pose estimation for hand and surgical instruments from an egocentric perspective is a fundamental task and has been studied extensively in the computer vision field in recent years. However, the development of this field has been impeded by a lack of datasets, especially in the surgical field, where bloody gloves and reflective metallic tools make it hard to obtain 3D pose annotations for hands and objects using conventional methods. To address this issue, we propose POV-Surgery, a large-scale, synthetic, egocentric dataset focusing on pose estimation for hands with different surgical gloves and three orthopedic surgical instruments, namely scalpel, friem, and diskplacer. Our dataset consists of 53 sequences and 88,329 frames, featuring high-resolution RGB-D video streams with activity annotations, accurate 3D and 2D annotations for hand-object pose, and 2D hand-object segmentation masks. We fine-tune the current SOTA methods on POV-Surgery and further show the generalizability when applying to real-life cases with surgical gloves and tools by extensive evaluations. The code and the dataset are publicly available at batfacewayne.github.io/POV_Surgery_io/.TRANSLATED TEXT:<>将英文文本翻译成简化中文混合现实（MR）在外科导航系统、技能评估和机器人助手手术等领域 receiving growing attention。在这些应用中，从自我中心 perspective 的手和外科工具的 pose 估计是一个基本任务，在计算机视觉领域已经进行了广泛的研究。然而，该领域的发展受到了 dataset 的缺乏，特别是在医疗领域， где 血肉手套和镜面金属工具使得通过 convential 方法获取手和对象的 3D pose 约注意过来。为解决这个问题，我们提出了 POV-Surgery，一个大规模、 sintetic、 egocentric 数据集， focuses on hand pose estimation with different surgical gloves and three orthopedic surgical instruments, namely scalpel, friem, and diskplacer。我们的数据集包括 53 个序列和 88,329 帧， featuring high-resolution RGB-D 视频流、活动约注意、手-对象 pose 的准确 3D 和 2D 约注意、以及手-对象分割mask。我们在 POV-Surgery 上练化当前 SOTA 方法，并通过实际应用中的评估表明其普适性。数据集和代码在 batfacewayne.github.io/POV_Surgery_io/ 上公共可用。

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

paper_url: http://arxiv.org/abs/2307.10373
repo_url: https://github.com/omerbt/TokenFlow
paper_authors: Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel
for: 这个研究旨在提高视频编辑的质量和用户对生成内容的控制，使用文本到图像扩散模型。
methods: 这个方法基于干扰特征的扩散特征空间的一致性来保证生成的视频质量和原始视频的空间布局和动作一致性。
results: 这个方法可以在多种实际视频上达到当今最佳的编辑效果，无需训练或调整。 Webpage: https://diffusion-tokenflow.github.io/

Abstract
The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io/

摘要
“文本驱动视频编辑的AI革命已经在视频方面扩展。然而，当前的状态艺术视频模型仍然落后于图像模型以视觉质量和用户对生成内容的控制。在这种工作中，我们提出了一种基于文本到图像扩散模型的文本驱动视频编辑框架。具体来说，我们给出了一个源视频和目标文本提示的情况下，生成高质量的视频，并且保持输入视频的空间布局和运动。我们的方法基于一个关键观察：在编辑视频时，可以通过强制在扩散特征空间中保持一致来确保编辑结果的质量。我们实现这一点通过显式地填充扩散特征基于间帧对应关系，这些关系Ready available在模型中。因此，我们的框架不需要任何训练或调整，可以与任何市场上的文本到图像编辑方法协作。我们在多种实际视频上达到了国际先进编辑效果。网页：https://diffusion-tokenflow.github.io/”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering

paper_url: http://arxiv.org/abs/2307.10173
repo_url: https://github.com/DNA-Rendering/DNA-Rendering
paper_authors: Wei Cheng, Ruixiang Chen, Wanqi Yin, Siming Fan, Keyu Chen, Honglin He, Huiwen Luo, Zhongang Cai, Jingbo Wang, Yang Gao, Zhengming Yu, Zhengyu Lin, Daxuan Ren, Lei Yang, Ziwei Liu, Chen Change Loy, Chen Qian, Wayne Wu, Dahua Lin, Bo Dai, Kwan-Yee Lin
for: 这个论文是为了提供一个大规模、高精度的人性化渲染数据集，以便进一步推动计算机视觉和计算机图形学领域的研究。
methods: 这个论文使用了一种大规模、高精度的人性化渲染数据集，并提供了丰富的人体特征数据，如人体keypoint、背景掩蔽图、SMPLX模型、衣物/配饰材料、多视图图像和视频等。这些数据资源有助于提高当前 методы的精度在下游渲染任务中。
results: 这个论文提供了一个大规模的、全面的人性化渲染数据集，包括1500名人类参与者、5000个运动序列和67.5万帧数据量。此外，论文还提供了一个专业的多视图捕捉系统，以获取高质量的数据资源用于任务训练和评估。

Abstract
Realistic human-centric rendering plays a key role in both computer vision and computer graphics. Rapid progress has been made in the algorithm aspect over the years, yet existing human-centric rendering datasets and benchmarks are rather impoverished in terms of diversity, which are crucial for rendering effect. Researchers are usually constrained to explore and evaluate a small set of rendering problems on current datasets, while real-world applications require methods to be robust across different scenarios. In this work, we present DNA-Rendering, a large-scale, high-fidelity repository of human performance data for neural actor rendering. DNA-Rendering presents several alluring attributes. First, our dataset contains over 1500 human subjects, 5000 motion sequences, and 67.5M frames' data volume. Second, we provide rich assets for each subject -- 2D/3D human body keypoints, foreground masks, SMPLX models, cloth/accessory materials, multi-view images, and videos. These assets boost the current method's accuracy on downstream rendering tasks. Third, we construct a professional multi-view system to capture data, which contains 60 synchronous cameras with max 4096 x 3000 resolution, 15 fps speed, and stern camera calibration steps, ensuring high-quality resources for task training and evaluation. Along with the dataset, we provide a large-scale and quantitative benchmark in full-scale, with multiple tasks to evaluate the existing progress of novel view synthesis, novel pose animation synthesis, and novel identity rendering methods. In this manuscript, we describe our DNA-Rendering effort as a revealing of new observations, challenges, and future directions to human-centric rendering. The dataset, code, and benchmarks will be publicly available at https://dna-rendering.github.io/

摘要
现代人像渲染在计算机视觉和计算机图形中扮演着关键性角色。随着时间的推移，算法方面的进步很快，但现有的人像渲染数据集和标准却很缺乏多样性，这些多样性是渲染效果的关键因素。研究人员通常只能在当前数据集上进行有限的探索和评估，而实际应用中需要的方法应该能够在不同的场景下展现出Robust性。在这项工作中，我们提出了DNA-Rendering，一个大规模、高精度的人像渲染数据集。DNA-Rendering具有以下吸引人的特点：首先，我们的数据集包含1500名人类素材，5000个动作序列，67.5万帧数据量。其次，我们为每名素材提供了丰富的资源，包括2D/3D人体关键点、背景掩蔽、SMPLX模型、衣物材料、多视图图像和视频。这些资源可以提高当前方法在下游渲染任务上的准确率。第三，我们构建了专业多视图捕捉系统，包括60个同步相机，最高分辨率为4096 x 3000，帧率为15，相机准备过程严格，以保证高质量的资源 для任务训练和评估。同时，我们还提供了全面的大规模量表标准，包括多个任务来评估现有的新视角合成、新姿势动画合成和新人脸渲染方法的进步。在这篇论文中，我们描述了DNA-Rendering的努力，并揭示了新观察、挑战和未来方向，以便人像渲染领域的发展。数据集、代码和标准将在https://dna-rendering.github.io/上公开。

Adversarial Latent Autoencoder with Self-Attention for Structural Image Synthesis

paper_url: http://arxiv.org/abs/2307.10166
repo_url: None
paper_authors: Jiajie Fan, Laure Vuaille, Hao Wang, Thomas Bäck
for:SA-ALAE is proposed to facilitate industrial engineering processes, particularly in generating feasible design images of complex engineering parts.methods:SA-ALAE employs a novel Self-Attention Adversarial Latent Autoencoder architecture, which combines the strengths of adversarial training and latent space control to generate high-quality design images.results:SA-ALAE is demonstrated to generate engineering blueprints in a real automotive design task, showcasing its potential in efficient industrial design exploration and novel variant generation.

Abstract
Generative Engineering Design approaches driven by Deep Generative Models (DGM) have been proposed to facilitate industrial engineering processes. In such processes, designs often come in the form of images, such as blueprints, engineering drawings, and CAD models depending on the level of detail. DGMs have been successfully employed for synthesis of natural images, e.g., displaying animals, human faces and landscapes. However, industrial design images are fundamentally different from natural scenes in that they contain rich structural patterns and long-range dependencies, which are challenging for convolution-based DGMs to generate. Moreover, DGM-driven generation process is typically triggered based on random noisy inputs, which outputs unpredictable samples and thus cannot perform an efficient industrial design exploration. We tackle these challenges by proposing a novel model Self-Attention Adversarial Latent Autoencoder (SA-ALAE), which allows generating feasible design images of complex engineering parts. With SA-ALAE, users can not only explore novel variants of an existing design, but also control the generation process by operating in latent space. The potential of SA-ALAE is shown by generating engineering blueprints in a real automotive design task.

摘要
生成工程设计方法，受到深度生成模型（DGM）的推动，以便促进工程设计过程。在这些过程中，设计通常表示为图像，例如蓝图、工程图纸和CAD模型，具体程度因应用场景而异。DGM已成功应用于自然图像的合成，如显示动物、人脸和风景。然而，工程设计图像与自然场景有所不同，它们具有较为复杂的结构征特和长距离依赖关系，这使得核心变换基于DGM的生成过程困难。此外，DGM驱动的生成过程通常是基于随机噪声输入触发的，输出的样本随机且不可预测，因此无法实现有效的工程设计探索。我们解决这些挑战的方法是提出一种新模型，即Self-Attention Adversarial Latent Autoencoder（SA-ALAE）。SA-ALAE可以生成复杂工程部件的可行的设计图像。在一个真实的汽车设计任务中，SA-ALAE的潜在能力得到了证明。Simplified Chinese translation:生成工程设计方法，受到深度生成模型（DGM）的推动，以促进工程设计过程。在这些过程中，设计通常表示为图像，例如蓝图、工程图纸和CAD模型。DGM已成功应用于自然图像的合成，如显示动物、人脸和风景。然而，工程设计图像与自然场景有所不同，它们具有较为复杂的结构征特和长距离依赖关系。这使得核心变换基于DGM的生成过程困难。此外，DGM驱动的生成过程通常是基于随机噪声输入触发的，输出的样本随机且不可预测。我们解决这些挑战的方法是提出一种新模型，即Self-Attention Adversarial Latent Autoencoder（SA-ALAE）。SA-ALAE可以生成复杂工程部件的可行的设计图像。在一个真实的汽车设计任务中，SA-ALAE的潜在能力得到了证明。

Improving Multimodal Datasets with Image Captioning

paper_url: http://arxiv.org/abs/2307.10350
repo_url: None
paper_authors: Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, Ludwig Schmidt
for: 增强大型视觉语言模型的成功，需要大量的网络数据集。但是，原始网络数据具有噪音，现有的噪音筛选方法可能会导致数据多样性减少。本文强调caption质量作为噪音的一个主要来源，并研究如何使用生成的caption提高网络抽取的数据点 Utility。
methods: 通过不同的混合策略，将raw caption和生成的caption混合使用，以提高ImageNet和38个任务的性能。在128M图像文本对的候选人中，我们的最佳方法比DataCompbenchmark中提出的最佳策略提高2%和4%。我们的最佳方法还在Flickr和MS-COCO检索中比前一个最佳策略提高2倍。
results: 我们的实验表明，使用生成的caption可以提高多模态训练的性能。在不同的图像描述模型中，我们还发现了标准图像描述 benchmarks（如NoCaps CIDEr）中模型的性能不是一个可靠的指标，用于衡量生成的caption的用于多模态训练的效果。最后，我们在DataComp的大规模环境（1.28B图像文本对）中进行了实验，并提供了生成的caption在噪音下的局限性，以及图像淘汰的重要性，随着训练数据量的增加。

Abstract
Massive web datasets play a key role in the success of large vision-language models like CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to reduce noise often come at the expense of data diversity. Our work focuses on caption quality as one major source of noise, and studies how generated captions can increase the utility of web-scraped datapoints with nondescript text. Through exploring different mixing strategies for raw and generated captions, we outperform the best filtering method proposed by the DataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a candidate pool of 128M image-text pairs. Our best approach is also 2x better at Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an effective source of text supervision. In experimenting with different image captioning models, we also demonstrate that the performance of a model on standard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable indicator of the utility of the captions it generates for multimodal training. Finally, our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text, as well as the importance of image curation with increasing training data quantity.

摘要
庞大的网络数据集对大型视觉语言模型如CLIP和Flamingo的成功起到关键作用。然而，原始网络数据充满噪音，现有的过滤方法通常会导致数据多样性减少。我们的工作将注重caption质量作为噪音的一个主要来源，研究如何使用生成的caption提高网络抓取得分点中的不明文案。通过不同的混合策略来Raw和生成的caption，我们超过了DataComp benchmark中提出的最佳过滤方法，在ImageNet和38个任务中的平均提高2%和4%，给定候选池中的128M图像文本对。我们的最佳方法还在Flickr和MS-COCO检索中提高2倍。然后，我们分析了生成caption的有效性为多模态训练的文本指导。在不同的图像captioning模型中，我们也示出了标准图像captioning benchmark（例如NoCaps CIDEr）中模型的性能不是多模态训练中captions生成的性能的可靠指标。最后，我们在DataComp的大规模环境（1.28B图像文本对）中进行实验，发现生成的caption在大量训练数据量下有限制，同时图像筛选也是非常重要的。

paper_url: http://arxiv.org/abs/2307.10165
repo_url: None
paper_authors: Moa Arvidsson, Sithichot Sawirot, Cristofer Englund, Fernando Alonso-Fernandez, Martin Torstensson, Boris Duran
for: 这个研究旨在创建一个基于奈米潜在飞行器的解决方案，可以在停车场中探测车辆的位置。methods: 这个方案使用了墙跟踪算法和一个对探测车牌号的生物辨识器（CNN）进行探测。所有计算都在机器上进行，并将位置和探测到的图像发送到主机。results: 在八个测试案例中，这个解决方案能够成功遍历多列车辆，并实现了实时的位置探测。

Abstract
Millions of vehicles are transported every year, tightly parked in vessels or boats. To reduce the risks of associated safety issues like fires, knowing the location of vehicles is essential, since different vehicles may need different mitigation measures, e.g. electric cars. This work is aimed at creating a solution based on a nano-drone that navigates across rows of parked vehicles and detects their license plates. We do so via a wall-following algorithm, and a CNN trained to detect license plates. All computations are done in real-time on the drone, which just sends position and detected images that allow the creation of a 2D map with the position of the plates. Our solution is capable of reading all plates across eight test cases (with several rows of plates, different drone speeds, or low light) by aggregation of measurements across several drone journeys.

摘要
每年有数百万辆车被运输，紧挨着在船或舟上运输。以降低相关的安全问题（如火灾）为目的，知道车辆的位置是非常重要，因为不同的车辆可能需要不同的缓解措施，比如电动车。这项工作旨在创造一种基于尺寸扫描仪的解决方案，该仪器通过沿着车辆行驶的方式探测车辆的车牌。我们使用了墙面跟踪算法和用于检测车牌的 convolutional neural network（CNN）进行检测。所有计算都在实时进行，扫描仪只需将位置和检测到的图像发送给地面站，以便在地面站上创建2D图像，包含车牌的位置。我们的解决方案可以在八个测试案例中读取所有车牌，包括不同的车辆行驶速度、低光照等情况下。

FABRIC: Personalizing Diffusion Models with Iterative Feedback

paper_url: http://arxiv.org/abs/2307.10159
repo_url: https://github.com/sd-fabric/fabric
paper_authors: Dimitri von Rütte, Elisabetta Fedele, Jonathan Thomm, Lukas Wolf
for: 这个研究旨在探讨diffusion-based文本到图像模型中如何integrate人类反馈，以提高用户体验和输出质量。
methods: 该研究提出了一种无需训练的方法，可以应用于多种流行的扩散模型，利用扩散过程中的自注意层来condition diffusion过程，并通过多轮反馈来逐渐改进生成结果。
results: 研究表明，通过多轮反馈，生成结果可以得到改进，并且可以适应用户的个性化需求。

Abstract
In an era where visual content generation is increasingly driven by machine learning, the integration of human feedback into generative models presents significant opportunities for enhancing user experience and output quality. This study explores strategies for incorporating iterative human feedback into the generative process of diffusion-based text-to-image models. We propose FABRIC, a training-free approach applicable to a wide range of popular diffusion models, which exploits the self-attention layer present in the most widely used architectures to condition the diffusion process on a set of feedback images. To ensure a rigorous assessment of our approach, we introduce a comprehensive evaluation methodology, offering a robust mechanism to quantify the performance of generative visual models that integrate human feedback. We show that generation results improve over multiple rounds of iterative feedback through exhaustive analysis, implicitly optimizing arbitrary user preferences. The potential applications of these findings extend to fields such as personalized content creation and customization.

摘要
在机器学习驱动的视觉内容生成领域中，人类反馈的集成到生成模型中具有重要的可能性，以提高用户体验和输出质量。本研究探讨了把反馈图像纳入扩散型文本到图像模型的生成过程中的策略。我们提出了一种无需训练的方法，可以应用于广泛使用的扩散模型，利用最常用的架构中的自注意层来控制扩散过程，并通过一组反馈图像来进行条件。为了有系统地评估我们的方法，我们提出了一种完整的评估方法，可以准确评估生成视觉模型中的人类反馈。我们通过对多轮反馈的分析表明，通过多轮反馈，生成结果会得到改进，并在用户的Preferences中进行隐式优化。这些发现的应用领域包括个性化内容创作和自定义。

Leveraging Visemes for Better Visual Speech Representation and Lip Reading

paper_url: http://arxiv.org/abs/2307.10157
repo_url: None
paper_authors: Javad Peymanfard, Vahid Saeedi, Mohammad Reza Mohammadi, Hossein Zeinali, Nasser Mozayani
for: lip reading task, 包括 speech recognition、human-computer interaction 和安全系统等
methods: 使用visemes（lip shape groups）提取更有特征和可靠的视频特征，以便进行更高精度的lip reading
results: 在word-level和 sentence-level lip reading任务以及Arman-AV数据集上的audiovisual speech recognition任务中，提出的方法都能够超越当前状态的方法，并且实现了9.1%的lip-reading word error rate（WER）下降。

Abstract
Lip reading is a challenging task that has many potential applications in speech recognition, human-computer interaction, and security systems. However, existing lip reading systems often suffer from low accuracy due to the limitations of video features. In this paper, we propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading. We evaluate our approach on various tasks, including word-level and sentence-level lip reading, and audiovisual speech recognition using the Arman-AV dataset, a largescale Persian corpus. Our experimental results show that our viseme based approach consistently outperforms the state-of-theart methods in all these tasks. The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.

摘要
叙述读 lips 是一项具有广泛应用前景的任务，包括语音识别、人机交互和安全系统。然而，现有的叙述读系统经常受到视频特征的限制，导致准确性较低。本文提出了一种新的方法，利用 viseme，即phonetically similar lip shapes的组合，提取更加特征化和Robust的视频特征来进行叙述读。我们在不同的任务上进行了评估，包括单词和句子层次的叙述读以及audiovisual speech recognition。我们的实验结果表明，我们的 viseme 基于方法在所有这些任务中均表现出优于之前最佳方法。我们的方法可以将叙述读 word error rate (WER) 降低9.1%。

An Improved NeuMIP with Better Accuracy

paper_url: http://arxiv.org/abs/2307.10135
repo_url: None
paper_authors: Bowen Xue, Shuang Zhao, Henrik Wann Jensen, Zahra Montazeri
for: 增强 neural reflectance 模型的精度和细节表现，特别是对高光散射材料的处理。
methods: 使用频谱编码输入数据，启用 NeRF 技术，并使用多阶段 gradient-based 损失函数来提高网络性能。
results: 通过多种synthetic和实际示例，证明了方法的有效性和精度，特别是对高光散射材料的处理。

Abstract
Neural reflectance models are capable of accurately reproducing the spatially-varying appearance of many real-world materials at different scales. However, existing methods have difficulties handling highly glossy materials. To address this problem, we introduce a new neural reflectance model which, compared with existing methods, better preserves not only specular highlights but also fine-grained details. To this end, we enhance the neural network performance by encoding input data to frequency space, inspired by NeRF, to better preserve the details. Furthermore, we introduce a gradient-based loss and employ it in multiple stages, adaptive to the progress of the learning phase. Lastly, we utilize an optional extension to the decoder network using the Inception module for more accurate yet costly performance. We demonstrate the effectiveness of our method using a variety of synthetic and real examples.

摘要
神经反射模型可以准确地复制多种实际世界材料的空间分布变化。然而，现有方法在处理高光散材料时存在困难。为解决这个问题，我们介绍了一种新的神经反射模型，与现有方法相比，更好地保留不仅聚光点亮点，还有细节。为此，我们使用频谱空间编码输入数据，以更好地保留细节。此外，我们引入了Gradient-based损失函数，并在多个阶段适应学习阶段进行适应。最后，我们使用可选的扩展器网络使用Inception模块，以更加准确 yet costly的性能。我们通过多种 sintetic和实际例子示示了我们的方法的效果。

General vs. Long-Tailed Age Estimation: An Approach to Kill Two Birds with One Stone

paper_url: http://arxiv.org/abs/2307.10129
repo_url: None
paper_authors: Zenghao Bao, Zichang Tan, Jun Li, Jun Wan, Xibo Ma, Zhen Lei
for: 这个研究的目的是提出一个简单、有效、柔软的训练方法，以提高面部年龄估计的性能，同时能够优化头篇和尾篇类别的性能。
methods: 这个方法基于一个简单的内部组合函数，并运用了一个新的训练方法，名为GLAE。GLAE的训练方法包括一个内部组合函数和一个杜尼卡广泛函数。
results: 这个研究的结果显示，GLAE在Morph II上的MAE和CMAE分别为1.14年和1.27年，与之前最好的方法相比，MAE下降了34%，并且MAE接近1年。此外，GLAE在其他年龄评估数据集上也具有优秀的性能。

Abstract
Facial age estimation has received a lot of attention for its diverse application scenarios. Most existing studies treat each sample equally and aim to reduce the average estimation error for the entire dataset, which can be summarized as General Age Estimation. However, due to the long-tailed distribution prevalent in the dataset, treating all samples equally will inevitably bias the model toward the head classes (usually the adult with a majority of samples). Driven by this, some works suggest that each class should be treated equally to improve performance in tail classes (with a minority of samples), which can be summarized as Long-tailed Age Estimation. However, Long-tailed Age Estimation usually faces a performance trade-off, i.e., achieving improvement in tail classes by sacrificing the head classes. In this paper, our goal is to design a unified framework to perform well on both tasks, killing two birds with one stone. To this end, we propose a simple, effective, and flexible training paradigm named GLAE, which is two-fold. Our GLAE provides a surprising improvement on Morph II, reaching the lowest MAE and CMAE of 1.14 and 1.27 years, respectively. Compared to the previous best method, MAE dropped by up to 34%, which is an unprecedented improvement, and for the first time, MAE is close to 1 year old. Extensive experiments on other age benchmark datasets, including CACD, MIVIA, and Chalearn LAP 2015, also indicate that GLAE outperforms the state-of-the-art approaches significantly.

摘要
人脸年龄估算Received a lot of attention due to its diverse application scenarios. Most existing studies treat each sample equally and aim to reduce the average estimation error for the entire dataset, which can be summarized as General Age Estimation. However, due to the long-tailed distribution prevalent in the dataset, treating all samples equally will inevitably bias the model towards the head classes (usually the adult with a majority of samples). Driven by this, some works suggest that each class should be treated equally to improve performance in tail classes (with a minority of samples), which can be summarized as Long-tailed Age Estimation. However, Long-tailed Age Estimation usually faces a performance trade-off, i.e., achieving improvement in tail classes by sacrificing the head classes. In this paper, our goal is to design a unified framework to perform well on both tasks, killing two birds with one stone. To this end, we propose a simple, effective, and flexible training paradigm named GLAE, which is two-fold. Our GLAE provides a surprising improvement on Morph II, reaching the lowest MAE and CMAE of 1.14 and 1.27 years, respectively. Compared to the previous best method, MAE dropped by up to 34%, which is an unprecedented improvement, and for the first time, MAE is close to 1 year old. Extensive experiments on other age benchmark datasets, including CACD, MIVIA, and Chalearn LAP 2015, also indicate that GLAE outperforms the state-of-the-art approaches significantly.

Two Approaches to Supervised Image Segmentation

paper_url: http://arxiv.org/abs/2307.10123
repo_url: https://github.com/USTCPCS/CVPR2018_attention
paper_authors: Alexandre Benatti, Luciano da F. Costa
for: 图像分割 (image segmentation)
methods: 深度学习 (deep learning) 和多集 neurons 方法 (multiset neurons method)
results: 更高的准确率 (higher accuracy) with less computational resources using the multiset neurons method compared to deep learning.

Abstract
Though performed almost effortlessly by humans, segmenting 2D gray-scale or color images into respective regions of interest (e.g.~background, objects, or portions of objects) constitutes one of the greatest challenges in science and technology as a consequence of several effects including dimensionality reduction(3D to 2D), noise, reflections, shades, and occlusions, among many other possibilities. While a large number of interesting related approaches have been suggested along the last decades, it was mainly thanks to the recent development of deep learning that more effective and general solutions have been obtained, currently constituting the basic comparison reference for this type of operation. Also developed recently, a multiset-based methodology has been described that is capable of encouraging image segmentation performance combining spatial accuracy, stability, and robustness while requiring little computational resources (hardware and/or training and recognition time). The interesting features of the multiset neurons methodology mostly follow from the enhanced selectivity and sensitivity, as well as good robustness to data perturbations and outliers, allowed by the coincidence similarity index on which the multiset approach to supervised image segmentation is founded. After describing the deep learning and multiset neurons approaches, the present work develops comparison experiments between them which are primarily aimed at illustrating their respective main interesting features when applied to the adopted specific type of data and parameter configurations. While the deep learning approach confirmed its potential for performing image segmentation, the alternative multiset methodology allowed for enhanced accuracy while requiring little computational resources.

摘要
人类可以几乎无需努力地 segmenting 2D灰度或彩色图像为不同的区域关注（例如背景、物体或物体部分）是科学和技术中的一大挑战，这主要归因于数维度减少（3D到2D）、噪声、反射、阴影和遮挡等多种因素。过去几十年，有很多有趣的相关方法被提出，但是由于深度学习的发展，目前可以获得更有效和通用的解决方案。此外，最近开发的多集合方法可以提高图像分割性能，同时具有空间准确性、稳定性和鲁棒性，但需要少量计算资源（硬件和训练和识别时间）。多集合神经元方法的主要优点是增强选择性和敏感度，以及对数据异常和异常值的好robustness。在描述深度学习和多集合神经元方法后，本工作进行了这两种方法之间的比较实验，以 Illustrate它们在采用的特定数据和参数配置下的主要有趣特点。虽然深度学习方法表现出图像分割的潜力，但是对于特定数据和参数配置，替代的多集合方法可以提高准确性。

Boundary-Refined Prototype Generation: A General End-to-End Paradigm for Semi-Supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.10097
repo_url: https://github.com/djh-dzxw/BRPG
paper_authors: Junhao Dong, Zhu Meng, Delong Liu, Zhicheng Zhao, Fei Su
for: 本研究主要针对 semi-supervised semantic segmentation 领域，提出了一种基于 Prototype-based Classification 的新方法，以提高现有方法的性能。
methods: 本方法使用了一种名为 boundary-refined prototype generation (BRPG) 的新方法，通过将高信任特征和低信任特征分别归一类，生成更加接近类划的原型。此外，还提出了一种 adaptive prototype optimization 策略，以适应分布在不同类划上的特征分布。
results: 在 PASCAL VOC 2012 和 Cityscapes datasets 上进行了广泛的实验，显示了该方法的优越性和扩展性，比对当前状态的方法更高效。

Abstract
Prototype-based classification is a classical method in machine learning, and recently it has achieved remarkable success in semi-supervised semantic segmentation. However, the current approach isolates the prototype initialization process from the main training framework, which appears to be unnecessary. Furthermore, while the direct use of K-Means algorithm for prototype generation has considered rich intra-class variance, it may not be the optimal solution for the classification task. To tackle these problems, we propose a novel boundary-refined prototype generation (BRPG) method, which is incorporated into the whole training framework. Specifically, our approach samples and clusters high- and low-confidence features separately based on a confidence threshold, aiming to generate prototypes closer to the class boundaries. Moreover, an adaptive prototype optimization strategy is introduced to make prototype augmentation for categories with scattered feature distributions. Extensive experiments on the PASCAL VOC 2012 and Cityscapes datasets demonstrate the superiority and scalability of the proposed method, outperforming the current state-of-the-art approaches. The code is available at xxxxxxxxxxxxxx.

摘要
proto-based 分类是一种经典的机器学习方法，最近在半指导式semantic segmentation中表现出了很好的成果。然而，现行方法在初始化prototype过程中封闭了主要训练框架，这似乎是不必要的。此外，直接使用K-Means算法生成prototype可能并不是分类任务的最佳解决方案。为了解决这些问题，我们提出了一种新的边缘精制prototype生成（BRPG）方法，它在整个训练框架中集成。具体来说，我们根据信任度分类特征高低分开，并将它们分别样本和聚类，以便在类boundary附近生成prototype。此外，我们还引入了一种适应prototype优化策略，以便对分布在类boundary处的类进行prototype更新。我们在PASCAL VOC 2012和Cityscapes数据集上进行了广泛的实验，并证明了我们提出的方法的优越性和可扩展性，超过当前状态的术语。代码可以在xxxxxxxxxxxxx上找到。

Make-A-Volume: Leveraging Latent Diffusion Models for Cross-Modality 3D Brain MRI Synthesis

paper_url: http://arxiv.org/abs/2307.10094
repo_url: None
paper_authors: Lingting Zhu, Zeyue Xue, Zhenchao Jin, Xian Liu, Jingzhen He, Ziwei Liu, Lequan Yu
for:这 paper 的目的是提出一种新的方法 для医学影像合成，使得可以更好地处理不同模态的医学影像。methods:这 paper 使用了 diffusion-based 框架，并使用了 latent diffusion 模型来学习 slice-wise 映射。results:这 paper 的实验结果表明，其方法可以更好地Synthesize 3D 医学影像，并且可以保持volumetric consistency。

Abstract
Cross-modality medical image synthesis is a critical topic and has the potential to facilitate numerous applications in the medical imaging field. Despite recent successes in deep-learning-based generative models, most current medical image synthesis methods rely on generative adversarial networks and suffer from notorious mode collapse and unstable training. Moreover, the 2D backbone-driven approaches would easily result in volumetric inconsistency, while 3D backbones are challenging and impractical due to the tremendous memory cost and training difficulty. In this paper, we introduce a new paradigm for volumetric medical data synthesis by leveraging 2D backbones and present a diffusion-based framework, Make-A-Volume, for cross-modality 3D medical image synthesis. To learn the cross-modality slice-wise mapping, we employ a latent diffusion model and learn a low-dimensional latent space, resulting in high computational efficiency. To enable the 3D image synthesis and mitigate volumetric inconsistency, we further insert a series of volumetric layers in the 2D slice-mapping model and fine-tune them with paired 3D data. This paradigm extends the 2D image diffusion model to a volumetric version with a slightly increasing number of parameters and computation, offering a principled solution for generic cross-modality 3D medical image synthesis. We showcase the effectiveness of our Make-A-Volume framework on an in-house SWI-MRA brain MRI dataset and a public T1-T2 brain MRI dataset. Experimental results demonstrate that our framework achieves superior synthesis results with volumetric consistency.

摘要
医疗影像合成是一个关键话题，它具有潜在的推广应用前景在医疗影像领域。 despite recent successes in deep learning-based generative models, most current medical image synthesis methods rely on generative adversarial networks and suffer from notorious mode collapse and unstable training. Moreover, the 2D backbone-driven approaches would easily result in volumetric inconsistency, while 3D backbones are challenging and impractical due to the tremendous memory cost and training difficulty.在这篇论文中，我们提出了一种新的思路，利用2D backbone来实现医疗数据三维合成。我们提出了一个扩展了2D影像扩散模型的框架，称之为Make-A-Volume。为了学习交叉模态的slice-wise映射，我们采用了一个缓动模型，学习一个低维度的 latent space，从而实现高效计算。为了实现3D图像合成并避免Volumetric Inconsistency，我们进一步插入了一系列的volumetric层到2D slice-mapping模型中，并对它们进行了对应的3D数据的微调。这种思路将2D影像扩散模型扩展到了三维版本，只需要一些增加参数和计算量，提供了一个原理性的解决方案 для通用的交叉模态3D医疗影像合成。我们在一个自有的SWI-MRA脑MRI数据集和一个公共的T1-T2脑MRI数据集上展示了我们的Make-A-Volume框架的效果，实验结果表明我们的框架可以 achieve superior synthesis results with volumetric consistency。

2023-07-20

Spinal nerve segmentation method and dataset construction in endoscopic surgical scenarios

Soft-tissue Driven Craniomaxillofacial Surgical Planning

Improving Online Lane Graph Extraction by Object-Lane Clustering

OCTraN: 3D Occupancy Convolutional Transformer Network in Unstructured Traffic Scenarios

Modeling 3D cardiac contraction and relaxation with point cloud deformation networks

Confidence intervals for performance estimates in 3D medical image segmentation

Intrinsic Appearance Decomposition Using Point Cloud Representation

Language-based Action Concept Spaces Improve Video Self-Supervised Learning

Revisiting Fine-Tuning Strategies for Self-supervised Medical Imaging Analysis

Diffusion Sampling with Momentum for Mitigating Divergence Artifacts

WeakPolyp: You Only Look Bounding Box for Polyp Segmentation

Variational Point Encoding Deformation for Dental Modeling

Human Motion Generation: A Survey

Risk-optimized Outlier Removal for Robust Point Cloud Classification

Conservative Estimation of Perception Relevance of Dynamic Objects for Safe Trajectories in Automotive Scenarios

BlendFace: Re-designing Identity Encoders for Face-Swapping

Exploring Effective Priors and Efficient Models for Weakly-Supervised Change Detection

Self-paced Weight Consolidation for Continual Learning

Global Precipitation Nowcasting of Integrated Multi-satellitE Retrievals for GPM: A U-Net Convolutional LSTM Architecture

Label Calibration for Semantic Segmentation Under Domain Shift

Parse and Recall: Towards Accurate Lung Nodule Malignancy Prediction like Radiologists

Gradient-Semantic Compensation for Incremental Semantic Segmentation

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

A novel integrated method of detection-grasping for specific object based on the box coordinate matching

Perceptual Quality Assessment of Omnidirectional Audio-visual Signals

HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces

Behavioral Analysis of Vision-and-Language Navigation Agents

Feed-Forward Source-Free Domain Adaptation via Class Prototypes

SMURF: Spatial Multi-Representation Fusion for 3D Object Detection with 4D Imaging Radar

See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data

Learned Thresholds Token Merging and Pruning for Vision Transformers

Urban Radiance Field Representation with Deformable Neural Mesh Primitives

LBL: Logarithmic Barrier Loss Function for One-class Classification

EdgeAL: An Edge Estimation Based Active Learning Approach for OCT Segmentation

Comparison between transformers and convolutional models for fine-grained classification of insects

TwinLiteNet: An Efficient and Lightweight Model for Driveable Area and Lane Segmentation in Self-Driving Cars

Reverse Knowledge Distillation: Training a Large Model using a Small One for Retinal Image Matching on Limited Data

SqueezerFaceNet: Reducing a Small Face Recognition CNN Even More Via Filter Pruning

SLPD: Slide-level Prototypical Distillation for WSIs

Self2Self+: Single-Image Denoising with Self-Supervised Learning and Image Quality Assessment Loss

Pre-train, Adapt and Detect: Multi-Task Adapter Tuning for Camouflaged Object Detection

Deep learning for classification of noisy QR codes

Efficient Unified Demosaicing for Bayer and Non-Bayer Patterned Image Sensors

Lighting up NeRF via Unsupervised Decomposition and Enhancement

RetouchingFFHQ: A Large-scale Dataset for Fine-grained Face Retouching Detection

Quantized Feature Distillation for Network Quantization

Learning and Evaluating Human Preferences for Conversational Head Generation

Parallelization of a new embedded application for automatic meteor detection

Learning Discriminative Visual-Text Representation for Polyp Re-Identification

Joint Skeletal and Semantic Embedding Loss for Micro-gesture Classification

Quaternion tensor ring decomposition and application for color image inpainting

Hybrid Feature Embedding For Automatic Building Outline Extraction

Physics-Driven Turbulence Image Restoration with Stochastic Refinement

Flatness-Aware Minimization for Domain Generalization

SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and Multi-View for 3D Object Retrieval

Event Blob Tracking: An Asynchronous Real-Time Algorithm

Reference-based Painterly Inpainting via Diffusion: Crossing the Wild Reference Domain Gap

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

Interactive Segmentation for Diverse Gesture Types Without Context

FedSoup: Improving Generalization and Personalization in Federated Learning via Selective Model Interpolation

Is Grad-CAM Explainable in Medical Images?

Identifying Interpretable Subspaces in Image Representations

Eye Disease Classification Using Deep Learning Techniques

Mining Conditional Part Semantics with Occluded Extrapolation for Human-Object Interaction Detection

Novel Batch Active Learning Approach and Its Application to Synthetic Aperture Radar Datasets

Confidence Estimation Using Unlabeled Data

POV-Surgery: A Dataset for Egocentric Hand and Tool Pose Estimation During Surgical Activities

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering

Adversarial Latent Autoencoder with Self-Attention for Structural Image Synthesis

Improving Multimodal Datasets with Image Captioning

Drone navigation and license place detection for vehicle location in indoor spaces

FABRIC: Personalizing Diffusion Models with Iterative Feedback

Leveraging Visemes for Better Visual Speech Representation and Lip Reading

An Improved NeuMIP with Better Accuracy

General vs. Long-Tailed Age Estimation: An Approach to Kill Two Birds with One Stone

Two Approaches to Supervised Image Segmentation

Boundary-Refined Prototype Generation: A General End-to-End Paradigm for Semi-Supervised Semantic Segmentation

Make-A-Volume: Leveraging Latent Diffusion Models for Cross-Modality 3D Brain MRI Synthesis