2023-08-15

cs.CV

cs.CV - 2023-08-15

CCD-3DR: Consistent Conditioning in Diffusion for Single-Image 3D Reconstruction

paper_url: http://arxiv.org/abs/2308.07837
repo_url: None
paper_authors: Yan Di, Chenyangguang Zhang, Pengyuan Wang, Guangyao Zhai, Ruida Zhang, Fabian Manhardt, Benjamin Busam, Xiangyang Ji, Federico Tombari
for: 该文章描述了一种基于扩散模型的三维稀畴点云重建方法，用于基于单一RGB图像捕捉的对象重建。
methods: 该方法使用了一种新的中心扩散probabilistic模型来约束本地特征来 condtioning。在反 diffusion 过程中，抑制点云和样本点云被限制在一个子空间中，以确保点云中心保持不变。
results: 对于Synthetic ShapeNet-R2N2测试集，CCD-3DR超过了所有竞争者，增加了 más de 40%的性能提升。同时， authors还提供了实际应用中的Result on Pix3D数据集，以证明CCD-3DR在实际应用中的潜在性。

Abstract
In this paper, we present a novel shape reconstruction method leveraging diffusion model to generate 3D sparse point cloud for the object captured in a single RGB image. Recent methods typically leverage global embedding or local projection-based features as the condition to guide the diffusion model. However, such strategies fail to consistently align the denoised point cloud with the given image, leading to unstable conditioning and inferior performance. In this paper, we present CCD-3DR, which exploits a novel centered diffusion probabilistic model for consistent local feature conditioning. We constrain the noise and sampled point cloud from the diffusion model into a subspace where the point cloud center remains unchanged during the forward diffusion process and reverse process. The stable point cloud center further serves as an anchor to align each point with its corresponding local projection-based features. Extensive experiments on synthetic benchmark ShapeNet-R2N2 demonstrate that CCD-3DR outperforms all competitors by a large margin, with over 40% improvement. We also provide results on real-world dataset Pix3D to thoroughly demonstrate the potential of CCD-3DR in real-world applications. Codes will be released soon

摘要
在这篇论文中，我们提出了一种基于扩散模型的新型形态重建方法，用于从单个RGB图像中恢复3D稀畴点云。现有方法通常使用全局嵌入或本地投影基于特征作为扩散模型的指导条件，但这些策略无法一致地将净化后的点云与给定图像相对应，导致稳定性差和性能下降。在这篇论文中，我们提出了CCD-3DR，它利用一种新型的中心扩散概率模型来实现一致的本地特征控制。我们在扩散和归还过程中将噪声和采样点云压缩到一个子空间，使得点云中心保持不变。稳定的点云中心更serve as一个锚点，使每个点与其相应的本地投影基于特征进行对应。我们在Synthetic benchmark ShapeNet-R2N2上进行了广泛的实验，结果表明CCD-3DR与其他竞争对手相比，提高了40%以上。我们还对实际应用中的Pix3D数据集进行了详细的研究，以展示CCD-3DR在实际应用中的潜在能力。代码将很快发布。

Learning Better Keypoints for Multi-Object 6DoF Pose Estimation

paper_url: http://arxiv.org/abs/2308.07827
repo_url: None
paper_authors: Yangzheng Wu, Michael Greenspan
for: 本研究探讨了预定义关键点对pose estimation的影响，并发现可以通过训练一个图 Orientated Graph Network（KeyGNet）来选择一组分散的关键点，以提高准确率和效率。
methods: KeyGNet使用一种组合损失函数，包括 Wassserstein 距离和分散，来监督学习颜色和几何特征来估算关键点位置。
results: 实验表明，使用KeyGNet选择的关键点可以提高所有评价指标的准确率，包括所有七个测试集。特别是在Occlusion LINEMOD数据集上，KeyGNet选择的关键点可以提高ADD(S)的值 by +16.4% on PVN3D。

Abstract
We investigate the impact of pre-defined keypoints for pose estimation, and found that accuracy and efficiency can be improved by training a graph network to select a set of disperse keypoints with similarly distributed votes. These votes, learned by a regression network to accumulate evidence for the keypoint locations, can be regressed more accurately compared to previous heuristic keypoint algorithms. The proposed KeyGNet, supervised by a combined loss measuring both Wassserstein distance and dispersion, learns the color and geometry features of the target objects to estimate optimal keypoint locations. Experiments demonstrate the keypoints selected by KeyGNet improved the accuracy for all evaluation metrics of all seven datasets tested, for three keypoint voting methods. The challenging Occlusion LINEMOD dataset notably improved ADD(S) by +16.4% on PVN3D, and all core BOP datasets showed an AR improvement for all objects, of between +1% and +21.5%. There was also a notable increase in performance when transitioning from single object to multiple object training using KeyGNet keypoints, essentially eliminating the SISO-MIMO gap for Occlusion LINEMOD.

摘要
我们研究了预定的关键点对pose estimation的影响，并发现了通过训练一个图гра夫网络选择一组广泛分布的票点，以提高准确性和效率。这些票点由一个回归网络学习归一化证据以提高精度，相比之前的习惯性关键点算法。我们提出的KeyGNet，以combined损失函数 measuring Wasserstein distance和分布为优化目标，学习目标对象的颜色和几何特征以估算优化关键点位置。实验表明，KeyGNet选择的关键点提高了所有评价指标的准确性，包括所有七个数据集的所有三种关键点投票方法。特别是在Occlusion LINEMOD数据集上，KeyGNet选择的关键点提高了ADD(S)的准确性 by +16.4% on PVN3D，并且所有核心BOP数据集上的所有对象都显示了AR提升，分别为+1%到+21.5%。此外，使用KeyGNet关键点进行多对象训练时，可以基本消除SISO-MIMO障碍，为Occlusion LINEMOD数据集表现出了明显的提升。

ImbSAM: A Closer Look at Sharpness-Aware Minimization in Class-Imbalanced Recognition

paper_url: http://arxiv.org/abs/2308.07815
repo_url: https://github.com/cool-xuan/imbalanced_sam
paper_authors: Yixuan Zhou, Yi Qu, Xing Xu, Hengtao Shen
for: Addressing the challenge of class imbalance in recognition tasks, specifically the generalization issues that arise when tail classes have limited training data.
methods: Proposes a class-aware smoothness optimization algorithm called Imbalanced-SAM (ImbSAM) that leverages class priors to restrict the generalization scope of the class-agnostic SAM, improving generalization targeting tail classes.
results: Demonstrates remarkable performance improvements for tail classes and anomaly detection in two prototypical applications of class-imbalanced recognition: long-tailed classification and semi-supervised anomaly detection.

Abstract
Class imbalance is a common challenge in real-world recognition tasks, where the majority of classes have few samples, also known as tail classes. We address this challenge with the perspective of generalization and empirically find that the promising Sharpness-Aware Minimization (SAM) fails to address generalization issues under the class-imbalanced setting. Through investigating this specific type of task, we identify that its generalization bottleneck primarily lies in the severe overfitting for tail classes with limited training data. To overcome this bottleneck, we leverage class priors to restrict the generalization scope of the class-agnostic SAM and propose a class-aware smoothness optimization algorithm named Imbalanced-SAM (ImbSAM). With the guidance of class priors, our ImbSAM specifically improves generalization targeting tail classes. We also verify the efficacy of ImbSAM on two prototypical applications of class-imbalanced recognition: long-tailed classification and semi-supervised anomaly detection, where our ImbSAM demonstrates remarkable performance improvements for tail classes and anomaly. Our code implementation is available at https://github.com/cool-xuan/Imbalanced_SAM.

摘要
“类别不匹配是现实世界识别任务中的常见挑战，主要是因为大多数类别具有少量样本，也称为尾类。我们通过总化和实验发现，promising Sharpness-Aware Minimization (SAM) 在类别不匹配的设定下存在总化问题。通过研究这种特定任务，我们发现其总化瓶颈主要在tail classes中严重过拟合。为了缓解这个瓶颈，我们利用类别优先顺序来限制类型不感知 SAM 的总化范围，并提出一种类别意识细化优化算法名为 Imbalanced-SAM（ImbSAM）。通过类别优先顺序的引导，我们的 ImbSAM specifically 提高了tail classes的总化表现。我们还证明 ImbSAM 在long-tailed classification和 semi-supervised anomaly detection 中表现出色，尤其是在tail classes和异常处理方面。我们的代码实现可以在 GitHub 上找到：https://github.com/cool-xuan/Imbalanced_SAM。”

Grasp Transfer based on Self-Aligning Implicit Representations of Local Surfaces

paper_url: http://arxiv.org/abs/2308.07807
repo_url: None
paper_authors: Ahmet Tekden, Marc Peter Deisenroth, Yasemin Bekiroglu
for: 本文解决了将抓取经验或示例转移到新对象的问题，该对象与先前遇到的对象共享形状特征。
methods: 本文使用单一专家抓取示例学习了一个基于本地表面的印象模型，并在推理时使用这个模型将抓取转移到新对象的相似表面上。
results: 该方法在实验中可以成功将抓取转移到未看过的对象类别，并在实验和实际场景中表现出较好的空间精度和抓取精度。

Abstract
Objects we interact with and manipulate often share similar parts, such as handles, that allow us to transfer our actions flexibly due to their shared functionality. This work addresses the problem of transferring a grasp experience or a demonstration to a novel object that shares shape similarities with objects the robot has previously encountered. Existing approaches for solving this problem are typically restricted to a specific object category or a parametric shape. Our approach, however, can transfer grasps associated with implicit models of local surfaces shared across object categories. Specifically, we employ a single expert grasp demonstration to learn an implicit local surface representation model from a small dataset of object meshes. At inference time, this model is used to transfer grasps to novel objects by identifying the most geometrically similar surfaces to the one on which the expert grasp is demonstrated. Our model is trained entirely in simulation and is evaluated on simulated and real-world objects that are not seen during training. Evaluations indicate that grasp transfer to unseen object categories using this approach can be successfully performed both in simulation and real-world experiments. The simulation results also show that the proposed approach leads to better spatial precision and grasp accuracy compared to a baseline approach.

摘要
Objects we interact with and manipulate often share similar parts, such as handles, that allow us to transfer our actions flexibly due to their shared functionality. This work addresses the problem of transferring a grasp experience or a demonstration to a novel object that shares shape similarities with objects the robot has previously encountered. Existing approaches for solving this problem are typically restricted to a specific object category or a parametric shape. Our approach, however, can transfer grasps associated with implicit models of local surfaces shared across object categories. Specifically, we employ a single expert grasp demonstration to learn an implicit local surface representation model from a small dataset of object meshes. At inference time, this model is used to transfer grasps to novel objects by identifying the most geometrically similar surfaces to the one on which the expert grasp is demonstrated. Our model is trained entirely in simulation and is evaluated on simulated and real-world objects that are not seen during training. Evaluations indicate that grasp transfer to unseen object categories using this approach can be successfully performed both in simulation and real-world experiments. The simulation results also show that the proposed approach leads to better spatial precision and grasp accuracy compared to a baseline approach.Here's the text in Traditional Chinese:objects we interact with and manipulate often share similar parts, such as handles, that allow us to transfer our actions flexibly due to their shared functionality. This work addresses the problem of transferring a grasp experience or a demonstration to a novel object that shares shape similarities with objects the robot has previously encountered. Existing approaches for solving this problem are typically restricted to a specific object category or a parametric shape. Our approach, however, can transfer grasps associated with implicit models of local surfaces shared across object categories. Specifically, we employ a single expert grasp demonstration to learn an implicit local surface representation model from a small dataset of object meshes. At inference time, this model is used to transfer grasps to novel objects by identifying the most geometrically similar surfaces to the one on which the expert grasp is demonstrated. Our model is trained entirely in simulation and is evaluated on simulated and real-world objects that are not seen during training. Evaluations indicate that grasp transfer to unseen object categories using this approach can be successfully performed both in simulation and real-world experiments. The simulation results also show that the proposed approach leads to better spatial precision and grasp accuracy compared to a baseline approach.

Neuromorphic Seatbelt State Detection for In-Cabin Monitoring with Event Cameras

paper_url: http://arxiv.org/abs/2308.07802
repo_url: None
paper_authors: Paul Kielty, Cian Ryan, Mehdi Sefidgar Dilmaghani, Waseem Shariff, Joe Lemley, Peter Corcoran
for: 这个论文主要是为了研究如何使用事件摄像机在座位安全系统中检测安全带状态。
methods: 这篇论文使用了事件生成器生成的Synthetic neuromorphic frames，以及基于循环卷积神经网络的检测算法。
results: 论文的实验结果显示，在 binary 分类任务中，fastened/unfastened 帧的识别精度为 0.989 和 0.944 分别在 simulated 和 real test sets 上。当问题扩展到包括快速安全带的扭矩时，分别达到了 0.964 和 0.846 的 F1 分数。

Abstract
Neuromorphic vision sensors, or event cameras, differ from conventional cameras in that they do not capture images at a specified rate. Instead, they asynchronously log local brightness changes at each pixel. As a result, event cameras only record changes in a given scene, and do so with very high temporal resolution, high dynamic range, and low power requirements. Recent research has demonstrated how these characteristics make event cameras extremely practical sensors in driver monitoring systems (DMS), enabling the tracking of high-speed eye motion and blinks. This research provides a proof of concept to expand event-based DMS techniques to include seatbelt state detection. Using an event simulator, a dataset of 108,691 synthetic neuromorphic frames of car occupants was generated from a near-infrared (NIR) dataset, and split into training, validation, and test sets for a seatbelt state detection algorithm based on a recurrent convolutional neural network (CNN). In addition, a smaller set of real event data was collected and reserved for testing. In a binary classification task, the fastened/unfastened frames were identified with an F1 score of 0.989 and 0.944 on the simulated and real test sets respectively. When the problem extended to also classify the action of fastening/unfastening the seatbelt, respective F1 scores of 0.964 and 0.846 were achieved.

摘要
neuromorphic vision sensors 或事件摄像头，与传统摄像头不同，不是预先定义的帧率来捕捉图像。相反，它们在每个像素上逐渐记录当地明亮变化。这意味着事件摄像头只记录场景中的变化，并且具有非常高的时间分辨率、高动态范围和低功耗要求。最新的研究表明，这些特点使得事件摄像头成为了非常实用的护身伞系统（DMS）感知器，可以跟踪高速眼动和耶飞。这些研究提供了扩展事件基于DMS技术的证明，包括座席安全带状态检测。使用事件模拟器，一个由近红外（NIR）数据集生成的108,691个神经元模拟帧的车Occupants dataset被生成，并被分配到训练、验证和测试集中。此外，一个更小的真实事件数据集也被收集并保留用于测试。在一个二分类任务中，带fastened/unfastened帧被识别出来，F1分数分别为0.989和0.944。当问题扩展到还包括快速安装/解除座席安全带的动作时，分别获得了0.964和0.846的F1分数。

Handwritten Stenography Recognition and the LION Dataset

paper_url: http://arxiv.org/abs/2308.07799
repo_url: https://zenodo.org/record/8249818
paper_authors: Raphaela Heil, Malin Nauwerck
for: 这篇论文的目的是建立一个基准模型，用于识别手写stenography。
methods: 这篇论文使用了现有的文本识别模型，并应用了四种不同的编码方法，将目标序列转换成表示selected aspects of the writing system。此外，还使用了预训练方案，基于合成数据。
results: 基准模型在测试集上的平均字符错误率（CER）为29.81%，word error rate（WER）为55.14%。通过结合stenography-specific target sequence encodings、预训练和细化，可以大幅降低测试错误率，CER在24.5% - 26%之间，WER在44.8% - 48.2%之间。

Abstract
Purpose: In this paper, we establish a baseline for handwritten stenography recognition, using the novel LION dataset, and investigate the impact of including selected aspects of stenographic theory into the recognition process. We make the LION dataset publicly available with the aim of encouraging future research in handwritten stenography recognition. Methods: A state-of-the-art text recognition model is trained to establish a baseline. Stenographic domain knowledge is integrated by applying four different encoding methods that transform the target sequence into representations, which approximate selected aspects of the writing system. Results are further improved by integrating a pre-training scheme, based on synthetic data. Results: The baseline model achieves an average test character error rate (CER) of 29.81% and a word error rate (WER) of 55.14%. Test error rates are reduced significantly by combining stenography-specific target sequence encodings with pre-training and fine-tuning, yielding CERs in the range of 24.5% - 26% and WERs of 44.8% - 48.2%. Conclusion: The obtained results demonstrate the challenging nature of stenography recognition. Integrating stenography-specific knowledge, in conjunction with pre-training and fine-tuning on synthetic data, yields considerable improvements. Together with our precursor study on the subject, this is the first work to apply modern handwritten text recognition to stenography. The dataset and our code are publicly available via Zenodo.

摘要
目的：在这篇论文中，我们建立了手写stenography认识基线，使用新的LION数据集，并研究包括选择的stenographic理论方面的影响。我们将LION数据集公开提供，以促进未来的手写stenography认识研究。方法：我们使用现代文本认识模型进行基线建立。stenographic领域知识被集成，通过将目标序列转换为表示形式，以估计选择的stenographic特征。此外，我们还使用基于Synthetic数据的预训练方案，进一步改进结果。结果：基线模型在测试集上的平均字符错误率（CER）为29.81%，单词错误率（WER）为55.14%。通过将stenography特有的target序列编码与预训练和细化结合使用，可以将测试错误率显著降低到24.5% - 26%之间的CER，以及44.8% - 48.2%之间的WER。结论：我们的结果表明stenography认识是一项非常具有挑战性的任务。通过结合stenography特有的知识和预训练和细化，可以获得显著的改进。这是现代手写文本认识在stenography领域的第一篇研究，同时我们的数据集和代码也公开提供了via Zenodo。

DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

paper_url: http://arxiv.org/abs/2308.07787
repo_url: https://github.com/joannahong/diffv2s
paper_authors: Jeongsoo Choi, Joanna Hong, Yong Man Ro
for: 这个论文的目的是提出一种基于自我超vised学习模型和提示调整技术的视觉导航 speaker embedding抽取器，以便在推理时不需要外部音频信息。
methods: 该论文使用了一种基于自我超vised学习模型和提示调整技术的视觉导航 speaker embedding抽取器，以及一种基于这些 speaker embedding 和视觉表示的扩散基于视频到语音Synthesis模型。
results: 该论文的实验结果显示，使用该视觉导航 speaker embedding抽取器和扩散基于视频到语音Synthesis模型可以保持输入视频帧中的音频细节，同时还可以在推理时不需要外部音频信息。这些方法在比较之前的视频到语音Synthesis技术中达到了state-of-the-art的性能。

Abstract
Recent research has demonstrated impressive results in video-to-speech synthesis which involves reconstructing speech solely from visual input. However, previous works have struggled to accurately synthesize speech due to a lack of sufficient guidance for the model to infer the correct content with the appropriate sound. To resolve the issue, they have adopted an extra speaker embedding as a speaking style guidance from a reference auditory information. Nevertheless, it is not always possible to obtain the audio information from the corresponding video input, especially during the inference time. In this paper, we present a novel vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique. In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time. Using the extracted vision-guided speaker embedding representations, we further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video. The proposed DiffV2S not only maintains phoneme details contained in the input video frames, but also creates a highly intelligible mel-spectrogram in which the speaker identities of the multiple speakers are all preserved. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.

摘要
最新的研究表明，视频到语音合成技术已经取得了很好的结果，即从视觉输入重建说话。然而，之前的工作很难准确地合成说话，因为缺乏足够的指导，使模型很难准确地推断出正确的内容和相应的声音。为解决这个问题，他们采用了Extra speaker embedding作为引导，从参考音频信息中获得。然而，在推断时不一定可以获得相应的音频信息，特别是在推断时。在这篇论文中，我们提出了一种新的视觉引导的Speaker embedding抽取器，使用自我超visumodel和提示调整技术。在这种情况下，可以从输入视频信息中提取出富有的Speaker embedding信息，不需要外部音频信息。使用提取的视觉引导Speaker embedding表示，我们进一步发展了一种扩散基于的视频到语音合成模型，称为DiffV2S。DiffV2S Conditioned on these speaker embeddings and the visual representation extracted from the input video, the proposed DiffV2S not only maintains the phoneme details contained in the input video frames, but also creates a highly intelligible mel-spectrogram in which the speaker identities of the multiple speakers are all preserved. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.

Future Video Prediction from a Single Frame for Video Anomaly Detection

paper_url: http://arxiv.org/abs/2308.07783
repo_url: None
paper_authors: Mohammad Baradaran, Robert Bergevin
for: 这篇论文的目的是提出一个新的代理任务，即从单一帧画面中预测未来的影片，以便实现影片异常检测（VAD）中的长期运动模型。
methods: 这篇论文使用了一种新的semi-supervised anomaly detection方法，具体来说是使用未来帧画面预测作为代理任务，并将初始和未来的原始帧替换为它们的Semantic segmentation map，以增强模型的敏感度和精度。
results: 实验结果显示，这篇论文的方法能够实现长期运动模型的学习，并且与现有的预测基于VAD方法相比，具有更高的效果和精度。

Abstract
Video anomaly detection (VAD) is an important but challenging task in computer vision. The main challenge rises due to the rarity of training samples to model all anomaly cases. Hence, semi-supervised anomaly detection methods have gotten more attention, since they focus on modeling normals and they detect anomalies by measuring the deviations from normal patterns. Despite impressive advances of these methods in modeling normal motion and appearance, long-term motion modeling has not been effectively explored so far. Inspired by the abilities of the future frame prediction proxy-task, we introduce the task of future video prediction from a single frame, as a novel proxy-task for video anomaly detection. This proxy-task alleviates the challenges of previous methods in learning longer motion patterns. Moreover, we replace the initial and future raw frames with their corresponding semantic segmentation map, which not only makes the method aware of object class but also makes the prediction task less complex for the model. Extensive experiments on the benchmark datasets (ShanghaiTech, UCSD-Ped1, and UCSD-Ped2) show the effectiveness of the method and the superiority of its performance compared to SOTA prediction-based VAD methods.

摘要
视频异常检测（VAD）是计算机视觉中的一项重要而困难的任务。主要挑战在于缺乏异常情况下训练样本，因此半监督异常检测方法在过去几年中得到了更多的关注，这些方法通过模型常规动作和出现异常的方式进行检测。 despite impressive advances of these methods in modeling normal motion and appearance, long-term motion modeling has not been effectively explored so far. 鼓励了未来帧预测代理任务的能力，我们引入了从单一帧预测未来视频的任务，作为一种新的代理任务，以解决先前方法学习更长的动作模式的挑战。此外，我们将初始和未来的原始帧替换为它们对应的 semantic segmentation map，这不仅使得方法能够识别物体类型，还使得预测任务对模型更加简单。我们在 ShanghaiTech、UCSD-Ped1 和 UCSD-Ped2 benchmark datasets 进行了广泛的实验，并证明了方法的有效性和与State-of-the-art（SOTA）预测基于 VAD 方法的性能的superiority。

Learning Image Deraining Transformer Network with Dynamic Dual Self-Attention

paper_url: http://arxiv.org/abs/2308.07781
repo_url: None
paper_authors: Zhentao Fan, Hongming Chen, Yufeng Li
for: 本研究旨在提出一种高效的单张图像雨水去除方法，使用Transformer架构和动态双层自注意（DDSA）精确地捕捉图像中的雨水信息。
methods: 本方法使用了动态双层自注意（DDSA）精确地选择图像中的相似性值，并结合了一种新的空间增强Feedforward网络（SEFN）来提高图像的重建质量。
results: 经验表明，本方法在标准数据集上达到了高质量的雨水去除效果，并且在不同的雨水环境下都能够保持高度的稳定性。

Abstract
Recently, Transformer-based architecture has been introduced into single image deraining task due to its advantage in modeling non-local information. However, existing approaches tend to integrate global features based on a dense self-attention strategy since it tend to uses all similarities of the tokens between the queries and keys. In fact, this strategy leads to ignoring the most relevant information and inducing blurry effect by the irrelevant representations during the feature aggregation. To this end, this paper proposes an effective image deraining Transformer with dynamic dual self-attention (DDSA), which combines both dense and sparse attention strategies to better facilitate clear image reconstruction. Specifically, we only select the most useful similarity values based on top-k approximate calculation to achieve sparse attention. In addition, we also develop a novel spatial-enhanced feed-forward network (SEFN) to further obtain a more accurate representation for achieving high-quality derained results. Extensive experiments on benchmark datasets demonstrate the effectiveness of our proposed method.

摘要
最近，基于Transformer架构的单图雨水减除技术在单图雨水减除任务中得到应用。这是因为Transformer架构可以更好地模型非本地信息。然而，现有的方法通常会将全球特征集成到一个笔 dense self-attention策略中，这会导致忽略最重要的信息并通过不相关的表示导致图像重建的模糊效果。为了解决这个问题，本文提出了一种高效的图像雨水减除Transformer（DDSA），它结合了密集和疏缺注意策略来更好地促进清晰图像重建。具体来说，我们只选择最有用的相似性值，并通过top-k相似计算来实现疏缺注意。此外，我们还开发了一种新的空间增强Feed-Forward网络（SEFN），以更好地获得更高质量的雨水减除结果。我们在标准数据集上进行了广泛的实验，并证明了我们的提议的效果。

An Interpretable Machine Learning Model with Deep Learning-based Imaging Biomarkers for Diagnosis of Alzheimer’s Disease

paper_url: http://arxiv.org/abs/2308.07778
repo_url: None
paper_authors: Wenjie Kang, Bo Li, Janne M. Papma, Lize C. Jiskoot, Peter Paul De Deyn, Geert Jan Biessels, Jurgen A. H. R. Claassen, Huub A. M. Middelkoop, Wiesje M. van der Flier, Inez H. G. B. Ramakers, Stefan Klein, Esther E. Bron
for: 本研究旨在提出一种可解释的机器学习框架，用于自动早期诊断阿尔茨heimer病（AD）。
methods: 本研究使用了可解释的机器学习模型（EBM），并使用深度学习来提取特征。
results: 研究在Alzheimer’s Disease Neuroimaging Initiative（ADNI）数据集上 achieved accuracy of 0.883和area-under-the-curve（AUC）of 0.970在AD和控制分类中。在一个外部测试集上也达到了accuracy of 0.778和AUC of 0.887在AD和主观认知下降（SCD）分类中。 compared to使用体量生物标志代替深度学习特征的EBM模型，以及一个优化的 convolutional neural network（CNN）模型。

Abstract
Machine learning methods have shown large potential for the automatic early diagnosis of Alzheimer's Disease (AD). However, some machine learning methods based on imaging data have poor interpretability because it is usually unclear how they make their decisions. Explainable Boosting Machines (EBMs) are interpretable machine learning models based on the statistical framework of generalized additive modeling, but have so far only been used for tabular data. Therefore, we propose a framework that combines the strength of EBM with high-dimensional imaging data using deep learning-based feature extraction. The proposed framework is interpretable because it provides the importance of each feature. We validated the proposed framework on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, achieving accuracy of 0.883 and area-under-the-curve (AUC) of 0.970 on AD and control classification. Furthermore, we validated the proposed framework on an external testing set, achieving accuracy of 0.778 and AUC of 0.887 on AD and subjective cognitive decline (SCD) classification. The proposed framework significantly outperformed an EBM model using volume biomarkers instead of deep learning-based features, as well as an end-to-end convolutional neural network (CNN) with optimized architecture.

摘要

Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos

paper_url: http://arxiv.org/abs/2308.07771
repo_url: None
paper_authors: Wei Qian, Dan Guo, Kun Li, Xilan Tian, Meng Wang
for: 这个论文旨在提出一种基于 Transformer 框架的远程光谱 Plethysmography (rPPG) 测量方法，以减少干扰因素的影响，提高测量精度。
methods: 该方法使用了两种 TokenLearner（S-TL 和 T-TL）来捕捉不同的 facial ROI 之间的关系，以及 quasi- periodic patrern 的推断，以减少干扰因素的影响。
results: 在四个 physiological measurement benchmark 数据集上进行了广泛的实验，结果表明 Dual-TL 可以在 intra- 和 cross-dataset 测试中达到 state-of-the-art 性能，表明其在 rPPG 测量中的潜在应用前景。

Abstract
Remote photoplethysmography (rPPG) based physiological measurement is an emerging yet crucial vision task, whose challenge lies in exploring accurate rPPG prediction from facial videos accompanied by noises of illumination variations, facial occlusions, head movements, \etc, in a non-contact manner. Existing mainstream CNN-based models make efforts to detect physiological signals by capturing subtle color changes in facial regions of interest (ROI) caused by heartbeats. However, such models are constrained by the limited local spatial or temporal receptive fields in the neural units. Unlike them, a native Transformer-based framework called Dual-path TokenLearner (Dual-TL) is proposed in this paper, which utilizes the concept of learnable tokens to integrate both spatial and temporal informative contexts from the global perspective of the video. Specifically, the proposed Dual-TL uses a Spatial TokenLearner (S-TL) to explore associations in different facial ROIs, which promises the rPPG prediction far away from noisy ROI disturbances. Complementarily, a Temporal TokenLearner (T-TL) is designed to infer the quasi-periodic pattern of heartbeats, which eliminates temporal disturbances such as head movements. The two TokenLearners, S-TL and T-TL, are executed in a dual-path mode. This enables the model to reduce noise disturbances for final rPPG signal prediction. Extensive experiments on four physiological measurement benchmark datasets are conducted. The Dual-TL achieves state-of-the-art performances in both intra- and cross-dataset testings, demonstrating its immense potential as a basic backbone for rPPG measurement. The source code is available at \href{https://github.com/VUT-HFUT/Dual-TL}{https://github.com/VUT-HFUT/Dual-TL}

摘要
distant photoplethysmography (rPPG) 基于视频的生物指标测量是一个emerging yet crucial vision task， whose challenge lies in accurately predicting rPPG from facial videos accompanied by illumination variations, facial occlusions, head movements, etc. in a non-contact manner. Existing mainstream CNN-based models make efforts to detect physiological signals by capturing subtle color changes in facial regions of interest (ROI) caused by heartbeats. However, such models are constrained by the limited local spatial or temporal receptive fields in the neural units. Unlike them, a native Transformer-based framework called Dual-path TokenLearner (Dual-TL) is proposed in this paper, which utilizes the concept of learnable tokens to integrate both spatial and temporal informative contexts from the global perspective of the video. Specifically, the proposed Dual-TL uses a Spatial TokenLearner (S-TL) to explore associations in different facial ROIs, which promises the rPPG prediction far away from noisy ROI disturbances. Complementarily, a Temporal TokenLearner (T-TL) is designed to infer the quasi-periodic pattern of heartbeats, which eliminates temporal disturbances such as head movements. The two TokenLearners, S-TL and T-TL, are executed in a dual-path mode. This enables the model to reduce noise disturbances for final rPPG signal prediction. Extensive experiments on four physiological measurement benchmark datasets are conducted. The Dual-TL achieves state-of-the-art performances in both intra- and cross-dataset testings, demonstrating its immense potential as a basic backbone for rPPG measurement. The source code is available at \href{https://github.com/VUT-HFUT/Dual-TL}{https://github.com/VUT-HFUT/Dual-TL}.

Multi-scale Promoted Self-adjusting Correlation Learning for Facial Action Unit Detection

paper_url: http://arxiv.org/abs/2308.07770
repo_url: https://github.com/yuankaishen2001/Self-adjusting-AU
paper_authors: Xin Liu, Kaishen Yuan, Xuesong Niu, Jingang Shi, Zitong Yu, Huanjing Yue, Jingyu Yang
for: 这个论文旨在提出一种新的自适应AU相关学习方法（SACL），以提高人类表情识别task中AU相关性的准确性和效率。
methods: 本文使用了一种自适应AU相关学习方法，通过约束AU相关性的学习和更新，以及一种多层学习（MSFL）方法，将多个尺度的特征学习到一个统一的表征中。
results: 实验结果显示，提案的方法在广泛使用的AU识别benchmark datasets上表现出色，与现有的方法相比，只需28.7%和12.0%的parameters和FLOPs，并且具有较高的准确性和稳定性。

Abstract
Facial Action Unit (AU) detection is a crucial task in affective computing and social robotics as it helps to identify emotions expressed through facial expressions. Anatomically, there are innumerable correlations between AUs, which contain rich information and are vital for AU detection. Previous methods used fixed AU correlations based on expert experience or statistical rules on specific benchmarks, but it is challenging to comprehensively reflect complex correlations between AUs via hand-crafted settings. There are alternative methods that employ a fully connected graph to learn these dependencies exhaustively. However, these approaches can result in a computational explosion and high dependency with a large dataset. To address these challenges, this paper proposes a novel self-adjusting AU-correlation learning (SACL) method with less computation for AU detection. This method adaptively learns and updates AU correlation graphs by efficiently leveraging the characteristics of different levels of AU motion and emotion representation information extracted in different stages of the network. Moreover, this paper explores the role of multi-scale learning in correlation information extraction, and design a simple yet effective multi-scale feature learning (MSFL) method to promote better performance in AU detection. By integrating AU correlation information with multi-scale features, the proposed method obtains a more robust feature representation for the final AU detection. Extensive experiments show that the proposed method outperforms the state-of-the-art methods on widely used AU detection benchmark datasets, with only 28.7\% and 12.0\% of the parameters and FLOPs of the best method, respectively. The code for this method is available at \url{https://github.com/linuxsino/Self-adjusting-AU}.

摘要
Facial Action Unit (AU) 检测是影响情感计算和社会机器人的关键任务，因为它帮助确定人脸表达中的情感。生物学上来说，AU 之间存在无数关系，这些关系含有丰富的信息，对 AU 检测至关重要。以前的方法通过专家经验或统计规则在特定基准上预先定义 AU 关系，但这些方法无法全面反映复杂的 AU 关系。其他方法使用完全连接图来学习这些依赖关系，但这些方法可能会导致计算暴涨和大量数据依赖。为解决这些挑战，本文提出了一种新的自适应AU correlation学习方法（SACL），它具有较少的计算量，但能够提高 AU 检测的性能。SACL 方法通过有效地利用不同层次AU动作特征和情感表示信息来学习和更新 AU 关系图。此外，本文还探讨了多尺度学习在相关信息提取中的作用，并设计了一种简单 yet 有效的多尺度特征学习（MSFL）方法，以提高 AU 检测的性能。通过将 AU 关系信息与多尺度特征结合，提出的方法可以获得更加稳定的特征表示，进而提高 AU 检测的准确率。广泛的实验表明，提出的方法在多个常用 AU 检测基准数据集上的性能较之前的状态艺法高，仅使用 28.7% 和 12.0% 的参数和 FLOPs。代码可以在上获取。

Whale Detection Enhancement through Synthetic Satellite Images

paper_url: http://arxiv.org/abs/2308.07766
repo_url: https://github.com/prgumd/seadronesim2
paper_authors: Akshaj Gaur, Cheng Liu, Xiaomin Lin, Nare Karapetyan, Yiannis Aloimonos
for: 该研究目的是开发一个名为SeaDroneSim2的测试环境和数据集，以提高鲸鱼检测和减少训练数据收集的努力。
methods: 该研究使用现代计算机视觉算法和人工智能技术来生成synthetic图像数据集，以便用于训练机器学习算法。
results: 研究发现，通过将synthetic数据集与实际数据集相结合训练，可以提高鲸鱼检测性能15%，而不需要额外的数据收集努力。

Abstract
With a number of marine populations in rapid decline, collecting and analyzing data about marine populations has become increasingly important to develop effective conservation policies for a wide range of marine animals, including whales. Modern computer vision algorithms allow us to detect whales in images in a wide range of domains, further speeding up and enhancing the monitoring process. However, these algorithms heavily rely on large training datasets, which are challenging and time-consuming to collect particularly in marine or aquatic environments. Recent advances in AI however have made it possible to synthetically create datasets for training machine learning algorithms, thus enabling new solutions that were not possible before. In this work, we present a solution - SeaDroneSim2 benchmark suite, which addresses this challenge by generating aerial, and satellite synthetic image datasets to improve the detection of whales and reduce the effort required for training data collection. We show that we can achieve a 15% performance boost on whale detection compared to using the real data alone for training, by augmenting a 10% real data. We open source both the code of the simulation platform SeaDroneSim2 and the dataset generated through it.

摘要
“由于海洋生物种群在快速减少，收集和分析海洋生物数据已成为保护各种海洋动物，包括鲸鱼的重要策略。现代计算机视觉算法可以在各种领域中检测鲸鱼的图像，从而加速和提高监测过程。然而，这些算法具有大量训练数据的需求，特别是在海洋或水生环境中收集这些数据是困难和耗时的。最近的人工智能技术 however 使得可以 sintetically create datasets for training machine learning algorithms，从而开启了以前不可能的解决方案。在这项工作中，我们提出一个解决方案 - SeaDroneSim2 benchmark suite，它解决了这个挑战，通过生成航空和卫星synthetic图像数据，以提高鲸鱼检测的准确率。我们证明，通过将10%的实际数据与15%的synthetic数据混合，可以在训练中提高鲸鱼检测的性能，相比使用实际数据 alone。我们开源了 SeaDroneSim2 的代码和生成的数据集。”Note: The translation is done using Google Translate and may not be perfect. Please let me know if you need any further assistance.

CASPNet++: Joint Multi-Agent Motion Prediction

paper_url: http://arxiv.org/abs/2308.07751
repo_url: None
paper_authors: Maximilian Schäfer, Kun Zhao, Anton Kummert
for: 本研究旨在提高自动驾驶技术的支持，具体来说是预测道路用户未来的运动。
methods: 本研究使用Context-Aware Scene Prediction Network (CASPNet)的改进版本CASPNet++，通过改进景象理解和交互模型，支持场景中所有道路用户的联合预测。此外，还引入了基于实例的输出头，以提供多模态的轨迹。
results: 在EXTENSIVE量化和质量分析中，我们展示了CASPNet++在使用和融合多种环境输入源（如HD地图、雷达探测和激光分 segmentation）的能力。在 nuScenes 城市预测数据集上测试，CASPNet++达到了现状之最的性能。模型已经在测试车辆中部署，并在实时下运行，具有moderate的计算资源。

Abstract
The prediction of road users' future motion is a critical task in supporting advanced driver-assistance systems (ADAS). It plays an even more crucial role for autonomous driving (AD) in enabling the planning and execution of safe driving maneuvers. Based on our previous work, Context-Aware Scene Prediction Network (CASPNet), an improved system, CASPNet++, is proposed. In this work, we focus on further enhancing the interaction modeling and scene understanding to support the joint prediction of all road users in a scene using spatiotemporal grids to model future occupancy. Moreover, an instance-based output head is introduced to provide multi-modal trajectories for agents of interest. In extensive quantitative and qualitative analysis, we demonstrate the scalability of CASPNet++ in utilizing and fusing diverse environmental input sources such as HD maps, Radar detection, and Lidar segmentation. Tested on the urban-focused prediction dataset nuScenes, CASPNet++ reaches state-of-the-art performance. The model has been deployed in a testing vehicle, running in real-time with moderate computational resources.

摘要
预测路用户未来运动是智能驾驶技术支持的关键任务之一，尤其是自动驾驶（AD）。在这种情况下，预测安全驾驶动作的能力变得非常重要。基于我们之前的工作，Context-Aware Scene Prediction Network（CASPNet），我们提出了改进的系统CASPNet++。在这个工作中，我们将更进一步地提高交互模型和场景理解，以支持场景中所有道路用户的未来运动预测。此外，我们还引入了基于实例的输出头，以提供多模态轨迹 для关注点。在详细的量化和质量分析中，我们示出了CASPNet++在使用和融合多种环境输入源，如高分辨环境地图、雷达探测和激光分 segmentation 的可扩展性。在 nuScenes 城市预测数据集上，CASPNet++ 达到了状态agh 的性能。模型已经在测试车辆上部署，在实时运行中具有中等计算资源。

ChartDETR: A Multi-shape Detection Network for Visual Chart Recognition

paper_url: http://arxiv.org/abs/2308.07743
repo_url: None
paper_authors: Wenyuan Xue, Dapeng Chen, Baosheng Yu, Yifei Chen, Sai Zhou, Wei Peng
for: 这篇论文的目的是提出一种基于变换器的多形态检测器，以便自动从图表图像中识别表头和数据元素。methods: 该方法使用变换器来地址现有方法中的分组错误，通过引入查询组来预测所有数据元素形状，从而消除后期处理步骤。results: 该方法在三个 dataset 上达到了竞争性的 результа，包括在 Adobe Synthetic 上达到了 0.98 的 F1 分数，与之前最佳模型的 0.71 F1 分数相比显著提高。此外，我们也实现了一个新的状态对标结果，达到了 ExcelChart400k 上的 0.97。

Abstract
Visual chart recognition systems are gaining increasing attention due to the growing demand for automatically identifying table headers and values from chart images. Current methods rely on keypoint detection to estimate data element shapes in charts but suffer from grouping errors in post-processing. To address this issue, we propose ChartDETR, a transformer-based multi-shape detector that localizes keypoints at the corners of regular shapes to reconstruct multiple data elements in a single chart image. Our method predicts all data element shapes at once by introducing query groups in set prediction, eliminating the need for further postprocessing. This property allows ChartDETR to serve as a unified framework capable of representing various chart types without altering the network architecture, effectively detecting data elements of diverse shapes. We evaluated ChartDETR on three datasets, achieving competitive results across all chart types without any additional enhancements. For example, ChartDETR achieved an F1 score of 0.98 on Adobe Synthetic, significantly outperforming the previous best model with a 0.71 F1 score. Additionally, we obtained a new state-of-the-art result of 0.97 on ExcelChart400k. The code will be made publicly available.

摘要
“图表识别系统在当前receiving increasing attention，主要是因为需要自动从图表图像中识别表头和数据值。现有方法通过关键点检测来估算图表元素的形状，但是会在后处理中出现分组错误。为解决这个问题，我们提出了 ChartDETR，一种基于transformer的多形态检测器，可以在单个图表图像中寻找多个数据元素的角点。我们的方法通过设置查询组来预测所有数据元素的形状，从而消除后处理步骤。这个特性使得 ChartDETR 可以作为一个通用的框架，无需修改网络结构，可以有效地检测各种图表类型中的数据元素。我们对 ChartDETR 进行了三个数据集的评估，实现了所有图表类型中的竞争性结果，不需要任何额外增强。例如，在 Adobe Synthetic 数据集上，ChartDETR achieved an F1 score of 0.98，与之前的最佳模型（F1 score 0.71）有所显著超越。此外，我们在 ExcelChart400k 数据集上获得了新的州际最佳结果（F1 score 0.97）。代码将会公开发布。”

Identity-Consistent Aggregation for Video Object Detection

paper_url: http://arxiv.org/abs/2308.07737
repo_url: https://github.com/bladewaltz1/clipvid
paper_authors: Chaorui Deng, Da Chen, Qi Wu
for: 在视频对象检测（VID）中，通常利用视频中的丰富时间上下文来提高每帧中的对象表示。现有方法往往对不同对象的时间上下文进行一个汇总，而忽略其不同身份。而intuitively, 将不同对象的本地视图在不同帧中汇总可能为对象的理解提供更好的帮助。因此，在这篇论文中，我们目的是使模型能够专注于每个对象的一致性时间上下文，从而获得更全面的对象表示，并处理快速的对象出现变化，如遮挡、运动模糊等。
methods: 我们提出了一种名为ClipVID的VID模型，它具有特定于个体的汇总（ICA）层，可以为每个对象提取更细致的一致性时间上下文。通过设计非重复的集 Prediction 策略，我们减少了模型的重复性，使ICA层非常高效。此外，我们还设计了一个并行的剪辑clip-wise预测方案，使得整个视频clip的预测都可以在一个时间内完成。
results: 我们的方法在ImageNet VID数据集上实现了state-of-the-art（SOTA）性能（84.7% mAP），而且在7倍的运行速度（39.3 fps）上达到了前一代SOTA的速度。

Abstract
In Video Object Detection (VID), a common practice is to leverage the rich temporal contexts from the video to enhance the object representations in each frame. Existing methods treat the temporal contexts obtained from different objects indiscriminately and ignore their different identities. While intuitively, aggregating local views of the same object in different frames may facilitate a better understanding of the object. Thus, in this paper, we aim to enable the model to focus on the identity-consistent temporal contexts of each object to obtain more comprehensive object representations and handle the rapid object appearance variations such as occlusion, motion blur, etc. However, realizing this goal on top of existing VID models faces low-efficiency problems due to their redundant region proposals and nonparallel frame-wise prediction manner. To aid this, we propose ClipVID, a VID model equipped with Identity-Consistent Aggregation (ICA) layers specifically designed for mining fine-grained and identity-consistent temporal contexts. It effectively reduces the redundancies through the set prediction strategy, making the ICA layers very efficient and further allowing us to design an architecture that makes parallel clip-wise predictions for the whole video clip. Extensive experimental results demonstrate the superiority of our method: a state-of-the-art (SOTA) performance (84.7% mAP) on the ImageNet VID dataset while running at a speed about 7x faster (39.3 fps) than previous SOTAs.

摘要
在视频对象检测（VID）中，一种常见的做法是利用视频中的丰富时间上下文来强化每帧中的对象表示。现有的方法对不同对象的时间上下文待遇一样，而忽略了它们的不同标识。而 intuitively，将不同对象的本地视图在不同帧中聚合可能会更好地理解这些对象。因此，在这篇论文中，我们目标是让模型能够关注每个对象的一致性时间上下文，以获得更全面的对象表示，并处理快速的对象出现变化，如遮挡、动态模糊等。然而，在现有的 VID 模型之上实现这个目标存在低效率的问题，主要是因为它们的重复的区域提案和非平行的帧次预测方式。为了解决这个问题，我们提议 ClipVID，一种具有一致性聚合（ICA）层的 VID 模型，专门用于挖掘细致的时间上下文。它通过设置预测策略，有效减少了重复性，使 ICA 层非常高效，并让我们能够设计一个可以并行预测整个视频剪辑的架构。我们的实验结果表明，我们的方法可以达到最新的状态（SOTA）的性能（84.7% mAP）在 ImageNet VID 数据集上，而且在运行速度方面比前一代 SOTA 快约 7 倍（39.3 fps）。

Dynamic Low-Rank Instance Adaptation for Universal Neural Image Compression

paper_url: http://arxiv.org/abs/2308.07733
repo_url: https://github.com/llvy21/duic
paper_authors: Yue Lv, Jinxi Xiang, Jun Zhang, Wenming Yang, Xiao Han, Wei Yang
For: The paper aims to address the domain gap between training and inference datasets in neural image compression, and to improve the rate-distortion performance of out-of-domain images.* Methods: The proposed method uses low-rank adaptation and a dynamic gating network to update the adaptation parameters of the client’s decoder. The low-rank constraint reduces the bit rate overhead, and the dynamic gating network decides which decoder layers should employ adaptation.* Results: The proposed method significantly mitigates the domain gap and outperforms non-adaptive and instance adaptive methods with an average BD-rate improvement of approximately $19%$ and $5%$, respectively. Ablation studies confirm the method’s universality across various image compression architectures.Here is the information in Simplified Chinese text:* 用途：纸上提出了解决神经图像压缩领域中域外图像的问题，并且提高域外图像的率-损失性能。* 方法：提议使用低级别适应和动态阀网络来更新客户端解码器的适应参数。低级别约束 redues the bit rate overhead，而动态阀网络决定了哪些解码层应用适应。* 结果：提议方法能够有效地减少域外图像的域 gap，并且超越非适应方法和实例适应方法，具体是BD-rate上下降约19%和5%。杂合研究证明了方法的通用性。

Abstract
The latest advancements in neural image compression show great potential in surpassing the rate-distortion performance of conventional standard codecs. Nevertheless, there exists an indelible domain gap between the datasets utilized for training (i.e., natural images) and those utilized for inference (e.g., artistic images). Our proposal involves a low-rank adaptation approach aimed at addressing the rate-distortion drop observed in out-of-domain datasets. Specifically, we perform low-rank matrix decomposition to update certain adaptation parameters of the client's decoder. These updated parameters, along with image latents, are encoded into a bitstream and transmitted to the decoder in practical scenarios. Due to the low-rank constraint imposed on the adaptation parameters, the resulting bit rate overhead is small. Furthermore, the bit rate allocation of low-rank adaptation is \emph{non-trivial}, considering the diverse inputs require varying adaptation bitstreams. We thus introduce a dynamic gating network on top of the low-rank adaptation method, in order to decide which decoder layer should employ adaptation. The dynamic adaptation network is optimized end-to-end using rate-distortion loss. Our proposed method exhibits universality across diverse image datasets. Extensive results demonstrate that this paradigm significantly mitigates the domain gap, surpassing non-adaptive methods with an average BD-rate improvement of approximately $19\%$ across out-of-domain images. Furthermore, it outperforms the most advanced instance adaptive methods by roughly $5\%$ BD-rate. Ablation studies confirm our method's ability to universally enhance various image compression architectures.

摘要
最新的神经网络图像压缩技术显示出了超越传统标准编码器的可能性。然而，存在一个不可缓和的领域差距 между训练集（即自然图像）和推理集（例如艺术图像）。我们的提议是通过对客户端解码器的certain adaptation参数进行低级精度约束来解决Rate-Distortion Drop在异领域数据集上。特别是，我们使用低级精度约束来更新客户端解码器的adaptation参数，然后将这些参数、 along with image latents，编码到bitstream中并在实际应用场景中传输。由于低级精度约束的存在，所得到的bit rate overhead很小。此外，低级精度 adaptation的bit rate分配是非易的，需要根据异类输入的需求进行调整。我们因此引入了一个动态闭合网络，以确定哪些解码层应该使用适应。这个动态闭合网络通过练习率-损失函数来优化。我们的提议显示了对各种图像压缩架构的通用性。广泛的结果表明，我们的方法可以减少异领域图像压缩中的领域差距，相比非适应方法的平均BD-rate提高约19%。此外，它还超过了最先进的实例适应方法的BD-rate提高约5%。ablation研究证明了我们的方法可以通过不同的图像压缩架构进行加强。

paper_url: http://arxiv.org/abs/2308.07732
repo_url: https://github.com/haiyang-w/unitr
paper_authors: Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhenguo Li, Bernt Schiele, Liwei Wang
For: The paper is written for developing an efficient multi-modal backbone for outdoor 3D perception, which can handle a variety of modalities with unified modeling and shared parameters.* Methods: The paper uses a modality-agnostic transformer encoder to handle view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps. It also presents a novel multi-modal integration strategy that considers semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood relations.* Results: The paper achieves a new state-of-the-art performance on the nuScenes benchmark, with +1.1 NDS higher for 3D object detection and +12.0 higher mIoU for BEV map segmentation, and lower inference latency.

Abstract
Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous driving systems. However, current 3D perception research follows a modality-specific paradigm, leading to additional computation overheads and inefficient collaboration between different sensor data. In this paper, we present an efficient multi-modal backbone for outdoor 3D perception named UniTR, which processes a variety of modalities with unified modeling and shared parameters. Unlike previous works, UniTR introduces a modality-agnostic transformer encoder to handle these view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps. More importantly, to make full use of these complementary sensor types, we present a novel multi-modal integration strategy by both considering semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood relations. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks. It sets a new state-of-the-art performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object detection and +12.0 higher mIoU for BEV map segmentation with lower inference latency. Code will be available at https://github.com/Haiyang-W/UniTR .

摘要

Context-Aware Pseudo-Label Refinement for Source-Free Domain Adaptive Fundus Image Segmentation

paper_url: http://arxiv.org/abs/2308.07731
repo_url: https://github.com/xmed-lab/cpr
paper_authors: Zheang Huai, Xinpeng Ding, Yi Li, Xiaomeng Li
for: 这个论文是针对源数不可用的目标端做domain adaptation问题，即将源端模型训练到目标端，但源数没有可用，因此使用源模型生成的 Pseudo-label 进行更新。
methods: 本论文提出了一个基于上下文关系的 Pseudo-label 精度更新方法，包括将上下文相似度学习到 Pseudo-label 更新、适用于不同类别的Pixel-level和Class-level降噪方法，以及适度调整 Pseudo-label 以补偿错误更新。
results: 在跨领域眼部像素数据上进行实验，结果显示本方法可以实现顶尖的结果。

Abstract
In the domain adaptation problem, source data may be unavailable to the target client side due to privacy or intellectual property issues. Source-free unsupervised domain adaptation (SF-UDA) aims at adapting a model trained on the source side to align the target distribution with only the source model and unlabeled target data. The source model usually produces noisy and context-inconsistent pseudo-labels on the target domain, i.e., neighbouring regions that have a similar visual appearance are annotated with different pseudo-labels. This observation motivates us to refine pseudo-labels with context relations. Another observation is that features of the same class tend to form a cluster despite the domain gap, which implies context relations can be readily calculated from feature distances. To this end, we propose a context-aware pseudo-label refinement method for SF-UDA. Specifically, a context-similarity learning module is developed to learn context relations. Next, pseudo-label revision is designed utilizing the learned context relations. Further, we propose calibrating the revised pseudo-labels to compensate for wrong revision caused by inaccurate context relations. Additionally, we adopt a pixel-level and class-level denoising scheme to select reliable pseudo-labels for domain adaptation. Experiments on cross-domain fundus images indicate that our approach yields the state-of-the-art results. Code is available at https://github.com/xmed-lab/CPR.

摘要
在领域适应问题中，源数据可能无法提供到目标客边，因为隐私或知识产权问题。源无supervised领域适应（SF-UDA）target的目标分布，仅使用源模型和目标无标签数据进行对领域的适应。源模型通常对目标领域生成噪音和无法适应的文本标签，即邻近区域可能会被不同的文本标签。这个观察动机我们更新 pseudo-label。另外，我们发现在领域差距下，同一类型的特征通常会形成一个对应的分布，这implies context relations可以从特征距离中Calculate。为此，我们提出一个context-aware pseudo-label revision方法。具体来说，我们开发了一个context-similarity learning module，用于学习context relations。接下来，我们设计了使用学习的context relations来修订 pseudo-label。此外，我们提出了calibrate revisions的方法，以补偿因为不准确的context relations而导致的错误修订。此外，我们还采用了像素级和类别级的噪音除掉方法，以选择可靠的 pseudo-label для领域适应。实验结果显示，我们的方法在跨领域基因摄像头上获得了state-of-the-art的结果。代码可以在https://github.com/xmed-lab/CPR上获取。

Domain-Aware Fine-Tuning: Enhancing Neural Network Adaptability

paper_url: http://arxiv.org/abs/2308.07728
repo_url: None
paper_authors: Seokhyeon Ha, Sunbeom Jung, Jungwoo Lee
for: 本研究旨在提高 fine-tuning 过程中的模型性能，特别是在新目标领域中。methods: 本研究提出了一种名为 Domain-Aware Fine-Tuning (DAFT) 的新方法，它包括批量准则转换和细致调整。results: 对于多个基线方法，DAFT 方法能够明显提高模型的性能，并且在各种不同的数据集上都有显著的优势。

Abstract
Fine-tuning pre-trained neural network models has become a widely adopted approach across various domains. However, it can lead to the distortion of pre-trained feature extractors that already possess strong generalization capabilities. Mitigating feature distortion during adaptation to new target domains is crucial. Recent studies have shown promising results in handling feature distortion by aligning the head layer on in-distribution datasets before performing fine-tuning. Nonetheless, a significant limitation arises from the treatment of batch normalization layers during fine-tuning, leading to suboptimal performance. In this paper, we propose Domain-Aware Fine-Tuning (DAFT), a novel approach that incorporates batch normalization conversion and the integration of linear probing and fine-tuning. Our batch normalization conversion method effectively mitigates feature distortion by reducing modifications to the neural network during fine-tuning. Additionally, we introduce the integration of linear probing and fine-tuning to optimize the head layer with gradual adaptation of the feature extractor. By leveraging batch normalization layers and integrating linear probing and fine-tuning, our DAFT significantly mitigates feature distortion and achieves improved model performance on both in-distribution and out-of-distribution datasets. Extensive experiments demonstrate that our method outperforms other baseline methods, demonstrating its effectiveness in not only improving performance but also mitigating feature distortion.

摘要
“现代化的预训练神经网络模型已成为各个领域的广泛采用方法。然而，这可能导致预训练的特征提取器受到扭曲，这会影响模型的泛化能力。避免特征扭曲在新目标领域中进行适应是非常重要。近期的研究表明，在进行适应时对头层进行对齐可以有效地避免特征扭曲。然而，在细化过程中对批处理归一化层的处理会导致表现下降。在本文中，我们提出了适应领域域特征 fine-tuning（DAFT）方法，该方法包括批处理归一化转换和细化过程中的线性探测与细化。我们的批处理归一化转换方法可以减少在细化过程中对神经网络的修改，从而避免特征扭曲。此外，我们引入了线性探测与细化的集成，以便逐渐适应头层和特征提取器。通过利用批处理归一化层和集成线性探测与细化，我们的DAFT可以有效地避免特征扭曲，并在各种预测和非预测 datasets 上显著提高模型的性能。我们的实验结果表明，我们的方法可以超越其他基准方法，说明了它的效果不仅在提高性能，还在避免特征扭曲。”

Real-time Automatic M-mode Echocardiography Measurement with Panel Attention from Local-to-Global Pixels

paper_url: http://arxiv.org/abs/2308.07717
repo_url: https://github.com/hanktseng131415go/ramem
paper_authors: Ching-Hsun Tseng, Shao-Ju Chien, Po-Shen Wang, Shin-Jye Lee, Wei-Huan Hu, Bin Pu, Xiao-jun Zeng
For: 这个论文的目的是提出一种实时自动echocardiography测量方法，以解决现有的三个主要阻碍：无法建立一个自动化方案、手动标注M-mode echocardiogram是时间consuming、现有的卷积层（如ResNet）在处理大对象时效率低下。* Methods: 该论文使用了MEIS数据集（M-mode echocardiogram的实例分割数据集），提出了面掌注意力（local-to-global efficient attention）和更新后的UPANets V2，以实现大对象检测和全局接受场。* Results: 实验结果表明，RAMEM比现有的RIS脊梁（带有非本地注意力）在PASCAL 2012 SBD和人类性能测试中表现更好，并且可以在实时中进行自动化echocardiography测量。

Abstract
Motion mode (M-mode) recording is an essential part of echocardiography to measure cardiac dimension and function. However, the current diagnosis cannot build an automatic scheme, as there are three fundamental obstructs: Firstly, there is no open dataset available to build the automation for ensuring constant results and bridging M-mode echocardiography with real-time instance segmentation (RIS); Secondly, the examination is involving the time-consuming manual labelling upon M-mode echocardiograms; Thirdly, as objects in echocardiograms occupy a significant portion of pixels, the limited receptive field in existing backbones (e.g., ResNet) composed from multiple convolution layers are inefficient to cover the period of a valve movement. Existing non-local attentions (NL) compromise being unable real-time with a high computation overhead or losing information from a simplified version of the non-local block. Therefore, we proposed RAMEM, a real-time automatic M-mode echocardiography measurement scheme, contributes three aspects to answer the problems: 1) provide MEIS, a dataset of M-mode echocardiograms for instance segmentation, to enable consistent results and support the development of an automatic scheme; 2) propose panel attention, local-to-global efficient attention by pixel-unshuffling, embedding with updated UPANets V2 in a RIS scheme toward big object detection with global receptive field; 3) develop and implement AMEM, an efficient algorithm of automatic M-mode echocardiography measurement enabling fast and accurate automatic labelling among diagnosis. The experimental results show that RAMEM surpasses existing RIS backbones (with non-local attention) in PASCAL 2012 SBD and human performances in real-time MEIS tested. The code of MEIS and dataset are available at https://github.com/hanktseng131415go/RAME.

摘要
幻象模式（M-mode）记录是听觉心动图像测量的重要组成部分，但现有的诊断方案无法建立自动化机制，因为存在以下三个基本障碍：首先，没有开放的数据集可用于建立自动化，以确保定制化结果并将M-mode听觉心动图像与实时实例 segmentation（RIS）相连接；其次，检查需要手动标注M-mode听觉心动图像，时间consuming；第三，因为听觉心动图像中的对象占用了大量像素，现有的卷积层（例如ResNet）的有限感知范围不能覆盖心动期间oval movement。现有的非本地注意力（NL）不能实现实时，或者 computation overhead过高，或者 lost information from a simplified version of the non-local block。因此，我们提出了RAMEM，一种实时自动M-mode听觉心动图像测量方案，它在以下三个方面做出贡献：1. 提供MEIS数据集，用于实例 segmentation，以确保定制化结果和支持自动化方案的发展。2. 提出面attenion，具有local-to-global高效注意力，通过像素排序和更新UPANets V2在RIS方案中，以实现大对象检测与全局感知范围。3. 开发和实现AMEM算法，一种高效的自动M-mode听觉心动图像测量算法，可以快速和准确地自动标注诊断过程中。实验结果表明，RAMEM在PASCAL 2012 SBD和人类性能上都超过了现有的RIS卷积层（带有非本地注意力）。MEIS数据集和代码可以在https://github.com/hanktseng131415go/RAME中获取。

Enhancing Network Initialization for Medical AI Models Using Large-Scale, Unlabeled Natural Images

paper_url: http://arxiv.org/abs/2308.07688
repo_url: None
paper_authors: Soroosh Tayebi Arasteh, Leo Misera, Jakob Nikolas Kather, Daniel Truhn, Sven Nebelung
for: 这个研究旨在测试SSL在非医学影像领域进行预训练，以及与对非医学影像和医学影像进行预训练进行比较。
methods: 我们使用了视觉 трансформер，并将其预设的参数基于以下三种预训练方法：(i) SSL预训练自然影像（DINOv2），(ii) ImageNet dataset上的SL预训练，以及(iii) MIMIC-CXR dataset上的SL预训练。
results: 我们发现，使用这些预训练方法可以在800,000多帧颈部X-光像中诊断更多于20种不同的内部发现。SSL预训练在预训练自然影像时不仅超过ImageNet-based预训练（P<0.001 for all datasets），甚至在某些情况下也超过了预训练MIMIC-CXR dataset。

Abstract
Pre-training datasets, like ImageNet, have become the gold standard in medical image analysis. However, the emergence of self-supervised learning (SSL), which leverages unlabeled data to learn robust features, presents an opportunity to bypass the intensive labeling process. In this study, we explored if SSL for pre-training on non-medical images can be applied to chest radiographs and how it compares to supervised pre-training on non-medical images and on medical images. We utilized a vision transformer and initialized its weights based on (i) SSL pre-training on natural images (DINOv2), (ii) SL pre-training on natural images (ImageNet dataset), and (iii) SL pre-training on chest radiographs from the MIMIC-CXR database. We tested our approach on over 800,000 chest radiographs from six large global datasets, diagnosing more than 20 different imaging findings. Our SSL pre-training on curated images not only outperformed ImageNet-based pre-training (P<0.001 for all datasets) but, in certain cases, also exceeded SL on the MIMIC-CXR dataset. Our findings suggest that selecting the right pre-training strategy, especially with SSL, can be pivotal for improving artificial intelligence (AI)'s diagnostic accuracy in medical imaging. By demonstrating the promise of SSL in chest radiograph analysis, we underline a transformative shift towards more efficient and accurate AI models in medical imaging.

摘要
医疗图像分析领域的预训练数据集，如ImageNet，已成为黄金标准。然而，自动学习（SSL）技术，利用无标签数据来学习强健特征，现在提供了一种可能的代替方案。在本研究中，我们研究了将SSL预训练在非医学图像上应用于胸部X射线图像，以及与超参数预训练在非医学图像和医学图像上的比较。我们使用了视觉 трансформа器，并将其参数初始化为（i）SSL预训练natural images（DINOv2），（ii）SL预训练natural images（ImageNet数据集），和（iii）SL预训练在MIMIC-CXR数据库上的胸部X射线图像。我们对6个大型全球数据集中的超过800,000个胸部X射线图像进行测试，并识别了20种不同的成像发现。我们的SSL预训练在精心选择的图像上不仅超过了ImageNet基础预训练（P<0.001 for all datasets），而且在某些情况下还超过了SL在MIMIC-CXR数据库上的预训练。我们的发现表明，选择合适的预训练策略，特别是使用SSL，可以对医疗图像识别精度进行重要改进。我们的研究证明了SSL在胸部X射线图像分析中的承诺，并标识了医疗图像识别领域的一种转型变革，从而实现更高效和准确的人工智能模型。

A Review of Adversarial Attacks in Computer Vision

paper_url: http://arxiv.org/abs/2308.07673
repo_url: None
paper_authors: Yutong Zhang, Yao Li, Yin Li, Zhichang Guo
for: 本研究旨在解决深度神经网络受到敌意样本攻击的问题，尤其是在自动驾驶等安全关键场景中。
methods: 本研究使用了黑盒设定，即攻击者只能获得模型的输入和输出，而不知道模型的参数和梯度。
results: 研究发现，黑盒攻击可以 Transferability 到不同的深度学习和机器学习模型，并且可以 Achievability 在实际场景中。

Abstract
Deep neural networks have been widely used in various downstream tasks, especially those safety-critical scenario such as autonomous driving, but deep networks are often threatened by adversarial samples. Such adversarial attacks can be invisible to human eyes, but can lead to DNN misclassification, and often exhibits transferability between deep learning and machine learning models and real-world achievability. Adversarial attacks can be divided into white-box attacks, for which the attacker knows the parameters and gradient of the model, and black-box attacks, for the latter, the attacker can only obtain the input and output of the model. In terms of the attacker's purpose, it can be divided into targeted attacks and non-targeted attacks, which means that the attacker wants the model to misclassify the original sample into the specified class, which is more practical, while the non-targeted attack just needs to make the model misclassify the sample. The black box setting is a scenario we will encounter in practice.

摘要
深度神经网络在各种下游任务中广泛应用，特别是安全关键的情况下，如自动驾驶等，但深度网络受到反对攻击的威胁。这些反对攻击可能会在人类眼中不可见，但可能导致神经网络误分类，并且常常具有神经网络和机器学习模型之间的传播性和实际应用性。反对攻击可以分为白盒攻击和黑盒攻击两类，其中白盒攻击者知道模型的参数和梯度，黑盒攻击者只能获得输入和输出。根据攻击者的目的，反对攻击可以分为targeted攻击和非targeted攻击。targeted攻击需要模型误分类原始样本为指定的类别，更加实际；非targeted攻击只需要模型误分类样本。黑盒设定是我们在实践中会遇到的情况。

Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training

paper_url: http://arxiv.org/abs/2308.07665
repo_url: https://github.com/ximinng/inversion-by-inversion
paper_authors: Ximing Xing, Chuang Wang, Haitao Zhou, Zhihao Hu, Chongxuan Li, Dong Xu, Qian Yu
for: 将简图转换为真实图像
methods: 使用“倒反”两Stage方法，包括形态倒反和全控倒反两个阶段，通过形态能函数和外观能函数来控制图像的形态和外观特征。
results: 实验结果表明，提议的“倒反”方法能够生成高质量的真实图像，并且可以根据不同的示例图来控制图像的颜色和 текстура特征。

Abstract
Exemplar-based sketch-to-photo synthesis allows users to generate photo-realistic images based on sketches. Recently, diffusion-based methods have achieved impressive performance on image generation tasks, enabling highly-flexible control through text-driven generation or energy functions. However, generating photo-realistic images with color and texture from sketch images remains challenging for diffusion models. Sketches typically consist of only a few strokes, with most regions left blank, making it difficult for diffusion-based methods to produce photo-realistic images. In this work, we propose a two-stage method named ``Inversion-by-Inversion" for exemplar-based sketch-to-photo synthesis. This approach includes shape-enhancing inversion and full-control inversion. During the shape-enhancing inversion process, an uncolored photo is generated with the guidance of a shape-energy function. This step is essential to ensure control over the shape of the generated photo. In the full-control inversion process, we propose an appearance-energy function to control the color and texture of the final generated photo.Importantly, our Inversion-by-Inversion pipeline is training-free and can accept different types of exemplars for color and texture control. We conducted extensive experiments to evaluate our proposed method, and the results demonstrate its effectiveness.

摘要
<>translate_language: zh-CN<> exemplar-based sketch-to-photo synthesis 可以让用户生成基于绘图的 photo-realistic 图像。最近，Diffusion-based 方法在图像生成任务上 achieved 出色的表现，允许通过文本驱动生成或能量函数进行高度灵活的控制。然而，通过Diffusion 模型生成具有颜色和 texture 的 photo-realistic 图像仍然是一个挑战。绘图通常只有几个笔画，大多数区域都是质感，使得Diffusion 模型很难生成 photo-realistic 图像。在这项工作中，我们提出了一种 Two-stage 方法，名为“Inversion-by-Inversion”，用于 exemplar-based sketch-to-photo synthesis。这种方法包括 shape-enhancing inversion 和 full-control inversion。在 shape-enhancing inversion 过程中，通过一个 shape-energy 函数的引导，生成一个没有颜色的照片。这一步很重要，以确保对生成的照片的形状进行控制。在 full-control inversion 过程中，我们提出了一种 appearance-energy 函数，用于控制照片的颜色和 texture。重要的是，我们的 Inversion-by-Inversion 管道是无需训练的，可以接受不同类型的 exemplar 进行颜色和 texture 控制。我们进行了广泛的实验来评估我们的提议方法，结果显示其效果。

Gradient-Based Post-Training Quantization: Challenging the Status Quo

paper_url: http://arxiv.org/abs/2308.07662
repo_url: None
paper_authors: Edouard Yvinec, Arnaud Dapogny, Kevin Bailly
For: The paper focuses on gradient-based post-training quantization (GPTQ) methods for efficient deployment of deep neural networks.* Methods: The paper challenges common choices in GPTQ methods and derives best practices for designing more efficient and scalable GPTQ methods, including the problem formulation and optimization process.* Results: The paper proposes a novel importance-based mixed-precision technique and shows significant performance improvements on all tested state-of-the-art GPTQ methods and networks, achieving +6.819 points on ViT for 4-bit quantization.Here’s the simplified Chinese version:
for: 这篇论文关注的是在训练后进行权重调整的混合精度方法，以实现深度神经网络的高效部署。
methods: 论文挑战了常见的GPTQ方法选择，并提出了更有效和可扩展的GPTQ方法设计方法，包括问题定义和优化过程。
results: 论文提出了一种新的重要性基于混合精度技术，并在所有测试的当前GPTQ方法和网络上实现了显著的性能提升，例如在ViT网络上的4位量化得到了+6.819点的提升。

Abstract
Quantization has become a crucial step for the efficient deployment of deep neural networks, where floating point operations are converted to simpler fixed point operations. In its most naive form, it simply consists in a combination of scaling and rounding transformations, leading to either a limited compression rate or a significant accuracy drop. Recently, Gradient-based post-training quantization (GPTQ) methods appears to be constitute a suitable trade-off between such simple methods and more powerful, yet expensive Quantization-Aware Training (QAT) approaches, particularly when attempting to quantize LLMs, where scalability of the quantization process is of paramount importance. GPTQ essentially consists in learning the rounding operation using a small calibration set. In this work, we challenge common choices in GPTQ methods. In particular, we show that the process is, to a certain extent, robust to a number of variables (weight selection, feature augmentation, choice of calibration set). More importantly, we derive a number of best practices for designing more efficient and scalable GPTQ methods, regarding the problem formulation (loss, degrees of freedom, use of non-uniform quantization schemes) or optimization process (choice of variable and optimizer). Lastly, we propose a novel importance-based mixed-precision technique. Those guidelines lead to significant performance improvements on all the tested state-of-the-art GPTQ methods and networks (e.g. +6.819 points on ViT for 4-bit quantization), paving the way for the design of scalable, yet effective quantization methods.

摘要
量化已成为深度神经网络的有效部署步骤，将浮点运算转换为简单的固定点运算。最简单的方式是通过缩放和四舍五入变换，但这将导致压缩率有限或准确率下降。现在，使用梯度based后期量化（GPTQ）方法可以实现一个适当的平衡，特别是在尝试量化LLMs（大型语言模型）时，因为量化过程的扩展性是非常重要。GPTQ通过学习缩放操作使用小量训练集来实现。在这个工作中，我们挑战了GPTQ方法的常见选择。具体来说，我们发现这个过程在一定程度上是Robust，即选择特征、增强特征和calibration集的变量的影响相对较小。此外，我们还提出了一些设计更高效和可扩展的GPTQ方法的最佳实践，包括问题定义（损失、自由度、非对称量化方案）和优化过程（变量和优化器）中的一些变量。最后，我们提出了一种新的重要性基于混合精度技术。这些指南导致所有测试的State-of-the-art GPTQ方法和网络（如+6.819点的ViT для4位量化）获得显著性能提高，为设计可扩展、有效的量化方法铺平道路。

Geometry of the Visual Cortex with Applications to Image Inpainting and Enhancement

paper_url: http://arxiv.org/abs/2308.07652
repo_url: https://github.com/ballerin/v1diffusion
paper_authors: Francesco Ballerin, Erlend Grong
for: 这篇论文是为了提出基于视觉核心V1的扩展矩阵群$SE(2)$的图像填充和改善算法。
methods: 这篇论文使用了浸泡-推拿（WaxOn-WaxOff）方法，并利用了子 riemannian 结构来定义一种新的不锈滤波器，用于图像提高。
results: 研究人员通过应用这种方法于血管扩大扫描中的血管增强，得到了更加锐利的结果。

Abstract
Equipping the rototranslation group $SE(2)$ with a sub-Riemannian structure inspired by the visual cortex V1, we propose algorithms for image inpainting and enhancement based on hypoelliptic diffusion. We innovate on previous implementations of the methods by Citti, Sarti and Boscain et al., by proposing an alternative that prevents fading and capable of producing sharper results in a procedure that we call WaxOn-WaxOff. We also exploit the sub-Riemannian structure to define a completely new unsharp using $SE(2)$, analogous of the classical unsharp filter for 2D image processing, with applications to image enhancement. We demonstrate our method on blood vessels enhancement in retinal scans.

摘要
将$SE(2)$拓扑群受到视觉核V1的启发下的半里曼尼拓扑结构，我们提出了基于液体扩散的图像填充和改善算法。我们在之前的实现方法（Citti、Sarti和Boscain等人的方法）的基础上做出了修改，以避免模糊和生成更加锐利的结果，我们称之为“WaxOn-WaxOff”过程。我们还利用了子拓扑结构来定义一种全新的不锐化器，类似于传统的2D图像处理中的不锐化过滤器，并应用于图像增强。我们在血管扩大retinal扫描中进行了示例。

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

paper_url: http://arxiv.org/abs/2308.07648
repo_url: https://github.com/bladewaltz1/promptswitch
paper_authors: Chaorui Deng, Qi Chen, Pengda Qin, Da Chen, Qi Wu
for: 本文主要研究text-video retrieval领域中的问题，即如何使用预训练的文本-图像基础模型（如CLIP）在视频领域中进行有效的学习。
methods: 本文提出了一种新的方法，即在CLIP图像Encoder中引入空间-时间”Prompt Cube”，以快速包含全视频 semantics在帧表示中。此外，本文还提出了一种auxiliary video captioning目标函数，以帮助学习详细的视频 semantics。
results: 通过使用本文提出的方法，可以在三个标准 benchmark dataset上取得状态机器的性能（MSR-VTT、MSVD、LSMDC），而且只需要使用一个简单的时间融合策略（即mean-pooling）。

Abstract
In text-video retrieval, recent works have benefited from the powerful learning capabilities of pre-trained text-image foundation models (e.g., CLIP) by adapting them to the video domain. A critical problem for them is how to effectively capture the rich semantics inside the video using the image encoder of CLIP. To tackle this, state-of-the-art methods adopt complex cross-modal modeling techniques to fuse the text information into video frame representations, which, however, incurs severe efficiency issues in large-scale retrieval systems as the video representations must be recomputed online for every text query. In this paper, we discard this problematic cross-modal fusion process and aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts. Concretely, we first introduce a spatial-temporal "Prompt Cube" into the CLIP image encoder and iteratively switch it within the encoder layers to efficiently incorporate the global video semantics into frame representations. We then propose to apply an auxiliary video captioning objective to train the frame representations, which facilitates the learning of detailed video semantics by providing fine-grained guidance in the semantic space. With a naive temporal fusion strategy (i.e., mean-pooling) on the enhanced frame representations, we obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.

摘要
在文本视频检索中， latest works 受益于预训练的文本图像基础模型（如 CLIP）的强大学习能力，通过适应它们到视频频谱中来进行改进。然而，一个 kritical problem 是如何有效地在图像Encoder中捕捉视频中的丰富 semantics。 To tackle this, state-of-the-art methods 采用复杂的跨模态模型化技术来融合文本信息到视频帧表示中，这 however, incurs severe efficiency issues in large-scale retrieval systems as the video representations must be recomputed online for every text query. In this paper, we discard this problematic cross-modal fusion process and aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.Concretely, we first introduce a spatial-temporal "Prompt Cube" into the CLIP image encoder and iteratively switch it within the encoder layers to efficiently incorporate the global video semantics into frame representations. We then propose to apply an auxiliary video captioning objective to train the frame representations, which facilitates the learning of detailed video semantics by providing fine-grained guidance in the semantic space. With a naive temporal fusion strategy (i.e., mean-pooling) on the enhanced frame representations, we obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.

Backpropagation Path Search On Adversarial Transferability

paper_url: http://arxiv.org/abs/2308.07625
repo_url: None
paper_authors: Zhuoer Xu, Zhangxuan Gu, Jianping Zhang, Shiwen Cui, Changhua Meng, Weiqiang Wang
for: 防御深度神经网络受到攻击的隐晦攻击，需要在部署之前测试模型的可靠性。
methods: 使用Transfer-based攻击者制作攻击示例，并将其传递给黑盒中部署的受害者模型。为提高攻击性能，结构基本攻击者调整反propagation路径，但现有的结构基本攻击者未能探索 convolution 模块在 CNN 中的作用，并且修改反propagation 图表时使用了优化的方法。
results: 在各种传输设置下，我们的 backPropagation pAth Search (PAS) 可以大幅提高攻击成功率，包括正常训练和防御模型。

Abstract
Deep neural networks are vulnerable to adversarial examples, dictating the imperativeness to test the model's robustness before deployment. Transfer-based attackers craft adversarial examples against surrogate models and transfer them to victim models deployed in the black-box situation. To enhance the adversarial transferability, structure-based attackers adjust the backpropagation path to avoid the attack from overfitting the surrogate model. However, existing structure-based attackers fail to explore the convolution module in CNNs and modify the backpropagation graph heuristically, leading to limited effectiveness. In this paper, we propose backPropagation pAth Search (PAS), solving the aforementioned two problems. We first propose SkipConv to adjust the backpropagation path of convolution by structural reparameterization. To overcome the drawback of heuristically designed backpropagation paths, we further construct a DAG-based search space, utilize one-step approximation for path evaluation and employ Bayesian Optimization to search for the optimal path. We conduct comprehensive experiments in a wide range of transfer settings, showing that PAS improves the attack success rate by a huge margin for both normally trained and defense models.

摘要
深度神经网络容易受到反例攻击，需要在部署之前测试模型的可靠性。转移基于攻击者通过对代理模型创建反例，并将其传递到黑盒环境中部署的受害者模型。为增强反例传递性，结构基于攻击者可以修改反例传递的背景干扰路径，以避免攻击过拟合代理模型。然而，现有的结构基于攻击者未能探索 convolution 模块在 CNN 中，并修改背景干扰路径的方法，导致有限的效果。在这篇论文中，我们提出了 backPropagation pAth Search (PAS)，解决以下两个问题。我们首先提出 SkipConv，用于调整 convolution 模块的背景干扰路径。为了超越轮循的设计方法，我们进一步建立了 DAG 型搜索空间，利用一步逼近方法来评估路径，并使用 Bayesian 优化来搜索最佳路径。我们在各种转移设置下进行了广泛的实验，结果显示，PAS 可以在各种转移设置下提高攻击成功率，并且在防御模型上也有显著改善。

Self-Prompting Large Vision Models for Few-Shot Medical Image Segmentation

paper_url: http://arxiv.org/abs/2308.07624
repo_url: https://github.com/peteryyzhang/few-shot-self-prompt-sam
paper_authors: Qi Wu, Yuyao Zhang, Marawan Elbatel
for: 这篇论文主要应用于医疗领域的大基础模型（Segment Anything Model，SAM），以提高医疗影像分类的性能。
methods: 这篇论文提出了一种新的自我推问法，利用SAM的嵌入空间来推问自己，通过简单 yet有效的直线像素层级分类器。
results: 这篇论文在多个数据集上（比如几几个医疗影像分类 зада问）取得了竞争性的结果，较以少数影像进行微调的方法提高约15%。

Abstract
Recent advancements in large foundation models have shown promising potential in the medical industry due to their flexible prompting capability. One such model, the Segment Anything Model (SAM), a prompt-driven segmentation model, has shown remarkable performance improvements, surpassing state-of-the-art approaches in medical image segmentation. However, existing methods primarily rely on tuning strategies that require extensive data or prior prompts tailored to the specific task, making it particularly challenging when only a limited number of data samples are available. In this paper, we propose a novel perspective on self-prompting in medical vision applications. Specifically, we harness the embedding space of SAM to prompt itself through a simple yet effective linear pixel-wise classifier. By preserving the encoding capabilities of the large model, the contextual information from its decoder, and leveraging its interactive promptability, we achieve competitive results on multiple datasets (i.e. improvement of more than 15% compared to fine-tuning the mask decoder using a few images).

摘要
Translated into Simplified Chinese:最近的大基础模型在医疗领域的应用显示了扎实的投资潜力，特别是其 flexible 的提示能力。一种名为 Segment Anything Model（SAM）的提示驱动 segmentation 模型，在医疗图像 segmentation 方面显示了非凡的表现，超越了当前的状态艺术方法。然而，现有的方法主要依赖于调整策略，需要大量的数据或特定任务的先前提示，这使得只有有限数量的数据样本时 особенelly 挑战。在这篇论文中，我们提出了一种新的自我提示视角，具体来说是利用 SAM 的 embedding 空间来自我提示，通过一种简单 yet effective 的线性像素级分类器。通过保留大型模型的编码能力，保留解码器的上下文信息，以及利用其交互提示能力，我们在多个 dataset 上达到了竞争力的结果（比如 Fine-tuning mask decoder 使用几个图像时的提高 более 15%）。

Self-supervised Hypergraphs for Learning Multiple World Interpretations

paper_url: http://arxiv.org/abs/2308.07615
repo_url: None
paper_authors: Alina Marcu, Mihai Pirvu, Dragos Costea, Emanuela Haller, Emil Slusanschi, Ahmed Nabil Belbachir, Rahul Sukthankar, Marius Leordeanu
for: 学习多个场景表示，使用小量标注集。
methods: 利用场景表示之间的关系，建立多任务超гра�。使用超гра�提高VisTransformer模型，无需额外标注数据。
results: 比其他多任务图模型表现出色，在不同类型的超гра�和ensemble模型下进行自我超vision学习。I hope that helps! Let me know if you have any other questions.

Abstract
We present a method for learning multiple scene representations given a small labeled set, by exploiting the relationships between such representations in the form of a multi-task hypergraph. We also show how we can use the hypergraph to improve a powerful pretrained VisTransformer model without any additional labeled data. In our hypergraph, each node is an interpretation layer (e.g., depth or segmentation) of the scene. Within each hyperedge, one or several input nodes predict the layer at the output node. Thus, each node could be an input node in some hyperedges and an output node in others. In this way, multiple paths can reach the same node, to form ensembles from which we obtain robust pseudolabels, which allow self-supervised learning in the hypergraph. We test different ensemble models and different types of hyperedges and show superior performance to other multi-task graph models in the field. We also introduce Dronescapes, a large video dataset captured with UAVs in different complex real-world scenes, with multiple representations, suitable for multi-task learning.

摘要
我们提出了一种方法，通过利用场景表示的关系形式为多任务 гиперграフ来学习多个场景表示。我们还示出了如何使用 гиперграフ来提高一种强大预训练 VisTransformer 模型，无需任何额外的标注数据。在我们的 гиперграフ中，每个节点是一个解释层（例如深度或分割）的场景表示。在每个 гипер边上，一个或多个输入节点预测输出节点的层。因此，每个节点可以是输入节点在某些 гипер边上，并且是输出节点在其他 гипер边上。这样，多个路径可以达到同一个节点，从而形成ensemble，并使用这些ensemble来获得Robustpseudolabel，以实现自动标注学习在 гиперграフ中。我们测试了不同的ensemble模型和不同类型的 гипер边，并显示了与其他多任务图模型在领域中的超越性。我们还介绍了 Dronescapes，一个大量视频数据集，captured with UAVs在不同的复杂实际场景中，具有多种表示，适合多任务学习。

paper_url: http://arxiv.org/abs/2308.07611
repo_url: None
paper_authors: Po-Jui Lu, Benjamin Odry, Muhamed Barakovic, Matthias Weigel, Robin Sandkühler, Reza Rahmanzadeh, Xinjie Chen, Mario Ocampo-Pineda, Jens Kuhle, Ludwig Kappos, Philippe Cattin, Cristina Granziera
For: The paper aims to identify disability-related brain changes in multiple sclerosis (MS) patients using whole-brain quantitative MRI (qMRI) and a novel comprehensive approach called GAMER-MRIL.* Methods: The approach uses a gated-attention-based convolutional neural network (CNN) to select patch-based qMRI images that are important for a given task/question, and incorporates a structure-aware interpretability method called Layer-wise Relevance Propagation (LRP) to identify disability-related brain regions.* Results: The approach achieved an AUC of 0.885, and the most sensitive measures related to disability were qT1 and NDI. The proposed LRP approach obtained more specifically relevant regions than other interpretability methods, including the saliency map, the integrated gradients, and the original LRP. The relevant regions included the corticospinal tract, where average qT1 and NDI significantly correlated with patients’ disability scores.Here’s the Chinese version of the three key points:* 用途：本研究旨在通过整体approach GAMER-MRIL，利用整个脑quantitative MRI (qMRI) 数据，为多发性静脉炎 (MS) 患者识别缺乏功能相关的脑区域。* 方法：该方法使用 gated-attention-based convolutional neural network (CNN) 选择 qMRI 图像中重要的 patch，并 incorporates 结构意识的 interpretability method Layer-wise Relevance Propagation (LRP) 来发现缺乏功能相关的脑区域。* 结果：该方法实现了 AUC 0.885，qT1 和 NDI 是缺乏功能相关的最敏感度量。提议的 LRP 方法在其他 interpretability methods 中获得了更加特定的相关区域，包括 corticospinal tract，其中 qT1 和 NDI 与患者缺乏功能分数相关性 ($ \rho $ = -0.37 和 0.44)。

Abstract
Objective: Identifying disability-related brain changes is important for multiple sclerosis (MS) patients. Currently, there is no clear understanding about which pathological features drive disability in single MS patients. In this work, we propose a novel comprehensive approach, GAMER-MRIL, leveraging whole-brain quantitative MRI (qMRI), convolutional neural network (CNN), and an interpretability method from classifying MS patients with severe disability to investigating relevant pathological brain changes. Methods: One-hundred-sixty-six MS patients underwent 3T MRI acquisitions. qMRI informative of microstructural brain properties was reconstructed, including quantitative T1 (qT1), myelin water fraction (MWF), and neurite density index (NDI). To fully utilize the qMRI, GAMER-MRIL extended a gated-attention-based CNN (GAMER-MRI), which was developed to select patch-based qMRI important for a given task/question, to the whole-brain image. To find out disability-related brain regions, GAMER-MRIL modified a structure-aware interpretability method, Layer-wise Relevance Propagation (LRP), to incorporate qMRI. Results: The test performance was AUC=0.885. qT1 was the most sensitive measure related to disability, followed by NDI. The proposed LRP approach obtained more specifically relevant regions than other interpretability methods, including the saliency map, the integrated gradients, and the original LRP. The relevant regions included the corticospinal tract, where average qT1 and NDI significantly correlated with patients' disability scores ($\rho$=-0.37 and 0.44). Conclusion: These results demonstrated that GAMER-MRIL can classify patients with severe disability using qMRI and subsequently identify brain regions potentially important to the integrity of the mobile function. Significance: GAMER-MRIL holds promise for developing biomarkers and increasing clinicians' trust in NN.

摘要
目标：identifying multiple sclerosis (MS) 患者中 relate to disability 的 brain changes是非常重要的。目前，没有明确的认知关于单个 MS 患者中哪些病理特征驱动残疾。在这种工作中，我们提出了一种全新的 comprehensive 方法，GAMER-MRIL，通过整个大脑量化MRI (qMRI)、卷积神经网络 (CNN) 和可解释方法来从MS患者中分类患者严重残疾。方法：一百六十六名 MS 患者通过3T MRI成像。qMRI 中提供了微结构脑 Properties 的信息，包括量化T1 (qT1)、myelin water fraction (MWF) 和 neurite density index (NDI)。为了完全利用 qMRI，GAMER-MRIL 扩展了一种闭合注意力基于CNN (GAMER-MRI)，将其应用到整个大脑图像。为了找出残疾相关的脑区，GAMER-MRIL 修改了结构意识 interpretability 方法，卷积层感知 propagation (LRP)，以包含 qMRI。结果：测试性能为 AUC=0.885。qT1 是残疾相关度最高的度量，其次是 NDI。提出的 LRP 方法在特定的脑区中获得了更多的相关区域，比其他可解释方法更加具有特点。这些相关区域包括 corticospinal tract，其中 qT1 和 NDI 与患者残疾分数相关性 (-0.37 和 0.44)。结论：这些结果表明GAMER-MRIL 可以使用 qMRI 分类患者严重残疾，并在脑区中寻找可能与 mobil 功能完整性相关的区域。意义：GAMER-MRIL 具有发展生物标志物和提高临床医生对NN的信任的潜在价值。

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model

paper_url: http://arxiv.org/abs/2308.07593
repo_url: None
paper_authors: Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro
for: 这篇论文主要应用在无音识别中，将声音资讯与视觉资讯结合，以提高无音识别的精度。
methods: 提案的 Audio Knowledge empowered Visual Speech Recognition 框架（AKVSR）使用大规模预训Audio模型对声音资讯进行了丰富的编码，并将非语言信息从声音资料中排除，将语言信息储存在高精度的Audio内存中，最后通过Audio Bridging Module与视觉资讯进行匹配，以实现无需声音输入的训练。
results: 这篇论文透过广泛的实验证明了提案的方法的有效性，在两个广泛使用的数据集LRS2和LRS3上实现了新的顶峰性能。

Abstract
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used datasets, LRS2 and LRS3.

摘要
visual speech recognition (VSR) 是指从舌头运动中预测说话的任务。 VSR 被视为一个具有挑战性的任务，因为舌头运动的信息不够。在这篇论文中，我们提议了一个听音知识强化的视频语音识别框架（AKVSR），用于补充视觉模式中的不够的语音信息。与前一些方法不同，我们的 AKVSR 具有以下特点：1. 利用大规模预训练的音频模型编码的丰富听音知识。2. 通过归约非语言信息，将音频信息储存在高效的音频内存中，以便在训练时不需要音频输入。3. 包括听音桥接模块，可以在训练时找到最佳匹配的音频特征，从而实现无需音频输入的训练。我们通过广泛的实验 validate 了我们的提议，并在常用的 datasets 上达到了新的state-of-the-art 性能。

Graph-Segmenter: Graph Transformer with Boundary-aware Attention for Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.07592
repo_url: None
paper_authors: Zizhang Wu, Yuanzhu Gan, Tianhao Xu, Fan Wang
for: 提高 semantic segmentation 的性能
methods: 使用 Graph Transformer 和 Boundary-aware Attention 模块
results: 在三个 widely used semantic segmentation dataset 上达到 state-of-the-art 性能

Abstract
The transformer-based semantic segmentation approaches, which divide the image into different regions by sliding windows and model the relation inside each window, have achieved outstanding success. However, since the relation modeling between windows was not the primary emphasis of previous work, it was not fully utilized. To address this issue, we propose a Graph-Segmenter, including a Graph Transformer and a Boundary-aware Attention module, which is an effective network for simultaneously modeling the more profound relation between windows in a global view and various pixels inside each window as a local one, and for substantial low-cost boundary adjustment. Specifically, we treat every window and pixel inside the window as nodes to construct graphs for both views and devise the Graph Transformer. The introduced boundary-aware attention module optimizes the edge information of the target objects by modeling the relationship between the pixel on the object's edge. Extensive experiments on three widely used semantic segmentation datasets (Cityscapes, ADE-20k and PASCAL Context) demonstrate that our proposed network, a Graph Transformer with Boundary-aware Attention, can achieve state-of-the-art segmentation performance.

摘要
“ transformer-based semantic segmentation 方法，通过将图像分成不同区域，使用滑块窗口来建立关系，已经取得了出色的成果。然而，由于在这些方法中模型窗口之间的关系不是主要的强调点，因此未能充分利用。为了解决这个问题，我们提出了一个名为 Graph-Segmenter 的网络，包括 Graph Transformer 和Boundary-aware Attention 模组。这个网络可以同时在全球视图中模型窗口之间的深层关系，以及每个窗口和内部每个像素之间的本地关系，并且实现了低成本的边界调整。具体来说，我们将每个窗口和内部每个像素视为节点，以建立这两个视图的图形。我们还引入了边界意识注意力模组，以便优化目标物边界上的像素关系。我们在 Cityscapes、ADE-20k 和 PASCAL Context 三个通用 semantic segmentation 数据集上进行了广泛的实验，结果显示，我们的提案的网络，Graph Transformer with Boundary-aware Attention，可以 дости得 estado-of-the-art 的 segmentation 性能。”

ADD: An Automatic Desensitization Fisheye Dataset for Autonomous Driving

paper_url: http://arxiv.org/abs/2308.07590
repo_url: None
paper_authors: Zizhang Wu, Chenxin Yuan, Hongyang Wei, Fan Song, Tianhao Xu
for: 提供一个大 FoV 鱼眼相机拍摄的自动驾驶环境中数据保护的解决方案，以满足法规要求。
methods: 基于大 FoV 鱼眼相机的自动驾驶拍摄数据，构建了首个 Autopilot Desensitization Dataset (ADD)，并提出了一种深度学习基于图像感知的图像隐藏框架。
results: 在 ADD 数据集上，提出了一种高效的多任务感知网络（DesCenterNet），可以同时实现人脸和车牌检测和隐藏任务。对于图像隐藏任务，我们提出了一种新的评价标准，并进行了广泛的比较实验，证明了我们的方法的有效性和超越性。

Abstract
Autonomous driving systems require many images for analyzing the surrounding environment. However, there is fewer data protection for private information among these captured images, such as pedestrian faces or vehicle license plates, which has become a significant issue. In this paper, in response to the call for data security laws and regulations and based on the advantages of large Field of View(FoV) of the fisheye camera, we build the first Autopilot Desensitization Dataset, called ADD, and formulate the first deep-learning-based image desensitization framework, to promote the study of image desensitization in autonomous driving scenarios. The compiled dataset consists of 650K images, including different face and vehicle license plate information captured by the surround-view fisheye camera. It covers various autonomous driving scenarios, including diverse facial characteristics and license plate colors. Then, we propose an efficient multitask desensitization network called DesCenterNet as a benchmark on the ADD dataset, which can perform face and vehicle license plate detection and desensitization tasks. Based on ADD, we further provide an evaluation criterion for desensitization performance, and extensive comparison experiments have verified the effectiveness and superiority of our method on image desensitization.

摘要
自动驾驶系统需要大量图像来分析周围环境。然而， captured 图像中的private信息，如行人脸或车辆识别号，却受到较少的数据保护，这成为了一个重要的问题。在这篇论文中，我们根据宽视场(FoV)大的鱼眼镜头的优点，建立了首个Autopilot Desensitization Dataset（ADD），并提出了首个深度学习基于图像抑制框架。通过ADD集成了650000张图像，包括不同的脸和车辆识别号信息， captured by surround-view fisheye camera。它覆盖了各种自动驾驶场景，包括多样化的脸容特征和车辆识别号颜色。然后，我们提出了一种高效的多任务抑制网络， called DesCenterNet，作为ADD集成的benchmark，可以同时完成脸和车辆识别号检测和抑制任务。基于ADD，我们还提供了图像抑制性评价标准，并进行了广泛的比较实验，证明了我们的方法在图像抑制方面的效果和优势。

Synthetic data generation method for hybrid image-tabular data using two generative adversarial networks

paper_url: http://arxiv.org/abs/2308.07573
repo_url: None
paper_authors: Tomohiro Kikuchi, Shouhei Hanaoka, Takahiro Nakao, Tomomi Takenaga, Yukihiro Nomura, Harushi Mori, Takeharu Yoshikawa
for: 这篇论文旨在提出一种生成医疗资料的方法，以便解决医疗领域中隐私问题和促进数据共享。
methods: 这篇论文使用了一种称为 auto-encoding GAN（αGAN）和一种称为 conditional tabular GAN（CTGAN）的生成opponent neural network（GAN）方法，以生成医疗领域中的合成医疗资料。
results: 这篇论文成功地实现了生成多样化的合成医疗资料，包括颈部X射像（CXR）和结构化的数据（包括人体尺寸数据和实验室测试数据），并且保持了这些数据之间的对应关系。

Abstract
The generation of synthetic medical records using generative adversarial networks (GANs) has become increasingly important for addressing privacy concerns and promoting data sharing in the medical field. In this paper, we propose a novel method for generating synthetic hybrid medical records consisting of chest X-ray images (CXRs) and structured tabular data (including anthropometric data and laboratory tests) using an auto-encoding GAN ({\alpha}GAN) and a conditional tabular GAN (CTGAN). Our approach involves training a {\alpha}GAN model on a large public database (pDB) to reduce the dimensionality of CXRs. We then applied the trained encoder of the GAN model to the images in original database (oDB) to obtain the latent vectors. These latent vectors were combined with tabular data in oDB, and these joint data were used to train the CTGAN model. We successfully generated diverse synthetic records of hybrid CXR and tabular data, maintaining correspondence between them. We evaluated this synthetic database (sDB) through visual assessment, distribution of interrecord distances, and classification tasks. Our evaluation results showed that the sDB captured the features of the oDB while maintaining the correspondence between the images and tabular data. Although our approach relies on the availability of a large-scale pDB containing a substantial number of images with the same modality and imaging region as those in the oDB, this method has the potential for the public release of synthetic datasets without compromising the secondary use of data.

摘要
现代生成技术在医疗领域中得到了广泛应用，尤其是通过生成对抗网络（GAN）来解决隐私问题和促进数据共享。本文提出了一种新的方法，使用自动编码GAN（αGAN）和条件表格GAN（CTGAN）生成混合类医疗记录，包括胸部X射线图像（CXR）和结构化表格数据（包括人体测量数据和实验室测试结果）。我们的方法是使用大规模公共数据库（pDB）来减少CXR的维度，然后使用训练过的GAN模型的编码器对oDB中的图像进行编码，得到了潜在 вектор。这些潜在 вектор与表格数据进行结合，并将这些联合数据用于CTGAN模型的训练。我们成功地生成了多样化的医疗记录，保持了图像和表格数据之间的协调。我们对这个synthetic数据库（sDB）进行了视觉评估、记录间距离分布和分类任务的评估。我们的评估结果表明，sDB捕捉了oDB中的特征，同时保持了图像和表格数据之间的协调。虽然我们的方法需要一个大规模的pDB，但这种方法具有公开 синтетиче数据库的潜在优势，不需要牺牲第二次使用数据的隐私。

Ske2Grid: Skeleton-to-Grid Representation Learning for Action Recognition

paper_url: http://arxiv.org/abs/2308.07571
repo_url: https://github.com/osvai/ske2grid
paper_authors: Dongqi Cai, Yangyuxuan Kang, Anbang Yao, Yurong Chen
for: 本文提出了一种新的 Representation Learning 框架，用于改进人体skeleton基于动作识别。
methods: 本文使用了三种新的设计方法：Graph-node index transform (GIT)、Up-sampling transform (UPT) 和 Progressive learning strategy (PLS)，用于构建一个具有更高表示能力的人体skeleton网格表示。
results: experiments 表明，使用本文提出的Ske2Grid方法可以在六个主流的人体skeleton基于动作识别 dataset 上达到更高的性能，而不需要额外的设计。

Abstract
This paper presents Ske2Grid, a new representation learning framework for improved skeleton-based action recognition. In Ske2Grid, we define a regular convolution operation upon a novel grid representation of human skeleton, which is a compact image-like grid patch constructed and learned through three novel designs. Specifically, we propose a graph-node index transform (GIT) to construct a regular grid patch through assigning the nodes in the skeleton graph one by one to the desired grid cells. To ensure that GIT is a bijection and enrich the expressiveness of the grid representation, an up-sampling transform (UPT) is learned to interpolate the skeleton graph nodes for filling the grid patch to the full. To resolve the problem when the one-step UPT is aggressive and further exploit the representation capability of the grid patch with increasing spatial size, a progressive learning strategy (PLS) is proposed which decouples the UPT into multiple steps and aligns them to multiple paired GITs through a compact cascaded design learned progressively. We construct networks upon prevailing graph convolution networks and conduct experiments on six mainstream skeleton-based action recognition datasets. Experiments show that our Ske2Grid significantly outperforms existing GCN-based solutions under different benchmark settings, without bells and whistles. Code and models are available at https://github.com/OSVAI/Ske2Grid

摘要
First, we propose a graph-node index transform (GIT) to assign nodes in the skeleton graph to desired grid cells. This ensures that GIT is a bijection and enriches the expressiveness of the grid representation.Second, we learn an up-sampling transform (UPT) to interpolate the skeleton graph nodes for filling the grid patch to the full. This ensures that the grid representation is dense and detailed.Third, we propose a progressive learning strategy (PLS) to decouple the UPT into multiple steps and align them with multiple paired GITs through a compact cascaded design learned progressively. This further exploits the representation capability of the grid patch with increasing spatial size.We construct networks upon prevailing graph convolution networks and conduct experiments on six mainstream skeleton-based action recognition datasets. The results show that our Ske2Grid significantly outperforms existing GCN-based solutions under different benchmark settings, without bells and whistles. The code and models are available at https://github.com/OSVAI/Ske2Grid.

Improved mirror ball projection for more accurate merging of multiple camera outputs and process monitoring

paper_url: http://arxiv.org/abs/2308.10991
repo_url: https://github.com/FrostKiwi/Mirrorball
paper_authors: Wladislav Artsimovich, Yoko Hirono
for: 用圆镜代替宽角摄像机，实现低成本的生产过程监测在危险环境中，包括高温、真空和强电磁场环境。
methods: 使用圆镜反射将多种摄像机类型（如彩色图像、近红外、长波长红外、 ultraviolet）集成到单一宽角输出中，并考虑不同摄像机位置和镜头使用。
results: 研究表明，使用圆镜反射可以减少不同摄像机位置引入的视角偏移，具体取决于镜子大小和监测目标距离。此外，本文还介绍了一种受限于投影镜球的扭曲问题的变种，并评估了过程监测via圆镜球的效果。

Abstract
Using spherical mirrors in place of wide-angle cameras allows for cost-effective monitoring of manufacturing processes in hazardous environment, where a camera would normally not operate. This includes environments of high heat, vacuum and strong electromagnetic fields. Moreover, it allows the layering of multiple camera types (e.g., color image, near-infrared, long-wavelength infrared, ultraviolet) into a single wide-angle output, whilst accounting for the different camera placements and lenses used. Normally, the different camera positions introduce a parallax shift between the images, but with a spherical projection as produced by a spherical mirror, this parallax shift is reduced, depending on mirror size and distance to the monitoring target. This paper introduces a variation of the 'mirror ball projection', that accounts for distortion produced by a perspective camera at the pole of the projection. Finally, the efficacy of process monitoring via a mirror ball is evaluated.

摘要
Note: Simplified Chinese is also known as "简化字" or "简化字".Translation Notes:* "wide-angle camera" is translated as "广角镜头" (guǎng jiàng jīng tóu), which is a more common term in Simplified Chinese.* "spherical mirror" is translated as "球形镜" (qiu xíng jìng), which is a more precise term in Simplified Chinese.* "parallax shift" is translated as "偏移" (piān yì), which is a more common term in Simplified Chinese.* "perspective camera" is translated as "投影镜头" (pù yǐng jīng tóu), which is a more precise term in Simplified Chinese.* "mirror ball projection" is translated as "镜球投影" (jìng qiu pù yǐng), which is a more common term in Simplified Chinese.

SST: A Simplified Swin Transformer-based Model for Taxi Destination Prediction based on Existing Trajectory

paper_url: http://arxiv.org/abs/2308.07555
repo_url: None
paper_authors: Zepu Wang, Yifei Sun, Zhiyu Lei, Xincheng Zhu, Peng Sun
for: 预测AXI trajectory的目的地有很多减值，可以帮助智能位置基础服务。
methods: 将AXI trajectory转换为二维网格，使用计算机视觉技术进行预测。
results: 我们的实验结果表明，使用简化的Swin Transformer（SST）结构可以在实际 trajectory数据上达到更高的准确率，比state-of-the-art方法更高。

Abstract
Accurately predicting the destination of taxi trajectories can have various benefits for intelligent location-based services. One potential method to accomplish this prediction is by converting the taxi trajectory into a two-dimensional grid and using computer vision techniques. While the Swin Transformer is an innovative computer vision architecture with demonstrated success in vision downstream tasks, it is not commonly used to solve real-world trajectory problems. In this paper, we propose a simplified Swin Transformer (SST) structure that does not use the shifted window idea in the traditional Swin Transformer, as trajectory data is consecutive in nature. Our comprehensive experiments, based on real trajectory data, demonstrate that SST can achieve higher accuracy compared to state-of-the-art methods.

摘要
<>转换给定文本为简化中文。>预测出AXI Taxi trajectory的目的地可以有各种 beneficial effects for intelligent location-based services. one potential method to achieve this prediction is by converting the taxi trajectory into a two-dimensional grid and using computer vision techniques. Although the Swin Transformer is an innovative computer vision architecture with demonstrated success in vision downstream tasks, it is not commonly used to solve real-world trajectory problems. In this paper, we propose a simplified Swin Transformer (SST) structure that does not use the shifted window idea in the traditional Swin Transformer, as trajectory data is consecutive in nature. Our comprehensive experiments, based on real trajectory data, demonstrate that SST can achieve higher accuracy compared to state-of-the-art methods.Here's the word-for-word translation of the text into Simplified Chinese:<>将给定文本转换为简化中文。>预测AXI taxi trajectory的目的地可以有各种有益的效果 для智能位置基于服务。一个 potential method to achieve this prediction is by converting the taxi trajectory into a two-dimensional grid and using computer vision techniques. Although the Swin Transformer is an innovative computer vision architecture with demonstrated success in vision downstream tasks, it is not commonly used to solve real-world trajectory problems. In this paper, we propose a simplified Swin Transformer (SST) structure that does not use the shifted window idea in the traditional Swin Transformer, as trajectory data is consecutive in nature. Our comprehensive experiments, based on real trajectory data, demonstrate that SST can achieve higher accuracy compared to state-of-the-art methods.

Multi-view 3D Face Reconstruction Based on Flame

paper_url: http://arxiv.org/abs/2308.07551
repo_url: None
paper_authors: Wenzhuo Zheng, Junhao Zhao, Xiaohong Liu, Yongyang Pan, Zhenghao Gan, Haozhe Han, Ning Liu
for: 本研究旨在提高面部3D重建质量，通过结合多视图训练框架和面 Parametric模型Flame，提出多视图训练和测试模型MFNet。
methods: 我们建立了一个无监督训练框架，并实施了多视图光流损失函数和面点损失约束，最后获得了完整的MFNet。我们还提出了多视图光流损失和可见面罩的创新实现。
results: 我们在AFLW和facescape数据集上测试了我们的模型，并在实际场景中拍摄了我们的脸部图像，并实现了3D面部重建的好 Result。我们的工作主要解决了将面 Parametric模型与多视图face 3D重建结合的问题，并探讨了基于Flame的多视图训练和测试框架在面部3D重建领域的贡献。

Abstract
At present, face 3D reconstruction has broad application prospects in various fields, but the research on it is still in the development stage. In this paper, we hope to achieve better face 3D reconstruction quality by combining multi-view training framework with face parametric model Flame, propose a multi-view training and testing model MFNet (Multi-view Flame Network). We build a self-supervised training framework and implement constraints such as multi-view optical flow loss function and face landmark loss, and finally obtain a complete MFNet. We propose innovative implementations of multi-view optical flow loss and the covisible mask. We test our model on AFLW and facescape datasets and also take pictures of our faces to reconstruct 3D faces while simulating actual scenarios as much as possible, which achieves good results. Our work mainly addresses the problem of combining parametric models of faces with multi-view face 3D reconstruction and explores the implementation of a Flame based multi-view training and testing framework for contributing to the field of face 3D reconstruction.

摘要
Translated into Simplified Chinese:当前，人脸3D重建具有广泛应用前景，但相关研究还处于发展阶段。在这篇论文中，我们希望通过结合多视图培训框架和人脸参数模型Flame，提出一种多视图培训和测试模型MFNet（多视图Flame网络）。我们建立了一个无监督培训框架，并实施约束 Multi-view optical flow loss function和面部标记损失等，最后获得了完整的MFNet。我们提出了面部 parametric 模型和多视图 face 3D 重建的innovative实现，包括多视图 optical flow 损失和可见面罩。我们在 AFLW 和 facescape 数据集上测试了我们的模型，并在实际场景中拍摄了我们的脸部图像，并实现了三维人脸重建。我们的工作主要解决了人脸参数模型与多视图 face 3D 重建的组合问题，并探讨了基于 Flame 的多视图培训和测试框架在人脸3D重建领域的应用。

3DHacker: Spectrum-based Decision Boundary Generation for Hard-label 3D Point Cloud Attack

paper_url: http://arxiv.org/abs/2308.07546
repo_url: None
paper_authors: Yunbo Tao, Daizong Liu, Pan Zhou, Yulai Xie, Wei Du, Wei Hu
for: 攻击3D点云模型的安全性在自动驾驶和机器人导航等应用中 receiving increasing attention。
methods: 我们提出了一种新的3D攻击方法，称为3D Hard-label Attacker（3DHacker），基于分类标签知识生成敏感amples。
results: 我们的3DHacker方法在具有黑盒环境的情况下，可以凭借高效率和小型做出比较出色的攻击性能，并且对攻击者的质量也有较好的控制。

Abstract
With the maturity of depth sensors, the vulnerability of 3D point cloud models has received increasing attention in various applications such as autonomous driving and robot navigation. Previous 3D adversarial attackers either follow the white-box setting to iteratively update the coordinate perturbations based on gradients, or utilize the output model logits to estimate noisy gradients in the black-box setting. However, these attack methods are hard to be deployed in real-world scenarios since realistic 3D applications will not share any model details to users. Therefore, we explore a more challenging yet practical 3D attack setting, \textit{i.e.}, attacking point clouds with black-box hard labels, in which the attacker can only have access to the prediction label of the input. To tackle this setting, we propose a novel 3D attack method, termed \textbf{3D} \textbf{H}ard-label att\textbf{acker} (\textbf{3DHacker}), based on the developed decision boundary algorithm to generate adversarial samples solely with the knowledge of class labels. Specifically, to construct the class-aware model decision boundary, 3DHacker first randomly fuses two point clouds of different classes in the spectral domain to craft their intermediate sample with high imperceptibility, then projects it onto the decision boundary via binary search. To restrict the final perturbation size, 3DHacker further introduces an iterative optimization strategy to move the intermediate sample along the decision boundary for generating adversarial point clouds with smallest trivial perturbations. Extensive evaluations show that, even in the challenging hard-label setting, 3DHacker still competitively outperforms existing 3D attacks regarding the attack performance as well as adversary quality.

摘要
随着深度感知器的成熟，3D点云模型的漏洞受到了各种应用程序中的关注，如自动驾驶和机器人导航。先前的3D反击器都是采用白盒设定来逐渐更新坐标偏移量基于梯度，或者使用输出模型的логи值来估计噪声梯度在黑盒设定下。然而，这些攻击方法在实际应用场景中很难实施，因为实际的3D应用程序不会分享任何模型细节给用户。因此，我们研究一种更加具有挑战性且实用的3D攻击设定，即在黑盒硬标记下攻击点云，在这个设定下，攻击者只有访问输入的预测标签。为解决这个设定，我们提出了一种新的3D攻击方法，即3D硬标记攻击者（3DHacker），基于已发展的决策边界算法来生成反击样本，只需通过知道类别标签来生成对抗样本。具体来说，为构建类别意识模型的决策边界，3DHacker首先随机将两个不同类型的点云在spectral domain中混合为中间样本，然后将其投射到决策边界 via binary search。为限制最终的偏移量，3DHacker进一步引入了一种迭代优化策略，将中间样本在决策边界上移动，以生成最小的极小偏移量。广泛的评估表明，即使在挑战性的硬标记设定下，3DHacker仍然可以与现有3D攻击相比，在攻击性和对手质量方面具有竞争力。

Multimodal Dataset Distillation for Image-Text Retrieval

paper_url: http://arxiv.org/abs/2308.07545
repo_url: None
paper_authors: Xindi Wu, Zhiwei Deng, Olga Russakovsky
for: 这篇论文的目的是扩展 dataset distillation 方法到vision-language模型的训练中，以实现从零开始训练新模型的可能性。
methods: 本文提出了一个基于对应汇总的多Modal dataset distillation 方法，将影像和其相应的语言描述汇总在一个对应的形式中。
results: 本文比较了三种核心集选择方法 (strategic subsampling of the training dataset)，并证明了对于具有挑战性的 Flickr30K 和 COCO 检索准确度测试 benchmark 的改进，将最好的核心集选择方法选择 1000 个影像-文本组合用于训练，仅能实现 5.6% 的影像-文本搜寻精度 (recall@1)，而对于我们的 dataset distillation 方法仅需要 100 个训练组合 (一个次元的数量更少)，则可以实现类似的精度。

Abstract
Dataset distillation methods offer the promise of reducing a large-scale dataset down to a significantly smaller set of (potentially synthetic) training examples, which preserve sufficient information for training a new model from scratch. So far dataset distillation methods have been developed for image classification. However, with the rise in capabilities of vision-language models, and especially given the scale of datasets necessary to train these models, the time is ripe to expand dataset distillation methods beyond image classification. In this work, we take the first steps towards this goal by expanding on the idea of trajectory matching to create a distillation method for vision-language datasets. The key challenge is that vision-language datasets do not have a set of discrete classes. To overcome this, our proposed multimodal dataset distillation method jointly distill the images and their corresponding language descriptions in a contrastive formulation. Since there are no existing baselines, we compare our approach to three coreset selection methods (strategic subsampling of the training dataset), which we adapt to the vision-language setting. We demonstrate significant improvements on the challenging Flickr30K and COCO retrieval benchmark: the best coreset selection method which selects 1000 image-text pairs for training is able to achieve only 5.6% image-to-text retrieval accuracy (recall@1); in contrast, our dataset distillation approach almost doubles that with just 100 (an order of magnitude fewer) training pairs.

摘要
dataset 简化方法可以将大规模 dataset 缩小到一个较小的（可能是人工生成的）训练示例集，保留足够的信息来训练一个新模型从头开始。目前，dataset 简化方法已经被开发出来用于图像分类。然而，随着视觉语言模型的能力的提高，特别是对于训练这些模型所需的数据集的规模的增长，现在是时候扩展 dataset 简化方法到更多领域。在这项工作中，我们做出了首先的尝试，扩展了路径匹配的想法，以创建一种用于视觉语言 dataset 的简化方法。主要挑战在于视觉语言 dataset 没有固定的分类集。为了解决这个问题，我们提出了一种多Modal 的 dataset 简化方法，通过对图像和其相应的语言描述进行joint降维来实现。由于没有现有的基准，我们对这种方法进行比较，并将其与三种核心选择方法（策略性抽样）进行比较。我们在复杂的 Flickr30K 和 COCO 检索benchmark上显示出了显著的改善：最佳核心选择方法，选择 1000 个图像-文本对作为训练集，只能达到 5.6% 的图像-文本检索精度（recall@1）；相比之下，我们的 dataset 简化方法可以在 100 个训练对（一个小数量）下达到同样的精度。

Visual and Textual Prior Guided Mask Assemble for Few-Shot Segmentation and Beyond

paper_url: http://arxiv.org/abs/2308.07539
repo_url: None
paper_authors: Chen Shuai, Meng Fanman, Zhang Runtong, Qiu Heqian, Li Hongliang, Wu Qingbo, Xu Linfeng
for:* The paper is written for few-shot segmentation (FSS) tasks, specifically to enhance the generalization ability of FSS models using CLIP.methods:* The proposed method, PGMA-Net, employs a class-agnostic mask assembly process to alleviate bias towards base classes, and formulates diverse tasks into a unified manner by assembling prior through affinity.* The method includes a Prior-Guided Mask Assemble Module (PGMAM) with multiple General Assemble Units (GAUs) that consider diverse and plug-and-play interactions, and a Hierarchical Decoder with Channel-Drop Mechanism (HDCDM) to flexibly exploit assembled masks and low-level features.results:* The proposed PGMA-Net achieves new state-of-the-art results in the FSS task, with mIoU of $77.6$ on $\text{PASCAL-}5^i$ and $59.4$ on $\text{COCO-}20^i$ in 1-shot scenario.* The method can also solve bbox-level and cross-domain FSS, co-segmentation, zero-shot segmentation (ZSS) tasks, leading to an any-shot segmentation framework without extra re-training.

Abstract
Few-shot segmentation (FSS) aims to segment the novel classes with a few annotated images. Due to CLIP's advantages of aligning visual and textual information, the integration of CLIP can enhance the generalization ability of FSS model. However, even with the CLIP model, the existing CLIP-based FSS methods are still subject to the biased prediction towards base classes, which is caused by the class-specific feature level interactions. To solve this issue, we propose a visual and textual Prior Guided Mask Assemble Network (PGMA-Net). It employs a class-agnostic mask assembly process to alleviate the bias, and formulates diverse tasks into a unified manner by assembling the prior through affinity. Specifically, the class-relevant textual and visual features are first transformed to class-agnostic prior in the form of probability map. Then, a Prior-Guided Mask Assemble Module (PGMAM) including multiple General Assemble Units (GAUs) is introduced. It considers diverse and plug-and-play interactions, such as visual-textual, inter- and intra-image, training-free, and high-order ones. Lastly, to ensure the class-agnostic ability, a Hierarchical Decoder with Channel-Drop Mechanism (HDCDM) is proposed to flexibly exploit the assembled masks and low-level features, without relying on any class-specific information. It achieves new state-of-the-art results in the FSS task, with mIoU of $77.6$ on $\text{PASCAL-}5^i$ and $59.4$ on $\text{COCO-}20^i$ in 1-shot scenario. Beyond this, we show that without extra re-training, the proposed PGMA-Net can solve bbox-level and cross-domain FSS, co-segmentation, zero-shot segmentation (ZSS) tasks, leading an any-shot segmentation framework.

摘要
“几shot分类（FSS）的目标是使用几个标注图像来分类新的类别。由于CLIP的优点，将CLIP与FSS模型结合可以提高模型的扩展能力。然而，即使使用CLIP模型，现有的CLIP-based FSS方法仍然受到基本类别的预测偏好，这是由于类别特定的层次交互所致。为解决这个问题，我们提出了一个可视和文本对照的 Prior Guided Mask Assemble Network (PGMA-Net)。它使用一个类别不偏的掩模过程来减少偏好，并将多种任务转换为一个统一的形式。具体来说，首先将类别相关的文本和可见特征转换为类别不偏的机会地图。然后，我们引入一个 Prior-Guided Mask Assemble Module (PGMAM)，包括多个通用组合单元 (GAUs)。它考虑了多种不同和可插入的交互，例如可见文本、间隔和内部图像、训练无须、高阶的交互。最后，为保持类别不偏的能力，我们提出了一个弹性调节的高级解码器 (HDCDM)，以灵活地利用掩模和低层特征，不需要靠类别特定的信息。它实现了新的顶尖成绩在FSS任务中，具体为PASCAL-$5^i$中的$77.6$和COCO-$20^i$中的$59.4$在1架构enario中。此外，我们显示了在无需额外重训的情况下，提案的PGMA-Net可以解决矩形范围内的FSS、共 segmentation、零shot segmentation (ZSS)任务，实现一个任何shot segmentation框架。”

AttMOT: Improving Multiple-Object Tracking by Introducing Auxiliary Pedestrian Attributes

paper_url: http://arxiv.org/abs/2308.07537
repo_url: None
paper_authors: Yunhao Li, Zhen Xiao, Lin Yang, Dan Meng, Xin Zhou, Heng Fan, Libo Zhang
for:* The paper aims to address the gap in exploring pedestrian attributes in multi-object tracking (MOT) and propose a method to predict pedestrian attributes to support general Re-ID embedding.methods:* The proposed method AAM explores different approaches to fuse Re-ID embedding and pedestrian attributes, including attention mechanisms, to improve the performance of MOT.results:* The proposed method AAM achieves consistent improvements in MOTA, HOTA, AssA, IDs, and IDF1 scores on several representative pedestrian multi-object tracking benchmarks, including MOT17 and MOT20, when applied to state-of-the-art trackers.

Abstract
Multi-object tracking (MOT) is a fundamental problem in computer vision with numerous applications, such as intelligent surveillance and automated driving. Despite the significant progress made in MOT, pedestrian attributes, such as gender, hairstyle, body shape, and clothing features, which contain rich and high-level information, have been less explored. To address this gap, we propose a simple, effective, and generic method to predict pedestrian attributes to support general Re-ID embedding. We first introduce AttMOT, a large, highly enriched synthetic dataset for pedestrian tracking, containing over 80k frames and 6 million pedestrian IDs with different time, weather conditions, and scenarios. To the best of our knowledge, AttMOT is the first MOT dataset with semantic attributes. Subsequently, we explore different approaches to fuse Re-ID embedding and pedestrian attributes, including attention mechanisms, which we hope will stimulate the development of attribute-assisted MOT. The proposed method AAM demonstrates its effectiveness and generality on several representative pedestrian multi-object tracking benchmarks, including MOT17 and MOT20, through experiments on the AttMOT dataset. When applied to state-of-the-art trackers, AAM achieves consistent improvements in MOTA, HOTA, AssA, IDs, and IDF1 scores. For instance, on MOT17, the proposed method yields a +1.1 MOTA, +1.7 HOTA, and +1.8 IDF1 improvement when used with FairMOT. To encourage further research on attribute-assisted MOT, we will release the AttMOT dataset.

摘要
多bject tracking (MOT) 是计算机视觉中的基本问题，具有许多应用，如智能监控和自动驾驶。 DESPITE 在 MOT 中做出了 significan progress， pedestrian 特征，如性别、发型、身体形态和服装特征，具有丰富和高级信息，却得到了更少的关注。为了解决这一漏洞，我们提出了一种简单、有效和通用的方法，可以预测 pedestrian 特征，以支持通用 Re-ID 嵌入。我们首先介绍 AttMOT，一个大型、高度充实的人工synthetic dataset for pedestrian tracking，包含了80k帧和6000万个 pedestrian ID，具有不同的时间、天气和场景。我们知道 AttMOT 是首个具有semantic attribute的 MOT dataset。然后，我们探索了不同的方法来融合 Re-ID 嵌入和 pedestrian 特征，包括注意力机制。我们希望这种方法能够激发attribute-assisted MOT的发展。我们提出的方法 AAM 在多个表现 pedestrian multi-object tracking benchmarks，包括 MOT17 和 MOT20，通过在 AttMOT dataset上进行实验，实现了显著的改进。例如，在 MOT17 上，我们的方法可以提高 +1.1 MOTA、+1.7 HOTA 和 +1.8 IDF1 分数。为了鼓励 attribute-assisted MOT 的进一步研究，我们将在 AttMOT dataset上发布 AttMOT。

Improved Region Proposal Network for Enhanced Few-Shot Object Detection

paper_url: http://arxiv.org/abs/2308.07535
repo_url: https://github.com/zshanggu/htrpn
paper_authors: Zeyu Shangguan, Mohammad Rostami
For: 这种研究是为了解决深度学习基本监督学习方法的限制，提高对象检测任务的性能。* Methods: 该研究提出了一种半监督法，通过使用无标注数据进行训练，提高几招物检测性能。具体来说，他们开发了一种层次三元分类区提案网络（HTRPN），以便检测并分类未标注的新类实例。* Results: 对COCO和PASCAL VOC基准数据集进行测试，研究结果表明，该方法可以提高几招物检测性能，并超越现有的状态平台FSOD方法。

Abstract
Despite significant success of deep learning in object detection tasks, the standard training of deep neural networks requires access to a substantial quantity of annotated images across all classes. Data annotation is an arduous and time-consuming endeavor, particularly when dealing with infrequent objects. Few-shot object detection (FSOD) methods have emerged as a solution to the limitations of classic object detection approaches based on deep learning. FSOD methods demonstrate remarkable performance by achieving robust object detection using a significantly smaller amount of training data. A challenge for FSOD is that instances from novel classes that do not belong to the fixed set of training classes appear in the background and the base model may pick them up as potential objects. These objects behave similarly to label noise because they are classified as one of the training dataset classes, leading to FSOD performance degradation. We develop a semi-supervised algorithm to detect and then utilize these unlabeled novel objects as positive samples during the FSOD training stage to improve FSOD performance. Specifically, we develop a hierarchical ternary classification region proposal network (HTRPN) to localize the potential unlabeled novel objects and assign them new objectness labels to distinguish these objects from the base training dataset classes. Our improved hierarchical sampling strategy for the region proposal network (RPN) also boosts the perception ability of the object detection model for large objects. We test our approach and COCO and PASCAL VOC baselines that are commonly used in FSOD literature. Our experimental results indicate that our method is effective and outperforms the existing state-of-the-art (SOTA) FSOD methods. Our implementation is provided as a supplement to support reproducibility of the results.

摘要
尽管深度学习在对象检测任务中具有显著的成功，但标准的深度神经网络训练需要大量的标注图像，特别是处理不常见的对象。几拟对象检测（FSOD）方法已经出现，以解决深度学习对象检测方法的限制。FSOD方法可以达到使用较少的训练数据来实现稳定的对象检测性能。然而，FSOD中的一个挑战是，训练集中不存在的新类型对象可能会出现在背景中，并被基础模型认为是可能的对象。这些对象会被视为标注噪声，导致FSOD性能下降。我们提出了一种半supervised算法，用于检测并利用训练集外的未标注新对象作为Positive样本，以改进FSOD性能。具体来说，我们开发了一种嵌入式三元分类区域提案网络（HTRPN），用于定位潜在的未标注新对象，并将其分配新的对象性标签，以分开这些对象与基础训练集类别。我们还改进了RPN的层次采样策略，以提高对象检测模型对大对象的感知能力。我们测试了我们的方法，并与COCO和PASCAL VOC基线相比较。我们的实验结果表明，我们的方法是有效的，并超越了现有的状态的最佳方法（SOTA）。我们的实现提供了补充，以支持result的重复性。

Inverse Lithography Physics-informed Deep Neural Level Set for Mask Optimization

paper_url: http://arxiv.org/abs/2308.12299
repo_url: None
paper_authors: Xing-Yu Ma, Shaogang Hao
for: 提高磁版印刷过程中的分辨率，提高磁版印刷过程中的印刷可靠性
methods: 利用深度学习（DL）方法和层设法（ILT），实现磁版优化
results: 相比于各种纯DL和ILT方法，ILDLS方法可以减少计算时间，提高印刷可靠性和过程窗口（PW）等效果

Abstract
As the feature size of integrated circuits continues to decrease, optical proximity correction (OPC) has emerged as a crucial resolution enhancement technology for ensuring high printability in the lithography process. Recently, level set-based inverse lithography technology (ILT) has drawn considerable attention as a promising OPC solution, showcasing its powerful pattern fidelity, especially in advanced process. However, massive computational time consumption of ILT limits its applicability to mainly correcting partial layers and hotspot regions. Deep learning (DL) methods have shown great potential in accelerating ILT. However, lack of domain knowledge of inverse lithography limits the ability of DL-based algorithms in process window (PW) enhancement and etc. In this paper, we propose an inverse lithography physics-informed deep neural level set (ILDLS) approach for mask optimization. This approach utilizes level set based-ILT as a layer within the DL framework and iteratively conducts mask prediction and correction to significantly enhance printability and PW in comparison with results from pure DL and ILT. With this approach, computation time is reduced by a few orders of magnitude versus ILT. By gearing up DL with knowledge of inverse lithography physics, ILDLS provides a new and efficient mask optimization solution.

摘要
(Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. The translation may vary depending on the region and dialect.)

Confidence Contours: Uncertainty-Aware Annotation for Medical Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.07528
repo_url: None
paper_authors: Andre Ye, Quan Ze Chen, Amy Zhang
for: 本研究旨在提出一种新的验证图像分割模型的方法，以增强模型对不确定性的理解，从而更好地处理视觉抽象。
methods: 本研究提出了一种新的分割表示方法，称为信度轮廓（Confidence Contours），该方法通过高信度和低信度的轮廓来捕捉不确定性。同时，研究人员还开发了一种新的标注系统，用于收集轮廓数据。
results: 研究人员在Lung Image Dataset Consortium（LIDC）和一个 sintetic dataset上进行了评估，结果表明，Confidence Contours可以准确地捕捉不确定性，而且与标准的单个精度标注相比， annotator的努力不会增加太多。此外，研究人员还发现，通用的分割模型可以很好地学习Confidence Contours。最后，在5名医学专家的采访中，研究人员发现，Confidence Contour map比bayesian map更易于理解，因为它能够反映结构不确定性。

Abstract
Medical image segmentation modeling is a high-stakes task where understanding of uncertainty is crucial for addressing visual ambiguity. Prior work has developed segmentation models utilizing probabilistic or generative mechanisms to infer uncertainty from labels where annotators draw a singular boundary. However, as these annotations cannot represent an individual annotator's uncertainty, models trained on them produce uncertainty maps that are difficult to interpret. We propose a novel segmentation representation, Confidence Contours, which uses high- and low-confidence ``contours'' to capture uncertainty directly, and develop a novel annotation system for collecting contours. We conduct an evaluation on the Lung Image Dataset Consortium (LIDC) and a synthetic dataset. From an annotation study with 30 participants, results show that Confidence Contours provide high representative capacity without considerably higher annotator effort. We also find that general-purpose segmentation models can learn Confidence Contours at the same performance level as standard singular annotations. Finally, from interviews with 5 medical experts, we find that Confidence Contour maps are more interpretable than Bayesian maps due to representation of structural uncertainty.

摘要
医学图像分割模型化是一项高风险任务，理解不确定性是关键来解决视觉 ambiguity。先前的工作已经开发出了使用概率或生成机制来推导不确定性从标签中的分割模型，但这些标签不能表示个体注意者的不确定性，因此模型从这些标签学习的不确定性地图很难解释。我们提出了一种新的分割表示方式，即信心轮廓，可以直接捕捉不确定性，并开发了一种新的注意者系统来收集轮廓。我们在Lung Image Dataset Consortium（LIDC）和一个 sintetic dataset上进行了评估。从30名参与者的注意者研究中，结果表明，信心轮廓可以提供高度表示能力，而无需较大的注意者努力。此外，我们发现，通用分割模型可以学习信心轮廓，并且与标准单个标签学习模型的性能相当。最后，经验了5名医学专家的采访，发现，信心轮廓地图比bayesian地图更易于理解，因为它表示了结构不确定性。

Benchmarking Scalable Epistemic Uncertainty Quantification in Organ Segmentation

paper_url: http://arxiv.org/abs/2308.07506
repo_url: https://github.com/jadie1/medseguq
paper_authors: Jadie Adams, Shireen Y. Elhabian
for: 这个论文的目的是评估多种基于深度学习的自动组织器gmentation方法中的epistemicuncertainty量化方法，以便在临床应用中提供可靠和可Robust的模型。
methods: 本文使用了多种epistemic uncertainty量化方法，包括Bayesian neural networks, Monte Carlo dropout, and Deep Ensembles，并进行了比较性 benchmarking 测试。
results: 研究发现，Deep Ensembles方法在accuracy和uncertainty calibration方面表现最佳，而Bayesian neural networks方法在out-of-distribution detection方面表现最好。本文还提供了每种方法的优缺点和未来改进的建议。

Abstract
Deep learning based methods for automatic organ segmentation have shown promise in aiding diagnosis and treatment planning. However, quantifying and understanding the uncertainty associated with model predictions is crucial in critical clinical applications. While many techniques have been proposed for epistemic or model-based uncertainty estimation, it is unclear which method is preferred in the medical image analysis setting. This paper presents a comprehensive benchmarking study that evaluates epistemic uncertainty quantification methods in organ segmentation in terms of accuracy, uncertainty calibration, and scalability. We provide a comprehensive discussion of the strengths, weaknesses, and out-of-distribution detection capabilities of each method as well as recommendations for future improvements. These findings contribute to the development of reliable and robust models that yield accurate segmentations while effectively quantifying epistemic uncertainty.

摘要
深度学习基于方法可能在自动器官分割方面展示了较好的表现，但是量化和理解模型预测结果中的不确定性是重要的。虽然许多技术已经被提出用于知识型或模型基的不确定性估计，但是尚未清楚哪种方法在医学图像分析场景中更有优势。这篇论文提供了一项完整的比较研究，评估了器官分割中 epistemic 不确定性估计方法的准确性、不确定性归一化和可扩展性。我们提供了每种方法的优缺点、缺失和离群检测能力，以及未来改进的建议。这些发现有助于开发可靠和可靠的模型，以便实现准确的分割和有效地量化 epistemic 不确定性。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and widely used in other countries as well. If you need the translation in Traditional Chinese, please let me know.

ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection

paper_url: http://arxiv.org/abs/2308.07504
repo_url: https://github.com/chanchanchan97/icafusion
paper_authors: Jifeng Shen, Yifei Chen, Yue Liu, Xin Zuo, Heng Fan, Wankou Yang
for: 本研究旨在提高多spectral图像的特征融合，以提高多spectral对象检测的精度。
methods: 提出了一种基于双cross-attention transformer的新特征融合框架，通过全球特征交互模型，捕捉多modalitat中的补偿信息，提高对象特征的抑制性。
results: 实验结果表明，提出的方法可以在KAIST、FLIR和VEDAI数据集上实现superior表现，同时具有更快的推理速度，适用于各种实际应用场景。

Abstract
Effective feature fusion of multispectral images plays a crucial role in multi-spectral object detection. Previous studies have demonstrated the effectiveness of feature fusion using convolutional neural networks, but these methods are sensitive to image misalignment due to the inherent deffciency in local-range feature interaction resulting in the performance degradation. To address this issue, a novel feature fusion framework of dual cross-attention transformers is proposed to model global feature interaction and capture complementary information across modalities simultaneously. This framework enhances the discriminability of object features through the query-guided cross-attention mechanism, leading to improved performance. However, stacking multiple transformer blocks for feature enhancement incurs a large number of parameters and high spatial complexity. To handle this, inspired by the human process of reviewing knowledge, an iterative interaction mechanism is proposed to share parameters among block-wise multimodal transformers, reducing model complexity and computation cost. The proposed method is general and effective to be integrated into different detection frameworks and used with different backbones. Experimental results on KAIST, FLIR, and VEDAI datasets show that the proposed method achieves superior performance and faster inference, making it suitable for various practical scenarios. Code will be available at https://github.com/chanchanchan97/ICAFusion.

摘要
<> translates into 多spectral图像的有效特征融合在多spectral对象检测中发挥关键作用。先前的研究已经证明了使用卷积神经网络进行特征融合的效iveness，但这些方法容易受到图像不对齐的影响，导致性能下降。为解决这问题，一种新的特征融合框架基于双重交叉注意力变换器是提出来，可以模型全局特征交互和同时捕捉不同模式之间的补做信息。这个框架通过尝试引导的交叉注意力机制来增强对象特征的抗混淆性，从而提高性能。然而，堆叠多个变换器块以提高特征的增强，会增加模型的参数量和空间复杂度。为解决这问题，根据人类审查知识的过程，一种循环互动机制是提出来，可以在不同模式之间共享参数，从而降低模型的参数量和计算量。提出的方法可以与不同的检测框架集成，并且可以与不同的后处器结合使用。实验结果表明，提出的方法在KAIST、FLIR和VEDAI datasets上 achieve superior performance和快速的检测，适用于各种实际应用场景。代码将在https://github.com/chanchanchan97/ICAFusion中公开。

SpecTracle: Wearable Facial Motion Tracking from Unobtrusive Peripheral Cameras

paper_url: http://arxiv.org/abs/2308.07502
repo_url: None
paper_authors: Yinan Xuan, Varun Viswanath, Sunny Chu, Owen Bartolf, Jessica Echterhoff, Edward Wang
for: 这个论文旨在实现无障碍的虚拟现实环境中的”面对面”互动。
methods: 该系统使用两个宽角相机，位于幕面上，以实现面部动作跟踪。
results: 该系统可以在实时24帧/秒的� mobil GPU上运行，并且可以精准地跟踪用户面部的不同部分运动。个性化协调可以提高跟踪性能42.3%。

Abstract
Facial motion tracking in head-mounted displays (HMD) has the potential to enable immersive "face-to-face" interaction in a virtual environment. However, current works on facial tracking are not suitable for unobtrusive augmented reality (AR) glasses or do not have the ability to track arbitrary facial movements. In this work, we demonstrate a novel system called SpecTracle that tracks a user's facial motions using two wide-angle cameras mounted right next to the visor of a Hololens. Avoiding the usage of cameras extended in front of the face, our system greatly improves the feasibility to integrate full-face tracking into a low-profile form factor. We also demonstrate that a neural network-based model processing the wide-angle cameras can run in real-time at 24 frames per second (fps) on a mobile GPU and track independent facial movement for different parts of the face with a user-independent model. Using a short personalized calibration, the system improves its tracking performance by 42.3% compared to the user-independent model.

摘要
“头戴式显示器（HMD）中的面部运动跟踪可能启用虚拟环境中的互动。然而，当前的面部跟踪方法不适用于不干扰的增强现实（AR）镜或无法跟踪自由的面部运动。在这项工作中，我们介绍了一种名为SpecTracle的系统，它使用两个宽角相机安装在Hololens镜的两侧，以跟踪用户的面部运动。避免使用扩展到面前的相机，我们的系统可以大幅提高将全面跟踪集成到低 профиль形态中的可能性。我们还示出了一种基于神经网络的模型，通过处理宽角相机可以在实时24帧/秒（fps）的移动硬件上运行，并可以独立地跟踪不同部分的面部运动。通过短时间的个性化准备，系统可以提高跟踪性能42.3%比用户无关模型。”

BSED: Baseline Shapley-Based Explainable Detector

paper_url: http://arxiv.org/abs/2308.07490
repo_url: None
paper_authors: Michihiro Kuroki, Toshihiko Yamasaki
for: 这个论文的目的是提高Explainable Artificial Intelligence（XAI）在图像识别领域的可解释性，并提供一种基于基线特征的可解释的检测器（BSED），以满足可解释性axioms。
methods: 这个论文使用了Shapley值来扩展对象检测，并将其应用到各种检测器中，以实现可解释性。它还可以在不需要细致参数调整的情况下，对各种检测目标进行解释。
results: 论文的结果表明，BSED可以提供更有效的解释，并且可以在各种应用中 correction based on explanations from our method。此外，BSED的处理成本在理解的范围内，而原始的Shapley值则是计算成本过高的。

Abstract
Explainable artificial intelligence (XAI) has witnessed significant advances in the field of object recognition, with saliency maps being used to highlight image features relevant to the predictions of learned models. Although these advances have made AI-based technology more interpretable to humans, several issues have come to light. Some approaches present explanations irrelevant to predictions, and cannot guarantee the validity of XAI (axioms). In this study, we propose the Baseline Shapley-based Explainable Detector (BSED), which extends the Shapley value to object detection, thereby enhancing the validity of interpretation. The Shapley value can attribute the prediction of a learned model to a baseline feature while satisfying the explainability axioms. The processing cost for the BSED is within the reasonable range, while the original Shapley value is prohibitively computationally expensive. Furthermore, BSED is a generalizable method that can be applied to various detectors in a model-agnostic manner, and interpret various detection targets without fine-grained parameter tuning. These strengths can enable the practical applicability of XAI. We present quantitative and qualitative comparisons with existing methods to demonstrate the superior performance of our method in terms of explanation validity. Moreover, we present some applications, such as correcting detection based on explanations from our method.

摘要
人工智能（AI）的解释性（XAI）在图像识别领域已经取得了重要进步，使用焦点图来高亮图像特征，有助于人类更好地理解AI模型的预测。然而，这些进步并不能解决所有问题。一些方法提供不相关的解释，无法保证XAI的AXIoms的正确性。在本研究中，我们提出了基线Shapley值基于的解释探测器（BSED），该方法扩展了Shapley值到对象检测，从而提高了解释的正确性。Shapley值可以归因预测的learned模型到基线特征，同时满足解释AXIoms。BSED的处理成本在合理范围内，而原始Shapley值计算成本过高。此外，BSED是一种通用的方法，可以适用于不同的检测器，并且可以对各种检测目标进行不必做细致参数调整的解释。这些优点使得XAI在实际应用中得到了加强。我们对现有方法进行了量化和质量比较，以示我们的方法在解释正确性方面的超越。此外，我们还展示了一些应用，如通过我们的方法提供的解释来修正检测结果。

Space Object Identification and Classification from Hyperspectral Material Analysis

paper_url: http://arxiv.org/abs/2308.07481
repo_url: None
paper_authors: Massimiliano Vasile, Lewis Walker, Andrew Campbell, Simao Marto, Paul Murray, Stephen Marshall, Vasili Savitski
for: 这个论文是为了提取未知宇宙对象的谱spectrum信息而设计的数据处理管道。
methods: 该论文使用了两种物质标识和分类技术：一种是基于机器学习，另一种是基于最小二乘匹配已知谱spectrum库。
results: 论文将展示一些初步的物体识别和分类结果。

Abstract
This paper presents a data processing pipeline designed to extract information from the hyperspectral signature of unknown space objects. The methodology proposed in this paper determines the material composition of space objects from single pixel images. Two techniques are used for material identification and classification: one based on machine learning and the other based on a least square match with a library of known spectra. From this information, a supervised machine learning algorithm is used to classify the object into one of several categories based on the detection of materials on the object. The behaviour of the material classification methods is investigated under non-ideal circumstances, to determine the effect of weathered materials, and the behaviour when the training library is missing a material that is present in the object being observed. Finally the paper will present some preliminary results on the identification and classification of space objects.

摘要
Translation notes:* "hyperspectral signature" is translated as "多spectral特征" (duō spectrum de tiào xiàng)* "material composition" is translated as "物质组成" (wù zhì zhōng jī)* "machine learning" is translated as "机器学习" (jī shì xué xí)* "least square match" is translated as "最小二乘匹配" (zuì xiǎo èr chuī pīng pái)* "library of known spectra" is translated as "已知spectra库" (yǐ zhī spectrum kù)* "supervised machine learning algorithm" is translated as "指导式机器学习算法" (dì dǎo xìng jī shì xué xí algoritmos)* "classify the object" is translated as "对象分类" (duì yì fāng lèi)* "weathered materials" is translated as "天然风化物" (tiān zhēn fēng huà wù)Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

Probabilistic MIMO U-Net: Efficient and Accurate Uncertainty Estimation for Pixel-wise Regression

paper_url: http://arxiv.org/abs/2308.07477
repo_url: https://github.com/antonbaumann/mimo-unet
paper_authors: Anton Baumann, Thomas Roßberg, Michael Schmitt
For: 提高机器学习模型的可靠性和可读性，特别在高度重要的实际应用场景中。* Methods: 基于多输入多输出（MIMO）框架，利用深度神经网络的过参数化来实现像素级回归任务。采用U-Net架构，在单个模型中培养多个互相约束的子网络。还提出了一种同步子网络性能的新程序。* Results: 对两个正交的数据集进行了全面的评估，与现有模型相比，具有相似的准确率，更好的准确率Calibration，robust的out-of-distribution检测能力，并且具有较小的参数大小和执行时间。代码可以在github.com/antonbaumann/MIMO-Unet中下载。

Abstract
Uncertainty estimation in machine learning is paramount for enhancing the reliability and interpretability of predictive models, especially in high-stakes real-world scenarios. Despite the availability of numerous methods, they often pose a trade-off between the quality of uncertainty estimation and computational efficiency. Addressing this challenge, we present an adaptation of the Multiple-Input Multiple-Output (MIMO) framework -- an approach exploiting the overparameterization of deep neural networks -- for pixel-wise regression tasks. Our MIMO variant expands the applicability of the approach from simple image classification to broader computer vision domains. For that purpose, we adapted the U-Net architecture to train multiple subnetworks within a single model, harnessing the overparameterization in deep neural networks. Additionally, we introduce a novel procedure for synchronizing subnetwork performance within the MIMO framework. Our comprehensive evaluations of the resulting MIMO U-Net on two orthogonal datasets demonstrate comparable accuracy to existing models, superior calibration on in-distribution data, robust out-of-distribution detection capabilities, and considerable improvements in parameter size and inference time. Code available at github.com/antonbaumann/MIMO-Unet

摘要
“机器学习中的不确定性估计是对预测模型的可靠性和解释性提高的重要因素，特别是在高度重要的实际应用中。 despite the availability of numerous methods, they often pose a trade-off between the quality of uncertainty estimation and computational efficiency. Addressing this challenge, we present an adaptation of the Multiple-Input Multiple-Output（MIMO）framework—an approach exploiting the overparameterization of deep neural networks—for pixel-wise regression tasks. Our MIMO variant expands the applicability of the approach from simple image classification to broader computer vision domains. For that purpose, we adapted the U-Net architecture to train multiple subnetworks within a single model, harnessing the overparameterization in deep neural networks. Additionally, we introduce a novel procedure for synchronizing subnetwork performance within the MIMO framework. Our comprehensive evaluations of the resulting MIMO U-Net on two orthogonal datasets demonstrate comparable accuracy to existing models, superior calibration on in-distribution data, robust out-of-distribution detection capabilities, and considerable improvements in parameter size and inference time. Code available at github.com/antonbaumann/MIMO-Unet”Here's the breakdown of the translation:1. 机器学习 (machine learning) -> 机器学习 (machine learning)2. 不确定性估计 (uncertainty estimation) -> 不确定性估计 (uncertainty estimation)3. MIMO (Multiple-Input Multiple-Output) -> MIMO (多输入多输出)4. overparameterization -> 过参数化5. deep neural networks -> 深度神经网络6. pixel-wise regression -> 像素级别回归7. U-Net -> U-Net8. subnetworks -> 子网络9. in-distribution data -> 在分布中的数据10. out-of-distribution detection -> 外分布检测11. parameter size -> 参数大小12. inference time -> 推理时间Note that Simplified Chinese is used in the translation, which is the standard writing system used in mainland China.

Reducing Training Demands for 3D Gait Recognition with Deep Koopman Operator Constraints

paper_url: http://arxiv.org/abs/2308.07468
repo_url: None
paper_authors: Cole Hill, Mauricio Pamplona Segundo, Sudeep Sarkar
for: 本研究的目的是提出一种使用深度学习技术实现人体步态识别，并使用Linear Dynamical Systems（LDS）模块和损失函数来保证时间相关性和动态可靠性。
methods: 本研究使用了深度神经网络来适应3D人体步态数据，并引入了LDS模块和基于Koopman算子理论的损失函数来保证模型的动态可靠性和时间相关性。
results: 根据USF HumanID和CASIA-B dataset的比较，本研究的LDS方法可以在训练数据的限制下实现更高的准确率，而且3D模型方法在不同的视角变化和包袋等情况下也表现更好。

Abstract
Deep learning research has made many biometric recognition solution viable, but it requires vast training data to achieve real-world generalization. Unlike other biometric traits, such as face and ear, gait samples cannot be easily crawled from the web to form massive unconstrained datasets. As the human body has been extensively studied for different digital applications, one can rely on prior shape knowledge to overcome data scarcity. This work follows the recent trend of fitting a 3D deformable body model into gait videos using deep neural networks to obtain disentangled shape and pose representations for each frame. To enforce temporal consistency in the network, we introduce a new Linear Dynamical Systems (LDS) module and loss based on Koopman operator theory, which provides an unsupervised motion regularization for the periodic nature of gait, as well as a predictive capacity for extending gait sequences. We compare LDS to the traditional adversarial training approach and use the USF HumanID and CASIA-B datasets to show that LDS can obtain better accuracy with less training data. Finally, we also show that our 3D modeling approach is much better than other 3D gait approaches in overcoming viewpoint variation under normal, bag-carrying and clothing change conditions.

摘要
This study uses deep neural networks to fit a 3D deformable body model to gait videos and get separate shape and pose representations for each frame. To make sure the movements in the videos are consistent, we use a new Linear Dynamical Systems (LDS) module and loss based on Koopman operator theory. This approach provides an unsupervised motion regularization for the periodic nature of gait, as well as a way to predict how gait sequences will continue. We compare LDS to traditional adversarial training and use the USF HumanID and CASIA-B datasets to show that LDS can get better accuracy with less training data.Finally, we show that our 3D modeling approach is much better than other 3D gait approaches at handling changes in viewpoint, bag-carrying, and clothing under normal conditions.

There Is a Digital Art History

paper_url: http://arxiv.org/abs/2308.07464
repo_url: https://github.com/Gracetyty/art-gallery
paper_authors: Leonardo Impett, Fabian Offert
for: 本研究探讨了 Johanna Drucker 十年前提出的问题：“是否有数字艺术历史？”，以及在大规模变换器基础上的视觉模型的出现对数字艺术历史的影响。
methods: 本研究使用了两种主要方法：一是对大规模视觉模型中新编码的视艺抄影重要性进行分析，二是通过使用当代大规模视觉模型investigate基本问题从艺术史和城市规划等领域来进行技术 caso study。
results: 研究结果表明，大规模视觉模型在数字艺术历史方面可能会导致一个新的парадигShift，因为它们可以自动批处和抽象不同形式的视觉逻辑，并且在数字生活中已经广泛应用。同时，这些系统需要一种新的批判方法，该方法需要考虑模型和其应用之间的知识共生。

Abstract
In this paper, we revisit Johanna Drucker's question, "Is there a digital art history?" -- posed exactly a decade ago -- in the light of the emergence of large-scale, transformer-based vision models. While more traditional types of neural networks have long been part of digital art history, and digital humanities projects have recently begun to use transformer models, their epistemic implications and methodological affordances have not yet been systematically analyzed. We focus our analysis on two main aspects that, together, seem to suggest a coming paradigm shift towards a "digital" art history in Drucker's sense. On the one hand, the visual-cultural repertoire newly encoded in large-scale vision models has an outsized effect on digital art history. The inclusion of significant numbers of non-photographic images allows for the extraction and automation of different forms of visual logics. Large-scale vision models have "seen" large parts of the Western visual canon mediated by Net visual culture, and they continuously solidify and concretize this canon through their already widespread application in all aspects of digital life. On the other hand, based on two technical case studies of utilizing a contemporary large-scale visual model to investigate basic questions from the fields of art history and urbanism, we suggest that such systems require a new critical methodology that takes into account the epistemic entanglement of a model and its applications. This new methodology reads its corpora through a neural model's training data, and vice versa: the visual ideologies of research datasets and training datasets become entangled.

摘要
在这篇论文中，我们重新回归到 Johanna Drucker 提出的十年前的问题：“是否有数字艺术历史？”——传统神经网络已经长期出现在数字艺术史上，而最近的数字人文科学项目则开始使用转换器模型。然而，这些模型的认知途径和方法论上的影响尚未系统地分析。我们将分析两个主要方面，这两个方面共同表明一种可能的未来方向：数字艺术史。一方面，大规模感知模型中新编码的视觉文化财富对数字艺术史产生了巨大的影响。由于大量非摄影图像的包容，可以自动提取和抽象不同类型的视觉逻辑。大规模感知模型已经“看到”了西方视觉Canvas的大部分，并且不断巩固和固化这个Canvas，通过在所有数字生活中广泛应用。另一方面，基于两个实践案例，我们建议需要一种新的批判方法，该方法考虑模型和其应用之间的认知纠缠。这种新方法可以通过神经网络的训练数据和研究数据来读取 corpora，并且反之，研究数据和训练数据的视觉意识都会紧密相互纠缠。

U-Turn Diffusion

paper_url: http://arxiv.org/abs/2308.07421
repo_url: None
paper_authors: Hamidreza Behjoo, Michael Chertkov
for: 这个论文探讨了基于人工智能的 diffusion 模型，用于生成合成图像。这些模型利用动态辅助时间机制，通过随机差分方程来获得分数函数。
methods: 该论文提出了一个效果评价标准：生成过程中快速谱 correlation 的破坏能力直接关系到生成图像质量。此外， authors 还提出了一种“U-Turn Diffusion”技术，通过将标准前向 diffusion 过程缩短，然后执行标准反向动力，最终生成一个与 i.i.d. 样本 Distribution 相似的合成图像。
results: 该论文通过使用不同的分析工具，如自相关分析、分数函数质量分析和高斯分布预测测试，来分析相关的时间尺度。结果表明，在优化 U-turn 时间后，生成的合成图像与实际数据样本之间的干扰距离最小化。

Abstract
We present a comprehensive examination of score-based diffusion models of AI for generating synthetic images. These models hinge upon a dynamic auxiliary time mechanism driven by stochastic differential equations, wherein the score function is acquired from input images. Our investigation unveils a criterion for evaluating efficiency of the score-based diffusion models: the power of the generative process depends on the ability to de-construct fast correlations during the reverse/de-noising phase. To improve the quality of the produced synthetic images, we introduce an approach coined "U-Turn Diffusion". The U-Turn Diffusion technique starts with the standard forward diffusion process, albeit with a condensed duration compared to conventional settings. Subsequently, we execute the standard reverse dynamics, initialized with the concluding configuration from the forward process. This U-Turn Diffusion procedure, combining forward, U-turn, and reverse processes, creates a synthetic image approximating an independent and identically distributed (i.i.d.) sample from the probability distribution implicitly described via input samples. To analyze relevant time scales we employ various analytical tools, including auto-correlation analysis, weighted norm of the score-function analysis, and Kolmogorov-Smirnov Gaussianity test. The tools guide us to establishing that the Kernel Intersection Distance, a metric comparing the quality of synthetic samples with real data samples, is minimized at the optimal U-turn time.

摘要
我们提出了一项全面的检查Score-based扩散模型，用于生成 sintetic 图像。这些模型基于动态辅助时间机制驱动的随机差分方程，其中Score函数从输入图像中获得。我们的调查发现一个用于评估扩散模型效率的标准：扩散过程中快速相关性的破坏能力直接关系到生成过程的能效性。为了提高生成的 sintetic 图像质量，我们提出了“U-Turn扩散”技术。U-Turn扩散过程从标准前进Diffusion过程开始，但是压缩了传统设置中的时间。然后，我们执行标准的反动动态，初始化使用前进过程的结束配置。这种U-Turn扩散过程，包括前进、U-turn和反向过程，可以生成一个约束相同分布的 sintetic 图像，与输入样本的随机分布相对独立。为了分析相关的时间尺度，我们使用了多种分析工具，包括自相关分析、加重函数分析和高斯假设测试。这些工具引导我们确定了最佳U-turn时间，以使得扩散模型可以生成高质量的 sintetic 图像。

Semantify: Simplifying the Control of 3D Morphable Models using CLIP

paper_url: http://arxiv.org/abs/2308.07415
repo_url: https://github.com/Omergral/Semantify
paper_authors: Omer Gralnik, Guy Gafni, Ariel Shamir
for: 用于自动控制3D形态模型
methods: 使用CLIP语言视觉基础模型的 semantic 力进行自我超vised 训练
results: 实现了定制3D形态模型的简单 slider интерфей스，并可以快速地适应各种3D模型的定制。Here’s a more detailed explanation of each point:
for: The paper is written for the purpose of simplifying the control of 3D morphable models, specifically using self-supervised learning and the semantic power of CLIP language-vision foundation models.
methods: The paper proposes a method called Semantify, which utilizes the semantic power of CLIP to learn a non-linear mapping from scores across a small set of semantically meaningful and disentangled descriptors to the parametric coefficients of a given 3D morphable model. This is done without a human-in-the-loop and using training data created by randomly sampling the model’s parameters, creating various shapes, and rendering them.
results: The paper presents results on numerous 3D morphable models, including body shape models, face shape and expression models, and animal shapes. The results show that the proposed method defines a simple slider interface for intuitive modeling and can be used to instantly fit a 3D parametric body shape to in-the-wild images.

Abstract
We present Semantify: a self-supervised method that utilizes the semantic power of CLIP language-vision foundation model to simplify the control of 3D morphable models. Given a parametric model, training data is created by randomly sampling the model's parameters, creating various shapes and rendering them. The similarity between the output images and a set of word descriptors is calculated in CLIP's latent space. Our key idea is first to choose a small set of semantically meaningful and disentangled descriptors that characterize the 3DMM, and then learn a non-linear mapping from scores across this set to the parametric coefficients of the given 3DMM. The non-linear mapping is defined by training a neural network without a human-in-the-loop. We present results on numerous 3DMMs: body shape models, face shape and expression models, as well as animal shapes. We demonstrate how our method defines a simple slider interface for intuitive modeling, and show how the mapping can be used to instantly fit a 3D parametric body shape to in-the-wild images.

摘要
我们介绍Semantify：一种自动超级方法，利用CLIP语言视觉基础模型的 semantic 力来简化3D可变模型的控制。将 parametric 模型作为input，通过随机抽样模型参数，创建不同形状并rendering 它们。然后，使用CLIP的内存空间计算模型的出力图像和一组字幕描述的相似度。我们的关键想法是首先选择一小集 semantically meaningful 和分离的描述符，描述3DMM的特征，然后学习一个非线性的 mapping 将 scores across 这个集合转换为input 模型的参数。这个 mapping 是通过人工不在从事的方式定义的，我们提出了一个 neural network 的解释。我们在 numerous 3DMM 上进行了实验，包括人体形状模型、脸形和表情模型以及动物形状。我们显示了我们的方法可以定义一个简单的滑块界面，并说明了如何将 mapping 用于快速适应实验中的内部显示。最后，我们显示了我们的方法可以将3D parametric 体形快速适应到野外图像中。

A Unified Query-based Paradigm for Camouflaged Instance Segmentation

paper_url: http://arxiv.org/abs/2308.07392
repo_url: https://github.com/dongbo811/uqformer
paper_authors: Do Dong, Jialun Pei, Rongrong Gao, Tian-Zhu Xiang, Shuo Wang, Huan Xiong
for: 提高隐藏的实例分割精度
methods: 使用 query-based 多任务学习框架，包括设计多scales的 unified learning transformer decoder 和 composed query learning paradigm，以 capture 隐藏的对象区域和边界特征
results: 与 14 状态级方法进行比较，实现了隐藏实例分割的显著提高

Abstract
Due to the high similarity between camouflaged instances and the background, the recently proposed camouflaged instance segmentation (CIS) faces challenges in accurate localization and instance segmentation. To this end, inspired by query-based transformers, we propose a unified query-based multi-task learning framework for camouflaged instance segmentation, termed UQFormer, which builds a set of mask queries and a set of boundary queries to learn a shared composed query representation and efficiently integrates global camouflaged object region and boundary cues, for simultaneous instance segmentation and instance boundary detection in camouflaged scenarios. Specifically, we design a composed query learning paradigm that learns a shared representation to capture object region and boundary features by the cross-attention interaction of mask queries and boundary queries in the designed multi-scale unified learning transformer decoder. Then, we present a transformer-based multi-task learning framework for simultaneous camouflaged instance segmentation and camouflaged instance boundary detection based on the learned composed query representation, which also forces the model to learn a strong instance-level query representation. Notably, our model views the instance segmentation as a query-based direct set prediction problem, without other post-processing such as non-maximal suppression. Compared with 14 state-of-the-art approaches, our UQFormer significantly improves the performance of camouflaged instance segmentation. Our code will be available at https://github.com/dongbo811/UQFormer.

摘要
due to the high similarity between camouflaged instances and the background, the recently proposed camouflaged instance segmentation (CIS) faces challenges in accurate localization and instance segmentation. to this end, inspired by query-based transformers, we propose a unified query-based multi-task learning framework for camouflaged instance segmentation, termed UQFormer, which builds a set of mask queries and a set of boundary queries to learn a shared composed query representation and efficiently integrates global camouflaged object region and boundary cues, for simultaneous instance segmentation and instance boundary detection in camouflaged scenarios. specifically, we design a composed query learning paradigm that learns a shared representation to capture object region and boundary features by the cross-attention interaction of mask queries and boundary queries in the designed multi-scale unified learning transformer decoder. then, we present a transformer-based multi-task learning framework for simultaneous camouflaged instance segmentation and camouflaged instance boundary detection based on the learned composed query representation, which also forces the model to learn a strong instance-level query representation. notably, our model views the instance segmentation as a query-based direct set prediction problem, without other post-processing such as non-maximal suppression. compared with 14 state-of-the-art approaches, our UQFormer significantly improves the performance of camouflaged instance segmentation. our code will be available at https://github.com/dongbo811/UQFormer.Here's the word-for-word translation of the text into Simplified Chinese:由于隐形实例和背景的高相似性，最近提出的隐形实例分割（CIS）面临精度地位和实例分割挑战。为此，我们取得了 query-based transformers 的灵感，并提出了一种统一的 query-based 多任务学习框架，称为 UQFormer，该框架在隐形场景中同时进行实例分割和实例边界检测。specifically，我们设计了一种组合查询学习方案，通过 маска查询和边界查询的交互跨度束对象区域和边界特征进行学习共享的查询表示。然后，我们提出了一种基于 transformer 的多任务学习框架，通过学习共享的查询表示来同时进行隐形实例分割和隐形实例边界检测。此外，我们的模型视实例分割为直接查询集prediction问题，不需要其他后处理如非最大suppression。与14种状态级方法进行比较，我们的 UQFormer 显著提高了隐形实例分割的性能。我们的代码将在 https://github.com/dongbo811/UQFormer 上发布。

DISBELIEVE: Distance Between Client Models is Very Essential for Effective Local Model Poisoning Attacks

paper_url: http://arxiv.org/abs/2308.07387
repo_url: None
paper_authors: Indu Joshi, Priyank Upadhya, Gaurav Kumar Nayak, Peter Schüffler, Nassir Navab
for: This paper focuses on the privacy issues in federated learning, specifically in the medical image analysis domain, and proposes a local model poisoning attack called DISBELIEVE to defend against robust aggregation methods.
methods: The proposed DISBELIEVE attack creates malicious parameters or gradients that are close to benign clients’ parameters or gradients but have a high adverse effect on the global model’s performance.
results: The proposed attack significantly lowers the performance of state-of-the-art robust aggregation methods for medical image analysis on three publicly available datasets, and is also effective on natural images for multi-class classification on the benchmark dataset CIFAR-10.Here’s the full Chinese translation of the paper’s abstract:for: 这篇论文关注联合学习中的隐私问题，具体是医疗图像分析领域，并提出了一种本地模型毒品攻击方法called DISBELIEVE，以防止robust集成方法。methods: DISBELIEVE攻击方法创造了假的参数或梯度，使其与正常客户端的参数或梯度很近，但是对全局模型的性能产生高度的负面影响。results: 提议的攻击方法对state-of-the-art robust集成方法在三个公开的医疗图像数据集上显示出了显著的下降性能，并且在自然图像的多类分类任务上也有严重的下降性能。

Abstract
Federated learning is a promising direction to tackle the privacy issues related to sharing patients' sensitive data. Often, federated systems in the medical image analysis domain assume that the participating local clients are \textit{honest}. Several studies report mechanisms through which a set of malicious clients can be introduced that can poison the federated setup, hampering the performance of the global model. To overcome this, robust aggregation methods have been proposed that defend against those attacks. We observe that most of the state-of-the-art robust aggregation methods are heavily dependent on the distance between the parameters or gradients of malicious clients and benign clients, which makes them prone to local model poisoning attacks when the parameters or gradients of malicious and benign clients are close. Leveraging this, we introduce DISBELIEVE, a local model poisoning attack that creates malicious parameters or gradients such that their distance to benign clients' parameters or gradients is low respectively but at the same time their adverse effect on the global model's performance is high. Experiments on three publicly available medical image datasets demonstrate the efficacy of the proposed DISBELIEVE attack as it significantly lowers the performance of the state-of-the-art \textit{robust aggregation} methods for medical image analysis. Furthermore, compared to state-of-the-art local model poisoning attacks, DISBELIEVE attack is also effective on natural images where we observe a severe drop in classification performance of the global model for multi-class classification on benchmark dataset CIFAR-10.

摘要
“联邦学习”是一种解决医疗资料共享时隐私问题的可能性。在医疗影像分析领域中，联邦系统通常假设地方客户端是“正直”的。然而，一些研究发现，可以将一组黑客户端引入联邦系统，导致全球模型的性能下降。为了解决这个问题，一些防护整合方法被提出，但大多数这些方法对于黑客户端的攻击 remain vulnerable。我们引入了一种名为“DISBELIEVE”的本地模型欺骗攻击，它可以创造出黑客户端的参数或梯度，使其与正常客户端的参数或梯度之间的距离很近，但同时对全球模型的性能产生严重的影响。我们在三个公开可用的医疗影像数据集上进行实验，结果显示 DISBELIEVE 攻击可以对现有的robust aggregation方法进行严重攻击，并且与其他本地模型欺骗攻击相比，DISBELIEVE 攻击在自然图像中也有很好的效果。

The Devil in the Details: Simple and Effective Optical Flow Synthetic Data Generation

paper_url: http://arxiv.org/abs/2308.07378
repo_url: None
paper_authors: Kwon Byung-Ki, Kim Sung-Bin, Tae-Hyun Oh
for: 这 paper 是为了研究 dense optical flow 的进展而写的，特别是使用 supervised learning 方法，需要大量标注数据。
methods: 这 paper 使用了一种 simpler synthetic data generation method，通过组合基本操作来实现一定的真实感。 authors 还提出了一种使用 occlusion masks 的新方法，以帮助 RAFT 网络在 supervised 方法中进行更好的初始化。
results: 据 authors 的实验结果，使用这种新方法可以让 RAFT 网络在 MPI Sintel 和 KITTI 2015 上表现出色，超过原始 RAFT 的表现。

Abstract
Recent work on dense optical flow has shown significant progress, primarily in a supervised learning manner requiring a large amount of labeled data. Due to the expensiveness of obtaining large scale real-world data, computer graphics are typically leveraged for constructing datasets. However, there is a common belief that synthetic-to-real domain gaps limit generalization to real scenes. In this paper, we show that the required characteristics in an optical flow dataset are rather simple and present a simpler synthetic data generation method that achieves a certain level of realism with compositions of elementary operations. With 2D motion-based datasets, we systematically analyze the simplest yet critical factors for generating synthetic datasets. Furthermore, we propose a novel method of utilizing occlusion masks in a supervised method and observe that suppressing gradients on occluded regions serves as a powerful initial state in the curriculum learning sense. The RAFT network initially trained on our dataset outperforms the original RAFT on the two most challenging online benchmarks, MPI Sintel and KITTI 2015.

摘要
最近的紧密光流研究已经取得了重要进步，主要是以监督学习方式进行，需要大量标注数据。由于真实世界数据的获得成本较高，因此通常会利用计算机图形进行数据构造。然而，有一种常见的信念是， sintetic-to-real 领域差限制了对真实场景的泛化。在这篇论文中，我们表明了光流数据集中所需的特征很简单，并提出了一种简单的 sintetic 数据生成方法，该方法可以在元素操作的组合下实现一定的真实感。对于 2D 运动基于的数据集，我们系统地分析了生成 sintetic 数据的最简 yet critical 因素。此外，我们提议了在监督学习方法中使用遮盲mask，并观察到在遮盲区域上抑制梯度 serve as a powerful initial state in the curriculum learning sense。RAFT 网络首先在我们的数据集上进行了训练，然后在两个最为挑战的在线抽象上出色表现，即 MPI Sintel 和 KITTI 2015。

Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation

paper_url: http://arxiv.org/abs/2308.07316
repo_url: https://github.com/alexmartin1722/Revive-2I
paper_authors: Alexander Martin, Haitian Zheng, Jie An, Jiebo Luo
for: 这个论文的目的是提出一种可以在大领域差距下进行零shot图像到图像翻译（I2I）的方法，并且能够在不同领域中进行应用。
methods: 这个论文使用了文本引导的潜在扩散模型来实现零shot I2I，并且提出了一个新的任务——Skull2Animal，用于翻译骨骼和生物体之间。
results: 研究发现，传统的I2I方法无法跨大领域差距进行翻译，而文本引导的扩散和图像编辑模型则能够准确地完成零shot I2I。此外，研究还发现，提示是跨大领域差距翻译的关键因素，因为需要将目标领域的优先知识传递给模型。

Abstract
With a strong understanding of the target domain from natural language, we produce promising results in translating across large domain gaps and bringing skeletons back to life. In this work, we use text-guided latent diffusion models for zero-shot image-to-image translation (I2I) across large domain gaps (longI2I), where large amounts of new visual features and new geometry need to be generated to enter the target domain. Being able to perform translations across large domain gaps has a wide variety of real-world applications in criminology, astrology, environmental conservation, and paleontology. In this work, we introduce a new task Skull2Animal for translating between skulls and living animals. On this task, we find that unguided Generative Adversarial Networks (GANs) are not capable of translating across large domain gaps. Instead of these traditional I2I methods, we explore the use of guided diffusion and image editing models and provide a new benchmark model, Revive-2I, capable of performing zero-shot I2I via text-prompting latent diffusion models. We find that guidance is necessary for longI2I because, to bridge the large domain gap, prior knowledge about the target domain is needed. In addition, we find that prompting provides the best and most scalable information about the target domain as classifier-guided diffusion models require retraining for specific use cases and lack stronger constraints on the target domain because of the wide variety of images they are trained on.

摘要
With a strong understanding of the target domain from natural language, we produce promising results in translating across large domain gaps and bringing skeletons back to life. In this work, we use text-guided latent diffusion models for zero-shot image-to-image translation (I2I) across large domain gaps (longI2I), where large amounts of new visual features and new geometry need to be generated to enter the target domain. Being able to perform translations across large domain gaps has a wide variety of real-world applications in criminology, astrology, environmental conservation, and paleontology. In this work, we introduce a new task Skull2Animal for translating between skulls and living animals. On this task, we find that unguided Generative Adversarial Networks (GANs) are not capable of translating across large domain gaps. Instead of these traditional I2I methods, we explore the use of guided diffusion and image editing models and provide a new benchmark model, Revive-2I, capable of performing zero-shot I2I via text-prompting latent diffusion models. We find that guidance is necessary for longI2I because, to bridge the large domain gap, prior knowledge about the target domain is needed. In addition, we find that prompting provides the best and most scalable information about the target domain as classifier-guided diffusion models require retraining for specific use cases and lack stronger constraints on the target domain because of the wide variety of images they are trained on.Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Hong Kong, Macau, and Taiwan. The translation is written in Simplified Chinese, but the original text is in Traditional Chinese.

Dual Associated Encoder for Face Restoration

paper_url: http://arxiv.org/abs/2308.07314
repo_url: None
paper_authors: Yu-Ju Tsai, Yu-Lun Liu, Lu Qi, Kelvin C. K. Chan, Ming-Hsuan Yang
for: 提高低质量图像中的人脸细节 restored
methods: 使用双支持分支框架DAEFR，其中一支支持高质量图像（HQ）特征提取，另一支支持低质量图像（LQ）特征提取，并通过相互协同训练来促进代码预测和输出质量提高
results: 在 synthetic 和 real-world 数据集上，DAEFR 表现出色，可以更好地恢复人脸细节

Abstract
Restoring facial details from low-quality (LQ) images has remained a challenging problem due to its ill-posedness induced by various degradations in the wild. The existing codebook prior mitigates the ill-posedness by leveraging an autoencoder and learned codebook of high-quality (HQ) features, achieving remarkable quality. However, existing approaches in this paradigm frequently depend on a single encoder pre-trained on HQ data for restoring HQ images, disregarding the domain gap between LQ and HQ images. As a result, the encoding of LQ inputs may be insufficient, resulting in suboptimal performance. To tackle this problem, we propose a novel dual-branch framework named DAEFR. Our method introduces an auxiliary LQ branch that extracts crucial information from the LQ inputs. Additionally, we incorporate association training to promote effective synergy between the two branches, enhancing code prediction and output quality. We evaluate the effectiveness of DAEFR on both synthetic and real-world datasets, demonstrating its superior performance in restoring facial details.

摘要
优化 facial details 从低质量（LQ）图像的恢复问题一直是一个挑战，因为这个问题受到野外环境中各种破坏的影响，导致非固定的问题。现有的代码库先验 Mitigates 这个问题，通过使用 autoencoder 和学习的高质量（HQ）特征 codebook，实现了 Remarkable 的质量。然而，现有的这些方法 часто依赖于单个 encoder 预训练在 HQ 数据上，忽视 LQ 和 HQ 图像之间的领域差异。这导致 LQ 输入的编码可能不够，从而导致优化性不足。为解决这个问题，我们提出了一种新的 dual-branch 框架，名为 DAEFR。我们的方法在 auxiliary LQ 分支中提取了关键信息，并将这些信息与主要 HQ 分支相关联，以便在编码和输出质量之间产生有利的共同作用。我们在 synthetic 和实际世界的数据集上评估了 DAEFR 的效果，并证明其在恢复 facial details 方面具有 Superior 的性能。

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

paper_url: http://arxiv.org/abs/2308.07313
repo_url: https://github.com/michel-liu/grouppose-paddle
paper_authors: Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Errui Ding, Yao Zhao, Jingdong Wang
For: 这种论文的目的是研究终端多人姿态估计问题，以DETR-like框架为基础，并主要发展复杂的解码器。* Methods: 这种方法使用了简单 yet effective transformerapproach，名为Group Pose。它将 $K$-keypoint pose estimation视为预测 $N\times K$ 关键点位置，每个关键点从一个关键点查询中预测，同时每个姿态被表示为一个实例查询用于得分 $N$ 姿态预测。* Results: 这种方法无需人工框架监督，在 MS COCO 和 CrowdPose 上实验表明，其表现比前一些使用复杂解码器的方法更好，甚至与使用人工框架监督的 ED-Pose 相当。可以在 $\href{https://github.com/Michel-liu/GroupPose-Paddle}{\rm Paddle}$ 和 $\href{https://github.com/Michel-liu/GroupPose}{\rm PyTorch}$ 中找到代码。

Abstract
In this paper, we study the problem of end-to-end multi-person pose estimation. State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e.g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR. We present a simple yet effective transformer approach, named Group Pose. We simply regard $K$-keypoint pose estimation as predicting a set of $N\times K$ keypoint positions, each from a keypoint query, as well as representing each pose with an instance query for scoring $N$ pose predictions. Motivated by the intuition that the interaction, among across-instance queries of different types, is not directly helpful, we make a simple modification to decoder self-attention. We replace single self-attention over all the $N\times(K+1)$ queries with two subsequent group self-attentions: (i) $N$ within-instance self-attention, with each over $K$ keypoint queries and one instance query, and (ii) $(K+1)$ same-type across-instance self-attention, each over $N$ queries of the same type. The resulting decoder removes the interaction among across-instance type-different queries, easing the optimization and thus improving the performance. Experimental results on MS COCO and CrowdPose show that our approach without human box supervision is superior to previous methods with complex decoders, and even is slightly better than ED-Pose that uses human box supervision. $\href{https://github.com/Michel-liu/GroupPose-Paddle}{\rm Paddle}$ and $\href{https://github.com/Michel-liu/GroupPose}{\rm PyTorch}$ code are available.

摘要
在这篇论文中，我们研究了端到端多人姿态估计问题。现有的解决方案大多采用DETR-like框架，主要是开发复杂的解码器，例如将姿态估计视为关键点框检测并与人体检测结合在ED-Pose中，或者在PETR中 hierarchically 预测姿态和关键点。我们提出了一种简单 yet 有效的 transformer 方法，名为 Group Pose。我们简单地认为 $K$-关键点姿态估计是预测 $N\times K$ 关键点位置，每个从关键点查询中预测，同时每个姿态被 Represented 为一个实例查询用于得分 $N$ 姿态预测。我们受到了关键点查询之间相互交互不直接有助于的想法，因此我们对解码器自注意的进行了简单修改。我们将单个自注意所有 $N\times(K+1)$ 查询被替换为两个顺序的组自注意：（i） $N$ 内部实例自注意，每个在 $K$ 关键点查询和一个实例查询之间进行自注意，和（ii） $(K+1)$ 同类 across-instance 自注意，每个在 $N$ 查询之间进行自注意。这些修改后的解码器可以减少不同类型的关键点查询之间的交互，从而简化优化，并提高性能。我们在 COCO 和 CrowdPose 上进行了实验，发现我们的方法无需人工盒子超级视觉是与前一代方法相比提高性能，甚至与使用人工盒子超级视觉的 ED-Pose 相比略有提高。我们在 $\href{https://github.com/Michel-liu/GroupPose-Paddle}{\rm Paddle}$ 和 $\href{https://github.com/Michel-liu/GroupPose}{\rm PyTorch}$ 上提供了代码。

A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis

paper_url: http://arxiv.org/abs/2308.07301
repo_url: None
paper_authors: Esteve Valls Mascaro, Hyemin Ahn, Dongheui Lee
for: 这个论文的目的是提出一种新的任务无关的人体动作生成模型，即UNIMASK-M，可以有效地解决人体动作预测和填充中间姿势等问题。
methods: 这个模型使用了着重体部关系的架构，以及基于ViTs的人体姿势分解方法，以利用人体动作中的空间时间关系。此外，该模型还通过不同的面罩设计来进行姿势conditioned的动作生成。
results: 实验结果表明，UNIMASK-M模型在Human3.6M数据集上成功预测人体动作，并在LaFAN1数据集上实现了状态之最的动作填充结果，特别是在长距离转换期。更多信息可以查看项目官方网站：https://sites.google.com/view/estevevallsmascaro/publications/unimask-m。

Abstract
The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our UNIMASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset. Moreover, it achieves state-of-the-art results in motion inbetweening on the LaFAN1 dataset, particularly in long transition periods. More information can be found on the project website https://sites.google.com/view/estevevallsmascaro/publications/unimask-m.

摘要
历史上人体运动的合成总是通过任务 dependent 模型来解决，这些模型通常会专注于特定的挑战，例如预测未来运动或者使用知道的关键姿势来填充中间姿势。在这篇论文中，我们提出了一种新的任务独立的模型，即 UNIMASK-M，它可以有效地解决这些挑战。我们的模型在每个领域中都可以达到或更好的性能。我们的 UNIMASK-M 模型从人体 pose 中提取了身体部分，以利用人体运动中的空间时间关系。此外，我们将各种姿势conditioned 的运动合成任务重新表述为一个重建问题，输入不同的面纱模式。通过直接告诉我们模型关于遮盖的关节的信息，我们的 UNIMASK-M 模型变得更加鲁棒，更能够抵御遮挡。实验结果表明，我们的模型在 Human3.6M 数据集上预测人体运动成功。此外，它在 LaFAN1 数据集上的动作填充任务中达到了领先的成绩，特别是在长期跨过渡期内。更多信息可以在项目网站（https://sites.google.com/view/estevevallsmascaro/publications/unimask-m）上找到。

Accurate Eye Tracking from Dense 3D Surface Reconstructions using Single-Shot Deflectometry

paper_url: http://arxiv.org/abs/2308.07298
repo_url: None
paper_authors: Jiazhang Wang, Tianfu Wang, Bingjie Xu, Oliver Cossairt, Florian Willomitzer
for: 提高虚拟现实设备、神经科学研究和心理学中的眼动跟踪精度和速度。
methods: 基于单shotphasemeasuring-deflectometry（PMD）的新方法，通过获取肤色镜面上的密集3D表面信息来提高眼动跟踪精度和速度。
results: 实验表明，该方法可以实现眼动跟踪精度低于0.25度，至少比现有技术高出了$>3300\times$。

Abstract
Eye-tracking plays a crucial role in the development of virtual reality devices, neuroscience research, and psychology. Despite its significance in numerous applications, achieving an accurate, robust, and fast eye-tracking solution remains a considerable challenge for current state-of-the-art methods. While existing reflection-based techniques (e.g., "glint tracking") are considered the most accurate, their performance is limited by their reliance on sparse 3D surface data acquired solely from the cornea surface. In this paper, we rethink the way how specular reflections can be used for eye tracking: We propose a novel method for accurate and fast evaluation of the gaze direction that exploits teachings from single-shot phase-measuring-deflectometry (PMD). In contrast to state-of-the-art reflection-based methods, our method acquires dense 3D surface information of both cornea and sclera within only one single camera frame (single-shot). Improvements in acquired reflection surface points("glints") of factors $>3300 \times$ are easily achievable. We show the feasibility of our approach with experimentally evaluated gaze errors of only $\leq 0.25^\circ$ demonstrating a significant improvement over the current state-of-the-art.

摘要
眼动跟踪在虚拟现实设备的开发、 neuroscience 研究和心理学中扮演着关键性的角色。尽管它在多个应用程序中具有重要的作用，但是实现高度准确、可靠和快速的眼动跟踪解决方案仍然是当前技术的主要挑战。现有的反射基本技术（如“光泽跟踪”）被认为是最准确的，但它们的性能受到仅仅凭借硬件表面的辐射数据的限制。在这篇论文中，我们重新思考了如何使用折射来跟踪眼动：我们提出了一种新的方法，可以准确地和快速地评估眼动方向，这种方法利用了单 shot 相位测量折射（PMD）的教程。与现有的反射基本技术不同，我们的方法可以在单个摄像头帧中获得硬件表面的密集3D数据，包括辐射表面点的增加。我们实验证明，我们的方法可以在辐射表面点的增加比例上提高了$>3300\times$，并且我们实验证明了我们的方法的可行性，误差仅为$\leq 0.25^\circ$，这表明了我们的方法与当前技术的显著提高。

A Robust Approach Towards Distinguishing Natural and Computer Generated Images using Multi-Colorspace fused and Enriched Vision Transformer

paper_url: http://arxiv.org/abs/2308.07279
repo_url: https://github.com/manjaryp/mce-vit
paper_authors: Manjary P Gangan, Anoop Kadan, Lajish V L
for: 能够分辨 natura 和计算机生成的图像
methods: 使用两个视Transformers进行拟合，一个在 RGB 色域，另一个在 YCbCr 色域，并将两个拟合结果进行拟合
results: 提高了对计算机生成图像和 GAN 生成图像的分辨率，以及提高了对压缩、噪音等后处理图像的Robustness和普适性

Abstract
The works in literature classifying natural and computer generated images are mostly designed as binary tasks either considering natural images versus computer graphics images only or natural images versus GAN generated images only, but not natural images versus both classes of the generated images. Also, even though this forensic classification task of distinguishing natural and computer generated images gets the support of the new convolutional neural networks and transformer based architectures that can give remarkable classification accuracies, they are seen to fail over the images that have undergone some post-processing operations usually performed to deceive the forensic algorithms, such as JPEG compression, gaussian noise, etc. This work proposes a robust approach towards distinguishing natural and computer generated images including both, computer graphics and GAN generated images using a fusion of two vision transformers where each of the transformer networks operates in different color spaces, one in RGB and the other in YCbCr color space. The proposed approach achieves high performance gain when compared to a set of baselines, and also achieves higher robustness and generalizability than the baselines. The features of the proposed model when visualized are seen to obtain higher separability for the classes than the input image features and the baseline features. This work also studies the attention map visualizations of the networks of the fused model and observes that the proposed methodology can capture more image information relevant to the forensic task of classifying natural and generated images.

摘要
文学类别自然和计算机生成的图像工作大都设计为二分类任务， Either considering natural images versus computer graphics images only or natural images versus GAN generated images only，但不是natural images versus both classes of generated images。尽管这种审查类别任务可以通过新的 convolutional neural networks 和 transformer 基础架构得到惊人的分类精度，但它们在图像经过一些预处理操作后，如 JPEG 压缩、 Gaussian noise 等，会失败。这个工作提议一种可靠的方法，用于分类自然和计算机生成的图像，包括计算机图形和 GAN 生成的图像，使用两个视transformer 网络，其中一个在 RGB 色空间中运行，另一个在 YCbCr 色空间中运行。提议的方法在比较基eline 的情况下， achieve 高性能增加，同时也 achieve 更高的可靠性和普遍性。图像特征视觉化时，可以看到提议的模型对类别之间的分离性更高，than the input image features 和基eline 的特征。此外，研究提议的模型网络的注意力地图时，发现该方法可以更好地捕捉与审查任务相关的图像信息。

Diving with Penguins: Detecting Penguins and their Prey in Animal-borne Underwater Videos via Deep Learning

paper_url: http://arxiv.org/abs/2308.07267
repo_url: None
paper_authors: Kejia Zhang, Mingyu Yang, Stephen D. J. Lang, Alistair M. McInnes, Richard B. Sherley, Tilo Burghardt
for: 这个论文的目的是提供一个可靠的海水下enguin检测器，以及一个鱼类检测器，并对enguin的捕食行为进行自动识别。
methods: 这个论文使用了现代生物学logging技术，并使用了深度学习系统来检测enguin和鱼类。
results: 这个论文提供了一个高度可靠的海水下enguin检测器，并对enguin的捕食行为进行了自动识别。但是，进一步的工作是必要的以使这种技术在实际场景中有用。

Abstract
African penguins (Spheniscus demersus) are an endangered species. Little is known regarding their underwater hunting strategies and associated predation success rates, yet this is essential for guiding conservation. Modern bio-logging technology has the potential to provide valuable insights, but manually analysing large amounts of data from animal-borne video recorders (AVRs) is time-consuming. In this paper, we publish an animal-borne underwater video dataset of penguins and introduce a ready-to-deploy deep learning system capable of robustly detecting penguins (mAP50@98.0%) and also instances of fish (mAP50@73.3%). We note that the detectors benefit explicitly from air-bubble learning to improve accuracy. Extending this detector towards a dual-stream behaviour recognition network, we also provide the first results for identifying predation behaviour in penguin underwater videos. Whilst results are promising, further work is required for useful applicability of predation behaviour detection in field scenarios. In summary, we provide a highly reliable underwater penguin detector, a fish detector, and a valuable first attempt towards an automated visual detection of complex behaviours in a marine predator. We publish the networks, the DivingWithPenguins video dataset, annotations, splits, and weights for full reproducibility and immediate usability by practitioners.

摘要
非洲伯劳鸟（Spheniscus demersus）是一种濒临灭绝的物种。关于它们在水下猎食策略和相关的捕食成功率的知识很少，但这些信息对保护非常重要。现代生物 logging技术有potential提供有价值的洞察，但是手动分析动物携带视频记录器（AVR）上的大量数据非常时间consuming。在这篇论文中，我们发布了一个动物携带的水下视频数据集和一个准备就绪的深度学习系统，能够准确地检测伯劳鸟（mAP50@98.0%）和鱼雷（mAP50@73.3%）。我们发现检测器受到空气泡学习的帮助，以提高准确性。通过扩展这个检测器，我们还提供了第一次在水下伯劳鸟视频中自动识别捕食行为的结果。虽然结果有前途，但更多的工作是需要在实际场景中使用捕食行为检测。总之，我们提供了一个非常可靠的水下伯劳鸟检测器、鱼雷检测器和水下伯劳鸟视频中自动识别复杂行为的第一次尝试。我们发布了网络、DivingWithPenguins视频数据集、注释、分割和 weights，以便实现全 reproduceability和 immediate usability by practitioners。

Efficient Real-time Smoke Filtration with 3D LiDAR for Search and Rescue with Autonomous Heterogeneous Robotic Systems

paper_url: http://arxiv.org/abs/2308.07264
repo_url: None
paper_authors: Alexander Kyuroson, Anton Koval, George Nikolakopoulos
for: 提高机器人在具有烟尘的潜地环境中的自主导航和定位精度。
methods: 提出了一种模块化agnostic滤除管道，利用照度和空间信息进行烟尘排除，以提高点云检测的精度。
results: 对多个前沿探索任务进行了实验研究，并提供了对比其他方法的计算影响和安全自主导航的价值观。

Abstract
Search and Rescue (SAR) missions in harsh and unstructured Sub-Terranean (Sub-T) environments in the presence of aerosol particles have recently become the main focus in the field of robotics. Aerosol particles such as smoke and dust directly affect the performance of any mobile robotic platform due to their reliance on their onboard perception systems for autonomous navigation and localization in Global Navigation Satellite System (GNSS)-denied environments. Although obstacle avoidance and object detection algorithms are robust to the presence of noise to some degree, their performance directly relies on the quality of captured data by onboard sensors such as Light Detection And Ranging (LiDAR) and camera. Thus, this paper proposes a novel modular agnostic filtration pipeline based on intensity and spatial information such as local point density for removal of detected smoke particles from Point Cloud (PCL) prior to its utilization for collision detection. Furthermore, the efficacy of the proposed framework in the presence of smoke during multiple frontier exploration missions is investigated while the experimental results are presented to facilitate comparison with other methodologies and their computational impact. This provides valuable insight to the research community for better utilization of filtration schemes based on available computation resources while considering the safe autonomous navigation of mobile robots.

摘要
寻找和救援（SAR）任务在恶劣和无结构的地壳环境中变得越来越重要，特别是在Global Navigation Satellite System（GNSS）被排除的环境中。由于移动 робот平台的自主导航和地点化依赖于其 бордов的感知系统，因此尘埃和烟雾直接影响移动 робот的性能。虽然障碍物避免和物体探测算法有一定的鲁棒性，但它们的性能直接取决于捕获到的数据质量，例如雷达和摄像头的数据。因此，这篇论文提出了一种新的模块不可识别的筛选管道，基于照度和空间信息，如地点密度，以去除从点云（PCL）中探测到的烟雾。此外，这篇论文还 investigate了在多个前沿探索任务中，提议的框架在烟雾存在下的效果，并对结果进行实验，以便与其他方法ologies和计算影响进行比较。这为研究者提供了有价值的反馈，以便更好地利用筛选方案，同时考虑移动 robot的自主导航安全性。

Large-kernel Attention for Efficient and Robust Brain Lesion Segmentation

paper_url: http://arxiv.org/abs/2308.07251
repo_url: https://github.com/liamchalcroft/mdunet
paper_authors: Liam Chalcroft, Ruben Lourenço Pereira, Mikael Brudfors, Andrew S. Kayser, Mark D’Esposito, Cathy J. Price, Ioannis Pappas, John Ashburner
for: 这个论文主要用于提出一种基于 transformer 块的 U-Net 架构，用于三维脑损块分割。
methods: 该模型使用了一种混合 convolutional 和 transformer 块的 variant，用于模elling 长距离交互。
results: 研究表明，该模型在三维脑损块分割任务中提供了最佳的折衔点，即性能与当前状态体系相当，且参数效率与 CNN 相当，同时具有转化不变性的良好假设。

Abstract
Vision transformers are effective deep learning models for vision tasks, including medical image segmentation. However, they lack efficiency and translational invariance, unlike convolutional neural networks (CNNs). To model long-range interactions in 3D brain lesion segmentation, we propose an all-convolutional transformer block variant of the U-Net architecture. We demonstrate that our model provides the greatest compromise in three factors: performance competitive with the state-of-the-art; parameter efficiency of a CNN; and the favourable inductive biases of a transformer. Our public implementation is available at https://github.com/liamchalcroft/MDUNet .

摘要
视transformer是深度学习模型，用于视觉任务，包括医学影像分割。然而，它缺乏效率和翻译不变性，与卷积神经网络（CNN）不同。为了模型3D脑损害分割中的长距离交互，我们提议一种alleviation transformer块变体的U-Net架构。我们示示了我们的模型提供了三个因素的最佳妥协：与状态之artefact的性能竞争; 参数效率与CNN相同; 以及转移器的有利 inductive bias。我们的公共实现可以在https://github.com/liamchalcroft/MDUNet上找到。

AAFACE: Attribute-aware Attentional Network for Face Recognition

paper_url: http://arxiv.org/abs/2308.07243
repo_url: None
paper_authors: Niloufar Alipour Talemi, Hossein Kashiani, Sahar Rahimi Malakshan, Mohammad Saeed Ebrahimi Saadabadi, Nima Najafzadeh, Mohammad Akyash, Nasser M. Nasrabadi
for: 这个论文是为了提出一种新的多分支神经网络，该网络同时进行软生物ometrics（SB）预测和人脸识别（FR）两个任务。
methods: 该网络使用SB特征来增强FR表示的推断能力。具体来说，我们提出了一个属性意识的集成（AAI）模块，该模块通过对FR与SB特征图进行Weighted集成来实现。AAI模块不仅具有完全上下文意识，还可以学习输入特征之间复杂的关系。
results: 我们的提出的网络在比较于现状的SB预测和FR方法上表现出了superiority。

Abstract
In this paper, we present a new multi-branch neural network that simultaneously performs soft biometric (SB) prediction as an auxiliary modality and face recognition (FR) as the main task. Our proposed network named AAFace utilizes SB attributes to enhance the discriminative ability of FR representation. To achieve this goal, we propose an attribute-aware attentional integration (AAI) module to perform weighted integration of FR with SB feature maps. Our proposed AAI module is not only fully context-aware but also capable of learning complex relationships between input features by means of the sequential multi-scale channel and spatial sub-modules. Experimental results verify the superiority of our proposed network compared with the state-of-the-art (SoTA) SB prediction and FR methods.

摘要
在本文中，我们提出了一种新的多分支神经网络，该网络同时进行软生物特征（SB）预测作为辅助特征和人脸识别（FR）作为主要任务。我们提出的AAFace网络利用SB特征来增强FR表示的分类能力。为此，我们提出了一种属性意识权重整合（AAI）模块，以进行FR与SB特征地图的Weighted整合。我们的AAI模块不仅具有完整的上下文意识，还能够学习输入特征之间的复杂关系，通过纵向多尺度通道和空间子模块。实验结果证明了我们提出的网络的优越性，与当前最佳状态（SoTA）SB预测和FR方法相比。

UniWorld: Autonomous Driving Pre-training via World Models

paper_url: http://arxiv.org/abs/2308.07234
repo_url: https://github.com/chaytonmin/uniworld
paper_authors: Chen Min, Dawei Zhao, Liang Xiao, Yiming Nie, Bin Dai
for: This paper is written for those interested in developing world models for robots, specifically for autonomous driving.
methods: The paper proposes a unified pre-training framework called UniWorld, which uses a spatial-temporal world model to perceive the surroundings and predict the future behavior of other participants. The framework is based on Alberto Elfes’ pioneering work in 1989 and uses a label-free pre-training process to build a foundational model.
results: The proposed method demonstrates promising results in key tasks such as motion prediction, multi-camera 3D object detection, and surrounding semantic scene completion. Compared to monocular pre-training methods on the nuScenes dataset, UniWorld shows a significant improvement of about 1.5% in IoU for motion prediction, 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion. Additionally, the method achieves a 25% reduction in 3D training annotation costs, offering significant practical value for real-world autonomous driving.

Abstract
In this paper, we draw inspiration from Alberto Elfes' pioneering work in 1989, where he introduced the concept of the occupancy grid as World Models for robots. We imbue the robot with a spatial-temporal world model, termed UniWorld, to perceive its surroundings and predict the future behavior of other participants. UniWorld involves initially predicting 4D geometric occupancy as the World Models for foundational stage and subsequently fine-tuning on downstream tasks. UniWorld can estimate missing information concerning the world state and predict plausible future states of the world. Besides, UniWorld's pre-training process is label-free, enabling the utilization of massive amounts of image-LiDAR pairs to build a Foundational Model.The proposed unified pre-training framework demonstrates promising results in key tasks such as motion prediction, multi-camera 3D object detection, and surrounding semantic scene completion. When compared to monocular pre-training methods on the nuScenes dataset, UniWorld shows a significant improvement of about 1.5% in IoU for motion prediction, 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion. By adopting our unified pre-training method, a 25% reduction in 3D training annotation costs can be achieved, offering significant practical value for the implementation of real-world autonomous driving. Codes are publicly available at https://github.com/chaytonmin/UniWorld.

摘要
在这篇论文中，我们启发自阿尔伯托·艾尔法斯在1989年的开创性工作，其中提出了机器人世界模型的概念。我们为机器人提供了一个空间-时间世界模型，称之为UniWorld，以便理解它所处的环境和预测其他参与者的未来行为。UniWorld包括先预测4D几何占据的世界模型为基础阶段，然后细化到下游任务。UniWorld可以 estimte missing world state information和预测可能的未来世界状态。此外，UniWorld的预训练过程无需标签，可以使用巨量的图像-LiDAR对组建基础模型。提出的统一预训练框架在关键任务中表现出了可观的成果，比如运动预测、多摄像头3D物体检测和周围 semanticscene完成。与单摄像头预训练方法在nuScenes dataset上进行比较，UniWorld在运动预测、3D物体检测和semanticscene完成任务中显示出了约1.5%的 IoU提升、2.0%的 mAP提升和2.0%的 NDS提升。通过采用我们的统一预训练方法，可以降低3D训练注释成本的25%，提供了实际应用自动驾驶的重要实践价值。代码可以在https://github.com/chaytonmin/UniWorld中找到。

paper_url: http://arxiv.org/abs/2308.07228
repo_url: None
paper_authors: Zhouxia Wang, Jiawei Zhang, Tianshui Chen, Wenping Wang, Ping Luo
for: 这个论文的目标是提高高质量的人脸图像从不知道的干扰中进行恢复。
methods: 该论文提出了RestoreFormer++,一种新的人脸图像恢复算法，它在一个手动注意力机制上模型了人脸图像的上下文信息，并在另一个扩展降低模型上帮助生成更加真实的降低图像，以增强对真实场景的适应性。
results: 对比当前算法，RestoreFormer++有多个优势，包括更高的真实性和质量，以及更好的适应性和泛化能力。

Abstract
Blind face restoration aims at recovering high-quality face images from those with unknown degradations. Current algorithms mainly introduce priors to complement high-quality details and achieve impressive progress. However, most of these algorithms ignore abundant contextual information in the face and its interplay with the priors, leading to sub-optimal performance. Moreover, they pay less attention to the gap between the synthetic and real-world scenarios, limiting the robustness and generalization to real-world applications. In this work, we propose RestoreFormer++, which on the one hand introduces fully-spatial attention mechanisms to model the contextual information and the interplay with the priors, and on the other hand, explores an extending degrading model to help generate more realistic degraded face images to alleviate the synthetic-to-real-world gap. Compared with current algorithms, RestoreFormer++ has several crucial benefits. First, instead of using a multi-head self-attention mechanism like the traditional visual transformer, we introduce multi-head cross-attention over multi-scale features to fully explore spatial interactions between corrupted information and high-quality priors. In this way, it can facilitate RestoreFormer++ to restore face images with higher realness and fidelity. Second, in contrast to the recognition-oriented dictionary, we learn a reconstruction-oriented dictionary as priors, which contains more diverse high-quality facial details and better accords with the restoration target. Third, we introduce an extending degrading model that contains more realistic degraded scenarios for training data synthesizing, and thus helps to enhance the robustness and generalization of our RestoreFormer++ model. Extensive experiments show that RestoreFormer++ outperforms state-of-the-art algorithms on both synthetic and real-world datasets.

摘要
目标是从不知名的降低中恢复高质量的面孔图像。现有算法主要通过引入约束来补充高质量的细节，实现了很好的进步。然而，大多数这些算法忽略面孔中的丰富上下文信息和它们之间的互动，导致优化性不佳。另外，它们对实际世界应用场景的差异不够关注，限制了其robustness和泛化性。在这种情况下，我们提出了RestoreFormer++，它在一个方面引入了完全的空间注意力机制，以模型面孔中的上下文信息和约束之间的互动；另一方面，它探索了一种扩展降低模型，以帮助生成更真实的降低面孔图像，从而缓解实际世界和synthetic世界之间的差异。与现有算法相比，RestoreFormer++有几个重要优点。首先，不同于传统的视觉转换器，我们引入了多头跨度的cross-attention机制，以全面探索降低信息和高质量约束之间的空间互动，从而使RestoreFormer++能够更好地恢复面孔图像。第二，我们不是通过认知 oriented的字典来学习约束，而是通过恢复 oriented的字典来学习约束，这种字典包含更多的多样化的高质量 facial detail，更好地符合恢复目标。第三，我们引入了一种扩展降低模型，该模型包含更真实的降低场景，从而帮助提高RestoreFormer++模型的Robustness和泛化性。广泛的实验表明，RestoreFormer++在synthetic和实际世界数据上都能够超越现有的算法。

2023-08-15

CCD-3DR: Consistent Conditioning in Diffusion for Single-Image 3D Reconstruction

Learning Better Keypoints for Multi-Object 6DoF Pose Estimation

ImbSAM: A Closer Look at Sharpness-Aware Minimization in Class-Imbalanced Recognition

Grasp Transfer based on Self-Aligning Implicit Representations of Local Surfaces

Neuromorphic Seatbelt State Detection for In-Cabin Monitoring with Event Cameras

Handwritten Stenography Recognition and the LION Dataset

DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

Future Video Prediction from a Single Frame for Video Anomaly Detection

Learning Image Deraining Transformer Network with Dynamic Dual Self-Attention

An Interpretable Machine Learning Model with Deep Learning-based Imaging Biomarkers for Diagnosis of Alzheimer’s Disease

Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos

Multi-scale Promoted Self-adjusting Correlation Learning for Facial Action Unit Detection

Whale Detection Enhancement through Synthetic Satellite Images

CASPNet++: Joint Multi-Agent Motion Prediction

ChartDETR: A Multi-shape Detection Network for Visual Chart Recognition

Identity-Consistent Aggregation for Video Object Detection

Dynamic Low-Rank Instance Adaptation for Universal Neural Image Compression

UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation

Context-Aware Pseudo-Label Refinement for Source-Free Domain Adaptive Fundus Image Segmentation

Domain-Aware Fine-Tuning: Enhancing Neural Network Adaptability

Real-time Automatic M-mode Echocardiography Measurement with Panel Attention from Local-to-Global Pixels

Enhancing Network Initialization for Medical AI Models Using Large-Scale, Unlabeled Natural Images

A Review of Adversarial Attacks in Computer Vision

Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training

Gradient-Based Post-Training Quantization: Challenging the Status Quo

Geometry of the Visual Cortex with Applications to Image Inpainting and Enhancement

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

Backpropagation Path Search On Adversarial Transferability

Self-Prompting Large Vision Models for Few-Shot Medical Image Segmentation

Self-supervised Hypergraphs for Learning Multiple World Interpretations

GAMER-MRIL identifies Disability-Related Brain Changes in Multiple Sclerosis

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model

Graph-Segmenter: Graph Transformer with Boundary-aware Attention for Semantic Segmentation

ADD: An Automatic Desensitization Fisheye Dataset for Autonomous Driving

Synthetic data generation method for hybrid image-tabular data using two generative adversarial networks

Ske2Grid: Skeleton-to-Grid Representation Learning for Action Recognition

Improved mirror ball projection for more accurate merging of multiple camera outputs and process monitoring

SST: A Simplified Swin Transformer-based Model for Taxi Destination Prediction based on Existing Trajectory

Multi-view 3D Face Reconstruction Based on Flame

3DHacker: Spectrum-based Decision Boundary Generation for Hard-label 3D Point Cloud Attack

Multimodal Dataset Distillation for Image-Text Retrieval

Visual and Textual Prior Guided Mask Assemble for Few-Shot Segmentation and Beyond

AttMOT: Improving Multiple-Object Tracking by Introducing Auxiliary Pedestrian Attributes

Improved Region Proposal Network for Enhanced Few-Shot Object Detection

Inverse Lithography Physics-informed Deep Neural Level Set for Mask Optimization

Confidence Contours: Uncertainty-Aware Annotation for Medical Semantic Segmentation

Benchmarking Scalable Epistemic Uncertainty Quantification in Organ Segmentation

ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection

SpecTracle: Wearable Facial Motion Tracking from Unobtrusive Peripheral Cameras

BSED: Baseline Shapley-Based Explainable Detector

Space Object Identification and Classification from Hyperspectral Material Analysis

Probabilistic MIMO U-Net: Efficient and Accurate Uncertainty Estimation for Pixel-wise Regression

Reducing Training Demands for 3D Gait Recognition with Deep Koopman Operator Constraints

There Is a Digital Art History

U-Turn Diffusion

Semantify: Simplifying the Control of 3D Morphable Models using CLIP

A Unified Query-based Paradigm for Camouflaged Instance Segmentation

DISBELIEVE: Distance Between Client Models is Very Essential for Effective Local Model Poisoning Attacks

The Devil in the Details: Simple and Effective Optical Flow Synthetic Data Generation

Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation

Dual Associated Encoder for Face Restoration

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis

Accurate Eye Tracking from Dense 3D Surface Reconstructions using Single-Shot Deflectometry

A Robust Approach Towards Distinguishing Natural and Computer Generated Images using Multi-Colorspace fused and Enriched Vision Transformer

Diving with Penguins: Detecting Penguins and their Prey in Animal-borne Underwater Videos via Deep Learning

Efficient Real-time Smoke Filtration with 3D LiDAR for Search and Rescue with Autonomous Heterogeneous Robotic Systems

Large-kernel Attention for Efficient and Robust Brain Lesion Segmentation

AAFACE: Attribute-aware Attentional Network for Face Recognition

UniWorld: Autonomous Driving Pre-training via World Models

RestoreFormer++: Towards Real-World Blind Face Restoration from Undegraded Key-Value Pairs