2023-08-12

cs.LG

cs.LG - 2023-08-12

paper_url: http://arxiv.org/abs/2308.06594
repo_url: None
paper_authors: Jumman Hossain, Abu-Zaher Faridee, Nirmalya Roy, Anjan Basak, Derrik E. Asher
for: 这 paper 的目的是提出一种基于 Deep Reinforcement Learning (DRL) 算法，用于在 offroad 环境中避免被外部观察者发现，并在有观察者存在的情况下安全地前往预定的目的地。
methods: 该算法使用了一个本地成本地图，以帮助选择最佳的遮盾和低成本路径，并使用了3D 点云数据、机器人的位置和指定目标信息来计算本地成本地图。
results: CoverNav 在 Unity simulate 环境中被评估，并显示了在 terrain 中保持动态可行性的能力，并在不同的高度enario 中实现了最大的目标距离和成功率。

Abstract
Autonomous navigation in offroad environments has been extensively studied in the robotics field. However, navigation in covert situations where an autonomous vehicle needs to remain hidden from outside observers remains an underexplored area. In this paper, we propose a novel Deep Reinforcement Learning (DRL) based algorithm, called CoverNav, for identifying covert and navigable trajectories with minimal cost in offroad terrains and jungle environments in the presence of observers. CoverNav focuses on unmanned ground vehicles seeking shelters and taking covers while safely navigating to a predefined destination. Our proposed DRL method computes a local cost map that helps distinguish which path will grant the maximal covertness while maintaining a low cost trajectory using an elevation map generated from 3D point cloud data, the robot's pose, and directed goal information. CoverNav helps robot agents to learn the low elevation terrain using a reward function while penalizing it proportionately when it experiences high elevation. If an observer is spotted, CoverNav enables the robot to select natural obstacles (e.g., rocks, houses, disabled vehicles, trees, etc.) and use them as shelters to hide behind. We evaluate CoverNav using the Unity simulation environment and show that it guarantees dynamically feasible velocities in the terrain when fed with an elevation map generated by another DRL based navigation algorithm. Additionally, we evaluate CoverNav's effectiveness in achieving a maximum goal distance of 12 meters and its success rate in different elevation scenarios with and without cover objects. We observe competitive performance comparable to state of the art (SOTA) methods without compromising accuracy.

摘要
自主导航在非道路环境中已经得到了RoboticsField的广泛研究。然而，在保持外部观察者隐私的情况下，自主导航仍然是一个未得到充分探索的领域。在这篇论文中，我们提出了一种基于深度优化学习（DRL）算法，称为CoverNav，用于在非道路地形和热带环境中寻找最佳隐蔽和可行的路径，并在外部观察者存在的情况下保持最佳的隐蔽性。CoverNav的设计目标是让无人地面车辆在保持安全的情况下寻找遮盾和避险的方式，并最终达到预定的目标地点。我们提出的DRL方法计算了当地的成本图，以帮助分辨出最佳隐蔽路径，同时保持低成本 trajectory使用3D点云数据、机器人的姿态和指定目标信息。CoverNav帮助机器人代理人学习低高度地形，通过一个奖励函数，同时对高高度增加惩罚。如果检测到观察者，CoverNav允许机器人选择自然障碍物（如岩石、房屋、瘫痪车辆、树木等），并使用它们作为遮盾隐藏。我们使用Unity simulate环境进行评估，并证明CoverNav可以在地形中保持动态可行速度。此外，我们还评估了CoverNav在不同高度场景中的效果和成功率，并发现其与状态之最好的方法（SOTA）的性能相似，无需损失精度。

Value-Distributional Model-Based Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.06590
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Carlos E. Luis, Alessandro G. Bottero, Julia Vinogradska, Felix Berkenkamp, Jan Peters
for: 本研究旨在解决Sequential Decision-Making任务中的不确定性问题，通过基于搜索学习的模型based Bayesian reinforcement learning的角度来评估政策的长期表现。
methods: 本研究使用了分布式权值函数的思想，通过引入Bellman операktor的固定点来学习值函数的 posterior distribution。
results: 对多个连续控制任务的评估表明，EQR算法可以比以前的模型基于和模型自由算法表现更好，具有性能优势。

Abstract
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks. We study the problem from a model-based Bayesian reinforcement learning perspective, where the goal is to learn the posterior distribution over value functions induced by parameter (epistemic) uncertainty of the Markov decision process. Previous work restricts the analysis to a few moments of the distribution over values or imposes a particular distribution shape, e.g., Gaussians. Inspired by distributional reinforcement learning, we introduce a Bellman operator whose fixed-point is the value distribution function. Based on our theory, we propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function that can be used for policy optimization. Evaluation across several continuous-control tasks shows performance benefits with respect to established model-based and model-free algorithms.

摘要
Important 是量化一个政策的长期表现uncertainty，解决sequential decision-making tasks。我们从model-based Bayesian reinforcement learning的角度研究这个问题，目标是学习Markov decision process中参数（epistemic）uncertainty引起的值函数 posterior distribution。前一任 restricts the analysis to a few moments of the distribution over values or imposes a particular distribution shape, e.g., Gaussians。drawing inspiration from distributional reinforcement learning，我们引入一个Bellman operator，其fixed-point是值分布函数。根据我们的理论，我们提出Epistemic Quantile-Regression（EQR），一种model-based算法，可以学习一个可以用于政策优化的值分布函数。在许多连续控制任务上，我们的算法表现出了与已知model-based和model-free算法相比的性能优势。

Approximate Answering of Graph Queries

paper_url: http://arxiv.org/abs/2308.06585
repo_url: None
paper_authors: Michael Cochez, Dimitrios Alivanistos, Erik Arakelyan, Max Berrendorf, Daniel Daza, Mikhail Galkin, Pasquale Minervini, Mathias Niepert, Hongyu Ren
for: Answering queries in an incomplete knowledge graph (KG) setting.
methods: Several methods have been proposed to answer queries in an incomplete KG setting, including approaches based on semantic search, knowledge graph completion, and embedding-based methods.
results: These methods have been shown to be effective in answering queries in an incomplete KG setting, but they have limitations in terms of expressiveness, supported graph types, and inference capabilities.Here is the same information in Simplified Chinese text:
for: Answering queries in an incomplete知识图(KG) setting.
methods: Several methods have been proposed to answer queries in an incomplete KG setting, including基于semantic search的方法、knowledge graph completion的方法和embedding-based methods.
results: These methods have been shown to be effective in answering queries in an incomplete KG setting, but they have limitations in terms of expressiveness、supported graph types和inference capabilities.

Abstract
Knowledge graphs (KGs) are inherently incomplete because of incomplete world knowledge and bias in what is the input to the KG. Additionally, world knowledge constantly expands and evolves, making existing facts deprecated or introducing new ones. However, we would still want to be able to answer queries as if the graph were complete. In this chapter, we will give an overview of several methods which have been proposed to answer queries in such a setting. We will first provide an overview of the different query types which can be supported by these methods and datasets typically used for evaluation, as well as an insight into their limitations. Then, we give an overview of the different approaches and describe them in terms of expressiveness, supported graph types, and inference capabilities.

摘要
知识图（KG）自然而然 incomplete，因为世界知识不完整和输入KG中的偏见。然而，我们仍然想能够回答尚未完善的查询。在这章中，我们将给出几种提出来的方法，以及它们在支持不同类型的查询和评估 datasets 的限制。然后，我们将对这些方法进行概述，包括它们在表达力、支持的图类型和推理能力方面的特点。Here's the breakdown of the text into Simplified Chinese characters:知识图 (KG) 自然而然 incomplete 因为世界知识不完整和输入 KG 中的偏见。然而，我们仍然想能够回答尚未完善的查询。在这章中，我们将给出几种提出来的方法，以及它们在支持不同类型的查询和评估 datasets 的限制。然后，我们将对这些方法进行概述，包括它们在表达力、支持的图类型和推理能力方面的特点。

A new solution and concrete implementation steps for Artificial General Intelligence

paper_url: http://arxiv.org/abs/2308.09721
repo_url: None
paper_authors: Yongcong Chen, Ting Zeng, Jun Zhang
for: 这个论文目标是解决现有技术的缺陷，以实现更广泛的人工智能应用。
methods: 该论文使用现有技术和解决现有技术的缺陷，以实现更广泛的人工智能应用。
results: 该论文提出了解决现有技术的缺陷，以实现更广泛的人工智能应用的方法。

Abstract
At present, the mainstream artificial intelligence generally adopts the technical path of "attention mechanism + deep learning" + "reinforcement learning". It has made great progress in the field of AIGC (Artificial Intelligence Generated Content), setting off the technical wave of big models[ 2][13 ]. But in areas that need to interact with the actual environment, such as elderly care, home nanny, agricultural production, and vehicle driving, trial and error are expensive and a reinforcement learning process that requires much trial and error is difficult to achieve. Therefore, in order to achieve Artificial General Intelligence(AGI) that can be applied to any field, we need to use both existing technologies and solve the defects of existing technologies, so as to further develop the technological wave of artificial intelligence. In this paper, we analyze the limitations of the technical route of large models, and by addressing these limitations, we propose solutions, thus solving the inherent defects of large models. In this paper, we will reveal how to achieve true AGI step by step.

摘要
Translated into Simplified Chinese:现在，主流人工智能通常采用“注意机制+深度学习”+“强化学习”技术路线。这种方法在AIGC（人工智能生成内容）领域已经取得了 significativadeeps[ 2][13 ]， triggering a technological wave of big models. However, in areas that require interaction with the actual environment, such as elderly care, home nanny, agricultural production, and vehicle driving, trial and error are costly and a reinforcement learning process that requires much trial and error is difficult to achieve. Therefore, to achieve Artificial General Intelligence (AGI) that can be applied to any field, we need to leverage both existing technologies and address the limitations of existing technologies, in order to further develop the technological wave of artificial intelligence. In this paper, we analyze the limitations of the technical route of large models, and by addressing these limitations, we propose solutions, thus solving the inherent defects of large models. Through this paper, we will reveal how to achieve true AGI step by step.

EquiDiff: A Conditional Equivariant Diffusion Model For Trajectory Prediction

paper_url: http://arxiv.org/abs/2308.06564
repo_url: None
paper_authors: Kehua Chen, Xianda Chen, Zihan Yu, Meixin Zhu, Hai Yang
for: 预测自动驾驶车辆的未来路径，以确保安全和有效的运行。
methods: 使用深度学习的各种方法，包括权重排序网络和强化学习，以及基于 conditional diffusion model 的 EquiDiff 模型，通过 integrate 历史信息和随机 Gaussian 噪声来预测未来路径。
results: EquiDiff 模型在 NGSIM 数据集上的实验结果表明，在短期预测方面表现出色，但在长期预测方面有些较高的错误率。此外，我们还进行了一项ablation 研究，以Investigate 各组件对预测精度的贡献。同时，我们还提供了 diffusion 模型生成过程的视觉化，以提供预测结果的不确定性的视觉化。

Abstract
Accurate trajectory prediction is crucial for the safe and efficient operation of autonomous vehicles. The growing popularity of deep learning has led to the development of numerous methods for trajectory prediction. While deterministic deep learning models have been widely used, deep generative models have gained popularity as they learn data distributions from training data and account for trajectory uncertainties. In this study, we propose EquiDiff, a deep generative model for predicting future vehicle trajectories. EquiDiff is based on the conditional diffusion model, which generates future trajectories by incorporating historical information and random Gaussian noise. The backbone model of EquiDiff is an SO(2)-equivariant transformer that fully utilizes the geometric properties of location coordinates. In addition, we employ Recurrent Neural Networks and Graph Attention Networks to extract social interactions from historical trajectories. To evaluate the performance of EquiDiff, we conduct extensive experiments on the NGSIM dataset. Our results demonstrate that EquiDiff outperforms other baseline models in short-term prediction, but has slightly higher errors for long-term prediction. Furthermore, we conduct an ablation study to investigate the contribution of each component of EquiDiff to the prediction accuracy. Additionally, we present a visualization of the generation process of our diffusion model, providing insights into the uncertainty of the prediction.

摘要
准确的轨迹预测是自动驾驶车辆运行的关键。随着深度学习的普及，许多方法已经被开发出来用于轨迹预测。而使用权值函数的决定性深度学习模型在轨迹预测中广泛应用。在本研究中，我们提出了EquiDiff，一种基于条件扩散模型的深度生成模型，用于预测未来车辆的轨迹。EquiDiff通过将历史信息和随机 Gaussian 噪声纳入条件扩散模型，生成未来车辆的轨迹。我们的核心模型是一个SO(2)-共轭变换器，它完全利用了坐标点的几何属性。此外，我们还使用循环神经网络和图注意网络来提取历史轨迹中的社会互动。为了评估EquiDiff的性能，我们在NGSIM数据集上进行了广泛的实验。我们的结果显示，EquiDiff在短期预测方面胜过其他基准模型，但在长期预测方面有些微的错误。此外，我们还进行了减少分析，以了解各组件对预测精度的贡献。此外，我们还提供了生成过程中扩散模型的视觉化，为预测不确定性提供了更多的视角。

paper_url: http://arxiv.org/abs/2308.06549
repo_url: None
paper_authors: Tanvir Islam, Anika Rahman Joyita, Md. Golam Rabiul Alam, Mohammad Mehedi Hassan, Md. Rafiul Hassan, Raffaele Gravina
for: 这个研究旨在提供一种基于情感计算的餐单推荐和菜单规划方法，以满足用户不同情感的需求。
methods: 该研究使用了问卷调查和偏好认知来确定用户的餐食偏好，并使用电encephalography信号来检测用户对不同食物的情感。在这个研究中，我们使用了14栽 wireless Emotive Epoc+来测量用户对不同食物的情感。
results: 实验结果表明，该提议的情感计算、餐单推荐和菜单规划算法在多种评价参数上表现良好。

Abstract
The traditional dietary recommendation systems are basically nutrition or health-aware where the human feelings on food are ignored. Human affects vary when it comes to food cravings, and not all foods are appealing in all moods. A questionnaire-based and preference-aware meal recommendation system can be a solution. However, automated recognition of social affects on different foods and planning the menu considering nutritional demand and social-affect has some significant benefits of the questionnaire-based and preference-aware meal recommendations. A patient with severe illness, a person in a coma, or patients with locked-in syndrome and amyotrophic lateral sclerosis (ALS) cannot express their meal preferences. Therefore, the proposed framework includes a social-affective computing module to recognize the affects of different meals where the person's affect is detected using electroencephalography signals. EEG allows to capture the brain signals and analyze them to anticipate affective toward a food. In this study, we have used a 14-channel wireless Emotive Epoc+ to measure affectivity for different food items. A hierarchical ensemble method is applied to predict affectivity upon multiple feature extraction methods and TOPSIS (Technique for Order of Preference by Similarity to Ideal Solution) is used to generate a food list based on the predicted affectivity. In addition to the meal recommendation, an automated menu planning approach is also proposed considering a person's energy intake requirement, affectivity, and nutritional values of the different menus. The bin-packing algorithm is used for the personalized menu planning of breakfast, lunch, dinner, and snacks. The experimental findings reveal that the suggested affective computing, meal recommendation, and menu planning algorithms perform well across a variety of assessment parameters.

摘要
传统的饮食建议系统基本上是nutrition或健康意识的，忽略了人类情感对食物的影响。人们对食物的欲望和喜好可以在不同的情感状态下发生变化，问卷和喜好意识的饭菜推荐系统可能是一个解决方案。然而，自动地认知社会情感对不同食物的影响，并根据营养需求和社会情感规划菜单，有一些显著的优点。例如，患有严重疾病、昏迷状态或locked-in syndrome和amyotrophic lateral sclerosis（ALS）患者无法表达他们的饭菜偏好。因此，我们的框架包括一个社交情感计算模块，用于识别不同饭菜中的情感。我们使用14栏 wireless Emotive Epoc+来测量不同饭品的情感响应。我们采用了层次ensemble方法来预测情感，并使用TOPSIS（理想解决方案的技术）来生成基于预测情感的饭品列表。此外，我们还提出了一种自动菜单规划方法，考虑人类能量摄入需求、情感和不同菜单的营养价值。使用bin-packing算法进行个性化菜单规划的早餐、午餐、晚餐和小吃。实验结果表明，我们提出的情感计算、饭菜推荐和菜单规划算法在多种评价参数上表现良好。

Digital elevation model correction in urban areas using extreme gradient boosting, land cover and terrain parameters

paper_url: http://arxiv.org/abs/2308.06545
repo_url: None
paper_authors: Chukwuma Okolie, Jon Mills, Adedayo Adeleke, Julian Smit
for: This paper aims to enhance the accuracy of medium-resolution digital elevation models (DEMs) in urban areas using the extreme gradient boosting (XGBoost) ensemble algorithm.
methods: The XGBoost algorithm was applied to two medium-resolution DEMs over Cape Town, South Africa, using eleven predictor variables, including elevation, urban footprints, and terrain features.
results: The correction achieved significant accuracy gains, with the root mean square error (RMSE) of the DEMs improving by 46-53% and 72-73%, respectively, compared to other proposed methods. These results demonstrate the potential of gradient boosted trees for enhancing the quality of DEMs and improving hydrological modelling in urban catchments.Here is the same information in Simplified Chinese text:
for: 本研究目的是使用极限梯度加权树（XGBoost）ensemble算法提高城市区域中数字高程模型（DEMs）的精度。
methods: XGBoost算法应用于两个中Resolution DEMs上cape Town,南非，使用eleven predictor variable，包括高程、城市脚本、地形特征等。
results: 修正得到了显著的准确性提高，高程误差根mean square error（RMSE）提高46-53%和72-73%，相比其他提议的方法。这些结果表明梯度加权树可以提高DEMs的质量和城市流域 hydrological modelling中的准确性。

Abstract
The accuracy of digital elevation models (DEMs) in urban areas is influenced by numerous factors including land cover and terrain irregularities. Moreover, building artifacts in global DEMs cause artificial blocking of surface flow pathways. This compromises their quality and adequacy for hydrological and environmental modelling in urban landscapes where precise and accurate terrain information is needed. In this study, the extreme gradient boosting (XGBoost) ensemble algorithm is adopted for enhancing the accuracy of two medium-resolution 30m DEMs over Cape Town, South Africa: Copernicus GLO-30 and ALOS World 3D (AW3D). XGBoost is a scalable, portable and versatile gradient boosting library that can solve many environmental modelling problems. The training datasets are comprised of eleven predictor variables including elevation, urban footprints, slope, aspect, surface roughness, topographic position index, terrain ruggedness index, terrain surface texture, vector roughness measure, forest cover and bare ground cover. The target variable (elevation error) was calculated with respect to highly accurate airborne LiDAR. After training and testing, the model was applied for correcting the DEMs at two implementation sites. The correction achieved significant accuracy gains which are competitive with other proposed methods. The root mean square error (RMSE) of Copernicus DEM improved by 46 to 53% while the RMSE of AW3D DEM improved by 72 to 73%. These results showcase the potential of gradient boosted trees for enhancing the quality of DEMs, and for improved hydrological modelling in urban catchments.

摘要
“城市地区数字高程模型（DEM）的准确性受多种因素影响，包括地面覆盖和地形差异。此外，全球DEM中的建筑物artefact会导致 superficiale流动的人工堵塞，从而下降其质量和适用性于城市地区的水文和环境模型。本研究采用extrem Gradient Boosting（XGBoost）ensemble算法来提高两个中等分辨率30米DEM的准确性，即Copernicus GLO-30和ALOS World 3D（AW3D）。XGBoost是一种可扩展、可移植和多样的梯度提升库，可以解决许多环境模型问题。训练数据集包括11个预测变量，包括高程、城市覆盖面积、坡度、方向、表面粗糙度、地形位置指数、地形抗 roughness指数、地形表面文字、向量粗糙度度量、森林覆盖率和无 veg 覆盖率。目标变量（高程错误）与高精度飞行 LiDAR 进行计算。经过训练和测试，模型在两个实施地点应用于修正DEM。修正后的DEM准确性提高了46%到53%，而AW3D DEM的准确性提高了72%到73%。这些结果表明梯度增进树可以提高DEM的质量，并且为城市水文模型提供了改进的可能性。”

Dealing with Small Datasets for Deep Learning in Medical Imaging: An Evaluation of Self-Supervised Pre-Training on CT Scans Comparing Contrastive and Masked Autoencoder Methods for Convolutional Models

paper_url: http://arxiv.org/abs/2308.06534
repo_url: https://github.com/wolfda95/ssl-medicalimagining-cl-mae
paper_authors: Daniel Wolf, Tristan Payer, Catharina Silvia Lisson, Christoph Gerhard Lisson, Meinrad Beer, Timo Ropinski, Michael Götz
for: 这篇研究的目的是为了探讨在医疗影像领域中使用深度学习模型，以减少医生负担，提高诊断速度，并最小化诊断错误的风险。
methods: 这篇研究使用了自然语言处理领域的自我超vised学习方法，将深度学习模型训练在大量无标注的医疗影像 dataset 上，然后使用小量标注 dataset 进行精度训练。
results: 研究发现，使用 SparK 自我超vised学习方法可以更好地适应小量标注 dataset，并且在不同的训练 dataset 大小下表现出不同的优势。因此，这篇研究建议在医疗影像领域使用 SparK 自我超vised学习方法，以提高深度学习模型的精度和效率。

Abstract
Deep learning in medical imaging has the potential to minimize the risk of diagnostic errors, reduce radiologist workload, and accelerate diagnosis. Training such deep learning models requires large and accurate datasets, with annotations for all training samples. However, in the medical imaging domain, annotated datasets for specific tasks are often small due to the high complexity of annotations, limited access, or the rarity of diseases. To address this challenge, deep learning models can be pre-trained on large image datasets without annotations using methods from the field of self-supervised learning. After pre-training, small annotated datasets are sufficient to fine-tune the models for a specific task. The most popular self-supervised pre-training approaches in medical imaging are based on contrastive learning. However, recent studies in natural image processing indicate a strong potential for masked autoencoder approaches. Our work compares state-of-the-art contrastive learning methods with the recently introduced masked autoencoder approach "SparK" for convolutional neural networks (CNNs) on medical images. Therefore we pre-train on a large unannotated CT image dataset and fine-tune on several CT classification tasks. Due to the challenge of obtaining sufficient annotated training data in medical imaging, it is of particular interest to evaluate how the self-supervised pre-training methods perform when fine-tuning on small datasets. By experimenting with gradually reducing the training dataset size for fine-tuning, we find that the reduction has different effects depending on the type of pre-training chosen. The SparK pre-training method is more robust to the training dataset size than the contrastive methods. Based on our results, we propose the SparK pre-training for medical imaging tasks with only small annotated datasets.

摘要
深度学习在医疗影像领域可能减少诊断错误的风险，减轻放射学家的工作负担，并加速诊断。深度学习模型的训练需要大量和准确的数据集，并将所有训练样本注解。然而，在医疗影像领域，特定任务的注解数据集经常受到高复杂性的限制，限制了获得可用的数据。为解决这个挑战，深度学习模型可以通过不注解的图像集进行自我超vised学习。在这种情况下，小型注解数据集可以进行精度的微调。我们的工作 comparing state-of-the-art contrastive learning方法和最近引入的"SparK"隐藏自动编码器方法（MAE）在医疗影像领域的 convolutional neural networks（CNNs）中进行比较。因此，我们在大量无注解CT图像集上进行预训练，然后在多个CT分类任务上进行微调。由于在医疗影像领域获得足够的注解训练数据是困难的，因此我们特别关注在小型注解数据集上进行微调时的性能。我们通过逐渐减少微调数据集的大小来评估不同类型的预训练方法的性能。我们发现，使用SparK预训练方法可以更好地抗衡训练数据集的大小。根据我们的结果，我们提议在医疗影像任务中使用SparK预训练方法，即使只有小型注解数据集。

Learning Abstract Visual Reasoning via Task Decomposition: A Case Study in Raven Progressive Matrices

paper_url: http://arxiv.org/abs/2308.06528
repo_url: https://github.com/jakubkwiatkowski/abstract_compositional_transformer
paper_authors: Jakub Kwiatkowski, Krzysztof Krawiec
for: 本研究旨在提高 Abstract Reasoning 的能力，通过预测图像中对象的视觉属性和排列来解决 Raven Progressive Matrices (RPM) 问题。
methods: 本研究使用 transformer 架构，通过预测图像中对象的视觉属性和排列来解决 RPM 问题。研究还考虑了不同的图像分割方法和自动Masking 技术。
results: 实验结果表明，该方法不仅超越了当前最佳方法，还提供了有趣的思路和部分解释，帮助理解 RPM 问题的决策过程。此外，该方法还具有免除一些已知 RPM 标准准样的偏见的优点。

Abstract
One of the challenges in learning to perform abstract reasoning is that problems are often posed as monolithic tasks, with no intermediate subgoals. In Raven Progressive Matrices (RPM), the task is to choose one of the available answers given a context, where both contexts and answers are composite images featuring multiple objects in various spatial arrangements. As this high-level goal is the only guidance available, learning is challenging and most contemporary solvers tend to be opaque. In this study, we propose a deep learning architecture based on the transformer blueprint which, rather than directly making the above choice, predicts the visual properties of individual objects and their arrangements. The multidimensional predictions obtained in this way are then directly juxtaposed to choose the answer. We consider a few ways in which the model parses the visual input into tokens and several regimes of masking parts of the input in self-supervised training. In experimental assessment, the models not only outperform state-of-the-art methods but also provide interesting insights and partial explanations about the inference. The design of the method also makes it immune to biases that are known to exist in some RPM benchmarks.

摘要
一个学习抽象逻辑的挑战是问题frequently pose as monolithic tasks，without intermediate subgoals. In Raven Progressive Matrices (RPM), the task is to choose one of the available answers given a context, where both contexts and answers are composite images featuring multiple objects in various spatial arrangements. As this high-level goal is the only guidance available, learning is challenging and most contemporary solvers tend to be opaque. In this study, we propose a deep learning architecture based on the transformer blueprint, which rather than directly making the above choice, predicts the visual properties of individual objects and their arrangements. The multidimensional predictions obtained in this way are then directly juxtaposed to choose the answer. We consider a few ways in which the model parses the visual input into tokens and several regimes of masking parts of the input in self-supervised training. In experimental assessment, the models not only outperform state-of-the-art methods but also provide interesting insights and partial explanations about the inference. The design of the method also makes it immune to biases that are known to exist in some RPM benchmarks.Here is the translation in Traditional Chinese:一个学习抽象逻辑的挑战是问题frequently pose as monolithic tasks，without intermediate subgoals. In Raven Progressive Matrices (RPM), the task is to choose one of the available answers given a context, where both contexts and answers are composite images featuring multiple objects in various spatial arrangements. As this high-level goal is the only guidance available, learning is challenging and most contemporary solvers tend to be opaque. In this study, we propose a deep learning architecture based on the transformer blueprint, which rather than directly making the above choice, predicts the visual properties of individual objects and their arrangements. The multidimensional predictions obtained in this way are then directly juxtaposed to choose the answer. We consider a few ways in which the model parses the visual input into tokens and several regimes of masking parts of the input in self-supervised training. In experimental assessment, the models not only outperform state-of-the-art methods but also provide interesting insights and partial explanations about the inference. The design of the method also makes it immune to biases that are known to exist in some RPM benchmarks.

SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models

paper_url: http://arxiv.org/abs/2308.06522
repo_url: None
paper_authors: Sara Babakniya, Ahmed Roushdy Elkordy, Yahya H. Ezzeldin, Qingfeng Liu, Kee-Bong Song, Mostafa El-Khamy, Salman Avestimehr
for: 这个研究旨在探讨在分布式语言任务中应用精简 parameter fine-tuning（PEFT）方法，以提高 Federated Learning（FL）的可行性和效率。
methods: 本研究使用了 parameter efficient fine-tuning（PEFT）方法，并提出了一个名为 SLoRA 的新方法，具有跨用户数据的可靠性和高效性。
results: 实验结果显示，SLoRA 可以与全量 fine-tuning 相比，实现高度可 sparse 的更新，并在高 hetrogenous 数据场景下提高了表现。特别是，SLoRA 可以实现 $\sim 1%$ 的紧密更新，并降低了训练时间，高达 $90%$。

Abstract
Transfer learning via fine-tuning pre-trained transformer models has gained significant success in delivering state-of-the-art results across various NLP tasks. In the absence of centralized data, Federated Learning (FL) can benefit from distributed and private data of the FL edge clients for fine-tuning. However, due to the limited communication, computation, and storage capabilities of edge devices and the huge sizes of popular transformer models, efficient fine-tuning is crucial to make federated training feasible. This work explores the opportunities and challenges associated with applying parameter efficient fine-tuning (PEFT) methods in different FL settings for language tasks. Specifically, our investigation reveals that as the data across users becomes more diverse, the gap between fully fine-tuning the model and employing PEFT methods widens. To bridge this performance gap, we propose a method called SLoRA, which overcomes the key limitations of LoRA in high heterogeneous data scenarios through a novel data-driven initialization technique. Our experimental results demonstrate that SLoRA achieves performance comparable to full fine-tuning, with significant sparse updates with approximately $\sim 1\%$ density while reducing training time by up to $90\%$.

摘要
<>转换给定文本到简化中文。>基于预训练变换器模型的迁移学习已经在不同的自然语言处理任务中带来了显著的成果，包括语音识别、文本分类、翻译等。在中央数据缺乏的情况下，联邦学习（FL）可以利用分布式和私有的edge客户端数据进行微调。然而，由于edge设备的通信、计算和存储能力的限制，以及流行的变换器模型的大型，高效的微调是必要的以使联邦训练成为可能。这项工作探讨了在不同的FL设置下应用Parameter Efficient Fine-tuning（PEFT）方法的机会和挑战。具体来说，我们的调查发现，当用户数据变得更加多样化时，完全微调和PEFT方法之间的性能差距加大。为 bridging这个性能差距，我们提议了一种名为SLoRA的方法，通过一种新的数据驱动初始化技术，超越LoRA在高多样性数据场景中的关键限制。我们的实验结果表明，SLoRA可以与全部微调达到相同的性能水平，并在大约1%的稀疏更新下降低训练时间约90%。

One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training

paper_url: http://arxiv.org/abs/2308.07934
repo_url: https://github.com/jianshuod/tba
paper_authors: Jianshuo Dong, Han Qiu, Yiming Li, Tianwei Zhang, Yuanjie Li, Zeqi Lai, Chao Zhang, Shu-Tao Xia
for: 防御深度神经网络（DNNs）在实际设备上的安全性问题。
methods: 利用记忆FAULT INJECT技术实现行ammer attack，通过修改模型的权重来攻击量化模型在部署阶段。
results: 通过修改一个关键位的bit，可以轻松地将正常模型转化为恶意模型，并且这种攻击还可以绕过一些检测方法。

Abstract
Deep neural networks (DNNs) are widely deployed on real-world devices. Concerns regarding their security have gained great attention from researchers. Recently, a new weight modification attack called bit flip attack (BFA) was proposed, which exploits memory fault inject techniques such as row hammer to attack quantized models in the deployment stage. With only a few bit flips, the target model can be rendered useless as a random guesser or even be implanted with malicious functionalities. In this work, we seek to further reduce the number of bit flips. We propose a training-assisted bit flip attack, in which the adversary is involved in the training stage to build a high-risk model to release. This high-risk model, obtained coupled with a corresponding malicious model, behaves normally and can escape various detection methods. The results on benchmark datasets show that an adversary can easily convert this high-risk but normal model to a malicious one on victim's side by \textbf{flipping only one critical bit} on average in the deployment stage. Moreover, our attack still poses a significant threat even when defenses are employed. The codes for reproducing main experiments are available at \url{https://github.com/jianshuod/TBA}.

摘要
深度神经网络（DNN）在实际设备上广泛应用。关于其安全性的问题吸引了研究者的广泛关注。最近，一种新的权值修改攻击方法called bit flip attack（BFA）被提出，它利用内存错误注入技术such as row hammer攻击部署阶段的量化模型。只需几个比特软件，目标模型就可以变成随机猜测器或甚至被恶意模型植入。在这种工作中，我们尝试降低比特软件的数量。我们提出了帮助者参与训练阶段的训练帮助攻击，以建立一个高风险模型，并将其发布。这个高风险模型，结合相应的恶意模型，在发布阶段 behave normally，并可以逃脱多种检测方法。我们的攻击仍然对于防御措施产生威胁。实验结果表明，一个攻击者可以在部署阶段通过flipping only one critical bit的方式，将高风险模型转换为恶意模型，而且这种攻击仍然有效even when defenses are employed。代码可以在中进行重现主要实验。

Performance Analysis for Resource Constrained Decentralized Federated Learning Over Wireless Networks

paper_url: http://arxiv.org/abs/2308.06496
repo_url: None
paper_authors: Zhigang Yan, Dong Li
for: 这个研究旨在分析资源受限的分布式机器学习（DFL）系统中的通信效率优化。
methods: 这个研究使用了不同的通信方案（数位和类比）来分析内部通信效率。
results: 研究发现，这些通信方案可以提供内部模型的训练，并且可以在不同的通信条件下进行优化。

Abstract
Federated learning (FL) can lead to significant communication overhead and reliance on a central server. To address these challenges, decentralized federated learning (DFL) has been proposed as a more resilient framework. DFL involves parameter exchange between devices through a wireless network. This study analyzes the performance of resource-constrained DFL using different communication schemes (digital and analog) over wireless networks to optimize communication efficiency. Specifically, we provide convergence bounds for both digital and analog transmission approaches, enabling analysis of the model performance trained on DFL. Furthermore, for digital transmission, we investigate and analyze resource allocation between computation and communication and convergence rates, obtaining its communication complexity and the minimum probability of correction communication required for convergence guarantee. For analog transmission, we discuss the impact of channel fading and noise on the model performance and the maximum errors accumulation with convergence guarantee over fading channels. Finally, we conduct numerical simulations to evaluate the performance and convergence rate of convolutional neural networks (CNNs) and Vision Transformer (ViT) trained in the DFL framework on fashion-MNIST and CIFAR-10 datasets. Our simulation results validate our analysis and discussion, revealing how to improve performance by optimizing system parameters under different communication conditions.

摘要
联合学习（FL）可能会带来重要的通信负担和依赖中央服务器。为了解决这些挑战，分散式联合学习（DFL）已经被提议作为更可靠的框架。DFL通过装置间的参数交换来进行学习。本研究分析了受限制的DFL在无线网络上的表现，使用不同的通信方案（数位和模拟），以便最佳化通信效率。具体来说，我们提供了两种通信方法的整合界限，以及对数位传输的资源分配和通信复杂度的分析。另外，我们还考虑了频道折射和噪音对模型性能的影响，并分析了在折射通道上获得最大错误的组合。最后，我们将进行数据实验，以评估在DFL框架中训练过滤神经网络和探索神经网络的性能和融合率。我们的实验结果验证了我们的分析和讨论，并显示了如何通过优化系统参数来提高性能。

Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

paper_url: http://arxiv.org/abs/2308.06472
repo_url: None
paper_authors: Kumari Nishu, Minsik Cho, Paul Dixon, Devang Naik
for: 这篇论文主要关注于efficiently detecting arbitrary keywords in audio-text modalities, using an audio-compliant text encoder to reduce the mismatch between text and audio embeddings.
methods: 本文提出了一个新的架构，使用一个具有同步表示的文本编码器，将文本转换为phonemes使用grapheme-to-phoneme（G2P）模型，然后将phonemes转换为嵌入使用代表性的phoneme вектор，从低质量的话语资料集中提取。此外，本文还使用可替代的关键生成技术来开发一个Audio-Text嵌入验证器。
results: 实验结果显示，本文的方法在Libriphrase hard dataset上比前一个state-of-the-art的结果高出84.21%到92.7%，且下降了23.36%到14.4%的Equal-Error-Rate（EER）值。

Abstract
Spotting user-defined/flexible keywords represented in text frequently uses an expensive text encoder for joint analysis with an audio encoder in an embedding space, which can suffer from heterogeneous modality representation (i.e., large mismatch) and increased complexity. In this work, we propose a novel architecture to efficiently detect arbitrary keywords based on an audio-compliant text encoder which inherently has homogeneous representation with audio embedding, and it is also much smaller than a compatible text encoder. Our text encoder converts the text to phonemes using a grapheme-to-phoneme (G2P) model, and then to an embedding using representative phoneme vectors, extracted from the paired audio encoder on rich speech datasets. We further augment our method with confusable keyword generation to develop an audio-text embedding verifier with strong discriminative power. Experimental results show that our scheme outperforms the state-of-the-art results on Libriphrase hard dataset, increasing Area Under the ROC Curve (AUC) metric from 84.21% to 92.7% and reducing Equal-Error-Rate (EER) metric from 23.36% to 14.4%.

摘要
通常情况下，用户定义/灵活关键词在文本中的检测通常需要使用昂贵的文本编码器进行共同分析，并与音频编码器在嵌入空间进行结合分析，这可能会导致不同类型的表达（大匹配度）和复杂性增加。在这种工作中，我们提出了一种新的架构，可以有效地检测任意关键词，基于兼容音频编码器的文本编码器，该编码器具有兼容音频嵌入的同型表示，并且比兼容文本编码器更小。我们的文本编码器将文本转换为音频的phoneme使用图eme-to-phoneme（G2P）模型，然后将其转换为嵌入使用表示音频嵌入的phoneme вектор。我们还将我们的方法与可能的关键词生成进行增强，以开发一个具有强大抑制力的音频-文本嵌入验证器。实验结果表明，我们的方案在Libriphrase困难数据集上的成绩高于当前最佳结果，从84.21%提高到92.7%，并将相同错误率（EER）从23.36%降低到14.4%。

Volterra Accentuated Non-Linear Dynamical Admittance (VANYA) to model Deforestation: An Exemplification from the Amazon Rainforest

paper_url: http://arxiv.org/abs/2308.06471
repo_url: None
paper_authors: Karthik R., Ramamoorthy A.
for: 本研究旨在预测雨林覆盖率，通过 integrate 猎物驱动力学和决策支持系统。
methods: 本研究使用 VANYA 模型，包括猎物驱动力学和决策支持系统，并对 Amazon 雨林数据进行预测。
results: 研究发现 VANYA 模型在预测雨林覆盖率方面表现出色，比 Long Short-Term Memory、N-BEATS 和 RCN 等其他预测器更为精准。

Abstract
Intelligent automation supports us against cyclones, droughts, and seismic events with recent technology advancements. Algorithmic learning has advanced fields like neuroscience, genetics, and human-computer interaction. Time-series data boosts progress. Challenges persist in adopting these approaches in traditional fields. Neural networks face comprehension and bias issues. AI's expansion across scientific areas is due to adaptable descriptors and combinatorial argumentation. This article focuses on modeling Forest loss using the VANYA Model, incorporating Prey Predator Dynamics. VANYA predicts forest cover, demonstrated on Amazon Rainforest data against other forecasters like Long Short-Term Memory, N-BEATS, RCN.

摘要
智能自动化支持我们面对风暴、旱情和地震事件，因为最近技术的发展。算法学习已经提高了神经科学、遗传学和人机交互等领域。时间序列数据提高了进步。但在传统领域中采纳这些方法仍存在挑战。神经网络受理解和偏见问题困扰。AI的扩展到科学领域归功于可变描述和组合论证。本文将关注用VANYA模型预测森林损失，包括猎 Predator Dynamics。VANYA预测森林覆盖率，通过对亚马逊雨林数据进行比较，与其他预测器如Long Short-Term Memory、N-BEATS、RCN。

Tiny and Efficient Model for the Edge Detection Generalization

paper_url: http://arxiv.org/abs/2308.06468
repo_url: https://github.com/xavysp/teed
paper_authors: Xavier Soria, Yachuan Li, Mohammad Rouhani, Angel D. Sappa
For: 提高边检测精度，降低模型复杂度* Methods: 提出了一种轻量级卷积神经网络TEED，只有58K参数，比State-of-the-art模型少得多。* Results: 模型训练时间快（less than 30 minutes），每 epoch快（less than 5 minutes），预测边映射清晰度高。新提出的测试集可以评估边检测模型的通用性。

Abstract
Most high-level computer vision tasks rely on low-level image operations as their initial processes. Operations such as edge detection, image enhancement, and super-resolution, provide the foundations for higher level image analysis. In this work we address the edge detection considering three main objectives: simplicity, efficiency, and generalization since current state-of-the-art (SOTA) edge detection models are increased in complexity for better accuracy. To achieve this, we present Tiny and Efficient Edge Detector (TEED), a light convolutional neural network with only $58K$ parameters, less than $0.2$% of the state-of-the-art models. Training on the BIPED dataset takes $less than 30 minutes$, with each epoch requiring $less than 5 minutes$. Our proposed model is easy to train and it quickly converges within very first few epochs, while the predicted edge-maps are crisp and of high quality. Additionally, we propose a new dataset to test the generalization of edge detection, which comprises samples from popular images used in edge detection and image segmentation. The source code is available in https://github.com/xavysp/TEED.

摘要
大多数高级计算机视觉任务都依赖于低级图像操作作为其初始过程。操作如边检测、图像提高和超分解，为更高级的图像分析提供基础。在这项工作中，我们考虑边检测三个主要目标：简单、高效和普适，因为当前状态艺术（SOTA）边检测模型在精度方面增加了复杂度。为了实现这一点，我们提出了简单和高效的边检测器（TEED），这是一个具有只有58000个参数的轻量级卷积神经网络。在BIPE dataset上训练时间仅占少于30分钟，每个epoch仅需5分钟左右。我们提出的模型轻松训练，快速 converge在第一些epoch中，而预测的边图具有高质量。此外，我们还提出了一个新的测试普适性边检测的数据集，该数据集包括流行的图像used在边检测和图像分割中。源代码可以在https://github.com/xavysp/TEED上获取。

Not So Robust After All: Evaluating the Robustness of Deep Neural Networks to Unseen Adversarial Attacks

paper_url: http://arxiv.org/abs/2308.06467
repo_url: None
paper_authors: Roman Garaev, Bader Rasheed, Adil Khan
for: 这种研究旨在挑战当今防御机制对假数据攻击的有效性和通用性。
methods: 该研究使用了 adversarial attacks 来挑战当今的 DNN 模型。
results: 研究发现，train DNN 模型使用只有 robust 特征集时，并不能保证模型免受假数据攻击。此外，研究还发现 $L_2$ 和 $L_{\infty}$ нор的攻击对 DNN 表示的影响不同，这可能会对研究者提供有用的启示。

Abstract
Deep neural networks (DNNs) have gained prominence in various applications, such as classification, recognition, and prediction, prompting increased scrutiny of their properties. A fundamental attribute of traditional DNNs is their vulnerability to modifications in input data, which has resulted in the investigation of adversarial attacks. These attacks manipulate the data in order to mislead a DNN. This study aims to challenge the efficacy and generalization of contemporary defense mechanisms against adversarial attacks. Specifically, we explore the hypothesis proposed by Ilyas et. al, which posits that DNN image features can be either robust or non-robust, with adversarial attacks targeting the latter. This hypothesis suggests that training a DNN on a dataset consisting solely of robust features should produce a model resistant to adversarial attacks. However, our experiments demonstrate that this is not universally true. To gain further insights into our findings, we analyze the impact of adversarial attack norms on DNN representations, focusing on samples subjected to $L_2$ and $L_{\infty}$ norm attacks. Further, we employ canonical correlation analysis, visualize the representations, and calculate the mean distance between these representations and various DNN decision boundaries. Our results reveal a significant difference between $L_2$ and $L_{\infty}$ norms, which could provide insights into the potential dangers posed by $L_{\infty}$ norm attacks, previously underestimated by the research community.

摘要

A One-dimensional HEVC video steganalysis method using the Optimality of Predicted Motion Vectors

paper_url: http://arxiv.org/abs/2308.06464
repo_url: None
paper_authors: Jun Li, Minqing Zhang, Ke Niu, Yingnan Zhang, Xiaoyuan Yang
for: 增强HEVC标准视频隐藏通信的检测性能
methods: 基于优化的预测动作向量(MVP)的特征提取
results: 对两个通用数据集的三种常见隐藏通信方法进行检测，与四种现有的检测方法进行比较，实验结果表明提议的优化率of MVP在所有覆盖视频中为100%，而在所有隐藏视频中为 less than 100%，因此可以准确地分辨覆盖视频和隐藏视频，并在实际应用中具有无模型训练和低计算复杂度。

Abstract
Among steganalysis techniques, detection against motion vector (MV) domain-based video steganography in High Efficiency Video Coding (HEVC) standard remains a hot and challenging issue. For the purpose of improving the detection performance, this paper proposes a steganalysis feature based on the optimality of predicted MVs with a dimension of one. Firstly, we point out that the motion vector prediction (MVP) of the prediction unit (PU) encoded using the Advanced Motion Vector Prediction (AMVP) technique satisfies the local optimality in the cover video. Secondly, we analyze that in HEVC video, message embedding either using MVP index or motion vector differences (MVD) may destroy the above optimality of MVP. And then, we define the optimal rate of MVP in HEVC video as a steganalysis feature. Finally, we conduct steganalysis detection experiments on two general datasets for three popular steganography methods and compare the performance with four state-of-the-art steganalysis methods. The experimental results show that the proposed optimal rate of MVP for all cover videos is 100\%, while the optimal rate of MVP for all stego videos is less than 100\%. Therefore, the proposed steganography scheme can accurately distinguish between cover videos and stego videos, and it is efficiently applied to practical scenarios with no model training and low computational complexity.

摘要
在隐藏分析技术中，对高效视频编码标准（HEVC）中的动态vector域基于视频隐藏技术进行检测仍然是一个热点和挑战。为了提高检测性能，这篇论文提出了基于预测动态vector（MVP）的隐藏特征。首先，我们指出HEVC视频中的预测单元（PU）使用高级动态vector预测（AMVP）技术编码时，预测动态vector的优化性在覆盖视频中是本地优化的。其次，我们分析HEVC视频中的信息嵌入（使用MVP索引或动态vector差（MVD））可能会破坏上述优化性。然后，我们定义HEVC视频中MVP的优化率作为隐藏特征。最后，我们对两个通用数据集上三种流行的隐藏技术进行检测试验，并与四种现状顶尖隐藏检测方法进行比较。实验结果表明，我们提出的优化率对所有覆盖视频是100%，而对所有隐藏视频是少于100%。因此，我们的隐藏方案可以准确地 отлича出覆盖视频和隐藏视频，并在实际应用中具有无模型训练和低计算复杂度。

Multi-Label Knowledge Distillation

paper_url: http://arxiv.org/abs/2308.06453
repo_url: https://github.com/penghui-yang/l2d
paper_authors: Penghui Yang, Ming-Kun Xie, Chen-Chen Zong, Lei Feng, Gang Niu, Masashi Sugiyama, Sheng-Jun Huang
for: 本研究是为了解决多类 label 学习中的知识填充问题，因为传统的知识填充方法难以在多类 label 学习中应用。
methods: 本研究提出了一种新的多类 label 知识填充方法，它利用了类别 embeddings 的结构信息，并将多类 label 学习问题分解成多个 binary 分类问题，以提高知识填充效果。
results: 实验结果表明，提出的方法可以避免类标签之间的知识冲突，并在多个 benchmark 数据集上达到了Superior性能，比较方法的性能。

Abstract
Existing knowledge distillation methods typically work by imparting the knowledge of output logits or intermediate feature maps from the teacher network to the student network, which is very successful in multi-class single-label learning. However, these methods can hardly be extended to the multi-label learning scenario, where each instance is associated with multiple semantic labels, because the prediction probabilities do not sum to one and feature maps of the whole example may ignore minor classes in such a scenario. In this paper, we propose a novel multi-label knowledge distillation method. On one hand, it exploits the informative semantic knowledge from the logits by dividing the multi-label learning problem into a set of binary classification problems; on the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings. Experimental results on multiple benchmark datasets validate that the proposed method can avoid knowledge counteraction among labels, thus achieving superior performance against diverse comparing methods. Our code is available at: https://github.com/penghui-yang/L2D

摘要
traditional knowledge distillation methods typically work by imparting the knowledge of output logits or intermediate feature maps from the teacher network to the student network, which is very successful in multi-class single-label learning. However, these methods can hardly be extended to the multi-label learning scenario, where each instance is associated with multiple semantic labels, because the prediction probabilities do not sum to one and feature maps of the whole example may ignore minor classes in such a scenario. In this paper, we propose a novel multi-label knowledge distillation method. On one hand, it exploits the informative semantic knowledge from the logits by dividing the multi-label learning problem into a set of binary classification problems; on the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings. Experimental results on multiple benchmark datasets validate that the proposed method can avoid knowledge counteraction among labels, thus achieving superior performance against diverse comparing methods. Our code is available at: https://github.com/penghui-yang/L2D.Here's the translation in Traditional Chinese:传统的知识传递方法通常是将教师网络的输出条件或中间特征图形传递到学生网络中，在多类单 Label 学习中非常成功。然而，这些方法很难扩展到多 Label 学习情况下，因为每个例子都 associates 多个Semantic 标签，预测概率不等于一，特征图形可能将次要类别忽略。在这篇论文中，我们提出了一个新的多 Label 知识传递方法。一方面，它利用了条件的Semantic 知识，将多 Label 学习问题分成多个binary 分类问题；另一方面，它增强了学习的特征表现的明确性，通过利用标签对应的结构信息。实验结果显示，提案的方法可以避免标签之间的知识对抗，因此在多种比较方法面表现出色。我们的代码可以在：https://github.com/penghui-yang/L2D 中找到。

Latent Random Steps as Relaxations of Max-Cut, Min-Cut, and More

paper_url: http://arxiv.org/abs/2308.06448
repo_url: None
paper_authors: Sudhanshu Chanpuriya, Cameron Musco
for: 本研究旨在提出一种基于非正式矩阵因子化的probabilistic模型，用于融合群集和简化图structure。
methods: 该模型基于Random Walk进程的分解，并通过简单的梯度下降优化。
results: 该算法可以relax hard clustering问题，并将其转化为一个 tractable 的问题。 furthermore, 该模型在synthetic graph和一些不监控学习任务中表现良好，如orthographic和phonological数据的bipartite和tripartite clustering。

Abstract
Algorithms for node clustering typically focus on finding homophilous structure in graphs. That is, they find sets of similar nodes with many edges within, rather than across, the clusters. However, graphs often also exhibit heterophilous structure, as exemplified by (nearly) bipartite and tripartite graphs, where most edges occur across the clusters. Grappling with such structure is typically left to the task of graph simplification. We present a probabilistic model based on non-negative matrix factorization which unifies clustering and simplification, and provides a framework for modeling arbitrary graph structure. Our model is based on factorizing the process of taking a random walk on the graph. It permits an unconstrained parametrization, allowing for optimization via simple gradient descent. By relaxing the hard clustering to a soft clustering, our algorithm relaxes potentially hard clustering problems to a tractable ones. We illustrate our algorithm's capabilities on a synthetic graph, as well as simple unsupervised learning tasks involving bipartite and tripartite clustering of orthographic and phonological data.

摘要
Translated into Simplified Chinese:Algorithms for node clustering通常是查找图граhp的同质结构，即找到多数边连接的节点集，而不是跨集的边。然而，图 oftentimes also exhibits heterophilous structure, such as (nearly) bipartite and tripartite graphs, where most edges occur across the clusters. Previously, dealing with such structure was left to the task of graph simplification. We present a probabilistic model based on non-negative matrix factorization, which unifies clustering and simplification, and provides a framework for modeling arbitrary graph structure. Our model is based on factorizing the process of taking a random walk on the graph. It permits an unconstrained parametrization, allowing for optimization via simple gradient descent. By relaxing the hard clustering to a soft clustering, our algorithm relaxes potentially hard clustering problems to a tractable ones. We illustrate our algorithm's capabilities on a synthetic graph, as well as simple unsupervised learning tasks involving bipartite and tripartite clustering of orthographic and phonological data.

A Sequential Meta-Transfer (SMT) Learning to Combat Complexities of Physics-Informed Neural Networks: Application to Composites Autoclave Processing

paper_url: http://arxiv.org/abs/2308.06447
repo_url: https://github.com/miladramzy/sequentialmetatransferpinns
paper_authors: Milad Ramezankhani, Abbas S. Milani
for: 解决非线性偏微分方程（PDE）的快速解决方法，提高科学和工程应用中的精度和效率。
methods: 基于物理法则的神经网络（PINNs），通过将物理法则integrated到神经网络的训练中，使其在解决非线性系统方程方面表现出优异。
results: 在一个复杂的材料制造过程例子中，提出了一种新的Sequential Meta-Transfer（SMT）学习框架，可以快速地适应非线性系统中的变化，并大幅降低计算成本。

Abstract
Physics-Informed Neural Networks (PINNs) have gained popularity in solving nonlinear partial differential equations (PDEs) via integrating physical laws into the training of neural networks, making them superior in many scientific and engineering applications. However, conventional PINNs still fall short in accurately approximating the solution of complex systems with strong nonlinearity, especially in long temporal domains. Besides, since PINNs are designed to approximate a specific realization of a given PDE system, they lack the necessary generalizability to efficiently adapt to new system configurations. This entails computationally expensive re-training from scratch for any new change in the system. To address these shortfalls, in this work a novel sequential meta-transfer (SMT) learning framework is proposed, offering a unified solution for both fast training and efficient adaptation of PINNs in highly nonlinear systems with long temporal domains. Specifically, the framework decomposes PDE's time domain into smaller time segments to create "easier" PDE problems for PINNs training. Then for each time interval, a meta-learner is assigned and trained to achieve an optimal initial state for rapid adaptation to a range of related tasks. Transfer learning principles are then leveraged across time intervals to further reduce the computational cost.Through a composites autoclave processing case study, it is shown that SMT is clearly able to enhance the adaptability of PINNs while significantly reducing computational cost, by a factor of 100.

摘要
为了解决这些缺陷，这个研究提出了一个 novel sequential meta-transfer (SMT) 学习框架，它可以实现快速训练和高效适应 PINNs 在高非线性系统中。 Specifically, the framework decomposes PDE's time domain into smaller time segments to create "easier" PDE problems for PINNs training. Then for each time interval, a meta-learner is assigned and trained to achieve an optimal initial state for rapid adaptation to a range of related tasks. Transfer learning principles are then leveraged across time intervals to further reduce the computational cost.通过一个 composites autoclave processing 案例研究，显示了 SMT 能够增强 PINNs 的适应能力，同时大幅降低计算成本，比例为 100。

Neural Latent Aligner: Cross-trial Alignment for Learning Representations of Complex, Naturalistic Neural Data

paper_url: http://arxiv.org/abs/2308.06443
repo_url: None
paper_authors: Cheol Jun Cho, Edward F. Chang, Gopala K. Anumanchipalli
for: 本研究的目的是理解人类行为的神经实现，以便更好地理解神经科学中的复杂行为。
methods: 该研究提出了一种新的无监督学习框架——神经幽Alignment（NLA），用于找到有用的神经表示，并使用了一种完全可导的时间折叠模型（TWM）来解决 trial的时间不同问题。
results: 当应用于自然的说话ECoG数据时，该模型可以学习更好的表示来编码行为，特别是在低维度空间中。TWM被实验证明，并且当Visualized的折叠 manifold上可以看到共享的神经轨迹 across trials。

Abstract
Understanding the neural implementation of complex human behaviors is one of the major goals in neuroscience. To this end, it is crucial to find a true representation of the neural data, which is challenging due to the high complexity of behaviors and the low signal-to-ratio (SNR) of the signals. Here, we propose a novel unsupervised learning framework, Neural Latent Aligner (NLA), to find well-constrained, behaviorally relevant neural representations of complex behaviors. The key idea is to align representations across repeated trials to learn cross-trial consistent information. Furthermore, we propose a novel, fully differentiable time warping model (TWM) to resolve the temporal misalignment of trials. When applied to intracranial electrocorticography (ECoG) of natural speaking, our model learns better representations for decoding behaviors than the baseline models, especially in lower dimensional space. The TWM is empirically validated by measuring behavioral coherence between aligned trials. The proposed framework learns more cross-trial consistent representations than the baselines, and when visualized, the manifold reveals shared neural trajectories across trials.

摘要
Translated into Simplified Chinese:理解人类复杂行为的神经实现是生物科学的一个主要目标。为 достичь这个目标，寻找神经数据真正的表示是非常重要，但是由于行为的高复杂性和神经信号噪声比（SNR）的低，这是一项挑战。我们提出了一种新的无监督学习框架，神经缺失匹配（NLA），以获取行为相关的神经表示。我们的关键想法是在重复试验中对表示进行对齐，以学习跨试验的一致信息。此外，我们还提出了一种完全可微分的时间折叠模型（TWM），以解决试验时间的不一致问题。当应用于自然说话的内部电rocorticography（ECoG）数据时，我们的模型可以学习更好的表示，特别是在低维度空间中。TWM被验证了通过测量试验之间的行为一致性。我们的框架可以更好地学习跨试验一致的表示，并且当Visualized时，折叠 manifold revelas shared neural trajectories across trials。

A Domain-adaptive Physics-informed Neural Network for Inverse Problems of Maxwell’s Equations in Heterogeneous Media

paper_url: http://arxiv.org/abs/2308.06436
repo_url: None
paper_authors: Shiyuan Piao, Hong Gu, Aina Wang, Pan Qin
for: 解决Maxwell方程组在不同媒质中的逆问题
methods: 使用physics-informed神经网络（PINN）和领域适应训练策略
results: 提出了一种领域适应PINN（da-PINN），并在两个案例研究中证明了其效果

Abstract
Maxwell's equations are a collection of coupled partial differential equations (PDEs) that, together with the Lorentz force law, constitute the basis of classical electromagnetism and electric circuits. Effectively solving Maxwell's equations is crucial in various fields, like electromagnetic scattering and antenna design optimization. Physics-informed neural networks (PINNs) have shown powerful ability in solving PDEs. However, PINNs still struggle to solve Maxwell's equations in heterogeneous media. To this end, we propose a domain-adaptive PINN (da-PINN) to solve inverse problems of Maxwell's equations in heterogeneous media. First, we propose a location parameter of media interface to decompose the whole domain into several sub-domains. Furthermore, the electromagnetic interface conditions are incorporated into a loss function to improve the prediction performance near the interface. Then, we propose a domain-adaptive training strategy for da-PINN. Finally, the effectiveness of da-PINN is verified with two case studies.

摘要
马克斯威尔方程是一系列相互关联的偏微分方程（PDEs），与 Lorentz 力法则共同构成了经典电磁学和电路。有效解决马克斯威尔方程是在各种领域中重要，如电磁散射和天线设计优化。 физи学 Informed Neural Networks（PINNs）已经显示出解决 PDEs 的强大能力。然而，PINNs 仍然在不同媒体中解决马克斯威尔方程困难。为此，我们提出了域 adaptive PINN（da-PINN）来解决马克斯威尔方程的反向问题在不同媒体中。首先，我们提出了媒体界面位置参数来分解整个域 into 多个子域。然后，我们将电磁界面条件纳入损失函数以提高预测性能 near 界面。最后，我们提出了域 adaptive 训练策略 для da-PINN。 Finally， da-PINN 的有效性被两个案例研究所验证。

paper_url: http://arxiv.org/abs/2308.06432
repo_url: None
paper_authors: Yuhan Zhang, Kun Huang, Mingchao Li, Songtao Yuan, Qiang Chen
for: 预测 age-related macular degeneration (nAMD) 疾病发展，生成post-therapeutic SD-OCT图像
methods: 提posed a single-horizon disease evolution network (SHENet)，包括Feature Encoder、Graph Evolution Module和Feature Decoder，通过 adversarial training 确保疾病演化学习的有效性
results: 比较其他生成方法，SHENet 的生成 SD-OCT 图像具有最高的图像质量，同时保持着 структура和内容的准确预测，并且在质量和效果上具有更好的视觉效果

Abstract
Most of the existing disease prediction methods in the field of medical image processing fall into two classes, namely image-to-category predictions and image-to-parameter predictions. Few works have focused on image-to-image predictions. Different from multi-horizon predictions in other fields, ophthalmologists prefer to show more confidence in single-horizon predictions due to the low tolerance of predictive risk. We propose a single-horizon disease evolution network (SHENet) to predictively generate post-therapeutic SD-OCT images by inputting pre-therapeutic SD-OCT images with neovascular age-related macular degeneration (nAMD). In SHENet, a feature encoder converts the input SD-OCT images to deep features, then a graph evolution module predicts the process of disease evolution in high-dimensional latent space and outputs the predicted deep features, and lastly, feature decoder recovers the predicted deep features to SD-OCT images. We further propose an evolution reinforcement module to ensure the effectiveness of disease evolution learning and obtain realistic SD-OCT images by adversarial training. SHENet is validated on 383 SD-OCT cubes of 22 nAMD patients based on three well-designed schemes based on the quantitative and qualitative evaluations. Compared with other generative methods, the generative SD-OCT images of SHENet have the highest image quality. Besides, SHENet achieves the best structure protection and content prediction. Qualitative evaluations also demonstrate that SHENet has a better visual effect than other methods. SHENet can generate post-therapeutic SD-OCT images with both high prediction performance and good image quality, which has great potential to help ophthalmologists forecast the therapeutic effect of nAMD.

摘要
大多数现有的疾病预测方法在医学影像处理领域都归类为图像到类别预测和图像到参数预测，少数工作强调图像到图像预测。与其他多个时间预测不同，眼科医生偏好单个时间预测，因为预测风险的容忍度较低。我们提出了单个时间疾病演化网络（SHENet），用于预测治疗后SD-OCT图像。SHENet使用FeatureEncoder将输入SD-OCT图像转换为深度特征，然后使用图像演化模块预测疾病演化过程在高维潜在空间中，并输出预测的深度特征。最后，FeatureDecoder重建预测的深度特征为SD-OCT图像。我们还提出了演化增强模块，以确保疾病演化学习的有效性并获得真实的SD-OCT图像。SHENet在383个SD-OCT立方体上进行了三种基于量化和质量评价的验证。与其他生成方法相比，SHENet生成的SD-OCT图像的图像质量最高。此外，SHENet也达到了最佳结构保护和内容预测。质量评价还表明，SHENet的视觉效果更好。SHENet可以生成治疗后SD-OCT图像，具有高预测性和好图像质量，这对眼科医生预测nAMD的效果具有很大潜力。

Genetic heterogeneity analysis using genetic algorithm and network science

paper_url: http://arxiv.org/abs/2308.06429
repo_url: None
paper_authors: Zhendong Sha, Yuanzhu Chen, Ting Hu
for: 这个论文目的是通过基因组宽度关联研究(GWAS)发现疾病感染的遗传变量。
methods: 这篇论文使用了一种新的特征选择机制，即特征合选网络(FCS-Net)，以EXTRACT多样化的基因变量。FCS-Net使用了一种遗传算理算法(GA)和一种非线性机器学习算法来检测特征互作。
results: 实验表明，FCS-Net可以有效地检测特征互作，并且可以在一个案例-控制患肠癌GWAS数据集中提取出新的合成特征。这些合成特征可以用来解释患肠癌的遗传多样性。

Abstract
Through genome-wide association studies (GWAS), disease susceptible genetic variables can be identified by comparing the genetic data of individuals with and without a specific disease. However, the discovery of these associations poses a significant challenge due to genetic heterogeneity and feature interactions. Genetic variables intertwined with these effects often exhibit lower effect-size, and thus can be difficult to be detected using machine learning feature selection methods. To address these challenges, this paper introduces a novel feature selection mechanism for GWAS, named Feature Co-selection Network (FCSNet). FCS-Net is designed to extract heterogeneous subsets of genetic variables from a network constructed from multiple independent feature selection runs based on a genetic algorithm (GA), an evolutionary learning algorithm. We employ a non-linear machine learning algorithm to detect feature interaction. We introduce the Community Risk Score (CRS), a synthetic feature designed to quantify the collective disease association of each variable subset. Our experiment showcases the effectiveness of the utilized GA-based feature selection method in identifying feature interactions through synthetic data analysis. Furthermore, we apply our novel approach to a case-control colorectal cancer GWAS dataset. The resulting synthetic features are then used to explain the genetic heterogeneity in an additional case-only GWAS dataset.

摘要
通过全基因组协作研究（GWAS），可以通过比较患病者和无病者的遗传数据来确定疾病易感的遗传变量。然而，发现这些相互作用的挑战是由于遗传多样性和特征互作所致。遗传变量与这些效果相互作用的情况下经常表现出较低的效果大小，因此可能difficult to be detected using machine learning feature selection methods。为解决这些挑战，本文提出了一种新的特征选择机制，称为特征相互选择网络（FCSNet）。FCS-Net是基于多个独立的特征选择运行的一个网络结构，使用一种遗传算法（GA）进行进化学习算法。我们使用一种非线性机器学习算法来探测特征相互作用。我们还引入了一个名为社区风险分数（CRS）的合成特征，用于评估每个变量子集的疾病相关度。我们的实验表明，使用我们的新采用的GA基于特征选择方法可以快速和有效地检测特征相互作用。此外，我们还应用了我们的新方法于一个患肠癌GWAS数据集。得到的合成特征然后用于解释一个额外的case-only GWAS数据集中的遗传多样性。

Multiclass Learnability Does Not Imply Sample Compression

paper_url: http://arxiv.org/abs/2308.06424
repo_url: None
paper_authors: Chirag Pabbaraju
for: 该论文讨论了一种叫做”样本压缩”的问题，即对于每个由一个假设来标注的样本，是否可以只保留一小部分样本，以便从整个样本上获取标签。
methods: 论文使用了一种名叫”VC dimension”的概念，它是用于描述一个假设类型的复杂度的一种指标。论文还使用了一种名叫”DS dimension”的概念，它是用于描述一个多类假设类型的复杂度的一种指标。
results: 论文的结果表明，对于每个有限多个假设类型，都存在一个可以压缩样本的方法，其中的压缩率只取决于假设类型的VC dimension。但是，对于多类假设类型，不存在一个可以压缩样本的方法，其中的压缩率只取决于假设类型的DS dimension。

Abstract
A hypothesis class admits a sample compression scheme, if for every sample labeled by a hypothesis from the class, it is possible to retain only a small subsample, using which the labels on the entire sample can be inferred. The size of the compression scheme is an upper bound on the size of the subsample produced. Every learnable binary hypothesis class (which must necessarily have finite VC dimension) admits a sample compression scheme of size only a finite function of its VC dimension, independent of the sample size. For multiclass hypothesis classes, the analog of VC dimension is the DS dimension. We show that the analogous statement pertaining to sample compression is not true for multiclass hypothesis classes: every learnable multiclass hypothesis class, which must necessarily have finite DS dimension, does not admit a sample compression scheme of size only a finite function of its DS dimension.

摘要
一个假设集合承认样本压缩方案，如果对每个由假设集合中的一个假设标注的样本，只需保留一小样本，使得整个样本上的标注可以被推断出来。压缩方案的大小是样本上的子样本的上限。每个可学习的二分类假设集合（必然具有有限VC维度）承认一个样本压缩方案，它的大小只是VC维度的一个有限函数，不виси于样本的大小。对多类假设集合，相应的VC维度是DS维度。我们显示，对多类假设集合，其相应的压缩方案不存在，即每个可学习的多类假设集合，它必然具有有限DS维度，但不存在一个只是DS维度的有限函数的压缩方案。

Sensitivity-Aware Mixed-Precision Quantization and Width Optimization of Deep Neural Networks Through Cluster-Based Tree-Structured Parzen Estimation

paper_url: http://arxiv.org/abs/2308.06422
repo_url: None
paper_authors: Seyedarmin Azizi, Mahdi Nazemi, Arash Fayyazi, Massoud Pedram
for: 提高深度学习模型的效率，自动选择最佳bit Width和层Width。
methods: 使用Hessian-based pruning和cluster-based tree-structured Parzen estimator来缩小搜索空间，并开发surrogate模型。
results: 在知名数据集上进行严格测试，与现有方法相比，提供20%的模型大小减少和12倍的搜索时间减少，代表了深度学习模型设计优化领域的一大突破。

Abstract
As the complexity and computational demands of deep learning models rise, the need for effective optimization methods for neural network designs becomes paramount. This work introduces an innovative search mechanism for automatically selecting the best bit-width and layer-width for individual neural network layers. This leads to a marked enhancement in deep neural network efficiency. The search domain is strategically reduced by leveraging Hessian-based pruning, ensuring the removal of non-crucial parameters. Subsequently, we detail the development of surrogate models for favorable and unfavorable outcomes by employing a cluster-based tree-structured Parzen estimator. This strategy allows for a streamlined exploration of architectural possibilities and swift pinpointing of top-performing designs. Through rigorous testing on well-known datasets, our method proves its distinct advantage over existing methods. Compared to leading compression strategies, our approach records an impressive 20% decrease in model size without compromising accuracy. Additionally, our method boasts a 12x reduction in search time relative to the best search-focused strategies currently available. As a result, our proposed method represents a leap forward in neural network design optimization, paving the way for quick model design and implementation in settings with limited resources, thereby propelling the potential of scalable deep learning solutions.

摘要
“深度学习模型的复杂性和计算需求逐渐增长，因此选择最佳 neural network 层的位数和层宽成为了一项非常重要的优化方法。本工作提出了一种新的搜索机制，可以自动选择最佳位数和层宽，从而提高深度神经网络的效率。搜索空间通过利用希尔比ан-基于的剔除来减少，确保移除不必要的参数。然后，我们详细介绍了使用分布式树结构的 Parzen 估计器来开发备受欢迎和不欢迎的结果的代理模型。这种策略可以快速探索不同的建筑方案，并快速定位最佳的设计。我们对知名的数据集进行了严格的测试，并证明了我们的方法与现有方法相比，能够减少模型大小20%，而不会影响准确性。此外，我们的方法可以在搜索时间上减少12倍，相比最佳的搜索焦点策略。因此，我们的提议方法代表了深度神经网络设计优化领域的一大突破，为有限资源的设置中快速实现模型设计和实现，从而推动了可拓展的深度学习解决方案。”

Pedestrian Trajectory Prediction in Pedestrian-Vehicle Mixed Environments: A Systematic Review

paper_url: http://arxiv.org/abs/2308.06419
repo_url: None
paper_authors: Mahsa Golchoubian, Moojan Ghafurian, Kerstin Dautenhahn, Nasser Lashgarian Azad
for: 本研究は自动驾驶车辆（AV）在共享空间中的轨迹规划问题的解决方案。
methods: 本文系统atically review了 Literature中关于模拟行人轨迹预测的不同方法，这些方法可以应用于不结构化环境中。
results: 本文对pedestrian-vehicle交互（与人与人交互）进行了专门考虑，并review了不同变量（如预测uncertainty和行为差异）在已提出的预测模型中如何考虑。

Abstract
Planning an autonomous vehicle's (AV) path in a space shared with pedestrians requires reasoning about pedestrians' future trajectories. A practical pedestrian trajectory prediction algorithm for the use of AVs needs to consider the effect of the vehicle's interactions with the pedestrians on pedestrians' future motion behaviours. In this regard, this paper systematically reviews different methods proposed in the literature for modelling pedestrian trajectory prediction in presence of vehicles that can be applied for unstructured environments. This paper also investigates specific considerations for pedestrian-vehicle interaction (compared with pedestrian-pedestrian interaction) and reviews how different variables such as prediction uncertainties and behavioural differences are accounted for in the previously proposed prediction models. PRISMA guidelines were followed. Articles that did not consider vehicle and pedestrian interactions or actual trajectories, and articles that only focused on road crossing were excluded. A total of 1260 unique peer-reviewed articles from ACM Digital Library, IEEE Xplore, and Scopus databases were identified in the search. 64 articles were included in the final review as they met the inclusion and exclusion criteria. An overview of datasets containing trajectory data of both pedestrians and vehicles used by the reviewed papers has been provided. Research gaps and directions for future work, such as having more effective definition of interacting agents in deep learning methods and the need for gathering more datasets of mixed traffic in unstructured environments are discussed.

摘要
планирование пути автономного транспортного средства (АВ) в пространстве, где находятся пешеходы, требует рассмотрения прогнозируемых траекторий пешеходов. практический алгоритм предсказания траекторий пешеходов для использования АВ должен учитывать влияние взаимодействия автомобиля с пешеходами на будущие движения людей. в этом отношении, этот документ систематически обзорывает различные методы, предложенные в литературе для моделирования предсказания траекторий пешеходов в присутствии автомобилей, которые могут быть применены в неструктурированных средах. документ также исследует конкретные аспекты взаимодействия пешехода-автомобиль (в сравнении с взаимодействием пешехода-пешеход) и обзоры, как различные переменные, такие как неопределенности предсказания и различия в поведении, учитываются в предыдущих моделях предсказания. Following PRISMA guidelines, articles that did not consider the interactions between vehicles and pedestrians or actual trajectories, and articles that only focused on road crossing were excluded. A total of 1260 unique peer-reviewed articles from ACM Digital Library, IEEE Xplore, and Scopus databases were identified in the search. 64 articles were included in the final review as they met the inclusion and exclusion criteria. An overview of datasets containing trajectory data of both pedestrians and vehicles used by the reviewed papers has been provided. Research gaps and directions for future work, such as the need for more effective definitions of interacting agents in deep learning methods and the need for gathering more datasets of mixed traffic in unstructured environments, are discussed.

Learning Bayesian Networks with Heterogeneous Agronomic Data Sets via Mixed-Effect Models and Hierarchical Clustering

paper_url: http://arxiv.org/abs/2308.06399
repo_url: None
paper_authors: Lorenzo Vallegi, Marco Scutari, Federico Mattia Stefanini
For: This paper is written for researchers and practitioners who work with complex data sets in various fields, particularly agronomic studies. The paper aims to provide a novel approach for modeling causal relationships using Bayesian networks (BNs) and to demonstrate its effectiveness in handling hierarchical data.* Methods: The paper introduces a new approach that integrates random effects into BN learning, which is rooted in linear mixed-effects models. The approach uses directed acyclic graphs to illustrate the connections between variables and can handle complex networks of causal relationships.* Results: The paper reports that employing this approach can enhance structural learning, leading to the discovery of new connections and improved model specification. The approach also results in a reduction in prediction errors from 28% to 17%. The results suggest that the approach is effective in handling complex data sets and can improve the accuracy of predictions.

Abstract
Research involving diverse but related data sets, where associations between covariates and outcomes may vary, is prevalent in various fields including agronomic studies. In these scenarios, hierarchical models, also known as multilevel models, are frequently employed to assimilate information from different data sets while accommodating their distinct characteristics. However, their structure extend beyond simple heterogeneity, as variables often form complex networks of causal relationships. Bayesian networks (BNs) provide a powerful framework for modelling such relationships using directed acyclic graphs to illustrate the connections between variables. This study introduces a novel approach that integrates random effects into BN learning. Rooted in linear mixed-effects models, this approach is particularly well-suited for handling hierarchical data. Results from a real-world agronomic trial suggest that employing this approach enhances structural learning, leading to the discovery of new connections and the improvement of improved model specification. Furthermore, we observe a reduction in prediction errors from 28\% to 17\%. By extending the applicability of BNs to complex data set structures, this approach contributes to the effective utilisation of BNs for hierarchical agronomic data. This, in turn, enhances their value as decision-support tools in the field.

摘要
研究涉及多个相关数据集，其中变量之间可能存在复杂的关系，在各个领域，如农学研究中很普遍。在这些情况下，层次模型，也称为多级模型， часто被使用来整合不同数据集的信息，同时适应它们的特点。然而，这些模型的结构超出了简单的不同性，因为变量经常形成复杂的 causal 关系网络。 bayesian networks（BN）提供一种强大的模型化这些关系的框架，使用指向无环图来示出变量之间的连接。本研究提出了一种新的方法，即在 bayesian networks 学习中添加随机效应。基于线性混合效应模型，这种方法特别适合处理层次数据。实际的农学试验结果表明，通过使用这种方法，可以提高结构学习的效果，发现新的连接，并改善模型规定。此外，我们发现预测错误率从28%降低到17%。通过扩展 bayesian networks 的应用范围，这种方法为层次农学数据的有效利用做出了贡献，从而提高了 bayesian networks 作为决策支持工具的价值。

Detecting and Preventing Hallucinations in Large Vision Language Models

paper_url: http://arxiv.org/abs/2308.06394
repo_url: None
paper_authors: Anisha Gunjal, Jihan Yin, Erhan Bas
for: 本研究的目的是提高大观语言模型（LVLM）在多modal任务中的泛化能力，特别是对Visual Question Answering（VQA）任务的泛化。
methods: 我们使用了InstructBLIP模型，并通过我们的novel Fine-grained Direct Preference Optimization（FDPO）和 fine-grained多Modal reward模型来优化这个模型，以避免hallucination。
results: 我们的实验结果表明，使用FDPO和rejection sampling可以将InstructBLIP模型中的hallucination率降低41%和55%，并且我们的 reward模型可以在其他多Modal模型上提高泛化能力，降低LLaVA和mPLUG-OWL模型中的hallucination率15%和57%。

Abstract
Instruction tuned Large Vision Language Models (LVLMs) have significantly advanced in generalizing across a diverse set of multi-modal tasks, especially for Visual Question Answering (VQA). However, generating detailed responses that are visually grounded is still a challenging task for these models. We find that even the current state-of-the-art LVLMs (InstructBLIP) still contain a staggering 30 percent of the hallucinatory text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. To address this, we introduce M-HalDetect, a (M)ultimodal (Hal)lucination (Detect)ion Dataset that can be used to train and benchmark models for hallucination detection and prevention. M-HalDetect consists of 16k fine-grained annotations on VQA examples, making it the first comprehensive multi-modal hallucination detection dataset for detailed image descriptions. Unlike previous work that only consider object hallucination, we additionally annotate both entity descriptions and relationships that are unfaithful. To demonstrate the potential of this dataset for hallucination prevention, we optimize InstructBLIP through our novel Fine-grained Direct Preference Optimization (FDPO). We also train fine-grained multi-modal reward models from InstructBLIP and evaluate their effectiveness with best-of-n rejection sampling. We perform human evaluation on both FDPO and rejection sampling, and find that they reduce hallucination rates in InstructBLIP by 41% and 55% respectively. We also find that our reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57% respectively, and has strong correlation with human evaluated accuracy scores.

摘要
压缩 Large Vision Language Models (LVLMs) 在多modal任务上的总体化进步了很多，特别是视觉问答 (VQA)。然而，生成具体的视觉基于的回答仍然是这些模型的挑战。我们发现，even the current state-of-the-art LVLMs (InstructBLIP) 仍然包含30%的幻觉文本，包括不存在的对象、不准确的描述和关系。为解决这个问题，我们介绍了 M-HalDetect，一个多modal的幻觉检测数据集，可以用于训练和对模型的幻觉检测和预防。M-HalDetect 包含16k 细化的 VQA 示例注释，使其成为首个多modal幻觉检测数据集。不同于之前的工作仅考虑对象幻觉，我们还注释了不准确的实体描述和关系。为证明这个数据集的潜力，我们使用我们的新的精细直接偏好优化 (FDPO) 方法优化 InstructBLIP。我们还使用基于 InstructBLIP 的精细多modal奖励模型，并通过best-of-n 拒绝采样评估其效果。我们进行了人工评估，发现 FDPO 和拒绝采样都能减少 InstructBLIP 中幻觉率 by 41% 和 55% соответственно。此外，我们发现我们的奖励模型可以泛化到其他多modal模型，减少 LLaVA 和 mPLUG-OWL 中的幻觉率 by 15% 和 57% соответственно，并与人类评估准确率之间存在强相关性。

Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion

paper_url: http://arxiv.org/abs/2308.06382
repo_url: https://github.com/PhonemeHallucinator/Phoneme_Hallucinator
paper_authors: Siyuan Shan, Yang Li, Amartya Banerjee, Junier B. Oliva
for: 本研究旨在解决现有VC方法中的一个矛盾，即保持语言内容的同时实现高度的 speaker similarity。
methods: 本研究提出了一种新的VC模型，即“phoneme hallucinator”，该模型可以基于短时间内的目标说话者声音（例如3秒）生成多样化和高质量的目标说话者音频。
results: 对比 existed VC方法，本研究的“phoneme hallucinator”模型在语言内容和说话者相似性两个方面都达到了更高的性能。

Abstract
Voice conversion (VC) aims at altering a person's voice to make it sound similar to the voice of another person while preserving linguistic content. Existing methods suffer from a dilemma between content intelligibility and speaker similarity; i.e., methods with higher intelligibility usually have a lower speaker similarity, while methods with higher speaker similarity usually require plenty of target speaker voice data to achieve high intelligibility. In this work, we propose a novel method \textit{Phoneme Hallucinator} that achieves the best of both worlds. Phoneme Hallucinator is a one-shot VC model; it adopts a novel model to hallucinate diversified and high-fidelity target speaker phonemes based just on a short target speaker voice (e.g. 3 seconds). The hallucinated phonemes are then exploited to perform neighbor-based voice conversion. Our model is a text-free, any-to-any VC model that requires no text annotations and supports conversion to any unseen speaker. Objective and subjective evaluations show that \textit{Phoneme Hallucinator} outperforms existing VC methods for both intelligibility and speaker similarity.

摘要
声音转换（VC）目标是使一个人的声音变得更像另一个人的声音，保持语言内容不变。现有方法受到内容理解和发音相似之间的矛盾，即方法更高的理解能力通常需要大量目标 speaker 的声音数据来实现高度的发音相似。在这项工作中，我们提出了一种新方法——《phoneme hallucinator》。这是一个一架VC模型，它采用了一种新的模型来幻化目标 speaker 的多样化和高品质的发音，只需要短时间的目标 speaker 声音（例如3秒）。幻化的发音然后被利用于邻居基于的声音转换。我们的模型是文本无需、任何到任何的VC模型，不需要文本注释，并且支持转换到任何未看过的发音。对象和主观评估表明，《phoneme hallucinator》在理解和发音相似性方面都高于现有的VC方法。

DCNFIS: Deep Convolutional Neuro-Fuzzy Inference System

paper_url: http://arxiv.org/abs/2308.06378
repo_url: None
paper_authors: Mojtaba Yeganejou, Kimia Honari, Ryan Kluzinski, Scott Dick, Michael Lipsett, James Miller
for: 提高 искусственный интеллект的可解释性，即让人可以直观地理解算法的工作方式，而不是仅仅得到后果的解释。
methods: 通过混合深度学习和符谱逻辑模型，设计了一种深度 convolutional neuro-fuzzy inference system (DCNFIS)，以提高算法的可解释性而不损失准确性。
results: DCNFIS在四个常用的数据集上表现与三种现有的 convolutional neural networks 相当，并且在深度符谱系统中表现出色，可以提供明确的解释。

Abstract
A key challenge in eXplainable Artificial Intelligence is the well-known tradeoff between the transparency of an algorithm (i.e., how easily a human can directly understand the algorithm, as opposed to receiving a post-hoc explanation), and its accuracy. We report on the design of a new deep network that achieves improved transparency without sacrificing accuracy. We design a deep convolutional neuro-fuzzy inference system (DCNFIS) by hybridizing fuzzy logic and deep learning models and show that DCNFIS performs as accurately as three existing convolutional neural networks on four well-known datasets. We furthermore that DCNFIS outperforms state-of-the-art deep fuzzy systems. We then exploit the transparency of fuzzy logic by deriving explanations, in the form of saliency maps, from the fuzzy rules encoded in DCNFIS. We investigate the properties of these explanations in greater depth using the Fashion-MNIST dataset.

摘要
一个主要挑战在可解释人工智能中是论知识（即人类可以直接理解算法，而不是接受后勤解释）和准确性之间的贸易。我们报告了一种新的深度网络的设计，该网络实现了改善的透明度，不 sacrifice准确性。我们设计了一种深度 convolutional neuro-fuzzy inference system (DCNFIS)，通过混合深度学习模型和多valued 逻辑模型，并证明 DCNFIS 与三种现有的 convolutional neural networks 在四个知名的数据集上表现相同。此外，我们发现 DCNFIS 在深度逻辑系统中的透明度，可以从 fuzzy 规则中 derivation 出解释，例如 saliency maps。我们在 Fashion-MNIST 数据集上进行了更深入的调查，并证明这些解释具有某些性质。

UAMM: UBET Automated Market Maker

paper_url: http://arxiv.org/abs/2308.06375
repo_url: None
paper_authors: Daniel Jiwoong Im, Alexander Kondratskiy, Vincent Harvey, Hsuan-Wei Fu
for: 这篇论文是关于抽象市场制定机制（AMM），用于中央化交易所（DEX）的价格机制。
methods: 该论文提出了一种新的价格计算方法，称为UBET AMM（UAMM），该方法考虑了外部市场价格和流动性池的不稳定损失。
results: 作者们示出了该方法可以消除外部市场价格是有效的情况下的买卖假象。

Abstract
Automated market makers (AMMs) are pricing mechanisms utilized by decentralized exchanges (DEX). Traditional AMM approaches are constrained by pricing solely based on their own liquidity pool, without consideration of external markets or risk management for liquidity providers. In this paper, we propose a new approach known as UBET AMM (UAMM), which calculates prices by considering external market prices and the impermanent loss of the liquidity pool. Despite relying on external market prices, our method maintains the desired properties of a constant product curve when computing slippages. The key element of UAMM is determining the appropriate slippage amount based on the desired target balance, which encourages the liquidity pool to minimize impermanent loss. We demonstrate that our approach eliminates arbitrage opportunities when external market prices are efficient.

摘要
自动化市场制造者（AMM）是分布式交易所（DEX）中的价格调节机制。传统AMM方法仅基于自己的流动性池来价格调节，无视外部市场或流动性提供者的风险管理。在这篇论文中，我们提出了一新的方法，known as UBET AMM（UAMM），它根据外部市场价格和流动性池的不稳定损失来计算价格。尽管依赖外部市场价格，我们的方法仍然保持欲要的常量产品曲线价格调节。UBAMM的关键元素是根据目标库存量来决定适当的滑动量，这样将流动性池最小化不稳定损失。我们显示，我们的方法可以在有效的外部市场价格下消除投资机会。

Topic-Level Bayesian Surprise and Serendipity for Recommender Systems

paper_url: http://arxiv.org/abs/2308.06368
repo_url: https://github.com/ton-moy/surprise-and-serendipity
paper_authors: Tonmoy Hasan, Razvan Bunescu
for: 本研究旨在提高个性化推荐系统的效果，使用高度可能性的Item推荐，以减少用户接触到的Filter Bubble问题。
methods: 该研究使用了Bayesian surprise来量化Item的意外性，并结合协同推荐算法，以找到用户可能很喜欢的高度可能性Item。
results: 实验结果表明，使用Bayesian surprise来量化Item的意外性，与距离基于的估计法相比，具有更高的相关性和更好的服务器端推荐性能。

Abstract
A recommender system that optimizes its recommendations solely to fit a user's history of ratings for consumed items can create a filter bubble, wherein the user does not get to experience items from novel, unseen categories. One approach to mitigate this undesired behavior is to recommend items with high potential for serendipity, namely surprising items that are likely to be highly rated. In this paper, we propose a content-based formulation of serendipity that is rooted in Bayesian surprise and use it to measure the serendipity of items after they are consumed and rated by the user. When coupled with a collaborative-filtering component that identifies similar users, this enables recommending items with high potential for serendipity. To facilitate the evaluation of topic-level models for surprise and serendipity, we introduce a dataset of book reading histories extracted from Goodreads, containing over 26 thousand users and close to 1.3 million books, where we manually annotate 449 books read by 4 users in terms of their time-dependent, topic-level surprise. Experimental evaluations show that models that use Bayesian surprise correlate much better with the manual annotations of topic-level surprise than distance-based heuristics, and also obtain better serendipitous item recommendation performance.

摘要
一个受推荐系统可以专注于让用户过去的评分历史中的项目进行最佳化，则可能导致一个范本弹簧（filter bubble），使用户没有机会体验到 novel、未看过的类别中的项目。为了解决这个问题，我们可以推荐项目具有高度的惊喜性，即可能具有高度评价的项目。在这篇文章中，我们提出了基于 bayesian 惊喜的内容基式的调和方法，并使用这个方法来衡量项目被用户过去评分后的惊喜性。当与相似用户的协同推荐部分结合时，这个方法可以提供高度惊喜性的项目推荐。为了促进题目级模型的惊喜和创新性的评估，我们将goodreads中的阅读历史数据集提取出超过26000名用户和约130000本书的数据，并 manually annotate 449本被4名用户阅读的书籍，以时间依赖的题目级惊喜作为标准。实验结果显示，使用 bayesian 惊喜的模型与距离基于的评估方法相比，具有更好的惊喜性和创新性的项目推荐性能。

Learning Distributions via Monte-Carlo Marginalization

paper_url: http://arxiv.org/abs/2308.06352
repo_url: None
paper_authors: Chenqiu Zhao, Guanfang Dong, Anup Basu
For: 学习不可求解分布的方法* Methods: 使用参数化分布模型（如混合 Gaussian Mixture Model）来近似不可求解分布，并使用 Monte-Carlo Marginalization 和 Kernel Density Estimation 解决计算复杂性和优化问题* Results: 提出了一种可 differentiable 的分布学习方法，并在标准数据集和 sintetic data 上进行了实验，证明了该方法的效果。此外，该方法还可以在 VAE 中代替变量推理，并且可以生成更好的图像。

Abstract
We propose a novel method to learn intractable distributions from their samples. The main idea is to use a parametric distribution model, such as a Gaussian Mixture Model (GMM), to approximate intractable distributions by minimizing the KL-divergence. Based on this idea, there are two challenges that need to be addressed. First, the computational complexity of KL-divergence is unacceptable when the dimensions of distributions increases. The Monte-Carlo Marginalization (MCMarg) is proposed to address this issue. The second challenge is the differentiability of the optimization process, since the target distribution is intractable. We handle this problem by using Kernel Density Estimation (KDE). The proposed approach is a powerful tool to learn complex distributions and the entire process is differentiable. Thus, it can be a better substitute of the variational inference in variational auto-encoders (VAE). One strong evidence of the benefit of our method is that the distributions learned by the proposed approach can generate better images even based on a pre-trained VAE's decoder. Based on this point, we devise a distribution learning auto-encoder which is better than VAE under the same network architecture. Experiments on standard dataset and synthetic data demonstrate the efficiency of the proposed approach.

摘要
我们提出了一种新的方法来学习不可解 Distributions 的样本。主要思想是使用参数化分布模型，如 Gaussian Mixture Model (GMM)，来approximate不可解 Distributions 的 KL- divergence 的最小值。在这个想法的基础之上，需要解决两个挑战。首先，随着分布的维度增加，KL- divergence 的计算复杂性变得不可接受。我们提出了 Monte-Carlo Marginalization (MCMarg) 来解决这个问题。其次，目标分布是不可导的，因此需要使用 Kernel Density Estimation (KDE) 来处理这个问题。我们的方法可以学习复杂的分布，整个过程是导数的，因此可以作为 VAE 的更好的替代方案。我们的方法可以在标准数据集和 synthetic data 上进行实验，并且得到了良好的效果。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and widely used in other countries as well. The translation is based on the standard grammar and vocabulary of Simplified Chinese, and may differ slightly from the Traditional Chinese used in Taiwan and other countries.

Mirror Diffusion Models

paper_url: http://arxiv.org/abs/2308.06342
repo_url: https://github.com/cran/DIMORA
paper_authors: Jaesung Tae
for: 这份报告旨在提出一种对组 categorical 资料进行生成的方法，并且提供一个理论框架来适应受限的领域。
methods: 这种方法基于镜像朗凯文法，并且将其应用到简单的扩散过程中。这些方法可以自然地扩展到受欢迎的领域，如图像和文本生成。
results: 这种方法可以在简单的扩散过程中生成高质量的组 categorical 资料，并且可以实现受限的领域中的生成。

Abstract
Diffusion models have successfully been applied to generative tasks in various continuous domains. However, applying diffusion to discrete categorical data remains a non-trivial task. Moreover, generation in continuous domains often requires clipping in practice, which motivates the need for a theoretical framework for adapting diffusion to constrained domains. Inspired by the mirror Langevin algorithm for the constrained sampling problem, in this theoretical report we propose Mirror Diffusion Models (MDMs). We demonstrate MDMs in the context of simplex diffusion and propose natural extensions to popular domains such as image and text generation.

摘要
Diffusion models have successfully been applied to generative tasks in various continuous domains. However, applying diffusion to discrete categorical data remains a non-trivial task. Moreover, generation in continuous domains often requires clipping in practice, which motivates the need for a theoretical framework for adapting diffusion to constrained domains. Inspired by the mirror Langevin algorithm for the constrained sampling problem, in this theoretical report we propose Mirror Diffusion Models (MDMs). We demonstrate MDMs in the context of simplex diffusion and propose natural extensions to popular domains such as image and text generation.Here's the translation in Traditional Chinese:Diffusion models have successfully been applied to generative tasks in various continuous domains. However, applying diffusion to discrete categorical data remains a non-trivial task. Moreover, generation in continuous domains often requires clipping in practice, which motivates the need for a theoretical framework for adapting diffusion to constrained domains. Inspired by the mirror Langevin algorithm for the constrained sampling problem, in this theoretical report we propose Mirror Diffusion Models (MDMs). We demonstrate MDMs in the context of simplex diffusion and propose natural extensions to popular domains such as image and text generation.

Size Lowerbounds for Deep Operator Networks

paper_url: http://arxiv.org/abs/2308.06338
repo_url: None
paper_authors: Anirbit Mukherjee, Amartya Roy
for: 本研究目的是Establishing a first-of-its-kind data-dependent lower bound on the size of DeepONets required to reduce empirical error on noisy data.
methods: 本研究使用Deep Operator Networks (DeepONets) paradigm to solve regression in infinite dimensions and families of PDEs in one shot.
results: 研究发现，为了在$n$个数据点上获得低训练错误，Common output dimension of branch and trunk net必须在$\Omega \left ( \sqrt{n} \right )$scaling。这种情况在解决势动扩散反应PDE中进行实验，并证明在固定模型大小下，通过增加common output dimension来逐渐下降训练错误，数据训练集可能需要平方倍增。

Abstract
Deep Operator Networks are an increasingly popular paradigm for solving regression in infinite dimensions and hence solve families of PDEs in one shot. In this work, we aim to establish a first-of-its-kind data-dependent lowerbound on the size of DeepONets required for them to be able to reduce empirical error on noisy data. In particular, we show that for low training errors to be obtained on $n$ data points it is necessary that the common output dimension of the branch and the trunk net be scaling as $\Omega \left ( {\sqrt{n} \right )$. This inspires our experiments with DeepONets solving the advection-diffusion-reaction PDE, where we demonstrate the possibility that at a fixed model size, to leverage increase in this common output dimension and get monotonic lowering of training error, the size of the training data might necessarily need to scale quadratically with it.

摘要
深度网络（DeepONet）是一种在无穷维度中进行回归的增 Popular 模式，可以解决 families of PDEs 的问题一并。在这项工作中，我们想要建立一个数据висимы的下界，以确定 DeepONets 的大小需要在噪音数据上减少 empirical error。 Specifically，我们显示了在 $n$ 个数据点上获得低训练错误需要branch和trunk网的公共输出维度Scale as $\Omega \left ( \sqrt{n} \right )$.这种情况在我们解决了diffusion-advection-reaction PDE 的实验中得到了证明，我们发现在固定模型大小下，可以通过增加 common output dimension来降低训练错误，但是training data的大小可能需要 quadratic 增长。

Foundation Model is Efficient Multimodal Multitask Model Selector

paper_url: http://arxiv.org/abs/2308.06262
repo_url: https://github.com/opengvlab/multitask-model-selector
paper_authors: Fanqing Meng, Wenqi Shao, Zhanglin Peng, Chonghe Jiang, Kaipeng Zhang, Yu Qiao, Ping Luo
for: 这个论文研究了一个未经探索的重要问题：给一个集合的预训练神经网络，预测它们在每个多Modal任务上的性能，不需要细致调整。
methods: 这个论文提出了一种高效的多任务模型选择器（EMMS），使用大规模基础模型将多种下游任务的标签格式转换成一个统一的噪声标签嵌入。EMMS可以通过一种简单的权重线性回归来估算模型的传输性，该算法可以通过一种交互式最小化算法来高效解决。
results: 对5个下游任务和24个数据集进行了广泛的实验，显示了EMMS的高效性、有效性和通用性。比如，相比采用LogME进行加强的状态的方法，EMMS在图像识别、引用、描述、视觉问答和文本问答等5个任务上 achieve了9.0%、26.3%、20.1%、54.8%和12.2%的性能提升，同时带来5.13倍、6.29倍、3.59倍、6.19倍和5.66倍的计划时间提升。代码可以在https://github.com/OpenGVLab/Multitask-Model-Selector上获取。

Abstract
This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering. A brute-force approach is to finetune all models on all target datasets, bringing high computational costs. Although recent-advanced approaches employed lightweight metrics to measure models' transferability,they often depend heavily on the prior knowledge of a single task, making them inapplicable in a multi-modal multi-task scenario. To tackle this issue, we propose an efficient multi-task model selector (EMMS), which employs large-scale foundation models to transform diverse label formats such as categories, texts, and bounding boxes of different downstream tasks into a unified noisy label embedding. EMMS can estimate a model's transferability through a simple weighted linear regression, which can be efficiently solved by an alternating minimization algorithm with a convergence guarantee. Extensive experiments on 5 downstream tasks with 24 datasets show that EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models, making it the first model selection method in the multi-task scenario. For instance, compared with the state-of-the-art method LogME enhanced by our label embeddings, EMMS achieves 9.0\%, 26.3\%, 20.1\%, 54.8\%, 12.2\% performance gain on image recognition, referring, captioning, visual question answering, and text question answering, while bringing 5.13x, 6.29x, 3.59x, 6.19x, and 5.66x speedup in wall-clock time, respectively. The code is available at https://github.com/OpenGVLab/Multitask-Model-Selector.

摘要
Currently, a brute-force approach is to fine-tune all models on all target datasets, which is computationally expensive. Recent advanced approaches use lightweight metrics to measure models' transferability, but these metrics often rely on prior knowledge of a single task and are not applicable in a multi-modal multi-task scenario.To address this issue, we propose an Efficient Multi-task Model Selector (EMMS), which uses large-scale foundation models to transform diverse label formats into a unified noisy label embedding. EMMS estimates a model's transferability through a simple weighted linear regression, which can be efficiently solved by an alternating minimization algorithm with a convergence guarantee.Our extensive experiments on 5 downstream tasks with 24 datasets show that EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models. Compared with the state-of-the-art method LogME enhanced by our label embeddings, EMMS achieves a 9.0%, 26.3%, 20.1%, 54.8%, and 12.2% performance gain on image recognition, referring, captioning, visual question answering, and text question answering, respectively. Additionally, EMMS brings 5.13x, 6.29x, 3.59x, 6.19x, and 5.66x speedup in wall-clock time, respectively.The code for EMMS is available at https://github.com/OpenGVLab/Multitask-Model-Selector.

Predicting Resilience with Neural Networks

paper_url: http://arxiv.org/abs/2308.06309
repo_url: None
paper_authors: Karen da Mata, Priscila Silva, Lance Fiondella
for: This paper aims to propose and evaluate alternative neural network (NN) approaches to model and predict system performance, including negative and positive factors driving resilience, in order to quantify the impact of disruptive events and restorative activities.
methods: The paper proposes three alternative NN approaches, including Artificial Neural Networks, Recurrent Neural Networks, and Long-Short Term Memory (LSTM), to model and predict system performance.
results: The results show that NN models outperformed a classical statistical model on all goodness-of-fit measures, with LSTMs achieving an over 60% higher adjusted R squared and decreased predictive error by 34-fold compared to the traditional method. These results suggest that NN models are both feasible and accurate for predicting resilience and may find practical use in many important domains.Here is the same information in Simplified Chinese text:
for: 这篇论文目的是提出和评估基于神经网络（NN）的方法，以量化系统性能，包括负面和正面因素对系统抗异常性的影响。
methods: 论文提出了三种代表性的NN方法，包括人工神经网络、循环神经网络和长短期记忆（LSTM），以量化系统性能。
results: 结果表明，NN模型在所有准确度度量上都高于传统统计模型，具体来说，LSTM模型在所有准确度度量上高于60%，并将预测错误量减少34倍。这些结果表明，NN模型可以准确地预测系统抗异常性，并在许多重要领域发现实际应用。

Abstract
Resilience engineering studies the ability of a system to survive and recover from disruptive events, which finds applications in several domains. Most studies emphasize resilience metrics to quantify system performance, whereas recent studies propose statistical modeling approaches to project system recovery time after degradation. Moreover, past studies are either performed on data after recovering or limited to idealized trends. Therefore, this paper proposes three alternative neural network (NN) approaches including (i) Artificial Neural Networks, (ii) Recurrent Neural Networks, and (iii) Long-Short Term Memory (LSTM) to model and predict system performance, including negative and positive factors driving resilience to quantify the impact of disruptive events and restorative activities. Goodness-of-fit measures are computed to evaluate the models and compared with a classical statistical model, including mean squared error and adjusted R squared. Our results indicate that NN models outperformed the traditional model on all goodness-of-fit measures. More specifically, LSTMs achieved an over 60\% higher adjusted R squared, and decreased predictive error by 34-fold compared to the traditional method. These results suggest that NN models to predict resilience are both feasible and accurate and may find practical use in many important domains.

摘要
“恢复工程”（Resilience engineering）研究系统对瘫痪事件的抗衡能力和恢复时间，这些应用在多个领域。大多数研究强调系统表现的量化指标（resilience metrics），而现在的研究则提出了使用统计模型估算系统恢复时间。然而，过去的研究都是基于已经恢复的数据或仅对理想化趋势进行研究。因此，本文提出了三种人工神经网络（Artificial Neural Networks）方法，包括（i）人工神经网络（Artificial Neural Networks）、（ii）循环神经网络（Recurrent Neural Networks）和（iii）长期记忆运算（Long-Short Term Memory，LSTM），用于模拟和预测系统表现，包括负和正因素影响系统抗衡能力。我们 Compute 好igkeit-of-fit 度量来评估这些模型，并与传统的统计模型进行比较，包括平均方差和修正系数。我们的结果显示，NN 模型在所有好igkeit-of-fit 度量上表现更好，特别是 LSTM 模型的调整 R 平方error 高于 60%，并降低预测误差34倍。这些结果表示 NN 模型可以实现系统抗衡的预测，并且具有高准确性。这些模型可能在许多重要领域中找到实际应用。

FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods

paper_url: http://arxiv.org/abs/2308.06248
repo_url: https://github.com/visinf/funnybirds
paper_authors: Robin Hesse, Simone Schaub-Meyer, Stefan Roth
for: This paper aims to address the challenge of evaluating the quality of explainable artificial intelligence (XAI) methods, which is an important problem in safety-critical domains where XAI is used.
methods: The paper proposes a novel synthetic vision dataset called FunnyBirds, as well as accompanying automatic evaluation protocols. The dataset allows for semantically meaningful image interventions, such as removing individual object parts, which enables the analysis of explanations on a part level and the estimation of ground-truth part importances.
results: The paper reports results for 24 different combinations of neural models and XAI methods, demonstrating the strengths and weaknesses of the assessed methods in a fully automatic and systematic manner. The results show that the proposed evaluation protocols can provide valuable insights into the quality of XAI methods and can help to identify areas for improvement.

Abstract
The field of explainable artificial intelligence (XAI) aims to uncover the inner workings of complex deep neural models. While being crucial for safety-critical domains, XAI inherently lacks ground-truth explanations, making its automatic evaluation an unsolved problem. We address this challenge by proposing a novel synthetic vision dataset, named FunnyBirds, and accompanying automatic evaluation protocols. Our dataset allows performing semantically meaningful image interventions, e.g., removing individual object parts, which has three important implications. First, it enables analyzing explanations on a part level, which is closer to human comprehension than existing methods that evaluate on a pixel level. Second, by comparing the model output for inputs with removed parts, we can estimate ground-truth part importances that should be reflected in the explanations. Third, by mapping individual explanations into a common space of part importances, we can analyze a variety of different explanation types in a single common framework. Using our tools, we report results for 24 different combinations of neural models and XAI methods, demonstrating the strengths and weaknesses of the assessed methods in a fully automatic and systematic manner.

摘要
领域的可解释人工智能（XAI）目的是探索复杂的深度神经网络模型的内部工作机制。而这是安全关键领域的关键，但XAI自然lacks ground-truth explanations，这使得自动评估成为一个未解决的问题。我们解决这个挑战 by proposing a novel synthetic vision dataset，名为FunnyBirds，以及一系列自动评估协议。我们的dataset允许进行semantically meaningful的图像交互，例如去除个体物体部分，这有三个重要的含义。首先，它允许分析解释的部级别，这更加接近人类理解的水平 than existing methods that evaluate on a pixel level。第二，通过比较模型输出对各个部分去除后的输出，我们可以估算ground-truth part importances，这些importances应该反映在解释中。第三，将各种解释映射到一个共同的部分重要性空间中，我们可以分析多种不同的解释类型在一个共同框架中。使用我们的工具，我们对24种不同的神经网络模型和XAI方法进行了报告，并demonstrated它们在自动和系统atic的方式下的优劣点。

Private Distribution Learning with Public Data: The View from Sample Compression

paper_url: http://arxiv.org/abs/2308.06239
repo_url: None
paper_authors: Shai Ben-David, Alex Bie, Clément L. Canonne, Gautam Kamath, Vikrant Singhal
for: 本研究考虑了一种名为公共-私有学习的问题，在这种设置下，学习者只能访问公共和私有样本，并且需要根据这些样本来估算一个未知分布$p$的值，同时保证隐私性。
methods: 本研究使用了纯量化隐私学习来保证隐私性，并使用了一种叫做列学习的方法来连接公共-私有学习的问题。
results: 本研究得出了一些新的结论，包括对于任意$k$-mixture的加aussian over $\mathbb R^d$的样本复杂度上界，以及对于agnostic和分布shift抗性的学习者的结论。此外，研究还发现了一个关于公共-私有学习的closure性性质，即对于Gaussian over $\mathbb R^d$，至少需要$d$个公共样本来保证私有学习的隐私性。

Abstract
We study the problem of private distribution learning with access to public data. In this setup, which we refer to as public-private learning, the learner is given public and private samples drawn from an unknown distribution $p$ belonging to a class $\mathcal Q$, with the goal of outputting an estimate of $p$ while adhering to privacy constraints (here, pure differential privacy) only with respect to the private samples. We show that the public-private learnability of a class $\mathcal Q$ is connected to the existence of a sample compression scheme for $\mathcal Q$, as well as to an intermediate notion we refer to as list learning. Leveraging this connection: (1) approximately recovers previous results on Gaussians over $\mathbb R^d$; and (2) leads to new ones, including sample complexity upper bounds for arbitrary $k$-mixtures of Gaussians over $\mathbb R^d$, results for agnostic and distribution-shift resistant learners, as well as closure properties for public-private learnability under taking mixtures and products of distributions. Finally, via the connection to list learning, we show that for Gaussians in $\mathbb R^d$, at least $d$ public samples are necessary for private learnability, which is close to the known upper bound of $d+1$ public samples.

摘要
我们研究公共分布学习问题，即在公共数据和私人数据之间的学习问题。在这种设置下，我们称之为公共私人学习。学习者被公共和私人样本所提供，这些样本来自未知分布$p$，属于一个类$\mathcal Q$。学习者的目标是输出一个估计$p$，同时遵守隐私限制（在这种情况下是纯度ifferential privacy）只针对私人样本。我们证明了公共私人学习的可行性与存在一个样本压缩算法，以及一个中间概念——列表学习有关系。通过这种关系，我们可以：1. 约束previous结果中的高斯分布在$\mathbb R^d$上的恢复;2. 导出新的结果，包括$k$-mixture高斯分布在$\mathbb R^d$上的样本复杂性上的Upper bound，以及agnostic和分布shift抗性的学习者的结果。此外，通过与列表学习的连接，我们还证明了对于高斯分布在$\mathbb R^d$上，至少需要$d$个公共样本以便私人学习可行，这与已知的最高bound($d+1$个公共样本)很接近。

MaxFloodCast: Ensemble Machine Learning Model for Predicting Peak Inundation Depth And Decoding Influencing Features

paper_url: http://arxiv.org/abs/2308.06228
repo_url: None
paper_authors: Cheng-Chun Lee, Lipai Huang, Federico Antolini, Matthew Garcia, Andrew Juanb, Samuel D. Brody, Ali Mostafavi
for: This study aims to provide efficient and interpretable flood inundation depth predictions using a machine learning model, MaxFloodCast, which can support near-time floodplain management and emergency operations.
methods: The study uses physics-based hydrodynamic simulations to train the MaxFloodCast model, which achieves reliable flood inundation depth predictions with an average R-squared of 0.949 and a Root Mean Square Error of 0.61 ft on unseen data.
results: The study validates the MaxFloodCast model against Hurricane Harvey and Storm Imelda, demonstrating its potential in supporting flood risk management and emergency operations. The model provides critical information for decision-makers to prioritize areas with critical facilities and to examine how rainfall in other watersheds influences flood exposure in one area.

Abstract
Timely, accurate, and reliable information is essential for decision-makers, emergency managers, and infrastructure operators during flood events. This study demonstrates a proposed machine learning model, MaxFloodCast, trained on physics-based hydrodynamic simulations in Harris County, offers efficient and interpretable flood inundation depth predictions. Achieving an average R-squared of 0.949 and a Root Mean Square Error of 0.61 ft on unseen data, it proves reliable in forecasting peak flood inundation depths. Validated against Hurricane Harvey and Storm Imelda, MaxFloodCast shows the potential in supporting near-time floodplain management and emergency operations. The model's interpretability aids decision-makers in offering critical information to inform flood mitigation strategies, to prioritize areas with critical facilities and to examine how rainfall in other watersheds influences flood exposure in one area. The MaxFloodCast model enables accurate and interpretable inundation depth predictions while significantly reducing computational time, thereby supporting emergency response efforts and flood risk management more effectively.

摘要
时间、准确、可靠的信息是决策者、紧急管理者和基础设施运营员 durante 洪水事件的重要资讯。本研究展示了一个提议的机器学习模型MaxFloodCast，基于物理基础的水动力 simulations在哈里斯县训练，可提供优化和可解释的洪涛深度预测。在未见数据上，它实现了0.949的平均R-squared和0.61 ft的根幂平均误差。这证明了MaxFloodCast在预测洪涛峰值深度方面的可靠性。验证了飓风哈维和飓风Imelda，MaxFloodCast表明它具有支持即时洪平原管理和紧急作业的潜力。模型的解释性帮助决策者提供重要信息，以帮助实施洪水缓解策略，优先级有 kritical 设施区域，并考虑在其他水系中降雨如何影响洪涛暴露在一个区域。MaxFloodCast 模型可提供高精度和解释性的洪涛深度预测，同时大幅降低计算时间，因此更有效地支持紧急回应努力和洪水风险管理。

Automated Sizing and Training of Efficient Deep Autoencoders using Second Order Algorithms

paper_url: http://arxiv.org/abs/2308.06221
repo_url: None
paper_authors: Kanishka Tyagi, Chinmay Rane, Michael Manry
for: 这个论文是为了设计一种通用的线性分类器而写的。
methods: 论文使用了一种多步训练方法，包括初始化多类线性分类器、验证错误最小化、提高输出和批量训练算法。
results: 论文通过多步训练和杜邦法听到一个高效的深度学习模型，并且在多个公共数据集上实现了性能提升。

Abstract
We propose a multi-step training method for designing generalized linear classifiers. First, an initial multi-class linear classifier is found through regression. Then validation error is minimized by pruning of unnecessary inputs. Simultaneously, desired outputs are improved via a method similar to the Ho-Kashyap rule. Next, the output discriminants are scaled to be net functions of sigmoidal output units in a generalized linear classifier. We then develop a family of batch training algorithm for the multi layer perceptron that optimizes its hidden layer size and number of training epochs. Next, we combine pruning with a growing approach. Later, the input units are scaled to be the net function of the sigmoidal output units that are then feed into as input to the MLP. We then propose resulting improvements in each of the deep learning blocks thereby improving the overall performance of the deep architecture. We discuss the principles and formulation regarding learning algorithms for deep autoencoders. We investigate several problems in deep autoencoders networks including training issues, the theoretical, mathematical and experimental justification that the networks are linear, optimizing the number of hidden units in each layer and determining the depth of the deep learning model. A direct implication of the current work is the ability to construct fast deep learning models using desktop level computational resources. This, in our opinion, promotes our design philosophy of building small but powerful algorithms. Performance gains are demonstrated at each step. Using widely available datasets, the final network's ten fold testing error is shown to be less than that of several other linear, generalized linear classifiers, multi layer perceptron and deep learners reported in the literature.

摘要
我们提出一种多步训练方法用于设计通用线性分类器。首先，通过回归获得初始多类线性分类器。然后，通过剪枝消除不必要的输入，提高验证错误。同时，通过类似于何-卡希普规则提高 желаем的输出。接着，输出推定器被映射到通用线性分类器中的sigmoid输出单元。我们然后开发了一家批处理训练算法，用于最优化多层感知器的隐藏层大小和训练轮数。接着，我们结合剪枝与增长方法。然后，输入单元被映射到sigmoid输出单元中的net函数。我们最后提出了改进每个深度学习块的结果，从而提高整体深度学习模型的性能。我们讨论了深度学习算法的学习原理和形式，并对深度学习网络中的许多问题进行研究，包括训练问题、理论、数学和实验的正确性。我们的研究表明，通过使用桌面级计算资源，可以快速构建深度学习模型，这与我们的设计哲学相符。我们的实验表明，使用常用的数据集，最终网络的十倍测试错误小于其他线性、通用线性分类器、多层感知器和深度学习者在文献中报道的错误。

Change Point Detection With Conceptors

paper_url: http://arxiv.org/abs/2308.06213
repo_url: https://github.com/noahgade/changepointdetectionwithconceptors
paper_authors: Noah D. Gade, Jordan Rodu
for: 这篇论文主要是为了解决时间序列中数据生成过程发生变化的问题。
methods: 该方法使用一个概念矩阵来学习时间序列的特征动力学，然后通过一个随机回归神经网络来抽象数据，最后通过一个多variate的距离量来确定变化点。
results: 该方法可以提供可靠的变化点估计，并且可以通过Bootstrap方法来生成资料的类型1错误控制。在一些模拟数据和实际数据上测试了该方法，并评估了其性能。

Abstract
Offline change point detection seeks to identify points in a time series where the data generating process changes. This problem is well studied for univariate i.i.d. data, but becomes challenging with increasing dimension and temporal dependence. For the at most one change point problem, we propose the use of a conceptor matrix to learn the characteristic dynamics of a specified training window in a time series. The associated random recurrent neural network acts as a featurizer of the data, and change points are identified from a univariate quantification of the distance between the featurization and the space spanned by a representative conceptor matrix. This model agnostic method can suggest potential locations of interest that warrant further study. We prove that, under mild assumptions, the method provides a consistent estimate of the true change point, and quantile estimates for statistics are produced via a moving block bootstrap of the original data. The method is tested on simulations from several classes of processes, and we evaluate performance with clustering metrics, graphical methods, and observed Type 1 error control. We apply our method to publicly available neural data from rats experiencing bouts of non-REM sleep prior to exploration of a radial maze.

摘要
停机变点检测目标是找到时间序列中数据生成过程中的变化点。这个问题在独立同分布数据上得到了广泛的研究，但是随着维度和时间相关性的增加，这个问题就变得更加挑战。为了解决这个问题，我们提出了一种基于特征动态矩阵的方法。这个矩阵可以学习指定的训练窗口中的特征动态，然后通过一个随机循环神经网络来抽象数据。通过评估这个抽象和特征矩阵所生成的空间之间的距离，我们可以确定变点。这种模型无关的方法可以提供有利于进一步研究的可能性。我们证明，在某些假设下，这种方法可以提供一个一致的变点估计，并且可以通过移动块bootstrap来生成量统计。我们在一些类型的过程的 simulations 上测试了这种方法，并评估了它们的性能使用 clustering 度量、图形方法和观察到的类型一错控制。最后，我们应用了这种方法在公共可用的 neural 数据上，该数据来自于在非 REM 睡眠前的猫鼠在 радиаль 迷宫中的探索。

Safety in Traffic Management Systems: A Comprehensive Survey

paper_url: http://arxiv.org/abs/2308.06204
repo_url: None
paper_authors: Wenlu Du, Ankan Dash, Jing Li, Hua Wei, Guiling Wang
for: 这项研究的目的是为了对交通管理系统中的安全问题进行全面的文献综述，以便更好地了解这些系统的安全问题，并提出解决方案。
methods: 本文使用了文献综述的方法来检查交通管理系统中的安全问题，并分析了现有的研究成果。
results: 本文发现了交通管理系统中的安全问题，包括系统设计缺陷、车辆通信问题、人工意识问题等，并提出了一些解决方案，如使用隐藏马尔文网络、增加人工意识等。同时，本文也指出了现有研究的限制，如缺乏实验数据和难以模拟实际情况等。

Abstract
Traffic management systems play a vital role in ensuring safe and efficient transportation on roads. However, the use of advanced technologies in traffic management systems has introduced new safety challenges. Therefore, it is important to ensure the safety of these systems to prevent accidents and minimize their impact on road users. In this survey, we provide a comprehensive review of the literature on safety in traffic management systems. Specifically, we discuss the different safety issues that arise in traffic management systems, the current state of research on safety in these systems, and the techniques and methods proposed to ensure the safety of these systems. We also identify the limitations of the existing research and suggest future research directions.

摘要
交通管理系统在路面交通安全和效率的问题上扮演着重要的角色。然而，进步的科技应用在交通管理系统中带来了新的安全挑战。因此，确保交通管理系统的安全性是不可或缺的。在这份调查中，我们提供了交通管理系统安全的全面评论。具体来说，我们讨论了交通管理系统中不同的安全问题，现有的研究状况，以及确保这些系统安全的技术和方法。我们还识别出现有的研究限制，并建议未来研究方向。

2023-08-12

eess.IV

eess.IV - 2023-08-12

Semantic Communications with Explicit Semantic Base for Image Transmission

paper_url: http://arxiv.org/abs/2308.06599
repo_url: None
paper_authors: Yuan Zheng, Fengyu Wang, Wenjun Xu, Miao Pan, Ping Zhang
for: 提高下一代通信系统的可靠性和效率，通过增强信息的意义表示和同步。
methods: 提出一个基于显式 semantics 基础 (Seb) 的Semantic 图像传输框架，通过自适应的 Seb 生成和应用来实现图像的表示和传输。
results: 对比州的实验结果显示，提出的框架可以与现有的方法相比，在不同的信号噪含率 (SNR) 下提高 peak signal-to-noise ratio (PSNR) 的表现，提高图像的重建精度。

Abstract
Semantic communications, aiming at ensuring the successful delivery of the meaning of information, are expected to be one of the potential techniques for the next generation communications. However, the knowledge forming and synchronizing mechanism that enables semantic communication systems to extract and interpret the semantics of information according to the communication intents is still immature. In this paper, we propose a semantic image transmission framework with explicit semantic base (Seb), where Sebs are generated and employed as the knowledge shared between the transmitter and the receiver with flexible granularity. To represent images with Sebs, a novel Seb-based reference image generator is proposed to generate Sebs and then decompose the transmitted images. To further encode/decode the residual information for precise image reconstruction, a Seb-based image encoder/decoder is proposed. The key components of the proposed framework are optimized jointly by end-to-end (E2E) training, where the loss function is dedicated designed to tackle the problem of nondifferentiable operation in Seb-based reference image generator by introducing a gradient approximation mechanism. Extensive experiments show that the proposed framework outperforms state-of-art works by 0.5 - 1.5 dB in peak signal-to-noise ratio (PSNR) w.r.t. different signal-to-noise ratio (SNR).

摘要
semantic 通信技术，预计将成为下一代通信技术的一个 potential 方向，但现有的知识形成和同步机制仍然是幼稚的。本文提出了一种具有显式semantic base（Seb）的semantic 图像传输框架，其中Sebs是在传输和接收方之间共享的知识，并可以在不同的粒度水平进行flexible 分配。为了将图像表示为Sebs，本文提出了一种新的Seb-based reference image generator，该generator可以生成Sebs并将传输的图像进行解码。另外，为了进一步编码/解码剩余信息以实现精确的图像重建，本文提出了一种基于Seb的图像编码器/解码器。关键组件的优化由综合（E2E）训练进行，loss函数通过引入梯度近似机制来解决Seb-based reference image generator中的非导数操作问题。实验表明，提议的框架可以与当前最佳性能相比提高0.5-1.5 dB的峰值信号噪声比（PSNR）。

On Versatile Video Coding at UHD with Machine-Learning-Based Super-Resolution

paper_url: http://arxiv.org/abs/2308.06570
repo_url: None
paper_authors: Kristian Fischer, Christian Herglotz, André Kaup
for: 提高4K数据编码质量
methods: 使用Machine Learning基于单图超解算法和下一代VVC编码器
results: 可以获得12%~18%的Bjontegaard delta rate增加，并减少 compression artifacts 和loss of detailsHere’s a breakdown of each point:
for: 这篇论文是为了提高4K数据的编码质量而写的。
methods: 这篇论文使用了Machine Learning基于单图超解算法和下一代VVC编码器。
results: 根据论文的测试结果，可以获得12%~18%的Bjontegaard delta rate增加，同时减少 compression artifacts 和loss of details。

Abstract
Coding 4K data has become of vital interest in recent years, since the amount of 4K data is significantly increasing. We propose a coding chain with spatial down- and upscaling that combines the next-generation VVC codec with machine learning based single image super-resolution algorithms for 4K. The investigated coding chain, which spatially downscales the 4K data before coding, shows superior quality than the conventional VVC reference software for low bitrate scenarios. Throughout several tests, we find that up to 12 % and 18 % Bjontegaard delta rate gains can be achieved on average when coding 4K sequences with VVC and QP values above 34 and 42, respectively. Additionally, the investigated scenario with up- and downscaling helps to reduce the loss of details and compression artifacts, as it is shown in a visual example.

摘要
coding 4K 数据已成为近年来的焦点，因为4K 数据量在不断增加。我们提议一种具有空间下采样和上采样的编码链，该链 combining 下一代 VVC 编码器和基于机器学习的单张图像超解算法，用于4K。我们的调查显示，将4K 数据先下采样后编码，可以在低比特率场景下实现较高的质量。在多个测试中，我们发现，使用 VVC 和 QP 值高于 34 和 42 时，可以实现平均 12% 到 18% Bjontegaard Delta 率增加。此外，我们发现，将数据上下采样可以降低数据的丢失细节和压缩artefacts，如图像示例所示。Note: "Bjontegaard delta rate" refers to the difference between the peak signal-to-noise ratio (PSNR) of the original and reconstructed signals, which is a common measure of the quality of video compression. A lower Bjontegaard delta rate indicates better compression quality.

Three-dimensional echo-shifted EPI with simultaneous blip-up and blip-down acquisitions for correcting geometric distortion

paper_url: http://arxiv.org/abs/2308.06557
repo_url: None
paper_authors: Kaibao Sun, Zhifeng Chen, Guangyu Dan, Qingfei Luo, Lirong Yan, Feng Liu, Xiaohong Joe Zhou
For:* 这个研究旨在超越BUDA doublescan时间和降低功能MRI应用中的挑战，通过开发一种三维echo-shifted EPI BUDA（esEPI-BUDA）技术，以获得单shot中获得两个blip-up和blip-down数据集。Methods:* 这种三维esEPI-BUDA序列使用了一种echo-shifting策略，生成了两个EPI读取车的读取信号。这两个读取信号的k-空间轨迹相互交叠，并且使用了opposite phase-encoding gradient方向。这两个k-空间数据集分别使用3D SENSE算法重建，并生成了时间解决B0-场图像。Results:* 在一个phantom和一个人类大脑图像研究中，这种3D esEPI-BUDA技术可以有效地纠正几何扭曲。在人类大脑图像中，可以看到视觉激活区和其BOLD响应，与普通3D echo-planar图像相似。

Abstract
Purpose: Echo-planar imaging (EPI) with blip-up/down acquisition (BUDA) can provide high-quality images with minimal distortions by using two readout trains with opposing phase-encoding gradients. Because of the need for two separate acquisitions, BUDA doubles the scan time and degrades the temporal resolution when compared to single-shot EPI, presenting a major challenge for many applications, particularly functional MRI (fMRI). This study aims at overcoming this challenge by developing an echo-shifted EPI BUDA (esEPI-BUDA) technique to acquire both blip-up and blip-down datasets in a single shot. Methods: A three-dimensional (3D) esEPI-BUDA pulse sequence was designed by using an echo-shifting strategy to produce two EPI readout trains. These readout trains produced a pair of k-space datasets whose k-space trajectories were interleaved with opposite phase-encoding gradient directions. The two k-space datasets were separately reconstructed using a 3D SENSE algorithm, from which time-resolved B0-field maps were derived using TOPUP in FSL and then input into a forward model of joint parallel imaging reconstruction to correct for geometric distortion. In addition, Hankel structured low-rank constraint was incorporated into the reconstruction framework to improve image quality by mitigating the phase errors between the two interleaved k-space datasets. Results: The 3D esEPI-BUDA technique was demonstrated in a phantom and an fMRI study on healthy human subjects. Geometric distortions were effectively corrected in both phantom and human brain images. In the fMRI study, the visual activation volumes and their BOLD responses were comparable to those from conventional 3D echo-planar images. Conclusion: The improved imaging efficiency and dynamic distortion correction capability afforded by 3D esEPI-BUDA are expected to benefit many EPI applications.

摘要
目的：使用电平扫描（EPI）和折叠/下降获取（BUDA）技术可以提供高质量图像，并且减少了图像扭曲。然而，由于需要两个分离的获取，BUDA将扫描时间 Doubles和功能磁共振成像（fMRI）等应用中的时间分辨率下降为主要挑战。本研究旨在解决这个挑战，通过开发一种三维电平扫描BUDA（esEPI-BUDA）技术，以获取一个单击数据集。方法：使用电平扫描扩展策略，生成两个EPI读取列。这两个读取列在干扰方向上具有相反的频率编码梯度。这两个k空间数据集分别使用3D SENSE算法重建，并使用FSL中的TOPUP算法生成时间相关的B0场图。然后，将这些图像输入到一种前向模型，以正确地修正几何错误。此外，在重建框架中添加了具有Hankel结构的低级别约束，以提高图像质量，减少相位错误。结果：在荚体和人类大脑图像中，使用3D esEPI-BUDA技术可以有效地纠正几何错误。在fMRI研究中，观察到的视觉激活体和其BOLD响应与普通3D电平扫描图像相同。结论：3D esEPI-BUDA技术的改进的扫描效率和动态几何纠正能力，预期会对许多EPI应用产生积极的影响。

The Color Clifford Hardy Signal: Application to Color Edge Detection and Optical Flow

paper_url: http://arxiv.org/abs/2308.06485
repo_url: None
paper_authors: Xiaoxiao Hu, Kit Ian Kou, Cuiming Zou, Dong Cheng
for: This paper introduces a new approach to processing color images using the color Clifford Hardy signal, which is a high-dimensional analytic function.
methods: The paper proposes five methods for edge detection in color images based on the local feature representation of the color Clifford Hardy signal. These methods utilize the multi-scale structure of the signal to resist noise and improve edge detection accuracy.
results: The proposed methods are evaluated using image quality assessment criteria and are shown to be superior to traditional edge detection methods in terms of robustness to noise and accuracy. Additionally, an example application of color optical flow detection using the proposed approach is provided.

Abstract
This paper introduces the idea of the color Clifford Hardy signal, which can be used to process color images. As a complex analytic function's high-dimensional analogue, the color Clifford Hardy signal inherits many desirable qualities of analyticity. A crucial tool for getting the color and structural data is the local feature representation of a color image in the color Clifford Hardy signal. By looking at the extended Cauchy-Riemann equations in the high-dimensional space, it is possible to see the connection between the different parts of the color Clifford Hardy signal. Based on the distinctive and important local amplitude and local phase generated by the color Clifford Hardy signal, we propose five methods to identify the edges of color images with relation to a certain color. To prove the superiority of the offered methodologies, numerous comparative studies employing image quality assessment criteria are used. Specifically by using the multi-scale structure of the color Clifford Hardy signal, the proposed approaches are resistant to a variety of noises. In addition, a color optical flow detection method with anti-noise ability is provided as an example of application.

摘要

Out-of-distribution multi-view auto-encoders for prostate cancer lesion detection

paper_url: http://arxiv.org/abs/2308.06481
repo_url: None
paper_authors: Alvaro Fernandez-Quilez, Linas Vidziunas, Ørjan Kløvfjell Thoresen, Ketil Oppedal, Svein Reidar Kjosavik, Trygve Eftestøl
for: 这个研究旨在提出一种基于不监督学习的医学领域深度学习方法，以减少罕见的标注数据。
methods: 我们提出了一种多流水平方法，可以处理不同的T2w方向，以提高肝癌液体检测的性能。
results: 我们在公开可用的数据集上评估了我们的方法，与单向方法相比，我们的方法实现了更好的检测结果（AUC=82.3%）。

Abstract
Traditional deep learning (DL) approaches based on supervised learning paradigms require large amounts of annotated data that are rarely available in the medical domain. Unsupervised Out-of-distribution (OOD) detection is an alternative that requires less annotated data. Further, OOD applications exploit the class skewness commonly present in medical data. Magnetic resonance imaging (MRI) has proven to be useful for prostate cancer (PCa) diagnosis and management, but current DL approaches rely on T2w axial MRI, which suffers from low out-of-plane resolution. We propose a multi-stream approach to accommodate different T2w directions to improve the performance of PCa lesion detection in an OOD approach. We evaluate our approach on a publicly available data-set, obtaining better detection results in terms of AUC when compared to a single direction approach (73.1 vs 82.3). Our results show the potential of OOD approaches for PCa lesion detection based on MRI.

摘要
传统的深度学习（DL）方法基于指导学习思想需要庞大量的标注数据，而医疗领域中这些数据很少。无监管 OUT-OF-DISTRIBUTION（OOD）检测是一种alternative，它需要 fewer annotated data。另外，OOD应用可以利用医学数据中的类倾斜。核磁共振成像（MRI）已经被证明是肠癌（PCa）诊断和管理的有用工具，但现有的DL方法仅仅利用T2w极向MRI，这种MRIuffer from low out-of-plane resolution。我们提议一种多流处理方法，以便同时处理不同的T2w方向，以提高PCa涂抹检测的性能。我们在公共可用数据集上评估了我们的方法，并与单向方法（73.1）进行比较，得到了更好的检测结果（AUC），即82.3。我们的结果表明，基于MRI的PCa涂抹检测可以通过OOD方法实现更高的性能。

Leveraging multi-view data without annotations for prostate MRI segmentation: A contrastive approach

paper_url: http://arxiv.org/abs/2308.06477
repo_url: None
paper_authors: Tim Nikolass Lindeijer, Tord Martin Ytredal, Trygve Eftestøl, Tobias Nordström, Fredrik Jäderling, Martin Eklund, Alvaro Fernandez-Quilez
for: 这个研究的目的是提高自动脑下对肾脏的分类和量化描述，以支持肾脏癌病的诊断。
methods: 这个研究使用了一种对照方法，即在训练时使用多个检测方向的数据，并且不需要手动标注。
results: 研究结果显示，使用对照方法可以提高肾脏分类的精确度，并且在不同的检测方向下实现了较好的一致性。

Abstract
An accurate prostate delineation and volume characterization can support the clinical assessment of prostate cancer. A large amount of automatic prostate segmentation tools consider exclusively the axial MRI direction in spite of the availability as per acquisition protocols of multi-view data. Further, when multi-view data is exploited, manual annotations and availability at test time for all the views is commonly assumed. In this work, we explore a contrastive approach at training time to leverage multi-view data without annotations and provide flexibility at deployment time in the event of missing views. We propose a triplet encoder and single decoder network based on U-Net, tU-Net (triplet U-Net). Our proposed architecture is able to exploit non-annotated sagittal and coronal views via contrastive learning to improve the segmentation from a volumetric perspective. For that purpose, we introduce the concept of inter-view similarity in the latent space. To guide the training, we combine a dice score loss calculated with respect to the axial view and its manual annotations together with a multi-view contrastive loss. tU-Net shows statistical improvement in dice score coefficient (DSC) with respect to only axial view (91.25+-0.52% compared to 86.40+-1.50%,P<.001). Sensitivity analysis reveals the volumetric positive impact of the contrastive loss when paired with tU-Net (2.85+-1.34% compared to 3.81+-1.88%,P<.001). Further, our approach shows good external volumetric generalization in an in-house dataset when tested with multi-view data (2.76+-1.89% compared to 3.92+-3.31%,P=.002), showing the feasibility of exploiting non-annotated multi-view data through contrastive learning whilst providing flexibility at deployment in the event of missing views.

摘要
<>精准的肾脏定位和体积特征化可以支持肾脏癌诊断。大量自动肾脏分割工具忽略了多视图数据的可用性，即使据获取协议可以获得多视图数据。此外，当使用多视图数据时，手动标注和测试时 disponibility 通常被假设。在这种情况下，我们提出了一种对比方法，使得在训练时可以利用多视图数据无需标注，并在部署时提供灵活性。我们提出了一种基于 U-Net 的 triplet 编码器和单个解码器网络，我们称之为 tU-Net（ triplet U-Net）。我们的提议的架构可以通过对不同视图的对比学习利用非标注的 sagittal 和横截视图来提高分割。为此，我们引入了视图间相似性的概念在幂空间。为了导航训练，我们将 dice 分数损失与AXIAL 视图和其手动标注相加，并与多视图对比损失相结合。 tU-Net 显示在 DSC 系数上有统计学上的提升（91.25+-0.52% 相比 86.40+-1.50%,P<.001）。敏感分析表明，对于 tU-Net 来说，对比损失的负面影响是可观的（2.85+-1.34% 相比 3.81+-1.88%,P<.001）。此外，我们的方法在我们的内部数据集中表现了良好的外部Volumetric 普适性（2.76+-1.89% 相比 3.92+-3.31%,P=.002），这表明了可以通过对比学习利用非标注多视图数据，并在部署时提供灵活性。

CATS v2: Hybrid encoders for robust medical segmentation

paper_url: http://arxiv.org/abs/2308.06377
repo_url: https://github.com/haoli12345/cats
paper_authors: Hao Li, Han Liu, Dewei Hu, Xing Yao, Jiacheng Wang, Ipek Oguz
for:This paper proposes a new method for 3D medical image segmentation, specifically for vestibular schwannoma (VS) and prostate segmentation.methods:The proposed method uses a hybrid encoder consisting of a CNN-based encoder path and a transformer path with a shifted window to leverage both local and global information.results:The proposed method demonstrates superior performance in terms of higher Dice scores compared to state-of-the-art methods on two public challenge datasets (CrossMoDA and MSD-5) for VS and prostate segmentation.

Abstract
Convolutional Neural Networks (CNNs) have exhibited strong performance in medical image segmentation tasks by capturing high-level (local) information, such as edges and textures. However, due to the limited field of view of convolution kernel, it is hard for CNNs to fully represent global information. Recently, transformers have shown good performance for medical image segmentation due to their ability to better model long-range dependencies. Nevertheless, transformers struggle to capture high-level spatial features as effectively as CNNs. A good segmentation model should learn a better representation from local and global features to be both precise and semantically accurate. In our previous work, we proposed CATS, which is a U-shaped segmentation network augmented with transformer encoder. In this work, we further extend this model and propose CATS v2 with hybrid encoders. Specifically, hybrid encoders consist of a CNN-based encoder path paralleled to a transformer path with a shifted window, which better leverage both local and global information to produce robust 3D medical image segmentation. We fuse the information from the convolutional encoder and the transformer at the skip connections of different resolutions to form the final segmentation. The proposed method is evaluated on two public challenge datasets: Cross-Modality Domain Adaptation (CrossMoDA) and task 5 of Medical Segmentation Decathlon (MSD-5), to segment vestibular schwannoma (VS) and prostate, respectively. Compared with the state-of-the-art methods, our approach demonstrates superior performance in terms of higher Dice scores.

摘要
卷积神经网络（CNN）在医学图像分割任务中表现出色，通过捕捉高级（本地）信息，如边缘和文本ure。然而，由于卷积核心的视野有限，使得CNN难以完全表征全局信息。而在最近的几年，转移器在医学图像分割中表现良好，这是因为它们可以更好地模型长距离依赖关系。然而，转移器在捕捉高级空间特征方面表现不如CNN一样好。一个好的分割模型应该学习更好的表示本地和全局特征，以便具有高精度和Semantic Accuracy。在我们之前的工作中，我们提出了CATS，它是一个U型分割网络，其中包括转移器编码器。在这个工作中，我们进一步扩展了这个模型，并提出了CATS v2，其中包括混合编码器。特别是，混合编码器包括一个基于CNN的编码器路径和一个偏移窗口的转移器路径，这些路径都可以更好地利用本地和全局信息，以生成Robust 3D医学图像分割。我们在不同分辨率的 skip 连接中 fusion 了 convolutional 编码器和转移器的信息，以生成最终的分割。我们的方法在 Cross-Modality Domain Adaptation（CrossMoDA）和 Medical Segmentation Decathlon（MSD-5）两个公共挑战数据集上进行评估，用于分割 vestibular schwannoma（VS）和肾脏，分别。与当前状态艺术方法相比，我们的方法在 dice 分数上表现出优于其他方法。

Deep Learning-Based Open Source Toolkit for Eosinophil Detection in Pediatric Eosinophilic Esophagitis

paper_url: http://arxiv.org/abs/2308.06333
repo_url: https://github.com/hrlblab/open-eoe
paper_authors: Juming Xiong, Yilin Liu, Ruining Deng, Regina N Tyree, Hernan Correa, Girish Hiremath, Yaohong Wang, Yuankai Huo
for: 这个研究是为了开发一个开源的工具集（Open-EoE），用于检测食道检查图像（Whole Slide Image，WSI）中的嗜酸细胞（Eos）。methods: 该工具集使用三种state-of-the-art深度学习基于对象检测模型，并实现了一种 ensemble learning 策略以提高性能。results: 实验结果表明，Open-EoE 工具集可以有效地检测食道检查图像中的嗜酸细胞，并达到了91%的准确率，与专业病理学家的评估相一致。

Abstract
Eosinophilic Esophagitis (EoE) is a chronic, immune/antigen-mediated esophageal disease, characterized by symptoms related to esophageal dysfunction and histological evidence of eosinophil-dominant inflammation. Owing to the intricate microscopic representation of EoE in imaging, current methodologies which depend on manual identification are not only labor-intensive but also prone to inaccuracies. In this study, we develop an open-source toolkit, named Open-EoE, to perform end-to-end whole slide image (WSI) level eosinophil (Eos) detection using one line of command via Docker. Specifically, the toolkit supports three state-of-the-art deep learning-based object detection models. Furthermore, Open-EoE further optimizes the performance by implementing an ensemble learning strategy, and enhancing the precision and reliability of our results. The experimental results demonstrated that the Open-EoE toolkit can efficiently detect Eos on a testing set with 289 WSIs. At the widely accepted threshold of >= 15 Eos per high power field (HPF) for diagnosing EoE, the Open-EoE achieved an accuracy of 91%, showing decent consistency with pathologist evaluations. This suggests a promising avenue for integrating machine learning methodologies into the diagnostic process for EoE. The docker and source code has been made publicly available at https://github.com/hrlblab/Open-EoE.

摘要
《营养细胞性食管炎（EoE）》是一种慢性、免疫/抗原识别的食管疾病，表现为食管功能障碍和 histological 证明中充满嗜铁细胞的inflammation。由于诊断EoE的微scopic representation在成像中复杂，现有的方法ologies都是人工识别，不仅劳动密集，还容易出错。在这项研究中，我们开发了一个开源工具包，名为Open-EoE，通过一行命令via Docker进行整个报告图像（WSI）层级的嗜铁细胞（Eos）检测。具体来说，工具支持三种当前顶尖的深度学习基于对象检测模型。此外，Open-EoE还进一步优化了性能，通过实现 ensemble learning 策略，提高了结果的精度和可靠性。实验结果表明，Open-EoE 工具包可以有效地检测289张WSIs中的嗜铁细胞。在 widely accepted 的 >= 15 Eos per high power field（HPF）的标准reshold上，Open-EoE 达到了91%的准确率，与Pathologist 评估相当一致。这表明了机器学习方法的可能性在EoE 诊断过程中的应用。docker 和源代码已经公开发布在https://github.com/hrlblab/Open-EoE。

Revolutionizing Space Health (Swin-FSR): Advancing Super-Resolution of Fundus Images for SANS Visual Assessment Technology

paper_url: http://arxiv.org/abs/2308.06332
repo_url: https://github.com/FarihaHossain/SwinFSR
paper_authors: Khondker Fariha Hossain, Sharif Amit Kamran, Joshua Ong, Andrew G. Lee, Alireza Tavakkoli
for:This paper is written for the purpose of developing a novel model for fundus image super-resolution, specifically using Swin Transformer with spatial and depth-wise attention.methods:The paper utilizes a novel model called Swin-FSR, which combines Swin Transformer with spatial and depth-wise attention for fundus image super-resolution.results:The paper achieves Peak signal-to-noise-ratio (PSNR) of 47.89, 49.00 and 45.32 on three public datasets, namely iChallenge-AMD, iChallenge-PM, and G1020. Additionally, the model showed comparable results on a privately held dataset for Spaceflight-associated Neuro-ocular Syndrome (SANS) provided by NASA.

Abstract
The rapid accessibility of portable and affordable retinal imaging devices has made early differential diagnosis easier. For example, color funduscopy imaging is readily available in remote villages, which can help to identify diseases like age-related macular degeneration (AMD), glaucoma, or pathological myopia (PM). On the other hand, astronauts at the International Space Station utilize this camera for identifying spaceflight-associated neuro-ocular syndrome (SANS). However, due to the unavailability of experts in these locations, the data has to be transferred to an urban healthcare facility (AMD and glaucoma) or a terrestrial station (e.g, SANS) for more precise disease identification. Moreover, due to low bandwidth limits, the imaging data has to be compressed for transfer between these two places. Different super-resolution algorithms have been proposed throughout the years to address this. Furthermore, with the advent of deep learning, the field has advanced so much that x2 and x4 compressed images can be decompressed to their original form without losing spatial information. In this paper, we introduce a novel model called Swin-FSR that utilizes Swin Transformer with spatial and depth-wise attention for fundus image super-resolution. Our architecture achieves Peak signal-to-noise-ratio (PSNR) of 47.89, 49.00 and 45.32 on three public datasets, namely iChallenge-AMD, iChallenge-PM, and G1020. Additionally, we tested the model's effectiveness on a privately held dataset for SANS provided by NASA and achieved comparable results against previous architectures.

摘要
随着可携式和Affordable的Retinal imaging设备的快速访问，早期差异诊断变得更加容易。例如，颜色基准摄影是在偏远村庄中ready available，可以 помо助于诊断年龄相关的macular degeneration（AMD）、 glaucoma 或pathological myopia（PM）等疾病。然而，由于这些地点缺乏专家，因此数据必须被传输到城市医疗机构（AMD和glaucoma）或者地球站（例如，SANS）进行更加精确的疾病诊断。此外，由于带宽限制，摄影数据必须进行压缩传输。过去的多年来，不同的超分辨率算法已经被提出来解决这个问题。此外，随着深度学习的出现，这一领域已经进步到了非常高的水平，可以使x2和x4压缩的图像得到原始形态的还原，无需失去空间信息。在本文中，我们提出了一种新的模型called Swin-FSR，该模型利用SwinTransformer和空间和深度精度注意力来进行fundus图像超分辨率。我们的架构实现了Peak signal-to-noise-ratio（PSNR）的47.89、49.00和45.32在三个公共数据集上，即iChallenge-AMD、iChallenge-PM和G1020。此外，我们对NASA提供的一个私人保持的SANS数据集进行测试，并与之前的建筑物实现了相似的结果。

A Hierarchical Descriptor Framework for On-the-Fly Anatomical Location Matching between Longitudinal Studies

paper_url: http://arxiv.org/abs/2308.07337
repo_url: None
paper_authors: Halid Ziya Yerebakan, Yoshihisa Shinagawa, Mahesh Ranganath, Simon Allen-Raffl, Gerardo Hermosillo Valadez
for: 医疗影像比较 longitudinal 比较中的相似性匹配
methods: 基于 hierarchical sparse sampling 的描述子计算和 hierarchical 搜索
results: 减少计算时间至毫秒级别，无需依赖于预训练、重新映射或多模态转换，并且在 Deep Lesion Tracking 数据集上达到更高的匹配精度，比最精确的算法 faster 24 倍。

Abstract
We propose a method to match anatomical locations between pairs of medical images in longitudinal comparisons. The matching is made possible by computing a descriptor of the query point in a source image based on a hierarchical sparse sampling of image intensities that encode the location information. Then, a hierarchical search operation finds the corresponding point with the most similar descriptor in the target image. This simple yet powerful strategy reduces the computational time of mapping points to a millisecond scale on a single CPU. Thus, radiologists can compare similar anatomical locations in near real-time without requiring extra architectural costs for precomputing or storing deformation fields from registrations. Our algorithm does not require prior training, resampling, segmentation, or affine transformation steps. We have tested our algorithm on the recently published Deep Lesion Tracking dataset annotations. We observed more accurate matching compared to Deep Lesion Tracker while being 24 times faster than the most precise algorithm reported therein. We also investigated the matching accuracy on CT and MR modalities and compared the proposed algorithm's accuracy against ground truth consolidated from multiple radiologists.

摘要
我们提出了一种方法，用于在医疗影像序列中匹配 анатомиче位置。该方法基于源图像中查询点的Descriptor，该Descriptor通过 hierarchical sparse sampling 图像强度编码位置信息来计算。然后，使用 hierarchical 搜索操作找到目标图像中最相似的点。这种简单 yet powerful 策略可以在单个 CPU 上减少计算时间到毫秒级，因此 radiologist 可以在实时比较相似的 анатомиче位置，不需要额外的建筑成本或存储投影场景的预处理或存储步骤。我们的算法不需要先行训练、重新采样、分割或 affine 变换步骤。我们在 Deep Lesion Tracking 数据集注释中进行了测试，并观察到比 Deep Lesion Tracker 更高的匹配精度，同时比最精确的算法reported therein 24 倍快。我们还对 CT 和 MR 模式进行了匹配精度的研究，并与多名医生共同合理的ground truth 进行了比较。

2023-08-11

cs.SD

cs.SD - 2023-08-11

Improving Joint Speech-Text Representations Without Alignment

paper_url: http://arxiv.org/abs/2308.06125
repo_url: None
paper_authors: Cal Peyser, Zhong Meng, Ke Hu, Rohit Prabhavalkar, Andrew Rosenberg, Tara N. Sainath, Michael Picheny, Kyunghyun Cho
for: 这个论文旨在探讨文本描述生成中的字段表示空间，以及如何在这个空间中同时表示文本和语音。
methods: 这个论文使用了联合语音文本编码器，通过将语音和文本域合并到一起，以大参数模型的能力为基础。这些方法显示了搭配性，但需要特殊地处理语音和文本序列长度的差异。
results: 这个论文提供了证据表明联合语音文本编码器可以自然地实现多媒体表示的一致性，并且可以通过抛弃序列长度来避免对预测的影响。这种损失可以提高下游WR的性能，包括大参数单语言和多语言系统。

Abstract
The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as joint speech-text encoders that can scale to the capacities of very large parameter models by being trained on both unpaired speech and text. While these methods show promise, they have required special treatment of the sequence-length mismatch inherent in speech and text, either by up-sampling heuristics or an explicit alignment model. In this work, we offer evidence that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length, and argue that consistency losses could forgive length differences and simply assume the best alignment. We show that such a loss improves downstream WER in both a large-parameter monolingual and multilingual system.

摘要
最近一年内，文本引导图像生成技术呈现出惊人的进步，基于跨Modal Representation空间的想法，在文本和图像领域 jointly 表示。在ASR中，这个想法得到应用，实现了合并语音和文本的encoder，可以通过训练大参数模型来扩大 capacities。虽然这些方法显示了承诺，但它们需要特殊地处理语音和文本的序列长度差异，可能通过上折表示或者显式对齐模型。在这项工作中，我们提供了证据，表明联合语音和文本encoder可以自然地实现多modalities的一致表示，并且提议使用一致损失来补偿序列长度差异。我们展示了这种损失可以提高下游WER，包括大参数训练的单语言和多语言系统。

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

paper_url: http://arxiv.org/abs/2308.06112
repo_url: None
paper_authors: Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Haithem Boussaid, Ebtessam Almazrouei, Merouane Debbah
for: 文章目的是提出一种简单的视觉语音识别（VSR）方法，以便在不需要大量标注数据的情况下进行训练和测试。
methods: 方法基于学习一个先验模型，将视觉语音编码器中的征表示映射到对应的音频对的征表示中，以实现有效的文本解码。
results: 对于LRS3数据集，提出的方法可以与完全监督学习方法相比，达到26个WRR（识别错误率）。与现有的SoTA方法不同，该方法在VoxCeleb测试集上保持了合理的性能。

Abstract
Visual Speech Recognition (VSR) differs from the common perception tasks as it requires deeper reasoning over the video sequence, even by human experts. Despite the recent advances in VSR, current approaches rely on labeled data to fully train or finetune their models predicting the target speech. This hinders their ability to generalize well beyond the training set and leads to performance degeneration under out-of-distribution challenging scenarios. Unlike previous works that involve auxiliary losses or complex training procedures and architectures, we propose a simple approach, named Lip2Vec that is based on learning a prior model. Given a robust visual speech encoder, this network maps the encoded latent representations of the lip sequence to their corresponding latents from the audio pair, which are sufficiently invariant for effective text decoding. The generated audio representation is then decoded to text using an off-the-shelf Audio Speech Recognition (ASR) model. The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER. Unlike SoTA approaches, our model keeps a reasonable performance on the VoxCeleb test set. We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.

摘要
“视觉语音识别（VSR）与常见的识别任务不同，它需要对视频序列进行更深刻的推理，甚至到了人类专家的水平。Despite recent advances in VSR, current approaches still rely on labeled data to fully train or fine-tune their models to predict the target speech, which hinders their ability to generalize well beyond the training set and leads to performance degradation under out-of-distribution challenging scenarios. Unlike previous works that involve auxiliary losses or complex training procedures and architectures, we propose a simple approach, named Lip2Vec, which is based on learning a prior model. Given a robust visual speech encoder, this network maps the encoded latent representations of the lip sequence to their corresponding latents from the audio pair, which are sufficient for effective text decoding. The generated audio representation is then decoded to text using an off-the-shelf Audio Speech Recognition (ASR) model. The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset, achieving 26 WER. Unlike SoTA approaches, our model maintains a reasonable performance on the VoxCeleb test set. We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

An Autoethnographic Exploration of XAI in Algorithmic Composition

paper_url: http://arxiv.org/abs/2308.06089
repo_url: None
paper_authors: Ashley Noel-Hirst, Nick Bryan-Kinns
for: 本研究旨在探讨如何使用可解释的人工智能（XAI）生成模型来帮助音乐创作。
methods: 本研究使用MeasureVAE生成模型，并对其中的可解释层次进行训练，以便在爱尔兰传统音乐上进行音乐创作。
results: 研究发现，在音乐创作过程中，探索性的音乐工作流程更加强调音乐训练集中的音乐特征，而不是生成模型本身的特征。这种方法还表明XAI模型可以被合理地应用于更复杂和多元的音乐创作工作流程中。

Abstract
Machine Learning models are capable of generating complex music across a range of genres from folk to classical music. However, current generative music AI models are typically difficult to understand and control in meaningful ways. Whilst research has started to explore how explainable AI (XAI) generative models might be created for music, no generative XAI models have been studied in music making practice. This paper introduces an autoethnographic study of the use of the MeasureVAE generative music XAI model with interpretable latent dimensions trained on Irish folk music. Findings suggest that the exploratory nature of the music-making workflow foregrounds musical features of the training dataset rather than features of the generative model itself. The appropriation of an XAI model within an iterative workflow highlights the potential of XAI models to form part of a richer and more complex workflow than they were initially designed for.

摘要
文本翻译成简化中文：机器学习模型可以生成多种音乐类型，从民族音乐到古典音乐。然而，当前的生成音乐AI模型通常具有困难理解和控制的问题。研究已经开始探讨如何创建可解释的AI生成模型（XAI），但没有任何生成XAI模型在音乐创作实践中被研究。这篇论文介绍了一个自传式研究，使用了可解释的维度进行训练的MeasureVAE生成音乐XAI模型，并对 И尔兰民族音乐进行了应用。发现结果表明，音乐创作工作流程的探索性强调了训练数据集中的音乐特征，而不是生成模型自身的特征。在Iterative workflow中应用XAI模型，表明XAI模型可以成为更加复杂和多元的工作流程的一部分。

Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model

paper_url: http://arxiv.org/abs/2308.05995
repo_url: None
paper_authors: Fan Zhang, Naye Ji, Fuxing Gao, Siyuan Zhao, Zhaohan Wang, Shunman Li
for: 这 paper 的目的是为数字人类创造领域中的合作语言姿势生成。
methods: 这 paper 使用的方法是基于 transformer 架构的 speech-conditional 扩散模型，使用 WavLM 预训练模型提取低级和高级声音信息，并通过 adaptive layer norm 架构学习声音信息和合作姿势之间的关系。
results: 经过对 Trinity、ZEGGS 和 BEAT 数据集的评估，这 paper 的模型能够生成自然的合作姿势，并且可以控制姿势的风格和性格。

Abstract
The generation of co-speech gestures for digital humans is an emerging area in the field of virtual human creation. Prior research has made progress by using acoustic and semantic information as input and adopting classify method to identify the person's ID and emotion for driving co-speech gesture generation. However, this endeavour still faces significant challenges. These challenges go beyond the intricate interplay between co-speech gestures, speech acoustic, and semantics; they also encompass the complexities associated with personality, emotion, and other obscure but important factors. This paper introduces "diffmotion-v2," a speech-conditional diffusion-based and non-autoregressive transformer-based generative model with WavLM pre-trained model. It can produce individual and stylized full-body co-speech gestures only using raw speech audio, eliminating the need for complex multimodal processing and manually annotated. Firstly, considering that speech audio not only contains acoustic and semantic features but also conveys personality traits, emotions, and more subtle information related to accompanying gestures, we pioneer the adaptation of WavLM, a large-scale pre-trained model, to extract low-level and high-level audio information. Secondly, we introduce an adaptive layer norm architecture in the transformer-based layer to learn the relationship between speech information and accompanying gestures. Extensive subjective evaluation experiments are conducted on the Trinity, ZEGGS, and BEAT datasets to confirm the WavLM and the model's ability to synthesize natural co-speech gestures with various styles.

摘要
<>转换文本到简化中文。<>虚拟人类创造领域中的同声动作生成是一个emerging领域。先前的研究已经做出了进步，使用音声和语义信息作为输入，采用分类方法确定人的ID和情绪，以驱动同声动作生成。然而，这种努力仍然面临着重大挑战。这些挑战不仅包括同声动作、音声和语义之间的细微相互作用，还包括人格、情绪和其他重要而不可预测的因素。本文介绍“diffmotion-v2”模型，这是一种基于 transformer 架构的 speech-conditional 扩散型生成模型，使用 WavLM 预训练模型。它可以根据原始的 Raw Speech 音频生成个性化和风格化的全身同声动作，不需要详细的多媒体处理和手动标注。首先，我们认为音频不仅包含音声和语义信息，还包含人格特征和情绪信息，因此我们采用 WavLM 预训练模型来提取低级和高级音频信息。其次，我们引入 transformer 架构中的 adaptive 层 нор方法，以学习音频信息和同声动作之间的关系。我们在 Trinity、ZEGGS 和 BEAT dataset 上进行了大量主观评估实验，以证明 WavLM 和模型的能力生成自然的同声动作。

Advancing the study of Large-Scale Learning in Overlapped Speech Detection

paper_url: http://arxiv.org/abs/2308.05987
repo_url: None
paper_authors: Zhaohui Yin, Jingguang Tian, Xinhui Hu, Xinkang Xu
for: 多个party会话中的干扰语音检测（OSD）是speech应用中的一个重要部分，但大多数现有的OSD模型都是基于特定的数据集进行训练和评估，这限制了这些模型的应用场景。
methods: 本研究提出了大规模学习（LSL）在OSD中的应用，并设计了522小时不同语言和风格的标注音频作为大规模数据集。并通过对不同深度神经网络基于OSD模型的比较性试验来评估LSL在OSD任务中的效果和OSD模型的性能。
results: 研究结果表明，LSL可以显著提高OSD模型的性能和Robustness，并且CF-OSD模型基于LSL在16K单频OSD任务中取得了最佳性能，其F1分数为80.8%和52.0% separately在Alimeeting测试集和DIHARD II评估集上。

Abstract
Overlapped Speech Detection (OSD) is an important part of speech applications involving analysis of multi-party conversations. However, Most of the existing OSD models are trained and evaluated on specific dataset, which limits the application scenarios of these models. In order to solve this problem, we conduct a study of large-scale learning (LSL) in OSD and propose a more general 16K single-channel OSD model. In our study, 522 hours of labeled audio in different languages and styles are collected and used as the large-scale dataset. Rigorous comparative experiments are designed and used to evaluate the effectiveness of LSL in OSD task and the performance of OSD models based on different deep neural networks. The results show that LSL can significantly improve the performance and robustness of OSD models, and the OSD model based on Conformer (CF-OSD) with LSL is currently the best 16K single-channel OSD model. Moreover, the CF-OSD with LSL establishes a state-of-the-art performance with a F1-score of 80.8% and 52.0% on the Alimeeting test set and DIHARD II evaluation set, respectively.

摘要
《 overlap speech detection (OSD) 是一种重要的语音应用程序中的分析多方会话的一部分。然而，现有的大多数 OSD 模型都是基于特定数据集进行训练和评估，这限制了这些模型的应用场景。为解决这个问题，我们进行了大规模学习 (LSL) 在 OSD 中的研究，并提出了一种更加通用的 16K 单通道 OSD 模型。在我们的研究中，我们收集了 522 小时不同语言和风格的标注音频，并使用这些大规模数据集进行训练和测试。我们设计了严格的比较实验，用于评估 LSL 在 OSD 任务中的效果和不同深度神经网络上的 OSD 模型性能。结果显示，LSL 可以明显提高 OSD 模型的性能和 Robustness，并且 CF-OSD WITH LSL 目前是最佳的 16K 单通道 OSD 模型。此外，CF-OSD WITH LSL 在 Alimeeting 测试集和 DIHARD II 评估集上的 F1 分数分别达到了 80.8% 和 52.0%。

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

paper_url: http://arxiv.org/abs/2308.05734
repo_url: https://github.com/haoheliu/AudioLDM2
paper_authors: Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley
for: 这个论文的目的是提出一个框架，用于同时生成不同类型的声音（如speech、音乐和声效），并且使用同一种学习方法来满足不同类型的目标和偏见。
methods: 该框架使用了一种通用的声音表示（LOA），将任何类型的声音都可以翻译为LOA，并使用一个GPT-2模型将不同类型的modalities翻译为LOA。在生成过程中，使用一个Latent Diffusion模型， conditioned on LOA，进行自我监督的声音生成学习。
results: 实验结果表明，该框架可以在主要的benchmark上达到新的州OF-THE-ART或与之前的方法竞争的性能。 Code和demo可以在https://audioldm.github.io/audioldm2中获取。

Abstract
Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called language of audio (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches. Our demo and code are available at https://audioldm.github.io/audioldm2.

摘要
尽管听音生成存在不同类型的听音之间共同之处，例如语音、音乐和声音效果，但设计模型时需要仔细考虑每种类型的特定目标和偏见，这些偏见与其他类型的偏见可能有所不同。为了带我们更近到听音生成的统一视角，这篇论文提出了一个框架，该框架利用同一种学习方法来生成不同类型的听音。我们的框架引入了一个通用的听音表示（LOA），任何听音都可以根据AudioMAE自动学习的预训练表示学习模型翻译为LOA。在生成过程中，我们使用GPT-2模型将任何模式翻译为LOA，然后使用干扰扩散模型在LOA上进行自主学习。我们的提议框架自然带来了一些优点，例如在上下文学习能力和可重用的自动学习AudioMAE和干扰扩散模型。我们的实验在文本到听音、文本到音乐和文本到语音的主要benchmark上达到了新的州OF-THE-ART或与前一代方法竞争的性能。我们的demo和代码可以在https://audioldm.github.io/audioldm2中找到。

2023-08-11

cs.CV

cs.CV - 2023-08-11

DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity

paper_url: http://arxiv.org/abs/2308.06198
repo_url: None
paper_authors: Melissa Hall, Candace Ross, Adina Williams, Nicolas Carion, Michal Drozdzal, Adriana Romero Soriano
for: 这个研究旨在评估文本到图像生成系统中的可信度、多样性和提示一致性，以及这些系统在不同地区的使用情况。
methods: 研究人员提出了三个指标来评估文本到图像生成系统的可信度、多样性和提示一致性，包括对世界各地的图像生成、地域差异和提示信息的影响。
results: 研究结果显示，当提示为非洲和西亚地区时，模型的生成图像的可信度和多样性较低，而提示包含地理信息会导致生成图像的一致性和多样性减退。此外，研究还发现了一些物体的地域差异较大。这些结果表明，随着图像生成质量的提高，图像的真实 Representation 在世界各地受到了影响。

Abstract
The unprecedented photorealistic results achieved by recent text-to-image generative systems and their increasing use as plug-and-play content creation solutions make it crucial to understand their potential biases. In this work, we introduce three indicators to evaluate the realism, diversity and prompt-generation consistency of text-to-image generative systems when prompted to generate objects from across the world. Our indicators complement qualitative analysis of the broader impact of such systems by enabling automatic and efficient benchmarking of geographic disparities, an important step towards building responsible visual content creation systems. We use our proposed indicators to analyze potential geographic biases in state-of-the-art visual content creation systems and find that: (1) models have less realism and diversity of generations when prompting for Africa and West Asia than Europe, (2) prompting with geographic information comes at a cost to prompt-consistency and diversity of generated images, and (3) models exhibit more region-level disparities for some objects than others. Perhaps most interestingly, our indicators suggest that progress in image generation quality has come at the cost of real-world geographic representation. Our comprehensive evaluation constitutes a crucial step towards ensuring a positive experience of visual content creation for everyone.

摘要
“Recent text-to-image生成系统实现了无 precedent 的实景实现，并在实际应用中普遍用作内置的内容创建解决方案。因此，理解这些系统的可能偏见成为核心。在这个工作中，我们提出三个指标来评估文本生成系统当前对世界各地的物品生成的实现度、多样性和描述稳定性。我们的指标可以辅助自动和高效地评估地域差异，从而构建责任感的视觉内容创建系统。我们使用我们提出的指标进行分析，发现：（1）当提交 Africah 和 West Asia 时，模型的实现度和多样性较低；（2）对地理信息进行提交会导致实现稳定性和多样性的损失；（3）模型对一些物品的地域差异较大。最有趣的是，我们的指标显示，实际图像质量的进步对于地域呈现来说是否增。我们的全面评估是建立责任感的视觉内容创建系统的重要一步。”

Towards Packaging Unit Detection for Automated Palletizing Tasks

paper_url: http://arxiv.org/abs/2308.06306
repo_url: None
paper_authors: Markus Völk, Kilian Kleeberger, Werner Kraus, Richard Bormann
for: 这篇论文是用于解决各种自动化装卸任务中的包装单位检测问题。
methods: 本论文提出了一种基于人工生成数据的方法，可以对真实世界的包装单位进行检测，并且不需要进一步训练或设置。这种方法可以处理稀疏且低质量的感应数据，可以利用专家知识，并且具有广泛的产品和应用场景的应用能力。
results: 本论文的实验结果显示，提出的方法能够实现高度的准确性和稳定性，并且在实际应用中具有很好的表现。此外，论文还详细介绍了一个实验室示范器和一个商业解决方案，它们都是根据本论文的方法进行开发的。

Abstract
For various automated palletizing tasks, the detection of packaging units is a crucial step preceding the actual handling of the packaging units by an industrial robot. We propose an approach to this challenging problem that is fully trained on synthetically generated data and can be robustly applied to arbitrary real world packaging units without further training or setup effort. The proposed approach is able to handle sparse and low quality sensor data, can exploit prior knowledge if available and generalizes well to a wide range of products and application scenarios. To demonstrate the practical use of our approach, we conduct an extensive evaluation on real-world data with a wide range of different retail products. Further, we integrated our approach in a lab demonstrator and a commercial solution will be marketed through an industrial partner.

摘要

Discovering Local Binary Pattern Equation for Foreground Object Removal in Videos

paper_url: http://arxiv.org/abs/2308.06305
repo_url: None
paper_authors: Caroline Pacheco do Espirito Silva, Andrews Cordolino Sobral, Antoine Vacavant, Thierry Bouwmans, Felippe De Souza
for: 自动发现具有背景和前景的场景中移动部分的Local Binary Pattern（LBP）处理方法
methods: 使用 симвоlic regression 自动发现 LBP 公式
results: 实验结果显示，提案的方法可以对实际的城市景象进行高品质的分类，并与先前的州前测试数据表现出色。

Abstract
Designing a novel Local Binary Pattern (LBP) process usually relies heavily on human experts' knowledge and experience in the area. Even experts are often left with tedious episodes of trial and error until they identify an optimal LBP for a particular dataset. To address this problem, we present a novel symbolic regression able to automatically discover LBP formulas to remove the moving parts of a scene by segmenting it into a background and a foreground. Experimental results conducted on real videos of outdoor urban scenes under various conditions show that the LBPs discovered by the proposed approach significantly outperform the previous state-of-the-art LBP descriptors both qualitatively and quantitatively. Our source code and data will be available online.

摘要
通常情况下，设计本地二进制 patrern（LBP）处理程序需要几乎依赖人类专家的知识和经验。甚至专家们也经常会遭遇辗转的尝试和错误，直到他们找到一个特定数据集的优化LBP。为解决这个问题，我们提出了一种新的符号回归算法，能够自动找到去掉场景中移动部分的LBP方程。我们的实验结果表明，对于实际的户外都市场景，我们的方法可以明显超越前一个状态的艺术LBP描述符， both qualitatively and quantitatively。我们的源代码和数据将在线上公开。

Rethinking the Localization in Weakly Supervised Object Localization

paper_url: http://arxiv.org/abs/2308.06161
repo_url: https://github.com/tzzcl/PSOL
paper_authors: Rui Xu, Yong Luo, Han Hu, Bo Du, Jialie Shen, Yonggang Wen
for: 本文针对强度指导物体Localization（WSOL）进行研究，WSOL是计算机视觉中最受欢迎且最具挑战性的任务之一，这任务的目的是在仅有图像水平指导下，将图像中的物体Localization。
methods: 本文提出了两个方法来解决WSOL中的问题，第一个方法是替换单一类别检测（SCR）为多类别检测（BCD），以便在图像中多次检测多个物体。第二个方法是设计一个权重Entropy（WE）损失函数，用于降低随机矩形的负面影响。
results: 实验结果显示，我们的方法可以对Popular CUB-200-2011和ImageNet-1K datasets进行广泛的测试，并获得了效果的结果。

Abstract
Weakly supervised object localization (WSOL) is one of the most popular and challenging tasks in computer vision. This task is to localize the objects in the images given only the image-level supervision. Recently, dividing WSOL into two parts (class-agnostic object localization and object classification) has become the state-of-the-art pipeline for this task. However, existing solutions under this pipeline usually suffer from the following drawbacks: 1) they are not flexible since they can only localize one object for each image due to the adopted single-class regression (SCR) for localization; 2) the generated pseudo bounding boxes may be noisy, but the negative impact of such noise is not well addressed. To remedy these drawbacks, we first propose to replace SCR with a binary-class detector (BCD) for localizing multiple objects, where the detector is trained by discriminating the foreground and background. Then we design a weighted entropy (WE) loss using the unlabeled data to reduce the negative impact of noisy bounding boxes. Extensive experiments on the popular CUB-200-2011 and ImageNet-1K datasets demonstrate the effectiveness of our method.

摘要
弱级指导对象定位（WSOL）是计算机视觉中最受欢迎且最有挑战的任务之一。这个任务的目标是根据图像水平的监督来地将对象在图像中Localize。最新的解决方案通常受到以下几个缺点的影响：1）它们不够灵活，因为它们只能在每个图像中Localize一个对象，因为采用的是单个类 regression（SCR）来实现Localization；2）生成的假 bounding box 可能具有噪音，但这种噪音的负面影响并未得到足够的解决。为了缓解这些缺点，我们首先提议将 SCR 替换为二分类探测器（BCD），以在多个对象中Localize。然后，我们设计了一种权重Entropy（WE）损失函数，使用无标注数据来减少噪音 bounding box 的负面影响。我们在popular CUB-200-2011 和 ImageNet-1K 数据集上进行了广泛的实验，并证明了我们的方法的有效性。

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

paper_url: http://arxiv.org/abs/2308.06160
repo_url: https://github.com/showlab/datasetdm
paper_authors: Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, Chunhua Shen
for: 本文提出了一种生成各种感知数据的数据生成模型，以便用于训练多种下游任务的感知模型。
methods: 该模型基于预训练的扩散模型，并通过文本引导图像生成来生成各种感知数据，包括分割mask和深度。
results: 该方法可以生成具有高质量感知标注的无限多个synthetic图像，并且可以在不同的领域进行鲁棒的适应和零shot掌握。此外，该方法还具有高度的可重复性和可扩展性，可以有效地应用于多种任务和新的任务组合。

Abstract
Current deep networks are very data-hungry and benefit from training on largescale datasets, which are often time-consuming to collect and annotate. By contrast, synthetic data can be generated infinitely using generative models such as DALL-E and diffusion models, with minimal effort and cost. In this paper, we present DatasetDM, a generic dataset generation model that can produce diverse synthetic images and the corresponding high-quality perception annotations (e.g., segmentation masks, and depth). Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation. We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module. Training the decoder only needs less than 1% (around 100 images) manually labeled images, enabling the generation of an infinitely large annotated dataset. Then these synthetic data can be used for training various perception models for downstream tasks. To showcase the power of the proposed approach, we generate datasets with rich dense pixel-wise labels for a wide range of downstream tasks, including semantic segmentation, instance segmentation, and depth estimation. Notably, it achieves 1) state-of-the-art results on semantic segmentation and instance segmentation; 2) significantly more robust on domain generalization than using the real data alone; and state-of-the-art results in zero-shot segmentation setting; and 3) flexibility for efficient application and novel task composition (e.g., image editing). The project website and code can be found at https://weijiawu.github.io/DatasetDM_page/ and https://github.com/showlab/DatasetDM, respectively

摘要
现代深度网络很需要大量数据进行训练，而收集和标注大量数据可能是时间consuming和成本高的。然而，使用生成模型如DALL-E和扩散模型可以生成无穷量的 sintetic 数据，只需要微不足百的人工标注图像。在这篇论文中，我们提出了 DatasetDM，一种通用的数据生成模型，可以生成多样化的 sintetic 图像和相应的高质量感知注释（例如分割mask和深度）。我们的方法基于预训练的扩散模型，并扩展了文本引导的图像生成到感知数据生成。我们发现，扩散模型的含义密集代码可以高效地解码为准确的感知注释，只需要训练decoder模块 fewer than 1% (around 100 images) 的人工标注图像。这些生成的synthetic数据可以用于训练多种感知模型，以满足下游任务。我们在各种下游任务中实现了1) 状态的最佳结果，包括semantic segmentation和instance segmentation; 2) 在预测领域内显著更加稳定，比使用真实数据alone; 和3) 可以有效地应用和创新任务（例如图像编辑）。相关项目网站和代码可以在https://weijiawu.github.io/DatasetDM_page/ 和https://github.com/showlab/DatasetDM 中找到。

Efficient Large-scale AUV-based Visual Seafloor Mapping

paper_url: http://arxiv.org/abs/2308.06147
repo_url: None
paper_authors: Mengkun She, Yifan Song, David Nakath, Kevin Köser
for: 这项研究旨在提高自动化潜水车（AUV）在深海中进行三维重建的精度和效率。
methods: 该系统利用最新的深海拍摄技术和视觉地图生成技术，以帮助自动化潜水车在深海中重建海底三维模型。
results: 该系统在多次实验船次中展示了其可靠性和实用性，并且比增量重建更快，而且性能与增量重建相当。

Abstract
Driven by the increasing number of marine data science applications, there is a growing interest in surveying and exploring the vast, uncharted terrain of the deep sea with robotic platforms. Despite impressive results achieved by many on-land visual mapping algorithms in the past decades, transferring these methods from land to the deep sea remains a challenge due to harsh environmental conditions. Typically, deep-sea exploration involves the use of autonomous underwater vehicles (AUVs) equipped with high-resolution cameras and artificial illumination systems. However, images obtained in this manner often suffer from heterogeneous illumination and quality degradation due to attenuation and scattering, on top of refraction of light rays. All of this together often lets on-land SLAM approaches fail underwater or makes Structure-from-Motion approaches drift or omit difficult images, resulting in gaps, jumps or weakly registered areas. In this work, we present a system that incorporates recent developments in underwater imaging and visual mapping to facilitate automated robotic 3D reconstruction of hectares of seafloor. Our approach is efficient in that it detects and reconsiders difficult, weakly registered areas, to avoid omitting images and to make better use of limited dive time; on the other hand it is computationally efficient; leveraging a hybrid approach combining benefits from SLAM and Structure-from-Motion that runs much faster than incremental reconstructions while achieving at least on-par performance. The proposed system has been extensively tested and evaluated during several research cruises, demonstrating its robustness and practicality in real-world conditions.

摘要
随着 marine data science 应用的增加，有越来越多的人对深海的未探索领域进行探索和调查，使用 робоット平台。虽然过去数十年来的陆地视图算法已经取得了很好的成果，但将这些方法从陆地传播到深海仍然是一个挑战，因为深海环境条件非常恶劣。通常，深海探索使用自动化潜水车（AUV）配备高分辨率摄像头和人工照明系统。但是，在这种情况下获得的图像经常受到不均匀照明和质量强化的影响，加之光线干扰和折射，导致在陆地SLAM方法下不能在水下工作或者Structure-from-Motion方法会在水下偏移或者忽略Difficult Images，导致图像中的缺失、跳跃和弱联系区域。在这项工作中，我们提出了一种系统，利用最新的深海摄像头和视图地图技术来自动化机器人3D重建深海底的百 hectare。我们的方法能够检测和重新考虑Difficult，弱联系区域，以避免忽略图像和更好地利用有限的潜水时间；同时，它具有计算效率，利用一种hybrid方法，结合SLAM和Structure-from-Motion两种方法，运行得更快，而且性能至少与逐步重建相当。我们的系统在多次研究考察中得到了广泛的测试和评估，并在实际条件下显示了其可靠性和实用性。

CompTLL-UNet: Compressed Domain Text-Line Localization in Challenging Handwritten Documents using Deep Feature Learning from JPEG Coefficients

paper_url: http://arxiv.org/abs/2308.06142
repo_url: None
paper_authors: Bulla Rajesh, Sk Mahafuz Zaman, Mohammed Javed, P. Nagabhushan
for: 本研究旨在提出一种直接从JPEG压缩系数中进行文本线程地理Localization的方法，以解决手写文本图像中文本线程的自动地理化问题。
methods: 本研究使用了一种Modified U-Net architecture，称为Compressed Text-Line Localization Network (CompTLL-UNet)，以直接从JPEG压缩系数中学习深度特征，并将其应用于文本线程地理Localization问题。
results: 研究人员通过对ICDAR2017（cBAD）和ICDAR2019（cBAD）标准测试集进行训练和测试，并得到了在JPEG压缩domain中的state-of-the-art表现，同时减少了存储和计算成本。

Abstract
Automatic localization of text-lines in handwritten documents is still an open and challenging research problem. Various writing issues such as uneven spacing between the lines, oscillating and touching text, and the presence of skew become much more challenging when the case of complex handwritten document images are considered for segmentation directly in their respective compressed representation. This is because, the conventional way of processing compressed documents is through decompression, but here in this paper, we propose an idea that employs deep feature learning directly from the JPEG compressed coefficients without full decompression to accomplish text-line localization in the JPEG compressed domain. A modified U-Net architecture known as Compressed Text-Line Localization Network (CompTLL-UNet) is designed to accomplish it. The model is trained and tested with JPEG compressed version of benchmark datasets including ICDAR2017 (cBAD) and ICDAR2019 (cBAD), reporting the state-of-the-art performance with reduced storage and computational costs in the JPEG compressed domain.

摘要
自动化手写文档中文行的本地化仍然是一个开放的研究问题。不同的手写问题，如行间距不均匀、文本振荡和触擦，以及扭曲的存在，在考虑复杂手写文档图像时变得非常困难。这是因为，传统的处理压缩文档的方法是通过解压缩，但在这篇论文中，我们提出了一个想法，即直接从JPEG压缩系数中使用深度特征学习来完成文行本地化。我们设计了一种修改后的U-Net架构，称为Compressed Text-Line Localization Network (CompTLL-UNet)，以实现其目的。我们在JPEG压缩版本的标准评价数据集ICDAR2017（cBAD）和ICDAR2019（cBAD）进行训练和测试，并实现了当今最佳性能，同时减少存储和计算成本在JPEG压缩Domain中。

Uncertainty Quantification for Image-based Traffic Prediction across Cities

paper_url: http://arxiv.org/abs/2308.06129
repo_url: https://github.com/alextimans/traffic4cast-uncertainty
paper_authors: Alexander Timans, Nina Wiedemann, Nishant Kumar, Ye Hong, Martin Raubal
for: This paper aims to investigate the application of uncertainty quantification (UQ) methods for traffic prediction, and to evaluate the effectiveness of UQ methods in providing meaningful uncertainty estimates for city-wide traffic dynamics.methods: The paper compares two epistemic and two aleatoric UQ methods on both temporal and spatio-temporal transfer tasks, and demonstrates how uncertainty estimates can be employed for unsupervised outlier detection on changes in city traffic dynamics.results: The paper finds that meaningful uncertainty estimates can be recovered using UQ methods, and that these estimates can be used to capture both temporal and spatial effects on traffic behavior in a representative case study for the city of Moscow.

Abstract
Despite the strong predictive performance of deep learning models for traffic prediction, their widespread deployment in real-world intelligent transportation systems has been restrained by a lack of interpretability. Uncertainty quantification (UQ) methods provide an approach to induce probabilistic reasoning, improve decision-making and enhance model deployment potential. To gain a comprehensive picture of the usefulness of existing UQ methods for traffic prediction and the relation between obtained uncertainties and city-wide traffic dynamics, we investigate their application to a large-scale image-based traffic dataset spanning multiple cities and time periods. We compare two epistemic and two aleatoric UQ methods on both temporal and spatio-temporal transfer tasks, and find that meaningful uncertainty estimates can be recovered. We further demonstrate how uncertainty estimates can be employed for unsupervised outlier detection on changes in city traffic dynamics. We find that our approach can capture both temporal and spatial effects on traffic behaviour in a representative case study for the city of Moscow. Our work presents a further step towards boosting uncertainty awareness in traffic prediction tasks, and aims to highlight the value contribution of UQ methods to a better understanding of city traffic dynamics.

摘要

Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow

paper_url: http://arxiv.org/abs/2308.06101
repo_url: https://github.com/bcmi/DCI-VTON-Virtual-Try-On
paper_authors: Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, Liqing Zhang
for: 用于实现虚拟试穿，即将衣服从一个图像传输到另一个图像中，保持衣服和人体细节的图像合成任务。
methods: 使用噪声扩散模型（Diffusion Model）来生成高质量图像，并使用准备模块（Warping Module）来引导噪声扩散模型的生成。
results: 通过将衣服经过准备模块进行初始处理，以保持衣服的本地细节，然后将渗透模型的输出与人体图像相加，实现高质量和真实的虚拟试穿结果。

Abstract
Virtual try-on is a critical image synthesis task that aims to transfer clothes from one image to another while preserving the details of both humans and clothes. While many existing methods rely on Generative Adversarial Networks (GANs) to achieve this, flaws can still occur, particularly at high resolutions. Recently, the diffusion model has emerged as a promising alternative for generating high-quality images in various applications. However, simply using clothes as a condition for guiding the diffusion model to inpaint is insufficient to maintain the details of the clothes. To overcome this challenge, we propose an exemplar-based inpainting approach that leverages a warping module to guide the diffusion model's generation effectively. The warping module performs initial processing on the clothes, which helps to preserve the local details of the clothes. We then combine the warped clothes with clothes-agnostic person image and add noise as the input of diffusion model. Additionally, the warped clothes is used as local conditions for each denoising process to ensure that the resulting output retains as much detail as possible. Our approach, namely Diffusion-based Conditional Inpainting for Virtual Try-ON (DCI-VTON), effectively utilizes the power of the diffusion model, and the incorporation of the warping module helps to produce high-quality and realistic virtual try-on results. Experimental results on VITON-HD demonstrate the effectiveness and superiority of our method.

摘要
“虚拟试穿”是一个重要的图像合成任务，目的是将衣服从一个图像转移到另一个，同时保留人体和衣服的细节。许多现有方法利用生成对抗网络（GANs）实现此目的，但缺陷仍然存在，特别是在高分辨率下。随后，扩散模型在许多应用中作为生成高品质图像的可能性而崛起。然而，仅将衣服作为导向扩散模型填充的条件，无法维持衣服的细节。为解决这个挑战，我们提出了一个示范基本填充方法，利用扭曲模组将衣服进行初始处理，帮助保留衣服的地方细节。我们然后结合扭曲衣服和不同于衣服的人像图混合，加入杂音作为扩散模型的输入。此外，扭曲衣服也用作每个降噪过程的地方条件，以确保结果保留最多细节。我们统称这种方法为扩散基于对应填充（DCI-VTON）。我们的方法充分利用扩散模型的力量，并通过扭曲模组的帮助，实现高品质和实际的虚拟试穿结果。实验结果显示，我们的方法在VITON-HD上得到了优异的效果。

Diffusion-based Visual Counterfactual Explanations – Towards Systematic Quantitative Evaluation

paper_url: http://arxiv.org/abs/2308.06100
repo_url: https://github.com/cairo-thws/dbvce_eval
paper_authors: Philipp Vaeth, Alexander M. Fruehwald, Benjamin Paassen, Magda Gregorova
for: 本研究旨在系统地评估最新的视觉对照解释（VCE）方法，并提出一组最小的度量来评估这些方法。
methods: 本研究使用了涉及潜在的设计选择的扩散基于生成模型来生成高维像素图像，并对这些方法进行系统评估。
results: 研究发现了多种方向，以便未来驱动VCE方法的进步和改进。此外，研究还提供了一个有价值的指南，以便其他研究人员在评估对照解释时保持一致和透明度。

Abstract
Latest methods for visual counterfactual explanations (VCE) harness the power of deep generative models to synthesize new examples of high-dimensional images of impressive quality. However, it is currently difficult to compare the performance of these VCE methods as the evaluation procedures largely vary and often boil down to visual inspection of individual examples and small scale user studies. In this work, we propose a framework for systematic, quantitative evaluation of the VCE methods and a minimal set of metrics to be used. We use this framework to explore the effects of certain crucial design choices in the latest diffusion-based generative models for VCEs of natural image classification (ImageNet). We conduct a battery of ablation-like experiments, generating thousands of VCEs for a suite of classifiers of various complexity, accuracy and robustness. Our findings suggest multiple directions for future advancements and improvements of VCE methods. By sharing our methodology and our approach to tackle the computational challenges of such a study on a limited hardware setup (including the complete code base), we offer a valuable guidance for researchers in the field fostering consistency and transparency in the assessment of counterfactual explanations.

摘要
最新的视觉对照解释（VCE）方法充分利用深度生成模型来生成高维度图像，但目前难以比较这些VCE方法的表现，因为评估方法多样化并经常降至视觉检查个例和小规模用户研究。在这项工作中，我们提出了一个系统化、量化评估VCE方法的框架，以及一组最小化的度量。我们使用这个框架来探索latest扩散基本生成模型中的某些关键设计选择对自然图像分类（ImageNet）VCE的影响。我们进行了一系列减少类似实验，生成了数以千计的VCE，对一组不同的分类器进行评估。我们的发现表明了未来VCE方法的多个方向的进攻，并提供了一个有价值的指南，以帮助研究人员在评估对照解释时保持一致和透明度。

Automated Construction of Time-Space Diagrams for Traffic Analysis Using Street-View Video Sequence

paper_url: http://arxiv.org/abs/2308.06098
repo_url: None
paper_authors: Tanay Rastogi, Mårten Björkman
for: 这个论文旨在利用在运动车辆上安装的摄像头捕捉的街景视频序列来构建时空图表，以分析交通流量和优化交通基础设施和交通管理策略。
methods: 该研究使用了最新的YOLOv5、StrongSORT和光学测距技术来从视频数据中推断车辆轨迹，并生成时空图表。
results: 评估结果表明，该方法可以从视频数据中提取车辆轨迹，但是存在一些误差，可以通过提高检测器、跟踪器和距离计算组件的性能来缓解这些误差。

Abstract
Time-space diagrams are essential tools for analyzing traffic patterns and optimizing transportation infrastructure and traffic management strategies. Traditional data collection methods for these diagrams have limitations in terms of temporal and spatial coverage. Recent advancements in camera technology have overcome these limitations and provided extensive urban data. In this study, we propose an innovative approach to constructing time-space diagrams by utilizing street-view video sequences captured by cameras mounted on moving vehicles. Using the state-of-the-art YOLOv5, StrongSORT, and photogrammetry techniques for distance calculation, we can infer vehicle trajectories from the video data and generate time-space diagrams. To evaluate the effectiveness of our proposed method, we utilized datasets from the KITTI computer vision benchmark suite. The evaluation results demonstrate that our approach can generate trajectories from video data, although there are some errors that can be mitigated by improving the performance of the detector, tracker, and distance calculation components. In conclusion, the utilization of street-view video sequences captured by cameras mounted on moving vehicles, combined with state-of-the-art computer vision techniques, has immense potential for constructing comprehensive time-space diagrams. These diagrams offer valuable insights into traffic patterns and contribute to the design of transportation infrastructure and traffic management strategies.

摘要
时空图是交通Pattern分析和交通基础设施和交通管理策略优化的重要工具。传统数据采集方法有限制性，尤其是在时间和空间方面。现代摄像头技术的发展已经突破了这些限制，提供了广泛的都市数据。本研究提出了一种创新的时空图构建方法，利用行车中的街景视频序列，通过安装在移动 Vehicle 上的摄像头捕捉的视频数据来进行推断。使用最新的 YOLOv5、StrongSORT 和光学计算技术，我们可以从视频数据中推断车辆轨迹，并生成时空图。为了评估我们的提议方法的效果，我们利用了 KITTI 计算机视觉标准库中的数据集。评估结果表明，我们的方法可以从视频数据中提取车辆轨迹，但是存在一些错误，这些错误可以通过提高探测器、跟踪器和距离计算组件的性能来减少。因此，通过利用行车中的街景视频序列，并利用现代计算机视觉技术，我们的方法具有广泛的应用前景，可以为交通基础设施和交通管理策略的设计提供有价值的信息。

RIGID: Recurrent GAN Inversion and Editing of Real Face Videos

paper_url: http://arxiv.org/abs/2308.06097
repo_url: None
paper_authors: Yangyang Xu, Shengfeng He, Kwan-Yee K. Wong, Ping Luo
for: 本研究旨在应用GAN的强大可编辑性到实际图像上，但现有方法通常是对视频帧进行分解式逆合，导致时间上的不一致问题。
methods: 我们提出了一种统一的循环框架，名为RIGID，以同时强制满足时间相关的GAN逆合和人脸编辑。我们的方法从三个方面模型了时间关系。
results: 我们的方法可以Qualitatively and quantitatively超过现有方法，并且可以应用于不同的编辑任务。

Abstract
GAN inversion is indispensable for applying the powerful editability of GAN to real images. However, existing methods invert video frames individually often leading to undesired inconsistent results over time. In this paper, we propose a unified recurrent framework, named \textbf{R}ecurrent v\textbf{I}deo \textbf{G}AN \textbf{I}nversion and e\textbf{D}iting (RIGID), to explicitly and simultaneously enforce temporally coherent GAN inversion and facial editing of real videos. Our approach models the temporal relations between current and previous frames from three aspects. To enable a faithful real video reconstruction, we first maximize the inversion fidelity and consistency by learning a temporal compensated latent code. Second, we observe incoherent noises lie in the high-frequency domain that can be disentangled from the latent space. Third, to remove the inconsistency after attribute manipulation, we propose an \textit{in-between frame composition constraint} such that the arbitrary frame must be a direct composite of its neighboring frames. Our unified framework learns the inherent coherence between input frames in an end-to-end manner, and therefore it is agnostic to a specific attribute and can be applied to arbitrary editing of the same video without re-training. Extensive experiments demonstrate that RIGID outperforms state-of-the-art methods qualitatively and quantitatively in both inversion and editing tasks. The deliverables can be found in \url{https://cnnlstm.github.io/RIGID}

摘要
GAN逆转是应用GAN的强大编辑能力的必要条件。然而，现有方法通常对视频帧进行逆转，经常导致时间上的不一致结果。在这篇论文中，我们提出了一个统一的回归框架，名为Recurrent Video GAN Inversion and Editing（RIGID），以确立和同时执行时间相关的GAN逆转和人脸编辑。我们的方法从三个方面模型视频帧之间的时间关系：1. 通过学习时间补偿的秘密码来最大化逆转准确率和一致性。2. 发现高频域中的不一致噪声可以从秘密码空间分离。3. 在执行特性修改后，我们提出了一个“间隔帧组合约束”，使得任意帧必须是其邻近帧的直接组合。我们的统一框架在端到端方式学习视频帧之间的自然一致性，因此它是不同特性的编辑无需重新训练。广泛的实验表明，RIGID在逆转和编辑任务中都能够胜过现有方法，详细的结果可以在中找到。

Experts Weights Averaging: A New General Training Scheme for Vision Transformers

paper_url: http://arxiv.org/abs/2308.06093
repo_url: None
paper_authors: Yongqi Huang, Peng Ye, Xiaoshui Huang, Sheng Li, Tao Chen, Wanli Ouyang
for: 这篇论文是为了提出一种新的普适训练策略 для Transformer 视觉模型（ViT），以提高性能而不增加推理成本。
methods: 这种训练策略利用 Mixture-of-Experts（MoE）机制，在训练和推理阶段分别使用不同的 FFN 和 MoE，以实现性能提高和成本降低。
results: experiments 表明，这种训练策略可以在多个 2D 和 3D 视觉任务、不同的 ViT 架构和数据集上实现性能提高，并且可以应用于 fine-tuning ViTs 中进一步提高性能。

Abstract
Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs), which achieves performance improvement without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may question: if a training scheme specifically for ViTs exists that can also achieve performance improvement without increasing inference cost? Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we affirmatively answer these questions, with a new general training strategy for ViTs. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs that assign tokens to experts by random uniform partition, and perform Experts Weights Averaging (EWA) on these MoEs at the end of each iteration. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference. We further provide a theoretical analysis to show why and how it works. Comprehensive experiments across various 2D and 3D visual tasks, ViT architectures, and datasets validate the effectiveness and generalizability of the proposed training scheme. Besides, our training scheme can also be applied to improve performance when fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can significantly improve the effectiveness of naive MoE in various 2D visual small datasets and 3D visual tasks.

摘要
通用的training scheme for Convolutional Neural Networks (CNNs) 是structural re-parameterization, which can improve performance without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may wonder if there is a training scheme specifically for ViTs that can also achieve performance improvement without increasing inference cost. Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we answer these questions affirmatively, with a new general training strategy for ViTs. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs that assign tokens to experts by random uniform partition, and perform Experts Weights Averaging (EWA) on these MoEs at the end of each iteration. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference. We further provide a theoretical analysis to show why and how it works. Comprehensive experiments across various 2D and 3D visual tasks, ViT architectures, and datasets validate the effectiveness and generalizability of the proposed training scheme. Besides, our training scheme can also be applied to improve performance when fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can significantly improve the effectiveness of naive MoE in various 2D visual small datasets and 3D visual tasks.

Versatile Face Animator: Driving Arbitrary 3D Facial Avatar in RGBD Space

paper_url: http://arxiv.org/abs/2308.06076
repo_url: None
paper_authors: Haoyu Wang, Haozhe Wu, Junliang Xing, Jia Jia
for: 这paper的目的是提出一种新的自动化面部动画框架，以便在电影制作和游戏行业中创造真实的3D面部动画，并且能够减少成本和增加效率。
methods: 这paper使用的方法包括面部动画捕捉和动作重定向，通过端到端的方式实现无需blendshape或rigs的face animation生成。具体来说， authors propose了一种RGBD动画模块，通过层次动作字典学习面部动作从 Raw RGBD视频中，并将RGBD图像细化到3D面部模型中，以便无论面部模型的topology、文字、blendshape或rigs是什么，都可以生成高质量的3D面部动画。
results: 实验表明， authors的提出的框架可以生成出吸引人的3D面部动画效果， highlighting its potential as a promising solution for the cost-effective and efficient production of facial animation in the metaverse.

Abstract
Creating realistic 3D facial animation is crucial for various applications in the movie production and gaming industry, especially with the burgeoning demand in the metaverse. However, prevalent methods such as blendshape-based approaches and facial rigging techniques are time-consuming, labor-intensive, and lack standardized configurations, making facial animation production challenging and costly. In this paper, we propose a novel self-supervised framework, Versatile Face Animator, which combines facial motion capture with motion retargeting in an end-to-end manner, eliminating the need for blendshapes or rigs. Our method has the following two main characteristics: 1) we propose an RGBD animation module to learn facial motion from raw RGBD videos by hierarchical motion dictionaries and animate RGBD images rendered from 3D facial mesh coarse-to-fine, enabling facial animation on arbitrary 3D characters regardless of their topology, textures, blendshapes, and rigs; and 2) we introduce a mesh retarget module to utilize RGBD animation to create 3D facial animation by manipulating facial mesh with controller transformations, which are estimated from dense optical flow fields and blended together with geodesic-distance-based weights. Comprehensive experiments demonstrate the effectiveness of our proposed framework in generating impressive 3D facial animation results, highlighting its potential as a promising solution for the cost-effective and efficient production of facial animation in the metaverse.

摘要
创造真实的3D人脸动画对于电影制作和游戏业界的各种应用非常重要，特别是在metaverse领域的发展。然而，现有的方法，如blendshape-basedapproaches和facial rigging技术，是时间消耗大、劳动密集，而且没有标准化的配置，使得人脸动画生产变得困难和昂贵。在这篇论文中，我们提出了一种新的自主学习框架，名为Versatile Face Animator，它将facial motion capture和动作重定向结合在一起，从而消除了需要blendshapes或rigs的需求。我们的方法有以下两个主要特点：1. 我们提出了一种RGBD动画模块，通过层次动作词典来学习人脸动作从raw RGBD视频中，并将RGBD图像渲染为3D人脸模型，从而实现人脸动画的生成。这种方法允许人脸动画生产在任何3D人脸模型上，无论其topology、texture、blendshapes或rigs等属性。2. 我们引入了一种网格重定向模块，通过使用 dense optical flow Fields和geodesic-distance-based weights来估算控制器变换，并将其与RGBD动画结合以生成3D人脸动画。我们的实验表明，我们提出的框架可以生成出吸引人的3D人脸动画结果， highlighting its potential as a promising solution for the cost-effective and efficient production of facial animation in the metaverse.

Out-of-Distribution Detection for Monocular Depth Estimation

paper_url: http://arxiv.org/abs/2308.06072
repo_url: None
paper_authors: Julia Hornauer, Adrian Holzbock, Vasileios Belagiannis
for: 本研究旨在提高单目深度估计中的不确定性估计方法，主要是针对图像噪声引入的数据不确定性。
methods: 我们使用 anomaly detection 技术来检测 encoder-decoder 深度估计模型中的数据不符合分布（out-of-distribution，OOD）图像。我们使用固定深度encoder提取特征，然后使用只有在 Distribution 数据进行图像重建训练。因此，OOD 图像会导致高重建错误，我们可以用这个错误来 отличи出在 Distribution 和 OOD 图像之间。
results: 我们在标准的 NYU Depth V2 和 KITTI 测试集上进行了实验，并发现我们的后期方法可以在不修改已经训练过的 encoder-decoder 深度估计模型上达到 astonishingly 好的性能，并且超过了现有的不确定性估计方法。

Abstract
In monocular depth estimation, uncertainty estimation approaches mainly target the data uncertainty introduced by image noise. In contrast to prior work, we address the uncertainty due to lack of knowledge, which is relevant for the detection of data not represented by the training distribution, the so-called out-of-distribution (OOD) data. Motivated by anomaly detection, we propose to detect OOD images from an encoder-decoder depth estimation model based on the reconstruction error. Given the features extracted with the fixed depth encoder, we train an image decoder for image reconstruction using only in-distribution data. Consequently, OOD images result in a high reconstruction error, which we use to distinguish between in- and out-of-distribution samples. We built our experiments on the standard NYU Depth V2 and KITTI benchmarks as in-distribution data. Our post hoc method performs astonishingly well on different models and outperforms existing uncertainty estimation approaches without modifying the trained encoder-decoder depth estimation model.

摘要
单目深度估算中的不确定性估计方法主要针对图像噪声引入的数据不确定性。与先前的工作不同，我们处理lack of knowledge引起的不确定性，这是对于检测没有被训练分布表示的数据（即out-of-distribution，OOD数据）的探测。受到异常检测的 inspirited by，我们提议通过重建错误来探测OOD图像。我们使用固定深度encoder提取特征，然后使用只有在distribution数据进行图像重建训练。因此，OOD图像会导致高重建错误，我们可以使用这个错误来分辨在-和out-of-distribution样本。我们的实验基于标准的NYU Depth V2和KITTIbenchmark中的in-distribution数据。我们的后期方法在不修改已经训练过的encoder-decoder深度估算模型上表现出色，并且超过了现有的不确定性估计方法。

Head Rotation in Denoising Diffusion Models

paper_url: http://arxiv.org/abs/2308.06057
repo_url: https://github.com/asperti/head-rotation
paper_authors: Andrea Asperti, Gabriele Colasuonno, Antonio Guerra
for: 本研究旨在探讨 Denoising Diffusion Models (DDM) 在深度生成模型领域的应用，特别是在批处理和修改图像中的特征特点。
methods: 本研究使用了一种新的嵌入技术，即 Denoising Diffusion Implicit Models (DDIM)，以实现高维latent space中的 semantics 探索和特征修改。
results: 研究表明，通过对数据集样本进行linear regression，可以获得不同拜角度($\pm 30^o$) 的批处理结果，保留图像的特征特点。此外，研究还发现了一种基于照明方向的图像分类法，将 CelebA 图像分为三个主要组：左、中、右。

Abstract
Denoising Diffusion Models (DDM) are emerging as the cutting-edge technology in the realm of deep generative modeling, challenging the dominance of Generative Adversarial Networks. However, effectively exploring the latent space's semantics and identifying compelling trajectories for manipulating and editing important attributes of the generated samples remains challenging, primarily due to the high-dimensional nature of the latent space. In this study, we specifically concentrate on face rotation, which is known to be one of the most intricate editing operations. By leveraging a recent embedding technique for Denoising Diffusion Implicit Models (DDIM), we achieve, in many cases, noteworthy manipulations encompassing a wide rotation angle of $\pm 30^o$, preserving the distinct characteristics of the individual. Our methodology exploits the computation of trajectories approximating clouds of latent representations of dataset samples with different yaw rotations through linear regression. Specific trajectories are obtained by restricting the analysis to subsets of data sharing significant attributes with the source image. One of these attributes is the light provenance: a byproduct of our research is a labeling of CelebA, categorizing images into three major groups based on the illumination direction: left, center, and right.

摘要
德提高扩散模型（DDM）正在深度生成模型领域中崛起，挑战对抗生成对抗网络的主导地位。然而，深入探索生成空间的 semantics 和找到编辑重要特征的吸引人之路仍然是一大挑战，主要因为生成空间的维度很高。在这个研究中，我们专注于脸 rotate，被认为是深度生成模型中最复杂的编辑操作。我们通过利用最近的 Denoising Diffusion Implicit Models（DDIM）嵌入技术，在许多情况下实现了30度的旋转角范围内的出色操作，并保持个体特征的独特性。我们的方法利用计算 trajectories approximating clouds of latent representations of dataset samples with different yaw rotations through linear regression。specific trajectories 是通过对数据集中 sharing significative attributes with the source image 进行限制分析获得的。其中一个特征是光源来源：我们的研究中的一个成果是对 CelebA 图像进行了分类，将图像分为三个主要组 Based on the illumination direction：left, center, and right。

Computer-Aided Cytology Diagnosis in Animals: CNN-Based Image Quality Assessment for Accurate Disease Classification

paper_url: http://arxiv.org/abs/2308.06055
repo_url: None
paper_authors: Jan Krupiński, Maciej Wielgosz, Szymon Mazurek, Krystian Strzałka, Paweł Russek, Jakub Caputa, Daria Łukasik, Jakub Grzeszczyk, Michał Karwatowski, Rafał Fraczek, Ernest Jamro, Marcin Pietroń, Sebastian Koryciak, Agnieszka Dąbrowska-Boruch, Kazimierz Wiatr
for: 这个论文是为了开发一个基于计算机的病理诊断系统，用于动物病理诊断。
methods: 论文使用了卷积神经网络（CNN）进行图像质量评估（IQA）。
results: 研究发现，使用CNN进行IQA可以提高病理诊断的准确率。

Abstract
This paper presents a computer-aided cytology diagnosis system designed for animals, focusing on image quality assessment (IQA) using Convolutional Neural Networks (CNNs). The system's building blocks are tailored to seamlessly integrate IQA, ensuring reliable performance in disease classification. We extensively investigate the CNN's ability to handle various image variations and scenarios, analyzing the impact on detecting low-quality input data. Additionally, the network's capacity to differentiate valid cellular samples from those with artifacts is evaluated. Our study employs a ResNet18 network architecture and explores the effects of input sizes and cropping strategies on model performance. The research sheds light on the significance of CNN-based IQA in computer-aided cytology diagnosis for animals, enhancing the accuracy of disease classification.

摘要
Translation notes:* "computer-aided cytology diagnosis system" is translated as "计算机助成细胞诊断系统" (computer-aided cytology diagnosis system)* "Convolutional Neural Networks" is translated as "卷积神经网络" (Convolutional Neural Networks)* "image quality assessment" is translated as "图像质量评估" (image quality assessment)* "IQA" is translated as "IQA" (IQA)* "disease classification" is translated as "疾病分类" (disease classification)* "ResNet18" is translated as "ResNet18" (ResNet18)* "input sizes" is translated as "输入大小" (input sizes)* "cropping strategies" is translated as "剪辑策略" (cropping strategies)

Hardware Accelerators in Autonomous Driving

paper_url: http://arxiv.org/abs/2308.06054
repo_url: None
paper_authors: Ken Power, Shailendra Deva, Ting Wang, Julius Li, Ciarán Eising
for: 本研究旨在概述机器学习加速器在自动驾驶汽车中的应用，以提高机器视觉任务的性能和可靠性。
methods: 本文使用了多种机器学习模型和特殊目标处理器来加速机器视觉任务。
results: 本研究提出了一些建议和未来研究方向，以推动机器学习加速器在自动驾驶领域的发展。

Abstract
Computing platforms in autonomous vehicles record large amounts of data from many sensors, process the data through machine learning models, and make decisions to ensure the vehicle's safe operation. Fast, accurate, and reliable decision-making is critical. Traditional computer processors lack the power and flexibility needed for the perception and machine vision demands of advanced autonomous driving tasks. Hardware accelerators are special-purpose coprocessors that help autonomous vehicles meet performance requirements for higher levels of autonomy. This paper provides an overview of ML accelerators with examples of their use for machine vision in autonomous vehicles. We offer recommendations for researchers and practitioners and highlight a trajectory for ongoing and future research in this emerging field.

摘要
计算平台在自动驾驶车辆记录大量感知器数据，通过机器学习模型进行处理，以确保车辆安全运行。快速、准确、可靠的决策是关键。传统计算机处理器缺乏高级自动驾驶任务的感知和机器视觉需求的能力。硬件加速器是特殊用途的辅助处理器，帮助自动驾驶车辆实现更高的自主驱动水平。本文提供机器学习加速器的概述，以及机器视觉在自动驾驶车辆中的应用例子。我们对研究人员和实践者提出建议，并高亮了该领域的未来研究轨迹。

Towards Instance-adaptive Inference for Federated Learning

paper_url: http://arxiv.org/abs/2308.06051
repo_url: https://github.com/chunmeifeng/fedins
paper_authors: Chun-Mei Feng, Kai Yu, Nian Liu, Xinxing Xu, Salman Khan, Wangmeng Zuo
for: 这种论文是为了解决 federated learning (FL) 中的客户端数据不均衡问题，特别是在复杂的实际数据上出现的内部客户端数据不均衡。
methods: 该论文提出了一种新的 FL 算法，即 FedIns，用于处理内部客户端数据不均衡。该算法通过实例特定的折衡策略来实现实例特定的折衡，从而降低了客户端数据不均衡的影响。
results: 实验结果表明， compare to 状态之前的方法，FedIns 可以达到6.64%的提升，并且与少于15%的通信成本相对。

Abstract
Federated learning (FL) is a distributed learning paradigm that enables multiple clients to learn a powerful global model by aggregating local training. However, the performance of the global model is often hampered by non-i.i.d. distribution among the clients, requiring extensive efforts to mitigate inter-client data heterogeneity. Going beyond inter-client data heterogeneity, we note that intra-client heterogeneity can also be observed on complex real-world data and seriously deteriorate FL performance. In this paper, we present a novel FL algorithm, i.e., FedIns, to handle intra-client data heterogeneity by enabling instance-adaptive inference in the FL framework. Instead of huge instance-adaptive models, we resort to a parameter-efficient fine-tuning method, i.e., scale and shift deep features (SSF), upon a pre-trained model. Specifically, we first train an SSF pool for each client, and aggregate these SSF pools on the server side, thus still maintaining a low communication cost. To enable instance-adaptive inference, for a given instance, we dynamically find the best-matched SSF subsets from the pool and aggregate them to generate an adaptive SSF specified for the instance, thereby reducing the intra-client as well as the inter-client heterogeneity. Extensive experiments show that our FedIns outperforms state-of-the-art FL algorithms, e.g., a 6.64\% improvement against the top-performing method with less than 15\% communication cost on Tiny-ImageNet. Our code and models will be publicly released.

摘要
联合学习（FL）是一种分布式学习 paradigma，允许多个客户端学习一个强大的全球模型，通过聚合本地训练来提高性能。然而，全球模型的性能 oftentimes 受到客户端数据不均衡的影响，需要大量努力来mitigate inter-client data heterogeneity。此外，我们注意到了复杂的实际数据中的内部客户端数据不均衡，也会严重影响FL性能。在这篇论文中，我们提出了一种新的FL算法，即FedIns，用于处理内部客户端数据不均衡。在FL框架中，我们使用 instance-adaptive inference 来适应实例特点，而不需要巨大的实例特化模型。具体来说，我们首先在每个客户端上训练一个scale and shift deep features（SSF）池，然后在服务器端将这些SSF池进行聚合，以保持低通信成本。为了实现实例特化的推理，对于一个实例，我们会在实例特点上动态找到最佳的SSF子集，并将这些子集聚合以生成一个适应该实例的adaptive SSF，从而降低内部客户端以及客户端之间的数据不均衡。我们的 FedIns 在 Tiny-ImageNet 上比顶尖方法提高6.64%，并且通信成本低于15%。我们将代码和模型公开发布。

Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning

paper_url: http://arxiv.org/abs/2308.06038
repo_url: https://github.com/chunmeifeng/DiffTPT
paper_authors: Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, Wangmeng Zuo
for: 这个研究paper的目的是提出一种名为DiffTPT的新的测试时间Prompt Tuning方法，用于在不知道的新领域上适应下测试样本。
methods: 这个方法使用了预训条件模型，并利用预训 diffusion 模型生成多元和有用的新数据，以扩大模型在不知道的新领域上的适应能力。
results: 在测试数据集上，DiffTPT 方法可以提高零件精度的正确率，比起现有的state-of-the-art TPT 方法，提高了5.13%。

Abstract
Benefiting from prompt tuning, recent years have witnessed the promising performance of pre-trained vision-language models, e.g., CLIP, on versatile downstream tasks. In this paper, we focus on a particular setting of learning adaptive prompts on the fly for each test sample from an unseen new domain, which is known as test-time prompt tuning (TPT). Existing TPT methods typically rely on data augmentation and confidence selection. However, conventional data augmentation techniques, e.g., random resized crops, suffers from the lack of data diversity, while entropy-based confidence selection alone is not sufficient to guarantee prediction fidelity. To address these issues, we propose a novel TPT method, named DiffTPT, which leverages pre-trained diffusion models to generate diverse and informative new data. Specifically, we incorporate augmented data by both conventional method and pre-trained stable diffusion to exploit their respective merits, improving the models ability to adapt to unknown new test data. Moreover, to ensure the prediction fidelity of generated data, we introduce a cosine similarity-based filtration technique to select the generated data with higher similarity to the single test sample. Our experiments on test datasets with distribution shifts and unseen categories demonstrate that DiffTPT improves the zero-shot accuracy by an average of 5.13\% compared to the state-of-the-art TPT method. Our code and models will be publicly released.

摘要
在最近几年，快速调参的推动下，预训练的视觉语言模型，如CLIP，在多种下游任务上表现出了扎实的潜力。在这篇论文中，我们关注一种特殊的学习适应 prompt 的方法，即在测试阶段调参（TPT）。现有的 TPT 方法通常基于数据增强和信任选择。然而，传统的数据增强技术，例如随机resize crop，缺乏数据多样性，而基于Entropy的信任选择独立不能保证预测准确性。为解决这些问题，我们提出了一种新的 TPT 方法，名为DiffTPT，它利用预训练的扩散模型生成多样和有用的新数据。具体来说，我们结合了传统的数据增强方法和预训练的稳定扩散来利用它们的优势，提高模型对未知新测试数据的适应能力。此外，为保证生成数据的预测准确性，我们引入了cosine similarity基于的筛选技术，选择生成数据与单个测试样本更高的相似性。我们的实验表明，DiffTPT 在分布偏移和未知类型的测试集上提高了零aser准精度，相比前州之最的 TPT 方法。我们的代码和模型将公开发布。

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

paper_url: http://arxiv.org/abs/2308.06027
repo_url: None
paper_authors: Yuki Endo
for: 这个论文的目的是提出一种不需要额外训练 diffusion models 的文本到图像生成方法，以实现更好的空间控制。
methods: 该方法基于 cross-attention maps 的 pozitional 关系，通过 directly swapping cross-attention maps 与常数 maps 计算自 semantic regions 来控制注意力。此外，提出了 masked-attention guidance，通过 manipulate 噪音图来控制注意力，使图像更 faithful 于 semantic masks。
results: 实验表明，该方法可以比基eline 更加准确地控制空间，并且可以生成更 faithful 于 semantic masks 的图像。

Abstract
Text-to-image synthesis has achieved high-quality results with recent advances in diffusion models. However, text input alone has high spatial ambiguity and limited user controllability. Most existing methods allow spatial control through additional visual guidance (e.g, sketches and semantic masks) but require additional training with annotated images. In this paper, we propose a method for spatially controlling text-to-image generation without further training of diffusion models. Our method is based on the insight that the cross-attention maps reflect the positional relationship between words and pixels. Our aim is to control the attention maps according to given semantic masks and text prompts. To this end, we first explore a simple approach of directly swapping the cross-attention maps with constant maps computed from the semantic regions. Moreover, we propose masked-attention guidance, which can generate images more faithful to semantic masks than the first approach. Masked-attention guidance indirectly controls attention to each word and pixel according to the semantic regions by manipulating noise images fed to diffusion models. Experiments show that our method enables more accurate spatial control than baselines qualitatively and quantitatively.

摘要
为达到这个目标，我们首先探索了直接将抽象映射换成固定的映射，从Semantic Regions中计算的常数映射。此外，我们还提出了受隐藏图像干扰的掩码引导，可以在Semantic Mask中更加准确地控制注意力。这种方法通过在扩散模型中隐藏图像的干扰来控制每个字和像素的注意力，从而更加准确地实现Semantic Mask中的图像生成。实验表明，我们的方法可以较为精确地控制文本到图像的空间位置，相比基线方法。

Spatial-information Guided Adaptive Context-aware Network for Efficient RGB-D Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.06024
repo_url: https://github.com/mvme-hbut/sgacnet
paper_authors: Yang Zhang, Chenyun Xiong, Junjie Liu, Xuhui Ye, Guodong Sun
for: 这篇论文主要针对mobile robots中的RGB-D语义分割问题，即在环境中分割对象和场景，并对其进行分类。
methods: 该论文提出了一种高效的encoder-decoder网络，该网络通过杂合模式和空间杂合注意力模块来有效地捕捉多级RGB-D特征。此外，论文还提出了一种全球导航本地相互关系模块，以获取足够高级上下文信息。
results: 实验结果表明，该方法在NYUv2、SUN RGB-D和Cityscapes数据集上比州前方法更好地平衡了分割精度、计算时间和参数数量。

Abstract
Efficient RGB-D semantic segmentation has received considerable attention in mobile robots, which plays a vital role in analyzing and recognizing environmental information. According to previous studies, depth information can provide corresponding geometric relationships for objects and scenes, but actual depth data usually exist as noise. To avoid unfavorable effects on segmentation accuracy and computation, it is necessary to design an efficient framework to leverage cross-modal correlations and complementary cues. In this paper, we propose an efficient lightweight encoder-decoder network that reduces the computational parameters and guarantees the robustness of the algorithm. Working with channel and spatial fusion attention modules, our network effectively captures multi-level RGB-D features. A globally guided local affinity context module is proposed to obtain sufficient high-level context information. The decoder utilizes a lightweight residual unit that combines short- and long-distance information with a few redundant computations. Experimental results on NYUv2, SUN RGB-D, and Cityscapes datasets show that our method achieves a better trade-off among segmentation accuracy, inference time, and parameters than the state-of-the-art methods. The source code will be at https://github.com/MVME-HBUT/SGACNet

摘要
高效的RGB-Dsemantic segmentation在移动机器人中得到了广泛的关注，它对环境信息的分析和识别具有重要作用。根据前一些研究，深度信息可以提供对物体和场景的相对性 geometric 关系，但实际的深度数据通常存在噪声。为了避免不利的影响于分 segmentation 精度和计算，需要设计一个有效的框架，利用交叉模态的相关性和补做信息。在这篇论文中，我们提出了一种高效的轻量级 encoder-decoder 网络，它可以减少计算参数并保证算法的稳定性。与核心和空间拼接注意力模块结合，我们的网络可以有效地捕捉多级 RGB-D 特征。我们还提出了一种全局导引的地方相互关系上下文模块，以获得足够的高级上下文信息。解码器使用一种轻量级径相关单元，将短距离和长距离信息相结合，并且只有一些冗余计算。实验结果表明，我们的方法在 NYUv2、SUN RGB-D 和 Cityscapes 数据集上实现了更好的精度至灵活性至参数的平衡，比现有方法更好。代码将在 GitHub 上发布，链接在。

Scale-Preserving Automatic Concept Extraction (SPACE)

paper_url: http://arxiv.org/abs/2308.06022
repo_url: https://github.com/data-science-in-mechanical-engineering/space
paper_authors: Andrés Felipe Posada-Moreno, Lukas Kreisköther, Tassilo Glander, Sebastian Trimpe
for: 提高工业四点零中 convolutional neural network（CNN）的可靠性和透明度，以及减少经济损失和人类生命风险。
methods: 基于图像 slice 的方法，避免缺失缩放问题，提供全面的概念描述。
results: 在工业质量控制领域的三个图像分类 dataset 上，SPACE 方法比其他方法表现更好，提供了可操作的概念描述，帮助理解 CNN 的决策机制。

Abstract
Convolutional Neural Networks (CNN) have become a common choice for industrial quality control, as well as other critical applications in the Industry 4.0. When these CNNs behave in ways unexpected to human users or developers, severe consequences can arise, such as economic losses or an increased risk to human life. Concept extraction techniques can be applied to increase the reliability and transparency of CNNs through generating global explanations for trained neural network models. The decisive features of image datasets in quality control often depend on the feature's scale; for example, the size of a hole or an edge. However, existing concept extraction methods do not correctly represent scale, which leads to problems interpreting these models as we show herein. To address this issue, we introduce the Scale-Preserving Automatic Concept Extraction (SPACE) algorithm, as a state-of-the-art alternative concept extraction technique for CNNs, focused on industrial applications. SPACE is specifically designed to overcome the aforementioned problems by avoiding scale changes throughout the concept extraction process. SPACE proposes an approach based on square slices of input images, which are selected and then tiled before being clustered into concepts. Our method provides explanations of the models' decision-making process in the form of human-understandable concepts. We evaluate SPACE on three image classification datasets in the context of industrial quality control. Through experimental results, we illustrate how SPACE outperforms other methods and provides actionable insights on the decision mechanisms of CNNs. Finally, code for the implementation of SPACE is provided.

摘要
convolutional neural networks (CNN) 已成为工业品质控制的常用选择，以及工业4.0其他关键应用程序。当这些CNN表现出人类用户或开发者所未预料的行为时，可能会导致经济损失或增加人类生命风险。概念提取技术可以应用于增加CNN的可靠性和透明度，通过生成全局解释以便训练的神经网络模型。图像 Dataset 中的特征特征frequently dependent on the scale of the feature; 例如，图像中的孔或边缘的大小。然而，现有的概念提取方法不正确地表示尺度，这会导致解释问题。为解决这个问题，我们介绍了Scale-Preserving Automatic Concept Extraction (SPACE)算法，作为工业应用场景中的现代替代方法。SPACE专门设计用于解决以上问题，避免在概念提取过程中改变尺度。SPACE提出基于输入图像的方方块，选择并粘贴后，再分类为概念。我们的方法可以提供神经网络模型决策过程的人类可理解的解释。我们在三个图像分类Dataset中进行了实验，并证明了SPACE在工业品质控制领域的表现优于其他方法，并提供了可行的解释。最后，我们提供了实现SPACE的代码。

Enhancing Generalization of Universal Adversarial Perturbation through Gradient Aggregation

paper_url: http://arxiv.org/abs/2308.06015
repo_url: https://github.com/liuxuannan/stochastic-gradient-aggregation
paper_authors: Xuannan Liu, Yaoyao Zhong, Yuhang Zhang, Lixiong Qin, Weihong Deng
for: 提高 universal adversarial perturbation (UAP) 的泛化能力，解决 UAP 生成方法中的梯度消失和局部最优化问题。
methods: 提出 Stochastic Gradient Aggregation (SGA) 方法，通过多个小批量训练和内部多次归一化来稳定梯度和减少量化误差，从而提高 UAP 的泛化能力。
results: EXTENSIVE experiments on the standard ImageNet dataset demonstrate that our method significantly enhances the generalization ability of UAP and outperforms other state-of-the-art methods。

Abstract
Deep neural networks are vulnerable to universal adversarial perturbation (UAP), an instance-agnostic perturbation capable of fooling the target model for most samples. Compared to instance-specific adversarial examples, UAP is more challenging as it needs to generalize across various samples and models. In this paper, we examine the serious dilemma of UAP generation methods from a generalization perspective -- the gradient vanishing problem using small-batch stochastic gradient optimization and the local optima problem using large-batch optimization. To address these problems, we propose a simple and effective method called Stochastic Gradient Aggregation (SGA), which alleviates the gradient vanishing and escapes from poor local optima at the same time. Specifically, SGA employs the small-batch training to perform multiple iterations of inner pre-search. Then, all the inner gradients are aggregated as a one-step gradient estimation to enhance the gradient stability and reduce quantization errors. Extensive experiments on the standard ImageNet dataset demonstrate that our method significantly enhances the generalization ability of UAP and outperforms other state-of-the-art methods. The code is available at https://github.com/liuxuannan/Stochastic-Gradient-Aggregation.

摘要
TRANSLATION:深度神经网络受到通用对抗扰动（UAP）的威胁，UAP是一种能够骗取目标模型的大多数样本的实例无关扰动。与实例特定对抗示例相比，UAP更加挑战性，因为它需要在不同的样本和模型之间进行泛化。在这篇文章中，我们对UAP生成方法的泛化问题进行了深入的检查——梯度消失问题和大批量优化问题。为了解决这些问题，我们提出了一种简单有效的方法，即随机梯度聚合（SGA）。SGA使用小批量训练来进行多次内部预搜，然后将所有内部梯度聚合为一个步骤的梯度估计，以提高梯度稳定性和减少量化误差。广泛的实验表明，我们的方法可以很好地提高UAP的泛化能力，并超过了当前状态的方法。代码可以在https://github.com/liuxuannan/Stochastic-Gradient-Aggregation中获取。

ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

paper_url: http://arxiv.org/abs/2308.06009
repo_url: None
paper_authors: Kun Li, Dan Guo, Meng Wang
for: 本研究targets the video grounding (VG) task, aiming to locate the queried action or event in an untrimmed video based on rich linguistic descriptions.
methods: 我们提出了一种novel boundary regression paradigm, which performs regression token learning in a transformer. Specifically, we present a simple but effective proposal-free framework, namely Video Grounding Transformer (ViGT), which predicts the temporal boundary using a learnable regression token rather than multi-modal or cross-modal features.
results: 在三个公共数据集（ANet Captions、TACoS和YouCookII）上，我们的提案ViGT表现出色，并进行了广泛的ablation study和质量分析以验证我们的解释性。

Abstract
The video grounding (VG) task aims to locate the queried action or event in an untrimmed video based on rich linguistic descriptions. Existing proposal-free methods are trapped in complex interaction between video and query, overemphasizing cross-modal feature fusion and feature correlation for VG. In this paper, we propose a novel boundary regression paradigm that performs regression token learning in a transformer. Particularly, we present a simple but effective proposal-free framework, namely Video Grounding Transformer (ViGT), which predicts the temporal boundary using a learnable regression token rather than multi-modal or cross-modal features. In ViGT, the benefits of a learnable token are manifested as follows. (1) The token is unrelated to the video or the query and avoids data bias toward the original video and query. (2) The token simultaneously performs global context aggregation from video and query features. First, we employed a sharing feature encoder to project both video and query into a joint feature space before performing cross-modal co-attention (i.e., video-to-query attention and query-to-video attention) to highlight discriminative features in each modality. Furthermore, we concatenated a learnable regression token [REG] with the video and query features as the input of a vision-language transformer. Finally, we utilized the token [REG] to predict the target moment and visual features to constrain the foreground and background probabilities at each timestamp. The proposed ViGT performed well on three public datasets: ANet Captions, TACoS and YouCookII. Extensive ablation studies and qualitative analysis further validated the interpretability of ViGT.

摘要
视频定位（VG）任务的目标是在没有提案的情况下，基于丰富的语言描述，在不归档的视频中找到查询的动作或事件。现有的方法受到视频和查询之间的复杂交互的限制，过度强调cross-modal特征融合和特征相关性，导致VG任务的实现困难。在本文中，我们提出了一种新的boundary regression paradigm，即使用一个学习的回归токеン（REG）来实现视频定位。特别是，我们提出了一种简单 yet effective的无提案框架，即视频定位变换器（ViGT），该框架通过学习回归token来预测目标时刻，而不是通过多Modal或cross-modal特征。在ViGT中， learnable token的好处如下：1. token不与视频或查询有关，因此避免了数据偏好向原始视频和查询。2. token同时进行全局上下文聚合从视频和查询特征中。首先，我们使用了一个共享特征编码器将视频和查询映射到一个共同特征空间，然后进行cross-modal协同注意力（即视频到查询注意力和查询到视频注意力）以便强调每个模式中的特征。此外，我们将视频和查询特征 concatenated with a learnable回归token [REG]作为视频-语言变换器的输入。最后，我们使用了这个token [REG]来预测目标时刻，并使用视频和查询特征来约束背景和前景概率在每个时间戳。我们的ViGT在ANet Captions、TACoS和YouCookII等三个公共数据集上表现出色，并进行了广泛的归一化研究和质量分析，以证明ViGT的可读性。

Image-based Geolocalization by Ground-to-2.5D Map Matching

paper_url: http://arxiv.org/abs/2308.05993
repo_url: None
paper_authors: Mengjie Zhou, Liu Liu, Yiran Zhong
for: 图像基于地理位置问题的解决方案，即将平面图像与地图照片匹配。
methods: 我们提出了一种新的方法，通过在2.5D空间中使用高度信息来提高跨视图匹配。我们首先将2D地图与平面图像对齐使用极限变换，然后利用全球混合来混合多模态特征从2D和2.5D地图来增强定位编码的 distintiveness。
results: 我们的方法在两种常见的定位方法，单图定位和路径定位中实现了显著高于前一代2D地图基于方法的定位精度和更快的收敛速度。

Abstract
We study the image-based geolocalization problem that aims to locate ground-view query images on cartographic maps. Previous methods often utilize cross-view localization techniques to match ground-view query images with 2D maps. However, the performance of these methods is frequently unsatisfactory due to the significant cross-view appearance differences. In this paper, we extend cross-view matching to 2.5D spaces, where the heights of the structures - such as trees, buildings, and other objects - can provide additional information to guide the cross-view matching. We present a new approach to learning representative embeddings from multi-model data. Specifically, we first align 2D maps to ground-view panoramic images with polar transform to reduce the gap between panoramic images and maps. Then we leverage global fusion to fuse the multi-modal features from 2D and 2.5D maps to increase the distinctiveness of location embeddings. We construct the first large-scale ground-to-2.5D map geolocalization dataset to validate our method and facilitate the research. We test our learned embeddings on two popular localization approaches, i.e., single-image based localization, and route based localization. Extensive experiments demonstrate that our proposed method achieves significantly higher localization accuracy and faster convergence than previous 2D map-based approaches.

摘要
我们研究图像基于地理位置 Localization 问题，即将地面查询图像与地图映射到一起。过去的方法 frequently utilize cross-view localization技术来匹配地面查询图像与2D地图。然而，这些方法的性能frequently不满足，因为跨视图的外观差异很大。在这篇论文中，我们将cross-view matching扩展到2.5D空间，其中结构高度，如树、建筑和其他物体的高度，可以提供更多的信息来导航跨视图匹配。我们提出了一种新的方法来学习表示 embedding从多模型数据中。特别是，我们首先将2D地图与地面全景图像使用极体 transform进行对齐，以降低全景图像与地图之间的差异。然后，我们利用全球混合来混合2D和2.5D地图的多模态特征，以提高定位编码的特征性。我们构建了首个大规模的地面到2.5D地图地理位置定位数据集，以验证我们的方法和促进研究。我们测试了我们学习的编码，并与单个图像基于定位和路径基于定位进行比较。广泛的实验表明，我们的提议方法在前2D地图基于方法的定位精度和更快的收敛速度上具有显著优势。

Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection

paper_url: http://arxiv.org/abs/2308.05991
repo_url: https://github.com/yinyf0804/wsod-cbl
paper_authors: Yufei Yin, Jiajun Deng, Wengang Zhou, Li Li, Houqiang Li
for: 提高weakly supervised object detection的精度，增强多个实例检测网络（MIDN）的 Pseudo-labeling 质量。
methods: 我们提出了一种新的weakly supervised object detection框架，即Cylic-Bootstrap Labeling（CBL），它利用了一个可靠的教师网络来优化MIDN的 Pseudo-labeling。特别是，我们使用了一种权重加权移动平均策略，以便利用不同的修正模块的输出。此外，我们还提出了一种新的类别特定的排名精灵抽象算法，用于使MIDN受益于教师网络的输出。
results: 我们在PASCAL VOC 2007 & 2012和COCO datasets上进行了广泛的实验，并证明了我们的CBL框架在weakly supervised object detection中表现出色。

Abstract
Recent progress in weakly supervised object detection is featured by a combination of multiple instance detection networks (MIDN) and ordinal online refinement. However, with only image-level annotation, MIDN inevitably assigns high scores to some unexpected region proposals when generating pseudo labels. These inaccurate high-scoring region proposals will mislead the training of subsequent refinement modules and thus hamper the detection performance. In this work, we explore how to ameliorate the quality of pseudo-labeling in MIDN. Formally, we devise Cyclic-Bootstrap Labeling (CBL), a novel weakly supervised object detection pipeline, which optimizes MIDN with rank information from a reliable teacher network. Specifically, we obtain this teacher network by introducing a weighted exponential moving average strategy to take advantage of various refinement modules. A novel class-specific ranking distillation algorithm is proposed to leverage the output of weighted ensembled teacher network for distilling MIDN with rank information. As a result, MIDN is guided to assign higher scores to accurate proposals among their neighboring ones, thus benefiting the subsequent pseudo labeling. Extensive experiments on the prevalent PASCAL VOC 2007 \& 2012 and COCO datasets demonstrate the superior performance of our CBL framework. Code will be available at https://github.com/Yinyf0804/WSOD-CBL/.

摘要
最近的弱监督对象检测进步主要表现为多个实例检测网络（MIDN）和顺序在线级化。然而，只有图像级别的标注，MIDN会不可避免地将一些意外的区域提案分配高分数。这些不准确的高分数区域提案会 Mislead 后续的级化模块的训练，从而降低检测性能。在这种情况下，我们explore如何改善MIDN中 pseudo-标签的质量。我们提出了一种新的弱监督对象检测框架，即循环 bootstrap labeling（CBL），该框架利用多个可靠的教师网络来优化MIDN。具体来说，我们通过引入权重加权移动平均策略来获得这些教师网络。此外，我们还提出了一种类别特定的排名精炼算法，以利用多个 ensembled 教师网络的输出来精炼MIDN。这使得MIDN可以更好地将准确的提案与其邻近的提案分配高分数。这种方法的实验结果表明，我们的CBL框架在PASCAL VOC 2007 和 COCO datasets上显示出了superior的性能。代码将在https://github.com/Yinyf0804/WSOD-CBL/上公布。

Automatic Classification of Blood Cell Images Using Convolutional Neural Network

paper_url: http://arxiv.org/abs/2308.06300
repo_url: None
paper_authors: Rabia Asghar, Sanjay Kumar, Paul Hynds, Abeera Mahfooz
for: 这项研究的目的是自动分类血液细胞。methods: 这项研究使用了转移学习的 convolutional neural network (CNN) 模型，包括 VGG16、VGG19、ResNet-50、ResNet-101、ResNet-152、InceptionV3、MobileNetV2 和 DenseNet-20，并应用于 PBC 数据集的正常 DIB。results: 实验结果表明，提议的 CNN 模型在 PBC 数据集上达到了 99.91% 的准确率，与之前在文献中报道的结果相比，我们的提议的 convolutional neural network 模型在血液细胞分类方面表现竞争力强。

Abstract
Human blood primarily comprises plasma, red blood cells, white blood cells, and platelets. It plays a vital role in transporting nutrients to different organs, where it stores essential health-related data about the human body. Blood cells are utilized to defend the body against diverse infections, including fungi, viruses, and bacteria. Hence, blood analysis can help physicians assess an individual's physiological condition. Blood cells have been sub-classified into eight groups: Neutrophils, eosinophils, basophils, lymphocytes, monocytes, immature granulocytes (promyelocytes, myelocytes, and metamyelocytes), erythroblasts, and platelets or thrombocytes on the basis of their nucleus, shape, and cytoplasm. Traditionally, pathologists and hematologists in laboratories have examined these blood cells using a microscope before manually classifying them. The manual approach is slower and more prone to human error. Therefore, it is essential to automate this process. In our paper, transfer learning with CNN pre-trained models. VGG16, VGG19, ResNet-50, ResNet-101, ResNet-152, InceptionV3, MobileNetV2, and DenseNet-20 applied to the PBC dataset's normal DIB. The overall accuracy achieved with these models lies between 91.375 and 94.72%. Hence, inspired by these pre-trained architectures, a model has been proposed to automatically classify the ten types of blood cells with increased accuracy. A novel CNN-based framework has been presented to improve accuracy. The proposed CNN model has been tested on the PBC dataset normal DIB. The outcomes of the experiments demonstrate that our CNN-based framework designed for blood cell classification attains an accuracy of 99.91% on the PBC dataset. Our proposed convolutional neural network model performs competitively when compared to earlier results reported in the literature.

摘要
人体血液主要由血浆、红细胞、白细胞和板块组成。它扮演着重要的Transportation和存储功能，帮助医生评估人体 physiological 状况。血液细胞可以用来防御身体各种感染，包括真菌、病毒和 бактери。因此，血液分析可以帮助医生评估个体的 physiological 状况。血液细胞可以分为八种类型：neutrophils、eosinophils、basophils、lymphocytes、monocytes、immature granulocytes（promyelocytes、myelocytes和metamyelocytes）、erythroblasts和板块或 trombocytes ，根据其核心、形态和 cytoplasm。在医学实验室中，传统上，病理学家和血液学家使用 Mikroskop manually 分类这些血液细胞。这种手动方法 slower 和更容易出现人类错误。因此，需要自动化这个过程。在我们的论文中，我们使用了转移学习的 CNN 预训练模型。VGG16、VGG19、ResNet-50、ResNet-101、ResNet-152、InceptionV3、MobileNetV2和DenseNet-20 应用于 PBC 数据集的 normal DIB。模型的总准确率在这些模型之间处于 91.375% 和 94.72% 之间。因此，我们提出了一种基于 CNN 的框架，以提高准确率。我们提出的 CNN 模型在 PBC 数据集 normal DIB 上进行测试，实验结果表明，我们的 CNN 基于框架在 PBC 数据集上达到了 99.91% 的准确率。我们的提出的 convolutional neural network 模型与之前在文献中报道的结果相比，表现竞争力强。

MS3D++: Ensemble of Experts for Multi-Source Unsupervised Domain Adaption in 3D Object Detection

paper_url: http://arxiv.org/abs/2308.05988
repo_url: https://github.com/darrenjkt/ms3d
paper_authors: Darren Tsai, Julie Stephany Berrio, Mao Shan, Eduardo Nebot, Stewart Worrall
for: addressed the issue of 3D detectors’ poor performance in unfamiliar domains due to domain gap
methods: introduced MS3D++, a self-training framework for multi-source unsupervised domain adaptation in 3D object detection, which generates high-quality pseudo-labels and fuses predictions of an ensemble of multi-frame pre-trained detectors
results: achieved state-of-the-art performance, comparable to training with human-annotated labels in Bird’s Eye View (BEV) evaluation for both low and high density lidar, on Waymo, nuScenes and Lyft datasets.

Abstract
Deploying 3D detectors in unfamiliar domains has been demonstrated to result in a drastic drop of up to 70-90% in detection rate due to variations in lidar, geographical region, or weather conditions from their original training dataset. This domain gap leads to missing detections for densely observed objects, misaligned confidence scores, and increased high-confidence false positives, rendering the detector highly unreliable. To address this, we introduce MS3D++, a self-training framework for multi-source unsupervised domain adaptation in 3D object detection. MS3D++ provides a straightforward approach to domain adaptation by generating high-quality pseudo-labels, enabling the adaptation of 3D detectors to a diverse range of lidar types, regardless of their density. Our approach effectively fuses predictions of an ensemble of multi-frame pre-trained detectors from different source domains to improve domain generalization. We subsequently refine the predictions temporally to ensure temporal consistency in box localization and object classification. Furthermore, we present an in-depth study into the performance and idiosyncrasies of various 3D detector components in a cross-domain context, providing valuable insights for improved cross-domain detector ensembling. Experimental results on Waymo, nuScenes and Lyft demonstrate that detectors trained with MS3D++ pseudo-labels achieve state-of-the-art performance, comparable to training with human-annotated labels in Bird's Eye View (BEV) evaluation for both low and high density lidar.

摘要
部署3D探测器在不熟悉的领域中有示出大幅下降，达到70-90%的探测率下降，这是由于探测器的原始训练数据集和领域不同而导致的领域差距。这种领域差距会导致密集观测的对象丢失、信度不对的置信度和高置信度的假阳性，使探测器变得不可靠。为解决这问题，我们提出了MS3D++自动化框架，用于多源无监督领域适应3D对象探测。MS3D++提供了一种简单的适应方法，通过生成高质量的 Pseudo-标签来适应不同类型的雷达，无论其密度如何。我们的方法可以将多帧预训练探测器的预测结果集成，以提高领域通用性。然后，我们将预测结果进行时间推理，以确保盒子的时间一致性和物体的分类。此外，我们还进行了跨领域3D探测器组件的性能研究，提供了有价值的探测器组合研究指导。实验结果表明，使用MS3D++ Pseudo-标签训练的探测器在Waymo、nuScenes和Lyft上达到了状态之arte的性能，与人工标注的BEV评价相当，包括低密度和高密度雷达。

Zero-shot Text-driven Physically Interpretable Face Editing

paper_url: http://arxiv.org/abs/2308.05976
repo_url: None
paper_authors: Yapeng Meng, Songru Yang, Xu Hu, Rui Zhao, Lincheng Li, Zhenwei Shi, Zhengxia Zou
for: 这篇论文提出了一种基于自由文本提示的新型面部编辑方法，以便在不同的文本提示下进行面部编辑。
methods: 这篇论文使用了一种新的vector流场模型来实现面部编辑，这种模型可以通过控制图像像素的坐标和颜色的偏移来实现面部编辑。
results: compared with现有的文本驱动的面部编辑方法，这种方法可以生成高同一性和图像质量的面部编辑结果，并且可以在实时视频面部编辑中使用。

Abstract
This paper proposes a novel and physically interpretable method for face editing based on arbitrary text prompts. Different from previous GAN-inversion-based face editing methods that manipulate the latent space of GANs, or diffusion-based methods that model image manipulation as a reverse diffusion process, we regard the face editing process as imposing vector flow fields on face images, representing the offset of spatial coordinates and color for each image pixel. Under the above-proposed paradigm, we represent the vector flow field in two ways: 1) explicitly represent the flow vectors with rasterized tensors, and 2) implicitly parameterize the flow vectors as continuous, smooth, and resolution-agnostic neural fields, by leveraging the recent advances of implicit neural representations. The flow vectors are iteratively optimized under the guidance of the pre-trained Contrastive Language-Image Pretraining~(CLIP) model by maximizing the correlation between the edited image and the text prompt. We also propose a learning-based one-shot face editing framework, which is fast and adaptable to any text prompt input. Our method can also be flexibly extended to real-time video face editing. Compared with state-of-the-art text-driven face editing methods, our method can generate physically interpretable face editing results with high identity consistency and image quality. Our code will be made publicly available.

摘要
In our proposed paradigm, we represent the vector flow field in two ways:1. Explicitly represent the flow vectors with rasterized tensors.2. Implicitly parameterize the flow vectors as continuous, smooth, and resolution-agnostic neural fields by leveraging recent advances in implicit neural representations.The flow vectors are iteratively optimized under the guidance of the pre-trained Contrastive Language-Image Pretraining (CLIP) model by maximizing the correlation between the edited image and the text prompt. We also propose a learning-based one-shot face editing framework, which is fast and adaptable to any text prompt input. Our method can also be flexibly extended to real-time video face editing.Compared with state-of-the-art text-driven face editing methods, our method can generate physically interpretable face editing results with high identity consistency and image quality. Our code will be made publicly available.

Focused Specific Objects NeRF

paper_url: http://arxiv.org/abs/2308.05970
repo_url: None
paper_authors: Yuesong Li, Feng Pan, Helong Yan, Xiuli Xin, Xiaoxue Feng
for: 提高NeRF模型的快速训练和高质量渲染，适用于复杂场景。
methods: 利用场景 semantic priors 提高快速训练，使网络只关注特定目标而不受背景影响。可以提高训练速度7.78倍，并且更快地渲染小至中型目标。此外，这种改进适用于所有NeRF模型。
results: 通过弱监督和粗粒抽象采样，进一步加速训练并保持渲染质量。此外，提出了新的场景编辑技术，可以实现特定Semantic targets的独特显示或掩蔽渲染。解决不supervised区域错误推理问题，我们还设计了一个自动化 loops，结合形态运算和聚类。

Abstract
Most NeRF-based models are designed for learning the entire scene, and complex scenes can lead to longer learning times and poorer rendering effects. This paper utilizes scene semantic priors to make improvements in fast training, allowing the network to focus on the specific targets and not be affected by complex backgrounds. The training speed can be increased by 7.78 times with better rendering effect, and small to medium sized targets can be rendered faster. In addition, this improvement applies to all NeRF-based models. Considering the inherent multi-view consistency and smoothness of NeRF, this paper also studies weak supervision by sparsely sampling negative ray samples. With this method, training can be further accelerated and rendering quality can be maintained. Finally, this paper extends pixel semantic and color rendering formulas and proposes a new scene editing technique that can achieve unique displays of the specific semantic targets or masking them in rendering. To address the problem of unsupervised regions incorrect inferences in the scene, we also designed a self-supervised loop that combines morphological operations and clustering.

摘要
大多数NeRF基于模型是为整个场景学习，复杂的场景可能会导致更长的学习时间和更差的渲染效果。这篇论文利用场景 semantic 预测来进行改进快速训练，使网络只关注特定的目标而不受背景的影响。通过这种方法，训练速度可以提高7.78倍，并且小到中等大小的目标可以更快地渲染。此外，这种改进还适用于所有NeRF基于模型。由于NeRF的内置多视图一致性和平滑性，这篇论文还研究了稀疏采样负方向样本的弱监督学习。通过这种方法，训练可以进一步加速，并且渲染质量可以保持。最后，这篇论文扩展像素semantic和颜色渲染公式，并提出了一种新的场景编辑技术，可以实现特定semantictarget的唯一显示或隐藏其在渲染中。为了解决场景中无监督区域的错误推断问题，我们还设计了一种自动化环节，结合形态运算和分群。

YOLOrtho – A Unified Framework for Teeth Enumeration and Dental Disease Detection

paper_url: http://arxiv.org/abs/2308.05967
repo_url: None
paper_authors: Shenxiao Mei, Chenglong Ma, Feihong Shen, Huikai Wu
For: 本研究的目的是开发一种结合了牙齿数量和牙疾病识别的综合框架，以提高牙医的诊断效率和准确率。* Methods: 我们采用了一种基于CoordConv的模型结构，并在模型中插入了一个更多的 upsampling layer，以更好地利用数据并同时学习牙齿检测和疾病识别。* Results: 我们的模型在实验中得到了较大的扩散模型的更好的效果，并且可以准确地识别牙齿和牙疾病。

Abstract
Detecting dental diseases through panoramic X-rays images is a standard procedure for dentists. Normally, a dentist need to identify diseases and find the infected teeth. While numerous machine learning models adopting this two-step procedure have been developed, there has not been an end-to-end model that can identify teeth and their associated diseases at the same time. To fill the gap, we develop YOLOrtho, a unified framework for teeth enumeration and dental disease detection. We develop our model on Dentex Challenge 2023 data, which consists of three distinct types of annotated data. The first part is labeled with quadrant, and the second part is labeled with quadrant and enumeration and the third part is labeled with quadrant, enumeration and disease. To further improve detection, we make use of Tufts Dental public dataset. To fully utilize the data and learn both teeth detection and disease identification simultaneously, we formulate diseases as attributes attached to their corresponding teeth. Due to the nature of position relation in teeth enumeration, We replace convolution layer with CoordConv in our model to provide more position information for the model. We also adjust the model architecture and insert one more upsampling layer in FPN in favor of large object detection. Finally, we propose a post-process strategy for teeth layout that corrects teeth enumeration based on linear sum assignment. Results from experiments show that our model exceeds large Diffusion-based model.

摘要
检测牙科疾病通过扫描影像是 dentist 的标准程序。通常， dentist 需要识别疾病并找到感染的牙齿。虽然许多机器学习模型采用了这两步过程，但没有一个端到端模型可以同时识别牙齿和其相关疾病。为填补这一空白，我们开发了 YOLOrtho，一个综合框架 для牙齿编号和牙科疾病检测。我们在 Dentex Challenge 2023 数据集上验证我们的模型，该数据集包括三种不同的注释数据。第一部分是标注了 quadrant，第二部分是标注了 quadrant 和编号，第三部分是标注了 quadrant、编号和疾病。为了进一步提高检测精度，我们使用 Tufts Dental 公共数据集。为了充分利用数据并同时学习牙齿检测和疾病识别，我们将疾病视为牙齿上的特征，并将疾病分别附加到它们所对应的牙齿上。由于牙齿编号的位置关系，我们将卷积层替换为 CoordConv，以提供更多的位置信息 для模型。我们还调整模型结构，并在 FPN 中添加一个更多的膨敛层，以便更好地检测大对象。最后，我们提出了一种采用线性归一化的牙齿布局修正策略，以修正牙齿编号。实验结果显示，我们的模型超越了大 diffusion-based 模型。

Compositional Learning in Transformer-Based Human-Object Interaction Detection

paper_url: http://arxiv.org/abs/2308.05961
repo_url: None
paper_authors: Zikun Zhuang, Ruihao Qian, Chi Xie, Shuang Liang
for: 本研究旨在解决人机对象交互（HOI）检测中长尾分布的问题，通过启发式学习和组合学习来提高HOI检测的性能。
methods: 我们提出了一种基于 transformer 框架的组合 HOI 学习方法，利用人物对象对的表示和交互表示在不同 HOI 实例中进行重新组合，以获得更加 ricther 的上下文信息和更好的知识泛化。
results: 我们的简单 yet effective 方法在 experiments 中达到了领先的性能水平，特别是在 rare HOI 类中表现出色。

Abstract
Human-object interaction (HOI) detection is an important part of understanding human activities and visual scenes. The long-tailed distribution of labeled instances is a primary challenge in HOI detection, promoting research in few-shot and zero-shot learning. Inspired by the combinatorial nature of HOI triplets, some existing approaches adopt the idea of compositional learning, in which object and action features are learned individually and re-composed as new training samples. However, these methods follow the CNN-based two-stage paradigm with limited feature extraction ability, and often rely on auxiliary information for better performance. Without introducing any additional information, we creatively propose a transformer-based framework for compositional HOI learning. Human-object pair representations and interaction representations are re-composed across different HOI instances, which involves richer contextual information and promotes the generalization of knowledge. Experiments show our simple but effective method achieves state-of-the-art performance, especially on rare HOI classes.

摘要
人机物交互（HOI）检测是理解人类活动和视觉场景的重要组成部分。长板分布的标注实例是HOI检测的主要挑战，促进了几个shot和零shot学习的研究。以HOI三元组的 combinatorial 性为 inspiration，一些现有的方法采用了组合学习的想法，在Object和动作特征上学习并重新组合为新的训练样本。然而，这些方法通常采用了基于CNN的两Stage paradigm，具有有限的特征提取能力，并经常利用辅助信息来提高性能。而无需任何额外信息，我们创新地提出了基于 transformer 框架的 HOI 学习方法。人机对象对的表示和交互表示在不同的 HOI 实例中被重新组合，这里包含了更加丰富的上下文信息，提高了知识的通用性。实验表明，我们简单 yet effective 的方法可以达到领先的性能，特别是在罕见 HOI 类上。

Classification of White Blood Cells Using Machine and Deep Learning Models: A Systematic Review

paper_url: http://arxiv.org/abs/2308.06296
repo_url: None
paper_authors: Rabia Asghar, Sanjay Kumar, Paul Hynds, Arslan Shaukat
for: 这种研究的目的是帮助改善医疗图像分析领域中白细胞类型分类的精度。
methods: 这种研究使用了现代的机器学习（ML）和深度学习（DL）技术，包括血液图像、MRI、X射线等医疗图像领域的数据。
results: 研究发现，随着近年来ML和DL技术的不断发展和应用，白细胞类型分类的精度得到了显著改善，但还存在一些挑战，如数据集的可用性和医疗人员的培训等。

Abstract
Machine learning (ML) and deep learning (DL) models have been employed to significantly improve analyses of medical imagery, with these approaches used to enhance the accuracy of prediction and classification. Model predictions and classifications assist diagnoses of various cancers and tumors. This review presents an in-depth analysis of modern techniques applied within the domain of medical image analysis for white blood cell classification. The methodologies that use blood smear images, magnetic resonance imaging (MRI), X-rays, and similar medical imaging domains are identified and discussed, with a detailed analysis of ML/DL techniques applied to the classification of white blood cells (WBCs) representing the primary focus of the review. The data utilized in this research has been extracted from a collection of 136 primary papers that were published between the years 2006 and 2023. The most widely used techniques and best-performing white blood cell classification methods are identified. While the use of ML and DL for white blood cell classification has concurrently increased and improved in recent year, significant challenges remain - 1) Availability of appropriate datasets remain the primary challenge, and may be resolved using data augmentation techniques. 2) Medical training of researchers is recommended to improve current understanding of white blood cell structure and subsequent selection of appropriate classification models. 3) Advanced DL networks including Generative Adversarial Networks, R-CNN, Fast R-CNN, and faster R-CNN will likely be increasingly employed to supplement or replace current techniques.

摘要
医学影像分析（ML）和深度学习（DL）模型已经广泛应用于医学影像分析中，以提高预测和分类的准确性。这些模型的预测和分类帮助诊断各种恶性肿瘤。本文归纳了现代医学影像分析领域中的现代技术，并详细分析了用于白血球类别的ML/DL技术。研究使用的数据来自于2006年至2023年发表的136篇原始论文。最常用的技术和最佳白血球类别方法被标出。虽然在最近几年，用于白血球类别的ML和DL技术的使用和改进得到了普遍应用，但是还存在一些挑战，包括：1）获得适当数据集是最主要的挑战，可以通过数据增强技术解决。2）医学研究人员的培训可以提高现代白血球结构的理解，并选择适当的类别模型。3）将来，高级的DL网络，如生成对抗网络、R-CNN、快速R-CNN和更快的R-CNN将可能被广泛应用，以补充或取代当前的技术。

Learned Point Cloud Compression for Classification

paper_url: http://arxiv.org/abs/2308.05959
repo_url: https://github.com/multimedialabsfu/learned-point-cloud-compression-for-classification
paper_authors: Mateen Ulhaq, Ivan V. Bajić
for: 这个论文是为了提出一种特殊的点云编码器，用于在服务器端进行深度学习机器视觉任务的点云数据传输和处理。
methods: 该编码器基于PointNet，并且实现了一个高度特殊的点云编码方法，以实现更好的环境-质量负担比。
results: 相比于非特殊编码器，该编码器在ModelNet40数据集上可以实现94%的BD-比特率减少，同时保持高度的准确率。此外，对于低资源的终端设备，我们还提出了两种轻量级的编码器配置，可以实现相似的BD-比特率减少，同时具有较低的顶部1准确率下降和较低的编码器端kMACs/点。

Abstract
Deep learning is increasingly being used to perform machine vision tasks such as classification, object detection, and segmentation on 3D point cloud data. However, deep learning inference is computationally expensive. The limited computational capabilities of end devices thus necessitate a codec for transmitting point cloud data over the network for server-side processing. Such a codec must be lightweight and capable of achieving high compression ratios without sacrificing accuracy. Motivated by this, we present a novel point cloud codec that is highly specialized for the machine task of classification. Our codec, based on PointNet, achieves a significantly better rate-accuracy trade-off in comparison to alternative methods. In particular, it achieves a 94% reduction in BD-bitrate over non-specialized codecs on the ModelNet40 dataset. For low-resource end devices, we also propose two lightweight configurations of our encoder that achieve similar BD-bitrate reductions of 93% and 92% with 3% and 5% drops in top-1 accuracy, while consuming only 0.470 and 0.048 encoder-side kMACs/point, respectively. Our codec demonstrates the potential of specialized codecs for machine analysis of point clouds, and provides a basis for extension to more complex tasks and datasets in the future.

摘要
深度学习在进行三维点云数据的机器视觉任务中变得越来越普遍，如分类、物体检测和分割。然而，深度学习推理是计算昂贵的。因此，为了将点云数据传输到服务器进行处理，需要一个轻量级的编码器。这个编码器必须具有高度压缩比和低计算成本，而不是牺牲准确性。驱动了这一点，我们提出了一种特化于机器分类任务的点云编码器。我们的编码器基于PointNet，与其他方法相比，实现了显著更好的速率准确性质量比。具体来说，在ModelNet40数据集上，我们的编码器可以达到94%的BD-比特率减少，而且只需0.470和0.048 encoder-side kMACs/点。为低资源的端device，我们还提出了两种轻量级的编码器配置，可以实现相同的BD-比特率减少，分别为93%和92%，而且只需3%和5%的顶部一个精度下降，并且只需0.470和0.048 encoder-side kMACs/点。我们的编码器表明特化编码器在机器分析点云数据方面的潜力，并提供了未来扩展到更复杂的任务和数据集的基础。

paper_url: http://arxiv.org/abs/2308.05948
repo_url: None
paper_authors: Yiyang Cai, Jiaming Lu, Jiewen Wang, Shuang Liang
for: 本研究旨在解决粗糙和噪声存在的手绘图像数据中的低质量样本问题，提高三维形状检索的精度。
methods: 本研究提出了一种不确定性意识的跨Modal传输网络（UACTN），它将手绘图像和三维形状的表示学分解为两个独立任务：手绘图像的分类学习和三维形状的特征传输。首先，我们提出了一种端到端的分类方法，同时学习手绘图像的特征和不确定性，以便通过不同水平的不确定性来防止噪声绘图的抖抖。然后，三维形状的特征被映射到预学习的手绘图像空间中，进行特征对齐。
results: 对两个标准 benchmark进行了广泛的实验和剖析研究，证明了我们提出的方法在比前STATE-OF-THE-ART方法更高的精度和稳定性。

Abstract
In recent years, sketch-based 3D shape retrieval has attracted growing attention. While many previous studies have focused on cross-modal matching between hand-drawn sketches and 3D shapes, the critical issue of how to handle low-quality and noisy samples in sketch data has been largely neglected. This paper presents an uncertainty-aware cross-modal transfer network (UACTN) that addresses this issue. UACTN decouples the representation learning of sketches and 3D shapes into two separate tasks: classification-based sketch uncertainty learning and 3D shape feature transfer. We first introduce an end-to-end classification-based approach that simultaneously learns sketch features and uncertainty, allowing uncertainty to prevent overfitting noisy sketches by assigning different levels of importance to clean and noisy sketches. Then, 3D shape features are mapped into the pre-learned sketch embedding space for feature alignment. Extensive experiments and ablation studies on two benchmarks demonstrate the superiority of our proposed method compared to state-of-the-art methods.

摘要
在最近几年，基于绘制的3D形状检索已经吸引了越来越多的关注。许多先前的研究都集中在手绘素描和3D形状之间的跨模态匹配上，但是对于绘制数据中噪声和低质量样本的处理问题一直受到了很大的忽略。这篇论文提出了一种不确定性意识的跨模态传输网络（UACTN），解决这个问题。UACTN将绘制素描和3D形状的表示学分解为两个独立的任务：基于绘制素描的分类学习和3D形状特征传输。我们首先介绍了一种端到端的分类基本方法，同时学习绘制素描和不确定性，使不确定性防止噪声绘制素描中的过拟合。然后，3D形状特征被映射到预先学习的绘制元素空间中进行特征对齐。广泛的实验和减少研究在两个标准 bench 上表明了我们提出的方法的优越性。

Generalizing Event-Based Motion Deblurring in Real-World Scenarios

paper_url: http://arxiv.org/abs/2308.05932
repo_url: https://github.com/xiangz-0/gem
paper_authors: Xiang Zhang, Lei Yu, Wen Yang, Jianzhuang Liu, Gui-Song Xia
for: 这篇论文旨在普遍化事件基 Motion deblurring 性能，并适应实际场景中的不同空间和时间尺度的运动模糊。
methods: 该论文提出了一种缩放意识网络，可以自适应输入空间Scale和学习不同的时间尺度运动模糊。并提出了一种两个阶段自我超视训练方案，以适应实际数据分布。
results: 该论文的方法可以高效地还原隐藏的图像的亮度和结构，并通过自适应学习来普遍化运动模糊处理的性能，以适应实际场景中的不同空间和时间尺度的运动模糊。

Abstract
Event-based motion deblurring has shown promising results by exploiting low-latency events. However, current approaches are limited in their practical usage, as they assume the same spatial resolution of inputs and specific blurriness distributions. This work addresses these limitations and aims to generalize the performance of event-based deblurring in real-world scenarios. We propose a scale-aware network that allows flexible input spatial scales and enables learning from different temporal scales of motion blur. A two-stage self-supervised learning scheme is then developed to fit real-world data distribution. By utilizing the relativity of blurriness, our approach efficiently ensures the restored brightness and structure of latent images and further generalizes deblurring performance to handle varying spatial and temporal scales of motion blur in a self-distillation manner. Our method is extensively evaluated, demonstrating remarkable performance, and we also introduce a real-world dataset consisting of multi-scale blurry frames and events to facilitate research in event-based deblurring.

摘要
Event-based motion deblurring 已经显示出了有前途的结果，通过利用低延迟事件。然而，现有的方法受到一些限制，因为它们假设输入的空间分辨率和特定的模糊分布相同。这项工作解决了这些限制，并寻求将事件基于的模糊除掉应用于实际场景中。我们提议一种可灵活输入空间比例的缩放网络，以及一种两阶段自我超vised学习方案，以适应实际数据分布。通过利用模糊程度的相对性，我们的方法能够有效地保持原始的明亮和结构，并在自我滤清过程中广泛应用。我们的方法在评价中表现出色，并且我们还介绍了一个包含多个滤镜速度和事件的实际数据集，以便进行事件基于的模糊除掉研究。

CaPhy: Capturing Physical Properties for Animatable Human Avatars

paper_url: http://arxiv.org/abs/2308.05925
repo_url: None
paper_authors: Zhaoqi Su, Liangxiao Hu, Siyou Lin, Hongwen Zhang, Shengping Zhang, Justus Thies, Yebin Liu
for: reconstruction of animatable human avatars with realistic dynamic properties for clothing
methods: combination of unsupervised training with physics-based losses and 3D-supervised training using scanned data, optimization of physical parameters using gradient constraints
results: ability to generalize to novel poses with realistic dynamic cloth deformations, superior quantitative and qualitative results compared with previous methodsHere’s the text in Simplified Chinese:
for: 创建有动态衣物特性的可动人形模型
methods: 组合不监督学习和物理损失，以及3D监督学习使用扫描数据，并对物理参数进行梯度约束优化
results: 能够扩展到新的姿势，并实现真实的动态布料扭曲和皱纹效果，与之前的方法相比有较高的量化和质量效果

Abstract
We present CaPhy, a novel method for reconstructing animatable human avatars with realistic dynamic properties for clothing. Specifically, we aim for capturing the geometric and physical properties of the clothing from real observations. This allows us to apply novel poses to the human avatar with physically correct deformations and wrinkles of the clothing. To this end, we combine unsupervised training with physics-based losses and 3D-supervised training using scanned data to reconstruct a dynamic model of clothing that is physically realistic and conforms to the human scans. We also optimize the physical parameters of the underlying physical model from the scans by introducing gradient constraints of the physics-based losses. In contrast to previous work on 3D avatar reconstruction, our method is able to generalize to novel poses with realistic dynamic cloth deformations. Experiments on several subjects demonstrate that our method can estimate the physical properties of the garments, resulting in superior quantitative and qualitative results compared with previous methods.

摘要
我们提出了CaPhy方法，一种新的人工智能方法，用于重建具有真实动态性质的人工人体模型。我们的目标是从实际观察中捕捉人体服装的几何和物理性质。这使得我们可以将人工人体模型应用到新的姿势上，并且通过物理正确的扭曲和皱纹来模拟人体的动态行为。为此，我们结合无监督训练、物理学损失和3D监督训练使用扫描数据来重建一个物理realistic的服装动态模型。此外，我们还通过引入物理损失的梯度约束来优化物理参数。与前一代3D人体重建方法不同，我们的方法能够泛化到新的姿势上，并且能够模拟真实的动态皮肤扭曲。我们在多个主题上进行了实验，并证明了我们的方法可以估计人体服装的物理参数，从而实现了较好的量化和质量上的效果。

BATINet: Background-Aware Text to Image Synthesis and Manipulation Network

paper_url: http://arxiv.org/abs/2308.05921
repo_url: None
paper_authors: Ryugo Morita, Zhiqiang Zhang, Jinjia Zhou
for: 本研究旨在生成文本描述的背景图像中的前景内容，以匹配输入背景图像的样式。
methods: 该研究提出了一种 Background-Aware Text to Image synthesis and manipulation Network (BATINet)，包括两个关键组件：位置探测网络 (PDN) 和融合网络 (HN)。PDN 检测文本相关对象在背景图像中最有可能的位置，而 HN 使得生成的内容与背景样式信息融合。
results: 该研究通过对 CUB 数据集进行质量和量化评估，表明了该模型比其他状态 искусственный方法更高效。此外，该模型还可以应用于文本指导图像修改任务，解决最大化对象形状的修改任务。

Abstract
Background-Induced Text2Image (BIT2I) aims to generate foreground content according to the text on the given background image. Most studies focus on generating high-quality foreground content, although they ignore the relationship between the two contents. In this study, we analyzed a novel Background-Aware Text2Image (BAT2I) task in which the generated content matches the input background. We proposed a Background-Aware Text to Image synthesis and manipulation Network (BATINet), which contains two key components: Position Detect Network (PDN) and Harmonize Network (HN). The PDN detects the most plausible position of the text-relevant object in the background image. The HN harmonizes the generated content referring to background style information. Finally, we reconstructed the generation network, which consists of the multi-GAN and attention module to match more user preferences. Moreover, we can apply BATINet to text-guided image manipulation. It solves the most challenging task of manipulating the shape of an object. We demonstrated through qualitative and quantitative evaluations on the CUB dataset that the proposed model outperforms other state-of-the-art methods.

摘要

Semantics2Hands: Transferring Hand Motion Semantics between Avatars

paper_url: http://arxiv.org/abs/2308.05920
repo_url: https://github.com/abcyzj/Semantics2Hands
paper_authors: Zijie Ye, Jia Jia, Junliang Xing
for: 这篇论文主要目标是在不同的人工手模型之间传递手势semantics。
methods: 作者提出了一种基于解剖学semantic matrix（ASM）的方法，通过定量地编码手势semantics，实现精准地重定向手势。然后，他们使用一种基于解剖学semantics重建网络（ASRN）来从源ASM到目标手关节弯曲的映射。
results: 作者通过在同频和交叉频段的手势重定向任务中评估了他们的方法，并证明了其在质量和量化上与当前状态OF THE ARTS具有显著优势。

Abstract
Human hands, the primary means of non-verbal communication, convey intricate semantics in various scenarios. Due to the high sensitivity of individuals to hand motions, even minor errors in hand motions can significantly impact the user experience. Real applications often involve multiple avatars with varying hand shapes, highlighting the importance of maintaining the intricate semantics of hand motions across the avatars. Therefore, this paper aims to transfer the hand motion semantics between diverse avatars based on their respective hand models. To address this problem, we introduce a novel anatomy-based semantic matrix (ASM) that encodes the semantics of hand motions. The ASM quantifies the positions of the palm and other joints relative to the local frame of the corresponding joint, enabling precise retargeting of hand motions. Subsequently, we obtain a mapping function from the source ASM to the target hand joint rotations by employing an anatomy-based semantics reconstruction network (ASRN). We train the ASRN using a semi-supervised learning strategy on the Mixamo and InterHand2.6M datasets. We evaluate our method in intra-domain and cross-domain hand motion retargeting tasks. The qualitative and quantitative results demonstrate the significant superiority of our ASRN over the state-of-the-arts.

摘要
人类手部，非语言交流的主要手段，在多种场景中传递细腻的 semantics。由于人类对手势的敏感性强，even slight errors in hand motions can significantly impact the user experience。实际应用中常有多个化身，各自的手形不同，因此维护手势 semantics的细腻性在多个化身之间是非常重要的。因此，这篇论文旨在将多个化身中的手势 semantics 转移到彼此的手 JOINTS 模型上。为解决这个问题，我们提出了一种新的 анатомиче基于的semantic matrix (ASM)，该矩阵编码了手势 semantics。ASM 量化了手部和其他关节的位置 relative to the local frame of the corresponding joint，使得精准地重定向手势。然后，我们获得了一个从源 ASM 到目标手关节旋转的映射函数，通过使用一种基于 анатомиче基的semantics reconstruction network (ASRN)来实现。我们使用一种半监督学习策略在 Mixamo 和 InterHand2.6M 数据集上训练 ASRN。我们对 intra-domain 和 cross-domain 手势重定向任务进行评估。结果表明，我们的 ASRN 在 Qualitative 和量化方面具有显著的优势，胜过当前状态艺的。

Collaborative Tracking Learning for Frame-Rate-Insensitive Multi-Object Tracking

paper_url: http://arxiv.org/abs/2308.05911
repo_url: None
paper_authors: Yiheng Liu, Junta Wu, Yi Fu
for: 提高edge设备的计算、存储和功耗成本，实现高效的多目标跟踪（MOT）
methods: 提出了一种搜索式跟踪学习（ColTrack）方法，通过多个历史查询来共同跟踪目标，并在每两个temporal块解码器之间插入信息级别融合模块，以更好地融合时间特征。此外，还提出了跟踪对象一致损失函数，以引导历史查询之间的交互。
results: 在高帧率视频中，ColTrack比 state-of-the-art 方法在大规模 datasets Dancetrack 和 BDD100K 上表现出更高的性能，并超过了现有的端到端方法在 MOT17 上的表现。此外，ColTrack 在低帧率视频中也具有明显的优势，可以降低帧率要求，同时保持更高的性能，从而实现更快的处理速度。

Abstract
Multi-object tracking (MOT) at low frame rates can reduce computational, storage and power overhead to better meet the constraints of edge devices. Many existing MOT methods suffer from significant performance degradation in low-frame-rate videos due to significant location and appearance changes between adjacent frames. To this end, we propose to explore collaborative tracking learning (ColTrack) for frame-rate-insensitive MOT in a query-based end-to-end manner. Multiple historical queries of the same target jointly track it with richer temporal descriptions. Meanwhile, we insert an information refinement module between every two temporal blocking decoders to better fuse temporal clues and refine features. Moreover, a tracking object consistency loss is proposed to guide the interaction between historical queries. Extensive experimental results demonstrate that in high-frame-rate videos, ColTrack obtains higher performance than state-of-the-art methods on large-scale datasets Dancetrack and BDD100K, and outperforms the existing end-to-end methods on MOT17. More importantly, ColTrack has a significant advantage over state-of-the-art methods in low-frame-rate videos, which allows it to obtain faster processing speeds by reducing frame-rate requirements while maintaining higher performance. Code will be released at https://github.com/yolomax/ColTrack

摘要
多bject tracking（MOT）在低帧率下可以降低计算、存储和功能开销，以更好地满足边缘设备的限制。许多现有的MOT方法在低帧率视频中表现不佳，因为 между邻帧的位置和外观改变很大。为此，我们提出了协同跟踪学习（ColTrack），用于极低帧率下的框架缺失感知MOT。多个历史查询在同一个目标上共同跟踪，使用更 ricoh 的时间描述。此外，我们在每两个时间块解码器之间插入信息精细模块，以更好地融合时间断言和细化特征。此外，我们还提出了跟踪对象一致损失，以引导历史查询之间的互动。实验结果表明，在高帧率视频中，ColTrack可以与现有的方法匹配或超越其性能，并在MOT17上跟踪对象的框架缺失下表现更好。此外，ColTrack在低帧率视频中表现更优于现有的方法，可以降低帧率要求，同时保持高性能。代码将在https://github.com/yolomax/ColTrack上发布。

Semantic-embedded Similarity Prototype for Scene Recognition

paper_url: http://arxiv.org/abs/2308.05896
repo_url: None
paper_authors: Chuanxin Song, Hanbo Wu, Xin Ma
for: 提高Scene recognition的准确性，不增加网络参数
methods: 使用统计策略描述Scene中的semantic知识，并利用这些知识来构建一个相似性原型，以支持网络训练
results: 对多个benchmark进行了全面评估，并确认了该相似性原型可以提高现有网络的性能，无需增加计算负担

Abstract
Due to the high inter-class similarity caused by the complex composition within scenes and the co-existing objects across scenes, various studies have explored object semantic knowledge within scenes to improve scene recognition. However, a resulting issue arises as semantic segmentation or object detection techniques demand heavy computational power, thereby burdening the network considerably. This limitation often renders object-assisted approaches incompatible with edge devices. In contrast, this paper proposes a semantic-based similarity prototype that assists the scene recognition network to achieve higher accuracy without increasing network parameters. It is simple and can be plug-and-played into existing pipelines. More specifically, a statistical strategy is introduced to depict semantic knowledge in scenes as class-level semantic representations. These representations are utilized to explore inter-class correlations, ultimately constructing a similarity prototype. Furthermore, we propose two ways to use the similarity prototype to support network training from the perspective of gradient label softening and batch-level contrastive loss, respectively. Comprehensive evaluations on multiple benchmarks show that our similarity prototype enhances the performance of existing networks without adding any computational burden. Code and the statistical similarity prototype will be available soon.

摘要
(Simplified Chinese translation)由于场景中物体的复杂composición和场景之间的对象协同存在高度的类间相似性，许多研究已经探索了场景内物体的semantic知识，以提高场景认知。然而，这些方法通常需要大量的计算能力，对网络造成沉重的负担。这限制了对象辅助方法在边缘设备上的应用。相比之下，这篇论文提出了一种基于semantic的相似性原型，可以帮助场景认知网络提高准确性，而不需增加网络参数。它简单，可以与现有管道一起使用。更具体地，我们引入了一种统计策略，将场景中的semantic知识映射到类级semantic表示上。这些表示可以探索场景中类之间的相似性，最终构建一个相似性原型。此外，我们还提出了在网络训练的视角下使用相似性原型的两种方法： gradient label softening 和 batch-level contrastive loss。多个 benchmark 的全面评估表明，我们的相似性原型可以帮助现有网络提高性能，而无需增加计算负担。代码和统计相似性原型将很快地提供。

Aphid Cluster Recognition and Detection in the Wild Using Deep Learning Models

paper_url: http://arxiv.org/abs/2308.05881
repo_url: None
paper_authors: Tianxiao Zhang, Kaidong Li, Xiangyu Chen, Cuncong Zhong, Bo Luo, Ivan Grijalva, Brian McCornack, Daniel Flippo, Ajay Sharda, Guanghui Wang
for: 用于检测螟蛾群落，提高化学防治效率和环境可持续性。
methods: 使用深度学习模型检测螟蛾群落，并对图像进行裁剪处理，生成了151,380个标注图像 patch。
results: 对四种 state-of-the-art 对象检测模型（VFNet、GFLV2、PAA 和 ATSS）进行实验，并证明所有模型在螟蛾数据集上具有稳定的相似性表现，并且通过合并邻近螟蛾群落和移除小 cluster 提高了性能约17%。

Abstract
Aphid infestation poses a significant threat to crop production, rural communities, and global food security. While chemical pest control is crucial for maximizing yields, applying chemicals across entire fields is both environmentally unsustainable and costly. Hence, precise localization and management of aphids are essential for targeted pesticide application. The paper primarily focuses on using deep learning models for detecting aphid clusters. We propose a novel approach for estimating infection levels by detecting aphid clusters. To facilitate this research, we have captured a large-scale dataset from sorghum fields, manually selected 5,447 images containing aphids, and annotated each individual aphid cluster within these images. To facilitate the use of machine learning models, we further process the images by cropping them into patches, resulting in a labeled dataset comprising 151,380 image patches. Then, we implemented and compared the performance of four state-of-the-art object detection models (VFNet, GFLV2, PAA, and ATSS) on the aphid dataset. Extensive experimental results show that all models yield stable similar performance in terms of average precision and recall. We then propose to merge close neighboring clusters and remove tiny clusters caused by cropping, and the performance is further boosted by around 17%. The study demonstrates the feasibility of automatically detecting and managing insects using machine learning models. The labeled dataset will be made openly available to the research community.

摘要
螟蛂滋生对农村社区和全球食品安全构成了重要威胁。虽然化学防治 insect 是提高产量的重要手段，但是在整个田野上应用化学药品是环境不可持续和昂贵的。因此，精准地Localization和管理螟蛂是非常重要的。本文主要关注使用深度学习模型来探测螟蛂群。我们提出了一种新的方法，利用深度学习模型来估算感染水平。为了进行这项研究，我们在甘蔗田中采集了大规模数据集，手动选择了5447张图像，并在每张图像中标注了每个螟蛂群。为了使机器学习模型更易使用，我们进一步处理了图像，将其分割成小块，得到了151380个标注的图像块。然后，我们实现了和比较了四种当前最佳对象检测模型（VFNet、GFLV2、PAA和ATSS）在螟蛂数据集上的性能。结果显示，所有模型在精度和报告方面具有稳定的相似性。我们然后提议将邻近的螟蛂群合并并 removes 小 clusters，性能得到了约17%的提升。研究表明，使用机器学习模型自动检测和管理昆虫是可能的。标注数据集将被开放提供给研究社区。

Vision Backbone Enhancement via Multi-Stage Cross-Scale Attention

paper_url: http://arxiv.org/abs/2308.05872
repo_url: None
paper_authors: Liang Shang, Yanli Liu, Zhengyang Lou, Shuxue Quan, Nagesh Adluru, Bochen Guan, William A. Sethares
for: 提高视觉任务中 CNN 和 ViT 的性能，增加多stage和跨Scale的交互。
methods: 提出了一个简单的添加注意力模块，使得不同阶段和尺度的特征图可以进行多stage和跨Scale的交互。
results: 实验表明，在多个下游任务中，MSCSA 可以提供明显的性能提升，具有相对较少的额外计算量和运行时间。

Abstract
Convolutional neural networks (CNNs) and vision transformers (ViTs) have achieved remarkable success in various vision tasks. However, many architectures do not consider interactions between feature maps from different stages and scales, which may limit their performance. In this work, we propose a simple add-on attention module to overcome these limitations via multi-stage and cross-scale interactions. Specifically, the proposed Multi-Stage Cross-Scale Attention (MSCSA) module takes feature maps from different stages to enable multi-stage interactions and achieves cross-scale interactions by computing self-attention at different scales based on the multi-stage feature maps. Our experiments on several downstream tasks show that MSCSA provides a significant performance boost with modest additional FLOPs and runtime.

摘要
卷积神经网络（CNN）和视transformer（ViT）在视觉任务中取得了非常出色的成绩。然而，许多architecture不考虑不同阶段和缩放的feature map之间的交互，这可能会限制其性能。在这个工作中，我们提出了一种简单的加载注意力模块，以便超越这些限制。具体来说，我们的多阶段跨度注意力（MSCSA）模块使用不同阶段的feature map来实现多阶段交互，并在不同缩放级别上计算基于多阶段feature map的自注意力。我们的实验表明，MSCSA可以提供显著的性能提升，而且增加的计算量和运行时间幅度都很小。

The Multi-modality Cell Segmentation Challenge: Towards Universal Solutions

paper_url: http://arxiv.org/abs/2308.05864
repo_url: None
paper_authors: Jun Ma, Ronald Xie, Shamini Ayyadhury, Cheng Ge, Anubha Gupta, Ritu Gupta, Song Gu, Yao Zhang, Gihun Lee, Joonkee Kim, Wei Lou, Haofeng Li, Eric Upschulte, Timo Dickscheid, José Guilherme de Almeida, Yixin Wang, Lin Han, Xin Yang, Marco Labagnara, Sahand Jamal Rahi, Carly Kempster, Alice Pollitt, Leon Espinosa, Tâm Mignot, Jan Moritz Middeke, Jan-Niklas Eckardt, Wangkai Li, Zhaoyang Li, Xiaochen Cai, Bizhe Bai, Noah F. Greenwald, David Van Valen, Erin Weisbart, Beth A. Cimini, Zhuoshi Li, Chao Zuo, Oscar Brück, Gary D. Bader, Bo Wang
for: 本研究旨在提供一个多Modalities单元 segmentation的benchmark，以便用于生物学实验中的单元分析。
methods: 本研究使用Transformer基于的深度学习算法，可以在不同的显微镜像平台和组织类型中自动调整参数。
results: 研究发现，这种新的算法可以在多个Modalities的单元分析中提供更高的准确率和多样性。

Abstract
Cell segmentation is a critical step for quantitative single-cell analysis in microscopy images. Existing cell segmentation methods are often tailored to specific modalities or require manual interventions to specify hyperparameters in different experimental settings. Here, we present a multi-modality cell segmentation benchmark, comprising over 1500 labeled images derived from more than 50 diverse biological experiments. The top participants developed a Transformer-based deep-learning algorithm that not only exceeds existing methods, but can also be applied to diverse microscopy images across imaging platforms and tissue types without manual parameter adjustments. This benchmark and the improved algorithm offer promising avenues for more accurate and versatile cell analysis in microscopy imaging.

摘要
细胞分 segmentation是单细胞分析中的关键步骤，现有的细胞分 segmentation方法 oftentimes 适应特定的Modalities or require manual interventions to specify hyperparameters in different experimental settings. 在这里，我们提出了一个多Modalities 细胞分 segmentation benchmark，包括超过1500个标注图像，来自更多的50种多样化的生物实验。顶尖参与者开发了一种基于Transformer的深度学习算法，不仅超越了现有的方法，还可以适用于多种微scopic imaging平台和组织类型无需手动参数调整。这个benchmark和改进的算法提供了更加准确和多样化的细胞分析方法。

SegDA: Maximum Separable Segment Mask with Pseudo Labels for Domain Adaptive Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.05851
repo_url: None
paper_authors: Anant Khandelwal
for: 提高Unsupervised Domain Adaptation（UDA）方法的转移性能，解决目标领域 Label 缺乏问题。
methods: 提出SegDA模组，增强UDA方法的转移性能，通过学习最大分类separable segment表现。
results: 在四个UDA benchmark上实现了+2.2 mIoU、+2.0 mIoU、+5.9 mIoU、+2.6 mIoU 的提升，即GTA -> Cityscapes、Synthia -> Cityscapes、Cityscapes -> DarkZurich、Cityscapes -> ACDC。

Abstract
Unsupervised Domain Adaptation (UDA) aims to solve the problem of label scarcity of the target domain by transferring the knowledge from the label rich source domain. Usually, the source domain consists of synthetic images for which the annotation is easily obtained using the well known computer graphics techniques. However, obtaining annotation for real world images (target domain) require lot of manual annotation effort and is very time consuming because it requires per pixel annotation. To address this problem we propose SegDA module to enhance transfer performance of UDA methods by learning the maximum separable segment representation. This resolves the problem of identifying visually similar classes like pedestrian/rider, sidewalk/road etc. We leveraged Equiangular Tight Frame (ETF) classifier inspired from Neural Collapse for maximal separation between segment classes. This causes the source domain pixel representation to collapse to a single vector forming a simplex vertices which are aligned to the maximal separable ETF classifier. We use this phenomenon to propose the novel architecture for domain adaptation of segment representation for target domain. Additionally, we proposed to estimate the noise in labelling the target domain images and update the decoder for noise correction which encourages the discovery of pixels for classes not identified in pseudo labels. We have used four UDA benchmarks simulating synthetic-to-real, daytime-to-nighttime, clear-to-adverse weather scenarios. Our proposed approach outperforms +2.2 mIoU on GTA -> Cityscapes, +2.0 mIoU on Synthia -> Cityscapes, +5.9 mIoU on Cityscapes -> DarkZurich, +2.6 mIoU on Cityscapes -> ACDC.

摘要
Unsupervised Domain Adaptation (UDA) 目标是解决目标领域中标签稀缺的问题，通过从标签充沛的源领域传递知识。通常，源领域包含合成图像，可以使用Well known计算机图形技术获得注解。然而，获得真实世界图像（目标领域）的注解需要大量的手动注解effort和时间consuming，因为它需要每个像素注解。为解决这个问题，我们提议SegDA模块，用于增强UDA方法的传递性能。这解决了类型相似的问题，如行人/骑行者、斜道/路等。我们利用Equiangular Tight Frame（ETF）分类器，启发自Neural Collapse，以实现最大分离。这导致源领域像素表示 collapse到一个简单x��ensional vertices，这些 vertices 与最大分离ETF分类器相互平行。我们使用这种现象，提出一种新的领域适应措施。此外，我们还提出了估计目标领域图像注解的噪声，并更新解码器进行噪声纠正，这会鼓励发现类不存在 pseudo labels 中的像素。我们在四个 UDA benchmark 中使用了 simulating synthetic-to-real、daytime-to-nighttime、clear-to-adverse weather 等enario。我们的提议方法在 GTA -> Cityscapes、Synthia -> Cityscapes、Cityscapes -> DarkZurich 和 Cityscapes -> ACDC 等四个 benchmark 上都有出色的表现，相对于基eline +2.2 mIoU、+2.0 mIoU、+5.9 mIoU 和 +2.6 mIoU。

Recognizing Handwritten Mathematical Expressions of Vertical Addition and Subtraction

paper_url: http://arxiv.org/abs/2308.05820
repo_url: https://github.com/danielgol/hme-vas
paper_authors: Daniel Rosa, Filipe R. Cordeiro, Ruan Carvalho, Everton Souza, Sergio Chevtchenko, Luiz Rodrigues, Marcelo Marinho, Thales Vieira, Valmir Macario
for: 这个论文的目的是提出一种新的手写数学表达识别方法，能够识别括论的添加和减法表达。
methods: 这个论文使用了一些现有的物体检测算法，包括YOLO v7、YOLO v8、YOLO-NAS、NanoDet和FCOS，以及一种新的识别方法，可以将 bounding box 映射到 LaTeX markup 序列中。
results: 这个论文的结果表明，该方法能够高效地识别手写数学表达，并且可以在不同的环境中进行扩展。

Abstract
Handwritten Mathematical Expression Recognition (HMER) is a challenging task with many educational applications. Recent methods for HMER have been developed for complex mathematical expressions in standard horizontal format. However, solutions for elementary mathematical expression, such as vertical addition and subtraction, have not been explored in the literature. This work proposes a new handwritten elementary mathematical expression dataset composed of addition and subtraction expressions in a vertical format. We also extended the MNIST dataset to generate artificial images with this structure. Furthermore, we proposed a solution for offline HMER, able to recognize vertical addition and subtraction expressions. Our analysis evaluated the object detection algorithms YOLO v7, YOLO v8, YOLO-NAS, NanoDet and FCOS for identifying the mathematical symbols. We also proposed a transcription method to map the bounding boxes from the object detection stage to a mathematical expression in the LATEX markup sequence. Results show that our approach is efficient, achieving a high expression recognition rate. The code and dataset are available at https://github.com/Danielgol/HME-VAS

摘要
手写数学表达识别（HMER）是一项具有许多教育应用的挑战性任务。现今的HMER方法主要关注于标准水平格式中的复杂数学表达。然而，对于基本数学表达，如垂直加减表达，在 литературе中没有得到了探讨。本工作提出了一个新的手写基本数学表达数据集，包括垂直加减表达的形式。此外，我们还扩展了MNIST数据集，生成了具有这种结构的人工图像。此外，我们还提出了一种OFFLINE HMER方法，能够识别垂直加减表达。我们的分析评估了YOLO v7、YOLO v8、YOLO-NAS、NanoDet和FCOS算法来识别数学符号。此外，我们还提出了一种映射矩阵方法，将物体检测阶段中的 bounding box 映射到LATEX markup语句中的数学表达。结果表明，我们的方法高效，达到了高表达识别率。代码和数据集可以在https://github.com/Danielgol/HME-VAS上获取。

Absorption-Based, Passive Range Imaging from Hyperspectral Thermal Measurements

paper_url: http://arxiv.org/abs/2308.05818
repo_url: None
paper_authors: Unay Dorken Gallastegi, Hoover Rueda-Chacon, Martin J. Stevens, Vivek K Goyal
for: This paper is written for the purpose of developing a novel passive range imaging method based on atmospheric absorption of ambient thermal radiance, which can be used to recover range features from remote objects in natural scenes without the need for active illumination.
methods: The paper uses a computational approach that separates the effects of remote object material composition, temperature, and range on the spectrum of thermal radiance, and introduces a novel method that exploits atmospheric absorption to mitigate noise in low-contrast scenarios. The method jointly estimates range and intrinsic object properties by exploiting a variety of absorption lines spread over the infrared spectrum.
results: The paper reports that the proposed method can recover range features from 15m to 150m in long-wave infrared (8–13 $\mu$m) hyperspectral image data acquired from natural scenes with no active illumination. The results show good qualitative match to unaligned lidar data.

Abstract
Passive hyperspectral long-wave infrared measurements are remarkably informative about the surroundings, such as remote object material composition, temperature, and range; and air temperature and gas concentrations. Remote object material and temperature determine the spectrum of thermal radiance, and range, air temperature, and gas concentrations determine how this spectrum is modified by propagation to the sensor. We computationally separate these phenomena, introducing a novel passive range imaging method based on atmospheric absorption of ambient thermal radiance. Previously demonstrated passive absorption-based ranging methods assume hot and highly emitting objects. However, the temperature variation in natural scenes is usually low, making range imaging challenging. Our method benefits from explicit consideration of air emission and parametric modeling of atmospheric absorption. To mitigate noise in low-contrast scenarios, we jointly estimate range and intrinsic object properties by exploiting a variety of absorption lines spread over the infrared spectrum. Along with Monte Carlo simulations that demonstrate the importance of regularization, temperature differentials, and availability of many spectral bands, we apply this method to long-wave infrared (8--13 $\mu$m) hyperspectral image data acquired from natural scenes with no active illumination. Range features from 15m to 150m are recovered, with good qualitative match to unaligned lidar data.

摘要
活动式 hyperspectral 长波无源热红外测量具有极高的准确性，能够提供远程物体材质组成、温度和距离等信息，以及空气温度和气体成分。远程物体材质和温度会影响热辐射谱的спектrum，而距离、空气温度和气体成分会影响这种谱spectrum的修饰和传播到探测器。我们通过计算分离这些现象，提出了一种新的无源范围成像方法，基于大气吸收的热辐射 ambient。先前的无源吸收基于范围方法假设了热和高度发射的对象，但是自然场景中的温度变化通常很低，使范围成像变得困难。我们的方法具有详细考虑大气发射和参数化大气吸收的优势，以适应实际场景。为了降低在低对比度场景下的噪声，我们同时估算范围和对象内部特性，通过利用多种吸收线扩散在红外谱域中。此外，我们还通过 Monte Carlo 仿真示出了正则化、温度差和许多 spectral band 的重要性，并应用这种方法到长波无源热红外（8-13 $\mu$m） hyperspectral 图像数据，从自然场景中无活动照明获得范围信息。在15米至150米的范围内，可以恢复出较好的质量匹配，与不对齐的探测器数据相符。

Spintronics for image recognition : performance benchmarking via ultrafast data-driven simulations

paper_url: http://arxiv.org/abs/2308.05810
repo_url: None
paper_authors: Anatole Moureaux, Chloé Chopin, Laurent Jacques, Flavio Abreu Araujo
for: 用于图像分类
methods: 使用硬件基于echo-state网络（ESN），利用磁通量普通磁铁结构（STVO）来实现图像分类
results: 使用DD-TEA模拟STVO动态，实现了图像分类 task的高精度性，并在MNIST、EMNIST-letters和Fashion MNIST等 dataset上达到了比较高的性能水平，但在EMNIST-letters和Fashion MNIST上的性能较低，这可能是因为系统架构的简单性和任务的复杂性的问题。

Abstract
We present a demonstration of image classification using a hardware-based echo-state network (ESN) that relies on spintronic nanostructures known as vortex-based spin-torque oscillators (STVOs). Our network is realized using a single STVO multiplexed in time. To circumvent the challenges associated with repeated experimental manipulation of such a nanostructured system, we employ an ultrafast data-driven simulation framework called the data-driven Thiele equation approach (DD-TEA) to simulate the STVO dynamics. We use this approach to efficiently develop, optimize and test an STVO-based ESN for image classification using the MNIST dataset. We showcase the versatility of our solution by successfully applying it to solve classification challenges with the EMNIST-letters and Fashion MNIST datasets. Through our simulations, we determine that within a large ESN the results obtained using the STVO dynamics as an activation function are comparable to the ones obtained with other conventional nonlinear activation functions like the reLU and the sigmoid. While achieving state-of-the-art accuracy levels on the MNIST dataset, our model's performance on EMNIST-letters and Fashion MNIST is lower due to the relative simplicity of the system architecture and the increased complexity of the tasks. We expect that the DD-TEA framework will enable the exploration of more specialized neural architectures, ultimately leading to improved classification accuracy. This approach also holds promise for investigating and developing dedicated learning rules to further enhance classification performance.

摘要
我们提出了一种使用硬件基于的echo-state网络（ESN）进行图像分类，该网络基于旋转-基于磁通量（STVO）的磁镜结构。我们的网络使用了单个STVO，并在时间多态化中进行多批处理。为了缓解重复实验室中STVO系统的nanostructured系统的挑战，我们使用了一种高速数据驱动的模拟框架called the data-driven Thiele equation approach（DD-TEA）来模拟STVO动态。我们使用这种方法来有效地开发、优化和测试一个基于STVO的ESN，并在MNIST数据集上进行图像分类。我们成功地应用了这种解决方案，并在EMNIST-letters和Fashion MNIST数据集上解决了分类挑战。我们通过模拟结果发现，在大型ESN中，使用STVO动态作为激活函数的结果与其他常见非线性激活函数如reLU和sigmoid相比，具有相似的性能水平。而在MNIST数据集上，我们的模型实现了state-of-the-art的准确率水平，但在EMNIST-letters和Fashion MNIST数据集上，模型的性能较低，这主要归结于系统架构的简单性和任务的复杂性。我们预计，DD-TEA框架将能够探索更特化的神经网络架构，从而实现更高的分类精度。此外，这种方法还可以用于研究和开发特定的学习规则，以进一步提高分类性能。

Iterative Reweighted Least Squares Networks With Convergence Guarantees for Solving Inverse Imaging Problems

paper_url: http://arxiv.org/abs/2308.05745
repo_url: None
paper_authors: Iaroslav Koshelev, Stamatios Lefkimmiatis
for: 这篇论文关注于图像重建任务中的分析基于图像正则化，旨在推广 sparse 和/或低维解决方案。
methods: 作者提出了一种新的优化策略，基于可评估函数 Parametrize 图像正则化。这种策略是基于 Iteratively Reweighted Least Squares (IRLS) 方法，通常用于 synthesis-based $\ell_p$ 和 $\mathcal{S}_p$ norm 以及 analysis-based $\ell_1$ 和核函数正则化。
results: 作者证明了其优化算法在稳定点下线性收敛，并提供了一个上界 для收敛速率。此外，作者还提出了一种学习参数的方法，通过将学习过程视为一个随机双层优化问题来实现。通过证明了其优化算法的收敛性，这种学习方法可以成功完成。作者还对其 learned IRLS 变体进行了评估，并与其他现有的学习重建方法进行了比较。

Abstract
In this work we present a novel optimization strategy for image reconstruction tasks under analysis-based image regularization, which promotes sparse and/or low-rank solutions in some learned transform domain. We parameterize such regularizers using potential functions that correspond to weighted extensions of the $\ell_p^p$-vector and $\mathcal{S}_p^p$ Schatten-matrix quasi-norms with $0 < p \le 1$. Our proposed minimization strategy extends the Iteratively Reweighted Least Squares (IRLS) method, typically used for synthesis-based $\ell_p$ and $\mathcal{S}_p$ norm and analysis-based $\ell_1$ and nuclear norm regularization. We prove that under mild conditions our minimization algorithm converges linearly to a stationary point, and we provide an upper bound for its convergence rate. Further, to select the parameters of the regularizers that deliver the best results for the problem at hand, we propose to learn them from training data by formulating the supervised learning process as a stochastic bilevel optimization problem. We show that thanks to the convergence guarantees of our proposed minimization strategy, such optimization can be successfully performed with a memory-efficient implicit back-propagation scheme. We implement our learned IRLS variants as recurrent networks and assess their performance on the challenging image reconstruction tasks of non-blind deblurring, super-resolution and demosaicking. The comparisons against other existing learned reconstruction approaches demonstrate that our overall method is very competitive and in many cases outperforms existing unrolled networks, whose number of parameters is orders of magnitude higher than in our case.

摘要
在这项工作中，我们提出了一种新的优化策略，用于图像重建任务中的分析基于图像正则化。我们使用可调函数来parameterize这些正则izer，这些函数对应于weighted扩展的 $\ell_p^p$ 评估函数和 $\mathcal{S}_p^p$ 施密特矩阵评估函数的 $0 < p \le 1$。我们的提出的最小化策略是基于Iteratively Reweighted Least Squares（IRLS）方法，通常用于 synthesis-based $\ell_p$ 和 $\mathcal{S}_p$ 评估和分析基于 $\ell_1$ 和核评估的正则化。我们证明了，在某些条件下，我们的最小化算法会线性收敛到稳定点，并提供了最小化速度的上限。此外，为了选择正则izer的参数，以实现问题中最佳的结果，我们提议通过训练数据来学习这些参数，并将其表示为随机双层优化问题。我们证明了， благодаря我们提出的最小化策略的收敛保证，这种优化可以成功地进行，使用内存高效的隐式循环反射学习。我们实现了我们学习IRLS变体的recurrent neural network，并对非盲杂化、超分辨和排版重建任务进行评估。与其他已知的学习重建方法相比，我们的总方法在许多情况下表现非常竞争力强，而且在许多情况下even outperform existing unrolled networks， whose number of parameters is orders of magnitude higher than ours。

PlankAssembly: Robust 3D Reconstruction from Three Orthographic Views with Learnt Shape Programs

paper_url: http://arxiv.org/abs/2308.05744
repo_url: https://github.com/manycore-research/PlankAssembly
paper_authors: Wentao Hu, Jia Zheng, Zixin Zhang, Xiaojun Yuan, Jian Yin, Zihan Zhou
for: 自动将2D线描图 transformed into3D CAD模型
methods: 使用Transformer序列生成模型和形状程序
results: 与现有方法相比，our方法在干扰或 incomplete输入时表现出色

Abstract
In this paper, we develop a new method to automatically convert 2D line drawings from three orthographic views into 3D CAD models. Existing methods for this problem reconstruct 3D models by back-projecting the 2D observations into 3D space while maintaining explicit correspondence between the input and output. Such methods are sensitive to errors and noises in the input, thus often fail in practice where the input drawings created by human designers are imperfect. To overcome this difficulty, we leverage the attention mechanism in a Transformer-based sequence generation model to learn flexible mappings between the input and output. Further, we design shape programs which are suitable for generating the objects of interest to boost the reconstruction accuracy and facilitate CAD modeling applications. Experiments on a new benchmark dataset show that our method significantly outperforms existing ones when the inputs are noisy or incomplete.

摘要
在这篇论文中，我们提出了一种新的方法，用于自动将二维线 drawing从三个orthographic视图转换为三维 CAD 模型。现有的方法在这个问题上 reconstruction 3D 模型，而不是维护explicit的对应关系 между输入和输出。这些方法在输入中存在错误和噪声时容易失败。为了解决这个困难，我们利用了注意力机制在Transformer-based sequence generation模型中学习flexible的对应关系。此外，我们还设计了适合生成目标对象的形状程序，以提高重建精度和促进 CAD 模型应用。在一个新的 bencmark 数据集上进行了实验，我们的方法在输入噪声或不完整时表现出了显著的优异性。

Zero Grads Ever Given: Learning Local Surrogate Losses for Non-Differentiable Graphics

paper_url: http://arxiv.org/abs/2308.05739
repo_url: None
paper_authors: Michael Fischer, Tobias Ritschel
for: 缺乏定义或ZeroGradients问题的图形优化
methods: 自动学习神经网络对象函数的替换、在线自监督学习、跟踪式采样
results: 可以优化非对称非导数的图形问题，比如视野渲染、批处理模型和物理驱动动画优化，并且可以扩展到更高维度问题。

Abstract
Gradient-based optimization is now ubiquitous across graphics, but unfortunately can not be applied to problems with undefined or zero gradients. To circumvent this issue, the loss function can be manually replaced by a "surrogate" that has similar minima but is differentiable. Our proposed framework, ZeroGrads, automates this process by learning a neural approximation of the objective function, the surrogate, which in turn can be used to differentiate through arbitrary black-box graphics pipelines. We train the surrogate on an actively smoothed version of the objective and encourage locality, focusing the surrogate's capacity on what matters at the current training episode. The fitting is performed online, alongside the parameter optimization, and self-supervised, without pre-computed data or pre-trained models. As sampling the objective is expensive (it requires a full rendering or simulator run), we devise an efficient sampling scheme that allows for tractable run-times and competitive performance at little overhead. We demonstrate optimizing diverse non-convex, non-differentiable black-box problems in graphics, such as visibility in rendering, discrete parameter spaces in procedural modelling or optimal control in physics-driven animation. In contrast to more traditional algorithms, our approach scales well to higher dimensions, which we demonstrate on problems with up to 35k interlinked variables.

摘要
gradient-based优化现在Graphics中 ubique,但是它无法应用于具有未定义或Zero Gradient的问题。为了缺页这个问题，损失函数可以手动被替换为一个"代理"，该函数有相似的 минимумы，但是可微的。我们的提议的框架ZeroGrads自动实现了这个过程，它学习一个神经网络的目标函数，代理函数，该函数可以在arbitrary黑盒图形处理管道中进行导数。我们在训练过程中使用一种活动缓和的损失函数，并且强调地方性，使得代理函数的容量集中在当前训练集中。我们在线上进行训练，并且是自我supervised，不需要预计算数据或预训练模型。由于抽样损失函数是Expensive（需要全部渲染或simulatorrun），我们开发了一种高效的抽样方案，使得运行时间可以被控制，并且性能与其他方法相匹配。我们在Graphics中优化了多种非对称、非导数的黑盒问题，如渲染中的可见性、procedural模型中的分割参数空间和物理驱动的动画中的优化问题。与传统算法相比，我们的方法可以扩展到更高的维度，我们在35k个相互连接的变量上进行了示例。

Follow Anything: Open-set detection, tracking, and following in real-time

paper_url: http://arxiv.org/abs/2308.05737
repo_url: https://github.com/alaamaalouf/followanything
paper_authors: Alaa Maalouf, Ninad Jadhav, Krishna Murthy Jatavallabhula, Makram Chahine, Daniel M. Vogt, Robert J. Wood, Antonio Torralba, Daniela Rus
for: 本研究旨在开发一个能够实时检测、跟踪和跟踪任何对象的 роботизирован系统，以满足各种工业自动化、物流和仓储、医疗和安全等领域中的需求。
methods: 本研究使用的方法是一种开放 vocabulary 和多Modal 模型，可以在实时检测和跟踪中应用于未知类型的对象。它利用大规模预训练模型（基础模型）提供的丰富视觉描述符，对输入图像序列进行检测和分割，并跟踪图像帧中的检测和分割结果。
results: 本研究在一个真实世界的 роботизирован系统（微型飞行器）上进行了实验，并证明了 FAn 可以在实时控制循环中无顾 occlusion 和对象重新出现的情况下准确地跟踪对象。此外，FAn 可以在一个笔记型的 GPU 上运行，实现每秒 6-20 帧的 Throughput。为了促进快速采用、部署和扩展，我们将所有代码打包在 GitHub 上的项目页面（https://github.com/alaamaalouf/FollowAnything），并附上了一个5分钟的 Explainer 视频（https://www.youtube.com/watch?v=6Mgt3EPytrw）。

Abstract
Tracking and following objects of interest is critical to several robotics use cases, ranging from industrial automation to logistics and warehousing, to healthcare and security. In this paper, we present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed ``follow anything'' (FAn), is an open-vocabulary and multimodal model -- it is not restricted to concepts seen at training time and can be applied to novel classes at inference time using text, images, or click queries. Leveraging rich visual descriptors from large-scale pre-trained models (foundation models), FAn can detect and segment objects by matching multimodal queries (text, images, clicks) against an input image sequence. These detected and segmented objects are tracked across image frames, all while accounting for occlusion and object re-emergence. We demonstrate FAn on a real-world robotic system (a micro aerial vehicle) and report its ability to seamlessly follow the objects of interest in a real-time control loop. FAn can be deployed on a laptop with a lightweight (6-8 GB) graphics card, achieving a throughput of 6-20 frames per second. To enable rapid adoption, deployment, and extensibility, we open-source all our code on our project webpage at https://github.com/alaamaalouf/FollowAnything . We also encourage the reader the watch our 5-minutes explainer video in this https://www.youtube.com/watch?v=6Mgt3EPytrw .

摘要
tracking和跟踪对象是 robotics 应用场景中的关键，从制造自动化到物流和仓储，到医疗和安全。在这篇论文中，我们提出了一种用于实时检测、跟踪和追踪任何对象的机器人系统。我们的方法，命名为“随时跟踪”（FAn），不受训练时的概念限制，可以在推理时应用于新的类型。通过使用大规模预训练模型（基础模型）提供的丰富视觉描述符，FAn 可以在输入图像序列中检测和分割对象，并在图像帧中跟踪这些检测到的对象，同时考虑 occlusion 和对象重新出现。我们在一种真实的机器人系统（微型飞行器）上实现了 FAn，并发现它可以在实时控制循环中不间断地跟踪对象。FAn 可以在具有6-8 GB 的轻量级图形处理器（GPU）上运行， achieving a throughput of 6-20 frames per second。为了促进快速采用、部署和扩展，我们在项目网站上公开了所有代码（https://github.com/alaamaalouf/FollowAnything）。我们还邀请读者查看我们的5分钟解释视频（https://www.youtube.com/watch?v=6Mgt3EPytrw）。

MapTRv2: An End-to-End Framework for Online Vectorized HD Map Construction

paper_url: http://arxiv.org/abs/2308.05736
repo_url: https://github.com/hustvl/maptr
paper_authors: Bencheng Liao, Shaoyu Chen, Yunchi Zhang, Bo Jiang, Qian Zhang, Wenyu Liu, Chang Huang, Xinggang Wang
for: 本研究旨在提供一种在线vector化高清地图构建框架，以便实现自动驾驶系统中的规划。
methods: 本方法提议了一种统一的 permutation-equivalent 模型方法，即将地图元素表示为一组等效排序的点集，以准确描述地图元素的形状和稳定学习过程。我们还设计了层次查询嵌入 schema和 hierarchical bipartite matching 来灵活地编码结构化地图信息。
results: 我们的方法可以快速地在实时推理速度下 convergence，并在 nuScenes 和 Argoverse2 datasets 上达到了状态对照性的表现。丰富的质量结果显示了在复杂和多样化的驾驶场景中的稳定和可靠的地图构建质量。

Abstract
High-definition (HD) map provides abundant and precise static environmental information of the driving scene, serving as a fundamental and indispensable component for planning in autonomous driving system. In this paper, we present \textbf{Map} \textbf{TR}ansformer, an end-to-end framework for online vectorized HD map construction. We propose a unified permutation-equivalent modeling approach, \ie, modeling map element as a point set with a group of equivalent permutations, which accurately describes the shape of map element and stabilizes the learning process. We design a hierarchical query embedding scheme to flexibly encode structured map information and perform hierarchical bipartite matching for map element learning. To speed up convergence, we further introduce auxiliary one-to-many matching and dense supervision. The proposed method well copes with various map elements with arbitrary shapes. It runs at real-time inference speed and achieves state-of-the-art performance on both nuScenes and Argoverse2 datasets. Abundant qualitative results show stable and robust map construction quality in complex and various driving scenes. Code and more demos are available at \url{https://github.com/hustvl/MapTR} for facilitating further studies and applications.

摘要
高清晰地图（HD地图）提供了驾驶场景中精确和丰富的静态环境信息， serves as 驾驶自动化系统的基本和不可或缺的组件。在这篇论文中，我们提出了 \textbf{Map} \textbf{TR}ansformer，一种端到端框架，用于在线vectorized HD地图建构。我们提出了一种统一 permutation-equivalent 模型方法，即将地图元素模型为一个点集，并使用一组等效排序来准确描述地图元素的形状，从而稳定学习过程。我们设计了层次查询嵌入方案，以便flexibly编码结构化地图信息，并在多个纬度上进行层次对应。为了加速收敛，我们还引入了辅助一对多对应和密集监督。提出的方法可以快速和稳定地构建多种形状的地图元素，并在真实时间执行速度下达到了 nuScenes 和 Argoverse2 数据集的国际级表现。丰富的质量图像显示了在复杂和多样的驾驶场景中的稳定和可靠的地图建构质量。代码和更多示例可以在 \url{https://github.com/hustvl/MapTR} 上找到，以便进一步的研究和应用。

FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models

paper_url: http://arxiv.org/abs/2308.05733
repo_url: https://github.com/aim-uofa/FrozenRecon
paper_authors: Guangkai Xu, Wei Yin, Hao Chen, Chunhua Shen, Kai Cheng, Feng Zhao
for:The paper is written for the task of 3D scene reconstruction, specifically addressing the challenge of robustly obtaining camera poses and achieving dense scene reconstruction in diverse real-world scenarios.methods:The paper proposes a novel test-time optimization approach that leverages pre-trained affine-invariant depth models, such as LeReS, to ensure inter-frame consistency and achieve robust scene reconstruction. The approach involves freezing the depth predictions, rectifying them with a geometric consistency alignment module, and employing the resulting scale-consistent depth maps to obtain camera poses and reconstruct the scene.results:The paper achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets, demonstrating the effectiveness of the proposed approach in handling diverse real-world scenarios and improving the robustness of 3D scene reconstruction.

Abstract
3D scene reconstruction is a long-standing vision task. Existing approaches can be categorized into geometry-based and learning-based methods. The former leverages multi-view geometry but can face catastrophic failures due to the reliance on accurate pixel correspondence across views. The latter was proffered to mitigate these issues by learning 2D or 3D representation directly. However, without a large-scale video or 3D training data, it can hardly generalize to diverse real-world scenarios due to the presence of tens of millions or even billions of optimization parameters in the deep network. Recently, robust monocular depth estimation models trained with large-scale datasets have been proven to possess weak 3D geometry prior, but they are insufficient for reconstruction due to the unknown camera parameters, the affine-invariant property, and inter-frame inconsistency. Here, we propose a novel test-time optimization approach that can transfer the robustness of affine-invariant depth models such as LeReS to challenging diverse scenes while ensuring inter-frame consistency, with only dozens of parameters to optimize per video frame. Specifically, our approach involves freezing the pre-trained affine-invariant depth model's depth predictions, rectifying them by optimizing the unknown scale-shift values with a geometric consistency alignment module, and employing the resulting scale-consistent depth maps to robustly obtain camera poses and achieve dense scene reconstruction, even in low-texture regions. Experiments show that our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.

摘要
3D场景重建是一个长期的视觉任务。现有的方法可以分为基于几何学的方法和学习基于方法。前者利用多视图几何学，但可能会遇到几何学灾难，因为需要精准的像素对应关系 across views。后者提出了使用 directly 学习 2D 或 3D 表示，但在没有大规模视频或 3D 训练数据时，它很难总结到多样化的实际场景中。最近，一些稳定的单视图深度估计模型，通过大规模数据集训练，已经证明了具有弱型 3D 几何特征，但它们无法承受不知道摄像头参数，Affine-invariant 性和inter-frame不一致性。我们提出了一种新的测试时优化方法，可以在多样化场景中传递稳定的 Affine-invariant 深度模型 LeReS 的稳定性，同时保证 inter-frame 一致性，只需要每帧Optimize 几十个参数。我们的方法包括冻结预训练 Affine-invariant 深度模型的深度预测，对其进行 rectify 操作，使用 resulting scale-consistent depth maps 来Robustly 获取摄像头姿态和实现密集场景重建，即使在低文本区域。实验表明，我们的方法在五个零学习测试数据集上 achieve state-of-the-art cross-dataset reconstruction。

Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction

paper_url: http://arxiv.org/abs/2308.05721
repo_url: https://github.com/yangyangxu0/demtg
paper_authors: Yangyang Xu, Yibo Yang, Bernard Ghanemm, Lefei Zhang, Du Bo, Dacheng Tao
for: 这个研究旨在开发一个新的多任务学习（Multi-task learning，MTL）模型，可以结合具有弹性和对话的对称卷积（Deformable CNN）和问题基于对称（Query-based Transformer）的优点，以提高MTL的效能。
methods: 这个模型使用了一个简单且有效的encoder-decoder架构，组合了对称和注意力机制，具有弹性和全面的特征，可以对多个任务进行精确的预测。
results: 实验结果显示，提案的DeMTG模型比现有的Transformer-based和CNN-based竞争模型在多个度量上表现更好，并且需要更少的GFLOPs。 codes和模型可以在https://github.com/yangyangxu0/DeMTG上下载。

Abstract
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL). Most of the current studies on MTL solely rely on CNN or Transformer. In this work, we present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction. This combination may offer a simple and efficient solution owing to its powerful and flexible task-specific learning and advantages of lower cost, less complexity and smaller parameters than the traditional MTL methods. We introduce deformable mixer Transformer with gating (DeMTG), a simple and effective encoder-decoder architecture up-to-date that incorporates the convolution and attention mechanism in a unified network for MTL. It is exquisitely designed to use advantages of each block, and provide deformable and comprehensive features for all tasks from local and global perspective. First, the deformable mixer encoder contains two types of operators: the channel-aware mixing operator leveraged to allow communication among different channels, and the spatial-aware deformable operator with deformable convolution applied to efficiently sample more informative spatial locations. Second, the task-aware gating transformer decoder is used to perform the task-specific predictions, in which task interaction block integrated with self-attention is applied to capture task interaction features, and the task query block integrated with gating attention is leveraged to select corresponding task-specific features. Further, the experiment results demonstrate that the proposed DeMTG uses fewer GFLOPs and significantly outperforms current Transformer-based and CNN-based competitive models on a variety of metrics on three dense prediction datasets. Our code and models are available at https://github.com/yangyangxu0/DeMTG.

摘要
CNN 和 Transformer 都有自己的优点，两者都广泛用于密集预测多任务学习（MTL）。现有大多数MTL研究都仅仅靠坐标 CNN 或 Transformer。在这种工作中，我们提出了一种新的MTL模型，该模型将具有扭曲 CNN 和查询基于 Transformer 的共享锁定，以便实现多任务密集预测。这种结合可能会提供一种简单、高效的解决方案，因为它可以充分利用每个任务的特点，并且具有更低的成本、更低的复杂性和更小的参数。我们称之为“扭曲混合变换器”（DeMTG），它是一种简单而有效的编码器-解码器架构，它将 convolution 和注意力机制 integrate 在一个网络中，以便实现 MTL。它的设计非常灵活，可以根据每个任务的需求进行定制。首先，扭曲混合encoder包含两种操作符：通道意识混合操作符，可以允许不同通道之间的交流，以及空间意识扭曲操作符，通过高效地采样更多的空间位置来提高预测性能。其次，任务意识解码器使用任务交互块和自注意力来捕捉任务交互特征，并使用任务查询块和锁定注意力来选择相应的任务特定特征。此外，我们的实验结果表明，提案的 DeMTG 使用更少的 GFLOPs，并在多个维度上显著超越现有的 Transformer 基于和 CNN 基于竞争模型。我们的代码和模型可以在 GitHub 上获取。

Temporally-Adaptive Models for Efficient Video Understanding

paper_url: http://arxiv.org/abs/2308.05787
repo_url: https://github.com/alibaba-mmai-research/TAdaConv
paper_authors: Ziyuan Huang, Shiwei Zhang, Liang Pan, Zhiwu Qing, Yingya Zhang, Ziwei Liu, Marcelo H. Ang Jr
for: 这个研究旨在提高视频理解模型中的时间模型化能力，以提高视频理解的精度和效率。
methods: 该研究提出了Temporally-Adaptive Convolutions（TAdaConv），它在视频中的每帧中进行自适应权重调整，以便更好地模型视频中的时间动态。TAdaConv使得空间核函数获得了时间模型化能力，从而提高了模型的表现。
results: 根据实验结果，TAdaConvNeXtV2和TAdaFormer在不同的视频理解benchmark中与现有的卷积和Transformer-based模型竞争性地表现，并且在一些任务上具有更高的精度和效率。

Abstract
Spatial convolutions are extensively used in numerous deep video models. It fundamentally assumes spatio-temporal invariance, i.e., using shared weights for every location in different frames. This work presents Temporally-Adaptive Convolutions (TAdaConv) for video understanding, which shows that adaptive weight calibration along the temporal dimension is an efficient way to facilitate modeling complex temporal dynamics in videos. Specifically, TAdaConv empowers spatial convolutions with temporal modeling abilities by calibrating the convolution weights for each frame according to its local and global temporal context. Compared to existing operations for temporal modeling, TAdaConv is more efficient as it operates over the convolution kernels instead of the features, whose dimension is an order of magnitude smaller than the spatial resolutions. Further, kernel calibration brings an increased model capacity. Based on this readily plug-in operation TAdaConv as well as its extension, i.e., TAdaConvV2, we construct TAdaBlocks to empower ConvNeXt and Vision Transformer to have strong temporal modeling capabilities. Empirical results show TAdaConvNeXtV2 and TAdaFormer perform competitively against state-of-the-art convolutional and Transformer-based models in various video understanding benchmarks. Our codes and models are released at: https://github.com/alibaba-mmai-research/TAdaConv.

摘要
“空间卷积广泛应用于深度视频模型中。它假设空间时间不变性，即使用共享权重 для每帧不同帧。本工作介绍了Temporally-Adaptive Convolutions（TAdaConv） для视频理解，它表明了在时间维度上进行权重调整是一种高效的方式来模型视频复杂的时间动态。具体来说，TAdaConv使得空间卷积具有时间模型能力，通过对每帧的卷积权重进行本地和全局时间上下文的调整。与现有的时间模型操作相比，TAdaConv更高效，因为它操作在卷积核心上而不是特征上，特征维度与空间分辨率相比只有一个数量级。此外，权重调整提高了模型的容量。基于这种可插入操作，我们构建了TAdaBlocks来激发ConvNeXt和Vision Transformer具有强大的时间模型能力。实验结果显示TAdaConvNeXtV2和TAdaFormer与状态态的 convolutional 和 Transformer-based 模型在不同的视频理解 benchmark 中竞争性。我们的代码和模型在：https://github.com/alibaba-mmai-research/TAdaConv。”

Spatial Pathomics Toolkit for Quantitative Analysis of Podocyte Nuclei with Histology and Spatial Transcriptomics Data in Renal Pathology

paper_url: http://arxiv.org/abs/2308.06288
repo_url: https://github.com/hrlblab/spatial_pathomics
paper_authors: Jiayuan Chen, Yu Wang, Ruining Deng, Quan Liu, Can Cui, Tianyuan Yao, Yilin Liu, Jianyong Zhong, Agnes B. Fogo, Haichun Yang, Shilin Zhao, Yuankai Huo
for: 这篇论文的目的是提出一种新的工具包，用于全面评估肾脏病变中的podocyte细胞特征。
methods: 这个工具包包括三个主要组成部分：1）实例对象分割，以准确地识别podocyte核lei；2）pathomics特征生成，从识别的核lei中提取了一系列量化特征；3）Robust统计分析，为探索质量特征之间的空间关系提供了一个全面的探索。
results: 该工具包成功提取和分析了podocyte核lei的形态和文化特征，并通过统计分析发现了一系列podocyte形omic特征。此外，工具还能够揭示肾脏病变中podocyte分布的空间信息，为肾脏病变的研究提供了新的视角。

Abstract
Podocytes, specialized epithelial cells that envelop the glomerular capillaries, play a pivotal role in maintaining renal health. The current description and quantification of features on pathology slides are limited, prompting the need for innovative solutions to comprehensively assess diverse phenotypic attributes within Whole Slide Images (WSIs). In particular, understanding the morphological characteristics of podocytes, terminally differentiated glomerular epithelial cells, is crucial for studying glomerular injury. This paper introduces the Spatial Pathomics Toolkit (SPT) and applies it to podocyte pathomics. The SPT consists of three main components: (1) instance object segmentation, enabling precise identification of podocyte nuclei; (2) pathomics feature generation, extracting a comprehensive array of quantitative features from the identified nuclei; and (3) robust statistical analyses, facilitating a comprehensive exploration of spatial relationships between morphological and spatial transcriptomics features.The SPT successfully extracted and analyzed morphological and textural features from podocyte nuclei, revealing a multitude of podocyte morphomic features through statistical analysis. Additionally, we demonstrated the SPT's ability to unravel spatial information inherent to podocyte distribution, shedding light on spatial patterns associated with glomerular injury. By disseminating the SPT, our goal is to provide the research community with a powerful and user-friendly resource that advances cellular spatial pathomics in renal pathology. The implementation and its complete source code of the toolkit are made openly accessible at https://github.com/hrlblab/spatial_pathomics.

摘要
PODOSITE 是特殊的血浆上皮细胞，它们环绕血浆 капиllaries，对肾健康具有重要作用。现有的描述和量化特征是有限的，因此需要创新的解决方案来全面评估多样化的phenotypic特征在整个扫描图像（WSIs）中。特别是理解PODOSITE的形态特征非常重要，以study glomerular injury。本文介绍了Spatial Pathomics Toolkit（SPT）并应用它于PODOSITE pathomics。SPT包括三个主要组成部分：（1）实体对象分割，可以准确地识别PODOSITE核lei；（2）pathomics特征生成，从识别的核lei中提取了一系列数量特征；以及（3）Robust统计分析，使得可以全面探索扫描图像中PODOSITE的形态特征和空间特征之间的关系。SPT成功地提取和分析PODOSITE核lei的形态和文化特征，揭示了许多PODOSITE形态特征，并通过统计分析，探索了PODOSITE分布的空间特征，为glomerular injury提供了新的视角。我们的目标是通过普及SPT，为研究社区提供一个强大和易用的资源，以提高细胞空间Pathomics在肾病理学中的发展。SPT的实现和完整的源代码可以在https://github.com/hrlblab/spatial_pathomics上免费获取。

Shadow Datasets, New challenging datasets for Causal Representation Learning

paper_url: http://arxiv.org/abs/2308.05707
repo_url: https://github.com/Jiagengzhu/Shadow-dataset-for-crl
paper_authors: Jiageng Zhu, Hanchen Xie, Jianhua Wu, Jiazhi Li, Mahyar Khayatkhoei, Mohamed E. Hussein, Wael AbdAlmageed
For: 本研究旨在探讨语义因素之间的 causal 关系，并提出了一种基于弱监督学习的 causal representation learning（CRL）方法。* Methods: 该方法使用了一种基于 Generative Adversarial Networks（GANs）的弱监督学习方法，并在四个现有的数据集（Pendulum、Flow、CelebA（BEARD）和CelebA（SMILE））上进行了评估。* Results: 研究人员发现，使用该方法可以在具有更多多样化生成因素的更复杂的 causal 图上找到更好的 causal 关系。此外，他们还修改了现有的real数据集（CelebA（BEARD）和CelebA（SMILE））的原始 causal 图，以更好地适应数据集的分布。

Abstract
Discovering causal relations among semantic factors is an emergent topic in representation learning. Most causal representation learning (CRL) methods are fully supervised, which is impractical due to costly labeling. To resolve this restriction, weakly supervised CRL methods were introduced. To evaluate CRL performance, four existing datasets, Pendulum, Flow, CelebA(BEARD) and CelebA(SMILE), are utilized. However, existing CRL datasets are limited to simple graphs with few generative factors. Thus we propose two new datasets with a larger number of diverse generative factors and more sophisticated causal graphs. In addition, current real datasets, CelebA(BEARD) and CelebA(SMILE), the originally proposed causal graphs are not aligned with the dataset distributions. Thus, we propose modifications to them.

摘要
发现 semantic factor 之间的 causal 关系是 representation learning 中一个出现的话题。大多数 causal representation learning（CRL）方法是强制执行的，这是因为标注成本高昂。为解决这些限制，弱类标注 CRL 方法被引入。为评估 CRL 性能，四个现有的数据集， Pendulum、Flow、CelebA（BEARD）和 CelebA（SMILE），被使用。然而，现有的 CRL 数据集受限于简单的图像和少量生成因素。因此，我们提议两个新的数据集，它们具有更多的多样化的生成因素和更复杂的 causal 图。此外，原始的 real datasets，CelebA（BEARD）和 CelebA（SMILE），其提posed的 causal 图与数据集分布不匹配。因此，我们提议修改它们。

Masked Diffusion as Self-supervised Representation Learner

paper_url: http://arxiv.org/abs/2308.05695
repo_url: None
paper_authors: Zixuan Pan, Jianxu Chen, Yiyu Shi
for: 这个论文是用于探讨Diffusion模型在生成和表示学习中的关系，以及如何使用Masking机制来提高Diffusion模型的表示学习能力。
methods: 该论文使用了Diffusion模型，并将传统的加法 Gaussian 噪声替换为Masking机制来进行自我超级vised学习。
results: 该论文在医疗和自然图像 semantic segmentation 任务中达到了优秀的表现，特别是在少shot场景下。

Abstract
Denoising diffusion probabilistic models have recently demonstrated state-of-the-art generative performance and been used as strong pixel-level representation learners. This paper decomposes the interrelation between the generative capability and representation learning ability inherent in diffusion models. We present masked diffusion model (MDM), a scalable self-supervised representation learner that substitutes the conventional additive Gaussian noise of traditional diffusion with a masking mechanism. Our proposed approach convincingly surpasses prior benchmarks, demonstrating remarkable advancements in both medical and natural image semantic segmentation tasks, particularly within the context of few-shot scenario.

摘要
diffusion 模型在最近几年已经展示了状态算法的极优生成性能，并且用作强大的像素级表示学习器。这篇论文分析了diffusion模型中生成能力和表示学习能力之间的关系。我们提出了受掩码机制取代传统的加法 Gaussian 噪声的masked diffusion model（MDM）。我们的提议方法在医学和自然图像Semantic segmentation任务中表现出色，特别是在几个shot场景下。

Leverage Weakly Annotation to Pixel-wise Annotation via Zero-shot Segment Anything Model for Molecular-empowered Learning

paper_url: http://arxiv.org/abs/2308.05785
repo_url: None
paper_authors: Xueyuan Li, Ruining Deng, Yucheng Tang, Shunxing Bao, Haichun Yang, Yuankai Huo
for: 这个研究旨在发展一种可以实现批量标注的人工智能方法，以便将数位实验室数组据数据集（Giga-pixel Whole Slide Imaging，WSI）中的多个细胞型态识别为精确为可能。
methods: 这个研究使用了一种名为“对应强化学习”的方法，即使用强化学习来将类别为细胞型态的图像转换为对应的细胞类别。此外，还使用了一种名为“强制标注”的方法，即将图像中的细胞类别标注为精确的细胞类别。
results: 研究结果显示，使用“对应强化学习”和“强制标注”的方法可以实现对细胞型态的精确识别，并且可以降低非专家标注者的努力，只需要对图像进行简单的标注。此外，这种方法并不会对数据集的质量产生影响。

Abstract
Precise identification of multiple cell classes in high-resolution Giga-pixel whole slide imaging (WSI) is critical for various clinical scenarios. Building an AI model for this purpose typically requires pixel-level annotations, which are often unscalable and must be done by skilled domain experts (e.g., pathologists). However, these annotations can be prone to errors, especially when distinguishing between intricate cell types (e.g., podocytes and mesangial cells) using only visual inspection. Interestingly, a recent study showed that lay annotators, when using extra immunofluorescence (IF) images for reference (referred to as molecular-empowered learning), can sometimes outperform domain experts in labeling. Despite this, the resource-intensive task of manual delineation remains a necessity during the annotation process. In this paper, we explore the potential of bypassing pixel-level delineation by employing the recent segment anything model (SAM) on weak box annotation in a zero-shot learning approach. Specifically, we harness SAM's ability to produce pixel-level annotations from box annotations and utilize these SAM-generated labels to train a segmentation model. Our findings show that the proposed SAM-assisted molecular-empowered learning (SAM-L) can diminish the labeling efforts for lay annotators by only requiring weak box annotations. This is achieved without compromising annotation accuracy or the performance of the deep learning-based segmentation. This research represents a significant advancement in democratizing the annotation process for training pathological image segmentation, relying solely on non-expert annotators.

摘要
高精度整个扫描图像（WSI）中多个细胞类型的精准识别是许多临床应用场景中的关键。建立AI模型用于这种目的通常需要像素级别的标注，但这些标注通常是不可扩展的并且需要具备专业知识（如病理学家）进行完成。然而，这些标注可能存在错误，特别是在辨别复杂细胞类型（如 podocytes 和 mesangial cells）时使用 только视觉检查。让人感兴奋的是，一项最近的研究表示，使用附加的免疫抗体（IF）图像作参考（称为分子驱动学习），非专业标注人员可以在标注时与专业人员相比表现更出色。尽管这样，手动分割任务仍然是标注过程中的必需任务。在这篇论文中，我们探讨使用最近的分类任何事物模型（SAM）在零容量学习方法中绕过像素级别分割的可能性。我们利用SAM的能力生成像素级别标注从箱子标注，并使用这些SAM生成的标注来训练分割模型。我们的发现表明，我们的SAM-助け学习（SAM-L）可以减少非专业标注人员的标注努力，只需要弱型箱子标注。这是在不妨害标注精度或深度学习基于图像分割的性能下进行的。这项研究表明了在训练病理图像分割模型时，不需要专业人员进行标注，可以仅仅通过非专业标注人员完成。

High-performance Data Management for Whole Slide Image Analysis in Digital Pathology

paper_url: http://arxiv.org/abs/2308.05784
repo_url: https://github.com/hrlblab/adios
paper_authors: Haoju Leng, Ruining Deng, Shunxing Bao, Dazheng Fang, Bryan A. Millis, Yucheng Tang, Haichun Yang, Xiao Wang, Yifan Peng, Lipeng Wan, Yuankai Huo
For: The paper is written to address the computational bottleneck in input-output (I/O) system when deploying image analysis algorithms on whole-slide images (WSIs).* Methods: The paper proposes the use of Adaptable IO System version 2 (ADIOS2) to streamline data management across WSIs and reduce data retrieval times.* Results: The paper shows that ADIOS2 achieves a two-fold speed-up compared to the brute-force approach in a CPU-based image analysis scenario, and its performance is on par with the cutting-edge GPU I/O acceleration framework, NVIDIA Magnum IO GPU Direct Storage (GDS), in a GPU-based deep learning framework scenario.Here are the three points in Simplified Chinese text:* For: 本文是为了解决扫描图像分析过程中的计算瓶颈，尤其是在扫描图像（WSIs）上部署图像分析算法时。* Methods: 本文提出使用Adaptable IO System version 2（ADIOS2）来协调数据管理过程，以便更好地处理扫描图像中的数据访问。* Results: 本文显示，使用ADIOS2可以将计算瓶颈降低到一半，并且在使用深度学习框架时与NVIDIA Magnum IO GPU Direct Storage（GDS）的性能相当。

Abstract
When dealing with giga-pixel digital pathology in whole-slide imaging, a notable proportion of data records holds relevance during each analysis operation. For instance, when deploying an image analysis algorithm on whole-slide images (WSI), the computational bottleneck often lies in the input-output (I/O) system. This is particularly notable as patch-level processing introduces a considerable I/O load onto the computer system. However, this data management process could be further paralleled, given the typical independence of patch-level image processes across different patches. This paper details our endeavors in tackling this data access challenge by implementing the Adaptable IO System version 2 (ADIOS2). Our focus has been constructing and releasing a digital pathology-centric pipeline using ADIOS2, which facilitates streamlined data management across WSIs. Additionally, we've developed strategies aimed at curtailing data retrieval times. The performance evaluation encompasses two key scenarios: (1) a pure CPU-based image analysis scenario ("CPU scenario"), and (2) a GPU-based deep learning framework scenario ("GPU scenario"). Our findings reveal noteworthy outcomes. Under the CPU scenario, ADIOS2 showcases an impressive two-fold speed-up compared to the brute-force approach. In the GPU scenario, its performance stands on par with the cutting-edge GPU I/O acceleration framework, NVIDIA Magnum IO GPU Direct Storage (GDS). From what we know, this appears to be among the initial instances, if any, of utilizing ADIOS2 within the field of digital pathology. The source code has been made publicly available at https://github.com/hrlblab/adios.

摘要
当处理 gigapixel 数字 PATHOLOGY 整个扫描图像时，一定比例的数据记录在每次分析操作中保持有效。例如，在应用图像分析算法于整个扫描图像（WSI）时，计算机系统的瓶颈通常出现在输入输出（I/O）系统中。特别是在使用 patch-level 处理时， introduce 一个较大的 I/O 负担到计算机系统中。但是，这个数据管理过程可以进一步并行化，因为不同的 patch 之间的图像处理通常具有一定的独立性。这篇论文详细介绍了我们在解决这个数据访问挑战方面的努力，我们实现了基于 ADIOS2 的数字 PATHOLOGY 中心管道，以便在 WSI 之间进行流畅的数据管理。此外，我们还开发了降低数据检索时间的策略。我们的性能评估包括两个关键场景：（1） CPU 场景（CPU 场景），和（2） GPU 场景（GPU 场景）。我们的发现表明了一些有趣的结果。在 CPU 场景下，ADIOS2 与简洁方法相比，显示出了很好的两倍速度提升。在 GPU 场景下，ADIOS2 的性能与当前最先进的 GPU I/O 加速框架，NVIDIA Magnum IO GPU Direct Storage（GDS）相当。我们知道，这可能是数字 PATHOLOGY 领域中首次使用 ADIOS2 的情况之一，如果不是唯一的。我们在上公开了源代码。

Multi-scale Multi-site Renal Microvascular Structures Segmentation for Whole Slide Imaging in Renal Pathology

paper_url: http://arxiv.org/abs/2308.05782
repo_url: None
paper_authors: Franklin Hu, Ruining Deng, Shunxing Bao, Haichun Yang, Yuankai Huo
for: This paper is written for renal pathologists who need a computational tool for the quantitative analysis of renal microvascular structures.
methods: The paper uses a novel single dynamic network method called Omni-Seg, which capitalizes on multi-site, multi-scale training data and utilizes partially labeled images to segment microvascular structures.
results: The experimental results indicate that Omni-Seg outperforms other methods in terms of both the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU).

Abstract
Segmentation of microvascular structures, such as arterioles, venules, and capillaries, from human kidney whole slide images (WSI) has become a focal point in renal pathology. Current manual segmentation techniques are time-consuming and not feasible for large-scale digital pathology images. While deep learning-based methods offer a solution for automatic segmentation, most suffer from a limitation: they are designed for and restricted to training on single-site, single-scale data. In this paper, we present Omni-Seg, a novel single dynamic network method that capitalizes on multi-site, multi-scale training data. Unique to our approach, we utilize partially labeled images, where only one tissue type is labeled per training image, to segment microvascular structures. We train a singular deep network using images from two datasets, HuBMAP and NEPTUNE, across different magnifications (40x, 20x, 10x, and 5x). Experimental results indicate that Omni-Seg outperforms in terms of both the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU). Our proposed method provides renal pathologists with a powerful computational tool for the quantitative analysis of renal microvascular structures.

摘要
Segmentation of microvascular structures, such as arterioles, venules, and capillaries, from human kidney whole slide images (WSI) has become a focal point in renal pathology. Current manual segmentation techniques are time-consuming and not feasible for large-scale digital pathology images. While deep learning-based methods offer a solution for automatic segmentation, most suffer from a limitation: they are designed for and restricted to training on single-site, single-scale data. In this paper, we present Omni-Seg, a novel single dynamic network method that capitalizes on multi-site, multi-scale training data. Unique to our approach, we utilize partially labeled images, where only one tissue type is labeled per training image, to segment microvascular structures. We train a singular deep network using images from two datasets, HuBMAP and NEPTUNE, across different magnifications (40x, 20x, 10x, and 5x). Experimental results indicate that Omni-Seg outperforms in terms of both the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU). Our proposed method provides renal pathologists with a powerful computational tool for the quantitative analysis of renal microvascular structures.Here's the translation in Traditional Chinese: segmentation of microvascular structures, such as arterioles, venules, and capillaries, from human kidney whole slide images (WSI) has become a focal point in renal pathology. Current manual segmentation techniques are time-consuming and not feasible for large-scale digital pathology images. While deep learning-based methods offer a solution for automatic segmentation, most suffer from a limitation: they are designed for and restricted to training on single-site, single-scale data. In this paper, we present Omni-Seg, a novel single dynamic network method that capitalizes on multi-site, multi-scale training data. Unique to our approach, we utilize partially labeled images, where only one tissue type is labeled per training image, to segment microvascular structures. We train a singular deep network using images from two datasets, HuBMAP and NEPTUNE, across different magnifications (40x, 20x, 10x, and 5x). Experimental results indicate that Omni-Seg outperforms in terms of both the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU). Our proposed method provides renal pathologists with a powerful computational tool for the quantitative analysis of renal microvascular structures.

2D3D-MATR: 2D-3D Matching Transformer for Detection-free Registration between Images and Point Clouds

paper_url: http://arxiv.org/abs/2308.05667
repo_url: None
paper_authors: Minhao Li, Zheng Qin, Zhirui Gao, Renjiao Yi, Chenyang Zhu, Yulan Guo, Kai Xu
for: 准确和稳定的图像和点云对接问题（cross-modality registration）
methods: 提出了一种检测器-自由的方法（detection-free method），即2D3D-MATR，它首先在下采样后的图像和点云之间计算粗略匹配，然后将其扩展到形成密集匹配。该方法采用了一个巨量感知机制，将图像和点云之间的全局上下文约束和跨Modalities的相关性约束都带入学习。
results: 在两个公共测试集上进行了广泛的实验，证明了2D3D-MATR比前一个状态的P2-Net提高了约20个百分点的匹配率和10个百分点的对接回归率。

Abstract
The commonly adopted detect-then-match approach to registration finds difficulties in the cross-modality cases due to the incompatible keypoint detection and inconsistent feature description. We propose, 2D3D-MATR, a detection-free method for accurate and robust registration between images and point clouds. Our method adopts a coarse-to-fine pipeline where it first computes coarse correspondences between downsampled patches of the input image and the point cloud and then extends them to form dense correspondences between pixels and points within the patch region. The coarse-level patch matching is based on transformer which jointly learns global contextual constraints with self-attention and cross-modality correlations with cross-attention. To resolve the scale ambiguity in patch matching, we construct a multi-scale pyramid for each image patch and learn to find for each point patch the best matching image patch at a proper resolution level. Extensive experiments on two public benchmarks demonstrate that 2D3D-MATR outperforms the previous state-of-the-art P2-Net by around $20$ percentage points on inlier ratio and over $10$ points on registration recall. Our code and models are available at https://github.com/minhaolee/2D3DMATR.

摘要
通常采用检测然后匹配的方法在跨Modalities的情况下遇到困难，这是因为针对不同模式的关键点检测和特征描述不兼容。我们提议了一种没有检测的方法，即2D3D-MATR，用于准确和可靠地将图像和点云进行registrations。我们的方法采用一个分支扩充的管道，首先计算下采样的图像和点云之间的粗略匹配，然后将其扩展到形成点云和图像之间的密集匹配。粗略水平的patch匹配基于转换器，它同时学习全局上下文约束和自我注意力以及跨Modalities的相关性。为解决粗略水平的匹配抖动问题，我们构建了每个图像区域的多尺度 pyramid，并学习找到每个点云区域的最佳匹配图像区域的合适的分辨率水平。我们的实验表明，2D3D-MATR比前一个状态的P2-Net提高了约20个百分点的匹配率和10个百分点的注册回归率。我们的代码和模型可以在https://github.com/minhaolee/2D3DMATR中获取。

2023-08-11

cs.AI

cs.AI - 2023-08-11

Towards a Causal Probabilistic Framework for Prediction, Action-Selection & Explanations for Robot Block-Stacking Tasks

paper_url: http://arxiv.org/abs/2308.06203
repo_url: None
paper_authors: Ricardo Cannizzaro, Jonathan Routley, Lars Kunze
for: 这个论文的目的是提供一种 causal probabilistic 框架，用于嵌入物理模拟能力到STRUCTURAL causal model 中，以便 robot 可以在堆叠任务中进行现状识别、下一步操作选择和后果解释。
methods: 该论文使用 causal inference 和 Bayesian networks 来编码形式化的 causal 关系知识，并将其与 probabilistic 表示方法结合使用。它还使用 physics simulation 来模拟堆叠任务的当前状态，并提出了一种基于 counterfactual explanation 的 post-hoc 解释方法。
results: 论文提出了一种 novel causal probabilistic 框架，并在 simulated 和实际的 robot 堆叠任务中提供了 exemplar 的 next-best action 选择结果。它还将在未来进行实验证明和扩展。

Abstract
Uncertainties in the real world mean that is impossible for system designers to anticipate and explicitly design for all scenarios that a robot might encounter. Thus, robots designed like this are fragile and fail outside of highly-controlled environments. Causal models provide a principled framework to encode formal knowledge of the causal relationships that govern the robot's interaction with its environment, in addition to probabilistic representations of noise and uncertainty typically encountered by real-world robots. Combined with causal inference, these models permit an autonomous agent to understand, reason about, and explain its environment. In this work, we focus on the problem of a robot block-stacking task due to the fundamental perception and manipulation capabilities it demonstrates, required by many applications including warehouse logistics and domestic human support robotics. We propose a novel causal probabilistic framework to embed a physics simulation capability into a structural causal model to permit robots to perceive and assess the current state of a block-stacking task, reason about the next-best action from placement candidates, and generate post-hoc counterfactual explanations. We provide exemplar next-best action selection results and outline planned experimentation in simulated and real-world robot block-stacking tasks.

摘要
real-world uncertainty 意味着 robot 设计人员无法预测和专门设计 robot 可能遇到的所有场景。因此，以前的 robot 设计是脆弱的，只能在高度控制的环境中工作。 causal 模型提供了一个原则的框架，用于编码 robot 与环境的 causal 关系，以及常见的real-world Robot 遇到的抽象表示 uncertainty 和不确定性。 combined with causal inference, these models permit an autonomous agent to understand, reason about, and explain its environment.在这项工作中，我们关注了一个 robot 块堆叠任务，因为它需要 robot 拥有的基本感知和操作能力，这些能力是许多应用程序，包括仓库自动化和家庭支持 robotics 所需的。我们提出了一种新的 causal 概率框架，用于嵌入物理模拟能力到结构 causal 模型中，使 robot 能够识别和评估块堆叠任务的当前状态，从选择候选地点中选择下一个行动，并生成post-hoc counterfactual explanations。我们提供了示例的下一个行动选择结果，并详细介绍计划的实验在模拟和实际 robot 块堆叠任务中。

Exploring Predicate Visual Context in Detecting of Human-Object Interactions

paper_url: http://arxiv.org/abs/2308.06202
repo_url: https://github.com/fredzzhang/pvic
paper_authors: Frederic Z. Zhang, Yuhui Yuan, Dylan Campbell, Zhuoyao Zhong, Stephen Gould
for: 这 paper 的目的是研究人–物交互 (HOI) 问题，尤其是使用 two-stage transformer-based HOI detectors。
methods: 这 paper 使用 visualisations 和仔细的实验来研究如何重新引入图像特征，并通过改进查询设计、广泛探索键和值、以及盒子对位嵌入为空间指导来提高 predicate visual context (PViC)。
results: 这 paper 在 HICO-DET 和 V-COCO 测试集上表现出色，比前一代方法高效，同时保持低训练成本。

Abstract
Recently, the DETR framework has emerged as the dominant approach for human--object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost.

摘要
Here's the Simplified Chinese translation:最近，DETR框架在人物交互（HOI）研究中成为主流方法。特别是两stage转换器基于HOI检测器在性能和训练效率方面表现非常出色。然而，这些模型通常会基于缺乏细化上下文信息的对象特征进行分类，忽略对象的姿势和方向信息，而依赖视觉特征来判断对象的标识和边框端点。这会导致复杂或抽象的交互不能正确识别。在这个工作中，我们通过视觉化和仔细设计的实验来研究这些问题。我们研究如何通过交叉注意来重新引入图像特征，使用改进的查询设计、广泛探索的键和值、以及为空间导航而添加的盒对位嵌入。我们的模型具有加强 predicate visual context（PViC），在HICO-DET和V-COCO测试benchmark上显示出优于当前最佳方法，而且保持训练成本低。

Complex Facial Expression Recognition Using Deep Knowledge Distillation of Basic Features

paper_url: http://arxiv.org/abs/2308.06197
repo_url: https://github.com/angusmaiden/complex-fer
paper_authors: Angus Maiden, Bahareh Nakisa
for: 这个论文的目的是提出一种基于人类认知和学习的新型连续学习方法，以便准确地识别新的复杂表情类别使用少量示例。
methods: 该方法基于人类认知和学习，包括知识储存、知识总结和预测排序记忆等技术。它还使用 GradCAM 可视化来解释基本和复杂表情之间的关系。
results: 该方法可以准确地识别新的复杂表情类别，并且在连续学习中保持知识的稳定性。在新类别上达到了74.28%的总准确率，比非连续学习方法提高13.95%。此外，该方法还应用了几个示例学习来达到了100%的准确率。

Abstract
Complex emotion recognition is a cognitive task that has so far eluded the same excellent performance of other tasks that are at or above the level of human cognition. Emotion recognition through facial expressions is particularly difficult due to the complexity of emotions expressed by the human face. For a machine to approach the same level of performance in this domain as a human, it may need to synthesise knowledge and understand new concepts in real-time as humans do. Humans are able to learn new concepts using only few examples, by distilling the important information from memories and discarding the rest. Similarly, continual learning methods learn new classes whilst retaining the knowledge of known classes, whilst few-shot learning methods are able to learn new classes using very few training examples. We propose a novel continual learning method inspired by human cognition and learning that can accurately recognise new compound expression classes using few training samples, by building on and retaining its knowledge of basic expression classes. Using GradCAM visualisations, we demonstrate the relationship between basic and compound facial expressions, which our method leverages through knowledge distillation and a novel Predictive Sorting Memory Replay. Our method achieves the current state-of-the-art in continual learning for complex facial expression recognition with 74.28% Overall Accuracy on new classes. We also demonstrate that using continual learning for complex facial expression recognition achieves far better performance than non-continual learning methods, improving on state-of-the-art non-continual learning methods by 13.95%. To the best of our knowledge, our work is also the first to apply few-shot learning to complex facial expression recognition, achieving the state-of-the-art with 100% accuracy using a single training sample for each expression class.

摘要
人工智能recognition of complex emotions是一个艰难的认知任务，尤其是通过脸部表达来认出情感。由于人脸上表达的情感复杂，因此机器需要同人类一样快速学习和捕捉新的概念。人类通过少量示例学习新的概念，并将重要信息提取出来，而不是把所有信息都记忆下来。我们提出了一种基于人类认知和学习的新型不断学习方法，可以准确地识别新的复杂表情类别，使用很少的训练样本。我们使用GradCAM视觉化来表示基本和复杂表情之间的关系，并通过知识储存和一种新的预测排序记忆来利用这种关系。我们的方法实现了当前领域的最佳性能，新类别的总准确率为74.28%。我们还证明了不断学习对复杂表情认知的表现远胜非不断学习方法，提高了状态之前的最佳非不断学习方法的13.95%。而且，我们是第一次将几 shot learning应用于复杂表情认知，实现了100%的准确率，只需用每个表情类别的一个训练样本。

Software Doping Analysis for Human Oversight

paper_url: http://arxiv.org/abs/2308.06186
repo_url: None
paper_authors: Sebastian Biewer, Kevin Baum, Sarah Sterz, Holger Hermanns, Sven Hetmank, Markus Langer, Anne Lauber-Rönsberg, Franz Lehr
for: 本文提出了一个框架，用于减少软件对社会 pose 的风险。这包括软件毒品和不公正的决策系统中的不公正和歧视。
methods: 本文结合了软件毒品分析的正式基础和概率证明技术，提出了一种黑盒分析技术，用于发现软件中的不良效果。这种技术应用于柴油车的排放清洁系统以及高风险决策系统中的人类评审。
results: 本文的方法可以帮助人类监督者更好地做出更负责任的决策，以促进更有效的人类监督。此外，文章还提供了一个法律、哲学和心理学等多方面的视角，探讨软件 pose 对社会的 potential 问题。

Abstract
This article introduces a framework that is meant to assist in mitigating societal risks that software can pose. Concretely, this encompasses facets of software doping as well as unfairness and discrimination in high-risk decision-making systems. The term software doping refers to software that contains surreptitiously added functionality that is against the interest of the user. A prominent example of software doping are the tampered emission cleaning systems that were found in millions of cars around the world when the diesel emissions scandal surfaced. The first part of this article combines the formal foundations of software doping analysis with established probabilistic falsification techniques to arrive at a black-box analysis technique for identifying undesired effects of software. We apply this technique to emission cleaning systems in diesel cars but also to high-risk systems that evaluate humans in a possibly unfair or discriminating way. We demonstrate how our approach can assist humans-in-the-loop to make better informed and more responsible decisions. This is to promote effective human oversight, which will be a central requirement enforced by the European Union's upcoming AI Act. We complement our technical contribution with a juridically, philosophically, and psychologically informed perspective on the potential problems caused by such systems.

摘要

Physical Adversarial Attacks For Camera-based Smart Systems: Current Trends, Categorization, Applications, Research Challenges, and Future Outlook

paper_url: http://arxiv.org/abs/2308.06173
repo_url: None
paper_authors: Amira Guesmi, Muhammad Abdullah Hanif, Bassem Ouni, Muhammed Shafique
for: 本文提供了物理敌意攻击的全面评估，强调特征和应用场景，以便为研究人员、实践者和政策制定者提供一份有价值的资源，以掌握物理敌意攻击的全面特征和挑战，并为建立可靠和安全的深度学习系统而做出贡献。
methods: 本文涵盖了物理敌意攻击的多种方法，包括分类、检测、人脸识别、语义分割和深度估计等应用场景，并对这些攻击方法进行了效果、隐蔽性和稳定性的评估。
results: 本文对物理敌意整体攻击的效果进行了评估，并发现这些攻击方法在不同的应用场景中的表现强度不同，同时也发现了这些攻击方法的潜在漏洞和可能的应对策略。

Abstract
In this paper, we present a comprehensive survey of the current trends focusing specifically on physical adversarial attacks. We aim to provide a thorough understanding of the concept of physical adversarial attacks, analyzing their key characteristics and distinguishing features. Furthermore, we explore the specific requirements and challenges associated with executing attacks in the physical world. Our article delves into various physical adversarial attack methods, categorized according to their target tasks in different applications, including classification, detection, face recognition, semantic segmentation and depth estimation. We assess the performance of these attack methods in terms of their effectiveness, stealthiness, and robustness. We examine how each technique strives to ensure the successful manipulation of DNNs while mitigating the risk of detection and withstanding real-world distortions. Lastly, we discuss the current challenges and outline potential future research directions in the field of physical adversarial attacks. We highlight the need for enhanced defense mechanisms, the exploration of novel attack strategies, the evaluation of attacks in different application domains, and the establishment of standardized benchmarks and evaluation criteria for physical adversarial attacks. Through this comprehensive survey, we aim to provide a valuable resource for researchers, practitioners, and policymakers to gain a holistic understanding of physical adversarial attacks in computer vision and facilitate the development of robust and secure DNN-based systems.

摘要
在本文中，我们提供了物理抗击攻击的全面报告，强调特定领域的当前趋势。我们的目标是为抗击攻击提供全面的理解，分析其关键特征和区别。此外，我们还探讨了在物理世界中执行攻击的具体要求和挑战。我们的文章探讨了不同应用领域中的物理抗击攻击方法，分为不同的目标任务，包括分类、检测、脸Recognition、 semantic segmentation 和深度估计。我们评估了这些攻击方法的效果、隐蔽性和可靠性。我们检查了每种技术如何在抗击DNN的同时避免检测和实际世界的扭曲。最后，我们讨论了现有的挑战和未来研究方向，包括增强防御机制、探索新的攻击策略、在不同应用领域中评估攻击、并建立DNN领域的标准化测试基准和评价标准。通过这篇全面报告，我们希望为研究人员、实践者和政策制定者提供一份有价值的资源，以便更好地理解物理抗击攻击，并促进robust和安全的DNN基于系统的发展。

Phased Deep Spatio-temporal Learning for Highway Traffic Volume Prediction

paper_url: http://arxiv.org/abs/2308.06155
repo_url: None
paper_authors: Weilong Ding, Tianpu Zhang, Zhe Wang
for: 预测城市高速公路交通量，提高城市现代生活质量。
methods: 使用深度空间时间学习方法，包括数据准备阶段、空间时间学习阶段和决策阶段，其中数据准备阶段通过精心normal化数据来抑制长尾分布，空间时间学习阶段使用FCN和LSTM组合模型考虑时空天气和历法特征，决策阶段通过对来自不同数据源的交通量进行权重平均来预测下一天城市高速公路交通量。
results: 使用实际数据进行了广泛的实验，结果显示该方法在预测城市高速公路交通量方面具有显著的改进，其中MPAE和R-squre度量达到5.269和0.997 respectively。

Abstract
Inter-city highway transportation is significant for citizens' modern urban life and generates heterogeneous sensory data with spatio-temporal characteristics. As a routine analysis in transportation domain, daily traffic volume estimation faces challenges for highway toll stations including lacking of exploration of correlative spatio-temporal features from a long-term perspective and effective means to deal with data imbalance which always deteriorates the predictive performance. In this paper, a deep spatio-temporal learning method is proposed to predict daily traffic volume in three phases. In feature pre-processing phase, data is normalized elaborately according to latent long-tail distribution. In spatio-temporal learning phase, a hybrid model is employed combining fully convolution network (FCN) and long short-term memory (LSTM), which considers time, space, meteorology, and calendar from heterogeneous data. In decision phase, traffic volumes on a coming day at network-wide toll stations would be achieved effectively, which is especially calibrated for vital few highway stations. Using real-world data from one Chinese provincial highway, extensive experiments show our method has distinct improvement for predictive accuracy than various traditional models, reaching 5.269 and 0.997 in MPAE and R-squre metrics, respectively.

摘要
Modern urban жизнь的城市间交通运输非常重要，生成了多样化的感知数据，具有空间时间特征。为了解决高速公路收费站的日常交通量预测问题，包括缺乏长期纵探感知特征的探索和数据不均衡问题，这些问题都会导致预测性能下降。本文提出了深度空间时间学习方法，用于预测高速公路交通量。在特征预处理阶段，数据进行了精心Normalization，根据异常长尾分布。在空间时间学习阶段，我们使用了混合FCN和LSTM模型，考虑了时间、空间、气象和日历等多种多样数据。在决策阶段，通过对整个高速公路网络的收费站进行有效预测，实现了对重要的一些高速公路站点的准确预测。使用了一个中国省级高速公路的实际数据，我们进行了广泛的实验，并证明了我们的方法在预测精度方面与传统模型有显著提升，分别达到5.269和0.997的MPAE和R-squre指标。

Application of Artificial Neural Networks for Investigation of Pressure Filtration Performance, a Zinc Leaching Filter Cake Moisture Modeling

paper_url: http://arxiv.org/abs/2308.06138
repo_url: None
paper_authors: Masoume Kazemi, Davood Moradkhani, Alireza A. Alipour
for: 这个研究的目的是开发一个人工神经网络模型，用于预测氧气压 фильtration过程中铅生产中的蛋白湿度。
methods: 这个研究使用了人工神经网络模型，并在288个试验中测试了两种不同的织物（S1和S2）。
results: 研究发现，使用人工神经网络模型可以高度准确地预测氧气压 фильtration过程中铅生产中的蛋白湿度，其R2值为0.88和0.83，MSE值为6.243x10-07和1.086x10-06，MAE值为0.00056和0.00088。

Abstract
Machine Learning (ML) is a powerful tool for material science applications. Artificial Neural Network (ANN) is a machine learning technique that can provide high prediction accuracy. This study aimed to develop an ANN model to predict the cake moisture of the pressure filtration process of zinc production. The cake moisture was influenced by seven parameters: temperature (35 and 65 Celsius), solid concentration (0.2 and 0.38 g/L), pH (2, 3.5, and 5), air-blow time (2, 10, and 15 min), cake thickness (14, 20, 26, and 34 mm), pressure, and filtration time. The study conducted 288 tests using two types of fabrics: polypropylene (S1) and polyester (S2). The ANN model was evaluated by the Coefficient of determination (R2), the Mean Square Error (MSE), and the Mean Absolute Error (MAE) metrics for both datasets. The results showed R2 values of 0.88 and 0.83, MSE values of 6.243x10-07 and 1.086x10-06, and MAE values of 0.00056 and 0.00088 for S1 and S2, respectively. These results indicated that the ANN model could predict the cake moisture of pressure filtration in the zinc leaching process with high accuracy.

摘要
machine learning (ml) 是一种强大的工具 для材料科学应用。人工神经网络 (ann) 是一种机器学习技术，可以提供高精度预测。本研究目的是开发一个 ann 模型，以预测压 filtered 过程中锻 production 中糕点湿度。糕点湿度受到七个参数的影响：温度 (35 和 65 摄氏度), 固体浓度 (0.2 和 0.38 g/L), pH (2, 3.5, 和 5), 空气喷压时间 (2, 10, 和 15 分), 糕点厚度 (14, 20, 26, 和 34 mm), 压力, 和过滤时间。研究进行了 288 个测试，使用了两种不同的 fabrics： polypropylene (S1) 和 polyester (S2)。 ann 模型被评估使用 coefficient of determination (R2), mean square error (MSE), 和 mean absolute error (MAE) 度量器，对于两个数据集。结果表明，R2 值为 0.88 和 0.83，MSE 值为 6.243x10-07 和 1.086x10-06，MAE 值为 0.00056 和 0.00088，分别对 S1 和 S2 数据集。这些结果表明，ann 模型可以准确预测压 filtered 过程中锻 production 中糕点湿度。

A Game-Theoretic Framework for Joint Forecasting and Planning

paper_url: http://arxiv.org/abs/2308.06137
repo_url: https://github.com/portal-cornell/game-theoretic-forecasting-planning
paper_authors: Kushal Kedia, Prithwish Dan, Sanjiban Choudhury
for: 本研究旨在提供一种基于游戏理论的规划和预测方法，以提高机器人在人类存在下的安全规划。
methods: 本研究使用了一种新的游戏理论基础的规划和预测方法，其中包括一种新的评价函数，用于评估规划的性能。
results: 研究表明，使用该方法可以生成更安全的规划，并且在人类行为的长尾事件中表现更好。 codes 可以在 https://github.com/portal-cornell/Game-Theoretic-Forecasting-Planning 上下载。

Abstract
Planning safe robot motions in the presence of humans requires reliable forecasts of future human motion. However, simply predicting the most likely motion from prior interactions does not guarantee safety. Such forecasts fail to model the long tail of possible events, which are rarely observed in limited datasets. On the other hand, planning for worst-case motions leads to overtly conservative behavior and a ``frozen robot''. Instead, we aim to learn forecasts that predict counterfactuals that humans guard against. We propose a novel game-theoretic framework for joint planning and forecasting with the payoff being the performance of the planner against the demonstrator, and present practical algorithms to train models in an end-to-end fashion. We demonstrate that our proposed algorithm results in safer plans in a crowd navigation simulator and real-world datasets of pedestrian motion. We release our code at https://github.com/portal-cornell/Game-Theoretic-Forecasting-Planning.

摘要
计划安全机器人运动在人类存在下需要可靠的未来人类运动预测。然而，只预测最有可能性的运动不能保证安全。这些预测不会考虑可能性较低的事件，这些事件在有限的数据集中rarely observed。相反，我们目标是学习预测 humans guard against counterfactuals。我们提出了一种新的游戏理论基础，其中的奖励是与示例者的表现相对评价，并提供了实用的算法来训练模型。我们在人群导航模拟器和实际人行动数据上示出了我们的提议算法可以提供更安全的计划。我们的代码可以在上下载。

Improving Joint Speech-Text Representations Without Alignment

paper_url: http://arxiv.org/abs/2308.06125
repo_url: None
paper_authors: Cal Peyser, Zhong Meng, Ke Hu, Rohit Prabhavalkar, Andrew Rosenberg, Tara N. Sainath, Michael Picheny, Kyunghyun Cho
for: 这个论文旨在提出一种基于Modal Space的文本提示图像生成方法，以提高文本检索和识别的效果。
methods: 该方法使用了共同表示空间，将文本和图像领域共同表示，并通过适应loss来忽略序列长度差异。
results: 该方法在大参数训练的声音-文本编码器中显示出了改善的下游WER性能， tanto在单语言系统中 alsowithin multilingual system。

Abstract
The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as joint speech-text encoders that can scale to the capacities of very large parameter models by being trained on both unpaired speech and text. While these methods show promise, they have required special treatment of the sequence-length mismatch inherent in speech and text, either by up-sampling heuristics or an explicit alignment model. In this work, we offer evidence that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length, and argue that consistency losses could forgive length differences and simply assume the best alignment. We show that such a loss improves downstream WER in both a large-parameter monolingual and multilingual system.

摘要
最近一年内，文本提示图像生成技术呈现了惊人的进步，基于跨Modal Representation Space（MR）的想法。在ASR中，这种想法得到应用，通过同时训练语音和文本频谱的共同Encoder来实现大型参数模型的扩展。虽然这些方法显示搭配性，但它们需要特殊地处理语音和文本序列长度之间的差异，通常通过填充规则或显式对齐模型。在这项工作中，我们提供证据，表明 JOINT 语音文本 Encoder 自然地实现了不同Modalities中的一致性，而不需要特殊地处理序列长度。我们还证明，在大型单语言和多语言系统中，这种损失函数可以提高下游 WER。

Learning Deductive Reasoning from Synthetic Corpus based on Formal Logic

paper_url: http://arxiv.org/abs/2308.07336
repo_url: https://github.com/hitachi-nlp/fld
paper_authors: Terufumi Morishita, Gaku Morio, Atsuki Yamaguchi, Yasuhiro Sogawa
for: 本研究旨在使语言模型（LM）掌握逻辑推理能力。
methods: 我们采用了基于形式逻辑理论的有效集合推理规则，可以从多步骤中 derivation 任何其他推理规则。
results: 我们的实验表明，使用我们提议的$\textbf{FLD}$ Corpora进行训练，可以使LM acquire更加通用的逻辑推理能力，并且我们identified了推理推理能力中哪些方面可以通过推理 corpora增强LM，以及哪些方面无法增强。

Abstract
We study a synthetic corpus-based approach for language models (LMs) to acquire logical deductive reasoning ability. The previous studies generated deduction examples using specific sets of deduction rules. However, these rules were limited or otherwise arbitrary. This can limit the generalizability of acquired deductive reasoning ability. We rethink this and adopt a well-grounded set of deduction rules based on formal logic theory, which can derive any other deduction rules when combined in a multistep way. We empirically verify that LMs trained on the proposed corpora, which we name $\textbf{FLD}$ ($\textbf{F}$ormal $\textbf{L}$ogic $\textbf{D}$eduction), acquire more generalizable deductive reasoning ability. Furthermore, we identify the aspects of deductive reasoning ability on which deduction corpora can enhance LMs and those on which they cannot. Finally, on the basis of these results, we discuss the future directions for applying deduction corpora or other approaches for each aspect. We release the code, data, and models.

摘要
我们研究了基于合成 corpora 的方法，以使语言模型（LM）掌握逻辑推理能力。过去的研究通常使用特定的推理规则生成推理示例，但这些规则受限或是Random。这会限制学习得到的逻辑推理能力的通用性。我们重新思考这一点，采用基于正式逻辑理论的固定的推理规则，这些规则可以在多步骤的情况下组合生成任何其他的推理规则。我们实验证明，使用我们命名的 $\textbf{FLD}$（$\textbf{F}$ormal $\textbf{L}$ogic $\textbf{D}$eduction）训练集，LM 可以获得更加通用的逻辑推理能力。此外，我们还识别了使用推理 corpora 可以增强LM的逻辑推理能力中的哪些方面，以及哪些方面无法增强。最后，我们根据这些结果，讨论未来如何应用推理 corpora 或其他方法来解决每个方面的问题。我们发布了代码、数据和模型。

Improving Zero-Shot Text Matching for Financial Auditing with Large Language Models

paper_url: http://arxiv.org/abs/2308.06111
repo_url: None
paper_authors: Lars Hillebrand, Armin Berger, Tobias Deußer, Tim Dilmaghani, Mohamed Khaled, Bernd Kliem, Rüdiger Loitz, Maren Pielka, David Leonhard, Christian Bauckhage, Rafet Sifa
for: 帮助审核金融文档，提高审核效率和精度。
methods: 使用AI技术，提供相关文本段的推荐，以满足严格会计标准的法律要求。
results: 比较 existing 方法有 significat 性能提升，可以帮助审核人员更快速地完成审核任务。

Abstract
Auditing financial documents is a very tedious and time-consuming process. As of today, it can already be simplified by employing AI-based solutions to recommend relevant text passages from a report for each legal requirement of rigorous accounting standards. However, these methods need to be fine-tuned regularly, and they require abundant annotated data, which is often lacking in industrial environments. Hence, we present ZeroShotALI, a novel recommender system that leverages a state-of-the-art large language model (LLM) in conjunction with a domain-specifically optimized transformer-based text-matching solution. We find that a two-step approach of first retrieving a number of best matching document sections per legal requirement with a custom BERT-based model and second filtering these selections using an LLM yields significant performance improvements over existing approaches.

摘要
审核财务文档是一项非常繁琐和时间consuming的过程。到目前为止，可以使用基于人工智能的解决方案来提供相关的文本段落，以满足严格的会计标准的法律要求。然而，这些方法需要定期细化，并且需要充足的注释数据，而这些数据在工业环境中经常缺乏。因此，我们提出了ZeroShotALI，一种新的推荐系统，利用现代大语言模型（LLM），并结合适应域特定的 transformer 基于文本匹配解决方案。我们发现，首先使用自定义 BERT 基本模型 retrieve 最佳匹配的文档段落，然后使用 LLM 进行筛选，可以实现显著性能提升。

Neural Conversation Models and How to Rein Them in: A Survey of Failures and Fixes

paper_url: http://arxiv.org/abs/2308.06095
repo_url: None
paper_authors: Fabian Galetzka, Anne Beyer, David Schlangen
for: 这个论文旨在探讨基于强大语言模型的开放领域对话系统，以及如何使用语言模型实现合适的对话贡献。
methods: 该论文根据Grice的协作对话maxims进行了语言模型的解释，并系统化了相关研究的文献，包括使用数据、训练方法和解码方法来控制语言模型的性能。
results: 该论文提出了一些有前途的方法，并建议了未来研究的新方向，以提高语言模型的对话贡献质量。

Abstract
Recent conditional language models are able to continue any kind of text source in an often seemingly fluent way. This fact encouraged research in the area of open-domain conversational systems that are based on powerful language models and aim to imitate an interlocutor by generating appropriate contributions to a written dialogue. From a linguistic perspective, however, the complexity of contributing to a conversation is high. In this survey, we interpret Grice's maxims of cooperative conversation from the perspective of this specific research area and systematize the literature under the aspect of what makes a contribution appropriate: A neural conversation model has to be fluent, informative, consistent, coherent, and follow social norms. In order to ensure these qualities, recent approaches try to tame the underlying language models at various intervention points, such as data, training regime or decoding. Sorted by these categories and intervention points, we discuss promising attempts and suggest novel ways for future research.

摘要

Reinforcement Logic Rule Learning for Temporal Point Processes

paper_url: http://arxiv.org/abs/2308.06094
repo_url: None
paper_authors: Chao Yang, Lu Wang, Kun Gao, Shuang Li
for: 这种方法用于扩展 temporal logic 规则集，以解释时间事件的发生。
methods: 该算法使用 temporal point process 模型和学习框架，逐渐优化规则内容和权重，直到 Observational event sequence 的可能性最大化。
results: 该方法在 both synthetic and real healthcare datasets 上获得了promising results。

Abstract
We propose a framework that can incrementally expand the explanatory temporal logic rule set to explain the occurrence of temporal events. Leveraging the temporal point process modeling and learning framework, the rule content and weights will be gradually optimized until the likelihood of the observational event sequences is optimal. The proposed algorithm alternates between a master problem, where the current rule set weights are updated, and a subproblem, where a new rule is searched and included to best increase the likelihood. The formulated master problem is convex and relatively easy to solve using continuous optimization, whereas the subproblem requires searching the huge combinatorial rule predicate and relationship space. To tackle this challenge, we propose a neural search policy to learn to generate the new rule content as a sequence of actions. The policy parameters will be trained end-to-end using the reinforcement learning framework, where the reward signals can be efficiently queried by evaluating the subproblem objective. The trained policy can be used to generate new rules in a controllable way. We evaluate our methods on both synthetic and real healthcare datasets, obtaining promising results.

摘要
我们提出了一个框架，可以逐步扩展 temporal logic 规则集来解释时间事件的发生。利用 temporal point process 模型和学习框架，规则内容和权重将被渐进优化，直到观察事件序列的可能性最大化。我们的算法会 alternate between主问题和辅助问题。主问题中，当前规则集权重将被更新；辅助问题中，一个新的规则将被搜索并添加到最大化观察事件序列的可能性。主问题是 convex 的continuous optimization 问题，可以使用continuous optimization 方法解决；辅助问题则需要搜索庞大的 combinatorial rule predicate 和关系空间。为解决这个挑战，我们提出了一种 neural search 策略，可以学习生成新规则内容为一个序列的动作。策略参数将通过 reinforcement learning 框架进行end-to-end 训练，其中 reward signal 可以效率地被查询通过辅助问题的目标函数。训练好的策略可以控制性地生成新规则。我们对 synthetic 和实际医疗数据进行了evaluation， obtained promising results.

Toward a Better Understanding of Loss Functions for Collaborative Filtering

paper_url: http://arxiv.org/abs/2308.06091
repo_url: https://github.com/psm1206/mawu
paper_authors: Seongmin Park, Mincheol Yoon, Jae-woong Lee, Hogun Park, Jongwuk Lee
for: 这篇论文探讨了现有的 collaborative filtering（CF）模型学习过程中的三个组成部分，即交互编码器、损失函数和负样本。
methods: 该论文分析了现有的损失函数之间的关系，并提出了一种新的损失函数：margin-aware alignment和weighted uniformity（MAWU），该损失函数能够考虑到数据集的特点，提高CF模型的性能。
results: 实验结果表明，当 equiped with MAWU，MF和LightGCN的性能与现有的CF模型相当或更高，特别是在许多实际应用中。

Abstract
Collaborative filtering (CF) is a pivotal technique in modern recommender systems. The learning process of CF models typically consists of three components: interaction encoder, loss function, and negative sampling. Although many existing studies have proposed various CF models to design sophisticated interaction encoders, recent work shows that simply reformulating the loss functions can achieve significant performance gains. This paper delves into analyzing the relationship among existing loss functions. Our mathematical analysis reveals that the previous loss functions can be interpreted as alignment and uniformity functions: (i) the alignment matches user and item representations, and (ii) the uniformity disperses user and item distributions. Inspired by this analysis, we propose a novel loss function that improves the design of alignment and uniformity considering the unique patterns of datasets called Margin-aware Alignment and Weighted Uniformity (MAWU). The key novelty of MAWU is two-fold: (i) margin-aware alignment (MA) mitigates user/item-specific popularity biases, and (ii) weighted uniformity (WU) adjusts the significance between user and item uniformities to reflect the inherent characteristics of datasets. Extensive experimental results show that MF and LightGCN equipped with MAWU are comparable or superior to state-of-the-art CF models with various loss functions on three public datasets.

摘要

An Autoethnographic Exploration of XAI in Algorithmic Composition

paper_url: http://arxiv.org/abs/2308.06089
repo_url: None
paper_authors: Ashley Noel-Hirst, Nick Bryan-Kinns
for: 本研究旨在探讨如何使用可解释的人工智能（XAI）生成模型来创作传统爱尔兰民谣音乐。
methods: 该研究使用MeasureVAE生成模型，该模型具有可解释的秘密分量，并在爱尔兰民谣音乐上进行训练。
results: 研究发现，在音乐创作过程中，音乐创作者倾向于利用模型中的特征，而不是模型本身。这种方法可能扩展XAI模型的应用范围，并且可能为音乐创作者提供有用的创作工具。

Abstract
Machine Learning models are capable of generating complex music across a range of genres from folk to classical music. However, current generative music AI models are typically difficult to understand and control in meaningful ways. Whilst research has started to explore how explainable AI (XAI) generative models might be created for music, no generative XAI models have been studied in music making practice. This paper introduces an autoethnographic study of the use of the MeasureVAE generative music XAI model with interpretable latent dimensions trained on Irish folk music. Findings suggest that the exploratory nature of the music-making workflow foregrounds musical features of the training dataset rather than features of the generative model itself. The appropriation of an XAI model within an iterative workflow highlights the potential of XAI models to form part of a richer and more complex workflow than they were initially designed for.

摘要
机器学习模型可以生成复杂的音乐，从folk到古典音乐。然而，当前的生成音乐AI模型通常具有难以理解和控制的问题。研究已经开始探索如何创建可解释的AI生成模型（XAI） для音乐，但没有任何生成XAI模型在音乐创作实践中被研究。这篇论文介绍了一个自传式的研究，使用MeasureVAE生成音乐XAI模型，具有可解释的潜在维度，在爱尔兰传统音乐上进行训练。发现结果表明，音乐创作过程的探索性强调了训练数据集中的音乐特征，而不是生成模型自身的特征。将XAI模型 incorporated into an iterative workflow highlights the potential of XAI models to form part of a richer and more complex workflow than they were initially designed for。

Assessing Student Errors in Experimentation Using Artificial Intelligence and Large Language Models: A Comparative Study with Human Raters

paper_url: http://arxiv.org/abs/2308.06088
repo_url: None
paper_authors: Arne Bewersdorff, Kathrin Seßler, Armin Baur, Enkelejda Kasneci, Claudia Nerdel
for: 本研究旨在提供一种基于大语言模型（LLM）的自动错误检测方法，以便为学生实验卷提供产生性、个性化的反馈。
methods: 研究使用了GPT-3.5和GPT-4系列的人工智能系统，并对人类评审者进行比较。
results: 研究发现，LLM系统可以准确地检测一些基本学生错误，如专注于依赖变量而不是预期观察（准确率为0.90）、修改进行中的实验（准确率为1）和VALIDATE_TRIALS（准确率为0.82）等。然而，检测更复杂的错误，如是否进行了有效的控制试验（准确率为0.60），具有更大的挑战。

Abstract
Identifying logical errors in complex, incomplete or even contradictory and overall heterogeneous data like students' experimentation protocols is challenging. Recognizing the limitations of current evaluation methods, we investigate the potential of Large Language Models (LLMs) for automatically identifying student errors and streamlining teacher assessments. Our aim is to provide a foundation for productive, personalized feedback. Using a dataset of 65 student protocols, an Artificial Intelligence (AI) system based on the GPT-3.5 and GPT-4 series was developed and tested against human raters. Our results indicate varying levels of accuracy in error detection between the AI system and human raters. The AI system can accurately identify many fundamental student errors, for instance, the AI system identifies when a student is focusing the hypothesis not on the dependent variable but solely on an expected observation (acc. = 0.90), when a student modifies the trials in an ongoing investigation (acc. = 1), and whether a student is conducting valid test trials (acc. = 0.82) reliably. The identification of other, usually more complex errors, like whether a student conducts a valid control trial (acc. = .60), poses a greater challenge. This research explores not only the utility of AI in educational settings, but also contributes to the understanding of the capabilities of LLMs in error detection in inquiry-based learning like experimentation.

摘要
identifying logical errors in complex, incomplete or even contradictory data like students' experimentation protocols is challenging. recognizing the limitations of current evaluation methods, we investigate the potential of Large Language Models (LLMs) for automatically identifying student errors and streamlining teacher assessments. our aim is to provide a foundation for productive, personalized feedback. using a dataset of 65 student protocols, an Artificial Intelligence (AI) system based on the GPT-3.5 and GPT-4 series was developed and tested against human raters. our results indicate varying levels of accuracy in error detection between the AI system and human raters. the AI system can accurately identify many fundamental student errors, for instance, the AI system identifies when a student is focusing the hypothesis not on the dependent variable but solely on an expected observation (acc. = 0.90), when a student modifies the trials in an ongoing investigation (acc. = 1), and whether a student is conducting valid test trials (acc. = 0.82) reliably. the identification of other, usually more complex errors, like whether a student conducts a valid control trial (acc. = .60), poses a greater challenge. this research explores not only the utility of AI in educational settings, but also contributes to the understanding of the capabilities of LLMs in error detection in inquiry-based learning like experimentation.

Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization

paper_url: http://arxiv.org/abs/2308.06087
repo_url: https://github.com/visualaikhu/sira-ssl
paper_authors: Sung Jin Um, Dongjin Kim, Jung Uk Kim
for: 本研究旨在开发一种可以听视模式结合探测声音源的方法，以便机器可以在视觉场景中检测声音源的位置。methods: 该方法利用了两个模式的空间cue，即声音和视觉模式，并通过人类行为的模仿，对声音源进行探测。此外，我们还提出了一种循环注意网络，以模仿人类的循环注意行为。results: 我们的方法在Flickr SoundNet和VGG-Sound Source数据集上进行了广泛的实验，并得到了较好的result。与传统方法相比，我们的方法能够更好地探测声音源的位置。代码可以在GitHub上找到：https://github.com/VisualAIKHU/SIRA-SSL。

Abstract
The objective of the sound source localization task is to enable machines to detect the location of sound-making objects within a visual scene. While the audio modality provides spatial cues to locate the sound source, existing approaches only use audio as an auxiliary role to compare spatial regions of the visual modality. Humans, on the other hand, utilize both audio and visual modalities as spatial cues to locate sound sources. In this paper, we propose an audio-visual spatial integration network that integrates spatial cues from both modalities to mimic human behavior when detecting sound-making objects. Additionally, we introduce a recursive attention network to mimic human behavior of iterative focusing on objects, resulting in more accurate attention regions. To effectively encode spatial information from both modalities, we propose audio-visual pair matching loss and spatial region alignment loss. By utilizing the spatial cues of audio-visual modalities and recursively focusing objects, our method can perform more robust sound source localization. Comprehensive experimental results on the Flickr SoundNet and VGG-Sound Source datasets demonstrate the superiority of our proposed method over existing approaches. Our code is available at: https://github.com/VisualAIKHU/SIRA-SSL

摘要
¹ 音源localization任务的目标是让机器在视觉场景中检测声音来源的位置。而现有的方法只是使用音频作为视觉模式的比较依据，而人类则是同时使用音频和视觉模式来定位声音来源。在这篇论文中，我们提出了一种audio-visual空间集成网络，该网络可以 integrates空间信息来自两个模式，以模仿人类行为。此外，我们还引入了一种循环注意网络，以模仿人类的反复关注对象，从而获得更准确的注意区域。为了有效地编码音频-视觉模式中的空间信息，我们提出了音频-视觉对匹配损失和空间区域对齐损失。通过利用音频-视觉模式中的空间信息和循环注意对象，我们的方法可以实现更加稳定的声音来源定位。我们的实验结果表明，我们的方法在Flickr SoundNet和VGG-Sound Source datasets上比现有方法更高效。我们的代码可以在：https://github.com/VisualAIKHU/SIRA-SSL中找到。

Cost-effective On-device Continual Learning over Memory Hierarchy with Miro

paper_url: http://arxiv.org/abs/2308.06053
repo_url: None
paper_authors: Xinyue Ma, Suyeon Jeong, Minjia Zhang, Di Wang, Jonghyun Choi, Myeongjae Jeon
for: This paper focuses on training neural network models incrementally on edge devices using continual learning (CL), with the goal of achieving cost-effectiveness while maintaining high model accuracy.
methods: The paper explores the design space of hierarchical memory replay-based CL and presents a novel system runtime called Miro that dynamically configures the CL system based on resource states for the best cost-effectiveness. Miro also performs online profiling on parameters with clear accuracy-energy trade-offs and adapts to optimal values with low overhead.
results: The paper shows that Miro significantly outperforms baseline systems in terms of cost-effectiveness, achieving higher accuracy while using less energy on edge devices.

Abstract
Continual learning (CL) trains NN models incrementally from a continuous stream of tasks. To remember previously learned knowledge, prior studies store old samples over a memory hierarchy and replay them when new tasks arrive. Edge devices that adopt CL to preserve data privacy are typically energy-sensitive and thus require high model accuracy while not compromising energy efficiency, i.e., cost-effectiveness. Our work is the first to explore the design space of hierarchical memory replay-based CL to gain insights into achieving cost-effectiveness on edge devices. We present Miro, a novel system runtime that carefully integrates our insights into the CL framework by enabling it to dynamically configure the CL system based on resource states for the best cost-effectiveness. To reach this goal, Miro also performs online profiling on parameters with clear accuracy-energy trade-offs and adapts to optimal values with low overhead. Extensive evaluations show that Miro significantly outperforms baseline systems we build for comparison, consistently achieving higher cost-effectiveness.

摘要

Learning to Guide Human Experts via Personalized Large Language Models

paper_url: http://arxiv.org/abs/2308.06039
repo_url: None
paper_authors: Debodeep Banerjee, Stefano Teso, Andrea Passerini
for: 学习延迟（learning to defer），一个预测器可以识别风险决策并将其延迟给人类专家。
methods: 我们提出了学习导航（LTG）框架，而不是提供准备好的决策，机器将提供有用的指导来导引决策，人类完全负责来到决策。
results: 我们介绍了SLOG实现，可以通过一小量的人类监督将普通的大型自然语言模型转换成一个能够生成文本指导的模块，并对医疗诊断任务进行初步但是有前途的实验。

Abstract
In learning to defer, a predictor identifies risky decisions and defers them to a human expert. One key issue with this setup is that the expert may end up over-relying on the machine's decisions, due to anchoring bias. At the same time, whenever the machine chooses the deferral option the expert has to take decisions entirely unassisted. As a remedy, we propose learning to guide (LTG), an alternative framework in which -- rather than suggesting ready-made decisions -- the machine provides guidance useful to guide decision-making, and the human is entirely responsible for coming up with a decision. We also introduce SLOG, an LTG implementation that leverages (a small amount of) human supervision to convert a generic large language model into a module capable of generating textual guidance, and present preliminary but promising results on a medical diagnosis task.

摘要
在学习延迟中，一个预测器可以识别风险决策并将其延迟给人类专家。然而，一个问题是专家可能会因为锚定偏见过度依赖机器的决策。当机器选择延迟选项时，专家必须完全不帮助地做出决策。为了解决这个问题，我们提出了学习导航（LTG）框架，在这个框架中，机器不会提供准备好的决策，而是提供有用的指导，以帮助人类做出决策。我们还介绍了SLOG，一种LTG实现方式，通过（一定的）人类监督来将一个通用的大型自然语言模型转换成一个能够生成文本指导的模块，并发表了初步但有前途的医疗诊断任务结果。

Deep Context Interest Network for Click-Through Rate Prediction

paper_url: http://arxiv.org/abs/2308.06037
repo_url: None
paper_authors: Xuyang Hou, Zhe Wang, Qi Liu, Tan Qu, Jia Cheng, Jun Lei
for: 预测用户点击行为（Click-Through Rate，CTR），提高在线广告等行业中的表现。
methods: 提出了一种名为深度上下文兴趣网络（Deep Context Interest Network，DCIN）的新模型，将用户点击行为和其显示上下文集成到一起，以学习用户的上下文感兴趣。DCIN包括三个关键模块：1）位置意识上下文聚合模块（PCAM），通过注意力机制对显示项进行聚合; 2）反馈上下文融合模块（FCFM），通过非线性特征交互来融合点击和显示上下文表示; 3）兴趣匹配模块（IMM），通过匹配点击和显示上下文中的兴趣来活化用户的兴趣。
results: 在线上和离线上的评估中，DCIN方法显示出了明显的提高，特别是在大规模的产业环境中部署后，使用DCIN方法可以提高1.5%的CTR和1.5%的RPM。

Abstract
Click-Through Rate (CTR) prediction, estimating the probability of a user clicking on an item, is essential in industrial applications, such as online advertising. Many works focus on user behavior modeling to improve CTR prediction performance. However, most of those methods only model users' positive interests from users' click items while ignoring the context information, which is the display items around the clicks, resulting in inferior performance. In this paper, we highlight the importance of context information on user behavior modeling and propose a novel model named Deep Context Interest Network (DCIN), which integrally models the click and its display context to learn users' context-aware interests. DCIN consists of three key modules: 1) Position-aware Context Aggregation Module (PCAM), which performs aggregation of display items with an attention mechanism; 2) Feedback-Context Fusion Module (FCFM), which fuses the representation of clicks and display contexts through non-linear feature interaction; 3) Interest Matching Module (IMM), which activates interests related with the target item. Moreover, we provide our hands-on solution to implement our DCIN model on large-scale industrial systems. The significant improvements in both offline and online evaluations demonstrate the superiority of our proposed DCIN method. Notably, DCIN has been deployed on our online advertising system serving the main traffic, which brings 1.5% CTR and 1.5% RPM lift.

摘要
Click-through rate (CTR) 预测，judging the probability of a user clicking on an item，是工业应用中的关键问题，如在线广告。许多研究都是用户行为模型来提高 CTR 预测性能。然而，大多数这些方法只是模型用户的正面兴趣，即用户点击的项目，而忽略了上下文信息，即点击项目的周围的显示项目，这会导致性能下降。在这篇论文中，我们强调了用户上下文信息的重要性，并提出了一种新的模型，即深度上下文兴趣网络（DCIN）。DCIN 包括三个关键模块：1）位置意识上下文聚合模块（PCAM），通过注意机制来聚合显示项目; 2）反馈上下文融合模块（FCFM），通过非线性特征交互来融合点击和显示上下文的表示; 3）兴趣匹配模块（IMM），通过匹配用户的兴趣和目标项目来活化用户的兴趣。此外，我们还提供了在大规模工业系统上实现 DCIN 模型的实践方法。在线和离线评估中，DCIN 方法显示出了明显的优势，特别是在主要流量上部署 DCIN 方法后，Click-through rate 提高 1.5%，和 Revenue Per Mille (RPM) 提高 1.5%。

Evidence of Human-Like Visual-Linguistic Integration in Multimodal Large Language Models During Predictive Language Processing

paper_url: http://arxiv.org/abs/2308.06035
repo_url: None
paper_authors: Viktor Kewenig, Christopher Edwards, Quitterie Lacome DEstalenx, Akilles Rechardt, Jeremy I Skipper, Gabriella Vigliocco
for:这篇论文探讨了大语言模型（LLM）的语言处理能力是否可以模拟人类的认知过程。methods:研究使用了多Modal language model（mLLM），它将视觉和语言嵌入空间与 transformer 型注意机制结合起来进行下一个词预测。results:研究发现，人类的预测性与 CLIP 分数相似，但不是对一个单modal LLM 进行比较。进一步的分析发现，当 CLIP 的视觉注意重量被干扰时，人类和 CLIP 的预测性都消失。此外，当同一个输入被传递给一个多 modal 模型而不具有注意时，人类和 CLIP 的预测性也消失。这些结果表明，在 mLLM 和人类之间的预测语言处理过程存在相似的多模态信息集成和注意机制。

Abstract
The advanced language processing abilities of large language models (LLMs) have stimulated debate over their capacity to replicate human-like cognitive processes. One differentiating factor between language processing in LLMs and humans is that language input is often grounded in more than one perceptual modality, whereas most LLMs process solely text-based information. Multimodal grounding allows humans to integrate - e.g. visual context with linguistic information and thereby place constraints on the space of upcoming words, reducing cognitive load and improving perception and comprehension. Recent multimodal LLMs (mLLMs) combine visual and linguistic embedding spaces with a transformer type attention mechanism for next-word prediction. To what extent does predictive language processing based on multimodal input align in mLLMs and humans? To answer this question, 200 human participants watched short audio-visual clips and estimated the predictability of an upcoming verb or noun. The same clips were processed by the mLLM CLIP, with predictability scores based on a comparison of image and text feature vectors. Eye-tracking was used to estimate what visual features participants attended to, and CLIP's visual attention weights were recorded. We find that human estimates of predictability align significantly with CLIP scores, but not for a unimodal LLM of comparable parameter size. Further, alignment vanished when CLIP's visual attention weights were perturbed, and when the same input was fed to a multimodal model without attention. Analysing attention patterns, we find a significant spatial overlap between CLIP's visual attention weights and human eye-tracking data. Results suggest that comparable processes of integrating multimodal information, guided by attention to relevant visual features, supports predictive language processing in mLLMs and humans.

摘要
大型语言模型（LLM）的高级语言处理能力已经引起了对其是否可以模拟人类认知过程的辩论。与人类语言处理不同的一点是，LLM通常只处理文本类型的语言输入，而人类则可以将多种感知模式结合起来，以帮助理解和理解语言。多模态基础设施（multimodal grounding）允许人类将视觉上下文与语言信息结合起来，从而限制下一个词的可能性，降低认知负担并提高理解和理解能力。最近的多模态语言模型（mLLM）将视觉和语言嵌入空间与转换器类型注意机制结合起来，以进行下一个词预测。我们问道，mLLM和人类的预测语言处理是否具有相似的特征？为了回答这个问题，我们请求200名参与者观看短 audio-visual clip，并估算下一个词或名称的预测可能性。同时，我们使用CLIP模型处理同一个clip，并根据图像和文本特征向量进行比较。使用眼动追踪来估算参与者关注的视觉特征，并记录CLIP的视觉注意力量。我们发现，人类的预测与CLIP得分之间存在显著的相似性，但不同于一个参数大小相当的单模态 LLM。此外，当CLIP的视觉注意力量被干扰时，或者当同一个输入被传递给没有注意力的多模态模型时，对应的Alignment消失。我们分析了注意力模式，发现CLIP的视觉注意力量和人类眼动追踪数据之间存在显著的空间重叠。结果表明，mLLM和人类在多模态信息的集成和注意力引导下实现了相似的预测语言处理过程。

Large Language Models in Cryptocurrency Securities Cases: Can ChatGPT Replace Lawyers?

paper_url: http://arxiv.org/abs/2308.06032
repo_url: None
paper_authors: Arianna Trozze, Toby Davies, Bennett Kleinberg
for: The paper aims to study the effectiveness of large language models (LLMs) in conducting legal tasks, specifically in the context of securities cases involving cryptocurrencies.
methods: The paper uses GPT-3.5, a large language model, to evaluate its legal reasoning and drafting capabilities in real-life cases. The authors compare the performance of GPT-3.5 with human lawyers in terms of determining potential violations and drafting legal complaints.
results: The paper finds that GPT-3.5’s legal reasoning skills are weak and misses additional correct violations, but it performs better in legal drafting. The authors also find that jurors’ decisions are not statistically significantly associated with the author of the document upon which they based their decisions. Overall, the paper suggests that LLMs cannot satisfactorily conduct legal reasoning tasks but could provide access to justice for more individuals by reducing the cost of legal services.

Abstract
Large Language Models (LLMs) could enhance access to the legal system. However, empirical research on their effectiveness in conducting legal tasks is scant. We study securities cases involving cryptocurrencies as one of numerous contexts where AI could support the legal process, studying LLMs' legal reasoning and drafting capabilities. We examine whether a) an LLM can accurately determine which laws are potentially being violated from a fact pattern, and b) whether there is a difference in juror decision-making based on complaints written by a lawyer compared to an LLM. We feed fact patterns from real-life cases to GPT-3.5 and evaluate its ability to determine correct potential violations from the scenario and exclude spurious violations. Second, we had mock jurors assess complaints written by the LLM and lawyers. GPT-3.5's legal reasoning skills proved weak, though we expect improvement in future models, particularly given the violations it suggested tended to be correct (it merely missed additional, correct violations). GPT-3.5 performed better at legal drafting, and jurors' decisions were not statistically significantly associated with the author of the document upon which they based their decisions. Because LLMs cannot satisfactorily conduct legal reasoning tasks, they would be unable to replace lawyers at this stage. However, their drafting skills (though, perhaps, still inferior to lawyers), could provide access to justice for more individuals by reducing the cost of legal services. Our research is the first to systematically study LLMs' legal drafting and reasoning capabilities in litigation, as well as in securities law and cryptocurrency-related misconduct.

摘要
大型语言模型（LLMs）可能会提高法律系统的存取。然而，实践研究这些模型在执行法律任务上的效果仍然缺乏。我们在投资业中使用 криптовалю牌为背景，研究 AI 可以支持法律程序，研究 LLMs 的法律推理和撰写能力。我们评估了 GPT-3.5 是否能够准确地决定从事例中可能违反的法律，以及法律执行人和 AI 撰写的文件是否会影响陪审员的决策。我们给 GPT-3.5 输入了实际案例的事例，并评估了它的法律推理和撰写能力。我们发现 GPT-3.5 的法律推理能力较弱，但是它还是能够准确地决定案例中的可能违反。在撰写方面，GPT-3.5 表现较好，并且陪审员的决策与文件的作者无关。由于 LLMs 目前无法实现法律推理任务，因此它们无法取代律师。但是，它们的撰写能力可能可以帮助更多人访问法律系统，因为它们可以降低法律服务的成本。我们的研究是首次系统地研究 LLMs 在诉讼中的法律推理和撰写能力，以及在投资业和 криптовалю牌相关的违法行为中。

AI-Assisted Investigation of On-Chain Parameters: Risky Cryptocurrencies and Price Factors

paper_url: http://arxiv.org/abs/2308.08554
repo_url: None
paper_authors: Abdulrezzak Zekiye, Semih Utku, Fadi Amroush, Oznur Ozkasap
for: 本研究旨在帮助投资者做出 Informed 投资决策，对 cryptocurrency 价格的影响因素进行分析，并identify 风险的 cryptocurrency。
methods: 本研究使用历史数据和人工智能算法对 on-chain 参数进行分析，以确定 cryptocurrency 价格的影响因素和风险度。
results: 分析发现，大约一third (39%) of cryptocurrencies 从市场中消失，只有一小部分 (10%) 存活了超过 1000 天。我们发现 cryptocurrency 价格和最大供应量以及总供应量之间存在显著负相关性，同时24小时交易量和价格也存在某种弱相关性。此外，我们使用 clustering 和分类来为投资者提供更全面的 cryptocurrency 认知，并使用多种分类器来预测 cryptocurrency 是否为风险。最终，我们使用 K-Nearest Neighbor 得到了最佳的 f1-score 76%。

Abstract
Cryptocurrencies have become a popular and widely researched topic of interest in recent years for investors and scholars. In order to make informed investment decisions, it is essential to comprehend the factors that impact cryptocurrency prices and to identify risky cryptocurrencies. This paper focuses on analyzing historical data and using artificial intelligence algorithms on on-chain parameters to identify the factors affecting a cryptocurrency's price and to find risky cryptocurrencies. We conducted an analysis of historical cryptocurrencies' on-chain data and measured the correlation between the price and other parameters. In addition, we used clustering and classification in order to get a better understanding of a cryptocurrency and classify it as risky or not. The analysis revealed that a significant proportion of cryptocurrencies (39%) disappeared from the market, while only a small fraction (10%) survived for more than 1000 days. Our analysis revealed a significant negative correlation between cryptocurrency price and maximum and total supply, as well as a weak positive correlation between price and 24-hour trading volume. Moreover, we clustered cryptocurrencies into five distinct groups using their on-chain parameters, which provides investors with a more comprehensive understanding of a cryptocurrency when compared to those clustered with it. Finally, by implementing multiple classifiers to predict whether a cryptocurrency is risky or not, we obtained the best f1-score of 76% using K-Nearest Neighbor.

摘要
digital currencies 在过去几年中变得非常受投资者和学者关注。为了做出了解的投资决策，需要了解影响数字货币价格的因素，并识别风险货币。这篇论文探讨了历史数字货币的链上数据，并使用人工智能算法对其他参数进行分析。我们对历史数字货币的链上数据进行分析，并测量价格与其他参数之间的相关性。此外，我们还使用聚类和分类来更好地理解数字货币，并将其分为五个不同的组。最后，我们通过多种分类器预测数字货币是否为风险，并获得了最佳的 f1 分数（76%）使用 K-最近邻居。

Controlling Character Motions without Observable Driving Source

paper_url: http://arxiv.org/abs/2308.06025
repo_url: None
paper_authors: Weiyuan Li, Bin Dai, Ziyi Zhou, Qi Yao, Baoyuan Wang
for: 生成无驱动源的多样化、自然的人体动作序列
methods: 提议一种系统性框架，结合VQ-VAE和征token级控制策略，并使用优化的奖励函数进行训练
results: 通过全面评估，我们的提议的框架可以有效地解决无驱动源生成中的OOD问题、缺乏多样性和不愿意 periodic 问题，并与其他强基eline比较高效。

Abstract
How to generate diverse, life-like, and unlimited long head/body sequences without any driving source? We argue that this under-investigated research problem is non-trivial at all, and has unique technical challenges behind it. Without semantic constraints from the driving sources, using the standard autoregressive model to generate infinitely long sequences would easily result in 1) out-of-distribution (OOD) issue due to the accumulated error, 2) insufficient diversity to produce natural and life-like motion sequences and 3) undesired periodic patterns along the time. To tackle the above challenges, we propose a systematic framework that marries the benefits of VQ-VAE and a novel token-level control policy trained with reinforcement learning using carefully designed reward functions. A high-level prior model can be easily injected on top to generate unlimited long and diverse sequences. Although we focus on no driving sources now, our framework can be generalized for controlled synthesis with explicit driving sources. Through comprehensive evaluations, we conclude that our proposed framework can address all the above-mentioned challenges and outperform other strong baselines very significantly.

摘要
如何生成无驱动源的多样化、生命般自然的头部或身体序列？我们认为这是一个尚未受到充分研究的问题，具有独特的技术挑战。在没有 semantic 约束的情况下，使用标准的 autoregressive 模型来生成无限长的序列将导致1) OUT-OF-DISTRIBUTION（OOD）问题 Due to the accumulated error，2) 不够的多样性以生成自然和生命般的动作序列，和3) 不愿意的 periodic 模式在时间上。为了解决以上挑战，我们提出了一个系统性的框架，该框架将 VQ-VAE 的优点和一种新的 токен级控制策略，通过使用 Carefully 设计的 reward 函数进行训练。高级别的 prior 模型可以轻松地在上面嵌入，以生成无限长和多样化的序列。虽然我们现在没有驱动源，但我们的框架可以通过 Controlled 的方式扩展到有显式驱动源的情况。通过全面的评估，我们结论是我们提出的框架可以解决所有以上挑战，并与其他强大的基准模型进行比较，显著超出其性能。

Optimizing transformer-based machine translation model for single GPU training: a hyperparameter ablation study

paper_url: http://arxiv.org/abs/2308.06017
repo_url: None
paper_authors: Luv Verma, Ketaki N. Kolhatkar
for: explore the relationship between model complexity and performance in machine translation tasks
methods: systematic investigation using ablation and a single NVIDIA A100 GPU
results: unexpected insight that smaller models can be more effective, and the importance of precise hyperparameter tuning over mere scaling

Abstract
In machine translation tasks, the relationship between model complexity and performance is often presumed to be linear, driving an increase in the number of parameters and consequent demands for computational resources like multiple GPUs. To explore this assumption, this study systematically investigates the effects of hyperparameters through ablation on a sequence-to-sequence machine translation pipeline, utilizing a single NVIDIA A100 GPU. Contrary to expectations, our experiments reveal that combinations with the most parameters were not necessarily the most effective. This unexpected insight prompted a careful reduction in parameter sizes, uncovering "sweet spots" that enable training sophisticated models on a single GPU without compromising translation quality. The findings demonstrate an intricate relationship between hyperparameter selection, model size, and computational resource needs. The insights from this study contribute to the ongoing efforts to make machine translation more accessible and cost-effective, emphasizing the importance of precise hyperparameter tuning over mere scaling.

摘要
在机器翻译任务中，模型复杂度和性能之间的关系 oft presumed 是线性的，驱动参数数量的增加和计算资源的需求，如多个GPU。为了探讨这个假设，这项研究系统atically investigate了机器翻译管道中的效果，使用单个NVIDIA A100 GPU。与预期不同，我们的实验发现，最多参数的组合并不总是最有效的。这 Unexpected insight prompted 我们进行精细的参数减少，探索“甜点”，使得在单个GPU上进行训练复杂的模型不会丧失翻译质量。这些发现表明了参数选择、模型大小和计算资源需求之间的复杂关系。这些发现对于使机器翻译更加可 accessible和cost-effective 有益，强调精确的参数调整的重要性，而不是仅仅是扩大。

Large Language Models for Telecom: Forthcoming Impact on the Industry

paper_url: http://arxiv.org/abs/2308.06013
repo_url: None
paper_authors: Ali Maatouk, Nicola Piovesan, Fadhel Ayed, Antonio De Domenico, Merouane Debbah
For: The paper explores the potential impact of Large Language Models (LLMs) on the telecom industry, and provides insights into their current capabilities and limitations.* Methods: The paper examines the use cases that can be readily implemented in the telecom industry, streamlining numerous tasks that currently hinder operational efficiency and demand significant manpower and engineering expertise.* Results: The paper uncovers essential research directions that deal with the distinctive challenges of utilizing LLMs within the telecom domain, addressing these challenges to fully harness the potential of LLMs and unlock their capabilities within the telecom domain.Here are the three information points in Simplified Chinese text:* For: 论文探讨了大语言模型（LLMs）对电信业界的影响，并提供了其当前能力和限制的深入分析。* Methods: 论文检查了可以在电信业界快速实施的用例，使numerous tasksof operational efficiency and engineering expertise demanding tasks become more efficient and streamlined.* Results: 论文浮现出了在电信领域使用 LLMs 的特殊挑战，并提出了解决这些挑战的重要研究方向，以全面发挥 LLMS 的潜在力量和电信领域中的可能性。

Abstract
Large Language Models (LLMs) have emerged as a transformative force, revolutionizing numerous fields well beyond the conventional domain of Natural Language Processing (NLP) and garnering unprecedented attention. As LLM technology continues to progress, the telecom industry is facing the prospect of its potential impact on its landscape. To elucidate these implications, we delve into the inner workings of LLMs, providing insights into their current capabilities and limitations. We also examine the use cases that can be readily implemented in the telecom industry, streamlining numerous tasks that currently hinder operational efficiency and demand significant manpower and engineering expertise. Furthermore, we uncover essential research directions that deal with the distinctive challenges of utilizing the LLMs within the telecom domain. Addressing these challenges represents a significant stride towards fully harnessing the potential of LLMs and unlocking their capabilities to the fullest extent within the telecom domain.

摘要
Translation notes:* "Large Language Models" (LLMs) is translated as "大型语言模型" (dàxìng yǔyán módel) in Simplified Chinese.* "Natural Language Processing" (NLP) is translated as "自然语言处理" (zìrán yǔyán xῡxí) in Simplified Chinese.* "telecom industry" is translated as "电信行业" (diànxiāng xíngyè) in Simplified Chinese.* "inner workings" is translated as "内部机制" (nèibù jīzhì) in Simplified Chinese.* "capabilities and limitations" is translated as "能力和局限性" (nénglì yǔ jiǔxiàn xìng) in Simplified Chinese.* "use cases" is translated as "应用场景" (yìngyòu scènes) in Simplified Chinese.* "immediately" is translated as "立即" (lìjí) in Simplified Chinese.* "significant manpower and engineering expertise" is translated as "巨大的人力和工程培训" (kùde dì zhōngyàng yǔgōng zhìxíng) in Simplified Chinese.* "essential research directions" is translated as "重要的研究方向" (zhòngyào de yánjiù fāngdìng) in Simplified Chinese.* "distinctive challenges" is translated as "特殊的挑战" (tèshū de tiǎozhàn) in Simplified Chinese.

Deep Task-specific Bottom Representation Network for Multi-Task Recommendation

paper_url: http://arxiv.org/abs/2308.05996
repo_url: None
paper_authors: Qi Liu, Zhilong Zhou, Gangwei Jiang, Tiezheng Ge, Defu Lian
for: 提高 recommendation system 的性能，解决多任务学习中任务之间的负向传递问题。
methods: 提出了 Deep Task-specific Bottom Representation Network (DTRN)，通过在底层表示模型学习阶段为每个任务分别学习专门的表示，解决任务之间的负向传递问题。
results: 通过实验证明，DTRN 可以提高 recommendation system 的性能，并且可以与现有的多任务学习方法结合使用。

Abstract
Neural-based multi-task learning (MTL) has gained significant improvement, and it has been successfully applied to recommendation system (RS). Recent deep MTL methods for RS (e.g. MMoE, PLE) focus on designing soft gating-based parameter-sharing networks that implicitly learn a generalized representation for each task. However, MTL methods may suffer from performance degeneration when dealing with conflicting tasks, as negative transfer effects can occur on the task-shared bottom representation. This can result in a reduced capacity for MTL methods to capture task-specific characteristics, ultimately impeding their effectiveness and hindering the ability to generalize well on all tasks. In this paper, we focus on the bottom representation learning of MTL in RS and propose the Deep Task-specific Bottom Representation Network (DTRN) to alleviate the negative transfer problem. DTRN obtains task-specific bottom representation explicitly by making each task have its own representation learning network in the bottom representation modeling stage. Specifically, it extracts the user's interests from multiple types of behavior sequences for each task through the parameter-efficient hypernetwork. To further obtain the dedicated representation for each task, DTRN refines the representation of each feature by employing a SENet-like network for each task. The two proposed modules can achieve the purpose of getting task-specific bottom representation to relieve tasks' mutual interference. Moreover, the proposed DTRN is flexible to combine with existing MTL methods. Experiments on one public dataset and one industrial dataset demonstrate the effectiveness of the proposed DTRN.

摘要

Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model

paper_url: http://arxiv.org/abs/2308.05995
repo_url: None
paper_authors: Fan Zhang, Naye Ji, Fuxing Gao, Siyuan Zhao, Zhaohan Wang, Shunman Li
for: 这篇论文主要目标是提出一种基于扩散的非自回归 трансформа器模型，用于生成基于语音的自然化姿势。
methods: 该模型使用 WavLM 预训练模型提取低级和高级语音信息，并使用适应层 нор方案学习语音信息和附属姿势之间的关系。
results: 对于 Trinity、ZEGGS 和 BEAT 等 dataset，模型能够生成自然化的姿势，并且可以控制姿势的风格和质量。

Abstract
The generation of co-speech gestures for digital humans is an emerging area in the field of virtual human creation. Prior research has made progress by using acoustic and semantic information as input and adopting classify method to identify the person's ID and emotion for driving co-speech gesture generation. However, this endeavour still faces significant challenges. These challenges go beyond the intricate interplay between co-speech gestures, speech acoustic, and semantics; they also encompass the complexities associated with personality, emotion, and other obscure but important factors. This paper introduces "diffmotion-v2," a speech-conditional diffusion-based and non-autoregressive transformer-based generative model with WavLM pre-trained model. It can produce individual and stylized full-body co-speech gestures only using raw speech audio, eliminating the need for complex multimodal processing and manually annotated. Firstly, considering that speech audio not only contains acoustic and semantic features but also conveys personality traits, emotions, and more subtle information related to accompanying gestures, we pioneer the adaptation of WavLM, a large-scale pre-trained model, to extract low-level and high-level audio information. Secondly, we introduce an adaptive layer norm architecture in the transformer-based layer to learn the relationship between speech information and accompanying gestures. Extensive subjective evaluation experiments are conducted on the Trinity, ZEGGS, and BEAT datasets to confirm the WavLM and the model's ability to synthesize natural co-speech gestures with various styles.

摘要
“数字人类创造领域中的同声动作生成是一个emerging领域。先前的研究使用了语音和语义信息作为输入，采用分类方法来确定人的ID和情绪，以驱动同声动作生成。然而，这个努力仍然面临着一些挑战。这些挑战不仅包括语音、语义和同声动作之间的复杂互动，还包括人格、情绪和其他一些重要 yet obscure的因素。本文介绍了一种基于扩散和非autoregressive transformer的generative模型，称为diffmotion-v2。该模型可以通过 Raw speech audio 生成具有个性化和风格化的全身同声动作，无需复杂的多modal处理和手动标注。首先，我们认为语音audio不仅包含语音和语义特征，还拥有人格特征、情绪和更加细微的动作相关信息。因此，我们采用了 WavLM，一个大规模预训练模型，以提取低级和高级语音信息。其次，我们引入了adaptive层norm架构，以学习语音信息和同声动作之间的关系。我们在 Trinity、ZEGGS 和 BEAT 数据集上进行了评估实验，以确认 WavLM 和模型的能力生成自然的同声动作。”

Defensive Perception: Estimation and Monitoring of Neural Network Performance under Deployment

paper_url: http://arxiv.org/abs/2308.06299
repo_url: None
paper_authors: Hendrik Vogt, Stefan Buehler, Mark Schutera
for: 本研究旨在 Addressing the issue of unnoticed catastrophic deployment and domain shift in neural networks for semantic segmentation in autonomous driving.
methods: 我们的方法基于 deep learning-based perception for autonomous driving 是 uncertain 的，并且可以通过 Monte Carlo Dropout 方法来 estimating epistemic uncertainty. 我们的方法不需要修改已经部署的神经网络，并且可以保证预期的模型性能。
results: 我们的方法可以 estimate neural network performance，并且可以 monitoring 和 notification of entering domains of reduced neural network performance under deployment. 我们还提出了一些新的方法来改进应用在部署设置下，包括减少计算成本和限制估计噪声。最后，我们示出了我们的方法在多种不同的部署转移 relevante to autonomous driving 中的应用，如夜晚、雨天或雪天等。总的来说，我们的方法在部署设置下有很大的潜力，可以实现 operational design domain recognition via uncertainty，并且可以提供 defensive perception、safe state triggers、 warning notifications 和 feedback for testing or development and adaptation of the perception stack.

Abstract
In this paper, we propose a method for addressing the issue of unnoticed catastrophic deployment and domain shift in neural networks for semantic segmentation in autonomous driving. Our approach is based on the idea that deep learning-based perception for autonomous driving is uncertain and best represented as a probability distribution. As autonomous vehicles' safety is paramount, it is crucial for perception systems to recognize when the vehicle is leaving its operational design domain, anticipate hazardous uncertainty, and reduce the performance of the perception system. To address this, we propose to encapsulate the neural network under deployment within an uncertainty estimation envelope that is based on the epistemic uncertainty estimation through the Monte Carlo Dropout approach. This approach does not require modification of the deployed neural network and guarantees expected model performance. Our defensive perception envelope has the capability to estimate a neural network's performance, enabling monitoring and notification of entering domains of reduced neural network performance under deployment. Furthermore, our envelope is extended by novel methods to improve the application in deployment settings, including reducing compute expenses and confining estimation noise. Finally, we demonstrate the applicability of our method for multiple different potential deployment shifts relevant to autonomous driving, such as transitions into the night, rainy, or snowy domain. Overall, our approach shows great potential for application in deployment settings and enables operational design domain recognition via uncertainty, which allows for defensive perception, safe state triggers, warning notifications, and feedback for testing or development and adaptation of the perception stack.

摘要
在这篇论文中，我们提出了一种方法来解决深度学习基于自动驾驶的神经网络中的不良发展和领域转移问题。我们的方法基于神经网络的不确定性，即神经网络在自动驾驶中的观测是一个可能性 Distribution 的。由于自动驾驶的安全性 paramount，因此神经网络的观测系统必须能够识别自动车辆离开操作设计域，预测危险不确定性，并降低神经网络的性能。为此，我们提议将神经网络在部署过程中包裹在一个不确定性估计封装中，该封装基于 Monte Carlo Dropout 方法来估计神经网络的不确定性。这种方法不需要修改已部署的神经网络，并保证预期的模型性能。我们的防御观测封装具有估计神经网络性能的能力，可以监测和通知部署过程中神经网络性能下降。此外，我们还提出了一些新的方法来改进部署设置中的应用，包括减少计算成本和限制估计噪声。最后，我们示例了我们的方法在多种不同的部署转移中的应用，如夜晚、雨天或雪天等。总之，我们的方法在部署设置中具有潜力，可以实现操作设计域识别，并提供了防御观测、安全状态触发器、警示通知和测试或开发和适应观测堆的反馈。

TrajPAC: Towards Robustness Verification of Pedestrian Trajectory Prediction Models

paper_url: http://arxiv.org/abs/2308.05985
repo_url: https://github.com/zl-helios/trajpac
paper_authors: Liang Zhang, Nathaniel Xu, Pengfei Yang, Gaojie Jin, Cheng-Chao Huang, Lijun Zhang
for: 本研究旨在提高自动驾驶车辆的安全性，即使在拥有不同的情况下进行预测。
methods: 本研究使用了一种可靠的拟合正确（PAC）框架，以确保方法的稳定性和可靠性。
results: 本研究对四种state-of-the-art trajectory prediction模型进行了robustness测试，并通过了TrajPAC工具的评估。同时，研究还探讨了影响robustness性表现的多种因素。

Abstract
Robust pedestrian trajectory forecasting is crucial to developing safe autonomous vehicles. Although previous works have studied adversarial robustness in the context of trajectory forecasting, some significant issues remain unaddressed. In this work, we try to tackle these crucial problems. Firstly, the previous definitions of robustness in trajectory prediction are ambiguous. We thus provide formal definitions for two kinds of robustness, namely label robustness and pure robustness. Secondly, as previous works fail to consider robustness about all points in a disturbance interval, we utilise a probably approximately correct (PAC) framework for robustness verification. Additionally, this framework can not only identify potential counterexamples, but also provides interpretable analyses of the original methods. Our approach is applied using a prototype tool named TrajPAC. With TrajPAC, we evaluate the robustness of four state-of-the-art trajectory prediction models -- Trajectron++, MemoNet, AgentFormer, and MID -- on trajectories from five scenes of the ETH/UCY dataset and scenes of the Stanford Drone Dataset. Using our framework, we also experimentally study various factors that could influence robustness performance.

摘要
Robust pedestrian trajectory forecasting是Autonomous Vehicle的关键技能。 Previous works have studied adversarial robustness in the context of trajectory forecasting, but some significant issues remain unaddressed. In this work, we try to tackle these crucial problems.Firstly, the previous definitions of robustness in trajectory prediction are ambiguous. We thus provide formal definitions for two kinds of robustness, namely label robustness and pure robustness.Secondly, previous works fail to consider robustness about all points in a disturbance interval. We utilize a probably approximately correct (PAC) framework for robustness verification. This framework not only identifies potential counterexamples but also provides interpretable analyses of the original methods.Our approach is applied using a prototype tool named TrajPAC. With TrajPAC, we evaluate the robustness of four state-of-the-art trajectory prediction models -- Trajectron++, MemoNet, AgentFormer, and MID -- on trajectories from five scenes of the ETH/UCY dataset and scenes of the Stanford Drone Dataset. Using our framework, we also experimentally study various factors that could influence robustness performance.

Contrastive Explanations of Multi-agent Optimization Solutions

paper_url: http://arxiv.org/abs/2308.05984
repo_url: None
paper_authors: Parisa Zehtabi, Alberto Pozanco, Ayala Bloch, Daniel Borrajo, Sarit Kraus
for: 提供了一种适用于多代理优化问题的域独立方法来获取冲突解释。
methods: 该方法包括生成一个新的解决方案，并将该解决方案与原始解决方案进行比较，以便高亮差异。
results: 计算机实验和用户研究表明，该方法可以为大型多代理优化问题提供有用的冲突解释，并使人类用户对原始解决方案的满意度提高。

Abstract
In many real-world scenarios, agents are involved in optimization problems. Since most of these scenarios are over-constrained, optimal solutions do not always satisfy all agents. Some agents might be unhappy and ask questions of the form ``Why does solution $S$ not satisfy property $P$?''. In this paper, we propose MAoE, a domain-independent approach to obtain contrastive explanations by (i) generating a new solution $S^\prime$ where the property $P$ is enforced, while also minimizing the differences between $S$ and $S^\prime$; and (ii) highlighting the differences between the two solutions. Such explanations aim to help agents understanding why the initial solution is better than what they expected. We have carried out a computational evaluation that shows that MAoE can generate contrastive explanations for large multi-agent optimization problems. We have also performed an extensive user study in four different domains that shows that, after being presented with these explanations, humans' satisfaction with the original solution increases.

摘要
在许多实际场景中，代理人经常参与优化问题。由于大多数场景是过Constraint的，优化解决方案不总能满足所有代理人。一些代理人可能不满意并提问“为什么解决方案 $S$ 不满足属性 $P”？在这篇论文中，我们提议了MAoE，一种适用于各种领域的途径，通过（i）生成一个新的解决方案 $S^\prime$，使属性 $P$ 得到满足，同时尽量减少 $S$ 和 $S^\prime$ 之间的差异；以及（ii）高亮显示这两个解决方案之间的差异。这些解释的目的是帮助代理人理解初始解决方案是如何比预期更好。我们进行了大量计算评估，证明了MAoE可以为大规模多代理人优化问题生成对比性的解释。我们还进行了四个不同领域的用户研究，发现，在被给予这些解释后，人们对初始解决方案的满意度增加。

Face Encryption via Frequency-Restricted Identity-Agnostic Attacks

paper_url: http://arxiv.org/abs/2308.05983
repo_url: None
paper_authors: Xin Dong, Rui Wang, Siyuan Liang, Aishan Liu, Lihua Jing
for: 防止face recognition系统的敏感资讯泄露
methods: 利用频率限制identity-agnostic（FRIA）框架实现隐藏人脸图像
results: 实验结果显示FRIA可以实现高比例的黑盒攻击成功率（96%），并且在实际应用中显示出实际的应用前景。

Abstract
Billions of people are sharing their daily live images on social media everyday. However, malicious collectors use deep face recognition systems to easily steal their biometric information (e.g., faces) from these images. Some studies are being conducted to generate encrypted face photos using adversarial attacks by introducing imperceptible perturbations to reduce face information leakage. However, existing studies need stronger black-box scenario feasibility and more natural visual appearances, which challenge the feasibility of privacy protection. To address these problems, we propose a frequency-restricted identity-agnostic (FRIA) framework to encrypt face images from unauthorized face recognition without access to personal information. As for the weak black-box scenario feasibility, we obverse that representations of the average feature in multiple face recognition models are similar, thus we propose to utilize the average feature via the crawled dataset from the Internet as the target to guide the generation, which is also agnostic to identities of unknown face recognition systems; in nature, the low-frequency perturbations are more visually perceptible by the human vision system. Inspired by this, we restrict the perturbation in the low-frequency facial regions by discrete cosine transform to achieve the visual naturalness guarantee. Extensive experiments on several face recognition models demonstrate that our FRIA outperforms other state-of-the-art methods in generating more natural encrypted faces while attaining high black-box attack success rates of 96%. In addition, we validate the efficacy of FRIA using real-world black-box commercial API, which reveals the potential of FRIA in practice. Our codes can be found in https://github.com/XinDong10/FRIA.

摘要
每天，数十亿人在社交媒体上分享每日生活图片。然而，恶意收集者使用深度人脸识别系统抽取这些图片中的生物信息（例如，脸部）。一些研究在生成加密的人脸照片方面进行了努力，但现有研究受到强制黑盒场景可行性和更自然的视觉效果的挑战。为了解决这些问题，我们提出了频率限制人anonymous（FRIA）框架，用于加密不经授权的人脸识别，不需要个人信息。在弱黑盒场景下，我们发现了多个人脸识别模型的表示相似，因此我们提议使用这些表示作为准则，帮助生成加密人脸照片。此外，我们还采用了快播扩散变换来限制低频脸部区域中的杂散变换，以保证视觉自然性。我们的实验表明，FRIA可以在多个人脸识别模型上达到96%的黑盒攻击成功率，同时生成更自然的加密人脸照片。此外，我们还验证了FRIA的可行性，使用了实际的黑盒商用API。codes可以在https://github.com/XinDong10/FRIA中找到。

CyberForce: A Federated Reinforcement Learning Framework for Malware Mitigation

paper_url: http://arxiv.org/abs/2308.05978
repo_url: None
paper_authors: Chao Feng, Alberto Huertas Celdran, Pedro Miguel Sanchez Sanchez, Jan Kreischer, Jan von der Assen, Gerome Bovet, Gregorio Martinez Perez, Burkhard Stiller
for: 提高互联网物联网设备的网络安全性
methods: 使用联邦强化学习（FRL）和设备指纹识别技术，采用分布式强化学习来采集和私有地确定适用于防御零日攻击的最佳防御策略
results: 在一个真实的互联网物联网平台上进行了一组实验，证明了CyberForce可以高精度地学习适合防御零日攻击的最佳防御策略，并且在所有客户端受到所有攻击时，FRL Agent比中央RL Agent更快速地训练和选择合适的防御策略。在不同的客户端遭受不同的攻击时，CyberForce客户端可以从其他客户端中获得知识并采用相似的攻击行为。此外，CyberForce还显示了强大的数据欺诈攻击Robustness。

Abstract
The expansion of the Internet-of-Things (IoT) paradigm is inevitable, but vulnerabilities of IoT devices to malware incidents have become an increasing concern. Recent research has shown that the integration of Reinforcement Learning with Moving Target Defense (MTD) mechanisms can enhance cybersecurity in IoT devices. Nevertheless, the numerous new malware attacks and the time that agents take to learn and select effective MTD techniques make this approach impractical for real-world IoT scenarios. To tackle this issue, this work presents CyberForce, a framework that employs Federated Reinforcement Learning (FRL) to collectively and privately determine suitable MTD techniques for mitigating diverse zero-day attacks. CyberForce integrates device fingerprinting and anomaly detection to reward or penalize MTD mechanisms chosen by an FRL-based agent. The framework has been evaluated in a federation consisting of ten devices of a real IoT platform. A pool of experiments with six malware samples affecting the devices has demonstrated that CyberForce can precisely learn optimum MTD mitigation strategies. When all clients are affected by all attacks, the FRL agent exhibits high accuracy and reduced training time when compared to a centralized RL agent. In cases where different clients experience distinct attacks, the CyberForce clients gain benefits through the transfer of knowledge from other clients and similar attack behavior. Additionally, CyberForce showcases notable robustness against data poisoning attacks.

摘要
互联网物料（IoT）的扩展是不可避免的，但是IoT设备对于恶意软件攻击的脆弱性已经成为一个增加的 Concern。 latest research 表明，将强化学习与移动目标防御（MTD）机制相结合可以强化IoT设备的防护。然而，新的恶意软件攻击和代理人执行时间使得这种方法在实际的IoT应用中不实际。为解决这个问题，这个研究呈现了CyberForce框架，这个框架使用联邦强化学习（FRL）来集体和私有地决定适当的MTD策略以避免多种零日攻击。CyberForce框架具有设备识别和偏见检测，将选择的MTD机制作为FRL基于的代理人奖励或惩罚。在一个真实的IoT平台上的十台设备组成的联邦中进行了评估。一系列实验显示，CyberForce可以精确地学习适当的MTD防御策略。当所有客户端受到所有攻击时，FRL代理人比中央RL代理人更高精度和较少的训练时间。在不同的客户端遭受不同的攻击时，CyberForce客户端从其他客户端和相似的攻击行为中获得了知识转移。此外，CyberForce还表现出了杰出的抗毒血统特性。

Tweet Sentiment Extraction using Viterbi Algorithm with Transfer Learning

paper_url: http://arxiv.org/abs/2308.05973
repo_url: https://github.com/Zied130/Tweet_Sentiment-
paper_authors: Zied Baklouti
for: 本研究旨在Identifying tweet sentence中的情感部分。
methods: 该研究基于Modified Viterbi algorithm，并引入了信任分数和向量作为内部评估指标。
results: 研究发现，通过调整非 Parametric 模型，可以获得高度解释性的结果，并且信任分数向量可以准确地表示模型对最不确定预测状态的不确idence。

Abstract
Tweet sentiment extraction extracts the most significant portion of the sentence, determining whether the sentiment is positive or negative. This research aims to identify the part of tweet sentences that strikes any emotion. To reach this objective, we continue improving the Viterbi algorithm previously modified by the author to make it able to receive pre-trained model parameters. We introduce the confidence score and vector as two indicators responsible for evaluating the model internally before assessing the final results. We then present a method to fine-tune this nonparametric model. We found that the model gets highly explainable as the confidence score vector reveals precisely where the least confidence predicted states are and if the modifications approved ameliorate the confidence score or if the tuning is going in the wrong direction.

摘要
《推文情感EXTraction》是一项研究，旨在从推文句子中提取情感的最重要部分，以确定情感是正面或负面。为达到这个目标，我们继续改进了作者已经修改过的维特比 алгоритм，以使其能够接受预训练模型参数。我们引入了信任分数和向量作为两个评估模型的内部指标，以评估模型的性能。然后，我们提出了一种方法来细化这种非 Parametric 模型。我们发现，模型在信任分数向量的指导下变得非常可解释，并且可以准确地描述最不信任的预测状态和修改是否有助于提高信任分数。

An Encoder-Decoder Approach for Packing Circles

paper_url: http://arxiv.org/abs/2308.07335
repo_url: None
paper_authors: Akshay Kiran Jose, Gangadhar Karevvanavar, Rajshekhar V Bhat
for: 本研究旨在解决一个多年来吸引了广泛关注的封装问题，即将小型对象封装在大型对象中，并且要求小型对象不可以相互重叠或者尽量减少重叠。
methods: 本研究提出了一种新的编码器-解码器架构，包括编码器块、扰动块和解码器块，用于封装同形圆形在大型圆形中。该方法中，编码器接受一个圆形的标识符作为输入，并通过一个归一化层输出圆心，扰动层添加了控制的扰动，使圆心不能超过小圆形的半径，而解码器接受扰动后的圆心作为输入，并估算出封装的圆形标识符。
results: 该方法可以对高维度和不同形状的对象进行封装，并且可以提供竞争性的性能 compared to 经典方法。

Abstract
The problem of packing smaller objects within a larger object has been of interest since decades. In these problems, in addition to the requirement that the smaller objects must lie completely inside the larger objects, they are expected to not overlap or have minimum overlap with each other. Due to this, the problem of packing turns out to be a non-convex problem, obtaining whose optimal solution is challenging. As such, several heuristic approaches have been used for obtaining sub-optimal solutions in general, and provably optimal solutions for some special instances. In this paper, we propose a novel encoder-decoder architecture consisting of an encoder block, a perturbation block and a decoder block, for packing identical circles within a larger circle. In our approach, the encoder takes the index of a circle to be packed as an input and outputs its center through a normalization layer, the perturbation layer adds controlled perturbations to the center, ensuring that it does not deviate beyond the radius of the smaller circle to be packed, and the decoder takes the perturbed center as input and estimates the index of the intended circle for packing. We parameterize the encoder and decoder by a neural network and optimize it to reduce an error between the decoder's estimated index and the actual index of the circle provided as input to the encoder. The proposed approach can be generalized to pack objects of higher dimensions and different shapes by carefully choosing normalization and perturbation layers. The approach gives a sub-optimal solution and is able to pack smaller objects within a larger object with competitive performance with respect to classical methods.

摘要
“ Packing smaller objects within a larger object 已经是多年来的研究问题。在这些问题中，除了要求小 objet完全嵌入大 objet 之外，还需要避免它们之间的重叠或最小化重叠。由于这个原因， packing 问题变得非 convex 的，获得优化的解决方案具有挑战性。因此，许多启发法被用来获得不优化的解决方案，以及对特殊情况下的可证优化解决方案。在这篇论文中，我们提出了一种新的编码器-解码器架构，包括编码器块、扰动块和解码器块，用于嵌入 identical circles within a larger circle。在我们的方法中，编码器接受一个圆的索引作为输入，并通过normalization layer输出圆心，perturbation layer添加了控制的扰动，确保圆心不会超过小圆的半径，而解码器接受扰动后的圆心作为输入，并估计圆的索引。我们将编码器和解码器参数化为神经网络，并优化它以降低神经网络的输出与实际输入圆的索引之间的错误。我们的方法可以扩展到嵌入高维度和不同形状的对象，通过合适的 normalization 和扰动层来进行parameterization。我们的方法可以提供竞争性的性能，并且可以嵌入小对象 within a larger object 中。”

Decentralised Governance for Foundation Model based Systems: Exploring the Role of Blockchain in Responsible AI

paper_url: http://arxiv.org/abs/2308.05962
repo_url: None
paper_authors: Yue Liu, Qinghua Lu, Liming Zhu, Hye-Young Paik
for: 本研究旨在探讨基础模型 Based AI 系统的治理问题，以确保其可靠性和避免滥用，并对人类、社会和环境造成害。
methods: 本研究采用了 eight 个治理挑战，涵盖基础模型 Based AI 系统的三个基本维度：决策权、激励和责任。此外，研究还探讨了使用区块链技术来解决这些挑战的可能性。
results: 研究表明，使用区块链技术可以实现基础模型 Based AI 系统的分布式治理，并提高其可靠性和安全性。

Abstract
Foundation models are increasingly attracting interest worldwide for their distinguished capabilities and potential to perform a wide variety of tasks. Nevertheless, people are concerned about whether foundation model based AI systems are properly governed to ensure trustworthiness of foundation model based AI systems and to prevent misuse that could harm humans, society and the environment. In this paper, we identify eight governance challenges in the entire lifecycle of foundation model based AI systems regarding the three fundamental dimensions of governance: decision rights, incentives, and accountability. Furthermore, we explore the potential of blockchain as a solution to address the challenges by providing a distributed ledger to facilitate decentralised governance. We present an architecture that demonstrates how blockchain can be leveraged to realise governance in foundation model based AI systems.

摘要
基础模型在全球引起了越来越多的关注，因为它们具有突出的能力和可以执行各种任务。然而，人们担心基础模型基于的AI系统是否得到了适当的管理，以确保该系统的可靠性和避免滥用，以避免对人类、社会和环境造成伤害。在这篇论文中，我们认为基础模型基于AI系统的管理存在八个挑战，这些挑战分布在三个基本维度上：决策权、激励和责任。此外，我们还探讨了使用区块链解决这些挑战的可能性，并提出了一种架构，以示如何使用区块链实现基础模型基于AI系统的管理。

BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents

paper_url: http://arxiv.org/abs/2308.05960
repo_url: https://github.com/salesforce/bolaa
paper_authors: Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, Silvio Savarese
for: 这篇论文旨在比较不同类型的自主代理（LAA）和大语言模型（LLM）的可比性，以及提出一种新的多代理管理策略，以提高LAA在各种决策和多步逻辑环境中的表现。
methods: 论文使用了多种代理体系和LLM脊梁，并进行了广泛的 simulations validate LAAs 的性能。
results: 研究结果表明，BOLAA 可以在各种环境中提高 LAAs 的表现，并且可以提供可靠的量化建议 для LAA 的设计和 LLMS 的选择。

Abstract
The massive successes of large language models (LLMs) encourage the emerging exploration of LLM-augmented Autonomous Agents (LAAs). An LAA is able to generate actions with its core LLM and interact with environments, which facilitates the ability to resolve complex tasks by conditioning on past interactions such as observations and actions. Since the investigation of LAA is still very recent, limited explorations are available. Therefore, we provide a comprehensive comparison of LAA in terms of both agent architectures and LLM backbones. Additionally, we propose a new strategy to orchestrate multiple LAAs such that each labor LAA focuses on one type of action, \textit{i.e.} BOLAA, where a controller manages the communication among multiple agents. We conduct simulations on both decision-making and multi-step reasoning environments, which comprehensively justify the capacity of LAAs. Our performance results provide quantitative suggestions for designing LAA architectures and the optimal choice of LLMs, as well as the compatibility of both. We release our implementation code of LAAs to the public at \url{https://github.com/salesforce/BOLAA}.

摘要
大型语言模型（LLM）的巨大成功激发了LAA的潜在探索（LLM-augmented Autonomous Agents）。LAA可以通过核心LLM生成动作并与环境交互，从而实现根据过去交互的条件来解决复杂任务。由于LAA的探索还很新，有限的探索结果 disponible。因此，我们提供了LAA的全面比较，包括代理建筑和LLM脊梁。此外，我们提议一种新的策略，使得每个劳动LAA专注于一种类型的动作，即BOLAA，其中一个控制器负责多个代理之间的交流。我们在决策和多步逻辑环境中进行了模拟，全面证明了LAAs的能力。我们的性能结果提供了LLA的建 architecture和LLM的优选，以及这两者之间的兼容性。我们在GitHub上公开了LAAs的实现代码，请参考\url{https://github.com/salesforce/BOLAA}.

FoodSAM: Any Food Segmentation

paper_url: http://arxiv.org/abs/2308.05938
repo_url: https://github.com/jamesjg/foodsam
paper_authors: Xing Lan, Jiayi Lyu, Hanyu Jiang, Kun Dong, Zehai Niu, Yi Zhang, Jian Xue
for: 这篇论文探讨了Segment Anything Model（SAM）在食物图像分割中的零基础能力。
methods: 作者提出了一种新的框架，即FoodSAM，用于增强SAM生成的mask的semantic segmentation质量。此外，作者还提出了一种基于独立个体的思想，对食物图像进行实例分割。
results: 广泛的实验表明FoodSAM可以有效地分割食物项目，并且可以在多种级别进行分割。此外，FoodSAM还可以实现实例、�anoptic和Promptable segmentation，是首次在食物图像分割领域实现这些功能的工作。

Abstract
In this paper, we explore the zero-shot capability of the Segment Anything Model (SAM) for food image segmentation. To address the lack of class-specific information in SAM-generated masks, we propose a novel framework, called FoodSAM. This innovative approach integrates the coarse semantic mask with SAM-generated masks to enhance semantic segmentation quality. Besides, we recognize that the ingredients in food can be supposed as independent individuals, which motivated us to perform instance segmentation on food images. Furthermore, FoodSAM extends its zero-shot capability to encompass panoptic segmentation by incorporating an object detector, which renders FoodSAM to effectively capture non-food object information. Drawing inspiration from the recent success of promptable segmentation, we also extend FoodSAM to promptable segmentation, supporting various prompt variants. Consequently, FoodSAM emerges as an all-encompassing solution capable of segmenting food items at multiple levels of granularity. Remarkably, this pioneering framework stands as the first-ever work to achieve instance, panoptic, and promptable segmentation on food images. Extensive experiments demonstrate the feasibility and impressing performance of FoodSAM, validating SAM's potential as a prominent and influential tool within the domain of food image segmentation. We release our code at https://github.com/jamesjg/FoodSAM.

摘要
在这篇论文中，我们探讨Segment Anything Model（SAM）在食品图像分割中的零类能力。为了 Address SAM生成的masks中缺乏类别特定信息，我们提出了一种新的框架，叫做FoodSAM。这种创新的approach integrates the coarse semantic mask with SAM-generated masks，以提高semantic segmentation的质量。此外，我们认为食品中的 ингредиënces可以 viewed as independent individuals，这使我们能够在食品图像上进行实例分割。此外，FoodSAM还扩展了零类能力，以包括泛素分割，通过添加一个对象检测器，使FoodSAM能够有效地捕捉非食品对象信息。 Drawing inspiration from the recent success of promptable segmentation，我们也 extend FoodSAM to promptable segmentation，支持多种提示变体。因此，FoodSAM emerges as an all-encompassing solution capable of segmenting food items at multiple levels of granularity。值得一提的是，这是首次实现在食品图像上实现实例、泛素和可提示分割的工作。extensive experiments demonstrate the feasibility and impressive performance of FoodSAM，证明SAM的潜在力量以及其作为食品图像分割领域的引用工具。我们在https://github.com/jamesjg/FoodSAM上发布了我们的代码。

A Deep Recurrent-Reinforcement Learning Method for Intelligent AutoScaling of Serverless Functions

paper_url: http://arxiv.org/abs/2308.05937
repo_url: None
paper_authors: Siddharth Agarwal, Maria A. Rodriguez, Rajkumar Buyya
for: Function autoscaling in cloud environments, specifically for IoT-edge data processing and anomaly detection.
methods: Model-free Recurrent RL agent and Proximal Policy Optimization (PPO) algorithm.
results: Improved throughput, function execution, and accounted for more function instances compared to commercially used threshold-based function autoscaling.Here’s the full text in Simplified Chinese:
for: 函数自适应缩放在云环境中，尤其是 для IoT-edge数据处理和异常检测。
methods: 模型自由Recurrent RL代理和Proximal Policy Optimization（PPO）算法。
results: 提高通过率、函数执行率和负担更多的函数实例相比于商业使用的阈值基于函数自适应缩放。

Abstract
Function-as-a-Service (FaaS) introduces a lightweight, function-based cloud execution model that finds its relevance in applications like IoT-edge data processing and anomaly detection. While CSP offer a near-infinite function elasticity, these applications often experience fluctuating workloads and stricter performance constraints. A typical CSP strategy is to empirically determine and adjust desired function instances, "autoscaling", based on monitoring-based thresholds such as CPU or memory, to cope with demand and performance. However, threshold configuration either requires expert knowledge, historical data or a complete view of environment, making autoscaling a performance bottleneck lacking an adaptable solution.RL algorithms are proven to be beneficial in analysing complex cloud environments and result in an adaptable policy that maximizes the expected objectives. Most realistic cloud environments usually involve operational interference and have limited visibility, making them partially observable. A general solution to tackle observability in highly dynamic settings is to integrate Recurrent units with model-free RL algorithms and model a decision process as a POMDP. Therefore, in this paper, we investigate a model-free Recurrent RL agent for function autoscaling and compare it against the model-free Proximal Policy Optimisation (PPO) algorithm. We explore the integration of a LSTM network with the state-of-the-art PPO algorithm to find that under our experimental and evaluation settings, recurrent policies were able to capture the environment parameters and show promising results for function autoscaling. We further compare a PPO-based autoscaling agent with commercially used threshold-based function autoscaling and posit that a LSTM-based autoscaling agent is able to improve throughput by 18%, function execution by 13% and account for 8.4% more function instances.

摘要
Function-as-a-Service (FaaS) 引入了一种轻量级、功能基于云执行模型，在 IoT-edge 数据处理和异常检测等应用中发挥作用。而 CSP 提供了近乎无限的功能灵活性，但这些应用经常遇到波动性的工作负荷和更严格的性能限制。一般 CSP 策略是通过实际观察数据或历史数据来确定和调整所需的功能实例数量，以适应需求和性能。但是，这种策略通常需要专业知识、历史数据或完整的环境视图，从而导致自适应缩放成性能瓶颈。RL 算法已经在分析复杂云环境方面展现出了有利的特点，因此在这篇论文中，我们将 investigate 一种基于 POMDP 的模型自由 RL 代理来解决函数自适应缩放问题。我们将比较使用 PPO 算法和 LSTM 网络来模型决策过程，并发现在我们的实验和评估环境下，循环策略能够捕捉环境参数并显示出扎实的结果。我们进一步比较了使用 PPO 算法进行自适应缩放的代理和商业使用的阈值基于自适应缩放，并论证 LSTM 基于的自适应缩放代理能够提高吞吐量by 18%、功能执行by 13% 和覆盖8.4%更多的功能实例。

LittleMu: Deploying an Online Virtual Teaching Assistant via Heterogeneous Sources Integration and Chain of Teach Prompts

paper_url: http://arxiv.org/abs/2308.05935
repo_url: https://github.com/thu-keg/vta
paper_authors: Shangqing Tu, Zheyuan Zhang, Jifan Yu, Chunyang Li, Siyu Zhang, Zijun Yao, Lei Hou, Juanzi Li
for: 这篇论文旨在提供一个基于少量标注数据的虚拟MOOC教学助手，以支持在线学习。
methods: 该系统包括两个互动模块：一是结构化、半结构化和无结构化知识源的集成，以提供准确的答案；另一个是通过大规模预训练模型的“链式教学”示例，处理复杂的未收集问题。
results: 作者在线测试和实际投入中证明了该系统的性能，并在XuetangX MOOC平台上服务了超过80,000名用户，处理了超过300,000个问题。

Abstract
Teaching assistants have played essential roles in the long history of education. However, few MOOC platforms are providing human or virtual teaching assistants to support learning for massive online students due to the complexity of real-world online education scenarios and the lack of training data. In this paper, we present a virtual MOOC teaching assistant, LittleMu with minimum labeled training data, to provide question answering and chit-chat services. Consisting of two interactive modules of heterogeneous retrieval and language model prompting, LittleMu first integrates structural, semi- and unstructured knowledge sources to support accurate answers for a wide range of questions. Then, we design delicate demonstrations named "Chain of Teach" prompts to exploit the large-scale pre-trained model to handle complex uncollected questions. Except for question answering, we develop other educational services such as knowledge-grounded chit-chat. We test the system's performance via both offline evaluation and online deployment. Since May 2020, our LittleMu system has served over 80,000 users with over 300,000 queries from over 500 courses on XuetangX MOOC platform, which continuously contributes to a more convenient and fair education. Our code, services, and dataset will be available at https://github.com/THU-KEG/VTA.

摘要
教学助手在教育历史中扮演了关键角色，但目前许多MOOC平台没有提供人工或虚拟教学助手来支持在线学习者，这主要因为在线教育场景复杂，缺乏培训数据。在这篇论文中，我们提出了一个名为“小慕”的虚拟MOOC教学助手，可以提供问答和聊天服务。“小慕”包括两个互动模块：一是结构化、半结构化和无结构化知识源的集成，以支持各种问题的准确答案。其次，我们设计了细腻的示例名为“链条教”，以利用大规模预训练模型来处理复杂的未收集问题。除了问答外，我们还开发了其他教育服务，如知识基于聊天。我们对系统的性能进行了线上评估和下载测试。自2020年5月以来，我们的“小慕”系统已经为超过80,000名用户提供了超过300,000个问题的回答，从超过500门课程中获得了XuetangX MOOC平台的线上执行。我们的代码、服务和数据将在https://github.com/THU-KEG/VTA上提供。

Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT

paper_url: http://arxiv.org/abs/2308.06294
repo_url: None
paper_authors: Jingye Yang, Cong Liu, Wendy Deng, Da Wu, Chunhua Weng, Yunyun Zhou, Kai Wang
for: 本研究旨在开发一种基于transformer架构的大语言模型（LLM），以便自动检测临床现象术语，包括HPO未ocument的术语。
methods: 我们开发了两种模型：PhenoBCBERT和PhenoGPT。PhenoBCBERT使用 Bio+Clinical BERT 作为预训模型，而 PhenoGPT 则可以从多种 GPT 模型中 initialize，包括开源版本如 GPT-J、Falcon 和 LLaMA，以及关闭源版本如 GPT-3 和 GPT-3.5。
results: 我们发现我们的方法可以从临床观察纪录中提取更多的现象术语，包括 novel 的术语不受 HPO 规范。我们还进行了生物医学文献中的案例研究，以示新现象信息的识别和提取。我们比较了现有的 BERT 基本的 versus GPT 基本的模型，包括模型架构、内存使用、速度、准确率和隐私保护等多个方面。

Abstract
We hypothesize that large language models (LLMs) based on the transformer architecture can enable automated detection of clinical phenotype terms, including terms not documented in the HPO. In this study, we developed two types of models: PhenoBCBERT, a BERT-based model, utilizing Bio+Clinical BERT as its pre-trained model, and PhenoGPT, a GPT-based model that can be initialized from diverse GPT models, including open-source versions such as GPT-J, Falcon, and LLaMA, as well as closed-source versions such as GPT-3 and GPT-3.5. We compared our methods with PhenoTagger, a recently developed HPO recognition tool that combines rule-based and deep learning methods. We found that our methods can extract more phenotype concepts, including novel ones not characterized by HPO. We also performed case studies on biomedical literature to illustrate how new phenotype information can be recognized and extracted. We compared current BERT-based versus GPT-based models for phenotype tagging, in multiple aspects including model architecture, memory usage, speed, accuracy, and privacy protection. We also discussed the addition of a negation step and an HPO normalization layer to the transformer models for improved HPO term tagging. In conclusion, PhenoBCBERT and PhenoGPT enable the automated discovery of phenotype terms from clinical notes and biomedical literature, facilitating automated downstream tasks to derive new biological insights on human diseases.

摘要
We compared current BERT-based versus GPT-based models for phenotype tagging in multiple aspects, including model architecture, memory usage, speed, accuracy, and privacy protection. We also discussed the addition of a negation step and an HPO normalization layer to the transformer models for improved HPO term tagging. In conclusion, PhenoBCBERT and PhenoGPT enable the automated discovery of phenotype terms from clinical notes and biomedical literature, facilitating automated downstream tasks to derive new biological insights on human diseases.Translation notes:* "HPO" is translated as "人类疾病诊断 Ontology" (human disease diagnosis ontology)* "Phenotype" is translated as "诊断特征" (diagnostic feature)* "Clinical notes" is translated as "医疗记录" (medical records)* "Biomedical literature" is translated as "生物医学文献" (biomedical literature)* "BERT" is translated as " Bio+Clinical BERT" (Bio+Clinical BERT)* "GPT" is translated as " GPT-based model" (GPT-based model)* "PhenoTagger" is translated as "HPO识别工具" (HPO recognition tool)* "Rule-based" is translated as "规则基于的" (rule-based)* "Deep learning" is translated as "深度学习" (deep learning)* "Memory usage" is translated as "内存使用" (memory usage)* "Speed" is translated as "速度" (speed)* "Accuracy" is translated as "准确率" (accuracy)* "Privacy protection" is translated as "隐私保护" (privacy protection)* "Negation step" is translated as "否定步骤" (negation step)* "HPO normalization layer" is translated as "HPO正常化层" (HPO normalization layer)

paper_url: http://arxiv.org/abs/2308.05893
repo_url: None
paper_authors: Jaehoon Chung, Jamil Fayyad, Younes Al Younes, Homayoun Najjaran
For: 本论文主要探讨了多 Agent Pathfinding（MAPF）领域中 Deep Reinforcement Learning（DRL）的应用，并提供了一个综合的评估 metric 来评估不同的 MAPF 算法。* Methods: 本文使用了 DRL 技术来解决 MAPF 中的复杂问题，并提供了一个综合的评估 metric 来评估不同的 MAPF 算法。* Results: 本文提供了一个综合的评估 metric 来评估不同的 MAPF 算法，并介绍了 Model-based DRL 作为未来研究的可能性，以及其所需的基础理解。

Abstract
Multi-agent pathfinding (MAPF) is a critical field in many large-scale robotic applications, often being the fundamental step in multi-agent systems. The increasing complexity of MAPF in complex and crowded environments, however, critically diminishes the effectiveness of existing solutions. In contrast to other studies that have either presented a general overview of the recent advancements in MAPF or extensively reviewed Deep Reinforcement Learning (DRL) within multi-agent system settings independently, our work presented in this review paper focuses on highlighting the integration of DRL-based approaches in MAPF. Moreover, we aim to bridge the current gap in evaluating MAPF solutions by addressing the lack of unified evaluation metrics and providing comprehensive clarification on these metrics. Finally, our paper discusses the potential of model-based DRL as a promising future direction and provides its required foundational understanding to address current challenges in MAPF. Our objective is to assist readers in gaining insight into the current research direction, providing unified metrics for comparing different MAPF algorithms and expanding their knowledge of model-based DRL to address the existing challenges in MAPF.

摘要

DF2: Distribution-Free Decision-Focused Learning

paper_url: http://arxiv.org/abs/2308.05889
repo_url: None
paper_authors: Lingkai Kong, Wenhao Mu, Jiaming Cui, Yuchen Zhuang, B. Aditya Prakash, Bo Dai, Chao Zhang
for: 这篇论文是关于解决预测然后优化问题的决策尝试学（DFL）方法中的三个瓶颈的研究。
methods: 该论文提出了一种新的分布自由决策尝试学方法（DF2），该方法可以解决预测模型与优化目标之间的模型匹配错误、样本平均approximation错误和梯度approximation错误。
results: 该论文通过在一个 sintetic 问题、一个风力发电拍卖问题和一个非几何疫苗分布问题上进行测试，证明了 DF2 的有效性。

Abstract
Decision-focused learning (DFL) has recently emerged as a powerful approach for predict-then-optimize problems by customizing a predictive model to a downstream optimization task. However, existing end-to-end DFL methods are hindered by three significant bottlenecks: model mismatch error, sample average approximation error, and gradient approximation error. Model mismatch error stems from the misalignment between the model's parameterized predictive distribution and the true probability distribution. Sample average approximation error arises when using finite samples to approximate the expected optimization objective. Gradient approximation error occurs as DFL relies on the KKT condition for exact gradient computation, while most methods approximate the gradient for backpropagation in non-convex objectives. In this paper, we present DF2 -- the first \textit{distribution-free} decision-focused learning method explicitly designed to address these three bottlenecks. Rather than depending on a task-specific forecaster that requires precise model assumptions, our method directly learns the expected optimization function during training. To efficiently learn the function in a data-driven manner, we devise an attention-based model architecture inspired by the distribution-based parameterization of the expected objective. Our method is, to the best of our knowledge, the first to address all three bottlenecks within a single model. We evaluate DF2 on a synthetic problem, a wind power bidding problem, and a non-convex vaccine distribution problem, demonstrating the effectiveness of DF2.

摘要
决策关注学习（DFL）是一种有力的方法，用于预测然后优化问题，通过适应下游优化任务中的预测模型。然而，现有的端到端DFL方法受到三大瓶颈：模型匹配错误、样本平均预测错误和梯度近似错误。模型匹配错误来自预测模型参数化预测分布与真实概率分布的不一致。样本平均预测错误发生在使用有限样本来近似优化目标函数的时候。梯度近似错误由于DFL依赖于KKT条件来确定梯度，而大多数方法在非 convex 目标函数中使用梯度近似来进行反向传播。在这篇论文中，我们提出了DF2方法——首个不偏 towards任务特定预测器的分布自由决策关注学习方法。相比于基于任务特定预测器的方法，我们的方法直接在训练中学习预期优化函数。为效率地学习函数，我们设计了一种注意力基于分布参数化的预测模型。我们的方法是，到目前为止，第一个同时解决三大瓶颈的方法。我们在一个 sintetic 问题、一个风力发电拍卖问题和一个非 convex vaccine distribution问题上评估了DF2方法，并证明了其效果。

Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-Chips

paper_url: http://arxiv.org/abs/2308.05869
repo_url: None
paper_authors: Ismet Dagli, Mehmet Belviranli
for: 这篇论文是为了提出一种新的策略来管理移动和自动化系统中的多个工作负荷，以提高系统的性能和资源利用率。
methods: 该论文使用了一种新的策略，即HaX-CoNN，来映射具有不同加速器的concurrently执行深度神经网络（DNN）推理任务到SoC中的多种加速器中。该策略考虑了每层执行特性、共享内存（SM）竞争和间接加速器转换，以找到最佳调度。
results: 实验结果表明，HaX-CoNN可以减少SM竞争量达45%，并且可以提高延迟和总吞吐量相对于现有方法的32%和29%。

Abstract
Two distinguishing features of state-of-the-art mobile and autonomous systems are 1) there are often multiple workloads, mainly deep neural network (DNN) inference, running concurrently and continuously; and 2) they operate on shared memory system-on-chips (SoC) that embed heterogeneous accelerators tailored for specific operations. State-of-the-art lacks efficient performance and resource management techniques necessary to either maximize total system throughput or minimize end-to-end workload latency. In this work, we propose HaX-CoNN, a novel scheme that characterizes and maps layers in concurrently executing DNN inference workloads to a diverse set of accelerators within a SoC. Our scheme uniquely takes per-layer execution characteristics, shared memory (SM) contention, and inter-accelerator transitions into account to find optimal schedules. We evaluate HaX-CoNN on NVIDIA Orin, NVIDIA Xavier, and Qualcomm Snapdragon 865 SoCs. Our experimental results indicate that HaX-CoNN minimizes memory contention by up to 45% and can improve latency and total throughput by up to 32% and 29%, respectively, compared to the state-of-the-art approaches.

摘要
两个特点 distinguishing state-of-the-art 移动和自动化系统是：1）经常有多个工作负载，主要是深度神经网络（DNN）推理，同时并发运行；2）它们运行在共享内存系统在板（SoC）中，该系统嵌入特化为特定操作的多种加速器。现状缺乏有效的性能和资源管理技术，以最大化总系统吞吐量或最小化终端工作负载延迟。在这种工作中，我们提出了 HaX-CoNN 方案，它将在同时执行的 DNN 推理工作负载中映射层到 SoC 中的多种加速器中。我们的方案独特地考虑每层执行特性、共享内存（SM）竞争以及 между加速器的转换，以找到最佳时间表。我们对 NVIDIA Orin、NVIDIA Xavier 和 Qualcomm Snapdragon 865 SoC 进行了实验。我们的实验结果表明，HaX-CoNN 可以最大化内存竞争减少至 45%，并可以提高延迟和总吞吐量相对于现状方法的 32% 和 29%。

Unleashing the Strengths of Unlabeled Data in Pan-cancer Abdominal Organ Quantification: the FLARE22 Challenge

paper_url: http://arxiv.org/abs/2308.05862
repo_url: https://github.com/junma11/flare
paper_authors: Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Shihao Ma, Adamo Young, Cheng Zhu, Kangkang Meng, Xin Yang, Ziyan Huang, Fan Zhang, Wentao Liu, YuanKe Pan, Shoujin Huang, Jiacheng Wang, Mingze Sun, Weixin Xu, Dengqiang Jia, Jae Won Choi, Natália Alves, Bram de Wilde, Gregor Koehler, Yajun Wu, Manuel Wiesenfarth, Qiongjie Zhu, Guoqiang Dong, Jian He, the FLARE Challenge Consortium, Bo Wang
for: 这篇论文的目的是要探讨自动化腹部疾病诊断和治疗规划中的量化器官评估。
methods: 这篇论文使用了许多人工智能（AI）算法，以测试它们在实际世界中的多元国际设定下的精度和效率。
results: 这篇论文发现了一些AI算法可以实现高度的准确性和效率，并且可以在不同的种族、疾病、阶段和生产商的CT扫描 immagini中实现一致性。这些算法还可以将腹部器官的生物特征自动提取出来，这是传统manual measurement的劳动密集的领域。

Abstract
Quantitative organ assessment is an essential step in automated abdominal disease diagnosis and treatment planning. Artificial intelligence (AI) has shown great potential to automatize this process. However, most existing AI algorithms rely on many expert annotations and lack a comprehensive evaluation of accuracy and efficiency in real-world multinational settings. To overcome these limitations, we organized the FLARE 2022 Challenge, the largest abdominal organ analysis challenge to date, to benchmark fast, low-resource, accurate, annotation-efficient, and generalized AI algorithms. We constructed an intercontinental and multinational dataset from more than 50 medical groups, including Computed Tomography (CT) scans with different races, diseases, phases, and manufacturers. We independently validated that a set of AI algorithms achieved a median Dice Similarity Coefficient (DSC) of 90.0\% by using 50 labeled scans and 2000 unlabeled scans, which can significantly reduce annotation requirements. The best-performing algorithms successfully generalized to holdout external validation sets, achieving a median DSC of 89.5\%, 90.9\%, and 88.3\% on North American, European, and Asian cohorts, respectively. They also enabled automatic extraction of key organ biology features, which was labor-intensive with traditional manual measurements. This opens the potential to use unlabeled data to boost performance and alleviate annotation shortages for modern AI models.

摘要
《量化器官评估是自动肝脏疾病诊断和治疗规划的关键步骤。人工智能（AI）已经在这一过程中表现出了很大的潜力。然而，现有的大多数AI算法仍然依赖于许多专家注释，而且缺乏实际世界多国场景中的全面评估和精度。为了解决这些局限性，我们组织了2022年FLARE挑战，这是肝脏分析领域最大的挑战至今。我们构建了跨国和多国的数据集，包括不同的种族、疾病、阶段和制造商的 Computed Tomography（CT）扫描图像。我们独立验证了一组AI算法可以在50个医疗机构的50个标注图像和2000个无标注图像的基础上达到 médiane 的 dice相似度（DSC）90.0%。这些算法可以减少注释要求，并在占据外验集中保持高精度。最佳算法还能够自动提取关键器官生物特征，这是传统的手动测量很劳累。这开启了使用无标注数据来提高性能的可能性，并解决了现代AI模型的注释缺乏问题。》

Are We Closing the Loop Yet? Gaps in the Generalizability of VIS4ML Research

paper_url: http://arxiv.org/abs/2308.06290
repo_url: None
paper_authors: Hariharan Subramonyam, Jessica Hullman
for: 这个论文的目的是探讨机器学习（ML）领域的人工智能（AI）研究，以帮助专家更好地发展、理解和改进机器学习模型。
methods: 研究人员使用了交互式可视化技术来让专家更好地理解机器学习组件，并使用了人类知识来支持人类在loop任务。
results: 研究人员发现，目前的VIS4ML研究范围和实践中的应用Scope有一定的差距，许多论文的结论通常是基于非代表性的场景、少量的ML专家和已知的数据集进行验证，而且存在一些关键依赖关系和不充分评估的问题。

Abstract
Visualization for machine learning (VIS4ML) research aims to help experts apply their prior knowledge to develop, understand, and improve the performance of machine learning models. In conceiving VIS4ML systems, researchers characterize the nature of human knowledge to support human-in-the-loop tasks, design interactive visualizations to make ML components interpretable and elicit knowledge, and evaluate the effectiveness of human-model interchange. We survey recent VIS4ML papers to assess the generalizability of research contributions and claims in enabling human-in-the-loop ML. Our results show potential gaps between the current scope of VIS4ML research and aspirations for its use in practice. We find that while papers motivate that VIS4ML systems are applicable beyond the specific conditions studied, conclusions are often overfitted to non-representative scenarios, are based on interactions with a small set of ML experts and well-understood datasets, fail to acknowledge crucial dependencies, and hinge on decisions that lack justification. We discuss approaches to close the gap between aspirations and research claims and suggest documentation practices to report generality constraints that better acknowledge the exploratory nature of VIS4ML research.

摘要

Knowledge Propagation over Conditional Independence Graphs

paper_url: http://arxiv.org/abs/2308.05857
repo_url: None
paper_authors: Urszula Chajewska, Harsh Shrivastava
for: 这个论文主要是为了提出知识传播算法来处理 conditional independence 图（CI graph）。
methods: 该论文使用了一种基于 undirected graph 的方法，用于模型特征之间的相互关系。
results: 实验结果表明，该算法在公开available的 Cora 和 PubMed 数据集上比现有技术更高效。

Abstract
Conditional Independence (CI) graph is a special type of a Probabilistic Graphical Model (PGM) where the feature connections are modeled using an undirected graph and the edge weights show the partial correlation strength between the features. Since the CI graphs capture direct dependence between features, they have been garnering increasing interest within the research community for gaining insights into the systems from various domains, in particular discovering the domain topology. In this work, we propose algorithms for performing knowledge propagation over the CI graphs. Our experiments demonstrate that our techniques improve upon the state-of-the-art on the publicly available Cora and PubMed datasets.

摘要
<>将文本翻译为简化中文。<> condition independence（CI）图是一种特殊的概率图模型（PGM），其特征连接使用无向图表示，边重量表示特征之间的半相关强度。由于 CI 图表示直接相关性 между特征，因此在不同领域研究中吸引了越来越多的关注，尤其是发现领域结构。在这种工作中，我们提出了在 CI 图上进行知识传播的算法。我们的实验结果表明，我们的技术在公共可用的 Cora 和 PubMed 数据集上超过了现状的水平。

Seed Kernel Counting using Domain Randomization and Object Tracking Neural Networks

paper_url: http://arxiv.org/abs/2308.05846
repo_url: None
paper_authors: Venkat Margapuri, Prapti Thapaliya, Mitchell Neilsen
For: The paper is written for the seed production industry, specifically for small-scale seed production firms who cannot afford high-priced mechanized seed kernel counters.* Methods: The paper proposes the use of object tracking neural network models, such as YOLO, to estimate cereal yield inexpensively. The authors also use synthetic imagery as a feasible substitute to train neural networks for object tracking.* Results: The paper demonstrates the use of a low-cost mechanical hopper, trained YOLOv8 neural network model, and object tracking algorithms on StrongSORT and ByteTrack to estimate cereal yield from videos. The results show an accuracy of 95.2% and 93.2% for Soy and Wheat respectively using the StrongSORT algorithm, and an accuracy of 96.8% and 92.4% for Soy and Wheat respectively using the ByteTrack algorithm.Here are the three points in Simplified Chinese text:* For: 这篇论文是为小规模种子生产公司写的，以便他们不能负担高价的机器化种子果实计数器。* Methods: 论文提出使用对象跟踪神经网络模型，如YOLO，来低成本地测量小麦和大豢的果实量。作者还使用Synthetic imagery作为可行的代替方法来训练神经网络。* Results: 论文实践了一种低成本的机械搅拌器、训练过的YOLOv8神经网络模型和对象跟踪算法在StrongSORT和ByteTrack上测量小麦和大豢的果实量。结果显示，使用StrongSORT算法可达95.2%和93.2%的准确率，使用ByteTrack算法可达96.8%和92.4%的准确率。

Abstract
High-throughput phenotyping (HTP) of seeds, also known as seed phenotyping, is the comprehensive assessment of complex seed traits such as growth, development, tolerance, resistance, ecology, yield, and the measurement of parameters that form more complex traits. One of the key aspects of seed phenotyping is cereal yield estimation that the seed production industry relies upon to conduct their business. While mechanized seed kernel counters are available in the market currently, they are often priced high and sometimes outside the range of small scale seed production firms' affordability. The development of object tracking neural network models such as You Only Look Once (YOLO) enables computer scientists to design algorithms that can estimate cereal yield inexpensively. The key bottleneck with neural network models is that they require a plethora of labelled training data before they can be put to task. We demonstrate that the use of synthetic imagery serves as a feasible substitute to train neural networks for object tracking that includes the tasks of object classification and detection. Furthermore, we propose a seed kernel counter that uses a low-cost mechanical hopper, trained YOLOv8 neural network model, and object tracking algorithms on StrongSORT and ByteTrack to estimate cereal yield from videos. The experiment yields a seed kernel count with an accuracy of 95.2\% and 93.2\% for Soy and Wheat respectively using the StrongSORT algorithm, and an accuray of 96.8\% and 92.4\% for Soy and Wheat respectively using the ByteTrack algorithm.

摘要
高通量现象评估（HTP）的种子也称为种子现象评估，是全面评估复杂种子特征，如生长、发展、耐受、抗性、生态学、产量等参数的评估。种子生产行业很需要产量预测，以便进行业务。目前市场上有机器化种子坚果数计算机，但它们往往很昂贵，小规模种子生产公司可能无法负担。基于对象跟踪神经网络模型，如一下只看一次（YOLO），计算科学家可以设计便宜的算法来预测小麦和豫 corn 的产量。然而，神经网络模型的主要瓶颈是需要大量标注训练数据。我们表明，使用合成图像作为可行的替代方案，可以用于训练对象跟踪神经网络模型，包括对象分类和检测任务。此外，我们提议一种使用低成本机械吸盘、训练过 YOLOv8 神经网络模型和对象跟踪算法的种子坚果计数器，用于从视频中预测小麦和豫 corn 的产量。实验结果表明，使用 StrongSORT 算法和 ByteTrack 算法，可以准确地预测小麦和豫 corn 的产量，准确率分别为 95.2% 和 93.2%，以及 96.8% 和 92.4%。

DiLogics: Creating Web Automation Programs With Diverse Logics

paper_url: http://arxiv.org/abs/2308.05828
repo_url: None
paper_authors: Kevin Pu, Jim Yang, Angel Yuan, Minyi Ma, Rui Dong, Xinyu Wang, Yan Chen, Tovi Grossman
For: The paper is written for knowledge workers who frequently encounter repetitive web data entry tasks and want to increase their productivity through web automation.* Methods: The paper presents a programming-by-demonstration system called DiLogics, which uses natural language processing (NLP) to assist users in creating web automation programs that can handle diverse specifications.* Results: The paper shows that non-experts can effectively use DiLogics to create automation programs that fulfill diverse input instructions, and that DiLogics provides an efficient, intuitive, and expressive method for developing web automation programs satisfying diverse specifications.Here’s the same information in Simplified Chinese:* For: 论文是为知识工作者所写，他们经常遇到重复的网络数据录入任务，想通过网络自动化提高工作效率。* Methods: 论文提出了一种基于示例示出的编程系统——DiLogics，通过自然语言处理（NLP）助手，帮助用户创建满足多样化要求的网络自动化程序。* Results: 论文表明，非专业人员可以有效使用 DiLogics 创建满足多样化输入指令的自动化程序，而 DiLogics 提供了高效、直观、表达力强的网络自动化程序开发方法。

Abstract
Knowledge workers frequently encounter repetitive web data entry tasks, like updating records or placing orders. Web automation increases productivity, but translating tasks to web actions accurately and extending to new specifications is challenging. Existing tools can automate tasks that perform the same logical trace of UI actions (e.g., input text in each field in order), but do not support tasks requiring different executions based on varied input conditions. We present DiLogics, a programming-by-demonstration system that utilizes NLP to assist users in creating web automation programs that handle diverse specifications. DiLogics first semantically segments input data to structured task steps. By recording user demonstrations for each step, DiLogics generalizes the web macros to novel but semantically similar task requirements. Our evaluation showed that non-experts can effectively use DiLogics to create automation programs that fulfill diverse input instructions. DiLogics provides an efficient, intuitive, and expressive method for developing web automation programs satisfying diverse specifications.

摘要
知识工作者经常遇到重复的网络数据入力任务，如更新记录或发送订单。网络自动化可以提高生产力，但将任务翻译为网络动作accurately并扩展到新规范是挑战。现有工具可以自动执行同样的逻辑Trace of UI actions（例如，输入文本在每个字段中输入），但不支持基于不同输入条件的任务。我们提出了DiLogics，一个基于Programming-by-Demonstration的系统，使用自然语言处理（NLP）帮助用户创建满足多样化要求的网络自动化程序。DiLogics首先将输入数据semanticallySegmented into结构化任务步骤。通过记录用户示例 для每个步骤，DiLogics将网络 macro扩展到新的，但semantically similar的任务要求。我们的评估显示，非专家可以有效地使用DiLogics创建自动化程序，满足多样化的输入指令。DiLogics提供了高效、直观、表达力强的方法 для开发满足多样化要求的网络自动化程序。

Encode-Store-Retrieve: Enhancing Memory Augmentation through Language-Encoded Egocentric Perception

paper_url: http://arxiv.org/abs/2308.05822
repo_url: None
paper_authors: Junxiao Shen, John Dudley, Per Ola Kristensson
for: 增强人类记忆能力，尤其是对于长期内存和生活记忆。
methods: 利用增强现实头戴式显示器捕捉和保存生活视频，并使用自然语言编码将其存储在矢量数据库中。
results: 比较出色的结果，BLEU分数达8.3，超过了传统机器学习模型的3.4-5.8分数。在用户研究中，人工系统的响应得分为4.13/5，而人类参与者的得分为2.46/5。

Abstract
We depend on our own memory to encode, store, and retrieve our experiences. However, memory lapses can occur. One promising avenue for achieving memory augmentation is through the use of augmented reality head-mounted displays to capture and preserve egocentric videos, a practice commonly referred to as life logging. However, a significant challenge arises from the sheer volume of video data generated through life logging, as the current technology lacks the capability to encode and store such large amounts of data efficiently. Further, retrieving specific information from extensive video archives requires substantial computational power, further complicating the task of quickly accessing desired content. To address these challenges, we propose a memory augmentation system that involves leveraging natural language encoding for video data and storing them in a vector database. This approach harnesses the power of large vision language models to perform the language encoding process. Additionally, we propose using large language models to facilitate natural language querying. Our system underwent extensive evaluation using the QA-Ego4D dataset and achieved state-of-the-art results with a BLEU score of 8.3, outperforming conventional machine learning models that scored between 3.4 and 5.8. Additionally, in a user study, our system received a higher mean response score of 4.13/5 compared to the human participants' score of 2.46/5 on real-life episodic memory tasks.

摘要
我们依赖我们自己的记忆来编码、存储和检索我们的经验。然而，记忆漏洞可能会出现。一种有前途的方法是通过使用扩展现实头戴式显示器捕捉和保存 Egocentric 视频，这种做法通常被称为生活日志。然而，大量视频数据的生成带来了现有技术的存储和编码问题，特别是在检索广泛的视频存档中寻找特定内容的计算机力量很大，使得快速访问感兴趣的内容变得复杂。为解决这些挑战，我们提议一种增强记忆系统，该系统利用自然语言编码器将视频数据存储在 вектор数据库中。这种方法利用大量视力语言模型来实现语言编码过程。此外，我们还提议使用大量语言模型来促进自然语言查询。我们的系统在使用 QA-Ego4D 数据集进行了广泛的评估，并取得了当前最佳成绩，BLEU 分数为 8.3，超越了传统机器学习模型的分数范围 между 3.4 和 5.8。此外，在用户研究中，我们的系统得到了用户平均回答分数为 4.13/5，而人类参与者的平均回答分数为 2.46/5 在真实生活记忆任务中。

Neural Progressive Meshes

paper_url: http://arxiv.org/abs/2308.05741
repo_url: None
paper_authors: Yun-Chun Chen, Vladimir G. Kim, Noam Aigerman, Alec Jacobson
for: efficiently transmitting large geometric data (e.g., 3D meshes) over the Internet
methods: subdivision-based encoder-decoder architecture trained on a large collection of surfaces, with progressive transmission of residual features
results: outperforms baselines in terms of compression ratio and reconstruction quality

Abstract
The recent proliferation of 3D content that can be consumed on hand-held devices necessitates efficient tools for transmitting large geometric data, e.g., 3D meshes, over the Internet. Detailed high-resolution assets can pose a challenge to storage as well as transmission bandwidth, and level-of-detail techniques are often used to transmit an asset using an appropriate bandwidth budget. It is especially desirable for these methods to transmit data progressively, improving the quality of the geometry with more data. Our key insight is that the geometric details of 3D meshes often exhibit similar local patterns even across different shapes, and thus can be effectively represented with a shared learned generative space. We learn this space using a subdivision-based encoder-decoder architecture trained in advance on a large collection of surfaces. We further observe that additional residual features can be transmitted progressively between intermediate levels of subdivision that enable the client to control the tradeoff between bandwidth cost and quality of reconstruction, providing a neural progressive mesh representation. We evaluate our method on a diverse set of complex 3D shapes and demonstrate that it outperforms baselines in terms of compression ratio and reconstruction quality.

摘要
现在有许多手持式设备可以播放3D内容，这些内容的大量几何数据（如3D网格）的传输效率变得非常重要。高分辨率资产可能会占用很多存储空间和传输带宽，而level-of-detail技术可以将资产分解为不同的级别，以适应不同的带宽预算。我们的关键发现是，3D网格的几何细节经常会在不同的形状之间展现相似的地方性特征，因此可以使用共享学习生成空间来有效表示它们。我们使用了分割基于的编码器-解码器架构来学习这个空间，并在大量的表面上进行预先训练。此外，我们还发现可以在间接级别之间进行进程式传输额外的剩余特征，使客户端可以控制带宽成本和重建质量之间的交易，提供神经进程式网格表示。我们对一个多样化的3D形状集合进行了评估，并证明了我们的方法在压缩率和重建质量两个方面都超过了基准值。

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

paper_url: http://arxiv.org/abs/2308.05734
repo_url: https://github.com/haoheliu/AudioLDM2
paper_authors: Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley
for: 这篇论文是针对不同类型的声音生成模型的设计，包括speech、音乐和 зву效，并提出一个框架，让这些模型共享同一个学习方法。
methods: 这个框架使用了一个称为“语言音频”（LOA）的通用表现，将任何声音转换为LOA，然后使用GPT-2模型处理自我监督学习。在生成过程中，我们使用了一个潜在扩散模型，将任何modalities转换为LOA。
results: 实验结果显示，这个框架可以实现新的州际表现或与前一代方法竞争的性能，并且具有内在学习能力和可重用的自我监督学习模型。代码和示例可以在https://audioldm.github.io/audioldm2获取。

Abstract
Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called language of audio (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches. Our demo and code are available at https://audioldm.github.io/audioldm2.

摘要
尽管各种音频之间存在共同之处，如语音、音乐和音效，但设计模型时需要仔细考虑每种类型的特定目标和偏见，这些偏见可能与其他类型的偏见存在很大差异。为了带我们更近于一个统一的音频生成视角，这篇论文提出了一个框架，该框架利用同一种学习方法来生成语音、音乐和音效。我们的框架称之为语音语言（LOA）框架。任何音频都可以根据AudioMAE自supervised预训练表示学习模型转化为LOA表示。在生成过程中，我们使用GPT-2模型将任何模态转化为LOA，并使用conditioned on LOA的隐式扩散模型进行自主学习。我们的提议的框架自然带来了在上下文学习能力和可 reuse自supervised AudioMAE和隐藏扩散模型的优点。我们的实验在文本到音频、文本到音乐和文本到语音的主要标准测试集上达到了新的状态ucker或竞争性的表现。您可以在https://audioldm.github.io/audioldm2上获取我们的demo和代码。

PDE-Refiner: Achieving Accurate Long Rollouts with Neural PDE Solvers

paper_url: http://arxiv.org/abs/2308.05732
repo_url: None
paper_authors: Phillip Lippe, Bastiaan S. Veeling, Paris Perdikaris, Richard E. Turner, Johannes Brandstetter
for: 这个论文的目的是提出一种基于深度神经网络的 partial differential equation（PDE）解决方案，以提高解决PDE问题的计算效率和准确性。
methods: 这个论文使用了一种基于扩展的 diffusion 模型，通过多步增强过程来更好地模型 PDE 解的所有频率成分。
results: 论文通过对复杂的液体动力学 benchmark 进行验证，表明 PDE-Refiner 可以在稳定和准确的情况下进行 Rollout 操作，并且可以超过现有的神经网络、数值和神经数值模型。 additionally, PDE-Refiner 可以大幅提高数据效率，并且可以准确地评估模型的预测不确定性。

Abstract
Time-dependent partial differential equations (PDEs) are ubiquitous in science and engineering. Recently, mostly due to the high computational cost of traditional solution techniques, deep neural network based surrogates have gained increased interest. The practical utility of such neural PDE solvers relies on their ability to provide accurate, stable predictions over long time horizons, which is a notoriously hard problem. In this work, we present a large-scale analysis of common temporal rollout strategies, identifying the neglect of non-dominant spatial frequency information, often associated with high frequencies in PDE solutions, as the primary pitfall limiting stable, accurate rollout performance. Based on these insights, we draw inspiration from recent advances in diffusion models to introduce PDE-Refiner; a novel model class that enables more accurate modeling of all frequency components via a multistep refinement process. We validate PDE-Refiner on challenging benchmarks of complex fluid dynamics, demonstrating stable and accurate rollouts that consistently outperform state-of-the-art models, including neural, numerical, and hybrid neural-numerical architectures. We further demonstrate that PDE-Refiner greatly enhances data efficiency, since the denoising objective implicitly induces a novel form of spectral data augmentation. Finally, PDE-Refiner's connection to diffusion models enables an accurate and efficient assessment of the model's predictive uncertainty, allowing us to estimate when the surrogate becomes inaccurate.

摘要
时间依赖的partial differential equations (PDEs) 在科学和工程中很普遍。近年来，主要因为传统解决方案的计算成本高涨，深度神经网络基于的surrogate 获得了更多的关注。但是，实际应用中，这些神经PDE解决器的实用性受到长时间预测稳定和准确的限制。在这个工作中，我们对常见的时间推送策略进行大规模分析，发现了忽略非主导空间频率信息的问题，这常常与PDE解决中高频信号相关。基于这些发现，我们从Diffusion模型中灵感获得了PDE-Refiner;一种新的模型类，可以更好地模型所有频率成分，通过多步精度提升过程。我们验证了PDE-Refiner在复杂的液体动力学benchmark上，表明其可以在稳定和准确的情况下进行长时间推送。此外，PDE-Refiner可以大幅提高数据效率，因为净化目标意味着一种新的spectral data augmentation。最后，PDE-Refiner的连接到Diffusion模型使得可以准确和有效地评估模型的预测不确定性，从而估计模型在不准确的情况下。

Rethinking Integration of Prediction and Planning in Deep Learning-Based Automated Driving Systems: A Review

paper_url: http://arxiv.org/abs/2308.05731
repo_url: None
paper_authors: Steffen Hagedorn, Marcel Hallgarten, Martin Stoll, Alexandru Condurache
for: 提高自动驾驶系统的安全性、效率和舒适性。
methods: 使用深度学习模型进行预测和规划，并将两者 integrate 为一个互相关联的模型。
results: 通过对现有的模型进行系统性的审视和分析，提供了关于不同集成方法的研究 gap 和未来挑战，以及指出了未来研究的潜在方向。

Abstract
Automated driving has the potential to revolutionize personal, public, and freight mobility. Besides the enormous challenge of perception, i.e. accurately perceiving the environment using available sensor data, automated driving comprises planning a safe, comfortable, and efficient motion trajectory. To promote safety and progress, many works rely on modules that predict the future motion of surrounding traffic. Modular automated driving systems commonly handle prediction and planning as sequential separate tasks. While this accounts for the influence of surrounding traffic on the ego-vehicle, it fails to anticipate the reactions of traffic participants to the ego-vehicle's behavior. Recent works suggest that integrating prediction and planning in an interdependent joint step is necessary to achieve safe, efficient, and comfortable driving. While various models implement such integrated systems, a comprehensive overview and theoretical understanding of different principles are lacking. We systematically review state-of-the-art deep learning-based prediction, planning, and integrated prediction and planning models. Different facets of the integration ranging from model architecture and model design to behavioral aspects are considered and related to each other. Moreover, we discuss the implications, strengths, and limitations of different integration methods. By pointing out research gaps, describing relevant future challenges, and highlighting trends in the research field, we identify promising directions for future research.

摘要
自动驾驶技术有可能改变人类、公共和货物运输的方式。除了巨大的感知挑战以外，自动驾驶还包括规划一个安全、舒适和高效的动力轨迹。为了促进安全和进步，许多研究都是通过模块来预测周围交通的未来运动来实现。这些模块通常处理预测和规划作为独立的两个任务。although this approach takes into account the influence of surrounding traffic on the ego-vehicle, it fails to anticipate the reactions of traffic participants to the ego-vehicle's behavior。Recent works suggest that integrating prediction and planning in an interdependent joint step is necessary to achieve safe, efficient, and comfortable driving。various models have implemented such integrated systems, but a comprehensive overview and theoretical understanding of different principles are lacking。we systematically review state-of-the-art deep learning-based prediction, planning, and integrated prediction and planning models。different aspects of the integration, such as model architecture, model design, and behavioral aspects, are considered and related to each other。furthermore, we discuss the implications, strengths, and limitations of different integration methods。by pointing out research gaps, describing relevant future challenges, and highlighting trends in the research field, we identify promising directions for future research。

Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problems

paper_url: http://arxiv.org/abs/2308.05713
repo_url: None
paper_authors: Ernest Davis, Scott Aaronson
for: 测试GPT-4语言模型在科学和数学领域的105个原始问题中的能力，包括高中和大学水平。
methods: 使用Wolfram Alpha和Code Interpreter插件。
results: 测试结果表明，插件可以显著提高GPT解决这些问题的能力，但仍有“接口”失败，即GPT有问题表述问题以获得有用的答案。

Abstract
This report describes a test of the large language model GPT-4 with the Wolfram Alpha and the Code Interpreter plug-ins on 105 original problems in science and math, at the high school and college levels, carried out in June-August 2023. Our tests suggest that the plug-ins significantly enhance GPT's ability to solve these problems. Having said that, there are still often "interface" failures; that is, GPT often has trouble formulating problems in a way that elicits useful answers from the plug-ins. Fixing these interface failures seems like a central challenge in making GPT a reliable tool for college-level calculation problems.

摘要
Simplified Chinese:这份报告描述了在6月至8月2023年，使用GPT-4语言模型和Wolfram Alpha和Code Interpreter插件，对105个科学和数学问题进行了测试，这些问题分别来自高中和大学水平。我们的测试表明，插件可以大大提高GPT的解决这些问题的能力。然而，还有很多“接口”失败，即GPT在提问时未能得到有用的答案。解决这些接口失败是让GPT成为大学水平计算问题的可靠工具的中心挑战。

Exploring the Potential of World Models for Anomaly Detection in Autonomous Driving

paper_url: http://arxiv.org/abs/2308.05701
repo_url: None
paper_authors: Daniel Bogdoll, Lukas Bosch, Tim Joseph, Helen Gremmelmaier, Yitian Yang, J. Marius Zöllner
for: 本研究旨在探讨世界模型如何应用于自动驾驶系统中的异常检测。
methods: 本研究使用世界模型来检测自动驾驶系统中的异常。
results: 本研究提供了世界模型在自动驾驶系统中异常检测的概述，并将各个组件与前期异常检测研究相关联，以便进一步探讨这一领域。

Abstract
In recent years there have been remarkable advancements in autonomous driving. While autonomous vehicles demonstrate high performance in closed-set conditions, they encounter difficulties when confronted with unexpected situations. At the same time, world models emerged in the field of model-based reinforcement learning as a way to enable agents to predict the future depending on potential actions. This led to outstanding results in sparse reward and complex control tasks. This work provides an overview of how world models can be leveraged to perform anomaly detection in the domain of autonomous driving. We provide a characterization of world models and relate individual components to previous works in anomaly detection to facilitate further research in the field.

摘要
In this work, we explore how world models can be used for anomaly detection in the field of autonomous driving. We provide a comprehensive overview of world models and how they can be applied to detect anomalies in this domain. Additionally, we relate individual components of world models to previous works in anomaly detection, providing a foundation for further research in this area.

SSLRec: A Self-Supervised Learning Library for Recommendation

paper_url: http://arxiv.org/abs/2308.05697
repo_url: https://github.com/hkuds/sslrec
paper_authors: Xubin Ren, Lianghao Xia, Yuhao Yang, Wei Wei, Tianle Wang, Xuheng Cai, Chao Huang
for: This paper is written to address the lack of unified frameworks for evaluating self-supervised learning (SSL) recommendation algorithms across different domains.
methods: The paper introduces SSLRec, a novel benchmark platform that provides a standardized, flexible, and comprehensive framework for evaluating various SSL-enhanced recommenders. The platform features a modular architecture and a complete set of data augmentation and self-supervised toolkits.
results: The paper provides a comprehensive set of state-of-the-art SSL-enhanced recommendation models across different scenarios, enabling researchers to evaluate these cutting-edge models and drive further innovation in the field. The paper also simplifies the process of training and evaluating different recommendation models with consistent and fair settings.

Abstract
Self-supervised learning (SSL) has gained significant interest in recent years as a solution to address the challenges posed by sparse and noisy data in recommender systems. Despite the growing number of SSL algorithms designed to provide state-of-the-art performance in various recommendation scenarios (e.g., graph collaborative filtering, sequential recommendation, social recommendation, KG-enhanced recommendation), there is still a lack of unified frameworks that integrate recommendation algorithms across different domains. Such a framework could serve as the cornerstone for self-supervised recommendation algorithms, unifying the validation of existing methods and driving the design of new ones. To address this gap, we introduce SSLRec, a novel benchmark platform that provides a standardized, flexible, and comprehensive framework for evaluating various SSL-enhanced recommenders. The SSLRec library features a modular architecture that allows users to easily evaluate state-of-the-art models and a complete set of data augmentation and self-supervised toolkits to help create SSL recommendation models with specific needs. Furthermore, SSLRec simplifies the process of training and evaluating different recommendation models with consistent and fair settings. Our SSLRec platform covers a comprehensive set of state-of-the-art SSL-enhanced recommendation models across different scenarios, enabling researchers to evaluate these cutting-edge models and drive further innovation in the field. Our implemented SSLRec framework is available at the source code repository https://github.com/HKUDS/SSLRec.

摘要
自我监督学习（SSL）在过去几年内受到了广泛关注，以解决推荐系统中稀缺和噪音数据的挑战。虽然有一大量的SSL算法，用于在不同领域提供状态对抗性的表现（如图像协同推荐、序列推荐、社交推荐、知识 graphs 增强推荐），但是还没有一个统一的框架，可以将推荐算法集成到不同领域。这样的框架可以作为推荐算法的基础，统一验证现有方法，并驱动新的方法的设计。为解决这个差距，我们介绍了SSLRec，一个新的测试平台，它提供了标准化、灵活、全面的评估推荐算法的框架。SSLRec 库具有可扩展的架构，allowing users to easily evaluate state-of-the-art models，并且提供了完整的数据增强和自我监督工具kit，帮助用户创建特定需求的SSL推荐模型。此外，SSLRec 简化了不同推荐模型的训练和评估过程，使得模型的评估具有共同和公正的设置。我们的SSLRec 平台覆盖了不同enario 中的 cutting-edge SSL-enhanced recommendation models， allowing researchers to evaluate these models and drive further innovation in the field。我们实现的SSLRec 框架可以在上获取。

Hard No-Box Adversarial Attack on Skeleton-Based Human Action Recognition with Skeleton-Motion-Informed Gradient

paper_url: http://arxiv.org/abs/2308.05681
repo_url: https://github.com/luyg45/hardnoboxattack
paper_authors: Zhengzhi Lu, He Wang, Ziyi Chang, Guoan Yang, Hubert P. H. Shum
for: 这 paper 的目的是证明 skeleton-based 人员活动识别方法存在敏感性，并提出一种新的攻击任务，即攻击者没有访问受试者模型或训练数据或标签。
methods: 这 paper 使用了一种新的攻击方法，即基于动作演示的攻击方法，称为 SMI 梯度。这种攻击方法可以在不知情的情况下对skeleton-based 人员活动识别模型进行攻击。
results: experiments 表明，这种攻击方法可以对skeleton-based 人员活动识别模型造成实际的威胁，并且可以在不知情的情况下进行 Transfer-based 和 black-box 攻击。此外，这种攻击方法还可以提高攻击样本的可识别性和透明度。

Abstract
Recently, methods for skeleton-based human activity recognition have been shown to be vulnerable to adversarial attacks. However, these attack methods require either the full knowledge of the victim (i.e. white-box attacks), access to training data (i.e. transfer-based attacks) or frequent model queries (i.e. black-box attacks). All their requirements are highly restrictive, raising the question of how detrimental the vulnerability is. In this paper, we show that the vulnerability indeed exists. To this end, we consider a new attack task: the attacker has no access to the victim model or the training data or labels, where we coin the term hard no-box attack. Specifically, we first learn a motion manifold where we define an adversarial loss to compute a new gradient for the attack, named skeleton-motion-informed (SMI) gradient. Our gradient contains information of the motion dynamics, which is different from existing gradient-based attack methods that compute the loss gradient assuming each dimension in the data is independent. The SMI gradient can augment many gradient-based attack methods, leading to a new family of no-box attack methods. Extensive evaluation and comparison show that our method imposes a real threat to existing classifiers. They also show that the SMI gradient improves the transferability and imperceptibility of adversarial samples in both no-box and transfer-based black-box settings.

摘要
近期，基于骨架的人体活动识别方法被发现容易受到敌意攻击。然而，这些攻击方法具有限制性的要求，包括受害者（white-box攻击）、训练数据（transfer-based攻击）或模型查询频繁（black-box攻击）。这些要求都是非常困难，这引发了对攻击性能的评估。在这篇论文中，我们表明了这种漏洞的存在。为此，我们倡议了一个新的攻击任务：攻击者无法访问受害者模型或训练数据或标签。我们称之为“硬无框攻击”。我们首先学习了一个动作拟合空间，并定义了一种对抗损失函数来计算一个新的攻击方向，称之为“骨动作信息”（SMI）损失函数。我们的损失函数含有动作动力信息，与现有的梯度基于攻击方法不同。SMI损失函数可以增强许多梯度基于攻击方法，导致一个新的无框攻击家族。我们进行了广泛的评估和比较，并证明了我们的方法对现有分类器具有真正的威胁。此外，我们还证明了SMI损失函数可以提高黑框和转移黑框无框攻击的可读性和隐蔽性。

Exploring Deep Learning Approaches to Predict Person and Vehicle Trips: An Analysis of NHTS Data

paper_url: http://arxiv.org/abs/2308.05665
repo_url: None
paper_authors: Kojo Adu-Gyamfi, Sharma Anuj
for: 这项研究的目的是探讨深度学习技术在交通规划预测中的潜在应用，以提高交通规划的准确性和可靠性。methods: 这项研究使用了全国家庭旅行调查（NHTS）数据集，开发了一个深度学习模型来预测人坐标和车辆坐标。该模型利用了NHTS数据中的庞大信息，捕捉了复杂的非线性关系，从而超越了传统模型的性能。results: 该研究发现，使用深度学习模型可以实现人坐标预测的准确率达98%，车辆坐标预测的准确率达96%，与传统交通规划模型相比有显著提高。这表明深度学习在交通规划预测中具有潜在的应用价值。

Abstract
Modern transportation planning relies heavily on accurate predictions of person and vehicle trips. However, traditional planning models often fail to account for the intricacies and dynamics of travel behavior, leading to less-than-optimal accuracy in these predictions. This study explores the potential of deep learning techniques to transform the way we approach trip predictions, and ultimately, transportation planning. Utilizing a comprehensive dataset from the National Household Travel Survey (NHTS), we developed and trained a deep learning model for predicting person and vehicle trips. The proposed model leverages the vast amount of information in the NHTS data, capturing complex, non-linear relationships that were previously overlooked by traditional models. As a result, our deep learning model achieved an impressive accuracy of 98% for person trip prediction and 96% for vehicle trip estimation. This represents a significant improvement over the performances of traditional transportation planning models, thereby demonstrating the power of deep learning in this domain. The implications of this study extend beyond just more accurate predictions. By enhancing the accuracy and reliability of trip prediction models, planners can formulate more effective, data-driven transportation policies, infrastructure, and services. As such, our research underscores the need for the transportation planning field to embrace advanced techniques like deep learning. The detailed methodology, along with a thorough discussion of the results and their implications, are presented in the subsequent sections of this paper.

摘要
现代交通规划强调准确预测人员和车辆行程。然而，传统的规划模型经常忽略旅行行为的细节和动态特征，导致预测不准确。这项研究探讨使用深度学习技术改变交通规划方法的潜在性。我们使用全国家庭旅行调查（NHTS）数据集，开发和训练了深度学习模型，以预测人员和车辆行程。我们的提案的模型利用NHTS数据中的庞大信息，捕捉复杂的非线性关系，从而超过传统模型的性能。因此，我们的深度学习模型在人员行程预测中达到了98%的准确率，在车辆行程估算中达到了96%的准确率。这表明深度学习在这个领域具有显著的优势。这些结果不仅表明了深度学习模型的更高准确性和可靠性，还有助于交通规划师们制定更有效的数据驱动的交通政策、基础设施和服务。因此，我们的研究证明了交通规划领域应该采用高级技术如深度学习。详细的方法和结果的讨论，以及其影响，在后续章节中提供。

2023-08-11

cs.CL

cs.CL - 2023-08-11

Weakly Supervised Text Classification on Free Text Comments in Patient-Reported Outcome Measures

paper_url: http://arxiv.org/abs/2308.06199
repo_url: None
paper_authors: Anna-Grace Linton, Vania Dimitrova, Amy Downing, Richard Wagland, Adam Glaser
for: 这个论文是为了分析患有Rectal cancer的病人的自报症状数据中的FTC数据，以提高健康质量生活（HRQoL）的评估。
methods: 这个论文使用了五种weakly supervised text classification（WSTC）技术来分类医疗领域特定的FTC数据，以提取报告的健康相关质量生活（HRQoL）主题。
results: 研究发现，使用WSTC技术可以在医疗领域特定的FTC数据中提取健康相关质量生活（HRQoL）主题，但是模型精度和主题之间存在差异。

Abstract
Free text comments (FTC) in patient-reported outcome measures (PROMs) data are typically analysed using manual methods, such as content analysis, which is labour-intensive and time-consuming. Machine learning analysis methods are largely unsupervised, necessitating post-analysis interpretation. Weakly supervised text classification (WSTC) can be a valuable method of analysis to classify domain-specific text data in which there is limited labelled data. In this paper, we apply five WSTC techniques to FTC in PROMs data to identify health-related quality of life (HRQoL) themes reported by colorectal cancer patients. The WSTC methods label all the themes mentioned in the FTC. The results showed moderate performance on the PROMs data, mainly due to the precision of the models, and variation between themes. Evaluation of the classification performance illustrated the potential and limitations of keyword based WSTC to label PROMs FTC when labelled data is limited.

摘要
免费文本评论（FTC）在患者报告的结果数据中通常使用手动方法进行分析，如内容分析，这是费时费力的。机器学习分析方法是无监督的，需要后期分析。弱监督文本分类（WSTC）可以是分析domain特有文本数据的有价值方法，在这篇论文中，我们将WSTC技术应用于患者报告中的FTC，以确定患者报告中的健康相关质量生活（HRQoL）主题。WSTC方法将FTC中所提到的所有主题标注。结果表明，在PROMs数据上，WSTC方法的性能较差，主要是因为模型精度和主题之间的变化。对分类性能的评估表明了关键词基于WSTC的标注PROMs FTC的潜在和局限性。

Assessing Guest Nationality Composition from Hotel Reviews

paper_url: http://arxiv.org/abs/2308.06175
repo_url: None
paper_authors: Fabian Gröger, Marc Pouly, Flavia Tinner, Leif Brandes
for: 这个论文是为了研究如何使用机器学习来监测和评估具体客户来源的酒店业务竞争力。
methods: 该论文使用了预训练的嵌入和堆式LSTM层来提取文本评论中的客户国籍信息，以动态评估和监测具体客户来源的变化。
results: 研究发现，使用简单的架构可以提供更好的性能和时间成本比例，而不是使用更复杂的语言模型。

Abstract
Many hotels target guest acquisition efforts to specific markets in order to best anticipate individual preferences and needs of their guests. Likewise, such strategic positioning is a prerequisite for efficient marketing budget allocation. Official statistics report on the number of visitors from different countries, but no fine-grained information on the guest composition of individual businesses exists. There is, however, growing interest in such data from competitors, suppliers, researchers and the general public. We demonstrate how machine learning can be leveraged to extract references to guest nationalities from unstructured text reviews in order to dynamically assess and monitor the dynamics of guest composition of individual businesses. In particular, we show that a rather simple architecture of pre-trained embeddings and stacked LSTM layers provides a better performance-runtime tradeoff than more complex state-of-the-art language models.

摘要
Many hotels target their guest acquisition efforts at specific markets in order to best anticipate the individual preferences and needs of their guests. Likewise, this strategic positioning is a prerequisite for efficient marketing budget allocation. Official statistics report on the number of visitors from different countries, but there is no fine-grained information on the guest composition of individual businesses. However, there is growing interest in such data from competitors, suppliers, researchers, and the general public. We demonstrate how machine learning can be leveraged to extract references to guest nationalities from unstructured text reviews in order to dynamically assess and monitor the dynamics of guest composition of individual businesses. In particular, we show that a relatively simple architecture of pre-trained embeddings and stacked LSTM layers provides a better performance-runtime tradeoff than more complex state-of-the-art language models.Here's a word-for-word translation of the text into Simplified Chinese: muchos 酒店target their 客源取得 efforts at specific markets in order to best anticipate the individual preferences and needs of their guests. Likewise, this strategic positioning is a prerequisite for efficient marketing budget allocation. Official statistics report on the number of visitors from different countries, but there is no fine-grained information on the guest composition of individual businesses. However, there is growing interest in such data from competitors, suppliers, researchers, and the general public. We demonstrate how machine learning can be leveraged to extract references to guest nationalities from unstructured text reviews in order to dynamically assess and monitor the dynamics of guest composition of individual businesses. In particular, we show that a relatively simple architecture of pre-trained embeddings and stacked LSTM layers provides a better performance-runtime tradeoff than more complex state-of-the-art language models.

Task Conditioned BERT for Joint Intent Detection and Slot-filling

paper_url: http://arxiv.org/abs/2308.06165
repo_url: None
paper_authors: Diogo Tavares, Pedro Azevedo, David Semedo, Ricardo Sousa, João Magalhães
for: 本研究旨在解决对话系统中的不可预测用户意图和多个插槽的多样性问题，以实现对话状态的跟踪和用户喜好的理解。
methods: 该研究提出了一种基于Transformer编码器的原则性模型，通过在多个任务上训练模型，并通过丰富的输入来conditioning模型。
results: 实验结果表明，通过conditioning模型在多个对话推理任务上的输入，可以实现对MultiWOZ数据集上的同时意图和插槽检测的提高，具体是3.2%、10.8%和14.4%。此外，在真实的Farfetch客户对话中，提出的conditioned BERT也可以在对话中实现高度的共同目标和意图检测性能。

Abstract
Dialogue systems need to deal with the unpredictability of user intents to track dialogue state and the heterogeneity of slots to understand user preferences. In this paper we investigate the hypothesis that solving these challenges as one unified model will allow the transfer of parameter support data across the different tasks. The proposed principled model is based on a Transformer encoder, trained on multiple tasks, and leveraged by a rich input that conditions the model on the target inferences. Conditioning the Transformer encoder on multiple target inferences over the same corpus, i.e., intent and multiple slot types, allows learning richer language interactions than a single-task model would be able to. In fact, experimental results demonstrate that conditioning the model on an increasing number of dialogue inference tasks leads to improved results: on the MultiWOZ dataset, the joint intent and slot detection can be improved by 3.2\% by conditioning on intent, 10.8\% by conditioning on slot and 14.4\% by conditioning on both intent and slots. Moreover, on real conversations with Farfetch costumers, the proposed conditioned BERT can achieve high joint-goal and intent detection performance throughout a dialogue.

摘要
对话系统需要面对用户意图的不可预测性和插槽的多样性，以便理解用户偏好。在这篇论文中，我们研究了假设：将这些挑战作为一个统一的模型来处理，可以在不同任务之间传递参数支持数据。我们提出的原则性的模型基于Transformer编码器，在多个任务上训练，并且使用丰富的输入来 condition the model 于目标推理。conditioning the Transformer encoder 于多个对话推理任务上的同一个 корпу斯（例如，意图和多个插槽类型），可以学习更加丰富的语言互动。实际结果表明，conditioning the model 于增加的对话推理任务可以提高结果：在MultiWOZ dataset上，联合意图和插槽检测可以提高3.2%，联合插槽和意图检测可以提高10.8%，而联合意图和插槽检测可以提高14.4%。此外，使用conditioned BERT在真实的对话中，可以 achieve high joint-goal和意图检测性能。

Identification of the Relevance of Comments in Codes Using Bag of Words and Transformer Based Models

paper_url: http://arxiv.org/abs/2308.06144
repo_url: https://github.com/sruthisudheer/comment-classification-of-c-code
paper_authors: Sruthi S, Tanmay Basu
for: 本研究的目的是分类代码段落的注释是否相关。
methods: 本研究使用了不同的特征工程方案和文本分类技术，包括经典的袋子字符模型和基于转换器的模型。
results: 研究发现，使用袋子字符模型在训练集上表现最佳，但模型在训练和测试集上的表现并不理想。

Abstract
The Forum for Information Retrieval (FIRE) started a shared task this year for classification of comments of different code segments. This is binary text classification task where the objective is to identify whether comments given for certain code segments are relevant or not. The BioNLP-IISERB group at the Indian Institute of Science Education and Research Bhopal (IISERB) participated in this task and submitted five runs for five different models. The paper presents the overview of the models and other significant findings on the training corpus. The methods involve different feature engineering schemes and text classification techniques. The performance of the classical bag of words model and transformer-based models were explored to identify significant features from the given training corpus. We have explored different classifiers viz., random forest, support vector machine and logistic regression using the bag of words model. Furthermore, the pre-trained transformer based models like BERT, RoBERT and ALBERT were also used by fine-tuning them on the given training corpus. The performance of different such models over the training corpus were reported and the best five models were implemented on the given test corpus. The empirical results show that the bag of words model outperforms the transformer based models, however, the performance of our runs are not reasonably well in both training and test corpus. This paper also addresses the limitations of the models and scope for further improvement.

摘要
《信息检索论坛（FIRE）》这年开始了代码段评注分类的共同任务。这是一个二分类文本分类任务，目标是判断给定代码段的评注是否相关。印度科学教育研究所 Bhopal（IISERB）的 BioNLP-IISERB 组participated in this task and submitted five runs for five different models. 本文介绍了模型和其他对训练集的重要发现。方法包括不同的特征工程方案和文本分类技术。我们使用了经典的包装词语模型和转换器基于模型，并对训练集进行了不同的特征工程和分类技术的探索。我们还使用了随机森林、支持向量机和梯度回归等分类器，并使用了包装词语模型。此外，我们还使用了预训练的转换器基于模型，如 BERT、RoBERT 和 ALBERT，并将其在训练集上进行了精度训练。对于训练集和测试集，我们报告了不同模型的性能，并选择了最佳五个模型进行实现。实验结果表明，包装词语模型在训练集上表现出色，但我们的运行并没有在训练集和测试集上表现出理想的性能。这篇文章还讨论了模型的限制和改进的可能性。

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

paper_url: http://arxiv.org/abs/2308.06112
repo_url: None
paper_authors: Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Haithem Boussaid, Ebtessam Almazrouei, Merouane Debbah
for: 本文目的是提出一种简单的方法，以便在视频序列中进行语音识别。
methods: 该方法基于学习一个先验模型，将视频序列中的唇形态编码器映射到匹配的音频对应的唇形态编码器，然后使用一个Off-the-shelf Audio Speech Recognition（ASR）模型将生成的音频表示转换为文本。
results: 该方法在LRS3数据集上比前期方法更高效，达到26 WER水平，而且与State-of-the-art（SoTA）方法不同，该模型在VoxCeleb测试集上保持了理想的性能。

Abstract
Visual Speech Recognition (VSR) differs from the common perception tasks as it requires deeper reasoning over the video sequence, even by human experts. Despite the recent advances in VSR, current approaches rely on labeled data to fully train or finetune their models predicting the target speech. This hinders their ability to generalize well beyond the training set and leads to performance degeneration under out-of-distribution challenging scenarios. Unlike previous works that involve auxiliary losses or complex training procedures and architectures, we propose a simple approach, named Lip2Vec that is based on learning a prior model. Given a robust visual speech encoder, this network maps the encoded latent representations of the lip sequence to their corresponding latents from the audio pair, which are sufficiently invariant for effective text decoding. The generated audio representation is then decoded to text using an off-the-shelf Audio Speech Recognition (ASR) model. The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER. Unlike SoTA approaches, our model keeps a reasonable performance on the VoxCeleb test set. We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.

摘要
视觉语音识别（VSR）与常见的观察任务不同，它需要对视频序列进行更深层次的理解，即使由人类专家也是如此。尽管最近有一些关于VSR的进步，但现有的方法仍然依赖于标注数据来完全训练或调整其模型，这会导致它们在不同于训练集的场景下表现不佳。不同于之前的工作，我们提出了一种简单的方法，即 Lip2Vec，它基于学习一个先验模型。给定一个强大的视觉语音编码器，这个网络将编码后的唇形态封装到它们对应的音频对的 latent 表示中，这些表示是 suficiently 惰性的，以便有效地解码文本。生成的音频表示然后被使用一个购买的 Audio Speech Recognition（ASR）模型来解码为文本。我们提出的模型与完全supervised学习方法在 LRS3 数据集上比较，达到 26 WER 的性能。与 SoTA 方法不同，我们的模型在 VoxCeleb 测试集上保持了合理的性能。我们认为，将 VSR 转换为 ASR 任务，可以减少两者之间的性能差距，并为更flexible的唇读法提供了道路。

Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling

paper_url: http://arxiv.org/abs/2308.06077
repo_url: None
paper_authors: Marija Šakota, Maxime Peyrard, Robert West
for: 这个研究是为了提出一个成本效益探索（CELMOC）框架，帮助选择适当的语言模型（LM），以提高整体性能while reducing cost.
methods: 这个研究使用了四个不同大小和成本的语言模型（LM），并使用了一个meta-model来预测每个输入将对哪个LM有好的表现。然后，这个框架将输入分配给预测对应的LM，以提高整体性能while reducing cost.
results: 这个研究发现，使用这个框架可以与使用最大化LM的性能相似，但是可以降低成本63%。这个框架可以帮助研究者和实践者共同储存大量的钱。

Abstract
Generative language models (LMs) have become omnipresent across data science. For a wide variety of tasks, inputs can be phrased as natural language prompts for an LM, from whose output the solution can then be extracted. LM performance has consistently been increasing with model size - but so has the monetary cost of querying the ever larger models. Importantly, however, not all inputs are equally hard: some require larger LMs for obtaining a satisfactory solution, whereas for others smaller LMs suffice. Based on this fact, we design a framework for Cost-Effective Language Model Choice (CELMOC). Given a set of inputs and a set of candidate LMs, CELMOC judiciously assigns each input to an LM predicted to do well on the input according to a so-called meta-model, aiming to achieve high overall performance at low cost. The cost-performance trade-off can be flexibly tuned by the user. Options include, among others, maximizing total expected performance (or the number of processed inputs) while staying within a given cost budget, or minimizing total cost while processing all inputs. We evaluate CELMOC on 14 datasets covering five natural language tasks, using four candidate LMs of vastly different size and cost. With CELMOC, we match the performance of the largest available LM while achieving a cost reduction of 63%. Via our publicly available library, researchers as well as practitioners can thus save large amounts of money without sacrificing performance.

摘要
现代语言模型（LM）在数据科学中变得普遍，它们可以用来解决各种任务，并且可以通过自然语言提示来获得答案。然而，LM的性能随模型大小的增加而提高，但是查询费用也在增加。这意味着，不同的输入可能需要不同的LM来获得满意的答案，而不同的LM可能需要不同的费用。基于这一点，我们设计了一个Cost-Effective Language Model Choice（CELMOC）框架。给定一组输入和一组候选LM，CELMOC会judiciously将每个输入分配给一个LM，以便在一个称为meta-model中 predictions of the LM's performance on the input，以实现高效性低成本。用户可以通过调整cost-performance贸易来灵活地调整成本-性能平衡。选项包括最大化总预期性能（或处理输入数）的成本不超过一定预算，或者最小化成本的情况下处理所有输入。我们在14个数据集上进行了五种自然语言任务的测试，使用四种不同的LM，并证明了CELMOC可以与最大可用LM的性能匹配，同时实现63%的成本减少。通过我们公开提供的库，研究人员和实践者都可以大幅降低成本，无需牺牲性能。

A Case Study on Context Encoding in Multi-Encoder based Document-Level Neural Machine Translation

paper_url: http://arxiv.org/abs/2308.06063
repo_url: None
paper_authors: Ramakrishna Appicharla, Baban Gain, Santanu Pal, Asif Ekbal
for: 研究人员希望了解多个encoder模型在不同的上下文情况下的表现，以提高模型在翻译中的准确性。
methods: 研究人员使用多个encoder模型，并对其进行训练，以便在不同的上下文情况下进行翻译。
results: 研究人员发现，even when the context is random, the model can still perform well on the ContraPro test set。此外，研究人员还发现，混合选择的上下文和随机上下文的设置通常比其他设置更好。

Abstract
Recent studies have shown that the multi-encoder models are agnostic to the choice of context, and the context encoder generates noise which helps improve the models in terms of BLEU score. In this paper, we further explore this idea by evaluating with context-aware pronoun translation test set by training multi-encoder models trained on three different context settings viz, previous two sentences, random two sentences, and a mix of both as context. Specifically, we evaluate the models on the ContraPro test set to study how different contexts affect pronoun translation accuracy. The results show that the model can perform well on the ContraPro test set even when the context is random. We also analyze the source representations to study whether the context encoder generates noise. Our analysis shows that the context encoder provides sufficient information to learn discourse-level information. Additionally, we observe that mixing the selected context (the previous two sentences in this case) and the random context is generally better than the other settings.

摘要
近期研究表明，多encoder模型对选择 контекст无关，context encoder生成噪声可以提高模型在BLEU分数上的表现。在这篇论文中，我们进一步探究这个想法，通过训练基于三种不同context设置的多encoder模型，并在ContraPro测试集上评估其表现。结果显示，模型可以在随机context下表现良好。我们还分析了源表示，确定context encoder是否生成噪声。我们的分析表明，context encoder提供了足够的信息来学习论坛水平信息。此外，我们发现混合选定context（在这种情况下是前两句）和随机context通常比其他设置更好。

Evaluating Picture Description Speech for Dementia Detection using Image-text Alignment

paper_url: http://arxiv.org/abs/2308.07933
repo_url: None
paper_authors: Youxiang Zhu, Nana Lin, Xiaohui Liang, John A. Batsis, Robert M. Roth, Brian MacWhinney
for: 检测诊断阿尔茨海默病（dementia）
methods: 利用图像描述文本对预处理样本，并利用大型预训练图像文本对适应模型。
results: 提出了首个利用图像和描述文本输入，并利用图像文本对适应模型的诊断阿尔茨海默病模型，实现了state-of-the-art表现，检测精度达83.44%，高于文本只基eline模型的79.91%。

Abstract
Using picture description speech for dementia detection has been studied for 30 years. Despite the long history, previous models focus on identifying the differences in speech patterns between healthy subjects and patients with dementia but do not utilize the picture information directly. In this paper, we propose the first dementia detection models that take both the picture and the description texts as inputs and incorporate knowledge from large pre-trained image-text alignment models. We observe the difference between dementia and healthy samples in terms of the text's relevance to the picture and the focused area of the picture. We thus consider such a difference could be used to enhance dementia detection accuracy. Specifically, we use the text's relevance to the picture to rank and filter the sentences of the samples. We also identified focused areas of the picture as topics and categorized the sentences according to the focused areas. We propose three advanced models that pre-processed the samples based on their relevance to the picture, sub-image, and focused areas. The evaluation results show that our advanced models, with knowledge of the picture and large image-text alignment models, achieve state-of-the-art performance with the best detection accuracy at 83.44%, which is higher than the text-only baseline model at 79.91%. Lastly, we visualize the sample and picture results to explain the advantages of our models.

摘要

PIPPA: A Partially Synthetic Conversational Dataset

paper_url: http://arxiv.org/abs/2308.05884
repo_url: None
paper_authors: Tear Gosling, Alpin Dale, Yinhe Zheng
for: 本研究旨在提供一个基于人工智能的会话和扮演数据集，以便研究人工智能系统在角色扮演场景中的发展。
methods: 本研究使用了社区驱动的协同募集方法，吸引了一群游戏爱好者参与，共创造出了超过100万句话的会话记录，分布在26,000个对话会话中。
results: 本研究提供了一个名为PIPPA的半 sintetic数据集，该数据集包含了丰富的会话记录，可以为研究人工智能系统在角色扮演场景中的发展提供一个重要的资源。

Abstract
With the emergence of increasingly powerful large language models, there is a burgeoning interest in leveraging these models for casual conversation and role-play applications. However, existing conversational and role-playing datasets often fail to capture the diverse and nuanced interactions typically exhibited by real-world role-play participants. To address this limitation and contribute to the rapidly growing field, we introduce a partially-synthetic dataset named PIPPA (Personal Interaction Pairs between People and AI). PIPPA is a result of a community-driven crowdsourcing effort involving a group of role-play enthusiasts. The dataset comprises over 1 million utterances that are distributed across 26,000 conversation sessions and provides a rich resource for researchers and AI developers to explore and refine conversational AI systems in the context of role-play scenarios.

摘要
“大型语言模型的出现使得人们对协谈和角色扮演应用的兴趣增加。然而，现有的协谈和角色扮演数据集常常无法捕捉真实世界角色扮演者之间的多样化和细节化互动。为解决这个限制，我们介绍了一个名为PIPPA（人工智能与人际互动对）的半人工数据集。PIPPA是由社区营运的调询员们组成的一个志工团队所创建的，数据集包含了过百万句说话，分布在26,000个对话会议中，并提供了一个丰富的资源供研究人员和AI开发者们探索和检验协谈AI系统在角色扮演场景中的表现。”Note that Simplified Chinese is the standard writing system used in mainland China, while Traditional Chinese is used in Taiwan and Hong Kong.

EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

paper_url: http://arxiv.org/abs/2308.05725
repo_url: None
paper_authors: Tu Anh Nguyen, Wei-Ning Hsu, Antony D’Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux
for: 这个论文是为了研究textless speech synthesis的高质量表达方法和数据集。
methods: 这篇论文使用了自适应学习的低比特率精炼单元来重新生成高质量的speech，并使用了26种自然的善意表达方式来生成各种不同的表达方式。
results: 该论文 introduce了一个高质量的表达型speech数据集，包括了读出的speech和自由对话，并提供了一个表达 benchmark 来评估不同自适应精炼单元的表达质量。

Abstract
Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in low-bitrate units and resynthesize it in a target voice while preserving content and style. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders, and explore tradeoffs between quality, bitrate and invariance to speaker and style. All the dataset, evaluation metrics and baseline models are open source

摘要
最近的研究表明，可以基于低比特率独立单元进行高质量的语音重synthesis，而这些单元可以在自我超vised的方式下学习，因此可以捕捉到语音中的表达特征，如语音调、声音风格和非语音 vocalization。然而，这些方法的应用仍受到大多数语音合成数据集是阅读的限制，这限制了它们的自由和表达力。在这里，我们介绍Expresso，一个高质量的自由语音合成数据集，包括了阅读语音和自由对话，并在26种自由表达风格中进行了rendering。我们描述了这个数据集的挑战和潜力，并通过一个表达 benchark来评估重synthesis质量，其中任务是将输入编码成低比特率单元并在目标声音中重synthesize，保持内容和风格不变。我们使用自动测试 метри来评估重synthesis质量，并探讨了不同自动测试 метри的tradeoffs，以及bitrate、 speaker和风格的不变性。所有的数据集、评估 метри和基eline模型都是开源的。

A Preliminary Study of the Intrinsic Relationship between Complexity and Alignment

paper_url: http://arxiv.org/abs/2308.05696
repo_url: https://github.com/alibabaresearch/damo-convai
paper_authors: Yingxiu Zhao, Bowen Yu, Binyuan Hui, Haiyang Yu, Fei Huang, Yongbin Li, Nevin L. Zhang
for: 提高大型自然语言模型（LLMs）在开放领域指令数据上的训练，以实现与终端任务和用户偏好更好地对齐。
methods: 通过控制指令数据的复杂性来提高性能。提出了一种名为“tree-instruct”的方法，通过添加指令 semantic tree 中指定的节点数来生成新的指令数据，并通过调整添加的节点数来控制Difficulty Level。
results: 通过实验发现，增加复杂性可以持续提高性能，例如使用1,000个指令数据和10个节点可以提高胜率24%。同时发现，相同的字符数预算下，一些复杂的指令可以超越多样但简单的指令。此外，训练课程指令调整可能并不是预期的结果，关键在于增加复杂性。

Abstract
Training large language models (LLMs) with open-domain instruction data has yielded remarkable success in aligning to end tasks and user preferences. Extensive research has highlighted that enhancing the quality and diversity of instruction data consistently improves performance. However, the impact of data complexity, as a crucial metric, remains relatively unexplored in three aspects: (1) scaling law, where the sustainability of performance improvements with increasing complexity is uncertain, (2) additional tokens, whether the improvement brought by complexity comes from introducing more training tokens, and (3) curriculum tuning, where the potential advantages of incorporating instructions ranging from easy to difficult are not yet fully understood. In this paper, we propose \textit{tree-instruct} to systematically enhance the complexity of instruction data in a controllable manner. This approach adds a specified number of nodes into the instruction semantic tree, yielding new instruction data based on the modified tree. By adjusting the number of added nodes, we can control the difficulty level in the modified instruction data. Our preliminary experiments reveal the following insights: (1) Increasing complexity consistently leads to sustained performance improvements. For instance, using 1,000 instruction data and 10 nodes resulted in a substantial 24\% increase in win rate. (2) Under the same token budget, a few complex instructions outperform diverse yet simple instructions. (3) Curriculum instruction tuning might not yield the anticipated results; focusing on increasing complexity appears to be the key.

摘要
训练大型自然语言模型（LLM）与开放领域指令数据有着惊人的成功，它们可以很好地适应到任务和用户喜好。广泛的研究表明，提高指令数据质量和多样性可以一直提高性能。然而，数据复杂性的影响，作为一个关键指标，还没有得到充分的探索。特别是，有三个方面的研究仍然不够：（1）扩展法律，是否可以长期维持性能提高的可靠性，（2）新的特征 Tokens 是否真的带来了性能提高，以及（3）课程调整，是否可以通过从易到Difficult的指令进行调整来获得更好的效果。本文提出了一种名为“tree-instruct”的方法，可以系统地提高指令数据的复杂性。这种方法在指令semantic树中添加指定的节点数量，从而生成新的指令数据。通过调整添加的节点数量，可以控制修改后的指令数据的难度水平。我们的初步实验发现了以下结论：（1）增加复杂性一直会持续提高性能。例如，使用 1,000 个指令数据和 10 个节点，可以获得大量的 24% 的提高。（2）在同一个token预算下，一些复杂的指令会超过多样但简单的指令。（3）课程调整可能并不会带来预期的结果，而是关注增加复杂性才是关键。

Finding Already Debunked Narratives via Multistage Retrieval: Enabling Cross-Lingual, Cross-Dataset and Zero-Shot Learning

paper_url: http://arxiv.org/abs/2308.05680
repo_url: None
paper_authors: Iknoor Singh, Carolina Scarton, Xingyi Song, Kalina Bontcheva
for: This paper aims to detect stories that have already been debunked to reduce the manual efforts of professional fact-checkers and slow the spread of misinformation.
methods: The paper creates a novel dataset for cross-lingual retrieval of already debunked narratives using tweets as queries to a database of fact-checking articles. It also presents an extensive experiment to benchmark fine-tuned and off-the-shelf multilingual pre-trained Transformer models for this task.
results: The results show that the task of cross-lingual retrieval of already debunked narratives is challenging, and off-the-shelf Transformer models fail to outperform a strong lexical-based baseline (BM25). However, the paper’s multistage retrieval framework is robust and outperforms BM25 in most scenarios, enabling cross-domain and zero-shot learning without significantly harming the model’s performance.

Abstract
The task of retrieving already debunked narratives aims to detect stories that have already been fact-checked. The successful detection of claims that have already been debunked not only reduces the manual efforts of professional fact-checkers but can also contribute to slowing the spread of misinformation. Mainly due to the lack of readily available data, this is an understudied problem, particularly when considering the cross-lingual task, i.e. the retrieval of fact-checking articles in a language different from the language of the online post being checked. This paper fills this gap by (i) creating a novel dataset to enable research on cross-lingual retrieval of already debunked narratives, using tweets as queries to a database of fact-checking articles; (ii) presenting an extensive experiment to benchmark fine-tuned and off-the-shelf multilingual pre-trained Transformer models for this task; and (iii) proposing a novel multistage framework that divides this cross-lingual debunk retrieval task into refinement and re-ranking stages. Results show that the task of cross-lingual retrieval of already debunked narratives is challenging and off-the-shelf Transformer models fail to outperform a strong lexical-based baseline (BM25). Nevertheless, our multistage retrieval framework is robust, outperforming BM25 in most scenarios and enabling cross-domain and zero-shot learning, without significantly harming the model's performance.

摘要
该任务是检索已经证伪的故事，目的是检测已经被证实的故事。成功检测已经证伪的故事不仅可以减少专业 фактоCheckers的手动努力，还可以减slow下迷信的传播。然而，由于数据的不Ready availability，这是一个未得到充分研究的问题，特别是跨语言任务，即在不同语言的 онлайн文章被检查时， retrieve fact-checking articles。这篇论文填补了这一漏洞，通过以下三个方面：1. 创建了一个新的数据集，用于启发研究跨语言检索已经证伪的故事，使用推文作为查询语。2. 进行了广泛的实验，以评估 fine-tuned 和 off-the-shelf 多语言预训练Transformer模型的表现。3. 提出了一个多Stage框架，将跨语言检索已经证伪的故事任务分为两个阶段：精细化阶段和重新排序阶段。结果显示，跨语言检索已经证伪的故事是一个具有挑战性的任务，off-the-shelf Transformer模型无法超过一个强的字符基本模型（BM25）。然而，我们的多Stage Retrieval框架是可靠的，在大多数场景下超过 BM25，并且具有跨频域和零shot学习能力，无需明显危害模型性能。

2023-08-11

cs.LG

cs.LG - 2023-08-11

Towards a Causal Probabilistic Framework for Prediction, Action-Selection & Explanations for Robot Block-Stacking Tasks

paper_url: http://arxiv.org/abs/2308.06203
repo_url: None
paper_authors: Ricardo Cannizzaro, Jonathan Routley, Lars Kunze
for: 本研究旨在提供一种基于 causal probabilistic 框架的 autonomous robot 堆叠任务解决方案，以便 robot 可以在不同的 scenarios 中自动地理解和解释堆叠任务的当前状态，并选择下一个最佳动作。
methods: 本研究使用 causal inference 和 physics simulation 技术，将 causal models 和 probabilistic representations 结合起来，以便 robot 可以理解和描述堆叠任务的 causal 关系，并根据不同的 placement candidates 选择下一个最佳动作。
results: 本研究提出了一种 novel causal probabilistic 框架，可以帮助 robot 在不同的 scenarios 中自动地理解和解释堆叠任务的当前状态，并选择下一个最佳动作。例如，我们提供了一些 exemplar 的下一个最佳动作选择结果，并规划了在 simulated 和实际的 robot 堆叠任务中进行进一步的实验。

Abstract
Uncertainties in the real world mean that is impossible for system designers to anticipate and explicitly design for all scenarios that a robot might encounter. Thus, robots designed like this are fragile and fail outside of highly-controlled environments. Causal models provide a principled framework to encode formal knowledge of the causal relationships that govern the robot's interaction with its environment, in addition to probabilistic representations of noise and uncertainty typically encountered by real-world robots. Combined with causal inference, these models permit an autonomous agent to understand, reason about, and explain its environment. In this work, we focus on the problem of a robot block-stacking task due to the fundamental perception and manipulation capabilities it demonstrates, required by many applications including warehouse logistics and domestic human support robotics. We propose a novel causal probabilistic framework to embed a physics simulation capability into a structural causal model to permit robots to perceive and assess the current state of a block-stacking task, reason about the next-best action from placement candidates, and generate post-hoc counterfactual explanations. We provide exemplar next-best action selection results and outline planned experimentation in simulated and real-world robot block-stacking tasks.

摘要
世界上的不确定性使得系统设计者无法预期和预制所有 robot 可能遇到的情况。因此，基于这种设计的 robot 在不控制环境中会失败。 causal 模型提供了一个理解 formal knowledge 的 causal 关系，并且可以用 probabilistic 表示 noise 和不确定性，这些是实际 robot 遇到的通常情况。 combined with causal inference, these models allow an autonomous agent to understand, reason about, and explain its environment. 在这项工作中，我们关注了一个 robot 堆叠任务，因为它涉及到多种应用，包括仓储логиística和家庭支持 robotics。我们提出了一种新的 causal probabilistic 框架，用于嵌入物理模拟能力到结构 causal 模型中，让 robot 可以识别和评估堆叠任务的当前状态，选择下一步行动的位置候选者，并生成post-hoc counterfactual 解释。我们提供了示例下一步行动选择结果，并详细介绍在 simulated 和实际 robot 堆叠任务中的计划实验。

Exploring Predicate Visual Context in Detecting of Human-Object Interactions

paper_url: http://arxiv.org/abs/2308.06202
repo_url: https://github.com/fredzzhang/pvic
paper_authors: Frederic Z. Zhang, Yuhui Yuan, Dylan Campbell, Zhuoyao Zhong, Stephen Gould
for: 本研究探讨了 DETR 框架在人物对象交互（HOI）领域中的表现，特别是使用两个阶段的 transformer 来进行 HOI 检测。
methods: 本研究使用了Visualiztion和仔细的实验来研究如何重新引入图像特征，并通过改进查询设计、广泛探索键和值、以及盒子对应的位域嵌入来提高 predicate 视觉上下文（PViC）模型的性能。
results: 根据 HICO-DET 和 V-COCO benchmark 的测试结果，我们的 PViC 模型在 HOI 领域中表现出了优于现有方法，同时保持了低的训练成本。

Abstract
Recently, the DETR framework has emerged as the dominant approach for human--object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost.

摘要
Translation notes:* DETR 框架在人物交互研究中得到了广泛应用，特别是两阶段变换器基于 HOI 检测器。* 这些 HOI 检测器通常基于对象特征，而不考虑人物姿态和方向信息，而选择视觉特征来确定对象标识和边框极限。* 这会限制复杂或抽象的交互识别。* 在这项工作中，我们通过视觉化和仔细设计的实验来研究这些问题。* 我们 investigate如何通过交叉注意来重新引入图像特征。* 我们采用改进的查询设计、广泛探索键值和盒对位嵌入作为空间导航，以提高 predicate visual context (PViC) 模型在 HICO-DET 和 V-COCO benchmark 上的表现，同时保持低训练成本。

Complex Facial Expression Recognition Using Deep Knowledge Distillation of Basic Features

paper_url: http://arxiv.org/abs/2308.06197
repo_url: https://github.com/angusmaiden/complex-fer
paper_authors: Angus Maiden, Bahareh Nakisa
for: 本研究旨在提出一种基于人类认知和学习的新方法，以准确地识别复杂的表情类别，并且使用少量训练样本进行学习。
methods: 本研究使用了知识泛化和几何学习等方法，并首次应用了几何学习到复杂表情识别中。通过知识泛化和Predictive Sorting Memory Replay等技术，本研究可以快速学习新的表情类别，同时保持已知类别的知识。
results: 本研究实现了在新类别上的74.28%的总准确率，比非连续学习方法高出13.95%。此外，本研究还首次在复杂表情识别中实现了几何学习的应用，并达到了100%的准确率，使用单个训练样本来识别每个表情类别。

Abstract
Complex emotion recognition is a cognitive task that has so far eluded the same excellent performance of other tasks that are at or above the level of human cognition. Emotion recognition through facial expressions is particularly difficult due to the complexity of emotions expressed by the human face. For a machine to approach the same level of performance in this domain as a human, it may need to synthesise knowledge and understand new concepts in real-time as humans do. Humans are able to learn new concepts using only few examples, by distilling the important information from memories and discarding the rest. Similarly, continual learning methods learn new classes whilst retaining the knowledge of known classes, whilst few-shot learning methods are able to learn new classes using very few training examples. We propose a novel continual learning method inspired by human cognition and learning that can accurately recognise new compound expression classes using few training samples, by building on and retaining its knowledge of basic expression classes. Using GradCAM visualisations, we demonstrate the relationship between basic and compound facial expressions, which our method leverages through knowledge distillation and a novel Predictive Sorting Memory Replay. Our method achieves the current state-of-the-art in continual learning for complex facial expression recognition with 74.28% Overall Accuracy on new classes. We also demonstrate that using continual learning for complex facial expression recognition achieves far better performance than non-continual learning methods, improving on state-of-the-art non-continual learning methods by 13.95%. To the best of our knowledge, our work is also the first to apply few-shot learning to complex facial expression recognition, achieving the state-of-the-art with 100% accuracy using a single training sample for each expression class.

摘要
人工智能在复杂情绪认知方面的表现仍然落后于其他高于人类水平的任务的表现。人脸表达情感是特别Difficult，因为人脸上表达的情感非常复杂。为了让机器达到人类水平，它可能需要合成知识和理解新概念，就像人类一样。人类可以通过几个示例学习新概念，从记忆中提取重要信息，并抛弃其他信息。我们提出了一种基于人类认知和学习的新型连续学习方法，可以准确地识别新的复杂表达类型，使用很少的训练样本。我们使用GradCAM可视化来显示基本和复杂表达之间的关系，我们的方法通过知识储存和一种新的预测排序记忆重温来利用这种关系。我们的方法实现了当前的连续学习状态势最佳，在新类上达到74.28%的总准确率。我们还证明了在连续学习方法下，复杂表达认知的性能远超非连续学习方法，提高了状态公共非连续学习方法的13.95%。而且，我们的工作是首次将几shot学习应用于复杂表达认知领域，在每个表达类型上达到100%的准确率，只需要一个训练样本。

Assessing Guest Nationality Composition from Hotel Reviews

paper_url: http://arxiv.org/abs/2308.06175
repo_url: None
paper_authors: Fabian Gröger, Marc Pouly, Flavia Tinner, Leif Brandes
for: 本研究旨在通过机器学习技术分析酒店客户的国籍分布，以便更好地评估和监控酒店客户的概貌。
methods: 本研究使用了预训练的嵌入和堆式LSTM层，实现了对不结构化文本评论中关于客户国籍的引用的自动提取。
results: 研究发现，使用简单的架构可以提供更好的性能和时间成本协议，而不需要更复杂的语言模型。

Abstract
Many hotels target guest acquisition efforts to specific markets in order to best anticipate individual preferences and needs of their guests. Likewise, such strategic positioning is a prerequisite for efficient marketing budget allocation. Official statistics report on the number of visitors from different countries, but no fine-grained information on the guest composition of individual businesses exists. There is, however, growing interest in such data from competitors, suppliers, researchers and the general public. We demonstrate how machine learning can be leveraged to extract references to guest nationalities from unstructured text reviews in order to dynamically assess and monitor the dynamics of guest composition of individual businesses. In particular, we show that a rather simple architecture of pre-trained embeddings and stacked LSTM layers provides a better performance-runtime tradeoff than more complex state-of-the-art language models.

摘要
Many hotels target their guest acquisition efforts at specific markets in order to best anticipate the individual preferences and needs of their guests. Similarly, precise positioning is a prerequisite for efficient marketing budget allocation. Official statistics report on the number of visitors from different countries, but there is no fine-grained information on the guest composition of individual businesses. However, there is growing interest in such data from competitors, suppliers, researchers, and the general public. We demonstrate how machine learning can be used to extract references to guest nationalities from unstructured text reviews in order to dynamically assess and monitor the dynamics of guest composition of individual businesses. In particular, we show that a relatively simple architecture of pre-trained embeddings and stacked LSTM layers provides a better performance-runtime tradeoff than more complex state-of-the-art language models.Here's the translation breakdown: Many hotels target their guest acquisition efforts at specific markets (多家酒店对specific markets进行客源策略) in order to best anticipate the individual preferences and needs of their guests (以便更好地预测客人偏好和需求) Similarly, precise positioning is a prerequisite for efficient marketing budget allocation ( similarly, precise positioning is a prerequisite for efficient marketing budget allocation) Official statistics report on the number of visitors from different countries (官方统计数据表明不同国家的游客数量) but there is no fine-grained information on the guest composition of individual businesses (但没有对个体业务的客户结构进行细化的信息) However, there is growing interest in such data from competitors, suppliers, researchers, and the general public (然而，有越来越多的竞争对手、供应商、研究人员和公众对此数据表示兴趣) We demonstrate how machine learning can be used to extract references to guest nationalities from unstructured text reviews (我们示出了使用机器学习提取客人国籍信息从无结构文本评论中) in order to dynamically assess and monitor the dynamics of guest composition of individual businesses (以便动态评估和监控个体业务的客户结构变化) In particular, we show that a relatively simple architecture of pre-trained embeddings and stacked LSTM layers provides a better performance-runtime tradeoff than more complex state-of-the-art language models (特别是，我们示出了一种相对简单的架构，即预训练的嵌入和叠加LSTM层，在性能和运行时间之间提供了更好的负荷均衡)

Physical Adversarial Attacks For Camera-based Smart Systems: Current Trends, Categorization, Applications, Research Challenges, and Future Outlook

paper_url: http://arxiv.org/abs/2308.06173
repo_url: None
paper_authors: Amira Guesmi, Muhammad Abdullah Hanif, Bassem Ouni, Muhammed Shafique
for: 本研究准备了一份物理敌意攻击的综述，以提供对物理世界中敌意攻击的全面理解，包括攻击的特点和分类。
methods: 本文分析了不同应用领域中的物理敌意攻击方法，包括分类、检测、人脸识别、 semantic segmentation 和深度估计。我们评估了这些攻击方法的效果、隐蔽性和Robustness。
results: 本文分析了物理敌意攻击的挑战和难点，并评估了各种攻击方法的性能。我们还提出了未来研究的方向，包括增强防御机制、探索新的攻击策略、评估攻击在不同应用领域中的性能，以及建立物理敌意攻击的标准化评估标准。

Abstract
In this paper, we present a comprehensive survey of the current trends focusing specifically on physical adversarial attacks. We aim to provide a thorough understanding of the concept of physical adversarial attacks, analyzing their key characteristics and distinguishing features. Furthermore, we explore the specific requirements and challenges associated with executing attacks in the physical world. Our article delves into various physical adversarial attack methods, categorized according to their target tasks in different applications, including classification, detection, face recognition, semantic segmentation and depth estimation. We assess the performance of these attack methods in terms of their effectiveness, stealthiness, and robustness. We examine how each technique strives to ensure the successful manipulation of DNNs while mitigating the risk of detection and withstanding real-world distortions. Lastly, we discuss the current challenges and outline potential future research directions in the field of physical adversarial attacks. We highlight the need for enhanced defense mechanisms, the exploration of novel attack strategies, the evaluation of attacks in different application domains, and the establishment of standardized benchmarks and evaluation criteria for physical adversarial attacks. Through this comprehensive survey, we aim to provide a valuable resource for researchers, practitioners, and policymakers to gain a holistic understanding of physical adversarial attacks in computer vision and facilitate the development of robust and secure DNN-based systems.

摘要
在这篇论文中，我们提供了物理敌对攻击的全面评估，专注于物理世界中的攻击方法。我们想要为读者提供物理敌对攻击的全面理解，分析其关键特征和区别特征。此外，我们还探讨在物理世界中执行攻击的具体要求和挑战。我们的文章探讨了不同应用场景中的物理敌对攻击方法，分为不同的目标任务，包括分类、检测、识别、 semantic segmentation 和深度估计。我们评估了这些攻击方法的效果、隐蔽性和可靠性。我们研究每种技术如何在 DNN 上成功操纵，同时避免检测和真实世界的扭曲。最后，我们讨论了物理敌对攻击领域当前的挑战和未来研究方向，包括增强防御机制、探索新的攻击策略、在不同应用场景中评估攻击、建立物理敌对攻击的标准化评估标准和评估方法。通过这篇全面的评估，我们希望为研究人员、实践者和政策制定者提供一份有价值的资源，以便更好地理解物理敌对攻击在计算机视觉中的现状，并促进 DNN 基本的安全和可靠性。

Phased Deep Spatio-temporal Learning for Highway Traffic Volume Prediction

paper_url: http://arxiv.org/abs/2308.06155
repo_url: None
paper_authors: Weilong Ding, Tianpu Zhang, Zhe Wang
for: 这份研究paper的目的是估算城市高速公路交通量，并对于时空特征进行深入的学习分析，以提高交通量估算的准确性。methods: 本研究使用了深度时空学习方法，包括对数据进行精确的 нормализація、将时空特征融合为一个数据集，并使用了FCN和LSTM两种模型来进行学习。results: 本研究的结果显示，使用深度时空学习方法可以优化交通量估算的准确性，比传统模型提高了5.269和0.997个MPAE和R-squre度量上。

Abstract
Inter-city highway transportation is significant for citizens' modern urban life and generates heterogeneous sensory data with spatio-temporal characteristics. As a routine analysis in transportation domain, daily traffic volume estimation faces challenges for highway toll stations including lacking of exploration of correlative spatio-temporal features from a long-term perspective and effective means to deal with data imbalance which always deteriorates the predictive performance. In this paper, a deep spatio-temporal learning method is proposed to predict daily traffic volume in three phases. In feature pre-processing phase, data is normalized elaborately according to latent long-tail distribution. In spatio-temporal learning phase, a hybrid model is employed combining fully convolution network (FCN) and long short-term memory (LSTM), which considers time, space, meteorology, and calendar from heterogeneous data. In decision phase, traffic volumes on a coming day at network-wide toll stations would be achieved effectively, which is especially calibrated for vital few highway stations. Using real-world data from one Chinese provincial highway, extensive experiments show our method has distinct improvement for predictive accuracy than various traditional models, reaching 5.269 and 0.997 in MPAE and R-squre metrics, respectively.

摘要
urban modern 生活中的公共交通运输非常重要，产生了不同类型的感知数据，具有空间时间特征。在交通领域的日常任务中，估算高速公路客运量受到了缺乏长期perspective下挖掘相关空间时间特征的缺失和数据偏移问题的影响。本文提出了一种深度空间时间学习方法，用于预测高速公路客运量。在特征预处理阶段，数据进行了精心normal化，根据潜在的长尾分布。在空间时间学习阶段，我们employs hybrid模型，组合了全连接网络（FCN）和长短期记忆（LSTM），考虑了时间、空间、天气和calendar从多种不同的数据。在决策阶段，预测当天高速公路客运量的效果非常高，尤其是对于重要的一些高速公路站点。使用中国一省高速公路的实际数据进行了广泛的实验，结果表明，我们的方法在预测精度方面与传统模型有显著的提升，达到了5.269和0.997的MPAE和R-squre指标。

Gaussian Process Regression for Maximum Entropy Distribution

paper_url: http://arxiv.org/abs/2308.06149
repo_url: None
paper_authors: Mohsen Sadr, Manuel Torrilhon, M. Hossein Gorji
for: 用于闭合问题中的最大Entropy分布。
methods: 使用Gaussian priors来近似Lagrangemultipliers，并且使用不同的kernel函数对Hyperparameters进行优化。
results: 对几个测试 случа件进行了性能研究，包括非平衡分布的relaxation，其 governing equations是Bhatnagar-Gross-Krook和Boltzmann气动方程。

Abstract
Maximum-Entropy Distributions offer an attractive family of probability densities suitable for moment closure problems. Yet finding the Lagrange multipliers which parametrize these distributions, turns out to be a computational bottleneck for practical closure settings. Motivated by recent success of Gaussian processes, we investigate the suitability of Gaussian priors to approximate the Lagrange multipliers as a map of a given set of moments. Examining various kernel functions, the hyperparameters are optimized by maximizing the log-likelihood. The performance of the devised data-driven Maximum-Entropy closure is studied for couple of test cases including relaxation of non-equilibrium distributions governed by Bhatnagar-Gross-Krook and Boltzmann kinetic equations.

摘要
maximum-entropy 分布对各种应用问题来说是一个吸引人的家族。然而，在实际封闭设定中，找到lagrange多项式的参数化问题实际上是计算瓶颈。基于近期 Gaussian 过程的成功，我们考虑使用 Gaussian 假设来近似 lagrange多项式的参数化。对于不同的核函数，我们通过寄存器最大化 log-likelihood 来调整超参数。我们对数据驱动的最大 entropy 封闭性能进行了一些测试，包括对非平衡分布的缓和，由 Bhatnagar-Gross-Krook 和 Boltzmann 动力学方程控制。

A New Approach to Overcoming Zero Trade in Gravity Models to Avoid Indefinite Values in Linear Logarithmic Equations and Parameter Verification Using Machine Learning

paper_url: http://arxiv.org/abs/2308.06303
repo_url: None
paper_authors: Mikrajuddin Abdullah
For: 解决高零流贸易数据中Logarithmic linear equation中的无穷值问题，以便使用重力模型来描述国际贸易。* Methods: 提出了一种两步技术来确定重力参数：首先，使用本地线性回归来设置一个假值来替代零流贸易，然后使用iterative技术来估计重力参数。* Results: 使用机器学习来测试估计的参数，发现GDP的势力和距离的势力在同一个群集中，值约为1。这种策略可以解决其他Logarithmic linear regression问题。

Abstract
The presence of a high number of zero flow trades continues to provide a challenge in identifying gravity parameters to explain international trade using the gravity model. Linear regression with a logarithmic linear equation encounters an indefinite value on the logarithmic trade. Although several approaches to solving this problem have been proposed, the majority of them are no longer based on linear regression, making the process of finding solutions more complex. In this work, we suggest a two-step technique for determining the gravity parameters: first, perform linear regression locally to establish a dummy value to substitute trade flow zero, and then estimating the gravity parameters. Iterative techniques are used to determine the optimum parameters. Machine learning is used to test the estimated parameters by analyzing their position in the cluster. We calculated international trade figures for 2004, 2009, 2014, and 2019. We just examine the classic gravity equation and discover that the powers of GDP and distance are in the same cluster and are both worth roughly one. The strategy presented here can be used to solve other problems involving log-linear regression.

摘要
《高数量的零流通贸易对国际贸易模型中gravity参数的确定带来挑战。线性回归logarithmic linear方程会得到无穷值的贸易流量。虽然有多种解决方案被提出，但大多数都不再基于线性回归，使得解决问题变得更加复杂。在这项工作中，我们建议一种两步方法来确定gravity参数：首先，使用本地线性回归来确定一个占位值来替代零流通贸易，然后估算gravity参数。使用迭代技术确定优化参数。使用机器学习测试估算参数的位置在集群中。我们计算了2004年、2009年、2014年和2019年的国际贸易数据。我们只是考虑 классическиеgravity方程，发现GDP的势和距离的权重都处于同一个集群，并且均值约为1。提出的策略可以解决其他log-linear回归中的问题。》Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Identification of the Relevance of Comments in Codes Using Bag of Words and Transformer Based Models

paper_url: http://arxiv.org/abs/2308.06144
repo_url: https://github.com/sruthisudheer/comment-classification-of-c-code
paper_authors: Sruthi S, Tanmay Basu
for: 本研究的目的是为了 классификация代码段评论的不同类型。
methods: 本研究使用了不同的特征工程和文本分类技术，包括经典的袋子模型和基于变换器的模型。
results: 研究发现袋子模型在训练集上的表现更好，但是模型的性能在训练和测试集上并不理想。

Abstract
The Forum for Information Retrieval (FIRE) started a shared task this year for classification of comments of different code segments. This is binary text classification task where the objective is to identify whether comments given for certain code segments are relevant or not. The BioNLP-IISERB group at the Indian Institute of Science Education and Research Bhopal (IISERB) participated in this task and submitted five runs for five different models. The paper presents the overview of the models and other significant findings on the training corpus. The methods involve different feature engineering schemes and text classification techniques. The performance of the classical bag of words model and transformer-based models were explored to identify significant features from the given training corpus. We have explored different classifiers viz., random forest, support vector machine and logistic regression using the bag of words model. Furthermore, the pre-trained transformer based models like BERT, RoBERT and ALBERT were also used by fine-tuning them on the given training corpus. The performance of different such models over the training corpus were reported and the best five models were implemented on the given test corpus. The empirical results show that the bag of words model outperforms the transformer based models, however, the performance of our runs are not reasonably well in both training and test corpus. This paper also addresses the limitations of the models and scope for further improvement.

摘要
“forum for information retrieval（FIRE）这年开始了代码段评估的共同任务。这是一个二分类文本分类任务，目标是判断给定代码段的注释是否相关。BIoNLP-IISERB组在印度科学教育研究所 Bhopal（IISERB）参与了这个任务，并提交了五个运行。这篇论文介绍了模型和其他重要发现，包括不同的特征工程和文本分类技术。我们使用了不同的特征工程和文本分类技术来找出特征。我们还使用了Random Forest、支持向量机和Logistic Regression等分类器，以及 Bag of Words 模型。此外，我们还使用了预训练的 transformer 模型，如 BERT、RoBERT 和 ALBERT，并通过 fine-tuning 来适应给定的训练集。我们对各种模型的性能进行了报告，并在测试集上实现了最佳五个模型。实验结果表明， Bag of Words 模型在训练集上表现更好，但我们的运行在训练和测试集上的表现并不理想。这篇论文还探讨了模型的局限性和进一步改进的可能性。”

CompTLL-UNet: Compressed Domain Text-Line Localization in Challenging Handwritten Documents using Deep Feature Learning from JPEG Coefficients

paper_url: http://arxiv.org/abs/2308.06142
repo_url: None
paper_authors: Bulla Rajesh, Sk Mahafuz Zaman, Mohammed Javed, P. Nagabhushan
for: 本研究旨在提出一种直接从JPEG压缩矩阵中进行文本线localization的方法，以提高文本Recognition的效率和可扩展性。
methods: 本研究使用了一种 modificated U-Net architecture，称为Compressed Text-Line Localization Network (CompTLL-UNet)，以直接从JPEG压缩矩阵中提取深度特征进行文本线localization。
results: 研究发现，CompTLL-UNet在JPEG压缩矩阵中直接进行文本线localization可以达到state-of-the-art表现，同时减少存储和计算成本。

Abstract
Automatic localization of text-lines in handwritten documents is still an open and challenging research problem. Various writing issues such as uneven spacing between the lines, oscillating and touching text, and the presence of skew become much more challenging when the case of complex handwritten document images are considered for segmentation directly in their respective compressed representation. This is because, the conventional way of processing compressed documents is through decompression, but here in this paper, we propose an idea that employs deep feature learning directly from the JPEG compressed coefficients without full decompression to accomplish text-line localization in the JPEG compressed domain. A modified U-Net architecture known as Compressed Text-Line Localization Network (CompTLL-UNet) is designed to accomplish it. The model is trained and tested with JPEG compressed version of benchmark datasets including ICDAR2017 (cBAD) and ICDAR2019 (cBAD), reporting the state-of-the-art performance with reduced storage and computational costs in the JPEG compressed domain.

摘要
自动化手写文档中文行的地方化是一个还未解决的研究问题。不同的写作问题，如行间距不均，文本 oscilaltion 和触摸，以及扭曲的问题，在考虑复杂手写文档图像时变得更加挑战。这是因为，传统的文档处理方法是通过解压，但在这篇论文中，我们提出了一个想法，即使用深度特征学习直接从 JPEG 压缩级联系中进行文行地方化。我们称之为 Compressed Text-Line Localization Network (CompTLL-UNet) 的修改版 U-Net 架构。该模型在 JPEG 压缩领域中进行训练和测试，并在 JPEG 压缩版 benchmark 数据集上达到了当前最佳性能，同时减少了存储和计算成本。

Application of Artificial Neural Networks for Investigation of Pressure Filtration Performance, a Zinc Leaching Filter Cake Moisture Modeling

paper_url: http://arxiv.org/abs/2308.06138
repo_url: None
paper_authors: Masoume Kazemi, Davood Moradkhani, Alireza A. Alipour
for: 这种研究旨在开发一个人工神经网络模型，用于预测氧化锌生产过程中压 filtering 过程中的蛋白质湿度。methods: 该研究使用了人工神经网络模型，并在288个测试中使用了两种不同的 fabrics：polypropylene（S1）和 polyester（S2）。results: 研究发现，人工神经网络模型可以高度准确地预测氧化锌生产过程中压 filtering 过程中的蛋白质湿度，R2 值分别为0.88和0.83，MSE值分别为6.243x10-07和1.086x10-06，MAE值分别为0.00056和0.00088。

Abstract
Machine Learning (ML) is a powerful tool for material science applications. Artificial Neural Network (ANN) is a machine learning technique that can provide high prediction accuracy. This study aimed to develop an ANN model to predict the cake moisture of the pressure filtration process of zinc production. The cake moisture was influenced by seven parameters: temperature (35 and 65 Celsius), solid concentration (0.2 and 0.38 g/L), pH (2, 3.5, and 5), air-blow time (2, 10, and 15 min), cake thickness (14, 20, 26, and 34 mm), pressure, and filtration time. The study conducted 288 tests using two types of fabrics: polypropylene (S1) and polyester (S2). The ANN model was evaluated by the Coefficient of determination (R2), the Mean Square Error (MSE), and the Mean Absolute Error (MAE) metrics for both datasets. The results showed R2 values of 0.88 and 0.83, MSE values of 6.243x10-07 and 1.086x10-06, and MAE values of 0.00056 and 0.00088 for S1 and S2, respectively. These results indicated that the ANN model could predict the cake moisture of pressure filtration in the zinc leaching process with high accuracy.

摘要
机器学习（ML）是资源科学应用中的一种强大工具。人工神经网络（ANN）是机器学习技术之一，可以提供高精度预测。本研究目的是开发一个ANN模型，用于预测压 filtering过程中锌生产中的蛋白质湿度。蛋白质湿度受到七个参数的影响：温度（35和65℃），固体浓度（0.2和0.38 g/L），pH（2、3.5和5），空气吹时（2、10和15分），蛋白质厚度（14、20、26和34 mm），压力和滤 separation时间。研究进行了288次测试，使用了两种不同的布料：polypropylene（S1）和polyester（S2）。ANN模型被评估了Cofficient of determination（R2）、Mean Square Error（MSE）和Mean Absolute Error（MAE）度量，对于两个数据集。结果显示，R2值为0.88和0.83，MSE值为6.243x10-07和1.086x10-06，MAE值为0.00056和0.00088，分别对应S1和S2数据集。这些结果表明，ANN模型可以高精度预测压 filtering过程中锌生产中的蛋白质湿度。

PDE Discovery for Soft Sensors Using Coupled Physics-Informed Neural Network with Akaike’s Information Criterion

paper_url: http://arxiv.org/abs/2308.06132
repo_url: None
paper_authors: Aina Wang, Pan Qin, Xi-Ming Sun
for: 这个论文是为了探讨soft sensors的PDE结构发现和适应 industrial processes with spatiotemporal dependence。
methods: 这个论文提出了一种基于物理学习的CPINN-AIC方法，通过将物理法则和神经网络结合在一起，以找到适合soft sensors的PDE结构。
results: 实验结果表明，CPINN-AIC方法可以有效地找到适合soft sensors的PDE结构，并且可以在实际应用中提供高精度的预测结果。

Abstract
Soft sensors have been extensively used to monitor key variables using easy-to-measure variables and mathematical models. Partial differential equations (PDEs) are model candidates for soft sensors in industrial processes with spatiotemporal dependence. However, gaps often exist between idealized PDEs and practical situations. Discovering proper structures of PDEs, including the differential operators and source terms, can remedy the gaps. To this end, a coupled physics-informed neural network with Akaike's criterion information (CPINN-AIC) is proposed for PDE discovery of soft sensors. First, CPINN is adopted for obtaining solutions and source terms satisfying PDEs. Then, we propose a data-physics-hybrid loss function for training CPINN, in which undetermined combinations of differential operators are involved. Consequently, AIC is used to discover the proper combination of differential operators. Finally, the artificial and practical datasets are used to verify the feasibility and effectiveness of CPINN-AIC for soft sensors. The proposed CPINN-AIC is a data-driven method to discover proper PDE structures and neural network-based solutions for soft sensors.

摘要
First, CPINN is used to obtain solutions and source terms that satisfy PDEs. Then, we propose a data-physics-hybrid loss function for training CPINN, which involves undetermined combinations of differential operators. Next, AIC is used to discover the appropriate combination of differential operators. Finally, artificial and practical datasets are used to verify the feasibility and effectiveness of CPINN-AIC for soft sensors.The proposed CPINN-AIC is a data-driven method for discovering proper PDE structures and neural network-based solutions for soft sensors.

Uncertainty Quantification for Image-based Traffic Prediction across Cities

paper_url: http://arxiv.org/abs/2308.06129
repo_url: https://github.com/alextimans/traffic4cast-uncertainty
paper_authors: Alexander Timans, Nina Wiedemann, Nishant Kumar, Ye Hong, Martin Raubal
for: 这个论文旨在探讨uncertainty quantification（UQ）方法在交通预测 tasks 中的应用，以提高模型的可解释性和决策支持能力。
methods: 论文使用了两种epistemic和两种aleatoric UQ 方法，对多个城市和时间点的大规模图像基于交通数据进行了比较。
results: 研究发现，使用UQ方法可以获得有意义的uncertainty estimate，并且可以用于不监督异常检测城市交通动态变化。在moscow 城市的示例研究中，我们发现了时间和空间效应对交通行为的影响。

Abstract
Despite the strong predictive performance of deep learning models for traffic prediction, their widespread deployment in real-world intelligent transportation systems has been restrained by a lack of interpretability. Uncertainty quantification (UQ) methods provide an approach to induce probabilistic reasoning, improve decision-making and enhance model deployment potential. To gain a comprehensive picture of the usefulness of existing UQ methods for traffic prediction and the relation between obtained uncertainties and city-wide traffic dynamics, we investigate their application to a large-scale image-based traffic dataset spanning multiple cities and time periods. We compare two epistemic and two aleatoric UQ methods on both temporal and spatio-temporal transfer tasks, and find that meaningful uncertainty estimates can be recovered. We further demonstrate how uncertainty estimates can be employed for unsupervised outlier detection on changes in city traffic dynamics. We find that our approach can capture both temporal and spatial effects on traffic behaviour in a representative case study for the city of Moscow. Our work presents a further step towards boosting uncertainty awareness in traffic prediction tasks, and aims to highlight the value contribution of UQ methods to a better understanding of city traffic dynamics.

摘要
尽管深度学习模型在交通预测 task 上表现出了强大的预测能力，但它们在实际世界智能交通系统中的广泛部署受到了不确定性的限制。不确定性量化（UQ）方法可以带来 probabilistic reasoning，改善决策，并提高模型的部署潜力。为了更好地了解现有 UQ 方法在交通预测 task 中的用于性和 obtained uncertainties 与城市范围内交通动力学的关系，我们对一个大规模的图像基于交通数据集进行了 Investigation。我们对这些数据集进行了两种 Epistemic 和两种 Aleatoric UQ 方法的比较，并发现了 meaningful uncertainty estimates 可以被回归。我们还示出了如何使用 uncertainty estimates 进行无supervised outlier detection on changes in city traffic dynamics。我们在 Moscowa 市的一个示例研究中发现，我们的方法可以捕捉到时间和空间效应的交通行为。我们的工作是提高交通预测任务中的 uncertainty awareness 的一次进一步步骤，并 hopes to highlight the value contribution of UQ methods to a better understanding of city traffic dynamics。

Learning Control Policies for Variable Objectives from Offline Data

paper_url: http://arxiv.org/abs/2308.06127
repo_url: None
paper_authors: Marc Weber, Phillip Swazinna, Daniel Hein, Steffen Udluft, Volkmar Sterzing
for: 本研究提供了一种可行的方法来获得复杂系统的高级控制策略，特别是当直接与环境进行互动不可available时。
methods: 本研究使用了一种概念扩展，即变量目标策略（VOP），以帮助策略在多个目标下进行有效的演化。
results: 通过在策略中输入不同的目标，用户可以在运行时调整策略的行为或优化目标，不需要再收集更多的观察批量或重新训练。

Abstract
Offline reinforcement learning provides a viable approach to obtain advanced control strategies for dynamical systems, in particular when direct interaction with the environment is not available. In this paper, we introduce a conceptual extension for model-based policy search methods, called variable objective policy (VOP). With this approach, policies are trained to generalize efficiently over a variety of objectives, which parameterize the reward function. We demonstrate that by altering the objectives passed as input to the policy, users gain the freedom to adjust its behavior or re-balance optimization targets at runtime, without need for collecting additional observation batches or re-training.

摘要
直接与环境交互不可用时，偏离线强化学习提供了一个可行的方法来获得先进的控制策略。在这篇论文中，我们介绍了一种概念扩展，即变量目标策略（VOP）。这种方法通过让策略 Parameterize the reward function，以高效地生成多种目标。我们示示了，通过在策略中输入不同的目标，用户可以在运行时调整策略的行为或优化目标，无需收集更多的观察批处或重新训练。

Learning Deductive Reasoning from Synthetic Corpus based on Formal Logic

paper_url: http://arxiv.org/abs/2308.07336
repo_url: https://github.com/hitachi-nlp/fld
paper_authors: Terufumi Morishita, Gaku Morio, Atsuki Yamaguchi, Yasuhiro Sogawa
for: 本研究旨在帮助语言模型（LM）学习逻辑推理能力，使其可以更好地理解和推理文本中的逻辑关系。
methods: 我们采用了一种基于形式逻辑理论的合理的 deduction 规则集，并使用这些规则生成了一个合理的 deduction 示例集（FLD），以训练LM。
results: 我们的实验表明，使用FLD训练LM后，LM可以更好地推理文本中的逻辑关系，并且可以在多个步骤中组合多个规则来推理。此外，我们还发现了一些不同的逻辑推理能力方面，在哪些方面 deduction corpora 可以帮助LM，在哪些方面不可以。

Abstract
We study a synthetic corpus-based approach for language models (LMs) to acquire logical deductive reasoning ability. The previous studies generated deduction examples using specific sets of deduction rules. However, these rules were limited or otherwise arbitrary. This can limit the generalizability of acquired deductive reasoning ability. We rethink this and adopt a well-grounded set of deduction rules based on formal logic theory, which can derive any other deduction rules when combined in a multistep way. We empirically verify that LMs trained on the proposed corpora, which we name $\textbf{FLD}$ ($\textbf{F}$ormal $\textbf{L}$ogic $\textbf{D}$eduction), acquire more generalizable deductive reasoning ability. Furthermore, we identify the aspects of deductive reasoning ability on which deduction corpora can enhance LMs and those on which they cannot. Finally, on the basis of these results, we discuss the future directions for applying deduction corpora or other approaches for each aspect. We release the code, data, and models.

摘要
我们研究一种基于合成语料库的方法，使语言模型（LM）学习逻辑推理能力。过去的研究通常通过特定的推理规则生成推理示例，但这些规则有限或者是伪装的。这可能会限制学习的推理能力的普适性。我们重新思考，采用基于正式逻辑理论的固定的推理规则集，这些规则可以在多步骤中组合， derive任何其他的推理规则。我们employmetricamente验证，使用我们提出的$\textbf{FLD}$($\textbf{F}$ormal $\textbf{L}$ogic $\textbf{D}$eduction) corpora训练LMs，LMs可以获得更加普适的逻辑推理能力。此外，我们还确定了推理能力上哪些方面可以通过推理 corpora增强LMs，以及哪些方面无法增强。最后，基于这些结果，我们讨论未来如何应用推理 corpora或其他方法来解决每个方面的问题。我们释放代码、数据和模型。

Hawkes Processes with Delayed Granger Causality

paper_url: http://arxiv.org/abs/2308.06106
repo_url: None
paper_authors: Chao Yang, Hengyuan Miao, Shuang Li
for: 本研究旨在明确滞后Granger causal效应的扩展模型，基于多变量骨灰过程。这种想法源于事件引起效应通常需要一些时间。研究这个时间延迟本身具有科学意义。
methods: 我们提出了一种基于Variational Auto-Encoder（VAE）算法来近似 posterior distribution of time lags。我们对复杂的设定进行了研究，以便在不同情况下推断时延 posterior distribution。
results: 我们在 synthetic 和实际数据上进行了实验，并达到了良好的事件预测和时延推断准确率。

Abstract
We aim to explicitly model the delayed Granger causal effects based on multivariate Hawkes processes. The idea is inspired by the fact that a causal event usually takes some time to exert an effect. Studying this time lag itself is of interest. Given the proposed model, we first prove the identifiability of the delay parameter under mild conditions. We further investigate a model estimation method under a complex setting, where we want to infer the posterior distribution of the time lags and understand how this distribution varies across different scenarios. We treat the time lags as latent variables and formulate a Variational Auto-Encoder (VAE) algorithm to approximate the posterior distribution of the time lags. By explicitly modeling the time lags in Hawkes processes, we add flexibility to the model. The inferred time-lag posterior distributions are of scientific meaning and help trace the original causal time that supports the root cause analysis. We empirically evaluate our model's event prediction and time-lag inference accuracy on synthetic and real data, achieving promising results.

摘要
我们目标是显式地模型延迟的格兰杰 causal 效应基于多variate Hawkes 过程。这个想法源于事件引起效应通常需要一些时间。研究这个时间延迟自己的研究有趣。给定我们的模型，我们首先证明延迟参数的可识别性nder mild conditions。我们进一步研究一种模型估计方法，想要从不同enario中INF posterior distribution of time lags和了解这个分布在不同enario中如何变化。我们对时延 treated as latent variables，并形式化一种Variational Auto-Encoder（VAE）算法来近似 posterior distribution of time lags。由于我们显式地模型了 Hawkes 过程中的时延，我们添加了模型的灵活性。从事件预测和时延推断角度来看，我们的模型具有良好的准确性。我们通过使用synthetic和实际数据进行实验，并实现了模型的预测和时延推断准确性。

Composable Function-preserving Expansions for Transformer Architectures

paper_url: http://arxiv.org/abs/2308.06103
repo_url: None
paper_authors: Andrea Gesmundo, Kaitlin Maile
for: 这篇论文的目的是提出一种能够逐步增加 transformer 类型神经网络的规模，保持模型的功能完整性。
methods: 该论文提出了六种可组合的变换，用于逐步增加神经网络的规模，并且提供了证明，证明这些变换不会影响模型的功能。
results: 该论文的实验结果表明，通过使用这些变换，可以 efficiently 培养更大和更强大的神经网络，并且可以逐步增加模型的规模。

Abstract
Training state-of-the-art neural networks requires a high cost in terms of compute and time. Model scale is recognized to be a critical factor to achieve and improve the state-of-the-art. Increasing the scale of a neural network normally requires restarting from scratch by randomly initializing all the parameters of the model, as this implies a change of architecture's parameters that does not allow for a straightforward transfer of knowledge from smaller size models. In this work, we propose six composable transformations to incrementally increase the size of transformer-based neural networks while preserving functionality, allowing to expand the capacity of the model as needed. We provide proof of exact function preservation under minimal initialization constraints for each transformation. The proposed methods may enable efficient training pipelines for larger and more powerful models by progressively expanding the architecture throughout training.

摘要
文本翻译为简化字符串现代神经网络训练需要高度的计算和时间成本。模型缩放被认为是达到和改进状态前的关键因素。增加模型缩放通常需要从头开始重新初始化所有模型参数，因为这意味着更改模型参数的architecture，不允许直接将小型模型中的知识传递给大型模型。在这项工作中，我们提出了六种可 compose 的变换，可以逐步增加 transformer 类神经网络的大小，保持功能不变。我们提供了每种变换的准确函数保存证明，并且表明在较小的初始化约束下，这些变换可以高效地训练更大和更强大的模型。这些方法可能会启用高效的训练管道，以逐步扩展模型的大小，从而提高模型的能力。

Diffusion-based Visual Counterfactual Explanations – Towards Systematic Quantitative Evaluation

paper_url: http://arxiv.org/abs/2308.06100
repo_url: https://github.com/cairo-thws/dbvce_eval
paper_authors: Philipp Vaeth, Alexander M. Fruehwald, Benjamin Paassen, Magda Gregorova
for: 本研究旨在系统地评估最新的视觉对话式解释方法（VCE），并提出了一个最小的约束集以便评估。
methods: 本研究使用了深度生成模型来生成高维图像，并使用了一个系统的评估框架来评估VCE方法的性能。
results: 研究发现了一些可能的改进方向，并提供了一个值得关注的指南 для未来的VCE方法的发展。

Abstract
Latest methods for visual counterfactual explanations (VCE) harness the power of deep generative models to synthesize new examples of high-dimensional images of impressive quality. However, it is currently difficult to compare the performance of these VCE methods as the evaluation procedures largely vary and often boil down to visual inspection of individual examples and small scale user studies. In this work, we propose a framework for systematic, quantitative evaluation of the VCE methods and a minimal set of metrics to be used. We use this framework to explore the effects of certain crucial design choices in the latest diffusion-based generative models for VCEs of natural image classification (ImageNet). We conduct a battery of ablation-like experiments, generating thousands of VCEs for a suite of classifiers of various complexity, accuracy and robustness. Our findings suggest multiple directions for future advancements and improvements of VCE methods. By sharing our methodology and our approach to tackle the computational challenges of such a study on a limited hardware setup (including the complete code base), we offer a valuable guidance for researchers in the field fostering consistency and transparency in the assessment of counterfactual explanations.

摘要
最新的视觉对比例解释（VCE）技术利用深度生成模型生成高维度图像的新示例，图像质量极高。然而，目前很难比较这些VCE方法的表现，因为评价方法多样化，经常降低到视觉检查具体示例和小规模用户研究。在这项工作中，我们提出了一个系统性评价框架和最小的度量集，用于评价VCE方法。我们使用这个框架来探索最新的扩散基于生成模型中的某些关键设计选择对自然图像分类（ImageNet）VCEs的影响。我们进行了一系列减少类似实验，生成了数千个VCEs，用于一组不同的分类器，包括不同的复杂度、准确率和鲁棒性。我们的发现表明，有多个方向可以进行未来的进步和改进VCE方法。通过分享我们的方法和我们在限制性硬件设置上进行计算挑战的解决方案，我们提供了对研究领域的研究者的有价值指导，促进了透明度和一致性在对对比解释的评价中。

Neural Conversation Models and How to Rein Them in: A Survey of Failures and Fixes

paper_url: http://arxiv.org/abs/2308.06095
repo_url: None
paper_authors: Fabian Galetzka, Anne Beyer, David Schlangen
for: 这个论文旨在探讨基于强大语言模型的开放领域对话系统，以及如何使用Grice的协作对话原则来评估这些系统的合适性。
methods: 这篇论文使用了系统化文献综述的方法，对现有的方法进行分类和探讨，并提出了一些新的研究方向。
results: 论文 conclude 了现有的方法的优缺点，并提出了一些未来研究的可能性，包括如何更好地控制语言模型，以确保对话的流畅、有用、一致、准确和遵循社会规范。

Abstract
Recent conditional language models are able to continue any kind of text source in an often seemingly fluent way. This fact encouraged research in the area of open-domain conversational systems that are based on powerful language models and aim to imitate an interlocutor by generating appropriate contributions to a written dialogue. From a linguistic perspective, however, the complexity of contributing to a conversation is high. In this survey, we interpret Grice's maxims of cooperative conversation from the perspective of this specific research area and systematize the literature under the aspect of what makes a contribution appropriate: A neural conversation model has to be fluent, informative, consistent, coherent, and follow social norms. In order to ensure these qualities, recent approaches try to tame the underlying language models at various intervention points, such as data, training regime or decoding. Sorted by these categories and intervention points, we discuss promising attempts and suggest novel ways for future research.

摘要
Translated into Simplified Chinese:现代语言模型可以继续任何类型的文本源，往往显示出流畅的语言能力。这种情况促使了开放领域对话系统的研究，基于强大语言模型，模拟对话伙伴的语言输出。从语言学角度来看，与人类对话的复杂性相比，对话中的贡献具有高度的复杂性。在这份调查中，我们从这个特定的研究领域的视角出发， интерпреτ Grice的协作对话maxims，并将文献分类和 intervención点进行了系统化。为确保这些特质，最近的方法尝试在不同的 intervención点进行了控制，例如数据、训练方法或解码。 sorted by these categories and intervención points, we discuss promising attempts and suggest novel ways for future research.

Reinforcement Logic Rule Learning for Temporal Point Processes

paper_url: http://arxiv.org/abs/2308.06094
repo_url: None
paper_authors: Chao Yang, Lu Wang, Kun Gao, Shuang Li
for: 这个论文的目的是提出一种可以逐步扩展解释时间事件的发生的框架。
methods: 该方法利用时间点过程模型和学习框架，逐步优化规则集的内容和权重，直到 Observational event sequence 的可能性最大化。
results: 该方法在 Both synthetic and real healthcare datasets 上得到了Promising results。In English, this translates to:
for: The purpose of this paper is to propose a framework for incrementally expanding the explanatory temporal logic rule set to explain the occurrence of temporal events.
methods: The method uses the temporal point process modeling and learning framework to gradually optimize the rule set content and weights until the likelihood of the observational event sequences is maximized.
results: The method obtains promising results on both synthetic and real healthcare datasets.

Abstract
We propose a framework that can incrementally expand the explanatory temporal logic rule set to explain the occurrence of temporal events. Leveraging the temporal point process modeling and learning framework, the rule content and weights will be gradually optimized until the likelihood of the observational event sequences is optimal. The proposed algorithm alternates between a master problem, where the current rule set weights are updated, and a subproblem, where a new rule is searched and included to best increase the likelihood. The formulated master problem is convex and relatively easy to solve using continuous optimization, whereas the subproblem requires searching the huge combinatorial rule predicate and relationship space. To tackle this challenge, we propose a neural search policy to learn to generate the new rule content as a sequence of actions. The policy parameters will be trained end-to-end using the reinforcement learning framework, where the reward signals can be efficiently queried by evaluating the subproblem objective. The trained policy can be used to generate new rules in a controllable way. We evaluate our methods on both synthetic and real healthcare datasets, obtaining promising results.

摘要
我们提出一种框架，可以逐步扩展解释时间事件的发生。利用时间点过程模型和学习框架，规则内容和权重将会逐步优化，直到观察事件序列的可能性最高。我们提出的算法会 alternate между主问题和auxiliary problem。主问题中，当前规则集权重将会更新；auxiliary problem中，一个新规则将会被搜索并包含在规则集中，以最大化观察事件序列的可能性。主问题是 convex 的且可以使用连续优化解决，而auxiliary problem则需要搜索庞大的时间逻辑规则 predicate 和关系空间。为解决这个挑战，我们提出一种神经搜索策略，可以学习生成新规则的内容作为一个序列动作。策略参数将通过 reinforcement learning 框架进行endl-to-end 训练，其中的奖励信号可以高效地查询通过评估auxiliary problem 的目标函数。训练好的策略可以用于控制性地生成新规则。我们在 sintetic 和实际医疗数据上进行了试验，获得了有望的结果。

Experts Weights Averaging: A New General Training Scheme for Vision Transformers

paper_url: http://arxiv.org/abs/2308.06093
repo_url: None
paper_authors: Yongqi Huang, Peng Ye, Xiaoshui Huang, Sheng Li, Tao Chen, Wanli Ouyang
for: 这个论文的目的是提出一种新的普适训练策略，以提高视Transformer（ViT）的性能而不增加推理成本。
methods: 这种训练策略利用了 Mixture-of-Experts（MoE）技术，将 ViT 分解成多个分支结构，并在每个分支上使用特定的 FFN 进行训练。在训练阶段，通过随机划分token来将各个分支分配给专家，并在每次迭代结束后进行专家平均值来进行权重平均。在推理阶段，将每个分支转换回原始的 ViT 结构。
results: 对于多种2D和3D视觉任务、ViT 架构和数据集，实验表明该训练策略可以提高 ViT 的性能，并且可以应用于 ViT 的微调。此外，该训练策略还可以提高 naive MoE 在小型2D视觉任务和3D视觉任务中的效果。

Abstract
Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs), which achieves performance improvement without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may question: if a training scheme specifically for ViTs exists that can also achieve performance improvement without increasing inference cost? Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we affirmatively answer these questions, with a new general training strategy for ViTs. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs that assign tokens to experts by random uniform partition, and perform Experts Weights Averaging (EWA) on these MoEs at the end of each iteration. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference. We further provide a theoretical analysis to show why and how it works. Comprehensive experiments across various 2D and 3D visual tasks, ViT architectures, and datasets validate the effectiveness and generalizability of the proposed training scheme. Besides, our training scheme can also be applied to improve performance when fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can significantly improve the effectiveness of naive MoE in various 2D visual small datasets and 3D visual tasks.

摘要
<>将文本翻译成简化中文。<> convolutional neural networks (CNNs) 的一种通用训练方案是结构 parameterization，可以提高性能而不增加推理成本。随着 transformers (ViTs) 在视觉任务上逐渐超越 CNNs, 一个问题是：是否存在特定于 ViTs 的训练方案，可以在不增加推理成本下提高性能？recently, mixture-of-experts (MoE) 吸引了更多的关注，因为它可以在固定成本下有效地扩展 transformers 的容量。由于 MoE 也可以视为多支分支结构，我们可以使用 MoE 来实现对 ViTs 的训练方案，类似于结构 parameterization。在这篇文章中，我们答应这些问题，并提出了一种新的一般训练策略 для ViTs。具体来说，我们在训练阶段将 ViT 中的一些 feed-forward networks (FFNs) 替换为特制的、更高效的 MoEs，并在每次迭代结束后进行 Experts Weights Averaging (EWA)。训练结束后，我们将每个 MoE 转换成 FFN，并将模型转换回原始 ViT 模型，用于推理。我们还提供了一种理论分析，以解释为什么和如何实现。我们在多种 2D 和 3D 视觉任务、ViT 架构和数据集上进行了广泛的实验，并证明了提案的训练策略的有效性和普适性。此外，我们的训练策略还可以在 fine-tuning ViTs 中提高性能。最后，但也非常重要的是，我们的 EWA 技术可以在多种 2D 视觉小数据集和 3D 视觉任务中显著提高 naive MoE 的效果。

Toward a Better Understanding of Loss Functions for Collaborative Filtering

paper_url: http://arxiv.org/abs/2308.06091
repo_url: https://github.com/psm1206/mawu
paper_authors: Seongmin Park, Mincheol Yoon, Jae-woong Lee, Hogun Park, Jongwuk Lee
for: 本研究探讨了现有的搜索系统中的协同推荐技术，具体来说是对现有的搜索系统中的三个主要组成部分进行分析：用户交互编码器、损失函数和随机抽取。
methods: 本研究使用了数学分析来分析现有的搜索系统中的损失函数，发现它们可以看作是对用户和项目表示的匹配和对用户和项目分布的均匀化。基于这种分析，我们提出了一种新的损失函数，即对适应和均匀化考虑的权重对齐和均匀性损失函数（MAWU）。
results: 我们通过对三个公共数据集进行广泛的实验研究发现，使用MAWU损失函数的MF和LightGCN模型在与其他搜索系统中比较或超过了现有的CF模型。

Abstract
Collaborative filtering (CF) is a pivotal technique in modern recommender systems. The learning process of CF models typically consists of three components: interaction encoder, loss function, and negative sampling. Although many existing studies have proposed various CF models to design sophisticated interaction encoders, recent work shows that simply reformulating the loss functions can achieve significant performance gains. This paper delves into analyzing the relationship among existing loss functions. Our mathematical analysis reveals that the previous loss functions can be interpreted as alignment and uniformity functions: (i) the alignment matches user and item representations, and (ii) the uniformity disperses user and item distributions. Inspired by this analysis, we propose a novel loss function that improves the design of alignment and uniformity considering the unique patterns of datasets called Margin-aware Alignment and Weighted Uniformity (MAWU). The key novelty of MAWU is two-fold: (i) margin-aware alignment (MA) mitigates user/item-specific popularity biases, and (ii) weighted uniformity (WU) adjusts the significance between user and item uniformities to reflect the inherent characteristics of datasets. Extensive experimental results show that MF and LightGCN equipped with MAWU are comparable or superior to state-of-the-art CF models with various loss functions on three public datasets.

摘要

Safeguarding Learning-based Control for Smart Energy Systems with Sampling Specifications

paper_url: http://arxiv.org/abs/2308.06069
repo_url: None
paper_authors: Chih-Hong Cheng, Venkatesh Prasad Venkataramanan, Pragya Kirti Gupta, Yun-Fei Hsu, Simon Burton
for: 本研究探讨了在使用强化学习控制能源系统时遇到的挑战，特别是需要同时满足性能要求和安全要求，以避免黑OUT。
methods: 本文详细介绍了如何通过时间逻辑逻辑（LTL）的离散化将实时逻辑转换为线性逻辑，从而使得LTL的满足 imply 原始安全要求的满足。这种离散化允许使用先进工程方法，如生成防护仿真和正式验证，其中对于统计模型检查，LTL模型检查中的概率保证形成原始实时安全要求的满足的下界。
results: 本文通过实验和分析表明，通过离散化将实时逻辑转换为线性逻辑可以提高安全性和可靠性，并且可以减少风险和成本。

Abstract
We study challenges using reinforcement learning in controlling energy systems, where apart from performance requirements, one has additional safety requirements such as avoiding blackouts. We detail how these safety requirements in real-time temporal logic can be strengthened via discretization into linear temporal logic (LTL), such that the satisfaction of the LTL formulae implies the satisfaction of the original safety requirements. The discretization enables advanced engineering methods such as synthesizing shields for safe reinforcement learning as well as formal verification, where for statistical model checking, the probabilistic guarantee acquired by LTL model checking forms a lower bound for the satisfaction of the original real-time safety requirements.

摘要

Deep learning-based flow disaggregation for hydropower plant management

paper_url: http://arxiv.org/abs/2308.11631
repo_url: None
paper_authors: Duo Zhang
for: 这个研究是为了提供一种基于深度学习的时间系列分解模型，用于从日均流量数据中提取更高的时间分辨率的流量信息。
methods: 该模型使用了深度学习的强大特征提取能力，对日均流量数据进行分解，并生成了更高的时间分辨率的流量时间序列。
results: 预liminary结果表明，该模型在 tested using flow data from a Norwegian flow station 中对流量时间序列的分解表现了一些有 promise 的特征。

Abstract
High temporal resolution data is a vital resource for hydropower plant management. Currently, only daily resolution data are available for most of Norwegian hydropower plant, however, to achieve more accurate management, sub-daily resolution data are often required. To deal with the wide absence of sub-daily data, time series disaggregation is a potential tool. In this study, we proposed a time series disaggregation model based on deep learning, the model is tested using flow data from a Norwegian flow station, to disaggregate the daily flow into hourly flow. Preliminary results show some promising aspects for the proposed model.

摘要
高 temporal resolution 数据是贯彻水电厂管理的重要资源。现在，大多数挪威水电厂的数据仅有每日分辨率，但是为更准确的管理，需要更高的分辨率。为了解决广泛缺乏的 sub-daily 数据，时间系列分解是一个可能的工具。在这项研究中，我们提出了基于深度学习的时间系列分解模型，用于分解挪威流站的日aily flow 数据 into hourly flow。初步结果显示该模型具有一些有前途的特点。

Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction

paper_url: http://arxiv.org/abs/2308.06058
repo_url: None
paper_authors: Xiaowen Jiang, Sebastian U. Stich
for: 这个论文主要是为了解决SGD在训练过参数化模型时的问题，特别是在非 interpolating 设定下。
methods: 这篇论文提出了两种新的Stochastic Polyak stepsize (SPS)和Stochastic line-search (SLS)算法，它们可以在非 interpolating 设定下保证拟合率的拟合，并且在 convex 和过参数化模型中保持下线性和线性的拟合率。
results: 这篇论文提出了两种新的算法，并且 equip 了这两种算法 WITH 一种新的假值减少技术，使得它们可以在非 interpolating 设定下需要 $\smash{\widetilde{\mathcal{O}}(n+1/\epsilon)$ 梯度评估来实现 $\mathcal{O}(\epsilon)$-优化性。这些结果比 AdaSPS 和 AdaSLS Without 假值减少技术在非 interpolating regimes 更快，并且与 AdaSVRG 具有相同的速率，但是不需要内外循环结构。此外，这篇论文的实验数据 validate 了我们的理论和算法的可行性和稳定性。

Abstract
The recently proposed stochastic Polyak stepsize (SPS) and stochastic line-search (SLS) for SGD have shown remarkable effectiveness when training over-parameterized models. However, in non-interpolation settings, both algorithms only guarantee convergence to a neighborhood of a solution which may result in a worse output than the initial guess. While artificially decreasing the adaptive stepsize has been proposed to address this issue (Orvieto et al. [2022]), this approach results in slower convergence rates for convex and over-parameterized models. In this work, we make two contributions: Firstly, we propose two new variants of SPS and SLS, called AdaSPS and AdaSLS, which guarantee convergence in non-interpolation settings and maintain sub-linear and linear convergence rates for convex and strongly convex functions when training over-parameterized models. AdaSLS requires no knowledge of problem-dependent parameters, and AdaSPS requires only a lower bound of the optimal function value as input. Secondly, we equip AdaSPS and AdaSLS with a novel variance reduction technique and obtain algorithms that require $\smash{\widetilde{\mathcal{O}}(n+1/\epsilon)$ gradient evaluations to achieve an $\mathcal{O}(\epsilon)$-suboptimality for convex functions, which improves upon the slower $\mathcal{O}(1/\epsilon^2)$ rates of AdaSPS and AdaSLS without variance reduction in the non-interpolation regimes. Moreover, our result matches the fast rates of AdaSVRG but removes the inner-outer-loop structure, which is easier to implement and analyze. Finally, numerical experiments on synthetic and real datasets validate our theory and demonstrate the effectiveness and robustness of our algorithms.

摘要
最近提出的随机Polyak步骤(SPS)和随机搜索(SLS)方法在训练过参数模型时已经表现出了惊人的效果。然而，在非 interpolating Setting 中，这两种算法只能 garantate 到一个解附近的 converges，可能会比初始猜测更差。而人工减小 adaptive 步骤的方法（Orvieto et al. [2022]）可以解决这个问题，但这种方法会导致对凸和过参数模型的 converges 速度变得更慢。在这个工作中，我们做了两个贡献：首先，我们提出了 AdaSPS 和 AdaSLS 两种新的变体，这些变体可以在非 interpolating Setting 中 garantate converges，并且在凸和强凸函数上保持 sub-linear 和 linear converges 速度。AdaSLS 不需要问题依赖的参数，而 AdaSPS 只需要输入最佳函数值的下界。其次，我们使用了一种新的差分降采样技术，并得到了需要 $\smash{\widetilde{\mathcal{O}}(n+1/\epsilon)$ 梯度评估来实现 $\mathcal{O}(\epsilon)$-suboptimality 的算法，这比 AdaSPS 和 AdaSLS 无差分降采样情况下的 slower $\mathcal{O}(1/\epsilon^2)$ 速度更快。此外，我们的结果与 AdaSVRG 的快速率相同，但是我们的算法去掉了内部外部循环结构，这使得实现和分析更加容易。最后，我们在 sintetic 和实际数据上进行了数据验证，并证明了我们的理论和算法的有效性和可靠性。

Cost-effective On-device Continual Learning over Memory Hierarchy with Miro

paper_url: http://arxiv.org/abs/2308.06053
repo_url: None
paper_authors: Xinyue Ma, Suyeon Jeong, Minjia Zhang, Di Wang, Jonghyun Choi, Myeongjae Jeon
for: 这个研究是为了提高边缘设备上的continuous learning（CL）系统的成本效益。
methods: 这个研究使用了层次记忆重测的方法来维持之前学习的知识，并且在边缘设备上进行了线上 profiling 来找到最佳的精度-能源交换值。
results: 比基eline系统，这个研究的系统可以实现更高的成本效益。

Abstract
Continual learning (CL) trains NN models incrementally from a continuous stream of tasks. To remember previously learned knowledge, prior studies store old samples over a memory hierarchy and replay them when new tasks arrive. Edge devices that adopt CL to preserve data privacy are typically energy-sensitive and thus require high model accuracy while not compromising energy efficiency, i.e., cost-effectiveness. Our work is the first to explore the design space of hierarchical memory replay-based CL to gain insights into achieving cost-effectiveness on edge devices. We present Miro, a novel system runtime that carefully integrates our insights into the CL framework by enabling it to dynamically configure the CL system based on resource states for the best cost-effectiveness. To reach this goal, Miro also performs online profiling on parameters with clear accuracy-energy trade-offs and adapts to optimal values with low overhead. Extensive evaluations show that Miro significantly outperforms baseline systems we build for comparison, consistently achieving higher cost-effectiveness.

摘要
We present Miro, a novel system runtime that carefully integrates our insights into the CL framework. Miro dynamically configures the CL system based on resource states for the best cost-effectiveness, and performs online profiling on parameters with clear accuracy-energy trade-offs. By adapting to optimal values with low overhead, Miro significantly outperforms baseline systems we built for comparison, consistently achieving higher cost-effectiveness.In Simplified Chinese:CONTINUAL learning (CL) trains neural network (NN) models 从一个不断更新的任务流中进行逐步增长，以保留之前学习的知识。先前的研究使用了一个内存层次结构来存储老任务和重新播放它们，但这种方法可能会占用更多的能量和破坏能效性。我们的工作是首次探索基于层次内存重温的 CL 设计空间，以了解在边缘设备上实现成本效iveness。我们介绍了 Miro，一种新的系统时间。 Miro 细心地将我们的发现 integrate 到 CL 框架中，以实现最佳的成本效iveness。 Miro 可以在不同的资源状态下动态配置 CL 系统，并在精细的粒度上进行在线 Profiling，以确定参数的准确性-能量费用的交换。通过适应优化的低负担，Miro 在与基准系统进行比较时显著超越它们，一直保持高的成本效iveness。

Towards Instance-adaptive Inference for Federated Learning

paper_url: http://arxiv.org/abs/2308.06051
repo_url: https://github.com/chunmeifeng/fedins
paper_authors: Chun-Mei Feng, Kai Yu, Nian Liu, Xinxing Xu, Salman Khan, Wangmeng Zuo
for: 提高 federated learning 的性能，特别是在带有内部数据不均匀性的情况下。
methods: 提出一种基于 scale and shift deep features (SSF) 的参数高效精度调整方法，并在客户端端 sides 实现实例特化的推理。
results: 在 Tiny-ImageNet 上比顶级方法提高 6.64%，并且通信成本低于 15%。

Abstract
Federated learning (FL) is a distributed learning paradigm that enables multiple clients to learn a powerful global model by aggregating local training. However, the performance of the global model is often hampered by non-i.i.d. distribution among the clients, requiring extensive efforts to mitigate inter-client data heterogeneity. Going beyond inter-client data heterogeneity, we note that intra-client heterogeneity can also be observed on complex real-world data and seriously deteriorate FL performance. In this paper, we present a novel FL algorithm, i.e., FedIns, to handle intra-client data heterogeneity by enabling instance-adaptive inference in the FL framework. Instead of huge instance-adaptive models, we resort to a parameter-efficient fine-tuning method, i.e., scale and shift deep features (SSF), upon a pre-trained model. Specifically, we first train an SSF pool for each client, and aggregate these SSF pools on the server side, thus still maintaining a low communication cost. To enable instance-adaptive inference, for a given instance, we dynamically find the best-matched SSF subsets from the pool and aggregate them to generate an adaptive SSF specified for the instance, thereby reducing the intra-client as well as the inter-client heterogeneity. Extensive experiments show that our FedIns outperforms state-of-the-art FL algorithms, e.g., a 6.64\% improvement against the top-performing method with less than 15\% communication cost on Tiny-ImageNet. Our code and models will be publicly released.

摘要
联合学习（FL）是一种分布式学习模式，允许多个客户端学习一个强大的全球模型，通过客户端的本地训练数据的汇总来提高模型性能。然而，全球模型的性能经常受到客户端数据不均衡的影响，需要大量的努力来缓解客户端之间数据不同性。在这篇论文中，我们提出了一种新的联合学习算法，即FedIns，以处理客户端数据不同性。我们采用一种实例适应的推理方法，即缩放和平移深度特征（SSF），以适应复杂的实际数据。具体来说，我们首先在每个客户端上训练一个SSF集，然后在服务器端将这些SSF集进行汇总，以保持低通信成本。为实现实例适应推理，对于一个给定的实例，我们在实例特定的情况下动态找到最佳适应SSF子集，并将这些子集进行汇总，以生成适应该实例的SSF。这有助于降低客户端之间以及客户端与数据之间的不同性。我们的FedIns在Tiny-ImageNet上与状态体系的方法进行比较，实际实验表明，我们的FedIns可以提高6.64%的性能，并且通信成本不足15%。我们将代码和模型公开发布。

AI-Assisted Investigation of On-Chain Parameters: Risky Cryptocurrencies and Price Factors

paper_url: http://arxiv.org/abs/2308.08554
repo_url: None
paper_authors: Abdulrezzak Zekiye, Semih Utku, Fadi Amroush, Oznur Ozkasap
for:This paper aims to analyze historical data and use artificial intelligence algorithms to identify factors affecting cryptocurrency prices and to classify cryptocurrencies as risky or not.methods:The authors use on-chain parameters and artificial intelligence techniques, including clustering and classification, to analyze the data. They also implement multiple classifiers to predict the risk of a cryptocurrency.results:The analysis reveals that a significant proportion of cryptocurrencies disappear from the market, and that there is a negative correlation between price and maximum and total supply, as well as a weak positive correlation with 24-hour trading volume. The authors also cluster cryptocurrencies into five distinct groups using their on-chain parameters, and achieve an f1-score of 76% using K-Nearest Neighbor to predict the risk of a cryptocurrency.

Abstract
Cryptocurrencies have become a popular and widely researched topic of interest in recent years for investors and scholars. In order to make informed investment decisions, it is essential to comprehend the factors that impact cryptocurrency prices and to identify risky cryptocurrencies. This paper focuses on analyzing historical data and using artificial intelligence algorithms on on-chain parameters to identify the factors affecting a cryptocurrency's price and to find risky cryptocurrencies. We conducted an analysis of historical cryptocurrencies' on-chain data and measured the correlation between the price and other parameters. In addition, we used clustering and classification in order to get a better understanding of a cryptocurrency and classify it as risky or not. The analysis revealed that a significant proportion of cryptocurrencies (39%) disappeared from the market, while only a small fraction (10%) survived for more than 1000 days. Our analysis revealed a significant negative correlation between cryptocurrency price and maximum and total supply, as well as a weak positive correlation between price and 24-hour trading volume. Moreover, we clustered cryptocurrencies into five distinct groups using their on-chain parameters, which provides investors with a more comprehensive understanding of a cryptocurrency when compared to those clustered with it. Finally, by implementing multiple classifiers to predict whether a cryptocurrency is risky or not, we obtained the best f1-score of 76% using K-Nearest Neighbor.

摘要
digital currencies have become a popular and widely researched topic of interest in recent years for investors and scholars. In order to make informed investment decisions, it is essential to comprehend the factors that impact digital currency prices and to identify risky digital currencies. This paper focuses on analyzing historical data and using artificial intelligence algorithms on on-chain parameters to identify the factors affecting a digital currency's price and to find risky digital currencies. We conducted an analysis of historical digital currencies' on-chain data and measured the correlation between the price and other parameters. In addition, we used clustering and classification in order to get a better understanding of a digital currency and classify it as risky or not. The analysis revealed that a significant proportion of digital currencies (39%) disappeared from the market, while only a small fraction (10%) survived for more than 1000 days. Our analysis revealed a significant negative correlation between digital currency price and maximum and total supply, as well as a weak positive correlation between price and 24-hour trading volume. Moreover, we clustered digital currencies into five distinct groups using their on-chain parameters, which provides investors with a more comprehensive understanding of a digital currency when compared to those clustered with it. Finally, by implementing multiple classifiers to predict whether a digital currency is risky or not, we obtained the best f1-score of 76% using K-Nearest Neighbor.

Controlling Character Motions without Observable Driving Source

paper_url: http://arxiv.org/abs/2308.06025
repo_url: None
paper_authors: Weiyuan Li, Bin Dai, Ziyi Zhou, Qi Yao, Baoyuan Wang
for: generator of diverse, life-like, and unlimited long head/body sequences without any driving source
methods: combines VQ-VAE and reinforcement learning with carefully designed reward functions to address challenges such as OOD issue, insufficient diversity, and periodic patterns
results: outperforms other strong baselines very significantly in generating diverse and natural motion sequences

Abstract
How to generate diverse, life-like, and unlimited long head/body sequences without any driving source? We argue that this under-investigated research problem is non-trivial at all, and has unique technical challenges behind it. Without semantic constraints from the driving sources, using the standard autoregressive model to generate infinitely long sequences would easily result in 1) out-of-distribution (OOD) issue due to the accumulated error, 2) insufficient diversity to produce natural and life-like motion sequences and 3) undesired periodic patterns along the time. To tackle the above challenges, we propose a systematic framework that marries the benefits of VQ-VAE and a novel token-level control policy trained with reinforcement learning using carefully designed reward functions. A high-level prior model can be easily injected on top to generate unlimited long and diverse sequences. Although we focus on no driving sources now, our framework can be generalized for controlled synthesis with explicit driving sources. Through comprehensive evaluations, we conclude that our proposed framework can address all the above-mentioned challenges and outperform other strong baselines very significantly.

摘要
如何生成无驱动源的多样化、生命如真、无限长头/身序列？我们认为这是一个未受抨拤的研究问题，具有独特的技术挑战。没有 semantic constraints from the driving sources，使用标准 autoregressive model 生成无限长序列会导致 1) OOD 问题 Due to accumulated error, 2) 不够的多样性来生成自然和生命如真的动作序列和 3) 不想要的 periodic patterns 在时间方向。为解决以上挑战，我们提出了一个系统化的框架，将 VQ-VAE 的优点和一种新的 токен级控制策略，通过 reinforcement learning 使用经过设计的奖励函数进行训练。高级别的 prior model 可以轻松地注入到这个框架中，以生成无限长和多样化的序列。虽然我们现在没有驱动源，但我们的框架可以通过控制 synthesis 来扩展到有显式驱动源的情况。通过全面的评估，我们结论是我们提出的框架可以解决所有以上问题，并与其他强大的基elines 进行比较显著地超越。

Evaluating Picture Description Speech for Dementia Detection using Image-text Alignment

paper_url: http://arxiv.org/abs/2308.07933
repo_url: None
paper_authors: Youxiang Zhu, Nana Lin, Xiaohui Liang, John A. Batsis, Robert M. Roth, Brian MacWhinney
for: 这项研究旨在提高诊断失忆症的精度，通过使用图像描述文本对话来检测诊断失忆症。
methods: 该研究使用了大量预训练的图像文本对应模型的知识，并对样本进行了预处理，包括图像和文本的重要区域分割、文本与图像的相关性评分等。
results: 研究发现，通过使用图像和文本，可以提高诊断失忆症的精度，并实现了状态之最高的检测精度（83.44%），高于文本只基eline模型（79.91%）。

Abstract
Using picture description speech for dementia detection has been studied for 30 years. Despite the long history, previous models focus on identifying the differences in speech patterns between healthy subjects and patients with dementia but do not utilize the picture information directly. In this paper, we propose the first dementia detection models that take both the picture and the description texts as inputs and incorporate knowledge from large pre-trained image-text alignment models. We observe the difference between dementia and healthy samples in terms of the text's relevance to the picture and the focused area of the picture. We thus consider such a difference could be used to enhance dementia detection accuracy. Specifically, we use the text's relevance to the picture to rank and filter the sentences of the samples. We also identified focused areas of the picture as topics and categorized the sentences according to the focused areas. We propose three advanced models that pre-processed the samples based on their relevance to the picture, sub-image, and focused areas. The evaluation results show that our advanced models, with knowledge of the picture and large image-text alignment models, achieve state-of-the-art performance with the best detection accuracy at 83.44%, which is higher than the text-only baseline model at 79.91%. Lastly, we visualize the sample and picture results to explain the advantages of our models.

摘要

Large Language Models for Telecom: Forthcoming Impact on the Industry

paper_url: http://arxiv.org/abs/2308.06013
repo_url: None
paper_authors: Ali Maatouk, Nicola Piovesan, Fadhel Ayed, Antonio De Domenico, Merouane Debbah
for: 本研究旨在探讨大语言模型（LLM）在 телеком领域的影响和应用前景，以便更好地利用LLM技术，提高操作效率和工程培训资源的利用率。
methods: 本文采用了各种方法，包括LLM的概念梳理、现有技术的综述、实践案例的探讨和未来研究方向的探讨，以便更好地理解LLM的现有能力和局限性。
results: 本文提出了一些可以快速实施在 телеCOM领域的用例，例如自动化客户服务、自动化质量监测和自动化问题排查等，这些用例可以帮助提高操作效率和减少人工干预。此外，本文还探讨了在telecom领域使用LLM的一些特殊挑战和未来研究方向，这些挑战和研究方向需要进一步的研究和开发，以便更好地应用LLM技术在telecom领域。

Abstract
Large Language Models (LLMs) have emerged as a transformative force, revolutionizing numerous fields well beyond the conventional domain of Natural Language Processing (NLP) and garnering unprecedented attention. As LLM technology continues to progress, the telecom industry is facing the prospect of its potential impact on its landscape. To elucidate these implications, we delve into the inner workings of LLMs, providing insights into their current capabilities and limitations. We also examine the use cases that can be readily implemented in the telecom industry, streamlining numerous tasks that currently hinder operational efficiency and demand significant manpower and engineering expertise. Furthermore, we uncover essential research directions that deal with the distinctive challenges of utilizing the LLMs within the telecom domain. Addressing these challenges represents a significant stride towards fully harnessing the potential of LLMs and unlocking their capabilities to the fullest extent within the telecom domain.

摘要

Does AI for science need another ImageNet Or totally different benchmarks? A case study of machine learning force fields

paper_url: http://arxiv.org/abs/2308.05999
repo_url: None
paper_authors: Yatao Li, Wanling Gao, Lei Wang, Lixin Sun, Zun Wang, Jianfeng Zhan
for: 这篇论文旨在探讨如何有效地 benchmark AI for science（AI4S），使用机器学习方法进行科学计算任务的准确性和速度提高。
methods: 本论文使用机器学习力场（MLFF）作为一个案例研究，并指出了一些在科学上有意义的测试欠妥，并提出了一些解决方案来评估 MLFF 模型。
results: 本论文提出了一种新的方法来评估 AI4S，包括样本效率、时间域敏感和横向泛化能力等方面的性能指标，这些指标可以更好地评估模型在实际科学应用中的表现。

Abstract
AI for science (AI4S) is an emerging research field that aims to enhance the accuracy and speed of scientific computing tasks using machine learning methods. Traditional AI benchmarking methods struggle to adapt to the unique challenges posed by AI4S because they assume data in training, testing, and future real-world queries are independent and identically distributed, while AI4S workloads anticipate out-of-distribution problem instances. This paper investigates the need for a novel approach to effectively benchmark AI for science, using the machine learning force field (MLFF) as a case study. MLFF is a method to accelerate molecular dynamics (MD) simulation with low computational cost and high accuracy. We identify various missed opportunities in scientifically meaningful benchmarking and propose solutions to evaluate MLFF models, specifically in the aspects of sample efficiency, time domain sensitivity, and cross-dataset generalization capabilities. By setting up the problem instantiation similar to the actual scientific applications, more meaningful performance metrics from the benchmark can be achieved. This suite of metrics has demonstrated a better ability to assess a model's performance in real-world scientific applications, in contrast to traditional AI benchmarking methodologies. This work is a component of the SAIBench project, an AI4S benchmarking suite. The project homepage is https://www.computercouncil.org/SAIBench.

摘要
“AI for science（AI4S）是一个emerging研究领域，旨在使用机器学习方法提高科学计算任务的精度和速度。传统的AI测试方法难以适应AI4S中的专有挑战，因为它们假设训练、测试和未来的实际世界问题数据独立且相同分布，而AI4S工作负载则预期了对数据的异步问题。本文 investigate了需要一种新的测试方法，以便对AI for science进行有效的测试，使用机器学习力场（MLFF）作为一个实验。MLFF是一种将分子动力学（MD） simulations accelerated with low computational cost和高精度。我们识别了科学上有意义的测试 missed opportunities，并提出了评估MLFF模型的方法，包括样本效率、时间域敏感度和跨数据通用能力。通过设置问题实际相似于科学应用，可以从测试中获得更有意义的性能指标。这套指标已经显示出在实际科学应用中，模型的性能比传统AI测试方法更好。这是SAIBench项目的一部分，SAIBench是一个AI4S测试集。项目首页可以在https://www.computercouncil.org/SAIBench浏览。”

Automatic Classification of Blood Cell Images Using Convolutional Neural Network

paper_url: http://arxiv.org/abs/2308.06300
repo_url: None
paper_authors: Rabia Asghar, Sanjay Kumar, Paul Hynds, Abeera Mahfooz
for: automatic classification of ten types of blood cells
methods: utilized transfer learning with pre-trained CNN models (VGG16, VGG19, ResNet-50, ResNet-101, ResNet-152, InceptionV3, MobileNetV2, and DenseNet-20) and proposed a novel CNN-based framework
results: achieved an accuracy of 99.91% on the PBC dataset, outperforming earlier results reported in the literature

Abstract
Human blood primarily comprises plasma, red blood cells, white blood cells, and platelets. It plays a vital role in transporting nutrients to different organs, where it stores essential health-related data about the human body. Blood cells are utilized to defend the body against diverse infections, including fungi, viruses, and bacteria. Hence, blood analysis can help physicians assess an individual's physiological condition. Blood cells have been sub-classified into eight groups: Neutrophils, eosinophils, basophils, lymphocytes, monocytes, immature granulocytes (promyelocytes, myelocytes, and metamyelocytes), erythroblasts, and platelets or thrombocytes on the basis of their nucleus, shape, and cytoplasm. Traditionally, pathologists and hematologists in laboratories have examined these blood cells using a microscope before manually classifying them. The manual approach is slower and more prone to human error. Therefore, it is essential to automate this process. In our paper, transfer learning with CNN pre-trained models. VGG16, VGG19, ResNet-50, ResNet-101, ResNet-152, InceptionV3, MobileNetV2, and DenseNet-20 applied to the PBC dataset's normal DIB. The overall accuracy achieved with these models lies between 91.375 and 94.72%. Hence, inspired by these pre-trained architectures, a model has been proposed to automatically classify the ten types of blood cells with increased accuracy. A novel CNN-based framework has been presented to improve accuracy. The proposed CNN model has been tested on the PBC dataset normal DIB. The outcomes of the experiments demonstrate that our CNN-based framework designed for blood cell classification attains an accuracy of 99.91% on the PBC dataset. Our proposed convolutional neural network model performs competitively when compared to earlier results reported in the literature.

摘要
人体血液主要由血液溶解物、红细胞、白细胞和板凝血球组成。它扮演着重要的role在输送不同器官的营养物质和保存人体重要的健康数据。血液细胞可以用于防御身体对多种感染，包括病毒、菌和真菌。因此，血液分析可以帮助医生评估个体的生理状况。血液细胞被分为八类： нейтрофиLS、嗜酸细胞、基索细胞、T细胞和板凝血球，基于其核仁、形态和 citoplasma。传统上，pathologist和hemotologist在实验室中使用 Mikroskop 手动分类这些血液细胞。这种手动方法 slower和更容易出现人类错误。因此，需要自动化这个过程。在我们的论文中，通过转移学习与CNN预训练模型。VGG16、VGG19、ResNet-50、ResNet-101、ResNet-152、InceptionV3、MobileNetV2和DenseNet-20应用于PBC数据集的正常DIB。这些模型的总准确率在91.375%和94.72%之间。因此，基于这些预训练模型，我们提出了一种自动分类血液细胞的方法，以提高准确率。我们提出了一种基于CNN的框架，并在PBC数据集的正常DIB上测试了该模型。实验结果显示，我们的CNN模型在PBC数据集上达到了99.91%的准确率。我们的提出的 convolutional neural network 模型在与前一些 literatur 中报道的结果相比，表现竞争力强。

Fast and Accurate Transferability Measurement by Evaluating Intra-class Feature Variance

paper_url: http://arxiv.org/abs/2308.05986
repo_url: https://github.com/snudatalab/TMI
paper_authors: Huiwen Xu, U Kang
for: This paper aims to quickly and accurately find the most useful pre-trained model for a downstream task by measuring transferability.
methods: The proposed method is called TMI (TRANSFERABILITY MEASUREMENT WITH INTRA-CLASS FEATURE VARIANCE), which measures transferability by evaluating intra-class feature variance.
results: Extensive experiments on real-world datasets show that TMI outperforms competitors for selecting the top-5 best models, and exhibits consistently better correlation in 13 out of 17 cases.Here’s the Simplified Chinese text format you requested:
for: 这篇论文目标是快速准确地找到下游任务中最有用的预训练模型，通过衡量转移性。
methods: 提出的方法是TMI（转移性评估与内类特征差异），该方法通过评估内类特征差异来衡量转移性。
results: 实验结果表明，TMI在真实世界数据集上对预训练模型的选择 exhibits 比竞争者更高的溢余率和更好的相关性，在13个案例中 consistently better correlation.

Abstract
Given a set of pre-trained models, how can we quickly and accurately find the most useful pre-trained model for a downstream task? Transferability measurement is to quantify how transferable is a pre-trained model learned on a source task to a target task. It is used for quickly ranking pre-trained models for a given task and thus becomes a crucial step for transfer learning. Existing methods measure transferability as the discrimination ability of a source model for a target data before transfer learning, which cannot accurately estimate the fine-tuning performance. Some of them restrict the application of transferability measurement in selecting the best supervised pre-trained models that have classifiers. It is important to have a general method for measuring transferability that can be applied in a variety of situations, such as selecting the best self-supervised pre-trained models that do not have classifiers, and selecting the best transferring layer for a target task. In this work, we propose TMI (TRANSFERABILITY MEASUREMENT WITH INTRA-CLASS FEATURE VARIANCE), a fast and accurate algorithm to measure transferability. We view transferability as the generalization of a pre-trained model on a target task by measuring intra-class feature variance. Intra-class variance evaluates the adaptability of the model to a new task, which measures how transferable the model is. Compared to previous studies that estimate how discriminative the models are, intra-class variance is more accurate than those as it does not require an optimal feature extractor and classifier. Extensive experiments on real-world datasets show that TMI outperforms competitors for selecting the top-5 best models, and exhibits consistently better correlation in 13 out of 17 cases.

摘要
(Simplified Chinese)给一组预训练模型，如何快速和准确地找到下游任务中最有用的预训练模型？转移性测量是用于衡量预训练模型在源任务上学习后，在目标任务上的泛化能力。它是转移学习的关键步骤，可以快速地排序预训练模型，从而选择最适合的模型。现有的方法测量转移性是通过评估源模型对目标数据的分类能力来进行，这不能准确地评估精度调整后的性能。一些方法还限制了转移性测量的应用，只能选择已经有类别器的超vised预训练模型。因此，需要一种通用的转移性测量方法，可以应对多种情况，如选择最佳无类别器的自我监督预训练模型，以及选择最佳传递层 для目标任务。在这种情况下，我们提出了TMI（转移性评价 WITH INTRA-CLASS FEATURE VARIANCE），一种快速和准确的转移性测量算法。我们认为转移性是衡量预训练模型在目标任务上的泛化能力，通过评估INTRA-CLASS Feature Variance来衡量。INTRA-CLASS Variance评估模型对新任务的适应性，这是衡量转移性的准确方法。与之前的研究相比，INTRA-CLASS Variance更加准确，因为它不需要最佳的特征提取器和类别器。实验表明，TMI在真实的 dataset上出色地选择了top-5最佳模型，并在13个案例中具有更高的相关性。

Defensive Perception: Estimation and Monitoring of Neural Network Performance under Deployment

paper_url: http://arxiv.org/abs/2308.06299
repo_url: None
paper_authors: Hendrik Vogt, Stefan Buehler, Mark Schutera
for: Addressing the issue of unnoticed catastrophic deployment and domain shift in neural networks for semantic segmentation in autonomous driving.
methods: Based on the idea that deep learning-based perception for autonomous driving is uncertain and best represented as a probability distribution, the paper proposes to encapsulate the neural network under deployment within an uncertainty estimation envelope using Monte Carlo Dropout.
results: The proposed method can estimate a neural network’s performance and monitor and notify of entering domains of reduced neural network performance under deployment, with the potential to improve safety and adaptability of autonomous driving systems.

Abstract
In this paper, we propose a method for addressing the issue of unnoticed catastrophic deployment and domain shift in neural networks for semantic segmentation in autonomous driving. Our approach is based on the idea that deep learning-based perception for autonomous driving is uncertain and best represented as a probability distribution. As autonomous vehicles' safety is paramount, it is crucial for perception systems to recognize when the vehicle is leaving its operational design domain, anticipate hazardous uncertainty, and reduce the performance of the perception system. To address this, we propose to encapsulate the neural network under deployment within an uncertainty estimation envelope that is based on the epistemic uncertainty estimation through the Monte Carlo Dropout approach. This approach does not require modification of the deployed neural network and guarantees expected model performance. Our defensive perception envelope has the capability to estimate a neural network's performance, enabling monitoring and notification of entering domains of reduced neural network performance under deployment. Furthermore, our envelope is extended by novel methods to improve the application in deployment settings, including reducing compute expenses and confining estimation noise. Finally, we demonstrate the applicability of our method for multiple different potential deployment shifts relevant to autonomous driving, such as transitions into the night, rainy, or snowy domain. Overall, our approach shows great potential for application in deployment settings and enables operational design domain recognition via uncertainty, which allows for defensive perception, safe state triggers, warning notifications, and feedback for testing or development and adaptation of the perception stack.

摘要
在这篇论文中，我们提出了一种方法来解决神经网络 semantic segmentation 中的隐藏性恶性发布和领域变化问题。我们的方法基于神经网络在自动驾驶中的感知是uncertain的，并且最好表示为概率分布。由于自动驾驶车辆的安全性 paramount，因此感知系统需要认可车辆离开操作设计领域，预测危险性uncertainty，并减少感知系统的性能。为此，我们提议将投入部署中的神经网络包装在一个uncertainty估计膜中，该膜基于Monte Carlo Dropout方法来进行epistemic uncertainty估计。这种方法不需要修改已部署的神经网络，并保证预期的模型性能。我们的防御感知膜具有估计神经网络性能的能力，可以监测和通知部署中神经网络性能下降。此外，我们还提出了一些新的方法来改进在部署 Settings中的应用，包括减少计算成本和限制估计噪声。最后，我们示出了我们方法在多种不同的部署转移中的可应用性，如夜晚、雨天和雪天等。总之，我们的方法在部署 Settings中具有广泛的应用前景，并可以通过uncertainty来实现操作设计领域认可，这使得可以实现防御性感知、安全状态触发器、警告通知和测试或开发和适应感知堆的反馈。

An Encoder-Decoder Approach for Packing Circles

paper_url: http://arxiv.org/abs/2308.07335
repo_url: None
paper_authors: Akshay Kiran Jose, Gangadhar Karevvanavar, Rajshekhar V Bhat
for: packing smaller objects within a larger object, with the requirement that the smaller objects must not overlap and must lie completely inside the larger object.
methods: a novel encoder-decoder architecture consisting of an encoder block, a perturbation block, and a decoder block, with the encoder and decoder parameterized by a neural network and optimized to reduce an error between the decoder’s estimated index and the actual index of the circle provided as input to the encoder.
results: a sub-optimal solution that can pack smaller objects within a larger object with competitive performance compared to classical methods. The approach can be generalized to pack objects of higher dimensions and different shapes by carefully choosing normalization and perturbation layers.

Abstract
The problem of packing smaller objects within a larger object has been of interest since decades. In these problems, in addition to the requirement that the smaller objects must lie completely inside the larger objects, they are expected to not overlap or have minimum overlap with each other. Due to this, the problem of packing turns out to be a non-convex problem, obtaining whose optimal solution is challenging. As such, several heuristic approaches have been used for obtaining sub-optimal solutions in general, and provably optimal solutions for some special instances. In this paper, we propose a novel encoder-decoder architecture consisting of an encoder block, a perturbation block and a decoder block, for packing identical circles within a larger circle. In our approach, the encoder takes the index of a circle to be packed as an input and outputs its center through a normalization layer, the perturbation layer adds controlled perturbations to the center, ensuring that it does not deviate beyond the radius of the smaller circle to be packed, and the decoder takes the perturbed center as input and estimates the index of the intended circle for packing. We parameterize the encoder and decoder by a neural network and optimize it to reduce an error between the decoder's estimated index and the actual index of the circle provided as input to the encoder. The proposed approach can be generalized to pack objects of higher dimensions and different shapes by carefully choosing normalization and perturbation layers. The approach gives a sub-optimal solution and is able to pack smaller objects within a larger object with competitive performance with respect to classical methods.

摘要
“packing smaller objects within a larger object”这个问题已经引起了几十年的关注。在这些问题中，除了要求小 objet completely inside the larger object，还需要不相互 overlap 或者 minimize overlap。由于这，packing problem 变成了一个非凸问题，获取其优化解决方案是具有挑战性。因此，许多启发性方法被用来获取sub-optimal解决方案，而且在某些特殊情况下可以获取可证明优化解决方案。在这篇论文中，我们提出了一种新的encoder-decoder架构，包括encoder块、perturbation块和decoder块，用于packing identical circles within a larger circle。在我们的方法中，encoder接受一个圆的索引作为输入，并通过normalization层输出圆心，perturbation层在圆心上添加控制的偏移，以确保它不会超过小圆的半径，decoder接受偏移后的圆心作为输入，并估算圆的索引。我们归Parameterize encoder和decoder使用神经网络，并优化它们，以减少由decoder输出的index和实际输入圆索引之间的误差。提案的方法可以通过选择normalization和perturbation层来parametrize，用于packing高维度和不同形状的 объек。该方法可以提供competitive performance with respect to classical methods，并且能够在larger object中pack smaller object。

Learning nonparametric DAGs with incremental information via high-order HSIC

paper_url: http://arxiv.org/abs/2308.05969
repo_url: None
paper_authors: Yafei Wang, Jianguo Liu
for: 本研究旨在提高潘氏网络（BN）学习中的分类识别率，通过最大化全局分数函数。
methods: 本研究使用了一种两阶段算法，名为最优调整（OT）算法，它首先使用了一个决定子集的条件来确定Underlying Directed Acyclic Graph（DAG），然后使用了一个全局优化和局部调整的方法来提高分数函数。
results: 对多种synthetic数据和实际数据进行了数值实验，结果表明，相比其他方法，OT算法在 Sigmoid Mix模型中的结构干扰距离（SID）为329.7，这表明OT算法更准确地预测了DAG结构。

Abstract
Score-based methods for learning Bayesain networks(BN) aim to maximizing the global score functions. However, if local variables have direct and indirect dependence simultaneously, the global optimization on score functions misses edges between variables with indirect dependent relationship, of which scores are smaller than those with direct dependent relationship. In this paper, we present an identifiability condition based on a determined subset of parents to identify the underlying DAG. By the identifiability condition, we develop a two-phase algorithm namely optimal-tuning (OT) algorithm to locally amend the global optimization. In the optimal phase, an optimization problem based on first-order Hilbert-Schmidt independence criterion (HSIC) gives an estimated skeleton as the initial determined parents subset. In the tuning phase, the skeleton is locally tuned by deletion, addition and DAG-formalization strategies using the theoretically proved incremental properties of high-order HSIC. Numerical experiments for different synthetic datasets and real-world datasets show that the OT algorithm outperforms existing methods. Especially in Sigmoid Mix model with the size of the graph being ${\rm\bf d=40}$, the structure intervention distance (SID) of the OT algorithm is 329.7 smaller than the one obtained by CAM, which indicates that the graph estimated by the OT algorithm misses fewer edges compared with CAM.

摘要
Score-based方法 для学习悖论网络（BN）目的是最大化全局分数函数。然而，如果本地变量有直接和间接依赖关系同时存在，全局优化分数函数会遗弃变量之间的间接依赖关系的边，其分数较直接依赖关系的边小。在这篇论文中，我们提出了一种可 identificability 条件，基于确定的父集来识别下面的DAG。通过可 identificability 条件，我们开发了一个两相态算法（OT），即最优化（OT）算法，以本地修正全局优化。在优化阶段，使用首项Hilbert-Schmidt独立性 критерий（HSIC）来生成一个初始确定的父集。在调整阶段，使用HSIC的增量性质来修正skeleton，使其更加准确。numérico实验表明，OT算法在不同的 sintétiques datasets 和实际数据集上都有更好的性能，特别是在sigmoid mix模型中，OT算法的结构间接距离（SID）为329.7，比CAM的SID更小，这表明OT算法估算的图miss fewer edges compared with CAM。

Classification of White Blood Cells Using Machine and Deep Learning Models: A Systematic Review

paper_url: http://arxiv.org/abs/2308.06296
repo_url: None
paper_authors: Rabia Asghar, Sanjay Kumar, Paul Hynds, Arslan Shaukat
for: 本文主要用于探讨现代医学影像分析中的白细胞类型分类方法。
methods: 本文主要介绍了使用血液染色片图像、核磁共振成像（MRI）、X射线等医学影像领域的方法，以及基于机器学习（ML）和深度学习（DL）的技术在白细胞类型分类中的应用。
results: 研究发现，在过去的17年中，使用ML和DL技术进行白细胞类型分类的使用量和性能都在不断提高，但现存在一些挑战，包括获得相关数据的可用性和医学人员的培训。

Abstract
Machine learning (ML) and deep learning (DL) models have been employed to significantly improve analyses of medical imagery, with these approaches used to enhance the accuracy of prediction and classification. Model predictions and classifications assist diagnoses of various cancers and tumors. This review presents an in-depth analysis of modern techniques applied within the domain of medical image analysis for white blood cell classification. The methodologies that use blood smear images, magnetic resonance imaging (MRI), X-rays, and similar medical imaging domains are identified and discussed, with a detailed analysis of ML/DL techniques applied to the classification of white blood cells (WBCs) representing the primary focus of the review. The data utilized in this research has been extracted from a collection of 136 primary papers that were published between the years 2006 and 2023. The most widely used techniques and best-performing white blood cell classification methods are identified. While the use of ML and DL for white blood cell classification has concurrently increased and improved in recent year, significant challenges remain - 1) Availability of appropriate datasets remain the primary challenge, and may be resolved using data augmentation techniques. 2) Medical training of researchers is recommended to improve current understanding of white blood cell structure and subsequent selection of appropriate classification models. 3) Advanced DL networks including Generative Adversarial Networks, R-CNN, Fast R-CNN, and faster R-CNN will likely be increasingly employed to supplement or replace current techniques.

摘要
医学影像分析（ME）和深度学习（DL）模型已经广泛应用于医学影像分析中，以提高预测和分类的准确率。这些方法可以帮助诊断不同类型的肿瘤和癌症。本文提供了现代医学影像分析领域中modern techniques的深入分析，主要关注白血球类别（WBC）的分类。这些方法使用血液沾染图像、核磁共振成像（MRI）、X射线等医学影像领域的数据，并详细分析了应用于WBC分类的ML/DL技术。研究使用的数据来自于2006年至2023年发表的136篇原始论文。最常用的技术和最佳白血球分类方法被识别出来。although the use of ML and DL for WBC classification has concurrently increased and improved in recent years, significant challenges remain - 1) 获得相应的数据集是主要挑战，可以通过数据扩展技术解决。2) 医学研究人员的培训是建议的，以提高现在白血球结构的理解，并选择合适的分类模型。3) 包括生成对抗网络、R-CNN、 Fast R-CNN 和更快的 R-CNN在内的高级 DL 网络将在未来被更加广泛应用，以补充或取代当前的技术。

Learned Point Cloud Compression for Classification

paper_url: http://arxiv.org/abs/2308.05959
repo_url: https://github.com/multimedialabsfu/learned-point-cloud-compression-for-classification
paper_authors: Mateen Ulhaq, Ivan V. Bajić
for: 用于机器分析三维点云数据的特有机器视觉任务，如分类、物体检测和分割。
methods: 基于PointNet的专门编码器，实现了高比特率压缩和低计算成本，并且可以在不同的硬件资源下进行调整。
results: 在ModelNet40数据集上，与非专门编码器进行比较，实现了94%的BD-比特率减少，同时保持了高度的准确率。对于低资源的终端设备，还提出了两种轻量级的编码器配置，实现了93%和92%的BD-比特率减少，同时保持了3%和5%的顶部一个准确率下降。

Abstract
Deep learning is increasingly being used to perform machine vision tasks such as classification, object detection, and segmentation on 3D point cloud data. However, deep learning inference is computationally expensive. The limited computational capabilities of end devices thus necessitate a codec for transmitting point cloud data over the network for server-side processing. Such a codec must be lightweight and capable of achieving high compression ratios without sacrificing accuracy. Motivated by this, we present a novel point cloud codec that is highly specialized for the machine task of classification. Our codec, based on PointNet, achieves a significantly better rate-accuracy trade-off in comparison to alternative methods. In particular, it achieves a 94% reduction in BD-bitrate over non-specialized codecs on the ModelNet40 dataset. For low-resource end devices, we also propose two lightweight configurations of our encoder that achieve similar BD-bitrate reductions of 93% and 92% with 3% and 5% drops in top-1 accuracy, while consuming only 0.470 and 0.048 encoder-side kMACs/point, respectively. Our codec demonstrates the potential of specialized codecs for machine analysis of point clouds, and provides a basis for extension to more complex tasks and datasets in the future.

摘要
深度学习在进行机器视觉任务，如分类、物体检测和分割，三维点云数据上越来越广泛使用。然而，深度学习推理是计算昂贵的。因此，为了将点云数据传输到网络上进行服务器端处理，需要一个轻量级的编码器。这个编码器需要具有高度压缩率，而不是牺牲准确性。我们被这种需求所驱动，因此我们提出了一种特种的点云编码器，专门为机器分类任务。我们的编码器基于PointNet，与其他方法相比，具有显著更好的速率准确性质量比。具体来说，它在ModelNet40数据集上实现了94%的BD-比特率减少。而为低资源端设备，我们还提出了两种轻量级的编码器配置，它们分别实现了93%和92%的BD-比特率减少，同时仅占用0.470和0.048编码器端kMACs/点。我们的编码器表明特种编码器在机器分析点云数据中的潜力，并提供了将来扩展到更复杂的任务和数据集的基础。

Node Embedding for Homophilous Graphs with ARGEW: Augmentation of Random walks by Graph Edge Weights

paper_url: http://arxiv.org/abs/2308.05957
repo_url: https://github.com/ncsoft/argew
paper_authors: Jun Hee Kim, Jaeman Son, Hyunsoo Kim, Eunjo Lee
for: 本文提出了一种新的Random Walk基于图EdgeWeight的增强方法，用于改善node embedding的质量。
methods: 本文使用了ARGEW方法，它是一种基于Random Walk的增强方法，可以让node embedding更加准确地反映EdgeWeight。
results: 在多个实际网络中，ARGEW方法能够使node embedding更加准确地反映EdgeWeight，并且在node classification任务中表现出色，能够与supervised GCN相当。

Abstract
Representing nodes in a network as dense vectors node embeddings is important for understanding a given network and solving many downstream tasks. In particular, for weighted homophilous graphs where similar nodes are connected with larger edge weights, we desire node embeddings where node pairs with strong weights have closer embeddings. Although random walk based node embedding methods like node2vec and node2vec+ do work for weighted networks via including edge weights in the walk transition probabilities, our experiments show that the embedding result does not adequately reflect edge weights. In this paper, we propose ARGEW (Augmentation of Random walks by Graph Edge Weights), a novel augmentation method for random walks that expands the corpus in such a way that nodes with larger edge weights end up with closer embeddings. ARGEW can work with any random walk based node embedding method, because it is independent of the random sampling strategy itself and works on top of the already-performed walks. With several real-world networks, we demonstrate that with ARGEW, compared to not using it, the desired pattern that node pairs with larger edge weights have closer embeddings is much clearer. We also examine ARGEW's performance in node classification: node2vec with ARGEW outperforms pure node2vec and is not sensitive to hyperparameters (i.e. consistently good). In fact, it achieves similarly good results as supervised GCN, even without any node feature or label information during training. Finally, we explain why ARGEW works consistently well by exploring the coappearance distributions using a synthetic graph with clear structural roles.

摘要
<>TRANSLATE_TEXT节点在网络中的表示为密集向量节点嵌入是重要的，可以帮助我们更好地理解给定的网络和解决许多下游任务。特别是在有Edge weights的同种连接节点的网络中，我们希望节点嵌入中的节点对应于Edge weights更大的节点对应更近。虽然随机游走基于节点嵌入方法如node2vec和node2vec+可以在 Edge weights的网络上工作，但我们的实验表明，这些嵌入结果并不充分反映Edge weights。在这篇论文中，我们提出了ARGEW（Augmentation of Random walks by Graph Edge Weights），一种新的增强方法，可以让随机游走中的节点对应更加接近。ARGEW可以与任何随机游走基于节点嵌入方法一起使用，因为它与随机游走的采样策略独立，可以在已经完成的游走之上进行增强。使用ARGEW，与不使用ARGEW的情况下，在实际网络上，我们发现，节点 Edge weights更大的对应更近的嵌入结果是非常明显的。此外，我们还考虑了ARGEW在节点分类任务中的性能，发现使用ARGEW的node2vec可以与监督GCN相当好，甚至不需要节点特征或标签信息进行训练。最后，我们解释了ARGEW在各种网络上的成功原因，通过使用一个Synthetic graph with clear structural roles来探索它们的协同出现分布。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can translate it for you.

INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing

paper_url: http://arxiv.org/abs/2308.05930
repo_url: None
paper_authors: Stefan Abi-Karam, Rishov Sarkar, Dejia Xu, Zhiwen Fan, Zhangyang Wang, Cong Hao
for: 这个论文是为了提高nth-order gradient计算的效率而设计的。
methods: 这个论文使用了FIFO流和优化的计算kernel库，以实现高效的内存使用和并行计算。
results: 对INR编辑作业进行了 benchmark，实现了 CPU 和 GPU 基准线上的1.8-4.8倍和1.5-3.6倍的速度提升，同时实现了3.1-8.9倍和1.7-4.3倍的内存使用降低和1.7-11.3倍和5.5-32.8倍的能量延迟产品降低。

Abstract
An increasing number of researchers are finding use for nth-order gradient computations for a wide variety of applications, including graphics, meta-learning (MAML), scientific computing, and most recently, implicit neural representations (INRs). Recent work shows that the gradient of an INR can be used to edit the data it represents directly without needing to convert it back to a discrete representation. However, given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient due to the higher demand for computing power and higher complexity in data movement. This makes it a promising target for FPGA acceleration. In this work, we introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We address this problem in two phases. First, we design a dataflow architecture that uses FIFO streams and an optimized computation kernel library, ensuring high memory efficiency and parallel computation. Second, we propose a compiler that extracts and optimizes computation graphs, automatically configures hardware parameters such as latency and stream depths to optimize throughput, while ensuring deadlock-free operation, and outputs High-Level Synthesis (HLS) code for FPGA implementation. We utilize INR editing as our benchmark, presenting results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively. Furthermore, we obtain 3.1-8.9x and 1.7-4.3x lower memory usage, and 1.7-11.3x and 5.5-32.8x lower energy-delay product. Our framework will be made open-source and available on GitHub.

摘要
更多研究人员正在发现nth-order gradient计算在各种应用领域中的使用，包括图形学、元学习（MAML）、科学计算和最近的隐藏神经表示（INR）。最新的研究表明，INR的导数可以直接编辑它所表示的数据，无需将其转换回分组表示。然而，传统架构在计算graph函数的nth-order gradient时会面临高计算能力和数据移动复杂性的挑战，这使得它成为加速的优先目标。在这项工作中，我们介绍INR-Arch框架，它将计算图的nth-order gradient转换为硬件优化的数据流体系结构。我们解决这个问题在两个阶段。首先，我们设计了一种数据流体系结构，使用FIFO流和优化的计算kernels库，以确保高内存效率和并行计算。其次，我们提出了一种编译器，可以提取和优化计算图，自动配置硬件参数 such as 响应时间和流深度，以便最大化 Throughput，保证无堵塞操作，并输出高级synthesis（HLS）代码 дляFPGA实现。我们使用INR编辑作为我们的标准准对，并发布了相关的结果，显示与CPU和GPU基线相比，INR-Arch实现的速度提升为1.8-4.8倍和1.5-3.6倍。此外，我们得到了3.1-8.9倍和1.7-4.3倍的内存使用量减少，以及1.7-11.3倍和5.5-32.8倍的能量延迟产品。我们的框架将被开源，并在GitHub上提供。

On the equivalence of Occam algorithms

paper_url: http://arxiv.org/abs/2308.05906
repo_url: None
paper_authors: Zaman Keinath-Esmail
for: 本研究是用于提供后果正义的理论结果和算法设计方法的基础。
methods: 本研究使用了Occam算法和其它算法，以及关于扩展列表的概念类的研究。
results: 本研究获得了关于Occam算法的$\delta$-无关复杂性的结果，以及对于扩展列表的概念类的partial推论的后果正义。

Abstract
Blumer et al. (1987, 1989) showed that any concept class that is learnable by Occam algorithms is PAC learnable. Board and Pitt (1990) showed a partial converse of this theorem: for concept classes that are closed under exception lists, any class that is PAC learnable is learnable by an Occam algorithm. However, their Occam algorithm outputs a hypothesis whose complexity is $\delta$-dependent, which is an important limitation. In this paper, we show that their partial converse applies to Occam algorithms with $\delta$-independent complexities as well. Thus, we provide a posteriori justification of various theoretical results and algorithm design methods which use the partial converse as a basis for their work.

摘要
布鲁默等（1987, 1989）证明任何可学习的概念类都可以使用Occam算法学习。博德和皮特（1990）证明了一个部分对您的结论：如果概念类是关闭于例外列表，那么任何可学习的概念类都可以使用Occam算法学习。然而，他们的Occam算法输出一个复杂度为 $\delta$-dependent的假设，这是一个重要的限制。在这篇论文中，我们证明了他们的部分对您也适用于Occam算法 $\delta$-无关的复杂度上。因此，我们提供了 posteriori 的正当性，用于各种理论结果和算法设计方法的基础。

Comparing the quality of neural network uncertainty estimates for classification problems

paper_url: http://arxiv.org/abs/2308.05903
repo_url: None
paper_authors: Daniel Ries, Joshua Michalenko, Tyler Ganter, Rashad Imad-Fayez Baiyasi, Jason Adams
for: 这个论文的目的是评估深度学习模型的不确定性评估方法的质量。
methods: 这篇论文使用了频率统计学方法，包括频率Interval覆盖率和Interval宽度，以评估 credible interval 的质量。同时，它还使用了预期报告错误来评估分类预测的信任度。
results: 研究发现，不同的深度学习模型在不同的数据集上可以生成不同质量的不确定性 estimates。此外，MCMC和 bootstrapped NN 在评估不确定性方面表现最佳，而 DE 和 MC dropout 的表现较差。通过这些结果，我们展示了不同的深度学习模型可以生成不同质量的不确定性 estimates，因此需要一种原则性的评估方法来选择合适的 UQ 方法。

Abstract
Traditional deep learning (DL) models are powerful classifiers, but many approaches do not provide uncertainties for their estimates. Uncertainty quantification (UQ) methods for DL models have received increased attention in the literature due to their usefulness in decision making, particularly for high-consequence decisions. However, there has been little research done on how to evaluate the quality of such methods. We use statistical methods of frequentist interval coverage and interval width to evaluate the quality of credible intervals, and expected calibration error to evaluate classification predicted confidence. These metrics are evaluated on Bayesian neural networks (BNN) fit using Markov Chain Monte Carlo (MCMC) and variational inference (VI), bootstrapped neural networks (NN), Deep Ensembles (DE), and Monte Carlo (MC) dropout. We apply these different UQ for DL methods to a hyperspectral image target detection problem and show the inconsistency of the different methods' results and the necessity of a UQ quality metric. To reconcile these differences and choose a UQ method that appropriately quantifies the uncertainty, we create a simulated data set with fully parameterized probability distribution for a two-class classification problem. The gold standard MCMC performs the best overall, and the bootstrapped NN is a close second, requiring the same computational expense as DE. Through this comparison, we demonstrate that, for a given data set, different models can produce uncertainty estimates of markedly different quality. This in turn points to a great need for principled assessment methods of UQ quality in DL applications.

摘要
传统深度学习（DL）模型强大的分类器，但许多方法不提供不确定性的估计。不确定性量化（UQ）方法 для DL 模型在文献中受到了更多的关注，因为它们在决策中非常有用，特别是高 consequence 决策。然而，对 UQ 方法评价的研究相对较少。我们使用统计方法的频繁interval coverage和interval width来评价credible interval的质量，以及预期准确性error来评价分类预测的信任度。这些指标在Bayesian neural networks（BNN）适用Markov Chain Monte Carlo（MCMC）和variational inference（VI）、bootstrapped neural networks（NN）、Deep Ensembles（DE）和Monte Carlo（MC）dropout中进行了评价。我们使用这些不同的 UQ 方法来处理干扰影像目标检测问题，并显示了不同方法的结果之间的不一致性和需要一个 UQ 质量指标。为了解决这些差异并选择一个适当的 UQ 方法，我们创建了一个完全参数化的概率分布数据集，用于两类分类问题。MCMC perform最佳，而 bootstrapped NN 紧接其后，需要与 DE 相同的计算成本。通过这种比较，我们证明了，对于给定的数据集，不同的模型可以生成不同质量的不确定性估计。这一点指出了需要原则性评价 UQ 质量的必要性。

Target Detection on Hyperspectral Images Using MCMC and VI Trained Bayesian Neural Networks

paper_url: http://arxiv.org/abs/2308.06293
repo_url: None
paper_authors: Daniel Ries, Jason Adams, Joshua Zollweg
for: 这个论文是为了提供一种 bayesian neural network (BNN) 的 uncertainty quantification (UQ) 方法，以便在图像分类任务中提供更加可靠的预测和估计。
methods: 这个论文使用了 Markov Chain Monte Carlo (MCMC) 和 variational inference (VI) 两种方法来训练 BNN，并对这两种方法的效果进行比较。
results: 论文在 hyperspectral imagery (HSI) 领域中进行了 target detection 任务，并通过对 MCMC- 和 VI-trained BNN 的比较来显示这两种方法在不同场景下的效果。结果显示，两种方法都能够在一个高精度 HSI 目标检测场景中表现良好。

Abstract
Neural networks (NN) have become almost ubiquitous with image classification, but in their standard form produce point estimates, with no measure of confidence. Bayesian neural networks (BNN) provide uncertainty quantification (UQ) for NN predictions and estimates through the posterior distribution. As NN are applied in more high-consequence applications, UQ is becoming a requirement. BNN provide a solution to this problem by not only giving accurate predictions and estimates, but also an interval that includes reasonable values within a desired probability. Despite their positive attributes, BNN are notoriously difficult and time consuming to train. Traditional Bayesian methods use Markov Chain Monte Carlo (MCMC), but this is often brushed aside as being too slow. The most common method is variational inference (VI) due to its fast computation, but there are multiple concerns with its efficacy. We apply and compare MCMC- and VI-trained BNN in the context of target detection in hyperspectral imagery (HSI), where materials of interest can be identified by their unique spectral signature. This is a challenging field, due to the numerous permuting effects practical collection of HSI has on measured spectra. Both models are trained using out-of-the-box tools on a high fidelity HSI target detection scene. Both MCMC- and VI-trained BNN perform well overall at target detection on a simulated HSI scene. This paper provides an example of how to utilize the benefits of UQ, but also to increase awareness that different training methods can give different results for the same model. If sufficient computational resources are available, the best approach rather than the fastest or most efficient should be used, especially for high consequence problems.

摘要
神经网络（NN）在图像分类中变得非常普遍，但在标准形式下产生点估计，无法提供信度评估。 bayesian神经网络（BNN）提供了图像分类预测和估计的不确定性评估。随着NN应用于更高重要性应用，不确定性评估变得越来越重要。BNN不仅可以提供准确的预测和估计，还可以提供一个包含合理值的时间范围，在所需的概率下。despite their positive attributes, BNN are notoriously difficult and time-consuming to train. Traditional Bayesian methods use Markov Chain Monte Carlo（MCMC）， but this is often brushed aside as being too slow. The most common method is variational inference（VI） due to its fast computation, but there are multiple concerns with its efficacy. We apply and compare MCMC- and VI-trained BNN in the context of target detection in hyperspectral imagery（HSI）， where materials of interest can be identified by their unique spectral signature. This is a challenging field, due to the numerous permuting effects practical collection of HSI has on measured spectra. Both models are trained using out-of-the-box tools on a high fidelity HSI target detection scene. Both MCMC- and VI-trained BNN perform well overall at target detection on a simulated HSI scene. This paper provides an example of how to utilize the benefits of UQ, but also to increase awareness that different training methods can give different results for the same model. If sufficient computational resources are available, the best approach rather than the fastest or most efficient should be used, especially for high-consequence problems.

The divergence time of protein structures modelled by Markov matrices and its relation to the divergence of sequences

paper_url: http://arxiv.org/abs/2308.06292
repo_url: None
paper_authors: Sandun Rajapaksa, Lloyd Allison, Peter J. Stuckey, Maria Garcia de la Banda, Arun S. Konagurthu
for:这paper aimed to develop a time-parameterized statistical model to quantify the divergent evolution of protein structures.methods:The authors used a large collection of protein 3D structure alignments and inferred a time-parameterized stochastic matrix and associated Dirichlet models using the Bayesian and information-theoretic framework of Minimum Message Length.results:The authors demonstrated a competitive performance in secondary structure prediction against neural network architectures commonly employed for this task, and yielded a relationship between the Markov divergence time of structures and of sequences.

Abstract
A complete time-parameterized statistical model quantifying the divergent evolution of protein structures in terms of the patterns of conservation of their secondary structures is inferred from a large collection of protein 3D structure alignments. This provides a better alternative to time-parameterized sequence-based models of protein relatedness, that have clear limitations dealing with twilight and midnight zones of sequence relationships. Since protein structures are far more conserved due to the selection pressure directly placed on their function, divergence time estimates can be more accurate when inferred from structures. We use the Bayesian and information-theoretic framework of Minimum Message Length to infer a time-parameterized stochastic matrix (accounting for perturbed structural states of related residues) and associated Dirichlet models (accounting for insertions and deletions during the evolution of protein domains). These are used in concert to estimate the Markov time of divergence of tertiary structures, a task previously only possible using proxies (like RMSD). By analyzing one million pairs of homologous structures, we yield a relationship between the Markov divergence time of structures and of sequences. Using these inferred models and the relationship between the divergence of sequences and structures, we demonstrate a competitive performance in secondary structure prediction against neural network architectures commonly employed for this task. The source code and supplementary information are downloadable from \url{http://lcb.infotech.monash.edu.au/sstsum}.

摘要
一个完整的时间参数化统计模型，用于量化蛋白质结构的分化演化，基于蛋白质三维结构对对的对应关系。这个模型提供了一个更好的替代方案，用于时间参数化序列基于的蛋白质相关性模型，这些模型在黑暗和午夜时区的序列关系上存在明显的限制。由于蛋白质结构受到直接影响函数选择压力，因此结构分化时间估计可以更准确地来自结构。我们使用 bayesian 和信息理论框架，来推算一个时间参数化随机矩阵（考虑相关结构态态的异常）和相应的 Dirichlet 模型（考虑插入和删除在蛋白质领域的演化中）。这些模型在合作下用于估计蛋白质结构的马克夫时间异同，这个任务之前只能使用 прокси（如 RMSD）来实现。通过分析一百万对同源结构，我们发现了结构分化时间和序列分化时间之间的关系。使用这些推算出的模型和序列分化时间之间的关系，我们展示了在次级结构预测中与常见神经网络架构相比，表现竞争力。源代码和补充信息可以在 \url{http://lcb.infotech.monash.edu.au/sstsum} 下载。

paper_url: http://arxiv.org/abs/2308.05893
repo_url: None
paper_authors: Jaehoon Chung, Jamil Fayyad, Younes Al Younes, Homayoun Najjaran
for: 本文主要探讨了多代理路径找索（MAPF）领域中深度学习（DRL）基本方法的应用。
methods: 本文主要介绍了将DRL作为MAPF解决方案的一部分，并提供了一个综合的评价指标集。
results: 本文通过对多种MAPF算法的评价，折衔了现有解决方案的缺点，并提出了将model-based DRL作为未来发展方向的可能性。

Abstract
Multi-agent pathfinding (MAPF) is a critical field in many large-scale robotic applications, often being the fundamental step in multi-agent systems. The increasing complexity of MAPF in complex and crowded environments, however, critically diminishes the effectiveness of existing solutions. In contrast to other studies that have either presented a general overview of the recent advancements in MAPF or extensively reviewed Deep Reinforcement Learning (DRL) within multi-agent system settings independently, our work presented in this review paper focuses on highlighting the integration of DRL-based approaches in MAPF. Moreover, we aim to bridge the current gap in evaluating MAPF solutions by addressing the lack of unified evaluation metrics and providing comprehensive clarification on these metrics. Finally, our paper discusses the potential of model-based DRL as a promising future direction and provides its required foundational understanding to address current challenges in MAPF. Our objective is to assist readers in gaining insight into the current research direction, providing unified metrics for comparing different MAPF algorithms and expanding their knowledge of model-based DRL to address the existing challenges in MAPF.

摘要
多智能路径找索（MAPF）是许多大规模 робоaxi应用中的关键领域，经常是多智能系统的基础步骤。然而，随着环境的复杂化和拥堵，MAPF的复杂性逐渐增加，导致现有的解决方案效果逐渐下降。与其他研究不同，我们的工作不仅提供了现代多智能系统中DRL的总览，还审视了DRL在多智能系统设置中的探索。此外，我们还希望通过解决MAPF解决方案评价中的困难点，提供一种统一的评价指标，并为评价MAPF算法提供全面的解释。最后，我们介绍了基于DRL的模型驱动方法，并提供了该方法的基础理解，以解决现有的MAPF挑战。我们的目标是帮助读者更深入了解当前的研究方向，提供统一的评价指标，扩展知识，并使读者能够更好地理解和应用MAPF和DRL。

DF2: Distribution-Free Decision-Focused Learning

paper_url: http://arxiv.org/abs/2308.05889
repo_url: None
paper_authors: Lingkai Kong, Wenhao Mu, Jiaming Cui, Yuchen Zhuang, B. Aditya Prakash, Bo Dai, Chao Zhang
for: 这篇论文旨在解决预测然后优化问题中的三个瓶颈，即模型匹配错误、样本平均approximation错误和梯度approximation错误。
methods: 我们提出了DF2方法，这是第一个不需要任务特定预测器的分布-自由决策学习方法，可以直接在训练中学习预测目标函数。我们开发了一种注意力基于分布参数化的模型架构，以便高效地学习目标函数。
results: 我们在一个 synthetic 问题、一个风力发电拍价问题和一个非对称疫苗分布问题上评估了DF2方法，结果显示DF2方法有效。

Abstract
Decision-focused learning (DFL) has recently emerged as a powerful approach for predict-then-optimize problems by customizing a predictive model to a downstream optimization task. However, existing end-to-end DFL methods are hindered by three significant bottlenecks: model mismatch error, sample average approximation error, and gradient approximation error. Model mismatch error stems from the misalignment between the model's parameterized predictive distribution and the true probability distribution. Sample average approximation error arises when using finite samples to approximate the expected optimization objective. Gradient approximation error occurs as DFL relies on the KKT condition for exact gradient computation, while most methods approximate the gradient for backpropagation in non-convex objectives. In this paper, we present DF2 -- the first \textit{distribution-free} decision-focused learning method explicitly designed to address these three bottlenecks. Rather than depending on a task-specific forecaster that requires precise model assumptions, our method directly learns the expected optimization function during training. To efficiently learn the function in a data-driven manner, we devise an attention-based model architecture inspired by the distribution-based parameterization of the expected objective. Our method is, to the best of our knowledge, the first to address all three bottlenecks within a single model. We evaluate DF2 on a synthetic problem, a wind power bidding problem, and a non-convex vaccine distribution problem, demonstrating the effectiveness of DF2.

摘要
“决策关注学习”（DFL）是一种有力的方法，用于预测后优化问题。但是，现有的综合性DFL方法受到三大瓶颈：模型匹配错误、样本平均 aproximation 错误和梯度approximation 错误。模型匹配错误来自于模型参数化预测分布与真实概率分布的不一致。样本平均 aproximation 错误出现在使用有限样本来近似预期优化目标时。梯度approximation 错误则是因为DFL 依赖于 KKT 条件来计算梯度，而大多数方法在非 convex 目标上 aproximate 梯度。在这篇论文中，我们提出了 DF2 方法，这是首个不需要任务特定的预测器，可以直接在训练中学习预期优化函数。为了有效地学习这个函数，我们设计了一种注意力基于的模型架构，这种架构得到了分布基于的参数化。我们的方法可以同时解决三个瓶颈，这是首次在单个模型中实现。我们在一个 sintetic 问题、一个风力发电拍卖问题和一个非 convex 疫苗分布问题中证明了 DF2 的有效性。

GPLaSDI: Gaussian Process-based Interpretable Latent Space Dynamics Identification through Deep Autoencoder

paper_url: http://arxiv.org/abs/2308.05882
repo_url: https://github.com/llnl/gplasdi
paper_authors: Christophe Bonneville, Youngsoo Choi, Debojyoti Ghosh, Jonathan L. Belof
for: This paper aims to develop a novel reduced-order model (ROM) framework called GPLaSDI, which leverages Gaussian processes (GPs) for latent space ODE interpolations and provides accurate and efficient predictions for partial differential equations (PDEs).
methods: The proposed method, GPLaSDI, uses autoencoders to map full-order PDE solutions to a latent space and learns the system of ODEs governing the latent space dynamics. The method then interpolates and solves the ODE system in the reduced latent space, allowing for fast and accurate ROM predictions.
results: The proposed method is demonstrated on three problems, including the Burgers equation, Vlasov equation for plasma physics, and a rising thermal bubble problem, and achieves between 200 and 100,000 times speed-up with up to 7% relative error. The method provides accurate and efficient predictions for PDEs without prior knowledge of the underlying equations.

Abstract
Numerically solving partial differential equations (PDEs) can be challenging and computationally expensive. This has led to the development of reduced-order models (ROMs) that are accurate but faster than full order models (FOMs). Recently, machine learning advances have enabled the creation of non-linear projection methods, such as Latent Space Dynamics Identification (LaSDI). LaSDI maps full-order PDE solutions to a latent space using autoencoders and learns the system of ODEs governing the latent space dynamics. By interpolating and solving the ODE system in the reduced latent space, fast and accurate ROM predictions can be made by feeding the predicted latent space dynamics into the decoder. In this paper, we introduce GPLaSDI, a novel LaSDI-based framework that relies on Gaussian process (GP) for latent space ODE interpolations. Using GPs offers two significant advantages. First, it enables the quantification of uncertainty over the ROM predictions. Second, leveraging this prediction uncertainty allows for efficient adaptive training through a greedy selection of additional training data points. This approach does not require prior knowledge of the underlying PDEs. Consequently, GPLaSDI is inherently non-intrusive and can be applied to problems without a known PDE or its residual. We demonstrate the effectiveness of our approach on the Burgers equation, Vlasov equation for plasma physics, and a rising thermal bubble problem. Our proposed method achieves between 200 and 100,000 times speed-up, with up to 7% relative error.

摘要
解决partial differential equations (PDEs) numerically可以困难且计算成本高。这导致了reduced-order models (ROMs)的发展，ROMs比full order models (FOMs)快速且准确。最近，机器学习的进步使得non-linear projection方法的创造，如Latent Space Dynamics Identification (LaSDI)。LaSDI将全功率PDE解析 mapping到一个latent space使用autoencoders，并学习latent space动力学系统。通过latent space动力学系统的 interpolating和解决，可以快速且准确地预测ROM。在本文中，我们介绍了GPLaSDI，一种基于Gaussian process (GP)的LaSDI框架。使用GP提供了两个优点：首先，它允许量化ROM预测的uncertainty。其次，通过利用这个预测uncertainty，可以efficiently adaptive training through greedy选择additional training data points。这种方法不需要先知道下面PDEs。因此，GPLaSDI是非侵入的，可以应用于没有known PDE或其剩余的问题。我们在Burgers方程、Vlasov方程 дляплазма物理和一个升高的热气囊问题上表明了我们的方法的有效性，实现了200-100,000倍的速度减少，Relative error在7%之间。

Aphid Cluster Recognition and Detection in the Wild Using Deep Learning Models

paper_url: http://arxiv.org/abs/2308.05881
repo_url: None
paper_authors: Tianxiao Zhang, Kaidong Li, Xiangyu Chen, Cuncong Zhong, Bo Luo, Ivan Grijalva, Brian McCornack, Daniel Flippo, Ajay Sharda, Guanghui Wang
for: 本研究旨在使用深度学习模型检测蚂蚁团群，以实现Targeted pesticide application。
methods: 我们使用了大量的麦田图像数据，手动选择了5447个图像中包含蚂蚁的图像，并对每个蚂蚁团群进行了标注。然后，我们将图像分割成patches，得到了151380个标注图像 patches。最后，我们对四种state-of-the-art对象检测模型（VFNet、GFLV2、PAA和ATSS）进行了实现和比较。
results: 我们的实验结果表明，这四种模型在蚂蚁数据集上具有稳定的性能，以至于平均准确率和回归率。此外，我们还提出了将邻近的蚂蚁团群合并并去除cropping中的小 clusters，从而进一步提高了性能。本研究 demonstarates了使用机器学习模型自动检测和管理蚂蚁的可能性。

Abstract
Aphid infestation poses a significant threat to crop production, rural communities, and global food security. While chemical pest control is crucial for maximizing yields, applying chemicals across entire fields is both environmentally unsustainable and costly. Hence, precise localization and management of aphids are essential for targeted pesticide application. The paper primarily focuses on using deep learning models for detecting aphid clusters. We propose a novel approach for estimating infection levels by detecting aphid clusters. To facilitate this research, we have captured a large-scale dataset from sorghum fields, manually selected 5,447 images containing aphids, and annotated each individual aphid cluster within these images. To facilitate the use of machine learning models, we further process the images by cropping them into patches, resulting in a labeled dataset comprising 151,380 image patches. Then, we implemented and compared the performance of four state-of-the-art object detection models (VFNet, GFLV2, PAA, and ATSS) on the aphid dataset. Extensive experimental results show that all models yield stable similar performance in terms of average precision and recall. We then propose to merge close neighboring clusters and remove tiny clusters caused by cropping, and the performance is further boosted by around 17%. The study demonstrates the feasibility of automatically detecting and managing insects using machine learning models. The labeled dataset will be made openly available to the research community.

摘要
螟蛾感染对农业生产、农村社区和全球食品安全构成了重要威胁。虽然化学防治昆虫是最大化产量的重要手段，但是投放化学药品于整个田野是环境不可持续和昂贵的。因此，精准地本地化和管理螟蛾是不可或缺的。本文主要关注使用深度学习模型检测螟蛾群体。我们提出了一种新的方法，通过检测螟蛾群体来估算感染水平。为了进行这项研究，我们在高粮田场中捕捉了大规模数据集， manually selected 5,447 张图像中包含螟蛾，并对每个个体螟蛾群体进行了注释。为了使机器学习模型可用，我们进一步处理了图像，将其截割成 patches，得到了151,380 个标注图像 patches。然后，我们实现并比较了四种当前最佳对象检测模型（VFNet、GFLV2、PAA 和 ATSS）在螟蛾数据集上的性能。广泛的实验结果表明，所有模型在精度和准确性方面具有稳定的相似性。我们 then proposed to merge close neighboring clusters and remove tiny clusters caused by cropping, and the performance is further boosted by around 17%. 该研究表明了使用机器学习模型自动检测和管理昆虫的可能性。标注数据集将被开放给研究社区。

Composable Core-sets for Diversity Approximation on Multi-Dataset Streams

paper_url: http://arxiv.org/abs/2308.05878
repo_url: None
paper_authors: Stephanie Wang, Michael Flynn, Fangyu Luo
for: 这种论文主要针对的是实时训练机器学习模型，尤其是在感知数据量很大的情况下。
methods: 这种方法使用核心集 constructions algorithm，该算法可以 constructions 可 Composable 核心集，以便在流处理数据中SUMMARIZE 数据，并在活动学习环境中使用。
results: 这种方法可以减少训练时间，并且可以在实时训练中使用。 Empirical analysis 表明，这种方法可以在大量感知数据中提供可靠的结果。

Abstract
Core-sets refer to subsets of data that maximize some function that is commonly a diversity or group requirement. These subsets are used in place of the original data to accomplish a given task with comparable or even enhanced performance if biases are removed. Composable core-sets are core-sets with the property that subsets of the core set can be unioned together to obtain an approximation for the original data; lending themselves to be used for streamed or distributed data. Recent work has focused on the use of core-sets for training machine learning models. Preceding solutions such as CRAIG have been proven to approximate gradient descent while providing a reduced training time. In this paper, we introduce a core-set construction algorithm for constructing composable core-sets to summarize streamed data for use in active learning environments. If combined with techniques such as CRAIG and heuristics to enhance construction speed, composable core-sets could be used for real time training of models when the amount of sensor data is large. We provide empirical analysis by considering extrapolated data for the runtime of such a brute force algorithm. This algorithm is then analyzed for efficiency through averaged empirical regression and key results and improvements are suggested for further research on the topic.

摘要
<>将文本翻译成简化字符串。<>核心集（core-set）指的是一个数据集的子集，可以最大化某些函数，通常是多样性或组准要求。这些子集用于取代原始数据，以实现与原始数据相同或更高的性能，而且去掉偏见。可composable核心集（composable core-sets）是指核心集具有可分解的性质，即可将核心集中的子集union起来，来 aproximate 原始数据。这些核心集可以用于流动或分布式数据。现有研究主要关注使用核心集来训练机器学习模型。先前的解决方案，如CRAIG，已经证明可以approximate 梯度下降，同时提供减少的训练时间。在这篇论文中，我们提出一种constructing 核心集算法，用于概要 summarize 流动数据，以便在活动学习环境中使用。如果与CRAIG和其他技术相结合，可以用composable核心集来实现实时训练模型，当感知器数据较大时。我们对这种简洁算法进行了empirical分析，包括考虑extrapolated 数据的运行时间。这种算法的效率则通过averaged empirical regression和关键结果进行了分析，并建议了进一步的研究方向。

Revisiting N-CNN for Clinical Practice

paper_url: http://arxiv.org/abs/2308.05877
repo_url: None
paper_authors: Leonardo Antunes Ferreira, Lucas Pereira Carlini, Gabriel de Almeida Sá Coutrin, Tatiany Marcondes Heideirich, Marina Carvalho de Moraes Barros, Ruth Guinsburg, Carlos Eduardo Thomaz
For: This paper aims to improve the performance and reliability of a deep learning model for neonatal pain assessment by optimizing its hyperparameters and evaluating their impact on classification metrics, explainability, and calibration.* Methods: The authors used a Neonatal Convolutional Neural Network (N-CNN) and optimized its hyperparameters by evaluating the improvement in F1 Score for each hyperparameter individually. They also applied soft labels derived from the Neonatal Facial Coding System to improve the model’s performance.* Results: The Tuned N-CNN showed improvements in classification metrics and explainability, but the calibration performance did not improve directly. The authors believe that their insights could contribute to the development of more reliable pain evaluation tools for newborns, which could aid healthcare professionals in delivering appropriate interventions and improving patient outcomes.Here’s the Chinese version of the three key points:* For: 这篇论文目的是通过优化深度学习模型的超参数，以提高新生痛症评估的性能和可靠性。* Methods: 作者使用了Neonatal Convolutional Neural Network (N-CNN)，并通过评估每个超参数的改进情况来优化其超参数。他们还应用了基于Neonatal Facial Coding System的软标签，以提高模型的性能。* Results: 优化后的Tuned N-CNN表现出了 Classification Metrics和Explainability的改进，但Calibration性能直接不改善。作者认为，他们的发现可能对新生痛症评估工具的开发产生影响，帮助医疗专业人员采取适当的 intervención和改善病人结果。

Abstract
This paper revisits the Neonatal Convolutional Neural Network (N-CNN) by optimizing its hyperparameters and evaluating how they affect its classification metrics, explainability and reliability, discussing their potential impact in clinical practice. We have chosen hyperparameters that do not modify the original N-CNN architecture, but mainly modify its learning rate and training regularization. The optimization was done by evaluating the improvement in F1 Score for each hyperparameter individually, and the best hyperparameters were chosen to create a Tuned N-CNN. We also applied soft labels derived from the Neonatal Facial Coding System, proposing a novel approach for training facial expression classification models for neonatal pain assessment. Interestingly, while the Tuned N-CNN results point towards improvements in classification metrics and explainability, these improvements did not directly translate to calibration performance. We believe that such insights might have the potential to contribute to the development of more reliable pain evaluation tools for newborns, aiding healthcare professionals in delivering appropriate interventions and improving patient outcomes.

摘要
Translated into Simplified Chinese:这篇论文探讨了新生儿 convolutional neural network (N-CNN) 的超参数优化和其对 классифика表现、可解释性和可靠性的影响，并讨论其在临床实践中的潜在影响。我们选择了不改变原始 N-CNN 架构的超参数，主要是调整学习率和训练正则化。优化是通过评估每个超参数的改进情况来进行的，并选择了最佳的超参数来创建一个调整后的 N-CNN。我们还应用了基于新生儿表情编码系统的软标签，提出了一种新的训练表情分类模型的方法。虽然调整后的 N-CNN 结果表明了类型 metric 和可解释性的改进，但这些改进并没有直接影响 calibration 性能。我们认为这些发现可能会对新生儿疼痛评估工具的开发产生影响，帮助医疗专业人员采取合适的 intervención和提高病人结果。

UFed-GAN: A Secure Federated Learning Framework with Constrained Computation and Unlabeled Data

paper_url: http://arxiv.org/abs/2308.05870
repo_url: None
paper_authors: Achintha Wijesinghe, Songyang Zhang, Siyu Qi, Zhi Ding
for: 本研究旨在Addressing limited computational resources和无标签数据问题，提高云端 multimedia data classification和隐私保护的可行性。
methods: 本文提出的Unsupervised Federated Generative Adversarial Network（UFed-GAN）可以在资源受限和无标签数据环境下Capture用户数据分布，而无需本地分类训练。
results: 我们的实验结果表明，UFed-GAN在Addressing limited computational resources和无标签数据问题时具有强大的潜力，同时保持用户隐私。

Abstract
To satisfy the broad applications and insatiable hunger for deploying low latency multimedia data classification and data privacy in a cloud-based setting, federated learning (FL) has emerged as an important learning paradigm. For the practical cases involving limited computational power and only unlabeled data in many wireless communications applications, this work investigates FL paradigm in a resource-constrained and label-missing environment. Specifically, we propose a novel framework of UFed-GAN: Unsupervised Federated Generative Adversarial Network, which can capture user-side data distribution without local classification training. We also analyze the convergence and privacy of the proposed UFed-GAN. Our experimental results demonstrate the strong potential of UFed-GAN in addressing limited computational resources and unlabeled data while preserving privacy.

摘要

Using Twitter Data to Determine Hurricane Category: An Experiment

paper_url: http://arxiv.org/abs/2308.05866
repo_url: None
paper_authors: Songhui Yue, Jyothsna Kondari, Aibek Musaev, Randy K. Smith, Songqing Yue
for: 本研究旨在通过数据挖掘方法挖掘社交媒体数据和灾害严重程度之间的关系。
methods: 本研究使用了Twitter数据，通过数据挖掘技术挖掘社交媒体数据和灾害严重程度之间的关系。
results: 实验结果表明， Twitter数据与灾害严重程度之间存在正相关关系。此外，本研究还提出了一种使用Twitter数据预测灾害分类的方法。

Abstract
Social media posts contain an abundant amount of information about public opinion on major events, especially natural disasters such as hurricanes. Posts related to an event, are usually published by the users who live near the place of the event at the time of the event. Special correlation between the social media data and the events can be obtained using data mining approaches. This paper presents research work to find the mappings between social media data and the severity level of a disaster. Specifically, we have investigated the Twitter data posted during hurricanes Harvey and Irma, and attempted to find the correlation between the Twitter data of a specific area and the hurricane level in that area. Our experimental results indicate a positive correlation between them. We also present a method to predict the hurricane category for a specific area using relevant Twitter data.

摘要
社交媒体帖子中含有大量的公众意见信息，特别是自然灾害如飓风。帖子与事件之间存在特殊的相关性，可以通过数据挖掘方法获得。这篇论文探讨了找到社交媒体数据与灾害严重程度之间的映射。我们对飓风哈维和飓风艾尔玛期间的推特数据进行了研究，并发现了这些数据与灾害严重程度之间的正相关性。我们还提出了使用相关的推特数据预测灾害分类的方法。

The Multi-modality Cell Segmentation Challenge: Towards Universal Solutions

paper_url: http://arxiv.org/abs/2308.05864
repo_url: None
paper_authors: Jun Ma, Ronald Xie, Shamini Ayyadhury, Cheng Ge, Anubha Gupta, Ritu Gupta, Song Gu, Yao Zhang, Gihun Lee, Joonkee Kim, Wei Lou, Haofeng Li, Eric Upschulte, Timo Dickscheid, José Guilherme de Almeida, Yixin Wang, Lin Han, Xin Yang, Marco Labagnara, Sahand Jamal Rahi, Carly Kempster, Alice Pollitt, Leon Espinosa, Tâm Mignot, Jan Moritz Middeke, Jan-Niklas Eckardt, Wangkai Li, Zhaoyang Li, Xiaochen Cai, Bizhe Bai, Noah F. Greenwald, David Van Valen, Erin Weisbart, Beth A. Cimini, Zhuoshi Li, Chao Zuo, Oscar Brück, Gary D. Bader, Bo Wang
for: 这个论文旨在提供一个多Modalidad单元Segmentation的benchmark，以便提高单元分析的准确性和多样性。
methods: 这个研究使用Transformer基于的深度学习算法，可以在多种微scopic影像平台和组织类型上自动调整参数，而不需要手动参数调整。
results: 研究发现这个新算法不仅可以超越现有的方法，还可以在多种微scopic影像中应用，无需手动调整参数。这个benchmark和改进的算法为单元分析带来了有前途的可能性。

Abstract
Cell segmentation is a critical step for quantitative single-cell analysis in microscopy images. Existing cell segmentation methods are often tailored to specific modalities or require manual interventions to specify hyperparameters in different experimental settings. Here, we present a multi-modality cell segmentation benchmark, comprising over 1500 labeled images derived from more than 50 diverse biological experiments. The top participants developed a Transformer-based deep-learning algorithm that not only exceeds existing methods, but can also be applied to diverse microscopy images across imaging platforms and tissue types without manual parameter adjustments. This benchmark and the improved algorithm offer promising avenues for more accurate and versatile cell analysis in microscopy imaging.

摘要
Cell segmentation is a crucial step for quantitative single-cell analysis in microscopy images. Existing cell segmentation methods are often tailored to specific modalities or require manual interventions to specify hyperparameters in different experimental settings. Here, we present a multi-modality cell segmentation benchmark, comprising over 1500 labeled images derived from more than 50 diverse biological experiments. The top participants developed a Transformer-based deep-learning algorithm that not only exceeds existing methods, but can also be applied to diverse microscopy images across imaging platforms and tissue types without manual parameter adjustments. This benchmark and the improved algorithm offer promising avenues for more accurate and versatile cell analysis in microscopy imaging.Here's the translation in Traditional Chinese as well:Cell segmentation is a critical step for quantitative single-cell analysis in microscopy images. Existing cell segmentation methods are often tailored to specific modalities or require manual interventions to specify hyperparameters in different experimental settings. Here, we present a multi-modality cell segmentation benchmark, comprising over 1500 labeled images derived from more than 50 diverse biological experiments. The top participants developed a Transformer-based deep-learning algorithm that not only exceeds existing methods, but can also be applied to diverse microscopy images across imaging platforms and tissue types without manual parameter adjustments. This benchmark and the improved algorithm offer promising avenues for more accurate and versatile cell analysis in microscopy imaging.

Knowledge Propagation over Conditional Independence Graphs

paper_url: http://arxiv.org/abs/2308.05857
repo_url: None
paper_authors: Urszula Chajewska, Harsh Shrivastava
for: 本研究旨在提出针对 Conditional Independence (CI) 图的知识传播算法，用于从不同领域中的系统中提取有价值的信息。
methods: 本研究使用了一种基于 CI 图的知识传播算法，通过模型特性之间的部分相关性来捕捉系统的域トポлогия。
results: 实验结果表明，本研究的方法在公开available的 Cora 和 PubMed 数据集上表现出色，与现有技术相比有所提高。

Abstract
Conditional Independence (CI) graph is a special type of a Probabilistic Graphical Model (PGM) where the feature connections are modeled using an undirected graph and the edge weights show the partial correlation strength between the features. Since the CI graphs capture direct dependence between features, they have been garnering increasing interest within the research community for gaining insights into the systems from various domains, in particular discovering the domain topology. In this work, we propose algorithms for performing knowledge propagation over the CI graphs. Our experiments demonstrate that our techniques improve upon the state-of-the-art on the publicly available Cora and PubMed datasets.

摘要
<>translate "Conditional Independence (CI) graph is a special type of a Probabilistic Graphical Model (PGM) where the feature connections are modeled using an undirected graph and the edge weights show the partial correlation strength between the features. Since the CI graphs capture direct dependence between features, they have been garnering increasing interest within the research community for gaining insights into the systems from various domains, in particular discovering the domain topology. In this work, we propose algorithms for performing knowledge propagation over the CI graphs. Our experiments demonstrate that our techniques improve upon the state-of-the-art on the publicly available Cora and PubMed datasets." into Simplified Chinese.翻译结果如下：Conditional Independence（CI）图是一种特殊的概率图模型（PGM），其特点是通过无向图模型特性之间的关系，并且边权表示特征之间的部分相关性。由于CI图能够直接表示特征之间的依赖关系，因此在不同领域的研究中引起了越来越多的关注，特别是发现领域拓扑。在这项工作中，我们提出了在CI图上进行知识传播的算法。我们的实验表明，我们的技术可以超越当前状态的术语在公开available的Cora和PubMed数据集上。

CSPM: A Contrastive Spatiotemporal Preference Model for CTR Prediction in On-Demand Food Delivery Services

paper_url: http://arxiv.org/abs/2308.08446
repo_url: None
paper_authors: Guyu Jiang, Xiaoyun Li, Rongrong Jing, Ruoqi Zhao, Xingliang Ni, Guodong Cao, Ning Hu
for: 预测用户点击率 (CTR) 在在线快递食品平台上是一项重要的任务，以准确地估算用户点击食品项目的概率。与通用的电子商务平台如淘宝和amazon不同，用户在快递食品平台上的行为和兴趣受到地点和时间的限制，导致现有的 CTR 预测算法在 OFD 场景下缺乏效果。
methods: 本文提出了一种基于对比学习框架的 Contrastive Sres under different search states (CSPM) 算法，用于模型用户在不同搜索状态下的各种各样的首选。CSPM 包括三个模块：对比空间时间表示学习 (CSRL)、空间时间偏好提取器 (StPE) 和空间时间信息筛选器 (StIF)。CSRL 使用对比学习框架生成一个空间时间活动表示 (SAR)，以便用于搜索行为序列中的用户首选。StPE 使用 SAR 来激活用户的不同位置和时间的偏好，使用多头注意机制。StIF 将 SAR integrate到一个阻止网络中，以自动捕捉具有潜在空间时间效应的重要特征。
results: 对于两个大规模的实际数据集，CSPM 表现出了顶尖的状态。特别是，CSPM 在 Alibaba 在线 OFD 平台 Ele.me 上进行了成功部署，导致了一个显著的 0.88% 提升 в CTR，这有着重要的业务意义。

Abstract
Click-through rate (CTR) prediction is a crucial task in the context of an online on-demand food delivery (OFD) platform for precisely estimating the probability of a user clicking on food items. Unlike universal e-commerce platforms such as Taobao and Amazon, user behaviors and interests on the OFD platform are more location and time-sensitive due to limited delivery ranges and regional commodity supplies. However, existing CTR prediction algorithms in OFD scenarios concentrate on capturing interest from historical behavior sequences, which fails to effectively model the complex spatiotemporal information within features, leading to poor performance. To address this challenge, this paper introduces the Contrastive Sres under different search states using three modules: contrastive spatiotemporal representation learning (CSRL), spatiotemporal preference extractor (StPE), and spatiotemporal information filter (StIF). CSRL utilizes a contrastive learning framework to generate a spatiotemporal activation representation (SAR) for the search action. StPE employs SAR to activate users' diverse preferences related to location and time from the historical behavior sequence field, using a multi-head attention mechanism. StIF incorporates SAR into a gating network to automatically capture important features with latent spatiotemporal effects. Extensive experiments conducted on two large-scale industrial datasets demonstrate the state-of-the-art performance of CSPM. Notably, CSPM has been successfully deployed in Alibaba's online OFD platform Ele.me, resulting in a significant 0.88% lift in CTR, which has substantial business implications.

摘要
clic-through rate (CTR) 预测是在在线快递食品平台上关键的任务，可以准确地估算用户点击食品项。与普通的电商平台如淘宝和亚马逊不同，用户在食品平台上的行为和兴趣更加地受到地点和时间影响，因为快递范围和地域商品供应有限。然而，现有的 CTRL 预测算法在食品平台场景中强调 capture interest from historical behavior sequences，这会导致不能够有效地模型特有的空间时间信息，从而导致 poor performance。为解决这个挑战，本文介绍了一种基于对比学习框架的 Contrastive Sres under different search states (CSPM) 算法，包括三个模块：对比空间时间表示学习 (CSRL)、空间时间偏好提取器 (StPE) 和空间时间信息筛选器 (StIF)。CSRL 使用对比学习框架生成一个空间时间活动表示 (SAR) для搜索动作。StPE 使用 SAR activation 用户在历史行为序列字段中的多头注意机制来激活用户的多样化的地点和时间偏好。StIF 将 SAR incorporated 到一个阻止网络中，以自动捕捉特有的空间时间效果。经验表明，CSPM 可以在两个大规模的工业数据集上实现状态之 arts 的表现，并且成功地部署在阿里巴巴的在线OFD平台 Ele.me 上，导致了一个显著的 0.88% 提升 CTRL，这有重要的商业意义。

GaborPINN: Efficient physics informed neural networks using multiplicative filtered networks

paper_url: http://arxiv.org/abs/2308.05843
repo_url: None
paper_authors: Xinquan Huang, Tariq Alkhalifah
for: 快速解决全波形推敲问题，例如全波形反射推敲。
methods: 使用改进的物理学信息感知神经网络（PINN），其中包括增强的多元滤波网络（MFN），以及利用Gabor基函数来表示波场。
results: 与传统PINN相比，提出的方法可以快速 converge，具体来说，可以提高 convergence speed 至少 two magnitudes。

Abstract
The computation of the seismic wavefield by solving the Helmholtz equation is crucial to many practical applications, e.g., full waveform inversion. Physics-informed neural networks (PINNs) provide functional wavefield solutions represented by neural networks (NNs), but their convergence is slow. To address this problem, we propose a modified PINN using multiplicative filtered networks, which embeds some of the known characteristics of the wavefield in training, e.g., frequency, to achieve much faster convergence. Specifically, we use the Gabor basis function due to its proven ability to represent wavefields accurately and refer to the implementation as GaborPINN. Meanwhile, we incorporate prior information on the frequency of the wavefield into the design of the method to mitigate the influence of the discontinuity of the represented wavefield by GaborPINN. The proposed method achieves up to a two-magnitude increase in the speed of convergence as compared with conventional PINNs.

摘要
Computing the seismic wavefield by solving the Helmholtz equation is crucial to many practical applications, such as full waveform inversion. Physics-informed neural networks (PINNs) provide functional wavefield solutions represented by neural networks (NNs), but their convergence is slow. To address this problem, we propose a modified PINN using multiplicative filtered networks, which embeds some of the known characteristics of the wavefield in training, such as frequency, to achieve much faster convergence. Specifically, we use the Gabor basis function due to its proven ability to represent wavefields accurately and refer to the implementation as GaborPINN. Meanwhile, we incorporate prior information on the frequency of the wavefield into the design of the method to mitigate the influence of the discontinuity of the represented wavefield by GaborPINN. The proposed method achieves up to a two-magnitude increase in the speed of convergence as compared with conventional PINNs.Translation in Simplified Chinese:计算地震波场的海尔姆霍尔兹方程是许多实际应用的关键，例如全波形反射。物理学 Informed Neural Networks (PINNs) 提供了功能波场解决方案，表示为神经网络 (NNs)，但是它们的收敛速度较慢。为解决这个问题，我们提议一种修改后的 PINN，使用多 multiply 过滤网络，其中包含一些知道波场的特征，例如频率，以实现更快的收敛速度。我们使用 Gabor 基函数，因为它已经证明可以高度准确地表示波场，并将其称为 GaborPINN。此外，我们在方法设计中加入了波场频率的先前信息，以降低 GaborPINN 表示的波场的缺陷的影响。提议的方法可以与传统 PINNs 的收敛速度提高至两个数量级。

FLShield: A Validation Based Federated Learning Framework to Defend Against Poisoning Attacks

paper_url: http://arxiv.org/abs/2308.05832
repo_url: None
paper_authors: Ehsanul Kabir, Zeyu Song, Md Rafi Ur Rashid, Shagufta Mehnaz
for: 本研究旨在提出一种新的 Federated Learning 框架，以确保 Collaborative Learning 系统的安全性和可靠性。
methods: 该框架基于参与者的善意数据进行 validate 本地模型，以防止恶意参与者的行为。
results: 经过广泛的实验评估，FLShield 框架能够有效防止多种毒化和后门攻击，并保持本地数据的隐私。

Abstract
Federated learning (FL) is revolutionizing how we learn from data. With its growing popularity, it is now being used in many safety-critical domains such as autonomous vehicles and healthcare. Since thousands of participants can contribute in this collaborative setting, it is, however, challenging to ensure security and reliability of such systems. This highlights the need to design FL systems that are secure and robust against malicious participants' actions while also ensuring high utility, privacy of local data, and efficiency. In this paper, we propose a novel FL framework dubbed as FLShield that utilizes benign data from FL participants to validate the local models before taking them into account for generating the global model. This is in stark contrast with existing defenses relying on server's access to clean datasets -- an assumption often impractical in real-life scenarios and conflicting with the fundamentals of FL. We conduct extensive experiments to evaluate our FLShield framework in different settings and demonstrate its effectiveness in thwarting various types of poisoning and backdoor attacks including a defense-aware one. FLShield also preserves privacy of local data against gradient inversion attacks.

摘要
federated learning (FL) 正在改变我们如何从数据中学习。随着其 популярность的增长，它现在在许多安全关键领域中使用，如自动驾驶和医疗。由于 thousands of participants 可以在这种合作环境中参与，因此保证安全性和可靠性的需求变得更加重要。这高亮了需要设计安全可靠的 FL 系统，同时保持高的实用性、本地数据隐私和效率。在这篇论文中，我们提出了一种新的 FL 框架，称为 FLShield，它利用 FL 参与者的善良数据来验证本地模型，以确保其在生成全球模型之前的可靠性。这与现有防御方法，即依赖服务器访问干净的数据集的假设，不同。我们进行了广泛的实验来评估我们的 FLShield 框架在不同的设置下的效果，并证明它在不同类型的毒剂和后门攻击中具有很高的有效性。FLShield 还保持了本地数据隐私性免受梯度反向攻击。

Neural Progressive Meshes

paper_url: http://arxiv.org/abs/2308.05741
repo_url: None
paper_authors: Yun-Chun Chen, Vladimir G. Kim, Noam Aigerman, Alec Jacobson
for: 提高3D内容的传输效率， especialy for large geometric data such as 3D meshes.
methods: 使用学习生成模型来分解和重建3D模型，以实现进度式传输和高质量重建。
results: 比基eline方法具有更高的压缩率和重建质量。In English, this means:
for: Improving the efficiency of transmitting 3D content, especially for large geometric data such as 3D meshes.
methods: Using a learned generative model to decompose and reconstruct 3D models, in order to achieve progressive transmission and high-quality reconstruction.
results: Outperforming baseline methods in terms of compression ratio and reconstruction quality.

Abstract
The recent proliferation of 3D content that can be consumed on hand-held devices necessitates efficient tools for transmitting large geometric data, e.g., 3D meshes, over the Internet. Detailed high-resolution assets can pose a challenge to storage as well as transmission bandwidth, and level-of-detail techniques are often used to transmit an asset using an appropriate bandwidth budget. It is especially desirable for these methods to transmit data progressively, improving the quality of the geometry with more data. Our key insight is that the geometric details of 3D meshes often exhibit similar local patterns even across different shapes, and thus can be effectively represented with a shared learned generative space. We learn this space using a subdivision-based encoder-decoder architecture trained in advance on a large collection of surfaces. We further observe that additional residual features can be transmitted progressively between intermediate levels of subdivision that enable the client to control the tradeoff between bandwidth cost and quality of reconstruction, providing a neural progressive mesh representation. We evaluate our method on a diverse set of complex 3D shapes and demonstrate that it outperforms baselines in terms of compression ratio and reconstruction quality.

摘要
Our key insight is that the geometric details of 3D meshes often exhibit similar local patterns across different shapes, and can be effectively represented with a shared learned generative space. We use a subdivision-based encoder-decoder architecture trained in advance on a large collection of surfaces to learn this space. Additionally, we observe that residual features can be transmitted progressively between intermediate levels of subdivision, allowing the client to control the tradeoff between bandwidth cost and quality of reconstruction. This provides a neural progressive mesh representation.We evaluate our method on a diverse set of complex 3D shapes and demonstrate that it outperforms baselines in terms of compression ratio and reconstruction quality.

Zero Grads Ever Given: Learning Local Surrogate Losses for Non-Differentiable Graphics

paper_url: http://arxiv.org/abs/2308.05739
repo_url: None
paper_authors: Michael Fischer, Tobias Ritschel
for: ZeroGrads is a framework for optimizing non-convex, non-differentiable black-box problems in graphics, such as visibility in rendering, discrete parameter spaces in procedural modelling, or optimal control in physics-driven animation.
methods: ZeroGrads uses a neural approximation of the objective function, called a surrogate, to circumvent the issue of undefined or zero gradients in gradient-based optimization. The surrogate is learned online and self-supervised, without pre-computed data or pre-trained models.
results: ZeroGrads demonstrates competitive performance at little overhead, scaling well to higher dimensions with up to 35k interlinked variables. It is able to optimize diverse non-convex, non-differentiable black-box problems in graphics, including visibility in rendering, discrete parameter spaces in procedural modelling, and optimal control in physics-driven animation.

Abstract
Gradient-based optimization is now ubiquitous across graphics, but unfortunately can not be applied to problems with undefined or zero gradients. To circumvent this issue, the loss function can be manually replaced by a "surrogate" that has similar minima but is differentiable. Our proposed framework, ZeroGrads, automates this process by learning a neural approximation of the objective function, the surrogate, which in turn can be used to differentiate through arbitrary black-box graphics pipelines. We train the surrogate on an actively smoothed version of the objective and encourage locality, focusing the surrogate's capacity on what matters at the current training episode. The fitting is performed online, alongside the parameter optimization, and self-supervised, without pre-computed data or pre-trained models. As sampling the objective is expensive (it requires a full rendering or simulator run), we devise an efficient sampling scheme that allows for tractable run-times and competitive performance at little overhead. We demonstrate optimizing diverse non-convex, non-differentiable black-box problems in graphics, such as visibility in rendering, discrete parameter spaces in procedural modelling or optimal control in physics-driven animation. In contrast to more traditional algorithms, our approach scales well to higher dimensions, which we demonstrate on problems with up to 35k interlinked variables.

摘要
梯度基本优化现在 Graphics 中 universal，但它无法应用于具有未定义或Zero梯度的问题。为了解决这个问题，损失函数可以手动被替换为一个“代理”，该函数有相似的枢轴点，但是可 diferenciable。我们的提议的框架，ZeroGrads，可以自动实现这个过程，通过学习一个神经网络来 aproximate 目标函数，代理函数，然后使用这个代理函数来 differentiate 通过黑obox 图形处理器。我们在训练过程中使用活动滑动的目标函数，以及强调本集中的性能，使得代理函数具有更好的表达能力。我们的方法可以在线进行训练，同时进行参数优化，无需预计算数据或预训练模型。由于评估目标函数的成本高（需要完整的渲染或模拟运行），我们设计了一种有效的采样方案，以实现可负担的运行时间和竞争力。我们在不同的非 diferenciable 和非杠定的黑obox 图形问题中进行优化，如视力渲染、分布式参数空间和物理驱动动画中的优化问题。与传统方法相比，我们的方法可扩展到更高的维度，我们在35k个相互关联的变量上进行了演示。

Follow Anything: Open-set detection, tracking, and following in real-time

paper_url: http://arxiv.org/abs/2308.05737
repo_url: https://github.com/alaamaalouf/followanything
paper_authors: Alaa Maalouf, Ninad Jadhav, Krishna Murthy Jatavallabhula, Makram Chahine, Daniel M. Vogt, Robert J. Wood, Antonio Torralba, Daniela Rus
For: 这篇论文旨在提出一种机器人系统，可以在实时中检测、跟踪和跟踪任何目标对象。* Methods: 该系统使用了一种名为“跟踪任何”（FAn）的开放词汇和多模态模型，可以应用于新的类型在推理时使用文本、图像或点击查询。该模型利用大规模预训练模型（基础模型）中的丰富视觉描述符，可以在输入图像序列中检测和分割目标对象，并跟踪它们在图像帧中。* Results: 作者在一个真实世界的机器人系统（微型飞行器）上测试了FAn系统，并证明了它可以在实时控制循环中不间断地跟踪目标对象。此外，FAn系统可以在一个笔记型的硬件配置（6-8 GB的轻量硬件卡）上运行，达到6-20帧每秒的吞吐量。

Abstract
Tracking and following objects of interest is critical to several robotics use cases, ranging from industrial automation to logistics and warehousing, to healthcare and security. In this paper, we present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed ``follow anything'' (FAn), is an open-vocabulary and multimodal model -- it is not restricted to concepts seen at training time and can be applied to novel classes at inference time using text, images, or click queries. Leveraging rich visual descriptors from large-scale pre-trained models (foundation models), FAn can detect and segment objects by matching multimodal queries (text, images, clicks) against an input image sequence. These detected and segmented objects are tracked across image frames, all while accounting for occlusion and object re-emergence. We demonstrate FAn on a real-world robotic system (a micro aerial vehicle) and report its ability to seamlessly follow the objects of interest in a real-time control loop. FAn can be deployed on a laptop with a lightweight (6-8 GB) graphics card, achieving a throughput of 6-20 frames per second. To enable rapid adoption, deployment, and extensibility, we open-source all our code on our project webpage at https://github.com/alaamaalouf/FollowAnything . We also encourage the reader the watch our 5-minutes explainer video in this https://www.youtube.com/watch?v=6Mgt3EPytrw .

摘要
Tracking和跟踪目标对象是Robotics应用场景的关键，从工业自动化到物流和仓储，以及医疗和安全。在这篇论文中，我们提出了一种可以实时探测、跟踪和跟踪任何目标的Robotic系统。我们的方法，命名为“跟随任何thing”（FAn），不受训练时间的概念限制，可以在推理时间应用于新的类型。利用大规模预训练模型（基础模型）提供的丰富视觉描述符，FAn可以通过对输入图像序列中的文本、图像和键入查询进行匹配来探测和分割目标。这些探测和分割的目标在图像帧中被跟踪，同时考虑 occlusion 和目标重新出现。我们在一种实际的微型飞行器上进行了FAn的示例应用，并报告了它在实时控制循环中无缝跟踪目标。FAn可以在一个具有6-8 GB的轻量级图形处理器的笔记型电脑上运行，每秒6-20帧的吞吐量。为了促进快速的采用、部署和扩展，我们在项目网站上公开了所有代码（https://github.com/alaamaalouf/FollowAnything）。我们还邀请读者关注我们的5分钟解释视频（https://www.youtube.com/watch?v=6Mgt3EPytrw）。

PDE-Refiner: Achieving Accurate Long Rollouts with Neural PDE Solvers

paper_url: http://arxiv.org/abs/2308.05732
repo_url: None
paper_authors: Phillip Lippe, Bastiaan S. Veeling, Paris Perdikaris, Richard E. Turner, Johannes Brandstetter
for: This paper aims to improve the accuracy and stability of deep neural network-based surrogates for solving partial differential equations (PDEs) by addressing the neglect of non-dominant spatial frequency information.
methods: The authors use a large-scale analysis of common temporal rollout strategies and draw inspiration from recent advances in diffusion models to introduce a novel model class called PDE-Refiner, which uses a multistep refinement process to accurately model all frequency components of PDE solutions.
results: The authors validate PDE-Refiner on challenging benchmarks of complex fluid dynamics and demonstrate stable and accurate rollouts that consistently outperform state-of-the-art models, including neural, numerical, and hybrid neural-numerical architectures. Additionally, PDE-Refiner is shown to greatly enhance data efficiency by implicitly inducing a novel form of spectral data augmentation.

Abstract
Time-dependent partial differential equations (PDEs) are ubiquitous in science and engineering. Recently, mostly due to the high computational cost of traditional solution techniques, deep neural network based surrogates have gained increased interest. The practical utility of such neural PDE solvers relies on their ability to provide accurate, stable predictions over long time horizons, which is a notoriously hard problem. In this work, we present a large-scale analysis of common temporal rollout strategies, identifying the neglect of non-dominant spatial frequency information, often associated with high frequencies in PDE solutions, as the primary pitfall limiting stable, accurate rollout performance. Based on these insights, we draw inspiration from recent advances in diffusion models to introduce PDE-Refiner; a novel model class that enables more accurate modeling of all frequency components via a multistep refinement process. We validate PDE-Refiner on challenging benchmarks of complex fluid dynamics, demonstrating stable and accurate rollouts that consistently outperform state-of-the-art models, including neural, numerical, and hybrid neural-numerical architectures. We further demonstrate that PDE-Refiner greatly enhances data efficiency, since the denoising objective implicitly induces a novel form of spectral data augmentation. Finally, PDE-Refiner's connection to diffusion models enables an accurate and efficient assessment of the model's predictive uncertainty, allowing us to estimate when the surrogate becomes inaccurate.

摘要
时间依赖的 partial differential equations (PDEs) 在科学和工程中广泛存在。近期，主要因为传统解决方法的计算成本高昂，深度神经网络基于的 surrogate 获得了增加的关注。但是，实际应用中需要 deep neural network 可以在长时间 horizons 提供稳定、准确的预测，这是一个非常困难的问题。在这项工作中，我们对常见的时间滚动策略进行大规模分析，发现了忽略非主导空间频率信息的问题，这常与 PDE 解的高频信号相关。基于这些发现，我们从 diffusion 模型中继承了一种新的模型类型 - PDE-Refiner，它可以更好地模拟所有频率组成部分。我们验证了 PDE-Refiner 在复杂的液体动力学 benchmark 上，表现稳定和准确， consistently 超过了现有的神经网络、数值和混合神经网络-数值模型。此外， PDE-Refiner 可以大幅提高数据效率，因为均匀目标 implicitly 引入了一种新的 spectral data augmentation。最后， PDE-Refiner 的连接到 diffusion 模型使得可以准确地评估模型的预测 uncertainty，allowing us to estimate 当 surrogate 变得不准确。

Rethinking Integration of Prediction and Planning in Deep Learning-Based Automated Driving Systems: A Review

paper_url: http://arxiv.org/abs/2308.05731
repo_url: None
paper_authors: Steffen Hagedorn, Marcel Hallgarten, Martin Stoll, Alexandru Condurache
for: 本研究旨在探讨自动驾驶系统中的预测和规划模型，以提高安全、舒适、高效的驾驶体验。
methods: 本研究使用深度学习技术来实现预测和规划模型的集成，并对不同的集成方法进行系统性的回顾和分析。
results: 研究发现，集成预测和规划模型可以提高自动驾驶系统的安全性、舒适性和高效性，但同时也存在一些研究挑战和限制。

Abstract
Automated driving has the potential to revolutionize personal, public, and freight mobility. Besides the enormous challenge of perception, i.e. accurately perceiving the environment using available sensor data, automated driving comprises planning a safe, comfortable, and efficient motion trajectory. To promote safety and progress, many works rely on modules that predict the future motion of surrounding traffic. Modular automated driving systems commonly handle prediction and planning as sequential separate tasks. While this accounts for the influence of surrounding traffic on the ego-vehicle, it fails to anticipate the reactions of traffic participants to the ego-vehicle's behavior. Recent works suggest that integrating prediction and planning in an interdependent joint step is necessary to achieve safe, efficient, and comfortable driving. While various models implement such integrated systems, a comprehensive overview and theoretical understanding of different principles are lacking. We systematically review state-of-the-art deep learning-based prediction, planning, and integrated prediction and planning models. Different facets of the integration ranging from model architecture and model design to behavioral aspects are considered and related to each other. Moreover, we discuss the implications, strengths, and limitations of different integration methods. By pointing out research gaps, describing relevant future challenges, and highlighting trends in the research field, we identify promising directions for future research.

摘要
Simplified Chinese:自动驾驶有可能对个人、公共和货物运输产生革命性的变革。除了巨大的感知挑战，自动驾驶还包括规划一个安全、舒适和效率的运动轨迹。为促进安全和进步，许多工作依赖于周围交通的预测模块。现有的模块通常处理预测和规划为独立的两个任务。这种方法虽然考虑了周围交通对egosensor数据的影响，但未能预测egosensor的行为对交通参与者的反应。最近的研究表明，将预测和规划 integrate为一个互相依赖的步骤是必要的，以实现安全、高效和舒适的驾驶。虽然有各种实现了这些集成系统的模型，但是它们之间的概念性和理论基础缺乏系统性的回顾。我们系统地回顾当前最新的深度学习基于预测和规划的模型，包括不同的模型架构、设计和行为方面。此外，我们还讨论了不同集成方法的影响、优势和局限性。通过指出研究漏洞、描述未来挑战和提出新趋势，我们标识了未来研究的投入方向。

EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

paper_url: http://arxiv.org/abs/2308.05725
repo_url: None
paper_authors: Tu Anh Nguyen, Wei-Ning Hsu, Antony D’Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux
For: The paper is written for researchers and developers working on speech synthesis, particularly those interested in textless speech synthesis and expressive speech synthesis.* Methods: The paper uses low bitrate discrete units that have been learned in a self-supervised fashion to resynthesize high-quality speech. The authors introduce a new dataset called Expresso, which includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.* Results: The paper presents an expressive resynthesis benchmark that evaluates the quality of resynthesized speech using automatic metrics for different self-supervised discrete encoders. The authors explore tradeoffs between quality, bitrate, and invariance to speaker and style.Here are the three key points in Simplified Chinese text:
for: 这篇论文是为研究者和开发者们，尤其是关注无文本语音合成和表达性语音合成的人员所写的。
methods: 这篇论文使用自主学习的低比特率简洁单元来重新生成高质量的语音。作者们介绍了一个新的数据集called Expresso，该数据集包括了阅读的语音和自发的对话，并在26种自由表达风格中进行了渲染。
results: 论文提出了一个表达性重新生成 benchmark，用于评估不同自主学习简洁单元的重新生成质量。作者们explore了质量、比特率和 speaker和风格的不变性之间的贸易。

Abstract
Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in low-bitrate units and resynthesize it in a target voice while preserving content and style. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders, and explore tradeoffs between quality, bitrate and invariance to speaker and style. All the dataset, evaluation metrics and baseline models are open source

摘要
近期研究表明，可以使用低比特率独立单元来重新生成高质量的语音，而不是基于文本。这些独立单元可以自然地捕捉语音中的表达特征（诸如语调、声音风格、非语言 vocalization），但目前这些方法的应用受到大多数语音合成数据集是阅读的限制。在这里，我们介绍Expresso，一个高质量表达语音数据集，包括阅读语音和自由对话，拥有26种自然表达风格。我们介绍了这些数据集的挑战和潜力，并通过一个表达性重新生成 bencmark 评估不同自然隐式编码器的质量。我们还评估了不同自然隐式编码器的质量，bitrate和对话者和风格的不变性。所有的数据集、评估指标和基准模型都是开源的。

Optimizing Performance of Feedforward and Convolutional Neural Networks through Dynamic Activation Functions

paper_url: http://arxiv.org/abs/2308.05724
repo_url: None
paper_authors: Chinmay Rane, Kanishka Tyagi, Michael Manry
for: 这篇论文主要针对深度学习训练算法在各个领域中的成功，以及深度 convolutional neural networks（CNN）中layer的层数的增加。
methods: 该论文使用了piece-wise linear（PWL）活化函数在隐藏层中，并对普通的ReLU活化函数进行比较。
results: 研究发现，使用PWL活化函数可以在 convolutional neural networks和多层感知器中提高网络性能，并与PyTorch中深度和浅度CNNs的比较结果进一步证明了这一点。

Abstract
Deep learning training training algorithms are a huge success in recent years in many fields including speech, text,image video etc. Deeper and deeper layers are proposed with huge success with resnet structures having around 152 layers. Shallow convolution neural networks(CNN's) are still an active research, where some phenomena are still unexplained. Activation functions used in the network are of utmost importance, as they provide non linearity to the networks. Relu's are the most commonly used activation function.We show a complex piece-wise linear(PWL) activation in the hidden layer. We show that these PWL activations work much better than relu activations in our networks for convolution neural networks and multilayer perceptrons. Result comparison in PyTorch for shallow and deep CNNs are given to further strengthen our case.

摘要
深度学习训练算法在最近几年内取得了很大成功，在多个领域，如语音、文本、图像和视频等领域中。随着更深的层次结构的提出，深度学习模型的性能得到了大幅提升。例如，ResNet结构中的约152层。 although shallow convolutional neural networks (CNNs) are still an active research area, some phenomena are still not well understood. activation functions used in the network are of paramount importance, as they provide non-linearity to the networks. ReLU是最常用的激活函数。我们在隐藏层使用复杂的piece-wise linear(PWL)激活函数，并证明这些PWL激活函数在我们的网络中比ReLU激活函数更有效。我们还通过PyTorch中的深度和浅度CNN比较结果，进一步证明我们的结论。

A Comparison of Classical and Deep Reinforcement Learning Methods for HVAC Control

paper_url: http://arxiv.org/abs/2308.05711
repo_url: None
paper_authors: Marshall Wang, John Willes, Thomas Jiralerspong, Matin Moezzi
for: 优化冷暖空调控制系统的性能和能效性
methods: 使用经典RL和深度RL方法（Q学习和深度Q网络）在多种HVAC环境进行比较，探讨RL代理的实际参数选择和奖励调整
results: 提供了配置RL代理在HVAC系统中的指导，促进能源高效和成本低的运行

Abstract
Reinforcement learning (RL) is a promising approach for optimizing HVAC control. RL offers a framework for improving system performance, reducing energy consumption, and enhancing cost efficiency. We benchmark two popular classical and deep RL methods (Q-Learning and Deep-Q-Networks) across multiple HVAC environments and explore the practical consideration of model hyper-parameter selection and reward tuning. The findings provide insight for configuring RL agents in HVAC systems, promoting energy-efficient and cost-effective operation.

摘要

Shadow Datasets, New challenging datasets for Causal Representation Learning

paper_url: http://arxiv.org/abs/2308.05707
repo_url: https://github.com/Jiagengzhu/Shadow-dataset-for-crl
paper_authors: Jiageng Zhu, Hanchen Xie, Jianhua Wu, Jiazhi Li, Mahyar Khayatkhoei, Mohamed E. Hussein, Wael AbdAlmageed
for: 本研究旨在探索语义因素之间的 causal 关系，以提高表示学习中的 causal 理解。
methods: 研究使用 weakly 监督的 CRL 方法，以避免高成本的标注。
results: 提出了两个新的 CRL 数据集，以及对原有数据集的修正，以更好地评估 CRL 性能。

Abstract
Discovering causal relations among semantic factors is an emergent topic in representation learning. Most causal representation learning (CRL) methods are fully supervised, which is impractical due to costly labeling. To resolve this restriction, weakly supervised CRL methods were introduced. To evaluate CRL performance, four existing datasets, Pendulum, Flow, CelebA(BEARD) and CelebA(SMILE), are utilized. However, existing CRL datasets are limited to simple graphs with few generative factors. Thus we propose two new datasets with a larger number of diverse generative factors and more sophisticated causal graphs. In addition, current real datasets, CelebA(BEARD) and CelebA(SMILE), the originally proposed causal graphs are not aligned with the dataset distributions. Thus, we propose modifications to them.

摘要
发现 semantic 因素之间的 causal 关系是 representation learning 中一个 emerging 话题。大多数 causal representation learning（CRL）方法是完全监督的，这是因为标注成本高昂。为解决这种限制，弱监督 CRL 方法被引入。为评估 CRL 性能，我们使用了四个现有的数据集：Pendulum、Flow、CelebA（BEARD）和 CelebA（SMILE）。然而，现有的 CRL 数据集受到简单的图гра各种限制，因此我们提议两个新的数据集，它们具有更多的多样化的生成因素和更复杂的 causal 图。此外，现有的实际数据集 CelebA（BEARD）和 CelebA（SMILE）的原始 causal 图与数据集分布不匹配，因此我们提出修改。

Hard No-Box Adversarial Attack on Skeleton-Based Human Action Recognition with Skeleton-Motion-Informed Gradient

paper_url: http://arxiv.org/abs/2308.05681
repo_url: https://github.com/luyg45/hardnoboxattack
paper_authors: Zhengzhi Lu, He Wang, Ziyi Chang, Guoan Yang, Hubert P. H. Shum
for: 本研究旨在证明skeleton-based人体活动识别方法受到恶意攻击的漏洞性。
methods: 本研究使用了一种新的攻击任务，即攻击者无法访问受试者模型或训练数据或标签。我们称之为“硬无框攻击”（hard no-box attack）。我们首先学习了一个动作映射，然后定义了一种对抗损失函数来计算一个新的攻击方向，称之为skeleton-motion-informed（SMI）梯度。我们的梯度包含了运动动力学信息，与现有的梯度基于攻击方法不同。
results: 我们的方法对现有的分类器带来了真正的威胁，并且显示了SMI梯度可以提高攻击样本的传播性和隐蔽性在无框和转移黑盒设置中。

Abstract
Recently, methods for skeleton-based human activity recognition have been shown to be vulnerable to adversarial attacks. However, these attack methods require either the full knowledge of the victim (i.e. white-box attacks), access to training data (i.e. transfer-based attacks) or frequent model queries (i.e. black-box attacks). All their requirements are highly restrictive, raising the question of how detrimental the vulnerability is. In this paper, we show that the vulnerability indeed exists. To this end, we consider a new attack task: the attacker has no access to the victim model or the training data or labels, where we coin the term hard no-box attack. Specifically, we first learn a motion manifold where we define an adversarial loss to compute a new gradient for the attack, named skeleton-motion-informed (SMI) gradient. Our gradient contains information of the motion dynamics, which is different from existing gradient-based attack methods that compute the loss gradient assuming each dimension in the data is independent. The SMI gradient can augment many gradient-based attack methods, leading to a new family of no-box attack methods. Extensive evaluation and comparison show that our method imposes a real threat to existing classifiers. They also show that the SMI gradient improves the transferability and imperceptibility of adversarial samples in both no-box and transfer-based black-box settings.

摘要
最近，基于骨架的人体活动识别方法已经被证明为易受到敌意攻击的。然而，这些攻击方法具有高度限制性的要求，包括受害者（i.e. white-box attacks）、训练数据（i.e. transfer-based attacks）或模型查询频繁（i.e. black-box attacks）的访问权限。这些限制性使得攻击的危害程度提高了一个问题。在这篇论文中，我们证明了这种攻击存在。为此，我们提出了一个新的攻击任务：攻击者没有访问受害者模型或训练数据或标签的权限。我们称之为“hard no-box attack”。 Specifically, we first learn a motion manifold where we define an adversarial loss to compute a new gradient for the attack, named skeleton-motion-informed (SMI) gradient。我们的梯度包含运动动力学信息，与现有的梯度基于攻击方法计算每个维度在数据中独立的假设不同。SMI梯度可以增强许多梯度基于攻击方法，导致一个新的家族的no-box攻击方法。我们进行了广泛的评估和比较，证明了我们的方法对现有的分类器具有真正的威胁。它们还表明SMI梯度可以提高攻击样本在no-box和转移基于黑盒设置中的传输性和隐蔽性。

Finding Already Debunked Narratives via Multistage Retrieval: Enabling Cross-Lingual, Cross-Dataset and Zero-Shot Learning

paper_url: http://arxiv.org/abs/2308.05680
repo_url: None
paper_authors: Iknoor Singh, Carolina Scarton, Xingyi Song, Kalina Bontcheva
for: 本研究的目的是探讨跨语言异常检验已经证伪的故事检索问题，以减少专业Fact-checker的手动努力并减少假信息的快速传播。
methods: 本研究使用了一种新的数据集，使用推文作为查询语进行一个数据库中的Fact-checking文章的检索。同时，研究还进行了大规模的实验，用以评估多种预训练的Transformer模型在这个跨语言任务中的表现。
results: 研究结果表明，跨语言检索已经证伪的故事是一项具有挑战性的任务，而Off-the-shelf Transformer模型不能超越一个强的基于词语的基线模型（BM25）。然而，研究提出了一种多Stage检索框架，可以减少BM25的性能下降，并且可以在不同领域和零shot学习的情况下表现出色。

Abstract
The task of retrieving already debunked narratives aims to detect stories that have already been fact-checked. The successful detection of claims that have already been debunked not only reduces the manual efforts of professional fact-checkers but can also contribute to slowing the spread of misinformation. Mainly due to the lack of readily available data, this is an understudied problem, particularly when considering the cross-lingual task, i.e. the retrieval of fact-checking articles in a language different from the language of the online post being checked. This paper fills this gap by (i) creating a novel dataset to enable research on cross-lingual retrieval of already debunked narratives, using tweets as queries to a database of fact-checking articles; (ii) presenting an extensive experiment to benchmark fine-tuned and off-the-shelf multilingual pre-trained Transformer models for this task; and (iii) proposing a novel multistage framework that divides this cross-lingual debunk retrieval task into refinement and re-ranking stages. Results show that the task of cross-lingual retrieval of already debunked narratives is challenging and off-the-shelf Transformer models fail to outperform a strong lexical-based baseline (BM25). Nevertheless, our multistage retrieval framework is robust, outperforming BM25 in most scenarios and enabling cross-domain and zero-shot learning, without significantly harming the model's performance.

摘要
Retrieving already debunked narratives 的任务是检测已经被证实的故事。成功检测已经证实的声明不仅可以减少专业 фак-checker 的手动努力，还可以减slow down 虚假信息的传播。但是由于数据不 readily available 的原因，这是一个未经研究的问题，特别是在跨语言任务上。这篇论文填补了这一漏洞，通过以下三个方面：1. 创建了一个新的数据集，用于启动研究跨语言已经证实故事的检索；2. 进行了广泛的实验，对精制和简洁的多语言预训练Transformer模型进行了评估；3. 提出了一个多 Stage 框架，将跨语言已经证实故事的检索任务分为两个阶段：精度阶段和重新排序阶段。结果显示，跨语言已经证实故事的检索任务是一个挑战性的任务，而Off-the-shelf Transformer模型也不能超过强的基于词语的基线（BM25）。然而，我们的多阶段检索框架具有强健性，在大多数情况下超过了BM25，并且允许跨频率和零学习，无需明显减少模型性能。

2023-08-11

eess.IV

eess.IV - 2023-08-11

Towards Packaging Unit Detection for Automated Palletizing Tasks

paper_url: http://arxiv.org/abs/2308.06306
repo_url: None
paper_authors: Markus Völk, Kilian Kleeberger, Werner Kraus, Richard Bormann
for: automatized palletizing tasks
methods: fully trained on synthetic data, can handle sparse and low quality sensor data, can exploit prior knowledge
results: able to generalize to a wide range of products and application scenarios, demonstrated on real-world data with a wide range of different retail products, integrated in a lab demonstrator and a commercial solution will be marketed through an industrial partner.

Abstract
For various automated palletizing tasks, the detection of packaging units is a crucial step preceding the actual handling of the packaging units by an industrial robot. We propose an approach to this challenging problem that is fully trained on synthetically generated data and can be robustly applied to arbitrary real world packaging units without further training or setup effort. The proposed approach is able to handle sparse and low quality sensor data, can exploit prior knowledge if available and generalizes well to a wide range of products and application scenarios. To demonstrate the practical use of our approach, we conduct an extensive evaluation on real-world data with a wide range of different retail products. Further, we integrated our approach in a lab demonstrator and a commercial solution will be marketed through an industrial partner.

摘要

paper_url: http://arxiv.org/abs/2308.05975
repo_url: None
paper_authors: Liang Chen, Yifei Yin, Hao Shi, Qingqing Sheng, Wei Li
for:This paper aims to propose a self-supervised SAR despeckling strategy that can be trained without speckle-free images.methods:The proposed method uses a sub-sampler based on the adjacent-syntropy criteria to generate training image pairs from real-world SAR images, and employs parameter sharing convolutional neural networks to make full use of training pairs. A multi-feature loss function is proposed, which combines despeckling term, regular term, and perception term to constrain the gap between the generated paired images and preserve edge and texture features.results:The proposed method is validated on real-world SAR images and shows better performance than several advanced SAR image despeckling methods, with improved ability to preserve edge and texture features.

Abstract
Speckle noise is generated due to the SAR imaging mechanism, which brings difficulties in SAR image interpretation. Hence, despeckling is a helpful step in SAR pre-processing. Nowadays, deep learning has been proved to be a progressive method for SAR image despeckling. Most deep learning methods for despeckling are based on supervised learning, which needs original SAR images and speckle-free SAR images to train the network. However, the speckle-free SAR images are generally not available. So, this issue was tackled by adding multiplicative noise to optical images synthetically for simulating speckled image. Therefore, there are following challenges in SAR image despeckling: (1) lack of speckle-free SAR image; (2) difficulty in keeping details such as edges and textures in heterogeneous areas. To address these issues, we propose a self-supervised SAR despeckling strategy that can be trained without speckle-free images. Firstly, the feasibility of SAR image despeckling without speckle-free images is proved theoretically. Then, the sub-sampler based on the adjacent-syntropy criteria is proposed. The training image pairs are generated by the sub-sampler from real-word SAR image to estimate the noise distribution. Furthermore, to make full use of training pairs, the parameter sharing convolutional neural networks are adopted. Finally, according to the characteristics of SAR images, a multi-feature loss function is proposed. The proposed loss function is composed of despeckling term, regular term and perception term, to constrain the gap between the generated paired images. The ability of edge and texture feature preserving is improved simultaneously. Finally, qualitative and quantitative experiments are validated on real-world SAR images, showing better performances than several advanced SAR image despeckling methods.

摘要
<> tranlate the following text into Simplified Chinese.<>零点噪声是SAR成像机制所导致的，这会带来SAR图像解释的困难。因此，滤除零点噪声是SAR预处理的有用步骤。目前，深度学习已经证明是SAR图像滤除零点噪声的进步方法。大多数深度学习方法是基于supervised learning，需要原始SAR图像和噪声自由SAR图像来训练网络。然而，噪声自由SAR图像通常不可得。因此，这个问题被解决了通过在互联网上添加 multiplicative noise synthetically来模拟零点噪声的image。因此，SAR图像滤除零点噪声存在以下挑战：（1）缺乏噪声自由SAR图像；（2）在不同区域中保持细节，如边缘和文本ure。 To address these issues, we propose a self-supervised SAR despeckling strategy that can be trained without speckle-free images. Firstly, the feasibility of SAR image despeckling without speckle-free images is proved theoretically. Then, the sub-sampler based on the adjacent-syntropy criteria is proposed. The training image pairs are generated by the sub-sampler from real-word SAR image to estimate the noise distribution. Furthermore, to make full use of training pairs, the parameter sharing convolutional neural networks are adopted. Finally, according to the characteristics of SAR images, a multi-feature loss function is proposed. The proposed loss function is composed of despeckling term, regular term and perception term, to constrain the gap between the generated paired images. The ability of edge and texture feature preserving is improved simultaneously. Finally, qualitative and quantitative experiments are validated on real-world SAR images, showing better performances than several advanced SAR image despeckling methods.Translation:零点噪声是SAR成像机制所导致的，这会带来SAR图像解释的困难。因此，滤除零点噪声是SAR预处理的有用步骤。目前，深度学习已经证明是SAR图像滤除零点噪声的进步方法。大多数深度学习方法是基于supervised learning，需要原始SAR图像和噪声自由SAR图像来训练网络。然而，噪声自由SAR图像通常不可得。因此，这个问题被解决了通过在互联网上添加 multiplicative noise synthetically来模拟零点噪声的image。因此，SAR图像滤除零点噪声存在以下挑战：（1）缺乏噪声自由SAR图像；（2）在不同区域中保持细节，如边缘和文本ure。为了解决这些问题，我们提出了一种自助式SAR滤除零点噪声策略，不需要噪声自由SAR图像进行训练。首先，我们证明了SAR图像滤除零点噪声的可能性。然后，我们提出了基于邻域同律的sub-sampler。通过这个sub-sampler，从实际世界SAR图像中生成了训练对。然后，为了充分利用训练对，我们采用了参数共享卷积神经网络。最后，根据SAR图像的特点，我们提出了一种多特征损失函数。该损失函数由滤除项、常规项和感知项组成，以限制生成对的差异。同时，edge和 texture特征的保持能力得到了提高。最后，我们对实际世界SAR图像进行了质量和量化的实验，并证明了我们的方法在SAR图像滤除零点噪声方面的性能比其他一些先进方法更好。

Classification Method of Road Surface Condition and Type with LiDAR Using Spatiotemporal Information

paper_url: http://arxiv.org/abs/2308.05965
repo_url: None
paper_authors: Ju Won Seo, Jin Sung Kim, Chung Choo Chung
for: 这个论文旨在运用深度神经网络（DNN）来分类道路表面状态和类型，使用LiDAR提供的信息。
methods: 本论文使用LiDAR提供的反射率和点云数据，在四个前方路段中分成四个互不相关的子区域。首先，constructed feature vectors使用每个子区域的反射率、点云数据和车辆内部信息。然后，使用DNN进行分类。最后，将DNN的输出 feed into 空间时间过程，以生成基于车速和概率的最终分类结果。
results: 根据比较研究，提出的DNN方法在两个车辆前方的子区域中取得了最高精度（98.0%和98.6%）。此外，在Jetson TX2板上实现了这个方法，证明了它在实时中可行。

Abstract
This paper proposes a spatiotemporal architecture with a deep neural network (DNN) for road surface conditions and types classification using LiDAR. It is known that LiDAR provides information on the reflectivity and number of point clouds depending on a road surface. Thus, this paper utilizes the information to classify the road surface. We divided the front road area into four subregions. First, we constructed feature vectors using each subregion's reflectivity, number of point clouds, and in-vehicle information. Second, the DNN classifies road surface conditions and types for each subregion. Finally, the output of the DNN feeds into the spatiotemporal process to make the final classification reflecting vehicle speed and probability given by the outcomes of softmax functions of the DNN output layer. To validate the effectiveness of the proposed method, we performed a comparative study with five other algorithms. With the proposed DNN, we obtained the highest accuracy of 98.0\% and 98.6\% for two subregions near the vehicle. In addition, we implemented the proposed method on the Jetson TX2 board to confirm that it is applicable in real-time.

摘要
First, feature vectors are constructed using the reflectivity, number of point clouds, and in-vehicle information for each subregion. Then, the DNN classifies the road surface conditions and types for each subregion. Finally, the output of the DNN is fed into a spatiotemporal process to make the final classification, taking into account the vehicle speed and probability given by the outcomes of softmax functions of the DNN output layer.To validate the effectiveness of the proposed method, a comparative study was conducted with five other algorithms. The results showed that the proposed DNN achieved the highest accuracy of 98.0% and 98.6% for two subregions near the vehicle. Additionally, the proposed method was implemented on the Jetson TX2 board to confirm its applicability in real-time.

Unleashing the Strengths of Unlabeled Data in Pan-cancer Abdominal Organ Quantification: the FLARE22 Challenge

paper_url: http://arxiv.org/abs/2308.05862
repo_url: https://github.com/junma11/flare
paper_authors: Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Shihao Ma, Adamo Young, Cheng Zhu, Kangkang Meng, Xin Yang, Ziyan Huang, Fan Zhang, Wentao Liu, YuanKe Pan, Shoujin Huang, Jiacheng Wang, Mingze Sun, Weixin Xu, Dengqiang Jia, Jae Won Choi, Natália Alves, Bram de Wilde, Gregor Koehler, Yajun Wu, Manuel Wiesenfarth, Qiongjie Zhu, Guoqiang Dong, Jian He, the FLARE Challenge Consortium, Bo Wang
for: The paper aims to evaluate the accuracy and efficiency of artificial intelligence (AI) algorithms in automated abdominal disease diagnosis and treatment planning, particularly in real-world multinational settings.
methods: The authors organized the FLARE 2022 Challenge, the largest abdominal organ analysis challenge to date, to benchmark fast, low-resource, accurate, annotation-efficient, and generalized AI algorithms. They constructed an intercontinental and multinational dataset from more than 50 medical groups, including Computed Tomography (CT) scans with different races, diseases, phases, and manufacturers.
results: The best-performing algorithms achieved a median Dice Similarity Coefficient (DSC) of 90.0% by using 50 labeled scans and 2000 unlabeled scans, which can significantly reduce annotation requirements. The algorithms also successfully generalized to holdout external validation sets, achieving a median DSC of 89.5%, 90.9%, and 88.3% on North American, European, and Asian cohorts, respectively. Additionally, the algorithms enabled automatic extraction of key organ biology features, which was labor-intensive with traditional manual measurements.

Abstract
Quantitative organ assessment is an essential step in automated abdominal disease diagnosis and treatment planning. Artificial intelligence (AI) has shown great potential to automatize this process. However, most existing AI algorithms rely on many expert annotations and lack a comprehensive evaluation of accuracy and efficiency in real-world multinational settings. To overcome these limitations, we organized the FLARE 2022 Challenge, the largest abdominal organ analysis challenge to date, to benchmark fast, low-resource, accurate, annotation-efficient, and generalized AI algorithms. We constructed an intercontinental and multinational dataset from more than 50 medical groups, including Computed Tomography (CT) scans with different races, diseases, phases, and manufacturers. We independently validated that a set of AI algorithms achieved a median Dice Similarity Coefficient (DSC) of 90.0\% by using 50 labeled scans and 2000 unlabeled scans, which can significantly reduce annotation requirements. The best-performing algorithms successfully generalized to holdout external validation sets, achieving a median DSC of 89.5\%, 90.9\%, and 88.3\% on North American, European, and Asian cohorts, respectively. They also enabled automatic extraction of key organ biology features, which was labor-intensive with traditional manual measurements. This opens the potential to use unlabeled data to boost performance and alleviate annotation shortages for modern AI models.

摘要
“量化器官评估是自动化腹部疾病诊断和治疗规划的关键步骤。人工智能（AI）已经表现出很大的潜力来自动化这一过程。然而，现有的大多数AI算法都需要丰富的专家标注，并且没有全面评估其在实际世界多国 Settings中的准确率和效率。为了解决这些限制，我们组织了FLARE 2022 Challenge，是腹部器官分析挑战赛的最大一次。我们构建了跨 continent和多国的数据集，包括不同的种族、疾病、阶段和制造商的 Computed Tomography（CT）扫描图像。我们独立验证了一些AI算法可以通过50个标注图像和2000个无标注图像来达到90.0%的 dice相似度系数（DSC），这可以减少标注需求。最佳performing算法成功泛化到保留的外部验证集上，其中得到了89.5%、90.9%和88.3%的DSC在北美、欧洲和亚洲群体中，分别。它们还允许自动提取关键生物器官特征，这是传统手动测量很劳累的。这开启了使用无标注数据来提高性能的可能性，并alleviate标注短缺问题 для现代AI模型。”

End-to-End Optimization of JPEG-Based Deep Learning Process for Image Classification

paper_url: http://arxiv.org/abs/2308.05840
repo_url: None
paper_authors: Siyu Qi, Lahiru D. Chamain, Zhi Ding
for: 这篇论文是为了解决深度学习应用中的分布式学习问题，特别是图像分类需要有效的图像压缩编码器在低成本感知设备上部署，以便高效传输和储存。
methods: 这篇论文使用了一个统一的端到端训练模型，包括一个JPEG图像编码器和一个基于深度学习的分类器。这个模型可以调整广泛部署的JPEG编码器设定，以提高分类精度，同时考虑到带宽限制。
results: 我们在CIFAR-100和ImageNet上进行了测试，结果显示了这个模型可以提高验证精度，并且超过了固定JPEG配置的验证精度。

Abstract
Among major deep learning (DL) applications, distributed learning involving image classification require effective image compression codecs deployed on low-cost sensing devices for efficient transmission and storage. Traditional codecs such as JPEG designed for perceptual quality are not configured for DL tasks. This work introduces an integrative end-to-end trainable model for image compression and classification consisting of a JPEG image codec and a DL-based classifier. We demonstrate how this model can optimize the widely deployed JPEG codec settings to improve classification accuracy in consideration of bandwidth constraint. Our tests on CIFAR-100 and ImageNet also demonstrate improved validation accuracy over preset JPEG configuration.

摘要
major deep learning (DL) 应用中，分布式学习涉及到图像分类时，需要有效的图像压缩编码器在低成本感知设备上部署，以便高效地传输和存储。传统的编码器如JPEG，是为人类视觉质量设计的，并不适用于DL任务。这项工作介绍了一种综合的末端到末端可调整模型，它由JPEG图像编码器和基于DL的分类器组成。我们示出了如何使用这种模型来优化广泛部署的JPEG配置 Settings，以提高分类精度，同时考虑带宽约束。我们在CIFAR-100和ImageNet上进行了测试，并证明了我们的模型在预先设置的JPEG配置下的验证精度得到了改进。

Spatial Pathomics Toolkit for Quantitative Analysis of Podocyte Nuclei with Histology and Spatial Transcriptomics Data in Renal Pathology

paper_url: http://arxiv.org/abs/2308.06288
repo_url: https://github.com/hrlblab/spatial_pathomics
paper_authors: Jiayuan Chen, Yu Wang, Ruining Deng, Quan Liu, Can Cui, Tianyuan Yao, Yilin Liu, Jianyong Zhong, Agnes B. Fogo, Haichun Yang, Shilin Zhao, Yuankai Huo
for: 这个研究旨在开发一个新的工具包，帮助研究人员更好地评估肾脏疾病中的肾脏细胞（podocyte）的特征。
methods: 该工具包包括三个主要组成部分：1）实例对象分割，可以准确地识别肾脏细胞核心；2）pathomics特征生成，从识别出的核心中提取了一系列数量特征；3）Robust统计分析，使得可以全面探索肾脏细胞的空间关系。
results: 该工具包成功提取和分析了肾脏细胞的形态和文化特征，揭示了肾脏细胞形态变化的多样性，并且通过统计分析发现了肾脏细胞的空间分布具有一定的相关性。

Abstract
Podocytes, specialized epithelial cells that envelop the glomerular capillaries, play a pivotal role in maintaining renal health. The current description and quantification of features on pathology slides are limited, prompting the need for innovative solutions to comprehensively assess diverse phenotypic attributes within Whole Slide Images (WSIs). In particular, understanding the morphological characteristics of podocytes, terminally differentiated glomerular epithelial cells, is crucial for studying glomerular injury. This paper introduces the Spatial Pathomics Toolkit (SPT) and applies it to podocyte pathomics. The SPT consists of three main components: (1) instance object segmentation, enabling precise identification of podocyte nuclei; (2) pathomics feature generation, extracting a comprehensive array of quantitative features from the identified nuclei; and (3) robust statistical analyses, facilitating a comprehensive exploration of spatial relationships between morphological and spatial transcriptomics features.The SPT successfully extracted and analyzed morphological and textural features from podocyte nuclei, revealing a multitude of podocyte morphomic features through statistical analysis. Additionally, we demonstrated the SPT's ability to unravel spatial information inherent to podocyte distribution, shedding light on spatial patterns associated with glomerular injury. By disseminating the SPT, our goal is to provide the research community with a powerful and user-friendly resource that advances cellular spatial pathomics in renal pathology. The implementation and its complete source code of the toolkit are made openly accessible at https://github.com/hrlblab/spatial_pathomics.

摘要
PODOCYTES，特殊化的 epithelial 细胞，环绕 glomerular capillaries，对肾脏健康具有重要作用。现有的描述和量化特征是有限的，需要创新的解决方案来全面评估多样性的 fenotipic 特征在 Whole Slide Images (WSIs) 中。尤其是理解PODOCYTES的 morphological 特征是肾脏损伤的研究中的关键。这篇文章介绍了 Spatial Pathomics Toolkit (SPT) 并应用其于 PODOCYTES pathomics。SPT 包括三个主要组成部分：（1）实体对象分割，帮助精确地识别 PODOCYTES 核体;（2） pathomics 特征生成，从分割的 PODOCYTES 核体中提取了丰富的量化特征; 和（3） Robust 统计分析，使得可以全面探索 PODOCYTES 的 spatial 关系和 morphological 特征之间的关系。SPT 成功地提取和分析 PODOCYTES 核体的 morphological 和 texture 特征，发现了许多 PODOCYTES 形态特征，通过统计分析。此外，我们还证明了 SPT 的能力可以揭示 PODOCYTES 分布的空间信息，披露肾脏损伤PODOCYTES的空间模式。通过普及 SPT，我们的目标是为研究社区提供一个强大和易用的工具，推动细胞空间 PATHOMICS 的发展。实现和完整的源代码可以在 https://github.com/hrlblab/spatial_pathomics 上获得。

Leverage Weakly Annotation to Pixel-wise Annotation via Zero-shot Segment Anything Model for Molecular-empowered Learning

paper_url: http://arxiv.org/abs/2308.05785
repo_url: None
paper_authors: Xueyuan Li, Ruining Deng, Yucheng Tang, Shunxing Bao, Haichun Yang, Yuankai Huo
for:这个研究旨在提高数据验证的精确性和效率，并且将这个过程与非专家标注者进行结合，以提高数据验证的可 accessed性和可重用性。methods:这个研究使用了SAM模型，将它用于从矩形标注中生成像素级标注，并将这些标注用于训练分类模型。results:研究发现，使用SAM-L方法可以将标注工作从专家标注者转移到非专家标注者，不需要像素级标注，同时并不会对数据验证的精确性造成影响。

Abstract
Precise identification of multiple cell classes in high-resolution Giga-pixel whole slide imaging (WSI) is critical for various clinical scenarios. Building an AI model for this purpose typically requires pixel-level annotations, which are often unscalable and must be done by skilled domain experts (e.g., pathologists). However, these annotations can be prone to errors, especially when distinguishing between intricate cell types (e.g., podocytes and mesangial cells) using only visual inspection. Interestingly, a recent study showed that lay annotators, when using extra immunofluorescence (IF) images for reference (referred to as molecular-empowered learning), can sometimes outperform domain experts in labeling. Despite this, the resource-intensive task of manual delineation remains a necessity during the annotation process. In this paper, we explore the potential of bypassing pixel-level delineation by employing the recent segment anything model (SAM) on weak box annotation in a zero-shot learning approach. Specifically, we harness SAM's ability to produce pixel-level annotations from box annotations and utilize these SAM-generated labels to train a segmentation model. Our findings show that the proposed SAM-assisted molecular-empowered learning (SAM-L) can diminish the labeling efforts for lay annotators by only requiring weak box annotations. This is achieved without compromising annotation accuracy or the performance of the deep learning-based segmentation. This research represents a significant advancement in democratizing the annotation process for training pathological image segmentation, relying solely on non-expert annotators.

摘要
高分辨率整个扫描图像（WSI）中多个细胞类型的精确识别是许多临床场景中的关键。建立一个人工智能模型用于这种目的通常需要像素级别的标注，但这些标注通常是不可扩展的并且需要具有专业技能（例如病理学家）进行。然而，这些标注可能受到错误的影响，特别是在用视觉检查来 distinguishing 细胞类型（例如 Podocytes 和 mesangial cells）时。有趣的是，一项最近的研究表明，使用Extra immunofluorescence（IF）图像作为参考时，lay annotators 可以在标注时与专业人员相比而出色表现。 despite this，手动分割任务仍然是注意力的必要要求。在这篇文章中，我们探讨可以通过快速的 Zero-shot 学习方法，使用最近的 segment anything model（SAM）来绕过像素级别的标注。我们利用 SAM 生成的标签来训练分割模型，并发现我们的提案可以使lay annotators 的标注努力减少到只需要weak box annotation。这是在不损害标注准确性或深度学习基于的分割性能的情况下完成的。这项研究表明了我们可以通过非专家 annotators 来进行训练病理图像分割的标注，不需要专业人员。

High-performance Data Management for Whole Slide Image Analysis in Digital Pathology

paper_url: http://arxiv.org/abs/2308.05784
repo_url: https://github.com/hrlblab/adios
paper_authors: Haoju Leng, Ruining Deng, Shunxing Bao, Dazheng Fang, Bryan A. Millis, Yucheng Tang, Haichun Yang, Xiao Wang, Yifan Peng, Lipeng Wan, Yuankai Huo
for: 这个论文主要是为了解决整个扫描图像的数据访问挑战，尤其是在应用图像分析算法时，计算机系统中的I/O系统会出现计算瓶颈。
methods: 这篇论文使用了ADIOS2系统来解决这个问题，并开发了一个专门为整个扫描图像的数据管理而设计的数据管理管道。
results: 论文的实验结果显示，使用ADIOS2可以减少数据访问时间，并且在CPU场景下比使用笔制方法两倍快，在GPU场景下与NVIDIA Magnum IO GPU直接存储（GDS）相当。

Abstract
When dealing with giga-pixel digital pathology in whole-slide imaging, a notable proportion of data records holds relevance during each analysis operation. For instance, when deploying an image analysis algorithm on whole-slide images (WSI), the computational bottleneck often lies in the input-output (I/O) system. This is particularly notable as patch-level processing introduces a considerable I/O load onto the computer system. However, this data management process could be further paralleled, given the typical independence of patch-level image processes across different patches. This paper details our endeavors in tackling this data access challenge by implementing the Adaptable IO System version 2 (ADIOS2). Our focus has been constructing and releasing a digital pathology-centric pipeline using ADIOS2, which facilitates streamlined data management across WSIs. Additionally, we've developed strategies aimed at curtailing data retrieval times. The performance evaluation encompasses two key scenarios: (1) a pure CPU-based image analysis scenario ("CPU scenario"), and (2) a GPU-based deep learning framework scenario ("GPU scenario"). Our findings reveal noteworthy outcomes. Under the CPU scenario, ADIOS2 showcases an impressive two-fold speed-up compared to the brute-force approach. In the GPU scenario, its performance stands on par with the cutting-edge GPU I/O acceleration framework, NVIDIA Magnum IO GPU Direct Storage (GDS). From what we know, this appears to be among the initial instances, if any, of utilizing ADIOS2 within the field of digital pathology. The source code has been made publicly available at https://github.com/hrlblab/adios.

摘要
Our performance evaluation included two key scenarios: (1) a pure CPU-based image analysis scenario ("CPU scenario") and (2) a GPU-based deep learning framework scenario ("GPU scenario"). Our findings showed noteworthy outcomes. Under the CPU scenario, ADIOS2 achieved an impressive two-fold speed-up compared to the brute-force approach. In the GPU scenario, its performance was on par with the cutting-edge GPU I/O acceleration framework, NVIDIA Magnum IO GPU Direct Storage (GDS). To the best of our knowledge, this is one of the earliest instances, if not the first, of using ADIOS2 in the field of digital pathology. The source code is publicly available at .

Multi-scale Multi-site Renal Microvascular Structures Segmentation for Whole Slide Imaging in Renal Pathology

paper_url: http://arxiv.org/abs/2308.05782
repo_url: None
paper_authors: Franklin Hu, Ruining Deng, Shunxing Bao, Haichun Yang, Yuankai Huo
for: automatic segmentation of microvascular structures in human kidney whole slide images
methods: uses a novel single dynamic network method that capitalizes on multi-site, multi-scale training data and partially labeled images
results: outperforms other methods in terms of both the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU)

Abstract
Segmentation of microvascular structures, such as arterioles, venules, and capillaries, from human kidney whole slide images (WSI) has become a focal point in renal pathology. Current manual segmentation techniques are time-consuming and not feasible for large-scale digital pathology images. While deep learning-based methods offer a solution for automatic segmentation, most suffer from a limitation: they are designed for and restricted to training on single-site, single-scale data. In this paper, we present Omni-Seg, a novel single dynamic network method that capitalizes on multi-site, multi-scale training data. Unique to our approach, we utilize partially labeled images, where only one tissue type is labeled per training image, to segment microvascular structures. We train a singular deep network using images from two datasets, HuBMAP and NEPTUNE, across different magnifications (40x, 20x, 10x, and 5x). Experimental results indicate that Omni-Seg outperforms in terms of both the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU). Our proposed method provides renal pathologists with a powerful computational tool for the quantitative analysis of renal microvascular structures.

摘要
Segmentation of microvascular structures, such as arterioles, venules, and capillaries, from human kidney whole slide images (WSI) has become a focal point in renal pathology. Current manual segmentation techniques are time-consuming and not feasible for large-scale digital pathology images. While deep learning-based methods offer a solution for automatic segmentation, most suffer from a limitation: they are designed for and restricted to training on single-site, single-scale data. In this paper, we present Omni-Seg, a novel single dynamic network method that capitalizes on multi-site, multi-scale training data. Unique to our approach, we utilize partially labeled images, where only one tissue type is labeled per training image, to segment microvascular structures. We train a singular deep network using images from two datasets, HuBMAP and NEPTUNE, across different magnifications (40x, 20x, 10x, and 5x). Experimental results indicate that Omni-Seg outperforms in terms of both the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU). Our proposed method provides renal pathologists with a powerful computational tool for the quantitative analysis of renal microvascular structures.Here's the translation in Traditional Chinese: Segmentation of microvascular structures, such as arterioles, venules, and capillaries, from human kidney whole slide images (WSI) has become a focal point in renal pathology. Current manual segmentation techniques are time-consuming and not feasible for large-scale digital pathology images. While deep learning-based methods offer a solution for automatic segmentation, most suffer from a limitation: they are designed for and restricted to training on single-site, single-scale data. In this paper, we present Omni-Seg, a novel single dynamic network method that capitalizes on multi-site, multi-scale training data. Unique to our approach, we utilize partially labeled images, where only one tissue type is labeled per training image, to segment microvascular structures. We train a singular deep network using images from two datasets, HuBMAP and NEPTUNE, across different magnifications (40x, 20x, 10x, and 5x). Experimental results indicate that Omni-Seg outperforms in terms of both the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU). Our proposed method provides renal pathologists with a powerful computational tool for the quantitative analysis of renal microvascular structures.

2023-08-10

cs.SD

cs.SD - 2023-08-10

Stabilizing Training with Soft Dynamic Time Warping: A Case Study for Pitch Class Estimation with Weakly Aligned Targets

paper_url: http://arxiv.org/abs/2308.05429
repo_url: https://github.com/groupmm/stabilizing_sdtw
paper_authors: Johannes Zeitler, Simon Deniffel, Michael Krause, Meinard Müller
for: 这 paper 的目的是提高神经网络在弱相对标注数据上训练的稳定性。
methods: 该 paper 使用了 soft dynamic time warping (SDTW) 损失函数，并研究了三种不同的稳定化策略，以解决在早期训练阶段 soft alignment 和参考Alignment 之间的差异导致的参数更新错误。
results: 该 paper 通过三种稳定化策略，使得神经网络训练变得更加稳定，并且在实验中得到了较好的效果。

Abstract
Soft dynamic time warping (SDTW) is a differentiable loss function that allows for training neural networks from weakly aligned data. Typically, SDTW is used to iteratively compute and refine soft alignments that compensate for temporal deviations between the training data and its weakly annotated targets. One major problem is that a mismatch between the estimated soft alignments and the reference alignments in the early training stage leads to incorrect parameter updates, making the overall training procedure unstable. In this paper, we investigate such stability issues by considering the task of pitch class estimation from music recordings as an illustrative case study. In particular, we introduce and discuss three conceptually different strategies (a hyperparameter scheduling, a diagonal prior, and a sequence unfolding strategy) with the objective of stabilizing intermediate soft alignment results. Finally, we report on experiments that demonstrate the effectiveness of the strategies and discuss efficiency and implementation issues.

摘要
“软时间截断函数（SDTW）是一种可微分损失函数，可以将神经网络从弱相关数据进行训练。通常，SDTW 会在训练过程中逐步计算和修正软定时调整，以补偿音乐录音中的时间偏差。但一个主要问题是，在训练的早期，估计的软定时调整与参考调整之间的差异，导致参数更新过激，使全局训练过程不稳定。在这篇文章中，我们 investigate 这些稳定问题，并提出三种不同的思路（几何参数调整、主成分矩阵和序列复制策略），以稳定中途软定时调整结果。最后，我们 report 实验结果，并讨论效率和实现问题。”Note that Simplified Chinese is used here, which is a common writing system used in mainland China and Singapore. Traditional Chinese is also widely used, especially in Taiwan and Hong Kong.

A Novel Self-training Approach for Low-resource Speech Recognition

paper_url: http://arxiv.org/abs/2308.05269
repo_url: None
paper_authors: Satwinder Singh, Feng Hou, Ruili Wang
for: 提高低资源语言自动语音识别（ASR）的精度。
methods: 提出了一种自我听写方法，使用不可 counted的无标签语音数据来生成高度准确的pseudo标签，从而提高ASR系统的准确率。
results: 实验分析表明，我们的方法可以在四个真实语音dataset上提高单词错误率，相比基准模型，实现了14.94%的相对改进。此外，我们的提议方法在Common Voice Punjabi dataset上得到了最佳结果。

Abstract
In this paper, we propose a self-training approach for automatic speech recognition (ASR) for low-resource settings. While self-training approaches have been extensively developed and evaluated for high-resource languages such as English, their applications to low-resource languages like Punjabi have been limited, despite the language being spoken by millions globally. The scarcity of annotated data has hindered the development of accurate ASR systems, especially for low-resource languages (e.g., Punjabi and M\=aori languages). To address this issue, we propose an effective self-training approach that generates highly accurate pseudo-labels for unlabeled low-resource speech. Our experimental analysis demonstrates that our approach significantly improves word error rate, achieving a relative improvement of 14.94% compared to a baseline model across four real speech datasets. Further, our proposed approach reports the best results on the Common Voice Punjabi dataset.

摘要
在这篇论文中，我们提出了一种自我培训方法用于自动语音识别（ASR）在低资源设置下。而自我培训方法在高资源语言如英语上已经广泛开发和评估，但对低资源语言如旁遮普语言（Punjabi）的应用却受到限制，尽管这种语言由全球数百万人使用。缺乏标注数据的问题使得低资源语言的ASR系统的开发受到了很大的限制，尤其是旁遮普语言和Maori语言等。为解决这个问题，我们提出了一种高度有效的自我培训方法，该方法可以生成高度准确的伪标签 для无标注的低资源语音。我们的实验分析表明，我们的方法可以在四个真实的语音 dataset 上显著提高单词错误率，相比基准模型，实现了14.94%的相对提升。此外，我们的提出方法在Common Voice Punjabi dataset上报告了最佳结果。

Separate Anything You Describe

paper_url: http://arxiv.org/abs/2308.05037
repo_url: https://github.com/audio-agi/audiosep
paper_authors: Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang
for: 这个研究旨在开发一种基于自然语言查询的数字听风场景分析（CASA）系统，用于从混合声音中分离target声音。methods: 该研究使用了大规模的 Multimodal 数据集训练了 AudioSep 基础模型，并对其进行了广泛的评估，包括听风事件分离、乐器分离和语音提升等任务。results: AudioSep 显示了强大的分离性能和零扩展能力，使用音频描述或文本标签作为查询，明显超过了之前的听风查询和语言查询的音频分离模型。

Abstract
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model at: https://github.com/Audio-AGI/AudioSep.

摘要
新的语言句某query音频源分离（LASS）是一种计算听力场景分析（CASA）的新方案，它目标是根据自然语言查询，从音频混合中分离目标声音。Recent works on LASS，虽然在特定的音频源（例如乐器）上达到了预测性的分离性能，但无法在开放领域中分离音频概念。在这项工作中，我们介绍了一个基于开放领域的音频源分离基础模型，即AudioSep。我们在大规模的多模式数据集上训练AudioSep，并对其在多个任务上进行了广泛的评估，包括音频事件分离、乐器分离和语音提升。AudioSep表现出了强大的分离能力和零基础学习能力，使用音频caption或文本标签作为查询，明显超过了先前的音频-查询和语言-查询的音频分离模型。为了保证这项工作的可重现性，我们将在GitHub上发布源代码、评估标准和预训练模型，请参考：https://github.com/Audio-AGI/AudioSep。

2023-08-10

cs.CV

cs.CV - 2023-08-10

AD-CLIP: Adapting Domains in Prompt Space Using CLIP

paper_url: http://arxiv.org/abs/2308.05659
repo_url: None
paper_authors: Mainak Singha, Harsh Pal, Ankit Jha, Biplab Banerjee
for: 解决预测领域（target domain）和训练领域（source domain）不同时，预测模型的泛化问题。methods: 利用CLIP的冻结视觉背景，提取图像风格（domain）和内容信息，然后学习提问符。提问符设计为具有领域不变性和分类通用性，通过同时使用图像风格和内容特征来进行条件学习。results: 与现有文献进行比较，AD-CLIP在三个标准预测领域数据集上实现了更高的效果。

Abstract
Although deep learning models have shown impressive performance on supervised learning tasks, they often struggle to generalize well when the training (source) and test (target) domains differ. Unsupervised domain adaptation (DA) has emerged as a popular solution to this problem. However, current DA techniques rely on visual backbones, which may lack semantic richness. Despite the potential of large-scale vision-language foundation models like CLIP, their effectiveness for DA has yet to be fully explored. To address this gap, we introduce AD-CLIP, a domain-agnostic prompt learning strategy for CLIP that aims to solve the DA problem in the prompt space. We leverage the frozen vision backbone of CLIP to extract both image style (domain) and content information, which we apply to learn prompt tokens. Our prompts are designed to be domain-invariant and class-generalizable, by conditioning prompt learning on image style and content features simultaneously. We use standard supervised contrastive learning in the source domain, while proposing an entropy minimization strategy to align domains in the embedding space given the target domain data. We also consider a scenario where only target domain samples are available during testing, without any source domain data, and propose a cross-domain style mapping network to hallucinate domain-agnostic tokens. Our extensive experiments on three benchmark DA datasets demonstrate the effectiveness of AD-CLIP compared to existing literature.

摘要
Translated into Simplified Chinese:深度学习模型在指导学习任务上表现出了很好的表现，但是当训练（源）和测试（目标）领域不同时，它们经常难以泛化良好。随着不监督领域变换（DA）技术的出现，这个问题得到了解决。然而，当前的DA技术往往基于视觉背bone，这可能缺乏semantic richness。尽管CLIP等大规模视觉语言基础模型的潜力尚未得到完全explored，因此我们引入AD-CLIP，一种基于CLIP的领域不同的提示学习策略，以解决DA问题。我们利用CLIP的冻结视觉背bone来提取图像风格（领域）和内容信息，并将其应用于学习提示符。我们的提示符设计为领域不变和类通用，通过同时应用图像风格和内容特征来进行提示学习。我们在源领域中使用标准的监督对比学习，而在目标领域中提出了一种减少随机性的方法，以将领域对应。我们还考虑了只有目标领域的样本Available during testing，而不需要源领域数据，并提出了一种cross-domain style mapping network来产生领域无关的token。我们对三个 benchmark DA 数据集进行了广泛的实验，结果表明 AD-CLIP 的效果比现有文献更高。

Attention-based 3D CNN with Multi-layer Features for Alzheimer’s Disease Diagnosis using Brain Images

paper_url: http://arxiv.org/abs/2308.05655
repo_url: None
paper_authors: Yanteng Zhang, Qizhi Teng, Xiaohai He, Tong Niu, Lipei Zhang, Yan Liu, Chao Ren
for: 这个研究旨在提高了Alzheimer病的诊断精度，通过结合多层特征以及注意力机制，从brain影像中提取更好的特征，以更好地识别Alzheimer病的特征。
methods: 本研究使用了一个终端到端的3D CNN框架，基于ResNet，并通过注意力机制将多层特征融合，以更好地捕捉brain影像中的微妙变化。
results: 本研究在ADNI数据库上进行了ablation实验，使用了两种模式的brain影像，获得了89.71%和91.18%的AD诊断精度，并超过了一些现有的方法。

Abstract
Structural MRI and PET imaging play an important role in the diagnosis of Alzheimer's disease (AD), showing the morphological changes and glucose metabolism changes in the brain respectively. The manifestations in the brain image of some cognitive impairment patients are relatively inconspicuous, for example, it still has difficulties in achieving accurate diagnosis through sMRI in clinical practice. With the emergence of deep learning, convolutional neural network (CNN) has become a valuable method in AD-aided diagnosis, but some CNN methods cannot effectively learn the features of brain image, making the diagnosis of AD still presents some challenges. In this work, we propose an end-to-end 3D CNN framework for AD diagnosis based on ResNet, which integrates multi-layer features obtained under the effect of the attention mechanism to better capture subtle differences in brain images. The attention maps showed our model can focus on key brain regions related to the disease diagnosis. Our method was verified in ablation experiments with two modality images on 792 subjects from the ADNI database, where AD diagnostic accuracies of 89.71% and 91.18% were achieved based on sMRI and PET respectively, and also outperformed some state-of-the-art methods.

摘要
《结构MRI和PET成像在诊断阿尔ツheimer病（AD）中发挥重要作用，显示了脑部的形态变化和糖分代谢变化。但是，在临床实践中，通过sMRI的诊断仍然存在一些困难，例如，某些认知障碍患者的脑部图像表现相对较为低调。随着深度学习的出现，卷积神经网络（CNN）在AD诊断中成为了一种有价值的方法。但是，一些CNN方法无法有效地学习脑部图像的特征，使得AD的诊断仍然存在一些挑战。在这种情况下，我们提出了一种基于ResNet的综合3D CNN框架 для AD诊断，该框架通过多层效应机制来更好地捕捉脑部图像中的细微差异。我们的模型可以通过注意力地图来更好地关注与疾病诊断相关的主要脑区域。我们的方法在ADNI数据库上进行了792个研究主题的ablation实验，其中AD诊断精度分别为89.71%和91.18%，并且超过了一些现有的方法。》

Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization

paper_url: http://arxiv.org/abs/2308.05648
repo_url: https://github.com/sldz0306/ccr
paper_authors: Zezhong Lv, Bing Su, Ji-Rong Wen
for: 本研究旨在提高视频Localization的效果，通过弱监督方法实现视频和自然语言之间的Alignment。
methods: 本研究使用的方法包括cross-modality similarity匹配和Counterfactual cross-modality reasoning。
results: 实验结果表明，提出的方法可以减轻视频Propopsal的重建干扰，提高视频Localization的精度。

Abstract
Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query. Weakly supervised methods gains attention recently, as the precise temporal location of the target segment is not always available. However, one of the greatest challenges encountered by the weakly supervised method is implied in the mismatch between the video and language induced by the coarse temporal annotations. To refine the vision-language alignment, recent works contrast the cross-modality similarities driven by reconstructing masked queries between positive and negative video proposals. However, the reconstruction may be influenced by the latent spurious correlation between the unmasked and the masked parts, which distorts the restoring process and further degrades the efficacy of contrastive learning since the masked words are not completely reconstructed from the cross-modality knowledge. In this paper, we discover and mitigate this spurious correlation through a novel proposed counterfactual cross-modality reasoning method. Specifically, we first formulate query reconstruction as an aggregated causal effect of cross-modality and query knowledge. Then by introducing counterfactual cross-modality knowledge into this aggregation, the spurious impact of the unmasked part contributing to the reconstruction is explicitly modeled. Finally, by suppressing the unimodal effect of masked query, we can rectify the reconstructions of video proposals to perform reasonable contrastive learning. Extensive experimental evaluations demonstrate the effectiveness of our proposed method. The code is available at \href{https://github.com/sLdZ0306/CCR}{https://github.com/sLdZ0306/CCR}.

摘要
视频瞬间本地化目标是 retrieve 目标段的不断视频，根据自然语言查询。Recently, weakly supervised methods have gained attention, as the precise temporal location of the target segment is not always available. However, one of the greatest challenges encountered by weakly supervised methods is the mismatch between the video and language induced by coarse temporal annotations. To refine the vision-language alignment, recent works contrast the cross-modality similarities driven by reconstructing masked queries between positive and negative video proposals. However, the reconstruction may be influenced by the latent spurious correlation between the unmasked and masked parts, which distorts the restoring process and further degrades the efficacy of contrastive learning since the masked words are not completely reconstructed from the cross-modality knowledge. In this paper, we discover and mitigate this spurious correlation through a novel proposed counterfactual cross-modality reasoning method. Specifically, we first formulate query reconstruction as an aggregated causal effect of cross-modality and query knowledge. Then, by introducing counterfactual cross-modality knowledge into this aggregation, the spurious impact of the unmasked part contributing to the reconstruction is explicitly modeled. Finally, by suppressing the unimodal effect of masked query, we can rectify the reconstructions of video proposals to perform reasonable contrastive learning. Extensive experimental evaluations demonstrate the effectiveness of our proposed method. 代码可以在 \href{https://github.com/sLdZ0306/CCR}{https://github.com/sLdZ0306/CCR} 上获取。

Self-Supervised Monocular Depth Estimation by Direction-aware Cumulative Convolution Network

paper_url: http://arxiv.org/abs/2308.05605
repo_url: https://github.com/wencheng256/daccn
paper_authors: Wencheng Han, Junbo Yin, Jianbing Shen
for: 提高自助监督的单目深度估计精度
methods: 提出了一种新的方向意识模块和改进的积累卷积方法
results: 实验显示，该方法在三个广泛使用的标准测试集（KITTI、Cityscapes和Make3D）上达到了新的状态码性能，在所有三种自助监督下都显示了显著的改进。

Abstract
Monocular depth estimation is known as an ill-posed task in which objects in a 2D image usually do not contain sufficient information to predict their depth. Thus, it acts differently from other tasks (e.g., classification and segmentation) in many ways. In this paper, we find that self-supervised monocular depth estimation shows a direction sensitivity and environmental dependency in the feature representation. But the current backbones borrowed from other tasks pay less attention to handling different types of environmental information, limiting the overall depth accuracy. To bridge this gap, we propose a new Direction-aware Cumulative Convolution Network (DaCCN), which improves the depth feature representation in two aspects. First, we propose a direction-aware module, which can learn to adjust the feature extraction in each direction, facilitating the encoding of different types of information. Secondly, we design a new cumulative convolution to improve the efficiency for aggregating important environmental information. Experiments show that our method achieves significant improvements on three widely used benchmarks, KITTI, Cityscapes, and Make3D, setting a new state-of-the-art performance on the popular benchmarks with all three types of self-supervision.

摘要
监视一个单目标的深度估计是一个不充分定义的任务，因为图像中的对象通常不含有足够的信息来预测其深度。因此，这个任务与其他任务（如分类和分割）有很多不同之处。在这篇论文中，我们发现了自我监视单目标深度估计中的方向敏感性和环境依赖性在特征表示方面。然而，现有的基础模型从其他任务中借鉴的方法很少注重处理不同类型的环境信息，这限制了总的深度准确性。为了bridging这个差距，我们提议了一种新的方向敏感的累积卷积网络（DaCCN），它可以改进深度特征表示的两个方面。首先，我们提出了一个方向敏感模块，可以根据不同的方向来调整特征提取，以便更好地编码不同类型的信息。其次，我们设计了一种新的累积卷积来提高环境信息的重要性聚合效率。实验显示，我们的方法在三个广泛使用的标准测试集（KITTI、Cityscapes和Make3D）上达到了新的状态对比性，在这些标准测试集上，我们的方法在所有三种自我监视下达到了最高的性能。

paper_url: http://arxiv.org/abs/2308.05602
repo_url: None
paper_authors: Shizhe Chen, Thomas Chabal, Ivan Laptev, Cordelia Schmid
for: 本研究旨在帮助智能代理人在未经见过的环境中寻找目标对象，以提高智能导航能力。
methods: 本研究使用了隐式地图，通过将新观察到的信息递归更新地图，并通过引入辅助任务来促进空间理解。
results: 本研究在MP3D数据集上达到了state-of-the-art的性能，并在真实场景中使用只需几个实际示例就能够实现适当的目标对象导航。

Abstract
Object goal navigation aims to navigate an agent to locations of a given object category in unseen environments. Classical methods explicitly build maps of environments and require extensive engineering while lacking semantic information for object-oriented exploration. On the other hand, end-to-end learning methods alleviate manual map design and predict actions using implicit representations. Such methods, however, lack an explicit notion of geometry and may have limited ability to encode navigation history. In this work, we propose an implicit spatial map for object goal navigation. Our implicit map is recursively updated with new observations at each step using a transformer. To encourage spatial reasoning, we introduce auxiliary tasks and train our model to reconstruct explicit maps as well as to predict visual features, semantic labels and actions. Our method significantly outperforms the state of the art on the challenging MP3D dataset and generalizes well to the HM3D dataset. We successfully deploy our model on a real robot and achieve encouraging object goal navigation results in real scenes using only a few real-world demonstrations. Code, trained models and videos are available at \url{https://www.di.ens.fr/willow/research/onav_rim/}.

摘要
<>translation_direction: zh-HansObjective: 导航一个智能代理人到具有给定物品类别的位置在未经过工程的环境中。经典方法会显式地建立环境地图，但缺乏对物品导航的 semantic 信息，而且需要大量的工程。在另一方面，终端学习方法可以减少手动地图设计和预测动作使用隐式表示，但是它们缺乏显式的几何信息和导航历史编码能力。在这种情况下，我们提出了一种隐式几何地图 для物品目标导航。我们的隐式地图通过每步新观察更新，使用 transformer 进行递归更新。为了促进几何理解，我们引入了辅助任务，并训练我们的模型可以重建显式地图、预测视觉特征、semantic标签以及动作。我们的方法在 MP3D 数据集上显著超越了现状，并在 HM3D 数据集上广泛适用。我们成功地在真实场景中部署了我们的模型，并在实际演示中达到了鼓励的物品目标导航结果。代码、训练模型和视频可以在 \url{https://www.di.ens.fr/willow/research/onav_rim/} 上获取。

NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search

paper_url: http://arxiv.org/abs/2308.05600
repo_url: None
paper_authors: Edouard Yvinec, Arnaud Dapogny, Kevin Bailly
for: 这个研究旨在提高深度神经网络（DNN）的硬件设备，以减少它们的计算成本和 latency。
methods: 本研究使用了一种称为“非均匀数字化”的技术，将浮点数表示转换为低位数字表示，以确保DNN模型的紧密运算。
results: 研究获得了 state-of-the-art 的压缩率，包括 both data-free 和 data-driven 配置。

Abstract
Deep neural network (DNN) deployment has been confined to larger hardware devices due to their expensive computational requirements. This challenge has recently reached another scale with the emergence of large language models (LLMs). In order to reduce both their memory footprint and latency, a promising technique is quantization. It consists in converting floating point representations to low bit-width fixed point representations, usually by assuming a uniform mapping onto a regular grid. This process, referred to in the literature as uniform quantization, may however be ill-suited as most DNN weights and activations follow a bell-shaped distribution. This is even worse on LLMs whose weight distributions are known to exhibit large, high impact, outlier values. In this work, we propose an improvement over the most commonly adopted way to tackle this limitation in deep learning models quantization, namely, non-uniform quantization. NUPES leverages automorphisms to preserve the scalar multiplications. Such transformations are derived from power functions. However, the optimization of the exponent parameter and weight values remains a challenging and novel problem which could not be solved with previous post training optimization techniques which only learn to round up or down weight values in order to preserve the predictive function. We circumvent this limitation with a new paradigm: learning new quantized weights over the entire quantized space. Similarly, we enable the optimization of the power exponent, i.e. the optimization of the quantization operator itself during training by alleviating all the numerical instabilities. The resulting predictive function is compatible with integer-only low-bit inference. We show the ability of the method to achieve state-of-the-art compression rates in both, data-free and data-driven configurations.

摘要
深度神经网络（DNN）的部署因其高计算成本受限于更大的硬件设备。这一挑战最近又由大型语言模型（LLM）的出现带来了新的级别。以减少其存储占用量和延迟，一种有前途的技术是量化。它通过将浮点表示转换为低位宽的固定点表示，通常通过假设一个固定的映射到一个常规网格来实现。这个过程，在文献中称为均匀量化，可能并不适合大多数DNN的权重和活动值，因为它们通常遵循一个钟形分布。这种问题更加严重，因为LLM的权重分布知道会显示出大、高影响的异常值。在这项工作中，我们提出一种改进常见深度学习模型量化的方法，即非均匀量化（NUPES）。NUPES利用自动变换来保持整数乘法。这些变换来自于力学函数。然而，优化废弃值和权重值的问题仍然是一个挑战，这是因为前训练优化技术只能学习将权重值缩放到适当的整数値。我们解决这个问题的新方法是通过学习新的量化权重值，并且在整个量化空间中进行优化。此外，我们还使得量化运算자本身在训练中进行优化，从而消除所有的数值不稳定性。得到的预测函数与整数仅具有低位宽的减法兼容。我们展示了该方法可以在数据驱动和数据预处理的配置下实现状态革命级别的压缩率。

Test-Time Selection for Robust Skin Lesion Analysis

paper_url: http://arxiv.org/abs/2308.05595
repo_url: https://github.com/alceubissoto/skin-tts
paper_authors: Alceu Bissoto, Catarina Barata, Eduardo Valle, Sandra Avila
for: 减少皮肤病变分类模型中的偏见，使模型更加准确地预测皮肤病变。
methods: 提出了一种人类在 loop 方法（TTS），利用测试样本中的正面（例如，疤痕面积）和负面（例如，artifacts）关键点，以避免模型学习干扰因素。
results: 实现了对皮肤病变分类模型中的偏见的mitigation，无需重新训练模型，并且可以在不同的注解量和偏见水平下进行稳定的性能表现。

Abstract
Skin lesion analysis models are biased by artifacts placed during image acquisition, which influence model predictions despite carrying no clinical information. Solutions that address this problem by regularizing models to prevent learning those spurious features achieve only partial success, and existing test-time debiasing techniques are inappropriate for skin lesion analysis due to either making unrealistic assumptions on the distribution of test data or requiring laborious annotation from medical practitioners. We propose TTS (Test-Time Selection), a human-in-the-loop method that leverages positive (e.g., lesion area) and negative (e.g., artifacts) keypoints in test samples. TTS effectively steers models away from exploiting spurious artifact-related correlations without retraining, and with less annotation requirements. Our solution is robust to a varying availability of annotations, and different levels of bias. We showcase on the ISIC2019 dataset (for which we release a subset of annotated images) how our model could be deployed in the real-world for mitigating bias.

摘要
皮肤损害分析模型受到图像获取过程中的artifacts的干扰，这些artifacts不含有临床信息，却影响模型预测结果。现有的解决方案通过对模型进行正则化，以防止它们学习这些干扰特征，仅具有部分成功。现有的测试时debiasing技术不适用于皮肤损害分析，因为它们假设测试数据的分布是不实际的或需要医疗专业人员进行繁琐的注释。我们提出了TTS（测试时选择），一种人类在循环中的方法，利用测试样本中的正面（例如，损害区域）和负面（例如，artifacts）关键点。TTS可以让模型在没有重新训练的情况下，快速地跳过损害相关的干扰特征。我们的解决方案对于不同的注释量和偏好都是稳定的，并且可以适应不同的测试数据分布。我们在ISIC2019 dataset（我们发布了一 subset of annotated images）上展示了我们的模型在实际应用中的mitigating bias的能力。

Category Feature Transformer for Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.05581
repo_url: None
paper_authors: Quan Tang, Chuanjian Liu, Fagui Liu, Yifan Liu, Jun Jiang, Bowen Zhang, Kai Han, Yunhe Wang
for: 这个研究旨在提高 semantic segmentation 的性能，通过改进 feature aggregation 方法。
methods: 该研究提出了 Category Feature Transformer (CFT)，通过多头注意力机制来探索 feature embedding 和变换的流动，从高级别特征中提取各个 semantic category 的统一 embedding。
results: 对比于传统的点积或拼接方法，CFT 在各种 backbone 网络上展示了更高的性能，特别是在 ADE20K dataset 上得到了吸引人的 55.1% mIoU，同时减少了模型参数和计算量。

Abstract
Aggregation of multi-stage features has been revealed to play a significant role in semantic segmentation. Unlike previous methods employing point-wise summation or concatenation for feature aggregation, this study proposes the Category Feature Transformer (CFT) that explores the flow of category embedding and transformation among multi-stage features through the prevalent multi-head attention mechanism. CFT learns unified feature embeddings for individual semantic categories from high-level features during each aggregation process and dynamically broadcasts them to high-resolution features. Integrating the proposed CFT into a typical feature pyramid structure exhibits superior performance over a broad range of backbone networks. We conduct extensive experiments on popular semantic segmentation benchmarks. Specifically, the proposed CFT obtains a compelling 55.1% mIoU with greatly reduced model parameters and computations on the challenging ADE20K dataset.

摘要
aggregation of multi-stage features 已经被确认为semantic segmentation中发挥重要作用。 unlike previous methods使用点 wise summation或concatenaion来实现特征集成,这种研究提议Category Feature Transformer（CFT），该研究通过流行的多头注意机制来探索多个阶段特征之间的类 embedding和转换。 CFT在每个层次特征集成过程中学习各个semantic category的统一特征表示，并在高分辨率特征上动态广播它们。将提议的CFTintegrated into a typical feature pyramid structure shows superior performance over a wide range of backbone networks. We conduct extensive experiments on popular semantic segmentation benchmarks. Specifically, the proposed CFT obtains a compelling 55.1% mIoU with greatly reduced model parameters and computations on the challenging ADE20K dataset.

Cross-Domain Product Representation Learning for Rich-Content E-Commerce

paper_url: http://arxiv.org/abs/2308.05550
repo_url: https://github.com/adxcreative/cope
paper_authors: Xuehan Bai, Yan Li, Yanhua Cheng, Wenjie Yang, Quan Chen, Han Li
for: 这篇论文旨在解决Rich-content电商中产品在不同媒体频道上的描述不一致问题，实现跨媒体产品识别，以提高用户搜索体验和产品推荐效果。
methods: 本论文提出了一种 Cross-dOmain Product rEpresentation（COPE）框架，通过多modal学习（包括文本和视觉学习）将产品表示在不同媒体频道上的特征空间统一。
results: experiments表明，COPE可以学习一个共同特征空间，以便在不同媒体频道上进行产品识别和推荐。

Abstract
The proliferation of short video and live-streaming platforms has revolutionized how consumers engage in online shopping. Instead of browsing product pages, consumers are now turning to rich-content e-commerce, where they can purchase products through dynamic and interactive media like short videos and live streams. This emerging form of online shopping has introduced technical challenges, as products may be presented differently across various media domains. Therefore, a unified product representation is essential for achieving cross-domain product recognition to ensure an optimal user search experience and effective product recommendations. Despite the urgent industrial need for a unified cross-domain product representation, previous studies have predominantly focused only on product pages without taking into account short videos and live streams. To fill the gap in the rich-content e-commerce area, in this paper, we introduce a large-scale cRoss-dOmain Product Ecognition dataset, called ROPE. ROPE covers a wide range of product categories and contains over 180,000 products, corresponding to millions of short videos and live streams. It is the first dataset to cover product pages, short videos, and live streams simultaneously, providing the basis for establishing a unified product representation across different media domains. Furthermore, we propose a Cross-dOmain Product rEpresentation framework, namely COPE, which unifies product representations in different domains through multimodal learning including text and vision. Extensive experiments on downstream tasks demonstrate the effectiveness of COPE in learning a joint feature space for all product domains.

摘要
“短视频和直播平台的普及化已经改变了在线购物的方式。而而不是浏览产品页面，消费者现在通过 ricch-content e-commerce 购买产品，这种新的在线购物方式带来了技术挑战，因为产品在不同媒体频道上可能会present differently。因此，一个统一的产品表示是必须的，以确保跨媒体产品识别，并提供优化的用户搜索体验和有效的产品推荐。然而，随着 ricch-content e-commerce 领域的快速发展，前一些研究主要集中在产品页面上，忽略了短视频和直播的存在。为了填补这个空白，在这篇论文中，我们提出了一个大规模的 cRoss-dOmain Product Ecognition 数据集，称为 ROPE。ROPE 覆盖了多个产品类别，包括超过 180,000 个产品，对应millions of short videos and live streams。这是第一个同时覆盖产品页面、短视频和直播的数据集，提供了跨媒体产品表示的基础。此外，我们提出了一个 Cross-dOmain Product rEpresentation 框架，即 COPE，该框架通过多Modal学习，包括文本和视觉，将产品表示在不同媒体频道上统一。我们的实验表明，COPE 可以学习一个共同特征空间，以便在所有产品频道上进行下游任务。”

Deep Richardson-Lucy Deconvolution for Low-Light Image Deblurring

paper_url: http://arxiv.org/abs/2308.05543
repo_url: None
paper_authors: Liang Chen, Jiawei Zhang, Zhenhua Li, Yunxuan Wei, Faming Fang, Jimmy Ren, Jinshan Pan
for: 这篇论文主要用于处理受扰光照下拍摄的图像，特别是处理受扰光照下拍摄的图像中具有噪点和模糊的问题。
methods: 该论文提出了一种数据驱动的方法，通过学习latent map来模型受扰光照下拍摄图像中的噪点。该方法可以将非 слепо的锐化问题转化为最大 posterior（MAP）问题，并通过 iteratively computing latent map和latent image来解决。
results: 实验结果表明，该方法可以与现有算法相比，在synthetic和实际图像上提供高质量的锐化结果，而无需 amplified artifacts。

Abstract
Images taken under the low-light condition often contain blur and saturated pixels at the same time. Deblurring images with saturated pixels is quite challenging. Because of the limited dynamic range, the saturated pixels are usually clipped in the imaging process and thus cannot be modeled by the linear blur model. Previous methods use manually designed smooth functions to approximate the clipping procedure. Their deblurring processes often require empirically defined parameters, which may not be the optimal choices for different images. In this paper, we develop a data-driven approach to model the saturated pixels by a learned latent map. Based on the new model, the non-blind deblurring task can be formulated into a maximum a posterior (MAP) problem, which can be effectively solved by iteratively computing the latent map and the latent image. Specifically, the latent map is computed by learning from a map estimation network (MEN), and the latent image estimation process is implemented by a Richardson-Lucy (RL)-based updating scheme. To estimate high-quality deblurred images without amplified artifacts, we develop a prior estimation network (PEN) to obtain prior information, which is further integrated into the RL scheme. Experimental results demonstrate that the proposed method performs favorably against state-of-the-art algorithms both quantitatively and qualitatively on synthetic and real-world images.

摘要
低光照下拍摄的图像常常具有模糊和饱和像素同时出现。对于这类图像，进行去模糊处理是非常困难的。这是因为捕捉过程中的动态范围有限，因此饱和像素通常会被截断，无法被线性模糊模型模型。之前的方法通常使用手动设计的光滑函数来估算截断过程。他们的去模糊过程通常需要定制参数，这些参数可能不是不同图像的优化选择。在这篇论文中，我们提出了一种数据驱动的方法，通过学习latent map来模型饱和像素。基于新的模型，我们可以将非视情报的去模糊任务转化为最大 posterior（MAP）问题，这可以通过 iteratively 计算latent map和latent image来有效解决。具体来说，latent map是通过学习map estimation network（MEN）来计算的，而latent image estimation过程是通过Richardson-Lucy（RL）更新方案来实现。为了获得高质量去模糊图像而无需增强artefacts，我们开发了一种先验概率网络（PEN）来获得先验信息，该信息进一步被集成到RL方案中。实验结果表明，我们提出的方法在synthetic和实际图像上都与状态 arts 算法相比，具有较高的量化和质量表现。

Robust Asymmetric Loss for Multi-Label Long-Tailed Learning

paper_url: http://arxiv.org/abs/2308.05542
repo_url: https://github.com/kalelpark/RAL
paper_authors: Wongi Park, Inhyuk Park, Sungeun Kim, Jongbin Ryu
for: 这个论文是为了解决长尾分布和多标签问题而写的。
methods: 这篇论文提出了一种 Robust Asymmetric Loss（RAL）函数，用于同时解决长尾分布和多标签问题。
results: 论文在多种长尾单标签数据集上实现了优秀的表现，其中包括ICCV CVAMD 2023 competition中的CXR-LT数据集，并达到了Top-5的成绩。

Abstract
In real medical data, training samples typically show long-tailed distributions with multiple labels. Class distribution of the medical data has a long-tailed shape, in which the incidence of different diseases is quite varied, and at the same time, it is not unusual for images taken from symptomatic patients to be multi-label diseases. Therefore, in this paper, we concurrently address these two issues by putting forth a robust asymmetric loss on the polynomial function. Since our loss tackles both long-tailed and multi-label classification problems simultaneously, it leads to a complex design of the loss function with a large number of hyper-parameters. Although a model can be highly fine-tuned due to a large number of hyper-parameters, it is difficult to optimize all hyper-parameters at the same time, and there might be a risk of overfitting a model. Therefore, we regularize the loss function using the Hill loss approach, which is beneficial to be less sensitive against the numerous hyper-parameters so that it reduces the risk of overfitting the model. For this reason, the proposed loss is a generic method that can be applied to most medical image classification tasks and does not make the training process more time-consuming. We demonstrate that the proposed robust asymmetric loss performs favorably against the long-tailed with multi-label medical image classification in addition to the various long-tailed single-label datasets. Notably, our method achieves Top-5 results on the CXR-LT dataset of the ICCV CVAMD 2023 competition. We opensource our implementation of the robust asymmetric loss in the public repository: https://github.com/kalelpark/RAL.

摘要
医疗数据中的训练样本通常具有长尾分布，多个标签。医疗数据的分布具有长尾形状， symptomatic 患者的图像可能患有多种疾病。因此，在这篇论文中，我们同时解决这两个问题，提出了一种robust asymmetric loss函数。我们的损失函数同时解决了长尾和多标签分类问题，因此导致了损失函数的复杂设计，具有许多超参数。虽然模型可以高度精化，但是由于超参数的多少，具有风险过拟合模型。因此，我们使用希尔损失函数的正则化方法，以降低对超参数的敏感性，从而减少过拟合风险。由于这种损失函数是一种通用的方法，可以应用于大多数医疗图像分类任务，不会增加训练过程的时间。我们的方法在多种长尾单标签数据集上达到了优秀的表现，特别是在ICCV CVAMD 2023 大赛中的 CXR-LT 数据集上达到了 Top-5 результаutos。我们将我们实现的robust asymmetric loss函数开源在公共存储库：https://github.com/kalelpark/RAL。

Is there progress in activity progress prediction?

paper_url: http://arxiv.org/abs/2308.05533
repo_url: https://github.com/frans-db/progress-prediction
paper_authors: Frans de Boer, Jan C. van Gemert, Jouke Dijkstra, Silvia L. Pintea
for: 这项研究的目的是估计活动完成的百分数。
methods: 这些方法使用机器学习方法，在复杂和现实istic的视频集合上训练和评估。
results: 研究发现现有的进度预测方法在这些视频集合上并不能提取有用的视觉信息，因此无法超过基准线。我们设计了一个精心控制的 synthetic 数据集，并在这个数据集上示出了考虑的方法可以利用视觉信息，直接关联进度预测。我们 conclued 进度预测任务在目前使用的实际世界数据集上是不妥协的。此外，为了公正地衡量活动进度，我们建议使用简单 yet effective的帧数基准线。

Abstract
Activity progress prediction aims to estimate what percentage of an activity has been completed. Currently this is done with machine learning approaches, trained and evaluated on complicated and realistic video datasets. The videos in these datasets vary drastically in length and appearance. And some of the activities have unanticipated developments, making activity progression difficult to estimate. In this work, we examine the results obtained by existing progress prediction methods on these datasets. We find that current progress prediction methods seem not to extract useful visual information for the progress prediction task. Therefore, these methods fail to exceed simple frame-counting baselines. We design a precisely controlled dataset for activity progress prediction and on this synthetic dataset we show that the considered methods can make use of the visual information, when this directly relates to the progress prediction. We conclude that the progress prediction task is ill-posed on the currently used real-world datasets. Moreover, to fairly measure activity progression we advise to consider a, simple but effective, frame-counting baseline.

摘要
In this work, we examine the results obtained by existing progress prediction methods on these datasets. We find that current progress prediction methods do not effectively use visual information for the progress prediction task, and therefore, they fail to exceed simple frame-counting baselines.To address this issue, we design a precisely controlled dataset for activity progress prediction, and on this synthetic dataset, we show that the considered methods can make use of visual information that is directly related to the progress prediction. We conclude that the progress prediction task is not well-posed on the currently used real-world datasets, and to fairly measure activity progression, we recommend considering a simple but effective frame-counting baseline.

Critical Points ++: An Agile Point Cloud Importance Measure for Robust Classification, Adversarial Defense and Explainable AI

paper_url: http://arxiv.org/abs/2308.05525
repo_url: https://github.com/yossilevii100/critical_points2
paper_authors: Meir Yossef Levi, Guy Gilboa
for: 本研究探讨了3D点云中关键点的Interplay和Out-Of-Distribution（OOD）样本。
methods: 研究人员首先研究了常见损害和异常样本是否被解释为关键点。然后，他们推广了关键点的概念，将其转化为重要度度量。通过只在不重要点上训练分类网络，研究人员发现可以大幅提高网络的Robustness，但是会导致小量的性能损失在干净样本上。研究人员还发现了 норма化 entropy 的高度有用性，并建议使用 adaptive threshold 来选择不重要点。
results: 研究人员的方法可以在多个应用中达到 State-Of-The-Art（SOTA）Result，包括 Explainable AI（XAI）、异常 Sample Removal、不确定性估计、Robust Classification 和 Adversarial Defense。

Abstract
The ability to cope accurately and fast with Out-Of-Distribution (OOD) samples is crucial in real-world safety demanding applications. In this work we first study the interplay between critical points of 3D point clouds and OOD samples. Our findings are that common corruptions and outliers are often interpreted as critical points. We generalize the notion of critical points into importance measures. We show that training a classification network based only on less important points dramatically improves robustness, at a cost of minor performance loss on the clean set. We observe that normalized entropy is highly informative for corruption analysis. An adaptive threshold based on normalized entropy is suggested for selecting the set of uncritical points. Our proposed importance measure is extremely fast to compute. We show it can be used for a variety of applications, such as Explainable AI (XAI), Outlier Removal, Uncertainty Estimation, Robust Classification and Adversarial Defense. We reach SOTA results on the two latter tasks. Code is available at: https://github.com/yossilevii100/critical_points2

摘要
“实际应用中的安全性要求非常高，因此能够快速和精确地处理不同类型的外部数据（Out-Of-Distribution，OOD）amples是非常重要的。在这个工作中，我们首先研究了3D点云的批处和OOD samples之间的交互。我们发现，通常的变化和异常点经常被视为批点。我们将批点的概念扩展为重要度衡量。我们显示，仅对不重要的点进行训练，可以很好地提高抗衡性，但是会有小量的性能损失。我们发现，正规化 entropy 非常有用于问题分析。我们建议一个基于正规化 entropy 的静态阈值，用于选择不重要的点。我们的提出的重要度度量非常快速 compute。我们显示它可以用于多个应用，如可读性AI（XAI）、异常点除除、不确定度估计、Robust Classification 和攻击防护。我们在这两个任务上达到了 SOTA 结果。代码可以在：https://github.com/yossilevii100/critical_points2 中找到。”

Look at the Neighbor: Distortion-aware Unsupervised Domain Adaptation for Panoramic Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.05493
repo_url: None
paper_authors: Xu Zheng, Tianbo Pan, Yunhao Luo, Lin Wang
for: 这个研究目的是提高Panoramic Semantic Segmentation中的预测性能，以及解决这个领域中的领域差异问题。
methods: 这个研究使用了一个新的Unsupervised Domain Adaptation方法，包括Distortion-Aware Attention和Class-Wise Feature Aggregation两个模组。这个方法不需要使用任何几何约束，可以更好地处理不对称的像素分布问题。
results: 这个研究的结果显示，这个新的Unsupervised Domain Adaptation方法可以实现更好的预测性能，并且可以大幅降低80%的parameters。

Abstract
Endeavors have been recently made to transfer knowledge from the labeled pinhole image domain to the unlabeled panoramic image domain via Unsupervised Domain Adaptation (UDA). The aim is to tackle the domain gaps caused by the style disparities and distortion problem from the non-uniformly distributed pixels of equirectangular projection (ERP). Previous works typically focus on transferring knowledge based on geometric priors with specially designed multi-branch network architectures. As a result, considerable computational costs are induced, and meanwhile, their generalization abilities are profoundly hindered by the variation of distortion among pixels. In this paper, we find that the pixels' neighborhood regions of the ERP indeed introduce less distortion. Intuitively, we propose a novel UDA framework that can effectively address the distortion problems for panoramic semantic segmentation. In comparison, our method is simpler, easier to implement, and more computationally efficient. Specifically, we propose distortion-aware attention (DA) capturing the neighboring pixel distribution without using any geometric constraints. Moreover, we propose a class-wise feature aggregation (CFA) module to iteratively update the feature representations with a memory bank. As such, the feature similarity between two domains can be consistently optimized. Extensive experiments show that our method achieves new state-of-the-art performance while remarkably reducing 80% parameters.

摘要
尝试将知识从标注的小孔镜像域传递到无标注的全景图像域via无监督领域适应（UDA）。目标是解决领域差距问题，由非均匀分布的像素引起的风格差异和扭曲问题。先前的工作通常是基于地理约束设计特制多支网络架构来进行传输知识。这会导致计算成本增加，同时其泛化能力受到像素扭曲的变化很大的限制。在这篇论文中，我们发现了ERP像素的邻居区域实际上具有较低的扭曲程度。我们提出了一种新的UDA框架，可以有效地解决全景图像 semantic segmentation 中的扭曲问题。与先前的方法相比，我们的方法更简单，易于实现，计算效率更高。我们提出了一种受到邻居像素分布的扭曲意识（DA）模块，不使用任何地理约束。此外，我们提出了一种类别特征聚合（CFA）模块，可以逐渐更新特征表示，并且通过记忆银行来一致地优化特征相似性。广泛的实验表明，我们的方法可以实现新的领先性表现，同时减少80%的参数量。

YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection

paper_url: http://arxiv.org/abs/2308.05480
repo_url: https://github.com/fishandwasabi/yolo-ms
paper_authors: Yuming Chen, Xinbin Yuan, Ruiqi Wu, Jiabao Wang, Qibin Hou, Ming-Ming Cheng
for:The paper aims to provide the object detection community with an efficient and performant object detector, specifically the YOLO-MS model.methods:The core design of YOLO-MS is based on investigations into how convolutions with different kernel sizes affect object detection performance at different scales. The strategy is to enhance multi-scale feature representations for real-time object detectors.results:The YOLO-MS model outperforms recent state-of-the-art real-time object detectors, including YOLO-v7 and RTMDet, with a comparable number of parameters and FLOPs. Specifically, the XS version of YOLO-MS achieves an AP score of 43%+ on MS COCO, which is about 2%+ higher than RTMDet with the same model size. Additionally, the method can be used as a plug-and-play module for other YOLO models, improving their AP scores with fewer parameters and FLOPs.

Abstract
We aim at providing the object detection community with an efficient and performant object detector, termed YOLO-MS. The core design is based on a series of investigations on how convolutions with different kernel sizes affect the detection performance of objects at different scales. The outcome is a new strategy that can strongly enhance multi-scale feature representations of real-time object detectors. To verify the effectiveness of our strategy, we build a network architecture, termed YOLO-MS. We train our YOLO-MS on the MS COCO dataset from scratch without relying on any other large-scale datasets, like ImageNet, or pre-trained weights. Without bells and whistles, our YOLO-MS outperforms the recent state-of-the-art real-time object detectors, including YOLO-v7 and RTMDet, when using a comparable number of parameters and FLOPs. Taking the XS version of YOLO-MS as an example, with only 4.5M learnable parameters and 8.7G FLOPs, it can achieve an AP score of 43%+ on MS COCO, which is about 2%+ higher than RTMDet with the same model size. Moreover, our work can also be used as a plug-and-play module for other YOLO models. Typically, our method significantly improves the AP of YOLOv8 from 37%+ to 40%+ with even fewer parameters and FLOPs. Code is available at https://github.com/FishAndWasabi/YOLO-MS.

摘要
我们目标是为对象检测社区提供高效和高性能的对象检测器，称为YOLO-MS。我们的核心设计基于对不同核群体维度对象检测性能的系列调查。结果是一种新的策略，可以强大地提高实时对象检测器的多缘特征表示。为证明我们的策略的有效性，我们建立了一个网络架构，称为YOLO-MS。我们的YOLO-MS在从头开始训练MS COCO数据集，不需要任何其他大规模数据集，如ImageNet，或预训练 веса。无论钟和鼓，我们的YOLO-MS在与相同参数和计算量下，可以超越最近的实时对象检测器，包括YOLO-v7和RTMDet。例如，我们的XS版YOLO-MS只有4.5M个学习参数和8.7G计算量，可以在MS COCO上 achieve AP分数达43%+，这比RTMDet相同模型大小的情况高出2%+。此外，我们的工作还可以用作其他YOLO模型的插件模块，通常可以使YOLOv8的AP从37%+提高到40%+，只需要更少的参数和计算量。代码可以在https://github.com/FishAndWasabi/YOLO-MS上获取。

Surface Masked AutoEncoder: Self-Supervision for Cortical Imaging Data

paper_url: http://arxiv.org/abs/2308.05474
repo_url: https://github.com/metrics-lab/surface-vision-transformers
paper_authors: Simon Dahan, Mariana da Silva, Daniel Rueckert, Emma C Robinson
for: 这篇论文旨在探讨如何使用自我监督来解决视Transformer架构中的逻辑假设问题，以提高小 datasets 上的泛化能力。
methods: 该论文基于近期将视Transformer翻译到表面几何的研究，并 investigate了使用Masked AutoEncoder（MAE）自我监督来学习 cortical 结构。
results: 该方法可以有效地重建输入中masked 的表面数据，从而学习强大的表示，并在下游任务中提高表现。我们在 cortical phenotype regression 中使用 developing Human Connectome Project（dHCP）评估了该方法，并发现在 pre-training 后，模型的性能提高了26%，并且 converges 80% faster than scratch 训练。此外，我们发现在大量数据集上 pre-training 视Transformer 模型可以快速获得Robust 的表示，供 fine-tuning 在 low-data 情况下。

Abstract
Self-supervision has been widely explored as a means of addressing the lack of inductive biases in vision transformer architectures, which limits generalisation when networks are trained on small datasets. This is crucial in the context of cortical imaging, where phenotypes are complex and heterogeneous, but the available datasets are limited in size. This paper builds upon recent advancements in translating vision transformers to surface meshes and investigates the potential of Masked AutoEncoder (MAE) self-supervision for cortical surface learning. By reconstructing surface data from a masked version of the input, the proposed method effectively models cortical structure to learn strong representations that translate to improved performance in downstream tasks. We evaluate our approach on cortical phenotype regression using the developing Human Connectome Project (dHCP) and demonstrate that pre-training leads to a 26\% improvement in performance, with an 80\% faster convergence, compared to models trained from scratch. Furthermore, we establish that pre-training vision transformer models on large datasets, such as the UK Biobank (UKB), enables the acquisition of robust representations for finetuning in low-data scenarios. Our code and pre-trained models are publicly available at \url{https://github.com/metrics-lab/surface-vision-transformers}.

摘要
自我监督已经广泛探索以解决视transformer架构中缺乏适应性的问题，这限制了通用化when网络在小数据集上训练。在 cortical imaging 中，现象是复杂且多样的，但可用的数据集很有限。这篇论文基于最近的翻译 vision transformers 到表面网格和 investigates the potential of Masked AutoEncoder (MAE) self-supervision for cortical surface learning。通过从masked版本的输入中重建表面数据，提议的方法可以有效地模型 cortical structure，从而学习强大的表示。我们在 cortical phenotype regression 中使用 developing Human Connectome Project (dHCP) 进行评估，并发现预训练可以提高表现的26％，并且更快 converges，相比于从scratch 训练的模型。此外，我们发现预训练 vision transformer 模型在大数据集上，如 UK Biobank (UKB) ，可以获得 Robust 的表示，用于 low-data scenarios 的 finetuning。我们的代码和预训练模型可以在上获取。

Comprehensive Analysis of Network Robustness Evaluation Based on Convolutional Neural Networks with Spatial Pyramid Pooling

paper_url: http://arxiv.org/abs/2308.08012
repo_url: None
paper_authors: Wenjun Jiang, Tianlong Fan, Changhao Li, Chuanfu Zhang, Tao Zhang, Zong-fu Luo
for: 本研究旨在提高复杂网络的连接稳定性评估效率和实用性。
methods: 本研究使用机器学习技术来解决连接稳定性评估中的挑战。
results: 研究结果表明，提出的卷积神经网络模型（CNN）和空间彩色Pooling网络（SPP-net）可以有效地解决高计算成本、不同网络类型、组件类型和失效enario下的连接稳定性评估问题。

Abstract
Connectivity robustness, a crucial aspect for understanding, optimizing, and repairing complex networks, has traditionally been evaluated through time-consuming and often impractical simulations. Fortunately, machine learning provides a new avenue for addressing this challenge. However, several key issues remain unresolved, including the performance in more general edge removal scenarios, capturing robustness through attack curves instead of directly training for robustness, scalability of predictive tasks, and transferability of predictive capabilities. In this paper, we address these challenges by designing a convolutional neural networks (CNN) model with spatial pyramid pooling networks (SPP-net), adapting existing evaluation metrics, redesigning the attack modes, introducing appropriate filtering rules, and incorporating the value of robustness as training data. The results demonstrate the thoroughness of the proposed CNN framework in addressing the challenges of high computational time across various network types, failure component types and failure scenarios. However, the performance of the proposed CNN model varies: for evaluation tasks that are consistent with the trained network type, the proposed CNN model consistently achieves accurate evaluations of both attack curves and robustness values across all removal scenarios. When the predicted network type differs from the trained network, the CNN model still demonstrates favorable performance in the scenario of random node failure, showcasing its scalability and performance transferability. Nevertheless, the performance falls short of expectations in other removal scenarios. This observed scenario-sensitivity in the evaluation of network features has been overlooked in previous studies and necessitates further attention and optimization. Lastly, we discuss important unresolved questions and further investigation.

摘要
<> translate the following text into Simplified Chinese<>网络稳定性，复杂网络理解、优化和维护的关键方面，传统上通过时间消耗和实际不切实际的simulation进行评估。幸运地，机器学习提供了一个新的解决方案。然而，数个关键问题仍然无解，包括更一般的边缺失enario中性能、通过攻击曲线而不直接培养 robustness、预测任务可扩展性和预测能力传输性。在这篇论文中，我们通过设计卷积神经网络（CNN）模型，并采用空间层次卷积网络（SPP-net），适应现有评估指标，重新设计攻击模式，引入合适的筛选规则，并将robustness的价值作为训练数据来进行处理。结果表明我们提posed CNN框架在不同网络类型、组件类型和攻击场景下具有高计算时间的通用性。然而，我们的CNN模型在不同攻击场景下的性能存在差异，其中在随机节点失效场景下，CNN模型仍然可以获得良好的性能，表明它具有可扩展性和性能传输性。然而，在其他攻击场景下，CNN模型的性能不符合预期，这种场景敏感性尚未在前一次研究中得到了足够的注意。最后，我们讨论了一些未解决的重要问题和进一步的调查。

KS-APR: Keyframe Selection for Robust Absolute Pose Regression

paper_url: http://arxiv.org/abs/2308.05459
repo_url: None
paper_authors: Changkun Liu, Yukun Zhao, Tristan Braud
for: 提高手持设备的虚拟 reality（AR）精度
methods: 使用精度检查和拒绝不可靠的姿态估计
results: 提高了所有模型的精度和减少了大误差的比例In English:
for: Improving the accuracy of mobile augmented reality (AR)
methods: Using accuracy checks and rejecting unreliable pose estimates
results: Improved accuracy and reduced proportion of large errors for all models

Abstract
Markerless Mobile Augmented Reality (AR) aims to anchor digital content in the physical world without using specific 2D or 3D objects. Absolute Pose Regressors (APR) are end-to-end machine learning solutions that infer the device's pose from a single monocular image. Thanks to their low computation cost, they can be directly executed on the constrained hardware of mobile AR devices. However, APR methods tend to yield significant inaccuracies for input images that are too distant from the training set. This paper introduces KS-APR, a pipeline that assesses the reliability of an estimated pose with minimal overhead by combining the inference results of the APR and the prior images in the training set. Mobile AR systems tend to rely upon visual-inertial odometry to track the relative pose of the device during the experience. As such, KS-APR favours reliability over frequency, discarding unreliable poses. This pipeline can integrate most existing APR methods to improve accuracy by filtering unreliable images with their pose estimates. We implement the pipeline on three types of APR models on indoor and outdoor datasets. The median error on position and orientation is reduced for all models, and the proportion of large errors is minimized across datasets. Our method enables state-of-the-art APRs such as DFNetdm to outperform single-image and sequential APR methods. These results demonstrate the scalability and effectiveness of KS-APR for visual localization tasks that do not require one-shot decisions.

摘要
Markerless移动增强现实（AR）目标是在物理世界中固定数字内容，无需使用特定的2D或3D对象。绝对姿态预测器（APR）是一种端到端机器学习解决方案，通过单一的照片来预测设备的姿态。由于它们的计算成本低，可以直接在移动AR设备上执行。然而，APR方法通常对输入图像过于远的域产生了显著的错误。本文介绍KS-APR，一个管道，可以在低负荷下评估估计pose的可靠性，并将其与训练集中的先前图像结合使用。移动AR系统通常通过视觉-运动估计来跟踪设备的相对姿态，因此KS-APR偏好可靠性而不是频率，抛弃不可靠的姿态。这个管道可以将大多数现有APR方法集成到更高的准确度，并将不可靠的图像与其pose估计一起抛弃。我们在室内和室外数据集上实现了这个管道，并显示了所有模型的 median 错误和方向错误的减少，以及大型错误的最小化。我们的方法可以让现有的APR方法，如DFNetdm，在单张图像和序列APR方法上表现出色。这些结果表明KS-APR可以在视觉地理标定任务中实现高精度和可靠性，而不需要一次性决策。

Transforming Breast Cancer Diagnosis: Towards Real-Time Ultrasound to Mammogram Conversion for Cost-Effective Diagnosis

paper_url: http://arxiv.org/abs/2308.05449
repo_url: None
paper_authors: Sahar Almahfouz Nasser, Ashutosh Sharma, Anmol Saraf, Amruta Mahendra Parulekar, Purvi Haria, Amit Sethi
For: The paper aims to provide surgeons with mammogram-like image quality in real-time from noisy US images.* Methods: The authors use the Stride software to numerically solve the forward model and generate ultrasound images from mammogram images. They also leverage domain adaptation and generative adversarial networks (GANs) to enhance the realism of the simulated ultrasound images.* Results: The resultant images have considerably more discernible details than the original US images.Here’s the simplified Chinese text:
for: 这篇论文目的是为医生提供高品质的实时ultrasound（US）图像。
methods: 作者使用Stride软件数值解决前向模型，将mammogram图像转换成US图像。他们还利用域适应和生成对抗网络（GANs）提高模拟US图像的真实性。
results: 结果图像比原始US图像更有明显的特征。

Abstract
Ultrasound (US) imaging is better suited for intraoperative settings because it is real-time and more portable than other imaging techniques, such as mammography. However, US images are characterized by lower spatial resolution noise-like artifacts. This research aims to address these limitations by providing surgeons with mammogram-like image quality in real-time from noisy US images. Unlike previous approaches for improving US image quality that aim to reduce artifacts by treating them as (speckle noise), we recognize their value as informative wave interference pattern (WIP). To achieve this, we utilize the Stride software to numerically solve the forward model, generating ultrasound images from mammograms images by solving wave-equations. Additionally, we leverage the power of domain adaptation to enhance the realism of the simulated ultrasound images. Then, we utilize generative adversarial networks (GANs) to tackle the inverse problem of generating mammogram-quality images from ultrasound images. The resultant images have considerably more discernible details than the original US images.

摘要
超声成像（US）在操作期间更适合使用，因为它们是实时的，更加可搬than other imaging techniques, such as mammography. However, US images are characterized by lower spatial resolution and noise-like artifacts. This research aims to address these limitations by providing surgeons with mammogram-like image quality in real-time from noisy US images. Unlike previous approaches for improving US image quality that aim to reduce artifacts by treating them as (speckle noise), we recognize their value as informative wave interference patterns (WIP). To achieve this, we utilize the Stride software to numerically solve the forward model, generating ultrasound images from mammography images by solving wave-equations. Additionally, we leverage the power of domain adaptation to enhance the realism of the simulated ultrasound images. Then, we utilize generative adversarial networks (GANs) to tackle the inverse problem of generating mammogram-quality images from ultrasound images. The resultant images have considerably more discernible details than the original US images.Here's the translation in Traditional Chinese as well, for comparison:超声成像（US）在操作期间更适合使用，因为它们是实时的，更加可搬than other imaging techniques, such as mammography. However, US images are characterized by lower spatial resolution and noise-like artifacts. This research aims to address these limitations by providing surgeons with mammogram-like image quality in real-time from noisy US images. Unlike previous approaches for improving US image quality that aim to reduce artifacts by treating them as (speckle noise), we recognize their value as informative wave interference patterns (WIP). To achieve this, we utilize the Stride software to numerically solve the forward model, generating ultrasound images from mammography images by solving wave-equations. Additionally, we leverage the power of domain adaptation to enhance the realism of the simulated ultrasound images. Then, we utilize generative adversarial networks (GANs) to tackle the inverse problem of generating mammogram-quality images from ultrasound images. The resultant images have considerably more discernible details than the original US images.

A Generalized Physical-knowledge-guided Dynamic Model for Underwater Image Enhancement

paper_url: http://arxiv.org/abs/2308.05447
repo_url: None
paper_authors: Pan Mu, Hanning Xu, Zheyuan Liu, Zheng Wang, Sixian Chan, Cong Bai
for: 提高水下图像的色彩扩散和对比度，以及处理不同类型的水下图像。
methods: 提出一种基于物理知识的动态模型（简称GUPDM），包括三部分：大气环境基本结构（ADS）、传输导航动态结构（TDS）和优先级多尺度结构（PMS）。通过形成模型来模拟不同类型的水下图像，并采用动态核心扩散来自适应水下图像的特点。
results: 通过实验表明，GUPDM可以有效地提高水下图像的色彩扩散和对比度，并且可以适应不同类型的水下图像。

Abstract
Underwater images often suffer from color distortion and low contrast resulting in various image types, due to the scattering and absorption of light by water. While it is difficult to obtain high-quality paired training samples with a generalized model. To tackle these challenges, we design a Generalized Underwater image enhancement method via a Physical-knowledge-guided Dynamic Model (short for GUPDM), consisting of three parts: Atmosphere-based Dynamic Structure (ADS), Transmission-guided Dynamic Structure (TDS), and Prior-based Multi-scale Structure (PMS). In particular, to cover complex underwater scenes, this study changes the global atmosphere light and the transmission to simulate various underwater image types (e.g., the underwater image color ranging from yellow to blue) through the formation model. We then design ADS and TDS that use dynamic convolutions to adaptively extract prior information from underwater images and generate parameters for PMS. These two modules enable the network to select appropriate parameters for various water types adaptively. Besides, the multi-scale feature extraction module in PMS uses convolution blocks with different kernel sizes and obtains weights for each feature map via channel attention block and fuses them to boost the receptive field of the network. The source code will be available at \href{https://github.com/shiningZZ/GUPDM}{https://github.com/shiningZZ/GUPDM}.

摘要
水下图像经常受到颜色扭曲和对比度下降的影响，导致各种图像类型，这是由于水尘和吸收光的效应。而获得高质量的搅合训练样本是困难的，以普通的模型来处理这些挑战。为解决这些问题，我们设计了一种通用的水下图像增强方法，即物理知识引导动态模型（简称GUPDM），它包括三部分：大气基础动态结构（ADS）、传输基础动态结构（TDS）和优先级基础多尺度结构（PMS）。具体来说，为了处理复杂的水下场景，这个研究通过形成模型来模拟不同的水下图像类型（例如，水下图像颜色从黄到蓝）。然后，我们设计了ADS和TDS模块，它们使用动态 convolution来自适应地提取水下图像中的优先级信息，并生成PMS模块中的参数。这两个模块使得网络可以自动选择不同的水体类型的参数。此外，PMS模块中的多尺度特征提取模块使用不同的核心大小的卷积块，通过通道注意力块来获得每个特征图的权重，并将其拼接以提高网络的感知范围。源代码将在 \href{https://github.com/shiningZZ/GUPDM}{https://github.com/shiningZZ/GUPDM} 上提供。

Benchmarking Algorithmic Bias in Face Recognition: An Experimental Approach Using Synthetic Faces and Human Evaluation

paper_url: http://arxiv.org/abs/2308.05441
repo_url: None
paper_authors: Hao Liang, Pietro Perona, Guha Balakrishnan
for: 本研究旨在测试面部识别系统的偏见。现有的测试方法基于在野收集的数据集，这些数据集可能受到保护属性（如种族、性别）和非保护属性（如姿势、照明）的影响。这些观察数据集只能得出相关性的结论，例如“算法A在 dataset X 中对男女面部的准确率不同”。而我们的实验方法可以 manipulate 属性，从而得出 causal 结论，例如 “算法A对 gender 和皮肤色的影响”。
methods: 我们的方法基于使用神经网络生成器生成synthetic face image，其中每个属性Of interest 独立地修改，保持其他属性不变。人观察员提供了真实的面部 Similarity 标准，用于评估synthetic image pair 之间的面部 Identification 距离。我们 validate 我们的方法量化地，对三种研究级面部识别模型的种族和性别偏见进行评估。我们的syntheticipeline发现，对于这些算法，黑人和东亚人种 subgroup 的准确率较低。我们的方法还可以评估这些模型对 Face Identity 距离的影响。我们的大量synthetic数据集，包括48,000个synthetic face image pair 和555,000个人注解（个体属性和对比面部 Similarity），现已经提供给研究人员。
results: 我们的实验结果表明，对于这些算法，黑人和东亚人种 subgroup 的准确率较低。我们的synthetic pipeline 还发现，这些模型对 Face Identity 距离的影响，并且可以评估这些模型对不同属性的影响。

Abstract
We propose an experimental method for measuring bias in face recognition systems. Existing methods to measure bias depend on benchmark datasets that are collected in the wild and annotated for protected (e.g., race, gender) and non-protected (e.g., pose, lighting) attributes. Such observational datasets only permit correlational conclusions, e.g., "Algorithm A's accuracy is different on female and male faces in dataset X.". By contrast, experimental methods manipulate attributes individually and thus permit causal conclusions, e.g., "Algorithm A's accuracy is affected by gender and skin color." Our method is based on generating synthetic faces using a neural face generator, where each attribute of interest is modified independently while leaving all other attributes constant. Human observers crucially provide the ground truth on perceptual identity similarity between synthetic image pairs. We validate our method quantitatively by evaluating race and gender biases of three research-grade face recognition models. Our synthetic pipeline reveals that for these algorithms, accuracy is lower for Black and East Asian population subgroups. Our method can also quantify how perceptual changes in attributes affect face identity distances reported by these models. Our large synthetic dataset, consisting of 48,000 synthetic face image pairs (10,200 unique synthetic faces) and 555,000 human annotations (individual attributes and pairwise identity comparisons) is available to researchers in this important area.

摘要
我们提出了一种实验方法来测试面部识别系统的偏见。现有的方法测试偏见基于在野收集的数据集，并对保护 attribute（例如种族、性别）和非保护 attribute（例如姿势、照明）进行标注。这些观察数据集只允许 correlate 结论，例如“算法 A 的精度不同于女性和男性脸部在数据集 X 中”。而我们的实验方法可以单独 manipulate attribute，因此可以得到 causal 结论，例如 “算法 A 的精度受到 gender 和皮肤色影响”。我们的方法基于使用神经网络生成器生成 synthetic 脸部图像，其中每个 attribute of interest 独立地修改，而保持所有其他 attribute 不变。人类观察员提供了实际的识别同一个 synthetic 图像对的真实标注。我们 validate 我们的方法量化地，对三种研究级 face recognition 模型的种族和性别偏见进行评估。我们的 synthetic 管道发现，这些算法对黑人和东亚人 subgroup 的精度较低。我们的方法还可以量化这些模型对 attribute 的变化如何影响面部识别距离的报告。我们的大量 synthetic 数据集，包括 48,000 个 synthetic 脸部图像对（10,200 个Unique synthetic 脸部）和 555,000 个人类标注（个体 attribute 和对比识别比较）现已提供给研究人员。

Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting for Robust 6D Object Pose Estimation

paper_url: http://arxiv.org/abs/2308.05438
repo_url: https://github.com/junzastar/dftr_voting
paper_authors: Jun Zhou, Kai Chen, Linlin Xu, Qi Dou, Jing Qin
for: 本研究旨在提高基于单一RGBD图像的6D对象姿态估计的效率和精度，特别是两种不同模式之间的有效融合。
methods: 我们提出了一种 Deep Fusion Transformer（DFTr）块，该块可以聚合不同模式的跨模式特征，以提高姿态估计。此外，我们还提出了一种Weighted vector-wise voting算法，该算法通过非迭代全局优化策略来准确地local化3D关键点，并实现了近实时推理。
results: 我们的实验表明，我们的提议的3D关键点投票算法具有强大的泛化能力和高效性。Results on four widely used benchmarks also demonstrate that our method outperforms the state-of-the-art methods by large margins.

Abstract
One critical challenge in 6D object pose estimation from a single RGBD image is efficient integration of two different modalities, i.e., color and depth. In this work, we tackle this problem by a novel Deep Fusion Transformer~(DFTr) block that can aggregate cross-modality features for improving pose estimation. Unlike existing fusion methods, the proposed DFTr can better model cross-modality semantic correlation by leveraging their semantic similarity, such that globally enhanced features from different modalities can be better integrated for improved information extraction. Moreover, to further improve robustness and efficiency, we introduce a novel weighted vector-wise voting algorithm that employs a non-iterative global optimization strategy for precise 3D keypoint localization while achieving near real-time inference. Extensive experiments show the effectiveness and strong generalization capability of our proposed 3D keypoint voting algorithm. Results on four widely used benchmarks also demonstrate that our method outperforms the state-of-the-art methods by large margins.

摘要
一个重要挑战在基于单个RGBD图像的6D对象姿态估计中是有效地结合两种不同模式，即颜色和深度。在这个工作中，我们解决这个问题通过一种新的深度融合变换（DFTr）块，可以将不同模式之间的交叉特征聚合以提高姿态估计。与现有的融合方法不同，我们提出的DFTr可以更好地模型不同模式的语义相关性，以便更好地融合不同模式的全球提高特征，从而提高信息抽取。此外，为了进一步提高可靠性和效率，我们引入了一种新的加权 вектор值投票算法，可以非迭代地全球优化精度，以实现准确的3D关键点定位，同时实现近实时的推理。广泛的实验表明我们的提出的3D关键点投票算法的有效性和强大的通用能力。结果也表明我们的方法在四个广泛使用的benchmark上大幅度超过了状态艺的方法。

Ensemble Modeling for Multimodal Visual Action Recognition

paper_url: http://arxiv.org/abs/2308.05430
repo_url: None
paper_authors: Jyoti Kini, Sarah Fleischer, Ishan Dave, Mubarak Shah
for: 这篇论文的目的是提出一种多 modal 动作识别的ensemble模型方法。
methods: 这篇论文使用了一种基于 focal loss的个别模ality模型训练方法，并提出了一种将 focal loss 调整为适应 MECCANO dataset 的长尾分布的方法。
results: 实验结果显示了这种方法的效果。

Abstract
In this work, we propose an ensemble modeling approach for multimodal action recognition. We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO [21] dataset. Based on the underlying principle of focal loss, which captures the relationship between tail (scarce) classes and their prediction difficulties, we propose an exponentially decaying variant of focal loss for our current task. It initially emphasizes learning from the hard misclassified examples and gradually adapts to the entire range of examples in the dataset. This annealing process encourages the model to strike a balance between focusing on the sparse set of hard samples, while still leveraging the information provided by the easier ones. Additionally, we opt for the late fusion strategy to combine the resultant probability distributions from RGB and Depth modalities for final action prediction. Experimental evaluations on the MECCANO dataset demonstrate the effectiveness of our approach.

摘要
在这项工作中，我们提出了一种集成模型化方法 для多模态动作识别。我们独立地在MECCANO数据集中训练个体模态模型，使用我们提议的缩放损失变体，该变体能够处理MECCANO数据集的长尾分布。基于缩放损失的基本原理，我们提出了一种衰减变体的缩放损失，该变体在初始阶段强调学习困难识别的样本，逐渐适应整个数据集的范围。这种熔化过程鼓励模型均衡 между关注罕见类和其预测困难的关系，同时仍然利用整个数据集中的信息。此外，我们选择了后期融合策略，将RGB和深度模态的结果概率分布组合为最终的动作预测。实验评估在MECCANO数据集上，证明了我们的方法的有效性。

Speech-Driven 3D Face Animation with Composite and Regional Facial Movements

paper_url: http://arxiv.org/abs/2308.05428
repo_url: https://github.com/wuhaozhe/audio2face_mm2023
paper_authors: Haozhe Wu, Songtao Zhou, Jia Jia, Junliang Xing, Qi Wen, Xiang Wen
for: 这篇论文主要针对的是如何使用语音驱动的3D人脸动画，以具有生动的表现和高效的计算。
methods: 该方法首先引入了适应性调整模块，通过使用自然语音驱动的非自然表达来动态调整语音驱动的面部表现。其次，该方法保证每帧的面部特征集中注重当前3D人脸的本地空间运动。最后，该方法提出了一种非autoregressive的听音抽象核心，以维护高频人脸运动的细节和高效地进行推理。
results: 经过广泛的实验和用户研究，该方法被证明可以胜过当前领先的方法， both qualitatively和quantitatively。

Abstract
Speech-driven 3D face animation poses significant challenges due to the intricacy and variability inherent in human facial movements. This paper emphasizes the importance of considering both the composite and regional natures of facial movements in speech-driven 3D face animation. The composite nature pertains to how speech-independent factors globally modulate speech-driven facial movements along the temporal dimension. Meanwhile, the regional nature alludes to the notion that facial movements are not globally correlated but are actuated by local musculature along the spatial dimension. It is thus indispensable to incorporate both natures for engendering vivid animation. To address the composite nature, we introduce an adaptive modulation module that employs arbitrary facial movements to dynamically adjust speech-driven facial movements across frames on a global scale. To accommodate the regional nature, our approach ensures that each constituent of the facial features for every frame focuses on the local spatial movements of 3D faces. Moreover, we present a non-autoregressive backbone for translating audio to 3D facial movements, which maintains high-frequency nuances of facial movements and facilitates efficient inference. Comprehensive experiments and user studies demonstrate that our method surpasses contemporary state-of-the-art approaches both qualitatively and quantitatively.

摘要
<>对于由语音驱动的3D人脸动画来说，存在许多挑战，主要是因为人脸运动的复杂性和多样性。这篇论文强调需要考虑人脸运动的复合和地域特性。复合特性指的是语音独立因素在时间维度上对语音驱动的人脸运动进行全面的调整。地域特性则表示人脸运动不是全面相关的，而是由当地的肌肉 actuate 在空间维度上。因此，需要同时考虑这两种特性，以便生成真实的动画。为了解决复合特性，我们提出了一个适应型调整模块，通过使用任意的面部运动来动态调整语音驱动的面部运动 across frames 的全局级别。为了满足地域特性，我们的方法 garantía 每帧的面部特征都将注意当地的3D人脸空间运动。此外，我们还提出了一种非自我回归的核心，用于将语音转化为3D人脸运动，保留高频环境的人脸运动细节，并且实现高效的推理。完整的实验和用户研究表明，我们的方法在质量和量上都超过了当今状态的最佳方法。>>>

Adaptive Low Rank Adaptation of Segment Anything to Salient Object Detection

paper_url: http://arxiv.org/abs/2308.05426
repo_url: None
paper_authors: Ruikai Cui, Siyuan He, Shi Qiu
for: 提高Salient Object Detection（SOD）性能
methods: 采用适应性训练Segment Anything Model（SAM），利用深度学习中的低级结构，实现对出色对象检测的适应精度
results: 对五个RGB数据集进行了全面的质量和量测试，证明了我们的方法在Salient Object Detection领域的性能明显超过了现有方法

Abstract
Foundation models, such as OpenAI's GPT-3 and GPT-4, Meta's LLaMA, and Google's PaLM2, have revolutionized the field of artificial intelligence. A notable paradigm shift has been the advent of the Segment Anything Model (SAM), which has exhibited a remarkable capability to segment real-world objects, trained on 1 billion masks and 11 million images. Although SAM excels in general object segmentation, it lacks the intrinsic ability to detect salient objects, resulting in suboptimal performance in this domain. To address this challenge, we present the Segment Salient Object Model (SSOM), an innovative approach that adaptively fine-tunes SAM for salient object detection by harnessing the low-rank structure inherent in deep learning. Comprehensive qualitative and quantitative evaluations across five challenging RGB benchmark datasets demonstrate the superior performance of our approach, surpassing state-of-the-art methods.

摘要
基于OpenAI的GPT-3和GPT-4、Meta的LLaMA以及Google的PaLM2等基础模型，我们已经进行了一系列的研究和开发。在人工智能领域，我们发现了一种新的思维方式，即Segment Anything Model（SAM）。SAM在实际世界中 segmentation 任务上表现了非常出色，经过10亿个mask和1100万张图像训练，但它在焦点 объек detection 领域表现不佳，这是因为它缺乏内在的焦点检测能力。为解决这个挑战，我们提出了Segment Salient Object Model（SSOM），一种新的方法，通过利用深度学习中的低级结构来适应性地细化SAM，以提高焦点 объек detection 的性能。我们在五个RGB标准测试数据集上进行了全面的质量和量化评估，结果表明，我们的方法在焦点 объек detection 领域表现出色，超越了当前的状态艺。

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

paper_url: http://arxiv.org/abs/2308.05421
repo_url: https://github.com/gewu-lab/pstp-net
paper_authors: Guangyao Li, Wenxuan Hou, Di Hu
for: 这个 paper 的目的是提出一种 Progressive Spatio-Temporal Perception Network (PSTP-Net)，用于 answers 视频中的问题。
methods: 这个模型包括三个模块：首先，一个时间段选择模块用于选择与问题相关的音频视频段。然后，一个空间区域选择模块用于从选择的时间段中选择与问题相关的区域。最后，一个听力导向的视觉注意力模块用于捕捉音频和选择的空间区域之间的关系。
results: EXTENSIVE experimental results on the public MUSIC-AVQA 和 AVQA datasets show that PSTP-Net 具有高效性和高精度，可以快速和准确地回答视频中的问题。

Abstract
Audio-Visual Question Answering (AVQA) task aims to answer questions about different visual objects, sounds, and their associations in videos. Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest. Oppositely, only focusing on the question-aware audio-visual content could get rid of influence, meanwhile enabling the model to answer more efficiently. In this paper, we propose a Progressive Spatio-Temporal Perception Network (PSTP-Net), which contains three modules that progressively identify key spatio-temporal regions w.r.t. questions. Specifically, a temporal segment selection module is first introduced to select the most relevant audio-visual segments related to the given question. Then, a spatial region selection module is utilized to choose the most relevant regions associated with the question from the selected temporal segments. To further refine the selection of features, an audio-guided visual attention module is employed to perceive the association between auido and selected spatial regions. Finally, the spatio-temporal features from these modules are integrated for answering the question. Extensive experimental results on the public MUSIC-AVQA and AVQA datasets provide compelling evidence of the effectiveness and efficiency of PSTP-Net. Code is available at: \href{https://github.com/GeWu-Lab/PSTP-Net}{https://github.com/GeWu-Lab/PSTP-Net}

摘要
Audio-Visual问答任务（AVQA）目的是回答视频中不同的视觉对象和声音之间的关系 вопросы。这些自然多模态视频具有丰富和复杂的音视频组件，大多数可能与问题无关，甚至作为干扰因子影响回答问题内容。相反，只考虑问题相关的音视频内容，可以快速地回答问题。在本文中，我们提出了一种进步的空间时间感知网络（PSTP-Net），它包括三个模块，逐渐地标识问题相关的空间时间区域。具体来说，首先引入的是时间段选择模块，用于选择与问题相关的音视频段。然后，使用空间区域选择模块选择与问题相关的区域。为了进一步细化选择特征，我们采用了听音引导视觉注意力模块，以便感知音频和选择的空间区域之间的关系。最后，这些模块中的特征被集成，用于回答问题。我们在公共的MUSIC-AVQA和AVQA数据集上进行了广泛的实验，并提供了证明PSTP-Net的效果和效率的证据。代码可以在：\href{https://github.com/GeWu-Lab/PSTP-Net}{https://github.com/GeWu-Lab/PSTP-Net}

SC3K: Self-supervised and Coherent 3D Keypoints Estimation from Rotated, Noisy, and Decimated Point Cloud Data

paper_url: http://arxiv.org/abs/2308.05410
repo_url: https://github.com/iit-pavis/sc3k
paper_authors: Mohammad Zohaib, Alessio Del Bue
for: 这种论文是为了推算ObjectCategory中的关键点的新方法，在实际应用中处理受到噪声、下采样和自由旋转的点云数据（PCD）时。
methods: 该论文提出了一种新的自动学习方法，不需要任何注释，可以在没有对象类型的先验知识的情况下推算关键点。这种方法使用了一种新的自我监督训练策略，并且使用了一种协同的辅助损失函数来促进所需的关键点特性。
results: 实验结果显示，提出的方法可以更好地 estimating keypoints， coverage提高了+9.41%，同时保持了semantic consistency (+4.66%)，这些关键点可以最佳地表示object的3D形状，并且与现有的无监督方法进行比较。代码和数据可以在https://github.com/IITPAVIS/SC3K上下载。

Abstract
This paper proposes a new method to infer keypoints from arbitrary object categories in practical scenarios where point cloud data (PCD) are noisy, down-sampled and arbitrarily rotated. Our proposed model adheres to the following principles: i) keypoints inference is fully unsupervised (no annotation given), ii) keypoints position error should be low and resilient to PCD perturbations (robustness), iii) keypoints should not change their indexes for the intra-class objects (semantic coherence), iv) keypoints should be close to or proximal to PCD surface (compactness). We achieve these desiderata by proposing a new self-supervised training strategy for keypoints estimation that does not assume any a priori knowledge of the object class, and a model architecture with coupled auxiliary losses that promotes the desired keypoints properties. We compare the keypoints estimated by the proposed approach with those of the state-of-the-art unsupervised approaches. The experiments show that our approach outperforms by estimating keypoints with improved coverage (+9.41%) while being semantically consistent (+4.66%) that best characterizes the object's 3D shape for downstream tasks. Code and data are available at: https://github.com/IITPAVIS/SC3K

摘要

Keypoints inference is fully unsupervised (no annotation given).2. Keypoints position error should be low and resilient to PCD perturbations (robustness).3. Keypoints should not change their indexes for the intra-class objects (semantic coherence).4. Keypoints should be close to or proximal to PCD surface (compactness).To achieve these desiderata, we propose a new self-supervised training strategy for keypoints estimation that does not assume any a priori knowledge of the object class, and a model architecture with coupled auxiliary losses that promotes the desired keypoints properties. We compare the keypoints estimated by the proposed approach with those of the state-of-the-art unsupervised approaches. The experiments show that our approach outperforms by estimating keypoints with improved coverage (+9.41%) while being semantically consistent (+4.66%) that best characterizes the object’s 3D shape for downstream tasks.Code and data are available at: https://github.com/IITPAVIS/SC3K

Enhancing Low-light Light Field Images with A Deep Compensation Unfolding Network

paper_url: http://arxiv.org/abs/2308.05404
repo_url: https://github.com/lyuxianqiang/lfll-dcu
paper_authors: Xianqiang Lyu, Junhui Hou
for: 这篇论文旨在提出一种新的、可解释的端到端学习框架，称为深度补偿解放网络（DCUNet），用于修复低光照条件下捕捉的光场图像。
methods: DCUNet采用多Stage结构，模仿解析难题的优化过程，并包括内容相关的深度补偿模块，以抑制噪声和推算光照图和推算结果错误。
results: 对于 simulate 和实际数据集，DCUNet 比 state-of-the-art 方法更高效和更好地保持光场图像的重要 геомétríc结构。代码将在 https://github.com/lyuxianqiang/LFLL-DCU 公开。

Abstract
This paper presents a novel and interpretable end-to-end learning framework, called the deep compensation unfolding network (DCUNet), for restoring light field (LF) images captured under low-light conditions. DCUNet is designed with a multi-stage architecture that mimics the optimization process of solving an inverse imaging problem in a data-driven fashion. The framework uses the intermediate enhanced result to estimate the illumination map, which is then employed in the unfolding process to produce a new enhanced result. Additionally, DCUNet includes a content-associated deep compensation module at each optimization stage to suppress noise and illumination map estimation errors. To properly mine and leverage the unique characteristics of LF images, this paper proposes a pseudo-explicit feature interaction module that comprehensively exploits redundant information in LF images. The experimental results on both simulated and real datasets demonstrate the superiority of our DCUNet over state-of-the-art methods, both qualitatively and quantitatively. Moreover, DCUNet preserves the essential geometric structure of enhanced LF images much better. The code will be publicly available at https://github.com/lyuxianqiang/LFLL-DCU.

摘要

Learning Gabor Texture Features for Fine-Grained Recognition

paper_url: http://arxiv.org/abs/2308.05396
repo_url: None
paper_authors: Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, Jun Liu
for: 提高细化识别性能
methods: 使用Gabor filters和CNN branches，并提出多种优化策略
results: 在多个 dataset 上达到顶尖性能水平

Abstract
Extracting and using class-discriminative features is critical for fine-grained recognition. Existing works have demonstrated the possibility of applying deep CNNs to exploit features that distinguish similar classes. However, CNNs suffer from problems including frequency bias and loss of detailed local information, which restricts the performance of recognizing fine-grained categories. To address the challenge, we propose a novel texture branch as complimentary to the CNN branch for feature extraction. We innovatively utilize Gabor filters as a powerful extractor to exploit texture features, motivated by the capability of Gabor filters in effectively capturing multi-frequency features and detailed local information. We implement several designs to enhance the effectiveness of Gabor filters, including imposing constraints on parameter values and developing a learning method to determine the optimal parameters. Moreover, we introduce a statistical feature extractor to utilize informative statistical information from the signals captured by Gabor filters, and a gate selection mechanism to enable efficient computation by only considering qualified regions as input for texture extraction. Through the integration of features from the Gabor-filter-based texture branch and CNN-based semantic branch, we achieve comprehensive information extraction. We demonstrate the efficacy of our method on multiple datasets, including CUB-200-2011, NA-bird, Stanford Dogs, and GTOS-mobile. State-of-the-art performance is achieved using our approach.

摘要
<>将文本翻译成简化中文。<>抽取和使用类别特征是细化识别的关键。现有工作已经证明了深度Convolutional Neural Networks (CNNs) 可以激活类似类之间的特征。然而，CNNs 受到频率偏见和细化地方信息的失去的限制，这限制了细化类别的识别性。为解决这个挑战，我们提议一种新的纹理分支作为 CNN 分支的补充。我们创新地利用 Gabor filters 作为强大的提取器，以利用纹理特征。我们实施了多种设计来提高 Gabor filters 的效果，包括参数约束和学习方法来确定最佳参数。此外，我们引入了一种统计特征提取器，以利用 Gabor filters 捕捉到的信号中的有用统计信息。还有一种阀选机制，以便只考虑有利的区域作为纹理提取的输入。通过将 Gabor filters 和 CNN 的 semantic 分支结合起来，我们实现了全面的信息提取。我们在多个 dataset 上示范了我们的方法，包括 CUB-200-2011、NA-bird、Stanford Dogs 和 GTOS-mobile。我们的方法实现了现状的最佳性。

Robust Localization with Visual-Inertial Odometry Constraints for Markerless Mobile AR

paper_url: http://arxiv.org/abs/2308.05394
repo_url: None
paper_authors: Changkun Liu, Yukun Zhao, Tristan Braud
for: 这篇论文主要是为了提高无标志 markerless 手持设备的 Augmented Reality (AR) 应用程序的定位精度。
methods: 这篇论文提出了一种新的框架，即 VIO-APR，它将 absolute pose regressor (APR) 与地方的 VIO 跟踪系统结合在一起，以便更好地评估 VIO 的可靠性，并通过 APR 来识别和补做 VIO 的漂移。
results: 在使用 VIO-APR 后， median 准确性提高了36%（位置）和29%（orientation），高精度（0.25 m, 2°）的帧数增加了112%，而低精度（5 m, 10°）的帧数减少了大量。 VIO-APR 在一个基于 Unity 的手持 AR 应用程序中被实现，并显示了更高的本地化和更稳定的整体体验。

Abstract
Visual Inertial Odometry (VIO) is an essential component of modern Augmented Reality (AR) applications. However, VIO only tracks the relative pose of the device, leading to drift over time. Absolute pose estimation methods infer the device's absolute pose, but their accuracy depends on the input quality. This paper introduces VIO-APR, a new framework for markerless mobile AR that combines an absolute pose regressor (APR) with a local VIO tracking system. VIO-APR uses VIO to assess the reliability of the APR and the APR to identify and compensate for VIO drift. This feedback loop results in more accurate positioning and more stable AR experiences. To evaluate VIO-APR, we created a dataset that combines camera images with ARKit's VIO system output for six indoor and outdoor scenes of various scales. Over this dataset, VIO-APR improves the median accuracy of popular APR by up to 36\% in position and 29\% in orientation, increases the percentage of frames in the high ($0.25 m, 2^{\circ}$) accuracy level by up to 112\% and reduces the percentage of frames predicted below the low ($5 m, 10^\circ$) accuracy greatly. We implement VIO-APR into a mobile AR application using Unity to demonstrate its capabilities. VIO-APR results in noticeably more accurate localization and a more stable overall experience.

摘要
“几何态对应”（VIO）是现代增强现实（AR）应用程序中的重要组件。然而，VIO只追踪设备的相对位姿，因此会随时间偏移。绝对位姿估推方法可以将设备的绝对位姿估推，但它们的准确度取决于输入质量。本文提出了一个新的框架，即VIO-APR，它结合了绝对位姿估推器（APR）和地方VIO追踪系统。VIO-APR使用VIO评估估推器的可靠性，并使用APR来识别和补偿VIO偏移。这个关键Loop的结果是更加精确的定位和更加稳定的AR体验。为了评估VIO-APR，我们创建了一个具有相机图像和ARKit的VIO系统输出的六个室内和外部景象的数据集。在这个数据集上，VIO-APR提高了流行的APR的中位均值精度（position）和方位精度（orientation）的比例，增加了高精度水平（0.25 m，2°）的frames的百分比，并大幅降低了低精度水平（5 m，10°）的frames的百分比。我们将VIO-APR集成到Unity中的 mobilAR应用程序中，以示其能力。VIO-APR实现了更加精确的定位和更加稳定的AR体验。”

Product Review Image Ranking for Fashion E-commerce

paper_url: http://arxiv.org/abs/2308.05390
repo_url: None
paper_authors: Sangeet Jaiswal, Dhruv Patel, Sreekanth Vempati, Konduru Saiswaroop
for: The paper aims to improve the ranking of customer images on a fashion e-commerce platform, as the reliance on User Generated Content (UGC) has increased and the number of customer images has grown.
methods: The proposed method uses a training procedure to rank customer images, leveraging distortion techniques to enhance the quality of the images and a network to distinguish between high-quality and bad-quality images.
results: The proposed method outperforms baseline models on two metrics, correlation coefficient and accuracy, by substantial margins.Here’s the simplified Chinese text for the three points:
for: 这篇论文目的是提高时尚电商平台上客户照片的排名，因为用户生成内容的依赖度增加，客户照片的数量也在增加。
methods: 提议的方法使用训练程序来排序客户照片，利用扭曲技术提高照片质量，并使用网络来分辨高质量照片和差质量照片。
results: 提议的方法相比基eline模型，在两个约束中（相关系数和准确率）表现出了明显的优势。

Abstract
In a fashion e-commerce platform where customers can't physically examine the products on their own, being able to see other customers' text and image reviews of the product is critical while making purchase decisions. Given the high reliance on these reviews, over the years we have observed customers proactively sharing their reviews. With an increase in the coverage of User Generated Content (UGC), there has been a corresponding increase in the number of customer images. It is thus imperative to display the most relevant images on top as it may influence users' online shopping choices and behavior. In this paper, we propose a simple yet effective training procedure for ranking customer images. We created a dataset consisting of Myntra (A Major Indian Fashion e-commerce company) studio posts and highly engaged (upvotes/downvotes) UGC images as our starting point and used selected distortion techniques on the images of the above dataset to bring their quality at par with those of bad UGC images. We train our network to rank bad-quality images lower than high-quality ones. Our proposed method outperforms the baseline models on two metrics, namely correlation coefficient, and accuracy, by substantial margins.

摘要
在一个无法质感产品的电商平台上，能够看到其他顾客的文本和图像评论是购买决策中非常重要的。随着用户生成内容的涵盖率的增加，我们在年月之间观察到顾客积极分享他们的评论。随着用户图像的增加，显示最相关的图像变得非常重要，因为它们可能影响用户在线购物选择和行为。在这篇论文中，我们提出了一种简单 yet 有效的训练方法，用于排序顾客图像。我们使用 Myntra（印度主要的时尚电商公司）的Studio帖子和高度参与度（Upvotes/Downvotes）的用户生成内容图像作为我们的起点，并使用选择的扭曲技术来将这些图像的质量与Bad UGC图像相匹配。我们训练我们的网络，以便将差质图像排名在低于高质图像之前。我们的提议方法在两个纪录中，即相关性系数和准确率，与基准模型相比均显示出了显著的优势。

HGDNet: A Height-Hierarchy Guided Dual-Decoder Network for Single View Building Extraction and Height Estimation

paper_url: http://arxiv.org/abs/2308.05387
repo_url: None
paper_authors: Chaoran Lu, Ningning Cao, Pan Zhang, Ting Liu, Baochai Peng, Guozhang Liu, Mengke Yuan, Sen Zhang, Simin Huang, Tao Wang
for: 提高大规模城市3D重建 task 的性能，即建筑高度估计和建筑检测两个相关任务的统一。
methods: 提出了 Height-hierarchy Guided Dual-decoder Network (HGDNet)，通过guide synthesized discrete height-hierarchy nDSM来帮助height estimation branch，提高了建筑高度估计的准确性。同时，采用了两个阶段堆叠结构来实现更加准确的建筑EXTRACTION。
results: 在 DFC 2023 Track 2 数据集上进行了实验，得到了建筑高度估计（δ1:0.8012）、实例EXTRACTION（AP50:0.7730）和最终的平均分数（0.7871），在测试阶段以第一名的成绩。

Abstract
Unifying the correlative single-view satellite image building extraction and height estimation tasks indicates a promising way to share representations and acquire generalist model for large-scale urban 3D reconstruction. However, the common spatial misalignment between building footprints and stereo-reconstructed nDSM height labels incurs degraded performance on both tasks. To address this issue, we propose a Height-hierarchy Guided Dual-decoder Network (HGDNet) to estimate building height. Under the guidance of synthesized discrete height-hierarchy nDSM, auxiliary height-hierarchical building extraction branch enhance the height estimation branch with implicit constraints, yielding an accuracy improvement of more than 6% on the DFC 2023 track2 dataset. Additional two-stage cascade architecture is adopted to achieve more accurate building extraction. Experiments on the DFC 2023 Track 2 dataset shows the superiority of the proposed method in building height estimation ({\delta}1:0.8012), instance extraction (AP50:0.7730), and the final average score 0.7871 ranks in the first place in test phase.

摘要
合并相关的单视图卫星图像建筑EXTRACTION和高度估计任务表明了实现大规模城市3D重建的有望途径。然而，通常的空间不同步 между建筑基面和斯特瑞重建的nDSM高度标签会导致两个任务的性能下降。为解决这个问题，我们提议一种高度层导航分布式双解调网络（HGDNet）来估算建筑高度。在带有生成的精制Height-层次nDSM的指导下，增强高度估计分支可以提供隐式约束，从而提高高度估计精度。实验表明，我们提议的方法在DFC 2023 Track 2数据集上的高度估计（δ1:0.8012）、实例EXTRACTION（AP50:0.7730）和最终平均分数（0.7871）均达到了领先地位。

Interaction-aware Joint Attention Estimation Using People Attributes

paper_url: http://arxiv.org/abs/2308.05382
repo_url: https://github.com/chihina/pjae
paper_authors: Chihiro Nakatani, Hiroaki Kawashima, Norimichi Ukita
for: 这种论文旨在提出一种基于单个图像的共享注意力估计方法，与之前相关的工作不同，它不仅使用人具有关注相关的特征，还使用人的位置和行为作为共享注意力的上下文指标进行权重。
methods: 我们提出了一种新的Transformer基于注意力网络来编码共享注意力为低维特征，并引入了特殊的MLP头并 позицион嵌入，以便预测每个像素的共享注意力信任度，从而生成信任度热图。
results: 我们的方法在比较实验中与领先方法进行比较，并表明我们的方法在量化方面得到了提高。代码：https://anonymous.4open.science/r/anonymized_codes-ECA4.

Abstract
This paper proposes joint attention estimation in a single image. Different from related work in which only the gaze-related attributes of people are independently employed, (I) their locations and actions are also employed as contextual cues for weighting their attributes, and (ii) interactions among all of these attributes are explicitly modeled in our method. For the interaction modeling, we propose a novel Transformer-based attention network to encode joint attention as low-dimensional features. We introduce a specialized MLP head with positional embedding to the Transformer so that it predicts pixelwise confidence of joint attention for generating the confidence heatmap. This pixelwise prediction improves the heatmap accuracy by avoiding the ill-posed problem in which the high-dimensional heatmap is predicted from the low-dimensional features. The estimated joint attention is further improved by being integrated with general image-based attention estimation. Our method outperforms SOTA methods quantitatively in comparative experiments. Code: https://anonymous.4open.science/r/anonymized_codes-ECA4.

摘要
这个论文提出了基于单个图像的共享注意力估计方法。与相关的工作不同，我们不仅独立使用人具 gaze-相关特征，还利用人具位置和动作作为共享注意力的上下文调节器，并且明确模型这些特征之间的交互。为实现交互模型，我们提议使用一种专门的 transformer 基于注意力网络，将共享注意力编码为低维特征。我们还引入了一个特殊的多层感知（MLP）头，使其预测每个像素的共享注意力信任程度，以生成信任热图。这种像素级预测可以避免高维热图预测具有低维特征的不定 problema。我们的方法在比较实验中与state-of-the-art方法相比表现出色，代码可以在以下链接中找到：https://anonymous.4open.science/r/anonymized_codes-ECA4。

Flexible Isosurface Extraction for Gradient-Based Mesh Optimization

paper_url: http://arxiv.org/abs/2308.05371
repo_url: None
paper_authors: Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, Jun Gao
for: 本研究探讨了基于梯度的网格优化，即通过代表3D表面网格为某个拟合函数的iso面来进行优化，这种方法在摄影、生成模型、反向物理等应用中越来越普遍。现有的实现都是基于经典iso面提取算法如追溯立方体和双重缘界，这些技术是为了从固定的known场景中提取网格，而在优化设置下缺乏可以表示高质量特征保持网格的自由度，或者受到数值不稳定的问题。
methods: 我们引入了特定的参数，以便在提取网格时进行本地灵活的调整，以保持网格的准确性和可见性。这些参数通过自动微分准确地与背景场景的拟合函数一起更新。我们基于双重追溯立方体来提取网格，并提供了可选的生成四面体和层次适应网格的扩展。
results: 我们的实验表明，FlexiCubes可以在synthetic benchmark和实际应用中提供显著的改进，提高网格质量和准确性。

Abstract
This work considers gradient-based mesh optimization, where we iteratively optimize for a 3D surface mesh by representing it as the isosurface of a scalar field, an increasingly common paradigm in applications including photogrammetry, generative modeling, and inverse physics. Existing implementations adapt classic isosurface extraction algorithms like Marching Cubes or Dual Contouring; these techniques were designed to extract meshes from fixed, known fields, and in the optimization setting they lack the degrees of freedom to represent high-quality feature-preserving meshes, or suffer from numerical instabilities. We introduce FlexiCubes, an isosurface representation specifically designed for optimizing an unknown mesh with respect to geometric, visual, or even physical objectives. Our main insight is to introduce additional carefully-chosen parameters into the representation, which allow local flexible adjustments to the extracted mesh geometry and connectivity. These parameters are updated along with the underlying scalar field via automatic differentiation when optimizing for a downstream task. We base our extraction scheme on Dual Marching Cubes for improved topological properties, and present extensions to optionally generate tetrahedral and hierarchically-adaptive meshes. Extensive experiments validate FlexiCubes on both synthetic benchmarks and real-world applications, showing that it offers significant improvements in mesh quality and geometric fidelity.

摘要
To address this limitation, we introduce FlexiCubes, a new isosurface representation specifically designed for optimizing an unknown mesh. Our key insight is to introduce additional carefully-chosen parameters into the representation, which allow for local flexible adjustments to the extracted mesh geometry and connectivity. These parameters are updated along with the underlying scalar field via automatic differentiation when optimizing for a downstream task.We base our extraction scheme on Dual Marching Cubes for improved topological properties, and present extensions to optionally generate tetrahedral and hierarchically-adaptive meshes. Extensive experiments validate FlexiCubes on both synthetic benchmarks and real-world applications, showing that it offers significant improvements in mesh quality and geometric fidelity.

TriDo-Former: A Triple-Domain Transformer for Direct PET Reconstruction from Low-Dose Sinograms

paper_url: http://arxiv.org/abs/2308.05365
repo_url: https://github.com/gluucose/TriDoFormer
paper_authors: Jiaqi Cui, Pinxian Zeng, Xinyi Zeng, Peng Wang, Xi Wu, Jiliu Zhou, Yan Wang, Dinggang Shen
for: 提高标准剂量Positron发射Tomography（PET）图像质量，最小化辐射暴露
methods: 使用 transformer 模型，联合三个频域（sinogram、图像、频率）进行直接 PET 重建
results: 比对 estado-of-the-art 方法，TriDo-Former 能够提供更高质量的 PET 图像，同时减少辐射暴露

Abstract
To obtain high-quality positron emission tomography (PET) images while minimizing radiation exposure, various methods have been proposed for reconstructing standard-dose PET (SPET) images from low-dose PET (LPET) sinograms directly. However, current methods often neglect boundaries during sinogram-to-image reconstruction, resulting in high-frequency distortion in the frequency domain and diminished or fuzzy edges in the reconstructed images. Furthermore, the convolutional architectures, which are commonly used, lack the ability to model long-range non-local interactions, potentially leading to inaccurate representations of global structures. To alleviate these problems, we propose a transformer-based model that unites triple domains of sinogram, image, and frequency for direct PET reconstruction, namely TriDo-Former. Specifically, the TriDo-Former consists of two cascaded networks, i.e., a sinogram enhancement transformer (SE-Former) for denoising the input LPET sinograms and a spatial-spectral reconstruction transformer (SSR-Former) for reconstructing SPET images from the denoised sinograms. Different from the vanilla transformer that splits an image into 2D patches, based specifically on the PET imaging mechanism, our SE-Former divides the sinogram into 1D projection view angles to maintain its inner-structure while denoising, preventing the noise in the sinogram from prorogating into the image domain. Moreover, to mitigate high-frequency distortion and improve reconstruction details, we integrate global frequency parsers (GFPs) into SSR-Former. The GFP serves as a learnable frequency filter that globally adjusts the frequency components in the frequency domain, enforcing the network to restore high-frequency details resembling real SPET images. Validations on a clinical dataset demonstrate that our TriDo-Former outperforms the state-of-the-art methods qualitatively and quantitatively.

摘要
要获得高质量的 позиトロン发射 Tomography（PET）图像，同时最小化辐射暴露，various方法已经被提议用于直接从低剂量PET（LPET）信号直接重构标准剂量PET（SPET）图像。然而，当前方法 часто忽略信号境界，导致信号频谱频率域中的高频扩散，并且图像重构后的边缘变得模糊或者杂乱。此外，通用的卷积架构，常常缺乏能够模型长距离非本地交互的能力，可能导致不准确地表示全球结构。为了解决这些问题，我们提议一种基于 transformer 的模型，称为 TriDo-Former。具体来说，TriDo-Former 由两个缓冲网络组成，即信号增强 transformer（SE-Former）和空间spectral重构 transformer（SSR-Former）。SE-Former 使得LPET信号中的噪声得到去噪，而不是将图像分割成2D patches，SSR-Former 使用GFP来重构SPET图像。GFP acts as a learnable frequency filter that globally adjusts the frequency components in the frequency domain, enforcing the network to restore high-frequency details resembling real SPET images。 validate on a clinical dataset, our TriDo-Former outperforms the state-of-the-art methods both qualitatively and quantitatively.

Pseudo-label Alignment for Semi-supervised Instance Segmentation

paper_url: http://arxiv.org/abs/2308.05359
repo_url: https://github.com/hujiecpp/pais
paper_authors: Jie Hu, Chen Chen, Liujuan Cao, Shengchuan Zhang, Annan Shu, Guannan Jiang, Rongrong Ji
for: 提高 semi-supervised instance segmentation 的性能，特别是在有限 Label 的情况下。
methods: 提出一个名为 pseudo-label aligning instance segmentation (PAIS) 的新框架，通过动态调整 semi-supervised loss 的权重，以适应不同的 class 和 mask 质量。
results: 在 COCO 和 Cityscapes datasets 上进行了广泛的实验，显示 PAIS 是一种有 promise 的 semi-supervised instance segmentation 框架，特别是在 Label 数据受限的情况下。在 COCO dataset 上，只使用 1% 的 Label 数据，PAIS 可以达到 21.2 mAP（基于 Mask-RCNN）和 19.9 mAP（基于 K-Net）的性能，比 current state-of-the-art 模型 NoisyBoundary 高出 12 点之多。

Abstract
Pseudo-labeling is significant for semi-supervised instance segmentation, which generates instance masks and classes from unannotated images for subsequent training. However, in existing pipelines, pseudo-labels that contain valuable information may be directly filtered out due to mismatches in class and mask quality. To address this issue, we propose a novel framework, called pseudo-label aligning instance segmentation (PAIS), in this paper. In PAIS, we devise a dynamic aligning loss (DALoss) that adjusts the weights of semi-supervised loss terms with varying class and mask score pairs. Through extensive experiments conducted on the COCO and Cityscapes datasets, we demonstrate that PAIS is a promising framework for semi-supervised instance segmentation, particularly in cases where labeled data is severely limited. Notably, with just 1\% labeled data, PAIS achieves 21.2 mAP (based on Mask-RCNN) and 19.9 mAP (based on K-Net) on the COCO dataset, outperforming the current state-of-the-art model, \ie, NoisyBoundary with 7.7 mAP, by a margin of over 12 points. Code is available at: \url{https://github.com/hujiecpp/PAIS}.

摘要
假标签对 semi-supervised instance segmentation 非常重要，它可以生成实例掩码和类别标签从无注释图像中进行后续训练。然而，现有的管道中的假标签可能会直接被过滤掉，因为类别和掩码质量不匹配。为解决这个问题，我们在这篇论文中提出了一种新的框架，即假标签对齐Instance Segmentation（PAIS）。在 PAIS 中，我们定义了一种动态对齐损失（DALoss），该损失可以根据不同的类别和掩码质量对 semi-supervised 损失项的重量进行调整。经过了EXTENSIVE EXPERIMENTS在 COCO 和 Cityscapes 数据集上，我们证明了 PAIS 是 semi-supervised instance segmentation 的一个有前途的框架，特别在 Label 数据充足环境下出现问题时。值得注意的是，只使用 1% 的 Label 数据，PAIS 在 COCO 数据集上 achiev 21.2 mAP（基于 Mask-RCNN）和 19.9 mAP（基于 K-Net），比现状态的最佳模型 NoisyBoundary 的 7.7 mAP 高出12点多。代码可以在：\url{https://github.com/hujiecpp/PAIS} 中找到。

Fine-grained building roof instance segmentation based on domain adapted pretraining and composite dual-backbone

paper_url: http://arxiv.org/abs/2308.05358
repo_url: None
paper_authors: Guozhang Liu, Baochai Peng, Ting Liu, Pan Zhang, Mengke Yuan, Chaoran Lu, Ningning Cao, Sen Zhang, Simin Huang, Tao Wang
for: 这篇论文是为了提出一个能够实现单独建筑物的Semantic Interpretation，并且能够扩展到高分辨率光偏振仪影像中的建筑物实例分割器。
methods: 这篇论文使用了域 adaptation pre-training策略和composite dual-backbone来推广特征学习，以及一个新的数据增强管线、Stochastic Weight Averaging（SWA）训练和实例分割器模型的整合。
results: 实验结果显示，我们的方法在2023 IEEE GRSS Data Fusion Contest（DFC）Track 1测试阶段中获得第一名($mAP_{50}$:50.6%)，并且我们还探索了使用光偏振仪影像和SAR数据的多modal资料融合的潜力。

Abstract
The diversity of building architecture styles of global cities situated on various landforms, the degraded optical imagery affected by clouds and shadows, and the significant inter-class imbalance of roof types pose challenges for designing a robust and accurate building roof instance segmentor. To address these issues, we propose an effective framework to fulfill semantic interpretation of individual buildings with high-resolution optical satellite imagery. Specifically, the leveraged domain adapted pretraining strategy and composite dual-backbone greatly facilitates the discriminative feature learning. Moreover, new data augmentation pipeline, stochastic weight averaging (SWA) training and instance segmentation based model ensemble in testing are utilized to acquire additional performance boost. Experiment results show that our approach ranks in the first place of the 2023 IEEE GRSS Data Fusion Contest (DFC) Track 1 test phase ($mAP_{50}$:50.6\%). Note-worthily, we have also explored the potential of multimodal data fusion with both optical satellite imagery and SAR data.

摘要
global市区建筑风格多样性、云影和阴影对抗减弱了建筑瓦片实例分割的精度和稳定性，为设计一个稳定和准确的建筑瓦片实例分割器带来挑战。为解决这些问题，我们提出了一种有效的框架，以实现高分辨率光学卫星图像中建筑物的 semantic解释。具体来说，我们利用了领域适应性预训练策略和复合双轴核心，以便更好地学习抽象特征。此外，我们还使用了新的数据增强管道、随机权重平均（SWA）训练和测试阶段的实例分割器模型ensemble，以获得更高的性能提升。实验结果表明，我们的方法在2023年IEEE GRSS数据融合大赛（DFC）赛道1测试阶段中得分第一（$mAP_{50}$：50.6%）。另外，我们还探索了多Modal数据融合，使用光学卫星图像和SAR数据。

TCSloT: Text Guided 3D Context and Slope Aware Triple Network for Dental Implant Position Prediction

paper_url: http://arxiv.org/abs/2308.05355
repo_url: None
paper_authors: Xinquan Yang, Jinheng Xie, Xuechen Li, Xuguang Li, Linlin Shen, Yongqiang Deng
for: 这个论文的目的是为了提高骨嵌入式仪器的适应率和精度。
methods: 这个论文使用了一种名为TCSloT的三维文本引导和倾斜识别网络，以及一个名为TVP的文本变化识别模组，以及一个名为SAL的倾斜敏感损失函数。
results: 根据五次十字验证，这个TCSloT方法在骨嵌入式仪器数据集上表现出色，较之前方法有更高的适应率和精度。

Abstract
In implant prosthesis treatment, the surgical guide of implant is used to ensure accurate implantation. However, such design heavily relies on the manual location of the implant position. When deep neural network has been proposed to assist the dentist in locating the implant position, most of them take a single slice as input, which do not fully explore 3D contextual information and ignoring the influence of implant slope. In this paper, we design a Text Guided 3D Context and Slope Aware Triple Network (TCSloT) which enables the perception of contextual information from multiple adjacent slices and awareness of variation of implant slopes. A Texture Variation Perception (TVP) module is correspondingly elaborated to process the multiple slices and capture the texture variation among slices and a Slope-Aware Loss (SAL) is proposed to dynamically assign varying weights for the regression head. Additionally, we design a conditional text guidance (CTG) module to integrate the text condition (i.e., left, middle and right) from the CLIP for assisting the implant position prediction. Extensive experiments on a dental implant dataset through five-fold cross-validation demonstrated that the proposed TCSloT achieves superior performance than existing methods.

摘要
在附加 prósthesis 治疗中，针对implant的外科引导被用来确保精准的植入。然而，这种设计依赖于手动定位implant的位置。当deep neural network被提议以帮助 dentist 定位implant的位置时，大多数其中的输入是单个slice，这不完全探索3DContextual信息和忽略了implant的 Slope的影响。在这篇论文中，我们设计了Text Guided 3D Context and Slope Aware Triple Network (TCSloT)，它能够从多个相邻slice中捕捉Contextual信息和implant Slope的变化。同时，我们还设计了Texture Variation Perception (TVP)模块，用于处理多个slice和捕捉slice之间的Texture variation。此外，我们还提出了Slope-Aware Loss (SAL)，以动态分配不同权重 для regression head。此外，我们还设计了 conditional text guidance (CTG)模块，用于integrating text condition (例如，左、中、右) from CLIP，以帮助implant的位置预测。经过对一个 dental implant 数据集的五次横断验证，我们的TCSloT方法已经达到了现有方法的超越性表现。

Towards General and Fast Video Derain via Knowledge Distillation

paper_url: http://arxiv.org/abs/2308.05346
repo_url: None
paper_authors: Defang Cai, Pan Mu, Sixian Chan, Zhanpeng Shao, Cong Bai
for: 这篇论文是关于视觉系统中降水的影响和去雨的方法。
methods: 该论文提出了一种基于知识储存的普适 видео去雨网络（名为RRGNet），可以处理不同类型的雨束。特别是，我们设计了一个帧分组化的Encoder-Decoder网络，利用视频的时间信息。此外，我们还使用了老任务模型来引导当前模型学习新的雨束类型，而不是忘记之前学习的知识。
results: 我们的开发的通用方法在运行速度和去雨效果方面达到了最佳效果。

Abstract
As a common natural weather condition, rain can obscure video frames and thus affect the performance of the visual system, so video derain receives a lot of attention. In natural environments, rain has a wide variety of streak types, which increases the difficulty of the rain removal task. In this paper, we propose a Rain Review-based General video derain Network via knowledge distillation (named RRGNet) that handles different rain streak types with one pre-training weight. Specifically, we design a frame grouping-based encoder-decoder network that makes full use of the temporal information of the video. Further, we use the old task model to guide the current model in learning new rain streak types while avoiding forgetting. To consolidate the network's ability to derain, we design a rain review module to play back data from old tasks for the current model. The experimental results show that our developed general method achieves the best results in terms of running speed and derain effect.

摘要
As a common natural weather condition, rain can obscure video frames and thus affect the performance of the visual system, so video derain receives a lot of attention. In natural environments, rain has a wide variety of streak types, which increases the difficulty of the rain removal task. In this paper, we propose a Rain Review-based General video derain Network via knowledge distillation (named RRGNet) that handles different rain streak types with one pre-training weight. Specifically, we design a frame grouping-based encoder-decoder network that makes full use of the temporal information of the video. Further, we use the old task model to guide the current model in learning new rain streak types while avoiding forgetting. To consolidate the network's ability to derain, we design a rain review module to play back data from old tasks for the current model. The experimental results show that our developed general method achieves the best results in terms of running speed and derain effect.Here's the translation breakdown:* As a common natural weather condition, rain can obscure video frames and thus affect the performance of the visual system, so video derain receives a lot of attention. (雨是常见的自然天气 Condition, 可以遮盖视频帧, 影响视觉系统的性能, 因此视频 Derain 受到了很多关注.)* In natural environments, rain has a wide variety of streak types, which increases the difficulty of the rain removal task. (在自然环境中, 雨有各种不同的梦幕类型, 使得雨 Remove 任务变得更加困难.)* In this paper, we propose a Rain Review-based General video derain Network via knowledge distillation (named RRGNet) that handles different rain streak types with one pre-training weight. (在这篇论文中, 我们提出了一种基于雨 Review 的通用视频 Derain 网络 (名为 RRGNet)，可以通过知识塑造来处理不同的雨梦幕类型, 使用一个预训练权重.)* Specifically, we design a frame grouping-based encoder-decoder network that makes full use of the temporal information of the video. (具体来说, 我们设计了基于帧组的 encoder-decoder 网络, 可以充分利用视频的时间信息.)* Further, we use the old task model to guide the current model in learning new rain streak types while avoiding forgetting. (此外, 我们使用现有任务模型来引导当前模型在学习新的雨梦幕类型, 避免忘记.)* To consolidate the network's ability to derain, we design a rain review module to play back data from old tasks for the current model. (为了巩固网络的 Derain 能力, 我们设计了一个雨 Review 模块, 可以将过去任务的数据播放给当前模型.)* The experimental results show that our developed general method achieves the best results in terms of running speed and derain effect. (实验结果表明, 我们提出的通用方法在运行速度和 Derain 效果上得到了最好的结果.)

Prostate Age Gap (PAG): An MRI surrogate marker of aging for prostate cancer detection

paper_url: http://arxiv.org/abs/2308.05344
repo_url: None
paper_authors: Alvaro Fernandez-Quilez, Tobias Nordström, Fredrik Jäderling, Svein Reidar Kjosavik, Martin Eklund
for: 这个研究的目的是确定Prostate Age Gap（PAG）作为抑肾癌（PC）风险的估计工具。
methods: 这个研究使用了深度学习模型，通过对7243个肾镜片的训练和测试，来评估PAG的估计能力。
results: 研究发现，PAG在识别高度重要的PC风险上显示出了更好的预测能力，并且与PSA水平和PI-RADS>=3进行比较，得到了更高的AUC值（0.981）。

Abstract
Background: Prostate cancer (PC) MRI-based risk calculators are commonly based on biological (e.g. PSA), MRI markers (e.g. volume), and patient age. Whilst patient age measures the amount of years an individual has existed, biological age (BA) might better reflect the physiology of an individual. However, surrogates from prostate MRI and linkage with clinically significant PC (csPC) remain to be explored. Purpose: To obtain and evaluate Prostate Age Gap (PAG) as an MRI marker tool for csPC risk. Study type: Retrospective. Population: A total of 7243 prostate MRI slices from 468 participants who had undergone prostate biopsies. A deep learning model was trained on 3223 MRI slices cropped around the gland from 81 low-grade PC (ncsPC, Gleason score <=6) and 131 negative cases and tested on the remaining 256 participants. Assessment: Chronological age was defined as the age of the participant at the time of the visit and used to train the deep learning model to predict the age of the patient. Following, we obtained PAG, defined as the model predicted age minus the patient's chronological age. Multivariate logistic regression models were used to estimate the association through odds ratio (OR) and predictive value of PAG and compared against PSA levels and PI-RADS>=3. Statistical tests: T-test, Mann-Whitney U test, Permutation test and ROC curve analysis. Results: The multivariate adjusted model showed a significant difference in the odds of clinically significant PC (csPC, Gleason score >=7) (OR =3.78, 95% confidence interval (CI):2.32-6.16, P <.001). PAG showed a better predictive ability when compared to PI-RADS>=3 and adjusted by other risk factors, including PSA levels: AUC =0.981 vs AUC =0.704, p<.001. Conclusion: PAG was significantly associated with the risk of clinically significant PC and outperformed other well-established PC risk factors.

摘要
Background: 肾癌（PC）MRI基于风险计算器通常基于生物 markers（例如PSA）、MRI标志（例如体积）和患者年龄。然而，生物年龄（BA）可能更好地反映个体的生理 physiology。然而，MRI检测和临床重要PC（csPC）之间的连接仍然需要进一步探索。目的：为了获得和评估肾 Age Gap（PAG）作为MRI标志工具，以评估csPC风险。研究类型：回顾性。人口：总共7243个肾MRI剖析结果从468名参与者手术后进行了检测，其中81名低等PC（ncsPC， Gleason分数<=6）和131名消除性检测结果。评估：参与者年龄是通过训练深度学习模型来定义参与者年龄，然后对MRI扫描结果进行了cropping。模型被训练在3223个MRI扫描结果中，并在256名参与者中进行了测试。Assessment：参与者年龄是通过训练深度学习模型来预测参与者年龄。其后，我们计算了PAG，即模型预测年龄 minus 参与者年龄。多变量回归分析方法用于计算PAG与csPC风险的相关性，并与PSA水平和PI-RADS>=3进行比较。Results：多变量调整模型显示csPC风险（Gleason分数>=7）之间存在显著差异（OR =3.78，95%CI：2.32-6.16，P <.001）。PAG表现出了与其他PC风险因子相比更好的预测能力，包括PSA水平和其他调整因子。Conclusion：PAG与csPC风险之间存在显著相关性，并且在其他PC风险因子调整后表现出了更好的预测能力。

RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End Robust Estimation

paper_url: http://arxiv.org/abs/2308.05318
repo_url: https://github.com/irmvlab/rlsac
paper_authors: Chang Nie, Guangming Wang, Zhe Liu, Luca Cavalli, Marc Pollefeys, Hesheng Wang
for: 本文提出了一种基于强化学习的样本共识框架（RLSAC），用于终端 robust 估计。RLSAC 利用图 нейрон网络来利用数据和历史信息，以指导下一步样本的选择。
methods: 本文提出了一种基于强化学习的样本共识框架（RLSAC），其中使用了图 нейрон网络，并且通过下游任务的反馈来进行无监督训练。
results: 实验结果表明，RLSAC 可以从特征学习到慢慢探索更好的假设。此外，RLSAC 还可以轻松地转移到其他样本共识基于稳健估计的任务上。

Abstract
Robust estimation is a crucial and still challenging task, which involves estimating model parameters in noisy environments. Although conventional sampling consensus-based algorithms sample several times to achieve robustness, these algorithms cannot use data features and historical information effectively. In this paper, we propose RLSAC, a novel Reinforcement Learning enhanced SAmple Consensus framework for end-to-end robust estimation. RLSAC employs a graph neural network to utilize both data and memory features to guide exploring directions for sampling the next minimum set. The feedback of downstream tasks serves as the reward for unsupervised training. Therefore, RLSAC can avoid differentiating to learn the features and the feedback of downstream tasks for end-to-end robust estimation. In addition, RLSAC integrates a state transition module that encodes both data and memory features. Our experimental results demonstrate that RLSAC can learn from features to gradually explore a better hypothesis. Through analysis, it is apparent that RLSAC can be easily transferred to other sampling consensus-based robust estimation tasks. To the best of our knowledge, RLSAC is also the first method that uses reinforcement learning to sample consensus for end-to-end robust estimation. We release our codes at https://github.com/IRMVLab/RLSAC.

摘要
强健估算是一项关键但又具有挑战性的任务，它涉及到在噪声环境中估算模型参数。 conventional sampling consensus-based algorithms 通常需要多次抽样来实现强健性，但这些算法无法有效地利用数据特征和历史信息。在这篇论文中，我们提出了RLSAC，一种基于强化学习的SAmple Consensus框架，用于终端强健估算。 RLSAC 利用图 neural network 来利用数据和记忆特征来引导抽样下一个最小集的探索方向。下游任务的反馈作为无监督训练的奖励，因此RLSAC可以避免学习特征和下游任务的反馈来实现终端强健估算。此外，RLSAC 还 integrates 一个状态转移模块，该模块编码了数据和记忆特征。我们的实验结果表明，RLSAC 可以从特征中逐步探索更好的假设。通过分析，可以看出，RLSAC 可以轻松地转移到其他抽样consensus-based 强健估算任务。到目前为止，RLSAC 是我们知道的第一种使用强化学习来抽样consensus的方法，我们在 GitHub 上发布了代码，请参考 https://github.com/IRMVLab/RLSAC。

Deep Semantic Graph Matching for Large-scale Outdoor Point Clouds Registration

paper_url: http://arxiv.org/abs/2308.05314
repo_url: None
paper_authors: Shaocong Liu, Tao Wang, Yan Zhang, Ruqin Zhou, Li Li, Chenguang Dai, Yongsheng Zhang, Hanyun Wang
for: 本研究は大规模のアウトドアポイントクラウドの регистраーションに関するものです。
methods: 本研究では、大规模ポイントクラウドセマンティック分类ネットワークを使用してポイントクラウドのセマンティックカテゴリーラベルを取得します。それから、类似のセマンティックラベルを持つ邻接するポイントをクラスターするために、ε-距离クラスターアルゴリズムを使用します。次に、セマンティックインスタンス间の空间的な邻接関系に基づいて、高次元の特征を学习するグラフカーネルネットワークを构筑します。特徴の组み合わせには、几何学的形状特征、セマンティックカテゴリー特征および空间分布特征が含まれます。最后に、セマンティックインスタンスマッチング问题を最适输送问题としてモデル化し、最适匹配层を使用して解决します。
results: 実験结果では、提案された方法により、KITTI Odometryデータセットでの平均相対距离误差および平均相対回転误差が6.6cmおよび0.229{\deg}です。

Abstract
The current point cloud registration methods are mainly based on geometric information and usually ignore the semantic information in the point clouds. In this paper, we treat the point cloud registration problem as semantic instance matching and registration task, and propose a deep semantic graph matching method for large-scale outdoor point cloud registration. Firstly, the semantic category labels of 3D point clouds are obtained by utilizing large-scale point cloud semantic segmentation network. The adjacent points with the same category labels are then clustered together by using Euclidean clustering algorithm to obtain the semantic instances. Secondly, the semantic adjacency graph is constructed based on the spatial adjacency relation of semantic instances. Three kinds of high-dimensional features including geometric shape features, semantic categorical features and spatial distribution features are learned through graph convolutional network, and enhanced based on attention mechanism. Thirdly, the semantic instance matching problem is modeled as an optimal transport problem, and solved through an optimal matching layer. Finally, according to the matched semantic instances, the geometric transformation matrix between two point clouds is first obtained by SVD algorithm and then refined by ICP algorithm. The experiments are cconducted on the KITTI Odometry dataset, and the average relative translation error and average relative rotation error of the proposed method are 6.6cm and 0.229{\deg} respectively.

摘要
当前点云注册方法主要基于几何信息，通常忽略点云中的 semantics信息。在这篇论文中，我们将点云注册问题看作semantic实例匹配和注册任务，并提出了大规模室外点云注册的深度 semantic graph匹配方法。首先，通过利用大规模点云semantic分割网络获取3D点云的semantic分类标签。然后，通过Euclidean clustering算法将相邻的点云分类标签相同的点云集成为semantic实例。其次，基于semantic实例之间的空间相互关系，构建semantic adjacency图。然后，通过图 convolutional neural network学习三种高维特征，包括几何形态特征、semantic分类特征和空间分布特征，并通过注意机制进行增强。第三次，将semantic实例匹配问题转化为一个最优运输问题，并通过最优匹配层解决。最后，根据匹配的semantic实例，首先使用SVD算法获取两个点云之间的几何变换矩阵，然后通过ICP算法进行精细调整。我们在KITTI Odometry数据集上进行了实验，并得到了6.6cm和0.229{\deg}的平均相对平移错误和平均相对旋转错误。

DAOT: Domain-Agnostically Aligned Optimal Transport for Domain-Adaptive Crowd Counting

paper_url: http://arxiv.org/abs/2308.05311
repo_url: https://github.com/HopooLinZ/DAOT
paper_authors: Huilin Zhu, Jingling Yuan, Xian Zhong, Zhengwei Yang, Zheng Wang, Shengfeng He
for: 这篇论文是为了解决对于人群调查中的域别问题，特别是跨域调查时的域别差异。
methods: 本文使用了一种名为Domain-agnostically Aligned Optimal Transport (DAOT)的策略，它利用对域别因素的调整，将域别因素与其他域别因素进行适当的调整，以实现跨域调查的桥接。
results: 根据该策略，作者实现了跨域调查中的强大普遍性，并且在五个标准的人群调查benchmark上进行了广泛的实验，结果显示了该方法的优异性。

Abstract
Domain adaptation is commonly employed in crowd counting to bridge the domain gaps between different datasets. However, existing domain adaptation methods tend to focus on inter-dataset differences while overlooking the intra-differences within the same dataset, leading to additional learning ambiguities. These domain-agnostic factors, e.g., density, surveillance perspective, and scale, can cause significant in-domain variations, and the misalignment of these factors across domains can lead to a drop in performance in cross-domain crowd counting. To address this issue, we propose a Domain-agnostically Aligned Optimal Transport (DAOT) strategy that aligns domain-agnostic factors between domains. The DAOT consists of three steps. First, individual-level differences in domain-agnostic factors are measured using structural similarity (SSIM). Second, the optimal transfer (OT) strategy is employed to smooth out these differences and find the optimal domain-to-domain misalignment, with outlier individuals removed via a virtual "dustbin" column. Third, knowledge is transferred based on the aligned domain-agnostic factors, and the model is retrained for domain adaptation to bridge the gap across domains. We conduct extensive experiments on five standard crowd-counting benchmarks and demonstrate that the proposed method has strong generalizability across diverse datasets. Our code will be available at: https://github.com/HopooLinZ/DAOT/.

摘要
域际适应是常见的人群计数技术，用于跨域数据集之间的域际差异 bridging。然而，现有的域际适应方法通常会忽略内部数据集之间的差异，导致额外学习混淆。这些域际无关因素，例如密度、观察角度和比例，可以在同一个数据集中导致重要的内部差异，并且在不同域际之间的差异可能会导致性能下降。为解决这个问题，我们提议一种基于域际无关因素的域际对齐策略（DAOT）。DAOT包括以下三步：1. 使用结构相似度（SSIM）测量各个个体在不同域际中的域际无关因素差异。2. 使用最佳传输（OT）策略平滑出这些差异，并在不同域际之间找到最佳域际偏移，并将异常个体从虚拟"废弃"列中除除。3. 基于对齐的域际无关因素，传输知识并重新训练模型，以bridging跨域数据集之间的差异。我们对五个标准人群计数标准 benchmark进行了广泛的实验，并证明了我们的方法具有跨多个数据集的强大普适性。我们的代码将在 GitHub 上公开：https://github.com/HopooLinZ/DAOT/.

From CNN to Transformer: A Review of Medical Image Segmentation Models

paper_url: http://arxiv.org/abs/2308.05305
repo_url: None
paper_authors: Wenjian Yao, Jiajun Bai, Wei Liao, Yuheng Chen, Mengjuan Liu, Yao Xie
For: The paper is written for researchers in the field of medical image segmentation, particularly those interested in using deep learning models for this task.* Methods: The paper surveys four recent medical image segmentation models, including U-Net and its variants, as well as transformer-based models like TransUNet.* Results: The paper evaluates the performance of these models on two benchmark datasets (Tuberculosis Chest X-rays and ovarian tumors) and discusses the main challenges and future trends in medical image segmentation.Here’s the same information in Simplified Chinese text:* For: 该文章是为医学图像分割领域的研究人员所写的，特别是关注使用深度学习模型进行这项任务。* Methods: 文章介绍了最近四年内最为代表的医学图像分割模型，包括U-Net和其变体，以及基于转移器的模型如TransUNet。* Results: 文章对两个标准数据集（肺结核X光图像和卵巢肿瘤）进行了量化评估这些模型的性能，并讨论了医学图像分割领域的主要挑战和未来趋势。

Abstract
Medical image segmentation is an important step in medical image analysis, especially as a crucial prerequisite for efficient disease diagnosis and treatment. The use of deep learning for image segmentation has become a prevalent trend. The widely adopted approach currently is U-Net and its variants. Additionally, with the remarkable success of pre-trained models in natural language processing tasks, transformer-based models like TransUNet have achieved desirable performance on multiple medical image segmentation datasets. In this paper, we conduct a survey of the most representative four medical image segmentation models in recent years. We theoretically analyze the characteristics of these models and quantitatively evaluate their performance on two benchmark datasets (i.e., Tuberculosis Chest X-rays and ovarian tumors). Finally, we discuss the main challenges and future trends in medical image segmentation. Our work can assist researchers in the related field to quickly establish medical segmentation models tailored to specific regions.

摘要
医学图像分割是医学图像分析的重要步骤，特别是疾病诊断和治疗的关键前提。深度学习在图像分割方面的应用已成为现代医学图像分析的主流方法。目前广泛采用的方法包括U-Net和其变种。此外，由于自然语言处理任务中 pré-训练模型的显著成功，如Transformer-based模型TransUNet，在医学图像分割数据集上达到了满意的性能。在这篇论文中，我们对最近几年内最有代表性的四种医学图像分割模型进行了一个抽查。我们 theoretically 分析了这些模型的特点，并对两个标准数据集（即肺炎X光图像和卵巢肿瘤）进行了量化评估。最后，我们讨论了医学图像分割领域的主要挑战和未来趋势。我们的工作可以帮助相关领域的研究人员快速建立适应特定地区的医学分割模型。

Multi-Visual-Inertial System: Analysis, Calibration and Estimation

paper_url: http://arxiv.org/abs/2308.05303
repo_url: None
paper_authors: Yulin Yang, Patrick Geneva, Guoquan Huang
for: 这个论文主要针对多视觉陀螺系统（MVIS）的状态估算和感知融合算法的研究。
methods: 论文使用的方法包括新的分析合并IMU集成（ACI3），用于预处理IMU测量，以及模型多种噪声和扩展姿态约束。
results: 论文通过实验和 simulations validate 了提出的方法，并证明了与现有方法相比，该方法可以实现竞争性的准确性和重复性。

Abstract
In this paper, we study state estimation of multi-visual-inertial systems (MVIS) and develop sensor fusion algorithms to optimally fuse an arbitrary number of asynchronous inertial measurement units (IMUs) or gyroscopes and global and(or) rolling shutter cameras. We are especially interested in the full calibration of the associated visual-inertial sensors, including the IMU or camera intrinsics and the IMU-IMU(or camera) spatiotemporal extrinsics as well as the image readout time of rolling-shutter cameras (if used). To this end, we develop a new analytic combined IMU integration with intrinsics-termed ACI3-to preintegrate IMU measurements, which is leveraged to fuse auxiliary IMUs and(or) gyroscopes alongside a base IMU. We model the multi-inertial measurements to include all the necessary inertial intrinsic and IMU-IMU spatiotemporal extrinsic parameters, while leveraging IMU-IMU rigid-body constraints to eliminate the necessity of auxiliary inertial poses and thus reducing computational complexity. By performing observability analysis of MVIS, we prove that the standard four unobservable directions remain - no matter how many inertial sensors are used, and also identify, for the first time, degenerate motions for IMU-IMU spatiotemporal extrinsics and auxiliary inertial intrinsics. In addition to the extensive simulations that validate our analysis and algorithms, we have built our own MVIS sensor rig and collected over 25 real-world datasets to experimentally verify the proposed calibration against the state-of-the-art calibration method such as Kalibr. We show that the proposed MVIS calibration is able to achieve competing accuracy with improved convergence and repeatability, which is open sourced to better benefit the community.

摘要
在这篇论文中，我们研究多视听听系统（MVIS）的状态估计和感知融合算法，以优化无限多个异步听听测量单元（IMU）或陀螺仪和全球和(或)滚动镜头相机的感知融合。我们特别关注多视听听感知器的全部准确卡lip，包括听听测量单元或相机的内参和听听测量单元(或相机)的空间时间外参，以及滚动镜头相机的图像读取时间。为此，我们开发了一种新的分析结合IMU集成integration（ACI3），用于预integrate听听测量单元，并将其用于融合辅助IMU和(或)陀螺仪。我们模型了多元听听测量，包括所有必要的听听内参和听听测量单元之间的空间时间外参，同时利用听听测量单元之间的硬体约束，以消除辅助听听pose的必要性，从而降低计算复杂性。通过对MVIS的可见性分析，我们证明了标准四个不见方向都存在，不 matter how many听听测量单元被用，并同时标识了IMU-IMU空间时间外参和辅助听听内参的缺乏定向性。除了对我们的分析和算法进行了广泛的实验验证之外，我们还自己建立了MVIS感知器测试台，收集了超过25个实际数据集，以实际验证我们的准确性核算法。我们的结果显示，我们的MVIS准确性核算法可以与现有的准确性核算法相比，具有更好的快速响应和重复性，这些结果将被开源，以更好地服务于社区。

Fine-Grained Self-Supervised Learning with Jigsaw Puzzles for Medical Image Classification

paper_url: http://arxiv.org/abs/2308.05770
repo_url: https://github.com/kalelpark/FG-SSL
paper_authors: Wongi Park, Jongbin Ryu
for: 这个研究旨在提高医疗影像中细微病变的分类精度，因为这些病变之间存在微小差异。methods: 本研究提出了一种细元自我超vised学习（FG-SSL）方法，通过几个层的对称块进行逐步学习，以实现细元之间的距离降低。此外，我们还应用了层次块进行逐步细元学习，以EXTRACT不同的信息在每个步骤中。results: 我们在实验中发现，提出的细元自我超vised学习方法能够与现有的State-of-the-art方法相比，在广泛使用的ISIC2018、APTOS2019和ISIC2017 datasets上表现出色。

Abstract
Classifying fine-grained lesions is challenging due to minor and subtle differences in medical images. This is because learning features of fine-grained lesions with highly minor differences is very difficult in training deep neural networks. Therefore, in this paper, we introduce Fine-Grained Self-Supervised Learning(FG-SSL) method for classifying subtle lesions in medical images. The proposed method progressively learns the model through hierarchical block such that the cross-correlation between the fine-grained Jigsaw puzzle and regularized original images is close to the identity matrix. We also apply hierarchical block for progressive fine-grained learning, which extracts different information in each step, to supervised learning for discovering subtle differences. Our method does not require an asymmetric model, nor does a negative sampling strategy, and is not sensitive to batch size. We evaluate the proposed fine-grained self-supervised learning method on comprehensive experiments using various medical image recognition datasets. In our experiments, the proposed method performs favorably compared to existing state-of-the-art approaches on the widely-used ISIC2018, APTOS2019, and ISIC2017 datasets.

摘要
<>translate the following text into Simplified Chinese:Classifying fine-grained lesions is challenging due to minor and subtle differences in medical images. This is because learning features of fine-grained lesions with highly minor differences is very difficult in training deep neural networks. Therefore, in this paper, we introduce Fine-Grained Self-Supervised Learning(FG-SSL) method for classifying subtle lesions in medical images. The proposed method progressively learns the model through hierarchical block such that the cross-correlation between the fine-grained Jigsaw puzzle and regularized original images is close to the identity matrix. We also apply hierarchical block for progressive fine-grained learning, which extracts different information in each step, to supervised learning for discovering subtle differences. Our method does not require an asymmetric model, nor does a negative sampling strategy, and is not sensitive to batch size. We evaluate the proposed fine-grained self-supervised learning method on comprehensive experiments using various medical image recognition datasets. In our experiments, the proposed method performs favorably compared to existing state-of-the-art approaches on the widely-used ISIC2018, APTOS2019, and ISIC2017 datasets.Translate the text into Simplified Chinese:<>Here's the translation:精细 lesion 分类具有挑战性，主要是因为医疗图像中细微差异很大。这是因为在训练深度神经网络时，学习精细 lesion 的特征非常困难。因此，在这篇论文中，我们提出了 Fine-Grained Self-Supervised Learning（FG-SSL）方法，用于医疗图像中精细 lesion 的分类。我们的方法通过层次块进行逐步学习，使得对精细 Jigsaw puzzle 和正则化的原始图像之间的协同关系接近于标准矩阵。我们还将层次块应用于进程性的细化学习，以便在每个步骤中提取不同的信息，并将其应用于监督学习以找到细微差异。我们的方法不需要非对称模型，也不需要负样本策略，并且不敏感于批处理大小。我们在多种医疗图像识别数据集上进行了广泛的实验，并证明了我们的方法与现有状态艺术方法在 ISIC2018、APTOS2019 和 ISIC2017 等广泛使用的数据集上表现出色。

Informative Scene Graph Generation via Debiasing

paper_url: http://arxiv.org/abs/2308.05286
repo_url: None
paper_authors: Lianli Gao, Xinyu Lyu, Yuyu Guo, Yuxuan Hu, Yuan-Fang Li, Lu Xu, Heng Tao Shen, Jingkuan Song
For: The paper aims to address the issue of biases in scene graph generation models, which tend to predict common predicates instead of informative ones, leading to the loss of precise information and overall performance.* Methods: The proposed method, DB-SGG, integrates two components: Semantic Debiasing (SD) and Balanced Predicate Learning (BPL) to address the imbalances in semantic space and training samples. SD utilizes a confusion matrix and a bipartite graph to construct predicate relationships, while BPL adopts a random undersampling strategy and an ambiguity removing strategy to focus on informative predicates.* Results: The proposed method outperforms Transformer by 136.3%, 119.5%, and 122.6% on mR@20 at three SGG sub-tasks on the SGG-VG dataset, and is further verified on another complex SGG dataset (SGG-GQA) and two downstream tasks (sentence-to-graph retrieval and image captioning).Here is the simplified Chinese text for the three key points:
for: 本文目标是解决Scene Graph Generation（SGG）模型偏袋现象，即预测常见的 predicate 而不是有用的 predicate。
methods: 提议的方法DB-SGG包括两个组件：Semantic Debiasing（SD）和Balanced Predicate Learning（BPL），用于解决 semantic space 和训练样本的偏袋问题。 SD 利用冲突矩阵和两个分别的图来构造 predicate 关系，而 BPL 采用随机下样本策略和抽象策略来强调有用 predicate。
results: 提议的方法在 SGG 子任务中的 mR@20 上比Transformer高出136.3%、119.5% 和122.6%，并在另一个复杂的 SGG dataset（SGG-GQA）和两个下游任务（句子到图 retrieval 和图像captioning）上进行验证。

Abstract
Scene graph generation aims to detect visual relationship triplets, (subject, predicate, object). Due to biases in data, current models tend to predict common predicates, e.g. "on" and "at", instead of informative ones, e.g. "standing on" and "looking at". This tendency results in the loss of precise information and overall performance. If a model only uses "stone on road" rather than "stone blocking road" to describe an image, it may be a grave misunderstanding. We argue that this phenomenon is caused by two imbalances: semantic space level imbalance and training sample level imbalance. For this problem, we propose DB-SGG, an effective framework based on debiasing but not the conventional distribution fitting. It integrates two components: Semantic Debiasing (SD) and Balanced Predicate Learning (BPL), for these imbalances. SD utilizes a confusion matrix and a bipartite graph to construct predicate relationships. BPL adopts a random undersampling strategy and an ambiguity removing strategy to focus on informative predicates. Benefiting from the model-agnostic process, our method can be easily applied to SGG models and outperforms Transformer by 136.3%, 119.5%, and 122.6% on mR@20 at three SGG sub-tasks on the SGG-VG dataset. Our method is further verified on another complex SGG dataset (SGG-GQA) and two downstream tasks (sentence-to-graph retrieval and image captioning).

摘要
Scene graph生成目标是检测视觉关系三元组（主语、谓语、谓object）。由于数据偏见，当前模型往往预测常见谓语，如“on”和“at”，而不是有用的谓语，如“站立在”和“看向”。这种偏见导致精准信息损失和总体性能下降。如果模型只使用“石头在路上”而不是“石头堵路”来描述一幅图像，那可能是严重的误解。我们认为这种现象是由两种偏见引起的：semantic space层次偏见和训练样本层次偏见。为解决这个问题，我们提出了DB-SGG，一种有效的框架，不是传统的分布适应。它包括两个组成部分：semantic debiasing（SD）和balanced predicate learning（BPL）。SD利用冲突矩阵和两种权重的笛卡尔矩阵来构建谓语关系。BPL采用随机下抽样策略和模糊性下降策略来强调有用的谓语。由于我们的方法是模型无关的，可以轻松应用于SGG模型，而且在SGG-VG数据集上超越了Transformer，增加了136.3%、119.5%和122.6%的mR@20。我们的方法还在另一个复杂的SGG数据集（SGG-GQA）和两个下游任务（句子到图 Retrieval和图像描述）中进行了验证。

Local-Global Information Interaction Debiasing for Dynamic Scene Graph Generation

paper_url: http://arxiv.org/abs/2308.05274
repo_url: None
paper_authors: Xinyu Lyu, Jingwei Liu, Yuyu Guo, Lianli Gao
for: 提高动态场景图生成（DynSGG）模型的性能，解决tail预测问题。
methods: 基于多任务学习（MTL）的新DynSGG模型，引入本地交互信息和全局人体动作交互信息，使模型更全面地理解单个图像的视觉上下文。
results: 在Action Genome dataset上进行了广泛的实验，证明我们提出的框架有效地解决了长尾问题，同时也提高了动态场景图生成的性能。

Abstract
The task of dynamic scene graph generation (DynSGG) aims to generate scene graphs for given videos, which involves modeling the spatial-temporal information in the video. However, due to the long-tailed distribution of samples in the dataset, previous DynSGG models fail to predict the tail predicates. We argue that this phenomenon is due to previous methods that only pay attention to the local spatial-temporal information and neglect the consistency of multiple frames. To solve this problem, we propose a novel DynSGG model based on multi-task learning, DynSGG-MTL, which introduces the local interaction information and global human-action interaction information. The interaction between objects and frame features makes the model more fully understand the visual context of the single image. Long-temporal human actions supervise the model to generate multiple scene graphs that conform to the global constraints and avoid the model being unable to learn the tail predicates. Extensive experiments on Action Genome dataset demonstrate the efficacy of our proposed framework, which not only improves the dynamic scene graph generation but also alleviates the long-tail problem.

摘要
dynamically scene graph generation (DynSGG) 目标是为给定视频生成场景图，需要处理视频中的空间-时间信息。然而，由于样本集的长尾分布，先前的 DynSGG 模型无法预测尾预测值。我们认为这种现象是因为先前的方法只关注本地空间-时间信息，忽略多帧图像的一致性。为解决这个问题，我们提出了基于多任务学习的新 DynSGG 模型，DynSGG-MTL，该模型引入了本地互动信息和全局人体动作互动信息。对象和帧特征之间的交互使得模型更好地理解单个图像的视觉上下文。长期人体动作超级vised 模型生成符合全局约束的多个场景图，以避免模型无法学习尾预测值。广泛的 Action Genome 数据集实验证明了我们提出的框架的有效性，不仅改善了动态场景图生成，还解决了长尾问题。

TrainFors: A Large Benchmark Training Dataset for Image Manipulation Detection and Localization

paper_url: http://arxiv.org/abs/2308.05264
repo_url: None
paper_authors: Soumyaroop Nandi, Prem Natarajan, Wael Abd-Almageed
for:* 这项研究的目的是为了提供一个标准的图像修改检测和定位 benchmark 数据集，以便对现有的图像修改检测方法进行公正的评估。methods:* 这项研究使用了一个标准的图像修改数据集，并对现有的图像修改检测方法进行训练和测试。results:* 研究发现现有的图像修改检测方法在使用标准数据集上的性能不一，并且存在一些问题，如数据集的不具有 persistent 性和模型架构的差异。Here is the same information in Simplified Chinese:for:* 这项研究的目的是为了提供一个标准的图像修改检测和定位 benchmark 数据集，以便对现有的图像修改检测方法进行公正的评估。methods:* 这项研究使用了一个标准的图像修改数据集，并对现有的图像修改检测方法进行训练和测试。results:* 研究发现现有的图像修改检测方法在使用标准数据集上的性能不一，并且存在一些问题，如数据集的不具有 persistent 性和模型架构的差异。

Abstract
The evaluation datasets and metrics for image manipulation detection and localization (IMDL) research have been standardized. But the training dataset for such a task is still nonstandard. Previous researchers have used unconventional and deviating datasets to train neural networks for detecting image forgeries and localizing pixel maps of manipulated regions. For a fair comparison, the training set, test set, and evaluation metrics should be persistent. Hence, comparing the existing methods may not seem fair as the results depend heavily on the training datasets as well as the model architecture. Moreover, none of the previous works release the synthetic training dataset used for the IMDL task. We propose a standardized benchmark training dataset for image splicing, copy-move forgery, removal forgery, and image enhancement forgery. Furthermore, we identify the problems with the existing IMDL datasets and propose the required modifications. We also train the state-of-the-art IMDL methods on our proposed TrainFors1 dataset for a fair evaluation and report the actual performance of these methods under similar conditions.

摘要
《评估数据集和度量标准化 для图像修改检测和位置标注（IMDL）研究已经实现。但训练数据集还是非标准的。前一些研究者使用了不同的数据集来训练神经网络来检测图像诈改和定位修改后的像素地图。为了公平比较，训练集、测试集和评估度量应该固定。因此，现有的方法之间的比较可能不公平，因为结果受训练集和模型架构的影响。此外，没有任何前一些工作公布了用于IMDL任务的人工生成数据集。我们提议一个标准化的测试数据集，用于图像拼接、复制移动诈改、移除诈改和图像增强诈改。此外，我们还识别了现有IMDL数据集中的问题，并提出了修改。我们还使用我们所提议的TrainFors1数据集来训练当前领域的状态OF THE ART方法，并对这些方法进行公平的评估。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Advancing Early Detection of Virus Yellows: Developing a Hybrid Convolutional Neural Network for Automatic Aphid Counting in Sugar Beet Fields

paper_url: http://arxiv.org/abs/2308.05257
repo_url: https://github.com/junfenggaolab/counting-aphids
paper_authors: Xumin Gao, Wenxin Xue, Callum Lennox, Mark Stevens, Junfeng Gao
for: 这个论文的目的是为了提供一种有效的自动蜂数计算方法，以预 warningsugar beet fields中的蜂病风险。
methods: 该方法使用了一种混合的自动蜂数计算网络架构，它结合检测网络和density map estimation网络。在蜂数低的情况下，使用了一种改进的Yolov5来计算蜂数;而在蜂数高的情况下，使用了CSRNet来计算蜂数。
results: 对比试验表明，该方法在计算蜂数方面比所有其他方法更高精度，其MAE和RMSE值分别为2.93和4.01（标准蜂数），以及34.19和38.66（高密度蜂数）。此外，改进后的Yolov5的AP值高于原始Yolov5的5%。特别是对于非常小的蜂和密集分布的蜂，改进后的Yolov5的检测性能明显更高。

Abstract
Aphids are efficient vectors to transmit virus yellows in sugar beet fields. Timely monitoring and control of their populations are thus critical to prevent the large-scale outbreak of virus yellows. However, the manual counting of aphids, which is the most common practice, is labor-intensive and time-consuming. Additionally, two of the biggest challenges in aphid counting are that aphids are small objects and their density distributions are varied in different areas of the field. To address these challenges, we proposed a hybrid automatic aphid counting network architecture which integrates the detection network and the density map estimation network. When the distribution density of aphids is low, it utilizes an improved Yolov5 to count aphids. Conversely, when the distribution density of aphids is high, its witches to CSRNet to count aphids. To the best of our knowledge, this is the first framework integrating the detection network and the density map estimation network for counting tasks. Through comparison experiments of counting aphids, it verified that our proposed approach outperforms all other methods in counting aphids. It achieved the lowest MAE and RMSE values for both the standard and high-density aphid datasets: 2.93 and 4.01 (standard), and 34.19 and 38.66 (high-density), respectively. Moreover, the AP of the improved Yolov5 is 5% higher than that of the original Yolov5. Especially for extremely small aphids and densely distributed aphids, the detection performance of the improved Yolov5 is significantly better than the original Yolov5. This work provides an effective early warning for the virus yellows risk caused by aphids in sugar beet fields, offering protection for sugar beet growth and ensuring sugar beet yield. The datasets and project code are released at: https://github.com/JunfengGaolab/Counting-Aphids.

摘要
蜥蠕是蔗糖田中病毒黄病的高效传播者。在蔗糖田中，在时间上适当监测和控制蜥蠕 populace 是非常重要的，以避免大规模病毒黄病爆发。然而， manual counting of aphids， which is the most common practice， is labor-intensive and time-consuming。 In addition, two of the biggest challenges in aphid counting are that aphids are small objects and their density distributions are varied in different areas of the field. To address these challenges, we proposed a hybrid automatic aphid counting network architecture which integrates the detection network and the density map estimation network. When the distribution density of aphids is low, it utilizes an improved Yolov5 to count aphids. Conversely, when the distribution density of aphids is high, it switches to CSRNet to count aphids. To the best of our knowledge, this is the first framework integrating the detection network and the density map estimation network for counting tasks. Through comparison experiments of counting aphids, it was verified that our proposed approach outperforms all other methods in counting aphids. It achieved the lowest MAE and RMSE values for both the standard and high-density aphid datasets: 2.93 and 4.01 (standard), and 34.19 and 38.66 (high-density), respectively. Moreover, the AP of the improved Yolov5 is 5% higher than that of the original Yolov5. Especially for extremely small aphids and densely distributed aphids, the detection performance of the improved Yolov5 is significantly better than the original Yolov5. This work provides an effective early warning for the virus yellows risk caused by aphids in sugar beet fields, offering protection for sugar beet growth and ensuring sugar beet yield. The datasets and project code are released at: https://github.com/JunfengGaolab/Counting-Aphids.

Spatial Gated Multi-Layer Perceptron for Land Use and Land Cover Mapping

paper_url: http://arxiv.org/abs/2308.05235
repo_url: https://github.com/aj1365/sgumlp
paper_authors: Ali Jamali, Swalpa Kumar Roy, Danfeng Hong, Peter M Atkinson, Pedram Ghamisi
for: 这个研究是为了开发一个能够精确地分类土地用途的数位模型。
methods: 这个研究使用了多层感知核（MLP）和空间闸 gates（SGUs）来实现精确的土地用途分类。
results: 研究发现，提出的SGU-MLP分类算法比其他CNN和CNN-ViT基于的模型（包括HybridSN、ResNet、iFormer、EfficientFormer和CoAtNet）更高度精确，在三个实验中（在HOUSTON、BERLIN和AUGSBURG）都有所进步。例如，在HOUSTON实验中，SGU-MLP比HybridSN、CoAtNet、Efficientformer、iFormer和ResNet进步约15%、19%、20%、21%和25%，分别。

Abstract
Convolutional Neural Networks (CNNs) are models that are utilized extensively for the hierarchical extraction of features. Vision transformers (ViTs), through the use of a self-attention mechanism, have recently achieved superior modeling of global contextual information compared to CNNs. However, to realize their image classification strength, ViTs require substantial training datasets. Where the available training data are limited, current advanced multi-layer perceptrons (MLPs) can provide viable alternatives to both deep CNNs and ViTs. In this paper, we developed the SGU-MLP, a learning algorithm that effectively uses both MLPs and spatial gating units (SGUs) for precise land use land cover (LULC) mapping. Results illustrated the superiority of the developed SGU-MLP classification algorithm over several CNN and CNN-ViT-based models, including HybridSN, ResNet, iFormer, EfficientFormer and CoAtNet. The proposed SGU-MLP algorithm was tested through three experiments in Houston, USA, Berlin, Germany and Augsburg, Germany. The SGU-MLP classification model was found to consistently outperform the benchmark CNN and CNN-ViT-based algorithms. For example, for the Houston experiment, SGU-MLP significantly outperformed HybridSN, CoAtNet, Efficientformer, iFormer and ResNet by approximately 15%, 19%, 20%, 21%, and 25%, respectively, in terms of average accuracy. The code will be made publicly available at https://github.com/aj1365/SGUMLP

摘要
卷积神经网络（CNN）是一种广泛应用于层次EXTRACT特征的模型。视觉 трансформер（ViT）通过自我注意机制，在全球上下文信息模型化方面最近获得了超越CNN的成绩。然而，为了实现它们的图像分类优势，ViT需要大量的训练数据。当可用的训练数据有限时，当前的高级多层感知神经网络（MLP）可以提供可靠的替代方案。在这篇论文中，我们开发了SGU-MLP学习算法，该算法利用了MLP和空间闭合单元（SGU）来实现精确的土地用途分类。结果表明，与多个CNN和CNN-ViT基于模型相比，我们提出的SGU-MLP分类算法表现出色，包括HybridSN、ResNet、iFormer、EfficientFormer和CoAtNet等模型。SGU-MLP分类模型在HOUSTON、BERLIN和AUGSBURG三个实验中均表现出优异，例如HOUSTON实验中，SGU-MLPsignificantly exceededHybridSN、CoAtNet、Efficientformer、iFormer和ResNet等模型的均值准确率，分别高于它们约15%、19%、20%、21%和25%。SGU-MLP分类模型的代码将于https://github.com/aj1365/SGUMLP上公开。

SegMatch: A semi-supervised learning method for surgical instrument segmentation

paper_url: http://arxiv.org/abs/2308.05232
repo_url: None
paper_authors: Meng Wei, Charlie Budd, Luis C. Garcia-Peraza-Herrera, Reuben Dorent, Miaojing Shi, Tom Vercauteren
for: 这个论文旨在提高 Laparoscopic 和 Robotic 手术图像中的外科工具 segmentation 精度，降低需要昂贵的标注成本。methods: 我们提出了 SegMatch，一种基于 FixMatch 的 semi supervised learning 方法，用于减少 Laparoscopic 和 Robotic 手术图像中的标注成本。SegMatch 利用弱转换和 pseudo 标注来强制执行不监督损失，并且在 segmentation 任务中注意到了等变性和不变性性质。我们还引入了可调学习的对抗扰动扩展，以提高对抗扰动的准确性。results: 我们在 MICCAI Instrument Segmentation Challenge datasets Robust-MIS 2019 和 EndoVis 2017 进行了测试，结果表明，通过在训练过程中添加无标注数据，我们的方法可以超越完全监督的方法，即使它们受到训练数据的限制。SegMatch 还比一些当前的 semi-supervised learning semantic segmentation 模型在不同的标签数据比例中表现出色。

Abstract
Surgical instrument segmentation is recognised as a key enabler to provide advanced surgical assistance and improve computer assisted interventions. In this work, we propose SegMatch, a semi supervised learning method to reduce the need for expensive annotation for laparoscopic and robotic surgical images. SegMatch builds on FixMatch, a widespread semi supervised classification pipeline combining consistency regularization and pseudo labelling, and adapts it for the purpose of segmentation. In our proposed SegMatch, the unlabelled images are weakly augmented and fed into the segmentation model to generate a pseudo-label to enforce the unsupervised loss against the output of the model for the adversarial augmented image on the pixels with a high confidence score. Our adaptation for segmentation tasks includes carefully considering the equivariance and invariance properties of the augmentation functions we rely on. To increase the relevance of our augmentations, we depart from using only handcrafted augmentations and introduce a trainable adversarial augmentation strategy. Our algorithm was evaluated on the MICCAI Instrument Segmentation Challenge datasets Robust-MIS 2019 and EndoVis 2017. Our results demonstrate that adding unlabelled data for training purposes allows us to surpass the performance of fully supervised approaches which are limited by the availability of training data in these challenges. SegMatch also outperforms a range of state-of-the-art semi-supervised learning semantic segmentation models in different labelled to unlabelled data ratios.

摘要
针对手术 instrumente 分割问题，我们提出了SegMatch方法，这是一种半supervised学习方法，可以减少高成本的标注cost для lap 和 robotic 手术图像。SegMatch 基于FixMatch算法，这是一种广泛应用的半supervised分类管道， combining consistency regularization和pseudo labeling。在我们的SegMatch中，无标注图像会被弱地扩充，并feed into segmentation模型，以生成一个pseudo-标签，以便对模型输出的像素进行强制检查。我们对segmentation任务进行了仔细考虑，包括对扩充函数的等变性和不变性性。为了增强我们的扩充的 relevance，我们不仅使用了手工设计的扩充，还引入了一种可学习的对抗扩充策略。我们的算法在Robust-MIS 2019和EndoVis 2017两个MICCAI instrumente 分割挑战数据集上进行了评估。我们的结果表明，通过在训练过程中添加无标注数据，可以超越完全supervised方法的性能，这些方法受到训练数据的有限制。SegMatch还超过了一些state-of-the-art半supervised学习semantic segmentation模型，在不同的标签到无标签数据比例下。

A Unified Interactive Model Evaluation for Classification, Object Detection, and Instance Segmentation in Computer Vision

paper_url: http://arxiv.org/abs/2308.05168
repo_url: None
paper_authors: Changjian Chen, Yukai Guo, Fengyuan Tian, Shilong Liu, Weikai Yang, Zhaowei Wang, Jing Wu, Hang Su, Hanspeter Pfister, Shixia Liu
for: 这篇论文主要是为了评估计算机视觉领域中的模型评估工具，它们主要是为了评估分类模型，而忽略了更复杂的模型，如对象检测。
methods: 这篇论文提出了一种开源的视觉分析工具——Uni-Evaluator，用于支持计算机视觉领域中的一元模型评估。这种方法的关键思想是将不同任务中的预测结果表示为统一的概率分布。
results: 在两个案例研究中，Uni-Evaluator被证明能够有效地评估模型性能，并且可以帮助进行有知识的改进。

Abstract
Existing model evaluation tools mainly focus on evaluating classification models, leaving a gap in evaluating more complex models, such as object detection. In this paper, we develop an open-source visual analysis tool, Uni-Evaluator, to support a unified model evaluation for classification, object detection, and instance segmentation in computer vision. The key idea behind our method is to formulate both discrete and continuous predictions in different tasks as unified probability distributions. Based on these distributions, we develop 1) a matrix-based visualization to provide an overview of model performance; 2) a table visualization to identify the problematic data subsets where the model performs poorly; 3) a grid visualization to display the samples of interest. These visualizations work together to facilitate the model evaluation from a global overview to individual samples. Two case studies demonstrate the effectiveness of Uni-Evaluator in evaluating model performance and making informed improvements.

摘要
现有的模型评估工具主要专注于评估分类模型，留下了评估更复杂的模型，如物体检测的空白。在这篇论文中，我们开发了一款开源的视觉分析工具——Uni-Evaluator，用于支持统一的模型评估在计算机视觉中。我们的方法的关键思想是将不同任务中的预测结果形式为统一的概率分布。基于这些分布，我们开发了：1）矩阵视图，用于提供模型性能的概述；2）表视图，用于标识模型性能不佳的数据subset；3）格子视图，用于显示关注的样本。这些视图共同工作，以帮助从全局概述到个体样本的模型评估。两个案例证明Uni-Evaluator的有效性在评估模型性能并作出了有用的改进。

Deep Learning for Morphological Identification of Extended Radio Galaxies using Weak Labels

paper_url: http://arxiv.org/abs/2308.05166
repo_url: https://github.com/nikhel1/gal-cam
paper_authors: Nikhel Gupta, Zeeshan Hayder, Ray P. Norris, Minh Huynh, Lars Petersson, X. Rosalind Wang, Heinz Andernach, Bärbel S. Koribalski, Miranda Yew, Evan J. Crawford
for: 这个研究旨在开发一种基于深度学习的弱监督算法，以降低对复杂 radio galaxy 的 pixe-level标注成本。
methods: 该算法使用了类级别标签来训练深度学习模型，并使用 inter-pixel 关系网络 (IRNet) 来进一步调整类 activation maps (CAMs)，以获得 radio galaxy 的实例分割面积。
results: 我们使用的数据来自澳大利亚 Square Kilometre Array Pathfinder (ASKAP) telescope，特别是 Evolutionary Map of the Universe (EMU) Pilot Survey，覆盖了天空面积270平方度，具有RMS敏感度25-35 $\mu$Jy/beam。我们示出了弱监督深度学习算法可以高精度预测 pixe-level信息，包括 radio emission 的扩展面积和红外主 galaxy 的位置。我们使用 mAP 评价模型的性能，并显示模型在 radio masks 和红外主 galaxy 位置上的 mAP$_{50}$ 分别为 67.5% 和 76.8%。模型架构可以在以下链接中找到：https://github.com/Nikhel1/Gal-CAM

Abstract
The present work discusses the use of a weakly-supervised deep learning algorithm that reduces the cost of labelling pixel-level masks for complex radio galaxies with multiple components. The algorithm is trained on weak class-level labels of radio galaxies to get class activation maps (CAMs). The CAMs are further refined using an inter-pixel relations network (IRNet) to get instance segmentation masks over radio galaxies and the positions of their infrared hosts. We use data from the Australian Square Kilometre Array Pathfinder (ASKAP) telescope, specifically the Evolutionary Map of the Universe (EMU) Pilot Survey, which covered a sky area of 270 square degrees with an RMS sensitivity of 25-35 $\mu$Jy/beam. We demonstrate that weakly-supervised deep learning algorithms can achieve high accuracy in predicting pixel-level information, including masks for the extended radio emission encapsulating all galaxy components and the positions of the infrared host galaxies. We evaluate the performance of our method using mean Average Precision (mAP) across multiple classes at a standard intersection over union (IoU) threshold of 0.5. We show that the model achieves a mAP$_{50}$ of 67.5\% and 76.8\% for radio masks and infrared host positions, respectively. The network architecture can be found at the following link: https://github.com/Nikhel1/Gal-CAM

摘要
现在的工作介绍了一种使用弱类标注深度学习算法，以降低复杂Radio галактики多 компонент的标注成本。该算法在Radio галактики的弱类标签上训练，以获得类活动图(CAM)。然后，使用Inter-pixel关系网络(IRNet)来进一步细化CAM，以获得Radio галактики和它们的红外主 galaxy的实例分割mask。我们使用澳大利亚的Square Kilometre Array Pathfinder(ASKAP)望远镜，特别是Evolutionary Map of the Universe(EMU) Pilot Survey，覆盖了天空面积270平方度，具有RMS敏感度为25-35μJy/beam。我们表明，弱类标注深度学习算法可以高精度预测像素级信息，包括Radio emission覆盖所有 галактиComponents的面积和红外主 galaxy的位置。我们使用mean Average Precision(mAP)来评估方法的性能，并在多个类型的交叠 UNION(IoU)阈值0.5下进行评估。我们发现，模型在Radio mask和红外主 galaxy位置上的mAP$_{50}$分别为67.5%和76.8%。网络架构可以在以下链接中找到：https://github.com/Nikhel1/Gal-CAM。

Scene-Generalizable Interactive Segmentation of Radiance Fields

paper_url: http://arxiv.org/abs/2308.05104
repo_url: None
paper_authors: Songlin Tang, Wenjie Pei, Xin Tao, Tanghui Jia, Guangming Lu, Yu-Wing Tai
for:* 三维 объек象分割在各种场景中的可交互式分割方法methods:* 跨维度导航协助减少了基于2D图像的scarce用户点击的问题* 不确定性除去的3D分割模块实现了高效而准确的3D分割* 在2D空间中监督3D分割错误的掩蔽抑制学习方案results:* 对两个实际挑战性的benchmark数据集进行了广泛的实验，并证明了提案的方法的效果和场景化可行性* 比 класси方法需要场景特定优化的方法表现更优异

Abstract
Existing methods for interactive segmentation in radiance fields entail scene-specific optimization and thus cannot generalize across different scenes, which greatly limits their applicability. In this work we make the first attempt at Scene-Generalizable Interactive Segmentation in Radiance Fields (SGISRF) and propose a novel SGISRF method, which can perform 3D object segmentation for novel (unseen) scenes represented by radiance fields, guided by only a few interactive user clicks in a given set of multi-view 2D images. In particular, the proposed SGISRF focuses on addressing three crucial challenges with three specially designed techniques. First, we devise the Cross-Dimension Guidance Propagation to encode the scarce 2D user clicks into informative 3D guidance representations. Second, the Uncertainty-Eliminated 3D Segmentation module is designed to achieve efficient yet effective 3D segmentation. Third, Concealment-Revealed Supervised Learning scheme is proposed to reveal and correct the concealed 3D segmentation errors resulted from the supervision in 2D space with only 2D mask annotations. Extensive experiments on two real-world challenging benchmarks covering diverse scenes demonstrate 1) effectiveness and scene-generalizability of the proposed method, 2) favorable performance compared to classical method requiring scene-specific optimization.

摘要
现有的互动式分割方法在辐射场中存在Scene-specific优化，这限制了它们的可重用性。在这个工作中，我们首次提出了Scene-Generalizable Interactive Segmentation in Radiance Fields（SGISRF）方法，可以在未经过Scene-specific优化的情况下，对未看过的Scene中的3D объек的分割进行3D объек分割，只需要在多视图2D图像中提供一些互动用户点击。特别是，我们的SGISRF方法解决了三个挑战：首先，我们提出了跨维度导航协助升级，以将罕见的2D用户点击编码成有用的3D导航表示。其次，我们设计了不确定性消除3D分割模块，以实现高效的3D分割。最后，我们提出了隐藏的3D分割错误的抑制和修复方案，以解释和修正在2D空间中只有2DMask注释的超出3D分割错误。我们的实验结果表明，SGISRF方法具有效果和Scene-generalizability，并且与经典方法相比，具有更好的性能。

A degree of image identification at sub-human scales could be possible with more advanced clusters

paper_url: http://arxiv.org/abs/2308.05092
repo_url: https://github.com/prateekjannu/imagescale2
paper_authors: Prateek Y J
for: 本研究旨在确定当前可用的自动学习技术是否可以达到人类水平的视觉图像理解，使用同样的感知输入量和质量。
methods: 本研究使用了涉及数据量和图像质量的缩放实验，无需外部资金支持。
results: 我们发现，同时缩放数据量和图像分辨率可以 дости得人类水平的物品检测性能，而不需要超过人类大小。我们使用视transformer在200000个图像和256ppi之间进行训练。

Abstract
The purpose of the research is to determine if currently available self-supervised learning techniques can accomplish human level comprehension of visual images using the same degree and amount of sensory input that people acquire from. Initial research on this topic solely considered data volume scaling. Here, we scale both the volume of data and the quality of the image. This scaling experiment is a self-supervised learning method that may be done without any outside financing. We find that scaling up data volume and picture resolution at the same time enables human-level item detection performance at sub-human sizes.We run a scaling experiment with vision transformers trained on up to 200000 images up to 256 ppi.

摘要
<> translate "The purpose of the research is to determine if currently available self-supervised learning techniques can accomplish human level comprehension of visual images using the same degree and amount of sensory input that people acquire from. Initial research on this topic solely considered data volume scaling. Here, we scale both the volume of data and the quality of the image. This scaling experiment is a self-supervised learning method that may be done without any outside financing. We find that scaling up data volume and picture resolution at the same time enables human-level item detection performance at sub-human sizes.We run a scaling experiment with vision transformers trained on up to 200000 images up to 256 ppi." into Simplified Chinese. translate "研究的目的是判断现有的自主学习技术是否可以通过同样的感知输入来实现人类水平的视觉理解。初步研究仅考虑数据量缩放。在这里，我们同时缩放数据量和图像质量。这是一种没有外部资金支持的自主学习方法。我们发现，同时缩放数据量和图像分辨率可以实现人类水平的项目检测性能。我们使用视transformer在200000个图像上进行训练，分辨率达256ppi。" into Simplified Chinese.Here's the translation:<>研究的目的是判断现有的自主学习技术是否可以通过同样的感知输入来实现人类水平的视觉理解。初步研究仅考虑数据量缩放。在这里，我们同时缩放数据量和图像质量。这是一种没有外部资金支持的自主学习方法。我们发现，同时缩放数据量和图像分辨率可以实现人类水平的项目检测性能。我们使用视transformer在200000个图像上进行训练，分辨率达256ppi。Translation:The purpose of the research is to determine whether existing self-supervised learning techniques can achieve human-level understanding of visual images using the same amount and degree of sensory input as humans. Initial research only considered data volume scaling. Here, we scale both the volume of data and the quality of the image. This scaling experiment is a self-supervised learning method without external funding. We find that scaling up data volume and picture resolution at the same time enables human-level item detection performance at sub-human sizes. We use vision transformers to train on up to 200,000 images with a resolution of 256 ppi.

Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyri

paper_url: http://arxiv.org/abs/2308.05070
repo_url: https://github.com/aimagelab/vffc
paper_authors: Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara
for: 这项研究旨在提出一种基于快速傅立卷变换算法的深度学习方法，用于恢复和检测艾库拉纳皮里文档中的墨水。
methods: 该方法使用了改进的快速傅立卷变换算法，并应用于一种深度学习架构中，以便在艾库拉纳皮里文档中自动检测墨水。
results: 实验表明，该方法在艾库拉纳皮里文档中具有优秀的检测精度和速度，并且可以有效地处理高度损害的文档。

Abstract
Recent advancements in Digital Document Restoration (DDR) have led to significant breakthroughs in analyzing highly damaged written artifacts. Among those, there has been an increasing interest in applying Artificial Intelligence techniques for virtually unwrapping and automatically detecting ink on the Herculaneum papyri collection. This collection consists of carbonized scrolls and fragments of documents, which have been digitized via X-ray tomography to allow the development of ad-hoc deep learning-based DDR solutions. In this work, we propose a modification of the Fast Fourier Convolution operator for volumetric data and apply it in a segmentation architecture for ink detection on the challenging Herculaneum papyri, demonstrating its suitability via deep experimental analysis. To encourage the research on this task and the application of the proposed operator to other tasks involving volumetric data, we will release our implementation (https://github.com/aimagelab/vffc)

摘要
现代数字文献修复技术的发展已导致对高度损坏的手写文物的分析 achievements 。 Among them, there has been an increasing interest in applying artificial intelligence techniques for virtually unwrapping and automatically detecting ink on the Herculaneum papyri collection. This collection consists of carbonized scrolls and fragments of documents, which have been digitized via X-ray tomography to allow the development of ad-hoc deep learning-based DDR solutions. In this work, we propose a modification of the Fast Fourier Convolution operator for volumetric data and apply it in a segmentation architecture for ink detection on the challenging Herculaneum papyri, demonstrating its suitability via deep experimental analysis. To encourage the research on this task and the application of the proposed operator to other tasks involving volumetric data, we will release our implementation (https://github.com/aimagelab/vffc)。Note that "Simplified Chinese" is a translation of the text into Chinese characters, but the word order and grammar may be different from traditional Chinese.

Geometric Learning-Based Transformer Network for Estimation of Segmentation Errors

paper_url: http://arxiv.org/abs/2308.05068
repo_url: None
paper_authors: Sneha Sree C, Mohammad Al Fahim, Keerthi Ram, Mohanasankar Sivaprakasam
for: 这个研究旨在提高和加速医疗机构对图像分类的专业人员的努力，并且为医生提供一个评估和修正错误的工具。
methods: 我们提出了一个方法，可以在三维Volume的分类图中识别和量化错误区域。我们的方法使用一个基于Transformer的图形神经网络，可以在三维 mesh 中计算错误的数值和类别。
results: 我们的方法在一个高分辨率的微型CT数据集上进行评估，实际上生成了错误的三维分类图。我们的方法可以在这些错误分类图中识别和量化错误区域，并且与其他图形神经网络进行比较，获得了更高的精度和准确性。

Abstract
Many segmentation networks have been proposed for 3D volumetric segmentation of tumors and organs at risk. Hospitals and clinical institutions seek to accelerate and minimize the efforts of specialists in image segmentation. Still, in case of errors generated by these networks, clinicians would have to manually edit the generated segmentation maps. Given a 3D volume and its putative segmentation map, we propose an approach to identify and measure erroneous regions in the segmentation map. Our method can estimate error at any point or node in a 3D mesh generated from a possibly erroneous volumetric segmentation map, serving as a Quality Assurance tool. We propose a graph neural network-based transformer based on the Nodeformer architecture to measure and classify the segmentation errors at any point. We have evaluated our network on a high-resolution micro-CT dataset of the human inner-ear bony labyrinth structure by simulating erroneous 3D segmentation maps. Our network incorporates a convolutional encoder to compute node-centric features from the input micro-CT data, the Nodeformer to learn the latent graph embeddings, and a Multi-Layer Perceptron (MLP) to compute and classify the node-wise errors. Our network achieves a mean absolute error of ~0.042 over other Graph Neural Networks (GNN) and an accuracy of 79.53% over other GNNs in estimating and classifying the node-wise errors, respectively. We also put forth vertex-normal prediction as a custom pretext task for pre-training the CNN encoder to improve the network's overall performance. Qualitative analysis shows the efficiency of our network in correctly classifying errors and reducing misclassifications.

摘要
很多三维分割网络已经被提出用于肿瘤和关键器官的三维分割。医院和临床机构希望加速和减少专家在图像分割中的努力。然而，在这些网络生成的分割图中出现错误时，临床专家仍需手动修改生成的分割图。我们提出一种方法，可以在基于Nodeformer架构的图 neural network中计算和评估分割错误的点或节点级别。我们的方法可以在三维碎片中计算错误的点级别差异，并且可以作为质量控制工具。我们在人类内耳骨制造结构的高分辨率微型计算机扫描图像上评估了我们的网络。我们的网络包括一个卷积Encoder计算输入微型计算机数据中的节点特征，Nodeformer学习干扰图像的秘密图像嵌入，以及一个多层感知器（MLP）计算和分类节点错误。我们的网络在其他图 neural network中的mean absolute error约为0.042，和其他图 neural network中的准确率为79.53%。我们还提出了预训练CNNEncoder的预测顶点正常任务，以提高网络的总性能。Qualitative分析表明我们的网络可以准确地分类错误并减少错误分类。

A Novel Method for improving accuracy in neural network by reinstating traditional back propagation technique

paper_url: http://arxiv.org/abs/2308.05059
repo_url: None
paper_authors: Gokulprasath R
for: 这项研究旨在提出一种快速和有效地训练深度神经网络的方法，以解决传统训练方法所面临的计算开销和淡吸梯度问题。
methods: 该方法利用一种快速参数更新技术，减少了计算梯度的需求，从而提高了学习速度和避免了淡吸梯度问题。
results: 对于标准数据集，该方法在比较之下超过了当前最佳方法的性能，并且显示了更快的学习速度和更好的稳定性。

Abstract
Deep learning has revolutionized industries like computer vision, natural language processing, and speech recognition. However, back propagation, the main method for training deep neural networks, faces challenges like computational overhead and vanishing gradients. In this paper, we propose a novel instant parameter update methodology that eliminates the need for computing gradients at each layer. Our approach accelerates learning, avoids the vanishing gradient problem, and outperforms state-of-the-art methods on benchmark data sets. This research presents a promising direction for efficient and effective deep neural network training.

摘要

PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

paper_url: http://arxiv.org/abs/2308.05051
repo_url: None
paper_authors: Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton
for: 本研究旨在提高视频中复杂的时间相关动作关系的检测精度，通过利用多尺度时间特征来提高 temps 自注意 Mechanism 的表达能力。
methods: 我们提出了一种基于 transformer 的网络，称为 PAT，它通过嵌入相对位域编码和利用多尺度时间关系来提高 temps 自注意 Mechanism 的表达能力，并且不使用层次结构。
results: 我们在两个复杂的多标签数据集上进行了评估，并显示了 PAT 可以提高当前状态的艺术精度，具体的结果为 Charades 数据集上的 mAP 提高 1.1%，MultiTHUMOS 数据集上的 mAP 提高 0.6%，其中 Charades 数据集的新状态艺术精度为 26.5%，MultiTHUMOS 数据集的新状态艺术精度为 44.6%。

Abstract
We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-attention mechanism and (ii) exploit multi-scale temporal relationships by designing a novel non hierarchical network, in contrast to the recent transformer-based approaches that use a hierarchical structure. We argue that joining the self-attention mechanism with multiple sub-sampling processes in the hierarchical approaches results in increased loss of positional information. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets, and show that PAT improves the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation studies to examine the impact of the different components of our proposed network.

摘要
我们介绍PAT，一种基于转换器的网络，可以学习视频中复杂的时间相关动作依赖关系。现有方法中，转换器中的自我注意机制会失去时间位置信息，这是对精准动作检测至关重要。为解决这个问题，我们（i）在自我注意机制中嵌入相对位置编码，（ii）利用多尺度时间关系，设计了一种新的非层次网络，与现有的转换器基于层次结构的方法不同。我们认为，将自我注意机制与多个子抽样过程结合在一起会导致位置信息损失。我们对我们提议的方法进行了两个挑战性多标签数据集的测试，并显示PAT在Charades和MultiTHUMOS数据集上的current state-of-the-art结果提高1.1%和0.6% mAP，分别达到26.5%和44.6%的新state-of-the-art mAP。我们还进行了广泛的拓展研究，以检查不同组件的影响。

2023-08-12

CoverNav: Cover Following Navigation Planning in Unstructured Outdoor Environment with Deep Reinforcement Learning

Value-Distributional Model-Based Reinforcement Learning

Approximate Answering of Graph Queries

A new solution and concrete implementation steps for Artificial General Intelligence

EquiDiff: A Conditional Equivariant Diffusion Model For Trajectory Prediction

Human Behavior-based Personalized Meal Recommendation and Menu Planning Social System

Digital elevation model correction in urban areas using extreme gradient boosting, land cover and terrain parameters

Dealing with Small Datasets for Deep Learning in Medical Imaging: An Evaluation of Self-Supervised Pre-Training on CT Scans Comparing Contrastive and Masked Autoencoder Methods for Convolutional Models

Learning Abstract Visual Reasoning via Task Decomposition: A Case Study in Raven Progressive Matrices

SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models

One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training

Performance Analysis for Resource Constrained Decentralized Federated Learning Over Wireless Networks

Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

Volterra Accentuated Non-Linear Dynamical Admittance (VANYA) to model Deforestation: An Exemplification from the Amazon Rainforest

Tiny and Efficient Model for the Edge Detection Generalization

Not So Robust After All: Evaluating the Robustness of Deep Neural Networks to Unseen Adversarial Attacks

A One-dimensional HEVC video steganalysis method using the Optimality of Predicted Motion Vectors

Multi-Label Knowledge Distillation

Latent Random Steps as Relaxations of Max-Cut, Min-Cut, and More

A Sequential Meta-Transfer (SMT) Learning to Combat Complexities of Physics-Informed Neural Networks: Application to Composites Autoclave Processing

Neural Latent Aligner: Cross-trial Alignment for Learning Representations of Complex, Naturalistic Neural Data

A Domain-adaptive Physics-informed Neural Network for Inverse Problems of Maxwell’s Equations in Heterogeneous Media

Learn Single-horizon Disease Evolution for Predictive Generation of Post-therapeutic Neovascular Age-related Macular Degeneration

Genetic heterogeneity analysis using genetic algorithm and network science

Multiclass Learnability Does Not Imply Sample Compression

Sensitivity-Aware Mixed-Precision Quantization and Width Optimization of Deep Neural Networks Through Cluster-Based Tree-Structured Parzen Estimation

Pedestrian Trajectory Prediction in Pedestrian-Vehicle Mixed Environments: A Systematic Review

Learning Bayesian Networks with Heterogeneous Agronomic Data Sets via Mixed-Effect Models and Hierarchical Clustering

Detecting and Preventing Hallucinations in Large Vision Language Models

Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion

DCNFIS: Deep Convolutional Neuro-Fuzzy Inference System

UAMM: UBET Automated Market Maker

Topic-Level Bayesian Surprise and Serendipity for Recommender Systems

Learning Distributions via Monte-Carlo Marginalization

Mirror Diffusion Models

Size Lowerbounds for Deep Operator Networks

Foundation Model is Efficient Multimodal Multitask Model Selector

Predicting Resilience with Neural Networks

FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods

Private Distribution Learning with Public Data: The View from Sample Compression

MaxFloodCast: Ensemble Machine Learning Model for Predicting Peak Inundation Depth And Decoding Influencing Features

Automated Sizing and Training of Efficient Deep Autoencoders using Second Order Algorithms

Change Point Detection With Conceptors

Safety in Traffic Management Systems: A Comprehensive Survey

2023-08-12

Semantic Communications with Explicit Semantic Base for Image Transmission

On Versatile Video Coding at UHD with Machine-Learning-Based Super-Resolution

Three-dimensional echo-shifted EPI with simultaneous blip-up and blip-down acquisitions for correcting geometric distortion

The Color Clifford Hardy Signal: Application to Color Edge Detection and Optical Flow

Out-of-distribution multi-view auto-encoders for prostate cancer lesion detection

Leveraging multi-view data without annotations for prostate MRI segmentation: A contrastive approach

CATS v2: Hybrid encoders for robust medical segmentation

Deep Learning-Based Open Source Toolkit for Eosinophil Detection in Pediatric Eosinophilic Esophagitis

Revolutionizing Space Health (Swin-FSR): Advancing Super-Resolution of Fundus Images for SANS Visual Assessment Technology

A Hierarchical Descriptor Framework for On-the-Fly Anatomical Location Matching between Longitudinal Studies

2023-08-11

Improving Joint Speech-Text Representations Without Alignment

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

An Autoethnographic Exploration of XAI in Algorithmic Composition

Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model

Advancing the study of Large-Scale Learning in Overlapped Speech Detection

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

2023-08-11

DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity

Towards Packaging Unit Detection for Automated Palletizing Tasks

Discovering Local Binary Pattern Equation for Foreground Object Removal in Videos

Rethinking the Localization in Weakly Supervised Object Localization

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

Efficient Large-scale AUV-based Visual Seafloor Mapping

CompTLL-UNet: Compressed Domain Text-Line Localization in Challenging Handwritten Documents using Deep Feature Learning from JPEG Coefficients

Uncertainty Quantification for Image-based Traffic Prediction across Cities

Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow

Diffusion-based Visual Counterfactual Explanations – Towards Systematic Quantitative Evaluation

Automated Construction of Time-Space Diagrams for Traffic Analysis Using Street-View Video Sequence

RIGID: Recurrent GAN Inversion and Editing of Real Face Videos

Experts Weights Averaging: A New General Training Scheme for Vision Transformers

Versatile Face Animator: Driving Arbitrary 3D Facial Avatar in RGBD Space

Out-of-Distribution Detection for Monocular Depth Estimation

Head Rotation in Denoising Diffusion Models