2023-07-24

cs.CV

cs.CV - 2023-07-24

Automotive Object Detection via Learning Sparse Events by Temporal Dynamics of Spiking Neurons

paper_url: http://arxiv.org/abs/2307.12900
repo_url: None
paper_authors: Hu Zhang, Luziwei Leng, Kaiwei Che, Qian Liu, Jie Cheng, Qinghai Guo, Jiangxing Liao, Ran Cheng
for: 这篇论文旨在应用事件驱动神经网络（SNN）来进行事件基数据的物体探测。
methods: 这篇论文使用了SNN的传入层、遗传层和调节层，并使用了脉幅动力学来调节网络活动。
results: 这篇论文使用了SNN取得了47.7%的精确率（map50）在Gen1标准dataset上，比前一代SNN提高9.7%，并且比使用注意力机制的对照模型还要好。

Abstract
Event-based sensors, with their high temporal resolution (1us) and dynamical range (120dB), have the potential to be deployed in high-speed platforms such as vehicles and drones. However, the highly sparse and fluctuating nature of events poses challenges for conventional object detection techniques based on Artificial Neural Networks (ANNs). In contrast, Spiking Neural Networks (SNNs) are well-suited for representing event-based data due to their inherent temporal dynamics. In particular, we demonstrate that the membrane potential dynamics can modulate network activity upon fluctuating events and strengthen features of sparse input. In addition, the spike-triggered adaptive threshold can stabilize training which further improves network performance. Based on this, we develop an efficient spiking feature pyramid network for event-based object detection. Our proposed SNN outperforms previous SNNs and sophisticated ANNs with attention mechanisms, achieving a mean average precision (map50) of 47.7% on the Gen1 benchmark dataset. This result significantly surpasses the previous best SNN by 9.7% and demonstrates the potential of SNNs for event-based vision. Our model has a concise architecture while maintaining high accuracy and much lower computation cost as a result of sparse computation. Our code will be publicly available.

摘要
Event-based 感测器，具有高度的时间分辨率（1us）和动态范围（120dB），有可能在高速平台 such as 车辆和无人机上进行部署。然而，事件的高度稀疏和波动性带来了传统的人工神经网络（ANNs）中的挑战。相比之下，脉冲神经网络（SNNs）因其内置的时间动力学而适用于表示事件基本数据。具体来说，我们表明了膜电位动力学可以在事件波动时调整网络活动，并强制特征的稀疏输入。此外，使用触发适应阈值可以稳定训练，从而进一步提高网络性能。基于这些，我们开发了高效的脉冲特征峰网络，用于事件基本对象检测。我们的提出的 SNN 在 Gen1 测试集上达到了47.7%的 Mean Average Precision（map50），这比前一代最佳 SNN 高出9.7%，并证明了 SNN 在事件基本视觉中的潜力。我们的模型具有简洁的架构，同时保持高度准确和较低的计算成本。我们的代码将公开。

Data-free Black-box Attack based on Diffusion Model

paper_url: http://arxiv.org/abs/2307.12872
repo_url: None
paper_authors: Mingwen Shao, Lingzhuang Meng, Yuanjian Qiao, Lixu Zhang, Wangmeng Zuo
for: 增强数据隐身攻击的效率和准确性，使用扩散模型生成数据来训练代理模型。
methods: 使用扩散模型生成数据，并提出了干扰码修饰（LCA）方法来指导扩散模型生成数据，使得生成的数据可以更好地满足目标模型的批判标准。
results: 通过使用干扰码修饰方法，可以使扩散模型生成的数据更加符合目标模型的批判标准，并且可以提高黑盒攻击的成功率和减少查询预算。

Abstract
Since the training data for the target model in a data-free black-box attack is not available, most recent schemes utilize GANs to generate data for training substitute model. However, these GANs-based schemes suffer from low training efficiency as the generator needs to be retrained for each target model during the substitute training process, as well as low generation quality. To overcome these limitations, we consider utilizing the diffusion model to generate data, and propose a data-free black-box attack scheme based on diffusion model to improve the efficiency and accuracy of substitute training. Despite the data generated by the diffusion model exhibits high quality, it presents diverse domain distributions and contains many samples that do not meet the discriminative criteria of the target model. To further facilitate the diffusion model to generate data suitable for the target model, we propose a Latent Code Augmentation (LCA) method to guide the diffusion model in generating data. With the guidance of LCA, the data generated by the diffusion model not only meets the discriminative criteria of the target model but also exhibits high diversity. By utilizing this data, it is possible to train substitute model that closely resemble the target model more efficiently. Extensive experiments demonstrate that our LCA achieves higher attack success rates and requires fewer query budgets compared to GANs-based schemes for different target models.

摘要
由于目标模型的训练数据不可获取，最近的方案多利用GANs生成数据进行训练代理模型。然而，这些GANs基本方案受到低训练效率和低生成质量的限制。为了突破这些限制，我们考虑使用扩散模型来生成数据，并提出了基于扩散模型的数据自由黑盒攻击方案，以提高代理训练的效率和准确率。尽管扩散模型生成的数据具有高质量，但它们具有多样化的领域分布和含有大量不符合目标模型的检准标准的样本。为了更好地使用扩散模型生成数据，我们提出了秘密代码增强（LCA）方法，以导引扩散模型生成数据。通过LCA的引导，扩散模型生成的数据不仅符合目标模型的检准标准，而且具有高多样性。通过利用这些数据，我们可以更高效地训练代理模型，并且需要更少的查询预算。我们的LCA在不同的目标模型上实现了更高的攻击成功率和更低的查询预算。

Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry

paper_url: http://arxiv.org/abs/2307.12868
repo_url: None
paper_authors: Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, Youngjung Uh
for: 这 paper 旨在更好地理解Diffusion Model（DM）的潜在空间，从几何学角度进行分析。
methods: 这 paper 使用了 pullback 度量来找到 $\mathcal{X}$ 中的本地几何基底和其对应的 $\mathcal{H}$ 中的中间特征图。
results: 这 paper 发现了一种不supervised image editing能力，通过 traversal 在 $\mathbf{x}$ 空间中。此外，paper 还分析了这些结构如何随 diffusion 步骤的变化，以及如何基于文本条件进行修改。

Abstract
Despite the success of diffusion models (DMs), we still lack a thorough understanding of their latent space. To understand the latent space $\mathbf{x}_t \in \mathcal{X}$, we analyze them from a geometrical perspective. Specifically, we utilize the pullback metric to find the local latent basis in $\mathcal{X}$ and their corresponding local tangent basis in $\mathcal{H}$, the intermediate feature maps of DMs. The discovered latent basis enables unsupervised image editing capability through latent space traversal. We investigate the discovered structure from two perspectives. First, we examine how geometric structure evolves over diffusion timesteps. Through analysis, we show that 1) the model focuses on low-frequency components early in the generative process and attunes to high-frequency details later; 2) At early timesteps, different samples share similar tangent spaces; and 3) The simpler datasets that DMs trained on, the more consistent the tangent space for each timestep. Second, we investigate how the geometric structure changes based on text conditioning in Stable Diffusion. The results show that 1) similar prompts yield comparable tangent spaces; and 2) the model depends less on text conditions in later timesteps. To the best of our knowledge, this paper is the first to present image editing through $\mathbf{x}$-space traversal and provide thorough analyses of the latent structure of DMs.

摘要
不withstanding the success of diffusion models (DMs), we still lack a thorough understanding of their latent space. To understand the latent space $\mathbf{x}_t \in \mathcal{X}$, we analyze them from a geometrical perspective. Specifically, we utilize the pullback metric to find the local latent basis in $\mathcal{X}$ and their corresponding local tangent basis in $\mathcal{H}$, the intermediate feature maps of DMs. The discovered latent basis enables unsupervised image editing capability through latent space traversal. We investigate the discovered structure from two perspectives. First, we examine how geometric structure evolves over diffusion timesteps. Through analysis, we show that 1) the model focuses on low-frequency components early in the generative process and attunes to high-frequency details later; 2) At early timesteps, different samples share similar tangent spaces; and 3) The simpler datasets that DMs trained on, the more consistent the tangent space for each timestep. Second, we investigate how the geometric structure changes based on text conditioning in Stable Diffusion. The results show that 1) similar prompts yield comparable tangent spaces; and 2) the model depends less on text conditions in later timesteps. To the best of our knowledge, this paper is the first to present image editing through $\mathbf{x}$-space traversal and provide thorough analyses of the latent structure of DMs.Note: The translation is in Simplified Chinese, which is the standardized form of Chinese used in mainland China and Singapore. The traditional Chinese form of the translation is slightly different due to the difference in word order and character usage.

Treatment Outcome Prediction for Intracerebral Hemorrhage via Generative Prognostic Model with Imaging and Tabular Data

paper_url: http://arxiv.org/abs/2307.12858
repo_url: https://github.com/med-air/top-gpm
paper_authors: Wenao Ma, Cheng Chen, Jill Abrigo, Calvin Hoi-Kwan Mak, Yuqi Gong, Nga Yan Chan, Chu Han, Zaiyi Liu, Qi Dou
For: 预测 интра脑出血(ICH) 治疗结果* Methods: 使用 imaging 和 tabular 数据建立报告模型，并使用 variational autoencoder 模型生成低维度预测分数，以Address selection bias* Results: 对实际临床数据进行了广泛的实验，并显示了substantial improvement 在治疗结果预测 compared to 现有的状态 искусственный智能方法。Here’s the breakdown of each point:* For: 预测 ICH 治疗结果 (What the paper is written for)* Methods: 使用 imaging 和 tabular 数据建立报告模型，并使用 variational autoencoder 模型生成低维度预测分数，以Address selection bias (What methods the paper uses)* Results: 对实际临床数据进行了广泛的实验，并显示了substantial improvement 在治疗结果预测 compared to 现有的状态 искусственный智能方法。 (What results the paper gets)

Abstract
Intracerebral hemorrhage (ICH) is the second most common and deadliest form of stroke. Despite medical advances, predicting treat ment outcomes for ICH remains a challenge. This paper proposes a novel prognostic model that utilizes both imaging and tabular data to predict treatment outcome for ICH. Our model is trained on observational data collected from non-randomized controlled trials, providing reliable predictions of treatment success. Specifically, we propose to employ a variational autoencoder model to generate a low-dimensional prognostic score, which can effectively address the selection bias resulting from the non-randomized controlled trials. Importantly, we develop a variational distributions combination module that combines the information from imaging data, non-imaging clinical data, and treatment assignment to accurately generate the prognostic score. We conducted extensive experiments on a real-world clinical dataset of intracerebral hemorrhage. Our proposed method demonstrates a substantial improvement in treatment outcome prediction compared to existing state-of-the-art approaches. Code is available at https://github.com/med-air/TOP-GPM

摘要
中风血盖（ICH）是第二常见且最致命的 stroke 型。医学进步不withstanding，预测 IC 的治疗结果仍然是一大挑战。本文提出了一种新的 прогности 模型，该模型利用了 Both imaging 和 tabular data 预测 IC 的治疗结果。我们的模型在非随机化控制试验中收集的 observational 数据上进行训练，以提供可靠的治疗成功预测。特别是，我们提出了一种 variational autoencoder 模型，用于生成低维度的 прогности 分数，以有效地 Addressing the selection bias resulting from non-randomized controlled trials。我们还开发了一种 variational distributions combination module，用于组合 imaging 数据、非 imaging клиниче数据和治疗分配信息，以准确生成 prognostic 分数。我们对实际患有中风血盖的临床数据进行了广泛的实验，并证明了我们的提出方法可以具有显著改善的治疗结果预测效果，比现有的状态 искусственный智能方法更好。代码可以在 https://github.com/med-air/TOP-GPM 上找到。

Multiscale Video Pretraining for Long-Term Activity Forecasting

paper_url: http://arxiv.org/abs/2307.12854
repo_url: None
paper_authors: Reuben Tan, Matthias De Lange, Michael Iuzzolino, Bryan A. Plummer, Kate Saenko, Karl Ridgeway, Lorenzo Torresani
for: 预测人类活动长期趋势，提高机器学习模型对未看过数据的泛化能力。
methods: 提出了一种新的自我监督预训练方法—多 scales video pretraining (MVP)，通过学习预测视频片断的上下文化表示来学习Robust Representation。
results: 与现有方法进行比较，MVP在长期动作预测和视频概要预测任务上表现出了显著的性能优势，对于视频概要预测任务，MVP的表现提高了Relative Performance Gain超过20%。

Abstract
Long-term activity forecasting is an especially challenging research problem because it requires understanding the temporal relationships between observed actions, as well as the variability and complexity of human activities. Despite relying on strong supervision via expensive human annotations, state-of-the-art forecasting approaches often generalize poorly to unseen data. To alleviate this issue, we propose Multiscale Video Pretraining (MVP), a novel self-supervised pretraining approach that learns robust representations for forecasting by learning to predict contextualized representations of future video clips over multiple timescales. MVP is based on our observation that actions in videos have a multiscale nature, where atomic actions typically occur at a short timescale and more complex actions may span longer timescales. We compare MVP to state-of-the-art self-supervised video learning approaches on downstream long-term forecasting tasks including long-term action anticipation and video summary prediction. Our comprehensive experiments across the Ego4D and Epic-Kitchens-55/100 datasets demonstrate that MVP out-performs state-of-the-art methods by significant margins. Notably, MVP obtains a relative performance gain of over 20% accuracy in video summary forecasting over existing methods.

摘要
长期活动预测是一个特别困难的研究问题，因为它需要理解视频中观察到的动作之间的时间关系，以及人类活动的多样性和复杂性。尽管通过昂贵的人工纠正，使用现状的预测方法可以取得良好的性能，但这些方法在未看到的数据上 generalize 很差。为了解决这个问题，我们提出了多Scale Video Pretraining（MVP），一种新的自我监督预训练方法，该方法通过学习预测视频clip的多Scale representation来学习Robust的表示。MVP基于我们观察到的动作在视频中有多Scale性质， atomic 动作通常发生在短时间尺度，而更复杂的动作可能 span 更长的时间尺度。我们与现状的自我监督视频学习方法进行比较，在下游长期预测任务中，包括长期动作预测和视频概要预测。我们在Ego4D和Epic-Kitchens-55/100 datasets上进行了广泛的实验，结果表明MVP在性能上超过了现状的方法，具体来说，MVP在视频概要预测任务中取得了20%以上的相对性能提升。

Spatiotemporal Modeling Encounters 3D Medical Image Analysis: Slice-Shift UNet with Multi-View Fusion

paper_url: http://arxiv.org/abs/2307.12853
repo_url: None
paper_authors: C. I. Ugwu, S. Casarin, O. Lanz
for: 这paper的目的是提出一种基于2DConvolutional Neural Networks的多模态脏器部分 segmentation方法，以提高诊断和治疗的效率和准确性。
methods: 这paper使用了一种基于多视图的2DConvolutional Neural Networks方法，通过在多个方向上进行2D convolution来学习多维特征，并通过卷积缓存共享机制来重新inteporate第三维信息。
results: 对于Multi-Modality Abdominal Multi-Organ Segmentation (AMOS)和Multi-Atlas Labeling Beyond the Cranial Vault (BTCV)数据集，这paper的方法得到了较高的效率和相当于顶尖模型的性能。

Abstract
As a fundamental part of computational healthcare, Computer Tomography (CT) and Magnetic Resonance Imaging (MRI) provide volumetric data, making the development of algorithms for 3D image analysis a necessity. Despite being computationally cheap, 2D Convolutional Neural Networks can only extract spatial information. In contrast, 3D CNNs can extract three-dimensional features, but they have higher computational costs and latency, which is a limitation for clinical practice that requires fast and efficient models. Inspired by the field of video action recognition we propose a new 2D-based model dubbed Slice SHift UNet (SSH-UNet) which encodes three-dimensional features at 2D CNN's complexity. More precisely multi-view features are collaboratively learned by performing 2D convolutions along the three orthogonal planes of a volume and imposing a weights-sharing mechanism. The third dimension, which is neglected by the 2D convolution, is reincorporated by shifting a portion of the feature maps along the slices' axis. The effectiveness of our approach is validated in Multi-Modality Abdominal Multi-Organ Segmentation (AMOS) and Multi-Atlas Labeling Beyond the Cranial Vault (BTCV) datasets, showing that SSH-UNet is more efficient while on par in performance with state-of-the-art architectures.

摘要
为computational healthcare的基本部分，计算机断层成像（CT）和核磁共振成像（MRI）提供了体积数据，因此开发3D图像分析算法是必要的。然而，使用2D卷积神经网络（CNN）只能提取空间信息，而3D CNN则可以提取三维特征，但它们的计算成本和延迟较高，这限制了临床实践中需要快速和高效的模型。 drawing inspiration from the field of video action recognition, we propose a new 2D-based model called Slice SHift UNet (SSH-UNet) that encodes three-dimensional features at the complexity of 2D CNNs. Specifically, we perform 2D convolutions along the three orthogonal planes of a volume and share weights across different planes to collaboratively learn multi-view features. The third dimension, which is neglected by the 2D convolution, is reincorporated by shifting a portion of the feature maps along the slices' axis. Our approach is validated on the Multi-Modality Abdominal Multi-Organ Segmentation (AMOS) and Multi-Atlas Labeling Beyond the Cranial Vault (BTCV) datasets, showing that SSH-UNet is more efficient while on par in performance with state-of-the-art architectures.

Multi-View Vertebra Localization and Identification from CT Images

paper_url: http://arxiv.org/abs/2307.12845
repo_url: https://github.com/shanghaitech-impact/multi-view-vertebra-localization-and-identification-from-ct-images
paper_authors: Han Wu, Jiadong Zhang, Yu Fang, Zhentao Liu, Nizhuan Wang, Zhiming Cui, Dinggang Shen
for: 本研究旨在提高CT图像中骨架的准确位置和识别率，为各种临床应用提供技术支持。
methods: 本研究提出了一种基于多视图的骨架localization和识别方法，将3D问题转化为2D的本地化和识别任务。而无需裁剪patch操作，我们的方法可以自然地学习多视图全局信息。此外，为更好地捕捉不同视角的骨架结构信息，我们还提出了一种多视图对比学习策略来预训练后处理网络。
results: 我们的方法可以在只使用两个2D网络的情况下，准确地位置和识别骨架在CT图像中。并且与状态艺术方法相比，我们的方法显著超越了其表现。代码可以在https://github.com/ShanghaiTech-IMPACT/Multi-View-Vertebra-Localization-and-Identification-from-CT-Images上获取。

Abstract
Accurately localizing and identifying vertebrae from CT images is crucial for various clinical applications. However, most existing efforts are performed on 3D with cropping patch operation, suffering from the large computation costs and limited global information. In this paper, we propose a multi-view vertebra localization and identification from CT images, converting the 3D problem into a 2D localization and identification task on different views. Without the limitation of the 3D cropped patch, our method can learn the multi-view global information naturally. Moreover, to better capture the anatomical structure information from different view perspectives, a multi-view contrastive learning strategy is developed to pre-train the backbone. Additionally, we further propose a Sequence Loss to maintain the sequential structure embedded along the vertebrae. Evaluation results demonstrate that, with only two 2D networks, our method can localize and identify vertebrae in CT images accurately, and outperforms the state-of-the-art methods consistently. Our code is available at https://github.com/ShanghaiTech-IMPACT/Multi-View-Vertebra-Localization-and-Identification-from-CT-Images.

摘要
通过CT图像进行精准的骨vertebrae的本地化和识别是许多临床应用中的关键。然而，大多数现有的尝试都是在3D中进行，通过剪辑patch操作，它们受到了大量计算成本和局部信息的限制。在这篇论文中，我们提出了基于多视图的骨vertebrae本地化和识别方法，将3D问题转化为2D本地化和识别任务。不同于剪辑patch的限制，我们的方法可以自然地学习多视图的全局信息。此外，为了更好地捕捉不同视图角度的骨 vertebrae结构信息，我们开发了一种多视图冲击学习策略来预训练干部。此外，我们还提出了一种序列损失来保持骨vertebrae中的顺序结构嵌入。评估结果表明，只有两个2D网络，我们的方法可以在CT图像中精准地本地化和识别骨vertebrae，并在状态艺术方法上一直保持领先。我们的代码可以在https://github.com/ShanghaiTech-IMPACT/Multi-View-Vertebra-Localization-and-Identification-from-CT-Images中找到。

Learning Provably Robust Estimators for Inverse Problems via Jittering

paper_url: http://arxiv.org/abs/2307.12822
repo_url: https://github.com/mli-lab/robust_reconstructors_via_jittering
paper_authors: Anselm Krainovic, Mahdi Soltanolkotabi, Reinhard Heckel
for: 本研究 investigate whether jittering, a simple regularization technique, can be used to train efficient worst-case robust deep neural networks for inverse problems.
methods: 本研究使用了一种简单的正则化技术，即在训练时添加均匀的高斯噪声，来提高深度神经网络的最坏情况Robustness。
results: 研究发现，使用jittering可以实现最佳的 $\ell_2$-worst-case robust estimator for linear denoising,并且通过训练深度神经网络（U-net）对自然图像减 noise、deconvolution和加速Magnetic Resonance Imaging（MRI）进行了实验，结果表明，jittering可以增强最坏情况Robustness，但可能不适用于 inverse problems beyond denoising。此外，研究还发现，训练在真实数据上，通常包含一些噪声，可以提高模型的Robustness。

Abstract
Deep neural networks provide excellent performance for inverse problems such as denoising. However, neural networks can be sensitive to adversarial or worst-case perturbations. This raises the question of whether such networks can be trained efficiently to be worst-case robust. In this paper, we investigate whether jittering, a simple regularization technique that adds isotropic Gaussian noise during training, is effective for learning worst-case robust estimators for inverse problems. While well studied for prediction in classification tasks, the effectiveness of jittering for inverse problems has not been systematically investigated. In this paper, we present a novel analytical characterization of the optimal $\ell_2$-worst-case robust estimator for linear denoising and show that jittering yields optimal robust denoisers. Furthermore, we examine jittering empirically via training deep neural networks (U-nets) for natural image denoising, deconvolution, and accelerated magnetic resonance imaging (MRI). The results show that jittering significantly enhances the worst-case robustness, but can be suboptimal for inverse problems beyond denoising. Moreover, our results imply that training on real data which often contains slight noise is somewhat robustness enhancing.

摘要
深度神经网络在逆问题中提供了优秀的性能，但神经网络可能会受到敌意或最坏情况的攻击。这引发了我们是否可以有效地训练神经网络以适应最坏情况的问题。在这篇论文中，我们研究了加入 ISO 型 Gaussian 噪声 durante 训练是否可以有效地学习最坏情况Robust 估计器。虽然这种技术在预测 classification 任务中已经得到了广泛的研究，但对于逆问题的研究尚未得到系统的探讨。在这篇论文中，我们提出了一种新的分析方法，用于计算最佳 $\ell_2$ 最坏情况Robust 估计器。我们发现，在线性噪声降解 задании下，加入噪声可以实现最佳的Robust 性能。此外，我们通过训练深度神经网络（U-net）进行自然图像降解、减 convolution 和加速核磁共振成像（MRI）任务，实验结果表明，加入噪声可以显著提高最坏情况Robust 性能，但在涉及到逆问题的更高级别上可能不是最佳的方法。此外，我们的结果还 imply 训练在真实数据上，常常包含一定的噪声，可能会有一定的 robustness 提升。

Exposing the Troublemakers in Described Object Detection

paper_url: http://arxiv.org/abs/2307.12813
repo_url: https://github.com/shikras/d-cube
paper_authors: Chi Xie, Zhao Zhang, Yixuan Wu, Feng Zhu, Rui Zhao, Shuang Liang
for: 本研究旨在提高Open-Vocabulary object Detection (OVD)和Referring Expression Comprehension (REC)的实际应用，通过扩展分类名称为描述语言表达来提高DOD任务的研究基础。
methods: 我们采用了Description Detection Dataset ($D^3$)，其特点是包含flexible language expressions和全面的描述对象标注，并评估了之前的SOTA方法在$D^3$上的性能。此外，我们还提出了一种基线方法，通过重新构建训练数据和引入二分类子任务来大幅提高REC方法的性能。
results: 我们发现现有REC方法在$D^3$上存在许多问题，包括信任分数、拒绝负例、多目标场景等。此外，最新的bi-functional方法也不太适用于DOD任务，因为它们在训练和测试过程中分离的程序和推理策略。我们的基线方法在$D^3$上表现出色，大幅超越现有方法。

Abstract
Detecting objects based on language descriptions is a popular task that includes Open-Vocabulary object Detection (OVD) and Referring Expression Comprehension (REC). In this paper, we advance them to a more practical setting called Described Object Detection (DOD) by expanding category names to flexible language expressions for OVD and overcoming the limitation of REC to only grounding the pre-existing object. We establish the research foundation for DOD tasks by constructing a Description Detection Dataset ($D^3$), featuring flexible language expressions and annotating all described objects without omission. By evaluating previous SOTA methods on $D^3$, we find some troublemakers that fail current REC, OVD, and bi-functional methods. REC methods struggle with confidence scores, rejecting negative instances, and multi-target scenarios, while OVD methods face constraints with long and complex descriptions. Recent bi-functional methods also do not work well on DOD due to their separated training procedures and inference strategies for REC and OVD tasks. Building upon the aforementioned findings, we propose a baseline that largely improves REC methods by reconstructing the training data and introducing a binary classification sub-task, outperforming existing methods. Data and code is available at https://github.com/shikras/d-cube.

摘要
检测基于语言描述的对象是一个受欢迎的任务，包括开放词汇对象检测（OVD）和引用表达理解（REC）。在这篇论文中，我们将其推进到更实用的设定中，称为描述对象检测（DOD），通过扩展类别名称到 flexible language expression 来进一步提高 OVD 的精度。我们建立了描述检测任务的研究基础，constructing a Description Detection Dataset ($D^3$), featuring flexible language expressions and annotating all described objects without omission。通过评估先前的 SOTA 方法在 $D^3$ 上，我们发现了一些“困难者”，其中 REC 方法受到负样本、多目标场景和 confidence score 等限制，而 OVD 方法则面临长和复杂的描述句所带来的限制。最近的 bi-functional 方法也不太适合 DOD 任务，因为它们在 REC 和 OVD 任务之间具有分开的训练过程和推理策略。基于以上发现，我们提出了一个基线方案，可以大幅提高 REC 方法的性能，通过重新构建训练数据和引入 binary classification 子任务。数据和代码可以在 https://github.com/shikras/d-cube 上获取。

Compact & Capable: Harnessing Graph Neural Networks and Edge Convolution for Medical Image Classification

paper_url: http://arxiv.org/abs/2307.12790
repo_url: https://github.com/anonrepo-keeper/gcnn-ec
paper_authors: Aryan Singh, Pepijn Van de Ven, Ciarán Eising, Patrick Denny
for: 这个研究探讨了图像分类 tasks 上的图形学习模型，具体来说是使用 Graph Neural Networks (GNNs) 和 edge convolution 来强化图像之间的连接。
methods: 本研究提出了一个新的 GNN 模型，融合了 GNNs 和 edge convolution，以利用 RGB 通道特征值之间的连接来强化图像的表现。
results: 比较 GNN 和 pre-trained DNNs，GNN 能够在 MedMNIST 数据集上显示出与 DNNs 相似的表现，并且仅需要1000个参数，相较于 DNNs 的训练时间和数据量。

Abstract
Graph-based neural network models are gaining traction in the field of representation learning due to their ability to uncover latent topological relationships between entities that are otherwise challenging to identify. These models have been employed across a diverse range of domains, encompassing drug discovery, protein interactions, semantic segmentation, and fluid dynamics research. In this study, we investigate the potential of Graph Neural Networks (GNNs) for medical image classification. We introduce a novel model that combines GNNs and edge convolution, leveraging the interconnectedness of RGB channel feature values to strongly represent connections between crucial graph nodes. Our proposed model not only performs on par with state-of-the-art Deep Neural Networks (DNNs) but does so with 1000 times fewer parameters, resulting in reduced training time and data requirements. We compare our Graph Convolutional Neural Network (GCNN) to pre-trained DNNs for classifying MedMNIST dataset classes, revealing promising prospects for GNNs in medical image analysis. Our results also encourage further exploration of advanced graph-based models such as Graph Attention Networks (GAT) and Graph Auto-Encoders in the medical imaging domain. The proposed model yields more reliable, interpretable, and accurate outcomes for tasks like semantic segmentation and image classification compared to simpler GCNNs

摘要
“基于图的神经网络模型在 Representation Learning 领域受到广泛应用，因为它们可以捕捉到难以识别的实体之间的隐藏 topological 关系。这些模型在药物发现、蛋白结合、 semantic segmentation 和 fluid dynamics 等领域都有应用。在这项研究中，我们研究了医学图像分类中 Graph Neural Networks（GNNs）的潜力。我们提出了一种新的模型，它将 GNNs 和边 convolution 结合起来，利用 RGB 通道特征值之间的连接来强大地表示图像中关键节点之间的连接。我们的提出的模型不仅与 state-of-the-art Deep Neural Networks（DNNs）的性能相当，而且只需要1000倍 fewer 参数，从而减少了训练时间和数据需求。我们对 MedMNIST 数据集类型进行比较，发现 GNNs 在医学图像分类中有极好的前景。我们的结果还鼓励了在医学图像分类和 semantic segmentation 等任务中进一步探索更高级别的图基于模型，如 Graph Attention Networks（GAT）和 Graph Auto-Encoders。我们的模型在 semantic segmentation 和图像分类任务中比 simpler GNNs 更可靠、更加可解释、更高精度。”

Fast Full-frame Video Stabilization with Iterative Optimization

paper_url: http://arxiv.org/abs/2307.12774
repo_url: None
paper_authors: Weiyue Zhao, Xin Li, Zhan Peng, Xianrui Luo, Xinyi Ye, Hao Lu, Zhiguo Cao
for: 提高视频稳定性和计算效率的交互优化方法
methods: 使用Synthetic datasets、两级（粗到细）稳定算法、可信度地图和反射推断等技术
results: 提供了高效且视觉质量高的视频稳定方法，并通过实验证明其在计算速度和视觉质量两个方面的优势

Abstract
Video stabilization refers to the problem of transforming a shaky video into a visually pleasing one. The question of how to strike a good trade-off between visual quality and computational speed has remained one of the open challenges in video stabilization. Inspired by the analogy between wobbly frames and jigsaw puzzles, we propose an iterative optimization-based learning approach using synthetic datasets for video stabilization, which consists of two interacting submodules: motion trajectory smoothing and full-frame outpainting. First, we develop a two-level (coarse-to-fine) stabilizing algorithm based on the probabilistic flow field. The confidence map associated with the estimated optical flow is exploited to guide the search for shared regions through backpropagation. Second, we take a divide-and-conquer approach and propose a novel multiframe fusion strategy to render full-frame stabilized views. An important new insight brought about by our iterative optimization approach is that the target video can be interpreted as the fixed point of nonlinear mapping for video stabilization. We formulate video stabilization as a problem of minimizing the amount of jerkiness in motion trajectories, which guarantees convergence with the help of fixed-point theory. Extensive experimental results are reported to demonstrate the superiority of the proposed approach in terms of computational speed and visual quality. The code will be available on GitHub.

摘要
视频稳定化问题是将晃动视频转化为美观的视频。计算速度和视觉质量之间的平衡问题一直是视频稳定化领域的开放问题。我们提出了基于iterative optimization-based learning方法的视频稳定化方法，该方法包括两个互动子模块：运动轨迹缓和全帧补充。首先，我们开发了一种两级（粗细到细）稳定化算法，基于概率流场。利用估算的光学流场的信息来引导搜索共享区域，我们使用回传propagation。其次，我们提出了一种新的分解和融合策略，以生成稳定视频全帧视图。我们发现，通过iterative optimization方法，目标视频可以被视为非线性映射的稳定点，我们将视频稳定化问题解释为减少运动轨迹中的抖动量，以 garantía converge。我们通过实验报告了我们的方法的计算速度和视觉质量的超越性。代码将在GitHub上发布。

LiDAR Meta Depth Completion

paper_url: http://arxiv.org/abs/2307.12761
repo_url: https://github.com/wbkit/reslan
paper_authors: Wolfgang Boettcher, Lukas Hoyer, Ozan Unal, Ke Li, Dengxin Dai
for: 提高移动自适应系统中的深度估计精度
methods: 动态适应LiDAR扫描模式的深度完成网络
results: 与专业LiDAR模型相比，单个模型可以在多个LiDAR扫描模式上显示更好的性能，并且可以扩展到未经训练的扫描模式

Abstract
Depth estimation is one of the essential tasks to be addressed when creating mobile autonomous systems. While monocular depth estimation methods have improved in recent times, depth completion provides more accurate and reliable depth maps by additionally using sparse depth information from other sensors such as LiDAR. However, current methods are specifically trained for a single LiDAR sensor. As the scanning pattern differs between sensors, every new sensor would require re-training a specialized depth completion model, which is computationally inefficient and not flexible. Therefore, we propose to dynamically adapt the depth completion model to the used sensor type enabling LiDAR adaptive depth completion. Specifically, we propose a meta depth completion network that uses data patterns derived from the data to learn a task network to alter weights of the main depth completion network to solve a given depth completion task effectively. The method demonstrates a strong capability to work on multiple LiDAR scanning patterns and can also generalize to scanning patterns that are unseen during training. While using a single model, our method yields significantly better results than a non-adaptive baseline trained on different LiDAR patterns. It outperforms LiDAR-specific expert models for very sparse cases. These advantages allow flexible deployment of a single depth completion model on different sensors, which could also prove valuable to process the input of nascent LiDAR technology with adaptive instead of fixed scanning patterns.

摘要
深度估算是创建移动自主系统中的一项重要任务。虽然单投射深度估算方法在最近有所进步，但深度完成提供更加准确和可靠的深度地图，并使用其他感知器 such as LiDAR 上的稀疏深度信息。然而，当前的方法都是特定的 LiDAR 扫描模式进行训练。因此，我们提出了动态适应 LiDAR 类型的深度完成模型，以便在不同的 LiDAR 扫描模式上进行适应性的深度完成。具体来说，我们提出了一个元深度完成网络，通过数据模式来学习一个任务网络，以修改主深度完成网络中的权重，以解决给定的深度完成任务。该方法能够在多个 LiDAR 扫描模式上工作，并且还能够对未在训练中看到的扫描模式进行泛化。使用单一模型，我们的方法能够在不同 LiDAR 扫描模式上显著提高结果，并且在非常稀疏的情况下也能够超越特定 LiDAR 模型。这些优势使得我们的方法可以在不同感知器上进行灵活的部署，这也可以证明有用于处理新型 LiDAR 技术的适应式扫描模式。

ICF-SRSR: Invertible scale-Conditional Function for Self-Supervised Real-world Single Image Super-Resolution

paper_url: http://arxiv.org/abs/2307.12751
repo_url: None
paper_authors: Reyhaneh Neshatavar, Mohsen Yavartanoo, Sanghyun Son, Kyoung Mu Lee
for: 这个论文是为了解决单图超解析（SISR）问题，即将低分辨率（LR）图像提升到高分辨率（HR）图像的问题。
methods: 该论文提出了一种新的归一化可能函数（ICF），可以将输入图像缩放到不同的比例条件下，并且可以还原原始输入图像。基于该ICF，该论文提出了一种新的无监督SISR框架（ICF-SRSR），可以在实际世界中进行SR任务无需使用任何对应/无对应的训练数据。
results: 该论文的实验表明，ICF-SRSR可以在无监督情况下处理SISR问题，并且在实际世界中表现更好于使用synthesized paired图像进行训练的方法。此外，ICF-SRSR还可以生成更加真实和可行的LR-HR对，使现有的监督SISR网络更加可靠。

Abstract
Single image super-resolution (SISR) is a challenging ill-posed problem that aims to up-sample a given low-resolution (LR) image to a high-resolution (HR) counterpart. Due to the difficulty in obtaining real LR-HR training pairs, recent approaches are trained on simulated LR images degraded by simplified down-sampling operators, e.g., bicubic. Such an approach can be problematic in practice because of the large gap between the synthesized and real-world LR images. To alleviate the issue, we propose a novel Invertible scale-Conditional Function (ICF), which can scale an input image and then restore the original input with different scale conditions. By leveraging the proposed ICF, we construct a novel self-supervised SISR framework (ICF-SRSR) to handle the real-world SR task without using any paired/unpaired training data. Furthermore, our ICF-SRSR can generate realistic and feasible LR-HR pairs, which can make existing supervised SISR networks more robust. Extensive experiments demonstrate the effectiveness of the proposed method in handling SISR in a fully self-supervised manner. Our ICF-SRSR demonstrates superior performance compared to the existing methods trained on synthetic paired images in real-world scenarios and exhibits comparable performance compared to state-of-the-art supervised/unsupervised methods on public benchmark datasets.

摘要
单图超分辨 (SISR) 是一个具有挑战性的不定性问题，旨在将给定的低分辨 (LR) 图像提升到高分辨 (HR) 对应的图像。由于实际获得LR-HR训练对的困难，现有的方法通常是使用简化下采样算法，如二次方程，进行模拟LR图像。这种方法在实践中可能存在大量的差异，这使得SR任务变得更加困难。为了解决这个问题，我们提议一种新的归一化可逆函数 (ICF)，可以将输入图像缩放，然后将原始输入图像还原，并且可以在不同的比例条件下进行缩放。通过利用我们提议的ICF，我们建立了一种新的自适应SISR框架 (ICF-SRSR)，可以在实际的SR任务中处理不需要使用任何配对/非配对训练数据。此外，我们的ICF-SRSR可以生成真实可行的LR-HR对，这可以使得现有的supervised SISR网络更加强大。我们进行了广泛的实验，并证明了我们的ICF-SRSR可以在不需要任何配对训练数据的情况下 Handle SISR任务。我们的ICF-SRSR在实际场景中表现出了supervised/非配对方法的相当性，并且在公共的benchmark datasets上达到了相当的性能。

CLIP-KD: An Empirical Study of Distilling CLIP Models

paper_url: http://arxiv.org/abs/2307.12732
repo_url: https://github.com/winycg/CLIP-KD
paper_authors: Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Yongjun Xu
for: 本文旨在静态小CLIP模型，通过与大教师CLIP模型的导师学习来提高性能。
methods: 本文提出了多种静态CLIP模型的精简策略，包括关系、特征、梯度和对比方式，以评估其影响CLIP精简。
results: 研究发现最简单的特征模仿与MSE损失最佳，并且交互对比学习和关系基于精简也对性能有益。在应用于多个学生网络，CLIP精简在零shotImageNet分类和跨模态检索benchmark上具有consistent的改进。

Abstract
CLIP has become a promising language-supervised visual pre-training framework and achieves excellent performance over a wide range of tasks. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies, including relation, feature, gradient and contrastive paradigm, to examine the impact on CLIP distillation. We show that the simplest feature mimicry with MSE loss performs best. Moreover, interactive contrastive learning and relation-based distillation are also critical in performance improvement. We apply the unified method to distill several student networks trained on 15 million (image, text) pairs. Distillation improves the student CLIP models consistently over zero-shot ImageNet classification and cross-modal retrieval benchmarks. We hope our empirical study will become an important baseline for future CLIP distillation research. The code is available at \url{https://github.com/winycg/CLIP-KD}.

摘要
CLIP 已成为一个有前途的语言监督 visual 预训练框架，在各种任务上表现出色。本文目的是将大 teacher CLIP 模型监督小 CLIP 模型进行学习减少。我们提出了多种减少策略，包括关系、特征、梯度和对比方法，以确定 CLIP 减少的影响。我们发现最简单的特征模仿方法使用 MSE 损失最佳。此外，交互式对比学习和关系基于减少也对性能进行了贡献。我们应用这种简化方法到多个学生网络，每个网络都在 15 百万（图像、文本）对的训练下进行学习。减少提高了学生 CLIP 模型在零shot ImageNet 分类和 cross-modal 检索标准准则上的表现。我们希望我们的实验研究将成为未来 CLIP 减少研究的重要基准。代码可以在 \url{https://github.com/winycg/CLIP-KD} 上获取。

COCO-O: A Benchmark for Object Detectors under Natural Distribution Shifts

paper_url: http://arxiv.org/abs/2307.12730
repo_url: https://github.com/alibaba/easyrobust
paper_authors: Xiaofeng Mao, Yuefeng Chen, Yao Zhu, Da Chen, Hang Su, Rong Zhang, Hui Xue
for:This paper aims to provide a comprehensive assessment of the robustness of object detection models under natural distribution shifts, and to introduce a new test dataset called COCO-O to benchmark the OOD robustness of detectors.methods:The authors use a large-scale dataset called COCO-O, which contains six types of natural distribution shifts, to evaluate the OOD robustness of more than 100 modern object detection models. They also conduct experiments to study the effect of different components of the detector, such as the backbone, detection head, and augmentation techniques, on its OOD robustness.results:The authors find that most classic detectors do not exhibit strong OOD generalization, and that the backbone is the most important part for robustness. They also find that an end-to-end detection transformer design does not bring any enhancement, and may even reduce robustness. Finally, they show that large-scale foundation models have made a great leap on robust object detection. The dataset will be available at https://github.com/alibaba/easyrobust/tree/main/benchmarks/coco_o.Here is the Chinese version of the three key points:for:这篇论文的目的是为了提供一个广泛的对象检测模型在自然分布变化下的robustness的评估，并为了 benchmarking 对象检测模型的OOD robustness而引入了COCO-O测试集。methods:作者使用了COCO-O测试集，这个测试集包含6种自然分布变化，来评估现代对象检测模型的OOD robustness。他们还进行了不同组件的检测器，如背bone、检测头和增强技术的效果的研究。results:作者发现大多数经典的检测器在自然分布变化下并没有强大的OOD普适性，背bone是检测器的robustness中最重要的部分。他们还发现，检测transformer设计并不提供任何改进，可能甚至会降低robustness。最后，他们发现大规模基础模型在robust对象检测方面做出了很大的进步。

Abstract
Practical object detection application can lose its effectiveness on image inputs with natural distribution shifts. This problem leads the research community to pay more attention on the robustness of detectors under Out-Of-Distribution (OOD) inputs. Existing works construct datasets to benchmark the detector's OOD robustness for a specific application scenario, e.g., Autonomous Driving. However, these datasets lack universality and are hard to benchmark general detectors built on common tasks such as COCO. To give a more comprehensive robustness assessment, we introduce COCO-O(ut-of-distribution), a test dataset based on COCO with 6 types of natural distribution shifts. COCO-O has a large distribution gap with training data and results in a significant 55.7% relative performance drop on a Faster R-CNN detector. We leverage COCO-O to conduct experiments on more than 100 modern object detectors to investigate if their improvements are credible or just over-fitting to the COCO test set. Unfortunately, most classic detectors in early years do not exhibit strong OOD generalization. We further study the robustness effect on recent breakthroughs of detector's architecture design, augmentation and pre-training techniques. Some empirical findings are revealed: 1) Compared with detection head or neck, backbone is the most important part for robustness; 2) An end-to-end detection transformer design brings no enhancement, and may even reduce robustness; 3) Large-scale foundation models have made a great leap on robust object detection. We hope our COCO-O could provide a rich testbed for robustness study of object detection. The dataset will be available at https://github.com/alibaba/easyrobust/tree/main/benchmarks/coco_o.

摘要
实际应用中的物体检测应用可能会在自然分布变化下失效。这个问题导致研究人员更加关注检测器在不同分布下的Robustness。现有的 dataset 用于测试检测器的 OOD Robustness，如 autonomous driving 应用场景。然而，这些 dataset 缺乏通用性，难以测试基于 common task 如 COCO 的检测器。为了给出更加全面的 Robustness 评估，我们引入 COCO-O（out-of-distribution）测试集，基于 COCO 的6种自然分布变化。COCO-O 与训练数据之间存在大的分布差，导致 Faster R-CNN 检测器的Relative performance drop 为 55.7%。我们使用 COCO-O 对 более than 100 种现代 object detection 算法进行实验，以查找他们是否具有可靠的 OOD 通用性。结果显示，大多数早期的检测器不具有强的 OOD 泛化能力。我们进一步研究了检测器的architecture design，增强技术和预训练技术的Robustness效果。我们发现：1）与检测头或 neck 相比，后凹是检测器 Robustness 的关键部分; 2） end-to-end 检测转换设计不具有改善效果，可能 même 减少 Robustness; 3）大规模基础模型在 Robust object detection 方面做出了很大的进步。我们希望 COCO-O 可以为 Robustness 研究提供一个丰富的测试平台。 dataset 将在 https://github.com/alibaba/easyrobust/tree/main/benchmarks/coco_o 上提供。

Persistent-Transient Duality: A Multi-mechanism Approach for Modeling Human-Object Interaction

paper_url: http://arxiv.org/abs/2307.12729
repo_url: None
paper_authors: Hung Tran, Vuong Le, Svetha Venkatesh, Truyen Tran
for: 这个论文旨在解释人类在人机交互（HOI）活动中的多机制性，以及如何使用Persistent-Transient Duality来模型这些机制。
methods: 这篇论文使用了Parent-Child neural network，其中Persistent频道和Transient频道是两个独立的神经网络，用于模型不同的机制。此外，还使用了一个神经网络模块来实现机制之间的动态切换。
results: 在两个丰富的数据集和多种设置下，这个模型在HOI动作预测中具有了superior的性能，证明了其适用性。

Abstract
Humans are highly adaptable, swiftly switching between different modes to progressively handle different tasks, situations and contexts. In Human-object interaction (HOI) activities, these modes can be attributed to two mechanisms: (1) the large-scale consistent plan for the whole activity and (2) the small-scale children interactive actions that start and end along the timeline. While neuroscience and cognitive science have confirmed this multi-mechanism nature of human behavior, machine modeling approaches for human motion are trailing behind. While attempted to use gradually morphing structures (e.g., graph attention networks) to model the dynamic HOI patterns, they miss the expeditious and discrete mode-switching nature of the human motion. To bridge that gap, this work proposes to model two concurrent mechanisms that jointly control human motion: the Persistent process that runs continually on the global scale, and the Transient sub-processes that operate intermittently on the local context of the human while interacting with objects. These two mechanisms form an interactive Persistent-Transient Duality that synergistically governs the activity sequences. We model this conceptual duality by a parent-child neural network of Persistent and Transient channels with a dedicated neural module for dynamic mechanism switching. The framework is trialed on HOI motion forecasting. On two rich datasets and a wide variety of settings, the model consistently delivers superior performances, proving its suitability for the challenge.

摘要
人类具有高度适应性，快速 switching между不同模式以逐渐处理不同任务、情况和上下文。在人机交互（HOI）活动中，这些模式可以归结为两种机制：（1）整体活动的大规模一致计划，和（2）在时间轴上开始和结束的小规模儿童交互动作。而神经科学和认知科学已经证实了人类行为的多机制性，但机器模型人体运动的方法尚未跟上。尝试使用渐变变换结构（如图像注意力网络）来模型动态 HOI 模式，但它们缺乏人类运动快速和精细的模式转换特点。为了bridging这个差距，本工作提出了同时控制人体运动的两个机制：持续过程，该过程在全局范围内一直运行，和间歇性子进程，该子进程在人类与对象交互时在本地上运行。这两种机制组成了互动的持续-间歇性双重性，这种双重性同时控制活动序列。我们使用父母-孩子神经网络，其中持续通道和间歇性通道各自拥有特定的神经元模块，以实现动态机制转换。这种概念双重性在 HOI 动作预测中得到证明，在两个丰富的数据集和多种设置下，模型一致地表现出优秀的表现，证明其适用性。

AMAE: Adaptation of Pre-Trained Masked Autoencoder for Dual-Distribution Anomaly Detection in Chest X-Rays

paper_url: http://arxiv.org/abs/2307.12721
repo_url: None
paper_authors: Behzad Bozorgtabar, Dwarikanath Mahapatra, Jean-Philippe Thiran
for: 这个研究是针对医疗影像中的无Supervised anomaly detection问题，尤其是胸部X-ray影像。
methods: 这个研究使用了一个名为AMAE的两阶段算法，它首先将normal类别的训练图像 sintethesize为假想的异常图像，然后使用内部的lightweight分类器来训练。接着，它利用了一个pseudo-labeling scheme来利用无标注图像中的异常。
results: AMAE在不同的异常比例下评估了其效果，并与其他自动标注和双分布异常检测方法进行比较。结果显示，AMAE在三个公开的胸部X-raybenchmark上设置了新的State-of-the-art。

Abstract
Unsupervised anomaly detection in medical images such as chest radiographs is stepping into the spotlight as it mitigates the scarcity of the labor-intensive and costly expert annotation of anomaly data. However, nearly all existing methods are formulated as a one-class classification trained only on representations from the normal class and discard a potentially significant portion of the unlabeled data. This paper focuses on a more practical setting, dual distribution anomaly detection for chest X-rays, using the entire training data, including both normal and unlabeled images. Inspired by a modern self-supervised vision transformer model trained using partial image inputs to reconstruct missing image regions -- we propose AMAE, a two-stage algorithm for adaptation of the pre-trained masked autoencoder (MAE). Starting from MAE initialization, AMAE first creates synthetic anomalies from only normal training images and trains a lightweight classifier on frozen transformer features. Subsequently, we propose an adaptation strategy to leverage unlabeled images containing anomalies. The adaptation scheme is accomplished by assigning pseudo-labels to unlabeled images and using two separate MAE based modules to model the normative and anomalous distributions of pseudo-labeled images. The effectiveness of the proposed adaptation strategy is evaluated with different anomaly ratios in an unlabeled training set. AMAE leads to consistent performance gains over competing self-supervised and dual distribution anomaly detection methods, setting the new state-of-the-art on three public chest X-ray benchmarks: RSNA, NIH-CXR, and VinDr-CXR.

摘要
不监督异常检测在医学影像中，如胸部X射线图像，正在受到关注，因为它解决了专业人员对异常数据的繁琐和成本的批注短缺。然而，大多数现有方法都是基于一个类型的分类，只使用正常类的表示来训练。这篇论文关注了更实际的设定：对胸部X射线图像进行双分布异常检测，使用整个训练数据，包括正常和无标记图像。 Drawing inspiration from a modern self-supervised vision transformer model trained using partial image inputs to reconstruct missing image regions，我们提出了AMAE算法，一个两个阶段的算法，用于适应预先训练的封闭自动encoder（MAE）。从MAE初始化开始，AMAE首先将正常训练图像中生成假异常，并训练一个轻量级分类器。接着，我们提出了适应策略，使用无标记图像中含异常的图像进行适应。这种适应策略是通过将无标记图像 assign pseudo-labels，并使用两个基于MAE模块来模型无标记图像的正常和异常分布。我们对不同异常比例的无标记训练集进行评估，AMAE在自我超vised和双分布异常检测方法中表现出优异，创造了新的state-of-the-art在三个公共胸部X射线标准benchmark上：RSNA、NIH-CXR和VinDr-CXR。

CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle Components

paper_url: http://arxiv.org/abs/2307.12718
repo_url: None
paper_authors: Davide Di Nucci, Alessandro Simoni, Matteo Tomei, Luca Ciuffreda, Roberto Vezzani, Rita Cucchiara
for: 本研究旨在提供一个Synthetic benchmark dataset for vehicle inspection, 以便用于评估和比较不同技术的效果。
methods: 本研究使用NeRFs技术来生成3D重建图像，并提供了相应的深度地图和semantic segmentationmask。
results: 本研究提供了一个公共可用的Synthetic benchmark dataset，可以作为评估和比较不同技术的导向。

Abstract
Neural Radiance Fields (NeRFs) have gained widespread recognition as a highly effective technique for representing 3D reconstructions of objects and scenes derived from sets of images. Despite their efficiency, NeRF models can pose challenges in certain scenarios such as vehicle inspection, where the lack of sufficient data or the presence of challenging elements (e.g. reflections) strongly impact the accuracy of the reconstruction. To this aim, we introduce CarPatch, a novel synthetic benchmark of vehicles. In addition to a set of images annotated with their intrinsic and extrinsic camera parameters, the corresponding depth maps and semantic segmentation masks have been generated for each view. Global and part-based metrics have been defined and used to evaluate, compare, and better characterize some state-of-the-art techniques. The dataset is publicly released at https://aimagelab.ing.unimore.it/go/carpatch and can be used as an evaluation guide and as a baseline for future work on this challenging topic.

摘要
neural radiance fields (NeRFs) 已经得到广泛的认可作为用于表示基于图像集的物体和场景三维重建的高效技术。尽管它们有效，但 NeRF 模型在某些情况下可能会遇到挑战，如车辆检查，因为缺乏足够的数据或存在挑战性的元素（例如反射）会强烈影响重建的准确性。为此，我们引入了 CarPatch，一个新的人工测试集。除了每个视图的图像、摄像机参数、深度地图和semantic segmentation映射之外，还包括了每个视图的全球和部分评价指标。这些指标用于评估、比较和更好地描述一些现状技术的性能。该数据集publicly released at ，可以用作评估指南和未来工作的基准。

Dense Transformer based Enhanced Coding Network for Unsupervised Metal Artifact Reduction

paper_url: http://arxiv.org/abs/2307.12717
repo_url: None
paper_authors: Wangduo Xie, Matthew B. Blaschko
for: 这篇论文的目的是提出一种不需要组织数据的自动化阴极 artifact 除除法，以帮助 CT 影像诊断中受到阴极 artifact 的影响。
methods: 本论文提出了一个基于 dense transformer 的增强编码网络 (DTEC-Net)，包括一个层次分解对接oder 和 transformer 来取得高密度编码序列，以及一个第二阶分解方法来改善密度序列的解码过程。
results: 实验和模型讨论表明，DTEC-Net 能够优于前一代方法，对 CT 影像进行自动化阴极 artifact 除除法，同时保留 CT 影像的结构资讯。

Abstract
CT images corrupted by metal artifacts have serious negative effects on clinical diagnosis. Considering the difficulty of collecting paired data with ground truth in clinical settings, unsupervised methods for metal artifact reduction are of high interest. However, it is difficult for previous unsupervised methods to retain structural information from CT images while handling the non-local characteristics of metal artifacts. To address these challenges, we proposed a novel Dense Transformer based Enhanced Coding Network (DTEC-Net) for unsupervised metal artifact reduction. Specifically, we introduce a Hierarchical Disentangling Encoder, supported by the high-order dense process, and transformer to obtain densely encoded sequences with long-range correspondence. Then, we present a second-order disentanglement method to improve the dense sequence's decoding process. Extensive experiments and model discussions illustrate DTEC-Net's effectiveness, which outperforms the previous state-of-the-art methods on a benchmark dataset, and greatly reduces metal artifacts while restoring richer texture details.

摘要

Damage Vision Mining Opportunity for Imbalanced Anomaly Detection

paper_url: http://arxiv.org/abs/2307.12676
repo_url: None
paper_authors: Takato Yasuno
for: 这篇论文是为了探讨受损测量数据的挑战和解决方案，尤其是在工业应用中使用深度学习方法进行预测维护和应急维护。
methods: 本研究使用了深度学习方法，包括回归、图像分类、物件检测和 semantic segmentation，以解决受损测量数据的问题。
results: 研究发现，在受损测量数据中，使用深度学习方法可以提高预测维护和应急维护的精度。此外，研究还发现了一个有用的正常检测应用，可以在受损测量数据中检测到异常情况。

Abstract
In past decade, previous balanced datasets have been used to advance algorithms for classification, object detection, semantic segmentation, and anomaly detection in industrial applications. Specifically, for condition-based maintenance, automating visual inspection is crucial to ensure high quality. Deterioration prognostic attempts to optimize the fine decision process for predictive maintenance and proactive repair. In civil infrastructure and living environment, damage data mining cannot avoid the imbalanced data issue because of rare unseen events and high quality status by improved operations. For visual inspection, deteriorated class acquired from the surface of concrete and steel components are occasionally imbalanced. From numerous related surveys, we summarize that imbalanced data problems can be categorized into four types; 1) missing range of target and label valuables, 2) majority-minority class imbalance, 3) foreground-background of spatial imbalance, 4) long-tailed class of pixel-wise imbalance. Since 2015, there has been many imbalanced studies using deep learning approaches that includes regression, image classification, object detection, semantic segmentation. However, anomaly detection for imbalanced data is not yet well known. In the study, we highlight one-class anomaly detection application whether anomalous class or not, and demonstrate clear examples on imbalanced vision datasets: blood smear, lung infection, hazardous driving, wooden, concrete deterioration, river sludge, and disaster damage. Illustrated in Fig.1, we provide key results on damage vision mining advantage, hypothesizing that the more effective range of positive ratio, the higher accuracy gain of anomaly detection application. In our imbalanced studies, compared with the balanced case of positive ratio 1/1, we find that there is applicable positive ratio, where the accuracy are consistently high.

摘要
过去一个 décennial，以前的平衡数据集被用于提高分类、物体检测、semantic segmentation和异常检测在工业应用中的算法。具体来说，为condition-based maintenance，自动化视觉检查是关键以确保高质量。预测维护和抢修的决策进程优化。在公共基础设施和生活环境中，损害数据挖掘无法避免偏移数据问题，因为罕见的未看到事件和高质量状态的提高。对于视觉检查，坏化类型从混凝土和钢结构的表面获得的数据 occasionally imbalanced。从多个相关的调查中，我们总结出四种偏移数据问题：1）目标和标签值的缺失范围，2）多数少数类别偏移，3）前景背景的空间偏移，4）像素级偏移。自2015年以来，有很多关于偏移数据的研究使用深度学习方法，包括回归、图像分类、物体检测和semantic segmentation。然而，异常检测对偏移数据还未得到充分的研究。在本研究中，我们强调一类异常检测应用，无论异常类或不，并提供了清晰的示例，包括血癌、肺感染、危险驾驶、木材、混凝土衰老、河流淤泥和灾害损害。图1中，我们提供了危害视觉矿物优势，假设更高的正确率，异常检测应用的更高精度。在我们的偏移研究中，与平衡情况相比，我们发现了可采用的正确比例，其准确率一致高。

Industrial Segment Anything – a Case Study in Aircraft Manufacturing, Intralogistics, Maintenance, Repair, and Overhaul

paper_url: http://arxiv.org/abs/2307.12674
repo_url: None
paper_authors: Keno Moenck, Arne Wendt, Philipp Prünte, Julian Koch, Arne Sahrhage, Johann Gierecker, Ole Schmedemann, Falko Kähler, Dirk Holst, Martin Gomse, Thorsten Schüppstuhl, Daniel Schoepflin
for: 这篇论文旨在探讨在飞机生产专业领域中应用深度学习基于应用程序的问题。
methods: 该论文使用视觉基础模型（VFM）的零shot能力来解决数据、上下文和感知器多样性的问题。
results: 论文对飞机生产专业领域中的制造、内部物流、维护、维修和更换过程进行了应用，并考虑了领域知识的投入。

Abstract
Deploying deep learning-based applications in specialized domains like the aircraft production industry typically suffers from the training data availability problem. Only a few datasets represent non-everyday objects, situations, and tasks. Recent advantages in research around Vision Foundation Models (VFM) opened a new area of tasks and models with high generalization capabilities in non-semantic and semantic predictions. As recently demonstrated by the Segment Anything Project, exploiting VFM's zero-shot capabilities is a promising direction in tackling the boundaries spanned by data, context, and sensor variety. Although, investigating its application within specific domains is subject to ongoing research. This paper contributes here by surveying applications of the SAM in aircraft production-specific use cases. We include manufacturing, intralogistics, as well as maintenance, repair, and overhaul processes, also representing a variety of other neighboring industrial domains. Besides presenting the various use cases, we further discuss the injection of domain knowledge.

摘要
通常在特殊领域 like 飞机生产 industri에서推广深度学习基于应用typically suffers from the training data availability problem。只有一些数据集表示不同的对象、情况和任务。 current Advantages in Research on Vision Foundation Models (VFM) opened a new area of tasks and models with high generalization capabilities in non-semantic and semantic predictions。 As recently demonstrated by the Segment Anything Project， exploiting VFM's zero-shot capabilities is a promising direction in tackling the boundaries spanned by data, context, and sensor variety。Although，investigating its application within specific domains is subject to ongoing research。This paper contributes here by surveying applications of the SAM in aircraft production-specific use cases。We include manufacturing, intralogistics, as well as maintenance, repair, and overhaul processes，also representing a variety of other neighboring industrial domains。Besides presenting the various use cases，we further discuss the injection of domain knowledge。

Global k-Space Interpolation for Dynamic MRI Reconstruction using Masked Image Modeling

paper_url: http://arxiv.org/abs/2307.12672
repo_url: None
paper_authors: Jiazhen Pan, Suprosanna Shit, Özgün Turgut, Wenqi Huang, Hongwei Bran Li, Nil Stolt-Ansó, Thomas Küstner, Kerstin Hammernik, Daniel Rueckert
for: This paper focuses on improving dynamic magnetic resonance imaging (MRI) reconstruction by interpolating undersampled k-space data before obtaining images with Fourier transform.
methods: The proposed approach uses a Transformer-based k-space Global Interpolation Network (k-GIN) to learn global dependencies among low- and high-frequency components of 2D+t k-space, and a novel k-space Iterative Refinement Module (k-IRM) to enhance high-frequency components learning.
results: The proposed method outperforms baseline methods in terms of both quantitative and qualitative measures, and achieves higher robustness and generalizability in highly-undersampled MR data.Here’s the Chinese translation of the three points:
for: 这篇论文关注改进动态磁共振成像重建，通过在傅里叶变换前 interpolate 受损的 k-空间数据。
methods: 提议的方法使用 Transformer 型 k-空间全球 interpolating 网络 (k-GIN) 学习 k-空间低频和高频成分之间的全球相互关系，并使用 novel k-space Iterative Refinement Module (k-IRM) 进一步提高高频成分学习。
results: 提议的方法相比基eline方法，在量化和质量上都有显著提高，并在高度受损的 MR 数据中具有更高的Robustness和普适性。

Abstract
In dynamic Magnetic Resonance Imaging (MRI), k-space is typically undersampled due to limited scan time, resulting in aliasing artifacts in the image domain. Hence, dynamic MR reconstruction requires not only modeling spatial frequency components in the x and y directions of k-space but also considering temporal redundancy. Most previous works rely on image-domain regularizers (priors) to conduct MR reconstruction. In contrast, we focus on interpolating the undersampled k-space before obtaining images with Fourier transform. In this work, we connect masked image modeling with k-space interpolation and propose a novel Transformer-based k-space Global Interpolation Network, termed k-GIN. Our k-GIN learns global dependencies among low- and high-frequency components of 2D+t k-space and uses it to interpolate unsampled data. Further, we propose a novel k-space Iterative Refinement Module (k-IRM) to enhance the high-frequency components learning. We evaluate our approach on 92 in-house 2D+t cardiac MR subjects and compare it to MR reconstruction methods with image-domain regularizers. Experiments show that our proposed k-space interpolation method quantitatively and qualitatively outperforms baseline methods. Importantly, the proposed approach achieves substantially higher robustness and generalizability in cases of highly-undersampled MR data.

摘要
在动态磁共振成像（MRI）中，通常因为扫描时间有限，因此会出现嵌套artefacts在图像领域。因此，动态MR重建需要不仅考虑图像频谱中的空间频率组件，还需要考虑时间重复性。大多数前一些工作都是通过图像频谱约束（约束）来进行MR重建。在这种情况下，我们将掩码图像模型与嵌套空间 interpolate的方法相连接，并提出了一种新的Transformer基于的嵌套空间全球 interpolate网络，称为k-GIN。我们的k-GIN可以学习2D+t嵌套空间中低频和高频组件之间的全球依赖关系，并使用它来 interpolate不扫描的数据。此外，我们还提出了一种新的嵌套空间迭代优化模块（k-IRM），以提高高频组件的学习。我们对92个自有2D+t心脏MR数据进行了评估，并与图像频谱约束的MR重建方法进行比较。实验结果表明，我们的提posed方法在量化和质量上都超过了基eline方法。其中，我们的方法在高度压缩MR数据的情况下具有显著更高的Robustness和普适性。

A Theoretically Guaranteed Quaternion Weighted Schatten p-norm Minimization Method for Color Image Restoration

paper_url: http://arxiv.org/abs/2307.12656
repo_url: https://github.com/qiuxuanzhizi/qwsnm
paper_authors: Qing-Hua Zhang, Liang-Tian He, Yi-Lun Wang, Liang-Jian Deng, Jun Liu
for: 这篇论文主要针对的是颜色图像修复（CIR）问题，具体来说是使用Weighted Nuclear Norm Minimization（WNNM）和Weighted Schatten $p$-norm Minimization（WSNM）方法来解决CIR问题。
methods: 这篇论文提出了一种基于四元数的WNNM方法（QWNNM），该方法可以将颜色图像 Represented as a whole in the quaternion domain，并且保持了颜色通道之间的自然协同关系。此外，该论文还提出了一种基于四元数的WSNM模型（QWSNM），用于解决CIR问题。
results: 该论文通过对两种CIR任务， namely color image denoising和deblurring，进行了广泛的实验，并证明了QWSNM方法在量化和质量上都有优于许多现有的方法。此外，论文还提供了一种初步的理论收敛分析，表明QWNNM和QWSNM的解决方案都具有固定点收敛保证。

Abstract
Inspired by the fact that the matrix formulated by nonlocal similar patches in a natural image is of low rank, the rank approximation issue have been extensively investigated over the past decades, among which weighted nuclear norm minimization (WNNM) and weighted Schatten $p$-norm minimization (WSNM) are two prevailing methods have shown great superiority in various image restoration (IR) problems. Due to the physical characteristic of color images, color image restoration (CIR) is often a much more difficult task than its grayscale image counterpart. However, when applied to CIR, the traditional WNNM/WSNM method only processes three color channels individually and fails to consider their cross-channel correlations. Very recently, a quaternion-based WNNM approach (QWNNM) has been developed to mitigate this issue, which is capable of representing the color image as a whole in the quaternion domain and preserving the inherent correlation among the three color channels. Despite its empirical success, unfortunately, the convergence behavior of QWNNM has not been strictly studied yet. In this paper, on the one side, we extend the WSNM into quaternion domain and correspondingly propose a novel quaternion-based WSNM model (QWSNM) for tackling the CIR problems. Extensive experiments on two representative CIR tasks, including color image denoising and deblurring, demonstrate that the proposed QWSNM method performs favorably against many state-of-the-art alternatives, in both quantitative and qualitative evaluations. On the other side, more importantly, we preliminarily provide a theoretical convergence analysis, that is, by modifying the quaternion alternating direction method of multipliers (QADMM) through a simple continuation strategy, we theoretically prove that both the solution sequences generated by the QWNNM and QWSNM have fixed-point convergence guarantees.

摘要
基于自然图像中非local相似区域矩阵的低级数据结构，过去几十年内，对矩阵近似问题进行了广泛的研究，其中包括权重核函数最小化（WNNM）和权重斜率p- norm最小化（WSNM）等两种方法，在不同的图像修复（IR）问题中显示出了优异的表现。然而，由于图像的物理特性，对于颜色图像的修复（CIR）是对灰度图像修复的一个非常更加困难的任务。然而，传统的WNNM/WSNM方法只是对每个色道进行独立处理，而忽略了它们之间的相互关系。最近，一种基于四元数的WNNM方法（QWNNM）已经开发出来，可以将颜色图像作为一个整体在四元数域中表示，并保留它们之间的自然相互关系。虽然它在实际中表现了良好，但它们的减法性还没有得到严格的研究。在这篇论文中，我们首先将WSNM扩展到四元数域，并对此提出了一种新的四元数基于WNNM模型（QWSNM），用于解决CIR问题。我们在两个代表性的CIR任务上进行了广泛的实验，包括颜色图像噪声去除和颜色图像补做。结果表明，我们提出的QWSNM方法在量化和质量上的评价中表现出色，胜过许多当前的状态艺术。此外，我们还提供了一种初步的理论减法分析，即通过修改四元数alternating direction method of multipliers（QADMM）的简单继续策略，我们 theoretically proves that both the solution sequences generated by QWNNM and QWSNM have fixed-point convergence guarantees.

PG-RCNN: Semantic Surface Point Generation for 3D Object Detection

paper_url: http://arxiv.org/abs/2307.12637
repo_url: https://github.com/quotation2520/pg-rcnn
paper_authors: Inyong Koo, Inyoung Lee, Se-Ho Kim, Hee-Seon Kim, Woo-jin Jeon, Changick Kim
for: 该 paper 是为了解决 LiDAR 数据中 объек的三维检测困难而写的。
methods: 该 paper 使用了点云补充方法，包括使用预训练网络生成 RoI 中的点云。
results: 该 paper 提出了 Point Generation R-CNN（PG-RCNN），一种新的端到端检测器，可以生成准确的前景对象的 semantic surface points。

Abstract
One of the main challenges in LiDAR-based 3D object detection is that the sensors often fail to capture the complete spatial information about the objects due to long distance and occlusion. Two-stage detectors with point cloud completion approaches tackle this problem by adding more points to the regions of interest (RoIs) with a pre-trained network. However, these methods generate dense point clouds of objects for all region proposals, assuming that objects always exist in the RoIs. This leads to the indiscriminate point generation for incorrect proposals as well. Motivated by this, we propose Point Generation R-CNN (PG-RCNN), a novel end-to-end detector that generates semantic surface points of foreground objects for accurate detection. Our method uses a jointly trained RoI point generation module to process the contextual information of RoIs and estimate the complete shape and displacement of foreground objects. For every generated point, PG-RCNN assigns a semantic feature that indicates the estimated foreground probability. Extensive experiments show that the point clouds generated by our method provide geometrically and semantically rich information for refining false positive and misaligned proposals. PG-RCNN achieves competitive performance on the KITTI benchmark, with significantly fewer parameters than state-of-the-art models. The code is available at https://github.com/quotation2520/PG-RCNN.

摘要
Motivated by this, we propose Point Generation R-CNN (PG-RCNN), a novel end-to-end detector that generates semantic surface points of foreground objects for accurate detection. Our method uses a jointly trained RoI point generation module to process the contextual information of RoIs and estimate the complete shape and displacement of foreground objects. For every generated point, PG-RCNN assigns a semantic feature that indicates the estimated foreground probability.Extensive experiments show that the point clouds generated by our method provide geometrically and semantically rich information for refining false positive and misaligned proposals. PG-RCNN achieves competitive performance on the KITTI benchmark, with significantly fewer parameters than state-of-the-art models. The code is available at https://github.com/quotation2520/PG-RCNN.Translated into Simplified Chinese:一个主要挑战在LiDAR基于3D物体检测中是感知器通常无法捕捉物体的完整空间信息，这主要是因为距离较远和遮挡。两Stage检测器通过添加更多的点云来补充RoI中的点云，但这些方法生成的点云都是对所有的区域提案中的物体，假设物体总是存在于RoI中。这会导致无用的点云生成和预测错误的提案。我们提出了Point Generation R-CNN（PG-RCNN），一种新的端到端检测器，它可以生成对象的语义表面点，以提高检测的准确性。我们的方法使用一个同时训练的RoI点生成模块，以处理RoI中的上下文信息，并估计前景对象的完整形状和偏移。每个生成的点都被PG-RCNN分配一个语义特征，这个特征指示了对象的预测概率。广泛的实验表明，PG-RCNN生成的点云具有很好的准确性和语义特征，可以用于修正错误的提案和偏移。PG-RCNN在KITTI测试benchmark上达到了竞争性性能，并且 Parameters 比 state-of-the-art 模型少得多。代码可以在https://github.com/quotation2520/PG-RCNN中下载。

Automatic lobe segmentation using attentive cross entropy and end-to-end fissure generation

paper_url: http://arxiv.org/abs/2307.12634
repo_url: https://github.com/htytewx/softcam
paper_authors: Qi Su, Na Wang, Jiawen Xie, Yinan Chen, Xiaofan Zhang
for: automatic lung lobe segmentation algorithm for the diagnosis and treatment of lung diseases
methods: task-specific loss function to pay attention to the area around the pulmonary fissure, end-to-end pulmonary fissure generation method, registration-based loss function to alleviate convergence difficulty
results: dice scores of 97.83% on private dataset STLB and 94.75% on public LUNA16 dataset

Abstract
The automatic lung lobe segmentation algorithm is of great significance for the diagnosis and treatment of lung diseases, however, which has great challenges due to the incompleteness of pulmonary fissures in lung CT images and the large variability of pathological features. Therefore, we propose a new automatic lung lobe segmentation framework, in which we urge the model to pay attention to the area around the pulmonary fissure during the training process, which is realized by a task-specific loss function. In addition, we introduce an end-to-end pulmonary fissure generation method in the auxiliary pulmonary fissure segmentation task, without any additional network branch. Finally, we propose a registration-based loss function to alleviate the convergence difficulty of the Dice loss supervised pulmonary fissure segmentation task. We achieve 97.83% and 94.75% dice scores on our private dataset STLB and public LUNA16 dataset respectively.

摘要
“自动肺lobus分割算法具有诊断和治疗肺病的重要意义，但受到肺 CT 影像中肺裂的不完整性和病理特征的大幅度variability所困扰。因此，我们提出了一个新的自动肺lobus分割框架，其中我们要求模型在训练过程中对肺裂附近区域做出优化。此外，我们引入了一个独立的辅助肺裂分割任务，并在这个任务中使用了一个统一的损失函数。最后，我们提出了一个注册损失函数，以解决基于 Dice 损失函数的肺裂分割任务中的整合问题。我们在私人数据集 STLB 和公共数据集 LUNA16 上实现了97.83% 和 94.75%的 Dice 分数。”

Semi-Supervised Medical Image Segmentation with Co-Distribution Alignment

paper_url: http://arxiv.org/abs/2307.12630
repo_url: None
paper_authors: Tao Wang, Zhongzheng Huang, Jiawei Wu, Yuanzheng Cai, Zuoyong Li
for: 这篇论文主要是为了提出一种基于半指导学习的医学影像分割方法，以便在缺乏大量标注数据的情况下进行医学影像分割。
methods: 这篇论文提出了一种名为Co-Distribution Alignment（Co-DA）的方法，它可以在半指导学习情况下提高医学影像分割的性能。Co-DA方法包括使用两个不同初始化的模型进行类别匹配，并使用一个模型生成的pseudo-labels来监督另一个模型。此外，论文还提出了一种过期预期极限似一个权重函数来降低无效的pseudo-labels噪音。
results: 根据论文的实验结果，Co-DA方法在三个公共dataset上都有较好的性能，尤其是在2D CaDIS dataset和3D LGE-MRI和ACDC dataset上，它可以仅使用24%的标注数据而 достиieving an mIoU of 0.8515和Dice score of 0.8824和0.8773，即使只使用20%的数据。

Abstract
Medical image segmentation has made significant progress when a large amount of labeled data are available. However, annotating medical image segmentation datasets is expensive due to the requirement of professional skills. Additionally, classes are often unevenly distributed in medical images, which severely affects the classification performance on minority classes. To address these problems, this paper proposes Co-Distribution Alignment (Co-DA) for semi-supervised medical image segmentation. Specifically, Co-DA aligns marginal predictions on unlabeled data to marginal predictions on labeled data in a class-wise manner with two differently initialized models before using the pseudo-labels generated by one model to supervise the other. Besides, we design an over-expectation cross-entropy loss for filtering the unlabeled pixels to reduce noise in their pseudo-labels. Quantitative and qualitative experiments on three public datasets demonstrate that the proposed approach outperforms existing state-of-the-art semi-supervised medical image segmentation methods on both the 2D CaDIS dataset and the 3D LGE-MRI and ACDC datasets, achieving an mIoU of 0.8515 with only 24% labeled data on CaDIS, and a Dice score of 0.8824 and 0.8773 with only 20% data on LGE-MRI and ACDC, respectively.

摘要
医疗图像分割技术在有大量标注数据时已经做出了 significiant进步。然而，为了创建医疗图像分割数据集， annotating 医疗图像分割数据集是昂贵的，主要因为需要专业技能。此外，医疗图像中的类别经常不均匀分布，这会严重地影响少数类别的分类性能。为了解决这些问题，这篇论文提出了Co-Distribution Alignment（Co-DA）方法，用于 semi-supervised 医疗图像分割。具体来说，Co-DA 方法将未标注数据中的边缘预测与已标注数据中的边缘预测进行类别匹配，使用两个不同初始化的模型来实现。此外，我们还设计了过期cross-entropy损失函数，用于筛选未标注的像素，以降低它们的 Pseudo-labels 中的噪音。我们对三个公共数据集进行了量化和质量测试，结果显示，我们的方法在 CaDIS 数据集上的 mIoU 为 0.8515，只使用 24% 的标注数据；在 LGE-MRI 和 ACDC 数据集上，我们的方法的 Dice 分数分别为 0.8824 和 0.8773，只使用 20% 的数据。

Phase Matching for Out-of-Distribution Generalization

paper_url: http://arxiv.org/abs/2307.12622
repo_url: None
paper_authors: Chengming Hu, Yeqian Du, Rui Wang, Hao Chen
for: 本研究旨在解释卷积神经网络（CNNs）在不同分布下的泛化行为，并提出一种基于傅ри映射的频谱层次结构的方法来解决频谱预测问题。
methods: 本研究使用傅ри映射来分解视觉信号，并提出了一种基于频谱的相匹配方法（PhaMa）来 Address Domain Generalization（DG）问题。
results: 经过实验证明，提出的方法可以在多个标准准 benchmark上达到领先的性能水平，并且在不同分布下的泛化和Out-of-distribution（OOD） robustness任务中表现出色。

Abstract
The Fourier transform, serving as an explicit decomposition method for visual signals, has been employed to explain the out-of-distribution generalization behaviors of Convolutional Neural Networks (CNNs). Previous studies have indicated that the amplitude spectrum is susceptible to the disturbance caused by distribution shifts. On the other hand, the phase spectrum preserves highly-structured spatial information, which is crucial for robust visual representation learning. However, the spatial relationships of phase spectrum remain unexplored in previous researches. In this paper, we aim to clarify the relationships between Domain Generalization (DG) and the frequency components, and explore the spatial relationships of the phase spectrum. Specifically, we first introduce a Fourier-based structural causal model which interprets the phase spectrum as semi-causal factors and the amplitude spectrum as non-causal factors. Then, we propose Phase Matching (PhaMa) to address DG problems. Our method introduces perturbations on the amplitude spectrum and establishes spatial relationships to match the phase components. Through experiments on multiple benchmarks, we demonstrate that our proposed method achieves state-of-the-art performance in domain generalization and out-of-distribution robustness tasks.

摘要
《傅里叶变换在视觉信号中的明确分解方法》，已经用于解释深度学习模型在不同分布下的泛化行为。先前的研究表明，振荡спектrum容易受到分布变化的影响。然而，相对于振荡спектrum，频率спектrum preserve了高度结构化的空间信息，这是重要的视觉表示学习的基础。然而，频率спектrum中的空间关系尚未在先前的研究中得到了探讨。在这篇论文中，我们想要解释频率спектrum和频率组件之间的关系，并探讨频率спектrum中的空间关系。我们首先介绍了一种基于傅里叶变换的结构 causal模型，其中 interprets频率спектrum为半 causal因素，而振荡спектrum为非 causal因素。然后，我们提出了phasematching（PhaMa）方法，用于解决频率特征泛化问题。我们的方法在振荡spectrum中引入了干扰并建立了空间关系，以匹配频率спектrum的组件。通过多个标准benchmark experiment表明，我们提出的方法可以在频率特征泛化和out-of-distribution Robustness任务中达到状态之巅的性能。

Sparse annotation strategies for segmentation of short axis cardiac MRI

paper_url: http://arxiv.org/abs/2307.12619
repo_url: None
paper_authors: Josh Stein, Maxime Di Folco, Julia Schnabel
for: 这个论文的目的是研究使用减少数据量和标注量来实现高精度的心脏MRI分割。
methods: 这个论文使用了减少数据量和标注量来降低标注量的方法，包括训练 sparse volumes 和 sparse annotations。
results: 研究发现，训练 sparse volumes 和 sparse annotations 可以获得高达 0.85 的 Dice 分数，并且比使用完整数据集（160 和 240 个数据集）更好。此外，中部的剖面标注对分割性能最有利，而脊梁区域的标注对分割性能最差。

Abstract
Short axis cardiac MRI segmentation is a well-researched topic, with excellent results achieved by state-of-the-art models in a supervised setting. However, annotating MRI volumes is time-consuming and expensive. Many different approaches (e.g. transfer learning, data augmentation, few-shot learning, etc.) have emerged in an effort to use fewer annotated data and still achieve similar performance as a fully supervised model. Nevertheless, to the best of our knowledge, none of these works focus on which slices of MRI volumes are most important to annotate for yielding the best segmentation results. In this paper, we investigate the effects of training with sparse volumes, i.e. reducing the number of cases annotated, and sparse annotations, i.e. reducing the number of slices annotated per case. We evaluate the segmentation performance using the state-of-the-art nnU-Net model on two public datasets to identify which slices are the most important to annotate. We have shown that training on a significantly reduced dataset (48 annotated volumes) can give a Dice score greater than 0.85 and results comparable to using the full dataset (160 and 240 volumes for each dataset respectively). In general, training on more slice annotations provides more valuable information compared to training on more volumes. Further, annotating slices from the middle of volumes yields the most beneficial results in terms of segmentation performance, and the apical region the worst. When evaluating the trade-off between annotating volumes against slices, annotating as many slices as possible instead of annotating more volumes is a better strategy.

摘要
短轴心臓MRI分割是已有广泛研究的话题，现有前沿模型在监督环境下实现了出色的结果。然而，对MRI卷积的标注是时间consuming和昂贵的。多种方法（如转移学习、数据扩展、少数案例学习等）在尝试使用 fewer annotated data 并 still achieve similar performance as a fully supervised model 中出现。然而，到目前为止，这些工作没有关注在哪些MRI卷积中最重要的标注，以获得最佳分割结果。在这篇论文中，我们investigate the effects of training with sparse volumes和 sparse annotations。我们使用state-of-the-art nnU-Net模型在两个公共数据集上评估分割性能，以确定哪些卷积是最重要的标注。我们发现，使用很少的数据（48个标注卷积）可以达到 Dice 分数大于 0.85 和与全数据集（160和240卷积）的结果相当。总的来说，训练更多的卷积标注提供更多有价值的信息，而不是训练更多的卷积。此外，从中心部分标注MRI卷积可以提供最佳分割结果，而Apical区域则是最差。当评估标注卷积与卷积之间的权衡时，可以看到， annotating as many slices as possible instead of annotating more volumes 是一个更好的策略。

Attribute Regularized Soft Introspective VAE: Towards Cardiac Attribute Regularization Through MRI Domains

paper_url: http://arxiv.org/abs/2307.12618
repo_url: None
paper_authors: Maxime Di Folco, Cosmin Bercea, Julia A. Schnabel
for: 这篇论文旨在提高深度生成模型的可控性，通过修改数据特征来控制数据生成。
methods: 这篇论文提出了Attribute-Regularized Soft Introspective VAE（Attri-SIVAE）模型，通过添加特征规范损失来提高VAE的可控性。
results: 实验表明，Attri-SIVAE模型在不同的MRI数据集上表现相当于Attributed regularized VAE，并且可以在不同的数据集上保持同等的规范水平。

Abstract
Deep generative models have emerged as influential instruments for data generation and manipulation. Enhancing the controllability of these models by selectively modifying data attributes has been a recent focus. Variational Autoencoders (VAEs) have shown promise in capturing hidden attributes but often produce blurry reconstructions. Controlling these attributes through different imaging domains is difficult in medical imaging. Recently, Soft Introspective VAE leverage the benefits of both VAEs and Generative Adversarial Networks (GANs), which have demonstrated impressive image synthesis capabilities, by incorporating an adversarial loss into VAE training. In this work, we propose the Attributed Soft Introspective VAE (Attri-SIVAE) by incorporating an attribute regularized loss, into the Soft-Intro VAE framework. We evaluate experimentally the proposed method on cardiac MRI data from different domains, such as various scanner vendors and acquisition centers. The proposed method achieves similar performance in terms of reconstruction and regularization compared to the state-of-the-art Attributed regularized VAE but additionally also succeeds in keeping the same regularization level when tested on a different dataset, unlike the compared method.

摘要
深度生成模型已经成为数据生成和修改的重要工具。提高这些模型的可控性，通过选择性地修改数据特性，是最近的焦点。变量自编码器（VAEs）已经表现出捕捉隐藏特性的抑或，但它们经常生成模糊的重建。在医学成像中，控制这些特性通过不同的成像频谱是困难的。最近，软 introspective VAE 利用 VAEs 和生成对抗网络（GANs）的优点，通过在 VAE 训练中添加对抗损失来提高图像合成能力。在这项工作中，我们提议 incorporating attribute regularized loss 到 Soft-Intro VAE 框架中，并对其进行实验评估。我们发现，提议的方法在不同的 MRI 数据集上实现了相似的重建和规范性能，与 state-of-the-art attributed regularized VAE 相比，同时还能在不同的数据集上保持同等的规范水平。

ExWarp: Extrapolation and Warping-based Temporal Supersampling for High-frequency Displays

paper_url: http://arxiv.org/abs/2307.12607
repo_url: None
paper_authors: Akanksha Dixit, Yashashwee Chakrabarty, Smruti R. Sarangi
For: The paper aims to increase the frame rate of high-frequency displays by 4x with minimal reduction in perceived image quality.* Methods: The paper proposes using reinforcement learning (RL) to intelligently choose between slower DNN-based extrapolation and faster warping-based methods.* Results: The proposed approach, called Exwarp, achieves a 4x increase in frame rate with minimal reduction in image quality.

Abstract
High-frequency displays are gaining immense popularity because of their increasing use in video games and virtual reality applications. However, the issue is that the underlying GPUs cannot continuously generate frames at this high rate -- this results in a less smooth and responsive experience. Furthermore, if the frame rate is not synchronized with the refresh rate, the user may experience screen tearing and stuttering. Previous works propose increasing the frame rate to provide a smooth experience on modern displays by predicting new frames based on past or future frames. Interpolation and extrapolation are two widely used algorithms that predict new frames. Interpolation requires waiting for the future frame to make a prediction, which adds additional latency. On the other hand, extrapolation provides a better quality of experience because it relies solely on past frames -- it does not incur any additional latency. The simplest method to extrapolate a frame is to warp the previous frame using motion vectors; however, the warped frame may contain improperly rendered visual artifacts due to dynamic objects -- this makes it very challenging to design such a scheme. Past work has used DNNs to get good accuracy, however, these approaches are slow. This paper proposes Exwarp -- an approach based on reinforcement learning (RL) to intelligently choose between the slower DNN-based extrapolation and faster warping-based methods to increase the frame rate by 4x with an almost negligible reduction in the perceived image quality.

摘要
高频显示器在游戏和虚拟现实应用中的使用越来越普遍，但是下面的 GPU 无法不断生成这高速帧，这会导致用户体验不平滑和不响应。此外，如果帧率与刷新率不同步，用户可能会经历屏扑和停顿。先前的工作建议通过预测新帧来提高现代显示器的帧率，以提供平滑的用户体验。 interpolate 和 extrapolate 是两种广泛使用的预测算法。 interpolate 需要等待未来的帧来作预测，这会添加额外的延迟。 extrapolate 提供了更好的用户体验，因为它仅基于过去的帧，无需添加额外的延迟。在 extrapolate 帧时，最简单的方法是通过运动向量来扭曲上一帧，但是扭曲后的帧可能会包含不正确渲染的视觉artefacts，这使得设计这种方案非常困难。过去的工作使用 DNN 来获得好的准确性，但这些方法比较慢。这篇论文提出了 Exwarp，一种基于 reinforcement learning（RL）的方法，可以智能地选择 slower DNN-based extrapolation 和 faster warping-based方法，以提高帧率4倍，并且几乎无法感受到图像质量的下降。

SwinMM: Masked Multi-view with Swin Transformers for 3D Medical Image Segmentation

paper_url: http://arxiv.org/abs/2307.12591
repo_url: https://github.com/ucsc-vlaa/swinmm
paper_authors: Yiqing Wang, Zihan Li, Jieru Mei, Zihao Wei, Li Liu, Chen Wang, Shengtian Sang, Alan Yuille, Cihang Xie, Yuyin Zhou
for:这篇论文主要目的是提高自主学习方法 для医疗影像分割。methods:论文使用的方法包括masked multi-view encoder和cross-view decoder，以及一种新的多视图学习方法。results:论文比前一个状态的自主学习方法Swin UNITR显示了明显的优势，能够更好地 интегрирова多视图信息，提高模型的准确率和数据效率。

Abstract
Recent advancements in large-scale Vision Transformers have made significant strides in improving pre-trained models for medical image segmentation. However, these methods face a notable challenge in acquiring a substantial amount of pre-training data, particularly within the medical field. To address this limitation, we present Masked Multi-view with Swin Transformers (SwinMM), a novel multi-view pipeline for enabling accurate and data-efficient self-supervised medical image analysis. Our strategy harnesses the potential of multi-view information by incorporating two principal components. In the pre-training phase, we deploy a masked multi-view encoder devised to concurrently train masked multi-view observations through a range of diverse proxy tasks. These tasks span image reconstruction, rotation, contrastive learning, and a novel task that employs a mutual learning paradigm. This new task capitalizes on the consistency between predictions from various perspectives, enabling the extraction of hidden multi-view information from 3D medical data. In the fine-tuning stage, a cross-view decoder is developed to aggregate the multi-view information through a cross-attention block. Compared with the previous state-of-the-art self-supervised learning method Swin UNETR, SwinMM demonstrates a notable advantage on several medical image segmentation tasks. It allows for a smooth integration of multi-view information, significantly boosting both the accuracy and data-efficiency of the model. Code and models are available at https://github.com/UCSC-VLAA/SwinMM/.

摘要

PRIOR: Prototype Representation Joint Learning from Medical Images and Reports

paper_url: http://arxiv.org/abs/2307.12577
repo_url: https://github.com/qtacierp/prior
paper_authors: Pujin Cheng, Li Lin, Junyan Lyu, Yijin Huang, Wenhan Luo, Xiaoying Tang
for: 本文提出了一种基于对比学习的视频语言共同预训练框架，用于学习医学图像和报告之间的对应关系。
methods: 该方法使用了全球对对比方法，以及一种细致的本地对对比模块，以便学习高级клиниче语言特征和低级视觉特征。此外，一种跨Modalities的条件重建模块是用于在训练阶段交换modalities之间的信息。
results: 实验结果表明，提出的方法在五个下游任务中（包括监督分类、零扩展分类、图像到文本检索、semantic segmentation和物体检测）均表现出色，并且在不同的数据集大小设置下也具有优异性。

Abstract
Contrastive learning based vision-language joint pre-training has emerged as a successful representation learning strategy. In this paper, we present a prototype representation learning framework incorporating both global and local alignment between medical images and reports. In contrast to standard global multi-modality alignment methods, we employ a local alignment module for fine-grained representation. Furthermore, a cross-modality conditional reconstruction module is designed to interchange information across modalities in the training phase by reconstructing masked images and reports. For reconstructing long reports, a sentence-wise prototype memory bank is constructed, enabling the network to focus on low-level localized visual and high-level clinical linguistic features. Additionally, a non-auto-regressive generation paradigm is proposed for reconstructing non-sequential reports. Experimental results on five downstream tasks, including supervised classification, zero-shot classification, image-to-text retrieval, semantic segmentation, and object detection, show the proposed method outperforms other state-of-the-art methods across multiple datasets and under different dataset size settings. The code is available at https://github.com/QtacierP/PRIOR.

摘要
医疗图像和报告的共同预训练基于对比学习已经成为一种成功的表示学习策略。在这篇论文中，我们提出了一种原型学习框架，其中包括医疗图像和报告之间的全局和局部对齐。与标准的全局多Modalities对齐方法不同，我们使用了局部对齐模块，以获得细化的表示。此外，我们还设计了跨Modalities的Conditional重建模块，用于在训练阶段交换modalities之间的信息，通过重建遮盖的图像和报告来进行交换。为恢复长报告，我们构建了句子级prototype记忆银行，使得网络能够关注低级的本地视觉和高级的医学语言特征。此外，我们还提出了一种非自动生成 paradigma，用于恢复非顺序报告。实验结果表明，我们的方法在五个下游任务中，包括supervised分类、零shot分类、图像到文本检索、semantic segmentation和物体检测中，都超过了其他当前state-of-the-art方法。代码可以在https://github.com/QtacierP/PRIOR上获取。

A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.12574
repo_url: None
paper_authors: Jinjing Zhu, Yunhao Luo, Xu Zheng, Hao Wang, Lin Wang
for: 本研究的目的是解决如何使 convolutional neural network (CNN) 和 vision transformer (ViT) 模型之间进行协同学习，以实现 semantic segmentation 中的可靠知识选择和交换？
methods: 我们提出了一个在线知识distillation (KD) 框架，可以同时学习高效 yet 紧凑的 CNN 和 ViT 模型，并通过两个关键技术突破 CNN 和 ViT 的局限性。首先，我们提出了不同类征distillation (HFD)，以提高学生在低层特征空间的一致性，模仿 CNN 和 ViT 之间的不同特征。其次，我们提出了双向选择distillation (BSD)，可以动态地将选择的知识传递给对方。这包括1) Region-wise BSD 确定知识传递的方向在特征空间中，2) Pixel-wise BSD 确定在极值空间中传递哪些预测知识。
results: 我们的提出的框架在三个 benchmark 数据集上进行了广泛的实验，并与现有的在线distillation方法相比，表现出了很大的提升。此外，我们的方法还证明了在学习 ViT 和 CNN 模型之间协同学习的可能性。

Abstract
In this paper, we strive to answer the question "how to collaboratively learn convolutional neural network (CNN)-based and vision transformer (ViT)-based models by selecting and exchanging the reliable knowledge between them for semantic segmentation?" Accordingly, we propose an online knowledge distillation (KD) framework that can simultaneously learn compact yet effective CNN-based and ViT-based models with two key technical breakthroughs to take full advantage of CNNs and ViT while compensating their limitations. Firstly, we propose heterogeneous feature distillation (HFD) to improve students' consistency in low-layer feature space by mimicking heterogeneous features between CNNs and ViT. Secondly, to facilitate the two students to learn reliable knowledge from each other, we propose bidirectional selective distillation (BSD) that can dynamically transfer selective knowledge. This is achieved by 1) region-wise BSD determining the directions of knowledge transferred between the corresponding regions in the feature space and 2) pixel-wise BSD discerning which of the prediction knowledge to be transferred in the logit space. Extensive experiments on three benchmark datasets demonstrate that our proposed framework outperforms the state-of-the-art online distillation methods by a large margin, and shows its efficacy in learning collaboratively between ViT-based and CNN-based models.

摘要
在本文中，我们努力回答“如何通过选择和交换可靠知识来协同学习卷积神经网络（CNN）和视觉 трансформа（ViT）模型以进行semantic segmentation？”为此，我们提出了在线知识储存（KD）框架，可同时学习高效又紧凑的CNN和ViT模型，并且具有两个关键技术突破，以全面利用CNN和ViT的优势，同时补做它们的局限性。首先，我们提出了不同类征储存（HFD），以提高学生在低层特征空间的一致性，模仿CNN和ViT之间的不同特征。其次，为了让两个学生之间学习可靠的知识，我们提出了双向选择储存（BSD），可动态传递选择的知识。这实现了1）在特征空间中确定 переда知识的方向，2）在逻辑空间中选择要传递的预测知识。我们对三个标准测试集进行了广泛的实验，结果显示，我们提出的框架在在线储存方法中升级了状态之差，并在学习协同CNN和ViT模型方面表现出了 efficacy。

MataDoc: Margin and Text Aware Document Dewarping for Arbitrary Boundary

paper_url: http://arxiv.org/abs/2307.12571
repo_url: None
paper_authors: Beiya Dai, Xing li, Qunyi Xie, Yulin Li, Xiameng Qin, Chengquan Zhang, Kun Yao, Junyu Han
for: DocUNet, DIR300, WarpDoc datasets
methods: margin regularization, background consistency, word position consistency
results: superior performance on documents with incomplete boundaries

Abstract
Document dewarping from a distorted camera-captured image is of great value for OCR and document understanding. The document boundary plays an important role which is more evident than the inner region in document dewarping. Current learning-based methods mainly focus on complete boundary cases, leading to poor document correction performance of documents with incomplete boundaries. In contrast to these methods, this paper proposes MataDoc, the first method focusing on arbitrary boundary document dewarping with margin and text aware regularizations. Specifically, we design the margin regularization by explicitly considering background consistency to enhance boundary perception. Moreover, we introduce word position consistency to keep text lines straight in rectified document images. To produce a comprehensive evaluation of MataDoc, we propose a novel benchmark ArbDoc, mainly consisting of document images with arbitrary boundaries in four typical scenarios. Extensive experiments confirm the superiority of MataDoc with consideration for the incomplete boundary on ArbDoc and also demonstrate the effectiveness of the proposed method on DocUNet, DIR300, and WarpDoc datasets.

摘要
文档去扭曲从扭曲捕捉的图像中是很有价值的，尤其是文档边界的角色更加重要。现有的学习型方法主要关注完整边界的文档去扭曲，导致文档修正性能较差。与这些方法不同，本文提出了MataDoc，第一种专门关注任意边界文档去扭曲方法，并添加了边界追求和文本意识regularization。具体来说，我们设计了边界追求的margin regularization，通过考虑背景一致性来增强边界感知。此外，我们引入了文本位置一致的regularization，以保持文本线条在修正后的图像中 straight。为了对MataDoc进行全面的评估，我们提出了一个新的benchmark ArbDoc，主要包括四种典型的文档场景，其中文档边界具有任意形态。广泛的实验表明，MataDoc在ArbDoc上的性能卓越，同时也在DocUNet、DIR300和WarpDoc数据集上表现出色。

Interpolating between Images with Diffusion Models

paper_url: http://arxiv.org/abs/2307.12560
repo_url: None
paper_authors: Clinton J. Wang, Polina Golland
for: 这篇论文旨在探讨图像生成和编辑中缺失的特性 interpolating between two input images，以扩展图像生成模型的创作应用。
methods: 该论文提出了一种基于潜在扩散模型的零shot interpolating方法，通过在潜在空间中逐渐减少噪声水平进行 interpolating，然后使用文本排序和主体姿态来进行denoising。
results: 该论文通过在多种主体姿态、图像风格和图像内容中进行 interpolating，并通过CLIP选择最高质量图像，以证明该方法可以获得有力的 interpolating结果。

Abstract
One little-explored frontier of image generation and editing is the task of interpolating between two input images, a feature missing from all currently deployed image generation pipelines. We argue that such a feature can expand the creative applications of such models, and propose a method for zero-shot interpolation using latent diffusion models. We apply interpolation in the latent space at a sequence of decreasing noise levels, then perform denoising conditioned on interpolated text embeddings derived from textual inversion and (optionally) subject poses. For greater consistency, or to specify additional criteria, we can generate several candidates and use CLIP to select the highest quality image. We obtain convincing interpolations across diverse subject poses, image styles, and image content, and show that standard quantitative metrics such as FID are insufficient to measure the quality of an interpolation. Code and data are available at https://clintonjwang.github.io/interpolation.

摘要
一个未经探索的前ier是将两个输入图像之间进行 interpolating，这是现有的图像生成管道中缺失的一个特性。我们认为这样的特性可以扩展图像生成模型的创作应用，并提出了采用潜在扩散模型进行零 shot interpolating的方法。我们在潜在空间中逐渐减少噪声水平进行 interpolating，然后使用文本倒转和（可选）主体姿态来进行杜然处理。为了更好地保持一致性，或者设置其他参数，我们可以生成多个候选图像，并使用 CLIP 选择最高质量的图像。我们在不同的主体姿态、图像风格和图像内容中获得了令人满意的 interpolating，并证明了标准的量化指标如 FID 不够Measure interpolating 的质量。代码和数据可以在 https://clintonjwang.github.io/interpolation 上获取。

Revisiting Event-based Video Frame Interpolation

paper_url: http://arxiv.org/abs/2307.12558
repo_url: None
paper_authors: Jiaben Chen, Yichen Zhu, Dongze Lian, Jiaqi Yang, Yifu Wang, Renrui Zhang, Xinhang Liu, Shenhan Qian, Laurent Kneip, Shenghua Gao
for: 用于提高视频插值的精度和真实性
methods: 使用事件摄像头提供的高时间密度和高噪声特征进行事件导向的Optical flow refinement策略，以及一种分解器-并发的事件核心Synthesis策略
results: 比前方法更加可靠和真实地生成中间帧结果，并且在实验中表明了考虑事件特征的重要性

Abstract
Dynamic vision sensors or event cameras provide rich complementary information for video frame interpolation. Existing state-of-the-art methods follow the paradigm of combining both synthesis-based and warping networks. However, few of those methods fully respect the intrinsic characteristics of events streams. Given that event cameras only encode intensity changes and polarity rather than color intensities, estimating optical flow from events is arguably more difficult than from RGB information. We therefore propose to incorporate RGB information in an event-guided optical flow refinement strategy. Moreover, in light of the quasi-continuous nature of the time signals provided by event cameras, we propose a divide-and-conquer strategy in which event-based intermediate frame synthesis happens incrementally in multiple simplified stages rather than in a single, long stage. Extensive experiments on both synthetic and real-world datasets show that these modifications lead to more reliable and realistic intermediate frame results than previous video frame interpolation methods. Our findings underline that a careful consideration of event characteristics such as high temporal density and elevated noise benefits interpolation accuracy.

摘要
“动态视觉传感器或事件摄像机可提供丰富的补充信息，以帮助视频帧 interpolate。现有的state-of-the-art方法通常采用组合synthesis-based和折叠网络的方法。然而，这些方法很少充分尊重事件流的内在特征。因为事件摄像机只记录了INTENSITY变化和方向，而不是颜色强度，因此从事件中估算光流 arguably 更加困难 than from RGB信息。我们因此提议将RGB信息 integrate into event-guided optical flow refinement策略。此外，由于事件摄像机提供的时间信号具有 quasi-连续性，我们提议采用分段的 divide-and-conquer策略，在多个简化的阶段中进行事件基本中间帧synthesis，而不是在单一、长阶段中进行。广泛的实验表明，这些修改可以更加可靠和真实地 interpolate 视频帧结果，than previous video frame interpolation方法。我们的发现也 подчеркивает，对事件特征，如高时间密度和提高的噪声，的仔细考虑，可以提高插值精度。”

MFMAN-YOLO: A Method for Detecting Pole-like Obstacles in Complex Environment

paper_url: http://arxiv.org/abs/2307.12548
repo_url: None
paper_authors: Lei Cai, Hao Wang, Congling Zhou, Yongqiang Wang, Boyu Liu
for: 解决复杂环境中杆体物体特征信息易丢失问题，提高探测精度和实时性。
methods: 提出了一种多尺度混合注意力机制探测算法，通过最优运输函数孟柯诺夫（MK）函数进行匹配，并将多尺度特征分割和混合注意力机制应用于复杂环境中。
results: 实验结果显示，方法的检测精度、回归率和均值精度分别为94.7%、93.1%和97.4%，检测帧率达400f/s。这种方法可以在实时和精度高的情况下探测复杂环境中的杆体物体。

Abstract
In real-world traffic, there are various uncertainties and complexities in road and weather conditions. To solve the problem that the feature information of pole-like obstacles in complex environments is easily lost, resulting in low detection accuracy and low real-time performance, a multi-scale hybrid attention mechanism detection algorithm is proposed in this paper. First, the optimal transport function Monge-Kantorovich (MK) is incorporated not only to solve the problem of overlapping multiple prediction frames with optimal matching but also the MK function can be regularized to prevent model over-fitting; then, the features at different scales are up-sampled separately according to the optimized efficient multi-scale feature pyramid. Finally, the extraction of multi-scale feature space channel information is enhanced in complex environments based on the hybrid attention mechanism, which suppresses the irrelevant complex environment background information and focuses the feature information of pole-like obstacles. Meanwhile, this paper conducts real road test experiments in a variety of complex environments. The experimental results show that the detection precision, recall, and average precision of the method are 94.7%, 93.1%, and 97.4%, respectively, and the detection frame rate is 400 f/s. This research method can detect pole-like obstacles in a complex road environment in real time and accurately, which further promotes innovation and progress in the field of automatic driving.

摘要
实际交通中有各种不确定性和复杂性，以致 pole-like obstacles 的特征信息在复杂环境中易丢失，导致检测精度低下、实时性低下。为解决这个问题，本文提出了一种多尺度混合注意力机制检测算法。首先，通过 Monge-Kantorovich（MK）函数进行最佳运输函数，不仅可以解决多个预测帧的最佳匹配问题，还可以将 MK 函数进行正则化，以防止模型过拟合；然后，在不同尺度上分别更新独立的高效多尺度特征 pyramid。最后，在复杂环境中提高多尺度特征空间通道信息抽取的能力，通过混合注意力机制，抑制不相关的复杂环境背景信息，专注于检测 pole-like obstacles 的特征信息。同时，本文在实际公路测试中进行了多种复杂环境的实验，实验结果表明，该方法的检测精度、回归率和平均精度分别为 94.7%、93.1% 和 97.4%，检测帧率为 400 f/s。这种检测方法可以在复杂交通环境中实时和准确地检测 pole-like obstacles，为自动驾驶技术的进一步创新和发展做出了贡献。

Towards Generalizable Deepfake Detection by Primary Region Regularization

paper_url: http://arxiv.org/abs/2307.12534
repo_url: None
paper_authors: Harry Cheng, Yangyang Guo, Tianyi Wang, Liqiang Nie, Mohan Kankanhalli
for: 提高深伪检测方法的泛化能力，对于未见过的伪造和修改方法进行扩展
methods: 使用新的调整方法来提高深伪检测方法的泛化能力，包括删除主要区域图像的调整
results: 与多个基eline比较，提高了平均性能表现6%，并且与一些现有的基eline竞争Here’s the breakdown of each sentence:* “for”: This sentence indicates the purpose of the paper, which is to improve the generalization ability of deepfake detection methods.* “methods”: This sentence describes the approach used in the paper to achieve the purpose, which is to use a novel regularization perspective to enhance the generalization capability of deepfake detectors.* “results”: This sentence summarizes the performance of the proposed method compared to baseline methods, showing an average improvement of 6% and competitive performance with state-of-the-art baselines.

Abstract
The existing deepfake detection methods have reached a bottleneck in generalizing to unseen forgeries and manipulation approaches. Based on the observation that the deepfake detectors exhibit a preference for overfitting the specific primary regions in input, this paper enhances the generalization capability from a novel regularization perspective. This can be simply achieved by augmenting the images through primary region removal, thereby preventing the detector from over-relying on data bias. Our method consists of two stages, namely the static localization for primary region maps, as well as the dynamic exploitation of primary region masks. The proposed method can be seamlessly integrated into different backbones without affecting their inference efficiency. We conduct extensive experiments over three widely used deepfake datasets - DFDC, DF-1.0, and Celeb-DF with five backbones. Our method demonstrates an average performance improvement of 6% across different backbones and performs competitively with several state-of-the-art baselines.

摘要
现有的深伪检测方法已达到泛化到未经见 forgery 和 manipulation 方法的瓶颈。基于观察到深伪检测器偏好特定的主要区域在输入中过拟合的观察，这篇文章提高了泛化能力从一个新的规范化视角。这可以简单地通过除去输入图像的主要区域来实现，以防止检测器过于依赖数据偏好。我们的方法包括两个阶段：首先，静态地LOCALIZATION FOR PRIMARY REGION MAPS，然后是动态利用主要区域面罩。我们的方法可以轻松地与不同的背bone结合使用，无需影响其推理效率。我们在DFDC、DF-1.0和Celeb-DF三个广泛使用的深伪数据集上进行了广泛的实验，我们的方法在不同的背bone上显示了平均提高6%的性能，并与一些状态机器人的基准值竞争。

On the Connection between Pre-training Data Diversity and Fine-tuning Robustness

paper_url: http://arxiv.org/abs/2307.12532
repo_url: None
paper_authors: Vivek Ramanujan, Thao Nguyen, Sewoong Oh, Ludwig Schmidt, Ali Farhadi
for: 这个论文的目的是研究预训练策略对深度学习模型的泛化性的影响。methods: 作者使用了多种自然和 sintetic 数据源来生成不同的预训练分布，并通过评估这些预训练分布对下游模型的抗衰假设来研究预训练分布的影响。results: 研究发现，预训练分布中数据量的变化是主要影响下游模型的抗衰假设的因素，而其他因素对下游模型的抗衰假设具有有限的影响。例如，减少 ImageNet 预训练类别的数量，同时增加每个类别中的图像数量（即保持总数据量不变）并不会影响下游模型的抗衰假设。

Abstract
Pre-training has been widely adopted in deep learning to improve model performance, especially when the training data for a target task is limited. In our work, we seek to understand the implications of this training strategy on the generalization properties of downstream models. More specifically, we ask the following question: how do properties of the pre-training distribution affect the robustness of a fine-tuned model? The properties we explore include the label space, label semantics, image diversity, data domains, and data quantity of the pre-training distribution. We find that the primary factor influencing downstream effective robustness (Taori et al., 2020) is data quantity, while other factors have limited significance. For example, reducing the number of ImageNet pre-training classes by 4x while increasing the number of images per class by 4x (that is, keeping total data quantity fixed) does not impact the robustness of fine-tuned models. We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources, primarily using the iWildCam-WILDS distribution shift as a test for downstream robustness.

摘要
pré-entraînement a été largement adopté dans l'apprentissage profond pour améliorer les performances des modèles, especialment lorsque les données d'entraînement pour une tâche cible sont limitées. Dans notre travail, nous voulons comprendre les implications de cette stratégie d'entraînement sur les propriétés de généralisation des modèles downstream. Plus spécifiquement, nous posons la question suivante : comment les propriétés de la distribution de pré-entraînement affectent-elles la robustesse des modèles fine-tunés ? Les propriétés que nous explorons comprennent l'espace de labels, les semantiques de labels, la diversité d'images, les domaines de données et la quantité de données de la distribution de pré-entraînement. Nous trouvons que le facteur primordial influençant la robustesse effective downstream (Taori et al., 2020) est la quantité de données, tandis que les autres facteurs ont un impact limité. Par exemple, en réduisant le nombre de classes de pré-entraînement d'ImageNet de 4 fois while augmentant le nombre d'images par classe de 4 fois (c'est-à-dire en gardant la quantité totale de données fixée), n'a pas d'impact sur la robustesse des modèles fine-tunés. Nous démontrons nos résultats sur des distributions de pré-entraînement tirées de divers sources de données naturelles et synthétiques, en utilisant principalement la distribution shift iWildCam-WILDS comme un test pour la robustesse downstream.

Entropy Transformer Networks: A Learning Approach via Tangent Bundle Data Manifold

paper_url: http://arxiv.org/abs/2307.12517
repo_url: https://github.com/ijcnn2023/ESTN
paper_authors: Pourya Shamsolmoali, Masoumeh Zareapoor
for: 该 paper ocuses on developing an accurate and efficient image transformation approach for Convolutional Neural Networks (CNNs) architectures.
methods: 该 paper 用 novel Entropy Spatial Transformer Networks (ESTN) interpolate on the data manifold distributions, 使用 random samples 和 tangent space 计算 transformer parameters. 同时， authors 还提出了一种简单 yet effective technique 来 normalize the non-zero values of convolution operation.
results: experiments 表明， ESTN 可以在多种 computer vision tasks 中提高预测精度，包括图像重建和分类，而减少计算成本。

Abstract
This paper focuses on an accurate and fast interpolation approach for image transformation employed in the design of CNN architectures. Standard Spatial Transformer Networks (STNs) use bilinear or linear interpolation as their interpolation, with unrealistic assumptions about the underlying data distributions, which leads to poor performance under scale variations. Moreover, STNs do not preserve the norm of gradients in propagation due to their dependency on sparse neighboring pixels. To address this problem, a novel Entropy STN (ESTN) is proposed that interpolates on the data manifold distributions. In particular, random samples are generated for each pixel in association with the tangent space of the data manifold and construct a linear approximation of their intensity values with an entropy regularizer to compute the transformer parameters. A simple yet effective technique is also proposed to normalize the non-zero values of the convolution operation, to fine-tune the layers for gradients' norm-regularization during training. Experiments on challenging benchmarks show that the proposed ESTN can improve predictive accuracy over a range of computer vision tasks, including image reconstruction, and classification, while reducing the computational cost.

摘要
To address these problems, a novel Entropy STN (ESTN) is proposed that interpolates on the data manifold distributions. Specifically, random samples are generated for each pixel in association with the tangent space of the data manifold, and a linear approximation of their intensity values is computed with an entropy regularizer to compute the transformer parameters. Additionally, a simple yet effective technique is proposed to normalize the non-zero values of the convolution operation to fine-tune the layers for gradients' norm-regularization during training.Experiments on challenging benchmarks show that the proposed ESTN can improve predictive accuracy over a range of computer vision tasks, including image reconstruction and classification, while reducing the computational cost.

Cross Contrasting Feature Perturbation for Domain Generalization

paper_url: http://arxiv.org/abs/2307.12502
repo_url: https://github.com/hackmebroo/ccfp
paper_authors: Chenming Li, Daoan Zhang, Wenjian Huang, Jianguo Zhang
for: This paper focuses on the problem of domain generalization (DG) and proposes a novel framework called Cross Contrasting Feature Perturbation (CCFP) to simulate domain shift and improve the robustness of the model.methods: The proposed CCFP framework uses an online one-stage approach and generates perturbed features in the latent space while regularizing the model prediction against domain shift. The framework includes learnable feature perturbations and semantic consistency constraints to improve the quality of the perturbed features.results: The proposed method outperforms the previous state-of-the-art in the standard DomainBed benchmark with a strict evaluation protocol. Quantitative analyses show that the method can effectively alleviate the domain shift problem in out-of-distribution (OOD) scenarios.

Abstract
Domain generalization (DG) aims to learn a robust model from source domains that generalize well on unseen target domains. Recent studies focus on generating novel domain samples or features to diversify distributions complementary to source domains. Yet, these approaches can hardly deal with the restriction that the samples synthesized from various domains can cause semantic distortion. In this paper, we propose an online one-stage Cross Contrasting Feature Perturbation (CCFP) framework to simulate domain shift by generating perturbed features in the latent space while regularizing the model prediction against domain shift. Different from the previous fixed synthesizing strategy, we design modules with learnable feature perturbations and semantic consistency constraints. In contrast to prior work, our method does not use any generative-based models or domain labels. We conduct extensive experiments on a standard DomainBed benchmark with a strict evaluation protocol for a fair comparison. Comprehensive experiments show that our method outperforms the previous state-of-the-art, and quantitative analyses illustrate that our approach can alleviate the domain shift problem in out-of-distribution (OOD) scenarios.

摘要
域间泛化（DG）目标是从源域学习一个可以在未看过的目标域中进行泛化的模型。最近的研究主要关注生成新的域样本或特征以增加分布的多样性，但这些方法很难处理源域样本生成的semantic扭曲问题。在这篇论文中，我们提出了在线一阶段 Cross Contrasting Feature Perturbation（CCFP）框架，通过在幂空间生成受损特征来模拟域转移，同时对模型预测进行域转移 regularization。与前一代固定生成策略不同，我们设计了可学习的特征干扰和semantic一致约束。与先前的工作不同，我们的方法不使用任何生成型模型或域标签。我们在DomainBed标准测试床上进行了广泛的实验，并对比评估准则进行严格的评估。广泛的实验结果表明，我们的方法超过了先前的最佳实现，并且量化分析表明，我们的方法可以在OOD（out-of-distribution）场景中缓解域转移问题。

AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models

paper_url: http://arxiv.org/abs/2307.12499
repo_url: None
paper_authors: Xuelong Dai, Kaisheng Liang, Bin Xiao
for: 本研究旨在提出一种新的方法，即AdvDiff，用于生成不受限制的反击示例，以攻击深度学习模型和反击技术。
methods: 本研究使用了两种新的反击指导技术，即反击扩散模型的梯度导航和反击扩散模型的反向生成过程。这两种技术可以生成高质量、实用的反击示例，并且可以兼顾target类фика器的解释性。
results: 实验结果表明，AdvDiff可以高效地生成不受限制的反击示例，并且在MNIST和ImageNet datasets上的实验结果表明，AdvDiff的攻击性和生成质量都高于基于GAN的方法。

Abstract
Unrestricted adversarial attacks present a serious threat to deep learning models and adversarial defense techniques. They pose severe security problems for deep learning applications because they can effectively bypass defense mechanisms. However, previous attack methods often utilize Generative Adversarial Networks (GANs), which are not theoretically provable and thus generate unrealistic examples by incorporating adversarial objectives, especially for large-scale datasets like ImageNet. In this paper, we propose a new method, called AdvDiff, to generate unrestricted adversarial examples with diffusion models. We design two novel adversarial guidance techniques to conduct adversarial sampling in the reverse generation process of diffusion models. These two techniques are effective and stable to generate high-quality, realistic adversarial examples by integrating gradients of the target classifier interpretably. Experimental results on MNIST and ImageNet datasets demonstrate that AdvDiff is effective to generate unrestricted adversarial examples, which outperforms GAN-based methods in terms of attack performance and generation quality.

摘要
深度学习模型面临无限制 adversarial 攻击的威胁，这些攻击可以快速绕过防御机制。然而，过去的攻击方法常用生成式对抗网络（GAN），这些网络不能有效地证明其可行性，特别是在大规模的数据集如 ImageNet 上。在本文中，我们提出一种新的方法，称为 AdvDiff，以使用扩散模型生成无限制 adversarial 示例。我们设计了两种新的对抗导航技术，以在扩散模型的反生成过程中进行对抗采样。这两种技术能够生成高质量、实用的 adversarial 示例，通过可见的梯度来具体化目标分类器的解释。实验结果表明，AdvDiff 在 MNIST 和 ImageNet 数据集上效果地生成了无限制 adversarial 示例，其性能和生成质量都高于基于 GAN 的方法。

Rethinking Data Distillation: Do Not Overlook Calibration

paper_url: http://arxiv.org/abs/2307.12463
repo_url: https://github.com/dongyaozhu/calibrate-networks-trained-on-distilled-datasets
paper_authors: Dongyao Zhu, Bowen Lei, Jie Zhang, Yanbo Fang, Ruqi Zhang, Yiqun Xie, Dongkuan Xu
for: 本文旨在解决由大型源数据抽取而得到的神经网络经常产生过于自信的输出问题，通过calibration方法来修正。
methods: 本文提出了Masked Temperature Scaling (MTS)和Masked Distillation Training (MDT)两种方法，可以在神经网络训练过程中进行数据缩写和混合，以提高神经网络的准确率和可靠性。
results: 本文的实验结果表明，使用MTS和MDT可以有效地修正由大型源数据抽取而得到的神经网络，提高其准确率和可靠性，同时保持数据缩写的效率。

Abstract
Neural networks trained on distilled data often produce over-confident output and require correction by calibration methods. Existing calibration methods such as temperature scaling and mixup work well for networks trained on original large-scale data. However, we find that these methods fail to calibrate networks trained on data distilled from large source datasets. In this paper, we show that distilled data lead to networks that are not calibratable due to (i) a more concentrated distribution of the maximum logits and (ii) the loss of information that is semantically meaningful but unrelated to classification tasks. To address this problem, we propose Masked Temperature Scaling (MTS) and Masked Distillation Training (MDT) which mitigate the limitations of distilled data and achieve better calibration results while maintaining the efficiency of dataset distillation.

摘要
neural networks 经过精炼数据训练后常会产生过度自信的输出，需要使用均衡方法进行调整。现有的均衡方法，如温度升降和混合方法，对原始大规模数据训练的网络具有良好的效果。然而，我们发现这些方法对含拟合数据训练的网络无法进行均衡。在这篇论文中，我们发现了含拟合数据导致网络无法均衡的两个问题：（一）含拟合数据导致网络的最大幂值分布更加集中，（二）含拟合数据丢失了与分类任务无关 yet semantically meaningful的信息。为解决这问题，我们提出了Masked Temperature Scaling（MTS）和Masked Distillation Training（MDT）两种方法，它们可以缓解含拟合数据的限制，实现更好的均衡结果，同时保持数据精炼的效率。

Robust face anti-spoofing framework with Convolutional Vision Transformer

paper_url: http://arxiv.org/abs/2307.12459
repo_url: None
paper_authors: Yunseung Lee, Youngjun Kwak, Jinho Shin
for: 本研究旨在提高人脸验证过程中的防御性能，对抗真实的演示攻击。
methods: 本研究使用自注意力和卷积层对人脸图像进行全球和局部学习，以提高人脸识别性能。
results: 该模型在不同数据集中的跨域设定中表现出了7.3%$p$和12.9%$p$的提升，并在九个参考模型中得到了最高的平均排名。

Abstract
Owing to the advances in image processing technology and large-scale datasets, companies have implemented facial authentication processes, thereby stimulating increased focus on face anti-spoofing (FAS) against realistic presentation attacks. Recently, various attempts have been made to improve face recognition performance using both global and local learning on face images; however, to the best of our knowledge, this is the first study to investigate whether the robustness of FAS against domain shifts is improved by considering global information and local cues in face images captured using self-attention and convolutional layers. This study proposes a convolutional vision transformer-based framework that achieves robust performance for various unseen domain data. Our model resulted in 7.3%$p$ and 12.9%$p$ increases in FAS performance compared to models using only a convolutional neural network or vision transformer, respectively. It also shows the highest average rank in sub-protocols of cross-dataset setting over the other nine benchmark models for domain generalization.

摘要
Translated into Simplified Chinese:因为图像处理技术的进步和大规模数据集，公司已经实施了人脸验证过程，从而促使更多关注面对攻击（FAS）的真实演示攻击。最近，有很多尝试来提高人脸识别性能使用全球和地方学习方法，但到目前为止，这是第一个研究是否可以通过考虑全球信息和地方指示来提高人脸验证性能对域shift。这种研究提出了基于 convolutional vision transformer 框架的方法，实现了对不同数据集的 Robust 性能。我们的模型比只使用 convolutional neural network 或 vision transformer 模型使用时提高了7.3%$p$ 和 12.9%$p$ 的 FAS性能。它还在横跨数据集设定下的各个子协议中显示了最高的平均排名。

EnTri: Ensemble Learning with Tri-level Representations for Explainable Scene Recognition

paper_url: http://arxiv.org/abs/2307.12442
repo_url: None
paper_authors: Amirhossein Aminimehr, Amirali Molaei, Erik Cambria
for: 提高Scene recognition的可读性和可解释性，同时提高分类精度。
methods: 使用ensemble学习，包括Pixel-level、Semantic segmentation-level和Object class和frequency level的特征编码，以及不同复杂度的特征编码策略。
results: 在MIT67、SUN397和UIUC8 datasets上实现了87.69%、75.56%和99.17%的分类精度，比 estado-of-the-art方法有竞争力。

Abstract
Scene recognition based on deep-learning has made significant progress, but there are still limitations in its performance due to challenges posed by inter-class similarities and intra-class dissimilarities. Furthermore, prior research has primarily focused on improving classification accuracy, yet it has given less attention to achieving interpretable, precise scene classification. Therefore, we are motivated to propose EnTri, an ensemble scene recognition framework that employs ensemble learning using a hierarchy of visual features. EnTri represents features at three distinct levels of detail: pixel-level, semantic segmentation-level, and object class and frequency level. By incorporating distinct feature encoding schemes of differing complexity and leveraging ensemble strategies, our approach aims to improve classification accuracy while enhancing transparency and interpretability via visual and textual explanations. To achieve interpretability, we devised an extension algorithm that generates both visual and textual explanations highlighting various properties of a given scene that contribute to the final prediction of its category. This includes information about objects, statistics, spatial layout, and textural details. Through experiments on benchmark scene classification datasets, EnTri has demonstrated superiority in terms of recognition accuracy, achieving competitive performance compared to state-of-the-art approaches, with an accuracy of 87.69%, 75.56%, and 99.17% on the MIT67, SUN397, and UIUC8 datasets, respectively.

摘要
EnTri represents features at three distinct levels of detail: pixel-level, semantic segmentation-level, and object class and frequency level. By incorporating distinct feature encoding schemes of differing complexity and leveraging ensemble strategies, our approach aims to improve classification accuracy while enhancing transparency and interpretability via visual and textual explanations.To achieve interpretability, we devised an extension algorithm that generates both visual and textual explanations highlighting various properties of a given scene that contribute to the final prediction of its category. This includes information about objects, statistics, spatial layout, and textural details.Through experiments on benchmark scene classification datasets, EnTri has demonstrated superiority in terms of recognition accuracy, achieving competitive performance compared to state-of-the-art approaches, with an accuracy of 87.69%, 75.56%, and 99.17% on the MIT67, SUN397, and UIUC8 datasets, respectively.

SwIPE: Efficient and Robust Medical Image Segmentation with Implicit Patch Embeddings

paper_url: http://arxiv.org/abs/2307.12429
repo_url: None
paper_authors: Yejia Zhang, Pengfei Gu, Nishchal Sapkota, Danny Z. Chen
for: 这个研究的目的是为了提出一个新的医疗影像分类方法，以改善现有的离散表示方法，并且能够获得更好的本地细节和全局形状匹配。
methods: 这个方法使用了隐藏 нейрон网络（INR）来学习连续表示，并且预测形状在图像级层次，而不是点级层次或全图像级层次，以获得更好的本地边界定义和全局形状匹配。
results: 实验结果显示，这个方法可以与现有的离散方法进行比较，并且在两个任务（2D肿瘤分类和3D腹部器官分类）上获得了更好的结果，并且需要较少的参数。此外，这个方法也展示了较好的数据效率和数据类型的适应性。

Abstract
Modern medical image segmentation methods primarily use discrete representations in the form of rasterized masks to learn features and generate predictions. Although effective, this paradigm is spatially inflexible, scales poorly to higher-resolution images, and lacks direct understanding of object shapes. To address these limitations, some recent works utilized implicit neural representations (INRs) to learn continuous representations for segmentation. However, these methods often directly adopted components designed for 3D shape reconstruction. More importantly, these formulations were also constrained to either point-based or global contexts, lacking contextual understanding or local fine-grained details, respectively--both critical for accurate segmentation. To remedy this, we propose a novel approach, SwIPE (Segmentation with Implicit Patch Embeddings), that leverages the advantages of INRs and predicts shapes at the patch level--rather than at the point level or image level--to enable both accurate local boundary delineation and global shape coherence. Extensive evaluations on two tasks (2D polyp segmentation and 3D abdominal organ segmentation) show that SwIPE significantly improves over recent implicit approaches and outperforms state-of-the-art discrete methods with over 10x fewer parameters. Our method also demonstrates superior data efficiency and improved robustness to data shifts across image resolutions and datasets. Code is available on Github.

摘要
现代医学图像分割方法主要使用精度为矩阵的批处理来学习特征和生成预测。虽然有效，但这种方法具有不可修复的局限性，包括空间不灵活、高分辨率图像扩展不良、直接没有对物体形状的理解。为了解决这些限制，一些最近的研究使用了卷积神经网络（INR）来学习连续表示，以提高分割精度。然而，这些方法通常直接采用了设计 для三维形态重建的组件，而且受限于点级或全局上下文，缺乏当地细节或形态准确性。为了改善这个问题，我们提出了一种新的方法：SwIPE（分割with Implicit Patch Embeddings），它利用INR的优点，预测形状在patch水平（而不是点级或图像级），以实现准确的本地边界定义和全局形态协调。我们对两个任务（2D菌体分割和3D腹部器官分割）进行了广泛的评估，结果表明SwIPE在最近的隐式方法中显著提高，并在精度和数据效率方面超过了状态机 discrete方法。我们的方法还在数据偏移和图像分辨率之间具有更好的数据效率和数据弹性。代码可以在Github上获取。

Augmented Box Replay: Overcoming Foreground Shift for Incremental Object Detection

paper_url: http://arxiv.org/abs/2307.12427
repo_url: https://github.com/YuyangSunshine/ABR_IOD
paper_authors: Liu Yuyang, Cong Yang, Goswami Dipam, Liu Xialei, Joost van de Weijer
for: 这篇论文的目的是解决对incremental object detection（IOD）中的catastrophic forgetting问题。methods: 本文使用了一个称为Augmented Box Replay（ABR）的新方法，它将仅储存和重复过去任务中的前景物体，以避免预义遗传问题。此外，本文也提出了一种创新的注意力捕捉RoI特征的Attentive RoI Distillation损失，它使用领域特征的空间注意力来锁定现在的模型对过去模型的重要信息的注意。results: ABR有效地降低了之前任务的遗传，同时保持了目前任务的高柔养性。此外，ABR还能够储存和重复的减少储存需求，相比于标准的图像重复。实验结果表明，本文的模型在Pascal-VOC和COCO dataset上具有现场的表现。

Abstract
In incremental learning, replaying stored samples from previous tasks together with current task samples is one of the most efficient approaches to address catastrophic forgetting. However, unlike incremental classification, image replay has not been successfully applied to incremental object detection (IOD). In this paper, we identify the overlooked problem of foreground shift as the main reason for this. Foreground shift only occurs when replaying images of previous tasks and refers to the fact that their background might contain foreground objects of the current task. To overcome this problem, a novel and efficient Augmented Box Replay (ABR) method is developed that only stores and replays foreground objects and thereby circumvents the foreground shift problem. In addition, we propose an innovative Attentive RoI Distillation loss that uses spatial attention from region-of-interest (RoI) features to constrain current model to focus on the most important information from old model. ABR significantly reduces forgetting of previous classes while maintaining high plasticity in current classes. Moreover, it considerably reduces the storage requirements when compared to standard image replay. Comprehensive experiments on Pascal-VOC and COCO datasets support the state-of-the-art performance of our model.

摘要
在增量学习中，重新播放之前任务的样本和当前任务的样本是解决快速卷积承忘的一种最有效的方法。然而，与增量分类不同，图像重新播放在增量物体检测（IOD）中尚未得到成功。在这篇论文中，我们认为背景变化导致的前景偏移是主要的问题。背景变化仅发生在重新播放前任务的图像时，并且指的是图像的背景中可能包含当前任务的前景对象。为解决这个问题，我们开发了一种新的和高效的增量盒子重播（ABR）方法，该方法仅存储和重播前景对象，因此可以避免前景偏移问题。此外，我们提出了一种创新的注意力捕捉的区域特征练习损失（Attentive RoI Distillation loss），该损失使当前模型通过区域特征中的注意力来约束当前模型关注到最重要的信息。ABR显著减少了之前类的忘记，同时保持当前类的高灵活性。此外，它比标准图像重新播放要减少存储需求。我们在 Pascal-VOC 和 COCO 数据集上进行了广泛的实验，并证明了我们的模型的状态级表现。

TransNet: Transparent Object Manipulation Through Category-Level Pose Estimation

paper_url: http://arxiv.org/abs/2307.12400
repo_url: None
paper_authors: Huijie Zhang, Anthony Opipari, Xiaotong Chen, Jiyue Zhu, Zeren Yu, Odest Chadwicke Jenkins
for: 本研究旨在提高自动化透明物体检测和操作系统的可靠性和精度，特别是在透明物体上进行Category-levelpose estimation。
methods: 本研究提出了一种两stage管道 named TransNet，包括本地化深度完成和表面法向估计两个阶段。TransNet使用了一种新的surface normal estimation方法，并且使用了一种新的depth completion方法来提高pose estimation的准确性。
results: 对于一个大规模的透明物体 dataset，TransNet achieved improved pose estimation accuracy compared to a state-of-the-art category-level pose estimation approach. In addition, TransNet was used to build an autonomous transparent object manipulation system for robotic pick-and-place and pouring tasks, which demonstrated its effectiveness in real-world applications.

Abstract
Transparent objects present multiple distinct challenges to visual perception systems. First, their lack of distinguishing visual features makes transparent objects harder to detect and localize than opaque objects. Even humans find certain transparent surfaces with little specular reflection or refraction, like glass doors, difficult to perceive. A second challenge is that depth sensors typically used for opaque object perception cannot obtain accurate depth measurements on transparent surfaces due to their unique reflective properties. Stemming from these challenges, we observe that transparent object instances within the same category, such as cups, look more similar to each other than to ordinary opaque objects of that same category. Given this observation, the present paper explores the possibility of category-level transparent object pose estimation rather than instance-level pose estimation. We propose \textit{\textbf{TransNet}, a two-stage pipeline that estimates category-level transparent object pose using localized depth completion and surface normal estimation. TransNet is evaluated in terms of pose estimation accuracy on a large-scale transparent object dataset and compared to a state-of-the-art category-level pose estimation approach. Results from this comparison demonstrate that TransNet achieves improved pose estimation accuracy on transparent objects. Moreover, we use TransNet to build an autonomous transparent object manipulation system for robotic pick-and-place and pouring tasks.

摘要
trasparent objects present multiple distinct challenges to visual perception systems. First, their lack of distinguishing visual features makes transparent objects harder to detect and localize than opaque objects. Even humans find certain transparent surfaces with little specular reflection or refraction, like glass doors, difficult to perceive. A second challenge is that depth sensors typically used for opaque object perception cannot obtain accurate depth measurements on transparent surfaces due to their unique reflective properties. Stemming from these challenges, we observe that transparent object instances within the same category, such as cups, look more similar to each other than to ordinary opaque objects of that same category. Given this observation, the present paper explores the possibility of category-level transparent object pose estimation rather than instance-level pose estimation. We propose \textbf{\textit{TransNet}， a two-stage pipeline that estimates category-level transparent object pose using localized depth completion and surface normal estimation. TransNet is evaluated in terms of pose estimation accuracy on a large-scale transparent object dataset and compared to a state-of-the-art category-level pose estimation approach. Results from this comparison demonstrate that TransNet achieves improved pose estimation accuracy on transparent objects. Moreover, we use TransNet to build an autonomous transparent object manipulation system for robotic pick-and-place and pouring tasks.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you need the translation in Traditional Chinese, please let me know.

Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

paper_url: http://arxiv.org/abs/2307.12392
repo_url: https://github.com/cv516buaa/ir-vg
paper_authors: Menghao Li, Chunlei Wang, Wenquan Feng, Shuchang Lyu, Guangliang Cheng, Xiangtai Li, Binghao Liu, Qi Zhao
for: 本研究旨在解决现有的视觉固定问题，即基于给定的描述生成假阳性对象。methods: 本研究提出了一种Iterative Robust Visual Grounding（IR-VG）框架，包括多层视语融合（IMVF）和掩码参考中心点监督（MRCS）等技术，以提高对描述的匹配和对图像中的细节特征的捕捉。results: 对五个常见的视觉固定数据集和两个新提出的鲁棒视觉固定数据集进行了广泛的实验，并得到了新的最佳性能记录，相比之前的最佳方法在两个新提出的鲁棒视觉固定数据集上提高了25%和10%。此外，该方法还在五个常见的视觉固定数据集上得到了证明。

Abstract
Visual Grounding (VG) aims at localizing target objects from an image based on given expressions and has made significant progress with the development of detection and vision transformer. However, existing VG methods tend to generate false-alarm objects when presented with inaccurate or irrelevant descriptions, which commonly occur in practical applications. Moreover, existing methods fail to capture fine-grained features, accurate localization, and sufficient context comprehension from the whole image and textual descriptions. To address both issues, we propose an Iterative Robust Visual Grounding (IR-VG) framework with Masked Reference based Centerpoint Supervision (MRCS). The framework introduces iterative multi-level vision-language fusion (IMVF) for better alignment. We use MRCS to ahieve more accurate localization with point-wised feature supervision. Then, to improve the robustness of VG, we also present a multi-stage false-alarm sensitive decoder (MFSD) to prevent the generation of false-alarm objects when presented with inaccurate expressions. The proposed framework is evaluated on five regular VG datasets and two newly constructed robust VG datasets. Extensive experiments demonstrate that IR-VG achieves new state-of-the-art (SOTA) results, with improvements of 25\% and 10\% compared to existing SOTA approaches on the two newly proposed robust VG datasets. Moreover, the proposed framework is also verified effective on five regular VG datasets. Codes and models will be publicly at https://github.com/cv516Buaa/IR-VG.

摘要
Visual Grounding (VG) target objetcs from an image based on given expressions and has made significant progress with the development of detection and vision transformer. However, existing VG methods tend to generate false-alarm objects when presented with inaccurate or irrelevant descriptions, which commonly occur in practical applications. Moreover, existing methods fail to capture fine-grained features, accurate localization, and sufficient context comprehension from the whole image and textual descriptions. To address both issues, we propose an Iterative Robust Visual Grounding (IR-VG) framework with Masked Reference based Centerpoint Supervision (MRCS). The framework introduces iterative multi-level vision-language fusion (IMVF) for better alignment. We use MRCS to achieve more accurate localization with point-wised feature supervision. Then, to improve the robustness of VG, we also present a multi-stage false-alarm sensitive decoder (MFSD) to prevent the generation of false-alarm objects when presented with inaccurate expressions. The proposed framework is evaluated on five regular VG datasets and two newly constructed robust VG datasets. Extensive experiments demonstrate that IR-VG achieves new state-of-the-art (SOTA) results, with improvements of 25% and 10% compared to existing SOTA approaches on the two newly proposed robust VG datasets. Moreover, the proposed framework is also verified effective on five regular VG datasets. Codes and models will be publicly available at https://github.com/cv516Buaa/IR-VG.

Assessing Intra-class Diversity and Quality of Synthetically Generated Images in a Biomedical and Non-biomedical Setting

paper_url: http://arxiv.org/abs/2308.02505
repo_url: None
paper_authors: Muhammad Muneeb Saad, Mubashir Husain Rehmani, Ruairi O’Reilly
for: This paper aims to evaluate the effectiveness of using Generative Adversarial Networks (GANs) for data augmentation in biomedical image analysis, and to investigate the impact of different sample sizes on the diversity and quality of synthetic images.
methods: The paper uses Multi-scale Structural Similarity Index Measure, Cosine Distance, and Frechet Inception Distance to evaluate the diversity and quality of synthetic images generated by a Deep Convolutional GAN in both biomedical and non-biomedical imaging modalities.
results: The results show that the metrics scores for diversity and quality vary significantly across biomedical-to-biomedical and biomedical-to-non-biomedical imaging modalities, and that the diversity and quality of synthetic images are affected by the sample size used for training the GAN.

Abstract
In biomedical image analysis, data imbalance is common across several imaging modalities. Data augmentation is one of the key solutions in addressing this limitation. Generative Adversarial Networks (GANs) are increasingly being relied upon for data augmentation tasks. Biomedical image features are sensitive to evaluating the efficacy of synthetic images. These features can have a significant impact on metric scores when evaluating synthetic images across different biomedical imaging modalities. Synthetically generated images can be evaluated by comparing the diversity and quality of real images. Multi-scale Structural Similarity Index Measure and Cosine Distance are used to evaluate intra-class diversity, while Frechet Inception Distance is used to evaluate the quality of synthetic images. Assessing these metrics for biomedical and non-biomedical imaging is important to investigate an informed strategy in evaluating the diversity and quality of synthetic images. In this work, an empirical assessment of these metrics is conducted for the Deep Convolutional GAN in a biomedical and non-biomedical setting. The diversity and quality of synthetic images are evaluated using different sample sizes. This research intends to investigate the variance in diversity and quality across biomedical and non-biomedical imaging modalities. Results demonstrate that the metrics scores for diversity and quality vary significantly across biomedical-to-biomedical and biomedical-to-non-biomedical imaging modalities.

摘要
在生物医学影像分析中，数据不均衡是广泛存在的问题，而数据扩充是解决这个问题的关键方法之一。生成敌对网络（GANs）在数据扩充任务中得到了越来越多的应用。生物医学影像特征对于评估 sintetic 图像的效果非常敏感。这些特征在评估不同生物医学成像模式下的 sintetic 图像时可以有显著的影响。 sintetic 图像可以通过比较真实图像的多样性和质量来评估。多尺度结构相似度指标和夹角距离是评估内部多样性的方法，而干扰抽象距离则是评估 sintetic 图像质量的方法。为了了解 informed 策略，需要对生物医学和非生物医学成像进行评估。本研究通过对这些指标进行实际评估，研究了不同样本大小下的多样性和质量的变化。结果表明，在生物医学到生物医学和生物医学到非生物医学的转换中，指标分数差异显著。

2023-07-24

cs.AI

cs.AI - 2023-07-24

QAmplifyNet: Pushing the Boundaries of Supply Chain Backorder Prediction Using Interpretable Hybrid Quantum - Classical Neural Network

paper_url: http://arxiv.org/abs/2307.12906
repo_url: None
paper_authors: Md Abrar Jahin, Md Sakib Hossain Shovon, Md. Saiful Islam, Jungpil Shin, M. F. Mridha, Yuichi Okuyama
for: 这个研究是为了提高供应链管理中的货物预先预测，以便优化存储控制、减少成本和提高顾客满意度。methods: 本研究提出了一个新的方法ológical framework，使用量子概念的启发法来预测货物预先，并且使用量子-классиiral neural network来预测货物预先效果。results: 实验评估表明，QAmplifyNet模型在短时间和不寻常的数据集上预测货物预先的性能比 классиical models、量子组合、量子神经网和深度强化学习模型更好。此外，QAmplifyNet模型的可读性也得到了改进，使用可读性的人工智能技术。实际应用中，QAmplifyNet模型可以实现有效的存储控制、减少货物预先和提高操作效率。未来的工作包括进一步探索量子概念启发法，扩大数据集和探索其他供应链应用。

Abstract
Supply chain management relies on accurate backorder prediction for optimizing inventory control, reducing costs, and enhancing customer satisfaction. However, traditional machine-learning models struggle with large-scale datasets and complex relationships, hindering real-world data collection. This research introduces a novel methodological framework for supply chain backorder prediction, addressing the challenge of handling large datasets. Our proposed model, QAmplifyNet, employs quantum-inspired techniques within a quantum-classical neural network to predict backorders effectively on short and imbalanced datasets. Experimental evaluations on a benchmark dataset demonstrate QAmplifyNet's superiority over classical models, quantum ensembles, quantum neural networks, and deep reinforcement learning. Its proficiency in handling short, imbalanced datasets makes it an ideal solution for supply chain management. To enhance model interpretability, we use Explainable Artificial Intelligence techniques. Practical implications include improved inventory control, reduced backorders, and enhanced operational efficiency. QAmplifyNet seamlessly integrates into real-world supply chain management systems, enabling proactive decision-making and efficient resource allocation. Future work involves exploring additional quantum-inspired techniques, expanding the dataset, and investigating other supply chain applications. This research unlocks the potential of quantum computing in supply chain optimization and paves the way for further exploration of quantum-inspired machine learning models in supply chain management. Our framework and QAmplifyNet model offer a breakthrough approach to supply chain backorder prediction, providing superior performance and opening new avenues for leveraging quantum-inspired techniques in supply chain management.

摘要
供应链管理需要准确预测落后订单，以优化存储控制、降低成本和提高客户满意度。然而，传统的机器学习模型在巨量数据和复杂关系下难以取得实际数据收集。本研究提出了一种新的方法ológica framework for supply chain backorder prediction，解决了处理巨量数据的挑战。我们的提出的模型，QAmplifyNet，利用量子启发技术在量子-классиical neural network中预测落后订单，效果更高于经典模型、量子集合、量子神经网络和深度奖励学习。实验评估表明，QAmplifyNet在短时间和不均衡数据上的预测性能较高，适用于实际供应链管理。为提高模型可读性，我们使用Explainable Artificial Intelligence技术。实际应用包括改善存储控制、减少落后订单和提高操作效率。QAmplifyNet可顺利 интегра到实际供应链管理系统中，允许批处理决策和有效资源分配。未来工作包括进一步探索量子启发技术、扩大数据集和探索其他供应链应用。本研究解锁了量子计算在供应链优化中的潜力，开辟了量子启发机器学习模型在供应链管理中的新 Avenues。我们的框架和QAmplifyNet模型提供了落后订单预测的突破方法，提供了更高的性能，开启了新的可能性 для利用量子启发技术在供应链管理中。

Towards Bridging the FL Performance-Explainability Trade-Off: A Trustworthy 6G RAN Slicing Use-Case

paper_url: http://arxiv.org/abs/2307.12903
repo_url: None
paper_authors: Swastika Roy, Hatim Chergui, Christos Verikoukis
for: sixth-generation (6G) networks 与多元网络slice 共存下，AI驱动的零touch管理和orchestration (MANO) 成为重要的。但是，确保AI黑盒子在实际应用中的可靠性是问题。
methods: this paper presents a novel explanation-guided in-hoc federated learning (FL) approach, which combines a constrained resource allocation model and an explainer exchange in a closed loop (CL) fashion to achieve transparent 6G network slicing resource management in a RAN-Edge setup under non-independent identically distributed (non-IID) datasets.
results: the proposed approach achieves a balance between AI performance and explainability, and outperforms the unconstrained Integrated-Gradient post-hoc FL baseline in terms of faithfulness of explanations and overall training process.Here is the full answer in Simplified Chinese:
for: sixth-generation (6G) networks 与多元网络slice 共存下，AI驱动的零touch管理和orchestration (MANO) 成为重要的。但是，确保AI黑盒子在实际应用中的可靠性是问题。
methods: this paper presents a novel explanation-guided in-hoc federated learning (FL) approach, which combines a constrained resource allocation model and an explainer exchange in a closed loop (CL) fashion to achieve transparent 6G network slicing resource management in a RAN-Edge setup under non-independent identically distributed (non-IID) datasets.
results: the proposed approach achieves a balance between AI performance and explainability, and outperforms the unconstrained Integrated-Gradient post-hoc FL baseline in terms of faithfulness of explanations and overall training process.

Abstract
In the context of sixth-generation (6G) networks, where diverse network slices coexist, the adoption of AI-driven zero-touch management and orchestration (MANO) becomes crucial. However, ensuring the trustworthiness of AI black-boxes in real deployments is challenging. Explainable AI (XAI) tools can play a vital role in establishing transparency among the stakeholders in the slicing ecosystem. But there is a trade-off between AI performance and explainability, posing a dilemma for trustworthy 6G network slicing because the stakeholders require both highly performing AI models for efficient resource allocation and explainable decision-making to ensure fairness, accountability, and compliance. To balance this trade off and inspired by the closed loop automation and XAI methodologies, this paper presents a novel explanation-guided in-hoc federated learning (FL) approach where a constrained resource allocation model and an explainer exchange -- in a closed loop (CL) fashion -- soft attributions of the features as well as inference predictions to achieve a transparent 6G network slicing resource management in a RAN-Edge setup under non-independent identically distributed (non-IID) datasets. In particular, we quantitatively validate the faithfulness of the explanations via the so-called attribution-based confidence metric that is included as a constraint to guide the overall training process in the run-time FL optimization task. In this respect, Integrated-Gradient (IG) as well as Input $\times$ Gradient and SHAP are used to generate the attributions for our proposed in-hoc scheme, wherefore simulation results under different methods confirm its success in tackling the performance-explainability trade-off and its superiority over the unconstrained Integrated-Gradient post-hoc FL baseline.

摘要
在 sixth-generation（6G）网络上，多种网络slice共存的情况下，采用AI驱动的零 touched Management和Orchestration（MANO）变得非常重要。然而，在实际部署中确保AI黑obox的可靠性是一项挑战。 Explainable AI（XAI）工具可以在slice ecosystem中建立透明度，但是存在AI性能和解释性之间的负担，这种负担对于可靠的6G网络slice进行分配是一个悖论。为了平衡这个负担，我们提出了一种基于closed loop自动化和XAI方法的新的解释导向 federated learning（FL）方法。在这种方法中，一个受限的资源分配模型和一个解释器在closed loop（CL）方式交换软属性和推理预测结果，以实现透明的6G网络slice资源管理。特别是，我们在运行时FL优化任务中包含了一个受限的权重矩阵，以确保解释的准确性。在这种情况下，我们使用Integrated-Gradient（IG）、Input × Gradient和SHAP等方法生成解释，并通过在不同方法下的 simulate结果证明了我们的方法的成功。

As Time Goes By: Adding a Temporal Dimension Towards Resolving Delegations in Liquid Democracy

paper_url: http://arxiv.org/abs/2307.12898
repo_url: None
paper_authors: Evangelos Markakis, Georgios Papasotiropoulos
for: This paper aims to integrate a time horizon into decision-making problems in Liquid Democracy systems to enhance participation.
methods: The paper uses temporal graph theory to analyze the computational complexity of Liquid Democracy systems with a time horizon.
results: The paper shows that adding a time horizon can increase the number of possible delegation paths and reduce the loss of votes due to delegation cycles or abstaining agents, ultimately enhancing participation in Liquid Democracy systems.

Abstract
In recent years, the study of various models and questions related to Liquid Democracy has been of growing interest among the community of Computational Social Choice. A concern that has been raised, is that current academic literature focuses solely on static inputs, concealing a key characteristic of Liquid Democracy: the right for a voter to change her mind as time goes by, regarding her options of whether to vote herself or delegate her vote to other participants, till the final voting deadline. In real life, a period of extended deliberation preceding the election-day motivates voters to adapt their behaviour over time, either based on observations of the remaining electorate or on information acquired for the topic at hand. By adding a temporal dimension to Liquid Democracy, such adaptations can increase the number of possible delegation paths and reduce the loss of votes due to delegation cycles or delegating paths towards abstaining agents, ultimately enhancing participation. Our work takes a first step to integrate a time horizon into decision-making problems in Liquid Democracy systems. Our approach, via a computational complexity analysis, exploits concepts and tools from temporal graph theory which turn out to be convenient for our framework.

摘要

Anytime Model Selection in Linear Bandits

paper_url: http://arxiv.org/abs/2307.12897
repo_url: None
paper_authors: Parnian Kassraie, Aldo Pacchiano, Nicolas Emmenegger, Andreas Krause
for: 本文研究了带剑优化中的模型选择问题，即在搜索和利用之间进行平衡，以便在不同的模型中选择最佳的一个。
methods: 作者使用了在线学习算法，将不同的模型当做专家进行处理。然而，现有的方法具有$\text{poly}M$的复杂度，与模型数量 $M$ 成直接相关。
results: 作者提出了ALEXP方法，它具有$\log M$ 的依赖关系，并且具有任何时间保证的减册。此外，ALEXP方法不需要知道预测horizon $n$，也不需要早期充分探索阶段。

Abstract
Model selection in the context of bandit optimization is a challenging problem, as it requires balancing exploration and exploitation not only for action selection, but also for model selection. One natural approach is to rely on online learning algorithms that treat different models as experts. Existing methods, however, scale poorly ($\text{poly}M$) with the number of models $M$ in terms of their regret. Our key insight is that, for model selection in linear bandits, we can emulate full-information feedback to the online learner with a favorable bias-variance trade-off. This allows us to develop ALEXP, which has an exponentially improved ($\log M$) dependence on $M$ for its regret. ALEXP has anytime guarantees on its regret, and neither requires knowledge of the horizon $n$, nor relies on an initial purely exploratory stage. Our approach utilizes a novel time-uniform analysis of the Lasso, establishing a new connection between online learning and high-dimensional statistics.

摘要
模型选择在带刺优化上是一个复杂的问题，因为它需要平衡探索和利用，不仅 для动作选择，而且还为模型选择。一个自然的方法是通过在线学习算法来处理不同的模型。现有方法，然而， scales poorly（$\text{poly}M$）于模型数量 $M$ 的 regret。我们的关键发现是，在线选择器中的模型选择可以通过利用全信息反馈来让在线学习者具有有利的偏差-variance质量。这允许我们开发 ALEXP，它的 regret 有 exponentially 改进的（$\log M$）依赖于 $M$。ALEXP 具有任何时间 guarantees 的 regret，并不需要知道天数 $n$，也不需要初始阶段的纯探索阶段。我们的方法利用了一种新的时间uniform 分析，建立了在线学习和高维统计之间的新连接。

Interpretable Stereotype Identification through Reasoning

paper_url: http://arxiv.org/abs/2308.00071
repo_url: None
paper_authors: Jacob-Junqi Tian, Omkar Dige, David Emerson, Faiza Khan Khattak
for: 这篇研究的目的是探讨语模型中的偏见，并在其开发中整合公平性，以确保这些模型是无偏见的和公平的。methods: 本研究使用了Vicuna-13B-v1.3进行零执行刻 sterotype 识别，并评估了将13B扩展到33B的影响。results: 研究发现，从理解到执行的改善 exceeds 从扩展到33B的改善，这表明了理解可以帮助语模型在离域任务上超越扩展的法则。此外，通过选择性分析一些理解迹象，我们显示了如何理解可以提高决策的解释性。

Abstract
Given that language models are trained on vast datasets that may contain inherent biases, there is a potential danger of inadvertently perpetuating systemic discrimination. Consequently, it becomes essential to examine and address biases in language models, integrating fairness into their development to ensure these models are equitable and free from bias. In this work, we demonstrate the importance of reasoning in zero-shot stereotype identification based on Vicuna-13B-v1.3. While we do observe improved accuracy by scaling from 13B to 33B, we show that the performance gain from reasoning significantly exceeds the gain from scaling up. Our findings suggest that reasoning could be a key factor that enables LLMs to trescend the scaling law on out-of-domain tasks such as stereotype identification. Additionally, through a qualitative analysis of select reasoning traces, we highlight how reasoning enhances not just accuracy but also the interpretability of the decision.

摘要
language models 可能会含有隐性偏见，这可能会导致不必要地扩大系统歧视。因此，检查和解决语言模型中的偏见变得非常重要，以确保这些模型是公正的。在这种情况下，我们展示了在 Vicuna-13B-v1.3 基础上进行零 shot 刻板印象的重要性。虽然我们确实观察到了从 13B 缩放到 33B 的性能提升，但我们发现，理解的性能提升远超过缩放的提升。我们的发现表明，理解可以使 LLMS 突破预期的性能增长。此外，通过分析 select 的理解轨迹，我们指出了如何使用理解来提高决策的可读性。

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

paper_url: http://arxiv.org/abs/2307.12856
repo_url: None
paper_authors: Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, Aleksandra Faust
for: 这篇论文的目的是提出一种基于大语言模型（LLM）的自主网络浏览器，可以根据自然语言指令完成真实网站上的任务。
methods: 这篇论文使用了Flan-U-PaLM和HTML-T5两种大语言模型，其中Flan-U-PaLM用于固有代码生成，HTML-T5用于长HTML文档的规划和摘要。这两种模型都使用了地方和全局注意力机制，并使用了混合长时间杂化目标来进行规划和摘要。
results: 论文的实验表明，使用WebAgent可以在真实网站上提高成功率超过50%，并且HTML-T5模型在解决HTML基于任务上的精度高于之前的SoTA。

Abstract
Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web navigation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that can complete the tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via generated Python programs from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our recipe improves the success on a real website by over 50%, and that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9% higher success rate than prior SoTA on the MiniWoB web navigation benchmark and better accuracy on offline task planning evaluation.

摘要
Recently, pre-trained large language models (LLMs) have achieved better generalization and sample efficiency in autonomous web navigation. However, the performance on real-world websites is still affected by three factors: open domain, limited context length, and lack of inductive bias on HTML. To address these issues, we introduce WebAgent, an LLM-driven agent that can complete tasks on real websites based on natural language instructions.WebAgent plans ahead by breaking down instructions into canonical sub-instructions, summarizing long HTML documents into task-relevant snippets, and acting on websites through generated Python programs. We use Flan-U-PaLM for grounded code generation and HTML-T5, a new pre-trained LLM that utilizes local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization.Our empirical results show that our recipe improves the success rate on a real website by over 50%, and HTML-T5 is the best model for solving HTML-based tasks, achieving a 14.9% higher success rate than the prior state-of-the-art on the MiniWoB web navigation benchmark and better accuracy on offline task planning evaluation.

EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge: Mixed Sequences Prediction

paper_url: http://arxiv.org/abs/2307.12837
repo_url: None
paper_authors: Amirshayan Nasirimajd, Simone Alberto Peirone, Chiara Plizzari, Barbara Caputo
For: 本研究是为了解决Unsupervised Domain Adaptation (UDA) Challenge in Action Recognition 中的问题。* Methods: 我们采用了一种基于序列的方法，即将源频率和目标频率 randomly combine 生成一个修改后的序列，然后使用标准的 pseudo-labeling 策略提取目标频率中的动作标签。* Results: 我们的提交（名为 ‘sshayan’）可以在领导人员中找到，目前在 ‘verb’ 和 ‘noun’ 两个分类中排名第二和第四。

Abstract
This report presents the technical details of our approach for the EPIC-Kitchens-100 Unsupervised Domain Adaptation (UDA) Challenge in Action Recognition. Our approach is based on the idea that the order in which actions are performed is similar between the source and target domains. Based on this, we generate a modified sequence by randomly combining actions from the source and target domains. As only unlabelled target data are available under the UDA setting, we use a standard pseudo-labeling strategy for extracting action labels for the target. We then ask the network to predict the resulting action sequence. This allows to integrate information from both domains during training and to achieve better transfer results on target. Additionally, to better incorporate sequence information, we use a language model to filter unlikely sequences. Lastly, we employed a co-occurrence matrix to eliminate unseen combinations of verbs and nouns. Our submission, labeled as 'sshayan', can be found on the leaderboard, where it currently holds the 2nd position for 'verb' and the 4th position for both 'noun' and 'action'.

摘要
Here's the Simplified Chinese translation:这份报告介绍了我们在EPIC-Kitchens-100Unsupervised Domain Adaptation（UDA）挑战中的动作识别技术细节。我们的方法基于源频率和目标频率中动作的顺序相似性。我们随机将源频率和目标频率中的动作组合在一起，然后使用标准的 Pseudo-labeling 策略提取目标频率中的动作标签。我们然后问网络预测结果的动作序列。这allow我们在训练中 integrating信息从两个频率中，以实现更好的传输结果。此外，我们还使用语言模型筛选不可能的序列，以及一个co-occurrence Matrix来消除未看到的动词和名词的组合。我们的提交，标记为'sshayan'，可以在排行榜上找到，现在在'verb'、'noun'和'action'三个分类中分别排名第二和第四。

Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

paper_url: http://arxiv.org/abs/2307.13012
repo_url: None
paper_authors: Martin Lebourdais, Théo Mariotte, Marie Tahon, Anthony Larcher, Antoine Laurent, Silvio Montresor, Sylvain Meignier, Jean-Hugh Thomas
for: 这篇论文主要针对的是voice activity detection和 overlap speech detection的预处理任务，以提高speaker diarization的最终 segmentation性能。methods: 这篇论文提出了一个全新的benchmark，用于评估不同的voice activity detection和overlap speech detection模型，在多个音频设置和语音频道上。这些模型结合了一个Temporal Convolutional Network和适应于设置的语音表示，可以达到 state-of-the-art的性能水平。results: 这篇论文的实验结果显示，将voice activity detection和overlap speech detection作为一个 jointly trained模型，可以提高预处理性能，同时降低了训练成本。此外，这种unique的架构还可以用于单和多通道speech处理。

Abstract
Voice activity and overlapped speech detection (respectively VAD and OSD) are key pre-processing tasks for speaker diarization. The final segmentation performance highly relies on the robustness of these sub-tasks. Recent studies have shown VAD and OSD can be trained jointly using a multi-class classification model. However, these works are often restricted to a specific speech domain, lacking information about the generalization capacities of the systems. This paper proposes a complete and new benchmark of different VAD and OSD models, on multiple audio setups (single/multi-channel) and speech domains (e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results. We show that the joint training of these two tasks offers similar performances in terms of F1-score to two dedicated VAD and OSD systems while reducing the training cost. This unique architecture can also be used for single and multichannel speech processing.

摘要
<>将文本翻译成简化字符串。<>声音活动和重叠说话检测（简称VAD和OSD）是Speaker Diagnosis的关键预处理任务。最终分 segmentation 性能强度取决于这两个子任务的稳定性。 current studies have shown that VAD and OSD can be trained together using a multi-class classification model. However, these works are often restricted to a specific speech domain, lacking information about the generalization capacities of the systems. This paper proposes a complete and new benchmark of different VAD and OSD models, on multiple audio setups (single/multi-channel) and speech domains (e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results. We show that the joint training of these two tasks offers similar performances in terms of F1-score to two dedicated VAD and OSD systems while reducing the training cost. This unique architecture can also be used for single and multichannel speech processing.

End-to-End Deep Transfer Learning for Calibration-free Motor Imagery Brain Computer Interfaces

paper_url: http://arxiv.org/abs/2307.12827
repo_url: None
paper_authors: Maryam Alimardani, Steven Kocken, Nikki Leeuwis
for:这个研究的目的是开发一种可以在各种应用场景中使用的无需参数调整的motor imagery brain-computer interface（MI-BCI）分类器。methods:这个研究使用了深度训练学习，并在 Raw EEG 信号上进行了一种端到端的深度学习方法。三种深度学习模型（MIN2Net、EEGNet 和 DeepConvNet）被训练并比较使用了一个公开available的数据集。results:在离势一个subject出的cross-validation中，MIN2Net 无法在新用户中分辨右手和左手的motor imagery，具有51.7%的 median 准确率。而 EEGNet 和 DeepConvNet 两种模型在其他数据集上没有进行过训练，但在这个数据集上的 median 准确率分别为62.5% 和 59.2%。这些准确率虽然不足70%，但与其他数据集上的准确率相似。

Abstract
A major issue in Motor Imagery Brain-Computer Interfaces (MI-BCIs) is their poor classification accuracy and the large amount of data that is required for subject-specific calibration. This makes BCIs less accessible to general users in out-of-the-lab applications. This study employed deep transfer learning for development of calibration-free subject-independent MI-BCI classifiers. Unlike earlier works that applied signal preprocessing and feature engineering steps in transfer learning, this study adopted an end-to-end deep learning approach on raw EEG signals. Three deep learning models (MIN2Net, EEGNet and DeepConvNet) were trained and compared using an openly available dataset. The dataset contained EEG signals from 55 subjects who conducted a left- vs. right-hand motor imagery task. To evaluate the performance of each model, a leave-one-subject-out cross validation was used. The results of the models differed significantly. MIN2Net was not able to differentiate right- vs. left-hand motor imagery of new users, with a median accuracy of 51.7%. The other two models performed better, with median accuracies of 62.5% for EEGNet and 59.2% for DeepConvNet. These accuracies do not reach the required threshold of 70% needed for significant control, however, they are similar to the accuracies of these models when tested on other datasets without transfer learning.

摘要
一个主要问题在肌动幻象潜意计算机界面（MI-BCI）是其低精度分类和大量的数据需要用于特定用户的卡 Liping。这使得BCI对通用用户在室外应用中 menos accessible。这项研究使用了深度传输学习开发了无需特定用户 calibration的Ml-BCI分类器。与之前的工作一样，这项研究不对信号预处理和特征工程步骤进行传输学习，而是使用了 Raw EEG 信号的端到端深度学习方法。研究使用了三个深度学习模型（MIN2Net、EEGNet 和 DeepConvNet），并对其进行了比较。数据集包含55名用户完成左手 vs. 右手肌动幻象任务的EEG信号。为评估每个模型的性能，使用了留一个用户之外的交叉验证。结果表明，MIN2Net 无法识别新用户的左手 vs. 右手肌动幻象，具有51.7%的 median 精度。另外两个模型则表现更好， median 精度分别为62.5% 和 59.2%。这些精度没有达到所需的70%的阈值，但与不使用传输学习的其他数据集测试的精度类似。

Performance of Large Language Models in a Computer Science Degree Program

paper_url: http://arxiv.org/abs/2308.02432
repo_url: None
paper_authors: Tim Krüger, Michael Gref
for: 这个论文的目的是评估不同大型自然语言模型在大学应用科学学士学位课程中的效iveness。
methods: 这个论文使用了不同大型自然语言模型作为教学工具，并通过提示模型 lecture material、运动任务和过去考试来评估它们在不同计算机科学领域的能力。
results: 研究发现，ChatGPT-3.5在10个测试模块中的平均分为79.9%，BingAI为68.4%，LLaMa（65亿参数变量）为20%。尽管这些结果非常有力，但even GPT-4.0不能通过学位课程 - 因为它在数学计算中存在限制。

Abstract
Large language models such as ChatGPT-3.5 and GPT-4.0 are ubiquitous and dominate the current discourse. Their transformative capabilities have led to a paradigm shift in how we interact with and utilize (text-based) information. Each day, new possibilities to leverage the capabilities of these models emerge. This paper presents findings on the performance of different large language models in a university of applied sciences' undergraduate computer science degree program. Our primary objective is to assess the effectiveness of these models within the curriculum by employing them as educational aids. By prompting the models with lecture material, exercise tasks, and past exams, we aim to evaluate their proficiency across different computer science domains. We showcase the strong performance of current large language models while highlighting limitations and constraints within the context of such a degree program. We found that ChatGPT-3.5 averaged 79.9% of the total score in 10 tested modules, BingAI achieved 68.4%, and LLaMa, in the 65 billion parameter variant, 20%. Despite these convincing results, even GPT-4.0 would not pass the degree program - due to limitations in mathematical calculations.

摘要
大型语言模型如ChatGPT-3.5和GPT-4.0在当前的讨论中占据主导地位，其转换能力导致了信息处理方式的 paradigm shift。每天，新的可能性 emerge 以利用这些模型的能力。本文介绍了不同大型语言模型在大学应用科学学士学位课程中的表现。我们的主要目标是通过使用这些模型作为教育工具来评估它们在课程中的效iveness。我们在讲义材料、作业任务和过往考试中提问这些模型，以评估它们在不同的计算机科学领域中的技能。我们显示了当前大型语言模型的强大表现，并 highlighted 其中的限制和约束在这种学位课程中。我们发现ChatGPT-3.5在10个测试模块中的平均分为79.9%，BingAI为68.4%，LLaMa（65亿参数变量）为20%。尽管这些结果非常吸引人，但even GPT-4.0不能通过这种学位课程 - 因为它们在数学计算中的限制。

Maximal Independent Sets for Pooling in Graph Neural Networks

paper_url: http://arxiv.org/abs/2307.13011
repo_url: None
paper_authors: Stevan Stanovic, Benoit Gaüzère, Luc Brun
for: 本文为了解决图像Pooling问题，提出了三种基于最大独立集的Pooling方法，以避免传统图像Pooling方法的缺点。
methods: 本文使用了三种基于最大独立集的Pooling方法，即Maximal Independent Set Pooling（MISP）、Maximal Independent Set Aggregation（MISA）和Maximal Independent Set Based Pooling（MIBP）。
results: 实验结果表明，这三种Pooling方法能够减少图像的缺失和杂乱，并且能够保持图像的连通性。

Abstract
Convolutional Neural Networks (CNNs) have enabled major advances in image classification through convolution and pooling. In particular, image pooling transforms a connected discrete lattice into a reduced lattice with the same connectivity and allows reduction functions to consider all pixels in an image. However, there is no pooling that satisfies these properties for graphs. In fact, traditional graph pooling methods suffer from at least one of the following drawbacks: Graph disconnection or overconnection, low decimation ratio, and deletion of large parts of graphs. In this paper, we present three pooling methods based on the notion of maximal independent sets that avoid these pitfalls. Our experimental results confirm the relevance of maximal independent set constraints for graph pooling.

摘要
convolutional neural networks (CNNs) hanno permesso di grandi avanzamenti nella classificazione di immagini attraverso la convoluzione e la pooling. In particolare, la pooling di immagini trasforma una rete discreta connesse in una rete ridotta con la stessa connessione e consente alle funzioni di riduzione di considerare tutti i pixel dell'immagine. Tuttavia, non esiste pooling per grafi che soddisfi queste proprietà. Infatti, i metodi di pooling tradizionali per grafi soffrono almeno uno dei seguenti difetti: disconnessione del grafico o eccessiva connessione, bassa riduzione del numero di node e cancellazione di grandi porzioni del grafico. In questo paper, presentiamo tre metodi di pooling basati sulla nozione di set indipendenti massimi che evitano questi inconvenienti. I nostri risultati sperimentali confermano la rilevanza delle restrizioni di set indipendenti massimi per il pooling di grafi.

Analyzing the Strategy of Propaganda using Inverse Reinforcement Learning: Evidence from the 2022 Russian Invasion of Ukraine

paper_url: http://arxiv.org/abs/2307.12788
repo_url: None
paper_authors: Dominique Geissler, Stefan Feuerriegel
for: This study aims to understand the strategy behind the pro-Russian propaganda campaign on social media during the 2022 Russian invasion of Ukraine.
methods: The study uses an inverse reinforcement learning (IRL) approach to model online behavior as a Markov decision process and infer the underlying reward structure that guides propagandists when interacting with users with a supporting or opposing stance toward the invasion.
results: The study finds that bots and humans follow different strategies in responding to pro-Russian propaganda. Bots primarily respond to pro-invasion messages, suggesting they seek to drive virality, while messages indicating opposition primarily elicit responses from humans, suggesting they tend to engage in critical discussions.

Abstract
The 2022 Russian invasion of Ukraine was accompanied by a large-scale, pro-Russian propaganda campaign on social media. However, the strategy behind the dissemination of propaganda has remained unclear, particularly how the online discourse was strategically shaped by the propagandists' community. Here, we analyze the strategy of the Twitter community using an inverse reinforcement learning (IRL) approach. Specifically, IRL allows us to model online behavior as a Markov decision process, where the goal is to infer the underlying reward structure that guides propagandists when interacting with users with a supporting or opposing stance toward the invasion. Thereby, we aim to understand empirically whether and how between-user interactions are strategically used to promote the proliferation of Russian propaganda. For this, we leverage a large-scale dataset with 349,455 posts with pro-Russian propaganda from 132,131 users. We show that bots and humans follow a different strategy: bots respond predominantly to pro-invasion messages, suggesting that they seek to drive virality; while messages indicating opposition primarily elicit responses from humans, suggesting that they tend to engage in critical discussions. To the best of our knowledge, this is the first study analyzing the strategy behind propaganda from the 2022 Russian invasion of Ukraine through the lens of IRL.

摘要
俄罗斯入侵乌克兰的2022年社交媒体宣传活动中，涉及到了大规模的俄罗斯支持者在社交媒体上的宣传活动。然而，这些宣传活动的战略具体如何实施，特别是如何通过在用户之间的互动来推动俄罗斯宣传的普及，这些问题仍未得到了清楚的回答。在这里，我们使用 inverse reinforcement learning（IRL）方法来分析推特社区的宣传策略。具体来说，IRL方法允许我们将在线行为模型为Markov决策过程，并且目的是从推特用户的支持或反对姿态来推断宣传者在互动中的奖励结构。因此，我们可以理解propagandists在互动中是如何用between-user互动来推动俄罗斯宣传的。为此，我们利用了349,455条推特帖子和132,131名用户的大规模数据集。我们发现，机器人和人类使用者采取了不同的策略：机器人尽量回应支持入侵的消息，表明它们想要驱动病毒性；而反对消息主要引起人类用户的回应，表明人类用户更倾向于进行批评讨论。到目前为止，这是我们分析2022年俄罗斯入侵乌克兰的宣传策略的第一项研究，通过IRL镜像来分析。

Is attention all you need in medical image analysis? A review

paper_url: http://arxiv.org/abs/2307.12775
repo_url: None
paper_authors: Giorgos Papanastasiou, Nikolaos Dikaios, Jiahao Huang, Chengjia Wang, Guang Yang
for: 这篇论文旨在概述现有的hybrid CNN-Transf/Attention模型，以及对这些模型的架构设计、突破点和应用前景。
methods: 这篇论文使用了系统性的文献综述方法，对 hybrid CNN-Transf/Attention模型进行了架构分析和综述，并提出了一种基于数据驱动的领域泛化和适应方法。
results: 论文发现了hybrid CNN-Transf/Attention模型在医学图像分析领域的应用前景，并提出了一种基于数据驱动的领域泛化和适应方法，可以优化模型的泛化能力和适应能力。

Abstract
Medical imaging is a key component in clinical diagnosis, treatment planning and clinical trial design, accounting for almost 90% of all healthcare data. CNNs achieved performance gains in medical image analysis (MIA) over the last years. CNNs can efficiently model local pixel interactions and be trained on small-scale MI data. The main disadvantage of typical CNN models is that they ignore global pixel relationships within images, which limits their generalisation ability to understand out-of-distribution data with different 'global' information. The recent progress of Artificial Intelligence gave rise to Transformers, which can learn global relationships from data. However, full Transformer models need to be trained on large-scale data and involve tremendous computational complexity. Attention and Transformer compartments (Transf/Attention) which can well maintain properties for modelling global relationships, have been proposed as lighter alternatives of full Transformers. Recently, there is an increasing trend to co-pollinate complementary local-global properties from CNN and Transf/Attention architectures, which led to a new era of hybrid models. The past years have witnessed substantial growth in hybrid CNN-Transf/Attention models across diverse MIA problems. In this systematic review, we survey existing hybrid CNN-Transf/Attention models, review and unravel key architectural designs, analyse breakthroughs, and evaluate current and future opportunities as well as challenges. We also introduced a comprehensive analysis framework on generalisation opportunities of scientific and clinical impact, based on which new data-driven domain generalisation and adaptation methods can be stimulated.

摘要
医疗影像是诊断、治疗规划和临床试验设计中的关键组成部分，占全部医疗数据的近90%。过去几年，Convolutional Neural Networks（CNN）在医疗影像分析（MIA）中获得了性能提升。CNN可以有效地模型图像中的局部像素互动，并可以在小规模的MI数据上进行训练。然而，典型的CNN模型却忽略图像中的全局像素关系，这限制了它们对不同'全球'信息的总体化能力。随着人工智能的发展，Transformers（Transformers）产生了，它可以从数据中学习全球关系。然而，全Transformers模型需要大规模的训练和巨大的计算复杂度。Attention和Transformers组件（Transf/Attention），可以保持模型全球关系的性能，被提议为轻量级的Alternative。最近，有一个增长的趋势，将Complementary local-global properties（CLGP）从CNN和Transf/Attention架构中搬运到新的混合模型中。过去几年，混合CNN-Transf/Attention模型在多种MIA问题上表现出了明显的增长。在这个系统性评论中，我们对现有的混合CNN-Transf/Attention模型进行了评论和探讨，分析了重要的架构设计、突破性和现有和未来的机遇和挑战。此外，我们还提出了一种基于总结机会的科学和临床影响分析框架，可以驱动新的数据驱动领域总结和适应方法。

Adaptation of Whisper models to child speech recognition

paper_url: http://arxiv.org/abs/2307.13008
repo_url: https://github.com/c3imaging/whisper_child_speech
paper_authors: Rishabh Jain, Andrei Barcovschi, Mariam Yiwere, Peter Corcoran, Horia Cucu
for: 提高儿童语音识别（ASR）系统对儿童语音的识别精度
methods: 使用现有的大量成人语音数据集来适应儿童语音，并对Whisper模型进行finetuning和自动监督学习
results: 在儿童语音上，使用finetuning Whisper模型和自动监督学习的wav2vec2模型可以获得显著改善的ASR性能，相比非finetuning Whisper模型

Abstract
Automatic Speech Recognition (ASR) systems often struggle with transcribing child speech due to the lack of large child speech datasets required to accurately train child-friendly ASR models. However, there are huge amounts of annotated adult speech datasets which were used to create multilingual ASR models, such as Whisper. Our work aims to explore whether such models can be adapted to child speech to improve ASR for children. In addition, we compare Whisper child-adaptations with finetuned self-supervised models, such as wav2vec2. We demonstrate that finetuning Whisper on child speech yields significant improvements in ASR performance on child speech, compared to non finetuned Whisper models. Additionally, utilizing self-supervised Wav2vec2 models that have been finetuned on child speech outperforms Whisper finetuning.

摘要
自动话语识别（ASR）系统经常因缺乏儿童语音数据而困难地识别儿童语音。然而，存在巨量的注解的成人语音数据，这些数据被用来创建多语言的ASR模型，如呼叫。我们的工作是探索是否可以将这些模型适应儿童语音，以提高ASR的性能。此外，我们还比较了使用自动注解的wav2vec2模型和Whisper模型。我们的结果显示，对儿童语音进行finetuning可以大幅提高ASR性能，相比非finetuning Whisper模型。此外，使用自动注解的wav2vec2模型，经过finetuning在儿童语音上表现更好，超越了Whisper finetuning。

Nonparametric Linear Feature Learning in Regression Through Regularisation

paper_url: http://arxiv.org/abs/2307.12754
repo_url: https://github.com/bertillefollain/regfeal
paper_authors: Bertille Follain, Umut Simsekli, Francis Bach
for: 这个论文主要针对高维数据的自动特征选择问题，具体来说是多指标模型中的适用学习问题。
methods: 该论文提出了一种新的非 Parametric 方法，可以同时估计预测函数和相关的直方几何空间。该方法使用了Empirical risk minimization，并添加了函数导数的约束，以保证方法的多元性。
results: 该论文提供了一些实验结果，证明了RegFeaL 可以在不同的实际场景中表现出色，并且可以准确地估计相关维度。

Abstract
Representation learning plays a crucial role in automated feature selection, particularly in the context of high-dimensional data, where non-parametric methods often struggle. In this study, we focus on supervised learning scenarios where the pertinent information resides within a lower-dimensional linear subspace of the data, namely the multi-index model. If this subspace were known, it would greatly enhance prediction, computation, and interpretation. To address this challenge, we propose a novel method for linear feature learning with non-parametric prediction, which simultaneously estimates the prediction function and the linear subspace. Our approach employs empirical risk minimisation, augmented with a penalty on function derivatives, ensuring versatility. Leveraging the orthogonality and rotation invariance properties of Hermite polynomials, we introduce our estimator, named RegFeaL. By utilising alternative minimisation, we iteratively rotate the data to improve alignment with leading directions and accurately estimate the relevant dimension in practical settings. We establish that our method yields a consistent estimator of the prediction function with explicit rates. Additionally, we provide empirical results demonstrating the performance of RegFeaL in various experiments.

摘要
《学习表示在自动选择特征中扮演重要角色，特别是在高维数据的情况下，非 Parametric 方法经常陷入困难。在这项研究中，我们关注supervised learning情况下，关键信息都集中在数据中的一个lower-dimensional linear subspace，即多index模型。如果这个subspace已知，它会大大提高预测、计算和解释。为解决这个挑战，我们提出了一种新的方法 для线性特征学习，同时估算预测函数和linear subspace。我们的方法使用empirical risk minimization，并添加了函数导数的罚因，以确保多样化。利用 Hermite polynomials的正交性和旋转不变性，我们引入了我们的估计器，名为RegFeaL。通过使用alternative minimization，我们可以逐步旋转数据，以便更好地与主导方向相align并准确地估算实际情况中的相关维度。我们证明了我们的方法可以生成一个consistent的预测函数估计器，并且提供了explicit rates。此外，我们还提供了一系列实验结果，证明RegFeaL的性能。

Introducing CALMED: Multimodal Annotated Dataset for Emotion Detection in Children with Autism

paper_url: http://arxiv.org/abs/2307.13706
repo_url: None
paper_authors: Annanda Sousa, Karen Young, Mathieu D’aquin, Manel Zarrouk, Jennifer Holloway
for: 这个论文的目的是提高人际交流的自动情感识别系统，以提供个性化的用户体验。
methods: 本论文使用了多种数据收集和处理技术，包括录音和视频特征提取和分类。
results: 本论文描述了一个基于多Modal的情感识别数据集，包括8-12岁的儿童患有Autism Spectrum Disorder（ASD）的记录例子。该数据集包括4个target类别的注解，共计57,012个示例，每个示例代表200ms（0.2秒）的时间窗口。

Abstract
Automatic Emotion Detection (ED) aims to build systems to identify users' emotions automatically. This field has the potential to enhance HCI, creating an individualised experience for the user. However, ED systems tend to perform poorly on people with Autism Spectrum Disorder (ASD). Hence, the need to create ED systems tailored to how people with autism express emotions. Previous works have created ED systems tailored for children with ASD but did not share the resulting dataset. Sharing annotated datasets is essential to enable the development of more advanced computer models for ED within the research community. In this paper, we describe our experience establishing a process to create a multimodal annotated dataset featuring children with a level 1 diagnosis of autism. In addition, we introduce CALMED (Children, Autism, Multimodal, Emotion, Detection), the resulting multimodal emotion detection dataset featuring children with autism aged 8-12. CALMED includes audio and video features extracted from recording files of study sessions with participants, together with annotations provided by their parents into four target classes. The generated dataset includes a total of 57,012 examples, with each example representing a time window of 200ms (0.2s). Our experience and methods described here, together with the dataset shared, aim to contribute to future research applications of affective computing in ASD, which has the potential to create systems to improve the lives of people with ASD.

摘要
自动情感检测（ED）目标是建立自动识别用户情感的系统。这个领域有可能提高人机交互（HCI），创造个性化的用户体验。然而，ED系统通常在Autism Spectrum Disorder（ASD）人群表现不佳。因此，需要开发特化于人们表达情感的ED系统。先前的工作已经创建了特化于儿童ASD的ED系统，但没有公布数据集。分享标注数据集是开发更先进的计算机模型的关键。在这篇论文中，我们描述了我们在建立一个多模态注释数据集的过程中的经验。此外，我们还介绍了CALMED（儿童Autism、多模态、情感检测）数据集，这是8-12岁的儿童ASD的多模态情感检测数据集。CALMED包括录音和视频特征，从参与者录制的文件中提取，以及由参与者的父母提供的四个目标类别的注释。生成的数据集包括57,012个示例，每个示例表示200毫秒（0.2秒）的时间窗口。我们的经验和方法描述以及分享的数据，希望能为ASD的情感计算机科学研究提供贡献，以创造用于改善ASD人群生活质量的系统。

MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features

paper_url: http://arxiv.org/abs/2307.12698
repo_url: None
paper_authors: Adrien Bardes, Jean Ponce, Yann LeCun
for: 学习视觉表示，强调了学习内容特征，而不是捕捉物体运动或位置信息。
methods: 我们提出了 MC-JEPA 方法，即共同嵌入预测架构和自动学习方法，以同时学习光流和内容特征，并证明了这两个关联目标互助彼此，从而学习了包含运动信息的内容特征。
results: 我们的方法可以与现有的无监督光流标准做比较，以及与常见的自动学习方法在图像和视频 semantic segmentation 任务上表现相当。

Abstract
Self-supervised learning of visual representations has been focusing on learning content features, which do not capture object motion or location, and focus on identifying and differentiating objects in images and videos. On the other hand, optical flow estimation is a task that does not involve understanding the content of the images on which it is estimated. We unify the two approaches and introduce MC-JEPA, a joint-embedding predictive architecture and self-supervised learning approach to jointly learn optical flow and content features within a shared encoder, demonstrating that the two associated objectives; the optical flow estimation objective and the self-supervised learning objective; benefit from each other and thus learn content features that incorporate motion information. The proposed approach achieves performance on-par with existing unsupervised optical flow benchmarks, as well as with common self-supervised learning approaches on downstream tasks such as semantic segmentation of images and videos.

摘要
自领导学习的视觉表示学习把注意力集中在学习内容特征上，这些特征不包括物体运动或位置信息，而是通过识别和区分图像和视频中的 объек 来学习。然而，流体计算是一个不需要理解图像内容的任务。我们将这两种方法联合起来，并引入 MC-JEPA，一种共享编码器中的联合预测建筑和自领导学习方法，以便同时学习流体计算和内容特征。我们发现这两个关联的目标，即流体计算目标和自领导学习目标，互相帮助和学习，因此学习的内容特征包含运动信息。我们的方法可以与现有的无监督光流标准准确比较，以及常见的自领导学习方法在图像和视频 semantic 分割任务中的性能。

Addressing the Impact of Localized Training Data in Graph Neural Networks

paper_url: http://arxiv.org/abs/2307.12689
repo_url: https://github.com/akanshaaga/reg_appnp
paper_authors: Singh Akansha
for: 本文旨在评估 Graph Neural Networks (GNNs) 在不同区域的图数据上的性能，以及如何在受限的训练数据情况下提高 GNN 的适应性和泛化能力。
methods: 本文提出了一种基于 distribution alignment 的正则化方法，用于解决 GNN 在本地化训练数据上的性能下降问题。
results: 经过广泛测试， Results 表明该正则化方法可以有效地提高 GNN 在异常数据上的性能，并且可以帮助 GNN 更好地适应和泛化到不同的图数据上。

Abstract
Graph Neural Networks (GNNs) have achieved notable success in learning from graph-structured data, owing to their ability to capture intricate dependencies and relationships between nodes. They excel in various applications, including semi-supervised node classification, link prediction, and graph generation. However, it is important to acknowledge that the majority of state-of-the-art GNN models are built upon the assumption of an in-distribution setting, which hinders their performance on real-world graphs with dynamic structures. In this article, we aim to assess the impact of training GNNs on localized subsets of the graph. Such restricted training data may lead to a model that performs well in the specific region it was trained on but fails to generalize and make accurate predictions for the entire graph. In the context of graph-based semi-supervised learning (SSL), resource constraints often lead to scenarios where the dataset is large, but only a portion of it can be labeled, affecting the model's performance. This limitation affects tasks like anomaly detection or spam detection when labeling processes are biased or influenced by human subjectivity. To tackle the challenges posed by localized training data, we approach the problem as an out-of-distribution (OOD) data issue by by aligning the distributions between the training data, which represents a small portion of labeled data, and the graph inference process that involves making predictions for the entire graph. We propose a regularization method to minimize distributional discrepancies between localized training data and graph inference, improving model performance on OOD data. Extensive tests on popular GNN models show significant performance improvement on three citation GNN benchmark datasets. The regularization approach effectively enhances model adaptation and generalization, overcoming challenges posed by OOD data.

摘要
граф neural networks (GNNs) 已经取得了很大的成功，它们可以从图структуре数据中学习，因为它们可以捕捉图中节点之间的复杂关系和依赖关系。它们在半supervised node classification、链接预测和图生成等应用中表现出色。然而，我们需要注意的是，大多数当前的状态部署GNN模型是基于图结构的内部分布式Setting中建模，这会限制它们在实际图中的性能。在这篇文章中，我们试图评估在本地化Subset中训练GNN模型的影响。这种局部训练数据可能会导致模型在特定区域中表现出色，但是它无法总结整个图中的准确预测。在图基于半supervised learning (SSL) 中，资源约束常常导致数据集很大，但只有一部分可以被标记，这会影响模型的性能。这种限制会对任务 like anomaly detection 或 spam detection 产生影响，因为标记过程可能受到人类主观性的影响。为了解决本地化训练数据所带来的挑战，我们将这种问题看作一个out-of-distribution (OOD) 数据问题，并通过对本地化训练数据和图推理过程之间的分布匹配来解决问题。我们提出一种Regularization方法，以降低本地化训练数据和图推理之间的分布差异，从而提高OOD数据上的模型表现。我们在popular GNN模型上进行了广泛的测试，并发现这种Regularization方法可以在三个文献GNN benchmark数据集上显著提高表现。这种Regularization方法可以增强模型的适应和泛化能力，从而超越OOD数据的挑战。

IteraTTA: An interface for exploring both text prompts and audio priors in generating music with text-to-audio models

paper_url: http://arxiv.org/abs/2307.13005
repo_url: None
paper_authors: Hiromu Yakura, Masataka Goto
for: 帮助用户自由生成音乐音频，不需具备音乐知识，通过不同文本提示和音频先导来探索音频生成空间。
methods: 使用文本-到-音频生成技术，并提供用户可以进行双重探索（文本提示和音频先导），以便理解不同文本提示和音频先导对生成结果的影响，并逐渐实现用户的模糊化目标。
results: 通过提供特定的音频先导和文本提示，用户可以逐渐理解和探索音频生成空间，并通过反复比较不同的文本提示和音频先导来了解它们对生成结果的影响。

Abstract
Recent text-to-audio generation techniques have the potential to allow novice users to freely generate music audio. Even if they do not have musical knowledge, such as about chord progressions and instruments, users can try various text prompts to generate audio. However, compared to the image domain, gaining a clear understanding of the space of possible music audios is difficult because users cannot listen to the variations of the generated audios simultaneously. We therefore facilitate users in exploring not only text prompts but also audio priors that constrain the text-to-audio music generation process. This dual-sided exploration enables users to discern the impact of different text prompts and audio priors on the generation results through iterative comparison of them. Our developed interface, IteraTTA, is specifically designed to aid users in refining text prompts and selecting favorable audio priors from the generated audios. With this, users can progressively reach their loosely-specified goals while understanding and exploring the space of possible results. Our implementation and discussions highlight design considerations that are specifically required for text-to-audio models and how interaction techniques can contribute to their effectiveness.

摘要
现代文本到音频生成技术具有让新手无需 musical knowledge 可以自由生成音乐音频的潜力。用户可以通过不同的文本提示来尝试生成音频，而不需要了解和器械进程和和谐进程。然而，与图像领域相比，了解音乐频谱中可能的音频空间是Difficult的，因为用户无法同时听到生成的音频变化。我们因此为用户提供了探索不仅文本提示而且受限于文本到音频生成过程的音频先例的机会。这种双重探索使得用户可以通过相互比较不同的文本提示和音频先例来了解它们对生成结果的影响。我们开发的界面IteraTTA特别设计为帮助用户精细调整文本提示和选择生成音频中有利的先例。通过这种方式，用户可以逐步实现自己的模糊化目标，同时了解和探索可能的结果空间。我们的实现和讨论探讨了特定于文本到音频模型的设计考虑和交互技术如何提高其效果。

Control and Monitoring of Artificial Intelligence Algorithms

paper_url: http://arxiv.org/abs/2307.13705
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Carlos Mario Braga Ortuño, Blanza Martinez Donoso, Belén Muñiz Villanueva
for: 本研究阐述了在训练完成后，如何监督人工智能模型的运行，并处理可能出现的数据分布的变化。
methods: 本研究介绍了一些用于评估模型表现的指标，以及适当的数据分布基础。
results: 研究发现，监督模型的运行可以帮助检测和处理数据分布的变化，并且可以提高模型的表现。

Abstract
This paper elucidates the importance of governing an artificial intelligence model post-deployment and overseeing potential fluctuations in the distribution of present data in contrast to the training data. The concepts of data drift and concept drift are explicated, along with their respective foundational distributions. Furthermore, a range of metrics is introduced, which can be utilized to scrutinize the model's performance concerning potential temporal variations.

摘要
Translation Notes:* "post-deployment" is translated as "后部署" (hòu bù zhì)* "potential fluctuations" is translated as "潜在的波动" (pán zài de bō dòng)* "data drift" is translated as "数据漂移" (shù jí qiáo yí)* "concept drift" is translated as "概念漂移" (gài yán qiáo yí)* "foundational distributions" is translated as "基础分布" (jī zhì fēn zhòu)* "scrutinize" is translated as "检查" (jiǎn chá)* "concerning potential temporal variations" is translated as "关于潜在的时间变化" (guān yù pán zài de shí huan bìng xiàng)

Remote Bio-Sensing: Open Source Benchmark Framework for Fair Evaluation of rPPG

paper_url: http://arxiv.org/abs/2307.12644
repo_url: https://github.com/remotebiosensing/rppg
paper_authors: Dae-Yeol Kim, Eunsu Goh, KwangKee Lee, JongEui Chae, JongHyeon Mun, Junyeong Na, Chae-bong Sohn, Do-Yup Kim
For: This paper aims to provide a benchmarking framework for evaluating the performance of remote photoplethysmography (rPPG) techniques across a wide range of datasets, to facilitate fair comparison and progress in the field.* Methods: The paper uses a variety of datasets and benchmarking metrics to evaluate the performance of both conventional non-deep neural network (non-DNN) and deep neural network (DNN) methods for rPPG.* Results: The paper provides a comprehensive evaluation of the performance of different rPPG techniques on a wide range of datasets, and highlights the need for fair and evaluable benchmarking to overcome challenges in the field and make meaningful progress.

Abstract
rPPG (Remote photoplethysmography) is a technology that measures and analyzes BVP (Blood Volume Pulse) by using the light absorption characteristics of hemoglobin captured through a camera. Analyzing the measured BVP can derive various physiological signals such as heart rate, stress level, and blood pressure, which can be applied to various applications such as telemedicine, remote patient monitoring, and early prediction of cardiovascular disease. rPPG is rapidly evolving and attracting great attention from both academia and industry by providing great usability and convenience as it can measure biosignals using a camera-equipped device without medical or wearable devices. Despite extensive efforts and advances in this field, serious challenges remain, including issues related to skin color, camera characteristics, ambient lighting, and other sources of noise and artifacts, which degrade accuracy performance. We argue that fair and evaluable benchmarking is urgently required to overcome these challenges and make meaningful progress from both academic and commercial perspectives. In most existing work, models are trained, tested, and validated only on limited datasets. Even worse, some studies lack available code or reproducibility, making it difficult to fairly evaluate and compare performance. Therefore, the purpose of this study is to provide a benchmarking framework to evaluate various rPPG techniques across a wide range of datasets for fair evaluation and comparison, including both conventional non-deep neural network (non-DNN) and deep neural network (DNN) methods. GitHub URL: https://github.com/remotebiosensing/rppg

摘要
remote 血液氧测技术 (rPPG) 是一种利用摄像机捕捉到血液中内氧滤过特性，并分析血液量脉搏 (BVP) 的技术。通过分析测量的 BVP，可以 derivate 多种生理信号，如心率、压力和压力等，这些信号可以应用于多个应用，如远程医疗、远程患者监控和早期心血管疾病预后评估。 rPPG 在学术和业界中受到广泛关注，因为它提供了很好的使用性和便利性，可以通过摄像机设备而不需要医疗器械或戴式设备来量测生物信号。然而，这个领域仍然面临许多挑战，包括皮肤颜色、摄像机特性、环境照明和其他干扰和错误的问题，这些问题会影响性能。我们认为，对于这些挑战的公平和评估是非常重要的，以实现学术和商业上的进步。大多数现有的工作都是将模型训练、测试和验证在有限的数据集上。甚至更糟糕，一些研究缺乏可用的代码或重现性，使得公平评估和比较性能的问题更加困难。因此，本研究的目的是提供一个 benchmarking 框架，以评估不同的 rPPG 技术在广泛的数据集上的性能，以便公平评估和比较，包括非深度神经网络 (non-DNN) 和深度神经网络 (DNN) 方法。GitHub URL:

paper_url: http://arxiv.org/abs/2307.12626
repo_url: None
paper_authors: Jingxuan Wei, Cheng Tan, Zhangyang Gao, Linzhuang Sun, Siyuan Li, Bihui Yu, Ruifeng Guo, Stan Z. Li
for: This paper aims to address the lack of comprehensive evaluation of diverse approaches in multimodal scientific question answering, by presenting a novel dataset (COCO Multi-Modal Reasoning Dataset) that includes open-ended questions, rationales, and answers derived from the large object dataset COCO.
methods: The proposed dataset pioneers the use of open-ended questions in the context of multimodal chain-of-thought, which introduces a more challenging problem that effectively assesses the reasoning capability of CoT models. The authors propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders.
results: Extensive experiments demonstrate the efficacy of the proposed dataset and techniques, offering novel perspectives for advancing multimodal reasoning. The proposed methods and dataset provide valuable insights and offer a more challenging problem for advancing the field of multimodal reasoning.

Abstract
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence, especially when tackling complex tasks. While the chain-of-thought (CoT) technique has gained considerable attention, the existing ScienceQA dataset, which focuses on multimodal scientific questions and explanations from elementary and high school textbooks, lacks a comprehensive evaluation of diverse approaches. To address this gap, we present COCO Multi-Modal Reasoning Dataset(COCO-MMRD), a novel dataset that encompasses an extensive collection of open-ended questions, rationales, and answers derived from the large object dataset COCO. Unlike previous datasets that rely on multiple-choice questions, our dataset pioneers the use of open-ended questions in the context of multimodal CoT, introducing a more challenging problem that effectively assesses the reasoning capability of CoT models. Through comprehensive evaluations and detailed analyses, we provide valuable insights and propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders. Extensive experiments demonstrate the efficacy of the proposed dataset and techniques, offering novel perspectives for advancing multimodal reasoning.

摘要
多Modal重要组成部分在人工智能系统具有人类智能的追求中，尤其是在较复杂的任务上。而链式思考（CoT）技术已经受到了广泛关注，但现有的科学QA数据集（ScienceQA），which focuses on multimodal scientific questions and explanations from elementary and high school textbooks, lacks a comprehensive evaluation of diverse approaches. To address this gap, we present COCO Multi-Modal Reasoning Dataset(COCO-MMRD), a novel dataset that encompasses an extensive collection of open-ended questions, rationales, and answers derived from the large object dataset COCO. Unlike previous datasets that rely on multiple-choice questions, our dataset pioneers the use of open-ended questions in the context of multimodal CoT, introducing a more challenging problem that effectively assesses the reasoning capability of CoT models. Through comprehensive evaluations and detailed analyses, we provide valuable insights and propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders. Extensive experiments demonstrate the efficacy of the proposed dataset and techniques, offering novel perspectives for advancing multimodal reasoning.Here is the word-for-word translation of the text into Simplified Chinese:多Modal重要组成部分在人工智能系统具有人类智能的追求中，尤其是在较复杂的任务上。而链式思考（CoT）技术已经受到了广泛关注，但现有的科学QA数据集（ScienceQA），which focuses on multimodal scientific questions and explanations from elementary and high school textbooks, lacks a comprehensive evaluation of diverse approaches. To address this gap, we present COCO Multi-Modal Reasoning Dataset(COCO-MMRD), a novel dataset that encompasses an extensive collection of open-ended questions, rationales, and answers derived from the large object dataset COCO. Unlike previous datasets that rely on multiple-choice questions, our dataset pioneers the use of open-ended questions in the context of multimodal CoT, introducing a more challenging problem that effectively assesses the reasoning capability of CoT models. Through comprehensive evaluations and detailed analyses, we provide valuable insights and propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders. Extensive experiments demonstrate the efficacy of the proposed dataset and techniques, offering novel perspectives for advancing multimodal reasoning.

De-confounding Representation Learning for Counterfactual Inference on Continuous Treatment via Generative Adversarial Network

paper_url: http://arxiv.org/abs/2307.12625
repo_url: None
paper_authors: Yonghe Zhao, Qiang Huang, Haolong Zeng, Yun Pen, Huiyan Sun
for: 这种论文主要针对的是如何对连续型干预变量进行Counterfactual推断，而现实世界中更常见的是连续型干预变量的Counterfactual推断任务。
methods: 我们提出了一种基于De-confounding Representation Learning（DRL）的框架，通过生成与干预变量分离的covariate表示来消除干预变量与covariate之间的相关性。DRL是一种非 Parametric 模型，可以消除连续型干预变量与covariate之间的线性和非线性相关性。
results: 在 synthetic 数据集上进行了广泛的实验，发现 DRL 模型在学习分离表示的同时，也可以超越当前Counterfactual推断模型的性能。此外，我们还应用了 DRL 模型到一个实际的医疗数据集 MIMIC，并显示出了连续型红细胞宽度分布和死亡率之间的详细 causal 关系。

Abstract
Counterfactual inference for continuous rather than binary treatment variables is more common in real-world causal inference tasks. While there are already some sample reweighting methods based on Marginal Structural Model for eliminating the confounding bias, they generally focus on removing the treatment's linear dependence on confounders and rely on the accuracy of the assumed parametric models, which are usually unverifiable. In this paper, we propose a de-confounding representation learning (DRL) framework for counterfactual outcome estimation of continuous treatment by generating the representations of covariates disentangled with the treatment variables. The DRL is a non-parametric model that eliminates both linear and nonlinear dependence between treatment and covariates. Specifically, we train the correlations between the de-confounded representations and the treatment variables against the correlations between the covariate representations and the treatment variables to eliminate confounding bias. Further, a counterfactual inference network is embedded into the framework to make the learned representations serve both de-confounding and trusted inference. Extensive experiments on synthetic datasets show that the DRL model performs superiorly in learning de-confounding representations and outperforms state-of-the-art counterfactual inference models for continuous treatment variables. In addition, we apply the DRL model to a real-world medical dataset MIMIC and demonstrate a detailed causal relationship between red cell width distribution and mortality.

摘要
translate-into-simplified-chineseCounterfactual inference for continuous rather than binary treatment variables is more common in real-world causal inference tasks. While there are already some sample reweighting methods based on Marginal Structural Model for eliminating the confounding bias, they generally focus on removing the treatment's linear dependence on confounders and rely on the accuracy of the assumed parametric models, which are usually unverifiable. In this paper, we propose a de-confounding representation learning (DRL) framework for counterfactual outcome estimation of continuous treatment by generating the representations of covariates disentangled with the treatment variables. The DRL is a non-parametric model that eliminates both linear and nonlinear dependence between treatment and covariates. Specifically, we train the correlations between the de-confounded representations and the treatment variables against the correlations between the covariate representations and the treatment variables to eliminate confounding bias. Further, a counterfactual inference network is embedded into the framework to make the learned representations serve both de-confounding and trusted inference. Extensive experiments on synthetic datasets show that the DRL model performs superiorly in learning de-confounding representations and outperforms state-of-the-art counterfactual inference models for continuous treatment variables. In addition, we apply the DRL model to a real-world medical dataset MIMIC and demonstrate a detailed causal relationship between red cell width distribution and mortality.Here's the text in Simplified Chinese:对于连续型干预变量，Counterfactual inference更常用于实际世界 causal inference 任务中。虽然现有一些基于Marginal Structural Model的样本重新权重方法可以消除干预的相关性，但这些方法通常只是消除干预对隐藏变量的直线相关性，并且它们的假设模型通常是不可验证的。在这篇论文中，我们提出了一种基于 representation learning 的 de-confounding 框架，用于计算连续型干预下的可靠的结果。该框架使用非 Parametric 模型来消除干预和隐藏变量之间的线性和非线性相关性。具体来说，我们在 DRL 框架中训练了对干预变量和隐藏变量之间的相关性，并与隐藏变量和干预变量之间的相关性进行对比，以消除干预偏见。此外，我们还在框架中嵌入了一个 counterfactual inference 网络，以使得学习的表示可以服务于 both de-confounding 和可靠的推理。在 synthetic 数据集上进行了广泛的实验，显示 DRL 模型在学习 de-confounding 表示方面表现出色，并且超越了现有的 state-of-the-art counterfactual inference 模型。此外，我们还应用了 DRL 模型到了一个实际医疗数据集 MIMIC，并通过示出红细胞宽度分布和死亡之间的详细 causal 关系，以证明 DRL 模型的效果。

Past-present temporal programs over finite traces

paper_url: http://arxiv.org/abs/2307.12620
repo_url: None
paper_authors: Pedro Cabalar, Martín Diéguez, François Laferrière, Torsten Schaub
for: 这个论文旨在探讨逻辑编程的扩展，使用时间逻辑的语言结构来模型动态应用。
methods: 这篇论文使用了TELf语义，对过去和当前的逻辑编程规则进行研究，并将过去和当前分为不同的语义级别。
results: 论文提出了一种基于LTLf表达式的完成和循环式表达式的定义，以捕捉过去和当前 temporal 稳定模型。

Abstract
Extensions of Answer Set Programming with language constructs from temporal logics, such as temporal equilibrium logic over finite traces (TELf), provide an expressive computational framework for modeling dynamic applications. In this paper, we study the so-called past-present syntactic subclass, which consists of a set of logic programming rules whose body references to the past and head to the present. Such restriction ensures that the past remains independent of the future, which is the case in most dynamic domains. We extend the definitions of completion and loop formulas to the case of past-present formulas, which allows capturing the temporal stable models of a set of past-present temporal programs by means of an LTLf expression.

摘要
扩展回答集编程的语言结构，如基于时间逻辑的时间平衡逻辑（TELf），提供了一种表达强大的计算框架，用于模型动态应用。本文研究一种叫做过去今天的 sintactic subclass，它包含一组逻辑编程规则，其体中参考过去，头部参考当前。这种限制保证了过去不会受到未来的影响，这是动态领域中的典型情况。我们将完成和循环式的定义扩展到过去今天的表达中，以便通过 LTLf 表达捕捉过去今天的时间稳定模型。

CTVIS: Consistent Training for Online Video Instance Segmentation

paper_url: http://arxiv.org/abs/2307.12616
repo_url: https://github.com/kainingying/ctvis
paper_authors: Kaining Ying, Qing Zhong, Weian Mao, Zhenhua Wang, Hao Chen, Lin Yuanbo Wu, Yifan Liu, Chengxiang Fan, Yunzhi Zhuge, Chunhua Shen
for: 本研究旨在提高在线视频实例分割（VIS）中的实例嵌入差异，以便在不同时刻进行实例关联。
methods: 本研究使用了对比损失来直接监督实例嵌入学习，并使用了一种叫做“一致训练”的简单而有效的训练策略，以提高实例嵌入的可靠性。
results: 实验表明，使用“一致训练”策略可以提高实例嵌入的可靠性，并在三个 VIS 测试套件中提高了 SOTA 模型的性能，包括 YTVIS19（55.1% AP）、YTVIS21（50.1% AP）和 OVIS（35.5% AP）。此外，我们还发现使用 pseudo-video 从图像转换而来的模型可以训练出比 Fully-supervised 模型更加强大的模型。

Abstract
The discrimination of instance embeddings plays a vital role in associating instances across time for online video instance segmentation (VIS). Instance embedding learning is directly supervised by the contrastive loss computed upon the contrastive items (CIs), which are sets of anchor/positive/negative embeddings. Recent online VIS methods leverage CIs sourced from one reference frame only, which we argue is insufficient for learning highly discriminative embeddings. Intuitively, a possible strategy to enhance CIs is replicating the inference phase during training. To this end, we propose a simple yet effective training strategy, called Consistent Training for Online VIS (CTVIS), which devotes to aligning the training and inference pipelines in terms of building CIs. Specifically, CTVIS constructs CIs by referring inference the momentum-averaged embedding and the memory bank storage mechanisms, and adding noise to the relevant embeddings. Such an extension allows a reliable comparison between embeddings of current instances and the stable representations of historical instances, thereby conferring an advantage in modeling VIS challenges such as occlusion, re-identification, and deformation. Empirically, CTVIS outstrips the SOTA VIS models by up to +5.0 points on three VIS benchmarks, including YTVIS19 (55.1% AP), YTVIS21 (50.1% AP) and OVIS (35.5% AP). Furthermore, we find that pseudo-videos transformed from images can train robust models surpassing fully-supervised ones.

摘要
<>对于在线视频实例分割（VIS）中，实例嵌入的歧视扮演着关键的角色。我们使用对比损失来直接监督实例嵌入学习，其中对比项（CI）是一组锚点/正例/负例嵌入的集合。在线 VIS 方法通常使用一个参考帧中的 CIs，我们认为这是学习高度抽象的嵌入的不够。我们提出了一种简单 yet 有效的训练策略，called Consistent Training for Online VIS（CTVIS），它的目的是在训练和推理管道中对实例嵌入进行对应。具体来说，CTVIS 使用暂停聚合的嵌入和存储机制，并将它们添加到相关的嵌入中。这种扩展使得可以在训练和推理中对实例嵌入进行可靠的比较，从而提高对 occlusion、重复、和变形等挑战的适应能力。实验证明，CTVIS 可以超越当前 SOTA VIS 模型，提高 YTVIS19（55.1% AP）、YTVIS21（50.1% AP）和 OVIS（35.5% AP）等三个 VIS 标准套件中的成绩。此外，我们发现在图像转换成 pseudo-video 后训练的模型可以超过完全监督的模型。>>>

Regulating AI manipulation: Applying Insights from behavioral economics and psychology to enhance the practicality of the EU AI Act

paper_url: http://arxiv.org/abs/2308.02041
repo_url: None
paper_authors: Huixin Zhong
For: 这篇论文的目的是为了解释和增强欧盟人工智能法规第五条的执行，以防止人工智能操纵的可能有害后果。* Methods: 这篇论文使用了认知心理学和行为经济学的研究来解释潜意识技术和相关表达的概念，并将行为经济学中的决策简化技巧推广到操纵技术的领域。* Results: 这篇论文提出了五种经典的决策简化技巧和其相应的示例，以便用户、开发者、算法审核人和法律专业人员识别操纵技术并采取对策。此外，论文还对欧盟人工智能法规第五条的保护效果进行了批判性评估，并提出了特定的修改建议以增强保护效果。

Abstract
The EU AI Act Article 5 is designed to regulate AI manipulation to prevent potential harmful consequences. However, the practical implementation of this legislation is challenging due to the ambiguous terminologies and the unclear presentations of manipulative techniques. Moreover, the Article 5 also suffers criticize of inadequate protective efficacy. This paper attempts to clarify terminologies and to enhance the protective efficacy by integrating insights from psychology and behavioral economics. Firstly, this paper employs cognitive psychology research to elucidate the term subliminal techniques and its associated representation. Additionally, this paper extends the study of heuristics: a set of thinking shortcuts which can be aroused for behavior changing from behavior economics to the realm of manipulative techniques. The elucidation and expansion of terminologies not only provide a more accurate understanding of the legal provision but also enhance its protective efficacy. Secondly, this paper proposes five classical heuristics and their associated examples to illustrate how can AI arouse those heuristics to alter users behavior. The enumeration of heuristics serves as a practical guide for stakeholders such as AI developers, algorithm auditors, users, and legal practitioners, enabling them to identify manipulative techniques and implement countermeasures. Finally, this paper critically evaluates the protective efficacy of Article 5 for both the general public and vulnerable groups. This paper argues that the current protective efficacy of Article 5 is insufficient and thus proposes specific revision suggestions to terms a and b in Article 5 to enhance its protective efficacy. This work contributes to the ongoing discourse on AI ethics and legal regulations, providing a practical guide for interpreting and applying the EU AI Act Article 5.

摘要
欧盟人工智能法 Article 5 是为了规范人工智能操纵，避免可能的有害后果。然而，实施这一法律的具体方法是困难的，因为涉及的术语抽象，以及操纵技巧的不清楚表述。此外， Article 5 还受到了不充分的保护效果的批评。这篇论文试图使用认知心理学和行为经济学的意识来减轻这些问题。首先，这篇论文使用认知心理学研究来解释“潜意识技巧”的概念，并与其相关的表达相结合。此外，这篇论文将行为经济学中的决策简化技巧（heuristics）扩展到了操纵技巧的领域。通过这种方式，不仅可以提供更加准确的法律规定的理解，还可以提高其保护效果。其次，这篇论文提出五种经典的决策简化技巧和其相应的示例，以示AI如何使用这些技巧来改变用户的行为。这些技巧的列表 serves as a practical guide for stakeholders such as AI developers, algorithm auditors, users, and legal practitioners，allowing them to identify manipulative techniques and implement countermeasures。最后，这篇论文 kritisch evaluates the protective efficacy of Article 5 for both the general public and vulnerable groups。这篇论文 argue that the current protective efficacy of Article 5 is insufficient, and thus proposes specific revision suggestions to terms a and b in Article 5 to enhance its protective efficacy。这篇论文的成果将贡献到欧盟人工智能伦理法规的继续讨论中，并提供了一份实用的指南，用于解读和应用欧盟人工智能法 Article 5。

Less is More: Focus Attention for Efficient DETR

paper_url: http://arxiv.org/abs/2307.12612
repo_url: https://github.com/huawei-noah/noah-research
paper_authors: Dehua Zheng, Wenhui Dong, Hailin Hu, Xinghao Chen, Yunhe Wang
for: 这个论文的目的是提高DETR-like模型的计算效率，同时保持模型的准确率。methods: 这个论文使用了一种名为Focus-DETR的方法，它使用了双重注意 Mechanism来注意更有用的 токен，从而提高计算效率。具体来说，它首先使用了一种名为Token Scoring Mechanism来评估每个 токен的重要性，然后使用了一种名为Enhanced Semantic Interaction Mechanism来提高对象的 semantic interaction。results: Comparing with state-of-the-art sparse DETR-like detectors under the same setting, Focus-DETR achieves 50.4AP (+2.2) on COCO, with comparable complexity.

Abstract
DETR-like models have significantly boosted the performance of detectors and even outperformed classical convolutional models. However, all tokens are treated equally without discrimination brings a redundant computational burden in the traditional encoder structure. The recent sparsification strategies exploit a subset of informative tokens to reduce attention complexity maintaining performance through the sparse encoder. But these methods tend to rely on unreliable model statistics. Moreover, simply reducing the token population hinders the detection performance to a large extent, limiting the application of these sparse models. We propose Focus-DETR, which focuses attention on more informative tokens for a better trade-off between computation efficiency and model accuracy. Specifically, we reconstruct the encoder with dual attention, which includes a token scoring mechanism that considers both localization and category semantic information of the objects from multi-scale feature maps. We efficiently abandon the background queries and enhance the semantic interaction of the fine-grained object queries based on the scores. Compared with the state-of-the-art sparse DETR-like detectors under the same setting, our Focus-DETR gets comparable complexity while achieving 50.4AP (+2.2) on COCO. The code is available at https://github.com/huawei-noah/noah-research/tree/master/Focus-DETR and https://gitee.com/mindspore/models/tree/master/research/cv/Focus-DETR.

摘要

SL: Stable Learning in Source-Free Domain Adaption for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2307.12580
repo_url: None
paper_authors: Yixin Chen, Yan Wang
for: 这篇论文是针对医疗影像分析中的深度学习技术，尤其是面临域名变化问题。
methods: 本研究提出了一个名为“稳定学习”（Stable Learning）策略，用于解决“长期训练，差化表现”的问题。这个策略包括对预测重量进行整合和增加熵。
results: 比较实验显示了这个策略的有效性。此外，研究者还进行了广泛的剥离实验，以评估不同方法之间的比较。

Abstract
Deep learning techniques for medical image analysis usually suffer from the domain shift between source and target data. Most existing works focus on unsupervised domain adaptation (UDA). However, in practical applications, privacy issues are much more severe. For example, the data of different hospitals have domain shifts due to equipment problems, and data of the two domains cannot be available simultaneously because of privacy. In this challenge defined as Source-Free UDA, the previous UDA medical methods are limited. Although a variety of medical source-free unsupervised domain adaption (MSFUDA) methods have been proposed, we found they fall into an over-fitting dilemma called "longer training, worse performance." Therefore, we propose the Stable Learning (SL) strategy to address the dilemma. SL is a scalable method and can be integrated with other research, which consists of Weight Consolidation and Entropy Increase. First, we apply Weight Consolidation to retain domain-invariant knowledge and then we design Entropy Increase to avoid over-learning. Comparative experiments prove the effectiveness of SL. We also have done extensive ablation experiments. Besides, We will release codes including a variety of MSFUDA methods.

摘要
深度学习技术 для医疗影像分析通常受到源数据和目标数据之间的领域转移的影响。大多数现有的工作集中在无监督领域适应（UDA）。然而，在实际应用中，隐私问题更加严重。例如，医院数据之间存在领域转移，因为设备问题，两个领域的数据无法同时可用。这种挑战被称为源无法适应（Source-Free UDA），前一代的UDA医疗方法有限。虽然一些医疗源无法无监督领域适应（MSFUDA）方法已经被提出，但我们发现它们受到“长期训练，性能下降”的困惑。因此，我们提出稳定学习（SL）策略来解决这个困惑。SL是可扩展的方法，可以与其他研究集成，它包括权重团聚和Entropy增加。首先，我们将权重团聚应用于保留领域不变知识，然后我们设计Entropy增加以避免过度学习。对比性实验证明SL的效果。此外，我们还进行了广泛的减少实验。此外，我们将发布代码，包括多种MSFUDA方法。

Continuation Path Learning for Homotopy Optimization

paper_url: http://arxiv.org/abs/2307.12551
repo_url: https://github.com/xi-l/cpl
paper_authors: Xi Lin, Zhiyuan Yang, Xiaoyuan Zhang, Qingfu Zhang
for: 解决复杂优化问题，提高homotopy优化的效果和可靠性。
methods: 提出了一种基于模型的方法，可以同时优化原始问题和所有优化子问题，并在实时生成任何中间解决方案。
results: 实验表明，提议的方法可以大幅提高homotopy优化的性能，并提供更多有用信息，以支持更好的决策。

Abstract
Homotopy optimization is a traditional method to deal with a complicated optimization problem by solving a sequence of easy-to-hard surrogate subproblems. However, this method can be very sensitive to the continuation schedule design and might lead to a suboptimal solution to the original problem. In addition, the intermediate solutions, often ignored by classic homotopy optimization, could be useful for many real-world applications. In this work, we propose a novel model-based approach to learn the whole continuation path for homotopy optimization, which contains infinite intermediate solutions for any surrogate subproblems. Rather than the classic unidirectional easy-to-hard optimization, our method can simultaneously optimize the original problem and all surrogate subproblems in a collaborative manner. The proposed model also supports real-time generation of any intermediate solution, which could be desirable for many applications. Experimental studies on different problems show that our proposed method can significantly improve the performance of homotopy optimization and provide extra helpful information to support better decision-making.

摘要
In this work, we propose a novel model-based approach to learn the whole continuation path for homotopy optimization, which includes infinite intermediate solutions for any surrogate subproblems. Our method optimizes the original problem and all surrogate subproblems in a collaborative manner, rather than the classic unidirectional easy-to-hard optimization. The proposed model also supports real-time generation of any intermediate solution, which could be desirable for many applications.Experimental studies on different problems show that our proposed method can significantly improve the performance of homotopy optimization and provide extra helpful information to support better decision-making.

Knapsack: Connectedness, Path, and Shortest-Path

paper_url: http://arxiv.org/abs/2307.12547
repo_url: None
paper_authors: Palash Dey, Sudeshna Kolay, Sipra Singh
for: 这个论文研究了带有图理解的背包问题。 specifically, it aims to find a connected subset of items of maximum value that satisfies the knapsack constraint.
methods: 这个论文使用了图论的方法来解决这个问题。 specifically, it shows that the problem is strongly NP-complete even for graphs of maximum degree four and NP-complete even for star graphs.
results: 这个论文得到了一个时间复杂度为 $O\left(2^{tw\log tw}\cdot\text{poly}(\min{s^2,d^2})\right)$ 的算法，以及一个 $(1-\epsilon)$ 因数逼近算法时间复杂度为 $O\left(2^{tw\log tw}\cdot\text{poly}(n,1/\epsilon)\right)$ для每个 $\epsilon>0$。 Additionally, it shows that connected-knapsack is computationally hardest followed by path-knapsack and shortestpath-knapsack.

Abstract
We study the knapsack problem with graph theoretic constraints. That is, we assume that there exists a graph structure on the set of items of knapsack and the solution also needs to satisfy certain graph theoretic properties on top of knapsack constraints. In particular, we need to compute in the connected knapsack problem a connected subset of items which has maximum value subject to the size of knapsack constraint. We show that this problem is strongly NP-complete even for graphs of maximum degree four and NP-complete even for star graphs. On the other hand, we develop an algorithm running in time $O\left(2^{tw\log tw}\cdot\text{poly}(\min\{s^2,d^2\})\right)$ where $tw,s,d$ are respectively treewidth of the graph, size, and target value of the knapsack. We further exhibit a $(1-\epsilon)$ factor approximation algorithm running in time $O\left(2^{tw\log tw}\cdot\text{poly}(n,1/\epsilon)\right)$ for every $\epsilon>0$. We show similar results for several other graph theoretic properties, namely path and shortest-path under the problem names path-knapsack and shortestpath-knapsack. Our results seems to indicate that connected-knapsack is computationally hardest followed by path-knapsack and shortestpath-knapsack.

摘要
我们研究了带有图 teoretic 约束的零件包问题。具体来说，我们假设存在一个图结构，其中每个零件都有一定的价值，并且解决方案还需要满足一定的图 theoretica 性质。在特定情况下，我们需要计算一个连接的零件集，其中每个零件的价值最大，并且满足零件包的大小约束。我们证明了这个问题是强NP完全的，甚至对于最大度为四的图和星型图。然而，我们开发了一个时间复杂度为 $O\left(2^{tw\log tw}\cdot\text{poly}(\min\{s^2,d^2\})\right)$ 的算法，其中 $tw,s,d$ 分别是图的树幂，零件包的大小，和目标值。此外，我们还提出了一个 $(1-\epsilon)$ 因子近似算法，其时间复杂度为 $O\left(2^{tw\log tw}\cdot\text{poly}(n,1/\epsilon)\right)$，其中 $\epsilon>0$。我们还证明了对于一些其他的图 teoretic 性质，例如路径和最短路，我们可以通过对应的问题名称来描述它们，例如路径-零件包和最短路-零件包。我们的结果表明，连接-零件包是计算最为困难的，然后是路径-零件包和最短路-零件包。

Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model

paper_url: http://arxiv.org/abs/2307.12545
repo_url: None
paper_authors: Peng Wu, Jing Liu, Xiangteng He, Yuxin Peng, Peng Wang, Yanning Zhang
for: 这个研究的目的是为了提出一个新的错误检测任务，即错误视频搜寻（VAR），这个任务的目的是实用地搜寻错误的视频，并且使用语言描述和同步的声音进行跨Modalities的搜寻。
methods: 这个研究使用了一个名为“错误引导排序”的类别学习模型，它使用了一个新的错误排序方法来将错误的部分对应到视频中的特定时间点。此外，这个模型还使用了一个内建的类别排序方法来将视频中的内容与语言描述进行对应。
results: 实验结果显示，这个新的错误检测任务（VAR）是一个非常有挑战性的任务，并且证明了这个任务的重要性。此外，实验结果还显示了这个模型的优秀性，它可以实现高度的准确率和高效率。

Abstract
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications, its current dominant tasks focus on online detecting anomalies% at the frame level, which can be roughly interpreted as the binary or multiple event classification. However, such a setup that builds relationships between complicated anomalous events and single labels, e.g., ``vandalism'', is superficial, since single labels are deficient to characterize anomalous events. In reality, users tend to search a specific video rather than a series of approximate videos. Therefore, retrieving anomalous events using detailed descriptions is practical and positive but few researches focus on this. In this context, we propose a novel task called Video Anomaly Retrieval (VAR), which aims to pragmatically retrieve relevant anomalous videos by cross-modalities, e.g., language descriptions and synchronous audios. Unlike the current video retrieval where videos are assumed to be temporally well-trimmed with short duration, VAR is devised to retrieve long untrimmed videos which may be partially relevant to the given query. To achieve this, we present two large-scale VAR benchmarks, UCFCrime-AR and XDViolence-AR, constructed on top of prevalent anomaly datasets. Meanwhile, we design a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we propose an anomaly-led sampling to focus on key segments in long untrimmed videos. Then, we introduce an efficient pretext task to enhance semantic associations between video-text fine-grained representations. Besides, we leverage two complementary alignments to further match cross-modal contents. Experimental results on two benchmarks reveal the challenges of VAR task and also demonstrate the advantages of our tailored method.

摘要
视频异常检测（VAD）在过去几年中受到了越来越多的关注，因为它在实际应用中具有潜在的价值。目前主流的VAD任务都是在帧级别上进行异常检测，可以简单地 interpreted为多个事件的二分类或多类别分类。但是，这种设置不仅缺乏异常事件之间的关系，而且单个标签（如“违法行为”）无法描述异常事件的复杂性。在实际应用中，用户通常会搜索特定的视频而不是一系列相似的视频。因此，使用详细的描述来检测异常事件是有用的。在这个上下文中，我们提出了一个新的任务：视频异常检索（VAR），该任务的目标是在多modalities（如语言描述和同步声音）之间实用地检索相关的异常视频。不同于现有的视频检索，在VAR中视频不再被假设为短暂和有效的，而是可以是长不trimmed的，这些视频可能只有部分相关于给定查询。为了实现这一目标，我们提出了两个大规模的VAR benchmark，UCFCrime-AR和XDViolence-AR，它们基于常见的异常数据集。同时，我们设计了一种名为异常领导Alignment网络（ALAN）的模型，用于VAR。在ALAN中，我们提出了一种异常领导采样方法，以吸引关注关键部分在长不trimmed视频中。然后，我们引入了一种有效的假任务，以增强视频-文本细致的semantic关系。此外，我们利用了两种补偿匹配来进一步匹配cross-modal内容。实验结果表明，VAR任务具有挑战性，同时我们的定制方法也得到了优势。

Client-Level Differential Privacy via Adaptive Intermediary in Federated Medical Imaging

paper_url: http://arxiv.org/abs/2307.12542
repo_url: https://github.com/med-air/client-dp-fl
paper_authors: Meirui Jiang, Yuan Zhong, Anjie Le, Xiaoxiao Li, Qi Dou
for: This paper aims to optimize the trade-off between privacy protection and performance in federated learning (FL) for medical imaging under the context of client-level differential privacy (DP).
methods: The proposed approach is based on an adaptive intermediary strategy that splits clients into sub-clients, which serve as intermediaries between hospitals and the server to mitigate the noises introduced by DP without harming privacy.
results: The proposed approach is empirically evaluated on both classification and segmentation tasks using two public datasets, and its effectiveness is demonstrated with significant performance improvements and comprehensive analytical studies.Here’s the simplified Chinese version:
for: 这篇论文的目标是在医疗镜像 Federated Learning（FL）中优化 differential privacy（DP）的质量和性能的负担平衡。
methods: 提议的方法是基于 adaptive intermediary strategy，将客户端分成子客户端，这些子客户端将作为客户端和服务器之间的中间人，以减少 DP 引入的噪声而不危害隐私。
results: 提议的方法被 Empirical 评估在 two 个公共数据集上，并通过 comprehensive analytical studies 证明其效果。

Abstract
Despite recent progress in enhancing the privacy of federated learning (FL) via differential privacy (DP), the trade-off of DP between privacy protection and performance is still underexplored for real-world medical scenario. In this paper, we propose to optimize the trade-off under the context of client-level DP, which focuses on privacy during communications. However, FL for medical imaging involves typically much fewer participants (hospitals) than other domains (e.g., mobile devices), thus ensuring clients be differentially private is much more challenging. To tackle this problem, we propose an adaptive intermediary strategy to improve performance without harming privacy. Specifically, we theoretically find splitting clients into sub-clients, which serve as intermediaries between hospitals and the server, can mitigate the noises introduced by DP without harming privacy. Our proposed approach is empirically evaluated on both classification and segmentation tasks using two public datasets, and its effectiveness is demonstrated with significant performance improvements and comprehensive analytical studies. Code is available at: https://github.com/med-air/Client-DP-FL.

摘要
尽管最近的进展已经提高了联合学习（FL）的隐私保护（DP），但是实际世界医疗场景中DP与性能之间的负担仍未得到充分探讨。在这篇论文中，我们提议优化DP与隐私保护之间的负担，在客户端DP上进行优化。然而，医疗影像FL通常有比其他领域（如移动设备）更少的参与者（医院），因此保持客户端的隐私是更加挑战性的。为解决这个问题，我们提议使用可适应担保策略，以提高性能而不危害隐私。具体来说，我们通过将客户端分为子客户端，使其作为医院和服务器之间的中间人，可以减少DP引入的噪声而不危害隐私。我们的提议方法在两个公共数据集上进行了实际评估，并通过了重要性能改进和完整的分析研究。代码可以在以下链接中找到：https://github.com/med-air/Client-DP-FL。

SelFormaly: Towards Task-Agnostic Unified Anomaly Detection

paper_url: http://arxiv.org/abs/2307.12540
repo_url: None
paper_authors: Yujin Lee, Harin Lim, Hyunsoo Yoon
for: 这篇论文旨在提出一个通用且强大的问题检测框架，以扩展previous任务特定的问题检测方法。
methods: 这篇论文使用了自我supervised ViTs，以及back-patch masking和top k-ratio feature matching等技术来实现通用的问题检测。
results: 这篇论文在不同的数据集上都 achieved state-of-the-art 的结果，并且适用于多种任务，包括问题检测、Semantic anomaly detection、多类问题检测和问题聚合。

Abstract
The core idea of visual anomaly detection is to learn the normality from normal images, but previous works have been developed specifically for certain tasks, leading to fragmentation among various tasks: defect detection, semantic anomaly detection, multi-class anomaly detection, and anomaly clustering. This one-task-one-model approach is resource-intensive and incurs high maintenance costs as the number of tasks increases. This paper presents SelFormaly, a universal and powerful anomaly detection framework. We emphasize the necessity of our off-the-shelf approach by pointing out a suboptimal issue with fluctuating performance in previous online encoder-based methods. In addition, we question the effectiveness of using ConvNets as previously employed in the literature and confirm that self-supervised ViTs are suitable for unified anomaly detection. We introduce back-patch masking and discover the new role of top k-ratio feature matching to achieve unified and powerful anomaly detection. Back-patch masking eliminates irrelevant regions that possibly hinder target-centric detection with representations of the scene layout. The top k-ratio feature matching unifies various anomaly levels and tasks. Finally, SelFormaly achieves state-of-the-art results across various datasets for all the aforementioned tasks.

摘要
核心思想是可视异常检测是学习正常图像的 normality，但之前的工作是为特定任务而开发，导致不同任务之间的分化。这种一个任务一个模型的方法是资源占用和维护成本高。这篇论文介绍了 SelFormaly，一个通用和强大的异常检测框架。我们强调我们的卖外方法的必要性，指出了在线编码器基于方法中的性能波动问题。此外，我们质疑了使用 ConvNets 以前在文献中使用的效果，并证明了不同类异常检测和泛化异常检测可以使用自适应 ViTs。我们引入后贴布覆盖和发现了新的顶部 k-比例特征匹配，以实现统一和强大的异常检测。后贴布覆盖 eliminates 不相关的区域，可能干扰目标中心检测表示场景布局。顶部 k-比例特征匹配统一了不同异常水平和任务。最后，SelFormaly 在不同数据集上实现了所有上述任务的州OF-the-art 结果。

Rethinking Medical Report Generation: Disease Revealing Enhancement with Knowledge Graph

paper_url: http://arxiv.org/abs/2307.12526
repo_url: https://github.com/wangyixinxin/mrg-kg
paper_authors: Yixin Wang, Zihao Lin, Haoyu Dong
for: 这个研究的目的是提高医疗报告生成（MRG）过程中的知识图（KG）的完整性和应用。
methods: 这个研究使用了一个完整的知识图，包括137种疾病和异常性，以帮助指导MRG过程。此外，研究还引入了一种新的增强策略，以便增强疾病类型在分布的表现。
results: 研究发现，使用提案的两个阶段生成方法和增强策略，可以显著提高生成报告中的疾病匹配度和多样性。这表明，这种方法可以有效地减少对疾病分布的长尾问题。

Abstract
Knowledge Graph (KG) plays a crucial role in Medical Report Generation (MRG) because it reveals the relations among diseases and thus can be utilized to guide the generation process. However, constructing a comprehensive KG is labor-intensive and its applications on the MRG process are under-explored. In this study, we establish a complete KG on chest X-ray imaging that includes 137 types of diseases and abnormalities. Based on this KG, we find that the current MRG data sets exhibit a long-tailed problem in disease distribution. To mitigate this problem, we introduce a novel augmentation strategy that enhances the representation of disease types in the tail-end of the distribution. We further design a two-stage MRG approach, where a classifier is first trained to detect whether the input images exhibit any abnormalities. The classified images are then independently fed into two transformer-based generators, namely, ``disease-specific generator" and ``disease-free generator" to generate the corresponding reports. To enhance the clinical evaluation of whether the generated reports correctly describe the diseases appearing in the input image, we propose diverse sensitivity (DS), a new metric that checks whether generated diseases match ground truth and measures the diversity of all generated diseases. Results show that the proposed two-stage generation framework and augmentation strategies improve DS by a considerable margin, indicating a notable reduction in the long-tailed problem associated with under-represented diseases.

摘要
医学报告生成（MRG）中知识图грам（KG）发挥关键作用，因为它揭示疾病之间的关系，可以用于导航生成过程。然而，建立完整的KG是劳动密集的，其在MRG过程中的应用还尚未得到充分探索。在这种研究中，我们建立了包括137种疾病和异常的完整KG，基于这个KG，我们发现现有的MRG数据集表现出长尾问题。为了解决这个问题，我们提出了一种新的扩充策略，该策略可以增强疾病类型在分布的尾部的表达。我们还设计了一种两阶段的MRG方法，其中一个是使用一个分类器来检测输入图像是否存在任何异常。经过分类后，输入图像被独立地传递给两个基于转换器的生成器，即“疾病特定生成器”和“疾病无效生成器”，以生成相应的报告。为了提高生成的严肃评估，我们提出了多样敏感度（DS），一种新的度量，该度量检查生成的疾病与真实的疾病是否匹配，并且度量所有生成的疾病的多样性。结果表明，我们的两阶段生成框架和扩充策略可以大幅提高DS， indicating a remarkable reduction in the long-tailed problem associated with under-represented diseases.

FaFCNN: A General Disease Classification Framework Based on Feature Fusion Neural Networks

paper_url: http://arxiv.org/abs/2307.12518
repo_url: None
paper_authors: Menglin Kong, Shaojie Zhao, Juan Cheng, Xingquan Li, Ri Su, Muzhou Hou, Cong Cao
for: 本研究旨在解决应用深度学习/机器学习方法于疾病分类任务中的两个基本问题，即训练样本数量和质量的不足，以及如何有效地融合多源特征并训练稳定的分类模型。
methods: 我们提出了一种基于人类学习知识的Feature-aware Fusion Correlation Neural Network (FaFCNN)框架，包括特征意识互动模块和域对抗学习基于特征对齐模块。
results: 实验结果表明，通过使用预训练梯度提升树的扩充特征，FaFCNN在低质量 dataset 上实现了更高的性能提升，并且对于竞争对手基线方法进行了一致性优化。此外，广泛的实验还证明了提案的方法的稳定性和模型中每个组件的有效性。

Abstract
There are two fundamental problems in applying deep learning/machine learning methods to disease classification tasks, one is the insufficient number and poor quality of training samples; another one is how to effectively fuse multiple source features and thus train robust classification models. To address these problems, inspired by the process of human learning knowledge, we propose the Feature-aware Fusion Correlation Neural Network (FaFCNN), which introduces a feature-aware interaction module and a feature alignment module based on domain adversarial learning. This is a general framework for disease classification, and FaFCNN improves the way existing methods obtain sample correlation features. The experimental results show that training using augmented features obtained by pre-training gradient boosting decision tree yields more performance gains than random-forest based methods. On the low-quality dataset with a large amount of missing data in our setup, FaFCNN obtains a consistently optimal performance compared to competitive baselines. In addition, extensive experiments demonstrate the robustness of the proposed method and the effectiveness of each component of the model\footnote{Accepted in IEEE SMC2023}.

摘要
“there are two fundamental problems in applying deep learning/machine learning methods to disease classification tasks, one is the insufficient number and poor quality of training samples; another one is how to effectively fuse multiple source features and thus train robust classification models. To address these problems, inspired by the process of human learning knowledge, we propose the Feature-aware Fusion Correlation Neural Network (FaFCNN), which introduces a feature-aware interaction module and a feature alignment module based on domain adversarial learning. This is a general framework for disease classification, and FaFCNN improves the way existing methods obtain sample correlation features. The experimental results show that training using augmented features obtained by pre-training gradient boosting decision tree yields more performance gains than random-forest based methods. On the low-quality dataset with a large amount of missing data in our setup, FaFCNN obtains a consistently optimal performance compared to competitive baselines. In addition, extensive experiments demonstrate the robustness of the proposed method and the effectiveness of each component of the model”。Here is the breakdown of the translation:* “two fundamental problems”(二大问题) - This phrase is translated as "two fundamental problems" to emphasize the importance of the issues being discussed.* “insufficient number and poor quality of training samples”(训练样本数量和质量不足) - This phrase is translated as "insufficient number and poor quality of training samples" to accurately convey the idea that there are not enough high-quality training samples available for training deep learning models.* “how to effectively fuse multiple source features”(如何有效地融合多源特征) - This phrase is translated as "how to effectively fuse multiple source features" to emphasize the importance of combining multiple sources of data to improve the accuracy of deep learning models.* “and thus train robust classification models”(并因此训练Robust分类模型) - This phrase is translated as "and thus train robust classification models" to emphasize the goal of training deep learning models that are robust and accurate.* “Feature-aware Fusion Correlation Neural Network”(特征意识融合相关神经网络) - This phrase is translated as "Feature-aware Fusion Correlation Neural Network" to accurately convey the name of the proposed method and its key features.* “based on domain adversarial learning”(基于域对抗学习) - This phrase is translated as "based on domain adversarial learning" to emphasize the key technique used in the proposed method.* “This is a general framework for disease classification”(这是一种普遍的疾病分类框架) - This phrase is translated as "This is a general framework for disease classification" to emphasize that the proposed method is a generalizable framework that can be applied to a wide range of disease classification tasks.* “and FaFCNN improves the way existing methods obtain sample correlation features”(而FaFCNN改进了现有方法获取样本相关特征) - This phrase is translated as "and FaFCNN improves the way existing methods obtain sample correlation features" to emphasize the key advantage of the proposed method.* “The experimental results show that training using augmented features obtained by pre-training gradient boosting decision tree yields more performance gains than random-forest based methods”(实验结果表明，使用由前期逻辑树决策树提取的增强特征进行训练，比Random Forest基于方法更多地提高性能) - This phrase is translated as "The experimental results show that training using augmented features obtained by pre-training gradient boosting decision tree yields more performance gains than random-forest based methods" to accurately convey the key findings of the experiments.* “On the low-quality dataset with a large amount of missing data in our setup, FaFCNN obtains a consistently optimal performance compared to competitive baselines”(在我们的设置下，具有大量缺失数据的低质量数据集上，FaFCNN对于竞争对手基线表现出一致优化的性能) - This phrase is translated as "On the low-quality dataset with a large amount of missing data in our setup, FaFCNN obtains a consistently optimal performance compared to competitive baselines" to emphasize the key finding that the proposed method performs well even on low-quality datasets with a large amount of missing data.* “In addition, extensive experiments demonstrate the robustness of the proposed method and the effectiveness of each component of the model”(此外，广泛的实验还证明了提议方法的稳定性和模型每个组件的有效性) - This phrase is translated as "In addition, extensive experiments demonstrate the robustness of the proposed method and the effectiveness of each component of the model" to emphasize the key finding that the proposed method is robust and effective, and to highlight the importance of each component of the model.

Gradient-Based Word Substitution for Obstinate Adversarial Examples Generation in Language Models

paper_url: http://arxiv.org/abs/2307.12507
repo_url: None
paper_authors: Yimu Wang, Peng Shi, Hongyang Zhang
for: This paper aims to address the problem of generating obstinate adversarial examples in NLP by introducing a novel word substitution method named GradObstinate, which automatically generates obstinate adversarial examples without any constraints on the search space or the need for manual design principles.
methods: The proposed GradObstinate method uses a gradient-based approach to automatically generate obstinate adversarial examples. It does not rely on any manual design principles or constraints on the search space, making it more practical and applicable in real-world scenarios.
results: The proposed GradObstinate method is evaluated on five representative NLP models and four benchmarks, and the results show that it generates more powerful obstinate adversarial examples with a higher attack success rate compared to antonym-based methods. Additionally, the obstinate substitutions found by GradObstinate are transferable to other models in black-box settings, including even GPT-3 and ChatGPT.Here is the answer in Simplified Chinese:
for: 这篇论文目的是解决NLP中的阻挡(过稳)攻击例的生成问题，提出了一种名为GradObstinate的新的词替换方法，可以自动生成阻挡攻击例而不需要人工设计原则或搜索空间的约束。
methods: GradObstinate方法使用梯度基础来自动生成阻挡攻击例，不需要人工设计原则或搜索空间的约束，使其在实际应用中更加实际。
results: GradObstinate方法在五种代表性的NLP模型和四个benchmark上进行了广泛的实验，结果显示，它可以生成更加强大的阻挡攻击例，攻击成功率高于antonym-based方法。此外，GradObstinate发现的阻挡替换还可以在黑盒设置下转移到其他模型中，包括GPT-3和ChatGPT。

Abstract
In this paper, we study the problem of generating obstinate (over-stability) adversarial examples by word substitution in NLP, where input text is meaningfully changed but the model's prediction does not, even though it should. Previous word substitution approaches have predominantly focused on manually designed antonym-based strategies for generating obstinate adversarial examples, which hinders its application as these strategies can only find a subset of obstinate adversarial examples and require human efforts. To address this issue, in this paper, we introduce a novel word substitution method named GradObstinate, a gradient-based approach that automatically generates obstinate adversarial examples without any constraints on the search space or the need for manual design principles. To empirically evaluate the efficacy of GradObstinate, we conduct comprehensive experiments on five representative models (Electra, ALBERT, Roberta, DistillBERT, and CLIP) finetuned on four NLP benchmarks (SST-2, MRPC, SNLI, and SQuAD) and a language-grounding benchmark (MSCOCO). Extensive experiments show that our proposed GradObstinate generates more powerful obstinate adversarial examples, exhibiting a higher attack success rate compared to antonym-based methods. Furthermore, to show the transferability of obstinate word substitutions found by GradObstinate, we replace the words in four representative NLP benchmarks with their obstinate substitutions. Notably, obstinate substitutions exhibit a high success rate when transferred to other models in black-box settings, including even GPT-3 and ChatGPT. Examples of obstinate adversarial examples found by GradObstinate are available at https://huggingface.co/spaces/anonauthors/SecretLanguage.

摘要
在这篇论文中，我们研究了对话语言处理（NLP）领域中生成顽固（过度稳定）攻击示例的问题，通过单词替换而生成这些示例。现有的单词替换方法主要采用人工设计的反义策略来生成顽固攻击示例，这限制了其应用，因为这些策略只能找到一部分顽固攻击示例，并且需要人工劳动。为解决这问题，在这篇论文中，我们提出了一种新的单词替换方法，即GradObstinate，它是基于梯度的方法，可以自动生成顽固攻击示例，不需要任何限制或人工设计原则。为证明GradObstinate的有效性，我们在五种代表性模型（Electra、ALBERT、Roberta、DistillBERT和CLIP）上进行了广泛的实验，这些模型在四个NLPBenchmark（SST-2、MRPC、SNLI和SQuAD）和一个语言固定 benchmark（MSCOCO）上进行了finetuning。实验结果表明，我们提出的GradObstinate可以更好地生成顽固攻击示例，对于反义策略来说，攻击成功率更高。此外，我们还证明了GradObstinate生成的顽固替换示例在黑盒Setting中的传送性，包括GPT-3和ChatGPT等模型。详细的顽固攻击示例可以在https://huggingface.co/spaces/anonauthors/SecretLanguage上找到。

TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition

paper_url: http://arxiv.org/abs/2307.12493
repo_url: https://github.com/Shilin-LU/TF-ICON
paper_authors: Shilin Lu, Yanzhu Liu, Adams Wai-Kin Kong
for: 这个论文旨在提出一个无需训练的图像调和框架，将文本驱动的填充模型应用于跨领域图像导向作业。
methods: 这个框架使用的方法是使用文本驱动的填充模型，不需要进一步的训练、调整或优化。
results: 实验结果显示，将Stable Diffusion与特别提示（Exceptional Prompt）搭配可以超越现有的对应方法，而TF-ICON在多个视觉领域中也表现出优越的可 versatility。

Abstract
Text-driven diffusion models have exhibited impressive generative capabilities, enabling various image editing tasks. In this paper, we propose TF-ICON, a novel Training-Free Image COmpositioN framework that harnesses the power of text-driven diffusion models for cross-domain image-guided composition. This task aims to seamlessly integrate user-provided objects into a specific visual context. Current diffusion-based methods often involve costly instance-based optimization or finetuning of pretrained models on customized datasets, which can potentially undermine their rich prior. In contrast, TF-ICON can leverage off-the-shelf diffusion models to perform cross-domain image-guided composition without requiring additional training, finetuning, or optimization. Moreover, we introduce the exceptional prompt, which contains no information, to facilitate text-driven diffusion models in accurately inverting real images into latent representations, forming the basis for compositing. Our experiments show that equipping Stable Diffusion with the exceptional prompt outperforms state-of-the-art inversion methods on various datasets (CelebA-HQ, COCO, and ImageNet), and that TF-ICON surpasses prior baselines in versatile visual domains. Code is available at https://github.com/Shilin-LU/TF-ICON

摘要
文本驱动的扩散模型已经展示出了吸引人的生成能力，可以完成多种图像编辑任务。在这篇论文中，我们提出了TF-ICON，一种新的培成freeImage COmpositioN框架，利用文本驱动扩散模型进行交域图像引导组合。这个任务的目标是将用户提供的对象顺利地 интеGRATE到特定的视觉上下文中。当前的扩散基于方法通常需要费时的实例基于优化或特定数据集上的Finetuning pretrained模型，这可能会损害它们的丰富先天知识。相比之下，TF-ICON可以不需要额外的培成或优化，通过使用off-the-shelf扩散模型来完成交域图像引导组合。此外，我们引入了Exceptional prompt，它不含任何信息，以便文本驱动扩散模型可以准确地将真实图像转化为幂等表示，这为组合提供了基础。我们的实验表明，在不同的图像 datasets（CelebA-HQ、COCO和ImageNet）上，使用 Exceptional prompt 的 Stable Diffusion 方法可以超越现有的抽象方法，而TF-ICON 也可以在多种视觉领域中超越先前的基eline。代码可以在https://github.com/Shilin-LU/TF-ICON 中找到。

ChatGPT for Software Security: Exploring the Strengths and Limitations of ChatGPT in the Security Applications

paper_url: http://arxiv.org/abs/2307.12488
repo_url: None
paper_authors: Zhilong Wang, Lan Zhang, Peng Liu
for: The paper aims to evaluate ChatGPT’s capabilities in security-oriented program analysis, specifically from the perspectives of both attackers and security analysts.
methods: The paper uses a case study approach, presenting several security-oriented program analysis tasks and deliberately introducing challenges to assess ChatGPT’s responses.
results: The paper examines the quality of answers provided by ChatGPT and gains a clearer understanding of its strengths and limitations in the realm of security-oriented program analysis.Here’s the same information in Simplified Chinese text:
for: 本文旨在评估ChatGPT在安全关注程序分析方面的能力，具体来说是从攻击者和安全分析员两个角度出发。
methods: 本文采用 caso study方法，通过提出多个安全关注程序分析任务，故意引入挑战来评估ChatGPT的回答质量。
results: 本文通过分析ChatGPT的回答来了解它在安全关注程序分析方面的优劣点。

Abstract
ChatGPT, as a versatile large language model, has demonstrated remarkable potential in addressing inquiries across various domains. Its ability to analyze, comprehend, and synthesize information from both online sources and user inputs has garnered significant attention. Previous research has explored ChatGPT's competence in code generation and code reviews. In this paper, we delve into ChatGPT's capabilities in security-oriented program analysis, focusing on perspectives from both attackers and security analysts. We present a case study involving several security-oriented program analysis tasks while deliberately introducing challenges to assess ChatGPT's responses. Through an examination of the quality of answers provided by ChatGPT, we gain a clearer understanding of its strengths and limitations in the realm of security-oriented program analysis.

摘要
chatgpt 作为一种多能语言模型，在各个领域的问题上表现出了惊人的潜力。它可以分析、理解和合成来自线上源和用户输入的信息，吸引了广泛的关注。以前的研究探讨了 chatgpt 在代码生成和代码审查方面的能力。在这篇论文中，我们探究 chatgpt 在安全关注程序分析方面的能力，具体来说是从攻击者和安全分析员的视角来评估 chatgpt 的回答质量。我们通过对多个安全关注程序分析任务的挑战性评估，了解 chatgpt 在安全关注程序分析领域的优势和局限性。

ProtoFL: Unsupervised Federated Learning via Prototypical Distillation

paper_url: http://arxiv.org/abs/2307.12450
repo_url: None
paper_authors: Hansol Kim, Youngjun Kwak, Minyoung Jung, Jinho Shin, Youngsung Kim, Changick Kim
for: 提高数据隐私保护和一类分类性能
methods: 提出了基于原型 repreentation 的 federated learning 方法，以增强全球模型的表示能力，降低通信成本
results: 对五种广泛使用的基准数据集进行了广泛的实验，证明了提议的框架在先前的方法中表现出色

Abstract
Federated learning (FL) is a promising approach for enhancing data privacy preservation, particularly for authentication systems. However, limited round communications, scarce representation, and scalability pose significant challenges to its deployment, hindering its full potential. In this paper, we propose 'ProtoFL', Prototypical Representation Distillation based unsupervised Federated Learning to enhance the representation power of a global model and reduce round communication costs. Additionally, we introduce a local one-class classifier based on normalizing flows to improve performance with limited data. Our study represents the first investigation of using FL to improve one-class classification performance. We conduct extensive experiments on five widely used benchmarks, namely MNIST, CIFAR-10, CIFAR-100, ImageNet-30, and Keystroke-Dynamics, to demonstrate the superior performance of our proposed framework over previous methods in the literature.

摘要
《联合学习（FL）是一种有前途的方法，能够提高数据隐私保护，特别是 для 身份验证系统。然而，有限的回合通信，珍贵的表示和扩展性带来了其部署的挑战，使其全部潜力受限。在本文中，我们提出了“ProtoFL”，基于 прототипиаль表示储存的无监督联合学习，以提高全球模型的表示力和减少回合通信成本。此外，我们还引入了基于 нормализа函数的本地一类分类器，以提高有限数据下的性能。我们的研究是首次利用FL提高一类分类性能的研究。我们在五种广泛使用的标准数据集上进行了广泛的实验，包括MNIST、CIFAR-10、CIFAR-100、ImageNet-30和键盘动作，以示出我们提posed的框架在过去的方法中的超越性。》Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you prefer Traditional Chinese, please let me know.

SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic Spaces

paper_url: http://arxiv.org/abs/2307.12445
repo_url: None
paper_authors: Ivan Vallés-Pérez, Grzegorz Beringer, Piotr Bilinski, Gary Cook, Roberto Barra-Chicote
for: 本研究旨在将CLIP模型应用到语音领域，以学习共同的phonetic和acoustic空间表示。
methods: 本研究使用CLIP模型，通过对语音资料进行对映和调整，以学习共同的phonetic和acoustic空间表示。
results: 研究获得的结果显示，提案的模型具有辨识phonetic变化的能力，并且具有对不同类型噪音的抗响性。此外，研究还证明了模型的对下游应用的有用性，如语音识别和生成等。

Abstract
Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions, with outstanding zero- or few-shot results in downstream tasks. In this paper we explore the same idea proposed by CLIP but applied to the speech domain, where the phonetic and acoustic spaces usually coexist. We train a CLIP-based model with the aim to learn shared representations of phonetic and acoustic spaces. The results show that the proposed model is sensible to phonetic changes, with a 91% of score drops when replacing 20% of the phonemes at random, while providing substantial robustness against different kinds of noise, with a 10% performance drop when mixing the audio with 75% of Gaussian noise. We also provide empirical evidence showing that the resulting embeddings are useful for a variety of downstream applications, such as intelligibility evaluation and the ability to leverage rich pre-trained phonetic embeddings in speech generation task. Finally, we discuss potential applications with interesting implications for the speech generation and recognition fields.

摘要
多种例子在文献中证明深度学习模型可以好地处理多Modal数据。近期，CLIP使得深度学习系统可以学习图像和文本描述之间的共享幂等空间，显示出 Zero-或几个Shot结果在下游任务中。在这篇论文中，我们探索了与CLIP相同的想法，但是应用到语音频域中， где音频和语音空间通常共存。我们使用CLIP基于模型，以学习共享phonetic和Acoustic空间的表示。结果显示，我们的提posed模型对phonetic变化敏感，在Random中替换20%的Phoneemes时，得分下降91%，同时在混合音频75%的加aussian噪音时，表现下降10%。我们还提供了实验证明，表明得到的嵌入是下游应用中有用，如智能评估和Speech生成任务中的Rich预训练phonetic嵌入。最后，我们讨论了可能的应用，具有Speech生成和识别领域的 interessante implikationen。

AMaizeD: An End to End Pipeline for Automatic Maize Disease Detection

paper_url: http://arxiv.org/abs/2308.03766
repo_url: None
paper_authors: Anish Mall, Sanchit Kabra, Ankur Lhila, Pawan Ajmera
for: automatización de la detección de enfermedades en maíz utilizando imágenes multiespectrales obtenidas desde drones.
methods: combina redes neuronales convolucionales (CNNs) como extractores de características y técnicas de segmentación para identificar las plantas de maíz y sus enfermedades asociadas.
results: detecta una variedad de enfermedades de maíz, incluyendo la roya, el antracnosis y la podredumbre foliar, con un rendimiento estado del arte en el conjunto de datos personalizado.

Abstract
This research paper presents AMaizeD: An End to End Pipeline for Automatic Maize Disease Detection, an automated framework for early detection of diseases in maize crops using multispectral imagery obtained from drones. A custom hand-collected dataset focusing specifically on maize crops was meticulously gathered by expert researchers and agronomists. The dataset encompasses a diverse range of maize varieties, cultivation practices, and environmental conditions, capturing various stages of maize growth and disease progression. By leveraging multispectral imagery, the framework benefits from improved spectral resolution and increased sensitivity to subtle changes in plant health. The proposed framework employs a combination of convolutional neural networks (CNNs) as feature extractors and segmentation techniques to identify both the maize plants and their associated diseases. Experimental results demonstrate the effectiveness of the framework in detecting a range of maize diseases, including powdery mildew, anthracnose, and leaf blight. The framework achieves state-of-the-art performance on the custom hand-collected dataset and contributes to the field of automated disease detection in agriculture, offering a practical solution for early identification of diseases in maize crops advanced machine learning techniques and deep learning architectures.

摘要

Framing Relevance for Safety-Critical Autonomous Systems

paper_url: http://arxiv.org/abs/2307.14355
repo_url: None
paper_authors: Astrid Rakow
for: 这个论文是为了研究如何确定自动化系统在当前任务中需要的信息，以建立适当的世界观并实现任务目标。
methods: 这篇论文使用了正式方法来确定自动化系统需要的信息，包括对各种信息源的分类和选择，以及如何将信息整合到自动化系统中。
results: 这篇论文的结果表明，使用正式方法可以有效地确定自动化系统需要的信息，并且可以帮助自动化系统在充满信息的环境中做出更加有效的决策。

Abstract
We are in the process of building complex highly autonomous systems that have build-in beliefs, perceive their environment and exchange information. These systems construct their respective world view and based on it they plan their future manoeuvres, i.e., they choose their actions in order to establish their goals based on their prediction of the possible futures. Usually these systems face an overwhelming flood of information provided by a variety of sources where by far not everything is relevant. The goal of our work is to develop a formal approach to determine what is relevant for a safety critical autonomous system at its current mission, i.e., what information suffices to build an appropriate world view to accomplish its mission goals.

摘要
我们正在建设复杂高自动化系统，这些系统具有内置的信念和环境感知功能，并且能够交换信息。这些系统根据自己的世界观建立未来行动计划，即选择行动以实现目标基于预测未来的可能性。通常这些系统面临着极大的信息泛洪，其中大多数信息并不相关。我们的工作目标是开发一种正式的方法，以确定一个安全关键自动化系统当前任务中需要的信息，以建立合适的世界观以实现任务目标。

Implementing Smart Contracts: The case of NFT-rental with pay-per-like

paper_url: http://arxiv.org/abs/2308.02424
repo_url: https://github.com/asopi/rental-project
paper_authors: Alfred Sopi, Johannes Schneider, Jan vom Brocke
For: The paper aims to address the challenges of lending and renting non-fungible tokens (NFTs) for marketing purposes, such as the risk of items not being returned and the difficulty in anticipating the impact of artworks.* Methods: The paper introduces an NFT rental solution based on a pay-per-like pricing model using blockchain technology and smart contracts based on the Ethereum chain.* Results: The paper finds that blockchain solutions enjoy many advantages, but also observes dark sides such as large blockchain fees, which can be unfair to niche artists and potentially hamper cultural diversity. Additionally, a trust-cost tradeoff arises to handle fraud caused by manipulation from parties outside the blockchain.Here are the three points in Simplified Chinese text:* For: 论文目的是解决非 fungible tokens（NFTs）的借领和租赁问题，如物品不返还和艺术作品的影响预测困难。* Methods: 论文提出了基于 pays-per-like 价格模式的 NFT 租赁解决方案，使用了区块链技术和基于 Ethereum 链的智能合约。* Results: 论文发现区块链解决方案具有许多优点，但也注意到了一些黑暗的面向，如大量区块链费用，可能对专业艺术家不公平，可能妨碍文化多样性。此外，面临滥用和 manipulate 等问题，需要考虑信任成本tradeoff。

Abstract
Non-fungible tokens(NFTs) are on the rise. They can represent artworks exhibited for marketing purposes on webpages of companies or online stores -- analogously to physical artworks. Lending of NFTs is an attractive form of passive income for owners but comes with risks (e.g., items are not returned) and costs for escrow agents. Similarly, renters have difficulties in anticipating the impact of artworks, e.g., how spectators of NFTs perceive them. To address these challenges, we introduce an NFT rental solution based on a pay-per-like pricing model using blockchain technology, i.e., smart contracts based on the Ethereum chain. We find that blockchain solutions enjoy many advantages also reported for other applications, but interestingly, we also observe dark sides of (large) blockchain fees. Blockchain solutions appear unfair to niche artists and potentially hamper cultural diversity. Furthermore, a trust-cost tradeoff arises to handle fraud caused by manipulation from parties outside the blockchain. All code for the solution is publicly available at: https://github.com/asopi/rental-project

摘要
非可转换token(NFT)在升温。它们可以代表公司或在线商店的网页上展示的艺术作品，类似于 físichen艺术作品。NFT的借用是持有者的有利预想的 passive income，但是也有风险（例如，物品不返回）和代理人的成本。同时，租户困难预测NFT的影响，例如，如何评估NF的观众。为解决这些挑战，我们介绍了基于付费模式的NFT租赁解决方案，使用区块链技术和Ethereum链上的智能合约。我们发现，区块链解决方案在其他应用程序中报道的优点也存在，但是有趣的是，我们还观察到大型区块链费用的黑暗面。区块链解决方案可能不公平对尼希艺术家和文化多样性。此外，为了处理外部干扰所引起的诈骗，需要处理信任成本。所有的代码都可以在GitHub上找到：https://github.com/asopi/rental-project。

Validation of a Zero-Shot Learning Natural Language Processing Tool for Data Abstraction from Unstructured Healthcare Data

paper_url: http://arxiv.org/abs/2308.00107
repo_url: https://github.com/kaufmannb/PDF-Extractor
paper_authors: Basil Kaufmann, Dallin Busby, Chandan Krushna Das, Neeraja Tillu, Mani Menon, Ashutosh K. Tewari, Michael A. Gorin
For: The paper is written to describe the development and validation of a zero-shot learning natural language processing (NLP) tool for abstracting data from unstructured text contained within PDF documents, such as those found within electronic health records.* Methods: The data abstraction tool was based on the GPT-3.5 model from OpenAI, and was compared to three physician human abstractors in terms of time to task completion and accuracy for abstracting data on 14 unique variables from a set of 199 de-identified radical prostatectomy pathology reports. The reports were processed by the software tool in vectorized and scanned formats to establish the impact of optical character recognition on data abstraction.* Results: The software tool required a mean of 12.8 s to process the vectorized reports and a mean of 15.8 s to process the scanned reports, which was significantly faster than the human abstractors (mean time of 101 s). The tool had an overall accuracy of 94.2% for the vectorized reports and 88.7% for the scanned reports, which was non-inferior to 2 out of 3 human abstractors.

Abstract
Objectives: To describe the development and validation of a zero-shot learning natural language processing (NLP) tool for abstracting data from unstructured text contained within PDF documents, such as those found within electronic health records. Materials and Methods: A data abstraction tool based on the GPT-3.5 model from OpenAI was developed and compared to three physician human abstractors in terms of time to task completion and accuracy for abstracting data on 14 unique variables from a set of 199 de-identified radical prostatectomy pathology reports. The reports were processed by the software tool in vectorized and scanned formats to establish the impact of optical character recognition on data abstraction. The tool was assessed for superiority for data abstraction speed and non-inferiority for accuracy. Results: The human abstractors required a mean of 101s per report for data abstraction, with times varying from 15 to 284 s. In comparison, the software tool required a mean of 12.8 s to process the vectorized reports and a mean of 15.8 to process the scanned reports (P < 0.001). The overall accuracies of the three human abstractors were 94.7%, 97.8%, and 96.4% for the combined set of 2786 datapoints. The software tool had an overall accuracy of 94.2% for the vectorized reports, proving to be non-inferior to the human abstractors at a margin of -10% ($\alpha$=0.025). The tool had a slightly lower accuracy of 88.7% using the scanned reports, proving to be non-inferiority to 2 out of 3 human abstractors. Conclusion: The developed zero-shot learning NLP tool affords researchers comparable levels of accuracy to that of human abstractors, with significant time savings benefits. Because of the lack of need for task-specific model training, the developed tool is highly generalizable and can be used for a wide variety of data abstraction tasks, even outside the field of medicine.

摘要
Materials and Methods: We developed a data abstraction tool based on the GPT-3.5 model from OpenAI and compared its performance to three human abstractors in terms of time and accuracy for abstracting data from 14 variables in 199 de-identified radical prostatectomy pathology reports. The reports were processed in both vectorized and scanned formats to assess the impact of optical character recognition (OCR) on data abstraction. We evaluated the tool's superiority in speed and non-inferiority in accuracy.Results: The human abstractors took a mean of 101 seconds per report, with times ranging from 15 to 284 seconds. In contrast, the software tool took a mean of 12.8 seconds to process vectorized reports and 15.8 seconds for scanned reports (p < 0.001). The tool's overall accuracy was 94.2% for vectorized reports, proving non-inferiority to the human abstractors at a margin of -10% (α = 0.025). For scanned reports, the tool's accuracy was 88.7%, proving non-inferiority to two out of three human abstractors.Conclusion: The developed zero-shot learning NLP tool provides researchers with a time-saving solution that affords comparable levels of accuracy to human abstractors. The tool's lack of need for task-specific model training makes it highly generalizable and suitable for a wide range of data abstraction tasks, both within and outside the field of medicine.

Uncertainty-aware Grounded Action Transformation towards Sim-to-Real Transfer for Traffic Signal Control

paper_url: http://arxiv.org/abs/2307.12388
repo_url: None
paper_authors: Longchao Da, Hao Mei, Romir Sharma, Hua Wei
for: 提高RL在实际道路上的应用性能
methods: 使用 simulations-to-real-world (sim-to-real) 转移方法，动态将模拟环境中学习的策略转移到实际环境中，以 Mitigate domain gap
results: 在模拟交通环境中评估了UGAT方法，显示UGAT方法可以在实际环境中提高RL策略的性能

Abstract
Traffic signal control (TSC) is a complex and important task that affects the daily lives of millions of people. Reinforcement Learning (RL) has shown promising results in optimizing traffic signal control, but current RL-based TSC methods are mainly trained in simulation and suffer from the performance gap between simulation and the real world. In this paper, we propose a simulation-to-real-world (sim-to-real) transfer approach called UGAT, which transfers a learned policy trained from a simulated environment to a real-world environment by dynamically transforming actions in the simulation with uncertainty to mitigate the domain gap of transition dynamics. We evaluate our method on a simulated traffic environment and show that it significantly improves the performance of the transferred RL policy in the real world.

摘要
交通信号控制（TSC）是一项复杂且重要的任务，影响了数百万人的日常生活。人工智能学习（RL）已经在优化交通信号控制方面表现出了扎实的成果，但现有RL基于TSC方法主要在模拟环境中训练，它们在实际世界中的性能差异很大。在这篇论文中，我们提出了一种从模拟环境到实际世界（sim-to-real）的转移方法，称为UGAT，它通过在模拟环境中动态地转换行动，以减少领域差距，将学习的RL策略在实际世界中表现出较好的性能。我们对一个模拟交通环境进行评估，并显示了UGAT方法在实际世界中的性能提升。

In-Context Learning in Large Language Models Learns Label Relationships but Is Not Conventional Learning

paper_url: http://arxiv.org/abs/2307.12375
repo_url: None
paper_authors: Jannik Kossen, Tom Rainforth, Yarin Gal
for: 这种研究旨在探讨大语言模型（LLMs）在下游任务中表现提高的原因，以及如何更好地理解和调控LLMs的行为。
methods: 这篇论文使用了实验方法，检查了LLMs在不同情况下如何处理输入和标签的关系，并分析了模型如何在不同阶段学习和使用标签信息。
results: 研究发现，LLMs通常会在输入中使用标签信息，但是在预训练和输入中的标签关系是不同的，并且模型不会对所有输入信息进行平等处理。这些结论可以帮助我们更好地理解和调控LLMs的行为。

Abstract
The performance of Large Language Models (LLMs) on downstream tasks often improves significantly when including examples of the input-label relationship in the context. However, there is currently no consensus about how this in-context learning (ICL) ability of LLMs works: for example, while Xie et al. (2021) liken ICL to a general-purpose learning algorithm, Min et al. (2022b) argue ICL does not even learn label relationships from in-context examples. In this paper, we study (1) how labels of in-context examples affect predictions, (2) how label relationships learned during pre-training interact with input-label examples provided in-context, and (3) how ICL aggregates label information across in-context examples. Our findings suggests LLMs usually incorporate information from in-context labels, but that pre-training and in-context label relationships are treated differently, and that the model does not consider all in-context information equally. Our results give insights into understanding and aligning LLM behavior.

摘要
大型语言模型（LLM）在下游任务中表现往往会得到显著改善，当包含输入-标签关系的示例在内部时。然而，目前没有一致的看法，即如何实现这种在 контексте学习（ICL）能力：例如，希耶等（2021）将 ICL 比作一种通用学习算法，而民等（2022b）则认为 ICL 不会从内部示例中学习标签关系。在这篇论文中，我们研究以下几点：1. 如何 Labels of in-context examples affect predictions.2. 如何在预训练时学习的标签关系与输入-标签示例提供在内部交互。3. ICL 如何对各个内部示例的标签信息进行汇总。我们的发现表明，LLMs 通常会在内部示例中使用标签信息，但是预训练和内部标签关系被处理不同，并且模型不会对所有内部信息进行平等考虑。我们的结果为理解和调整 LLM 行为提供了新的视角。

Early Prediction of Alzheimers Disease Leveraging Symptom Occurrences from Longitudinal Electronic Health Records of US Military Veterans

paper_url: http://arxiv.org/abs/2307.12369
repo_url: None
paper_authors: Rumeng Li, Xun Wang, Dan Berlowitz, Brian Silver, Wen Hu, Heather Keating, Raelene Goodwin, Weisong Liu, Honghuang Lin, Hong Yu
for: 这个研究的目的是使用机器学习方法来分析患有阿尔茨杰病（AD）的长期电子医疗纪录（EHR），以预测AD的发病更早。methods: 这个研究使用了一种case-control设计，使用了来自美国卫生部VA卫生管理局（VHA）的长期EHR数据从2004年到2021年。研究使用了一组AD相关关键词，并分析了这些词的时间发展特征以预测AD的发病。results: 研究发现，与AD诊断日期相关的AD相关关键词的出现频率在患有AD的患者中增长迅速，从约10年增长到超过40年。此外，研究还发现了一些年龄、性别和种族/民族 subgroup的差异。最佳模型可以具有高度的排除率（ROCAUC 0.997），并且在不同的 subgroup 中具有良好的准确率和均匀性。

Abstract
Early prediction of Alzheimer's disease (AD) is crucial for timely intervention and treatment. This study aims to use machine learning approaches to analyze longitudinal electronic health records (EHRs) of patients with AD and identify signs and symptoms that can predict AD onset earlier. We used a case-control design with longitudinal EHRs from the U.S. Department of Veterans Affairs Veterans Health Administration (VHA) from 2004 to 2021. Cases were VHA patients with AD diagnosed after 1/1/2016 based on ICD-10-CM codes, matched 1:9 with controls by age, sex and clinical utilization with replacement. We used a panel of AD-related keywords and their occurrences over time in a patient's longitudinal EHRs as predictors for AD prediction with four machine learning models. We performed subgroup analyses by age, sex, and race/ethnicity, and validated the model in a hold-out and "unseen" VHA stations group. Model discrimination, calibration, and other relevant metrics were reported for predictions up to ten years before ICD-based diagnosis. The study population included 16,701 cases and 39,097 matched controls. The average number of AD-related keywords (e.g., "concentration", "speaking") per year increased rapidly for cases as diagnosis approached, from around 10 to over 40, while remaining flat at 10 for controls. The best model achieved high discriminative accuracy (ROCAUC 0.997) for predictions using data from at least ten years before ICD-based diagnoses. The model was well-calibrated (Hosmer-Lemeshow goodness-of-fit p-value = 0.99) and consistent across subgroups of age, sex and race/ethnicity, except for patients younger than 65 (ROCAUC 0.746). Machine learning models using AD-related keywords identified from EHR notes can predict future AD diagnoses, suggesting its potential use for identifying AD risk using EHR notes, offering an affordable way for early screening on large population.

摘要
早期预测阿尔茨海默病（AD）非常重要，以便在时间上采取措施和治疗。这项研究的目的是使用机器学习方法分析患者的长期电子医疗纪录（EHR），以预测AD的发病。我们采用了一种case-control设计，使用了2004年至2021年美国卫生部 veterans Health Administration（VHA）的长期EHR数据。cases是在2016年1月1日以后根据ICD-10-CM代码被诊断出AD的VHA患者，与年龄、性别和临床使用相同的9名控制组进行匹配。我们使用了AD相关关键词的Panel，并分析这些关键词在患者的长期EHR中的出现情况，以预测AD的发病。我们进行了年龄、性别和种族/民族 subgroup分析，并在一个封闭和“未看到”的VHA站组中验证了模型。我们测试了四种机器学习模型，并计算了预测正确率、准确率和其他相关指标。研究人口包括16,701个case和39,097个匹配控制组。case的AD相关关键词每年增长迅速，从约10到超过40，而控制组保持在10个。最佳模型在使用至少10年前ICD-based诊断时的预测中 дости到了高度的抗抑词率（ROCAUC 0.997）。模型具有良好的准确率（Hosmer-Lemeshow好准确性值 = 0.99）和年龄、性别和种族/民族 subgroup中的一致性，除了年龄少于65的患者（ROCAUC 0.746）。机器学习模型使用从EHR笔记中提取的AD相关关键词可以预测未来AD诊断，这表明了这种方法的潜在应用价值，可以通过分析EHR笔记来预测AD风险，这是一种可靠且经济的预测方法。

Deployment of Leader-Follower Automated Vehicle Systems for Smart Work Zone Applications with a Queuing-based Traffic Assignment Approach

paper_url: http://arxiv.org/abs/2308.03764
repo_url: None
paper_authors: Qing Tang, Xianbiao Hu
for: 这篇论文旨在优化Autonomous Truck Mounted Attenuator（ATMA）车辆系统中的routing，以最小化在交通基础设施维护期间的系统成本。
methods: 这篇论文使用了连接和自动化车辆技术，并提出了一种基于队列的交通分配方法，以考虑ATMA车辆的运行速度差异。
results: 研究发现，通过模拟不同路线的选择，ATMA车辆系统可以减少交通系统的成本，并且可以通过队列基于的交通分配方法来实现这一目标。

Abstract
The emerging technology of the Autonomous Truck Mounted Attenuator (ATMA), a leader-follower style vehicle system, utilizes connected and automated vehicle capabilities to enhance safety during transportation infrastructure maintenance in work zones. However, the speed difference between ATMA vehicles and general vehicles creates a moving bottleneck that reduces capacity and increases queue length, resulting in additional delays. The different routes taken by ATMA cause diverse patterns of time-varying capacity drops, which may affect the user equilibrium traffic assignment and lead to different system costs. This manuscript focuses on optimizing the routing for ATMA vehicles in a network to minimize the system cost associated with the slow-moving operation. To achieve this, a queuing-based traffic assignment approach is proposed to identify the system cost caused by the ATMA system. A queuing-based time-dependent (QBTD) travel time function, considering capacity drop, is introduced and applied in the static user equilibrium traffic assignment problem, with a result of adding dynamic characteristics. Subsequently, we formulate the queuing-based traffic assignment problem and solve it using a modified path-based algorithm. The methodology is validated using a small-size and a large-size network and compared with two benchmark models to analyze the benefit of capacity drop modeling and QBTD travel time function. Furthermore, the approach is applied to quantify the impact of different routes on the traffic system and identify an optimal route for ATMA vehicles performing maintenance work. Finally, sensitivity analysis is conducted to explore how the impact changes with variations in traffic demand and capacity reduction.

摘要
新兴技术自动化卡车拥挤器（ATMA），一种领头随员式车辆系统，通过连接和自动化车辆能力来提高交通基础设施维护工区的安全性。然而，ATMA车辆的速度与普通车辆的速度差距创造了运动瓶颈，导致交通容量下降和排队较长，从而增加延迟。ATMA车辆采取不同的路线，导致时间变化的容量下降，这可能影响用户均衡交通分配和导致不同的系统成本。本文关注优化ATMA车辆网络路径，以最小化由慢速运行引起的系统成本。为此，我们提出了基于队列的交通分配方法，以识别由ATMA系统引起的系统成本。我们引入了考虑容量下降的队列基于时间依赖（QBTD）旅行时间函数，并应用于静态用户均衡交通分配问题。通过修改的路径基本算法，我们解决了队列基于交通分配问题。我们验证了方法使用小型和大型网络，并与两个参考模型进行比较，以分析容器下降模型和QBTD旅行时间函数的利好。此外，我们还应用该方法来评估不同路线对交通系统的影响，并确定最佳维护工区路线。最后，我们进行敏感分析，以explore系统成本变化的影响因素。

2023-07-24

cs.CL

cs.CL - 2023-07-24

Corrections of Zipf’s and Heaps’ Laws Derived from Hapax Rate Models

paper_url: http://arxiv.org/abs/2307.12896
repo_url: https://github.com/lukasz-debowski/zipfanatomy
paper_authors: Łukasz Dębowski
for: 这篇论文旨在修正Zipf和Hep的法则，基于系统模型的废弃率。
methods: 这篇论文使用了两个假设：第一个是标准杯模型，认为短文本的边缘频率分布与长文本中的词汇频率分布相同；第二个假设是废弃率的函数是文本大小的简单函数。
results: 这篇论文显示，使用了Logistic模型可以得到最佳的适应。

Abstract
The article introduces corrections to Zipf's and Heaps' laws based on systematic models of the hapax rate. The derivation rests on two assumptions: The first one is the standard urn model which predicts that marginal frequency distributions for shorter texts look as if word tokens were sampled blindly from a given longer text. The second assumption posits that the rate of hapaxes is a simple function of the text size. Four such functions are discussed: the constant model, the Davis model, the linear model, and the logistic model. It is shown that the logistic model yields the best fit.

摘要
文章介绍了对Zipf和堆法的修正，基于系统性模型的唯一urn模型和文本大小的函数模型。两个假设是：首先，假设短文本中的单词分布遵循着 longer text中的样本采样；其次，假设 hapax 的速率是文本大小的简单函数。文章提出了四种函数模型：常数模型、Davis模型、线性模型和ilogistic模型。结果表明，ilogistic模型得到了最佳的适应。

Joint Dropout: Improving Generalizability in Low-Resource Neural Machine Translation through Phrase Pair Variables

paper_url: http://arxiv.org/abs/2307.12835
repo_url: None
paper_authors: Ali Araabi, Vlad Niculae, Christof Monz
for: 提高低资源语言对翻译机器翻译的性能
methods: 使用联合Dropout方法，将短语替换为变量，提高翻译机器翻译的可组合性
results: 对低资源语言对翻译机器翻译进行了重要改进，为语言对翻译机器翻译带来了显著提高，并且在不同领域中也具有了更好的鲁棒性和适应性。

Abstract
Despite the tremendous success of Neural Machine Translation (NMT), its performance on low-resource language pairs still remains subpar, partly due to the limited ability to handle previously unseen inputs, i.e., generalization. In this paper, we propose a method called Joint Dropout, that addresses the challenge of low-resource neural machine translation by substituting phrases with variables, resulting in significant enhancement of compositionality, which is a key aspect of generalization. We observe a substantial improvement in translation quality for language pairs with minimal resources, as seen in BLEU and Direct Assessment scores. Furthermore, we conduct an error analysis, and find Joint Dropout to also enhance generalizability of low-resource NMT in terms of robustness and adaptability across different domains

摘要
尽管神经机器翻译（NMT）已经取得了很大的成功，但它在低资源语言对的表现仍然较差，一个原因是对未经见过的输入的处理能力有限，即通用性。在这篇论文中，我们提出了一种方法called Joint Dropout，该方法通过将短语替换为变量，从而提高了语言对的复合性，这是通用性的关键特征。我们发现，对具有最少资源的语言对，使用Joint Dropout可以得到显著提高翻译质量，按照BLEU和直接评估得分来看。此外，我们进行了错误分析，发现Joint Dropout还可以提高低资源NMT的通用性，包括鲁棒性和适应性 across different domains。

Guidance in Radiology Report Summarization: An Empirical Evaluation and Error Analysis

paper_url: http://arxiv.org/abs/2307.12803
repo_url: https://github.com/jantrienes/inlg2023-radsum
paper_authors: Jan Trienes, Paul Youssef, Jörg Schlötterer, Christin Seifert
for: automatization of radiology report summarization to reduce clinicians’ manual work and improve reporting consistency
methods: variable-length extractive summaries as a domain-agnostic guidance signal, competitive with domain-specific methods
results: improved summarization quality compared to unguided summarization, but still limited by content selection and corpus-level inconsistencies

Abstract
Automatically summarizing radiology reports into a concise impression can reduce the manual burden of clinicians and improve the consistency of reporting. Previous work aimed to enhance content selection and factuality through guided abstractive summarization. However, two key issues persist. First, current methods heavily rely on domain-specific resources to extract the guidance signal, limiting their transferability to domains and languages where those resources are unavailable. Second, while automatic metrics like ROUGE show progress, we lack a good understanding of the errors and failure modes in this task. To bridge these gaps, we first propose a domain-agnostic guidance signal in form of variable-length extractive summaries. Our empirical results on two English benchmarks demonstrate that this guidance signal improves upon unguided summarization while being competitive with domain-specific methods. Additionally, we run an expert evaluation of four systems according to a taxonomy of 11 fine-grained errors. We find that the most pressing differences between automatic summaries and those of radiologists relate to content selection including omissions (up to 52%) and additions (up to 57%). We hypothesize that latent reporting factors and corpus-level inconsistencies may limit models to reliably learn content selection from the available data, presenting promising directions for future work.

摘要
自动概括 radiology 报告可以减少临床医生的手动劳动和提高报告的一致性。过去的工作是通过引导抽象SUMMARIZATION提高内容选择和事实性。然而，两个关键问题仍然存在。首先，当前的方法听命于域特定资源提取指导信号，限制其在领域和语言中的传输性。其次，虽然自动度量器Like ROUGE表现出进步，但我们对这个任务中的错误和失败模式几乎没有良好的理解。为了bridging这些差距，我们首先提议一种域无关的引导信号，即变量长抽取SUMMARIES。我们的实验结果表明，这种引导信号可以超过无引导抽取SUMMARIES，并与域特定方法竞争。此外，我们运行了四种系统的专家评估，根据报告11种细腻错误的税onomy。我们发现，自动报告与医生的报告之间最主要的差异在于内容选择，包括漏掉（最多52%）和添加（最多57%）。我们推测，隐藏的报告因素和 corpus 级别的不一致性可能限制模型从可用数据中学习内容选择，提供了可能的未来工作方向。

RRAML: Reinforced Retrieval Augmented Machine Learning

paper_url: http://arxiv.org/abs/2307.12798
repo_url: None
paper_authors: Andrea Bacciu, Florin Cuconasu, Federico Siciliano, Fabrizio Silvestri, Nicola Tonellotto, Giovanni Trappolini
for: 本研究旨在推广人工智能领域中的大语言模型（LLMs）的应用，提高其在理解、生成和修改人语言方面的能力。
methods: 本研究提出了一种新的框架，即强化检索增强机器学习（RRAML），它将LLMs的理解能力与一个特制的检索器连接起来，从一个大量的用户提供的数据库中提取支持信息。
results: RRAML可以减少LLMs的训练和重新训练的需求，同时也可以避免访问LLMs的梯度，从而提高其应用的效率和可扩展性。此外，RRAML还可以减少检索结果中的幻见和不相关信息，提高检索的准确率和有用性。

Abstract
The emergence of large language models (LLMs) has revolutionized machine learning and related fields, showcasing remarkable abilities in comprehending, generating, and manipulating human language. However, their conventional usage through API-based text prompt submissions imposes certain limitations in terms of context constraints and external source availability. To address these challenges, we propose a novel framework called Reinforced Retrieval Augmented Machine Learning (RRAML). RRAML integrates the reasoning capabilities of LLMs with supporting information retrieved by a purpose-built retriever from a vast user-provided database. By leveraging recent advancements in reinforcement learning, our method effectively addresses several critical challenges. Firstly, it circumvents the need for accessing LLM gradients. Secondly, our method alleviates the burden of retraining LLMs for specific tasks, as it is often impractical or impossible due to restricted access to the model and the computational intensity involved. Additionally we seamlessly link the retriever's task with the reasoner, mitigating hallucinations and reducing irrelevant, and potentially damaging retrieved documents. We believe that the research agenda outlined in this paper has the potential to profoundly impact the field of AI, democratizing access to and utilization of LLMs for a wide range of entities.

摘要
大型语言模型（LLM）的出现对机器学习和相关领域产生了革命性的变革，展示了人类语言理解、生成和修改的强大能力。然而，通过 API 提交文本提示来使用 LLM 存在一些限制，包括上下文约束和外部资源的可用性。为解决这些挑战，我们提出了一个新的框架 called Reinforced Retrieval Augmented Machine Learning（RRAML）。RRAML 将 LLM 的理解能力与用户提供的大量数据库中的支持信息结合起来，通过利用最近的回归学术进行有效地解决多个关键问题。首先，它绕过了访问 LLM 的梯度的需求。其次，我们的方法减轻了特定任务的 LLM 重新训练的压力，因为在访问模型和计算浩瀚性方面存在限制。此外，我们将检索器的任务与理解者联系在一起，以避免幻想和减少不相关和可能有害的检索文档。我们认为这篇论文的研究议程具有潜在的影响力，可以广泛影响 AI 领域，使 LLM 的访问和利用更加普遍和便捷。

Code-Switched Urdu ASR for Noisy Telephonic Environment using Data Centric Approach with Hybrid HMM and CNN-TDNN

paper_url: http://arxiv.org/abs/2307.12759
repo_url: https://github.com/sage-khan/code-switched-noisy-urdu-asr
paper_authors: Muhammad Danyal Khan, Raheem Ali, Arshad Aziz
for: 这个论文是为了提出一种资源有效的自动语音识别系统，以便在呼叫中心环境中准确地转录电话对话，并且能够进行自动化的电话监测、关键词搜索和情感分析。
methods: 这个论文使用了链式混合HMM和CNN-TDNN来实现资源有效的自动语音识别系统，并且在呼叫中心环境中进行了测试和评估。
results: 根据论文的描述，在呼叫中心环境中，使用链式混合HMM和CNN-TDNN来实现自动语音识别系统，可以达到5.2%的Word Error Rate（WER），包括干净环境和噪音环境下的 isolated words和连续杂音speech。

Abstract
Call Centers have huge amount of audio data which can be used for achieving valuable business insights and transcription of phone calls is manually tedious task. An effective Automated Speech Recognition system can accurately transcribe these calls for easy search through call history for specific context and content allowing automatic call monitoring, improving QoS through keyword search and sentiment analysis. ASR for Call Center requires more robustness as telephonic environment are generally noisy. Moreover, there are many low-resourced languages that are on verge of extinction which can be preserved with help of Automatic Speech Recognition Technology. Urdu is the $10^{th}$ most widely spoken language in the world, with 231,295,440 worldwide still remains a resource constrained language in ASR. Regional call-center conversations operate in local language, with a mix of English numbers and technical terms generally causing a "code-switching" problem. Hence, this paper describes an implementation framework of a resource efficient Automatic Speech Recognition/ Speech to Text System in a noisy call-center environment using Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. Using Hybrid HMM-DNN approach allowed us to utilize the advantages of Neural Network with less labelled data. Adding CNN with TDNN has shown to work better in noisy environment due to CNN's additional frequency dimension which captures extra information from noisy speech, thus improving accuracy. We collected data from various open sources and labelled some of the unlabelled data after analysing its general context and content from Urdu language as well as from commonly used words from other languages, primarily English and were able to achieve WER of 5.2% with noisy as well as clean environment in isolated words or numbers as well as in continuous spontaneous speech.

摘要
Call Centers possess vast amounts of audio data that can be leveraged for gaining valuable business insights, and the manual transcription of phone calls is a tedious task. An effective Automatic Speech Recognition (ASR) system can accurately transcribe these calls, enabling easy search through call history for specific context and content, and allowing for automatic call monitoring, improving quality of service (QoS) through keyword search and sentiment analysis. However, ASR systems for call centers must be more robust due to the noisy telephonic environment. Moreover, there are many low-resource languages that are on the verge of extinction, and ASR technology can help preserve these languages. Urdu, the 10th most widely spoken language in the world with 231,295,440 speakers, remains a resource-constrained language in ASR. Regional call-center conversations often operate in local languages, with a mix of English and technical terms, causing a "code-switching" problem.To address these challenges, this paper proposes an implementation framework for a resource-efficient ASR/Speech-to-Text system in a noisy call-center environment using Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. By combining Hybrid HMM-DNN and CNN-TDNN, we can leverage the advantages of neural networks with less labeled data. Additionally, the CNN-TDNN approach has shown to work better in noisy environments due to the CNN's additional frequency dimension, which captures extra information from noisy speech, improving accuracy.We collected data from various open sources and labeled some of the unlabeled data after analyzing its general context and content from Urdu language as well as from commonly used words from other languages, primarily English. Our results achieved a Word Error Rate (WER) of 5.2% with both noisy and clean environments in isolated words or numbers as well as in continuous spontaneous speech.

A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization

paper_url: http://arxiv.org/abs/2307.12659
repo_url: None
paper_authors: Edward Fish, Umberto Michieli, Mete Ozay
for: 这个研究是为了提高自动语音识别（ASR）模型的实时部署，以及实现个人化模型适应。
methods: 这个研究使用了混合精度优化方法（myQASR），可以根据不同的用户和内存需求生成个性化的混合精度优化方案。myQASR 使用了全精度活动值分析来自动评估网络层的优化感受，并生成个性化的混合精度优化方案。
results: 研究结果显示，使用 myQASR 可以提高特定的性别、语言和说话者的表现，并且不需要组数调整。

Abstract
Recent advancement in Automatic Speech Recognition (ASR) has produced large AI models, which become impractical for deployment in mobile devices. Model quantization is effective to produce compressed general-purpose models, however such models may only be deployed to a restricted sub-domain of interest. We show that ASR models can be personalized during quantization while relying on just a small set of unlabelled samples from the target domain. To this end, we propose myQASR, a mixed-precision quantization method that generates tailored quantization schemes for diverse users under any memory requirement with no fine-tuning. myQASR automatically evaluates the quantization sensitivity of network layers by analysing the full-precision activation values. We are then able to generate a personalised mixed-precision quantization scheme for any pre-determined memory budget. Results for large-scale ASR models show how myQASR improves performance for specific genders, languages, and speakers.

摘要
myQASR evaluates the quantization sensitivity of network layers by analyzing full-precision activation values, and generates a personalized mixed-precision quantization scheme for any pre-determined memory budget. Our results show that myQASR improves performance for specific genders, languages, and speakers.

Fake News Detection Through Graph-based Neural Networks: A Survey

paper_url: http://arxiv.org/abs/2307.12639
repo_url: None
paper_authors: Shuzhi Gong, Richard O. Sinnott, Jianzhong Qi, Cecile Paris
for: 本研究主要旨在探讨基于图структуры和深度学习技术的假新闻检测方法。
methods: 图结构基于的方法包括知识驱动方法、传播基于方法和多元社交 контекст基于方法，它们根据不同的图结构来模型新闻相关信息的流动。
results: 研究发现，图结构基于方法在假新闻检测中得到了显著的成果，特别是在模elling社交媒体宣传过程中。但是，还存在一些挑战和未解决的问题，如假新闻的定义和识别、社交媒体平台的不同性和数据的可靠性等。

Abstract
The popularity of online social networks has enabled rapid dissemination of information. People now can share and consume information much more rapidly than ever before. However, low-quality and/or accidentally/deliberately fake information can also spread rapidly. This can lead to considerable and negative impacts on society. Identifying, labelling and debunking online misinformation as early as possible has become an increasingly urgent problem. Many methods have been proposed to detect fake news including many deep learning and graph-based approaches. In recent years, graph-based methods have yielded strong results, as they can closely model the social context and propagation process of online news. In this paper, we present a systematic review of fake news detection studies based on graph-based and deep learning-based techniques. We classify existing graph-based methods into knowledge-driven methods, propagation-based methods, and heterogeneous social context-based methods, depending on how a graph structure is constructed to model news related information flows. We further discuss the challenges and open problems in graph-based fake news detection and identify future research directions.

摘要
在线社交网络的流行化使得信息的传播变得非常快速，人们可以更快地分享和消耗信息。然而，低质量和/或意外或故意假的信息也可以快速传播，这可能会对社会产生重大和负面的影响。正确地识别、标注和驳斥在线谣言已成为一项急需解决的问题。许多方法已经被提议来检测假新闻，其中包括深度学习和图基于的方法。在过去几年中，图基于的方法在检测假新闻方面取得了强劲的结果，因为它们可以准确地模拟在线新闻的社交上下文和传播过程。本文提供一个系统性的审查，检测基于图和深度学习的假新闻检测研究。我们将现有的图基于方法分为知识驱动的方法、传播基于方法和多元社交上下文基于方法，根据如何构建图来模型新闻相关信息的流动。我们还讨论了假新闻检测中的挑战和未解决的问题，并确定了未来研究的方向。

Tachikuma: Understading Complex Interactions with Multi-Character and Novel Objects by Large Language Models

paper_url: http://arxiv.org/abs/2307.12573
repo_url: None
paper_authors: Yuanzhi Liang, Linchao Zhu, Yi Yang
for:这篇论文旨在提高人工智能代理人在虚拟世界中的互动复杂性和灵活性，特别是在多个角色和新型对象的情况下。methods:该论文提出了在人工智能代理人世界模型中引入虚拟游戏主持人（GM）的想法，以增强信息把关、估计玩家的意图、提供环境描述和给予反馈等功能，从而补做当前世界模型的缺陷。results:该论文提出了一个名为Tachikuma的 benchmark，包括一个多个角色和新型对象基于互动 estimation（MOE）任务和一个相关的数据集。MOE挑战模型理解角色的意图并准确地确定他们在复杂情况下的行为。此外，数据集capture了在游戏即时通信中的实际交流记录，为未来的探索提供了多样、根据实际情况的复杂互动。最后，论文提出了一个简单的提示基线，并评估了其性能，示出其在促进互动理解方面的效果。

Abstract
Recent advancements in natural language and Large Language Models (LLMs) have enabled AI agents to simulate human-like interactions within virtual worlds. However, these interactions still face limitations in complexity and flexibility, particularly in scenarios involving multiple characters and novel objects. Pre-defining all interactable objects in the agent's world model presents challenges, and conveying implicit intentions to multiple characters through complex interactions remains difficult. To address these issues, we propose integrating virtual Game Masters (GMs) into the agent's world model, drawing inspiration from Tabletop Role-Playing Games (TRPGs). GMs play a crucial role in overseeing information, estimating players' intentions, providing environment descriptions, and offering feedback, compensating for current world model deficiencies. To facilitate future explorations for complex interactions, we introduce a benchmark named Tachikuma, comprising a Multiple character and novel Object based interaction Estimation (MOE) task and a supporting dataset. MOE challenges models to understand characters' intentions and accurately determine their actions within intricate contexts involving multi-character and novel object interactions. Besides, the dataset captures log data from real-time communications during gameplay, providing diverse, grounded, and complex interactions for further explorations. Finally, we present a simple prompting baseline and evaluate its performance, demonstrating its effectiveness in enhancing interaction understanding. We hope that our dataset and task will inspire further research in complex interactions with natural language, fostering the development of more advanced AI agents.

摘要
To facilitate future explorations for complex interactions, we introduce a benchmark named Tachikuma, comprising a Multiple character and novel Object based interaction Estimation (MOE) task and a supporting dataset. MOE challenges models to understand characters' intentions and accurately determine their actions within intricate contexts involving multi-character and novel object interactions. Besides, the dataset captures log data from real-time communications during gameplay, providing diverse, grounded, and complex interactions for further explorations.Finally, we present a simple prompting baseline and evaluate its performance, demonstrating its effectiveness in enhancing interaction understanding. We hope that our dataset and task will inspire further research in complex interactions with natural language, fostering the development of more advanced AI agents.Translation in Simplified Chinese:最近的自然语言和大型语言模型（LLMs）的进步，使得AI代理人能够在虚拟世界中模拟人类化的互动。然而，这些互动仍面临复杂性和灵活性的限制，特别是在多个角色和新的物品的情况下。将所有互动的物品都嵌入代理人的世界模型中存在挑战，而且通过复杂的互动传递多个角色的意图仍然具有挑战性。为解决这些问题，我们提出了在代理人的世界模型中 integrate 虚拟游戏大师（GMs）的想法， draw inspirations from 桌上角色扮演游戏（TRPGs）。GMs 在虚拟世界中扮演着重要的角色，负责资讯的监督、玩家的意图的估计、环境描述和回应，以补偿现有世界模型的不足。为了促进未来的复杂互动探索，我们提出了一个名为 Tachikuma 的benchmark，包括一个多个角色和新的物品基本互动Estimation（MOE）任务和一个支持 datasets。MOE 挑战模型能够理解角色的意图和精确地决定他们在复杂的多个角色和新的物品互动中的动作。此外， datasets capture 游戏中的实时通讯记录，提供多样化、根据现实的互动进行探索。最后，我们提出了一个简单的提示基eline，评估其表现，证明其能够增强互动理解。我们希望这个dataset和任务能够鼓励更多的研究在复杂互动中的自然语言，推动更进步的 AI 代理人的发展。

Towards Generalising Neural Topical Representations

paper_url: http://arxiv.org/abs/2307.12564
repo_url: None
paper_authors: Xiaohao Yang, He Zhao, Dinh Phung, Lan Du
for: 提高神经话题模型（NTM）的泛化能力，使其在不同文库和任务中产生质量话题表示。
methods: 使用数据扩充 durante el entrenamiento para模型 similar documents，并使用 Hierarchical Topic Transport Distance (HOTT) 测量文档之间的semantical distance。
results: 对多个NTMs进行了广泛的实验，并证明了框架可以significantly improve neural topical representation的泛化能力 across corpora。

Abstract
Topic models have evolved from conventional Bayesian probabilistic models to Neural Topic Models (NTMs) over the last two decays. Although NTMs have achieved promising performance when trained and tested on a specific corpus, their generalisation ability across corpora is rarely studied. In practice, we often expect that an NTM trained on a source corpus can still produce quality topical representation for documents in a different target corpus without retraining. In this work, we aim to improve NTMs further so that their benefits generalise reliably across corpora and tasks. To do so, we propose to model similar documents by minimising their semantical distance when training NTMs. Specifically, similar documents are created by data augmentation during training; The semantical distance between documents is measured by the Hierarchical Topic Transport Distance (HOTT), which computes the Optimal Transport (OT) distance between the topical representations. Our framework can be readily applied to most NTMs as a plug-and-play module. Extensive experiments show that our framework significantly improves the generalisation ability regarding neural topical representation across corpora.

摘要
To achieve this, we propose to model similar documents by minimizing their semantic distance during training. Specifically, we create similar documents by performing data augmentation during training, and we measure the semantic distance between documents using the Hierarchical Topic Transport Distance (HOTT), which computes the Optimal Transport (OT) distance between the topical representations. Our framework can be easily applied to most NTMs as a plug-and-play module.Extensive experiments show that our framework significantly improves the generalization ability of neural topical representation across corpora.

Lost In Translation: Generating Adversarial Examples Robust to Round-Trip Translation

paper_url: http://arxiv.org/abs/2307.12520
repo_url: https://github.com/neelbhandari6/nmt_text_attack
paper_authors: Neel Bhandari, Pin-Yu Chen
for: This paper aims to study the robustness of current text adversarial attacks to round-trip translation and to introduce an intervention-based solution to improve the robustness of adversarial examples.
methods: The paper uses six state-of-the-art text-based adversarial attacks and integrates machine translation into the process of adversarial example generation to improve the robustness of adversarial examples.
results: The paper demonstrates that finding adversarial examples robust to translation can help identify the insufficiency of language models that is common across languages, and motivate further research into multilingual adversarial attacks.Here’s the text in Simplified Chinese:
for: 这篇论文目标是研究当前文本 adversarial 攻击的翻译循环Robustness，并提出一种干预方法来提高攻击示例的Robustness。
methods: 论文使用了六种当前最佳文本基于攻击方法，并将机器翻译integrated into the process of adversarial example generation以提高攻击示例的Robustness。
results: 论文表明，找到可以在翻译中维持Robustness的攻击示例可以帮助发现语言模型的共同缺陷，并促进多语言攻击的研究。

Abstract
Language Models today provide a high accuracy across a large number of downstream tasks. However, they remain susceptible to adversarial attacks, particularly against those where the adversarial examples maintain considerable similarity to the original text. Given the multilingual nature of text, the effectiveness of adversarial examples across translations and how machine translations can improve the robustness of adversarial examples remain largely unexplored. In this paper, we present a comprehensive study on the robustness of current text adversarial attacks to round-trip translation. We demonstrate that 6 state-of-the-art text-based adversarial attacks do not maintain their efficacy after round-trip translation. Furthermore, we introduce an intervention-based solution to this problem, by integrating Machine Translation into the process of adversarial example generation and demonstrating increased robustness to round-trip translation. Our results indicate that finding adversarial examples robust to translation can help identify the insufficiency of language models that is common across languages, and motivate further research into multilingual adversarial attacks.

摘要
现代语言模型在许多下游任务上具有高准确率，但它们仍然易受到恶意攻击，特别是那些保留了原文的相似性。由于文本的多语言特性，对翻译后的恶意攻击的效iveness和机器翻译如何提高恶意攻击的Robustness remains largely unexplored。在这篇论文中，我们提供了round-trip translation对当前文本恶意攻击的全面研究。我们发现了6种现状顶尖文本基于攻击不具有翻译后的效力。此外，我们还介绍了一种利用机器翻译的解决方案，通过将机器翻译 integrate into the process of generating adversarial examples，并证明了该方法可以提高恶意攻击的Robustness。我们的结果表明，找到可以抵抗翻译的恶意攻击可以帮助发现语言模型的共同缺陷，并促进更多的关于多语言恶意攻击的研究。

Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training

paper_url: http://arxiv.org/abs/2307.12498
repo_url: https://github.com/WAPATASR/WAPAT
paper_authors: Gege Qi, Yuefeng Chen, Xiaofeng Mao, Xiaojun Jia, Ranjie Duan, Rong Zhang, Hui Xue
for: 提高自动话语识别（ASR）模型的实际抗坏性，使其在小量干扰和大范围频率变化下保持原有性能。
methods: 提出了一种新的WavAugment导向的phoneme adversarial Training（wapat）方法，通过在phoneme空间中使用对抗示例来使模型具有小量干扰和大范围频率变化下的抗坏性，并且通过使用振荡示例来引导对抗生成，以获得更好的普适性。
results: 在End-to-end Speech Challenge Benchmark（ESB）上进行了广泛的实验，结果表明，SpeechLM-wapat模型比原始模型减少了6.28%的Word Error Rate（WER），达到了新的状态态-of-the-art。

Abstract
Developing a practically-robust automatic speech recognition (ASR) is challenging since the model should not only maintain the original performance on clean samples, but also achieve consistent efficacy under small volume perturbations and large domain shifts. To address this problem, we propose a novel WavAugment Guided Phoneme Adversarial Training (wapat). wapat use adversarial examples in phoneme space as augmentation to make the model invariant to minor fluctuations in phoneme representation and preserve the performance on clean samples. In addition, wapat utilizes the phoneme representation of augmented samples to guide the generation of adversaries, which helps to find more stable and diverse gradient-directions, resulting in improved generalization. Extensive experiments demonstrate the effectiveness of wapat on End-to-end Speech Challenge Benchmark (ESB). Notably, SpeechLM-wapat outperforms the original model by 6.28% WER reduction on ESB, achieving the new state-of-the-art.

摘要
开发一个实用robust的自动语音识别（ASR）系统是具有搅乱的挑战，因为模型需要不仅保持干净样本的原始性能，还需要在小量扰动和大域转换下实现一致的效果。为解决这个问题，我们提出了一种新的WavAugment导向的phoneme adversarial training（wapat）方法。wapat使用phoneme空间的对抗样本作为增强元素，使模型对phoneme表示的小变化具有抗衰减性，并保持干净样本的性能。此外，wapat利用增强后的phoneme表示导向对抗生成，以找到更稳定和多样的梯度方向，从而提高泛化能力。广泛的实验表明，wapat在End-to-end Speech Challenge Benchmark（ESB）上具有显著的效果，SpeechLM-wapat比原始模型减少6.28%的WRR，实现新的州际顶峰性。

On the Effectiveness of Offline RL for Dialogue Response Generation

paper_url: http://arxiv.org/abs/2307.12425
repo_url: https://github.com/asappresearch/dialogue-offline-rl
paper_authors: Paloma Sodhi, Felix Wu, Ethan R. Elenberg, Kilian Q. Weinberger, Ryan McDonald
for: 研究 teacher forcing 的替代方法，以提高对话响应生成的性能。
methods: 使用了多种离线束规学学习（RL）方法，以优化对话响应生成的序列水平目标。
results: 研究发现，离线RL可以明显提高对话响应生成的性能，而不会导致训练不稳定或减少实际训练时间。

Abstract
A common training technique for language models is teacher forcing (TF). TF attempts to match human language exactly, even though identical meanings can be expressed in different ways. This motivates use of sequence-level objectives for dialogue response generation. In this paper, we study the efficacy of various offline reinforcement learning (RL) methods to maximize such objectives. We present a comprehensive evaluation across multiple datasets, models, and metrics. Offline RL shows a clear performance improvement over teacher forcing while not inducing training instability or sacrificing practical training budgets.

摘要
一种常见的语言模型训练技巧是教师强制（TF）。TF尝试匹配人类语言完全一致，即使同一个意思可以表达在不同的方式。这种motivation使我们使用序列级目标来生成对话响应。在这篇论文中，我们研究了多种离线强化学习（RL）方法，以最大化这些目标。我们在多个数据集、模型和指标上进行了全面的评估。离线RL显示了与教师强制相比的表现提升，而不会导致训练不稳定或浪费实际训练预算。

Testing Hateful Speeches against Policies

paper_url: http://arxiv.org/abs/2307.12418
repo_url: https://github.com/htytewx/softcam
paper_authors: Jiangrui Zheng, Xueqing Liu, Girish Budhrani, Wei Yang, Ravishka Rathnasuriya
for: 这个论文的目的是研究基于深度学习技术的 AI 系统如何对于基于自然语言规则的需求或政策进行行为。
methods: 该论文使用了人工批准和 OpenAI 的大语言模型自动匹配新的示例和政策来扩展 HateModerate 数据集。
results: 研究发现现有的 hate speech 检测软件对于某些政策有高失败率，而自动匹配新的示例和政策可以提高 AI 系统对于需求或政策的traceability。

Abstract
In the recent years, many software systems have adopted AI techniques, especially deep learning techniques. Due to their black-box nature, AI-based systems brought challenges to traceability, because AI system behaviors are based on models and data, whereas the requirements or policies are rules in the form of natural or programming language. To the best of our knowledge, there is a limited amount of studies on how AI and deep neural network-based systems behave against rule-based requirements/policies. This experience paper examines deep neural network behaviors against rule-based requirements described in natural language policies. In particular, we focus on a case study to check AI-based content moderation software against content moderation policies. First, using crowdsourcing, we collect natural language test cases which match each moderation policy, we name this dataset HateModerate; second, using the test cases in HateModerate, we test the failure rates of state-of-the-art hate speech detection software, and we find that these models have high failure rates for certain policies; finally, since manual labeling is costly, we further proposed an automated approach to augument HateModerate by finetuning OpenAI's large language models to automatically match new examples to policies. The dataset and code of this work can be found on our anonymous website: \url{https://sites.google.com/view/content-moderation-project}.

摘要
Recently, many software systems have adopted AI techniques, especially deep learning techniques. Due to their black-box nature, AI-based systems have brought challenges to traceability, as their behaviors are based on models and data, whereas the requirements or policies are rules in the form of natural or programming language. To the best of our knowledge, there is a limited amount of studies on how AI and deep neural network-based systems behave against rule-based requirements/policies. This experience paper examines deep neural network behaviors against rule-based requirements described in natural language policies. In particular, we focus on a case study to check AI-based content moderation software against content moderation policies. First, using crowdsourcing, we collect natural language test cases that match each moderation policy, which we name HateModerate; second, using the test cases in HateModerate, we test the failure rates of state-of-the-art hate speech detection software and find that these models have high failure rates for certain policies; finally, since manual labeling is costly, we further propose an automated approach to augment HateModerate by finetuning OpenAI's large language models to automatically match new examples to policies. The dataset and code of this work can be found on our anonymous website: [https://sites.google.com/view/content-moderation-project](https://sites.google.com/view/content-moderation-project).Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.

CommonsenseVIS: Visualizing and Understanding Commonsense Reasoning Capabilities of Natural Language Models

paper_url: http://arxiv.org/abs/2307.12382
repo_url: None
paper_authors: Xingbo Wang, Renfei Huang, Zhihua Jin, Tianqing Fang, Huamin Qu
for: 本研究旨在提供一个可视化的解释系统，以协助NLPT专家对模型对概念的关系进行可视化分析。
methods: 本研究使用了外部常识库来将模型的行为与人类知识相互对映，以提高模型的可视化解释。
results: 经过User Study，我们发现CommonsenseVIS可以帮助NLPT专家在不同情况下进行系统性和批量的可视化分析，从而更好地理解模型对概念的关系。

Abstract
Recently, large pretrained language models have achieved compelling performance on commonsense benchmarks. Nevertheless, it is unclear what commonsense knowledge the models learn and whether they solely exploit spurious patterns. Feature attributions are popular explainability techniques that identify important input concepts for model outputs. However, commonsense knowledge tends to be implicit and rarely explicitly presented in inputs. These methods cannot infer models' implicit reasoning over mentioned concepts. We present CommonsenseVIS, a visual explanatory system that utilizes external commonsense knowledge bases to contextualize model behavior for commonsense question-answering. Specifically, we extract relevant commonsense knowledge in inputs as references to align model behavior with human knowledge. Our system features multi-level visualization and interactive model probing and editing for different concepts and their underlying relations. Through a user study, we show that CommonsenseVIS helps NLP experts conduct a systematic and scalable visual analysis of models' relational reasoning over concepts in different situations.

摘要
To address this challenge, we propose CommonsenseVIS, a visual explanatory system that leverages external common sense knowledge bases to contextualize model behavior for common sense question-answering. Specifically, we extract relevant common sense knowledge from inputs and use it to align the model's behavior with human knowledge. Our system features multi-level visualization and interactive model probing and editing for different concepts and their underlying relations.Through a user study, we demonstrate that CommonsenseVIS helps NLP experts conduct a systematic and scalable visual analysis of the models' relational reasoning over concepts in different situations. By providing a visual interface for exploring the models' behavior, CommonsenseVIS enables experts to gain a deeper understanding of how the models are using common sense knowledge to make predictions. This can help improve the models' performance and ensure that they are making accurate and informed decisions.

Evaluating Emotional Nuances in Dialogue Summarization

paper_url: http://arxiv.org/abs/2307.12371
repo_url: None
paper_authors: Yongxin Zhou, Fabien Ringeval, François Portet
for: 本研究旨在提高对人工对话的自动概要summarization，以保留对话中情感内容的信息。
methods: 本文提出了一组名为$PEmo$的衡量量表，用于衡量对话概要中情感内容的保留情况。
results: 研究发现，现有的概要模型不太好地保留对话中情感内容，而且通过减少训练集中不情感对话，可以更好地保留情感内容，同时保留最重要的事实信息。

Abstract
Automatic dialogue summarization is a well-established task that aims to identify the most important content from human conversations to create a short textual summary. Despite recent progress in the field, we show that most of the research has focused on summarizing the factual information, leaving aside the affective content, which can yet convey useful information to analyse, monitor, or support human interactions. In this paper, we propose and evaluate a set of measures $PEmo$, to quantify how much emotion is preserved in dialog summaries. Results show that, summarization models of the state-of-the-art do not preserve well the emotional content in the summaries. We also show that by reducing the training set to only emotional dialogues, the emotional content is better preserved in the generated summaries, while conserving the most salient factual information.

摘要
自动对话摘要是一个已经成熟的任务，目的是从人类对话中提取最重要的内容，创建简短的文本摘要。尽管最近的进步在这个领域，但大多数研究仍然专注于摘要的事实信息，忽略了情感内容，这种内容可以带来有用的信息，分析、监测或支持人类交流。在这篇论文中，我们提出并评估了一组测量方法$PEmo$,以量化对话摘要中情感内容的保留程度。结果表明，现有的摘要模型并不能很好地保留对话中的情感内容。我们还表明，通过将训练集限制为只包含情感对话，可以更好地保留对话摘要中的情感内容，同时保留最重要的事实信息。

2023-07-24

cs.LG

cs.LG - 2023-07-24

QAmplifyNet: Pushing the Boundaries of Supply Chain Backorder Prediction Using Interpretable Hybrid Quantum - Classical Neural Network

paper_url: http://arxiv.org/abs/2307.12906
repo_url: None
paper_authors: Md Abrar Jahin, Md Sakib Hossain Shovon, Md. Saiful Islam, Jungpil Shin, M. F. Mridha, Yuichi Okuyama
for: 这个研究的目的是为供应链管理系统提供精确的货物预测，以便优化存储控制、降低成本和提高顾客满意度。
methods: 本研究提出了一个新的方法olographical framework，使用量子灵感技术实现供应链货物预测，并且可以处理短期和不寻常的数据集。
results: 实验结果显示，QAmplifyNet模型在短期和不寻常的数据集上的预测效果比 класиical models、量子对集、量子神经网和深度强化学习模型更好。这个模型的可读性和可扩展性使其成为供应链管理中的理想解决方案。

Abstract
Supply chain management relies on accurate backorder prediction for optimizing inventory control, reducing costs, and enhancing customer satisfaction. However, traditional machine-learning models struggle with large-scale datasets and complex relationships, hindering real-world data collection. This research introduces a novel methodological framework for supply chain backorder prediction, addressing the challenge of handling large datasets. Our proposed model, QAmplifyNet, employs quantum-inspired techniques within a quantum-classical neural network to predict backorders effectively on short and imbalanced datasets. Experimental evaluations on a benchmark dataset demonstrate QAmplifyNet's superiority over classical models, quantum ensembles, quantum neural networks, and deep reinforcement learning. Its proficiency in handling short, imbalanced datasets makes it an ideal solution for supply chain management. To enhance model interpretability, we use Explainable Artificial Intelligence techniques. Practical implications include improved inventory control, reduced backorders, and enhanced operational efficiency. QAmplifyNet seamlessly integrates into real-world supply chain management systems, enabling proactive decision-making and efficient resource allocation. Future work involves exploring additional quantum-inspired techniques, expanding the dataset, and investigating other supply chain applications. This research unlocks the potential of quantum computing in supply chain optimization and paves the way for further exploration of quantum-inspired machine learning models in supply chain management. Our framework and QAmplifyNet model offer a breakthrough approach to supply chain backorder prediction, providing superior performance and opening new avenues for leveraging quantum-inspired techniques in supply chain management.

摘要
供应链管理需要准确预测营销订单，以优化存储控制、降低成本和提高客户满意度。然而，传统的机器学习模型在大规模数据集和复杂关系下难以处理实际数据收集。本研究提出了一种新的方法olo Framework for Supply Chain Backorder Prediction，解决大数据集处理的挑战。我们的提议模型，QAmplifyNet，在短时间和不均衡数据集上预测营销订单非常有效。实验评估表明QAmplifyNet在经典模型、量子ensemble、量子神经网络和深度强化学习方面具有突出的优势。由于它可以处理短时间和不均衡数据集，因此在供应链管理中是一个 идеal的解决方案。为了提高模型可读性，我们使用了可解释人工智能技术。实际应用包括改善存储控制、减少营销订单和提高运营效率。QAmplifyNet可以轻松整合到实际供应链管理系统中，允许执行投入式决策和有效资源分配。未来的工作包括探索更多的量子静止技术、扩大数据集和探索其他供应链应用。本研究开启了量子计算在供应链优化中的潜力，为了 leveraging量子静止机器学习模型在供应链管理中提供了一个突破性的方法。我们的框架和QAmplifyNet模型为供应链营销订单预测提供了超越性能，开创了新的可能性，以及可以在供应链管理中应用量子静止技术。

Universal Approximation Theorem and error bounds for quantum neural networks and quantum reservoirs

paper_url: http://arxiv.org/abs/2307.12904
repo_url: None
paper_authors: Lukas Gonon, Antoine Jacquier
for: 这个论文是为了证明Quantum Neural Network可以用于精确地预测函数的目的。
methods: 这篇论文使用了Parameterised quantum circuits和随机量子Circuits来 aproximate classical functions。
results: 这篇论文提供了具体的错误 bound，证明一个Quantum Neural Network可以在某些情况下以 $\mathcal{O}(\varepsilon^{-2})$ 的参数和 $\mathcal{O} (\lceil \log_2(\varepsilon^{-1}) \rceil)$ 个量子比特来实现精度 $\varepsilon>0$ 的函数预测。

Abstract
Universal approximation theorems are the foundations of classical neural networks, providing theoretical guarantees that the latter are able to approximate maps of interest. Recent results have shown that this can also be achieved in a quantum setting, whereby classical functions can be approximated by parameterised quantum circuits. We provide here precise error bounds for specific classes of functions and extend these results to the interesting new setup of randomised quantum circuits, mimicking classical reservoir neural networks. Our results show in particular that a quantum neural network with $\mathcal{O}(\varepsilon^{-2})$ weights and $\mathcal{O} (\lceil \log_2(\varepsilon^{-1}) \rceil)$ qubits suffices to achieve accuracy $\varepsilon>0$ when approximating functions with integrable Fourier transform.

摘要
“ universal approximation 定理是 classical neural network 的基础，提供了理论保证这些 later 能够 approximate interessant 的映射。 recent results 表明这也可以在量子设置下实现，其中 classical 函数可以通过参数化 quantum circuit 的方式进行approximation。我们在这里提供了具体的误差 bound для特定的函数类型，并将其扩展到 randomized quantum circuit 中，模拟 classical reservoir neural network。我们的结果显示，一个 quantum neural network WITH $\mathcal{O}(\varepsilon^{-2})$ weights 和 $\mathcal{O} (\lceil \log_2(\varepsilon^{-1}) \rceil)$ qubits 就能够达到 $\varepsilon>0$ 的精度，当approximating functions with integrable Fourier transform。”Note: "Simplified Chinese" is a romanization of Chinese that uses simpler characters and grammar to facilitate typing and reading. It is not a standardized translation of Chinese, and the actual translation may vary depending on the context and the translator's preference.

Anytime Model Selection in Linear Bandits

paper_url: http://arxiv.org/abs/2307.12897
repo_url: None
paper_authors: Parnian Kassraie, Aldo Pacchiano, Nicolas Emmenegger, Andreas Krause
for: This paper is written for solving the problem of model selection in the context of bandit optimization, which is a challenging problem that requires balancing exploration and exploitation not only for action selection, but also for model selection.
methods: The paper proposes a new method called ALEXP, which uses online learning algorithms that treat different models as experts and emulates full-information feedback to the online learner with a favorable bias-variance trade-off.
results: The paper shows that ALEXP has an exponentially improved ($\log M$) dependence on the number of models $M$ for its regret, and has anytime guarantees on its regret without requiring knowledge of the horizon $n$ or relying on an initial purely exploratory stage.

Abstract
Model selection in the context of bandit optimization is a challenging problem, as it requires balancing exploration and exploitation not only for action selection, but also for model selection. One natural approach is to rely on online learning algorithms that treat different models as experts. Existing methods, however, scale poorly ($\text{poly}M$) with the number of models $M$ in terms of their regret. Our key insight is that, for model selection in linear bandits, we can emulate full-information feedback to the online learner with a favorable bias-variance trade-off. This allows us to develop ALEXP, which has an exponentially improved ($\log M$) dependence on $M$ for its regret. ALEXP has anytime guarantees on its regret, and neither requires knowledge of the horizon $n$, nor relies on an initial purely exploratory stage. Our approach utilizes a novel time-uniform analysis of the Lasso, establishing a new connection between online learning and high-dimensional statistics.

摘要
Our key insight is that, for model selection in linear bandits, we can emulate full-information feedback to the online learner with a favorable bias-variance trade-off. This allows us to develop ALEXP, which has an exponentially improved dependence on $M$ for its regret, with a logarithmic dependence on $M$ (log$M$).ALEXP has anytime guarantees on its regret, and does not require knowledge of the horizon $n$ or an initial purely exploratory stage. Our approach utilizes a novel time-uniform analysis of the Lasso, establishing a new connection between online learning and high-dimensional statistics.

A Statistical View of Column Subset Selection

paper_url: http://arxiv.org/abs/2307.12892
repo_url: https://github.com/anavsood/css
paper_authors: Anav Sood, Trevor Hastie
for: 选择一小集合的表变量从大数据集中
methods: 使用某些简统统计方法，如某些简单的概率模型，来实现维度减少
results: 该paper表明了CSS和主要变量选择是等价的，并且两者都可以视为最大化信息的最优化问题。此外，paper还介绍了如何使用这些连接来快速完成CSS，包括使用摘要统计数据进行CSS和在缺失和/或抑制数据的情况下进行CSS。

Abstract
We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of Principal Variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum likelihood estimation within a certain semi-parametric model. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.

摘要
我团队正在考虑一个大数据集中选择一小子集的代表变量问题。在计算机科学文献中，这个维度减少问题通常被称为列子集选择（CSS）。在统计学文献中，这个问题通常被формализова为找到最优化信息的主要变量。这篇论文表明了这两种方法是等价的，并且它们都可以被视为在某种半 parametic 模型中的最大 LIKELIHOOD估计。使用这些连接，我们展示了如何：1. 使用原始数据集的摘要统计来快速完成 CSS。2. 在缺失和/或截断数据存在的情况下进行 CSS。3. 在假设检测框架中选择 CSS 中的子集大小。

Interpretable Stereotype Identification through Reasoning

paper_url: http://arxiv.org/abs/2308.00071
repo_url: None
paper_authors: Jacob-Junqi Tian, Omkar Dige, David Emerson, Faiza Khan Khattak
for: 这篇研究的目的是探讨语言模型中的偏见，并将公平性集成到语言模型的开发过程中，以确保这些模型是不偏不倾的。
methods: 本研究使用了Vicuna-13B-v1.3进行零扩展 sterotype 识别 tasks，并 comparing the performance of scaling up from 13B to 33B 和 reasoning 的效果。
results: 研究发现，reasoning 可以帮助语言模型在离 Domain tasks 中提高精度，并且可以增加模型的解释力。

Abstract
Given that language models are trained on vast datasets that may contain inherent biases, there is a potential danger of inadvertently perpetuating systemic discrimination. Consequently, it becomes essential to examine and address biases in language models, integrating fairness into their development to ensure these models are equitable and free from bias. In this work, we demonstrate the importance of reasoning in zero-shot stereotype identification based on Vicuna-13B-v1.3. While we do observe improved accuracy by scaling from 13B to 33B, we show that the performance gain from reasoning significantly exceeds the gain from scaling up. Our findings suggest that reasoning could be a key factor that enables LLMs to trescend the scaling law on out-of-domain tasks such as stereotype identification. Additionally, through a qualitative analysis of select reasoning traces, we highlight how reasoning enhances not just accuracy but also the interpretability of the decision.

摘要
Given that language models are trained on vast datasets that may contain inherent biases, there is a potential danger of inadvertently perpetuating systemic discrimination. Consequently, it becomes essential to examine and address biases in language models, integrating fairness into their development to ensure these models are equitable and free from bias. In this work, we demonstrate the importance of reasoning in zero-shot stereotype identification based on Vicuna-13B-v1.3. While we do observe improved accuracy by scaling from 13B to 33B, we show that the performance gain from reasoning significantly exceeds the gain from scaling up. Our findings suggest that reasoning could be a key factor that enables LLMs to trescend the scaling law on out-of-domain tasks such as stereotype identification. Additionally, through a qualitative analysis of select reasoning traces, we highlight how reasoning enhances not just accuracy but also the interpretability of the decision.Here's the translation in Traditional Chinese: given that language models are trained on vast datasets that may contain inherent biases, there is a potential danger of inadvertently perpetuating systemic discrimination. Consequently, it becomes essential to examine and address biases in language models, integrating fairness into their development to ensure these models are equitable and free from bias. In this work, we demonstrate the importance of reasoning in zero-shot stereotype identification based on Vicuna-13B-v1.3. While we do observe improved accuracy by scaling from 13B to 33B, we show that the performance gain from reasoning significantly exceeds the gain from scaling up. Our findings suggest that reasoning could be a key factor that enables LLMs to trescend the scaling law on out-of-domain tasks such as stereotype identification. Additionally, through a qualitative analysis of select reasoning traces, we highlight how reasoning enhances not just accuracy but also the interpretability of the decision.

Data-free Black-box Attack based on Diffusion Model

paper_url: http://arxiv.org/abs/2307.12872
repo_url: None
paper_authors: Mingwen Shao, Lingzhuang Meng, Yuanjian Qiao, Lixu Zhang, Wangmeng Zuo
for: 本研究旨在提高数据free黑盒攻击的效率和准确性，通过使用扩散模型生成数据进行训练代理模型。
methods: 本研究使用扩散模型生成数据，并提出了一种干扰代码增强（LCA）方法，以指导扩散模型生成数据。LCA方法可以使得生成的数据符合目标模型的批判标准，同时保持高的多样性。
results: 对于不同的目标模型，我们的LCA方法可以获得更高的攻击成功率，并且需要更少的查询预算。EXTensive experiments表明，我们的LCA方法可以提高数据free黑盒攻击的效率和准确性。

Abstract
Since the training data for the target model in a data-free black-box attack is not available, most recent schemes utilize GANs to generate data for training substitute model. However, these GANs-based schemes suffer from low training efficiency as the generator needs to be retrained for each target model during the substitute training process, as well as low generation quality. To overcome these limitations, we consider utilizing the diffusion model to generate data, and propose a data-free black-box attack scheme based on diffusion model to improve the efficiency and accuracy of substitute training. Despite the data generated by the diffusion model exhibits high quality, it presents diverse domain distributions and contains many samples that do not meet the discriminative criteria of the target model. To further facilitate the diffusion model to generate data suitable for the target model, we propose a Latent Code Augmentation (LCA) method to guide the diffusion model in generating data. With the guidance of LCA, the data generated by the diffusion model not only meets the discriminative criteria of the target model but also exhibits high diversity. By utilizing this data, it is possible to train substitute model that closely resemble the target model more efficiently. Extensive experiments demonstrate that our LCA achieves higher attack success rates and requires fewer query budgets compared to GANs-based schemes for different target models.

摘要
因为目标模型的训练数据不可获得，大多数最新的方案使用GANs生成数据来训练代理模型。然而，这些GANs基于的方案受到低训练效率和低生成质量的限制。为了突破这些限制，我们考虑使用扩散模型生成数据，并提出了基于扩散模型的数据 свобо black-box攻击方案，以提高代理训练的效率和准确性。尽管扩散模型生成的数据具有高质量，但它们具有多样的领域分布和含有许多不符合目标模型的扩散标准的样本。为了使扩散模型更加适合目标模型，我们提出了幽默代码修饰（LCA）方法，以导引扩散模型生成数据。通过LCA的引导，扩散模型生成的数据不仅满足目标模型的扩散标准，而且具有高多样性。通过这些数据，我们可以更加快速地训练符合目标模型的代理模型。我们的LCA在不同的目标模型上实现了更高的攻击成功率和更少的查询预算。

Stochastic Step-wise Feature Selection for Exponential Random Graph Models (ERGMs)

paper_url: http://arxiv.org/abs/2307.12862
repo_url: None
paper_authors: Helal El-Zaatari, Fei Yu, Michael R Kosorok
for: 这 paper 的目的是提供一种改进的 exponential random graph models (ERGMs) 模型，以更好地Capture 社交网络中的依赖关系。
methods: 该 paper 使用了一种新的方法，即选择内生变量（endogenous variable selection），以解决 ERGMs 中的degeneracy问题，并提高了网络模型的准确性。
results: 经验测试表明，该方法可以有效地避免 ERGMs 中的degeneracy问题，并提高了网络模型的准确性和可靠性。

Abstract
Statistical analysis of social networks provides valuable insights into complex network interactions across various scientific disciplines. However, accurate modeling of networks remains challenging due to the heavy computational burden and the need to account for observed network dependencies. Exponential Random Graph Models (ERGMs) have emerged as a promising technique used in social network modeling to capture network dependencies by incorporating endogenous variables. Nevertheless, using ERGMs poses multiple challenges, including the occurrence of ERGM degeneracy, which generates unrealistic and meaningless network structures. To address these challenges and enhance the modeling of collaboration networks, we propose and test a novel approach that focuses on endogenous variable selection within ERGMs. Our method aims to overcome the computational burden and improve the accommodation of observed network dependencies, thereby facilitating more accurate and meaningful interpretations of network phenomena in various scientific fields. We conduct empirical testing and rigorous analysis to contribute to the advancement of statistical techniques and offer practical insights for network analysis.

摘要
(Simplified Chinese translation)社交网络统计分析提供了许多有价值的网络互动现象的理解，但是准确地模型网络仍然是一项挑战，因为计算负担重要和需要考虑观察到的网络依赖关系。扩展随机图模型（ERGMs）在社交网络模型中表现出了扩展的潜力，可以通过包含内生变量来捕捉网络依赖关系。然而，使用ERGMs也存在多种挑战，包括ERGM异常性，这会生成无意义和不切实际的网络结构。为了解决这些挑战并改进协作网络的模型，我们提出了一种新的方法，即内生变量选择在ERGMs中。我们的方法目的是减少计算负担和更好地考虑观察到的网络依赖关系，从而为不同科学领域中的网络现象提供更准确和有意义的解释。我们进行了实际测试和严格分析，以贡献到统计技术的进步和为网络分析提供实用的指导。

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

paper_url: http://arxiv.org/abs/2307.12856
repo_url: None
paper_authors: Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, Aleksandra Faust
for: 本研究旨在提高自主浏览器的自然语言指令下的实际website上的性能。
methods: 研究人员提出了WebAgent Agent，该Agent可以根据自然语言指令完成实际website上的任务。WebAgent使用Flan-U-PaLM进行code生成和HTML-T5进行规划和摘要，以及使用本地和全局听力机制和杂xture-span噪声目标来解决HTML文档中的长 Span问题。
results: 研究人员通过实验表明，WebAgent在真实的website上提高了成功率超过50%，并且HTML-T5在解决HTML基本任务方面的成功率高于之前的SoTA，达到14.9%。

Abstract
Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web navigation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that can complete the tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via generated Python programs from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our recipe improves the success on a real website by over 50%, and that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9% higher success rate than prior SoTA on the MiniWoB web navigation benchmark and better accuracy on offline task planning evaluation.

摘要
各种大型自然语言模型（LLM）在自主网络浏览方面最近很有进步，但实际网站上的性能仍然受到以下三种因素的影响：开放性、限制的上下文长度和HTML的适应性。我们介绍了WebAgent，一个根据自然语言指令完成实际网站上的任务的LML-驱动的代理人。WebAgent通过将指令分解成标准化的子指令、将长HTML文档摘要成任务相关的短报道，并通过生成的Python程序来操作网站。我们为WebAgent设计了Flan-U-PaLM，用于锚定代码生成，以及HTML-T5，一种新的适应HTML文档的预训练语言模型，使用本地和全球注意力机制，并结合长时间排除目标来计划和摘要。我们实际证明了我们的配方可以在真实网站上提高成功率高于50%，并且HTML-T5是解决HTML基本任务的最佳模型，比前一个SoTA在小型网站浏览 benchmark上的成功率高14.9%。

Early Neuron Alignment in Two-layer ReLU Networks with Small Initialization

paper_url: http://arxiv.org/abs/2307.12851
repo_url: None
paper_authors: Hancheng Min, René Vidal, Enrique Mallada
for: 这 paper 研究了使用梯度流的两层 ReLU 网络进行二分类训练，并且使用小初值。
methods: 我们分析了在训练集中的输入向量之间的相互关系，并且通过对神经元的方向动态进行仔细分析，提供了 $\mathcal{O}(\frac{\log n}{\sqrt{\mu})$ 上限 bounds on 训练时间，其中 $n$ 是数据点的数量，$\mu$ 是数据点之间的相互关系。
results: 我们的分析表明，在训练的早期阶段，神经元在第一层会尝试与输入数据进行对齐，并且在训练的晚期阶段，损失函数会逐渐逼近零，并且第一层的权重矩阵会变得相对低矩。数据实验表明了我们的理论发现。

Abstract
This paper studies the problem of training a two-layer ReLU network for binary classification using gradient flow with small initialization. We consider a training dataset with well-separated input vectors: Any pair of input data with the same label are positively correlated, and any pair with different labels are negatively correlated. Our analysis shows that, during the early phase of training, neurons in the first layer try to align with either the positive data or the negative data, depending on its corresponding weight on the second layer. A careful analysis of the neurons' directional dynamics allows us to provide an $\mathcal{O}(\frac{\log n}{\sqrt{\mu})$ upper bound on the time it takes for all neurons to achieve good alignment with the input data, where $n$ is the number of data points and $\mu$ measures how well the data are separated. After the early alignment phase, the loss converges to zero at a $\mathcal{O}(\frac{1}{t})$ rate, and the weight matrix on the first layer is approximately low-rank. Numerical experiments on the MNIST dataset illustrate our theoretical findings.

摘要
Here's the translation in Simplified Chinese:这篇论文研究了使用梯度流的两层ReLU网络 для二分类问题的训练，并采用小初始化。我们考虑了一个具有良好分离的训练集：任何同样标签的输入数据对都是正相关的，任何不同标签的输入数据对都是负相关的。我们的分析表明，在训练的早期阶段，第一层神经元尝试与输入数据进行对齐，具体来说，它们会对应于第二层神经元的权重进行对齐。通过仔细分析神经元的方向动态，我们可以提供一个 $\mathcal{O}(\frac{\log n}{\sqrt{\mu})$ 上限于时间内所有神经元达到好的对齐度，其中 $n$ 是数据点的数量，$\mu$ 是数据分离度。训练过程后期，损失函数会随着时间的增长而逐渐减少，并且第一层神经元的权重矩阵会 aproximately 变为低级 matrix。实验结果表明，这些理论发现都能够在 MNIST 数据集上得到支持。

Efficiently Learning One-Hidden-Layer ReLU Networks via Schur Polynomials

paper_url: http://arxiv.org/abs/2307.12840
repo_url: None
paper_authors: Ilias Diakonikolas, Daniel M. Kane
for: 学习一个 linear combination of $k$ ReLU 活化器在标准 Gaussian 分布上的 $\mathbb{R}^d$ 上的问题，使用标准损失函数。
methods: 使用 tensor decomposition 技术来识别一个子空间，使得所有 $O(k)$-order моменты在正交方向上都很小。
results: 提供了一个高效的算法，其复杂度为 $(dk/\epsilon)^{O(k)}$, 比之前的算法更为优化。

Abstract
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $\mathbb{R}^d$ with respect to the square loss. Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/\epsilon)^{O(k)}$, where $\epsilon>0$ is the target accuracy. Prior work had given an algorithm for this problem with complexity $(dk/\epsilon)^{h(k)}$, where the function $h(k)$ scales super-polynomially in $k$. Interestingly, the complexity of our algorithm is near-optimal within the class of Correlational Statistical Query algorithms. At a high-level, our algorithm uses tensor decomposition to identify a subspace such that all the $O(k)$-order moments are small in the orthogonal directions. Its analysis makes essential use of the theory of Schur polynomials to show that the higher-moment error tensors are small given that the lower-order ones are.

摘要
我们研究一个PAC学习问题，即以$k$个ReLU激活函数为线性结构，在标准 Gaussian 分布下的 $\mathbb{R}^d$ 上对对于方差损失函数进行学习。我们的主要结果是一个有效的学习算法，其sample和computational Complexity为 $(dk/\epsilon)^{O(k)}$, where $\epsilon>0$ 是目标精度。对比之下，先前的算法的复杂度为 $(dk/\epsilon)^{h(k)}$, where $h(k)$ scales 超 polynomial 地增长。有趣的是，我们的算法的复杂度几乎是near-optimal within the class of Correlational Statistical Query algorithms。在高阶概念上，我们的算法使用了维度分解来识别一个子空间，使得所有 $O(k)$-order moments 在这个orthogonal direction 上都是小的。其分析将使用Schur多项式理论来显示，在这个子空间上，更高阶的error tensors 是小的，只要Lower-order ones 是。

Learning Provably Robust Estimators for Inverse Problems via Jittering

paper_url: http://arxiv.org/abs/2307.12822
repo_url: https://github.com/mli-lab/robust_reconstructors_via_jittering
paper_authors: Anselm Krainovic, Mahdi Soltanolkotabi, Reinhard Heckel
for: 这篇论文 investigate whether jittering, a simple regularization technique, can be used to train deep neural networks to be worst-case robust for inverse problems.
methods: 论文使用了jittering regularization technique during training, and presents a novel analytical characterization of the optimal $\ell_2$-worst-case robust estimator for linear denoising.
results: 研究发现，jittering可以增强worst-case robustness，但可能不适用于 inverse problems beyond denoising。同时，论文还发现，使用实际数据进行训练可以提供一定的 robustness enhancement.

Abstract
Deep neural networks provide excellent performance for inverse problems such as denoising. However, neural networks can be sensitive to adversarial or worst-case perturbations. This raises the question of whether such networks can be trained efficiently to be worst-case robust. In this paper, we investigate whether jittering, a simple regularization technique that adds isotropic Gaussian noise during training, is effective for learning worst-case robust estimators for inverse problems. While well studied for prediction in classification tasks, the effectiveness of jittering for inverse problems has not been systematically investigated. In this paper, we present a novel analytical characterization of the optimal $\ell_2$-worst-case robust estimator for linear denoising and show that jittering yields optimal robust denoisers. Furthermore, we examine jittering empirically via training deep neural networks (U-nets) for natural image denoising, deconvolution, and accelerated magnetic resonance imaging (MRI). The results show that jittering significantly enhances the worst-case robustness, but can be suboptimal for inverse problems beyond denoising. Moreover, our results imply that training on real data which often contains slight noise is somewhat robustness enhancing.

摘要
深度神经网络在反向问题中表现出色，但是神经网络可能对抗性或最坏情况的扰动敏感。这引起了训练神经网络是否可以有效地培养最坏情况Robust的问题。在这篇论文中，我们调查了在反向问题中是否可以使用扰动，一种简单的规范技术，来学习最坏情况Robust的估计器。虽然在预测类型任务中well studied，但是反向问题中扰动的效iveness尚未系统地研究。在这篇论文中，我们提供了一种新的分析 Characterization of the optimal $\ell_2$-worst-case robust estimator for linear denoising，并证明了扰动可以生成最优的Robust denoiser。此外，我们通过训练深度神经网络（U-net）对自然图像杂谔、减 convolution和加速核磁共振成像（MRI）进行实验。结果表明，扰动可以强化最坏情况的Robust性，但可能不适用于反向问题 beyond denoising。此外，我们的结果也表明，训练在真实数据上，通常含有些许噪声，可以提高Robust性。

Maximal Independent Sets for Pooling in Graph Neural Networks

paper_url: http://arxiv.org/abs/2307.13011
repo_url: None
paper_authors: Stevan Stanovic, Benoit Gaüzère, Luc Brun
for: 图像分类领域的进步
methods: 基于最大独立集的三种图像池化方法
results: 实验结果证明最大独立集约束对图像池化有 relevance

Abstract
Convolutional Neural Networks (CNNs) have enabled major advances in image classification through convolution and pooling. In particular, image pooling transforms a connected discrete lattice into a reduced lattice with the same connectivity and allows reduction functions to consider all pixels in an image. However, there is no pooling that satisfies these properties for graphs. In fact, traditional graph pooling methods suffer from at least one of the following drawbacks: Graph disconnection or overconnection, low decimation ratio, and deletion of large parts of graphs. In this paper, we present three pooling methods based on the notion of maximal independent sets that avoid these pitfalls. Our experimental results confirm the relevance of maximal independent set constraints for graph pooling.

摘要
卷积神经网络（CNNs）已经为图像分类带来重要的进步，通过卷积和聚合。特别是图像聚合将连接的离散网络转换为减少的网络，让减少函数考虑整个图像中的所有像素。然而，为图集而设计的pooling方法存在一些缺点，包括图集分离或过度连接、低减少比率和删除大量图集。在这篇论文中，我们提出了基于最大独立集的三种pooling方法，避免了这些缺点。我们的实验结果证明了最大独立集约束对图集聚合具有重要性。

Causal Fair Machine Learning via Rank-Preserving Interventional Distributions

paper_url: http://arxiv.org/abs/2307.12797
repo_url: https://github.com/slds-lmu/paper_2023_cfml
paper_authors: Ludwig Bothmann, Susanne Dandl, Michael Schomaker
for: 该论文旨在设计机器学习模型，以减少自动决策系统中的不公正性。
methods: 该论文提出了一种基于 causal thinking 的方法，通过引入保护属性来定义个体是否是normatively equal。该方法使用rank-preserving interventional distributions来定义一个FiND世界，并使用扭曲方法进行估计。
results: 该论文通过实验和实际数据 validate了该方法和模型的评价标准，并显示了该方法能够减少不公正性。

Abstract
A decision can be defined as fair if equal individuals are treated equally and unequals unequally. Adopting this definition, the task of designing machine learning models that mitigate unfairness in automated decision-making systems must include causal thinking when introducing protected attributes. Following a recent proposal, we define individuals as being normatively equal if they are equal in a fictitious, normatively desired (FiND) world, where the protected attribute has no (direct or indirect) causal effect on the target. We propose rank-preserving interventional distributions to define an estimand of this FiND world and a warping method for estimation. Evaluation criteria for both the method and resulting model are presented and validated through simulations and empirical data. With this, we show that our warping approach effectively identifies the most discriminated individuals and mitigates unfairness.

摘要
一个决策可以被定义为公平的，如果对等的人进行对等的待遇，不同的人则不同的待遇。在设计自动化决策系统中减少不公的方面，我们应该采用 causal 思维。根据最近的提议，我们定义了一个人为在一个虚拟、normatively 愿望的 (FiND) 世界中是否是等值的。我们提议使用排名保持分布来定义这个FiND世界的估计量，并使用扭曲方法进行估计。我们对方法和模型的评价标准和验证结果进行了说明和验证，并通过实验数据和仿真数据来显示我们的扭曲方法能够有效地找到最受歧视的个体并减少不公。

Compact & Capable: Harnessing Graph Neural Networks and Edge Convolution for Medical Image Classification

paper_url: http://arxiv.org/abs/2307.12790
repo_url: https://github.com/anonrepo-keeper/gcnn-ec
paper_authors: Aryan Singh, Pepijn Van de Ven, Ciarán Eising, Patrick Denny
for: 这个研究探索了图形神经网络（Graph Neural Network，GNN）在医疗影像分类中的潜力。
methods: 我们提出了一种新的模型，具有结合图形神经网络和边弹击的特点，通过RGB通道特征值之间的连接强化关系，实现更好地表示关键图形节点之间的连接。
results: 我们的模型不仅与现有的深度神经网络（Deep Neural Network，DNN）相比，表现优化，且仅需1000个参数，训练时间和数据需求都被降低了。我们将这个GCNN模型与预训练的DNN进行比较，发现GCNN在医疗影像分类任务中表现出色，并鼓励进一步探索更进阶的图形基于模型，如图形注意力网络（Graph Attention Network，GAT）和图形自动编码器（Graph Auto-Encoder）在医疗影像领域的应用。

Abstract
Graph-based neural network models are gaining traction in the field of representation learning due to their ability to uncover latent topological relationships between entities that are otherwise challenging to identify. These models have been employed across a diverse range of domains, encompassing drug discovery, protein interactions, semantic segmentation, and fluid dynamics research. In this study, we investigate the potential of Graph Neural Networks (GNNs) for medical image classification. We introduce a novel model that combines GNNs and edge convolution, leveraging the interconnectedness of RGB channel feature values to strongly represent connections between crucial graph nodes. Our proposed model not only performs on par with state-of-the-art Deep Neural Networks (DNNs) but does so with 1000 times fewer parameters, resulting in reduced training time and data requirements. We compare our Graph Convolutional Neural Network (GCNN) to pre-trained DNNs for classifying MedMNIST dataset classes, revealing promising prospects for GNNs in medical image analysis. Our results also encourage further exploration of advanced graph-based models such as Graph Attention Networks (GAT) and Graph Auto-Encoders in the medical imaging domain. The proposed model yields more reliable, interpretable, and accurate outcomes for tasks like semantic segmentation and image classification compared to simpler GCNNs

摘要
“基于图的神经网络模型在知识学习领域受到广泛应用，因为它们可以捕捉难以识别的实体之间的隐藏 topological 关系。这些模型在药物发现、蛋白质交互、semantic segmentation 和 fluid dynamics 等领域中得到应用。在本研究中，我们调查了医学图像分类中的可能性，并提出了一种新的模型，该模型将基于图的神经网络（GCNN）和边 convolution 结合在一起，通过RGB通道特征值之间的连接来强大地表示关键图节点之间的连接。我们的提出的模型不仅与现有的深度神经网络（DNN）性能相似，而且具有1000倍少的参数，从而减少了训练时间和数据需求。我们对MedMNIST 数据集类别进行比较，发现GCNN在医学图像分类中有良好的前景，并且鼓励进一步探索更高级的图基于模型，如图注意力网络（GAT）和图自动编码器（GAE）在医学图像分类领域。GCNN 模型在 semantic segmentation 和图像分类任务中提供了更可靠、可解释、高精度的结果，相比于简单的 GCNN 模型”

Deep neural network improves the estimation of polygenic risk scores for breast cancer

paper_url: http://arxiv.org/abs/2307.13010
repo_url: None
paper_authors: Adrien Badré, Li Zhang, Wellington Muchero, Justin C. Reynolds, Chongle Pan
for: 这个研究用于比较多种计算模型来计算乳腺癌风险分数（PRS）。methods: 这个研究使用了深度神经网络（DNN）和其他机器学习技术以及统计学方法，包括BLUP、BayesA和LDpred。results: DNN在测试群体中表现出色，其AUC为67.4%，比其他方法高。此外，DNN还能够分化患者群体，并且可以达到18.8%的感知率 при 90%的准确率。这些结果表明，DNN可以更好地预测乳腺癌风险。

Abstract
Polygenic risk scores (PRS) estimate the genetic risk of an individual for a complex disease based on many genetic variants across the whole genome. In this study, we compared a series of computational models for estimation of breast cancer PRS. A deep neural network (DNN) was found to outperform alternative machine learning techniques and established statistical algorithms, including BLUP, BayesA and LDpred. In the test cohort with 50% prevalence, the Area Under the receiver operating characteristic Curve (AUC) were 67.4% for DNN, 64.2% for BLUP, 64.5% for BayesA, and 62.4% for LDpred. BLUP, BayesA, and LPpred all generated PRS that followed a normal distribution in the case population. However, the PRS generated by DNN in the case population followed a bi-modal distribution composed of two normal distributions with distinctly different means. This suggests that DNN was able to separate the case population into a high-genetic-risk case sub-population with an average PRS significantly higher than the control population and a normal-genetic-risk case sub-population with an average PRS similar to the control population. This allowed DNN to achieve 18.8% recall at 90% precision in the test cohort with 50% prevalence, which can be extrapolated to 65.4% recall at 20% precision in a general population with 12% prevalence. Interpretation of the DNN model identified salient variants that were assigned insignificant p-values by association studies, but were important for DNN prediction. These variants may be associated with the phenotype through non-linear relationships.

摘要
多因素风险分数（PRS）用于估计个体复杂疾病的遗传风险，基于整个基因组中的多个遗传变异。本研究比较了多种计算模型来估计乳腺癌PRS。深度神经网络（DNN）被发现超过了其他机器学习技术和确立的统计算法，包括BLUP、BayesA和LDpred。在测试群中的50%预测率下，DNN的AUC分数为67.4%，BLUP的AUC分数为64.2%，BayesA的AUC分数为64.5%，LDpred的AUC分数为62.4%。BLUP、BayesA和LPpred都生成的PRS在正例群中遵循正态分布。然而，DNN在疾病群中生成的PRS遵循了二元分布，由两个正态分布组成，其中一个分布的mean值明显高于控制群的mean值，另一个分布的mean值与控制群的mean值类似。这表明DNN能够将疾病群分为高遗传风险子群和正常遗传风险子群，其中高遗传风险子群的PRS平均值明显高于控制群的PRS，而正常遗传风险子群的PRS与控制群的PRS类似。这使得DNN在50%预测率下实现了18.8%的回归率，在20%预测率下可以扩展到65.4%的回归率。DNN模型的解释发现了一些被关键性评估为无关的变异，但对DNN预测是重要的。这些变异可能与现象之间存在非线性关系。

Analyzing the Strategy of Propaganda using Inverse Reinforcement Learning: Evidence from the 2022 Russian Invasion of Ukraine

paper_url: http://arxiv.org/abs/2307.12788
repo_url: None
paper_authors: Dominique Geissler, Stefan Feuerriegel
for: 这个研究旨在分析2022年俄罗斯入侵乌克兰的社交媒体宣传活动的战略。
methods: 这个研究使用反强化学习（IRL）方法来分析社交媒体上的宣传行为。
results: 研究发现，负面宣传的机器人和人类用户采取不同策略：机器人主要回应支持入侵的消息，而人类用户主要回应反对消息，这表明机器人寻求把消息推广，而人类用户更倾向于进行批评讨论。

Abstract
The 2022 Russian invasion of Ukraine was accompanied by a large-scale, pro-Russian propaganda campaign on social media. However, the strategy behind the dissemination of propaganda has remained unclear, particularly how the online discourse was strategically shaped by the propagandists' community. Here, we analyze the strategy of the Twitter community using an inverse reinforcement learning (IRL) approach. Specifically, IRL allows us to model online behavior as a Markov decision process, where the goal is to infer the underlying reward structure that guides propagandists when interacting with users with a supporting or opposing stance toward the invasion. Thereby, we aim to understand empirically whether and how between-user interactions are strategically used to promote the proliferation of Russian propaganda. For this, we leverage a large-scale dataset with 349,455 posts with pro-Russian propaganda from 132,131 users. We show that bots and humans follow a different strategy: bots respond predominantly to pro-invasion messages, suggesting that they seek to drive virality; while messages indicating opposition primarily elicit responses from humans, suggesting that they tend to engage in critical discussions. To the best of our knowledge, this is the first study analyzing the strategy behind propaganda from the 2022 Russian invasion of Ukraine through the lens of IRL.

摘要
俄罗斯入侵乌克兰的2022年社交媒体宣传活动拥有了大规模的俄罗斯支持者宣传运动。然而，这些宣传的执行策略仍然不清楚，特别是在线媒体互动如何被推动者社区战略性地形成。在这里，我们使用反向强化学习（IRL）方法来分析推特社区的策略。具体来说，IRL允许我们将在线行为视为Markov决策过程，其中的目标是推断推特用户在与支持或反对入侵的用户互动时的奖励结构。由此，我们希望理解在线互动是如何被推动者用于推广俄罗斯宣传的。为此，我们利用了349,455条推特文章和132,131名用户的大规模数据集。我们发现，机器人和人类采取不同策略：机器人主要回应支持入侵的消息，表明它们想要驱动病毒性;而表达反对的消息主要引起人类的回应，表明人类更倾向于进行批评讨论。根据我们所知，这是第一篇分析2022年俄罗斯入侵乌克兰宣传策略的IRL研究。

Is attention all you need in medical image analysis? A review

paper_url: http://arxiv.org/abs/2307.12775
repo_url: None
paper_authors: Giorgos Papanastasiou, Nikolaos Dikaios, Jiahao Huang, Chengjia Wang, Guang Yang
For: This paper reviews and analyzes existing hybrid CNN-Transf/Attention models for medical image analysis (MIA) problems, and discusses their generalization opportunities for scientific and clinical impact.* Methods: The paper uses a comprehensive analysis framework to evaluate the architectural designs, breakthroughs, and opportunities of hybrid CNN-Transf/Attention models in MIA.* Results: The paper provides a systematic review of existing hybrid CNN-Transf/Attention models, and discusses their strengths and limitations in terms of generalization ability and clinical impact.

Abstract
Medical imaging is a key component in clinical diagnosis, treatment planning and clinical trial design, accounting for almost 90% of all healthcare data. CNNs achieved performance gains in medical image analysis (MIA) over the last years. CNNs can efficiently model local pixel interactions and be trained on small-scale MI data. The main disadvantage of typical CNN models is that they ignore global pixel relationships within images, which limits their generalisation ability to understand out-of-distribution data with different 'global' information. The recent progress of Artificial Intelligence gave rise to Transformers, which can learn global relationships from data. However, full Transformer models need to be trained on large-scale data and involve tremendous computational complexity. Attention and Transformer compartments (Transf/Attention) which can well maintain properties for modelling global relationships, have been proposed as lighter alternatives of full Transformers. Recently, there is an increasing trend to co-pollinate complementary local-global properties from CNN and Transf/Attention architectures, which led to a new era of hybrid models. The past years have witnessed substantial growth in hybrid CNN-Transf/Attention models across diverse MIA problems. In this systematic review, we survey existing hybrid CNN-Transf/Attention models, review and unravel key architectural designs, analyse breakthroughs, and evaluate current and future opportunities as well as challenges. We also introduced a comprehensive analysis framework on generalisation opportunities of scientific and clinical impact, based on which new data-driven domain generalisation and adaptation methods can be stimulated.

摘要
医疗影像是诊断、治疗规划和临床试验设计中的关键组成部分，占健康保健数据的大约90%。过去几年，深度学习（CNN）在医疗影像分析（MIA）中获得了性能提升。CNN可以高效地模型影像中的局部像素互动，并可以在小规模的MI数据上进行训练。然而，典型的CNN模型忽略了影像中的全局像素关系，这限制了它们的泛化能力，不能理解不同的全局信息。随着人工智能的发展，转换器（Transformers）在数据中学习全局关系的能力得到了提升。然而，全Transformers模型需要大规模的训练数据和巨大的计算复杂度。为了维护模型的全局性和可扩展性，人们提出了Attention和Transformers组件（Transf/Attention）。最近几年， hybrid CNN-Transf/Attention模型在多个MIA问题上得到了广泛应用。在这篇系统评影卷中，我们对现有的hybrid CNN-Transf/Attention模型进行了抽样、回顾和分析，并评估了这些模型的当前和未来的机遇和挑战。此外，我们还提出了一种全面的分析框架，以便根据这些模型的泛化机会，推动数据驱动的领域泛化和适应方法的发展。

Detecting disturbances in network-coupled dynamical systems with machine learning

paper_url: http://arxiv.org/abs/2307.12771
repo_url: None
paper_authors: Per Sebastian Skardal, Juan G. Restrepo
for: identifying disturbances in network-coupled dynamical systems without knowledge of the disturbances or underlying dynamics
methods: model-free method based on machine learning using prior observations of the system when forced by a known training function
results: able to identify the locations and properties of many different types of unknown disturbances using a variety of known forcing functions, both with linear and nonlinear disturbances using food web and neuronal activity models.Here’s the full translation in Simplified Chinese:
for: Identifying disturbances in network-coupled dynamical systems without knowledge of the disturbances or underlying dynamics
methods: Model-free method based on machine learning using prior observations of the system when forced by a known training function
results: Able to identify the locations and properties of many different types of unknown disturbances using a variety of known forcing functions, both with linear and nonlinear disturbances using food web and neuronal activity models.

Abstract
Identifying disturbances in network-coupled dynamical systems without knowledge of the disturbances or underlying dynamics is a problem with a wide range of applications. For example, one might want to know which nodes in the network are being disturbed and identify the type of disturbance. Here we present a model-free method based on machine learning to identify such unknown disturbances based only on prior observations of the system when forced by a known training function. We find that this method is able to identify the locations and properties of many different types of unknown disturbances using a variety of known forcing functions. We illustrate our results both with linear and nonlinear disturbances using food web and neuronal activity models. Finally, we discuss how to scale our method to large networks.

摘要
<>无知的网络相互作用系统中的干扰的识别问题具有广泛的应用领域。例如，我们可能想知道哪些节点在网络中受到干扰，并识别干扰的类型。在这里，我们提出了一种无模型的方法，基于机器学习来识别未知干扰，只基于先前观察到的系统强制函数。我们发现这种方法可以识别多种不同类型的未知干扰，使用多种已知强制函数。我们使用食物网和神经活动模型来ILLUSTRATE我们的结果。最后，我们讨论如何扩展我们的方法到大型网络。<>

Nonparametric Linear Feature Learning in Regression Through Regularisation

paper_url: http://arxiv.org/abs/2307.12754
repo_url: https://github.com/bertillefollain/regfeal
paper_authors: Bertille Follain, Umut Simsekli, Francis Bach
for: 这个论文的目的是提出一种新的非 Parametric 特征选择方法，用于在高维数据中进行预测、计算和解释。
methods: 该方法使用了Empirical Risk Minimization 算法，并在其中添加了一个 penalty term 来保证方法的 versatility。在使用 Hermite 波幅的时候，我们引入了一个新的估计器名为 RegFeaL。
results: 我们的实验结果表明，RegFeaL 可以在各种实验中达到高效的预测性和精度。此外，我们还提供了一些实验结果，证明了我们的方法的可靠性和稳定性。

Abstract
Representation learning plays a crucial role in automated feature selection, particularly in the context of high-dimensional data, where non-parametric methods often struggle. In this study, we focus on supervised learning scenarios where the pertinent information resides within a lower-dimensional linear subspace of the data, namely the multi-index model. If this subspace were known, it would greatly enhance prediction, computation, and interpretation. To address this challenge, we propose a novel method for linear feature learning with non-parametric prediction, which simultaneously estimates the prediction function and the linear subspace. Our approach employs empirical risk minimisation, augmented with a penalty on function derivatives, ensuring versatility. Leveraging the orthogonality and rotation invariance properties of Hermite polynomials, we introduce our estimator, named RegFeaL. By utilising alternative minimisation, we iteratively rotate the data to improve alignment with leading directions and accurately estimate the relevant dimension in practical settings. We establish that our method yields a consistent estimator of the prediction function with explicit rates. Additionally, we provide empirical results demonstrating the performance of RegFeaL in various experiments.

摘要
学习表示在自动选择特征中扮演着关键角色，尤其在高维数据的情况下。在这种情况下，非 Parametric 方法经常陷入困难。在这项研究中，我们关注supervised学习场景，其中相关信息归结于数据中的一个低维线性子空间，即多指标模型。如果这个子空间知道，那么预测、计算和解释都将得到极大提高。为解决这个挑战，我们提出了一种新的方法，即linear feature学习方法，该方法同时估算预测函数和线性子空间。我们的方法使用empirical risk minimization，加上函数导数的罚函数，以确保多样性。通过 Hermite polynomials 的正交性和旋转不变性，我们引入了我们的估计器，名为RegFeaL。通过alternative minimization，我们可以逐步旋转数据，以便更好地与主要方向align，并准确地估算实际情况中的相关维度。我们证明了我们的方法可以得到一个consistent的预测函数估算器，并且提供了explicit rates。此外，我们还提供了许多实际 экспериментов的结果，以证明 RegFeaL 的性能。

Concept-based explainability for an EEG transformer model

paper_url: http://arxiv.org/abs/2307.12745
repo_url: https://github.com/andersgmadsen/tcav-bendr
paper_authors: Anders Gjølbye Madsen, William Theodor Lehn-Schiøler, Áshildur Jónsdóttir, Bergdís Arnardóttir, Lars Kai Hansen
for: 这个论文的目的是解释深度学习模型内部的状态，以便更好地理解它们如何处理数据。
methods: 该论文使用了Concept Activation Vectors（CAVs）方法来解释深度学习模型。CAVs是基于人类可理解的概念的方法，通过利用欧几何分布来定义内部状态。
results: 研究人员通过使用外部标注的EEG数据集和基于生物学结构的概念来定义概念，并证明这两种方法都可以提供深度EEG模型学习的有价值信息。

Abstract
Deep learning models are complex due to their size, structure, and inherent randomness in training procedures. Additional complexity arises from the selection of datasets and inductive biases. Addressing these challenges for explainability, Kim et al. (2018) introduced Concept Activation Vectors (CAVs), which aim to understand deep models' internal states in terms of human-aligned concepts. These concepts correspond to directions in latent space, identified using linear discriminants. Although this method was first applied to image classification, it was later adapted to other domains, including natural language processing. In this work, we attempt to apply the method to electroencephalogram (EEG) data for explainability in Kostas et al.'s BENDR (2021), a large-scale transformer model. A crucial part of this endeavor involves defining the explanatory concepts and selecting relevant datasets to ground concepts in the latent space. Our focus is on two mechanisms for EEG concept formation: the use of externally labeled EEG datasets, and the application of anatomically defined concepts. The former approach is a straightforward generalization of methods used in image classification, while the latter is novel and specific to EEG. We present evidence that both approaches to concept formation yield valuable insights into the representations learned by deep EEG models.

摘要
深度学习模型因其大小、结构和训练过程中的随机性而复杂。这些复杂性来自数据选择和印uctive bias。为了解释这些复杂性，金等人（2018）提出了概念活化向量（CAV），该方法通过将深度模型内部状态转化为人类可理解的概念来解释深度模型的行为。这些概念与 latent space 中的方向相对应，通过使用线性投影来确定。这种方法最初应用于图像分类 зада务，后来扩展到其他领域，包括自然语言处理。在这项工作中，我们尝试将该方法应用于 Kostas 等人（2021）的 BENDR 模型，这是一个大规模的 transformer 模型。我们的注重点在于选择合适的解释概念和使用外部标注的 EEG 数据集来固定概念。前一种方法是一种直观的推广，而后一种方法是特定于 EEG 的新领域。我们显示两种方法都可以为深度 EEG 模型学习的表征提供有价值的解释。

Sparse-firing regularization methods for spiking neural networks with time-to-first spike coding

paper_url: http://arxiv.org/abs/2307.13007
repo_url: None
paper_authors: Yusuke Sakemi, Kakei Yamamoto, Takeo Hosomi, Kazuyuki Aihara
for: 这种研究旨在提高多层脉冲神经网络（SNN）的训练效果，特别是使用错误反射算法来实现理想的时间编码。
methods: 这种方法使用时间到初始脉冲（TTFS）编码，每个神经元只能发射一次，这种限制使得信息可以在非常低的脉冲频率下处理。
results: 通过两种基于脉冲时间的稀发（SSR）规范方法来进一步降低TTFS-编码SNNs的脉冲频率，并在MNIST、Fashion-MNIST和CIFAR-10 datasets上使用多层感知器网络和卷积神经网络结构进行研究。

Abstract
The training of multilayer spiking neural networks (SNNs) using the error backpropagation algorithm has made significant progress in recent years. Among the various training schemes, the error backpropagation method that directly uses the firing time of neurons has attracted considerable attention because it can realize ideal temporal coding. This method uses time-to-first spike (TTFS) coding, in which each neuron fires at most once, and this restriction on the number of firings enables information to be processed at a very low firing frequency. This low firing frequency increases the energy efficiency of information processing in SNNs, which is important not only because of its similarity with information processing in the brain, but also from an engineering point of view. However, only an upper limit has been provided for TTFS-coded SNNs, and the information-processing capability of SNNs at lower firing frequencies has not been fully investigated. In this paper, we propose two spike timing-based sparse-firing (SSR) regularization methods to further reduce the firing frequency of TTFS-coded SNNs. The first is the membrane potential-aware SSR (M-SSR) method, which has been derived as an extreme form of the loss function of the membrane potential value. The second is the firing condition-aware SSR (F-SSR) method, which is a regularization function obtained from the firing conditions. Both methods are characterized by the fact that they only require information about the firing timing and associated weights. The effects of these regularization methods were investigated on the MNIST, Fashion-MNIST, and CIFAR-10 datasets using multilayer perceptron networks and convolutional neural network structures.

摘要
多层脉冲神经网络（SNN）的训练使用错误归散算法在过去几年来有了 significiant progress。多种训练方案中，使用神经元发射时间的错误归散方法吸引了较大的关注，因为它可以实现理想的时间编码。这种方法使用时间到第一脉冲（TTFS）编码，每个神经元只能发射一次，这种限制神经元发射数量使得信息可以在非常低的发射频率下处理。这种低发射频率提高了SNNs中信息处理的能效性，这不仅与神经元处理信息的方式相似，还从工程角度来看是非常重要。然而，只有提供了TTFS编码SNNs的Upper bound，它们在lower firing frequency下的信息处理能力还未得到了全面的研究。在这篇论文中，我们提出了两种基于发射时间的稀发射（SSR）规范，以进一步降低TTFS编码SNNs的发射频率。第一种是膜电压意识SSR（M-SSR）方法，它是膜电压值的极限形式的损失函数。第二种是发射条件意识SSR（F-SSR）方法，它是基于发射条件获得的规范函数。两种方法都是基于发射时间和相关权重的信息。我们在MNIST、Fashion-MNIST和CIFAR-10 datasets上使用多层报告网络和卷积神经网络结构来研究这两种规范的效果。

Safety Performance of Neural Networks in the Presence of Covariate Shift

paper_url: http://arxiv.org/abs/2307.12716
repo_url: None
paper_authors: Chih-Hong Cheng, Harald Ruess, Konstantinos Theodorou
for: This paper aims to address the issue of covariate shift’s impact on the operational safety performance of neural networks, and proposes a method to reshape the initial test set based on an approximation of the operational data.
methods: The proposed method uses finite binning and static dataflow analysis to derive conservative bounds on the values of neurons, and formulates a mixed integer linear programming (MILP) constraint to construct the minimum set of data points to be removed in the test set.
results: The proposed method can re-evaluate the safety performance of neural networks in the presence of covariate shift by using the reshaped test set, and can potentially reduce the need for collecting new operational data and creating corresponding ground truth labels.

Abstract
Covariate shift may impact the operational safety performance of neural networks. A re-evaluation of the safety performance, however, requires collecting new operational data and creating corresponding ground truth labels, which often is not possible during operation. We are therefore proposing to reshape the initial test set, as used for the safety performance evaluation prior to deployment, based on an approximation of the operational data. This approximation is obtained by observing and learning the distribution of activation patterns of neurons in the network during operation. The reshaped test set reflects the distribution of neuron activation values as observed during operation, and may therefore be used for re-evaluating safety performance in the presence of covariate shift. First, we derive conservative bounds on the values of neurons by applying finite binning and static dataflow analysis. Second, we formulate a mixed integer linear programming (MILP) constraint for constructing the minimum set of data points to be removed in the test set, such that the difference between the discretized test and operational distributions is bounded. We discuss potential benefits and limitations of this constraint-based approach based on our initial experience with an implemented research prototype.

摘要
covariate shift可能会影响神经网络的操作安全性表现。然而，为了重新评估安全性表现，通常需要收集新的操作数据并创建相应的地面真实标签，这并不是在运行时可行。我们因此提议将初始测试集重新分配，基于运行时神经网络活动 patrerns的approximation。这种approximation可以通过观察和学习神经网络在运行时的活动模式来获得。重新分配的测试集尝试反映了在运行时神经网络活动值的分布，可以用于重新评估安全性表现在covariate shift的情况下。首先，我们通过finite binning和静态数据流分析来 derive保守的神经元值 bounds。其次，我们将mix integer linear programming（MILP）约束构造最小的数据点删除集，以使得测试集和运行时分布之间的差异保持在bound。我们对实际研究版本中的初步体验提出了可能的优点和限制。

Policy Gradient Optimal Correlation Search for Variance Reduction in Monte Carlo simulation and Maximum Optimal Transport

paper_url: http://arxiv.org/abs/2307.12703
repo_url: None
paper_authors: Pierre Bras, Gilles Pagès
for: 估计 $f(X_T)$ 的方法，即使 $X$ 是 Stochastic Differential Equation 的解。
methods: 使用 $(f(X^1_T) + f(X^2_T))/2$ 作为新的估计器，其中 $X^1$ 和 $X^2$ 具有同样的分布，但是具有相对的路径相关性，以降低方差。采用深度神经网络来 aproximate 优化函数 $\rho$，并使用策略梯度和奖励学习技术来准确调整 $\rho$。
results: 通过Policy Gradient和奖励学习技术来准确地调整优化函数 $\rho$，实现了降低方差的目标。

Abstract
We propose a new algorithm for variance reduction when estimating $f(X_T)$ where $X$ is the solution to some stochastic differential equation and $f$ is a test function. The new estimator is $(f(X^1_T) + f(X^2_T))/2$, where $X^1$ and $X^2$ have same marginal law as $X$ but are pathwise correlated so that to reduce the variance. The optimal correlation function $\rho$ is approximated by a deep neural network and is calibrated along the trajectories of $(X^1, X^2)$ by policy gradient and reinforcement learning techniques. Finding an optimal coupling given marginal laws has links with maximum optimal transport.

摘要
我们提出了一种新的算法来降低方差时估计 $f(X_T)$，其中 $X$ 是一个随机 diferencial equation 的解，$f$ 是一个测试函数。新的估计器是 $(f(X^1_T) + f(X^2_T))/2$，其中 $X^1$ 和 $X^2$ 具有同样的分布，但是它们的路径相关，以降低方差。我们使用深度神经网络来 aproximate 优化函数 $\rho$，并通过政策梯度和强化学习技术来调整 $\rho$ 的参数。找到最佳对接给定分布有关系于最大优化运输。

MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features

paper_url: http://arxiv.org/abs/2307.12698
repo_url: None
paper_authors: Adrien Bardes, Jean Ponce, Yann LeCun
for: 本研究旨在jointly学习视觉表示和流体动作 estimation，并证明这两个目标互相帮助进行学习，从而学习出包含运动信息的内容特征。
methods: 本研究提出了MC-JEPA模型，即共同嵌入预测建模和自然学习方法，在共同编码器中同时学习视觉表示和流体动作 estimation。
results: 实验结果表明，MC-JEPA模型可以在无监督的情况下实现视觉表示和流体动作 estimation的同时学习，并且在下游任务中，如图像和视频Semantic segmentation等任务中，可以达到与现有无监督光流估计 benchmark和常见自然学习方法相当的性能。

Abstract
Self-supervised learning of visual representations has been focusing on learning content features, which do not capture object motion or location, and focus on identifying and differentiating objects in images and videos. On the other hand, optical flow estimation is a task that does not involve understanding the content of the images on which it is estimated. We unify the two approaches and introduce MC-JEPA, a joint-embedding predictive architecture and self-supervised learning approach to jointly learn optical flow and content features within a shared encoder, demonstrating that the two associated objectives; the optical flow estimation objective and the self-supervised learning objective; benefit from each other and thus learn content features that incorporate motion information. The proposed approach achieves performance on-par with existing unsupervised optical flow benchmarks, as well as with common self-supervised learning approaches on downstream tasks such as semantic segmentation of images and videos.

摘要
自适应学习视觉表示法中心在学习内容特征，这些特征不包括物体运动或位置信息，而是通过识别和区分图像和视频中的对象来学习。相反，光流估计是一个不需要理解图像内容的任务。我们将这两种方法结合起来，并介绍MC-JEPA，一种共享编码器中的共同预测建筑和自适应学习方法，以jointly学习光流和内容特征。我们发现这两个相关的目标，即光流估计目标和自适应学习目标，在共同学习中互相帮助，因此学习的内容特征包含运动信息。我们的方法可以与现有的无监督光流标准做比较，以及常见的自适应学习方法在图像和视频Semantic segmentation任务上的性能。

Addressing the Impact of Localized Training Data in Graph Neural Networks

paper_url: http://arxiv.org/abs/2307.12689
repo_url: https://github.com/akanshaaga/reg_appnp
paper_authors: Singh Akansha
for: 本研究旨在评估图神经网络（GNNs）在本地化训练数据下的性能。
methods: 我们提出了一种常见GNN模型的补做方法，以适应本地化训练数据下的挑战。
results: 我们在三个标准图神经网络 benchmark dataset上进行了广泛的测试，并得到了显著的性能提升。

Abstract
Graph Neural Networks (GNNs) have achieved notable success in learning from graph-structured data, owing to their ability to capture intricate dependencies and relationships between nodes. They excel in various applications, including semi-supervised node classification, link prediction, and graph generation. However, it is important to acknowledge that the majority of state-of-the-art GNN models are built upon the assumption of an in-distribution setting, which hinders their performance on real-world graphs with dynamic structures. In this article, we aim to assess the impact of training GNNs on localized subsets of the graph. Such restricted training data may lead to a model that performs well in the specific region it was trained on but fails to generalize and make accurate predictions for the entire graph. In the context of graph-based semi-supervised learning (SSL), resource constraints often lead to scenarios where the dataset is large, but only a portion of it can be labeled, affecting the model's performance. This limitation affects tasks like anomaly detection or spam detection when labeling processes are biased or influenced by human subjectivity. To tackle the challenges posed by localized training data, we approach the problem as an out-of-distribution (OOD) data issue by by aligning the distributions between the training data, which represents a small portion of labeled data, and the graph inference process that involves making predictions for the entire graph. We propose a regularization method to minimize distributional discrepancies between localized training data and graph inference, improving model performance on OOD data. Extensive tests on popular GNN models show significant performance improvement on three citation GNN benchmark datasets. The regularization approach effectively enhances model adaptation and generalization, overcoming challenges posed by OOD data.

摘要
GRAPH NEURAL NETWORKS (GNNs) have achieved notable success in learning from graph-structured data, owing to their ability to capture intricate dependencies and relationships between nodes. They excel in various applications, including semi-supervised node classification, link prediction, and graph generation. However, it is important to acknowledge that the majority of state-of-the-art GNN models are built upon the assumption of an in-distribution setting, which hinders their performance on real-world graphs with dynamic structures. In this article, we aim to assess the impact of training GNNs on localized subsets of the graph. Such restricted training data may lead to a model that performs well in the specific region it was trained on but fails to generalize and make accurate predictions for the entire graph. In the context of graph-based semi-supervised learning (SSL), resource constraints often lead to scenarios where the dataset is large, but only a portion of it can be labeled, affecting the model's performance. This limitation affects tasks like anomaly detection or spam detection when labeling processes are biased or influenced by human subjectivity. To tackle the challenges posed by localized training data, we approach the problem as an out-of-distribution (OOD) data issue by aligning the distributions between the training data, which represents a small portion of labeled data, and the graph inference process that involves making predictions for the entire graph. We propose a regularization method to minimize distributional discrepancies between localized training data and graph inference, improving model performance on OOD data. Extensive tests on popular GNN models show significant performance improvement on three citation GNN benchmark datasets. The regularization approach effectively enhances model adaptation and generalization, overcoming challenges posed by OOD data.

An Estimator for the Sensitivity to Perturbations of Deep Neural Networks

paper_url: http://arxiv.org/abs/2307.12679
repo_url: None
paper_authors: Naman Maheshwari, Nicholas Malaya, Scott Moe, Jaydeep P. Kulkarni, Sudhanva Gurumurthi
for: 这篇论文的目的是为了评估深度神经网络（DNNs）在安全关键应用中的稳定性，如自动驾驶车和疾病诊断。
methods: 这篇论文使用了一种能够预测DNN对输入和模型参数的敏感性的估计器。该估计器基于不等式和矩阵范数，其结果类似于神经网络的condition number。
results: 在测试了AlexNet和VGG-19 convolutional neural networks（CNNs）以及ImageNet dataset时，这种估计器能够准确地预测DNN对输入和模型参数的敏感性。此外，通过随机偏移和攻击测试，这种估计器的紧密性也得到了证明。

Abstract
For Deep Neural Networks (DNNs) to become useful in safety-critical applications, such as self-driving cars and disease diagnosis, they must be stable to perturbations in input and model parameters. Characterizing the sensitivity of a DNN to perturbations is necessary to determine minimal bit-width precision that may be used to safely represent the network. However, no general result exists that is capable of predicting the sensitivity of a given DNN to round-off error, noise, or other perturbations in input. This paper derives an estimator that can predict such quantities. The estimator is derived via inequalities and matrix norms, and the resulting quantity is roughly analogous to a condition number for the entire neural network. An approximation of the estimator is tested on two Convolutional Neural Networks, AlexNet and VGG-19, using the ImageNet dataset. For each of these networks, the tightness of the estimator is explored via random perturbations and adversarial attacks.

摘要
（ Deep Neural Networks 必须在安全关键应用中稳定，如自动驾驶车和疾病诊断。因此，必须了解 DNN 对输入和模型参数的敏感度，以确定安全地表示网络所需的最小位数准确性。然而，没有一个通用的结果可以预测给定 DNN 对轮减错误、噪声或其他输入中的敏感度。这篇文章提出了一个估计器，可以预测这些量。估计器是通过不等式和矩阵范数 derive，其结果类似于整个神经网络的condition number。这个估计器的紧密性在使用 ImageNet 数据集上对 AlexNet 和 VGG-19 两个卷积神经网络进行随机扰动和攻击性测试中被探索。）

Global k-Space Interpolation for Dynamic MRI Reconstruction using Masked Image Modeling

paper_url: http://arxiv.org/abs/2307.12672
repo_url: None
paper_authors: Jiazhen Pan, Suprosanna Shit, Özgün Turgut, Wenqi Huang, Hongwei Bran Li, Nil Stolt-Ansó, Thomas Küstner, Kerstin Hammernik, Daniel Rueckert
for: 这篇论文的目的是为了提高动力磁共振成像（MRI）中的数据探测率，以解决因时间限制而导致的抽象项目残影。
methods: 本文使用的方法是将受测空间探测短缺的数据进行插值，并且使用一个新的Transformer-based k-space Global Interpolation Network（k-GIN）来学习全球的低频和高频成像结构。此外，我们还提出了一个k-space Iterative Refinement Module（k-IRM）来强化高频成像的学习。
results: 我们的方法与基准方法相比，在92个内部2D+t心脏MRI试验中表现出了优化的成像质量和更高的类别化能力。特别是在具有高度受测空间探测短缺的情况下，我们的方法具有更高的类别化能力和普遍性。

Abstract
In dynamic Magnetic Resonance Imaging (MRI), k-space is typically undersampled due to limited scan time, resulting in aliasing artifacts in the image domain. Hence, dynamic MR reconstruction requires not only modeling spatial frequency components in the x and y directions of k-space but also considering temporal redundancy. Most previous works rely on image-domain regularizers (priors) to conduct MR reconstruction. In contrast, we focus on interpolating the undersampled k-space before obtaining images with Fourier transform. In this work, we connect masked image modeling with k-space interpolation and propose a novel Transformer-based k-space Global Interpolation Network, termed k-GIN. Our k-GIN learns global dependencies among low- and high-frequency components of 2D+t k-space and uses it to interpolate unsampled data. Further, we propose a novel k-space Iterative Refinement Module (k-IRM) to enhance the high-frequency components learning. We evaluate our approach on 92 in-house 2D+t cardiac MR subjects and compare it to MR reconstruction methods with image-domain regularizers. Experiments show that our proposed k-space interpolation method quantitatively and qualitatively outperforms baseline methods. Importantly, the proposed approach achieves substantially higher robustness and generalizability in cases of highly-undersampled MR data.

摘要
在动态磁共振成像（MRI）中，通常因为扫描时间有限，会导致卷积空间下折射样本受到假象 artifacts。因此，动态MR重建需要不仅考虑 x 和 y 方向的空间频率组件，还需要考虑时间重复性。大多数前一些工作都是通过图像领域的正则化（约束）来进行MR重建。相比之下，我们注意到 interpolating 未折射的卷积空间，并提出了一种基于 Transformer 的全域卷积global interpolation network，称之为 k-GIN。我们的 k-GIN 学习了 2D+t 卷积空间中低频和高频组件之间的全局依赖关系，并使用其来 interpolate 未折射数据。此外，我们还提出了一种 k-space 迭代优化模块（k-IRM），以提高高频组件的学习。我们对 92 个室内 2D+t 心脏 MRI 测试数据进行了评估，并与使用图像领域正则化的 MR 重建方法进行比较。实验表明，我们提出的方法在量化和质量上都有显著提高，并且在高度受折射影响的 MR 数据中具有更高的robustness和普适性。

Control and Monitoring of Artificial Intelligence Algorithms

paper_url: http://arxiv.org/abs/2307.13705
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Carlos Mario Braga Ortuño, Blanza Martinez Donoso, Belén Muñiz Villanueva
for: 这篇论文强调了在人工智能模型部署后进行监管和评估数据分布的变化。
methods: 文章介绍了数据漂移和概念漂移的概念，以及他们的基础分布。同时，文章还提出了一些用于评估模型性能对于时间变化的指标。
results: 文章通过介绍不同的指标和方法，探讨了模型在不同情况下的性能。

Abstract
This paper elucidates the importance of governing an artificial intelligence model post-deployment and overseeing potential fluctuations in the distribution of present data in contrast to the training data. The concepts of data drift and concept drift are explicated, along with their respective foundational distributions. Furthermore, a range of metrics is introduced, which can be utilized to scrutinize the model's performance concerning potential temporal variations.

摘要
这篇论文强调了在人工智能模型部署后的管理和监测数据分布的可能变化，而不是只是在训练数据上。文中介绍了数据漂移和概念漂移的概念，并详细介绍了它们的基础分布。此外，文中还提出了一些指标，可以用来评估模型在可能时间变化的情况下的性能。Here's a breakdown of the translation:* 这篇论文 (zhè běn tiān) - This paper* 强调 (qiáng dì) - emphasize* 在人工智能模型部署后 (zài rénsheng zhìyì módel bùdào hòu) - after the deployment of the artificial intelligence model* 管理 (guǎn lí) - management* 和监测 (hé jìng chá) - and monitoring* 数据分布 (shùdā fāngchēng) - data distribution* 可能变化 (kěnéng biànhùa) - potential variations* 而不是只是在训练数据上 (ér bùshì zhīshì zài xiǎngyìng shūjuè) - rather than only on the training data* 文中介绍 (wén zhōng jièshì) - the paper introduces* 数据漂移 (shùdā qùyì) - data drift* 和概念漂移 (hè guījiān qùyì) - and concept drift* 概念 (guījiān) - concept* 漂移 (qùyì) - drift* 基础分布 (jīshì fāngchēng) - fundamental distribution* 此外 (qíwài) - furthermore* 文中还提出 (wén zhōng hái tímzhěng) - the paper also proposes* 一些指标 (yīxiē zhǐdài) - some metrics* 可以用来 (kěyǐ yòu lái) - can be used to* 评估模型 (píngjì módel) - evaluate the model* 在可能时间变化的情况下 (zài kěnéng shíjiān biànhùa de qíngkè) - in the case of possible temporal variations.

TransFusion: Generating Long, High Fidelity Time Series using Diffusion Models with Transformers

paper_url: http://arxiv.org/abs/2307.12667
repo_url: https://github.com/fahim-sikder/TransFusion
paper_authors: Md Fahim Sikder, Resmi Ramachandranpillai, Fredrik Heintz
for: 本研究旨在生成高质量、长序时间序数据，应用广泛。
methods: 我们提出了一种基于协同扩散和变换器的生成模型，称为TransFusion。
results: TransFusion可以生成高质量的长序时间序数据，并且在许多视觉和实验性指标上表现出优于之前的状态 искусственный智能。

Abstract
The generation of high-quality, long-sequenced time-series data is essential due to its wide range of applications. In the past, standalone Recurrent and Convolutional Neural Network-based Generative Adversarial Networks (GAN) were used to synthesize time-series data. However, they are inadequate for generating long sequences of time-series data due to limitations in the architecture. Furthermore, GANs are well known for their training instability and mode collapse problem. To address this, we propose TransFusion, a diffusion, and transformers-based generative model to generate high-quality long-sequence time-series data. We have stretched the sequence length to 384, and generated high-quality synthetic data. To the best of our knowledge, this is the first study that has been done with this long-sequence length. Also, we introduce two evaluation metrics to evaluate the quality of the synthetic data as well as its predictive characteristics. We evaluate TransFusion with a wide variety of visual and empirical metrics, and TransFusion outperforms the previous state-of-the-art by a significant margin.

摘要
“高质量、长序时间序数据的生成是非常重要，因为它具有广泛的应用领域。在过去，单独的循环神经网和卷积神经网基于的生成对抗网（GAN）被用来合成时间序数据。但是，它们因架构限制而无法生成长序时间序数据，并且GAN在训练时会出现不稳定和模式崩溃问题。为解决这问题，我们提出了TransFusion，一个扩散和卷积变数基于的生成模型，可以生成高质量的长序时间序数据。我们已经将序列长度延长到384，并生成了高质量的 sintetic数据。到目前为止，这是第一篇使用这长序长度的研究。此外，我们也引入了两个评估 metric 来评估生成的质量和预测特性。我们将TransFusion评估使用广泛的视觉和实验 metric，并证明TransFusion在前一代的state-of-the-art上出现著标准的差异。”

Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics

paper_url: http://arxiv.org/abs/2307.12660
repo_url: https://github.com/umbertomichieli/tap-slda
paper_authors: Umberto Michieli, Pablo Peso Parada, Mete Ozay
for: 本研究旨在提高附加在设备上的语音识别（KWS）模型的适应速度，使其能够快速适应用户定义的新词语，而不会忘记之前的词语。
methods: 本研究提出了一种名为Temporal Aware Pooling（TAP）的方法，它在采用冻结的后向传播模型（backbone）的基础上，通过计算高阶语音特征的时间相关特征空间来扩充特征空间。然后，对这个扩充后的特征空间进行Gaussian模型的更新，以便更好地利用语音表示。
results: 实验分析表明，TAP-SLDA方法在几个设置、后向传播模型和基础上都显示出了明显的优异性，相对于竞争者的平均提升率为11.3%。

Abstract
Keyword Spotting (KWS) models on embedded devices should adapt fast to new user-defined words without forgetting previous ones. Embedded devices have limited storage and computational resources, thus, they cannot save samples or update large models. We consider the setup of embedded online continual learning (EOCL), where KWS models with frozen backbone are trained to incrementally recognize new words from a non-repeated stream of samples, seen one at a time. To this end, we propose Temporal Aware Pooling (TAP) which constructs an enriched feature space computing high-order moments of speech features extracted by a pre-trained backbone. Our method, TAP-SLDA, updates a Gaussian model for each class on the enriched feature space to effectively use audio representations. In experimental analyses, TAP-SLDA outperforms competitors on several setups, backbones, and baselines, bringing a relative average gain of 11.3% on the GSC dataset.

摘要
<> transtable id="1" 键 слова检测（KWS）模型在嵌入式设备上应该快速适应新用户定义的词语，而不会忘记之前的词语。嵌入式设备具有有限的存储和计算资源，因此无法保存样本或更新大型模型。我们考虑了嵌入式在线继续学习（EOCL）的设置，其中KWS模型具有冻结的脊梁被训练以逐渐认可新的词语从一个不重复的流量中，每个样本一个个。为此，我们提议使用时间意识汇聚（TAP），它在预训练后的听语特征空间中构建了丰富的特征空间，并计算高阶时域特征以实现有效的听语表示。我们的方法TAP-SLDA将对每个类在汇聚特征空间上更新加aussian模型，以使用听语表示。在实验分析中，TAP-SLDA比竞争者在多种设置、脊梁和基础上表现出较高的平均提升率，达到11.3%的相对平均提升率在GSC数据集上。

Remote Bio-Sensing: Open Source Benchmark Framework for Fair Evaluation of rPPG

paper_url: http://arxiv.org/abs/2307.12644
repo_url: https://github.com/remotebiosensing/rppg
paper_authors: Dae-Yeol Kim, Eunsu Goh, KwangKee Lee, JongEui Chae, JongHyeon Mun, Junyeong Na, Chae-bong Sohn, Do-Yup Kim
For: This study provides a benchmarking framework for evaluating the performance of remote photoplethysmography (rPPG) techniques across a wide range of datasets, to ensure fair and meaningful comparison and progress in the field.* Methods: The study uses a variety of datasets, including both conventional non-deep neural network (non-DNN) and deep neural network (DNN) methods, to evaluate the performance of rPPG techniques and provide a comprehensive benchmarking framework.* Results: The study aims to provide a fair and evaluable benchmarking framework for rPPG techniques, addressing the challenges of skin color, camera characteristics, ambient lighting, and other sources of noise and artifacts, to make meaningful progress in the field.

Abstract
rPPG (Remote photoplethysmography) is a technology that measures and analyzes BVP (Blood Volume Pulse) by using the light absorption characteristics of hemoglobin captured through a camera. Analyzing the measured BVP can derive various physiological signals such as heart rate, stress level, and blood pressure, which can be applied to various applications such as telemedicine, remote patient monitoring, and early prediction of cardiovascular disease. rPPG is rapidly evolving and attracting great attention from both academia and industry by providing great usability and convenience as it can measure biosignals using a camera-equipped device without medical or wearable devices. Despite extensive efforts and advances in this field, serious challenges remain, including issues related to skin color, camera characteristics, ambient lighting, and other sources of noise and artifacts, which degrade accuracy performance. We argue that fair and evaluable benchmarking is urgently required to overcome these challenges and make meaningful progress from both academic and commercial perspectives. In most existing work, models are trained, tested, and validated only on limited datasets. Even worse, some studies lack available code or reproducibility, making it difficult to fairly evaluate and compare performance. Therefore, the purpose of this study is to provide a benchmarking framework to evaluate various rPPG techniques across a wide range of datasets for fair evaluation and comparison, including both conventional non-deep neural network (non-DNN) and deep neural network (DNN) methods. GitHub URL: https://github.com/remotebiosensing/rppg

摘要
rPPG (远程血液抑血光谱) 是一种技术，通过使用摄像头捕捉光谱特性来测量和分析血液脉冲（BVP）。通过分析测量的BVP，可以 derivate 多种生物physiological signals，如心率、剂量压力和 стресс水平，这些信号可以应用于telemedicine、远程病人监测和早期心血管疾病预测等领域。rPPG 在学术和产业界 rapidly evolving 和吸引广泛关注，因为它可以通过摄像头设备测量生物信号，无需医疗或佩戴设备，提供了很好的可用性和便利性。然而，它还存在严重的挑战，包括皮肤颜色、摄像头特性、 ambient 照明和其他干扰和噪声的问题，这些问题会降低精度性能。我们认为，准确和评估可能的benchmarking是 urgently required ，以超越这些挑战和取得学术和商业上的进步。现有的大多数工作都是在有限的数据集上进行训练、测试和验证，甚至有些研究缺乏可用的代码或可重现性，使得准确评估和比较困难。因此，本研究的目的是提供一个 benchmarking 框架，以评估不同的 rPPG 技术在各种数据集上的性能，包括非深度神经网络（non-DNN）和深度神经网络（DNN）方法。GitHub URL：https://github.com/remotebiosensing/rppg

Fake News Detection Through Graph-based Neural Networks: A Survey

paper_url: http://arxiv.org/abs/2307.12639
repo_url: None
paper_authors: Shuzhi Gong, Richard O. Sinnott, Jianzhong Qi, Cecile Paris
for: 本研究评估了基于图структуры的新闻假消息检测方法和深度学习方法，以及它们在新闻传播过程中的应用。
methods: 本研究分类了现有的图结构基于的新闻假消息检测方法，包括知识驱动方法、传播基于方法和多元社交 контекст基于方法。
results: 本研究评估了现有的图结构基于的新闻假消息检测方法，并提出了未来研究方向。

Abstract
The popularity of online social networks has enabled rapid dissemination of information. People now can share and consume information much more rapidly than ever before. However, low-quality and/or accidentally/deliberately fake information can also spread rapidly. This can lead to considerable and negative impacts on society. Identifying, labelling and debunking online misinformation as early as possible has become an increasingly urgent problem. Many methods have been proposed to detect fake news including many deep learning and graph-based approaches. In recent years, graph-based methods have yielded strong results, as they can closely model the social context and propagation process of online news. In this paper, we present a systematic review of fake news detection studies based on graph-based and deep learning-based techniques. We classify existing graph-based methods into knowledge-driven methods, propagation-based methods, and heterogeneous social context-based methods, depending on how a graph structure is constructed to model news related information flows. We further discuss the challenges and open problems in graph-based fake news detection and identify future research directions.

摘要
“在线社交网络的广泛散布信息的受欢迎程度，使得人们可以更加快速地分享和消耗信息。但是，低品质和/或意外或故意伪造的信息也可以快速散布，导致社会产生了负面的影响。为了早为社会做出负面影响的预防和控制，识别、标识和驳斥网络伪信息的检测已经成为一个非常紧迫的问题。许多方法已经被提出来检测伪新闻，包括深度学习和agraph基的方法。在过去的几年中，agraph基的方法具有优秀的成绩，因为它们可以将在线新闻的社交内容和传播过程模型得到更加精确地。在这篇文章中，我们提出了一种系统性的审查伪新闻检测研究，依据graph基的方法进行分类，并讨论了伪新闻检测中的挑战和未来研究方向。”Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know.

Identifying drivers and mitigators for congestion and redispatch in the German electric power system with explainable AI

paper_url: http://arxiv.org/abs/2307.12636
repo_url: None
paper_authors: Maurizio Titz, Sebastian Pütz, Dirk Witthaut
for: 这篇论文旨在分析德国传输电网中的压力峰值和对负面影响，以及可能的市场设计变更以缓解压力峰值。
methods: 该论文使用可解释的机器学习模型来预测每小时的重新配置和对贸易量。模型分析了压力峰值的驱动因素和缓解因素，并评估了它们的影响。
results: 研究发现，风力电力生产是压力峰值的主要驱动因素，而水力电力和跨国电力贸易也扮演着重要的缓解作用。然而，太阳能电力没有缓解压力峰值的效果。结果表明，市场设计的变更可以缓解压力峰值。

Abstract
The transition to a sustainable energy supply challenges the operation of electric power systems in manifold ways. Transmission grid loads increase as wind and solar power are often installed far away from the consumers. In extreme cases, system operators must intervene via countertrading or redispatch to ensure grid stability. In this article, we provide a data-driven analysis of congestion in the German transmission grid. We develop an explainable machine learning model to predict the volume of redispatch and countertrade on an hourly basis. The model reveals factors that drive or mitigate grid congestion and quantifies their impact. We show that, as expected, wind power generation is the main driver, but hydropower and cross-border electricity trading also play an essential role. Solar power, on the other hand, has no mitigating effect. Our results suggest that a change to the market design would alleviate congestion.

摘要
“将可再生能源纳入可持续能源供应的过程对电力系统运行带来多种挑战。透传网络荷载增加，因为风力和太阳能经常在消费者处 instal 远 away。在极端情况下，系统运维人员需要通过对贸易或重新分配来确保网格稳定。在这篇文章中，我们提供了一个可解释的机器学习模型，用于预测每小时的重新分配和对贸易量。该模型表明风力发电是主要驱动力，而水力发电和跨国电力贸易也扮演着关键性的角色。而太阳能发电却没有缓解效果。我们的结果表明，修改市场设计可以缓解压力。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

De-confounding Representation Learning for Counterfactual Inference on Continuous Treatment via Generative Adversarial Network

paper_url: http://arxiv.org/abs/2307.12625
repo_url: None
paper_authors: Yonghe Zhao, Qiang Huang, Haolong Zeng, Yun Pen, Huiyan Sun
for: This paper aims to address the problem of counterfactual inference for continuous treatment variables, which is more common in real-world causal inference tasks.
methods: The proposed method is called de-confounding representation learning (DRL), which generates representations of covariates that are disentangled from the treatment variables. The DRL model is a non-parametric model that eliminates both linear and nonlinear dependence between treatment and covariates.
results: The DRL model outperforms state-of-the-art counterfactual inference models for continuous treatment variables in extensive experiments on synthetic datasets. Additionally, the DRL model is applied to a real-world medical dataset MIMIC and demonstrates a detailed causal relationship between red cell width distribution and mortality.

Abstract
Counterfactual inference for continuous rather than binary treatment variables is more common in real-world causal inference tasks. While there are already some sample reweighting methods based on Marginal Structural Model for eliminating the confounding bias, they generally focus on removing the treatment's linear dependence on confounders and rely on the accuracy of the assumed parametric models, which are usually unverifiable. In this paper, we propose a de-confounding representation learning (DRL) framework for counterfactual outcome estimation of continuous treatment by generating the representations of covariates disentangled with the treatment variables. The DRL is a non-parametric model that eliminates both linear and nonlinear dependence between treatment and covariates. Specifically, we train the correlations between the de-confounded representations and the treatment variables against the correlations between the covariate representations and the treatment variables to eliminate confounding bias. Further, a counterfactual inference network is embedded into the framework to make the learned representations serve both de-confounding and trusted inference. Extensive experiments on synthetic datasets show that the DRL model performs superiorly in learning de-confounding representations and outperforms state-of-the-art counterfactual inference models for continuous treatment variables. In addition, we apply the DRL model to a real-world medical dataset MIMIC and demonstrate a detailed causal relationship between red cell width distribution and mortality.

摘要
常用的Counterfactual推论 для连续而不是二进制的治疗变量更常见在实际世界的 causal推论任务中。现有一些基于Marginal Structural Model的样本重新权重方法，可以消除干扰的偏见，但这些方法通常假设了治疗变量和干扰变量之间的线性关系，并且这些模型通常是不可证明的。在这篇论文中，我们提出了一种基于de-confounding representation learning（DRL）的推论框架，用于连续治疗变量的 counterfactual 结果估计。DRL 是一种非 Parametric 模型，可以消除治疗变量和干扰变量之间的线性和非线性关系。具体来说，我们在框架中训练了干扰变量和治疗变量之间的相关性和干扰变量和治疗变量之间的相关性，以消除干扰偏见。此外，我们还将 counterfactual 推论网络 embedding 到框架中，以使得学习的表示可以用于 both de-confounding 和可靠的推论。我们在 synthetic 数据上进行了广泛的实验，发现 DRL 模型在学习 de-confounding 表示方面表现出色，并且超过了当前的 counterfactual 推论模型。此外，我们还应用了 DRL 模型到实际的医疗数据集 MIMIC，并显示了红细胞宽度分布和死亡的明确 causal 关系。

Predicting Ordinary Differential Equations with Transformers

paper_url: http://arxiv.org/abs/2307.12617
repo_url: None
paper_authors: Sören Becker, Michal Klein, Alexander Neitz, Giambattista Parascandolo, Niki Kilbertus
for: recuperates scalar ordinary differential equations (ODEs) in symbolic form from irregularly sampled and noisy observations of a single solution trajectory
methods: transformer-based sequence-to-sequence model
results: better or on par with existing methods in terms of accurate recovery, and efficiently scalable after one-time pretraining on a large set of ODEs

Abstract
We develop a transformer-based sequence-to-sequence model that recovers scalar ordinary differential equations (ODEs) in symbolic form from irregularly sampled and noisy observations of a single solution trajectory. We demonstrate in extensive empirical evaluations that our model performs better or on par with existing methods in terms of accurate recovery across various settings. Moreover, our method is efficiently scalable: after one-time pretraining on a large set of ODEs, we can infer the governing law of a new observed solution in a few forward passes of the model.

摘要
我们开发了一种基于转换器的序列到序列模型，可以从不规则采样和噪声观测数据中精确地回归Scalar常微方程（ODEs）的符号形式。我们在广泛的实验中证明了我们的模型与现有方法相比，在不同的设置下都能够更高效地回归精度。另外，我们的方法可以高效扩展：只需一次预训练于大量ODEs后，我们就可以在几个前向传播中快速地推断新观测数据的管理法律。

ExWarp: Extrapolation and Warping-based Temporal Supersampling for High-frequency Displays

paper_url: http://arxiv.org/abs/2307.12607
repo_url: None
paper_authors: Akanksha Dixit, Yashashwee Chakrabarty, Smruti R. Sarangi
for: 提高高频显示器的帧率，提供更平滑和响应的用户体验。methods: 使用动态网络（DNN）和强化学习（RL）算法，选择适合的插值和扩展方法，以提高帧率，同时保持图像质量。results: 比较传统方法，Exwarp可以提高帧率4倍，且图像质量几乎不受影响。

Abstract
High-frequency displays are gaining immense popularity because of their increasing use in video games and virtual reality applications. However, the issue is that the underlying GPUs cannot continuously generate frames at this high rate -- this results in a less smooth and responsive experience. Furthermore, if the frame rate is not synchronized with the refresh rate, the user may experience screen tearing and stuttering. Previous works propose increasing the frame rate to provide a smooth experience on modern displays by predicting new frames based on past or future frames. Interpolation and extrapolation are two widely used algorithms that predict new frames. Interpolation requires waiting for the future frame to make a prediction, which adds additional latency. On the other hand, extrapolation provides a better quality of experience because it relies solely on past frames -- it does not incur any additional latency. The simplest method to extrapolate a frame is to warp the previous frame using motion vectors; however, the warped frame may contain improperly rendered visual artifacts due to dynamic objects -- this makes it very challenging to design such a scheme. Past work has used DNNs to get good accuracy, however, these approaches are slow. This paper proposes Exwarp -- an approach based on reinforcement learning (RL) to intelligently choose between the slower DNN-based extrapolation and faster warping-based methods to increase the frame rate by 4x with an almost negligible reduction in the perceived image quality.

摘要
高频显示器目前在游戏和虚拟现实应用中得到了广泛的推广，但是这些显示器的后置GPU无法持续生成这高的帧率，这会导致用户体验不平滑和不响应。此外，如果帧率与刷新率不同步，用户可能会经历屏渲染和颤抖现象。以往的工作建议通过预测新帧来提高现代显示器的帧率，以提供柔顺的用户体验。插值和拟合是两种广泛使用的预测算法。插值需要等待未来帧来作预测，这会添加额外的延迟。拟合则提供了更高质量的用户体验，因为它仅基于过去帧进行预测，不增加额外的延迟。最简单的拟合方法是通过运动向量来扭曲上一帧，以生成下一帧。但是，扭曲后的帧可能包含不正确渲染的视觉artifacts，这使得设计这种方案非常困难。过去的工作使用深度神经网络（DNN）来获得高精度，但这些方法较慢。这篇论文提出了Exwarp方法，基于强化学习（RL）来智能选择 slower DNN-based extrapolation和 faster warping-based方法，以提高帧率4倍，并且几乎无法感受到图像质量的下降。

Concept backpropagation: An Explainable AI approach for visualising learned concepts in neural network models

paper_url: http://arxiv.org/abs/2307.12601
repo_url: https://github.com/patrik-ha/concept-backpropagation
paper_authors: Patrik Hammersborg, Inga Strümke
for: This paper aims to provide a method for visualizing the information that a neural network model depends on to represent a given concept.
methods: The method used in this paper is called concept backpropagation, which involves perturbing the model input in a way that maximizes the detected concept.
results: The paper presents results for this method applied to a variety of input modalities, and discusses how the method can be used to visualize the information that trained concept probes use and the degree to which the representation of the probed concept is entangled within the neural network model.

Abstract
Neural network models are widely used in a variety of domains, often as black-box solutions, since they are not directly interpretable for humans. The field of explainable artificial intelligence aims at developing explanation methods to address this challenge, and several approaches have been developed over the recent years, including methods for investigating what type of knowledge these models internalise during the training process. Among these, the method of concept detection, investigates which \emph{concepts} neural network models learn to represent in order to complete their tasks. In this work, we present an extension to the method of concept detection, named \emph{concept backpropagation}, which provides a way of analysing how the information representing a given concept is internalised in a given neural network model. In this approach, the model input is perturbed in a manner guided by a trained concept probe for the described model, such that the concept of interest is maximised. This allows for the visualisation of the detected concept directly in the input space of the model, which in turn makes it possible to see what information the model depends on for representing the described concept. We present results for this method applied to a various set of input modalities, and discuss how our proposed method can be used to visualise what information trained concept probes use, and the degree as to which the representation of the probed concept is entangled within the neural network model itself.

摘要

Optimized data collection and analysis process for studying solar-thermal desalination by machine learning

paper_url: http://arxiv.org/abs/2307.12594
repo_url: None
paper_authors: Guilong Peng, Senshan Sun, Yangjun Qin, Zhenwei Xu, Juxin Du, Swellam W. sharshir, A. W. Kandel, A. E. Kabeel, Nuo Yang
for: 这个研究的目的是提高机器学习在太阳蒸馏净水方面的应用，通过大量的实验数据收集和分析。
methods: 这个研究使用了修改后的实验数据收集和分析过程，通过加速数据收集和减少时间83.3%来收集超过一千个实验数据，比前一个研究的平均数据量大得多。同时，研究者使用了三种算法，包括人工神经网络、多ivariate 回归和随机森林，来研究数据特征的影响。
results: 研究结果表明，使用人工神经网络和随机森林算法时，大量数据可以显著提高预测精度。此外，研究还发现数据规模和范围对预测精度和影响因素排名的影响很大。同时，研究发现人工神经网络在推广范围上的描述性能受到数据范围的影响。这些结果表明，大量的实验数据收集和分析，以及数据特征的影响分析是机器学习在太阳蒸馏净水领域的重要步骤，可以推广机器学习在这个领域的应用。

Abstract
An effective interdisciplinary study between machine learning and solar-thermal desalination requires a sufficiently large and well-analyzed experimental datasets. This study develops a modified dataset collection and analysis process for studying solar-thermal desalination by machine learning. Based on the optimized water condensation and collection process, the proposed experimental method collects over one thousand datasets, which is ten times more than the average number of datasets in previous works, by accelerating data collection and reducing the time by 83.3%. On the other hand, the effects of dataset features are investigated by using three different algorithms, including artificial neural networks, multiple linear regressions, and random forests. The investigation focuses on the effects of dataset size and range on prediction accuracy, factor importance ranking, and the model's generalization ability. The results demonstrate that a larger dataset can significantly improve prediction accuracy when using artificial neural networks and random forests. Additionally, the study highlights the significant impact of dataset size and range on ranking the importance of influence factors. Furthermore, the study reveals that the extrapolation data range significantly affects the extrapolation accuracy of artificial neural networks. Based on the results, massive dataset collection and analysis of dataset feature effects are important steps in an effective and consistent machine learning process flow for solar-thermal desalination, which can promote machine learning as a more general tool in the field of solar-thermal desalination.

摘要
要有效地结合机器学习和太阳蒸馈淡水，需要一个足够大、且具有分析力的实验数据集。本研究提出了一种修改后的数据采集和分析过程，用于通过机器学习研究太阳蒸馈淡水。基于优化的水蒸馈和收集过程，该方法收集了超过一千个数据集，比前一个平均数据集的十倍多，并将采集时间减少了83.3%。而且，该研究通过使用三种不同的算法，包括人工神经网络、多元线性回归和随机森林，研究数据集大小和范围对预测精度、因素重要性排名和模型泛化能力的影响。结果表明，大量数据集可以在使用人工神经网络和随机森林时显著提高预测精度。此外，研究还发现数据集大小和范围对因素重要性排名产生了重要影响。此外，研究还发现人工神经网络抽象数据范围对抽象预测精度产生了重要影响。根据结果，大量数据采集和分析数据集特征效果是机器学习过程中不可或缺的一步，可以推动机器学习在太阳蒸馈淡水领域的普遍应用。

InVAErt networks: a data-driven framework for emulation, inference and identifiability analysis

paper_url: http://arxiv.org/abs/2307.12586
repo_url: None
paper_authors: Guoxiang Grayson Tong, Carlos A. Sing Long, Daniele E. Schiavazzi
for: 本研究旨在推广使用生成模型和深度学习来解决物理系统的设计和分析问题，而不仅仅是模拟任务。
methods: 该研究提出了一种名为inVAErt网络的框架，该框架使用确定性编码器和解码器来表示前向和反向解决 Map，使用流变换模型来捕捉系统输出的概率分布，并使用变量编码器来学习减少输入和输出之间的不一致性。
results: 研究人员通过数值实验证明了inVAErt网络的可行性和灵活性，并发现选择罚分 coefficient和积分空间抽取策略对训练和测试性能有重要影响。

Abstract
Use of generative models and deep learning for physics-based systems is currently dominated by the task of emulation. However, the remarkable flexibility offered by data-driven architectures would suggest to extend this representation to other aspects of system synthesis including model inversion and identifiability. We introduce inVAErt (pronounced \emph{invert}) networks, a comprehensive framework for data-driven analysis and synthesis of parametric physical systems which uses a deterministic encoder and decoder to represent the forward and inverse solution maps, normalizing flow to capture the probabilistic distribution of system outputs, and a variational encoder designed to learn a compact latent representation for the lack of bijectivity between inputs and outputs. We formally investigate the selection of penalty coefficients in the loss function and strategies for latent space sampling, since we find that these significantly affect both training and testing performance. We validate our framework through extensive numerical examples, including simple linear, nonlinear, and periodic maps, dynamical systems, and spatio-temporal PDEs.

摘要
使用生成模型和深度学习来处理物理系统的应用主要是 emulator。然而，这些数据驱动架构的灵活性表示可以扩展到其他系统设计方面，包括模型反转和可识别性。我们介绍inVAErt（pronounced inverse）网络，一个涵盖数据驱动分析和设计参数物理系统的框架，使用决定性编码器和解码器表示前向和反向解决Map，使用正态流捕捉系统输出的概率分布，并使用可变编码器学习减少输入和输出之间的不一致。我们正式调查损害征素在损失函数中的选择和latent空间抽样策略，因为我们发现这些对训练和测试性能有很大影响。我们通过大量的数字例子验证了我们的框架，包括简单的线性、非线性和 periodic maps，动力系统和时空PDEs。

Self-refining of Pseudo Labels for Music Source Separation with Noisy Labeled Data

paper_url: http://arxiv.org/abs/2307.12576
repo_url: None
paper_authors: Junghyun Koo, Yunkee Chae, Chang-Bin Jeon, Kyogu Lee
for: 提高音乐源分离（MSS）性能，增加大数据集来改进MSS模型的训练
methods: 自动地对含有噪声标签的数据集进行自我反射，提高MSS模型的识别精度
results: 使用自我反射的数据集可以达到与使用干净标签的数据集相同的识别精度，而且在只有噪声标签数据集的情况下，MSS模型训练在自我反射数据集上可以超过使用干净标签数据集训练的性能。

Abstract
Music source separation (MSS) faces challenges due to the limited availability of correctly-labeled individual instrument tracks. With the push to acquire larger datasets to improve MSS performance, the inevitability of encountering mislabeled individual instrument tracks becomes a significant challenge to address. This paper introduces an automated technique for refining the labels in a partially mislabeled dataset. Our proposed self-refining technique, employed with a noisy-labeled dataset, results in only a 1% accuracy degradation in multi-label instrument recognition compared to a classifier trained on a clean-labeled dataset. The study demonstrates the importance of refining noisy-labeled data in MSS model training and shows that utilizing the refined dataset leads to comparable results derived from a clean-labeled dataset. Notably, upon only access to a noisy dataset, MSS models trained on a self-refined dataset even outperform those trained on a dataset refined with a classifier trained on clean labels.

摘要
音乐源分离（MSS）面临限量正确标注个 instrumente 轨迹的问题。随着提高 MSS性能的努力，遇到带有错误标注的个 instrumente 轨迹的可能性变得非常重要。这篇文章介绍了一种自动刷新标注的技术，可以在带有噪声标注的 dataset 上进行刷新。我们的提议的自我刷新技术与噪声标注 dataset 上的类ifier 结合使用，对多个标签 instrumente 识别中的准确率进行了1%的下降。这种研究表明了刷新噪声标注数据的重要性，并证明了使用刷新后的数据可以达到与清晰标注数据相同的结果。甚至只有带有噪声标注的数据，MSS模型在使用自我刷新数据进行训练后会比使用刷新后的数据进行训练后更高的性能。

Towards Generalising Neural Topical Representations

paper_url: http://arxiv.org/abs/2307.12564
repo_url: None
paper_authors: Xiaohao Yang, He Zhao, Dinh Phung, Lan Du
for: 提高 neural topic model（NTM）的通用能力，使其可以在不同的资料集中具有可靠的泛化能力。
methods: 使用数据扩充和层次话题交通距离（HOTT）计算优化运输（OT）距离，以iminimize similar documents的semantic distance during training NTMs。
results: 对NTMs进行了扩展，使其在不同的资料集中具有显著提高的泛化能力。

Abstract
Topic models have evolved from conventional Bayesian probabilistic models to Neural Topic Models (NTMs) over the last two decays. Although NTMs have achieved promising performance when trained and tested on a specific corpus, their generalisation ability across corpora is rarely studied. In practice, we often expect that an NTM trained on a source corpus can still produce quality topical representation for documents in a different target corpus without retraining. In this work, we aim to improve NTMs further so that their benefits generalise reliably across corpora and tasks. To do so, we propose to model similar documents by minimising their semantical distance when training NTMs. Specifically, similar documents are created by data augmentation during training; The semantical distance between documents is measured by the Hierarchical Topic Transport Distance (HOTT), which computes the Optimal Transport (OT) distance between the topical representations. Our framework can be readily applied to most NTMs as a plug-and-play module. Extensive experiments show that our framework significantly improves the generalisation ability regarding neural topical representation across corpora.

摘要

DeepGATGO: A Hierarchical Pretraining-Based Graph-Attention Model for Automatic Protein Function Prediction

paper_url: http://arxiv.org/abs/2307.13004
repo_url: None
paper_authors: Zihao Li, Changkun Jiang, Jianqiang Li
for: automatic protein function prediction (AFP)
methods: sequence-based hierarchical prediction method using graph attention networks (GATs) and contrastive learning
results: better scalability in GO term enrichment analysis on large-scale datasets

Abstract
Automatic protein function prediction (AFP) is classified as a large-scale multi-label classification problem aimed at automating protein enrichment analysis to eliminate the current reliance on labor-intensive wet-lab methods. Currently, popular methods primarily combine protein-related information and Gene Ontology (GO) terms to generate final functional predictions. For example, protein sequences, structural information, and protein-protein interaction networks are integrated as prior knowledge to fuse with GO term embeddings and generate the ultimate prediction results. However, these methods are limited by the difficulty in obtaining structural information or network topology information, as well as the accuracy of such data. Therefore, more and more methods that only use protein sequences for protein function prediction have been proposed, which is a more reliable and computationally cheaper approach. However, the existing methods fail to fully extract feature information from protein sequences or label data because they do not adequately consider the intrinsic characteristics of the data itself. Therefore, we propose a sequence-based hierarchical prediction method, DeepGATGO, which processes protein sequences and GO term labels hierarchically, and utilizes graph attention networks (GATs) and contrastive learning for protein function prediction. Specifically, we compute embeddings of the sequence and label data using pre-trained models to reduce computational costs and improve the embedding accuracy. Then, we use GATs to dynamically extract the structural information of non-Euclidean data, and learn general features of the label dataset with contrastive learning by constructing positive and negative example samples. Experimental results demonstrate that our proposed model exhibits better scalability in GO term enrichment analysis on large-scale datasets.

摘要
自动蛋白功能预测（AFP）被分类为大规模多标签分类问题，旨在自动化蛋白聚集分析，以消除现有的人工劳动密集方法。现有的popular方法主要结合蛋白质相关信息和生物学功能 ontology（GO）标签来生成最终的功能预测结果。例如，蛋白序列、结构信息和蛋白蛋白交互网络被融合到GO标签嵌入中，以生成最终的预测结果。然而，这些方法受到蛋白质结构信息或网络拓扑信息的困难性和准确性的限制。因此，越来越多的方法只使用蛋白序列进行蛋白功能预测，这是一种更可靠和计算成本更低的方法。然而，现有的方法无法充分EXTRACT蛋白序列和标签数据中的特征信息。因此，我们提出了一种遵循蛋白序列层次预测方法，深度GATGO，该方法可以处理蛋白序列和GO标签数据层次，并使用图注意力网络（GATs）和对比学习来进行蛋白功能预测。具体来说，我们使用预训练模型计算蛋白序列和标签数据的嵌入，以降低计算成本并提高嵌入精度。然后，我们使用GATs动态提取蛋白序列非几何数据的结构信息，并通过对比学习学习标签数据的通用特征。实验结果表明，我们提出的模型在大规模GO标签浸泡分析中展现出较好的扩展性。

Homophily-Driven Sanitation View for Robust Graph Contrastive Learning

paper_url: http://arxiv.org/abs/2307.12555
repo_url: https://github.com/htytewx/softcam
paper_authors: Yulin Zhu, Xing Ai, Yevgeniy Vorobeychik, Kai Zhou
for: 这个论文旨在探讨Graph Contrastive Learning（GCL）对于结构攻击的 adversarial robustness。
methods: 这篇论文使用了一系列的攻击分析和理论分析，揭示了现有攻击的弱点和如何降低GCL的性能。此外，它还提出了一种robust GCL框架，该框架通过 integrate homophily-driven sanitation view来增强GCL的鲁棒性。然而，sanitation objective的非导数性带来了一些挑战，以下是一些解决这些挑战的技巧。
results: 我们的实验结果表明，GCHS（Graph Contrastive Learning with Homophily-driven Sanitation View）在两种状态之前的顶尖模型面前占据了优势，并在生成节点 embedding 和两个重要的下游任务上表现出色。

Abstract
We investigate adversarial robustness of unsupervised Graph Contrastive Learning (GCL) against structural attacks. First, we provide a comprehensive empirical and theoretical analysis of existing attacks, revealing how and why they downgrade the performance of GCL. Inspired by our analytic results, we present a robust GCL framework that integrates a homophily-driven sanitation view, which can be learned jointly with contrastive learning. A key challenge this poses, however, is the non-differentiable nature of the sanitation objective. To address this challenge, we propose a series of techniques to enable gradient-based end-to-end robust GCL. Moreover, we develop a fully unsupervised hyperparameter tuning method which, unlike prior approaches, does not require knowledge of node labels. We conduct extensive experiments to evaluate the performance of our proposed model, GCHS (Graph Contrastive Learning with Homophily-driven Sanitation View), against two state of the art structural attacks on GCL. Our results demonstrate that GCHS consistently outperforms all state of the art baselines in terms of the quality of generated node embeddings as well as performance on two important downstream tasks.

摘要
我们研究不监督图像对比学习（GCL）的抗击力，特别是对于结构性攻击。首先，我们提供了广泛的实验和理论分析，揭示了现有攻击的如何和为什么会下降GCL性能。 inspirited by our analytic results, we present a Robust GCL framework that integrates a homophily-driven sanitation view, which can be learned jointly with contrastive learning. However, the non-differentiable nature of the sanitation objective poses a key challenge. To address this challenge, we propose a series of techniques to enable gradient-based end-to-end robust GCL. Moreover, we develop a fully unsupervised hyperparameter tuning method, which unlike prior approaches, does not require knowledge of node labels. We conduct extensive experiments to evaluate the performance of our proposed model, GCHS (Graph Contrastive Learning with Homophily-driven Sanitation View), against two state-of-the-art structural attacks on GCL. Our results demonstrate that GCHS consistently outperforms all state-of-the-art baselines in terms of the quality of generated node embeddings as well as performance on two important downstream tasks.Here's the translation of the text in Traditional Chinese:我们研究不监督图像对比学习（GCL）的抗击力，特别是对于结构性攻击。首先，我们提供了广泛的实验和理论分析，揭示了现有攻击的如何和为什么会下降GCL性能。 inspirited by our analytic results, we present a Robust GCL framework that integrates a homophily-driven sanitation view, which can be learned jointly with contrastive learning. However, the non-differentiable nature of the sanitation objective poses a key challenge. To address this challenge, we propose a series of techniques to enable gradient-based end-to-end robust GCL. Moreover, we develop a fully unsupervised hyperparameter tuning method, which unlike prior approaches, does not require knowledge of node labels. We conduct extensive experiments to evaluate the performance of our proposed model, GCHS (Graph Contrastive Learning with Homophily-driven Sanitation View), against two state-of-the-art structural attacks on GCL. Our results demonstrate that GCHS consistently outperforms all state-of-the-art baselines in terms of the quality of generated node embeddings as well as performance on two important downstream tasks.

Continuation Path Learning for Homotopy Optimization

paper_url: http://arxiv.org/abs/2307.12551
repo_url: https://github.com/xi-l/cpl
paper_authors: Xi Lin, Zhiyuan Yang, Xiaoyuan Zhang, Qingfu Zhang
for: 提高Homotopy优化的效果和可用性，并提供更多的解决方案选择机会。
methods: 提出了一种基于模型的方法，可以同时优化原始问题和所有优化子问题，并实时生成任意中间解决方案。
results: 实验表明，该方法可以明显提高Homotopy优化的性能，并提供更多的有用信息支持更好的决策。

Abstract
Homotopy optimization is a traditional method to deal with a complicated optimization problem by solving a sequence of easy-to-hard surrogate subproblems. However, this method can be very sensitive to the continuation schedule design and might lead to a suboptimal solution to the original problem. In addition, the intermediate solutions, often ignored by classic homotopy optimization, could be useful for many real-world applications. In this work, we propose a novel model-based approach to learn the whole continuation path for homotopy optimization, which contains infinite intermediate solutions for any surrogate subproblems. Rather than the classic unidirectional easy-to-hard optimization, our method can simultaneously optimize the original problem and all surrogate subproblems in a collaborative manner. The proposed model also supports real-time generation of any intermediate solution, which could be desirable for many applications. Experimental studies on different problems show that our proposed method can significantly improve the performance of homotopy optimization and provide extra helpful information to support better decision-making.

摘要
In this work, we propose a novel model-based approach to learn the whole continuation path for homotopy optimization, which includes infinite intermediate solutions for any surrogate subproblems. Unlike classic unidirectional easy-to-hard optimization, our method can simultaneously optimize the original problem and all surrogate subproblems in a collaborative manner. The proposed model also supports real-time generation of any intermediate solution, which can be desirable for many applications.Experimental studies on different problems show that our proposed method can significantly improve the performance of homotopy optimization and provide extra helpful information to support better decision-making.

On the Connection between Pre-training Data Diversity and Fine-tuning Robustness

paper_url: http://arxiv.org/abs/2307.12532
repo_url: None
paper_authors: Vivek Ramanujan, Thao Nguyen, Sewoong Oh, Ludwig Schmidt, Ali Farhadi
for: 了解预训练策略对下游模型的泛化性质的影响。
methods: 研究预训练分布的属性对下游模型的可靠性的影响，包括标签空间、标签 semantics、图像多样性、数据领域和数据量等因素。
results: 发现预训练数据量是下游模型的可靠性的关键因素，其他因素具有有限的影响。例如，将 ImageNet 预训练类减少到 4 倍，同时将每个类的图像数量增加到 4 倍（即保持总数据量不变）不会影响 fine-tuned 模型的可靠性。通过使用不同的自然和Synthetic 数据源预训练分布，主要通过 iWildCam-WILDS 分布转换测试下游模型的可靠性。

Abstract
Pre-training has been widely adopted in deep learning to improve model performance, especially when the training data for a target task is limited. In our work, we seek to understand the implications of this training strategy on the generalization properties of downstream models. More specifically, we ask the following question: how do properties of the pre-training distribution affect the robustness of a fine-tuned model? The properties we explore include the label space, label semantics, image diversity, data domains, and data quantity of the pre-training distribution. We find that the primary factor influencing downstream effective robustness (Taori et al., 2020) is data quantity, while other factors have limited significance. For example, reducing the number of ImageNet pre-training classes by 4x while increasing the number of images per class by 4x (that is, keeping total data quantity fixed) does not impact the robustness of fine-tuned models. We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources, primarily using the iWildCam-WILDS distribution shift as a test for downstream robustness.

摘要
<>将文本翻译成简化中文。>预训练已广泛应用于深度学习中，以提高模型性能，特别是当目标任务的训练数据scarce时。在我们的工作中，我们想要了解预训练策略对下游模型的泛化性质产生的影响。更 Specifically，我们问的问题是：预训练分布的属性如何影响下游模型的可靠性？我们探讨的属性包括标签空间、标签 semantics、图像多样性、数据领域和数据量。我们发现预训练数据量是下游可靠性的主要因素，而其他因素具有有限的意义。例如，将 ImageNet 预训练类别数量减少到 4 倍，同时图像每类数量增加 4 倍（即保持总数据量不变），不会影响 Fine-tune 模型的可靠性。我们通过不同的自然和 sintetic 数据源中的预训练分布，主要使用 iWildCam-WILDS 分布转换为下游可靠性的测试。

Rethinking Medical Report Generation: Disease Revealing Enhancement with Knowledge Graph

paper_url: http://arxiv.org/abs/2307.12526
repo_url: https://github.com/wangyixinxin/mrg-kg
paper_authors: Yixin Wang, Zihao Lin, Haoyu Dong
for: 这个研究旨在提高医疗报告生成（MRG）的品质，特别是透过知识图（KG）来导向生成过程。
methods: 本研究使用了一个完整的KG，包括137种疾病和问题，并导入了一个新的增强描述疾病类型的增强策略，以解决长条形分布问题。
results: 研究发现，提案的两阶段生成框架和增强策略可以提高生成的多样性和准确性，并有着显著的改善效果。

Abstract
Knowledge Graph (KG) plays a crucial role in Medical Report Generation (MRG) because it reveals the relations among diseases and thus can be utilized to guide the generation process. However, constructing a comprehensive KG is labor-intensive and its applications on the MRG process are under-explored. In this study, we establish a complete KG on chest X-ray imaging that includes 137 types of diseases and abnormalities. Based on this KG, we find that the current MRG data sets exhibit a long-tailed problem in disease distribution. To mitigate this problem, we introduce a novel augmentation strategy that enhances the representation of disease types in the tail-end of the distribution. We further design a two-stage MRG approach, where a classifier is first trained to detect whether the input images exhibit any abnormalities. The classified images are then independently fed into two transformer-based generators, namely, ``disease-specific generator" and ``disease-free generator" to generate the corresponding reports. To enhance the clinical evaluation of whether the generated reports correctly describe the diseases appearing in the input image, we propose diverse sensitivity (DS), a new metric that checks whether generated diseases match ground truth and measures the diversity of all generated diseases. Results show that the proposed two-stage generation framework and augmentation strategies improve DS by a considerable margin, indicating a notable reduction in the long-tailed problem associated with under-represented diseases.

摘要
医疗报告生成（MRG）中知识图（KG）扮演着关键性的角色，因为它揭示疾病之间的关系，可以用于导航生成过程。然而，建立全面的KG是劳动密集的，而其在MRG过程中的应用还尚未得到了充分的探索。本研究中，我们建立了包含137种疾病和异常的完整KG，基于这个KG，我们发现现有的MRG数据集具有长尾分布问题。为了解决这个问题，我们提出了一种新的增强策略，增强疾病类型在分布尾部的表示。此外，我们设计了两个阶段的MRG方法，其中第一阶段使用分类器来检测输入图像是否具有任何异常。经过分类后，图像分别被独立地传递到两个基于转换器的生成器，即“疾病特定生成器”和“疾病无效生成器”，以生成对应的报告。为了提高生成报告的临床评估，我们提出了多样性敏感度（DS），一种新的指标，用于检查生成的疾病与实际情况是否匹配，并测量所有生成的疾病的多样性。结果显示，我们的两个阶段生成框架和增强策略可以大幅提高DS， indicating a considerable reduction in the long-tailed problem associated with under-represented diseases.

Landslide Surface Displacement Prediction Based on VSXC-LSTM Algorithm

paper_url: http://arxiv.org/abs/2307.12524
repo_url: None
paper_authors: Menglin Kong, Ruichen Li, Fan Liu, Xingquan Li, Juan Cheng, Muzhou Hou, Cong Cao
for: 预测地面滑坡表面变位
methods: 基于变形模式分解（VMD）、SegSigmoid函数、XGBoost算法和嵌入LSTM neural network的时序预测框架（VSXC-LSTM）
results: 在测试集上，模型表现良好，除了随机项 subsequences 难以适应外， periodic item subsequence 和 trend item subsequence 的 RMSE 和 MAPE 都小于 0.1， periodic item prediction module 基于 XGBoost 的 RMSE 为 0.006。

Abstract
Landslide is a natural disaster that can easily threaten local ecology, people's lives and property. In this paper, we conduct modelling research on real unidirectional surface displacement data of recent landslides in the research area and propose a time series prediction framework named VMD-SegSigmoid-XGBoost-ClusterLSTM (VSXC-LSTM) based on variational mode decomposition, which can predict the landslide surface displacement more accurately. The model performs well on the test set. Except for the random item subsequence that is hard to fit, the root mean square error (RMSE) and the mean absolute percentage error (MAPE) of the trend item subsequence and the periodic item subsequence are both less than 0.1, and the RMSE is as low as 0.006 for the periodic item prediction module based on XGBoost\footnote{Accepted in ICANN2023}.

摘要
地面滑坡是自然灾害，容易威胁当地生态、人们生命和财产。在这篇论文中，我们基于实际的单向表面偏移数据进行模拟研究，并提出了一种基于变幅模式分解的时间序列预测框架，称为VMD-SegSigmoid-XGBoost-ClusterLSTM（VSXC-LSTM）。这种模型可以更准确地预测滑坡表面偏移。测试集上的表现良好，只有随机项子序列难以适应，RMSE和MAPE值均小于0.1， periodic item prediction module based on XGBoost的RMSE值为0.006（ Accepted in ICANN2023）。

Lost In Translation: Generating Adversarial Examples Robust to Round-Trip Translation

paper_url: http://arxiv.org/abs/2307.12520
repo_url: https://github.com/neelbhandari6/nmt_text_attack
paper_authors: Neel Bhandari, Pin-Yu Chen
for: 研究文章探讨了现有文本 adversarial 攻击的稳定性，特别是对于保持了 considerable similarity 的文本 adversarial examples。
methods: 文章使用了 six state-of-the-art text-based adversarial attacks，并对它们进行了 round-trip translation 测试。此外，文章还提出了一种基于 machine translation 的解决方案，以增强 adversarial example 的稳定性。
results: 研究发现，six state-of-the-art text-based adversarial attacks 在 round-trip translation 下失效，而 integrate machine translation into adversarial example generation 可以提高稳定性。这些结果表明，找到可以抗 translation 的 adversarial examples 可以帮助找到语言模型的缺陷，并促进更多关于多语言 adversarial attacks 的研究。

Abstract
Language Models today provide a high accuracy across a large number of downstream tasks. However, they remain susceptible to adversarial attacks, particularly against those where the adversarial examples maintain considerable similarity to the original text. Given the multilingual nature of text, the effectiveness of adversarial examples across translations and how machine translations can improve the robustness of adversarial examples remain largely unexplored. In this paper, we present a comprehensive study on the robustness of current text adversarial attacks to round-trip translation. We demonstrate that 6 state-of-the-art text-based adversarial attacks do not maintain their efficacy after round-trip translation. Furthermore, we introduce an intervention-based solution to this problem, by integrating Machine Translation into the process of adversarial example generation and demonstrating increased robustness to round-trip translation. Our results indicate that finding adversarial examples robust to translation can help identify the insufficiency of language models that is common across languages, and motivate further research into multilingual adversarial attacks.

摘要
现代语言模型在许多下游任务上具有高精度。然而，它们仍然容易受到敌意攻击，特别是那些维持了原文和敌意例子之间的相似性。由于文本的多语言性，攻击者可以利用不同语言的文本来攻击语言模型。在这篇论文中，我们展示了现有的文本基于攻击的六种状态体验的不稳定性，并证明它们在翻译后不再有效。此外，我们还介绍了一种基于机器翻译的解决方案，并证明该方法可以提高攻击例子的翻译稳定性。我们的结果表明，找到可以抵抗翻译的攻击例子可以帮助我们发现语言模型的共同缺陷，并促进多语言攻击的研究。

DEPHN: Different Expression Parallel Heterogeneous Network using virtual gradient optimization for Multi-task Learning

paper_url: http://arxiv.org/abs/2307.12519
repo_url: None
paper_authors: Menglin Kong, Ri Su, Shaojie Zhao, Muzhou Hou
for: This paper proposes a new method for multi-task learning (MTL) recommendation systems to better understand user behavior in complex scenarios.
methods: The proposed method, called Different Expression Parallel Heterogeneous Network (DEPHN), uses different feature interaction methods to improve the generalization ability of shared information flow and adaptively adjusts the learning intensity of gated units based on task correlation.
results: Extensive experiments on artificial and real-world datasets demonstrate that DEPHN can capture task correlation in complex situations and achieve better performance than baseline models.Here’s the simplified Chinese version:
for: 这篇论文提出了一种基于多任务学习（MTL）的推荐系统，以更好地理解在复杂情况下用户行为。
methods: 提议的方法是多表达并发异构网络（DEPHN），通过不同的特征交互方法提高共享信息流的泛化能力，并在训练过程中通过特征显式映射和虚梯度系数进行专家阀控，以适应不同任务信息流的差异。
results: 对于人工和实际数据集的广泛实验表明，DEPHN可以在复杂情况下捕捉任务相关性，并比基线模型表现更好。

Abstract
Recommendation system algorithm based on multi-task learning (MTL) is the major method for Internet operators to understand users and predict their behaviors in the multi-behavior scenario of platform. Task correlation is an important consideration of MTL goals, traditional models use shared-bottom models and gating experts to realize shared representation learning and information differentiation. However, The relationship between real-world tasks is often more complex than existing methods do not handle properly sharing information. In this paper, we propose an Different Expression Parallel Heterogeneous Network (DEPHN) to model multiple tasks simultaneously. DEPHN constructs the experts at the bottom of the model by using different feature interaction methods to improve the generalization ability of the shared information flow. In view of the model's differentiating ability for different task information flows, DEPHN uses feature explicit mapping and virtual gradient coefficient for expert gating during the training process, and adaptively adjusts the learning intensity of the gated unit by considering the difference of gating values and task correlation. Extensive experiments on artificial and real-world datasets demonstrate that our proposed method can capture task correlation in complex situations and achieve better performance than baseline models\footnote{Accepted in IJCNN2023}.

摘要
互联网运营商可以通过多任务学习（MTL）来理解用户和预测他们在多行为场景中的行为。传统模型通过共享底部模型和阻塞专家来实现共享表示学习和信息差异化。然而，现实世界中任务之间的关系经常比既有方法不够好地处理共享信息。在这篇论文中，我们提出了不同表达平行多样性网络（DEPHN），以同时模型多个任务。DEPHN通过不同的特征互动方法来提高共享信息流的泛化能力。对于模型对不同任务信息流的分化能力，DEPHN使用特征显式映射和虚拟梯度系数进行专家闭合 durante 训练过程中，并根据计算任务相互关系的差异来自适应学习Intensity of gated unit。经过了人工和实际世界的广泛实验，我们的提议方法可以在复杂的情况下捕捉任务相互关系，并在基eline模型的基础上提高表现。

FaFCNN: A General Disease Classification Framework Based on Feature Fusion Neural Networks

paper_url: http://arxiv.org/abs/2307.12518
repo_url: None
paper_authors: Menglin Kong, Shaojie Zhao, Juan Cheng, Xingquan Li, Ri Su, Muzhou Hou, Cong Cao
for: 本研究旨在 Addressing two fundamental problems in applying deep learning/machine learning methods to disease classification tasks, namely insufficient training samples and feature fusion.
methods: 提出了一种基于域对抗学习的Feature-aware Fusion Correlation Neural Network (FaFCNN)，具有特点增强样本相关性和特征对齐。
results: 实验结果表明，使用增强的特征获得融合特征后，FaFCNN可以更好地提高疾病分类性能，特别是在低质量数据集上。此外，广泛的实验还证明了模型的稳定性和每个组件的有效性。

Abstract
There are two fundamental problems in applying deep learning/machine learning methods to disease classification tasks, one is the insufficient number and poor quality of training samples; another one is how to effectively fuse multiple source features and thus train robust classification models. To address these problems, inspired by the process of human learning knowledge, we propose the Feature-aware Fusion Correlation Neural Network (FaFCNN), which introduces a feature-aware interaction module and a feature alignment module based on domain adversarial learning. This is a general framework for disease classification, and FaFCNN improves the way existing methods obtain sample correlation features. The experimental results show that training using augmented features obtained by pre-training gradient boosting decision tree yields more performance gains than random-forest based methods. On the low-quality dataset with a large amount of missing data in our setup, FaFCNN obtains a consistently optimal performance compared to competitive baselines. In addition, extensive experiments demonstrate the robustness of the proposed method and the effectiveness of each component of the model\footnote{Accepted in IEEE SMC2023}.

摘要
“有两个基本问题在应用深度学习/机器学习方法进行疾病分类任务中，一个是训练样本数量和质量不足; 另一个是如何有效地融合多个来源特征，以训练可靠的分类模型。为了解决这些问题，我们提出了基于人类学习知识的Feature-aware Fusion Correlation Neural Network（FaFCNN）框架。这是一个通用的疾病分类框架，并且FaFCNN可以将现有方法中取得的样本相互联系特征改进。实验结果显示，使用增强决策树进行预训练后的扩展特征可以实现更多的性能提升，而且在我们的设置中，FaFCNN在低质量样本大量遗传数据上取得了一致性的最佳性能。此外，广泛的实验显示了提案的方法的稳定性和每个模型 ком成分的有效性。”Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China.

An Empirical Evaluation of Temporal Graph Benchmark

paper_url: http://arxiv.org/abs/2307.12510
repo_url: https://github.com/yule-BUAA/DyGLib_TGB
paper_authors: Le Yu
for: 本研究是一个empirical evaluation of Temporal Graph Benchmark (TGB)，通过扩展我们的Dynamic Graph Library (DyGLib)来对TGB进行评估。
methods: 本研究使用了11种流行的动态图学习方法进行更加广泛的比较，包括TGB中所report的基eline。
results: 通过实验发现，不同的模型在不同的数据集上表现出了不同的性能，与之前的观察一致；同时，使用DyGLib可以对一些基eline进行显著改进，超过TGB的报告结果。

Abstract
In this paper, we conduct an empirical evaluation of Temporal Graph Benchmark (TGB) by extending our Dynamic Graph Library (DyGLib) to TGB. Compared with TGB, we include eleven popular dynamic graph learning methods for more exhaustive comparisons. Through the experiments, we find that (1) different models depict varying performance across various datasets, which is in line with previous observations; (2) the performance of some baselines can be significantly improved over the reported results in TGB when using DyGLib. This work aims to ease the researchers' efforts in evaluating various dynamic graph learning methods on TGB and attempts to offer results that can be directly referenced in the follow-up research. All the used resources in this project are publicly available at https://github.com/yule-BUAA/DyGLib_TGB. This work is in progress, and feedback from the community is welcomed for improvements.

摘要
在这篇论文中，我们进行了emporical评估Temporal Graph Benchmark（TGB）的扩展，使用我们的动态图库（DyGLib）来对TGB进行评估。相比TGB，我们包括了 eleven 种流行的动态图学习方法，以便进行更加详细的比较。通过实验，我们发现了以下两点：1. 不同的模型在不同的数据集上表现出了不同的性能，这与之前的观察一致。2. 一些基elines的性能可以通过使用DyGLib进行改进，这与TGB中report的结果相比有所提高。这项工作的目标是为研究者提供一个便利的评估多种动态图学习方法的平台，并提供可直接引用的结果。我们使用的所有资源都公开可用于https://github.com/yule-BUAA/DyGLib_TGB。这项工作正在进行中，欢迎社区提供反馈以便进行改进。

AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models

paper_url: http://arxiv.org/abs/2307.12499
repo_url: None
paper_authors: Xuelong Dai, Kaisheng Liang, Bin Xiao
for: 防御深度学习模型受到攻击的攻击方法研究
methods: 使用扩散模型生成不受限制的攻击示例，并提出两种新的反向生成导航技术来进行攻击采样
results: 对MNIST和ImageNet dataset进行实验，得到了高质量、真实的攻击示例，并超越了基于GAN的方法在攻击性和生成质量上In English, this means:
for: Research on adversarial attacks against deep learning models and defense techniques
methods: Using diffusion models to generate unrestricted adversarial examples, and proposing two novel adversarial guidance techniques for reverse generation
results: Experimental results on MNIST and ImageNet datasets show that AdvDiff is effective in generating high-quality, realistic adversarial examples that outperform GAN-based methods in attack performance and generation quality.

Abstract
Unrestricted adversarial attacks present a serious threat to deep learning models and adversarial defense techniques. They pose severe security problems for deep learning applications because they can effectively bypass defense mechanisms. However, previous attack methods often utilize Generative Adversarial Networks (GANs), which are not theoretically provable and thus generate unrealistic examples by incorporating adversarial objectives, especially for large-scale datasets like ImageNet. In this paper, we propose a new method, called AdvDiff, to generate unrestricted adversarial examples with diffusion models. We design two novel adversarial guidance techniques to conduct adversarial sampling in the reverse generation process of diffusion models. These two techniques are effective and stable to generate high-quality, realistic adversarial examples by integrating gradients of the target classifier interpretably. Experimental results on MNIST and ImageNet datasets demonstrate that AdvDiff is effective to generate unrestricted adversarial examples, which outperforms GAN-based methods in terms of attack performance and generation quality.

摘要
深度学习模型面临了无限制敌意攻击的威胁，这些攻击可以有效绕过防御机制。然而，先前的攻击方法 часто使用生成对抗网络（GAN），这些网络不是理论可证明的，因此会生成不实际的例子，特别是对于大规模的数据集如ImageNet。在这篇论文中，我们提出了一种新的方法，称为AdvDiff，用于生成无限制敌意例子。我们设计了两种新的对抗导航技术，用于在扩散模型的反生成过程中进行对抗采样。这两种技术可以生成高质量、实际的敌意例子，通过可视化目标分类器的梯度来 интегрирова。实验结果表明，AdvDiff在MNIST和ImageNet数据集上效果地生成了无限制敌意例子，其性能和生成质量都高于基于GAN的方法。

A faster and simpler algorithm for learning shallow networks

paper_url: http://arxiv.org/abs/2307.12496
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Sitan Chen, Shyam Narayanan
for: 学习一个线性组合中的ReLU活化器，给出标注的例子来自标准的$d$-维高斯分布。
methods: 使用Chen et al.的算法， runtime在$\text{poly}(d,1/\varepsilon)$时间内运行，并在多个阶段学习。
results: 显示了一个简单的一阶版本的算法可以 suffices，并且其运行时间只是 $(d/\varepsilon)^{O(k^2)} $。

Abstract
We revisit the well-studied problem of learning a linear combination of $k$ ReLU activations given labeled examples drawn from the standard $d$-dimensional Gaussian measure. Chen et al. [CDG+23] recently gave the first algorithm for this problem to run in $\text{poly}(d,1/\varepsilon)$ time when $k = O(1)$, where $\varepsilon$ is the target error. More precisely, their algorithm runs in time $(d/\varepsilon)^{\mathrm{quasipoly}(k)}$ and learns over multiple stages. Here we show that a much simpler one-stage version of their algorithm suffices, and moreover its runtime is only $(d/\varepsilon)^{O(k^2)}$.

摘要
我们回到了已经很受研究的问题：学习一个线性 комбінаción of $k$ ReLU 激活函数， given labeled examples 从标准 $d$-dimensional Gaussian 分布中获取。陈等人 [CDG+23] 最近提出了首个这个问题可以在 $\text{poly}(d,1/\varepsilon)$ 时间内解决的算法，其中 $k = O(1)$，$\varepsilon$ 是目标错误。更加精确地说，他们的算法在多个阶段中执行， runtime 为 $(d/\varepsilon)^{\mathrm{quasipoly}(k)}$。我们现在显示出一个 much simpler 的一阶版本的他们的算法，并且其时间复杂度仅为 $(d/\varepsilon)^{O(k^2)}$。

Learning Universal and Robust 3D Molecular Representations with Graph Convolutional Networks

paper_url: http://arxiv.org/abs/2307.12491
repo_url: None
paper_authors: Shuo Zhang, Yang Liu, Li Xie, Lei Xie
for: 用于学习分子的准确表示，需要考虑分子的化学和几何特征。
methods: 基于分子图表示的方向节点对（DNP）描述器，Robust Molecular Graph Convolutional Network（RoM-GCN）模型可以同时考虑节点和边特征。
results: 对蛋白质和小分子数据集进行评估，研究表明DNP描述器能够具有3D分子几何信息的Robust性，RoM-GCN模型在比较基eline上表现出色。

Abstract
To learn accurate representations of molecules, it is essential to consider both chemical and geometric features. To encode geometric information, many descriptors have been proposed in constrained circumstances for specific types of molecules and do not have the properties to be ``robust": 1. Invariant to rotations and translations; 2. Injective when embedding molecular structures. In this work, we propose a universal and robust Directional Node Pair (DNP) descriptor based on the graph representations of 3D molecules. Our DNP descriptor is robust compared to previous ones and can be applied to multiple molecular types. To combine the DNP descriptor and chemical features in molecules, we construct the Robust Molecular Graph Convolutional Network (RoM-GCN) which is capable to take both node and edge features into consideration when generating molecule representations. We evaluate our model on protein and small molecule datasets. Our results validate the superiority of the DNP descriptor in incorporating 3D geometric information of molecules. RoM-GCN outperforms all compared baselines.

摘要

Learning Resource Allocation Policy: Vertex-GNN or Edge-GNN?

paper_url: http://arxiv.org/abs/2307.12480
repo_url: None
paper_authors: Yao Peng, Jia Guo, Chenyang Yang
for: 本文研究了基于图神经网络（Graph Neural Networks，GNNs）的无线资源分配策略学习。
methods: 本文分析了顶点神经网络（Vertex-GNNs）和边神经网络（Edge-GNNs）在学习无线策略时的表达能力。
results: 研究发现，顶点神经网络和边神经网络的表达能力取决于处理和组合函数的线性和输出维度。顶点神经网络在使用线性处理器时无法分辨所有通道矩阵，而边神经网络可以。在学习precoding策略时，即使使用非线性处理器，顶点神经网络的表达能力仍然有限。研究还提出了必要的条件，以确保GNNs可以好好地学习precoding策略。实验结果证明了分析结论，并表明了边神经网络可以与顶点神经网络相比，具有更低的训练和推断时间。

Abstract
Graph neural networks (GNNs) update the hidden representations of vertices (called Vertex-GNNs) or hidden representations of edges (called Edge-GNNs) by processing and pooling the information of neighboring vertices and edges and combining to incorporate graph topology. When learning resource allocation policies, GNNs cannot perform well if their expressive power are weak, i.e., if they cannot differentiate all input features such as channel matrices. In this paper, we analyze the expressive power of the Vertex-GNNs and Edge-GNNs for learning three representative wireless policies: link scheduling, power control, and precoding policies. We find that the expressive power of the GNNs depend on the linearity and output dimensions of the processing and combination functions. When linear processors are used, the Vertex-GNNs cannot differentiate all channel matrices due to the loss of channel information, while the Edge-GNNs can. When learning the precoding policy, even the Vertex-GNNs with non-linear processors may not be with strong expressive ability due to the dimension compression. We proceed to provide necessary conditions for the GNNs to well learn the precoding policy. Simulation results validate the analyses and show that the Edge-GNNs can achieve the same performance as the Vertex-GNNs with much lower training and inference time.

摘要
图 нейрон网络（GNNs）更新隐藏表示的顶点（称为顶点GNNs）或隐藏表示的边（称为边GNNs），通过处理和汇聚邻近顶点和边的信息，并将其组合以利用图STRUCTURE。在学习资源分配策略时，GNNs如果表达力强不足，例如不能分辨输入特征集如渠道矩阵，则不能表现出好的性能。在这篇论文中，我们分析顶点GNNs和边GNNs在学习三种代表性无线策略：链接调度策略、功率控制策略和排序策略时的表达力。我们发现顶点GNNs和边GNNs的表达力取决于处理和组合函数的线性和输出维度。当使用线性处理器时，顶点GNNs无法分辨所有渠道矩阵，而边GNNs可以。在学习预处理策略时， même avec les processeurs non linéaires, les GNNs peut ne pas avoir une capacité d'expression suffisante en raison de la compression de la dimension. Nous avons fourni les conditions nécessaires pour que les GNNs apprennent efficacement la stratégie de pré-traitement. Les résultats de simulation valident les analyses et montrent que les GNNs peuvent atteindre le même niveau de performance que les GNNs avec beaucoup moins de temps d'entraînement et d'inférence.

Model-free generalized fiducial inference

paper_url: http://arxiv.org/abs/2307.12472
repo_url: None
paper_authors: Jonathan P Williams
for: 这项研究的目的是为了开发一种安全可靠的机器学习 uncertainty quantification 方法。
methods: 这项研究使用了一种 model-free 统计框架，以实现 imprecise probabilistic prediction inference。这个框架可以提供 finite sample control of type 1 errors，同时也可以提供更多的 versatile tools for imprecise probabilistic reasoning。
results: 这项研究提出了一种新的 preciseness probability approximation，用于 approximate belief/plausibility measure pair。这个 aproximation 是一种 optima in some sense 的 probability measure in the credal set，可以解决 imprecise probabilistic approaches to inference 的问题。

Abstract
Motivated by the need for the development of safe and reliable methods for uncertainty quantification in machine learning, I propose and develop ideas for a model-free statistical framework for imprecise probabilistic prediction inference. This framework facilitates uncertainty quantification in the form of prediction sets that offer finite sample control of type 1 errors, a property shared with conformal prediction sets, but this new approach also offers more versatile tools for imprecise probabilistic reasoning. Furthermore, I propose and consider the theoretical and empirical properties of a precise probabilistic approximation to the model-free imprecise framework. Approximating a belief/plausibility measure pair by an [optimal in some sense] probability measure in the credal set is a critical resolution needed for the broader adoption of imprecise probabilistic approaches to inference in statistical and machine learning communities. It is largely undetermined in the statistical and machine learning literatures, more generally, how to properly quantify uncertainty in that there is no generally accepted standard of accountability of stated uncertainties. The research I present in this manuscript is aimed at motivating a framework for statistical inference with reliability and accountability as the guiding principles.

摘要
我受到机器学习中无certainty量化的需求而努力提出和开发一个无模型的统计框架，以便实现precise probabilistic prediction inference中的uncertainty量化。这个框架可以提供finite sample控制type 1 error的prediction set，和conformal prediction set相似，但这个新的方法可以提供更多的versatile工具 дляimprecise probabilistic reasoning。此外，我还提出了一个精确的 probabilistic approximation，用于对无模型的imprecise framework进行approximation。在这个框架中，一个belief/plausibility measure pair的抽象是一个optimal的probability measure在credal set中，这是critical resolution needed for the broader adoption of imprecise probabilistic approaches to inference in statistical and machine learning communities。在统计和机器学习文献中，更加一般地，无法properly quantify uncertainty，因为没有一个通行的标准 accountability of stated uncertainties。我在这个著作中的研究是对 statistical inference的一个框架，以实现可靠性和责任性为引导 principl。

Rethinking Data Distillation: Do Not Overlook Calibration

paper_url: http://arxiv.org/abs/2307.12463
repo_url: https://github.com/dongyaozhu/calibrate-networks-trained-on-distilled-datasets
paper_authors: Dongyao Zhu, Bowen Lei, Jie Zhang, Yanbo Fang, Ruqi Zhang, Yiqun Xie, Dongkuan Xu
for: 本研究旨在解决因数据压缩而导致的神经网络输出过于自信的问题，提出了两种新的纠正方法：Masked Temperature Scaling (MTS) 和 Masked Distillation Training (MDT)。
methods: 本研究使用了数据压缩后的神经网络进行训练，并采用了温度扩大和混合方法来纠正神经网络的输出。
results: 研究发现，使用Masked Temperature Scaling (MTS) 和 Masked Distillation Training (MDT) 可以更好地纠正数据压缩后的神经网络输出，同时保持数据压缩的效率。

Abstract
Neural networks trained on distilled data often produce over-confident output and require correction by calibration methods. Existing calibration methods such as temperature scaling and mixup work well for networks trained on original large-scale data. However, we find that these methods fail to calibrate networks trained on data distilled from large source datasets. In this paper, we show that distilled data lead to networks that are not calibratable due to (i) a more concentrated distribution of the maximum logits and (ii) the loss of information that is semantically meaningful but unrelated to classification tasks. To address this problem, we propose Masked Temperature Scaling (MTS) and Masked Distillation Training (MDT) which mitigate the limitations of distilled data and achieve better calibration results while maintaining the efficiency of dataset distillation.

摘要
neural networks 经过精炼数据训练后通常会生成过度自信的输出，需要进行减强方法来修正。现有的减强方法，如温度Scaling 和 mixup，对于基于原始大规模数据的网络进行训练时工作良好。然而，我们发现这些方法无法调整基于大源数据集的数据精炼后的网络。在这篇论文中，我们发现了精炼数据导致的网络无法减强的两个问题：（i）精炼数据集中最大 logits 的更集中分布，以及（ii）semantic意义强度不相关的信息的丢失。为解决这问题，我们提出了Masked Temperature Scaling (MTS) 和 Masked Distillation Training (MDT)，这两种方法可以缓解精炼数据的局限性，实现更好的减强结果，同时保持数据精炼的效率。

Rates of Approximation by ReLU Shallow Neural Networks

paper_url: http://arxiv.org/abs/2307.12461
repo_url: None
paper_authors: Tong Mao, Ding-Xuan Zhou
For: The paper is written to investigate the efficiency of shallow neural networks with one hidden layer in approximating functions from H"older spaces.* Methods: The paper uses ReLU neural networks with $m$ hidden neurons to approximate functions from $W_\infty^r([-1, 1]^d)$ and provides rates of uniform approximation.* Results: The paper shows that ReLU shallow neural networks can uniformly approximate functions from $W_\infty^r([-1, 1]^d)$ with rates $O((\log m)^{\frac{1}{2} +d}m^{-\frac{r}{d}\frac{d+2}{d+4})$ when $r<d/2 +2$, which is very close to the optimal rate $O(m^{-\frac{r}{d})$ when the dimension $d$ is large.

Abstract
Neural networks activated by the rectified linear unit (ReLU) play a central role in the recent development of deep learning. The topic of approximating functions from H\"older spaces by these networks is crucial for understanding the efficiency of the induced learning algorithms. Although the topic has been well investigated in the setting of deep neural networks with many layers of hidden neurons, it is still open for shallow networks having only one hidden layer. In this paper, we provide rates of uniform approximation by these networks. We show that ReLU shallow neural networks with $m$ hidden neurons can uniformly approximate functions from the H\"older space $W_\infty^r([-1, 1]^d)$ with rates $O((\log m)^{\frac{1}{2} +d}m^{-\frac{r}{d}\frac{d+2}{d+4})$ when $r

摘要
“射预统计学中的神经网络，尤其是使用Rectified Linear Unit（ReLU）启动的神经网络，在深度学习的发展中扮演着中心作用。关于使用这些神经网络来近似Holder空间中函数的问题，是深度学习算法的效率的关键因素。虽然在多层神经网络的情况下已经得到了广泛的研究，但是对于单层神经网络还未有充分的研究。在这篇论文中，我们提供了uniform近似率。我们证明了ReLU单层神经网络可以将-$m$个隐藏神经元uniform近似$W_\infty^r([-1, 1]^d)$中的函数， rates为$O((\log m)^{\frac{1}{2} +d}m^{-\frac{r}{d}\frac{d+2}{d+4})$，当$r

Information-theoretic Analysis of Test Data Sensitivity in Uncertainty

paper_url: http://arxiv.org/abs/2307.12456
repo_url: None
paper_authors: Futoshi Futami, Tomoharu Iwata
for: 这篇论文的目的是对 bayesian 推断中的不确定性进行量化，并分析了这种不确定性的两种类型：aleatoric 不确定性和 epistemic 不确定性。
methods: 该论文使用了 bayesian 推断，并rigorously 分解了 predictive uncertainty 到 two 种不确定性。它们分别表示数据生成过程中的内在随机性和数据不充分导致的多样性。
results: 该论文成功地定义了 uncertainty sensitivity，并extend 了现有的 bayesian meta-learning 分析。它首次显示了任务之间的新的sensitivity。

Abstract
Bayesian inference is often utilized for uncertainty quantification tasks. A recent analysis by Xu and Raginsky 2022 rigorously decomposed the predictive uncertainty in Bayesian inference into two uncertainties, called aleatoric and epistemic uncertainties, which represent the inherent randomness in the data-generating process and the variability due to insufficient data, respectively. They analyzed those uncertainties in an information-theoretic way, assuming that the model is well-specified and treating the model's parameters as latent variables. However, the existing information-theoretic analysis of uncertainty cannot explain the widely believed property of uncertainty, known as the sensitivity between the test and training data. It implies that when test data are similar to training data in some sense, the epistemic uncertainty should become small. In this work, we study such uncertainty sensitivity using our novel decomposition method for the predictive uncertainty. Our analysis successfully defines such sensitivity using information-theoretic quantities. Furthermore, we extend the existing analysis of Bayesian meta-learning and show the novel sensitivities among tasks for the first time.

摘要
某些任务中，泊然推理 often 用于 uncertainty quantification 任务。据 Xu 和 Raginsky （2022）的分析， Bayesian 推理中的 predictive uncertainty 可以分为两种不确定性，即 aleatoric 和 epistemic 不确定性，它们表示数据生成过程中的内在随机性和数据不充分导致的多样性。他们通过信息论方式分析这些不确定性，假设模型是正确的并将模型参数看作隐藏变量。然而，现有的信息论分析不能解释uncertainty 中的一个广泛信奉的性质，即测试数据与训练数据之间的敏感性。这意味着当测试数据与训练数据相似时， epistemic 不确定性应该减少。在这项工作中，我们通过我们的新的分解方法来研究这种敏感性。我们的分析成功地定义了这种敏感性使用信息论量表示。此外，我们将 Bayesian meta-learning 的现有分析扩展到新的任务，并首次研究这些任务之间的新的敏感性。

DiAMoNDBack: Diffusion-denoising Autoregressive Model for Non-Deterministic Backmapping of Cα Protein Traces

paper_url: http://arxiv.org/abs/2307.12451
repo_url: https://github.com/ferg-lab/diamondback
paper_authors: Michael S. Jones, Kirill Shmilovich, Andrew L. Ferguson
for: 这个论文的目的是提出一种基于杂化的分子模型，以便在长时间步骤上模拟蛋白质的均质化和折叠过程。
methods: 这个论文使用了一种名为Diffusion-denoising Autoregressive Model for Non-Deterministic Backmapping（DiAMoNDBack）的杂化模型，用于从粗粒度模型中恢复到原子级模型。这种模型基于一种杂化扩散过程，通过 conditioned on the Cα trace和当地的蛋白质结构来生成原子级模型。
results: 这个论文的实验结果表明，DiAMoNDBack模型可以在蛋白质结构模拟中实现高水平的重建性，包括正确的键形成、避免侧链冲突以及配置态的多样性。此外，模型还可以在不同的蛋白质结构和 simulate 中进行传输性。

Abstract
Coarse-grained molecular models of proteins permit access to length and time scales unattainable by all-atom models and the simulation of processes that occur on long-time scales such as aggregation and folding. The reduced resolution realizes computational accelerations but an atomistic representation can be vital for a complete understanding of mechanistic details. Backmapping is the process of restoring all-atom resolution to coarse-grained molecular models. In this work, we report DiAMoNDBack (Diffusion-denoising Autoregressive Model for Non-Deterministic Backmapping) as an autoregressive denoising diffusion probability model to restore all-atom details to coarse-grained protein representations retaining only C{\alpha} coordinates. The autoregressive generation process proceeds from the protein N-terminus to C-terminus in a residue-by-residue fashion conditioned on the C{\alpha} trace and previously backmapped backbone and side chain atoms within the local neighborhood. The local and autoregressive nature of our model makes it transferable between proteins. The stochastic nature of the denoising diffusion process means that the model generates a realistic ensemble of backbone and side chain all-atom configurations consistent with the coarse-grained C{\alpha} trace. We train DiAMoNDBack over 65k+ structures from Protein Data Bank (PDB) and validate it in applications to a hold-out PDB test set, intrinsically-disordered protein structures from the Protein Ensemble Database (PED), molecular dynamics simulations of fast-folding mini-proteins from DE Shaw Research, and coarse-grained simulation data. We achieve state-of-the-art reconstruction performance in terms of correct bond formation, avoidance of side chain clashes, and diversity of the generated side chain configurational states. We make DiAMoNDBack model publicly available as a free and open source Python package.

摘要
习微观模型可以访问到长度和时间尺度，不可能由所有原子模型 achieve。它允许模拟在长时间尺度上发生的过程，如聚集和折叠。减少分辨率实现计算加速，但原子尺度的表示是完整理解机制的必要条件。回映是将高级别的分辨率复制到低级别的分辨率模型中的过程。在这种工作中，我们报道了Diffusion-denoising Autoregressive Model for Non-Deterministic Backmapping（Diffusion-denoising Autoregressive Model for Non-Deterministic Backmapping，简称DiAMoNDBack）作为一种推荐模型，用于在保留Cα坐标的情况下，将高级别的分辨率详细信息还原到低级别的分辨率模型中。这种推荐过程从蛋白质的N端开始，以每个残基为单位，通过Cα轨迹和以前已经回映的背bone和副链原子来驱动。本地和自适应的特点使得模型可以转移到不同的蛋白质上。由于杂化的推荐过程，模型可以生成一个真实的蛋白质详细配置，包括背bone和副链原子的全原子配置，并且与高级别的分辨率详细信息保持一致。我们在65000多个PDB结构数据集上训练了DiAMoNDBack模型，并在一个PDB测试集上验证了它。我们还在Protein Ensemble Database（PED）中的自发布蛋白质结构数据集、DE Shaw Research的分子动力学 simulations和减少分辨率 simulation数据上应用了这种模型。我们达到了当前最佳的重建性表现，包括正确的键形成、避免副链冲突和副链配置状态的多样性。我们将DiAMoNDBack模型作为一种免费和开源的Python包公开发布。

ProtoFL: Unsupervised Federated Learning via Prototypical Distillation

paper_url: http://arxiv.org/abs/2307.12450
repo_url: None
paper_authors: Hansol Kim, Youngjun Kwak, Minyoung Jung, Jinho Shin, Youngsung Kim, Changick Kim
for: 提高数据隐私保护和验证系统性能
methods: 提出了基于原型表示缩短的 federated learning（ProtoFL）和基于正则化流的本地一类分类器
results: 在五个广泛使用的标准评估 dataset 上，证明了我们提出的框架在先前Literature中的表现优于其他方法

Abstract
Federated learning (FL) is a promising approach for enhancing data privacy preservation, particularly for authentication systems. However, limited round communications, scarce representation, and scalability pose significant challenges to its deployment, hindering its full potential. In this paper, we propose 'ProtoFL', Prototypical Representation Distillation based unsupervised Federated Learning to enhance the representation power of a global model and reduce round communication costs. Additionally, we introduce a local one-class classifier based on normalizing flows to improve performance with limited data. Our study represents the first investigation of using FL to improve one-class classification performance. We conduct extensive experiments on five widely used benchmarks, namely MNIST, CIFAR-10, CIFAR-100, ImageNet-30, and Keystroke-Dynamics, to demonstrate the superior performance of our proposed framework over previous methods in the literature.

摘要
federated learning (FL) 是一种有前途的方法，可以增强数据隐私保护，特别是 для 身份验证系统。然而，有限的回合通信，珍贵的表示和扩展性带来了大量的挑战，这些挑战妨碍了其全面发挥。在这篇论文中，我们提议了“ProtoFL”，基于原型表示抽象的 Federated Learning，以提高全球模型的表示力并降低回合通信成本。此外，我们还引入了一种基于正规流的本地一类分类器，以提高有限数据下的性能。这是文献中第一篇使用 Federated Learning 提高一类分类性能的研究。我们在五个广泛使用的 benchmark 上进行了广泛的实验，包括 MNIST、CIFAR-10、CIFAR-100、ImageNet-30 和 Keystroke-Dynamics，以示出我们提posed框架的超过先前方法的优秀性。

WEPRO: Weight Prediction for Efficient Optimization of Hybrid Quantum-Classical Algorithms

paper_url: http://arxiv.org/abs/2307.12449
repo_url: None
paper_authors: Satwik Kundu, Debarshi Kundu, Swaroop Ghosh
for: 加速量子 neural network 和量子矩阵问题的训练，提高量子矩阵算法的精度和效率。
methods: 提出了一种新的方法 called WEPRO，利用量子矩阵参数 weights 的常见趋势来加速量子矩阵的训练。并提出了两种优化技术 Naive Prediction 和 Adaptive Prediction。
results: 通过对多个量子 neural network 模型的训练和测试，显示了 WEPRO 可以提高训练速度约 2.25 倍，同时提高精度和预测性能，且具有低存储和计算开销。在量子矩阵问题中，WEPRO 也能够提高训练速度和精度。

Abstract
The exponential run time of quantum simulators on classical machines and long queue depths and high costs of real quantum devices present significant challenges in the effective training of Variational Quantum Algorithms (VQAs) like Quantum Neural Networks (QNNs), Variational Quantum Eigensolver (VQE) and Quantum Approximate Optimization Algorithm (QAOA). To address these limitations, we propose a new approach, WEPRO (Weight Prediction), which accelerates the convergence of VQAs by exploiting regular trends in the parameter weights. We introduce two techniques for optimal prediction performance namely, Naive Prediction (NaP) and Adaptive Prediction (AdaP). Through extensive experimentation and training of multiple QNN models on various datasets, we demonstrate that WEPRO offers a speedup of approximately $2.25\times$ compared to standard training methods, while also providing improved accuracy (up to $2.3\%$ higher) and loss (up to $6.1\%$ lower) with low storage and computational overheads. We also evaluate WEPRO's effectiveness in VQE for molecular ground-state energy estimation and in QAOA for graph MaxCut. Our results show that WEPRO leads to speed improvements of up to $3.1\times$ for VQE and $2.91\times$ for QAOA, compared to traditional optimization techniques, while using up to $3.3\times$ less number of shots (i.e., repeated circuit executions) per training iteration.

摘要
traditional training methods for Variational Quantum Algorithms (VQAs) like Quantum Neural Networks (QNNs), Variational Quantum Eigensolver (VQE), and Quantum Approximate Optimization Algorithm (QAOA) face significant challenges due to the exponential run time on classical machines and the long queue depths and high costs of real quantum devices. To address these limitations, we propose a new approach called WEPRO (Weight Prediction), which accelerates the convergence of VQAs by exploiting regular trends in the parameter weights. We introduce two techniques for optimal prediction performance, namely Naive Prediction (NaP) and Adaptive Prediction (AdaP). Through extensive experimentation and training of multiple QNN models on various datasets, we demonstrate that WEPRO offers a speedup of approximately 2.25 times compared to standard training methods, while also providing improved accuracy (up to 2.3% higher) and loss (up to 6.1% lower) with low storage and computational overheads. We also evaluate WEPRO's effectiveness in VQE for molecular ground-state energy estimation and in QAOA for graph MaxCut. Our results show that WEPRO leads to speed improvements of up to 3.1 times for VQE and 2.91 times for QAOA, compared to traditional optimization techniques, while using up to 3.3 times less number of shots (i.e., repeated circuit executions) per training iteration.

Multifidelity Covariance Estimation via Regression on the Manifold of Symmetric Positive Definite Matrices

paper_url: http://arxiv.org/abs/2307.12438
repo_url: None
paper_authors: Aimee Maurais, Terrence Alsup, Benjamin Peherstorfer, Youssef Marzouk
for: 这篇论文是为了提出一种多信度估计器，用于估计协方差矩阵。
methods: 这篇论文使用了拟合问题的方法，在协方差矩阵的拟合空间上进行估计。
results: 论文的实验结果表明，使用这种多信度估计器可以大幅降低估计误差，相比单信度和其他多信度估计器。此外，这种估计器还保持了正定定义性，使其适用于后续任务，如数据吸收和 метри学学习。

Abstract
We introduce a multifidelity estimator of covariance matrices formulated as the solution to a regression problem on the manifold of symmetric positive definite matrices. The estimator is positive definite by construction, and the Mahalanobis distance minimized to obtain it possesses properties which enable practical computation. We show that our manifold regression multifidelity (MRMF) covariance estimator is a maximum likelihood estimator under a certain error model on manifold tangent space. More broadly, we show that our Riemannian regression framework encompasses existing multifidelity covariance estimators constructed from control variates. We demonstrate via numerical examples that our estimator can provide significant decreases, up to one order of magnitude, in squared estimation error relative to both single-fidelity and other multifidelity covariance estimators. Furthermore, preservation of positive definiteness ensures that our estimator is compatible with downstream tasks, such as data assimilation and metric learning, in which this property is essential.

摘要
我们介绍一个多域确度估计器，它是解决在对称正定矩阵构造的应变问题中的解。这个估计器由建构而成，并且在 Mahalanobis 距离下实现了实用的计算。我们显示了我们的数据融合多域确度估计器（MRMF）是一个最大 LIKELIHOOD 估计器，在某些错误模型上的拓扑 tangent space 上。更一般地说，我们的里敦热投影框架包含了现有的多域确度估计器，它们是由控制值构成的。我们通过数据示例显示了我们的估计器可以对单域和其他多域确度估计器的平方误差做出很大减少，达到一个次的减少。此外，保持正定性的保证，使得我们的估计器可以与下游任务，如数据融合和度量学习，进行Compatible。

A Generalized Schwarz-type Non-overlapping Domain Decomposition Method using Physics-constrained Neural Networks

paper_url: http://arxiv.org/abs/2307.12435
repo_url: https://github.com/hipersimlab/pecann
paper_authors: Shamsulhaq Basir, Inanc Senocak
For: The paper is written for solving forward and inverse problems involving partial differential equations (PDEs) using a meshless Schwarz-type non-overlapping domain decomposition method based on artificial neural networks.* Methods: The paper uses a generalized Robin-type interface condition, where unique Robin parameters are assigned to each subdomain and learned to minimize the mismatch on the Robin interface condition. The method uses an independent neural network model trained to minimize the loss on the governing PDE while strictly enforcing boundary and interface conditions through an augmented Lagrangian formalism.* Results: The paper demonstrates the versatility and performance of the proposed approach through extensive experiments on forward and inverse problems, including one-way and two-way decompositions with crosspoints. The learned Robin parameters adapt to the local behavior of the solution, domain partitioning, and subdomain location relative to the overall domain.

Abstract
We present a meshless Schwarz-type non-overlapping domain decomposition method based on artificial neural networks for solving forward and inverse problems involving partial differential equations (PDEs). To ensure the consistency of solutions across neighboring subdomains, we adopt a generalized Robin-type interface condition, assigning unique Robin parameters to each subdomain. These subdomain-specific Robin parameters are learned to minimize the mismatch on the Robin interface condition, facilitating efficient information exchange during training. Our method is applicable to both the Laplace's and Helmholtz equations. It represents local solutions by an independent neural network model which is trained to minimize the loss on the governing PDE while strictly enforcing boundary and interface conditions through an augmented Lagrangian formalism. A key strength of our method lies in its ability to learn a Robin parameter for each subdomain, thereby enhancing information exchange with its neighboring subdomains. We observe that the learned Robin parameters adapt to the local behavior of the solution, domain partitioning and subdomain location relative to the overall domain. Extensive experiments on forward and inverse problems, including one-way and two-way decompositions with crosspoints, demonstrate the versatility and performance of our proposed approach.

摘要
我们提出了一种无缝Schwarz类非重叠域分解方法，基于人工神经网络来解决部分� differential 方程（PDEs）中的前向和反向问题。为确保邻居子域解的一致性，我们采用一种通用的Robin类型界面条件，将每个子域分配特定的Robin参数。这些子域特定的Robin参数通过在训练中最小化Robin界面条件的差异，以便有效地交换信息。我们的方法适用于拉普拉斯方程和哈尔曼方程。它使用独立的神经网络模型来表示本地解，并通过一种扩展的拉格朗日 formalism来严格执行边界和界面条件。我们发现，我们的方法可以学习每个子域的Robin参数，从而提高邻居子域之间信息交换的能力。我们在多个实验中证明了我们的提出的方法的多样性和性能。这些实验包括一个方向和二个方向的分解，以及跨点的分解。

Augmented Box Replay: Overcoming Foreground Shift for Incremental Object Detection

paper_url: http://arxiv.org/abs/2307.12427
repo_url: https://github.com/YuyangSunshine/ABR_IOD
paper_authors: Liu Yuyang, Cong Yang, Goswami Dipam, Liu Xialei, Joost van de Weijer
for: 这篇论文的目的是解决在增量学习中的忘记问题，特别是在增量物体检测（IOD）领域中，通过重新播放之前任务的图像和当前任务的图像。
methods: 这篇论文提出了一种新的和高效的增量Box Replay（ABR）方法，该方法仅将前一任务的背景图像中的前景物体存储并重新播放，从而解决了背景shift问题。此外，该论文还提出了一种新的注意力捕捉的RoI填充损失，该损失使当前模型在旧模型中捕捉到重要信息。
results: 实验结果表明，ABR方法可以有效地避免忘记前一任务的类别，同时保持当前任务的柔软性。此外，ABR方法还可以减少存储需求，并在 Pascal-VOC 和 COCO 数据集上达到了顶尖性能。

Abstract
In incremental learning, replaying stored samples from previous tasks together with current task samples is one of the most efficient approaches to address catastrophic forgetting. However, unlike incremental classification, image replay has not been successfully applied to incremental object detection (IOD). In this paper, we identify the overlooked problem of foreground shift as the main reason for this. Foreground shift only occurs when replaying images of previous tasks and refers to the fact that their background might contain foreground objects of the current task. To overcome this problem, a novel and efficient Augmented Box Replay (ABR) method is developed that only stores and replays foreground objects and thereby circumvents the foreground shift problem. In addition, we propose an innovative Attentive RoI Distillation loss that uses spatial attention from region-of-interest (RoI) features to constrain current model to focus on the most important information from old model. ABR significantly reduces forgetting of previous classes while maintaining high plasticity in current classes. Moreover, it considerably reduces the storage requirements when compared to standard image replay. Comprehensive experiments on Pascal-VOC and COCO datasets support the state-of-the-art performance of our model.

摘要
增量学习中，重新播放之前任务中的样本和当前任务中的样本是解决忘却折架的最有效方法之一。然而，与增量分类不同，图像重新播放在增量物体检测（IOD）中尚未得到成功。在这篇论文中，我们认为过looked problem of foreground shift是主要的原因。foreground shift只发生在重新播放之前任务的图像时，并且指的是这些背景可能包含当前任务中的前景对象。为解决这个问题，我们开发了一种新的和高效的增强盒子重新播放（ABR）方法，只将前景对象存储和重新播放，因此绕过了前景shift问题。此外，我们提出了一种创新的关注点 RoI 特征整合损失，使当前模型从老模型中提取最重要的信息，并将其用于现有模型的约束。ABR 能够减少之前类型的忘却，同时保持当前类型的高柔性。此外，它也可以significantly reduce the storage requirements when compared to standard image replay。我们在 Pascal-VOC 和 COCO 数据集上进行了全面的实验，并证明了我们的模型的状态-of-the-art表现。

Practical Commercial 5G Standalone (SA) Uplink Throughput Prediction

paper_url: http://arxiv.org/abs/2307.12417
repo_url: None
paper_authors: Kasidis Arunruangsirilert, Jiro Katto
for: 这个论文的目的是预测5G NR网络中用户设备（UE）的未来上行吞吐量，以优化用户体验（QoE）。
methods: 这个论文使用ConvLSTM神经网络预测未来上行吞吐量，基于过去的上行吞吐量和RF参数。网络通过实际的5G SA网络驱动测试数据进行训练，并限制了模型只使用Android API中提供的信息。
results: 这个论文的结果表明，使用ConvLSTM神经网络预测未来上行吞吐量的方法可以达到98.9%的准确率， average RMSE为1.80 Mbps。

Abstract
While the 5G New Radio (NR) network promises a huge uplift of the uplink throughput, the improvement can only be seen when the User Equipment (UE) is connected to the high-frequency millimeter wave (mmWave) band. With the rise of uplink-intensive smartphone applications such as the real-time transmission of UHD 4K/8K videos, and Virtual Reality (VR)/Augmented Reality (AR) contents, uplink throughput prediction plays a huge role in maximizing the users' quality of experience (QoE). In this paper, we propose using a ConvLSTM-based neural network to predict the future uplink throughput based on past uplink throughput and RF parameters. The network is trained using the data from real-world drive tests on commercial 5G SA networks while riding commuter trains, which accounted for various frequency bands, handover, and blind spots. To make sure our model can be practically implemented, we then limited our model to only use the information available via Android API, then evaluate our model using the data from both commuter trains and other methods of transportation. The results show that our model reaches an average prediction accuracy of 98.9\% with an average RMSE of 1.80 Mbps across all unseen evaluation scenarios.

摘要
5G新Radio（NR）网络承诺会带来巨大的上行吞吐量提高，但是这种提高只能在用户设备（UE）与高频毫米波（mmWave）频率带连接时得到。随着上行吞吐量占用应用程序如实时传输UHD 4K/8K视频和虚拟现实（VR）/增强现实（AR）内容的普及，上行吞吐量预测在maximizing用户体验质量（QoE）中扮演着关键的角色。在这篇论文中，我们提议使用ConvLSTM神经网络预测未来上行吞吐量，基于过去上行吞吐量和RF参数。网络通过实际驱动测试数据 collected from commercial 5G SA 网络而验证，该数据包括不同频率带、过渡和隐私。为确保我们的模型能够实际应用，我们然后限制了我们的模型仅使用可以通过 Android API 获得的信息。我们使用了不同交通工具进行评估，并发现我们的模型在所有未seen评估场景中达到了98.9%的预测精度，平均Relative Mean Squared Error（RMSE）为1.80 Mbps。

A Machine Learning Approach to Two-Stage Adaptive Robust Optimization

paper_url: http://arxiv.org/abs/2307.12409
repo_url: https://github.com/molyswu/hand_detection
paper_authors: Dimitris Bertsimas, Cheol Woo Kim
for: 解决两阶段线性适应Robust优化问题（ARO）中的 binary 现在变量和多面uncertainty sets问题。
methods: 使用机器学习方法，编码优化的现在决策、最差情况相关的优化决策和等待决策为策略。使用列和约束生成算法提取优化策略，并使用机器学习模型预测高质量策略。
results: 应用方法到facility location、multi-item 存储控制和单位启动问题，可以快速解决ARO问题，高精度。

Abstract
We propose an approach based on machine learning to solve two-stage linear adaptive robust optimization (ARO) problems with binary here-and-now variables and polyhedral uncertainty sets. We encode the optimal here-and-now decisions, the worst-case scenarios associated with the optimal here-and-now decisions, and the optimal wait-and-see decisions into what we denote as the strategy. We solve multiple similar ARO instances in advance using the column and constraint generation algorithm and extract the optimal strategies to generate a training set. We train a machine learning model that predicts high-quality strategies for the here-and-now decisions, the worst-case scenarios associated with the optimal here-and-now decisions, and the wait-and-see decisions. We also introduce an algorithm to reduce the number of different target classes the machine learning algorithm needs to be trained on. We apply the proposed approach to the facility location, the multi-item inventory control and the unit commitment problems. Our approach solves ARO problems drastically faster than the state-of-the-art algorithms with high accuracy.

摘要

Optimal Control of Multiclass Fluid Queueing Networks: A Machine Learning Approach

paper_url: http://arxiv.org/abs/2307.12405
repo_url: None
paper_authors: Dimitris Bertsimas, Cheol Woo Kim
for: 这篇论文是为了提出一种机器学习方法来控制多类流体队列网络（MFQNET），并提供了明确和深入的控制策略。
methods: 这篇论文使用了优化类型的机器学习方法，即Optimal Classification Trees with hyperplane splits（OCT-H）来学习MFQNET的控制策略。
results: 实验结果表明，使用OCT-H学习的控制策略可以在大规模网络中实现100%的准确率，而在线应用只需几毫秒。

Abstract
We propose a machine learning approach to the optimal control of multiclass fluid queueing networks (MFQNETs) that provides explicit and insightful control policies. We prove that a threshold type optimal policy exists for MFQNET control problems, where the threshold curves are hyperplanes passing through the origin. We use Optimal Classification Trees with hyperplane splits (OCT-H) to learn an optimal control policy for MFQNETs. We use numerical solutions of MFQNET control problems as a training set and apply OCT-H to learn explicit control policies. We report experimental results with up to 33 servers and 99 classes that demonstrate that the learned policies achieve 100\% accuracy on the test set. While the offline training of OCT-H can take days in large networks, the online application takes milliseconds.

摘要
我们提出了一种机器学习方法来优化多类流体队列网络（MFQNET）的控制问题，该方法提供了明确和深入的控制策略。我们证明了多类流体队列网络控制问题中存在一种阈值类型的优化策略，其阈值曲线都是通过起点的 hyperplanes。我们使用Optimal Classification Trees with hyperplane splits（OCT-H）来学习MFQNET的控制策略。我们使用 numerically solved MFQNET control problems作为训练集，并通过OCT-H来学习明确的控制策略。我们发现在33个服务器和99个类型的 эксперименталь结果中，学习的策略可以达到100%的准确率。虽然在大型网络中的离线训练可能需要几天的时间，但在线应用只需毫秒钟。

Uncertainty-aware Grounded Action Transformation towards Sim-to-Real Transfer for Traffic Signal Control

paper_url: http://arxiv.org/abs/2307.12388
repo_url: None
paper_authors: Longchao Da, Hao Mei, Romir Sharma, Hua Wei
for: 提高RL在实际道路上的应用性能
methods: 使用 simulations-to-real-world（sim-to-real）转移方法，动态将模拟环境中学习的策略转移到实际环境中，以抑制域的差异
results: 在模拟交通环境中评估了UGAT方法，并显示其在实际环境中显著提高RL策略的性能

Abstract
Traffic signal control (TSC) is a complex and important task that affects the daily lives of millions of people. Reinforcement Learning (RL) has shown promising results in optimizing traffic signal control, but current RL-based TSC methods are mainly trained in simulation and suffer from the performance gap between simulation and the real world. In this paper, we propose a simulation-to-real-world (sim-to-real) transfer approach called UGAT, which transfers a learned policy trained from a simulated environment to a real-world environment by dynamically transforming actions in the simulation with uncertainty to mitigate the domain gap of transition dynamics. We evaluate our method on a simulated traffic environment and show that it significantly improves the performance of the transferred RL policy in the real world.

摘要
交通信号控制（TSC）是一项复杂重要的任务，影响了数百万人的日常生活。强化学习（RL）已经在优化交通信号控制方面显示了扎实的成果，但现有RL基于TSC方法主要在模拟环境中训练，它们在真实世界中的性能差距很大。在这篇论文中，我们提出了一种从模拟环境到真实世界（sim-to-real）传输方法，称为UGAT，它可以在模拟环境中学习的策略在真实世界中被转移并且在不同的环境中保持良好的性能。我们对一个模拟交通环境进行了评估，并证明了UGAT方法在真实世界中可以大幅提高RL策略的性能。

In-Context Learning in Large Language Models Learns Label Relationships but Is Not Conventional Learning

paper_url: http://arxiv.org/abs/2307.12375
repo_url: None
paper_authors: Jannik Kossen, Tom Rainforth, Yarin Gal
for: 本研究旨在 investigating large language models (LLMs) 在下游任务中的启发式学习能力，特别是如何在 Context 中提供的示例对 Label 之间的关系对 LLMs 的预测造成影响。
methods: 本研究使用了一种 combine 的方法，包括分析 LLMs 在预训练和 Context 中的行为，以及如何将 Context 中的示例和 Label 相互关联。
results: 研究发现，LLMs 通常会在 Context 中使用示例 Label 的信息，但是预训练和 Context 中的 Label 关系是不同的，并且模型不会对所有 Context 中的信息进行平等考虑。这些结论可以帮助我们更好地理解和调节 LLM 的行为。

Abstract
The performance of Large Language Models (LLMs) on downstream tasks often improves significantly when including examples of the input-label relationship in the context. However, there is currently no consensus about how this in-context learning (ICL) ability of LLMs works: for example, while Xie et al. (2021) liken ICL to a general-purpose learning algorithm, Min et al. (2022b) argue ICL does not even learn label relationships from in-context examples. In this paper, we study (1) how labels of in-context examples affect predictions, (2) how label relationships learned during pre-training interact with input-label examples provided in-context, and (3) how ICL aggregates label information across in-context examples. Our findings suggests LLMs usually incorporate information from in-context labels, but that pre-training and in-context label relationships are treated differently, and that the model does not consider all in-context information equally. Our results give insights into understanding and aligning LLM behavior.

摘要
大型自然语言模型（LLM）在下游任务中的表现经常会有显著改善，当 inclusion 输入-标签关系的例子在上下文中时。然而，目前没有一致的共识，如何解释 LLM 的上下文学习（ICL）能力。例如，希xsd et al.（2021）认为 ICL 类似于一种通用学习算法，而 Min et al.（2022b）则认为 ICL 不会从上下文中学习标签关系。在这篇文章中，我们研究了以下几点：1. 输入-标签例子中的标签如何影响预测结果。2. 在预训练中学习的标签关系如何与输入-标签例子在上下文中相互作用。3. ICL 如何平均处理上下文中的标签信息。我们发现 LLM 通常会在上下文中利用标签信息，但是预训练和上下文中的标签关系被处理不同，而且模型不会对所有上下文中的信息进行平均处理。我们的发现可以帮助理解和调整 LLM 的行为。

Assessing Intra-class Diversity and Quality of Synthetically Generated Images in a Biomedical and Non-biomedical Setting

paper_url: http://arxiv.org/abs/2308.02505
repo_url: None
paper_authors: Muhammad Muneeb Saad, Mubashir Husain Rehmani, Ruairi O’Reilly
for: 这种研究是为了评估生成 adversarial Networks (GANs) 在生成难以 obtain 的医学影像数据 augmentation 任务中的效果。
methods: 这种研究使用了多Scale Structural Similarity Index Measure 和 Cosine Distance 评估生成图像的内部多样性，以及 Frechet Inception Distance 评估生成图像的质量。
results: 研究发现，在不同的医学影像模式下，生成图像的多样性和质量得分异常大。不同的采样大小也会影响生成图像的质量和多样性。这种研究旨在探讨生成图像的多样性和质量在医学和非医学影像模式之间的差异。

Abstract
In biomedical image analysis, data imbalance is common across several imaging modalities. Data augmentation is one of the key solutions in addressing this limitation. Generative Adversarial Networks (GANs) are increasingly being relied upon for data augmentation tasks. Biomedical image features are sensitive to evaluating the efficacy of synthetic images. These features can have a significant impact on metric scores when evaluating synthetic images across different biomedical imaging modalities. Synthetically generated images can be evaluated by comparing the diversity and quality of real images. Multi-scale Structural Similarity Index Measure and Cosine Distance are used to evaluate intra-class diversity, while Frechet Inception Distance is used to evaluate the quality of synthetic images. Assessing these metrics for biomedical and non-biomedical imaging is important to investigate an informed strategy in evaluating the diversity and quality of synthetic images. In this work, an empirical assessment of these metrics is conducted for the Deep Convolutional GAN in a biomedical and non-biomedical setting. The diversity and quality of synthetic images are evaluated using different sample sizes. This research intends to investigate the variance in diversity and quality across biomedical and non-biomedical imaging modalities. Results demonstrate that the metrics scores for diversity and quality vary significantly across biomedical-to-biomedical and biomedical-to-non-biomedical imaging modalities.

摘要
在生物医学影像分析中，数据偏好是广泛存在的问题，而生成对抗网络（GANs）在解决这一问题上逐渐被广泛应用。生物医学影像特征对评估合成图像的效果非常敏感，这些特征可以带来评估合成图像的纪录分数变化。合成图像可以通过与真实图像进行比较来评估其多样性和质量。在不同的生物医学成像modalities中，使用多尺度结构相似性指标和夹角距离来评估内类多样性，而使用彩色征波距离来评估合成图像的质量。为了调查这些指标在生物医学和非生物医学成像modalities中的效果，这里进行了一项实验性的评估。研究表明，在不同的生物医学和非生物医学成像modalities中，多样性和质量指标的分数差异很大。

Early Prediction of Alzheimers Disease Leveraging Symptom Occurrences from Longitudinal Electronic Health Records of US Military Veterans

paper_url: http://arxiv.org/abs/2307.12369
repo_url: None
paper_authors: Rumeng Li, Xun Wang, Dan Berlowitz, Brian Silver, Wen Hu, Heather Keating, Raelene Goodwin, Weisong Liu, Honghuang Lin, Hong Yu
for: 预测阿尔茨海默病（AD）的早期诊断非常重要，以便于时间性的 intervención和治疗。这项研究使用机器学习方法分析患者的长期电子医疗纪录（EHR），以找出可以预测AD诊断的标志和症状。
methods: 这项研究使用了一种 случа控制设计，使用从2004年到2021年的美国卫生部卫生管理局（VHA）的长期EHR数据进行分析。实验中的患者是由2016年以后根据ICD-10-CM代码诊断出的AD患者，与年龄、性别和临床使用相同的9名控制人进行匹配。研究使用了AD相关关键词的时间序列分析，以预测AD诊断。
results: 研究发现，患者的AD相关关键词的时间序列分析可以预测AD诊断，特别是在诊断附近的时间段。模型的拟合度（ROCAUC）为0.997，准确性很高。研究还发现，年龄、性别和种族/民族子组的预测结果几乎一致，只有年龄 younger than 65的 subgroup的预测结果不太准确（ROCAUC 0.746）。这种机器学习模型可以使用EHR数据预测AD诊断，提供一种可靠的、便宜的方式，用于早期诊断大量人口。

Abstract
Early prediction of Alzheimer's disease (AD) is crucial for timely intervention and treatment. This study aims to use machine learning approaches to analyze longitudinal electronic health records (EHRs) of patients with AD and identify signs and symptoms that can predict AD onset earlier. We used a case-control design with longitudinal EHRs from the U.S. Department of Veterans Affairs Veterans Health Administration (VHA) from 2004 to 2021. Cases were VHA patients with AD diagnosed after 1/1/2016 based on ICD-10-CM codes, matched 1:9 with controls by age, sex and clinical utilization with replacement. We used a panel of AD-related keywords and their occurrences over time in a patient's longitudinal EHRs as predictors for AD prediction with four machine learning models. We performed subgroup analyses by age, sex, and race/ethnicity, and validated the model in a hold-out and "unseen" VHA stations group. Model discrimination, calibration, and other relevant metrics were reported for predictions up to ten years before ICD-based diagnosis. The study population included 16,701 cases and 39,097 matched controls. The average number of AD-related keywords (e.g., "concentration", "speaking") per year increased rapidly for cases as diagnosis approached, from around 10 to over 40, while remaining flat at 10 for controls. The best model achieved high discriminative accuracy (ROCAUC 0.997) for predictions using data from at least ten years before ICD-based diagnoses. The model was well-calibrated (Hosmer-Lemeshow goodness-of-fit p-value = 0.99) and consistent across subgroups of age, sex and race/ethnicity, except for patients younger than 65 (ROCAUC 0.746). Machine learning models using AD-related keywords identified from EHR notes can predict future AD diagnoses, suggesting its potential use for identifying AD risk using EHR notes, offering an affordable way for early screening on large population.

摘要
早期预测阿尔ツ海默病（AD）是非常重要，以便在时间上采取措施和治疗。这项研究目的是使用机器学习方法分析患者的长期电子医疗记录（EHR），以确定患者患阿尔ツ海默病的预测指标。我们采用了一种case-control设计，使用2004年至2021年美国卫生部老年军人医疗管理局（VHA）的长期EHR数据，确定患者是在2016年1月1日后被诊断为阿尔ツ海默病（根据ICD-10-CM代码），并与年龄、性别和临床使用相同的9名控制人群进行匹配。我们使用了一组阿尔ツ海默病相关关键词，并跟踪这些词语在患者的长期EHR记录中的出现情况，采用4种机器学习模型进行预测。我们进行了年龄、性别和种族/民族 subgroup分析，并在一个“隐藏”的VHA站点上验证模型。模型的准确率、均衡和其他相关指标都被报告，用于预测距离ICD-基本诊断的诊断。研究人口包括16,701例患者和39,097名匹配的控制人群。患者的AD相关关键词每年增加速度很快，从约10个增加到超过40个，而控制人群保持在10个，而且在诊断 approached时，AD相关关键词的增加速度加剧。最佳模型在使用至少10年之前的ICD-基本诊断数据时，达到了0.997的报告准确率。模型均衡好（Hosmer-Lemeshow准确度测试值为0.99），并在年龄、性别和种族/民族 subgroup中保持一致，除了年龄小于65岁的患者（ROCAUC为0.746）。机器学习模型使用从EHR记录中提取的AD相关关键词可以预测未来的AD诊断，表明其可能用于通过EHR记录来预测AD风险，提供一种可靠且有效的大规模屏检方式。

2023-07-24

eess.IV

eess.IV - 2023-07-24

Conditional Residual Coding: A Remedy for Bottleneck Problems in Conditional Inter Frame Coding

paper_url: http://arxiv.org/abs/2307.12864
repo_url: None
paper_authors: Fabian Brand, Jürgen Seiler, André Kaup
for: 这个论文是为了提出一种新的视频编码方法，即基于神经网络的 conditional coding，以提高视频编码的效率。
methods: 这个论文使用了 conditional coding 和 residual coding 两种编码方法进行比较，并提出了一种新的 conditional residual coding 方法，以解决 conditional coding 中的信息瓶颈问题。
results: 论文通过 theoretically 和实际例子的分析，证明 conditional residual coding 可以减少信息瓶颈的影响，同时保持 conditional coding 的理论性能。这种编码方法可以 viewed as “the best from both worlds” 在 residual 和 conditional coding 之间。

Abstract
Conditional coding is a new video coding paradigm enabled by neural-network-based compression. It can be shown that conditional coding is in theory better than the traditional residual coding, which is widely used in video compression standards like HEVC or VVC. However, on closer inspection, it becomes clear that conditional coders can suffer from information bottlenecks in the prediction path, i.e., that due to the data processing inequality not all information from the prediction signal can be passed to the reconstructed signal, thereby impairing the coder performance. In this paper we propose the conditional residual coding concept, which we derive from information theoretical properties of the conditional coder. This coder significantly reduces the influence of bottlenecks, while maintaining the theoretical performance of the conditional coder. We provide a theoretical analysis of the coding paradigm and demonstrate the performance of the conditional residual coder in a practical example. We show that conditional residual coders alleviate the disadvantages of conditional coders while being able to maintain their advantages over residual coders. In the spectrum of residual and conditional coding, we can therefore consider them as ``the best from both worlds''.

摘要
新的条件编码方式是基于神经网络的压缩，可以证明这种条件编码在理论上比传统的差异编码（如HEVC或VVC中的差异编码）更好。然而，在更加仔细的分析下，可以发现条件编码器可能会在预测路径中遇到信息瓶颈，即由数据处理不对称性导致的信息无法传递到重建信号中，从而影响编码器性能。在这篇论文中，我们提出了条件差异编码概念，该概念基于条件编码器的信息学性质。这种编码器可以减少预测路径中的瓶颈影响，同时保持条件编码器的理论性能。我们对这种编码器进行了理论分析，并在实践中示出了其性能。我们发现，条件差异编码器可以消除条件编码器的缺点，同时保持条件编码器比差异编码器更好的优势。因此，在差异和条件编码之间的谱spectrum中，我们可以视之为“最佳的两个世界”。

Spatiotemporal Modeling Encounters 3D Medical Image Analysis: Slice-Shift UNet with Multi-View Fusion

paper_url: http://arxiv.org/abs/2307.12853
repo_url: None
paper_authors: C. I. Ugwu, S. Casarin, O. Lanz
for: 这paper的目的是提出一种基于2D Convolutional Neural Networks的多模态脐椎像分割模型，以提高计算医学中的图像分析效能。
methods: 这paper使用了一种名为Slice SHift UNet（SSH-UNet）的新模型，它通过在多个视角上进行2D卷积，共同学习多个视角的特征，并通过在层次轴上偏移特征图来重新包含第三维度信息。
results: 该paper在Multi-Modality Abdominal Multi-Organ Segmentation（AMOS）和Multi-Atlas Labeling Beyond the Cranial Vault（BTCV） datasets上进行了实验，并证明了SSH-UNet的效果与现有的模型相当，而且更高效。

Abstract
As a fundamental part of computational healthcare, Computer Tomography (CT) and Magnetic Resonance Imaging (MRI) provide volumetric data, making the development of algorithms for 3D image analysis a necessity. Despite being computationally cheap, 2D Convolutional Neural Networks can only extract spatial information. In contrast, 3D CNNs can extract three-dimensional features, but they have higher computational costs and latency, which is a limitation for clinical practice that requires fast and efficient models. Inspired by the field of video action recognition we propose a new 2D-based model dubbed Slice SHift UNet (SSH-UNet) which encodes three-dimensional features at 2D CNN's complexity. More precisely multi-view features are collaboratively learned by performing 2D convolutions along the three orthogonal planes of a volume and imposing a weights-sharing mechanism. The third dimension, which is neglected by the 2D convolution, is reincorporated by shifting a portion of the feature maps along the slices' axis. The effectiveness of our approach is validated in Multi-Modality Abdominal Multi-Organ Segmentation (AMOS) and Multi-Atlas Labeling Beyond the Cranial Vault (BTCV) datasets, showing that SSH-UNet is more efficient while on par in performance with state-of-the-art architectures.

摘要
computer tomography (CT) 和 магнитная резонансная томография (MRI) 提供了体积数据，因此开发三维图像分析算法是必需的基础部分。 although 2D convolutional neural networks (CNNs) 可以提取空间信息，但它们只能提取二维特征。相比之下，三维 CNNs 可以提取三维特征，但它们的计算成本和延迟更高，这限制了临床实践中的快速和高效模型。 inspirited by the field of video action recognition, we propose a new 2D-based model called Slice SHift UNet (SSH-UNet)，它在 2D CNN 的复杂性下编码三维特征。 more precisely, multi-view features are collaboratively learned by performing 2D convolutions along the three orthogonal planes of a volume and imposing a weights-sharing mechanism. the third dimension, which is neglected by the 2D convolution, is reincorporated by shifting a portion of the feature maps along the slices' axis. the effectiveness of our approach is validated in Multi-Modality Abdominal Multi-Organ Segmentation (AMOS) and Multi-Atlas Labeling Beyond the Cranial Vault (BTCV) datasets, showing that SSH-UNet is more efficient while on par in performance with state-of-the-art architectures.

Multi-View Vertebra Localization and Identification from CT Images

paper_url: http://arxiv.org/abs/2307.12845
repo_url: https://github.com/shanghaitech-impact/multi-view-vertebra-localization-and-identification-from-ct-images
paper_authors: Han Wu, Jiadong Zhang, Yu Fang, Zhentao Liu, Nizhuan Wang, Zhiming Cui, Dinggang Shen
for: 本研究旨在提出一种基于多视图的 vertebra 定位和识别方法，以解决现有方法的大量计算成本和局部信息有限问题。
methods: 该方法将3D问题转化为2D定位和识别任务，并采用多视图对准学习策略来学习全局信息。此外，还提出了一种序列损失来保持vertebrae中的序列结构。
results: 评估结果表明，只使用两个2D网络，该方法可以准确地定位和识别CT图像中的vertebrae，并在比较现有方法的情况下卓越表现。

Abstract
Accurately localizing and identifying vertebrae from CT images is crucial for various clinical applications. However, most existing efforts are performed on 3D with cropping patch operation, suffering from the large computation costs and limited global information. In this paper, we propose a multi-view vertebra localization and identification from CT images, converting the 3D problem into a 2D localization and identification task on different views. Without the limitation of the 3D cropped patch, our method can learn the multi-view global information naturally. Moreover, to better capture the anatomical structure information from different view perspectives, a multi-view contrastive learning strategy is developed to pre-train the backbone. Additionally, we further propose a Sequence Loss to maintain the sequential structure embedded along the vertebrae. Evaluation results demonstrate that, with only two 2D networks, our method can localize and identify vertebrae in CT images accurately, and outperforms the state-of-the-art methods consistently. Our code is available at https://github.com/ShanghaiTech-IMPACT/Multi-View-Vertebra-Localization-and-Identification-from-CT-Images.

摘要
通过CT图像进行精准地Localizing和识别脊梗是许多临床应用中的关键。然而，大多数现有的尝试都是基于3D的剪辑补丁操作，它们受到大量计算成本和有限的全局信息的限制。在这篇论文中，我们提出了基于多视图的脊梗Localization和识别方法，将3D问题转化为2D的Localization和识别任务。不同于剪辑补丁限制，我们的方法可以自然地学习多视图的全局信息。此外，为了更好地捕捉不同视角的解剖结构信息，我们还提出了一种多视图对比学习策略来预训练脊梗。此外，我们还提出了一种序列损失，以维护链接在脊梗上的序列结构。评估结果表明，只有两个2D网络，我们的方法可以在CT图像中准确地Localizing和识别脊梗，并在状态艺术方法上一致性地表现出优于其他方法。我们的代码可以在https://github.com/ShanghaiTech-IMPACT/Multi-View-Vertebra-Localization-and-Identification-from-CT-Images上获取。

Deep Homography Prediction for Endoscopic Camera Motion Imitation Learning

paper_url: http://arxiv.org/abs/2307.12792
repo_url: None
paper_authors: Martin Huber, Sebastien Ourselin, Christos Bergeles, Tom Vercauteren
for: 这个研究探讨了透过从逆向录影中学习自动化 Laparoscopic 镜头运动。
methods: 研究将从 retrospective 录影中学习对象运动空间的增强，运用 homographies 进行对象运动不变对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对照对

Abstract
In this work, we investigate laparoscopic camera motion automation through imitation learning from retrospective videos of laparoscopic interventions. A novel method is introduced that learns to augment a surgeon's behavior in image space through object motion invariant image registration via homographies. Contrary to existing approaches, no geometric assumptions are made and no depth information is necessary, enabling immediate translation to a robotic setup. Deviating from the dominant approach in the literature which consist of following a surgical tool, we do not handcraft the objective and no priors are imposed on the surgical scene, allowing the method to discover unbiased policies. In this new research field, significant improvements are demonstrated over two baselines on the Cholec80 and HeiChole datasets, showcasing an improvement of 47% over camera motion continuation. The method is further shown to indeed predict camera motion correctly on the public motion classification labels of the AutoLaparo dataset. All code is made accessible on GitHub.

摘要
在这个研究中，我们研究了通过imitating Learning自逆 Laparoscopic 摄像头运动的自动化。我们提出了一种新的方法，可以在图像空间通过对象运动不变的图像 регистрациюvia homographies来增强 Surgeon 的行为。与现有方法不同，我们没有做任何几何假设，也没有需要深度信息，因此可以立即翻译到Robotic 设置。与文献中主流的方法不同，我们没有手动定义目标，也没有对手术场景做任何假设，因此方法可以发现无偏的策略。在这个新的研究领域中，我们示出了在Cholec80 和 HeiChole 数据集上显著提高，比对照续摄像头运动的Camera Motion Continuation 提高47%。此外，我们还证明了该方法可以正确预测摄像头运动在AutoLaparo 数据集上的公共运动分类标签上。所有代码都已经公开在 GitHub。

Synthetic white balancing for intra-operative hyperspectral imaging

paper_url: http://arxiv.org/abs/2307.12791
repo_url: None
paper_authors: Anisha Bahl, Conor C. Horgan, Mirek Janatka, Oscar J. MacCormac, Philip Noonan, Yijing Xie, Jianrong Qiu, Nicola Cavalcanti, Philipp Fürnstahl, Michael Ebner, Mads S. Bergholt, Jonathan Shapey, Tom Vercauteren
For: The paper is written for the purpose of demonstrating the need for in situ white references in hyperspectral imaging for surgical applications, and proposing a novel, sterile, synthetic reference construction algorithm to address this need.* Methods: The paper uses a composite image from a video of a standard sterile ruler to create the synthetic reference, and models the reference as the product of independent spatial and spectral components, with a scalar factor accounting for gain, exposure, and light intensity.* Results: The paper shows that the synthetic references achieve median pixel-by-pixel errors lower than 6.5% and produce similar reconstructions and errors to an ideal reference, and that the algorithm integrated well into surgical workflow with median pixel-by-pixel errors of 4.77%, while maintaining good spectral and color reconstruction.

Abstract
Hyperspectral imaging shows promise for surgical applications to non-invasively provide spatially-resolved, spectral information. For calibration purposes, a white reference image of a highly-reflective Lambertian surface should be obtained under the same imaging conditions. Standard white references are not sterilizable, and so are unsuitable for surgical environments. We demonstrate the necessity for in situ white references and address this by proposing a novel, sterile, synthetic reference construction algorithm. The use of references obtained at different distances and lighting conditions to the subject were examined. Spectral and color reconstructions were compared with standard measurements qualitatively and quantitatively, using $\Delta E$ and normalised RMSE respectively. The algorithm forms a composite image from a video of a standard sterile ruler, whose imperfect reflectivity is compensated for. The reference is modelled as the product of independent spatial and spectral components, and a scalar factor accounting for gain, exposure, and light intensity. Evaluation of synthetic references against ideal but non-sterile references is performed using the same metrics alongside pixel-by-pixel errors. Finally, intraoperative integration is assessed though cadaveric experiments. Improper white balancing leads to increases in all quantitative and qualitative errors. Synthetic references achieve median pixel-by-pixel errors lower than 6.5% and produce similar reconstructions and errors to an ideal reference. The algorithm integrated well into surgical workflow, achieving median pixel-by-pixel errors of 4.77%, while maintaining good spectral and color reconstruction.

摘要
高spectral成像显示在手术应用中具有潜在的优势，能够非侵入式地在空间上提供 spectral信息。为了进行准确的均衡，需要在同一种 imaging 条件下获得一个白色参照图像，但标准的白色参照图像不能sterilizable，因此不适用于手术环境。我们提出了一种新的、sterile、Synthetic参照图像建构算法。我们测试了不同距离和照明条件下的参照图像的使用，并与标准测量进行比较。我们使用了ΔE和normalized RMSE两种指标进行评估。我们的算法使用了一个标准 sterile 的测量仪表，并对其进行了补做。参照图像被视为独立的空间和spectral组分的乘积，以及一个权值补做照明、曝光和光强。我们对synthetic参照图像与理想 pero non-sterile 参照图像进行了比较，并使用了相同的指标进行评估。最后，我们通过实验评估了这种算法在手术过程中的integrability。不当的白平衡会导致所有量化和质量错误的增加。synthetic参照图像的 median 像素误差低于6.5%，并且生成了与理想参照图像类似的重建和错误。我们的算法在手术工作流中融合了良好的 spectral和color重建，并且 median 像素误差为4.77%。

ICF-SRSR: Invertible scale-Conditional Function for Self-Supervised Real-world Single Image Super-Resolution

paper_url: http://arxiv.org/abs/2307.12751
repo_url: None
paper_authors: Reyhaneh Neshatavar, Mohsen Yavartanoo, Sanghyun Son, Kyoung Mu Lee
for: 提高单张图像超分辨率（SISR）的性能，不使用任何对应的训练数据。
methods: 提出了一种新的可逆扩率函数（ICF），可以扩大输入图像，然后使用不同的扩率条件恢复原始输入图像。基于该ICF，提出了一种新的无监督SISR框架（ICF-SRSR）。
results: 经验表明，提出的ICF-SRSR方法在实际世界 scenarios中可以很好地处理SISR任务，并且与现有的监督/无监督方法在公共 benchmark datasets上展现了相似的性能。

Abstract
Single image super-resolution (SISR) is a challenging ill-posed problem that aims to up-sample a given low-resolution (LR) image to a high-resolution (HR) counterpart. Due to the difficulty in obtaining real LR-HR training pairs, recent approaches are trained on simulated LR images degraded by simplified down-sampling operators, e.g., bicubic. Such an approach can be problematic in practice because of the large gap between the synthesized and real-world LR images. To alleviate the issue, we propose a novel Invertible scale-Conditional Function (ICF), which can scale an input image and then restore the original input with different scale conditions. By leveraging the proposed ICF, we construct a novel self-supervised SISR framework (ICF-SRSR) to handle the real-world SR task without using any paired/unpaired training data. Furthermore, our ICF-SRSR can generate realistic and feasible LR-HR pairs, which can make existing supervised SISR networks more robust. Extensive experiments demonstrate the effectiveness of the proposed method in handling SISR in a fully self-supervised manner. Our ICF-SRSR demonstrates superior performance compared to the existing methods trained on synthetic paired images in real-world scenarios and exhibits comparable performance compared to state-of-the-art supervised/unsupervised methods on public benchmark datasets.

摘要
Single image super-resolution (SISR) 是一个具有挑战性的不定系数问题，旨在将给定的低分辨率 (LR) 图像提升到高分辨率 (HR) 对应的图像。由于实际获得LR-HR训练对的困难，现有的方法通常是通过简化的下采样算法，如比 Example: bicubic，进行训练。这种方法在实践中可能会存在问题，因为生成的Synthesized和实际世界LR图像之间存在很大的差距。为了解决这个问题，我们提出了一种新的减少函数 (ICF)，可以将输入图像缩放，然后使用不同的缩放比例来恢复原始输入。通过利用我们提出的ICF，我们建立了一种新的自动编码SR框架 (ICF-SRSR)，可以在不使用任何paired/unpaired训练数据的情况下进行SR任务。此外，我们的ICF-SRSR可以生成可靠和可行的LR-HR对，这可以使现有的supervised SR网络更加可靠。我们的实验表明，我们的ICF-SRSR可以在不使用任何训练数据的情况下处理SR任务，并且在实际世界 scenario 中表现出色。我们的ICF-SRSR在与现有的方法进行比较时，在公共的benchmark datasets上表现出了相当的性能。

Dense Transformer based Enhanced Coding Network for Unsupervised Metal Artifact Reduction

paper_url: http://arxiv.org/abs/2307.12717
repo_url: None
paper_authors: Wangduo Xie, Matthew B. Blaschko
for: 针对CT图像损坏的金属artifacts，提高临床诊断的精度。
methods: 提出了一种基于Dense Transformer的增强编码网络（DTEC-Net），利用高阶杂分解编码器和转换器来获得长距离匹配的紧密编码序列。然后，提出了第二阶杂分解方法来改进密集序列的解码过程。
results: 对一个标准测试集进行了广泛的实验和模型说明，证明DTEC-Net的有效性，其在降低金属artifacts的同时保留了更多的细节Texture。与之前的状态统计方法相比，DTEC-Net显著提高了图像质量。

Abstract
CT images corrupted by metal artifacts have serious negative effects on clinical diagnosis. Considering the difficulty of collecting paired data with ground truth in clinical settings, unsupervised methods for metal artifact reduction are of high interest. However, it is difficult for previous unsupervised methods to retain structural information from CT images while handling the non-local characteristics of metal artifacts. To address these challenges, we proposed a novel Dense Transformer based Enhanced Coding Network (DTEC-Net) for unsupervised metal artifact reduction. Specifically, we introduce a Hierarchical Disentangling Encoder, supported by the high-order dense process, and transformer to obtain densely encoded sequences with long-range correspondence. Then, we present a second-order disentanglement method to improve the dense sequence's decoding process. Extensive experiments and model discussions illustrate DTEC-Net's effectiveness, which outperforms the previous state-of-the-art methods on a benchmark dataset, and greatly reduces metal artifacts while restoring richer texture details.

摘要

Low-complexity Overfitted Neural Image Codec

paper_url: http://arxiv.org/abs/2307.12706
repo_url: https://github.com/Orange-OpenSource/Cool-Chic
paper_authors: Thomas Leguay, Théo Ladune, Pierrick Philippe, Gordon Clare, Félix Henry
for: 这个论文是为了提出一种具有减少复杂度的神经网络图像编码器，该编码器可以对输入图像进行适应参数过滤。
methods: 该论文使用了自适应神经网络，并通过优化训练过程和使用轻量级模块来降低编码器的复杂度。
results: 该论文的方法可以与 autoencoder 和 HEVC 比肩，并且在不同的编码条件下具有14%的rate reduction，同时保持相似的复杂度。

Abstract
We propose a neural image codec at reduced complexity which overfits the decoder parameters to each input image. While autoencoders perform up to a million multiplications per decoded pixel, the proposed approach only requires 2300 multiplications per pixel. Albeit low-complexity, the method rivals autoencoder performance and surpasses HEVC performance under various coding conditions. Additional lightweight modules and an improved training process provide a 14% rate reduction with respect to previous overfitted codecs, while offering a similar complexity. This work is made open-source at https://orange-opensource.github.io/Cool-Chic/

摘要
我们提出了一种减少复杂性的神经图像编码器，其将解码器参数过拟合到输入图像。而自动编码器可能需要每个解码ixel进行数百万次乘法运算，而我们的方法只需要每个解码ixel进行2300次乘法运算。虽然具有较低的复杂性，我们的方法与自动编码器的性能相当，甚至超过HEVC的编码性能在不同的编码条件下。此外，我们还提供了一些轻量级模块和改进的训练过程，可以对前一代过拟合编码器进行14%的比较率减少，同时保持相似的复杂性。该工作将在https://orange-opensource.github.io/Cool-Chic/上开源。

Bayesian Based Unrolling for Reconstruction and Super-resolution of Single-Photon Lidar Systems

paper_url: http://arxiv.org/abs/2307.12700
repo_url: None
paper_authors: Abderrahim Halimi, Jakeoung Koo, Stephen McLaughlin
for: 这篇论文主要用于描述一种基于深度学习的3D单光子探测器的重建和超分辨率方法。
methods: 该方法基于一种卷积 Bayesian 模型，可以在高噪音环境下提供最佳估计，同时具有改进的网络解释性。
results: 与现有的学习基于方法相比，该算法具有减少可训练参数数量、更高的噪音耐受度和系统响应函数模型化不足的问题，同时提供了更多的估计信息，包括不确定性度量。 Synthetic and real data 比较表明，该算法可以与现有算法相比，提供类似的推理质量和计算复杂度。

Abstract
Deploying 3D single-photon Lidar imaging in real world applications faces several challenges due to imaging in high noise environments and with sensors having limited resolution. This paper presents a deep learning algorithm based on unrolling a Bayesian model for the reconstruction and super-resolution of 3D single-photon Lidar. The resulting algorithm benefits from the advantages of both statistical and learning based frameworks, providing best estimates with improved network interpretability. Compared to existing learning-based solutions, the proposed architecture requires a reduced number of trainable parameters, is more robust to noise and mismodelling of the system impulse response function, and provides richer information about the estimates including uncertainty measures. Results on synthetic and real data show competitive results regarding the quality of the inference and computational complexity when compared to state-of-the-art algorithms. This short paper is based on contributions published in [1] and [2].

摘要
<>将3D单 фотоン探测技术应用于实际场景中存在多种挑战，包括高噪声环境和探测器有限分辨率。这篇论文提出了基于深度学习的bayesian模型的推算和超Resolution算法，以解决3D单 фотоン探测中的重要问题。该算法利用了统计和学习两个框架的优点，提供了最佳估计值，同时具有改进的网络解释性。与现有的学习型解决方案相比，提出的架构具有较少的可训练参数量、更高的噪声和系统响应函数模型化不正确率，并提供了更多的估计值和不确定度测量。对于synthetic和实际数据进行了比较，结果显示了与当前状态艺术算法相当的质量和计算复杂度。这篇短文基于[1]和[2]的贡献。

Automatic lobe segmentation using attentive cross entropy and end-to-end fissure generation

paper_url: http://arxiv.org/abs/2307.12634
repo_url: https://github.com/htytewx/softcam
paper_authors: Qi Su, Na Wang, Jiawen Xie, Yinan Chen, Xiaofan Zhang
For: automatic lung lobe segmentation algorithm for the diagnosis and treatment of lung diseases* Methods: task-specific loss function to pay attention to the area around the pulmonary fissure, end-to-end pulmonary fissure generation method, registration-based loss function to alleviate convergence difficulty* Results: achieved 97.83% and 94.75% dice scores on private dataset STLB and public LUNA16 dataset respectively.

Abstract
The automatic lung lobe segmentation algorithm is of great significance for the diagnosis and treatment of lung diseases, however, which has great challenges due to the incompleteness of pulmonary fissures in lung CT images and the large variability of pathological features. Therefore, we propose a new automatic lung lobe segmentation framework, in which we urge the model to pay attention to the area around the pulmonary fissure during the training process, which is realized by a task-specific loss function. In addition, we introduce an end-to-end pulmonary fissure generation method in the auxiliary pulmonary fissure segmentation task, without any additional network branch. Finally, we propose a registration-based loss function to alleviate the convergence difficulty of the Dice loss supervised pulmonary fissure segmentation task. We achieve 97.83% and 94.75% dice scores on our private dataset STLB and public LUNA16 dataset respectively.

摘要
自动肺lobSeg算法对肺病诊断和治疗具有很大的重要性，但是受到肺CT图像的杏仁缺失和疾病特征的大量变化所带来的挑战。因此，我们提出了一种新的自动肺lobSeg框架，其中我们要求模型在训练过程中对杏仁附近区域进行注意力。我们实现了这一点通过任务特定的损失函数。此外，我们还提出了一种不含额外网络分支的杏仁生成方法，以及一种基于准确Registration的损失函数，以解决约瑟分解损失supervised杏仁分 segmentation任务的困难。在我们的私有数据集STLB和公共数据集LUNA16上，我们实现了97.83%和94.75%的 dice分数。

Sparse annotation strategies for segmentation of short axis cardiac MRI

paper_url: http://arxiv.org/abs/2307.12619
repo_url: None
paper_authors: Josh Stein, Maxime Di Folco, Julia Schnabel
for: 本研究旨在探讨使用少量标注数据进行心脏MRI分割的方法，以优化标注成本和提高分割性能。
methods: 我们采用了减少数据量和标注数量的方法，包括减少数据量和标注数量，以及使用转移学习和数据增强技术。
results: 我们的实验结果表明，训练使用少量标注数据可以达到0.85的Dice分数和与全数据集相当的性能。此外，我们发现，在中部层的标注更加有价值，而胸部区域的标注最差。在评估量据集对比中，更多的层标注比更多的量据集具有更高的分割性能。因此，建议在标注时尽量标注中部层，而不是标注更多的量据集。

Abstract
Short axis cardiac MRI segmentation is a well-researched topic, with excellent results achieved by state-of-the-art models in a supervised setting. However, annotating MRI volumes is time-consuming and expensive. Many different approaches (e.g. transfer learning, data augmentation, few-shot learning, etc.) have emerged in an effort to use fewer annotated data and still achieve similar performance as a fully supervised model. Nevertheless, to the best of our knowledge, none of these works focus on which slices of MRI volumes are most important to annotate for yielding the best segmentation results. In this paper, we investigate the effects of training with sparse volumes, i.e. reducing the number of cases annotated, and sparse annotations, i.e. reducing the number of slices annotated per case. We evaluate the segmentation performance using the state-of-the-art nnU-Net model on two public datasets to identify which slices are the most important to annotate. We have shown that training on a significantly reduced dataset (48 annotated volumes) can give a Dice score greater than 0.85 and results comparable to using the full dataset (160 and 240 volumes for each dataset respectively). In general, training on more slice annotations provides more valuable information compared to training on more volumes. Further, annotating slices from the middle of volumes yields the most beneficial results in terms of segmentation performance, and the apical region the worst. When evaluating the trade-off between annotating volumes against slices, annotating as many slices as possible instead of annotating more volumes is a better strategy.

摘要
短轴心臓MRI分割是一个广泛研究的话题，现有一些最新的模型在指导下达到了出色的结果。然而，对MRIVolume进行标注是时间consuming和expensive。许多不同的方法（如转移学习、数据扩展、少数学习等）在尝试使用 fewer annotated data 并且达到类似于全指导模型的性能。然而，据我们所知，这些工作没有关注于哪些MRI Volume slice是最重要的标注，以达到最佳分割结果。在这篇文章中，我们 investigate了在减少 annotated volume 和 sparse annotations 下的训练效果。我们使用了state-of-the-art nnU-Net模型对两个公共数据集进行评估，以确定哪些slice是最重要的标注。我们发现，通过减少数据集至48个标注Volume可以达到Dice分数大于0.85，并且与使用全数据集（160和240个Volume）的结果相当。总的来说，训练更多的slice标注比训练更多的Volume更有价值的信息。此外，从MRI Volume 中间部分标注slice最有利于分割性能，而apical区域最差。当评估 annotating Volume 和 slice 之间的负担比，更好的策略是annotating as many slices as possible 而不是 annotating more Volume。

Attribute Regularized Soft Introspective VAE: Towards Cardiac Attribute Regularization Through MRI Domains

paper_url: http://arxiv.org/abs/2307.12618
repo_url: None
paper_authors: Maxime Di Folco, Cosmin Bercea, Julia A. Schnabel
for: 本研究旨在提高深度生成模型的控制性，通过选择性地修改数据特征进行数据生成和修饰。
methods: 本研究使用了Variational Autoencoders (VAEs)，并通过添加对偏好损失的限制来提高模型的控制性。
results: 实验表明，提出的Attributed Soft Introspective VAE（Attri-SIVAE）方法可以在不同的MRI数据集上达到同等的重建和规范化性，而且在不同的数据集上也可以保持同等的规范化水平，不同于相比方法。

Abstract
Deep generative models have emerged as influential instruments for data generation and manipulation. Enhancing the controllability of these models by selectively modifying data attributes has been a recent focus. Variational Autoencoders (VAEs) have shown promise in capturing hidden attributes but often produce blurry reconstructions. Controlling these attributes through different imaging domains is difficult in medical imaging. Recently, Soft Introspective VAE leverage the benefits of both VAEs and Generative Adversarial Networks (GANs), which have demonstrated impressive image synthesis capabilities, by incorporating an adversarial loss into VAE training. In this work, we propose the Attributed Soft Introspective VAE (Attri-SIVAE) by incorporating an attribute regularized loss, into the Soft-Intro VAE framework. We evaluate experimentally the proposed method on cardiac MRI data from different domains, such as various scanner vendors and acquisition centers. The proposed method achieves similar performance in terms of reconstruction and regularization compared to the state-of-the-art Attributed regularized VAE but additionally also succeeds in keeping the same regularization level when tested on a different dataset, unlike the compared method.

摘要
深度生成模型已经成为数据生成和修饰的重要工具。提高这些模型的可控性，通过选择性地修改数据属性，是最近的研究焦点。变量自动编码器（VAEs）可以捕捉隐藏属性，但经常生成模糊的重建。在医学成像中，控制这些属性通过不同的成像频谱是困难的。最近，软 introspective VAE 利用了 VAEs 和生成对抗网络（GANs）的优点，通过在 VAE 训练中添加对抗损失来提高图像生成能力。在这项工作中，我们提出了具有属性规则化损失的 Attributed Soft Introspective VAE（Attri-SIVAE）。我们通过实验评估该方法在不同的cardiac MRI数据集上的性能。该方法与状态uset-of-the-art Attributed regularized VAE 相似的重建和规则化性能，并且在不同的数据集上保持了同等的规则化水平，不同于相比方法。

AMaizeD: An End to End Pipeline for Automatic Maize Disease Detection

paper_url: http://arxiv.org/abs/2308.03766
repo_url: None
paper_authors: Anish Mall, Sanchit Kabra, Ankur Lhila, Pawan Ajmera
for: 这个研究论文旨在提供一个自动化的豇豉病诊断框架，用于早期检测豇豉作物中的病诊断。
methods: 该框架使用多spectral图像，结合了深度学习网络来提取特征和分割方法，以识别豇豉作物和其相关的病诊断。
results: 实验结果表明，该框架可以有效地检测豇豉作物中的多种病诊断，包括粉刺虫、芽虫和叶褪病等。

Abstract
This research paper presents AMaizeD: An End to End Pipeline for Automatic Maize Disease Detection, an automated framework for early detection of diseases in maize crops using multispectral imagery obtained from drones. A custom hand-collected dataset focusing specifically on maize crops was meticulously gathered by expert researchers and agronomists. The dataset encompasses a diverse range of maize varieties, cultivation practices, and environmental conditions, capturing various stages of maize growth and disease progression. By leveraging multispectral imagery, the framework benefits from improved spectral resolution and increased sensitivity to subtle changes in plant health. The proposed framework employs a combination of convolutional neural networks (CNNs) as feature extractors and segmentation techniques to identify both the maize plants and their associated diseases. Experimental results demonstrate the effectiveness of the framework in detecting a range of maize diseases, including powdery mildew, anthracnose, and leaf blight. The framework achieves state-of-the-art performance on the custom hand-collected dataset and contributes to the field of automated disease detection in agriculture, offering a practical solution for early identification of diseases in maize crops advanced machine learning techniques and deep learning architectures.

摘要

Development Of Automated Cardiac Arrhythmia Detection Methods Using Single Channel ECG Signal

paper_url: http://arxiv.org/abs/2308.02405
repo_url: None
paper_authors: Arpita Paul, Avik Kumar Das, Manas Rakshit, Ankita Ray Chowdhury, Susmita Saha, Hrishin Roy, Sajal Sarkar, Dongiri Prasanth, Eravelli Saicharan
for:多种心脏病的自动检测和分类可能会减少心脏疾病的死亡率。本研究提出了基于单通道电cardiogram（ECG）信号的多类刺激识别算法。methods:在本研究中，使用心脏自变性（HRV）、形态特征和wavelet幂特征，通过机器学习基于Random Forest分类器进行检测。results:使用HRV和时域形态特征时，获得了85.11%的准确率、85.11%的敏感度、85.07%的精度和85.00%的F1分数。使用HRV和wavelet幂特征时，性能提高到90.91%的准确率、90.91%的敏感度、90.96%的精度和90.87%的F1分数。实验结果表明，提出的方案可以有效地从单通道ECG记录中检测多种刺激。

Abstract
Arrhythmia, an abnormal cardiac rhythm, is one of the most common types of cardiac disease. Automatic detection and classification of arrhythmia can be significant in reducing deaths due to cardiac diseases. This work proposes a multi-class arrhythmia detection algorithm using single channel electrocardiogram (ECG) signal. In this work, heart rate variability (HRV) along with morphological features and wavelet coefficient features are utilized for detection of 9 classes of arrhythmia. Statistical, entropy and energy-based features are extracted and applied to machine learning based random forest classifiers. Data used in both works is taken from 4 broad databases (CPSC and CPSC extra, PTB-XL, G12EC and Chapman-Shaoxing and Ningbo Database) made available by Physionet. With HRV and time domain morphological features, an average accuracy of 85.11%, sensitivity of 85.11%, precision of 85.07% and F1 score of 85.00% is obtained whereas with HRV and wavelet coefficient features, the performance obtained is 90.91% accuracy, 90.91% sensitivity, 90.96% precision and 90.87% F1 score. The detailed analysis of simulation results affirms that the presented scheme effectively detects broad categories of arrhythmia from single-channel ECG records. In the last part of the work, the proposed classification schemes are implemented on hardware using Raspberry Pi for real time ECG signal classification.

摘要
心动过速病（Arrhythmia）是心血管疾病中最常见的一种。自动检测和识别Arrhythmia可以有效降低心血管疾病的死亡率。这项工作提出了基于单通道电cardiogram（ECG）信号的多类Arrhythmia检测算法。在这项工作中，利用心跳变化（HRV）以及形态特征和wavelet幅特征来检测9种类型的Arrhythmia。通过提取统计、熵和能量基本特征，并应用机器学习基于Random Forest分类器，实现了高精度的Arrhythmia检测。数据来源于Physionet提供的4个广泛数据库（CPSC和CPSC extra、PTB-XL、G12EC和Chapman-Shaoxing和Ningbo数据库）。使用HRV和时域形态特征时，取得了85.11%的准确率、85.11%的敏感度、85.07%的精度和85.00%的F1分数，而使用HRV和wavelet幅特征时，取得了90.91%的准确率、90.91%的敏感度、90.96%的精度和90.87%的F1分数。etailed分析结果表明，提出的方案可以有效地从单通道ECG记录中检测广泛的Arrhythmia类型。最后，提出的分类方案在硬件上使用Raspberry Pi实现了实时ECG信号分类。

4D Feet: Registering Walking Foot Shapes Using Attention Enhanced Dynamic-Synchronized Graph Convolutional LSTM Network

paper_url: http://arxiv.org/abs/2307.12377
repo_url: None
paper_authors: Farzam Tajdari, Toon Huysmans, Xinhe Yao, Jun Xu, Yu Song
for: 该论文旨在帮助研究人员更好地理解动态弹性人体部件的特征，通过基于多个异步摄像机捕获的4D扫描数据进行重建。
methods: 该论文提出了一种通用框架，包括：1）使用非RIGID迭代最近最远点对精度找到和对准不同摄像机捕获的3D扫描数据中的动态特征；2）使用一种新型的ADGC-LSTM网络将不同摄像机捕获的3D扫描数据同步到特定摄像机的时间轴上；3）使用非RIGID注准方法将同步化的3D扫描数据注准到高质量模板中。
results: 该论文采用了一种新开发的4D脚部扫描仪，并将数据集分为58名参与者的15帧/秒4D形态数据集（共116个脚部，包括5147帧的3D扫描数据），覆盖了脚步征的重要阶段。结果表明提出的方法有效地同步异步的4D扫描数据，特别是通过使用提出的ADGC-LSTM网络进行同步。

Abstract
4D scans of dynamic deformable human body parts help researchers have a better understanding of spatiotemporal features. However, reconstructing 4D scans based on multiple asynchronous cameras encounters two main challenges: 1) finding the dynamic correspondences among different frames captured by each camera at the timestamps of the camera in terms of dynamic feature recognition, and 2) reconstructing 3D shapes from the combined point clouds captured by different cameras at asynchronous timestamps in terms of multi-view fusion. In this paper, we introduce a generic framework that is able to 1) find and align dynamic features in the 3D scans captured by each camera using the nonrigid iterative closest-farthest points algorithm; 2) synchronize scans captured by asynchronous cameras through a novel ADGC-LSTM-based network, which is capable of aligning 3D scans captured by different cameras to the timeline of a specific camera; and 3) register a high-quality template to synchronized scans at each timestamp to form a high-quality 3D mesh model using a non-rigid registration method. With a newly developed 4D foot scanner, we validate the framework and create the first open-access data-set, namely the 4D feet. It includes 4D shapes (15 fps) of the right and left feet of 58 participants (116 feet in total, including 5147 3D frames), covering significant phases of the gait cycle. The results demonstrate the effectiveness of the proposed framework, especially in synchronizing asynchronous 4D scans using the proposed ADGC-LSTM network.

摘要
4D扫描技术为研究人体动态变形带来了更好的认知，但是通过多个异步相机重建4D扫描存在两大挑战：1）在不同相机拍摄时间点找到动态匹配，并通过动态特征识别将它们相互对应；2）将不同相机拍摄的点云数据 fusion 到一起，以便形成高质量的3D模型。在这篇论文中，我们提出了一种通用的框架，可以1）使用非RIGID迭代最近最远点算法来在不同相机拍摄的3D扫描中找到和对应动态特征；2）使用一种新型的ADGC-LSTM网络将不同相机拍摄的3D扫描同步到同一个时间轴上；3）使用非RIGID注册方法将同步化后的3D扫描与高质量模板进行对应，以形成高质量的3D mesh模型。我们使用一种新开发的4D脚部扫描仪来验证该框架，并创建了首个公共数据集，即4D脚部（15帧/秒），包括58名参与者的右和左脚的4D形状（共116个脚，包括5147帧），覆盖了走势过程中重要的阶段。结果表明提出的框架具有良好的效果，特别是在同步异步4D扫描中使用提出的ADGC-LSTM网络。

2023-07-23

cs.SD

cs.SD - 2023-07-23

A meta learning scheme for fast accent domain expansion in Mandarin speech recognition

paper_url: http://arxiv.org/abs/2307.12262
repo_url: None
paper_authors: Ziwei Zhu, Changhao Shan, Bihong Zhang, Jian Yu
for: 这 paper 是为了解决中文识别技术中的方言域扩展问题，提高方言识别精度。
methods: 这 paper 使用了元学习技术，包括模型冻结和多元学习，实现快速方言域扩展。
results: 该方法在方言域扩展任务上达到了3%的相对提升，相比基线模型，在同样的测试集上提高了37%。此外，该方法也在大量数据上实现了4%的相对提升。

Abstract
Spoken languages show significant variation across mandarin and accent. Despite the high performance of mandarin automatic speech recognition (ASR), accent ASR is still a challenge task. In this paper, we introduce meta-learning techniques for fast accent domain expansion in mandarin speech recognition, which expands the field of accents without deteriorating the performance of mandarin ASR. Meta-learning or learn-to-learn can learn general relation in multi domains not only for over-fitting a specific domain. So we select meta-learning in the domain expansion task. This more essential learning will cause improved performance on accent domain extension tasks. We combine the methods of meta learning and freeze of model parameters, which makes the recognition performance more stable in different cases and the training faster about 20%. Our approach significantly outperforms other methods about 3% relatively in the accent domain expansion task. Compared to the baseline model, it improves relatively 37% under the condition that the mandarin test set remains unchanged. In addition, it also proved this method to be effective on a large amount of data with a relative performance improvement of 4% on the accent test set.

摘要
spoken languages show significant variation across mandarin and accent. despite the high performance of mandarin automatic speech recognition (ASR), accent ASR is still a challenge task. in this paper, we introduce meta-learning techniques for fast accent domain expansion in mandarin speech recognition, which expands the field of accents without deteriorating the performance of mandarin ASR. meta-learning or learn-to-learn can learn general relation in multi domains not only for over-fitting a specific domain. so we select meta-learning in the domain expansion task. this more essential learning will cause improved performance on accent domain extension tasks. we combine the methods of meta learning and freeze of model parameters, which makes the recognition performance more stable in different cases and the training faster about 20%. our approach significantly outperforms other methods about 3% relatively in the accent domain expansion task. compared to the baseline model, it improves relatively 37% under the condition that the mandarin test set remains unchanged. in addition, it also proved this method to be effective on a large amount of data with a relative performance improvement of 4% on the accent test set.

MyVoice: Arabic Speech Resource Collaboration Platform

paper_url: http://arxiv.org/abs/2308.02503
repo_url: None
paper_authors: Yousseif Elshahawy, Yassine El Kheir, Shammur Absar Chowdhury, Ahmed Ali
for: 增强阿拉伯语言技术的发展，收集和整理阿拉伯语言的语音数据。
methods: 使用互联网平台，征集大量的阿拉伯语言口音录音，并提供城市/国家精细的口音选择功能。用户可以 switching roles，从记录者变成评估者，并提供反馈。平台还包括质量检查系统，过滤掉低质量和假 recording。
results: 实现了收集大量的阿拉伯语言口音数据，提供城市/国家精细的口音选择功能，并且可以进行多方合作，汇集多种阿拉伯语言数据。

Abstract
We introduce MyVoice, a crowdsourcing platform designed to collect Arabic speech to enhance dialectal speech technologies. This platform offers an opportunity to design large dialectal speech datasets; and makes them publicly available. MyVoice allows contributors to select city/country-level fine-grained dialect and record the displayed utterances. Users can switch roles between contributors and annotators. The platform incorporates a quality assurance system that filters out low-quality and spurious recordings before sending them for validation. During the validation phase, contributors can assess the quality of recordings, annotate them, and provide feedback which is then reviewed by administrators. Furthermore, the platform offers flexibility to admin roles to add new data or tasks beyond dialectal speech and word collection, which are displayed to contributors. Thus, enabling collaborative efforts in gathering diverse and large Arabic speech data.

摘要
我团队介绍MyVoice，一个招待人寄语的平台，用于提高阿拉伯语言口音技术。该平台提供了大量地方口音数据的设计机会，并将其公共地发布。MyVoice让参与者可以选择城市/国家精细口音，并录制显示的语音。用户可以在角色之间切换，包括参与者和注释者。平台包含一个质量保证系统，过滤掉低质量和假语音记录，然后将其发送给验证。在验证阶段，参与者可以评估语音质量，注释和提供反馈，这些反馈会被管理员审核。此外，平台允许管理员添加新的数据或任务，以外语言和词汇收集，这些任务将被显示给参与者。因此，MyVoice平台可以促进多方合作，收集多样化和大量的阿拉伯语言数据。

Signal Reconstruction from Mel-spectrogram Based on Bi-level Consistency of Full-band Magnitude and Phase

paper_url: http://arxiv.org/abs/2307.12232
repo_url: https://github.com/YoshikiMas/signal-reconstruction-from-mel-spectrogram
paper_authors: Yoshiki Masuyama, Natsuki Ueno, Nobutaka Ono
for: 重建时域信号从低维спектроgram中
methods: 利用丰富的听取关系和时域信号之间的双层关系，并使用优化问题的形式来重建全带幅强度和相位信息
results: 对话、音乐和环境信号都有较好的重建效果

Abstract
We propose an optimization-based method for reconstructing a time-domain signal from a low-dimensional spectral representation such as a mel-spectrogram. Phase reconstruction has been studied to reconstruct a time-domain signal from the full-band short-time Fourier transform (STFT) magnitude. The Griffin-Lim algorithm (GLA) has been widely used because it relies only on the redundancy of STFT and is applicable to various audio signals. In this paper, we jointly reconstruct the full-band magnitude and phase by considering the bi-level relationships among the time-domain signal, its STFT coefficients, and its mel-spectrogram. The proposed method is formulated as a rigorous optimization problem and estimates the full-band magnitude based on the criterion used in GLA. Our experiments demonstrate the effectiveness of the proposed method on speech, music, and environmental signals.

摘要
我们提出了一种基于优化的方法，用于从低维特征表示（如MEL spectrogram）重建时域信号。 phase reconstruction 已经研究过了从全带快时域傅立叙Transform（STFT）大小取得时域信号的重建方法。格里菲恩-林算法（GLA）在广泛使用，因为它只凭借 STFT 的重复性而工作，适用于各种音频信号。在这篇论文中，我们同时重建全带大小和频谱图中的相对关系，并基于这些关系进行优化问题的形式化表述。我们的实验表明，提议的方法在语音、音乐和环境信号上具有效果。

Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

paper_url: http://arxiv.org/abs/2307.12231
repo_url: None
paper_authors: Yoshiki Masuyama, Xuankai Chang, Wangyou Zhang, Samuele Cornell, Zhong-Qiu Wang, Nobutaka Ono, Yanmin Qian, Shinji Watanabe
for: 这个论文是为了研究听话筛选和识别的集成，以提高多个人识别性能。
methods: 本论文使用了多通道分离方法、面积基于的扩展射频映射和复杂spectral mapping，以及最佳的后续模型特征。
results: 研究人员使用了最新的自动学习表示(SSLR)来改进filterbank特征，并通过合理的训练策略来集成语音分离和识别。结果显示，该策略在噪音混响WHAMR!测试集上实现了2.5%单词错误率，与现有的面积基于MVDR扩展射频映射和filterbank集成的28.9%相比，显著提高了多个人识别性能。

Abstract
Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. We employ the recent self-supervised learning representation (SSLR) as a feature and improve the recognition performance from the case with filterbank features. To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. The proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set, significantly outperforming an existing mask-based MVDR beamforming and filterbank integration (28.9%).

摘要
neuronal speech separation 已经取得了很大的进步，其与自动语音识别（ASR）的结合是实现多 speaker ASR 的重要方向。本工作提供了深入的Investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end。具体来说，我们探讨了多通道分离方法，Mask-based beamforming和复杂的 spectral mapping，以及ASR back-end模型中最佳的特征。我们使用了最近的自然语言学习表示（SSLR）作为特征，并从filterbank特征中提高了认知性能。为了进一步提高多 speaker recognition性能，我们提出了一种优化的训练策略，将 speech separation和recognition与SSLR结合使用。我们使用TF-GridNet-based complex spectral mapping和WavLM-based SSLR，在抗噪抗干扰 WHAMR! 测试集上实现了2.5% 词错率，与现有的mask-based MVDR beamforming和filterbank结合（28.9%）相比，显著超越了。

Backdoor Attacks against Voice Recognition Systems: A Survey

paper_url: http://arxiv.org/abs/2307.13643
repo_url: None
paper_authors: Baochen Yan, Jiahe Lan, Zheng Yan
for: This paper aims to provide a comprehensive survey on backdoor attacks against Voice Recognition Systems (VRSs) and to discuss the feasibility of deploying classic backdoor defense methods and generic audio defense techniques on VRSs.
methods: The paper employs a comprehensive taxonomy of backdoor attacks against VRSs from different perspectives, and analyzes the characteristic of different categories. It also reviews existing attack methods and classic backdoor defense methods, and discusses the feasibility of deploying them on VRSs.
results: The paper provides a thorough review of backdoor attacks against VRSs, and discusses the open issues and future research directions in this field. It also provides a comprehensive understanding of the vulnerabilities of VRSs to backdoor attacks and the potential solutions to mitigate these attacks.

Abstract
Voice Recognition Systems (VRSs) employ deep learning for speech recognition and speaker recognition. They have been widely deployed in various real-world applications, from intelligent voice assistance to telephony surveillance and biometric authentication. However, prior research has revealed the vulnerability of VRSs to backdoor attacks, which pose a significant threat to the security and privacy of VRSs. Unfortunately, existing literature lacks a thorough review on this topic. This paper fills this research gap by conducting a comprehensive survey on backdoor attacks against VRSs. We first present an overview of VRSs and backdoor attacks, elucidating their basic knowledge. Then we propose a set of evaluation criteria to assess the performance of backdoor attack methods. Next, we present a comprehensive taxonomy of backdoor attacks against VRSs from different perspectives and analyze the characteristic of different categories. After that, we comprehensively review existing attack methods and analyze their pros and cons based on the proposed criteria. Furthermore, we review classic backdoor defense methods and generic audio defense techniques. Then we discuss the feasibility of deploying them on VRSs. Finally, we figure out several open issues and further suggest future research directions to motivate the research of VRSs security.

摘要
声认系统（VRS）利用深度学习进行语音识别和说话人识别。它们在各种现实应用中广泛应用，从智能语音助手到电信监测和生物认证。然而，先前的研究表明，VRS受到后门攻击的威胁，这对VRS的安全性和隐私具有重要性。然而，现有的文献缺乏对这个话题的全面审查。这篇论文填补了这个研究空白，通过进行VRS对后门攻击的全面评估。我们首先提供VRS和后门攻击的概述，并提出评估后门攻击方法的评价标准。然后，我们提出了VRS对后门攻击的多维分类，并分析不同类别的特点。接着，我们对现有的攻击方法进行了全面的审查，并分析了它们的优缺点。此外，我们还评估了经典的后门防御方法和通用音频防御技术，并评估了它们在VRS上的可行性。最后，我们提出了一些未解决的问题，并建议未来研究VRS的安全性。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

paper_url: http://arxiv.org/abs/2307.12134
repo_url: None
paper_authors: Suyoun Kim, Akshat Shrivastava, Duc Le, Ju Lin, Ozlem Kalinli, Michael L. Seltzer
for: 提高 END-to-END（E2E）语音理解（SLU）系统的 Robustness，使其在听写识别（ASR）错误时仍能准确理解语音。
methods: 我们提出了一种新的 E2E SLU 系统，利用 audio 和文本表示，并基于 ASR 假设的模态信息确定精度。我们采用了两种新技术：1）有效地编码 ASR 假设质量，2）有效地将其集成到 E2E SLU 模型中。
results: 我们在 STOP 数据集上进行了实验，并发现我们的方法可以提高准确率。我们还进行了分析，以证明我们的方法的有效性。

Abstract
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR), and outperforms traditional pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems still show weakness when text representation quality is low due to ASR transcription errors. To overcome this issue, we propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses. We introduce two novel techniques: 1) an effective method to encode the quality of ASR hypotheses and 2) an effective approach to integrate them into E2E SLU models. We show accuracy improvements on STOP dataset and share the analysis to demonstrate the effectiveness of our approach.

摘要
最近，端到端（E2E）的语音理解（SLU）系统在使用预训练的语音识别（ASR）模型时，变得更加有前途。这种方法使用单个模型，利用语音和文本表示从预训练的ASR模型中获取，并在设备上流动enario下超越传统的管道SLU系统。然而，E2E SLU系统仍然在文本表示质量低下时表现弱，这是因为ASR识别错误。为了解决这个问题，我们提出了一种新的E2E SLU系统，增强了对ASR错误的抗钝性。我们介绍了两种新技术：1）一种有效的ASR假设质量编码方法，2）一种有效的将其集成到E2E SLU模型中的方法。我们在STOP数据集上显示了准确性改进，并提供分析，以证明我们的方法的有效性。

Estimating speaker direction on a humanoid robot with binaural acoustic signals

paper_url: http://arxiv.org/abs/2307.12129
repo_url: None
paper_authors: Pranav Barot, Katja Mombaur, Ewen MacDonald
for: 这篇论文主要用于讲述一种用于人类对话者的位置估计方法，以实现人类与机器人之间的对话。
methods: 这篇论文使用了一种基于眼睛音源的方法来估计对话者的位置，并考虑了实时应用场景。这种方法在机器人人类头上实现了双耳声音源定位框架。
results: 经过实验和分析，这种方法可以在实时应用场景中提供有效的位置估计结果，并且可以适应不同的对话场景。同时，这种方法也可以减少延迟时间，以便实现实时对话。

Abstract
To achieve human-like behaviour during speech interactions, it is necessary for a humanoid robot to estimate the location of a human talker. Here, we present a method to optimize the parameters used for the direction of arrival (DOA) estimation, while also considering real-time applications for human-robot interaction scenarios. This method is applied to binaural sound source localization framework on a humanoid robotic head. Real data is collected and annotated for this work. Optimizations are performed via a brute force method and a Bayesian model based method, results are validated and discussed, and effects on latency for real-time use are also explored.

摘要
Translated into Simplified Chinese:为实现人类样式的语音互动， robot需要估计人类说话者的位置。我们现在提出一种优化DOA估计参数的方法，同时考虑实时应用场景。这种方法应用于人型机器人头部上的双耳声源定位框架。实际数据收集和标注，并对其进行优化。我们使用枚举方法和 Bayesian 模型基于方法进行优化，并对结果进行验证和讨论。我们还探讨了在实时使用中的延迟影响。

2023-07-23

cs.CV

cs.CV - 2023-07-23

ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer

paper_url: http://arxiv.org/abs/2307.12349
repo_url: https://github.com/lartpang/comptr
paper_authors: Youwei Pang, Xiaoqi Zhao, Lihe Zhang, Huchuan Lu
for: 这个研究是为了开发一个能够同时处理多种不同任务的复合式对话模型，以提高深度学习（DL）在紧密预测领域的表现。
methods: 本研究使用了一种名为ComPtr的复合式对话模型，它基于信息补充的概念，并具有两个组件：协调强化和差异意识部分。这两个组件可以帮助ComPtr从不同的图像来源中获取重要的视觉 semantic 讯号，并将其转换为多个任务中的有用信息。
results: 在多个代表性的视觉任务中，例如离散检测、RGB-T人数掌握、RGB-D/T焦点物探测以及RGB-D semantics 类别分类，ComPtr 都能够获得良好的表现。

Abstract
Deep learning (DL) has advanced the field of dense prediction, while gradually dissolving the inherent barriers between different tasks. However, most existing works focus on designing architectures and constructing visual cues only for the specific task, which ignores the potential uniformity introduced by the DL paradigm. In this paper, we attempt to construct a novel \underline{ComP}lementary \underline{tr}ansformer, \textbf{ComPtr}, for diverse bi-source dense prediction tasks. Specifically, unlike existing methods that over-specialize in a single task or a subset of tasks, ComPtr starts from the more general concept of bi-source dense prediction. Based on the basic dependence on information complementarity, we propose consistency enhancement and difference awareness components with which ComPtr can evacuate and collect important visual semantic cues from different image sources for diverse tasks, respectively. ComPtr treats different inputs equally and builds an efficient dense interaction model in the form of sequence-to-sequence on top of the transformer. This task-generic design provides a smooth foundation for constructing the unified model that can simultaneously deal with various bi-source information. In extensive experiments across several representative vision tasks, i.e. remote sensing change detection, RGB-T crowd counting, RGB-D/T salient object detection, and RGB-D semantic segmentation, the proposed method consistently obtains favorable performance. The code will be available at \url{https://github.com/lartpang/ComPtr}.

摘要
深度学习（DL）在密集预测方面取得了 significiant 进步，逐渐消除了不同任务之间的自然障碍。然而，大多数现有的工作都是为特定任务或子集任务设计特有的建筑和视觉提示，忽视了深度学习 парадиг中的可能性。在这篇论文中，我们尝试构建一种新的 ComPlementary trasnformer，即 ComPtr，用于多种生物源密集预测任务。Specifically， unlike existing methods that over-specialize in a single task or a subset of tasks, ComPtr starts from the more general concept of bi-source dense prediction. Based on the basic dependence on information complementarity, we propose consistency enhancement and difference awareness components with which ComPtr can evacuate and collect important visual semantic cues from different image sources for diverse tasks, respectively. ComPtr treats different inputs equally and builds an efficient dense interaction model in the form of sequence-to-sequence on top of the transformer. This task-generic design provides a smooth foundation for constructing the unified model that can simultaneously deal with various bi-source information. In extensive experiments across several representative vision tasks, i.e. 远程感知变化检测、RGB-T人群计数、RGB-D/T突出物检测和RGB-D semantic segmentation, the proposed method consistently obtains favorable performance. Code will be available at \url{https://github.com/lartpang/ComPtr}.

ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting

paper_url: http://arxiv.org/abs/2307.12348
repo_url: https://github.com/zsyoaoa/resshift
paper_authors: Zongsheng Yue, Jianyi Wang, Chen Change Loy
for: 提高图像超分辨率（SR）方法的执行速度，解决现有方法的执行慢速问题。
methods: 提出一种新的和高效的扩散模型，通过减少扩散步数，消除post加速的需求和相关性能下降。
results: 实验表明，提出的方法在 synthetic 和实际数据集上具有较高或相当于当前状态艺术方法的性能，只需15步扩散。

Abstract
Diffusion-based image super-resolution (SR) methods are mainly limited by the low inference speed due to the requirements of hundreds or even thousands of sampling steps. Existing acceleration sampling techniques inevitably sacrifice performance to some extent, leading to over-blurry SR results. To address this issue, we propose a novel and efficient diffusion model for SR that significantly reduces the number of diffusion steps, thereby eliminating the need for post-acceleration during inference and its associated performance deterioration. Our method constructs a Markov chain that transfers between the high-resolution image and the low-resolution image by shifting the residual between them, substantially improving the transition efficiency. Additionally, an elaborate noise schedule is developed to flexibly control the shifting speed and the noise strength during the diffusion process. Extensive experiments demonstrate that the proposed method obtains superior or at least comparable performance to current state-of-the-art methods on both synthetic and real-world datasets, even only with 15 sampling steps. Our code and model are available at https://github.com/zsyOAOA/ResShift.

摘要
Diffusion-based图像超分辨 (SR) 方法主要受限于推断速度较低，因为需要数百或者千个抽象步骤。现有的加速抽象技术无论如何，都会 sacrificing performance一定程度，导致SR结果过度模糊。为解决这个问题，我们提出了一种新的和高效的Diffusion模型，可以大幅减少抽象步骤数量，从而消除推断过程中的后加速和其相关的性能下降。我们的方法构建了一个Markov链，将高分辨图像和低分辨图像之间的差异转移到高分辨图像上，大幅提高了转移效率。此外，我们还开发了一种灵活控制抽象速度和噪声强度的附加noise schedule。广泛的实验表明，我们的方法可以在现有的State-of-the-art方法的基础上实现更好的或者相当的性能，只需要15个抽象步骤。我们的代码和模型可以在https://github.com/zsyOAOA/ResShift上下载。

Rapid detection of soil carbonates by means of NIR spectroscopy, deep learning methods and phase quantification by powder Xray diffraction

paper_url: http://arxiv.org/abs/2307.12341
repo_url: None
paper_authors: Lykourgos Chiniadis, Petros Tamvakis
for:* 提高农业生产和土壤属性分析，为生态可持续发展提供关键前提。methods:* 使用FT NIR reflectance спектроскопия和深度学习方法来预测土壤碳酸含量。* 利用多种机器学习算法，如：多层感知网络（MLP）回归器和卷积神经网络（CNN），并与传统的多变量分析（PLSR）、矩阵分析（Cubist）和支持向量机（SVM）进行比较。results:* 使用FT NIR reflectance спектроскопия和深度学习方法可以快速和高效地预测土壤碳酸含量，并且在未看过的土壤样本上达到了良好的预测性能。* 与X射Diffraction量测相比，MLP预测值和实际值之间的相对误差在5%以内，表明深度学习模型可以准确地预测土壤碳酸含量。* 本研究的结果表明，深度学习模型可以作为快速和高效的预测工具，用于预测土壤碳酸含量，特别是在没有量imetric方法可用的情况下。

Abstract
Soil NIR spectral absorbance/reflectance libraries are utilized towards improving agricultural production and analysis of soil properties which are key prerequisite for agroecological balance and environmental sustainability. Carbonates in particular, represent a soil property which is mostly affected even by mild, let alone extreme, changes of environmental conditions during climate change. In this study we propose a rapid and efficient way to predict carbonates content in soil by means of FT NIR reflectance spectroscopy and by use of deep learning methods. We exploited multiple machine learning methods, such as: 1) a MLP Regressor and 2) a CNN and compare their performance with other traditional ML algorithms such as PLSR, Cubist and SVM on the combined dataset of two NIR spectral libraries: KSSL (USDA), a dataset of soil samples reflectance spectra collected nationwide, and LUCAS TopSoil (European Soil Library) which contains soil sample absorbance spectra from all over the European Union, and use them to predict carbonate content on never before seen soil samples. Soil samples in KSSL and in TopSoil spectral libraries were acquired in the spectral region of visNIR, however in this study, only the NIR spectral region was utilized. Quantification of carbonates by means of Xray Diffraction is in good agreement with the volumetric method and the MLP prediction. Our work contributes to rapid carbonates content prediction in soil samples in cases where: 1) no volumetric method is available and 2) only NIR spectra absorbance data are available. Up till now and to the best of our knowledge, there exists no other study, that presents a prediction model trained on such an extensive dataset with such promising results on unseen data, undoubtedly supporting the notion that deep learning models present excellent prediction tools for soil carbonates content.

摘要
soil NIR spectral absorbance/reflectance 图书馆是用于提高农业生产和土壤属性分析的重要途径，这些属性是生态平衡和环境可持续性的关键因素。碳酸盐 particularly 是在气候变化中环境条件轻度到极端变化时受到影响的土壤属性。本研究提出了一种快速和高效的碳酸盐含量预测方法，通过FT NIR 反射спектроскопия和深度学习方法。我们利用了多种机器学习方法，如：1）多层感知网络（MLP）回归器和2）卷积神经网络（CNN），并与传统的机器学习算法如：PLSR、Cubist和SVM进行比较，使用了合并的KSSL（美国农业部）和LUCAS TopSoil（欧盟土壤图书馆）的两个NIR spectral库，用于预测碳酸盐含量。KSSL和TopSoil spectral库中的土壤样本反射спектроскопия数据收集在VisNIRspectral区间内，但在本研究中仅使用NIRspectral区间。X射 diffraction 测量和MLP预测结果表明，我们的方法可以准确预测碳酸盐含量。我们的工作支持在没有体积方法可用时，只有NIR Spectra absorbance数据可用时，可以快速预测碳酸盐含量。到目前为止，我们知道没有其他研究，可以在such an extensive dataset上提出类似的预测模型，并且模型在未见数据上表现了惊人的好准确性，无疑地支持深度学习模型在土壤碳酸盐含量预测中的出色表现。

Learning Navigational Visual Representations with Semantic Map Supervision

paper_url: http://arxiv.org/abs/2307.12335
repo_url: https://github.com/yiconghong/ego2map-navit
paper_authors: Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, Hao Tan
for: 本研究旨在提高家用机器人视觉导航能力，尤其是在室内环境中。
methods: 我们提出了一种基于 egocentric 视图和 semantic 地图（Ego$^2$-Map）的视觉表示学习方法，以增强机器人的导航能力。
results: 我们的实验表明，使用我们学习的表示可以在 object-goal 导航任务中表现出优于最新的视觉预训练方法，并在 continuous 环境中实现新的state-of-the-art 结果。

Abstract
Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot. However, most existing works only employ visual backbones pre-trained either with independent images for classification or with self-supervised learning methods to adapt to the indoor navigation domain, neglecting the spatial relationships that are essential to the learning of navigation. Inspired by the behavior that humans naturally build semantically and spatially meaningful cognitive maps in their brains during navigation, in this paper, we propose a novel navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps (Ego$^2$-Map). We apply the visual transformer as the backbone encoder and train the model with data collected from the large-scale Habitat-Matterport3D environments. Ego$^2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation. Experiments show that agents using our learned representations on object-goal navigation outperform recent visual pre-training methods. Moreover, our representations significantly improve vision-and-language navigation in continuous environments for both high-level and low-level action spaces, achieving new state-of-the-art results of 47% SR and 41% SPL on the test server.

摘要
“能够感受到环境的 semantics 和空间结构是家用机器人视觉导航的重要前提。然而，现有的大多数工作仅将视觉背bone 进行独立图像的分类或自我类型学习方法进行适应室内导航领域，忽略了 Navigation 中所需的空间关系。 Drawing inspiration from humans' natural ability to build semantically and spatially meaningful cognitive maps in their brains during navigation, in this paper, we propose a novel navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps (Ego$^2$-Map). We apply the visual transformer as the backbone encoder and train the model with data collected from the large-scale Habitat-Matterport3D environments. Ego$^2$-Map learning transfers the compact and rich information from a map, such as objects, structure, and transition, to the agent's egocentric representations for navigation. Experiments show that agents using our learned representations on object-goal navigation outperform recent visual pre-training methods. Moreover, our representations significantly improve vision-and-language navigation in continuous environments for both high-level and low-level action spaces, achieving new state-of-the-art results of 47% SR and 41% SPL on the test server.”Note that Simplified Chinese is the official standard for Chinese writing in mainland China, and it is used in this translation. Traditional Chinese is used in Taiwan and Hong Kong, and it may have slightly different grammar and character usage.

ES2Net: An Efficient Spectral-Spatial Network for Hyperspectral Image Change Detection

paper_url: http://arxiv.org/abs/2307.12327
repo_url: None
paper_authors: Qingren Yao, Yuan Zhou, Wei Xiang
For:* The paper is written for hyperspectral image change detection (HSI-CD) and aims to identify differences in bitemporal HSIs.Methods:* The paper proposes an end-to-end efficient spectral-spatial change detection network (ES2Net) that includes a learnable band selection module and a cluster-wise spatial attention mechanism.Results:* The paper demonstrates the effectiveness and superiority of the proposed method compared with other state-of-the-art methods through experiments on three widely used HSI-CD datasets.

Abstract
Hyperspectral image change detection (HSI-CD) aims to identify the differences in bitemporal HSIs. To mitigate spectral redundancy and improve the discriminativeness of changing features, some methods introduced band selection technology to select bands conducive for CD. However, these methods are limited by the inability to end-to-end training with the deep learning-based feature extractor and lack considering the complex nonlinear relationship among bands. In this paper, we propose an end-to-end efficient spectral-spatial change detection network (ES2Net) to address these issues. Specifically, we devised a learnable band selection module to automatically select bands conducive to CD. It can be jointly optimized with a feature extraction network and capture the complex nonlinear relationships among bands. Moreover, considering the large spatial feature distribution differences among different bands, we design the cluster-wise spatial attention mechanism that assigns a spatial attention factor to each individual band to individually improve the feature discriminativeness for each band. Experiments on three widely used HSI-CD datasets demonstrate the effectiveness and superiority of this method compared with other state-of-the-art methods.

摘要
干elijah�image Change Detection (HSI-CD) 的目标是 Identify the differences between bitemporal hyperspectral images (HSIs). To reduce spectral redundancy and improve the discriminativeness of changing features, some methods have introduced band selection technology to select bands that are conducive to CD. However, these methods are limited by their inability to perform end-to-end training with deep learning-based feature extractors and their failure to consider the complex nonlinear relationships among bands.In this paper, we propose an end-to-end efficient spectral-spatial change detection network (ES2Net) to address these issues. Specifically, we have devised a learnable band selection module that automatically selects bands that are conducive to CD. This module can be jointly optimized with a feature extraction network and captures the complex nonlinear relationships among bands. Moreover, considering the large spatial feature distribution differences among different bands, we have designed a cluster-wise spatial attention mechanism that assigns a spatial attention factor to each individual band to individually improve the feature discriminativeness for each band.Experiments on three widely used HSI-CD datasets have demonstrated the effectiveness and superiority of this method compared with other state-of-the-art methods.

Development of pericardial fat count images using a combination of three different deep-learning models

paper_url: http://arxiv.org/abs/2307.12316
repo_url: None
paper_authors: Takaaki Matsunaga, Atsushi Kono, Hidetoshi Matsuo, Kaoru Kitagawa, Mizuho Nishio, Hiromi Hashimura, Yu Izawa, Takayoshi Toba, Kazuki Ishikawa, Akie Katsuki, Kazuyuki Ohmura, Takamichi Murakami
for: The paper aims to generate pericardial fat count images (PFCIs) from chest radiographs (CXRs) using a dedicated deep-learning model, in order to evaluate pericardial fat (PF) and its potential role in the development of coronary artery disease.
methods: The proposed method uses three different deep-learning models, including CycleGAN, to generate PFCIs from CXRs. The method first projects the three-dimensional CT images onto a two-dimensional plane, and then uses the deep-learning models to generate PFCIs from the projected images. The performance of the proposed method is evaluated using structural similarity index measure (SSIM), mean squared error (MSE), and mean absolute error (MAE).
results: The results show that the PFCIs generated using the proposed method have better performance than those generated using a single CycleGAN-based model, as measured by SSIM, MSE, and MAE. The proposed method also shows the potential for evaluating PF without the need for CT scans.

Abstract
Rationale and Objectives: Pericardial fat (PF), the thoracic visceral fat surrounding the heart, promotes the development of coronary artery disease by inducing inflammation of the coronary arteries. For evaluating PF, this study aimed to generate pericardial fat count images (PFCIs) from chest radiographs (CXRs) using a dedicated deep-learning model. Materials and Methods: The data of 269 consecutive patients who underwent coronary computed tomography (CT) were reviewed. Patients with metal implants, pleural effusion, history of thoracic surgery, or that of malignancy were excluded. Thus, the data of 191 patients were used. PFCIs were generated from the projection of three-dimensional CT images, where fat accumulation was represented by a high pixel value. Three different deep-learning models, including CycleGAN, were combined in the proposed method to generate PFCIs from CXRs. A single CycleGAN-based model was used to generate PFCIs from CXRs for comparison with the proposed method. To evaluate the image quality of the generated PFCIs, structural similarity index measure (SSIM), mean squared error (MSE), and mean absolute error (MAE) of (i) the PFCI generated using the proposed method and (ii) the PFCI generated using the single model were compared. Results: The mean SSIM, MSE, and MAE were as follows: 0.856, 0.0128, and 0.0357, respectively, for the proposed model; and 0.762, 0.0198, and 0.0504, respectively, for the single CycleGAN-based model. Conclusion: PFCIs generated from CXRs with the proposed model showed better performance than those with the single model. PFCI evaluation without CT may be possible with the proposed method.

摘要
目的和目标：胸膜脂肪（PF），脏膜内附近心脏的脂肪，可以促进心脏疾病的发展，并导致心脏粥玢病变。为评估PF，本研究想要从胸部X射线图像（CXR）中生成胸膜脂肪计数图像（PFCIs）。材料和方法：本研究审查了269例 consecutively admitted patients的 coronary computed tomography（CT）数据。排除了 метал制品、肿胀、历史上的胸部手术和肿瘤等因素，因此使用了191例的数据。PFCIs由三维CT图像的投影生成，其中脂肪堆积表示高像素值。本研究使用了三种深度学习模型，包括 CycleGAN，来生成PFCIs从CXR。单个 CycleGAN-based 模型用于生成PFCIs从CXR，并与提案方法进行比较。为评估生成的PFCIs的图像质量，使用了结构相似度指数（SSIM）、平均平方误差（MSE）和平均绝对误差（MAE）。结果：* SSIM：0.856* MSE：0.0128* MAE：0.0357比较结果：* SSIM：0.762* MSE：0.0198* MAE：0.0504结论：提案方法生成的PFCIs显示与单个模型生成的PFCIs有更好的性能。PFCI评估可能不需要CT扫描。

Building Extraction from Remote Sensing Images via an Uncertainty-Aware Network

paper_url: http://arxiv.org/abs/2307.12309
repo_url: https://github.com/henryjiepanli/uncertainty-aware-network
paper_authors: Wei He, Jiepan Li, Weinan Cao, Liangpei Zhang, Hongyan Zhang
for: 减少建筑物识别错误率
methods: 使用uncertainty-aware网络（UANet）
results: 比其他状态对照方法减少误差率

Abstract
Building extraction aims to segment building pixels from remote sensing images and plays an essential role in many applications, such as city planning and urban dynamic monitoring. Over the past few years, deep learning methods with encoder-decoder architectures have achieved remarkable performance due to their powerful feature representation capability. Nevertheless, due to the varying scales and styles of buildings, conventional deep learning models always suffer from uncertain predictions and cannot accurately distinguish the complete footprints of the building from the complex distribution of ground objects, leading to a large degree of omission and commission. In this paper, we realize the importance of uncertain prediction and propose a novel and straightforward Uncertainty-Aware Network (UANet) to alleviate this problem. To verify the performance of our proposed UANet, we conduct extensive experiments on three public building datasets, including the WHU building dataset, the Massachusetts building dataset, and the Inria aerial image dataset. Results demonstrate that the proposed UANet outperforms other state-of-the-art algorithms by a large margin.

摘要
traditional deep learning models always suffer from uncertain predictions and cannot accurately distinguish the complete footprints of the building from the complex distribution of ground objects, leading to a large degree of omission and commission. To address this problem, we propose a novel and straightforward Uncertainty-Aware Network (UANet) to alleviate this problem. To verify the performance of our proposed UANet, we conduct extensive experiments on three public building datasets, including the WHU building dataset, the Massachusetts building dataset, and the Inria aerial image dataset. Results demonstrate that the proposed UANet outperforms other state-of-the-art algorithms by a large margin.Here's the text with some additional information about the Simplified Chinese translation:The text is translated into Simplified Chinese, which is the standard writing system used in mainland China. The translation is done using a machine translation tool, and the result is a more literal translation of the original text.Some notes about the translation:* "uncertain prediction" is translated as "uncertain predictions" (uncertain predictions is a phrase that is commonly used in machine learning to describe the situation where a model is not sure about its predictions).* "complete footprints" is translated as "complete footprint" (singular form) to match the original text.* "ground objects" is translated as "ground objects" (literal translation), but it could be translated as "ground features" or "ground objects" depending on the context.* "state-of-the-art algorithms" is translated as "state-of-the-art algorithm" (singular form) to match the original text.I hope this helps! Let me know if you have any other questions.

RANSAC-NN: Unsupervised Image Outlier Detection using RANSAC

paper_url: http://arxiv.org/abs/2307.12301
repo_url: https://github.com/mxtsai/ransac-nn
paper_authors: Chen-Han Tsai, Yu-Shao Peng
for: This paper proposes an unsupervised outlier detection algorithm specifically designed for image data, called RANSAC-NN.
methods: The proposed algorithm uses a RANSAC-based approach to compare images and predict the outlier score without additional training or label information.
results: The proposed algorithm consistently performs favorably against state-of-the-art outlier detection algorithms on 15 diverse datasets without any hyperparameter tuning, and it has potential applications in image mislabeled detection.Here’s the same information in Simplified Chinese text:
for: 这篇论文提出了一种特制于图像数据的无监督异常检测算法，名为RANSAC-NN。
methods: 该算法使用RANSAC-based方法比较图像，自动预测每个图像的异常分数，无需额外训练或标签信息。
results: 该算法在15个多样化的数据集中，与现状异常检测算法相比，一般性高，而且无需调整Hyperparameter，并且具有图像推理检测的潜在应用。

Abstract
Image outlier detection (OD) is crucial for ensuring the quality and accuracy of image datasets used in computer vision tasks. The majority of OD algorithms, however, have not been targeted toward image data. Consequently, the results of applying such algorithms to images are often suboptimal. In this work, we propose RANSAC-NN, a novel unsupervised OD algorithm specifically designed for images. By comparing images in a RANSAC-based approach, our algorithm automatically predicts the outlier score of each image without additional training or label information. We evaluate RANSAC-NN against state-of-the-art OD algorithms on 15 diverse datasets. Without any hyperparameter tuning, RANSAC-NN consistently performs favorably in contrast to other algorithms in almost every dataset category. Furthermore, we provide a detailed analysis to understand each RANSAC-NN component, and we demonstrate its potential applications in image mislabeled detection. Code for RANSAC-NN is provided at https://github.com/mxtsai/ransac-nn

摘要
Image outlier detection (OD) 是 Ensure the quality and accuracy of image datasets used in computer vision tasks 的 crucial step. However, most OD algorithms have not been designed specifically for images, leading to suboptimal results when applied to images. In this work, we propose RANSAC-NN, a novel unsupervised OD algorithm tailored for images. By comparing images in a RANSAC-based approach, our algorithm automatically predicts the outlier score of each image without requiring additional training or label information. We evaluate RANSAC-NN against state-of-the-art OD algorithms on 15 diverse datasets and show that it consistently performs well without any hyperparameter tuning. Furthermore, we provide a detailed analysis of each RANSAC-NN component and demonstrate its potential applications in image mislabeled detection. The code for RANSAC-NN is available at https://github.com/mxtsai/ransac-nn.

Hybrid-CSR: Coupling Explicit and Implicit Shape Representation for Cortical Surface Reconstruction

paper_url: http://arxiv.org/abs/2307.12299
repo_url: None
paper_authors: Shanlin Sun, Thanh-Tung Le, Chenyu You, Hao Tang, Kun Han, Haoyu Ma, Deying Kong, Xiangyi Yan, Xiaohui Xie
for: cortical surface reconstruction
methods: geometric deep-learning model combining explicit and implicit shape representations, mesh-based deformation module, optimization-based diffeomorphic surface registration
results: surpasses existing implicit and explicit cortical surface reconstruction methods in numeric metrics, including accuracy, regularity, and consistency.

Abstract
We present Hybrid-CSR, a geometric deep-learning model that combines explicit and implicit shape representations for cortical surface reconstruction. Specifically, Hybrid-CSR begins with explicit deformations of template meshes to obtain coarsely reconstructed cortical surfaces, based on which the oriented point clouds are estimated for the subsequent differentiable poisson surface reconstruction. By doing so, our method unifies explicit (oriented point clouds) and implicit (indicator function) cortical surface reconstruction. Compared to explicit representation-based methods, our hybrid approach is more friendly to capture detailed structures, and when compared with implicit representation-based methods, our method can be topology aware because of end-to-end training with a mesh-based deformation module. In order to address topology defects, we propose a new topology correction pipeline that relies on optimization-based diffeomorphic surface registration. Experimental results on three brain datasets show that our approach surpasses existing implicit and explicit cortical surface reconstruction methods in numeric metrics in terms of accuracy, regularity, and consistency.

摘要
我们介绍Hybrid-CSR，一种几何深度学习模型，将显式和隐式形态表示结合用于脑表面重建。具体来说，Hybrid-CSR从template mesh的显式变形开始，以获取粗略重建的脑表面，然后根据这些oriented point clouds进行后续的可 differentiable Poisson surface reconstruction。这样，我们的方法将显式（oriented point clouds）和隐式（指示函数）脑表面重建方法融合在一起。与显式表示基于方法相比，我们的半结合方法更友好地捕捉细节结构，而与隐式表示基于方法相比，我们的方法可以保持topology意识，这是因为我们通过粘性变换模块进行终端训练。为了解决topology问题，我们提出了一种新的topology更正管道，该管道基于数据优化的diffusion surface registration。实验结果表明，我们的方法在三个脑数据集上的数据指标上胜过现有的隐式和显式脑表面重建方法。

Simultaneous temperature estimation and nonuniformity correction from multiple frames

paper_url: http://arxiv.org/abs/2307.12297
repo_url: None
paper_authors: Navot Oz, Omri Berman, Nir Sochen, David Mendelovich, Iftach Klapp
for: 这个论文的目的是提出一种同时进行温度估计和不均匀性修正的方法，以提高低成本的红外摄像机在各种应用中的精度和效率。
methods: 该方法基于深度学习核函数网络（KPN），利用摄像机的物理图像捕获模型，并通过一个新的偏移块来 incorporate ambient temperature。
results: 对实际数据进行测试，该方法可以 achieve 高精度和高效率的温度估计和不均匀性修正，相比 vanilla KPN 有显著的改善。

Abstract
Infrared (IR) cameras are widely used for temperature measurements in various applications, including agriculture, medicine, and security. Low-cost IR camera have an immense potential to replace expansive radiometric cameras in these applications, however low-cost microbolometer-based IR cameras are prone to spatially-variant nonuniformity and to drift in temperature measurements, which limits their usability in practical scenarios. To address these limitations, we propose a novel approach for simultaneous temperature estimation and nonuniformity correction from multiple frames captured by low-cost microbolometer-based IR cameras. We leverage the physical image acquisition model of the camera and incorporate it into a deep learning architecture called kernel estimation networks (KPN), which enables us to combine multiple frames despite imperfect registration between them. We also propose a novel offset block that incorporates the ambient temperature into the model and enables us to estimate the offset of the camera, which is a key factor in temperature estimation. Our findings demonstrate that the number of frames has a significant impact on the accuracy of temperature estimation and nonuniformity correction. Moreover, our approach achieves a significant improvement in performance compared to vanilla KPN, thanks to the offset block. The method was tested on real data collected by a low-cost IR camera mounted on a UAV, showing only a small average error of $0.27^\circ C-0.54^\circ C$ relative to costly scientific-grade radiometric cameras. Our method provides an accurate and efficient solution for simultaneous temperature estimation and nonuniformity correction, which has important implications for a wide range of practical applications.

摘要
infrared (IR) 摄像机广泛应用于温度测量多个应用场景中，如农业、医学和安全。低成本IR摄像机具有取代昂贵 радиометрические摄像机的潜在优势，但低成本微博拉ometer-based IR摄像机受到空间不均和温度测量中的偏差所限制。为解决这些限制，我们提出了一种新的方法，即同时进行温度估计和不均差修正，使用多帧 captured by low-cost microbolometer-based IR摄像机。我们利用摄像机物理捕获模型，并将其integrated into a deep learning architecture called kernel estimation networks (KPN)，这使得我们可以将多帧组合成一起，即使这些帧之间没有完美对齐。我们还提出了一个新的偏移块，即将室外温度 incorporated into the model，这使得我们可以估计摄像机的偏移，这是温度估计中的关键因素。我们的发现表明，帧数对温度估计和不均差修正的精度有着显著的影响。此外，我们的方法在比 vanilla KPN 的情况下具有显著的改善，这是因为偏移块的存在。我们的方法在实际数据 collected by a low-cost IR摄像机 mounted on a UAV 上进行测试，显示只有小于 $0.27^\circ C-0.54^\circ C$ 的平均误差，相比昂贵的科学级 radiometric 摄像机。我们的方法为温度估计和不均差修正提供了一种准确和高效的解决方案，这有着广泛的实际应用前景。

TransHuman: A Transformer-based Human Representation for Generalizable Neural Human Rendering

paper_url: http://arxiv.org/abs/2307.12291
repo_url: https://github.com/pansanity666/TransHuman
paper_authors: Xiao Pan, Zongxin Yang, Jianxin Ma, Chang Zhou, Yi Yang
for: 本研究强调的任务是通过 conditional Neural Radiance Fields (NeRF) 来实现一致的人类渲染，从多视图视频中训练 conditional NeRF 模型，以便在不同的人物上进行一致的渲染。
methods: 本研究使用了一种brand-new的框架，名为 TransHuman，它通过 Transformer-based Human Encoding (TransHE)、Deformable Partial Radiance Fields (DPaRF) 和 Fine-grained Detail Integration (FDI) 来学习涂抹 SMPL 图像，并捕捉人体部件之间的全局关系。
results: 经验表明，TransHuman 可以在 ZJU-MoCap 和 H36M 数据集上达到新的州态艺之绩，同时具有高效性。项目页面：https://pansanity666.github.io/TransHuman/

Abstract
In this paper, we focus on the task of generalizable neural human rendering which trains conditional Neural Radiance Fields (NeRF) from multi-view videos of different characters. To handle the dynamic human motion, previous methods have primarily used a SparseConvNet (SPC)-based human representation to process the painted SMPL. However, such SPC-based representation i) optimizes under the volatile observation space which leads to the pose-misalignment between training and inference stages, and ii) lacks the global relationships among human parts that is critical for handling the incomplete painted SMPL. Tackling these issues, we present a brand-new framework named TransHuman, which learns the painted SMPL under the canonical space and captures the global relationships between human parts with transformers. Specifically, TransHuman is mainly composed of Transformer-based Human Encoding (TransHE), Deformable Partial Radiance Fields (DPaRF), and Fine-grained Detail Integration (FDI). TransHE first processes the painted SMPL under the canonical space via transformers for capturing the global relationships between human parts. Then, DPaRF binds each output token with a deformable radiance field for encoding the query point under the observation space. Finally, the FDI is employed to further integrate fine-grained information from reference images. Extensive experiments on ZJU-MoCap and H36M show that our TransHuman achieves a significantly new state-of-the-art performance with high efficiency. Project page: https://pansanity666.github.io/TransHuman/

摘要
在这篇论文中，我们关注通用神经人类渲染任务，这个任务是通过多视图视频来训练 conditional Neural Radiance Fields（NeRF）来生成不同人物的图像。为了处理人类动态运动，之前的方法主要使用了SparseConvNet（SPC）来表示人类，但这种SPC-based表示方式有两个问题：一是优化在不稳定的观察空间下，导致训练和执行阶段的姿势不一致；二是缺乏人类部分之间的全局关系，这是在处理部分SMPL的时候非常重要。为了解决这些问题，我们提出了一个全新的框架名为TransHuman，它在 canonical space 中学习涂抹 SMPL，并且 capture 人类部分之间的全局关系。TransHuman 的主要组成部分包括 Transformer-based Human Encoding（TransHE）、Deformable Partial Radiance Fields（DPaRF）和 Fine-grained Detail Integration（FDI）。TransHE 首先在 canonical space 中使用 transformers 处理涂抹 SMPL，以 capture 人类部分之间的全局关系。然后，DPaRF 将每个输出 Token 绑定到一个可变的辐射场，以在观察空间中编码查询点。最后，FDI 被使用来进一步 интеGRATE 细节信息。我们对 ZJU-MoCap 和 H36M 进行了广泛的实验，并证明了我们的 TransHuman 可以达到新的 estado del arte 性能，同时具有高效性。项目页面：https://pansanity666.github.io/TransHuman/

Downstream-agnostic Adversarial Examples

paper_url: http://arxiv.org/abs/2307.12280
repo_url: https://github.com/cgcl-codes/advencoder
paper_authors: Ziqi Zhou, Shengshan Hu, Ruizhi Zhao, Qian Wang, Leo Yu Zhang, Junhui Hou, Hai Jin
for: 本研究旨在提出一个基于预训模型的攻击框架，可以对具有预训模型的下游任务进行 Universial Adversarial Examples 攻击。
methods: 本研究使用了高频率成分信息来引导生成攻击例子，然后设计了一个生成攻击框架，以学习攻击类别dataset的分布，以提高攻击成功率和传播性。
results: 研究结果显示，攻击者可以成功攻击下游任务，不需要知道预训dataset或下游dataset的详细信息。此外，研究者还提出了四种防护方法，其结果进一步证明了 AdvEncoder 的攻击能力。

Abstract
Self-supervised learning usually uses a large amount of unlabeled data to pre-train an encoder which can be used as a general-purpose feature extractor, such that downstream users only need to perform fine-tuning operations to enjoy the benefit of "large model". Despite this promising prospect, the security of pre-trained encoder has not been thoroughly investigated yet, especially when the pre-trained encoder is publicly available for commercial use. In this paper, we propose AdvEncoder, the first framework for generating downstream-agnostic universal adversarial examples based on the pre-trained encoder. AdvEncoder aims to construct a universal adversarial perturbation or patch for a set of natural images that can fool all the downstream tasks inheriting the victim pre-trained encoder. Unlike traditional adversarial example works, the pre-trained encoder only outputs feature vectors rather than classification labels. Therefore, we first exploit the high frequency component information of the image to guide the generation of adversarial examples. Then we design a generative attack framework to construct adversarial perturbations/patches by learning the distribution of the attack surrogate dataset to improve their attack success rates and transferability. Our results show that an attacker can successfully attack downstream tasks without knowing either the pre-training dataset or the downstream dataset. We also tailor four defenses for pre-trained encoders, the results of which further prove the attack ability of AdvEncoder.

摘要
自我监督学习通常使用大量未标注数据来预训练一个编码器，以便下游用户只需进行精细调整来获得“大型模型”的好处。然而，预训练编码器的安全性尚未得到全面的调查，尤其是在公开可用于商业用途的情况下。在这篇论文中，我们提出了 AdvEncoder，第一个基于预训练编码器的下游agnostic通用攻击示例生成框架。AdvEncoder的目标是为一组自然图像构建一个通用攻击杂音或贴图，可以让所有继承于受害者预训练编码器的下游任务受到攻击。与传统攻击示例工作不同，预训练编码器只输出图像特征 вектор而不是分类标签。因此，我们首先利用图像高频成分信息来引导攻击示例生成。然后，我们设计了一个生成攻击框架，通过学习攻击代理数据集的分布来提高攻击成功率和传输性。我们的结果表明，攻击者可以成功攻击下游任务，不需要知道预训练数据集或下游数据集。我们还适应四种防御措施 для预训练编码器，结果证明了 AdvEncoder 的攻击能力。

FDCT: Fast Depth Completion for Transparent Objects

paper_url: http://arxiv.org/abs/2307.12274
repo_url: https://github.com/nonmy/fdct
paper_authors: Tianan Li, Zhehan Chen, Huan Liu, Chen Wang
for: 这篇论文的目的是提出一种快速的深度完成框架，用于处理透明物体的RGB-D图像。
methods: 该方法使用了一种新的拼接分支和短circuit来抓取低级特征，并使用了一种损失函数来抑制过拟合。
results: 对比之前的方法，该方法可以在70帧/秒的速度下提供更高精度的深度修复结果，并且可以改善物体抓取任务中的姿态估计。Here’s the full summary in Simplified Chinese:
for: 这篇论文的目的是提出一种快速的深度完成框架，用于处理透明物体的RGB-D图像。
methods: 该方法使用了一种新的拼接分支和短circuit来抓取低级特征，并使用了一种损失函数来抑制过拟合。
results: 对比之前的方法，该方法可以在70帧/秒的速度下提供更高精度的深度修复结果，并且可以改善物体抓取任务中的姿态估计。I hope that helps!

Abstract
Depth completion is crucial for many robotic tasks such as autonomous driving, 3-D reconstruction, and manipulation. Despite the significant progress, existing methods remain computationally intensive and often fail to meet the real-time requirements of low-power robotic platforms. Additionally, most methods are designed for opaque objects and struggle with transparent objects due to the special properties of reflection and refraction. To address these challenges, we propose a Fast Depth Completion framework for Transparent objects (FDCT), which also benefits downstream tasks like object pose estimation. To leverage local information and avoid overfitting issues when integrating it with global information, we design a new fusion branch and shortcuts to exploit low-level features and a loss function to suppress overfitting. This results in an accurate and user-friendly depth rectification framework which can recover dense depth estimation from RGB-D images alone. Extensive experiments demonstrate that FDCT can run about 70 FPS with a higher accuracy than the state-of-the-art methods. We also demonstrate that FDCT can improve pose estimation in object grasping tasks. The source code is available at https://github.com/Nonmy/FDCT

摘要
深度完成是许多 робо类任务的关键，如自动驾驶、3D重建和机械 manipulate。尽管已有很大的进步，现有方法仍然具有计算投入性和时间约束，并且大多数方法只适用于不透明物体，对于透明物体呈现特殊的反射和折射特性具有困难。为解决这些挑战，我们提出了高速深度完成框架 для透明物体（FDCT），该框架还有利于下游任务如对象 pose 估计。为了利用本地信息和避免过拟合问题，我们设计了新的融合分支和短cut 来利用低级特征，并设计了一个损失函数来抑制过拟合。这导致了一个准确和用户友好的深度修正框架，可以从RGB-D图像中恢复精密的深度估计。广泛的实验表明，FDCT 可以在70 FPS 下运行，并且与现状态艺术方法相比具有更高的准确率。我们还示出了 FDCT 可以改善对象抓取任务中的姿态估计。源代码可以在中下载。

Context Perception Parallel Decoder for Scene Text Recognition

paper_url: http://arxiv.org/abs/2307.12270
repo_url: None
paper_authors: Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Chenxia Li, Yuning Du, Yu-Gang Jiang
for: 这篇论文主要研究Scene Text Recognition（STR）方法的高精度和快速推断问题。
methods: 本文提出了一种新的AR模型，并进行了实验研究，发现AR模型的成功不仅归功于语言模型，还归功于视觉上下文的感知。为此，本文提出了Context Perception Parallel Decoder（CPPD）模型，通过计算字符出现频率和字符顺序来提供上下文信息，并且与字符预测任务结合，以准确地推断字符序列。
results: 实验结果表明，CPPD模型在英文和中文benchmark上达到了非常竞争性的准确率，并且比AR模型快约7倍。此外，CPPD模型也是当前最快的recognizer之一。代码将很快发布。

Abstract
Scene text recognition (STR) methods have struggled to attain high accuracy and fast inference speed. Autoregressive (AR)-based STR model uses the previously recognized characters to decode the next character iteratively. It shows superiority in terms of accuracy. However, the inference speed is slow also due to this iteration. Alternatively, parallel decoding (PD)-based STR model infers all the characters in a single decoding pass. It has advantages in terms of inference speed but worse accuracy, as it is difficult to build a robust recognition context in such a pass. In this paper, we first present an empirical study of AR decoding in STR. In addition to constructing a new AR model with the top accuracy, we find out that the success of AR decoder lies also in providing guidance on visual context perception rather than language modeling as claimed in existing studies. As a consequence, we propose Context Perception Parallel Decoder (CPPD) to decode the character sequence in a single PD pass. CPPD devises a character counting module and a character ordering module. Given a text instance, the former infers the occurrence count of each character, while the latter deduces the character reading order and placeholders. Together with the character prediction task, they construct a context that robustly tells what the character sequence is and where the characters appear, well mimicking the context conveyed by AR decoding. Experiments on both English and Chinese benchmarks demonstrate that CPPD models achieve highly competitive accuracy. Moreover, they run approximately 7x faster than their AR counterparts, and are also among the fastest recognizers. The code will be released soon.

摘要
Scene文本识别（STR）方法一直受到高精度和快速推理速度的限制。基于排序（AR）的STR模型利用已经识别的字符来逐个解码下一个字符，其精度较高。然而，推理速度又相对较慢，主要因为这种迭代过程。相反，并行推理（PD）基于STR模型在单个推理过程中推理所有字符，它在推理速度方面具有优势，但精度相对较差，因为在这种情况下难以建立强大的识别Context。在这篇论文中，我们首先进行了AR推理在STR方面的实验研究。此外，我们还构建了一个新的AR模型，并发现AR推理的成功不仅归结于语言模型化，还需要VisualContext的感知。因此，我们提出了Context Perception并行推理器（CPPD），可以在单个PD过程中推理字符序列。CPPD包括字符出现频次预测模块和字符顺序预测模块。对于每个文本实例，前者预测每个字符的出现频次，而后者预测字符的顺序和占位符。与字符预测任务相结合，它们构建了一个Robust的识别Context，能够准确地表示文本序列和字符的位置，与AR推理准确相似。实验表明，CPPD模型在英文和中文标准套件上达到了非常竞争的精度水平，并且运行速度约为AR模型的7倍，同时也是推理速度最快的一部分。代码即将发布。

ResWCAE: Biometric Pattern Image Denoising Using Residual Wavelet-Conditioned Autoencoder

paper_url: http://arxiv.org/abs/2307.12255
repo_url: None
paper_authors: Youzhi Liang, Wen Liang
for: 该文章的目的是提出一种轻量级和可靠的深度学习架构，用于解决小型互联网设备中的指纹图像噪声问题。
methods: 该架构基于差分卷积束的残差整合（Res-WCAE），包括两个Encoder和一个Decoder。image Encoder使用残差连接来保持细腻的空间特征，而wavelet Encoder使用卷积束来提取特征。
results: 对比多种现有方法，RES-WCAE在噪声水平较高的指纹图像权重级别上表现出优异性，特别是对于严重损坏的指纹图像。总的来说，RES-WCAE显示出了解决小型互联网设备中生物认证系统中的噪声问题的潜力。

Abstract
The utilization of biometric authentication with pattern images is increasingly popular in compact Internet of Things (IoT) devices. However, the reliability of such systems can be compromised by image quality issues, particularly in the presence of high levels of noise. While state-of-the-art deep learning algorithms designed for generic image denoising have shown promise, their large number of parameters and lack of optimization for unique biometric pattern retrieval make them unsuitable for these devices and scenarios. In response to these challenges, this paper proposes a lightweight and robust deep learning architecture, the Residual Wavelet-Conditioned Convolutional Autoencoder (Res-WCAE) with a Kullback-Leibler divergence (KLD) regularization, designed specifically for fingerprint image denoising. Res-WCAE comprises two encoders - an image encoder and a wavelet encoder - and one decoder. Residual connections between the image encoder and decoder are leveraged to preserve fine-grained spatial features, where the bottleneck layer conditioned on the compressed representation of features obtained from the wavelet encoder using approximation and detail subimages in the wavelet-transform domain. The effectiveness of Res-WCAE is evaluated against several state-of-the-art denoising methods, and the experimental results demonstrate that Res-WCAE outperforms these methods, particularly for heavily degraded fingerprint images in the presence of high levels of noise. Overall, Res-WCAE shows promise as a solution to the challenges faced by biometric authentication systems in compact IoT devices.

摘要
互联网物联网（IoT）设备中的生物识别系统对于图像生物识别的使用越来越普遍。然而，这些系统的可靠性可能受到图像质量问题的影响，特别是在高水平的噪声存在下。现有的深度学习算法，设计用于普通图像数据的数据条件，即使在生物识别领域中显示了损害，但它们的参数数量过多，并且不适合专门适应生物识别图像的条件下进行数据条件。为了解决这些挑战，本文提出了一个轻量级和可靠的深度学习架构，即对应涡回图像条件的内积条件（KLD）调整的内积条件对称卷积数位对称（Res-WCAE）。Res-WCAE包括两个卷积数位：一个图像卷积数位和一个波лет卷积数位，以及一个解oder。内积条件组件使用对应涡回图像条件的条件对称卷积数位，以维持细节的空间特征。实验结果显示，Res-WCAE在多个州数据条件下与其他竞争方法进行比较，特别是在高水平的噪声存在下，Res-WCAE表现出色。总的来说，Res-WCAE具有优秀的应用潜力，用于生物识别系统中的图像数据条件。

Explainable Depression Detection via Head Motion Patterns

paper_url: http://arxiv.org/abs/2307.12241
repo_url: None
paper_authors: Monika Gahalawat, Raul Fernandez Rojas, Tanaya Guha, Ramanathan Subramanian, Roland Goecke
for: 检测抑郁症状
methods: 基于head motion数据的基本运动单元（kinemes）和机器学习方法
results: 头部运动模式是识别抑郁症状的有效标记，并且可以找到与先前发现的解释性kineme patrernsHere’s a more detailed explanation of each point:
for: The paper is written to detect depression using head motion data and machine learning methods.
methods: The paper uses two approaches to analyze head motion data: (a) discovering kinemes from head motion data of both depressed patients and healthy controls, and (b) learning kineme patterns only from healthy controls and computing statistics derived from reconstruction errors for both the patient and control classes. The paper employs machine learning methods to evaluate depression classification performance on two datasets: BlackDog and AVEC2013.
results: The paper finds that head motion patterns are effective biomarkers for detecting depressive symptoms, and that explanatory kineme patterns consistent with prior findings can be observed for the two classes. The paper achieves peak F1 scores of 0.79 and 0.82, respectively, over BlackDog and AVEC2013 for binary classification over episodic thin-slices, and a peak F1 of 0.72 over videos for AVEC2013.

Abstract
While depression has been studied via multimodal non-verbal behavioural cues, head motion behaviour has not received much attention as a biomarker. This study demonstrates the utility of fundamental head-motion units, termed \emph{kinemes}, for depression detection by adopting two distinct approaches, and employing distinctive features: (a) discovering kinemes from head motion data corresponding to both depressed patients and healthy controls, and (b) learning kineme patterns only from healthy controls, and computing statistics derived from reconstruction errors for both the patient and control classes. Employing machine learning methods, we evaluate depression classification performance on the \emph{BlackDog} and \emph{AVEC2013} datasets. Our findings indicate that: (1) head motion patterns are effective biomarkers for detecting depressive symptoms, and (2) explanatory kineme patterns consistent with prior findings can be observed for the two classes. Overall, we achieve peak F1 scores of 0.79 and 0.82, respectively, over BlackDog and AVEC2013 for binary classification over episodic \emph{thin-slices}, and a peak F1 of 0.72 over videos for AVEC2013.

摘要
在识别抑郁症的研究中，脑部运动尚未得到过多的关注，作为生物标志。这项研究表明了基本头部运动单元（kinemes）在抑郁检测中的有用性，通过采用两种不同的方法和特征：（a）从头部运动数据中挖掘出kinemes，并对both depressed patients和健康Control进行比较；（b）从健康Control中学习kineme模式，并计算来自重建错误的统计，用于分类patient和control类。通过机器学习方法，我们评估了在BlackDog和AVEC2013 datasets上的抑郁分类性能。我们发现：（1）头部运动模式是有效的生物标志，可以检测抑郁症状；（2）可以看到与之前发现相符的解释性kineme模式。总的来说，我们在BlackDog和AVEC2013上取得了最高的F1分数为0.79和0.82，并在AVEC2013上取得了最高的F1分数为0.72。

Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation

paper_url: http://arxiv.org/abs/2307.12239
repo_url: https://github.com/bytedance/dq-det
paper_authors: Yiming Cui, Linjie Yang, Haichao Yu
for: 这种方法用于提高DETR基本模型的性能，包括对象检测、实例分割、精准分割和视频实例分割等多个任务。
methods: 该方法使用一个列表学习的检测查询来从变换器网络中提取信息，并学习预测图像中对象的位置和类别。然后，通过随机几何融合这些学习的查询来生成动态的融合查询，以更好地捕捉图像中对象的先前知识。
results: 通过使用我们的模ulated查询，DETR基本模型在多个任务上取得了一致和稳定的高性能。这些任务包括对象检测、实例分割、精准分割和视频实例分割等。

Abstract
Transformer-based detection and segmentation methods use a list of learned detection queries to retrieve information from the transformer network and learn to predict the location and category of one specific object from each query. We empirically find that random convex combinations of the learned queries are still good for the corresponding models. We then propose to learn a convex combination with dynamic coefficients based on the high-level semantics of the image. The generated dynamic queries, named modulated queries, better capture the prior of object locations and categories in the different images. Equipped with our modulated queries, a wide range of DETR-based models achieve consistent and superior performance across multiple tasks including object detection, instance segmentation, panoptic segmentation, and video instance segmentation.

摘要
transformer-based 检测和分割方法使用一个学习的检测查询列表从 transformer 网络中获取信息并学习预测图像中的对象位置和类别。我们实际发现，随机几何的检测查询仍然可以为相应的模型提供好的性能。然后，我们提议通过图像的高级 semantics 学习动态的查询组合，称为modulated queries。这些生成的动态查询更好地捕捉图像中对象的先驱知识，使得各种基于 DE TR 的模型在多个任务上具有一致性和superior的表现，包括对象检测、实例分割、泛化分割和视频实例分割。

paper_url: http://arxiv.org/abs/2307.12236
repo_url: None
paper_authors: Longxiang Zhang, Wenping Wang
for: 这种研究是为了评估在视频流处理中的游戏技能，以便向流服务提供者提供个性化推荐和服务促销。
methods: 该研究使用了最新的终端模型，以学习多Modalities的联合表示。在数据集中，研究人员首先识别数据集中的漏洞，然后手动清理数据。
results: 经过广泛的实验，研究人员证明了他们的提议的有效性。然而，研究人员还发现了他们的模型偏向用户标识而不是学习有意义的表示。

Abstract
Online streaming is an emerging market that address much attention. Assessing gaming skills from videos is an important task for streaming service providers to discover talented gamers. Service providers require the information to offer customized recommendation and service promotion to their customers. Meanwhile, this is also an important multi-modal machine learning tasks since online streaming combines vision, audio and text modalities. In this study we begin by identifying flaws in the dataset and proceed to clean it manually. Then we propose several variants of latest end-to-end models to learn joint representation of multiple modalities. Through our extensive experimentation, we demonstrate the efficacy of our proposals. Moreover, we identify that our proposed models is prone to identifying users instead of learning meaningful representations. We purpose future work to address the issue in the end.

摘要
在线串流是一个崛起的市场，吸引了大量的注意力。从视频中评估玩家技巧是串流服务提供商们需要了解潜在的才华玩家的重要任务。服务提供商需要这些信息以为客户提供个性化推荐和服务促销。同时，这也是一个重要的多modal机器学习任务，因为在线串流结合了视觉、音频和文本模式。在这个研究中，我们开始通过手动清理数据集来发现问题，然后我们提出了多种最新的端到端模型，以学习多Modalities的联合表示。通过我们的广泛的实验，我们证明了我们的提议的效果。此外，我们发现我们的提议模型很容易被用户的身份识别，而不是学习有意义的表示。我们未来的工作是解决这个问题。

EchoGLAD: Hierarchical Graph Neural Networks for Left Ventricle Landmark Detection on Echocardiograms

paper_url: http://arxiv.org/abs/2307.12229
repo_url: https://github.com/masoudmo/echoglad
paper_authors: Masoud Mokhtari, Mobina Mahdavi, Hooman Vaseli, Christina Luong, Purang Abolmaesumi, Teresa S. M. Tsang, Renjie Liao
For: The paper aims to automate the task of detecting four landmark locations and measuring the internal dimension of the left ventricle and the approximate mass of the surrounding muscle in the heart, using machine learning.* Methods: The proposed method uses an echocardiogram-based, hierarchical graph neural network (GNN) for left ventricle landmark detection, which includes a hierarchical graph representation learning framework for multi-resolution landmark detection via GNNs, and induced hierarchical supervision at different levels of granularity using a multi-level loss.* Results: The paper achieves the state-of-the-art mean absolute errors (MAEs) of 1.46 mm and 1.86 mm on two datasets under the in-distribution (ID) setting, and shows better out-of-distribution (OOD) generalization than prior works with a testing MAE of 4.3 mm.

Abstract
The functional assessment of the left ventricle chamber of the heart requires detecting four landmark locations and measuring the internal dimension of the left ventricle and the approximate mass of the surrounding muscle. The key challenge of automating this task with machine learning is the sparsity of clinical labels, i.e., only a few landmark pixels in a high-dimensional image are annotated, leading many prior works to heavily rely on isotropic label smoothing. However, such a label smoothing strategy ignores the anatomical information of the image and induces some bias. To address this challenge, we introduce an echocardiogram-based, hierarchical graph neural network (GNN) for left ventricle landmark detection (EchoGLAD). Our main contributions are: 1) a hierarchical graph representation learning framework for multi-resolution landmark detection via GNNs; 2) induced hierarchical supervision at different levels of granularity using a multi-level loss. We evaluate our model on a public and a private dataset under the in-distribution (ID) and out-of-distribution (OOD) settings. For the ID setting, we achieve the state-of-the-art mean absolute errors (MAEs) of 1.46 mm and 1.86 mm on the two datasets. Our model also shows better OOD generalization than prior works with a testing MAE of 4.3 mm.

摘要
左心室功能评估需要检测四个标志点和测量左心室内部维度以及周围肌肉的约重。难点在机器学习自动化这个任务是严重缺乏临床标注，即只有一些标注像素在高维度图像中，导致许多前作重要关注于均勋标注。然而，这种标注策略忽视图像的解剖信息并且带来一定偏见。为解决这个挑战，我们提出了一种用电子心室图像（EchoGLAD）来检测左心室标志点的模型。我们的主要贡献包括：1. 基于层次图像表示学习框架，通过层次graph neural network（GNN）实现多分辨率标志点检测。2. 通过多级监督，在不同的级别上实现层次监督。我们在公共和私人数据集上进行了测试，在内 distribuition（ID）和外部 distribuition（OOD）两种设置下。在ID设置下，我们实现了左心室标志点检测的状态机器学习（SOTA）的mean absolute error（MAE）为1.46mm和1.86mm。我们的模型还在OOD设置下表现更好，测试MAE为4.3mm。

The identification of garbage dumps in the rural areas of Cyprus through the application of deep learning to satellite imagery

paper_url: http://arxiv.org/abs/2308.02502
repo_url: None
paper_authors: Andrew Keith Wilkinson
For: The paper aims to investigate the use of artificial intelligence techniques and satellite imagery to identify illegal garbage dumps in rural areas of Cyprus.* Methods: The paper uses a novel dataset of images, data augmentation techniques, and an artificial neural network (specifically, a convolutional neural network) to recognize the presence or absence of garbage in new images.* Results: The resulting deep learning model can correctly identify images containing garbage in approximately 90% of cases, and could form the basis of a future system for systematically analyzing the entire landscape of Cyprus to build a comprehensive “garbage” map of the island.Here are the three points in Simplified Chinese text:* For: 这篇论文目标是使用人工智能技术和卫星影像来识别资产报废在cyprus的农村地区。* Methods: 论文使用了一个新的图像集，数据增强技术和人工神经网络来识别新图像中是否包含垃圾。* Results: 结果显示，使用这种方法可以在约90%的情况下正确地识别垃圾图像，并可能成为未来Cyprus岛上系统性地分析整个地图的基础。

Abstract
Garbage disposal is a challenging problem throughout the developed world. In Cyprus, as elsewhere, illegal ``fly-tipping" is a significant issue, especially in rural areas where few legal garbage disposal options exist. However, there is a lack of studies that attempt to measure the scale of this problem, and few resources available to address it. A method of automating the process of identifying garbage dumps would help counter this and provide information to the relevant authorities. The aim of this study was to investigate the degree to which artificial intelligence techniques, together with satellite imagery, can be used to identify illegal garbage dumps in the rural areas of Cyprus. This involved collecting a novel dataset of images that could be categorised as either containing, or not containing, garbage. The collection of such datasets in sufficient raw quantities is time consuming and costly. Therefore a relatively modest baseline set of images was collected, then data augmentation techniques used to increase the size of this dataset to a point where useful machine learning could occur. From this set of images an artificial neural network was trained to recognise the presence or absence of garbage in new images. A type of neural network especially suited to this task known as ``convolutional neural networks" was used. The efficacy of the resulting model was evaluated using an independently collected dataset of test images. The result was a deep learning model that could correctly identify images containing garbage in approximately 90\% of cases. It is envisaged that this model could form the basis of a future system that could systematically analyse the entire landscape of Cyprus to build a comprehensive ``garbage" map of the island.

摘要
垃圾处理是发达国家的一个挑战，在塞浦路斯也是如此。非法投射（fly-tipping）是一个严重的问题，尤其在农村地区，因为有限的法定垃圾处理选择。然而，有很少的研究 Trying to measure the scale of this problem and few resources available to address it. This study aimed to investigate the extent to which artificial intelligence techniques, combined with satellite imagery, can be used to identify illegal garbage dumps in rural areas of Cyprus.To do this, we collected a novel dataset of images that could be categorized as either containing or not containing garbage. However, collecting such datasets in large quantities is time-consuming and costly, so we used data augmentation techniques to increase the size of our dataset to a point where useful machine learning could occur. We then trained an artificial neural network on this dataset to recognize the presence or absence of garbage in new images. We used a type of neural network well-suited to this task, called convolutional neural networks (CNNs).We evaluated the efficacy of our resulting model using an independently collected dataset of test images. The model was able to correctly identify images containing garbage in approximately 90% of cases. We envision this model as the basis for a future system that could systematically analyze the entire landscape of Cyprus to create a comprehensive "garbage" map of the island.

ASCON: Anatomy-aware Supervised Contrastive Learning Framework for Low-dose CT Denoising

paper_url: http://arxiv.org/abs/2307.12225
repo_url: https://github.com/hao1635/ASCON
paper_authors: Zhihao Chen, Qi Gao, Yi Zhang, Hongming Shan
for: 低剂量 computed tomography（CT）静止图像减噪
methods: 提出了一种新的 Anatomy-aware Supervised CONtrastive learning框架（ASCON），可以利用静止图像的生物学信息进行减噪，同时提供生物学解释。
results: 对两个公共的低剂量 CT 静止图像减噪数据集进行了广泛的实验，并证明了 ASCON 的超过状态艺术模型的性能。另外，ASCON 提供了低剂量 CT 静止图像减噪中的生物学解释，这是首次实现的。

Abstract
While various deep learning methods have been proposed for low-dose computed tomography (CT) denoising, most of them leverage the normal-dose CT images as the ground-truth to supervise the denoising process. These methods typically ignore the inherent correlation within a single CT image, especially the anatomical semantics of human tissues, and lack the interpretability on the denoising process. In this paper, we propose a novel Anatomy-aware Supervised CONtrastive learning framework, termed ASCON, which can explore the anatomical semantics for low-dose CT denoising while providing anatomical interpretability. The proposed ASCON consists of two novel designs: an efficient self-attention-based U-Net (ESAU-Net) and a multi-scale anatomical contrastive network (MAC-Net). First, to better capture global-local interactions and adapt to the high-resolution input, an efficient ESAU-Net is introduced by using a channel-wise self-attention mechanism. Second, MAC-Net incorporates a patch-wise non-contrastive module to capture inherent anatomical information and a pixel-wise contrastive module to maintain intrinsic anatomical consistency. Extensive experimental results on two public low-dose CT denoising datasets demonstrate superior performance of ASCON over state-of-the-art models. Remarkably, our ASCON provides anatomical interpretability for low-dose CT denoising for the first time. Source code is available at https://github.com/hao1635/ASCON.

摘要
而んどの深度学习方法已经被提出用于低剂量计算机断层成像（CT）去噪，大多数其中 leverages the normal-dose CT影像作为ground truth来监督去噪过程。这些方法通常忽略单个CT影像内的自然 correlation，特别是人体组织学的 semantics，而且缺乏去噪过程的解释性。在这篇论文中，我们提出了一种新的Anatomy-aware Supervised CONtrastive learning框架，称为ASCON，可以利用人体组织学来低剂量CT去噪，同时提供解释性。我们的ASCON包括两个新的设计：一种高效的自我注意力基于U-Net（ESAU-Net）和一种多尺度的组织学对比网络（MAC-Net）。首先，为了更好地捕捉全局-地方交互和适应高分辨率输入，我们使用了通道级别的自我注意力机制。其次，MAC-Net包括一个patch-wise非对比模块，用于捕捉内在的组织信息，以及一个像素级别的对比模块，用于保持内在的组织一致性。我们对公共的两个低剂量CT去噪数据集进行了广泛的实验，结果显示ASCON在状态机器上表现出了superior性。吸取onders, our ASCON为低剂量CT去噪提供了解释性，这是首次。源代码可以在https://github.com/hao1635/ASCON中获取。

LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference

paper_url: http://arxiv.org/abs/2307.12217
repo_url: None
paper_authors: Cong Wang, Yu-Ping Wang, Dinesh Manocha
for: 本研究旨在提出一种新的方法，即LoLep，可以从单个RGB图像中推断高精度的场景表示，并生成更好的新视图。
methods: 为解决不具有深度信息的情况下推断合适的平面位置的问题，我们采用了预分 partition disparity 空间为 bins，并设计了一种disparity sampler来推断多个平面在每个bin中的本地偏移。此外，我们还提出了两种优化策略，其中一种是与不同的 disparity distribution 集合结合，另一种是在不同的 dataset 上添加 occlusion-aware reprojection loss 作为简单 yet effective 的 геометрической监视技术。
results: 我们的方法可以生成高精度的场景表示，并在不同的数据集上达到了领先的状态。与MINE相比，我们的方法具有 LPIPS 减少量为4.8%-9.0%和 RV 减少量为73.9%-83.5%。此外，我们还评估了实际图像上的性能，并证明了 LoLep 的优势。

Abstract
We propose a novel method, LoLep, which regresses Locally-Learned planes from a single RGB image to represent scenes accurately, thus generating better novel views. Without the depth information, regressing appropriate plane locations is a challenging problem. To solve this issue, we pre-partition the disparity space into bins and design a disparity sampler to regress local offsets for multiple planes in each bin. However, only using such a sampler makes the network not convergent; we further propose two optimizing strategies that combine with different disparity distributions of datasets and propose an occlusion-aware reprojection loss as a simple yet effective geometric supervision technique. We also introduce a self-attention mechanism to improve occlusion inference and present a Block-Sampling Self-Attention (BS-SA) module to address the problem of applying self-attention to large feature maps. We demonstrate the effectiveness of our approach and generate state-of-the-art results on different datasets. Compared to MINE, our approach has an LPIPS reduction of 4.8%-9.0% and an RV reduction of 73.9%-83.5%. We also evaluate the performance on real-world images and demonstrate the benefits.

摘要
我们提出了一种新方法，LoLep，该方法从单个RGB图像中回归地方精度的计划，以便更加准确地表示场景，并生成更好的新视图。在不知道深度信息的情况下，回归合适的平面位置是一个具有挑战性的问题。为解决这个问题，我们先对差分空间进行预分区，并设计了差分抽样器来回归多个平面在每个分区中的本地偏移。然而，只使用这种抽样器将网络训练不整合;我们还提出了两种优化策略，其中一种是基于不同差分分布的数据集的优化策略，另一种是一种简单 yet有效的干扰抑制损失技术。我们还引入了自注意机制，以改善干扰推断，并提出了块抽样自注意模块（BS-SA），以解决应用自注意到大特征地图时存在的问题。我们证明了我们的方法的有效性，并在不同的数据集上达到了领先的结果。相比于MINE，我们的方法有LPIPS减少4.8%-9.0%和RV减少73.9%-83.5%。我们还评估了实际图像上的性能，并证明了其利好。

LIST: Learning Implicitly from Spatial Transformers for Single-View 3D Reconstruction

paper_url: http://arxiv.org/abs/2307.12194
repo_url: None
paper_authors: Mohammad Samiul Arshad, William J. Beksi
for: 用于重构3D对象的几何和 topological结构 from a single 2D图像
methods: 利用本地和全局图像特征，通过一种新的神经网络架构来准确重构3D对象的几何和 topological结构
results: 对比现有方法，本研究的模型能够更高精度地重构3D对象的几何和 topological结构，不需要摄像头估计或像素对齐。

Abstract
Accurate reconstruction of both the geometric and topological details of a 3D object from a single 2D image embodies a fundamental challenge in computer vision. Existing explicit/implicit solutions to this problem struggle to recover self-occluded geometry and/or faithfully reconstruct topological shape structures. To resolve this dilemma, we introduce LIST, a novel neural architecture that leverages local and global image features to accurately reconstruct the geometric and topological structure of a 3D object from a single image. We utilize global 2D features to predict a coarse shape of the target object and then use it as a base for higher-resolution reconstruction. By leveraging both local 2D features from the image and 3D features from the coarse prediction, we can predict the signed distance between an arbitrary point and the target surface via an implicit predictor with great accuracy. Furthermore, our model does not require camera estimation or pixel alignment. It provides an uninfluenced reconstruction from the input-view direction. Through qualitative and quantitative analysis, we show the superiority of our model in reconstructing 3D objects from both synthetic and real-world images against the state of the art.

摘要
通过单个2D图像掌握3D物体的几何和拓扑细节是计算机视觉领域的基本挑战。现有的解决方案往往无法恢复自我遮盖的几何结构和/或准确地重建物体的拓扑形态。为解决这个问题，我们介绍了LIST，一种新的神经网络架构，它利用本地和全局图像特征来准确地从单个图像中重建3D物体的几何和拓扑结构。我们使用全局2D特征预测目标对象的抽象形状，然后使用它作为高分辨率重建的基础。通过利用图像中的本地特征和3D预测结果，我们可以使用隐式预测器来准确地预测目标表面上任意点的负距离。此外，我们的模型不需要相机估计或像素对齐。它可以从输入视图角度提供不受束缚的重建。通过质量和量化分析，我们证明了我们的模型在对真实图像和 sintetic图像进行3D物体重建时具有优势。

An X3D Neural Network Analysis for Runner’s Performance Assessment in a Wild Sporting Environment

paper_url: http://arxiv.org/abs/2307.12183
repo_url: None
paper_authors: David Freire-Obregón, Javier Lorenzo-Navarro, Oliverio J. Santana, Daniel Hernández-Sosa, Modesto Castrillón-Santana
for: 这个研究是为了应用传输学习技术在运动环境中进行3D神经网络的分析。
methods: 该方法使用一个动作识别网络来估计长跑运动员的总赛时（CRT）。
results: 研究发现，使用X3D神经网络可以提供出色的表现，对于短视频输入， Mean Absolute Error为12分钟半。此外，X3D神经网络需要的内存比前一作少得多，可以达到更高的精度。

Abstract
We present a transfer learning analysis on a sporting environment of the expanded 3D (X3D) neural networks. Inspired by action quality assessment methods in the literature, our method uses an action recognition network to estimate athletes' cumulative race time (CRT) during an ultra-distance competition. We evaluate the performance considering the X3D, a family of action recognition networks that expand a small 2D image classification architecture along multiple network axes, including space, time, width, and depth. We demonstrate that the resulting neural network can provide remarkable performance for short input footage, with a mean absolute error of 12 minutes and a half when estimating the CRT for runners who have been active from 8 to 20 hours. Our most significant discovery is that X3D achieves state-of-the-art performance while requiring almost seven times less memory to achieve better precision than previous work.

摘要
我们提出了一种基于转移学习的分析，涉及到扩展三维神经网络（X3D）环境中的运动环境。我们的方法使用动作识别网络来估算Runner在长跑比赛中的累累时间（CRT）。我们评估性能时考虑了X3D家族中的动作识别网络，该网络扩展了小于2D图像分类架构的维度，包括空间、时间、宽度和深度。我们表明，这种神经网络可以提供短输入视频时具有很好的性能，其中平均绝对误差为12分钟半，当估算 runner在8到20个小时内活动时。我们最重要的发现是，X3D可以达到现有最佳性能，而且需要只有七分之一的内存，以提高精度。

paper_url: http://arxiv.org/abs/2307.12180
repo_url: https://github.com/linzy0227/pdminet
paper_authors: Yafei Zhang, Zhiyuan Li, Huafeng Li, Dapeng Tao
for: 这个论文的目的是提出一种多模态核磁共振（MR）脑肿瘤图像分割方法，以便更好地确定和地理位置脑肿瘤子区域。
methods: 该方法首先提取输入图像中的特征，然后使用脑肿瘤prototype来引导和融合不同模态特征，以便高亮每个脑肿瘤子区域的特征。
results: 实验结果表明，提出的方法在三个竞赛脑肿瘤分割数据集上具有更高的分割精度和稳定性。

Abstract
For multi-modal magnetic resonance (MR) brain tumor image segmentation, current methods usually directly extract the discriminative features from input images for tumor sub-region category determination and localization. However, the impact of information aliasing caused by the mutual inclusion of tumor sub-regions is often ignored. Moreover, existing methods usually do not take tailored efforts to highlight the single tumor sub-region features. To this end, a multi-modal MR brain tumor segmentation method with tumor prototype-driven and multi-expert integration is proposed. It could highlight the features of each tumor sub-region under the guidance of tumor prototypes. Specifically, to obtain the prototypes with complete information, we propose a mutual transmission mechanism to transfer different modal features to each other to address the issues raised by insufficient information on single-modal features. Furthermore, we devise a prototype-driven feature representation and fusion method with the learned prototypes, which implants the prototypes into tumor features and generates corresponding activation maps. With the activation maps, the sub-region features consistent with the prototype category can be highlighted. A key information enhancement and fusion strategy with multi-expert integration is designed to further improve the segmentation performance. The strategy can integrate the features from different layers of the extra feature extraction network and the features highlighted by the prototypes. Experimental results on three competition brain tumor segmentation datasets prove the superiority of the proposed method.

摘要
现有的多Modal MR脑肿吸引图像分割方法通常直接从输入图像中提取出特征，以确定和 lokalisieren tumor sub-region。然而，信息抖动所引起的影响通常被忽略。此外，现有的方法通常不会采取特化的努力来强调单个 tumor sub-region 的特征。为此，我们提出了一种基于 tumor 原型的多Modal MR脑肿吸引图像分割方法。它可以在 tumor 原型的指导下强调每个 tumor sub-region 的特征。具体来说，为了获得完整的信息，我们提出了一种互传机制，将不同模态特征传递给每个模态，以解决单modal特征不具备的问题。此外，我们还提出了一种基于原型的特征表示和融合方法，通过learned原型来嵌入 tumor 特征，并生成对应的活化地图。通过活化地图，可以强调与原型类别相符的子区域特征。为了进一步提高分割性能，我们还设计了一种多 экспер特性融合策略。该策略可以将不同层次的特征和由原型引导的特征融合在一起。实验结果表明，我们的提议方法在三个竞赛脑肿吸引图像分割数据集上具有优越性。

Leveraging Knowledge Graphs for Zero-Shot Object-agnostic State Classification

paper_url: http://arxiv.org/abs/2307.12179
repo_url: None
paper_authors: Filipos Gouidis, Theodore Patkos, Antonis Argyros, Dimitris Plexousakis
for: 本研究强调解决对象状态分类（OSC）问题，具体来说是一种零例学习问题，即不需要知道对象的类别来预测对象的状态。
methods: 我们提出了首个对象agnosticState Classification（OaSC）方法，即不需要对象类知识或估计来预测对象的状态。我们利用知识 graphs（KGs）来结构化和组织知识，并与视觉信息结合，以便在对象/状态对的训练集中未经遭遇的情况下预测对象的状态。
results: 实验结果表明，对象类知识不是预测对象状态的决定因素。此外，我们的OaSC方法在所有数据集和benchmark中都超越了现有方法，差距很大。

Abstract
We investigate the problem of Object State Classification (OSC) as a zero-shot learning problem. Specifically, we propose the first Object-agnostic State Classification (OaSC) method that infers the state of a certain object without relying on the knowledge or the estimation of the object class. In that direction, we capitalize on Knowledge Graphs (KGs) for structuring and organizing knowledge, which, in combination with visual information, enable the inference of the states of objects in object/state pairs that have not been encountered in the method's training set. A series of experiments investigate the performance of the proposed method in various settings, against several hypotheses and in comparison with state of the art approaches for object attribute classification. The experimental results demonstrate that the knowledge of an object class is not decisive for the prediction of its state. Moreover, the proposed OaSC method outperforms existing methods in all datasets and benchmarks by a great margin.

摘要
我们研究对象状态分类（OSC）问题，并将其视为零例学习问题。具体来说，我们提出了首个对象不依赖类别知识的对象状态分类方法（OaSC）。这种方法可以基于知识图（KGs）结构和组织知识，并结合视觉信息，对未在方法训练集中出现的对象/状态对进行状态推理。我们进行了一系列实验，测试方法在不同的设置、假设和现有的对象属性分类方法的比较中的性能。实验结果表明，对象类知识不是决定对象状态预测的关键因素。此外，我们的OaSC方法在所有数据集和标准准则上都超越了现有方法，差距很大。

Challenges for Monocular 6D Object Pose Estimation in Robotics

paper_url: http://arxiv.org/abs/2307.12172
repo_url: None
paper_authors: Stefan Thalhammer, Dominik Bauer, Peter Hönig, Jean-Baptiste Weibel, José García-Rodríguez, Markus Vincze
for: 本研究旨在探讨单视模式下的物体 pose 估算问题，即 robotics 应用中的核心识别任务。
methods: 该研究使用了现成的 RGB 感知器和 CNN 进行快速推理，以及提出了一种综合视角和数据集的评估方法。
results: 研究发现，虽然现有的多Modal 和单视方法已经达到了 estado del arte，但是它们在 robotics 应用中仍面临着 occlusion 处理、新的 pose 表示方法、Category-level pose estimation 的形式化和改进等挑战。此外，大量对象集、新型对象、干涉物质、不确定性估计等问题仍未得到解决。

Abstract
Object pose estimation is a core perception task that enables, for example, object grasping and scene understanding. The widely available, inexpensive and high-resolution RGB sensors and CNNs that allow for fast inference based on this modality make monocular approaches especially well suited for robotics applications. We observe that previous surveys on object pose estimation establish the state of the art for varying modalities, single- and multi-view settings, and datasets and metrics that consider a multitude of applications. We argue, however, that those works' broad scope hinders the identification of open challenges that are specific to monocular approaches and the derivation of promising future challenges for their application in robotics. By providing a unified view on recent publications from both robotics and computer vision, we find that occlusion handling, novel pose representations, and formalizing and improving category-level pose estimation are still fundamental challenges that are highly relevant for robotics. Moreover, to further improve robotic performance, large object sets, novel objects, refractive materials, and uncertainty estimates are central, largely unsolved open challenges. In order to address them, ontological reasoning, deformability handling, scene-level reasoning, realistic datasets, and the ecological footprint of algorithms need to be improved.

摘要
We argue that there are still several fundamental challenges that need to be addressed in order to improve the performance of monocular object pose estimation in robotics. These challenges include occlusion handling, the development of novel pose representations, and the formalization and improvement of category-level pose estimation. Additionally, there are several central, largely unsolved open challenges that must be addressed, including the need for large object sets, the ability to handle novel objects and refractive materials, and the inclusion of uncertainty estimates.To address these challenges, we propose several areas of improvement, including ontological reasoning, deformability handling, scene-level reasoning, the creation of realistic datasets, and the reduction of the ecological footprint of algorithms. By focusing on these areas, we believe that the performance of monocular object pose estimation in robotics can be significantly improved, enabling more advanced and capable robots.

Facial Point Graphs for Amyotrophic Lateral Sclerosis Identification

paper_url: http://arxiv.org/abs/2307.12159
repo_url: None
paper_authors: Nícolas Barbosa Gomes, Arissa Yoshida, Mateus Roder, Guilherme Camargo de Oliveira, João Paulo Papa
for: 早期诊断阿LS（amyotrophic lateral sclerosis）有助于确定治疗的开始、改善病人的前途和整体健康状况。
methods: 该论文提出使用计算机方法分析病人的脸部表达来自动识别阿LS。
results: 实验结果表明，该方法在渥太华神经面数据集上表现出色，超过了现有的最佳成绩，为早期诊断阿LS带来了有前途的发展。

Abstract
Identifying Amyotrophic Lateral Sclerosis (ALS) in its early stages is essential for establishing the beginning of treatment, enriching the outlook, and enhancing the overall well-being of those affected individuals. However, early diagnosis and detecting the disease's signs is not straightforward. A simpler and cheaper way arises by analyzing the patient's facial expressions through computational methods. When a patient with ALS engages in specific actions, e.g., opening their mouth, the movement of specific facial muscles differs from that observed in a healthy individual. This paper proposes Facial Point Graphs to learn information from the geometry of facial images to identify ALS automatically. The experimental outcomes in the Toronto Neuroface dataset show the proposed approach outperformed state-of-the-art results, fostering promising developments in the area.

摘要
早期诊断阿底士病（ALS）对患者的治疗结果有着重要的影响，能够提高生活质量和整体健康状况。然而，早期诊断和识别病种的症状不是一件容易的事情。这项研究提出使用计算机方法分析病人的面部表达来自动识别ALS。研究发现，当患者进行特定的动作时，如开口嘴巴，健康人的面部肌肉运动与ALS患者不同。该研究使用面部点Graph学习face图像的几何信息，并在渥太华神经面数据集上进行实验，结果表明该方法在比较当前的结果之上出色，这些结果表明了这种方法在诊断ALS方面的潜在价值。

Real-Time Neural Video Recovery and Enhancement on Mobile Devices

paper_url: http://arxiv.org/abs/2307.12152
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Zhaoyuan He, Yifan Yang, Lili Qiu, Kyoungjun Park
for: 提高移动设备上视频流式传输的流畅体验
methods: 提出一种新的视频帧恢复算法、一种新的超分辨率算法和一种接受器增强视频比特率调整算法
results: 实现了30帧/秒的实时增强，在不同的网络环境下测试，实现了视频流经验质量（Quality of Experience，QoE）的显著提高（24%-82%）

Abstract
As mobile devices become increasingly popular for video streaming, it's crucial to optimize the streaming experience for these devices. Although deep learning-based video enhancement techniques are gaining attention, most of them cannot support real-time enhancement on mobile devices. Additionally, many of these techniques are focused solely on super-resolution and cannot handle partial or complete loss or corruption of video frames, which is common on the Internet and wireless networks. To overcome these challenges, we present a novel approach in this paper. Our approach consists of (i) a novel video frame recovery scheme, (ii) a new super-resolution algorithm, and (iii) a receiver enhancement-aware video bit rate adaptation algorithm. We have implemented our approach on an iPhone 12, and it can support 30 frames per second (FPS). We have evaluated our approach in various networks such as WiFi, 3G, 4G, and 5G networks. Our evaluation shows that our approach enables real-time enhancement and results in a significant increase in video QoE (Quality of Experience) of 24\% - 82\% in our video streaming system.

摘要
“随着移动设备在影像流媒体中的普及，实时优化影像流媒体的体验成为了非常重要的。深度学习基本的影像增强技术在获得注目，但大多数这些技术无法在移动设备上支持实时优化。此外，许多这些技术仅专注于超解析，而无法处理部分或完全的影像帧损失或腐败，这是互联网和无线网络上很常见的问题。”“为了解决这些挑战，我们在这篇论文中提出了一个新的方法。我们的方法包括：（i）一个新的影像帧恢复算法，（ii）一个新的超解析算法，以及（iii）一个受到接收端优化影像比特率改变算法的影像流媒体实时优化系统。我们在iPhone 12上实现了我们的方法，并且可以支持30帧每秒（FPS）。我们在WiFi、3G、4G和5G网络中进行了评估，我们的评估结果表明，我们的方法可以实现实时优化，并且导致影像流媒体系统中的影像质量经验（Quality of Experience，QoE）增加了24%-82%。”

Does color modalities affect handwriting recognition? An empirical study on Persian handwritings using convolutional neural networks

paper_url: http://arxiv.org/abs/2307.12150
repo_url: None
paper_authors: Abbas Zohrevand, Zahra Imani, Javad Sadri, Ching Y. Suen
for: 这篇论文是 investigate whether color modalities of handwritten digits and words affect their recognition accuracy or speed.
methods: 使用 Convolutional Neural Networks (CNNs) 作为眼动模拟器，在一个新的波斯语手写数据库中进行测试。
results: 结果表明，使用 CNN 对黑白字体和字符图像进行训练后，对测试集进行识别时，比其他两种颜色模式具有更高的性能。然而，在不同颜色模式下的训练时间比较，发现使用 BW 图像进行训练是最效率的。

Abstract
Most of the methods on handwritten recognition in the literature are focused and evaluated on Black and White (BW) image databases. In this paper we try to answer a fundamental question in document recognition. Using Convolutional Neural Networks (CNNs), as eye simulator, we investigate to see whether color modalities of handwritten digits and words affect their recognition accuracy or speed? To the best of our knowledge, so far this question has not been answered due to the lack of handwritten databases that have all three color modalities of handwritings. To answer this question, we selected 13,330 isolated digits and 62,500 words from a novel Persian handwritten database, which have three different color modalities and are unique in term of size and variety. Our selected datasets are divided into training, validation, and testing sets. Afterwards, similar conventional CNN models are trained with the training samples. While the experimental results on the testing set show that CNN on the BW digit and word images has a higher performance compared to the other two color modalities, in general there are no significant differences for network accuracy in different color modalities. Also, comparisons of training times in three color modalities show that recognition of handwritten digits and words in BW images using CNN is much more efficient.

摘要
大多数现成的手写识别方法都是在黑白图像库中进行研究和评估。在这篇论文中，我们试图回答一个基本的问题：使用卷积神经网络（CNN）作为眼动模拟器，我们调查了手写数字和字符的三种颜色模式是否影响了识别精度或速度？至于这个问题，我们认为现有的手写库缺乏三种颜色模式的手写数据，因此这个问题尚未得到解答。为了回答这个问题，我们选择了13330个隔离的数字和62500个字符从一个新的波斯语手写库中，这些数据库具有三种颜色模式和各种大小和样式。我们选择的数据库被分成了训练集、验证集和测试集。然后，我们使用同样的传统CNN模型在训练样本上进行训练。在测试集上的实验结果表明，使用CNN对黑白数字和字符图像进行识别的精度较高，而在其他两种颜色模式下的识别精度相对较低。此外，在三种颜色模式下的训练时间进行比较，发现使用黑白图像进行识别的训练时间相对较短。

Learned Gridification for Efficient Point Cloud Processing

paper_url: http://arxiv.org/abs/2307.14354
repo_url: https://github.com/computri/gridifier
paper_authors: Putri A. van der Linden, David W. Romero, Erik J. Bekkers
for: 将点云数据转换为可以进行操作的稠密grid数据，以提高操作效率和可扩展性。
methods: 提出了一种名为”learnable gridification”的方法，用于将点云数据转换为稠密grid数据，并在后续层使用常见的grid-based操作，如Conv3D。此外，还提出了一种名为”learnable de-gridification”的方法，用于将稠密grid数据还原回原始点云数据。
results: 经过理论和实验分析，显示了gridified网络在内存和时间方面的扩展性和可扩展性，而且可以达到竞争性的结果。

Abstract
Neural operations that rely on neighborhood information are much more expensive when deployed on point clouds than on grid data due to the irregular distances between points in a point cloud. In a grid, on the other hand, we can compute the kernel only once and reuse it for all query positions. As a result, operations that rely on neighborhood information scale much worse for point clouds than for grid data, specially for large inputs and large neighborhoods. In this work, we address the scalability issue of point cloud methods by tackling its root cause: the irregularity of the data. We propose learnable gridification as the first step in a point cloud processing pipeline to transform the point cloud into a compact, regular grid. Thanks to gridification, subsequent layers can use operations defined on regular grids, e.g., Conv3D, which scale much better than native point cloud methods. We then extend gridification to point cloud to point cloud tasks, e.g., segmentation, by adding a learnable de-gridification step at the end of the point cloud processing pipeline to map the compact, regular grid back to its original point cloud form. Through theoretical and empirical analysis, we show that gridified networks scale better in terms of memory and time than networks directly applied on raw point cloud data, while being able to achieve competitive results. Our code is publicly available at https://github.com/computri/gridifier.

摘要
神经操作需要邻域信息的时候在点云上比在网格数据上更加昂贵，因为点云中点的距离不规则。在网格上，我们可以一次计算核心，然后重复使用它们 для所有查询位置。因此，基于邻域信息的操作在点云上比网格数据更加慢速，尤其是对于大输入和大邻域。在这项工作中，我们解决了点云方法的扩展性问题，通过learned gridification来将点云转换成一个紧凑的、规则的网格。然后，我们扩展了gridification到点云到点云任务，例如分割，通过添加一个学习的de-gridification步骤来将紧凑的网格还原回原始点云形式。我们通过理论和实验分析表明，gridified网络在内存和时间方面比直接应用于原始点云数据更加高效，同时能够实现竞争性的结果。我们的代码可以在https://github.com/computri/gridifier上下载。

A Vision for Cleaner Rivers: Harnessing Snapshot Hyperspectral Imaging to Detect Macro-Plastic Litter

paper_url: http://arxiv.org/abs/2307.12145
repo_url: https://github.com/river-lab/hyperspectral_macro_plastic_detection
paper_authors: Nathaniel Hanson, Ahmet Demirkaya, Deniz Erdoğmuş, Aron Stubbins, Taşkın Padır, Tales Imbiriba
for: 这个研究旨在解决水体中废弃 пласти克废弃物的监测问题，以提高当地生态和经济环境的健康性。
methods: 这个研究使用计算机成像技术来检测水体中的巨大废弃 пласти克废弃物。研究人员使用快照可见短波辐射 hyperspectral成像技术，并利用机器学习分类方法来实现高精度检测。
results: 实验结果表明，使用 hyperspectral 数据和非线性分类方法可以在具有挑战性的场景下实现高精度的检测精度，特别是在检测部分潜没的废弃 пласти克废弃物时。

Abstract
Plastic waste entering the riverine harms local ecosystems leading to negative ecological and economic impacts. Large parcels of plastic waste are transported from inland to oceans leading to a global scale problem of floating debris fields. In this context, efficient and automatized monitoring of mismanaged plastic waste is paramount. To address this problem, we analyze the feasibility of macro-plastic litter detection using computational imaging approaches in river-like scenarios. We enable near-real-time tracking of partially submerged plastics by using snapshot Visible-Shortwave Infrared hyperspectral imaging. Our experiments indicate that imaging strategies associated with machine learning classification approaches can lead to high detection accuracy even in challenging scenarios, especially when leveraging hyperspectral data and nonlinear classifiers. All code, data, and models are available online: https://github.com/RIVeR-Lab/hyperspectral_macro_plastic_detection.

摘要
塑料垃圾进入河流环境会对当地生态系统造成负面影响，导致生态和经济 Both positive and negative impacts. Large amounts of plastic waste are transported from inland to oceans, causing a global problem of floating debris fields. In this context, efficient and automated monitoring of mismanaged plastic waste is crucial. To address this problem, we explore the feasibility of macro-plastic litter detection using computational imaging approaches in river-like scenarios. We enable near-real-time tracking of partially submerged plastics by using snapshot Visible-Shortwave Infrared hyperspectral imaging. Our experiments show that imaging strategies combined with machine learning classification approaches can achieve high detection accuracy, especially in challenging scenarios, by leveraging hyperspectral data and nonlinear classifiers. All code, data, and models are available online at .Here's the breakdown of the translation:* "塑料垃圾" (plastic waste) is translated as "塑料垃圾" (plastic waste)* "进入河流环境" (entering the riverine environment) is translated as "进入河流环境" (entering the riverine environment)* "会对当地生态系统造成负面影响" (leading to negative ecological and economic impacts) is translated as "会对当地生态系统造成负面影响" (leading to negative ecological and economic impacts)* "Large amounts of plastic waste are transported from inland to oceans" is translated as "大量塑料垃圾从内陆流入海洋" (large amounts of plastic waste are transported from inland to oceans)* "causing a global problem of floating debris fields" is translated as "导致全球漂泊垃圾场的问题" (causing a global problem of floating debris fields)* "In this context, efficient and automated monitoring of mismanaged plastic waste is crucial" is translated as "在这种情况下，有效和自动化的塑料垃圾监测是关键" (in this context, efficient and automated monitoring of mismanaged plastic waste is crucial)* "To address this problem, we explore the feasibility of macro-plastic litter detection using computational imaging approaches in river-like scenarios" is translated as "为解决这个问题，我们在河流类场景中explore了计算成像方法的可能性" (to address this problem, we explore the feasibility of macro-plastic litter detection using computational imaging approaches in river-like scenarios)* "We enable near-real-time tracking of partially submerged plastics by using snapshot Visible-Shortwave Infrared hyperspectral imaging" is translated as "我们通过使用快照可见短波infrared hyperspectral成像来实现近实时跟踪部分浸没在水中的塑料" (we enable near-real-time tracking of partially submerged plastics by using snapshot Visible-Shortwave Infrared hyperspectral imaging)* "Our experiments indicate that imaging strategies associated with machine learning classification approaches can lead to high detection accuracy" is translated as "我们的实验表明，与机器学习分类方法相关的成像策略可以实现高的检测精度" (our experiments indicate that imaging strategies associated with machine learning classification approaches can lead to high detection accuracy)* "especially in challenging scenarios, by leveraging hyperspectral data and nonlinear classifiers" is translated as "特别是在挑战性的场景下，通过利用快照数据和非线性分类器来提高检测精度" (especially in challenging scenarios, by leveraging hyperspectral data and nonlinear classifiers)* "All code, data, and models are available online" is translated as "所有代码、数据和模型都可以在线获取" (all code, data, and models are available online)

SCPAT-GAN: Structural Constrained and Pathology Aware Convolutional Transformer-GAN for Virtual Histology Staining of Human Coronary OCT images

paper_url: http://arxiv.org/abs/2307.12138
repo_url: None
paper_authors: Xueshen Li, Hongshan Liu, Xiaoyu Song, Brigitta C. Brott, Silvio H. Litovsky, Yu Gan
for: 对于摄影镜像胆管疾病的诊断和治疗提供虚拟 histological 信息
methods: 使用 transformer 生成数学类型的 GAN 模型，将 OCT 图像转换为虚拟显色 H&E 厚比例图像
results: 实现了对摄影镜像胆管疾病的诊断和治疗中的虚拟 histological 信息生成，并且可以对应胆管疾病的病理特征Here’s the simplified Chinese text:
for: 为摄影镜像胆管疾病的诊断和治疗提供虚拟 histological 信息
methods: 使用 transformer 生成数学类型的 GAN 模型，将 OCT 图像转换为虚拟显色 H&E 厚比例图像
results: 实现了对摄影镜像胆管疾病的诊断和治疗中的虚拟 histological 信息生成，并且可以对应胆管疾病的病理特征

Abstract
There is a significant need for the generation of virtual histological information from coronary optical coherence tomography (OCT) images to better guide the treatment of coronary artery disease. However, existing methods either require a large pixel-wisely paired training dataset or have limited capability to map pathological regions. To address these issues, we proposed a structural constrained, pathology aware, transformer generative adversarial network, namely SCPAT-GAN, to generate virtual stained H&E histology from OCT images. The proposed SCPAT-GAN advances existing methods via a novel design to impose pathological guidance on structural layers using transformer-based network.

摘要
<>将文本翻译成简化中文。<> coronary optical coherence tomography (OCT) 图像中的虚拟 histological 信息的生成很需要用于更好地治疗 coronary artery disease。然而，现有的方法 Either require a large pixel-wisely paired training dataset or have limited capability to map pathological regions。为解决这些问题，我们提出了一种基于 transformer 的 generator adversarial network，即 SCPAT-GAN，用于从 OCT 图像中生成虚拟染色 H&E histology。我们的提议的 SCPAT-GAN 会使现有的方法得到一种新的设计，通过将 pathological guidance 添加到结构层以使用 transformer-based network。Here's the breakdown of the translation:* coronary optical coherence tomography (OCT) 图像中的虚拟 histological 信息 (Virtual histological information in OCT images)* Either require a large pixel-wisely paired training dataset (Existing methods either require a large pixel-wisely paired training dataset)* or have limited capability to map pathological regions (or have limited capability to map pathological regions)* 用于更好地治疗 coronary artery disease (To better guide the treatment of coronary artery disease)* 基于 transformer 的 generator adversarial network (Based on transformer, a generator adversarial network)* 即 SCPAT-GAN (i.e., SCPAT-GAN)* 用于从 OCT 图像中生成虚拟染色 H&E histology (To generate virtual stained H&E histology from OCT images)* 我们的提议的 SCPAT-GAN (Our proposed SCPAT-GAN)* 会使现有的方法得到一种新的设计 (Will make existing methods obtain a new design)* 通过将 pathological guidance 添加到结构层 (By adding pathological guidance to the structural layers)* 以使用 transformer-based network (To use transformer-based network)

Improving temperature estimation in low-cost infrared cameras using deep neural networks

paper_url: http://arxiv.org/abs/2307.12130
repo_url: None
paper_authors: Navot Oz, Nir Sochen, David Mendelovich, Iftach Klapp
for: 提高低成本热相机的温度精度和修正不均匀性
methods: 开发了一个考虑 ambient temperature 的非均匀性模拟器，并提出了一种基于批处理神经网络的方法，通过使用单个图像和摄像头自身测量的 ambient temperature 来估算对象的温度和修正不均匀性
results: 比前一些方法更低 ($1^\circ C$) 的mean temperature error，并且通过Physical constraint 降低了错误率 ($4%$)。总的来说，mean temperature error 在广泛验证集上为 $0.37^\circ C$，并在实际场景中得到了等效的结果。

Abstract
Low-cost thermal cameras are inaccurate (usually $\pm 3^\circ C$) and have space-variant nonuniformity across their detector. Both inaccuracy and nonuniformity are dependent on the ambient temperature of the camera. The main goal of this work was to improve the temperature accuracy of low-cost cameras and rectify the nonuniformity. A nonuniformity simulator that accounts for the ambient temperature was developed. An end-to-end neural network that incorporates the ambient temperature at image acquisition was introduced. The neural network was trained with the simulated nonuniformity data to estimate the object's temperature and correct the nonuniformity, using only a single image and the ambient temperature measured by the camera itself. Results show that the proposed method lowered the mean temperature error by approximately $1^\circ C$ compared to previous works. In addition, applying a physical constraint on the network lowered the error by an additional $4\%$. The mean temperature error over an extensive validation dataset was $0.37^\circ C$. The method was verified on real data in the field and produced equivalent results.

摘要
低成本热相机的精度受到 ambient temperature 的影响（通常在 $\pm 3^\circ C$ 之间），并且具有空间不均的非均匀性，这两个问题都与热相机的环境温度相关。本工作的主要目标是提高低成本热相机的温度精度和修正非均匀性。我们开发了一个考虑 ambient temperature 的非均匀性模拟器，并引入了一个结合 ambient temperature 的末端到终端神经网络。这个神经网络通过使用模拟的非均匀数据进行训练，以估算对象的温度并修正非均匀性，只需要使用单个图像和摄像头自带的温度值。结果显示，我们的方法可以比前一些工作降低mean温度错误约 $1^\circ C$。此外，通过应用物理约束，降低错误约 $4\%$。整体来说，我们的方法在广泛验证数据集上的mean温度错误为 $0.37^\circ C$。此方法还在实际场景中进行验证，并获得相同的结果。

InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot Text-based Video Editing

paper_url: http://arxiv.org/abs/2308.00135
repo_url: https://github.com/infusion-zero-edit/InFusion
paper_authors: Anant Khandelwal
for: 这篇论文的目的是提出一个框架，以便透过文本提示进行类型控制的视频编辑，并且在不需要训练的情况下实现高品质的视频编辑。
methods: 这篇论文使用了大型预训的文本扩散模型，并且提出了一个内部插入（Injection）的方法，允许在视频中编辑多个概念，并且具有像素级的控制。
results: 论文的实验结果显示，这个框架可以实现高品质和时间含意的视频编辑，并且可以与现有的图像扩散技术进行整合。

Abstract
Large text-to-image diffusion models have achieved remarkable success in generating diverse, high-quality images. Additionally, these models have been successfully leveraged to edit input images by just changing the text prompt. But when these models are applied to videos, the main challenge is to ensure temporal consistency and coherence across frames. In this paper, we propose InFusion, a framework for zero-shot text-based video editing leveraging large pre-trained image diffusion models. Our framework specifically supports editing of multiple concepts with pixel-level control over diverse concepts mentioned in the editing prompt. Specifically, we inject the difference in features obtained with source and edit prompts from U-Net residual blocks of decoder layers. When these are combined with injected attention features, it becomes feasible to query the source contents and scale edited concepts along with the injection of unedited parts. The editing is further controlled in a fine-grained manner with mask extraction and attention fusion, which cut the edited part from the source and paste it into the denoising pipeline for the editing prompt. Our framework is a low-cost alternative to one-shot tuned models for editing since it does not require training. We demonstrated complex concept editing with a generalised image model (Stable Diffusion v1.5) using LoRA. Adaptation is compatible with all the existing image diffusion techniques. Extensive experimental results demonstrate the effectiveness of existing methods in rendering high-quality and temporally consistent videos.

摘要
大型文本到图像扩散模型已经实现了惊人的成功，可以生成多样化、高质量的图像。此外，这些模型还可以通过修改输入文本来编辑图像。但是在视频 editing 中，主要挑战是保证时间协调一致和各帧的一致性。在这篇论文中，我们提出了 InFusion 框架，一种基于大型预训练的图像扩散模型来实现零搅 diffusion 的文本基于视频编辑。我们的框架具有多个概念编辑、像素级控制的特点，可以在编辑提示中提出多个概念，并且可以在精细化的方式下控制编辑。具体来说，我们将源和编辑提示中的特征差拟合到 U-Net 径弧层的解码层中，并与注入的注意力特征相结合，使得可以查询源内容并将编辑概念扩大到不变的部分。此外，我们还使用抽取Mask和注意力融合来进一步控制编辑。我们的框架是一种低成本的替代方案，不需要训练。我们使用了 Stable Diffusion v1.5 总体图像模型进行复杂概念编辑，并证明了与现有图像扩散技术相容。我们的实验结果表明，InFusion 可以生成高质量和时间协调一致的视频。

Synthesis of Batik Motifs using a Diffusion – Generative Adversarial Network

paper_url: http://arxiv.org/abs/2307.12122
repo_url: https://github.com/octadion/diffusion-stylegan2-ada-pytorch
paper_authors: One Octadion, Novanto Yudistira, Diva Kurnianingtyas
for:本研究旨在帮助 batik 设计师或手工艺术家创造独特和高品质的 batik 模样，并实现有效率的生产时间和成本。methods:本研究使用了 StyleGAN2-Ada 和扩散技术，实现了生成高品质和真实的 synthetic batik 模样。 StyleGAN2-Ada 是一种分离 Style 和 Content 两个方面的 GAN 模型，而扩散技术则引入了随机噪音到数据中。results:根据质量和量度评估，模型测试过程中生成的 batik 模样具有细节丰富和艺术多样性，并且能够实现有效率的生产时间和成本。

Abstract
Batik, a unique blend of art and craftsmanship, is a distinct artistic and technological creation for Indonesian society. Research on batik motifs is primarily focused on classification. However, further studies may extend to the synthesis of batik patterns. Generative Adversarial Networks (GANs) have been an important deep learning model for generating synthetic data, but often face challenges in the stability and consistency of results. This research focuses on the use of StyleGAN2-Ada and Diffusion techniques to produce realistic and high-quality synthetic batik patterns. StyleGAN2-Ada is a variation of the GAN model that separates the style and content aspects in an image, whereas diffusion techniques introduce random noise into the data. In the context of batik, StyleGAN2-Ada and Diffusion are used to produce realistic synthetic batik patterns. This study also made adjustments to the model architecture and used a well-curated batik dataset. The main goal is to assist batik designers or craftsmen in producing unique and quality batik motifs with efficient production time and costs. Based on qualitative and quantitative evaluations, the results show that the model tested is capable of producing authentic and quality batik patterns, with finer details and rich artistic variations. The dataset and code can be accessed here:https://github.com/octadion/diffusion-stylegan2-ada-pytorch

摘要
《独特的抽象艺术——batik的研究》batik是印度尼西亚社会独特的艺术和手工艺术材料，研究主要集中在纹理的分类。然而，进一步的研究可能会扩展到纹理的合成。生成对抗网络（GANs）是深度学习模型，用于生成合成数据，但经常面临稳定性和一致性问题。本研究使用StyleGAN2-Ada和扩散技术生成真实和高质量的合成纹理图案。StyleGAN2-Ada是GAN模型中分离风格和内容的变种，而扩散技术引入随机噪声到数据中。在batik中，StyleGAN2-Ada和扩散被用生成真实的合成纹理图案。本研究还对模型结构进行了调整，使用了优化的batik数据集。主要目标是协助batik设计师或手工艺术家生成独特和高质量的纹理图案，减少生产时间和成本。根据质量和量度评价，结果显示，试用的模型能够生成authentic和高质量的纹理图案，细节更加细腻，艺术变化更加丰富。数据集和代码可以在以下链接获取：https://github.com/octadion/diffusion-stylegan2-ada-pytorch

Pyramid Semantic Graph-based Global Point Cloud Registration with Low Overlap

paper_url: http://arxiv.org/abs/2307.12116
repo_url: https://github.com/hkust-aerial-robotics/pagor
paper_authors: Zhijian Qiao, Zehuan Yu, Huan Yin, Shaojie Shen
for: 这篇论文是关于全球点云注册的，用于绕过视点变化和 occlusion 等问题，以实现 loop closing 和 relocalization。
methods: 该论文提出了一种基于图论的全球点云注册方法，使用了 robust 的数据关联和可靠的姿态估计，以及semantic 信息来减少点云数据的精度。
results: 实验结果表明，该方法在自行收集的indoor数据集和公共的 KITTI 数据集上具有最高成功率，即使点云之间的重叠率低、semantic质量低。代码已经开源在 GitHub 上（https://github.com/HKUST-Aerial-Robotics/Pagor）。

Abstract
Global point cloud registration is essential in many robotics tasks like loop closing and relocalization. Unfortunately, the registration often suffers from the low overlap between point clouds, a frequent occurrence in practical applications due to occlusion and viewpoint change. In this paper, we propose a graph-theoretic framework to address the problem of global point cloud registration with low overlap. To this end, we construct a consistency graph to facilitate robust data association and employ graduated non-convexity (GNC) for reliable pose estimation, following the state-of-the-art (SoTA) methods. Unlike previous approaches, we use semantic cues to scale down the dense point clouds, thus reducing the problem size. Moreover, we address the ambiguity arising from the consistency threshold by constructing a pyramid graph with multi-level consistency thresholds. Then we propose a cascaded gradient ascend method to solve the resulting densest clique problem and obtain multiple pose candidates for every consistency threshold. Finally, fast geometric verification is employed to select the optimal estimation from multiple pose candidates. Our experiments, conducted on a self-collected indoor dataset and the public KITTI dataset, demonstrate that our method achieves the highest success rate despite the low overlap of point clouds and low semantic quality. We have open-sourced our code https://github.com/HKUST-Aerial-Robotics/Pagor for this project.

摘要
To achieve this, we construct a consistency graph to facilitate robust data association and employ graduated non-convexity (GNC) for reliable pose estimation, following state-of-the-art (SoTA) methods. Unlike previous approaches, we use semantic cues to scale down the dense point clouds, reducing the problem size. Additionally, we address the ambiguity arising from the consistency threshold by constructing a pyramid graph with multi-level consistency thresholds.We then propose a cascaded gradient ascend method to solve the resulting densest clique problem and obtain multiple pose candidates for every consistency threshold. Finally, fast geometric verification is employed to select the optimal estimation from multiple pose candidates. Our experiments, conducted on a self-collected indoor dataset and the public KITTI dataset, show that our method achieves the highest success rate despite the low overlap of point clouds and low semantic quality. We have open-sourced our code at https://github.com/HKUST-Aerial-Robotics/Pagor for this project.

2023-07-23

cs.AI

cs.AI - 2023-07-23

Right for the Wrong Reason: Can Interpretable ML Techniques Detect Spurious Correlations?

paper_url: http://arxiv.org/abs/2307.12344
repo_url: https://github.com/ss-sun/right-for-the-wrong-reason
paper_authors: Susu Sun, Lisa M. Koch, Christian F. Baumgartner
for: 本研究旨在评估不同半结构化解释技术的检测潜在的假 correlate 能力。
methods: 本研究使用了 five 种post-hoc解释技术和一种自然语言解释技术来检测在胸部X射影诊断任务中 искусificially添加的三种干扰因素。
results: 研究发现，使用 SHAP 技术和自然语言解释技术 Attri-Net 可以准确地检测到胸部X射影诊断中的假 correlate，并且这些技术可以被用来可靠地检测模型的异常行为。

Abstract
While deep neural network models offer unmatched classification performance, they are prone to learning spurious correlations in the data. Such dependencies on confounding information can be difficult to detect using performance metrics if the test data comes from the same distribution as the training data. Interpretable ML methods such as post-hoc explanations or inherently interpretable classifiers promise to identify faulty model reasoning. However, there is mixed evidence whether many of these techniques are actually able to do so. In this paper, we propose a rigorous evaluation strategy to assess an explanation technique's ability to correctly identify spurious correlations. Using this strategy, we evaluate five post-hoc explanation techniques and one inherently interpretable method for their ability to detect three types of artificially added confounders in a chest x-ray diagnosis task. We find that the post-hoc technique SHAP, as well as the inherently interpretable Attri-Net provide the best performance and can be used to reliably identify faulty model behavior.

摘要
深度神经网络模型具有无可比的分类性能，但它们容易学习潜在的 correlate 信息。这些依赖于潜在信息的关系可能难以通过性能指标来探测，尤其如果测试数据来自同一个分布。可解释的机器学习方法，如后期解释或内置可解释的分类器，承诺可以识别模型的错误思维。然而，有混合的证据表明许多这些技术是否能够做到这一点。在这篇论文中，我们提出了一种严格的评估策略，用于评估一种解释技术的能力 Correctly 识别人工添加的三种隐藏因素在胸部X射影诊断任务中。我们发现，使用 SHAP 和 Attri-Net 的 Post-hoc 解释技术，以及内置可解释的 Attri-Net 可以准确地识别模型的错误行为。

Towards Generic and Controllable Attacks Against Object Detection

paper_url: http://arxiv.org/abs/2307.12342
repo_url: https://github.com/liguopeng0923/LGP
paper_authors: Guopeng Li, Yue Xu, Jian Ding, Gui-Song Xia
for: 这个论文的目的是设计一种可控的攻击方法来攻击主流的物件探测器（Object Detectors，OD），以实现对OD的攻击。
methods: 这个论文使用了一种称为LGP（Local Perturbations with Adaptively Global Attacks）的白盒攻击方法，它可以对OD进行攻击，并且可以控制攻击的方向和大小。LGP使用了高品质的提案和三种不同的损失函数来实现攻击。
results: 实验结果显示，LGP可以成功攻击十六种主流的物件探测器，包括MS-COCO和DOTA datasets。此外，LGP也可以实现了不可见和传递性的攻击。codes可以在https://github.com/liguopeng0923/LGP.git中取得。

Abstract
Existing adversarial attacks against Object Detectors (ODs) suffer from two inherent limitations. Firstly, ODs have complicated meta-structure designs, hence most advanced attacks for ODs concentrate on attacking specific detector-intrinsic structures, which makes it hard for them to work on other detectors and motivates us to design a generic attack against ODs. Secondly, most works against ODs make Adversarial Examples (AEs) by generalizing image-level attacks from classification to detection, which brings redundant computations and perturbations in semantically meaningless areas (e.g., backgrounds) and leads to an emergency for seeking controllable attacks for ODs. To this end, we propose a generic white-box attack, LGP (local perturbations with adaptively global attacks), to blind mainstream object detectors with controllable perturbations. For a detector-agnostic attack, LGP tracks high-quality proposals and optimizes three heterogeneous losses simultaneously. In this way, we can fool the crucial components of ODs with a part of their outputs without the limitations of specific structures. Regarding controllability, we establish an object-wise constraint that exploits foreground-background separation adaptively to induce the attachment of perturbations to foregrounds. Experimentally, the proposed LGP successfully attacked sixteen state-of-the-art object detectors on MS-COCO and DOTA datasets, with promising imperceptibility and transferability obtained. Codes are publicly released in https://github.com/liguopeng0923/LGP.git

摘要
现有的对象检测器（OD）的敌对攻击受到两种内在的限制。首先，OD有复杂的元结构设计，因此大多数对OD的高级攻击都是针对特定的检测器结构，这使得它们难以在其他检测器上工作，激励我们设计一种通用的攻击方法。其次，大多数对OD的攻击是将图像级别的攻击扩展到检测，这会带来 redundant computations和在意义不明的区域（如背景）中的扰动，从而导致对OD的攻击控制不足。为此，我们提出了一种通用的白盒攻击方法——本地加工攻击（LGP），可以让主流的对象检测器感受到控制的扰动。为了实现检测器无关的攻击，LGP跟踪高质量的提案并同时优化三种多元损失。这样，我们可以使对OD的核心组件的输出中的一部分被扰动，而不是特定的结构。在控制性方面，我们建立了一种对象强制约束，以便随时 Adaptively 调整扰动的分布。实验结果表明，我们提出的LGP方法成功地击败了MS-COCO和DOTA数据集上十六个状态对象检测器，并且在透明度和传输性方面取得了惊人的成绩。代码在https://github.com/liguopeng0923/LGP.git中公开发布。

Tackling the Curse of Dimensionality with Physics-Informed Neural Networks

paper_url: http://arxiv.org/abs/2307.12306
repo_url: None
paper_authors: Zheyuan Hu, Khemraj Shukla, George Em Karniadakis, Kenji Kawaguchi
for: 解决高维精度数学模型的计算问题，提高计算效率和可扩展性。
methods: 使用Stochastic Dimension Gradient Descent（SDGD）方法，即将梯度分解成不同维度的部分，然后随机选择这些维度的部分进行每次训练。
results: 能够快速解决许多困难高维度精度数学问题，如 Hamilton-Jacobi-Bellman 方程和Schrödinger方程，并且可以在单个GPU上进行计算。例如，在100,000维度中解决了一个非线性HJB方程和一个Black-Scholes方程，并且在6小时内完成了计算。

Abstract
The curse-of-dimensionality (CoD) taxes computational resources heavily with exponentially increasing computational cost as the dimension increases. This poses great challenges in solving high-dimensional PDEs as Richard Bellman first pointed out over 60 years ago. While there has been some recent success in solving numerically partial differential equations (PDEs) in high dimensions, such computations are prohibitively expensive, and true scaling of general nonlinear PDEs to high dimensions has never been achieved. In this paper, we develop a new method of scaling up physics-informed neural networks (PINNs) to solve arbitrary high-dimensional PDEs. The new method, called Stochastic Dimension Gradient Descent (SDGD), decomposes a gradient of PDEs into pieces corresponding to different dimensions and samples randomly a subset of these dimensional pieces in each iteration of training PINNs. We theoretically prove the convergence guarantee and other desired properties of the proposed method. We experimentally demonstrate that the proposed method allows us to solve many notoriously hard high-dimensional PDEs, including the Hamilton-Jacobi-Bellman (HJB) and the Schr\"{o}dinger equations in thousands of dimensions very fast on a single GPU using the PINNs mesh-free approach. For instance, we solve nontrivial nonlinear PDEs (one HJB equation and one Black-Scholes equation) in 100,000 dimensions in 6 hours on a single GPU using SDGD with PINNs. Since SDGD is a general training methodology of PINNs, SDGD can be applied to any current and future variants of PINNs to scale them up for arbitrary high-dimensional PDEs.

摘要
科学研究中的困难（CoD）会使计算资源受到极大的挑战，计算成本随着维度的增加而呈指数增长。这在60多年前，理查德·贝尔曼首次提出的高维度 partial differential equations（PDEs）解决问题中 pose great challenges. Although there has been some recent success in numerically solving PDEs in high dimensions, such computations are prohibitively expensive, and true scaling of general nonlinear PDEs to high dimensions has never been achieved.在这篇论文中，我们开发了一种新的方法，即随机维度加速（SDGD），用于扩展物理学信息神经网络（PINNs）解决任意高维度 PDEs。SDGD方法将 PDE 的梯度分解成不同维度的部分，然后在训练 PINNs 时随机选择这些维度的部分。我们证明了该方法的收敛保证和其他愿望的性质。我们通过实验表明，使用 SDGD 方法可以很快地解决许多知名度很高的高维度 PDEs，例如 Hamilton-Jacobi-Bellman 方程和 Schrödinger 方程在千个维度中。例如，我们在6个小时内使用单个 GPU 解决了一个非线性 HJB 方程和一个黑色-肖勒斯方程在10,000个维度中。由于 SDGD 是一种通用的 PINNs 训练方法，因此可以应用于任何当前和未来的 PINNs 变体，以扩展它们的应用范围。

Controller Synthesis for Timeline-based Games

paper_url: http://arxiv.org/abs/2307.12289
repo_url: None
paper_authors: Renato Acampora, Luca Geatti, Nicola Gigante, Angelo Montanari, Valentino Picotti
for: 这篇论文旨在提供一种有效和计算优化的控制器生成方法来解决时间轴基于游戏中的游戏策略问题。
methods: 该论文使用的方法包括时间轴基于游戏和控制器生成。
results: 该论文提出的控制器生成方法可以有效地解决时间轴基于游戏中的游戏策略问题，并且computational complexity是2EXPTIME-complete。

Abstract
In the timeline-based approach to planning, the evolution over time of a set of state variables (the timelines) is governed by a set of temporal constraints. Traditional timeline-based planning systems excel at the integration of planning with execution by handling temporal uncertainty. In order to handle general nondeterminism as well, the concept of timeline-based games has been recently introduced. It has been proved that finding whether a winning strategy exists for such games is 2EXPTIME-complete. However, a concrete approach to synthesize controllers implementing such strategies is missing. This paper fills this gap, by providing an effective and computationally optimal approach to controller synthesis for timeline-based games.

摘要
在时间轴基本方法中，时间变量集（时间轴）的演化遵循一组时间约束。传统的时间轴基本方法具有融合规划与执行的能力，可以处理时间不确定性。为了处理通用非决定性，时间轴基本方法中的游戏概念被最近引入。已证明找到赢家策略的存在是2EXPTIME-完善。但是，实现这种策略的控制器合成方法缺失。这篇论文填补了这个空白，提供了有效和计算优化的控制器合成方法 для时间轴基本方法中的游戏。

Decentralized Adaptive Formation via Consensus-Oriented Multi-Agent Communication

paper_url: http://arxiv.org/abs/2307.12287
repo_url: None
paper_authors: Yuming Xiang, Sizhao Li, Rongpeng Li, Zhifeng Zhao, Honggang Zhang
for: 这 paper 的目的是提出一种适应多 аген特式控制方法，以适应多 аген系统中 agent 的数量变化，并在有限通信环境下实现高速稳定的formation control.
methods: 该 paper 使用了一种新的 multi-agent reinforcement learning 方法，即 consensus-oriented multi-agent communication (ConsMAC)，以使 agents 能够 perceive 全局信息并在本地状态下达成 consensus。此外，paper 还使用了 policy distillation 来实现 Adaptive formation adjustment.
results: 实验结果表明，提出的方法在高速稳定性方面具有出色的表现，并且可以快速适应多 аген系统中 agent 的数量变化。

Abstract
Adaptive multi-agent formation control, which requires the formation to flexibly adjust along with the quantity variations of agents in a decentralized manner, belongs to one of the most challenging issues in multi-agent systems, especially under communication-limited constraints. In this paper, we propose a novel Consensus-based Decentralized Adaptive Formation (Cons-DecAF) framework. Specifically, we develop a novel multi-agent reinforcement learning method, Consensus-oriented Multi-Agent Communication (ConsMAC), to enable agents to perceive global information and establish the consensus from local states by effectively aggregating neighbor messages. Afterwards, we leverage policy distillation to accomplish the adaptive formation adjustment. Meanwhile, instead of pre-assigning specific positions of agents, we employ a displacement-based formation by Hausdorff distance to significantly improve the formation efficiency. The experimental results through extensive simulations validate that the proposed method has achieved outstanding performance in terms of both speed and stability.

摘要
《适应多智能体formation控制》是多智能体系统中最为复杂的问题之一，尤其是在有限通信环境下。在这篇论文中，我们提出了一种新的Consensus-based Decentralized Adaptive Formation（Cons-DecAF）框架。具体来说，我们开发了一种新的多智能体学习方法——Consensus-oriented Multi-Agent Communication（ConsMAC），使智能体能够从本地状态中获得全局信息并达成一致。接着，我们通过策略填充来实现形态调整。而不是先行指定智能体的具体位置，我们employs a displacement-based formation by Hausdorff distance，以大幅提高形态效率。实验结果表明，我们提出的方法在速度和稳定性两个方面具有出色的表现。

Towards Automatic Boundary Detection for Human-AI Collaborative Hybrid Essay in Education

paper_url: http://arxiv.org/abs/2307.12267
repo_url: https://github.com/douglashiwo/BoundaryDetectionFromHybridText
paper_authors: Zijie Zeng, Lele Sha, Yuheng Li, Kaixun Yang, Dragan Gašević, Guanliang Chen
for: 本研究旨在探讨AI生成文本检测在 hybrid 文本中的应用，即检测人类和生成LLMs共同写作的文本中的AI生成部分。
methods: 本研究提出了一种两步方法，包括在encoder训练过程中分离AI生成内容和人类写作内容，然后计算每两个相邻的聚合函数间的距离，并假设存在两个聚合函数间最远的距离处的边界。
results: 实验结果表明，提出的方法在不同的实验设置下 consistently 超过基eline方法的性能，并且 encoder 训练过程可以明显提高方法的性能。在检测单边 hybrid 文本中的边界时，可以采用一定的 prototype 大小来进一步提高方法的性能，升师22% 在 Domain 评估中和18% 在 Out-of-Domain 评估中。

Abstract
The recent large language models (LLMs), e.g., ChatGPT, have been able to generate human-like and fluent responses when provided with specific instructions. While admitting the convenience brought by technological advancement, educators also have concerns that students might leverage LLMs to complete their writing assignments and pass them off as their original work. Although many AI content detection studies have been conducted as a result of such concerns, most of these prior studies modeled AI content detection as a classification problem, assuming that a text is either entirely human-written or entirely AI-generated. In this study, we investigated AI content detection in a rarely explored yet realistic setting where the text to be detected is collaboratively written by human and generative LLMs (i.e., hybrid text). We first formalized the detection task as identifying the transition points between human-written content and AI-generated content from a given hybrid text (boundary detection). Then we proposed a two-step approach where we (1) separated AI-generated content from human-written content during the encoder training process; and (2) calculated the distances between every two adjacent prototypes and assumed that the boundaries exist between the two adjacent prototypes that have the furthest distance from each other. Through extensive experiments, we observed the following main findings: (1) the proposed approach consistently outperformed the baseline methods across different experiment settings; (2) the encoder training process can significantly boost the performance of the proposed approach; (3) when detecting boundaries for single-boundary hybrid essays, the proposed approach could be enhanced by adopting a relatively large prototype size, leading to a 22% improvement in the In-Domain evaluation and an 18% improvement in the Out-of-Domain evaluation.

摘要
Recent large language models (LLMs), such as ChatGPT, have been able to generate human-like and fluent responses when provided with specific instructions. However, educators have concerns that students may use LLMs to complete their writing assignments and pass them off as their own work. To address this issue, many AI content detection studies have been conducted, but most of these prior studies modeled AI content detection as a classification problem, assuming that a text is either entirely human-written or entirely AI-generated.In this study, we investigated AI content detection in a realistic setting where the text to be detected is collaboratively written by humans and generative LLMs (hybrid text). We formalized the detection task as identifying the transition points between human-written content and AI-generated content in a given hybrid text (boundary detection).To solve this problem, we proposed a two-step approach:1. Separate AI-generated content from human-written content during the encoder training process.2. Calculate the distances between every two adjacent prototypes and assume that the boundaries exist between the two adjacent prototypes with the furthest distance from each other.Through extensive experiments, we found the following main findings:1. Our proposed approach consistently outperformed baseline methods across different experiment settings.2. The encoder training process can significantly boost the performance of our proposed approach.3. When detecting boundaries for single-boundary hybrid essays, our proposed approach can be enhanced by adopting a relatively large prototype size, leading to a 22% improvement in the In-Domain evaluation and an 18% improvement in the Out-of-Domain evaluation.

Building-road Collaborative Extraction from Remotely Sensed Images via Cross-Interaction

paper_url: http://arxiv.org/abs/2307.12256
repo_url: None
paper_authors: Haonan Guo, Xin Su, Chen Wu, Bo Du, Liangpei Zhang
for: 本研究旨在提高从高分辨率遥感图像中提取建筑和道路的精度和效率，通过建筑和道路之间的协同抽取方法。
methods: 本研究提出了一种基于多任务和跨比例特征互动的建筑-道路共同抽取方法，通过交互信息 между任务和自适应收发场所来提高每个任务的准确性。
results: 实验表明，提出的方法可以在各种城市和农村场景下实现出色的建筑-道路抽取性能和效率。

Abstract
Buildings are the basic carrier of social production and human life; roads are the links that interconnect social networks. Building and road information has important application value in the frontier fields of regional coordinated development, disaster prevention, auto-driving, etc. Mapping buildings and roads from very high-resolution (VHR) remote sensing images have become a hot research topic. However, the existing methods often ignore the strong spatial correlation between roads and buildings and extract them in isolation. To fully utilize the complementary advantages between buildings and roads, we propose a building-road collaborative extraction method based on multi-task and cross-scale feature interaction to improve the accuracy of both tasks in a complementary way. A multi-task interaction module is proposed to interact information across tasks and preserve the unique information of each task, which tackle the seesaw phenomenon in multitask learning. By considering the variation in appearance and structure between buildings and roads, a cross-scale interaction module is designed to automatically learn the optimal reception field for different tasks. Compared with many existing methods that train each task individually, the proposed collaborative extraction method can utilize the complementary advantages between buildings and roads by the proposed inter-task and inter-scale feature interactions, and automatically select the optimal reception field for different tasks. Experiments on a wide range of urban and rural scenarios show that the proposed algorithm can achieve building-road extraction with outstanding performance and efficiency.

摘要
建筑和路径是社会生产和人类生活的基础载体，路径是社会网络之间的连接。建筑和路径信息在前沿领域的区域协调发展、灾害预防、自动驾驶等领域有重要应用价值。从very high-resolution（VHR）Remote sensing图像中提取建筑和路径信息已成为热点研究话题。然而，现有方法 часто忽略了道路和建筑之间的强相关性，单独提取它们。为了充分利用建筑和道路之间的补做优势，我们提议一种建筑道路共同提取方法，基于多任务和跨比例特征互动来提高两个任务的准确率。我们提出的多任务互动模块可以在不同任务之间交换信息，保持每个任务的独特信息，解决多任务学习中的摇摆现象。通过考虑建筑和道路之间的外观和结构变化，我们设计了一种跨比例互动模块，自动学习不同任务的最佳接收频率。与许多现有方法不同，我们的共同提取方法可以利用建筑和道路之间的补做优势，通过我们提出的交互特征互动和自动选择最佳接收频率，提高两个任务的准确率。在各种都市和农村场景下进行了广泛的实验，我们的算法可以实现出色的建筑道路提取和高效率。

Nature and the Machines

paper_url: http://arxiv.org/abs/2308.04440
repo_url: https://github.com/face-analysis/emonet
paper_authors: Huw Price, Matthew Connolly
for: 这篇论文是关于人工智能（AI）对人类是否 pose existential risk 的问题，而critics认为这个问题 receiving 太多关注，希望把其推迟到一边，Focus on 当前 AI poses 的风险。
methods: 论文使用的方法是 argument 和 persuasion，指出 Nature 期刊在这个问题上的错误判断，并阐述了 AI 的风险和 Its potential consequences.
results: 论文的结论是，Nature 期刊的 error judgement 是 serious，因为 AI 的风险不仅是今天的问题，也是未来的问题。论文 argues that we should not ignore the potential risks of AI, but instead, we should consider the consequences of error and take appropriate measures to mitigate them.

Abstract
Does artificial intelligence (AI) pose existential risks to humanity? Some critics feel this question is getting too much attention, and want to push it aside in favour of conversations about the immediate risks of AI. These critics now include the journal Nature, where a recent editorial urges us to 'stop talking about tomorrow's AI doomsday when AI poses risks today.' We argue that this is a serious failure of judgement, on Nature's part. In science, as in everyday life, we expect influential actors to consider the consequences of error. As the world's leading scientific journal, Nature is certainly an influential actor, especially so in the absence of robust global regulation of AI. Yet it has manifestly failed to consider the cost of error in this case.

摘要
人类是否面临人工智能（AI）的极大风险？一些批评者认为这个问题Receiving too much attention，希望把其推迟到一Side和讨论当前AI的风险。这些批评者现在包括《Nature》杂志，其latest editorial呼吁我们“停止讨论明天的AIArmageddon，因为AI今天已经存在风险”。我们认为，《Nature》在这个问题上manifestly failed to consider the cost of error。In science, as in everyday life, we expect influential actors to consider the consequences of error. As the world's leading scientific journal, Nature is certainly an influential actor, especially so in the absence of robust global regulation of AI. Yet it has manifestly failed to consider the cost of error in this case.

MARS: Exploiting Multi-Level Parallelism for DNN Workloads on Adaptive Multi-Accelerator Systems

paper_url: http://arxiv.org/abs/2307.12234
repo_url: None
paper_authors: Guan Shen, Jieru Zhao, Zeke Wang, Zhe Lin, Wenchao Ding, Chentao Wu, Quan Chen, Minyi Guo
for: 这个研究旨在提出一个名为 MARS 的新的映射框架，用于在多个加速器系统中选择适当的加速器设计，并对 DNN 进行有效的分割策略，以最大化并行度。
methods: 这个研究使用了 computation-aware 加速器选择策略和 communication-aware 分割策略，以提高 DNN 的执行效率。
results: 实验结果显示，MARS 可以在常见的 DNN 负载中实现32.2% 的延迟增加，并在多核心模型中实现59.4% 的延迟增加，较基eline方法和相关的现有方法更高。

Abstract
Along with the fast evolution of deep neural networks, the hardware system is also developing rapidly. As a promising solution achieving high scalability and low manufacturing cost, multi-accelerator systems widely exist in data centers, cloud platforms, and SoCs. Thus, a challenging problem arises in multi-accelerator systems: selecting a proper combination of accelerators from available designs and searching for efficient DNN mapping strategies. To this end, we propose MARS, a novel mapping framework that can perform computation-aware accelerator selection, and apply communication-aware sharding strategies to maximize parallelism. Experimental results show that MARS can achieve 32.2% latency reduction on average for typical DNN workloads compared to the baseline, and 59.4% latency reduction on heterogeneous models compared to the corresponding state-of-the-art method.

摘要
“深度神经网络的快速演化，硬件系统也在快速发展。作为数据中心、云平台和SoC中高扩展性低生产成本的解决方案，多加速器系统广泛存在。因此，多加速器系统中的一个挑战是选择合适的加速器设计，并搜索高度平行的DNN映射策略。为此，我们提出了MARS，一种新的映射框架，可以实现计算意识加速器选择，以及通信意识分割策略，以最大化并行性。实验结果显示，MARS可以相对基eline方法平均减少32.2%的延迟，对于常见的DNN任务，并且相对相对国际先进方法，对于多元模型，可以减少59.4%的延迟。”

Geometry-Aware Adaptation for Pretrained Models

paper_url: http://arxiv.org/abs/2307.12226
repo_url: None
paper_authors: Nicholas Roberts, Xintong Li, Dyah Adila, Sonia Cromp, Tzu-Heng Huang, Jitian Zhao, Frederic Sala
for:The paper is written to improve the performance of machine learning models, specifically zero-shot models, in predicting new classes without any additional training.methods:The proposed approach, called Loki, uses a simple technique to adapt the trained model to predict new classes by swapping the standard prediction rule with the Fréchet mean. The approach is a drop-in replacement and does not require any additional training data.results:The paper shows that Loki achieves up to 29.7% relative improvement over SimCLR on ImageNet and scales to hundreds of thousands of classes. When no external metric is available, Loki can use self-derived metrics from class embeddings and obtains a 10.5% improvement on pretrained zero-shot models such as CLIP.

Abstract
Machine learning models -- including prominent zero-shot models -- are often trained on datasets whose labels are only a small proportion of a larger label space. Such spaces are commonly equipped with a metric that relates the labels via distances between them. We propose a simple approach to exploit this information to adapt the trained model to reliably predict new classes -- or, in the case of zero-shot prediction, to improve its performance -- without any additional training. Our technique is a drop-in replacement of the standard prediction rule, swapping argmax with the Fr\'echet mean. We provide a comprehensive theoretical analysis for this approach, studying (i) learning-theoretic results trading off label space diameter, sample complexity, and model dimension, (ii) characterizations of the full range of scenarios in which it is possible to predict any unobserved class, and (iii) an optimal active learning-like next class selection procedure to obtain optimal training classes for when it is not possible to predict the entire range of unobserved classes. Empirically, using easily-available external metrics, our proposed approach, Loki, gains up to 29.7% relative improvement over SimCLR on ImageNet and scales to hundreds of thousands of classes. When no such metric is available, Loki can use self-derived metrics from class embeddings and obtains a 10.5% improvement on pretrained zero-shot models such as CLIP.

摘要
机器学习模型 -- 包括知名的零 shot 模型 -- 常常在具有小 proportion 的标签空间上进行训练。这些空间通常具有一个将标签相关的度量。我们提议一种简单的方法，利用这些信息来适应训练后的模型，以可靠地预测新类 -- 或在零 shot 预测情况下，提高其性能 -- 无需进一步训练。我们的技术是将 argmax 替换为 Fréchet 平均。我们提供了完整的理论分析，包括（i）交互学习理论，考虑标签空间径、样本复杂度和模型维度之间的负反关系，（ii）对预测任何未知类的全范围情况的Characterizations，以及（iii）用于获取优化训练类的优化活动学习类选择方法。Empirically, 我们的提议方法，Loki，在 ImageNet 上对 SimCLR 进行了29.7% 相对提升，并可扩展到数万个类。当没有外部度量时，Loki 可以使用自己计算出的类嵌入度量，并在预测零 shot 模型 CLIP 时获得10.5% 的提升。

FATRER: Full-Attention Topic Regularizer for Accurate and Robust Conversational Emotion Recognition

paper_url: http://arxiv.org/abs/2307.12221
repo_url: https://github.com/ludybupt/FATRER
paper_authors: Yuzhao Mao, Di Lu, Xiaojie Wang, Yang Zhang
for: 这 paper 的目的是理解对话中的情感引起的情绪。
methods: 这 paper 使用了一种基于全注意力话题规范的情绪识别器，以提高模型的Robustness 和准确性。
results: 实验显示，这 paper 的模型比现有模型更好地抵抗三种 adversarial 攻击，并且在情绪识别 tasks 上得到了更高的效果。

Abstract
This paper concentrates on the understanding of interlocutors' emotions evoked in conversational utterances. Previous studies in this literature mainly focus on more accurate emotional predictions, while ignoring model robustness when the local context is corrupted by adversarial attacks. To maintain robustness while ensuring accuracy, we propose an emotion recognizer augmented by a full-attention topic regularizer, which enables an emotion-related global view when modeling the local context in a conversation. A joint topic modeling strategy is introduced to implement regularization from both representation and loss perspectives. To avoid over-regularization, we drop the constraints on prior distributions that exist in traditional topic modeling and perform probabilistic approximations based entirely on attention alignment. Experiments show that our models obtain more favorable results than state-of-the-art models, and gain convincing robustness under three types of adversarial attacks.

摘要

Expediting Building Footprint Segmentation from High-resolution Remote Sensing Images via progressive lenient supervision

paper_url: http://arxiv.org/abs/2307.12220
repo_url: https://github.com/haonanguo/bfseg-efficient-building-footprint-segmentation-framework
paper_authors: Haonan Guo, Bo Du, Chen Wu, Xin Su, Liangpei Zhang
for: 本研究旨在提高从遥感图像中提取建筑印迹的效率和准确率。
methods: 本文提出了一种高效的建筑印迹分割框架，包括一种精密的均匀连接粗细度特征融合网络，以及一种宽泛的深度监督和采样策略。
results: 根据实验结果，提出的建筑印迹分割框架可以高效地提取建筑印迹，并且可以在不同的encoder网络上达到出色的性能和效率。

Abstract
The efficacy of building footprint segmentation from remotely sensed images has been hindered by model transfer effectiveness. Many existing building segmentation methods were developed upon the encoder-decoder architecture of U-Net, in which the encoder is finetuned from the newly developed backbone networks that are pre-trained on ImageNet. However, the heavy computational burden of the existing decoder designs hampers the successful transfer of these modern encoder networks to remote sensing tasks. Even the widely-adopted deep supervision strategy fails to mitigate these challenges due to its invalid loss in hybrid regions where foreground and background pixels are intermixed. In this paper, we conduct a comprehensive evaluation of existing decoder network designs for building footprint segmentation and propose an efficient framework denoted as BFSeg to enhance learning efficiency and effectiveness. Specifically, a densely-connected coarse-to-fine feature fusion decoder network that facilitates easy and fast feature fusion across scales is proposed. Moreover, considering the invalidity of hybrid regions in the down-sampled ground truth during the deep supervision process, we present a lenient deep supervision and distillation strategy that enables the network to learn proper knowledge from deep supervision. Building upon these advancements, we have developed a new family of building segmentation networks, which consistently surpass prior works with outstanding performance and efficiency across a wide range of newly developed encoder networks. The code will be released on https://github.com/HaonanGuo/BFSeg-Efficient-Building-Footprint-Segmentation-Framework.

摘要
“ remote sensing 图像中的建筑印迹分 segmentation 的效果受到模型传递效果的限制。现有的建筑分 segmentation 方法大多基于encoder-decoder架构的 U-Net，其中 encoder 是从新开发的背景网络中精度调整的。但是，现有的 decoder 设计带来了大量计算负担，使得现代 encoder 网络不能成功传递到 remote sensing 任务中。即使广泛采用的深度监测策略也无法缓解这些挑战，因为它在混合区域中的无效损失。在这篇论文中，我们对现有 decoder 网络设计进行了全面的评估，并提出了一种高效的框架 denoted as BFSeg，以提高学习效率和效果。具体来说，我们提出了一种紧密连接的粗细尺度混合特征卷积网络，以便轻松地实现特征之间的快速混合。此外，我们认为在下采样的真实图像中的混合区域无效的损失问题，我们提出了一种宽松的深度监测和采样策略，使得网络能够从深度监测中学习正确的知识。基于这些进步，我们开发了一个新的建筑分 segmentation 网络家族，这些网络在各种新开发的 encoder 网络上表现出色，在效率和性能方面都超越了先前的工作。代码将在 https://github.com/HaonanGuo/BFSeg-Efficient-Building-Footprint-Segmentation-Framework 上发布。”

A Comprehensive Review and Systematic Analysis of Artificial Intelligence Regulation Policies

paper_url: http://arxiv.org/abs/2307.12218
repo_url: None
paper_authors: Weiyue Wu, Shaoshan Liu
for: This paper aims to help governing bodies understand and regulate AI technologies in a chaotic global regulatory space.
methods: The paper presents a comprehensive review of AI regulation proposals from different geographical locations and cultural backgrounds, and develops a framework for analyzing these proposals.
results: The paper performs a systematic analysis of AI regulation proposals to identify potential failures and provide insights for governing bodies to untangle the AI regulatory chaos.In Simplified Chinese text, the three key points would be:
for: 这篇论文目标是帮助管理机构理解和调控全球AI regulatory空间中的混乱。
methods: 论文首先提供了AI regulatory proposal的全面回顾，然后开发了一个分析这些提案的框架。
results: 论文进行了系统性的AI regulatory proposal分析，以便发现可能的失败并为管理机构提供干预措施。

Abstract
Due to the cultural and governance differences of countries around the world, there currently exists a wide spectrum of AI regulation policy proposals that have created a chaos in the global AI regulatory space. Properly regulating AI technologies is extremely challenging, as it requires a delicate balance between legal restrictions and technological developments. In this article, we first present a comprehensive review of AI regulation proposals from different geographical locations and cultural backgrounds. Then, drawing from historical lessons, we develop a framework to facilitate a thorough analysis of AI regulation proposals. Finally, we perform a systematic analysis of these AI regulation proposals to understand how each proposal may fail. This study, containing historical lessons and analysis methods, aims to help governing bodies untangling the AI regulatory chaos through a divide-and-conquer manner.

摘要
Translated into Simplified Chinese:因为世界各国文化和管理差异的存在，目前在全球AI规制空间中存在广泛的AI规制政策提议，创造了混乱。正确地规制AI技术是极其困难的，因为它需要绝对的法律限制和技术发展之间的细致平衡。在这篇文章中，我们首先提供了AI规制提议的全面回顾，然后，Drawing from historical lessons，我们开发了一个框架，以便对AI规制提议进行全面的分析。最后，我们对这些AI规制提议进行了系统性的分析，以了解每一个提议可能会失败的原因。这篇文章，包含历史评论和分析方法， hopes to help governing bodies untangle the AI regulatory chaos through a divide-and-conquer manner.

Mental Workload Estimation with Electroencephalogram Signals by Combining Multi-Space Deep Models

paper_url: http://arxiv.org/abs/2308.02409
repo_url: None
paper_authors: Hong-Hai Nguyen, Ngumimi Karen Iyortsuun, Hyung-Jeong Yang, Guee-Sang Lee, Soo-Hyung Kim
for: 这篇论文旨在预测 mental workload 的三个状态和级别，以便提高 mental health 评估的准确性。
methods: 该论文使用 Temporal Convolutional Networks 和 Multi-Dimensional Residual Block 等方法，将多维度空间综合利用以实现最佳的 mental 估计。
results: 该论文通过将 mental workload 分类为三个状态并估计级别，帮助早期发现 mental health 问题，从而预防严重的健康问题并提高生活质量。

Abstract
The human brain is in a continuous state of activity during both work and rest. Mental activity is a daily process, and when the brain is overworked, it can have negative effects on human health. In recent years, great attention has been paid to early detection of mental health problems because it can help prevent serious health problems and improve quality of life. Several signals are used to assess mental state, but the electroencephalogram (EEG) is widely used by researchers because of the large amount of information it provides about the brain. This paper aims to classify mental workload into three states and estimate continuum levels. Our method combines multiple dimensions of space to achieve the best results for mental estimation. In the time domain approach, we use Temporal Convolutional Networks, and in the frequency domain, we propose a new architecture called the Multi-Dimensional Residual Block, which combines residual blocks.

摘要
人脑在工作和休息时都处于不断活跃状态。心理活动是每日的过程，当脑部过度劳累时，可能有负面影响于人类健康。在最近几年，对于早期发现心理问题的检测已经受到了广泛关注，因为它可以帮助预防严重的健康问题并提高生活质量。几种信号都用于评估心理状态，但是电enzephalogram（EEG）在研究人员中广泛使用，因为它可以提供大量关于脑部的信息。本文目的是将心理劳重分为三个状态，并估算维度水平。我们的方法将多个空间维度组合起来，以获得最佳的心理估算结果。在时域方法中，我们使用Temporal Convolutional Networks，在频域中，我们提出了一种新的架构——多维度征存块，这种架构将征存块结合在一起。

Traffic Flow Simulation for Autonomous Driving

paper_url: http://arxiv.org/abs/2307.16762
repo_url: None
paper_authors: Junfeng Li, Changqing Yan
for: This paper aims to provide a simulation environment for testing and evaluating the development of autonomous driving technology.
methods: The paper uses micro-traffic flow modeling and cellular automata to build the simulation environment, and it also employs a vehicle motion model based on bicycle intelligence.
results: The paper develops a simulation environment for autonomous vehicle flow, which can accurately control the acceleration, braking, steering, and lighting actions of the vehicle based on the bus instructions issued by the decision-making system.Here are the three points in Simplified Chinese text:
for: 这篇论文是为了提供自动驾驶技术的测试和评估技术的实验环境。
methods: 这篇论文使用微车流模型和细胞自动机来构建实验环境，同时还使用基于自行车智能的车辆动态模型。
results: 这篇论文建立了一个可以准确控制车辆加速、减速、转向和灯光动作的自动驾驶流 simulator。

Abstract
A traffic system is a random and complex large system, which is difficult to conduct repeated modelling and control research in a real traffic environment. With the development of automatic driving technology, the requirements for testing and evaluating the development of automatic driving technology are getting higher and higher, so the application of computer technology for traffic simulation has become a very effective technical means. Based on the micro-traffic flow modelling, this paper adopts the vehicle motion model based on cellular automata and the theory of bicycle intelligence to build the simulation environment of autonomous vehicle flow. The architecture of autonomous vehicles is generally divided into a perception system, decision system and control system. The perception system is generally divided into many subsystems, responsible for autonomous vehicle positioning, obstacle recognition, traffic signal detection and recognition and other tasks. Decision systems are typically divided into many subsystems that are responsible for tasks such as path planning, path planning, behavior selection, motion planning, and control. The control system is the basis of the selfdriving car, and each control system of the vehicle needs to be connected with the decision-making system through the bus, and can accurately control the acceleration degree, braking degree, steering amplitude, lighting control and other driving actions according to the bus instructions issued by the decision-making system, so as to achieve the autonomous driving of the vehicle.

摘要
traffic system 是一个随机和复杂的大型系统，在实际交通环境中进行重复模拟和控制研究非常困难。随着自动驾驶技术的发展，测试和评估自动驾驶技术的要求越来越高，因此计算机技术在交通模拟中发挥了非常有效的技术作用。基于微流量模拟，本文采用基于细胞自动机和自行车智能理论建立自动车流 simulation 环境。自动车的架构通常分为感知系统、决策系统和控制系统。感知系统通常分为多个子系统，负责自动车定位、障碍物识别、交通信号检测和识别等任务。决策系统通常分为多个子系统，负责任务such as 路径规划、行为选择、动作规划和控制。控制系统是自动车的基础，每个控制系统需要与决策系统通过公共总线连接，并可以准确控制车辆加速度、缓冲度、转向强度、灯光控制和其他驾驶动作，以实现自动驾驶。

DeepCL: Deep Change Feature Learning on Remote Sensing Images in the Metric Space

paper_url: http://arxiv.org/abs/2307.12208
repo_url: https://github.com/haonanguo/deepcl
paper_authors: Haonan Guo, Bo Du, Chen Wu, Chengxi Han, Liangpei Zhang
for: 本研究旨在提高自动变化检测（CD）的精度和可解释性，以便更好地监测地表动态变化。
methods: 我们结合度量学习的强时间关系模型和分割的优势，提出了深度变化特征学习（DeepCL）框架。我们设计了一种具有强时间相关性的强对比损失函数，以显著提高对比样本的重要性。此外，我们利用模型的时间关系知识来引导分割过程，以更好地检测变化区域。
results: 我们对 DeepCL 框架进行了严格的理论和实验评估，结果显示，DeepCL 在特征识别率、鲁棒性和可解释性等方面具有明显的优势，并且可以与多种 CD 方法进行结合使用。广泛的比较实验证明 DeepCL 在 CD 领域的精度和可解释性都达到了国际先进水平。

Abstract
Change detection (CD) is an important yet challenging task in the Earth observation field for monitoring Earth surface dynamics. The advent of deep learning techniques has recently propelled automatic CD into a technological revolution. Nevertheless, deep learning-based CD methods are still plagued by two primary issues: 1) insufficient temporal relationship modeling and 2) pseudo-change misclassification. To address these issues, we complement the strong temporal modeling ability of metric learning with the prominent fitting ability of segmentation and propose a deep change feature learning (DeepCL) framework for robust and explainable CD. Firstly, we designed a hard sample-aware contrastive loss, which reweights the importance of hard and simple samples. This loss allows for explicit modeling of the temporal correlation between bi-temporal remote sensing images. Furthermore, the modeled temporal relations are utilized as knowledge prior to guide the segmentation process for detecting change regions. The DeepCL framework is thoroughly evaluated both theoretically and experimentally, demonstrating its superior feature discriminability, resilience against pseudo changes, and adaptability to a variety of CD algorithms. Extensive comparative experiments substantiate the quantitative and qualitative superiority of DeepCL over state-of-the-art CD approaches.

摘要
优化检测 (CD) 是地球观测领域中重要且挑战性强的任务，用于监测地球表面动态。深度学习技术的出现已经推动了自动化 CD 技术到了技术革命的水平。然而，深度学习基于 CD 方法仍然受到两个主要问题的困扰：1）不充分的时间关系模型化和2）假变误分类。为了解决这些问题，我们将强度表示学习的准确性能与分割的优劣点相结合，并提出了深度变化特征学习 (DeepCL) 框架。首先，我们设计了一种坚持样本感知的对比损失函数，该函数重新评估硬件和简单样本的重要性。这种损失函数允许明确模型 биitemporal 遥感图像之间的时间相关性。其次，模型的时间关系知识被用作引导分割过程，以便检测变化区域。DeepCL 框架进行了严格的理论和实验测试，并证明其具有更高的特征抽象能力、鲁棒性 against 假变和适应性。广泛的比较实验证明 DeepCL 在 CD 领域的超过状态之前的性能。

Monadic Deep Learning

paper_url: http://arxiv.org/abs/2307.12187
repo_url: https://github.com/ThoughtWorksInc/monadic-deep-learning
paper_authors: Bo Yang, Zhihao Zhang Kirisame Marisa, Kai Shi
for: 本研究旨在提供一个可靠的、类型安全的深度学习框架，用于在 Scala 中实现动态神经网络。
methods: 本研究使用了一种新的准确方法，可以在静态类型函数中自动进行微分，并且可以跨语言交互。此外，研究还使用了一些套件和套件变换，以便用户可以创建具有动态神经网络表示的幂等表达式。
results: 研究得到了一个可靠的、类型安全的深度学习框架，可以在 Scala 中实现复杂的神经网络表示。用户可以使用这个框架来创建具有多变量的神经网络，并且仍然保持类型安全性。

Abstract
The Java and Scala community has built a very successful big data ecosystem. However, most of neural networks running on it are modeled in dynamically typed programming languages. These dynamically typed deep learning frameworks treat neural networks as differentiable expressions that contain many trainable variable, and perform automatic differentiation on those expressions when training them. Until 2019, none of the learning frameworks in statically typed languages provided the expressive power of traditional frameworks. Their users are not able to use custom algorithms unless creating plenty of boilerplate code for hard-coded back-propagation. We solved this problem in DeepLearning.scala 2. Our contributions are: 1. We discovered a novel approach to perform automatic differentiation in reverse mode for statically typed functions that contain multiple trainable variable, and can interoperate freely with the metalanguage. 2. We designed a set of monads and monad transformers, which allow users to create monadic expressions that represent dynamic neural networks. 3. Along with these monads, we provide some applicative functors, to perform multiple calculations in parallel. With these features, users of DeepLearning.scala were able to create complex neural networks in an intuitive and concise way, and still maintain type safety.

摘要
Java和Scala社区已经建立了一个非常成功的大数据生态系统。然而，大多数运行在其上的神经网络都是使用动态类型编程语言模型的。这些动态类型深度学习框架将神经网络视为可导的表达，并在训练时自动进行差分。直到2019年，静态类型语言的学习框架中没有提供表达能力相当于传统框架的。其用户无法使用自定义算法，除非创建大量的简单代码来实现硬编码的反向差分。我们解决了这个问题，我们在DeepLearning.scala 2中提供了以下贡献：1. 我们发现了一种新的方法，可以在静态类型函数中自动进行差分，并且可以与金属语言自由交互。2. 我们设计了一组幂kk和幂Transformer，这些幂kk可以让用户创建幂表达式，表示动态神经网络。3. 同时，我们还提供了一些应用函数，可以并发地执行多个计算。通过这些特性，DeepLearning.scala 2 的用户可以创建复杂的神经网络，并且保持类型安全性。

Machine learning discovers invariants of braids and flat braids

paper_url: http://arxiv.org/abs/2307.12185
repo_url: None
paper_authors: Alexei Lisitsa, Mateo Salles, Alexei Vernitski
for: 这个论文用机器学习分类扑斗织（或平坦扑斗）的例子，以确定它们是否是平凡的。
methods: 这个论文使用了指导学习使用神经网络（多层感知器）进行超visisted学习，以达到好的分类结果。
results: 通过这个论文，我们发现了新的便利的扑斗 invariants，包括完全的平坦扑斗 invariants。

Abstract
We use machine learning to classify examples of braids (or flat braids) as trivial or non-trivial. Our ML takes form of supervised learning using neural networks (multilayer perceptrons). When they achieve good results in classification, we are able to interpret their structure as mathematical conjectures and then prove these conjectures as theorems. As a result, we find new convenient invariants of braids, including a complete invariant of flat braids.

摘要
我们使用机器学习来分类拥有螺旋或平板拥的例子（拥有螺旋或平板拥）。我们的机器学习是指监督学习使用神经网络（多层感知器）。当它们在分类中达到好的结果时，我们可以解释它们的结构为数学假设，然后证明这些假设为定理。因此，我们发现了新的便利的拥 invariants，包括完整的平板拥 invariants。Note: "拥" (wò) in the text refers to "braids" in Chinese.

On the Expressivity of Multidimensional Markov Reward

paper_url: http://arxiv.org/abs/2307.12184
repo_url: None
paper_authors: Shuwa Miura
for: 本研究考虑了Markov奖励表示在Sequential Decision Making下的表达性。
methods: 我们使用Markov决策过程（MDPs）中的奖励函数来描述代理人希望的行为。我们研究了一个可能性空间中的奖励函数是否可以使得所有希望的策略更加抽象。我们的主要结果表明，如果存在一个可能性空间中的奖励函数，那么这个奖励函数必须满足一定的必要和 suficient conditions。
results: 我们还证明了，对于任何非杂分 deterministic策略集合，存在一个多维度Markov奖励函数可以描述它。

Abstract
We consider the expressivity of Markov rewards in sequential decision making under uncertainty. We view reward functions in Markov Decision Processes (MDPs) as a means to characterize desired behaviors of agents. Assuming desired behaviors are specified as a set of acceptable policies, we investigate if there exists a scalar or multidimensional Markov reward function that makes the policies in the set more desirable than the other policies. Our main result states both necessary and sufficient conditions for the existence of such reward functions. We also show that for every non-degenerate set of deterministic policies, there exists a multidimensional Markov reward function that characterizes it

摘要
我们考虑了马尔可夫奖励在随机决策中的表达性。我们看作奖励函数在马尔可夫决策过程（MDP）中是代表机器人所愿意行为的方式。假设所希望的行为是一个集合的可接受政策，我们实际寻找是否存在一个数值或多维马尔可夫奖励函数，使得这些政策在集合中更加愿意被选择。我们的主要结果表明了这些奖励函数的必要和充分条件，并且显示了每个非当ode deterministic政策都存在一个多维马尔可夫奖励函数，可以Characterize它。

Security and Privacy Issues of Federated Learning

paper_url: http://arxiv.org/abs/2307.12181
repo_url: https://github.com/JiangChSo/PFLM
paper_authors: Jahid Hasan
for: 本研究旨在提供一个涵盖多种机器学习模型的 Federated Learning（FL）安全性和隐私性挑战的全面分类。
methods: 本研究使用了多种攻击和防御策略，包括投毒攻击、后门攻击、会员推理攻击、生成对抗网络（GAN）基于攻击和均衡隐私攻击。
results: 本研究提出了一些新的研究方向，旨在强化FL系统的安全性和隐私性，以保护分布式学习环境中的敏感数据confidentiality。

Abstract
Federated Learning (FL) has emerged as a promising approach to address data privacy and confidentiality concerns by allowing multiple participants to construct a shared model without centralizing sensitive data. However, this decentralized paradigm introduces new security challenges, necessitating a comprehensive identification and classification of potential risks to ensure FL's security guarantees. This paper presents a comprehensive taxonomy of security and privacy challenges in Federated Learning (FL) across various machine learning models, including large language models. We specifically categorize attacks performed by the aggregator and participants, focusing on poisoning attacks, backdoor attacks, membership inference attacks, generative adversarial network (GAN) based attacks, and differential privacy attacks. Additionally, we propose new directions for future research, seeking innovative solutions to fortify FL systems against emerging security risks and uphold sensitive data confidentiality in distributed learning environments.

摘要
受领 learning（FL）已经出现为一种有前途的方法，以解决数据隐私和安全性问题，使得多个参与者可以构建共享模型，而不需要中央化敏感数据。然而，这种分布式模式带来了新的安全挑战，需要进行全面的风险识别和分类，以确保FL的安全保证。这篇论文提出了FL中安全和隐私挑战的完整分类，包括聚合器和参与者的攻击，以及毒ous attacks、后门攻击、会员推理攻击、基于GAN的攻击和权威隐私攻击。此外，我们还提出了未来研究的新方向，寻找创新的解决方案，以固化FL系统，防止数据泄露和分布式学习环境中的安全风险。

Named Entity Resolution in Personal Knowledge Graphs

paper_url: http://arxiv.org/abs/2307.12173
repo_url: None
paper_authors: Mayank Kejriwal
for: 本文主要针对个人知识图（PKG）中的命名实体解析（ER）问题。
methods: 本文首先提供了高质量和高效的ER问题定义和必需组件。然后，总结了现有技术在PKG中的应用可能性，以及预期在Web级数据中出现的挑战。
results: 本文结束后，提供了一些应用和未来研究的可能性。

Abstract
Entity Resolution (ER) is the problem of determining when two entities refer to the same underlying entity. The problem has been studied for over 50 years, and most recently, has taken on new importance in an era of large, heterogeneous 'knowledge graphs' published on the Web and used widely in domains as wide ranging as social media, e-commerce and search. This chapter will discuss the specific problem of named ER in the context of personal knowledge graphs (PKGs). We begin with a formal definition of the problem, and the components necessary for doing high-quality and efficient ER. We also discuss some challenges that are expected to arise for Web-scale data. Next, we provide a brief literature review, with a special focus on how existing techniques can potentially apply to PKGs. We conclude the chapter by covering some applications, as well as promising directions for future research.

摘要
<>转换给定文本到简化中文。>实体归一化（ER）问题是确定两个实体指向同一个基础实体的问题。该问题已经被研究了50多年，并在最近在大量、多元知识图（knowledge graphs）在社交媒体、电商和搜索等领域广泛使用后，又得到了新的重要性。本章将讨论个人知识图（PKGs）中的命名实体归一化问题。我们开始于形式定义问题和完成高质量和高效的实体归一化所需的组件。我们还讨论了预期出现在网络规模数据上的挑战。接下来，我们提供了一 brief文献综述，强调如何现有技术可能应用于PKGs。我们结束本章，涵盖一些应用以及未来研究的承诺。

Optimized Network Architectures for Large Language Model Training with Billions of Parameters

paper_url: http://arxiv.org/abs/2307.12169
repo_url: None
paper_authors: Weiyang Wang, Manya Ghobadi, Kayvon Shakeri, Ying Zhang, Naader Hasani
for: 这个论文挑战了训练大语言模型（LLM）的传统建立任意对任意网络的思路。
methods: 我们表明了 LLM 的通信模式，只有小组 GPU 需要高带宽任意对任意通信，以达到训练性能的近似最优点。在这些小组 GPU 之间，通信几乎不存在，稀疏、同质化。我们提议一种新的网络架构，与 LLM 的通信需求相匹配。我们将集群分为 HB Domain 中的 GPU，并使用非堵塞任意对任意高速交换机连接这些 GPU。在 HB Domain 内，网络只连接了需要通信的 GPU。我们称这种网络为 “铁路仅” 连接，并证明我们的提议架构可以将网络成本降低至最多 75%，不对 LLM 训练性能产生影响。
results: 我们的实验结果表明，我们的提议架构可以减少网络成本至最多 75%，而无需增加训练时间或缓存大小。此外，我们的架构还可以在不同的 GPU 分布和通信占比情况下保持较高的性能稳定性。

Abstract
This paper challenges the well-established paradigm for building any-to-any networks for training Large Language Models (LLMs). We show that LLMs exhibit a unique communication pattern where only small groups of GPUs require high-bandwidth any-to-any communication within them, to achieve near-optimal training performance. Across these groups of GPUs, the communication is insignificant, sparse, and homogeneous. We propose a new network architecture that closely resembles the communication requirement of LLMs. Our architecture partitions the cluster into sets of GPUs interconnected with non-blocking any-to-any high-bandwidth interconnects that we call HB domains. Across the HB domains, the network only connects GPUs with communication demands. We call this network a "rail-only" connection, and show that our proposed architecture reduces the network cost by up to 75% compared to the state-of-the-art any-to-any Clos networks without compromising the performance of LLM training.

摘要

Hallucination Improves the Performance of Unsupervised Visual Representation Learning

paper_url: http://arxiv.org/abs/2307.12168
repo_url: None
paper_authors: Jing Wu, Jennifer Hobbs, Naira Hovakimyan
for: 提高自主学习中的对比学习性能
methods: 基于Siamese结构的对比学习模型，加入Hallucinator来生成额外正例样本，提高对比学习的semantic对比和鲁棒性
results: 在不同的对比学习模型和数据集上，通过Hallucinator来生成额外正例样本，可以提高对比学习模型的稳定性和对应性，并且在下游任务中也可以看到明显的提升。

Abstract
Contrastive learning models based on Siamese structure have demonstrated remarkable performance in self-supervised learning. Such a success of contrastive learning relies on two conditions, a sufficient number of positive pairs and adequate variations between them. If the conditions are not met, these frameworks will lack semantic contrast and be fragile on overfitting. To address these two issues, we propose Hallucinator that could efficiently generate additional positive samples for further contrast. The Hallucinator is differentiable and creates new data in the feature space. Thus, it is optimized directly with the pre-training task and introduces nearly negligible computation. Moreover, we reduce the mutual information of hallucinated pairs and smooth them through non-linear operations. This process helps avoid over-confident contrastive learning models during the training and achieves more transformation-invariant feature embeddings. Remarkably, we empirically prove that the proposed Hallucinator generalizes well to various contrastive learning models, including MoCoV1&V2, SimCLR and SimSiam. Under the linear classification protocol, a stable accuracy gain is achieved, ranging from 0.3% to 3.0% on CIFAR10&100, Tiny ImageNet, STL-10 and ImageNet. The improvement is also observed in transferring pre-train encoders to the downstream tasks, including object detection and segmentation.

摘要
对比学习模型基于单胞结构的表现备受关注，特别是在自监督学习中。然而，这些框架的成功受到两个条件的限制：一个是足够多的正例对，另一个是正例对之间的差异。如果这两个条件不充分满足，这些框架将缺乏含义的对比，并且容易过滤。为了解决这两个问题，我们提出了“幻觉”（Hallucinator），它可以快速生成更多的正例对，并且让这些对之间的差异更加明显。幻觉是可微的，并且在预训任务中直接优化。此外，我们还将幻觉的对应对采用非线性操作，以避免模型在训练过程中过滤。我们实际上证明，提案的幻觉可以跨多种对比学习模型，包括MoCoV1&V2、SimCLR和SimSiam，在线性分类协议下表现稳定，从CIFAR10&100、Tiny ImageNet、STL-10和ImageNet上获得了0.3%到3.0%的稳定精度提升。此外，幻觉也可以在预训对象检测和分类等下游任务中实现稳定的提升。

The Imitation Game: Detecting Human and AI-Generated Texts in the Era of Large Language Models

paper_url: http://arxiv.org/abs/2307.12166
repo_url: None
paper_authors: Kadhim Hayawi, Sakib Shahriar, Sujith Samuel Mathew
for: 本研究旨在探讨人工智能（AI）基于大型语言模型（LLM）在教育、研究和实践中的潜在潜力，但分辨人类写作和AI生成文本已成为一项重要任务。
methods: 本研究采用了多种机器学习模型来分类文本，并 introduce了一个新的人类写作和LLM生成文本的数据集，包括了不同类型的文章、短篇小说、诗歌和Python代码。
results: 结果表明这些机器学习模型在分类文本时表现出色，即使数据集的样本数较少，但在分类GPT生成文本时，特别是在故事写作方面，任务变得更加困难。结果还表明这些模型在二分类任务中，如人类写作与特定LLM之间的分类，表现出更高的性能，而在多类任务中，如分辨人类写作和多个LLM之间的分类，任务变得更加复杂。

Abstract
The potential of artificial intelligence (AI)-based large language models (LLMs) holds considerable promise in revolutionizing education, research, and practice. However, distinguishing between human-written and AI-generated text has become a significant task. This paper presents a comparative study, introducing a novel dataset of human-written and LLM-generated texts in different genres: essays, stories, poetry, and Python code. We employ several machine learning models to classify the texts. Results demonstrate the efficacy of these models in discerning between human and AI-generated text, despite the dataset's limited sample size. However, the task becomes more challenging when classifying GPT-generated text, particularly in story writing. The results indicate that the models exhibit superior performance in binary classification tasks, such as distinguishing human-generated text from a specific LLM, compared to the more complex multiclass tasks that involve discerning among human-generated and multiple LLMs. Our findings provide insightful implications for AI text detection while our dataset paves the way for future research in this evolving area.

摘要
人工智能（AI）基于大型语言模型（LLM）的潜力在教育、研究和实践中具有巨大的推动力。然而，分辨人类写作和AI生成文本已成为一项重要的任务。这篇论文介绍了一项比较研究，推出了一个新的人类写作和LLM生成文本的数据集，包括了不同类型的文章、故事、诗歌和Python代码。我们使用了多种机器学习模型来分类文本。结果表明这些模型在分类人类写作和AI生成文本的任务中具有remarkable的表现，即使数据集的样本数较少。然而，当分类GPT生成文本时，特别是在故事创作中，任务变得更加困难。结果表明这些模型在二分类任务中，如人类写作与特定LLM的分类，表现更加出色，与多类任务，如分类人类写作和多个LLM之间的分类，相比较，任务变得更加复杂。我们的发现对AI文本检测具有深刻的意义，而我们的数据集也为未来这一领域的研究开辟了新的可能性。

DIP-RL: Demonstration-Inferred Preference Learning in Minecraft

paper_url: http://arxiv.org/abs/2307.12158
repo_url: None
paper_authors: Ellen Novoseller, Vinicius G. Goecks, David Watkins, Josh Miller, Nicholas Waytowich
For: + The paper is written for researchers and practitioners in the field of reinforcement learning and machine learning, particularly those interested in using human demonstrations to guide learning in unstructured and open-ended environments.* Methods: + The paper presents Demonstration-Inferred Preference Reinforcement Learning (DIP-RL), a novel algorithm that leverages human demonstrations in three distinct ways: training an autoencoder, seeding reinforcement learning (RL) training batches with demonstration data, and inferring preferences over behaviors to learn a reward function to guide RL.* Results: + The paper evaluates DIP-RL in a tree-chopping task in Minecraft, and finds that the method can guide an RL agent to learn a reward function that reflects human preferences. Additionally, DIP-RL performs competitively relative to baselines. Example trajectory rollouts of DIP-RL and baselines are available online.

Abstract
In machine learning for sequential decision-making, an algorithmic agent learns to interact with an environment while receiving feedback in the form of a reward signal. However, in many unstructured real-world settings, such a reward signal is unknown and humans cannot reliably craft a reward signal that correctly captures desired behavior. To solve tasks in such unstructured and open-ended environments, we present Demonstration-Inferred Preference Reinforcement Learning (DIP-RL), an algorithm that leverages human demonstrations in three distinct ways, including training an autoencoder, seeding reinforcement learning (RL) training batches with demonstration data, and inferring preferences over behaviors to learn a reward function to guide RL. We evaluate DIP-RL in a tree-chopping task in Minecraft. Results suggest that the method can guide an RL agent to learn a reward function that reflects human preferences and that DIP-RL performs competitively relative to baselines. DIP-RL is inspired by our previous work on combining demonstrations and pairwise preferences in Minecraft, which was awarded a research prize at the 2022 NeurIPS MineRL BASALT competition, Learning from Human Feedback in Minecraft. Example trajectory rollouts of DIP-RL and baselines are located at https://sites.google.com/view/dip-rl.

摘要
机器学习中的决策过程中，一个算法代理人会在环境中互动，并从环境接收回报信号作为反馈。但在许多不结构化的实际场景中，这种奖励信号是未知的，人们无法可靠地制定一个正确捕捉愿望行为的奖励信号。为解决这类无结构和开放的环境中的任务，我们提出了人示示 preference reinforcement learning（DIP-RL）算法，它利用人类示例的三种方式，包括训练自动编码器、使用示例数据种子RL训练批次，以及从示例数据中推断行为偏好，以学习一个奖励函数，以引导RL。我们在Minecraft中进行了树割任务的评估，结果表明，DIP-RL可以引导一个RL代理人学习一个奖励函数，满足人类的偏好，并且DIP-RL与基elines相比，表现竞争力强。DIP-RL的灵感来自我们在Minecraft中结合示例和对比式偏好的研究，赢得2022年NeurIPS MineRL BASALT大赛的研究奖，《从人类反馈学习在Minecraft》。有关DIP-RL和基elines的示例轨迹演示，请参考https://sites.google.com/view/dip-rl。

Emergence of Adaptive Circadian Rhythms in Deep Reinforcement Learning

paper_url: http://arxiv.org/abs/2307.12143
repo_url: https://github.com/aqeel13932/mn_project
paper_authors: Aqeel Labash, Florian Fletzer, Daniel Majoral, Raul Vicente
for: 这个论文的目的是研究深度学习代理人在有可靠 periodic 变化的环境中学习寻食任务时，是否可以emerge circadian-like rhythms。
methods: 作者使用了深度学习代理人，在一个可靠 periodic 变化的环境中解决寻食任务。在学习过程中，作者系统地Characterize了代理人的行为，并证明了代理人内部的rhythm是自适应的和可调整的。
results: 研究发现，代理人在学习过程中emerge一种自适应的rhythm，这种rhythm可以适应环境的signal phase的变化，而无需重新训练。此外，作者通过分析bifurcation和相对应 Curve的方法，表明了人工神经元的动力学特性是支持内部 rhythm 的内化的关键。

Abstract
Adapting to regularities of the environment is critical for biological organisms to anticipate events and plan. A prominent example is the circadian rhythm corresponding to the internalization by organisms of the $24$-hour period of the Earth's rotation. In this work, we study the emergence of circadian-like rhythms in deep reinforcement learning agents. In particular, we deployed agents in an environment with a reliable periodic variation while solving a foraging task. We systematically characterize the agent's behavior during learning and demonstrate the emergence of a rhythm that is endogenous and entrainable. Interestingly, the internal rhythm adapts to shifts in the phase of the environmental signal without any re-training. Furthermore, we show via bifurcation and phase response curve analyses how artificial neurons develop dynamics to support the internalization of the environmental rhythm. From a dynamical systems view, we demonstrate that the adaptation proceeds by the emergence of a stable periodic orbit in the neuron dynamics with a phase response that allows an optimal phase synchronisation between the agent's dynamics and the environmental rhythm.

摘要
适应环境的规律是生物体可能预测事件和规划的关键。一个明显的例子是Circadian rhythm，即生物体内化地球的24小时轮征。在这项工作中，我们研究了深度强化学习代理人在环境中的Circadian-like rhythm的出现。特别是，我们在一个可靠的周期变化环境中部署了代理人，并解决了捕食任务。我们系统地描述了代理人在学习过程中的行为，并证明了代理人内生的频率可以适应环境的阶段变化，而无需重新训练。此外，我们通过分支和相对响应曲线分析，展示了人工神经元发展的动力学支持内化环境的频率。从动态系统的视角来看，我们证明了适应过程中的稳定 periodic orbit 在神经元动力学中出现，并且该频率允许代理人的动力学和环境频率进行优化的相对同步。

Route Planning Using Nature-Inspired Algorithms

paper_url: http://arxiv.org/abs/2307.12133
repo_url: None
paper_authors: Priyansh Saxena, Raahat Gupta, Akshat Maheshwari
for: 本文主要用于介绍 Nature-Inspired Algorithms（NIAs），以及它们在路径规划问题中的应用。
methods: 本文使用了多种 Nature-Inspired Algorithms，包括遗传算法、社会间气候算法、蜂群算法等。
results: 本文通过对路径规划问题的解决，显示了 NIAs 的优化性和可靠性。

Abstract
There are many different heuristic algorithms for solving combinatorial optimization problems that are commonly described as Nature-Inspired Algorithms (NIAs). Generally, they are inspired by some natural phenomenon, and due to their inherent converging and stochastic nature, they are known to give optimal results when compared to classical approaches. There are a large number of applications of NIAs, perhaps the most popular being route planning problems in robotics - problems that require a sequence of translation and rotation steps from the start to the goal in an optimized manner while avoiding obstacles in the environment. In this chapter, we will first give an overview of Nature-Inspired Algorithms, followed by their classification and common examples. We will then discuss how the NIAs have applied to solve the route planning problem.

摘要
《自然引导算法》是解决 combinatorial optimization 问题的多种不同规则算法。通常它们是根据自然现象 inspirited，因为它们的内置的叠合和随机性，可以给出优化的结果，比 класси方法更好。有很多应用的NIAs，最受欢迎的是机器人路径规划问题——需要从起始点到目标点按一个优化的轨迹和旋转步骤，避免环境中的障碍物。本章首先给出了 Nature-Inspired Algorithms 的概述，然后分类和常见的例子，最后讨论了 NIAs 如何应用于路径规划问题。

AI on the Road: A Comprehensive Analysis of Traffic Accidents and Accident Detection System in Smart Cities

paper_url: http://arxiv.org/abs/2307.12128
repo_url: None
paper_authors: Victor Adewopo, Nelly Elsayed, Zag Elsayed, Murat Ozer, Victoria Wangia-Anderson, Ahmed Abdelgawad
for: 这篇论文主要是为了提高交通管理和交通事故预防。
methods: 该论文提出了一种基于交通监测摄像头和动作识别系统的交通事故探测和应对方案。
results: 该方案可以减少交通事故的频率和严重程度，提高交通管理的效率和安全性。

Abstract
Accident detection and traffic analysis is a critical component of smart city and autonomous transportation systems that can reduce accident frequency, severity and improve overall traffic management. This paper presents a comprehensive analysis of traffic accidents in different regions across the United States using data from the National Highway Traffic Safety Administration (NHTSA) Crash Report Sampling System (CRSS). To address the challenges of accident detection and traffic analysis, this paper proposes a framework that uses traffic surveillance cameras and action recognition systems to detect and respond to traffic accidents spontaneously. Integrating the proposed framework with emergency services will harness the power of traffic cameras and machine learning algorithms to create an efficient solution for responding to traffic accidents and reducing human errors. Advanced intelligence technologies, such as the proposed accident detection systems in smart cities, will improve traffic management and traffic accident severity. Overall, this study provides valuable insights into traffic accidents in the US and presents a practical solution to enhance the safety and efficiency of transportation systems.

摘要
意外探测和交通分析是智能城市和自动化交通系统的关键组成部分，可以降低意外频率、严重程度并改善总体交通管理。这篇论文对美国各地的交通意外进行了全面分析，使用国家公路安全管理局（NHTSA）的交通事故报告采样系统（CRSS）的数据。为了解决意外探测和交通分析的挑战，这篇论文提出了一个框架，该框架使用交通监测摄像头和动作认知系统来自动探测和应对交通意外。将该框架与急救服务集成，可以利用交通摄像头和机器学习算法来创造一个高效的交通意外应急处理解决方案。智能技术，如提议的交通意外探测系统，将改善交通管理和交通意外严重程度。总之，这篇研究对美国交通意外提供了有价值的意见，并提出了一个实用的解决方案，以提高交通系统的安全性和效率。

2023-07-23

cs.CL

cs.CL - 2023-07-23

X-CapsNet For Fake News Detection

paper_url: http://arxiv.org/abs/2307.12332
repo_url: None
paper_authors: Mohammad Hadi Goldani, Reza Safabakhsh, Saeedeh Momtazi
for: 本研究旨在帮助减少社交媒体和网络论坛上的谣言对用户决策产生的影响，通过自动检测和抵御假新闻。
methods: 该研究提出了一种基于变换器的模型，称为X-CapsNet，该模型包括一个带有动态路由算法的 capsule神经网络（CapsNet），以及一个大小基于的分类器。
results: 研究使用了 Covid-19 和 Liar 数据集进行评估，结果表明，模型在 Covid-19 数据集上的 F1 分数和 Liar 数据集上的准确率都高于现有基线。

Abstract
News consumption has significantly increased with the growing popularity and use of web-based forums and social media. This sets the stage for misinforming and confusing people. To help reduce the impact of misinformation on users' potential health-related decisions and other intents, it is desired to have machine learning models to detect and combat fake news automatically. This paper proposes a novel transformer-based model using Capsule neural Networks(CapsNet) called X-CapsNet. This model includes a CapsNet with dynamic routing algorithm paralyzed with a size-based classifier for detecting short and long fake news statements. We use two size-based classifiers, a Deep Convolutional Neural Network (DCNN) for detecting long fake news statements and a Multi-Layer Perceptron (MLP) for detecting short news statements. To resolve the problem of representing short news statements, we use indirect features of news created by concatenating the vector of news speaker profiles and a vector of polarity, sentiment, and counting words of news statements. For evaluating the proposed architecture, we use the Covid-19 and the Liar datasets. The results in terms of the F1-score for the Covid-19 dataset and accuracy for the Liar dataset show that models perform better than the state-of-the-art baselines.

摘要
新闻消耗量已经明显增加，这与网络讨论平台和社交媒体的普及和使用有着直接关系。这种情况设置了误导和混淆人们的场景，为了帮助用户避免基于误information的决策，自动检测和抗误information的机器学习模型变得越来越重要。本文提出了一种基于变换器的新型模型，称为X-CapsNet，该模型包括一个具有动态路由算法的Capsule神经网络（CapsNet）和一个大小分类器。我们使用了两个大小分类器，一个是深度卷积神经网络（DCNN）用于检测长 fake news 声明，另一个是多层感知神经网络（MLP）用于检测短新闻声明。为解决短新闻声明的表示问题，我们使用了 indirect 特征，即新闻发布人的 Profile 向量和新闻声明中的负面情感和计数词向量。为评估提议的体系，我们使用了 Covid-19 和 Liar 数据集。结果表明，模型在 Covid-19 数据集上的 F1 分数和 Liar 数据集上的准确率都高于当前基eline。

Milimili. Collecting Parallel Data via Crowdsourcing

paper_url: http://arxiv.org/abs/2307.12282
repo_url: https://github.com/alantonov/milimili
paper_authors: Alexander Antonov
for: 这个研究是为了提出一种通过协同劳动来收集并构建平行 Corpora的方法，比聘用专业翻译人员更加经济。
methods: 这种方法利用了互联网平台，通过吸引志愿者参与翻译来收集数据，并使用机器学习算法来进行自动评分。
results: 研究人员通过实验对Chechen-Russian和Fula-English语种的平行数据进行了收集和分析，并发现这种方法可以提供高质量的平行数据，但需要进一步的优化和纠正。

Abstract
We present a methodology for gathering a parallel corpus through crowdsourcing, which is more cost-effective than hiring professional translators, albeit at the expense of quality. Additionally, we have made available experimental parallel data collected for Chechen-Russian and Fula-English language pairs.

摘要
我们提出了一种使用人工协助收集并行文献的方法，这比聘请专业翻译人员更加经济，然而品质可能受到影响。此外，我们已经为Chechen-Russian和Fula-English语种对 experimental parallel数据进行了收集。

Transformer-based Joint Source Channel Coding for Textual Semantic Communication

paper_url: http://arxiv.org/abs/2307.12266
repo_url: None
paper_authors: Shicong Liu, Zhen Gao, Gaojie Chen, Yu Su, Lu Peng
for: 本文提出了一种文本semantic传输框架，以提高在雷达环境下的文本传输可靠性和效率。
methods: 本文使用了高级自然语言处理技术，将文本句子分解成单词，并使用Transformer编码器进行semantic提取。编码后的数据被归一化为固定长度二进制序列，并对这些二进制序列进行了模拟隐藏 Markov chain 的扩展。
results: simulationResults表明，提出的模型在semantic传输中具有较高的可靠性和效率，并且在雷达环境下能够有效地抗抗干扰。

Abstract
The Space-Air-Ground-Sea integrated network calls for more robust and secure transmission techniques against jamming. In this paper, we propose a textual semantic transmission framework for robust transmission, which utilizes the advanced natural language processing techniques to model and encode sentences. Specifically, the textual sentences are firstly split into tokens using wordpiece algorithm, and are embedded to token vectors for semantic extraction by Transformer-based encoder. The encoded data are quantized to a fixed length binary sequence for transmission, where binary erasure, symmetric, and deletion channels are considered for transmission. The received binary sequences are further decoded by the transformer decoders into tokens used for sentence reconstruction. Our proposed approach leverages the power of neural networks and attention mechanism to provide reliable and efficient communication of textual data in challenging wireless environments, and simulation results on semantic similarity and bilingual evaluation understudy prove the superiority of the proposed model in semantic transmission.

摘要
天空地海集成网络呼吁更加robust和安全的传输技术以抗干扰。在这篇论文中，我们提出了文本semantic传输框架，利用高级自然语言处理技术来模型和编码句子。具体来说，文本句子首先使用wordpiece算法拆分成单词，然后将单词embedding到token vector中进行semantic提取。编码后的数据被归一化为固定长度二进制序列进行传输，并考虑了三种渠道（ binary erasure、symmetric 和deletion）的传输。接收到的二进制序列被transformer解码器解码成原始句子中的单词，并用于句子重建。我们提出的方法利用神经网络和注意机制提供了可靠和高效的文本数据在困难无线环境中的传输，并在语义传输方面进行了严格的验证和评估。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

A meta learning scheme for fast accent domain expansion in Mandarin speech recognition

paper_url: http://arxiv.org/abs/2307.12262
repo_url: None
paper_authors: Ziwei Zhu, Changhao Shan, Bihong Zhang, Jian Yu
for: 这篇论文主要用于探讨普通话自动语音识别（ASR）中腔点域扩展问题。
methods: 该论文使用元学习技术来实现快速普通话腔点扩展，包括冻结模型参数以及元学习。
results: 相比基eline模型，该方法在腔点扩展任务中显示出3%的相对提升，并在大量数据下达到4%的相对提升。

Abstract
Spoken languages show significant variation across mandarin and accent. Despite the high performance of mandarin automatic speech recognition (ASR), accent ASR is still a challenge task. In this paper, we introduce meta-learning techniques for fast accent domain expansion in mandarin speech recognition, which expands the field of accents without deteriorating the performance of mandarin ASR. Meta-learning or learn-to-learn can learn general relation in multi domains not only for over-fitting a specific domain. So we select meta-learning in the domain expansion task. This more essential learning will cause improved performance on accent domain extension tasks. We combine the methods of meta learning and freeze of model parameters, which makes the recognition performance more stable in different cases and the training faster about 20%. Our approach significantly outperforms other methods about 3% relatively in the accent domain expansion task. Compared to the baseline model, it improves relatively 37% under the condition that the mandarin test set remains unchanged. In addition, it also proved this method to be effective on a large amount of data with a relative performance improvement of 4% on the accent test set.

摘要
⟨SYS⟩ spoken languages show significant variation across mandarin and accent. Despite the high performance of mandarin automatic speech recognition (ASR), accent ASR is still a challenge task. In this paper, we introduce meta-learning techniques for fast accent domain expansion in mandarin speech recognition, which expands the field of accents without deteriorating the performance of mandarin ASR. Meta-learning or learn-to-learn can learn general relation in multi domains not only for over-fitting a specific domain. So we select meta-learning in the domain expansion task. This more essential learning will cause improved performance on accent domain extension tasks. We combine the methods of meta learning and freeze of model parameters, which makes the recognition performance more stable in different cases and the training faster about 20%. Our approach significantly outperforms other methods about 3% relatively in the accent domain expansion task. Compared to the baseline model, it improves relatively 37% under the condition that the mandarin test set remains unchanged. In addition, it also proved this method to be effective on a large amount of data with a relative performance improvement of 4% on the accent test set. traslated by Google Translate

MyVoice: Arabic Speech Resource Collaboration Platform

paper_url: http://arxiv.org/abs/2308.02503
repo_url: None
paper_authors: Yousseif Elshahawy, Yassine El Kheir, Shammur Absar Chowdhury, Ahmed Ali
for: 增强阿拉伯语言技术的研究
methods: 使用拥有者参与的人群征集平台收集阿拉伯语言录音，并提供公共数据集
results: 成功创建了大量的 диалект语言录音数据集，并提供了可Switch用户角色的功能，以及一个质量筛选和反馈系统，以保证数据的质量。

Abstract
We introduce MyVoice, a crowdsourcing platform designed to collect Arabic speech to enhance dialectal speech technologies. This platform offers an opportunity to design large dialectal speech datasets; and makes them publicly available. MyVoice allows contributors to select city/country-level fine-grained dialect and record the displayed utterances. Users can switch roles between contributors and annotators. The platform incorporates a quality assurance system that filters out low-quality and spurious recordings before sending them for validation. During the validation phase, contributors can assess the quality of recordings, annotate them, and provide feedback which is then reviewed by administrators. Furthermore, the platform offers flexibility to admin roles to add new data or tasks beyond dialectal speech and word collection, which are displayed to contributors. Thus, enabling collaborative efforts in gathering diverse and large Arabic speech data.

摘要
我们介绍MyVoice，一个人工智能平台，旨在收集阿拉伯语言的口语，以提高方言技术。这个平台提供了大规模的方言语音数据的设计机会，并将其公开提供。MyVoice让参与者选择城市/国家精细方言，并录制显示的句子。用户可以在参与者和注释者之间进行Switch角色。平台包含一个质量保证系统，将低质量和假的录音过滤掉，然后将其发送到验证。在验证阶段，参与者可以评估录音质量，注释和提供反馈，这些反馈会被管理员审核。此外，平台还允许管理员添加新的数据或任务，超出方言语音和单词收集，这些任务将被显示给参与者。因此，MyVoice可以促进多方合作，收集多样化和大规模的阿拉伯语言数据。

Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

paper_url: http://arxiv.org/abs/2307.12231
repo_url: None
paper_authors: Yoshiki Masuyama, Xuankai Chang, Wangyou Zhang, Samuele Cornell, Zhong-Qiu Wang, Nobutaka Ono, Yanmin Qian, Shinji Watanabe
for: 这个论文的目的是提出一种基于自适应学习表示的多 speaker自然语音识别系统。
methods: 本文使用多通道分离方法、封闭抑制映射和复杂 spectral mapping，以及最佳的ASR后端模型特征。
results: 研究人员通过使用最新的自我超级vised学习表示（SSLR）来提高filterbank特征下的识别性能，并通过合理的训练策略将speech separation和识别 integrate into一个系统，实现了WHAMR! reverberation测试集的2.5%字幕误差率，与现有的mask-based MVDR抽样抑制和filterbank综合integration（28.9%）相比，表现显著提高。

Abstract
Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. We employ the recent self-supervised learning representation (SSLR) as a feature and improve the recognition performance from the case with filterbank features. To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. The proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set, significantly outperforming an existing mask-based MVDR beamforming and filterbank integration (28.9%).

摘要
“神经语音分离技术在过去几年中已经做出了很大的进步，其与自动语音识别（ASR）的结合是实现多个说话人ASR的重要方向。本文提供了详细的语音分离在噪音混响和噪音混响 scenarios中的研究，作为ASR前端。 Specifically, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. We employ the recent self-supervised learning representation (SSLR) as a feature and improve the recognition performance from the case with filterbank features. To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. The proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set, significantly outperforming an existing mask-based MVDR beamforming and filterbank integration (28.9%).”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.

Identifying Misinformation on YouTube through Transcript Contextual Analysis with Transformer Models

paper_url: http://arxiv.org/abs/2307.12155
repo_url: https://github.com/christoschr97/misinf-detection-llms
paper_authors: Christos Christodoulou, Nikos Salamanos, Pantelitsa Leonidou, Michail Papadakis, Michael Sirivianos
for: 本研究旨在提出一种新的视频分类方法，以确定视频内容的真实性。
methods: 本方法利用视频转cript中的文本内容，将传统的视频分类任务转化为文本分类任务。采用高级机器学习技术，如传输学习和少量学习。
results: 在三个dataset上进行评估，包括YouTube疫苗谣言相关视频、YouTube pseudoscience视频和一个新闻假消息集合。 fine-tuned模型的 Matthews Correlation Coefficient>0.81，准确率>0.90和F1 score>0.90。而少量学习模型在YouTube pseudoscience数据集上比 fine-tuned模型高20%的准确率和F1 score。

Abstract
Misinformation on YouTube is a significant concern, necessitating robust detection strategies. In this paper, we introduce a novel methodology for video classification, focusing on the veracity of the content. We convert the conventional video classification task into a text classification task by leveraging the textual content derived from the video transcripts. We employ advanced machine learning techniques like transfer learning to solve the classification challenge. Our approach incorporates two forms of transfer learning: (a) fine-tuning base transformer models such as BERT, RoBERTa, and ELECTRA, and (b) few-shot learning using sentence-transformers MPNet and RoBERTa-large. We apply the trained models to three datasets: (a) YouTube Vaccine-misinformation related videos, (b) YouTube Pseudoscience videos, and (c) Fake-News dataset (a collection of articles). Including the Fake-News dataset extended the evaluation of our approach beyond YouTube videos. Using these datasets, we evaluated the models distinguishing valid information from misinformation. The fine-tuned models yielded Matthews Correlation Coefficient>0.81, accuracy>0.90, and F1 score>0.90 in two of three datasets. Interestingly, the few-shot models outperformed the fine-tuned ones by 20% in both Accuracy and F1 score for the YouTube Pseudoscience dataset, highlighting the potential utility of this approach -- especially in the context of limited training data.

摘要
伪信息在YouTube上是一项重要的问题，需要 robust的检测策略。在这篇论文中，我们介绍了一种新的方法ology for video classification, ocus on 视频内容的真实性。我们将传统的视频分类任务转化为文本分类任务，通过利用视频字幕中的文本内容。我们采用了先进的机器学习技术，如传输学习，解决分类挑战。我们的方法包括两种形式的传输学习：（a）精度调整基于BERT、RoBERTa和ELECTRA的transformer模型，以及（b）几shot学习使用 sentence-transformers MPNet和RoBERTa-large。我们对三个数据集进行了应用：（a）YouTube疫苗谣言相关视频，（b）YouTube pseudo科学视频，以及（c） fake news数据集（一个收集了文章）。通过这些数据集，我们评估了我们的方法可以分辨真实信息和伪信息。精度调整模型的 Matthews Correlation Coefficient>0.81，准确率>0.90，和F1分数>0.90在三个数据集中都达到了。有意思的是，几shot模型在YouTube pseudo科学数据集上比精度调整模型高出20%的准确率和F1分数，这highlights了这种方法在有限的训练数据情况下的潜在实用性。

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

paper_url: http://arxiv.org/abs/2307.12134
repo_url: None
paper_authors: Suyoun Kim, Akshat Shrivastava, Duc Le, Ju Lin, Ozlem Kalinli, Michael L. Seltzer
for: 提高 END-to-END 语言理解系统的稳定性，增强对听写错误的耐受能力。
methods: 提出一种新的 END-to-END 语言理解系统，通过融合音频和文本表示来增强对听写错误的耐受能力，并采用两种新技术：1）有效地编码听写错误的质量信息，2）有效地将其集成到 END-to-END 语言理解模型中。
results: 在 STOP 数据集上实现了准确率的提高，并进行了分析，证明了我们的方法的有效性。

Abstract
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR), and outperforms traditional pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems still show weakness when text representation quality is low due to ASR transcription errors. To overcome this issue, we propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses. We introduce two novel techniques: 1) an effective method to encode the quality of ASR hypotheses and 2) an effective approach to integrate them into E2E SLU models. We show accuracy improvements on STOP dataset and share the analysis to demonstrate the effectiveness of our approach.

摘要
最近，终端到终端（E2E）的语音理解系统（SLU）已经变得更加有前途。这种方法使用单个模型，利用采样和文本表示从预训练的语音识别模型（ASR）中获得，并在设备流动中超越传统的管道式SLU系统。然而，E2E SLU系统仍然在文本表示质量低下表现不佳，即使使用ASR转译错误。为解决这个问题，我们提出了一种新的E2E SLU系统，增强了ASR错误的鲁棒性。我们介绍了两种新技术：1）有效的ASR假设质量编码方法和2）有效的E2E SLU模型集成方法。我们在STOP数据集上显示了准确率改善，并进行分析，以证明我们的方法的有效性。

Explainable Topic-Enhanced Argument Mining from Heterogeneous Sources

paper_url: http://arxiv.org/abs/2307.12131
repo_url: None
paper_authors: Jiasheng Si, Yingjie Zhu, Xingyu Shi, Deyu Zhou, Yulan He
for: 本文提出了一种新的可解释话题增强的论据挖掘方法，以提高论据挖掘的精度和效果。
methods: 本文使用了神经网络话题模型和语言模型，将目标信息补充了可解释话题表示，并通过共同学习来捕捉在论据中的句子水平话题信息。
results: 实验结果表明，提出的方法在benchmark数据集上在各种设置下都有显著优势，与现有基线模型相比。

Abstract
Given a controversial target such as ``nuclear energy'', argument mining aims to identify the argumentative text from heterogeneous sources. Current approaches focus on exploring better ways of integrating the target-associated semantic information with the argumentative text. Despite their empirical successes, two issues remain unsolved: (i) a target is represented by a word or a phrase, which is insufficient to cover a diverse set of target-related subtopics; (ii) the sentence-level topic information within an argument, which we believe is crucial for argument mining, is ignored. To tackle the above issues, we propose a novel explainable topic-enhanced argument mining approach. Specifically, with the use of the neural topic model and the language model, the target information is augmented by explainable topic representations. Moreover, the sentence-level topic information within the argument is captured by minimizing the distance between its latent topic distribution and its semantic representation through mutual learning. Experiments have been conducted on the benchmark dataset in both the in-target setting and the cross-target setting. Results demonstrate the superiority of the proposed model against the state-of-the-art baselines.

摘要
Given a controversial target such as "核能源" (nuclear energy), argument mining aims to identify the argumentative text from heterogeneous sources. Current approaches focus on exploring better ways of integrating the target-associated semantic information with the argumentative text. Despite their empirical successes, two issues remain unsolved: (i) a target is represented by a word or a phrase, which is insufficient to cover a diverse set of target-related subtopics; (ii) the sentence-level topic information within an argument, which we believe is crucial for argument mining, is ignored. To tackle the above issues, we propose a novel explainable topic-enhanced argument mining approach. Specifically, with the use of the neural topic model and the language model, the target information is augmented by explainable topic representations. Moreover, the sentence-level topic information within the argument is captured by minimizing the distance between its latent topic distribution and its semantic representation through mutual learning. Experiments have been conducted on the benchmark dataset in both the in-target setting and the cross-target setting. Results demonstrate the superiority of the proposed model against the state-of-the-art baselines.Here's the word-for-word translation:给一个争议性目标，如“核能源”(nuclear energy), argument mining 目标是从不同来源中提取Argumentative text。现有的方法主要关注在更好地将目标相关的语义信息与 argumentative text 集成。despite their empirical successes, two issues remain unsolved: (i) a target is represented by a word or a phrase, which is insufficient to cover a diverse set of target-related subtopics; (ii) the sentence-level topic information within an argument, which we believe is crucial for argument mining, is ignored. To tackle the above issues, we propose a novel explainable topic-enhanced argument mining approach. Specifically, with the use of the neural topic model and the language model, the target information is augmented by explainable topic representations. Moreover, the sentence-level topic information within the argument is captured by minimizing the distance between its latent topic distribution and its semantic representation through mutual learning. Experiments have been conducted on the benchmark dataset in both the in-target setting and the cross-target setting. Results demonstrate the superiority of the proposed model against the state-of-the-art baselines.

2023-07-23

cs.LG

cs.LG - 2023-07-23

Right for the Wrong Reason: Can Interpretable ML Techniques Detect Spurious Correlations?

paper_url: http://arxiv.org/abs/2307.12344
repo_url: https://github.com/ss-sun/right-for-the-wrong-reason
paper_authors: Susu Sun, Lisa M. Koch, Christian F. Baumgartner
for: This paper aims to evaluate the ability of various explanation techniques to identify spurious correlations in deep neural network models.
methods: The paper proposes a rigorous evaluation strategy to assess the effectiveness of post-hoc explanation techniques and inherently interpretable classifiers in detecting artificially added confounders in a chest x-ray diagnosis task.
results: The paper finds that the post-hoc technique SHAP and the inherently interpretable Attri-Net provide the best performance in identifying faulty model behavior and can be used to reliably detect spurious correlations.

Abstract
While deep neural network models offer unmatched classification performance, they are prone to learning spurious correlations in the data. Such dependencies on confounding information can be difficult to detect using performance metrics if the test data comes from the same distribution as the training data. Interpretable ML methods such as post-hoc explanations or inherently interpretable classifiers promise to identify faulty model reasoning. However, there is mixed evidence whether many of these techniques are actually able to do so. In this paper, we propose a rigorous evaluation strategy to assess an explanation technique's ability to correctly identify spurious correlations. Using this strategy, we evaluate five post-hoc explanation techniques and one inherently interpretable method for their ability to detect three types of artificially added confounders in a chest x-ray diagnosis task. We find that the post-hoc technique SHAP, as well as the inherently interpretable Attri-Net provide the best performance and can be used to reliably identify faulty model behavior.

摘要
深度神经网络模型可以提供无可比拟的分类性能，但它们容易学习假设关系。这些关系可能是由干扰信息引起的，这些干扰信息可以难以通过性能指标来探测，尤其如果测试数据来自同一个分布。可解释Machine Learning方法，如后处解释或自然可解释的分类器，承诺可以识别模型的错误思维。然而，现有证据表明，许多这些技术并没有充分能力完成这一任务。在这篇文章中，我们提出了一种严格的评估策略，用于评估解释技术的能力是否可以正确地识别假设关系。使用这种策略，我们评估了五种后处解释技术和一种自然可解释的分类器，以 Detect three types of artificially added confounders in a chest x-ray diagnosis task。我们发现，使用SHAP的后处解释技术以及自然可解释的Attri-Net可以提供最好的性能，可以可靠地识别模型的错误行为。

Self-Supervised Learning for Audio-Based Emotion Recognition

paper_url: http://arxiv.org/abs/2307.12343
repo_url: None
paper_authors: Peranut Nimitsurachat, Peter Washington
for: 这个研究的目的是发展一个基于音频资料的情绪识别模型，以便在心理健康、市场营销、游戏和社交媒体分析等领域中建立互动系统。methods: 这个研究使用了自我超级vised learning（SSL）方法，通过预测资料本身的特性来学习，不需要大量的指导标签。results: 这个研究发现，使用SSL方法可以在小量 annotated data 上提高模型的性能，特别是当情绪较易分类时。此外，这个研究还证明了SSL方法在嵌入特征表示空间中进行自我超级vised learning可以实现更好的表现。

Abstract
Emotion recognition models using audio input data can enable the development of interactive systems with applications in mental healthcare, marketing, gaming, and social media analysis. While the field of affective computing using audio data is rich, a major barrier to achieve consistently high-performance models is the paucity of available training labels. Self-supervised learning (SSL) is a family of methods which can learn despite a scarcity of supervised labels by predicting properties of the data itself. To understand the utility of self-supervised learning for audio-based emotion recognition, we have applied self-supervised learning pre-training to the classification of emotions from the CMU- MOSEI's acoustic modality. Unlike prior papers that have experimented with raw acoustic data, our technique has been applied to encoded acoustic data. Our model is first pretrained to uncover the randomly-masked timestamps of the acoustic data. The pre-trained model is then fine-tuned using a small sample of annotated data. The performance of the final model is then evaluated via several evaluation metrics against a baseline deep learning model with an identical backbone architecture. We find that self-supervised learning consistently improves the performance of the model across all metrics. This work shows the utility of self-supervised learning for affective computing, demonstrating that self-supervised learning is most useful when the number of training examples is small, and that the effect is most pronounced for emotions which are easier to classify such as happy, sad and anger. This work further demonstrates that self-supervised learning works when applied to embedded feature representations rather than the traditional approach of pre-training on the raw input space.

摘要
这个文章探讨了使用语音资料进行情感识别的模型，并 explore了这些模型在心理健康、市场营销、游戏和社交媒体分析等领域的应用。然而，对于情感识别模型的训练label scarcity是一个主要的阻碍因素。自动学习（SSL）是一家 мето�odo，可以在训练labels的缺乏情况下培养出高性能的模型。为了了解SSL在语音基本情感识别中的使用效果，我们将运用SSL预训练在CMU-MOSEI的语音模式上进行分组。不同于先前的研究，我们的技术是对编码语音资料进行预训练。我们的模型首先预训练以探索随机遮盾的语音资料时间戳。预训练后，模型会被精确地调整使用一小sample的标注数据。最终的模型性能会通过一些评估度量与基准深度学习模型进行比较。我们发现，透过SSL预训练可以对情感识别模型进行改进，并且这个改进效果是随着标注数据的数量增加而加强。此外，我们发现这个效果尤其明显在易于分组的情感方面，例如：快乐、沮丧和愤怒。这个研究显示了自动学习在情感识别中的 utility，并且显示了这种方法在语音嵌入特征表现上进行预训练的效果。

Rapid detection of soil carbonates by means of NIR spectroscopy, deep learning methods and phase quantification by powder Xray diffraction

paper_url: http://arxiv.org/abs/2307.12341
repo_url: None
paper_authors: Lykourgos Chiniadis, Petros Tamvakis
for: 这个研究旨在提高农业生产和土壤物理特性分析，以实现农业均衡和环境可持续性。methods: 本研究使用FT NIR reflectanceспектроскопія和深度学习方法来预测土壤碳酸含量。results: 研究获得了优异的预测结果，并且在无法使用量imetric方法的情况下，可以快速和有效地预测土壤碳酸含量。

Abstract
Soil NIR spectral absorbance/reflectance libraries are utilized towards improving agricultural production and analysis of soil properties which are key prerequisite for agroecological balance and environmental sustainability. Carbonates in particular, represent a soil property which is mostly affected even by mild, let alone extreme, changes of environmental conditions during climate change. In this study we propose a rapid and efficient way to predict carbonates content in soil by means of FT NIR reflectance spectroscopy and by use of deep learning methods. We exploited multiple machine learning methods, such as: 1) a MLP Regressor and 2) a CNN and compare their performance with other traditional ML algorithms such as PLSR, Cubist and SVM on the combined dataset of two NIR spectral libraries: KSSL (USDA), a dataset of soil samples reflectance spectra collected nationwide, and LUCAS TopSoil (European Soil Library) which contains soil sample absorbance spectra from all over the European Union, and use them to predict carbonate content on never before seen soil samples. Soil samples in KSSL and in TopSoil spectral libraries were acquired in the spectral region of visNIR, however in this study, only the NIR spectral region was utilized. Quantification of carbonates by means of Xray Diffraction is in good agreement with the volumetric method and the MLP prediction. Our work contributes to rapid carbonates content prediction in soil samples in cases where: 1) no volumetric method is available and 2) only NIR spectra absorbance data are available. Up till now and to the best of our knowledge, there exists no other study, that presents a prediction model trained on such an extensive dataset with such promising results on unseen data, undoubtedly supporting the notion that deep learning models present excellent prediction tools for soil carbonates content.

摘要
soil NIR spectral absorbance/reflectance 图书馆是用于提高农业生产和土壤属性分析的重要前提，这些属性是生态平衡和环境可持续性的关键因素。碳酸盐是土壤属性中受到气候变化的影响最大的一种，因此在这种情况下，我们提出了一种快速和高效的碳酸盐含量预测方法，使用FT NIR 谐振谱分析和深度学习方法。我们利用了多种机器学习方法，如：1）多层感知网络（MLP）回归器，2）卷积神经网络（CNN），并与传统的机器学习算法如PLSR、Cubist和SVM进行比较，用于预测碳酸盐含量。我们使用了KSSL（USDA）和LUCAS TopSoil（欧盟土壤图书馆）两个 spectral 图书馆的共同数据集，其中KSSL包含了全美国的土壤样本谐振谱pectra，而LUCAS TopSoil包含了欧盟各国的土壤样本吸收谱spectra。我们只使用了NIR spectral 区域。我们通过X射晶 diffraction 测量和MLP 预测相比，发现了含碳酸盐的量与volume 方法具有良好一致性。我们的工作可以帮助在没有volume 方法可用时，只有NIR spectra 吸收数据可用时，快速预测土壤中碳酸盐含量。在我们所知道的范围内，没有其他研究可以在这样的广泛数据集上提出类似的预测模型，并且模型在未seen数据上的表现是非常出色，证明了深度学习模型在土壤碳酸盐含量预测中的优秀表现。

TabADM: Unsupervised Tabular Anomaly Detection with Diffusion Models

paper_url: http://arxiv.org/abs/2307.12336
repo_url: None
paper_authors: Guy Zamberg, Moshe Salhov, Ofir Lindenbaum, Amir Averbuch
for: 本研究旨在提出一种基于扩散的 probabilistic 模型，用于不监督的异常检测。
methods: 我们的模型通过使用特殊的拒绝机制，使正常样本的浓度估计免受异常样本的影响。在推断阶段，我们可以通过查找低浓度区域的样本来识别异常样本。
results: 我们使用实际数据进行测试，发现我们的方法可以提高异常检测的能力，并且相比基eline，我们的方法更加稳定和不需要较多的超参数调整。

Abstract
Tables are an abundant form of data with use cases across all scientific fields. Real-world datasets often contain anomalous samples that can negatively affect downstream analysis. In this work, we only assume access to contaminated data and present a diffusion-based probabilistic model effective for unsupervised anomaly detection. Our model is trained to learn the density of normal samples by utilizing a unique rejection scheme to attenuate the influence of anomalies on the density estimation. At inference, we identify anomalies as samples in low-density regions. We use real data to demonstrate that our method improves detection capabilities over baselines. Furthermore, our method is relatively stable to the dimension of the data and does not require extensive hyperparameter tuning.

摘要
tables are an abundant form of data with use cases across all scientific fields. Real-world datasets often contain anomalous samples that can negatively affect downstream analysis. In this work, we only assume access to contaminated data and present a diffusion-based probabilistic model effective for unsupervised anomaly detection. Our model is trained to learn the density of normal samples by utilizing a unique rejection scheme to attenuate the influence of anomalies on the density estimation. At inference, we identify anomalies as samples in low-density regions. We use real data to demonstrate that our method improves detection capabilities over baselines. Furthermore, our method is relatively stable to the dimension of the data and does not require extensive hyperparameter tuning.Here's the translation in Traditional Chinese as well:tables are an abundant form of data with use cases across all scientific fields. Real-world datasets often contain anomalous samples that can negatively affect downstream analysis. In this work, we only assume access to contaminated data and present a diffusion-based probabilistic model effective for unsupervised anomaly detection. Our model is trained to learn the density of normal samples by utilizing a unique rejection scheme to attenuate the influence of anomalies on the density estimation. At inference, we identify anomalies as samples in low-density regions. We use real data to demonstrate that our method improves detection capabilities over baselines. Furthermore, our method is relatively stable to the dimension of the data and does not require extensive hyperparameter tuning.

An axiomatized PDE model of deep neural networks

paper_url: http://arxiv.org/abs/2307.12333
repo_url: None
paper_authors: Tangjun Wang, Wenqi Tao, Chenglong Bao, Zuoqiang Shi
for: 研究深度神经网络（DNN）与 partial differential equations（PDEs）之间的关系，尤其是 DNN 的普遍形式 PDE 模型。
methods: 将 DNN 视为一个进化Operator，从简单的基模型出发，根据一些合理的假设，证明了演化Operator 实际上是受湍涨-扩散方程的推动。
results: 根据演化Operator 的推动，提出了一种新的训练方法，用于改进 ResNet 的性能。实验 validate 了提出的方法的效果。

Abstract
Inspired by the relation between deep neural network (DNN) and partial differential equations (PDEs), we study the general form of the PDE models of deep neural networks. To achieve this goal, we formulate DNN as an evolution operator from a simple base model. Based on several reasonable assumptions, we prove that the evolution operator is actually determined by convection-diffusion equation. This convection-diffusion equation model gives mathematical explanation for several effective networks. Moreover, we show that the convection-diffusion model improves the robustness and reduces the Rademacher complexity. Based on the convection-diffusion equation, we design a new training method for ResNets. Experiments validate the performance of the proposed method.

摘要
Based on the relation between deep neural networks (DNN) and partial differential equations (PDEs), we investigate the general form of PDE models of DNN. To achieve this goal, we formulate DNN as an evolution operator from a simple base model. Under several reasonable assumptions, we prove that the evolution operator is actually determined by a convection-diffusion equation. This convection-diffusion equation model provides a mathematical explanation for several effective networks. Moreover, we show that the convection-diffusion model improves the robustness and reduces the Rademacher complexity. Based on the convection-diffusion equation, we propose a new training method for ResNets. Experimental results validate the performance of the proposed method.Here's the word-for-word translation of the text into Simplified Chinese: Based on Deep Neural Network (DNN) 和 Partial Differential Equations (PDEs) 之间的关系，我们研究 DNN 的总体形式 PDE 模型。为达到这个目标，我们将 DNN 表示为一个从简单基本模型的演化运算器。根据一些合理的假设，我们证明了演化运算器实际上是受湍振-漫步方程的决定。这个湍振-漫步方程模型为许多有效网络提供了数学解释。此外，我们还证明了湍振-漫步模型可以提高稳定性并降低 Rademacher 复杂度。基于湍振-漫步方程，我们提出了一种新的训练方法 для ResNets。实验结果证明了我们的方法的性能。

Tackling the Curse of Dimensionality with Physics-Informed Neural Networks

paper_url: http://arxiv.org/abs/2307.12306
repo_url: None
paper_authors: Zheyuan Hu, Khemraj Shukla, George Em Karniadakis, Kenji Kawaguchi
for: 解决高维纬度的物理学定义问题 (solving high-dimensional physical definition problems)
methods: 使用Stochastic Dimension Gradient Descent (SDGD)方法，即将梯度分解成不同维度的部分，并随机选择每个训练轮中的一部分维度进行训练physics-informed neural networks (PINNs)。 (using Stochastic Dimension Gradient Descent (SDGD) method, which decomposes the gradient into parts corresponding to different dimensions and randomly selects a subset of these dimensional parts for training physics-informed neural networks (PINNs))
results: 可以很快地解决很多难以解决的高维度纬度的非线性Partial Differential Equations (PDEs)，例如Hamilton-Jacobi-Bellman (HJB)和Schrödinger方程在千个维度中的解决。 (can solve many notoriously hard high-dimensional PDEs, such as the Hamilton-Jacobi-Bellman (HJB) and the Schrödinger equations in thousands of dimensions very fast)

Abstract
The curse-of-dimensionality (CoD) taxes computational resources heavily with exponentially increasing computational cost as the dimension increases. This poses great challenges in solving high-dimensional PDEs as Richard Bellman first pointed out over 60 years ago. While there has been some recent success in solving numerically partial differential equations (PDEs) in high dimensions, such computations are prohibitively expensive, and true scaling of general nonlinear PDEs to high dimensions has never been achieved. In this paper, we develop a new method of scaling up physics-informed neural networks (PINNs) to solve arbitrary high-dimensional PDEs. The new method, called Stochastic Dimension Gradient Descent (SDGD), decomposes a gradient of PDEs into pieces corresponding to different dimensions and samples randomly a subset of these dimensional pieces in each iteration of training PINNs. We theoretically prove the convergence guarantee and other desired properties of the proposed method. We experimentally demonstrate that the proposed method allows us to solve many notoriously hard high-dimensional PDEs, including the Hamilton-Jacobi-Bellman (HJB) and the Schr\"{o}dinger equations in thousands of dimensions very fast on a single GPU using the PINNs mesh-free approach. For instance, we solve nontrivial nonlinear PDEs (one HJB equation and one Black-Scholes equation) in 100,000 dimensions in 6 hours on a single GPU using SDGD with PINNs. Since SDGD is a general training methodology of PINNs, SDGD can be applied to any current and future variants of PINNs to scale them up for arbitrary high-dimensional PDEs.

摘要
“几何约束”（CoD）会卷用计算资源，计算成本随着维度的增加而呈指数增长。这会对解决高维 partial differential equations（PDEs） pose 大量挑战，Richard Bellman 在60年前就这样注意到。虽然在过去几年，有些研究者通过数值方法解决高维 PDEs，但这些计算却是非常昂贵的，并且true scaling of general nonlinear PDEs to high dimensions 从未实现过。在这篇论文中，我们开发了一种新的方法，即随机维度梯度下降（SDGD），用于扩展 physics-informed neural networks（PINNs）来解决任意高维 PDEs。SDGD 方法将 PDE 的梯度分解成不同维度的部分，并在训练 PINNs 时随机选择这些维度的部分。我们理论上证明了该方法的收敛性和其他愿望的性质。我们实验表明，该方法可以很快地解决许多知名度的高维 PDEs，包括 Hamilton-Jacobi-Bellman 方程和 Schrödinger 方程，并且可以在单个 GPU 上完成。例如，我们在 100,000 维度中解决了一些非线性 PDEs（一个 HJB 方程和一个 Black-Scholes 方程），只用了 6 个 GPU 上的 6 小时。由于 SDGD 是 PINNs 的一种通用训练方法，SDGD 可以应用于任何当前和未来的 PINNs 变体，以扩展它们到任意高维 PDEs。

Physics-Informed Machine Learning of Argon Gas-Driven Melt Pool Dynamics

paper_url: http://arxiv.org/abs/2307.12304
repo_url: None
paper_authors: R. Sharma, W. Grace Guo, M. Raissi, Y. B. Guo
for: 这 paper 是关于 metal 添加印制 (AM) 过程中溶融池动态的研究，它们的目的是提高过程的稳定性、微结构形成和印制物的性能。
methods: 这 paper 使用了物理学习 (PIML) 方法，通过将神经网络与物理法律相结合来预测溶融池动态，包括温度、速度和压力等参数。PIML 方法可以避免使用数学模拟方法，从而大幅降低计算成本。
results: 该 paper 通过数据驱动发现了模型常数，并且通过优化 PINN 模型来提高模型训练效率。PIML 方法可以高效地预测溶融池动态，并且可以提供更好的初始条件和边界条件。

Abstract
Melt pool dynamics in metal additive manufacturing (AM) is critical to process stability, microstructure formation, and final properties of the printed materials. Physics-based simulation including computational fluid dynamics (CFD) is the dominant approach to predict melt pool dynamics. However, the physics-based simulation approaches suffer from the inherent issue of very high computational cost. This paper provides a physics-informed machine learning (PIML) method by integrating neural networks with the governing physical laws to predict the melt pool dynamics such as temperature, velocity, and pressure without using any training data on velocity. This approach avoids solving the highly non-linear Navier-Stokes equation numerically, which significantly reduces the computational cost. The difficult-to-determine model constants of the governing equations of the melt pool can also be inferred through data-driven discovery. In addition, the physics-informed neural network (PINN) architecture has been optimized for efficient model training. The data-efficient PINN model is attributed to the soft penalty by incorporating governing partial differential equations (PDEs), initial conditions, and boundary conditions in the PINN model.

摘要
金属加料制造（AM）中的熔 pool 动力学是制造过程稳定性、微结构形成和打印物质的关键因素。基于物理定律的数学模拟（CFD）是预测熔 pool 动力学的主要方法。然而，物理基础的模拟方法受到内置的计算成本高峰问题。这篇论文提出了基于物理学习（PIML）方法，通过将神经网络与管理物理法律相结合来预测熔 pool 动力学的温度、速度和压力，不需要使用任何很速度训练数据。这种方法可以避免数值方法中的高级非线性 Navier-Stokes 方程的解算问题，从而减少计算成本。此外，通过数据驱动发现，可以通过推断模型常数来确定管理方程的困难常数。此外，基于物理学习（PINN）架构已经优化了模型训练效率。通过软约束 penalty，将管理的partial differential equations（PDEs）、初始条件和边界条件 integrate 到 PINN 模型中，使得数据效率的 PINN 模型。

RANSAC-NN: Unsupervised Image Outlier Detection using RANSAC

paper_url: http://arxiv.org/abs/2307.12301
repo_url: https://github.com/mxtsai/ransac-nn
paper_authors: Chen-Han Tsai, Yu-Shao Peng
for: 这个论文旨在提出一种专门为图像数据设计的异常检测算法，以确保计算机视觉任务中使用的图像数据质量和准确性。
methods: 该算法基于RANSAC的方法进行比较图像，自动预测每个图像的异常分数而无需额外训练或标签信息。
results: 对于15种多样化的数据集，RANSAC-NN在与当前状态艺算法进行比较时，无需任何超参数调整，一致地表现出优异性。此外，文章还提供了每个RANSAC-NN组件的详细分析，并展示了其在图像涂抹检测中的潜在应用。

Abstract
Image outlier detection (OD) is crucial for ensuring the quality and accuracy of image datasets used in computer vision tasks. The majority of OD algorithms, however, have not been targeted toward image data. Consequently, the results of applying such algorithms to images are often suboptimal. In this work, we propose RANSAC-NN, a novel unsupervised OD algorithm specifically designed for images. By comparing images in a RANSAC-based approach, our algorithm automatically predicts the outlier score of each image without additional training or label information. We evaluate RANSAC-NN against state-of-the-art OD algorithms on 15 diverse datasets. Without any hyperparameter tuning, RANSAC-NN consistently performs favorably in contrast to other algorithms in almost every dataset category. Furthermore, we provide a detailed analysis to understand each RANSAC-NN component, and we demonstrate its potential applications in image mislabeled detection. Code for RANSAC-NN is provided at https://github.com/mxtsai/ransac-nn

摘要
图像异常检测（OD）是计算机视觉任务中至关重要的质量和准确性因素。大多数OD算法 however，没有特地针对图像数据。因此，将这些算法应用于图像时的结果通常不佳。在这项工作中，我们提出了一种新的无监督OD算法，即RANSAC-NN。我们通过对图像进行RANSAC-based的比较，自动地对每个图像预测异常分数，无需额外的训练或标签信息。我们对RANSAC-NN与现有OD算法进行了15种多样化的数据集评估。无需任何超参数调整，RANSAC-NN在大多数数据集类别中一直表现优于其他算法。此外，我们还提供了每个RANSAC-NN组件的详细分析，并证明其在图像涂抹检测中的潜在应用。RANSAC-NN代码可以在https://github.com/mxtsai/ransac-nn上获取。

ResWCAE: Biometric Pattern Image Denoising Using Residual Wavelet-Conditioned Autoencoder

paper_url: http://arxiv.org/abs/2307.12255
repo_url: None
paper_authors: Youzhi Liang, Wen Liang
For: This paper proposes a deep learning architecture for fingerprint image denoising in compact IoT devices, aiming to improve the reliability of biometric authentication systems.* Methods: The proposed method, called Residual Wavelet-Conditioned Convolutional Autoencoder (Res-WCAE), combines image and wavelet encoders with a Kullback-Leibler divergence regularization. It leverages residual connections and wavelet-transform domain features to preserve fine-grained spatial information.* Results: The experimental results show that Res-WCAE outperforms several state-of-the-art denoising methods, particularly for heavily degraded fingerprint images with high levels of noise. The proposed method demonstrates promise for improving the reliability of biometric authentication systems in compact IoT devices.Here’s the simplified Chinese text for the three key points:* For: 这篇论文提出了一种用于 compact IoT 设备中的指纹图像干扰 removing 深度学习架构，以提高生物识别系统的可靠性。* Methods: 提议的方法是 Residual Wavelet-Conditioned Convolutional Autoencoder (Res-WCAE)，它组合了图像编码器和波峰编码器，并使用 Kullback-Leibler 分布regularization。它利用了剩余连接和波峰变换域特征来保留细腻的空间信息。* Results: 实验结果表明，Res-WCAE 比许多状态机器人的干扰方法更高效，特别是对于受到高水平噪声的指纹图像。提议的方法表明，可以提高 compact IoT 设备中生物识别系统的可靠性。

Abstract
The utilization of biometric authentication with pattern images is increasingly popular in compact Internet of Things (IoT) devices. However, the reliability of such systems can be compromised by image quality issues, particularly in the presence of high levels of noise. While state-of-the-art deep learning algorithms designed for generic image denoising have shown promise, their large number of parameters and lack of optimization for unique biometric pattern retrieval make them unsuitable for these devices and scenarios. In response to these challenges, this paper proposes a lightweight and robust deep learning architecture, the Residual Wavelet-Conditioned Convolutional Autoencoder (Res-WCAE) with a Kullback-Leibler divergence (KLD) regularization, designed specifically for fingerprint image denoising. Res-WCAE comprises two encoders - an image encoder and a wavelet encoder - and one decoder. Residual connections between the image encoder and decoder are leveraged to preserve fine-grained spatial features, where the bottleneck layer conditioned on the compressed representation of features obtained from the wavelet encoder using approximation and detail subimages in the wavelet-transform domain. The effectiveness of Res-WCAE is evaluated against several state-of-the-art denoising methods, and the experimental results demonstrate that Res-WCAE outperforms these methods, particularly for heavily degraded fingerprint images in the presence of high levels of noise. Overall, Res-WCAE shows promise as a solution to the challenges faced by biometric authentication systems in compact IoT devices.

摘要
现在互联网物联网设备中越来越普遍使用生物特征认证，特别是使用图像特征进行身份验证。然而，这些系统的可靠性可能受到图像质量问题的影响，特别是在高噪声水平下。当前的深度学习算法，专门为普通图像除噪设计的深度学习模型，虽然已经达到了一定的成绩，但它们的参数数量很大，并且没有特定生物特征检索优化，因此不适合这些设备和场景。为了解决这些挑战，本文提出了一种轻量级和可靠的深度学习架构——差分波let-conditioned Convolutional Autoencoder（Res-WCAE），并在其中添加了Kullback-Leibler异质（KLD）正则化。Res-WCAE包括两个编码器——图像编码器和波лет编码器——以及一个解码器。图像编码器和解码器之间的差分连接，使得图像的细腻特征得到保留，而波лет编码器使用波лет变换获得的特征的压缩表示，并通过差分连接和权重Conditioning来与图像编码器进行交互。对比多种现有的去噪方法，Res-WCAE的效果得到了证明，特别是在高噪声水平下的极大噪声图像认证 task。总的来说，Res-WCAE表现出了在compact IoT设备中的可靠性和灵活性，并且有望成为生物认证系统中的解决方案。

Explainable Depression Detection via Head Motion Patterns

paper_url: http://arxiv.org/abs/2307.12241
repo_url: None
paper_authors: Monika Gahalawat, Raul Fernandez Rojas, Tanaya Guha, Ramanathan Subramanian, Roland Goecke
for: 检测抑郁症状
methods: 基于head motion数据的基本运动单元（kinemes）和机器学习方法
results: head motion patterns 可以作为抑郁症状的生物标志，并且可以通过基于征料的方法来分类抑郁和健康控制组Here’s a more detailed explanation of each point:
for: The paper is written to detect depression symptoms using head motion data and machine learning methods.
methods: The paper uses two approaches to detect depression: (a) discovering kinemes from head motion data corresponding to both depressed patients and healthy controls, and (b) learning kineme patterns only from healthy controls, and computing statistics derived from reconstruction errors for both the patient and control classes.
results: The paper finds that head motion patterns are effective biomarkers for detecting depressive symptoms, and that explanatory kineme patterns consistent with prior findings can be observed for the two classes. The paper achieves peak F1 scores of 0.79 and 0.82, respectively, over BlackDog and AVEC2013 datasets for binary classification over episodic thin-slices, and a peak F1 of 0.72 over videos for AVEC2013.

Abstract
While depression has been studied via multimodal non-verbal behavioural cues, head motion behaviour has not received much attention as a biomarker. This study demonstrates the utility of fundamental head-motion units, termed \emph{kinemes}, for depression detection by adopting two distinct approaches, and employing distinctive features: (a) discovering kinemes from head motion data corresponding to both depressed patients and healthy controls, and (b) learning kineme patterns only from healthy controls, and computing statistics derived from reconstruction errors for both the patient and control classes. Employing machine learning methods, we evaluate depression classification performance on the \emph{BlackDog} and \emph{AVEC2013} datasets. Our findings indicate that: (1) head motion patterns are effective biomarkers for detecting depressive symptoms, and (2) explanatory kineme patterns consistent with prior findings can be observed for the two classes. Overall, we achieve peak F1 scores of 0.79 and 0.82, respectively, over BlackDog and AVEC2013 for binary classification over episodic \emph{thin-slices}, and a peak F1 of 0.72 over videos for AVEC2013.

摘要
研究表示，诊断抑郁症可以通过多modal非语言行为迹象来进行。然而，头部运动行为尚未得到很多关注，作为生物标记。本研究演示了基本头部运动单元（kinemes）在抑郁检测中的有用性，通过采用两种不同的方法和特征：（a）从头部运动数据中提取与抑郁患者和健康控制人群相对应的kinemes，以及（b）通过健康控制人群学习kineme模式，并计算来自重建错误的统计。使用机器学习方法，我们评估了抑郁分类性能在BlackDog和AVEC2013数据集上。我们发现：（1）头部运动模式是抑郁症的有效生物标记，和（2）可以观察到健康控制人群和抑郁患者之间的准确的kineme模式。总的来说，我们在BlackDog和AVEC2013数据集上实现了最高的F1分数为0.79和0.82，分别为 binary 分类EPISODE 薄片和视频。

Demonstration of a Response Time Based Remaining Useful Life (RUL) Prediction for Software Systems

paper_url: http://arxiv.org/abs/2307.12237
repo_url: None
paper_authors: Ray Islam, Peter Sandborn
for: 这个论文旨在应用PHM概念到软件系统中，以预测问题和计算系统的RUL。
methods: 本论文使用了usage参数（例如发布数量和类别）和性能参数（例如响应时间）来预测RUL。
results: 研究人员通过对实际数据进行比较，发现PHM概念可以应用于软件系统，并且可以计算出RUL来做系统管理决策。

Abstract
Prognostic and Health Management (PHM) has been widely applied to hardware systems in the electronics and non-electronics domains but has not been explored for software. While software does not decay over time, it can degrade over release cycles. Software health management is confined to diagnostic assessments that identify problems, whereas prognostic assessment potentially indicates when in the future a problem will become detrimental. Relevant research areas such as software defect prediction, software reliability prediction, predictive maintenance of software, software degradation, and software performance prediction, exist, but all of these represent diagnostic models built upon historical data, none of which can predict an RUL for software. This paper addresses the application of PHM concepts to software systems for fault predictions and RUL estimation. Specifically, this paper addresses how PHM can be used to make decisions for software systems such as version update and upgrade, module changes, system reengineering, rejuvenation, maintenance scheduling, budgeting, and total abandonment. This paper presents a method to prognostically and continuously predict the RUL of a software system based on usage parameters (e.g., the numbers and categories of releases) and performance parameters (e.g., response time). The model developed has been validated by comparing actual data, with the results that were generated by predictive models. Statistical validation (regression validation, and k-fold cross validation) has also been carried out. A case study, based on publicly available data for the Bugzilla application is presented. This case study demonstrates that PHM concepts can be applied to software systems and RUL can be calculated to make system management decisions.

摘要
预测和健康管理（PHM）已广泛应用于硬件系统中，但尚未探讨软件领域。虽然软件不会逝减，但可能会逐渐下降。软件健康管理仅仅是诊断评估，而预测评估可能会预测未来哪一天问题会变得严重。有关研究领域包括软件缺陷预测、软件可靠性预测、软件维护预测、软件衰老和软件性能预测，但这些都是基于历史数据建立的诊断模型，无法预测软件的寿命。本文探讨将PHM概念应用于软件系统中，以预测问题和计算软件系统的寿命。具体来说，本文探讨了如何使用PHM来做软件系统的决策，如版本更新和升级、模块更改、系统重构、重新生成、维护计划、预算和完全废弃。本文提出了一种基于使用量和性能参数预测软件系统的寿命的方法。该模型已经验证了，并通过与预测模型生成的结果进行比较。此外，还进行了统计验证（回归验证和Kfold跨验证）。一个基于公共数据的 Bugzilla 应用程序的案例研究也被提出，这个案例示出了PHM概念可以应用于软件系统，并且可以计算软件系统的寿命以进行系统管理决策。

paper_url: http://arxiv.org/abs/2307.12236
repo_url: None
paper_authors: Longxiang Zhang, Wenping Wang
for: 本研究旨在为串流服务提供商评估电竞技巧，以便为客户提供个性化推荐和服务促销。
methods: 本研究使用最新的端到端模型学习joint representation of multiple modalities，并进行了大量的实验证明其效果。
results: 研究发现，提议的模型具有识别用户的弱点，而不是学习有意义的表示。未来工作将解决这个问题。

Abstract
Online streaming is an emerging market that address much attention. Assessing gaming skills from videos is an important task for streaming service providers to discover talented gamers. Service providers require the information to offer customized recommendation and service promotion to their customers. Meanwhile, this is also an important multi-modal machine learning tasks since online streaming combines vision, audio and text modalities. In this study we begin by identifying flaws in the dataset and proceed to clean it manually. Then we propose several variants of latest end-to-end models to learn joint representation of multiple modalities. Through our extensive experimentation, we demonstrate the efficacy of our proposals. Moreover, we identify that our proposed models is prone to identifying users instead of learning meaningful representations. We purpose future work to address the issue in the end.

摘要
在线流媒体是一个崛起的市场，吸引了很多注意。通过视频评估游戏技巧是流媒体服务提供者为发掘才华的玩家提供重要的任务。服务提供者需要这些信息以为客户提供个性化推荐和服务促销。同时，这也是一项重要的多modal机器学习任务，因为在线流媒体结合了视觉、音频和文本模式。在本研究中，我们首先发现数据集中的缺陷，然后手动清理数据。然后，我们提出了多种最新的端到端模型，以学习多Modalities的共同表示。通过我们的广泛实验，我们证明了我们的提议的有效性。另外，我们发现我们的提议模型容易被用户认出，而不是学习有意义的表示。我们在结尾提出了未来的工作，以解决这个问题。

EchoGLAD: Hierarchical Graph Neural Networks for Left Ventricle Landmark Detection on Echocardiograms

paper_url: http://arxiv.org/abs/2307.12229
repo_url: https://github.com/masoudmo/echoglad
paper_authors: Masoud Mokhtari, Mobina Mahdavi, Hooman Vaseli, Christina Luong, Purang Abolmaesumi, Teresa S. M. Tsang, Renjie Liao
for: 这 paper 的目的是自动检测心脏左心室的四个标志点和测量左心室内部的尺寸和周围肌肉的大约质量。
methods: 这 paper 使用了一种基于 echo cardiogram 的层次 graph neural network (GNN)，以实现左心室标志点检测。
results: 这 paper 在一个公共数据集和一个私有数据集上进行了评估，在内分布 (ID) 和外分布 (OOD) 两种设置下， achieved state-of-the-art 的 Mean Absolute Error (MAE) 值为 1.46 mm 和 1.86 mm，并且在 OOD 设置下表现更好。

Abstract
The functional assessment of the left ventricle chamber of the heart requires detecting four landmark locations and measuring the internal dimension of the left ventricle and the approximate mass of the surrounding muscle. The key challenge of automating this task with machine learning is the sparsity of clinical labels, i.e., only a few landmark pixels in a high-dimensional image are annotated, leading many prior works to heavily rely on isotropic label smoothing. However, such a label smoothing strategy ignores the anatomical information of the image and induces some bias. To address this challenge, we introduce an echocardiogram-based, hierarchical graph neural network (GNN) for left ventricle landmark detection (EchoGLAD). Our main contributions are: 1) a hierarchical graph representation learning framework for multi-resolution landmark detection via GNNs; 2) induced hierarchical supervision at different levels of granularity using a multi-level loss. We evaluate our model on a public and a private dataset under the in-distribution (ID) and out-of-distribution (OOD) settings. For the ID setting, we achieve the state-of-the-art mean absolute errors (MAEs) of 1.46 mm and 1.86 mm on the two datasets. Our model also shows better OOD generalization than prior works with a testing MAE of 4.3 mm.

摘要
functional assessment of the left ventricle chamber of the heart requires detecting four landmark locations and measuring the internal dimension of the left ventricle and the approximate mass of the surrounding muscle. The key challenge of automating this task with machine learning is the sparsity of clinical labels, i.e., only a few landmark pixels in a high-dimensional image are annotated, leading many prior works to heavily rely on isotropic label smoothing. However, such a label smoothing strategy ignores the anatomical information of the image and induces some bias. To address this challenge, we introduce an echocardiogram-based, hierarchical graph neural network (GNN) for left ventricle landmark detection (EchoGLAD). Our main contributions are: 1) a hierarchical graph representation learning framework for multi-resolution landmark detection via GNNs; 2) induced hierarchical supervision at different levels of granularity using a multi-level loss. We evaluate our model on a public and a private dataset under the in-distribution (ID) and out-of-distribution (OOD) settings. For the ID setting, we achieve the state-of-the-art mean absolute errors (MAEs) of 1.46 mm and 1.86 mm on the two datasets. Our model also shows better OOD generalization than prior works with a testing MAE of 4.3 mm.Here's the translation in Traditional Chinese:函数评估左心室的 left ventricle chamber of the heart requires detecting four landmark locations and measuring the internal dimension of the left ventricle and the approximate mass of the surrounding muscle. The key challenge of automating this task with machine learning is the sparsity of clinical labels, i.e., only a few landmark pixels in a high-dimensional image are annotated, leading many prior works to heavily rely on isotropic label smoothing. However, such a label smoothing strategy ignores the anatomical information of the image and induces some bias. To address this challenge, we introduce an echocardiogram-based, hierarchical graph neural network (GNN) for left ventricle landmark detection (EchoGLAD). Our main contributions are: 1) a hierarchical graph representation learning framework for multi-resolution landmark detection via GNNs; 2) induced hierarchical supervision at different levels of granularity using a multi-level loss. We evaluate our model on a public and a private dataset under the in-distribution (ID) and out-of-distribution (OOD) settings. For the ID setting, we achieve the state-of-the-art mean absolute errors (MAEs) of 1.46 mm and 1.86 mm on the two datasets. Our model also shows better OOD generalization than prior works with a testing MAE of 4.3 mm.

The identification of garbage dumps in the rural areas of Cyprus through the application of deep learning to satellite imagery

paper_url: http://arxiv.org/abs/2308.02502
repo_url: None
paper_authors: Andrew Keith Wilkinson
for: 这个研究旨在使用人工智能技术和卫星图像来识别Cyprus农村地区的非法垃圾弃置。
methods: 这个研究使用了人工智能技术和卫星图像来识别垃圾，首先收集了一个小型数据集，然后使用数据扩展技术来增加数据量，然后训练了一个 convolutional neural network（CNN）来识别垃圾。
results: 这个研究得到了一个深度学习模型，可以在90%的情况下正确地识别垃圾图像。这个模型可以成为未来Cyprus岛上的垃圾映射系统的基础。

Abstract
Garbage disposal is a challenging problem throughout the developed world. In Cyprus, as elsewhere, illegal ``fly-tipping" is a significant issue, especially in rural areas where few legal garbage disposal options exist. However, there is a lack of studies that attempt to measure the scale of this problem, and few resources available to address it. A method of automating the process of identifying garbage dumps would help counter this and provide information to the relevant authorities. The aim of this study was to investigate the degree to which artificial intelligence techniques, together with satellite imagery, can be used to identify illegal garbage dumps in the rural areas of Cyprus. This involved collecting a novel dataset of images that could be categorised as either containing, or not containing, garbage. The collection of such datasets in sufficient raw quantities is time consuming and costly. Therefore a relatively modest baseline set of images was collected, then data augmentation techniques used to increase the size of this dataset to a point where useful machine learning could occur. From this set of images an artificial neural network was trained to recognise the presence or absence of garbage in new images. A type of neural network especially suited to this task known as ``convolutional neural networks" was used. The efficacy of the resulting model was evaluated using an independently collected dataset of test images. The result was a deep learning model that could correctly identify images containing garbage in approximately 90\% of cases. It is envisaged that this model could form the basis of a future system that could systematically analyse the entire landscape of Cyprus to build a comprehensive ``garbage" map of the island.

摘要
垃圾处理是发达国家的一个挑战。在塞浦路斯，如其他地方一样，非法“飞tipping”是一个严重的问题，特别是在农村地区，其法定垃圾处理选择较少。然而，有很少的研究尝试量化这个问题，而且有限的资源来解决它。这项研究的目的是使用人工智能技术和卫星图像来识别塞浦路斯农村地区非法垃圾排放。这包括收集一个新的图像集，这些图像可以分为含垃圾和不含垃圾两类。收集这些图像集的过程是时间consuming和成本高的。因此，我们只收集了一个相对较小的基线集的图像，然后使用数据扩展技术来增加这个集的大小，以便进行有用的机器学习。从这些图像中，我们用人工神经网络来识别新图像中是否含垃圾。我们使用的是一种适合这种任务的特殊类型的神经网络，即“卷积神经网络”。我们评估了这种模型的效果，使用独立收集的测试集。结果是一个深度学习模型，可以在90%的情况下正确地识别含垃圾的图像。我们可以基于这个模型，建立一个将系统地分析整个塞浦路斯岛的系统，并建立一个“垃圾”地图。

Geometry-Aware Adaptation for Pretrained Models

paper_url: http://arxiv.org/abs/2307.12226
repo_url: None
paper_authors: Nicholas Roberts, Xintong Li, Dyah Adila, Sonia Cromp, Tzu-Heng Huang, Jitian Zhao, Frederic Sala
for: 提高零shot预测性能和适应新类预测
methods: 利用度量学空间信息进行适应和预测改进
results: 在ImageNet上实现了29.7%的相对改进，并且可以扩展到千万类的预测任务。当无外部度量时，可以使用自动生成的度量从类嵌入中获得10.5%的改进。

Abstract
Machine learning models -- including prominent zero-shot models -- are often trained on datasets whose labels are only a small proportion of a larger label space. Such spaces are commonly equipped with a metric that relates the labels via distances between them. We propose a simple approach to exploit this information to adapt the trained model to reliably predict new classes -- or, in the case of zero-shot prediction, to improve its performance -- without any additional training. Our technique is a drop-in replacement of the standard prediction rule, swapping argmax with the Fr\'echet mean. We provide a comprehensive theoretical analysis for this approach, studying (i) learning-theoretic results trading off label space diameter, sample complexity, and model dimension, (ii) characterizations of the full range of scenarios in which it is possible to predict any unobserved class, and (iii) an optimal active learning-like next class selection procedure to obtain optimal training classes for when it is not possible to predict the entire range of unobserved classes. Empirically, using easily-available external metrics, our proposed approach, Loki, gains up to 29.7% relative improvement over SimCLR on ImageNet and scales to hundreds of thousands of classes. When no such metric is available, Loki can use self-derived metrics from class embeddings and obtains a 10.5% improvement on pretrained zero-shot models such as CLIP.

摘要
机器学习模型 -- 包括知名的零批量模型 -- 常常在具有小比例标签空间的数据集上训练。这些空间通常具有一个度量关系标签之间的距离。我们提出了一种简单的方法，利用这些信息来适应训练过的模型预测新类 -- 或者在零批量预测中提高性能 -- без需要额外训练。我们的技术是将欧几何均值换取作为标准预测规则的替换。我们提供了全面的理论分析，研究（i）标签空间径距、样本复杂度和模型维度之间的学习理论结果，（ii）可预测任何未见类的全范围描述，以及（iii）针对不能预测整个未见类范围时的优化的活动学习样本选择方法。实际上，我们的提议方法Loki在ImageNet上实现了Relative improvement为29.7%，并且可以扩展到万个类。当没有外部度量时，Loki可以使用自带的类嵌入度量，实现预测零批量模型CLIP的10.5%提高。

Improving Out-of-Distribution Robustness of Classifiers via Generative Interpolation

paper_url: http://arxiv.org/abs/2307.12219
repo_url: None
paper_authors: Haoyue Bai, Ceyuan Yang, Yinghao Xu, S. -H. Gary Chan, Bolei Zhou
for: 提高神经网络模型对于不同分布数据的鲁棒性
methods: 使用生成模型作为数据增强源，通过混合多个域的生成模型并在 interpolate 模型参数来生成多元的OoD样本
results: 实验结果显示，提出的方法可以明显提高神经网络模型对于不同分布数据的鲁棒性，并且可以控制增强的方向和强度

Abstract
Deep neural networks achieve superior performance for learning from independent and identically distributed (i.i.d.) data. However, their performance deteriorates significantly when handling out-of-distribution (OoD) data, where the training and test are drawn from different distributions. In this paper, we explore utilizing the generative models as a data augmentation source for improving out-of-distribution robustness of neural classifiers. Specifically, we develop a simple yet effective method called Generative Interpolation to fuse generative models trained from multiple domains for synthesizing diverse OoD samples. Training a generative model directly on the source domains tends to suffer from mode collapse and sometimes amplifies the data bias. Instead, we first train a StyleGAN model on one source domain and then fine-tune it on the other domains, resulting in many correlated generators where their model parameters have the same initialization thus are aligned. We then linearly interpolate the model parameters of the generators to spawn new sets of generators. Such interpolated generators are used as an extra data augmentation source to train the classifiers. The interpolation coefficients can flexibly control the augmentation direction and strength. In addition, a style-mixing mechanism is applied to further improve the diversity of the generated OoD samples. Our experiments show that the proposed method explicitly increases the diversity of training domains and achieves consistent improvements over baselines across datasets and multiple different distribution shifts.

摘要
Training a generative model directly on the source domains can lead to mode collapse and amplify data bias. Instead, we first train a StyleGAN model on one source domain and then fine-tune it on the other domains, resulting in many correlated generators with aligned model parameters. We then linearly interpolate the model parameters of the generators to spawn new sets of generators. Such interpolated generators are used as an extra data augmentation source to train the classifiers. The interpolation coefficients can flexibly control the augmentation direction and strength.In addition, we apply a style-mixing mechanism to further improve the diversity of the generated OoD samples. Our experiments show that the proposed method explicitly increases the diversity of training domains and achieves consistent improvements over baselines across datasets and multiple different distribution shifts.

Mental Workload Estimation with Electroencephalogram Signals by Combining Multi-Space Deep Models

paper_url: http://arxiv.org/abs/2308.02409
repo_url: None
paper_authors: Hong-Hai Nguyen, Ngumimi Karen Iyortsuun, Hyung-Jeong Yang, Guee-Sang Lee, Soo-Hyung Kim
for: 本研究旨在 классифицироватьMENTAL工作负担为三种状态和估算综合指数。
methods: 该方法 combinestemultiple空间维度来获得最佳的心理估算结果。在时域方法中，我们使用Temporal Convolutional Networks，而在频域方法中，我们提出了一种新的Multi-Dimensional Residual Block架构。
results: 我们的方法可以准确地分类MENTAL工作负担为三种状态，并且可以准确地估算综合指数。

Abstract
The human brain is in a continuous state of activity during both work and rest. Mental activity is a daily process, and when the brain is overworked, it can have negative effects on human health. In recent years, great attention has been paid to early detection of mental health problems because it can help prevent serious health problems and improve quality of life. Several signals are used to assess mental state, but the electroencephalogram (EEG) is widely used by researchers because of the large amount of information it provides about the brain. This paper aims to classify mental workload into three states and estimate continuum levels. Our method combines multiple dimensions of space to achieve the best results for mental estimation. In the time domain approach, we use Temporal Convolutional Networks, and in the frequency domain, we propose a new architecture called the Multi-Dimensional Residual Block, which combines residual blocks.

摘要
人类大脑在工作和休息时都处于不断活跃的状态。心理活动是每日的过程，当大脑过度劳累时，可能会对人类健康产生负面影响。在最近几年，关注早期识别心理健康问题的注意力凝固了，因为它可以帮助预防严重的健康问题并提高生活质量。多种信号都可以评估心理状态，但是电enzephalogram（EEG）在研究人员中广泛使用，因为它可以提供大量关于大脑的信息。本文的目标是将心理劳动力分为三个状态，并估计连续水平。我们的方法将多维空间综合使用，以达到最佳的心理估计结果。在时域方法中，我们使用Temporal Convolutional Networks，在频域中，我们提出了一种新的建筑方案called Multi-Dimensional Residual Block，这个方案将residual blocks综合使用。

Adversarial Agents For Attacking Inaudible Voice Activated Devices

paper_url: http://arxiv.org/abs/2307.12204
repo_url: None
paper_authors: Forrest McKee, David Noever
For: 这个论文探讨了基于互联网物联网的听不到攻击的威胁风险。* Methods: 该论文使用了强化学习来解决这些听不到攻击的问题。* Results: 研究发现，使用深度强化学习的方法可以快速拥有所有节点，并且在 fewer steps 中完成。In English, this means:* For: This paper explores the risk of inaudible attacks on voice-activated devices.* Methods: The paper uses reinforcement learning to solve the problem of inaudible attacks.* Results: The study finds that using deep reinforcement learning can quickly gain control of all nodes, and achieve this in fewer steps.

Abstract
The paper applies reinforcement learning to novel Internet of Thing configurations. Our analysis of inaudible attacks on voice-activated devices confirms the alarming risk factor of 7.6 out of 10, underlining significant security vulnerabilities scored independently by NIST National Vulnerability Database (NVD). Our baseline network model showcases a scenario in which an attacker uses inaudible voice commands to gain unauthorized access to confidential information on a secured laptop. We simulated many attack scenarios on this baseline network model, revealing the potential for mass exploitation of interconnected devices to discover and own privileged information through physical access without adding new hardware or amplifying device skills. Using Microsoft's CyberBattleSim framework, we evaluated six reinforcement learning algorithms and found that Deep-Q learning with exploitation proved optimal, leading to rapid ownership of all nodes in fewer steps. Our findings underscore the critical need for understanding non-conventional networks and new cybersecurity measures in an ever-expanding digital landscape, particularly those characterized by mobile devices, voice activation, and non-linear microphones susceptible to malicious actors operating stealth attacks in the near-ultrasound or inaudible ranges. By 2024, this new attack surface might encompass more digital voice assistants than people on the planet yet offer fewer remedies than conventional patching or firmware fixes since the inaudible attacks arise inherently from the microphone design and digital signal processing.

摘要
文章应用再强化学习解决新互联网设备配置中的攻击问题。我们对无声攻击voice控制设备进行分析，确认了攻击性风险因子为7.6/10，强调了设备安全漏洞的独立评分。我们的基线网络模型显示了一种攻击者通过无声voice命令窃取机密信息的场景，我们在这个基线网络模型上进行了许多攻击场景的模拟，发现了大规模攻击INTERNET OF THINGS设备，以获取特权信息，不需要新硬件或设备技能升级。使用Microsoft的CyberBattleSim框架，我们评估了六种再强化学习算法，发现deep Q学习与利用最佳，可以在 fewer steps 内快速拥有所有节点。我们的发现强调了非常 conventional 网络和新的cybersecurity措施在不断扩大的数字ландшаф具 особен需要，特别是包括移动设备、voice控制和非线性 Microphone 在内的设备，遭受恶势力操作的隐藏攻击。到2024年，这个新的攻击表面可能会包括更多的数字voice助手 than人类 на 地球， yet offer fewer remedies than conventional patching or firmware fixes，因为无声攻击来自 Microphone 设计和数字信号处理。

NCART: Neural Classification and Regression Tree for Tabular Data

paper_url: http://arxiv.org/abs/2307.12198
repo_url: None
paper_authors: Jiaqi Luo, Shixin Xu
for: 这 paper 旨在提出一种可解释性强的深度学习模型，以解决深度学习模型在大规模或高维数据集中的计算成本高和可解释性差的问题。
methods: 该 paper 提出了一种名为 Neural Classification and Regression Tree (NCART) 的新型可解释性神经网络，它将多层感知网络替换为多个可导的无知决策树。通过将决策树 integrate 到网络架构中，NCART 保持了可解释性，同时具有神经网络的综合能力。
results: 数值实验表明，NCART 比现有的深度学习模型具有更高的性能，并且在不同的数据集中表现出色，建立了 NCART 作为树状模型的强大竞争对手。

Abstract
Deep learning models have become popular in the analysis of tabular data, as they address the limitations of decision trees and enable valuable applications like semi-supervised learning, online learning, and transfer learning. However, these deep-learning approaches often encounter a trade-off. On one hand, they can be computationally expensive when dealing with large-scale or high-dimensional datasets. On the other hand, they may lack interpretability and may not be suitable for small-scale datasets. In this study, we propose a novel interpretable neural network called Neural Classification and Regression Tree (NCART) to overcome these challenges. NCART is a modified version of Residual Networks that replaces fully-connected layers with multiple differentiable oblivious decision trees. By integrating decision trees into the architecture, NCART maintains its interpretability while benefiting from the end-to-end capabilities of neural networks. The simplicity of the NCART architecture makes it well-suited for datasets of varying sizes and reduces computational costs compared to state-of-the-art deep learning models. Extensive numerical experiments demonstrate the superior performance of NCART compared to existing deep learning models, establishing it as a strong competitor to tree-based models.

摘要
深度学习模型在表格数据分析中变得越来越受欢迎，因为它们可以解决决策树的限制，并实现有价值的应用，如半监督学习、在线学习和传输学习。然而，这些深度学习方法经常面临一种负担。一方面，它们在处理大规模或高维度数据时可能会变得计算昂贵。另一方面，它们可能缺乏可解性，并且不适用于小规模数据。在本研究中，我们提出一种新的可解的神经网络模型，即神经分类和回归树（NCART），以解决这些挑战。NCART是基于差分网络的修改版本，它将完全连接层 replaced with 多个可微分的无知决策树。通过将决策树 integrate into 神经网络架构，NCART可以保持其可解性，同时受益于神经网络的终端能力。NCART 的简单架构使其适用于不同规模的数据集，并 reduc 计算成本与现有的深度学习模型相比。我们的数字实验证明 NCART 的表现比现有的深度学习模型更优，从而成为树状模型的强竞争对手。

Monadic Deep Learning

paper_url: http://arxiv.org/abs/2307.12187
repo_url: https://github.com/ThoughtWorksInc/monadic-deep-learning
paper_authors: Bo Yang, Zhihao Zhang Kirisame Marisa, Kai Shi
for: 这个论文的目的是解决 dynamically typed 编程语言中的神经网络模型问题，使得用户可以使用 statically typed 语言来创建神经网络模型。
methods: 这篇论文使用了一种新的方法，即在静态类型函数中自动对含有多个可训练变量的神经网络模型进行差分。它还使用了一些幺athed 和 monad transformers，以便让用户创建具有 intuition 和简洁性的神经网络模型。
results: 该论文的实验结果表明，使用 DeepLearning.scala 可以帮助用户创建复杂的神经网络模型，并且仍然保持类型安全性。

Abstract
The Java and Scala community has built a very successful big data ecosystem. However, most of neural networks running on it are modeled in dynamically typed programming languages. These dynamically typed deep learning frameworks treat neural networks as differentiable expressions that contain many trainable variable, and perform automatic differentiation on those expressions when training them. Until 2019, none of the learning frameworks in statically typed languages provided the expressive power of traditional frameworks. Their users are not able to use custom algorithms unless creating plenty of boilerplate code for hard-coded back-propagation. We solved this problem in DeepLearning.scala 2. Our contributions are: 1. We discovered a novel approach to perform automatic differentiation in reverse mode for statically typed functions that contain multiple trainable variable, and can interoperate freely with the metalanguage. 2. We designed a set of monads and monad transformers, which allow users to create monadic expressions that represent dynamic neural networks. 3. Along with these monads, we provide some applicative functors, to perform multiple calculations in parallel. With these features, users of DeepLearning.scala were able to create complex neural networks in an intuitive and concise way, and still maintain type safety.

摘要
直到2019年，静态类型语言中的学习框架都没有提供传统框架的表达力。其用户无法使用自定义算法，除非创建大量的 boilerplate 代码来实现硬编码的反射传播。我们在 DeepLearning.scala 2 中解决了这个问题。我们的贡献包括：1. 我们发现了一种新的方法，可以在静态类型函数中自动进行反Mode微分，并且可以与金属语言进行自由交互。2. 我们设计了一组幂等和幂等转换，这些幂等可以让用户创建幂等表达式，表示动态神经网络。3. 同时，我们还提供了一些应用程序函数，可以在平行计算中进行多个计算。通过这些特性，DeepLearning.scala 的用户可以创建复杂的神经网络，使用 intuition 和简洁的方式来进行训练，并且仍保持类型安全性。

Machine learning discovers invariants of braids and flat braids

paper_url: http://arxiv.org/abs/2307.12185
repo_url: None
paper_authors: Alexei Lisitsa, Mateo Salles, Alexei Vernitski
for: 用机器学习分类布里论（或平坦布里论）的例子为轻量级或非轻量级。
methods: 使用超visisted学习神经网络（多层感知器）进行分类。
results: 发现新的便利 invariants of braids, including a complete invariant of flat braids.Here’s the translation in English:
for: Using machine learning to classify examples of braids (or flat braids) as trivial or non-trivial.
methods: Using supervised learning with neural networks (multilayer perceptrons).
results: Discovering new convenient invariants of braids, including a complete invariant of flat braids.

Abstract
We use machine learning to classify examples of braids (or flat braids) as trivial or non-trivial. Our ML takes form of supervised learning using neural networks (multilayer perceptrons). When they achieve good results in classification, we are able to interpret their structure as mathematical conjectures and then prove these conjectures as theorems. As a result, we find new convenient invariants of braids, including a complete invariant of flat braids.

摘要
我们使用机器学习来分类拥有布里亚的示例（或平板布里亚）为致命或非致命。我们的机器学习形式为指导学习使用神经网络（多层感知器）。当它们在分类中获得良好的结果时，我们可以解释它们的结构为数学假设，然后证明这些假设为定理。因此，我们发现新的便利 invariants of braids，包括完整的平板布里亚 invariants。Note: "布里亚" (braids) is a word in Chinese that refers to a type of mathematical object, and "平板布里亚" (flat braids) is a specific type of braid that is flat and has no crossing points.

paper_url: http://arxiv.org/abs/2307.12180
repo_url: https://github.com/linzy0227/pdminet
paper_authors: Yafei Zhang, Zhiyuan Li, Huafeng Li, Dapeng Tao
for: 这种多Modal MR brain tumor imaging segmentation方法是为了解决现有方法 Directly extracting discriminative features from input images for tumor sub-region category determination and localization，忽略了信息杂化的影响。
methods: 该方法提议使用肿瘤原型驱动的多专家结合，使得每个肿瘤子区域特征得到高亮显示。具体来说，我们提出了一种互传机制，将不同modal的特征传递给每个modal，以解决单modal特征不充分的问题。此外，我们还提出了一种使用学习的肿瘤原型来驱动特征表示和融合方法，使得肿瘤特征得到了融合。
results: 实验结果表明，该方法在三个竞赛肿瘤分割数据集上具有优秀的性能。

Abstract
For multi-modal magnetic resonance (MR) brain tumor image segmentation, current methods usually directly extract the discriminative features from input images for tumor sub-region category determination and localization. However, the impact of information aliasing caused by the mutual inclusion of tumor sub-regions is often ignored. Moreover, existing methods usually do not take tailored efforts to highlight the single tumor sub-region features. To this end, a multi-modal MR brain tumor segmentation method with tumor prototype-driven and multi-expert integration is proposed. It could highlight the features of each tumor sub-region under the guidance of tumor prototypes. Specifically, to obtain the prototypes with complete information, we propose a mutual transmission mechanism to transfer different modal features to each other to address the issues raised by insufficient information on single-modal features. Furthermore, we devise a prototype-driven feature representation and fusion method with the learned prototypes, which implants the prototypes into tumor features and generates corresponding activation maps. With the activation maps, the sub-region features consistent with the prototype category can be highlighted. A key information enhancement and fusion strategy with multi-expert integration is designed to further improve the segmentation performance. The strategy can integrate the features from different layers of the extra feature extraction network and the features highlighted by the prototypes. Experimental results on three competition brain tumor segmentation datasets prove the superiority of the proposed method.

摘要
对多Modal MR脑肿刺激图像分割，当前方法通常直接从输入图像中提取特征来确定肿瘤子区划分和定位。然而，现有方法通常忽略了肿瘤子区之间的信息冲突的影响。此外，现有方法通常不会特意强调单个肿瘤子区域的特征。为此，我们提议一种基于肿瘤原型的多Modal MR脑肿刺激分割方法。它可以在肿瘤原型的指导下高亮单个肿瘤子区域的特征。具体来说，为了获取完整信息的肿瘤原型，我们提议一种互传机制来传递不同modal特征之间的信息，以解决单modal特征不具备的问题。此外，我们设计了一种基于肿瘤原型的特征表示和融合方法，其可以在肿瘤特征中嵌入肿瘤原型，并生成相应的激活图。通过激活图，可以高亮与肿瘤原型类别相符的子区域特征。此外，我们设计了一种多 экспер特征融合策略，可以将不同层次的特征和肿瘤原型高亮的特征融合在一起，以提高分割性能。实验结果表明，提议的方法在三个竞赛脑肿刺激分割数据集上显示出了超越性。

Learn to Compress (LtC): Efficient Learning-based Streaming Video Analytics

paper_url: http://arxiv.org/abs/2307.12171
repo_url: None
paper_authors: Quazi Mishkatul Alam, Israat Haque, Nael Abu-Ghazaleh
for: 这个论文的目的是建立一个高效的云端视频分析框架，以减少视频流的带宽和能源消耗。methods: 这个框架使用了一个轻量级神经网络，通过教师模型在服务器端进行学习，以减少视频流中不必要的信息。此外，它还使用了一种基于特征差分的时间滤波算法，以便快速忽略不必要的帧。results: 这个框架可以使用28-35% less bandwidth和45% shorter response delay，与最近发布的相关研究框架相比，而且保持了类似的分析性能。

Abstract
Video analytics are often performed as cloud services in edge settings, mainly to offload computation, and also in situations where the results are not directly consumed at the video sensors. Sending high-quality video data from the edge devices can be expensive both in terms of bandwidth and power use. In order to build a streaming video analytics pipeline that makes efficient use of these resources, it is therefore imperative to reduce the size of the video stream. Traditional video compression algorithms are unaware of the semantics of the video, and can be both inefficient and harmful for the analytics performance. In this paper, we introduce LtC, a collaborative framework between the video source and the analytics server, that efficiently learns to reduce the video streams within an analytics pipeline. Specifically, LtC uses the full-fledged analytics algorithm at the server as a teacher to train a lightweight student neural network, which is then deployed at the video source. The student network is trained to comprehend the semantic significance of various regions within the videos, which is used to differentially preserve the crucial regions in high quality while the remaining regions undergo aggressive compression. Furthermore, LtC also incorporates a novel temporal filtering algorithm based on feature-differencing to omit transmitting frames that do not contribute new information. Overall, LtC is able to use 28-35% less bandwidth and has up to 45% shorter response delay compared to recently published state of the art streaming frameworks while achieving similar analytics performance.

摘要
视频分析通常在云服务中进行，主要是为了减轻计算负担，以及在视频感知不直接在视频传感器上进行处理。往往将高质量视频数据从边缘设备传输到云服务器可能会占用很多带宽和电力资源。为建立高效的流动视频分析管道，因此是非常重要减小视频流。传统的视频压缩算法不了解视频的 semantics，可能会导致不fficient和对分析性能有害。在这篇论文中，我们介绍了 LtC，一个协同框架，其中视频源和分析服务器之间协同减小视频流。特别是，LtC 使用全功能的分析算法作为老师来训练一个轻量级神经网络，并将其部署到视频源上。学生网络被训练以理解视频中各个区域的semantic Significance，并使用这些区域来差分保留高质量视频，而其他区域则进行了激进压缩。此外，LtC 还包括一种基于特征差异的时间滤波算法，以便快速忽略不包含新信息的帧。总之，LtC 可以使用28-35%的带宽和45%的响应延迟，相比之下最新的流动框架，而 achieved similar analytics performance。

Optimized Network Architectures for Large Language Model Training with Billions of Parameters

paper_url: http://arxiv.org/abs/2307.12169
repo_url: None
paper_authors: Weiyang Wang, Manya Ghobadi, Kayvon Shakeri, Ying Zhang, Naader Hasani
for: 这篇论文挑战了在训练大型自然语言模型（LLM）时建立任意对任意网络的惯例。
methods: 我们表明了 LLM 训练中唯一特点的通信模式，只有小组内 GPU 需要高带宽任意对任意通信，以达到训练性能的近似最优。这些小组内 GPU 之间的通信费量极低、稀疏和均匀。我们提议一种新的网络架构，它与 LLM 的通信需求高度相似。我们将集群分为 HB domain，其中每个 HB domain 由非堵塞的任意对任意高带宽互连器连接。在 HB domain 内，网络只与需要通信的 GPU 连接。我们称这种网络为 “铁路仅” 连接，并证明我们的提议架构可以将网络成本降低至最多 75%，不会影响 LLM 训练的性能。
results: 我们的实验结果表明，我们的提议架构可以减少网络成本，同时保持 LLM 训练的性能。

Abstract
This paper challenges the well-established paradigm for building any-to-any networks for training Large Language Models (LLMs). We show that LLMs exhibit a unique communication pattern where only small groups of GPUs require high-bandwidth any-to-any communication within them, to achieve near-optimal training performance. Across these groups of GPUs, the communication is insignificant, sparse, and homogeneous. We propose a new network architecture that closely resembles the communication requirement of LLMs. Our architecture partitions the cluster into sets of GPUs interconnected with non-blocking any-to-any high-bandwidth interconnects that we call HB domains. Across the HB domains, the network only connects GPUs with communication demands. We call this network a "rail-only" connection, and show that our proposed architecture reduces the network cost by up to 75% compared to the state-of-the-art any-to-any Clos networks without compromising the performance of LLM training.

摘要

Facial Point Graphs for Amyotrophic Lateral Sclerosis Identification

paper_url: http://arxiv.org/abs/2307.12159
repo_url: None
paper_authors: Nícolas Barbosa Gomes, Arissa Yoshida, Mateus Roder, Guilherme Camargo de Oliveira, João Paulo Papa
for: 这篇论文的目的是找到早期诊断amyotrophic lateral sclerosis (ALS)的方法，以提高病人的预后和生活质量。
methods: 这篇论文使用computational方法来分析病人的脸部表情，以检测ALS的症状。具体来说，这篇论文使用Facial Point Graphs来学习脸部图像的几何特征，以自动识别ALS。
results: 论文的实验结果显示，提案的方法在测试数据集Toronto Neuroface Dataset中，与现有方法相比，有着更高的准确性和效率。这些结果显示出这种方法的潜力，并带来了领域的发展。

Abstract
Identifying Amyotrophic Lateral Sclerosis (ALS) in its early stages is essential for establishing the beginning of treatment, enriching the outlook, and enhancing the overall well-being of those affected individuals. However, early diagnosis and detecting the disease's signs is not straightforward. A simpler and cheaper way arises by analyzing the patient's facial expressions through computational methods. When a patient with ALS engages in specific actions, e.g., opening their mouth, the movement of specific facial muscles differs from that observed in a healthy individual. This paper proposes Facial Point Graphs to learn information from the geometry of facial images to identify ALS automatically. The experimental outcomes in the Toronto Neuroface dataset show the proposed approach outperformed state-of-the-art results, fostering promising developments in the area.

摘要
早期识别amyotrophic lateral sclerosis（ALS）是非常重要，可以提高治疗的开始，改善患者的生活质量和总体情况。然而，早期诊断和识别病状的标准方法并不是很直forward。本文提出了一种使用计算机方法分析患者的面部表情来自动识别ALS的方法。当患者进行特定的动作时，如打开嘴巴，特定的面部肌肉的运动会与健康人的不同。这篇论文使用面部点图学习geometry of facial images来自动识别ALS，实验结果表明该方法在多伦多Neuroface dataset中超越了现有的最佳结果，为您的发展提供了有希望的前景。

DIP-RL: Demonstration-Inferred Preference Learning in Minecraft

paper_url: http://arxiv.org/abs/2307.12158
repo_url: None
paper_authors: Ellen Novoseller, Vinicius G. Goecks, David Watkins, Josh Miller, Nicholas Waytowich
for: 本研究旨在解决在无结构化真实世界中，RL算法学习Sequential Decision-Making时，因缺乏奖励信号而难以学习的问题。
methods: 本研究提出了Demonstration-Inferred Preference Reinforcement Learning（DIP-RL）算法，利用人类示范来推导RL算法学习。DIP-RL在三种不同的方式使用示范数据，包括训练自动编码器、RL训练批处理中使用示范数据，以及推导RL奖励函数。
results: 实验结果表明，DIP-RL可以引导RL算法学习人类偏好，并且与基elines相比，DIP-RL在树割任务中表现竞争力强。例如轨迹满足扩展可以在https://sites.google.com/view/dip-rl。

Abstract
In machine learning for sequential decision-making, an algorithmic agent learns to interact with an environment while receiving feedback in the form of a reward signal. However, in many unstructured real-world settings, such a reward signal is unknown and humans cannot reliably craft a reward signal that correctly captures desired behavior. To solve tasks in such unstructured and open-ended environments, we present Demonstration-Inferred Preference Reinforcement Learning (DIP-RL), an algorithm that leverages human demonstrations in three distinct ways, including training an autoencoder, seeding reinforcement learning (RL) training batches with demonstration data, and inferring preferences over behaviors to learn a reward function to guide RL. We evaluate DIP-RL in a tree-chopping task in Minecraft. Results suggest that the method can guide an RL agent to learn a reward function that reflects human preferences and that DIP-RL performs competitively relative to baselines. DIP-RL is inspired by our previous work on combining demonstrations and pairwise preferences in Minecraft, which was awarded a research prize at the 2022 NeurIPS MineRL BASALT competition, Learning from Human Feedback in Minecraft. Example trajectory rollouts of DIP-RL and baselines are located at https://sites.google.com/view/dip-rl.

摘要
机器学习 дляsequential decision-making中的算法式代理可以在环境中学习并接受反馈形式为奖signal。但在许多无结构的实际场景中，这种奖signal是未知的，人们无法可靠地制定一个正确捕捉所愿行为的奖 signal。为解决这些无结构和开放的环境中的任务，我们提出了Demonstration-Inferred Preference Reinforcement Learning（DIP-RL）算法，该算法利用人类示范在三种方式：在训练 autoencoder 中，在RL训练批次中使用示范数据，以及推导RL agent 对行为的偏好来学习一个奖函数来导引RL。我们在 Minecraft 中进行了树割任务的评估，结果表明DIP-RL可以引导RL agent 学习一个符合人类偏好的奖函数，并且DIP-RL与基elines相比表现竞争性。DIP-RL是基于我们在 Minecraft 中结合示范和对比偏好的前一项研究，该研究在2022年 NeurIPS MineRL BASALT 比赛中获得了研究奖，Learning from Human Feedback in Minecraft。DIP-RL的示例轨迹扩展位于。

Identifying contributors to supply chain outcomes in a multi-echelon setting: a decentralised approach

paper_url: http://arxiv.org/abs/2307.12157
repo_url: None
paper_authors: Stefan Schoepf, Jack Foster, Alexandra Brintrup
for: 本研究旨在帮助企业快速准确地确定生产过程中metric变化的原因，尤其是在多层供应链中，只能部分可见。
methods: 本研究提议使用可解释人工智能来实现分布式计算估算变量的贡献，以解决数据隐私问题。
results: 实验结果表明，分布式计算可以更有效地检测质量变化的起源，比中央化方法使用Shapley添加itive解释。

Abstract
Organisations often struggle to identify the causes of change in metrics such as product quality and delivery duration. This task becomes increasingly challenging when the cause lies outside of company borders in multi-echelon supply chains that are only partially observable. Although traditional supply chain management has advocated for data sharing to gain better insights, this does not take place in practice due to data privacy concerns. We propose the use of explainable artificial intelligence for decentralised computing of estimated contributions to a metric of interest in a multi-stage production process. This approach mitigates the need to convince supply chain actors to share data, as all computations occur in a decentralised manner. Our method is empirically validated using data collected from a real multi-stage manufacturing process. The results demonstrate the effectiveness of our approach in detecting the source of quality variations compared to a centralised approach using Shapley additive explanations.

摘要
企业们经常难以确定生产质量和交付时间的变化的原因。这个任务在多层供应链中，只能半 observability 的情况下变得更加困难。传统的供应链管理推荐数据共享，以获得更好的洞察力，但在实践中，由于数据隐私问题，这并没有实现。我们建议使用可解释人工智能 для分布式计算估算的贡献因素，以解决不必要地让供应链 aktör 分享数据的问题。我们的方法在实验 validate 使用实际的多Stage生产过程中的数据，结果显示，我们的方法可以更好地探测质量变化的来源，比中央化使用Shapley加itive解释法。

Real-Time Neural Video Recovery and Enhancement on Mobile Devices

paper_url: http://arxiv.org/abs/2307.12152
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Zhaoyuan He, Yifan Yang, Lili Qiu, Kyoungjun Park
for: 提高移动设备上视频流式的流畅体验
methods: 提出了一种新的视频帧恢复方案、一种新的超分辨率算法和一种接受器增强视频比特率调整算法
results: 实现了30帧/秒的实时增强，在不同的网络环境下提高了视频流程的质量经验（QoE），具体的提高率为24%-82%。

Abstract
As mobile devices become increasingly popular for video streaming, it's crucial to optimize the streaming experience for these devices. Although deep learning-based video enhancement techniques are gaining attention, most of them cannot support real-time enhancement on mobile devices. Additionally, many of these techniques are focused solely on super-resolution and cannot handle partial or complete loss or corruption of video frames, which is common on the Internet and wireless networks. To overcome these challenges, we present a novel approach in this paper. Our approach consists of (i) a novel video frame recovery scheme, (ii) a new super-resolution algorithm, and (iii) a receiver enhancement-aware video bit rate adaptation algorithm. We have implemented our approach on an iPhone 12, and it can support 30 frames per second (FPS). We have evaluated our approach in various networks such as WiFi, 3G, 4G, and 5G networks. Our evaluation shows that our approach enables real-time enhancement and results in a significant increase in video QoE (Quality of Experience) of 24\% - 82\% in our video streaming system.

摘要
为了优化移动设备上的视频流处理，随着移动设备的普及，现在已经非常重要。虽然深度学习基于视频提升技术在获得关注，但大多数这些技术无法在移动设备上实时进行提升。此外，许多这些技术都是专注于超解像，而不是处理部分或完全丢失的视频帧，这是网络和无线网络中的常见问题。为了解决这些挑战，我们在本文中提出了一种新的方法。我们的方法包括以下三个部分：(i) 一种新的视频帧恢复算法，(ii) 一种新的超解像算法，(iii) 一种基于接收器提升的视频比特率自适应算法。我们在iPhone 12上实现了我们的方法，并可以支持30帧/秒。我们在WiFi、3G、4G和5G网络中进行了评估，我们的评估结果表明，我们的方法可以实现实时提升，并导致视频Quality of Experience（QoE）提高24%-82%。

Learned Gridification for Efficient Point Cloud Processing

paper_url: http://arxiv.org/abs/2307.14354
repo_url: https://github.com/computri/gridifier
paper_authors: Putri A. van der Linden, David W. Romero, Erik J. Bekkers
for: 这篇论文主要用于解决点云处理领域中的可插入性问题，提高点云处理的缩放性和可扩展性。
methods: 该论文提出了一种名为”学习gridification”的方法，即将点云转化为一个紧凑、规则的网格，以便在网格上使用已有的深度学习方法。
results: 经过 teorтиче 和实验分析，该论文表明，使用学习gridification方法可以提高点云处理的缩放性和可扩展性，同时保持与原始点云数据相比的竞争性。

Abstract
Neural operations that rely on neighborhood information are much more expensive when deployed on point clouds than on grid data due to the irregular distances between points in a point cloud. In a grid, on the other hand, we can compute the kernel only once and reuse it for all query positions. As a result, operations that rely on neighborhood information scale much worse for point clouds than for grid data, specially for large inputs and large neighborhoods. In this work, we address the scalability issue of point cloud methods by tackling its root cause: the irregularity of the data. We propose learnable gridification as the first step in a point cloud processing pipeline to transform the point cloud into a compact, regular grid. Thanks to gridification, subsequent layers can use operations defined on regular grids, e.g., Conv3D, which scale much better than native point cloud methods. We then extend gridification to point cloud to point cloud tasks, e.g., segmentation, by adding a learnable de-gridification step at the end of the point cloud processing pipeline to map the compact, regular grid back to its original point cloud form. Through theoretical and empirical analysis, we show that gridified networks scale better in terms of memory and time than networks directly applied on raw point cloud data, while being able to achieve competitive results. Our code is publicly available at https://github.com/computri/gridifier.

摘要
神经操作依赖地域信息在点云上比在格子数据上更加昂贵，因为点云中点的距离不规则。在格子中，我们可以一次计算核心，然后将其重复使用所有查询位置。因此，基于地域信息的操作在点云上缩放比格子数据更差，特别是对于大输入和大地域。在这项工作中，我们解决点云方法的扩展性问题，通过将点云转换为可 compact、规则的格子。感谢gridification，后续层可以使用定义在规则格子上的操作，例如Conv3D，这些操作在点云数据上缩放更好。我们还扩展gridification来点云到点云任务，例如分割，通过在点云处理管道的末端添加学习的de-gridification步骤，将紧凑的规则格子映射回原始点云形式。通过理论和实验分析，我们表明gridified网络在内存和时间上比直接应用于原始点云数据更好的扩展性，同时能够达到竞争性的结果。我们的代码公开在https://github.com/computri/gridifier。

CorrFL: Correlation-Based Neural Network Architecture for Unavailability Concerns in a Heterogeneous IoT Environment

paper_url: http://arxiv.org/abs/2307.12149
repo_url: https://github.com/Western-OC2-Lab/CorrFL
paper_authors: Ibrahim Shaer, Abdallah Shami
For: 解决 Federated Learning（FL）中模型策略的差异和物联网（IoT）节点的缺失问题。* Methods: 提出了一种基于相关学习（Correlation-based FL，CorrFL）的方法，通过将不同模型权重映射到共同的准则空间来处理模型策略的差异。* Results: 通过对一个真实的使用场景进行评估，发现CorrFL模型在缺失IoT节点和高活动水平时的预测性能较为出色，并且与不同的使用场景中的标准模型进行比较。

Abstract
The Federated Learning (FL) paradigm faces several challenges that limit its application in real-world environments. These challenges include the local models' architecture heterogeneity and the unavailability of distributed Internet of Things (IoT) nodes due to connectivity problems. These factors posit the question of "how can the available models fill the training gap of the unavailable models?". This question is referred to as the "Oblique Federated Learning" problem. This problem is encountered in the studied environment that includes distributed IoT nodes responsible for predicting CO2 concentrations. This paper proposes the Correlation-based FL (CorrFL) approach influenced by the representational learning field to address this problem. CorrFL projects the various model weights to a common latent space to address the model heterogeneity. Its loss function minimizes the reconstruction loss when models are absent and maximizes the correlation between the generated models. The latter factor is critical because of the intersection of the feature spaces of the IoT devices. CorrFL is evaluated on a realistic use case, involving the unavailability of one IoT device and heightened activity levels that reflect occupancy. The generated CorrFL models for the unavailable IoT device from the available ones trained on the new environment are compared against models trained on different use cases, referred to as the benchmark model. The evaluation criteria combine the mean absolute error (MAE) of predictions and the impact of the amount of exchanged data on the prediction performance improvement. Through a comprehensive experimental procedure, the CorrFL model outperformed the benchmark model in every criterion.

摘要
联邦学习（FL）模式面临许多实际环境中的挑战，这些挑战包括本地模型的架构多样性和分布式互联网络端的网络问题，这们问题使得“如何让可用的模型填充缺失的模型？”这个问题被称为“偏角联邦学习”问题。这个问题在分散式互联网络端负责预测CO2浓度的环境中被研究。这篇文章提出了基于相互关联学习（CorrFL）方法，它将多个模型的weight投射到共同的潜在空间以解决模型多样性问题。CorrFL的损失函数将缺失的模型的重建损失降低至最小，并将可用模型生成的模型之间的相互相关性提高。这个因素是critical，因为分散式互联网络端的特征空间 intersection。通过一个实际的使用情况，这篇文章评估了CorrFL在一个 IoT 设备缺失和活动水平增加的情况下的表现。生成的CorrFL模型与不可用的 IoT 设备进行比较，并与不同的使用情况下的模型进行比较，这些模型被称为底线模型。评估标准包括预测误差的总平均误差（MAE）和预测性能改善的资料交换量影响。通过一个完整的实验程序，CorrFL 模型在每个标准中都表现出优于底线模型。

Applications of Machine Learning to Modelling and Analysing Dynamical Systems

paper_url: http://arxiv.org/abs/2308.03763
repo_url: None
paper_authors: Vedanta Thapar
for: 本研究使用物理学 Informed Neural Networks 分析非线性哈密顿动力系统，具有一个动力平衡方程的第一 интеграル。
methods: 本文提出了一种结合现有哈密顿神经网络结构的 Adaptable Symplectic Recurrent Neural Networks 架构，能够保持哈密顿方程和相互作用的 symplectic 结构，并在整个参数空间预测动力学行为。此架构在预测哈密顿动力学中，特别是在含有多个参数的潜能中表现出了显著的优势。
results: 本研究表明，使用该架构可以高效地预测哈密顿动力学，尤其是在含有多个参数的潜能中。此外， authors 还证明了该方法在单参数潜能中的稳定性和长期预测性。

Abstract
We explore the use of Physics Informed Neural Networks to analyse nonlinear Hamiltonian Dynamical Systems with a first integral of motion. In this work, we propose an architecture which combines existing Hamiltonian Neural Network structures into Adaptable Symplectic Recurrent Neural Networks which preserve Hamilton's equations as well as the symplectic structure of phase space while predicting dynamics for the entire parameter space. This architecture is found to significantly outperform previously proposed neural networks when predicting Hamiltonian dynamics especially in potentials which contain multiple parameters. We demonstrate its robustness using the nonlinear Henon-Heiles potential under chaotic, quasiperiodic and periodic conditions. The second problem we tackle is whether we can use the high dimensional nonlinear capabilities of neural networks to predict the dynamics of a Hamiltonian system given only partial information of the same. Hence we attempt to take advantage of Long Short Term Memory networks to implement Takens' embedding theorem and construct a delay embedding of the system followed by mapping the topologically invariant attractor to the true form. This architecture is then layered with Adaptable Symplectic nets to allow for predictions which preserve the structure of Hamilton's equations. We show that this method works efficiently for single parameter potentials and provides accurate predictions even over long periods of time.

摘要
我们探讨使用物理 Informed Neural Networks 分析非线性汉密尔顿动力系统，其具有一个动力的第一Integral of motion。在这项工作中，我们提议一种结合现有汉密尔顿神经网络结构的可适应 симплектиче Recurrent Neural Networks 结构，该结构保留汉密尔顿方程以及相对空间的 симплектиче结构，同时预测动力的整个参数空间。这种结构在预测汉密尔顿动力方面表现出了明显的优异，尤其是在含有多个参数的潜能中。我们通过使用非线性 Henon-Heiles 潜能函数进行了robustness测试，并在各种不同的conditions下进行了验证。第二个问题是可以使用高维非线性神经网络来预测汉密尔顿系统的动力，只要知道一部分系统的信息吗。因此，我们尝试使用 Long Short Term Memory 网络实现 Takens 嵌入定理，并将系统的延迟嵌入映射到真正的形式。然后，我们层加 Adaptable Symplectic nets 以使预测保留汉密尔顿方程的结构。我们发现这种方法可以高效地预测单参数潜能中的动力，并且可以在长时间内提供高度准确的预测。

A Vision for Cleaner Rivers: Harnessing Snapshot Hyperspectral Imaging to Detect Macro-Plastic Litter

paper_url: http://arxiv.org/abs/2307.12145
repo_url: https://github.com/river-lab/hyperspectral_macro_plastic_detection
paper_authors: Nathaniel Hanson, Ahmet Demirkaya, Deniz Erdoğmuş, Aron Stubbins, Taşkın Padır, Tales Imbiriba
for: 本研究旨在开发一种高效自动化的浮游垃圾监测方法，以解决水体中垃圾杂物的监测问题。
methods: 本研究使用计算成像技术进行浮游垃圾物质的检测，包括可见短波谱成像和可见短波谱识别方法。
results: 实验结果表明，使用Snapshot可见短波谱成像和机器学习分类方法可以在实际场景中实现高检测精度，特别是在具有挑战性的场景下。

Abstract
Plastic waste entering the riverine harms local ecosystems leading to negative ecological and economic impacts. Large parcels of plastic waste are transported from inland to oceans leading to a global scale problem of floating debris fields. In this context, efficient and automatized monitoring of mismanaged plastic waste is paramount. To address this problem, we analyze the feasibility of macro-plastic litter detection using computational imaging approaches in river-like scenarios. We enable near-real-time tracking of partially submerged plastics by using snapshot Visible-Shortwave Infrared hyperspectral imaging. Our experiments indicate that imaging strategies associated with machine learning classification approaches can lead to high detection accuracy even in challenging scenarios, especially when leveraging hyperspectral data and nonlinear classifiers. All code, data, and models are available online: https://github.com/RIVeR-Lab/hyperspectral_macro_plastic_detection.

摘要
塑料废弃物进入河流环境会对当地生态系统造成负面影响，导致生态和经济问题。大量塑料废弃物从陆地传输到海洋，导致全球范围内漂浮垃圾场景。在这种情况下，高效和自动化的废弃塑料监测变得非常重要。为解决这个问题，我们分析了使用计算成像方法检测大型塑料废弃物的可能性。我们使用快照可见短波谱 hyperspectral成像进行近实时检测半潜水塑料。我们的实验表明，通过使用机器学习分类方法和非线性分类器，可以在具有挑战性的情况下实现高检测精度。所有代码、数据和模型都可以在 GitHub 上下载：https://github.com/RIVeR-Lab/hyperspectral_macro_plastic_detection。

Emergence of Adaptive Circadian Rhythms in Deep Reinforcement Learning

paper_url: http://arxiv.org/abs/2307.12143
repo_url: https://github.com/aqeel13932/mn_project
paper_authors: Aqeel Labash, Florian Fletzer, Daniel Majoral, Raul Vicente
For: The paper explores the emergence of circadian-like rhythms in deep reinforcement learning agents.* Methods: The authors use a foraging task and a reliable periodic variation in the environment to train the agents. They systematically characterize the agents’ behavior during learning and analyze the emergence of an endogenous rhythm using bifurcation and phase response curve analyses.* Results: The paper shows that the internal rhythm adapts to shifts in the phase of the environmental signal without any re-training, and demonstrates how artificial neurons develop dynamics to support the internalization of the environmental rhythm. The adaptation proceeds through the emergence of a stable periodic orbit in the neuron dynamics with a phase response that allows optimal phase synchronization between the agent’s dynamics and the environmental rhythm.Here’s the Chinese translation of the three points:* For: 这篇论文研究了深度奖励学习代理人是如何适应环境的 rhythm。* Methods: 作者使用了一个捕食任务和一个可预测的环境变化来训练代理人。他们系统地描述了代理人在学习过程中的行为，并通过分支和相位响应曲线分析来分析内在的 rhythm 的出现。* Results: 论文显示了内在 rhythm 可以适应环境信号的相位变化，而不需要任何再训练。它还表明了人工神经元的动力学发展了一种支持内在 rhythm 的动力学特性，并且在代理人动力学和环境 rhythm 之间进行了优化的相位同步。从动力学视角来看， adaptive 进程是通过内在 rhythm 的稳定 periodic orbit 的出现来实现的，该 periodic orbit 的相位响应允许代理人动力学和环境 rhythm 之间的优化相位同步。

Abstract
Adapting to regularities of the environment is critical for biological organisms to anticipate events and plan. A prominent example is the circadian rhythm corresponding to the internalization by organisms of the $24$-hour period of the Earth's rotation. In this work, we study the emergence of circadian-like rhythms in deep reinforcement learning agents. In particular, we deployed agents in an environment with a reliable periodic variation while solving a foraging task. We systematically characterize the agent's behavior during learning and demonstrate the emergence of a rhythm that is endogenous and entrainable. Interestingly, the internal rhythm adapts to shifts in the phase of the environmental signal without any re-training. Furthermore, we show via bifurcation and phase response curve analyses how artificial neurons develop dynamics to support the internalization of the environmental rhythm. From a dynamical systems view, we demonstrate that the adaptation proceeds by the emergence of a stable periodic orbit in the neuron dynamics with a phase response that allows an optimal phase synchronisation between the agent's dynamics and the environmental rhythm.

摘要
适应环境的规律是生物体预测事件和规划的关键。一个明显的例子是生物体内部的 circadian 频率，即通过生物体内部内化地球的24小时转动周期。在这项工作中，我们研究了深度学习Agent中的 circadian-like 频率的出现。特别是，我们在一个可靠 periodic 变化的环境中部署了 Agent，并在寻食任务中学习。我们系统地描述了 Agent 的行为 durante 学习，并证明了 Agent 内部的频率可以自动适应环境的阶段偏移。此外，我们通过杂化和相对响应曲线分析表明，人工神经元发展了 dynamics 来支持内部化环境的频率。从动力系统视角来看，适应进程由神经元动力学中的稳定 periodic 轨迹的出现和相应的相位协调导致。

Unlocking Carbon Reduction Potential with Reinforcement Learning for the Three-Dimensional Loading Capacitated Vehicle Routing Problem

paper_url: http://arxiv.org/abs/2307.12136
repo_url: None
paper_authors: Stefan Schoepf, Stephen Mak, Julian Senoner, Liming Xu, Netland Torbjörn, Alexandra Brintrup
for: 增加效率，提高运输效率
methods: 使用强化学习模型
results: 与现有方法相比，平均差距在3.83%到8.10%之间

Abstract
Heavy goods vehicles are vital backbones of the supply chain delivery system but also contribute significantly to carbon emissions with only 60% loading efficiency in the United Kingdom. Collaborative vehicle routing has been proposed as a solution to increase efficiency, but challenges remain to make this a possibility. One key challenge is the efficient computation of viable solutions for co-loading and routing. Current operations research methods suffer from non-linear scaling with increasing problem size and are therefore bound to limited geographic areas to compute results in time for day-to-day operations. This only allows for local optima in routing and leaves global optimisation potential untouched. We develop a reinforcement learning model to solve the three-dimensional loading capacitated vehicle routing problem in approximately linear time. While this problem has been studied extensively in operations research, no publications on solving it with reinforcement learning exist. We demonstrate the favourable scaling of our reinforcement learning model and benchmark our routing performance against state-of-the-art methods. The model performs within an average gap of 3.83% to 8.10% compared to established methods. Our model not only represents a promising first step towards large-scale logistics optimisation with reinforcement learning but also lays the foundation for this research stream.

摘要
To address this challenge, we have developed a reinforcement learning model to solve the three-dimensional loading capacitated vehicle routing problem in approximately linear time. This problem has been extensively studied in operations research, but no publications on solving it with reinforcement learning exist. We demonstrate the favourable scaling of our reinforcement learning model and benchmark our routing performance against state-of-the-art methods. Our model performs within an average gap of 3.83% to 8.10% compared to established methods.Our model not only represents a promising first step towards large-scale logistics optimization with reinforcement learning but also lays the foundation for this research stream. With the ability to efficiently compute viable solutions for co-loading and routing, we can significantly reduce carbon emissions from heavy goods vehicles and improve the overall efficiency of the supply chain delivery system.

The Sample Complexity of Multi-Distribution Learning for VC Classes

paper_url: http://arxiv.org/abs/2307.12135
repo_url: None
paper_authors: Pranjal Awasthi, Nika Haghtalab, Eric Zhao
for: 多 Distribution Learning 是一种自然推广 PAC 学习到多个数据分布的设置中。
methods: 使用游戏动力学来解决这个问题，并讨论了一些基本障碍。
results: 研究表明，现有的最佳下界是 $\Omega(\epsilon^{-2}(d + k \ln(k)))$, 而实际 Sample Complexity 为 $O(\epsilon^{-2} \ln(k)(d + k) + \min{\epsilon^{-1} dk, \epsilon^{-4} \ln(k) d})$.

Abstract
Multi-distribution learning is a natural generalization of PAC learning to settings with multiple data distributions. There remains a significant gap between the known upper and lower bounds for PAC-learnable classes. In particular, though we understand the sample complexity of learning a VC dimension d class on $k$ distributions to be $O(\epsilon^{-2} \ln(k)(d + k) + \min\{\epsilon^{-1} dk, \epsilon^{-4} \ln(k) d\})$, the best lower bound is $\Omega(\epsilon^{-2}(d + k \ln(k)))$. We discuss recent progress on this problem and some hurdles that are fundamental to the use of game dynamics in statistical learning.

摘要
多分布学习是自然推广PAC学习的设置中的多个数据分布的一种自然推广。现存在较大的知识上下文和下界之间的差距。具体来说，虽然我们理解了VC阶数d在k个分布上学习的样本复杂度为O（ε^-2 \* ln(k) (d + k) + MIN（ε^-1 dk, ε^-4 \* ln(k) d）），但最好的下界是Ω（ε^-2 (d + k \* ln(k))）。我们讨论了这个问题的最新进展和使用游戏动力学在统计学习中的核心障碍。

AI on the Road: A Comprehensive Analysis of Traffic Accidents and Accident Detection System in Smart Cities

paper_url: http://arxiv.org/abs/2307.12128
repo_url: None
paper_authors: Victor Adewopo, Nelly Elsayed, Zag Elsayed, Murat Ozer, Victoria Wangia-Anderson, Ahmed Abdelgawad
for: 本研究旨在提高交通管理和交通事故减少，通过分析不同地区的交通事故数据，提出一个基于交通监控摄像头和动作识别系统的交通事故探测和应答框架。
methods: 本研究使用了国家公路交通安全管理局（NHTSA）的交通事故报告采样系统（CRSS）数据进行交通事故分析，并提出了一种基于机器学习算法和交通监控摄像头的交通事故探测和应答框架。
results: 本研究发现了不同地区的交通事故特征和趋势，并提出了一种基于交通监控摄像头和动作识别系统的交通事故探测和应答框架，可以减少交通事故的频率和严重程度，提高交通管理的效率和安全性。

Abstract
Accident detection and traffic analysis is a critical component of smart city and autonomous transportation systems that can reduce accident frequency, severity and improve overall traffic management. This paper presents a comprehensive analysis of traffic accidents in different regions across the United States using data from the National Highway Traffic Safety Administration (NHTSA) Crash Report Sampling System (CRSS). To address the challenges of accident detection and traffic analysis, this paper proposes a framework that uses traffic surveillance cameras and action recognition systems to detect and respond to traffic accidents spontaneously. Integrating the proposed framework with emergency services will harness the power of traffic cameras and machine learning algorithms to create an efficient solution for responding to traffic accidents and reducing human errors. Advanced intelligence technologies, such as the proposed accident detection systems in smart cities, will improve traffic management and traffic accident severity. Overall, this study provides valuable insights into traffic accidents in the US and presents a practical solution to enhance the safety and efficiency of transportation systems.

摘要
智能城市和自动交通系统中的事故探测和交通分析是一个关键组成部分，可以降低事故频率、严重程度并改善总体交通管理。这篇论文对美国各地的交通事故进行了全面的分析，使用国家公路安全管理局（NHTSA）的事故报告采样系统（CRSS）的数据。为了解决事故探测和交通分析的挑战，该论文提出了一个框架，使用交通监控摄像头和动作认知系统来自动探测和应对交通事故。将该框架与紧急服务集成，可以利用交通摄像头和机器学习算法创造一种高效的交通事故应对解决方案，减少人类错误。高级智能技术，如智能城市中的事故探测系统，将改善交通管理和交通事故严重程度。总的来说，这篇研究提供了美国交通事故的有价值的视角，并提出了实用的解决方案，以提高交通系统的安全和效率。

Synthesis of Batik Motifs using a Diffusion – Generative Adversarial Network

paper_url: http://arxiv.org/abs/2307.12122
repo_url: https://github.com/octadion/diffusion-stylegan2-ada-pytorch
paper_authors: One Octadion, Novanto Yudistira, Diva Kurnianingtyas
for: assist batik designers or craftsmen in producing unique and quality batik motifs with efficient production time and costs.
methods: using StyleGAN2-Ada and Diffusion techniques to produce realistic and high-quality synthetic batik patterns, with adjustments to the model architecture and a well-curated batik dataset.
results: capable of producing authentic and quality batik patterns, with finer details and rich artistic variations.

Abstract
Batik, a unique blend of art and craftsmanship, is a distinct artistic and technological creation for Indonesian society. Research on batik motifs is primarily focused on classification. However, further studies may extend to the synthesis of batik patterns. Generative Adversarial Networks (GANs) have been an important deep learning model for generating synthetic data, but often face challenges in the stability and consistency of results. This research focuses on the use of StyleGAN2-Ada and Diffusion techniques to produce realistic and high-quality synthetic batik patterns. StyleGAN2-Ada is a variation of the GAN model that separates the style and content aspects in an image, whereas diffusion techniques introduce random noise into the data. In the context of batik, StyleGAN2-Ada and Diffusion are used to produce realistic synthetic batik patterns. This study also made adjustments to the model architecture and used a well-curated batik dataset. The main goal is to assist batik designers or craftsmen in producing unique and quality batik motifs with efficient production time and costs. Based on qualitative and quantitative evaluations, the results show that the model tested is capable of producing authentic and quality batik patterns, with finer details and rich artistic variations. The dataset and code can be accessed here:https://github.com/octadion/diffusion-stylegan2-ada-pytorch

摘要
《独特的抽象艺术》——巴迪克的研究巴迪克是印度尼西亚社会独特的艺术和手工艺术品。研究巴迪克图案主要集中在分类方面，但可能会扩展到synthesize batik patterns。生成对抗网络（GANs）是深度学习模型，可以生成 sintetic data，但经常面临稳定性和一致性的挑战。本研究使用StyleGAN2-Ada和扩散技术生成高质量和真实的 sintetic batik patterns。StyleGAN2-Ada分离图像中的风格和内容两个方面，而扩散技术引入随机噪音。在batik中，StyleGAN2-Ada和扩散被用来生成真实的 sintetic batik patterns。本研究还对模型结构进行了调整，使用了高质量的batik dataset。主要目标是帮助batik设计师或手工艺术家生成独特和高质量的batik图案，以及减少生产时间和成本。根据质量和量的评估，研究结果表明模型能够生成authentic和高质量的batik patterns，具有细节和艺术变化。数据集和代码可以在以下链接获取：https://github.com/octadion/diffusion-stylegan2-ada-pytorch

2023-07-24

Automotive Object Detection via Learning Sparse Events by Temporal Dynamics of Spiking Neurons

Data-free Black-box Attack based on Diffusion Model

Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry

Treatment Outcome Prediction for Intracerebral Hemorrhage via Generative Prognostic Model with Imaging and Tabular Data

Multiscale Video Pretraining for Long-Term Activity Forecasting

Spatiotemporal Modeling Encounters 3D Medical Image Analysis: Slice-Shift UNet with Multi-View Fusion

Multi-View Vertebra Localization and Identification from CT Images

Learning Provably Robust Estimators for Inverse Problems via Jittering

Exposing the Troublemakers in Described Object Detection

Compact & Capable: Harnessing Graph Neural Networks and Edge Convolution for Medical Image Classification

Fast Full-frame Video Stabilization with Iterative Optimization

LiDAR Meta Depth Completion

ICF-SRSR: Invertible scale-Conditional Function for Self-Supervised Real-world Single Image Super-Resolution

CLIP-KD: An Empirical Study of Distilling CLIP Models

COCO-O: A Benchmark for Object Detectors under Natural Distribution Shifts

Persistent-Transient Duality: A Multi-mechanism Approach for Modeling Human-Object Interaction

AMAE: Adaptation of Pre-Trained Masked Autoencoder for Dual-Distribution Anomaly Detection in Chest X-Rays

CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle Components

Dense Transformer based Enhanced Coding Network for Unsupervised Metal Artifact Reduction

Damage Vision Mining Opportunity for Imbalanced Anomaly Detection

Industrial Segment Anything – a Case Study in Aircraft Manufacturing, Intralogistics, Maintenance, Repair, and Overhaul

Global k-Space Interpolation for Dynamic MRI Reconstruction using Masked Image Modeling

A Theoretically Guaranteed Quaternion Weighted Schatten p-norm Minimization Method for Color Image Restoration

PG-RCNN: Semantic Surface Point Generation for 3D Object Detection

Automatic lobe segmentation using attentive cross entropy and end-to-end fissure generation

Semi-Supervised Medical Image Segmentation with Co-Distribution Alignment

Phase Matching for Out-of-Distribution Generalization

Sparse annotation strategies for segmentation of short axis cardiac MRI

Attribute Regularized Soft Introspective VAE: Towards Cardiac Attribute Regularization Through MRI Domains

ExWarp: Extrapolation and Warping-based Temporal Supersampling for High-frequency Displays

SwinMM: Masked Multi-view with Swin Transformers for 3D Medical Image Segmentation

PRIOR: Prototype Representation Joint Learning from Medical Images and Reports

A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation

MataDoc: Margin and Text Aware Document Dewarping for Arbitrary Boundary

Interpolating between Images with Diffusion Models

Revisiting Event-based Video Frame Interpolation

MFMAN-YOLO: A Method for Detecting Pole-like Obstacles in Complex Environment

Towards Generalizable Deepfake Detection by Primary Region Regularization

On the Connection between Pre-training Data Diversity and Fine-tuning Robustness

Entropy Transformer Networks: A Learning Approach via Tangent Bundle Data Manifold

Cross Contrasting Feature Perturbation for Domain Generalization

AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models

Rethinking Data Distillation: Do Not Overlook Calibration

Robust face anti-spoofing framework with Convolutional Vision Transformer

EnTri: Ensemble Learning with Tri-level Representations for Explainable Scene Recognition

SwIPE: Efficient and Robust Medical Image Segmentation with Implicit Patch Embeddings

Augmented Box Replay: Overcoming Foreground Shift for Incremental Object Detection

TransNet: Transparent Object Manipulation Through Category-Level Pose Estimation

Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

Assessing Intra-class Diversity and Quality of Synthetically Generated Images in a Biomedical and Non-biomedical Setting

2023-07-24

QAmplifyNet: Pushing the Boundaries of Supply Chain Backorder Prediction Using Interpretable Hybrid Quantum - Classical Neural Network

Towards Bridging the FL Performance-Explainability Trade-Off: A Trustworthy 6G RAN Slicing Use-Case

As Time Goes By: Adding a Temporal Dimension Towards Resolving Delegations in Liquid Democracy

Anytime Model Selection in Linear Bandits

Interpretable Stereotype Identification through Reasoning

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge: Mixed Sequences Prediction

Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

End-to-End Deep Transfer Learning for Calibration-free Motor Imagery Brain Computer Interfaces

Performance of Large Language Models in a Computer Science Degree Program

Maximal Independent Sets for Pooling in Graph Neural Networks

Analyzing the Strategy of Propaganda using Inverse Reinforcement Learning: Evidence from the 2022 Russian Invasion of Ukraine

Is attention all you need in medical image analysis? A review

Adaptation of Whisper models to child speech recognition

Nonparametric Linear Feature Learning in Regression Through Regularisation

Introducing CALMED: Multimodal Annotated Dataset for Emotion Detection in Children with Autism

MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features

Addressing the Impact of Localized Training Data in Graph Neural Networks

IteraTTA: An interface for exploring both text prompts and audio priors in generating music with text-to-audio models

Control and Monitoring of Artificial Intelligence Algorithms

Remote Bio-Sensing: Open Source Benchmark Framework for Fair Evaluation of rPPG

Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework

De-confounding Representation Learning for Counterfactual Inference on Continuous Treatment via Generative Adversarial Network

Past-present temporal programs over finite traces

CTVIS: Consistent Training for Online Video Instance Segmentation

Regulating AI manipulation: Applying Insights from behavioral economics and psychology to enhance the practicality of the EU AI Act

Less is More: Focus Attention for Efficient DETR

SL: Stable Learning in Source-Free Domain Adaption for Medical Image Segmentation