2023-09-04

cs.CV

cs.CV - 2023-09-04

NLLB-CLIP – train performant multilingual image retrieval model on a budget

paper_url: http://arxiv.org/abs/2309.01859
repo_url: None
paper_authors: Alexander Visheratin
for: investigate whether someone without access to massive computing resources can make a valuable scientific contribution in the field of multilingual image retrieval.
methods: trained a CLIP model with a text encoder from the NLLB model using an automatically created dataset of 106,246 good-quality images with captions in 200 languages, and used various sizes of image and text encoders and froze different parts of the model during training.
results: NLLB-CLIP is comparable in quality to state-of-the-art models and significantly outperforms them on low-resource languages.

Abstract
Today, the exponential rise of large models developed by academic and industrial institutions with the help of massive computing resources raises the question of whether someone without access to such resources can make a valuable scientific contribution. To explore this, we tried to solve the challenging task of multilingual image retrieval having a limited budget of $1,000. As a result, we present NLLB-CLIP - CLIP model with a text encoder from the NLLB model. To train the model, we used an automatically created dataset of 106,246 good-quality images with captions in 201 languages derived from the LAION COCO dataset. We trained multiple models using image and text encoders of various sizes and kept different parts of the model frozen during the training. We thoroughly analyzed the trained models using existing evaluation datasets and newly created XTD200 and Flickr30k-200 datasets. We show that NLLB-CLIP is comparable in quality to state-of-the-art models and significantly outperforms them on low-resource languages.

摘要
Note: "NLLB" stands for "No Language Labels Born", which is a technique for training machine learning models without language labels. "CLIP" stands for "Contrastive Language-Image Pre-training".

Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge for Generic Image Representations

paper_url: http://arxiv.org/abs/2309.01858
repo_url: None
paper_authors: Nikolaos-Antonios Ypsilantis, Kaifeng Chen, Bingyi Cao, Mário Lipovský, Pelin Dogan-Schönberger, Grzegorz Makosa, Boris Bluntschli, Mojtaba Seyedhosseini, Ondřej Chum, André Araujo
for: The paper aims to address the problem of universal image embedding, where a single universal model is trained and used in multiple domains.
methods: The paper proposes a new large-scale public benchmark for the evaluation of universal image embeddings, with 241k query images, 1.4M index images, and 2.8M training images across 8 different domains and 349k classes. The authors also provide a comprehensive experimental evaluation on the new dataset and conduct a public research competition to foster future research in this area.
results: The paper shows that existing approaches and simplistic extensions lead to worse performance than an assembly of models trained for each domain separately. Additionally, the public research competition attracted the participation of more than 1k teams worldwide and generated many interesting research ideas and findings.

Abstract
Fine-grained and instance-level recognition methods are commonly trained and evaluated on specific domains, in a model per domain scenario. Such an approach, however, is impractical in real large-scale applications. In this work, we address the problem of universal image embedding, where a single universal model is trained and used in multiple domains. First, we leverage existing domain-specific datasets to carefully construct a new large-scale public benchmark for the evaluation of universal image embeddings, with 241k query images, 1.4M index images and 2.8M training images across 8 different domains and 349k classes. We define suitable metrics, training and evaluation protocols to foster future research in this area. Second, we provide a comprehensive experimental evaluation on the new dataset, demonstrating that existing approaches and simplistic extensions lead to worse performance than an assembly of models trained for each domain separately. Finally, we conducted a public research competition on this topic, leveraging industrial datasets, which attracted the participation of more than 1k teams worldwide. This exercise generated many interesting research ideas and findings which we present in detail. Project webpage: https://cmp.felk.cvut.cz/univ_emb/

摘要
通常，细化和实例级认识方法在特定领域上进行训练和评估，这种方法在实际大规模应用中不实用。在这种工作中，我们解决了图像嵌入的问题，其中一个通用模型在多个领域进行训练和使用。我们首先利用现有的领域特定数据集， méticulously construct了一个大规模的公共数据集，用于图像嵌入的评估，该数据集包括8个领域、349个类型，共计241k个查询图像、1.4M个指定图像和2.8M个训练图像。我们定义了适当的度量、训练和评估协议，以促进未来的研究。其次，我们对新数据集进行了完整的实验评估，表明现有方法和简单的扩展在多个领域中的性能较差于每个领域 separately trained models。最后，我们在这个主题上进行了公共研究竞赛，使用了来自产业的数据集，这引起了全球1k多个团队的参与。这个实验生成了许多有趣的研究想法和发现，我们在详细地展示。项目网页：https://cmp.felk.cvut.cz/univ_emb/

SMPLitex: A Generative Model and Dataset for 3D Human Texture Estimation from Single Image

paper_url: http://arxiv.org/abs/2309.01855
repo_url: None
paper_authors: Dan Casas, Marc Comino-Trinidad
for: 该论文旨在 estimating and manipulating 人体3D外观从单个图像中。
methods: 该方法基于 reciently proposed generative models for 2D images，并将其扩展到3D领域通过计算输入图像中像素与表面之间的对应关系。
results: 该方法在3个公共可用的数据集上进行了量化和质量评估，并表明 SMPLitex 可以对人体Texture estimation 进行更好的表现，同时允许更多的任务，如编辑、 sintesis 和 manipulate。

Abstract
We propose SMPLitex, a method for estimating and manipulating the complete 3D appearance of humans captured from a single image. SMPLitex builds upon the recently proposed generative models for 2D images, and extends their use to the 3D domain through pixel-to-surface correspondences computed on the input image. To this end, we first train a generative model for complete 3D human appearance, and then fit it into the input image by conditioning the generative model to the visible parts of the subject. Furthermore, we propose a new dataset of high-quality human textures built by sampling SMPLitex conditioned on subject descriptions and images. We quantitatively and qualitatively evaluate our method in 3 publicly available datasets, demonstrating that SMPLitex significantly outperforms existing methods for human texture estimation while allowing for a wider variety of tasks such as editing, synthesis, and manipulation

摘要
我们提出SMPLitex方法，用于从单张图像中估计和操纵人体的完整3D外观。SMPLitex基于最近提出的生成模型 для2D图像，并将其扩展到3D领域通过图像上的像素到表面匹配。为此，我们首先培训了一个完整3D人体外观生成模型，然后将其适应到输入图像中可见部分的条件下。此外，我们还提出了一个新的高质量人体xture样本，通过SMPLitex conditioned on subject descriptions和图像来建立。我们在3个公共可用的数据集上Quantitatively和Qualitatively评估了我们的方法，结果显示SMPLitex Significantly Outperforms现有的人体xture估计方法，同时允许更多的任务，如编辑、生成和操作。

Uncertainty in AI: Evaluating Deep Neural Networks on Out-of-Distribution Images

paper_url: http://arxiv.org/abs/2309.01850
repo_url: None
paper_authors: Jamiu Idowu, Ahmed Almasoud
for: 这篇论文探讨了深度神经网络（包括ResNet-50、VGG16、DenseNet121、AlexNet和GoogleNet）在不正常数据（out-of-distribution，OOD）或受到干扰（perturbed）情况下的表现不一致。
methods: 这篇论文采用了三个实验方法：首先，使用预训练模型将OOD图像分类，以评估它们的表现。其次，建立了模型预测的 ensemble，使用概率平均来寻求多数票的优势。 ensemble的不确定性由average probability、variance和entropy指标来衡量。最后，对新生成的DALL-E图像或实际捕捉图像进行了干扰（filters、rotations等）测试，以评估模型的 robustness。
results: 结果显示，ResNet-50是OOD图像中最准确的单个模型，但ensemble perfom even better，对所有图像进行正确分类。此外，对DALL-E图像和实际捕捉图像进行干扰测试后，ResNet-50模型的表现出现了明显的敏感性，对4/5不受干扰图像进行正确分类，但对所有受干扰图像进行错误分类，这些错误分类也可以由人类观察到，反映AI模型的局限性。使用Saliency map可以确定模型对图像的重要区域做出决策。

Abstract
As AI models are increasingly deployed in critical applications, ensuring the consistent performance of models when exposed to unusual situations such as out-of-distribution (OOD) or perturbed data, is important. Therefore, this paper investigates the uncertainty of various deep neural networks, including ResNet-50, VGG16, DenseNet121, AlexNet, and GoogleNet, when dealing with such data. Our approach includes three experiments. First, we used the pretrained models to classify OOD images generated via DALL-E to assess their performance. Second, we built an ensemble from the models' predictions using probabilistic averaging for consensus due to its advantages over plurality or majority voting. The ensemble's uncertainty was quantified using average probabilities, variance, and entropy metrics. Our results showed that while ResNet-50 was the most accurate single model for OOD images, the ensemble performed even better, correctly classifying all images. Third, we tested model robustness by adding perturbations (filters, rotations, etc.) to new epistemic images from DALL-E or real-world captures. ResNet-50 was chosen for this being the best performing model. While it classified 4 out of 5 unperturbed images correctly, it misclassified all of them post-perturbation, indicating a significant vulnerability. These misclassifications, which are clear to human observers, highlight AI models' limitations. Using saliency maps, we identified regions of the images that the model considered important for their decisions.

摘要
As AI模型在关键应用中得到广泛应用，确保模型在不常见的情况下（如外部数据）的稳定性是重要的。因此，这篇论文研究了各种深度神经网络（包括ResNet-50、VGG16、DenseNet121、AlexNet和GoogleNet）在处理不常见数据时的不确定性。我们的方法包括三个实验。第一个实验是使用预训练模型来分类DALL-E生成的外部数据，以评估它们的性能。第二个实验是使用概率权重平均来构建一个ensemble，并使用概率、方差和 entropy 度量来衡量ensemble的不确定性。我们的结果表明，虽然ResNet-50是外部数据上最准确的单个模型，但ensemble perfom even better， correctly classifying all images。第三个实验是测试模型的Robustness，我们添加了 filters、旋转等扰动到DALL-E生成的新知识图像或实际捕捉图像。ResNet-50是我们选择的，因为它是最佳性能的模型。而在添加扰动后，ResNet-50对5个未扰动图像中的4个正确分类，但对所有扰动图像 incorrect classification，这表明模型有 significiant vulnerability。这些错误分类，对人类来说明显， highlight AI模型的局限性。使用saliency maps，我们identified模型对图像决策中的重要区域。

StereoFlowGAN: Co-training for Stereo and Flow with Unsupervised Domain Adaptation

paper_url: http://arxiv.org/abs/2309.01842
repo_url: None
paper_authors: Zhexiao Xiong, Feng Qiao, Yu Zhang, Nathan Jacobs
for: 这 paper 用于提出一种基于图像到图像翻译的新训练策略，用于立体匹配和光流估算，以便在真实图像场景下实现优秀的性能。
methods: 这 paper 使用了一种图像到图像翻译的方法，通过在真实图像和Synthetic图像之间进行图像翻译，以便在真实图像场景下训练模型。它还引入了一种bidirectional feature warping模块，可以处理左右和前后方向的图像翻译。
results: 实验结果表明，这 paper 的提出的方法可以比前一些基于域变换的方法更高效地进行立体匹配和光流估算，这证明了该方法的有效性。

Abstract
We introduce a novel training strategy for stereo matching and optical flow estimation that utilizes image-to-image translation between synthetic and real image domains. Our approach enables the training of models that excel in real image scenarios while relying solely on ground-truth information from synthetic images. To facilitate task-agnostic domain adaptation and the training of task-specific components, we introduce a bidirectional feature warping module that handles both left-right and forward-backward directions. Experimental results show competitive performance over previous domain translation-based methods, which substantiate the efficacy of our proposed framework, effectively leveraging the benefits of unsupervised domain adaptation, stereo matching, and optical flow estimation.

摘要
我们介绍了一种新的训练策略，用于stereo匹配和光流估算，该策略利用图像到图像翻译来在生成图像和实际图像域之间进行图像-图像翻译。我们的方法允许在实际图像场景下训练出 excel 的模型，只靠基于生成图像的真实信息进行训练。为了实现任务不受限制的领域适应和任务特定组件的训练，我们引入了双向特征扭曲模块，可以处理左右和前后两个方向。实验结果显示，我们的提出的框架可以与前一些基于领域翻译的方法相比，并且实际上利用了无监督领域适应、stereo匹配和光流估算的优点。

On the fly Deep Neural Network Optimization Control for Low-Power Computer Vision

paper_url: http://arxiv.org/abs/2309.01824
repo_url: None
paper_authors: Ishmeet Kaur, Adwaita Janardhan Jadhav
for: This paper aims to improve the deployability of state-of-the-art computer vision techniques on resource-constrained edge devices.
methods: The paper proposes a novel technique called AdaptiveActivation, which dynamically adjusts the sparsity and precision of a DNN’s activation function during run-time to improve accuracy and energy consumption.
results: The authors conduct experiments on popular edge devices and show that their approach achieves accuracy within 1.5% of the baseline while requiring 10%–38% less memory, providing more accuracy-efficiency tradeoff options.Here’s the same information in Simplified Chinese text:
for: 这篇论文目标是提高现代计算机视觉技术在受限的边缘设备上的部署可能性。
methods: 论文提出了一种名为 AdaptiveActivation 的新技术，它在运行时动态调整 DNN 的激活函数输出范围，以提高准确率和能耗。
results: 作者在各种边缘设备上进行了实验，并显示其方法可以达到基eline的准确率（ Within 1.5%），同时需要10%-38% Less memory，提供更多的准确率-效率质量权衡选项。

Abstract
Processing visual data on mobile devices has many applications, e.g., emergency response and tracking. State-of-the-art computer vision techniques rely on large Deep Neural Networks (DNNs) that are usually too power-hungry to be deployed on resource-constrained edge devices. Many techniques improve the efficiency of DNNs by using sparsity or quantization. However, the accuracy and efficiency of these techniques cannot be adapted for diverse edge applications with different hardware constraints and accuracy requirements. This paper presents a novel technique to allow DNNs to adapt their accuracy and energy consumption during run-time, without the need for any re-training. Our technique called AdaptiveActivation introduces a hyper-parameter that controls the output range of the DNNs' activation function to dynamically adjust the sparsity and precision in the DNN. AdaptiveActivation can be applied to any existing pre-trained DNN to improve their deployability in diverse edge environments. We conduct experiments on popular edge devices and show that the accuracy is within 1.5% of the baseline. We also show that our approach requires 10%--38% less memory than the baseline techniques leading to more accuracy-efficiency tradeoff options

摘要
处理移动设备上的视觉数据有很多应用，例如紧急响应和跟踪。现状顶尖计算机视觉技术依靠大深度神经网络（DNNs），但这些大DNNs通常是资源受限的边缘设备上不可deploy。许多技术改进DNNs的效率，使用稀疏性或量化。然而，这些技术不能适应多样化的边缘应用程序不同的硬件限制和准确要求。这篇论文提出了一种新的技术，允许DNNs在运行时自适应准确和能耗，无需任何再训练。我们的技术被称为AdaptiveActivation，它在DNNs的活化函数输出范围中引入了一个超参数，以动态调整DNNs的稀疏性和精度。AdaptiveActivation可以应用于任何现有的预训练DNN，以提高它们在多样化边缘环境中的部署可能性。我们在受欢迎的边缘设备上进行了实验，并证明了准确性与基准值相差1.5%。我们还证明了我们的方法需要10%-38% menos的内存，从而提供更多的准确度-效率质量评估选项

Multi-dimension unified Swin Transformer for 3D Lesion Segmentation in Multiple Anatomical Locations

paper_url: http://arxiv.org/abs/2309.01823
repo_url: None
paper_authors: Shaoyan Pan, Yiqiao Liu, Sarah Halek, Michal Tomaszewski, Shubing Wang, Richard Baumgartner, Jianda Yuan, Gregory Goldmacher, Antong Chen
for: 这个研究旨在提高了肿瘤辐射成像中的肿瘤 segmentation精度，以便为肿瘤生长模型的研究提供更多的数据。
methods: 该研究使用了一种新的模型，即多维度统一Swin transformer（MDU-ST），将2D和3D输入都可以进行学习，并且可以从大量未标注的3D肿瘤量据中学习肿瘤形态下的基本特征。
results: 该研究发现，使用该模型可以在肿瘤成像中提高肿瘤 segmentation精度，并且在评估中得到了与其他模型相比的显著提高。这种方法可以用于自动化肿瘤 segmentation，以便为肿瘤生长模型的研究提供更多的数据。

Abstract
In oncology research, accurate 3D segmentation of lesions from CT scans is essential for the modeling of lesion growth kinetics. However, following the RECIST criteria, radiologists routinely only delineate each lesion on the axial slice showing the largest transverse area, and delineate a small number of lesions in 3D for research purposes. As a result, we have plenty of unlabeled 3D volumes and labeled 2D images, and scarce labeled 3D volumes, which makes training a deep-learning 3D segmentation model a challenging task. In this work, we propose a novel model, denoted a multi-dimension unified Swin transformer (MDU-ST), for 3D lesion segmentation. The MDU-ST consists of a Shifted-window transformer (Swin-transformer) encoder and a convolutional neural network (CNN) decoder, allowing it to adapt to 2D and 3D inputs and learn the corresponding semantic information in the same encoder. Based on this model, we introduce a three-stage framework: 1) leveraging large amount of unlabeled 3D lesion volumes through self-supervised pretext tasks to learn the underlying pattern of lesion anatomy in the Swin-transformer encoder; 2) fine-tune the Swin-transformer encoder to perform 2D lesion segmentation with 2D RECIST slices to learn slice-level segmentation information; 3) further fine-tune the Swin-transformer encoder to perform 3D lesion segmentation with labeled 3D volumes. The network's performance is evaluated by the Dice similarity coefficient (DSC) and Hausdorff distance (HD) using an internal 3D lesion dataset with 593 lesions extracted from multiple anatomical locations. The proposed MDU-ST demonstrates significant improvement over the competing models. The proposed method can be used to conduct automated 3D lesion segmentation to assist radiomics and tumor growth modeling studies. This paper has been accepted by the IEEE International Symposium on Biomedical Imaging (ISBI) 2023.

摘要
在肿瘤研究中，准确的3D肿瘤分割从CT扫描图是非常重要的，以模拟肿瘤增长趋势。然而，根据RECIST标准，医生通常只在最大横向面的AXIAL slice上画出肿瘤，并且只为研究目的画出少量3D肿瘤。这意味着我们有大量未标注的3D卷积和标注的2D图像，以及罕见的标注3D卷积，这使得训练深度学习3D分割模型成为一项挑战。在这种情况下，我们提出了一种新的模型，即多维度统一Swin变换（MDU-ST），用于肿瘤分割。MDU-ST包括Swin变换器（Swin-transformer）编码器和卷积神经网络（CNN）解码器，可以适应2D和3D输入，并在同一个编码器中学习相应的 semantic 信息。基于这种模型，我们提出了一个三个阶段的框架：1）通过自动驱动的预TEXT任务来利用大量未标注的3D肿瘤卷积来学习肿瘤生物学的下面纹理；2）根据2D RECIST slice 进行精度调整Swin-transformer编码器，以学习 slice-level 分割信息；3）进一步精度调整Swin-transformer编码器，以进行3D肿瘤分割使用标注3D卷积。网络的性能被评估于 internal 3D 肿瘤数据集中，包括593个肿瘤，从多个 анаatomical 位置中提取。提出的 MDU-ST 表现出色，胜过竞争模型。该方法可以用于自动进行3D肿瘤分割，以帮助 радиOmics 和肿瘤增长模型研究。这篇文章已经被Accepted by IEEE International Symposium on Biomedical Imaging（ISBI）2023。

Instant Continual Learning of Neural Radiance Fields

paper_url: http://arxiv.org/abs/2309.01811
repo_url: None
paper_authors: Ryan Po, Zhengyang Dong, Alexander W. Bergman, Gordon Wetzstein
for: novle-view synthesis和3D scene reconstruction
methods: replay-based methods combined with a hybrid explicit–implicit scene representation
results: higher reconstruction quality and faster training than previous methods

Abstract
Neural radiance fields (NeRFs) have emerged as an effective method for novel-view synthesis and 3D scene reconstruction. However, conventional training methods require access to all training views during scene optimization. This assumption may be prohibitive in continual learning scenarios, where new data is acquired in a sequential manner and a continuous update of the NeRF is desired, as in automotive or remote sensing applications. When naively trained in such a continual setting, traditional scene representation frameworks suffer from catastrophic forgetting, where previously learned knowledge is corrupted after training on new data. Prior works in alleviating forgetting with NeRFs suffer from low reconstruction quality and high latency, making them impractical for real-world application. We propose a continual learning framework for training NeRFs that leverages replay-based methods combined with a hybrid explicit--implicit scene representation. Our method outperforms previous methods in reconstruction quality when trained in a continual setting, while having the additional benefit of being an order of magnitude faster.

摘要

Accuracy and Consistency of Space-based Vegetation Height Maps for Forest Dynamics in Alpine Terrain

paper_url: http://arxiv.org/abs/2309.01797
repo_url: None
paper_authors: Yuchang Jiang, Marius Rüetschi, Vivien Sainte Fare Garnot, Mauro Marty, Konrad Schindler, Christian Ginzler, Jan D. Wegner
for: 提高瑞士国家森林评估的时间分辨率
methods: 使用卫星遥感和深度学习生成大规模的植被高程地图
results: 实现年度、国家范围内的植被高程地图，并对植被高程地图进行变化探测，检测到小于250平方米的变化

Abstract
Monitoring and understanding forest dynamics is essential for environmental conservation and management. This is why the Swiss National Forest Inventory (NFI) provides countrywide vegetation height maps at a spatial resolution of 0.5 m. Its long update time of 6 years, however, limits the temporal analysis of forest dynamics. This can be improved by using spaceborne remote sensing and deep learning to generate large-scale vegetation height maps in a cost-effective way. In this paper, we present an in-depth analysis of these methods for operational application in Switzerland. We generate annual, countrywide vegetation height maps at a 10-meter ground sampling distance for the years 2017 to 2020 based on Sentinel-2 satellite imagery. In comparison to previous works, we conduct a large-scale and detailed stratified analysis against a precise Airborne Laser Scanning reference dataset. This stratified analysis reveals a close relationship between the model accuracy and the topology, especially slope and aspect. We assess the potential of deep learning-derived height maps for change detection and find that these maps can indicate changes as small as 250 $m^2$. Larger-scale changes caused by a winter storm are detected with an F1-score of 0.77. Our results demonstrate that vegetation height maps computed from satellite imagery with deep learning are a valuable, complementary, cost-effective source of evidence to increase the temporal resolution for national forest assessments.

摘要
监测和理解森林动态是环境保护和管理的关键。为此，瑞士国家森林调查（NFI）提供了全国覆盖率0.5米的植被高度地图。然而，NFI的更新周期为6年，限制了森林动态的时间分析。这可以通过使用空间Remote sensing和深度学习生成大规模的植被高度地图来改进。在这篇论文中，我们对这些方法进行了详细的分析，并在瑞士进行了操作应用。我们生成了2017年至2020年的年度、全国覆盖率10米的植被高度地图，基于Sentinel-2卫星图像。与之前的研究相比，我们进行了大规模的 stratified 分析，并对精确的空中雷达扫描参照数据进行了比较。这种 stratified 分析表明模型精度与地形特征（坡度和方向）之间存在紧密的关系。我们评估了深度学习得到的高度地图在变化检测方面的潜力，并发现这些地图可以检测变化为小为250平方米。在更大规模的变化（冬季风暴）方面，我们获得了F1分数0.77。我们的结果表明，通过卫星图像使用深度学习计算的植被高度地图是一种有价值的、补充性、成本效果的证据，可以增加国家森林评估的时间分辨率。

Safe and Robust Watermark Injection with a Single OoD Image

paper_url: http://arxiv.org/abs/2309.01786
repo_url: None
paper_authors: Shuyang Yu, Junyuan Hong, Haobo Zhang, Haotao Wang, Zhangyang Wang, Jiayu Zhou
for: 保护深度神经网络模型的知识产权和商业所有权
methods: 使用一个单一的 OUT-OF-distribution（OoD）图像作为秘密钥刃，并通过随机偏移模型参数来防御常见的水印移除攻击
results: 提出了一种安全和可靠的水印插入技术，可以在不需要训练数据的情况下，在不同的模型版本上保持水印的可读性和不朽性。

Abstract
Training a high-performance deep neural network requires large amounts of data and computational resources. Protecting the intellectual property (IP) and commercial ownership of a deep model is challenging yet increasingly crucial. A major stream of watermarking strategies implants verifiable backdoor triggers by poisoning training samples, but these are often unrealistic due to data privacy and safety concerns and are vulnerable to minor model changes such as fine-tuning. To overcome these challenges, we propose a safe and robust backdoor-based watermark injection technique that leverages the diverse knowledge from a single out-of-distribution (OoD) image, which serves as a secret key for IP verification. The independence of training data makes it agnostic to third-party promises of IP security. We induce robustness via random perturbation of model parameters during watermark injection to defend against common watermark removal attacks, including fine-tuning, pruning, and model extraction. Our experimental results demonstrate that the proposed watermarking approach is not only time- and sample-efficient without training data, but also robust against the watermark removal attacks above.

摘要
训练高性能深度神经网络需要大量的数据和计算资源。保护深度模型的知识产权（IP）和商业所有权是一项挑战，但也变得越来越重要。一条主要的水印策略是通过毒化训练样本来植入可靠的后门 triggers，但这些常常因数据隐私和安全问题而不切实际，并且容易受到微型化改进的影响。为了解决这些挑战，我们提议一种安全和可靠的后门基于水印注入技术，利用单个 OUT-OF-distribution（OoD）图像作为秘密键 для IP 验证。训练数据的独立性使其不受第三方的IP安全承诺影响。我们通过在水印注入过程中随机偏移模型参数来增强鲁棒性，以抵御通常的水印移除攻击，包括微型化、剪辑和模型提取。我们的实验结果表明，我们提议的水印策略不仅时间和样本效率高，而且具有很好的鲁棒性。

StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation

paper_url: http://arxiv.org/abs/2309.01770
repo_url: None
paper_authors: Zhouxia Wang, Xintao Wang, Liangbin Xie, Zhongang Qi, Ying Shan, Wenping Wang, Ping Luo
for: 本研究旨在提出一种不需要LoRA的图像美化方法，该方法可以根据文本提示和样式参考图像来生成输出图像，而不需要训练每种样式的LoRA。
methods: 本方法使用了两种组件：一个两路扩充模块（TPCA）和三种解除策略。这些组件使得我们的模型可以分离文本提示和样式参考特征，并减少样式参考中的强相关性，从而提高图像质量和多样性。
results: 实验表明，我们的方法可以生成高质量的图像，并且可以适应不同的样式（包括未经见过的样式），而不需要多个LoRA。相比之下，现有方法 Less flexible and less efficient。

Abstract
This paper presents a LoRA-free method for stylized image generation that takes a text prompt and style reference images as inputs and produces an output image in a single pass. Unlike existing methods that rely on training a separate LoRA for each style, our method can adapt to various styles with a unified model. However, this poses two challenges: 1) the prompt loses controllability over the generated content, and 2) the output image inherits both the semantic and style features of the style reference image, compromising its content fidelity. To address these challenges, we introduce StyleAdapter, a model that comprises two components: a two-path cross-attention module (TPCA) and three decoupling strategies. These components enable our model to process the prompt and style reference features separately and reduce the strong coupling between the semantic and style information in the style references. StyleAdapter can generate high-quality images that match the content of the prompts and adopt the style of the references (even for unseen styles) in a single pass, which is more flexible and efficient than previous methods. Experiments have been conducted to demonstrate the superiority of our method over previous works.

摘要

BLiSS: Bootstrapped Linear Shape Space

paper_url: http://arxiv.org/abs/2309.01765
repo_url: None
paper_authors: Sanjeev Muralikrishnan, Chun-Hao Paul Huang, Duygu Ceylan, Niloy J. Mitra
for: 创建人类形态模型，提高人类形态数据的表示和分析能力
methods: 使用自适应扩展模型，通过精细调整和非rigid registration来实现人类形态数据的匹配
results: 提出BLiSS方法，可以自动将新的不注册扫描数据匹配到已有的注册扫描数据中，提高人类形态数据的表示和分析能力

Abstract
Morphable models are fundamental to numerous human-centered processes as they offer a simple yet expressive shape space. Creating such morphable models, however, is both tedious and expensive. The main challenge is establishing dense correspondences across raw scans that capture sufficient shape variation. This is often addressed using a mix of significant manual intervention and non-rigid registration. We observe that creating a shape space and solving for dense correspondence are tightly coupled -- while dense correspondence is needed to build shape spaces, an expressive shape space provides a reduced dimensional space to regularize the search. We introduce BLiSS, a method to solve both progressively. Starting from a small set of manually registered scans to bootstrap the process, we enrich the shape space and then use that to get new unregistered scans into correspondence automatically. The critical component of BLiSS is a non-linear deformation model that captures details missed by the low-dimensional shape space, thus allowing progressive enrichment of the space.

摘要
《膨润模型是人类中心的过程中的基础模型，它们提供了简单 yet 表达力强的形态空间。然而，创建这些膨润模型是时间和成本的挑战。主要挑战在于在原始扫描图像之间建立密集的对匹配，以捕捉足够的形态变化。通常通过手动干预和非RIGID注册来解决这个问题。我们发现创建形态空间和 dense correspondence 是紧密相关的——而 dense correspondence 是建立形态空间的必要条件，而且一个表达力强的形态空间可以提供一个减少维度的空间来规范搜索。我们介绍了 BLiSS，一种解决这两个问题的方法。从一个小型手动注册的扫描图像开始，我们在形态空间中增强表达力，然后用这个空间来自动将新的未注册扫描图像与其他扫描图像进行对匹配。BLiSS 的关键组成部分是一种非线性塑形模型，它可以捕捉低维度形态空间所过度的细节，从而允许进行进一步的表达力增强。》

Multispectral Indices for Wildfire Management

paper_url: http://arxiv.org/abs/2309.01751
repo_url: None
paper_authors: Afonso Oliveira, João P. Matos-Carvalho, Filipe Moutinho, Nuno Fachada
for: 这篇论文旨在为火灾预防和管理提供 Multispectral 指标和相关方法。
methods: 论文检查了多种领域，其中 Multispectral 指标与野火预防和管理有着 closest 关系，包括植被和土壤特征提取、水特征映射、人工结构识别和火灾后烧区面积估计。
results: 论文强调了 Multispectral 指标在野火管理中的 universality 和有效性，并提供了具体的指标，如 NDVI 和 NDWI。同时，为了提高准确性和解决个体指标应用中的局限性，建议 integra complementary 处理解决方案和其他数据源，如高分辨率图像和地面测量。

Abstract
This paper highlights and summarizes the most important multispectral indices and associated methodologies for fire management. Various fields of study are examined where multispectral indices align with wildfire prevention and management, including vegetation and soil attribute extraction, water feature mapping, artificial structure identification, and post-fire burnt area estimation. The versatility and effectiveness of multispectral indices in addressing specific issues in wildfire management are emphasized. Fundamental insights for optimizing data extraction are presented. Concrete indices for each task, including the NDVI and the NDWI, are suggested. Moreover, to enhance accuracy and address inherent limitations of individual index applications, the integration of complementary processing solutions and additional data sources like high-resolution imagery and ground-based measurements is recommended. This paper aims to be an immediate and comprehensive reference for researchers and stakeholders working on multispectral indices related to the prevention and management of fires.

摘要
Simplified Chinese:这篇论文探讨了多spectral指标的最重要应用和方法在野火管理中，包括植被和土壤特征提取、水特征地图、人工结构识别和火灾后烧Area估计。论文强调了多spectral指标在野火预防和管理中的 versatility和有效性。提供了数据提取优化的基本理念，并建议使用NDVI和NDWI等指标。此外，为了提高准确性和解决个体指标应用的局限性，建议结合补充处理解决方案和高分辨率图像以及地面测量数据。这篇论文旨在为研究人员和相关方 working on多spectral指标与野火预防和管理的人提供立即和全面的参考。

paper_url: http://arxiv.org/abs/2309.01728
repo_url: https://github.com/zhangyong-tang/gmmt
paper_authors: Zhangyong Tang, Tianyang Xu, Xuefeng Zhu, Xiao-Jun Wu, Josef Kittler
for: 本研究探讨了如何使用生成模型技术来解决多Modal跟踪中的信息融合挑战。
methods: 本研究使用了两种常见的生成模型技术， namely Conditional Generative Adversarial Networks (CGANs) 和 Diffusion Models (DMs)。这些技术在传统的融合过程中直接将每种模式的特征传输到融合块，而不是直接将特征传输。
results: 经验表明，使用生成模型技术可以提高多Modal跟踪的性能，并在 LasHeR 和 RGBD1K 上设置新的纪录。

Abstract
Generative models (GMs) have received increasing research interest for their remarkable capacity to achieve comprehensive understanding. However, their potential application in the domain of multi-modal tracking has remained relatively unexplored. In this context, we seek to uncover the potential of harnessing generative techniques to address the critical challenge, information fusion, in multi-modal tracking. In this paper, we delve into two prominent GM techniques, namely, Conditional Generative Adversarial Networks (CGANs) and Diffusion Models (DMs). Different from the standard fusion process where the features from each modality are directly fed into the fusion block, we condition these multi-modal features with random noise in the GM framework, effectively transforming the original training samples into harder instances. This design excels at extracting discriminative clues from the features, enhancing the ultimate tracking performance. To quantitatively gauge the effectiveness of our approach, we conduct extensive experiments across two multi-modal tracking tasks, three baseline methods, and three challenging benchmarks. The experimental results demonstrate that the proposed generative-based fusion mechanism achieves state-of-the-art performance, setting new records on LasHeR and RGBD1K.

摘要
Translated into Simplified Chinese:生成模型（GM）已经收到了研究的增加兴趣，特别是在多modal跟踪领域中。在这个预期中，我们想要探索生成技术的潜在应用，以解决多modal跟踪中的关键挑战——信息融合。在这篇论文中，我们探究了两种主要的生成技术——条件生成对抗网络（CGAN）和扩散模型（DM）。与标准融合过程不同，我们在生成模型框架中，通过conditioning多modal特征来增强特征的抽象能力，从而提高最终跟踪性能。为了量化评估我们的方法的效果，我们在两个多modal跟踪任务、三个基eline方法和三个挑战性 benchmark 上进行了广泛的实验。实验结果表明，我们提出的生成基于融合机制可以达到状态级表现，在 LasHeR 和 RGBD1K 上设置新的纪录。

SAF-IS: a Spatial Annotation Free Framework for Instance Segmentation of Surgical Tools

paper_url: http://arxiv.org/abs/2309.01723
repo_url: None
paper_authors: Luca Sestini, Benoit Rosa, Elena De Momi, Giancarlo Ferrigno, Nicolas Padoy
for: This paper aims to develop a framework for instance segmentation of surgical instruments without requiring expensive pixel-level annotations.
methods: The proposed solution uses binary tool masks and binary tool presence labels to train a tool instance classifier, leveraging unsupervised binary segmentation models to obtain the masks.
results: The approach outperforms several state-of-the-art fully-supervised segmentation methods and is completely free from spatial annotations.

Abstract
Instance segmentation of surgical instruments is a long-standing research problem, crucial for the development of many applications for computer-assisted surgery. This problem is commonly tackled via fully-supervised training of deep learning models, requiring expensive pixel-level annotations to train. In this work, we develop a framework for instance segmentation not relying on spatial annotations for training. Instead, our solution only requires binary tool masks, obtainable using recent unsupervised approaches, and binary tool presence labels, freely obtainable in robot-assisted surgery. Based on the binary mask information, our solution learns to extract individual tool instances from single frames, and to encode each instance into a compact vector representation, capturing its semantic features. Such representations guide the automatic selection of a tiny number of instances (8 only in our experiments), displayed to a human operator for tool-type labelling. The gathered information is finally used to match each training instance with a binary tool presence label, providing an effective supervision signal to train a tool instance classifier. We validate our framework on the EndoVis 2017 and 2018 segmentation datasets. We provide results using binary masks obtained either by manual annotation or as predictions of an unsupervised binary segmentation model. The latter solution yields an instance segmentation approach completely free from spatial annotations, outperforming several state-of-the-art fully-supervised segmentation approaches.

摘要
Instance segmentation of surgical instruments是长期的研究问题，对计算机助手手术应用的发展非常重要。通常通过深度学习模型的全导学习来解决这个问题，需要昂贵的像素级注解来训练。在这种工作中，我们开发了不需要空间注解的实例分割框架。而是利用最近的无监督方法获取的二进制工具面积，以及在机器人助手手术中自由获得的二进制工具存在标签。基于二进制面积信息，我们的解决方案可以从单帧中提取个体工具实例，并将每个实例编码为一个紧凑的向量表示，捕捉其 semantic 特征。这些表示导引人工操作员选择工具类型标签。最终，我们使用这些标签来匹配每个训练实例与二进制工具存在标签，以提供有效的超级视图信号，用于训练工具实例分类器。我们在EndoVis 2017和2018 segmentation dataset上验证了我们的框架。我们使用手动注解或predictions of unsupervised binary segmentation model来获取二进制面积。后者解决方案可以完全免除空间注解，并在数个状态之前的全导学习方法之上表现出色。

ControlMat: A Controlled Generative Approach to Material Capture

paper_url: http://arxiv.org/abs/2309.01700
repo_url: None
paper_authors: Giuseppe Vecchio, Rosalie Martin, Arthur Roullier, Adrien Kaiser, Romain Rouffet, Valentin Deschaintre, Tamy Boubekeur
for: 提出了一种控制的数据生成方法，用于从单个照片中生成可信、可缩放、物理基础的数字材料。methods: 使用了生成深度网络进行控制synthesis，并采用了多通道输出的diffusion模型，采样过程进行多尺度信息融合，并引入了折叠diffusion来实现高分辨率输出和缩放性。results: 比较了与推论和秘密空间优化方法，显示了控制Mat的超越性，并且仔细验证了diffusion过程的设计选择。

Abstract
Material reconstruction from a photograph is a key component of 3D content creation democratization. We propose to formulate this ill-posed problem as a controlled synthesis one, leveraging the recent progress in generative deep networks. We present ControlMat, a method which, given a single photograph with uncontrolled illumination as input, conditions a diffusion model to generate plausible, tileable, high-resolution physically-based digital materials. We carefully analyze the behavior of diffusion models for multi-channel outputs, adapt the sampling process to fuse multi-scale information and introduce rolled diffusion to enable both tileability and patched diffusion for high-resolution outputs. Our generative approach further permits exploration of a variety of materials which could correspond to the input image, mitigating the unknown lighting conditions. We show that our approach outperforms recent inference and latent-space-optimization methods, and carefully validate our diffusion process design choices. Supplemental materials and additional details are available at: https://gvecchio.com/controlmat/.

摘要
Material 重建从照片是3D内容创造的关键组件。我们提议将这个不定性问题转化为控制的合成问题，利用最近的生成深度网络的进步。我们提出ControlMat方法，给定一个具有不控制照明的照片输入，使用扩散模型生成可信、可缩放、基于物理的高分辨率数字材料。我们仔细分析扩散模型的多通道输出行为，适应多尺度信息的融合和滤波rolled扩散，以实现高分辨率输出的瓦片可重复性和补充扩散。我们的生成方法还允许探索输入图像所对应的多种材料， mitigate不确定的照明条件。我们比较了我们的扩散过程设计选择，并证明我们的方法超越了最近的推理和latent空间优化方法。补充材料和更多细节可以在：https://gvecchio.com/controlmat/。

Mask-Attention-Free Transformer for 3D Instance Segmentation

paper_url: http://arxiv.org/abs/2309.01692
repo_url: https://github.com/dvlab-research/mask-attention-free-transformer
paper_authors: Xin Lai, Yuhui Yuan, Ruihang Chu, Yukang Chen, Han Hu, Jiaya Jia
for: 提高3D实例分割 task 的速度和准确率，即使初始masks的召回率低。
methods: 弃用mask attention设计，改用auxiliary center regression任务，通过positional prior来进行cross-attention和迭代改进。
results: 与现有工作相比，我们的方法可以在ScanNetv2 3D实例分割benchmark上 converge 4x faster，并且在多个dataset上显示出超过现有方法的性能。

Abstract
Recently, transformer-based methods have dominated 3D instance segmentation, where mask attention is commonly involved. Specifically, object queries are guided by the initial instance masks in the first cross-attention, and then iteratively refine themselves in a similar manner. However, we observe that the mask-attention pipeline usually leads to slow convergence due to low-recall initial instance masks. Therefore, we abandon the mask attention design and resort to an auxiliary center regression task instead. Through center regression, we effectively overcome the low-recall issue and perform cross-attention by imposing positional prior. To reach this goal, we develop a series of position-aware designs. First, we learn a spatial distribution of 3D locations as the initial position queries. They spread over the 3D space densely, and thus can easily capture the objects in a scene with a high recall. Moreover, we present relative position encoding for the cross-attention and iterative refinement for more accurate position queries. Experiments show that our approach converges 4x faster than existing work, sets a new state of the art on ScanNetv2 3D instance segmentation benchmark, and also demonstrates superior performance across various datasets. Code and models are available at https://github.com/dvlab-research/Mask-Attention-Free-Transformer.

摘要
最近，基于 transformer 的方法在 3D 实例分割中占据了主导地位，其中 mask attention 是通常 involve 的一部分。具体来说，对象查询在第一次 cross-attention 中被 guid 由初始实例面积的面积注意力，然后在类似的方式进行 iterative refinement。但我们发现，mask attention 管道通常会导致慢速收敛，因为初始实例面积的准确率较低。因此，我们放弃了面积注意力设计，转而使用 auxillary center regression 任务来解决这个问题。通过 center regression，我们可以有效地超越低准确率的问题，并通过做Positional Prior来进行 cross-attention。为了实现这个目标，我们开发了一系列的位置意识设计。首先，我们学习了 3D 空间中的位置分布，作为初始的位置查询。它们在 3D 空间中分布 densely，可以轻松地捕捉 scene 中的对象，并且具有高准确率。此外，我们还提供了相对位置编码 для cross-attention 和 iterative refinement，以便更准确地确定位置查询。实验表明，我们的方法可以在 ScanNetv2 3D 实例分割 benchmark 上 converges 4x faster than 现有的工作，并且在多个 dataset 上也表现出了superior的性能。代码和模型可以在 https://github.com/dvlab-research/Mask-Attention-Free-Transformer 上获取。

Prior Knowledge Guided Network for Video Anomaly Detection

paper_url: http://arxiv.org/abs/2309.01682
repo_url: None
paper_authors: Zhewen Deng, Dongyue Chen, Shizhuo Deng
for: video anomaly detection (VAD)
methods: 使用教师学生网络、自适应网络、知识填充等方法提高模型的泛化能力和多尺度检测能力
results: 实验结果表明，我们的方法可以更高效、更准确地检测视频异常事件，比现有的方法更高效。

Abstract
Video Anomaly Detection (VAD) involves detecting anomalous events in videos, presenting a significant and intricate task within intelligent video surveillance. Existing studies often concentrate solely on features acquired from limited normal data, disregarding the latent prior knowledge present in extensive natural image datasets. To address this constraint, we propose a Prior Knowledge Guided Network(PKG-Net) for the VAD task. First, an auto-encoder network is incorporated into a teacher-student architecture to learn two designated proxy tasks: future frame prediction and teacher network imitation, which can provide better generalization ability on unknown samples. Second, knowledge distillation on proper feature blocks is also proposed to increase the multi-scale detection ability of the model. In addition, prediction error and teacher-student feature inconsistency are combined to evaluate anomaly scores of inference samples more comprehensively. Experimental results on three public benchmarks validate the effectiveness and accuracy of our method, which surpasses recent state-of-the-arts.

摘要
视频异常检测（VAD）涉及到视频中异常事件的检测，是智能视频监测中的一个复杂和繁复任务。现有研究 часто仅仅使用有限的正常数据来学习特征，忽略了大量自然图像数据中的隐藏知识。为解决这一问题，我们提议一种基于先前知识指导网络（PKG-Net）的方法。首先，我们在教师-学生架构中 integrate 一个自编码器网络，以学习两个指定的代理任务：未来帧预测和教师网络模仿，以提高对未知样本的泛化能力。其次，我们还提出了在正确的特征块上进行知识填充，以增强模型的多尺度检测能力。此外，我们还结合预测错误和教师-学生特征不一致来评估推理样本的异常分数。实验结果表明，我们的方法可以在三个公共标准测试集上达到更高的准确率，超过当前状态的艺术。

Building Footprint Extraction in Dense Areas using Super Resolution and Frame Field Learning

paper_url: http://arxiv.org/abs/2309.01656
repo_url: None
paper_authors: Vuong Nguyen, Anh Ho, Duc-Anh Vu, Nguyen Thi Ngoc Anh, Tran Ngoc Thang
for: 提高叠 edifices 的精度和精细度，使得在压杂的区域中提取建筑物的轮廓更加精准。
methods: 使用超分解提高空中图像的空间分辨率，然后使用多任务学习模块进行分割和框架场景学习，以处理不规则的建筑结构。
results: 对印度一个贫民区的实验表明，提出的方法可以明显超越当前状态的方法，具有较高的精度和精细度。

Abstract
Despite notable results on standard aerial datasets, current state-of-the-arts fail to produce accurate building footprints in dense areas due to challenging properties posed by these areas and limited data availability. In this paper, we propose a framework to address such issues in polygonal building extraction. First, super resolution is employed to enhance the spatial resolution of aerial image, allowing for finer details to be captured. This enhanced imagery serves as input to a multitask learning module, which consists of a segmentation head and a frame field learning head to effectively handle the irregular building structures. Our model is supervised by adaptive loss weighting, enabling extraction of sharp edges and fine-grained polygons which is difficult due to overlapping buildings and low data quality. Extensive experiments on a slum area in India that mimics a dense area demonstrate that our proposed approach significantly outperforms the current state-of-the-art methods by a large margin.

摘要
尽管现有的州际数据集上得到了可注目的结果，但现今的状态艺术无法在受挑战的区域中生成准确的建筑面积，这是因为这些区域具有复杂的属性和有限的数据可用性。在这篇论文中，我们提出了一种框架来解决这些问题。我们首先使用超分解来提高飞行图像的空间分辨率，以便更好地捕捉详细的建筑结构。这个提高的图像作为输入，我们的模型包括一个分割头和一个帧场学习头，以有效地处理不规则的建筑结构。我们的模型被超过适应损失质量调整，以提取锐利的边和细腻的多边形，这是由于 overlap 的建筑和低质量数据所致。我们的提出的方法在印度一个模拟 dense 区域的实验中显示出了与当前状态艺术方法之间的大幅提高。

Relay Diffusion: Unifying diffusion process across resolutions for image synthesis

paper_url: http://arxiv.org/abs/2309.03350
repo_url: https://github.com/THUDM/RelayDiffusion
paper_authors: Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jianqiao Wangni, Zhuoyi Yang, Jie Tang
for: 这 paper 是用于描述一种基于抽象扩散模型的高分辨率图像生成方法。
methods: 这 paper 使用了一种叫做 Relay Diffusion Model (RDM)，它可以将低分辨率图像或噪声转换成等效的高分辨率图像，从而让扩散过程可以继续无间断地进行在任何新的分辨率或模型中。
results: 这 paper 的实验结果表明，RDM 可以在 CelebA-HQ 和 ImageNet 256$\times$256 上 achieved state-of-the-art FID 和 sFID Result，大幅超越过去的 ADM、LDM 和 DiT 等方法。

Abstract
Diffusion models achieved great success in image synthesis, but still face challenges in high-resolution generation. Through the lens of discrete cosine transformation, we find the main reason is that \emph{the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain}. In this work, we present Relay Diffusion Model (RDM), which transfers a low-resolution image or noise into an equivalent high-resolution one for diffusion model via blurring diffusion and block noise. Therefore, the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning. RDM achieves state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256$\times$256, surpassing previous works such as ADM, LDM and DiT by a large margin. All the codes and checkpoints are open-sourced at \url{https://github.com/THUDM/RelayDiffusion}.

摘要
Diffusion models have achieved great success in image synthesis, but still face challenges in high-resolution generation. Through the lens of discrete cosine transformation, we find that the main reason is that the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain. In this work, we present Relay Diffusion Model (RDM), which transfers a low-resolution image or noise into an equivalent high-resolution one for diffusion model via blurring diffusion and block noise. Therefore, the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning. RDM achieves state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256×256, surpassing previous works such as ADM, LDM and DiT by a large margin. All the codes and checkpoints are open-sourced at \url{https://github.com/THUDM/RelayDiffusion}.Here's the word-for-word translation of the text:Diffusion models 已经取得了很大的成功在图像生成中，但仍然面临高分辨率生成的挑战。通过抽象幂transform的孔径，我们发现主要的问题在于，在更高的分辨率下，相同的噪声水平会导致更高的信号噪声比在频域中。在这项工作中，我们提出了Relay Diffusion Model（RDM），它通过混合扩散和块噪声来将低分辨率图像或噪声转换成与扩散模型相对应的高分辨率图像。因此，扩散过程可以不间断继续在任何新的分辨率或模型上进行，不需要从纯噪声或低分辨率conditioning重新开始。RDM在CelebA-HQ和ImageNet 256×256上 achieve state-of-the-art FID和sFID，大幅超过了前一些工作，如ADM、LDM和DiT。所有的代码和检查点都是开源的，可以在 \url{https://github.com/THUDM/RelayDiffusion} 上找到。

ReLoc-PDR: Visual Relocalization Enhanced Pedestrian Dead Reckoning via Graph Optimization

paper_url: http://arxiv.org/abs/2309.01646
repo_url: None
paper_authors: Zongyang Chen, Xianfei Pan, Changhao Chen
for: 本研究旨在提供一种精准地位化行人在卫星排除条件下，使用低成本抗 gravitational 倾斜传感器。
methods: 该研究提出了一种 combining PDR 和视觉重定位技术，使用图像优化算法和学习描述符来实现Robust位置估算。
results: 实验表明，我们的 ReLoc-PDR 在不良环境中表现出了较高的准确性和可靠性，可以在较少的文本环境和黑夜场景中实现高精度的行人位置估算。

Abstract
Accurately and reliably positioning pedestrians in satellite-denied conditions remains a significant challenge. Pedestrian dead reckoning (PDR) is commonly employed to estimate pedestrian location using low-cost inertial sensor. However, PDR is susceptible to drift due to sensor noise, incorrect step detection, and inaccurate stride length estimation. This work proposes ReLoc-PDR, a fusion framework combining PDR and visual relocalization using graph optimization. ReLoc-PDR leverages time-correlated visual observations and learned descriptors to achieve robust positioning in visually-degraded environments. A graph optimization-based fusion mechanism with the Tukey kernel effectively corrects cumulative errors and mitigates the impact of abnormal visual observations. Real-world experiments demonstrate that our ReLoc-PDR surpasses representative methods in accuracy and robustness, achieving accurte and robust pedestrian positioning results using only a smartphone in challenging environments such as less-textured corridors and dark nighttime scenarios.

摘要
<>精度和可靠地定位行人在卫星探测不可靠情况下是一个重要挑战。行人死reckoning（PDR）通常用低成本惯性传感器来估算行人位置。然而，PDR受到传感器噪声、错误的步动检测和不准确的步长估算的影响，导致偏移。本工作提出了Reloc-PDR，一种混合框架， комбини了PDR和视觉重定位使用图像优化。Reloc-PDR利用时间相关的视觉观察和学习的特征来实现在视觉弱化环境中Robust的定位。一种图像优化基于Tukeykernel的混合机制，有效地纠正累累的错误和减少了不正常的视觉观察的影响。实际实验表明，我们的Reloc-PDR在精度和Robust性方面超过了代表性的方法，在杂糌走廊和黑夜enario中实现了高精度和Robust的行人定位，只使用了智能手机。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Cross-Consistent Deep Unfolding Network for Adaptive All-In-One Video Restoration

paper_url: http://arxiv.org/abs/2309.01627
repo_url: None
paper_authors: Yuanshuo Cheng, Mingwen Shao, Yecong Wan, Lixu Zhang, Wangmeng Zuo, Deyu Meng
for: 提高实际应用中视频修复（VR）方法的可扩展性和可靠性。
methods: 提议了一种 Cross-consistent Deep Unfolding Network（CDUN），可以通过单一模型来消除多种降低效果。
results: 实验表明，提议的方法可以在All-In-One VR中实现状态最佳的表现。

Abstract
Existing Video Restoration (VR) methods always necessitate the individual deployment of models for each adverse weather to remove diverse adverse weather degradations, lacking the capability for adaptive processing of degradations. Such limitation amplifies the complexity and deployment costs in practical applications. To overcome this deficiency, in this paper, we propose a Cross-consistent Deep Unfolding Network (CDUN) for All-In-One VR, which enables the employment of a single model to remove diverse degradations for the first time. Specifically, the proposed CDUN accomplishes a novel iterative optimization framework, capable of restoring frames corrupted by corresponding degradations according to the degradation features given in advance. To empower the framework for eliminating diverse degradations, we devise a Sequence-wise Adaptive Degradation Estimator (SADE) to estimate degradation features for the input corrupted video. By orchestrating these two cascading procedures, CDUN achieves adaptive processing for diverse degradation. In addition, we introduce a window-based inter-frame fusion strategy to utilize information from more adjacent frames. This strategy involves the progressive stacking of temporal windows in multiple iterations, effectively enlarging the temporal receptive field and enabling each frame's restoration to leverage information from distant frames. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance in All-In-One VR.

摘要
现有的视频修复（VR）方法总是需要采取各种特定的模型来消除不同的降低因素，缺乏适应处理降低因素的能力。这种局限性会增加实际应用中的复杂性和投入成本。为了解决这一缺点，在这篇论文中，我们提出了一种名为 Cross-consistent Deep Unfolding Network（CDUN）的所有在一个VR方法，可以使用单个模型来消除多种降低因素。具体来说，我们提出了一种新的迭代优化框架，可以根据输入降低因素的特征来修复降低因素所影响的帧。为了使这种框架能够消除多种降低因素，我们设计了一种适应性降低因素估计器（SADE），可以为输入降低因素提供适应性的估计。通过这两种顺序执行的过程，CDUN实现了适应处理多种降低因素。此外，我们还提出了一种窗口基本的Inter-frame融合策略，可以利用更多的邻近帧中的信息。这种策略通过在多个迭代中逐渐堆叠窗口，实现了提高时间感知范围，使每帧的修复可以利用更远的帧中的信息。广泛的实验表明，我们提出的方法在All-In-One VR中实现了state-of-the-art的性能。

AGG-Net: Attention Guided Gated-convolutional Network for Depth Image Completion

paper_url: http://arxiv.org/abs/2309.01624
repo_url: None
paper_authors: Dongyue Chen, Tingxuan Huang, Zhimin Song, Shizhuo Deng, Tong Jia
for: 提高RGBD相机遥感几何图像质量
methods: 提出了一种基于Attention Guided Gated-convolutional Network（AGG-Net）的深度图像完成方法，通过实现颜色和深度特征的综合 fusion，提高了图像的准确性和可靠性
results: 对比于状态艺术方法，该方法在NYU-Depth V2、DIML和SUN RGB-D等标准底本上达到了更高的完成率

Abstract
Recently, stereo vision based on lightweight RGBD cameras has been widely used in various fields. However, limited by the imaging principles, the commonly used RGB-D cameras based on TOF, structured light, or binocular vision acquire some invalid data inevitably, such as weak reflection, boundary shadows, and artifacts, which may bring adverse impacts to the follow-up work. In this paper, we propose a new model for depth image completion based on the Attention Guided Gated-convolutional Network (AGG-Net), through which more accurate and reliable depth images can be obtained from the raw depth maps and the corresponding RGB images. Our model employs a UNet-like architecture which consists of two parallel branches of depth and color features. In the encoding stage, an Attention Guided Gated-Convolution (AG-GConv) module is proposed to realize the fusion of depth and color features at different scales, which can effectively reduce the negative impacts of invalid depth data on the reconstruction. In the decoding stage, an Attention Guided Skip Connection (AG-SC) module is presented to avoid introducing too many depth-irrelevant features to the reconstruction. The experimental results demonstrate that our method outperforms the state-of-the-art methods on the popular benchmarks NYU-Depth V2, DIML, and SUN RGB-D.

摘要
近些年来，基于轻量级RGBD相机的斯tereo视觉已经广泛应用于多个领域。然而，由于捕捉原理的限制，常用的RGB-D相机基于TOF、结构光或双目视觉都会不可避免地获得一些无效数据，如弱反射、边缘阴影和artefacts，这些无效数据可能会对后续工作产生负面影响。在这篇论文中，我们提出了一种基于Attention Guided Gated-convolutional Network（AGG-Net）的深度图像完成模型，通过这种模型，从原始的深度图和对应的RGB图中获得更加准确和可靠的深度图像。我们的模型采用了UNet-like的架构，该架构包括两个平行的深度和颜色特征分支。在编码阶段，我们提出了一种Attention Guided Gated-Convolution（AG-GConv）模块，用于在不同的尺度上进行深度和颜色特征的 fusión，以降低无效深度数据对重建的负面影响。在解码阶段，我们提出了一种Attention Guided Skip Connection（AG-SC）模块，以避免在重建过程中引入过多不相关的深度特征。实验结果表明，我们的方法在流行的benchmark上（NYU-Depth V2、DIML和SUN RGB-D）具有出色的性能。

Hindering Adversarial Attacks with Multiple Encrypted Patch Embeddings

paper_url: http://arxiv.org/abs/2309.01620
repo_url: None
paper_authors: AprilPyone MaungMaung, Isao Echizen, Hitoshi Kiya
for: 本研究提出了一种新的钥匙基防御方法，旨在提高防御效果和可靠性。
methods: 本研究基于之前的钥匙基防御方法，并做出了两个主要改进：（1）高效地训练，（2）可选的随机化。提议的防御使用一或多个秘密质量嵌入和一个预训练的卷积网络来实现。当使用多个秘密嵌入时，提议的防御允许在推理过程中进行随机化。
results: 对于ImageNet dataset上的一系列攻击，包括适应性攻击，提议的防御得到了高度的鲁棒性精度和相当的清洁精度。

Abstract
In this paper, we propose a new key-based defense focusing on both efficiency and robustness. Although the previous key-based defense seems effective in defending against adversarial examples, carefully designed adaptive attacks can bypass the previous defense, and it is difficult to train the previous defense on large datasets like ImageNet. We build upon the previous defense with two major improvements: (1) efficient training and (2) optional randomization. The proposed defense utilizes one or more secret patch embeddings and classifier heads with a pre-trained isotropic network. When more than one secret embeddings are used, the proposed defense enables randomization on inference. Experiments were carried out on the ImageNet dataset, and the proposed defense was evaluated against an arsenal of state-of-the-art attacks, including adaptive ones. The results show that the proposed defense achieves a high robust accuracy and a comparable clean accuracy compared to the previous key-based defense.

摘要
在这篇论文中，我们提出了一种新的钥匙基础防御，旨在同时增加效率和鲁棒性。先前的钥匙基础防御可以有效防御对抗例子，但是特制的适应攻击可以绕过先前的防御，并且在大量的ImageNet数据集上训练是困难的。我们基于先前的防御，提出了两大改进：（1）高效的训练和（2）可选的随机化。我们提出的防御使用一个或多个秘密贴图嵌入和一个或多个预训练的卷积网络。当使用多个秘密贴图嵌入时，我们的防御允许在推理时进行随机化。我们在ImageNet数据集上进行了实验，并评估了一系列当今最佳攻击。结果表明，我们的防御可以 дости得高效率和相对较高的清洁率，与先前的钥匙基础防御相比。

On the Query Strategies for Efficient Online Active Distillation

paper_url: http://arxiv.org/abs/2309.01612
repo_url: None
paper_authors: Michele Boldo, Enrico Martini, Mirco De Marchi, Stefano Aldegheri, Nicola Bombieri
for: 本研究旨在提高人姿估算（HPE）模型的训练效率和实时适应性，通过Active Learning（AL）和在线热退化。
methods: 本研究使用两种方法进行评估：一种是传统的离线方法，另一种是通过知识退化进行在线评估。
results: 研究表明，通过选择合适的帧进行训练，可以减少模型的计算复杂度，同时保持模型的准确性。这种方法可以应用于实时人姿估算场景，并且可以帮助模型在新的上下文中进行有效的适应。

Abstract
Deep Learning (DL) requires lots of time and data, resulting in high computational demands. Recently, researchers employ Active Learning (AL) and online distillation to enhance training efficiency and real-time model adaptation. This paper evaluates a set of query strategies to achieve the best training results. It focuses on Human Pose Estimation (HPE) applications, assessing the impact of selected frames during training using two approaches: a classical offline method and a online evaluation through a continual learning approach employing knowledge distillation, on a popular state-of-the-art HPE dataset. The paper demonstrates the possibility of enabling training at the edge lightweight models, adapting them effectively to new contexts in real-time.

摘要
深度学习（DL）需要很多时间和数据，导致计算成本很高。最近，研究人员使用活动学习（AL）和在线热针蒸馈来提高训练效率和实时模型适应。这篇论文评估了一组查询策略，以实现最佳训练结果。它专注于人姿估计（HPE）应用，评估在训练过程中选择的帧的影响，使用两种方法：一种传统的离线方法和一种在线评估过程，通过知识传递来适应新上下文。论文展示了在边缘训练轻量级模型的可能性，并在实时中效地适应新上下文。

Segmentation of 3D pore space from CT images using curvilinear skeleton: application to numerical simulation of microbial decomposition

paper_url: http://arxiv.org/abs/2309.01611
repo_url: None
paper_authors: Olivier Monga, Zakaria Belghali, Mouad Klai, Lucie Druoton, Dominique Michelucci, Valerie Pot
For: The paper aims to present a new method for describing the pore space of soil using the curvilinear skeleton, which can improve the accuracy and efficiency of numerical simulations of microbial decomposition and diffusion processes.* Methods: The authors use 3D X-ray CT scanner images to extract the pore space of soil and then use the curvilinear skeleton to segment the pore space into connected regions. They compare the results with other methods using different geometric representations of pore space, such as balls and voxels.* Results: The authors validate the simulation outputs using different pore space geometrical representations and show that the curvilinear skeleton-based method can provide more accurate and efficient simulations of microbial decomposition and diffusion processes in soil.

Abstract
Recent advances in 3D X-ray Computed Tomographic (CT) sensors have stimulated research efforts to unveil the extremely complex micro-scale processes that control the activity of soil microorganisms. Voxel-based description (up to hundreds millions voxels) of the pore space can be extracted, from grey level 3D CT scanner images, by means of simple image processing tools. Classical methods for numerical simulation of biological dynamics using mesh of voxels, such as Lattice Boltzmann Model (LBM), are too much time consuming. Thus, the use of more compact and reliable geometrical representations of pore space can drastically decrease the computational cost of the simulations. Several recent works propose basic analytic volume primitives (e.g. spheres, generalized cylinders, ellipsoids) to define a piece-wise approximation of pore space for numerical simulation of draining, diffusion and microbial decomposition. Such approaches work well but the drawback is that it generates approximation errors. In the present work, we study another alternative where pore space is described by means of geometrically relevant connected subsets of voxels (regions) computed from the curvilinear skeleton. Indeed, many works use the curvilinear skeleton (3D medial axis) for analyzing and partitioning 3D shapes within various domains (medicine, material sciences, petroleum engineering, etc.) but only a few ones in soil sciences. Within the context of soil sciences, most studies dealing with 3D medial axis focus on the determination of pore throats. Here, we segment pore space using curvilinear skeleton in order to achieve numerical simulation of microbial decomposition (including diffusion processes). We validate simulation outputs by comparison with other methods using different pore space geometrical representations (balls, voxels).

摘要
In the present work, we study another alternative where pore space is described by means of geometrically relevant connected subsets of voxels (regions) computed from the curvilinear skeleton. The curvilinear skeleton is a 3D medial axis that has been widely used in various domains such as medicine, material sciences, and petroleum engineering to analyze and partition 3D shapes. However, only a few studies have applied this technique in soil sciences, and most of them have focused on determining pore throats. Here, we segment pore space using the curvilinear skeleton to achieve numerical simulation of microbial decomposition, including diffusion processes. We validate the simulation outputs by comparing them with other methods using different pore space geometrical representations (balls, voxels).

Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models

paper_url: http://arxiv.org/abs/2309.01590
repo_url: https://github.com/kdst-team/probablistic_precision_recall
paper_authors: Dogyun Park, Suhyun Kim
for: This paper focuses on evaluating the fidelity and diversity of generative models, specifically addressing the limitations of existing k-Nearest Neighbor (kNN) based precision-recall metrics.methods: The authors propose novel metrics, P-precision and P-recall (PP&PR), based on a probabilistic approach to address the oversimplified assumptions and undesirable properties of kNN.results: The authors show through extensive toy experiments and state-of-the-art generative models that their PP&PR provide more reliable estimates for comparing fidelity and diversity than the existing metrics.

Abstract
Assessing the fidelity and diversity of the generative model is a difficult but important issue for technological advancement. So, recent papers have introduced k-Nearest Neighbor ($k$NN) based precision-recall metrics to break down the statistical distance into fidelity and diversity. While they provide an intuitive method, we thoroughly analyze these metrics and identify oversimplified assumptions and undesirable properties of kNN that result in unreliable evaluation, such as susceptibility to outliers and insensitivity to distributional changes. Thus, we propose novel metrics, P-precision and P-recall (PP\&PR), based on a probabilistic approach that address the problems. Through extensive investigations on toy experiments and state-of-the-art generative models, we show that our PP\&PR provide more reliable estimates for comparing fidelity and diversity than the existing metrics. The codes are available at \url{https://github.com/kdst-team/Probablistic_precision_recall}.

摘要
【评估生成模型的准确性和多样性是技术发展中的一个重要问题。因此，最近的论文已经引入了k-最近邻居（$k$NN）基于精度-回归指标来细分统计距离。尽管它们提供了直观的方法，但我们在这些指标中进行了全面的分析，并发现了它们的假设过于简化，以及不良的性质，如受到异常值的影响和分布变化的敏感性不足。因此，我们提出了新的指标，即P-精度和P-回归（PP&PR），基于概率方法，以解决这些问题。我们在各种实验中进行了广泛的调查，并显示了我们的PP&PR在比较准确性和多样性方面提供了更可靠的估计。代码可以在获取。】Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on FPGA Devices

paper_url: http://arxiv.org/abs/2309.01587
repo_url: None
paper_authors: Alexander Montgomerie-Corcoran, Petros Toupas, Zhewen Yu, Christos-Savvas Bouganis
for: 这个论文是为了解决现代智能视觉和图像处理任务中的对象检测问题，以实现现实生活中的各种应用程序，如自动驾驶到医疗影像处理。
methods: 这个论文使用了流处理架构和自动化工具流来加速YOLO模型，以解决将现代对象检测模型部署到FPGA设备上的挑战。
results: 这个论文的研究结果表明，使用流处理架构和自动化工具流可以生成高性能的FPGA加速器，可以与GPU设备相比，并且超越当前状态的FPGA加速器。

Abstract
AI has led to significant advancements in computer vision and image processing tasks, enabling a wide range of applications in real-life scenarios, from autonomous vehicles to medical imaging. Many of those applications require efficient object detection algorithms and complementary real-time, low latency hardware to perform inference of these algorithms. The YOLO family of models is considered the most efficient for object detection, having only a single model pass. Despite this, the complexity and size of YOLO models can be too computationally demanding for current edge-based platforms. To address this, we present SATAY: a Streaming Architecture Toolflow for Accelerating YOLO. This work tackles the challenges of deploying stateof-the-art object detection models onto FPGA devices for ultralow latency applications, enabling real-time, edge-based object detection. We employ a streaming architecture design for our YOLO accelerators, implementing the complete model on-chip in a deeply pipelined fashion. These accelerators are generated using an automated toolflow, and can target a range of suitable FPGA devices. We introduce novel hardware components to support the operations of YOLO models in a dataflow manner, and off-chip memory buffering to address the limited on-chip memory resources. Our toolflow is able to generate accelerator designs which demonstrate competitive performance and energy characteristics to GPU devices, and which outperform current state-of-the-art FPGA accelerators.

摘要

Improving Visual Quality and Transferability of Adversarial Attacks on Face Recognition Simultaneously with Adversarial Restoration

paper_url: http://arxiv.org/abs/2309.01582
repo_url: None
paper_authors: Fengfan Zhou, Hefei Ling, Yuxuan Shi, Jiazhong Chen, Ping Li
for: 该论文旨在提高黑客脸部例子的视觉质量和传输性。
methods: 该论文提出了一种新的黑客攻击技术，即黑客恢复（AdvRestore），它利用了一种面Restoration Latent Diffusion Model（RLDM）来提高黑客脸部例子的视觉质量和传输性。
results: 该论文的实验结果表明，黑客恢复技术可以备受提高黑客脸部例子的传输性和视觉质量。

Abstract
Adversarial face examples possess two critical properties: Visual Quality and Transferability. However, existing approaches rarely address these properties simultaneously, leading to subpar results. To address this issue, we propose a novel adversarial attack technique known as Adversarial Restoration (AdvRestore), which enhances both visual quality and transferability of adversarial face examples by leveraging a face restoration prior. In our approach, we initially train a Restoration Latent Diffusion Model (RLDM) designed for face restoration. Subsequently, we employ the inference process of RLDM to generate adversarial face examples. The adversarial perturbations are applied to the intermediate features of RLDM. Additionally, by treating RLDM face restoration as a sibling task, the transferability of the generated adversarial face examples is further improved. Our experimental results validate the effectiveness of the proposed attack method.

摘要
<>发现对抗面部示例具有两个关键性能：视觉质量和传输性。然而，现有的方法几乎从未同时考虑这两个性能，导致效果不佳。为解决这问题，我们提出了一种新的对抗攻击技术，即对抗恢复（AdvRestore）。我们首先在RLDM（Restoration Latent Diffusion Model）中训练一个面Restoration模型。然后，我们使用RLDM的推理过程来生成对抗面部示例。对抗扰动被应用于RLDM中的中间特征。此外，通过将RLDM的面Restoration视为姐妹任务，我们进一步改进了生成的对抗面部示例的传输性。我们的实验结果证明了我们的攻击方法的效果。Note: "RLDM" stands for "Restoration Latent Diffusion Model" in Chinese.

DiffHPE: Robust, Coherent 3D Human Pose Lifting with Diffusion

paper_url: http://arxiv.org/abs/2309.01575
repo_url: None
paper_authors: Cédric Rommel, Eduardo Valle, Mickaël Chen, Souhaiel Khalfaoui, Renaud Marlet, Matthieu Cord, Patrick Pérez
for: 这篇论文是为了提出一种新的3D人姿估计方法（DiffHPE），通过灵活的扩散模型来提高人姿估计的准确性、可靠性和一致性。
methods: 这篇论文使用了扩散模型，并将其与现有的指导模型相结合，以提高人姿估计的精度和可靠性。
results: 论文通过使用扩散模型，提高了人姿估计的时间一致性和三角均衡性，并在干扰情况下表现更加稳定。在人类3.6M数据集上，这种方法表现出色，并在不同的干扰情况下保持稳定性。

Abstract
We present an innovative approach to 3D Human Pose Estimation (3D-HPE) by integrating cutting-edge diffusion models, which have revolutionized diverse fields, but are relatively unexplored in 3D-HPE. We show that diffusion models enhance the accuracy, robustness, and coherence of human pose estimations. We introduce DiffHPE, a novel strategy for harnessing diffusion models in 3D-HPE, and demonstrate its ability to refine standard supervised 3D-HPE. We also show how diffusion models lead to more robust estimations in the face of occlusions, and improve the time-coherence and the sagittal symmetry of predictions. Using the Human\,3.6M dataset, we illustrate the effectiveness of our approach and its superiority over existing models, even under adverse situations where the occlusion patterns in training do not match those in inference. Our findings indicate that while standalone diffusion models provide commendable performance, their accuracy is even better in combination with supervised models, opening exciting new avenues for 3D-HPE research.

摘要
我们提出了一种创新的三维人姿估计（3D-HPE）方法，通过结合前沿扩散模型，这些模型在多个领域中引领了革命，但在3D-HPE中尚未得到广泛研究。我们证明了扩散模型可以提高人姿估计的准确性、可靠性和相对性。我们提出了一种新的推 diffusionHPE 方法，并证明它可以在标准的三维人姿估计上进行精细调整。我们还表明了扩散模型可以在 occlusion 情况下提供更加稳定的估计，并改善时间相关性和顺序协调性。使用 Human\,3.6M 数据集，我们证明了我们的方法的效iveness，并与现有模型相比，即使在训练和推测中 occlusion patrerns 不同时，也能够 достичь更高的性能。我们的发现表明，单独使用扩散模型可以提供很好的性能，但是将它们与直接学习模型相结合，可以带来更高的准确性，开启了3D-HPE研究的新的可能性。

Raw Data Is All You Need: Virtual Axle Detector with Enhanced Receptive Field

paper_url: http://arxiv.org/abs/2309.01574
repo_url: None
paper_authors: Henik Riedel, Robert Steven Lorenzen, Clemens Hübler
for: 本研究旨在开发一种新的车辆轴承检测方法，以实现实时应用桥式Weight-In-Motion（BWIM）系统，不需要专门的车辆检测器。
methods: 该方法基于虚拟车辆检测器（VAD）模型，利用原始加速度数据进行处理，从而提高了感知范围。
results: 比较 experiments 表明，与现有VAD方法相比，提出的虚拟车辆检测器with Enhanced Receptive field（VADER）可以提高(F_1) score 73%，空间精度 39%，同时降低计算和内存成本99%。VADER在使用代表性训练集和功能传感器时，(F_1) score 达99.4%，空间错误为4.13cm。此外，我们还提出了一种基于对象大小驱动的 CNN 架构设计规则，结果表明，使用原始数据可以达到更好的性能，从而成为考虑原始数据作为输入的优势。

Abstract
Rising maintenance costs of ageing infrastructure necessitate innovative monitoring techniques. This paper presents a new approach for axle detection, enabling real-time application of Bridge Weigh-In-Motion (BWIM) systems without dedicated axle detectors. The proposed method adapts the Virtual Axle Detector (VAD) model to handle raw acceleration data, which allows the receptive field to be increased. The proposed Virtual Axle Detector with Enhanced Receptive field (VADER) improves the $F_1$ score by 73\% and spatial accuracy by 39\%, while cutting computational and memory costs by 99\% compared to the state-of-the-art VAD. VADER reaches a $F_1$ score of 99.4\% and a spatial error of 4.13~cm when using a representative training set and functional sensors. We also introduce a novel receptive field (RF) rule for an object-size driven design of Convolutional Neural Network (CNN) architectures. Based on this rule, our results suggest that models using raw data could achieve better performance than those using spectrograms, offering a compelling reason to consider raw data as input.

摘要
提高老化基础设施维护成本的必要性，这篇论文提出了一种新的车轮检测方法，允许实时应用桥梁运动测量系统（BWIM）无需专门的车轮检测器。该方法基于虚拟车轮检测器（VAD）模型，可以处理原始加速度数据，从而提高感知范围。提出的虚拟车轮检测器增强型（VADER）提高了$F_1$分数 by 73%和空间准确率 by 39%，同时降低计算和存储成本 by 99%比领先的VAD。VADER在使用代表性训练集和功能传感器时达到了$F_1$分数99.4%和空间错误4.13cm。我们还提出了一种新的接收场规则（RF），用于设计基于卷积神经网络（CNN）架构。根据这个规则，我们的结果表明使用原始数据可以实现更好的性能，这为使用原始数据作为输入提供了一个吸引人的理由。

Locality-Aware Hyperspectral Classification

paper_url: http://arxiv.org/abs/2309.01561
repo_url: https://github.com/zhoufangqin/hylite
paper_authors: Fangqin Zhou, Mert Kilickaya, Joaquin Vanschoren
for: 本研究旨在提高干扰影像分类精度，利用视Transformers自动化干扰影像分类。
methods: 本研究提出了三大贡献：一、引入干扰本地信息图像变换器（HyLITE），二、一种新的规范函数，以及三、提出的方法在竞争对手中具有明显的性能优势，升高了准确率。
results: 本研究的实验结果表明，提出的方法可以在干扰影像分类任务中升高准确率，与竞争对手相比，具有明显的性能优势，最高准确率提升10%。

Abstract
Hyperspectral image classification is gaining popularity for high-precision vision tasks in remote sensing, thanks to their ability to capture visual information available in a wide continuum of spectra. Researchers have been working on automating Hyperspectral image classification, with recent efforts leveraging Vision-Transformers. However, most research models only spectra information and lacks attention to the locality (i.e., neighboring pixels), which may be not sufficiently discriminative, resulting in performance limitations. To address this, we present three contributions: i) We introduce the Hyperspectral Locality-aware Image TransformEr (HyLITE), a vision transformer that models both local and spectral information, ii) A novel regularization function that promotes the integration of local-to-global information, and iii) Our proposed approach outperforms competing baselines by a significant margin, achieving up to 10% gains in accuracy. The trained models and the code are available at HyLITE.

摘要
干扰图像分类在远程感知中得到推广，感谢它们可以捕捉视觉信息的广泛谱 spectrum。研究人员在自动化干扰图像分类方面努力，其中最新的努力是利用视力变换器。然而，大多数研究模型只考虑spectra信息，缺乏对邻近像素（即地方信息）的注意力，这可能导致表现有限制。为此，我们提出了三项贡献：1. 我们介绍了干扰图像特征地址Transformer（HyLITE），一种视力变换器，可以同时模型本地和spectra信息。2. 一种新的规范函数，可以促进本地信息与全局信息的集成。3. 我们的提议方法在比较基eline上表现出特别的准确性提升，达到10%的提升率。我们的训练模型和代码可以在HyLITE上下载。

TSTTC: A Large-Scale Dataset for Time-to-Contact Estimation in Driving Scenarios

paper_url: http://arxiv.org/abs/2309.01539
repo_url: https://github.com/tusen-ai/TSTTC
paper_authors: Yuheng Shi, Zehao Huang, Yan Yan, Naiyan Wang, Xiaojie Guo
for: 这篇论文主要旨在提供一个大规模的行为对象驱动距离时间联系（TTC）数据集，以便促进TTC估计方法的研究和发展。
methods: 这篇论文使用了大量的驾驶数据，并采用了最新的神经网络生成技术来增加小TTC情况的数据量。
results: 该论文提供了一个大规模的TTC数据集，并提供了一些简单 yet 有效的TTC估计基线。这些基线在提posed dataset上进行了广泛的评估，以证明其效果。

Abstract
Time-to-Contact (TTC) estimation is a critical task for assessing collision risk and is widely used in various driver assistance and autonomous driving systems. The past few decades have witnessed development of related theories and algorithms. The prevalent learning-based methods call for a large-scale TTC dataset in real-world scenarios. In this work, we present a large-scale object oriented TTC dataset in the driving scene for promoting the TTC estimation by a monocular camera. To collect valuable samples and make data with different TTC values relatively balanced, we go through thousands of hours of driving data and select over 200K sequences with a preset data distribution. To augment the quantity of small TTC cases, we also generate clips using the latest Neural rendering methods. Additionally, we provide several simple yet effective TTC estimation baselines and evaluate them extensively on the proposed dataset to demonstrate their effectiveness. The proposed dataset is publicly available at https://open-dataset.tusen.ai/TSTTC.

摘要
<>将文本翻译成简化中文。<>时间到contact（TTC）估计是评估冲突风险的关键任务，广泛应用于不同的驾驶助手和自动驾驶系统中。过去几十年内，相关理论和算法的发展都有所成就。现有的学习型方法需要大量的TTC实际场景数据。在这项工作中，我们提供了一个大规模的 объек oriented TTC数据集，用于推广TTC估计。为了收集有价值的样本并使数据具有不同TTC值相对均衡，我们通过了数千小时的驾驶数据，选择了超过200K个序列，并采用了一个预设的数据分布。为了增加小TTC情况的数量，我们还使用了最新的神经网络渲染方法生成clip。此外，我们还提供了一些简单 yet有效的TTC估计基线，并在提posed数据集上进行了广泛的评估，以示其效果。提出的数据集可以在https://open-dataset.tusen.ai/TSTTC上公开获取。

On the use of Mahalanobis distance for out-of-distribution detection with neural networks for medical imaging

paper_url: http://arxiv.org/abs/2309.01488
repo_url: https://github.com/harryanthony/mahalanobis-ood-detection
paper_authors: Harry Anthony, Konstantinos Kamnitsas
for: 本研究旨在探讨在医疗应用中使用神经网络时，如何探测输入数据与训练数据之间的差异，以避免不可靠的预测。
methods: 本研究使用了距离基于方法，如 Mahalanobis 距离，来探测输入数据与训练数据之间的差异。
results: 本研究发现，使用 Mahalanobis 距离探测输入数据与训练数据之间的差异，并不是一个一致的解决方案。相反， Results 表明，选择合适的层或层组，可以提高探测不同类型的异常情况的灵活性。此外，研究还发现，将 OOD 探测器分解成不同深度的网络层可以增强网络的稳定性。这些发现都被 validate 在实际 OOD 任务上，使用 CheXpert 胸部X射影像，并使用不同的 pacemaker 和性别作为 OOD 例子。

Abstract
Implementing neural networks for clinical use in medical applications necessitates the ability for the network to detect when input data differs significantly from the training data, with the aim of preventing unreliable predictions. The community has developed several methods for out-of-distribution (OOD) detection, within which distance-based approaches - such as Mahalanobis distance - have shown potential. This paper challenges the prevailing community understanding that there is an optimal layer, or combination of layers, of a neural network for applying Mahalanobis distance for detection of any OOD pattern. Using synthetic artefacts to emulate OOD patterns, this paper shows the optimum layer to apply Mahalanobis distance changes with the type of OOD pattern, showing there is no one-fits-all solution. This paper also shows that separating this OOD detector into multiple detectors at different depths of the network can enhance the robustness for detecting different OOD patterns. These insights were validated on real-world OOD tasks, training models on CheXpert chest X-rays with no support devices, then using scans with unseen pacemakers (we manually labelled 50% of CheXpert for this research) and unseen sex as OOD cases. The results inform best-practices for the use of Mahalanobis distance for OOD detection. The manually annotated pacemaker labels and the project's code are available at: https://github.com/HarryAnthony/Mahalanobis-OOD-detection.

摘要
使用神经网络进行医疗应用时，需要神经网络能够检测输入数据与训练数据之间的差异，以避免不可靠的预测。社区已经开发出了许多对外部数据（OOD）检测方法，其中距离基于方法，如马哈拉诺比斯距离，表现出了潜在。这篇论文挑战了社区认知，即在任何OOD模式下都有一个最佳层或组合层可以应用马哈拉诺比斯距离来检测OOD模式。使用 sintetic artifacts 模拟 OOD 模式，这篇论文表明了在不同类型的 OOD 模式时，适用马哈拉诺比斯距离的层不同，而且没有一个通用的解决方案。此外，这篇论文还表明，将 OOD 检测器分解成不同深度的网络层可以提高对不同 OOD 模式的检测稳定性。这些发现得到了实际 OOD 任务的验证，使用 CheXpert 胸部X射影片进行训练，然后使用未知的心 pacemaker 和性别作为 OOD 例外。结果提供了使用马哈拉诺比斯距离进行 OOD 检测的最佳实践。手动标注 pacemaker 标签和项目代码可以在 GitHub 上获取：https://github.com/HarryAnthony/Mahalanobis-OOD-detection。

GenSelfDiff-HIS: Generative Self-Supervision Using Diffusion for Histopathological Image Segmentation

paper_url: http://arxiv.org/abs/2309.01487
repo_url: None
paper_authors: Vishnuvardhan Purma, Suhas Srinath, Seshan Srirangarajan, Aanchal Kakkar, Prathosh A. P
for:这个研究目的是提出一种基于自类学习的几何像分割方法，来减轻几何像分析的传统人工分析压力。methods:这个方法基于对无标注数据的生成扩散模型，并使用多元损失函数进行精致化。results:研究结果显示，这个方法可以在两个公开available的数据集上取得良好的效果，并且在一个新提出的头颈癌（HN）数据集上也取得了良好的效果。

Abstract
Histopathological image segmentation is a laborious and time-intensive task, often requiring analysis from experienced pathologists for accurate examinations. To reduce this burden, supervised machine-learning approaches have been adopted using large-scale annotated datasets for histopathological image analysis. However, in several scenarios, the availability of large-scale annotated data is a bottleneck while training such models. Self-supervised learning (SSL) is an alternative paradigm that provides some respite by constructing models utilizing only the unannotated data which is often abundant. The basic idea of SSL is to train a network to perform one or many pseudo or pretext tasks on unannotated data and use it subsequently as the basis for a variety of downstream tasks. It is seen that the success of SSL depends critically on the considered pretext task. While there have been many efforts in designing pretext tasks for classification problems, there haven't been many attempts on SSL for histopathological segmentation. Motivated by this, we propose an SSL approach for segmenting histopathological images via generative diffusion models in this paper. Our method is based on the observation that diffusion models effectively solve an image-to-image translation task akin to a segmentation task. Hence, we propose generative diffusion as the pretext task for histopathological image segmentation. We also propose a multi-loss function-based fine-tuning for the downstream task. We validate our method using several metrics on two publically available datasets along with a newly proposed head and neck (HN) cancer dataset containing hematoxylin and eosin (H\&E) stained images along with annotations. Codes will be made public at https://github.com/PurmaVishnuVardhanReddy/GenSelfDiff-HIS.git.

摘要
历史 PATHOLOGICAL 图像分割是一项劳动密集和时间consuming的任务，经验训练的病理学家通常需要进行准确的检查。为了减轻这个负担，有些人使用了Supervised 机器学习方法，使用大规模的标注数据进行历史 PATHOLOGICAL 图像分析。然而，在一些情况下，获得大规模的标注数据是一个瓶颈，而自我supervised 学习（SSL）是一种代替方案，它可以使用只有未标注的数据进行训练。基本上，SSL 的想法是训练一个网络，使其在未标注数据上完成一些pseudo或预text任务，然后用这些任务作为基础进行多种下游任务。它的成功取决于考虑的预text任务。虽然有很多人在设计 Classification 的预text任务方面做出了努力，但是对历史 PATHOLOGICAL 图像分割的 SSL 方法还未有很多尝试。在这篇论文中，我们提出了一种基于生成扩散模型的 SSL 方法，用于分割历史 PATHOLOGICAL 图像。我们基于图像到图像的翻译任务的观察，因此我们提出了生成扩散作为预text任务。此外，我们还提出了基于多个损失函数的细化。我们使用了多种度量来验证我们的方法，并在两个公共可用的数据集上进行验证，以及一个新提出的头颈癌（HN）癌症数据集，该数据集包含HE染料的历史 PATHOLOGICAL 图像以及注解。代码将在上公开。

CA2: Class-Agnostic Adaptive Feature Adaptation for One-class Classification

paper_url: http://arxiv.org/abs/2309.01483
repo_url: None
paper_authors: Zilong Zhang, Zhibin Zhao, Deyu Meng, Xingwu Zhang, Xuefeng Chen
for: 提高机器学习模型在真实世界中的部署，实现一类分类（OCC）。
methods: 使用适应特定目标数据集的预训练特征，并将其扩展到未知类数。
results: 在不同类数（1-1024）的训练数据上，一直高于当前状态艺术方法，提高OCC性能。

Abstract
One-class classification (OCC), i.e., identifying whether an example belongs to the same distribution as the training data, is essential for deploying machine learning models in the real world. Adapting the pre-trained features on the target dataset has proven to be a promising paradigm for improving OCC performance. Existing methods are constrained by assumptions about the number of classes. This contradicts the real scenario where the number of classes is unknown. In this work, we propose a simple class-agnostic adaptive feature adaptation method (CA2). We generalize the center-based method to unknown classes and optimize this objective based on the prior existing in the pre-trained network, i.e., pre-trained features that belong to the same class are adjacent. CA2 is validated to consistently improve OCC performance across a spectrum of training data classes, spanning from 1 to 1024, outperforming current state-of-the-art methods. Code is available at https://github.com/zhangzilongc/CA2.

摘要
一类分类（OCC），即确定输入例子是否属于训练数据的同一分布，在实际应用中是非常重要的。适应预训练特征onto目标数据集已经证明是提高OCC性能的有效方法。现有方法受限于类别数量的假设，这与实际场景不符。在这种情况下，我们提出了一种简单的类型不可知的适应特征调整方法（CA2）。我们将中心基于方法扩展到未知类别，并基于预训练网络中的先前存在的对称性来优化这个目标函数。CA2被证明可以在训练数据类型范围从1到1024之间，在不同类别的情况下一致提高OCC性能，超越当前状态的最佳方法。代码可以在https://github.com/zhangzilongc/CA2上下载。

Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models

paper_url: http://arxiv.org/abs/2309.01479
repo_url: None
paper_authors: Qiong Wu, Wei Yu, Yiyi Zhou, Shubin Huang, Xiaoshuai Sun, Rongrong Ji
for: 这篇研究目的是提出一种 Parameter and Computation Efficient Transfer Learning (PCETL) 方法，以提高 Vision-Language Pre-trained (VLP) 模型在下游任务中的适应性。
methods: 本研究提出了一种叫做 Dynamic Architecture Skipping (DAS) 方法，它通过观察 VLP 模型的模块之间的相互关联，并使用强化学习 (RL) 来决定哪些模块是可以被跳过的。这样可以将 VLP 模型的trainable参数数量降低，同时保持其在下游任务中的性能。
results: 实验结果显示 DAS 方法可以将 VLP 模型的 Computational Complexity 降低到 -11.97% FLOPs，并且与现有的 PETL 方法相比，DAS 方法在参数给数和性能之间能够取得平衡。

Abstract
With ever increasing parameters and computation, vision-language pre-trained (VLP) models exhibit prohibitive expenditure in downstream task adaption. Recent endeavors mainly focus on parameter efficient transfer learning (PETL) for VLP models by only updating a small number of parameters. However, excessive computational overhead still plagues the application of VLPs. In this paper, we aim at parameter and computation efficient transfer learning (PCETL) for VLP models. In particular, PCETL not only needs to limit the number of trainable parameters in VLP models, but also to reduce the computational redundancy during inference, thus enabling a more efficient transfer. To approach this target, we propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL. Instead of directly optimizing the intrinsic architectures of VLP models, DAS first observes the significances of their modules to downstream tasks via a reinforcement learning (RL) based process, and then skips the redundant ones with lightweight networks, i.e., adapters, according to the obtained rewards. In this case, the VLP model can well maintain the scale of trainable parameters while speeding up its inference on downstream tasks. To validate DAS, we apply it to two representative VLP models, namely ViLT and METER, and conduct extensive experiments on a bunch of VL tasks. The experimental results not only show the great advantages of DAS in reducing computational complexity, e.g. -11.97% FLOPs of METER on VQA2.0, but also confirm its competitiveness against existing PETL methods in terms of parameter scale and performance. Our source code is given in our appendix.

摘要
随着参数和计算的增加，视觉语言预训练（VLP）模型在下游任务适应中存在拥堵性问题。现有的努力主要集中在视觉语言预训练（PETL）模型中，通过只更新一小部分参数进行 parameter efficient transfer learning。然而，计算开销仍然困扰着VLP的应用。在这篇论文中，我们目标是在VLP模型中实现参数和计算效率的传输学习（PCETL）。具体来说，PCETL不仅需要限制VLP模型的可训练参数数量，还需要在推理过程中减少计算重复性，以便更有效地进行传输。为达到这个目标，我们提出了一种新的动态architecture skipping（DAS）方法。在DAS方法中，我们首先通过 reinforcement learning（RL）基于的过程来评估VLP模型中各个模块的下游任务意义，然后根据获得的奖励，使用轻量级网络（adapter）将红undeniable模块替换掉。这样，VLP模型可以保持参数数量的扩展性，同时快速完成下游任务的推理。为验证DAS方法，我们在两个代表性的VLP模型，namely ViLT和METER上进行了广泛的实验。实验结果不仅表明DAS方法可以减少计算复杂度，例如METER模型在VQA2.0任务上的计算复杂度减少了11.97%，而且也证明了它在参数数量和性能方面与现有的PETL方法相当竞争。我们的源代码在附录中提供。

Defect Detection in Synthetic Fibre Ropes using Detectron2 Framework

paper_url: http://arxiv.org/abs/2309.01469
repo_url: None
paper_authors: Anju Rani, Daniel O. Arroyo, Petar Durdevic
for: This paper aims to develop an automated and efficient method for detecting defects in synthetic fibre ropes (SFRs) using deep learning (DL) models, specifically the Detectron2 library with Mask R-CNN architecture.
methods: The study uses an experimentally obtained dataset of high-dimensional images of SFRs, with seven damage classes, to train and test Mask R-CNN with various backbone configurations.
results: The use of Detectron2 and Mask R-CNN with different backbone configurations can effectively detect defects in SFRs, enhancing the inspection process and ensuring the safety of the fibre ropes.

Abstract
Fibre ropes with the latest technology have emerged as an appealing alternative to steel ropes for offshore industries due to their lightweight and high tensile strength. At the same time, frequent inspection of these ropes is essential to ensure the proper functioning and safety of the entire system. The development of deep learning (DL) models in condition monitoring (CM) applications offers a simpler and more effective approach for defect detection in synthetic fibre ropes (SFRs). The present paper investigates the performance of Detectron2, a state-of-the-art library for defect detection and instance segmentation. Detectron2 with Mask R-CNN architecture is used for segmenting defects in SFRs. Mask R-CNN with various backbone configurations has been trained and tested on an experimentally obtained dataset comprising 1,803 high-dimensional images containing seven damage classes (loop high, loop medium, loop low, compression, core out, abrasion, and normal respectively) for SFRs. By leveraging the capabilities of Detectron2, this study aims to develop an automated and efficient method for detecting defects in SFRs, enhancing the inspection process, and ensuring the safety of the fibre ropes.

摘要
合成纤维绳（Synthetic Fiber Ropes，SFR）的 Condition Monitoring（CM）应用中，latest technology的纤维绳已经出现为海上工业的吸引力，因为它们具有轻量和高强度特点。同时，为保证整个系统的正常工作和安全，这些纤维绳的常规检查是必须的。在这种情况下，深度学习（Deep Learning，DL）模型在CM应用中的开发提供了一种更加简单和有效的方法来检测SFR中的缺陷。本文 investigate了Detectron2库在SFR中的缺陷检测和实例分割方面的表现。通过使用Mask R-CNN架构，Detectron2在SFR中 segmenting 缺陷。Mask R-CNN采用了不同的背景配置，在实验获得的1,803个高维度图像中进行了7种损害类（循环高、循环中、循环低、压缩、核心缺陷、 Abrasion 和正常）的训练和测试。通过利用Detectron2的能力，本研究旨在开发一种自动化和高效的SFR缺陷检测方法，提高检查过程，并确保纤维绳的安全。

Toward Defensive Letter Design

paper_url: http://arxiv.org/abs/2309.01452
repo_url: https://github.com/rprokap/pset-9
paper_authors: Rentaro Kataoka, Akisato Kimura, Seiichi Uchida
for: 防御 adversarial 攻击
methods: 使用 Iterative Fast Gradient Sign Method (I-FGSM) 和深度回归模型测试字符图像的防御能力，并基于生成 adversarial 网络 (GAN) 提出一种两步方法生成更加防御性的字符图像。
results: 通过测试和实验，提出了一种基于字符图像的防御机制，可以帮助防御 adversarial 攻击。

Abstract
A major approach for defending against adversarial attacks aims at controlling only image classifiers to be more resilient, and it does not care about visual objects, such as pandas and cars, in images. This means that visual objects themselves cannot take any defensive actions, and they are still vulnerable to adversarial attacks. In contrast, letters are artificial symbols, and we can freely control their appearance unless losing their readability. In other words, we can make the letters more defensive to the attacks. This paper poses three research questions related to the adversarial vulnerability of letter images: (1) How defensive are the letters against adversarial attacks? (2) Can we estimate how defensive a given letter image is before attacks? (3) Can we control the letter images to be more defensive against adversarial attacks? For answering the first and second questions, we measure the defensibility of letters by employing Iterative Fast Gradient Sign Method (I-FGSM) and then build a deep regression model for estimating the defensibility of each letter image. We also propose a two-step method based on a generative adversarial network (GAN) for generating character images with higher defensibility, which solves the third research question.

摘要
一种主要方法对抗反对攻击是控制图像分类器更加抗击强，不管图像中的物体，如熊猫和车辆，是否受到攻击。这意味着视觉物体本身无法采取任何防御行动，仍然易受到反对攻击。相比之下，字符是人工符号，我们可以自由地控制它们的显示，只要不导致不可读性。即使在攻击时，我们可以使字符更加抗击强。这篇论文提出了三个研究问题 relacionadas 反对攻击的抗击性：1. 字符对反对攻击的抗击性如何？2. 我们可以在攻击之前对给定的字符图像进行估算，该图像在攻击中的抗击性如何？3. 我们可以通过生成推荐网络（GAN）来生成具有更高抗击性的字符图像，以解决第三个研究问题。为了回答第一个和第二个问题，我们使用迭代快速梯度签名方法（I-FGSM）测量字符的抗击性，并建立深度回归模型来估算每个字符图像的抗击性。我们还提出了一种基于GAN的两步方法，用于生成具有更高抗击性的字符图像，解决第三个研究问题。

Large Separable Kernel Attention: Rethinking the Large Kernel Attention Design in CNN

paper_url: http://arxiv.org/abs/2309.01439
repo_url: None
paper_authors: Kin Wai Lau, Lai-Man Po, Yasar Abbas Ur Rehman
for: 提高vision-based任务中VAN的性能，并降低计算和存储的占用率。
methods: 提出Large Separable Kernel Attention（LSKA）模块，将深度wise convolutional layer中的2D卷积核 decomposition为水平和垂直的1D卷积核，从而避免了额外块的使用。
results: 对VAN、ViTs和ConvNeXt进行了修饰，并在Object recognition、Object detection、Semantic segmentation和Robustness测试中提供了相对较好的性能。在不同的kernel size下，LSKA模块可以减少计算和存储的占用率，并且在Object recognition、Object detection、Semantic segmentation和Robustness测试中具有较好的性能。

Abstract
Visual Attention Networks (VAN) with Large Kernel Attention (LKA) modules have been shown to provide remarkable performance, that surpasses Vision Transformers (ViTs), on a range of vision-based tasks. However, the depth-wise convolutional layer in these LKA modules incurs a quadratic increase in the computational and memory footprints with increasing convolutional kernel size. To mitigate these problems and to enable the use of extremely large convolutional kernels in the attention modules of VAN, we propose a family of Large Separable Kernel Attention modules, termed LSKA. LSKA decomposes the 2D convolutional kernel of the depth-wise convolutional layer into cascaded horizontal and vertical 1-D kernels. In contrast to the standard LKA design, the proposed decomposition enables the direct use of the depth-wise convolutional layer with large kernels in the attention module, without requiring any extra blocks. We demonstrate that the proposed LSKA module in VAN can achieve comparable performance with the standard LKA module and incur lower computational complexity and memory footprints. We also find that the proposed LSKA design biases the VAN more toward the shape of the object than the texture with increasing kernel size. Additionally, we benchmark the robustness of the LKA and LSKA in VAN, ViTs, and the recent ConvNeXt on the five corrupted versions of the ImageNet dataset that are largely unexplored in the previous works. Our extensive experimental results show that the proposed LSKA module in VAN provides a significant reduction in computational complexity and memory footprints with increasing kernel size while outperforming ViTs, ConvNeXt, and providing similar performance compared to the LKA module in VAN on object recognition, object detection, semantic segmentation, and robustness tests.

摘要
视觉注意网络（VAN）配置了大kernel注意模块（LKA）可以提供出色的表现，超过视transformer（ViT），在视觉任务中。然而，深度 wise convolutional层在这些LKA模块中会导致计算和存储空间呈 quadratic 增长，随着核心大小的增加。为了解决这些问题并使用极大的核心大小在VAN中的注意模块中，我们提出了一个家族Large Separable Kernel Attention（LSKA）模块。LSKA将二维核心层的深度wise convolutional层 decomposed into cascaded horizontal和vertical 1-D核心。与标准LKA设计不同，我们的分解方式可以 direct 使用深度wise convolutional层中的大核心在注意模块中，无需额外块。我们的实验结果表明，在VAN中使用我们提出的LSKA模块可以与标准LKA模块相当，同时具有较低的计算复杂度和存储空间占用。此外，我们发现LSKA设计偏向对象的形状，而不是Texture，随着核心大小的增加。此外，我们对VAN、ViTs和最近的ConvNeXt在ImageNet数据集上进行了大规模的robustness测试，发现LSKA模块在对象认知、物体检测、semantic segmentation和Robustness测试中具有优秀的表现。

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

paper_url: http://arxiv.org/abs/2309.01430
repo_url: https://github.com/leaplabthu/dat
paper_authors: Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, Gao Huang
for: 这篇论文的目的是提出一种可靠且高效的视觉处理模型，能够解决传统的视觉模型在实现全球注意力和对特定区域的适应能力之间的矛盾。
methods: 这篇论文提出了一种名为“弹性多头注意模组”的新型注意力模块，具有自动分配鉴定点的功能，以实现对应区域的适应注意。这个模组可以与传统的 dense attention 结合，以提高视觉模型的表现力。
results: 根据实验结果，这篇论文的提案DAT++ 能够在多个视觉识别任务上取得顶尖的成绩，包括 ImageNet 的准确率85.9%、COCO 的实例分割精度54.5和47.0，以及 ADE20K 的 semantics 分割精度51.5。

Abstract
Transformers have shown superior performance on various vision tasks. Their large receptive field endows Transformer models with higher representation power than their CNN counterparts. Nevertheless, simply enlarging the receptive field also raises several concerns. On the one hand, using dense attention in ViT leads to excessive memory and computational cost, and features can be influenced by irrelevant parts that are beyond the region of interests. On the other hand, the handcrafted attention adopted in PVT or Swin Transformer is data agnostic and may limit the ability to model long-range relations. To solve this dilemma, we propose a novel deformable multi-head attention module, where the positions of key and value pairs in self-attention are adaptively allocated in a data-dependent way. This flexible scheme enables the proposed deformable attention to dynamically focus on relevant regions while maintains the representation power of global attention. On this basis, we present Deformable Attention Transformer (DAT), a general vision backbone efficient and effective for visual recognition. We further build an enhanced version DAT++. Extensive experiments show that our DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.

摘要
《 transformers 在视觉任务中表现出色，其大 receive 场能够提供更高的表示力 than CNN 模型。然而，简单地扩大 receive 场也存在一些问题。一方面，使用 dense attention 在 ViT 中会导致额外的内存和计算成本增加，而且特征可能受到 beyond 的无关部分的影响。另一方面，手动设置的 attention 在 PVT 或 Swin Transformer 中可能会限制对长距离关系的模型化。为解决这个 dilemma，我们提出了一种 novel deformable multi-head attention 模块，其中 key 和 value 对在 self-attention 中的位置会在数据依存地分配。这种灵活的方案允许我们的 propose deformable attention 动态关注相关的区域，同时保持 global attention 的表示力。基于这个基础，我们提出了 Deformable Attention Transformer (DAT)，一种通用的视觉基础结构，高效精准地进行视觉识别。此外，我们还提出了 DAT++，一种进一步提高 DAT 的版本。广泛的实验表明，我们的 DAT++ 在多种视觉识别benchmark上达到了最佳结果，其中 ImageNet 准确率为 85.9%，COCO 实例分割 mAP 为 54.5 和 47.0，ADE20K semantic segmentation mIoU 为 51.5。

Adapting Segment Anything Model for Change Detection in HR Remote Sensing Images

paper_url: http://arxiv.org/abs/2309.01429
repo_url: https://github.com/ggsding/sam-cd
paper_authors: Lei Ding, Kun Zhu, Daifeng Peng, Hao Tang, Haitao Guo
for: 这个研究目的是将视觉基础模型（VFM）应用于高分辨率远程感知图像（RSIs）中的变化探测。methods: 这个研究使用了快速SAM的视觉编码器来提取RS scene中的视觉表现，并使用了一个 convolutional adaptor 来聚合任务化变化信息。此外，这个研究还引入了一个任务无关的 semantic learning branch 来模型RSIs中的semantic latent space。results: 这个研究获得了与顶尖方法相比的更高的准确性，并且展示了与半指导CD方法相似的样本效率学习能力。根据我们所知，这是首次将VFM应用于HR RSIs中的变化探测。

Abstract
Vision Foundation Models (VFMs) such as the Segment Anything Model (SAM) allow zero-shot or interactive segmentation of visual contents, thus they are quickly applied in a variety of visual scenes. However, their direct use in many Remote Sensing (RS) applications is often unsatisfactory due to the special imaging characteristics of RS images. In this work, we aim to utilize the strong visual recognition capabilities of VFMs to improve the change detection of high-resolution Remote Sensing Images (RSIs). We employ the visual encoder of FastSAM, an efficient variant of the SAM, to extract visual representations in RS scenes. To adapt FastSAM to focus on some specific ground objects in the RS scenes, we propose a convolutional adaptor to aggregate the task-oriented change information. Moreover, to utilize the semantic representations that are inherent to SAM features, we introduce a task-agnostic semantic learning branch to model the semantic latent in bi-temporal RSIs. The resulting method, SAMCD, obtains superior accuracy compared to the SOTA methods and exhibits a sample-efficient learning ability that is comparable to semi-supervised CD methods. To the best of our knowledge, this is the first work that adapts VFMs for the CD of HR RSIs.

摘要
各种视觉基础模型（VFM），如分割任何模型（SAM），可以实现零shot或交互分割视觉内容，因此它们很快地应用于多种视觉场景。然而，直接使用它们在许多远程感知（RS）应用中 often 不满足要求，因为RS图像的特殊捕捉特性。在这项工作中，我们希望利用VFM的强大视觉识别能力来改进高分辨率远程感知图像（RSIs）中的变化检测。我们使用 FastSAM 的视觉编码器来提取 RS 场景中的视觉表示。为了使 FastSAM 在RS场景中专注于某些特定的地面 объек，我们提议一种 convolutional adapter 来聚合任务关注的变化信息。此外，我们引入一种任务无关的 semantic learning branch 来模型RSIs中的semantic latent space。得到的方法，SAMCD，与state-of-the-art方法相比，显示出了更高的准确率，并且展现了与 semi-supervised CD 方法类似的样本效率学习能力。到目前为止，这是首次应用 VFM 于高分辨率 RSIs 的变化检测。

Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification

paper_url: http://arxiv.org/abs/2309.01420
repo_url: None
paper_authors: Zhiyin Shao, Xinyu Zhang, Changxing Ding, Jian Wang, Jingdong Wang
for: 本研究旨在提高文本到图像人重识别（T2I-ReID）任务的性能，因为存在两种基础问题：数据不一致和训练不一致。
methods: 我们提出了一个新的统一预训策略（UniPT），包括建立大规模的文本标注人像数据集“LUPerson-T”，并使用简单的视觉语言预训策略来对图像和文本模态的特征空间进行Alignment。
results: 我们的UniPT可以在不需要任何辅助工具的情况下达到竞争性的排名1精度（68.50%、60.09%和51.85%）在CUHK-PEDES、ICFG-PEDES和RSTPReid等三个任务上。

Abstract
The pre-training task is indispensable for the text-to-image person re-identification (T2I-ReID) task. However, there are two underlying inconsistencies between these two tasks that may impact the performance; i) Data inconsistency. A large domain gap exists between the generic images/texts used in public pre-trained models and the specific person data in the T2I-ReID task. This gap is especially severe for texts, as general textual data are usually unable to describe specific people in fine-grained detail. ii) Training inconsistency. The processes of pre-training of images and texts are independent, despite cross-modality learning being critical to T2I-ReID. To address the above issues, we present a new unified pre-training pipeline (UniPT) designed specifically for the T2I-ReID task. We first build a large-scale text-labeled person dataset "LUPerson-T", in which pseudo-textual descriptions of images are automatically generated by the CLIP paradigm using a divide-conquer-combine strategy. Benefiting from this dataset, we then utilize a simple vision-and-language pre-training framework to explicitly align the feature space of the image and text modalities during pre-training. In this way, the pre-training task and the T2I-ReID task are made consistent with each other on both data and training levels. Without the need for any bells and whistles, our UniPT achieves competitive Rank-1 accuracy of, ie, 68.50%, 60.09%, and 51.85% on CUHK-PEDES, ICFG-PEDES and RSTPReid, respectively. Both the LUPerson-T dataset and code are available at https;//github.com/ZhiyinShao-H/UniPT.

摘要
“预训作业是文本识别人重识别（T2I-ReID）任务的不可或缺的一部分。然而，有两个隐含的差异可能影响性能；一是数据不一致。 generic images/texts在公共预训模型中使用的大频谱与特定人数据在T2I-ReID任务中存在巨大的频谱差异。这种差异特别是对文本而言，通常文本数据无法细化特定人的描述。二是训练不一致。图像和文本的预训过程是独立的，尽管交叉模态学习是T2I-ReID任务中 krit。为解决以上问题，我们提出了一个新的一体化预训管线（UniPT），专门为T2I-ReID任务设计。我们首先建立了大规模的文本标注人数据集"LUPerson-T",其中图像中的文本描述使用CLIP парадигмы自动生成pseudo文本描述。利用这个数据集，我们然后使用简单的视觉语言预训框架，在预训过程中显式对图像和文本模式之间的特征空间进行对接。这样，预训任务和T2I-ReID任务在数据和训练水平上得到了一致。不需要任何附加功能，我们的UniPT实现了竞争力强的排名1准确率，即68.50%、60.09%和51.85%在CUHK-PEDES、ICFG-PEDES和RSTPReid等三个任务中。LUPerson-T数据集和代码都可以在https://github.com/ZhiyinShao-H/UniPT上获取。”

Implicit Neural Image Stitching With Enhanced and Blended Feature Reconstruction

paper_url: http://arxiv.org/abs/2309.01409
repo_url: https://github.com/minshu-kim/NIS
paper_authors: Minsu Kim, Jaewon Lee, Byeonghun Lee, Sunghoon Im, Kyong Hwan Jin
for: 提高图像拼接的质量和精度，解决传统框架中的锐利artefacts和照明、深度等级的不一致问题。
methods: 基于隐藏层 neural network 的图像拼接方法，通过估计图像的福洛coefficients来提高图像质量，并在幂值空间进行颜色匹配和重差调整，最终decode为RGB值得拼接图像。
results: 比传统框架更高效地解决低分辨率图像拼接问题，并且可以融合加速图像提高方法，实现更高质量的拼接图像。

Abstract
Existing frameworks for image stitching often provide visually reasonable stitchings. However, they suffer from blurry artifacts and disparities in illumination, depth level, etc. Although the recent learning-based stitchings relax such disparities, the required methods impose sacrifice of image qualities failing to capture high-frequency details for stitched images. To address the problem, we propose a novel approach, implicit Neural Image Stitching (NIS) that extends arbitrary-scale super-resolution. Our method estimates Fourier coefficients of images for quality-enhancing warps. Then, the suggested model blends color mismatches and misalignment in the latent space and decodes the features into RGB values of stitched images. Our experiments show that our approach achieves improvement in resolving the low-definition imaging of the previous deep image stitching with favorable accelerated image-enhancing methods. Our source code is available at https://github.com/minshu-kim/NIS.

摘要
现有的图像拼接框架通常提供可见的合理拼接结果，但它们受到锐化缺陷和照明、深度等因素的影响，导致拼接图像具有模糊效果。虽然最近的学习型拼接方法可以减轻这些缺陷，但它们需要牺牲图像质量，无法捕捉高频环境的细节。为解决这问题，我们提出了一种新的方法：隐式神经图像拼接（NIS），它扩展了自适应超分辨率。我们的方法估算图像的快推函数，然后使用建议的模型在幽DefaultsLatent空间进行混合和调整，最后 decode到RGB值来生成拼接图像。我们的实验表明，我们的方法可以提高前期深度图像拼接的低分辨率问题，并且可以利用加速的图像改进方法来加速图像进行改进。我们的源代码可以在 GitHub 上找到：https://github.com/minshu-kim/NIS。

Leveraging Self-Supervised Vision Transformers for Neural Transfer Function Design

paper_url: http://arxiv.org/abs/2309.01408
repo_url: None
paper_authors: Dominik Engel, Leon Sick, Timo Ropinski
for: 用于量 Rendering 中的结构分类和质量属性分配
methods: 使用自然语言描述的自适应 Transfer Function 定义方法
results: 减少标注量和计算时间，提高分割精度和用户体验

Abstract
In volume rendering, transfer functions are used to classify structures of interest, and to assign optical properties such as color and opacity. They are commonly defined as 1D or 2D functions that map simple features to these optical properties. As the process of designing a transfer function is typically tedious and unintuitive, several approaches have been proposed for their interactive specification. In this paper, we present a novel method to define transfer functions for volume rendering by leveraging the feature extraction capabilities of self-supervised pre-trained vision transformers. To design a transfer function, users simply select the structures of interest in a slice viewer, and our method automatically selects similar structures based on the high-level features extracted by the neural network. Contrary to previous learning-based transfer function approaches, our method does not require training of models and allows for quick inference, enabling an interactive exploration of the volume data. Our approach reduces the amount of necessary annotations by interactively informing the user about the current classification, so they can focus on annotating the structures of interest that still require annotation. In practice, this allows users to design transfer functions within seconds, instead of minutes. We compare our method to existing learning-based approaches in terms of annotation and compute time, as well as with respect to segmentation accuracy. Our accompanying video showcases the interactivity and effectiveness of our method.

摘要
在量rendering中，传输函数用于分类结构物体，并将光学性质如颜色和透明度分配给这些结构物体。传输函数通常是1D或2D函数，它们将简单特征映射到这些光学性质。由于设计传输函数的过程通常是慢搬和不直观的，因此有几种方法被提议用于其交互式规定。在本文中，我们提出了一种使用自然语言处理器来定义传输函数的新方法。用户只需在切片查看器中选择结构物体，我们的方法会自动选择与结构物体相似的结构，基于由神经网络提取的高级特征。与之前的学习基于的传输函数方法不同，我们的方法不需要训练模型，可以快速进行推理，从而允许用户在数秒钟内设计传输函数，而不是需要数分钟。此外，我们的方法可以减少必须的注释量，通过在用户操作时提供反馈，使用户能够更快地注释需要注释的结构物体。在实践中，我们的方法比既有学习基于的方法更快，更准确。我们的视频辑演示了我们的方法的互动性和效果。

Learning Residual Elastic Warps for Image Stitching under Dirichlet Boundary Condition

paper_url: http://arxiv.org/abs/2309.01406
repo_url: https://github.com/minshu-kim/REwarp
paper_authors: Minsu Kim, Yongjun Lee, Woo Kyoung Han, Kyong Hwan Jin
for: 解决深度学习图像拼接中大偏移误差所导致的缺陷，提高图像拼接的精度和效率。
methods: 使用 Dirichlet 边界条件和循环学习减少误差，预测homography和Thin-plate Spline（TPS）来实现缺陷和孔洞自适应拼接。
results: 在实验中，REwarp 表现出了优于现有方法的精度和计算效率，能够提供高质量的图像拼接 results。

Abstract
Trendy suggestions for learning-based elastic warps enable the deep image stitchings to align images exposed to large parallax errors. Despite the remarkable alignments, the methods struggle with occasional holes or discontinuity between overlapping and non-overlapping regions of a target image as the applied training strategy mostly focuses on overlap region alignment. As a result, they require additional modules such as seam finder and image inpainting for hiding discontinuity and filling holes, respectively. In this work, we suggest Recurrent Elastic Warps (REwarp) that address the problem with Dirichlet boundary condition and boost performances by residual learning for recurrent misalign correction. Specifically, REwarp predicts a homography and a Thin-plate Spline (TPS) under the boundary constraint for discontinuity and hole-free image stitching. Our experiments show the favorable aligns and the competitive computational costs of REwarp compared to the existing stitching methods. Our source code is available at https://github.com/minshu-kim/REwarp.

摘要
当前建议：学习基于弹性折叠的方法可以使深度图像融合到大量偏差错误中进行对齐。尽管这些方法能够实现出色的对齐，但它们在处理重叠和非重叠区域之间的缺陷和缺口问题上却陷入困难。因此，它们通常需要额外的模块，如缺陷找到器和图像填充，以隐藏缺陷和填充缺口。在这种情况下，我们建议使用循环弹性折叠（REwarp）方法，该方法通过 Dirichlet 边界条件和循环弹性学习来解决缺陷和缺口问题，从而实现不间断和缺陷自适应的图像融合。我们的实验表明，REwarp 方法可以具有优秀的对齐性和竞争性的计算成本。我们的源代码可以在 GitHub 上找到。

SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations

paper_url: http://arxiv.org/abs/2309.01391
repo_url: None
paper_authors: Tanvir Mahmud, Chun-Hao Liu, Burhaneddin Yaman, Diana Marculescu
for: 这篇论文是为了提出一种基于semi-supervised learning的视频对象检测方法，以解决现有视频对象检测方法所存在的一些问题，例如需要大量注释框架来实现良好的监督学习效果。
methods: 这篇论文使用了一种基于流动的策略，即使用流动的预测来选择合适的 pseudo-labels，以便在大量无注释框架上进行学习。具体来说，这篇论文引入了两种选择方法：一种是基于流动的预测和匹配的方法，另一种是基于交叉 IoU 和交叉异同度的方法。
results: 根据论文的结果，这种 semi-supervised 视频对象检测方法可以在 ImageNet-VID、Epic-KITCHENS 和 YouTube-VIS 等 datasets 上达到显著的性能改进，比如在 ImageNet-VID 上的 mAP 提高了 10.3%，在 Epic-KITCHENS 上的 mAP 提高了 13.1%，在 YouTube-VIS 上的 mAP 提高了 11.4%。

Abstract
Despite significant progress in semi-supervised learning for image object detection, several key issues are yet to be addressed for video object detection: (1) Achieving good performance for supervised video object detection greatly depends on the availability of annotated frames. (2) Despite having large inter-frame correlations in a video, collecting annotations for a large number of frames per video is expensive, time-consuming, and often redundant. (3) Existing semi-supervised techniques on static images can hardly exploit the temporal motion dynamics inherently present in videos. In this paper, we introduce SSVOD, an end-to-end semi-supervised video object detection framework that exploits motion dynamics of videos to utilize large-scale unlabeled frames with sparse annotations. To selectively assemble robust pseudo-labels across groups of frames, we introduce \textit{flow-warped predictions} from nearby frames for temporal-consistency estimation. In particular, we introduce cross-IoU and cross-divergence based selection methods over a set of estimated predictions to include robust pseudo-labels for bounding boxes and class labels, respectively. To strike a balance between confirmation bias and uncertainty noise in pseudo-labels, we propose confidence threshold based combination of hard and soft pseudo-labels. Our method achieves significant performance improvements over existing methods on ImageNet-VID, Epic-KITCHENS, and YouTube-VIS datasets. Code and pre-trained models will be released.

摘要
尽管在半监督学习方面已经取得了显著的进步，但视频对象检测中仍有一些关键问题需要解决：（1）在视频中达到良好的性能需要大量的注释帧。（2）即使在视频中存在大量的间隔帧相互关系，仍然收集注释帧的成本高、时间费时、重复的问题。（3）现有的半监督技术在静止图像上基本无法利用视频中的时间动态。本文介绍SSVOD，一种终端到终点的半监督视频对象检测框架，利用视频中的时间动态来利用大量的无注释帧。为了选择性地组合坚实的pseudo标签，我们引入了流动抗压采用邻近帧的预测值进行时间一致性估计。具体来说，我们引入了交叉IoU和交叉异常值基于选择方法，以包括坚实的pseudo标签 для bounding box和类别标签。为了保持pseudo标签中的确认偏见和不确定噪音的平衡，我们提议使用信任值阈值基于组合硬化和软化pseudo标签。我们的方法在ImageNet-VID、Epic-KITCHENS和YouTube-VIS数据集上取得了显著的性能提升。我们将代码和预训练模型发布。

Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

paper_url: http://arxiv.org/abs/2309.01380
repo_url: None
paper_authors: Soumya Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar
for: 本研究探讨了两个新引入的 dataset，NewsVideoQA 和 M4-ViteVQA，用于基于文本内容回答视频问题。
methods: 研究者使用了 BERT-QA，一种文本只模型，对这两个 dataset 进行了实验，并发现它在这两个 dataset 上 display 相似的表现。
results: 研究发现，训练在 M4-ViteVQA 上并不能 directly apply to NewsVideoQA，且对于 out-of-domain 训练，需要进行适应。

Abstract
Researchers have extensively studied the field of vision and language, discovering that both visual and textual content is crucial for understanding scenes effectively. Particularly, comprehending text in videos holds great significance, requiring both scene text understanding and temporal reasoning. This paper focuses on exploring two recently introduced datasets, NewsVideoQA and M4-ViteVQA, which aim to address video question answering based on textual content. The NewsVideoQA dataset contains question-answer pairs related to the text in news videos, while M4-ViteVQA comprises question-answer pairs from diverse categories like vlogging, traveling, and shopping. We provide an analysis of the formulation of these datasets on various levels, exploring the degree of visual understanding and multi-frame comprehension required for answering the questions. Additionally, the study includes experimentation with BERT-QA, a text-only model, which demonstrates comparable performance to the original methods on both datasets, indicating the shortcomings in the formulation of these datasets. Furthermore, we also look into the domain adaptation aspect by examining the effectiveness of training on M4-ViteVQA and evaluating on NewsVideoQA and vice-versa, thereby shedding light on the challenges and potential benefits of out-of-domain training.

摘要
Translated into Simplified Chinese:研究人员对视觉和语言领域进行了广泛的研究，发现视觉和文本内容都是理解场景的关键因素。特别是在视频中理解文本内容的重要性，需要场景文本理解和时间理解。本文关注两个最近引入的dataset，新闻视频问答集和M4-ViteVQA集，以解决基于文本内容的视频问答问题。新闻视频问答集包含新闻视频中文本内容相关的问题答案对，而M4-ViteVQA集包含多种类别的问题答案对，如博客、旅行和购物。我们对这些dataset的形式化进行了多种层次的分析，探讨响应问题需要的视觉理解和多帧理解水平。此外，我们还进行了BERT-QA模型的实验，这是一种只有文本内容的模型，它在这两个dataset上达到了相当的性能，表明这些dataset的形式化存在缺陷。此外，我们还研究了域适应问题，包括在M4-ViteVQA集上训练并在新闻视频问答集上测试的效果，以及 vice versa，从而探讨域外训练的挑战和优点。

ImmersiveNeRF: Hybrid Radiance Fields for Unbounded Immersive Light Field Reconstruction

paper_url: http://arxiv.org/abs/2309.01374
repo_url: None
paper_authors: Xiaohang Yu, Haoxiang Wang, Yuqi Han, Lei Yang, Tao Yu, Qionghai Dai
for: This paper proposes a method for unbounded immersive light field reconstruction, which supports high-quality rendering and aggressive view extrapolation.methods: The method uses a hybrid radiance field representation, with separate radiance fields for the foreground and background, and adaptive sampling and segmentation regularization to improve performance.results: The proposed method achieves strong performance for unbounded immersive light field reconstruction, and contributes a novel dataset for further research and applications in the immersive light field domain.Here’s the text in Simplified Chinese:for: 这篇论文提出了一种用于无限维度吸引光场重建的方法，支持高质量渲染和较为侵略性的视角推导。methods: 该方法使用了混合的光场场景表示，将前景和背景分别表示为两个不同的空间映射策略，并使用了适应性的采样和分割规则来提高性能。results: 提出的方法在无限维度吸引光场重建中实现了强大的表现，并为未来的研究和应用在吸引光场领域提供了一个新的数据集。

Abstract
This paper proposes a hybrid radiance field representation for unbounded immersive light field reconstruction which supports high-quality rendering and aggressive view extrapolation. The key idea is to first formally separate the foreground and the background and then adaptively balance learning of them during the training process. To fulfill this goal, we represent the foreground and background as two separate radiance fields with two different spatial mapping strategies. We further propose an adaptive sampling strategy and a segmentation regularizer for more clear segmentation and robust convergence. Finally, we contribute a novel immersive light field dataset, named THUImmersive, with the potential to achieve much larger space 6DoF immersive rendering effects compared with existing datasets, by capturing multiple neighboring viewpoints for the same scene, to stimulate the research and AR/VR applications in the immersive light field domain. Extensive experiments demonstrate the strong performance of our method for unbounded immersive light field reconstruction.

摘要
这个论文提出了一种混合辐射场表示方法，用于无限尺度吸引辐射场重建，支持高质量渲染和激进视点推演。关键思想是首先正式分离背景和前景，然后在训练过程中适应地学习它们。为达到这个目标，我们将背景和前景表示为两个不同的辐射场，使用两种不同的空间映射策略。我们还提出了一种适应 sampling 策略和一种分割规范，以实现更清晰的分割和更稳定的收敛。最后，我们发布了一个新的 immerse 辐射场数据集，名为 THUImmersive，它可以实现更大的空间 six-degree-of-freedom 吸引辐射渲染效果，比现有数据集更大，通过记录同一场景的多个邻居视点，刺激研究和 AR/VR 应用在 immerse 辐射场领域。广泛的实验表明我们的方法在无限尺度吸引辐射场重建中具有强大表现。

DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion

paper_url: http://arxiv.org/abs/2309.01372
repo_url: https://github.com/axdfhj/mdd
paper_authors: Yunhong Lou, Linchao Zhu, Yaxiong Wang, Xiaohan Wang, Yi Yang
for: 这个论文的目的是生成基于文本描述的高质量人体动作，同时保持动作多样性。
methods: 该论文使用了一种新的方法，即 Hierarchical Semantic Aggregation (HSA) 模块和 Motion Discrete Diffusion (MDD) 框架，以确保动作质量和多样性之间的平衡。
results: 该论文通过实验证明，其方法可以在 HumanML3D 和 KIT-ML 上达到 state-of-the-art 的动作质量和竞争力动作多样性。

Abstract
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions while preserving motion diversity.Despite the recent significant process in text-based human motion generation,existing methods often prioritize fitting training motions at the expense of action diversity. Consequently, striking a balance between motion quality and diversity remains an unresolved challenge. This problem is compounded by two key factors: 1) the lack of diversity in motion-caption pairs in existing benchmarks and 2) the unilateral and biased semantic understanding of the text prompt, focusing primarily on the verb component while neglecting the nuanced distinctions indicated by other words.In response to the first issue, we construct a large-scale Wild Motion-Caption dataset (WMC) to extend the restricted action boundary of existing well-annotated datasets, enabling the learning of diverse motions through a more extensive range of actions. To this end, a motion BLIP is trained upon a pretrained vision-language model, then we automatically generate diverse motion captions for the collected motion sequences. As a result, we finally build a dataset comprising 8,888 motions coupled with 141k text.To comprehensively understand the text command, we propose a Hierarchical Semantic Aggregation (HSA) module to capture the fine-grained semantics.Finally,we involve the above two designs into an effective Motion Discrete Diffusion (MDD) framework to strike a balance between motion quality and diversity. Extensive experiments on HumanML3D and KIT-ML show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity. Dataset, code, and pretrained models will be released to reproduce all of our results.

摘要
我们介绍了一种新的方法——多样化动作（DiverseMotion），可以生成高质量的人体动作，基于文本描述而conditioning，同时保持动作多样性。尽管最近有 significative progress in text-based human motion generation,但现有方法通常会偏好适应训练动作，而忽略动作多样性。这种问题被两个关键因素困扰：1）现有的动作-caption对不够多样化，2）文本提示的semantic理解偏执一面，即便只重视verb部分，而忽略其他词语的细微差别。为了解决第一个问题，我们构建了一个大规模的野动作-caption数据集（WMC），以扩展现有的动作边界，让学习动作的多样性。为此，我们首先训练了一个动作BLIP在一个预训练的视力语言模型上，然后自动生成了多样的动作caption。最终，我们建立了一个包含8,888个动作和141,000个文本的数据集。为了全面理解文本命令，我们提出了一个层次semantic汇集（HSA）模块，以捕捉细节semantic。最后，我们将上述两种设计 integrate into an effective Motion Discrete Diffusion（MDD）框架，以平衡动作质量和多样性。我们的多样化动作在HumanML3D和KIT-ML上进行了广泛的实验，并达到了状态之arte motion质量和竞争力动作多样性。数据集、代码和预训练模型将被发布，以便复制所有我们的结果。

Attention as Annotation: Generating Images and Pseudo-masks for Weakly Supervised Semantic Segmentation with Diffusion

paper_url: http://arxiv.org/abs/2309.01369
repo_url: None
paper_authors: Ryota Yoshihashi, Yuya Otsuka, Kenji Doi, Tomohiro Tanaka
for:* 这个论文的目的是提出一种没有使用真实图像和手动标注的Semantic segmentation训练方法。methods:* 该方法使用文本到图像扩散模型生成的图像，并使用图像的内部文本到图像十字关注作为监督 Pseudo-mask。results:* 实验表明，attn2mask可以在PASCAL VOC中取得良好的结果，而不需要使用真实的训练数据。* 该方法还能够扩展到更多类别的场景，如ImageNet segmentation。* 它还显示了对LoRA基于的细化调整的适应能力，可以将segmenation转移到远程领域，如Cityscapes。

Abstract
Although recent advancements in diffusion models enabled high-fidelity and diverse image generation, training of discriminative models largely depends on collections of massive real images and their manual annotation. Here, we present a training method for semantic segmentation that neither relies on real images nor manual annotation. The proposed method {\it attn2mask} utilizes images generated by a text-to-image diffusion model in combination with its internal text-to-image cross-attention as supervisory pseudo-masks. Since the text-to-image generator is trained with image-caption pairs but without pixel-wise labels, attn2mask can be regarded as a weakly supervised segmentation method overall. Experiments show that attn2mask achieves promising results in PASCAL VOC for not using real training data for segmentation at all, and it is also useful to scale up segmentation to a more-class scenario, i.e., ImageNet segmentation. It also shows adaptation ability with LoRA-based fine-tuning, which enables the transfer to a distant domain i.e., Cityscapes.

摘要
尽管最近的扩散模型可以生成高精度和多样的图像，但训练推理模型大多依赖于庞大的真实图像和手动注释。在这里，我们提出了一种不需要真实图像和手动注释的 semantic segmentation 训练方法。我们称之为“attn2mask”，它利用由文本到图像扩散模型生成的图像，并与其内部的文本到图像交叉注意力作为超级vision pseudo-mask。由于文本到图像生成器在没有像素级标注的情况下训练，可以视为一种弱supervised segmentation方法。实验表明，attn2mask 在 PASCAL VOC 上表现出色，不使用实际训练数据进行 segmentation 的情况下，并且在更多类enario中也表现出了好的扩展能力。它还表现出了 LoRA-based fine-tuning 的适应能力，可以在远程领域 i.e., Cityscapes 中进行转移。

High Frequency, High Accuracy Pointing onboard Nanosats using Neuromorphic Event Sensing and Piezoelectric Actuation

paper_url: http://arxiv.org/abs/2309.01361
repo_url: None
paper_authors: Yasir Latif, Peter Anastasiou, Yonhon Ng, Zebb Prime, Tien-Fu Lu, Matthew Tetlow, Robert Mahony, Tat-Jun Chin
for: 这个论文旨在提高小型卫星的稳定点击精度，以便为空间域意识任务（SDA）提供更高精度的点击。
methods: 该论文提出了一种新的 payload，它利用 neuromorphic event sensor 和 piezoelectric stage 实现高精度和高频率的相对位态估计和控制。
results: experiments 表明，使用该 payload 可以在 1-5 度的精度下实现稳定点击，并且可以在 50Hz 的操作频率下运行。I hope this helps! Let me know if you have any other questions.

Abstract
As satellites become smaller, the ability to maintain stable pointing decreases as external forces acting on the satellite come into play. At the same time, reaction wheels used in the attitude determination and control system (ADCS) introduce high frequency jitter which can disrupt pointing stability. For space domain awareness (SDA) tasks that track objects tens of thousands of kilometres away, the pointing accuracy offered by current nanosats, typically in the range of 10 to 100 arcseconds, is not sufficient. In this work, we develop a novel payload that utilises a neuromorphic event sensor (for high frequency and highly accurate relative attitude estimation) paired in a closed loop with a piezoelectric stage (for active attitude corrections) to provide highly stable sensor-specific pointing. Event sensors are especially suited for space applications due to their desirable characteristics of low power consumption, asynchronous operation, and high dynamic range. We use the event sensor to first estimate a reference background star field from which instantaneous relative attitude is estimated at high frequency. The piezoelectric stage works in a closed control loop with the event sensor to perform attitude corrections based on the discrepancy between the current and desired attitude. Results in a controlled setting show that we can achieve a pointing accuracy in the range of 1-5 arcseconds using our novel payload at an operating frequency of up to 50Hz using a prototype built from commercial-off-the-shelf components. Further details can be found at https://ylatif.github.io/ultrafinestabilisation

摘要
为了提高小型卫星的稳定性，我们开发了一种新的 payload，它使用神经元事件传感器（高频和高精度相对姿态估计）和一个 piezoelectric stage（用于活动姿态 corrections）。事件传感器在空间应用中特别有利，因为它们具有低功耗、异步操作和高动态范围等极佳特点。我们使用事件传感器来估计背景星场，并根据差异来使用 piezoelectric stage 进行姿态 corrections。在控制的环境中，我们可以使用我们的新型payload实现1-5弧矩度精度的指向稳定，并且可以在50Hz的运行频率下达到这个精度。更多细节可以在https://ylatif.github.io/ultrafinestabilisation 查看。

Adapting Classifiers To Changing Class Priors During Deployment

paper_url: http://arxiv.org/abs/2309.01357
repo_url: None
paper_authors: Natnael Daba, Bruce McIntosh, Abhijit Mahalanobis
for: 这篇论文是关于如何在不同的部署场景中使用通用分类器的。
methods: 论文使用了基于分类器自身输出的方法来估算类偏好。
results: 结果表明，在部署场景中 incorporating 估算的类偏好可以使分类器在运行时准确率提高。I hope that helps! Let me know if you have any other questions.

Abstract
Conventional classifiers are trained and evaluated using balanced data sets in which all classes are equally present. Classifiers are now trained on large data sets such as ImageNet, and are now able to classify hundreds (if not thousands) of different classes. On one hand, it is desirable to train such general-purpose classifier on a very large number of classes so that it performs well regardless of the settings in which it is deployed. On the other hand, it is unlikely that all classes known to the classifier will occur in every deployment scenario, or that they will occur with the same prior probability. In reality, only a relatively small subset of the known classes may be present in a particular setting or environment. For example, a classifier will encounter mostly animals if its deployed in a zoo or for monitoring wildlife, aircraft and service vehicles at an airport, or various types of automobiles and commercial vehicles if it is used for monitoring traffic. Furthermore, the exact class priors are generally unknown and can vary over time. In this paper, we explore different methods for estimating the class priors based on the output of the classifier itself. We then show that incorporating the estimated class priors in the overall decision scheme enables the classifier to increase its run-time accuracy in the context of its deployment scenario.

摘要
传统的分类器通常在具有平衡数据集的情况下训练和评估。现在，分类器被训练在大量数据集如ImageNet上，能够分类百计以上不同的类别。一方面，悉心地训练这种通用分类器，以便它在不同的部署enario中都能表现出色。另一方面，实际情况下，分类器可能只会遇到部分已知的类别，而且这些类别的发生概率可能不同，甚至随着时间的推移而变化。例如，如果把分类器部署在动物园或野生动物监测中，它就会遇到大量的动物类别。 similarly， if it is used for monitoring traffic, it will encounter mostly automobiles and commercial vehicles. 为了解决这个问题，我们在这篇论文中研究了不同的方法来估算类别概率，基于分类器的输出。然后，我们表明，在部署scenario中，通过包含估算后的类别概率在总决策方案中，可以使分类器在运行时准确性提高。

Real-time pedestrian recognition on low computational resources

paper_url: http://arxiv.org/abs/2309.01353
repo_url: None
paper_authors: Guifan Weng
for: 这篇文章的目的是实现实时行人识别在小型移动设备上，以提高安全性和自动驾驶等应用的效能。
methods: 这篇文章使用了三种方法来实现实时行人识别，包括提高了本地二进制特征和 AdaBoost 分类器、优化了几何特征和支持向量机制、以及实现了快速梯度下降神经网络。
results: 这篇文章的结果显示了三种方法可以在小型物理设备上实现实时行人识别，并且获得了高于95%的准确率和高于5 fps的速度。这些方法可以轻松地应用到小型移动设备上，并且具有高相容性和通用性。

Abstract
Pedestrian recognition has successfully been applied to security, autonomous cars, Aerial photographs. For most applications, pedestrian recognition on small mobile devices is important. However, the limitations of the computing hardware make this a challenging task. In this work, we investigate real-time pedestrian recognition on small physical-size computers with low computational resources for faster speed. This paper presents three methods that work on the small physical size CPUs system. First, we improved the Local Binary Pattern (LBP) features and Adaboost classifier. Second, we optimized the Histogram of Oriented Gradients (HOG) and Support Vector Machine. Third, We implemented fast Convolutional Neural Networks (CNNs). The results demonstrate that the three methods achieved real-time pedestrian recognition at an accuracy of more than 95% and a speed of more than 5 fps on a small physical size computational platform with a 1.8 GHz Intel i5 CPU. Our methods can be easily applied to small mobile devices with high compatibility and generality.

摘要
人体识别已成功应用于安全、自动驾驶、航空图像等领域。大多数应用中，人体识别在小型移动设备上是非常重要。然而，计算硬件的限制使得这成为一项挑战。在这项工作中，我们调查了小型物理尺寸计算机上的实时人体识别方法。本文提出了三种方法，它们在小型物理尺寸计算机系统上实现了实时人体识别，并且具有高准确率和快速速度。首先，我们改进了本地二进制特征（LBP）和权重融合分类器。其次，我们优化了梯度图 histogram 和支持向量机。最后，我们实现了快速的卷积神经网络（CNNs）。结果表明，三种方法在一个小型物理尺寸计算平台上实现了实时人体识别，准确率高于 95%，速度高于 5 fps。我们的方法可以轻松应用于小型移动设备，具有高兼容性和通用性。

Adv3D: Generating 3D Adversarial Examples in Driving Scenarios with NeRF

paper_url: http://arxiv.org/abs/2309.01351
repo_url: None
paper_authors: Leheng Li, Qing Lian, Ying-Cong Chen
for: 这个研究旨在测试深度神经网络（DNNs）对于恶作剧示例的敏感性，并且针对基于DNN的自动驾驶架构（i.e., 3D物体探测）。
methods: 这个研究使用了模型恶作剧示例为Neural Radiance Fields（NeRFs），并且提出了primitive-aware sampling和semantic-guided regularization以生成可能的恶作剧示例。
results: 实验结果显示，训练了恶作剧NeRF可以对不同的 pose、scene 和3D探测器进行大规模的性能降低。此外，这个研究也提出了一种防御方法，即通过数据增强训练来防止这些攻击。

Abstract
Deep neural networks (DNNs) have been proven extremely susceptible to adversarial examples, which raises special safety-critical concerns for DNN-based autonomous driving stacks (i.e., 3D object detection). Although there are extensive works on image-level attacks, most are restricted to 2D pixel spaces, and such attacks are not always physically realistic in our 3D world. Here we present Adv3D, the first exploration of modeling adversarial examples as Neural Radiance Fields (NeRFs). Advances in NeRF provide photorealistic appearances and 3D accurate generation, yielding a more realistic and realizable adversarial example. We train our adversarial NeRF by minimizing the surrounding objects' confidence predicted by 3D detectors on the training set. Then we evaluate Adv3D on the unseen validation set and show that it can cause a large performance reduction when rendering NeRF in any sampled pose. To generate physically realizable adversarial examples, we propose primitive-aware sampling and semantic-guided regularization that enable 3D patch attacks with camouflage adversarial texture. Experimental results demonstrate that the trained adversarial NeRF generalizes well to different poses, scenes, and 3D detectors. Finally, we provide a defense method to our attacks that involves adversarial training through data augmentation. Project page: https://len-li.github.io/adv3d-web

摘要
深度神经网络（DNN）已经被证明非常易受到敌意示例的影响，这引发了特别的安全关注，特别是在基于DNN的自动驾驶堆栈中（即3D对象检测）。虽然有大量的图像级别攻击工作，但大多数都是在2D像素空间中进行，这些攻击并不总是物理上真实的在我们的3D世界中。在这里，我们提出了模型敌意示例为神经辐射场（NeRF）的首次探索。NeRF的进步提供了真实的外观和准确的3D生成，从而导致更真实和可能的敌意示例。我们在训练敌意NeRF时，将培育周围对象的信任预测值作为3D检测器的训练集中的一部分。然后，我们在无法见验证集上评估Adv3D，并证明它可以在任意抽象 pose 中进行3D patch攻击。为生成物理可能的敌意示例，我们提出了基于元素的sampling和semantic-guided regularization，允许3D质量攻击。实验结果表明，我们的训练敌意NeRF可以在不同的 pose、场景和3D检测器上进行广泛的应用。最后，我们提出了防御方法，通过数据增强来进行对敌意示例的防御。项目页面：https://len-li.github.io/adv3d-web

Adaptive Parametric Prototype Learning for Cross-Domain Few-Shot Classification

paper_url: http://arxiv.org/abs/2309.01342
repo_url: None
paper_authors: Marzi Heidari, Abdullah Alchihabi, Qing En, Yuhong Guo
for: 这篇论文是为了解决跨领域少数检索分类问题。
methods: 本文提出了一种名为 Adaptive Parametric Prototype Learning（APPL）的新方法，它是基于元学习惯例的。不同于现有的标本性几少方法，我们在支持集合上学习分类标本，并将标本获得到几少检索集合中的条件强制整理。
results: 我们在多个跨领域少数检索资料集上实验了这种方法，结果显示APPL在跨领域少数检索分类中表现更好，比较多数现有的方法。

Abstract
Cross-domain few-shot classification induces a much more challenging problem than its in-domain counterpart due to the existence of domain shifts between the training and test tasks. In this paper, we develop a novel Adaptive Parametric Prototype Learning (APPL) method under the meta-learning convention for cross-domain few-shot classification. Different from existing prototypical few-shot methods that use the averages of support instances to calculate the class prototypes, we propose to learn class prototypes from the concatenated features of the support set in a parametric fashion and meta-learn the model by enforcing prototype-based regularization on the query set. In addition, we fine-tune the model in the target domain in a transductive manner using a weighted-moving-average self-training approach on the query instances. We conduct experiments on multiple cross-domain few-shot benchmark datasets. The empirical results demonstrate that APPL yields superior performance than many state-of-the-art cross-domain few-shot learning methods.

摘要
跨领域少量分类问题比其内领域对应的问题更加挑战性，这是因为训练和测试任务之间存在领域偏移。在这篇论文中，我们开发了一种名为 Adaptive Parametric Prototype Learning（APPL）的新方法，该方法基于元学习准则进行跨领域少量分类。与现有的概率性几何方法不同，我们在支持集合的 concatenated 特征上学习类prototype，并在元学习中加入prototype-based regularization。此外，我们在目标领域中进行了权重移动平均自适应更新方法，以便在查询集中进行微调。我们在多个跨领域少量分类 benchmark 数据集上进行了实验，结果表明，APPL 的性能较多现状顶尖跨领域少量分类方法优秀。

MDSC: Towards Evaluating the Style Consistency Between Music and

paper_url: http://arxiv.org/abs/2309.01340
repo_url: https://github.com/zixiangzhou916/mdsc
paper_authors: Zixiang Zhou, Baoyuan Wang
for: 评估音乐与舞蹈风格的匹配度（assessing the matching degree between music and dance styles）
methods: 使用音乐编码器和动作编码器进行匹配和对齐（using music and motion encoders for matching and alignment）
results: 提出了一种新的评估metric——音乐动作风格一致度（MDSC），并通过用户研究发现其可以准确评估音乐与动作风格之间的匹配度（accurately assessing the matching degree between music and dance styles）。Here is the summary in English for reference:
for: Assessing the matching degree between music and dance styles
methods: Using music and motion encoders for matching and alignment
results: Proposed a new evaluation metric called Music-Dance Style Consistency (MDSC) and validated its effectiveness through user studies, demonstrating its ability to accurately assess the matching degree between music and dance styles.

Abstract
We propose MDSC(Music-Dance-Style Consistency), the first evaluation metric which assesses to what degree the dance moves and music match. Existing metrics can only evaluate the fidelity and diversity of motion and the degree of rhythmic matching between music and motion. MDSC measures how stylistically correlated the generated dance motion sequences and the conditioning music sequences are. We found that directly measuring the embedding distance between motion and music is not an optimal solution. We instead tackle this through modelling it as a clustering problem. Specifically, 1) we pre-train a music encoder and a motion encoder, then 2) we learn to map and align the motion and music embedding in joint space by jointly minimizing the intra-cluster distance and maximizing the inter-cluster distance, and 3) for evaluation purpose, we encode the dance moves into embedding and measure the intra-cluster and inter-cluster distances, as well as the ratio between them. We evaluate our metric on the results of several music-conditioned motion generation methods, combined with user study, we found that our proposed metric is a robust evaluation metric in measuring the music-dance style correlation. The code is available at: https://github.com/zixiangzhou916/MDSC.

摘要
我们提出了MDSC（音乐舞蹈风格一致性），第一个评估预测metric，评估音乐和舞蹈动作之间的匹配程度。现有的metric只能评估动作和音乐的精度和多样性，以及音乐和动作的节奏匹配程度。而MDSC则测量了生成的舞蹈动作序列和条件音乐序列之间的风格相关性。我们发现直接测量动作和音乐的嵌入距离并不是最佳解决方案。我们相反地通过模elling它为一个聚类问题来解决。具体来说，我们先预训一个音乐编码器和一个动作编码器，然后学习将动作和音乐嵌入在共同空间中进行对应。最后，我们用计算内层距离和外层距离，以及内层和外层之间的比率来评估。我们在一些音乐条件动作生成方法的结果上进行了评估，并结合用户研究发现，我们的提出的metric是一种可靠的评估metric，可以量化音乐舞蹈风格相关性。代码可以在以下链接中找到：https://github.com/zixiangzhou916/MDSC。

Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization

paper_url: http://arxiv.org/abs/2309.01331
repo_url: None
paper_authors: Yiwen Cao, Yukun Su, Wenjun Wang, Yanxia Liu, Qingyao Wu
for: 本研究旨在解决weakly supervised object localization中的partial activation问题，即使用只有图像水平级别的指导，学习检测器能够准确地本地化对象。
methods: 本研究使用Vision Transformer（Transformer）来解决partial activation问题，并通过自动注意力机制获取长距离特征依赖关系。此外，还提出了一种本地匹配策略，通过对局部图像进行洗混，保证全局一致性。
results: 经验结果表明，我们的方法可以在CUB-200-2011和ILSVRC datasets上达到新的状态级表现，与之前的方法相比具有显著的超越。

Abstract
Weakly supervised object localization (WSOL) strives to learn to localize objects with only image-level supervision. Due to the local receptive fields generated by convolution operations, previous CNN-based methods suffer from partial activation issues, concentrating on the object's discriminative part instead of the entire entity scope. Benefiting from the capability of the self-attention mechanism to acquire long-range feature dependencies, Vision Transformer has been recently applied to alleviate the local activation drawbacks. However, since the transformer lacks the inductive localization bias that are inherent in CNNs, it may cause a divergent activation problem resulting in an uncertain distinction between foreground and background. In this work, we proposed a novel Semantic-Constraint Matching Network (SCMN) via a transformer to converge on the divergent activation. Specifically, we first propose a local patch shuffle strategy to construct the image pairs, disrupting local patches while guaranteeing global consistency. The paired images that contain the common object in spatial are then fed into the Siamese network encoder. We further design a semantic-constraint matching module, which aims to mine the co-object part by matching the coarse class activation maps (CAMs) extracted from the pair images, thus implicitly guiding and calibrating the transformer network to alleviate the divergent activation. Extensive experimental results conducted on two challenging benchmarks, including CUB-200-2011 and ILSVRC datasets show that our method can achieve the new state-of-the-art performance and outperform the previous method by a large margin.

摘要
我们提出了一种新的半监督物体地址 localization（WSOL）方法，它的目的是通过只有图像级别的指导来学习地址物体。由于图像 convolution 操作生成的局部感知范围，前一代 CNN 基于方法容易出现部分活跃问题，即将对象的特征部分进行激活而不是整个实体范围。在应用自注意机制可以获得长范围特征依赖关系的情况下，我们使用了 Vision Transformer 来缓解本地活动问题。然而，由于 transformer 缺乏 CNN 中带有适应性的 inductive 地址偏好，可能会导致不确定的背景和前景分割。为了解决这个问题，我们提出了一种新的 Semantic-Constraint Matching Network（SCMN），通过 transformer 来实现对分歧的激活的控制。具体来说，我们首先提出了一种本地小块洗版策略，用于构建图像对。这种策略可以在保证全局一致性的情况下，对图像进行局部洗版。然后，我们将这些包含共同物体的图像对feed到 Siamese 网络Encoder 中。我们还设计了一种 semantic-constraint 匹配模块，该模块的目的是通过匹配 CAMs 提取自对图像对中的共同部分，从而隐式地引导和调整 transformer 网络，以缓解分歧的激活。我们在 CUB-200-2011 和 ILSVRC 数据集上进行了广泛的实验，结果表明，我们的方法可以达到新的状态码性能，并将之前的方法超越。

SKoPe3D: A Synthetic Dataset for Vehicle Keypoint Perception in 3D from Traffic Monitoring Cameras

paper_url: http://arxiv.org/abs/2309.01324
repo_url: None
paper_authors: Himanshu Pahadia, Duo Lu, Bharatesh Chakravarthi, Yezhou Yang
for:SKoPe3D is a synthetic vehicle keypoint dataset generated using the CARLA simulator, aiming to address the challenges of vehicle keypoint detection in vision-based vehicle monitoring for ITS.methods:The dataset includes generated images with bounding boxes, tracking IDs, and 33 keypoints for each vehicle, spanning over 25k images across 28 scenes with over 150k vehicle instances and 4.9 million keypoints.results:The dataset has the potential to enable advancements in vehicle keypoint detection for ITS, as demonstrated by training a keypoint R-CNN model on the dataset and conducting a thorough evaluation. The dataset’s applicability and the potential for knowledge transfer between synthetic and real-world data are highlighted.

Abstract
Intelligent transportation systems (ITS) have revolutionized modern road infrastructure, providing essential functionalities such as traffic monitoring, road safety assessment, congestion reduction, and law enforcement. Effective vehicle detection and accurate vehicle pose estimation are crucial for ITS, particularly using monocular cameras installed on the road infrastructure. One fundamental challenge in vision-based vehicle monitoring is keypoint detection, which involves identifying and localizing specific points on vehicles (such as headlights, wheels, taillights, etc.). However, this task is complicated by vehicle model and shape variations, occlusion, weather, and lighting conditions. Furthermore, existing traffic perception datasets for keypoint detection predominantly focus on frontal views from ego vehicle-mounted sensors, limiting their usability in traffic monitoring. To address these issues, we propose SKoPe3D, a unique synthetic vehicle keypoint dataset generated using the CARLA simulator from a roadside perspective. This comprehensive dataset includes generated images with bounding boxes, tracking IDs, and 33 keypoints for each vehicle. Spanning over 25k images across 28 scenes, SKoPe3D contains over 150k vehicle instances and 4.9 million keypoints. To demonstrate its utility, we trained a keypoint R-CNN model on our dataset as a baseline and conducted a thorough evaluation. Our experiments highlight the dataset's applicability and the potential for knowledge transfer between synthetic and real-world data. By leveraging the SKoPe3D dataset, researchers and practitioners can overcome the limitations of existing datasets, enabling advancements in vehicle keypoint detection for ITS.

摘要
现代交通基础设施中的智能交通系统（ITS）已经革命化了现代交通基础设施，提供了重要的功能，如交通监测、路安全评估、减压和法律执法。在视觉基础上，精准的车辆检测和车辆位置估计是ITS的关键，特别是使用路边安装的单目camera。车辆特征和形态变化、遮挡、天气和照明条件会增加车辆检测的复杂度。此外，现有的交通感知数据集主要集中在前视角，限制了它们的应用范围。为解决这些问题，我们提出了SKoPe3D数据集，这是一个基于CARLA simulate器生成的路边视角的唯一的车辆关键点数据集。这个全面的数据集包括生成的图像、 bounding box、跟踪ID和33个关键点，涵盖了25000多张图像、28个场景，共计150000辆车辆和4900万个关键点。为证明其可用性，我们在我们的数据集上训练了一个关键点R-CNN模型，并进行了系统性的评估。我们的实验表明，SKoPe3D数据集可以应用于车辆关键点检测，并且可以在实际数据上传递知识。通过利用SKoPe3D数据集，研究人员和实践者可以超越现有数据集的限制，促进车辆关键点检测的进步，以推动ITS的发展。

FAU-Net: An Attention U-Net Extension with Feature Pyramid Attention for Prostate Cancer Segmentation

paper_url: http://arxiv.org/abs/2309.01322
repo_url: None
paper_authors: Pablo Cesar Quihui-Rubio, Daniel Flores-Araiza, Miguel Gonzalez-Mendoza, Christian Mata, Gilberto Ochoa-Ruiz
for: 这篇论文是为了提出一种基于深度学习的抑制肾脏区域分 segmentation 方法，以提高肾脏癌检测和诊断的工作流程。
methods: 该方法使用 U-Net 网络结合 additive 和 feature pyramid attention 模块，以提高分 segmentation 精度。
results: 比较 seven 种不同的 U-Net 架构，提出的方法在测试集中 achieved mean DSC 84.15% 和 IoU 76.9%，与大多数研究模型相比，只有 R2U-Net 和 attention R2U-Net 架构超越。

Abstract
This contribution presents a deep learning method for the segmentation of prostate zones in MRI images based on U-Net using additive and feature pyramid attention modules, which can improve the workflow of prostate cancer detection and diagnosis. The proposed model is compared to seven different U-Net-based architectures. The automatic segmentation performance of each model of the central zone (CZ), peripheral zone (PZ), transition zone (TZ) and Tumor were evaluated using Dice Score (DSC), and the Intersection over Union (IoU) metrics. The proposed alternative achieved a mean DSC of 84.15% and IoU of 76.9% in the test set, outperforming most of the studied models in this work except from R2U-Net and attention R2U-Net architectures.

摘要
Translation notes:* "prostate zones" is translated as "陌生区域" (pinyin: zhèng xìng qū yù)* "MRI images" is translated as "MRI图像" (pinyin: MRI tú xiàng)* "U-Net" is translated as "U-Net" (pinyin: Yù nét)* "additive and feature pyramid attention modules" is translated as "加法和特征层 pyramid 注意模块" (pinyin: jiā fàng yǔ tiě xìng piào qián yǐng module)* "can improve the workflow of prostate cancer detection and diagnosis" is translated as "可以改善陌生癌病检测和诊断的工作流程" (pinyin: kě yǐ gǎi shàn zhèng xìng ài yì jīng yì gòng zuò liú xíng)* "the proposed model" is translated as "提议的模型" (pinyin: tím yì de mó del)* "seven different U-Net-based architectures" is translated as "七种不同的 U-Net 基于架构" (pinyin: qī zhǒng bù dìng de U-Net bìng yù jià gòng)* "automatic segmentation performance" is translated as "自动分割性能" (pinyin: zì dìan fēn xiǎn yè nuò)* "Dice Score (DSC)" is translated as "Dice Score (DSC)" (pinyin: Dice Score (DSC))* "Intersection over Union (IoU)" is translated as "交叠率 (IoU)" (pinyin: jiāo fù rátio (IoU))* "test set" is translated as "测试集" (pinyin: cè shì jí)* "outperforming most of the studied models" is translated as "超越大多数研究的模型" (pinyin: chāo yú dà duō shù yán jí de mó del)* "except for R2U-Net and attention R2U-Net architectures" is translated as "除了 R2U-Net 和注意 R2U-Net 架构" (pinyin: chú le R2U-Net hé zhù yì R2U-Net jià gòng)

An FPGA smart camera implementation of segmentation models for drone wildfire imagery

paper_url: http://arxiv.org/abs/2309.01318
repo_url: None
paper_authors: Eduardo Guarduño-Martinez, Jorge Ciprian-Sanchez, Gerardo Valente, Vazquez-Garcia, Gerardo Rodriguez-Hernandez, Adriana Palacios-Rosas, Lucile Rossi-Tisson, Gilberto Ochoa-Ruiz
for: 这个研究旨在开发一个可行的、低功耗的计算机构架，以实现在无人机上进行火灾探测和评估。
methods: 这个研究使用了智能相机，基于低功耗的可程式逻辑阵列（FPGAs），并与二进制神经网络（BNNs）结合，以实现在边缘计算上的高效执行。
results: 研究人员透过优化和对缩减原始模型的实现，实现了从8帧每秒（FPS）提高至33.63 FPS的速度提升，而且无损于标注性能：模型在沃尔夫-科赫茨曼统计指标（MCC）、F1分数和哈菲安质量指标（HAF）中获得0.912、0.915和0.870的分数，并且与原始全精度模型的标注结果相似。

Abstract
Wildfires represent one of the most relevant natural disasters worldwide, due to their impact on various societal and environmental levels. Thus, a significant amount of research has been carried out to investigate and apply computer vision techniques to address this problem. One of the most promising approaches for wildfire fighting is the use of drones equipped with visible and infrared cameras for the detection, monitoring, and fire spread assessment in a remote manner but in close proximity to the affected areas. However, implementing effective computer vision algorithms on board is often prohibitive since deploying full-precision deep learning models running on GPU is not a viable option, due to their high power consumption and the limited payload a drone can handle. Thus, in this work, we posit that smart cameras, based on low-power consumption field-programmable gate arrays (FPGAs), in tandem with binarized neural networks (BNNs), represent a cost-effective alternative for implementing onboard computing on the edge. Herein we present the implementation of a segmentation model applied to the Corsican Fire Database. We optimized an existing U-Net model for such a task and ported the model to an edge device (a Xilinx Ultra96-v2 FPGA). By pruning and quantizing the original model, we reduce the number of parameters by 90%. Furthermore, additional optimizations enabled us to increase the throughput of the original model from 8 frames per second (FPS) to 33.63 FPS without loss in the segmentation performance: our model obtained 0.912 in Matthews correlation coefficient (MCC),0.915 in F1 score and 0.870 in Hafiane quality index (HAF), and comparable qualitative segmentation results when contrasted to the original full-precision model. The final model was integrated into a low-cost FPGA, which was used to implement a neural network accelerator.

摘要
野火是全球最重要的自然灾害之一，它对社会和环境层次产生了深远的影响。因此，许多研究已经进行了，以应用计算机见识技术来解决这个问题。一种最具吸引力的方法是使用具有可见光和红外线摄像头的无人机，以无人机遥测、监控和评估野火传播的方式进行远程监控，但是在邻近灾区进行这些操作。然而，实现有效的计算机见识算法在无人机上是经常不可能的，因为将全精度深度学习模型在GPU上运行是不可避免的，因为它们的电力消耗量太高，无人机的载重量也是有限的。因此，在这个工作中，我们认为智能相机，基于低功耗的场程可程式阵列（FPGAs），与二进制神经网络（BNNs）共同构成了一个可行的选择。我们在这里显示了对 corsica 火灾数据库进行分类模型的实现。我们修改了现有的 U-Net 模型，并将模型转移到边缘设备（Xilinx Ultra96-v2 FPGA）上。通过剪裁和数值化原始模型，我们缩减了模型的参数数量，从8帧/秒降至33.63帧/秒，并维持了分类性能的稳定。我们的模型在 Matthews 相互关联系数（MCC）、F1 分数（F1）和 Hafiane 质量指数（HAF）中获得了0.912、0.915和0.870的数据，并且在与原始全精度模型进行比较时，获得了相似的分类结果。最终模型被集成到低成本 FPGA 上，实现了一个神经网络加速器。

Enhancing Automated and Early Detection of Alzheimer’s Disease Using Out-Of-Distribution Detection

paper_url: http://arxiv.org/abs/2309.01312
repo_url: None
paper_authors: Audrey Paleczny, Shubham Parab, Maxwell Zhang
for: 预测老年人群中的阿尔ц海默病患者（age 65 and older），以便提供早期诊断和治疗。
methods: 使用深度学习模型（Convolutional Neural Network，CNN）和磁共振成像（Magnetic Resonance Imaging，MRI）进行诊断。
results: 使用OOD检测技术可以减少假阳性诊断，提高诊断的可靠性。模型基于CNN结果的检测精度为98%，分类精度为95%，超过了基于分割体积模型的检测和分类精度（93%和87%）。

Abstract
More than 10.7% of people aged 65 and older are affected by Alzheimer's disease. Early diagnosis and treatment are crucial as most Alzheimer's patients are unaware of having it until the effects become detrimental. AI has been known to use magnetic resonance imaging (MRI) to diagnose Alzheimer's. However, models which produce low rates of false diagnoses are critical to prevent unnecessary treatments. Thus, we trained supervised Random Forest models with segmented brain volumes and Convolutional Neural Network (CNN) outputs to classify different Alzheimer's stages. We then applied out-of-distribution (OOD) detection to the CNN model, enabling it to report OOD if misclassification is likely, thereby reducing false diagnoses. With an accuracy of 98% for detection and 95% for classification, our model based on CNN results outperformed our segmented volume model, which had detection and classification accuracies of 93% and 87%, respectively. Applying OOD detection to the CNN model enabled it to flag brain tumor images as OOD with 96% accuracy and minimal overall accuracy reduction. By using OOD detection to enhance the reliability of MRI classification using CNNs, we lowered the rate of false positives and eliminated a significant disadvantage of using Machine Learning models for healthcare tasks. Source code available upon request.

摘要
更多于10.7%的人年龄在65岁及以上有患阿尔茨海默病。早期诊断和治疗是非常重要，因为大多数阿尔茨海默病患者并不知道自己患病 until the effects become detrimental。人工智能可以使用磁共振成像（MRI）进行诊断。然而，模型生成低False Positive率是非常重要，以避免不必要的治疗。因此，我们使用了监督式Random Forest模型，并将Convolutional Neural Network（CNN）输出与分割的脑部volume进行类别。我们然后将CNN模型应用到OOD检测，以便如果误分类可能，则报告OOD，从而降低了False Positive率。使用OOD检测可以提高MRI类别的可靠性，并且使用CNN模型进行健康任务的应用中，消除了一个重要的缺点。代码可以在请求时提供。

EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity

paper_url: http://arxiv.org/abs/2309.01296
repo_url: None
paper_authors: Zijie Jiang, Masatoshi Okutomi
for: 本研究旨在提高现有自监学习Scene Flow estimation方法的精度，通过借鉴超vised学习方法的优点，并在减少动态区域的影响下提高固有的姿态稳定性。
methods: 我们提出了一种名为EMR-MSF的改进模型，其中包括在构建ego-motion汇集模块时引入explicit和稳定的几何约束，以及在满足固有的姿态稳定性下使用mask正则化损失。此外，我们还提出了一种运动一致损失和mask正则化损失，以全面发挥静态区域的作用。
results: 我们的提posed方法在KITTIScene Flow benchmark上表现出色，与state-of-the-art自监学习monocular方法的SF-all指标相比，提高44%。此外，我们的方法在深度和视觉征迹等子任务中也达到了supervised方法的水平。

Abstract
Self-supervised monocular scene flow estimation, aiming to understand both 3D structures and 3D motions from two temporally consecutive monocular images, has received increasing attention for its simple and economical sensor setup. However, the accuracy of current methods suffers from the bottleneck of less-efficient network architecture and lack of motion rigidity for regularization. In this paper, we propose a superior model named EMR-MSF by borrowing the advantages of network architecture design under the scope of supervised learning. We further impose explicit and robust geometric constraints with an elaborately constructed ego-motion aggregation module where a rigidity soft mask is proposed to filter out dynamic regions for stable ego-motion estimation using static regions. Moreover, we propose a motion consistency loss along with a mask regularization loss to fully exploit static regions. Several efficient training strategies are integrated including a gradient detachment technique and an enhanced view synthesis process for better performance. Our proposed method outperforms the previous self-supervised works by a large margin and catches up to the performance of supervised methods. On the KITTI scene flow benchmark, our approach improves the SF-all metric of the state-of-the-art self-supervised monocular method by 44% and demonstrates superior performance across sub-tasks including depth and visual odometry, amongst other self-supervised single-task or multi-task methods.

摘要
自我监督单目场景流估算，寻求从两个 consecutively temporally 单目图像中理解三维结构和三维运动。由于现有方法的精度受到网络架构的瓶颈和运动稳定性的限制，因此在这篇论文中，我们提出了一种高效的模型 named EMR-MSF。我们采用了指导了supervised学习中网络架构的优点，并在 elaborate 构建了自身运动汇集模块，在这里我们提出了一种坚定性软面罩来过滤动态区域，以确保稳定的自身运动估算。此外，我们还提出了运动一致损失和面罩规范损失，以便充分利用静止区域。我们还 интегрирова了一些高效的训练策略，包括梯度分离技术和改进的视图合成过程，以提高表现。根据 KITTI 场景流标准测试集，我们的方法在自我监督单目方法中提高了 SF-all 指标44%，并在深度和视觉速度等子任务中表现出色，在其他自我监督单任务或多任务方法中也达到了类似的水平。

2023-09-04

cs.AI

cs.AI - 2023-09-04

Efficient Defense Against Model Stealing Attacks on Convolutional Neural Networks

paper_url: http://arxiv.org/abs/2309.01838
repo_url: https://github.com/kacemkhaled/defending-extraction
paper_authors: Kacem Khaled, Mouna Dhaouadi, Felipe Gohring de Magalhães, Gabriela Nicolescu
For: The paper is written to propose a simple yet effective and efficient defense against model stealing attacks for deep learning models.* Methods: The paper introduces a heuristic approach to perturb the output probabilities of the model to defend against stealing attacks, which can be easily integrated into models without additional training.* Results: The proposed defense is effective in defending against three state-of-the-art stealing attacks, and outperforms the state-of-the-art defenses with a $\times37$ faster inference latency without requiring any additional model and with a low impact on the model’s performance. The defense is also effective for quantized CNNs targeting edge devices.Here’s the same information in Simplified Chinese text:* For: 这篇论文是为了提出一种简单又有效的模型盗用攻击防御方案。* Methods: 论文提出一种基于归类抽象的方法，通过对模型输出概率进行扰动来防御盗用攻击。这种方法可以轻松地与现有模型集成，无需进行额外训练。* Results: 提出的防御方法有效地防止了三种state-of-the-art的盗用攻击，并且比现有的防御方法快速37倍，不需要额外的模型和占用较低的模型性能。此外，这种防御方法也适用于采用量化（即压缩）的卷积神经网络（CNN）和边缘设备。

Abstract
Model stealing attacks have become a serious concern for deep learning models, where an attacker can steal a trained model by querying its black-box API. This can lead to intellectual property theft and other security and privacy risks. The current state-of-the-art defenses against model stealing attacks suggest adding perturbations to the prediction probabilities. However, they suffer from heavy computations and make impracticable assumptions about the adversary. They often require the training of auxiliary models. This can be time-consuming and resource-intensive which hinders the deployment of these defenses in real-world applications. In this paper, we propose a simple yet effective and efficient defense alternative. We introduce a heuristic approach to perturb the output probabilities. The proposed defense can be easily integrated into models without additional training. We show that our defense is effective in defending against three state-of-the-art stealing attacks. We evaluate our approach on large and quantized (i.e., compressed) Convolutional Neural Networks (CNNs) trained on several vision datasets. Our technique outperforms the state-of-the-art defenses with a $\times37$ faster inference latency without requiring any additional model and with a low impact on the model's performance. We validate that our defense is also effective for quantized CNNs targeting edge devices.

摘要
In this paper, we propose a simple yet effective and efficient defense alternative. We introduce a heuristic approach to perturb the output probabilities. The proposed defense can be easily integrated into models without additional training. We show that our defense is effective in defending against three state-of-the-art stealing attacks.We evaluate our approach on large and quantized (i.e., compressed) Convolutional Neural Networks (CNNs) trained on several vision datasets. Our technique outperforms the state-of-the-art defenses with a $\times37$ faster inference latency without requiring any additional model and with a low impact on the model's performance. We validate that our defense is also effective for quantized CNNs targeting edge devices.

Smoothing ADMM for Sparse-Penalized Quantile Regression with Non-Convex Penalties

paper_url: http://arxiv.org/abs/2309.03094
repo_url: None
paper_authors: Reza Mirzaeifard, Naveen K. D. Venkategowda, Vinay Chakravarthi Gogineni, Stefan Werner
for: 这paper investigates quantile regression in the presence of non-convex and non-smooth sparse penalties, and proposes a novel single-loop smoothing ADMM algorithm named SIAD to accelerate the convergence speed.
methods: 该paper使用了iterative techniques like coordinate descent and local linear approximation, as well as the alternating direction method of multipliers (ADMM) to facilitate convergence.
results: 数据表示SIAD方法比现有方法更快和稳定，提供了更好的解决方案 для sparse-penalized quantile regression。

Abstract
This paper investigates quantile regression in the presence of non-convex and non-smooth sparse penalties, such as the minimax concave penalty (MCP) and smoothly clipped absolute deviation (SCAD). The non-smooth and non-convex nature of these problems often leads to convergence difficulties for many algorithms. While iterative techniques like coordinate descent and local linear approximation can facilitate convergence, the process is often slow. This sluggish pace is primarily due to the need to run these approximation techniques until full convergence at each step, a requirement we term as a \emph{secondary convergence iteration}. To accelerate the convergence speed, we employ the alternating direction method of multipliers (ADMM) and introduce a novel single-loop smoothing ADMM algorithm with an increasing penalty parameter, named SIAD, specifically tailored for sparse-penalized quantile regression. We first delve into the convergence properties of the proposed SIAD algorithm and establish the necessary conditions for convergence. Theoretically, we confirm a convergence rate of $o\big({k^{-\frac{1}{4}}\big)$ for the sub-gradient bound of augmented Lagrangian. Subsequently, we provide numerical results to showcase the effectiveness of the SIAD algorithm. Our findings highlight that the SIAD method outperforms existing approaches, providing a faster and more stable solution for sparse-penalized quantile regression.

摘要
To improve the convergence speed, we use the alternating direction method of multipliers (ADMM) and develop a new single-loop smoothing ADMM algorithm called SIAD. We prove that the SIAD algorithm converges at a rate of $o\big({k^{-\frac{1}{4}}\big)$ for the sub-gradient bound of the augmented Lagrangian.We also conduct numerical experiments to compare the performance of the SIAD algorithm with other methods. Our results show that the SIAD method outperforms existing approaches, providing a faster and more stable solution for sparse-penalized quantile regression.

One Wide Feedforward is All You Need

paper_url: http://arxiv.org/abs/2309.01826
repo_url: https://github.com/chrisneagu/FTC-Skystone-Dark-Angels-Romania-2020
paper_authors: Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan
for: 这个论文的目的是探究Transformer架构中的Feed Forward Network（FFN） redundancy，以及如何通过减少FFN的参数数量来提高模型的准确率和响应时间。
methods: 这个论文使用了Transformer架构，并对其中的FFN进行了探究和优化。特别是， authors 发现了FFN的重复性，并通过在解码器层上移除FFN来减少参数数量。此外， authors 还将共享一个FFN来替代原始Transformer Big中的多个FFN，以提高准确率和响应时间。
results: 根据实验结果， authors 发现了减少FFN参数数量可以 achieving substantial gains in both accuracy and latency with respect to the original Transformer Big。具体来说， authors 通过减少解码器层的FFN参数数量，可以提高模型的准确率，同时也可以降低模型的响应时间。

Abstract
The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

摘要
transformer 架构有两个主要非嵌入组件：注意力和Feed Forward Network (FFN)。注意力捕捉即使词语位置不同也可以互相依赖的关系，而 FFN 非线性变换每个输入token。在这项工作中，我们研究 FFN 的角色，并发现它占用模型参数的一大部分，但它具有很高的重复率。具体来说，我们可以通过去除decoder层的 FFN，并将encoder中的 FFN 共享来减少参数数量，只有一定的精度下降。最后，我们通过增加共享 FFN 的隐藏维度，实现了对原始 transformer Big 的重大提升 both accuracy和延迟时间。

Towards Foundational AI Models for Additive Manufacturing: Language Models for G-Code Debugging, Manipulation, and Comprehension

paper_url: http://arxiv.org/abs/2309.02465
repo_url: https://github.com/idealab-isu/llm4g-code
paper_authors: Anushrut Jignasu, Kelly Marshall, Baskar Ganapathysubramanian, Aditya Balu, Chinmay Hegde, Adarsh Krishnamurthy
for: 这篇论文旨在描述如何使用现有的大型自然语言模型（LLMs）来理解和修改3D打印机的G-code文件。
methods: 论文使用了六种现有的LLMs，并设计了有效的提示来让这些模型理解和操纵G-code文件。
results: 论文对六种LLMs的性能进行了全面的评估，并分析了它们对完整G-code文件的理解的优劣点。

Abstract
3D printing or additive manufacturing is a revolutionary technology that enables the creation of physical objects from digital models. However, the quality and accuracy of 3D printing depend on the correctness and efficiency of the G-code, a low-level numerical control programming language that instructs 3D printers how to move and extrude material. Debugging G-code is a challenging task that requires a syntactic and semantic understanding of the G-code format and the geometry of the part to be printed. In this paper, we present the first extensive evaluation of six state-of-the-art foundational large language models (LLMs) for comprehending and debugging G-code files for 3D printing. We design effective prompts to enable pre-trained LLMs to understand and manipulate G-code and test their performance on various aspects of G-code debugging and manipulation, including detection and correction of common errors and the ability to perform geometric transformations. We analyze their strengths and weaknesses for understanding complete G-code files. We also discuss the implications and limitations of using LLMs for G-code comprehension.

摘要
三维打印或加itive制造是一种革命性的技术，允许将数字模型转化为物理 объек的创造。然而，三维打印的质量和准确性取决于G-code的正确性和效率，G-code是一种低级数控制程序语言，用于指示三维打印机如何移动和挤出材料。调试G-code是一项复杂的任务，需要对G-code格式和部件的几何结构具有语义和语法理解。在这篇论文中，我们展示了首次对六种现代基础大语言模型（LLM）的扩展性评估，以便理解和修改G-code文件。我们设计有效的提示，使得预训练的LLM可以理解和操纵G-code，并测试其表现于不同的G-code调试和修改方面，包括常见错误检测和修复以及几何变换能力。我们分析它们在完整G-code文件理解方面的优势和缺陷。我们还讨论了使用LLM进行G-code理解的限制和局限性。

paper_url: http://arxiv.org/abs/2309.01808
repo_url: https://github.com/ynchuang/discoverpath
paper_authors: Yu-Neng Chuang, Guanchu Wang, Chia-Yuan Chang, Kwei-Herng Lai, Daochen Zha, Ruixiang Tang, Fan Yang, Alfredo Costilla Reyes, Kaixiong Zhou, Xiaoqian Jiang, Xia Hu
for: 增进生物医学研究中的文献检索效率，尤其是在跨学科领域中，通过使用知识图来提高用户体验。
methods: 使用命名实体识别（NER）和parts-of-speech（POS）标签来从文章摘要中提取 terminologies和关系，并将其整合成知识图。
results: 提供了一个开源的Graphical User Interface，可以帮助用户查找相关的文章和增进知识探索。

Abstract
The exponential growth in scholarly publications necessitates advanced tools for efficient article retrieval, especially in interdisciplinary fields where diverse terminologies are used to describe similar research. Traditional keyword-based search engines often fall short in assisting users who may not be familiar with specific terminologies. To address this, we present a knowledge graph-based paper search engine for biomedical research to enhance the user experience in discovering relevant queries and articles. The system, dubbed DiscoverPath, employs Named Entity Recognition (NER) and part-of-speech (POS) tagging to extract terminologies and relationships from article abstracts to create a KG. To reduce information overload, DiscoverPath presents users with a focused subgraph containing the queried entity and its neighboring nodes and incorporates a query recommendation system, enabling users to iteratively refine their queries. The system is equipped with an accessible Graphical User Interface that provides an intuitive visualization of the KG, query recommendations, and detailed article information, enabling efficient article retrieval, thus fostering interdisciplinary knowledge exploration. DiscoverPath is open-sourced at https://github.com/ynchuang/DiscoverPath.

摘要
随着学术论文的激增增长，需要更高级的工具来快速检索相关的论文，特别是在交叉学科领域，where diverse terminologies are used to describe similar research. 传统的关键词基本搜索引擎often fails to assist users who are not familiar with specific terminologies. To address this, we present a knowledge graph-based paper search engine for biomedical research to enhance the user experience in discovering relevant queries and articles. The system, called DiscoverPath, employs Named Entity Recognition (NER) and part-of-speech (POS) tagging to extract terminologies and relationships from article abstracts to create a knowledge graph (KG). To reduce information overload, DiscoverPath presents users with a focused subgraph containing the queried entity and its neighboring nodes, and incorporates a query recommendation system, enabling users to iteratively refine their queries. The system is equipped with an accessible Graphical User Interface that provides an intuitive visualization of the KG, query recommendations, and detailed article information, enabling efficient article retrieval and thus fostering interdisciplinary knowledge exploration. DiscoverPath is open-sourced at .

Marginalized Importance Sampling for Off-Environment Policy Evaluation

paper_url: http://arxiv.org/abs/2309.01807
repo_url: None
paper_authors: Pulkit Katdare, Nan Jiang, Katherine Driggs-Campbell
for: 评估实际世界中RL策略的性能，不需要真实世界的部署。
methods: 利用Marginalized Importance Sampling（MIS）框架，通过在模拟器中添加真实世界停留数据，评估RL策略的实际世界性能。
results: 对多种Sim2Sim环境和目标策略，以及不同的停留数据收集策略进行了实际评估，并在Sim2Real任务中评估了一个7度 freedomRobotic臂的性能。

Abstract
Reinforcement Learning (RL) methods are typically sample-inefficient, making it challenging to train and deploy RL-policies in real world robots. Even a robust policy trained in simulation, requires a real-world deployment to assess their performance. This paper proposes a new approach to evaluate the real-world performance of agent policies without deploying them in the real world. The proposed approach incorporates a simulator along with real-world offline data to evaluate the performance of any policy using the framework of Marginalized Importance Sampling (MIS). Existing MIS methods face two challenges: (1) large density ratios that deviate from a reasonable range and (2) indirect supervision, where the ratio needs to be inferred indirectly, thus exacerbating estimation error. Our approach addresses these challenges by introducing the target policy's occupancy in the simulator as an intermediate variable and learning the density ratio as the product of two terms that can be learned separately. The first term is learned with direct supervision and the second term has a small magnitude, thus making it easier to run. We analyze the sample complexity as well as error propagation of our two step-procedure. Furthermore, we empirically evaluate our approach on Sim2Sim environments such as Cartpole, Reacher and Half-Cheetah. Our results show that our method generalizes well across a variety of Sim2Sim gap, target policies and offline data collection policies. We also demonstrate the performance of our algorithm on a Sim2Real task of validating the performance of a 7 DOF robotic arm using offline data along with a gazebo based arm simulator.

摘要
《强化学习（RL）方法通常是样本不fficient，使得在实际世界中训练和部署RL策略变得困难。 même a robust策略在simulation中训练，需要在实际世界中进行评估其性能。这篇论文提出了一种新的方法，用于在实际世界中评估代理策略的性能，不需要将其部署到实际世界中。该方法利用了simulator和实际世界的停止数据，通过Marginalized Importance Sampling（MIS）框架评估任何策略的性能。现有MIS方法面临两个挑战：（1）巨大的概率比率，它们与理解范围内的合理范围偏离很大；（2）间接监督，需要间接地估算概率比率，从而使得估计误差加大。我们的方法解决了这两个挑战，通过引入目标策略在simulator中的存在量作为中间变量，将概率比率分解为两个可分别学习的项。第一项通过直接监督学习，第二项具有小的幅度，因此更容易实现。我们还分析了我们的两步程序的样本复杂度以及误差的卷积。此外，我们也进行了Empirical评估，并证明我们的方法在Cartpole、Reacher和Half-Cheetah等Sim2Sim环境中能够具有良好的泛化性。此外，我们还展示了我们的算法在一个Sim2Real任务中，使用了停止数据和Gazebo基于的arm simulator，验证了一个7度OF robotic arm的性能。

Neural-Singular-Hessian: Implicit Neural Representation of Unoriented Point Clouds by Enforcing Singular Hessian

paper_url: http://arxiv.org/abs/2309.01793
repo_url: https://github.com/bearprin/Neural-Singular-Hessian
paper_authors: Zixiong Wang, Yunxiao Zhang, Rui Xu, Fan Zhang, Pengshuai Wang, Shuangmin Chen, Shiqing Xin, Wenping Wang, Changhe Tu
for: 该论文旨在拟合点云数据中的表面 reconstruction 问题。
methods: 该方法combines various regularization terms, such as Eikonal和Laplacian energy terms, to enforce the learned neural function to possess the properties of a Signed Distance Function (SDF)。 In addition, the approach enforces the Hessian of the neural implicit function to have a zero determinant for points near the surface, which aligns the gradients for a near-surface point and its on-surface projection point, producing a rough but faithful shape。
results: 经验表明，该方法可以有效地suppress ghost geometry和recover details from unoriented point clouds with better expressiveness than existing fitting-based methods。

Abstract
Neural implicit representation is a promising approach for reconstructing surfaces from point clouds. Existing methods combine various regularization terms, such as the Eikonal and Laplacian energy terms, to enforce the learned neural function to possess the properties of a Signed Distance Function (SDF). However, inferring the actual topology and geometry of the underlying surface from poor-quality unoriented point clouds remains challenging. In accordance with Differential Geometry, the Hessian of the SDF is singular for points within the differential thin-shell space surrounding the surface. Our approach enforces the Hessian of the neural implicit function to have a zero determinant for points near the surface. This technique aligns the gradients for a near-surface point and its on-surface projection point, producing a rough but faithful shape within just a few iterations. By annealing the weight of the singular-Hessian term, our approach ultimately produces a high-fidelity reconstruction result. Extensive experimental results demonstrate that our approach effectively suppresses ghost geometry and recovers details from unoriented point clouds with better expressiveness than existing fitting-based methods.

摘要
神经隐式表示是一种有前途的方法，用于从点云重建表面。现有方法将各种正则化项相结合，如振荡能量和拉普拉斯能量项，以强制学习神经函数具备签名距离函数的性质。然而，从低质量、无法定向的点云中恢复真实的表面 topology 和几何结构仍然是一个搜索。根据 diferencial geometry，在表面 differential thin-shell 空间中，SDF 的哈密顿矩阵是非特征矩阵。我们的方法强制神经隐式函数的哈密顿矩阵在near surface 点附近为零 determinant。这种技术将near surface 点的梯度与其在表面上的投影点的梯度相对align，生成一个粗糙 yet faithful 的形态，只需几个迭代即可。通过渐进式地减小weight的特征矩阵项，我们的方法最终生成高精度重建结果。广泛的实验结果表明，我们的方法可以有效地抑制幽灵几何和从无法定向的点云中恢复细节，比现有的适应型方法更有表达力。

3D View Prediction Models of the Dorsal Visual Stream

paper_url: http://arxiv.org/abs/2309.01782
repo_url: None
paper_authors: Gabriel Sarch, Hsiao-Yu Fish Tung, Aria Wang, Jacob Prince, Michael Tarr
for: 这个论文是为了测试一种基于3D场景几何的自适应循环神经网络（GRNN）是否能更好地与脑动脉核心视觉区域的功能特性相匹配。
methods: 这个论文使用了一种自适应循环神经网络（GRNN），并使用了一个3D特征记忆来训练这个模型。
results: 研究发现，GRNN能够更好地预测新的摄像头视图，并且对脑动脉核心视觉区域的变化具有更高的准确率。

Abstract
Deep neural network representations align well with brain activity in the ventral visual stream. However, the primate visual system has a distinct dorsal processing stream with different functional properties. To test if a model trained to perceive 3D scene geometry aligns better with neural responses in dorsal visual areas, we trained a self-supervised geometry-aware recurrent neural network (GRNN) to predict novel camera views using a 3D feature memory. We compared GRNN to self-supervised baseline models that have been shown to align well with ventral regions using the large-scale fMRI Natural Scenes Dataset (NSD). We found that while the baseline models accounted better for ventral brain regions, GRNN accounted for a greater proportion of variance in dorsal brain regions. Our findings demonstrate the potential for using task-relevant models to probe representational differences across visual streams.

摘要
深度神经网络表示 aligned well with brain activity in the ventral visual stream. However, the primate visual system has a distinct dorsal processing stream with different functional properties. To test if a model trained to perceive 3D scene geometry aligns better with neural responses in dorsal visual areas, we trained a self-supervised geometry-aware recurrent neural network (GRNN) to predict novel camera views using a 3D feature memory. We compared GRNN to self-supervised baseline models that have been shown to align well with ventral regions using the large-scale fMRI Natural Scenes Dataset (NSD). We found that while the baseline models accounted better for ventral brain regions, GRNN accounted for a greater proportion of variance in dorsal brain regions. Our findings demonstrate the potential for using task-relevant models to probe representational differences across visual streams.Here's the translation in Traditional Chinese:深度神经网络表示 aligned well with brain activity in the ventral visual stream. However, the primate visual system has a distinct dorsal processing stream with different functional properties. To test if a model trained to perceive 3D scene geometry aligns better with neural responses in dorsal visual areas, we trained a self-supervised geometry-aware recurrent neural network (GRNN) to predict novel camera views using a 3D feature memory. We compared GRNN to self-supervised baseline models that have been shown to align well with ventral regions using the large-scale fMRI Natural Scenes Dataset (NSD). We found that while the baseline models accounted better for ventral brain regions, GRNN accounted for a greater proportion of variance in dorsal brain regions. Our findings demonstrate the potential for using task-relevant models to probe representational differences across visual streams.

On the size of irredundant propagation complete CNF formulas

paper_url: http://arxiv.org/abs/2309.01750
repo_url: None
paper_authors: Petr Savický
for: 这个论文 investigate propagation complete (PC) CNF formulas for a symmetric definite Horn function of $n$ variables.
methods: 论文使用了 minimum size of these formulas 与specific covering numbers closely related, specifically, the smallest number of $k$-subsets of an $n$-set covering all $(k-1)$-subsets for a suitable $k$.
results: 论文展示了一个 irredundant PC formula whose size is larger than the size of a smallest PC formula for the same function by a factor $\Omega(n/\ln n)$. This complements a known polynomial upper bound on this factor.

Abstract
We investigate propagation complete (PC) CNF formulas for a symmetric definite Horn function of $n$ variables and demonstrate that the minimum size of these formulas is closely related to specific covering numbers, namely, to the smallest number of $k$-subsets of an $n$-set covering all $(k-1)$-subsets for a suitable $k$. As a consequence, we demonstrate an irredundant PC formula whose size is larger than the size of a smallest PC formula for the same function by a factor $\Omega(n/\ln n)$. This complements a known polynomial upper bound on this factor.

摘要
我们调查完整的几何函数（PC）逻辑式，对于一个对称定义的权Func数学函数，并证明这个函数的最小大小与特定的覆盖数字有密切的关系，即最小的$k$-subsets的集合覆盖所有($k-1$)-subsets。我们从这个结果获得了一个不可简的PC方程，其大小比最小PC方程的大小有$\Omega(n/\ln n)$的因子。这与已知的多项式上界有关。

Hybrid data driven/thermal simulation model for comfort assessment

paper_url: http://arxiv.org/abs/2309.01734
repo_url: None
paper_authors: Romain Barbedienne, Sara Yasmine Ouerk, Mouadh Yagoubi, Hassan Bouia, Aurelie Kaemmerlen, Benoit Charrier
for: 提高物理模型的速度和质量
methods: 结合实际数据和模拟数据预测室内舒适度
results: 使用Random Forest模型 obtain F1 score 0.999 的promising results

Abstract
Machine learning models improve the speed and quality of physical models. However, they require a large amount of data, which is often difficult and costly to acquire. Predicting thermal comfort, for example, requires a controlled environment, with participants presenting various characteristics (age, gender, ...). This paper proposes a method for hybridizing real data with simulated data for thermal comfort prediction. The simulations are performed using Modelica Language. A benchmarking study is realized to compare different machine learning methods. Obtained results look promising with an F1 score of 0.999 obtained using the random forest model.

摘要
文本翻译为简化中文：机器学习模型可以提高物理模型的速度和质量，但它们需要大量数据，而这些数据常常困难和costly to obtain。预测冷凉舒适性需要控制环境，参与者具有不同特征（年龄、性别、...）。这篇论文提议将实际数据与模拟数据相互融合以预测冷凉舒适性。模拟使用Modelica语言进行。实现了不同机器学习方法的比较研究。获得的结果很有前途，使用随机森林模型获得的F1分数为0.999。Note: "简化中文" refers to Simplified Chinese, which is one of the two standardized Chinese writing systems, used in mainland China and Singapore.

Softmax Bias Correction for Quantized Generative Models

paper_url: http://arxiv.org/abs/2309.01729
repo_url: None
paper_authors: Nilesh Prasad Pandey, Marios Fournarakis, Chirag Patel, Markus Nagel
for: 提高 Edge 设备上大量生成模型的运行时间和功耗效率，包括稳定扩散或大语言模型。
methods: investigate 软MAX输出强制性对归一化干扰的影响，并提出一种在部署时进行偏差修正，以提高软MAX的量化可行性。
results: 在稳定扩散 v1.5 和 125M-size OPT 语言模型上，实现了8比特量化软MAX后的准确性提高。

Abstract
Post-training quantization (PTQ) is the go-to compression technique for large generative models, such as stable diffusion or large language models. PTQ methods commonly keep the softmax activation in higher precision as it has been shown to be very sensitive to quantization noise. However, this can lead to a significant runtime and power overhead during inference on resource-constraint edge devices. In this work, we investigate the source of the softmax sensitivity to quantization and show that the quantization operation leads to a large bias in the softmax output, causing accuracy degradation. To overcome this issue, we propose an offline bias correction technique that improves the quantizability of softmax without additional compute during deployment, as it can be readily absorbed into the quantization parameters. We demonstrate the effectiveness of our method on stable diffusion v1.5 and 125M-size OPT language model, achieving significant accuracy improvement for 8-bit quantized softmax.

摘要
Post-training quantization (PTQ) 是大型生成模型的压缩技术，如稳定扩散或大语言模型。PTQ方法通常保留软 макс激活函数的高精度，因为它已经被证明对压缩噪声非常敏感。然而，这可能会导致在资源有限的边缘设备中的运行时间和功耗开销增加。在这项工作中，我们研究软 макс激活函数对压缩的敏感性的源头，发现压缩操作会导致软 макс输出中的大量偏差，从而导致准确性下降。为解决这个问题，我们提出了一种离线偏差修正技术，可以在部署过程中对软 макс进行压缩而不需要额外的计算，因为它可以轻松吸收到压缩参数中。我们在稳定扩散 v1.5 和 125M 大小的 OPT 语言模型上进行了实验，并达到了8位压缩软 макс后的显著准确性改进。

Interdisciplinary Fairness in Imbalanced Research Proposal Topic Inference: A Hierarchical Transformer-based Method with Selective Interpolation

paper_url: http://arxiv.org/abs/2309.01717
repo_url: None
paper_authors: Meng Xiao, Min Wu, Ziyue Qiao, Yanjie Fu, Zhiyuan Ning, Yi Du, Yuanchun Zhou
for: 提高自动话题推荐系统的公平性，解决由于人工填写话题导致的偏误和不公平现象。
methods: 基于Transformerencoder-decoder架构实现话题标签推论系统，并利用 interpolate技术生成pseudo-交叉学科提案，以减少模型训练时的偏误。
results: 在实际数据集上进行了广泛的实验，研究结果表明，提posed方法可以减少自动话题推论任务中的不公平现象。

Abstract
The objective of topic inference in research proposals aims to obtain the most suitable disciplinary division from the discipline system defined by a funding agency. The agency will subsequently find appropriate peer review experts from their database based on this division. Automated topic inference can reduce human errors caused by manual topic filling, bridge the knowledge gap between funding agencies and project applicants, and improve system efficiency. Existing methods focus on modeling this as a hierarchical multi-label classification problem, using generative models to iteratively infer the most appropriate topic information. However, these methods overlook the gap in scale between interdisciplinary research proposals and non-interdisciplinary ones, leading to an unjust phenomenon where the automated inference system categorizes interdisciplinary proposals as non-interdisciplinary, causing unfairness during the expert assignment. How can we address this data imbalance issue under a complex discipline system and hence resolve this unfairness? In this paper, we implement a topic label inference system based on a Transformer encoder-decoder architecture. Furthermore, we utilize interpolation techniques to create a series of pseudo-interdisciplinary proposals from non-interdisciplinary ones during training based on non-parametric indicators such as cross-topic probabilities and topic occurrence probabilities. This approach aims to reduce the bias of the system during model training. Finally, we conduct extensive experiments on a real-world dataset to verify the effectiveness of the proposed method. The experimental results demonstrate that our training strategy can significantly mitigate the unfairness generated in the topic inference task.

摘要
translate to Simplified Chinese:研究主题推断在研究提案中的目标是通过funding机构定义的学科系统中获得最适合的学科分类。该机构将根据此分类找到相应的专家评审人员从其数据库中。自动化主题推断可以降低人类手动填充主题的错误，跨学科研究提案和非跨学科研究提案之间的知识差距，并提高系统效率。现有方法是将这视为一个层次多个标签的分类问题，使用生成模型iteratively推断最有利的主题信息。然而，这些方法忽略了跨学科研究提案和非跨学科研究提案之间的规模差异，导致自动推断系统将跨学科研究提案分类为非跨学科研究提案，从而导致了对专家分配的不公正。如何在复杂的学科系统下解决这种数据不匹配问题，从而解决这种不公正呢？在这篇论文中，我们实现了一个基于Transformer编码器-解码器架构的主题标签推断系统。此外，我们使用 interpolate技术在训练期间创建一系列 pseudo-跨学科提案从非跨学科提案中，基于非 Parametric indicator such as cross-topic probabilities和主题发生概率。这种方法 aimsto reduce the bias of the system during model training.最后，我们对实际数据进行了广泛的实验，以验证提案的有效性。实验结果表明，我们的训练策略可以明显减少自动推断 task中的不公正。

On the Robustness of Post-hoc GNN Explainers to Label Noise

paper_url: http://arxiv.org/abs/2309.01706
repo_url: None
paper_authors: Zhiqiang Zhong, Yangqianzi Jiang, Davide Mottin
for: 本研究旨在探讨post-hoc图 neural network（GNN）解释器在受损标签情况下的可靠性。
methods: 研究使用了多种post-hoc GNN解释器，并在不同的标签噪声水平进行了系统性的实验研究。
results: 研究发现，post-hoc GNN解释器具有抗受损性，但是即使标签噪声较低，解释器也会受到影响，解释质量下降。同时，研究还发现，随着标签噪声水平的增加，解释器的效果逐渐恢复。

Abstract
Proposed as a solution to the inherent black-box limitations of graph neural networks (GNNs), post-hoc GNN explainers aim to provide precise and insightful explanations of the behaviours exhibited by trained GNNs. Despite their recent notable advancements in academic and industrial contexts, the robustness of post-hoc GNN explainers remains unexplored when confronted with label noise. To bridge this gap, we conduct a systematic empirical investigation to evaluate the efficacy of diverse post-hoc GNN explainers under varying degrees of label noise. Our results reveal several key insights: Firstly, post-hoc GNN explainers are susceptible to label perturbations. Secondly, even minor levels of label noise, inconsequential to GNN performance, harm the quality of generated explanations substantially. Lastly, we engage in a discourse regarding the progressive recovery of explanation effectiveness with escalating noise levels.

摘要
提议作为图 neural network（GNN）的黑盒限制解决方案，post-hoc GNN 解释器尝试提供准确和有 insightful 的 GNN 行为解释。尽管在学术和工业上最近有所进步，但post-hoc GNN 解释器的Robustness 仍未被探索，对于标签噪声的情况。为了bridging这个差距，我们进行了系统性的实验研究，评估不同的 post-hoc GNN 解释器在不同的标签噪声水平下的效果。我们的结果显示了以下几点：一、post-hoc GNN 解释器受到标签变动的影响。二、即使标签噪声非常低，也会对 GNN 性能造成很大的影响。三、随着噪声水平的增加，解释效果逐渐恢复。

No Data Augmentation? Alternative Regularizations for Effective Training on Small Datasets

paper_url: http://arxiv.org/abs/2309.01694
repo_url: None
paper_authors: Lorenzo Brigato, Stavroula Mougiakakou
for: 解决现代计算机视觉中小训练集图像分类任务的问题
methods: 使用各种敏感训练策略和模型尺度、训练时间表等参数的调整
results: 在 solely 1% of the original CIFAR-10 training set (i.e., 50 images per class) 和 ciFAIR-10 测试集上达到了 66.5% 的测试精度，与现状最佳方法相当。

Abstract
Solving image classification tasks given small training datasets remains an open challenge for modern computer vision. Aggressive data augmentation and generative models are among the most straightforward approaches to overcoming the lack of data. However, the first fails to be agnostic to varying image domains, while the latter requires additional compute and careful design. In this work, we study alternative regularization strategies to push the limits of supervised learning on small image classification datasets. In particular, along with the model size and training schedule scaling, we employ a heuristic to select (semi) optimal learning rate and weight decay couples via the norm of model parameters. By training on only 1% of the original CIFAR-10 training set (i.e., 50 images per class) and testing on ciFAIR-10, a variant of the original CIFAR without duplicated images, we reach a test accuracy of 66.5%, on par with the best state-of-the-art methods.

摘要
现代计算机视觉中解决小训练集数据的图像分类任务仍然是一个开放的挑战。非常的数据扩展和生成模型是最直接的方法来缓解缺乏数据的问题，但是前者不具备适应不同图像领域的特性，而后者需要额外的计算和精心的设计。在这项工作中，我们研究了不同于supervised学习的regularization策略，以推动小图像分类集数据上的模型训练。特别是，我们采用一种heuristic来选择（semi）优化的学习率和权重衰减couple，通过模型参数的norm来实现。通过使用原始CIFAR-10训练集的1%（即50张每个类）和测试在ciFAIR-10上，我们达到了66.5%的测试精度，与状态元的方法相当。

Prompt me a Dataset: An investigation of text-image prompting for historical image dataset creation using foundation models

paper_url: http://arxiv.org/abs/2309.01674
repo_url: https://github.com/hassanhajj910/prompt-me-a-dataset
paper_authors: Hassan El-Hajj, Matteo Valleriani
for: 这个论文是为了提出一个基于基础模型的图像提取管道，用于从历史文献中提取图像，并评估文本-图像提示的效果在人文领域中。
methods: 该管道采用了GroundDINO和Meta的Segment-Anything-Model（SAM）来从历史文献中检索大量的视觉数据，并评估不同语言提示的影响。
results: 研究发现，使用文本-图像提示可以提高图像提取的效果，并且在不同水平的人文数据集上都有较高的效果。

Abstract
In this paper, we present a pipeline for image extraction from historical documents using foundation models, and evaluate text-image prompts and their effectiveness on humanities datasets of varying levels of complexity. The motivation for this approach stems from the high interest of historians in visual elements printed alongside historical texts on the one hand, and from the relative lack of well-annotated datasets within the humanities when compared to other domains. We propose a sequential approach that relies on GroundDINO and Meta's Segment-Anything-Model (SAM) to retrieve a significant portion of visual data from historical documents that can then be used for downstream development tasks and dataset creation, as well as evaluate the effect of different linguistic prompts on the resulting detections.

摘要
在这篇论文中，我们提出了一个图像提取管道，使用基础模型来从历史文献中提取图像，并评估文本图像提示的效果在人文领域中。我们的动机是，历史学家对于与历史文献一起出版的视觉元素具有极高的兴趣，而人文领域内的数据资源相对较少，而且对于其他领域来说更加缺乏准确的标注数据。我们提议一种顺序的方法，利用GroundDINO和Meta的Segment-Anything-Model（SAM）来从历史文献中检索大量的视觉数据，并用于下游开发任务和数据集创建，以及评估不同语言提示的影响。

Fine-grained Affective Processing Capabilities Emerging from Large Language Models

paper_url: http://arxiv.org/abs/2309.01664
repo_url: None
paper_authors: Joost Broekens, Bernhard Hilpert, Suzan Verberne, Kim Baraka, Patrick Gebhard, Aske Plaat
for: 这项研究探讨了 ChatGPT 在情感计算任务中的零配置能力，并使用提示alone进行情感分析、情感表达和情感识别等任务。
methods: 这项研究使用了 ChatGPT 进行语言预处理和情感计算任务，并通过提示来实现情感分析、情感表达和情感识别等任务。
results: 研究发现 ChatGPT 可以在 Valence、Arousal 和 Dominance 维度上进行意义性的情感分析，并且有意义的情感表达和情感识别能力。此外， ChatGPT 还可以基于提示实现基本的情绪诱发。这些发现对于情感计算任务和人工智能应用有重要意义。

Abstract
Large language models, in particular generative pre-trained transformers (GPTs), show impressive results on a wide variety of language-related tasks. In this paper, we explore ChatGPT's zero-shot ability to perform affective computing tasks using prompting alone. We show that ChatGPT a) performs meaningful sentiment analysis in the Valence, Arousal and Dominance dimensions, b) has meaningful emotion representations in terms of emotion categories and these affective dimensions, and c) can perform basic appraisal-based emotion elicitation of situations based on a prompt-based computational implementation of the OCC appraisal model. These findings are highly relevant: First, they show that the ability to solve complex affect processing tasks emerges from language-based token prediction trained on extensive data sets. Second, they show the potential of large language models for simulating, processing and analyzing human emotions, which has important implications for various applications such as sentiment analysis, socially interactive agents, and social robotics.

摘要

ChatGPT can perform meaningful sentiment analysis in the Valence, Arousal, and Dominance dimensions.2. ChatGPT has meaningful emotion representations in terms of emotion categories and affective dimensions.3. ChatGPT can perform basic appraisal-based emotion elicitation of situations using a prompt-based computational implementation of the OCC appraisal model.These findings are significant:1. They demonstrate that the ability to solve complex affect processing tasks can emerge from language-based token prediction trained on extensive data sets.2. They show the potential of large language models for simulating, processing, and analyzing human emotions, which has important implications for applications such as sentiment analysis, socially interactive agents, and social robotics.

Unveiling Theory of Mind in Large Language Models: A Parallel to Single Neurons in the Human Brain

paper_url: http://arxiv.org/abs/2309.01660
repo_url: None
paper_authors: Mohsen Jamali, Ziv M. Williams, Jing Cai
for: This paper explores the ability of large language models (LLMs) to exhibit a Theory of Mind (ToM), a cognitive capacity related to our conscious mind that allows us to infer another’s beliefs and perspective.methods: The authors drew inspiration from the dorsal medial prefrontal cortex (dmPFC) neurons subserving human ToM and employed a similar methodology to examine whether LLMs exhibit comparable characteristics. They analyzed the hidden embeddings (artificial neurons) within LLMs to see if they could represent another’s perspective.results: The analysis revealed a striking resemblance between the two, as the hidden embeddings within LLMs started to exhibit significant responsiveness to either true- or false-belief trials, suggesting their ability to represent another’s perspective. The authors found that the other’s beliefs could be accurately decoded using the entire embeddings, indicating the presence of the embeddings’ ToM capability at the population level. These findings offer initial evidence of a parallel between the artificial model and neurons in the human brain.

Abstract
With their recent development, large language models (LLMs) have been found to exhibit a certain level of Theory of Mind (ToM), a complex cognitive capacity that is related to our conscious mind and that allows us to infer another's beliefs and perspective. While human ToM capabilities are believed to derive from the neural activity of a broadly interconnected brain network, including that of dorsal medial prefrontal cortex (dmPFC) neurons, the precise processes underlying LLM's capacity for ToM or their similarities with that of humans remains largely unknown. In this study, we drew inspiration from the dmPFC neurons subserving human ToM and employed a similar methodology to examine whether LLMs exhibit comparable characteristics. Surprisingly, our analysis revealed a striking resemblance between the two, as hidden embeddings (artificial neurons) within LLMs started to exhibit significant responsiveness to either true- or false-belief trials, suggesting their ability to represent another's perspective. These artificial embedding responses were closely correlated with the LLMs' performance during the ToM tasks, a property that was dependent on the size of the models. Further, the other's beliefs could be accurately decoded using the entire embeddings, indicating the presence of the embeddings' ToM capability at the population level. Together, our findings revealed an emergent property of LLMs' embeddings that modified their activities in response to ToM features, offering initial evidence of a parallel between the artificial model and neurons in the human brain.

摘要
大型语言模型（LLM）的最近发展已经发现具有一定的理论心（ToM）能力，这是与我们意识的大脑网络相关的复杂认知能力，允许我们推断别人的信念和视角。人类ToM能力据信来自大脑的广泛交叉连接的神经活动，包括前 фронталь脑某些 neurons，但precise processes underlying LLM的ToM或与人类相似之处仍然不清楚。在这项研究中，我们 Draw inspiration from human ToM neurons and used a similar methodology to examine whether LLMs exhibit comparable characteristics. Surprisingly, our analysis revealed a striking resemblance between the two, as hidden embeddings（人工神经元）within LLMs started to exhibit significant responsiveness to either true- or false-belief trials， suggesting their ability to represent another's perspective. These artificial embedding responses were closely correlated with the LLMs' performance during the ToM tasks， a property that was dependent on the size of the models. Furthermore， the other's beliefs could be accurately decoded using the entire embeddings， indicating the presence of the embeddings' ToM capability at the population level. Together， our findings revealed an emergent property of LLMs' embeddings that modified their activities in response to ToM features， offering initial evidence of a parallel between the artificial model and neurons in the human brain.

Which algorithm to select in sports timetabling?

paper_url: http://arxiv.org/abs/2309.03229
repo_url: https://github.com/robertomrosati/sa4stt
paper_authors: David Van Bulck, Dries Goossens, Jan-Patrick Clarner, Angelos Dimitsas, George H. G. Fonseca, Carlos Lamas-Fernandez, Martin Mariusz Lester, Jaap Pedersen, Antony E. Phillips, Roberto Maria Rosati
for: 运动赛事时间表调定 (sports timetabling)
methods: 机器学习技术 (machine learning techniques)
results: + 提出了一个算法选择系统，可以根据运动赛事问题实例的特征选择最佳的算法。 + indentified 了选择算法时的重要特征，提供了算法性能的深入了解和提高建议。 + empirically evaluated the hardness of the instances.In English, this means:
for: Sports timetabling
methods: Machine learning techniques
results: + Proposed an algorithm selection system that can select the best algorithm based on the characteristics of a sports timetabling problem instance. + Identified the important features in selecting the algorithm, providing deep insights into the performance of the algorithms and suggestions for improvement. + Empirically evaluated the hardness of the instances.

Abstract
Any sports competition needs a timetable, specifying when and where teams meet each other. The recent International Timetabling Competition (ITC2021) on sports timetabling showed that, although it is possible to develop general algorithms, the performance of each algorithm varies considerably over the problem instances. This paper provides an instance space analysis for sports timetabling, resulting in powerful insights into the strengths and weaknesses of eight state-of-the-art algorithms. Based on machine learning techniques, we propose an algorithm selection system that predicts which algorithm is likely to perform best when given the characteristics of a sports timetabling problem instance. Furthermore, we identify which characteristics are important in making that prediction, providing insights in the performance of the algorithms, and suggestions to further improve them. Finally, we assess the empirical hardness of the instances. Our results are based on large computational experiments involving about 50 years of CPU time on more than 500 newly generated problem instances.

摘要
任何体育竞赛都需要一份时间表，指定比赛队伍在哪里和何时相遇。最近的国际时间安排竞赛（ITC2021）表明，尽管可以开发通用算法，但每个算法在问题实例上的性能差异较大。本文提供了体育时间安排的实例空间分析，导致了八种当前状态算法的强大洞察和探索。基于机器学习技术，我们提出了一种算法选择系统，可以根据体育时间安排问题实例的特点预测最佳算法。此外，我们还确定了哪些特征对于这种预测具有重要性，从而提供了算法性能的深入了解和改进建议。最后，我们评估了实验难度。我们的结果基于大量计算实验，耗时约50年，使用了500多个新生成的问题实例。

Design of Recognition and Evaluation System for Table Tennis Players’ Motor Skills Based on Artificial Intelligence

paper_url: http://arxiv.org/abs/2309.07141
repo_url: None
paper_authors: Zhuo-yong Shi, Ye-tao Jia, Ke-xin Zhang, Ding-han Wang, Long-meng Ji, Yong Wu
for: 这项研究旨在提高穿戴式设备对特定运动的识别和分析能力。
methods: 该研究使用人工智能技术，设计了一种Device来收集乒乓球运动员的运动信息，并对实际运动数据进行处理。然后，通过分割特征数据库和特征工程来构建运动特征，并通过不同评价指标的损失函数来建立运动技巧的层次评价系统。
results: 研究结果显示，基于特征计算机神经网络的Feature-based BP神经网络在识别乒乓球运动员的运动技巧方面具有更高的识别精度和更强的泛化能力，比传统的卷积神经网络更为出色。

Abstract
With the rapid development of electronic science and technology, the research on wearable devices is constantly updated, but for now, it is not comprehensive for wearable devices to recognize and analyze the movement of specific sports. Based on this, this paper improves wearable devices of table tennis sport, and realizes the pattern recognition and evaluation of table tennis players' motor skills through artificial intelligence. Firstly, a device is designed to collect the movement information of table tennis players and the actual movement data is processed. Secondly, a sliding window is made to divide the collected motion data into a characteristic database of six table tennis benchmark movements. Thirdly, motion features were constructed based on feature engineering, and motor skills were identified for different models after dimensionality reduction. Finally, the hierarchical evaluation system of motor skills is established with the loss functions of different evaluation indexes. The results show that in the recognition of table tennis players' motor skills, the feature-based BP neural network proposed in this paper has higher recognition accuracy and stronger generalization ability than the traditional convolutional neural network.

摘要
随着电子科学和技术的快速发展，穿戴设备的研究不断更新，但目前并不能完全识别和分析特定运动的运动动作。基于这一点，本文提出了一种改进穿戴设备，以便识别和评估乒乓球运动员的动作能力。首先，设备是设计用来收集乒乓球运动员的运动信息，并对实际运动数据进行处理。其次，使用滑动窗口将收集的运动数据分成六种乒乓球标准运动动作的特征库。然后，基于特征工程学，构建了运动特征，并将不同模型中的动作识别为不同的评估指标。最后，建立了基于损失函数的层次评估系统，以评估不同模型的评估指标。结果表明，在识别乒乓球运动员的动作能力方面，基于特征参数的BP神经网络提出的方法在识别精度和泛化能力方面高于传统的卷积神经网络。

Corgi^2: A Hybrid Offline-Online Approach To Storage-Aware Data Shuffling For SGD

paper_url: http://arxiv.org/abs/2309.01640
repo_url: None
paper_authors: Etay Livne, Gal Kaplun, Eran Malach, Shai Shalev-Schwatz
for: 这篇论文是为了提高 Stochastic Gradient Descent (SGD) 训练机器学习模型时的数据访问效率而设计的。
methods: 该论文提出了一种在云存储的大型数据集上使用 online shuffling 算法，称为 CorgiPile，以提高数据访问效率，但是会导致一定的性能损失。该论文还提出了一种新的两步半数据洗选策略， combinining offline 迭代 CorgiPile 方法和 online 迭代。
results: 该论文提供了一个全面的理论分析，证明了该方法的收敛性质，并通过实验结果表明，该方法可以在 homogeneous 数据上实现类似于随机访问的性能，而不需要妥协数据访问效率。

Abstract
When using Stochastic Gradient Descent (SGD) for training machine learning models, it is often crucial to provide the model with examples sampled at random from the dataset. However, for large datasets stored in the cloud, random access to individual examples is often costly and inefficient. A recent work \cite{corgi}, proposed an online shuffling algorithm called CorgiPile, which greatly improves efficiency of data access, at the cost some performance loss, which is particularly apparent for large datasets stored in homogeneous shards (e.g., video datasets). In this paper, we introduce a novel two-step partial data shuffling strategy for SGD which combines an offline iteration of the CorgiPile method with a subsequent online iteration. Our approach enjoys the best of both worlds: it performs similarly to SGD with random access (even for homogenous data) without compromising the data access efficiency of CorgiPile. We provide a comprehensive theoretical analysis of the convergence properties of our method and demonstrate its practical advantages through experimental results.

摘要
当使用泛化Gradient Descent（SGD）训练机器学习模型时，通常需要将模型提供随机选择自 dataset 中的示例。然而，对于大规模存储在云端的数据集，随机访问单个示例是经济不可行，不 efficient。一项最近的工作 \cite{corgi} 提出了一种在线洗混算法 called CorgiPile，可以大幅提高数据访问效率，但是会导致一定的性能损失，尤其是对于存储在同一个分区（例如视频集）中的数据。在这篇论文中，我们提出了一种新的两步半数据洗混策略， combinines an offline iteration of the CorgiPile method with a subsequent online iteration。我们的方法可以同SGD with random access（即使对同种数据）获得类似的性能，而无需牺牲 CorgiPile 的数据访问效率。我们提供了完整的理论分析方法，并通过实验结果证明了我们的方法的实际优势。

Concepts is All You Need: A More Direct Path to AGI

paper_url: http://arxiv.org/abs/2309.01622
repo_url: None
paper_authors: Peter Voss, Mladjan Jovanovic
for: 这个论文旨在帮助开发人工通用智能（AGI），以便更快速地实现人类智能水平的计算机。
methods: 该论文采用了认知AI方法，而不是现在广泛使用的统计学和生成方法，以更好地理解人类智能的核心需求，并从而快速实现人类智能水平的计算机。
results: 该论文提出了一种建议的体系和开发计划，以及一些初步的结果，可以帮助开发人工智能快速实现人类智能水平。

Abstract
Little demonstrable progress has been made toward AGI (Artificial General Intelligence) since the term was coined some 20 years ago. In spite of the fantastic breakthroughs in Statistical AI such as AlphaZero, ChatGPT, and Stable Diffusion none of these projects have, or claim to have, a clear path to AGI. In order to expedite the development of AGI it is crucial to understand and identify the core requirements of human-like intelligence as it pertains to AGI. From that one can distill which particular development steps are necessary to achieve AGI, and which are a distraction. Such analysis highlights the need for a Cognitive AI approach rather than the currently favored statistical and generative efforts. More specifically it identifies the central role of concepts in human-like cognition. Here we outline an architecture and development plan, together with some preliminary results, that offers a much more direct path to full Human-Level AI (HLAI)/ AGI.

摘要
“自从AGI（人工通用智能）的概念提出20年前，实际的进步不多。尽管这些年来的统计AI（如AlphaZero、ChatGPT和稳定扩散）实现了非常惊人的突破，但是这些项目都没有或宣称不会有明确的AGI路径。以实现AGI为目标，需要了解和识别人类智能的核心需求，从而决定需要哪些开发步骤，哪些则是骚扰。这种分析表明了需要以认知AI为主，而不是目前受欢迎的统计和生成方法。更 Specifically，它显示了概念在人类智能中的中心角色。以下是一个建筑和开发计划，以及一些先验性结果，它将提供一个许多更直接的人类水准AI（HLAI）/AGI道路。”Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

DeViL: Decoding Vision features into Language

paper_url: http://arxiv.org/abs/2309.01617
repo_url: https://github.com/ExplainableML/DeViL
paper_authors: Meghal Dani, Isabel Rio-Torto, Stephan Alaniz, Zeynep Akata
for: 本研究旨在提供深度神经网络决策过程的自然语言描述，尤其是对于视觉卷积网络的各层抽象特征。
methods: 我们提出的DeViL方法可以将视觉特征转换为自然语言描述，并不仅高亮特征位置，还生成了对应的文本描述。我们使用了 dropout 技术来进行验证，并使用了预训练的语言模型来生成文本描述。
results: DeViL方法可以生成与图像内容相关的自然语言描述，并且在 CC3M dataset 上超越了先前的轻量级captioning模型，以及描述了视觉模型中学习的概念。此外，DeViL 还在 MILANNOTATIONS dataset 上超越了当前的 neuron-wise 描述模型。

Abstract
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks. In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned. Our DeViL method decodes vision features into language, not only highlighting the attribution locations but also generating textual descriptions of visual features at different layers of the network. We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language. By employing dropout both per-layer and per-spatial-location, our model can generalize training on image-text pairs to generate localized explanations. As it uses a pre-trained language model, our approach is fast to train, can be applied to any vision backbone, and produces textual descriptions at different layers of the vision network. Moreover, DeViL can create open-vocabulary attribution maps corresponding to words or phrases even outside the training scope of the vision model. We demonstrate that DeViL generates textual descriptions relevant to the image content on CC3M surpassing previous lightweight captioning models and attribution maps uncovering the learned concepts of the vision backbone. Finally, we show DeViL also outperforms the current state-of-the-art on the neuron-wise descriptions of the MILANNOTATIONS dataset. Code available at https://github.com/ExplainableML/DeViL

摘要
各自使用dropout both per-layer和per-spatial-location，我们的DeViL方法可以进行区域化解释。我们的方法通过将视觉特征翻译成语言提示，然后使用一个独立的语言模型解码成自然语言描述。我们的模型可以快速训练，可以应用于任何视觉后处理器，并且在不同层次上生成文本描述。此外，DeViL还可以生成对于训练词汇外的开 vocabulary扩展映射。我们示示了DeViL可以在CC3M上生成相关的图像内容的文本描述，并且在MILANNOTATIONS数据集中神经元级别的描述也超过了当前状态的最佳性能。代码可以在https://github.com/ExplainableML/DeViL上获取。

Deep Learning Overloaded Vehicle Identification for Long Span Bridges Based on Structural Health Monitoring Data

paper_url: http://arxiv.org/abs/2309.01593
repo_url: None
paper_authors: Yuqin Li, Jun Liu, Shengliang Zhong, Licheng Zhou, Shoubin Dong, Zejia Liu, Liqun Tang
for: 这个研究是为了找出长 Span 桥梁上的过载车辆，使用结构健康监控数据进行过载车辆识别。
methods: 本研究提出了一个深度学习基本的过载车辆识别方法（DOVI），使用时间卷网络架构对输入序列数据进行抽象，提供了一个端到端的过载车辆识别解决方案，不需要影响线或 velocity 和车辆底盘信息。
results: результа显示，提出的深度学习过载车辆识别方法比其他机器学习和深度学习方法更有效和更坚固，可以在多辆车辆下进行运行。

Abstract
Overloaded vehicles bring great harm to transportation infrastructures. BWIM (bridge weigh-in-motion) method for overloaded vehicle identification is getting more popular because it can be implemented without interruption to the traffic. However, its application is still limited because its effectiveness largely depends on professional knowledge and extra information, and is susceptible to occurrence of multiple vehicles. In this paper, a deep learning based overloaded vehicle identification approach (DOVI) is proposed, with the purpose of overloaded vehicle identification for long-span bridges by the use of structural health monitoring data. The proposed DOVI model uses temporal convolutional architectures to extract the spatial and temporal features of the input sequence data, thus provides an end-to-end overloaded vehicle identification solution which neither needs the influence line nor needs to obtain velocity and wheelbase information in advance and can be applied under the occurrence of multiple vehicles. Model evaluations are conducted on a simply supported beam and a long-span cable-stayed bridge under random traffic flow. Results demonstrate that the proposed deep-learning overloaded vehicle identification approach has better effectiveness and robustness, compared with other machine learning and deep learning approaches.

摘要
拥载过重车辆对交通基础设施造成严重损害。BWIM（桥上量测方法）方法在过重车辆标识方面获得更多的应用，因为它不需要中断交通。然而，它的应用仍然受限，因为它的效果受职业知识和附加信息的影响，并且容易发生多辆车辆的情况。在本文中，一种基于深度学习的过重车辆标识方法（DOVI）被提出，用于长链桥上的过重车辆标识，通过使用结构健康监测数据。提出的DOVI模型使用时间卷积架构提取输入序列数据的空间和时间特征，因此提供了一个终端到终点的过重车辆标识解决方案，无需影响线 nor 需要先知道速度和车辆跑道信息。模型评估在简支桥和长链悬臂桥上进行随机交通流下。结果表明，提出的深度学习过重车辆标识方法比其他机器学习和深度学习方法更有效和更坚定。

Les Houches Lectures on Deep Learning at Large & Infinite Width

paper_url: http://arxiv.org/abs/2309.01592
repo_url: None
paper_authors: Yasaman Bahri, Boris Hanin, Antonin Brossollet, Vittorio Erba, Christian Keup, Rosalba Pacelli, James B. Simon
for: 这些讲座主要关注深度神经网络的无穷宽限和大宽限的特性。
methods: 讲座涉及到深度神经网络的Random化、训练后的连接关系、线性模型、kernels和Gaussian Processes等方面。
results: 讲座讨论了无穷宽限下的深度神经网络的统计和动力学性质，以及训练后的大宽限网络的非平衡和平衡情况。

Abstract
These lectures, presented at the 2022 Les Houches Summer School on Statistical Physics and Machine Learning, focus on the infinite-width limit and large-width regime of deep neural networks. Topics covered include various statistical and dynamical properties of these networks. In particular, the lecturers discuss properties of random deep neural networks; connections between trained deep neural networks, linear models, kernels, and Gaussian processes that arise in the infinite-width limit; and perturbative and non-perturbative treatments of large but finite-width networks, at initialization and after training.

摘要
这些讲座，发生在2022年的勒舍夏学院深度学习和机器学习讲座，关注深度神经网络的无限宽限和大宽限。讲座讨论了各种统计和动力学性质，包括随机深度神经网络、训练后深度神经网络与线性模型、核函数和高斯过程之间的连接。此外，讲座还讨论了大宽度网络的非短程和短程训练初始化。

Rail Crack Propagation Forecasting Using Multi-horizons RNNs

paper_url: http://arxiv.org/abs/2309.01569
repo_url: None
paper_authors: Sara Yasmine Ouerk, Olivier Vo Van, Mouadh Yagoubi
for: 预测铁路裂隙长度的扩展，以便维护和评估材料和结构的安全性。
methods: 使用机器学习技术，特别是循环神经网络（RNN），来预测时间序列数据。
results: 比较state-of-the-art模型（LSTM和GRU），多个 horizons 模型表现出色，可以更好地预测铁路裂隙长度的扩展。

Abstract
The prediction of rail crack length propagation plays a crucial role in the maintenance and safety assessment of materials and structures. Traditional methods rely on physical models and empirical equations such as Paris law, which often have limitations in capturing the complex nature of crack growth. In recent years, machine learning techniques, particularly Recurrent Neural Networks (RNNs), have emerged as promising methods for time series forecasting. They allow to model time series data, and to incorporate exogenous variables into the model. The proposed approach involves collecting real data on the French rail network that includes historical crack length measurements, along with relevant exogenous factors that may influence crack growth. First, a pre-processing phase was performed to prepare a consistent data set for learning. Then, a suitable Bayesian multi-horizons recurrent architecture was designed to model the crack propagation phenomenon. Obtained results show that the Multi-horizons model outperforms state-of-the-art models such as LSTM and GRU.

摘要
预测铁路裂口长度的传播 игра着关键的角色在材料和结构的维护和安全评估中。传统方法通常基于物理模型和实验方程如巴黎法律，它们经常无法捕捉裂口增长的复杂性。在最近几年，机器学习技术特别是循环神经网络（RNN）已经出现为时间序列预测的有力方法。它允许模拟时间序列数据，并将外生变量纳入模型中。该方法中的提议包括收集法国铁路网络的历史裂口长度测量数据，以及可能影响裂口增长的相关外生因素。首先，一个预处理阶段进行了数据集的准备，以便学习。然后，一种适合的多个镜像感知架构被设计来模拟裂口增长现象。实验结果表明，多镜像模型在LSTM和GRU模型之上表现出了优异的性能。

OutRank: Speeding up AutoML-based Model Search for Large Sparse Data sets with Cardinality-aware Feature Ranking

paper_url: http://arxiv.org/abs/2309.01552
repo_url: https://github.com/outbrain/outrank
paper_authors: Blaž Škrlj, Blaž Mramor
for: 这种论文主要用于提高现代推荐系统的设计，以便更好地解决推荐任务。
methods: 本论文使用了一种称为OutRank的系统，用于精细地排序特征和数据质量相关的异常检测。OutRank使用了一种基于分类数据的变体，即对同类特征噪声进行Normalizaation，以便更好地发现有用的信号。此外，该方法还 incorporates 特征相似性和合并相关性信息。
results: 作者们在一个 synthetic 数据集上证明了OutRank的可行性，并在一个真实的点击率预测数据集上比Random Forest-based approaches表现出色，得到了更好的结果。OutRank可以探索更大的特征空间，达到300%更大的特征空间，从而更快地找到更好的模型。

Abstract
The design of modern recommender systems relies on understanding which parts of the feature space are relevant for solving a given recommendation task. However, real-world data sets in this domain are often characterized by their large size, sparsity, and noise, making it challenging to identify meaningful signals. Feature ranking represents an efficient branch of algorithms that can help address these challenges by identifying the most informative features and facilitating the automated search for more compact and better-performing models (AutoML). We introduce OutRank, a system for versatile feature ranking and data quality-related anomaly detection. OutRank was built with categorical data in mind, utilizing a variant of mutual information that is normalized with regard to the noise produced by features of the same cardinality. We further extend the similarity measure by incorporating information on feature similarity and combined relevance. The proposed approach's feasibility is demonstrated by speeding up the state-of-the-art AutoML system on a synthetic data set with no performance loss. Furthermore, we considered a real-life click-through-rate prediction data set where it outperformed strong baselines such as random forest-based approaches. The proposed approach enables exploration of up to 300% larger feature spaces compared to AutoML-only approaches, enabling faster search for better models on off-the-shelf hardware.

摘要
现代推荐系统的设计需要了解哪些特征空间中的特征是解决某个推荐任务的关键。然而，实际世界数据集经常具有大量数据、稀疏性和噪声等特点，使得找到有意义的信号变得困难。特征排名算法可以帮助解决这些挑战，通过识别最有用的特征并实现自动化模型搜索（AutoML）。我们介绍了一个名为OutRank的系统，用于多样化特征排名和数据质量相关异常检测。OutRank采用了分类数据的视角，使用一种对特征噪声产生的减法正则化的相互信息变体。我们进一步扩展了相互信息，通过 integrate feature similarity和共同相关性信息。我们的方法的可行性被证明通过加速现场AutoML系统的速度，而无损失性。此外，我们考虑了一个真实的点击率预测数据集，其中OutRank超过了强基线方法，如随机森林方法。我们的方法可以探索到300%更大的特征空间，比AutoML-只的方法更快地寻找更好的模型，在准备的硬件上。

ChatRule: Mining Logical Rules with Large Language Models for Knowledge Graph Reasoning

paper_url: http://arxiv.org/abs/2309.01538
repo_url: None
paper_authors: Linhao Luo, Jiaxin Ju, Bo Xiong, Yuan-Fang Li, Gholamreza Haffari, Shirui Pan
for: mines logical rules over knowledge graphs (KGs) to improve reasoning performance and provide interpretable results.
methods: uses large language models (LLMs) to generate logical rules, leveraging both the semantic and structural information of KGs, and incorporates facts from existing KGs to refine the generated rules.
results: evaluates the effectiveness and scalability of the proposed method on four large-scale KGs, showing impressive performance and outperforming existing methods.

Abstract
Logical rules are essential for uncovering the logical connections between relations, which could improve the reasoning performance and provide interpretable results on knowledge graphs (KGs). Although there have been many efforts to mine meaningful logical rules over KGs, existing methods suffer from the computationally intensive searches over the rule space and a lack of scalability for large-scale KGs. Besides, they often ignore the semantics of relations which is crucial for uncovering logical connections. Recently, large language models (LLMs) have shown impressive performance in the field of natural language processing and various applications, owing to their emergent ability and generalizability. In this paper, we propose a novel framework, ChatRule, unleashing the power of large language models for mining logical rules over knowledge graphs. Specifically, the framework is initiated with an LLM-based rule generator, leveraging both the semantic and structural information of KGs to prompt LLMs to generate logical rules. To refine the generated rules, a rule ranking module estimates the rule quality by incorporating facts from existing KGs. Last, a rule validator harnesses the reasoning ability of LLMs to validate the logical correctness of ranked rules through chain-of-thought reasoning. ChatRule is evaluated on four large-scale KGs, w.r.t. different rule quality metrics and downstream tasks, showing the effectiveness and scalability of our method.

摘要
<>使用逻辑规则可以探索知识图（KG）中的逻辑连接关系，提高推理性能并提供可读写的结果。 although there have been many efforts to mine meaningful logical rules over KGs, existing methods suffer from computationally intensive searches over the rule space and a lack of scalability for large-scale KGs. besides, they often ignore the semantics of relations which is crucial for uncovering logical connections. Recently, large language models (LLMs) have shown impressive performance in the field of natural language processing and various applications, owing to their emergent ability and generalizability. In this paper, we propose a novel framework, ChatRule, unleashing the power of large language models for mining logical rules over knowledge graphs. Specifically, the framework is initiated with an LLM-based rule generator, leveraging both the semantic and structural information of KGs to prompt LLMs to generate logical rules. To refine the generated rules, a rule ranking module estimates the rule quality by incorporating facts from existing KGs. Last, a rule validator harnesses the reasoning ability of LLMs to validate the logical correctness of ranked rules through chain-of-thought reasoning. ChatRule is evaluated on four large-scale KGs, w.r.t. different rule quality metrics and downstream tasks, showing the effectiveness and scalability of our method.<>Here's the translation in Simplified Chinese:使用逻辑规则可以探索知识图（KG）中的逻辑连接关系，提高推理性能并提供可读写的结果。 although there have been many efforts to mine meaningful logical rules over KGs, existing methods suffer from computationally intensive searches over the rule space and a lack of scalability for large-scale KGs. besides, they often ignore the semantics of relations which is crucial for uncovering logical connections. Recently, large language models (LLMs) have shown impressive performance in the field of natural language processing and various applications, owing to their emergent ability and generalizability. In this paper, we propose a novel framework, ChatRule, unleashing the power of large language models for mining logical rules over knowledge graphs. Specifically, the framework is initiated with an LLM-based rule generator, leveraging both the semantic and structural information of KGs to prompt LLMs to generate logical rules. To refine the generated rules, a rule ranking module estimates the rule quality by incorporating facts from existing KGs. Last, a rule validator harnesses the reasoning ability of LLMs to validate the logical correctness of ranked rules through chain-of-thought reasoning. ChatRule is evaluated on four large-scale KGs, w.r.t. different rule quality metrics and downstream tasks, showing the effectiveness and scalability of our method.

Are We Using Autoencoders in a Wrong Way?

paper_url: http://arxiv.org/abs/2309.01532
repo_url: https://github.com/GabMartino/icrst_trst_autoencoder
paper_authors: Gabriele Martino, Davide Moroni, Massimo Martinelli
for: This paper is written for revisiting the standard training for undercomplete autoencoders, specifically by modifying the shape of the latent space without using any explicit regularization term in the loss function.
methods: The paper uses the standard training for undercomplete autoencoders, but with a modified shape of the latent space. The model is trained to reconstruct not the same observation in input, but another one sampled from the same class distribution.
results: The paper explores the behavior of the latent space in the case of reconstruction of a random sample from the whole dataset.Here’s the same information in Simplified Chinese text:
for: 这篇论文是为了重新评估标准的半完全自动编码器训练方法，具体来说是通过不使用任何显式正则化项来修改半完全自动编码器的幂值空间的形态。
methods: 这篇论文使用标准的半完全自动编码器训练方法，但是半完全自动编码器的幂值空间被修改了。模型被训练以重建不同的输入观测值，而不是原始的输入观测值。
results: 这篇论文探索了随机从整个数据集中采样的整个数据集的行为。

Abstract
Autoencoders are certainly among the most studied and used Deep Learning models: the idea behind them is to train a model in order to reconstruct the same input data. The peculiarity of these models is to compress the information through a bottleneck, creating what is called Latent Space. Autoencoders are generally used for dimensionality reduction, anomaly detection and feature extraction. These models have been extensively studied and updated, given their high simplicity and power. Examples are (i) the Denoising Autoencoder, where the model is trained to reconstruct an image from a noisy one; (ii) Sparse Autoencoder, where the bottleneck is created by a regularization term in the loss function; (iii) Variational Autoencoder, where the latent space is used to generate new consistent data. In this article, we revisited the standard training for the undercomplete Autoencoder modifying the shape of the latent space without using any explicit regularization term in the loss function. We forced the model to reconstruct not the same observation in input, but another one sampled from the same class distribution. We also explored the behaviour of the latent space in the case of reconstruction of a random sample from the whole dataset.

摘要
自然语言处理中的Autoencoder是非常常用的深度学习模型之一，其核心思想是通过训练模型来重建输入数据。Autoencoder模型具有压缩信息的特点，创造了所谓的缓存空间（Latent Space）。这些模型通常用于维度减少、异常检测和特征提取。这些模型已经得到了广泛的研究和更新，因为它们的简单性和力量。例如，（i）噪声Autoencoder，其中模型通过噪声图像重建原始图像；（ii）稀疏Autoencoder，其中瓶颈是通过惩罚项来创造的；（iii）变量Autoencoder，其中缓存空间用于生成新的一致性数据。在这篇文章中，我们重新训练了含有缺失的Autoencoder模型，不使用任何显式的惩罚项在损失函数中。我们让模型重建不同的输入数据，而不是原始输入数据。我们还探索了缓存空间在重建整个数据集中的行为。

paper_url: http://arxiv.org/abs/2309.01516
repo_url: https://github.com/longkukuhi/multiway-adapter
paper_authors: Zijun Long, George Killick, Richard McCreadie, Gerardo Aragon Camarasa
for: 这个研究旨在解决大型多 modal 模型（LMMs）的适应问题，尤其是在新任务上进行高效的适应和传播知识。
methods: 我们提出了一个创新的框架，名为 Multiway-Adapter，它包括一个“对齐增强器”，用于深入对齐不同模式之间的知识。这个方法增加了LMMs中的 fewer than 1.25% 的额外参数，并在零基eline image-text搜寻中表现出色，而且可以降低 fine-tuning 时间达57%。
results: 我们的方法可以实现LMMs的资源有效适应和传播知识，并且可以提高零基eline image-text搜寻的性能，而且可以降低 fine-tuning 时间。

Abstract
As the size of Large Multi-Modal Models (LMMs) increases consistently, the adaptation of these pre-trained models to specialized tasks has become a computationally and memory-intensive challenge. Traditional fine-tuning methods require isolated, exhaustive retuning for each new task, limiting the models' versatility. Moreover, current efficient adaptation techniques often overlook modality alignment, focusing only on the knowledge extraction of new tasks. To tackle these issues, we introduce Multiway-Adapter, an innovative framework incorporating an 'Alignment Enhancer' to deepen modality alignment, enabling high transferability without tuning pre-trained parameters. Our method adds fewer than 1.25\% of additional parameters to LMMs, exemplified by the BEiT-3 model in our study. This leads to superior zero-shot image-text retrieval performance compared to fully fine-tuned models, while achieving up to a 57\% reduction in fine-tuning time. Our approach offers a resource-efficient and effective adaptation pathway for LMMs, broadening their applicability. The source code is publicly available at: \url{https://github.com/longkukuhi/MultiWay-Adapter}.

摘要
As the size of Large Multi-Modal Models (LMMs) increases consistently, the adaptation of these pre-trained models to specialized tasks has become a computationally and memory-intensive challenge. Traditional fine-tuning methods require isolated, exhaustive retuning for each new task, limiting the models' versatility. Moreover, current efficient adaptation techniques often overlook modality alignment, focusing only on the knowledge extraction of new tasks. To tackle these issues, we introduce Multiway-Adapter, an innovative framework incorporating an 'Alignment Enhancer' to deepen modality alignment, enabling high transferability without tuning pre-trained parameters. Our method adds fewer than 1.25% of additional parameters to LMMs, exemplified by the BEiT-3 model in our study. This leads to superior zero-shot image-text retrieval performance compared to fully fine-tuned models, while achieving up to a 57% reduction in fine-tuning time. Our approach offers a resource-efficient and effective adaptation pathway for LMMs, broadening their applicability. The source code is publicly available at: \url{https://github.com/longkukuhi/MultiWay-Adapter}.Here's the translation in Traditional Chinese:为了解决大型多modal模型（LMMs）的 Parameters 规模不断增加所带来的 computationally 和 memory-intensive 挑战，传统的 fine-tuning 方法通常需要隔离的、耗时的 retuning для each new task，限制模型的多样性。此外，现有的高效的 adaptation 技术 oft overlook modality alignment，专注于新任务的知识提取。为了解决这些问题，我们介绍 Multiway-Adapter，一个创新的框架，包括一个 'Alignment Enhancer'，以深化modal alignment，实现高转移性 без fine-tuning 预训练模型的 Parameters。我们的方法仅增加 LMMs 中的 fewer than 1.25% 的额外参数，例如 BEiT-3 模型在我们的研究中。这导致在 zero-shot 图像文本搜寻性能比完全 fine-tuned 模型更高，同时可以 achieve up to 57% 的 fine-tuning 时间减少。我们的方法提供了资源效率的和有效的 adaptation 通路 для LMMs，扩展其应用范围。source code 公开可用于：\url{https://github.com/longkukuhi/MultiWay-Adapter}.

RGI-Net: 3D Room Geometry Inference from Room Impulse Responses in the Absence of First-order Echoes

paper_url: http://arxiv.org/abs/2309.01513
repo_url: None
paper_authors: Inmo Yeon, Jung-Woo Choi
for: 这个论文是为了提出一种新的室内geometry推导方法，不需要先知道室内形状是 convex 的，也不需要知道墙的数量。
methods: 该方法使用了深度神经网络（DNN），称为 RGI-Net，可以从室内响应函数（RIR）中推导室内形状。 RGI-Net 学习和利用室内高阶反射的复杂关系，因此可以在非拥圆形室内或缺失首际反射情况下估计室内形状。
results: 该方法可以在实际场景中应用，只需要使用一个圆形 Mikrofon 阵列和一个单个扬声器，可以大幅提高实用性。 RGI-Net 还包括评估网络，可以分别评估墙的存在概率，因此不需要先知道墙的数量。

Abstract
Room geometry is important prior information for implementing realistic 3D audio rendering. For this reason, various room geometry inference (RGI) methods have been developed by utilizing the time of arrival (TOA) or time difference of arrival (TDOA) information in room impulse responses. However, the conventional RGI technique poses several assumptions, such as convex room shapes, the number of walls known in priori, and the visibility of first-order reflections. In this work, we introduce the deep neural network (DNN), RGI-Net, which can estimate room geometries without the aforementioned assumptions. RGI-Net learns and exploits complex relationships between high-order reflections in room impulse responses (RIRs) and, thus, can estimate room shapes even when the shape is non-convex or first-order reflections are missing in the RIRs. The network takes RIRs measured from a compact audio device equipped with a circular microphone array and a single loudspeaker, which greatly improves its practical applicability. RGI-Net includes the evaluation network that separately evaluates the presence probability of walls, so the geometry inference is possible without prior knowledge of the number of walls.

摘要

Neural Vector Fields: Generalizing Distance Vector Fields by Codebooks and Zero-Curl Regularization

paper_url: http://arxiv.org/abs/2309.01512
repo_url: None
paper_authors: Xianghui Yang, Guosheng Lin, Zhenghao Chen, Luping Zhou
for: 提出了一种新的3D表示方法，即神经Vector Fields（NVF），可以同时优化表示度和拟合精度。
methods: 该方法采用直接预测表面上的变换，而不是通过网络导数来获取方向场，从而解决了表面EXTRACTION的问题。此外，该方法还提出了两种形状代码库，即NVF（Lite或Ultra），以促进跨类重建。
results: 在四个表面重建场景中，NVF（Ultra）表现出色，包括水密vs非水密形状、Category-agnostic重建vs Category-unseen重建、Category-specific重建和跨域重建。

Abstract
Recent neural networks based surface reconstruction can be roughly divided into two categories, one warping templates explicitly and the other representing 3D surfaces implicitly. To enjoy the advantages of both, we propose a novel 3D representation, Neural Vector Fields (NVF), which adopts the explicit learning process to manipulate meshes and implicit unsigned distance function (UDF) representation to break the barriers in resolution and topology. This is achieved by directly predicting the displacements from surface queries and modeling shapes as Vector Fields, rather than relying on network differentiation to obtain direction fields as most existing UDF-based methods do. In this way, our approach is capable of encoding both the distance and the direction fields so that the calculation of direction fields is differentiation-free, circumventing the non-trivial surface extraction step. Furthermore, building upon NVFs, we propose to incorporate two types of shape codebooks, \ie, NVFs (Lite or Ultra), to promote cross-category reconstruction through encoding cross-object priors. Moreover, we propose a new regularization based on analyzing the zero-curl property of NVFs, and implement this through the fully differentiable framework of our NVF (ultra). We evaluate both NVFs on four surface reconstruction scenarios, including watertight vs non-watertight shapes, category-agnostic reconstruction vs category-unseen reconstruction, category-specific, and cross-domain reconstruction.

摘要
最近的神经网络基于表面重建可以大致分为两类，一是显式填充模板，另一是表示3D表面的隐式表示。为了利用这两者的优点，我们提出了一种新的3D表示方法，即神经向量场（NVF），它采用显式学习过程来操纵网格和隐式无符号距离函数（UDF）表示来突破分辨率和结构的限制。在这种方式下，我们直接预测表面上的变位异移，而不是通过网络导数来获取方向场，这样我们可以同时编码距离场和方向场，从而避免了不rivial的表面提取步骤。此外，我们在NVF的基础上，提出了两种形状码库，即NVF（Lite或Ultra），以便通过编码跨物类约束来促进跨类重建。此外，我们还提出了一种基于NVF的零核性分析 regularization，并通过我们的完全导数可 differentiable 框架来实现。我们在四个表面重建场景中评估了NVF，包括非水平 shapes、类型不可知的重建、类型特定重建和跨领域重建。

Memory Efficient Optimizers with 4-bit States

paper_url: http://arxiv.org/abs/2309.01507
repo_url: https://github.com/thu-ml/low-bit-optimizers
paper_authors: Bingrui Li, Jianfei Chen, Jun Zhu
for: 降低各种神经网络训练中的内存占用，使得训练模型在给定内存预算内可以达到最大训练模型。
methods: 通过对首 moments和次 moments进行详细的实验分析，下降优化器状态的有效位数至4位。特别是，我们发现了outsider pattern，现有的块 wise quantization无法准确地近似。我们使用更小的块大小，并使用行 wise和列 wise信息进行更好的量化。此外，我们还解决了量化第二 moment的零点问题，使用了 exclude 零点的直线量化器。
results: 我们在各种 benchmark 上评估了我们的4位优化器，包括自然语言理解、机器翻译、图像分类和指令调整。在所有任务上，我们的优化器可以与其全精度对手相当，同时具有更好的内存效率。

Abstract
Optimizer states are a major source of memory consumption for training neural networks, limiting the maximum trainable model within given memory budget. Compressing the optimizer states from 32-bit floating points to lower bitwidth is promising to reduce the training memory footprint, while the current lowest achievable bitwidth is 8-bit. In this work, we push optimizer states bitwidth down to 4-bit through a detailed empirical analysis of first and second moments. Specifically, we find that moments have complicated outlier patterns, that current block-wise quantization cannot accurately approximate. We use a smaller block size and propose to utilize both row-wise and column-wise information for better quantization. We further identify a zero point problem of quantizing the second moment, and solve this problem with a linear quantizer that excludes the zero point. Our 4-bit optimizer is evaluated on a wide variety of benchmarks including natural language understanding, machine translation, image classification, and instruction tuning. On all the tasks our optimizers can achieve comparable accuracy with their full-precision counterparts, while enjoying better memory efficiency.

摘要

BadSQA: Stealthy Backdoor Attacks Using Presence Events as Triggers in Non-Intrusive Speech Quality Assessment

paper_url: http://arxiv.org/abs/2309.01480
repo_url: None
paper_authors: Ying Ren, Kailai Shen, Zhe Ye, Diqun Yan
for: 本研究旨在攻击非侵入式语音质量评估（NISQA）系统，以实现高度隐蔽的攻击。
methods: 本研究提出了一种基于存在事件的新型后门攻击方法，可以在NISQA任务中实现高度隐蔽的攻击。
results: 实验结果表明，提出的后门攻击方法在四个基准数据集和两个现状最佳NISQA模型下，可以 достиieves an average attack success rate of up to 99% with a poisoning rate of only 3%.

Abstract
Non-Intrusive speech quality assessment (NISQA) has gained significant attention for predicting the mean opinion score (MOS) of speech without requiring the reference speech. In practical NISQA scenarios, untrusted third-party resources are often employed during deep neural network training to reduce costs. However, it would introduce a potential security vulnerability as specially designed untrusted resources can launch backdoor attacks against NISQA systems. Existing backdoor attacks primarily focus on classification tasks and are not directly applicable to NISQA which is a regression task. In this paper, we propose a novel backdoor attack on NISQA tasks, leveraging presence events as triggers to achieving highly stealthy attacks. To evaluate the effectiveness of our proposed approach, we conducted experiments on four benchmark datasets and employed two state-of-the-art NISQA models. The results demonstrate that the proposed backdoor attack achieved an average attack success rate of up to 99% with a poisoning rate of only 3%.

摘要

Pure Monte Carlo Counterfactual Regret Minimization

paper_url: http://arxiv.org/abs/2309.03084
repo_url: None
paper_authors: Ju Qi, Ting Feng, Falun Hei, Zhemei Fang, Yunfeng Luo
for: solves large-scale incomplete information games
methods: builds upon CFR and Fictitious Play, combines counterfactual regret and best response strategy
results: achieves better performance, reduces time and space complexity, and converges faster than MCCFR with a new warm-start algorithm

Abstract
Counterfactual Regret Minimization (CFR) and its variants are the best algorithms so far for solving large-scale incomplete information games. Building upon CFR, this paper proposes a new algorithm named Pure CFR (PCFR) for achieving better performance. PCFR can be seen as a combination of CFR and Fictitious Play (FP), inheriting the concept of counterfactual regret (value) from CFR, and using the best response strategy instead of the regret matching strategy for the next iteration. Our theoretical proof that PCFR can achieve Blackwell approachability enables PCFR's ability to combine with any CFR variant including Monte Carlo CFR (MCCFR). The resultant Pure MCCFR (PMCCFR) can significantly reduce time and space complexity. Particularly, the convergence speed of PMCCFR is at least three times more than that of MCCFR. In addition, since PMCCFR does not pass through the path of strictly dominated strategies, we developed a new warm-start algorithm inspired by the strictly dominated strategies elimination method. Consequently, the PMCCFR with new warm start algorithm can converge by two orders of magnitude faster than the CFR+ algorithm.

摘要
Counterfactual Regret Minimization (CFR) 和其变种是目前最佳的大规模不完整信息游戏解决方案。本文提出了一种新的算法名为纯Counterfactual Regret Minimization (PCFR)，以实现更好的性能。PCFR可以看作是CFR和虚拟玩家(FP)的组合，沿用CFR中的counterfactual regret（价值）概念，并使用下一轮的最佳回应策略而不是 regret matching 策略。我们提供了理论证明，表明PCFR可以实现黑尔方程（Blackwell）可接近性，这使得PCFR可以与任何CFR变种，包括Monte Carlo CFR (MCCFR)结合。结果是PMCCFR可以显著降低时间和空间复杂度。尤其是PMCCFR的快速收敛速度至少三倍于MCCFR。此外，由于PMCCFR不通过严格dominated策略的路径，我们开发了一种新的暖启动算法， drawing inspiration from the strictly dominated strategies elimination method。因此，PMCCFR with new warm start algorithm可以在CFR+算法 convergence speed two orders of magnitude faster。

Interactive Graph Convolutional Filtering

paper_url: http://arxiv.org/abs/2309.01453
repo_url: None
paper_authors: Jin Zhang, Defu Lian, Hong Xie, Yawen Li, Enhong Chen
for: 这 paper 的目的是提出一种解决交互推荐系统中的冷启动和数据稀缺问题的新方法。
methods: 这 paper 使用了一种基于图模型的交互卷积滤波器模型，并使用了变量推理技术来解决非线性模型的计算困难。它还使用了 bayesian 元学习方法来有效地解决冷启动问题，并 derive 了对方法的理论 regret 下界，以确保方法的稳定性。
results: experiments 表明，该方法在三个实际 dataset 上表现出色，比存在的基准模型更高。

Abstract
Interactive Recommender Systems (IRS) have been increasingly used in various domains, including personalized article recommendation, social media, and online advertising. However, IRS faces significant challenges in providing accurate recommendations under limited observations, especially in the context of interactive collaborative filtering. These problems are exacerbated by the cold start problem and data sparsity problem. Existing Multi-Armed Bandit methods, despite their carefully designed exploration strategies, often struggle to provide satisfactory results in the early stages due to the lack of interaction data. Furthermore, these methods are computationally intractable when applied to non-linear models, limiting their applicability. To address these challenges, we propose a novel method, the Interactive Graph Convolutional Filtering model. Our proposed method extends interactive collaborative filtering into the graph model to enhance the performance of collaborative filtering between users and items. We incorporate variational inference techniques to overcome the computational hurdles posed by non-linear models. Furthermore, we employ Bayesian meta-learning methods to effectively address the cold-start problem and derive theoretical regret bounds for our proposed method, ensuring a robust performance guarantee. Extensive experimental results on three real-world datasets validate our method and demonstrate its superiority over existing baselines.

摘要

Effective Multi-Graph Neural Networks for Illicit Account Detection on Cryptocurrency Transaction Networks

paper_url: http://arxiv.org/abs/2309.02460
repo_url: None
paper_authors: Zhihao Ding, Jieming Shi, Qing Li, Jiannong Cao
for: 本研究旨在检测 криптовалюencies 交易网络上的非法帐户，以防止normal用户 suffer 亏金额。
methods: 本文使用 DIAM 模型，该模型包括 Edge2Seq 模块和 Multigraph Discrepancy (MGD) 模块，自动学习有效的节点表示，并 capture 非法节点特征。
results: 对于 4 个大型 криптовалюencies 数据集（Bitcoin 和 Ethereum），DIAM 模型与 14 种现有解决方案进行比较，并 consistently 实现最佳性能，准确地检测非法帐户，而且高效。例如，在一个 Bitcoin 数据集上，DIAM 模型 achieve F1 score 96.55%，significantly higher than 最佳竞争者的 F1 score 83.92%。

Abstract
We study illicit account detection on transaction networks of cryptocurrencies that are increasi_testngly important in online financial markets. The surge of illicit activities on cryptocurrencies has resulted in billions of losses from normal users. Existing solutions either rely on tedious feature engineering to get handcrafted features, or are inadequate to fully utilize the rich semantics of cryptocurrency transaction data, and consequently, yield sub-optimal performance. In this paper, we formulate the illicit account detection problem as a classification task over directed multigraphs with edge attributes, and present DIAM, a novel multi-graph neural network model to effectively detect illicit accounts on large transaction networks. First, DIAM includes an Edge2Seq module that automatically learns effective node representations preserving intrinsic transaction patterns of parallel edges, by considering both edge attributes and directed edge sequence dependencies. Then utilizing the multigraph topology, DIAM employs a new Multigraph Discrepancy (MGD) module with a well-designed message passing mechanism to capture the discrepant features between normal and illicit nodes, supported by an attention mechanism. Assembling all techniques, DIAM is trained in an end-to-end manner. Extensive experiments, comparing against 14 existing solutions on 4 large cryptocurrency datasets of Bitcoin and Ethereum, demonstrate that DIAM consistently achieves the best performance to accurately detect illicit accounts, while being efficient. For instance, on a Bitcoin dataset with 20 million nodes and 203 million edges, DIAM achieves F1 score 96.55%, significantly higher than the F1 score 83.92% of the best competitor.

摘要
我们研究非法账户检测在 криптовалюencies 交易网络上，这些网络在在线金融市场中变得越来越重要。非法活动对于正常用户而言导致了亿万元的损失。现有的解决方案可以分为两类：一是 tedious 的特征工程来获取手工特征，二是不充分利用 криптовалюencies 交易数据的semantics，导致效果不佳。在这篇论文中，我们将非法账户检测问题定义为一个分类任务，并提出了一种多граaph神经网络模型（DIAM），可以有效地检测非法账户在大规模交易网络上。首先，DIAM包含一个 Edge2Seq 模块，可以自动学习有效的节点表示，保留交易 patrerns 的内在特征，通过考虑边Attributes和指向edge sequence dependencies。然后，DIAM使用一种新的多граaph不匹配度（MGD）模块，通过一种合理的消息传递机制，捕捉非法和正常节点之间的不同特征。最后，DIAM结合所有技术，在端到端方式进行训练。广泛的实验证明，对于4个大的 криптовалюencies dataset（Bitcoin和Ethereum），DIAM在14种现有解决方案中显示出了最高的性能，可以准确地检测非法账户，同时具有高效性。例如，在一个 Bitcoin dataset中，包含2000万个节点和203亿个边，DIAM的 F1 分数为96.55%，远高于最佳竞争者的 F1 分数83.92%。

paper_url: http://arxiv.org/abs/2309.01418
repo_url: None
paper_authors: Dan Mitrea, Viorica Chifu, Tudor Cioara, Ionut Anghel, Cristina Pop
for: 这篇论文的目的是提出一种基于hedonic game的P2P能源交易模型，以便在能源社区中帮助潜在的购买者和卖家之间进行能源交易，同时考虑社交因素。
methods: 该模型使用hedonic game理论来协调和合作，考虑了社交关系内的能源价格和社会偏好，并且通过了区块链技术来实现P2P能源交易。
results: 在一个实验中，该模型在能源社区中提高了总能源交易量的5%，并且在社交环境下帮助提高了能源交易量的10%，同时帮助实现了社区内的能源需求和供应的更好均衡。

Abstract
Lately, the energy communities have gained a lot of attention as they have the potential to significantly contribute to the resilience and flexibility of the energy system, facilitating widespread integration of intermittent renewable energy sources. Within these communities the prosumers can engage in peer-to-peer trading, fostering local collaborations and increasing awareness about energy usage and flexible consumption. However, even under these favorable conditions, prosumer engagement levels remain low, requiring trading mechanisms that are aligned with their social values and expectations. In this paper, we introduce an innovative hedonic game coordination and cooperation model for P2P energy trading among prosumers which considers the social relationships within an energy community to create energy coalitions and facilitate energy transactions among them. We defined a heuristic that optimizes the prosumers coalitions, considering their social and energy price preferences and balancing the energy demand and supply within the community. We integrated the proposed hedonic game model into a state-of-the-art blockchain-based P2P energy flexibility market and evaluated its performance within an energy community of prosumers. The evaluation results on a blockchain-based P2P energy flexibility market show the effectiveness in considering social factors when creating coalitions, increasing the total amount of energy transacted in a market session by 5% compared with other game theory-based solutions. Finally, it shows the importance of the social dimensions of P2P energy transactions, the positive social dynamics in the energy community increasing the amount of energy transacted by more than 10% while contributing to a more balanced energy demand and supply within the community.

摘要
近些时间，能源社区获得了很多关注，因为它们可以为能源系统的可持续性和灵活性做出重要贡献，激发广泛的可变性可再生能源源泉的Integration。在这些社区中，潜在消费者（prosumer）可以进行Peer-to-Peer（P2P）贸易，促进本地合作和提高能源消耗和灵活消耗的认识。然而，即使在这些有利条件下，潜在消费者参与度仍然低，需要与他们的社会价值观和期望相匹配的交易机制。在这篇论文中，我们介绍了一种创新的 Hedonic Game 协调和合作模型，用于P2P能源贸易中潜在消费者之间的协作。我们定义了一个启发函数，用于优化潜在消费者的协会，考虑其社会和能源价格偏好，并均衡能源需求和供应在社区内。我们将该模型集成到了一个国际领先的区块链技术基础的P2P能源灵活市场中，并对其在能源社区中的潜在消费者进行评估。评估结果表明，考虑社会因素时创建协会可以提高P2P能源贸易市场Session中的总能源交易量，比其他Game theory基础的解决方案提高5%。此外，它还表明了社会维度上的P2P能源贸易的重要性，通过提高能源社区内的能源交易量高于10%，同时为能源需求和供应做出更好的均衡。

Towards frugal unsupervised detection of subtle abnormalities in medical imaging

paper_url: http://arxiv.org/abs/2309.02458
repo_url: https://github.com/geoffroyo/onlineem
paper_authors: Geoffroy Oudoumanessah, Carole Lartizien, Michel Dojat, Florence Forbes
for: 这篇论文的目的是提出一种基于混合分布的无监督异常检测方法，来探析医疗影像中的异常现象。
methods: 这篇论文使用了混合分布的方法，包括mixtures of probability distributions，来进行异常检测。这种方法可以处理复杂的多个变量参考模型，并且具有较少的参数和较好的解释性。然而，标准的估计方法，如期望最大化算法，不适合处理大量数据，因为它们需要高度的内存使用。
results: 这篇论文的结果显示，使用混合分布的方法可以实现高度的异常检测精度，并且可以适应不同的医疗影像数据。实验结果显示，这种方法可以检测出 Parkinson 病患的脑部畸形，并且与 Hoehn 和 Yahr 病程 scales 相符。

Abstract
Anomaly detection in medical imaging is a challenging task in contexts where abnormalities are not annotated. This problem can be addressed through unsupervised anomaly detection (UAD) methods, which identify features that do not match with a reference model of normal profiles. Artificial neural networks have been extensively used for UAD but they do not generally achieve an optimal trade-o$\hookleftarrow$ between accuracy and computational demand. As an alternative, we investigate mixtures of probability distributions whose versatility has been widely recognized for a variety of data and tasks, while not requiring excessive design e$\hookleftarrow$ort or tuning. Their expressivity makes them good candidates to account for complex multivariate reference models. Their much smaller number of parameters makes them more amenable to interpretation and e cient learning. However, standard estimation procedures, such as the Expectation-Maximization algorithm, do not scale well to large data volumes as they require high memory usage. To address this issue, we propose to incrementally compute inferential quantities. This online approach is illustrated on the challenging detection of subtle abnormalities in MR brain scans for the follow-up of newly diagnosed Parkinsonian patients. The identified structural abnormalities are consistent with the disease progression, as accounted by the Hoehn and Yahr scale.

摘要
医学成像异常检测在没有标注异常的情况下是一个挑战。这个问题可以通过无监督异常检测（USD）方法解决，这些方法可以标识不符合参照模型的常见 Profile 中的特征。人工神经网络已经广泛应用于 USD，但它们通常不能达到最佳的准确性和计算成本之间的平衡。作为一个替代方案，我们调查混合概率分布的使用。这种混合分布的多样性使其适用于多种数据和任务，而不需要过度的设计或调整。它的表达能力使其成为质量异常检测中的好选择，但标准估计过程，如期望最大化算法，不适用于大量数据。为解决这个问题，我们提议逐步计算推理量。这种在线方法在对抗性检测MR brain scan中的轻微异常情况中得到了描述。已经标识出的结构异常与疾病进程相符，如根据豪恩-雅尔scale的评估。

Metric Learning for Projections Bias of Generalized Zero-shot Learning

paper_url: http://arxiv.org/abs/2309.01390
repo_url: None
paper_authors: Chong Zhang, Mingyu Jin, Qinkai Yu, Haochen Xue, Xiaobo Jin
for: 本研究旨在提高Generalized Zero-shot Learning（GZSL）模型的可靠性和效果，使其能够正确识别未经见过的类别。
methods: 本研究使用了Variational Autoencoder & Generative Adversarial Networks（VAEGAN）框架，并提出了一种新的参数化 Mahalanobis 距离表示法，以便在推理过程中减少偏见。同时，我们改进了VAEGAN 网络结构，以便使用两个分支来分别预测已经见过的样本和通过这个seen样本生成的未经见过的样本。
results: 我们在四个数据集上进行了广泛的评估，并证明了我们的方法在与状态方法相比有superiority。 codes 可以在https://anonymous.4open.science/r/111hxr 上获取。

Abstract
Generalized zero-shot learning models (GZSL) aim to recognize samples from seen or unseen classes using only samples from seen classes as training data. During inference, GZSL methods are often biased towards seen classes due to the visibility of seen class samples during training. Most current GZSL methods try to learn an accurate projection function (from visual space to semantic space) to avoid bias and ensure the effectiveness of GZSL methods. However, during inference, the computation of distance will be important when we classify the projection of any sample into its nearest class since we may learn a biased projection function in the model. In our work, we attempt to learn a parameterized Mahalanobis distance within the framework of VAEGAN (Variational Autoencoder \& Generative Adversarial Networks), where the weight matrix depends on the network's output. In particular, we improved the network structure of VAEGAN to leverage the discriminative models of two branches to separately predict the seen samples and the unseen samples generated by this seen one. We proposed a new loss function with two branches to help us learn the optimized Mahalanobis distance representation. Comprehensive evaluation benchmarks on four datasets demonstrate the superiority of our method over the state-of-the-art counterparts. Our codes are available at https://anonymous.4open.science/r/111hxr.

摘要
通用零shot学习模型（GZSL）目标是使用已知类样本进行训练，并在测试时recognize未知类样本。然而，大多数现有GZSL方法在测试时仍然受到已知类样本的影响，导致模型偏向已知类。为解决这个问题，现有的GZSL方法通常尝试学习一个准确的投影函数（从视觉空间到 semantic空间），以避免偏见和保证GZSL方法的效果。然而，在测试时，计算距离是非常重要的，因为我们可能会学习一个偏向的投影函数。在我们的工作中，我们尝试了在VAEGAN（variational autoencoder & generative adversarial networks）框架中学习一个参数化的马ха拉诺比斯距离。具体来说，我们改进了VAEGAN的网络结构，使其能够利用两个分支的探测模型来分别预测已知样本和由已知样本生成的未知样本。我们提出了一个新的两支loss函数，以帮助我们学习优化的马ха拉诺比斯距离表示。我们在四个数据集上进行了全面的评估，并证明了我们的方法在当前的状态革命性。我们的代码可以在https://anonymous.4open.science/r/111hxr中获取。

LoRA-like Calibration for Multimodal Deception Detection using ATSFace Data

paper_url: http://arxiv.org/abs/2309.01383
repo_url: None
paper_authors: Shun-Wen Hsiao, Cheng-Yuan Sun
for: 本研究旨在开发一种能够有效地检测人类视频中的谎言，并提供可读性的模型。
methods: 我们提出了一种注意力意识的神经网络模型，该模型通过综合评估视频、音频和文本特征，找到谎言的表征。我们还使用多模态融合策略，提高了准确率。
results: 我们在一个真实的评估数据集上实现了92%的准确率。此外，模型还可以显示视频中的注意力焦点，为检测谎言提供了有价值的信息。

Abstract
Recently, deception detection on human videos is an eye-catching techniques and can serve lots applications. AI model in this domain demonstrates the high accuracy, but AI tends to be a non-interpretable black box. We introduce an attention-aware neural network addressing challenges inherent in video data and deception dynamics. This model, through its continuous assessment of visual, audio, and text features, pinpoints deceptive cues. We employ a multimodal fusion strategy that enhances accuracy; our approach yields a 92\% accuracy rate on a real-life trial dataset. Most important of all, the model indicates the attention focus in the videos, providing valuable insights on deception cues. Hence, our method adeptly detects deceit and elucidates the underlying process. We further enriched our study with an experiment involving students answering questions either truthfully or deceitfully, resulting in a new dataset of 309 video clips, named ATSFace. Using this, we also introduced a calibration method, which is inspired by Low-Rank Adaptation (LoRA), to refine individual-based deception detection accuracy.

摘要

Memory augment is All You Need for image restoration

paper_url: http://arxiv.org/abs/2309.01377
repo_url: https://github.com/zhangbaijin/memorynet
paper_authors: Xiao Feng Zhang, Chao Chen Gu, Shan Ying Zhu
for: 本研究旨在提出一种基于三级层次记忆的图像恢复方法，以提高图像恢复性能。
methods: 该方法使用了一种名为MemoryNet的三级层次记忆层和对比学习策略，将样本分为正例、负例和实际三种样本，并通过对比学习来塑造学习的特征。
results: 实验表明，该方法在Derain/Deshadow/Deblur任务上能够提高图像恢复性能，并且在三个不同类型的质量畸变数据集上获得了显著的PSNR和SSIM提升，这是一种强有力的证明，表明恢复的图像是可见真实的。

Abstract
Image restoration is a low-level vision task, most CNN methods are designed as a black box, lacking transparency and internal aesthetics. Although some methods combining traditional optimization algorithms with DNNs have been proposed, they all have some limitations. In this paper, we propose a three-granularity memory layer and contrast learning named MemoryNet, specifically, dividing the samples into positive, negative, and actual three samples for contrastive learning, where the memory layer is able to preserve the deep features of the image and the contrastive learning converges the learned features to balance. Experiments on Derain/Deshadow/Deblur task demonstrate that these methods are effective in improving restoration performance. In addition, this paper's model obtains significant PSNR, SSIM gain on three datasets with different degradation types, which is a strong proof that the recovered images are perceptually realistic. The source code of MemoryNet can be obtained from https://github.com/zhangbaijin/MemoryNet

摘要
Image restoration 是一个低级视觉任务，大多数 CNN 方法都是黑盒子，缺乏透明度和内部美学。虽然一些将传统优化算法与 DNN 结合的方法有被提议，但它们都有一些限制。在这篇论文中，我们提出了三级别内存层和对比学习名为 MemoryNet，具体来说，将样本分为正样本、负样本和实际三个样本进行对比学习，内存层能够保留图像深度特征，对比学习使得学习的特征进行平衡。实验表明，这些方法可以提高修复性能。此外，这篇论文的模型在三个不同类型的损害数据集上获得了显著的 PSNR、SSIM 提升，这是一个强大的证明，修复的图像是有感知真实的。MemoryNet 的源代码可以从 GitHub 上获取：https://github.com/zhangbaijin/MemoryNet

ReOnto: A Neuro-Symbolic Approach for Biomedical Relation Extraction

paper_url: http://arxiv.org/abs/2309.01370
repo_url: https://github.com/kracr/reonto-relation-extraction
paper_authors: Monika Jain, Kuldeep Singh, Raghava Mutharaju
for: 这个研究旨在提高生物医学文本中关系提取 task 的性能，通过使用 neuromorphic 知识来解决生物医学关系的特殊性。
methods: 这种新的技术called ReOnto，使用图 neural network 获得句子表示，并利用公开 accessible ontology 作为先验知识来识别两个实体之间的句子关系。
results: 实验结果表明，使用符号知识从 ontology 与图 neural network 结合使用，可以超过所有基线（约3%）。

Abstract
Relation Extraction (RE) is the task of extracting semantic relationships between entities in a sentence and aligning them to relations defined in a vocabulary, which is generally in the form of a Knowledge Graph (KG) or an ontology. Various approaches have been proposed so far to address this task. However, applying these techniques to biomedical text often yields unsatisfactory results because it is hard to infer relations directly from sentences due to the nature of the biomedical relations. To address these issues, we present a novel technique called ReOnto, that makes use of neuro symbolic knowledge for the RE task. ReOnto employs a graph neural network to acquire the sentence representation and leverages publicly accessible ontologies as prior knowledge to identify the sentential relation between two entities. The approach involves extracting the relation path between the two entities from the ontology. We evaluate the effect of using symbolic knowledge from ontologies with graph neural networks. Experimental results on two public biomedical datasets, BioRel and ADE, show that our method outperforms all the baselines (approximately by 3\%).

摘要
relation extraction (RE) 是将 sentence 中 entities 之间的 semantic 关系提取出来，并将其与知识图（KG）或ontology 中定义的关系进行对应的任务。目前已经有很多方法提出来了。但是在生物医学文本中应用这些技术时，通常会得到不满足的结果，因为生物医学关系很难直接从句子中提取。为解决这些问题，我们提出了一种新的技术 called ReOnto，它利用 neurosymbolic 知识来进行 RE 任务。ReOnto 使用图ael 神经网络来获取句子表示，并利用公共可访问的 ontology 作为先验知识来确定句子中两个 entit 之间的关系。该方法包括从 ontology 中提取两个 entit 之间的关系路径。我们通过对 symbolic 知识和 graph neural networks 的结合效果进行实验，并在两个公共生物医学数据集（BioRel 和 ADE）上进行了评估。结果表明，我们的方法比所有基eline（约为 3%） superior。

Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation

paper_url: http://arxiv.org/abs/2309.01365
repo_url: https://github.com/hbing-l/rtpca
paper_authors: Hanbing Liu, Wangmeng Xiang, Jun-Yan He, Zhi-Qi Cheng, Bin Luo, Yifeng Geng, Xuansong Xie
for: 优化3D人体pose预测器的精度和结构，基于transformer的Refined Temporal Pyramidal Compression-and-Amplification（RTPCA）模型。
methods: 利用时间维度，RTPCA模型通过Temporal Pyramidal Compression-and-Amplification（TPCA）块和 Cross-Layer Refinement（XLR）模块，提高了内块时间模型和 между块特征交互。TPCA块利用时间 pyramid 思想，强化关键和值表示能力，并从运动序列中提取空间 semantics。XLR模块通过不断交互查询、键和值，营养丰富的semantic表示。
results: 在Human3.6M、HumanEva-I和MPI-INF-3DHP测试集上达到了state-of-the-art result，与其他基于transformer的方法相比，具有较少的计算负担。

Abstract
Accurately estimating the 3D pose of humans in video sequences requires both accuracy and a well-structured architecture. With the success of transformers, we introduce the Refined Temporal Pyramidal Compression-and-Amplification (RTPCA) transformer. Exploiting the temporal dimension, RTPCA extends intra-block temporal modeling via its Temporal Pyramidal Compression-and-Amplification (TPCA) structure and refines inter-block feature interaction with a Cross-Layer Refinement (XLR) module. In particular, TPCA block exploits a temporal pyramid paradigm, reinforcing key and value representation capabilities and seamlessly extracting spatial semantics from motion sequences. We stitch these TPCA blocks with XLR that promotes rich semantic representation through continuous interaction of queries, keys, and values. This strategy embodies early-stage information with current flows, addressing typical deficits in detail and stability seen in other transformer-based methods. We demonstrate the effectiveness of RTPCA by achieving state-of-the-art results on Human3.6M, HumanEva-I, and MPI-INF-3DHP benchmarks with minimal computational overhead. The source code is available at https://github.com/hbing-l/RTPCA.

摘要
准确估计视频序列中人体3D姿势需要 Both accuracy和一个良好的架构。在Transformers的成功基础上，我们引入Refined Temporal Pyramidal Compression-and-Amplification（RTPCA）transformer。利用时间维度，RTPCA通过其Temporal Pyramidal Compression-and-Amplification（TPCA）结构进一步发挥 intra-block 时间模型化，并使用 Cross-Layer Refinement（XLR）模块来细化 inter-block 特征互动。特别是，TPCA块采用了时间PYRAMID思想，强化关键和值表示能力，并快速从运动序列中提取空间语义。我们将这些TPCA块与XLR相连，以便通过 queries、keys 和values之间的连续互动，实现丰富的semantic表示。这种策略既保留了早期信息，又与当前流量互动，解决了其他基于Transformer的方法中常见的缺失细节和稳定性问题。我们在Human3.6M、HumanEva-I和MPI-INF-3DHP benchmark上实现了state-of-the-art的结果，而且计算开销很小。代码可以在https://github.com/hbing-l/RTPCA中获取。

Self-driven Grounding: Large Language Model Agents with Automatical Language-aligned Skill Learning

paper_url: http://arxiv.org/abs/2309.01352
repo_url: None
paper_authors: Shaohui Peng, Xing Hu, Qi Yi, Rui Zhang, Jiaming Guo, Di Huang, Zikang Tian, Ruizhi Chen, Zidong Du, Qi Guo, Yunji Chen, Ling Li
for: 提高大语言模型在真实环境中的应用能力
methods: 自动提出子目标、与环境互动验证、自适应学习练习技能
results: 在知名的 instruktion following task 上比较出色的表现，与循证学习方法相当，但需要更少的示范数据，证明学习到的技能有效并证明了框架的可行性和效率。

Abstract
Large language models (LLMs) show their powerful automatic reasoning and planning capability with a wealth of semantic knowledge about the human world. However, the grounding problem still hinders the applications of LLMs in the real-world environment. Existing studies try to fine-tune the LLM or utilize pre-defined behavior APIs to bridge the LLMs and the environment, which not only costs huge human efforts to customize for every single task but also weakens the generality strengths of LLMs. To autonomously ground the LLM onto the environment, we proposed the Self-Driven Grounding (SDG) framework to automatically and progressively ground the LLM with self-driven skill learning. SDG first employs the LLM to propose the hypothesis of sub-goals to achieve tasks and then verify the feasibility of the hypothesis via interacting with the underlying environment. Once verified, SDG can then learn generalized skills with the guidance of these successfully grounded subgoals. These skills can be further utilized to accomplish more complex tasks which fail to pass the verification phase. Verified in the famous instruction following task set-BabyAI, SDG achieves comparable performance in the most challenging tasks compared with imitation learning methods that cost millions of demonstrations, proving the effectiveness of learned skills and showing the feasibility and efficiency of our framework.

摘要
SDG first uses the LLM to propose hypotheses of sub-goals to achieve tasks and then verifies the feasibility of these hypotheses by interacting with the underlying environment. Once verified, SDG can learn generalized skills with the guidance of these successfully grounded sub-goals. These skills can be used to accomplish more complex tasks that fail to pass the verification phase. In the famous instruction following task set-BabyAI, SDG achieves comparable performance in the most challenging tasks with millions of demonstrations, demonstrating the effectiveness of learned skills and the feasibility and efficiency of our framework.

UniSA: Unified Generative Framework for Sentiment Analysis

paper_url: http://arxiv.org/abs/2309.01339
repo_url: https://github.com/dawn0815/saeval-benchmark
paper_authors: Zaijing Li, Ting-En Lin, Yuchuan Wu, Meng Liu, Fengxiao Tang, Ming Zhao, Yongbin Li
for: 本研究旨在解决各种情感分析子任务之间的协调问题，提高多模态情感分析的性能。
methods: 该研究提出了一种任务特定提示方法，并 introduce了一种多模态生成框架 named UniSA，以及一个新的情感分析评价标准 benchmark。
results: 实验结果表明，UniSA在各种情感分析子任务中表现 Comparable 于现状况，并且在不同子任务之间具有良好的泛化能力。

Abstract
Sentiment analysis is a crucial task that aims to understand people's emotional states and predict emotional categories based on multimodal information. It consists of several subtasks, such as emotion recognition in conversation (ERC), aspect-based sentiment analysis (ABSA), and multimodal sentiment analysis (MSA). However, unifying all subtasks in sentiment analysis presents numerous challenges, including modality alignment, unified input/output forms, and dataset bias. To address these challenges, we propose a Task-Specific Prompt method to jointly model subtasks and introduce a multimodal generative framework called UniSA. Additionally, we organize the benchmark datasets of main subtasks into a new Sentiment Analysis Evaluation benchmark, SAEval. We design novel pre-training tasks and training methods to enable the model to learn generic sentiment knowledge among subtasks to improve the model's multimodal sentiment perception ability. Our experimental results show that UniSA performs comparably to the state-of-the-art on all subtasks and generalizes well to various subtasks in sentiment analysis.

摘要
（简化中文）情感分析是一项非常重要的任务，旨在理解人们的情感状态并根据多modal信息预测情感类别。它包括多个子任务，如对话中情感识别（ERC）、基于特征的情感分析（ABSA）和多modal情感分析（MSA）。然而，在情感分析中统一所有子任务存在许多挑战，包括模式匹配、统一输入/输出格式和数据集偏见。为了解决这些挑战，我们提出了任务特定提示方法，用于同时模型子任务，并引入了一个多modal生成框架called UniSA。此外，我们还将主要的 benchmark dataset组织成了一个新的情感分析评估 benchmark，称为 SAEval。我们还设计了新的预训练任务和训练方法，以便模型可以从多个子任务中学习通用的情感知识，提高模型的多modal情感感知能力。我们的实验结果显示，UniSA在所有子任务上表现相当于当前状态的顶尖水平，并且在不同的子任务中具有良好的通用性。

Learning for Interval Prediction of Electricity Demand: A Cluster-based Bootstrapping Approach

paper_url: http://arxiv.org/abs/2309.01336
repo_url: None
paper_authors: Rohit Dube, Natarajan Gautam, Amarnath Banerjee, Harsha Nagarajan
for: 这篇论文是为了提供一种基于差分bootstrap的日均电力需求Interval估计方法，以便在小聚合负荷setting中更好地管理运营。
methods: 该方法使用机器学习算法获取日均电力需求的点估计值，并使用这些点估计值和相应的差分来生成Interval估计值。具体来说，首先使用一种不supervised learning算法将日均电力需求数据分为类似的日期集合，然后将这些集合用于生成Interval估计值。
results: 该方法在使用实际电力需求数据进行评估时，与其他 bootstrap方法相比，能够更好地保持Interval估计值的准确性和稳定性。具体来说，该方法可以在不同的信任 интер val中提供更加精准的Interval估计值，并且可以避免因点估计值的偏差而导致的误差。

Abstract
Accurate predictions of electricity demands are necessary for managing operations in a small aggregation load setting like a Microgrid. Due to low aggregation, the electricity demands can be highly stochastic and point estimates would lead to inflated errors. Interval estimation in this scenario, would provide a range of values within which the future values might lie and helps quantify the errors around the point estimates. This paper introduces a residual bootstrap algorithm to generate interval estimates of day-ahead electricity demand. A machine learning algorithm is used to obtain the point estimates of electricity demand and respective residuals on the training set. The obtained residuals are stored in memory and the memory is further partitioned. Days with similar demand patterns are grouped in clusters using an unsupervised learning algorithm and these clusters are used to partition the memory. The point estimates for test day are used to find the closest cluster of similar days and the residuals are bootstrapped from the chosen cluster. This algorithm is evaluated on the real electricity demand data from EULR(End Use Load Research) and is compared to other bootstrapping methods for varying confidence intervals.

摘要
准确的电力需求预测是微型电网运营管理中必要的。由于低聚合，电力需求具有高度抽象和点估计会带来膨胀的错误。间隔估计在这种情况下，可以提供未来值的范围，并帮助量化估计错误。本文介绍了剩余 Bootstrap 算法，用于生成间隔估计的日前电力需求。一个机器学习算法用于获取电力需求的点估计和相应的偏差在训练集上。获取的偏差被存储在内存中，并将内存进一步分区。根据类似的需求模式，天数被分组到 clusters 中使用不监督学习算法。测试日点估计用于找到最接近的 cluster，并从选择的 cluster 中 bootstrapping 偏差。这种算法在 EULR（End Use Load Research）实际电力需求数据上进行了评估，并与其他各种各样的启动方法进行了比较。

Can I Trust Your Answer? Visually Grounded Video Question Answering

paper_url: http://arxiv.org/abs/2309.01327
repo_url: https://github.com/doc-doc/next-gqa
paper_authors: Junbin Xiao, Angela Yao, Yicong Li, Tat Seng Chua
for: 这个论文旨在探讨利用预处理技术来提高视频语言理解的趋势，具体来说是考虑视频语言模型（VLMs）能够回答问题并同时提供视觉证据，以确定这些技术的预测是否真正受到视频内容的支持，而不是语言或视觉上的偶合关系。
methods: 作者提出了NExT-GQA数据集，用于检验当今最佳的VLMs。通过后期注意力分析，发现这些模型尚未能够坚持回答的根据，这表明这些模型的预测不可靠。为此，作者提出了一种视频定位机制，包括 Gaussian mask 优化和跨模态学习。
results: 作者的实验表明，这种定位机制可以提高视频定位和回答。不同的后端模型的实验结果也表明，这种定位机制可以提高视频定位和回答的可靠性。

Abstract
We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding. Specifically, by forcing vision-language models (VLMs) to answer questions and simultaneously provide visual evidence, we seek to ascertain the extent to which the predictions of such techniques are genuinely anchored in relevant video content, versus spurious correlations from language or irrelevant visual context. Towards this, we construct NExT-GQA -- an extension of NExT-QA with 10.5$K$ temporal grounding (or location) labels tied to the original QA pairs. With NExT-GQA, we scrutinize a variety of state-of-the-art VLMs. Through post-hoc attention analysis, we find that these models are weak in substantiating the answers despite their strong QA performance. This exposes a severe limitation of these models in making reliable predictions. As a remedy, we further explore and suggest a video grounding mechanism via Gaussian mask optimization and cross-modal learning. Experiments with different backbones demonstrate that this grounding mechanism improves both video grounding and QA. Our dataset and code are released. With these efforts, we aim to push towards the reliability of deploying VLMs in VQA systems.

摘要
我们研究基于视觉的视频问答系统，响应现代技术的趋势，使用预训练技术来理解视频语言。特别是，我们要证明视频语言模型（VLM）的预测是否围绕视频内容进行 anchored，而不是语言或无关的视觉上下文的偶合。为此，我们构建了NExT-GQA数据集，包含10500个时间（或位置）标签，与原始问答对相关。通过对多种现状顶尖VLM进行探究，我们发现这些模型具有强大的问答能力，但却弱于证明答案的能力。这表明这些模型在作出可靠预测时存在严重的限制。为此，我们进一步探讨视频基准机制，包括 Gaussian 掩码优化和跨模态学习。实验表明，这种基准机制可以提高视频基准和问答能力。我们发布了数据集和代码，以便推动VLM在VQA系统中的可靠部署。

Learning a Patent-Informed Biomedical Knowledge Graph Reveals Technological Potential of Drug Repositioning Candidates

paper_url: http://arxiv.org/abs/2309.03227
repo_url: https://github.com/ysjegal/ysjegal-drug-repositioning
paper_authors: Yongseung Jegal, Jaewoong Choi, Jiho Lee, Ki-Su Park, Seyoung Lee, Janghyeok Yoon
for:This paper aims to present a novel protocol for identifying drug repositioning candidates with both technological potential and scientific evidence.methods:The protocol involves constructing a scientific biomedical knowledge graph (s-BKG) and a patent-informed biomedical knowledge graph (p-BKG), and using a graph embedding protocol to evaluate the relevance scores of potential drug candidates.results:The case study on Alzheimer’s disease demonstrates the efficacy and feasibility of the proposed method, and the quantitative outcomes and systematic methods are expected to bridge the gap between computational discoveries and successful market applications in drug repositioning research.

Abstract
Drug repositioning-a promising strategy for discovering new therapeutic uses for existing drugs-has been increasingly explored in the computational science literature using biomedical databases. However, the technological potential of drug repositioning candidates has often been overlooked. This study presents a novel protocol to comprehensively analyse various sources such as pharmaceutical patents and biomedical databases, and identify drug repositioning candidates with both technological potential and scientific evidence. To this end, first, we constructed a scientific biomedical knowledge graph (s-BKG) comprising relationships between drugs, diseases, and genes derived from biomedical databases. Our protocol involves identifying drugs that exhibit limited association with the target disease but are closely located in the s-BKG, as potential drug candidates. We constructed a patent-informed biomedical knowledge graph (p-BKG) by adding pharmaceutical patent information. Finally, we developed a graph embedding protocol to ascertain the structure of the p-BKG, thereby calculating the relevance scores of those candidates with target disease-related patents to evaluate their technological potential. Our case study on Alzheimer's disease demonstrates its efficacy and feasibility, while the quantitative outcomes and systematic methods are expected to bridge the gap between computational discoveries and successful market applications in drug repositioning research.

摘要
药物重新定位策略，即在现有药物上发现新的治疗用途，在计算科学文献中得到了越来越多的探索。然而，药物重新定位候选者的技术潜力经常被忽视。本研究提出了一种新的协议，用于全面分析各种来源，包括药品专利和生物医学数据库，并从科学角度评估药物重新定位候选者。为此，我们首先构建了一个生物医学知识图（s-BKG），其中包括药物、疾病和基因之间的科学关系，来自生物医学数据库。我们的协议是通过识别具有较少与目标疾病相关的药物，但与s-BKG相互关联的药物作为重新定位候选者。然后，我们将药品专利信息添加到了生物医学知识图（p-BKG）中，并开发了一个图像嵌入协议，以确定p-BKG的结构，从而计算重新定位候选者与疾病相关专利的相互关系。我们的实验案例涉及阿兹海默病，并证明了其可行性和实用性，而量化结果和系统方法即将bridge计算发现和成功应用在药物重新定位研究中的差距。

Code Representation Pre-training with Complements from Program Executions

paper_url: http://arxiv.org/abs/2309.09980
repo_url: None
paper_authors: Jiabo Huang, Jianyu Zhao, Yuyang Rong, Yiwen Guo, Yifeng He, Hao Chen
for: 提高代码智能的研究，使用大型自然语言处理模型（LLM）。
methods: 使用自定义随机测试工具生成测试用例，并将其用于预训练代码表示。
results: 与其他预训练方法相比，FuzzPretrain在代码搜索中提高了6%以上/9%以上的MAP值。

Abstract
Large language models (LLMs) for natural language processing have been grafted onto programming language modeling for advancing code intelligence. Although it can be represented in the text format, code is syntactically more rigorous in order to be properly compiled or interpreted to perform a desired set of behaviors given any inputs. In this case, existing works benefit from syntactic representations to learn from code less ambiguously in the forms of abstract syntax tree, control-flow graph, etc. However, programs with the same purpose can be implemented in various ways showing different syntactic representations while the ones with similar implementations can have distinct behaviors. Though trivially demonstrated during executions, such semantics about functionality are challenging to be learned directly from code, especially in an unsupervised manner. Hence, in this paper, we propose FuzzPretrain to explore the dynamic information of programs revealed by their test cases and embed it into the feature representations of code as complements. The test cases are obtained with the assistance of a customized fuzzer and are only required during pre-training. FuzzPretrain yielded more than 6%/9% mAP improvements on code search over its counterparts trained with only source code or AST, respectively. Our extensive experimental results show the benefits of learning discriminative code representations with program executions.

摘要
大型自然语言处理（LLM）模型已经被应用到程序语言模型中，以提高代码智能。虽然代码可以被表示为文本格式，但它的语法更加严格，以便在输入任意时执行所需的行为。在这种情况下，现有的工作受益于语法表示，以更好地从代码中学习不同的行为。然而，实现相同的目的可以通过不同的方式实现，导致代码的语法表示异常。尽管在执行时可以轻松地证明这些semantics，但在无监督情况下学习代码的Semantics是困难的。因此，在这篇论文中，我们提出了FuzzPretrain，它可以通过测试用例来探索代码中的动态信息，并将其 embedding到代码的特征表示中作为补充。这些测试用例通过自定义的随机测试工具生成，只需在预训练时使用。FuzzPretrain在代码搜索中实现了6%/9%的mAP提升，与只使用源代码或AST训练的对手相比。我们的广泛的实验结果表明了从代码执行中学习特征代码表示的利antages。

ExMobileViT: Lightweight Classifier Extension for Mobile Vision Transformer

paper_url: http://arxiv.org/abs/2309.01310
repo_url: None
paper_authors: Gyeongdong Yang, Yungwook Kwon, Hyunjin Kim
for: 提高手机友好的视transformer性能，降低计算负担
methods: 使用均值池化结果来扩展通道数，再利用早期注意力阶段的信息
results: 相比原MobileViT，提高精度，仅增加5%参数量

Abstract
The paper proposes an efficient structure for enhancing the performance of mobile-friendly vision transformer with small computational overhead. The vision transformer (ViT) is very attractive in that it reaches outperforming results in image classification, compared to conventional convolutional neural networks (CNNs). Due to its need of high computational resources, MobileNet-based ViT models such as MobileViT-S have been developed. However, their performance cannot reach the original ViT model. The proposed structure relieves the above weakness by storing the information from early attention stages and reusing it in the final classifier. This paper is motivated by the idea that the data itself from early attention stages can have important meaning for the final classification. In order to reuse the early information in attention stages, the average pooling results of various scaled features from early attention stages are used to expand channels in the fully-connected layer of the final classifier. It is expected that the inductive bias introduced by the averaged features can enhance the final performance. Because the proposed structure only needs the average pooling of features from the attention stages and channel expansions in the final classifier, its computational and storage overheads are very small, keeping the benefits of low-cost MobileNet-based ViT (MobileViT). Compared with the original MobileViTs on the ImageNet dataset, the proposed ExMobileViT has noticeable accuracy enhancements, having only about 5% additional parameters.

摘要
文章提出了一种高效的结构，以提高移动设备友好的视Transformer（ViT）性能，而不需要大量计算资源。由于ViT模型的需求高于常见的卷积神经网络（CNN），因此基于MobileNet的ViT模型如MobileViT-S已经开发。然而，其性能不能达到原始ViT模型的水平。提出的结构解决了上述弱点，通过将早期注意力阶段中的信息存储并重复使用在最终分类器中。这篇论文受到了数据本身在早期注意力阶段的重要性的想法所 inspirited。为了重复使用早期注意力阶段的信息，使用了不同缩放因子的特征的平均池化结果来扩展最终分类器的 Fully Connected（FC）层的通道数。预计通过引入缩放因子引入的预测偏好，可以提高最终性能。由于提出的结构只需要平均池化早期注意力阶段的特征，以及在最终分类器的FC层中进行通道扩展，因此计算和存储开销非常小，保留了低成本的MobileNet基于ViT（MobileViT）的好处。与原始MobileViT在ImageNet dataset上的性能相比，提出的ExMobileViT具有显著的准确性提升，只有约5%的额外参数。

Partial Proof of a Conjecture with Implications for Spectral Majorization

paper_url: http://arxiv.org/abs/2309.01302
repo_url: None
paper_authors: Jeffrey Uhlmann
for: 这项研究探讨了一个关于 $n\times n$ ($n\leq 6$) 正定矩阵的性质的悬念。
methods: 这项研究使用了计算机辅助的和平方方法（SoS）来证明多项式非负性。
results: 研究发现了一个新的矩阵家族，其特点是对角线majorize其 спектrum。此外，这个家族可以通过克로内克组合扩展到 $n>6$ ，保持特殊的majorization property。

Abstract
In this paper we report on new results relating to a conjecture regarding properties of $n\times n$, $n\leq 6$, positive definite matrices. The conjecture has been proven for $n\leq 4$ using computer-assisted sum of squares (SoS) methods for proving polynomial nonnegativity. Based on these proven cases, we report on the recent identification of a new family of matrices with the property that their diagonals majorize their spectrum. We then present new results showing that this family can extended via Kronecker composition to $n>6$ while retaining the special majorization property. We conclude with general considerations on the future of computer-assisted and AI-based proofs.

摘要
在这篇论文中，我们报告了新的结果，与positive definite矩阵($n\times n$, $n\leq 6$)的性质有关。我们已经使用计算机支持的sum of squares（SoS）方法证明了$n\leq 4$的情况。基于已经证明的 случа例，我们报告了一个新的矩阵家族，其特点是主对角线majorize其 спектrum。然后，我们发现了一种可以通过克ро内克组合延伸到$n>6$的方法，保持特殊的majorization性。我们结束于计算机支持和人工智能基于证明的未来考虑。Here's the translation in Traditional Chinese:在这篇论文中，我们报告了新的结果，与positive definite矩阵（$n\times n$, $n\leq 6）的性质有关。我们已经使用计算机支持的sum of squares（SoS）方法证明了$n\leq 4$的情况。基于已经证明的个案例，我们报告了一个新的矩阵家族，其特点是主对角线majorize其 спектrum。然后，我们发现了一种可以通过克ро内克组合延伸到$n>6$的方法，保持特殊的majorization性。我们结束于计算机支持和人工智能基于证明的未来考虑。

AlphaZero Gomoku

paper_url: http://arxiv.org/abs/2309.01294
repo_url: https://github.com/suragnair/alpha-zero-general
paper_authors: Wen Liang, Chao Yu, Brian Whiteaker, Inyoung Huh, Hua Shao, Youzhi Liang
for: 这个论文的目的是探索AlphaZero算法在gomoku棋盘游戏中的表现。
methods: 这个论文使用了AlphaZero算法，具体来说是将深度学习与Monte Carlo搜索结合在一起，以便在gomoku棋盘游戏中实现人工智能的优势。
results: 测试结果表明，AlphaZero算法在gomoku棋盘游戏中表现出了优势，并且能够在不同的游戏环境下保持稳定的高水平。

Abstract
In the past few years, AlphaZero's exceptional capability in mastering intricate board games has garnered considerable interest. Initially designed for the game of Go, this revolutionary algorithm merges deep learning techniques with the Monte Carlo tree search (MCTS) to surpass earlier top-tier methods. In our study, we broaden the use of AlphaZero to Gomoku, an age-old tactical board game also referred to as "Five in a Row." Intriguingly, Gomoku has innate challenges due to a bias towards the initial player, who has a theoretical advantage. To add value, we strive for a balanced game-play. Our tests demonstrate AlphaZero's versatility in adapting to games other than Go. MCTS has become a predominant algorithm for decision processes in intricate scenarios, especially board games. MCTS creates a search tree by examining potential future actions and uses random sampling to predict possible results. By leveraging the best of both worlds, the AlphaZero technique fuses deep learning from Reinforcement Learning with the balancing act of MCTS, establishing a fresh standard in game-playing AI. Its triumph is notably evident in board games such as Go, chess, and shogi.

摘要
Recently, AlphaZero的出色的能力在复杂游戏中精通得到了广泛关注。 alphaZero最初是设计用于围棋游戏，这种革命性的算法将深度学习技术与Monte Carlo Tree Search（MCTS）相结合，超越了之前的顶尖方法。在我们的研究中，我们扩展了AlphaZero的使用范围到了古 mobil游戏（Gomoku），这是一款具有偏好初始玩家的战略游戏。很有趣的是，Gomoku拥有内生的挑战，因为初始玩家有理论上的优势。为了增加价值，我们努力寻求平衡的游戏环境。我们的测试表明AlphaZero在不同于Go的游戏中也有很好的适应能力。MCTS已成为复杂enario中决策过程中的主流算法，特别是板球游戏。MCTS通过评估未来动作的可能性来构建搜索树，并使用随机抽样来预测可能的结果。AlphaZero技术将深度学习从强化学习与MCTS的平衡过程相结合，建立了新的游戏AI标准。其胜利在板球游戏、国际象棋和将棋等游戏中得到了广泛证明。

2023-09-04

cs.CL

cs.CL - 2023-09-04

paper_url: http://arxiv.org/abs/2309.01860
repo_url: None
paper_authors: Zaber Ibn Abdul Hakim, Rasman Mubtasim Swargo, Muhammad Abdullah Adnan
for: 这个研究的目的是扩展现有的连续手语识别和翻译管道，以包含多modal信息。
methods: 这个研究使用了一种跨modal编码器，将Optical flow信息与RGB图像集成到特征集中，以增强手语识别和翻译的精度。
results: 研究表明，通过包含多modal信息，可以提高手语识别和翻译的结果。在RWTH-PHOENIX-2014数据集上进行手语识别任务，我们的方法可以降低WER值0.9。在RWTH-PHOENIX-2014T数据集上进行翻译任务，我们的方法可以提高大多数BLEU分数值0.6。

Abstract
In this paper, we devise a mechanism for the addition of multi-modal information with an existing pipeline for continuous sign language recognition and translation. In our procedure, we have incorporated optical flow information with RGB images to enrich the features with movement-related information. This work studies the feasibility of such modality inclusion using a cross-modal encoder. The plugin we have used is very lightweight and doesn't need to include a separate feature extractor for the new modality in an end-to-end manner. We have applied the changes in both sign language recognition and translation, improving the result in each case. We have evaluated the performance on the RWTH-PHOENIX-2014 dataset for sign language recognition and the RWTH-PHOENIX-2014T dataset for translation. On the recognition task, our approach reduced the WER by 0.9, and on the translation task, our approach increased most of the BLEU scores by ~0.6 on the test set.

摘要
在这篇论文中，我们提出了一种方法，用于将多modal信息添加到现有的连续手语识别和翻译管道中。在我们的过程中，我们将光流信息与RGB图像结合以增强特征中的运动相关信息。本研究证明了这种多模态包含的可行性，并使用了一个轻量级的插件，不需要在端到端方式中额外添加新模态的特征提取器。我们对手语识别和翻译进行了应用，并在每个情况下提高了结果。我们在RWTH-PHOENIX-2014 dataset上进行了手语识别任务的评估，并在RWTH-PHOENIX-2014T dataset上进行了翻译任务的评估。在识别任务中，我们的方法降低了WER值0.9，在翻译任务中，我们的方法在测试集上提高了大多数BLEU分数的0.6。

Minimal Effective Theory for Phonotactic Memory: Capturing Local Correlations due to Errors in Speech

paper_url: http://arxiv.org/abs/2309.02466
repo_url: None
paper_authors: Paul Myles Eugenio
for: 这个论文是为了探讨语音演化的经济性的影响，以及这种影响如何使得人们更容易学习语音。
methods: 这篇论文使用了一种基于tensor网络的本地相关模型，这种模型利用了语音中的本地phonetic correlations来促进语音学习。
results: 研究发现，这种模型可以帮助人们更容易学习语音，并且可以生成新的语音单词，这些单词符合目标语言的phonetic规则。此外，模型还可以提供一个 hierarchical 的最likely errors 列表，用于描述在语音行为中可能出现的错误。

Abstract
Spoken language evolves constrained by the economy of speech, which depends on factors such as the structure of the human mouth. This gives rise to local phonetic correlations in spoken words. Here we demonstrate that these local correlations facilitate the learning of spoken words by reducing their information content. We do this by constructing a locally-connected tensor-network model, inspired by similar variational models used for many-body physics, which exploits these local phonetic correlations to facilitate the learning of spoken words. The model is therefore a minimal model of phonetic memory, where "learning to pronounce" and "learning a word" are one and the same. A consequence of which is the learned ability to produce new words which are phonetically reasonable for the target language; as well as providing a hierarchy of the most likely errors that could be produced during the action of speech. We test our model against Latin and Turkish words. (The code is available on GitHub.)

摘要
spoken language evolves constrained by the economy of speech, which depends on factors such as the structure of the human mouth. This gives rise to local phonetic correlations in spoken words. Here we demonstrate that these local correlations facilitate the learning of spoken words by reducing their information content. We do this by constructing a locally-connected tensor-network model, inspired by similar variational models used for many-body physics, which exploits these local phonetic correlations to facilitate the learning of spoken words. The model is therefore a minimal model of phonetic memory, where "learning to pronounce" and "learning a word" are one and the same. A consequence of which is the learned ability to produce new words which are phonetically reasonable for the target language; as well as providing a hierarchy of the most likely errors that could be produced during the action of speech. We test our model against Latin and Turkish words. (The code is available on GitHub.)Here's the translation breakdown:* "spoken language" 口语 (kǒu yǔ)* "evolves" 演化 (biǎn huà)* "constrained" 受限 (shòu jiàn)* "by the economy of speech" 由语言经济 (yǐ yǔ yán jīng jì)* "which depends on factors such as the structure of the human mouth" 即人口结构等因素 (jī rén kǒu jiégòu déng yǐng fāng)* "This gives rise to local phonetic correlations in spoken words" 这会导致 spoken words 中的地方声学相关性 (zhè huì dào cái spoken words zhōng de dì fāng shēng xué xiāng yì)* "Here we demonstrate that these local correlations facilitate the learning of spoken words" 我们在这里示出这些地方声学相关性可以促进 spoken words 的学习 (wǒ men zài zhè lǐ shì chū shēng zhī yì xiǎng xué xí)* "by reducing their information content" 通过减少信息内容 (tōng guò jiǎn shòu xìn xīn nèi zhòng)* "We do this by constructing a locally-connected tensor-network model" 我们使用一种地方连接的tensor网络模型 (wǒ men shǐ yòng yī zhōng dì fāng lián xiǎng de tensor wǎng wǎn mó del)* "inspired by similar variational models used for many-body physics" Drawing on similar variational models used in many-body physics (fāng yǐn yī xiàng zhī yì zhī shì)* "which exploits these local phonetic correlations to facilitate the learning of spoken words" 这种模型可以利用这些地方声学相关性来促进 spoken words 的学习 (zhè zhōng mó del cóu yì liǎng yòu zhī yì xiǎng yì shì)* "The model is therefore a minimal model of phonetic memory" 这种模型因此成为一种最小的声学记忆模型 (zhè zhōng mó del yìn xiàng zhī yì xiǎng yì)* "where 'learning to pronounce' and 'learning a word' are one and the same" 这种模型中，"学习发音"和"学习一个词"是一样的 (zhè zhōng mó del zhì yì, "xué xí fā yīn" yǔ "xué xí yī gè ci" shì yī yàng de)* "A consequence of which is the learned ability to produce new words which are phonetically reasonable for the target language" 这种模型的一个结果是学习出新词，这些词汇在目标语言中是声学合理的 (zhè zhōng mó del yī yì shì, zhè xiē yì shì zhī yì yì bù)* "as well as providing a hierarchy of the most likely errors that could be produced during the action of speech" 这种模型还可以提供一个词汇错误的层次结构 (zhè zhōng mó del yǐn yì shì zhī yì yì bù)* "We test our model against Latin and Turkish words" 我们对 Latin 和 Turkish 词汇进行测试 (wǒ men duì Lati n yǔ Tūrkish yì shì zhì)* "The code is available on GitHub" 代码可以在 GitHub 上找到 (fǎn yì zhī yì zhī shì)

Into the Single Cell Multiverse: an End-to-End Dataset for Procedural Knowledge Extraction in Biomedical Texts

paper_url: http://arxiv.org/abs/2309.01812
repo_url: https://github.com/ylaboratory/flambe
paper_authors: Ruth Dannenfelser, Jeffrey Zhong, Ran Zhang, Vicky Yao
for: 这篇论文的主要目的是提供一个用于生物医学领域进行过程知识抽取的数据集，以便进一步开发自然语言处理（NLP）模型。
methods: 该数据集是通过专家 manually 精心纠正的方式从生物医学领域的学术论文中提取出的过程知识，包括名实 recognize 和ambiguation 等任务。
results: 该数据集提供了一个大量的 manually 精心纠正的名实识别和ambiguation 数据集，以便进一步开发NLP模型，同时也有助于提高生物医学研究领域的重复性。

Abstract
Many of the most commonly explored natural language processing (NLP) information extraction tasks can be thought of as evaluations of declarative knowledge, or fact-based information extraction. Procedural knowledge extraction, i.e., breaking down a described process into a series of steps, has received much less attention, perhaps in part due to the lack of structured datasets that capture the knowledge extraction process from end-to-end. To address this unmet need, we present FlaMB\'e (Flow annotations for Multiverse Biological entities), a collection of expert-curated datasets across a series of complementary tasks that capture procedural knowledge in biomedical texts. This dataset is inspired by the observation that one ubiquitous source of procedural knowledge that is described as unstructured text is within academic papers describing their methodology. The workflows annotated in FlaMB\'e are from texts in the burgeoning field of single cell research, a research area that has become notorious for the number of software tools and complexity of workflows used. Additionally, FlaMB\'e provides, to our knowledge, the largest manually curated named entity recognition (NER) and disambiguation (NED) datasets for tissue/cell type, a fundamental biological entity that is critical for knowledge extraction in the biomedical research domain. Beyond providing a valuable dataset to enable further development of NLP models for procedural knowledge extraction, automating the process of workflow mining also has important implications for advancing reproducibility in biomedical research.

摘要
多种常见的自然语言处理（NLP）信息EXTRACTION任务可以被视为评估声明知识或基于事实的信息EXTRACTION。而过程知识EXTRACTION，即从描述过程中提取步骤，则 receiving much less attention，可能是由于缺乏结构化数据集的不足。为解决这一需求，我们提出FlaMB\'e（流动注释 для多元生物实体），这是一系列 complementary tasks 的专家修订 datasets，用于捕捉生物文献中的过程知识。这个数据集得到了学术论文中描述的过程的注释，这些过程来自生物学研究领域的快速发展领域，特别是单细胞研究。此外，FlaMB\'e 还提供了，到我们所知，生物医学研究领域中最大的手动修订名实体识别（NER）和名实体识别（NED）数据集，用于识别基本生物实体，这种实体是生物医学研究领域中知识EXTRACTION的重要基础。除了提供可用于进一步发展NLP模型的价值数据集外，自动化工作流挖掘也有重要的 reproduceability 推动生物医学研究的意义。

Are Emergent Abilities in Large Language Models just In-Context Learning?

paper_url: http://arxiv.org/abs/2309.01809
repo_url: https://github.com/ukplab/on-emergence
paper_authors: Sheng Lu, Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, Iryna Gurevych
for: This paper aims to provide a comprehensive examination of the emergent abilities of large language models, specifically looking at the role of in-context learning in their performance.
methods: The authors use a set of 18 models with varying parameters (60 million to 175 billion) and test them on a set of 22 tasks. They conduct over 1,000 experiments to evaluate the models’ performance and determine the underlying mechanisms driving their emergent abilities.
results: The authors find that the emergent abilities of large language models can primarily be attributed to in-context learning, and there is no evidence for the emergence of reasoning abilities. This provides valuable insights into the use of these models and alleviates safety concerns regarding their performance.Here’s the same information in Simplified Chinese text:
for: 这篇论文目的是为了对大语言模型的突出能力进行全面的检查，特别是它们在受到不同提示下的性能如何。
methods: 作者使用了18个模型（参数量从60万到175亿），对其进行了22项任务的测试，并进行了超过1000个实验来评估模型的性能。
results: 作者发现，大语言模型的突出能力主要归因于受到提示下的学习，并没有证据表明它们具有推理能力的emergence。这些结论有助于我们更好地理解这些模型的使用，并缓解对其性能的安全性问题。

Abstract
Large language models have exhibited emergent abilities, demonstrating exceptional performance across diverse tasks for which they were not explicitly trained, including those that require complex reasoning abilities. The emergence of such abilities carries profound implications for the future direction of research in NLP, especially as the deployment of such models becomes more prevalent. However, one key challenge is that the evaluation of these abilities is often confounded by competencies that arise in models through alternative prompting techniques, such as in-context learning and instruction following, which also emerge as the models are scaled up. In this study, we provide the first comprehensive examination of these emergent abilities while accounting for various potentially biasing factors that can influence the evaluation of models. We conduct rigorous tests on a set of 18 models, encompassing a parameter range from 60 million to 175 billion parameters, across a comprehensive set of 22 tasks. Through an extensive series of over 1,000 experiments, we provide compelling evidence that emergent abilities can primarily be ascribed to in-context learning. We find no evidence for the emergence of reasoning abilities, thus providing valuable insights into the underlying mechanisms driving the observed abilities and thus alleviating safety concerns regarding their use.

摘要

An Empirical Analysis for Zero-Shot Multi-Label Classification on COVID-19 CT Scans and Uncurated Reports

paper_url: http://arxiv.org/abs/2309.01740
repo_url: None
paper_authors: Ethan Dack, Lorenzo Brigato, Matthew McMurray, Matthias Fontanellaz, Thomas Frauenfelder, Hanno Hoppe, Aristomenis Exadaktylos, Thomas Geiser, Manuela Funke-Chambour, Andreas Christe, Lukas Ebner, Stavroula Mougiakakou
for: 这 paper 是为了研究基于 contrastive visual language learning 的零shot 多标签分类方法，以帮助 radiologist 诊断 COVID-19 和识别详细的 lung 病变。
methods: 这 paper 使用了 unstructured data 和 CT 成像来进行零shot 多标签分类，并与 human expert 合作调节模型。
results: 这 paper 的 empirical analysis 表明，零shot 多标签分类方法可以帮助 radiologist 更好地诊断 COVID-19 和识别详细的 lung 病变。

Abstract
The pandemic resulted in vast repositories of unstructured data, including radiology reports, due to increased medical examinations. Previous research on automated diagnosis of COVID-19 primarily focuses on X-ray images, despite their lower precision compared to computed tomography (CT) scans. In this work, we leverage unstructured data from a hospital and harness the fine-grained details offered by CT scans to perform zero-shot multi-label classification based on contrastive visual language learning. In collaboration with human experts, we investigate the effectiveness of multiple zero-shot models that aid radiologists in detecting pulmonary embolisms and identifying intricate lung details like ground glass opacities and consolidations. Our empirical analysis provides an overview of the possible solutions to target such fine-grained tasks, so far overlooked in the medical multimodal pretraining literature. Our investigation promises future advancements in the medical image analysis community by addressing some challenges associated with unstructured data and fine-grained multi-label classification.

摘要
pandemic 导致了庞大的不结构数据存储，包括 radiology 报告，由于增加的医疗检查。先前的自动诊断 COVID-19 研究主要集中在 X-ray 图像上，尽管它们的精度比 computed tomography (CT) 扫描低。在这项工作中，我们利用医院的不结构数据和 CT 扫描的细腻特征，实现零shot 多标签分类 based on 对比视觉语言学习。与人类专家合作，我们调查了多种零shot 模型的效果，帮助 radiologist 检测肺动脉塞和识别复杂的肺脉塞，如云彩杂色和聚集。我们的实验分析提供了targeting such fine-grained tasks 的可能解决方案的概述，在医学多Modal 预训练 литературе中被过looked。我们的调查承诺未来医学图像分析社区的发展，解决一些不结构数据和细腻多标签分类的挑战。

Prompting or Fine-tuning? A Comparative Study of Large Language Models for Taxonomy Construction

paper_url: http://arxiv.org/abs/2309.01715
repo_url: https://github.com/20001LastOrder/Taxonomy-GPT
paper_authors: Boqi Chen, Fandi Yi, Dániel Varró
for: 本研究旨在提出一种满足结构约束的自动taxonomy构建框架，以便在不同的软件模型和自然语言处理（NLP）活动中提高taxonomy的效果。
methods: 本研究使用了适当的用户输入（称为提示），将GPT-3等大语言模型（LLMs）引导到多种NLP任务中，而不需要显式（重）训练。
results: 研究发现，无需显式训练，提示方法可以在hypernym taxonomy和计算机科学 taxonomy dataset中对taxonomy进行自动构建，并且在训练集小时，提示方法的性能比 fine-tuning 方法更高。但是， fine-tuning 方法可以轻松地对生成的taxonomy进行后处理，以满足所有约束。

Abstract
Taxonomies represent hierarchical relations between entities, frequently applied in various software modeling and natural language processing (NLP) activities. They are typically subject to a set of structural constraints restricting their content. However, manual taxonomy construction can be time-consuming, incomplete, and costly to maintain. Recent studies of large language models (LLMs) have demonstrated that appropriate user inputs (called prompting) can effectively guide LLMs, such as GPT-3, in diverse NLP tasks without explicit (re-)training. However, existing approaches for automated taxonomy construction typically involve fine-tuning a language model by adjusting model parameters. In this paper, we present a general framework for taxonomy construction that takes into account structural constraints. We subsequently conduct a systematic comparison between the prompting and fine-tuning approaches performed on a hypernym taxonomy and a novel computer science taxonomy dataset. Our result reveals the following: (1) Even without explicit training on the dataset, the prompting approach outperforms fine-tuning-based approaches. Moreover, the performance gap between prompting and fine-tuning widens when the training dataset is small. However, (2) taxonomies generated by the fine-tuning approach can be easily post-processed to satisfy all the constraints, whereas handling violations of the taxonomies produced by the prompting approach can be challenging. These evaluation findings provide guidance on selecting the appropriate method for taxonomy construction and highlight potential enhancements for both approaches.

摘要
taxonomies 表示实体之间的层次关系，常用于软件建模和自然语言处理（NLP）活动中。它们通常受到一组结构约束，限制它们的内容。然而，手动构建税onomy可以是时间consuming、不完整和维护成本高。现在的研究表明，适当的用户输入（叫做提示）可以导引大语言模型（LLMs）在多种NLP任务中表现出优秀的性能，无需显式（再）训练。然而，现有的自动税onomy构建方法通常通过调整模型参数来进行细化。在这篇论文中，我们提出一种通用的税onomy构建框架，考虑结构约束。然后，我们进行了系统比较，通过对hypernym税onomy和一个新的计算机科学税onomy数据集进行提示和细化两种方法的性能。我们的结果显示以下：（1）无需显式训练数据集，提示方法比细化方法表现更好，而且当训练数据集较小时，性能差距更大。然而，（2）通过细化方法生成的税onomy可以轻松地进行后期处理，以满足所有约束，而提示方法生成的税onomy处理抵触的问题可以困难。这些评估结果为选择合适的方法提供指导，并高亮了两种方法的可能的改进。

MathAttack: Attacking Large Language Models Towards Math Solving Ability

paper_url: http://arxiv.org/abs/2309.01686
repo_url: None
paper_authors: Zihao Zhou, Qiufeng Wang, Mingyu Jin, Jie Yao, Jianan Ye, Wei Liu, Wei Wang, Xiaowei Huang, Kaizhu Huang
for: 本研究旨在检测大型自然语言模型（LLMs）在数学问题解决能力方面的安全性。
methods: 我们提出了一种名为MathAttack的模型，用于攻击数学问题样本，以保持原始数学问题的逻辑逻辑。我们首先使用逻辑存在检测来识别逻辑入口，然后使用word-level攻击者对剩下的文本进行攻击。
results: 我们的实验表明，MathAttack可以有效攻击LLMs的数学问题解决能力。我们发现：1）我们的敌意样本从高精度LLMs中生成的样本也能够攻击低精度LLMs（例如，从大到小模型或从多个步骤到零步骤提问中）；2）复杂的数学问题（例如，更多的解决步骤、更长的文本、更多的数字）更容易受到攻击；3）我们可以通过使用我们的反敌样本来提高LLMs的 robustness。

Abstract
With the boom of Large Language Models (LLMs), the research of solving Math Word Problem (MWP) has recently made great progress. However, there are few studies to examine the security of LLMs in math solving ability. Instead of attacking prompts in the use of LLMs, we propose a MathAttack model to attack MWP samples which are closer to the essence of security in solving math problems. Compared to traditional text adversarial attack, it is essential to preserve the mathematical logic of original MWPs during the attacking. To this end, we propose logical entity recognition to identify logical entries which are then frozen. Subsequently, the remaining text are attacked by adopting a word-level attacker. Furthermore, we propose a new dataset RobustMath to evaluate the robustness of LLMs in math solving ability. Extensive experiments on our RobustMath and two another math benchmark datasets GSM8K and MultiAirth show that MathAttack could effectively attack the math solving ability of LLMs. In the experiments, we observe that (1) Our adversarial samples from higher-accuracy LLMs are also effective for attacking LLMs with lower accuracy (e.g., transfer from larger to smaller-size LLMs, or from few-shot to zero-shot prompts); (2) Complex MWPs (such as more solving steps, longer text, more numbers) are more vulnerable to attack; (3) We can improve the robustness of LLMs by using our adversarial samples in few-shot prompts. Finally, we hope our practice and observation can serve as an important attempt towards enhancing the robustness of LLMs in math solving ability. We will release our code and dataset.

摘要
随着大型语言模型（LLMs）的爆发，解决数学问题（MWP）的研究已经做出了很大的进步。然而，有很少的研究检查LLMs在数学问题解决能力的安全性。而不是通过对提示的使用来进行攻击，我们提议一种名为MathAttack的模型来攻击MWP样本，这些样本更加接近安全性的解决数学问题的本质。相比传统的文本恶意攻击，在攻击MWP时更加重要是保持原始MWP的数学逻辑。为此，我们提出了逻辑实体识别，以冻结逻辑实体。然后，剩下的文本使用word-level攻击者进行攻击。此外，我们提出了一个名为RobustMath的新数据集，用于评估LLMs在数学问题解决能力的Robustness。我们在RobustMath和两个其他数学benchmark数据集GSM8K和MultiAirth上进行了广泛的实验，结果表明MathAttack可以有效攻击LLMs的数学问题解决能力。在实验中，我们注意到以下问题：1.我们从高精度LLMs中生成的恶意样本也可以有效地攻击低精度LLMs（例如，从大到小的LLMs或从多少个提示到零个提示）。2.复杂的MWP（如更多的解决步骤、更长的文本、更多的数字）更容易受到攻击。3.我们可以通过使用我们的恶意样本来提高LLMs的Robustness，特别是在几个提示中。最后，我们希望我们的实践和观察可以作为LLMs在数学问题解决能力的Robustness的重要尝试。我们将发布我们的代码和数据集。

CRUISE-Screening: Living Literature Reviews Toolbox

paper_url: http://arxiv.org/abs/2309.01684
repo_url: https://github.com/projectdossier/cruise-screening
paper_authors: Wojciech Kusa, Petr Knoth, Allan Hanbury
for: 帮助研究人员快速找到相关研究，提高生成Literature Review的效率和有效性。
methods: 使用文本分类和问答模型帮助屏选相关论文，并通过API与多个搜索引擎连接以更新搜索结果。
results: 开发了一款基于Web应用程序，可以实现活动Literature Review的屏选和搜索，可以帮助研究人员避免手动屏选和搜索，提高工作效率。

Abstract
Keeping up with research and finding related work is still a time-consuming task for academics. Researchers sift through thousands of studies to identify a few relevant ones. Automation techniques can help by increasing the efficiency and effectiveness of this task. To this end, we developed CRUISE-Screening, a web-based application for conducting living literature reviews - a type of literature review that is continuously updated to reflect the latest research in a particular field. CRUISE-Screening is connected to several search engines via an API, which allows for updating the search results periodically. Moreover, it can facilitate the process of screening for relevant publications by using text classification and question answering models. CRUISE-Screening can be used both by researchers conducting literature reviews and by those working on automating the citation screening process to validate their algorithms. The application is open-source: https://github.com/ProjectDoSSIER/cruise-screening, and a demo is available under this URL: https://citation-screening.ec.tuwien.ac.at. We discuss the limitations of our tool in Appendix A.

摘要
保持研究的最新信息和找到相关工作仍然是学术人员的时间消耗任务。研究人员需要从千余篇论文中筛选出一些相关的研究，以增加研究效率和效果。自动化技术可以帮助解决这个问题。为此，我们开发了CRUISE-Screening，一个基于网络的应用程序，用于进行生活文献评估 - 一种Periodically更新的文献评估方法，以反映最新的研究领域。CRUISE-Screening通过API与多个搜索引擎连接，以 periodic更新搜索结果。此外，它还可以通过文本分类和问答模型来帮助屏选相关的论文。CRUISE-Screening可以用于由研究人员进行文献评估，以及用于自动化引用屏选过程的验证。该应用程序是开源的，可以在 GitHub 上找到代码：https://github.com/ProjectDoSSIER/cruise-screening，并可以在以下 URL 上查看示例：https://citation-screening.ec.tuwien.ac.at。我们在 Appendix A 中讨论了我们的工具的限制。

Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?

paper_url: http://arxiv.org/abs/2309.01669
repo_url: None
paper_authors: Leon Weber-Genzel, Robert Litschko, Ekaterina Artemova, Barbara Plank
for:这个论文主要针对的是如何在生成 Setting 中应用 Annotation Error Detection (AED) 方法，以提高 Large Language Models (LLMs) 的训练。methods:这篇论文使用了三个 instruction-tuning 数据集，它们都是由专家和 semi-automatic 方法进行了注释。 authors 还提出了四种 AED 基线方法，并对其进行了全面的评估。results:这篇论文发现，选择正确的 AED 方法和模型大小是非常重要，这有助于提高 instruction-tuning 的性能。 authors 还提供了一个首次的案例研究，以了解 instruction-tuning 数据集的质量如何影响下游性能。

Abstract
Instruction-tuning has become an integral part of training pipelines for Large Language Models (LLMs) and has been shown to yield strong performance gains. In an orthogonal line of research, Annotation Error Detection (AED) has emerged as a tool for detecting quality issues of gold-standard labels. But so far, the application of AED methods is limited to discriminative settings. It is an open question how well AED methods generalize to generative settings which are becoming widespread via generative LLMs. In this work, we present a first and new benchmark for AED on instruction-tuning data: Donkii. It encompasses three instruction-tuning datasets enriched with annotations by experts and semi-automatic methods. We find that all three datasets contain clear-cut errors that sometimes directly propagate into instruction-tuned LLMs. We propose four AED baselines for the generative setting and evaluate them comprehensively on the newly introduced dataset. Our results demonstrate that choosing the right AED method and model size is indeed crucial, thereby deriving practical recommendations. To gain insights, we provide a first case-study to examine how the quality of the instruction-tuning datasets influences downstream performance.

摘要
<>转换文本为简化中文。<>现在，教程调整（Instruction-tuning）已成为大语言模型（LLMs）训练管道的一个重要组成部分，并且已经显示出强大的性能提升。而在平行的研究方向中，标注错误检测（AED）已经出现为检测黄金标准标签质量问题的工具。但目前，AED方法的应用仅限于推理性Setting中。因此，是一个打开的问题，AED方法在生成Setting中的普遍性如何。在这项工作中，我们提出了一个新的和第一个Benchmark дляAED在教程调整数据上：Donkii。它包括三个 instruciton-tuning 数据集，每个数据集都有专家和半自动方法对标注。我们发现所有三个数据集都含有明确的错误，这些错误直接卷入 instruciton-tuned LLMs。我们提出了四个AED基线 для生成Setting，并在新引入的数据集上进行了全面的评估。我们的结果表明，选择正确的AED方法和模型大小是非常重要，从而得到了实用的建议。为了获得更多的洞察，我们提供了一个首次的案例研究，探讨了下游性能如何受到教程调整数据的质量影响。

paper_url: http://arxiv.org/abs/2309.01659
repo_url: https://github.com/andreskarjus/evolving_divergence
paper_authors: Andres Karjus, Christine Cuskley
for: 本研究探讨了美国政治各派别之间语言分化的现象，特别是社交媒体平台上的语言使用差异。
methods: 该研究使用了社交媒体数据 mines, lexicostatistics, machine learning 和大语言模型，并采用了系统的人工注释方法，以描述和量化语言分化的现象。
results: 研究发现，美国政治各派别之间存在语言分化的现象，尤其是在话题和话语方面，与之前的研究一致。这些现象表明，美国英语在不断受到政治各派别的影响，可能会导致语言分化。

Abstract
Language change is influenced by many factors, but often starts from synchronic variation, where multiple linguistic patterns or forms coexist, or where different speech communities use language in increasingly different ways. Besides regional or economic reasons, communities may form and segregate based on political alignment. The latter, referred to as political polarization, is of growing societal concern across the world. Here we map and quantify linguistic divergence across the partisan left-right divide in the United States, using social media data. We develop a general methodology to delineate (social) media users by their political preference, based on which (potentially biased) news media accounts they do and do not follow on a given platform. Our data consists of 1.5M short posts by 10k users (about 20M words) from the social media platform Twitter (now "X"). Delineating this sample involved mining the platform for the lists of followers (n=422M) of 72 large news media accounts. We quantify divergence in topics of conversation and word frequencies, messaging sentiment, and lexical semantics of words and emoji. We find signs of linguistic divergence across all these aspects, especially in topics and themes of conversation, in line with previous research. While US American English remains largely intelligible within its large speech community, our findings point at areas where miscommunication may eventually arise given ongoing polarization and therefore potential linguistic divergence. Our methodology - combining data mining, lexicostatistics, machine learning, large language models and a systematic human annotation approach - is largely language and platform agnostic. In other words, while we focus here on US political divides and US English, the same approach is applicable to other countries, languages, and social media platforms.

摘要
language change 由多种因素 influencing，frequently 从同时变化开始，其中多种语言模式或形式同时存在，或者不同的语言社区使用语言在不同的方式。besides regional or economic reasons, communities may form and segregate based on political alignment. the latter, referred to as political polarization, is of growing societal concern across the world. here we map and quantify linguistic divergence across the partisan left-right divide in the united states, using social media data. we develop a general methodology to delineate (social) media users by their political preference, based on which (potentially biased) news media accounts they do and do not follow on a given platform. our data consists of 1.5 million short posts by 10,000 users (about 20 million words) from the social media platform Twitter (now "X"). delineating this sample involved mining the platform for the lists of followers (n=422 million) of 72 large news media accounts. we quantify divergence in topics of conversation and word frequencies, messaging sentiment, and lexical semantics of words and emoji. we find signs of linguistic divergence across all these aspects, especially in topics and themes of conversation, in line with previous research. while US American English remains largely intelligible within its large speech community, our findings point at areas where miscommunication may eventually arise given ongoing polarization and therefore potential linguistic divergence. our methodology - combining data mining, lexicostatistics, machine learning, large language models and a systematic human annotation approach - is largely language and platform agnostic. in other words, while we focus here on US political divides and US English, the same approach is applicable to other countries, languages, and social media platforms.

Exploring the effectiveness of ChatGPT-based feedback compared with teacher feedback and self-feedback: Evidence from Chinese to English translation

paper_url: http://arxiv.org/abs/2309.01645
repo_url: None
paper_authors: Siyi Cao, Linping Zhong
for: 本研究是为了比较中国硬件翻译硬件翻译学生在英语作为第二语言中使用ChatGPT反馈的效果，并与教师反馈（TF）和自我反馈（SF）进行比较。
methods: 本研究使用了BLEU分数来衡量翻译质量，以及Coh-Metrix来分析翻译文本中的语言特征。
results: 研究发现，TF和SF带来的翻译文本质量高于ChatGPT反馈，但ChatGPT反馈能够提高翻译文本中的词汇能力和参照相互关系。同时，TF和SF更能够提高翻译文本中的语法能力，特别是正确使用了过去分词。

Abstract
ChatGPT,a cutting-edge AI-powered Chatbot,can quickly generate responses on given commands. While it was reported that ChatGPT had the capacity to deliver useful feedback, it is still unclear about its effectiveness compared with conventional feedback approaches,such as teacher feedback (TF) and self-feedback (SF). To address this issue, this study compared the revised Chinese to English translation texts produced by Chinese Master of Translation and Interpretation (MTI) students,who learned English as a Second/Foreign Language (ESL/EFL), based on three feedback types (i.e., ChatGPT-based feedback, TF and SF). The data was analyzed using BLEU score to gauge the overall translation quality as well as Coh-Metrix to examine linguistic features across three dimensions: lexicon, syntax, and cohesion.The findings revealed that TF- and SF-guided translation texts surpassed those with ChatGPT-based feedback, as indicated by the BLEU score. In terms of linguistic features,ChatGPT-based feedback demonstrated superiority, particularly in enhancing lexical capability and referential cohesion in the translation texts. However, TF and SF proved more effective in developing syntax-related skills,as it addressed instances of incorrect usage of the passive voice. These diverse outcomes indicate ChatGPT's potential as a supplementary resource, complementing traditional teacher-led methods in translation practice.

摘要
chatGPT，一种前沿的人工智能 chatbot，可快速生成对给定命令的回应。尽管报道称 chatGPT 有可能提供有用的反馈，但是它的效果对于传统的反馈方法（如教师反馈和自我反馈）仍然不清楚。为了解决这个问题，本研究比较了由中文翻译和 intérprete 学生（学习英语为第二外语/第二外语）所制定的修改后的中英翻译文本，基于三种反馈类型（即 chatGPT 反馈、教师反馈和自我反馈）。数据分析使用 BLEU 分数来评估翻译质量的总体水平，以及 Coh-Metrix 来检查翻译文本中的三个维度：词汇、 sentence 和 cohesion。研究发现，TF 和 SF 引导的翻译文本在 BLEU 分数上胜过 chatGPT 反馈，而在语言特征方面，chatGPT 反馈表现出优势，特别是在提高翻译文本中的词汇能力和引用共识性。然而，TF 和 SF 更有效地发展了 sentence 结构相关的技能，它解决了 incorrect 使用过去分词的情况。这些多样的结果表明 chatGPT 可以作为辅助资源，与传统的教师带领方法相结合，在翻译实践中发挥作用。

Critical Behavioral Traits Foster Peer Engagement in Online Mental Health Communities

paper_url: http://arxiv.org/abs/2309.01618
repo_url: None
paper_authors: Aseem Srivastava, Tanya Gupta, Alison Cerezo, Sarah Peregrine, Lord, Md Shad Akhtar, Tanmoy Chakraborty
For: This paper aims to explore the factors that drive peer engagement within counseling threads on online mental health communities, such as Reddit, to enhance our understanding of this critical phenomenon.* Methods: The study uses a novel dataset called BeCOPE, which consists of over 10,000 posts and 58,000 comments from 21 mental health-specific subreddits, and is annotated with three major fine-grained behavior labels (intent, criticism, and readability) and emotion labels.* Results: The study finds that self-criticism is the most prevalent form of criticism expressed by help-seekers, and that individuals who explicitly express their need for help are more likely to receive assistance compared to those who present surveys or engage in rants. Additionally, the study highlights the importance of well-articulated problem descriptions in receiving support.

Abstract
Online Mental Health Communities (OMHCs), such as Reddit, have witnessed a surge in popularity as go-to platforms for seeking information and support in managing mental health needs. Platforms like Reddit offer immediate interactions with peers, granting users a vital space for seeking mental health assistance. However, the largely unregulated nature of these platforms introduces intricate challenges for both users and society at large. This study explores the factors that drive peer engagement within counseling threads, aiming to enhance our understanding of this critical phenomenon. We introduce BeCOPE, a novel behavior encoded Peer counseling dataset comprising over 10,118 posts and 58,279 comments sourced from 21 mental health-specific subreddits. The dataset is annotated using three major fine-grained behavior labels: (a) intent, (b) criticism, and (c) readability, along with the emotion labels. Our analysis indicates the prominence of ``self-criticism'' as the most prevalent form of criticism expressed by help-seekers, accounting for a significant 43% of interactions. Intriguingly, we observe that individuals who explicitly express their need for help are 18.01% more likely to receive assistance compared to those who present ``surveys'' or engage in ``rants.'' Furthermore, we highlight the pivotal role of well-articulated problem descriptions, showing that superior readability effectively doubles the likelihood of receiving the sought-after support. Our study emphasizes the essential role of OMHCs in offering personalized guidance and unveils behavior-driven engagement patterns.

摘要
We introduce BeCOPE, a novel dataset comprising over 10,118 posts and 58,279 comments sourced from 21 mental health-specific subreddits. The dataset is annotated with three major fine-grained behavior labels: (a) intent, (b) criticism, and (c) readability, as well as emotion labels. Our analysis reveals that "self-criticism" is the most prevalent form of criticism expressed by help-seekers, accounting for 43% of interactions. Interestingly, we find that individuals who explicitly express their need for help are 18.01% more likely to receive assistance compared to those who present "surveys" or engage in "rants." Furthermore, we highlight the importance of well-articulated problem descriptions, showing that superior readability effectively doubles the likelihood of receiving the sought-after support.Our study emphasizes the crucial role of OMHCs in offering personalized guidance and unveils behavior-driven engagement patterns. These findings have significant implications for the development of OMHCs and the provision of mental health support online. By understanding the factors that drive peer engagement, we can better tailor these platforms to meet the needs of users and improve the overall quality of mental health support.

Geo-Encoder: A Chunk-Argument Bi-Encoder Framework for Chinese Geographic Re-Ranking

paper_url: http://arxiv.org/abs/2309.01606
repo_url: None
paper_authors: Yong Cao, Ruixue Ding, Boli Chen, Xianzhi Li, Min Chen, Daniel Hershcovich, Pengjun Xie, Fei Huang
for: 该论文旨在提高中文地图搜索结果的准确率，以便为地图服务等地理相关应用提供更加有用的结果。
methods: 该论文提出了一种新的框架，即Geo-Encoder，以更好地将中文地理 semantics интегра到重新排序管道中。该方法首先使用可用的工具将文本与地理 span 相关联，然后提出了一种多任务学习模块，以同时获得一个有效的注意力矩阵，决定chunk的贡献。此外，该论文还提出了一种异步更新机制，以便指导模型更好地关注特定的chunk。
results: experiments 表明，Geo-Encoder 在两个中文地理搜索数据集上达到了显著提高，相比之前的基eline。特别是，它使得 MGEO-BERT 的 Hit@1 分数从 62.76 提高到 68.98， representing a 6.22% improvement.

Abstract
Chinese geographic re-ranking task aims to find the most relevant addresses among retrieved candidates, which is crucial for location-related services such as navigation maps. Unlike the general sentences, geographic contexts are closely intertwined with geographical concepts, from general spans (e.g., province) to specific spans (e.g., road). Given this feature, we propose an innovative framework, namely Geo-Encoder, to more effectively integrate Chinese geographical semantics into re-ranking pipelines. Our methodology begins by employing off-the-shelf tools to associate text with geographical spans, treating them as chunking units. Then, we present a multi-task learning module to simultaneously acquire an effective attention matrix that determines chunk contributions to extra semantic representations. Furthermore, we put forth an asynchronous update mechanism for the proposed addition task, aiming to guide the model capable of effectively focusing on specific chunks. Experiments on two distinct Chinese geographic re-ranking datasets, show that the Geo-Encoder achieves significant improvements when compared to state-of-the-art baselines. Notably, it leads to a substantial improvement in the Hit@1 score of MGEO-BERT, increasing it by 6.22% from 62.76 to 68.98 on the GeoTES dataset.

摘要

A Comparative Analysis of Pretrained Language Models for Text-to-Speech

paper_url: http://arxiv.org/abs/2309.01576
repo_url: None
paper_authors: Marcel Granero-Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell, Nicole Peinelt, Alexis Moinet, Thomas Drugman
for: 本研究旨在investigate the impact of different pre-trained language models (PLMs) on text-to-speech (TTS) tasks, specifically prosody prediction和pause prediction.
methods: 研究采用了15种不同的PLMs，并对其进行了训练和测试。
results: 发现PLMs的大小和质量之间存在对数关系，并且发现表达和中性表达之间存在显著的性能差异。此外，发现 pause prediction 任务对小型模型的敏感程度较低，并且发现这些语言模型的验证结果和我们的实验结果之间存在强相关性。

Abstract
State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech. However, while PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked. In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS tasks: prosody prediction and pause prediction. Firstly, we trained a prosody prediction model using 15 different PLMs. Our findings revealed a logarithmic relationship between model size and quality, as well as significant performance differences between neutral and expressive prosody. Secondly, we employed PLMs for pause prediction and found that the task was less sensitive to small models. We also identified a strong correlation between our empirical results and the GLUE scores obtained for these language models. To the best of our knowledge, this is the first study of its kind to investigate the impact of different PLMs on TTS.

摘要
现代文本读取系统（TTS）已经利用预训练语言模型（PLM）提高了语调和创造出更自然的语音。然而，虽然PLM在自然语言理解（NLU）方面得到了广泛的研究，但它们在TTS方面的影响却被忽略了。在这项研究中，我们想要填补这个差距，通过对不同PLM进行比较分析，以探讨它们在两个TTS任务中的表现。首先，我们使用15种不同的PLM进行语调预测模型的训练。我们的发现表明，模型的大小和质量之间存在对数的关系，同时，中性和表达性的语调之间也存在显著的性能差异。其次，我们使用PLM进行停顿预测，发现这个任务对小型模型是更敏感的。我们还发现了这些实验结果和GLUE分数中的语言模型得到的相关性强。根据我们所知，这是第一项研究对TTS中不同PLM的影响的研究。

What are Public Concerns about ChatGPT? A Novel Self-Supervised Neural Topic Model Tells You

paper_url: http://arxiv.org/abs/2309.01522
repo_url: None
paper_authors: Rui Wang, Xing Liu, Yanan Wang, Haiping Huang
for: 本研究的目的是挖掘对 chatGPT 的公共担忧。
methods: 本研究使用了一种名为 Self-Supervised neural Topic Model (SSTM)，它将话题化模型视为表示学习过程。
results: 实验结果显示，提posed方法可以提取高质量的公共担忧，并且具有更好的解释性和多样性，超过了现有的方法的性能。

Abstract
The recently released artificial intelligence conversational agent, ChatGPT, has gained significant attention in academia and real life. A multitude of early ChatGPT users eagerly explore its capabilities and share their opinions on it via social media. Both user queries and social media posts express public concerns regarding this advanced dialogue system. To mine public concerns about ChatGPT, a novel Self-Supervised neural Topic Model (SSTM), which formalizes topic modeling as a representation learning procedure, is proposed in this paper. Extensive experiments have been conducted on Twitter posts about ChatGPT and queries asked by ChatGPT users. And experimental results demonstrate that the proposed approach could extract higher quality public concerns with improved interpretability and diversity, surpassing the performance of state-of-the-art approaches.

摘要
Recently released artificial intelligence conversational agent ChatGPT 已经吸引了大量学术和实际应用的关注。许多早期 ChatGPT 用户积极探索其能力并分享他们对其的看法 via 社交媒体。用户提问和社交媒体文章表达了公众对 ChatGPT 的担忧。为了挖掘公众对 ChatGPT 的担忧，本文提出了一种新的 Self-Supervised neural Topic Model (SSTM)，它将话题模型化为表示学习过程的形式。对 Twitter 上关于 ChatGPT 的文章和用户提问进行了广泛的实验，并实验结果表明，提出的方法可以提取更高质量的公众担忧，并且可以提高解释性和多样性，超过了现有的方法的性能。

LLM and Infrastructure as a Code use case

paper_url: http://arxiv.org/abs/2309.01456
repo_url: None
paper_authors: Thibault Chanus, Michael Aubertin
for: This paper aims to explore the use of Generative LLMs (Language Models) to generate and manage Ansible YAML roles and playbooks, with a focus on identifying potential directions and industrial applications.
methods: The paper employs the use of Ansible and YAML, alongside Generative LLMs, to automate systems administration tasks and translate human descriptions into code.
results: The paper outlines the potential of using Generative LLMs in this context, with the potential for improved efficiency and accuracy in generating and managing Ansible YAML roles and playbooks.Here’s the Chinese translation of the three points:
for: 这篇论文旨在探讨使用生成式LLM（语言模型）来生成和管理 Ansible YAML 角色和执行脚本，注重发现可能的方向和工业应用。
methods: 论文使用 Ansible 和 YAML，并与生成式LLM 结合，自动化系统管理任务，将人类描述转化为代码。
results: 论文强调使用生成式LLM 在这个上下文中的潜在优势，包括提高代码生成和管理 Ansible YAML 角色和执行脚本的效率和准确性。

Abstract
Cloud computing and the evolution of management methodologies such as Lean Management or Agile entail a profound transformation in both system construction and maintenance approaches. These practices are encompassed within the term "DevOps." This descriptive approach to an information system or application, alongside the configuration of its constituent components, has necessitated the development of descriptive languages paired with specialized engines for automating systems administration tasks. Among these, the tandem of Ansible (engine) and YAML (descriptive language) stands out as the two most prevalent tools in the market, facing notable competition mainly from Terraform. The current document presents an inquiry into a solution for generating and managing Ansible YAML roles and playbooks, utilizing Generative LLMs (Language Models) to translate human descriptions into code. Our efforts are focused on identifying plausible directions and outlining the potential industrial applications. Note: For the purpose of this experiment, we have opted against the use of Ansible Lightspeed. This is due to its reliance on an IBM Watson model, for which we have not found any publicly available references. Comprehensive information regarding this remarkable technology can be found directly on our partner RedHat's website, https://www.redhat.com/en/about/press-releases/red-hat-introduces-ansible-lightspeed-ai-driven-it-automation

摘要
云计算和流程管理方法的发展，如Lean Management或Agile，对系统建构和维护方法进行了深刻的变革。这些实践被称为“DevOps”。这种描述性方法， alongside the configuration of its constituent components, has led to the development of descriptive languages and specialized engines for automating systems administration tasks. Among these, the pairing of Ansible (engine) and YAML (descriptive language) is the most prevalent in the market, facing significant competition from Terraform.本文档讨论了一种解决方案，使用生成型LLM（语言模型）将人类描述翻译成代码，以便生成和管理Ansible YAML角色和执行脚本。我们的努力专注于找到可能的方向和详细描述相关的工业应用。注意：在这个实验中，我们选择不使用Ansible Lightspeed，因为它基于IBM Watson模型，而我们没有找到任何公开可用的参考。如果您想了解更多关于这一技术的信息，请参考我们的合作伙伴Red Hat的官方网站，https://www.redhat.com/en/about/press-releases/red-hat-introduces-ansible-lightspeed-ai-driven-it-automation。

NumHG: A Dataset for Number-Focused Headline Generation

paper_url: http://arxiv.org/abs/2309.01455
repo_url: None
paper_authors: Jian-Tao Huang, Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen
for: 本研究的目的是提高Headline Generation task的 numeral accuracy, 通过引入新的数据集NumHG，并对5种之前的模型进行人工评估。
methods: 本研究使用了Encoder-Decoder模型，并对数据集进行细致的标注，以便更好地学习和评估 numeral generation。
results: 研究发现，现有的模型在numeral generation方面存在缺陷，特别是在数字的准确性方面。 NumHG数据集可以帮助解决这个问题，并且可以驱动进一步的研究和讨论。

Abstract
Headline generation, a key task in abstractive summarization, strives to condense a full-length article into a succinct, single line of text. Notably, while contemporary encoder-decoder models excel based on the ROUGE metric, they often falter when it comes to the precise generation of numerals in headlines. We identify the lack of datasets providing fine-grained annotations for accurate numeral generation as a major roadblock. To address this, we introduce a new dataset, the NumHG, and provide over 27,000 annotated numeral-rich news articles for detailed investigation. Further, we evaluate five well-performing models from previous headline generation tasks using human evaluation in terms of numerical accuracy, reasonableness, and readability. Our study reveals a need for improvement in numerical accuracy, demonstrating the potential of the NumHG dataset to drive progress in number-focused headline generation and stimulate further discussions in numeral-focused text generation.

摘要
摘要生成，摘要文本生成中的一项关键任务，旨在将全文短化为精炼的一行文本。尤其是当今的编码-解码模型在ROUGE指标上表现出色，但它们在精确生成数字的问题上经常困难。我们认为缺乏精细标注数据为准确数字生成带来了重大障碍。为了解决这一问题，我们提出了一个新的数据集，即NumHG，并为其进行了27,000多个精心标注的新闻文章的 исследова。此外，我们使用人类评估来评估五种以前的摘要生成模型，以确定它们在数字准确性、合理性和可读性方面的表现。我们的研究发现，现有的模型在数字准确性方面存在改进的需求，这也证明了NumHG数据集的潜在作用力，以及数字专注的文本生成领域的进一步探讨。

Open Sesame! Universal Black Box Jailbreaking of Large Language Models

paper_url: http://arxiv.org/abs/2309.01446
repo_url: None
paper_authors: Raz Lapid, Ron Langberg, Moshe Sipper
for: 这篇论文旨在提供一种用于 manipulate 大型自然语言模型（LLM）的方法，以便在模型结构和参数不可访问的情况下实现不良目的。
methods: 这篇论文使用了一种基于遗传算法（GA）的方法，通过优化一个通用对抗提示来让模型偏离它的对抗目标，从而导致模型生成不当的输出。
results: 通过广泛的实验，这篇论文证明了这种方法的有效性，从而为负责任AI开发提供了一种诊断工具，用于评估和提高 LLM 的对人意图的Alignment。这是我们所知道的第一个自动化的通用黑盒子监狱攻击。

Abstract
Large language models (LLMs), designed to provide helpful and safe responses, often rely on alignment techniques to align with user intent and social guidelines. Unfortunately, this alignment can be exploited by malicious actors seeking to manipulate an LLM's outputs for unintended purposes. In this paper we introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that -- when combined with a user's query -- disrupts the attacked model's alignment, resulting in unintended and potentially harmful outputs. Our novel approach systematically reveals a model's limitations and vulnerabilities by uncovering instances where its responses deviate from expected behavior. Through extensive experiments we demonstrate the efficacy of our technique, thus contributing to the ongoing discussion on responsible AI development by providing a diagnostic tool for evaluating and enhancing alignment of LLMs with human intent. To our knowledge this is the first automated universal black box jailbreak attack.

摘要
大型语言模型（LLM）通常采用对齐技术来与用户意图和社会准则进行Alignment。 unfortunately，这种对齐可以被黑客利用，以达到不良目的。在这篇论文中，我们介绍了一种新的方法，使用进化算法（GA）来控制LLM，当模型结构和参数不可访问时。GA攻击工作通过优化通用对抗提示，使模型与用户的查询相结合，导致模型的对齐受到干扰，从而导致不良和可能有害的输出。我们的新方法可以系统地揭示模型的局限和漏洞，通过找到模型的响应与预期行为不符的情况。经过广泛的实验，我们证明了我们的技术的有效性，因此贡献到负责AI开发的讨论中，提供了对LLM的对齐评估和提高的 диагности工具。到我们所知，这是第一个自动化的通用黑盒子监狱攻击。

Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

paper_url: http://arxiv.org/abs/2309.02459
repo_url: None
paper_authors: Jiaxu Zhu, Weinan Tong, Yaoxun Xu, Changhe Song, Zhiyong Wu, Zhao You, Dan Su, Dong Yu, Helen Meng
for: 提高新频谱频率识别（ASR）性能，使用文本数据进行频率适应。
methods: 使用文本数据进行频率适应，通过下采样音频表示来匹配文本表示的长度。
results: 实验结果表明，提出的方法可以更好地学习两个Modalities的统一表示，从而提高ASR性能。

Abstract
Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method.

摘要
将两种modalities，语音和文本，映射到共享表示空间是一个研究话题，用文本只数据提高端到端自动语音识别（ASR）性能在新领域。然而，语音表示长度和文本表示长度不一致。遗传方法通过升降样本来匹配语音模式，但可能不匹配预期的实际持续时间。在这篇论文中，我们提出了一种新的匹配策略，通过降低音频表示来匹配文本模式。通过引入一个连续整合和点火（CIF）模块生成匹配语音表示，我们的ASR模型可以更好地从两个modalities中学习统一表示，允许通过文本只数据进行领域适应。实验结果表明我们的方法的效果。

SememeASR: Boosting Performance of End-to-End Speech Recognition against Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge

paper_url: http://arxiv.org/abs/2309.01437
repo_url: None
paper_authors: Jiaxu Zhu, Changhe Song, Zhiyong Wu, Helen Meng
for: 提高语音识别的效果，尤其是在域外和长尾数据上
methods: 利用sememe知识来增强语音识别模型
results: 实验表明，sememe知识可以提高语音识别的效果，并且可以提高模型对域外和长尾数据的识别能力

Abstract
Recently, excellent progress has been made in speech recognition. However, pure data-driven approaches have struggled to solve the problem in domain-mismatch and long-tailed data. Considering that knowledge-driven approaches can help data-driven approaches alleviate their flaws, we introduce sememe-based semantic knowledge information to speech recognition (SememeASR). Sememe, according to the linguistic definition, is the minimum semantic unit in a language and is able to represent the implicit semantic information behind each word very well. Our experiments show that the introduction of sememe information can improve the effectiveness of speech recognition. In addition, our further experiments show that sememe knowledge can improve the model's recognition of long-tailed data and enhance the model's domain generalization ability.

摘要

Benchmarking Large Language Models in Retrieval-Augmented Generation

paper_url: http://arxiv.org/abs/2309.01431
repo_url: https://github.com/chen700564/RGB
paper_authors: Jiawei Chen, Hongyu Lin, Xianpei Han, Le Sun
for: This paper aims to evaluate the impact of Retrieval-Augmented Generation (RAG) on large language models (LLMs) and identify potential bottlenecks in their capabilities.
methods: The paper uses a systematic approach to investigate the impact of RAG on LLMs, including the establishment of a new corpus (RGB) and the evaluation of 6 representative LLMs on RGB.
results: The evaluation reveals that while LLMs exhibit some degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information, indicating that there is still a considerable journey ahead to effectively apply RAG to LLMs.

Abstract
Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.

摘要
大量语言模型（LLM）的幻觉可以通过 Retrieval-Augmented Generation（RAG）方法进行缓解。然而，现有的研究缺乏对不同的大量语言模型RAG的精心评估，这使得了解RAG对不同LLM的可能的瓶颈困难。在这篇论文中，我们系统地研究了RAG对大量语言模型的影响。我们分析了不同的大量语言模型在4种基本能力上的表现，包括噪声抵抗、负面排斥、信息集成和Counterfactual Robustness。为此，我们创建了一个新的评估 benchmark，即 Retrieval-Augmented Generation Benchmark（RGB），该benchmark包含了4种分别测试基础能力的测试床。然后，我们评估了6种代表性强的LLM在RGB上，以诊断当前LLM在RAG应用中的挑战。评估结果表明，虽然LLM具有一定的噪声抵抗能力，但仍然在负面排斥、信息集成和面对假信息方面存在显著的困难。上述评估结果表明，在RAG应用中，还有一定的征途需要进行。

Hateful Messages: A Conversational Data Set of Hate Speech produced by Adolescents on Discord

paper_url: http://arxiv.org/abs/2309.01413
repo_url: None
paper_authors: Jan Fillies, Silvio Peikert, Adrian Paschke
For: The paper aims to address the bias of youth language within hate speech classification and provide a modern and anonymized hate speech youth language data set.* Methods: The research uses a self-developed annotation schema to classify publicly available online messages from the chat platform Discord, with 6.42% of the messages classified as hate speech. The data set includes age annotations for 35,553 messages, with an average author age of under 20 years old.* Results: The paper provides a modern and anonymized hate speech youth language data set consisting of 88,395 annotated chat messages, which can be used to improve the generalizability and performance of automated hate speech classification systems.

Abstract
With the rise of social media, a rise of hateful content can be observed. Even though the understanding and definitions of hate speech varies, platforms, communities, and legislature all acknowledge the problem. Therefore, adolescents are a new and active group of social media users. The majority of adolescents experience or witness online hate speech. Research in the field of automated hate speech classification has been on the rise and focuses on aspects such as bias, generalizability, and performance. To increase generalizability and performance, it is important to understand biases within the data. This research addresses the bias of youth language within hate speech classification and contributes by providing a modern and anonymized hate speech youth language data set consisting of 88.395 annotated chat messages. The data set consists of publicly available online messages from the chat platform Discord. ~6,42% of the messages were classified by a self-developed annotation schema as hate speech. For 35.553 messages, the user profiles provided age annotations setting the average author age to under 20 years old.

摘要

Zero-shot information extraction from radiological reports using ChatGPT

paper_url: http://arxiv.org/abs/2309.01398
repo_url: None
paper_authors: Danqing Hu, Bing Liu, Xiaofeng Zhu, Xudong Lu, Nan Wu
for: 抽象出 radiological report 中有用信息，以便进行第二次分析。
methods: 使用 ChatGPT 大语言模型进行零参数信息提取，不需要注释数据进行模型参数优化。
results: ChatGPT 可以在 847 份 CT 报告中提取有用信息，但还需要进一步改进以提高性能。

Abstract
Electronic health records contain an enormous amount of valuable information, but many are recorded in free text. Information extraction is the strategy to transform the sequence of characters into structured data, which can be employed for secondary analysis. However, the traditional information extraction components, such as named entity recognition and relation extraction, require annotated data to optimize the model parameters, which has become one of the major bottlenecks in building information extraction systems. With the large language models achieving good performances on various downstream NLP tasks without parameter tuning, it becomes possible to use large language models for zero-shot information extraction. In this study, we aim to explore whether the most popular large language model, ChatGPT, can extract useful information from the radiological reports. We first design the prompt template for the interested information in the CT reports. Then, we generate the prompts by combining the prompt template with the CT reports as the inputs of ChatGPT to obtain the responses. A post-processing module is developed to transform the responses into structured extraction results. We conducted the experiments with 847 CT reports collected from Peking University Cancer Hospital. The experimental results indicate that ChatGPT can achieve competitive performances for some extraction tasks compared with the baseline information extraction system, but some limitations need to be further improved.

摘要
电子健康记录包含巨量有价值信息，但许多都记录在自由文本中。信息提取是将字符串序列转换为结构化数据的策略，可以用于次要分析。然而，传统信息提取组件，如命名实体识别和关系EXTRACTION，需要标注数据来优化模型参数，这成为了建立信息提取系统的一个主要瓶颈。随着大语言模型在多个下游NLP任务中达到好表现，不需要参数调整，因此可以使用大语言模型进行零shot信息提取。在这项研究中，我们想要探索是否可以使用最受欢迎的大语言模型ChatGPT提取CT报告中的有用信息。我们首先设计了关注信息的模板Prompt，然后将Prompt与CT报告结合使用ChatGPT来获取响应。我们还开发了一个后处程模块，用于将响应转换为结构化提取结果。我们对847份CT报告进行了实验，结果表明ChatGPT可以与基准信息提取系统相比，在某些提取任务中达到竞争性表现，但还需要进一步改进。

2023-09-04

cs.LG

cs.LG - 2023-09-04

Delegating Data Collection in Decentralized Machine Learning

paper_url: http://arxiv.org/abs/2309.01837
repo_url: None
paper_authors: Nivasini Ananthakrishnan, Stephen Bates, Michael I. Jordan, Nika Haghtalab
for: 研究了数据收集委托的优化问题，尤其是面临不确定评估模型质量和无知对优化模型性能的两个基本机器学习挑战。
methods: 基于合同理论开始，设计优化和近优化合同来解决两个挑战，并通过简单的线性合同实现1-1/e的最优Utility，即使主体只有小考试集。此外，我们还给出了测试集大小的条件，以实现趋近于优化Utility的添加性近似。
results: 我们显示了在不确定评估模型质量和无知对优化模型性能的情况下，可以通过简单的线性合同和可靠的算法来解决这两个挑战，并实现高效的数据收集委托。

Abstract
Motivated by the emergence of decentralized machine learning ecosystems, we study the delegation of data collection. Taking the field of contract theory as our starting point, we design optimal and near-optimal contracts that deal with two fundamental machine learning challenges: lack of certainty in the assessment of model quality and lack of knowledge regarding the optimal performance of any model. We show that lack of certainty can be dealt with via simple linear contracts that achieve 1-1/e fraction of the first-best utility, even if the principal has a small test set. Furthermore, we give sufficient conditions on the size of the principal's test set that achieves a vanishing additive approximation to the optimal utility. To address the lack of a priori knowledge regarding the optimal performance, we give a convex program that can adaptively and efficiently compute the optimal contract.

摘要
驱动了分布式机器学习生态系统的出现，我们研究数据采集委托的问题。从ontract理论为起点，我们设计优化和近似优化的合同，解决机器学习中两个基本挑战：评估模型质量的不确定性和任何模型的优化性知识不够。我们显示，不确定性可以通过简单的线性合同来处理，实现1-1/e的首选utililty，即使主体只有一小部分的测试集。此外，我们给出了测试集的大小必须满足的条件，以实现趋近于优化的utililty。为了解决无先知道优化性的问题，我们给出了一种可靠和高效的 convex program，可以动态计算优化的合同。Note: "systext" is the Simplified Chinese character set, which is a standardized form of Chinese used in mainland China.

Soft-Dropout: A Practical Approach for Mitigating Overfitting in Quantum Convolutional Neural Networks

paper_url: http://arxiv.org/abs/2309.01829
repo_url: None
paper_authors: Aakash Ravindra Shinde, Charu Jain, Amir Kalev
for: 这 paper 是为了研究量子卷积神经网络（QCNN）中的过拟合问题。
methods: 这 paper 使用了一种经典的过拟合 mitigation 方法，即在训练后添加 dropout 方法。
results: 研究发现，在量子设定下直接实现 dropout 方法会导致 QCNN 的成功率减少。此外，提出了一种更加温和的 dropout 方法，可以成功地处理 QCNN 中的过拟合问题。

Abstract
Quantum convolutional neural network (QCNN), an early application for quantum computers in the NISQ era, has been consistently proven successful as a machine learning (ML) algorithm for several tasks with significant accuracy. Derived from its classical counterpart, QCNN is prone to overfitting. Overfitting is a typical shortcoming of ML models that are trained too closely to the availed training dataset and perform relatively poorly on unseen datasets for a similar problem. In this work we study the adaptation of one of the most successful overfitting mitigation method, knows as the (post-training) dropout method, to the quantum setting. We find that a straightforward implementation of this method in the quantum setting leads to a significant and undesirable consequence: a substantial decrease in success probability of the QCNN. We argue that this effect exposes the crucial role of entanglement in QCNNs and the vulnerability of QCNNs to entanglement loss. To handle overfitting, we proposed a softer version of the dropout method. We find that the proposed method allows us to handle successfully overfitting in the test cases.

摘要

Secure and Efficient Federated Learning in LEO Constellations using Decentralized Key Generation and On-Orbit Model Aggregation

paper_url: http://arxiv.org/abs/2309.01828
repo_url: None
paper_authors: Mohamed Elmahallawy, Tie Luo, Mohamed I. Ibrahem
for: 这篇论文是为了解决在低地球轨道（LEO）上运行小卫星的资料下载和联合学习问题。
methods: 这篇论文提出了一个名为 FedSecure 的安全联合学习方法，它包括两个新的 ком成分：（1）分散的钥匙生成，以保护卫星数据的隐私使用函数加密方案，和（2）在轨道上进行模型传输和聚合，从而实现每个轨道的部分全球模型，以最小化遗失可见区域的对等待时间。
results: 我们的分析和结果显示，FedSecure 可以保护每个卫星的数据免受窃听者、curious server 或 curious satellite 的披露，并且具有较低的通信和计算负载，从而实现高精度（达85.35%）的联合学习。

Abstract
Satellite technologies have advanced drastically in recent years, leading to a heated interest in launching small satellites into low Earth orbit (LEOs) to collect massive data such as satellite imagery. Downloading these data to a ground station (GS) to perform centralized learning to build an AI model is not practical due to the limited and expensive bandwidth. Federated learning (FL) offers a potential solution but will incur a very large convergence delay due to the highly sporadic and irregular connectivity between LEO satellites and GS. In addition, there are significant security and privacy risks where eavesdroppers or curious servers/satellites may infer raw data from satellites' model parameters transmitted over insecure communication channels. To address these issues, this paper proposes FedSecure, a secure FL approach designed for LEO constellations, which consists of two novel components: (1) decentralized key generation that protects satellite data privacy using a functional encryption scheme, and (2) on-orbit model forwarding and aggregation that generates a partial global model per orbit to minimize the idle waiting time for invisible satellites to enter the visible zone of the GS. Our analysis and results show that FedSecure preserves the privacy of each satellite's data against eavesdroppers, a curious server, or curious satellites. It is lightweight with significantly lower communication and computation overheads than other privacy-preserving FL aggregation approaches. It also reduces convergence delay drastically from days to only a few hours, yet achieving high accuracy of up to 85.35% using realistic satellite images.

摘要
卫星技术在最近几年内发展了非常快，导致低地球轨道（LEO）上发射小卫星来收集大量数据，如卫星成像。由于下载这些数据到地面站（GS）以进行中央学习并建立人工智能模型是不实际的，因为卫星和GS之间的带宽是有限且昂贵的。联邦学习（FL）提供了一个可能的解决方案，但是它会产生非常大的融合延迟，因为LEO卫星和GS之间的连接是不规则和不可预测的。此外，在卫星和GS之间的通信频道上存在严重的安全和隐私风险，可能导致卫星数据的泄露。为解决这些问题，这篇论文提出了FedSecure，一种安全的联邦学习方法，包括两个新的组件：1. 分布式密钥生成，通过功能加密方案保护卫星数据隐私。2. 在轨道上进行模型转发和聚合，每次轨道执行一个部分全球模型，以最小化隐藏在视野外的卫星等待时间。我们的分析和结果表明，FedSecure可以保护每个卫星的数据隐私，并且具有较低的通信和计算开销，相比其他隐私保护的聚合方法。它还可以减少融合延迟从几天减少到只需几个小时，同时实现高准确率（达85.35%）。

LoopTune: Optimizing Tensor Computations with Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.01825
repo_url: None
paper_authors: Dejan Grubisic, Bram Wasti, Chris Cummins, John Mellor-Crummey, Aleksandar Zlateski
for: 这篇论文是为了解决高性能机器学习应用在新硬件上运行的问题，但传统的编译器无法提供性能。
methods: 这篇论文使用了深度学习自适应优化技术，开发了一个名为LoopTune的深度学习编译器，可以优化深度学习模型中的矩阵计算在CPU上。LoopTune使用了ultra-fast lightweight代码生成器LoopNest进行硬件特定优化。
results: LoopTune可以减少LoopNest的搜索时间，并且可以生成论文速度比TVM、MetaSchedule和AutoTVM快，consistently performing at the level of the hand-tuned library Numpy。此外，LoopTune可以在秒钟级别进行代码优化。

Abstract
Advanced compiler technology is crucial for enabling machine learning applications to run on novel hardware, but traditional compilers fail to deliver performance, popular auto-tuners have long search times and expert-optimized libraries introduce unsustainable costs. To address this, we developed LoopTune, a deep reinforcement learning compiler that optimizes tensor computations in deep learning models for the CPU. LoopTune optimizes tensor traversal order while using the ultra-fast lightweight code generator LoopNest to perform hardware-specific optimizations. With a novel graph-based representation and action space, LoopTune speeds up LoopNest by 3.2x, generating an order of magnitude faster code than TVM, 2.8x faster than MetaSchedule, and 1.08x faster than AutoTVM, consistently performing at the level of the hand-tuned library Numpy. Moreover, LoopTune tunes code in order of seconds.

摘要
高级编译技术对机器学习应用的运行是关键，但传统的编译器无法提供性能。常用的自动调参器有很长的搜索时间，专家优化库会带来不可持续的成本。为解决这问题，我们开发了 LoopTune，一种基于深度学习的编译器，用于优化深度学习模型中的矩阵计算。LoopTune优化矩阵游标顺序，并使用 ultra-fast 轻量级代码生成器 LoopNest 进行硬件特定优化。使用图表 Representation 和 Action 空间，LoopTune 将 LoopNest 加速了 3.2 倍，生成的代码比 TVM 快了一个数量级，比 MetaSchedule 快了 2.8 倍，和 AutoTVM 快了 1.08 倍，一直保持与手动优化库 Numpy 的水平。此外，LoopTune 只需几秒钟来调参代码。

Computation and Communication Efficient Federated Learning over Wireless Networks

paper_url: http://arxiv.org/abs/2309.01816
repo_url: None
paper_authors: Xiaonan Liu, Tharmalingam Ratnarajah
for: 提高 Federated Learning （FL）模型训练的精度和效率，同时保持数据隐私。
methods: 提出一种基于 partial model pruning 和个性化的 FL 框架，将学习模型分为全球部分和个性化部分，以适应各个设备的非独立同分布（non IID）数据。
results: 通过数学分析和优化问题的解法，提高 FL 框架的计算和通信负载，同时提高学习精度和速度。实验结果显示，提议的 FL 框架可以降低大约 50% 的计算和通信负载。

Abstract
Federated learning (FL) allows model training from local data by edge devices while preserving data privacy. However, the learning accuracy decreases due to the heterogeneity of devices data, and the computation and communication latency increase when updating large scale learning models on devices with limited computational capability and wireless resources. To overcome these challenges, we consider a novel FL framework with partial model pruning and personalization. This framework splits the learning model into a global part with model pruning shared with all devices to learn data representations and a personalized part to be fine tuned for a specific device, which adapts the model size during FL to reduce both computation and communication overhead and minimize the overall training time, and increases the learning accuracy for the device with non independent and identically distributed (non IID) data. Then, the computation and communication latency and the convergence analysis of the proposed FL framework are mathematically analyzed. Based on the convergence analysis, an optimization problem is formulated to maximize the convergence rate under a latency threshold by jointly optimizing the pruning ratio and wireless resource allocation. By decoupling the optimization problem and deploying Karush Kuhn Tucker (KKT) conditions, we derive the closed form solutions of pruning ratio and wireless resource allocation. Finally, experimental results demonstrate that the proposed FL framework achieves a remarkable reduction of approximately 50 percents computation and communication latency compared with the scheme only with model personalization.

摘要
联合学习（FL）允许本地数据进行模型训练，同时保持数据隐私。然而，由于设备数据的不同性，学习准确率减少，并且更新大规模学习模型在设备上的计算和通信负担增加。为了解决这些挑战，我们提出了一种新的FL框架，其中分解学习模型为全球部分和个性化部分。全球部分通过模型剔除来学习数据表示，而个性化部分在特定设备上进行细化调整，以适应非独立同分布（non IID）数据。这种框架可以降低计算和通信延迟，提高学习精度，并最大化 converges 率。然后，我们数学分析了计算和通信延迟以及折衔率的影响。基于折衔率的最大化，我们提出了一个优化问题，以提高 converges 率下的延迟阈值。通过分解优化问题并应用 KKT 条件，我们得到了封闭式的解决方案。最后，我们通过实验证明，提出的FL框架可以降低约50%的计算和通信延迟。

Asymmetric matrix sensing by gradient descent with small random initialization

paper_url: http://arxiv.org/abs/2309.01796
repo_url: None
paper_authors: Johan S. Wind
for: matrix sensing problem, reconstructing low-rank matrix from linear measurements
methods: factorized gradient descent, continuous differential equation (perturbed gradient flow)
results: quick convergence to true target matrix with bounded perturbation, novel proof of asymmetric matrix sensing

Abstract
We study matrix sensing, which is the problem of reconstructing a low-rank matrix from a few linear measurements. It can be formulated as an overparameterized regression problem, which can be solved by factorized gradient descent when starting from a small random initialization. Linear neural networks, and in particular matrix sensing by factorized gradient descent, serve as prototypical models of non-convex problems in modern machine learning, where complex phenomena can be disentangled and studied in detail. Much research has been devoted to studying special cases of asymmetric matrix sensing, such as asymmetric matrix factorization and symmetric positive semi-definite matrix sensing. Our key contribution is introducing a continuous differential equation that we call the $\textit{perturbed gradient flow}$. We prove that the perturbed gradient flow converges quickly to the true target matrix whenever the perturbation is sufficiently bounded. The dynamics of gradient descent for matrix sensing can be reduced to this formulation, yielding a novel proof of asymmetric matrix sensing with factorized gradient descent. Compared to directly analyzing the dynamics of gradient descent, the continuous formulation allows bounding key quantities by considering their derivatives, often simplifying the proofs. We believe the general proof technique may prove useful in other settings as well.

摘要
我们研究矩阵感知问题，即从一些线性测量中重建一个低级矩阵的问题。可以将其形式化为过参数化回归问题，可以通过分解梯度下降来解决，当起始于一个小random initialization时。 linear neural networks 和特别是矩阵感知通过分解梯度下降是现代机器学习中非 convex 问题的典型模型，其中复杂现象可以分解和研究在详细的方式上。许多研究者已经投入到 изучение特殊情况的偏 asymmetric matrix sensing 中，如偏 asymmetric matrix factorization 和Symmetric positive semi-definite matrix sensing。我们的关键贡献在于引入一个名为 $\textit{perturbed gradient flow}$ 的连续偏微分方程。我们证明了当perturbation够小时，这个方程快速地 converge 到真正的目标矩阵。矩阵感知的梯度下降动力学可以被归纳到这个形式化中，从而得到一种新的证明方式。与直接分析梯度下降动力学相比，连续形式化允许通过考虑其导数来简化证明。我们认为这种普适的证明技巧可能在其他情况下也会有用。

Composite federated learning with heterogeneous data

paper_url: http://arxiv.org/abs/2309.01795
repo_url: None
paper_authors: Jiaojiao Zhang, Jiang Hu, Mikael Johansson
for: 解决复杂的 Federated Learning（FL）问题
methods: 使用战略性分解质量梯度和通信，并不假设数据相似性来避免客户端漂移
results: 比前者更高效，可以 linearly 收敛到一个 neighborhood 的优解Here’s a breakdown of each point:1. for: The paper is written to solve the composite Federated Learning (FL) problem.2. methods: The paper proposes a novel algorithm that manages non-smooth regularization by decoupling the proximal operator and communication, and addresses client drift without assuming data similarity. Each worker uses local updates to reduce communication frequency with the server and transmits only a $d$-dimensional vector per communication round.3. results: The algorithm is proven to converge linearly to a neighborhood of the optimal solution, and the paper demonstrates the superiority of the algorithm over state-of-the-art methods in numerical experiments.

Abstract
We propose a novel algorithm for solving the composite Federated Learning (FL) problem. This algorithm manages non-smooth regularization by strategically decoupling the proximal operator and communication, and addresses client drift without any assumptions about data similarity. Moreover, each worker uses local updates to reduce the communication frequency with the server and transmits only a $d$-dimensional vector per communication round. We prove that our algorithm converges linearly to a neighborhood of the optimal solution and demonstrate the superiority of our algorithm over state-of-the-art methods in numerical experiments.

摘要
我们提出了一种新的 Federation Learning（FL）问题的算法。该算法在规范化正则化中使用推迟分离 proximal 算符和通信，并 Addresses 客户端漂移无需数据相似性假设。另外，每个工作者使用本地更新减少与服务器之间的通信频率，并只在通信轮次中发送 $d$-维向量。我们证明了我们的算法 linearly 收敛到优解附近的解，并在数值实验中证明了我们的算法与现状算法的超越性。

Hierarchical Grammar-Induced Geometry for Data-Efficient Molecular Property Prediction

paper_url: http://arxiv.org/abs/2309.01788
repo_url: https://github.com/gmh14/geo-deg
paper_authors: Minghao Guo, Veronika Thost, Samuel W Song, Adithya Balachandran, Payel Das, Jie Chen, Wojciech Matusik
for: 这项研究的目的是提出一种数据效率的物理属性预测方法，以便在材料和药物发现领域中进行预测。
methods: 该方法使用学习式层次分子 grammar，可以生成分子结构从 grammar 生成规则。这种 grammar 适应了分子结构空间的显式几何结构，从而提供了有用的分子结构相似性的几何信息。 property 预测使用图 neural diffusion 在 grammar-induced 几何空间中进行。
results: 在小规模和大规模数据集上，我们的评估表明，这种方法可以比supervised和预训练图 neural network 等基准方法表现出色，并且在具有极其有限数据情况下进行预测时表现出色。我们还包括了细化的减少研究和进一步分析，以证明我们的解决方案的有效性。

Abstract
The prediction of molecular properties is a crucial task in the field of material and drug discovery. The potential benefits of using deep learning techniques are reflected in the wealth of recent literature. Still, these techniques are faced with a common challenge in practice: Labeled data are limited by the cost of manual extraction from literature and laborious experimentation. In this work, we propose a data-efficient property predictor by utilizing a learnable hierarchical molecular grammar that can generate molecules from grammar production rules. Such a grammar induces an explicit geometry of the space of molecular graphs, which provides an informative prior on molecular structural similarity. The property prediction is performed using graph neural diffusion over the grammar-induced geometry. On both small and large datasets, our evaluation shows that this approach outperforms a wide spectrum of baselines, including supervised and pre-trained graph neural networks. We include a detailed ablation study and further analysis of our solution, showing its effectiveness in cases with extremely limited data. Code is available at https://github.com/gmh14/Geo-DEG.

摘要
“分子性能预测是物质和药物搜索领域中的一项关键任务。 latest literature 中的 potential benefits 反映了使用深度学习技术的可能性。然而，这些技术在实践中受到一种常见的挑战：标注数据受到文献EXTRACTION AND laborious experimentation的成本限制。在这种情况下，我们提出了一种数据效率的属性预测器，利用可学习的分子语法树来生成分子。这种语法树 induces 分子图的Explicit geometry，提供了有用的分子结构相似性的假设。我们使用图解 diffusion 来预测分子的属性，并在小和大数据集上评估了我们的方法。结果表明，我们的方法可以比supervised和预训练图神经网络Outperform。我们还进行了详细的折衣分析和进一步的分析，证明了我们的解决方案在数据受限情况下的有效性。代码可以在https://github.com/gmh14/Geo-DEG中找到。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

ATMS: Algorithmic Trading-Guided Market Simulation

paper_url: http://arxiv.org/abs/2309.01784
repo_url: None
paper_authors: Song Wei, Andrea Coletta, Svitlana Vyetrenko, Tucker Balch
For: The paper aims to propose a metric to quantify market discrepancy and develop an Algorithmic Trading-guided Market Simulation (ATMS) to improve the realism of market simulations.* Methods: The proposed metric measures the difference between a causal effect from underlying market unique characteristics and is evaluated through the interaction between the AT agent and the market. ATMS formulates the simulator as a stochastic policy in reinforcement learning (RL) to account for the sequential nature of trading, and utilizes the policy gradient update to bypass differentiating the proposed metric.* Results: The proposed metric and ATMS are demonstrated to be effective through extensive experiments on semi-real market data, showing improved similarity to reality compared to the state-of-the-art conditional Wasserstein Generative Adversarial Network (cWGAN) approach, and producing market data with more balanced BUY and SELL volumes.

Abstract
The effective construction of an Algorithmic Trading (AT) strategy often relies on market simulators, which remains challenging due to existing methods' inability to adapt to the sequential and dynamic nature of trading activities. This work fills this gap by proposing a metric to quantify market discrepancy. This metric measures the difference between a causal effect from underlying market unique characteristics and it is evaluated through the interaction between the AT agent and the market. Most importantly, we introduce Algorithmic Trading-guided Market Simulation (ATMS) by optimizing our proposed metric. Inspired by SeqGAN, ATMS formulates the simulator as a stochastic policy in reinforcement learning (RL) to account for the sequential nature of trading. Moreover, ATMS utilizes the policy gradient update to bypass differentiating the proposed metric, which involves non-differentiable operations such as order deletion from the market. Through extensive experiments on semi-real market data, we demonstrate the effectiveness of our metric and show that ATMS generates market data with improved similarity to reality compared to the state-of-the-art conditional Wasserstein Generative Adversarial Network (cWGAN) approach. Furthermore, ATMS produces market data with more balanced BUY and SELL volumes, mitigating the bias of the cWGAN baseline approach, where a simple strategy can exploit the BUY/SELL imbalance for profit.

摘要
通常，建立一个Algorithmic Trading（AT）策略的有效构造往往依赖市场模拟器，但是现有方法很难适应交易活动的顺序和动态性。这种工作填补了这一空白，提出了一个用于量化市场差异的度量。这个度量测量了 causal effect的差异，它通过AT代理和市场的交互来评估。更重要的是，我们引入了Algorithmic Trading-guided Market Simulation（ATMS），通过优化我们的提出的度量来寻找最佳的市场模拟器。受SeqGAN的启发，ATMS将市场模拟器形式化为一个随机政策，以便考虑交易的顺序性。此外，ATMS使用策略梯度更新来绕过非导数的操作，如市场中的订单删除。通过对半实际市场数据进行了广泛的实验，我们证明了我们的度量的有效性，并显示ATMS生成的市场数据与现有的conditional Wasserstein Generative Adversarial Network（cWGAN）方法相比，具有更高的实际性。此外，ATMS生成的市场数据具有更好的BUY和SELL量均衡，从而减轻cWGAN基础方法的偏见，其中一个简单的策略可以通过BUY/SELL偏见来获得利润。

Survival Prediction from Imbalance colorectal cancer dataset using hybrid sampling methods and tree-based classifiers

paper_url: http://arxiv.org/abs/2309.01783
repo_url: None
paper_authors: Sadegh Soleimani, Mahsa Bahrami, Mansour Vali
for: 预测COLRET cancer patients的1、3、5年生存率
methods: 使用优化的预处理技术、标准平衡技术、 Synthetic Minority Over-sampling Techniques (SMOTE) 和 pipelines of SMOTE和RENN方法来平衡数据，并使用 Decision Trees、Random Forest、Extra Tree、eXtreme Gradient Boosting 和 Light Gradient Boosting (LGBM) 等树型分类算法进行预测
results: 使用5-fold cross-validation方法进行性能评估，在高度不平衡的1年生存任务中，提出的方法与LGBM combinatorial方法达到了72.30%的敏感性；在3年生存任务中，combine RENN和LGBM方法达到了80.81%的敏感性，表明提出的方法在高度不平衡的数据集上表现最佳。

Abstract
Background and Objective: Colorectal cancer is a high mortality cancer. Clinical data analysis plays a crucial role in predicting the survival of colorectal cancer patients, enabling clinicians to make informed treatment decisions. However, utilizing clinical data can be challenging, especially when dealing with imbalanced outcomes. This paper focuses on developing algorithms to predict 1-, 3-, and 5-year survival of colorectal cancer patients using clinical datasets, with particular emphasis on the highly imbalanced 1-year survival prediction task. To address this issue, we propose a method that creates a pipeline of some of standard balancing techniques to increase the true positive rate. Evaluation is conducted on a colorectal cancer dataset from the SEER database. Methods: The pre-processing step consists of removing records with missing values and merging categories. The minority class of 1-year and 3-year survival tasks consists of 10% and 20% of the data, respectively. Edited Nearest Neighbor, Repeated edited nearest neighbor (RENN), Synthetic Minority Over-sampling Techniques (SMOTE), and pipelines of SMOTE and RENN approaches were used and compared for balancing the data with tree-based classifiers. Decision Trees, Random Forest, Extra Tree, eXtreme Gradient Boosting, and Light Gradient Boosting (LGBM) are used in this article. Method. Results: The performance evaluation utilizes a 5-fold cross-validation approach. In the case of highly imbalanced datasets (1-year), our proposed method with LGBM outperforms other sampling methods with the sensitivity of 72.30%. For the task of imbalance (3-year survival), the combination of RENN and LGBM achieves a sensitivity of 80.81%, indicating that our proposed method works best for highly imbalanced datasets. Conclusions: Our proposed method significantly improves mortality prediction for the minority class of colorectal cancer patients.

摘要
背景和目标：肠Rectal癌是高 Mortality癌症，临床数据分析对于预测肠Rectal癌患者的存活率起着关键作用，帮助临床医生制定 Informed 治疗决策。然而，利用临床数据可能会困难，特别是面临异常尝试的情况。这篇论文关注了开发用来预测肠Rectal癌患者1-, 3-, 5年存活率的算法，特别是面临异常尝试的1年存活预测任务。为解决这一问题，我们提议一种方法，该方法包括一系列标准平衡技术，以增加真正正确率。我们对一个肠Rectal癌数据集进行评估。方法：数据预处理步骤包括去除异常值和合并类别。肠Rectal癌1年和3年存活任务中的少数类刚占数据集的10%和20%。我们使用编辑最近邻国（Edited Nearest Neighbor，RENN）、重复编辑最近邻国（Repeated edited nearest neighbor，SMOTE）、Synthetic Minority Over-sampling Techniques（SMOTE）和这些技术的组合来平衡数据，并与树型分类器结合。我们使用的分类器包括决策树、Random Forest、Extra Tree、eXtreme Gradient Boosting和Light Gradient Boosting（LGBM）。结果：我们使用5-fold Cross-validation方法进行性能评估。在面临异常尝试的1年存活任务中，我们提议的方法与LGBM结合的sensitivity达到72.30%，表明我们的方法在高度偏置的数据集中表现出色。在3年存活任务中，我们 combinational RENN和LGBM的方法达到了80.81%的sensitivity， indicating that our proposed method works best for highly imbalanced datasets。结论：我们的提议方法能够显著提高肠Rectal癌患者少数类的存活预测率。

Self-concordant Smoothing for Convex Composite Optimization

paper_url: http://arxiv.org/abs/2309.01781
repo_url: https://github.com/adeyemiadeoye/SelfConcordantSmoothOptimization.jl
paper_authors: Adeyemi D. Adeoye, Alberto Bemporad
for: 本研究旨在提出一种自相关平滑方法，用于最小化两个凸函数之和，其中一个是平滑的，另一个可能是非凸的。
methods: 本研究使用了部分平滑技术，只平滑了一部分非凸函数。研究者提出了一种自然的问题结构，以及一种变量 метриック选择方法和一种步长选择规则，特别适合 proximal Newton 类算法。
results: 研究者证明了本方法的本地二次quadratic convergence 率，并在两种算法中实现了这一点：Prox-N-SCORE 算法和 Prox-GGN-SCORE 算法。其中 Prox-GGN-SCORE 算法包含一种重要的近似过程，可以减少大多数计算开销，特别是在过parameterized 机器学习模型和 mini-batch Settings 中。 numerics 示例表明了本方法的高效性和其他方法的不平等。

Abstract
We introduce the notion of self-concordant smoothing for minimizing the sum of two convex functions: the first is smooth and the second may be nonsmooth. Our framework results naturally from the smoothing approximation technique referred to as partial smoothing in which only a part of the nonsmooth function is smoothed. The key highlight of our approach is in a natural property of the resulting problem's structure which provides us with a variable-metric selection method and a step-length selection rule particularly suitable for proximal Newton-type algorithms. In addition, we efficiently handle specific structures promoted by the nonsmooth function, such as $\ell_1$-regularization and group-lasso penalties. We prove local quadratic convergence rates for two resulting algorithms: Prox-N-SCORE, a proximal Newton algorithm and Prox-GGN-SCORE, a proximal generalized Gauss-Newton (GGN) algorithm. The Prox-GGN-SCORE algorithm highlights an important approximation procedure which helps to significantly reduce most of the computational overhead associated with the inverse Hessian. This approximation is essentially useful for overparameterized machine learning models and in the mini-batch settings. Numerical examples on both synthetic and real datasets demonstrate the efficiency of our approach and its superiority over existing approaches.

摘要
我们引入自己协调缓和的减少方法，用于减少两个凸函数：第一个是光滑的，第二个可能是不凸的。我们的框架从partial smoothing技术发展而来，仅将部分不凸函数缓和。我们的方法的关键特点在于它具有自然的变量 метри选择方法和步长选择规则，特别适合 proximal Newton 类型的算法。此外，我们可以有效地处理特定结构，它们由不凸函数带来，例如 $\ell_1$-调整和群lasso 罚则。我们证明了两个结果算法的本地quadratic convergence 率：Prox-N-SCORE 算法和 Prox-GGN-SCORE 算法。Prox-GGN-SCORE 算法显示了一个重要的近似程序，它可以帮助将大多数的计算负担与 inverse Hessian 相关联结。这个近似是在过parameterized machine learning 模型和 mini-batch 设定下特别有用。numero examples 表明我们的方法的效率和其他方法的优势。

Measuring, Interpreting, and Improving Fairness of Algorithms using Causal Inference and Randomized Experiments

paper_url: http://arxiv.org/abs/2309.01780
repo_url: None
paper_authors: James Enouen, Tianshu Sun, Yan Liu
For: The paper is written to address the problem of algorithm fairness in real-world AI production systems, with a focus on developing a practical and easy-to-implement measurement framework and a systematic approach to correcting detected sources of bias.* Methods: The paper uses recent advances in causal inference and interpretable machine learning to develop an algorithm-agnostic framework called MIIF (Measure, Interpret, and Improve the Fairness of an algorithmic decision). The framework includes randomized experiments to measure algorithm bias and an explainable machine learning model to interpret and distill the beliefs of a blackbox algorithm.* Results: The paper demonstrates the effectiveness of MIIF in measuring algorithm bias and improving fairness in practical applications like e-commerce and targeted advertising, where industry A/B testing is already abundant. The framework is shown to be simple and powerful, and the results suggest that it can be used to study algorithm fairness in a wide range of applications.

Abstract
Algorithm fairness has become a central problem for the broad adoption of artificial intelligence. Although the past decade has witnessed an explosion of excellent work studying algorithm biases, achieving fairness in real-world AI production systems has remained a challenging task. Most existing works fail to excel in practical applications since either they have conflicting measurement techniques and/ or heavy assumptions, or require code-access of the production models, whereas real systems demand an easy-to-implement measurement framework and a systematic way to correct the detected sources of bias. In this paper, we leverage recent advances in causal inference and interpretable machine learning to present an algorithm-agnostic framework (MIIF) to Measure, Interpret, and Improve the Fairness of an algorithmic decision. We measure the algorithm bias using randomized experiments, which enables the simultaneous measurement of disparate treatment, disparate impact, and economic value. Furthermore, using modern interpretability techniques, we develop an explainable machine learning model which accurately interprets and distills the beliefs of a blackbox algorithm. Altogether, these techniques create a simple and powerful toolset for studying algorithm fairness, especially for understanding the cost of fairness in practical applications like e-commerce and targeted advertising, where industry A/B testing is already abundant.

摘要
“算法公平性已成为人工智能广泛采用的中心问题。过去十年内，我们已经见证了优秀的研究算法偏见，但在实际应用中实现算法公平性仍然是一项挑战。现有大多数工作具有冲突的测量技术和假设，或者需要生产模型的代码访问，而实际应用需要一个简单易用的测量框架和一个系统atic的方法来纠正检测到的偏见来源。在这篇论文中，我们利用了最新的 causal inference 和可解释机器学习来提出一个算法无关的框架（MIIF），用于测量、解释和改进算法决策中的偏见。我们使用随机实验来测量算法偏见，这使得同时测量不同对待、不同影响和经济价值的可能性。此外，我们使用现代可解释技术来开发一个可解释的机器学习模型，可以准确地解释和概括黑盒算法的信仰。总之，这些技术创造了一个简单强大的工具集，用于研究算法公平性，特别是在实际应用中的电商和目标广告等场景，其中产业A/B测试已经充沛。”

DRAG: Divergence-based Adaptive Aggregation in Federated learning on Non-IID Data

paper_url: http://arxiv.org/abs/2309.01779
repo_url: None
paper_authors: Feng Zhu, Jingjing Zhang, Shengyun Liu, Xin Wang
for: 这个论文的目的是提高 Federated Learning（FL）中的通信效率，以及解决因为不同的训练数据分布而导致的“客户端滑块”现象。
methods: 这个论文提出了一个新的度量“拟合度”，用于量度每个客户端的本地更新与全域尺度方向之间的夹角。然后，这个度量被用来实现在每个环境中动态地“拖”本地更新，以避免额外的通信过程。
results: 这个论文透过实验证明了DRAG算法在实际应用中具有优异的性能，能够有效地控制“客户端滑块”现象，并且具有不断性和稳定性。此外，DRAG算法还能够对某些Byzantine攻击进行有效的防护。

Abstract
Local stochastic gradient descent (SGD) is a fundamental approach in achieving communication efficiency in Federated Learning (FL) by allowing individual workers to perform local updates. However, the presence of heterogeneous data distributions across working nodes causes each worker to update its local model towards a local optimum, leading to the phenomenon known as ``client-drift" and resulting in slowed convergence. To address this issue, previous works have explored methods that either introduce communication overhead or suffer from unsteady performance. In this work, we introduce a novel metric called ``degree of divergence," quantifying the angle between the local gradient and the global reference direction. Leveraging this metric, we propose the divergence-based adaptive aggregation (DRAG) algorithm, which dynamically ``drags" the received local updates toward the reference direction in each round without requiring extra communication overhead. Furthermore, we establish a rigorous convergence analysis for DRAG, proving its ability to achieve a sublinear convergence rate. Compelling experimental results are presented to illustrate DRAG's superior performance compared to state-of-the-art algorithms in effectively managing the client-drift phenomenon. Additionally, DRAG exhibits remarkable resilience against certain Byzantine attacks. By securely sharing a small sample of the client's data with the FL server, DRAG effectively counters these attacks, as demonstrated through comprehensive experiments.

摘要
本文提出了一种新的度量量名为“分布度”，用于量化当前工作节点的本地梯度和全局参考方向之间的角度。基于这个度量量，我们提出了一种名为“分布度基于的自适应聚合”（DRAG）算法，可以在每个轮次中动态地“拖”收到的本地更新向参考方向。这种算法不需要额外的通信开销，同时能够有效地控制客户端漂移现象。此外，我们也提供了一种准确的收敛分析，证明DRAG可以实现下线性收敛率。实验结果表明，DRAG在面临客户端漂移现象时表现出了显著的优势，并且具有remarkable的抗拒攻击能力。

CONFIDERAI: a novel CONFormal Interpretable-by-Design score function for Explainable and Reliable Artificial Intelligence

paper_url: http://arxiv.org/abs/2309.01778
repo_url: None
paper_authors: Alberto Carlevaro, Sara Narteni, Fabrizio Dabbene, Marco Muselli, Maurizio Mongelli
For: The paper proposes a methodology for linking conformal prediction with explainable machine learning, with the goal of creating more reliable and trustworthy artificial intelligence systems.* Methods: The paper introduces a new score function called CONFIDERAI, which leverages both the predictive ability of rules and their geometric position within boundaries. Additionally, the paper addresses the problem of defining regions in feature space where conformal guarantees are satisfied using techniques to control the number of non-conformal samples in conformal regions based on support vector data description (SVDD).* Results: The paper reports promising results on benchmark and real datasets, such as DNS tunneling detection and cardiovascular disease prediction.

Abstract
Everyday life is increasingly influenced by artificial intelligence, and there is no question that machine learning algorithms must be designed to be reliable and trustworthy for everyone. Specifically, computer scientists consider an artificial intelligence system safe and trustworthy if it fulfills five pillars: explainability, robustness, transparency, fairness, and privacy. In addition to these five, we propose a sixth fundamental aspect: conformity, that is, the probabilistic assurance that the system will behave as the machine learner expects. In this paper, we propose a methodology to link conformal prediction with explainable machine learning by defining CONFIDERAI, a new score function for rule-based models that leverages both rules predictive ability and points geometrical position within rules boundaries. We also address the problem of defining regions in the feature space where conformal guarantees are satisfied by exploiting techniques to control the number of non-conformal samples in conformal regions based on support vector data description (SVDD). The overall methodology is tested with promising results on benchmark and real datasets, such as DNS tunneling detection or cardiovascular disease prediction.

摘要
每天生活都在人工智能的影响下，机器学习算法必须设计为可靠和信任worthy。特别是计算机科学家认为一个人工智能系统安全和可靠的五大基础：解释性、可靠性、透明度、公平性和隐私。此外，我们还提出了第六个基本方面：准确性，即机器学习者期望的系统行为的概率ensure。在这篇论文中，我们提出了将CONFIDERAI作为新的分数函数，用于规则型模型，该函数利用规则预测的能力和点在规则边界的几何位置。我们还解决了定义特征空间中符合 garanties的问题，通过控制特征空间中非符合 garanties样本的数量来基于支持向量数据描述（SVDD）。总的来说，我们的方法ологи是在 benchmark和实际数据集上测试，如 DNS 隧道检测或心血管疾病预测。

Gated recurrent neural networks discover attention

paper_url: http://arxiv.org/abs/2309.01775
repo_url: None
paper_authors: Nicolas Zucchet, Seijin Kobayashi, Yassir Akram, Johannes von Oswald, Maxime Larcher, Angelika Steger, João Sacramento
for: 这个论文主要探讨了使用现代RNN的可能性，以及RNN如何通过线性循环层和Feedforward层实现自注意力。
methods: 这个论文使用了现代RNN的设计元素，包括线性循环层和Feedforward层，以及reverse工程技术来探讨RNN如何实现自注意力。
results: 研究发现，通过使用gradient descent优化器，RNN可以在具有简单学习任务的情况下实现同Transformers一样的性能，并且发现RNN在具有自注意力性的任务上实现了同Transformers一样的 Algorithm。

Abstract
Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.

摘要
Here is the text in Simplified Chinese:现代建筑设计使得回归神经网络（RNN）能够在某些序列模型任务上与变换器相当或超越其性能。这些现代RNN具有一种明确的设计模式：线性循环层与Feedforward层之间的乘法关系。我们示出了RNN具有这两个元素可以直接实现线性自注意力，变换器的核心组件。通过分析训练过的RNN，我们发现在实际的梯度下降过程中，Gradient Descent实际上找到了我们的结构。具体来说，我们研究了在解决简单的上下文学习任务上训练过的RNN，这些任务在变换器上很出色，并发现了Gradient Descent在我们的RNN中实际上填充了同样的注意力基于上下文学习算法，与变转器一样。我们的发现表明多元互作在神经网络中的重要性，并且可能存在某些RNN在实际中无意识地实现注意力。

ADC/DAC-Free Analog Acceleration of Deep Neural Networks with Frequency Transformation

paper_url: http://arxiv.org/abs/2309.01771
repo_url: None
paper_authors: Nastaran Darabi, Maeesha Binte Hashem, Hongyi Pan, Ahmet Cetin, Wilfred Gomes, Amit Ranjan Trivedi
for: 这篇论文旨在提出一种能效的频域运算深度神经网络（DNN）加速方法，以减少网络的耗电和延迟。methods: 本文使用频域运算，例如华氏-哈达姆转换（WHT），并提出了一种新的类比频域运算的能效加速方法，通过利用类比频域的tensor Transformation。results: 根据16×16核心阵列，对8位输入处理，提出的方法可以在VDD = 0.8 V下达到1602兆操作每秒每瓦（TOPS/W）的能效率，而且可以透过早期终止策略提高到5311 TOPS/W。

Abstract
The edge processing of deep neural networks (DNNs) is becoming increasingly important due to its ability to extract valuable information directly at the data source to minimize latency and energy consumption. Frequency-domain model compression, such as with the Walsh-Hadamard transform (WHT), has been identified as an efficient alternative. However, the benefits of frequency-domain processing are often offset by the increased multiply-accumulate (MAC) operations required. This paper proposes a novel approach to an energy-efficient acceleration of frequency-domain neural networks by utilizing analog-domain frequency-based tensor transformations. Our approach offers unique opportunities to enhance computational efficiency, resulting in several high-level advantages, including array micro-architecture with parallelism, ADC/DAC-free analog computations, and increased output sparsity. Our approach achieves more compact cells by eliminating the need for trainable parameters in the transformation matrix. Moreover, our novel array micro-architecture enables adaptive stitching of cells column-wise and row-wise, thereby facilitating perfect parallelism in computations. Additionally, our scheme enables ADC/DAC-free computations by training against highly quantized matrix-vector products, leveraging the parameter-free nature of matrix multiplications. Another crucial aspect of our design is its ability to handle signed-bit processing for frequency-based transformations. This leads to increased output sparsity and reduced digitization workload. On a 16$\times$16 crossbars, for 8-bit input processing, the proposed approach achieves the energy efficiency of 1602 tera operations per second per Watt (TOPS/W) without early termination strategy and 5311 TOPS/W with early termination strategy at VDD = 0.8 V.

摘要
深度神经网络（DNN）的边缘处理在不断增长的重要性，这是因为它可以直接从数据源提取有价值信息，以降低延迟和能耗。频域模型压缩，如沃尔夏-哈达姆变换（WHT），已被证明是一种有效的方法。然而，频域处理的优点通常被增加的 multiply-accumulate（MAC）操作所抵消。这篇论文提出了一种新的能效加速频域神经网络的方法，利用频域频率基于的分析频率转换。我们的方法具有增强计算效率的多种优点，包括数组微架构的并行计算、ADC/DAC无计算、和输出稀疏化。我们的方法可以减少转换矩阵中的可训练参数，从而实现更加紧凑的维度。此外，我们的新的数组微架构可以在某些列和行上进行可靠的缝合，以便实现完全的并行计算。此外，我们的方法可以避免 ADC/DAC 计算，通过对高度量化矩阵-向量乘法进行训练，利用矩阵乘法的参数自由性。此外，我们的方法还可以处理signed-bit转换，从而提高输出稀疏化和减少数字化工作负担。在0.8 V 的电压下，我们的方法在16x16 核心上实现了1602 TOPS/W 的能效率，而不需要早期终止策略，并在使用早期终止策略时实现了5311 TOPS/W 的能效率。

On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation

paper_url: http://arxiv.org/abs/2309.01753
repo_url: None
paper_authors: Jeongyeol Kwon, Dohyun Kwon, Steve Wright, Robert Nowak
for: 本文研究了一种基于 penalty 方法的 first-order 算法，用于解决精度优化（BO）问题，其中目标函数都是平滑的，但可能不具有凸性。
methods: 本文使用 penalty 方法来把上下水平的目标函数 combine 成一个权重和 penalty 参数 $\sigma > 0$ 的和。并通过准确地Characterizing 下水平目标函数和上水平目标函数的值和导数在 $\sigma $ 的关系，得出了 penalty 函数的梯度。
results: 本文提出了一种 first-order 算法，可以在 $\epsilon $ 精度下找到一个 $\epsilon $ 站ARY点，并且需要 $O(\epsilon^{-3})$ 和 $O(\epsilon^{-7})$ 访问 first-order (随机) 梯度或acles。在 deterministic 梯度或acles 的情况下，算法可以在一个完全单loop 方式下实现，并且可以在 $O(1)$ 扫描每次迭代。

Abstract
In this work, we study first-order algorithms for solving Bilevel Optimization (BO) where the objective functions are smooth but possibly nonconvex in both levels and the variables are restricted to closed convex sets. As a first step, we study the landscape of BO through the lens of penalty methods, in which the upper- and lower-level objectives are combined in a weighted sum with penalty parameter $\sigma > 0$. In particular, we establish a strong connection between the penalty function and the hyper-objective by explicitly characterizing the conditions under which the values and derivatives of the two must be $O(\sigma)$-close. A by-product of our analysis is the explicit formula for the gradient of hyper-objective when the lower-level problem has multiple solutions under minimal conditions, which could be of independent interest. Next, viewing the penalty formulation as $O(\sigma)$-approximation of the original BO, we propose first-order algorithms that find an $\epsilon$-stationary solution by optimizing the penalty formulation with $\sigma = O(\epsilon)$. When the perturbed lower-level problem uniformly satisfies the small-error proximal error-bound (EB) condition, we propose a first-order algorithm that converges to an $\epsilon$-stationary point of the penalty function, using in total $O(\epsilon^{-3})$ and $O(\epsilon^{-7})$ accesses to first-order (stochastic) gradient oracles when the oracle is deterministic and oracles are noisy, respectively. Under an additional assumption on stochastic oracles, we show that the algorithm can be implemented in a fully {\it single-loop} manner, i.e., with $O(1)$ samples per iteration, and achieves the improved oracle-complexity of $O(\epsilon^{-3})$ and $O(\epsilon^{-5})$, respectively.

摘要
在这项工作中，我们研究了一种基于权重方法的首选算法来解决层次优化问题（BO），其中目标函数是连续的，但可能非几何的在两个水平上。作为第一步，我们通过对层次优化问题进行负权重方法的分析，特别是通过Explicitly characterizing the conditions under which the values and derivatives of the two must be $O(\sigma)$-close。这个分析中的结果还包括了层次优化问题的下面解的梯度的表达式，这可能有独立的应用价值。然后，我们视权重方法为$O(\sigma)$-近似于原始BO的方法，并提出了一种首选算法，它可以在$\sigma = O(\epsilon)$下找到一个$\epsilon$-稳定的解。当下面问题的较小精度预测误差Bound（EB）条件满足时，我们提出了一种首选算法，它可以在$O(\epsilon^{-3})$和$O(\epsilon^{-7})$的访问次数下 converge to an $\epsilon$-稳定点，其中可能存在随机 oracle的假设。在这种假设下，我们表明了该算法可以在单 loop（即每个迭代只需要 $O(1)$ 样本）下实现，并且实现了改进的oracle-复杂度 $O(\epsilon^{-3})$和 $O(\epsilon^{-5})$。

Turbulent Flow Simulation using Autoregressive Conditional Diffusion Models

paper_url: http://arxiv.org/abs/2309.01745
repo_url: None
paper_authors: Georg Kohl, Li-Wei Chen, Nils Thuerey
for: 这篇论文主要是为了解决机器学习基于PDE解决方法中的稳定性问题。
methods: 这篇论文使用了一种基于 conditional diffusion 模型的抽象扩展，以提高学习型PDE解决方法的稳定性。
results: 论文表明，这种方法可以在各种复杂的液体流动场景中提供稳定的解决方案，并且可以在不同的流动参数范围内进行扩展。同时，这种方法还可以借助概率性的推 diffusion 方法来预测流体物理统计学上的性质。

Abstract
Simulating turbulent flows is crucial for a wide range of applications, and machine learning-based solvers are gaining increasing relevance. However, achieving stability when generalizing to longer rollout horizons remains a persistent challenge for learned PDE solvers. We address this challenge by introducing a fully data-driven fluid solver that utilizes an autoregressive rollout based on conditional diffusion models. We show that this approach offers clear advantages in terms of rollout stability compared to other learned baselines. Remarkably, these improvements in stability are achieved without compromising the quality of generated samples, and our model successfully generalizes to flow parameters beyond the training regime. Additionally, the probabilistic nature of the diffusion approach allows for inferring predictions that align with the statistics of the underlying physics. We quantitatively and qualitatively evaluate the performance of our method on a range of challenging scenarios, including incompressible and transonic flows, as well as isotropic turbulence.

摘要
模拟湍流是许多应用领域的关键，而机器学习基于的解决方案在不断增长。然而，在扩展到更长的执行 horizon 时，学习得到的稳定性仍然是一个棘手的挑战。我们解决这个挑战 by introducing a fully data-driven fluid solver that utilizes an autoregressive rollout based on conditional diffusion models.我们发现这种方法可以在其他学习基准下提供明显的稳定性改进，而不需要牺牲生成样本的质量。另外，Diffusion 方法的 probabilistic nature 允许我们生成与物理统计相符的预测。我们对一系列复杂的场景进行量化和质量evaluate our method，包括不压缩和超音速流体动理，以及各向异otropic turbulence。

Adaptive Resource Allocation for Virtualized Base Stations in O-RAN with Online Learning

paper_url: http://arxiv.org/abs/2309.01730
repo_url: None
paper_authors: Michail Kalntis, George Iosifidis, Fernando A. Kuipers
for: optimize the allocation of resources in virtualized base stations (vBSs) to balance effective throughput and energy consumption, even in challenging environments.
methods: online learning algorithm and meta-learning scheme to adapt to non-stationary or adversarial traffic demands and choose the best performing algorithm for different environments.
results: sub-linear regret and up to 64.5% power consumption savings compared to state-of-the-art benchmarks, evaluated with real-world data and trace-driven evaluations.

Abstract
Open Radio Access Network systems, with their virtualized base stations (vBSs), offer operators the benefits of increased flexibility, reduced costs, vendor diversity, and interoperability. Optimizing the allocation of resources in a vBS is challenging since it requires knowledge of the environment, (i.e., "external'' information), such as traffic demands and channel quality, which is difficult to acquire precisely over short intervals of a few seconds. To tackle this problem, we propose an online learning algorithm that balances the effective throughput and vBS energy consumption, even under unforeseeable and "challenging'' environments; for instance, non-stationary or adversarial traffic demands. We also develop a meta-learning scheme, which leverages the power of other algorithmic approaches, tailored for more "easy'' environments, and dynamically chooses the best performing one, thus enhancing the overall system's versatility and effectiveness. We prove the proposed solutions achieve sub-linear regret, providing zero average optimality gap even in challenging environments. The performance of the algorithms is evaluated with real-world data and various trace-driven evaluations, indicating savings of up to 64.5% in the power consumption of a vBS compared with state-of-the-art benchmarks.

摘要

Robust Online Classification: From Estimation to Denoising

paper_url: http://arxiv.org/abs/2309.01698
repo_url: None
paper_authors: Changlong Wu, Ananth Grama, Wojciech Szpankowski
for: 研究在含有噪声标签的在线分类中的噪声干扰。噪声机制由一个通用的kernel模型，对任何特征标签对而指定一个已知的分布集合，而选择器在每个时间步骤上选择未知分布，并生成噪声标签。
methods: 研究者使用了在线条件分布估计的概念，来扩展和涵盖了一般的噪声kernel和选择器，以及无限类型和随机生成的特征。
results: 研究者表明，对于许多自然的噪声kernel，选择器和标签函数的finite类型， minimax风险可以独立于时间跨度和对征函数的对数幂级别下降。此外，研究者还扩展了结果到无限类型和随机生成的特征。

Abstract
We study online classification in the presence of noisy labels. The noise mechanism is modeled by a general kernel that specifies, for any feature-label pair, a (known) set of distributions over noisy labels. At each time step, an adversary selects an unknown distribution from the distribution set specified by the kernel based on the actual feature-label pair, and generates the noisy label from the selected distribution. The learner then makes a prediction based on the actual features and noisy labels observed thus far, and incurs loss $1$ if the prediction differs from the underlying truth (and $0$ otherwise). The prediction quality is quantified through minimax risk, which computes the cumulative loss over a finite horizon $T$. We show that for a wide range of natural noise kernels, adversarially selected features, and finite class of labeling functions, minimax risk can be upper bounded independent of the time horizon and logarithmic in the size of labeling function class. We then extend these results to inifinite classes and stochastically generated features via the concept of stochastic sequential covering. Our results extend and encompass findings of Ben-David et al. (2009) through substantial generality, and provide intuitive understanding through a novel reduction to online conditional distribution estimation.

摘要
我们研究在含杂标签下的在线分类。杂标机制是通过一个通用的核函数来模型，该函数指定了任何特征标签对的（已知）分布过滤器。在每个时间步骤中，一个对手选择一个未知分布从选择的分布集中，并生成含杂标签。学习者根据实际特征和含杂标签所见而进行预测，并且如果预测与真实真实值不同，则输入1，否则输入0。预测质量通过最小最大风险来衡量，该风险计算了时间 horizon $T$ 内的总损失。我们证明，对于许多自然的杂标核函数、对手选择的特征和finite类标签函数，最小最大风险可以独立于时间桢和对数型的总体规模而上下界。然后，我们扩展这些结果到无限类和随机生成的特征上，通过离散顺序覆盖的概念。我们的结果超越和涵盖了Ben-David等人（2009）的发现，并提供了直观的理解，通过一种新的减少到在线条件分布估计的概念。

Physics-Informed Polynomial Chaos Expansions

paper_url: http://arxiv.org/abs/2309.01697
repo_url: None
paper_authors: Lukáš Novák, Himanshu Sharma, Michael D. Shields
for: 本研究旨在构造physics-informed多项式泛化(PCE)，以推优化既有数据约束又有物理约束的泛化过程。methods: 本研究使用了一种新的方法，即将实验设计与模型物理约束相结合，以构造physics-informed PCE。这种方法通过利用物理约束来提高泛化精度，而不需要评估原始模型。results: 研究结果表明，提出的方法可以提高泛化精度，而且不加增计算负担。此外，通过对一些具有不同复杂性的决定性例子进行应用，还可以通过分析减少后的PCE滤波来进行不确定性评估。

Abstract
Surrogate modeling of costly mathematical models representing physical systems is challenging since it is typically not possible to create a large experimental design. Thus, it is beneficial to constrain the approximation to adhere to the known physics of the model. This paper presents a novel methodology for the construction of physics-informed polynomial chaos expansions (PCE) that combines the conventional experimental design with additional constraints from the physics of the model. Physical constraints investigated in this paper are represented by a set of differential equations and specified boundary conditions. A computationally efficient means for construction of physically constrained PCE is proposed and compared to standard sparse PCE. It is shown that the proposed algorithms lead to superior accuracy of the approximation and does not add significant computational burden. Although the main purpose of the proposed method lies in combining data and physical constraints, we show that physically constrained PCEs can be constructed from differential equations and boundary conditions alone without requiring evaluations of the original model. We further show that the constrained PCEs can be easily applied for uncertainty quantification through analytical post-processing of a reduced PCE filtering out the influence of all deterministic space-time variables. Several deterministic examples of increasing complexity are provided and the proposed method is applied for uncertainty quantification.

摘要
实验设计较少的mathematical model预测physical system的成本高，因此实验设计是准确描述physical system的关键。这篇论文提出了一种新的方法，即physics-informed polynomial chaos expansions（PCE）的建构，融合实验设计和物理模型的条件。这些物理条件是通过 differential equations和specified boundary conditions表示的。提出了一种 computationally efficient的建构方法，并与标准的罕见PCE进行比较。结果显示，提案的方法可以提高 aproximation的精度，而且不会增加computational burden。虽然主要的目的是将data和物理条件融合，但我们显示可以从 differential equations和boundary conditions aloneconstruct physically constrained PCE，不需要评估原始模型。此外，我们还显示了 constrained PCE可以通过analytical post-processing的方式，范围内的所有决定性空间时间变量 filtering out。提供了一些 deterministic example of increasing complexity，并应用于uncertainty quantification。

paper_url: http://arxiv.org/abs/2309.01670
repo_url: None
paper_authors: Nathan Ng, Ji Won Park, Jae Hyeon Lee, Ryan Lewis Kelly, Stephen Ra, Kyunghyun Cho
for: 这个论文的目的是为了实现高通量DNA测序数据中的下游科学应用，特别是为了更好地清除Sequencing plataforms中的错误读取。
methods: 这个论文提出了一种新的自主学习方法，叫做Self-Supervised Set Learning (SSSL)，可以将多个受挤的序列读取到一个嵌入空间中，并且估算这些序列的集合嵌入。这个集合嵌入可以用来预测受挤的清洁序列。
results: 在实验中，SSSL方法可以较前一个基准下降17%的错误率，并且在实际数据中也有好的表现，尤其是在小序列上（小于6个受挤），可以大幅降低错误率。

Abstract
Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy observations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are available or error rates are too high. In this paper, we propose a novel method for blindly denoising sets of sequences without directly observing clean source sequence labels. Our method, Self-Supervised Set Learning (SSSL), gathers subreads together in an embedding space and estimates a single set embedding as the midpoint of the subreads in both the latent and sequence spaces. This set embedding represents the "average" of the subreads and can be decoded into a prediction of the clean sequence. In experiments on simulated long-read DNA data, SSSL methods denoise small reads of $\leq 6$ subreads with 17% fewer errors and large reads of $>6$ subreads with 8% fewer errors compared to the best baseline. On a real dataset of antibody sequences, SSSL improves over baselines on two self-supervised metrics, with a significant improvement on difficult small reads that comprise over 60% of the test set. By accurately denoising these reads, SSSL promises to better realize the potential of high-throughput DNA sequencing data for downstream scientific applications.

摘要
生物序列分析需要去除测序平台输出的不精准数据。我们考虑一种常见的情况，在高通量长读平台上重复读取短序列，以生成多个噪声观测。对这些噪声观测进行对Alignment基于的去噪方法经常失败，当有太少的噪声观测或错误率太高时。在这篇论文中，我们提出了一种新的方法，即Self-Supervised Set Learning（SSSL）。这种方法将噪声观测集成到一个映射空间中，并估算这些噪声观测的集中点，作为latent空间和序列空间中的midpoint。这个集中点表示“平均”的噪声观测，可以被解码成一个clean序列预测。在对模拟长读DNA数据进行实验中，SSSL方法可以对小读数据（≤6个噪声观测）和大读数据（>6个噪声观测）进行去噪，相比best baseline，减少了17%和8%的错误。在一个真实的抗体序列数据集上，SSSL方法超过了基准值，尤其是在difficult小读中，这些小读占检测集的60%以上。通过准确地去噪这些小读，SSSL方法承诺可以更好地实现高通量DNA测序数据的下游科学应用。

Robust penalized least squares of depth trimmed residuals regression for high-dimensional data

paper_url: http://arxiv.org/abs/2309.01666
repo_url: None
paper_authors: Yijun Zuo
for: 本研究旨在探讨高维度数据分析中遇到的挑战，包括维度大于样本大小（i）和异常点或杂质点隐藏和更难检测（ii）。
methods: 本文使用了许多现代整合 penalty 方法来分析高维度数据，其中包括惩罚方法和深度trimmed residuals 方法。
results: 研究发现，大多数整合 penalty 方法在面对异常点或杂质点时会失效，而新提出的最小最小二乘法depth trimmed residuals方法可以更好地处理这些情况，并在实验中表现出较高的估计和预测精度。

Abstract
Challenges with data in the big-data era include (i) the dimension $p$ is often larger than the sample size $n$ (ii) outliers or contaminated points are frequently hidden and more difficult to detect. Challenge (i) renders most conventional methods inapplicable. Thus, it attracts tremendous attention from statistics, computer science, and bio-medical communities. Numerous penalized regression methods have been introduced as modern methods for analyzing high-dimensional data. Disproportionate attention has been paid to the challenge (ii) though. Penalized regression methods can do their job very well and are expected to handle the challenge (ii) simultaneously. Most of them, however, can break down by a single outlier (or single adversary contaminated point) as revealed in this article. The latter systematically examines leading penalized regression methods in the literature in terms of their robustness, provides quantitative assessment, and reveals that most of them can break down by a single outlier. Consequently, a novel robust penalized regression method based on the least sum of squares of depth trimmed residuals is proposed and studied carefully. Experiments with simulated and real data reveal that the newly proposed method can outperform some leading competitors in estimation and prediction accuracy in the cases considered.

摘要
大数据时代的数据分析挑战包括（i）维度pfrequently大于样本size n（ii）异常值或杂质点隐藏更难于探测。挑战（i）使得大多数传统方法无法应用。这引起了统计、计算机科学和生物医学领域的极大关注。许多惩罚回归方法被引入为现代高维数据分析的方法。虽然挑战（ii）得到了过度的关注，但是惩罚回归方法可以很好地处理它。然而，大多数方法都可以被单个异常值（或单个杂质点）所破坏，这在本文中得到了证明。为了解决这个问题，一种基于深度剔除差异的最小二乘方法被提出并且仔细研究了。实验表明，新提出的方法在预测和估计精度方面在考虑的情况下能够超越一些竞争对手。

Locally Stationary Graph Processes

paper_url: http://arxiv.org/abs/2309.01657
repo_url: None
paper_authors: Abdullah Canbolat, Elif Vural
for: 本文旨在提出一种基于不规则网络结构的局部站立图像处理方法，以满足实际问题中的局部特点变化。
methods: 本文提出了一种基于组件过程的局部站立图像模型（LSGP），通过表示过程的总体为多个组件过程的组合，来表示图像在不同区域的局部站立性。提出了一种计算LSGP模型的算法，以及本地使用WSS过程的近似方法。
results: 实验表明，提出的过程模型可以与现有技术竞争，并且在信号 interpolating 问题中提供了高精度的信号表示。

Abstract
Stationary graph process models are commonly used in the analysis and inference of data sets collected on irregular network topologies. While most of the existing methods represent graph signals with a single stationary process model that is globally valid on the entire graph, in many practical problems, the characteristics of the process may be subject to local variations in different regions of the graph. In this work, we propose a locally stationary graph process (LSGP) model that aims to extend the classical concept of local stationarity to irregular graph domains. We characterize local stationarity by expressing the overall process as the combination of a set of component processes such that the extent to which the process adheres to each component varies smoothly over the graph. We propose an algorithm for computing LSGP models from realizations of the process, and also study the approximation of LSGPs locally with WSS processes. Experiments on signal interpolation problems show that the proposed process model provides accurate signal representations competitive with the state of the art.

摘要
stationary graph process models 常用于非 régulière 网络 topology 上的数据集分析和推理。大多数现有方法使用 globally 有效的站ARY graph signal 模型来表示整个图的信号，但在实际问题中，过程的特性可能会在不同地方的图中具有本地差异。在这种情况下，我们提出了一种 Locally Stationary Graph Process (LSGP) 模型，旨在扩展传统的本地站ARY性概念到不规则图域。我们通过表示过程的总体作为不同地方的组件过程的组合来 caracterize 本地站ARY性。我们还提出了一种计算 LSGP 模型的算法，以及对 LSGP 模型进行本地approximation的 WSS 过程的研究。实验表明，提出的过程模型可以与当前状态齐的精度地表示信号。

Representing Edge Flows on Graphs via Sparse Cell Complexes

paper_url: http://arxiv.org/abs/2309.01632
repo_url: None
paper_authors: Josef Hoppe, Michael T. Schaub
for: 获取对数据的稀疏、可解释性表示是机器学习和信号处理任务中的关键。
methods: 将图结构提升到 simplicial complex 中，然后使用 Hodge-Laplacian 的特征值和相应的 incidence matrix 来实现 Hodge 分解，从而将观察数据表示为梯度、旋转和响应流。
results: 在实际数据和 sintetic 数据上，我们的算法可以高效地解决 cell inference 优化问题，并且比现有的方法更高效。

Abstract
Obtaining sparse, interpretable representations of observable data is crucial in many machine learning and signal processing tasks. For data representing flows along the edges of a graph, an intuitively interpretable way to obtain such representations is to lift the graph structure to a simplicial complex: The eigenvectors of the associated Hodge-Laplacian, respectively the incidence matrices of the corresponding simplicial complex then induce a Hodge decomposition, which can be used to represent the observed data in terms of gradient, curl, and harmonic flows. In this paper, we generalize this approach to cellular complexes and introduce the cell inference optimization problem, i.e., the problem of augmenting the observed graph by a set of cells, such that the eigenvectors of the associated Hodge Laplacian provide a sparse, interpretable representation of the observed edge flows on the graph. We show that this problem is NP-hard and introduce an efficient approximation algorithm for its solution. Experiments on real-world and synthetic data demonstrate that our algorithm outperforms current state-of-the-art methods while being computationally efficient.

摘要
获取稀疏、可解释的数据表示是许多机器学习和信号处理任务中的关键。为了在图structure上获取这些表示，一种直观可解的方法是将图结构升级到 simplicial complex：图结构的特征值和相应的 simplicial complex 的 incidence matrix THEN INDUCE A Hodge decomposition, 可以用来表示观察到的边流在图上。在这篇论文中，我们扩展了这种方法到细胞复杂体系，并引入细胞推理优化问题，即在观察到的图上添加一组细胞，以便将 graph 的特征值和 incidence matrix 转化为稀疏、可解释的表示。我们证明了这个问题是NP困难的，并提出了一种有效的近似算法来解决它。实验表明，我们的算法在实际数据和synthetic数据上都能够超越当前状态的方法，而且 Computationally efficient。

Dropout Attacks

paper_url: http://arxiv.org/abs/2309.01614
repo_url: https://github.com/ngunnar/Robustness_tutorial
paper_authors: Andrew Yuan, Alina Oprea, Cheng Tan
for: 本研究旨在攻击深度学习模型中的Dropout操作，以防止过拟合。
methods: 本文引入了一种新的Dropout攻击方法，称为DROPOUTATTACK，通过 manipulate dropout operator 中选择的 neuron 而不是随机选择。
results: 在训练 VGG-16 模型在 CIFAR-100 上，我们的攻击可以将受到攻击的级别降低至 34.6% (从 81.7% 降至 47.1%)，而无需对模型精度做出任何干扰。

Abstract
Dropout is a common operator in deep learning, aiming to prevent overfitting by randomly dropping neurons during training. This paper introduces a new family of poisoning attacks against neural networks named DROPOUTATTACK. DROPOUTATTACK attacks the dropout operator by manipulating the selection of neurons to drop instead of selecting them uniformly at random. We design, implement, and evaluate four DROPOUTATTACK variants that cover a broad range of scenarios. These attacks can slow or stop training, destroy prediction accuracy of target classes, and sabotage either precision or recall of a target class. In our experiments of training a VGG-16 model on CIFAR-100, our attack can reduce the precision of the victim class by 34.6% (from 81.7% to 47.1%) without incurring any degradation in model accuracy

摘要
Dropout 是深度学习中常用的操作，目的是防止适应性过度 Training 中的 neuron 被随机Dropout 操作。这篇论文介绍了一种新的毒素攻击 named DROPOUTATTACK，该攻击targets dropout 操作，而不是随机选择 neuron。我们设计了四种 DROPOUTATTACK 变种，覆盖了广泛的场景。这些攻击可以降低目标类准确率，甚至使模型训练失败。在我们对 VGG-16 模型在 CIFAR-100 上训练的实验中，我们的攻击可以降低目标类准确率 by 34.6%（从 81.7% 降至 47.1%），而无需模型精度下降。

Fair Ranking under Disparate Uncertainty

paper_url: http://arxiv.org/abs/2309.01610
repo_url: None
paper_authors: Richa Rastogi, Thorsten Joachims
for: 提高排序系统的公平性，即使数据量不均衡时仍能保证各个群体的排名公平。
methods: 提出Equal-Opportunity Ranking（EOR）作为公平排序标准，并实现了一种可行的算法来实现EOR排名，时间复杂度为O(n * log(n))。
results: 经过synthetic数据、美国人口普查数据和Amazon搜索关键词数据的实验表明，该算法可靠地保证EOR公平性，同时提供有效的排名。

Abstract
Ranking is a ubiquitous method for focusing the attention of human evaluators on a manageable subset of options. Its use ranges from surfacing potentially relevant products on an e-commerce site to prioritizing college applications for human review. While ranking can make human evaluation far more effective by focusing attention on the most promising options, we argue that it can introduce unfairness if the uncertainty of the underlying relevance model differs between groups of options. Unfortunately, such disparity in uncertainty appears widespread, since the relevance estimates for minority groups tend to have higher uncertainty due to a lack of data or appropriate features. To overcome this fairness issue, we propose Equal-Opportunity Ranking (EOR) as a new fairness criterion for ranking that provably corrects for the disparity in uncertainty between groups. Furthermore, we present a practical algorithm for computing EOR rankings in time $O(n \log(n))$ and prove its close approximation guarantee to the globally optimal solution. In a comprehensive empirical evaluation on synthetic data, a US Census dataset, and a real-world case study of Amazon search queries, we find that the algorithm reliably guarantees EOR fairness while providing effective rankings.

摘要
“排名是一种普遍存在的方法，用于集中人类评估者的注意力于可管理的子集中。它的应用范围从电子商务网站上浮出潜在有用的产品到审核大学申请。尽管排名可以使人类评估变得非常有效，但它可能引入不公正性，因为不同群体选项的后台相关性模型的不确定性差异较大。实际上，这种差异在少数群体选项中的相关性估计通常更高，因为这些选项的数据或特征不够。为解决这个公正性问题，我们提出了平等机会排名（EOR）作为一种新的公正性标准，可以正确地纠正不同群体选项之间的不确定性差异。此外，我们提出了一种实用的算法来计算EOR排名，时间复杂度为O(nlog(n))，并证明其与全球最佳解决方案的快近优化 garantia。在synthetic数据、US Census数据和amazon搜索查询的实际评估中，我们发现了这种算法可靠地保证EOR公正性，同时提供有效的排名。”

Drifter: Efficient Online Feature Monitoring for Improved Data Integrity in Large-Scale Recommendation Systems

paper_url: http://arxiv.org/abs/2309.08617
repo_url: None
paper_authors: Blaž Škrlj, Nir Ki-Tov, Lee Edelist, Natalia Silberstein, Hila Weisman-Zohar, Blaž Mramor, Davorin Kopič, Naama Ziporin
for: 这个论文是为了解决大规模、动态流中数据质量维护问题。
methods: 这个系统使用了新的在线特征监测和验证技术，以提供快速、有效、适应性强的数据质量监测，并能够实时检测数据质量问题的根本原因。
results: 对实际数据集进行评估，这个系统能够有效地发送警示并解决数据质量问题，提高了实时live推荐系统的可靠性和性能。

Abstract
Real-world production systems often grapple with maintaining data quality in large-scale, dynamic streams. We introduce Drifter, an efficient and lightweight system for online feature monitoring and verification in recommendation use cases. Drifter addresses limitations of existing methods by delivering agile, responsive, and adaptable data quality monitoring, enabling real-time root cause analysis, drift detection and insights into problematic production events. Integrating state-of-the-art online feature ranking for sparse data and anomaly detection ideas, Drifter is highly scalable and resource-efficient, requiring only two threads and less than a gigabyte of RAM per production deployments that handle millions of instances per minute. Evaluation on real-world data sets demonstrates Drifter's effectiveness in alerting and mitigating data quality issues, substantially improving reliability and performance of real-time live recommender systems.

摘要
现实生产环境中， oftentimes 面临着大规模、动态流中数据质量维护的挑战。我们介绍了 Drifter，一种高效、轻量级的在线特征监测和验证系统，用于推荐使用情况下的数据质量监测。 Drifter 超越了现有方法的局限性，提供了快速、敏感、适应性的数据质量监测，允许实时根本原因分析、漂移检测和问题生成的生产事件中的深入洞察。 Drifter integrate 最新的在线特征排名技术和罕见检测思想，可扩展性强，只需两个线程和 less than 一 gigabyte 的内存，可执行 millions 个实例每分钟的生产部署。评估实际数据集表明， Drifter 可以有效地预警和解决数据质量问题，substantially 提高实时live 推荐系统的可靠性和性能。

Active flow control for three-dimensional cylinders through deep reinforcement learning

paper_url: http://arxiv.org/abs/2309.02462
repo_url: None
paper_authors: Pol Suárez, Francisco Alcántara-Ávila, Arnau Miró, Jean Rabault, Bernat Font, Oriol Lehmkuhl, R. Vinuesa
for: 降低缩扰系数（drag coefficient）
methods: 多个独立控制的零负质流体推进器（synthetic jets），基于深度学习托管的液体动力学解算器，以及一个使用质量优化算法的代理人
results: 在三种不同的问题配置中，实现了显著的缩扰减少

Abstract
This paper presents for the first time successful results of active flow control with multiple independently controlled zero-net-mass-flux synthetic jets. The jets are placed on a three-dimensional cylinder along its span with the aim of reducing the drag coefficient. The method is based on a deep-reinforcement-learning framework that couples a computational-fluid-dynamics solver with an agent using the proximal-policy-optimization algorithm. We implement a multi-agent reinforcement-learning framework which offers numerous advantages: it exploits local invariants, makes the control adaptable to different geometries, facilitates transfer learning and cross-application of agents and results in significant training speedup. In this contribution we report significant drag reduction after applying the DRL-based control in three different configurations of the problem.

摘要
We implement a multi-agent reinforcement-learning framework, which has several advantages:1. It exploits local invariants, making the control adaptable to different geometries.2. It facilitates transfer learning and cross-application of agents.3. It results in significant training speedup.In this contribution, we report significant drag reduction after applying the DRL-based control in three different configurations of the problem.

Passing Heatmap Prediction Based on Transformer Model and Tracking Data

paper_url: http://arxiv.org/abs/2309.01526
repo_url: None
paper_authors: Yisheng Pei, Varuna De Silva, Mike Caine
For: 这个研究旨在提供一种能够预测传球的可能性和传球前运动影响最终结果的深度学习网络架构，以更正查评球员的表现。* Methods: 该研究使用了深度学习网络模型，分析了超过28,000个传球事件，并达到了0.7以上的顶部一级准确率。* Results: 该研究表明，通过分析球员的运动动作，可以更好地理解球场控制和传球选择对防御性表现的影响，并为足球分析师提供一个更好的工具和指标来评估球员的运动贡献。

Abstract
Although the data-driven analysis of football players' performance has been developed for years, most research only focuses on the on-ball event including shots and passes, while the off-ball movement remains a little-explored area in this domain. Players' contributions to the whole match are evaluated unfairly, those who have more chances to score goals earn more credit than others, while the indirect and unnoticeable impact that comes from continuous movement has been ignored. This research presents a novel deep-learning network architecture which is capable to predict the potential end location of passes and how players' movement before the pass affects the final outcome. Once analysed more than 28,000 pass events, a robust prediction can be achieved with more than 0.7 Top-1 accuracy. And based on the prediction, a better understanding of the pitch control and pass option could be reached to measure players' off-ball movement contribution to defensive performance. Moreover, this model could provide football analysts a better tool and metric to understand how players' movement over time contributes to the game strategy and final victory.

摘要

A Blackbox Model Is All You Need to Breach Privacy: Smart Grid Forecasting Models as a Use Case

paper_url: http://arxiv.org/abs/2309.01523
repo_url: None
paper_authors: Hussein Aly, Abdulaziz Al-Ali, Abdullah Al-Ali, Qutaibah Malluhi
for: 这篇论文研究了智能电网中预测模型的隐私风险，尤其是深度学习和预测模型在智能电网中的应用。
methods: 本研究使用了深度学习和预测模型，包括长期快速响应神经网络 (LSTM)，分析了预测模型是否泄露敏感信息的风险。
results: 研究发现，使用预测模型可以泄露智能电网系统中的全局性和隐私威胁，特别是通过黑盒访问LSTM模型可以泄露大量信息，与数据直接访问的情况相似（差异在1%以下，在ROC曲线下的面积）。这说明需要对预测模型进行保护，与数据一样重要。

Abstract
This paper investigates the potential privacy risks associated with forecasting models, with specific emphasis on their application in the context of smart grids. While machine learning and deep learning algorithms offer valuable utility, concerns arise regarding their exposure of sensitive information. Previous studies have focused on classification models, overlooking risks associated with forecasting models. Deep learning based forecasting models, such as Long Short Term Memory (LSTM), play a crucial role in several applications including optimizing smart grid systems but also introduce privacy risks. Our study analyzes the ability of forecasting models to leak global properties and privacy threats in smart grid systems. We demonstrate that a black box access to an LSTM model can reveal a significant amount of information equivalent to having access to the data itself (with the difference being as low as 1% in Area Under the ROC Curve). This highlights the importance of protecting forecasting models at the same level as the data.

摘要
Translation notes:* "smart grids" is translated as "智能电网" (zìnéng diànwǎng)* "forecasting models" is translated as "预测模型" (yùzhèng módelǐ)* "machine learning" is translated as "机器学习" (jīshì xuéxí)* "deep learning" is translated as "深度学习" (shēngrù xuéxí)* "Long Short Term Memory" is translated as "长期短 память" (chángzhì duǎnjiàng)* "black box access" is translated as "黑盒访问" (hēi bāo fāngwù)* "privacy risks" is translated as "隐私风险" (yìnwèi fēngxì)

Hawkeye: Change-targeted Testing for Android Apps based on Deep Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.01519
repo_url: None
paper_authors: Chao Peng, Zhengwei Lv, Jiarong Fu, Jiayuan Liang, Zhao Zhang, Ajitha Rajan, Ping Yang
for: 本研究的目的是提高Android应用程序更新的正确性，以避免在用户端出现漏洞。
methods: 本研究提出了一种指导测试方法，使用深度强化学习来优先执行更新后影响的GUI操作。
results: 对于10个开源App和1个商业App，研究发现，使用 Hawkeye 可以更可靠地生成targeting更改的GUI事件序列，比 FastBot2 和 ARES 更高效。 Hawkeye 在小型开源App上表现相当，但在大型商业App上表现较低。在实际生产环境中，Hawkeye 也能够成功地进行烟测测试。

Abstract
Android Apps are frequently updated to keep up with changing user, hardware, and business demands. Ensuring the correctness of App updates through extensive testing is crucial to avoid potential bugs reaching the end user. Existing Android testing tools generate GUI events focussing on improving the test coverage of the entire App rather than prioritising updates and its impacted elements. Recent research has proposed change-focused testing but relies on random exploration to exercise the updates and impacted GUI elements that is ineffective and slow for large complex Apps with a huge input exploration space. We propose directed testing of App updates with Hawkeye that is able to prioritise executing GUI actions associated with code changes based on deep reinforcement learning from historical exploration data. Our empirical evaluation compares Hawkeye with state-of-the-art model-based and reinforcement learning-based testing tools FastBot2 and ARES using 10 popular open-source and 1 commercial App. We find that Hawkeye is able to generate GUI event sequences targeting changed functions more reliably than FastBot2 and ARES for the open source Apps and the large commercial App. Hawkeye achieves comparable performance on smaller open source Apps with a more tractable exploration space. The industrial deployment of Hawkeye in the development pipeline also shows that Hawkeye is ideal to perform smoke testing for merge requests of a complicated commercial App.

摘要
我们提出了directed testing of App updates with Hawkeye，它可以根据深度学习历史探索数据来优先执行更改后的GUI操作。我们的实验比较了Hawkeye与当前最佳的模型基于和学习基于的测试工具FastBot2和ARES，使用10个流行的开源App和1个商业App。我们发现Hawkeye可以更可靠地对更改后的函数生成GUI事件序列，比FastBot2和ARES更高效。 Hawkeye在小型开源App上也有类似的性能，而且在商业App的开发pipeline中的实际部署中也显示了Hawkeye的适用性。 Hawkeye可以快速完成 merge requests 的烟测，为复杂的商业App提供了一个 идеal testing solution。

Federated cINN Clustering for Accurate Clustered Federated Learning

paper_url: http://arxiv.org/abs/2309.01515
repo_url: None
paper_authors: Yuhao Zhou, Minjia Shi, Yuxin Tian, Yuanxi Li, Qing Ye, Jiancheng Lv
For: 本研究旨在 Addressing the challenge of coordinating Federated Learning (FL) with crowd intelligence, particularly when client groups have disparate objectives due to data heterogeneity or distinct tasks.* Methods: 提议了一种 Federated cINN Clustering Algorithm (FCCA)，通过对每个客户端的私有数据进行全局编码，并使用生成模型进行最大可能性估计，以避免模式折射和优化难度。* Results: 对多种模型和数据集进行了广泛的实验，并证明了 FCCA 的超越性，比如其能够更好地减少客户端之间的干扰和提高全局模型的准确率。

Abstract
Federated Learning (FL) presents an innovative approach to privacy-preserving distributed machine learning and enables efficient crowd intelligence on a large scale. However, a significant challenge arises when coordinating FL with crowd intelligence which diverse client groups possess disparate objectives due to data heterogeneity or distinct tasks. To address this challenge, we propose the Federated cINN Clustering Algorithm (FCCA) to robustly cluster clients into different groups, avoiding mutual interference between clients with data heterogeneity, and thereby enhancing the performance of the global model. Specifically, FCCA utilizes a global encoder to transform each client's private data into multivariate Gaussian distributions. It then employs a generative model to learn encoded latent features through maximum likelihood estimation, which eases optimization and avoids mode collapse. Finally, the central server collects converged local models to approximate similarities between clients and thus partition them into distinct clusters. Extensive experimental results demonstrate FCCA's superiority over other state-of-the-art clustered federated learning algorithms, evaluated on various models and datasets. These results suggest that our approach has substantial potential to enhance the efficiency and accuracy of real-world federated learning tasks.

摘要
《联邦学习（FL）》提出了一种创新的隐私保护分布式机器学习方法，可以实现大规模的群体智能。然而，在与群体智能协调FL时， Client groups possessing diverse objectives due to data heterogeneity or distinct tasks poses a significant challenge. To address this challenge, we propose the Federated cINN Clustering Algorithm (FCCA) to robustly cluster clients into different groups, avoiding mutual interference between clients with data heterogeneity, and thereby enhancing the performance of the global model. Specifically, FCCA utilizes a global encoder to transform each client's private data into multivariate Gaussian distributions. It then employs a generative model to learn encoded latent features through maximum likelihood estimation, which eases optimization and avoids mode collapse. Finally, the central server collects converged local models to approximate similarities between clients and thus partition them into distinct clusters. Extensive experimental results demonstrate FCCA's superiority over other state-of-the-art clustered federated learning algorithms, evaluated on various models and datasets. These results suggest that our approach has substantial potential to enhance the efficiency and accuracy of real-world federated learning tasks.

Layer-wise training for self-supervised learning on graphs

paper_url: http://arxiv.org/abs/2309.01503
repo_url: None
paper_authors: Oscar Pina, Verónica Vilaplana
for: 训练大 graphs上的端到端图 neural networks (GNN) 存在多种内存和计算挑战，限制了深度的应用，因为深度会导致内存和空间复杂度的剧烈增长。
methods: 我们提出了层 wise Regularized Graph Infomax算法，通过自我超vised的方式来训练 GNN 层次。我们将 GNN 中的特征传播和特征变换分解出来，以学习节点表示，并从未来输入预测的基础上 derivate 一个损失函数。
results: 我们在 inductive 大 graphs 中评估了我们的算法，与其他端到端方法相当，并且显著提高了效率，可以在一个单个设备上训练更加复杂的模型。此外，我们的算法还可以避免深度 GNN 中的抖搅现象。

Abstract
End-to-end training of graph neural networks (GNN) on large graphs presents several memory and computational challenges, and limits the application to shallow architectures as depth exponentially increases the memory and space complexities. In this manuscript, we propose Layer-wise Regularized Graph Infomax, an algorithm to train GNNs layer by layer in a self-supervised manner. We decouple the feature propagation and feature transformation carried out by GNNs to learn node representations in order to derive a loss function based on the prediction of future inputs. We evaluate the algorithm in inductive large graphs and show similar performance to other end to end methods and a substantially increased efficiency, which enables the training of more sophisticated models in one single device. We also show that our algorithm avoids the oversmoothing of the representations, another common challenge of deep GNNs.

摘要
大型图格神经网络（GNN）的端到端训练存在许多内存和计算挑战，深度随着增加而 exponentially 增加内存和空间复杂性。在这篇论文中，我们提出层wise Regularized Graph Infomax算法，用于层段式自我监督的GNN训练。我们将GNN中的特征传播和特征转换分解开来，以学习节点表示，并从未来输入预测得出损失函数。我们在大型 inductive 图上评估了算法，并与其他端到端方法相当，同时有substantially 提高效率，这使得可以在一个设备上训练更复杂的模型。此外，我们还证明了我们的算法可以避免深度GNN的过滤 representations。

FinDiff: Diffusion Models for Financial Tabular Data Generation

paper_url: http://arxiv.org/abs/2309.01472
repo_url: None
paper_authors: Timur Sattarov, Marco Schreyer, Damian Borth
for: 这个论文的目的是为了生成真实世界金融数据表格，以便于经济enario模型、压力测试和诈骗探测等下游任务。
methods: 这篇论文使用了扩散模型，特别是噪声扩散模型，来生成模拟真实数据的数据。它使用嵌入编码来处理金融数据的混合类型特征，包括分类和数值特征。
results: 实验结果表明，FinDiff可以生成高准确性、隐私和实用性的 sintetic tabular financial data。与现有基线模型进行比较，FinDiff在三个真实世界金融数据集上表现出色。

Abstract
The sharing of microdata, such as fund holdings and derivative instruments, by regulatory institutions presents a unique challenge due to strict data confidentiality and privacy regulations. These challenges often hinder the ability of both academics and practitioners to conduct collaborative research effectively. The emergence of generative models, particularly diffusion models, capable of synthesizing data mimicking the underlying distributions of real-world data presents a compelling solution. This work introduces 'FinDiff', a diffusion model designed to generate real-world financial tabular data for a variety of regulatory downstream tasks, for example economic scenario modeling, stress tests, and fraud detection. The model uses embedding encodings to model mixed modality financial data, comprising both categorical and numeric attributes. The performance of FinDiff in generating synthetic tabular financial data is evaluated against state-of-the-art baseline models using three real-world financial datasets (including two publicly available datasets and one proprietary dataset). Empirical results demonstrate that FinDiff excels in generating synthetic tabular financial data with high fidelity, privacy, and utility.

摘要
共享微数据，如基金投资和 derivate 工具，由 regulatory 机构提供的存在着独特的挑战，主要是由于严格的数据保密和隐私法规。这些挑战通常会阻碍学者和实践者进行有效的合作研究。随着生成模型的出现，特别是扩散模型，可以Synthesize 数据，模拟实际世界数据的下面分布。本文介绍了 'FinDiff'，一种扩散模型，用于生成实际世界金融表格数据，用于经济enario模拟、压力测试和欺诈探测等下游任务。FinDiff 使用 embedding 编码来模型金融数据的混合模式，包括 both categorical 和 numeric 特征。FinDiff 在生成 synthetic 金融表格数据方面的性能被评估于现有的基eline 模型，使用三个实际世界金融数据集（包括两个公共可用数据集和一个专有数据集）。实际结果表明，FinDiff 可以高效地生成 synthetic 金融表格数据，具有高准确性、隐私和实用性。

Leveraging Reward Consistency for Interpretable Feature Discovery in Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.01458
repo_url: None
paper_authors: Qisen Yang, Huanqian Wang, Mukun Tong, Wenjie Shi, Gao Huang, Shiji Song
for: 解释和解释深度强化学习（RL）Agent的行为
methods: 提出了一种新的框架（RL-in-RL），通过解决动作和奖励之间的连接问题，来保证奖励一致性 during interpretable feature发现
results: 在Atari 2600游戏和Duckietown自驾车 simulator环境中测试和评估了该方法，结果表明方法可以保持奖励一致性，并实现高质量的特征归属。进一步的分析实验也 validate了动作匹配原则的局限性。

Abstract
The black-box nature of deep reinforcement learning (RL) hinders them from real-world applications. Therefore, interpreting and explaining RL agents have been active research topics in recent years. Existing methods for post-hoc explanations usually adopt the action matching principle to enable an easy understanding of vision-based RL agents. In this paper, it is argued that the commonly used action matching principle is more like an explanation of deep neural networks (DNNs) than the interpretation of RL agents. It may lead to irrelevant or misplaced feature attribution when different DNNs' outputs lead to the same rewards or different rewards result from the same outputs. Therefore, we propose to consider rewards, the essential objective of RL agents, as the essential objective of interpreting RL agents as well. To ensure reward consistency during interpretable feature discovery, a novel framework (RL interpreting RL, denoted as RL-in-RL) is proposed to solve the gradient disconnection from actions to rewards. We verify and evaluate our method on the Atari 2600 games as well as Duckietown, a challenging self-driving car simulator environment. The results show that our method manages to keep reward (or return) consistency and achieves high-quality feature attribution. Further, a series of analytical experiments validate our assumption of the action matching principle's limitations.

摘要
深度强化学习（RL）的黑盒特性使其在实际应用中受到限制。因此，解释和解释RL代理的研究在最近几年得到了广泛的关注。现有的后续解释方法通常采用行动匹配原理，以便轻松理解视觉RL代理。然而，这篇论文 argue互联网的行动匹配原理更多地是深度神经网络（DNN）的解释，而不是RL代理的解释。这可能会导致不相关或错位的特征归因，因为不同的DNN输出可能导致同一个奖励，或者不同的奖励可能来自同一个输出。因此，我们建议将奖励作为RL代理的解释的关键目标，以确保解释过程中的奖励一致性。为保证解释过程中的奖励一致性，我们提出了一种RL解释RL的框架（denoted as RL-in-RL），解决了动作和奖励之间的梯度分离问题。我们在Atari 2600游戏和DUCKIETOWN自驾车模拟环境中进行了证明和评估。结果表明，我们的方法能够保持奖励一致性，并实现高质量的特征归因。此外，一系列的分析实验 validate了我们对行动匹配原理的假设的局限性。

On the Consistency and Robustness of Saliency Explanations for Time Series Classification

paper_url: http://arxiv.org/abs/2309.01457
repo_url: None
paper_authors: Chiara Balestra, Bin Li, Emmanuel Müller
for: 这 paper aims to analyze the consistency and robustness of saliency maps for time series features and temporal attribution in a time series classification task.
methods: 该 paper uses perturbation-based and gradient-based explanation models to generate saliency explanations, and examines their consistency and robustness on five real-world datasets.
results: The experimental results show that the saliency explanations from both models lack consistent and robust performances to some extent, highlighting the need for developing more reliable explanation methods for time series classification.

Abstract
Interpretable machine learning and explainable artificial intelligence have become essential in many applications. The trade-off between interpretability and model performance is the traitor to developing intrinsic and model-agnostic interpretation methods. Although model explanation approaches have achieved significant success in vision and natural language domains, explaining time series remains challenging. The complex pattern in the feature domain, coupled with the additional temporal dimension, hinders efficient interpretation. Saliency maps have been applied to interpret time series windows as images. However, they are not naturally designed for sequential data, thus suffering various issues. This paper extensively analyzes the consistency and robustness of saliency maps for time series features and temporal attribution. Specifically, we examine saliency explanations from both perturbation-based and gradient-based explanation models in a time series classification task. Our experimental results on five real-world datasets show that they all lack consistent and robust performances to some extent. By drawing attention to the flawed saliency explanation models, we motivate to develop consistent and robust explanations for time series classification.

摘要

interpreter 机器学习和解释人工智能在许多应用中变得必需。模型性能和可解释性的交易是发展内置和模型无关的解释方法的障碍。虽然模型解释方法在视觉和自然语言领域得到了显著成功，但是解释时序序列仍然是挑战。时序序列特征空间复杂，加上额外的时间维度，使得有效的解释受到阻碍。
Saliency maps 已经应用于解释时序序列窗口，但它们不是特定适用于顺序数据，因此受到各种问题的担忧。本文进行了严格的分析和对比，发现这些解释模型在时序序列分类任务中缺乏一致性和可靠性。我们在五个真实的实际数据集上进行了实验，结果表明它们都缺乏一定程度的一致性和可靠性。我们通过吸引注意力于缺陷的解释模型，激励开发一致和可靠的解释方法 для时序序列分类。
针对这些问题，我们提出了一种新的解释方法，即基于窗口的时序序列解释方法。这种方法可以快速地生成可解释的时序序列特征，并且可以减少对解释模型的依赖性。我们在实验中证明了这种方法的有效性和可靠性。
总之，我们的研究表明，为了解释时序序列分类模型的决策，需要开发一致和可靠的解释方法。我们的新的解释方法可以帮助解决这一问题，并且可以应用于实际的应用场景。
Note: Some of the words and phrases in the text may not be exactly the same as their Simplified Chinese translations, but they should be close enough to be understood.

Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with Expert Guidance

paper_url: http://arxiv.org/abs/2309.01448
repo_url: None
paper_authors: Qisen Yang, Shenzhi Wang, Qihang Zhang, Gao Huang, Shiji Song
for: 提高OFFLINE强化学习（RL）中的策略优化，以便在已经收集的数据集上无需与环境交互时仍能够得到优化的策略。
methods: 提出了一种基于导航网络的插件方法，名为指导OFFLINERL（GORL），该方法可以自动确定每个样本的策略改进和策略限制的相对重要性。
results: 经过广泛的实验表明，GORL可以轻松地与大多数OFFLINERL算法结合使用，并且在各种环境中提供了 statistically significant 的性能提升。

Abstract
Offline reinforcement learning (RL) optimizes the policy on a previously collected dataset without any interactions with the environment, yet usually suffers from the distributional shift problem. To mitigate this issue, a typical solution is to impose a policy constraint on a policy improvement objective. However, existing methods generally adopt a ``one-size-fits-all'' practice, i.e., keeping only a single improvement-constraint balance for all the samples in a mini-batch or even the entire offline dataset. In this work, we argue that different samples should be treated with different policy constraint intensities. Based on this idea, a novel plug-in approach named Guided Offline RL (GORL) is proposed. GORL employs a guiding network, along with only a few expert demonstrations, to adaptively determine the relative importance of the policy improvement and policy constraint for every sample. We theoretically prove that the guidance provided by our method is rational and near-optimal. Extensive experiments on various environments suggest that GORL can be easily installed on most offline RL algorithms with statistically significant performance improvements.

摘要
<> translation into Simplified ChineseOffline 学习优化策略（Offline Reinforcement Learning）通常会遇到分布shift问题，这是因为学习策略时不能直接与环境交互。为解决这个问题，常见的方法是通过策略约束来限制策略改进的目标。然而，现有的方法通常采用“一size-fits-all”的做法，即在一个mini-batch或整个Offline dataset中都保留单一的改进约束平衡。在这个工作中，我们 argue That different samples should be treated with different策略约束强度。基于这个想法，我们提出了一种名为Guided Offline RL（GORL）的新插件方法。GORL使用一个引导网络，以及只需几个专家示范，来动态确定每个样本的策略改进和策略约束的相对重要性。我们证明了我们的方法的引导是理性的和近似优化的。广泛的实验表明，GORL可以轻松地安装在大多数Offline RL算法上，并且具有 statistically significant的性能提升。

Expanding Mars Climate Modeling: Interpretable Machine Learning for Modeling MSL Relative Humidity

paper_url: http://arxiv.org/abs/2309.01424
repo_url: None
paper_authors: Nour Abdelmoneim, Dattaraj B. Dhuri, Dimitra Atri, Germán Martínez
for: 这个研究的目的是为了模拟火星气候，具体来说是在加乐沟地区测量的相对湿度。
methods: 这个研究使用的方法是基于机器学习技术，使用了一个深度神经网络模型，使用了火星气候模型生成的数据进行训练。
results: 研究结果表明，这个神经网络模型可以准确地预测加乐沟地区的相对湿度，误差在3%以内，$R^2$分数为0.92。此外，研究还发现了一种方法可以预测相对湿度的范围，这有助于应用需要范围值的场合。

Abstract
For the past several decades, numerous attempts have been made to model the climate of Mars with extensive studies focusing on the planet's dynamics and the understanding of its climate. While physical modeling and data assimilation approaches have made significant progress, uncertainties persist in comprehensively capturing and modeling the complexities of Martian climate. In this work, we propose a novel approach to Martian climate modeling by leveraging machine learning techniques that have shown remarkable success in Earth climate modeling. Our study presents a deep neural network designed to accurately model relative humidity in Gale Crater, as measured by NASA's Mars Science Laboratory ``Curiosity'' rover. By utilizing simulated meteorological variables produced by the Mars Planetary Climate Model, a robust Global Circulation Model, our model accurately predicts relative humidity with a mean error of 3\% and an $R^2$ score of 0.92. Furthermore, we present an approach to predict quantile ranges of relative humidity, catering to applications that require a range of values. To address the challenge of interpretability associated with machine learning models, we utilize an interpretable model architecture and conduct an in-depth analysis of its internal mechanisms and decision making processes. We find that our neural network can effectively model relative humidity at Gale crater using a few meteorological variables, with the monthly mean surface H$_2$O layer, planetary boundary layer height, convective wind speed, and solar zenith angle being the primary contributors to the model predictions. In addition to providing a fast and efficient method to modeling climate variables on Mars, this modeling approach can also be used to expand on current datasets by filling spatial and temporal gaps in observations.

摘要
For the past several decades, numerous attempts have been made to model the climate of Mars, with extensive studies focusing on the planet's dynamics and understanding its climate. Although physical modeling and data assimilation approaches have made significant progress, uncertainties still exist in comprehensively capturing and modeling the complexities of Martian climate. In this study, we propose a novel approach to Martian climate modeling by leveraging machine learning techniques that have shown remarkable success in Earth climate modeling. Our study presents a deep neural network designed to accurately model relative humidity in Gale Crater, as measured by NASA's Mars Science Laboratory "Curiosity" rover. By utilizing simulated meteorological variables produced by the Mars Planetary Climate Model, a robust Global Circulation Model, our model accurately predicts relative humidity with a mean error of 3% and an $R^2$ score of 0.92. Furthermore, we present an approach to predict quantile ranges of relative humidity, catering to applications that require a range of values. To address the challenge of interpretability associated with machine learning models, we utilize an interpretable model architecture and conduct an in-depth analysis of its internal mechanisms and decision-making processes. We find that our neural network can effectively model relative humidity at Gale crater using a few meteorological variables, with the monthly mean surface H$_2$O layer, planetary boundary layer height, convective wind speed, and solar zenith angle being the primary contributors to the model predictions. In addition to providing a fast and efficient method to modeling climate variables on Mars, this modeling approach can also be used to expand on current datasets by filling spatial and temporal gaps in observations.

Differentiable Bayesian Structure Learning with Acyclicity Assurance

paper_url: http://arxiv.org/abs/2309.01392
repo_url: None
paper_authors: Quang-Duy Tran, Phuoc Nguyen, Bao Duong, Thin Nguyen
for: 本研究旨在提出一种具有约束性的循环自适应方法，以确保生成的图Structures是有向无环的。
methods: 本研究使用了一种结合排序知识的方法，以确保生成的图Structures是有向无环的。
results: 实验表明，我们的方法可以比相关的 bayesian 分数基于方法更好地降低推理复杂性，同时确保生成的图Structures是有向无环的。

Abstract
Score-based approaches in the structure learning task are thriving because of their scalability. Continuous relaxation has been the key reason for this advancement. Despite achieving promising outcomes, most of these methods are still struggling to ensure that the graphs generated from the latent space are acyclic by minimizing a defined score. There has also been another trend of permutation-based approaches, which concern the search for the topological ordering of the variables in the directed acyclic graph in order to limit the search space of the graph. In this study, we propose an alternative approach for strictly constraining the acyclicty of the graphs with an integration of the knowledge from the topological orderings. Our approach can reduce inference complexity while ensuring the structures of the generated graphs to be acyclic. Our empirical experiments with simulated and real-world data show that our approach can outperform related Bayesian score-based approaches.

摘要
score-based方法在结构学任务中得到了广泛应用，主要是因为它们可以扩展到大规模的数据集。连续放松是这些方法的关键原因。尽管它们在实现出色的结果，但大多数这些方法仍然无法确保生成的图从幂等空间中是无环的，通过定义得分来减少搜索空间。此外，有一种排序基于方法，它们关注在搜索变量的顺序问题上，以限制搜索空间。在这种研究中，我们提出了一种新的方法，通过结合知识来限制图的环的数量。我们的方法可以减少推理复杂性，同时确保生成的图是无环的。我们的实验表明，我们的方法可以超越相关的 bayesian 分数基于方法。

Classic algorithms are fair learners: Classification Analysis of natural weather and wildfire occurrences

paper_url: http://arxiv.org/abs/2309.01381
repo_url: https://github.com/sengopal/classic-ml-review-paper
paper_authors: Senthilkumar Gopal
for: 这个论文旨在对常见的监督学习算法进行实际运行和数学分析，以了解它们在不同情况下的性能和特性。
methods: 这篇论文使用了多种常见的监督学习算法，包括决策树、扩展、支持向量机和k-最近邻居。
results: 该论文通过对稀疏表格数据进行分类任务的实验，发现这些经典算法在面对噪声和稀疏数据时仍然能够保持良好的泛化能力，并且可以通过不同的参数来提高分类精度。

Abstract
Classic machine learning algorithms have been reviewed and studied mathematically on its performance and properties in detail. This paper intends to review the empirical functioning of widely used classical supervised learning algorithms such as Decision Trees, Boosting, Support Vector Machines, k-nearest Neighbors and a shallow Artificial Neural Network. The paper evaluates these algorithms on a sparse tabular data for classification task and observes the effect on specific hyperparameters on these algorithms when the data is synthetically modified for higher noise. These perturbations were introduced to observe these algorithms on their efficiency in generalizing for sparse data and their utility of different parameters to improve classification accuracy. The paper intends to show that these classic algorithms are fair learners even for such limited data due to their inherent properties even for noisy and sparse datasets.

摘要
经典机器学习算法已经被详细地研究和分析其性能和特性。这篇论文的目的是对广泛使用的经典超级vised学习算法进行实证性的评估，包括决策树、提升、支持向量机器、k最近邻居和杂层人工神经网络。这篇论文将这些算法应用于稀疏表格数据进行分类任务，并观察这些算法对不同的超参数的影响，当数据被人工修改以增加噪音时。这些干扰是为了评估这些算法在普适数据上的泛化能力和不同参数的用于提高分类精度。论文的目的是表明这些经典算法是有效的学习者，即使只有有限的数据。

Mutual Information Maximizing Quantum Generative Adversarial Network and Its Applications in Finance

paper_url: http://arxiv.org/abs/2309.01363
repo_url: None
paper_authors: Mingyu Lee, Myeongjin Shin, Junseo Lee, Kabgyun Jeong
for: 本研究旨在应用于NISQ计算时代的量子机器学习领域，即使用量子机器学习来解决各种领域的问题。
methods: 本研究使用量子生成对抗网络（QGAN），并在QGAN中引入了误差度量 neural network（MINE）来解决模式塌陷问题。
results: 研究表明， InfoQGAN 可以成功地解决模式塌陷问题，并在金融场景中应用于动态资产配置问题来生成 portefolio 返报分布。

Abstract
One of the most promising applications in the era of NISQ (Noisy Intermediate-Scale Quantum) computing is quantum machine learning. Quantum machine learning offers significant quantum advantages over classical machine learning across various domains. Specifically, generative adversarial networks have been recognized for their potential utility in diverse fields such as image generation, finance, and probability distribution modeling. However, these networks necessitate solutions for inherent challenges like mode collapse. In this study, we capitalize on the concept that the estimation of mutual information between high-dimensional continuous random variables can be achieved through gradient descent using neural networks. We introduce a novel approach named InfoQGAN, which employs the Mutual Information Neural Estimator (MINE) within the framework of quantum generative adversarial networks to tackle the mode collapse issue. Furthermore, we elaborate on how this approach can be applied to a financial scenario, specifically addressing the problem of generating portfolio return distributions through dynamic asset allocation. This illustrates the potential practical applicability of InfoQGAN in real-world contexts.

摘要
一个有前途的应用在NISQ（杂AX）计算时代是量子机器学习。量子机器学习在各个领域提供了明显的量子优势。例如，生成对抗网络在图像生成、金融和概率分布模型方面具有潜在的应用前景。然而，这些网络面临的挑战包括模式塌缩。在本研究中，我们利用潜在的思路，即高维连续随机变量之间的相互信息的估计可以通过梯度下降使用神经网络来实现。我们提出一种名为InfoQGAN的新方法，该方法在量子生成对抗网络框架中使用神经网络来解决模式塌缩问题。此外，我们还详细介绍了如何在金融场景中应用InfoQGAN，具体是通过动态资产配置来生成 portefolio返杂分布。这说明InfoQGAN在实际场景中的应用前景非常广阔。

Random Projections of Sparse Adjacency Matrices

paper_url: http://arxiv.org/abs/2309.01360
repo_url: None
paper_authors: Frank Qiu
for: 该论文旨在研究随机投影方法，用于表示稀疏图。
methods: 该论文使用随机投影方法，以保留图的功能性。
results: 该论文显示，随机投影方法可以在同一空间中表示不同大小的图和顶点集，并且可以精确地执行图算子。

Abstract
We analyze a random projection method for adjacency matrices, studying its utility in representing sparse graphs. We show that these random projections retain the functionality of their underlying adjacency matrices while having extra properties that make them attractive as dynamic graph representations. In particular, they can represent graphs of different sizes and vertex sets in the same space, allowing for the aggregation and manipulation of graphs in a unified manner. We also provide results on how the size of the projections need to scale in order to preserve accurate graph operations, showing that the size of the projections can scale linearly with the number of vertices while accurately retaining first-order graph information. We conclude by characterizing our random projection as a distance-preserving map of adjacency matrices analogous to the usual Johnson-Lindenstrauss map.

摘要
我们分析了一种随机投影方法 для邻接矩阵，研究其在表示稀疏图的有用性。我们显示这些随机投影可以保持它们的下面矩阵的功能性，同时具有一些有利的特性，使得它们成为动态图表示的优选。具体来说，它们可以将不同大小的图和顶点集表示在同一个空间中，allowing for the aggregation and manipulation of graphs in a unified manner。我们还提供了保持准确图操作的尺度规则，显示随机投影的大小可以与顶点数 linearly 增长，并准确地保留首领信息。我们最后characterize our random projection as a distance-preserving map of adjacency matrices analogous to the usual Johnson-Lindenstrauss map.

MalwareDNA: Simultaneous Classification of Malware, Malware Families, and Novel Malware

paper_url: http://arxiv.org/abs/2309.01350
repo_url: None
paper_authors: Maksim E. Eren, Manish Bhattarai, Kim Rasmussen, Boian S. Alexandrov, Charles Nicholas
for: 本研究旨在提出一种新的机器学习方法，用于准确地识别新型恶意软件家族，同时整合了恶意软件/良好软件分类和恶意软件家族分类的能力。
methods: 该方法使用机器学习技术，并利用了各种特征和数据来进行分类。
results: 本研究的初步结果表明，该方法可以准确地识别新型恶意软件家族，并且可以整合恶意软件/良好软件分类和恶意软件家族分类的能力。

Abstract
Malware is one of the most dangerous and costly cyber threats to national security and a crucial factor in modern cyber-space. However, the adoption of machine learning (ML) based solutions against malware threats has been relatively slow. Shortcomings in the existing ML approaches are likely contributing to this problem. The majority of current ML approaches ignore real-world challenges such as the detection of novel malware. In addition, proposed ML approaches are often designed either for malware/benign-ware classification or malware family classification. Here we introduce and showcase preliminary capabilities of a new method that can perform precise identification of novel malware families, while also unifying the capability for malware/benign-ware classification and malware family classification into a single framework.

摘要
马拉ware是现代网络空间中最危险和最昂贵的Cyber安全威胁之一，但是使用机器学习（ML）技术对抗马拉ware威胁的采用速度相对较慢。现有的ML方法存在缺陷，主要是忽略现实中的新型马拉ware检测。此外，大多数当前ML方法都是为马拉ware/非恶意软件分类或马拉ware家族分类而设计的。我们介绍了一种新的方法，可以精准地识别新型马拉ware家族，同时整合了马拉ware/非恶意软件分类和马拉ware家族分类的能力。

In-processing User Constrained Dominant Sets for User-Oriented Fairness in Recommender Systems

paper_url: http://arxiv.org/abs/2309.01335
repo_url: None
paper_authors: Zhongxuan Han, Chaochao Chen, Xiaolin Zheng, Weiming Liu, Jun Wang, Wenjie Cheng, Yuyuan Li
for: 本研究旨在解决推荐系统中的用户方向偏袋问题（User-Oriented Fairness，UOF），即推荐性能对特定用户群体的不公正。methods: 本研究提出了一种基于backbone推荐模型的In-processing User Constrained Dominant Sets（In-UCDS）框架，包括两个阶段：UCDS模型阶段和在处理阶段。在UCDS模型阶段，对每个劣位用户，提取一个约束主集（user cluster），包含一些有利用户的高质量用户。在处理阶段，通过计算公平损失，将劣位用户的表示向其相应的cluster移动 closer。这种结合公平损失和原始backbone模型损失的方法，可以同时解决UOF问题和保持总的推荐性能。results: 实验表明，In-UCDS在三个真实世界数据集上表现出色，与状态前的方法相比，具有更高的公平性和更好的总推荐性能。

Abstract
Recommender systems are typically biased toward a small group of users, leading to severe unfairness in recommendation performance, i.e., User-Oriented Fairness (UOF) issue. The existing research on UOF is limited and fails to deal with the root cause of the UOF issue: the learning process between advantaged and disadvantaged users is unfair. To tackle this issue, we propose an In-processing User Constrained Dominant Sets (In-UCDS) framework, which is a general framework that can be applied to any backbone recommendation model to achieve user-oriented fairness. We split In-UCDS into two stages, i.e., the UCDS modeling stage and the in-processing training stage. In the UCDS modeling stage, for each disadvantaged user, we extract a constrained dominant set (a user cluster) containing some advantaged users that are similar to it. In the in-processing training stage, we move the representations of disadvantaged users closer to their corresponding cluster by calculating a fairness loss. By combining the fairness loss with the original backbone model loss, we address the UOF issue and maintain the overall recommendation performance simultaneously. Comprehensive experiments on three real-world datasets demonstrate that In-UCDS outperforms the state-of-the-art methods, leading to a fairer model with better overall recommendation performance.

摘要
<>translate_language: zh-CN<>推荐系统通常偏向一小组用户，导致推荐性能不公，即用户 ориентирован的公平问题 (UOF)。现有的 UOF 研究有限，并未能够解决 UOF 问题的根本原因：推荐模型学习过程中受欢迎和受惩罚用户之间的不公。为解决此问题，我们提出了内部用户受限 dominant set (In-UCDS) 框架，这是一个通用的框架，可以应用于任何基础推荐模型，以实现用户 Oriented 公平。我们将 In-UCDS 分成两个阶段：UCDS 模型化阶段和在处理阶段。在 UCDS 模型化阶段，为每个受欢迎用户，我们提取一个受限 dominant set (用户集)，包含一些受欢迎用户和受惩罚用户之间的相似性。在处理阶段，我们通过计算公平损失，使得受欢迎用户的表示更加接近它所对应的用户集。通过将公平损失与原始基础模型损失相加，我们同时解决 UOF 问题和保持总体推荐性能。广泛的实验表明，In-UCDS 在三个实际数据集上表现出色，与当前状态的方法相比，它可以实现更加公平的推荐模型，同时保持总体推荐性能。

An ML-assisted OTFS vs. OFDM adaptable modem

paper_url: http://arxiv.org/abs/2309.01319
repo_url: None
paper_authors: I. Zakir Ahmed, Hamid R. Sadjadpour
for: 提高高速移动场景下的通信性能
methods: 使用深度神经网络（DNN）自适应 switching between OTFS 和 OFDM 信号处理链
results: 对比 OTFS、OFDM 和提议的 switching waveform scheme，得到了显著改善的 Mean-Squared-Error（MSE）性能

Abstract
The Orthogonal-Time-Frequency-Space (OTFS) signaling is known to be resilient to doubly-dispersive channels, which impacts high mobility scenarios. On the other hand, the Orthogonal-Frequency-Division-Multiplexing (OFDM) waveforms enjoy the benefits of the reuse of legacy architectures, simplicity of receiver design, and low-complexity detection. Several studies that compare the performance of OFDM and OTFS have indicated mixed outcomes due to the plethora of system parameters at play beyond high-mobility conditions. In this work, we exemplify this observation using simulations and propose a deep neural network (DNN)-based adaptation scheme to switch between using either an OTFS or OFDM signal processing chain at the transmitter and receiver for optimal mean-squared-error (MSE) performance. The DNN classifier is trained to switch between the two schemes by observing the channel condition, received SNR, and modulation format. We compare the performance of the OTFS, OFDM, and the proposed switched-waveform scheme. The simulations indicate superior performance with the proposed scheme with a well-trained DNN, thus improving the MSE performance of the communication significantly.

摘要
Orthogonal-Time-Frequency-Space (OTFS) 信号处理可以抗抗双折射通道，这对高移动场景有着积极的影响。然而，Orthogonal-Frequency-Division-Multiplexing (OFDM) 波形具有重用现有架构、接收器设计简单、检测低复杂性的优点。但由于系统参数的各种变化，各种研究表明OFDM和OTFS的性能表现存在混乱。在这项工作中，我们通过simeulations进行了证明，并提出了基于深度神经网络（DNN）的适应方案，以实现在传输和接收端使用OTFS或OFDM信号处理链的优化。DNN分类器根据通道条件、接收SNR和调制格式进行选择。我们对OTFS、OFDM和我们提议的切换波形 schemes进行比较。Simulations表明，具有良好训练的DNN，提出的方案可以明显提高通信的MSE性能。

Communication-Efficient Design of Learning System for Energy Demand Forecasting of Electrical Vehicles

paper_url: http://arxiv.org/abs/2309.01297
repo_url: None
paper_authors: Jiacong Xu, Riley Kilfoyle, Zixiang Xiong, Ligang Lu
for: 这篇论文是用于提出一种可以实现时间序列能源利用预测的机器学习模型，并且可以在各地的电动车充电站进行分布式训练。
methods: 这篇论文使用了最新的 transformer 架构和分布式学习（Federated Learning）的组合，以提高时间序列预测性能。
results: 比较这篇论文的时间序列预测性能和训练资料量，与其他模型的比较显示，这篇论文的模型可以与其他模型相比，同时具有更低的训练资料量。

Abstract
Machine learning (ML) applications to time series energy utilization forecasting problems are a challenging assignment due to a variety of factors. Chief among these is the non-homogeneity of the energy utilization datasets and the geographical dispersion of energy consumers. Furthermore, these ML models require vast amounts of training data and communications overhead in order to develop an effective model. In this paper, we propose a communication-efficient time series forecasting model combining the most recent advancements in transformer architectures implemented across a geographically dispersed series of EV charging stations and an efficient variant of federated learning (FL) to enable distributed training. The time series prediction performance and communication overhead cost of our FL are compared against their counterpart models and shown to have parity in performance while consuming significantly lower data rates during training. Additionally, the comparison is made across EV charging as well as other time series datasets to demonstrate the flexibility of our proposed model in generalized time series prediction beyond energy demand. The source code for this work is available at https://github.com/XuJiacong/LoGTST_PSGF

摘要
机器学习（ML）应用于时间序列能源利用预测问题是一项复杂的任务，主要原因是时间序列数据的非均匀性和能源消耗者的地理分散。此外，这些ML模型需要大量的训练数据和通信占用量来建立有效的模型。在这篇论文中，我们提出了一种具有最新的变换架构和 federated learning（FL）的通信高效的时间序列预测模型，用于在地理分散的电动车充电站上进行分布式训练。我们对比了我们的FL模型和其他模型的时间序列预测性能和通信负担，并证明了我们的模型在通信成本下降的情况下保持了与其他模型相当的性能。此外，我们还对EV充电和其他时间序列数据集进行了比较，以示我们的模型在泛化时间序列预测中的灵活性。模型的源代码可以在https://github.com/XuJiacong/LoGTST_PSGF 上获取。

2023-09-04

eess.IV

eess.IV - 2023-09-04

Effects of Material Mapping Agnostic Partial Volume Correction for Subject Specific Finite Elements Simulations

paper_url: http://arxiv.org/abs/2309.01769
repo_url: https://github.com/adbeagley/pvcpy
paper_authors: Aren Beagley, Hannah Richards, Joshua W. Giles
for: correction of partial volume effects in CT images
methods: new algorithm based on previous work, no pre-processing or user input required, applied directly to CT images
results: improved accuracy of surface strain predictions in experimental three point bending tests compared to original, uncorrected CT images

Abstract
Partial Volume effects are present at the boundary between any two types of material in a CT image due to the scanner's Point Spread Function, finite voxel resolution, and importantly, the discrepancy in radiodensity between the two materials. In this study a new algorithm is developed and validated that builds on previously published work to enable the correction of partial volume effects at cortical bone boundaries. Unlike past methods, this algorithm does not require pre-processing or user input to achieve the correction, and the correction is applied directly onto a set of CT images, which enables it to be used in existing computational modelling workflows. The algorithm was validated by performing experimental three point bending tests on porcine fibulae specimen and comparing the experimental results to finite element results for models created using either the original, uncorrected CT images or the partial volume corrected images. Results demonstrated that the models created using the partial volume corrected images did improved the accuracy of the surface strain predictions. Given this initial validation, this algorithm is a viable method for overcoming the challenge of partial volume effects in CT images. Thus, future work should be undertaken to further validate the algorithm with human tissues and through coupling it with a range of different finite element creation workflows to verify that it is robust and agnostic to the chosen material mapping strategy.

摘要
<> CT 图像中的部分体积效应出现在任何两种材料之间的边界上，这是因为扫描仪的点扩散函数、粒子分辨率以及材料的辐射密度差异。在这项研究中，一种新的算法被开发并验证，以解决在 cortical bone 边界上的部分体积效应。与过去的方法不同的是，这种算法不需要先期处理或用户输入来实现修正，而且修正直接应用于 CT 图像集，因此可以在现有的计算模型工作流程中使用。这种算法在使用三点弯曲试验和猪骨脚模型进行验证后得到了证明，模型使用未修正 CT 图像时的表面弯曲预测结果比较准确。基于这个初步验证，这种算法是一种可靠的方法，未来的工作应该继续验证这种算法在人类组织中的效果，并通过将其与不同的材料映射策略集成来验证其是否具有抗耗荷性。

Deep Learning Approach for Large-Scale, Real-Time Quantification of Green Fluorescent Protein-Labeled Biological Samples in Microreactors

paper_url: http://arxiv.org/abs/2309.01384
repo_url: None
paper_authors: Yuanyuan Wei, Sai Mu Dalike Abaxi, Nawaz Mehmood, Luoquan Li, Fuyang Qu, Guangyao Cheng, Dehua Hu, Yi-Ping Ho, Scott Wu Yuan, Ho-Pui Ho
for: 这个研究旨在开发一种基于深度学习的批处理管线，以自动分类和量化GFP标记的微反应器。methods: 该方法使用深度学习算法自动分类和量化GFP标记的微反应器，并且可以在标准实验室fluorescence Mikroskop用于实时精确量化。results: 该研究发现，使用该方法可以准确预测微反应器的大小和占据状态，并且可以在2.5秒钟内量化超过2,000个微反应器（在10张图像中），并且具有1000倍的分辨率。

Abstract
Absolute quantification of biological samples entails determining expression levels in precise numerical copies, offering enhanced accuracy and superior performance for rare templates. However, existing methodologies suffer from significant limitations: flow cytometers are both costly and intricate, while fluorescence imaging relying on software tools or manual counting is time-consuming and prone to inaccuracies. In this study, we have devised a comprehensive deep-learning-enabled pipeline that enables the automated segmentation and classification of GFP (green fluorescent protein)-labeled microreactors, facilitating real-time absolute quantification. Our findings demonstrate the efficacy of this technique in accurately predicting the sizes and occupancy status of microreactors using standard laboratory fluorescence microscopes, thereby providing precise measurements of template concentrations. Notably, our approach exhibits an analysis speed of quantifying over 2,000 microreactors (across 10 images) within remarkably 2.5 seconds, and a dynamic range spanning from 56.52 to 1569.43 copies per micron-liter. Furthermore, our Deep-dGFP algorithm showcases remarkable generalization capabilities, as it can be directly applied to various GFP-labeling scenarios, including droplet-based, microwell-based, and agarose-based biological applications. To the best of our knowledge, this represents the first successful implementation of an all-in-one image analysis algorithm in droplet digital PCR (polymerase chain reaction), microwell digital PCR, droplet single-cell sequencing, agarose digital PCR, and bacterial quantification, without necessitating any transfer learning steps, modifications, or retraining procedures. We firmly believe that our Deep-dGFP technique will be readily embraced by biomedical laboratories and holds potential for further development in related clinical applications.

摘要
全程量化生物样本的实现需要确定表达水平的准确数值，提供了更高的精度和性能，特别是对于罕见模板。然而，现有的方法ologies有限，流率计价仪器costly和复杂，而基于软件工具或手动计数的抗体影像分析方法时间consuming和不准确。在本研究中，我们开发了一个涵盖全 Deep-learning-enabled 管道，可以自动 segmentation和类型化 GFP（绿色抗体）标记的微反应器，实现实时全程量化。我们的发现表明该技术可以准确预测微反应器的大小和占用状态，从而提供精确的模板浓度测量。另外，我们的 Deep-dGFP 算法展示了杰出的通用能力，可以直接应用于多种 GFP 标记方式，包括滴度基本、微瓶基本、agarose基本生物应用。根据我们所知，这是首次实现了无需转移学习步骤、修改或重训练的全程量化图像分析算法，可以应用于批量数计、微瓶数计、滴度单细Sequencing、agarose数计和细菌量化。我们 firmly believe 的 Deep-dGFP 技术将被生物医学实验室广泛采用，并具有进一步发展的临床应用潜力。

2023-09-04

eess.SP

eess.SP - 2023-09-04

FlexRDZ: Autonomous Mobility Management for Radio Dynamic Zones

paper_url: http://arxiv.org/abs/2309.01861
repo_url: None
paper_authors: Aashish Gottipati, Jacobus Van der Merwe
for: 这个论文旨在为广播动态频率区（RDZ）的安全运行提供自动化管理方式，使得测试发射机可以在实时控制下运行。
methods: 这篇论文使用层次任务网络和数字双树模型来在实时控制下规划和解决RDZ违反。
results: simulations 显示，使用 FlexRDZ 可以减少移动设备干扰的干扰强度达20dBm，同时保持测试发射机的通信能力和上线率。

Abstract
FlexRDZ is an online, autonomous manager for radio dynamic zones (RDZ) that seeks to enable the safe operation of RDZs through real-time control of deployed test transmitters. FlexRDZ leverages Hierarchical Task Networks and digital twin modeling to plan and resolve RDZ violations in near real-time. We prototype FlexRDZ with GTPyhop and the Terrain Integrated Rough Earth Model (TIREM). We deploy and evaluate FlexRDZ within a simulated version of the Salt Lake City POWDER testbed, a potential urban RDZ environment. Our simulations show that FlexRDZ enables up to a 20 dBm reduction in mobile interference and a significant reduction in the total power of leaked transmissions while preserving the overall communication capabilities and uptime of test transmitters. To our knowledge, FlexRDZ is the first autonomous system for RDZ management.

摘要
flexrdz 是一个在线、自主的电台动态区域（rdz）管理器，旨在通过实时控制部署的测试发射器来确保rdz的安全运行。 flexrdz 利用层次任务网络和数字双子模型来在实时内计划和解决rdz 违反。我们使用gtpyhop 和 terrain 集成粗地模型（tirem）来详细描述flexrdz。我们在一个模拟的盐湖城powder测试环境中部署并评估flexrdz，我们的仿真结果表明，flexrdz 可以减少移动设备干扰的干扰范围，并减少泄漏的发射功率，同时保持测试发射器的通信能力和上线率。在我们所知道的情况下，flexrdz 是首个自主的rdz管理系统。

Variational Tracking and Redetection for Closely-spaced Objects in Heavy Clutter

paper_url: http://arxiv.org/abs/2309.01774
repo_url: None
paper_authors: Runze Gan, Qing Li, Simon Godsill
for: 该论文旨在提出一种基于均值极值推理框架的非均匀波谱过程跟踪器（VB-AbNHPP），可以高效地在高密度的 closely-spaced objects 和重重雷区域中跟踪目标。
methods: 该论文基于一种基于均值极值推理框架的变分极值筛选法（VB-AbNHPP），并提出了一种可ralealisable的并行实现方法。此外，该论文还提出了一种变分地理学策略，可以快速 rediscover 丢失的目标在极大的监测区域中。
results: 该论文的实验结果表明，VB-AbNHPP 跟踪器在具有高密度的 closely-spaced objects 和重重雷区域的情况下，比较高效和准确地跟踪目标。此外，该论文还提出了一种自动检测和恢复丢失跟踪的方法，可以在极大的监测区域中快速 rediscover 丢失的目标。

Abstract
The non-homogeneous Poisson process (NHPP) is a widely used measurement model that allows for an object to generate multiple measurements over time. However, it can be difficult to efficiently and reliably track multiple objects under this NHPP model in scenarios with a high density of closely-spaced objects and heavy clutter. Therefore, based on the general coordinate ascent variational filtering framework, this paper presents a variational Bayes association-based NHPP tracker (VB-AbNHPP) that can efficiently perform tracking, data association, and learning of target and clutter rates with a parallelisable implementation. In addition, a variational localisation strategy is proposed, which enables rapid rediscovery of missed targets from a large surveillance area under extremely heavy clutter. This strategy is integrated into the VB-AbNHPP tracker, resulting in a robust methodology that can automatically detect and recover from track loss. This tracker demonstrates improved tracking performance compared with existing trackers in challenging scenarios, in terms of both accuracy and efficiency.

摘要
非均匀波恩斯过程（NHPP）是一种广泛使用的测量模型，允许对象在时间上产生多个测量。然而，在高密度的 closely-spaced 对象和重重干扰的情况下，可能difficult to efficiently and reliably track multiple objects under this NHPP model。因此，基于通用坐标征枢点升级滤波框架，这篇论文提出了一种基于变分浓 Bayes 协同跟踪（VB-AbNHPP）算法，可以高效地进行跟踪、数据协同和学习目标和干扰率的学习。此外，一种变分本地化策略也被提出，可以快速 rediscover 丢失的目标在极高干扰的情况下。这种策略被纳入 VB-AbNHPP 跟踪器中，导致了一种自动探测和恢复失踪的方法。这种跟踪器在复杂的情况下表现出了改善的跟踪性能，比如精度和效率。

Pupillary activity in areas of interest from visual stimuli for neonatal pain assessment

paper_url: http://arxiv.org/abs/2309.01738
repo_url: None
paper_authors: Roberto Magalhaes Jr, Rafael Orsi, Marina Barros, Ruth Guinsburg, Carlos E Thomaz
for: 这种研究旨在比较腕动活动指数与传统的眼动追踪指标（如fixation数量和持续时间），以评估新生儿的痛苦。
methods: 这种研究使用了眼动追踪系统Tobii TX300，记录新生儿脸部的视线活动。我们还引入了低/高指数腕动活动计算法，以评估健康专业人员和非专业人员在分析疼痛和无疼痛脸部图像时的视觉注意力。
results: 研究结果表明，传统的视觉追踪指标可能并不直接关联到相应的认知负荷。两个样本组参与者在分析疼痛和无疼痛脸部图像时，视觉注意力反映在传统指标中可能不同。

Abstract
This paper compares the pupillary activity index to traditional eye-tracking metrics like the fixation count and duration in assessing neonatal pain. It explores the benefits of incorporating pupillary activity measures to improve methods that lead to an understanding of cognitive processing and performance evaluation. The estimation of cognitive load using pupil diameter typically involves measures relative to a baseline. Instead, we conducted an eye-tracking study using the Low/High Index of Pupillary Activity to evaluate healthcare experts and non-experts analyzing the faces with and without pain from a dataset of newborn faces. This data was recorded by the Tobii TX300 eye-tracking system in a closed room with controlled lighting. Our contribution is to introduce the LHIPA calculation considering the areas of interest segments of the pupil diameter signal. The results suggest that the visual attention reflected by the traditional metrics may not correspond directly to the respective cognitive load for both sample groups of participants.

摘要
Translated into Simplified Chinese:这篇论文比较了腔Activity指数与传统的眼动跟踪指标，如fixation count和持续时间，以评估新生儿痛苦。它探讨了通过包含腔Activity测量来改进认知处理和性能评估方法的优点。通常来说，计算腔Activity需要相对于基准值进行估算。而我们则通过使用TX300眼动跟踪系统在控制的照明下录制了新生儿脸部图像，并使用低/高腔Activity指数来评估医疗专业人员和非专业人员对无痛和痛苦脸部图像的分析。我们的贡献在于在计算LHIPA时考虑了脸部图像中的区域关注段。结果表明，传统指标中的视觉注意力可能并不直接对应认知负荷的各个组合。

paper_url: http://arxiv.org/abs/2309.01719
repo_url: None
paper_authors: M. R. Davoodi, S. A. Mostafavian, S. R. Nabavian, GH. R. Jahangiri
for: 本研究旨在evaluating the integrity of civil structures by observing their dynamic responses over time, specifically focusing on identifying the modal parameters of structures using output-only identification techniques.
methods: 该研究使用了Finite Element Method (FEM) and MATLAB software to model and analyze the behavior of four beams with different boundary conditions and arbitrary loading. The modal parameters were identified using Frequency Domain Decomposition (FDD) and Peak Picking (PP) methods, and the results were compared with the input-output method using Frequency Relation Function (FRF).
results: 研究结果显示了三种方法之间的好匹配，即FDD、PP和FRF方法对梁的动态特性具有良好的一致性。

Abstract
Structural Health Monitoring (SHM) evaluates the integrity of a structure by observing its dynamic responses by an array of sensors over time to determine the current health state of the structure. The most important step of SHM is system identification which in civil structures is the identification of modal parameters of structures. Due to numerous limitations of input-output methods, system identification of ambient vibration structures using output-only identification techniques has become a key issue in structural health monitoring and assessment of engineering structures. In this paper, four beams with different boundary conditions and with arbitrary loading have been modeled in finite element software, ANSYS, and the responses (Acceleration of nodes) have been achieved. By using these data and the codes written in MATLAB software, the modal parameters (natural frequencies, mode shapes) of the beams are identified with FDD (frequency Domain Decomposition) and PP (Peak Picking) methods and then justified with the results of input-output method which was determined by frequency relation function (FRF). The results indicate a good agreement between the three methods for determining the dynamic characteristics of beams.

摘要
Structural Health Monitoring (SHM) 评估结构的完整性 by 观察其动态响应，通过数组传感器在时间上观察，以确定结构当前的健康状态。结构体系识别是结构体系的关键步骤，在 Civil 结构中最重要的是模式参数的识别。由于输入输出方法的限制，结构体系识别使用输出只识别技术在结构健康监测和工程结构评估中变得非常重要。在本文中，我们使用 ANSYS 软件模拟了四根不同边界条件的梁，并在这些梁上应用了任意的荷载。通过这些数据和 MATLAB 软件中写的代码，我们使用 FDD（频域分解）和 PP（峰挑出）方法来识别梁的模式参数（自然频率、模式形），并与输入输出方法确定的频率关系函数（FRF）的结果进行比较。结果表明，三种方法在梁的动态特性方面具有良好的一致性。

Direction-of-arrival estimation with conventional co-prime arrays using deep learning-based probablistic Bayesian neural networks

paper_url: http://arxiv.org/abs/2309.01690
repo_url: None
paper_authors: Wael Elshennawy
for: investigate the direction-of-arrival (DOA) estimation of narrow band signals with conventional co-prime arrays
methods: probabilistic Bayesian neural networks (PBNN) and a super resolution DOA estimation method based on Bayesian neural networks
results: enhances the generalization of untrained scenarios and provides robustness to non-ideal conditions, such as small angle separation, data scarcity, and imperfect arrays.

Abstract
The paper investigates the direction-of-arrival (DOA) estimation of narrow band signals with conventional co-prime arrays by using probabilistic Bayesian neural networks (PBNN). A super resolution DOA estimation method based on Bayesian neural networks and a spatially overcomplete array output formulation overcomes the pre-assumption dependencies of the model-driven DOA estimation methods. The proposed DOA estimation method utilizes a PBNN model to capture both data and model uncertainty. The developed PBNN model is trained to do the mapping from the pseudo-spectrum to the super resolution spectrum. This learning-based method enhances the generalization of untrained scenarios, and it provides robustness to non-ideal conditions, e.g., small angle separation, data scarcity, and imperfect arrays, etc. Simulation results demonstrate the loss curves of the PBNN model and deterministic model. Simulations are carried out to validate the performance of PBNN model compared to a deterministic model of conventional neural networks (CNN).

摘要
文章 investigate 方向来源（DOA）估算窄频信号的传统伙 Prime 阵列上的应用，使用概率 Bayesian 神经网络（PBNN）。一种基于 Bayesian 神经网络的超解析 DOA 估算方法，扩展了模型驱动 DOA 估算方法的假设dependencies。提案的 DOA 估算方法利用 PBNN 模型捕捉数据和模型不确定性。开发的 PBNN 模型是将 pseudo-spectrum 与超解析 specturm 之间的映射执行。这个学习基于方法提高了无条件enario 的通用性，并提供了不IDEAL 状况下的Robustness，例如小角分离、数据缺乏、不完整阵列等等。在 Simulation 中，文章显示了 PBNN 模型和决定性模型的损失曲线。透过实验 validate PBNN 模型与传统神经网络（CNN）的决定性模型之间的表现差异。

Towards Robust Velocity and Position Estimation of Opponents for Autonomous Racing Using Low-Power Radar

paper_url: http://arxiv.org/abs/2309.01647
repo_url: None
paper_authors: Andrea Ronco, Nicolas Baumann, Marco Giordano, Michele Magno
for: 这个论文是为了设计和开发一个智能子系统，该系统包括一种新型的低功耗雷达传感器，并将其集成到自动驾驶赛车视觉管道中，以精准估计动态障碍物的位置和速度。
methods: 该论文使用了Infineon BGT60TR13D雷达传感器，并进行了实际场景中的评估。论文还探讨了使用这种传感器子系统的优缺点，并基于场景中采集的数据进行了结论。
results: 论文的结果表明，使用该系统可以在10个毫瓦特的电力消耗下实现距离估计的误差在0.21+-0.29米之间，速度估计的误差在0.39+-0.19米/秒之间。这个系统可以与其他感知器如LiDAR和摄像头结合使用，并可以在许多应用场景中使用。

Abstract
This paper presents the design and development of an intelligent subsystem that includes a novel low-power radar sensor integrated into an autonomous racing perception pipeline to robustly estimate the position and velocity of dynamic obstacles. The proposed system, based on the Infineon BGT60TR13D radar, is evaluated in a real-world scenario with scaled race cars. The paper explores the benefits and limitations of using such a sensor subsystem and draws conclusions based on field-collected data. The results demonstrate a tracking error up to 0.21 +- 0.29 m in distance estimation and 0.39 +- 0.19 m/s in velocity estimation, despite the power consumption in the range of 10s of milliwatts. The presented system provides complementary information to other sensors such as LiDAR and camera, and can be used in a wide range of applications beyond autonomous racing.

摘要
Translation notes:* "infineon" is translated as " infinion" (因 Finland) in Simplified Chinese, as it is a German company.* "BGT60TR13D" is translated as "BGT60TR13D" in Simplified Chinese, as it is a product name and not a Chinese word.* "scaled race cars" is translated as "缩小赛车" (scaled down race cars) in Simplified Chinese.* "tracking error" is translated as "跟踪误差" (tracking error) in Simplified Chinese.* "power consumption" is translated as "能源消耗" (power consumption) in Simplified Chinese.* "milliwatts" is translated as "毫瓦" (milliwatts) in Simplified Chinese.

A balanced Memristor-CMOS ternary logic family and its application

paper_url: http://arxiv.org/abs/2309.01615
repo_url: None
paper_authors: Xiao-Yuan Wang, Jia-Wei Zhou, Chuan-Tao Dong, Xin-Hui Chen, Sanjoy Kumar Nandi, Robert G. Elliman, Sung-Mo Kang, Herbert Ho-Ching Iu
for: 该研究提出了基于memristors和CMOS设备的平衡三值逻辑电路设计。
methods: 该研究使用了系统的设计和仿真来设计和验证平衡三值最小门TMIN、最大门TMAX和三值逻辑滤波器。然后，基于这些设计的逻辑电路，如三值编码器、解码器和多路复用器，被设计出来。
results: 研究人员通过对两种设计方案的比较和分析，提供了后续三值逻辑电路的参考。

Abstract
The design of balanced ternary digital logic circuits based on memristors and conventional CMOS devices is proposed. First, balanced ternary minimum gate TMIN, maximum gate TMAX and ternary inverters are systematically designed and verified by simulation, and then logic circuits such as ternary encoders, decoders and multiplexers are designed on this basis. Two different schemes are then used to realize the design of functional combinational logic circuits such as a balanced ternary half adder, multiplier, and numerical comparator. Finally, we report a series of comparisons and analyses of the two design schemes, which provide a reference for subsequent research and development of three-valued logic circuits.

摘要
“提出了基于抗阻门和CMOS设备的均衡三值逻辑电路设计。首先，设计了均衡三值最小门TMIN、最大门TMAX和三值逻辑滤波器，并通过仿真验证。然后，基于这些设计，设计了三值编码器、解码器和多路复用器。接着，采用了两种不同的实现方案，实现了功能 combinational 逻辑电路的设计，如均衡三值半加器、乘法器和数字比较器。最后，对两个设计方案进行了比较和分析，以供后续研究和发展三值逻辑电路的参考。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know and I can provide that as well.

Half-Duplex APs with Dynamic TDD vs. Full-Duplex APs in Cell-Free Systems

paper_url: http://arxiv.org/abs/2309.01481
repo_url: None
paper_authors: Anubhab Chowdhury, Chandra R. Murthy
for: 这个论文研究了半通信（HD）无线点对（AP）在无基站（CF）系统中的比较研究，包括动态时分多播（DTDD）和全通信（FD）AP。
methods: 论文提出了一种新的频道分配方案，以最小化由多个基站服务的用户设备（UE）之间的干扰。然后，对于DTDD系统，提出了一种可提升总 spectral efficiency（SE）的封闭式公式，考虑了零干扰组合和预编码。同时，提出了一种可 garantía convergent 的算法，用于联合上下行功率分配和上下行模式调度。
results: 数值结果表明，论文的提案方法可以与多种 referential 相比，并且在相同的天线密度下，DTDD系统可以与FD系统具有相同的性能，而不需要进行内部基站干扰抑制。因此，DTDD与CF结合是一种可行的代替方案，可以在HD AP上实现类似的性能，而不需要进行FD系统的干扰抑制。

Abstract
In this paper, we present a comparative study of half-duplex (HD) access points (APs) with dynamic time-division duplex (DTDD) and full-duplex (FD) APs in cell-free (CF) systems. Although both DTDD and FD CF systems support concurrent downlink transmission and uplink reception capability, the sum spectral efficiency (SE) is limited by various cross-link interferences. We first present a novel pilot allocation scheme that minimizes the pilot length required to ensure no pilot contamination among the user equipments (UEs) served by at least one common AP. Then, we derive the sum SE in closed form, considering zero-forcing combining and precoding along with the signal-to-interference plus noise ratio optimal weighting at the central processing unit. We also present a provably convergent algorithm for joint uplink-downlink power allocation and uplink/downlink mode scheduling of the APs (for DTDD) to maximize the sum SE. Our numerical results illustrate the superiority of the proposed algorithms over several benchmarks and show that the sum SE with DTDD can outperform an FD CF system with similar antenna density. Thus, DTDD combined with CF is a promising alternative to FD that attains the same performance using HD APs, while obviating the burden of intra-AP interference cancellation.

摘要
在这篇论文中，我们进行了半杂推测点（HD）接入点（AP）与动态时分多杂分多杂（DTDD）和全杂推测点（FD）CF系统的比较研究。虽然DTDD和FD CF系统都支持同时下行传输和上行接收能力，但它们的总spectral efficiency（SE）受到各个横向交叉干扰的限制。我们首先提出了一种新的频道分配方案，以最小化UE被至少一个共享AP的 Pilot 污染。然后，我们 derive了总SE的closed form，考虑了零强制合并和预编码，以及在中央处理单元中的信号噪听比优化。我们还提出了一种可证确定性的算法，用于CF系统中的接入点的共同下行-上行功率分配和下行/上行模式调度，以最大化总SE。我们的数值结果表明，提议的算法在多个参考模型之上具有superiority，并且表明DTDD和FD CF系统之间的差异可以通过使用HD AP来实现，而不需要内部AP干扰消除。因此，DTDD与CF结合可以作为一个有 promise的FD替代方案，以实现相同的性能，使用HD AP，而不需要FD CF系统中的相同antenna density。

A Unified Framework for Guiding Generative AI with Wireless Perception in Resource Constrained Mobile Edge Networks

paper_url: http://arxiv.org/abs/2309.01426
repo_url: None
paper_authors: Jiacheng Wang, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong, Deepu Rajan, Shiwen Mao, Xuemin, Shen
for: 这 paper 是为了提供一种基于无线感知的生成式人工智能 (WiPe-GAI) 技术，用于在有限的移动边缘网络中提供数字内容生成服务 (AIGC)。
methods: 这 paper 使用了一种新的序列多尺度感知 (SMSP) 算法，使用无线信号中的通道状态信息 (CSI) 预测用户的skeleton。此外，它还使用了一种基于奖励机制的价格分配策略，以确保在有限的网络资源下，提供高质量的 AIGC。
results: 实验结果表明，提案的 frameworks 可以比其他现有解决方案更高效地预测用户的skeleton和生成价格策略。

Abstract
With the significant advancements in artificial intelligence (AI) technologies and powerful computational capabilities, generative AI (GAI) has become a pivotal digital content generation technique for offering superior digital services. However, directing GAI towards desired outputs still suffer the inherent instability of the AI model. In this paper, we design a novel framework that utilizes wireless perception to guide GAI (WiPe-GAI) for providing digital content generation service, i.e., AI-generated content (AIGC), in resource-constrained mobile edge networks. Specifically, we first propose a new sequential multi-scale perception (SMSP) algorithm to predict user skeleton based on the channel state information (CSI) extracted from wireless signals. This prediction then guides GAI to provide users with AIGC, such as virtual character generation. To ensure the efficient operation of the proposed framework in resource constrained networks, we further design a pricing-based incentive mechanism and introduce a diffusion model based approach to generate an optimal pricing strategy for the service provisioning. The strategy maximizes the user's utility while enhancing the participation of the virtual service provider (VSP) in AIGC provision. The experimental results demonstrate the effectiveness of the designed framework in terms of skeleton prediction and optimal pricing strategy generation comparing with other existing solutions.

摘要
With the significant advancements in artificial intelligence (AI) technologies and powerful computational capabilities, generative AI (GAI) has become a pivotal digital content generation technique for offering superior digital services. However, directing GAI towards desired outputs still suffers from the inherent instability of the AI model. In this paper, we design a novel framework that utilizes wireless perception to guide GAI (WiPe-GAI) for providing digital content generation service, i.e., AI-generated content (AIGC), in resource-constrained mobile edge networks. Specifically, we first propose a new sequential multi-scale perception (SMSP) algorithm to predict user skeleton based on the channel state information (CSI) extracted from wireless signals. This prediction then guides GAI to provide users with AIGC, such as virtual character generation. To ensure the efficient operation of the proposed framework in resource-constrained networks, we further design a pricing-based incentive mechanism and introduce a diffusion model based approach to generate an optimal pricing strategy for the service provisioning. The strategy maximizes the user's utility while enhancing the participation of the virtual service provider (VSP) in AIGC provision. The experimental results demonstrate the effectiveness of the designed framework in terms of skeleton prediction and optimal pricing strategy generation comparing with other existing solutions.

Detection of Pedestrian Turning Motions to Enhance Indoor Map Matching Performance

paper_url: http://arxiv.org/abs/2309.01405
repo_url: None
paper_authors: Seunghyeon Park, Taewon Kang, Seungjae Lee, Joon Hyo Rhee
for: 本研究旨在提高indoor pedestrian dead reckoning (PDR)和map matching技术性能，通过利用智能手机陀螺仪数据和进一步的算法进行人行走姿态探测。
methods: 本研究使用了三种方法进行人行走姿态探测，包括阈值基于的方法、隐马尔可夫模型（HMM）基于的方法和逐步级数（PELT）算法基于的方法。
results: 在场测试中，使用阈值基于的方法时，错过检测率为20.35%，假阳性率为7.65%；使用PELT基于的方法时，错过检测率为8.93%，假阳性率为6.97%；使用HMM基于的方法时，错过检测率为5.14%，假阳性率为2.00%。结果表明，本研究提出了一种更加准确和可靠的人行走导航系统。

Abstract
A pedestrian navigation system (PNS) in indoor environments, where global navigation satellite system (GNSS) signal access is difficult, is necessary, particularly for search and rescue (SAR) operations in large buildings. This paper focuses on studying pedestrian walking behaviors to enhance the performance of indoor pedestrian dead reckoning (PDR) and map matching techniques. Specifically, our research aims to detect pedestrian turning motions using smartphone inertial measurement unit (IMU) information in a given PDR trajectory. To improve existing methods, including the threshold-based turn detection method, hidden Markov model (HMM)-based turn detection method, and pruned exact linear time (PELT) algorithm-based turn detection method, we propose enhanced algorithms that better detect pedestrian turning motions. During field tests, using the threshold-based method, we observed a missed detection rate of 20.35% and a false alarm rate of 7.65%. The PELT-based method achieved a significant improvement with a missed detection rate of 8.93% and a false alarm rate of 6.97%. However, the best results were obtained using the HMM-based method, which demonstrated a missed detection rate of 5.14% and a false alarm rate of 2.00%. In summary, our research contributes to the development of a more accurate and reliable pedestrian navigation system by leveraging smartphone IMU data and advanced algorithms for turn detection in indoor environments.

摘要
pedestrian navigation system (PNS) 在室内环境中是必需的，特别是在大型建筑物的搜索和救援 (SAR) 操作中。这篇论文关注了人行走行为，以提高室内人行走推断 (PDR) 和地图匹配技术的性能。具体来说，我们的研究旨在通过使用手机陀螺仪测量单元 (IMU) 信息探测人行走转弯动作。为了改进现有方法，包括阈值基于的转弯检测方法、隐马尔可夫模型 (HMM) 基于的转弯检测方法和剪辑精确时间 (PELT) 算法基于的转弯检测方法，我们提出了改进的算法，以更好地检测人行走转弯动作。在实验中，使用阈值基于的方法时，我们观察到了20.35%的失败检测率和7.65%的假阳性率。使用 PELT 基于的方法时，获得了显著的改进，失败检测率为8.93%，假阳性率为6.97%。然而，最佳结果是通过使用 HMM 基于的方法获得， missed detection rate 为5.14%， false alarm rate 为2.00%。总之，我们的研究为室内人行走系统的开发提供了更加准确和可靠的 pedestrian navigation system。

Unlabelled Sensing with Priors: Algorithm and Bounds

paper_url: http://arxiv.org/abs/2309.01397
repo_url: None
paper_authors: Garweet Sresth, Ajit Rajwade, Satish Mulleti
for: 这种研究是关于无标签感知的一种变体， measurements 是稀疏排序的，同时知道一些对应关系。
methods: 这种方法使用一个估计器来解决未知向量的问题。
results: 经过数学实验表明，在高排序 режиmess（>30%）下，我们的方法可以与 классиical robust regression estimator相比，在减小化恢复错误度metric上提高到20%。此外，我们还应用了我们的框架在一个非rigid运动估计问题中，并证明了使用一些准确知道的对应关系可以改善运动估计。I hope this helps! Let me know if you have any further questions.

Abstract
In this study, we consider a variant of unlabelled sensing where the measurements are sparsely permuted, and additionally, a few correspondences are known. We present an estimator to solve for the unknown vector. We derive a theoretical upper bound on the $\ell_2$ reconstruction error of the unknown vector. Through numerical experiments, we demonstrate that the additional known correspondences result in a significant improvement in the reconstruction error. Additionally, we compare our estimator with the classical robust regression estimator and we find that our method outperforms it on the normalized reconstruction error metric by up to $20\%$ in the high permutation regimes $(>30\%)$. Lastly, we showcase the practical utility of our framework on a non-rigid motion estimation problem. We show that using a few manually annotated points along point pairs with the key-point (SIFT-based) descriptor pairs with unknown or incorrectly known correspondences can improve motion estimation.

摘要
在这项研究中，我们考虑了一种杂乱感知变体，其中测量结果 sparse permuted。此外，我们还知道一些对应关系。我们提出了一种解决未知向量的估计器。我们 derivated一个对 $\ell_2$ 重建误差的理论上限。通过数学实验，我们发现附加知道的对应关系导致重建误差显著下降。此外，我们与经典稳定回归估计器进行比较，发现我们的方法在 норми化重建误差指标上比 classical robust regression estimator 高效，在高排序域（>30%）下出现至多20%的提高。最后，我们展示了我们的框架在点对点匹配问题中的实际应用。我们表明，使用一些手动标注的点并将 SIFT 基于描述对照点对进行匹配可以改善无拘束运动估计。

White paper on LiDAR performance against selected Automotive Paints

paper_url: http://arxiv.org/abs/2309.01346
repo_url: None
paper_authors: James Lee Wei Shung, Paul Hibbard, Roshan Vijay, Lincoln Ang Hon Kin, Niels de Boer
for: 本研究的目的是了解汽车涂料对激光探测器性能的影响。
methods: 本研究使用了评估不同汽车涂料类型对不同激光探测器模型的反射INTENSITY输出的方法。
results: 研究发现，折衣色涂料的反射INTENSITY较低，而淡色涂料呈高INTENSITY值。

Abstract
LiDAR (Light Detection and Ranging) is a useful sensing technique and an important source of data for autonomous vehicles (AVs). In this publication we present the results of a study undertaken to understand the impact of automotive paint on LiDAR performance along with a methodology used to conduct this study. Our approach consists of evaluating the average reflected intensity output by different LiDAR sensor models when tested with different types of automotive paints. The paints were chosen to represent common paints found on vehicles in Singapore. The experiments were conducted with LiDAR sensors commonly used by autonomous vehicle (AV) developers and OEMs. The paints used were also selected based on those observed in real-world conditions. This stems from a desire to model real-world performance of actual sensing systems when exposed to the physical world. The goal is then to inform regulators of AVs in Singapore of the impact of automotive paint on LiDAR performance, so that they can determine testing standards and specifications which will better reflect real-world performance and also better assess the adequacy of LiDAR systems installed for local AV operations. The tests were conducted for a combination of 13 different paint panels and 3 LiDAR sensors. In general, it was observed that darker coloured paints have lower reflection intensity whereas lighter coloured paints exhibited higher intensity values.

摘要
李达（Light Detection and Ranging）是一种有用的探测技术，也是自动驾驶车辆（AV）的重要数据来源。在这篇论文中，我们介绍了对汽车涂料对李达性能的影响，以及对这种研究的方法。我们的方法是通过不同的李达传感器模型和不同类型的汽车涂料进行评估，并测试了常用于自动驾驶车辆开发商和OEM的李达仪。这些涂料被选择，以模拟在实际情况下所见到的涂料。我们想通过模拟实际探测系统在物理世界中的表现，以便为新加坡自动驾驶车辆的 regulators 提供更加准确地表现测试标准和规范。我们对13种不同涂料板和3种李达仪进行了测试。在一般情况下，抹上颜色比较浅的涂料具有较低的反射强度，而抹上颜色比较深的涂料则具有较高的强度值。

Fault Point Detection for Recovery Planning of Resilient Grid

paper_url: http://arxiv.org/abs/2309.01345
repo_url: None
paper_authors: Hideya Yoshiuchi, Haruna Takaaki, Swapnil Bembde
for: 降低自然灾害导致的经济损失，日本也面临大规模的天气灾害增加。
methods: 本文提出了一种基于投影杆对角和方向的 POLES-Aware 移动成本估算方法，用于决定异常点间的行驶成本。
results: 评估结果显示，目标区域的总复原时间可以降低28%。

Abstract
Large-scale meteorological disasters are increasing around the world, and power outage damage by natural disaster such as typhoons and earthquakes is increasing in Japan as well. Corresponding to the need of reduction of economic losses due to power outages, we are promoting research of resilient grids that minimizes power outage duration. In this report, we propose PACEM (Poles-Aware moving Cost Estimation Method) for determining travel costs between failure points based on the tilt angle and direction of electric poles obtained from pole-mounted sensors and road condition data. Evaluation result shows that the total recovery time can be reduced by 28% in the target area.

摘要
全球大规模气象灾害在增加，日本也有增加由自然灾害如风暴和地震等所导致的电力截断损害。为了减少经济损失，我们正在推广可靠网络的研究，以尽可能快地缩短停电时间。本报告提出了基于杆上设置的感知器和路况数据来计算停电时间的PACEM方法（杆相关运动成本估算法）。评估结果显示，目标区域的总恢复时间可以减少28%。

2023-09-03

cs.SD

cs.SD - 2023-09-03

NADiffuSE: Noise-aware Diffusion-based Model for Speech Enhancement

paper_url: http://arxiv.org/abs/2309.01212
repo_url: None
paper_authors: Wen Wang, Dongchao Yang, Qichen Ye, Bowen Cao, Yuexian Zou
for: 这种paper的目的是提高干扰除 noise 的speech增强技术。
methods: 这种paper使用的方法包括 diffusion models (DM) 和 generator-plus-conditioner (GPC) 结构，以及多stage frameworks。
results: 实验结果表明，这种 noise-aware diffusion-based speech enhancement 模型（NADiffuSE）可以提高干扰除 noise 的性能，并且可以在不同的干扰环境下保持良好的声音质量。

Abstract
The goal of speech enhancement (SE) is to eliminate the background interference from the noisy speech signal. Generative models such as diffusion models (DM) have been applied to the task of SE because of better generalization in unseen noisy scenes. Technical routes for the DM-based SE methods can be summarized into three types: task-adapted diffusion process formulation, generator-plus-conditioner (GPC) structures and the multi-stage frameworks. We focus on the first two approaches, which are constructed under the GPC architecture and use the task-adapted diffusion process to better deal with the real noise. However, the performance of these SE models is limited by the following issues: (a) Non-Gaussian noise estimation in the task-adapted diffusion process. (b) Conditional domain bias caused by the weak conditioner design in the GPC structure. (c) Large amount of residual noise caused by unreasonable interpolation operations during inference. To solve the above problems, we propose a noise-aware diffusion-based SE model (NADiffuSE) to boost the SE performance, where the noise representation is extracted from the noisy speech signal and introduced as a global conditional information for estimating the non-Gaussian components. Furthermore, the anchor-based inference algorithm is employed to achieve a compromise between the speech distortion and noise residual. In order to mitigate the performance degradation caused by the conditional domain bias in the GPC framework, we investigate three model variants, all of which can be viewed as multi-stage SE based on the preprocessing networks for Mel spectrograms. Experimental results show that NADiffuSE outperforms other DM-based SE models under the GPC infrastructure. Audio samples are available at: https://square-of-w.github.io/NADiffuSE-demo/.

摘要
目标是减少背景干扰，使得听写 speech 信号中的干扰消失。生成模型如扩散模型（DM）已经应用于干扰消除任务，因为它们在未看到的干扰场景中更好地泛化。技术 Routes 可以概括为三种：任务适应扩散过程的形ulation，生成器+条件器（GPC）结构和多Stage 框架。我们主要关注前两种方法，它们在 GPC 架构下使用任务适应扩散过程来更好地处理实际的干扰。然而，这些干扰消除模型的性能受以下问题的限制：(a) 任务适应扩散过程中的非高斯噪声估计。(b) GPC 结构中弱条件器设计引起的 conditional 领域偏见。(c) 推理过程中不合理的插值操作导致的大量剩余噪声。为解决以上问题，我们提出了一种噪声意识 diffusion-based SE 模型（NADiffuSE），其中噪声表示被提取自听写 speech 信号，并作为全局条件信息来估计非高斯噪声成分。此外，我们采用了 anchor-based 推理算法以实现权衡 speech 损害和剩余噪声。为了 Mitigate GPC 框架中的 conditional 领域偏见问题，我们进行了三种模型变体的研究，它们都可以视为基于 Mel spectrograms 的多Stage SE。实验结果表明，NADiffuSE 在 GPC 结构下的性能明显超越了其他 DM-based SE 模型。听写样本可以在以下网站上找到：https://square-of-w.github.io/NADiffuSE-demo/。

MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling

paper_url: http://arxiv.org/abs/2309.01142
repo_url: None
paper_authors: Zhichao Wang, Xinsheng Wang, Qicong Xie, Tao Li, Lei Xie, Qiao Tian, Yuping Wang
for: 这篇论文主要针对的是voice conversion（VC）任务中保持源语音的发音风格，并且实现高质量的转换。
methods: 该论文提出了一种多级风格模型法（MSM-VC），该法利用不同级别的特征来模型源语音的发音风格。具体来说，该法使用了不同级别的特征，包括末尾特征、本地特征和全局特征，来模型语音的帧级、本地级和全局级风格。此外，该法还引入了一个Explicit Constraint Module，以确保源语音的风格模型和目标说话人的特征 preserved。
results: 实验表明，MSM-VC方法可以高效地模型源语音的风格，同时保持高质量的语音转换和说话人相似性。

Abstract
In addition to conveying the linguistic content from source speech to converted speech, maintaining the speaking style of source speech also plays an important role in the voice conversion (VC) task, which is essential in many scenarios with highly expressive source speech, such as dubbing and data augmentation. Previous work generally took explicit prosodic features or fixed-length style embedding extracted from source speech to model the speaking style of source speech, which is insufficient to achieve comprehensive style modeling and target speaker timbre preservation. Inspired by the style's multi-scale nature of human speech, a multi-scale style modeling method for the VC task, referred to as MSM-VC, is proposed in this paper. MSM-VC models the speaking style of source speech from different levels. To effectively convey the speaking style and meanwhile prevent timbre leakage from source speech to converted speech, each level's style is modeled by specific representation. Specifically, prosodic features, pre-trained ASR model's bottleneck features, and features extracted by a model trained with a self-supervised strategy are adopted to model the frame, local, and global-level styles, respectively. Besides, to balance the performance of source style modeling and target speaker timbre preservation, an explicit constraint module consisting of a pre-trained speech emotion recognition model and a speaker classifier is introduced to MSM-VC. This explicit constraint module also makes it possible to simulate the style transfer inference process during the training to improve the disentanglement ability and alleviate the mismatch between training and inference. Experiments performed on the highly expressive speech corpus demonstrate that MSM-VC is superior to the state-of-the-art VC methods for modeling source speech style while maintaining good speech quality and speaker similarity.

摘要
在voice conversion（VC）任务中，保持源语音的说话风格也非常重要，特别在高度表情化的源语音中，如重 lip-sync 和数据增强。先前的工作通常使用源语音中的显式 просодические特征或固定长度的风格嵌入来模拟源语音的说话风格，但这并不能实现完整的风格模型化和目标 speaker timbre 保持。受到人类语音的风格多尺度性的启发，本文提出了一种多尺度风格模型化方法（MSM-VC）。MSM-VC 模拟源语音的说话风格从不同的水平，以达到更好的风格模型化和目标 speaker timbre 保持。具体来说，在不同水平上，采用不同的表示方式来模拟风格。例如，在帧水平上采用抑制特征、在本地水平上采用预训练 ASR 模型的瓶颈特征，在全局水平上采用通过自我超VI的方法提取的特征。此外，为了平衡源风格模型化和目标 speaker timbre 保持，我们引入了一个explicit constraint module，包括一个预训练的语音情感认知模型和一个 speaker classifier。这个explicit constraint module 也使得在训练过程中可以模拟风格传递INFERENCE进程，以提高分离度和减少训练和测试之间的差异。实验表明，在高度表情化语音库中，MSM-VC 比状态之前的VC方法更好地模拟源语音的说话风格，同时保持良好的语音质量和 speaker similarity。

2023-09-03

cs.CV

cs.CV - 2023-09-03

MAP: Domain Generalization via Meta-Learning on Anatomy-Consistent Pseudo-Modalities

paper_url: http://arxiv.org/abs/2309.01286
repo_url: None
paper_authors: Dewei Hu, Hao Li, Han Liu, Xing Yao, Jiacheng Wang, Ipek Oguz
for: 提高深度模型的 клиниче应用性，即增强模型对未经见的领域的泛化能力。
methods: 我们提出了一种名为Meta learning on Anatomy-consistent Pseudo-modalities（MAP）的方法，该方法通过学习结构特征来提高模型的泛化能力。我们首先使用特征提取网络生成了三种不同的 Pseudo-modalities，然后使用 episodic learning 模式，选择一个 Pseudo-modalities 作为元训练集，并在一个通过 Dirichlet mixup 生成的连续变换图像空间中进行元测试。此外，我们还引入了两种捕捉形态信息的损失函数，以便模型更好地关注形态特征。
results: 我们在七个公共 datasets 上进行了测试，并证明了 MAP 在不同的Retinal imaging modalities上有substantially better的泛化能力。

Abstract
Deep models suffer from limited generalization capability to unseen domains, which has severely hindered their clinical applicability. Specifically for the retinal vessel segmentation task, although the model is supposed to learn the anatomy of the target, it can be distracted by confounding factors like intensity and contrast. We propose Meta learning on Anatomy-consistent Pseudo-modalities (MAP), a method that improves model generalizability by learning structural features. We first leverage a feature extraction network to generate three distinct pseudo-modalities that share the vessel structure of the original image. Next, we use the episodic learning paradigm by selecting one of the pseudo-modalities as the meta-train dataset, and perform meta-testing on a continuous augmented image space generated through Dirichlet mixup of the remaining pseudo-modalities. Further, we introduce two loss functions that facilitate the model's focus on shape information by clustering the latent vectors obtained from images featuring identical vasculature. We evaluate our model on seven public datasets of various retinal imaging modalities and we conclude that MAP has substantially better generalizability. Our code is publically available at https://github.com/DeweiHu/MAP.

摘要

FOR-instance: a UAV laser scanning benchmark dataset for semantic and instance segmentation of individual trees

paper_url: http://arxiv.org/abs/2309.01279
repo_url: None
paper_authors: Stefano Puliti, Grant Pearse, Peter Surový, Luke Wallace, Markus Hollaus, Maciej Wielgosz, Rasmus Astrup
for: 这个论文是为了提高 dense airborne laser scanning 数据的Instance和Semantic segmentation技术而写的。
methods: 该论文使用了 UAV 上的激光扫描数据，并对其进行手动标注，以获得不同类别的树实体和 semantic classes。
results: 该论文提供了一个标准化的 benchmarking 数据集，用于提高Instance和Semantic segmentation技术的发展，并且可以适应不同的深度学习框架和 segmentation 策略。

Abstract
The FOR-instance dataset (available at https://doi.org/10.5281/zenodo.8287792) addresses the challenge of accurate individual tree segmentation from laser scanning data, crucial for understanding forest ecosystems and sustainable management. Despite the growing need for detailed tree data, automating segmentation and tracking scientific progress remains difficult. Existing methodologies often overfit small datasets and lack comparability, limiting their applicability. Amid the progress triggered by the emergence of deep learning methodologies, standardized benchmarking assumes paramount importance in these research domains. This data paper introduces a benchmarking dataset for dense airborne laser scanning data, aimed at advancing instance and semantic segmentation techniques and promoting progress in 3D forest scene segmentation. The FOR-instance dataset comprises five curated and ML-ready UAV-based laser scanning data collections from diverse global locations, representing various forest types. The laser scanning data were manually annotated into individual trees (instances) and different semantic classes (e.g. stem, woody branches, live branches, terrain, low vegetation). The dataset is divided into development and test subsets, enabling method advancement and evaluation, with specific guidelines for utilization. It supports instance and semantic segmentation, offering adaptability to deep learning frameworks and diverse segmentation strategies, while the inclusion of diameter at breast height data expands its utility to the measurement of a classic tree variable. In conclusion, the FOR-instance dataset contributes to filling a gap in the 3D forest research, enhancing the development and benchmarking of segmentation algorithms for dense airborne laser scanning data.

摘要
《FOR-instance数据集》（可在https://doi.org/10.5281/zenodo.8287792中获取）是一个关键性的三维森林景象分割数据集，用于提高受众树 segmentation 技术的精度。despite the growing need for detailed tree data, automating segmentation and tracking scientific progress remains difficult. Existing methodologies often overfit small datasets and lack comparability, limiting their applicability. With the emergence of deep learning methodologies, standardized benchmarking assumes paramount importance in these research domains. This data paper introduces a benchmarking dataset for dense airborne laser scanning data, aimed at advancing instance and semantic segmentation techniques and promoting progress in 3D forest scene segmentation.The FOR-instance dataset includes five curated and ML-ready UAV-based laser scanning data collections from diverse global locations, representing various forest types. The laser scanning data were manually annotated into individual trees (instances) and different semantic classes (e.g. stem, woody branches, live branches, terrain, low vegetation). The dataset is divided into development and test subsets, enabling method advancement and evaluation, with specific guidelines for utilization. It supports instance and semantic segmentation, offering adaptability to deep learning frameworks and diverse segmentation strategies, while the inclusion of diameter at breast height data expands its utility to the measurement of a classic tree variable. In conclusion, the FOR-instance dataset contributes to filling a gap in the 3D forest research, enhancing the development and benchmarking of segmentation algorithms for dense airborne laser scanning data.

Diffusion Models with Deterministic Normalizing Flow Priors

paper_url: http://arxiv.org/abs/2309.01274
repo_url: https://github.com/mohsenzand/dinof
paper_authors: Mohsen Zand, Ali Etemad, Michael Greenspan
for: 提高样本质量和采样速度
methods: 使用normalizing flows和diffusion模型
results: 在标准图像生成数据集上达到了比较出色的表现，包括FID和inception分数

Abstract
For faster sampling and higher sample quality, we propose DiNof ($\textbf{Di}$ffusion with $\textbf{No}$rmalizing $\textbf{f}$low priors), a technique that makes use of normalizing flows and diffusion models. We use normalizing flows to parameterize the noisy data at any arbitrary step of the diffusion process and utilize it as the prior in the reverse diffusion process. More specifically, the forward noising process turns a data distribution into partially noisy data, which are subsequently transformed into a Gaussian distribution by a nonlinear process. The backward denoising procedure begins with a prior created by sampling from the Gaussian distribution and applying the invertible normalizing flow transformations deterministically. To generate the data distribution, the prior then undergoes the remaining diffusion stochastic denoising procedure. Through the reduction of the number of total diffusion steps, we are able to speed up both the forward and backward processes. More importantly, we improve the expressive power of diffusion models by employing both deterministic and stochastic mappings. Experiments on standard image generation datasets demonstrate the advantage of the proposed method over existing approaches. On the unconditional CIFAR10 dataset, for example, we achieve an FID of 2.01 and an Inception score of 9.96. Our method also demonstrates competitive performance on CelebA-HQ-256 dataset as it obtains an FID score of 7.11. Code is available at https://github.com/MohsenZand/DiNof.

摘要
For faster sampling and higher sample quality, we propose DiNof (diffusion with normalizing flow priors), a technique that combines normalizing flows and diffusion models. We use normalizing flows to parameterize the noisy data at any arbitrary step of the diffusion process and use it as the prior in the reverse diffusion process. Specifically, the forward noising process converts a data distribution into partially noisy data, which are then transformed into a Gaussian distribution through a nonlinear process. The backward denoising process starts with a prior created by sampling from the Gaussian distribution and applying invertible normalizing flow transformations deterministically. The prior then undergoes the remaining diffusion stochastic denoising procedure to generate the data distribution. By reducing the number of total diffusion steps, we can speed up both the forward and backward processes. Moreover, we improve the expressive power of diffusion models by using both deterministic and stochastic mappings. Experimental results on standard image generation datasets show the advantage of our proposed method over existing approaches. On the unconditional CIFAR10 dataset, for example, we achieve an FID of 2.01 and an Inception score of 9.96. Our method also demonstrates competitive performance on the CelebA-HQ-256 dataset, with an FID score of 7.11. Code is available at https://github.com/MohsenZand/DiNof.Here's the translation in Traditional Chinese:为了更快速的抽样和提高抽样质量，我们提出了DiNof（diffusion with normalizing flow priors）技术，该技术结合了normalizing flows和diffusion models。我们使用normalizing flows来对任意步骤的diffusion проце程中的噪声数据进行参数化，并将其用作反diffusion проце程中的假设。具体来说，前向噪声过程将数据分布转换成部分噪声的数据，然后通过非线性过程将其转换为Gaussian分布。反对噪声过程从Gaussian分布中随机抽样获得一个假设，并通过实现可逆的normalizing flow对应映射来确定性地将其转换为数据分布。通过缩减总diffusion步骤数量，我们可以快速化前向和反对噪声过程。更重要的是，我们通过使用deterministic和stochastic mapping来提高diffusion模型的表达力。实验结果显示，我们在标准的图像生成 dataset上比较其他方法表现出色，例如在CIFAR10 dataset上，我们获得了FID值为2.01和inception值为9.96。我们的方法也在CelebA-HQ-256 dataset上表现出色，FID值为7.11。相关的代码可以在https://github.com/MohsenZand/DiNof上获取。

SOAR: Scene-debiasing Open-set Action Recognition

paper_url: http://arxiv.org/abs/2309.01265
repo_url: https://github.com/yhZhai/SOAR
paper_authors: Yuanhao Zhai, Ziyi Liu, Zhenyu Wu, Yi Wu, Chunluan Zhou, David Doermann, Junsong Yuan, Gang Hua
for: mitigating the risk of utilizing spurious clues in open-set action recognition
methods: adversarial scene reconstruction module, adaptive adversarial scene classification module
results: better mitigation of scene bias, outperformance of state-of-the-art methodsHere’s the simplified Chinese text:
for: 开普设置动作识别中减少背景信息
methods: adversarial scene reconstruction module, adaptive adversarial scene classification module
results: 更好地减少场景偏见, 超过当前最佳方法表现I hope that helps! Let me know if you have any other questions.

Abstract
Deep learning models have a risk of utilizing spurious clues to make predictions, such as recognizing actions based on the background scene. This issue can severely degrade the open-set action recognition performance when the testing samples have different scene distributions from the training samples. To mitigate this problem, we propose a novel method, called Scene-debiasing Open-set Action Recognition (SOAR), which features an adversarial scene reconstruction module and an adaptive adversarial scene classification module. The former prevents the decoder from reconstructing the video background given video features, and thus helps reduce the background information in feature learning. The latter aims to confuse scene type classification given video features, with a specific emphasis on the action foreground, and helps to learn scene-invariant information. In addition, we design an experiment to quantify the scene bias. The results indicate that the current open-set action recognizers are biased toward the scene, and our proposed SOAR method better mitigates such bias. Furthermore, our extensive experiments demonstrate that our method outperforms state-of-the-art methods, and the ablation studies confirm the effectiveness of our proposed modules.

摘要
深度学习模型可能会利用干扰信号来做预测，如recognize动作基于背景场景。这个问题可能会严重降低开集动作认识性能，因为测试样本的场景分布与训练样本不同。为了解决这个问题，我们提出了一种新方法，叫做Scene-debiasing Open-set Action Recognition（SOAR），它包括一个对抗场景重建模块和一个适应对抗场景分类模块。前者防止解码器基于视频特征重建视频背景，从而减少视频背景的影响。后者强调动作前景，尝试使场景类型分类不分化，以学习场景不变的信息。此外，我们设计了一个测量场景偏见的实验。结果表明，当前的开集动作认识器偏向场景，而我们提出的SOAR方法更好地 mitigates such bias。此外，我们的广泛实验表明，我们的方法高效地超过了当前的状态实验，而ablation studies也证明了我们的提出的模块的效果。

Multimodal Contrastive Learning with Hard Negative Sampling for Human Activity Recognition

paper_url: http://arxiv.org/abs/2309.01262
repo_url: None
paper_authors: Hyeongju Choi, Apoorva Beedu, Irfan Essa
for: 这种研究旨在提高人员活动识别（HAR）系统的性能，特别是在日常生活中使用自助学习方法，以减少 annotated data 的成本和困难。
methods: 我们提出了一种基于强负样本选择的自助学习方法，使用硬负样本损失函数，并在 Camera 和 IMU 感知器数据集上进行实验。
results: 我们的方法在两个标准 benchmark 数据集上（UTD-MHAD 和 MMAct）表现出色，在有限数据设置下学习强的特征表示，并在下游活动识别任务中超过了所有现有的状态艺术方法。

Abstract
Human Activity Recognition (HAR) systems have been extensively studied by the vision and ubiquitous computing communities due to their practical applications in daily life, such as smart homes, surveillance, and health monitoring. Typically, this process is supervised in nature and the development of such systems requires access to large quantities of annotated data. However, the higher costs and challenges associated with obtaining good quality annotations have rendered the application of self-supervised methods an attractive option and contrastive learning comprises one such method. However, a major component of successful contrastive learning is the selection of good positive and negative samples. Although positive samples are directly obtainable, sampling good negative samples remain a challenge. As human activities can be recorded by several modalities like camera and IMU sensors, we propose a hard negative sampling method for multimodal HAR with a hard negative sampling loss for skeleton and IMU data pairs. We exploit hard negatives that have different labels from the anchor but are projected nearby in the latent space using an adjustable concentration parameter. Through extensive experiments on two benchmark datasets: UTD-MHAD and MMAct, we demonstrate the robustness of our approach forlearning strong feature representation for HAR tasks, and on the limited data setting. We further show that our model outperforms all other state-of-the-art methods for UTD-MHAD dataset, and self-supervised methods for MMAct: Cross session, even when uni-modal data are used during downstream activity recognition.

摘要
人工活动识别（HAR）系统在视觉和无限计算领域得到了广泛的研究，因为它在日常生活中有很多实际应用，如智能家居、监测和健康监测。 Typically, this process is supervised in nature, and the development of such systems requires access to large amounts of annotated data. However, the higher costs and challenges associated with obtaining good quality annotations have made self-supervised methods an attractive option. Contrastive learning is one such method, but selecting good positive and negative samples is a major challenge. Although positive samples are directly obtainable, sampling good negative samples remains a challenge.为解决这个问题，我们提出了一种困难的负样本选择方法 для多modal HAR，并使用一个可调的集中参数来选择硬负样本。 We exploit hard negatives that have different labels from the anchor but are projected nearby in the latent space. Through extensive experiments on two benchmark datasets: UTD-MHAD and MMAct, we demonstrate the robustness of our approach for learning strong feature representations for HAR tasks, and on limited data settings. We further show that our model outperforms all other state-of-the-art methods for UTD-MHAD dataset, and self-supervised methods for MMAct: Cross session, even when uni-modal data are used during downstream activity recognition.

S2RF: Semantically Stylized Radiance Fields

paper_url: http://arxiv.org/abs/2309.01252
repo_url: None
paper_authors: Dishani Lahiri, Neeraj Panse, Moneish Kumar
for: 提供一种将任意图像中的风格传递到3D场景中的对象上的方法。
methods: 提议一种 nearest neighborhood-based loss 的新方法，以实现更好的3D场景重建和灵活的风格定制，同时保证多视角准确性。
results: 方法可以实现自由的3D场景重建和灵活的风格定制，并保证多视角准确性。

Abstract
We present our method for transferring style from any arbitrary image(s) to object(s) within a 3D scene. Our primary objective is to offer more control in 3D scene stylization, facilitating the creation of customizable and stylized scene images from arbitrary viewpoints. To achieve this, we propose a novel approach that incorporates nearest neighborhood-based loss, allowing for flexible 3D scene reconstruction while effectively capturing intricate style details and ensuring multi-view consistency.

摘要
我们提出了一种方法，可以将任意图像中的风格传递到3D场景中的对象上。我们的主要目标是为3D场景增加个性化风格控制，以便从任意视角创建个性化和风格化的场景图像。为 достичь这个目标，我们提议一种新的方法，该方法包括最近邻域基于的损失函数，可以在3D场景重建中 flexible 地捕捉细节，同时保证多视角一致性。

Towards Generic Image Manipulation Detection with Weakly-Supervised Self-Consistency Learning

paper_url: http://arxiv.org/abs/2309.01246
repo_url: https://github.com/yhZhai/WSCL
paper_authors: Yuanhao Zhai, Tianyu Luan, David Doermann, Junsong Yuan
for: 本文主要针对于强制性较低的图像修饰检测问题，以便利用更多的训练图像和快速适应新的修饰技术。methods: 本文提出了一种弱类supervised自适应学习方法，仅需要图像水平的二分类标签（真实或修饰）进行训练。该方法利用了多种内容无关的信息，通过在线pseudo标签生成和优化过程实现了跨源学习。此外，本文还提出了一种inter-patch consistency（IPC）方法，可以找到整个修饰区域。results: 实验表明，even though our方法是弱类supervised的，它在受检测图像是否修饰的情况下，与完全supervised方法相比，具有竞争性的性能，并且可以准确地找到修饰区域。

Abstract
As advanced image manipulation techniques emerge, detecting the manipulation becomes increasingly important. Despite the success of recent learning-based approaches for image manipulation detection, they typically require expensive pixel-level annotations to train, while exhibiting degraded performance when testing on images that are differently manipulated compared with training images. To address these limitations, we propose weakly-supervised image manipulation detection, such that only binary image-level labels (authentic or tampered with) are required for training purpose. Such a weakly-supervised setting can leverage more training images and has the potential to adapt quickly to new manipulation techniques. To improve the generalization ability, we propose weakly-supervised self-consistency learning (WSCL) to leverage the weakly annotated images. Specifically, two consistency properties are learned: multi-source consistency (MSC) and inter-patch consistency (IPC). MSC exploits different content-agnostic information and enables cross-source learning via an online pseudo label generation and refinement process. IPC performs global pair-wise patch-patch relationship reasoning to discover a complete region of manipulation. Extensive experiments validate that our WSCL, even though is weakly supervised, exhibits competitive performance compared with fully-supervised counterpart under both in-distribution and out-of-distribution evaluations, as well as reasonable manipulation localization ability.

摘要
As advanced image manipulation techniques emerge, detecting the manipulation becomes increasingly important. Despite the success of recent learning-based approaches for image manipulation detection, they typically require expensive pixel-level annotations to train, while exhibiting degraded performance when testing on images that are differently manipulated compared with training images. To address these limitations, we propose weakly-supervised image manipulation detection, such that only binary image-level labels (authentic or tampered with) are required for training purpose. Such a weakly-supervised setting can leverage more training images and has the potential to adapt quickly to new manipulation techniques. To improve the generalization ability, we propose weakly-supervised self-consistency learning (WSCL) to leverage the weakly annotated images. Specifically, two consistency properties are learned: multi-source consistency (MSC) and inter-patch consistency (IPC). MSC exploits different content-agnostic information and enables cross-source learning via an online pseudo label generation and refinement process. IPC performs global pair-wise patch-patch relationship reasoning to discover a complete region of manipulation. Extensive experiments validate that our WSCL, even though is weakly supervised, exhibits competitive performance compared with fully-supervised counterpart under both in-distribution and out-of-distribution evaluations, as well as reasonable manipulation localization ability.Here's the translation in Traditional Chinese:为了应对进阶图像修饰技术的出现，检测修饰成本日益重要。 despite recent learning-based approaches for image manipulation detection的成功，它们通常需要高昂的像素级标注来训练，而且在训练和测试图像不同的修饰方法时，表现会下降。为了解决这些限制，我们提出了弱型图像修饰检测，仅需要图像水平标注（真实或伪造）来训练。这样的弱型设定可以对更多的训练图像进行学习，并且具有适应新修饰技术的潜力。为了提高普遍性，我们提出了弱型自适应学习（WSCL），以利用弱型标注图像。具体来说，我们学习了两种一致性属性：多源一致性（MSC）和间接图像一致性（IPC）。MSC 利用不同内容不相关的信息，并允许跨源学习 via 线上 pseudo 标签生成和修正过程。IPC 执行全域对 patch-patch 关系的全球推理，以发现修饰区域。实验显示，我们的 WSCL ，即使是弱型学习，在两个分布中的评估中表现竞争性好，以及修饰地域的实际能力。

BodySLAM++: Fast and Tightly-Coupled Visual-Inertial Camera and Human Motion Tracking

paper_url: http://arxiv.org/abs/2309.01236
repo_url: None
paper_authors: Dorian F. Henning, Christopher Choi, Simon Schaefer, Stefan Leutenegger
for: 这篇论文是为了解决人体状态估计问题，尤其是在实际应用中需要实时估计人体状态的情况下。
methods: 该论文使用了视觉感知和自适应器来实现人体和摄像头状态估计，并将现有的视觉感知状态估计框架OKVIS2扩展到同时解决人体和摄像头状态估计的双重任务。
results: 相比基准方法，该方法可以提高人体状态估计的准确性和摄像头状态估计的准确性，并在Intel i7-模型CPU上实现15+帧每秒的实时性。

Abstract
Robust, fast, and accurate human state - 6D pose and posture - estimation remains a challenging problem. For real-world applications, the ability to estimate the human state in real-time is highly desirable. In this paper, we present BodySLAM++, a fast, efficient, and accurate human and camera state estimation framework relying on visual-inertial data. BodySLAM++ extends an existing visual-inertial state estimation framework, OKVIS2, to solve the dual task of estimating camera and human states simultaneously. Our system improves the accuracy of both human and camera state estimation with respect to baseline methods by 26% and 12%, respectively, and achieves real-time performance at 15+ frames per second on an Intel i7-model CPU. Experiments were conducted on a custom dataset containing both ground truth human and camera poses collected with an indoor motion tracking system.

摘要
Robust、快速、精确的人体状态估算——6D姿态和姿态——仍然是一个挑战性的问题。在实际应用中，可以在实时中估算人体状态是非常感兴趣的。在这篇论文中，我们提出了BodySLAM++，一种基于视觉-陀螺数据的人体和摄像头状态估算框架。BodySLAM++在OKVIS2视觉-陀螺状态估算框架的基础上进行了扩展，同时解决了同时估算摄像头和人体状态的两个任务。我们的系统相比基准方法提高了人体和摄像头状态估算的准确性，增加了26%和12%，并在Intel i7-型CPU上实现了15+帧每秒的实时性。我们在一个自定义的人体和摄像头pose的数据集上进行了实验。

Generalizability and Application of the Skin Reflectance Estimate Based on Dichromatic Separation (SREDS)

paper_url: http://arxiv.org/abs/2309.01235
repo_url: https://github.com/josephdrahos/sreds
paper_authors: Joseph Drahos, Richard Plesh, Keivan Bahmani, Mahesh Banavar, Stephanie Schuckers
for: 本研究旨在提供一种可靠的皮肤颜色度量，以便在面 recognition系统中减少因皮肤颜色而导致的性能差异。
methods: 本研究使用了基于 dichromatic separation 的皮肤颜色度量（SREDS），并对其进行了进一步的分析和评估。
results: 研究发现，SREDS 能够创造一个具有较低差异的皮肤颜色度量，并且可以作为自我报告的种族标签的替代方案。此外，研究还提供了一个开源的 SREDS 实现，以帮助研究人员。

Abstract
Face recognition (FR) systems have become widely used and readily available in recent history. However, differential performance between certain demographics has been identified within popular FR models. Skin tone differences between demographics can be one of the factors contributing to the differential performance observed in face recognition models. Skin tone metrics provide an alternative to self-reported race labels when such labels are lacking or completely not available e.g. large-scale face recognition datasets. In this work, we provide a further analysis of the generalizability of the Skin Reflectance Estimate based on Dichromatic Separation (SREDS) against other skin tone metrics and provide a use case for substituting race labels for SREDS scores in a privacy-preserving learning solution. Our findings suggest that SREDS consistently creates a skin tone metric with lower variability within each subject and SREDS values can be utilized as an alternative to the self-reported race labels at minimal drop in performance. Finally, we provide a publicly available and open-source implementation of SREDS to help the research community. Available at https://github.com/JosephDrahos/SREDS

摘要
人脸识别（FR）系统在近代历史中广泛使用和可用。然而， differential performance between certain demographics 在流行的 FR 模型中被识别出来。skin tone differences between demographics can be one of the factors contributing to the differential performance observed in face recognition models。skin tone metrics provide an alternative to self-reported race labels when such labels are lacking or completely not available, for example, large-scale face recognition datasets.在这项工作中，我们进一步分析了Skin Reflectance Estimate based on Dichromatic Separation（SREDS）与其他皮肤颜色指标的一致性，并提供了使用 SREDS scores substitute for race labels in a privacy-preserving learning solution的用例。我们发现，SREDS consistently creates a skin tone metric with lower variability within each subject, and SREDS values can be used as an alternative to self-reported race labels at minimal drop in performance。最后，我们提供了一个公共可用的和开源的 SREDS 实现，以帮助研究社区。可以在查看。

Spectral Adversarial MixUp for Few-Shot Unsupervised Domain Adaptation

paper_url: http://arxiv.org/abs/2309.01207
repo_url: https://github.com/RPIDIAL/SAMix
paper_authors: Jiajin Zhang, Hanqing Chao, Amit Dhurandhar, Pin-Yu Chen, Ali Tajer, Yangyang Xu, Pingkun Yan
for: 本研究旨在 Addressing the challenging problem of few-shot unsupervised domain adaptation (FSUDA) in clinical applications, where only a limited number of unlabeled target domain samples are available for training.
methods: 我们提出了一种新的方法，即 spectral sensitivity map 和 sensitivity-guided spectral adversarial mixup (SAMix) 方法，以强化模型在目标频谱中的一致性和模型通用性。
results: 我们在多个任务和数据集上进行了严谨的评估，并证明了我们的方法可以有效地提高模型在目标频谱中的一致性和模型通用性。

Abstract
Domain shift is a common problem in clinical applications, where the training images (source domain) and the test images (target domain) are under different distributions. Unsupervised Domain Adaptation (UDA) techniques have been proposed to adapt models trained in the source domain to the target domain. However, those methods require a large number of images from the target domain for model training. In this paper, we propose a novel method for Few-Shot Unsupervised Domain Adaptation (FSUDA), where only a limited number of unlabeled target domain samples are available for training. To accomplish this challenging task, first, a spectral sensitivity map is introduced to characterize the generalization weaknesses of models in the frequency domain. We then developed a Sensitivity-guided Spectral Adversarial MixUp (SAMix) method to generate target-style images to effectively suppresses the model sensitivity, which leads to improved model generalizability in the target domain. We demonstrated the proposed method and rigorously evaluated its performance on multiple tasks using several public datasets.

摘要
域名转换是在医疗应用中的一个常见问题，source domain 和 target domain 的图像分布不同。不supervised Domain Adaptation（UDA）技术已经被提议，以适应source domain 中训练的模型到 target domain。然而，这些方法需要大量的target domain图像来训练模型。在这篇论文中，我们提出了一种新的方法：Few-Shot Unsupervised Domain Adaptation（FSUDA），只需要有限量的target domain样本进行训练。为了实现这个复杂的任务，我们首先引入了一个spectral sensitivity map来描述模型在频率域的泛化弱点。然后，我们开发了一种Sensitivity-guided Spectral Adversarial MixUp（SAMix）方法，可以生成target-style图像，以有效地减少模型的敏感性，从而提高模型在target domain的泛化性。我们证明了我们的方法的可行性和对多个任务的精心评估。

MAGMA: Music Aligned Generative Motion Autodecoder

paper_url: http://arxiv.org/abs/2309.01202
repo_url: None
paper_authors: Sohan Anisetty, Amit Raj, James Hays
for: 这 paper 的目的是解决将乐曲映射到舞蹈中的问题，以实现空间和时间协调，同时与乐曲的进程保持同步。
methods: 作者使用 Vector Quantized-Variational Autoencoder (VQ-VAE) 和 Transformer 解码器，将动作划分成基本 primitives，并通过对音乐表示的比较来评估音乐表示的重要性。
results: 作者的方法可以实现州际最佳的音乐到动作生成效果，并可以生成较长的动作序列，易于自定义动作序列以满足风格要求。

Abstract
Mapping music to dance is a challenging problem that requires spatial and temporal coherence along with a continual synchronization with the music's progression. Taking inspiration from large language models, we introduce a 2-step approach for generating dance using a Vector Quantized-Variational Autoencoder (VQ-VAE) to distill motion into primitives and train a Transformer decoder to learn the correct sequencing of these primitives. We also evaluate the importance of music representations by comparing naive music feature extraction using Librosa to deep audio representations generated by state-of-the-art audio compression algorithms. Additionally, we train variations of the motion generator using relative and absolute positional encodings to determine the effect on generated motion quality when generating arbitrarily long sequence lengths. Our proposed approach achieve state-of-the-art results in music-to-motion generation benchmarks and enables the real-time generation of considerably longer motion sequences, the ability to chain multiple motion sequences seamlessly, and easy customization of motion sequences to meet style requirements.

摘要
将音乐映射到舞蹈是一个挑战性的问题，需要空间和时间协调以及 continual 同步音乐的进程。引用大语言模型，我们提出了一种 two-step 方法，使用 вектор量化-自适应编码器（VQ-VAE）来压缩动作并训练 transformer 解码器来学习正确的动作顺序。我们还评估音乐表示的重要性，比较 naive 音乐特征提取使用 Librosa 和深度音频表示生成器生成的音频特征。此外，我们在不同的 poz 编码器和绝对 poz 编码器进行训练，以确定在生成长序列时的影响。我们的提出方法在音乐到动作生成标准准则中实现了状态顶峰的结果，可以实时生成较长的动作序列，链接多个动作序列，以及根据风格要求自由定制动作序列。

Holistic Dynamic Frequency Transformer for Image Fusion and Exposure Correction

paper_url: http://arxiv.org/abs/2309.01183
repo_url: None
paper_authors: Xiaoke Shang, Gehui Li, Zhiying Jiang, Shaomin Zhang, Nai Ding, Jinyuan Liu
for: 提高图像质量，解决曝光相关问题
methods: 利用频域回归，替代传统的相关计算，使用全息频域注意力和动态频域前向网络来提取全息信息，并使用拉普拉斯 pyramid decomposes 图像为不同频率带信息，然后使用多个修复器来恢复特定频率带信息
results: 实现了主流数据集上的最佳Result，为曝光 corrections 等computer vision任务提供了更加细致和协调的解决方案

Abstract
The correction of exposure-related issues is a pivotal component in enhancing the quality of images, offering substantial implications for various computer vision tasks. Historically, most methodologies have predominantly utilized spatial domain recovery, offering limited consideration to the potentialities of the frequency domain. Additionally, there has been a lack of a unified perspective towards low-light enhancement, exposure correction, and multi-exposure fusion, complicating and impeding the optimization of image processing. In response to these challenges, this paper proposes a novel methodology that leverages the frequency domain to improve and unify the handling of exposure correction tasks. Our method introduces Holistic Frequency Attention and Dynamic Frequency Feed-Forward Network, which replace conventional correlation computation in the spatial-domain. They form a foundational building block that facilitates a U-shaped Holistic Dynamic Frequency Transformer as a filter to extract global information and dynamically select important frequency bands for image restoration. Complementing this, we employ a Laplacian pyramid to decompose images into distinct frequency bands, followed by multiple restorers, each tuned to recover specific frequency-band information. The pyramid fusion allows a more detailed and nuanced image restoration process. Ultimately, our structure unifies the three tasks of low-light enhancement, exposure correction, and multi-exposure fusion, enabling comprehensive treatment of all classical exposure errors. Benchmarking on mainstream datasets for these tasks, our proposed method achieves state-of-the-art results, paving the way for more sophisticated and unified solutions in exposure correction.

摘要
correction of exposure-related issues 是图像质量进步的关键组件，具有广泛的计算机视觉应用场景。历史上，大多数方法ologies 都是在空间领域进行恢复，忽略了频率领域的潜在优势。此外，对于低光照修复、曝光修复和多曝光融合，缺乏一个统一的视角，使得图像处理优化受到妨碍。为了解决这些挑战，本文提出了一种新的方法，利用频率领域来改善和统一曝光修复任务。我们的方法引入全局频率注意力和动态频率预测网络，取代了传统的空间领域相关计算。它们组成了基本建构块，使得U-形全局动态频率变换器作为筛选器，以EXTRACT全局信息和动态选择重要的频率带width。此外，我们采用拉普拉斯 pyramid decomposed 图像到不同的频率带width，然后使用多个恢复器，每个恢复器都是针对特定频率带width 的信息恢复。pyramid 融合allowing 更加细致和细腻的图像修复过程。最终，我们的结构统一了三个任务：低光照修复、曝光修复和多曝光融合，使得全面地处理所有传统曝光错误。在主流数据集上 benchmarking，我们提出的方法实现了状态之前的成绩，开启了更加复杂和统一的曝光修复解决方案。

Deep Unfolding Convolutional Dictionary Model for Multi-Contrast MRI Super-resolution and Reconstruction

paper_url: http://arxiv.org/abs/2309.01171
repo_url: https://github.com/lpcccc-cv/mc-cdic
paper_authors: Pengcheng Lei, Faming Fang, Guixu Zhang, Ming Xu
for: 这个论文主要是为了提出一个基于深度学习的多测量MRI超解析和重建方法，以探索多测量图像之间的联乘关系。
methods: 本文提出了一个叫做多测量 convolutional dictionary（MC-CDic）模型，利用优化算法和构造资料实际关系来实现多测量图像之间的联乘。MC-CDic模型包括建立观察模型、构造多测量字典和使用 proximal 算法来优化模型。
results: 实验结果显示，MC-CDic模型在多测量MRI超解析和重建任务中具有较高的性能，较以前的State-of-the-Art方法。

Abstract
Magnetic resonance imaging (MRI) tasks often involve multiple contrasts. Recently, numerous deep learning-based multi-contrast MRI super-resolution (SR) and reconstruction methods have been proposed to explore the complementary information from the multi-contrast images. However, these methods either construct parameter-sharing networks or manually design fusion rules, failing to accurately model the correlations between multi-contrast images and lacking certain interpretations. In this paper, we propose a multi-contrast convolutional dictionary (MC-CDic) model under the guidance of the optimization algorithm with a well-designed data fidelity term. Specifically, we bulid an observation model for the multi-contrast MR images to explicitly model the multi-contrast images as common features and unique features. In this way, only the useful information in the reference image can be transferred to the target image, while the inconsistent information will be ignored. We employ the proximal gradient algorithm to optimize the model and unroll the iterative steps into a deep CDic model. Especially, the proximal operators are replaced by learnable ResNet. In addition, multi-scale dictionaries are introduced to further improve the model performance. We test our MC-CDic model on multi-contrast MRI SR and reconstruction tasks. Experimental results demonstrate the superior performance of the proposed MC-CDic model against existing SOTA methods. Code is available at https://github.com/lpcccc-cv/MC-CDic.

摘要
magnetic resonance imaging (MRI) 任务 oftentimes involve multiple contrasts. Recently, numerous deep learning-based multi-contrast MRI super-resolution (SR) and reconstruction methods have been proposed to explore the complementary information from the multi-contrast images. However, these methods either construct parameter-sharing networks or manually design fusion rules, failing to accurately model the correlations between multi-contrast images and lacking certain interpretations.In this paper, we propose a multi-contrast convolutional dictionary (MC-CDic) model under the guidance of the optimization algorithm with a well-designed data fidelity term. Specifically, we bulid an observation model for the multi-contrast MR images to explicitly model the multi-contrast images as common features and unique features. In this way, only the useful information in the reference image can be transferred to the target image, while the inconsistent information will be ignored.We employ the proximal gradient algorithm to optimize the model and unroll the iterative steps into a deep CDic model. Especially, the proximal operators are replaced by learnable ResNet. In addition, multi-scale dictionaries are introduced to further improve the model performance.We test our MC-CDic model on multi-contrast MRI SR and reconstruction tasks. Experimental results demonstrate the superior performance of the proposed MC-CDic model against existing state-of-the-art (SOTA) methods. Code is available at https://github.com/lpcccc-cv/MC-CDic.

An Asynchronous Linear Filter Architecture for Hybrid Event-Frame Cameras

paper_url: http://arxiv.org/abs/2309.01159
repo_url: https://github.com/ziweiwwang/event-asynchronous-filter
paper_authors: Ziwei Wang, Yonhon Ng, Cedric Scheerlinck, Robert Mahony
for: 这篇论文是为了描述一种基于事件和帧camera数据的协同筛选框架，用于重建HDR视频和空间卷积。
methods: 该框架使用了 asynchronous linear filter architecture，将事件和帧camera数据 fusion，以优化HDR视频重建和空间卷积。
results: 对于公共的 datasets，该方法与其他状态艺法比较，在灰度误差率（69.4%减少）和图像相似性指标（均提高35.5%）中表现出色。此外，该框架还可以将图像卷积与线性空间核积合并应用。

Abstract
Event cameras are ideally suited to capture High Dynamic Range (HDR) visual information without blur but provide poor imaging capability for static or slowly varying scenes. Conversely, conventional image sensors measure absolute intensity of slowly changing scenes effectively but do poorly on HDR or quickly changing scenes. In this paper, we present an asynchronous linear filter architecture, fusing event and frame camera data, for HDR video reconstruction and spatial convolution that exploits the advantages of both sensor modalities. The key idea is the introduction of a state that directly encodes the integrated or convolved image information and that is updated asynchronously as each event or each frame arrives from the camera. The state can be read-off as-often-as and whenever required to feed into subsequent vision modules for real-time robotic systems. Our experimental results are evaluated on both publicly available datasets with challenging lighting conditions and fast motions, along with a new dataset with HDR reference that we provide. The proposed AKF pipeline outperforms other state-of-the-art methods in both absolute intensity error (69.4% reduction) and image similarity indexes (average 35.5% improvement). We also demonstrate the integration of image convolution with linear spatial kernels Gaussian, Sobel, and Laplacian as an application of our architecture.

摘要
The key idea is to introduce a state that encodes the integrated or convolved image information and is updated asynchronously as each event or frame arrives from the camera. This state can be read off as often as required to feed into subsequent vision modules for real-time robotic systems. Our experimental results are evaluated on publicly available datasets with challenging lighting conditions and fast motions, as well as a new dataset with HDR reference that we provide.Compared to other state-of-the-art methods, our proposed asynchronous kernel filter (AKF) pipeline achieves a 69.4% reduction in absolute intensity error and an average 35.5% improvement in image similarity indexes. We also demonstrate the integration of image convolution with linear spatial kernels, such as Gaussian, Sobel, and Laplacian, as an application of our architecture.

LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models

paper_url: http://arxiv.org/abs/2309.01155
repo_url: None
paper_authors: Cheng Shi, Sibei Yang
for: 提高预训练模型在下游任务中的性能，尤其是图像识别领域
methods: 使用生成的文本图像作为视觉提示，并解决了鸡蛋问题
results: 在16个 datasets上，方法 consistently outperforms 州际方法，包括几个shot学习、基础到新的泛化和领域泛化

Abstract
Prompt engineering is a powerful tool used to enhance the performance of pre-trained models on downstream tasks. For example, providing the prompt ``Let's think step by step" improved GPT-3's reasoning accuracy to 63% on MutiArith while prompting ``a photo of" filled with a class name enables CLIP to achieve $80$\% zero-shot accuracy on ImageNet. While previous research has explored prompt learning for the visual modality, analyzing what constitutes a good visual prompt specifically for image recognition is limited. In addition, existing visual prompt tuning methods' generalization ability is worse than text-only prompting tuning. This paper explores our key insight: synthetic text images are good visual prompts for vision-language models! To achieve that, we propose our LoGoPrompt, which reformulates the classification objective to the visual prompt selection and addresses the chicken-and-egg challenge of first adding synthetic text images as class-wise visual prompts or predicting the class first. Without any trainable visual prompt parameters, experimental results on 16 datasets demonstrate that our method consistently outperforms state-of-the-art methods in few-shot learning, base-to-new generalization, and domain generalization.

摘要
Prompt engineering是一种强大的工具，可以提高预训练模型在下游任务中表现。例如，提供“思考步骤”的提示可以提高GPT-3的逻辑准确率到63%在MutiArith上，而提示“一张”filled with类名可以使CLIP achieve ImageNet上零基本精度80%。而前期研究已经探索了文本模式下的提示学习，但是对于图像识别领域的视觉提示特别是有限的研究。此外，现有的视觉提示调整方法的通用能力比文本只提示调整更差。这篇论文探讨了我们的关键发现：Synthetic text images是良好的视觉提示 для视觉语言模型！为实现这一点，我们提出了LoGoPrompt，它将类型化目标重新定义为视觉提示选择，并解决了鸡蛋问题，即首先添加synthetic text images为类别视觉提示或预测类型。无需任何可训练的视觉提示参数，我们的方法在16个数据集上实验表明， consistently outperform了当前状态的方法在少shot学习、基础到新的泛化和频率泛化上。

EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment

paper_url: http://arxiv.org/abs/2309.01151
repo_url: None
paper_authors: Cheng Shi, Sibei Yang
for: 提高开放词汇物体检测性能，使检测器在基础类别上训练， yet 能够检测新类别。
methods: 使用 CLIP 强大的零上shot认知能力对象级别的嵌入进行对应。
results: 比如，使用 CLIP 对象级别对接 resulted in overfitting to base categories, i.e., novel categories most similar to base categories have particularly poor performance as they are recognized as similar base categories。 In this paper, we propose Early Dense Alignment (EDA) to bridge the gap between generalizable local semantics and object-level prediction.

Abstract
Vision-language models such as CLIP have boosted the performance of open-vocabulary object detection, where the detector is trained on base categories but required to detect novel categories. Existing methods leverage CLIP's strong zero-shot recognition ability to align object-level embeddings with textual embeddings of categories. However, we observe that using CLIP for object-level alignment results in overfitting to base categories, i.e., novel categories most similar to base categories have particularly poor performance as they are recognized as similar base categories. In this paper, we first identify that the loss of critical fine-grained local image semantics hinders existing methods from attaining strong base-to-novel generalization. Then, we propose Early Dense Alignment (EDA) to bridge the gap between generalizable local semantics and object-level prediction. In EDA, we use object-level supervision to learn the dense-level rather than object-level alignment to maintain the local fine-grained semantics. Extensive experiments demonstrate our superior performance to competing approaches under the same strict setting and without using external training resources, i.e., improving the +8.4% novel box AP50 on COCO and +3.9% rare mask AP on LVIS.

摘要
现代视力语言模型，如CLIP，已经提高了开放词汇物体检测的性能，其中检测器在基本类别上训练，但需要检测新类别。现有方法利用CLIP强大的零shot识别能力将对象级别的嵌入与文本类别嵌入相对 alignment。然而，我们发现使用CLIP进行对象级别对 alignment 会导致基本类别最 Similar novel categories 的表现特别差，即基本类别最 Similar novel categories 的表现特别差。在这篇论文中，我们首先发现了现有方法无法具备强大的基础-to-新泛化的能力，因为loss of critical fine-grained local image semantics 阻碍了现有方法的提升。然后，我们提出了 Early Dense Alignment (EDA) 方法，用于补做这个障碍。在 EDA 中，我们使用对象级别的超级vised learning 来学习 dense-level 的对ignment，以保持本地细致的 semantics。我们的实验表明，我们在同样的严格设定下，不使用外部训练资源，可以提高 COCO 上的 +8.4% novel box AP50 和 LVIS 上的 +3.9% rare mask AP。

VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

paper_url: http://arxiv.org/abs/2309.01141
repo_url: None
paper_authors: Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, Donglin Wang
for: 这个研究的目的是寻找一个可以实现零shot visual grounding的方法，不需要任何调整和额外的训练数据。
methods: 这个方法基于文本与图像散射模型，并提出了一个简单又有效的零shot visual grounding框架，另外还设计了一个考虑全局和局部 контекст的区域评分方法。
results: 实验结果显示，这个方法在RefCOCO、RefCOCO+和RefCOCOg上实现了优秀的零shot visual grounding性能。

Abstract
Large-scale text-to-image diffusion models have shown impressive capabilities across various generative tasks, enabled by strong vision-language alignment obtained through pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding.

摘要
大规模文本到图像扩散模型已经在多种生成任务中展现出了吸引人的能力，得益于在预训练中获得的强视语对应性。然而，大多数视语识别任务需要大量的时间和计算资源进行精心 Labeling 数据集来获得这种对应性。在这种情况下，我们研究直接将预训练的生成扩散模型应用到挑战性的视觉识别任务中，无需任何 fine-tuning 和额外的训练数据集。特别是，我们提出了一种简单 yet effective 的零shot 视觉定位框架，称为 VGDiffZero。我们还设计了一种全面的区域分配方法，考虑了每个隔离提案的全局和地方上下文。广泛的实验表明，VGDiffZero 在 RefCOCO、RefCOCO+ 和 RefCOCOg 上达到了零shot 视觉定位的强性表现。

RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model

paper_url: http://arxiv.org/abs/2309.02455
repo_url: None
paper_authors: Ahmad Sebaq, Mohamed ElHelw
for: This paper is written for remote sensing tasks that require high-quality, detailed satellite images for accurate analysis and decision-making.
methods: The paper proposes an innovative and lightweight approach that employs two-stage diffusion models to generate high-resolution satellite images purely based on text prompts. The approach consists of two interconnected diffusion models: a Low-Resolution Generation Diffusion Model (LR-GDM) and a Super-Resolution Diffusion Model (SRDM).
results: The approach outperforms existing state-of-the-art (SoTA) models in generating satellite images with realistic geographical features, weather conditions, and land structures while achieving remarkable super-resolution results for increased spatial precision.Here’s the Chinese version of the information points:
for: 这篇论文是为Remote Sensing任务而写的，需要高质量、细节准确的卫星图像进行准确分析和决策。
methods: 这篇论文提出了一种创新的、轻量级的方法，使用两个阶段的扩散模型来基于文本提示生成高分辨率卫星图像。该方法包括两个相连的扩散模型：一个低分辨率生成扩散模型（LR-GDM）和一个超分辨率扩散模型（SRDM）。
results: 该方法比现有的State-of-the-Art（SoTA）模型在生成卫星图像方面更高效，能够实现更高的地理特征、天气条件和土地结构的准确描述，同时实现了显著的超分辨率效果以提高空间精度。

Abstract
Satellite imagery generation and super-resolution are pivotal tasks in remote sensing, demanding high-quality, detailed images for accurate analysis and decision-making. In this paper, we propose an innovative and lightweight approach that employs two-stage diffusion models to gradually generate high-resolution Satellite images purely based on text prompts. Our innovative pipeline comprises two interconnected diffusion models: a Low-Resolution Generation Diffusion Model (LR-GDM) that generates low-resolution images from text and a Super-Resolution Diffusion Model (SRDM) conditionally produced. The LR-GDM effectively synthesizes low-resolution by (computing the correlations of the text embedding and the image embedding in a shared latent space), capturing the essential content and layout of the desired scenes. Subsequently, the SRDM takes the generated low-resolution image and its corresponding text prompts and efficiently produces the high-resolution counterparts, infusing fine-grained spatial details and enhancing visual fidelity. Experiments are conducted on the commonly used dataset, Remote Sensing Image Captioning Dataset (RSICD). Our results demonstrate that our approach outperforms existing state-of-the-art (SoTA) models in generating satellite images with realistic geographical features, weather conditions, and land structures while achieving remarkable super-resolution results for increased spatial precision.

摘要
卫星图像生成和超解像是远程感知领域的关键任务，需要高质量、详细的图像 для准确的分析和决策。在这篇论文中，我们提出了一种创新的和轻量级的方法，使用两个链接的扩散模型来逐渐生成高分辨率的卫星图像， purely based on text prompts。我们的创新管道包括两个相连的扩散模型：一个低分辨率生成扩散模型（LR-GDM），通过（计算文本嵌入和图像嵌入在共享尺度空间中的相关性）来有效地生成低分辨率图像，捕捉整个场景的主要内容和布局。然后，SRDM模型会使用生成的低分辨率图像和其相应的文本提示，生成高分辨率对应的图像，注入细致的空间细节，提高视觉准确性。我们在Remote Sensing Image Captioning Dataset（RSICD）上进行了实验，我们的方法比现有的SoTA模型在生成卫星图像的realistic geographical features、天气条件和地形结构方面表现出色，同时实现了很高的超分辨率效果，提高了空间精度。

Hybrid-Supervised Dual-Search: Leveraging Automatic Learning for Loss-free Multi-Exposure Image Fusion

paper_url: http://arxiv.org/abs/2309.01113
repo_url: None
paper_authors: Guanyao Wu, Hongming Fu, Jinyuan Liu, Long Ma, Xin Fan, Risheng Liu
for: 本研究目的是解决多 expose 图像融合中的限制，提高图像的Authentic Representation。
methods: 本文提出了一种 Hybrid-Supervised Dual-Search 方法（HSDS-MEF），它使用自动化设计网络结构和损失函数的双级优化搜索算法。
results: 对比Various competitive schemes，本文实现了state-of-the-art表现，在Visual Information Fidelity（VIF）指标上提高10.61%和4.38%，并提供高对比度、丰富细节和颜色的结果。

Abstract
Multi-exposure image fusion (MEF) has emerged as a prominent solution to address the limitations of digital imaging in representing varied exposure levels. Despite its advancements, the field grapples with challenges, notably the reliance on manual designs for network structures and loss functions, and the constraints of utilizing simulated reference images as ground truths. Consequently, current methodologies often suffer from color distortions and exposure artifacts, further complicating the quest for authentic image representation. In addressing these challenges, this paper presents a Hybrid-Supervised Dual-Search approach for MEF, dubbed HSDS-MEF, which introduces a bi-level optimization search scheme for automatic design of both network structures and loss functions. More specifically, we harnesses a unique dual research mechanism rooted in a novel weighted structure refinement architecture search. Besides, a hybrid supervised contrast constraint seamlessly guides and integrates with searching process, facilitating a more adaptive and comprehensive search for optimal loss functions. We realize the state-of-the-art performance in comparison to various competitive schemes, yielding a 10.61% and 4.38% improvement in Visual Information Fidelity (VIF) for general and no-reference scenarios, respectively, while providing results with high contrast, rich details and colors.

摘要
多曝光图像融合（MEF）已成为现代图像捕捉技术的一个主要解决方案，以 Addressing the limitations of digital imaging in representing varied exposure levels. Despite its advancements, the field is still facing challenges, such as the reliance on manual designs for network structures and loss functions, and the constraints of using simulated reference images as ground truths. As a result, current methodologies often suffer from color distortions and exposure artifacts, further complicating the quest for authentic image representation.To address these challenges, this paper proposes a Hybrid-Supervised Dual-Search approach for MEF, called HSDS-MEF, which introduces a bi-level optimization search scheme for automatic design of both network structures and loss functions. Specifically, we leverage a unique dual research mechanism rooted in a novel weighted structure refinement architecture search. Moreover, a hybrid supervised contrast constraint seamlessly guides and integrates with the searching process, facilitating a more adaptive and comprehensive search for optimal loss functions.We demonstrate the state-of-the-art performance of HSDS-MEF compared to various competitive schemes, with a 10.61% and 4.38% improvement in Visual Information Fidelity (VIF) for general and no-reference scenarios, respectively. The results show high contrast, rich details, and vivid colors.

paper_url: http://arxiv.org/abs/2309.01111
repo_url: https://github.com/duyooho/arsdm
paper_authors: Yuhao Du, Yuncheng Jiang, Shuangyi Tan, Xusheng Wu, Qi Dou, Zhen Li, Guanbin Li, Xiang Wan
For: 协助临床诊断和治疗，提高残渣检测和识别精度。* Methods: 利用扩展的Diffusion模型，根据扩展的数据分布进行数据生成，并在训练过程中使用基于预训练的分割模型进行纠正。* Results: 对残渣检测和识别任务进行了广泛的实验，发现生成的数据可以显著提高基eline方法的性能。

Abstract
Colonoscopy analysis, particularly automatic polyp segmentation and detection, is essential for assisting clinical diagnosis and treatment. However, as medical image annotation is labour- and resource-intensive, the scarcity of annotated data limits the effectiveness and generalization of existing methods. Although recent research has focused on data generation and augmentation to address this issue, the quality of the generated data remains a challenge, which limits the contribution to the performance of subsequent tasks. Inspired by the superiority of diffusion models in fitting data distributions and generating high-quality data, in this paper, we propose an Adaptive Refinement Semantic Diffusion Model (ArSDM) to generate colonoscopy images that benefit the downstream tasks. Specifically, ArSDM utilizes the ground-truth segmentation mask as a prior condition during training and adjusts the diffusion loss for each input according to the polyp/background size ratio. Furthermore, ArSDM incorporates a pre-trained segmentation model to refine the training process by reducing the difference between the ground-truth mask and the prediction mask. Extensive experiments on segmentation and detection tasks demonstrate the generated data by ArSDM could significantly boost the performance of baseline methods.

摘要
colonoscopy分析，特别是自动复合体划分和检测，对诊断和治疗提供了重要支持。然而，医疗图像注释是劳动和资源浪费的，缺乏注释数据限制了现有方法的效iveness和泛化。尽管最近的研究将着眼于数据生成和扩展来解决这一问题，但生成的数据质量仍然是挑战，这限制了后续任务的贡献。 inspirited by diffuse models的优势在适应数据分布和生成高质量数据，在这篇论文中，我们提出了一种适应改进 semantic diffusion model（ArSDM），用于生成帮助下游任务的colonoscopy图像。具体来说，ArSDM在训练过程中使用了真实分 segmentation mask作为假设条件，并根据复合体/背景大小比进行了diffusion损失的调整。此外，ArSDM还 integrate了预训练分 segmentation模型，以减少真实分 mask和预测 mask之间的差异。对于分 segmentation和检测任务进行了广泛的实验， demonstrate that ArSDM生成的数据可以对基eline方法提供显著的提高。

AdvMono3D: Advanced Monocular 3D Object Detection with Depth-Aware Robust Adversarial Training

paper_url: http://arxiv.org/abs/2309.01106
repo_url: None
paper_authors: Xingyuan Li, Jinyuan Liu, Long Ma, Xin Fan, Risheng Liu
for: 增强monocular 3D对象检测模型的鲁棒性，以应对针对这些模型的攻击。
methods: 我们提出了一种深度意识 adversarial training 方法，包括设计了一种基于 IDP 的攻击，以及一种基于uncertainty的征还学习方法。
results: 我们在 KITTI 3D 数据集上进行了广泛的实验，发现 DART3D 在对抗攻击时比直接对抗训练（最流行的方法）提高了 $AP_{R40}$ 的车类分类表现，升准4.415%、4.112% 和 3.195%。

Abstract
Monocular 3D object detection plays a pivotal role in the field of autonomous driving and numerous deep learning-based methods have made significant breakthroughs in this area. Despite the advancements in detection accuracy and efficiency, these models tend to fail when faced with such attacks, rendering them ineffective. Therefore, bolstering the adversarial robustness of 3D detection models has become a crucial issue that demands immediate attention and innovative solutions. To mitigate this issue, we propose a depth-aware robust adversarial training method for monocular 3D object detection, dubbed DART3D. Specifically, we first design an adversarial attack that iteratively degrades the 2D and 3D perception capabilities of 3D object detection models(IDP), serves as the foundation for our subsequent defense mechanism. In response to this attack, we propose an uncertainty-based residual learning method for adversarial training. Our adversarial training approach capitalizes on the inherent uncertainty, enabling the model to significantly improve its robustness against adversarial attacks. We conducted extensive experiments on the KITTI 3D datasets, demonstrating that DART3D surpasses direct adversarial training (the most popular approach) under attacks in 3D object detection $AP_{R40}$ of car category for the Easy, Moderate, and Hard settings, with improvements of 4.415%, 4.112%, and 3.195%, respectively.

摘要
<>三元射顶3D物体探测在自动驾驶领域扮演重要角色，许多深度学习基础方法在这个领域中获得了重要突破。然而，这些模型对于这些攻击时往往会失败，导致它们无效。因此，增强3D物体探测模型的敌意耐袭性成为了一个紧要的问题，需要获得优先顾及和创新解决方案。为了解决这个问题，我们提出了一个深度感知敌意耐袭训练方法，名为DART3D。具体来说，我们首先设计了一个攻击，逐步对3D物体探测模型（IDP）进行损害， serve as the foundation for our subsequent defense mechanism。对于这个攻击，我们提出了一种不确定性基于的剩余学习方法 для adversarial training。我们的对抗训练方法利用模型的不确定性，使模型能够在攻击下提高其 robustness。我们对KITTI 3D数据集进行了广泛的实验，结果显示，DART3D 在3D物体探测 $AP_{R40}$ 的车辆类别下，在Easy、Moderate和Hard设定下，与直接对抗训练（最受欢迎的方法）相比，增加了4.415%, 4.112%, 3.195%。

Turn Fake into Real: Adversarial Head Turn Attacks Against Deepfake Detection

paper_url: http://arxiv.org/abs/2309.01104
repo_url: None
paper_authors: Weijie Wang, Zhengyu Zhao, Nicu Sebe, Bruno Lepri
for: 评估深伪检测器的可靠性，检测到深伪视频中的人脸变化。
methods: 基于单个假图像Synthesize face view的方法，实现3D对抗性评估。
results: 对多种检测器进行了广泛的实验， validate了攻击者可以通过AdvHeat在真实场景下取得高攻击成功率（96.8%），并且可以降低步骤数到50。

Abstract
Malicious use of deepfakes leads to serious public concerns and reduces people's trust in digital media. Although effective deepfake detectors have been proposed, they are substantially vulnerable to adversarial attacks. To evaluate the detector's robustness, recent studies have explored various attacks. However, all existing attacks are limited to 2D image perturbations, which are hard to translate into real-world facial changes. In this paper, we propose adversarial head turn (AdvHeat), the first attempt at 3D adversarial face views against deepfake detectors, based on face view synthesis from a single-view fake image. Extensive experiments validate the vulnerability of various detectors to AdvHeat in realistic, black-box scenarios. For example, AdvHeat based on a simple random search yields a high attack success rate of 96.8% with 360 searching steps. When additional query access is allowed, we can further reduce the step budget to 50. Additional analyses demonstrate that AdvHeat is better than conventional attacks on both the cross-detector transferability and robustness to defenses. The adversarial images generated by AdvHeat are also shown to have natural looks. Our code, including that for generating a multi-view dataset consisting of 360 synthetic views for each of 1000 IDs from FaceForensics++, is available at https://github.com/twowwj/AdvHeaT.

摘要
恶意使用深度模仿导致公众对数字媒体的信任减退。虽然有效的深度模仿检测器已经提出，但它们却容易受到反对攻击。为评估检测器的可靠性，latest studies have explored various attacks。然而，所有的攻击都是基于二维图像干扰，这些干扰难以在真实的人脸变化中翻译。在这篇论文中，我们提出了第一个基于三维面视 synthesis的深度模仿攻击方法——对深度模仿检测器的反抗头部攻击（AdvHeat）。我们进行了广泛的实验，证明了各种检测器对AdvHeat的攻击成功率在真实的黑盒enario中很高，例如，基于随机搜索的AdvHeat可以达到96.8%的攻击成功率，只需360步。当允许更多的查询访问时，我们可以进一步降低步数到50。additional analyses表明，AdvHeat比传统攻击更好地在跨检测器的转移性和防御机制上。生成的反对图像也被证明为具有自然的外观。我们的代码，包括生成360个视图的多视图数据集，可以在https://github.com/twowwj/AdvHeaT上下载。

Dual Adversarial Resilience for Collaborating Robust Underwater Image Enhancement and Perception

paper_url: http://arxiv.org/abs/2309.01102
repo_url: None
paper_authors: Zengxi Zhang, Zhiying Jiang, Zeru Shi, Jinyuan Liu, Risheng Liu
for: 提高水下图像的可见度和色彩稳定性，并且提高后续识别任务的精度。
methods: 提出了一种协同对抗鲁棒网络（CARNet），包括一个可逆网络、一种同步进行攻击训练和攻击检测、以及一个攻击模式识别器，以提高图像增强和识别任务的Robustness。
results: 实验结果表明，提出的方法可以输出高质量的增强图像，并且与前STATE-OF-THE-ART方法相比，其识别精度提高了6.71%。

Abstract
Due to the uneven scattering and absorption of different light wavelengths in aquatic environments, underwater images suffer from low visibility and clear color deviations. With the advancement of autonomous underwater vehicles, extensive research has been conducted on learning-based underwater enhancement algorithms. These works can generate visually pleasing enhanced images and mitigate the adverse effects of degraded images on subsequent perception tasks. However, learning-based methods are susceptible to the inherent fragility of adversarial attacks, causing significant disruption in results. In this work, we introduce a collaborative adversarial resilience network, dubbed CARNet, for underwater image enhancement and subsequent detection tasks. Concretely, we first introduce an invertible network with strong perturbation-perceptual abilities to isolate attacks from underwater images, preventing interference with image enhancement and perceptual tasks. Furthermore, we propose a synchronized attack training strategy with both visual-driven and perception-driven attacks enabling the network to discern and remove various types of attacks. Additionally, we incorporate an attack pattern discriminator to heighten the robustness of the network against different attacks. Extensive experiments demonstrate that the proposed method outputs visually appealing enhancement images and perform averagely 6.71% higher detection mAP than state-of-the-art methods.

摘要
In this work, we propose a collaborative adversarial resilience network (CARNet) for underwater image enhancement and detection tasks. The key idea is to use an invertible network with strong perturbation-perceptual abilities to isolate attacks from underwater images, preventing interference with image enhancement and perceptual tasks. Additionally, we propose a synchronized attack training strategy that incorporates both visual-driven and perception-driven attacks, allowing the network to distinguish and remove various types of attacks. To further enhance the robustness of the network, we also incorporate an attack pattern discriminator.Experimental results show that the proposed method outputs visually appealing enhancement images and achieves an average detection mAP of 6.71% higher than state-of-the-art methods.

Enhancing Infrared Small Target Detection Robustness with Bi-Level Adversarial Framework

paper_url: http://arxiv.org/abs/2309.01099
repo_url: None
paper_authors: Zhu Liu, Zihang Chen, Jinyuan Liu, Long Ma, Xin Fan, Risheng Liu
for: 提高小型红外目标检测 against 模糊和干扰背景的稳定性。
methods: 提出了一种 би低级对抗框架，包括learnable生成干扰并 Maximize losses as lower-level objective，以及提高检测器的Robustness promotion as upper-level objective。还提出了一种层次强化学习策略，以发现最有害的干扰并均衡性能和稳定性。
results: 在各种干扰下，提高了21.96% IOU，并在总benchmark上提高了4.97% IOU。I hope that helps! Let me know if you have any further questions.

Abstract
The detection of small infrared targets against blurred and cluttered backgrounds has remained an enduring challenge. In recent years, learning-based schemes have become the mainstream methodology to establish the mapping directly. However, these methods are susceptible to the inherent complexities of changing backgrounds and real-world disturbances, leading to unreliable and compromised target estimations. In this work, we propose a bi-level adversarial framework to promote the robustness of detection in the presence of distinct corruptions. We first propose a bi-level optimization formulation to introduce dynamic adversarial learning. Specifically, it is composited by the learnable generation of corruptions to maximize the losses as the lower-level objective and the robustness promotion of detectors as the upper-level one. We also provide a hierarchical reinforced learning strategy to discover the most detrimental corruptions and balance the performance between robustness and accuracy. To better disentangle the corruptions from salient features, we also propose a spatial-frequency interaction network for target detection. Extensive experiments demonstrate our scheme remarkably improves 21.96% IOU across a wide array of corruptions and notably promotes 4.97% IOU on the general benchmark. The source codes are available at https://github.com/LiuZhu-CV/BALISTD.

摘要
探测小型红外目标在杂乱背景下是一个长期不断挑战。在最近几年，学习基于的方法成为了主流方法来建立映射。然而，这些方法容易受到变化背景和真实世界干扰的影响，导致目标估计不可靠和妥协。在这种情况下，我们提出了一种bi-level对抗框架，以提高探测中的Robustness。我们首先提出了bi-level优化形式来引入动态对抗学习。具体来说，它由learnable生成损害来最大化损害作为下一级目标，并且通过提高检测器的Robustness来作为上一级目标。我们还提供了层次强化学习策略，以发现最有害的损害和平衡性能和准确性。为了更好地分离损害和突出特征，我们还提出了一种空间频率交互网络 для目标探测。广泛的实验表明，我们的方案可以remarkably提高21.96% IOU在各种损害下，并且明显提高4.97% IOU在总benchmark上。源代码可以在https://github.com/LiuZhu-CV/BALISTD中下载。

CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection

paper_url: http://arxiv.org/abs/2309.01093
repo_url: None
paper_authors: Jiajin Tang, Ge Zheng, Jingyi Yu, Sibei Yang
for: 本研究旨在检测图像中适合完成任务的对象实例，挑战在于对象类型过多，不能仅仅依据传统对象检测中的Category列表。
methods: 我们提出了基于基本可行性（Fundamental Affordances）而不是对象类型的方法，即从大量语言模型中提取可行知识，并使用多层链式思维（MLCoT）提取可行知识。
results: 我们的CoTDet方法在比较性评价中表现出色，与状态前方法相比，提高了15.6个盒子AP和14.8个面积AP，并能够生成对象检测的合理理由。

Abstract
Task driven object detection aims to detect object instances suitable for affording a task in an image. Its challenge lies in object categories available for the task being too diverse to be limited to a closed set of object vocabulary for traditional object detection. Simply mapping categories and visual features of common objects to the task cannot address the challenge. In this paper, we propose to explore fundamental affordances rather than object categories, i.e., common attributes that enable different objects to accomplish the same task. Moreover, we propose a novel multi-level chain-of-thought prompting (MLCoT) to extract the affordance knowledge from large language models, which contains multi-level reasoning steps from task to object examples to essential visual attributes with rationales. Furthermore, to fully exploit knowledge to benefit object recognition and localization, we propose a knowledge-conditional detection framework, namely CoTDet. It conditions the detector from the knowledge to generate object queries and regress boxes. Experimental results demonstrate that our CoTDet outperforms state-of-the-art methods consistently and significantly (+15.6 box AP and +14.8 mask AP) and can generate rationales for why objects are detected to afford the task.

摘要
In this paper, we propose exploring fundamental affordances rather than object categories. Affordances are common attributes that enable different objects to accomplish the same task. We also propose a novel multi-level chain-of-thought prompting (MLCoT) to extract affordance knowledge from large language models. This contains multiple levels of reasoning steps from the task to object examples to essential visual attributes with rationales.Furthermore, to fully utilize knowledge to benefit object recognition and localization, we propose a knowledge-conditional detection framework, named CoTDet. This framework conditions the detector based on the knowledge to generate object queries and regress boxes. Experimental results show that our CoTDet outperforms state-of-the-art methods by a consistent and significant margin (+15.6 box AP and +14.8 mask AP) and can generate rationales for why objects are detected to afford the task.

Face Clustering for Connection Discovery from Event Images

paper_url: http://arxiv.org/abs/2309.01092
repo_url: None
paper_authors: Ming Cheung
for: 这篇论文旨在提出一种基于事件图像的社交连接发现系统，以便无需在线社交图像上获取数据。
methods: 该论文提出了一种基于面 clustering的方法，通过分析事件图像中的人脸印象，找到社交连接。
results: 经过实验， authors 发现可以使用事件图像中的人脸印象来构建社交图，并且可以达到 80% 的 F1 分数。

Abstract
Social graphs are very useful for many applications, such as recommendations and community detections. However, they are only accessible to big social network operators due to both data availability and privacy concerns. Event images also capture the interactions among the participants, from which social connections can be discovered to form a social graph. Unlike online social graphs, social connections carried by event images can be extracted without user inputs, and hence many social graph-based applications become possible, even without access to online social graphs. This paper proposes a system to discover social connections from event images. By utilizing the social information from even images, such as co-occurrence, a face clustering method is proposed and implemented, and connections can be discovered without the identity of the event participants. By collecting over 40000 faces from over 3000 participants, it is shown that the faces can be well clustered with 80% in F1 score, and social graphs can be constructed. Utilizing offline event images may create a long-term impact on social network analytics.

摘要
社交图是非常有用于多种应用程序，如推荐和社群检测。然而，它们只能被大型社交网络运营商访问，因为数据可用性和隐私问题。事件图像也捕捉参与者之间的互动，从而可以构建社交图。与在线社交图不同，基于事件图像的社交连接可以无需用户输入抽象，因此许多基于社交图应用程序变得可能，甚至无需访问在线社交图。这篇论文提议一种基于事件图像的社交连接发现系统。通过使用事件图像中的社交信息，如共处，一种面 clustering 方法被提出并实现，并可以无需参与者身份信息发现社交连接。通过收集超过 40000 张面和超过 3000 名参与者，显示可以很好地将面 clustering 得到 80% 的 F1 分数，并构建社交图。使用离线事件图像可能会对社交网络分析产生长期影响。

MILA: Memory-Based Instance-Level Adaptation for Cross-Domain Object Detection

paper_url: http://arxiv.org/abs/2309.01086
repo_url: None
paper_authors: Onkar Krishna, Hiroki Ohashi, Saptarshi Sinha
for: 这种论文是为了解决跨Domain对象检测中的困难问题，特别是对于不同领域之间的对应关系的建立。
methods: 这种方法使用了对抗学习来对像级和实例级的特征进行对齐。具体来说，它使用了一个内存模块来存储所有标注源实例的卷积特征，并使用一个简单 yet effective的内存检索模块来为目标实例检索最相似的源实例。
results: 这种方法在不同领域之间的对应关系建立方面具有显著的优势，比如在不同领域的对象检测任务中，它的性能都高于非内存基于的方法。

Abstract
Cross-domain object detection is challenging, and it involves aligning labeled source and unlabeled target domains. Previous approaches have used adversarial training to align features at both image-level and instance-level. At the instance level, finding a suitable source sample that aligns with a target sample is crucial. A source sample is considered suitable if it differs from the target sample only in domain, without differences in unimportant characteristics such as orientation and color, which can hinder the model's focus on aligning the domain difference. However, existing instance-level feature alignment methods struggle to find suitable source instances because their search scope is limited to mini-batches. Mini-batches are often so small in size that they do not always contain suitable source instances. The insufficient diversity of mini-batches becomes problematic particularly when the target instances have high intra-class variance. To address this issue, we propose a memory-based instance-level domain adaptation framework. Our method aligns a target instance with the most similar source instance of the same category retrieved from a memory storage. Specifically, we introduce a memory module that dynamically stores the pooled features of all labeled source instances, categorized by their labels. Additionally, we introduce a simple yet effective memory retrieval module that retrieves a set of matching memory slots for target instances. Our experiments on various domain shift scenarios demonstrate that our approach outperforms existing non-memory-based methods significantly.

摘要
域外对象检测是一项挑战性任务，需要对来源和目标域进行对齐。先前的方法通过对抗学习来实现对像水平和实例水平的对齐。在实例水平上，找到一个适合的来源实例是关键。一个来源实例被视为适合的，只要它与目标实例在域之间差异，而不是在不重要的特征如旋转和颜色上差异，这些特征可能会使模型忽略对域差异的对齐。然而，现有的实例级别的特征对齐方法很难找到适合的来源实例，因为它们的搜索范围仅限于 мини-批。 мини-批通常很小，因此不一定包含适合的来源实例。这种缺乏多样性的问题特别是在目标实例具有高内类变异时变得更加突出。为解决这个问题，我们提出了一种带有内存的实例级别域适应框架。我们的方法将目标实例与同类标签的最相似来源实例进行对齐，而不是通过搜索 mini-批中的来源实例。具体来说，我们引入了一个内存模块，该模块在 Label 分类下将所有标注源实例的归一化特征存储在内存中。此外，我们还引入了一个简单 yet effective的内存检索模块，该模块可以将目标实例与内存中的匹配记录进行对比。我们的实验表明，我们的方法在不同的域转移enario中与非带内存的方法相比显著性能更高。

Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through Image-IDS Aligning

paper_url: http://arxiv.org/abs/2309.01083
repo_url: None
paper_authors: Haiyang Yu, Xiaocong Wang, Bin Li, Xiangyang Xue
for: 提高中文文本识别精度和普适性
methods: 使用人工智能模型进行中文字符预训练，并将学习的表示映射到文本识别模型中进行优化
results: 在中文字符识别和文本识别任务中表现出色，超过了之前的方法在大多数场景下Translation:
for: 提高中文文本识别精度和普适性
methods: 使用人工智能模型进行中文字符预训练，并将学习的表示映射到文本识别模型中进行优化
results: 在中文字符识别和文本识别任务中表现出色，超过了之前的方法在大多数场景下

Abstract
Scene text recognition has been studied for decades due to its broad applications. However, despite Chinese characters possessing different characteristics from Latin characters, such as complex inner structures and large categories, few methods have been proposed for Chinese Text Recognition (CTR). Particularly, the characteristic of large categories poses challenges in dealing with zero-shot and few-shot Chinese characters. In this paper, inspired by the way humans recognize Chinese texts, we propose a two-stage framework for CTR. Firstly, we pre-train a CLIP-like model through aligning printed character images and Ideographic Description Sequences (IDS). This pre-training stage simulates humans recognizing Chinese characters and obtains the canonical representation of each character. Subsequently, the learned representations are employed to supervise the CTR model, such that traditional single-character recognition can be improved to text-line recognition through image-IDS matching. To evaluate the effectiveness of the proposed method, we conduct extensive experiments on both Chinese character recognition (CCR) and CTR. The experimental results demonstrate that the proposed method performs best in CCR and outperforms previous methods in most scenarios of the CTR benchmark. It is worth noting that the proposed method can recognize zero-shot Chinese characters in text images without fine-tuning, whereas previous methods require fine-tuning when new classes appear. The code is available at https://github.com/FudanVI/FudanOCR/tree/main/image-ids-CTR.

摘要
scene文本识别已经受到了多年的研究，因为它有广泛的应用场景。然而，尽管中文字体具有不同的特点，如复杂的内部结构和大量的类别，但只有少数方法被提出来用于中文文本识别（CTR）。特别是，大类划分带来了对零Instance和少Instance中文字体的挑战。在这篇论文中，我们提出了一个基于人类识别中文文本的两stage框架。首先，我们通过对印刷字体图像和意图描述序列（IDS）进行对齐，预训练一个类似于CLIP的模型。这个预训练阶段模拟了人类识别中文字体，并从图像和IDS获得了每个字体的征识性表示。然后，学习的表示被用来监督CTR模型，以改进传统的单个字体识别，并将其扩展到文本图像与IDS匹配。为评估提案的效果，我们进行了广泛的实验，包括中文字体识别（CCR）和CTR。实验结果显示，提案的方法在CCR中表现最佳，并在大多数CTR标准各种场景中超过了先前的方法。值得注意的是，提案的方法可以在文本图像中识别零Instance中文字体，而不需要调整。相比之下，先前的方法在新类型出现时需要调整。代码可以在https://github.com/FudanVI/FudanOCR/tree/main/image-ids-CTR中找到。

Orientation-Independent Chinese Text Recognition in Scene Images

paper_url: http://arxiv.org/abs/2309.01081
repo_url: None
paper_authors: Haiyang Yu, Xiaocong Wang, Bin Li, Xiangyang Xue
for: 本文是为了提高自然场景中的中文文本识别精度而提出的。
methods: 本文使用了一种新的Character Image Reconstruction Network（CIRN）来提取文本图像中的 orientation-independent 视觉特征，以便在自然场景中Robustly 识别水平和垂直文本。
results: 实验结果表明，在一个场景集上，提取Content和orientation信息的方法可以提高文本识别性能，而且在特制的Vertical Chinese Text Recognition（VCTR）集上，该方法可以提高45.63%。

Abstract
Scene text recognition (STR) has attracted much attention due to its broad applications. The previous works pay more attention to dealing with the recognition of Latin text images with complex backgrounds by introducing language models or other auxiliary networks. Different from Latin texts, many vertical Chinese texts exist in natural scenes, which brings difficulties to current state-of-the-art STR methods. In this paper, we take the first attempt to extract orientation-independent visual features by disentangling content and orientation information of text images, thus recognizing both horizontal and vertical texts robustly in natural scenes. Specifically, we introduce a Character Image Reconstruction Network (CIRN) to recover corresponding printed character images with disentangled content and orientation information. We conduct experiments on a scene dataset for benchmarking Chinese text recognition, and the results demonstrate that the proposed method can indeed improve performance through disentangling content and orientation information. To further validate the effectiveness of our method, we additionally collect a Vertical Chinese Text Recognition (VCTR) dataset. The experimental results show that the proposed method achieves 45.63% improvement on VCTR when introducing CIRN to the baseline model.

摘要
Scene文本识别（STR）在广泛应用领域中受到了广泛关注。先前的研究更多地关注了处理复杂背景的拉丁文本图像的识别，通过语言模型或其他辅助网络。与拉丁文本不同，自然场景中存在许多垂直的中文文本，这带来了当前状态的STR方法的困难。在这篇论文中，我们首次尝试提取不受方向影响的视觉特征，通过分离内容和方向信息来识别自然场景中的垂直和水平文本。我们引入了Character Image Reconstruction Network（CIRN）来重建相应的打印字符图像，并提取了内容和方向信息的分离。我们对一个场景集进行了测试，并得到了提高性的结果。为了进一步验证我们的方法的有效性，我们还收集了一个垂直中文文本识别（VCTR）集。实验结果表明，当我们将CIRN添加到基eline模型时，提高了45.63%的性能。

Robust Adversarial Defense by Tensor Factorization

paper_url: http://arxiv.org/abs/2309.01077
repo_url: None
paper_authors: Manish Bhattarai, Mehmet Cagri Kaymak, Ryan Barron, Ben Nebgen, Kim Rasmussen, Boian Alexandrov
for: 防御机器学习模型受到敌意攻击的难题，这种攻击可能会导致模型的性能下降或甚至失败。
methods: 我们的方法利用输入数据的矩阵化和神经网络参数的矩阵化，并将其组合成一种强大的防御策略。
results: 我们的方法能够保持高度的鲁棒性，即使面临最强大的自动攻击也能够维持Robust性。对比已有的防御策略，我们的结果不仅与之匹配，而且还超过了它们。这个研究证明了将矩阵化和低级别分解结合使用的可能性。

Abstract
As machine learning techniques become increasingly prevalent in data analysis, the threat of adversarial attacks has surged, necessitating robust defense mechanisms. Among these defenses, methods exploiting low-rank approximations for input data preprocessing and neural network (NN) parameter factorization have shown potential. Our work advances this field further by integrating the tensorization of input data with low-rank decomposition and tensorization of NN parameters to enhance adversarial defense. The proposed approach demonstrates significant defense capabilities, maintaining robust accuracy even when subjected to the strongest known auto-attacks. Evaluations against leading-edge robust performance benchmarks reveal that our results not only hold their ground against the best defensive methods available but also exceed all current defense strategies that rely on tensor factorizations. This study underscores the potential of integrating tensorization and low-rank decomposition as a robust defense against adversarial attacks in machine learning.

摘要
随着机器学习技术在数据分析中的普及，对抗攻击的威胁也在不断增加，需要开发有力的防御机制。在这些防御机制中，利用输入数据的低级别拟合和神经网络（NN）参数的因式分解方法已经显示出了潜在的可能性。我们的工作将这些方法进一步推广，通过将输入数据的维度化和NN参数的维度化结合使用，以增强对抗攻击的防御能力。我们的提案方法在面对最强大的自动攻击时仍能保持坚固的准确率，并在与现有的防御策略相比表现出色。这一研究表明，将维度化和低级别拟合结合使用可以成为机器学习中对抗攻击的有力防御策略。

Muti-Stage Hierarchical Food Classification

paper_url: http://arxiv.org/abs/2309.01075
repo_url: None
paper_authors: Xinyue Pan, Jiangpeng He, Fengqing Zhu
for: 本研究旨在提高食品图像分类的精度，以便从捕捉的食品图像中提取营养成分信息。
methods: 我们提出了一种多stage层次结构的方法，通过Iteratively clustering和合并食品项目 during the training process，使深度模型能够提取图像特征，这些特征在不同的标签之间具有很好的拟合度。
results: 我们在VFN-nutrient数据集上进行了测试，并获得了与现有工作相比的出色的结果，包括食品类别和食品项目分类。

Abstract
Food image classification serves as a fundamental and critical step in image-based dietary assessment, facilitating nutrient intake analysis from captured food images. However, existing works in food classification predominantly focuses on predicting 'food types', which do not contain direct nutritional composition information. This limitation arises from the inherent discrepancies in nutrition databases, which are tasked with associating each 'food item' with its respective information. Therefore, in this work we aim to classify food items to align with nutrition database. To this end, we first introduce VFN-nutrient dataset by annotating each food image in VFN with a food item that includes nutritional composition information. Such annotation of food items, being more discriminative than food types, creates a hierarchical structure within the dataset. However, since the food item annotations are solely based on nutritional composition information, they do not always show visual relations with each other, which poses significant challenges when applying deep learning-based techniques for classification. To address this issue, we then propose a multi-stage hierarchical framework for food item classification by iteratively clustering and merging food items during the training process, which allows the deep model to extract image features that are discriminative across labels. Our method is evaluated on VFN-nutrient dataset and achieve promising results compared with existing work in terms of both food type and food item classification.

摘要
We introduce the VFN-nutrient dataset, annotating each food image in VFN with a food item including nutritional composition information. This hierarchical structure allows for more discriminative annotation of food items. However, the nutritional composition information does not always correspond to visual relations, posing challenges for deep learning-based classification.To address this, we propose a multi-stage hierarchical framework for food item classification. We iteratively cluster and merge food items during training, allowing the deep model to extract image features that are discriminative across labels. Our method is evaluated on the VFN-nutrient dataset and achieves promising results compared to existing works in terms of both food type and food item classification.

Spatial and Visual Perspective-Taking via View Rotation and Relation Reasoning for Embodied Reference Understanding

paper_url: http://arxiv.org/abs/2309.01073
repo_url: https://github.com/ChengShiest/REP-ERU
paper_authors: Cheng Shi, Sibei Yang
for: 本研究的目的是研究语言和姿势的参照理解，即接受者需要根据发送者的语言和姿势在共享物理环境中找到target对象。
methods: 本研究提出了一种基于自己视角的参照理解方法，称为REasoning from your Perspective（REP），该方法通过建立发送者和接受者之间的关系和对象与发送者之间的关系来解决主要挑战。
results: 实验结果表明，REP方法在YouRefIt上的精度达到+5.22%，与所有现有的状态 искусственный智能算法相比，占据了大幅度的优势。

Abstract
Embodied Reference Understanding studies the reference understanding in an embodied fashion, where a receiver is required to locate a target object referred to by both language and gesture of the sender in a shared physical environment. Its main challenge lies in how to make the receiver with the egocentric view access spatial and visual information relative to the sender to judge how objects are oriented around and seen from the sender, i.e., spatial and visual perspective-taking. In this paper, we propose a REasoning from your Perspective (REP) method to tackle the challenge by modeling relations between the receiver and the sender and the sender and the objects via the proposed novel view rotation and relation reasoning. Specifically, view rotation first rotates the receiver to the position of the sender by constructing an embodied 3D coordinate system with the position of the sender as the origin. Then, it changes the orientation of the receiver to the orientation of the sender by encoding the body orientation and gesture of the sender. Relation reasoning models the nonverbal and verbal relations between the sender and the objects by multi-modal cooperative reasoning in gesture, language, visual content, and spatial position. Experiment results demonstrate the effectiveness of REP, which consistently surpasses all existing state-of-the-art algorithms by a large margin, i.e., +5.22% absolute accuracy in terms of Prec0.5 on YouRefIt.

摘要
“人体参照理解”研究者强调在与另一个人共享的物理环境中，语言和手势都指向某个目标物件，并且需要接受者对 sender 的 egocentric 视角进行诠释。这个挑战在于如何让接受者获取 sender 的位置和方向信息，以便对于 sender 所看到的物品进行诠释。在这篇论文中，我们提出了一种基于你的视角（REP）方法，以解决这个挑战。REP 方法包括两个主要步骤：一、使用视角转换来让接受者视角与 sender 的视角进行对接，并且将接受者的视角转换为 sender 的视角。二、使用多modal 协同理解来模型非语言和语言之间的关系，以及物品和接受者之间的关系。实验结果显示，REP 方法能够优于所有现有的州际算法，具体而言，在 YouRefIt 上的 Prec0.5 上提高了 +5.22% 的绝对精度。

Channel Attention Separable Convolution Network for Skin Lesion Segmentation

paper_url: http://arxiv.org/abs/2309.01072
repo_url: None
paper_authors: Changlu Guo, Jiangyan Dai, Marton Szemenyei, Yugen Yi
for: 静观皮肤恶性肿瘤早期诊断，提高检测精度和效率。
methods: 基于U-Net、DenseNet、分割核、通道注意力和离散尺度 pyramid pooling（ASPP）的新网络：通道注意力分割卷积网络（CASCN）。
results: 在PH2数据集上进行了评估，没有过多预/后处理图像，CASCN实现了PH2数据集的最佳性能，Dice相似度0.9461，准确率0.9645。

Abstract
Skin cancer is a frequently occurring cancer in the human population, and it is very important to be able to diagnose malignant tumors in the body early. Lesion segmentation is crucial for monitoring the morphological changes of skin lesions, extracting features to localize and identify diseases to assist doctors in early diagnosis. Manual de-segmentation of dermoscopic images is error-prone and time-consuming, thus there is a pressing demand for precise and automated segmentation algorithms. Inspired by advanced mechanisms such as U-Net, DenseNet, Separable Convolution, Channel Attention, and Atrous Spatial Pyramid Pooling (ASPP), we propose a novel network called Channel Attention Separable Convolution Network (CASCN) for skin lesions segmentation. The proposed CASCN is evaluated on the PH2 dataset with limited images. Without excessive pre-/post-processing of images, CASCN achieves state-of-the-art performance on the PH2 dataset with Dice similarity coefficient of 0.9461 and accuracy of 0.9645.

摘要
皮肤癌是人类常见的癌症，早期诊断非常重要。 lesion segmentation 是监测皮肤损害的重要步骤，提取特征以地址和诊断疾病，帮助医生早期诊断。手动减少 dermoscopic 图像的步骤是时间consuming 和 error-prone，因此需要精准和自动化的分 segmentation 算法。 Drawing inspiration from advanced mechanisms such as U-Net, DenseNet, Separable Convolution, Channel Attention, and Atrous Spatial Pyramid Pooling (ASPP), we propose a novel network called Channel Attention Separable Convolution Network (CASCN) for skin lesions segmentation. The proposed CASCN is evaluated on the PH2 dataset with limited images. Without excessive pre-/post-processing of images, CASCN achieves state-of-the-art performance on the PH2 dataset with Dice similarity coefficient of 0.9461 and accuracy of 0.9645.Here's the breakdown of the translation:* "皮肤癌" (pí shèi gān) - skin cancer* "常见" (cháng jiàn) - frequently occurring* "早期诊断" (zhāo qiér xiǎng dài) - early diagnosis* "lesion segmentation" (lé shion segmenation) - segmentation of lesions* "重要步骤" (zhòng yào bù shè) - crucial step* "提取特征" (tixiāo tè xíng) - extract features* "地址和诊断" (dì yì yè shì) - localize and identify diseases* "帮助医生" (bāng zhù yī shēng) - assist doctors* "早期诊断" (zhāo qiér xiǎng dài) - early diagnosis* "精准和自动化" (jīn chūn yǔ zì zhì) - precise and automated* "分 segmentation" (fēn biao) - segmentation* "算法" (suān fǎ) - algorithm* "Channel Attention Separable Convolution Network" (CHannel Attention Separable Convolution Network) - proposed network* "PH2 dataset" (PH2 dataset) - dataset used for evaluation* "limited images" (liù yǐng) - limited number of images* "without excessive pre-/post-processing" (yīn wèi zhèng yǐn yǐn) - without extensive pre-/post-processing* "achieves state-of-the-art performance" (dào zhèng yì yì) - achieves state-of-the-art performance* "Dice similarity coefficient" (Dice similarity coefficient) - evaluation metric* "accuracy" ( accuracy) - evaluation metric

Semi-supervised 3D Video Information Retrieval with Deep Neural Network and Bi-directional Dynamic-time Warping Algorithm

paper_url: http://arxiv.org/abs/2309.01063
repo_url: None
paper_authors: Yintai Ma, Diego Klabjan
for: 本研究提出了一种基于视觉内容的半监督深度学习算法，用于检索相似的2D和3D视频。
methods: 该方法结合深度卷积神经网络和循环神经网络，并使用动态时间戳对相似度进行评估。
results: 该方法在多个公共数据集上进行测试，并与状态元深度学习模型进行比较，结果显示该方法在视频检索任务中具有良好的性能。

Abstract
This paper presents a novel semi-supervised deep learning algorithm for retrieving similar 2D and 3D videos based on visual content. The proposed approach combines the power of deep convolutional and recurrent neural networks with dynamic time warping as a similarity measure. The proposed algorithm is designed to handle large video datasets and retrieve the most related videos to a given inquiry video clip based on its graphical frames and contents. We split both the candidate and the inquiry videos into a sequence of clips and convert each clip to a representation vector using an autoencoder-backed deep neural network. We then calculate a similarity measure between the sequences of embedding vectors using a bi-directional dynamic time-warping method. This approach is tested on multiple public datasets, including CC\_WEB\_VIDEO, Youtube-8m, S3DIS, and Synthia, and showed good results compared to state-of-the-art. The algorithm effectively solves video retrieval tasks and outperforms the benchmarked state-of-the-art deep learning model.

摘要
To implement the algorithm, we split both the candidate and the inquiry videos into a sequence of clips and convert each clip to a representation vector using an autoencoder-backed deep neural network. We then calculate a similarity measure between the sequences of embedding vectors using a bi-directional dynamic time-warping method.We test the algorithm on multiple public datasets, including CC\_WEB\_VIDEO, Youtube-8m, S3DIS, and Synthia, and show that it outperforms state-of-the-art deep learning models. The algorithm effectively solves video retrieval tasks and demonstrates good performance.

Efficient Curriculum based Continual Learning with Informative Subset Selection for Remote Sensing Scene Classification

paper_url: http://arxiv.org/abs/2309.01050
repo_url: None
paper_authors: S Divakar Bhat, Biplab Banerjee, Subhasis Chaudhuri, Avik Bhattacharya
for: 这个论文探讨了从光学远程测量图像中的地面分类问题，并提出了一个基于现有数据的类别增cremental learning（CIL）框架。
methods: 本文提出了一个独特的CIL方法，据建立了一个专门的curriculum来学习新的类别，并采用了一种对于旧的流程进行选择的sample选择策略来减少错误的影响。
results: 实验结果显示，提出的方法可以提高CIL性能，并且比过去的方法更好地调节稳定性和柔软性的贡献。

Abstract
We tackle the problem of class incremental learning (CIL) in the realm of landcover classification from optical remote sensing (RS) images in this paper. The paradigm of CIL has recently gained much prominence given the fact that data are generally obtained in a sequential manner for real-world phenomenon. However, CIL has not been extensively considered yet in the domain of RS irrespective of the fact that the satellites tend to discover new classes at different geographical locations temporally. With this motivation, we propose a novel CIL framework inspired by the recent success of replay-memory based approaches and tackling two of their shortcomings. In order to reduce the effect of catastrophic forgetting of the old classes when a new stream arrives, we learn a curriculum of the new classes based on their similarity with the old classes. This is found to limit the degree of forgetting substantially. Next while constructing the replay memory, instead of randomly selecting samples from the old streams, we propose a sample selection strategy which ensures the selection of highly confident samples so as to reduce the effects of noise. We observe a sharp improvement in the CIL performance with the proposed components. Experimental results on the benchmark NWPU-RESISC45, PatternNet, and EuroSAT datasets confirm that our method offers improved stability-plasticity trade-off than the literature.

摘要
我们在这篇论文中处理了Remote Sensing（RS）图像中的类增长学习（CIL）问题。CIL的概念在真实世界中的数据采集中变得越来越重要，因为数据通常是在序列化的方式获取的。然而，CIL在RS领域还没有得到广泛的考虑，即使卫星在不同的地理位置上发现新的类型。为了解决这个问题，我们提出了一种基于最近的replay-memory基本方法的新CIL框架，并解决了两个缺点。首先，我们学习新类的curriculum，基于其与旧类的相似性，以限制淘汰旧类的影响。我们发现这可以减少淘汰的影响。然后，在构建replay内存时，而不是随机选择旧流中的样本，我们提议一种样本选择策略，以确保选择高度确定的样本，以降低噪声的影响。我们发现这些组件对CIL性能带来了明显的改善。实验结果表明，我们的方法在NWPU-RESISC45、PatternNet和EuroSAT数据集上提供了与文献中的稳定- пластично性质量Trade-off更好的性能。

2023-09-03

cs.AI

cs.AI - 2023-09-03

paper_url: http://arxiv.org/abs/2309.01291
repo_url: https://github.com/babatundeibukun/simple-social-learning-environment
paper_authors: Sara Fish, Paul Gölz, David C. Parkes, Ariel D. Procaccia, Gili Rusak, Itai Shapira, Manuel Wüthrich
for: 这篇论文是为了探讨人工智能在民主过程中的应用，具体来说是如何使用自然语言处理技术来实现民主选举。
methods: 这篇论文使用了社会选择理论的数学严谨性和大自然语言模型的文本生成能力，提出了一个生成社会选择框架，可以帮助解决复杂的民主选举问题。
results: 通过应用这个框架，可以生成一个代表民意的评论文本，例如在在线审议过程中。

Abstract
Traditionally, social choice theory has only been applicable to choices among a few predetermined alternatives but not to more complex decisions such as collectively selecting a textual statement. We introduce generative social choice, a framework that combines the mathematical rigor of social choice theory with large language models' capability to generate text and extrapolate preferences. This framework divides the design of AI-augmented democratic processes into two components: first, proving that the process satisfies rigorous representation guarantees when given access to oracle queries; second, empirically validating that these queries can be approximately implemented using a large language model. We illustrate this framework by applying it to the problem of generating a slate of statements that is representative of opinions expressed as free-form text, for instance in an online deliberative process.

摘要
（以下是简化中文版本）传统上，社会选择理论只适用于一些已经预先确定的选项之间的选择，而不适用于更复杂的决策，如通过人工智能支持的民主过程中的 коллектив选择文本声明。我们介绍了生成社会选择框架，这个框架将社会选择理论的数学严谨性与大自然语言模型的文本生成能力相结合，以便更好地满足民主过程中的多样化需求。这个框架将民主过程的设计分为两个组成部分：首先，证明过程满足严谨的表达保证，当给定询问 oracle 时；其次，通过实际验证来证明这些询问可以使用大自然语言模型来近似实现。我们通过应用这个框架来解决在 он线协商过程中收集和代表表达出的意见的问题，例如生成一份代表多种意见的文本声明。

Traveling Waves Encode the Recent Past and Enhance Sequence Learning

paper_url: http://arxiv.org/abs/2309.08045
repo_url: https://github.com/anon-neurips-2023/wave-rnn
paper_authors: T. Anderson Keller, Lyle Muller, Terrence Sejnowski, Max Welling
for: 这个论文的目的是解释 cortical sheet 中的 neural activity 是如何实现短期记忆的。
methods: 这个论文使用了一种简单的 recurrent neural network 模型，称为 Wave-RNN (wRNN)，来实现wave-like dynamics。
results: 研究发现，使用 wRNN 模型可以快速地学习并表现出优于不含wave的模型，并且在更复杂的序列模型任务中也表现出类似的性能。

Abstract
Traveling waves of neural activity have been observed throughout the brain at a diversity of regions and scales; however, their precise computational role is still debated. One physically grounded hypothesis suggests that the cortical sheet may act like a wave-field capable of storing a short-term memory of sequential stimuli through induced waves traveling across the cortical surface. To date, however, the computational implications of this idea have remained hypothetical due to the lack of a simple recurrent neural network architecture capable of exhibiting such waves. In this work, we introduce a model to fill this gap, which we denote the Wave-RNN (wRNN), and demonstrate how both connectivity constraints and initialization play a crucial role in the emergence of wave-like dynamics. We then empirically show how such an architecture indeed efficiently encodes the recent past through a suite of synthetic memory tasks where wRNNs learn faster and perform significantly better than wave-free counterparts. Finally, we explore the implications of this memory storage system on more complex sequence modeling tasks such as sequential image classification and find that wave-based models not only again outperform comparable wave-free RNNs while using significantly fewer parameters, but additionally perform comparably to more complex gated architectures such as LSTMs and GRUs. We conclude with a discussion of the implications of these results for both neuroscience and machine learning.

摘要
旅行波的神经活动已在脑部多个区域和尺度上观察到;然而，它们的具体计算作用仍在讨论中。一种物理上基础的假设是，质神经层可能 acts like a wave-field，可以在启发了扩散的 cortical surface 上存储短期内存。然而，这个想法的计算影响仍然是假设，因为没有一个简单的循环神经网络架构可以实现这种波动。在这种工作中，我们提出了一种模型，我们称之为 wave-RNN（wRNN），并证明了连接约束和初始化对波动的出现具有关键作用。然后，我们employmultiple synthetic memory tasks to demonstrate that wRNNs learn faster and perform significantly better than wave-free counterparts。最后，我们探讨了这种内存存储系统在更复杂的序列模型任务中的表现，并发现波动基本模型不仅在相对较少的参数下比wave-free RNNs快速学习，而且与更复杂的闭合架构，如LSTMs和GRUs，相当。我们 conclude with a discussion of the implications of these results for both neuroscience and machine learning.

Bayesian inference of composition-dependent phase diagrams

paper_url: http://arxiv.org/abs/2309.01271
repo_url: None
paper_authors: Timofei Miryashkin, Olga Klimanova, Vladimir Ladygin, Alexander Shapeev
for: This paper was written to develop a method for constructing temperature-concentration phase diagrams for materials using Bayesian inference and molecular dynamics simulations.
methods: The paper uses Bayesian inference to combine thermodynamic data from molecular dynamics simulations, melting point simulations, and phonon calculations, and to extrapolate the results to the infinite-atom limit.
results: The paper reports the development of an algorithm that can be used to construct temperature-concentration phase diagrams for materials with a high degree of accuracy and precision, and demonstrates the effectiveness of the algorithm on two binary systems, Ge-Si and K-Na.

Abstract
Phase diagrams serve as a highly informative tool for materials design, encapsulating information about the phases that a material can manifest under specific conditions. In this work, we develop a method in which Bayesian inference is employed to combine thermodynamic data from molecular dynamics (MD), melting point simulations, and phonon calculations, process these data, and yield a temperature-concentration phase diagram. The employed Bayesian framework yields us not only the free energies of different phases as functions of temperature and concentration but also the uncertainties of these free energies originating from statistical errors inherent to finite-length MD trajectories. Furthermore, it extrapolates the results of the finite-atom calculations to the infinite-atom limit and facilitates the choice of temperature, chemical potentials, and the number of atoms conducting the next simulation with which will be the most efficient in reducing the uncertainty of the phase diagram. The developed algorithm was successfully tested on two binary systems, Ge-Si and K-Na, in the full range of concentrations and temperatures.

摘要
（以下是简化中文版）phas diagrams serve as a highly informative tool for materials design, encapsulating information about the phases that a material can manifest under specific conditions. In this work, we develop a method in which Bayesian inference is employed to combine thermodynamic data from molecular dynamics (MD), melting point simulations, and phonon calculations, process these data, and yield a temperature-concentration phase diagram. The employed Bayesian framework yields us not only the free energies of different phases as functions of temperature and concentration but also the uncertainties of these free energies originating from statistical errors inherent to finite-length MD trajectories. Furthermore, it extrapolates the results of the finite-atom calculations to the infinite-atom limit and facilitates the choice of temperature, chemical potentials, and the number of atoms conducting the next simulation with which will be the most efficient in reducing the uncertainty of the phase diagram. The developed algorithm was successfully tested on two binary systems, Ge-Si and K-Na, in the full range of concentrations and temperatures.

COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers

paper_url: http://arxiv.org/abs/2309.01270
repo_url: https://github.com/juliendenize/eztorch
paper_authors: Julien Denize, Mykola Liashuha, Jaonary Rabarisoa, Astrid Orcesi, Romain Hérault
for: 这 paper 是为了提出一种用于动作检测的 Initialization 管道，即 COMEDIAN，该管道包括自动学习和知识储存两个 initialization 阶段。
methods: 该 paper 使用了两个 initialization 阶段，首先是使用短视频作为输入进行自动学习初始化 spatial transformer，然后是通过知识储存来增强 spatial transformer 的输出，并在最后一步进行 fine-tuning。
results: 实验结果表明，COMEDIAN 的预训练方法可以在 SoccerNet-v2 数据集上达到状态作卷积的性能，并且比非预训练模型更快地 converges。这些结果表明 COMEDIAN 的预训练管道的有效性。

Abstract
We present COMEDIAN, a novel pipeline to initialize spatio-temporal transformers for action spotting, which involves self-supervised learning and knowledge distillation. Action spotting is a timestamp-level temporal action detection task. Our pipeline consists of three steps, with two initialization stages. First, we perform self-supervised initialization of a spatial transformer using short videos as input. Additionally, we initialize a temporal transformer that enhances the spatial transformer's outputs with global context through knowledge distillation from a pre-computed feature bank aligned with each short video segment. In the final step, we fine-tune the transformers to the action spotting task. The experiments, conducted on the SoccerNet-v2 dataset, demonstrate state-of-the-art performance and validate the effectiveness of COMEDIAN's pretraining paradigm. Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.

摘要
我们提出了COMEDIAN，一个新的 Initialize Pipeline，用于时间action spotting任务的spatio-temporal transformer的初始化。action spotting是一个时间戳级的动作检测任务。我们的管道包括三个步骤，其中有两个初始化阶段。首先，我们使用短视频作为输入进行自我超vised学习初始化一个空间变换器。其次，我们使用知识填充学习增强空间变换器的输出，通过对每个短视频分段预计算的特征库进行知识填充。最后，我们精度调整transformer到动作检测任务。我们在SoccerNet-v2数据集上进行了实验，并证明了COMEDIAN的预训练方案的效iveness。我们的结果显示了预训练模型的性能提高和更快的收敛速度。

Learning-Aware Safety for Interactive Autonomy

paper_url: http://arxiv.org/abs/2309.01267
repo_url: None
paper_authors: Haimin Hu, Zixu Zhang, Kensuke Nakamura, Andrea Bajcsy, Jaime F. Fisac
for: 本研究旨在提供一种新的关闭Loop方法，以确保机器人系统在实时学习和适应的情况下保持安全交互。
methods: 该方法使用反抗搅ء深度学习来规避未来可能的enario，并同时考虑机器人学习算法的内部信念的变化。
results: 研究人员使用这种方法可以 tractable safety analysis，并且可以处理高维度的情况。此外，他们还能够证明这种方法可以与 bayesian belief propagation和大型预训练神经轨迹预测器结合使用。

Abstract
One of the outstanding challenges for the widespread deployment of robotic systems like autonomous vehicles is ensuring safe interaction with humans without sacrificing efficiency. Existing safety analysis methods often neglect the robot's ability to learn and adapt at runtime, leading to overly conservative behavior. This paper proposes a new closed-loop paradigm for synthesizing safe control policies that explicitly account for the system's evolving uncertainty under possible future scenarios. The formulation reasons jointly about the physical dynamics and the robot's learning algorithm, which updates its internal belief over time. We leverage adversarial deep reinforcement learning (RL) for scaling to high dimensions, enabling tractable safety analysis even for implicit learning dynamics induced by state-of-the-art prediction models. We demonstrate our framework's ability to work with both Bayesian belief propagation and the implicit learning induced by a large pre-trained neural trajectory predictor.

摘要
一个现代化的挑战是在广泛部署自动化系统时确保安全地与人类交互，不会牺牲效率。现有的安全分析方法经常忽略机器人的学习和运行时 adaptability，导致行为过于保守。这篇论文提出了一种新的封闭循环方案，用于生成安全控制策略，并且考虑了系统的演变不确定性。我们利用对抗式深度学习来扩展到高维度，使得安全分析可以承受大数据量，并且可以 tractable 地分析隐式学习动力，即使使用现代预测模型。我们示例中使用了 bayesian belief propagation 和大型预训练神经网络轨迹预测器来演示我们的框架的可行性。

Large AI Model Empowered Multimodal Semantic Communications

paper_url: http://arxiv.org/abs/2309.01249
repo_url: None
paper_authors: Feibo Jiang, Yubo Peng, Li Dong, Kezhi Wang, Kun Yang, Cunhua Pan, Xiaohu You
for: 提供一个具有低延迟和高质量的Semantic Communication（SC）体验，使用多Modal Signal（文本、音频、图像和视频）。
methods: 利用大AI模型，具体是Multimodal Language Model（MLM）和Large Language Model（LLM）来解决数据不一致性、semantic ambiguity和信号抖动等问题。
results: 提出一个基于大AI模型的多Modal SC（LAM-MSC）框架，包括MLM-based Multimodal Alignment（MMA）、个性化的LKB和Conditional Generative Adversarial Networks-based Channel Estimation（CGE）等技术，可以有效地提高SC的性能。

Abstract
Multimodal signals, including text, audio, image and video, can be integrated into Semantic Communication (SC) for providing an immersive experience with low latency and high quality at the semantic level. However, the multimodal SC has several challenges, including data heterogeneity, semantic ambiguity, and signal fading. Recent advancements in large AI models, particularly in Multimodal Language Model (MLM) and Large Language Model (LLM), offer potential solutions for these issues. To this end, we propose a Large AI Model-based Multimodal SC (LAM-MSC) framework, in which we first present the MLM-based Multimodal Alignment (MMA) that utilizes the MLM to enable the transformation between multimodal and unimodal data while preserving semantic consistency. Then, a personalized LLM-based Knowledge Base (LKB) is proposed, which allows users to perform personalized semantic extraction or recovery through the LLM. This effectively addresses the semantic ambiguity. Finally, we apply the Conditional Generative adversarial networks-based channel Estimation (CGE) to obtain Channel State Information (CSI). This approach effectively mitigates the impact of fading channels in SC. Finally, we conduct simulations that demonstrate the superior performance of the LAM-MSC framework.

摘要
多模式信号（包括文本、音频、图像和视频）可以在semantic Communication（SC）中集成，以提供具有低延迟和高质量的 immerse 体验。然而，多模式SC 存在许多挑战，包括数据不一致、semantic 抽象和信号衰减。最近的大AI模型，特别是多模式语言模型（MLM）和大语言模型（LLM），提供了解决这些问题的可能性。为此，我们提出了基于大AI模型的多模式SC 框架（LAM-MSC），其中我们首先提出了基于 MLM 的多模式对应（MMA），使得在多模式和单模式数据之间进行转换，保持 semantic 一致性。然后，我们提出了基于 LLM 的个性化知识库（LKB），允许用户进行个性化semantic 提取或恢复，从而有效解决semantic 抽象问题。最后，我们应用Conditional Generative Adversarial Networks（CGE）来获取通道状态信息（CSI），这种方法有效地减轻了混叠通道的影响。我们进行了实验，并证明了 LAM-MSC 框架的超越性。

Representations Matter: Embedding Modes of Large Language Models using Dynamic Mode Decomposition

paper_url: http://arxiv.org/abs/2309.01245
repo_url: None
paper_authors: Mohamed Akrout
for: 本研究旨在检测大型自然语言模型（LLM）生成的“妄想”内容，即不基础的文本内容。
methods: 本研究使用动态模式分解（DMD）工具分析生成文本的词嵌入空间的模式演化。
results: 研究发现，生成文本的词嵌入spectrum随着句子的排序逐渐降低，与真实文本的词嵌入spectrum不同。此外，评估 случа件中存在LLM妄想时，真实文本的词嵌入模式具有更多的模式被LLM嵌入模式不好地拟合。这表明，妄想结果归因于生成技术和基础表示。

Abstract
Existing large language models (LLMs) are known for generating "hallucinated" content, namely a fabricated text of plausibly looking, yet unfounded, facts. To identify when these hallucination scenarios occur, we examine the properties of the generated text in the embedding space. Specifically, we draw inspiration from the dynamic mode decomposition (DMD) tool in analyzing the pattern evolution of text embeddings across sentences. We empirically demonstrate how the spectrum of sentence embeddings over paragraphs is constantly low-rank for the generated text, unlike that of the ground-truth text. Importantly, we find that evaluation cases having LLM hallucinations correspond to ground-truth embedding patterns with a higher number of modes being poorly approximated by the few modes associated with LLM embedding patterns. In analogy to near-field electromagnetic evanescent waves, the embedding DMD eigenmodes of the generated text with hallucinations vanishes quickly across sentences as opposed to those of the ground-truth text. This suggests that the hallucinations result from both the generation techniques and the underlying representation.

摘要
现有大型语言模型（LLM）已知能生成“幻想”内容，即fabricated文本中的虚假信息。为了识别这些幻想场景，我们研究生成文本在嵌入空间的属性。 Specifically, we draw inspiration from动态模式分解（DMD）工具来分析文本嵌入的模式进化。我们实际示例中，生成文本的句子嵌入spectrum across paragraphs是常 Low-rank的，与真实文本的嵌入spectrum不同。进一步，我们发现评测 случа件具有LLM幻想的情况与真实文本的嵌入模式数量更高，但这些模式与LLM嵌入模式之间的相似性较低。在近场电磁波的类比中，生成文本幻想的嵌入DMD eigenmodes在句子之间变得越来越小，与真实文本的嵌入DMD eigenmodes不同。这表明幻想的结果来自生成技术和下面的表示。

Saturn: An Optimized Data System for Large Model Deep Learning Workloads

paper_url: http://arxiv.org/abs/2309.01226
repo_url: https://github.com/knagrecha/saturn
paper_authors: Kabir Nagrecha, Arun Kumar
for: 本研究旨在帮助深度学习（DL）用户更好地选择并运行大型模型，解决DL用户面临的三大负担：并行选择、资源分配和任务调度。
methods: 本研究提出了一种新的信息系统架构，用于解决DL用户面临的三大负担。该架构包括一种可编程的实验室 Profiler，一种MILP（分配优化问题）模型，以及一种自适应调度策略。
results: 实验结果表明，使用MILP解决器可以significantly reduces model selection runtimes（39-49%），而且可以进一步 optimize system runtime through introspective scheduling approach。这些技术在一个新的数据系统中被实现，称为Saturn。

Abstract
Large language models such as GPT-3 & ChatGPT have transformed deep learning (DL), powering applications that have captured the public's imagination. These models are rapidly being adopted across domains for analytics on various modalities, often by finetuning pre-trained base models. Such models need multiple GPUs due to both their size and computational load, driving the development of a bevy of "model parallelism" techniques & tools. Navigating such parallelism choices, however, is a new burden for end users of DL such as data scientists, domain scientists, etc. who may lack the necessary systems knowhow. The need for model selection, which leads to many models to train due to hyper-parameter tuning or layer-wise finetuning, compounds the situation with two more burdens: resource apportioning and scheduling. In this work, we tackle these three burdens for DL users in a unified manner by formalizing them as a joint problem that we call SPASE: Select a Parallelism, Allocate resources, and SchedulE. We propose a new information system architecture to tackle the SPASE problem holistically, representing a key step toward enabling wider adoption of large DL models. We devise an extensible template for existing parallelism schemes and combine it with an automated empirical profiler for runtime estimation. We then formulate SPASE as an MILP. We find that direct use of an MILP-solver is significantly more effective than several baseline heuristics. We optimize the system runtime further with an introspective scheduling approach. We implement all these techniques into a new data system we call Saturn. Experiments with benchmark DL workloads show that Saturn achieves 39-49% lower model selection runtimes than typical current DL practice.

摘要
大型语言模型如GPT-3和ChatGPT已经改变深度学习（DL），推动了许多应用程序，吸引了大众的注意。这些模型在不同领域中被快速适用，通常是通过调整预训模型来进行调整。这些模型需要多个GPU，因为它们的大小和计算负载，这驱使了模型平行化技术和工具的发展。但是，为DL使用者如数据科学家和领域科学家等选择和管理这些平行化方案，则增加了新的负担。因为模型选择和层级调整导致了多个模型需要训练，这个问题更加复杂。在这个研究中，我们将这三个负担视为一个共同问题，我们统称为SPASE：选择平行、分配资源和安排。我们提出了一个新的资讯系统架构，来解决SPASE问题。我们创建了一个可扩展的平行方案模板，并与一个自动化的实验性质估计器结合。我们将SPASE视为一个MILP（内置搜索）。我们发现，直接使用MILP解决方案比基eline变数估计法更有效。我们进一步优化系统执行时间使用一种自我反思的安排方法。我们实现了这些技术在我们的新数据系统Saturn上。实验结果显示，Saturn在常用DL工作负载上降低了39-49%的模型选择执行时间。

Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models

paper_url: http://arxiv.org/abs/2309.01219
repo_url: https://github.com/hillzhang1999/llm-hallucination-survey
paper_authors: Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, Shuming Shi
for: 本研究旨在探讨大语言模型（LLM）在实际应用中的可靠性问题，即 LLM occasional hallucination 问题。
methods: 本研究对现有的检测、解释和缓解 LLM hallucination 方法进行了检视和分析，并讨论了未来研究的可能方向。
results: 研究发现了 LLM hallucination 现象的多种类型和评价指标，分析了现有的缓解方法的效果，并提出了未来研究的潜在方向。

Abstract
While large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge. This phenomenon poses a substantial challenge to the reliability of LLMs in real-world scenarios. In this paper, we survey recent efforts on the detection, explanation, and mitigation of hallucination, with an emphasis on the unique challenges posed by LLMs. We present taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyze existing approaches aiming at mitigating LLM hallucination, and discuss potential directions for future research.

摘要
大型语言模型（LLM）在多种下游任务中表现出色，但也存在一定的问题：LLM occasional generation content diverges from user input, contradicts previously generated context, or misaligns with established world knowledge。这种现象对 LLM 在实际应用场景中的可靠性带来了极大的挑战。在这篇论文中，我们对 LLM 幻觉现象进行了检测、解释和避免的各种尝试，并分析了现有的避免方法，以及未来研究的可能性。Here's the text with some notes on the translation:* "大型语言模型" (LLM) is translated as "大型语言模型" (also known as "large language models" or "LLMs").* "幻觉" (hallucination) is translated as "幻觉" (also known as "hallucination" or "LLM hallucination").* " contradicts" is translated as " contradicts" (同义词).* "misaligns" is translated as "misaligns" (同义词).* "established world knowledge" is translated as "已知世界知识" (also known as "common knowledge" or "established knowledge").Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

Physics-inspired Neural Networks for Parameter Learning of Adaptive Cruise Control Systems

paper_url: http://arxiv.org/abs/2309.01211
repo_url: None
paper_authors: Theocharis Apostolakis, Konstantinos Ampountolas
for: 本研究提出了一种基于物理学的神经网络（PiNN），用于学习商业实施的自适应速度控制系统（ACC）的参数。
methods: 本研究使用了多层人工神经网络作为通用函数approximator，并采用了常数时间头额策略（CTHP）来模拟ACC系统的长向 Dynamics。
results: 研究结果表明，提出的PiNN可以高效地学习未知ACC系统的参数，并且对于不同的汽车制造商的ACC系统进行了严格的评估。结果还表明，ACC系统的设计参数不是$L_2$也不是$L_\infty$的 string stable。

Abstract
This paper proposes and develops a physics-inspired neural network (PiNN) for learning the parameters of commercially implemented adaptive cruise control (ACC) systems in automotive industry. To emulate the core functionality of stock ACC systems, which have proprietary control logic and undisclosed parameters, the constant time-headway policy (CTHP) is adopted. Leveraging the multi-layer artificial neural networks as universal approximators, the developed PiNN serves as a surrogate model for the longitudinal dynamics of ACC-engaged vehicles, efficiently learning the unknown parameters of the CTHP. The ability of the PiNN to infer the unknown ACC parameters is meticulous evaluated using both synthetic and high-fidelity empirical data of space-gap and relative velocity involving ACC-engaged vehicles in platoon formation. The results have demonstrated the superior predictive ability of the proposed PiNN in learning the unknown design parameters of stock ACC systems from different car manufacturers. The set of ACC model parameters obtained from the PiNN revealed that the stock ACC systems of the considered vehicles in three experimental campaigns are neither $L_2$ nor $L_\infty$ string stable.

摘要
Translation notes:* "stock ACC systems" is translated as "商业实现的适应速度控制系统" (shāngchǎng jítuō de suīyìng yìjīng zhìxīng zhì)* "proprietary control logic" is translated as "专有控制逻辑" (zhuān yǒu kòng zhì lógí)* "undisclosed parameters" is translated as "未公开的参数" (wèi gōngkāi de ciàngxiàng)* "constant time-headway policy" is translated as "常数时间间隔策略" (chángshuō shíjiān jiāngrá zhùlü)* "multi-layer artificial neural networks" is translated as "多层人工神经网络" (duōcéng rénshēng jīngxīn wǎngwǎng)* "surrogate model" is translated as "代理模型" (dài lǐ móde)* "longitudinal dynamics" is translated as "长度动力学" (chángduō dònglì xué)* "ACC-engaged vehicles" is translated as "适应速度控制车辆" (suīyìng yìjīng chēliàng)* "platoon formation" is translated as "队列形式" (duì liè xíngshì)* "high-fidelity empirical data" is translated as "高准确的实验数据" (gāo zhèngqì de shíyàn shùdà)* "string stability" is translated as "串稳定" (chuī jìdìng)

A Visual Interpretation-Based Self-Improved Classification System Using Virtual Adversarial Training

paper_url: http://arxiv.org/abs/2309.01196
repo_url: None
paper_authors: Shuai Jiang, Sayaka Kamei, Chen Li, Shengzhe Hou, Yasuhiko Morimoto
for: 这篇论文旨在提出一个可以实现Visual Interpretation-based Self-Improving Classification的模型，以解决BERT模型在自然语言处理中的黑盒问题。methods: 本文提出的方法包括：使用精进BERT模型作为文本类别器，然后使用这些预测的类别标签来在另一个BERT模型中进行类别训练，同时使用VAT技术进行自适应训练。results: 实验结果显示，提出的模型在Twitter的短讯数据集上实现了高效的类别性能。此外，ablation study结果显示不同模型的Component对于类别结果的影响。

Abstract
The successful application of large pre-trained models such as BERT in natural language processing has attracted more attention from researchers. Since the BERT typically acts as an end-to-end black box, classification systems based on it usually have difficulty in interpretation and low robustness. This paper proposes a visual interpretation-based self-improving classification model with a combination of virtual adversarial training (VAT) and BERT models to address the above problems. Specifically, a fine-tuned BERT model is used as a classifier to classify the sentiment of the text. Then, the predicted sentiment classification labels are used as part of the input of another BERT for spam classification via a semi-supervised training manner using VAT. Additionally, visualization techniques, including visualizing the importance of words and normalizing the attention head matrix, are employed to analyze the relevance of each component to classification accuracy. Moreover, brand-new features will be found in the visual analysis, and classification performance will be improved. Experimental results on Twitter's tweet dataset demonstrate the effectiveness of the proposed model on the classification task. Furthermore, the ablation study results illustrate the effect of different components of the proposed model on the classification results.

摘要
成功应用大型预训模型，如BERT，在自然语言处理中吸引了更多研究者的注意。由于BERT通常作为终端黑盒模型，因此基于它的分类系统通常具有低可解释性和低Robustness。本文提出了基于可见解释的自我改进分类模型，通过结合虚拟对抗训练（VAT）和BERT模型来解决上述问题。具体来说，一个精度调整后的BERT模型被用作文本情感分类器。然后，预测的情感分类标签被用作另一个BERT模型的敏感训练数据，通过semi-supervised的方式使用VAT进行训练。此外，使用视觉化技术，包括Word的重要性可见化和注意头矩阵的 норmalizaition，以分析每个组件对分类精度的 relevance。此外，Visual分析还可以找到新的特征。实验结果表明，提议的模型在Twitter tweet数据集上对分类任务具有效果。此外，ablation study结果表明不同组件对分类结果的影响。

A Survey on Service Route and Time Prediction in Instant Delivery: Taxonomy, Progress, and Prospects

paper_url: http://arxiv.org/abs/2309.01194
repo_url: None
paper_authors: Haomin Wen, Youfang Lin, Lixia Wu, Xiaowei Mao, Tianyue Cai, Yunfeng Hou, Shengnan Guo, Yuxuan Liang, Guangyin Jin, Yiji Zhao, Roger Zimmermann, Jieping Ye, Huaiyu Wan
for: 这篇论文的目的是为服务平台的路由和时间预测（RTP）提供一个系统性的概述，帮助研究人员更好地了解这个领域。
methods: 这篇论文使用了一种新的分类方法，将RTP方法分为三类：任务类型、模型架构和学习态度。这些方法包括单路预测、单时预测和共同路径时间预测等。
results: 这篇论文提供了一个全面的概述，把现有的RTP方法分类和总结，并指出了当前研究的局限性和未来可能的发展方向。

Abstract
Instant delivery services, such as food delivery and package delivery, have achieved explosive growth in recent years by providing customers with daily-life convenience. An emerging research area within these services is service Route\&Time Prediction (RTP), which aims to estimate the future service route as well as the arrival time of a given worker. As one of the most crucial tasks in those service platforms, RTP stands central to enhancing user satisfaction and trimming operational expenditures on these platforms. Despite a plethora of algorithms developed to date, there is no systematic, comprehensive survey to guide researchers in this domain. To fill this gap, our work presents the first comprehensive survey that methodically categorizes recent advances in service route and time prediction. We start by defining the RTP challenge and then delve into the metrics that are often employed. Following that, we scrutinize the existing RTP methodologies, presenting a novel taxonomy of them. We categorize these methods based on three criteria: (i) type of task, subdivided into only-route prediction, only-time prediction, and joint route\&time prediction; (ii) model architecture, which encompasses sequence-based and graph-based models; and (iii) learning paradigm, including Supervised Learning (SL) and Deep Reinforcement Learning (DRL). Conclusively, we highlight the limitations of current research and suggest prospective avenues. We believe that the taxonomy, progress, and prospects introduced in this paper can significantly promote the development of this field.

摘要
快速配送服务，如食物配送和快递服务，在最近几年内取得了极大的增长，提供了日常生活的便利。一个快速发展的研究领域是服务路径预测（RTP），旨在预测未来服务路径以及工作者的到达时间。作为服务平台中最重要的任务之一，RTP对于提高用户满意度和降低运营成本具有重要意义。然而，迄今为止，没有一份系统性、全面的评论指导研究人员在这个领域。为了填补这一空白，我们的工作提供了首次的全面评论，系统地分类了最新的服务路径预测方法。我们首先定义RTP挑战，然后详细介绍使用的指标。接着，我们仔细检查现有的RTP方法，并对其进行新的分类。我们根据三个 критери予分类这些方法：（一）任务类型，分为单独的路径预测、时间预测和路径\&时间预测；（二）模型结构，包括序列基的和图基的模型；（三）学习思想，包括超级学习（SL）和深度优化学习（DRL）。最后，我们强调现有研究的局限性，并建议未来的方向。我们认为这种分类、进步和前瞻在这篇论文中具有重要的促进作用，可以推动这个领域的发展。

LogGPT: Exploring ChatGPT for Log-Based Anomaly Detection

paper_url: http://arxiv.org/abs/2309.01189
repo_url: None
paper_authors: Jiaxing Qi, Shaohan Huang, Zhongzhi Luan, Carol Fung, Hailong Yang, Depei Qian
for: 这个研究旨在提出一个基于 ChatGPT 的传统系统logs 异常检测方法，以解决高维度和噪音 logs 资料的分析问题。
methods: 本研究使用了 ChatGPT 的语言解释能力，实现了将大规模文本库中的知识转移到 logs 异常检测中。
results: 我们的实验结果显示，LogGPT 可以获得良好的效果，并且具有良好的解释性。这个研究提供了对传统系统logs 异常检测任务中 prompt-based 模型的初步探索。

Abstract
The increasing volume of log data produced by software-intensive systems makes it impractical to analyze them manually. Many deep learning-based methods have been proposed for log-based anomaly detection. These methods face several challenges such as high-dimensional and noisy log data, class imbalance, generalization, and model interpretability. Recently, ChatGPT has shown promising results in various domains. However, there is still a lack of study on the application of ChatGPT for log-based anomaly detection. In this work, we proposed LogGPT, a log-based anomaly detection framework based on ChatGPT. By leveraging the ChatGPT's language interpretation capabilities, LogGPT aims to explore the transferability of knowledge from large-scale corpora to log-based anomaly detection. We conduct experiments to evaluate the performance of LogGPT and compare it with three deep learning-based methods on BGL and Spirit datasets. LogGPT shows promising results and has good interpretability. This study provides preliminary insights into prompt-based models, such as ChatGPT, for the log-based anomaly detection task.

摘要
随着软件敏感系统中的日志数据量的增加，手动分析变得不切实际。许多深度学习基本方法已经为日志异常检测提出了多种方案。这些方法面临着高维度和噪声的日志数据、类别不均衡、泛化和模型解释性等挑战。最近，ChatGPT已经在不同领域展示了有前途的成绩。然而，针对日志异常检测的ChatGPT的应用研究仍然缺乏。本文提出了基于ChatGPT的日志异常检测框架——LogGPT。通过利用ChatGPT的语言解释能力，LogGPT希望能够利用大规模文献中的知识传递到日志异常检测中。我们对LogGPT进行了实验，并与三种深度学习基本方法进行比较。LogGPT显示了良好的性能和解释性。这项研究提供了推特模型（如ChatGPT）在日志异常检测任务中的初步洞察。

Pre-trained Neural Recommenders: A Transferable Zero-Shot Framework for Recommendation Systems

paper_url: http://arxiv.org/abs/2309.01188
repo_url: None
paper_authors: Junting Wang, Adit Krishnan, Hari Sundaram, Yunzhe Li
for: 该研究旨在开发一种基于现代神经网络的协同推荐技术，以满足电商、社交媒体和内容分享平台的成功。
methods: 该研究使用了预训练的视觉和语言模型，并explores the possibility of pre-trained recommender models that支持在新领域建立推荐系统，无需重新训练或使用auxiliary user或item信息。
results: 研究表明，通过利用用户-项目交互矩阵的统计特征，可以学习不需要用户或项目 auxillary信息的零式推荐模型，并且这些模型可以在不同领域和数据集上进行适应。

Abstract
Modern neural collaborative filtering techniques are critical to the success of e-commerce, social media, and content-sharing platforms. However, despite technical advances -- for every new application domain, we need to train an NCF model from scratch. In contrast, pre-trained vision and language models are routinely applied to diverse applications directly (zero-shot) or with limited fine-tuning. Inspired by the impact of pre-trained models, we explore the possibility of pre-trained recommender models that support building recommender systems in new domains, with minimal or no retraining, without the use of any auxiliary user or item information. Zero-shot recommendation without auxiliary information is challenging because we cannot form associations between users and items across datasets when there are no overlapping users or items. Our fundamental insight is that the statistical characteristics of the user-item interaction matrix are universally available across different domains and datasets. Thus, we use the statistical characteristics of the user-item interaction matrix to identify dataset-independent representations for users and items. We show how to learn universal (i.e., supporting zero-shot adaptation without user or item auxiliary information) representations for nodes and edges from the bipartite user-item interaction graph. We learn representations by exploiting the statistical properties of the interaction data, including user and item marginals, and the size and density distributions of their clusters.

摘要
现代神经网络合作推荐技术对电商、社交媒体和内容分享平台的成功起到了关键作用。然而，尽管技术上有所进步，但为每个新应用领域，我们仍需要从零开始训练NCF模型。相比之下，预训练视觉和语言模型可以直接应用于多个应用领域，或者只需要限定的微调。受预训练模型的影响启发了我们，我们试图开发预训练推荐模型，以支持在新领域建立推荐系统，无需重新训练，无需使用任何辅助用户或者物品信息。零shot推荐 без辅助信息是一项挑战，因为在不同的用户和物品之间没有共同的用户或者物品。我们的基本想法是，用户-物品交互矩阵的统计特征是透传的，可以在不同的领域和数据集之间形成共同的表征。因此，我们使用用户-物品交互矩阵的统计特征来定义数据集独立的用户和物品表征。我们展示了如何从二分图中学习universal（即无需用户或物品辅助信息进行适应）的表征。我们利用交互数据的统计特征，包括用户和物品的独立分布、以及用户和物品的尺寸和密度分布，来学习表征。

Cognition-Mode Aware Variational Representation Learning Framework for Knowledge Tracing

paper_url: http://arxiv.org/abs/2309.01179
repo_url: https://github.com/zmy-9/CMVF
paper_authors: Moyu Zhang, Xinning Zhu, Chunhong Zhang, Feng Pan, Wenchen Qian, Hui Zhao
for: 这篇论文的目的是帮助学生Personalized Learning中的知识追踪（KT）任务，解决该任务的资料罕见问题，以及将学生的实际状态转换为更加Robust的表现。
methods: 这篇论文提出了一个Cognition-Mode Aware Variational Representation Learning Framework（CMVF），可以直接应用于现有的KT方法。CMVF使用一个几率模型来生成每个学生的分布，考虑到有限实践记录的不确定性，并使用variational inference（VI）估计学生的分布。此外，我们还引入了一个认知模式意识的多元分布作为专家知识，以避免学生对于有限实践记录的过度个性化。
results: 实验结果显示，CMVF可以有效地帮助现有的KT方法学习更加Robust的学生表现。

Abstract
The Knowledge Tracing (KT) task plays a crucial role in personalized learning, and its purpose is to predict student responses based on their historical practice behavior sequence. However, the KT task suffers from data sparsity, which makes it challenging to learn robust representations for students with few practice records and increases the risk of model overfitting. Therefore, in this paper, we propose a Cognition-Mode Aware Variational Representation Learning Framework (CMVF) that can be directly applied to existing KT methods. Our framework uses a probabilistic model to generate a distribution for each student, accounting for uncertainty in those with limited practice records, and estimate the student's distribution via variational inference (VI). In addition, we also introduce a cognition-mode aware multinomial distribution as prior knowledge that constrains the posterior student distributions learning, so as to ensure that students with similar cognition modes have similar distributions, avoiding overwhelming personalization for students with few practice records. At last, extensive experimental results confirm that CMVF can effectively aid existing KT methods in learning more robust student representations. Our code is available at https://github.com/zmy-9/CMVF.

摘要
知识跟踪（KT）任务在个性化学习中扮演着关键角色，其目的是预测学生的回答基于他们历史实践行为序列。然而，KT任务受到数据稀缺的影响，这使得学习 robust 的学生表示变得更加挑战，同时增加了模型适应过拟合的风险。因此，在这篇论文中，我们提出了一种基于变量学习框架（CMVF），可以直接应用于现有的 KT 方法。我们的框架使用一种 probabilistic 模型来生成每个学生的分布，考虑到有限实践记录下的不确定性，并通过变量推理（VI）来估计学生的分布。此外，我们还引入了认知模式意识的多omial分布作为先验知识，以避免学生具有少量实践记录的情况下过度个性化。最后，我们进行了广泛的实验研究，证明CMVF可以有效地帮助现有的 KT 方法学习更加 robust 的学生表示。我们的代码可以在 https://github.com/zmy-9/CMVF 上获取。

Logic of subjective probability

paper_url: http://arxiv.org/abs/2309.01173
repo_url: None
paper_authors: Vladimir Vovk
for: 本文研究对主观概率的 sintax和 semantics。
methods: 本文使用多种方法来测试概率陈述，包括间subjective概率和不人性概率。
results: 本文 argue that已经被测试过的不人性概率具有对象概率的特征，并采用 Jeffreys’s law来支持这一想法。

Abstract
In this paper I discuss both syntax and semantics of subjective probability. The semantics determines ways of testing probability statements. Among important varieties of subjective probabilities are intersubjective probabilities and impersonal probabilities, and I will argue that well-tested impersonal probabilities acquire features of objective probabilities. Jeffreys's law, my next topic, states that two successful probability forecasters must issue forecasts that are close to each other, thus supporting the idea of objective probabilities. Finally, I will discuss connections between subjective and frequentist probability.

摘要
在这篇论文中，我讨论了主观概率的语法和 semantics。 semantics 确定了概率声明的测试方法。重要的主观概率包括 между人共识概率和无人共识概率，我会 argue 这些经过测试的无人共识概率具有目的对象概率的特征。 Jeffreys's law 是我下一个话题，它表明两个成功的概率预测人必须发布的预测结果几乎相同，从而支持目的对象概率的想法。最后，我会讨论主观概率和频率主义概率之间的关系。

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

paper_url: http://arxiv.org/abs/2309.01172
repo_url: None
paper_authors: Zhenheng Tang, Yuxin Wang, Xin He, Longteng Zhang, Xinglin Pan, Qiang Wang, Rongfei Zeng, Kaiyong Zhao, Shaohuai Shi, Bingsheng He, Xiaowen Chu
for: 这篇论文旨在解决大型自然语言模型（LLM）的内存和计算需求快速增长，对于没有大规模高端GPU的人员而言，训练或部署LLM受到阻碍。但是，consumer-level GPU却通常被LLM忽略，因为它们的计算能力较弱，储存容量较小，并且通信带宽较低。此外，用户可能有隐私问题在与远端LLM进行互动。
methods: 这篇论文提出了一个分散式系统，以解开consumer-level GPU的潜力在预训、推导和精度调整 LLM 中。但是，这个系统面临了重要挑战，包括CPU和GPU内存有限，低网络带宽，节点和设备多样性。
results: 我们的系统设计包括：1）一个中继处理器，以实现动态加入和退出计算提供者；2）任务排程，以提高系统效率；3）将机器学习过程抽象为指向无顺序图（DAG），以实现模型和任务通用性。我们的性能分析显示，50个RTX 3080 GPU可以 дости持比4个H100 GPU，它们都是许多更昂贵的。

Abstract
The rapid growth of memory and computation requirements of large language models (LLMs) has outpaced the development of hardware, hindering people who lack large-scale high-end GPUs from training or deploying LLMs. However, consumer-level GPUs, which constitute a larger market share, are typically overlooked in LLM due to their weaker computing performance, smaller storage capacity, and lower communication bandwidth. Additionally, users may have privacy concerns when interacting with remote LLMs. In this paper, we envision a decentralized system unlocking the potential vast untapped consumer-level GPUs in pre-training, inference and fine-tuning of LLMs with privacy protection. However, this system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity. To address these challenges, our system design incorporates: 1) a broker with backup pool to implement dynamic join and quit of computing providers; 2) task scheduling with hardware performance to improve system efficiency; 3) abstracting ML procedures into directed acyclic graphs (DAGs) to achieve model and task universality; 4) abstracting intermediate represention and execution planes to ensure compatibility of various devices and deep learning (DL) frameworks. Our performance analysis demonstrates that 50 RTX 3080 GPUs can achieve throughputs comparable to those of 4 H100 GPUs, which are significantly more expensive.

摘要
大型语言模型（LLM）的快速增长对于硬件的开发落后，使得没有大规模高端GPU的人员无法训练或部署LLM。然而，consumer-level GPU占了更大的市场份额，但它们通常在LLM中被忽略因为它们的计算性能较弱、存储容量较小、通信带宽较低。此外，用户可能有隐私问题在与远程LLM互动。本文描述了一个分布式系统，让consumer-level GPU在预训、处理和精确化LLM中发挥潜力，并提供隐私保护。然而，这个系统面临着重要的挑战，包括CPU和GPU内存有限、低网络带宽、执行环境和设备多样性。为解决这些挑战，我们的系统设计包括：1. 中介人员伙伴库，以进行动态加入和退出计算提供者的实现。2. 根据硬件性能进行任务调度，以提高系统效率。3. 将机器学习过程抽象为指向的无向 graphs（DAGs），以实现模型和任务通用性。4. 将中间表示和执行计划抽象为保证不同设备和深度学习（DL）框架的相容性。我们的性能分析显示，50个RTX 3080 GPU可以 achievable throughputs comparable to those of 4个H100 GPU，这些 GPU 的成本很高。

End-to-End Learning on Multimodal Knowledge Graphs

paper_url: http://arxiv.org/abs/2309.01169
repo_url: https://gitlab.com/wxwilcke/mrgcn
paper_authors: W. X. Wilcke, P. Bloem, V. de Boer, R. H. van t Veer
for: This paper aims to enable data scientists to learn end-to-end on heterogeneous knowledge by proposing a multimodal message passing network that can learn from the structure of graphs and multimodal node features.
methods: The proposed model uses dedicated neural encoders to learn embeddings for node features belonging to five different types of modalities, which are then projected into a joint representation space together with their relational information.
results: The authors implement and demonstrate their model on node classification and link prediction for artificial and real-world datasets, and conduct an inverse ablation study to evaluate the effect that each modality has on the overall performance. The results show that end-to-end multimodal learning from any arbitrary knowledge graph is possible, and that including multimodal information can significantly affect performance, but much depends on the characteristics of the data.Here’s the Chinese translation of the three points:
for: 这篇论文目的是帮助数据科学家通过多Modal的消息传递网络来学习综合知识。
methods: 提议的模型使用专门的神经网络编码器来学习节点特征所属不同类型的多Modalities，然后将其投影到共同表示空间中。
results: 作者实现并在人工和实际世界数据集上进行节点分类和链接预测，并进行反归消除研究来评估每种Modalities对总性能的影响。结果表明，任意知识图Multimodal端到端学习是可能的，并且包含多Modal信息可以对性能产生显著影响，但是这取决于数据的特点。

Abstract
Knowledge graphs enable data scientists to learn end-to-end on heterogeneous knowledge. However, most end-to-end models solely learn from the relational information encoded in graphs' structure: raw values, encoded as literal nodes, are either omitted completely or treated as regular nodes without consideration for their values. In either case we lose potentially relevant information which could have otherwise been exploited by our learning methods. We propose a multimodal message passing network which not only learns end-to-end from the structure of graphs, but also from their possibly divers set of multimodal node features. Our model uses dedicated (neural) encoders to naturally learn embeddings for node features belonging to five different types of modalities, including numbers, texts, dates, images and geometries, which are projected into a joint representation space together with their relational information. We implement and demonstrate our model on node classification and link prediction for artificial and real-worlds datasets, and evaluate the effect that each modality has on the overall performance in an inverse ablation study. Our results indicate that end-to-end multimodal learning from any arbitrary knowledge graph is indeed possible, and that including multimodal information can significantly affect performance, but that much depends on the characteristics of the data.

摘要
知识 graphs 启用数据科学家学习终端到不同类型的知识。然而，大多数终端模型只学习图structure中的关系信息，Raw values 作为literal nodes被完全 omitted 或者 treated as regular nodes without consideration for their values。在这种情况下，我们可能会产生可以利用我们学习方法的潜在信息。我们提议一种多modal message passing network，不仅学习终端从图structure，还从图中可能多样化的多modal node features。我们的模型使用专门（神经网络）编码器来自然学习节点特征的嵌入，包括数字、文本、日期、图像和几何特征，这些特征被投影到共同表示空间中，与关系信息一起。我们实现并示cases on node classification和链接预测任务上，并通过反向减少研究来评估每个模式对总性能的影响。我们的结果表明，从任何arbitrary知识图中进行终端多modal学习是可能的，并且包含多modal信息可以对性能产生重要影响，但是具体取决于数据的特点。

Spatial-temporal Vehicle Re-identification

paper_url: http://arxiv.org/abs/2309.01166
repo_url: https://github.com/Zhongdao/VehicleReIDKeyPointData
paper_authors: Hye-Geun Kim, YouKyoung Na, Hae-Won Joe, Yong-Hyuk Moon, Yeong-Jun Cho
for: 解决大规模摄像头网络中的车辆重新识别问题，提高公共安全、交通管理和安全性。
methods: 基于可适应Parzen窗法估算相机网络拓扑，并将相机空间temporal相似性和外观相似性合理地融合，使用协调网络进行组合。
results: 在公共数据集（VeRi776）上实现了99.64%的排名1准确率，表明利用空间和时间信息可以提高外观基于方法的准确率，有效地处理车辆外观模糊问题。

Abstract
Vehicle re-identification (ReID) in a large-scale camera network is important in public safety, traffic control, and security. However, due to the appearance ambiguities of vehicle, the previous appearance-based ReID methods often fail to track vehicle across multiple cameras. To overcome the challenge, we propose a spatial-temporal vehicle ReID framework that estimates reliable camera network topology based on the adaptive Parzen window method and optimally combines the appearance and spatial-temporal similarities through the fusion network. Based on the proposed methods, we performed superior performance on the public dataset (VeRi776) by 99.64% of rank-1 accuracy. The experimental results support that utilizing spatial and temporal information for ReID can leverage the accuracy of appearance-based methods and effectively deal with appearance ambiguities.

摘要

Large Language Models for Generative Recommendation: A Survey and Visionary Discussions

paper_url: http://arxiv.org/abs/2309.01157
repo_url: None
paper_authors: Lei Li, Yongfeng Zhang, Dugang Liu, Li Chen
for: 这篇论文主要是为了探讨大语言模型（LLM）在推荐系统（RS）中的应用和发展。
methods: 这篇论文使用了大量的文献研究和分析，探讨了LLM在RS中的应用，包括直接从整个item池中生成推荐。
results: 论文发现，LLM可以在RS中提供更加准确和个性化的推荐，同时也可以简化推荐过程，从而提高推荐系统的效率和可靠性。

Abstract
Recent years have witnessed the wide adoption of large language models (LLM) in different fields, especially natural language processing and computer vision. Such a trend can also be observed in recommender systems (RS). However, most of related work treat LLM as a component of the conventional recommendation pipeline (e.g., as a feature extractor) which may not be able to fully leverage the generative power of LLM. Instead of separating the recommendation process into multiple stages such as score computation and re-ranking, this process can be simplified to one stage with LLM: directly generating recommendations from the complete pool of items. This survey reviews the progress, methods and future directions of LLM-based generative recommendation by examining three questions: 1) What generative recommendation is, 2) Why RS should advance to generative recommendation, and 3) How to implement LLM-based generative recommendation for various RS tasks. We hope that the survey can provide the context and guidance needed to explore this interesting and emerging topic.

摘要
近年来，大语言模型（LLM）在不同领域得到广泛应用，特别是自然语言处理和计算机视觉。这种趋势也可以在推荐系统（RS）中见到。然而，大多数相关工作都将 LLM 视为传统推荐管道的一部分（例如特征提取器），这可能无法充分利用 LLM 的生成能力。相反，可以将推荐过程简化为一个阶段， directly generating recommendations from the complete pool of items，而不是将推荐过程分解为多个阶段，如分数计算和重新排序。本文将评查 LLM 基于生成推荐的进步、方法和未来发展方向。 Specifically, we will examine three questions: 1) What is generative recommendation, 2) Why should RS advance to generative recommendation, and 3) How to implement LLM-based generative recommendation for various RS tasks. We hope that this survey can provide the necessary context and guidance to explore this interesting and emerging topic.

FedFwd: Federated Learning without Backpropagation

paper_url: http://arxiv.org/abs/2309.01150
repo_url: None
paper_authors: Seonghwan Park, Dahun Shin, Jinseok Chung, Namhoon Lee
for: 降低 federated learning（FL）中客户端的资源限制，提高训练效率。
methods: 利用 Hinton（2022）提出的 recent BP-free method，即 Forward Forward algorithm，在本地训练过程中进行 layer-wise 地本地更新参数。
results: 在标准dataset上进行了多种实验，如 MNIST 和 CIFAR-10，并显示 FedFwd 与其他 BP-dependent FL 方法相当竞争。

Abstract
In federated learning (FL), clients with limited resources can disrupt the training efficiency. A potential solution to this problem is to leverage a new learning procedure that does not rely on backpropagation (BP). We present a novel approach to FL called FedFwd that employs a recent BP-free method by Hinton (2022), namely the Forward Forward algorithm, in the local training process. FedFwd can reduce a significant amount of computations for updating parameters by performing layer-wise local updates, and therefore, there is no need to store all intermediate activation values during training. We conduct various experiments to evaluate FedFwd on standard datasets including MNIST and CIFAR-10, and show that it works competitively to other BP-dependent FL methods.

摘要
在联合学习（FL）中，客户端具有有限资源可能会干扰训练效率。我们提出一种新的学习方法，不依赖于反卷推（BP）。我们称之为FedFwd，它利用截止往返算法（Hinton，2022），在本地训练过程中进行层次分解更新参数。因此，无需在训练过程中存储所有中间活动值。我们在标准数据集上进行了多种实验，包括MNIST和CIFAR-10，并证明FedFwd与其他依赖于BP的FL方法相当竞争。

Interpretable Sequence Clustering

paper_url: http://arxiv.org/abs/2309.01140
repo_url: https://github.com/jd445/Interpretable-Sequence-Clustering-Tree
paper_authors: Junjie Dong, Xinyi Yang, Mudi Jiang, Lianyu Hu, Zengyou He
for: 解决 categorical sequence clustering 中 interpretability 问题，提供一种可解释的树结构。
methods: combinatorial patterns 和 boosting-based construction strategy，first project sequences into random subspaces, then use k-means algorithm to obtain initial cluster assignments, and construct a pattern-based decision tree.
results: 实验结果表明，提posed method 可以提供可解释的树结构，同时具有快速和准确的cluster assignments。

Abstract
Categorical sequence clustering plays a crucial role in various fields, but the lack of interpretability in cluster assignments poses significant challenges. Sequences inherently lack explicit features, and existing sequence clustering algorithms heavily rely on complex representations, making it difficult to explain their results. To address this issue, we propose a method called Interpretable Sequence Clustering Tree (ISCT), which combines sequential patterns with a concise and interpretable tree structure. ISCT leverages k-1 patterns to generate k leaf nodes, corresponding to k clusters, which provides an intuitive explanation on how each cluster is formed. More precisely, ISCT first projects sequences into random subspaces and then utilizes the k-means algorithm to obtain high-quality initial cluster assignments. Subsequently, it constructs a pattern-based decision tree using a boosting-based construction strategy in which sequences are re-projected and re-clustered at each node before mining the top-1 discriminative splitting pattern. Experimental results on 14 real-world data sets demonstrate that our proposed method provides an interpretable tree structure while delivering fast and accurate cluster assignments.

摘要
Note:* "可解释" (可解释) in Chinese means "interpretable" or "explainable".* "序列" (序列) in Chinese means "sequence".* "划分" (划分) in Chinese means "clustering" or "partitioning".* "树" (树) in Chinese means "tree".

Financial Fraud Detection using Quantum Graph Neural Networks

paper_url: http://arxiv.org/abs/2309.01127
repo_url: None
paper_authors: Nouhaila Innan, Abhishek Sawaika, Ashim Dhor, Siddhant Dutta, Sairupa Thota, Husayn Gokal, Nandan Patel, Muhammad Al-Zafar Khan, Ioannis Theodonis, Mohamed Bennai
for: 防止金融欺诈和保持金融机构的声誉
methods: 使用量子图 neural network (QGNN) 和变量量子电路 (VQC)
results: 在实际金融欺诈检测数据集上，QGNN 的 AUC 为 0.85，高于 classical GNNHere’s the full translation of the abstract in Simplified Chinese:防止金融欺诈和保持金融机构的声誉是非常重要的。然而，现有的金融欺诈检测方法有限制性，需要新的方法来提高检测率。在这篇论文中，我们提出了一种使用量子图 neural network (QGNN) 和变量量子电路 (VQC) 的新方法，用于检测金融欺诈。我们使用了一个实际的金融欺诈检测数据集，并将 QGNN 与 classical GNN 进行比较。结果显示，QGNN 的 AUC 为 0.85，高于 classical GNN。这些研究表明了 QGNN 的潜在优势，并建议 QGNN 作为改进金融欺诈检测的新方法。

Abstract
Financial fraud detection is essential for preventing significant financial losses and maintaining the reputation of financial institutions. However, conventional methods of detecting financial fraud have limited effectiveness, necessitating the need for new approaches to improve detection rates. In this paper, we propose a novel approach for detecting financial fraud using Quantum Graph Neural Networks (QGNNs). QGNNs are a type of neural network that can process graph-structured data and leverage the power of Quantum Computing (QC) to perform computations more efficiently than classical neural networks. Our approach uses Variational Quantum Circuits (VQC) to enhance the performance of the QGNN. In order to evaluate the efficiency of our proposed method, we compared the performance of QGNNs to Classical Graph Neural Networks using a real-world financial fraud detection dataset. The results of our experiments showed that QGNNs achieved an AUC of $0.85$, which outperformed classical GNNs. Our research highlights the potential of QGNNs and suggests that QGNNs are a promising new approach for improving financial fraud detection.

摘要
财务欺诈检测是预防重大财务损失和保持金融机构声誉的关键。然而，传统的金融欺诈检测方法有限制，需要新的方法来提高检测率。在这篇论文中，我们提出了一种使用量子图神经网络（QGNN）来检测金融欺诈的新方法。QGNN是一种可以处理图 струкured 数据的神经网络，并且可以利用量子计算（QC）来进行计算，比 классические神经网络更高效。我们的方法使用变量量子电路（VQC）来增强QGNN的性能。为了评估我们的提议的效果，我们比较了QGNN和经典的图神经网络（GNN）在一个真实的金融欺诈检测数据集上的性能。实验结果表明，QGNN达到了 AUC 的 $0.85$，高于经典 GNN。我们的研究表明了 QGNN 的潜在优势，并建议 QGNN 是一种有前途的新方法，可以改善金融欺诈检测。

MedChatZH: a Better Medical Adviser Learns from Better Instructions

paper_url: http://arxiv.org/abs/2309.01114
repo_url: https://github.com/tyang816/medchatzh
paper_authors: Yang Tan, Mingchen Li, Zijie Huang, Huiqun Yu, Guisheng Fan
for: 这个研究是为了提高特殊领域的中医Question Answering（QA）系统，使用Generative大型自然语言模型（LLMs）。
methods: 我们使用了特定领域的中医书籍进行预训，并与精心挑选的医疗指令集进行微调。
results: 我们的模型在一个真实的医疗对话数据集上表现出色，超过了几个固定基准的模型。我们释出了我们的模型、代码和数据集，并且鼓励更多的研究者参与这个领域的研究。

Abstract
Generative large language models (LLMs) have shown great success in various applications, including question-answering (QA) and dialogue systems. However, in specialized domains like traditional Chinese medical QA, these models may perform unsatisfactorily without fine-tuning on domain-specific datasets. To address this, we introduce MedChatZH, a dialogue model designed specifically for traditional Chinese medical QA. Our model is pre-trained on Chinese traditional medical books and fine-tuned with a carefully curated medical instruction dataset. It outperforms several solid baselines on a real-world medical dialogue dataset. We release our model, code, and dataset on https://github.com/tyang816/MedChatZH to facilitate further research in the domain of traditional Chinese medicine and LLMs.

摘要
大型生成语言模型（LLMs）在不同应用领域中表现出色，包括问答（QA）和对话系统。然而，在专门的中文传统医学问答领域中，这些模型可能无法达到预期的性能，需要进行域pecific的 fine-tuning。为解决这个问题，我们介绍了MedChatZH，一种专门为中文传统医学问答设计的对话模型。我们的模型在中文传统医学书籍上进行预训练，并与仔细编辑的医学指导数据集进行了精度调整。与几个固定基eline相比，我们的模型在真实的医学对话数据集上表现出色。我们将我们的模型、代码和数据集发布到https://github.com/tyang816/MedChatZH，以便进一步的研究在中文传统医学领域和LLMs之间的关系。

A Study on the Implementation of Generative AI Services Using an Enterprise Data-Based LLM Application Architecture

paper_url: http://arxiv.org/abs/2309.01105
repo_url: None
paper_authors: Cheonsu Jeong
for: 本研究旨在提供一种基于大语言模型（LLM）应用架构的生成AI服务实现方法。
methods: 本研究使用精度调整技术和直接文档 интеграción来缓解数据缺乏问题，并开发了一种名为Retrieval-Augmented Generation（RAG）模型，以提高信息存储和检索过程，从而改善内容生成。
results: 研究表明，RAG模型能够有效地缓解数据缺乏问题，并且可以在实际应用中提高LLM服务的可用性。

Abstract
This study presents a method for implementing generative AI services by utilizing the Large Language Models (LLM) application architecture. With recent advancements in generative AI technology, LLMs have gained prominence across various domains. In this context, the research addresses the challenge of information scarcity and proposes specific remedies by harnessing LLM capabilities. The investigation delves into strategies for mitigating the issue of inadequate data, offering tailored solutions. The study delves into the efficacy of employing fine-tuning techniques and direct document integration to alleviate data insufficiency. A significant contribution of this work is the development of a Retrieval-Augmented Generation (RAG) model, which tackles the aforementioned challenges. The RAG model is carefully designed to enhance information storage and retrieval processes, ensuring improved content generation. The research elucidates the key phases of the information storage and retrieval methodology underpinned by the RAG model. A comprehensive analysis of these steps is undertaken, emphasizing their significance in addressing the scarcity of data. The study highlights the efficacy of the proposed method, showcasing its applicability through illustrative instances. By implementing the RAG model for information storage and retrieval, the research not only contributes to a deeper comprehension of generative AI technology but also facilitates its practical usability within enterprises utilizing LLMs. This work holds substantial value in advancing the field of generative AI, offering insights into enhancing data-driven content generation and fostering active utilization of LLM-based services within corporate settings.

摘要
The study introduces a Retrieval-Augmented Generation (RAG) model, which tackles the aforementioned challenges by enhancing information storage and retrieval processes. The RAG model is carefully designed to improve content generation. The research elucidates the key phases of the information storage and retrieval methodology underpinned by the RAG model, and a comprehensive analysis of these steps is undertaken to emphasize their significance in addressing data scarcity.The study demonstrates the efficacy of the proposed method through illustrative instances, showcasing its applicability within enterprises utilizing LLMs. By implementing the RAG model for information storage and retrieval, the research contributes to a deeper understanding of generative AI technology and facilitates its practical usability within corporate settings. This work holds substantial value in advancing the field of generative AI, offering insights into enhancing data-driven content generation and fostering active utilization of LLM-based services.

M2HGCL: Multi-Scale Meta-Path Integrated Heterogeneous Graph Contrastive Learning

paper_url: http://arxiv.org/abs/2309.01101
repo_url: None
paper_authors: Yuanyuan Guo, Yu Xia, Rui Wang, Rongcheng Duan, Lu Li, Jiangmeng Li
for: This paper focuses on improving the performance of heterogeneous graph contrastive learning models by proposing a new multi-scale meta-path integrated model (M2HGCL) that captures discriminative information from various types of meta-paths.
methods: The proposed M2HGCL model discards the conventional heterogeneity-homogeneity transformation and performs graph contrastive learning in a joint manner, aggregating direct neighbor information, initial meta-path neighbor information, and expanded meta-path neighbor information to capture sufficient discriminative information.
results: The proposed M2HGCL model outperforms current state-of-the-art baseline models on three real-world datasets through extensive experiments, demonstrating its effectiveness in improving the performance of heterogeneous graph contrastive learning models.

Abstract
Inspired by the successful application of contrastive learning on graphs, researchers attempt to impose graph contrastive learning approaches on heterogeneous information networks. Orthogonal to homogeneous graphs, the types of nodes and edges in heterogeneous graphs are diverse so that specialized graph contrastive learning methods are required. Most existing methods for heterogeneous graph contrastive learning are implemented by transforming heterogeneous graphs into homogeneous graphs, which may lead to ramifications that the valuable information carried by non-target nodes is undermined thereby exacerbating the performance of contrastive learning models. Additionally, current heterogeneous graph contrastive learning methods are mainly based on initial meta-paths given by the dataset, yet according to our deep-going exploration, we derive empirical conclusions: only initial meta-paths cannot contain sufficiently discriminative information; and various types of meta-paths can effectively promote the performance of heterogeneous graph contrastive learning methods. To this end, we propose a new multi-scale meta-path integrated heterogeneous graph contrastive learning (M2HGCL) model, which discards the conventional heterogeneity-homogeneity transformation and performs the graph contrastive learning in a joint manner. Specifically, we expand the meta-paths and jointly aggregate the direct neighbor information, the initial meta-path neighbor information and the expanded meta-path neighbor information to sufficiently capture discriminative information. A specific positive sampling strategy is further imposed to remedy the intrinsic deficiency of contrastive learning, i.e., the hard negative sample sampling issue. Through extensive experiments on three real-world datasets, we demonstrate that M2HGCL outperforms the current state-of-the-art baseline models.

摘要
研究人员受到同化学习在图上的成功应用的启发，尝试将同化学习方法应用于不同类型节点和边的异质图。与同质图不同的是，异质图中节点和边的类型多样化，因此需要特化的同化学习方法。现有的异质图同化学习方法大多是通过将异质图转化为同质图来实现，这可能会导致非目标节点上的有价信息被抑制，从而降低同化学习模型的性能。另外，现有的异质图同化学习方法主要基于数据集提供的初始元PATH，但根据我们的深入探索，我们得出了实证结论：只有初始元PATH不能含有足够的分化信息；而不同类型的元PATH可以有效提高异质图同化学习模型的性能。为此，我们提出了一种新的多级元PATH集成的异质图同化学习（M2HGCL）模型，该模型不需要将异质图转化为同质图，而是直接在异质图上进行同化学习。具体来说，我们将元PATH扩展，并同时对直接邻居信息、初始元PATH邻居信息和扩展元PATH邻居信息进行联合聚合，以足够捕捉分化信息。此外，我们还采用了一种特定的正样本采样策略，以解决对异质图同化学习的内在缺陷，即困难的负样本采样问题。通过对三个实际数据集进行广泛的实验，我们证明了M2HGCL模型比现状之最先进基eline模型具有更高的性能。

Stabilize to Act: Learning to Coordinate for Bimanual Manipulation

paper_url: http://arxiv.org/abs/2309.01087
repo_url: None
paper_authors: Jennifer Grannen, Yilin Wu, Brandon Vu, Dorsa Sadigh
for: 这篇论文旨在提供一种解决高维控制问题的策略，以便在两手控制系统中实现高级别的掌控能力。
methods: 该策略基于人类的启发，提出了一种新的角色分配框架，其中一个稳定臂用于保持物品不变，而另一个执行臂用于完成任务。该框架使用一个学习的稳定 repositing 类型来 alternate между维护稳定位置和执行任务。
results: 在四种不同复杂度的双手任务上，BUDS 使用 20 个示例并达到 76.9% 的任务成功率，并能够在不同类型的对象上进行扩展。相比之下，不结构化基线方法只能达到 43.3% 的成功率。

Abstract
Key to rich, dexterous manipulation in the real world is the ability to coordinate control across two hands. However, while the promise afforded by bimanual robotic systems is immense, constructing control policies for dual arm autonomous systems brings inherent difficulties. One such difficulty is the high-dimensionality of the bimanual action space, which adds complexity to both model-based and data-driven methods. We counteract this challenge by drawing inspiration from humans to propose a novel role assignment framework: a stabilizing arm holds an object in place to simplify the environment while an acting arm executes the task. We instantiate this framework with BimanUal Dexterity from Stabilization (BUDS), which uses a learned restabilizing classifier to alternate between updating a learned stabilization position to keep the environment unchanged, and accomplishing the task with an acting policy learned from demonstrations. We evaluate BUDS on four bimanual tasks of varying complexities on real-world robots, such as zipping jackets and cutting vegetables. Given only 20 demonstrations, BUDS achieves 76.9% task success across our task suite, and generalizes to out-of-distribution objects within a class with a 52.7% success rate. BUDS is 56.0% more successful than an unstructured baseline that instead learns a BC stabilizing policy due to the precision required of these complex tasks. Supplementary material and videos can be found at https://sites.google.com/view/stabilizetoact .

摘要
针对实际世界中的灵活操作，关键是在两手之间协调控制。然而，建立双手自主系统的控制策略具有内在的挑战。一个这样的挑战是双手动作空间的高维度，这会使模型基于方法和数据驱动方法都变得复杂。我们从人类的经验中着想出一种新的角色分配框架：一个稳定化手持物体以简化环境，而另一个执行手执行任务。我们实现了这种框架，并命名为BUDS（双手稳定到行动），它使用一个学习的稳定化分类器来 alternate между更新一个学习的稳定化位置，以保持环境不变，并使用一个学习来自示例的行动策略来完成任务。我们在四个不同复杂度的双手任务上进行了实验，包括zip Jackets和切 vegetables，并只需20个示例来 achieve 76.9%的任务成功率。此外，BUDS还能够在不同类型的物体上generalize，并在不同的环境中保持52.7%的成功率。相比之下，不结构化的基准模型只能达到56.0%的成功率。详细的材料和视频可以在https://sites.google.com/view/stabilizetoact找到。

UnsMOT: Unified Framework for Unsupervised Multi-Object Tracking with Geometric Topology Guidance

paper_url: http://arxiv.org/abs/2309.01078
repo_url: None
paper_authors: Son Tran, Cong Tran, Anh Tran, Cuong Pham
for: 提高无监督多目标跟踪（MOT）方法的性能，避免高昂的数据标注成本。
methods: 提出了一种名为UnsMOT的新框架，其将视觉特征和动作特征与几何信息结合，以提供更准确的跟踪。
results: 实验结果表明，与现有方法相比，UnsMOT方法在HOTA、IDF1和MOTA指标上表现出色。

Abstract
Object detection has long been a topic of high interest in computer vision literature. Motivated by the fact that annotating data for the multi-object tracking (MOT) problem is immensely expensive, recent studies have turned their attention to the unsupervised learning setting. In this paper, we push forward the state-of-the-art performance of unsupervised MOT methods by proposing UnsMOT, a novel framework that explicitly combines the appearance and motion features of objects with geometric information to provide more accurate tracking. Specifically, we first extract the appearance and motion features using CNN and RNN models, respectively. Then, we construct a graph of objects based on their relative distances in a frame, which is fed into a GNN model together with CNN features to output geometric embedding of objects optimized using an unsupervised loss function. Finally, associations between objects are found by matching not only similar extracted features but also geometric embedding of detections and tracklets. Experimental results show remarkable performance in terms of HOTA, IDF1, and MOTA metrics in comparison with state-of-the-art methods.

摘要
We first extract object appearance features using a convolutional neural network (CNN) and motion features using a recurrent neural network (RNN). Then, we create a graph of objects based on their relative distances in a frame, which is fed into a graph neural network (GNN) together with CNN features. The GNN outputs geometric embeddings of objects that are optimized using an unsupervised loss function. Finally, we use both feature extraction and geometric embedding to associate objects, by matching not only similar extracted features but also the geometric embeddings of detections and tracklets.Our experimental results show impressive performance in terms of HOTA, IDF1, and MOTA metrics, outperforming state-of-the-art methods.

Multidomain transformer-based deep learning for early detection of network intrusion

paper_url: http://arxiv.org/abs/2309.01070
repo_url: None
paper_authors: Jinxin Liu, Murat Simsek, Michele Nogueira, Burak Kantarci
For: This paper aims to improve the timeliness of Network Intrusion Detection Systems (NIDS) by using Multivariate Time Series (MTS) early detection to identify malicious flows before they reach their target systems.* Methods: The paper proposes a novel feature extractor called Time Series Network Flow Meter (TS-NFM) to represent network flows as MTS with explainable features. It also introduces a new deep learning-based early detection model called Multi-Domain Transformer (MDT) that incorporates the frequency domain into Transformer, and a Multi-Domain Multi-Head Attention (MD-MHA) mechanism to improve feature extraction.* Results: The proposed methodology improves the earliness of conventional NIDS by 5x10^4 times and duration-based earliness by a factor of 60, resulting in a 84.1% macro F1 score (31% higher than Transformer) on the SCVIC-TS-2022 dataset. The proposed MDT also outperforms state-of-the-art early detection methods by 5% and 6% on ECG and Wafer datasets, respectively.

Abstract
Timely response of Network Intrusion Detection Systems (NIDS) is constrained by the flow generation process which requires accumulation of network packets. This paper introduces Multivariate Time Series (MTS) early detection into NIDS to identify malicious flows prior to their arrival at target systems. With this in mind, we first propose a novel feature extractor, Time Series Network Flow Meter (TS-NFM), that represents network flow as MTS with explainable features, and a new benchmark dataset is created using TS-NFM and the meta-data of CICIDS2017, called SCVIC-TS-2022. Additionally, a new deep learning-based early detection model called Multi-Domain Transformer (MDT) is proposed, which incorporates the frequency domain into Transformer. This work further proposes a Multi-Domain Multi-Head Attention (MD-MHA) mechanism to improve the ability of MDT to extract better features. Based on the experimental results, the proposed methodology improves the earliness of the conventional NIDS (i.e., percentage of packets that are used for classification) by 5x10^4 times and duration-based earliness (i.e., percentage of duration of the classified packets of a flow) by a factor of 60, resulting in a 84.1% macro F1 score (31% higher than Transformer) on SCVIC-TS-2022. Additionally, the proposed MDT outperforms the state-of-the-art early detection methods by 5% and 6% on ECG and Wafer datasets, respectively.

摘要
timely response of Network Intrusion Detection Systems (NIDS) 是受流生成过程的限制，需要accumulation of network packets。这篇论文介绍了Multivariate Time Series (MTS) early detection into NIDS，以识别恶意流之前到达目标系统。为此，我们首先提出了一种新的特征提取器，Time Series Network Flow Meter (TS-NFM)，它将网络流转换为MTS，并提取可解释的特征。此外，我们还创建了一个新的benchmark dataset，使用TS-NFM和CICIDS2017的元数据，称为SCVIC-TS-2022。此外，我们还提出了一种新的深度学习基于Transformer的早期检测模型，即Multi-Domain Transformer (MDT)，它在频率频谱中包含Transformer。此外，我们还提出了一种Multi-Domain Multi-Head Attention (MD-MHA)机制，以提高MTD的特征提取能力。根据实验结果，我们的方法提高了传统NIDS的早期响应（即流经过核心率）5x10^4倍，并提高了持续时间基于的早期响应（即分类后的流 duration）的因子60，从而达到了84.1%的macro F1分数（31%高于Transformer）。此外，我们的MTD还超过了当前早期检测方法的状态。

Separable Hamiltonian Neural Networks

paper_url: http://arxiv.org/abs/2309.01069
repo_url: https://github.com/zykhoo/separablenns
paper_authors: Zi-Yu Khoo, Jonathan Sze Choong Low, Stéphane Bressan
for: 用于透过确定数据库中的问题，推断汉密尔数据的问题。
methods: 使用汉密尔神经网络，将汉密尔系统的问题转化为确定的数据库中的问题，并将问题转化为确定的数据库中的问题。
results: 透过将问题转化为确定的数据库中的问题，使得汉密尔神经网络可以更好地预测汉密尔系统的问题。

Abstract
The modelling of dynamical systems from discrete observations is a challenge faced by modern scientific and engineering data systems. Hamiltonian systems are one such fundamental and ubiquitous class of dynamical systems. Hamiltonian neural networks are state-of-the-art models that unsupervised-ly regress the Hamiltonian of a dynamical system from discrete observations of its vector field under the learning bias of Hamilton's equations. Yet Hamiltonian dynamics are often complicated, especially in higher dimensions where the state space of the Hamiltonian system is large relative to the number of samples. A recently discovered remedy to alleviate the complexity between state variables in the state space is to leverage the additive separability of the Hamiltonian system and embed that additive separability into the Hamiltonian neural network. Following the nomenclature of physics-informed machine learning, we propose three separable Hamiltonian neural networks. These models embed additive separability within Hamiltonian neural networks. The first model uses additive separability to quadratically scale the amount of data for training Hamiltonian neural networks. The second model embeds additive separability within the loss function of the Hamiltonian neural network. The third model embeds additive separability through the architecture of the Hamiltonian neural network using conjoined multilayer perceptions. We empirically compare the three models against state-of-the-art Hamiltonian neural networks, and demonstrate that the separable Hamiltonian neural networks, which alleviate complexity between the state variables, are more effective at regressing the Hamiltonian and its vector field.

摘要
现代科学和工程数据系统中模拟动力系统从离散观察数据是一个挑战。哈密顿系统是这种基本和普遍的动力系统之一。哈密顿神经网络是目前的状态艺术模型，可以无监督地将哈密顿系统的劳动量从离散观察数据的向量场中预测。然而，哈密顿动力学在更高维度时可能会变得复杂，特别是当状态空间的维度远大于样本数时。为了缓解状态变量之间的复杂性，我们提出了利用哈密顿系统的添加性分解性来附加到哈密顿神经网络中。根据物理学教育机器学习的命名，我们提出了三种分解哈密顿神经网络。这些模型在哈密顿神经网络中嵌入添加性分解性。第一个模型通过添加性来幂等增加训练哈密顿神经网络的数据量。第二个模型在哈密顿神经网络的损失函数中嵌入添加性。第三个模型通过哈密顿神经网络的建筑嵌入添加性，使用共同多层感知。我们对现有的哈密顿神经网络进行了比较，并证明了分解哈密顿神经网络在预测哈密顿和其向量场方面更有效。

AB2CD: AI for Building Climate Damage Classification and Detection

paper_url: http://arxiv.org/abs/2309.01066
repo_url: None
paper_authors: Maximilian Nitsche, S. Karthik Mukkavilli, Niklas Kühl, Thomas Brunschwiler
for: 本研究旨在应用深度学习技术精确评估自然灾害中的建筑物损坏，使用遥测数据。
methods: 我们使用了不同的深度学习模型，包括差分径等径网络、内部对称网络和双路网络，以及ensemble技术，并评估了不同对应的测试集。
results: 我们的研究结果显示，使用3米以下的卫星影像分辨率可以实现高精度的建筑物损坏推断，并且使用不同的深度学习模型可以实现不同程度的准确性。

Abstract
We explore the implementation of deep learning techniques for precise building damage assessment in the context of natural hazards, utilizing remote sensing data. The xBD dataset, comprising diverse disaster events from across the globe, serves as the primary focus, facilitating the evaluation of deep learning models. We tackle the challenges of generalization to novel disasters and regions while accounting for the influence of low-quality and noisy labels inherent in natural hazard data. Furthermore, our investigation quantitatively establishes that the minimum satellite imagery resolution essential for effective building damage detection is 3 meters and below 1 meter for classification using symmetric and asymmetric resolution perturbation analyses. To achieve robust and accurate evaluations of building damage detection and classification, we evaluated different deep learning models with residual, squeeze and excitation, and dual path network backbones, as well as ensemble techniques. Overall, the U-Net Siamese network ensemble with F-1 score of 0.812 performed the best against the xView2 challenge benchmark. Additionally, we evaluate a Universal model trained on all hazards against a flood expert model and investigate generalization gaps across events, and out of distribution from field data in the Ahr Valley. Our research findings showcase the potential and limitations of advanced AI solutions in enhancing the impact assessment of climate change-induced extreme weather events, such as floods and hurricanes. These insights have implications for disaster impact assessment in the face of escalating climate challenges.

摘要
我们探讨了深度学习技术的应用于精准建筑损害评估中，利用遥感数据，在自然灾害背景下。xBD数据集，包括全球各地不同类型灾害事件，作为主要关注对象，以评估深度学习模型。我们解决了对新灾害和地区总结的挑战，同时考虑了自然灾害数据中的低质量和噪音标签的影响。进一步，我们发现了卫星遥感分辨率最低为3米以下，以下1米为类型分类使用对称和非对称分辨率扰动分析。为实现Robust和准确的建筑损害检测和分类，我们评估了不同的深度学习模型，包括剩余、挤压和激活、双路网络框架。综合来说，U-Net Siamese网络集成 Ensemble架构，F-1分数0.812，在xView2挑战benchmark中表现最佳。此外，我们还评估了对所有灾害的通用模型，并 investigate了不同事件之间的总结差和场景数据外部的差异。我们的研究发现，高级AI解决方案在气候变化引起的极端天气事件的影响评估中具有潜力，但同时也存在局限性。这些发现对气候变化的挑战下的灾害影响评估产生了重要的意义。

Generative Data Augmentation using LLMs improves Distributional Robustness in Question Answering

paper_url: http://arxiv.org/abs/2309.06358
repo_url: None
paper_authors: Arijit Ghosh Chowdhury, Aman Chadha
for: investigate the influence of generated datasets on the performance of QA models under natural distribution shifts
methods: two-step generation approach, generating both contexts and QA pairs to augment existing datasets
results: augmenting reading comprehension datasets with generated data leads to better robustness towards natural distribution shifts

Abstract
Robustness in Natural Language Processing continues to be a pertinent issue, where state of the art models under-perform under naturally shifted distributions. In the context of Question Answering, work on domain adaptation methods continues to be a growing body of research. However, very little attention has been given to the notion of domain generalization under natural distribution shifts, where the target domain is unknown. With drastic improvements in the quality and access to generative models, we answer the question: How do generated datasets influence the performance of QA models under natural distribution shifts? We perform experiments on 4 different datasets under varying amounts of distribution shift, and analyze how "in-the-wild" generation can help achieve domain generalization. We take a two-step generation approach, generating both contexts and QA pairs to augment existing datasets. Through our experiments, we demonstrate how augmenting reading comprehension datasets with generated data leads to better robustness towards natural distribution shifts.

摘要
natural language processing 的 robustness 仍然是一个有问题的issue, 现代模型在自然地shifted distributions下表现不佳。在问答领域中, 关于领域适应方法的研究继续增长。然而, 很少注意到target domain是未知的情况下的领域普遍化。随着生成模型的提高和生成数据的可用性的提高, 我们回答了 Question Answering 模型在自然分布shift下的性能如何受到生成数据的影响。我们在4个不同的 dataset上进行了不同量的分布shift的实验，并分析了如何在�nit-in-the-wild�生成数据的帮助下实现领域普遍化。我们采用了two-step generation方法，首先生成了上下文，然后生成了问题和答案的对。通过我们的实验，我们证明了增强阅读理解dataset的可Generated data可以提高模型对自然分布shift的 Robustness。

Integration of Vision-based Object Detection and Grasping for Articulated Manipulator in Lunar Conditions

paper_url: http://arxiv.org/abs/2309.01055
repo_url: None
paper_authors: Camille Boucher, Gustavo H. Diaz, Shreya Santra, Kentaro Uno, Kazuya Yoshida
for: 这篇论文是为了开发 lunar robot 应用程序而写的。
methods: 这篇论文使用了视觉基础框架，包括物体检测、实例分割和抓取检测，以实现不同应用程序的 integrate。
results: 在具有不平面表面和困难照明条件的情况下，这篇论文达到了92%的成功率，并实现了使用视觉系统结果进行不同应用程序的 assemble 任务。

Abstract
The integration of vision-based frameworks to achieve lunar robot applications faces numerous challenges such as terrain configuration or extreme lighting conditions. This paper presents a generic task pipeline using object detection, instance segmentation and grasp detection, that can be used for various applications by using the results of these vision-based systems in a different way. We achieve a rock stacking task on a non-flat surface in difficult lighting conditions with a very good success rate of 92%. Eventually, we present an experiment to assemble 3D printed robot components to initiate more complex tasks in the future.

摘要
具有视觉基础框架的月球机器人应用面临许多挑战，如地形配置和极端照明条件。本文提出了一个通用任务管道，使用物体检测、实例分割和抓取检测来实现多种应用。我们在非平面表面下实现了一个石堆任务，并在困难的照明条件下达到了92%的成功率。最后，我们展示了将3D打印机器人组件 assembling 以实现更复杂的任务。Note: Please keep in mind that the translation is Simplified Chinese, and some words or phrases may have different translations in Traditional Chinese.

2023-09-03

cs.CL

cs.CL - 2023-09-03

BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning

paper_url: http://arxiv.org/abs/2309.01256
repo_url: https://github.com/rambo-coder/BDC-Adapter
paper_authors: Yi Zhang, Ce Zhang, Zihan Liao, Yushun Tang, Zhihai He
for: 本研究旨在开发轻量级 fine-tuning 技术，以适应下游视觉任务。
methods: 本研究提出了 Brownian Distance Covariance (BDC) метри克，用于度量视觉语言理解中的特征相依关系。基于该 metric，我们提出了一种名为 BDC-Adapter 的新方法，该方法通过组合 BDC прототипы相似预测和多模态预测网络来实现分类任务。
results: 我们的实验结果表明，BDC-Adapter 可以自由处理非线性关系，全面捕捉独立性，与当前状态的方法相比，具有大幅度的提升。

Abstract
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ALIGN, have introduced a new paradigm for learning transferable visual representations. Recently, there has been a surge of interest among researchers in developing lightweight fine-tuning techniques to adapt these models to downstream visual tasks. We recognize that current state-of-the-art fine-tuning methods, such as Tip-Adapter, simply consider the covariance between the query image feature and features of support few-shot training samples, which only captures linear relations and potentially instigates a deceptive perception of independence. To address this issue, in this work, we innovatively introduce Brownian Distance Covariance (BDC) to the field of vision-language reasoning. The BDC metric can model all possible relations, providing a robust metric for measuring feature dependence. Based on this, we present a novel method called BDC-Adapter, which integrates BDC prototype similarity reasoning and multi-modal reasoning network prediction to perform classification tasks. Our extensive experimental results show that the proposed BDC-Adapter can freely handle non-linear relations and fully characterize independence, outperforming the current state-of-the-art methods by large margins.

摘要
大规模预训练视语模型（VLM），如CLIP和ALIGN，已经引入了学习可转移的视觉表示的新 paradigm。最近，研究人员对下游视觉任务适应这些模型的轻量级练习技术表示了很大的兴趣。我们认为现今最佳练习方法，如Tip-Adapter，只考虑了查询图像特征和支持几个少量训练样本的特征之间的covariance，这只captures linear relations,可能导致误导性的独立性概念。为解决这个问题，在这项工作中，我们创新地引入了浮动距离covariance（BDC）到视觉语言理解领域。BDC指标可以模型所有可能的关系，提供一种可靠的指标来衡量特征相互关系。基于这，我们提出了一种新方法called BDC-Adapter，它通过组合BDC原型相似性逻辑和多模态逻辑网络预测来实现分类任务。我们的广泛实验结果表明，我们的提议的BDC-Adapter可以自由地处理非线性关系，具有完全characterize独立性的优势，在与当前状态艺术方法比较大的差异。

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

paper_url: http://arxiv.org/abs/2309.01131
repo_url: None
paper_authors: Haoyu Cao, Changcun Bao, Chaohu Liu, Huang Chen, Kun Yin, Hao Liu, Yinsong Liu, Deqiang Jiang, Xing Sun
for: 提高文档理解效率和精度
methods: 使用选择性区域理解模型（SeRum），该模型将文档图像理解和识别任务转化为地图上的本地解码过程，使模型更加注意力集中于Query解码器生成的区域关键。
results: 实验结果表明，SeRum在文档理解任务中达到了国际级性能，并在文本检索任务中获得了竞争力。

Abstract
We propose a novel end-to-end document understanding model called SeRum (SElective Region Understanding Model) for extracting meaningful information from document images, including document analysis, retrieval, and office automation. Unlike state-of-the-art approaches that rely on multi-stage technical schemes and are computationally expensive, SeRum converts document image understanding and recognition tasks into a local decoding process of the visual tokens of interest, using a content-aware token merge module. This mechanism enables the model to pay more attention to regions of interest generated by the query decoder, improving the model's effectiveness and speeding up the decoding speed of the generative scheme. We also designed several pre-training tasks to enhance the understanding and local awareness of the model. Experimental results demonstrate that SeRum achieves state-of-the-art performance on document understanding tasks and competitive results on text spotting tasks. SeRum represents a substantial advancement towards enabling efficient and effective end-to-end document understanding.

摘要
我们提出了一种新的综合型文档理解模型，叫做SeRum（选择区域理解模型），用于从文档图像中提取有意义信息，包括文档分析、检索和办公自动化。 unlike现有的方法，SeRum不仅通过多个阶段技术实现，而且计算成本较高。 SeRum将文档图像理解和识别任务转化为当地解码过程，使用内容相关的字符串融合模块。这种机制使得模型更加注重查询解码器生成的区域兴趣，从而提高模型的效果和加速生成方案的解码速度。我们还设计了一些预训练任务，以增强模型的理解和地方意识。实验结果表明，SeRum在文档理解任务上达到了现有最佳性能，并在文本检索任务上获得了竞争性的成绩。SeRum代表了综合型文档理解的重要进步，它可以帮助实现高效、高效的文档理解。

Business Process Text Sketch Automation Generation Using Large Language Model

paper_url: http://arxiv.org/abs/2309.01071
repo_url: None
paper_authors: Rui Zhu, Quanzhou Hu, Wenxin Li, Honghao Xiao, Chaogang Wang, Zixin Zhou
for: This paper aims to address the challenge of business process document generation in the absence of datasets, and to provide a solution for improving the correctness of data-driven deep learning techniques in this domain.methods: The authors propose an approach that transforms Conditional Process Trees (CPTs) into Business Process Text Sketches (BPTSs) using Large Language Models (LLMs). They also introduce a divide-and-conquer strategy to break down difficult CPTs into smaller, more manageable parts.results: The authors report a correct rate of 93.42% using their proposed method, which is 45.17% better than traditional prompting methods. Their approach has the potential to provide a large number of datasets for the process model extraction (PME) domain.

Abstract
Business Process Management (BPM) is gaining increasing attention as it has the potential to cut costs while boosting output and quality. Business process document generation is a crucial stage in BPM. However, due to a shortage of datasets, data-driven deep learning techniques struggle to deliver the expected results. We propose an approach to transform Conditional Process Trees (CPTs) into Business Process Text Sketches (BPTSs) using Large Language Models (LLMs). The traditional prompting approach (Few-shot In-Context Learning) tries to get the correct answer in one go, and it can find the pattern of transforming simple CPTs into BPTSs, but for close-domain and CPTs with complex hierarchy, the traditional prompts perform weakly and with low correctness. We suggest using this technique to break down a difficult CPT into a number of basic CPTs and then solve each one in turn, drawing inspiration from the divide-and-conquer strategy. We chose 100 process trees with depths ranging from 2 to 5 at random, as well as CPTs with many nodes, many degrees of selection, and cyclic nesting. Experiments show that our method can achieve a correct rate of 93.42%, which is 45.17% better than traditional prompting methods. Our proposed method provides a solution for business process document generation in the absence of datasets, and secondly, it becomes potentially possible to provide a large number of datasets for the process model extraction (PME) domain.

摘要

2023-09-04

NLLB-CLIP – train performant multilingual image retrieval model on a budget

Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge for Generic Image Representations

SMPLitex: A Generative Model and Dataset for 3D Human Texture Estimation from Single Image

Uncertainty in AI: Evaluating Deep Neural Networks on Out-of-Distribution Images

StereoFlowGAN: Co-training for Stereo and Flow with Unsupervised Domain Adaptation

On the fly Deep Neural Network Optimization Control for Low-Power Computer Vision

Multi-dimension unified Swin Transformer for 3D Lesion Segmentation in Multiple Anatomical Locations

Instant Continual Learning of Neural Radiance Fields

Accuracy and Consistency of Space-based Vegetation Height Maps for Forest Dynamics in Alpine Terrain

Safe and Robust Watermark Injection with a Single OoD Image

StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation

BLiSS: Bootstrapped Linear Shape Space

Multispectral Indices for Wildfire Management

Generative-based Fusion Mechanism for Multi-Modal Tracking

SAF-IS: a Spatial Annotation Free Framework for Instance Segmentation of Surgical Tools

ControlMat: A Controlled Generative Approach to Material Capture

Mask-Attention-Free Transformer for 3D Instance Segmentation

Prior Knowledge Guided Network for Video Anomaly Detection

Building Footprint Extraction in Dense Areas using Super Resolution and Frame Field Learning

Relay Diffusion: Unifying diffusion process across resolutions for image synthesis

ReLoc-PDR: Visual Relocalization Enhanced Pedestrian Dead Reckoning via Graph Optimization

Cross-Consistent Deep Unfolding Network for Adaptive All-In-One Video Restoration

AGG-Net: Attention Guided Gated-convolutional Network for Depth Image Completion

Hindering Adversarial Attacks with Multiple Encrypted Patch Embeddings

On the Query Strategies for Efficient Online Active Distillation

Segmentation of 3D pore space from CT images using curvilinear skeleton: application to numerical simulation of microbial decomposition

Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models

SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on FPGA Devices

Improving Visual Quality and Transferability of Adversarial Attacks on Face Recognition Simultaneously with Adversarial Restoration

DiffHPE: Robust, Coherent 3D Human Pose Lifting with Diffusion

Raw Data Is All You Need: Virtual Axle Detector with Enhanced Receptive Field

Locality-Aware Hyperspectral Classification

TSTTC: A Large-Scale Dataset for Time-to-Contact Estimation in Driving Scenarios

On the use of Mahalanobis distance for out-of-distribution detection with neural networks for medical imaging

GenSelfDiff-HIS: Generative Self-Supervision Using Diffusion for Histopathological Image Segmentation

CA2: Class-Agnostic Adaptive Feature Adaptation for One-class Classification

Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models

Defect Detection in Synthetic Fibre Ropes using Detectron2 Framework

Toward Defensive Letter Design

Large Separable Kernel Attention: Rethinking the Large Kernel Attention Design in CNN

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Adapting Segment Anything Model for Change Detection in HR Remote Sensing Images

Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification

Implicit Neural Image Stitching With Enhanced and Blended Feature Reconstruction

Leveraging Self-Supervised Vision Transformers for Neural Transfer Function Design

Learning Residual Elastic Warps for Image Stitching under Dirichlet Boundary Condition

SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations

Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

ImmersiveNeRF: Hybrid Radiance Fields for Unbounded Immersive Light Field Reconstruction

DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion

Attention as Annotation: Generating Images and Pseudo-masks for Weakly Supervised Semantic Segmentation with Diffusion

High Frequency, High Accuracy Pointing onboard Nanosats using Neuromorphic Event Sensing and Piezoelectric Actuation

Adapting Classifiers To Changing Class Priors During Deployment

Real-time pedestrian recognition on low computational resources

Adv3D: Generating 3D Adversarial Examples in Driving Scenarios with NeRF

Adaptive Parametric Prototype Learning for Cross-Domain Few-Shot Classification

MDSC: Towards Evaluating the Style Consistency Between Music and

Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization

SKoPe3D: A Synthetic Dataset for Vehicle Keypoint Perception in 3D from Traffic Monitoring Cameras

FAU-Net: An Attention U-Net Extension with Feature Pyramid Attention for Prostate Cancer Segmentation

An FPGA smart camera implementation of segmentation models for drone wildfire imagery

Enhancing Automated and Early Detection of Alzheimer’s Disease Using Out-Of-Distribution Detection

EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity

2023-09-04

Efficient Defense Against Model Stealing Attacks on Convolutional Neural Networks

Smoothing ADMM for Sparse-Penalized Quantile Regression with Non-Convex Penalties

One Wide Feedforward is All You Need

Towards Foundational AI Models for Additive Manufacturing: Language Models for G-Code Debugging, Manipulation, and Comprehension

DiscoverPath: A Knowledge Refinement and Retrieval System for Interdisciplinarity on Biomedical Research

Marginalized Importance Sampling for Off-Environment Policy Evaluation

Neural-Singular-Hessian: Implicit Neural Representation of Unoriented Point Clouds by Enforcing Singular Hessian

3D View Prediction Models of the Dorsal Visual Stream

On the size of irredundant propagation complete CNF formulas

Hybrid data driven/thermal simulation model for comfort assessment

Softmax Bias Correction for Quantized Generative Models

Interdisciplinary Fairness in Imbalanced Research Proposal Topic Inference: A Hierarchical Transformer-based Method with Selective Interpolation

On the Robustness of Post-hoc GNN Explainers to Label Noise

No Data Augmentation? Alternative Regularizations for Effective Training on Small Datasets

Prompt me a Dataset: An investigation of text-image prompting for historical image dataset creation using foundation models