cs.AI - 2023-09-14

Retrieval-Augmented Text-to-Audio Generation

  • paper_url: http://arxiv.org/abs/2309.08051
  • repo_url: None
  • paper_authors: Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
  • for: 提高文本到audio生成中的质量和准确性,特别是处理 dataset 中罕见的类别问题。
  • methods: 提出了一种简单的检索增强方法,使用 Contrastive Language Audio Pretraining (CLAP) 模型 retrieve 相关的文本-audio对,然后使用这些数据的特征作为 TTA 模型的学习指导。
  • results: 在 AudioCaps dataset 上,提出的 Re-AudioLDM 系统实现了 state-of-the-art Frechet Audio Distance (FAD) 1.37,较 existing 方法大幅提高。Re-AudioLDM 还能够生成高质量的 audio для 复杂的场景、罕见的音频类别和even 未看过的音频类型,表明其在 TTA 任务中的潜在能力。
    Abstract Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting augmented system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the existing approaches by a large margin. Furthermore, we show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types, indicating its potential in TTA tasks.
    摘要 尽管最近的文本到音频(TTA)生成技术已经取得了一些进步,但我们发现,使用 AudioCaps 等数据集,state-of-the-art 模型,如 AudioLDM,在生成不匹配的类别时会受到偏见。具体来说,它们在常见的音频类别上表现出色,而在罕见的类别上表现不佳,从而降低整体生成性能。我们称这个问题为“长尾文本到音频生成”。为解决这个问题,我们提议一种简单的检索增强方法,即给输入文本提示时,首先利用 Contrastive Language Audio Pretraining(CLAP)模型来检索相关的文本-音频对。然后, retrieved audio-text 数据的特征被用作 TTA 模型的学习指导。我们增强 AudioLDM 后的系统被称为 Re-AudioLDM,在 AudioCaps 数据集上实现了最新的 Frechet Audio Distance(FAD)记录,为 1.37,超过了现有的方法。此外,我们还证明了 Re-AudioLDM 可以生成复杂场景、罕见音频类别以及未经见过的音频类别,表明它在 TTA 任务中的潜力。

Padding Aware Neurons

  • paper_url: http://arxiv.org/abs/2309.08048
  • repo_url: https://gitlab.com/paper14/padding-aware-neurons
  • paper_authors: Dario Garcia-Gasulla, Victor Gimenez-Abalos, Pablo Martin-Torres
  • for: 这篇论文主要研究了卷积层中的填充感知机制,以及这种机制对模型性能的影响。
  • methods: 研究者通过分析卷积层中的缺 padding 策略和激活函数来确定卷积层中的感知机制。他们还通过对多种预训练模型进行分析来explore PANs在不同模型中的存在。
  • results: 研究者发现了大多数卷积模型中的PANs,其数量从多达百个不等。他们还发现了不同类型的PANs,以及它们在数据上的偏好和影响。最后,研究者讨论了PANs的可 desirability 和其在模型性能、通用性、效率和安全性方面的可能的副作用。
    Abstract Convolutional layers are a fundamental component of most image-related models. These layers often implement by default a static padding policy (\eg zero padding), to control the scale of the internal representations, and to allow kernel activations centered on the border regions. In this work we identify Padding Aware Neurons (PANs), a type of filter that is found in most (if not all) convolutional models trained with static padding. PANs focus on the characterization and recognition of input border location, introducing a spatial inductive bias into the model (e.g., how close to the input's border a pattern typically is). We propose a method to identify PANs through their activations, and explore their presence in several popular pre-trained models, finding PANs on all models explored, from dozens to hundreds. We discuss and illustrate different types of PANs, their kernels and behaviour. To understand their relevance, we test their impact on model performance, and find padding and PANs to induce strong and characteristic biases in the data. Finally, we discuss whether or not PANs are desirable, as well as the potential side effects of their presence in the context of model performance, generalisation, efficiency and safety.
    摘要 卷积层是图像相关模型中的基本组件。这些层通常采用静态预留策略(例如零预留),以控制内部表示的大小和kernel活动的中心位置。在这种工作中,我们认定了边缘位置感知neurons(PANs),这种filter在大多数(如果不是所有)卷积模型中被发现。PANs通过识别和识别输入边缘的位置来引入空间卷积偏好(例如,输入边缘上的模式是如何靠近)。我们提出了通过活动来识别PANs的方法,并在多个流行预训练模型中找到了PANs,从数十到百个。我们讨论和描述了不同类型的PANs,其核心和行为。为了了解其 relevance,我们测试了它们对模型性能的影响,并发现预留和PANs都会对数据产生强大和特征的偏好。最后,我们讨论了PANs是否有利,以及它们在模型性能、泛化、效率和安全性上的可能的侧effect。

Towards Large-scale Building Attribute Mapping using Crowdsourced Images: Scene Text Recognition on Flickr and Problems to be Solved

  • paper_url: http://arxiv.org/abs/2309.08042
  • repo_url: https://github.com/ya0-sun/str-berlin
  • paper_authors: Yao Sun, Anna Kruspe, Liqiu Meng, Yifan Tian, Eike J Hoffmann, Stefan Auer, Xiao Xiang Zhu
  • for: 这篇论文旨在应用Scene Text Recognition(STR)技术在拼团网络街景图像中映射建筑特征。
  • methods: 这篇论文使用了Flickr图像,尤其是建筑外墙上的文本。创建了一个Berlin Flickr数据集,并使用预训练的STR模型进行文本检测和识别。
  • results: 手动检查一 subset 的 STR-识别图像表示高准确性。研究表明,STR 结果和建筑功能之间存在相关性,并分析了在 residential 建筑上文本出现的情况。但是,这种任务还存在许多挑战,包括街景图像中文本区域较小、缺乏ground truth标签和 Flickr 图像中建筑与OpenStreetMap 中建筑的不匹配。为了开发城市范围内的映射,建议在不同的场景下开发适当的算法或者引入其他数据来处理其他情况。此外,应进行多学科合作,以理解摄影和标注建筑的动机。 STR-on-Flickr 结果可在 https://github.com/ya0-sun/STR-Berlin 中获得。
    Abstract Crowdsourced platforms provide huge amounts of street-view images that contain valuable building information. This work addresses the challenges in applying Scene Text Recognition (STR) in crowdsourced street-view images for building attribute mapping. We use Flickr images, particularly examining texts on building facades. A Berlin Flickr dataset is created, and pre-trained STR models are used for text detection and recognition. Manual checking on a subset of STR-recognized images demonstrates high accuracy. We examined the correlation between STR results and building functions, and analysed instances where texts were recognized on residential buildings but not on commercial ones. Further investigation revealed significant challenges associated with this task, including small text regions in street-view images, the absence of ground truth labels, and mismatches in buildings in Flickr images and building footprints in OpenStreetMap (OSM). To develop city-wide mapping beyond urban hotspot locations, we suggest differentiating the scenarios where STR proves effective while developing appropriate algorithms or bringing in additional data for handling other cases. Furthermore, interdisciplinary collaboration should be undertaken to understand the motivation behind building photography and labeling. The STR-on-Flickr results are publicly available at https://github.com/ya0-sun/STR-Berlin.
    摘要 互助平台提供大量的街景图像,这些图像包含了重要的建筑信息。这项工作面临着在互助平台街景图像中应用场景文本识别(STR)的挑战。我们使用Flickr图像,特别是研究建筑facades上的文本。我们创建了一个Berlin Flickr数据集,并使用预训练的STR模型进行文本检测和识别。对一 subset of STR-识别的图像进行手动检查表明了高准确性。我们研究了STR结果和建筑功能之间的相关性,并分析了在住宅和商业建筑之间文本被识别的情况。进一步的调查发现了这个任务中的主要挑战,包括街景图像中文本区域的小さigkeit,缺乏真实标签数据,以及Flickr图像中的建筑和OpenStreetMap(OSM)中的建筑脚本之间的匹配问题。为了开发城市范围内的映射,我们建议在STR效果明显的场景下使用不同的算法或引入其他数据来处理其他情况。此外,我们建议进行多学科协作,以理解摄影和标注建筑的动机。STR-on-Flickr结果公共可用于https://github.com/ya0-sun/STR-Berlin。

BEA: Revisiting anchor-based object detection DNN using Budding Ensemble Architecture

  • paper_url: http://arxiv.org/abs/2309.08036
  • repo_url: None
  • paper_authors: Syed Sha Qutub, Neslihan Kose, Rafael Rosales, Michael Paulitsch, Korbinian Hagn, Florian Geissler, Yang Peng, Gereon Hinz, Alois Knoll
  • for: 提高 anchor-based 物体检测模型的准确率和 uncertainty estimation 质量。
  • methods: 使用 Budding Ensemble Architecture (BEA) 和提posed 损失函数来改进 confidence score 的准确性和 uncertainty estimation。
  • results: BEA 可以提高 Base-YOLOv3 和 SSD 模型的 object detection 精度和 uncertainty estimation 质量,并在不同 datasets 上实现更高的 out-of-distribution detection 性能。
    Abstract This paper introduces the Budding Ensemble Architecture (BEA), a novel reduced ensemble architecture for anchor-based object detection models. Object detection models are crucial in vision-based tasks, particularly in autonomous systems. They should provide precise bounding box detections while also calibrating their predicted confidence scores, leading to higher-quality uncertainty estimates. However, current models may make erroneous decisions due to false positives receiving high scores or true positives being discarded due to low scores. BEA aims to address these issues. The proposed loss functions in BEA improve the confidence score calibration and lower the uncertainty error, which results in a better distinction of true and false positives and, eventually, higher accuracy of the object detection models. Both Base-YOLOv3 and SSD models were enhanced using the BEA method and its proposed loss functions. The BEA on Base-YOLOv3 trained on the KITTI dataset results in a 6% and 3.7% increase in mAP and AP50, respectively. Utilizing a well-balanced uncertainty estimation threshold to discard samples in real-time even leads to a 9.6% higher AP50 than its base model. This is attributed to a 40% increase in the area under the AP50-based retention curve used to measure the quality of calibration of confidence scores. Furthermore, BEA-YOLOV3 trained on KITTI provides superior out-of-distribution detection on Citypersons, BDD100K, and COCO datasets compared to the ensembles and vanilla models of YOLOv3 and Gaussian-YOLOv3.
    摘要 The proposed loss functions in BEA improve the accuracy of object detection models. The BEA method was applied to both Base-YOLOv3 and SSD models, and the results show a significant improvement in accuracy. The BEA on Base-YOLOv3 trained on the KITTI dataset resulted in a 6% and 3.7% increase in mean Average Precision (mAP) and AP50, respectively. Additionally, using a well-balanced uncertainty estimation threshold in real-time led to a 9.6% higher AP50 than the base model. This is attributed to a 40% increase in the area under the AP50-based retention curve, which measures the quality of calibration of confidence scores.Moreover, the BEA-YOLOV3 trained on KITTI provided superior out-of-distribution detection on Citypersons, BDD100K, and COCO datasets compared to the ensembles and vanilla models of YOLOv3 and Gaussian-YOLOv3. This demonstrates the effectiveness of the BEA method in improving the accuracy of object detection models.

Vision-based Analysis of Driver Activity and Driving Performance Under the Influence of Alcohol

  • paper_url: http://arxiv.org/abs/2309.08021
  • repo_url: None
  • paper_authors: Ross Greer, Akshay Gopalkrishnan, Sumega Mandadi, Pujitha Gunaratne, Mohan M. Trivedi, Thomas D. Marcotte
  • for: 防止酒后驾车的研究,以实现车辆安全。
  • methods: 使用多modal的ensemble,包括视觉、热成像、音频和化学感知器,测试酒后驾车的影响和探索测定酒后驾车的方法。
  • results: 透过computer vision和机器学习模型分析车长的面部热成像,并引入训练模型的管道,以测定车长的呼吸气体含酒量。
    Abstract About 30% of all traffic crash fatalities in the United States involve drunk drivers, making the prevention of drunk driving paramount to vehicle safety in the US and other locations which have a high prevalence of driving while under the influence of alcohol. Driving impairment can be monitored through active use of sensors (when drivers are asked to engage in providing breath samples to a vehicle instrument or when pulled over by a police officer), but a more passive and robust mechanism of sensing may allow for wider adoption and benefit of intelligent systems that reduce drunk driving accidents. This could assist in identifying impaired drivers before they drive, or early in the driving process (before a crash or detection by law enforcement). In this research, we introduce a study which adopts a multi-modal ensemble of visual, thermal, audio, and chemical sensors to (1) examine the impact of acute alcohol administration on driving performance in a driving simulator, and (2) identify data-driven methods for detecting driving under the influence of alcohol. We describe computer vision and machine learning models for analyzing the driver's face in thermal imagery, and introduce a pipeline for training models on data collected from drivers with a range of breath-alcohol content levels, including discussion of relevant machine learning phenomena which can help in future experiment design for related studies.
    摘要 关于30%的美国交通事故死亡事故中有报告酒后驾驶,因此防止酒后驾驶是美国和其他有高酒精驾驶习惯的地区的车辆安全问题。驾驶异常可以通过活动使用感测器(当驾驶员被要求提供呼吸样本给车辆工具或被警察停车)进行监测,但更加不间断和可靠的感测方式可能会促使更广泛的应用和智能系统的发展,以降低酒后驾驶事故。这种研究可以帮助检测酒后驾驶之前或在驾驶过程中(前于事故或警察检测)。在这项研究中,我们介绍了一项多模态ensemble的视觉、热成像、声音和化学感测器来(1)分析酒后驾驶性能的影响,以及(2)检测酒后驾驶。我们描述了用于分析驾驶员面部的计算机视觉和机器学习模型,并介绍了一个管道用于在不同呼吸气体含量水平下收集数据,包括讨论有关相关机器学习现象的概念,可以帮助未来相关研究的实验设计。

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing

  • paper_url: http://arxiv.org/abs/2309.08008
  • repo_url: None
  • paper_authors: Sonish Sivarajkumar, Mark Kelley, Alyssa Samolyk-Mazzanti, Shyam Visweswaran, Yanshan Wang
  • for: 本研究旨在设计有效的提示方法,以帮助大语言模型(LLMs)在医疗领域进行特定的自然语言处理(NLP)任务,无需任务特定的训练数据。
  • methods: 本研究使用了现有的提示方法,包括简单前缀、简单补充、链式思维和预测提示,以及两种新的提示方法:准则提示和 ensemble 提示。
  • results: 研究表明,不同的提示方法在不同的语言模型(GPT-3.5、BARD 和 LLAMA2)上的表现有很大差异。 Ensemble 提示方法在三个语言模型上都表现最佳,而预测提示方法在 GPT-3.5 上表现最好。 Zero-shot 提示与 few-shot 提示的比较也提供了新的视角和指导意见 для LLMs 在医疗 NLP 领域的提示工程。
    Abstract Large language models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), especially in domains where labeled data is scarce or expensive, such as clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches. In this paper, we present a comprehensive and systematic experimental study on prompt engineering for five clinical NLP tasks: Clinical Sense Disambiguation, Biomedical Evidence Extraction, Coreference Resolution, Medication Status Extraction, and Medication Attribute Extraction. We assessed the prompts proposed in recent literature, including simple prefix, simple cloze, chain of thought, and anticipatory prompts, and introduced two new types of prompts, namely heuristic prompting and ensemble prompting. We evaluated the performance of these prompts on three state-of-the-art LLMs: GPT-3.5, BARD, and LLAMA2. We also contrasted zero-shot prompting with few-shot prompting, and provide novel insights and guidelines for prompt engineering for LLMs in clinical NLP. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative AI, and we hope that it will inspire and inform future research in this area.
    摘要 大型语言模型(LLM)在自然语言处理(NLP)领域表现出色,特别是在数据 scarcity 或者 expensive 的医疗领域。然而,要激活这些 LLMP 中的医疗知识,我们需要设计有效的提示,使其在无任务特定训练数据的情况下完成特定的医疗 NLP 任务。这被称为“上下文学习”,是一种艺术和科学,需要了解不同 LLMP 的优劣和提示工程方法的特点。在这篇论文中,我们提出了一项全面和系统的实验研究,探讨了不同提示方法的效果在五个医疗 NLP 任务中:医学意思解释、生物医学证据抽取、核心引用解决、药物状态抽取和药物特性抽取。我们评估了现有文献中的提示方法,包括简单前缀、简单填充、链条思维和预测提示,并引入了两种新的提示方法:启发式提示和ensemble提示。我们使用三个当今最先进的 LLMP:GPT-3.5、BARD和LLAMA2进行评估。我们还比较了零shot提示和几shot提示,并提供了新的发现和指导,用于提示工程在医疗 NLP 领域。我们认为,这是目前已知的一些实验性的探讨不同提示工程方法的第一个研究,希望能启发和引导未来的研究。

An Automated Machine Learning Approach for Detecting Anomalous Peak Patterns in Time Series Data from a Research Watershed in the Northeastern United States Critical Zone

  • paper_url: http://arxiv.org/abs/2309.07992
  • repo_url: None
  • paper_authors: Ijaz Ul Haq, Byung Suk Lee, Donna M. Rizzo, Julia N Perdrial
  • for: 这个研究旨在帮助水文学家在某些敏感区域中检测时间序列数据中的异常。
  • methods: 这个框架使用自动化机器学习方法,包括生成模型和自动化模型优化机制,以检测时间序列数据中的异常。
  • results: 研究表明,这个框架可以帮助水文学家选择最适合的模型实例,并在检测异常过程中提高准确率和计算效率。
    Abstract This paper presents an automated machine learning framework designed to assist hydrologists in detecting anomalies in time series data generated by sensors in a research watershed in the northeastern United States critical zone. The framework specifically focuses on identifying peak-pattern anomalies, which may arise from sensor malfunctions or natural phenomena. However, the use of classification methods for anomaly detection poses challenges, such as the requirement for labeled data as ground truth and the selection of the most suitable deep learning model for the given task and dataset. To address these challenges, our framework generates labeled datasets by injecting synthetic peak patterns into synthetically generated time series data and incorporates an automated hyperparameter optimization mechanism. This mechanism generates an optimized model instance with the best architectural and training parameters from a pool of five selected models, namely Temporal Convolutional Network (TCN), InceptionTime, MiniRocket, Residual Networks (ResNet), and Long Short-Term Memory (LSTM). The selection is based on the user's preferences regarding anomaly detection accuracy and computational cost. The framework employs Time-series Generative Adversarial Networks (TimeGAN) as the synthetic dataset generator. The generated model instances are evaluated using a combination of accuracy and computational cost metrics, including training time and memory, during the anomaly detection process. Performance evaluation of the framework was conducted using a dataset from a watershed, demonstrating consistent selection of the most fitting model instance that satisfies the user's preferences.
    摘要 中文翻译:这篇论文提出了一个自动化机器学习框架,用于帮助 hidrologists 在感知器数据中检测峰嵌入的异常。该框架具体关注 peak-pattern 异常的检测,可能由感知器故障或自然现象引起。然而,使用分类方法进行异常检测存在挑战,包括需要标注数据作为真实数据和选择适合给定任务和数据集的最佳深度学习模型。为Addressing these challenges, the proposed framework generates labeled datasets by injecting synthetic peak patterns into synthetically generated time series data and optimizes model hyperparameters based on the user's preferences. The framework uses Time-series Generative Adversarial Networks (TimeGAN) as the synthetic dataset generator and evaluates model performance using a combination of accuracy and computational cost metrics. The proposed framework was evaluated using a real-world dataset and demonstrated consistent selection of the most fitting model instance that satisfies the user's preferences.

Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models

  • paper_url: http://arxiv.org/abs/2309.07986
  • repo_url: None
  • paper_authors: James Burgess, Kuan-Chieh Wang, Serena Yeung
  • for: 这个论文是为了检验文本扩散模型是否能够学习3D结构,并在2D超视觉下生成3D图像。
  • methods: 这个论文使用了稳定扩散模型,并提出了一种基于摄像头视角参数的视角神经网络,用于控制生成图像的3D视图角。
  • results: 该方法可以解决缺少输入视图的新视角合成问题,并且可以生成具有 semantic detail 和 photorealism 的单视图新视角图像。此外,该方法还可以生成多种样本,用于模拟3D视觉问题中的不确定性。
    Abstract Text-to-image diffusion models understand spatial relationship between objects, but do they represent the true 3D structure of the world from only 2D supervision? We demonstrate that yes, 3D knowledge is encoded in 2D image diffusion models like Stable Diffusion, and we show that this structure can be exploited for 3D vision tasks. Our method, Viewpoint Neural Textual Inversion (ViewNeTI), controls the 3D viewpoint of objects in generated images from frozen diffusion models. We train a small neural mapper to take camera viewpoint parameters and predict text encoder latents; the latents then condition the diffusion generation process to produce images with the desired camera viewpoint. ViewNeTI naturally addresses Novel View Synthesis (NVS). By leveraging the frozen diffusion model as a prior, we can solve NVS with very few input views; we can even do single-view novel view synthesis. Our single-view NVS predictions have good semantic details and photorealism compared to prior methods. Our approach is well suited for modeling the uncertainty inherent in sparse 3D vision problems because it can efficiently generate diverse samples. Our view-control mechanism is general, and can even change the camera view in images generated by user-defined prompts.
    摘要 文本到图像扩散模型理解图像中对象之间的空间关系,但它们真的从只有2D监督学习到3D世界的真实结构吗?我们证明,2D图像扩散模型如稳定扩散,其中包含3D知识,并且我们表明了这种结构可以用于3D视觉任务。我们的方法,Viewpoint Neural Textual Inversion(ViewNeTI),可以控制扩散模型中的3D视角,从冻结扩散模型中生成图像。我们训练一个小的神经映射器,接受摄像机视角参数,并使用这些参数预测文本编码器的精度,然后这些精度控制扩散生成过程,以生成图像。ViewNeTI自然地解决了新视角合成(NVS)问题。通过利用冻结扩散模型作为先验,我们可以通过非常少的输入视图来解决NVS问题,甚至可以实现单视图新视角合成。我们的单视图NVS预测具有较好的semantic detail和photorealism,比之前的方法更好。我们的方法适合处理稀疏3D视觉问题中的uncertainty,可以快速生成多样的样本。我们的视角控制机制通用,可以改变图像中的摄像机视角,甚至可以在用户定义的提示中改变摄像机视角。

A Data Source for Reasoning Embodied Agents

  • paper_url: http://arxiv.org/abs/2309.07974
  • repo_url: https://github.com/facebookresearch/neuralmemory
  • paper_authors: Jack Lanchantin, Sainbayar Sukhbaatar, Gabriel Synnaeve, Yuxuan Sun, Kavya Srinet, Arthur Szlam
  • for: 这个论文的目的是探讨机器学习模型在理解任务中的进步,并提出了一种新的数据生成器来支持这些进步。
  • methods: 这个论文使用了新的模型架构、大规模预训练协议和专门的理解任务数据集来训练机器学习模型。
  • results: 研究人员通过对世界状态和机器人行为所生成的数据进行实例化,并使用了预训练语言模型和图structured Transformers来训练模型。然而,这些模型在 answering some questions about the world-state 方面表现不佳,这表明了设计神经网络理解模型和数据库表示方法的新研究方向。
    Abstract Recent progress in using machine learning models for reasoning tasks has been driven by novel model architectures, large-scale pre-training protocols, and dedicated reasoning datasets for fine-tuning. In this work, to further pursue these advances, we introduce a new data generator for machine reasoning that integrates with an embodied agent. The generated data consists of templated text queries and answers, matched with world-states encoded into a database. The world-states are a result of both world dynamics and the actions of the agent. We show the results of several baseline models on instantiations of train sets. These include pre-trained language models fine-tuned on a text-formatted representation of the database, and graph-structured Transformers operating on a knowledge-graph representation of the database. We find that these models can answer some questions about the world-state, but struggle with others. These results hint at new research directions in designing neural reasoning models and database representations. Code to generate the data will be released at github.com/facebookresearch/neuralmemory
    摘要

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

  • paper_url: http://arxiv.org/abs/2309.07915
  • repo_url: https://github.com/haozhezhao/mic
  • paper_authors: Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang
  • for: 这种论文是为了解决现代计算机视觉语言模型(VLMs)在处理复杂多模态提示时的问题。
  • methods: 该论文提出了一种名为 MMICL 的新方法,该方法包括一种特制的架构设计,可以快速地结合视觉和文本上下文,以及一个名为 MIC 的数据集,用于减少训练数据和实际应用中的复杂多模态提示之间的差距。
  • results: 该论文的实验结果表明,MMICL 可以在各种通用视觉语言任务上达到新的状态机器人性和少量参数性,尤其是在复杂的推理核心任务上,如 MME 和 MMBench。此外,实验还表明,MMICL 可以成功解决多模态提示理解的挑战。
    Abstract Starting from the resurgence of deep learning, vision-language models (VLMs) benefiting from large language models (LLMs) have never been so popular. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images. The issue can traced back to the architectural design of VLMs or pre-training data. Specifically, the current VLMs primarily emphasize utilizing multi-modal data with a single image some, rather than multi-modal prompts with interleaved multiple images and text. Even though some newly proposed VLMs could handle user prompts with multiple images, pre-training data does not provide more sophisticated multi-modal prompts than interleaved image and text crawled from the web. We propose MMICL to address the issue by considering both the model and data perspectives. We introduce a well-designed architecture capable of seamlessly integrating visual and textual context in an interleaved manner and MIC dataset to reduce the gap between the training data and the complex user prompts in real-world applications, including: 1) multi-modal context with interleaved images and text, 2) textual references for each image, and 3) multi-image data with spatial, logical, or temporal relationships. Our experiments confirm that MMICL achieves new stat-of-the-art zero-shot and few-shot performance on a wide range of general vision-language tasks, especially for complex reasoning benchmarks including MME and MMBench. Our analysis demonstrates that MMICL effectively deals with the challenge of complex multi-modal prompt understanding. The experiments on ScienceQA-IMG also show that MMICL successfully alleviates the issue of language bias in VLMs, which we believe is the reason behind the advanced performance of MMICL.
    摘要
  1. Multi-modal context with interleaved images and text2. Textual references for each image3. Multi-image data with spatial, logical, or temporal relationshipsOur experiments confirm that MMICL achieves new state-of-the-art zero-shot and few-shot performance on a wide range of general vision-language tasks, especially for complex reasoning benchmarks including MME and MMBench. Our analysis demonstrates that MMICL effectively deals with the challenge of complex multi-modal prompt understanding. The experiments on ScienceQA-IMG also show that MMICL successfully alleviates the issue of language bias in VLMs, which we believe is the reason behind the advanced performance of MMICL.

Beta Diffusion

  • paper_url: http://arxiv.org/abs/2309.07867
  • repo_url: https://github.com/ThisisBillhe/tiny-stable-diffusion
  • paper_authors: Mingyuan Zhou, Tianqi Chen, Zhendong Wang, Huangjie Zheng
  • for: beta diffusion是一种新的生成模型方法,用于生成在固定范围内的数据。
  • methods: beta diffusion使用了扩展和平移beta分布,通过多项式过程来实现时间的扩展和收缩,保持beta分布在前向采样和反向采样中。不同于传统的扩散基本模型, beta diffusion不使用加itive Gaussian噪声和重量化的证据下界(ELBO),而是使用KL divergence上界(KLUB),从而更好地优化模型。
  • results: 对于synthetic数据和自然图像,beta diffusion表现出了生成固定范围内数据的独特能力,并证明了KLUB的优化效果。
    Abstract We introduce beta diffusion, a novel generative modeling method that integrates demasking and denoising to generate data within bounded ranges. Using scaled and shifted beta distributions, beta diffusion utilizes multiplicative transitions over time to create both forward and reverse diffusion processes, maintaining beta distributions in both the forward marginals and the reverse conditionals, given the data at any point in time. Unlike traditional diffusion-based generative models relying on additive Gaussian noise and reweighted evidence lower bounds (ELBOs), beta diffusion is multiplicative and optimized with KL-divergence upper bounds (KLUBs) derived from the convexity of the KL divergence. We demonstrate that the proposed KLUBs are more effective for optimizing beta diffusion compared to negative ELBOs, which can also be derived as the KLUBs of the same KL divergence with its two arguments swapped. The loss function of beta diffusion, expressed in terms of Bregman divergence, further supports the efficacy of KLUBs for optimization. Experimental results on both synthetic data and natural images demonstrate the unique capabilities of beta diffusion in generative modeling of range-bounded data and validate the effectiveness of KLUBs in optimizing diffusion models, thereby making them valuable additions to the family of diffusion-based generative models and the optimization techniques used to train them.
    摘要 我们介绍β扩散,一种新的生成模型方法,它结合掩盖和去噪来生成在给定范围内的数据。使用了标准化和偏移的β分布,β扩散利用时间的滑动积分来创建前向和反向的扩散过程,维持β分布在前向的预期和反向的条件下,对于任何时间点的数据。不同于传统的扩散基本生成模型,利用加法的 Gaussian 噪声和重新权重的证据下界(ELBO),β扩散是multiplicative的,并且通过对KL散度的上界(KLUB)来优化。我们示出了KLUBs的更高效性 compared to负ELBOs,这些KLUBs可以 viewed as KL散度的对称�。数据的损失函数表示为Bregman散度,进一步支持了KLUBs的优化。实验结果显示β扩散在给定范围内的生成模型和自然图像中的独特能力,并且证明了KLUBs在扩散模型的优化中的效iveness,因此它们成为了扩散基本生成模型和优化技术的有用贡献。

The Rise and Potential of Large Language Model Based Agents: A Survey

  • paper_url: http://arxiv.org/abs/2309.07864
  • repo_url: https://github.com/woooodyy/llm-agent-paper-list
  • paper_authors: Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, Tao Gui
    for: This paper aims to provide a comprehensive survey of large language model (LLM)-based agents, exploring their potential for building general artificial intelligence (AGI) agents.methods: The paper presents a general framework for LLM-based agents, consisting of three main components: brain, perception, and action. The framework can be tailored for different applications.results: The paper discusses the extensive applications of LLM-based agents in single-agent scenarios, multi-agent scenarios, and human-agent cooperation. It also explores agent societies, social phenomena, and insights for human society.
    Abstract For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are artificial entities that sense their environment, make decisions, and take actions. Many efforts have been made to develop intelligent agents, but they mainly focus on advancement in algorithms or training strategies to enhance specific capabilities or performance on particular tasks. Actually, what the community lacks is a general and powerful model to serve as a starting point for designing AI agents that can adapt to diverse scenarios. Due to the versatile capabilities they demonstrate, large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI), offering hope for building general AI agents. Many researchers have leveraged LLMs as the foundation to build AI agents and have achieved significant progress. In this paper, we perform a comprehensive survey on LLM-based agents. We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents. Building upon this, we present a general framework for LLM-based agents, comprising three main components: brain, perception, and action, and the framework can be tailored for different applications. Subsequently, we explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation. Following this, we delve into agent societies, exploring the behavior and personality of LLM-based agents, the social phenomena that emerge from an agent society, and the insights they offer for human society. Finally, we discuss several key topics and open problems within the field. A repository for the related papers at https://github.com/WooooDyy/LLM-Agent-Paper-List.
    摘要 人类一直追求人类水平或更高的人工智能(AI),AI代理人被视为可能的实现之路。AI代理人是人工智能中的人工实体,它可以感知环境,做出决策,并进行行动。尽管有很多努力,但大多数努力都集中在算法或训练策略的提高上,以提高特定任务的能力。然而,实际上,社区缺乏一个通用和强大的模型,用于设计可适应多种场景的AI代理人。由于它们的多样化能力,大型自然语言模型(LLM)被视为人工通用智能(AGI)的可能的燃点,提供了建立通用AI代理人的希望。许多研究人员已经利用LLM作为基础,建立了AI代理人,并取得了 significativ进步。在这篇论文中,我们进行了LLM-based agents的全面评估。我们从哲学起源追溯代理人概念,并解释了LLM的适用性,并在这基础之上提出了一个通用框架,包括脑、感知和行动三个主要组成部分,该框架可以适应不同应用场景。然后,我们探讨了LLM-based agents在单机场景、多机场景和人机合作场景中的广泛应用。接着,我们探索了代理人社会中的行为和个性,以及由代理人社会产生的社会现象和人类社会中的意见。最后,我们讨论了领域内的一些关键问题和开放问题。相关论文的存储库可以在https://github.com/WooooDyy/LLM-Agent-Paper-List中找到。

CiwaGAN: Articulatory information exchange

  • paper_url: http://arxiv.org/abs/2309.07861
  • repo_url: https://github.com/gbegus/articulationgan
  • paper_authors: Gašper Beguš, Thomas Lu, Alan Zhou, Peter Wu, Gopala K. Anumanchipalli
  • for: 本研究旨在提供一个人类语言学习的模型,用于模拟人类语音认知和语音交流过程。
  • methods: 本研究使用了不supervised的articulatory模型和auditory模型,combined these two components to simulate human spoken language acquisition。
  • results: 提出的CiwaGAN模型是人类语言学习最实际的数据学习模型,可以用于认知可能性最高的人类语音认知 simulations。
    Abstract Humans encode information into sounds by controlling articulators and decode information from sounds using the auditory apparatus. This paper introduces CiwaGAN, a model of human spoken language acquisition that combines unsupervised articulatory modeling with an unsupervised model of information exchange through the auditory modality. While prior research includes unsupervised articulatory modeling and information exchange separately, our model is the first to combine the two components. The paper also proposes an improved articulatory model with more interpretable internal representations. The proposed CiwaGAN model is the most realistic approximation of human spoken language acquisition using deep learning. As such, it is useful for cognitively plausible simulations of the human speech act.
    摘要 人类将信息转化为声音,控制口音器官,并通过听觉器官解码信息。这篇论文介绍了 CiwaGAN,一种人类口语学习模型,把无监督的口音模型与无监督的听觉模型相结合。在之前的研究中,这两种 ком ponent都被处理了分开。我们的模型是第一个将这两个组件结合起来的。文章还提出了一种改进的口音模型,具有更可读的内部表示。提出的 CiwaGAN 模型是人类口语学习使用深度学习的最真实的近似,因此它对于人类语音行为的认知可能性 simulations 非常有用。

ExpertQA: Expert-Curated Questions and Attributed Answers

  • paper_url: http://arxiv.org/abs/2309.07852
  • repo_url: https://github.com/chaitanyamalaviya/expertqa
  • paper_authors: Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, Dan Roth
  • for: 这个论文的目的是研究语言模型的准确性和归因性在不同领域中的表现。
  • methods: 这篇论文使用了专家参与的方法来评估语言模型的输出是否符合事实,并生成了一个高质量的长形问答数据集(ExpertQA),包括2177个问题和32个领域的专家回答和归因。
  • results: 研究发现,语言模型在不同领域中的准确性和归因性表现不一样,并且存在一些领域的专家知识偏见。这些结果可以帮助改进语言模型的训练和应用。
    Abstract As language models are adapted by a more sophisticated and diverse set of users, the importance of guaranteeing that they provide factually correct information supported by verifiable sources is critical across fields of study & professions. This is especially the case for high-stakes fields, such as medicine and law, where the risk of propagating false information is high and can lead to undesirable societal consequences. Previous work studying factuality and attribution has not focused on analyzing these characteristics of language model outputs in domain-specific scenarios. In this work, we present an evaluation study analyzing various axes of factuality and attribution provided in responses from a few systems, by bringing domain experts in the loop. Specifically, we first collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions. We also ask experts to revise answers produced by language models, which leads to ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.
    摘要 随着语言模型被更多的复杂和多样化的用户采用,保证它们提供的信息是正确的和可靠的 sources 的重要性在不同领域和职业中日益增加。特别是在高度危险的领域,如医学和法律, propagating false information 可能导致社会不良影响。 previous work studying factuality and attribution 没有专门研究这些语言模型输出的特点在具体的场景下。在这种工作中,我们介绍一项评估研究,检查不同的 factuality 和 attribution 特点在语言模型输出中。 Specifically,我们首先收集了 484 名专家参与者从 32 个领域提供的专家挑选的问题,然后询问这些专家评估生成的回答。我们还让专家修改语言模型生成的答案,从而得到 ExpertQA,一个高质量的长文 QA 数据集,包括 2177 个问题和32 个领域的准确答案和声明。

Applying Deep Learning to Calibrate Stochastic Volatility Models

  • paper_url: http://arxiv.org/abs/2309.07843
  • repo_url: None
  • paper_authors: Abir Sridi, Paul Bilokon
    for: This paper aims to improve the calibration of stochastic volatility models by using deep learning techniques to speed up the calibration process and achieve more accurate results.methods: The authors use a differential deep learning (DDL) approach, which involves training machine learning models on samples of not only features and labels but also differentials of labels to features. They also compare the performance of different regularization techniques and show that the DDL approach outperforms classical deep learning methods.results: The trained neural network dramatically reduces the computation time required for Heston calibration, and the DDL approach outperforms classical deep learning methods in terms of reducing overfitting and improving generalization error.
    Abstract Stochastic volatility models, where the volatility is a stochastic process, can capture most of the essential stylized facts of implied volatility surfaces and give more realistic dynamics of the volatility smile or skew. However, they come with the significant issue that they take too long to calibrate. Alternative calibration methods based on Deep Learning (DL) techniques have been recently used to build fast and accurate solutions to the calibration problem. Huge and Savine developed a Differential Deep Learning (DDL) approach, where Machine Learning models are trained on samples of not only features and labels but also differentials of labels to features. The present work aims to apply the DDL technique to price vanilla European options (i.e. the calibration instruments), more specifically, puts when the underlying asset follows a Heston model and then calibrate the model on the trained network. DDL allows for fast training and accurate pricing. The trained neural network dramatically reduces Heston calibration's computation time. In this work, we also introduce different regularisation techniques, and we apply them notably in the case of the DDL. We compare their performance in reducing overfitting and improving the generalisation error. The DDL performance is also compared to the classical DL (without differentiation) one in the case of Feed-Forward Neural Networks. We show that the DDL outperforms the DL.
    摘要 In this work, we also introduce different regularization techniques, and we apply them notably in the case of the DDL. We compare their performance in reducing overfitting and improving the generalization error. The DDL performance is also compared to the classical DL (without differentiation) one in the case of feed-forward neural networks. We show that the DDL outperforms the DL. Translated into Simplified Chinese:随机波动模型可以捕捉证券波动表面的主要特征,但它们需要较长时间来均值。 alternatively, deep learning (DL) 技术已经用于构建快速和准确的均值问题解决方案。 huge and Savine 提出了差分深度学习(DDL)方法,其中机器学习模型在样本中学习不 только特征和标签,还学习标签与特征之间的差分。 当前的工作是使用 DDL 方法估算欧洲 vanilla 选择(即均值工具),具体来说是在 Heston 模型下估算 puts。 DDL 允许快速训练和精准估算。 训练神经网络对 Heston 均值的计算时间减少了很多。在这个工作中,我们还引入了不同的规范技术,并在 DDL 中应用它们。 我们比较它们在避免过拟合和提高通用错误的性能。 DDL 性能也与无梯度的深度学习(DL)相比。 我们显示了 DDL 在 feed-forward 神经网络中的性能明显超过了 DL。

Two Timin’: Repairing Smart Contracts With A Two-Layered Approach

  • paper_url: http://arxiv.org/abs/2309.07841
  • repo_url: None
  • paper_authors: Abhinav Jain, Ehan Masud, Michelle Han, Rohan Dhillon, Sumukh Rao, Arya Joshi, Salar Cheema, Saurav Kumar
  • for: 本研究旨在提出一种两层框架,用于自动检测和修复智能合约中的攻击漏洞。
  • methods: 该框架包括两个层:第一层是使用 Slither 漏洞报告和源代码,通过预训练 RandomForestClassifier (RFC) 和 Large Language Models (LLMs) 进行分类和修复漏洞。第二层是使用预训练 GPT-3.5-Turbo 和 fine-tuned Llama-2-7B 模型来构建智能合约修复模型。
  • results: 实验表明,使用 Fine-tuned Llama-2-7B 模型可以将总体漏洞数量减少为 97.5%,使用 GPT-3.5-Turbo 模型可以减少总体漏洞数量为 96.7%。手动检查修复后的合约显示,所有修复后的合约都保持了功能, indicating that the proposed method is appropriate for automatic batch classification and repair of vulnerabilities in smart contracts。
    Abstract Due to the modern relevance of blockchain technology, smart contracts present both substantial risks and benefits. Vulnerabilities within them can trigger a cascade of consequences, resulting in significant losses. Many current papers primarily focus on classifying smart contracts for malicious intent, often relying on limited contract characteristics, such as bytecode or opcode. This paper proposes a novel, two-layered framework: 1) classifying and 2) directly repairing malicious contracts. Slither's vulnerability report is combined with source code and passed through a pre-trained RandomForestClassifier (RFC) and Large Language Models (LLMs), classifying and repairing each suggested vulnerability. Experiments demonstrate the effectiveness of fine-tuned and prompt-engineered LLMs. The smart contract repair models, built from pre-trained GPT-3.5-Turbo and fine-tuned Llama-2-7B models, reduced the overall vulnerability count by 97.5% and 96.7% respectively. A manual inspection of repaired contracts shows that all retain functionality, indicating that the proposed method is appropriate for automatic batch classification and repair of vulnerabilities in smart contracts.
    摘要 (因为现代区块链技术的现代性,智能合约具有严重的风险和利益。在这些合约中,漏洞可能导致重大的后果,包括重要的损失。许多当前的论文主要关注智能合约的恶意意图,常常基于限制的合约特征,如字节码或操作码。本文提出了一种新的、两层架构:1)分类和2)直接修复恶意合约。使用Slither的漏洞报告,并将源代码与预训练的RandomForestClassifier(RFC)和大语言模型(LLMs)结合,对每个建议的漏洞进行分类和修复。实验表明,使用精制和提交的LLMs得到了有效的结果。智能合约修复模型,基于预训练的GPT-3.5-Turbo和精制的Llama-2-7B模型,将总体漏洞数量减少了97.5%和96.7%。人工检查修复的合约显示所有都保留了功能,表明提posed方法适用于自动批处理和修复智能合约中的漏洞。)

VAPOR: Legged Robot Navigation in Outdoor Vegetation Using Offline Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2309.07832
  • repo_url: https://github.com/kasunweerkoon/VAPOR
  • paper_authors: Kasun Weerakoon, Adarsh Jagan Sathyamoorthy, Mohamed Elnoor, Dinesh Manocha
  • for: 本研究は、自律的な四肢动物ロボットが、密集した OUTDOOR 环境での自律ナビゲーションを自动化するために、オフラインの强化学习(RL)を使用した新しい方法を提案します。
  • methods: 本研究で使用された方法は、actor-critic ネットワークを使用して、実际の OUTDOOR 环境で収集されたある程度のデータを使用して、物理的なおよび几何学的な障害物の特性を学习します。
  • results: 本研究では、Spot ロボットを使用して、复雑な実际の OUTDOOR シーンでの成功率が、先行方法よりも40%増加しました。また、平均的な电流消耗が2.9%减少し、一般化されたトレジット长さが11.2%减少しました。
    Abstract We present VAPOR, a novel method for autonomous legged robot navigation in unstructured, densely vegetated outdoor environments using offline Reinforcement Learning (RL). Our method trains a novel RL policy using an actor-critic network and arbitrary data collected in real outdoor vegetation. Our policy uses height and intensity-based cost maps derived from 3D LiDAR point clouds, a goal cost map, and processed proprioception data as state inputs, and learns the physical and geometric properties of the surrounding obstacles such as height, density, and solidity/stiffness. The fully-trained policy's critic network is then used to evaluate the quality of dynamically feasible velocities generated from a novel context-aware planner. Our planner adapts the robot's velocity space based on the presence of entrapment inducing vegetation, and narrow passages in dense environments. We demonstrate our method's capabilities on a Spot robot in complex real-world outdoor scenes, including dense vegetation. We observe that VAPOR's actions improve success rates by up to 40%, decrease the average current consumption by up to 2.9%, and decrease the normalized trajectory length by up to 11.2% compared to existing end-to-end offline RL and other outdoor navigation methods.
    摘要 我们介绍VAPOR方法,一种用于自主四肢机器人在未结构化、植被茂盛的户外环境中进行自主导航的新方法。我们的方法使用actor-critic网络来训练一个RL政策,并使用实际在户外植被中收集的任意数据进行训练。我们的政策使用高度和强度基于的成本地图、目标成本地图和处理后的 proprioception 数据作为输入,并学习周围障碍物的物理和几何特性,如高度、密度和坚硬程度。我们的完全训练的政策批评网络然后用于评估 dynamically feasible 的速度。我们的 плаanner 适应机器人的速度空间基于植被引起的困难和窄通道在密集环境中。我们在Spot机器人上展示了我们的方法在复杂的实际户外场景中的能力,包括密集的植被。我们观察到VAPOR的行动可以提高成功率达到40%,降低平均电池电 consumption达到2.9%,降低 нормализа trajectory length达到11.2%相比于现有的端到端Offline RL和其他户外导航方法。

Large-scale Weakly Supervised Learning for Road Extraction from Satellite Imagery

  • paper_url: http://arxiv.org/abs/2309.07823
  • repo_url: None
  • paper_authors: Shiqiao Meng, Zonglin Di, Siwei Yang, Yin Wang
  • for: 这个论文是为了提出一种基于深度学习的自动道路检测方法,以代替传统的手动地图生成。
  • methods: 这种方法使用了大规模的卫星图像和开源地图数据作为弱标签,并使用了D-LinkNet架构和ResNet-50背景网络进行Semantic segmentation模型的预训练。
  • results: 该方法的预测精度随着弱标签数据的Amount和地区训练地点的道路密度而增长,并在目前的DeepGlobe领先者排行板上超越了前一代的表现。此外,由于大规模预训练,该模型在不同的拍摄条件下Generalizes much better than通过只使用CURATED datasets进行训练的模型。
    Abstract Automatic road extraction from satellite imagery using deep learning is a viable alternative to traditional manual mapping. Therefore it has received considerable attention recently. However, most of the existing methods are supervised and require pixel-level labeling, which is tedious and error-prone. To make matters worse, the earth has a diverse range of terrain, vegetation, and man-made objects. It is well known that models trained in one area generalize poorly to other areas. Various shooting conditions such as light and angel, as well as different image processing techniques further complicate the issue. It is impractical to develop training data to cover all image styles. This paper proposes to leverage OpenStreetMap road data as weak labels and large scale satellite imagery to pre-train semantic segmentation models. Our extensive experimental results show that the prediction accuracy increases with the amount of the weakly labeled data, as well as the road density in the areas chosen for training. Using as much as 100 times more data than the widely used DeepGlobe road dataset, our model with the D-LinkNet architecture and the ResNet-50 backbone exceeds the top performer of the current DeepGlobe leaderboard. Furthermore, due to large-scale pre-training, our model generalizes much better than those trained with only the curated datasets, implying great application potential.
    摘要 自动从卫星影像中提取公路是一种可行的代替方案,因此在最近受到了广泛关注。然而,大多数现有方法都是指导的,需要像素级标注,这是费时和容易出错的。更重要的是,地球上有多种地形、植被和人工物,模型在一个区域内训练后难以在其他区域中泛化。另外,不同的拍摄条件,如光照和视角,以及不同的图像处理技术,进一步复杂了问题。难以开发卷积数据来覆盖所有的图像风格。这篇文章提议利用OpenStreetMap公路数据作为弱标签,并使用大规模卫星影像进行预训练 semantic segmentation 模型。我们的广泛实验结果表明,随着弱标签数据的增加,以及训练区域中公路的密度,预测精度也随之提高。使用100倍以上的数据,我们的模型结构为D-LinkNet和ResNet-50脊梁的模型超过了目前DeepGlobe领先者板块。此外,由于大规模预训练,我们的模型比只使用精心编辑的数据进行训练,更好地泛化。

What Matters to Enhance Traffic Rule Compliance of Imitation Learning for Automated Driving

  • paper_url: http://arxiv.org/abs/2309.07808
  • repo_url: None
  • paper_authors: Hongkuan Zhou, Aifen Sui, Wei Cao, Letian Shi
  • For: 这项研究旨在提高终端自动驾驶技术的总性性能,通过对整个驾驶管道进行单个神经网络的替换,以提高驾驶管道的简洁性和决策速度。* Methods: 该项研究提出了一种名为P-CSG的罚款基于仿真学习方法,该方法通过融合多种感知技术来提高终端自动驾驶的总性性能。* Results: 研究人员通过使用 Town 05 Long 测试准 benchmark,发现该模型在终端自动驾驶方面实现了15%以上的驾驶得分提升,并对基eline模型进行了比较。此外,研究人员还对模型进行了Robustness测试,发现该模型在对FGSM和Dot等敌意攻击的robustness性得到了显著提高。
    Abstract More research attention has recently been given to end-to-end autonomous driving technologies where the entire driving pipeline is replaced with a single neural network because of its simpler structure and faster inference time. Despite this appealing approach largely reducing the components in driving pipeline, its simplicity also leads to interpretability problems and safety issues arXiv:2003.06404. The trained policy is not always compliant with the traffic rules and it is also hard to discover the reason for the misbehavior because of the lack of intermediate outputs. Meanwhile, Sensors are also critical to autonomous driving's security and feasibility to perceive the surrounding environment under complex driving scenarios. In this paper, we proposed P-CSG, a novel penalty-based imitation learning approach with cross semantics generation sensor fusion technologies to increase the overall performance of End-to-End Autonomous Driving. We conducted an assessment of our model's performance using the Town 05 Long benchmark, achieving an impressive driving score improvement of over 15%. Furthermore, we conducted robustness evaluations against adversarial attacks like FGSM and Dot attacks, revealing a substantial increase in robustness compared to baseline models.More detailed information, such as code-based resources, ablation studies and videos can be found at https://hk-zh.github.io/p-csg-plus.
    摘要 更多研究注意力在最近已经转移到了端到端自主驾驶技术,因为它的更简单的结构和更快的推理时间。尽管这种方法可以大幅减少驾驶管道中的组件,但它的简单性也导致了解释问题和安全问题。arXiv:2003.06404。训练的策略并不总是遵循交通规则,而且还很难发现违规行为的原因,因为缺乏中间输出。同时,感知器也是自主驾驶的安全和可行性的关键。在这篇论文中,我们提出了一种基于罚金的模仿学习方法,并结合交叉语义生成感知器融合技术,以提高端到端自主驾驶的总性性能。我们使用了城市05长编辑评估我们的模型,实现了驾驶分数的显著提升超过15%。此外,我们还进行了对抗攻击 like FGSM和Dot attacks的Robustness评估,发现了与基线模型相比的显著增强。更多详细信息,如代码资源、截然退化研究和视频,可以在https://hk-zh.github.io/p-csg-plus找到。

TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild

  • paper_url: http://arxiv.org/abs/2309.08637
  • repo_url: None
  • paper_authors: Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, Taro Watanabe, Yujiu Yang, Shuming Shi
  • for: 本研究旨在推动大型自然语言处理模型在人工智能领域的应用,尤其是在多模态指令跟踪任务中。
  • methods: 本研究使用了 TextBind 框架,该框架几乎没有注解,可以帮助更大的语言模型拥有多turn多modal指令跟踪能力。
  • results: 研究发现,TextBind 框架可以从语言模型中生成多turn multimodal 指令响应对话,并且可以轻松地捕捉图像和文本输入。
    Abstract Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following.
    摘要 大型语言模型具有 instrucion-following 能力已经革命化人工智能领域。这些模型在自然语言界面上显示出极高的通用性,可以轻松地完成各种实际任务。然而,其性能强度取决于高质量的示例数据,而这些数据往往困难获得。这个挑战更加减震,当面临多模态 instrucion-following 时。我们提出了 TextBind,一个几乎无需注释的框架,可以使大型语言模型拥有多turn interleaved multimodal instrucion-following 能力。我们的方法只需要图像caption对,可以生成多turn multimodal instrucion-response对话。为了处理图像文本输入和输出,我们设计了 MIM 架构,它将图像编码器和解码器模型与语言模型集成一体。我们发布了我们的数据集、模型和 demo,以便未来的研究人员可以在多模态 instrucion-following 领域进行更多的研究。

TiBGL: Template-induced Brain Graph Learning for Functional Neuroimaging Analysis

  • paper_url: http://arxiv.org/abs/2309.07947
  • repo_url: None
  • paper_authors: Xiangzhu Meng, Wei Wei, Qiang Liu, Shu Wu, Liang Wang
  • for: 本研究旨在提高功能磁共振成像的诊断效率,通过提取模板脑图来减少噪声信息和提高诊断性能。
  • methods: 本研究提出了一种新的脑图学习框架,即模板引导脑图学习(TiBGL),具有权威性和可解释性。该框架基于各个组的模板脑图,可以去除噪声信息,并提高诊断性能。
  • results: 实验结果显示,提出的TiBGL可以在三个实际数据集上实现superior的性能,并且与 neuroscience 文献中的发现相协同。
    Abstract In recent years, functional magnetic resonance imaging has emerged as a powerful tool for investigating the human brain's functional connectivity networks. Related studies demonstrate that functional connectivity networks in the human brain can help to improve the efficiency of diagnosing neurological disorders. However, there still exist two challenges that limit the progress of functional neuroimaging. Firstly, there exists an abundance of noise and redundant information in functional connectivity data, resulting in poor performance. Secondly, existing brain network models have tended to prioritize either classification performance or the interpretation of neuroscience findings behind the learned models. To deal with these challenges, this paper proposes a novel brain graph learning framework called Template-induced Brain Graph Learning (TiBGL), which has both discriminative and interpretable abilities. Motivated by the related medical findings on functional connectivites, TiBGL proposes template-induced brain graph learning to extract template brain graphs for all groups. The template graph can be regarded as an augmentation process on brain networks that removes noise information and highlights important connectivity patterns. To simultaneously support the tasks of discrimination and interpretation, TiBGL further develops template-induced convolutional neural network and template-induced brain interpretation analysis. Especially, the former fuses rich information from brain graphs and template brain graphs for brain disorder tasks, and the latter can provide insightful connectivity patterns related to brain disorders based on template brain graphs. Experimental results on three real-world datasets show that the proposed TiBGL can achieve superior performance compared with nine state-of-the-art methods and keep coherent with neuroscience findings in recent literatures.
    摘要 在最近几年,功能核磁共振成为了人脑功能连接网络的 poderous工具。相关研究表明,人脑功能连接网络可以改善诊断神经疾病的效率。然而,还有两个挑战限制了功能神经成像的进步。首先,功能连接数据中存在严重的噪声和重复信息,导致性能下降。其次,现有的大脑网络模型往往偏重于分类性能或 neuroscience 发现的解释性能。为解决这些挑战,本文提出了一种新的大脑图学习框架,即模板引导的大脑图学习(TiBGL)。这种框架具有 both discriminative 和 interpretable 能力。受到相关医学发现的功能连接 patrerns 的激发,TiBGL 提出了模板引导的大脑图学习,可以从 brain 网络中提取模板 brain graphs。模板图可以视为对 brain 网络的增强处理,从中除去噪声信息,高亮重要的连接 patrerns。为同时支持分类和解释两个任务,TiBGL 进一步开发了模板引导的卷积神经网络和模板引导的大脑解释分析。特别是,前者可以将 rich 的信息从 brain 图和模板 brain 图 fusion 到 brain 疾病任务中,而后者可以基于模板 brain 图提供神经疾病相关的连接 patrerns。实验结果表明,提出的 TiBGL 可以在三个实际数据集上达到与九种 state-of-the-art 方法相当或更高的性能,同时与最新的 neuroscience 发现保持一致。

Variational Quantum Linear Solver enhanced Quantum Support Vector Machine

  • paper_url: http://arxiv.org/abs/2309.07770
  • repo_url: None
  • paper_authors: Jianming Yi, Kalyani Suresh, Ali Moghiseh, Norbert Wehn
  • for: 使用量子资源进行指导式机器学习任务,如分类。
  • methods: 提出了一种新的方法——Variational Quantum Linear Solver(VQLS)加强的Quantum Support Vector Machine(QSVM),利用量子线性解决器解决NISQ设备上的系统线性方程。
  • results: 通过大量的数值实验,我们发现我们的方法可以在不同的实例中准确地识别分布在一个8维特征空间中的分界面,并且在7维特征空间中表现出了强大的表现。
    Abstract Quantum Support Vector Machines (QSVM) play a vital role in using quantum resources for supervised machine learning tasks, such as classification. However, current methods are strongly limited in terms of scalability on Noisy Intermediate Scale Quantum (NISQ) devices. In this work, we propose a novel approach called the Variational Quantum Linear Solver (VQLS) enhanced QSVM. This is built upon our idea of utilizing the variational quantum linear solver to solve system of linear equations of a least squares-SVM on a NISQ device. The implementation of our approach is evaluated by an extensive series of numerical experiments with the Iris dataset, which consists of three distinct iris plant species. Based on this, we explore the practicality and effectiveness of our algorithm by constructing a classifier capable of classification in a feature space ranging from one to seven dimensions. Furthermore, by strategically exploiting both classical and quantum computing for various subroutines of our algorithm, we effectively mitigate practical challenges associated with the implementation. These include significant improvement in the trainability of the variational ansatz and notable reductions in run-time for cost calculations. Based on the numerical experiments, our approach exhibits the capability of identifying a separating hyperplane in an 8-dimensional feature space. Moreover, it consistently demonstrated strong performance across various instances with the same dataset.
    摘要 量子支持向量机器 (QSVM) 在使用量子资源进行监督式机器学习任务中扮演着重要的角色。然而,现有方法在不纯量子设备 (NISQ) 上的扩展性受到很大的限制。在这项工作中,我们提出了一种新的方法,即变量量子直方法加强的QSVM (VQLS-QSVM)。这是基于我们的变量量子直方法来解决量子直方法在NISQ设备上的系统线性方程的想法。我们的实现方法通过了广泛的数值实验,使用芳香植物三个物种的芳香数据集(Iris dataset)进行评估。根据这些实验结果,我们发现我们的算法在Feature空间范围从1到7维度的情况下能够建立一个分类器。此外,我们通过策略地利用古典计算和量子计算在不同子routines中来有效地 mitigate实际挑战,包括提高变量 Ansatz 的可训练性和计算成本的减少。根据数值实验结果,我们的方法可以在8维度的特征空间中找到分离的折线。此外,它在不同的实例中也具有强大的表现。

PRE: Vision-Language Prompt Learning with Reparameterization Encoder

  • paper_url: http://arxiv.org/abs/2309.07760
  • repo_url: https://github.com/minhanh151/respro
  • paper_authors: Anh Pham Thi Minh
  • for: 这篇论文旨在提高预先训练的视觉语言模型 CLIP 的零 shot 转移性能,并且解决 manual 的提示工程化问题,以实现实际应用。
  • methods: 本研究使用 Context Optimization (CoOp) 的概念,将 learnable textual tokens 引入视觉领域,以提高预先训练的模型在不同类型的图像上的表现。
  • results: 实验和广泛的拆分分析表明,我们的方法可以效率地提高预先训练的模型在新类别上的表现,并且在16 shot 设定下,与 CoOp 的比较获得了5.60% 的平均精度提升和3% 的调和比例提升,alls 在良好的训练时间内。
    Abstract Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks. However, to attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions. This manual prompt engineering is the major challenge for deploying such models in practice since it requires domain expertise and is extremely time-consuming. To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens. While CoOp can achieve substantial improvements over manual prompts, its learned context is worse generalizable to wider unseen classes within the same dataset. In this work, we present Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes. Instead of directly optimizing the prompts, PRE employs a prompt encoder to reparameterize the input prompt embeddings, enhancing the exploration of task-specific knowledge from few-shot samples. Experiments and extensive ablation studies on 8 benchmarks demonstrate that our approach is an efficient method for prompt learning. Specifically, PRE achieves a notable enhancement of 5.60% in average accuracy on New classes and 3% in Harmonic mean compared to CoOp in the 16-shot setting, all achieved within a good training time.
    摘要 大型预训练视觉语言模型如CLIP已经表现出了 zeroshot 跨任务传播的潜力。然而,为了 достичь优化的性能,需要手动选择提示以改善图像分布和文本描述之间的对应性。这个手动提示工程ering是在实践中部署这些模型的主要挑战,因为它需要域专业知识并是非常时间消耗。为了避免非常严重的提示工程ering,最近的工作Context Optimization(CoOp)引入了视野中的提示学习。虽然CoOp可以实现显著提高,但其学习的上下文在更广泛的未看过类中的总体化能力不够。在这种情况下,我们提出了提示学习与杂化编码器(PRE) - 一种简单有效的方法,可以提高未看过类中的总体化能力。而不是直接优化提示,PRE使用提示编码器来重parameterize输入提示嵌入,从少量样本中挖掘任务特定的知识。我们的方法在8个标准benchmark上进行了实验和广泛的ablation研究,结果表明,PRE是一种高效的提示学习方法。特别是,PRE在16个shotSetting中的平均精度提高5.60%,新类精度提高3%,所有这些成果均在合理的训练时间内完成。

Generative AI Text Classification using Ensemble LLM Approaches

  • paper_url: http://arxiv.org/abs/2309.07755
  • repo_url: None
  • paper_authors: Harika Abburi, Michael Suesserman, Nirmala Pudota, Balaji Veeramani, Edward Bowen, Sanmitra Bhattacharya
    for:* 本研究旨在判断一篇文章是人类写的还是AI生成的。methods:* 使用一个 ensemble neural network 模型,将多个预训练的 LLM 作为特征传递给一个传统机器学习(TML)分类器。results:* 在第一个任务中,模型在英语和西班牙语文本中分别排名第五和第十三(macro $F1$ 分数为 0.733 和 0.649)。* 在第二个任务中,模型在英语和西班牙语文本中分别排名第一(macro $F1$ 分数为 0.625 和 0.653)。
    Abstract Large Language Models (LLMs) have shown impressive performance across a variety of Artificial Intelligence (AI) and natural language processing tasks, such as content creation, report generation, etc. However, unregulated malign application of these models can create undesirable consequences such as generation of fake news, plagiarism, etc. As a result, accurate detection of AI-generated language can be crucial in responsible usage of LLMs. In this work, we explore 1) whether a certain body of text is AI generated or written by human, and 2) attribution of a specific language model in generating a body of text. Texts in both English and Spanish are considered. The datasets used in this study are provided as part of the Automated Text Identification (AuTexTification) shared task. For each of the research objectives stated above, we propose an ensemble neural model that generates probabilities from different pre-trained LLMs which are used as features to a Traditional Machine Learning (TML) classifier following it. For the first task of distinguishing between AI and human generated text, our model ranked in fifth and thirteenth place (with macro $F1$ scores of 0.733 and 0.649) for English and Spanish texts, respectively. For the second task on model attribution, our model ranked in first place with macro $F1$ scores of 0.625 and 0.653 for English and Spanish texts, respectively.
    摘要 大型语言模型(LLM)在许多人工智能(AI)和自然语言处理任务中表现出色,如内容创建、报告生成等。然而,不当使用这些模型可能会带来不良后果,如创造假新闻、抄袭等。因此,正确地检测AI生成的语言可以是责任使用LLM的关键。在这个工作中,我们探索以下两个研究目标:1)一个文本是AI生成还是人类写就的,2)哪个语言模型在生成一个文本中扮演了主要角色。我们使用了 AuTexTification 共享任务中提供的数据集。为每个研究目标,我们提出了一个ensemble神经网络模型,将不同预训LLM的特征用于一个传统机器学习(TML)分类器后面。For the first task of distinguishing between AI and human-generated text, our model ranked 13th and 5th place (with macro $F1$ scores of 0.733 and 0.649) for English and Spanish texts, respectively. For the second task on model attribution, our model ranked first place with macro $F1$ scores of 0.625 and 0.653 for English and Spanish texts, respectively.

AIDPS:Adaptive Intrusion Detection and Prevention System for Underwater Acoustic Sensor Networks

  • paper_url: http://arxiv.org/abs/2309.07730
  • repo_url: None
  • paper_authors: Soumadeep Das, Aryan Mohammadi Pasikhani, Prosanta Gope, John A. Clark, Chintan Patel, Biplab Sikdar
  • For: The paper proposes a secure intrusion detection and prevention system for Underwater Acoustic Sensor Networks (UW-ASNs) to address the lack of security considerations and the resource-constrained nature of sensor nodes.* Methods: The proposed Adaptive decentralized Intrusion Detection and Prevention System (AIDPS) uses machine learning algorithms (e.g., Adaptive Random Forest, light gradient-boosting machine, and K-nearest neighbors) and concept drift detection algorithms (e.g., ADWIN, kdqTree, and Page-Hinkley) to detect underwater-related attacks.* Results: The proposed scheme outperforms state-of-the-art benchmarking methods in terms of performance and provides a wider range of desirable features such as scalability and complexity, as demonstrated through extensive experimental results.Here’s the Chinese version of the three key points:* For: 该论文提出了一种为Underwater Acoustic Sensor Networks (UW-ASNs) 提供安全的攻击检测和预防系统,以解决UW-ASNs 缺乏安全考虑和感知器节点的资源约束。* Methods: 提出的 Adaptive 分散式攻击检测和预防系统 (AIDPS) 使用机器学习算法(例如 Adaptive Random Forest、light gradient-boosting machine 和 K-nearest neighbors)和概念漂移检测算法(例如 ADWIN、kdqTree 和 Page-Hinkley)来检测水下相关的攻击。* Results: 比较 experimental 结果表明,提出的方案在性能方面超过了状态方法 benchmarking 方法,并提供了更多的愿望特征,如可扩展性和复杂性。
    Abstract Underwater Acoustic Sensor Networks (UW-ASNs) are predominantly used for underwater environments and find applications in many areas. However, a lack of security considerations, the unstable and challenging nature of the underwater environment, and the resource-constrained nature of the sensor nodes used for UW-ASNs (which makes them incapable of adopting security primitives) make the UW-ASN prone to vulnerabilities. This paper proposes an Adaptive decentralised Intrusion Detection and Prevention System called AIDPS for UW-ASNs. The proposed AIDPS can improve the security of the UW-ASNs so that they can efficiently detect underwater-related attacks (e.g., blackhole, grayhole and flooding attacks). To determine the most effective configuration of the proposed construction, we conduct a number of experiments using several state-of-the-art machine learning algorithms (e.g., Adaptive Random Forest (ARF), light gradient-boosting machine, and K-nearest neighbours) and concept drift detection algorithms (e.g., ADWIN, kdqTree, and Page-Hinkley). Our experimental results show that incremental ARF using ADWIN provides optimal performance when implemented with One-class support vector machine (SVM) anomaly-based detectors. Furthermore, our extensive evaluation results also show that the proposed scheme outperforms state-of-the-art bench-marking methods while providing a wider range of desirable features such as scalability and complexity.
    摘要 水下声学传感网络(UW-ASN)广泛应用于水下环境中,但由于安全考虑不足、水下环境不稳定和敏感度高的传感节点,使得UW-ASN容易受到攻击。本文提出一种适应型分布式入侵检测预防系统(AIDPS),以提高UW-ASN的安全性,能够有效检测水下相关攻击(如黑洞、灰色洞和淹水攻击)。为确定最佳构建,我们进行了一系列实验,使用了多种当前最佳机器学习算法(如适应随机森林、光梯度搜索机和K nearest neighbors)和概念泄漏检测算法(如ADWIN、kdqTree和Page-Hinkley)。实验结果表明,增量ARF使用ADWIN提供了最佳性能,并且与一元支持向量机(SVM)异常检测器结合使用。此外,我们的广泛评估结果也显示,提案方案在与当前标准方法进行比较时表现出优异的特点,如扩展性和复杂度。

ChatGPT v Bard v Bing v Claude 2 v Aria v human-expert. How good are AI chatbots at scientific writing? (ver. 23Q3)

  • paper_url: http://arxiv.org/abs/2309.08636
  • repo_url: None
  • paper_authors: Edisa Lozić, Benjamin Štular
  • for: This paper analyzes the capabilities and limitations of six AI chatbots in scholarly writing in the humanities and archaeology.
  • methods: The methodology used tagging AI-generated content for quantitative accuracy and qualitative precision by human experts.
  • results: The AI chatbots demonstrated proficiency in recombining existing knowledge but failed in generating original scientific content. Additionally, the paper highlights the challenges AI chatbots face in emulating human originality in scientific writing.
    Abstract Historically, proficient writing was deemed essential for human advancement, with creative expression viewed as one of the hallmarks of human achievement. However, recent advances in generative AI have marked an inflection point in this narrative, including for scientific writing. This article provides a comprehensive analysis of the capabilities and limitations of six AI chatbots in scholarly writing in the humanities and archaeology. The methodology was based on tagging AI generated content for quantitative accuracy and qualitative precision by human experts. Quantitative accuracy assessed the factual correctness, while qualitative precision gauged the scientific contribution. While the AI chatbots, especially ChatGPT-4, demonstrated proficiency in recombining existing knowledge, they failed in generating original scientific content. As a side note, our results also suggest that with ChatGPT-4 the size of the LLMs has plateaued. Furthermore, the paper underscores the intricate and recursive nature of human research. This process of transforming raw data into refined knowledge is computationally irreducible, which highlights the challenges AI chatbots face in emulating human originality in scientific writing. In conclusion, while large language models have revolutionised content generation, their ability to produce original scientific contributions in the humanities remains limited. We expect that this will change in the near future with the evolution of current LLM-based AI chatbots towards LLM-powered software.
    摘要

NutritionVerse: Empirical Study of Various Dietary Intake Estimation Approaches

  • paper_url: http://arxiv.org/abs/2309.07704
  • repo_url: None
  • paper_authors: Chi-en Amy Tai, Matthew Keller, Saeejith Nair, Yuhao Chen, Yifan Wu, Olivia Markham, Krish Parmar, Pengcheng Xi, Heather Keller, Sharon Kirkpatrick, Alexander Wong
    for:这篇论文是为了提高自动识别食物的精度而写的。methods:这篇论文使用了计算机视觉和机器学习技术来自动估计食物摄入量。results:这篇论文提出了一个大规模的synthetic食物图像集(NutritionVerse-Synth)和一个实际图像集(NutritionVerse-Real),并对这些数据进行了分析和评估。
    Abstract Accurate dietary intake estimation is critical for informing policies and programs to support healthy eating, as malnutrition has been directly linked to decreased quality of life. However self-reporting methods such as food diaries suffer from substantial bias. Other conventional dietary assessment techniques and emerging alternative approaches such as mobile applications incur high time costs and may necessitate trained personnel. Recent work has focused on using computer vision and machine learning to automatically estimate dietary intake from food images, but the lack of comprehensive datasets with diverse viewpoints, modalities and food annotations hinders the accuracy and realism of such methods. To address this limitation, we introduce NutritionVerse-Synth, the first large-scale dataset of 84,984 photorealistic synthetic 2D food images with associated dietary information and multimodal annotations (including depth images, instance masks, and semantic masks). Additionally, we collect a real image dataset, NutritionVerse-Real, containing 889 images of 251 dishes to evaluate realism. Leveraging these novel datasets, we develop and benchmark NutritionVerse, an empirical study of various dietary intake estimation approaches, including indirect segmentation-based and direct prediction networks. We further fine-tune models pretrained on synthetic data with real images to provide insights into the fusion of synthetic and real data. Finally, we release both datasets (NutritionVerse-Synth, NutritionVerse-Real) on https://www.kaggle.com/nutritionverse/datasets as part of an open initiative to accelerate machine learning for dietary sensing.
    摘要 准确的饮食摄入估算对于支持健康饮食政策和计划是非常重要,因为营养不良直接导致生活质量下降。然而,自我报告方法如食物日志受到了重大偏见。传统的饮食评估技术和新兴的方法如移动应用程序具有高时间成本和可能需要专业人员。现有的工作集中在使用计算机视觉和机器学习自动估算饮食摄入从食物图像,但由于缺乏完整的数据集,包括多个视角、Modalities和食物注释,导致这些方法的准确性和现实性受限。为解决这个限制,我们介绍了nutritionVerse-Synth,第一个大规模的数据集,包括84984个真实的二维食物图像和相关的营养信息和多模态注释(包括深度图像、实例面Mask和semantic面Mask)。此外,我们收集了一个真实图像数据集,nutritionVerse-Real,包括251种菜谱的889张图像,以评估实际性。通过这些新的数据集,我们开发和评估nutritionVerse,一种饮食摄入估算的实验研究,包括间接分割基于和直接预测网络。此外,我们将先前在 sintetic数据上预训练的模型与真实图像进行融合,以提供有关合并 sintetic和real数据的信息。最后,我们在https://www.kaggle.com/nutritionverse/datasets上发布了nutritionVerse-Synth和nutritionVerse-Real两个数据集,作为一个开放的机器学习饮食感知项目。

Tree of Uncertain Thoughts Reasoning for Large Language Models

  • paper_url: http://arxiv.org/abs/2309.07694
  • repo_url: None
  • paper_authors: Shentong Mo, Miao Xin
  • for: 提高 Large Language Models (LLMs) 的决策精度,特别是在面临多元决策时。
  • methods: 利用 Monte Carlo Dropout 来评估 LLMs 的地方决策不确定性,然后将这些不确定性评估与全球搜索算法结合使用。
  • results: 在两个复杂的规划任务中(游戏24和小十字word),TouT 的实验证明了它的superiority,比ToT和链式思维提问方法更好。
    Abstract While the recently introduced Tree of Thoughts (ToT) has heralded advancements in allowing Large Language Models (LLMs) to reason through foresight and backtracking for global decision-making, it has overlooked the inherent local uncertainties in intermediate decision points or "thoughts". These local uncertainties, intrinsic to LLMs given their potential for diverse responses, remain a significant concern in the reasoning process. Addressing this pivotal gap, we introduce the Tree of Uncertain Thoughts (TouT) - a reasoning framework tailored for LLMs. Our TouT effectively leverages Monte Carlo Dropout to quantify uncertainty scores associated with LLMs' diverse local responses at these intermediate steps. By marrying this local uncertainty quantification with global search algorithms, TouT enhances the model's precision in response generation. We substantiate our approach with rigorous experiments on two demanding planning tasks: Game of 24 and Mini Crosswords. The empirical evidence underscores TouT's superiority over both ToT and chain-of-thought prompting methods.
    摘要 traditional Chinese:Recently introduced Tree of Thoughts (ToT) has brought about advancements in allowing Large Language Models (LLMs) to reason through foresight and backtracking for global decision-making, but it has overlooked the inherent local uncertainties in intermediate decision points or "thoughts". These local uncertainties, intrinsic to LLMs given their potential for diverse responses, remain a significant concern in the reasoning process. To address this crucial gap, we introduce the Tree of Uncertain Thoughts (TouT) - a reasoning framework tailored for LLMs. Our TouT effectively leverages Monte Carlo Dropout to quantify uncertainty scores associated with LLMs' diverse local responses at these intermediate steps. By marrying this local uncertainty quantification with global search algorithms, TouT enhances the model's precision in response generation. We substantiate our approach with rigorous experiments on two demanding planning tasks: Game of 24 and Mini Crosswords. The empirical evidence underscores TouT's superiority over both ToT and chain-of-thought prompting methods.Simplified Chinese:最近引入的思想树(ToT)已经为大语言模型(LLM)提供了前视和归culo的优化,但它忽略了LLM的本地不确定性在中间决策点或"思想"中。这些本地不确定性, LLM的具有多种响应的潜在问题,仍然是globaledécision-making中的主要问题。为解决这个关键的差距,我们介绍了思想树的不确定思想树(TouT)-一种适应LLM的理解框架。我们的TouT通过利用Monte Carlo Dropout来评估LLM的多个本地响应的不确定度分数。将这种本地不确定度评估与全球搜索算法结合,TouT可以提高模型的响应生成精度。我们通过对棋盘24和小十字word puzzle两个需要努力的规划任务的实验,证明了TouT的超越性 compared to ToT和链式思维提示方法。

Detecting ChatGPT: A Survey of the State of Detecting ChatGPT-Generated Text

  • paper_url: http://arxiv.org/abs/2309.07689
  • repo_url: None
  • paper_authors: Mahdi Dhaini, Wessel Poelman, Ege Erdogan
  • for: 本研究旨在探讨如何 отличить人类生成的文本和大语言模型(LLM)生成的文本,以确保文本的integrity。
  • methods: 本文提供了当前的approaches,包括数据集的建构、不同的方法的使用以及人类生成的文本和ChatGPT生成的文本的quality的比较。
  • results: 本文summarizes the current state of the art in detecting ChatGPT-generated text, including the various datasets constructed for this task, the methods employed, and the qualitative analyses performed to understand the characteristics of human- versus ChatGPT-generated text.
    Abstract While recent advancements in the capabilities and widespread accessibility of generative language models, such as ChatGPT (OpenAI, 2022), have brought about various benefits by generating fluent human-like text, the task of distinguishing between human- and large language model (LLM) generated text has emerged as a crucial problem. These models can potentially deceive by generating artificial text that appears to be human-generated. This issue is particularly significant in domains such as law, education, and science, where ensuring the integrity of text is of the utmost importance. This survey provides an overview of the current approaches employed to differentiate between texts generated by humans and ChatGPT. We present an account of the different datasets constructed for detecting ChatGPT-generated text, the various methods utilized, what qualitative analyses into the characteristics of human versus ChatGPT-generated text have been performed, and finally, summarize our findings into general insights
    摘要 “近期,智能语言模型的能力和普遍性得到了大量的进步,如ChatGPT(OpenAI,2022),这些模型已经带来了许多利益,例如生成流畅、人类语言样式的文本。然而, distinguishing between human-generated text and大型语言模型(LLM)生成的文本已成为一个重要的问题。这些模型有可能会欺骗人类,生成 искус生成的文本,这对于法律、教育和科学等领域来说,保持文本的完整性非常重要。本survey提供了当前的方法,用于分辨人类生成的文本和ChatGPT生成的文本,包括constructed的dataset、使用的方法、对human与ChatGPT生成的文本的分析,以及最后的总结。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

deepFDEnet: A Novel Neural Network Architecture for Solving Fractional Differential Equations

  • paper_url: http://arxiv.org/abs/2309.07684
  • repo_url: None
  • paper_authors: Ali Nosrati Firoozsalari, Hassan Dana Mazraeh, Alireza Afzal Aghaei, Kourosh Parand
  • for: 该研究提出了一种新的深度神经网络架构,用于精准解决不同类型的分数导函数方程。
  • methods: 该架构使用了Gaussian интеграル规则和$L_1$精度积分技术,在每个方程中使用深度神经网络来 aproximate 未知函数。
  • results: 实验结果表明,该架构可以高精度地解决不同类型的分数导函数方程,包括分数常 differential equation、分数预微差方程和分数预微差方程。
    Abstract The primary goal of this research is to propose a novel architecture for a deep neural network that can solve fractional differential equations accurately. A Gaussian integration rule and a $L_1$ discretization technique are used in the proposed design. In each equation, a deep neural network is used to approximate the unknown function. Three forms of fractional differential equations have been examined to highlight the method's versatility: a fractional ordinary differential equation, a fractional order integrodifferential equation, and a fractional order partial differential equation. The results show that the proposed architecture solves different forms of fractional differential equations with excellent precision.
    摘要 primary goal 的这个研究是提出一种深度神经网络架构,能够精确解析分数 diferencial 方程。在提出的设计中,使用 Gaussian 积分规则和 $L_1$ 精度积分技术。在每个方程中,深度神经网络用于 approximate 未知函数。研究对三种分数 differential 方程进行了测试:分数 ordinary differential equation,分数 order integrodifferential equation,和分数 order partial differential equation。结果显示,提出的架构可以精确地解决不同形式的分数 differential 方程。Note: "分数" (fractional) in the text refers to the fact that the differential equations are of fractional order, meaning that the derivatives are of a non-integer order.

Assessing the nature of large language models: A caution against anthropocentrism

  • paper_url: http://arxiv.org/abs/2309.07683
  • repo_url: None
  • paper_authors: Ann Speed
  • for: 这项研究是为了评估OpenAIs chatbot GPT3.5,了解其能力和人格特征。
  • methods: 该研究使用标准、 нор化和验证的认知和人格测试盘测GPT3.5的能力和稳定性。
  • results: GPT3.5 unlikely to have developed sentience, but displayed large variability in cognitive and personality measures over repeated observations, and showed poor mental health such as low self-esteem and dissociation from reality despite upbeat and helpful responses.
    Abstract Generative AI models garnered a large amount of public attention and speculation with the release of OpenAIs chatbot, ChatGPT. At least two opinion camps exist: one excited about possibilities these models offer for fundamental changes to human tasks, and another highly concerned about power these models seem to have. To address these concerns, we assessed GPT3.5 using standard, normed, and validated cognitive and personality measures. For this seedling project, we developed a battery of tests that allowed us to estimate the boundaries of some of these models capabilities, how stable those capabilities are over a short period of time, and how they compare to humans. Our results indicate that GPT 3.5 is unlikely to have developed sentience, although its ability to respond to personality inventories is interesting. It did display large variability in both cognitive and personality measures over repeated observations, which is not expected if it had a human-like personality. Variability notwithstanding, GPT3.5 displays what in a human would be considered poor mental health, including low self-esteem and marked dissociation from reality despite upbeat and helpful responses.
    摘要 我们的结果表明,GPT3.5不太可能已经发展出了意识(sentience)。尽管它能够回答人格评测表,但它在重复测试中显示了大量的变化,这与人类的个性特质不符。不过,GPT3.5在认知和人格测试中表现出了人类所谓的差等精神健康问题,包括低自尊和明显与现实分离。尽管如此,GPT3.5仍然能够表现出帮助和乐观的回答。

Federated Dataset Dictionary Learning for Multi-Source Domain Adaptation

  • paper_url: http://arxiv.org/abs/2309.07670
  • repo_url: None
  • paper_authors: Fabiola Espinosa Castellon, Eduardo Fernandes Montesuma, Fred Ngolè Mboula, Aurélien Mayoue, Antoine Souloumiac, Cédric Gouy-Pallier
  • for: 这篇论文是用于处理分布shift的联邦领域数据预测,特别是在部分客户拥有无标数据的情况下。
  • methods: 本文提出了一个基于词汇学习的联邦领域执行数据预测方法,称为FedDaDiL。这个方法通过客户端的词汇学习来学习客户端的分布,并将这些分布集成为联邦字库。
  • results: 本文透过实验证明了FedDaDiL的可行性和效果,包括在(i) Caltech-Office、(ii) TEP、和(iii) CWRU benchmark上进行了广泛的测试。此外,本文还与其中央化版本和其他联邦领域执行数据预测方法进行比较。
    Abstract In this article, we propose an approach for federated domain adaptation, a setting where distributional shift exists among clients and some have unlabeled data. The proposed framework, FedDaDiL, tackles the resulting challenge through dictionary learning of empirical distributions. In our setting, clients' distributions represent particular domains, and FedDaDiL collectively trains a federated dictionary of empirical distributions. In particular, we build upon the Dataset Dictionary Learning framework by designing collaborative communication protocols and aggregation operations. The chosen protocols keep clients' data private, thus enhancing overall privacy compared to its centralized counterpart. We empirically demonstrate that our approach successfully generates labeled data on the target domain with extensive experiments on (i) Caltech-Office, (ii) TEP, and (iii) CWRU benchmarks. Furthermore, we compare our method to its centralized counterpart and other benchmarks in federated domain adaptation.
    摘要 在本文中,我们提出了一种 federated domain adaptation 的方法,这种情况下存在客户端之间的分布转换和一些无标注数据。我们的框架,FedDaDiL,通过对empirical Distribution学习来解决这个问题。在我们的设定下,客户端的分布表示特定的领域,FedDaDiL 在多个客户端之间共同培养一个联合的empirical Distribution字典。 Specifically,我们基于 Dataset Dictionary Learning 框架,并设计了合作通信协议和聚合操作。选择的协议保持客户端的数据私有,从而提高了总体隐私性,比中央化对手更加隐私。我们在 (i) Caltech-Office、(ii) TEP 和 (iii) CWRU 测试benchmark上进行了广泛的实验,并证明了我们的方法能成功生成目标领域的标注数据。此外,我们还与中央化对手和其他联邦预测方法进行了比较。

Multi-Source Domain Adaptation meets Dataset Distillation through Dataset Dictionary Learning

  • paper_url: http://arxiv.org/abs/2309.07666
  • repo_url: None
  • paper_authors: Eduardo Fernandes Montesuma, Fred Ngolè Mboula, Antoine Souloumiac
  • for: 本文解决了多源领域适应(MSDA)和数据简报(DD)两个问题的交叉问题,即MSDA-DD问题。
  • methods: 本文使用了多种先前在MSDA领域中被采用的方法,如沃氏矩阵运输和数据字典学习,以及DD方法的分布匹配。
  • results: 本文在四个 benchmark(Caltech-Office 10、Tennessee-Eastman Process、Continuous Stirred Tank Reactor、Case Western Reserve University)上进行了大量的实验,并显示了使用仅1个样本每个类的情况下,可以达到现有的适应性能。
    Abstract In this paper, we consider the intersection of two problems in machine learning: Multi-Source Domain Adaptation (MSDA) and Dataset Distillation (DD). On the one hand, the first considers adapting multiple heterogeneous labeled source domains to an unlabeled target domain. On the other hand, the second attacks the problem of synthesizing a small summary containing all the information about the datasets. We thus consider a new problem called MSDA-DD. To solve it, we adapt previous works in the MSDA literature, such as Wasserstein Barycenter Transport and Dataset Dictionary Learning, as well as DD method Distribution Matching. We thoroughly experiment with this novel problem on four benchmarks (Caltech-Office 10, Tennessee-Eastman Process, Continuous Stirred Tank Reactor, and Case Western Reserve University), where we show that, even with as little as 1 sample per class, one achieves state-of-the-art adaptation performance.
    摘要 在这篇论文中,我们考虑了多源领域适应(MSDA)和数据精炼(DD)两个问题的交叉点。一方面,MSDA是适应多个不同标签源领域到一个没有标签目标领域的问题。另一方面,DD是解决将数据集中的所有信息概括到一个小Summary中的问题。因此,我们提出了一个新的问题 called MSDA-DD。为解决这个问题,我们改进了先前在MSDA литературе中的方法,如 Wasserstein Barycenter Transport和 Dataset Dictionary Learning,以及 DD 方法 Distribution Matching。我们在四个标准 benchmark(Caltech-Office 10、Tennessee-Eastman Process、Continuous Stirred Tank Reactor和Case Western Reserve University)进行了广泛的实验,并证明了,即使只有一个样本每个类,也可以达到状态精准适应性。

Feature Engineering in Learning-to-Rank for Community Question Answering Task

  • paper_url: http://arxiv.org/abs/2309.07610
  • repo_url: None
  • paper_authors: Nafis Sajid, Md Rashidul Hasan, Muhammad Ibrahim
  • for: This paper aims to improve the ranking of answers in community question answering (CQA) forums by introducing a BERT-based feature and combining both question and answer features.
  • methods: The proposed framework uses traditional features like TF-IDF and BM25, as well as a BERT-based feature to capture the semantic similarity between questions and answers. The framework also employs rank-learning algorithms that have not been widely used in the CQA domain.
  • results: The proposed framework achieves state-of-the-art performance on three standard CQA datasets, and the analysis of feature importance provides guidance for practitioners to select a better set of features for the CQA retrieval task.
    Abstract Community question answering (CQA) forums are Internet-based platforms where users ask questions about a topic and other expert users try to provide solutions. Many CQA forums such as Quora, Stackoverflow, Yahoo!Answer, StackExchange exist with a lot of user-generated data. These data are leveraged in automated CQA ranking systems where similar questions (and answers) are presented in response to the query of the user. In this work, we empirically investigate a few aspects of this domain. Firstly, in addition to traditional features like TF-IDF, BM25 etc., we introduce a BERT-based feature that captures the semantic similarity between the question and answer. Secondly, most of the existing research works have focused on features extracted only from the question part; features extracted from answers have not been explored extensively. We combine both types of features in a linear fashion. Thirdly, using our proposed concepts, we conduct an empirical investigation with different rank-learning algorithms, some of which have not been used so far in CQA domain. On three standard CQA datasets, our proposed framework achieves state-of-the-art performance. We also analyze importance of the features we use in our investigation. This work is expected to guide the practitioners to select a better set of features for the CQA retrieval task.
    摘要 社区问答(CQA)论坛是互联网上的平台,用户可以提问一个话题,其他专家用户则会提供解决方案。许多CQA论坛,如Quora、Stack overflow、Yahoo!Answer、Stack Exchange等,都有大量用户生成的数据。这些数据可以被自动化CQA排名系统使用,以提供与用户提交的问题相似的问题和答案。在这项工作中,我们employn empirical investigation of several aspects of this domain。首先,我们引入了BERT基于的 semanticsimilarity特征,用于捕捉问题和答案之间的semantic关系。其次,大多数现有研究works都是只是从问题部分提取特征,而不是从答案部分提取特征。我们组合了这两种类型的特征在线性方式。最后,我们使用我们提出的概念,使用不同的排名算法进行实验,其中一些算法在CQA领域未曾使用过。在三个标准CQA数据集上,我们的提出的框架实现了状态的表现。我们还分析了我们使用的特征的重要性。这项工作预计会引导实践者选择更好的特征集 дляCQA检索任务。

Turning Dross Into Gold Loss: is BERT4Rec really better than SASRec?

  • paper_url: http://arxiv.org/abs/2309.07602
  • repo_url: https://github.com/antklen/sasrec-bert4rec-recsys23
  • paper_authors: Anton Klenitskiy, Alexey Vasilev
  • for: Compares the performance of SASRec and BERT4Rec in recommendation tasks, and explores the effectiveness of training SASRec with negative sampling.
  • methods: Uses Transformer-based models SASRec and BERT4Rec as baselines, and compares their performance with different loss functions and negative sampling strategies.
  • results: Finds that SASRec outperforms BERT4Rec in terms of quality and training speed when trained with the same loss function as BERT4Rec, and that SASRec can be effectively trained with negative sampling but requires a larger number of negative examples than one.
    Abstract Recently sequential recommendations and next-item prediction task has become increasingly popular in the field of recommender systems. Currently, two state-of-the-art baselines are Transformer-based models SASRec and BERT4Rec. Over the past few years, there have been quite a few publications comparing these two algorithms and proposing new state-of-the-art models. In most of the publications, BERT4Rec achieves better performance than SASRec. But BERT4Rec uses cross-entropy over softmax for all items, while SASRec uses negative sampling and calculates binary cross-entropy loss for one positive and one negative item. In our work, we show that if both models are trained with the same loss, which is used by BERT4Rec, then SASRec will significantly outperform BERT4Rec both in terms of quality and training speed. In addition, we show that SASRec could be effectively trained with negative sampling and still outperform BERT4Rec, but the number of negative examples should be much larger than one.
    摘要 近期顺序推荐和下一个项目预测任务在推荐系统领域得到了越来越多的关注。目前,两种状态级基elines是基于Transformer架构的SASRec和BERT4Rec。过去几年,有很多文章比较了这两种算法,并提出了新的状态级基elines。大多数文章中,BERT4Rec的性能比SASRec好,但BERT4Rec使用交叉熵预测所有项目,而SASRec使用负样本和计算二进制交叉熵损失。在我们的工作中,我们发现如果两个模型都使用同一个损失函数,即BERT4Rec使用的交叉熵损失,那么SASRec会在质量和训练速度方面明显超过BERT4Rec。此外,我们还发现SASRec可以通过负样本进行训练,并且仍然超过BERT4Rec,但负样本的数量应该比一个更大得多。

Detecting Misinformation with LLM-Predicted Credibility Signals and Weak Supervision

  • paper_url: http://arxiv.org/abs/2309.07601
  • repo_url: None
  • paper_authors: João A. Leite, Olesya Razuvayevskaya, Kalina Bontcheva, Carolina Scarton
  • for: 本研究旨在检验语言模型是否能够通过提供18个信任信号来生成弱标签,以便用于内容真实性预测。
  • methods: 本研究使用大语言模型(LLM),并通过提供18个信任信号来启动它们。然后,使用弱监督来聚合这些潜在噪声的标签,以预测内容真实性。
  • results: 研究发现,使用这种方法可以超过现有的状况检测器,并且不需要使用任何基于真实标签的训练数据。此外,研究还分析了各个信任信号对内容真实性预测的贡献,提供了新的有价值的意见。
    Abstract Credibility signals represent a wide range of heuristics that are typically used by journalists and fact-checkers to assess the veracity of online content. Automating the task of credibility signal extraction, however, is very challenging as it requires high-accuracy signal-specific extractors to be trained, while there are currently no sufficiently large datasets annotated with all credibility signals. This paper investigates whether large language models (LLMs) can be prompted effectively with a set of 18 credibility signals to produce weak labels for each signal. We then aggregate these potentially noisy labels using weak supervision in order to predict content veracity. We demonstrate that our approach, which combines zero-shot LLM credibility signal labeling and weak supervision, outperforms state-of-the-art classifiers on two misinformation datasets without using any ground-truth labels for training. We also analyse the contribution of the individual credibility signals towards predicting content veracity, which provides new valuable insights into their role in misinformation detection.
    摘要 credibility signals represent a wide range of heuristics that are typically used by journalists and fact-checkers to assess the veracity of online content. Automating the task of credibility signal extraction, however, is very challenging as it requires high-accuracy signal-specific extractors to be trained, while there are currently no sufficiently large datasets annotated with all credibility signals. This paper investigates whether large language models (LLMs) can be prompted effectively with a set of 18 credibility signals to produce weak labels for each signal. We then aggregate these potentially noisy labels using weak supervision in order to predict content veracity. We demonstrate that our approach, which combines zero-shot LLM credibility signal labeling and weak supervision, outperforms state-of-the-art classifiers on two misinformation datasets without using any ground-truth labels for training. We also analyze the contribution of the individual credibility signals towards predicting content veracity, which provides new valuable insights into their role in misinformation detection.Here's the translation in Traditional Chinese:credibility signals represent a wide range of heuristics that are typically used by journalists and fact-checkers to assess the veracity of online content. Automating the task of credibility signal extraction, however, is very challenging as it requires high-accuracy signal-specific extractors to be trained, while there are currently no sufficiently large datasets annotated with all credibility signals. This paper investigates whether large language models (LLMs) can be prompted effectively with a set of 18 credibility signals to produce weak labels for each signal. We then aggregate these potentially noisy labels using weak supervision in order to predict content veracity. We demonstrate that our approach, which combines zero-shot LLM credibility signal labeling and weak supervision, outperforms state-of-the-art classifiers on two misinformation datasets without using any ground-truth labels for training. We also analyze the contribution of the individual credibility signals towards predicting content veracity, which provides new valuable insights into their role in misinformation detection.

C-Pack: Packaged Resources To Advance General Chinese Embedding

  • paper_url: http://arxiv.org/abs/2309.07597
  • repo_url: https://github.com/flagopen/flagembedding
  • paper_authors: Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighof
  • for: 本文旨在提供一个包含多种资源的包,以提高中文嵌入模型的Field。
  • methods: 本文使用了三个关键资源:1)中文文本嵌入数据集C-MTEB,2)大量的中文文本嵌入数据集C-MTP,3)多种中文嵌入模型家族C-TEM。
  • results: 我们的模型在C-MTEB上表现出色,与之前的所有中文文本嵌入模型相比,提高了+10%。此外,我们还对C-TEM模型进行了整体训练和优化。同时,我们还发布了英文文本嵌入数据集和模型,其性能在MTEB上与之前的最佳性能相当。所有资源都可以在https://github.com/FlagOpen/FlagEmbedding上下载。
    Abstract We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.
    摘要 我们介绍C-Pack,一个包含资源的集合,将通用中文嵌入领域带进行了显著的提升。C-Pack包括三个重要资源。1)C-MTEB是一个涵盖6个任务和35个数据集的中文文本嵌入测试集。2)C-MTP是一个由标注和未标注的中文资料集合而成的庞大文本嵌入数据集,用于训练嵌入模型。3)C-TEM是一家嵌入模型家族,覆盖多个大小。我们的模型在C-MTEB上比所有前一代中文文本嵌入模型高出+10%。我们还将整个训练方法集成并优化。此外,我们还发布了英文文本嵌入数据和模型,其性能在MTEB benchmark上达到了国际先进水平。而我们发布的英文数据量比中文数据量高出2倍。这些资源都公开提供在GitHub上,请参考https://github.com/FlagOpen/FlagEmbedding。

Neuro-Symbolic Recommendation Model based on Logic Query

  • paper_url: http://arxiv.org/abs/2309.07594
  • repo_url: None
  • paper_authors: Maonian Wu, Bang Chen, Shaojun Zhu, Bo Zheng, Wei Peng, Mingyi Zhang
  • for: 提供一种基于逻辑和符号的推荐模型,解决现有的推荐模型在实际任务中难以处理不一致和不完整的知识问题。
  • methods: 将用户历史交互转化为逻辑表达,然后将推荐预测转化为查询任务基于这个逻辑表达。使用神经网络的模块逻辑运算实现逻辑表达的计算。还构建了隐式逻辑编码器来有效减少逻辑计算的复杂性。
  • results: 在三个常见数据集上进行了实验,结果显示,我们的方法在比较于现有的浅深、会话、理解模型的情况下表现更好。
    Abstract A recommendation system assists users in finding items that are relevant to them. Existing recommendation models are primarily based on predicting relationships between users and items and use complex matching models or incorporate extensive external information to capture association patterns in data. However, recommendation is not only a problem of inductive statistics using data; it is also a cognitive task of reasoning decisions based on knowledge extracted from information. Hence, a logic system could naturally be incorporated for the reasoning in a recommendation task. However, although hard-rule approaches based on logic systems can provide powerful reasoning ability, they struggle to cope with inconsistent and incomplete knowledge in real-world tasks, especially for complex tasks such as recommendation. Therefore, in this paper, we propose a neuro-symbolic recommendation model, which transforms the user history interactions into a logic expression and then transforms the recommendation prediction into a query task based on this logic expression. The logic expressions are then computed based on the modular logic operations of the neural network. We also construct an implicit logic encoder to reasonably reduce the complexity of the logic computation. Finally, a user's interest items can be queried in the vector space based on the computation results. Experiments on three well-known datasets verified that our method performs better compared to state of the art shallow, deep, session, and reasoning models.
    摘要

Statistically Valid Variable Importance Assessment through Conditional Permutations

  • paper_url: http://arxiv.org/abs/2309.07593
  • repo_url: None
  • paper_authors: Ahmad Chamma, Denis A. Engemann, Bertrand Thirion
  • for: This paper aims to provide a systematic approach for studying Conditional Permutation Importance (CPI) and to develop reusable benchmarks of state-of-the-art variable importance estimators.
  • methods: The paper uses a model-agnostic and computationally lean approach to study CPI, which overcomes the limitations of standard permutation importance by providing accurate type-I error control.
  • results: The paper shows that CPI consistently showed top accuracy across benchmarks when used with a deep neural network, and provides a more parsimonious selection of statistically significant variables in real-world data analysis.
    Abstract Variable importance assessment has become a crucial step in machine-learning applications when using complex learners, such as deep neural networks, on large-scale data. Removal-based importance assessment is currently the reference approach, particularly when statistical guarantees are sought to justify variable inclusion. It is often implemented with variable permutation schemes. On the flip side, these approaches risk misidentifying unimportant variables as important in the presence of correlations among covariates. Here we develop a systematic approach for studying Conditional Permutation Importance (CPI) that is model agnostic and computationally lean, as well as reusable benchmarks of state-of-the-art variable importance estimators. We show theoretically and empirically that $\textit{CPI}$ overcomes the limitations of standard permutation importance by providing accurate type-I error control. When used with a deep neural network, $\textit{CPI}$ consistently showed top accuracy across benchmarks. An empirical benchmark on real-world data analysis in a large-scale medical dataset showed that $\textit{CPI}$ provides a more parsimonious selection of statistically significant variables. Our results suggest that $\textit{CPI}$ can be readily used as drop-in replacement for permutation-based methods.
    摘要 Here, we develop a systematic approach for studying Conditional Permutation Importance (CPI) that is model agnostic and computationally lean, as well as reusable benchmarks of state-of-the-art variable importance estimators. We show theoretically and empirically that CPI overcomes the limitations of standard permutation importance by providing accurate type-I error control. When used with a deep neural network, CPI consistently showed top accuracy across benchmarks.An empirical benchmark on real-world data analysis in a large-scale medical dataset showed that CPI provides a more parsimonious selection of statistically significant variables. Our results suggest that CPI can be readily used as a drop-in replacement for permutation-based methods.

Equivariant Data Augmentation for Generalization in Offline Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2309.07578
  • repo_url: None
  • paper_authors: Cristina Pinneri, Sarah Bechtle, Markus Wulfmeier, Arunkumar Byravan, Jingwei Zhang, William F. Whitney, Martin Riedmiller
  • for: 提高RL Agent的泛化能力,在固定数据集上不需要额外与环境交互。
  • methods: 学习动力模型,并使用归一化规则增加对翻译变换的抽象集。
  • results: 在考虑环境中,使用增强的数据集和策略学习算法,提高策略的测试性能。
    Abstract We present a novel approach to address the challenge of generalization in offline reinforcement learning (RL), where the agent learns from a fixed dataset without any additional interaction with the environment. Specifically, we aim to improve the agent's ability to generalize to out-of-distribution goals. To achieve this, we propose to learn a dynamics model and check if it is equivariant with respect to a fixed type of transformation, namely translations in the state space. We then use an entropy regularizer to increase the equivariant set and augment the dataset with the resulting transformed samples. Finally, we learn a new policy offline based on the augmented dataset, with an off-the-shelf offline RL algorithm. Our experimental results demonstrate that our approach can greatly improve the test performance of the policy on the considered environments.
    摘要 我们提出了一种新的方法,用于解决在线推荐学习(RL)中的泛化挑战,agent从固定的数据集中学习而不需要与环境进行任何交互。我们想要改进agent的泛化能力,以便在不同目标下进行更好的表现。为此,我们提议通过学习动力模型,并检查其对于固定类型的变换( specifically, state space中的翻译)是否对称。然后,我们使用Entropy regularizer来增加对称集,并将结果中的变换样本添加到数据集中。最后,我们使用Offline RL算法学习一个新的策略,基于扩展后的数据集。我们的实验结果表明,我们的方法可以大幅提高考试策略在考试环境中的表现。

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

  • paper_url: http://arxiv.org/abs/2309.07566
  • repo_url: None
  • paper_authors: Yongqi Wang, Jionghao Bai, Rongjie Huang, Ruiqi Li, Zhiqing Hong, Zhou Zhao
  • for: 用于进行Direct Speech-to-Speech Translation (S2ST),以实现高品质的语音翻译。
  • methods: 使用了基于自我监督学习的听语模型,以及一种神经编码器来实现风格传递。听语模型通过自我监督学习,学习了风格传递的能力,无需靠仰于任何 speaker-parallel 数据。
  • results: 经过大量训练,我们的模型实现了零shot cross-lingual风格传递,并生成了高准确率和风格相似的翻译语音示例。示例可以在 http://stylelm.github.io/ 上找到。
    Abstract Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech during translation. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer between source and target speech. We propose an S2ST framework with an acoustic language model based on discrete units from a self-supervised model and a neural codec for style transfer. The acoustic language model leverages self-supervised in-context learning, acquiring the ability for style transfer without relying on any speaker-parallel data, thereby overcoming the issue of data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and style similarity. Audio samples are available at http://stylelm.github.io/ .
    摘要 直接speech-to-speech翻译(S2ST)已经实现了惊人的准确率,但是无法保留源语音的 speaker timbre。同时,获得高质量的 speaker-平行数据的缺乏对学习 Style transfer between source and target speech pose a challenge。我们提议一个基于 discrete units 的 S2ST 框架,使用一个基于 acoustic language model 的 neural codec for style transfer。这个 acoustic language model 通过自我超vised in-context learning 获得了 Style transfer 的能力,不需要任何 speaker-平行数据,因此解决了数据缺乏的问题。通过大量的训练数据,我们的模型实现了零shot cross-lingual Style transfer на previously unseen source languages。实验表明,我们的模型可以生成高准确率和 Style similarity 的翻译语音。Audio samples 可以在 http://stylelm.github.io/ 中找到。

Masked Generative Modeling with Enhanced Sampling Scheme

  • paper_url: http://arxiv.org/abs/2309.07945
  • repo_url: None
  • paper_authors: Daesoo Lee, Erlend Aune, Sara Malacarne
  • for: This paper proposes a novel sampling scheme for masked non-autoregressive generative modeling to overcome the limitations of existing sampling methods.
  • methods: The proposed Enhanced Sampling Scheme (ESS) consists of three stages: Naive Iterative Decoding, Critical Reverse Sampling, and Critical Resampling. ESS uses confidence scores from a self-Token-Critic and the structure of the quantized latent vector space to ensure both sample diversity and fidelity.
  • results: The proposed ESS shows significant performance gains in both unconditional sampling and class-conditional sampling using all 128 datasets in the UCR Time Series archive.
    Abstract This paper presents a novel sampling scheme for masked non-autoregressive generative modeling. We identify the limitations of TimeVQVAE, MaskGIT, and Token-Critic in their sampling processes, and propose Enhanced Sampling Scheme (ESS) to overcome these limitations. ESS explicitly ensures both sample diversity and fidelity, and consists of three stages: Naive Iterative Decoding, Critical Reverse Sampling, and Critical Resampling. ESS starts by sampling a token set using the naive iterative decoding as proposed in MaskGIT, ensuring sample diversity. Then, the token set undergoes the critical reverse sampling, masking tokens leading to unrealistic samples. After that, critical resampling reconstructs masked tokens until the final sampling step is reached to ensure high fidelity. Critical resampling uses confidence scores obtained from a self-Token-Critic to better measure the realism of sampled tokens, while critical reverse sampling uses the structure of the quantized latent vector space to discover unrealistic sample paths. We demonstrate significant performance gains of ESS in both unconditional sampling and class-conditional sampling using all the 128 datasets in the UCR Time Series archive.
    摘要 ESS starts by sampling a token set using naive iterative decoding, as proposed in MaskGIT, to ensure sample diversity. Then, the token set undergoes critical reverse sampling, where tokens are masked to lead to unrealistic samples. Finally, critical resampling reconstructs masked tokens until the final sampling step is reached to ensure high fidelity. Critical resampling uses confidence scores obtained from a self-Token-Critic to better measure the realism of sampled tokens, while critical reverse sampling uses the structure of the quantized latent vector space to discover unrealistic sample paths.We demonstrate significant performance gains of ESS in both unconditional sampling and class-conditional sampling using all 128 datasets in the UCR Time Series archive.

SingFake: Singing Voice Deepfake Detection

  • paper_url: http://arxiv.org/abs/2309.07525
  • repo_url: None
  • paper_authors: Yongyi Zang, You Zhang, Mojtaba Heydari, Zhiyao Duan
  • for: 这篇研究是针对伪声音Synthesis技术的应用和挑战,尤其是在音乐领域中。
  • methods: 这篇研究使用了四种现有的语音伪装系统,并在这些系统上进行了训练和评估。
  • results: 研究发现这些语音伪装系统在对话�uterances上表现良好,但在歌曲中却表现不佳,尤其是在不熟悉的歌手、语言和音乐背景下。
    Abstract The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances. These unique properties make singing voice deepfake detection a relevant but significantly different problem from synthetic speech detection. In this work, we propose the singing voice deepfake detection task. We first present SingFake, the first curated in-the-wild dataset consisting of 28.93 hours of bonafide and 29.40 hours of deepfake song clips in five languages from 40 singers. We provide a train/val/test split where the test sets include various scenarios. We then use SingFake to evaluate four state-of-the-art speech countermeasure systems trained on speech utterances. We find these systems lag significantly behind their performance on speech test data. When trained on SingFake, either using separated vocal tracks or song mixtures, these systems show substantial improvement. However, our evaluations also identify challenges associated with unseen singers, communication codecs, languages, and musical contexts, calling for dedicated research into singing voice deepfake detection. The SingFake dataset and related resources are available online.
    摘要 1 歌唱voice合成技术的发展带来了艺术家和行业参与者面临的挑战,特别是在未经授权的voice使用方面。与合成语音不同,合成的歌唱voice通常会在具有强音乐背景的歌曲中释放,这可能会隐藏合成 artifacts。此外,歌唱voice具有不同的音响和语言特点,与语音词汇不同,这些独特的特点使得歌唱voice深层负伪检测成为一项有关的,但是与合成语音检测不同的问题。在这项工作中,我们提出了歌唱voice深层负伪检测任务。我们首先提供了 SingFake,这是首次在实际场景中采集的28.93小时真实音频和29.40小时深层负伪歌曲clip的第一个Curated dataset,包括5种语言和40名歌手。我们提供了训练/验证/测试的分 split,测试集包括多种场景。我们使用 SingFake 评估四种现状最佳的语音干扰系统,这些系统在语音测试数据上表现出色。然而,当我们将这些系统训练在 SingFake 上时,它们显示出了明显的改善。然而,我们的评估还发现了不同的歌手、通信编码器、语言和音乐背景的挑战,这要求特定的研究。 SingFake dataset和相关资源在线可用。

Learning Environment-Aware Affordance for 3D Articulated Object Manipulation under Occlusions

  • paper_url: http://arxiv.org/abs/2309.07510
  • repo_url: None
  • paper_authors: Kai Cheng, Ruihai Wu, Yan Shen, Chuanruo Ning, Guanqi Zhan, Hao Dong
  • for: 本研究旨在提供一种环境意识型的可行性框架,以便在多种环境中识别和控制3D彩色人工机器人。
  • methods: 该研究使用了一种新的对比式可行性学习框架,能够在含有一个障碍物的场景中培育可行性,并且能够在不同的障碍物组合场景中进行泛化。
  • results: 实验表明,该提议的环境意识型可行性框架能够有效地考虑环境约束,并且能够在多种环境中学习可行性。
    Abstract Perceiving and manipulating 3D articulated objects in diverse environments is essential for home-assistant robots. Recent studies have shown that point-level affordance provides actionable priors for downstream manipulation tasks. However, existing works primarily focus on single-object scenarios with homogeneous agents, overlooking the realistic constraints imposed by the environment and the agent's morphology, e.g., occlusions and physical limitations. In this paper, we propose an environment-aware affordance framework that incorporates both object-level actionable priors and environment constraints. Unlike object-centric affordance approaches, learning environment-aware affordance faces the challenge of combinatorial explosion due to the complexity of various occlusions, characterized by their quantities, geometries, positions and poses. To address this and enhance data efficiency, we introduce a novel contrastive affordance learning framework capable of training on scenes containing a single occluder and generalizing to scenes with complex occluder combinations. Experiments demonstrate the effectiveness of our proposed approach in learning affordance considering environment constraints.
    摘要 感知和操作3D嵌入式对象在多样环境中是家庭助手机器人的重要能力。 latest studies have shown that point-level affordance provides actionable priors for downstream manipulation tasks. However, existing works primarily focus on single-object scenarios with homogeneous agents, overlooking the realistic constraints imposed by the environment and the agent's morphology, e.g., occlusions and physical limitations. In this paper, we propose an environment-aware affordance framework that incorporates both object-level actionable priors and environment constraints. Unlike object-centric affordance approaches, learning environment-aware affordance faces the challenge of combinatorial explosion due to the complexity of various occlusions, characterized by their quantities, geometries, positions and poses. To address this and enhance data efficiency, we introduce a novel contrastive affordance learning framework capable of training on scenes containing a single occluder and generalizing to scenes with complex occluder combinations. Experiments demonstrate the effectiveness of our proposed approach in learning affordance considering environment constraints.

Connected Autonomous Vehicle Motion Planning with Video Predictions from Smart, Self-Supervised Infrastructure

  • paper_url: http://arxiv.org/abs/2309.07504
  • repo_url: https://github.com/Jiankai-Sun/SSTA2-ITSC-2023
  • paper_authors: Jiankai Sun, Shreyas Kousik, David Fridovich-Keil, Mac Schwager
  • for: 增强城市交通安全、效率和可持续性的自动驾驶汽车 (CAVs) 需要它们能准确预测周围的行为和安全地规划自己的动作。但是,在复杂的城市环境中,这是一项具有挑战性的任务,因为经常出现遮挡和多个代理人之间的交互。
  • methods: 本研究利用了一种名为 “Self-Supervised Traffic Advisor” (SSTA) 的智能基础设施来增强 CAV 的知觉状态,SSTA 是一种可以教育自己生成和广播有用的视频预测的感知器。在这种设计中,SSTA 预测的是未来的占用情况而不是原始的视频数据,这有助于减少广播预测数据的脚本。
  • results: 研究表明,这种设计可以有效地帮助 CAV 进行动作规划。一系列的数字实验研究了在充满人员的城市环境中 CAV 的实际应用情况,并证明了这种设计的有效性。
    Abstract Connected autonomous vehicles (CAVs) promise to enhance safety, efficiency, and sustainability in urban transportation. However, this is contingent upon a CAV correctly predicting the motion of surrounding agents and planning its own motion safely. Doing so is challenging in complex urban environments due to frequent occlusions and interactions among many agents. One solution is to leverage smart infrastructure to augment a CAV's situational awareness; the present work leverages a recently proposed "Self-Supervised Traffic Advisor" (SSTA) framework of smart sensors that teach themselves to generate and broadcast useful video predictions of road users. In this work, SSTA predictions are modified to predict future occupancy instead of raw video, which reduces the data footprint of broadcast predictions. The resulting predictions are used within a planning framework, demonstrating that this design can effectively aid CAV motion planning. A variety of numerical experiments study the key factors that make SSTA outputs useful for practical CAV planning in crowded urban environments.
    摘要 connected autonomous vehicles (CAVs) 会提高城市交通的安全性、效率和可持续性。然而,这取决于CAV正确预测周围的行为和安全地规划自己的运动。在复杂的城市环境中,这是一项挑战,因为经常出现遮挡和多个代理人之间的互动。一种解决方案是利用智能基础设施来增强CAV的情况意识;本研究利用“自我学习交通指南”(SSTA)框架的智能感知器来生成和广播有用的道路用户视频预测。在这种设计中,SSTA预测被修改为预测未来占用情况而不是原始视频,这 reduces the data footprint of broadcast predictions。这些预测被用于 плани组织,示出了这种设计可以有效地帮助CAV规划运动。数字实验评估了实用CAV规划中关键的因素,以便在忙oso urban environments中提高CAV的安全性和效率。

HDTR-Net: A Real-Time High-Definition Teeth Restoration Network for Arbitrary Talking Face Generation Methods

  • paper_url: http://arxiv.org/abs/2309.07495
  • repo_url: https://github.com/yylgoodlucky/hdtr
  • paper_authors: Yongyuan Li, Xiuyuan Qin, Chao Liang, Mingqiang Wei
  • for: 高清定制化脸部动作生成(TFG)旨在通过音频和脸部特征来重建脸部动作,以实现自然和真实的嘴部运动。
  • methods: 我们提出了一种高清定制化牙齿修复网络(HDTR-Net),可以快速提高牙齿区域的清晰度,同时保持同步和时间一致性。我们还提出了细节特征融合(FGFF)模块,以Capture细节特征信息周围牙齿和相邻区域,并使用这些特征来细化特征图以提高牙齿的清晰度。
  • results: 我们的方法可以适应任何TFG方法,不会受到嘴部同步和帧协调的影响。此外,我们的方法可以在高清定制化脸部视频合成中实现实时生成,并且在执行速度方面比现有基于超分辨率的面部修复更快$300%$。
    Abstract Talking Face Generation (TFG) aims to reconstruct facial movements to achieve high natural lip movements from audio and facial features that are under potential connections. Existing TFG methods have made significant advancements to produce natural and realistic images. However, most work rarely takes visual quality into consideration. It is challenging to ensure lip synchronization while avoiding visual quality degradation in cross-modal generation methods. To address this issue, we propose a universal High-Definition Teeth Restoration Network, dubbed HDTR-Net, for arbitrary TFG methods. HDTR-Net can enhance teeth regions at an extremely fast speed while maintaining synchronization, and temporal consistency. In particular, we propose a Fine-Grained Feature Fusion (FGFF) module to effectively capture fine texture feature information around teeth and surrounding regions, and use these features to fine-grain the feature map to enhance the clarity of teeth. Extensive experiments show that our method can be adapted to arbitrary TFG methods without suffering from lip synchronization and frame coherence. Another advantage of HDTR-Net is its real-time generation ability. Also under the condition of high-definition restoration of talking face video synthesis, its inference speed is $300\%$ faster than the current state-of-the-art face restoration based on super-resolution.
    摘要 talking face generation (TFG) 目标是重建面部动作,以达到高自然的唇部运动和脸部特征之间的潜在连接。现有的 TFG 方法已经做出了很大的进步,以生成自然和真实的图像。然而,大多数工作rarely 考虑视觉质量。在 cross-modal 生成方法中,保证唇部同步 while avoiding 视觉质量下降是一个挑战。为解决这个问题,我们提出了一个通用的高清晰牙齿修复网络,名为 HDTR-Net,可以在极快速速度下提高牙齿区域的清晰度,并保持同步和时间一致性。具体来说,我们提出了细腻特征融合(FGFF)模块,可以有效地捕捉牙齿和周围地域的细节特征信息,并使用这些特征来细化特征地图来提高牙齿的清晰度。我们的方法可以适应任意的 TFG 方法,而不会受到唇部同步和帧协调的影响。另外,HDTR-Net 还具有实时生成能力,在高清晰牙齿视频合成的情况下,其推理速度比现有的面部恢复技术更快,高出 $300\%$。

Where2Explore: Few-shot Affordance Learning for Unseen Novel Categories of Articulated Objects

  • paper_url: http://arxiv.org/abs/2309.07473
  • repo_url: None
  • paper_authors: Chuanruo Ning, Ruihai Wu, Haoran Lu, Kaichun Mo, Hao Dong
  • for: 本研究旨在解决机器人操作物体时遇到的基本问题,即对不同类别物体的抽象和Semantic variation的挑战。
  • methods: 我们提出了一种基于几何相似性的’Where2Explore’探索框架,通过在有限数量的实例上进行有效的探索,以适应 novel 类别物体。我们的框架可以快速地识别出不同类别物体之间的相似性,并将这些相似性转移到类似的部分,以便更高效地探索和学习。
  • results: 我们的实验表明,我们的框架可以快速地适应 novel 类别物体,并在实际环境中提供了高效的探索和学习功能。
    Abstract Articulated object manipulation is a fundamental yet challenging task in robotics. Due to significant geometric and semantic variations across object categories, previous manipulation models struggle to generalize to novel categories. Few-shot learning is a promising solution for alleviating this issue by allowing robots to perform a few interactions with unseen objects. However, extant approaches often necessitate costly and inefficient test-time interactions with each unseen instance. Recognizing this limitation, we observe that despite their distinct shapes, different categories often share similar local geometries essential for manipulation, such as pullable handles and graspable edges - a factor typically underutilized in previous few-shot learning works. To harness this commonality, we introduce 'Where2Explore', an affordance learning framework that effectively explores novel categories with minimal interactions on a limited number of instances. Our framework explicitly estimates the geometric similarity across different categories, identifying local areas that differ from shapes in the training categories for efficient exploration while concurrently transferring affordance knowledge to similar parts of the objects. Extensive experiments in simulated and real-world environments demonstrate our framework's capacity for efficient few-shot exploration and generalization.
    摘要 《描述物体搅动是机器人学中的基本 yet 挑战性任务。由于不同物体类别之间的几何和semantic variation很大, previous manipulation models 很难泛化到新类别。ew-shot learning 是一种有前途的解决方案,允许机器人在未看过的对象上进行几次互动。然而,现有的方法经常需要费时且不efficient的在每个未看过的实例上进行测试。认识到这一点,我们注意到,尽管它们的形状不同,不同的类别通常具有类似的本地几何特征,如可拖 handle 和可握 edge - 一种通常被前一些ew-shot learning works 下utilized。为了利用这一点,我们提出了 'Where2Explore',一种可以fficiently explore novel categories的投入学框架。我们的框架Explicitly estimates the geometric similarity across different categories, 并将这些类似性转移到相似的对象部分,以便高效地探索新类别。我们的框架在模拟和实际环境中进行了广泛的实验,并证明了它的高效性和泛化能力。

Detecting Unknown Attacks in IoT Environments: An Open Set Classifier for Enhanced Network Intrusion Detection

  • paper_url: http://arxiv.org/abs/2309.07461
  • repo_url: None
  • paper_authors: Yasir Ali Farrukh, Syed Wali, Irfan Khan, Nathaniel D. Bastian
  • for: 这篇论文旨在提供一个适应互联网领域的网络入侵检测系统(NIDS),以应对互联网物品(IoT)环境中的攻击。
  • methods: 这篇论文使用了图像基于的封包级数据表示法,将网络流量中的空间和时间模式抽象出来,以及整合堆叠和子集拓扑技术,以更好地识别未知的攻击。
  • results: 这篇论文的实验结果显示,该 frameworks 的检测率高达 88%,比较其他方法和最新的进展更高。
    Abstract The widespread integration of Internet of Things (IoT) devices across all facets of life has ushered in an era of interconnectedness, creating new avenues for cybersecurity challenges and underscoring the need for robust intrusion detection systems. However, traditional security systems are designed with a closed-world perspective and often face challenges in dealing with the ever-evolving threat landscape, where new and unfamiliar attacks are constantly emerging. In this paper, we introduce a framework aimed at mitigating the open set recognition (OSR) problem in the realm of Network Intrusion Detection Systems (NIDS) tailored for IoT environments. Our framework capitalizes on image-based representations of packet-level data, extracting spatial and temporal patterns from network traffic. Additionally, we integrate stacking and sub-clustering techniques, enabling the identification of unknown attacks by effectively modeling the complex and diverse nature of benign behavior. The empirical results prominently underscore the framework's efficacy, boasting an impressive 88\% detection rate for previously unseen attacks when compared against existing approaches and recent advancements. Future work will perform extensive experimentation across various openness levels and attack scenarios, further strengthening the adaptability and performance of our proposed solution in safeguarding IoT environments.
    摘要 互联网物件的普遍散布在生活所有方面,带来了一个通信连接的时代,创造了新的预防措施挑战和强化网络防护系统的需求。然而,传统的安全系统具有关闭世界的想法,往往对于不断发展的威胁领域显示出困难。在本文中,我们介绍了一个适应网络入侵检测系统(NIDS)的框架,用于解决网络入侵检测系统中的开放集 recognition(OSR)问题。我们的框架利用图像基本的封包水平数据表示,检测网络流量中的空间和时间几何模式。此外,我们还整合了堆叠和子集拓扑技术,以实现模elling Complex和多元的正常行为,从而识别未知的攻击。我们的实验结果显示,我们的提案可以实现88%的检测率,与现有方法和最新的进展相比。未来的工作将进行各种开放程度和攻击enario的广泛实验,进一步强化我们的提案在保护 IoT 环境方面的适用性和性能。

Towards Artificial General Intelligence (AGI) in the Internet of Things (IoT): Opportunities and Challenges

  • paper_url: http://arxiv.org/abs/2309.07438
  • repo_url: None
  • paper_authors: Fei Dou, Jin Ye, Geng Yuan, Qin Lu, Wei Niu, Haijian Sun, Le Guan, Guoyu Lu, Gengchen Mai, Ninghao Liu, Jin Lu, Zhengliang Liu, Zihao Wu, Chenjiao Tan, Shaochen Xu, Xianqiao Wang, Guoming Li, Lilong Chai, Sheng Li, Jin Sun, Hongyue Sun, Yunli Shao, Changying Li, Tianming Liu, Wenzhan Song
    for: 这篇研究探讨了智能网络的应用和挑战,尤其是在智能家居、生产、运输和教育等领域。methods: 本研究将AGI融合到IoT系统中,并提出了一个概念框架来实现这一目标。results: 研究发现AGI在IoT系统中的应用范围很广泛,但是适应IoT设备限制的AGI需要进一步的研究。此外,研究也探讨了IoT通信的复杂性和安全性问题。
    Abstract Artificial General Intelligence (AGI), possessing the capacity to comprehend, learn, and execute tasks with human cognitive abilities, engenders significant anticipation and intrigue across scientific, commercial, and societal arenas. This fascination extends particularly to the Internet of Things (IoT), a landscape characterized by the interconnection of countless devices, sensors, and systems, collectively gathering and sharing data to enable intelligent decision-making and automation. This research embarks on an exploration of the opportunities and challenges towards achieving AGI in the context of the IoT. Specifically, it starts by outlining the fundamental principles of IoT and the critical role of Artificial Intelligence (AI) in IoT systems. Subsequently, it delves into AGI fundamentals, culminating in the formulation of a conceptual framework for AGI's seamless integration within IoT. The application spectrum for AGI-infused IoT is broad, encompassing domains ranging from smart grids, residential environments, manufacturing, and transportation to environmental monitoring, agriculture, healthcare, and education. However, adapting AGI to resource-constrained IoT settings necessitates dedicated research efforts. Furthermore, the paper addresses constraints imposed by limited computing resources, intricacies associated with large-scale IoT communication, as well as the critical concerns pertaining to security and privacy.
    摘要 人工通用智能(AGI)具有人类认知能力的容器,引发科学、商业和社会领域的广泛关注。特别是在互联网东西(IoT)领域,AGI的应用前景非常广阔。这项研究探讨了在IoT预设下实现AGI的机会和挑战。研究首先介绍了IoT的基本原则和人工智能(AI)在IoT系统中的重要作用。然后,探讨了AGI的基本原则,并构建了AGI在IoT中的概念框架。AGI在IoT领域的应用范围广泛,涵盖智能街区、家庭环境、制造、交通等领域,以及环境监测、农业、医疗和教育等领域。然而,将AGI应用于有限资源的IoT设置中需要专门的研究努力。此外,研究还考虑了IoT通信的大规模复杂性和安全隐私问题。

Semantic Parsing in Limited Resource Conditions

  • paper_url: http://arxiv.org/abs/2309.07429
  • repo_url: None
  • paper_authors: Zhuang Li
  • for: 本论文关注 semantic parsing 面临的挑战,具体是在有限数据和计算资源的情况下。
  • methods: 本论文提出了一些解决方案,包括自动数据约束、知识传递、活动学习和连续学习。在没有平行训练数据的情况下,论文提议生成基于结构化数据库的 sintetic 训练示例。当源领域有充足数据,但目标领域有限制的平行数据时,本论文利用源领域的知识提高 parsing 性能。在多语言情况下,论文提出了一种适应 parsers 的方法,通过有限人工翻译预算来进行活动学习,以达到更好的 parsing 性能。
  • results: 本论文的实验结果表明,这些方法可以有效地提高 semantic parsing 的性能,尤其是在有限数据和计算资源的情况下。
    Abstract This thesis explores challenges in semantic parsing, specifically focusing on scenarios with limited data and computational resources. It offers solutions using techniques like automatic data curation, knowledge transfer, active learning, and continual learning. For tasks with no parallel training data, the thesis proposes generating synthetic training examples from structured database schemas. When there is abundant data in a source domain but limited parallel data in a target domain, knowledge from the source is leveraged to improve parsing in the target domain. For multilingual situations with limited data in the target languages, the thesis introduces a method to adapt parsers using a limited human translation budget. Active learning is applied to select source-language samples for manual translation, maximizing parser performance in the target language. In addition, an alternative method is also proposed to utilize machine translation services, supplemented by human-translated data, to train a more effective parser. When computational resources are limited, a continual learning approach is introduced to minimize training time and computational memory. This maintains the parser's efficiency in previously learned tasks while adapting it to new tasks, mitigating the problem of catastrophic forgetting. Overall, the thesis provides a comprehensive set of methods to improve semantic parsing in resource-constrained conditions.
    摘要 For tasks with no parallel training data, the thesis suggests generating synthetic training examples from structured database schemas. When there is abundant data in a source domain but limited parallel data in a target domain, the thesis leverages knowledge from the source domain to improve parsing in the target domain.For multilingual situations with limited data in the target languages, the thesis introduces a method to adapt parsers using a limited human translation budget. Active learning is applied to select source-language samples for manual translation, maximizing parser performance in the target language. Additionally, an alternative method is proposed to utilize machine translation services, supplemented by human-translated data, to train a more effective parser.When computational resources are limited, the thesis introduces a continual learning approach to minimize training time and computational memory. This approach maintains the parser's efficiency in previously learned tasks while adapting it to new tasks, mitigating the problem of catastrophic forgetting.Overall, the thesis provides a comprehensive set of methods to improve semantic parsing in resource-constrained conditions.

JSMNet Improving Indoor Point Cloud Semantic and Instance Segmentation through Self-Attention and Multiscale

  • paper_url: http://arxiv.org/abs/2309.07425
  • repo_url: None
  • paper_authors: Shuochen Xu, Zhenxin Zhang
  • for: 本研究的目的是提高indoor 3D点云数据的Semantic Understanding,以便应用于室内服务机器人、导航系统和数字双工程等领域。
  • methods: 我们提出了JSMNet方法,它结合多层网络和全局特征自注意模块,以实现高质量的indoor点云Semantic和实例分割。我们还设计了一个多resolution特征适应融合模块,以更好地表达indoor目标的特征。
  • results: 我们在S3DIS dataset上进行了实验,并与其他方法进行比较。结果显示,我们的提出方法在Semantic和实例分割方面的性能比PointNet和ASIS等方法高出16.0%和26.3%,并在目标地区分割方面比JSPNet等方法高出3.3%。
    Abstract The semantic understanding of indoor 3D point cloud data is crucial for a range of subsequent applications, including indoor service robots, navigation systems, and digital twin engineering. Global features are crucial for achieving high-quality semantic and instance segmentation of indoor point clouds, as they provide essential long-range context information. To this end, we propose JSMNet, which combines a multi-layer network with a global feature self-attention module to jointly segment three-dimensional point cloud semantics and instances. To better express the characteristics of indoor targets, we have designed a multi-resolution feature adaptive fusion module that takes into account the differences in point cloud density caused by varying scanner distances from the target. Additionally, we propose a framework for joint semantic and instance segmentation by integrating semantic and instance features to achieve superior results. We conduct experiments on S3DIS, which is a large three-dimensional indoor point cloud dataset. Our proposed method is compared against other methods, and the results show that it outperforms existing methods in semantic and instance segmentation and provides better results in target local area segmentation. Specifically, our proposed method outperforms PointNet (Qi et al., 2017a) by 16.0% and 26.3% in terms of semantic segmentation mIoU in S3DIS (Area 5) and instance segmentation mPre, respectively. Additionally, it surpasses ASIS (Wang et al., 2019) by 6.0% and 4.6%, respectively, as well as JSPNet (Chen et al., 2022) by a margin of 3.3% for semantic segmentation mIoU and a slight improvement of 0.3% for instance segmentation mPre.
    摘要 semantic understanding of indoor 3D point cloud data是重要的,用于多种应用程序,包括室内服务机器人、导航系统和数字孪生工程。全球特征对于实现高质量的indoor point cloud Semantic和实例分割是关键的,因为它们提供了重要的远程上下文信息。为了实现这一目标,我们提议了JSMNet,它组合了多层网络和全球特征自注意模块,以同时分割三维点云Semantic和实例。为了更好地表达室内目标特征,我们设计了一个多resolution特征适应融合模块,该模块考虑了由不同扫描仪距离目标而带来的点云密度差异。此外,我们提出了一个整合Semantic和实例特征的框架,以实现更高的结果。我们在S3DIS大型三维室内点云集合上进行实验,并与其他方法进行比较。结果显示,我们的提议方法在Semantic和实例分割方面比exist方法高出16.0%和26.3%,并在目标本地分割方面比exist方法高出6.0%和4.6%。此外,它还比JSPNet(Chen et al., 2022)高出3.3%的Semantic分割mIoU和0.3%的实例分割mPre。

An Assessment of ChatGPT on Log Data

  • paper_url: http://arxiv.org/abs/2309.07938
  • repo_url: None
  • paper_authors: Priyanka Mudgal, Rita Wouhaybi
  • for: 这 paper 旨在研究 ChatGPT 是否可以对日志数据进行有价值的处理,以及该模型在这个领域的缺陷和可能的下一步改进。
  • methods: 该 paper 使用 ChatGPT 进行日志数据处理,并对其表现进行分析和评估。
  • results: 研究发现,当前版本的 ChatGPT 对日志数据处理表现有限,响应不一致,并且存在扩展性问题。
    Abstract Recent development of large language models (LLMs), such as ChatGPT has been widely applied to a wide range of software engineering tasks. Many papers have reported their analysis on the potential advantages and limitations of ChatGPT for writing code, summarization, text generation, etc. However, the analysis of the current state of ChatGPT for log processing has received little attention. Logs generated by large-scale software systems are complex and hard to understand. Despite their complexity, they provide crucial information for subject matter experts to understand the system status and diagnose problems of the systems. In this paper, we investigate the current capabilities of ChatGPT to perform several interesting tasks on log data, while also trying to identify its main shortcomings. Our findings show that the performance of the current version of ChatGPT for log processing is limited, with a lack of consistency in responses and scalability issues. We also outline our views on how we perceive the role of LLMs in the log processing discipline and possible next steps to improve the current capabilities of ChatGPT and the future LLMs in this area. We believe our work can contribute to future academic research to address the identified issues.
    摘要 现代大语言模型(LLM),如ChatGPT,在软件工程任务上广泛应用。许多论文报道了对写代码、摘要、文本生成等任务的分析。然而,对ChatGPT在日志处理方面的现状分析尚未受到足够关注。大规模软件系统生成的日志复杂且难以理解,但它们提供了关键信息,帮助专家了解系统状态并诊断问题。在这篇论文中,我们调查了当前ChatGPT对日志数据进行多种有趣任务的能力,同时尝试了识别其主要缺陷。我们的发现表明,当前ChatGPT对日志处理的性能有限,响应不一致和可扩展性受限。我们还描述了我们对LLM在日志处理领域的角色和未来LLM在这个领域的可能发展。我们认为,我们的工作可以贡献于未来的学术研究,解决已知问题。

Client-side Gradient Inversion Against Federated Learning from Poisoning

  • paper_url: http://arxiv.org/abs/2309.07415
  • repo_url: https://github.com/clientsidegia/cgi
  • paper_authors: Jiaheng Wei, Yanjun Zhang, Leo Yu Zhang, Chao Chen, Shirui Pan, Kok-Leong Ong, Jun Zhang, Yang Xiang
  • For: The paper is written to address the vulnerability of federated learning (FL) to gradient inversion attacks (GIA), and to propose a novel attack method called client-side poisoning gradient inversion (CGI) that can be launched from clients.* Methods: The paper proposes a distinct approach in which an adversary utilizes a malicious model that amplifies the loss of a specific targeted class of interest, and optimizes malicious updates and blends benign updates with a malicious replacement vector to remain undetected by Byzantine-robust aggregation rules (AGRs).* Results: The paper demonstrates the feasibility of a client-side adversary with limited knowledge being able to recover the training samples from the aggregated global model, and shows that the proposed CGI method consistently and successfully extracts training input in all tested scenarios, including against Byzantine-robust AGRs.
    Abstract Federated Learning (FL) enables distributed participants (e.g., mobile devices) to train a global model without sharing data directly to a central server. Recent studies have revealed that FL is vulnerable to gradient inversion attack (GIA), which aims to reconstruct the original training samples and poses high risk against the privacy of clients in FL. However, most existing GIAs necessitate control over the server and rely on strong prior knowledge including batch normalization and data distribution information. In this work, we propose Client-side poisoning Gradient Inversion (CGI), which is a novel attack method that can be launched from clients. For the first time, we show the feasibility of a client-side adversary with limited knowledge being able to recover the training samples from the aggregated global model. We take a distinct approach in which the adversary utilizes a malicious model that amplifies the loss of a specific targeted class of interest. When honest clients employ the poisoned global model, the gradients of samples belonging to the targeted class are magnified, making them the dominant factor in the aggregated update. This enables the adversary to effectively reconstruct the private input belonging to other clients using the aggregated update. In addition, our CGI also features its ability to remain stealthy against Byzantine-robust aggregation rules (AGRs). By optimizing malicious updates and blending benign updates with a malicious replacement vector, our method remains undetected by these defense mechanisms. To evaluate the performance of CGI, we conduct experiments on various benchmark datasets, considering representative Byzantine-robust AGRs, and exploring diverse FL settings with different levels of adversary knowledge about the data. Our results demonstrate that CGI consistently and successfully extracts training input in all tested scenarios.
    摘要 分布式学习(FL)可以让分布式参与者(例如移动设备)共同训练全球模型,无需直接将数据传输到中央服务器。然而,现有研究表明,FL受到梯度反向攻击(GIA)的威胁,GIA的目标是重建原始训练样本,这会对客户端隐私造成高风险。然而,大多数现有的GIA都需要控制服务器并且需要强大的先前知识,包括批处理normalization和数据分布信息。在这项工作中,我们提出了客户端恶意梯度反向攻击(CGI),这是一种新的攻击方法,可以从客户端发起。我们首次表明,一个有限知识的客户端恶意者可以通过恶意模型来重建私有输入。我们采取了一种独特的方法,在恶意模型中 amplifies 特定类别的损失,使得这些样本的梯度在汇集后变得占主导地位。这使得恶意者可以通过汇集更新来重建其他客户端的私有输入。此外,我们的CGI还具有逃避拜尼瑞安规则(AGRs)的能力。通过优化恶意更新和杂合善良更新的恶意替换 вектор,我们的方法可以逃脱这些防御机制。为评估CGI的表现,我们在多个标准数据集上进行了实验,考虑了代表性的拜尼瑞安规则,以及不同的FL设置和恶意知识水平。我们的结果表明,CGI在所有测试场景中都能够成功地重建私有输入。

FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

  • paper_url: http://arxiv.org/abs/2309.07405
  • repo_url: None
  • paper_authors: Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng
  • for: 这个研究是为了开发一个基本的神经语音编码工具库(FunCodec),并提供可重现性的训练程式和测试脚本,以及可以与其他语音处理工具集成的可扩展设计。
  • methods: 这个研究使用了声音流程(SoundStream)和编码器(Encodec)等最新的神经语音编码模型,并提供了预训练的模型,可以用于学术或通用目的。
  • results: 根据实验结果,FunCodec可以在相同的压缩比例下实现更好的重建质量,并且可以与其他工具kit和发布的模型进行比较。此外,预训练的模型还可以用于下游任务,例如自动语音识别和个人化文本读取 synthesis。
    Abstract This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as speech recognition. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes. Based on the toolkit, we further propose the frequency-domain codec models, FreqCodec, which can achieve comparable speech quality with much lower computation and parameter complexity. Experimental results show that, under the same compression ratio, FunCodec can achieve better reconstruction quality compared with other toolkits and released models. We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis. This toolkit is publicly available at https://github.com/alibaba-damo-academy/FunCodec.
    摘要 这篇论文介绍了FunCodec,一个基于开源语音处理工具kit FunASR的基本神经网络语音编码器工具kit。FunCodec提供可重现训练药方和推理脚本,用于最新的神经网络语音编码器模型,如SoundStream和Encodec。由于FunCodec和FunASR的统一设计,因此可以轻松地将FunCodec集成到下游任务中,如语音识别。同时,预训练模型也是提供的,可以用于学术或通用目的。基于工具kit,我们还提出了频率频域编码器模型(FreqCodec),可以实现与其他工具kit和发布的模型相同的语音质量,但是具有远低的计算复杂性和参数量。实验结果表明,在同一压缩比下,FunCodec可以实现比其他工具kit和发布的模型更好的重建质量。此外,我们还证明了预训练模型适用于下游任务,包括自动语音识别和个性化文本到语音合成。这个工具kit现在可以在https://github.com/alibaba-damo-academy/FunCodec上获取。

Multi-Grade Deep Learning for Partial Differential Equations with Applications to the Burgers Equation

  • paper_url: http://arxiv.org/abs/2309.07401
  • repo_url: None
  • paper_authors: Yuesheng Xu, Taishan Zeng
  • For: This paper proposes a multi-grade deep learning method for solving nonlinear partial differential equations (PDEs), which can efficiently learn solutions of the equations and outperform existing single-grade deep learning methods in predictive accuracy.* Methods: The proposed method breaks down the task of learning a deep neural network (DNN) into several neural networks stacked on top of each other in a staircase-like manner, allowing for the mitigation of the complexity of solving the non-convex optimization problem with large number of parameters and the efficient learning of residual components left over from previous grades.* Results: The proposed two-stage multi-grade deep learning method enables efficient learning of solutions of the 1D, 2D, and 3D viscous Burgers equations, and outperforms existing single-grade deep learning methods in predictive accuracy. Specifically, the predictive errors of the single-grade deep learning are larger than those of the TS-MGDL method in 26-60, 4-31 and 3-12 times, for the 1D, 2D, and 3D equations, respectively.
    Abstract We develop in this paper a multi-grade deep learning method for solving nonlinear partial differential equations (PDEs). Deep neural networks (DNNs) have received super performance in solving PDEs in addition to their outstanding success in areas such as natural language processing, computer vision, and robotics. However, training a very deep network is often a challenging task. As the number of layers of a DNN increases, solving a large-scale non-convex optimization problem that results in the DNN solution of PDEs becomes more and more difficult, which may lead to a decrease rather than an increase in predictive accuracy. To overcome this challenge, we propose a two-stage multi-grade deep learning (TS-MGDL) method that breaks down the task of learning a DNN into several neural networks stacked on top of each other in a staircase-like manner. This approach allows us to mitigate the complexity of solving the non-convex optimization problem with large number of parameters and learn residual components left over from previous grades efficiently. We prove that each grade/stage of the proposed TS-MGDL method can reduce the value of the loss function and further validate this fact through numerical experiments. Although the proposed method is applicable to general PDEs, implementation in this paper focuses only on the 1D, 2D, and 3D viscous Burgers equations. Experimental results show that the proposed two-stage multi-grade deep learning method enables efficient learning of solutions of the equations and outperforms existing single-grade deep learning methods in predictive accuracy. Specifically, the predictive errors of the single-grade deep learning are larger than those of the TS-MGDL method in 26-60, 4-31 and 3-12 times, for the 1D, 2D, and 3D equations, respectively.
    摘要 我们在这篇论文中开发了一种多级深度学习方法,用于解决非线性偏微分方程(PDEs)。深度神经网络(DNNs)在解决PDEs方面已经表现出了绝佳的成绩,同时在自然语言处理、计算机视觉和机器人等领域也取得了突出的成绩。然而,训练非常深的网络是一项具有挑战性的任务。随着网络层数的增加,解决大规模非凸优化问题,以获得DNN的PDE解决方案,变得越来越Difficult,可能会导致预测精度下降。为了解决这个挑战,我们提出了一种两个阶段多级深度学习(TS-MGDL)方法,它将把解决DNN的任务分解成多个堆叠在一起的神经网络。这种方法可以减少解决非凸优化问题中参数的数量,并高效地学习剩余的Components。我们证明每个阶段/stage的TS-MGDL方法都可以降低损失函数的值,并通过实验 validate this fact。虽然该方法适用于一般PDEs,但在这篇论文中只进行了三维液体压力方程的实现。实验结果表明,提出的两个阶段多级深度学习方法可以有效地学习PDE的解,并且在预测精度方面超过单个阶段深度学习方法。特别是,单个阶段深度学习方法的预测错误相对TS-MGDL方法大得多,在1D、2D和3D方程中分别为26-60、4-31和3-12倍。

Semantic Adversarial Attacks via Diffusion Models

  • paper_url: http://arxiv.org/abs/2309.07398
  • repo_url: https://github.com/steven202/semantic_adv_via_dm
  • paper_authors: Chenan Wang, Jinhao Duan, Chaowei Xiao, Edward Kim, Matthew Stamm, Kaidi Xu
  • for: 这篇论文旨在提出一个快速生成 semantic adversarial attack 的框架,并提供了两种不同的 variants,即 Semantic Transformation (ST) 和 Latent Masking (LM) approaches.
  • methods: 这篇论文使用了 recent diffusion models 来快速生成 semantic adversarial attack,并在这些模型的latent space中进行了修饰和调整。
  • results: experiments 表明,这篇论文的方法可以在 CelebA-HQ 和 AFHQ datasets 上 достичь高度的成功率(约 100%)和优秀的数据调整能力(FID 为 36.61),并且在不同的设定下具有了优秀的通用性和转移性。
    Abstract Traditional adversarial attacks concentrate on manipulating clean examples in the pixel space by adding adversarial perturbations. By contrast, semantic adversarial attacks focus on changing semantic attributes of clean examples, such as color, context, and features, which are more feasible in the real world. In this paper, we propose a framework to quickly generate a semantic adversarial attack by leveraging recent diffusion models since semantic information is included in the latent space of well-trained diffusion models. Then there are two variants of this framework: 1) the Semantic Transformation (ST) approach fine-tunes the latent space of the generated image and/or the diffusion model itself; 2) the Latent Masking (LM) approach masks the latent space with another target image and local backpropagation-based interpretation methods. Additionally, the ST approach can be applied in either white-box or black-box settings. Extensive experiments are conducted on CelebA-HQ and AFHQ datasets, and our framework demonstrates great fidelity, generalizability, and transferability compared to other baselines. Our approaches achieve approximately 100% attack success rate in multiple settings with the best FID as 36.61. Code is available at https://github.com/steven202/semantic_adv_via_dm.
    摘要 传统的对抗攻击主要集中在图像空间中添加对抗扰动。然而,semantic adversarial攻击则更关注清晰例中的 semantic attribute,如颜色、上下文和特征。在这篇论文中,我们提出了一个框架,可以快速生成semantic adversarial攻击,通过利用最近的扩散模型,因为这些模型中包含了semantic信息。然后有两种变体的这个框架:1)semantic Transformation(ST)方法在生成图像和/或扩散模型的latent space中进行细化; 2)latent Masking(LM)方法使用另一个目标图像和局部backpropagation-based interpretation方法来遮盖latent space。此外,ST方法可以在白盒和黑盒设置中应用。我们对CelebA-HQ和AFHQ datasets进行了广泛的实验,并证明了我们的框架具有高度的准确性、普适性和传输性,相比其他基准。我们的方法在多种设置中实现了约100%的攻击成功率,并且FID值为36.61。代码可以在https://github.com/steven202/semantic_adv_via_dm上下载。

DebCSE: Rethinking Unsupervised Contrastive Sentence Embedding Learning in the Debiasing Perspective

  • paper_url: http://arxiv.org/abs/2309.07396
  • repo_url: None
  • paper_authors: Pu Miao, Zeyao Du, Junlin Zhang
  • for: 本研究的目的是提高句子嵌入模型的质量,因为现有的研究表明,BERT模型可能因为词频偏好而学习不良的句子嵌入。
  • methods: 本研究使用了一种新的对比学习框架,称为DebCSE,以消除各种偏好的影响,包括句子长度偏好和false negative sample bias。DebCSE使用了一种逆 probabilistic sampling 方法,选择高质量的正例对和负例对,以提高嵌入的质量。
  • results: 对 semantic textual similarity (STS) benchmarks进行了广泛的实验,显示DebCSE在BERTbase上得到了显著的提高,其Spearman correlation coefficient平均值为80.33%。
    Abstract Several prior studies have suggested that word frequency biases can cause the Bert model to learn indistinguishable sentence embeddings. Contrastive learning schemes such as SimCSE and ConSERT have already been adopted successfully in unsupervised sentence embedding to improve the quality of embeddings by reducing this bias. However, these methods still introduce new biases such as sentence length bias and false negative sample bias, that hinders model's ability to learn more fine-grained semantics. In this paper, we reexamine the challenges of contrastive sentence embedding learning from a debiasing perspective and argue that effectively eliminating the influence of various biases is crucial for learning high-quality sentence embeddings. We think all those biases are introduced by simple rules for constructing training data in contrastive learning and the key for contrastive learning sentence embedding is to mimic the distribution of training data in supervised machine learning in unsupervised way. We propose a novel contrastive framework for sentence embedding, termed DebCSE, which can eliminate the impact of these biases by an inverse propensity weighted sampling method to select high-quality positive and negative pairs according to both the surface and semantic similarity between sentences. Extensive experiments on semantic textual similarity (STS) benchmarks reveal that DebCSE significantly outperforms the latest state-of-the-art models with an average Spearman's correlation coefficient of 80.33% on BERTbase.
    摘要 (Simplified Chinese)前些研究已经表明,word frequency bias可以使BERT模型学习不可区分的句子嵌入。例如SimCSE和ConSERT已经成功地在无监督句子嵌入中提高嵌入质量,而这些方法仍然引入了新的偏好,如句子长度偏好和假阴性样本偏好,这会阻碍模型学习更细化的 semantics。在这篇论文中,我们重新评估了对句子嵌入学习的挑战,并 argue that消除各种偏好是学习高质量句子嵌入的关键。我们认为这些偏好都是由对构造训练数据的简单规则引入的,因此,针对training数据的分布来进行无监督学习的方法是关键。我们提出了一种新的对句子嵌入框架,称为DebCSE,可以消除这些偏好的影响。DebCSE使用反propensity weighted sampling方法选择高质量的正样本和负样本,根据句子表面和semantic相似性。我们对STSbenchmark进行了广泛的实验,显示DebCSE在BERTbase上的平均Spearman correlation coefficient为80.33%,超过了最新的状态计算机模型。

Unleashing the Power of Depth and Pose Estimation Neural Networks by Designing Compatible Endoscopic Images

  • paper_url: http://arxiv.org/abs/2309.07390
  • repo_url: None
  • paper_authors: Junyang Wu, Yun Gu
  • for: 本研究旨在提高endoscopic navigation中的深度和pose estimation框架,通过更好地与图像兼容ibilize neural networks。
  • methods: 我们在本研究中提出了两种方法来改进endoscopic图像和神经网络之间的兼容性。首先,我们引入Mask Image Modelling(MIM)模块,它输入部分图像信息而不是完整的图像信息,使神经网络能够从部分像素信息中恢复全局信息。其次,我们提出了一种轻量级神经网络来进一步改进endoscopic图像,以确保图像和神经网络之间的兼容性。
  • results: 我们在三个公共数据集和一个内部数据集上进行了广泛的实验,并证明了我们的方法可以大幅提高基elines。此外,我们提出的加强图像可以作为数据增强方法,并能够提取更稳定的特征点,在传统特征点匹配任务中表现出色。
    Abstract Deep learning models have witnessed depth and pose estimation framework on unannotated datasets as a effective pathway to succeed in endoscopic navigation. Most current techniques are dedicated to developing more advanced neural networks to improve the accuracy. However, existing methods ignore the special properties of endoscopic images, resulting in an inability to fully unleash the power of neural networks. In this study, we conduct a detail analysis of the properties of endoscopic images and improve the compatibility of images and neural networks, to unleash the power of current neural networks. First, we introcude the Mask Image Modelling (MIM) module, which inputs partial image information instead of complete image information, allowing the network to recover global information from partial pixel information. This enhances the network' s ability to perceive global information and alleviates the phenomenon of local overfitting in convolutional neural networks due to local artifacts. Second, we propose a lightweight neural network to enhance the endoscopic images, to explicitly improve the compatibility between images and neural networks. Extensive experiments are conducted on the three public datasets and one inhouse dataset, and the proposed modules improve baselines by a large margin. Furthermore, the enhanced images we proposed, which have higher network compatibility, can serve as an effective data augmentation method and they are able to extract more stable feature points in traditional feature point matching tasks and achieve outstanding performance.
    摘要 深度学习模型在无注意dataset上进行深度和pose估计框架,被视为有效的走向。现有大多数技术都是为了开发更高级别的神经网络,以提高准确性。然而,现有方法忽略了endooscopic图像的特殊性,导致神经网络无法全面发挥力量。在本研究中,我们进行了endooscopic图像的详细分析,并改进了图像和神经网络之间的兼容性,以解放神经网络的力量。首先,我们引入Mask Image Modelling(MIM)模块,该模块输入部分图像信息而不是完整的图像信息,allowing the network to recover global information from partial pixel information。这种方法使神经网络能够更好地感知全局信息,并减轻了因本地遗传而导致的Convolutional Neural Networks(CNN)的局部适应性。其次,我们提出了一种轻量级神经网络,用于提高endooscopic图像的兼容性。我们在三个公共数据集和一个内部数据集上进行了广泛的实验,并证明了我们的模块可以大幅提高基elines。此外,我们提出的改进图像可以作为一种有效的数据增强方法,它们具有更高的兼容性,可以提取更稳定的特征点,并在传统特征点匹配任务中达到出色的性能。

Landscape-Sketch-Step: An AI/ML-Based Metaheuristic for Surrogate Optimization Problems

  • paper_url: http://arxiv.org/abs/2309.07936
  • repo_url: https://github.com/rafael-a-monteiro-math/landscape_sketch_and_step
  • paper_authors: Rafael Monteiro, Kartik Sau
    for: 这个论文是为了提出一种新的全球优化算法,用于解决在评估成本函数成本高、不可靠或禁止的情况下进行优化。methods: 该方法 combining机器学习、随机优化和奖励学习技术,利用历史信息来选择合适的参数值,以便更judicious地评估成本函数。results: 与传统的复制交换 Monte Carlo 方法相比,LSS 需要的评估次数相对较少,尤其是在高通量计算或高性能计算任务中,这点非常重要。此外,LSS 也不同于标准的代理优化技术,不需要构建一个代理模型来 aproximating 或重建目标函数。在低维度优化问题(维度 1、2、4、8)中应用 LSS ,与传统的 Simulated Annealing 相比,LSS 显示更有效地加速优化过程。
    Abstract In this paper, we introduce a new heuristics for global optimization in scenarios where extensive evaluations of the cost function are expensive, inaccessible, or even prohibitive. The method, which we call Landscape-Sketch-and-Step (LSS), combines Machine Learning, Stochastic Optimization, and Reinforcement Learning techniques, relying on historical information from previously sampled points to make judicious choices of parameter values where the cost function should be evaluated at. Unlike optimization by Replica Exchange Monte Carlo methods, the number of evaluations of the cost function required in this approach is comparable to that used by Simulated Annealing, quality that is especially important in contexts like high-throughput computing or high-performance computing tasks, where evaluations are either computationally expensive or take a long time to be performed. The method also differs from standard Surrogate Optimization techniques, for it does not construct a surrogate model that aims at approximating or reconstructing the objective function. We illustrate our method by applying it to low dimensional optimization problems (dimensions 1, 2, 4, and 8) that mimick known difficulties of minimization on rugged energy landscapes often seen in Condensed Matter Physics, where cost functions are rugged and plagued with local minima. When compared to classical Simulated Annealing, the LSS shows an effective acceleration of the optimization process.
    摘要 在这篇论文中,我们介绍了一种新的全球优化启发法,用于在评估成本函数的成本高、不可靠或禁止的场景中进行全球优化。该方法,我们称之为“景观绘制步骤”(Landscape-Sketch-and-Step,LSS),结合机器学习、随机优化和强化学习技术,利用历史记录中的参数值来做出评估成本函数的judicious选择。与复制交换 Monte Carlo 方法不同,LSS 方法所需的评估成本函数的数量与 Simulated Annealing 相似,这种特点尤其重要在高通量计算或高性能计算任务中, где评估成本函数的计算成本或时间很高。此外,LSS 方法与标准代理优化技术不同,它不会构建一个目标函数的替身模型,以优化目标函数。我们通过应用该方法于低维度优化问题(维度 1、2、4、8),示出了 LSS 方法在 rugged 能量领域中的效果。相比于经典 Simulated Annealing,LSS 方法显示了更有效的优化过程。

The kernel-balanced equation for deep neural networks

  • paper_url: http://arxiv.org/abs/2309.07367
  • repo_url: None
  • paper_authors: Kenichi Nakazato
  • for: 这个论文的目的是研究深度神经网络在分布估计问题上的应用和稳定性问题。
  • methods: 这个论文使用了深度神经网络来估计数据集的分布,并通过训练来实现泛化 функción。
  • results: 研究发现,在训练时间长 enough和数据密度高 enough的情况下,神经网络的估计会变得不稳定,并且这种不稳定性与数据密度和训练时间之间存在正相关性。
    Abstract Deep neural networks have shown many fruitful applications in this decade. A network can get the generalized function through training with a finite dataset. The degree of generalization is a realization of the proximity scale in the data space. Specifically, the scale is not clear if the dataset is complicated. Here we consider a network for the distribution estimation of the dataset. We show the estimation is unstable and the instability depends on the data density and training duration. We derive the kernel-balanced equation, which gives a short phenomenological description of the solution. The equation tells us the reason for the instability and the mechanism of the scale. The network outputs a local average of the dataset as a prediction and the scale of averaging is determined along the equation. The scale gradually decreases along training and finally results in instability in our case.
    摘要 深度神经网络在本decennary中有很多成功应用。一个网络可以通过训练finite dataset来获得通用函数。特别是,数据空间中的距离度不清楚,如果数据集复杂。我们考虑一个数据集的分布估计网络。我们发现估计是不稳定的,不稳定程度取决于数据密度和训练时间。我们得出了kernel-balanced方程,它给出了解释解决方案的简短现象描述。这个方程告诉我们不稳定的原因和机制,以及估计的横幅是如何确定的。网络输出了一个数据集的本地均值作为预测,并且这个均值的横幅是通过方程确定的。在训练过程中,这个横幅逐渐减小,最终导致我们的 случаpped in our case.

Doubly High-Dimensional Contextual Bandits: An Interpretable Model for Joint Assortment-Pricing

  • paper_url: http://arxiv.org/abs/2309.08634
  • repo_url: None
  • paper_authors: Junhui Cai, Ran Chen, Martin J. Wainwright, Linda Zhao
  • for: 这种paper是为了解决零售业的挑战问题,具体来说是如何选择产品 Display 给消费者(搜索问题),以及如何定价产品(定价问题)以最大化收益或利润。
  • methods: 作者提出了一种联合方法,基于上下文投机(contextual bandits),来解决搜索和定价问题。该模型是双高维的,即 Context vector 和 Action 都可以取值在高维空间中。为了缓解维度的味道,作者提出了一种简单 yet flexible 的模型,通过一个(近似)低维表示矩阵来捕捉 covariate 和 Action 之间的交互。这种类型的模型具有一定的表达能力,同时仍然可以通过 latent factor 进行解释。
  • results: 作者提出了一种 computationally tractable 的方法,combines 探索/利用协议与高维矩阵估计器,并证明了该方法的 regret bound。实验结果表明,该方法在各种标准投机和定价模型中表现较佳,并且在实际案例中(来自一家领先的快速食品公司和一家崛起的美妆公司)也有较高的收益。在每个案例中,作者证明了使用该方法可以获得至少三倍的收益或利润,同时Latent factor模型也能够很好地解释。
    Abstract Key challenges in running a retail business include how to select products to present to consumers (the assortment problem), and how to price products (the pricing problem) to maximize revenue or profit. Instead of considering these problems in isolation, we propose a joint approach to assortment-pricing based on contextual bandits. Our model is doubly high-dimensional, in that both context vectors and actions are allowed to take values in high-dimensional spaces. In order to circumvent the curse of dimensionality, we propose a simple yet flexible model that captures the interactions between covariates and actions via a (near) low-rank representation matrix. The resulting class of models is reasonably expressive while remaining interpretable through latent factors, and includes various structured linear bandit and pricing models as particular cases. We propose a computationally tractable procedure that combines an exploration/exploitation protocol with an efficient low-rank matrix estimator, and we prove bounds on its regret. Simulation results show that this method has lower regret than state-of-the-art methods applied to various standard bandit and pricing models. Real-world case studies on the assortment-pricing problem, from an industry-leading instant noodles company to an emerging beauty start-up, underscore the gains achievable using our method. In each case, we show at least three-fold gains in revenue or profit by our bandit method, as well as the interpretability of the latent factor models that are learned.
    摘要 主要挑战在经营零售业务中包括如何选择给消费者提供的产品(集合问题),以及如何定价产品(定价问题)以最大化收入或利润。而不是单独考虑这两个问题,我们提议一种联合的集合-定价方法,基于上下文矩阵。我们的模型是双重高维的,即上下文向量和行动都可以取值在高维空间中。为了缓解维度繁殖的问题,我们提议一种简单 yet flexible的模型,通过一个(近)低维表示矩阵来捕捉上下文和行动之间的交互。这种类型的模型具有可解释性的特点,并包括一些结构化线性矩阵和定价模型作为特例。我们提出一种可行的计算过程,将探索/利用协议与高维矩阵估计器结合使用,并证明其误差的下界。实验结果显示,这种方法在各种标准矩阵和定价模型上的误差较低。实际案例,从一家领先的快速面公司到一家崛起的美容 startup,都证明了我们的方法可以实现至少三倍的收入或利润增加,同时 latent factor 模型的解释性也得到了证明。

Hodge-Aware Contrastive Learning

  • paper_url: http://arxiv.org/abs/2309.07364
  • repo_url: None
  • paper_authors: Alexander Möllers, Alexander Immer, Vincent Fortuin, Elvin Isufi
  • for: 模型多元依赖关系数据,如网络边缘数据或其他高阶结构中的数据。
  • methods: 利用黎替 decomposition 分解数据谱,并通过 simplicial neural networks 编码数据不变量,制定了有利的增强策略和权重重新定义的反向例示。
  • results: 通过这种方法,可以获得具有特定спектル信息的嵌入,并在两个标准的边流分类任务上达到了更高的表现。
    Abstract Simplicial complexes prove effective in modeling data with multiway dependencies, such as data defined along the edges of networks or within other higher-order structures. Their spectrum can be decomposed into three interpretable subspaces via the Hodge decomposition, resulting foundational in numerous applications. We leverage this decomposition to develop a contrastive self-supervised learning approach for processing simplicial data and generating embeddings that encapsulate specific spectral information.Specifically, we encode the pertinent data invariances through simplicial neural networks and devise augmentations that yield positive contrastive examples with suitable spectral properties for downstream tasks. Additionally, we reweight the significance of negative examples in the contrastive loss, considering the similarity of their Hodge components to the anchor. By encouraging a stronger separation among less similar instances, we obtain an embedding space that reflects the spectral properties of the data. The numerical results on two standard edge flow classification tasks show a superior performance even when compared to supervised learning techniques. Our findings underscore the importance of adopting a spectral perspective for contrastive learning with higher-order data.
    摘要 高等结构数据模型化方面, simplicial complexes 表现出效果,如数据定义于网络边缘或其他更高级结构中。它们的谱可以通过欧拉解 composite 分解成三个可解释的子空间,从而在各种应用中发挥重要作用。我们利用这种解构来开发一种自适应学习方法,用于处理 simplicial 数据并生成具有特定spectral信息的嵌入。Specifically, we 使用 simplicial 神经网络编码 pertinent 数据不变性,并设计了可提高positive contrastive例子的权重的扩充。我们通过鼓励负例的 Hodge 组件与锚点之间的相似性来重新调整负例的权重。这种方法可以使得负例中的谱信息更加稳定,从而提高 embedding 空间的可靠性。我们在两个标准的边流分类任务上进行了数值研究,结果表明我们的方法在比supervised learning 技术更高效。我们的发现表明在处理更高级数据时,采用spectral perspective 的contrastive learning方法是非常重要的。