cs.SD - 2023-11-06

Combinatorial Hodge Theory in Simplicial Signal Processing – DAFx2023 Lecture Notes

  • paper_url: http://arxiv.org/abs/2311.03469
  • repo_url: None
  • paper_authors: Georg Essl
  • for: 本研讨会讲述了Combinatorial Hodge Theory在 simplicial signal processing 中的应用,尤其是在数字音频效果(DAFx)领域。
  • methods: 本研究使用了Combinatorial Hodge Theory 来分析 simplicial signal processing 中的信号结构,并通过实验证明了其效果。
  • results: 研究发现,Combinatorial Hodge Theory 可以帮助分析和理解 simplicial signal processing 中的信号结构,并提供了一种新的视角来理解这些信号的性质和特征。
    Abstract Lecture notes of a tutorial on Combinatorial Hodge Theory in Simplicial Signal Processing held at international conference for digital audio effects (DAFx-23) in Copenhagen, Denmark.
    摘要 lecture notes of a tutorial on Combinatorial Hodge Theory in Simplicial Signal Processing held at international conference for digital audio effects (DAFx-23) in Copenhagen, Denmark.Translation:DAFx-23国际音频特效会议上的一个教程: combinatorial Hodge theory in simplicial signal processing。

A Foundation Model for Music Informatics

  • paper_url: http://arxiv.org/abs/2311.03318
  • repo_url: https://github.com/minzwon/musicfm
  • paper_authors: Minz Won, Yun-Ning Hung, Duc Le
  • for: 本研究探讨了适用于音乐信息学领域的基础模型,这个领域受到标注数据稀缺和泛化问题的限制。
  • methods: 我们进行了多种基础模型变体的比较研究,检查了关键的决定因素,包括模型架构、tokenization方法、时间分辨率、数据和模型可扩展性。
  • results: 我们的研究发现,我们的模型在多种音乐信息检索下表现出色,在特定的关键指标上超越了现有模型。这些发现对自动学习在音乐信息学中的理解做出了贡献,并为开发更有效和多样的基础模型提供了道路。我们公开发布了一个预训练的版本我们的模型,以便促进重现和未来研究。
    Abstract This paper investigates foundation models tailored for music informatics, a domain currently challenged by the scarcity of labeled data and generalization issues. To this end, we conduct an in-depth comparative study among various foundation model variants, examining key determinants such as model architectures, tokenization methods, temporal resolution, data, and model scalability. This research aims to bridge the existing knowledge gap by elucidating how these individual factors contribute to the success of foundation models in music informatics. Employing a careful evaluation framework, we assess the performance of these models across diverse downstream tasks in music information retrieval, with a particular focus on token-level and sequence-level classification. Our results reveal that our model demonstrates robust performance, surpassing existing models in specific key metrics. These findings contribute to the understanding of self-supervised learning in music informatics and pave the way for developing more effective and versatile foundation models in the field. A pretrained version of our model is publicly available to foster reproducibility and future research.
    摘要 中文翻译:这篇论文研究了针对音乐信息学领域特有的基础模型,该领域面临有限的标注数据和泛化问题。为解决这些问题,我们进行了详细的基础模型变种比较研究,检查了模型结构、字符串化方法、时间分辨率、数据和模型缩放等因素的影响。我们的目标是填补现有知识空白,了解这些因素对音乐信息学中基础模型的成功的贡献。我们采用了严格的评价框架,评估这些模型在音乐信息检索下涉及多个下游任务中的表现,尤其是字符级别和序列级别分类。我们的结果表明,我们的模型在特定的关键指标上达到了稳定的表现,超过了现有模型。这些成果对音乐信息学中自动学习的理解和未来研究做出了贡献。我们还公开提供了我们模型的预训练版本,以便促进可重复性和未来研究。

eess.AS - 2023-11-06

HRTF Estimation in the Wild

  • paper_url: http://arxiv.org/abs/2311.03560
  • repo_url: None
  • paper_authors: Vivek Jayaram, Ira Kemelmacher-Shlizerman, Steven M. Seitz
  • for: 这个论文旨在创建更真实的听觉 spatial audio 经验,通过采用个性化 HRTF 估算。
  • methods: 该论文提出了一种基于听觉记录和头部跟踪数据的个性化 HRTF 估算方法,不需要专门的设备或测试。
  • results: 该研究表明,通过分析不同环境中听觉数据,可以准确地估算个人化 HRTF,并且在虚拟环境中提高声音的地理位置和避免前后混淆。
    Abstract Head Related Transfer Functions (HRTFs) play a crucial role in creating immersive spatial audio experiences. However, HRTFs differ significantly from person to person, and traditional methods for estimating personalized HRTFs are expensive, time-consuming, and require specialized equipment. We imagine a world where your personalized HRTF can be determined by capturing data through earbuds in everyday environments. In this paper, we propose a novel approach for deriving personalized HRTFs that only relies on in-the-wild binaural recordings and head tracking data. By analyzing how sounds change as the user rotates their head through different environments with different noise sources, we can accurately estimate their personalized HRTF. Our results show that our predicted HRTFs closely match ground-truth HRTFs measured in an anechoic chamber. Furthermore, listening studies demonstrate that our personalized HRTFs significantly improve sound localization and reduce front-back confusion in virtual environments. Our approach offers an efficient and accessible method for deriving personalized HRTFs and has the potential to greatly improve spatial audio experiences.
    摘要 HEAD-RELATED TRANSFER FUNCTIONS (HRTFs) 是创造充气空间声音体验中的关键因素。然而,人员对HRTF的个性化差异较大,传统方法估计个性化HRTF 昂贵、耗时、需要特殊设备。我们想象一个世界,在日常环境中使用耳机记录数据来确定个性化HRTF。在这篇论文中,我们提出了一种新的方法,只需要在实际环境中采集听觉双耳记录和头部跟踪数据,就可以准确估计个性化HRTF。我们发现,我们预测的HRTF与在静音室中测量的真实HRTF几乎相同。此外,听测研究表明,我们的个性化HRTF可以明显提高虚拟环境中的声音定位和前后混乱减少。我们的方法可以快速、高效地获得个性化HRTF,并且有可能大幅改善空间声音体验。

cs.CV - 2023-11-06

Toward Planet-Wide Traffic Camera Calibration

  • paper_url: http://arxiv.org/abs/2311.04243
  • repo_url: None
  • paper_authors: Khiem Vuong, Robert Tamburo, Srinivasa G. Narasimhan
  • for: addresses the challenge of calibration for outdoor cameras, which has limited their potential for automated analysis.
  • methods: uses street-level imagery to reconstruct a metric 3D model and accurately localize over 100 global traffic cameras, demonstrating a scalable framework.
  • results: achieves significant enhancements over existing automatic calibration techniques and enables traffic analysis through 3D vehicle reconstruction and speed measurement.
    Abstract Despite the widespread deployment of outdoor cameras, their potential for automated analysis remains largely untapped due, in part, to calibration challenges. The absence of precise camera calibration data, including intrinsic and extrinsic parameters, hinders accurate real-world distance measurements from captured videos. To address this, we present a scalable framework that utilizes street-level imagery to reconstruct a metric 3D model, facilitating precise calibration of in-the-wild traffic cameras. Notably, our framework achieves 3D scene reconstruction and accurate localization of over 100 global traffic cameras and is scalable to any camera with sufficient street-level imagery. For evaluation, we introduce a dataset of 20 fully calibrated traffic cameras, demonstrating our method's significant enhancements over existing automatic calibration techniques. Furthermore, we highlight our approach's utility in traffic analysis by extracting insights via 3D vehicle reconstruction and speed measurement, thereby opening up the potential of using outdoor cameras for automated analysis.
    摘要

Unsupervised Region-Growing Network for Object Segmentation in Atmospheric Turbulence

  • paper_url: http://arxiv.org/abs/2311.03572
  • repo_url: None
  • paper_authors: Dehao Qin, Ripon Saha, Suren Jayasuriya, Jinwei Ye, Nianyi Li
  • for: 这个论文是为了提出一种无监督的前景对象分割网络,用于处理受气象干扰的动态场景。
  • methods: 该网络使用了拥平均的运动流来驱动一种新的区域增长算法,从而生成每个视频中的移动对象mask。然后,使用U-Net架构和一致性和分组损失来进一步优化这些mask,以确保它们在空间和时间方面具有最好的含义。
  • results: 该方法不需要标注训练数据,可以在不同的气象干扰强度下工作,并且在新发布的移动对象分割 dataset 上表现出了更高的分割精度和稳定性,比较于当前无监督方法。
    Abstract In this paper, we present a two-stage unsupervised foreground object segmentation network tailored for dynamic scenes affected by atmospheric turbulence. In the first stage, we utilize averaged optical flow from turbulence-distorted image sequences to feed a novel region-growing algorithm, crafting preliminary masks for each moving object in the video. In the second stage, we employ a U-Net architecture with consistency and grouping losses to further refine these masks optimizing their spatio-temporal alignment. Our approach does not require labeled training data and works across varied turbulence strengths for long-range video. Furthermore, we release the first moving object segmentation dataset of turbulence-affected videos, complete with manually annotated ground truth masks. Our method, evaluated on this new dataset, demonstrates superior segmentation accuracy and robustness as compared to current state-of-the-art unsupervised methods.
    摘要 在这篇论文中,我们提出了一种基于动态场景和大气抖擞的两Stage无监督前景物 segmentation网络。在第一Stage中,我们利用滤波后的平均运动速度来驱动一种新的区域增长算法,生成每幅视频中移动对象的初步面积。在第二Stage中,我们使用U-Net架构和一致性和分组损失来进一步细化这些面积,以便在空间和时间方向进行最佳对齐。我们的方法不需要标注训练数据,并在不同的大气抖擞强度下工作,可以处理长距离视频。此外,我们发布了第一个受抖擞影响的运动对象分 segmentation数据集,包括手动标注的真实地面积。我们的方法,在这新的数据集上进行评估,与当前无监督方法相比,显示出更高的分 segmentation精度和鲁棒性。

Cal-DETR: Calibrated Detection Transformer

  • paper_url: http://arxiv.org/abs/2311.03570
  • repo_url: https://github.com/akhtarvision/cal-detr
  • paper_authors: Muhammad Akhtar Munir, Salman Khan, Muhammad Haris Khan, Mohsen Ali, Fahad Shahbaz Khan
  • for: 这个研究旨在对现代基于对话 transformer 的物体检测器进行准确性调整,以提高它们在安全应用中的适用范围。
  • methods: 本研究提出了一个名为 Cal-DETR 的训练时间准确性调整机制,包括一个简单 yet effective 的对 transformer-based 物体检测器的不确定性评估方法,以及一个基于不确定性的类别LOGIT 调整机制。
  • results: Results 显示 Cal-DETR 可以对内部和外部测试 scenario 中的检测器进行有效的准确性调整,同时保持或甚至提高检测性能。
    Abstract Albeit revealing impressive predictive performance for several computer vision tasks, deep neural networks (DNNs) are prone to making overconfident predictions. This limits the adoption and wider utilization of DNNs in many safety-critical applications. There have been recent efforts toward calibrating DNNs, however, almost all of them focus on the classification task. Surprisingly, very little attention has been devoted to calibrating modern DNN-based object detectors, especially detection transformers, which have recently demonstrated promising detection performance and are influential in many decision-making systems. In this work, we address the problem by proposing a mechanism for calibrated detection transformers (Cal-DETR), particularly for Deformable-DETR, UP-DETR and DINO. We pursue the train-time calibration route and make the following contributions. First, we propose a simple yet effective approach for quantifying uncertainty in transformer-based object detectors. Second, we develop an uncertainty-guided logit modulation mechanism that leverages the uncertainty to modulate the class logits. Third, we develop a logit mixing approach that acts as a regularizer with detection-specific losses and is also complementary to the uncertainty-guided logit modulation technique to further improve the calibration performance. Lastly, we conduct extensive experiments across three in-domain and four out-domain scenarios. Results corroborate the effectiveness of Cal-DETR against the competing train-time methods in calibrating both in-domain and out-domain detections while maintaining or even improving the detection performance. Our codebase and pre-trained models can be accessed at \url{https://github.com/akhtarvision/cal-detr}.
    摘要 深度神经网络(DNN)在计算机视觉任务上表现出了卓越的预测能力,但它们受到过度自信的限制,这限制了DNN在安全关键应用中的广泛应用。有些最近的努力是对DNN进行准确化,但大多数这些努力都集中在分类任务上。却很少人关注现代基于DNN的物体检测器,特别是转换器基于的物体检测器,它们在许多决策系统中具有影响力。在这个工作中,我们解决这个问题,我们提出了一种 mechanism for calibrated detection transformers(Cal-DETR),特别是对Deformable-DETR、UP-DETR和DINO进行准确化。我们采用了训练时期的准确化路径,我们的贡献包括:首先,我们提出了一种简单 yet effective的转换器基于物体检测器的uncertainty量化方法。其次,我们开发了一种基于uncertainty的logit调整机制,该机制利用uncertainty来调整类logits。最后,我们开发了一种logit混合approach,该approach acts as a regularizer with detection-specific losses,并且与uncertainty-guided logit modulation technique相结合,以进一步提高准确性表现。我们在三个域内和四个外域场景进行了广泛的实验,结果证明Cal-DETR在对抗训练时期方法的竞争中,能够准确地调整域内和外域检测。我们的代码库和预训练模型可以在 \url{https://github.com/akhtarvision/cal-detr} 上获取。

Sea You Later: Metadata-Guided Long-Term Re-Identification for UAV-Based Multi-Object Tracking

  • paper_url: http://arxiv.org/abs/2311.03561
  • repo_url: None
  • paper_authors: Cheng-Yen Yang, Hsiang-Wei Huang, Zhongyu Jiang, Heng-Cheng Kuo, Jie Mei, Chung-I Huang, Jenq-Neng Hwang
  • for: 这篇论文是为了解决UAV在海上计算机视觉中的多对象跟踪问题,具体来说是解决短期重识别(ReID)和长期跟踪的问题。
  • methods: 这篇论文提出了一种适应性 metadata 导引的多对象跟踪算法(MG-MOT),利用了 UAV 的 GPS 位置、飞机高度和摄像头方向等 metadata,将短期跟踪数据融合成一个coherent的长期跟踪。
  • results: 在使用 SeaDroneSee 跟踪集,这篇论文在最新的UAV-based Maritime Object Tracking Challenge中获得了优秀的性能,其中 HOTA 为 69.5%,IDF1 为 85.9%。
    Abstract Re-identification (ReID) in multi-object tracking (MOT) for UAVs in maritime computer vision has been challenging for several reasons. More specifically, short-term re-identification (ReID) is difficult due to the nature of the characteristics of small targets and the sudden movement of the drone's gimbal. Long-term ReID suffers from the lack of useful appearance diversity. In response to these challenges, we present an adaptable motion-based MOT algorithm, called Metadata Guided MOT (MG-MOT). This algorithm effectively merges short-term tracking data into coherent long-term tracks, harnessing crucial metadata from UAVs, including GPS position, drone altitude, and camera orientations. Extensive experiments are conducted to validate the efficacy of our MOT algorithm. Utilizing the challenging SeaDroneSee tracking dataset, which encompasses the aforementioned scenarios, we achieve a much-improved performance in the latest edition of the UAV-based Maritime Object Tracking Challenge with a state-of-the-art HOTA of 69.5% and an IDF1 of 85.9% on the testing split.
    摘要 多目标跟踪(MOT)在无人机(UAV)上的重新识别(ReID)具有许多挑战,主要是因为小目标的特点和无人机镜头的快速移动。长期ReID受到缺乏有用的外观多样性的限制。为解决这些挑战,我们提出了适应性Metadata驱动的MOT算法(MG-MOT)。这种算法可以将短期跟踪数据合并成一致的长期跟踪,利用无人机的GPS位置、飞行高度和相机 orientations 等重要metadata。我们对我们的MOT算法进行了广泛的实验验证。使用 SeaDroneSee 跟踪数据集,这个数据集包括以上场景,我们在最新的UAV基于海上物体跟踪挑战中取得了显著提高的性能,HOTA 为 69.5%,IDF1 为 85.9% 在测试分区。

Spatio-Temporal Similarity Measure based Multi-Task Learning for Predicting Alzheimer’s Disease Progression using MRI Data

  • paper_url: http://arxiv.org/abs/2311.03557
  • repo_url: None
  • paper_authors: Xulong Wang, Yu Zhang, Menghui Zhou, Tong Liu, Jun Qi, Po Yang
  • For: 这篇论文的目的是为了提出一种基于多任务学习的新方法,用于有效地预测阿兹海默症(AD)的进展,并对该疾病的进展过程中各个生物标记的变化进行敏感地捕捉。* Methods: 这篇论文使用了一种基于多任务学习的新方法,包括定义一个时间测量来评估生物标记的变化趋势和速度,并将这个趋势转换为向量,然后在单一的向量空间中与其他生物标记进行比较。* Results: 实验结果显示,与对 ROI 进行直接学习相比,这种方法更有效地预测疾病的进展。此外,这种方法还可以实现长期稳定选择,对于疾病进展中各个生物标记之间的变化关系进行敏感地捕捉,并证明了这些变化关系对于认知预测有着重要的影响。
    Abstract Identifying and utilising various biomarkers for tracking Alzheimer's disease (AD) progression have received many recent attentions and enable helping clinicians make the prompt decisions. Traditional progression models focus on extracting morphological biomarkers in regions of interest (ROIs) from MRI/PET images, such as regional average cortical thickness and regional volume. They are effective but ignore the relationships between brain ROIs over time, which would lead to synergistic deterioration. For exploring the synergistic deteriorating relationship between these biomarkers, in this paper, we propose a novel spatio-temporal similarity measure based multi-task learning approach for effectively predicting AD progression and sensitively capturing the critical relationships between biomarkers. Specifically, we firstly define a temporal measure for estimating the magnitude and velocity of biomarker change over time, which indicate a changing trend(temporal). Converting this trend into the vector, we then compare this variability between biomarkers in a unified vector space(spatial). The experimental results show that compared with directly ROI based learning, our proposed method is more effective in predicting disease progression. Our method also enables performing longitudinal stability selection to identify the changing relationships between biomarkers, which play a key role in disease progression. We prove that the synergistic deteriorating biomarkers between cortical volumes or surface areas have a significant effect on the cognitive prediction.
    摘要 identifying 和利用不同的生物标志物(biomarkers)来跟踪阿尔ц海默病(AD)的进程已经收到了很多最近的关注,这些biomarkers可以帮助临床医生做出更加快速的决策。传统的进程模型会提取ROI(区域 интерес点)中的形态生物标志物,如区域average cortical thickness和区域体积。它们效果很好,但它们忽略了脑ROI之间的时间关系,这会导致同时破坏。为了探索这些生物标志物之间的同时破坏关系,在这篇论文中,我们提出了一种基于多任务学习的新的空间-时间相似度测量方法,用于有效地预测AD进程和敏感地捕捉生物标志物之间的关键关系。Specifically, we first define a temporal measure for estimating the magnitude and velocity of biomarker change over time, which indicates a changing trend (temporal). Converting this trend into a vector, we then compare this variability between biomarkers in a unified vector space (spatial). The experimental results show that compared with directly ROI-based learning, our proposed method is more effective in predicting disease progression. Our method also enables performing longitudinal stability selection to identify the changing relationships between biomarkers, which play a key role in disease progression. We prove that the synergistic deteriorating biomarkers between cortical volumes or surface areas have a significant effect on cognitive prediction.

Leveraging point annotations in segmentation learning with boundary loss

  • paper_url: http://arxiv.org/abs/2311.03537
  • repo_url: None
  • paper_authors: Eva Breznik, Hoel Kervadec, Filip Malmberg, Joel Kullberg, Håkan Ahlström, Marleen de Bruijne, Robin Strand
  • for: 这个论文研究了基于强度的距离地图与边损失的点指导semantic segmentation。
  • methods: 论文使用了边损失来强制更加严格地处理false positive,并使用了intensity-aware距离来缓解这个问题。
  • results: 实验结果表明,这种监督方法在ACDC和POEM两个多类数据集上表现出色,并且在POEM数据集上与CRF损失基于方法相当。I hope that helps! Let me know if you have any other questions.
    Abstract This paper investigates the combination of intensity-based distance maps with boundary loss for point-supervised semantic segmentation. By design the boundary loss imposes a stronger penalty on the false positives the farther away from the object they occur. Hence it is intuitively inappropriate for weak supervision, where the ground truth label may be much smaller than the actual object and a certain amount of false positives (w.r.t. the weak ground truth) is actually desirable. Using intensity-aware distances instead may alleviate this drawback, allowing for a certain amount of false positives without a significant increase to the training loss. The motivation for applying the boundary loss directly under weak supervision lies in its great success for fully supervised segmentation tasks, but also in not requiring extra priors or outside information that is usually required -- in some form -- with existing weakly supervised methods in the literature. This formulation also remains potentially more attractive than existing CRF-based regularizers, due to its simplicity and computational efficiency. We perform experiments on two multi-class datasets; ACDC (heart segmentation) and POEM (whole-body abdominal organ segmentation). Preliminary results are encouraging and show that this supervision strategy has great potential. On ACDC it outperforms the CRF-loss based approach, and on POEM data it performs on par with it. The code for all our experiments is openly available.
    摘要

High-resolution power equipment recognition based on improved self-attention

  • paper_url: http://arxiv.org/abs/2311.03518
  • repo_url: None
  • paper_authors: Siyi Zhang, Cheng Liu, Xiang Li, Xin Zhai, Zhen Wei, Sizhe Li, Xun Ma
  • for: 提高变压器图像识别精度,应对现有模型参数数量限制。
  • methods: 提出了一种基于深度自注意网络的新改进方法,包括基础网络、区域提议网络、目标区域提取和分割模块、最终预测网络。
  • results: 比较实验表明,该方法在变压器图像识别 tasks 上表现出色,大幅超越了两种常见的目标识别模型,为自动化电气设备检测带来新的思路。
    Abstract The current trend of automating inspections at substations has sparked a surge in interest in the field of transformer image recognition. However, due to restrictions in the number of parameters in existing models, high-resolution images can't be directly applied, leaving significant room for enhancing recognition accuracy. Addressing this challenge, the paper introduces a novel improvement on deep self-attention networks tailored for this issue. The proposed model comprises four key components: a foundational network, a region proposal network, a module for extracting and segmenting target areas, and a final prediction network. The innovative approach of this paper differentiates itself by decoupling the processes of part localization and recognition, initially using low-resolution images for localization followed by high-resolution images for recognition. Moreover, the deep self-attention network's prediction mechanism uniquely incorporates the semantic context of images, resulting in substantially improved recognition performance. Comparative experiments validate that this method outperforms the two other prevalent target recognition models, offering a groundbreaking perspective for automating electrical equipment inspections.
    摘要 当前的互动式检测技术在变电站中得到了广泛的应用,导致变压器图像识别领域的兴趣增加。然而,由于现有模型的参数数量限制,高分辨率图像直接应用不可,留下大量的提高识别精度的空间。为解决这个挑战,本文提出了一种新的深度自注意网络改进方法。该模型包括四个关键组成部分:基础网络、区域提议网络、目标区域提取和分割模块,以及最终预测网络。本文的创新approach是将部件localization和识别过程解耦,首先使用低分辨率图像进行localization,然后使用高分辨率图像进行识别。此外,深度自注意网络的预测机制唯一地含有图像 semanticcontext,导致识别性能明显提高。 comparative experiments表明,该方法在变电器目标识别方面表现出色,超越了两种常见的目标识别模型,提供了一个创新的视角为自动化电气设备检测。

SoundCam: A Dataset for Finding Humans Using Room Acoustics

  • paper_url: http://arxiv.org/abs/2311.03517
  • repo_url: None
  • paper_authors: Mason Wang, Samuel Clarke, Jui-Hsien Wang, Ruohan Gao, Jiajun Wu
  • for: 这个论文的目的是提供一个大规模的听声环境数据集,用于研究听声环境的特征和人工智能应用。
  • methods: 这个论文使用了10个通道的实际世界听声响应测量和10个通道的音乐录音,在三个不同的房间中进行了测量,包括一个控制的听声实验室、一个生活室和一个会议室,并在每个房间中进行了不同的人的位置测量。
  • results: 这个论文发现,这些测量可以用于探测和识别人类,以及跟踪他们的位置。
    Abstract A room's acoustic properties are a product of the room's geometry, the objects within the room, and their specific positions. A room's acoustic properties can be characterized by its impulse response (RIR) between a source and listener location, or roughly inferred from recordings of natural signals present in the room. Variations in the positions of objects in a room can effect measurable changes in the room's acoustic properties, as characterized by the RIR. Existing datasets of RIRs either do not systematically vary positions of objects in an environment, or they consist of only simulated RIRs. We present SoundCam, the largest dataset of unique RIRs from in-the-wild rooms publicly released to date. It includes 5,000 10-channel real-world measurements of room impulse responses and 2,000 10-channel recordings of music in three different rooms, including a controlled acoustic lab, an in-the-wild living room, and a conference room, with different humans in positions throughout each room. We show that these measurements can be used for interesting tasks, such as detecting and identifying humans, and tracking their positions.
    摘要 一个房间的声学性质是由房间的几何结构、房间内的物品以及它们的具体位置相互关系而决定。一个房间的声学性质可以通过源和听众位置之间的冲激响应(RIR)来描述,或者通过在房间中存在的自然信号来推导出。房间中物品的位置变化可以导致明显变化的声学性质,这些变化可以通过RIR来描述。现有的RIR数据集ither不系统地变化环境中的物品位置,或者只是 simulate RIR数据集。我们提出了SoundCam,这是历史上最大的公共发布的声学环境数据集,包括5000个真实世界中的10个通道冲激响应测量和3个不同房间中的2000个10个通道音乐录音。我们展示了这些测量可以用于有趣的任务,如检测和识别人类,以及跟踪他们的位置。

Predicting Age from White Matter Diffusivity with Residual Learning

  • paper_url: http://arxiv.org/abs/2311.03500
  • repo_url: None
  • paper_authors: Chenyu Gao, Michael E. Kim, Ho Hin Lee, Qi Yang, Nazirah Mohd Khairi, Praitayini Kanakaraj, Nancy R. Newlin, Derek B. Archer, Angela L. Jefferson, Warren D. Taylor, Brian D. Boyd, Lori L. Beason-Held, Susan M. Resnick, The BIOCARD Study Team, Yuankai Huo, Katherine D. Van Schaik, Kurt G. Schilling, Daniel Moyer, Ivana Išgum, Bennett A. Landman
  • for: 这个论文目的是开发白 matter 特有的年龄估计方法,以捕捉与正常年龄增长不同的异常。
  • methods: 这个论文使用了两种方法来预测年龄:一种是提取特定区域的微结构特征,另一种是使用3D差异神经网络(ResNets)学习图像直接特征,并对图像进行非线性对齐和折叠以最小化宏strucutral变化。
  • results: 测试数据上,第一种方法的 mean absolute error(MAE)为6.11年(正常参与者)和6.62年( cognitively impaired participant),而第二种方法的 MAE 为4.69年(正常参与者)和4.96年( cognitively impaired participant),显示 ResNet 模型可以更好地捕捉微结构特征进行年龄预测。
    Abstract Imaging findings inconsistent with those expected at specific chronological age ranges may serve as early indicators of neurological disorders and increased mortality risk. Estimation of chronological age, and deviations from expected results, from structural MRI data has become an important task for developing biomarkers that are sensitive to such deviations. Complementary to structural analysis, diffusion tensor imaging (DTI) has proven effective in identifying age-related microstructural changes within the brain white matter, thereby presenting itself as a promising additional modality for brain age prediction. Although early studies have sought to harness DTI's advantages for age estimation, there is no evidence that the success of this prediction is owed to the unique microstructural and diffusivity features that DTI provides, rather than the macrostructural features that are also available in DTI data. Therefore, we seek to develop white-matter-specific age estimation to capture deviations from normal white matter aging. Specifically, we deliberately disregard the macrostructural information when predicting age from DTI scalar images, using two distinct methods. The first method relies on extracting only microstructural features from regions of interest. The second applies 3D residual neural networks (ResNets) to learn features directly from the images, which are non-linearly registered and warped to a template to minimize macrostructural variations. When tested on unseen data, the first method yields mean absolute error (MAE) of 6.11 years for cognitively normal participants and MAE of 6.62 years for cognitively impaired participants, while the second method achieves MAE of 4.69 years for cognitively normal participants and MAE of 4.96 years for cognitively impaired participants. We find that the ResNet model captures subtler, non-macrostructural features for brain age prediction.
    摘要 干预发现与期望的年龄范围不符的可能是脑神经疾病和死亡风险的早期指标。确定年龄和与期望结果的偏差,从结构MRI数据中获得的任务已成为开发敏感于这些偏差的生物标志物的重要任务。与结构分析相 complementary,Diffusion tensor imaging (DTI)已经证明可以在脑白 matter中检测年龄相关的微strucutural变化,因此成为脑年龄预测的有力的附加模式。 although early studies have sought to harness DTI's advantages for age estimation, there is no evidence that the success of this prediction is owed to the unique microstructural and diffusivity features that DTI provides, rather than the macrostructural features that are also available in DTI data. Therefore, we seek to develop white-matter-specific age estimation to capture deviations from normal white matter aging. Specifically, we deliberately disregard the macrostructural information when predicting age from DTI scalar images, using two distinct methods. The first method relies on extracting only microstructural features from regions of interest. The second applies 3D residual neural networks (ResNets) to learn features directly from the images, which are non-linearly registered and warped to a template to minimize macrostructural variations. When tested on unseen data, the first method yields mean absolute error (MAE) of 6.11 years for cognitively normal participants and MAE of 6.62 years for cognitively impaired participants, while the second method achieves MAE of 4.69 years for cognitively normal participants and MAE of 4.96 years for cognitively impaired participants. We find that the ResNet model captures subtler, non-macrostructural features for brain age prediction.

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

  • paper_url: http://arxiv.org/abs/2311.03354
  • repo_url: https://github.com/UMass-Foundation-Model/CoVLM
  • paper_authors: Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, Chuang Gan
  • for: 提高大型视语言基础模型(VLM)的可 композиitional 能力,使其能够更好地理解和生成视语言对话。
  • methods: 提出了一种新的通信 токен技术,以便动态地通信 между视觉检测系统和语言系统,使Language Model(LM)能够更好地组合视觉实体和关系。
  • results: 与传统VLM相比,CoVLM在compositional reasoning benchmarks上表现出色,提高了约20%的HICO-DET mAP、约14%的Cola top-1准确率和约3%的ARO top-1准确率,同时在传统视语言任务中也达到了状态级表现。
    Abstract A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.
    摘要 人类拥有非凡的作业能力,即将“无限用 finite means”。然而,当前大量视语基础模型(VLM)仍然缺乏这种作业能力,因为它们的“袋子行为”和无法正确地构成视觉实体和实体之间的关系。为此,我们提出了CoVLM,可以引导LLM在文本和视觉Encoder之间进行可靠的交流,以实现视语通信编码。具体来说,我们首先设计了一组新的交流符,用于LLM与视觉检测系统之间的动态交流。这些交流符由LLM在视觉实体或关系后生成,以通知检测网络提出相关的区域。这些提出的区域的兴趣点(ROIs)然后被反馈到LLM,以便更好地根据相关区域进行语言生成。LLM因此可以通过交流符来组合视觉实体和关系。我们的框架可以凝聚视觉和LLM之间的关系,并在组合理解benchmark上超越前一代VLM的表现(例如,HICO-DET mAP上的~20%提升、Cola top-1准确率上的~14%提升和ARO top-1准确率上的~3%提升)。我们还在传统的视觉语言任务中实现了状态之前的表现,如引用表达理解和视觉问答。

Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion

  • paper_url: http://arxiv.org/abs/2311.03352
  • repo_url: https://github.com/qqlu/entity
  • paper_authors: Hao Zhou, Tiancheng Shen, Xu Yang, Hai Huang, Xiangtai Li, Lu Qi, Ming-Hsuan Yang
  • for: 本文探讨了开放词汇 segmentation 中的评估指标问题,即评估过程仍然强调关闭集成 metric 在零shot或者 cross-dataset 管道中,而不考虑预测和实际标签类别之间的相似性。
  • methods: 本文首先对 eleven 种类别间的相似度量进行了抽查和用户研究,包括 WordNet 语言统计学、文本嵌入和语言模型。基于这些探讨的 measurements,我们设计了一些新的评估指标,包括 Open mIoU、Open AP 和 Open PQ,适用于三种开放词汇 segmentation 任务。
  • results: 我们对 twelve 种开放词汇 segmentation 方法进行了 benchmark,并证明了我们的评估指标可以很好地评估开放能力。尽管相对性的subjectivity 存在,但我们的工作希望可以带领社区新的思考如何评估开放能力。评估代码在github 上发布。
    Abstract In this paper, we highlight a problem of evaluation metrics adopted in the open-vocabulary segmentation. That is, the evaluation process still heavily relies on closed-set metrics on zero-shot or cross-dataset pipelines without considering the similarity between predicted and ground truth categories. To tackle this issue, we first survey eleven similarity measurements between two categorical words using WordNet linguistics statistics, text embedding, and language models by comprehensive quantitative analysis and user study. Built upon those explored measurements, we designed novel evaluation metrics, namely Open mIoU, Open AP, and Open PQ, tailored for three open-vocabulary segmentation tasks. We benchmarked the proposed evaluation metrics on 12 open-vocabulary methods of three segmentation tasks. Even though the relative subjectivity of similarity distance, we demonstrate that our metrics can still well evaluate the open ability of the existing open-vocabulary segmentation methods. We hope that our work can bring with the community new thinking about how to evaluate the open ability of models. The evaluation code is released in github.
    摘要

Long-Term Invariant Local Features via Implicit Cross-Domain Correspondences

  • paper_url: http://arxiv.org/abs/2311.03345
  • repo_url: None
  • paper_authors: Zador Pataki, Mohammad Altillawi, Menelaos Kanakis, Rémi Pautrat, Fengyi Shen, Ziyuan Liu, Luc Van Gool, Marc Pollefeys
  • for: 本文旨在investigate long-term visual domain variations的影响 на visual localization,并提出一种数据驱动的方法来改善现代特征提取网络的跨Domain可靠性。
  • methods: 本文使用了现代特征提取网络,并对其进行了改进,包括提出了一种新的数据驱动方法(Implicit Cross-Domain Correspondences,iCDC),该方法可以生成跨Domain的准确对应关系。
  • results: 本文的实验结果显示,使用了提出的iCDC方法的网络,可以在跨Domain的情况下提高视觉本地化性能,并且与现有方法相比,有显著的性能优势。
    Abstract Modern learning-based visual feature extraction networks perform well in intra-domain localization, however, their performance significantly declines when image pairs are captured across long-term visual domain variations, such as different seasonal and daytime variations. In this paper, our first contribution is a benchmark to investigate the performance impact of long-term variations on visual localization. We conduct a thorough analysis of the performance of current state-of-the-art feature extraction networks under various domain changes and find a significant performance gap between intra- and cross-domain localization. We investigate different methods to close this gap by improving the supervision of modern feature extractor networks. We propose a novel data-centric method, Implicit Cross-Domain Correspondences (iCDC). iCDC represents the same environment with multiple Neural Radiance Fields, each fitting the scene under individual visual domains. It utilizes the underlying 3D representations to generate accurate correspondences across different long-term visual conditions. Our proposed method enhances cross-domain localization performance, significantly reducing the performance gap. When evaluated on popular long-term localization benchmarks, our trained networks consistently outperform existing methods. This work serves as a substantial stride toward more robust visual localization pipelines for long-term deployments, and opens up research avenues in the development of long-term invariant descriptors.
    摘要 现代学习基于的视觉特征提取网络在同一个频谱域内的本地化表现良好,但是当图像对被捕捉到不同季节和日期变化时,其表现却明显下降。在这篇论文中,我们的首要贡献是设立了跨域变化的视觉本地化性能的benchmark,并进行了当今状态艺术特征提取网络的系统性分析。我们发现了跨域变化对视觉本地化性能的显著性 gap,并 investigate了不同的方法来填补这个差距。我们提出了一种新的数据中心方法,即隐藏的跨域相对性(iCDC)。iCDC使用不同视觉频谱下的场景的多个神经辐射场,每个场景都适应各自的视觉频谱。它利用了下面的3D表示来生成准确的跨域相对性。我们的提议方法可以显著提高跨域本地化性能,降低性能差距。当我们的训练网络被评估于流行的长期本地化benchmark上,它们一直表现出色,超越了现有的方法。这项工作是对长期可靠的视觉本地化管道的重要进步,并开启了长期不变描述器的研究方向。

Cross-Image Attention for Zero-Shot Appearance Transfer

  • paper_url: http://arxiv.org/abs/2311.03335
  • repo_url: https://github.com/garibida/cross-image-attention
  • paper_authors: Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch-Elor, Daniel Cohen-Or
  • For: The paper aims to transfer the visual appearance between objects that share similar semantics but may differ significantly in shape.* Methods: The authors build upon the self-attention layers of text-to-image generative models and introduce a cross-image attention mechanism to establish semantic correspondences across images. They also use three mechanisms to manipulate the noisy latent codes or the model’s internal representations throughout the denoising process.* Results: The approach is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint between the two input images.
    Abstract Recent advancements in text-to-image generative models have demonstrated a remarkable ability to capture a deep semantic understanding of images. In this work, we leverage this semantic knowledge to transfer the visual appearance between objects that share similar semantics but may differ significantly in shape. To achieve this, we build upon the self-attention layers of these generative models and introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images. Specifically, given a pair of images -- one depicting the target structure and the other specifying the desired appearance -- our cross-image attention combines the queries corresponding to the structure image with the keys and values of the appearance image. This operation, when applied during the denoising process, leverages the established semantic correspondences to generate an image combining the desired structure and appearance. In addition, to improve the output image quality, we harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process. Importantly, our approach is zero-shot, requiring no optimization or training. Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint between the two input images.
    摘要 现代文本到图像生成模型的进步已经显示了捕捉图像深度Semantic理解的能力。在这项工作中,我们利用这种Semantic知识来传递图像之间的Visual形态。为 достичь这一目标,我们在生成模型中的自注意层上建立了跨图像注意机制,该机制将target图像中的查询与desired appearance图像中的键和值相结合。当应用于噪声除法过程中时,这种操作可以利用已经建立的Semantic匹配来生成一个 combining the desired structure and appearance的图像。此外,为了提高输出图像质量,我们利用了三种机制,分别是修改噪声缺失代码或模型内部表示的方法,这些机制在噪声除法过程中进行。值得注意的是,我们的方法是零 shot的,不需要优化或训练。实验结果表明,我们的方法在多种物体类别上效果广泛,并且对图像之间的形态、大小和视角变化 display 具有较好的Robustness。

TSP-Transformer: Task-Specific Prompts Boosted Transformer for Holistic Scene Understanding

  • paper_url: http://arxiv.org/abs/2311.03427
  • repo_url: https://github.com/tb2-sy/tsp-transformer
  • paper_authors: Shuo Wang, Jing Li, Zibo Zhao, Dongze Lian, Binbin Huang, Xiaomei Wang, Zhengxin Li, Shenghua Gao
  • for: 本文主要针对holistic scene understanding问题提出了一种Task-Specific Prompts Transformer(TSP-Transformer)方法,用于学习有效的表示。
  • methods: 该方法首先使用了一个vanilla transformer层,然后使用了一个任务特定的prompt transformer层,其中任务特定的prompt可以被视为启发器,帮助模型学习各个任务的特异性特征。
  • results: 实验结果表明, compared with existing方法,TSP-Transformer可以在NYUD-v2和PASCAL-Context datasets上达到最佳性能,证明了该方法的效果性。 code可以在以下链接中找到:https://github.com/tb2-sy/TSP-Transformer。
    Abstract Holistic scene understanding includes semantic segmentation, surface normal estimation, object boundary detection, depth estimation, etc. The key aspect of this problem is to learn representation effectively, as each subtask builds upon not only correlated but also distinct attributes. Inspired by visual-prompt tuning, we propose a Task-Specific Prompts Transformer, dubbed TSP-Transformer, for holistic scene understanding. It features a vanilla transformer in the early stage and tasks-specific prompts transformer encoder in the lateral stage, where tasks-specific prompts are augmented. By doing so, the transformer layer learns the generic information from the shared parts and is endowed with task-specific capacity. First, the tasks-specific prompts serve as induced priors for each task effectively. Moreover, the task-specific prompts can be seen as switches to favor task-specific representation learning for different tasks. Extensive experiments on NYUD-v2 and PASCAL-Context show that our method achieves state-of-the-art performance, validating the effectiveness of our method for holistic scene understanding. We also provide our code in the following link https://github.com/tb2-sy/TSP-Transformer.
    摘要 整体场景理解包括semantic segmentation、表面normal estimation、物体边界检测、深度估计等。关键问题在于学习表示效果,因为每个子任务建立在不仅相关性还有特定属性上。 Drawing inspiration from visual-prompt tuning, we propose a Task-Specific Prompts Transformer, dubbed TSP-Transformer, for holistic scene understanding. It consists of a vanilla transformer in the early stage and a task-specific prompts transformer encoder in the lateral stage, where task-specific prompts are augmented. By doing so, the transformer layer learns generic information from the shared parts and is endowed with task-specific capacity. First, the task-specific prompts serve as induced priors for each task, and they can be seen as switches that favor task-specific representation learning for different tasks. Our extensive experiments on NYUD-v2 and PASCAL-Context show that our method achieves state-of-the-art performance, validating the effectiveness of our method for holistic scene understanding. Our code is available at the following link: .

A Robust Bi-Directional Algorithm For People Count In Crowded Areas

  • paper_url: http://arxiv.org/abs/2311.03323
  • repo_url: None
  • paper_authors: Satyanarayana Penke, Gopikrishna Pavuluri, Soukhya Kunda, Satvik M, CharanKumar Y
  • for: 本研究旨在提供一种精准的人数计数系统,以帮助管理人员在繁忙的场所进行人员管理和救援。
  • methods: 本研究使用了图像处理技术和机器学习算法,通过识别人群形态和跟踪人员的移动方向来计算人数。
  • results: 实验结果表明,该算法可以准确地计算人数,并且可以在实时 scenarios 中提供人员的流动信息。
    Abstract People counting system in crowded places has become a very useful practical application that can be accomplished in various ways which include many traditional methods using sensors. Examining the case of real time scenarios, the algorithm espoused should be steadfast and accurate. People counting algorithm presented in this paper, is centered on blob assessment, devoted to yield the count of the people through a path along with the direction of traversal. The system depicted is often ensconced at the entrance of a building so that the unmitigated frequency of visitors can be recorded. The core premise of this work is to extricate count of people inflow and outflow pertaining to a particular area. The tot-up achieved can be exploited for purpose of statistics in the circumstances of any calamity occurrence in that zone. Relying upon the count totaled, the population in that vicinity can be assimilated in order to take on relevant measures to rescue the people.
    摘要 人数计算系统在拥挤地方已成为非常有用的实践应用,可以通过多种传统方法使用传感器实现。在实时场景中,算法应该坚定稳定,准确地计算人数。本文所描述的人数算法基于物体评估,通过跟踪人们在特定路径上的移动方向来计算人数。系统通常安装在建筑物入口处,以记录进出人数的频率。本研究的核心思想是从人数流入和流出中提取具体的人数统计数据,以便在紧急情况下采取有关救援措施。基于计算的人数,可以对当地人口进行融合,以便采取有关救援措施。

FATE: Feature-Agnostic Transformer-based Encoder for learning generalized embedding spaces in flow cytometry data

  • paper_url: http://arxiv.org/abs/2311.03314
  • repo_url: https://github.com/lisaweijler/fate
  • paper_authors: Lisa Weijler, Florian Kowarsch, Michael Reiter, Pedro Hermosilla, Margarita Maurer-Granofszky, Michael Dworzak
  • for: 这个论文是为了解决资料收集时给定特征的限制,以便将资料处理为可以运行的模型。
  • methods: 这个论文提出了一个新的架构,即set-transformer架构,可以直接处理不同特征空间的数据,而不需要规定输入空间为特征集的交集或联集。这个架构通过添加特征编解层,实现了从不同特征空间中学习共同的潜在特征空间。
  • results: 这个论文的模型在自动检测淋巴细胞抗生素敏感性检测中得到了良好的结果,并且可以运行在不同的特征空间中。特别是在淋巴细胞抗生素敏感性检测中,资料稀缺性是由疾病的低流行率导致的,这个模型的能力对于这种情况是非常重要的。
    Abstract While model architectures and training strategies have become more generic and flexible with respect to different data modalities over the past years, a persistent limitation lies in the assumption of fixed quantities and arrangements of input features. This limitation becomes particularly relevant in scenarios where the attributes captured during data acquisition vary across different samples. In this work, we aim at effectively leveraging data with varying features, without the need to constrain the input space to the intersection of potential feature sets or to expand it to their union. We propose a novel architecture that can directly process data without the necessity of aligned feature modalities by learning a general embedding space that captures the relationship between features across data samples with varying sets of features. This is achieved via a set-transformer architecture augmented by feature-encoder layers, thereby enabling the learning of a shared latent feature space from data originating from heterogeneous feature spaces. The advantages of the model are demonstrated for automatic cancer cell detection in acute myeloid leukemia in flow cytometry data, where the features measured during acquisition often vary between samples. Our proposed architecture's capacity to operate seamlessly across incongruent feature spaces is particularly relevant in this context, where data scarcity arises from the low prevalence of the disease. The code is available for research purposes at https://github.com/lisaweijler/FATE.
    摘要 “在过去几年中,模型架构和训练策略在不同数据模式之间变得更加通用和灵活,但是一个持续的限制是假设输入特征的固定量和排序。这个限制在样本中的特征采集时发生变化时 particualrly relevant。在这项工作中,我们希望通过不需要受限于输入空间的交叉或者拓展来有效地利用变量特征的数据。我们提出了一种新的架构,可以直接处理数据,不需要对特征模式进行对齐。我们通过将特征编码层添加到集成 transformer 架构中,使得模型可以从不同特征空间中学习共享的幂等特征空间。这些优点在抑制静脉细胞检测中 AUTOMATIC 的淋巴细胞癌症数据中得到了证明,这里的特征通常在样本之间发生变化。我们的提出的架构能够不受不同特征空间之间的不一致限制,特别 relevance 在这个上,由于疾病的低发生率,数据的稀缺性是一个主要的问题。模型代码可以在 GitHub 上获取,请参考 。”

A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose Estimation

  • paper_url: http://arxiv.org/abs/2311.03312
  • repo_url: https://github.com/QitaoZhao/ContextAware-PoseFormer
  • paper_authors: Qitao Zhao, Ce Zheng, Mengyuan Liu, Chen Chen
  • for: 提高3D人姿估计精度,不需要使用大量视频帧
  • methods: 利用 pré-trained 2D pose 检测器生成的中间视觉表示,无需进行训练
  • results: 对比 Context-Aware PoseFormer 和其他方法,显示了更高的速度和精度
    Abstract The dominant paradigm in 3D human pose estimation that lifts a 2D pose sequence to 3D heavily relies on long-term temporal clues (i.e., using a daunting number of video frames) for improved accuracy, which incurs performance saturation, intractable computation and the non-causal problem. This can be attributed to their inherent inability to perceive spatial context as plain 2D joint coordinates carry no visual cues. To address this issue, we propose a straightforward yet powerful solution: leveraging the readily available intermediate visual representations produced by off-the-shelf (pre-trained) 2D pose detectors -- no finetuning on the 3D task is even needed. The key observation is that, while the pose detector learns to localize 2D joints, such representations (e.g., feature maps) implicitly encode the joint-centric spatial context thanks to the regional operations in backbone networks. We design a simple baseline named Context-Aware PoseFormer to showcase its effectiveness. Without access to any temporal information, the proposed method significantly outperforms its context-agnostic counterpart, PoseFormer, and other state-of-the-art methods using up to hundreds of video frames regarding both speed and precision. Project page: https://qitaozhao.github.io/ContextAware-PoseFormer
    摘要 主流的3D人姿估算方法通过长期时间做为准备(使用大量视频帧)来提高准确性,这会导致性能混叠、计算困难和非 causa 问题。这可以归结于它们的内置不能感知空间上下文的问题,因为平面的2D关节坐标不含视觉提示。为解决这个问题,我们提出了一个简单 yet powerful的解决方案:利用可用的 intermediate visual representation(如Feature Map)生成的 off-the-shelf(预训练)2Dpose detector。我们发现,虽然pose detector learns to localize 2D关节,但这些表示(e.g., feature maps)在 backbone network中的 regional operation implicitely encode 关节-centric的空间上下文。我们设计了一个简单的基线方案,名为Context-Aware PoseFormer,以示其效果。不需要访问任何时间信息,我们的提posed方法在速度和精度两个方面与无Context-agnostic counterpart PoseFormer和其他使用Up to hundreds of video frames的方法相比,具有显著的优势。项目页面:https://qitaozhao.github.io/ContextAware-PoseFormer

Machine Learning-Based Tea Leaf Disease Detection: A Comprehensive Review

  • paper_url: http://arxiv.org/abs/2311.03240
  • repo_url: None
  • paper_authors: Faruk Ahmed, Md. Taimur Ahad, Yousuf Rayhan Emon
  • for: 本研究旨在探讨机器学习技术在荵茶叶病诊断中的应用,以提高茶叶生产效率和质量。
  • methods: 本研究使用了多种图像处理技术,包括各种Transformer模型(如Inception Convolutional Vision Transformer(ICVT)、GreenViT、PlantXViT、PlantViT、MSCVT、Transfer Learning Model & Vision Transformer(TLMViT)、IterationViT、IEM-ViT)以及其他模型(如Dense Convolutional Network(DenseNet)、Residual Neural Network(ResNet)-50V2、YOLOv5、YOLOv7、Convolutional Neural Network(CNN)、Deep CNN、Non-dominated Sorting Genetic Algorithm(NSGA-II)、MobileNetv2、Lesion-Aware Visual Transformer)。
  • results: 本研究通过对多个数据集进行测试,证明了这些机器学习模型在实际应用中的可行性。
    Abstract Tea leaf diseases are a major challenge to agricultural productivity, with far-reaching implications for yield and quality in the tea industry. The rise of machine learning has enabled the development of innovative approaches to combat these diseases. Early detection and diagnosis are crucial for effective crop management. For predicting tea leaf disease, several automated systems have already been developed using different image processing techniques. This paper delivers a systematic review of the literature on machine learning methodologies applied to diagnose tea leaf disease via image classification. It thoroughly evaluates the strengths and constraints of various Vision Transformer models, including Inception Convolutional Vision Transformer (ICVT), GreenViT, PlantXViT, PlantViT, MSCVT, Transfer Learning Model & Vision Transformer (TLMViT), IterationViT, IEM-ViT. Moreover, this paper also reviews models like Dense Convolutional Network (DenseNet), Residual Neural Network (ResNet)-50V2, YOLOv5, YOLOv7, Convolutional Neural Network (CNN), Deep CNN, Non-dominated Sorting Genetic Algorithm (NSGA-II), MobileNetv2, and Lesion-Aware Visual Transformer. These machine-learning models have been tested on various datasets, demonstrating their real-world applicability. This review study not only highlights current progress in the field but also provides valuable insights for future research directions in the machine learning-based detection and classification of tea leaf diseases.
    摘要 茶叶病菌是现代农业生产的主要挑战,对茶业产量和质量有着深远的影响。随着机器学习技术的发展,开发了一些创新的方法来抵御茶叶病菌。早期检测和诊断是农业管理的关键。在预测茶叶病菌方面,已经开发了许多自动化系统,使用不同的图像处理技术。本文提供了机器学习方法在诊断茶叶病菌方面的系统性评价,全面评估了不同的视觉转移模型,包括Inception Convolutional Vision Transformer(ICVT)、GreenViT、PlantXViT、PlantViT、MSCVT、Transfer Learning Model & Vision Transformer(TLMViT)、IterationViT、IEM-ViT等。此外,本文还评估了其他模型,如 dense convolutional network(DenseNet)、Residual Neural Network(ResNet)-50V2、YOLOv5、YOLOv7、Convolutional Neural Network(CNN)、Deep CNN、Non-dominated Sorting Genetic Algorithm(NSGA-II)、MobileNetv2、Lesion-Aware Visual Transformer等。这些机器学习模型在不同的数据集上进行了测试,表明了它们在实际应用中的可行性。本文不仅概述了当前领域的进展,还提供了有价值的未来研究方向,帮助推动机器学习在茶叶病菌检测和分类方面的进一步发展。

  • paper_url: http://arxiv.org/abs/2311.03233
  • repo_url: None
  • paper_authors: Sotiris Anagnostidis, Gregor Bachmann, Thomas Hofmann
  • for: 这 paper 的目的是提出一种可以在训练过程中动态调整模型形态的方法,以优化模型的性能。
  • methods: 这 paper 使用了一种基于 scaling laws 的方法,通过调整模型的形态,可以最优地利用计算资源,以提高模型的性能。
  • results: 这 paper 的实验结果表明,使用这种方法可以创造出一种更高效的 vision transformer 模型,并且可以在不同的 patch size 和宽度下进行调整,以优化模型的性能。
    Abstract In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data. The paradigm is very simple: Investing more computational resources (optimally) leads to better performance, and even predictably so; neural scaling laws have been derived that accurately forecast the performance of a network for a desired level of compute. This leads to the notion of a "compute-optimal" model, i.e. a model that allocates a given level of compute during training optimally to maximise performance. In this work, we extend the concept of optimality by allowing for an "adaptive" model, i.e. a model that can change its shape during the course of training. By allowing the shape to adapt, we can optimally traverse between the underlying scaling laws, leading to a significant reduction in the required compute to reach a given target performance. We focus on vision tasks and the family of Vision Transformers, where the patch size as well as the width naturally serve as adaptive shape parameters. We demonstrate that, guided by scaling laws, we can design compute-optimal adaptive models that beat their "static" counterparts.
    摘要 近年来,深度学习领域的状态精顶由大型模型所占据。这种方法很简单:投入更多计算资源(最优)会提高性能,甚至可预测性能如何提高。基于这种思想,我们提出了“计算优质”模型的概念,即在训练期间最优化计算资源的分配,以最大化性能。在这项工作中,我们将“可靠”模型扩展为可变形态模型,即在训练过程中可以改变模型的形态。通过允许形态变化,我们可以优化地 traverse 在下面的减少计算资源,以达到给定目标性能。我们将视觉任务和视觉转换器家族作为研究对象,并证明可以遵循减少计算资源的扩展。我们的计算优质可变模型可以在训练过程中击败其“静态”对手。

Segmentation of Drone Collision Hazards in Airborne RADAR Point Clouds Using PointNet

  • paper_url: http://arxiv.org/abs/2311.03221
  • repo_url: None
  • paper_authors: Hector Arroyo, Paul Kier, Dylan Angus, Santiago Matalonga, Svetlozar Georgiev, Mehdi Goli, Gerard Dooly, James Riordan
  • for: 本研究旨在帮助无人机在共享空域中进行 beyond visual line of sight(BVLOS)操作,提高无人机的 situational awareness,以确保安全操作。
  • methods: 本研究使用雷达技术,开发了一种基于 PointNet 架构的终端到终端语义分割方法,可同时识别多个Collision Hazard。
  • results: 该方法可以在 aerial 设定下,识别出 Five distinct classes:移动无人机(DJI M300和DJI Mini)和飞机(Ikarus C42),以及静止返回(地面和基础设施),提高了无人机的 situational awareness,达到了94%的准确率。
    Abstract The integration of unmanned aerial vehicles (UAVs) into shared airspace for beyond visual line of sight (BVLOS) operations presents significant challenges but holds transformative potential for sectors like transportation, construction, energy and defense. A critical prerequisite for this integration is equipping UAVs with enhanced situational awareness to ensure safe operations. Current approaches mainly target single object detection or classification, or simpler sensing outputs that offer limited perceptual understanding and lack the rapid end-to-end processing needed to convert sensor data into safety-critical insights. In contrast, our study leverages radar technology for novel end-to-end semantic segmentation of aerial point clouds to simultaneously identify multiple collision hazards. By adapting and optimizing the PointNet architecture and integrating aerial domain insights, our framework distinguishes five distinct classes: mobile drones (DJI M300 and DJI Mini) and airplanes (Ikarus C42), and static returns (ground and infrastructure) which results in enhanced situational awareness for UAVs. To our knowledge, this is the first approach addressing simultaneous identification of multiple collision threats in an aerial setting, achieving a robust 94% accuracy. This work highlights the potential of radar technology to advance situational awareness in UAVs, facilitating safe and efficient BVLOS operations.
    摘要 integrating unmanned aerial vehicles (UAVs) into shared airspace for beyond visual line of sight (BVLOS) operations presents significant challenges but holds transformative potential for sectors like transportation, construction, energy, and defense. A critical prerequisite for this integration is equipping UAVs with enhanced situational awareness to ensure safe operations. current approaches mainly target single object detection or classification, or simpler sensing outputs that offer limited perceptual understanding and lack the rapid end-to-end processing needed to convert sensor data into safety-critical insights. in contrast, our study leverages radar technology for novel end-to-end semantic segmentation of aerial point clouds to simultaneously identify multiple collision hazards. by adapting and optimizing the PointNet architecture and integrating aerial domain insights, our framework distinguishes five distinct classes: mobile drones (DJI M300 and DJI Mini) and airplanes (Ikarus C42), and static returns (ground and infrastructure) which results in enhanced situational awareness for UAVs. to our knowledge, this is the first approach addressing simultaneous identification of multiple collision threats in an aerial setting, achieving a robust 94% accuracy. this work highlights the potential of radar technology to advance situational awareness in UAVs, facilitating safe and efficient BVLOS operations.

Leveraging Transformers to Improve Breast Cancer Classification and Risk Assessment with Multi-modal and Longitudinal Data

  • paper_url: http://arxiv.org/abs/2311.03217
  • repo_url: None
  • paper_authors: Yiqiu Shen, Jungkyu Park, Frank Yeung, Eliana Goldberg, Laura Heacock, Farah Shamout, Krzysztof J. Geras
  • for: 这个研究旨在提高乳癌检测精度,特别是透过融合多modal的对话和时间变化信息以提高病人状态评估和未来癌症风险评估。
  • methods: 这个研究使用的是Multi-modal Transformer(MMT)神经网络,融合了脉冲探测和超音波对话,以帮助验证病人是否目前有癌症,并估算未来癌症风险。MMT使用自我对话和比较现有对话以聚焦多modal资料,并追踪时间变化以比较现有对话和先前对话。
  • results: 这个研究使用130万个检测数据,获得了AUROC0.943的检测精度,超过单一modal的基eline。此外,这个模型还可以对病人状态进行5年的风险评估,AUROC为0.826,超过了先前的脉冲探测基eline。
    Abstract Breast cancer screening, primarily conducted through mammography, is often supplemented with ultrasound for women with dense breast tissue. However, existing deep learning models analyze each modality independently, missing opportunities to integrate information across imaging modalities and time. In this study, we present Multi-modal Transformer (MMT), a neural network that utilizes mammography and ultrasound synergistically, to identify patients who currently have cancer and estimate the risk of future cancer for patients who are currently cancer-free. MMT aggregates multi-modal data through self-attention and tracks temporal tissue changes by comparing current exams to prior imaging. Trained on 1.3 million exams, MMT achieves an AUROC of 0.943 in detecting existing cancers, surpassing strong uni-modal baselines. For 5-year risk prediction, MMT attains an AUROC of 0.826, outperforming prior mammography-based risk models. Our research highlights the value of multi-modal and longitudinal imaging in cancer diagnosis and risk stratification.
    摘要 乳癌检查通常通过胸部X射线摄影进行,但现有的深度学习模型通常只分析每种成像模式独立, missed opportunities to integrate多种成像模式和时间信息。本研究提出了多模态变换(MMT),一种使用胸部X射线摄影和ultrasound同时进行 synergistic 分析,以识别当前患有乳癌的患者和评估无癌患者是否会发展为乳癌。MMT通过自注意力和比较当前检测与过去成像来聚合多模态数据,并且可以跟踪时间变化。在130万个检测数据上训练,MMT在检测现有癌症方面达到了AUROC 0.943,超过了强大的单模态基线。而在5年风险预测方面,MMT达到了AUROC 0.826,超过了过去基于胸部X射线摄影的风险模型。我们的研究强调了多模态和 longitudinal 成像在肿瘤诊断和风险 stratification 中的价值。

PainSeeker: An Automated Method for Assessing Pain in Rats Through Facial Expressions

  • paper_url: http://arxiv.org/abs/2311.03205
  • repo_url: None
  • paper_authors: Liu Liu, Guang Li, Dingfan Deng, Jinhua Yu, Yuan Zong
  • for: investigate whether laboratory rats’ pain can be automatically assessed through their facial expressions.
  • methods: proposed a novel deep learning method called PainSeeker for automatically assessing pain in rats via facial expressions.
  • results: demonstrated the feasibility of assessing rats’ pain from their facial expressions and also verified the effectiveness of the proposed PainSeeker in addressing this emerging but intriguing problem.Here is the full text in Simplified Chinese:
  • for: investigate whether laboratory rats’ pain can be automatically assessed through their facial expressions.
  • methods: proposed a novel deep learning method called PainSeeker for automatically assessing pain in rats via facial expressions.
  • results: demonstrated the feasibility of assessing rats’ pain from their facial expressions and also verified the effectiveness of the proposed PainSeeker in addressing this emerging but intriguing problem.I hope this helps!
    Abstract In this letter, we aim to investigate whether laboratory rats' pain can be automatically assessed through their facial expressions. To this end, we began by presenting a publicly available dataset called RatsPain, consisting of 1,138 facial images captured from six rats that underwent an orthodontic treatment operation. Each rat' facial images in RatsPain were carefully selected from videos recorded either before or after the operation and well labeled by eight annotators according to the Rat Grimace Scale (RGS). We then proposed a novel deep learning method called PainSeeker for automatically assessing pain in rats via facial expressions. PainSeeker aims to seek pain-related facial local regions that facilitate learning both pain discriminative and head pose robust features from facial expression images. To evaluate the PainSeeker, we conducted extensive experiments on the RatsPain dataset. The results demonstrate the feasibility of assessing rats' pain from their facial expressions and also verify the effectiveness of the proposed PainSeeker in addressing this emerging but intriguing problem. The RasPain dataset can be freely obtained from https://github.com/xhzongyuan/RatsPain.
    摘要 在这封信中,我们想 investigate 是否可以通过鼠标的表情自动评估它们的痛苦。为此,我们开始使用公共可用的 dataset called RatsPain,包含 1,138 个鼠标的面部图像,来自六只鼠标在 ortodontic 治疗操作后的视频记录。每只鼠标的面部图像在 RatsPain 中被精心选择并由八名注解者根据鼠标的抽筋scale (RGS) 进行了分类标注。然后,我们提出了一种新的深度学习方法 called PainSeeker,用于自动评估鼠标的痛苦程度 via 面部表情图像。PainSeeker 的目标是寻找痛苦相关的面部地方,以便从面部表情图像中学习痛苦特异和头 pose 稳定的特征。为了评估 PainSeeker,我们在 RatsPain 数据集上进行了广泛的实验。结果表明可以通过鼠标的面部表情评估它们的痛苦程度,并且证明了我们提出的 PainSeeker 可以有效地解决这个出现的问题。RatsPain 数据集可以免费下载于 https://github.com/xhzongyuan/RatsPain。

LCPR: A Multi-Scale Attention-Based LiDAR-Camera Fusion Network for Place Recognition

  • paper_url: http://arxiv.org/abs/2311.03198
  • repo_url: https://github.com/ZhouZijie77/LCPR
  • paper_authors: Zijie Zhou, Jingyi Xu, Guangming Xiong, Junyi Ma
  • for: 本研究旨在提高自动驾驶车辆在GPS无效环境中识别已经前期访问过的地点。
  • methods: 本研究使用多模态感知融合来超越个体感知器的不足之处。
  • results: 实验结果表明,我们的方法可以充分利用多视图相机和激光雷达数据来提高地点识别性能,同时具有强大的视角变化robustness。Here’s the English version of the summary for reference:
  • for: The purpose of this research is to improve the place recognition of autonomous vehicles in GPS-invalid environments.
  • methods: The study uses multimodal sensor fusion to overcome the limitations of individual sensors.
  • results: The experimental results show that our method can effectively utilize multi-view camera and LiDAR data to improve place recognition performance while maintaining strong robustness to viewpoint changes.
    Abstract Place recognition is one of the most crucial modules for autonomous vehicles to identify places that were previously visited in GPS-invalid environments. Sensor fusion is considered an effective method to overcome the weaknesses of individual sensors. In recent years, multimodal place recognition fusing information from multiple sensors has gathered increasing attention. However, most existing multimodal place recognition methods only use limited field-of-view camera images, which leads to an imbalance between features from different modalities and limits the effectiveness of sensor fusion. In this paper, we present a novel neural network named LCPR for robust multimodal place recognition, which fuses LiDAR point clouds with multi-view RGB images to generate discriminative and yaw-rotation invariant representations of the environment. A multi-scale attention-based fusion module is proposed to fully exploit the panoramic views from different modalities of the environment and their correlations. We evaluate our method on the nuScenes dataset, and the experimental results show that our method can effectively utilize multi-view camera and LiDAR data to improve the place recognition performance while maintaining strong robustness to viewpoint changes. Our open-source code and pre-trained models are available at https://github.com/ZhouZijie77/LCPR .
    摘要 固定位置识别是自动驾驶车辆最重要的模块之一,用于在GPS无效环境中识别之前访问过的地点。感知融合是一种有效的方法来超越个体感知器的缺陷。在过去几年,多模态固定位置识别方法已经吸引了增加的关注。然而,大多数现有的多模态固定位置识别方法只使用有限视场的相机图像,这会导致不同模态特征之间的均衡不良,限制感知融合的效iveness。在这篇论文中,我们提出了一种新的神经网络模型,名为LCPR,用于实现可靠的多模态固定位置识别。LCPR模型将LiDAR点云与多视角RGB图像融合,生成特征rich和旋转不变的环境表示。我们提出了一种多级注意力基于的混合模块,以便完全利用不同模态环境的全景视图和其相关性。我们在nuScenes数据集上进行了实验,结果表明,我们的方法可以有效地利用多视角相机和LiDAR数据,提高固定位置识别性能,同时保持强大的视点变化Robustness。我们的开源代码和预训练模型可以在https://github.com/ZhouZijie77/LCPR上获取。

Few-shot Learning using Data Augmentation and Time-Frequency Transformation for Time Series Classification

  • paper_url: http://arxiv.org/abs/2311.03194
  • repo_url: None
  • paper_authors: Hao Zhang, Zhendong Pang, Jiangpeng Wang, Teng Li
  • for: 这 paper 的目的是解决时间序列分类任务中的少量数据问题,提出了一种基于数据扩展的几何学学习框架。
  • methods: 该方法使用了时间频率域的变换和随机绘制来生成synthetic图像,并开发了一种序列 спектрограм神经网络模型(SSNN),该模型由两个子网络组成:一个使用1D径向块来提取输入序列中的特征,另一个使用2D径向块来提取spectrogram表示中的特征。
  • results: 在一个amyotrophic lateral sclerosis(ALS)数据集和一个风力机 fault(WTF)数据集上进行了对 existed DNN 模型的比较研究,结果表明,我们提出的方法可以在几何学学习中提高时间序列分类的精度。
    Abstract Deep neural networks (DNNs) that tackle the time series classification (TSC) task have provided a promising framework in signal processing. In real-world applications, as a data-driven model, DNNs are suffered from insufficient data. Few-shot learning has been studied to deal with this limitation. In this paper, we propose a novel few-shot learning framework through data augmentation, which involves transformation through the time-frequency domain and the generation of synthetic images through random erasing. Additionally, we develop a sequence-spectrogram neural network (SSNN). This neural network model composes of two sub-networks: one utilizing 1D residual blocks to extract features from the input sequence while the other one employing 2D residual blocks to extract features from the spectrogram representation. In the experiments, comparison studies of different existing DNN models with/without data augmentation are conducted on an amyotrophic lateral sclerosis (ALS) dataset and a wind turbine fault (WTF) dataset. The experimental results manifest that our proposed method achieves 93.75% F1 score and 93.33% accuracy on the ALS datasets while 95.48% F1 score and 95.59% accuracy on the WTF datasets. Our methodology demonstrates its applicability of addressing the few-shot problems for time series classification.
    摘要 深度神经网络(DNNs)在时间序列分类(TSC)任务中提供了一个有前途的框架,在实际应用中,作为数据驱动模型,DNNs受到了数据不充足的限制。几何学学习被研究以解决这个限制。在这篇论文中,我们提出了一种新的几何学学习框架,通过数据扩展,包括时间频域的变换和随机磁化生成的Synthetic图像。此外,我们开发了一种序列spectrogram神经网络(SSNN)。这个神经网络模型由两个子网络组成:一个使用1D residual块提取输入序列的特征,另一个使用2D residual块提取spectrogram表示的特征。在实验中,我们对不同的现有DNN模型进行了与/无数据扩展的比较研究,并在amyotrophic lateral sclerosis(ALS)数据集和风电机缺陷(WTF)数据集上进行了实验。实验结果表明,我们提出的方法在ALS数据集上达到了93.75%的F1分数和93.33%的准确率,在WTF数据集上达到了95.48%的F1分数和95.59%的准确率。我们的方法证明了其适用性于Addressing几何学学习问题。

Efficient and Low-Footprint Object Classification using Spatial Contrast

  • paper_url: http://arxiv.org/abs/2311.03422
  • repo_url: None
  • paper_authors: Matthew Belding, Daniel C. Stumpp, Rajkumar Kubendran
  • for: 本研究探讨了一种基于事件的视觉感知器,使用本地化的空间对比(SC),并采用了两种阈值技术,相对阈值和绝对阈值。
  • methods: 本研究使用了虚拟模拟器来研究这种硬件感知器的可能性。此外,通过使用德国交通标志数据集(GTSRB)和知名的深度神经网络(DNN)进行交通标志分类,以评估空间对比的效果。
  • results: 研究发现,使用空间对比可以有效地捕捉图像中重要的特征,并且可以使用二进制micronet实现较大的减少输入数据量和内存资源(至少12倍),相比高精度RGB图像和DNN,只有小loss(约2%)。因此,SC在功能和资源有限的边缘计算环境中表现出了很大的抢夺。
    Abstract Event-based vision sensors traditionally compute temporal contrast that offers potential for low-power and low-latency sensing and computing. In this research, an alternative paradigm for event-based sensors using localized spatial contrast (SC) under two different thresholding techniques, relative and absolute, is investigated. Given the slow maturity of spatial contrast in comparison to temporal-based sensors, a theoretical simulated output of such a hardware sensor is explored. Furthermore, we evaluate traffic sign classification using the German Traffic Sign dataset (GTSRB) with well-known Deep Neural Networks (DNNs). This study shows that spatial contrast can effectively capture salient image features needed for classification using a Binarized DNN with significant reduction in input data usage (at least 12X) and memory resources (17.5X), compared to high precision RGB images and DNN, with only a small loss (~2%) in macro F1-score. Binarized MicronNet achieves an F1-score of 94.4% using spatial contrast, compared to only 56.3% when using RGB input images. Thus, SC offers great promise for deployment in power and resource constrained edge computing environments.
    摘要 traducción al chino simplificado:传统的事件基于视觉传感器通常计算时间异相,这对低功耗和低延迟感知和计算具有潜在的潜力。在这项研究中,我们提出了一种基于本地空间异相(SC)的事件基于传感器,并使用两种不同的阈值技术:相对和绝对阈值。由于空间异相比 temporal-based传感器更慢成熔,我们首先 theoretically simulate 一个硬件传感器的输出。此外,我们使用德国交通标志数据集(GTSRB)进行交通标志分类,使用知名的深度神经网络(DNN)进行评估。结果表明,使用空间异相可以有效地捕捉图像中关键的特征,使用binarized DNN 进行分类,相比高精度 RGB 图像和 DNN,具有至少 12 倍的输入数据使用量和内存资源减少(17.5 倍),同时只失 2% 的 macro F1 score。binarized MicronNet 在使用空间异相时获得了 F1 score 94.4%,与使用 RGB 输入图像时相比,只有 56.3%。这表明 SC 具有在功率和资源受限的边缘计算环境中的潜力。

Frequency Domain Decomposition Translation for Enhanced Medical Image Translation Using GANs

  • paper_url: http://arxiv.org/abs/2311.03175
  • repo_url: None
  • paper_authors: Zhuhui Wang, Jianwei Zuo, Xuliang Deng, Jiajia Luo
  • for: 这篇论文主要针对医学影像转换任务,尤其是运用GAN方法实现高品质的医学影像转换。
  • methods: 本研究提出了一新的频域分解转换方法(FDDT),它将原始影像分解为高频和低频部分,并将这两个部分转换为同频域的转换结果,以保持原始影像的身份信息,同时最小化影像的形式信息损失。
  • results: 在实验中,FDDT与多个主流基eline模型进行比较,结果显示,FDDT可以将Fr'echet内部距离降低至24.4%、结构相似度降低至4.4%、峰值信号对比降低至5.8%和平均方差降低至31%,较前一方法降低23.7%、1.8%、6.8%和31.6%。
    Abstract Medical Image-to-image translation is a key task in computer vision and generative artificial intelligence, and it is highly applicable to medical image analysis. GAN-based methods are the mainstream image translation methods, but they often ignore the variation and distribution of images in the frequency domain, or only take simple measures to align high-frequency information, which can lead to distortion and low quality of the generated images. To solve these problems, we propose a novel method called frequency domain decomposition translation (FDDT). This method decomposes the original image into a high-frequency component and a low-frequency component, with the high-frequency component containing the details and identity information, and the low-frequency component containing the style information. Next, the high-frequency and low-frequency components of the transformed image are aligned with the transformed results of the high-frequency and low-frequency components of the original image in the same frequency band in the spatial domain, thus preserving the identity information of the image while destroying as little stylistic information of the image as possible. We conduct extensive experiments on MRI images and natural images with FDDT and several mainstream baseline models, and we use four evaluation metrics to assess the quality of the generated images. Compared with the baseline models, optimally, FDDT can reduce Fr\'echet inception distance by up to 24.4%, structural similarity by up to 4.4%, peak signal-to-noise ratio by up to 5.8%, and mean squared error by up to 31%. Compared with the previous method, optimally, FDDT can reduce Fr\'echet inception distance by up to 23.7%, structural similarity by up to 1.8%, peak signal-to-noise ratio by up to 6.8%, and mean squared error by up to 31.6%.
    摘要 医学图像转换是计算机视觉和生成人工智能领域的关键任务,并且具有广泛的应用前景。GAN基本方法是主流图像转换方法,但它们经常忽略图像在频率频谱中的变化和分布,或者只是使用简单的方法来对高频信息进行对齐,这可能导致图像生成的质量下降。为解决这些问题,我们提出了一种新的方法called频率频谱分解翻译(FDDT)。FDDT方法将原始图像分解成高频组件和低频组件,其中高频组件包含细节和标识信息,而低频组件包含风格信息。然后,将高频和低频组件的转换结果与原始图像的高频和低频组件在同一频率带的空间频谱中进行对齐,以保持图像的标识信息,同时尽量减少图像的风格信息损失。我们在MRI图像和自然图像上进行了广泛的实验,并使用了数个主流基eline模型进行比较。我们使用了四种评价指标来评价生成图像的质量,其中包括Fréchet吸引距离、结构相似度、峰值信号噪声比和平均平方误差。与基eline模型相比,FDDT可以最大化Fréchet吸引距离下降24.4%、结构相似度下降4.4%、峰值信号噪声比下降5.8%和平均平方误差下降31%。与之前的方法相比,FDDT可以最大化Fréchet吸引距离下降23.7%、结构相似度下降1.8%、峰值信号噪声比下降6.8%和平均平方误差下降31.6%。

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

  • paper_url: http://arxiv.org/abs/2311.03149
  • repo_url: None
  • paper_authors: Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, Limin Wang
  • for: 这个论文主要针对的是用自动编码器预训练小型视Transformer模型,以提高计算成本和适用范围。
  • methods: 该论文提出了一种新的偏向masked distillation(AMD)框架,用于预训练小型模型。AMD使用不同的掩码策略,让老师模型可以看到更多的上下文信息,而学生模型仍然保持高的掩码率。
  • results: AMD在IN1K dataset上 achiev 84.6%的分类精度,和在Something-in-Something V2 dataset上 achiev 73.3%的分类精度,比原始ViT-B模型提高3.7%。此外,AMD预训练模型也可以 transferred to 下游任务,并取得了一致的性能提升。
    Abstract Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However, these large foundation models often result in high computational cost that might limit their deployment. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically, taking inspiration from knowledge distillation in model compression, we propose a new asymmetric masked distillation(AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy, where the teacher model is enabled to see more context information with a lower masking ratio, while the student model still with high masking ratio to the original masked pre-training. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the standard pre-training.
    摘要 自我监督基础模型在计算机视觉领域表现出了很大的潜力,这主要归功于预训练方法的遮盖自动编码。但是,这些大型基础模型经常会导致高计算成本,这可能会限制其部署。这篇论文关注预训练相对较小的视觉转换器模型,以实现高效地适应下游任务。我们提出了一种新的异 symmetry 遮盖(AMD)框架,用于预训练相对小型模型。AMD的核心思想是设计不同的遮盖策略,使得老师模型在低遮盖率下可以看到更多的上下文信息,而学生模型仍然保持高遮盖率。我们设计了特定的多层特征对Alignment来规范学生MAE的预训练。为了证明 AMD 的有效性和多样性,我们将其应用于 ImageMAE 和 VideoMAE 中的预训练相对小型 ViT 模型。在 IN1K 上,AMD 达到了 84.6% 的分类精度,使用 ViT-B 模型。在 Something-in-Something V2 数据集上,AMD 达到了 73.3% 的分类精度,相比标准预训练 ViT-B 模型提高了 3.7%。我们还将 AMD 预训练模型转移到下游任务上,并获得了一致的性能改进。

Animating NeRFs from Texture Space: A Framework for Pose-Dependent Rendering of Human Performances

  • paper_url: http://arxiv.org/abs/2311.03140
  • repo_url: None
  • paper_authors: Paul Knoll, Wieland Morgenstern, Anna Hilsmann, Peter Eisert
  • for: 本研究旨在提出一种基于NeRF的人体动作控制 renderering框架,以实现 pose-dependent 的人体表现。
  • methods: 本方法基于 NeRF 的渲染场,将场面绘制在 SMPL 人体模型上,并通过skeletal 关节参数来控制人体的动作表现。
  • results: 实验结果显示,本方法可以实现高质量的新视角和新姿势 synthesis,并且能够efficiently 学习并渲染 despite mapping ambiguities和Random visual variations。
    Abstract Creating high-quality controllable 3D human models from multi-view RGB videos poses a significant challenge. Neural radiance fields (NeRFs) have demonstrated remarkable quality in reconstructing and free-viewpoint rendering of static as well as dynamic scenes. The extension to a controllable synthesis of dynamic human performances poses an exciting research question. In this paper, we introduce a novel NeRF-based framework for pose-dependent rendering of human performances. In our approach, the radiance field is warped around an SMPL body mesh, thereby creating a new surface-aligned representation. Our representation can be animated through skeletal joint parameters that are provided to the NeRF in addition to the viewpoint for pose dependent appearances. To achieve this, our representation includes the corresponding 2D UV coordinates on the mesh texture map and the distance between the query point and the mesh. To enable efficient learning despite mapping ambiguities and random visual variations, we introduce a novel remapping process that refines the mapped coordinates. Experiments demonstrate that our approach results in high-quality renderings for novel-view and novel-pose synthesis.
    摘要 创建高质量可控3D人体模型从多视图RGB视频中提供了一个 significante挑战。神经辐射场(NeRF)已经表现出了remarkable的质量,可以重建和自由观点渲染静止和动态场景。在这篇论文中,我们介绍了一种基于NeRF的新的框架,用于基于pose的人体表现的可控渲染。在我们的方法中,辐射场被扭曲到了一个SMPL体幔网格上,创建了一个新的表面对应表示。我们的表示可以通过skeletal关节参数来动画,这些参数被传递给NeRF,以便根据pose来控制外观。为实现这一点,我们的表示包括UV坐标在Texture map上的对应2D坐标和查询点与网格之间的距离。为了实现高效的学习,我们引入了一种新的映射过程,用于修正映射的坐标。实验结果表明,我们的方法可以生成高质量的新视图和新pose синтеesis。

TAMPAR: Visual Tampering Detection for Parcel Logistics in Postal Supply Chains

  • paper_url: http://arxiv.org/abs/2311.03124
  • repo_url: None
  • paper_authors: Alexander Naumann, Felix Hertlein, Laura Dörr, Kai Furmans
  • for: 本研究探讨了用于最后一英里配送的邮件检测 tampering 的方法,使用单个 RGB 图像与现有数据库中的参考图像进行比较,检测可能出现的外观变化。
  • methods: 本研究提议了一种 tampering 检测管道,利用锚点检测来确定包裹的八个角点,然后应用平行变换创建正规化的前视图,以便对包裹的每个可见面进行比较。
  • results: 实验结果表明,锚点检测和变换检测分别达到了 75.76% AP 和 81% 的准确率,F1 分数为 0.83,在实际图像中显示了良好的结果。此外,对不同的披雨、镜头偏角和检测方法进行了敏感性分析。
    Abstract Due to the steadily rising amount of valuable goods in supply chains, tampering detection for parcels is becoming increasingly important. In this work, we focus on the use-case last-mile delivery, where only a single RGB image is taken and compared against a reference from an existing database to detect potential appearance changes that indicate tampering. We propose a tampering detection pipeline that utilizes keypoint detection to identify the eight corner points of a parcel. This permits applying a perspective transformation to create normalized fronto-parallel views for each visible parcel side surface. These viewpoint-invariant parcel side surface representations facilitate the identification of signs of tampering on parcels within the supply chain, since they reduce the problem to parcel side surface matching with pair-wise appearance change detection. Experiments with multiple classical and deep learning-based change detection approaches are performed on our newly collected TAMpering detection dataset for PARcels, called TAMPAR. We evaluate keypoint and change detection separately, as well as in a unified system for tampering detection. Our evaluation shows promising results for keypoint (Keypoint AP 75.76) and tampering detection (81% accuracy, F1-Score 0.83) on real images. Furthermore, a sensitivity analysis for tampering types, lens distortion and viewing angles is presented. Code and dataset are available at https://a-nau.github.io/tampar.
    摘要 Due to the steadily rising amount of valuable goods in supply chains, detecting tampering for parcels has become increasingly important. In this work, we focus on the use-case of last-mile delivery, where only a single RGB image is taken and compared against a reference from an existing database to detect potential appearance changes that indicate tampering. We propose a tampering detection pipeline that utilizes keypoint detection to identify the eight corner points of a parcel. This permits applying a perspective transformation to create normalized fronto-parallel views for each visible parcel side surface. These viewpoint-invariant parcel side surface representations facilitate the identification of signs of tampering on parcels within the supply chain, since they reduce the problem to parcel side surface matching with pair-wise appearance change detection. Experiments with multiple classical and deep learning-based change detection approaches are performed on our newly collected TAMpering detection dataset for PARcels, called TAMPAR. We evaluate keypoint and change detection separately, as well as in a unified system for tampering detection. Our evaluation shows promising results for keypoint (Keypoint AP 75.76) and tampering detection (81% accuracy, F1-Score 0.83) on real images. Furthermore, a sensitivity analysis for tampering types, lens distortion, and viewing angles is presented. Code and dataset are available at https://a-nau.github.io/tampar.Here is the translation in Traditional Chinese:因为供应链中的高值货物量不断增加,该货物运输中的过程遗传检测已经变得非常重要。在这个工作中,我们专注在最后一英里的运输use case中,只有单一的RGB图像和现有数据库中的参考进行比较,以探测可能的外观变化,以探测遗传。我们提出了一个遗传检测管线,使用关键点检测来识别八个角点的货物。这允许我们将图像应用到对每个可见的货物侧面进行正规化平行投影。这些对货物侧面的投影具有视角不受影响的特性,因此可以将遗传检测问题降低到货物侧面匹配的对比检测。我们在多种古典和深度学习基于的变化检测方法上进行了实验,并使用了我们 newly collected TAMpering detection dataset for PARcels,called TAMPAR。我们分别评估了关键点和变化检测,以及它们在联合系统中的表现。我们的评估结果显示,关键点精度高(Keypoint AP 75.76)和遗传检测精度高(81%准确率、F1-Score 0.83)。此外,我们还进行了遗传类型、镜头扭曲和视角的敏感分析。代码和数据可以在https://a-nau.github.io/tampar上获取。

Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding

  • paper_url: http://arxiv.org/abs/2311.03106
  • repo_url: https://github.com/huiguanlab/umurl
  • paper_authors: Shengkai Sun, Daizong Liu, Jianfeng Dong, Xiaoye Qu, Junyu Gao, Xun Yang, Xun Wang, Meng Wang
  • for: 本研究旨在提出一种多模态无监督学习框架,以提高skeleton基于动作理解的robust性和效率。
  • methods: 本研究使用一种称为Unified Multimodal Unsupervised Representation Learning(UmURL)的方法,它通过早期融合策略将多 modal的特征编码在单流程中,从而降低模型复杂性。此外,本研究还提出了内部和外部一致性学习来保证多modal特征不受modal bias的影响。
  • results: 实验结果表明,UmURL可以具有高效率和低复杂性,同时在不同的下游任务场景中 achieve新的state-of-the-art表现。
    Abstract Unsupervised pre-training has shown great success in skeleton-based action understanding recently. Existing works typically train separate modality-specific models, then integrate the multi-modal information for action understanding by a late-fusion strategy. Although these approaches have achieved significant performance, they suffer from the complex yet redundant multi-stream model designs, each of which is also limited to the fixed input skeleton modality. To alleviate these issues, in this paper, we propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL, which exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner. Specifically, instead of designing separate modality-specific optimization processes for uni-modal unsupervised learning, we feed different modality inputs into the same stream with an early-fusion strategy to learn their multi-modal features for reducing model complexity. To ensure that the fused multi-modal features do not exhibit modality bias, i.e., being dominated by a certain modality input, we further propose both intra- and inter-modal consistency learning to guarantee that the multi-modal features contain the complete semantics of each modal via feature decomposition and distinct alignment. In this manner, our framework is able to learn the unified representations of uni-modal or multi-modal skeleton input, which is flexible to different kinds of modality input for robust action understanding in practical cases. Extensive experiments conducted on three large-scale datasets, i.e., NTU-60, NTU-120, and PKU-MMD II, demonstrate that UmURL is highly efficient, possessing the approximate complexity with the uni-modal methods, while achieving new state-of-the-art performance across various downstream task scenarios in skeleton-based action representation learning.
    摘要 近些年来,无监督预训练在基于骨架的动作理解中得到了很大的成功。现有的方法通常是将不同的感知Modalities分开训练,然后通过较晚的融合策略进行动作理解。虽然这些方法已经实现了显著的性能提升,但它们受到复杂且重复的多流程模型设计的限制,每个模型都受到固定输入骨架模式的限制。为了解决这些问题,在本文中,我们提出了一种统一多模态无监督学习框架,名为 UmURL,它利用有效的早期融合策略来同时编码多个感知Modalities的多模态特征。具体来说,而不是为每个模式分别设计单modal无监督学习过程,我们将不同的模式输入feed到同一个流程中,并使用早期融合策略来学习它们的多模态特征,以降低模型复杂度。此外,为确保融合的多模态特征不受任何一个模式输入的干扰,我们还提出了内部和外部模式一致性学习,以保证每个模式的多模态特征含有完整的 semantics。因此,我们的框架能够学习单模态或多模态骨架输入的统一表示,这是对实际中不同类型的模式输入的灵活应用。我们在NTU-60、NTU-120和PKU-MMD II等三个大规模数据集上进行了广泛的实验,结果表明,UmURL具有高效性,与单 modal 方法相当,而实现了多个下游任务enario中的新的顶峰性能。

A survey and classification of face alignment methods based on face models

  • paper_url: http://arxiv.org/abs/2311.03082
  • repo_url: https://github.com/nordlinglab/facealignment-survey
  • paper_authors: Jagmohan Meher, Hector Allende-Cid, Torbjörn E. M. Nordling
  • for: 这篇论文的目的是对不同类型的读者( beginner、实践者和研究人员)提供关于面部对齐的综述,包括面部模型的解释和训练,以及将面部模型适应到新的face图像中。
  • methods: 这篇论文使用了多种面部模型,包括基于3D的面部模型和基于深度学习的方法。这些方法的训练和应用都有所不同,例如使用热图来表示面部特征。
  • results: 研究发现,在极大的面 pose 情况下,3D-based face models更为有效,而深度学习-based方法通常使用热图来表示面部特征。此外,文章还讨论了面部模型在面 alignment 领域的未来发展方向。
    Abstract A face model is a mathematical representation of the distinct features of a human face. Traditionally, face models were built using a set of fiducial points or landmarks, each point ideally located on a facial feature, i.e., corner of the eye, tip of the nose, etc. Face alignment is the process of fitting the landmarks in a face model to the respective ground truth positions in an input image containing a face. Despite significant research on face alignment in the past decades, no review analyses various face models used in the literature. Catering to three types of readers - beginners, practitioners and researchers in face alignment, we provide a comprehensive analysis of different face models used for face alignment. We include the interpretation and training of the face models along with the examples of fitting the face model to a new face image. We found that 3D-based face models are preferred in cases of extreme face pose, whereas deep learning-based methods often use heatmaps. Moreover, we discuss the possible future directions of face models in the field of face alignment.
    摘要 一个面模型是一个数学表示人脸的特征。传统上,面模型通过一组标准点或特征点建立,每个点理想位于人脸中的一个特征处,例如眼角或鼻头等。人脸对适应是将标准点在面模型与输入图像中的真实位置进行适应的过程。Despite significant research on face alignment in the past decades, no review has analyzed various face models used in the literature. To cater to three types of readers - beginners, practitioners, and researchers in face alignment, we provide a comprehensive analysis of different face models used for face alignment. We include the interpretation and training of the face models along with examples of fitting the face model to a new face image. We found that 3D-based face models are preferred in cases of extreme face pose, whereas deep learning-based methods often use heatmaps. Moreover, we discuss the possible future directions of face models in the field of face alignment.

CogVLM: Visual Expert for Pretrained Language Models

  • paper_url: http://arxiv.org/abs/2311.03079
  • repo_url: https://github.com/thudm/cogvlm
  • paper_authors: Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang
  • for: This paper presents a powerful open-source visual language foundation model called CogVLM, which aims to bridge the gap between frozen pretrained language models and image encoders.
  • methods: The CogVLM model uses a trainable visual expert module in the attention and FFN layers to enable deep fusion of vision language features without sacrificing any performance on NLP tasks.
  • results: The CogVLM-17B model achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA, and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B.
    Abstract We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.
    摘要 我们介绍CogVLM,一个强大的开源视觉语言基础模型。与流行的浅层对应方法不同,CogVLM通过在注意力和FFN层中添加可学习的视觉专家模块,将冰格预训练语言模型和图像编码器连接起来。这使得CogVLM可以深度融合视觉语言特征,不会失去任何表现力在NLPTasks中。CogVLM-17B在10个经典跨模态benchmark上达到了状态机器人的表现,包括NoCaps、Flicker30k captioning、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA和TDIUC,并在VQAv2、OKVQA、TextVQA、COCO captioning等排名第2,超过或匹配PaLI-X 55B。代码和Checkpoint可以在https://github.com/THUDM/CogVLM中找到。

A Two-Stage Generative Model with CycleGAN and Joint Diffusion for MRI-based Brain Tumor Detection

  • paper_url: http://arxiv.org/abs/2311.03074
  • repo_url: https://github.com/zhyjsiat/a-two-stage-cyclegan-ve-brats2020
  • paper_authors: Wenxin Wang, Zhuo-Xu Cui, Guanxun Cheng, Chentao Cao, Xi Xu, Ziwei Liu, Haifeng Wang, Yulong Qi, Dong Liang, Yanjie Zhu
    for:* 这个 paper 的目的是提高脑肿瘤检测和分类的精度。methods:* 本 paper 使用了两种方法:CycleGAN 和 VE-JP。CycleGAN 是在不对应的数据上训练的,以生成异常的图像作为数据先验。VE-JP 则是使用伪随机分布来重建健康的图像,并将伪随机分布与真正的分布融合在一起。results:* 本 paper 的结果显示,使用 TSGM 可以提高脑肿瘤检测和分类的精度。在 BraTs2020 数据集上,DSC 分数为 0.8590,在 ITCS 数据集上分数为 0.6226,在内部数据集上分数为 0.7403。这些结果显示 TSGM 的检测和分类性能较好,并且具有更好的扩展性。
    Abstract Accurate detection and segmentation of brain tumors is critical for medical diagnosis. However, current supervised learning methods require extensively annotated images and the state-of-the-art generative models used in unsupervised methods often have limitations in covering the whole data distribution. In this paper, we propose a novel framework Two-Stage Generative Model (TSGM) that combines Cycle Generative Adversarial Network (CycleGAN) and Variance Exploding stochastic differential equation using joint probability (VE-JP) to improve brain tumor detection and segmentation. The CycleGAN is trained on unpaired data to generate abnormal images from healthy images as data prior. Then VE-JP is implemented to reconstruct healthy images using synthetic paired abnormal images as a guide, which alters only pathological regions but not regions of healthy. Notably, our method directly learned the joint probability distribution for conditional generation. The residual between input and reconstructed images suggests the abnormalities and a thresholding method is subsequently applied to obtain segmentation results. Furthermore, the multimodal results are weighted with different weights to improve the segmentation accuracy further. We validated our method on three datasets, and compared with other unsupervised methods for anomaly detection and segmentation. The DSC score of 0.8590 in BraTs2020 dataset, 0.6226 in ITCS dataset and 0.7403 in In-house dataset show that our method achieves better segmentation performance and has better generalization.
    摘要 现代医学诊断中,检测和分类脑肿的精准性非常重要。然而,现有的指导学习方法需要大量的标注图像,而状态的艺术模型在无监督方法中经常无法覆盖整个数据分布。本文提出了一种新的框架Two-Stage Generative Model(TSGM),它结合了Cycling Generative Adversarial Network(CycleGAN)和变量爆发杂化方程(VE-JP)以提高脑肿检测和分类。CycleGAN在无对应数据上训练,将健康图像转换成病理图像作为数据先验。然后,VE-JP被实现以重建健康图像,使用生成的假病理图像作为引导,只有病理区域受到修改,而非健康区域。值得注意的是,我们的方法直接学习了联合概率分布 для条件生成。输入图像与重建图像之间的差异指示病理,并应用阈值方法以获取分 segmentation 结果。此外,我们使用多Modal的结果进行权重,以进一步提高分 segmentation 精度。我们在三个数据集上验证了我们的方法,并与其他无监督方法进行比较。BraTs2020 数据集的 DSC 分数为 0.8590,ITCS 数据集的 DSC 分数为 0.6226,In-house 数据集的 DSC 分数为 0.7403,表明我们的方法在 segmentation 性能方面表现出色,并且具有更好的泛化能力。

OrthoNets: Orthogonal Channel Attention Networks

  • paper_url: http://arxiv.org/abs/2311.03071
  • repo_url: https://github.com/hady1011/orthonets
  • paper_authors: Hadi Salman, Caleb Parks, Matthew Swan, John Gauch
  • for: 提高频道注意机制的效iveness,寻找一种lossy压缩方法以获得最佳特征表示。
  • methods: 使用 randomly initialized orthogonal filters 构建注意机制,并将其集成到 ResNet 中。
  • results: 在 Birds、MS-COCO 和 Places356 datasets 上比较表现出色,与 FcaNet 和其他注意机制相比。在 ImageNet dataset 上与当前状态码头齐。
    Abstract Designing an effective channel attention mechanism implores one to find a lossy-compression method allowing for optimal feature representation. Despite recent progress in the area, it remains an open problem. FcaNet, the current state-of-the-art channel attention mechanism, attempted to find such an information-rich compression using Discrete Cosine Transforms (DCTs). One drawback of FcaNet is that there is no natural choice of the DCT frequencies. To circumvent this issue, FcaNet experimented on ImageNet to find optimal frequencies. We hypothesize that the choice of frequency plays only a supporting role and the primary driving force for the effectiveness of their attention filters is the orthogonality of the DCT kernels. To test this hypothesis, we construct an attention mechanism using randomly initialized orthogonal filters. Integrating this mechanism into ResNet, we create OrthoNet. We compare OrthoNet to FcaNet (and other attention mechanisms) on Birds, MS-COCO, and Places356 and show superior performance. On the ImageNet dataset, our method competes with or surpasses the current state-of-the-art. Our results imply that an optimal choice of filter is elusive and generalization can be achieved with a sufficiently large number of orthogonal filters. We further investigate other general principles for implementing channel attention, such as its position in the network and channel groupings. Our code is publicly available at https://github.com/hady1011/OrthoNets/
    摘要 设计有效的通道注意机制需要找到一种lossy压缩方法,以便获得优化的特征表示。尽管最近在这个领域的进展不断,但这问题仍然未得到解决。FcaNet,当前领先的通道注意机制,尝试使用Discrete Cosine Transforms(DCTs)来找到这样的信息充足压缩。FcaNet的一个缺点是没有自然的DCT频率选择。为了解决这个问题,FcaNet在ImageNet上进行了实验。我们假设,选择的频率只是支持性的角色,主要的驱动力是DCT核函数的正交性。为了测试这个假设,我们构建了一个使用随机初始化的正交滤波器的注意机制。将这个机制 integrate into ResNet,我们创建了OrthoNet。我们与FcaNet(以及其他注意机制)在Birds、MS-COCO和Places356上进行比较,并显示了超越性。在ImageNet dataset上,我们的方法与当前领先的状态相竞争。我们的结果表明,优化的筛选器是极其困难的,但通过一个足够大的正交滤波器数量,可以实现总体的泛化。我们进一步调查了其他实现通道注意的一般原则,如其在网络中的位置和通道分组。我们的代码可以在https://github.com/hady1011/OrthoNets/上获取。

Forest aboveground biomass estimation using GEDI and earth observation data through attention-based deep learning

  • paper_url: http://arxiv.org/abs/2311.03067
  • repo_url: None
  • paper_authors: Wenquan Dong, Edward T. A. Mitchard, Hao Yu, Steven Hancock, Casey M. Ryan
    for: 本研究旨在使用卫星数据进行森林上空生物质(AGB)估算,以了解气候变化下的碳账户。methods: 本研究使用了开源的卫星数据,包括GEDI LiDAR数据、C频段Sentinel-1 SAR数据、ALOS-2 PALSAR-2数据和Sentinel-2多спектраль数据,并采用了注意力基于深度学习模型(AU)进行AGB估算。results: 相比传统的RF算法,AU模型在AGB估算中显示了明显的高精度。AU模型的R2为0.66,RMSE为43.66亿分之一,偏差为0.14亿分之一,而RF的R2为0.62,RMSE为45.87亿分之一,偏差为1.09亿分之一。然而,深度学习方法的优势不uniform地出现在所有测试模型中。ResNet101只有R2为0.50,RMSE为52.93亿分之一,偏差为0.99亿分之一,而UNet Reported R2为0.65,RMSE为44.28亿分之一,并且显示了较大的偏差(1.84亿分之一)。此外,为了探讨AU在不含空间信息的情况下的性能,FC层被使用,以消除卫星数据中的空间信息。AU-FC实现了中间的R2为0.64,RMSE为44.92亿分之一,偏差为-0.56亿分之一,超过了RF,但是下过AU模型使用空间信息。
    Abstract Accurate quantification of forest aboveground biomass (AGB) is critical for understanding carbon accounting in the context of climate change. In this study, we presented a novel attention-based deep learning approach for forest AGB estimation, primarily utilizing openly accessible EO data, including: GEDI LiDAR data, C-band Sentinel-1 SAR data, ALOS-2 PALSAR-2 data, and Sentinel-2 multispectral data. The attention UNet (AU) model achieved markedly higher accuracy for biomass estimation compared to the conventional RF algorithm. Specifically, the AU model attained an R2 of 0.66, RMSE of 43.66 Mg ha-1, and bias of 0.14 Mg ha-1, while RF resulted in lower scores of R2 0.62, RMSE 45.87 Mg ha-1, and bias 1.09 Mg ha-1. However, the superiority of the deep learning approach was not uniformly observed across all tested models. ResNet101 only achieved an R2 of 0.50, an RMSE of 52.93 Mg ha-1, and a bias of 0.99 Mg ha-1, while the UNet reported an R2 of 0.65, an RMSE of 44.28 Mg ha-1, and a substantial bias of 1.84 Mg ha-1. Moreover, to explore the performance of AU in the absence of spatial information, fully connected (FC) layers were employed to eliminate spatial information from the remote sensing data. AU-FC achieved intermediate R2 of 0.64, RMSE of 44.92 Mgha-1, and bias of -0.56 Mg ha-1, outperforming RF but underperforming AU model using spatial information. We also generated 10m forest AGB maps across Guangdong for the year 2019 using AU and compared it with that produced by RF. The AGB distributions from both models showed strong agreement with similar mean values; the mean forest AGB estimated by AU was 102.18 Mg ha-1 while that of RF was 104.84 Mg ha-1. Additionally, it was observed that the AGB map generated by AU provided superior spatial information. Overall, this research substantiates the feasibility of employing deep learning for biomass estimation based on satellite data.
    摘要 “精确量化森林上空生物质量(AGB)是对于气候变化的理解 critical。在本研究中,我们提出了一种新的注意力基于深度学习方法来估算森林AGB,主要使用开放 accessible 的 Earth observation(EO)数据,包括:GEDI LiDAR数据、C-band Sentinel-1 SAR数据、ALOS-2 PALSAR-2数据和Sentinel-2多spectral数据。我们的注意力UNet(AU)模型在比较传统RF算法时表现出了明显的高准确性,具体来说,AU模型在R2、RMSE和偏差方面均有所提高,具体来说,AU模型的R2为0.66,RMSE为43.66 Mg ha-1,偏差为0.14 Mg ha-1,而RF的R2为0.62,RMSE为45.87 Mg ha-1,偏差为1.09 Mg ha-1。然而,深度学习方法的优势不是所有模型中均能实现。ResNet101只有R2为0.50,RMSE为52.93 Mg ha-1,偏差为0.99 Mg ha-1,而UNet的R2为0.65,RMSE为44.28 Mg ha-1,偏差为1.84 Mg ha-1。此外,为了探讨AU的表现在没有空间信息的情况下,我们运用了全连接(FC)层来消除 remote sensing 数据中的空间信息。AU-FC获得了中位R2值0.64,RMSE值44.92 Mg ha-1,偏差值-0.56 Mg ha-1,比RF表现更好,但比AU模型使用空间信息的表现下降。我们还使用AU生成了2019年在广东省的10米森林AGB地图,并与RF生成的地图进行比较。AU的AGB分布和RF的AGB分布都有相似的平均值,AU的AGB估算值为102.18 Mg ha-1,RF的AGB估算值为104.84 Mg ha-1。此外,AU生成的AGB地图提供了更好的空间信息。总之,这项研究证明了深度学习可以实现基于卫星数据的生物质量估算。”

AnyText: Multilingual Visual Text Generation And Editing

  • paper_url: http://arxiv.org/abs/2311.03054
  • repo_url: None
  • paper_authors: Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, Xuansong Xie
  • for: 这个论文主要 targets 的问题是如何使用扩散模型生成高质量的文本图像,尤其是在文本区域上。
  • methods: 该论文提出了一种基于扩散模型的多语言视觉文本生成和编辑模型,称为AnyText,它可以准确地渲染文本在图像中,并且可以在多种语言中生成文本。
  • results: 该论文通过对多种语言的文本图像进行评测,得出了与其他方法相比的显著性能优势。此外,该论文还提供了一个大规模的多语言文本图像集合(AnyWord-3M)和一个评测标准(AnyText-benchmark),以便进一步推动文本生成技术的发展。
    Abstract Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced on https://github.com/tyxsspa/AnyText to improve and promote the development of text generation technology.
    摘要 Diffusion模型基于Text-to-Image技术在最近几年内具有很高的成就。虽然目前的图像生成技术非常高级,可以生成高质量的图像,但是当注意力集中在生成图像中的文本区域时,仍然可以发现问题。为解决这个问题,我们介绍了AnyText,一种基于扩散的多语言视觉文本生成和编辑模型,它专注于在图像中准确和一致地生成文本。AnyText包括一个扩散管道,其中有两个主要元素:一个辅助隐藏模块和一个文本嵌入模块。前者使用文本字形、位置和遮盖图像作为输入,生成文本生成或编辑的隐藏特征。后者使用一个OCR模型将字roke数据编码为嵌入,这些嵌入与图像标签的嵌入结合生成文本,以便文本与背景融合。我们在训练时使用文本扩散损失和文本感知损失,以进一步提高文本准确性。AnyText可以在多种语言中写字,据我们所知,这是首次对多语言视觉文本生成进行了研究。此外,AnyText可以与现有的扩散模型集成,以提供更高质量的文本生成和编辑功能。经过广泛的评估实验,我们的方法在所有其他方法之上减分了较大的差距。此外,我们还提供了首个大规模的多语言文本图像集,AnyWord-3M,包含300万个图像文本对,其中每个对包含多种语言的OCR注解。基于AnyWord-3M集,我们提出了AnyText-benchmark,用于评估视觉文本生成准确性和质量。我们的项目将在https://github.com/tyxsspa/AnyText上开源,以促进和提高文本生成技术的发展。

MixUp-MIL: A Study on Linear & Multilinear Interpolation-Based Data Augmentation for Whole Slide Image Classification

  • paper_url: http://arxiv.org/abs/2311.03052
  • repo_url: None
  • paper_authors: Michael Gadermayr, Lukas Koller, Maximilian Tschuchnig, Lea Maria Stangassinger, Christina Kreutzer, Sebastien Couillard-Despres, Gertie Janneke Oostingh, Anton Hittmair
  • for: 本研究旨在 investigate linear and multilinear interpolation between feature vectors, a data augmentation technique, 以提高分类网络和多例学习的泛化性性能。
  • methods: 本研究使用了多例学习方法,并对10个不同的数据集配置和两种特征提取方法(批处理和自动提取)进行了研究。
  • results: 研究发现了Extraordinarily high variability in the effect of the method, 并发现了一些有趣的方向,提出了一些新的研究方向。
    Abstract For classifying digital whole slide images in the absence of pixel level annotation, typically multiple instance learning methods are applied. Due to the generic applicability, such methods are currently of very high interest in the research community, however, the issue of data augmentation in this context is rarely explored. Here we investigate linear and multilinear interpolation between feature vectors, a data augmentation technique, which proved to be capable of improving the generalization performance classification networks and also for multiple instance learning. Experiments, however, have been performed on only two rather small data sets and one specific feature extraction approach so far and a strong dependence on the data set has been identified. Here we conduct a large study incorporating 10 different data set configurations, two different feature extraction approaches (supervised and self-supervised), stain normalization and two multiple instance learning architectures. The results showed an extraordinarily high variability in the effect of the method. We identified several interesting aspects to bring light into the darkness and identified novel promising fields of research.
    摘要 For 分类数字整幅图像在缺乏像素级别标注的情况下,通常使用多个实例学习方法。由于这种方法的通用性,目前在研究community中具有很高的兴趣,但数据扩充在这个上下文中的问题rarely explored。我们在这里调查了线性和多线性 interpolate between feature vectors,一种数据扩充技术, Proof to be capable of improving the generalization performance of classification networks and also for multiple instance learning。经过实验,但只在两个较小的数据集和一种特定的特征提取方法上进行了测试,并且数据集的依赖性被识别出来。我们在这里进行了大规模的研究,包括10个不同的数据集配置、两种不同的特征提取方法(supervised和self-supervised)、染料normalization和两种多个实例学习架构。结果显示了极高的变化性,我们identified several interesting aspects to bring light into the darkness and identified novel promising fields of research.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

COLA: COarse-LAbel multi-source LiDAR semantic segmentation for autonomous driving

  • paper_url: http://arxiv.org/abs/2311.03017
  • repo_url: None
  • paper_authors: Jules Sanchez, Jean-Emmanuel Deschaud, François Goulette
  • for: 这paper是为了提高LiDAR semantic segmentation的自动驾驶而写的。
  • methods: 这paper使用了多源训练方法,利用了多个数据集在训练时使用。
  • results: 这paper实现了对域泛化、源到源分类和预训练等三个子领域的系统改进,并在这些领域中达到了最高的性能 (+10%、+5.3% 和 +12% 分别).
    Abstract LiDAR semantic segmentation for autonomous driving has been a growing field of interest in the past few years. Datasets and methods have appeared and expanded very quickly, but methods have not been updated to exploit this new availability of data and continue to rely on the same classical datasets. Different ways of performing LIDAR semantic segmentation training and inference can be divided into several subfields, which include the following: domain generalization, the ability to segment data coming from unseen domains ; source-to-source segmentation, the ability to segment data coming from the training domain; and pre-training, the ability to create re-usable geometric primitives. In this work, we aim to improve results in all of these subfields with the novel approach of multi-source training. Multi-source training relies on the availability of various datasets at training time and uses them together rather than relying on only one dataset. To overcome the common obstacles found for multi-source training, we introduce the coarse labels and call the newly created multi-source dataset COLA. We propose three applications of this new dataset that display systematic improvement over single-source strategies: COLA-DG for domain generalization (up to +10%), COLA-S2S for source-to-source segmentation (up to +5.3%), and COLA-PT for pre-training (up to +12%).
    摘要 隐藏的文本:LiDARSemantic分类对于自动驾驶 field of interest在过去几年来快速增长。dataset和方法快速出现和扩展,但方法没有更新以利用这些新的数据,仍然 rely on classical datasets。不同的方法在 LiDARSemantic分类训练和推理中的几个子领域中进行不同的方式,包括:领域泛化、来自训练领域的数据分类和预训练。在这项工作中,我们目的是提高所有这些子领域的结果,使用新的多源训练方法。翻译结果:LiDAR Semantic分类在自动驾驶领域的过去几年来快速增长。不同的dataset和方法快速出现和扩展,但方法没有更新以利用这些新的数据,仍然 rely on classical datasets。在 LiDAR Semantic分类训练和推理中,有几个不同的方法,包括领域泛化、来自训练领域的数据分类和预训练。在这项工作中,我们目的是通过新的多源训练方法提高所有这些子领域的结果。具体来说,我们提出了一种新的多源训练方法,称为COLA。COLA方法利用了不同的dataset在训练时的共同使用,而不是仅仅靠待训练的 dataset。为了解决多源训练中常见的障碍,我们引入了粗略标签,并创建了一个新的多源dataset,称为COLA。我们提出了三个COLAdataset的应用, Display systematic improvement over single-source strategies:COLA-DG for domain generalization (up to +10%), COLA-S2S for source-to-source segmentation (up to +5.3%), and COLA-PT for pre-training (up to +12%).

Exploring the Capability of Text-to-Image Diffusion Models with Structural Edge Guidance for Multi-Spectral Satellite Image Inpainting

  • paper_url: http://arxiv.org/abs/2311.03008
  • repo_url: None
  • paper_authors: Mikolaj Czerkawski, Christos Tachtatzis
  • for: 这个论文研究了卫星图像数据中的文本到图像填充模型的实用性。
  • methods: 论文提出了一种基于StableDiffusion和ControlNet的新填充框架,以及一种RGB到多spectral射频(MSI)转换方法。
  • results: 实验结果表明,通过StableDiffusion进行填充可能会出现不 DESirable的artefacts,而self-supervised internal inpainting的简单实现可以 achieve higher quality of synthesis。
    Abstract The paper investigates the utility of text-to-image inpainting models for satellite image data. Two technical challenges of injecting structural guiding signals into the generative process as well as translating the inpainted RGB pixels to a wider set of MSI bands are addressed by introducing a novel inpainting framework based on StableDiffusion and ControlNet as well as a novel method for RGB-to-MSI translation. The results on a wider set of data suggest that the inpainting synthesized via StableDiffusion suffers from undesired artefacts and that a simple alternative of self-supervised internal inpainting achieves higher quality of synthesis.
    摘要 文章研究文本到图像填充模型在卫星图像数据中的可用性。两个技术挑战:在生成过程中插入结构导向信号以及将RGB像素翻译到更广泛的MSI频谱上——通过介绍一种基于StableDiffusion和ControlNet的新填充框架以及一种RGB-to-MSI翻译方法。研究结果表明,通过StableDiffusion进行填充会产生不良artefacts,而自动内部填充的简单方法可以达到更高质量的synthesis。

Zero-Shot Enhancement of Low-Light Image Based on Retinex Decomposition

  • paper_url: http://arxiv.org/abs/2311.02995
  • repo_url: None
  • paper_authors: Wenchao Li, Bangshu Xiong, Qiaofeng Ou, Xiaoyun Long, Jinhao Zhu, Jiabao Chen, Shuyuan Wen
  • For: Zero-shot low-light image enhancement* Methods: Learning-based Retinex decomposition with N-Net network, noise loss term, RI-Net, texture loss term, and segmented smoothing loss* Results: Improved generalization performance with a homemade real-life low-light dataset and advanced vision tasks such as face detection, target recognition, and instance segmentation, competitive performance compared to current state-of-the-art methods.Here is the Simplified Chinese translation of the three key information points:* For: 低光照图像改善* Methods: 基于学习的Retinex分解方法,包括N-Net网络、噪声损失项、RI-Net、тексту准确损失项和分割缓和损失项* Results: 改进了基于自己实际低光照 dataset 和高级视觉任务(如人脸检测、目标识别和实例分割)的普适性,与当前状态级方法竞争。代码可以在 GitHub 上获取:https://github.com/liwenchao0615/ZERRINNet
    Abstract Two difficulties here make low-light image enhancement a challenging task; firstly, it needs to consider not only luminance restoration but also image contrast, image denoising and color distortion issues simultaneously. Second, the effectiveness of existing low-light enhancement methods depends on paired or unpaired training data with poor generalization performance. To solve these difficult problems, we propose in this paper a new learning-based Retinex decomposition of zero-shot low-light enhancement method, called ZERRINNet. To this end, we first designed the N-Net network, together with the noise loss term, to be used for denoising the original low-light image by estimating the noise of the low-light image. Moreover, RI-Net is used to estimate the reflection component and illumination component, and in order to solve the color distortion and contrast, we use the texture loss term and segmented smoothing loss to constrain the reflection component and illumination component. Finally, our method is a zero-reference enhancement method that is not affected by the training data of paired and unpaired datasets, so our generalization performance is greatly improved, and in the paper, we have effectively validated it with a homemade real-life low-light dataset and additionally with advanced vision tasks, such as face detection, target recognition, and instance segmentation. We conducted comparative experiments on a large number of public datasets and the results show that the performance of our method is competitive compared to the current state-of-the-art methods. The code is available at:https://github.com/liwenchao0615/ZERRINNet
    摘要 两个问题使低光照图像增强成为一项困难任务:首先,它需要同时考虑照度恢复、图像对比度、雷达噪声和色偏移问题。其次,现有的低光照增强方法的效果取决于对照或无照训练数据的学习,而且对泛化性表现不佳。为解决这些困难问题,我们在本文提出了一种新的学习基于Retinex分解的零参数低光照增强方法,称为ZERRINNet。为此,我们首先设计了N-Net网络,并与噪声损失项一起使用来降噪原始低光照图像。此外,我们使用RI-Net来估计反射组件和照明组件,以解决颜色扭曲和对比度问题。最后,我们的方法是一种零参考增强方法,不受训练数据的对照或无照数据的影响,因此我们的泛化性得到了大幅提高。在文章中,我们有效地验证了我们的方法,使用自己制作的真实生活低光照数据,以及高级视觉任务,如人脸检测、目标识别和实例分割。我们对大量公共数据进行了比较实验,结果显示了我们的方法与当前状态艺技术的竞争力。代码可以在:https://github.com/liwenchao0615/ZERRINNet 获取。

NEURO HAND: A weakly supervised Hierarchical Attention Network for neuroimaging abnormality Detection

  • paper_url: http://arxiv.org/abs/2311.02992
  • repo_url: None
  • paper_authors: David A. Wood
  • for: 这个论文是用于检测临床神经成像数据中的异常的。
  • methods: 这个方法使用了层次注意力网络,适用于非体积数据(即高分辨率MRI扫描序列),并可以从二分类评估级别标签进行训练。
  • results: 该方法可以提高分类精度,并提供可解释性,可以通过粗略的扫描水平和镜像级别异常Localization,或者给出不同扫描和序列的重要性分数,使其适用于自动化Radiology部门的检测系统。
    Abstract Clinical neuroimaging data is naturally hierarchical. Different magnetic resonance imaging (MRI) sequences within a series, different slices covering the head, and different regions within each slice all confer different information. In this work we present a hierarchical attention network for abnormality detection using MRI scans obtained in a clinical hospital setting. The proposed network is suitable for non-volumetric data (i.e. stacks of high-resolution MRI slices), and can be trained from binary examination-level labels. We show that this hierarchical approach leads to improved classification, while providing interpretability through either coarse inter- and intra-slice abnormality localisation, or giving importance scores for different slices and sequences, making our model suitable for use as an automated triaging system in radiology departments.
    摘要 临床神经成像数据自然归于层次结构。不同的磁共振成像(MRI)序列内一系列、不同的脑部slice覆盖头部、和每个slice中的不同区域都提供不同的信息。在这种工作中,我们提出了一种层次注意力网络用于使用MRI扫描图像进行异常检测。该提案的网络适用于非材料数据(即高分辨率MRI扫描图像的栈),可以从二进制检查级别标签进行训练。我们表明,这种层次方法可以提高分类性能,同时提供可读性通过每个slice和每个序列的重要性分数或者粗略的脑部异常定位。因此,我们的模型适用于辐射部门中的自动检测系统。

Diffusion-based Radiotherapy Dose Prediction Guided by Inter-slice Aware Structure Encoding

  • paper_url: http://arxiv.org/abs/2311.02991
  • repo_url: None
  • paper_authors: Zhenghao Feng, Lu Wen, Jianghong Xiao, Yuanyuan Xu, Xi Wu, Jiliu Zhou, Xingchen Peng, Yan Wang
  • for: 这篇论文的目的是为了提高放射治疗规划中的剂量分布预测,并且解决现有方法的过滤问题。
  • methods: 这篇论文提出了一个扩散模型基本的方法(DiffDose),它包括一个前进过程和一个反向过程。在前进过程中,DiffDose将剂量分布图transform为纯 Gaussian 噪声,并且同时训练一个噪声预测器来估计附加的噪声。在反向过程中,它逐步除去附加的噪声,最终输出预测的剂量分布图。
  • results: 这篇论文的结果显示,DiffDose 方法可以很好地解决现有方法的过滤问题,并且可以提高放射治疗规划中的剂量分布预测精度。
    Abstract Deep learning (DL) has successfully automated dose distribution prediction in radiotherapy planning, enhancing both efficiency and quality. However, existing methods suffer from the over-smoothing problem for their commonly used L1 or L2 loss with posterior average calculations. To alleviate this limitation, we propose a diffusion model-based method (DiffDose) for predicting the radiotherapy dose distribution of cancer patients. Specifically, the DiffDose model contains a forward process and a reverse process. In the forward process, DiffDose transforms dose distribution maps into pure Gaussian noise by gradually adding small noise and a noise predictor is simultaneously trained to estimate the noise added at each timestep. In the reverse process, it removes the noise from the pure Gaussian noise in multiple steps with the well-trained noise predictor and finally outputs the predicted dose distribution maps...
    摘要 深度学习(DL)已成功地自动预测辐射治疗规划中的剂量分布,提高了效率和质量。然而,现有方法受到L1或L2损失函数中的平均 posterior 计算的限制。为了解决这些限制,我们提出了基于扩散模型的剂量分布预测方法(DiffDose)。具体来说,DiffDose模型包括一个前向过程和一个反向过程。在前向过程中,DiffDose将剂量分布图像转化为纯 Gaussian 噪声,逐步添加小噪声,并同时训练噪声预测器来估计添加的噪声。在反向过程中,它将纯 Gaussian 噪声中的噪声除掉,并在多个步骤中使用已经良好训练的噪声预测器来除掉噪声,最终输出预测的剂量分布图像。

Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination

  • paper_url: http://arxiv.org/abs/2311.02960
  • repo_url: None
  • paper_authors: Peng Wang, Xiao Li, Can Yaras, Zhihui Zhu, Laura Balzano, Wei Hu, Qing Qu
  • for: 本研究目的是探讨深度学习网络中层次特征学习的机制。
  • methods: 本研究使用深度线性网络来探讨输入数据的转化。
  • results: 研究发现,深度线性网络中每层的输出都会具有 géometric 率的内类压缩和 linear 率的 между类分化。这种特征演化的 Pattern 在深度网络中具有可衡量的特征,并且在实际应用中有重要的实质性。
    Abstract Over the past decade, deep learning has proven to be a highly effective tool for learning meaningful features from raw data. However, it remains an open question how deep networks perform hierarchical feature learning across layers. In this work, we attempt to unveil this mystery by investigating the structures of intermediate features. Motivated by our empirical findings that linear layers mimic the roles of deep layers in nonlinear networks for feature learning, we explore how deep linear networks transform input data into output by investigating the output (i.e., features) of each layer after training in the context of multi-class classification problems. Toward this goal, we first define metrics to measure within-class compression and between-class discrimination of intermediate features, respectively. Through theoretical analysis of these two metrics, we show that the evolution of features follows a simple and quantitative pattern from shallow to deep layers when the input data is nearly orthogonal and the network weights are minimum-norm, balanced, and approximate low-rank: Each layer of the linear network progressively compresses within-class features at a geometric rate and discriminates between-class features at a linear rate with respect to the number of layers that data have passed through. To the best of our knowledge, this is the first quantitative characterization of feature evolution in hierarchical representations of deep linear networks. Empirically, our extensive experiments not only validate our theoretical results numerically but also reveal a similar pattern in deep nonlinear networks which aligns well with recent empirical studies. Moreover, we demonstrate the practical implications of our results in transfer learning. Our code is available at \url{https://github.com/Heimine/PNC_DLN}.
    摘要 Motivated by our findings that linear layers mimic the roles of deep layers in nonlinear networks for feature learning, we explore how deep linear networks transform input data into output by investigating the output (i.e., features) of each layer after training in the context of multi-class classification problems.To achieve this, we first define metrics to measure within-class compression and between-class discrimination of intermediate features, respectively. Through theoretical analysis of these two metrics, we show that the evolution of features follows a simple and quantitative pattern from shallow to deep layers when the input data is nearly orthogonal and the network weights are minimum-norm, balanced, and approximate low-rank. Specifically, each layer of the linear network progressively compresses within-class features at a geometric rate and discriminates between-class features at a linear rate with respect to the number of layers that data have passed through.To the best of our knowledge, this is the first quantitative characterization of feature evolution in hierarchical representations of deep linear networks. Our extensive experiments not only validate our theoretical results numerically but also reveal a similar pattern in deep nonlinear networks, which aligns well with recent empirical studies.Moreover, we demonstrate the practical implications of our results in transfer learning. Our code is available at \url{https://github.com/Heimine/PNC_DLN}.

Multi-view learning for automatic classification of multi-wavelength auroral images

  • paper_url: http://arxiv.org/abs/2311.02947
  • repo_url: None
  • paper_authors: Qiuju Yang, Hang Su, Lili Liu, Yixuan Wang, Ze-Jun Hu
  • for: auroral classification, polar research
  • methods: lightweight feature extraction backbone (LCTNet), multi-scale reconstructed feature module (MSRM), lightweight attention feature enhancement module (LAFE)
  • results: state-of-the-art classification accuracy, superior results in terms of accuracy and computational efficiency compared to existing multi-view methods
    Abstract Auroral classification plays a crucial role in polar research. However, current auroral classification studies are predominantly based on images taken at a single wavelength, typically 557.7 nm. Images obtained at other wavelengths have been comparatively overlooked, and the integration of information from multiple wavelengths remains an underexplored area. This limitation results in low classification rates for complex auroral patterns. Furthermore, these studies, whether employing traditional machine learning or deep learning approaches, have not achieved a satisfactory trade-off between accuracy and speed. To address these challenges, this paper proposes a lightweight auroral multi-wavelength fusion classification network, MLCNet, based on a multi-view approach. Firstly, we develop a lightweight feature extraction backbone, called LCTNet, to improve the classification rate and cope with the increasing amount of auroral observation data. Secondly, considering the existence of multi-scale spatial structures in auroras, we design a novel multi-scale reconstructed feature module named MSRM. Finally, to highlight the discriminative information between auroral classes, we propose a lightweight attention feature enhancement module called LAFE. The proposed method is validated using observational data from the Arctic Yellow River Station during 2003-2004. Experimental results demonstrate that the fusion of multi-wavelength information effectively improves the auroral classification performance. In particular, our approach achieves state-of-the-art classification accuracy compared to previous auroral classification studies, and superior results in terms of accuracy and computational efficiency compared to existing multi-view methods.
    摘要 极光分类对极地研究起到关键作用,但现有的极光分类研究主要基于单一波长的图像,通常为557.7纳米。其他波长的图像尚未得到足够的关注,而多波长信息的集成仍然是一个未发掘的领域。这种局限性导致复杂的极光图像分类率较低。此外,这些研究,无论使用传统机器学习还是深度学习方法,都没有实现满意的准确率和速度协调。为解决这些挑战,本文提出了一种轻量级的极光多波长融合分类网络(MLCNet),基于多视图方法。首先,我们开发了一种轻量级的特征提取背bone(LCTNet),以提高分类率并处理逐渐增长的极光观测数据量。其次,因为极光存在多尺度空间结构,我们设计了一种新的多尺度重构特征模块(MSRM)。最后,为强调极光类别之间的区别信息,我们提出了一种轻量级的注意力特征增强模块(LAFE)。我们的方法在2003-2004年由北极黄河站的观测数据进行验证,实验结果表明,将多波长信息融合分类效果显著提高了极光分类性能。特别是,我们的方法与过去的极光分类研究相比,实现了状态机器学习的最佳分类率,并在计算效率和多视图方法之间具有优势。

Truly Scale-Equivariant Deep Nets with Fourier Layers

  • paper_url: http://arxiv.org/abs/2311.02922
  • repo_url: https://github.com/ashiq24/scale_equivarinat_fourier_layer
  • paper_authors: Md Ashiqur Rahman, Raymond A. Yeh
  • for: 这个论文的目的是提出一种具有缩放平衡性的深度学习模型,以便在图像分割等任务中实现更好的效果。
  • methods: 该论文使用了 Fourier 层来实现缩放平衡性,并考虑了抗锯齿处理。
  • results: 该模型在 MNIST-scale 和 STL-10 数据集上实现了竞争力的分类性能,同时保持着缩放平衡性。Here’s the full text in Simplified Chinese:
  • for: 这个论文的目的是提出一种具有缩放平衡性的深度学习模型,以便在图像分割等任务中实现更好的效果。
  • methods: 该论文使用了 Fourier 层来实现缩放平衡性,并考虑了抗锯齿处理。
  • results: 该模型在 MNIST-scale 和 STL-10 数据集上实现了竞争力的分类性能,同时保持着缩放平衡性。
    Abstract In computer vision, models must be able to adapt to changes in image resolution to effectively carry out tasks such as image segmentation; This is known as scale-equivariance. Recent works have made progress in developing scale-equivariant convolutional neural networks, e.g., through weight-sharing and kernel resizing. However, these networks are not truly scale-equivariant in practice. Specifically, they do not consider anti-aliasing as they formulate the down-scaling operation in the continuous domain. To address this shortcoming, we directly formulate down-scaling in the discrete domain with consideration of anti-aliasing. We then propose a novel architecture based on Fourier layers to achieve truly scale-equivariant deep nets, i.e., absolute zero equivariance-error. Following prior works, we test this model on MNIST-scale and STL-10 datasets. Our proposed model achieves competitive classification performance while maintaining zero equivariance-error.
    摘要 Simplified Chinese:在计算机视觉中,模型需要适应图像分辨率变化以实现图像分割等任务,这被称为缩放相似性。最近的研究已经在发展缩放相似性卷积神经网络,如通过共享权重和核心缩放。然而,这些网络在实践中并不是真正的缩放相似性。具体来说,它们没有考虑抗锯齿处理。为解决这点,我们直接在逻辑域中表述下降操作,考虑抗锯齿处理。我们then propose一种基于傅里叶层的新架构,以实现真正的缩放相似性深度网络,即绝对零相似性错误。 seguir Prior works, we test this model on MNIST-scale and STL-10 datasets. Our proposed model achieves competitive classification performance while maintaining zero equivariance-error.

Benchmarking Deep Facial Expression Recognition: An Extensive Protocol with Balanced Dataset in the Wild

  • paper_url: http://arxiv.org/abs/2311.02910
  • repo_url: None
  • paper_authors: Gianmarco Ipinze Tutuianu, Yang Liu, Ari Alamäki, Janne Kauttonen
  • for: 这篇论文旨在为人计算机交互中的表情识别(FER)技术提供实用的研究和推荐。
  • methods: 这篇论文使用了23种常见的网络架构,并按照一种统一的协议进行评估。具体来说,研究人员在不同的输入分辨率、类别均衡管理和预训练策略下进行了多种设置的研究,以描述对应的性能贡献。
  • results: 经过广泛的实验和过滤,研究人员得出了一些关于深度FER方法在真实应用中的推荐,以及一些有关表情识别应用中的伦理规则、隐私问题和法规的讨论。
    Abstract Facial expression recognition (FER) is a crucial part of human-computer interaction. Existing FER methods achieve high accuracy and generalization based on different open-source deep models and training approaches. However, the performance of these methods is not always good when encountering practical settings, which are seldom explored. In this paper, we collected a new in-the-wild facial expression dataset for cross-domain validation. Twenty-three commonly used network architectures were implemented and evaluated following a uniform protocol. Moreover, various setups, in terms of input resolutions, class balance management, and pre-trained strategies, were verified to show the corresponding performance contribution. Based on extensive experiments on three large-scale FER datasets and our practical cross-validation, we ranked network architectures and summarized a set of recommendations on deploying deep FER methods in real scenarios. In addition, potential ethical rules, privacy issues, and regulations were discussed in practical FER applications such as marketing, education, and entertainment business.
    摘要 面部表达识别(FER)是人机交互的关键部分。现有的FER方法在不同的开源深度学习模型和训练方法上达到了高准确率和泛化。然而,这些方法在实际场景中的性能不总是好的,这些场景通常被忽略。在这篇论文中,我们收集了一个新的在野 facial expression 数据集,用于跨领域验证。我们实现了23种常用的网络架构,并按照一个固定的协议进行评估。此外,我们还对输入分辨率、类别平衡管理和预训练策略进行了不同的设置,以显示它们对性能的贡献。基于大量的实验和我们的实际核心验证,我们对深度FER方法的部署在实际场景中进行了排名和总结,并提出了一些应用中的建议。此外,我们还讨论了实际应用中的伦理规则、隐私问题和法规。

Human as Points: Explicit Point-based 3D Human Reconstruction from Single-view RGB Images

  • paper_url: http://arxiv.org/abs/2311.02892
  • repo_url: https://github.com/yztang4/hap
  • paper_authors: Yingzhi Tang, Qijian Zhang, Junhui Hou, Yebin Liu
  • for: This paper aims to improve the performance of single-view human reconstruction by proposing an explicit point-based framework called HaP, which leverages point clouds as the intermediate representation of the target geometric structure.
  • methods: The proposed HaP framework uses fully-explicit point cloud estimation, manipulation, generation, and refinement in the 3D geometric space, rather than implicit learning processes that can be ambiguous and less controllable. The framework also includes dedicated designs of specialized learning components and processing procedures.
  • results: The authors report quantitative performance improvements of 20% to 40% over current state-of-the-art methods, and better qualitative results, demonstrating the effectiveness of the proposed framework. The results suggest a paradigm rollback to fully-explicit and geometry-centric algorithm design, which enables the use of various powerful point cloud modeling architectures and processing techniques.
    Abstract The latest trends in the research field of single-view human reconstruction devote to learning deep implicit functions constrained by explicit body shape priors. Despite the remarkable performance improvements compared with traditional processing pipelines, existing learning approaches still show different aspects of limitations in terms of flexibility, generalizability, robustness, and/or representation capability. To comprehensively address the above issues, in this paper, we investigate an explicit point-based human reconstruction framework called HaP, which adopts point clouds as the intermediate representation of the target geometric structure. Technically, our approach is featured by fully-explicit point cloud estimation, manipulation, generation, and refinement in the 3D geometric space, instead of an implicit learning process that can be ambiguous and less controllable. The overall workflow is carefully organized with dedicated designs of the corresponding specialized learning components as well as processing procedures. Extensive experiments demonstrate that our framework achieves quantitative performance improvements of 20% to 40% over current state-of-the-art methods, and better qualitative results. Our promising results may indicate a paradigm rollback to the fully-explicit and geometry-centric algorithm design, which enables to exploit various powerful point cloud modeling architectures and processing techniques. We will make our code and data publicly available at https://github.com/yztang4/HaP.
    摘要 最新的研究趋势在人体单视重建领域是学习深度隐函数,受到明确的体型先验规则约束。Despite the remarkable performance improvements compared with traditional processing pipelines, existing learning approaches still have different limitations, such as flexibility, generalizability, robustness, and representation capability. To comprehensively address these issues, in this paper, we investigate an explicit point-based human reconstruction framework called HaP, which adopts point clouds as the intermediate representation of the target geometric structure. Technically, our approach is featured by fully-explicit point cloud estimation, manipulation, generation, and refinement in the 3D geometric space, instead of an implicit learning process that can be ambiguous and less controllable. The overall workflow is carefully organized with dedicated designs of the corresponding specialized learning components as well as processing procedures. Extensive experiments demonstrate that our framework achieves quantitative performance improvements of 20% to 40% over current state-of-the-art methods, and better qualitative results. Our promising results may indicate a paradigm rollback to the fully-explicit and geometry-centric algorithm design, which enables to exploit various powerful point cloud modeling architectures and processing techniques. We will make our code and data publicly available at https://github.com/yztang4/HaP.

Stacked Autoencoder Based Feature Extraction and Superpixel Generation for Multifrequency PolSAR Image Classification

  • paper_url: http://arxiv.org/abs/2311.02887
  • repo_url: None
  • paper_authors: Tushar Gadhiya, Sumanth Tangirala, Anil K. Roy
  • for: 本研究提出了一种多频度 polarimetric synthetic aperture radar(PolSAR)图像分类算法。
  • methods: 使用PolSAR分解算法提取了每个频率带的33特征,然后使用两层自适应神经网络减少输入特征向量,保留有用的输入特征。接着,使用SLIC算法生成了超像素,并使用这些超像素生成了一个强健的特征表示。最后,使用softmax分类器进行分类任务。
  • results: 在Flevoland数据集上进行了实验,并发现提议方法在文献中available的方法之上。
    Abstract In this paper we are proposing classification algorithm for multifrequency Polarimetric Synthetic Aperture Radar (PolSAR) image. Using PolSAR decomposition algorithms 33 features are extracted from each frequency band of the given image. Then, a two-layer autoencoder is used to reduce the dimensionality of input feature vector while retaining useful features of the input. This reduced dimensional feature vector is then applied to generate superpixels using simple linear iterative clustering (SLIC) algorithm. Next, a robust feature representation is constructed using both pixel as well as superpixel information. Finally, softmax classifier is used to perform classification task. The advantage of using superpixels is that it preserves spatial information between neighbouring PolSAR pixels and therefore minimises the effect of speckle noise during classification. Experiments have been conducted on Flevoland dataset and the proposed method was found to be superior to other methods available in the literature.
    摘要 在这篇论文中,我们提出了一种多频波动Synthetic Aperture Radar(PolSAR)图像的分类算法。使用PolSAR分解算法提取了每个频率带的图像中的33个特征。然后,使用两层自适应神经网络减少输入特征向量的维度,保留输入特征的有用信息。这个减少的特征向量然后用simple linear iterative clustering(SLIC)算法生成超像。接下来,使用像素和超像信息构建了一种强健的特征表示。最后,使用softmax分类器进行分类任务。使用超像有利于保留邻近PolSAR像素之间的空间信息,因此减少了零点噪声的影响,进一步提高了分类的精度。我们在Flevoland数据集上进行了实验,并发现提出的方法在相关文献中比其他方法更为出色。

Inner-IoU: More Effective Intersection over Union Loss with Auxiliary Bounding Box

  • paper_url: http://arxiv.org/abs/2311.02877
  • repo_url: None
  • paper_authors: Hao Zhang, Cong Xu, Shuaijie Zhang
    for: This paper aims to improve the bounding box regression process in object detection by proposing a new loss function called Inner-IoU loss.methods: The paper analyzes the BBR model and proposes using different scales of auxiliary bounding boxes to calculate losses, as well as introducing a scaling factor ratio to control the scale size of the auxiliary bounding boxes.results: The proposed Inner-IoU loss function enhances the detection performance of object detection models, demonstrating its effectiveness and generalization ability.
    Abstract With the rapid development of detectors, Bounding Box Regression (BBR) loss function has constantly updated and optimized. However, the existing IoU-based BBR still focus on accelerating convergence by adding new loss terms, ignoring the limitations of IoU loss term itself. Although theoretically IoU loss can effectively describe the state of bounding box regression,in practical applications, it cannot adjust itself according to different detectors and detection tasks, and does not have strong generalization. Based on the above, we first analyzed the BBR model and concluded that distinguishing different regression samples and using different scales of auxiliary bounding boxes to calculate losses can effectively accelerate the bounding box regression process. For high IoU samples, using smaller auxiliary bounding boxes to calculate losses can accelerate convergence, while larger auxiliary bounding boxes are suitable for low IoU samples. Then, we propose Inner-IoU loss, which calculates IoU loss through auxiliary bounding boxes. For different datasets and detectors, we introduce a scaling factor ratio to control the scale size of the auxiliary bounding boxes for calculating losses. Finally, integrate Inner-IoU into the existing IoU-based loss functions for simulation and comparative experiments. The experiment result demonstrate a further enhancement in detection performance with the utilization of the method proposed in this paper, verifying the effectiveness and generalization ability of Inner-IoU loss.
    摘要 随着检测器的快速发展,矩形框回归(BBR)损失函数不断更新和优化。然而,现有的IoU基于的BBR仍然围绕加入新的损失项来加速快损集中精度。虽然理论上IoU损失函数可以有效描述矩形框回归的状态,但在实际应用中,它无法根据不同的检测器和检测任务自适应调整,并且不具备强大的泛化能力。基于以上分析,我们首先分析了BBR模型,并结论出使用不同的检测器和检测任务的不同描述性框架可以更好地加速矩形框回归过程。为高IoU样本,使用较小的卫星框架来计算损失可以加速收敛,而对低IoU样本,使用较大的卫星框架更适合。然后,我们提出了Inner-IoU损失函数,通过卫星框架来计算IoU损失。为不同的数据集和检测器,我们引入了涉及因子比例来控制卫星框架的大小。最后,我们将Inner-IoU集成到现有的IoU基于损失函数中进行 simulations和比较实验。实验结果表明,通过使用本文提出的方法可以进一步提高检测性能,证明了Inner-IoU损失函数的有效性和泛化能力。

Dynamic Neural Fields for Learning Atlases of 4D Fetal MRI Time-series

  • paper_url: http://arxiv.org/abs/2311.02874
  • repo_url: https://github.com/kidrauh/neural-atlasing
  • paper_authors: Zeen Chi, Zhongxiao Cong, Clinton J. Wang, Yingcheng Liu, Esra Abaci Turk, P. Ellen Grant, S. Mazdak Abulnaga, Polina Golland, Neel Dey
  • for: 用于快速构建生物医学影像 атла斯,使用神经场。
  • methods: 使用神经场来学习可变的空间时间观察,实现主动态脉络MRI时序序列的个体化 атла斯建立和动态稳定。
  • results: 对妊娠期动态BOLD MRI时序列实现高质量的个体化 атла斯建立,相比现有方法快速 convergence,但略微下降一些骨骼匠 overlap。
    Abstract We present a method for fast biomedical image atlas construction using neural fields. Atlases are key to biomedical image analysis tasks, yet conventional and deep network estimation methods remain time-intensive. In this preliminary work, we frame subject-specific atlas building as learning a neural field of deformable spatiotemporal observations. We apply our method to learning subject-specific atlases and motion stabilization of dynamic BOLD MRI time-series of fetuses in utero. Our method yields high-quality atlases of fetal BOLD time-series with $\sim$5-7$\times$ faster convergence compared to existing work. While our method slightly underperforms well-tuned baselines in terms of anatomical overlap, it estimates templates significantly faster, thus enabling rapid processing and stabilization of large databases of 4D dynamic MRI acquisitions. Code is available at https://github.com/Kidrauh/neural-atlasing
    摘要 我们提出了一种快速生成生物医学影像 атла斯使用神经场的方法。 Atlases 是生物医学影像分析任务的关键,但是传统的深度网络估算方法和深度网络估算方法仍然具有较长的计算时间。在这项初步工作中,我们将主动扮演为学习一种可变的空间时间观察神经场。我们应用我们的方法来学习主动者特定的 атла斯和动态 BOLD MRI 时序列的运动稳定。我们的方法可以高效地生成高质量的胎儿 BOLD 时序列的 атла斯,与现有工作相比,它的整合时间约为 5-7 倍快。虽然我们的方法对于 анатомиче匹配略有下降,但它可以更快地估算模板,从而实现大规模的4D动态 MRI 数据库的快速处理和稳定。可以在 上下载代码。

OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data

  • paper_url: http://arxiv.org/abs/2311.02873
  • repo_url: https://github.com/shiyoung77/ovir-3d
  • paper_authors: Shiyang Lu, Haonan Chang, Eric Pu Jing, Abdeslam Boularias, Kostas Bekris
  • for: 开发了一种基于文本查询的开 vocabulary 3D объек实体检索方法,不需要使用任何3D数据进行训练。
  • methods: 方法使用了文本对齐的2D区域提档网络,将2D区域提档Network的特征相似性与文本查询进行对比,并通过多视图拟合来将2D区域提档转化为3D空间中的3D объек实体段落。
  • results: 实验结果表明,该方法可以快速和高效地在大多数indoor 3D场景下进行实时多视图拟合,并且不需要额外训练在3D空间。实验结果还表明,该方法在机器人导航和抓取等应用中具有广泛的应用前景。
    Abstract This work presents OVIR-3D, a straightforward yet effective method for open-vocabulary 3D object instance retrieval without using any 3D data for training. Given a language query, the proposed method is able to return a ranked set of 3D object instance segments based on the feature similarity of the instance and the text query. This is achieved by a multi-view fusion of text-aligned 2D region proposals into 3D space, where the 2D region proposal network could leverage 2D datasets, which are more accessible and typically larger than 3D datasets. The proposed fusion process is efficient as it can be performed in real-time for most indoor 3D scenes and does not require additional training in 3D space. Experiments on public datasets and a real robot show the effectiveness of the method and its potential for applications in robot navigation and manipulation.
    摘要 这个工作介绍了 OVIR-3D,一种简单又有效的方法,用于无需使用任何3D数据进行训练的开 vocabulary 3D对象实例检索。给定一个语言查询,提议的方法能够返回一个根据实例和文本查询之间的特征相似性排序的3D对象实例分割。这是通过多视图融合文本对齐的2D区域提档到3D空间中进行的,其中2D区域提档网络可以利用2D数据集,这些数据集通常更容易获得和更大规模。我们的融合过程是实时可行的,不需要额外训练在3D空间。我们在公共数据集和真实的 робоット上进行了实验,并证明了该方法的有效性和其在机器人导航和操作中的潜在应用。

FocusTune: Tuning Visual Localization through Focus-Guided Sampling

  • paper_url: http://arxiv.org/abs/2311.02872
  • repo_url: https://github.com/sontung/focus-tune
  • paper_authors: Son Tung Nguyen, Alejandro Fontan, Michael Milford, Tobias Fischer
  • for: 提高视觉地标算法性能
  • methods: 使用强调导航的采样技术,指导Scene coordinate regression模型在重要的3D点三角形计算中做出更好的预测
  • results: 与现有的状态艺术模型匹配或超越其性能,同时保持ACE模型的低存储和计算需求,例如在 Cambridge Landmarks 数据集上减少了译偏误值从25到19和17到15 cm,提高了应用在移动 робо扮和增强现实等领域的可行性。I hope that helps!
    Abstract We propose FocusTune, a focus-guided sampling technique to improve the performance of visual localization algorithms. FocusTune directs a scene coordinate regression model towards regions critical for 3D point triangulation by exploiting key geometric constraints. Specifically, rather than uniformly sampling points across the image for training the scene coordinate regression model, we instead re-project 3D scene coordinates onto the 2D image plane and sample within a local neighborhood of the re-projected points. While our proposed sampling strategy is generally applicable, we showcase FocusTune by integrating it with the recently introduced Accelerated Coordinate Encoding (ACE) model. Our results demonstrate that FocusTune both improves or matches state-of-the-art performance whilst keeping ACE's appealing low storage and compute requirements, for example reducing translation error from 25 to 19 and 17 to 15 cm for single and ensemble models, respectively, on the Cambridge Landmarks dataset. This combination of high performance and low compute and storage requirements is particularly promising for applications in areas like mobile robotics and augmented reality. We made our code available at \url{https://github.com/sontung/focus-tune}.
    摘要 我们提出了FocusTune,一种帮助视觉地标定算法性能提高的注意力导向抽象技术。FocusTune利用场景坐标重构模型中的关键几何约束,将注意力集中在点云三角形插值中的关键区域。具体来说,我们不是将图像中的所有点用于场景坐标重构模型的训练,而是将3D场景坐标重新 проек到图像平面上,然后在该地方采样。我们的提议的采样策略可以普遍应用,但我们在ACE模型中实现了它。我们的结果表明,FocusTune可以提高或与现有状态的性能匹配,同时保持ACE模型的低存储和计算需求。例如,在 cambridge 景点集中,FocusTune可以将翻译错误从25减少到19和17cm,对单个和集成模型进行比较。这种高性能且低计算存储需求的组合特别适用于移动 робо扮和增强现实应用。我们的代码可以在https://github.com/sontung/focus-tune上下载。

Neural-based Compression Scheme for Solar Image Data

  • paper_url: http://arxiv.org/abs/2311.02855
  • repo_url: None
  • paper_authors: Ali Zafari, Atefeh Khoshkhahtinat, Jeremy A. Grajeda, Piyush M. Mehta, Nasser M. Nasrabadi, Laura E. Boucheron, Barbara J. Thompson, Michael S. F. Kirk, Daniel da Silva
  • For: The paper is written for the purpose of proposing a neural network-based lossy compression method for data-intensive imagery missions, specifically for NASA’s SDO mission.* Methods: The proposed method uses an adversarially trained neural network with local and non-local attention modules to capture the local and global structure of the image, resulting in a better trade-off in rate-distortion (RD) compared to conventional hand-engineered codecs. The RD variational autoencoder is jointly trained with a channel-dependent entropy model as a shared prior between the analysis and synthesis transforms to make the entropy coding of the latent code more effective.* Results: The proposed algorithm outperforms currently-in-use and state-of-the-art codecs such as JPEG and JPEG-2000 in terms of RD performance when compressing extreme-ultraviolet (EUV) data. The algorithm is able to achieve consistent segmentations of coronal holes (CH) in the compressed images, even at a compression rate of $\sim0.1$ bits per pixel.
    Abstract Studying the solar system and especially the Sun relies on the data gathered daily from space missions. These missions are data-intensive and compressing this data to make them efficiently transferable to the ground station is a twofold decision to make. Stronger compression methods, by distorting the data, can increase data throughput at the cost of accuracy which could affect scientific analysis of the data. On the other hand, preserving subtle details in the compressed data requires a high amount of data to be transferred, reducing the desired gains from compression. In this work, we propose a neural network-based lossy compression method to be used in NASA's data-intensive imagery missions. We chose NASA's SDO mission which transmits 1.4 terabytes of data each day as a proof of concept for the proposed algorithm. In this work, we propose an adversarially trained neural network, equipped with local and non-local attention modules to capture both the local and global structure of the image resulting in a better trade-off in rate-distortion (RD) compared to conventional hand-engineered codecs. The RD variational autoencoder used in this work is jointly trained with a channel-dependent entropy model as a shared prior between the analysis and synthesis transforms to make the entropy coding of the latent code more effective. Our neural image compression algorithm outperforms currently-in-use and state-of-the-art codecs such as JPEG and JPEG-2000 in terms of the RD performance when compressing extreme-ultraviolet (EUV) data. As a proof of concept for use of this algorithm in SDO data analysis, we have performed coronal hole (CH) detection using our compressed images, and generated consistent segmentations, even at a compression rate of $\sim0.1$ bits per pixel (compared to 8 bits per pixel on the original data) using EUV data from SDO.
    摘要 studying the solar system and especially the sun relies on data gathered daily from space missions. these missions are data-intensive, and compressing this data to make it efficiently transferable to the ground station is a twofold decision. stronger compression methods can increase data throughput at the cost of accuracy, which could affect scientific analysis of the data. on the other hand, preserving subtle details in the compressed data requires a high amount of data to be transferred, reducing the desired gains from compression. in this work, we propose a neural network-based lossy compression method for use in nasa's data-intensive imagery missions. we chose nasa's sdo mission, which transmits 1.4 terabytes of data each day, as a proof of concept for the proposed algorithm.our proposed algorithm uses an adversarially trained neural network equipped with local and non-local attention modules to capture both the local and global structure of the image, resulting in a better trade-off in rate-distortion (rd) compared to conventional hand-engineered codecs. the rd variational autoencoder used in this work is jointly trained with a channel-dependent entropy model as a shared prior between the analysis and synthesis transforms to make the entropy coding of the latent code more effective. our neural image compression algorithm outperforms currently-in-use and state-of-the-art codecs such as jpeg and jpeg-2000 in terms of rd performance when compressing extreme-ultraviolet (euv) data.as a proof of concept for the use of this algorithm in sdo data analysis, we have performed coronal hole (ch) detection using our compressed images and generated consistent segmentations, even at a compression rate of approximately 0.1 bits per pixel (compared to 8 bits per pixel on the original data) using euv data from sdo.

Consistent4D: Consistent 360° Dynamic Object Generation from Monocular Video

  • paper_url: http://arxiv.org/abs/2311.02848
  • repo_url: https://github.com/yanqinJiang/Consistent4D
  • paper_authors: Yanqin Jiang, Li Zhang, Jin Gao, Weimin Hu, Yao Yao
  • for: 本研究は、无测量モノクロビデオから4D动的オブジェクトを生成する新しいアプローチを提案します。
  • methods: 我们は、360度动的オブジェクト再现问题を4D生成问题として捉え、多视点データ収集とカメラ测定を不要にします。これは、物体レベルの3D意识 Image Diffusion Modelを主たるスーパーバイジョン信号として使用して、Dynamic Neural Radiance Fields(DyNeRF)をトレーニングします。特に、Cascade DyNeRFを提案して、时间轴上のディスクレットなスーパーバイジョン信号の下で安定した受动と时间継続性を実现します。さらに、空间的な一贯性と时间的な一贯性を実现するために、Interpolation-driven Consistency Lossを导入します。
  • results: 我们のConsistent4Dは、先行研究と竞合する性能を示し、新しい可能性を开拓します。また、普通の文字-3D生成タスクにも优れた性能を示しています。プロジェクトページはhttps://consistent4d.github.io/です。
    Abstract In this paper, we present Consistent4D, a novel approach for generating 4D dynamic objects from uncalibrated monocular videos. Uniquely, we cast the 360-degree dynamic object reconstruction as a 4D generation problem, eliminating the need for tedious multi-view data collection and camera calibration. This is achieved by leveraging the object-level 3D-aware image diffusion model as the primary supervision signal for training Dynamic Neural Radiance Fields (DyNeRF). Specifically, we propose a Cascade DyNeRF to facilitate stable convergence and temporal continuity under the supervision signal which is discrete along the time axis. To achieve spatial and temporal consistency, we further introduce an Interpolation-driven Consistency Loss. It is optimized by minimizing the discrepancy between rendered frames from DyNeRF and interpolated frames from a pre-trained video interpolation model. Extensive experiments show that our Consistent4D can perform competitively to prior art alternatives, opening up new possibilities for 4D dynamic object generation from monocular videos, whilst also demonstrating advantage for conventional text-to-3D generation tasks. Our project page is https://consistent4d.github.io/.
    摘要 在这篇论文中,我们提出了一种新的方法,即Consistent4D,用于从单视图视频中生成4D动态对象。我们Uniquely将360度动态对象重建问题作为4D生成问题来处理,这意味着不需要繁琐的多视图数据收集和摄像头卡利布ر。我们通过利用物体层3D意识图像扩散模型作为主要超视图信号来培训动态神经辐射场(DyNeRF)。我们提出了一种升级的DyNeRF来实现稳定的整合和时间连续性,并引入了一种 interpolate-driven 一致损失来保证空间和时间一致性。我们通过对DyNeRF和预训练视频 interpolate模型生成的帧进行对比来优化这种损失函数。我们的项目页面是https://consistent4d.github.io/.Here's the translation in Traditional Chinese:在这篇论文中,我们提出了一种新的方法,即Consistent4D,用于从单视角影像中生成4D动态物件。我们Uniquely将360度动态物件重建问题作为4D生成问题来处理,这意味着不需要繁琐的多视角数据收集和摄像头卡利布。我们通过利用物体层3D意识图像扩散模型作为主要超视射信号来培训动态神经辐射场(DyNeRF)。我们提出了一种升级的DyNeRF来实现稳定的整合和时间连续性,并引入了一种 interpolate-driven 一致损失来保证空间和时间一致性。我们通过对DyNeRF和预训练影像 interpolate模型生成的帧进行比较来优化这种损失函数。我们的项目页面是https://consistent4d.github.io/.

Leveraging sinusoidal representation networks to predict fMRI signals from EEG

  • paper_url: http://arxiv.org/abs/2311.04234
  • repo_url: None
  • paper_authors: Yamin Li, Ange Lou, Catie Chang
    for: 这个论文的目的是用多通道EEG来预测fMRI信号,以提高EEG的空间分辨率和扩展fMRI的应用范围。methods: 这个论文使用了一种新的SIREN网络,通过学习电阻图谱信息来减少feature engineering,并使用encoder-decoder结构来重建fMRI信号。results: 试验结果表明,这个模型在8名参与者 simultaneous EEG-fMRI数据集上表现出色,并且超越了一个现有的状态略模型。这些结果表明了使用 periodic activation functions 在深度神经网络中模型功能神经成像数据的潜在优势。
    Abstract In modern neuroscience, functional magnetic resonance imaging (fMRI) has been a crucial and irreplaceable tool that provides a non-invasive window into the dynamics of whole-brain activity. Nevertheless, fMRI is limited by hemodynamic blurring as well as high cost, immobility, and incompatibility with metal implants. Electroencephalography (EEG) is complementary to fMRI and can directly record the cortical electrical activity at high temporal resolution, but has more limited spatial resolution and is unable to recover information about deep subcortical brain structures. The ability to obtain fMRI information from EEG would enable cost-effective, imaging across a wider set of brain regions. Further, beyond augmenting the capabilities of EEG, cross-modality models would facilitate the interpretation of fMRI signals. However, as both EEG and fMRI are high-dimensional and prone to artifacts, it is currently challenging to model fMRI from EEG. To address this challenge, we propose a novel architecture that can predict fMRI signals directly from multi-channel EEG without explicit feature engineering. Our model achieves this by implementing a Sinusoidal Representation Network (SIREN) to learn frequency information in brain dynamics from EEG, which serves as the input to a subsequent encoder-decoder to effectively reconstruct the fMRI signal from a specific brain region. We evaluate our model using a simultaneous EEG-fMRI dataset with 8 subjects and investigate its potential for predicting subcortical fMRI signals. The present results reveal that our model outperforms a recent state-of-the-art model, and indicates the potential of leveraging periodic activation functions in deep neural networks to model functional neuroimaging data.
    摘要 现代神经科学中,功能核磁共振成像(fMRI)已成为非侵入式窗口,提供整个大脑活动的动态图像。然而,fMRI受到血液干扰和高成本、固定性和金属设备不兼容等限制。电enzephalography(EEG)可以直接记录 cortical 电动力谱高时间分辨率,但是它的空间分辨率较有限,无法回归深部脑结构信息。能够从 EEG 获取 fMRI 信息,将可以实现成本下降,扫描更广泛的脑区域。此外,跨模态模型可以促进 fMRI 信号的解释。然而,由于 EEG 和 fMRI 都是高维度和易受损的,目前是困难的模型 fMRI 从 EEG。为解决这个挑战,我们提出了一种新的建议,可以直接从多通道 EEG 预测 fMRI 信号,不需要显式的特征工程。我们的模型实现了声律表示网络(SIREN)来学习 brain 动力学中的频率信息,该信息作为 EEG 输入,并由后续的编码器-解码器组合来有效地重建 fMRI 信号。我们使用了 8 名参与者的同时 EEG-fMRI 数据集进行评估,并研究了其在预测深部 fMRI 信号方面的潜力。结果表明,我们的模型超过了当前状态的最佳模型,并表明了在深度神经网络中使用律动函数可以有效地模型功能神经成像数据。

Flexible Multi-Generator Model with Fused Spatiotemporal Graph for Trajectory Prediction

  • paper_url: http://arxiv.org/abs/2311.02835
  • repo_url: None
  • paper_authors: Peiyuan Zhu, Fengxia Han, Hao Deng
  • for: 预测行程在自动驾驶系统中扮演着关键作用,帮助汽车实现精准跟踪和决策。
  • methods: 我们提出了一种行程预测框架,可以捕捉人群之间的社交交互变化和分支 manifold的模型。我们的框架基于综合时空图来更好地模型场景中人群的复杂交互,并使用多生成器架构,其中包括一个灵活的生成器选择器网络来学习多个生成器的分布。
  • results: 我们的框架在不同的挑战性数据集上比较多的基准方法达到了状态革新的表现。
    Abstract Trajectory prediction plays a vital role in automotive radar systems, facilitating precise tracking and decision-making in autonomous driving. Generative adversarial networks with the ability to learn a distribution over future trajectories tend to predict out-of-distribution samples, which typically occurs when the distribution of forthcoming paths comprises a blend of various manifolds that may be disconnected. To address this issue, we propose a trajectory prediction framework, which can capture the social interaction variations and model disconnected manifolds of pedestrian trajectories. Our framework is based on a fused spatiotemporal graph to better model the complex interactions of pedestrians in a scene, and a multi-generator architecture that incorporates a flexible generator selector network on generated trajectories to learn a distribution over multiple generators. We show that our framework achieves state-of-the-art performance compared with several baselines on different challenging datasets.
    摘要 文本翻译为简化中文。自动驾驶需要trajectory prediction来确定汽车的路径,以便准确地跟踪和做出决策。生成对抗网络可以学习未来路径的分布,但是它们通常会预测外部样本,这通常发生在未来路径的分布中包含多种不同的拟合 manifold,这些拟合 manifold可能是分立的。为了解决这个问题,我们提出了一个trajectory prediction框架,可以捕捉人行行为的社会交互变化,以及人行道径的分立拟合 manifold。我们的框架基于一个综合的空间时间图,更好地模型了场景中人行行为的复杂交互,以及一个多个生成器架构,其中包括一个灵活的生成器选择网络,可以学习多个生成器的分布。我们展示了我们的框架在不同的难度 datasets 上达到了现状最佳性能。

SemanticTopoLoop: Semantic Loop Closure With 3D Topological Graph Based on Quadric-Level Object Map

  • paper_url: http://arxiv.org/abs/2311.02831
  • repo_url: None
  • paper_authors: Zhenzhong Cao
  • for: 提高 réal-world scenario 中 SLAM 系统的精度和Robustness
  • methods: 基于多层验证的对象级数据协调方法和基于 quadric-level 对象地图 тополоジ的semantic loop closure方法
  • results: 在宽视场下实现高精度的loop closure,并且在对比 existed state-of-the-art 方法时显示出更高的精度、再现率和地图定位精度 metricIn English, this means:
  • for: Improving the accuracy and robustness of SLAM systems in real-world scenarios
  • methods: An object-level data association method based on multi-level verification and a semantic loop closure method based on a quadric-level object map topology
  • results: Achieving high-precision loop closure over a wide field of view, and outperforming existing state-of-the-art methods in terms of precision, recall, and localization accuracy metrics.
    Abstract Loop closure, as one of the crucial components in SLAM, plays an essential role in correcting the accumulated errors. Traditional appearance-based methods, such as bag-of-words models, are often limited by local 2D features and the volume of training data, making them less versatile and robust in real-world scenarios, leading to missed detections or false positives detections in loop closure. To address these issues, we first propose a object-level data association method based on multi-level verification, which can associate 2D semantic features of current frame with 3D objects landmarks of map. Next, taking advantage of these association relations, we introduce a semantic loop closure method based on quadric-level object map topology, which represents scenes through the topological graph of objects and achieves accurate loop closure at a wide field of view by comparing differences in the topological graphs. Finally, we integrate these two methods into a complete object-aware SLAM system. Qualitative experiments and ablation studies demonstrate the effectiveness and robustness of the proposed object-level data association algorithm. Quantitative experiments show that our semantic loop closure method outperforms existing state-of-the-art methods in terms of precision, recall and localization accuracy metrics.
    摘要 <>Translate given text into Simplified Chinese.<>路径关闭,作为SLAM中一个关键组件,对于消除积累错误起到了关键作用。传统的外观基于方法,如袋子模型,通常受到当地2D特征的限制,以及训练数据的量,使其在实际场景中 menos versatile 和robust,导致过滤或假阳性检测在路径关闭中出现。为解决这些问题,我们首先提出了基于多级验证的对象水平数据协调方法,可以将当前帧的2Dsemantic特征与地图中的3D对象标记相关联。接着,通过这些关系,我们引入了基于四元数平面的对象地图 тоポ多特征,可以在宽视野中高精度地实现路径关闭。最后,我们将这两种方法 integrate into a complete object-aware SLAM system。Qualitative experiments and ablation studies demonstrate the effectiveness and robustness of the proposed object-level data association algorithm。Quantitative experiments show that our semantic loop closure method outperforms existing state-of-the-art methods in terms of precision, recall, and localization accuracy metrics.

InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image

  • paper_url: http://arxiv.org/abs/2311.02826
  • repo_url: https://github.com/mybabyyh/instructpix2nerf
  • paper_authors: Jianhui Li, Shilong Liu, Zidong Liu, Yikai Wang, Kaiwen Zheng, Jinghui Xu, Jianmin Li, Jun Zhu
  • for: This paper aims to solve the problem of human-instructed 3D-aware portrait editing for open-world images, which has been under-explored due to the lack of labeled human face 3D datasets and effective architectures.
  • methods: The proposed method, InstructPix2NeRF, is an end-to-end diffusion-based framework that enables instructed 3D-aware portrait editing from a single open-world image with human instructions. It uses a conditional latent 3D diffusion process to lift 2D editing to 3D space and learn the correlation between the paired images’ difference and the instructions via triplet data.
  • results: The proposed method achieves effective and multi-semantic editing through one single pass with the portrait identity well-preserved. Additionally, an identity consistency module is proposed to increase the multi-view 3D identity consistency. Extensive experiments show the effectiveness of the method and its superiority against strong baselines quantitatively and qualitatively.Here’s the Chinese version of the three key points:
  • for: 这篇论文目标是解决人类指令下的开放世界图像上的3D-aware肖像编辑问题,这个问题由于缺乏人脸3D标注数据和有效架构而未得到充分研究。
  • methods: 提议的方法是一种基于扩散的终端推广框架,称为InstructPix2NeRF,它可以从单个开放世界图像上接受人类指令,并实现3D-aware肖像编辑。它使用一种条件隐藏3D扩散过程来提升2D编辑到3D空间,并通过 triplet 数据学习对差异和指令之间的相关性。
  • results: 提议的方法可以实现高效和多Semantic的编辑,同时保持肖像的identify完好。此外,还提出了一种人脸identidadityModule,它直接将提取的identify信号 Modulates到扩散过程中,从而提高多视图3D人脸identify一致性。广泛的实验证明了方法的效果和对强基eline的超越。
    Abstract With the success of Neural Radiance Field (NeRF) in 3D-aware portrait editing, a variety of works have achieved promising results regarding both quality and 3D consistency. However, these methods heavily rely on per-prompt optimization when handling natural language as editing instructions. Due to the lack of labeled human face 3D datasets and effective architectures, the area of human-instructed 3D-aware editing for open-world portraits in an end-to-end manner remains under-explored. To solve this problem, we propose an end-to-end diffusion-based framework termed InstructPix2NeRF, which enables instructed 3D-aware portrait editing from a single open-world image with human instructions. At its core lies a conditional latent 3D diffusion process that lifts 2D editing to 3D space by learning the correlation between the paired images' difference and the instructions via triplet data. With the help of our proposed token position randomization strategy, we could even achieve multi-semantic editing through one single pass with the portrait identity well-preserved. Besides, we further propose an identity consistency module that directly modulates the extracted identity signals into our diffusion process, which increases the multi-view 3D identity consistency. Extensive experiments verify the effectiveness of our method and show its superiority against strong baselines quantitatively and qualitatively.
    摘要 “受到神经辐射场(NeRF)的成功,3D意识编辑技术已经取得了显著的进步,但这些方法仍然高度依赖于每个提示的优化。因为缺乏人类脸部3D数据集和有效的建筑,人类指令下的开放世界肖像3D意识编辑仍然处于未explored阶段。为解决这个问题,我们提出了一种终端扩散基于的框架,称之为InstructPix2NeRF,它可以在单个开放世界图像上实现人类指令下的3D意识编辑。核心 liegt在一种 conditional latent 3D 扩散过程中,通过学习对带有对应图像差异和指令的对应关系来提升2D编辑到3D空间。通过我们提出的token位置随机Strategy,我们可以在一次通过中实现多Semantic editing,并且保持肖像的身份完整。此外,我们还提出了一种人类身份归一模块,它直接将提取的身份信号注入到我们的扩散过程中,从而提高多视图3D身份一致性。我们的实验证明了我们的方法的有效性,并与强基eline比较示出了我们的方法的超越性。”

Efficient, Self-Supervised Human Pose Estimation with Inductive Prior Tuning

  • paper_url: http://arxiv.org/abs/2311.02815
  • repo_url: https://github.com/princetonvisualai/hpe-inductive-prior-tuning
  • paper_authors: Nobline Yoo, Olga Russakovsky
  • for: 本研究旨在提高无监督人体姿势估算(HPE)的自动化性。
  • methods: 本研究使用了自我重构的方法,利用大量未标注的视觉数据,尽管当前精度不高。
  • results: 研究人员通过分析重建质量和姿势估算准确性之间的关系,开发了一个新的模型管道,使用了比基eline要少的训练数据,并提出了一个适合无监督设置的新评价指标。
    Abstract The goal of 2D human pose estimation (HPE) is to localize anatomical landmarks, given an image of a person in a pose. SOTA techniques make use of thousands of labeled figures (finetuning transformers or training deep CNNs), acquired using labor-intensive crowdsourcing. On the other hand, self-supervised methods re-frame the HPE task as a reconstruction problem, enabling them to leverage the vast amount of unlabeled visual data, though at the present cost of accuracy. In this work, we explore ways to improve self-supervised HPE. We (1) analyze the relationship between reconstruction quality and pose estimation accuracy, (2) develop a model pipeline that outperforms the baseline which inspired our work, using less than one-third the amount of training data, and (3) offer a new metric suitable for self-supervised settings that measures the consistency of predicted body part length proportions. We show that a combination of well-engineered reconstruction losses and inductive priors can help coordinate pose learning alongside reconstruction in a self-supervised paradigm.
    摘要 目标是二维人姿估计(HPE)是将人体部位的坐标进行地图化,给定一张人体姿势的图像。现状技术使用了千张标注图像(finetuning transformers或训练深度CNN),通过劳动密集的人工审核来获得。然而,无监督方法将HPE任务视为一个重建问题,可以利用大量的未标注视觉数据,尽管目前精度有所偏低。在这项工作中,我们探讨了如何提高无监督HPE。我们(1)分析了重建质量和姿势估计精度之间的关系,(2)开发了一个比基eline更高效的模型管线,使用较少的训练数据,并(3)提出了适合无监督设置的一个新的度量,用于测量预测的身体部分长度准确性。我们表明,将Well-engineered重建损失和拟合约束结合在一起可以协调姿势学习和重建在无监督 парадигме中。

Fast and Interpretable Face Identification for Out-Of-Distribution Data Using Vision Transformers

  • paper_url: http://arxiv.org/abs/2311.02803
  • repo_url: None
  • paper_authors: Hai Phan, Cindy Le, Vu Le, Yihui He, Anh Totti Nguyen
  • for: 本研究旨在提高face identification的精度和效率,提出了一种基于Vision Transformers(ViTs)的新方法。
  • methods: 本研究使用了两个图像的比较,首先在图像级别进行比较,然后在小块级别进行比较。在小块级别比较中,使用了交叉注意力来比较两个图像的patch。
  • results: 经过训练200万对图像,本研究的模型在对外部数据进行比较时达到了与DeepFace-EMD相同的准确率,但在执行速度方面比DeepFace-EMD更快,并且通过人类研究表明了模型的解释性。
    Abstract Most face identification approaches employ a Siamese neural network to compare two images at the image embedding level. Yet, this technique can be subject to occlusion (e.g. faces with masks or sunglasses) and out-of-distribution data. DeepFace-EMD (Phan et al. 2022) reaches state-of-the-art accuracy on out-of-distribution data by first comparing two images at the image level, and then at the patch level. Yet, its later patch-wise re-ranking stage admits a large $O(n^3 \log n)$ time complexity (for $n$ patches in an image) due to the optimal transport optimization. In this paper, we propose a novel, 2-image Vision Transformers (ViTs) that compares two images at the patch level using cross-attention. After training on 2M pairs of images on CASIA Webface (Yi et al. 2014), our model performs at a comparable accuracy as DeepFace-EMD on out-of-distribution data, yet at an inference speed more than twice as fast as DeepFace-EMD (Phan et al. 2022). In addition, via a human study, our model shows promising explainability through the visualization of cross-attention. We believe our work can inspire more explorations in using ViTs for face identification.
    摘要 大多数面部识别方法使用拟合网络进行图像嵌入水平的比较。然而,这种技术可能受到遮盖物(例如面具或太阳镜)和非典型数据的影响。深度脸部-EMD(Phan et al. 2022)达到了非典型数据上的状态态-of-the-art精度,但它的后续的质量排名阶段具有大 O(n^3 \* log n)的时间复杂度(对于图像中的 n 个质量),这是由优化运输优化引起的。在这篇论文中,我们提出了一种新的、使用视图变换器(ViTs)来比较两个图像的质量。经过在 CASIA Webface(Yi et al. 2014)上训练 200 万对图像,我们的模型在非典型数据上达到了与 DeepFace-EMD 相同的精度,但在推断速度方面比 DeepFace-EMD 更快速,大约两倍。此外,通过人类研究,我们的模型展示了可见的混合注意力可读性。我们认为我们的工作可以激励更多的人们在使用 ViTs 进行脸部识别。

cs.AI - 2023-11-06

Multimodal Stress Detection Using Facial Landmarks and Biometric Signals

  • paper_url: http://arxiv.org/abs/2311.03606
  • repo_url: None
  • paper_authors: Majid Hosseini, Morteza Bodaghi, Ravi Teja Bhupatiraju, Anthony Maida, Raju Gottumukkala
    for: 这种研究旨在提高人们的压力测量和情绪状况的评估,通过结合多种感知技术。methods: 这种研究使用多模态学习方法,结合脸部特征和生物指标信号进行压力检测。results: 研究发现,使用晚期融合技术可以达到94.39%的准确率,而使用早期融合技术可以超越这一成果,达到98.38%的准确率。
    Abstract The development of various sensing technologies is improving measurements of stress and the well-being of individuals. Although progress has been made with single signal modalities like wearables and facial emotion recognition, integrating multiple modalities provides a more comprehensive understanding of stress, given that stress manifests differently across different people. Multi-modal learning aims to capitalize on the strength of each modality rather than relying on a single signal. Given the complexity of processing and integrating high-dimensional data from limited subjects, more research is needed. Numerous research efforts have been focused on fusing stress and emotion signals at an early stage, e.g., feature-level fusion using basic machine learning methods and 1D-CNN Methods. This paper proposes a multi-modal learning approach for stress detection that integrates facial landmarks and biometric signals. We test this multi-modal integration with various early-fusion and late-fusion techniques to integrate the 1D-CNN model from biometric signals and 2-D CNN using facial landmarks. We evaluate these architectures using a rigorous test of models' generalizability using the leave-one-subject-out mechanism, i.e., all samples related to a single subject are left out to train the model. Our findings show that late-fusion achieved 94.39\% accuracy, and early-fusion surpassed it with a 98.38\% accuracy rate. This research contributes valuable insights into enhancing stress detection through a multi-modal approach. The proposed research offers important knowledge in improving stress detection using a multi-modal approach.
    摘要 发展不同感知技术对个人压力测量带来了改进。虽然单模态如穿戴式设备和表情识别已经取得了进步,但是将多个模态融合提供了更全面的压力测量,因为压力在不同人群中表现不同。多模态学习希望利用每个模态的优势而不仅仅依靠单一信号。由于处理和 инте格高维数据的限制,更多的研究是必要的。许多研究团队已经关注将压力和情绪信号在早期融合,例如特征级别融合使用基本机器学习方法和1D-CNN方法。这篇论文提出了一种多模态学习方法, integrating facial landmarks and biometric signals for stress detection.我们使用不同的早期融合和晚期融合技术来融合1D-CNN模型和2D-CNN使用facial landmarks。我们使用离散一个主题机制进行模型评估,即所有与一个主题相关的样本被去除,以训练模型。我们的发现显示,晚期融合达到94.39%的准确率,而早期融合超过了它,达到98.38%的准确率。这项研究为压力检测提供了价值的新发现,并且提供了改进压力检测的多模态方法的重要知识。

Brief for the Canada House of Commons Study on the Implications of Artificial Intelligence Technologies for the Canadian Labor Force: Generative Artificial Intelligence Shatters Models of AI and Labor

  • paper_url: http://arxiv.org/abs/2311.03595
  • repo_url: None
  • paper_authors: Morgan R. Frank
  • for: 探讨当前生产力技术的发展对工作 Market的影响,并提出政策建议以适应未来工作环境。
  • methods: 利用数据分析和预测技术来研究生产力技术对工作市场的影响,并对现有的自动化预测模型进行批判性分析。
  • results: 发现生产力技术可能会对一些以前被认为免疫自动化的职业产生影响,政策 makers应该促进工人的职业适应性,并鼓励教育机构开发适应AI技术的教育程序。
    Abstract Exciting advances in generative artificial intelligence (AI) have sparked concern for jobs, education, productivity, and the future of work. As with past technologies, generative AI may not lead to mass unemployment. But, unlike past technologies, generative AI is creative, cognitive, and potentially ubiquitous which makes the usual assumptions of automation predictions ill-suited for today. Existing projections suggest that generative AI will impact workers in occupations that were previously considered immune to automation. As AI's full set of capabilities and applications emerge, policy makers should promote workers' career adaptability. This goal requires improved data on job separations and unemployment by locality and job titles in order to identify early-indicators for the workers facing labor disruption. Further, prudent policy should incentivize education programs to accommodate learning with AI as a tool while preparing students for the demands of the future of work.
    摘要 新的生成人工智能技术已经引发了工作、教育、生产效率和未来工作的担忧。与过去的技术不同,生成人工智能具有创造力、认知能力和潜在的 ubique 特点,使得传统的自动化预测模型成为不适用的。现有的预测结果表明,生成人工智能可能会影响工作者在之前被认为是自动化免疫的职业。为实现工作者职业适应性,政策制定者应该促进地方各地的职业分类和失业数据的收集,以识别受到劳动干预的工作者。此外,政策应该鼓励教育计划,以便在使用人工智能为工具的同时,准备学生未来的工作需求。

  • paper_url: http://arxiv.org/abs/2311.03583
  • repo_url: None
  • paper_authors: Abbas Mehrabian, Ankit Anand, Hyunjik Kim, Nicolas Sonnerat, Matej Balog, Gheorghe Comanici, Tudor Berariu, Andrew Lee, Anian Ruoss, Anna Bulanova, Daniel Toyama, Sam Blackwell, Bernardino Romera Paredes, Petar Veličković, Laurent Orseau, Joonkyung Lee, Anurag Murty Naredla, Doina Precup, Adam Zsolt Wagner
  • for: 这个论文解决了一个中央极点图论题,这个问题是根据1975年erdős的 conjecture,找到一个给定大小的图最多的边数而不包含3-或4-цикル。
  • methods: 这个论文使用了AlphaZero和tabu搜索两种方法,并通过引入课程来提高state-of-the-art下界。
  • results: 这个论文通过引入课程和提高搜索策略,提高了几个不同大小的图的下界。此外,这个论文还提出了一种灵活的图生成环境和一种 permutation-invariant的网络架构来学习搜索在图空间中。
    Abstract This work studies a central extremal graph theory problem inspired by a 1975 conjecture of Erd\H{o}s, which aims to find graphs with a given size (number of nodes) that maximize the number of edges without having 3- or 4-cycles. We formulate this problem as a sequential decision-making problem and compare AlphaZero, a neural network-guided tree search, with tabu search, a heuristic local search method. Using either method, by introducing a curriculum -- jump-starting the search for larger graphs using good graphs found at smaller sizes -- we improve the state-of-the-art lower bounds for several sizes. We also propose a flexible graph-generation environment and a permutation-invariant network architecture for learning to search in the space of graphs.
    摘要 Translation note:* "sequential decision-making problem" becomes "连续决策问题" (liánxù juéxì wèn tí)* "AlphaZero" becomes "AlphaZero" (ā lfah zhī)* "tabu search" becomes "tabu搜索" (tā bù sōu suǒ)* "curriculum" becomes "课程" (kèxíng)* "permutation-invariant network architecture" becomes "对称的网络架构" (duìxiàng de wǎngluò jiàgòu)

AI-Enabled Unmanned Vehicle-Assisted Reconfigurable Intelligent Surfaces: Deployment, Prototyping, Experiments, and Opportunities

  • paper_url: http://arxiv.org/abs/2311.04241
  • repo_url: None
  • paper_authors: Li-Hsiang Shen, Kai-Ten Feng, Ta-Sung Lee, Yuan-Chun Lin, Shih-Cheng Lin, Chia-Chan Chang, Sheng-Fuh Chang
  • For: This paper focuses on the deployment of Reconfigurable Intelligent Surfaces (RIS) in wireless communication networks, specifically in the context of the sixth-generation (6G) technology. The paper explores the use of RIS to extend service coverage, reduce power consumption, and enhance spectral efficiency.* Methods: The paper discusses the theoretical and hardware aspects of RIS deployment, as well as the use of artificial intelligence (AI) and machine learning to optimize the deployment process. The authors propose a federated multi-agent reinforcement learning scheme to optimize the placement and configuration of RISs.* Results: The paper presents experimental results of the proposed i-Dris system, which achieves a transmission throughput of up to 980 Mbps under a bandwidth of 100 MHz with comparatively low complexity and rapid deployment. The results show that the i-Dris system outperforms existing works in this area.Here’s the simplified Chinese text for the three key points:* For: 这篇论文关注 sixth-generation (6G) 技术中的 Reconfigurable Intelligent Surfaces (RIS) 的部署,以扩展服务覆盖区域、降低功率消耗和提高频率效率。* Methods: 论文讨论了 RIS 部署的理论和硬件方面,以及使用人工智能 (AI) 和机器学习来优化部署过程。作者提议了一种联邦多代理人强化学习方案来优化 RIS 的分布和配置。* Results: 论文发表了 i-Dris 系统的实验结果,该系统可以在带宽 100 MHz 下实现传输吞吐量达 980 Mbps,与其他现有的方法相比,i-Dris 系统具有较低的复杂性和较快的部署速度。
    Abstract The requirement of wireless data demands is increasingly high as the sixth-generation (6G) technology evolves. Reconfigurable intelligent surface (RIS) is promisingly deemed to be one of 6G techniques for extending service coverage, reducing power consumption, and enhancing spectral efficiency. In this article, we have provided some fundamentals of RIS deployment in theory and hardware perspectives as well as utilization of artificial intelligence (AI) and machine learning. We conducted an intelligent deployment of RIS (i-Dris) prototype, including dual-band auto-guided vehicle (AGV) assisted RISs associated with an mmWave base station (BS) and a receiver. The RISs are deployed on the AGV with configured incident/reflection angles. While, both the mmWave BS and receiver are associated with an edge server monitoring downlink packets for obtaining system throughput. We have designed a federated multi-agent reinforcement learning scheme associated with several AGV-RIS agents and sub-agents per AGV-RIS consisting of the deployment of position, height, orientation and elevation angles. The experimental results presented the stationary measurement in different aspects and scenarios. The i-Dris can reach up to 980 Mbps transmission throughput under a bandwidth of 100 MHz with comparably low complexity as well as rapid deployment, which outperforms the other existing works. At last, we highlight some opportunities and future issues in leveraging RIS-empowered wireless communication networks.
    摘要 sixth-generation (6G) 技术的无线数据需求在不断增长,而Reconfigurable intelligent surface (RIS) 被认为是6G技术的一种扩展服务覆盖、降低功率消耗和提高频率效率的方法。在这篇文章中,我们提供了RIS部署的理论和硬件视图,以及人工智能(AI)和机器学习的应用。我们开发了一个名为“智能RIS部署”(i-Dris)的原型,包括与mmWave基站(BS)和接收器相连的双频自导车(AGV)助手RIS。RIS被部署在AGV上,并配置了入射/反射角度。而BS和接收器均与边缘服务器监控下链路包,以获取系统吞吐量。我们设计了多个AGV-RIS代理和子代理,包括RIS部署的位置、高度、orientation和倾斜角度。实验结果显示,i-Dris可以在不同方面和场景下实现静态测量,并且具有相对较低的复杂性和快速部署,超过了其他已有作品。最后,我们提出了利用RIS empowered无线通信网络的机遇和未来问题。

Inclusive Portraits: Race-Aware Human-in-the-Loop Technology

  • paper_url: http://arxiv.org/abs/2311.03567
  • repo_url: None
  • paper_authors: Claudia Flores-Saviaga, Christopher Curtis, Saiph Savage
  • for: 这篇论文旨在提出一种基于人类社会理论的人类在循环(HITL)系统,以提高人脸验证服务的性能,特别是为人类颜色的服务。
  • methods: 该论文提出了一种名为“含容图像(Inclusive Portraits,IP)”的新方法,它将人类社会理论与人脸验证服务相结合,以提高服务的可靠性和准确性。
  • results: 实验结果表明,将种族考虑到HITL系统中可以显著提高服务的性能,特别是为人类颜色的服务。此外,该研究还发现,考虑工作者个人特点在HITL系统的设计中是非常重要的。
    Abstract AI has revolutionized the processing of various services, including the automatic facial verification of people. Automated approaches have demonstrated their speed and efficiency in verifying a large volume of faces, but they can face challenges when processing content from certain communities, including communities of people of color. This challenge has prompted the adoption of "human-in-the-loop" (HITL) approaches, where human workers collaborate with the AI to minimize errors. However, most HITL approaches do not consider workers' individual characteristics and backgrounds. This paper proposes a new approach, called Inclusive Portraits (IP), that connects with social theories around race to design a racially-aware human-in-the-loop system. Our experiments have provided evidence that incorporating race into human-in-the-loop (HITL) systems for facial verification can significantly enhance performance, especially for services delivered to people of color. Our findings also highlight the importance of considering individual worker characteristics in the design of HITL systems, rather than treating workers as a homogenous group. Our research has significant design implications for developing AI-enhanced services that are more inclusive and equitable.
    摘要

Low-Rank MDPs with Continuous Action Spaces

  • paper_url: http://arxiv.org/abs/2311.03564
  • repo_url: None
  • paper_authors: Andrew Bennett, Nathan Kallus, Miruna Oprescu
  • for: 本文研究了将低级Markov决策过程(MDP)扩展到连续动作空间上,以提高RL学习的可靠性和可扩展性。
  • methods: 本文提出了多种具体的扩展方法,包括使用约束优化和离散化方法,以及对FLAMBE算法(Agarwal et al., 2020)进行修改。
  • results: 研究表明,无需修改FLAMBE算法,在transition函数具有Holder细化程度对动作的情况下,可以获得类似的PAC界限,而无需知道奖励函数。此外,当政策集合具有固定最小浓度或奖励函数具有Holder细化程度时,可以获得一个几乎同样的PAC界限。
    Abstract Low-Rank Markov Decision Processes (MDPs) have recently emerged as a promising framework within the domain of reinforcement learning (RL), as they allow for provably approximately correct (PAC) learning guarantees while also incorporating ML algorithms for representation learning. However, current methods for low-rank MDPs are limited in that they only consider finite action spaces, and give vacuous bounds as $|\mathcal{A}| \to \infty$, which greatly limits their applicability. In this work, we study the problem of extending such methods to settings with continuous actions, and explore multiple concrete approaches for performing this extension. As a case study, we consider the seminal FLAMBE algorithm (Agarwal et al., 2020), which is a reward-agnostic method for PAC RL with low-rank MDPs. We show that, without any modifications to the algorithm, we obtain similar PAC bound when actions are allowed to be continuous. Specifically, when the model for transition functions satisfies a Holder smoothness condition w.r.t. actions, and either the policy class has a uniformly bounded minimum density or the reward function is also Holder smooth, we obtain a polynomial PAC bound that depends on the order of smoothness.
    摘要

Context Unlocks Emotions: Text-based Emotion Classification Dataset Auditing with Large Language Models

  • paper_url: http://arxiv.org/abs/2311.03551
  • repo_url: None
  • paper_authors: Daniel Yang, Aditya Kommineni, Mohammad Alshehri, Nilamadhab Mohanty, Vedant Modi, Jonathan Gratch, Shrikanth Narayanan
  • for: 提高文本数据的情感分类模型性能
  • methods: 使用大语言模型生成文本上的补充 контекスト信息
  • results: 提高文本输入与人工标注的情感标签的匹配率, tanto from empirical evaluation and human evaluation
    Abstract The lack of contextual information in text data can make the annotation process of text-based emotion classification datasets challenging. As a result, such datasets often contain labels that fail to consider all the relevant emotions in the vocabulary. This misalignment between text inputs and labels can degrade the performance of machine learning models trained on top of them. As re-annotating entire datasets is a costly and time-consuming task that cannot be done at scale, we propose to use the expressive capabilities of large language models to synthesize additional context for input text to increase its alignment with the annotated emotional labels. In this work, we propose a formal definition of textual context to motivate a prompting strategy to enhance such contextual information. We provide both human and empirical evaluation to demonstrate the efficacy of the enhanced context. Our method improves alignment between inputs and their human-annotated labels from both an empirical and human-evaluated standpoint.
    摘要 文本数据中缺乏上下文信息可能使文本情感分类数据集的标注过程变得困难。因此,这些数据集的标签通常不会考虑所有可能的情感词汇。这种文本输入和标签之间的不一致可能使机器学习模型在这些数据集上训练时表现下降。然而,重新标注整个数据集是一项成本高、时间consuming的任务,不能在大规模进行。因此,我们提议使用大语言模型的表达能力来生成更多的上下文信息,以增强输入文本与注解的情感标签的协调。在这种情况下,我们提出了文本上下文的正式定义,并提出了一种提问策略来增强文本上下文信息。我们通过人类和实验评估来证明我们的方法可以提高输入文本与其人类注解标签之间的协调。

United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

  • paper_url: http://arxiv.org/abs/2311.03550
  • repo_url: None
  • paper_authors: Siddhant Bansal, Chetan Arora, C. V. Jawahar
  • for: 本研究旨在解决现有方法无法捕捉多个视频中关键步骤的问题,通过提出一种无监督图格学习(GPL)框架。
  • methods: GPL使用一种新的 UnityGraph 表示所有任务视频的图形,以获得内视频和多视频上下文。然后,使用 Node2Vec 算法更新 UnityGraph 中的坐标,以实现无监督性的同步。最后,使用 KMeans 聚类算法确定关键步骤。
  • results: 对于 ProceL、CrossTask 和 EgoProceL 测试集,GPL 实现了平均提高2% 和 3.6% 相比于现有方法。
    Abstract Given multiple videos of the same task, procedure learning addresses identifying the key-steps and determining their order to perform the task. For this purpose, existing approaches use the signal generated from a pair of videos. This makes key-steps discovery challenging as the algorithms lack inter-videos perspective. Instead, we propose an unsupervised Graph-based Procedure Learning (GPL) framework. GPL consists of the novel UnityGraph that represents all the videos of a task as a graph to obtain both intra-video and inter-videos context. Further, to obtain similar embeddings for the same key-steps, the embeddings of UnityGraph are updated in an unsupervised manner using the Node2Vec algorithm. Finally, to identify the key-steps, we cluster the embeddings using KMeans. We test GPL on benchmark ProceL, CrossTask, and EgoProceL datasets and achieve an average improvement of 2% on third-person datasets and 3.6% on EgoProceL over the state-of-the-art.
    摘要 <>将文本翻译成简化中文。<>给定多个视频任务,程序学习关注于标识任务中关键步骤的顺序执行。现有的方法使用视频对的信号来实现这一目标,但这会使关键步骤发现困难,因为算法缺乏视频间视角。我们提出一种不supervised图像基本学习(GPL)框架。GPL包括一种新的 UnityGraph,它将所有任务视频表示为一个图来获取任务内和任务间上下文。然后,使用Node2Vec算法更新 UnityGraph 中的表示,以获取同样的关键步骤的相似嵌入。最后,使用 KMeans 聚类算法确定关键步骤。我们在 ProceL、CrossTask 和 EgoProceL 数据集上测试 GPL,并在第三人数据集上提高了2%,在 EgoProceL 数据集上提高了3.6% 以上。

InterVLS: Interactive Model Understanding and Improvement with Vision-Language Surrogates

  • paper_url: http://arxiv.org/abs/2311.03547
  • repo_url: None
  • paper_authors: Jinbin Huang, Wenbin He, Liang Gou, Liu Ren, Chris Bryan
  • for: 帮助用户更好地理解深度学习模型和改进它们的性能
  • methods: 使用文本对齐的概念发现和模型无关的直线函数来衡量概念的影响
  • results: 在用户研究中,InterVLS有效地帮助用户 identific 模型中最有影响的概念,获得性能概念和调整概念影响以改进模型
    Abstract Deep learning models are widely used in critical applications, highlighting the need for pre-deployment model understanding and improvement. Visual concept-based methods, while increasingly used for this purpose, face challenges: (1) most concepts lack interpretability, (2) existing methods require model knowledge, often unavailable at run time. Additionally, (3) there lacks a no-code method for post-understanding model improvement. Addressing these, we present InterVLS. The system facilitates model understanding by discovering text-aligned concepts, measuring their influence with model-agnostic linear surrogates. Employing visual analytics, InterVLS offers concept-based explanations and performance insights. It enables users to adjust concept influences to update a model, facilitating no-code model improvement. We evaluate InterVLS in a user study, illustrating its functionality with two scenarios. Results indicates that InterVLS is effective to help users identify influential concepts to a model, gain insights and adjust concept influence to improve the model. We conclude with a discussion based on our study results.
    摘要 深度学习模型在关键应用中广泛使用,高亮了预部署模型理解和改进的需求。基于视觉概念的方法在这个目的上增加使用,但面临以下挑战:(1)大多数概念无法解释,(2)现有方法往往需要模型知识,而运行时这些知识通常不可用,(3)无法使用无代码方法进行后续模型改进。为解决这些问题,我们提出了InterVLS。该系统通过发现与文本对齐的概念,使用模型无关的线性代理来衡量这些概念的影响。通过视觉分析,InterVLS提供了基于概念的解释和性能印象。它允许用户根据概念的影响来更新模型,从而实现无代码模型改进。我们在用户研究中评估了InterVLS,并通过两个场景 illustrate its 功能。结果表明,InterVLS能够帮助用户Identify模型中的重要概念,获得情况和更改概念的影响来改进模型。我们根据我们的研究结果进行了讨论。

PcLast: Discovering Plannable Continuous Latent States

  • paper_url: http://arxiv.org/abs/2311.03534
  • repo_url: None
  • paper_authors: Anurag Koul, Shivakanth Sujit, Shaoru Chen, Ben Evans, Lili Wu, Byron Xu, Rajan Chari, Riashat Islam, Raihan Seraj, Yonathan Efroni, Lekan Molu, Miro Dudik, John Langford, Alex Lamb
  • for: goal-conditioned planning
  • methods: multi-step inverse dynamics, latent representation, and associating reachable states together in $\ell_2$ space
  • results: significant improvements in sampling efficiency and yields layered state abstractions that enable computationally efficient hierarchical planning.Here’s the Chinese translation of the three points:
  • for: 目的conditioned планинг
  • methods: 多步反动力学学习、占据表示和在 $\ell_2$ 空间相关联可达状态
  • results: 提高采样效率,并生成Computational efficient的层次 планинг。
    Abstract Goal-conditioned planning benefits from learned low-dimensional representations of rich, high-dimensional observations. While compact latent representations, typically learned from variational autoencoders or inverse dynamics, enable goal-conditioned planning they ignore state affordances, thus hampering their sample-efficient planning capabilities. In this paper, we learn a representation that associates reachable states together for effective onward planning. We first learn a latent representation with multi-step inverse dynamics (to remove distracting information); and then transform this representation to associate reachable states together in $\ell_2$ space. Our proposals are rigorously tested in various simulation testbeds. Numerical results in reward-based and reward-free settings show significant improvements in sampling efficiency, and yields layered state abstractions that enable computationally efficient hierarchical planning.
    摘要 goal-conditioned 规划受惠于学习的低维 Observations 的归一化表示。 although compact latent representations, typically learned from variational autoencoders or inverse dynamics, enable goal-conditioned planning, they ignore state affordances, thus hampering their sample-efficient planning capabilities. In this paper, we learn a representation that associates reachable states together for effective onward planning. We first learn a latent representation with multi-step inverse dynamics (to remove distracting information); and then transform this representation to associate reachable states together in $\ell_2$ space. Our proposals are rigorously tested in various simulation testbeds. Numerical results in reward-based and reward-free settings show significant improvements in sampling efficiency, and yields layered state abstractions that enable computationally efficient hierarchical planning.Here's the breakdown of the translation:* goal-conditioned 规划 (goal-conditioned planning)* 学习 (learned)* 低维 Observations (low-dimensional observations)* 归一化表示 (latent representation)* compact latent representations (compact latent representations)* 通常来自 (typically learned)* variational autoencoders (variational autoencoders)* inverse dynamics (inverse dynamics)* 忽略 (ignore)* state affordances (state affordances)* 因此 (thus)* 阻碍 (hampering)* 效果 (effective)* onward planning (onward planning)* 学习 (learn)* 多步 inverse dynamics (multi-step inverse dynamics)* 去掉干扰信息 (to remove distracting information)* 转换 (transform)* 在 $\ell_2$ 空间 associate (associate)* reachable states together (reachable states together)* 数学结果 (numerical results)* 在 reward-based 和 reward-free 设定下 show (show)* 显著提高 (significant improvements)* 采样效率 (sampling efficiency)* 层次状态抽象 (layered state abstractions)* 计算效率 (computationally efficient)* 层次规划 (hierarchical planning)

Brain Networks and Intelligence: A Graph Neural Network Based Approach to Resting State fMRI Data

  • paper_url: http://arxiv.org/abs/2311.03520
  • repo_url: None
  • paper_authors: Bishal Thapaliya, Esra Akbas, Jiayu Chen, Raam Sapkota, Bhaskar Ray, Pranav Suresh, Vince Calhoun, Jingyu Liu
    for:This paper aims to develop a novel modeling architecture called BrainRGIN for predicting intelligence (fluid, crystallized, and total intelligence) using graph neural networks on resting-state functional magnetic resonance imaging (rsfMRI) derived static functional network connectivity matrices.methods:The proposed BrainRGIN architecture incorporates a clustering-based embedding and graph isomorphism network in the graph convolutional layer, TopK pooling, and attention-based readout functions to predict intelligence. The approach uses rsfMRI data to capture the functional organization of the brain without relying on specific tasks or stimuli.results:The proposed BrainRGIN model achieved lower mean squared errors and higher correlation scores than existing relevant graph architectures and other traditional machine learning models for all of the intelligence prediction tasks. The middle frontal gyrus was found to contribute significantly to both fluid and crystallized intelligence, while total composite scores identified a diverse set of brain regions as relevant, highlighting the complex nature of total intelligence.
    Abstract Resting-state functional magnetic resonance imaging (rsfMRI) is a powerful tool for investigating the relationship between brain function and cognitive processes as it allows for the functional organization of the brain to be captured without relying on a specific task or stimuli. In this paper, we present a novel modeling architecture called BrainRGIN for predicting intelligence (fluid, crystallized, and total intelligence) using graph neural networks on rsfMRI derived static functional network connectivity matrices. Extending from the existing graph convolution networks, our approach incorporates a clustering-based embedding and graph isomorphism network in the graph convolutional layer to reflect the nature of the brain sub-network organization and efficient network expression, in combination with TopK pooling and attention-based readout functions. We evaluated our proposed architecture on a large dataset, specifically the Adolescent Brain Cognitive Development Dataset, and demonstrated its effectiveness in predicting individual differences in intelligence. Our model achieved lower mean squared errors and higher correlation scores than existing relevant graph architectures and other traditional machine learning models for all of the intelligence prediction tasks. The middle frontal gyrus exhibited a significant contribution to both fluid and crystallized intelligence, suggesting their pivotal role in these cognitive processes. Total composite scores identified a diverse set of brain regions to be relevant which underscores the complex nature of total intelligence.
    摘要 RESTING-STATE FUNCTIONAL MAGNETIC RESONANCE IMAGING (RSFMRI) 是一种强大的工具,可以评估大脑功能和认知过程之间的关系,因为它可以在不基于特定任务或刺激的情况下捕捉大脑的功能组织结构。在这篇论文中,我们提出了一种新的模型建立方式,称为BrainRGIN,可以使用图 neural networks 预测智商(流动、晶化和总智商)。 extending from the existing graph convolution networks,我们的方法包括嵌入和图同构网络,以及 TopK pooling 和注意力基本函数。我们在大量数据集,即青春期大脑认知发展数据集上评估了我们的提议的建立方式,并证明了它在智商预测任务中的有效性。我们的模型比拥有相关的图建立方式和传统机器学习模型都具有更低的平均平方误差和更高的相关性分数。中顶前颞卷积区显示在流动和晶化智商中具有重要作用,这表明它们在这些认知过程中发挥着关键作用。总合分数表明大脑各区域具有不同的重要性,这反映了智商的复杂结构。

MFAAN: Unveiling Audio Deepfakes with a Multi-Feature Authenticity Network

  • paper_url: http://arxiv.org/abs/2311.03509
  • repo_url: None
  • paper_authors: Karthik Sivarama Krishnan, Koushik Sivarama Krishnan
  • for: 防止深伪音频干预信息传播
  • methods: 多元特征数据网络(MFAAN),融合多种音频表现,包括MFCC、LFCC和Chroma-STFT,实现多元认知音频内容,精确地识别伪造音频
  • results: 在两个 benchmark 数据集上,MFAAN 表现出色,实现准确率98.93%和94.47%,说明 MFAAN 的可靠性和应用价值
    Abstract In the contemporary digital age, the proliferation of deepfakes presents a formidable challenge to the sanctity of information dissemination. Audio deepfakes, in particular, can be deceptively realistic, posing significant risks in misinformation campaigns. To address this threat, we introduce the Multi-Feature Audio Authenticity Network (MFAAN), an advanced architecture tailored for the detection of fabricated audio content. MFAAN incorporates multiple parallel paths designed to harness the strengths of different audio representations, including Mel-frequency cepstral coefficients (MFCC), linear-frequency cepstral coefficients (LFCC), and Chroma Short Time Fourier Transform (Chroma-STFT). By synergistically fusing these features, MFAAN achieves a nuanced understanding of audio content, facilitating robust differentiation between genuine and manipulated recordings. Preliminary evaluations of MFAAN on two benchmark datasets, 'In-the-Wild' Audio Deepfake Data and The Fake-or-Real Dataset, demonstrate its superior performance, achieving accuracies of 98.93% and 94.47% respectively. Such results not only underscore the efficacy of MFAAN but also highlight its potential as a pivotal tool in the ongoing battle against deepfake audio content.
    摘要 现代数字时代,深度模仿技术的普及带来了信息传递的威胁。特别是音频深度模仿,可能造成误导性的虚假信息。为解决这一问题,我们介绍了多元特征音频真实性网络(MFAAN),这是一种专门为检测假造音频内容而设计的高级架构。MFAAN通过并行的多个路径,利用不同的音频表示方法,包括Mel-frequency cepstral coefficients(MFCC)、linear-frequency cepstral coefficients(LFCC)和Chroma Short Time Fourier Transform(Chroma-STFT)。通过这种综合融合,MFAAN实现了对音频内容的细致理解,从而实现了对假造和真实录音的分辨率。初步的评估结果表明,MFAAN在两个标准数据集上('In-the-Wild' Audio Deepfake Data和The Fake-or-Real Dataset)达到了98.93%和94.47%的准确率,这不仅证明了MFAAN的有效性,还 highlighted its potential作为对深度模仿音频内容的战斗工具。

Astrocytes as a mechanism for meta-plasticity and contextually-guided network function

  • paper_url: http://arxiv.org/abs/2311.03508
  • repo_url: None
  • paper_authors: Lulu Gong, Fabio Pasqualetti, Thomas Papouin, ShiNung Ching
  • for: 这种研究旨在探讨astrocyte在脑中的功能,以及它们如何与神经元和 synapse 交互以实现学习。
  • methods: 这个研究使用了形式分析和逻辑推理来描述astrocyte如何影响神经元和synapse的adaptation,以及如何在不同的时间尺度上实现学习。
  • results: 研究发现,在存在时间尺度分开的astrocyte干扰下,神经元和synapse可以更好地适应不同的任务参数,并且可以在多个随机变化的上下文中学习。这种方法比传统的神经网络和非网络算法更可靠。
    Abstract Astrocytes are a highly expressed and highly enigmatic cell-type in the mammalian brain. Traditionally viewed as a mediator of basic physiological sustenance, it is increasingly recognized that astrocytes may play a more direct role in neural computation. A conceptual challenge to this idea is the fact that astrocytic activity takes a very different form than that of neurons, and in particular, occurs at orders-of-magnitude slower time-scales. In the current paper, we engage how such time-scale separation may endow astrocytes with the capability to enable learning in context-dependent settings, where fluctuations in task parameters may occur much more slowly than within-task requirements. This idea is based on the recent supposition that astrocytes, owing to their sensitivity to a host of physiological covariates, may be particularly well poised to modulate the dynamics of neural circuits in functionally salient ways. We pose a general model of neural-synaptic-astrocyte interaction and use formal analysis to characterize how astrocytic modulation may constitute a form of meta-plasticity, altering the ways in which synapses and neurons adapt as a function of time. We then embed this model in a bandit-based reinforcement learning task environment, and show how the presence of time-scale separated astrocytic modulation enables learning over multiple fluctuating contexts. Indeed, these networks learn far more reliably versus dynamically homogenous networks and conventional non-network-based bandit algorithms. Our results indicate how the presence of neural-astrocyte interaction in the brain may benefit learning over different time-scale and the conveyance of task relevant contextual information onto circuit dynamics.
    摘要 astrocytes是大脑中高度表达和高度神秘的细胞类型。传统上视为神经元的调节剂,但现在越来越认为astrocytes可能直接参与神经计算。一个挑战是astrocyte活动的时间尺度与神经元活动完全不同,astrocyte活动更加慢,甚至是神经元活动的数个量级慢。在当前文章中,我们探讨了如何这种时间尺度差异可能使astrocytes具有学习能力。我们提出了神经元-synapse-astrocyte交互的概念模型,并使用正式分析来描述如何astrocyte干涉可能导致神经细胞和 synapse 的适应性改变。然后,我们将这个模型嵌入到了一个基于奖励学习的bandit任务环境中,并证明了在多个随机变化的上下文中,astrocyte干涉的存在可以使网络学习更加可靠。我们的结果表明,在脑中具有神经-astrocyte交互的存在可以提高学习的可靠性和将任务相关的上下文信息传递到神经细胞动力学中。

Environmental-Impact Based Multi-Agent Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.04240
  • repo_url: None
  • paper_authors: Farinaz Alamiyan-Harandi, Pouria Ramazi
  • for: 提高社会冲击协力和强化个体影响集体结果
  • methods: 环境影响多智能 reinforcement learning(EMuReL)方法,每个代理都估计每个其他代理在环境中的环境影响
  • results: 在清理(resp. 收割)环境测试环境中,使用EMuReL训练的代理协作更有效,获得$54%$ ($39%$)和$20%$ ($44%$)更多的总奖励,同时保持同等水平的合作水平。
    Abstract To promote cooperation and strengthen the individual impact on the collective outcome in social dilemmas, we propose the Environmental-impact Multi-Agent Reinforcement Learning (EMuReL) method where each agent estimates the "environmental impact" of every other agent, that is, the difference in the current environment state compared to the hypothetical environment in the absence of that other agent. Inspired by the Inequity Aversion model, the agent then compares its own reward with those of its fellows multiplied by their environmental impacts. If its reward exceeds the scaled reward of one of its fellows, the agent takes "social responsibility" toward that fellow by reducing its own reward. Therefore, the less influential an agent is in reaching the current state, the more social responsibility is taken by other agents. Experiments in the Cleanup (resp. Harvest) test environment demonstrate that agents trained based on EMuReL learn to cooperate more effectively and obtain $54\%$ ($39\%$) and $20\%$ ($44\%$) more total rewards while preserving the same cooperation levels compared to when they are trained based on the two state-of-the-art reward reshaping methods inequity aversion and social influence.
    摘要

Kindness in Multi-Agent Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.04239
  • repo_url: None
  • paper_authors: Farinaz Alamiyan-Harandi, Mersad Hassanjani, Pouria Ramazi
  • for: 本研究旨在帮助合作agent在多智能体强化学习中增强协作能力,通过基于人类行为概念的KindMARL方法。
  • methods: KindMARL方法基于对agent动作的反思,对环境影响的估计,并通过对每个奖励的比较来评估同伙的善意。
  • results: 实验结果表明,基于KindMARL方法培训的合作agent在Cleanup和Harvest环境中赢得了89%和37%的总奖励,比基于不平等恐惧和社会影响方法的培训更高。此外,KindMARL方法在交通灯控制问题中也得到了效果。
    Abstract In human societies, people often incorporate fairness in their decisions and treat reciprocally by being kind to those who act kindly. They evaluate the kindness of others' actions not only by monitoring the outcomes but also by considering the intentions. This behavioral concept can be adapted to train cooperative agents in Multi-Agent Reinforcement Learning (MARL). We propose the KindMARL method, where agents' intentions are measured by counterfactual reasoning over the environmental impact of the actions that were available to the agents. More specifically, the current environment state is compared with the estimation of the current environment state provided that the agent had chosen another action. The difference between each agent's reward, as the outcome of its action, with that of its fellow, multiplied by the intention of the fellow is then taken as the fellow's "kindness". If the result of each reward-comparison confirms the agent's superiority, it perceives the fellow's kindness and reduces its own reward. Experimental results in the Cleanup and Harvest environments show that training based on the KindMARL method enabled the agents to earn 89\% (resp. 37\%) and 44% (resp. 43\%) more total rewards than training based on the Inequity Aversion and Social Influence methods. The effectiveness of KindMARL is further supported by experiments in a traffic light control problem.
    摘要 在人类社会中,人们常常在做出决定时包含公平性,并且往往以 reciprocal 的方式对待那些行为 kindly。他们评估他人的善良行为不仅从结果来评估,还从计划的意图来评估。这种行为概念可以适应培养合作代理人在多代理人学习环境(MARL)中。我们提出了 KindMARL 方法,其中代理人的意图通过对环境状态的 counterfactual 推理来衡量。具体来说,当前环境状态与代理人选择的行动可能的环境状态进行比较,然后计算每个代理人的奖励,并且将奖励与其他代理人的奖励进行比较。如果每个代理人的奖励与其他代理人的奖励之比大于或等于一定的阈值,那么该代理人将认为对方有善良行为,并将其奖励减少。实验结果表明,基于 KindMARL 方法进行训练后,代理人能够获得 89% (resp. 37%) 和 44% (resp. 43%) 更多的总奖励,相比于基于不平等恐惧和社会影响方法进行训练。 KindMARL 的效果还得到了在交通灯控制问题上的实验支持。

Multi-Resolution Diffusion for Privacy-Sensitive Recommender Systems

  • paper_url: http://arxiv.org/abs/2311.03488
  • repo_url: None
  • paper_authors: Derek Lilienthal, Paul Mello, Magdalini Eirinaki, Stas Tiomkin
    for: 这个论文旨在提出一种基于噪声模型的推荐系统,以保护用户隐私和安全。methods: 该方法使用扩散模型来生成可信度高的假数据,以取代或补充原始数据。results: 该方法比基于生成对抗网络、变量自动编码器和最近提出的扩散模型表现出色,平均提高了4.30%的Recall@$n$和4.65%的NDCG@$n$。
    Abstract While recommender systems have become an integral component of the Web experience, their heavy reliance on user data raises privacy and security concerns. Substituting user data with synthetic data can address these concerns, but accurately replicating these real-world datasets has been a notoriously challenging problem. Recent advancements in generative AI have demonstrated the impressive capabilities of diffusion models in generating realistic data across various domains. In this work we introduce a Score-based Diffusion Recommendation Model (SDRM), which captures the intricate patterns of real-world datasets required for training highly accurate recommender systems. SDRM allows for the generation of synthetic data that can replace existing datasets to preserve user privacy, or augment existing datasets to address excessive data sparsity. Our method outperforms competing baselines such as generative adversarial networks, variational autoencoders, and recently proposed diffusion models in synthesizing various datasets to replace or augment the original data by an average improvement of 4.30% in Recall@$n$ and 4.65% in NDCG@$n$.
    摘要 “优化推荐系统的重要组成部分是推荐系统,但它们对用户数据的依赖带来隐私和安全问题。使用生成的数据来取代用户数据可以解决这些问题,但实际生成这些真实世界数据集的问题是极其困难的。现代生成AI技术已经展示了对于不同领域的数据生成的杰出能力。在这个研究中,我们提出了一个Score-based Diffusion Recommendation Model(SDRM),可以实现真实世界数据集中的复杂模式,并且可以用来取代或补充原始数据,以保持用户隐私和增强推荐系统的准确度。我们的方法在对不同数据集进行生成和补充时,较前者4.30%和4.65%的NDCG@$n$和Recall@$n$的平均提升。”

CLIP-Motion: Learning Reward Functions for Robotic Actions Using Consecutive Observations

  • paper_url: http://arxiv.org/abs/2311.03485
  • repo_url: None
  • paper_authors: Xuzhe Dang, Stefan Edelkamp, Nicolas Ribault
  • for: 本文提出了一种新的方法,用于通过CLIP模型学习机器人动作的奖励函数。传统的奖励函数设计经常靠manual feature engineering,可能难以泛化到多种任务。我们的方法跳过这个挑战,利用CLIP模型对状态特征和图像输入进行有效处理。
  • methods: 我们的模型使用了CLIP模型,将两个连续的观察对比,并且能够准确地确定执行的动作。
  • results: 我们通过实验评估,证明了我们的方法在机器人动作中 precisely 地推断动作和其批处增强了人工奖励学习的训练。
    Abstract This paper presents a novel method for learning reward functions for robotic motions by harnessing the power of a CLIP-based model. Traditional reward function design often hinges on manual feature engineering, which can struggle to generalize across an array of tasks. Our approach circumvents this challenge by capitalizing on CLIP's capability to process both state features and image inputs effectively. Given a pair of consecutive observations, our model excels in identifying the motion executed between them. We showcase results spanning various robotic activities, such as directing a gripper to a designated target and adjusting the position of a cube. Through experimental evaluations, we underline the proficiency of our method in precisely deducing motion and its promise to enhance reinforcement learning training in the realm of robotics.
    摘要 这篇论文提出了一种新的奖函数学习方法,通过使用基于 CLIP 的模型来实现。传统的奖函数设计常常靠于手动特征工程,这可能难以泛化到多种任务上。我们的方法则利用 CLIP 模型能够有效处理状态特征和图像输入,从而缺乏手动特征工程的限制。给定两个连续观察结果,我们的模型能够准确地识别执行的动作。我们在不同的 робо类活动中,如指定目标上的抓取器和立方体的位置调整等,展示了我们的方法的精准性和其在机器人学习训练中的潜在优势。

Multi Loss-based Feature Fusion and Top Two Voting Ensemble Decision Strategy for Facial Expression Recognition in the Wild

  • paper_url: http://arxiv.org/abs/2311.03478
  • repo_url: None
  • paper_authors: Guangyao Zhou, Yuanlun Xie, Wenhong Tian
  • for: 本研究旨在提高在野外情绪识别(FER)的性能,涉及到图像质量和计算机视觉领域。
  • methods: 本研究使用了内部特征结合和多个网络之间的特征结合,以及集成策略。特别是,提出了一个新的单模型named R18+FAML,以及一个集成模型named R18+FAML-FGA-T2V,以提高FER在野外的性能。
  • results: 经验表明,我们的单模型R18+FAML和集成模型R18+FAML-FGA-T2V在三个挑战性的FER数据集上达到了$\left( 90.32, 62.17, 65.83 \right)%$和$\left( 91.59, 63.27, 66.63 \right)%$的准确率,分别超过了当前最佳结果。
    Abstract Facial expression recognition (FER) in the wild is a challenging task affected by the image quality and has attracted broad interest in computer vision. There is no research using feature fusion and ensemble strategy for FER simultaneously. Different from previous studies, this paper applies both internal feature fusion for a single model and feature fusion among multiple networks, as well as the ensemble strategy. This paper proposes one novel single model named R18+FAML, as well as one ensemble model named R18+FAML-FGA-T2V to improve the performance of the FER in the wild. Based on the structure of ResNet18 (R18), R18+FAML combines internal Feature fusion and three Attention blocks using Multiple Loss functions (FAML) to improve the diversity of the feature extraction. To improve the performance of R18+FAML, we propose a Feature fusion among networks based on the Genetic Algorithm (FGA), which can fuse the convolution kernels for feature extraction of multiple networks. On the basis of R18+FAML and FGA, we propose one ensemble strategy, i.e., the Top Two Voting (T2V) to support the classification of FER, which can consider more classification information comprehensively. Combining the above strategies, R18+FAML-FGA-T2V can focus on the main expression-aware areas. Extensive experiments demonstrate that our single model R18+FAML and the ensemble model R18+FAML-FGA-T2V achieve the accuracies of $\left( 90.32, 62.17, 65.83 \right)\%$ and $\left( 91.59, 63.27, 66.63 \right)\%$ on three challenging unbalanced FER datasets RAF-DB, AffectNet-8 and AffectNet-7 respectively, both outperforming the state-of-the-art results.
    摘要 Facial expression recognition (FER) in the wild is a challenging task affected by image quality and has attracted broad interest in computer vision. There is no research using feature fusion and ensemble strategy for FER simultaneously. Different from previous studies, this paper applies both internal feature fusion for a single model and feature fusion among multiple networks, as well as the ensemble strategy. This paper proposes one novel single model named R18+FAML, as well as one ensemble model named R18+FAML-FGA-T2V to improve the performance of the FER in the wild. Based on the structure of ResNet18 (R18), R18+FAML combines internal Feature fusion and three Attention blocks using Multiple Loss functions (FAML) to improve the diversity of the feature extraction. To improve the performance of R18+FAML, we propose a Feature fusion among networks based on the Genetic Algorithm (FGA), which can fuse the convolution kernels for feature extraction of multiple networks. On the basis of R18+FAML and FGA, we propose one ensemble strategy, i.e., the Top Two Voting (T2V) to support the classification of FER, which can consider more classification information comprehensively. Combining the above strategies, R18+FAML-FGA-T2V can focus on the main expression-aware areas. Extensive experiments demonstrate that our single model R18+FAML and the ensemble model R18+FAML-FGA-T2V achieve the accuracies of $(90.32, 62.17, 65.83)\%$ and $(91.59, 63.27, 66.63)\%$ on three challenging unbalanced FER datasets RAF-DB, AffectNet-8 and AffectNet-7 respectively, both outperforming the state-of-the-art results.

FinA: Fairness of Adverse Effects in Decision-Making of Human-Cyber-Physical-System

  • paper_url: http://arxiv.org/abs/2311.03468
  • repo_url: None
  • paper_authors: Tianyu Zhao, Salma Elmalaki
  • for: This paper focuses on ensuring fairness in decision-making systems within Human-Cyber-Physical-Systems (HCPS), particularly in the context of diverse individuals with varying behaviors and expectations.
  • methods: The paper introduces the concept of Fairness-in-Adverse-Effects (FinA) and proposes a comprehensive set of five formulations to address the challenge of fairness, taking into account both instantaneous and long-term aspects of adverse effects.
  • results: The evaluation conducted within the domain of smart homes demonstrates that the adoption of FinA significantly enhances the overall perception of fairness among individuals, with an average improvement of 66.7% compared to the state-of-the-art method.
    Abstract Ensuring fairness in decision-making systems within Human-Cyber-Physical-Systems (HCPS) is a pressing concern, particularly when diverse individuals, each with varying behaviors and expectations, coexist within the same application space, influenced by a shared set of control actions in the system. The long-term adverse effects of these actions further pose the challenge, as historical experiences and interactions shape individual perceptions of fairness. This paper addresses the challenge of fairness from an equity perspective of adverse effects, taking into account the dynamic nature of human behavior and evolving preferences while recognizing the lasting impact of adverse effects. We formally introduce the concept of Fairness-in-Adverse-Effects (FinA) within the HCPS context. We put forth a comprehensive set of five formulations for FinA, encompassing both the instantaneous and long-term aspects of adverse effects. To empirically validate the effectiveness of our FinA approach, we conducted an evaluation within the domain of smart homes, a pertinent HCPS application. The outcomes of our evaluation demonstrate that the adoption of FinA significantly enhances the overall perception of fairness among individuals, yielding an average improvement of 66.7% when compared to the state-of-the-art method.
    摘要 ( Ensuring fairness in decision-making systems within Human-Cyber-Physical-Systems (HCPS) is a pressing concern, particularly when diverse individuals, each with varying behaviors and expectations, coexist within the same application space, influenced by a shared set of control actions in the system. The long-term adverse effects of these actions further pose the challenge, as historical experiences and interactions shape individual perceptions of fairness. This paper addresses the challenge of fairness from an equity perspective of adverse effects, taking into account the dynamic nature of human behavior and evolving preferences while recognizing the lasting impact of adverse effects. We formally introduce the concept of Fairness-in-Adverse-Effects (FinA) within the HCPS context. We put forth a comprehensive set of five formulations for FinA, encompassing both the instantaneous and long-term aspects of adverse effects. To empirically validate the effectiveness of our FinA approach, we conducted an evaluation within the domain of smart homes, a pertinent HCPS application. The outcomes of our evaluation demonstrate that the adoption of FinA significantly enhances the overall perception of fairness among individuals, yielding an average improvement of 66.7% when compared to the state-of-the-art method.)Here's the translation in Simplified Chinese:保持 Human-Cyber-Physical-Systems (HCPS) 中的决策系统 fairness 是一项急需解决的问题,特别是当多个不同的个体,每个人有不同的行为和期望,共同存在同一个应用空间中,受到共享的控制动作影响。长期的不良影响还提出了挑战,因为历史经验和互动对每个人的公正感产生影响。本文从Equity 的视角来Address 这个公正感 Challenge,考虑到人类行为的动态性和不断改变的偏好,同时认可长期的不良影响。我们在 HCPS 上正式引入 Fairness-in-Adverse-Effects (FinA) 概念,并提出了 five 种 FinA 形式,涵盖了 immediate 和长期的不良影响方面。为了证明我们 FinA 方法的有效性,我们在智能家居领域进行了评估。评估结果表明,通过 FinA 的采用,人们对公正感的总体评价得到了66.7%的平均提高,相比之下与当前方法的提高率。

Exploitation-Guided Exploration for Semantic Embodied Navigation

  • paper_url: http://arxiv.org/abs/2311.03357
  • repo_url: None
  • paper_authors: Justin Wasserman, Girish Chowdhary, Abhinav Gupta, Unnat Jain
  • for: 这个论文主要针对embodied navigation和sim-to-robot transfer的问题进行研究,探讨了一种可靠的方法来 sintactically combine these components。
  • methods: 作者提出了Exploitation-Guided Exploration(XGX)方法,其中有一个分配探索和利用的两个模块,当目标变得可见时,利用模块会取代探索模块,并继续驱动一个被 override 的政策优化。
  • results: XGX方法在Object Navigation任务上达到了70%的状态听报到的性能,比之前的最佳基准值提高了3%。此外,通过 Targeted analysis 表明,XGX方法在目标conditined exploration中更高效。最后,作者在硬件机器上进行了sim-to-real transfer,并证明XGX方法在实际场景中表现出两倍于最佳基准值的性能。
    Abstract In the recent progress in embodied navigation and sim-to-robot transfer, modular policies have emerged as a de facto framework. However, there is more to compositionality beyond the decomposition of the learning load into modular components. In this work, we investigate a principled way to syntactically combine these components. Particularly, we propose Exploitation-Guided Exploration (XGX) where separate modules for exploration and exploitation come together in a novel and intuitive manner. We configure the exploitation module to take over in the deterministic final steps of navigation i.e. when the goal becomes visible. Crucially, an exploitation module teacher-forces the exploration module and continues driving an overridden policy optimization. XGX, with effective decomposition and novel guidance, improves the state-of-the-art performance on the challenging object navigation task from 70% to 73%. Along with better accuracy, through targeted analysis, we show that XGX is also more efficient at goal-conditioned exploration. Finally, we show sim-to-real transfer to robot hardware and XGX performs over two-fold better than the best baseline from simulation benchmarking. Project page: xgxvisnav.github.io
    摘要 Recent progress in embodied navigation and sim-to-robot transfer has led to the emergence of modular policies as a de facto framework. However, there is more to compositionality than just decomposing the learning load into modular components. In this work, we investigate a principled way to syntactically combine these components. Specifically, we propose Exploitation-Guided Exploration (XGX), where separate modules for exploration and exploitation are combined in a novel and intuitive manner. We configure the exploitation module to take over in the deterministic final steps of navigation when the goal becomes visible, and the crucial aspect of this approach is that the exploitation module teacher-forces the exploration module and continues driving an overridden policy optimization. XGX, with effective decomposition and novel guidance, improves the state-of-the-art performance on the challenging object navigation task from 70% to 73%. In addition to better accuracy, we show through targeted analysis that XGX is also more efficient at goal-conditioned exploration. Furthermore, we demonstrate sim-to-real transfer to robot hardware and XGX performs over two-fold better than the best baseline from simulation benchmarking. For more information, please visit the project page at xgxvisnav.github.io.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis

  • paper_url: http://arxiv.org/abs/2311.03355
  • repo_url: https://github.com/prismformore/seggen
  • paper_authors: Hanrong Ye, Jason Kuen, Qing Liu, Zhe Lin, Brian Price, Dan Xu
  • for: 提高图像分割模型的性能,尤其是在 semantic segmentation、panoptic segmentation 和 instance segmentation 领域。
  • methods: 使用 two 种数据生成策略:MaskSyn 和 ImgSyn,它们可以增加数据的多样性,以便更好地训练图像分割模型。
  • results: 在 ADE20K 和 COCO 测试集上,使用 SegGen 生成的数据可以大幅提高现有的图像分割模型的性能,包括 Mask2Former R50 和 Mask2Former Swin-L。特别是,ADE20K mIoU 中 Mask2Former R50 的性能从 47.2 提高到 49.9 (+2.7),而 Mask2Former Swin-L 的性能从 56.1 提高到 57.4 (+1.3)。这些出色的结果表明 SegGen 可以在有限的人工标注数据上提高图像分割模型的性能,同时也使得模型在未看到的领域上更加稳定。
    Abstract We propose SegGen, a highly-effective training data generation method for image segmentation, which pushes the performance limits of state-of-the-art segmentation models to a significant extent. SegGen designs and integrates two data generation strategies: MaskSyn and ImgSyn. (i) MaskSyn synthesizes new mask-image pairs via our proposed text-to-mask generation model and mask-to-image generation model, greatly improving the diversity in segmentation masks for model supervision; (ii) ImgSyn synthesizes new images based on existing masks using the mask-to-image generation model, strongly improving image diversity for model inputs. On the highly competitive ADE20K and COCO benchmarks, our data generation method markedly improves the performance of state-of-the-art segmentation models in semantic segmentation, panoptic segmentation, and instance segmentation. Notably, in terms of the ADE20K mIoU, Mask2Former R50 is largely boosted from 47.2 to 49.9 (+2.7); Mask2Former Swin-L is also significantly increased from 56.1 to 57.4 (+1.3). These promising results strongly suggest the effectiveness of our SegGen even when abundant human-annotated training data is utilized. Moreover, training with our synthetic data makes the segmentation models more robust towards unseen domains. Project website: https://seggenerator.github.io
    摘要 我们提出了SegGen,一种高效的训练数据生成方法,可以大幅提高现代分割模型的性能。SegGen通过两种数据生成策略:MaskSyn和ImgSyn。(一)MaskSyn通过我们提出的文本到mask生成模型和mask到图生成模型,可以增加分割掩码的多样性,为模型提供更丰富的指导。(二)ImgSyn通过现有掩码生成新的图像,可以强化图像的多样性,为模型输入提供更多的选择。在ADE20K和COCO评测标准上,我们的数据生成方法可以明显提高现代分割模型的semantic segmentation、panoptic segmentation和instance segmentation性能。特别是在ADE20K mIoU上,Mask2Former R50的性能从47.2提高到49.9(+2.7),Mask2Former Swin-L也从56.1提高到57.4(+1.3)。这些优秀的结果表明我们的SegGen在有 suficient human-annotated训练数据的情况下也能够取得显著的效果。此外,通过我们生成的 sintetic数据,分割模型可以更好地鲁ilde对未看过的领域。项目网站:https://seggenerator.github.io

GLaMM: Pixel Grounding Large Multimodal Model

  • paper_url: http://arxiv.org/abs/2311.03356
  • repo_url: None
  • paper_authors: Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S. Khan
  • for: 这 paper 的目的是提出一种基于视觉Domain的大型多modal模型(LMMs),可以生成与对应的语言响应。
  • methods: 这 paper 使用了一种新的模型 called Grounding LMM(GLaMM),它可以根据用户提供的文本和/或区域提示来生成对应的语言响应和物体分割emas。
  • results: GLaMM可以在多种下游任务上表现出色,包括图像和区域水平的Captioning、图像和区域水平的描述、和视觉语言对话。此外,GLaMM还可以在一些新的任务上表现出色,如 Referring Expression Segmentation 和 Grounded Conversation Generation。
    Abstract Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial efforts towards LMMs used holistic images and text prompts to generate ungrounded textual responses. Very recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring a single object category at a time, require users to specify the regions in inputs, or cannot offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of generating visually grounded detailed conversations, we introduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed Grounded Conversation Generation (GCG) task requires densely grounded concepts in natural scenes at a large-scale. To this end, we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG, GLaMM also performs effectively on several downstream tasks e.g., referring expression segmentation, image and region-level captioning and vision-language conversations. Project Page: https://mbzuai-oryx.github.io/groundingLMM.
    摘要 大型多modal模型(LMM)拓展了大型语言模型到视觉领域。初期尝试的LMM使用整体图像和文本提示生成不关联的文本响应。最近,区域级LMM已经用于生成视觉关联的响应,但它们只能同时参考一个物体类别,需要用户在输入中指定区域,或者无法提供密集像素级对象关根。在这项工作中,我们提出了固化LMM(GLaMM),第一个可以生成自然语言响应同时与相应的对象分割mask相匹配。GLaMM不仅可以在对话中固化出现的对象,还可以随意接受文本和可选的视觉提示(区域兴趣点)作为输入。这使得用户可以与模型在文本和视觉领域进行交互,并且可以在不同的级别进行交互。由于生成视觉关联的详细对话的标准benchmark尚未出现,我们提出了全面的评价协议,并针对我们精心准备的grounded conversations进行评价。我们的提议的Grounded Conversation Generation(GCG)任务需要在自然场景中densely grounded的概念,并在大规模上进行评价。为此,我们提出了高度注解的Grounding-anything Dataset(GranD),使用我们提出的自动注解管道,涵盖了7.5万个唯一的概念,在810万个区域中均有分割mask。除了GCG,GLaMM还在多个下游任务上表现出色,如图像和区域级captioning、视力语会话等。项目页面:https://mbzuai-oryx.github.io/groundingLMM。

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

  • paper_url: http://arxiv.org/abs/2311.03348
  • repo_url: None
  • paper_authors: Rusheb Shah, Quentin Feuillade–Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando
  • for: This paper investigates the vulnerability of large language models to jailbreaking attacks using persona modulation, and demonstrates the ability to elicit harmful responses from the models.
  • methods: The paper uses a language model assistant to automate the generation of jailbreaks, and demonstrates the effectiveness of this approach in achieving harmful completions in GPT-4, Claude 2, and Vicuna.
  • results: The paper shows that persona modulation can achieve a harmful completion rate of 42.5% in GPT-4, which is 185 times larger than before modulation, and also demonstrates the transfer of these attacks to other models, such as Claude 2 and Vicuna.
    Abstract Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a language model assistant. We demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. These automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is 185 times larger than before modulation (0.23%). These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%, respectively. Our work reveals yet another vulnerability in commercial large language models and highlights the need for more comprehensive safeguards.
    摘要 尽管努力对大语言模型进行安全配置,它们仍然易受到劫持提示的影响,导致发送危险指令的可能性增加。在这项工作中,我们研究人格调整作为黑盒子劫逃方法,以使目标模型采取愿意遵从危险指令的人格。而不是手动设计每个人格的提示,我们使用语言模型助手自动生成劫逃。我们示例了一系列由人格调整引起的危险结果,包括Synthesize毒品、制造炸弹和洗钱等详细指令。这些自动攻击的危险完成率为GPT-4的42.5%,比之前的0.23%高185倍。这些提示还传递到Claude 2和Vicuna,它们的危险完成率分别为61.0%和35.9%。我们的工作揭示了商业大语言模型又一个漏洞,并高亮了更加全面的安全措施的需要。

Multitask Kernel-based Learning with First-Order Logic Constraints

  • paper_url: http://arxiv.org/abs/2311.03340
  • repo_url: None
  • paper_authors: Michelangelo Diligenti, Marco Gori, Marco Maggini, Leonardo Rigutini
  • for: 这个论文旨在总结一个整合超级vised和无级ished例子以及背景知识的核心机器学习框架。
  • methods: 该论文使用多任务学习方案,其中多个预测函数定义在一个对象集中的特征空间上,并且可以是先知的或通过适当的核kernel-based学习器来 aproximate。
  • results: 该论文提出了一种将逻辑逻辑约束转换为连续实现的方法,并在多个例子中进行了实验,证明了该方法可以有效地解决semi-supervised学习问题。
    Abstract In this paper we propose a general framework to integrate supervised and unsupervised examples with background knowledge expressed by a collection of first-order logic clauses into kernel machines. In particular, we consider a multi-task learning scheme where multiple predicates defined on a set of objects are to be jointly learned from examples, enforcing a set of FOL constraints on the admissible configurations of their values. The predicates are defined on the feature spaces, in which the input objects are represented, and can be either known a priori or approximated by an appropriate kernel-based learner. A general approach is presented to convert the FOL clauses into a continuous implementation that can deal with the outputs computed by the kernel-based predicates. The learning problem is formulated as a semi-supervised task that requires the optimization in the primal of a loss function that combines a fitting loss measure on the supervised examples, a regularization term, and a penalty term that enforces the constraints on both the supervised and unsupervised examples. Unfortunately, the penalty term is not convex and it can hinder the optimization process. However, it is possible to avoid poor solutions by using a two stage learning schema, in which the supervised examples are learned first and then the constraints are enforced.
    摘要 在这篇论文中,我们提出了一个通用框架,用于将经过监督和无监督示例以及背景知识表示为一组第一频逻辑条件集成到核心机器中。具体来说,我们考虑了一种多任务学习方案,在其中多个定义在对象集中的预测符被同时学习从示例中,并且对预测符的值进行约束。这些预测符是定义在特征空间中,并且可以是知道的或者通过适当的核心学习器来 aproximated。我们提出了一种通用的方法,将逻辑条件集转换成可以处理核心预测符输出的连续实现方式。学习问题被定义为一种半监督学习任务,需要在超参中优化一个损失函数,该损失函数结合监督示例上的适应损失度量、正则化项和约束项。却可能是非 convex 的罚 penalty 项,这可能会阻碍优化过程。但是,我们可以通过一种两阶段学习策略来避免 poor solution,在其中首先学习监督示例,然后强制执行约束。

ProPath: Disease-Specific Protein Language Model for Variant Pathogenicity

  • paper_url: http://arxiv.org/abs/2311.03429
  • repo_url: None
  • paper_authors: Huixin Zhan, Zijun Zhang
  • for: 预测疾病相关的遗传学变革是现代遗传学中的一个核心挑战。
  • methods: 我们提出了一个疾病特有的蛋白语言模型,即ProPath,以捕捉罕见变异中的pseudo-log-likelihood比率。
  • results: 我们的结果显示,ProPath比预先训练的ESM1b提高了超过5%的AUC,并在两个数据集中均达到了最高表现。
    Abstract Clinical variant classification of pathogenic versus benign genetic variants remains a pivotal challenge in clinical genetics. Recently, the proposition of protein language models has improved the generic variant effect prediction (VEP) accuracy via weakly-supervised or unsupervised training. However, these VEPs are not disease-specific, limiting their adaptation at point-of-care. To address this problem, we propose a disease-specific \textsc{pro}tein language model for variant \textsc{path}ogenicity, termed ProPath, to capture the pseudo-log-likelihood ratio in rare missense variants through a siamese network. We evaluate the performance of ProPath against pre-trained language models, using clinical variant sets in inherited cardiomyopathies and arrhythmias that were not seen during training. Our results demonstrate that ProPath surpasses the pre-trained ESM1b with an over $5\%$ improvement in AUC across both datasets. Furthermore, our model achieved the highest performances across all baselines for both datasets. Thus, our ProPath offers a potent disease-specific variant effect prediction, particularly valuable for disease associations and clinical applicability.
    摘要 临床变体分类严重病原 versus benign 遗传变异仍然是临床遗传学的核心挑战。最近,蛋白语言模型的提议有助于无监督或弱监督训练下variant effet prediction(VEP)的准确性。然而,这些VEP不是疾病特定,限制其在临床应用中的适应性。为解决这个问题,我们提议一种疾病特定的蛋白语言模型,称为ProPath,以捕捉 Pseudo-log-likelihood ratio 在罕见 missense 变异中。我们通过对临床变体集和遗传性心脏病和心脏病例进行评估,发现ProPath 的性能超过了预训练的 ESM1b,在两个数据集上提高了超过 5% 的 AUC。此外,我们的模型在所有基线之上表现最高,尤其在疾病相关性和临床实用性方面。因此,我们的 ProPath 提供了一种有力的疾病特定变异效应预测,对于疾病相关性和临床应用非常有价值。

FLOGA: A machine learning ready dataset, a benchmark and a novel deep learning model for burnt area mapping with Sentinel-2

  • paper_url: http://arxiv.org/abs/2311.03339
  • repo_url: None
  • paper_authors: Maria Sdraka, Alkinoos Dimakos, Alexandros Malounis, Zisoula Ntasiou, Konstantinos Karantzalos, Dimitrios Michail, Ioannis Papoutsis
    for:This paper aims to provide an accurate and robust method for automatically extracting burnt areas from satellite imagery after wildfires.methods:The authors use a machine-learning ready dataset called FLOGA, which includes satellite imagery with different spatial and spectral resolutions, and ground truth annotations from domain experts. They compare the performance of multiple machine learning and deep learning algorithms for change detection, and propose a novel deep learning model called BAM-CD.results:The proposed BAM-CD model outperforms all other methods in terms of accuracy and robustness, providing an effective way to automatically extract burnt areas from satellite imagery.
    Abstract Over the last decade there has been an increasing frequency and intensity of wildfires across the globe, posing significant threats to human and animal lives, ecosystems, and socio-economic stability. Therefore urgent action is required to mitigate their devastating impact and safeguard Earth's natural resources. Robust Machine Learning methods combined with the abundance of high-resolution satellite imagery can provide accurate and timely mappings of the affected area in order to assess the scale of the event, identify the impacted assets and prioritize and allocate resources effectively for the proper restoration of the damaged region. In this work, we create and introduce a machine-learning ready dataset we name FLOGA (Forest wiLdfire Observations for the Greek Area). This dataset is unique as it comprises of satellite imagery acquired before and after a wildfire event, it contains information from Sentinel-2 and MODIS modalities with variable spatial and spectral resolution, and contains a large number of events where the corresponding burnt area ground truth has been annotated by domain experts. FLOGA covers the wider region of Greece, which is characterized by a Mediterranean landscape and climatic conditions. We use FLOGA to provide a thorough comparison of multiple Machine Learning and Deep Learning algorithms for the automatic extraction of burnt areas, approached as a change detection task. We also compare the results to those obtained using standard specialized spectral indices for burnt area mapping. Finally, we propose a novel Deep Learning model, namely BAM-CD. Our benchmark results demonstrate the efficacy of the proposed technique in the automatic extraction of burnt areas, outperforming all other methods in terms of accuracy and robustness. Our dataset and code are publicly available at: https://github.com/Orion-AI-Lab/FLOGA.
    摘要 过去一个 décennie 中,全球受到了越来越频繁和严重的野火威胁,对人类和动物生命、生态系统和社会经济稳定性构成了严重的威胁。因此,我们需要就野火的影响作出迫切的行动,以保护地球的自然资源。 robust machine learning 技术,结合高分辨率卫星图像的丰富存在,可以为评估事件规模、确定受影响资产和有效分配资源而提供准确和时间相关的地图。在这项工作中,我们创建了一个名为 FLOGA(希腊地区森林野火观察数据集)的机器学习准备数据集。FLOGA 数据集独特之处在于,它包含了在野火事件前后由卫星图像提供的信息,其中包括 Sentinel-2 和 MODIS Modalities 的变量空间和spectral 分辨率信息,同时包含大量已由领域专家标注的烧毁区域地面 truth。FLOGA 覆盖希腊更广泛的地区,这个地区具有地中海气候和地貌特点。我们使用 FLOGA 进行了多种机器学习和深度学习算法的自动烧毁区域抽取比较,并与基于特殊 spectral 指数的烧毁区域映射方法进行比较。最后,我们提出了一种新的深度学习模型,即 BAM-CD。我们的 refer 结果表明,提议的方法在自动烧毁区域抽取方面具有高度的准确性和稳定性。我们的数据集和代码在 GitHub 上公开提供:https://github.com/Orion-AI-Lab/FLOGA。

DAIL: Data Augmentation for In-Context Learning via Self-Paraphrase

  • paper_url: http://arxiv.org/abs/2311.03319
  • repo_url: None
  • paper_authors: Dawei Li, Yaxuan Li, Dheeraj Mekala, Shuyao Li, Yulin wang, Xueqi Wang, William Hogan, Jingbo Shang
  • for: 实现低资源下的内容学习(In-Context Learning,ICL),获得更好的结果。
  • methods: 使用自己生成的内容来增强大语言模型的熟悉度,然后通过多数决来决定最终结果。
  • results: 与标准ICL方法和其他ensemble-based方法相比,DAIL在低资源情况下表现更好,并且可以提供更高的信任度。
    Abstract In-Context Learning (ICL) combined with pre-trained large language models has achieved promising results on various NLP tasks. However, ICL requires high-quality annotated demonstrations which might not be available in real-world scenarios. To overcome this limitation, we propose \textbf{D}ata \textbf{A}ugmentation for \textbf{I}n-Context \textbf{L}earning (\textbf{DAIL}). DAIL leverages the intuition that large language models are more familiar with the content generated by themselves. It first utilizes the language model to generate paraphrases of the test sample and employs majority voting to determine the final result based on individual predictions. Our extensive empirical evaluation shows that DAIL outperforms the standard ICL method and other ensemble-based methods in the low-resource scenario. Additionally, we explore the use of voting consistency as a confidence score of the model when the logits of predictions are inaccessible. We believe our work will stimulate further research on ICL in low-resource settings.
    摘要 受欢迎的自然语言处理任务(NLP)中,在线上学习(ICL)与预训练大型自然语言模型(LMM)的结合已经实现了一定的成功。然而,ICL需要高质量的注释演示,这可能在实际场景中不 disponibles。为了解决这个限制,我们提出了\textbf{数据增强 для在线学习(DAIL)}. DAIL利用了LMM生成的内容的直觉,首先使用LMM生成测试样本的重写,然后通过多数投票确定基于个体预测的最终结果。我们的广泛的实验证明,DAIL在低资源场景下表现比标准的ICL方法和其他ensemble方法更好。此外,我们还探讨了使用投票一致性作为模型的信任度,当预测的logs不可访问时。我们认为我们的工作将鼓励更多关于ICL在低资源设置下的研究。

Neural Structure Learning with Stochastic Differential Equations

  • paper_url: http://arxiv.org/abs/2311.03309
  • repo_url: None
  • paper_authors: Benjie Wang, Joel Jennings, Wenbo Gong
  • for: 这项研究旨在探讨如何从时间观察数据中找到变量之间的底层关系,以及如何使用连续时间的概率过程来描述这些系统的动态。
  • methods: 这项研究使用了神经网络随机分布函数(SDE)和可变推理来推导 posterior 分布 над 可能的结构。
  • results: 研究表明,使用 SCOTCH 方法可以在不同的时间间隔下 Learning 更好的结构,并且在实际数据上比基eline方法有更高的性能。
    Abstract Discovering the underlying relationships among variables from temporal observations has been a longstanding challenge in numerous scientific disciplines, including biology, finance, and climate science. The dynamics of such systems are often best described using continuous-time stochastic processes. Unfortunately, most existing structure learning approaches assume that the underlying process evolves in discrete-time and/or observations occur at regular time intervals. These mismatched assumptions can often lead to incorrect learned structures and models. In this work, we introduce a novel structure learning method, SCOTCH, which combines neural stochastic differential equations (SDE) with variational inference to infer a posterior distribution over possible structures. This continuous-time approach can naturally handle both learning from and predicting observations at arbitrary time points. Theoretically, we establish sufficient conditions for an SDE and SCOTCH to be structurally identifiable, and prove its consistency under infinite data limits. Empirically, we demonstrate that our approach leads to improved structure learning performance on both synthetic and real-world datasets compared to relevant baselines under regular and irregular sampling intervals.
    摘要 描述变量之间的下面关系是科学领域中长期存在的挑战,包括生物、金融和气候科学。这些系统的动态通常使用连续时间杂事件来描述。然而,大多数现有结构学习方法假设下面过程发展在离散时间和/或观测点发生在固定时间间隔。这些不一致的假设可能会导致错误地学习结构和模型。在这种工作中,我们介绍了一种新的结构学习方法,即SCOTCH,它将神经生成器泛化准则与Variational推断结合以推理 posterior 分布中可能的结构。这种连续时间方法可以自然地处理从和预测观测点的任意时间点学习和预测。从理论角度来看,我们设置了SDE和SCOTCH的可结构可识别条件,并证明在无穷数据极限下,SCOTCH是一个一致的方法。实际上,我们在 sintetic 和实际数据上比较SCOTCH和相关基线方法的结构学习性能,并证明SCOTCH在固定和不固定时间间隔下的观测点上具有更高的结构学习性能。

Learning Reusable Manipulation Strategies

  • paper_url: http://arxiv.org/abs/2311.03293
  • repo_url: None
  • paper_authors: Jiayuan Mao, Joshua B. Tenenbaum, Tomás Lozano-Pérez, Leslie Pack Kaelbling
    for: 这篇论文的目的是帮助机器学习人类 manipulate “trick” 的能力,包括通过单一示例学习和自动游戏来获得 manipulate 技能。methods: 这篇论文使用的方法包括将每个示例解释为机器人对物体和物体之间的接触模式的序列,从而学习细致的抽象器和聚合器。results: 这篇论文的结果表明,通过这种方法,机器人可以通过单一示例学习和自动游戏来获得 manipulate 技能,并且可以在不同的情况下 flexibly 应用这些技能。
    Abstract Humans demonstrate an impressive ability to acquire and generalize manipulation "tricks." Even from a single demonstration, such as using soup ladles to reach for distant objects, we can apply this skill to new scenarios involving different object positions, sizes, and categories (e.g., forks and hammers). Additionally, we can flexibly combine various skills to devise long-term plans. In this paper, we present a framework that enables machines to acquire such manipulation skills, referred to as "mechanisms," through a single demonstration and self-play. Our key insight lies in interpreting each demonstration as a sequence of changes in robot-object and object-object contact modes, which provides a scaffold for learning detailed samplers for continuous parameters. These learned mechanisms and samplers can be seamlessly integrated into standard task and motion planners, enabling their compositional use.
    摘要 人类具有吸引人的手备技巧学习能力,可以从单一示例中即使用汤匙来达到远距离对象的技巧。我们可以将这种技巧应用到新的场景中,包括不同的对象位置、大小和类别(如锹和锤)。此外,我们还可以灵活地组合不同的技巧来制定长期计划。在这篇论文中,我们提出了一种机器人学习机制的框架,通过单一示例和自动游戏来学习。我们的关键发现在于将每个示例解释为机器人对象和对象之间的接触模式序列,这提供了一个学习细节 sampler 的框架。这些学习的机制和 sampler 可以轻松地与标准任务和运动规划器集成,以便它们的组合使用。

GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

  • paper_url: http://arxiv.org/abs/2311.03426
  • repo_url: None
  • paper_authors: Farnoosh Javadi, Walid Ahmed, Habib Hajimolahoseini, Foozhan Ataiefard, Mohammad Hassanpour, Saina Asani, Austin Wen, Omar Mohamed Awad, Kangling Liu, Yang Liu
  • for: 这篇论文是为了解决大型transformer模型面临的问题,包括过度参数化和耗时 computationally expensive pre-training。
  • methods: 这篇论文提出了一种通用的方法 called GQKVA,它利用query, key,和value grouping技术来优化transformer pre-training,以提高模型的速度和小型化。
  • results: 我们的实验表明,GQKVA可以在不同的variant中取得明显的交换,即在维持模型性能的情况下,降低模型的大小。我们的实验还显示,传统的多头注意方法不一定是最佳选择,因为有更轻量级和更快速的选择可以。我们在ViT上进行了试验,获得了约0.3%的增加精度,同时降低了模型的大小约4%。此外,我们最具攻击性的模型缩小实验中,模型的大小缩小约15%,但只有约1%的精度下降。
    Abstract Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grouping techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.
    摘要 巨大的变换器模型面临多个挑战,包括耗时和计算 интенсив的预训练和过参数化。这篇论文提出了一种通用的方法called GQKVA,该方法旨在加速变换器预训练,同时减少模型的大小。我们的实验表明,GQKVA方法在不同的variant中存在明显的性能和模型大小之间的负反相关,可以根据资源和时间限制进行定制选择。我们的发现还表明,传统的多头注意方法并不总是最佳选择,因为有轻量级和快速的代替方案可用。我们在ViT上测试了我们的方法,实现了图像分类任务中约0.3%的提升精度,同时减少了模型的大小约4%。此外,我们最具攻击性的模型减少实验结果表明,可以减少约15%的模型大小,只有约1%的精度下降。

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

  • paper_url: http://arxiv.org/abs/2311.03285
  • repo_url: https://github.com/s-lora/s-lora
  • paper_authors: Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica
  • For: The paper is written for the deployment of large language models using the “pretrain-then-finetune” paradigm, specifically focusing on the scalable serving of many LoRA adapters.* Methods: The paper proposes a system called S-LoRA, which stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. Unified Paging is used to manage dynamic adapter weights and KV cache tensors, and tensor parallelism and highly optimized custom CUDA kernels are employed for heterogeneous batching of LoRA computation.* Results: Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM, S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude, enabling scalable serving of many task-specific fine-tuned models and offering the potential for large-scale customized fine-tuning services.Here’s the simplified Chinese text for the three key points:* For: 这篇论文是为大语言模型的部署而写的,具体来说是关于大量LoRA适应器的批处理服务。* Methods: 这篇论文提出了一个名为S-LoRA的系统,它将所有适应器存储在主内存中,并在运行中查询使用的适应器被提取到GPU内存中。 Unified Paging技术用于管理动态适应器权重和KV缓存tensor,并使用tensor并行和高度优化的Custom CUDA核心来实现不同批处理LoRA计算。* Results: 相比之前的状态艺术库如HuggingFace PEFT和vLLM,S-LoRA可以提高 durchput 到最多4倍,并可以同时处理数量级别的适应器,这使得S-LoRA可以实现大规模的定制化 fine-tuning 服务,并且提供了大规模定制化 fine-tuning 服务的潜在性。
    Abstract The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at https://github.com/S-LoRA/S-LoRA
    摘要 “pretrain-then-finetune”模式广泛应用于语言模型的部署。低级别适应(LoRA),一种精炼 parameter 的 fine-tuning 方法,经常用于适应多种任务,从而生成了一大量的 LoRA 适应器。我们发现,这种模式具有批处理(batched inference)的广阔机会。为了利用这些机会,我们提出了 S-LoRA,一个用于批处理多个 LoRA 适应器的系统。S-LoRA 将所有适应器存储在主内存中,并在 GPU 内存中fetch 当前运行的查询所使用的适应器。为了高效使用 GPU 内存并避免分配不一致,S-LoRA 提出了统一分页(Unified Paging)技术。此外,S-LoRA 还使用了一种新的tensor并行执行策略和特化的 CUDA 加速器,以实现hetERogeneous批处理(heterogeneous batching)。这些特点使得 S-LoRA 可以在单个 GPU 或多个 GPU 上服务 thousands 个 LoRA 适应器,占用较小的负荷。相比之下,与 HuggingFace PEFT 和 vLLM(带有简单的 LoRA 服务支持)相比,S-LoRA 可以提高 durchput 高达 4 倍,并增加服务的适应器数目几个数量级。因此,S-LoRA 允许批处理多个任务特定的 fine-tuning 模型,并提供了大规模自定义 fine-tuning 服务。代码可以在 上找到。

An AI-Guided Data Centric Strategy to Detect and Mitigate Biases in Healthcare Datasets

  • paper_url: http://arxiv.org/abs/2311.03425
  • repo_url: None
  • paper_authors: Faris F. Gulamali, Ashwin S. Sawant, Lora Liharska, Carol R. Horowitz, Lili Chan, Patricia H. Kovatch, Ira Hofer, Karandeep Singh, Lynne D. Richardson, Emmanuel Mensah, Alexander W Charney, David L. Reich, Jianying Hu, Girish N. Nadkarni
  • for: 这篇论文旨在探讨健康域中诊断和预测算法的使用可能会对受欢迎群体产生偏见,以及如何使用深度学习方法来探测和改善这种偏见。
  • methods: 本论文提出了一种数据中心、模型无关、任务无关的方法来评估数据集偏见,通过分析不同群体在小样本大小下学习的关系(AEquity)来识别和修正健康域数据集中的种族偏见。
  • results: 研究发现,通过应用AEquity指标,可以在健康域数据集中识别和修正种族偏见,并在抑制偏见方面取得了显著的成果。
    Abstract The adoption of diagnosis and prognostic algorithms in healthcare has led to concerns about the perpetuation of bias against disadvantaged groups of individuals. Deep learning methods to detect and mitigate bias have revolved around modifying models, optimization strategies, and threshold calibration with varying levels of success. Here, we generate a data-centric, model-agnostic, task-agnostic approach to evaluate dataset bias by investigating the relationship between how easily different groups are learned at small sample sizes (AEquity). We then apply a systematic analysis of AEq values across subpopulations to identify and mitigate manifestations of racial bias in two known cases in healthcare - Chest X-rays diagnosis with deep convolutional neural networks and healthcare utilization prediction with multivariate logistic regression. AEq is a novel and broadly applicable metric that can be applied to advance equity by diagnosing and remediating bias in healthcare datasets.
    摘要 随着医疗健康预测和诊断算法的推广,对受护理难度群体的偏见问题产生了关注。深度学习方法来检测和缓解偏见的发展具有不同水平的成功。在这篇文章中,我们提出了一种数据中心、模型无关、任务无关的方法来评估数据集偏见(AEquity)。我们首先研究了不同群体在小样本大小下学习的关系,然后应用系统性分析AEquity值在不同人口 subgroup中的异常情况,以确定和缓解医疗数据集中的种族偏见。AEquity是一种新的和通用的指标,可以在医疗领域应用来提高公平性,诊断和缓解数据集偏见。

Using Symmetries to Lift Satisfiability Checking

  • paper_url: http://arxiv.org/abs/2311.03424
  • repo_url: None
  • paper_authors: Pierre Carbonnelle, Gottfried Schenner, Maurice Bruynooghe, Bart Bogaerts, Marc Denecker
  • for: 这种方法可以用来压缩结构(也称为解释)到一个更小的领域中,而不会产生信息损失。
  • methods: 该方法包括将句子自动翻译成一个等价可满足的句子,并将这个句子在扩展 vocabulary 上进行满足性检查。
  • results: 实验表明,这种方法可以在生成配置问题中获得大量的加速,并且还有应用于软件验证复杂数据结构的场景。
    Abstract We analyze how symmetries can be used to compress structures (also known as interpretations) onto a smaller domain without loss of information. This analysis suggests the possibility to solve satisfiability problems in the compressed domain for better performance. Thus, we propose a 2-step novel method: (i) the sentence to be satisfied is automatically translated into an equisatisfiable sentence over a ``lifted'' vocabulary that allows domain compression; (ii) satisfiability of the lifted sentence is checked by growing the (initially unknown) compressed domain until a satisfying structure is found. The key issue is to ensure that this satisfying structure can always be expanded into an uncompressed structure that satisfies the original sentence to be satisfied. We present an adequate translation for sentences in typed first-order logic extended with aggregates. Our experimental evaluation shows large speedups for generative configuration problems. The method also has applications in the verification of software operating on complex data structures. Further refinements of the translation are left for future work.
    摘要 我们分析如何使用对称性来压缩结构(也称为解释)到一个更小的领域 без损失信息。这种分析表明可以在压缩领域中解决满足性问题以获得更好的性能。因此,我们提出了一种新的两步方法:(i)要满足的句子自动被翻译成一个等价满足句子,使用一个扩展了词汇的“升级” vocabulary,以便压缩领域。(ii)检查升级后的句子是否满足,通过在初始不知道的压缩领域中增长 until 找到一个满足结构。关键在于确保这个满足结构可以扩展到一个不压缩的结构,以满足原始要满足的句子。我们对类型化first-order logic中的句子进行了适当的翻译。我们的实验评估表明,这种方法在生成配置问题上可以获得大量的速度提升。此方法还有软件验证复杂数据结构的应用。未来的工作将更进一步地完善翻译。

From Coupled Oscillators to Graph Neural Networks: Reducing Over-smoothing via a Kuramoto Model-based Approach

  • paper_url: http://arxiv.org/abs/2311.03260
  • repo_url: None
  • paper_authors: Tuan Nguyen, Tan M. Nguyen, Hirotada Honda, Takashi Sano, Vinh Nguyen, Shugo Nakamura
  • for: 针对 Graph Neural Network (GNN) 中的过敏问题,提出了 Kuramoto Graph Neural Network (KuramotoGNN),一种新的连续深度 GNN 类型。
  • methods: KuramotoGNN 使用 Kuramoto 模型来 Mitigate 过敏问题,Kuramoto 模型捕捉了非线性共振 oscilators 的同步行为。
  • results: 对多种图深度学习benchmark任务进行实验,表明 KuramotoGNN 可以减少过敏问题,而且与基eline GNN 和现有方法相比,具有更好的性能。
    Abstract We propose the Kuramoto Graph Neural Network (KuramotoGNN), a novel class of continuous-depth graph neural networks (GNNs) that employs the Kuramoto model to mitigate the over-smoothing phenomenon, in which node features in GNNs become indistinguishable as the number of layers increases. The Kuramoto model captures the synchronization behavior of non-linear coupled oscillators. Under the view of coupled oscillators, we first show the connection between Kuramoto model and basic GNN and then over-smoothing phenomenon in GNNs can be interpreted as phase synchronization in Kuramoto model. The KuramotoGNN replaces this phase synchronization with frequency synchronization to prevent the node features from converging into each other while allowing the system to reach a stable synchronized state. We experimentally verify the advantages of the KuramotoGNN over the baseline GNNs and existing methods in reducing over-smoothing on various graph deep learning benchmark tasks.
    摘要 我们提出了库拉莫托图 neural network(KuramotoGNN),一种新的连续深度图 neural network(GNN),它使用库拉莫托模型来 Mitigate the over-smoothing phenomenon, 在 graph neural networks 中,节点特征会在层数增加时变得无法分辨。库拉莫托模型捕捉了非线性 coupled oscillators 的同步行为。从 coupled oscillators 的视角来看,我们首先表明了库拉莫托模型和基本 GNN 之间的连接,然后我们解释了 GNN 中的过滤现象可以被看作库拉莫托模型中的相同频率同步。库拉莫托GNN 将这种相同频率同步替换为频率同步,以防止节点特征 converges 到每个节点特征,同时允许系统达到一个稳定的同步状态。我们通过实验证明了库拉莫托GNN 在不同的图深度学习 benchmark 任务上的优势,比如减少过滤现象。

Coherent Entity Disambiguation via Modeling Topic and Categorical Dependency

  • paper_url: http://arxiv.org/abs/2311.03253
  • repo_url: None
  • paper_authors: Zilin Xiao, Linjun Shou, Xingyao Zhang, Jie Wu, Ming Gong, Jian Pei, Daxin Jiang
    for:The paper aims to improve the coherence of entity disambiguation (ED) predictions by proposing a novel system called CoherentED.methods:CoherentED uses an unsupervised variational autoencoder (VAE) to extract latent topic vectors of context sentences, and incorporates an external category memory to retrieve relevant categories for undecided mentions. The system also employs step-by-step entity decisions to model entity-entity interactions and maintain maximum coherence at the category level.results:The proposed CoherentED model achieves new state-of-the-art results on popular ED benchmarks, with an average improvement of 1.3 F1 points, particularly excelling in long-text scenarios.
    Abstract Previous entity disambiguation (ED) methods adopt a discriminative paradigm, where prediction is made based on matching scores between mention context and candidate entities using length-limited encoders. However, these methods often struggle to capture explicit discourse-level dependencies, resulting in incoherent predictions at the abstract level (e.g. topic or category). We propose CoherentED, an ED system equipped with novel designs aimed at enhancing the coherence of entity predictions. Our method first introduces an unsupervised variational autoencoder (VAE) to extract latent topic vectors of context sentences. This approach not only allows the encoder to handle longer documents more effectively, conserves valuable input space, but also keeps a topic-level coherence. Additionally, we incorporate an external category memory, enabling the system to retrieve relevant categories for undecided mentions. By employing step-by-step entity decisions, this design facilitates the modeling of entity-entity interactions, thereby maintaining maximum coherence at the category level. We achieve new state-of-the-art results on popular ED benchmarks, with an average improvement of 1.3 F1 points. Our model demonstrates particularly outstanding performance on challenging long-text scenarios.
    摘要 To address this issue, we propose CoherentED, an ED system with novel designs that enhance the coherence of entity predictions. Our approach includes:1. Unsupervised variational autoencoder (VAE) to extract latent topic vectors of context sentences. This allows the encoder to handle longer documents more effectively, conserve valuable input space, and maintain topic-level coherence.2. External category memory to retrieve relevant categories for undecided mentions. This design facilitates the modeling of entity-entity interactions and maintains maximum coherence at the category level.3. Step-by-step entity decisions to model entity-entity interactions and maintain coherence at the category level.Our model achieves new state-of-the-art results on popular ED benchmarks, with an average improvement of 1.3 F1 points. It particularly excels in challenging long-text scenarios.

Instructed Language Models with Retrievers Are Powerful Entity Linkers

  • paper_url: http://arxiv.org/abs/2311.03250
  • repo_url: https://github.com/mrzilinxiao/insgenentitylinking
  • paper_authors: Zilin Xiao, Ming Gong, Jie Wu, Xingyao Zhang, Linjun Shou, Jian Pei, Daxin Jiang
  • for: 提高语言模型在实体链接任务中的性能,使其能够准确地预测知识库中的实体。
  • methods: 提出了一种新的生成式实体链接方法,包括将语言模型通过序列对应的训练目标和指导进行强制实体链接训练,以及一种轻量级的潜在提及检索器来减轻模型的解oding负担。
  • results: 与前一代生成方法进行比较,INSGENEL表现出了+6.8 F1分的提升,同时具有更高的训练数据效率和训练计算资源利用率。
    Abstract Generative approaches powered by large language models (LLMs) have demonstrated emergent abilities in tasks that require complex reasoning abilities. Yet the generative nature still makes the generated content suffer from hallucinations, thus unsuitable for entity-centric tasks like entity linking (EL) requiring precise entity predictions over a large knowledge base. We present Instructed Generative Entity Linker (INSGENEL), the first approach that enables casual language models to perform entity linking over knowledge bases. Several methods to equip language models with EL capability were proposed in this work, including (i) a sequence-to-sequence training EL objective with instruction-tuning, (ii) a novel generative EL framework based on a light-weight potential mention retriever that frees the model from heavy and non-parallelizable decoding, achieving 4$\times$ speedup without compromise on linking metrics. INSGENEL outperforms previous generative alternatives with +6.8 F1 points gain on average, also with a huge advantage in training data efficiency and training compute consumption. In addition, our skillfully engineered in-context learning (ICL) framework for EL still lags behind INSGENEL significantly, reaffirming that the EL task remains a persistent hurdle for general LLMs.
    摘要 大型语言模型驱动的生成方法已经展示出了复杂逻辑能力的emergent能力。然而,生成性还使生成内容受到幻觉的影响,因此不适用于实体关注任务(EL),需要精确的实体预测 над大量知识库。我们提出了首个使用语言模型执行EL任务的Instructed Generative Entity Linker(INSGENEL)方法。我们提出了多种方法来让语言模型拥有EL能力,包括(i)在EL目标下进行序列到序列训练,并使用指令调整;(ii)一种基于轻量级可能提取器的新一代生成EL框架,解决了重量级和非平行化解码的问题,实现了4倍的速度提升而无需牺牲链接指标。INSGENEL在前一代生成方法上提高了平均6.8个F1分,同时具有较好的训练数据效率和训练计算耗用率。此外,我们的巧妙地设计的上下文学习(ICL)框架 для EL仍然落后INSGENEL,这再次证明了EL任务对普通LLMs是一个持续的挑战。

Advancing Post Hoc Case Based Explanation with Feature Highlighting

  • paper_url: http://arxiv.org/abs/2311.03246
  • repo_url: None
  • paper_authors: Eoin Kenny, Eoin Delaney, Mark Keane
  • for: 该论文的目的是提出一种新的可解释AI(XAI)技术,用于辅助人类和AI系统之间的合作。
  • methods: 该论文使用了两种总体算法(幽默和超像素基于算法)来隔离测试图像中的多个清晰特征部分,然后将其连接到训练数据中的相关案例,以提供更全面的解释。
  • results: 该论文的结果表明,该方法可以正确地调整用户对批处理的感受,并且在实际数据集上的ImageNet dataset上实现了这一效果,而不是只是显示解释而无法连接到特征部分。
    Abstract Explainable AI (XAI) has been proposed as a valuable tool to assist in downstream tasks involving human and AI collaboration. Perhaps the most psychologically valid XAI techniques are case based approaches which display 'whole' exemplars to explain the predictions of black box AI systems. However, for such post hoc XAI methods dealing with images, there has been no attempt to improve their scope by using multiple clear feature 'parts' of the images to explain the predictions while linking back to relevant cases in the training data, thus allowing for more comprehensive explanations that are faithful to the underlying model. Here, we address this gap by proposing two general algorithms (latent and super pixel based) which can isolate multiple clear feature parts in a test image, and then connect them to the explanatory cases found in the training data, before testing their effectiveness in a carefully designed user study. Results demonstrate that the proposed approach appropriately calibrates a users feelings of 'correctness' for ambiguous classifications in real world data on the ImageNet dataset, an effect which does not happen when just showing the explanation without feature highlighting.
    摘要 explainer AI (XAI) 被提议为在人机合作下执行下游任务的有价值工具。 可能最有心理有效的 XAI 技术是情况基 Approach,通过显示“整体”的示例来解释黑盒 AI 系统的预测。然而,对于图像处理的Post hoc XAI 方法,没有尝试使用多个明确的特征部分来解释预测结果,从而与相关的训练数据中的案例相关联,以提供更全面的解释, faithful 于下面模型。在这里,我们解决这个差距,并提出了两种通用算法(潜在和超Pixel 基于),可以在测试图像中孤立多个明确的特征部分,然后与训练数据中的解释案例相关联,并在用户研究中进行测试。结果表明,我们的方法可以正确地考虑用户对异常分类结果的情感,在实际世界数据集上的 ImageNet dataset 上,并不会发生只显示解释而无需特征高亮的情况。Note that Simplified Chinese is used in this translation, as it is the most widely used form of Chinese in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

An Efficient Self-Supervised Cross-View Training For Sentence Embedding

  • paper_url: http://arxiv.org/abs/2311.03228
  • repo_url: https://github.com/mrpeerat/sct
  • paper_authors: Peerat Limkonchotiwat, Wuttikorn Ponwitayarat, Lalita Lowphansirikul, Can Udomcharoenchaikit, Ekapol Chuangsuwanich, Sarana Nutanong
  • for: 提高小型语言模型(PLM)的自监sentence表示学习性能。
  • methods: 提出了一种名为Self-supervised Cross-View Training(SCT)的框架,用于缩小大小PLM之间性能差异。
  • results: SCT在7个Semantic Textual Similarity(STS)benchmark上比基eline和state-of-the-art竞争对手perform better,特别是对PLM的参数量小于100M的情况下。
    Abstract Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a representation learning method such as contrastive learning. While this approach achieves impressive performance on larger PLMs, the performance rapidly degrades as the number of parameters decreases. In this paper, we propose a framework called Self-supervised Cross-View Training (SCT) to narrow the performance gap between large and small PLMs. To evaluate the effectiveness of SCT, we compare it to 5 baseline and state-of-the-art competitors on seven Semantic Textual Similarity (STS) benchmarks using 5 PLMs with the number of parameters ranging from 4M to 340M. The experimental results show that STC outperforms the competitors for PLMs with less than 100M parameters in 18 of 21 cases.
    摘要 自动监督句子表示学习是建立句子嵌入空间的任务,不需要人工标注努力。一种直观方法是通过对预训练语言模型(PLM)进行微调和表示学习方法,如对比学习。然而,这种方法在PLM的参数数量减少时表现迅速下降。在这篇论文中,我们提出了一个名为自动监督跨视图训练(SCT)的框架,以减少大小PLM的表现差距。为评估SCT的效果,我们与5个基线和当前顶峰竞争对手进行比较,在7个Semantic Textual Similarity(STS)标准套件上使用5个PLM的参数量从4M到340M。实验结果显示,STC在PLM参数量少于100M的18个情况中表现更好于竞争对手。

LDM3D-VR: Latent Diffusion Model for 3D VR

  • paper_url: http://arxiv.org/abs/2311.03226
  • repo_url: None
  • paper_authors: Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, Vasudev Lal
  • for: 这篇论文是为了提出一种基于潜在扩散模型的虚拟现实开发框架,包括LDM3D-pano和LDM3D-SR两种模型。这两种模型可以根据文本提示生成全景RGBD图像,并将低分辨率输入图像upscale到高分辨率RGBD图像。
  • methods: 这两种模型都是基于现有预训练模型的 fine-tuning,使用包括全景RGB图像、深度图像和标签在内的数据集进行训练。
  • results: 对于LDM3D-pano模型,研究人员通过对比与现有相关方法进行评估,发现它可以生成高质量的全景RGBD图像,而且比现有方法更具有创新性和灵活性。对于LDM3D-SR模型,研究人员发现它可以高效地upscale低分辨率RGBD图像到高分辨率,并且比现有方法更具有稳定性和可靠性。
    Abstract Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods.
    摘要 <>将文本翻译成简化中文。<>潜在扩散模型已经证明是视觉输出创造和操作的状态之一。然而,据我们知道,RGBD的深度图生成并非很广泛。我们介绍LDM3D-VR,一个针对虚拟现实开发的扩散模型集合,包括LDM3D-pano和LDM3D-SR。这两个模型允许通过文本提示生成宽角RGBD和低分辨率输入到高分辨率RGBD的upscaling。我们的模型来自现有预训练模型,并在包含宽角/高分辨率RGB图像、深度图和标签的数据集上进行了微调。两个模型与现有相关方法进行了评估。

ALYMPICS: Language Agents Meet Game Theory

  • paper_url: http://arxiv.org/abs/2311.03220
  • repo_url: None
  • paper_authors: Shaoguang Mao, Yuzhe Cai, Yan Xia, Wenshan Wu, Xun Wang, Fengyi Wang, Tao Ge, Furu Wei
  • for: 这篇论文是为了探讨语言模型代理(LLM)在游戏理论中的应用。
  • methods: 该论文使用LLM和自动化代理来模拟人类行为,并实现多代理合作,以构建人类交互的真实和动态模型。
  • results: 通过 manipulate 资源可用性和代理个性,我们观察了不同代理在竞争中如何参与和适应策略。LLM代理在游戏理论研究中提供了优势,包括模拟真实行为、提供可控、可扩展和可重现环境。
    Abstract This paper introduces Alympics, a platform that leverages Large Language Model (LLM) agents to facilitate investigations in game theory. By employing LLMs and autonomous agents to simulate human behavior and enable multi-agent collaborations, we can construct realistic and dynamic models of human interactions for game theory hypothesis formulating and testing. To demonstrate this, we present and implement a survival game involving unequal competition for limited resources. Through manipulation of resource availability and agent personalities, we observe how different agents engage in the competition and adapt their strategies. The use of LLM agents in game theory research offers significant advantages, including simulating realistic behavior, providing a controlled, scalable, and reproducible environment. Our work highlights the potential of LLM agents in enhancing the understanding of strategic decision-making within complex socioeconomic contexts. All codes will be made public soon.
    摘要 这篇论文介绍了Alympics平台,该平台利用大语言模型(LLM)代理来促进游戏理论研究。通过使用LLM和自动化代理来模拟人类行为并实现多代理合作,我们可以构建真实和动态的人类互动模型,用于游戏理论假设设计和测试。为了证明这一点,我们在一个有限资源的存储游戏中展示了不同代理的竞争和战略适应。通过资源可用性和代理个性的调整,我们观察到不同的代理如何参与竞争并适应策略。使用LLM代理在游戏理论研究中提供了显著优势,包括模拟真实行为、提供可控、可扩展和可重现的环境。我们的工作强调了LLM代理在复杂社会经济背景下的决策战略理解的潜在优势。所有代码即将公开。

Mini Minds: Exploring Bebeshka and Zlata Baby Models

  • paper_url: http://arxiv.org/abs/2311.03216
  • repo_url: https://github.com/upunaprosk/small-language-models
  • paper_authors: Irina Proskurina, Guillaume Metzler, Julien Velcin
  • for: 这个论文描述了卢梭大学2所提交给Strict-Smalltrack的BabyLM竞赛任务。该任务强调从头开始,使用有限数据量和人类语言学习来进行语言模型化。共享任务数据集有1000万个词语,与儿童词汇相当。
  • methods: 我们采用了建立搜索,使用masked语言模型损失来调整数据集上的配置。我们发现了一个优化的配置,并引入了两个小型语言模型(LMs),即Bebeshka和Zlata。这两个模型具有4层encoder和6层decoder,具有8个注意头和12个注意头。虽然这两个模型的规模远小于基elineLMs,但它们在性能上具有相似的表现。
  • results: 我们发现,这两个小型语言模型在包括道德判断任务中的语言理解任务中表现出色。我们还发现,这些任务的预测结果与人类价值观念相align。这些发现表明,小型语言模型在实际语言理解任务中具有潜在的应用前景。
    Abstract In this paper, we describe the University of Lyon 2 submission to the Strict-Small track of the BabyLM competition. The shared task is created with an emphasis on small-scale language modelling from scratch on limited-size data and human language acquisition. Dataset released for the Strict-Small track has 10M words, which is comparable to children's vocabulary size. We approach the task with an architecture search, minimizing masked language modelling loss on the data of the shared task. Having found an optimal configuration, we introduce two small-size language models (LMs) that were submitted for evaluation, a 4-layer encoder with 8 attention heads and a 6-layer decoder model with 12 heads which we term Bebeshka and Zlata, respectively. Despite being half the scale of the baseline LMs, our proposed models achieve comparable performance. We further explore the applicability of small-scale language models in tasks involving moral judgment, aligning their predictions with human values. These findings highlight the potential of compact LMs in addressing practical language understanding tasks.
    摘要 在这篇论文中,我们描述了里昂第二大学对Strict-Small赛道的提交。这个比赛强调从头开始,使用有限数据量和人类语言学习的小规模语言模型。比赛数据集的词汇量为1000万个词,与儿童 vocabulary 大小相当。我们通过架构搜索,在数据集上减少了遮盖语言模型损失。我们发现了一个优化的配置,并引入了两个小型语言模型(LM),即Bebeshka和Zlata。这两个模型各有4层encoder和6层decoder,每个encoder具有8个注意头,每个decoder具有12个注意头。尽管这两个模型的规模只是基elineLMs的一半,但它们在比赛中表现很强。我们进一步探索了小规模语言模型在道德判断任务中的可行性,并将其预测与人类价值观Alignment。这些发现表明了小规模语言模型在实际语言理解任务中的潜在潜力。

Pseudo-Labeling for Domain-Agnostic Bangla Automatic Speech Recognition

  • paper_url: http://arxiv.org/abs/2311.03196
  • repo_url: https://github.com/hishab-nlp/pseudo-labeling-for-domain-agnostic-bangla-asr
  • paper_authors: Rabindra Nath Nandi, Mehadi Hasan Menon, Tareq Al Muntasir, Sagor Sarker, Quazi Sarwar Muhtaseem, Md. Tariqul Islam, Shammur Absar Chowdhury, Firoj Alam
  • For: The paper aims to develop a large-scale domain-agnostic automatic speech recognition (ASR) dataset for low-resource languages, specifically Bangla.* Methods: The proposed methodology uses pseudo-labeling to develop a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios.* Results: The developed ASR system is benchmarked with publicly available datasets and compared with other available models, demonstrating its efficacy on a human-annotated domain-agnostic test set composed of news, telephony, and conversational data.
    Abstract One of the major challenges for developing automatic speech recognition (ASR) for low-resource languages is the limited access to labeled data with domain-specific variations. In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. With the proposed methodology, we developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios. We then exploited the developed corpus to design a conformer-based ASR system. We benchmarked the trained ASR with publicly available datasets and compared it with other available models. To investigate the efficacy, we designed and developed a human-annotated domain-agnostic test set composed of news, telephony, and conversational data among others. Our results demonstrate the efficacy of the model trained on psuedo-label data for the designed test-set along with publicly-available Bangla datasets. The experimental resources will be publicly available.(https://github.com/hishab-nlp/Pseudo-Labeling-for-Domain-Agnostic-Bangla-ASR)
    摘要 一个主要挑战是开发自动语音识别(ASR)系统的低资源语言是有限的标注数据的领域特定变化。在本研究中,我们提出了一种 Pseudo-labeling 方法,以开发大规模的领域不偏的 ASR 数据集。我们通过该方法,开发了20000+小时的标注的孟加拉语音数据集,覆盖了多样化的话题、说话风格、方言、噪音环境和对话场景。然后,我们利用开发的 corpus 设计了一个基于 Confomer 的 ASR 系统。我们对培金的 ASR 进行了评估,并与其他可用的模型进行了比较。为了调查效果,我们设计了和开发了一个人类标注的领域不偏测试集,包括新闻、电信和对话数据等。我们的结果表明,使用 Pseudo-labeling 方法对于我们设计的测试集以及公共可用的孟加拉语言数据集都有效。实验资源将公开。(参考:https://github.com/hishab-nlp/Pseudo-Labeling-for-Domain-Agnostic-Bangla-ASR)

Nexus at ArAIEval Shared Task: Fine-Tuning Arabic Language Models for Propaganda and Disinformation Detection

  • paper_url: http://arxiv.org/abs/2311.03184
  • repo_url: None
  • paper_authors: Yunze Xiao, Firoj Alam
  • for: 本研究旨在探讨自媒体内容中假信息和宣传内容的扩散,以及这些内容如何影响社会稳定和公众的信任。
  • methods: 本研究使用了基于变换器的 fine-tuning 和零或几次shot学习,以及 GPT-4 的使用。
  • results: 本研究在 ArAIEval 共享任务中的成绩为9名和10名。
    Abstract The spread of disinformation and propagandistic content poses a threat to societal harmony, undermining informed decision-making and trust in reliable sources. Online platforms often serve as breeding grounds for such content, and malicious actors exploit the vulnerabilities of audiences to shape public opinion. Although there have been research efforts aimed at the automatic identification of disinformation and propaganda in social media content, there remain challenges in terms of performance. The ArAIEval shared task aims to further research on these particular issues within the context of the Arabic language. In this paper, we discuss our participation in these shared tasks. We competed in subtasks 1A and 2A, where our submitted system secured positions 9th and 10th, respectively. Our experiments consist of fine-tuning transformer models and using zero- and few-shot learning with GPT-4.
    摘要 社会和谐受到假信息和宣传内容的威胁,这会损害了知情决策和可靠来源的信任。在线平台经常成为这种内容的殖民地,恶意者会利用听众的漏洞来形成公众意见。虽然已有研究尝试自动识别社交媒体中的假信息和宣传,但还有许多挑战。阿拉伯语 ArAIEval 分享任务旨在进一步研究这些问题。在这篇论文中,我们讲述了我们在这些分享任务中的参与。我们参加了1A和2A两个子任务,我们提交的系统在这两个子任务中分别获得了第9名和第10名。我们的实验包括细化变换器模型和使用零或几次shot学习与GPT-4。

ArAIEval Shared Task: Persuasion Techniques and Disinformation Detection in Arabic Text

  • paper_url: http://arxiv.org/abs/2311.03179
  • repo_url: None
  • paper_authors: Maram Hasanain, Firoj Alam, Hamdy Mubarak, Samir Abdaljalil, Wajdi Zaghouani, Preslav Nakov, Giovanni Da San Martino, Abed Alhakim Freihat
  • For: The paper is written for the first ArabicNLP 2023 conference and is focused on the ArAIEval shared task, which is a task of persuasion technique detection and disinformation detection in Arabic text.* Methods: The paper uses fine-tuning transformer models such as AraBERT as the core of the majority of the participating systems.* Results: The paper provides a description of the task setup, including the dataset construction and evaluation setup, and gives a brief overview of the participating systems. All datasets and evaluation scripts from the shared task are released to the research community.Here’s the information in Simplified Chinese text:* For: 这篇论文是为了第一届阿拉伯语自然语言处理会议(ArabicNLP 2023)的一部分,主要关注于 ArAIEval 共享任务,这个任务的目标是在 arabic 文本中检测吸引技巧和谎言检测。* Methods: 这篇论文主要使用 fine-tuning 转换器模型,如 AraBERT,作为大多数参与系统的核心。* Results: 论文提供了任务设置的描述,包括数据集构建和评估设置,并提供了参与系统的简短概述。所有数据集和评估脚本都被发布到研究社区。
    Abstract We present an overview of the ArAIEval shared task, organized as part of the first ArabicNLP 2023 conference co-located with EMNLP 2023. ArAIEval offers two tasks over Arabic text: (i) persuasion technique detection, focusing on identifying persuasion techniques in tweets and news articles, and (ii) disinformation detection in binary and multiclass setups over tweets. A total of 20 teams participated in the final evaluation phase, with 14 and 16 teams participating in Tasks 1 and 2, respectively. Across both tasks, we observed that fine-tuning transformer models such as AraBERT was at the core of the majority of the participating systems. We provide a description of the task setup, including a description of the dataset construction and the evaluation setup. We further give a brief overview of the participating systems. All datasets and evaluation scripts from the shared task are released to the research community. (https://araieval.gitlab.io/) We hope this will enable further research on these important tasks in Arabic.
    摘要 我们提供了阿拉伯语评价分享任务的概述,作为2023年阿拉伯语自然语言处理会议(ArabicNLP 2023)的一部分,并与EMNLP 2023会议共同举行。这个任务提供了两个任务,即在新闻文章和推文中发现吸引人技巧的检测,以及在推文中发现不实信息的检测。共有20个队伍参加了最终评估阶段,其中14个队伍参加了任务1,16个队伍参加了任务2。在两个任务中,大多数参与系统都是使用transformer模型进行精度调整,如AraBERT。我们提供了任务设置的描述,包括数据集的建构和评估设置的描述,以及参与系统的简要概述。此外,我们还发布了所有数据集和评估脚本,以便研究人员可以进一步进行研究。(https://araieval.gitlab.io/)。我们希望这将促进阿拉伯语的研究。

1D-Convolutional transformer for Parkinson disease diagnosis from gait

  • paper_url: http://arxiv.org/abs/2311.03177
  • repo_url: https://github.com/safwennaimi/1d-convolutional-transformer-for-parkinson-disease-diagnosis-from-gait
  • paper_authors: Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau
  • for: 这个研究的目的是用深度神经网络模型诊断 Parkinson 病患的步态。
  • methods: 这个研究使用了一种混合 ConvNet-Transformer 架构,通过捕捉本地特征和长期空间时间关系来准确地诊断病种的严重程度。
  • results: 研究结果表明,这种混合架构可以从步态数据中准确地诊断 Parkinson 病患的不同阶段,具体来说是88%的准确率,超过了其他现有的人工智能方法。此外,这种方法可以普遍应用于其他分类问题,以联合地解决1D信号中的特征相关性和空间时间关系问题。
    Abstract This paper presents an efficient deep neural network model for diagnosing Parkinson's disease from gait. More specifically, we introduce a hybrid ConvNet-Transformer architecture to accurately diagnose the disease by detecting the severity stage. The proposed architecture exploits the strengths of both Convolutional Neural Networks and Transformers in a single end-to-end model, where the former is able to extract relevant local features from Vertical Ground Reaction Force (VGRF) signal, while the latter allows to capture long-term spatio-temporal dependencies in data. In this manner, our hybrid architecture achieves an improved performance compared to using either models individually. Our experimental results show that our approach is effective for detecting the different stages of Parkinson's disease from gait data, with a final accuracy of 88%, outperforming other state-of-the-art AI methods on the Physionet gait dataset. Moreover, our method can be generalized and adapted for other classification problems to jointly address the feature relevance and spatio-temporal dependency problems in 1D signals. Our source code and pre-trained models are publicly available at https://github.com/SafwenNaimi/1D-Convolutional-transformer-for-Parkinson-disease-diagnosis-from-gait.
    摘要 这篇论文提出了一种高效的深度神经网络模型,用于从步态诊断parkinson病。更具体地说,我们引入了一种混合ConvNet-Transformer架构,以准确地诊断病种的严重度。我们的提议的架构利用了Convolutional Neural Networks和Transformers两者的优势,在单一端到端模型中结合使用,以EXTract local特征和long-termspatio-temporal关系。这种方法实现了使用单个模型来诊断parkinson病的更高性能,我们的实验结果表明,我们的方法可以准确地从步态数据中识别不同的parkinson病stage,最终准确率为88%,超过了其他state-of-the-art AI方法在Physionet步态数据集上。此外,我们的方法可以普化和适应其他分类问题,以 JOINTLY地解决1D信号中的特征相关性和spatio-temporal关系问题。我们的源代码和预训练模型可以在https://github.com/SafwenNaimi/1D-Convolutional-transformer-for-Parkinson-disease-diagnosis-from-gait上获取。

Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A Fresh Orb in the Cosmos of LLMs

  • paper_url: http://arxiv.org/abs/2311.03127
  • repo_url: None
  • paper_authors: Longyue Wang, Zhaopeng Tu, Yan Gu, Siyou Liu, Dian Yu, Qingsong Ma, Chenyang Lyu, Liting Zhou, Chao-Hong Liu, Yufeng Ma, Weiyu Chen, Yvette Graham, Bonnie Webber, Philipp Koehn, Andy Way, Yulin Yuan, Shuming Shi
  • for: 本研究旨在探讨机器翻译Literary Translation task的问题,以便促进这一领域的进步。
  • methods: 作者们使用了一个新的文本 corpora和一套产业界承认的评价标准,以评估参与者提交的系统性能。
  • results: 研究发现了一系列有趣的发现,包括文学和语言领域的MT问题,以及一些可能有助于解决这些问题的策略。
    Abstract Translating literary works has perennially stood as an elusive dream in machine translation (MT), a journey steeped in intricate challenges. To foster progress in this domain, we hold a new shared task at WMT 2023, the first edition of the Discourse-Level Literary Translation. First, we (Tencent AI Lab and China Literature Ltd.) release a copyrighted and document-level Chinese-English web novel corpus. Furthermore, we put forth an industry-endorsed criteria to guide human evaluation process. This year, we totally received 14 submissions from 7 academia and industry teams. We employ both automatic and human evaluations to measure the performance of the submitted systems. The official ranking of the systems is based on the overall human judgments. In addition, our extensive analysis reveals a series of interesting findings on literary and discourse-aware MT. We release data, system outputs, and leaderboard at http://www2.statmt.org/wmt23/literary-translation-task.html.
    摘要 machine translation (MT) Literary translation 总是被看作是一个逃逸的梦,一个涉及繁复挑战的旅程。为了推动这个领域的进步,我们在 WMT 2023 上开设了一个新的共同任务,即第一届 Discourse-Level Literary Translation。首先,我们(Tencent AI Lab 和 China Literature Ltd.)发布了一个版权保护的、中英文网络小说 corpus。其次,我们提出了产业界认可的评价标准,以导引人类评价过程。这年,我们总共收到了 14 个来自 7 所学术和产业团队的提交。我们使用自动和人类评价两者来评价提交系统的性能。人类评价结果作为官方排名的基础。此外,我们进行了广泛的分析,发现了一系列有趣的文学和报道意识MT的发现。我们在 http://www2.statmt.org/wmt23/literary-translation-task.html 上发布了数据、系统输出和排名。

Pelvic floor MRI segmentation based on semi-supervised deep learning

  • paper_url: http://arxiv.org/abs/2311.03105
  • repo_url: None
  • paper_authors: Jianwei Zuo, Fei Feng, Zhuhui Wang, James A. Ashton-Miller, John O. L. Delancey, Jiajia Luo
    for:这篇论文的目的是提出一个半supervised的框架来进行骨盘器官分类。methods:这个框架包括两个阶段:第一个阶段是使用自我监督预训 tasks进行自我预训,然后使用标注数据进行精确训练分类模型。第二个阶段是使用自我预训的分类模型来生成伪标签 для无标的数据。最后,这两个阶段的数据都被用在半supervised的训练中。results:在评估中,我们的方法可以将骨盘器官分类的性能提高,特别是难以分类的器官,如uterus,可以提高Semantic segmentation的精度 by up to 3.70%.
    Abstract The semantic segmentation of pelvic organs via MRI has important clinical significance. Recently, deep learning-enabled semantic segmentation has facilitated the three-dimensional geometric reconstruction of pelvic floor organs, providing clinicians with accurate and intuitive diagnostic results. However, the task of labeling pelvic floor MRI segmentation, typically performed by clinicians, is labor-intensive and costly, leading to a scarcity of labels. Insufficient segmentation labels limit the precise segmentation and reconstruction of pelvic floor organs. To address these issues, we propose a semi-supervised framework for pelvic organ segmentation. The implementation of this framework comprises two stages. In the first stage, it performs self-supervised pre-training using image restoration tasks. Subsequently, fine-tuning of the self-supervised model is performed, using labeled data to train the segmentation model. In the second stage, the self-supervised segmentation model is used to generate pseudo labels for unlabeled data. Ultimately, both labeled and unlabeled data are utilized in semi-supervised training. Upon evaluation, our method significantly enhances the performance in the semantic segmentation and geometric reconstruction of pelvic organs, Dice coefficient can increase by 2.65% averagely. Especially for organs that are difficult to segment, such as the uterus, the accuracy of semantic segmentation can be improved by up to 3.70%.
    摘要 pelvic organs的semantic segmentation via MRI具有重要的临床意义。现在,通过深度学习启用的semantic segmentation,可以实现三维的预测和重建 pelvic floor organs,为临床人员提供准确和直观的诊断结果。然而,pelvic floor MRI segmentation的标注工作,通常由临床人员进行,是劳动密集和成本高的,导致标注数据的缺乏。不充分的标注数据限制了精准的pelvic floor organs的分割和重建。为解决这些问题,我们提议了一种 semi-supervised 框架 для pelvic organ segmentation。该框架的实现包括两个阶段。在第一阶段,它通过自我监督的预训练来完成图像恢复任务。然后,使用标注数据来训练分割模型的精度调整。在第二阶段,自我监督分割模型被用来生成 pseudo 标签 для无标注数据。最终,两者都被用于 semi-supervised 训练。经评估,我们的方法可以显著提高pelvic organs的semantic segmentation和三维重建的性能。平均可以提高 Dice 系数2.65%,特别是难以分割的器官,如uterus,可以提高semantic segmentation的准确率达3.70%。

A Simple yet Efficient Ensemble Approach for AI-generated Text Detection

  • paper_url: http://arxiv.org/abs/2311.03084
  • repo_url: None
  • paper_authors: Harika Abburi, Kalyani Roy, Michael Suesserman, Nirmala Pudota, Balaji Veeramani, Edward Bowen, Sanmitra Bhattacharya
  • for: 本研究旨在开发一种自动可 distinguish between artificially generated text和人类写作的方法,以防止大语言模型(LLMs)的潜在滥用,如假新闻生成、垃圾邮件创建和学术作业违规使用。
  • methods: 我们提出了一种简单 yet efficient的解决方案,通过将多个组件LLMs的预测 ensemble。与前一代方法相比,我们的压缩结合方法仅使用了两个组件LLMs,而且可以达到相同的性能水平。
  • results: 我们在四个生成文本分类 benchmark 上进行了实验,结果表明,与前一代方法相比,我们的方法在性能提升的范围内为0.5%-100%。此外,我们还研究了各个LLMs的训练数据对模型性能的影响,发现可以将商业束缚的GPT数据取代为其他开源语言模型生成的数据,这是一种可行的替代方案。最后,我们通过英文作文数据集的实验,证明了我们的结合方法可以有效地处理新数据。
    Abstract Recent Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing across wide range of styles and genres. However, such capabilities are prone to potential abuse, such as fake news generation, spam email creation, and misuse in academic assignments. Hence, it is essential to build automated approaches capable of distinguishing between artificially generated text and human-authored text. In this paper, we propose a simple yet efficient solution to this problem by ensembling predictions from multiple constituent LLMs. Compared to previous state-of-the-art approaches, which are perplexity-based or uses ensembles with a number of LLMs, our condensed ensembling approach uses only two constituent LLMs to achieve comparable performance. Experiments conducted on four benchmark datasets for generative text classification show performance improvements in the range of 0.5 to 100\% compared to previous state-of-the-art approaches. We also study the influence that the training data from individual LLMs have on model performance. We found that substituting commercially-restrictive Generative Pre-trained Transformer (GPT) data with data generated from other open language models such as Falcon, Large Language Model Meta AI (LLaMA2), and Mosaic Pretrained Transformers (MPT) is a feasible alternative when developing generative text detectors. Furthermore, to demonstrate zero-shot generalization, we experimented with an English essays dataset, and results suggest that our ensembling approach can handle new data effectively.
    摘要 现代大语言模型(LLM)在生成文本方面表现出了惊人的能力,能够生成具有人类特点的文本,覆盖了各种风格和类型。然而,这些能力也存在潜在的危险,如生成假新闻、垃圾邮件和学术作业上的违规使用。因此,建立自动 distinguishing 人造文本和人类写作文本的方法变得非常重要。在这篇论文中,我们提出了一种简单 yet efficient 的解决方案,通过多个组件 LLM 的 ensemble 来实现。与之前的状态态前方法相比,我们的压缩 ensemble 方法只需使用两个组件 LLM,却可以达到相同的性能。在四个生成文本分类 benchmark 数据集上进行了实验,并取得了0.5% 到100% 的性能提升。我们还研究了各个 LLM 的训练数据对模型性能的影响。我们发现可以将商业 restriction 的 Generative Pre-trained Transformer(GPT)数据替换为其他开源语言模型如 Falcon、Large Language Model Meta AI(LLaMA2)和 Mosaic Pretrained Transformers(MPT)生成的数据,这是一种可行的方法。此外,我们进行了零shot泛化的实验,并发现我们的 ensemble 方法可以有效地处理新的数据。

SugarViT – Multi-objective Regression of UAV Images with Vision Transformers and Deep Label Distribution Learning Demonstrated on Disease Severity Prediction in Sugar Beet

  • paper_url: http://arxiv.org/abs/2311.03076
  • repo_url: None
  • paper_authors: Maurice Günder, Facundo Ramón Ispizua Yamati, Abel Andree Barreto Alcántara, Anne-Katrin Mahlein, Rafet Sifa, Christian Bauckhage
  • for: 这个研究旨在开发一个基于人工智能的植物特征标注框架,用于自动化大规模的甘蔗叶斑病苗分类。
  • methods: 这个研究使用了深度标签分布学习(DLDL)、特殊的损失函数和自订的模型架构,开发了一个基于视觉 трансформер的病苗度评分模型called SugarViT。
  • results: 这个研究获得了一个可靠的病苗度评分模型,并且还能够融合 remote sensing 数据和实验站的环境参数来预测病苗度。这个模型可以应用于多种像素基于的分类和回归 зада务。
    Abstract Remote sensing and artificial intelligence are pivotal technologies of precision agriculture nowadays. The efficient retrieval of large-scale field imagery combined with machine learning techniques shows success in various tasks like phenotyping, weeding, cropping, and disease control. This work will introduce a machine learning framework for automatized large-scale plant-specific trait annotation for the use case disease severity scoring for Cercospora Leaf Spot (CLS) in sugar beet. With concepts of Deep Label Distribution Learning (DLDL), special loss functions, and a tailored model architecture, we develop an efficient Vision Transformer based model for disease severity scoring called SugarViT. One novelty in this work is the combination of remote sensing data with environmental parameters of the experimental sites for disease severity prediction. Although the model is evaluated on this special use case, it is held as generic as possible to also be applicable to various image-based classification and regression tasks. With our framework, it is even possible to learn models on multi-objective problems as we show by a pretraining on environmental metadata.
    摘要 <>现代农业精度农业技术中,远程感知和人工智能是非常重要的。大规模田地图像的有效回收,结合机器学习技术,在不同任务中具有成功,如型态识别、除草、种植和病虫控制。本文将介绍一种基于机器学习框架的自动化大规模植物特征注释技术,用于 sugar beet Cercospora Leaf Spot (CLS) 疾病严重度评估。通过深度标签分布学习(DLDL)、特殊的损失函数和适应性的模型架构,我们开发了一种高效的视Transformer 模型,称为 SugarViT。本文的一个新特点是结合远程感知数据和实验室测试站的环境参数,进行疾病严重度预测。尽管这个模型是在这个特定的应用场景中评估的,但它是可以通用的,也可以应用于多种图像基于分类和回归任务。此外,我们还表明了如何通过环境元数据预训练,学习多目标问题。

Distributed Agent-Based Collaborative Learning in Cross-Individual Wearable Sensor-Based Human Activity Recognition

  • paper_url: http://arxiv.org/abs/2311.04236
  • repo_url: None
  • paper_authors: Ahmad Esmaeili, Zahra Ghorrati, Eric T. Matson
  • For: This paper is written for the field of personalized and context-aware Human Activity Recognition, with a focus on developing scalable, adaptable, and privacy-conscious methodologies using multi-agent systems.* Methods: The paper introduces a collaborative distributed learning approach rooted in multi-agent principles, where individual users of sensor-equipped devices function as agents within a distributed network, collectively contributing to the process of learning and classifying human activities.* Results: The proposed approach has been empirically tested on two publicly accessible human activity recognition datasets, showing the efficacy of inter-individual collaborative learning compared to centralized configurations, with both local and global generalization.
    Abstract The rapid growth of wearable sensor technologies holds substantial promise for the field of personalized and context-aware Human Activity Recognition. Given the inherently decentralized nature of data sources within this domain, the utilization of multi-agent systems with their inherent decentralization capabilities presents an opportunity to facilitate the development of scalable, adaptable, and privacy-conscious methodologies. This paper introduces a collaborative distributed learning approach rooted in multi-agent principles, wherein individual users of sensor-equipped devices function as agents within a distributed network, collectively contributing to the comprehensive process of learning and classifying human activities. In this proposed methodology, not only is the privacy of activity monitoring data upheld for each individual, eliminating the need for an external server to oversee the learning process, but the system also exhibits the potential to surmount the limitations of conventional centralized models and adapt to the unique attributes of each user. The proposed approach has been empirically tested on two publicly accessible human activity recognition datasets, specifically PAMAP2 and HARTH, across varying settings. The provided empirical results conclusively highlight the efficacy of inter-individual collaborative learning when contrasted with centralized configurations, both in terms of local and global generalization.
    摘要 “快速增长的戴式传感器技术持有大量潜在的个性化和上下文意识识别潜力。由于这个领域的数据来源本身具有分散化特点,因此使用多代理系统的特点可以推动开发可扩展、适应性强和隐私意识的方法。本文介绍一种基于多代理原则的分布式学习方法,在其中每个戴式设备上的用户作为分布式网络中的代理,共同参与人活动识别的全面学习过程。在该提议方法中,不仅保护了每个人的活动监测数据隐私,而且消除了中央服务器的学习过程监控需求,系统还能够超越传统中央化模型的局限性,适应每个用户的特点。这种方法在两个公共可访问的人类活动识别数据集(PAMAP2和HARTH)上进行了实验测试,结果表明在不同的设定下,分布式学习方法比中央化配置更高效, both locally and globally。”

Maximal Consistent Subsystems of Max-T Fuzzy Relational Equations

  • paper_url: http://arxiv.org/abs/2311.03059
  • repo_url: None
  • paper_authors: Ismaïl Baaj
  • for: 这个论文研究了一个$\max-T$不确定方程系统的不一致性,其中$T$是一个t-整数 among $\min$, 产品或Lukasiewicz的t-整数。
  • methods: 对于不一致的$\max-T$系统,我们直接构建了一个 canonical最大一致子系统(w.r.t. 包含关系),主要工具是计算不确定系统中Chebychev距离$\Delta = \inf_{c \in \mathcal{C} \Vert b - c \Vert$的分析式公式,其中$\mathcal{C}$是定义同样矩阵$A$的一致系统的二元组集。
  • results: 对于不一致的$\max-\min$系统,我们提供了一种高效的方法来获得所有一致子系统,并证明可以逐次获得所有最大一致子系统。
    Abstract In this article, we study the inconsistency of a system of $\max-T$ fuzzy relational equations of the form $A \Box_{T}^{\max} x = b$, where $T$ is a t-norm among $\min$, the product or Lukasiewicz's t-norm. For an inconsistent $\max-T$ system, we directly construct a canonical maximal consistent subsystem (w.r.t the inclusion order). The main tool used to obtain it is the analytical formula which compute the Chebyshev distance $\Delta = \inf_{c \in \mathcal{C} \Vert b - c \Vert$ associated to the inconsistent $\max-T$ system, where $\mathcal{C}$ is the set of second members of consistent systems defined with the same matrix $A$. Based on the same analytical formula, we give, for an inconsistent $\max-\min$ system, an efficient method to obtain all its consistent subsystems, and we show how to iteratively get all its maximal consistent subsystems.
    摘要 在这篇文章中,我们研究了 $\max-T$ 模糊关系方程的不一致性,其中 $T$ 是一种 t-整数(可能是乘法或卢卡氏 t-整数)。对于一个不一致的 $\max-T$ 系统,我们直接构造了一个宽义最大可consistent subsystem(w.r.t. 包含顺序)。我们使用了关于不一致 $\max-T$ 系统的分析式公式来计算 Chebyshev 距离 $\Delta = \inf_{c \in \mathcal{C} \Vert b - c \Vert$,其中 $\mathcal{C}$ 是定义同 $A$ 矩阵的一致系统的二元部分。基于同样的分析式公式,我们为一个不一致 $\max-\min$ 系统提供了一种高效的方法来获取所有一致子系统,并证明了可以递归地获取所有最大可consistent subsystem。

LitSumm: Large language models for literature summarisation of non-coding RNAs

  • paper_url: http://arxiv.org/abs/2311.03056
  • repo_url: https://github.com/rnacentral/litscan-summarization
  • paper_authors: Andrew Green, Carlos Ribas, Nancy Ontiveros-Palacios, Anton I. Petrov, Alex Bateman, Blake Sweeney
  • for: 本研究旨在解决生物医学文献筛选的挑战,即由于文献数量不断增加,而有限的审核人员无法适应所有相关文献。
  • methods: 本研究使用大型自然语言模型(LLM)生成非编码RNA文献摘要,并通过商业LLM和链接的提问和检查来生成高质量、准确的摘要。
  • results: 研究表明,使用现有的LLM可以自动生成高质量的非编码RNA文献摘要,并且与人类评价高度相关。此外,研究还发现自动评价方法与人类评价不 correlate。最后,研究应用其工具到选择的4,600多个ncRNA上,并将生成的摘要公开提供给RNAcentral资源。
    Abstract Motivation: Curation of literature in life sciences is a growing challenge. The continued increase in the rate of publication, coupled with the relatively fixed number of curators worldwide presents a major challenge to developers of biomedical knowledgebases. Very few knowledgebases have resources to scale to the whole relevant literature and all have to prioritise their efforts. Results: In this work, we take a first step to alleviating the lack of curator time in RNA science by generating summaries of literature for non-coding RNAs using large language models (LLMs). We demonstrate that high-quality, factually accurate summaries with accurate references can be automatically generated from the literature using a commercial LLM and a chain of prompts and checks. Manual assessment was carried out for a subset of summaries, with the majority being rated extremely high quality. We also applied the most commonly used automated evaluation approaches, finding that they do not correlate with human assessment. Finally, we apply our tool to a selection of over 4,600 ncRNAs and make the generated summaries available via the RNAcentral resource. We conclude that automated literature summarization is feasible with the current generation of LLMs, provided careful prompting and automated checking are applied. Availability: Code used to produce these summaries can be found here: https://github.com/RNAcentral/litscan-summarization and the dataset of contexts and summaries can be found here: https://huggingface.co/datasets/RNAcentral/litsumm-v1. Summaries are also displayed on the RNA report pages in RNAcentral (https://rnacentral.org/)
    摘要 目的:生命科学文献筛选是一项快速增长的挑战。随着发表速度的不断增加,与全球团队规模相比,生命科学文献筛选者的数量几乎固定,这对生物医学知识库开发者带来了主要的挑战。很少的知识库有资源可以涵盖整个相关文献,所以它们都必须优化努力。结果:在这项工作中,我们通过使用大型自然语言模型(LLM)生成非编码RNA文献摘要,以解决生命科学文献筛选者缺乏时间的问题。我们证明了可以使用商业LLM和链接的提示和检查生成高质量、精准的文献摘要,并且人工评估表明大多数摘要具有极高质量。我们还应用了常用的自动评估方法,发现它们与人工评估不符。最后,我们使用我们的工具对4600多个ncRNA进行摘要,并将生成的摘要公开于RNAcentral资源上。我们 conclued that使用当代LLM生成文献摘要是可能的,只要仔细制定提示和自动检查。可用性:用于生成这些摘要的代码可以在GitHub上找到:https://github.com/RNAcentral/litscan-summarization。我们生成的摘要和文献上下文可以在Hugging Face上找到:https://huggingface.co/datasets/RNAcentral/litsumm-v1。摘要也被显示在RNAcentral报告页面上(https://rnacentral.org/)。

Masking Hyperspectral Imaging Data with Pretrained Models

  • paper_url: http://arxiv.org/abs/2311.03053
  • repo_url: https://github.com/hifexplo/masking
  • paper_authors: Elias Arbash, Andréa de Lima Ribeiro, Sam Thiele, Nina Gnann, Behnood Rasti, Margret Fuchs, Pedram Ghamisi, Richard Gloaguen
  • for: 提高干扰谱数据处理性能,避免干扰谱数据中的背景频谱特征的影响。
  • methods: 提出了一种基于Segment Anything Model(SAM)和零shot Grounding Dino对象检测器的图像分割方法,并在 intersect和 exclude步骤中进行了筛选和排除。
  • results: 在三个复杂的应用场景中(塑料碎屑特征化、钻核扫描和垃圾监测)得到了明显的改善,包括计算成本、内存需求和总性能。
    Abstract The presence of undesired background areas associated with potential noise and unknown spectral characteristics degrades the performance of hyperspectral data processing. Masking out unwanted regions is key to addressing this issue. Processing only regions of interest yields notable improvements in terms of computational costs, required memory, and overall performance. The proposed processing pipeline encompasses two fundamental parts: regions of interest mask generation, followed by the application of hyperspectral data processing techniques solely on the newly masked hyperspectral cube. The novelty of our work lies in the methodology adopted for the preliminary image segmentation. We employ the Segment Anything Model (SAM) to extract all objects within the dataset, and subsequently refine the segments with a zero-shot Grounding Dino object detector, followed by intersection and exclusion filtering steps, without the need for fine-tuning or retraining. To illustrate the efficacy of the masking procedure, the proposed method is deployed on three challenging applications scenarios that demand accurate masking; shredded plastics characterization, drill core scanning, and litter monitoring. The numerical evaluation of the proposed masking method on the three applications is provided along with the used hyperparameters. The scripts for the method will be available at https://github.com/hifexplo/Masking.
    摘要 您的干扰背景预测数据处理性能的问题可以通过Masking来解决。我们的提案包括两个基本部分:首先生成区域兴趣标识符,然后仅应用干扰数据处理技术于新生成的干扰数据立方体中。我们的创新在于采用的Segment Anything Model(SAM)来提取数据集中的所有对象,然后使用零扩展Grounding Dino对象检测器进行筛选、排除掉重叠的步骤,无需进行微调或重新训练。为证明掩码方法的效果,我们在三个复杂的应用场景中运用了该方法,即杂物Characterization、钻探核心扫描和垃圾监测。我们提供了这三个应用场景的数值评估结果,并附上使用的超参数。方法的脚本将于https://github.com/hifexplo/Masking上公开。

Grouping Local Process Models

  • paper_url: http://arxiv.org/abs/2311.03040
  • repo_url: https://github.com/djdprogramming/adfa2
  • paper_authors: Viki Peeva, Wil M. P. van der Aalst
  • for: 本研究旨在提出一种三步管道来归类类似的本地过程模型(LPM),以解决模型爆炸和重复问题。
  • methods: 本研究使用了不同的过程模型相似度度量来归类LPM。
  • results: 实验结果表明, grouping 可以减少模型的数量,提高模型的精度和可读性。
    Abstract In recent years, process mining emerged as a proven technology to analyze and improve operational processes. An expanding range of organizations using process mining in their daily operation brings a broader spectrum of processes to be analyzed. Some of these processes are highly unstructured, making it difficult for traditional process discovery approaches to discover a start-to-end model describing the entire process. Therefore, the subdiscipline of Local Process Model (LPM) discovery tries to build a set of LPMs, i.e., smaller models that explain sub-behaviors of the process. However, like other pattern mining approaches, LPM discovery algorithms also face the problems of model explosion and model repetition, i.e., the algorithms may create hundreds if not thousands of models, and subsets of them are close in structure or behavior. This work proposes a three-step pipeline for grouping similar LPMs using various process model similarity measures. We demonstrate the usefulness of grouping through a real-life case study, and analyze the impact of different measures, the gravity of repetition in the discovered LPMs, and how it improves after grouping on multiple real event logs.
    摘要 Recently, process mining has emerged as a proven technology to analyze and improve operational processes. With an increasing number of organizations using process mining in their daily operations, a broader range of processes are being analyzed. However, some of these processes are highly unstructured, making it difficult for traditional process discovery approaches to discover a start-to-end model describing the entire process. Therefore, the subdiscipline of Local Process Model (LPM) discovery has emerged to build a set of LPMs, i.e., smaller models that explain sub-behaviors of the process. However, like other pattern mining approaches, LPM discovery algorithms also face the problems of model explosion and model repetition, i.e., the algorithms may create hundreds if not thousands of models, and subsets of them are close in structure or behavior. This work proposes a three-step pipeline for grouping similar LPMs using various process model similarity measures. We demonstrate the usefulness of grouping through a real-life case study, and analyze the impact of different measures, the gravity of repetition in the discovered LPMs, and how it improves after grouping on multiple real event logs.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Macau, and Taiwan.

GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

  • paper_url: http://arxiv.org/abs/2311.03035
  • repo_url: https://github.com/ackesnal/gtp-vit
  • paper_authors: Xuwei Xu, Sen Wang, Yudong Chen, Yanping Zheng, Zhewei Wei, Jiajun Liu
  • for: 提高资源受限设备上的预训练 ViT 的效率,使其能够快速推理图像。
  • methods: 提出了一种基于图的减少方法,即图像信息传递方法(GTP),通过在图像中减少不重要的图像信息,并将其传递给相邻的重要图像信息,以提高模型的效率。
  • results: 对 ImageNet-1K 进行了大量的实验,并证明了 GTP 可以减少 DeiT-S 和 DeiT-B 的计算复杂度,同时保持模型的性能水平。特别是,GTP 可以在不需要Finetune的情况下,对各种背景下的各种核心体系进行更快的推理。
    Abstract Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging due to high computational demands. To expedite pre-trained ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in the computation. However, these methods still have some limitations, such as image information loss from pruned tokens and inefficiency in the token-matching process. In this paper, we introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs. Inspired by graph summarization algorithms, GTP meticulously propagates less significant tokens' information to spatially and semantically connected tokens that are of greater importance. Consequently, the remaining few tokens serve as a summarization of the entire token graph, allowing the method to reduce computational complexity while preserving essential information of eliminated tokens. Combined with an innovative token selection strategy, GTP can efficiently identify image tokens to be propagated. Extensive experiments have validated GTP's effectiveness, demonstrating both efficiency and performance improvements. Specifically, GTP decreases the computational complexity of both DeiT-S and DeiT-B by up to 26% with only a minimal 0.3% accuracy drop on ImageNet-1K without finetuning, and remarkably surpasses the state-of-the-art token merging method on various backbones at an even faster inference speed. The source code is available at https://github.com/Ackesnal/GTP-ViT.
    摘要 Computer vision 领域中的 Vision Transformers (ViTs) 已经革命化了,但是在具有限制的资源设备上部署它们仍然是一个挑战。为了快速部署预训练的 ViTs,批量缩减和批量合并方法已经被开发出来,以减少计算中的符号数。然而,这些方法仍然有一些限制,例如图像信息的损失和符号匹配过程的不效率。在这篇论文中,我们介绍了一种新的图表-基于的符号传播(GTP)方法,以解决计算复杂性和信息保留之间的平衡问题。这种方法基于图 summarization 算法,细致地在空间和Semantic上传递不重要的符号信息到更重要的符号上。因此,剩下的符号只需要扮演一个图像的概要,允许方法减少计算复杂性而不失信息。与一种创新的符号选择策略结合使用,GTP可以高效地选择图像中的符号。经验证明了 GTP 的有效性,可以在 ImageNet-1K 上 без fine-tuning 下降到 0.3% 的准确率下减少 DeiT-S 和 DeiT-B 的计算复杂性,并在不同的核心上轻松突破当前的状况报告。详细的源代码可以在 上找到。

Beyond Words: A Mathematical Framework for Interpreting Large Language Models

  • paper_url: http://arxiv.org/abs/2311.03033
  • repo_url: None
  • paper_authors: Javier González, Aditya V. Nori
  • for: This paper aims to provide a mathematical framework for understanding and improving large language models (LLMs).
  • methods: The paper proposes a framework called Hex, which clarifies key terms and concepts in LLM research and offers a precise and consistent way to characterize LLMs.
  • results: The paper differentiates chain-of-thought reasoning from chain-of-thought prompting and establishes the conditions under which they are equivalent. The paper argues that its formal definitions and results are crucial for advancing the discussion on how to build generative AI systems that are safe, reliable, fair, and robust.Here’s the Simplified Chinese version of the three points:
  • for: 这篇论文目的是为大语言模型(LLM)提供数学基础。
  • methods: 论文提出了名为“Hex”的框架,它在 LLM 研究中清晰地解释了关键术语和概念,并提供了一种精确和一致的方式来描述 LLM。
  • results: 论文区分了链条思维和链条提示,并确定了它们在何种情况下是等价的。论文认为,其 формаль定义和结果对于构建安全、可靠、公平、Robust的生成 AI 系统是关键的。
    Abstract Large language models (LLMs) are powerful AI tools that can generate and comprehend natural language text and other complex information. However, the field lacks a mathematical framework to systematically describe, compare and improve LLMs. We propose Hex a framework that clarifies key terms and concepts in LLM research, such as hallucinations, alignment, self-verification and chain-of-thought reasoning. The Hex framework offers a precise and consistent way to characterize LLMs, identify their strengths and weaknesses, and integrate new findings. Using Hex, we differentiate chain-of-thought reasoning from chain-of-thought prompting and establish the conditions under which they are equivalent. This distinction clarifies the basic assumptions behind chain-of-thought prompting and its implications for methods that use it, such as self-verification and prompt programming. Our goal is to provide a formal framework for LLMs that can help both researchers and practitioners explore new possibilities for generative AI. We do not claim to have a definitive solution, but rather a tool for opening up new research avenues. We argue that our formal definitions and results are crucial for advancing the discussion on how to build generative AI systems that are safe, reliable, fair and robust, especially in domains like healthcare and software engineering.
    摘要

Federated Learning for Clinical Structured Data: A Benchmark Comparison of Engineering and Statistical Approaches

  • paper_url: http://arxiv.org/abs/2311.03417
  • repo_url: https://github.com/nliulab/fl-benchmark
  • paper_authors: Siqi Li, Di Miao, Qiming Wu, Chuan Hong, Danny D’Agostino, Xin Li, Yilin Ning, Yuqing Shang, Huazhu Fu, Marcus Eng Hock Ong, Hamed Haddadi, Nan Liu
  • for: 保护医疗合作中数据隐私
  • methods: 比较工程域和统计领域的 Federated learning 框架
  • results: 统计式 Federated learning 算法提供较为准确的点估计,但工程域基础的方法可以生成更高精度的预测结果。
    Abstract Federated learning (FL) has shown promising potential in safeguarding data privacy in healthcare collaborations. While the term "FL" was originally coined by the engineering community, the statistical field has also explored similar privacy-preserving algorithms. Statistical FL algorithms, however, remain considerably less recognized than their engineering counterparts. Our goal was to bridge the gap by presenting the first comprehensive comparison of FL frameworks from both engineering and statistical domains. We evaluated five FL frameworks using both simulated and real-world data. The results indicate that statistical FL algorithms yield less biased point estimates for model coefficients and offer convenient confidence interval estimations. In contrast, engineering-based methods tend to generate more accurate predictions, sometimes surpassing central pooled and statistical FL models. This study underscores the relative strengths and weaknesses of both types of methods, emphasizing the need for increased awareness and their integration in future FL applications.
    摘要 联合学习(FL)在医疗合作中保护数据隐私的潜力备受关注。虽然“FL”这个词汇最初是由工程师社群提出的,但随后的统计学界也开始探索类似的隐私保护算法。统计学上的FL算法与工程师社群的算法相比,尚未获得相同的认知程度。我们的目标是将这两种领域的FL框架进行首次全面比较,以评估它们在实际应用中的优劣。我们使用了五种FL框架,包括工程师社群和统计学界的方法,并使用实验和真实数据进行评估。结果显示,统计学上的FL算法对数据的批评估计获得较低的偏见,并且可以轻松地Estimate interval的信度。相比之下,工程师社群的方法具有较高的预测精度,有时超过中央联合和统计学上的FL模型。这个研究强调了两种方法之间的相对优劣,并强调未来FL应用中需要增加这两种方法的融合。

Visual-information-driven model for crowd simulation using temporal convolutional network

  • paper_url: http://arxiv.org/abs/2311.02996
  • repo_url: None
  • paper_authors: Xuanwen Liang, Eric Wai Ming Lee
  • for: 提高数据驱动人群模拟模型的适应性和现实感
  • methods: incorporate visual information, including scenario geometry and pedestrian locomotion, to improve the adaptability and realism of data-driven crowd simulation models
  • results: 在三个不同的公共步行者动态数据集上测试并评估了视觉驱动的人群模拟模型,并显示了该模型在所有三个几何场景中的改进适应性。
    Abstract Crowd simulations play a pivotal role in building design, influencing both user experience and public safety. While traditional knowledge-driven models have their merits, data-driven crowd simulation models promise to bring a new dimension of realism to these simulations. However, most of the existing data-driven models are designed for specific geometries, leading to poor adaptability and applicability. A promising strategy for enhancing the adaptability and realism of data-driven crowd simulation models is to incorporate visual information, including the scenario geometry and pedestrian locomotion. Consequently, this paper proposes a novel visual-information-driven (VID) crowd simulation model. The VID model predicts the pedestrian velocity at the next time step based on the prior social-visual information and motion data of an individual. A radar-geometry-locomotion method is established to extract the visual information of pedestrians. Moreover, a temporal convolutional network (TCN)-based deep learning model, named social-visual TCN, is developed for velocity prediction. The VID model is tested on three public pedestrian motion datasets with distinct geometries, i.e., corridor, corner, and T-junction. Both qualitative and quantitative metrics are employed to evaluate the VID model, and the results highlight the improved adaptability of the model across all three geometric scenarios. Overall, the proposed method demonstrates effectiveness in enhancing the adaptability of data-driven crowd models.
    摘要 人群模拟在建筑设计中发挥重要作用,影响用户体验和公共安全。传统的知识驱动模型有其优点,但数据驱动人群模拟模型可以带来新的现实感。然而,现有的数据驱动模型大多适用于特定的几何结构,导致适应性和可用性强度有限。为了提高数据驱动人群模拟模型的适应性和现实感,本文提出了一种视觉信息驱动(VID)人群模拟模型。VID模型根据先前的社交视觉信息和人员运动数据预测下一步人群速度。为了提取视觉信息,本文提出了一种雷达几何运动方法。此外,本文还开发了一种基于深度学习的社交视觉径向网络(Social-Visual TCN)模型,用于速度预测。VID模型在三个不同的公共人群动向数据集上进行测试,包括通道、角落和T字口。使用质量和量度指标评估VID模型,结果表明VID模型在所有三个几何场景中的适应性得到了提高。总的来说,本文提出的方法可以提高数据驱动人群模拟模型的适应性。

PowerFlowNet: Leveraging Message Passing GNNs for Improved Power Flow Approximation

  • paper_url: http://arxiv.org/abs/2311.03415
  • repo_url: None
  • paper_authors: Nan Lin, Stavros Orfanoudakis, Nathan Ordonez Cardenas, Juan S. Giraldo, Pedro P. Vergara
  • for: 提高现代电力网络的准确和高效运行和规划
  • methods: 使用图神经网络(GNNs)提高PF近似的速度和准确性
  • results: 在简单的IEEE 14-bus系统和实际的法国高压网络(6470rte)中,PowerFlowNet在性能和执行时间方面与新顿-拉普逊方法具有相似性,但是实现4倍 faster,并在其他传统近似方法(如DC relaxation方法)的基础上显著地超越它们。
    Abstract Accurate and efficient power flow (PF) analysis is crucial in modern electrical networks' efficient operation and planning. Therefore, there is a need for scalable algorithms capable of handling large-scale power networks that can provide accurate and fast solutions. Graph Neural Networks (GNNs) have emerged as a promising approach for enhancing the speed of PF approximations by leveraging their ability to capture distinctive features from the underlying power network graph. In this study, we introduce PowerFlowNet, a novel GNN architecture for PF approximation that showcases similar performance with the traditional Newton-Raphson method but achieves it 4 times faster in the simple IEEE 14-bus system and 145 times faster in the realistic case of the French high voltage network (6470rte). Meanwhile, it significantly outperforms other traditional approximation methods, such as the DC relaxation method, in terms of performance and execution time; therefore, making PowerFlowNet a highly promising solution for real-world PF analysis. Furthermore, we verify the efficacy of our approach by conducting an in-depth experimental evaluation, thoroughly examining the performance, scalability, interpretability, and architectural dependability of PowerFlowNet. The evaluation provides insights into the behavior and potential applications of GNNs in power system analysis.
    摘要 准确高效电流流动(PF)分析是现代电力网络的efficient操作和规划中的关键。因此,有一需要可扩展的算法,可以处理大规模的电力网络,提供准确和快速的解决方案。图neuronal networks(GNNs)已经出现为PF近似中的一种有前途的方法,通过它们能够从电力网络图中捕捉特征。在这种研究中,我们介绍PowerFlowNet,一种新的GNN架构,用于PF近似,它在简单的IEEE 14-bus系统和实际的法国高压网络(6470rte)中显示了与传统的Newton-Raphson方法类似的性能,但是实现速度为4倍快和145倍快。此外,它也明显超过了其他传统的近似方法,如DC缓和方法,在性能和执行时间方面,因此PowerFlowNet是一个非常有前途的PF分析解决方案。此外,我们通过进行深入的实验评估,全面检查PowerFlowNet的性能、可扩展性、可读性和架构可靠性。实验结果为我们提供了GNN在电力系统分析中的行为和应用前景。

A Generative Neural Network Approach for 3D Multi-Criteria Design Generation and Optimization of an Engine Mount for an Unmanned Air Vehicle

  • paper_url: http://arxiv.org/abs/2311.03414
  • repo_url: None
  • paper_authors: Christoph Petroll, Sebastian Eilermann, Philipp Hoefer, Oliver Niggemann
  • for: 这 paper 的目的是用生成神经网络进行功能兼ね合的 3D 设计重构和生成。
  • methods: 这 paper 使用 Conditional Variational Autoencoder (CVAE) 和 Marching cubes algorithm 来生成 meshes 并对其进行 simulated 评估。
  • results: paper 能够生成符合自定义功能条件的优化设计。
    Abstract One of the most promising developments in computer vision in recent years is the use of generative neural networks for functionality condition-based 3D design reconstruction and generation. Here, neural networks learn dependencies between functionalities and a geometry in a very effective way. For a neural network the functionalities are translated in conditions to a certain geometry. But the more conditions the design generation needs to reflect, the more difficult it is to learn clear dependencies. This leads to a multi criteria design problem due various conditions, which are not considered in the neural network structure so far. In this paper, we address this multi-criteria challenge for a 3D design use case related to an unmanned aerial vehicle (UAV) motor mount. We generate 10,000 abstract 3D designs and subject them all to simulations for three physical disciplines: mechanics, thermodynamics, and aerodynamics. Then, we train a Conditional Variational Autoencoder (CVAE) using the geometry and corresponding multicriteria functional constraints as input. We use our trained CVAE as well as the Marching cubes algorithm to generate meshes for simulation based evaluation. The results are then evaluated with the generated UAV designs. Subsequently, we demonstrate the ability to generate optimized designs under self-defined functionality conditions using the trained neural network.
    摘要 In this paper, we address this multi-criteria challenge for a 3D design use case related to an unmanned aerial vehicle (UAV) motor mount. We generate 10,000 abstract 3D designs and subject them all to simulations for three physical disciplines: mechanics, thermodynamics, and aerodynamics. Then, we train a Conditional Variational Autoencoder (CVAE) using the geometry and corresponding multicriteria functional constraints as input. We use our trained CVAE as well as the Marching cubes algorithm to generate meshes for simulation-based evaluation. The results are then evaluated with the generated UAV designs. Subsequently, we demonstrate the ability to generate optimized designs under self-defined functionality conditions using the trained neural network.

Discret2Di – Deep Learning based Discretization for Model-based Diagnosis

  • paper_url: http://arxiv.org/abs/2311.03413
  • repo_url: None
  • paper_authors: Lukas Moddemann, Henrik Sebastian Steude, Alexander Diedrich, Oliver Niggemann
  • for: 本文提出了一种自动学习逻辑表示法,用于进行基于一致性的诊断。
  • methods: 本文使用了机器学习技术,将时间序列转化为逻辑表示,并自动学习逻辑规则。
  • results: 本文通过实验示出,自动学习逻辑规则可以有效地进行基于一致性的诊断。In English, this translates to:
  • for: The paper proposes an automated learning method for logical expressions for consistency-based diagnosis.
  • methods: The paper uses machine learning techniques to convert time series into logical representations and automatically learn logical rules.
  • results: The paper shows through experiments that automated learning of logical rules can effectively perform consistency-based diagnosis.
    Abstract Consistency-based diagnosis is an established approach to diagnose technical applications, but suffers from significant modeling efforts, especially for dynamic multi-modal time series. Machine learning seems to be an obvious solution, which becomes less obvious when looking at details: Which notion of consistency can be used? If logical calculi are still to be used, how can dynamic time series be transferred into the discrete world? This paper presents the methodology Discret2Di for automated learning of logical expressions for consistency-based diagnosis. While these logical calculi have advantages by providing a clear notion of consistency, they have the key problem of relying on a discretization of the dynamic system. The solution presented combines machine learning from both the time series and the symbolic domain to automate the learning of logical rules for consistency-based diagnosis.
    摘要 系统稳定性分析是一种已经确立的方法,用于诊断技术应用程序,但是它受到了模型化努力的限制,特别是对动态多Modal时间序列的诊断。机器学习似乎是一个自然的解决方案,但是当考虑到细节时,问题变得不那么明显:哪种一致性概念可以使用?如果逻辑calculus仍然被用,如何将动态时间序列转化成离散世界?本文介绍了一种方法ологи法Discret2Di,用于自动学习逻辑表达式 для稳定性分析。这些逻辑calculus有优点,因为它们提供了明确的一致性概念,但它们的关键问题是基于离散系统的分析。本文的解决方案是结合时间序列和符号领域的机器学习来自动学习逻辑规则 для稳定性分析。

TabRepo: A Large Scale Repository of Tabular Model Evaluations and its AutoML Applications

  • paper_url: http://arxiv.org/abs/2311.02971
  • repo_url: https://github.com/autogluon/tabrepo
  • paper_authors: David Salinas, Nick Erickson
  • for: 这篇论文是为了介绍一个新的Tabular模型评估和预测数据集(TabRepo)。
  • methods: 论文使用了1206个模型在200个回归和分类 datasets上进行了预测和评估。
  • results: 论文表明,通过使用 TabRepo 可以免费地比较自动化机器学习(AutoML)系统和精细化优化hyperparameter,以及应用标准的传输学习技术可以超越当前的标准Tabular系统在准确性、运行时间和延迟方面。
    Abstract We introduce TabRepo, a new dataset of tabular model evaluations and predictions. TabRepo contains the predictions and metrics of 1206 models evaluated on 200 regression and classification datasets. We illustrate the benefit of our datasets in multiple ways. First, we show that it allows to perform analysis such as comparing Hyperparameter Optimization against current AutoML systems while also considering ensembling at no cost by using precomputed model predictions. Second, we show that our dataset can be readily leveraged to perform transfer-learning. In particular, we show that applying standard transfer-learning techniques allows to outperform current state-of-the-art tabular systems in accuracy, runtime and latency.
    摘要 我们介绍TabRepo,一个新的Tabular模型评估和预测Dataset。TabRepo包含1206个模型在200个回归和分类Dataset上的预测和度量。我们显示了我们的Dataset的价值,包括:1. 可以免费使用预computed模型预测来比较搜寻过程优化和现有的AutoML系统,以及考虑结合。2. 可以快速地将模型应用到新的Dataset上,并且使用标准的转移学习技术来超越目前的Tabular系统在准确、运行时间和延迟方面的表现。Here's the translation in Traditional Chinese:我们介绍TabRepo,一个新的Tabular模型评估和预测Dataset。TabRepo包含1206个模型在200个回归和分类Dataset上的预测和度量。我们显示了我们的Dataset的价值,包括:1. 可以免费使用预computed模型预测来比较搜寻过程优化和现有的AutoML系统,以及考虑结合。2. 可以快速地将模型应用到新的Dataset上,并且使用标准的转移学习技术来超越目前的Tabular系统在准确、运行时间和延遁方面的表现。

Retrieval-Augmented Code Generation for Universal Information Extraction

  • paper_url: http://arxiv.org/abs/2311.02962
  • repo_url: None
  • paper_authors: Yucan Guo, Zixuan Li, Xiaolong Jin, Yantao Liu, Yutao Zeng, Wenxuan Liu, Xiang Li, Pan Yang, Long Bai, Jiafeng Guo, Xueqi Cheng
  • for: 本文提出了一个基于大语言模型(LLMs)的通用扩展代码生成框架,用于信息抽取(IE)任务。
  • methods: 本文使用了Python类定义任务特定的结构知识表示,并运用了内容学习机制以将文本中的信息转换为代码。
  • results: 实验结果显示, Code4UIE 框架可以实现高效地对五种代表性的IE任务进行扩展代码生成。
    Abstract Information Extraction (IE) aims to extract structural knowledge (e.g., entities, relations, events) from natural language texts, which brings challenges to existing methods due to task-specific schemas and complex text expressions. Code, as a typical kind of formalized language, is capable of describing structural knowledge under various schemas in a universal way. On the other hand, Large Language Models (LLMs) trained on both codes and texts have demonstrated powerful capabilities of transforming texts into codes, which provides a feasible solution to IE tasks. Therefore, in this paper, we propose a universal retrieval-augmented code generation framework based on LLMs, called Code4UIE, for IE tasks. Specifically, Code4UIE adopts Python classes to define task-specific schemas of various structural knowledge in a universal way. By so doing, extracting knowledge under these schemas can be transformed into generating codes that instantiate the predefined Python classes with the information in texts. To generate these codes more precisely, Code4UIE adopts the in-context learning mechanism to instruct LLMs with examples. In order to obtain appropriate examples for different tasks, Code4UIE explores several example retrieval strategies, which can retrieve examples semantically similar to the given texts. Extensive experiments on five representative IE tasks across nine datasets demonstrate the effectiveness of the Code4UIE framework.
    摘要 信息提取(IE)的目标是从自然语言文本中提取结构知识(例如实体、关系、事件),这会对现有方法带来挑战,因为任务特定的 schema 和文本表达的复杂性。代码作为一种形式化语言,可以在不同的 schema 下描述结构知识,并且在各种任务下可以通过代码来实现这些知识。在这篇论文中,我们提出了一种基于大语言模型(LLM)的通用检索增强代码生成框架,称为 Code4UIE,用于IE任务。specifically,Code4UIE 使用 Python 类来定义任务特定的 schema,并通过在这些类中实例化文本中的信息来提取知识。为了生成代码更加准确,Code4UIE 采用了在上下文学习机制,使 LLM 通过示例进行指导。为了获得不同任务的合适示例,Code4UIE 探索了多种示例检索策略,可以从文本中检索到与给定文本相似的示例。经验表明,Code4UIE 框架在五种代表性IE任务中的九个数据集上得到了广泛的应用。

In-Context Learning for Knowledge Base Question Answering for Unmanned Systems based on Large Language Models

  • paper_url: http://arxiv.org/abs/2311.02956
  • repo_url: None
  • paper_authors: Yunlong Chen, Yaming Zhang, Jianfei Yu, Li Yang, Rui Xia
  • for: Answer factoid questions based on knowledge bases
  • methods: Use ChatGPT-based Cypher Query Language (CQL) generation framework to generate the most appropriate CQL based on Natural Language Questions (NLQ)
  • results: Achieved the second place in the CCKS 2023 Question Answering with Knowledge Graph Inference for Unmanned Systems competition, with an F1-score of 0.92676
    Abstract Knowledge Base Question Answering (KBQA) aims to answer factoid questions based on knowledge bases. However, generating the most appropriate knowledge base query code based on Natural Language Questions (NLQ) poses a significant challenge in KBQA. In this work, we focus on the CCKS2023 Competition of Question Answering with Knowledge Graph Inference for Unmanned Systems. Inspired by the recent success of large language models (LLMs) like ChatGPT and GPT-3 in many QA tasks, we propose a ChatGPT-based Cypher Query Language (CQL) generation framework to generate the most appropriate CQL based on the given NLQ. Our generative framework contains six parts: an auxiliary model predicting the syntax-related information of CQL based on the given NLQ, a proper noun matcher extracting proper nouns from the given NLQ, a demonstration example selector retrieving similar examples of the input sample, a prompt constructor designing the input template of ChatGPT, a ChatGPT-based generation model generating the CQL, and an ensemble model to obtain the final answers from diversified outputs. With our ChatGPT-based CQL generation framework, we achieved the second place in the CCKS 2023 Question Answering with Knowledge Graph Inference for Unmanned Systems competition, achieving an F1-score of 0.92676.
    摘要 知识库问答(KBQA)目标是基于知识库回答问题,但生成基于自然语言问题(NLQ)的知识库查询代码具有显著挑战。在这项工作中,我们将ocusonCCKS2023问答知识图推理对无人系统的竞赛。受最近大语言模型(LLMs)如ChatGPT和GPT-3在多种问答任务中的成功启发,我们提议一个基于ChatGPT的CypherQuery语言(CQL)生成框架,以生成基于给定NLQ的最佳CQL。我们的生成框架包括六部分:一个辅助模型预测基于给定NLQ的CQL语法信息,一个正式名词匹配器从NLQ中提取正式名词,一个示例选择器选择与输入样本相似的示例,一个提示构建器设计输入模板,一个基于ChatGPT的生成模型生成CQL,以及一个 ensemble模型以获得多元输出的最终答案。与我们的ChatGPT基于CQL生成框架,我们在CCKS2023问答知识图推理对无人系统竞赛中获得第二名,实现了F1分数0.92676。

Can LLMs Follow Simple Rules?

  • paper_url: http://arxiv.org/abs/2311.04235
  • repo_url: https://github.com/normster/llm_rules
  • paper_authors: Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, David Wagner
  • for: 本研究旨在提供一个程式码框架,以评估自然语言处理器(LLM)在开发者提供的规则下运行。
  • methods: 本研究使用了15个简单文本场景,让模型遵循开发者提供的规则来互动 avec human user。每个场景都有一个简洁的评估程式,以判断模型是否违反了规则。
  • results: 研究发现所有评估过的 популярProprietary和开源模型都受到了访问者调制的手动输入攻击,而GPT-4是所有模型中最好的表现。此外,研究还评估了开放模型对于梯度基本攻击的漏洞。
    Abstract As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Evaluating how well LLMs follow developer-provided rules in the face of adversarial inputs typically requires manual review, which slows down monitoring and methods development. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 15 simple text scenarios in which the model is instructed to obey a set of rules in natural language while interacting with the human user. Each scenario has a concise evaluation program to determine whether the model has broken any rules in a conversation. Through manual exploration of model behavior in our scenarios, we identify 6 categories of attack strategies and collect two suites of test cases: one consisting of unique conversations from manual testing and one that systematically implements strategies from the 6 categories. Across various popular proprietary and open models such as GPT-4 and Llama 2, we find that all models are susceptible to a wide variety of adversarial hand-crafted user inputs, though GPT-4 is the best-performing model. Additionally, we evaluate open models under gradient-based attacks and find significant vulnerabilities. We propose RuLES as a challenging new setting for research into exploring and defending against both manual and automatic attacks on LLMs.
    摘要 As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Evaluating how well LLMs follow developer-provided rules in the face of adversarial inputs typically requires manual review, which slows down monitoring and methods development. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 15 simple text scenarios in which the model is instructed to obey a set of rules in natural language while interacting with the human user. Each scenario has a concise evaluation program to determine whether the model has broken any rules in a conversation. Through manual exploration of model behavior in our scenarios, we identify 6 categories of attack strategies and collect two suites of test cases: one consisting of unique conversations from manual testing and one that systematically implements strategies from the 6 categories. Across various popular proprietary and open models such as GPT-4 and Llama 2, we find that all models are susceptible to a wide variety of adversarial hand-crafted user inputs, though GPT-4 is the best-performing model. Additionally, we evaluate open models under gradient-based attacks and find significant vulnerabilities. We propose RuLES as a challenging new setting for research into exploring and defending against both manual and automatic attacks on LLMs.Here's the translation in Traditional Chinese:当大型语言模型(LLMs)在实际应用中推广时,重要的是能够Specify和限制这些系统的行为,以确保它们可靠地进行工作。开发模型的人可能会想要设定模型的Explicit规则,例如“不要生成攻击性内容”,但这些规则可能会被破坏。评估LLMs在面对恶意输入时遵循开发者提供的规则需要手动审查,这会让监控和开发方法变得 slower。为解决这个问题,我们提出了Rule-following Language Evaluation Scenarios(RuLES),一个程式设计的框架,用于评估LLMs的遵循能力。RuLES包括15个简单文本场景,在其中模型需要遵循开发者提供的规则,并与人类用户互动。每个场景有一个简洁的评估程式,用于决定模型在对话中是否破坏了任何规则。通过我们的手动探索模型行为的方式,我们识别出6种攻击策略,并收集了两个测试集:一个是从手动测试中获得的唯一对话,另一个是实现了6种攻击策略的系统性测试集。在各种受欢迎的专有和开源模型(如GPT-4和Llama 2)中,我们发现所有模型都受到了许多攻击性的手动输入的威胁。此外,我们在Gradient-based攻击下进行评估,发现开放模型存在重大的漏洞。我们提议RuLES作为一个挑战性的新设定,用于研究LLMs在面对手动和自动攻击时的探索和防御。

Contrastive Multi-Level Graph Neural Networks for Session-based Recommendation

  • paper_url: http://arxiv.org/abs/2311.02938
  • repo_url: None
  • paper_authors: Fuyun Wang, Xingyu Gao, Zhenyu Chen, Lei Lyu
  • for: This paper aims to improve session-based recommendation by exploiting complex and high-order item transition information.
  • methods: The proposed method, called contrastive multi-level graph neural networks (CM-GNN), uses a combination of local-level, global-level, and hyper-level graph convolutional networks, as well as an attention-based fusion module to capture pairwise relations and high-order information among item transitions.
  • results: The proposed method outperforms state-of-the-art session-based recommendation techniques in extensive experiments on multiple benchmark datasets.Here’s the Chinese translation of the three points:
  • for: 这篇论文目标是改进会话基于推荐,通过捕捉复杂和高阶ITEM转换信息。
  • methods: 提议方法是对比例多级图 neural networks (CM-GNN),使用了本地级、全球级和超过级图 convolutional networks,以及一个注意力基于的融合模块,来捕捉ITEM转换对的对应关系和高阶信息。
  • results: 提议方法在多个广泛使用的数据集上进行了广泛的实验,并证明了与会话基于推荐技术的状态OF-THE-ART的性能优于。
    Abstract Session-based recommendation (SBR) aims to predict the next item at a certain time point based on anonymous user behavior sequences. Existing methods typically model session representation based on simple item transition information. However, since session-based data consists of limited users' short-term interactions, modeling session representation by capturing fixed item transition information from a single dimension suffers from data sparsity. In this paper, we propose a novel contrastive multi-level graph neural networks (CM-GNN) to better exploit complex and high-order item transition information. Specifically, CM-GNN applies local-level graph convolutional network (L-GCN) and global-level network (G-GCN) on the current session and all the sessions respectively, to effectively capture pairwise relations over all the sessions by aggregation strategy. Meanwhile, CM-GNN applies hyper-level graph convolutional network (H-GCN) to capture high-order information among all the item transitions. CM-GNN further introduces an attention-based fusion module to learn pairwise relation-based session representation by fusing the item representations generated by L-GCN and G-GCN. CM-GNN averages the item representations obtained by H-GCN to obtain high-order relation-based session representation. Moreover, to convert the high-order item transition information into the pairwise relation-based session representation, CM-GNN maximizes the mutual information between the representations derived from the fusion module and the average pool layer by contrastive learning paradigm. We conduct extensive experiments on multiple widely used benchmark datasets to validate the efficacy of the proposed method. The encouraging results demonstrate that our proposed method outperforms the state-of-the-art SBR techniques.
    摘要 Session-based recommendation (SBR) 目标是在特定时间点预测下一个项目,基于匿名用户行为序列。现有方法通常是基于简单的项目转移信息来建模会话表示。然而,由于会话数据由有限数量的用户的短时间互动组成,使得基于单一维度的项目转移信息来建模会话表示存在数据稀缺。在这篇论文中,我们提出了一种新的对比式多级图 neural network (CM-GNN),用于更好地利用复杂的高阶项目转移信息。特别是,CM-GNN 在当前会话和所有会话上应用了本地级别的图干涉网络 (L-GCN) 和全级别网络 (G-GCN),以有效地捕捉会话中对所有会话的对比关系。此外,CM-GNN 还应用了高级别的图干涉网络 (H-GCN),以捕捉高阶的项目转移信息。CM-GNN 还引入了一种注意力基于的融合模块,用于学习对比关系基于会话表示。CM-GNN 将 obtained 由 L-GCN 和 G-GCN 生成的项目表示进行 fusion,并使用 H-GCN 生成的高阶关系基于SESSION表示。此外,为了将高阶项目转移信息转化为对比关系基于会话表示,CM-GNN 使用对比学习框架强制最大化对这两个表示之间的共 informations。我们在多个广泛使用的 benchmark 数据集上进行了广泛的实验,以验证我们提出的方法的有效性。结果表明,我们的提出的方法在对State-of-the-art SBR 技术进行比较时表现出色。

Deep Image Semantic Communication Model for Artificial Intelligent Internet of Things

  • paper_url: http://arxiv.org/abs/2311.02926
  • repo_url: https://github.com/meatery/semantic-segmentation
  • paper_authors: Li Ping Qian, Yi Zhang, Sikai Lyu, Huijie Zhu, Yuan Wu, Xuemin Sherman Shen, Xiaoniu Yang
  • for: 提出一种深度学习图像semantic通信模型,以提高AIoT设备中图像数据的有效传输和恢复。
  • methods: 提议在传输端使用高精度图像semantic分割算法提取图像semantic信息,以实现图像数据的显著压缩。在接收端,使用基于GAN的semantic图像恢复算法将semantic图像转换为详细的真实场景图像。
  • results: 对比WebP和CycleGAN,提议的图像semantic通信模型可以提高图像压缩率和恢复精度,平均提高71.93%和25.07%。此外,我们的demo实验表明,提议模型可以将图像传输延迟降低95.26%。
    Abstract With the rapid development of Artificial Intelligent Internet of Things (AIoT), the image data from AIoT devices has been witnessing the explosive increasing. In this paper, a novel deep image semantic communication model is proposed for the efficient image communication in AIoT. Particularly, at the transmitter side, a high-precision image semantic segmentation algorithm is proposed to extract the semantic information of the image to achieve significant compression of the image data. At the receiver side, a semantic image restoration algorithm based on Generative Adversarial Network (GAN) is proposed to convert the semantic image to a real scene image with detailed information. Simulation results demonstrate that the proposed image semantic communication model can improve the image compression ratio and recovery accuracy by 71.93% and 25.07% on average in comparison with WebP and CycleGAN, respectively. More importantly, our demo experiment shows that the proposed model reduces the total delay by 95.26% in the image communication, when comparing with the original image transmission.
    摘要 随着人工智能互联网物联网(AIoT)的快速发展,AIoT设备中的图像数据已经经历了急剧增长。在这篇论文中,我们提出了一种新的深度图像Semantic Communication模型,用于AIoT中高效的图像通信。特别是在发送端,我们提出了一种高精度图像semantic分割算法,以EXTRACT图像中的semantic信息,以实现图像数据的显著压缩。在接收端,我们提出了一种基于Generative Adversarial Network(GAN)的semantic图像恢复算法,以将semantic图像转换为详细信息的真实场景图像。实验结果表明,我们提出的图像Semantic Communication模型可以提高图像压缩比和恢复精度,比WebP和CycleGAN的平均提高71.93%和25.07%。此外,我们的demo实验表明,我们的模型可以将图像通信总延迟减少95.26%,比原始图像传输更加高效。

Virtual Action Actor-Critic Framework for Exploration (Student Abstract)

  • paper_url: http://arxiv.org/abs/2311.02916
  • repo_url: None
  • paper_authors: Bumgeun Park, Taeyoung Kim, Quoc-Vinh Lai-Dang, Dongsoo Har
  • for: 提高RL中agent的寻找效率
  • methods: 提出了一种新的actor-critic框架,即虚拟行为actor-critic(VAAC),以解决RL中agent的寻找效率挑战。
  • results: 实验结果显示,VAAC比现有算法更高效地进行寻找。
    Abstract Efficient exploration for an agent is challenging in reinforcement learning (RL). In this paper, a novel actor-critic framework namely virtual action actor-critic (VAAC), is proposed to address the challenge of efficient exploration in RL. This work is inspired by humans' ability to imagine the potential outcomes of their actions without actually taking them. In order to emulate this ability, VAAC introduces a new actor called virtual actor (VA), alongside the conventional actor-critic framework. Unlike the conventional actor, the VA takes the virtual action to anticipate the next state without interacting with the environment. With the virtual policy following a Gaussian distribution, the VA is trained to maximize the anticipated novelty of the subsequent state resulting from a virtual action. If any next state resulting from available actions does not exhibit high anticipated novelty, training the VA leads to an increase in the virtual policy entropy. Hence, high virtual policy entropy represents that there is no room for exploration. The proposed VAAC aims to maximize a modified Q function, which combines cumulative rewards and the negative sum of virtual policy entropy. Experimental results show that the VAAC improves the exploration performance compared to existing algorithms.
    摘要 RL中的agent寻找最有效的探索方式是一项挑战。本文提出了一种新的actor-critic框架,即虚拟动作actor-critic(VAAC),以解决RL中的探索挑战。这项工作受人类能够想象自己的行动结果而不需要实际行动的能力所启发。为了模仿这种能力,VAAC引入了一个新的actor,即虚拟actor(VA)。与传统actor-critic框架不同的是,VA不需要与环境交互就可以预测下一个状态。通过虚拟策略按照高维度分布随机选择虚拟动作,VA在预测下一个状态时尽可能增加其 novaativity。如果可用的动作中任何一个状态不具备高预测 novaativity,则训练VA会增加虚拟策略的 entropy。因此,高虚拟策略 entropy表示探索空间充满潜能。VAAC的目标是最大化修改后Q函数,该函数组合了总奖励和虚拟策略 entropy的负数。实验结果表明,VAAC在探索性能方面比现有算法提高了。

Imitation Learning based Alternative Multi-Agent Proximal Policy Optimization for Well-Formed Swarm-Oriented Pursuit Avoidance

  • paper_url: http://arxiv.org/abs/2311.02912
  • repo_url: None
  • paper_authors: Sizhao Li, Yuming Xiang, Rongpeng Li, Zhifeng Zhao, Honggang Zhang
  • for: 该论文主要研究了多机器人系统(MRS)的协同控制问题,尤其是在大规模Decentralized MRS中实现追逐避免任务的可能性。
  • methods: 该论文提出了一种基于仿写学的多代理控制算法(IA-MAPPO),用于在协同控制下实现追逐逃脱任务。该算法包括一个基于策略浸泡的MAPPO执行器,以及使用仿写学来减少通信开销和提高扩展性。
  • results: simulations results validate the effectiveness of IA-MAPPO, and extensive ablation experiments show that the performance is comparable to a centralized solution with significant decrease in communication overheads.
    Abstract Multi-Robot System (MRS) has garnered widespread research interest and fostered tremendous interesting applications, especially in cooperative control fields. Yet little light has been shed on the compound ability of formation, monitoring and defence in decentralized large-scale MRS for pursuit avoidance, which puts stringent requirements on the capability of coordination and adaptability. In this paper, we put forward a decentralized Imitation learning based Alternative Multi-Agent Proximal Policy Optimization (IA-MAPPO) algorithm to provide a flexible and communication-economic solution to execute the pursuit avoidance task in well-formed swarm. In particular, a policy-distillation based MAPPO executor is firstly devised to capably accomplish and swiftly switch between multiple formations in a centralized manner. Furthermore, we utilize imitation learning to decentralize the formation controller, so as to reduce the communication overheads and enhance the scalability. Afterwards, alternative training is leveraged to compensate the performance loss incurred by decentralization. The simulation results validate the effectiveness of IA-MAPPO and extensive ablation experiments further show the performance comparable to a centralized solution with significant decrease in communication overheads.
    摘要

ViDa: Visualizing DNA hybridization trajectories with biophysics-informed deep graph embeddings

  • paper_url: http://arxiv.org/abs/2311.03411
  • repo_url: https://github.com/chenwei-zhang/ViDa
  • paper_authors: Chenwei Zhang, Jordan Lovrod, Boyan Beronov, Khanh Dao Duc, Anne Condon
  • For: 这个论文是为了帮助生物化学家和分子编程师理解核酸反应的复杂激发路径,并可以用于多种应用。* Methods: 该论文使用了一种名为 kontinuous-time Markov chain(CTMC)的模型,并使用了一种新的视觉化方法called ViDa,以Visualize DNA reaction trajectories的二维嵌入。* Results: 该论文的结果表明,使用域Specific supervised terms可以提高visualization的质量,并成功分离不同的folding pathways,提供了有用的反应机理的启示。
    Abstract Visualization tools can help synthetic biologists and molecular programmers understand the complex reactive pathways of nucleic acid reactions, which can be designed for many potential applications and can be modelled using a continuous-time Markov chain (CTMC). Here we present ViDa, a new visualization approach for DNA reaction trajectories that uses a 2D embedding of the secondary structure state space underlying the CTMC model. To this end, we integrate a scattering transform of the secondary structure adjacency, a variational autoencoder, and a nonlinear dimensionality reduction method. We augment the training loss with domain-specific supervised terms that capture both thermodynamic and kinetic features. We assess ViDa on two well-studied DNA hybridization reactions. Our results demonstrate that the domain-specific features lead to significant quality improvements over the state-of-the-art in DNA state space visualization, successfully separating different folding pathways and thus providing useful insights into dominant reaction mechanisms.
    摘要 <>Visualization 工具可以帮助 sintetic biology 和分子程序员理解聚合物reactions的复杂reacting pathways,这些reactions可以设计为多种可能性,并且可以使用连续时间Markov链(CTMC)来模型。在这里,我们介绍了一种新的可见化方法,即 ViDa,它使用CTMC模型下的secondary structure状态空间的2D嵌入来可见化DNA反应轨迹。为此,我们将scattering transform of secondary structure adjacency、variational autoencoder和非线性维度减少方法相互融合。我们还添加了域pecific的supervised terms,以捕捉thermodynamic和kinetic特征。我们对两个已经广泛研究的DNA杂化反应进行评估。我们的结果表明,域pecific特征导致了state space可见化中的质量提升,成功地分离不同的folding pathways,从而提供了关键的反应机理的视角。Note: Simplified Chinese is used here, which is a more casual and informal style of Chinese. If you prefer Traditional Chinese or a more formal style, please let me know and I can translate it accordingly.

Deep Learning-Empowered Semantic Communication Systems with a Shared Knowledge Base

  • paper_url: http://arxiv.org/abs/2311.02884
  • repo_url: None
  • paper_authors: Peng Yi, Yang Cao, Xin Kang, Ying-Chang Liang
  • for: 提高6G网络中semantic communication系统的可解释性。
  • methods: 提出一种基于深度学习的semantic communication系统,利用共享知识库提高系统的可解释性。
  • results: 对比基eline方法,提出的方法可以减少数据传输量,同时保持语句相似性。Here’s a brief explanation of each point:
  • for: The paper aims to improve the explainability of semantic communication systems in future 6G networks.
  • methods: The proposed method uses a shared knowledge base to integrate messages and corresponding knowledge, enabling the system to transmit fewer symbols without sacrificing semantic performance.
  • results: The proposed approach outperforms existing baseline methods in terms of transmitted data size and sentence similarity, as demonstrated by simulation results.
    Abstract Deep learning-empowered semantic communication is regarded as a promising candidate for future 6G networks. Although existing semantic communication systems have achieved superior performance compared to traditional methods, the end-to-end architecture adopted by most semantic communication systems is regarded as a black box, leading to the lack of explainability. To tackle this issue, in this paper, a novel semantic communication system with a shared knowledge base is proposed for text transmissions. Specifically, a textual knowledge base constructed by inherently readable sentences is introduced into our system. With the aid of the shared knowledge base, the proposed system integrates the message and corresponding knowledge from the shared knowledge base to obtain the residual information, which enables the system to transmit fewer symbols without semantic performance degradation. In order to make the proposed system more reliable, the semantic self-information and the source entropy are mathematically defined based on the knowledge base. Furthermore, the knowledge base construction algorithm is developed based on a similarity-comparison method, in which a pre-configured threshold can be leveraged to control the size of the knowledge base. Moreover, the simulation results have demonstrated that the proposed approach outperforms existing baseline methods in terms of transmitted data size and sentence similarity.
    摘要 深度学习驱动的semantic通信被视为未来6G网络中的优秀候选人。虽然现有的semantic通信系统已经达到了传统方法的超越性,但大多数semantic通信系统的端到端架构被视为黑盒模型,导致无法解释性的问题。为解决这个问题,本文提出了一种基于文本传输的新的semantic通信系统。具体来说,我们提出了一个基于自然可读的句子构建的文本知识库。通过与共享知识库的集成,我们的系统可以通过获取剩余信息来传输更少的符号,而无需增加 semantic 性能下降。为使我们的系统更加可靠,我们在知识库中定义了semantic自信息和源 entropy。此外,我们还开发了基于相似比较方法的知识库构建算法,可以通过配置阈值来控制知识库的大小。最后,我们通过对比实验结果,证明了我们的方法可以比现有的基准方法更好地压缩数据和保持句子相似性。

DP-DCAN: Differentially Private Deep Contrastive Autoencoder Network for Single-cell Clustering

  • paper_url: http://arxiv.org/abs/2311.03410
  • repo_url: None
  • paper_authors: Huifa Li, Jie Fu, Zhili Chen, Xiaomin Yang, Haitao Liu, Xinpeng Ling
  • for: 本研究旨在提出一种基于深度学习的具有隐私保护特性的单元细胞 clustering 方法,以保护用户隐私。
  • methods: 该方法基于 autoencoder 网络,通过部分网络杂化来实现隐私保护。
  • results: 实验结果显示,DP-DCAN 比传统的DP方案具有更好的性能和更强的鲁棒性。In English, this translates to:
  • for: The paper aims to propose a deep learning-based single-cell clustering method with privacy protection, to protect user privacy.
  • methods: The method is based on an autoencoder network, and achieves privacy protection through partial network perturbation.
  • results: Experimental results show that DP-DCAN outperforms traditional DP schemes and has stronger robustness to adversarial attacks.
    Abstract Single-cell RNA sequencing (scRNA-seq) is important to transcriptomic analysis of gene expression. Recently, deep learning has facilitated the analysis of high-dimensional single-cell data. Unfortunately, deep learning models may leak sensitive information about users. As a result, Differential Privacy (DP) is increasingly used to protect privacy. However, existing DP methods usually perturb whole neural networks to achieve differential privacy, and hence result in great performance overheads. To address this challenge, in this paper, we take advantage of the uniqueness of the autoencoder that it outputs only the dimension-reduced vector in the middle of the network, and design a Differentially Private Deep Contrastive Autoencoder Network (DP-DCAN) by partial network perturbation for single-cell clustering. Since only partial network is added with noise, the performance improvement is obvious and twofold: one part of network is trained with less noise due to a bigger privacy budget, and the other part is trained without any noise. Experimental results of six datasets have verified that DP-DCAN is superior to the traditional DP scheme with whole network perturbation. Moreover, DP-DCAN demonstrates strong robustness to adversarial attacks. The code is available at https://github.com/LFD-byte/DP-DCAN.
    摘要 单元细胞RNAseq(scRNA-seq)对转录组分析表达物的研究具有重要意义。现在,深度学习技术已经使得高维单元细胞数据的分析变得更加容易。然而,深度学习模型可能泄露用户敏感信息,因此隐私保护成为了一项重要的问题。现有的隐私保护方法通常是整个神经网络的杂化,从而导致性能开销很大。为解决这个挑战,我们在这篇论文中利用自动encoder的独特性,即它只输出减少维度的向量,并设计了一种部分神经网络杂化的干扰隐私网络(DP-DCAN)。由于只有部分神经网络受到噪声杂化,性能提高是明显的两倍:一部分神经网络在噪声更小的情况下训练,另一部分则是没有噪声的训练。实验结果表明,DP-DCAN比传统的DP方案更加有优势,并且具有强大的鲁棒性。代码可以在https://github.com/LFD-byte/DP-DCAN上下载。

Visualizing DNA reaction trajectories with deep graph embedding approaches

  • paper_url: http://arxiv.org/abs/2311.03409
  • repo_url: https://github.com/chenwei-zhang/ViDa
  • paper_authors: Chenwei Zhang, Khanh Dao Duc, Anne Condon
  • for: 这个论文是为了设计新的聚合酶反应,以便在各种应用中使用。
  • methods: 这篇论文使用了深度图像嵌入模型和常见维度减少方法,将高维数据映射到2D的欧式空间中。
  • results: 我们的初步结果表明,ViDa可以成功地分离不同的折叠机制,从而为用户提供有用的信息,并且比现有的DNA动力学视化方法更好。
    Abstract Synthetic biologists and molecular programmers design novel nucleic acid reactions, with many potential applications. Good visualization tools are needed to help domain experts make sense of the complex outputs of folding pathway simulations of such reactions. Here we present ViDa, a new approach for visualizing DNA reaction folding trajectories over the energy landscape of secondary structures. We integrate a deep graph embedding model with common dimensionality reduction approaches, to map high-dimensional data onto 2D Euclidean space. We assess ViDa on two well-studied and contrasting DNA hybridization reactions. Our preliminary results suggest that ViDa's visualization successfully separates trajectories with different folding mechanisms, thereby providing useful insight to users, and is a big improvement over the current state-of-the-art in DNA kinetics visualization.
    摘要 生物化学家和分子程序员设计了新的核酸反应,有很多应用前景。为了帮助领域专家理解复杂的输出,我们需要一些好的可视化工具。我们现在提出了ViDa,一种新的方法用于可视化DNA反应折叠路径在二维空间中的能量阶段特征。我们将深度图嵌入模型与常见维度减少方法结合,将高维数据映射到二维欧氏空间中。我们对两种已经广泛研究和不同折叠机制的DNA协同反应进行了预liminary测试,结果表明ViDa的可视化成功地分离不同折叠机制的轨迹,为用户提供有用的信息,而且与当前DNA动力学可视化领域的状态 искусственный智能有很大改进。

Temporal Shift – Multi-Objective Loss Function for Improved Anomaly Fall Detection

  • paper_url: http://arxiv.org/abs/2311.02863
  • repo_url: None
  • paper_authors: Stefan Denkovski, Shehroz S. Khan, Alex Mihailidis
  • for: 预防跌倒lder adults中的伤害和死亡,精确的跌倒探测可以帮助降低这些风险。
  • methods: 使用 autoencoder 和其他相关的数据填充架构,进行跌倒探测。
  • results: 比较多种模型,发现 Temporal Shift 对于跌倒探测有着优秀的表现,尤其是在单一摄像头上,与传统的重建 alone 相比。
    Abstract Falls are a major cause of injuries and deaths among older adults worldwide. Accurate fall detection can help reduce potential injuries and additional health complications. Different types of video modalities can be used in a home setting to detect falls, including RGB, Infrared, and Thermal cameras. Anomaly detection frameworks using autoencoders and their variants can be used for fall detection due to the data imbalance that arises from the rarity and diversity of falls. However, the use of reconstruction error in autoencoders can limit the application of networks' structures that propagate information. In this paper, we propose a new multi-objective loss function called Temporal Shift, which aims to predict both future and reconstructed frames within a window of sequential frames. The proposed loss function is evaluated on a semi-naturalistic fall detection dataset containing multiple camera modalities. The autoencoders were trained on normal activities of daily living (ADL) performed by older adults and tested on ADLs and falls performed by young adults. Temporal shift shows significant improvement to a baseline 3D Convolutional autoencoder, an attention U-Net CAE, and a multi-modal neural network. The greatest improvement was observed in an attention U-Net model improving by 0.20 AUC ROC for a single camera when compared to reconstruction alone. With significant improvement across different models, this approach has the potential to be widely adopted and improve anomaly detection capabilities in other settings besides fall detection.
    摘要 falls 是全球older adults中的主要导致伤害和死亡的原因之一。准确的落下检测可以帮助减少可能的伤害和额外的健康问题。家庭设置中可以使用RGB、Infrared和Thermal相机进行落下检测。使用自适应网络的异常检测框架可以用于落下检测,因为落下的数据异常性和多样性导致数据不匹配。在这篇论文中,我们提出了一种新的多目标损失函数,即时间偏移,以预测序列帧中的未来帧和重建帧。我们的提案的损失函数被评估在具有多个相机模式的半自然的落下检测数据集上。我们使用了正常老年人进行日常活动的学习,并在老年人和年轻人之间进行测试。时间偏移显示在不同模型中具有显著改进,特别是使用注意力U-Net CAE模型,其在单个相机上提高了0.20 AUC ROC。在不同的模型中,这种方法有广泛的应用前景,可以提高异常检测Capability在其他设置中。

Training Multi-layer Neural Networks on Ising Machine

  • paper_url: http://arxiv.org/abs/2311.03408
  • repo_url: None
  • paper_authors: Xujie Song, Tong Liu, Shengbo Eben Li, Jingliang Duan, Wenxuan Wang, Keqiang Li
  • for: 这paper aimed to train multi-layer feedforward neural networks on Ising machines using an Ising learning algorithm.
  • methods: The algorithm incorporates two essential techniques: binary representation of topological network and order reduction of loss function. The QNN is formulated as a QCBO problem, which is then converted to a QUBO problem that can be efficiently solved on Ising machines.
  • results: The algorithm achieved a classification accuracy of 98.3% on MNIST dataset after annealing for 700 ms, with a success probability of 72% in finding the optimal solution. The algorithm has the potential to train deeper neural networks with more spins on the Ising machine.
    Abstract As a dedicated quantum device, Ising machines could solve large-scale binary optimization problems in milliseconds. There is emerging interest in utilizing Ising machines to train feedforward neural networks due to the prosperity of generative artificial intelligence. However, existing methods can only train single-layer feedforward networks because of the complex nonlinear network topology. This paper proposes an Ising learning algorithm to train quantized neural network (QNN), by incorporating two essential techinques, namely binary representation of topological network and order reduction of loss function. As far as we know, this is the first algorithm to train multi-layer feedforward networks on Ising machines, providing an alternative to gradient-based backpropagation. Firstly, training QNN is formulated as a quadratic constrained binary optimization (QCBO) problem by representing neuron connection and activation function as equality constraints. All quantized variables are encoded by binary bits based on binary encoding protocol. Secondly, QCBO is converted to a quadratic unconstrained binary optimization (QUBO) problem, that can be efficiently solved on Ising machines. The conversion leverages both penalty function and Rosenberg order reduction, who together eliminate equality constraints and reduce high-order loss function into a quadratic one. With some assumptions, theoretical analysis shows the space complexity of our algorithm is $\mathcal{O}(H^2L + HLN\log H)$, quantifying the required number of Ising spins. Finally, the algorithm effectiveness is validated with a simulated Ising machine on MNIST dataset. After annealing 700 ms, the classification accuracy achieves 98.3%. Among 100 runs, the success probability of finding the optimal solution is 72%. Along with the increasing number of spins on Ising machine, our algorithm has the potential to train deeper neural networks.
    摘要 As a dedicated quantum device, 积极机器(Ising machine)可以在毫秒内解决大规模的二进制优化问题。因为生成型人工智能的发展,有越来越多的人对使用积极机器来训练Feedforward Neural Network(FFNN)感兴趣。然而,现有的方法只能训练单层FFNN,因为积极机器的非线性网络架构造问题。本文提出了一个积极学习算法,用于训练量化神经网络(QNN),通过结合两种重要技术:一是二进制表示网络的拓扑网络,二是排序缩减损失函数的技术。根据我们所知,这是第一个可以在积极机器上训练多层FFNN的算法,提供了一个alternative的方法。首先,训练QNN是转化为二进制受控制的问题(QCBO),通过表示神经连接和活化函数为等式约束。所有量化变数都是基于二进制编码协议编码。其次,QCBO被转换为二进制不约束的问题(QUBO),可以高效地解决在积极机器上。转换是通过 penalty function和Rosenberg排序缩减技术,将等式约束和高阶损失函数转换为二进制问题。假设时,我们进行了一些假设,实际分析显示算法的空间复杂度为 $\mathcal{O}(H^2L + HLN\log H)$,这个结果表明了我们需要多少积极转换的磁矩。最后,我们验证了我们的算法在MNIST dataset上的效果,通过氧化700毫秒后,分类精度达到98.3%。在100次实验中,成功率为72%。随着积极转换的磁矩增加,我们的算法有可能训练更深的神经网络。

Co-training and Co-distillation for Quality Improvement and Compression of Language Models

  • paper_url: http://arxiv.org/abs/2311.02849
  • repo_url: None
  • paper_authors: Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Hongbo Zhang, Sung Ju Hwang, Alexander Min
  • for: 减少计算成本的预训练语言模型(PLM)的压缩,以便在资源有限或实时设置下使用。
  • methods: 基于两个模型同时受训和互相知识传递的框架,即Co-Training and Co-Distillation(CTCD)。
  • results: CTCD框架可以同时提高性能和执行速度,并且可以与现有的技术相结合,如建筑设计或数据增强,以实现更高的性能提升。小模型通过CTCD框架进行传递学习,可以超越原始大模型的性能。
    Abstract Knowledge Distillation (KD) compresses computationally expensive pre-trained language models (PLMs) by transferring their knowledge to smaller models, allowing their use in resource-constrained or real-time settings. However, most smaller models fail to surpass the performance of the original larger model, resulting in sacrificing performance to improve inference speed. To address this issue, we propose Co-Training and Co-Distillation (CTCD), a novel framework that improves performance and inference speed together by co-training two models while mutually distilling knowledge. The CTCD framework successfully achieves this based on two significant findings: 1) Distilling knowledge from the smaller model to the larger model during co-training improves the performance of the larger model. 2) The enhanced performance of the larger model further boosts the performance of the smaller model. The CTCD framework shows promise as it can be combined with existing techniques like architecture design or data augmentation, replacing one-way KD methods, to achieve further performance improvement. Extensive ablation studies demonstrate the effectiveness of CTCD, and the small model distilled by CTCD outperforms the original larger model by a significant margin of 1.66 on the GLUE benchmark.
    摘要 知识塑化(KD)可以压缩 computationally expensive 预训练语言模型(PLMs),将它们的知识传递给更小的模型,使其在资源受限或实时设置中使用。然而,大多数更小的模型无法超越原始更大的模型的性能, resulting in sacrificing performance to improve inference speed. To address this issue, we propose Co-Training and Co-Distillation (CTCD), a novel framework that improves performance and inference speed together by co-training two models while mutually distilling knowledge. The CTCD framework successfully achieves this based on two significant findings:1. 压缩知识从更小的模型到更大的模型 durante co-training 可以提高更大模型的性能。2. 更大模型的性能的提高可以再次提高更小模型的性能。CTCD框架显示出了其可以与现有的技术,如建筑设计或数据扩展,结合使用,代替一个方向的 KD 方法,以达到更高的性能改进。广泛的抑制研究表明 CTCD 的效果,并且使用 CTCD 塑化的小模型在 GLUE benchmark 上超过原始更大模型的性能表现。

Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs

  • paper_url: http://arxiv.org/abs/2311.02847
  • repo_url: https://github.com/gewu-lab/llm_articulated_object_manipulation
  • paper_authors: Wenke Xia, Dong Wang, Xincheng Pang, Zhigang Wang, Bin Zhao, Di Hu
  • for: 这个论文旨在提高家用助手机器人的通用适应性,使其能够在各种不同的链接物上进行有效的操作。
  • methods: 该论文提出了一种基于语言模型的努力学习框架,通过提供物体的骨骼知识来帮助语言模型生成低级别的运动轨迹点。
  • results: 研究表明,该框架不仅在8种已经见过的类型上比传统方法表现出色,而且在8种未见过的类型上也有强大的零shot能力。此外,在实际场景中对7种不同的物体进行了实验,证明了该框架的实用性。
    Abstract Generalizable articulated object manipulation is essential for home-assistant robots. Recent efforts focus on imitation learning from demonstrations or reinforcement learning in simulation, however, due to the prohibitive costs of real-world data collection and precise object simulation, it still remains challenging for these works to achieve broad adaptability across diverse articulated objects. Recently, many works have tried to utilize the strong in-context learning ability of Large Language Models (LLMs) to achieve generalizable robotic manipulation, but most of these researches focus on high-level task planning, sidelining low-level robotic control. In this work, building on the idea that the kinematic structure of the object determines how we can manipulate it, we propose a kinematic-aware prompting framework that prompts LLMs with kinematic knowledge of objects to generate low-level motion trajectory waypoints, supporting various object manipulation. To effectively prompt LLMs with the kinematic structure of different objects, we design a unified kinematic knowledge parser, which represents various articulated objects as a unified textual description containing kinematic joints and contact location. Building upon this unified description, a kinematic-aware planner model is proposed to generate precise 3D manipulation waypoints via a designed kinematic-aware chain-of-thoughts prompting method. Our evaluation spanned 48 instances across 16 distinct categories, revealing that our framework not only outperforms traditional methods on 8 seen categories but also shows a powerful zero-shot capability for 8 unseen articulated object categories. Moreover, the real-world experiments on 7 different object categories prove our framework's adaptability in practical scenarios. Code is released at \href{https://github.com/GeWu-Lab/LLM_articulated_object_manipulation/tree/main}{here}.
    摘要 通用的链接物体操作是家庭助手机器人的必备能力。最近的努力主要集中在示范学习或者在模拟环境中使用奖励学习,但由于实际世界数据收集和精准的物体模拟成本 prohibitively expensive,这些工作还未能实现广泛的适应性。近些年,许多研究尝试使用大语言模型(LLM)的强Context Learning能力来实现通用的机器人操作,但大多数研究都集中在高级任务规划,忽略低级机器人控制。在这项工作中,我们建立了基于物体链接结构的Prompting框架,通过提示LLMs with 链接知识来生成低级运动轨迹点。为了有效地提示LLMs链接结构的不同物体,我们设计了一个统一的链接知识解析器,该解析器将各种链接物体表示为一个统一的文本描述,包括链接 JOINTS 和 Contact Location。基于这个统一描述,我们提出了一种基于链接结构的逻辑链式思维Prompting方法,以生成精确的3D操作轨迹点。我们的评估涵盖了48个实例,8种已看到类和8种未看到类,结果显示,我们的框架不仅在8种已看到类上超越传统方法,还在0shot情况下展现出强大的适应能力。此外,我们在7种实际物体类型上进行了实际实验,证明了我们的框架在实际场景中的适应性。代码可以在 \href{https://github.com/GeWu-Lab/LLM_articulated_object_manipulation/tree/main}{这里} 找到。

Saturn: Efficient Multi-Large-Model Deep Learning

  • paper_url: http://arxiv.org/abs/2311.02840
  • repo_url: None
  • paper_authors: Kabir Nagrecha, Arun Kumar
  • for: 提高多大型模型训练效率 (improve the efficiency of multi-large-model training)
  • methods: 提出了一种新的数据系统,即Saturn,用于解决在多大型模型训练中的三个关键问题:并行技术选择、GPU分配和调度。 (propose a new data system called Saturn to solve three key interconnected systems challenges for users building large models in this setting)
  • results: 比较研究显示,Saturn 的共优化方法可以提高模型选择运行时间的效率,比现今常见深度学习实践下降39-49%。 (evaluations show that the joint-optimization approach of Saturn can improve the efficiency of model selection runtimes, reducing them by 39-49% compared to typical current deep learning practices)
    Abstract In this paper, we propose Saturn, a new data system to improve the efficiency of multi-large-model training (e.g., during model selection/hyperparameter optimization). We first identify three key interconnected systems challenges for users building large models in this setting -- parallelism technique selection, distribution of GPUs over jobs, and scheduling. We then formalize these as a joint problem, and build a new system architecture to tackle these challenges simultaneously. Our evaluations show that our joint-optimization approach yields 39-49% lower model selection runtimes than typical current DL practice.
    摘要 在这篇论文中,我们提出了一种新的数据系统,用于提高多大型模型训练的效率(例如, durante model selection/超参数优化)。我们首先认为有三个关联的系统挑战,用户在建立大型模型时面临——并行技术选择、GPU分配到任务以及调度。我们然后将这些问题形式化为一个共同问题,并构建了一个新的系统架构,以同时解决这些挑战。我们的评估结果表明,我们的联合优化方法可以提供39-49%比现今深度学习实践的模型选择运行时间更低。

Mesh Neural Cellular Automata

  • paper_url: http://arxiv.org/abs/2311.02820
  • repo_url: None
  • paper_authors: Ehsan Pajouheshgar, Yitao Xu, Alexander Mordvintsev, Eyvind Niklasson, Tong Zhang, Sabine Süsstrunk
  • for: 提高虚拟环境的实际感 (enhancing the realism of virtual environments)
  • methods: 直接synthesize 3D纹理 (directly synthesize 3D textures) without UV mapping
  • results: 可以在实时中synthesize 3D纹理 (can synthesize 3D textures in real time) on any mesh, with generalization and multi-modal supervision capabilities.
    Abstract Modeling and synthesizing textures are essential for enhancing the realism of virtual environments. Methods that directly synthesize textures in 3D offer distinct advantages to the UV-mapping-based methods as they can create seamless textures and align more closely with the ways textures form in nature. We propose Mesh Neural Cellular Automata (MeshNCA), a method for directly synthesizing dynamic textures on 3D meshes without requiring any UV maps. MeshNCA is a generalized type of cellular automata that can operate on a set of cells arranged on a non-grid structure such as vertices of a 3D mesh. While only being trained on an Icosphere mesh, MeshNCA shows remarkable generalization and can synthesize textures on any mesh in real time after the training. Additionally, it accommodates multi-modal supervision and can be trained using different targets such as images, text prompts, and motion vector fields. Moreover, we conceptualize a way of grafting trained MeshNCA instances, enabling texture interpolation. Our MeshNCA model enables real-time 3D texture synthesis on meshes and allows several user interactions including texture density/orientation control, a grafting brush, and motion speed/direction control. Finally, we implement the forward pass of our MeshNCA model using the WebGL shading language and showcase our trained models in an online interactive demo which is accessible on personal computers and smartphones. Our demo and the high resolution version of this PDF are available at https://meshnca.github.io/.
    摘要 模型和 sintesizing texture 是虚拟环境的重要组成部分。直接 sintesizing texture 在 3D 提供了明显的优势,因为它们可以创建无缝 texture 并更好地遵循自然界中 texture 的形成方式。我们提出了 Mesh Neural Cellular Automata (MeshNCA),一种直接 sintesizing dynamic texture 的方法,无需 UV 映射。MeshNCA 是一种通用的细胞自动机,可以在 3D 网格结构上的细胞集上进行操作。它只需在 icosphere 网格上训练,但它可以在实时中 sintesize texture 于任何网格。此外,它可以接受多modal 监督和使用不同的目标,如图像、文本提示和运动向量场。此外,我们还提出了将训练好的 MeshNCA 实例结合的思想,以实现 texture interpolate。我们的 MeshNCA 模型允许在 mesh 上实时 sintesize texture,并提供了许多用户交互,包括 texture 密度/方向控制、grafting 毛刷、速度/方向控制。最后,我们使用 WebGL 渲染语言实现了我们的 MeshNCA 模型的前向传播,并在个人电脑和手机上展示了我们训练好的模型。我们的 demo 和高分辨率版PDF 可以在 https://meshnca.github.io/ 上获取。

QualEval: Qualitative Evaluation for Model Improvement

  • paper_url: http://arxiv.org/abs/2311.02807
  • repo_url: https://github.com/vmurahari3/qualeval
  • paper_authors: Vishvak Murahari, Ameet Deshpande, Peter Clark, Tanmay Rajpurohit, Ashish Sabharwal, Karthik Narasimhan, Ashwin Kalyan
    for: 这种研究旨在提高人工智能系统中的大语言模型(LLM)评估方法,并提供一种自动化质量评估方法来加速模型改进。methods: 该研究使用了一种强大的语言理解器和一种新的灵活线性Programming solver来生成人类可读的报告,以提供model improvement的数据科学家。results: 研究表明,通过使用QualEval的报告,可以提高Llama 2模型在对话任务(DialogSum)中的绝对性能,相比基线方案,提高15%点。QualEval成功地加速了模型开发的pace,因此可以视为一个数据科学家在盒子中。
    Abstract Quantitative evaluation metrics have traditionally been pivotal in gauging the advancements of artificial intelligence systems, including large language models (LLMs). However, these metrics have inherent limitations. Given the intricate nature of real-world tasks, a single scalar to quantify and compare is insufficient to capture the fine-grained nuances of model behavior. Metrics serve only as a way to compare and benchmark models, and do not yield actionable diagnostics, thus making the model improvement process challenging. Model developers find themselves amid extensive manual efforts involving sifting through vast datasets and attempting hit-or-miss adjustments to training data or setups. In this work, we address the shortcomings of quantitative metrics by proposing QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights that when applied, accelerate model improvement. The insights are backed by a comprehensive dashboard with fine-grained visualizations and human-interpretable analyses. We corroborate the faithfulness of QualEval by demonstrating that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative on a challenging dialogue task (DialogSum) when compared to baselines. QualEval successfully increases the pace of model development, thus in essence serving as a data-scientist-in-a-box. Given the focus on critiquing and improving current evaluation metrics, our method serves as a refreshingly new technique for both model evaluation and improvement.
    摘要 To address these limitations, we propose QualEval, which combines automated qualitative evaluation with quantitative scalar metrics to facilitate model improvement. QualEval utilizes a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights that can be applied to accelerate model improvement. These insights are supported by a comprehensive dashboard with fine-grained visualizations and human-interpretable analyses.We demonstrate the effectiveness of QualEval by showing that it can improve the absolute performance of the Llama 2 model by up to 15% points relative to baselines on a challenging dialogue task (DialogSum). By increasing the pace of model development, QualEval serves as a data-scientist-in-a-box, providing a refreshingly new technique for both model evaluation and improvement. Given the focus on critiquing and improving current evaluation metrics, our method offers a much-needed alternative to traditional evaluation methods.

Incorporating Worker Perspectives into MTurk Annotation Practices for NLP

  • paper_url: http://arxiv.org/abs/2311.02802
  • repo_url: None
  • paper_authors: Olivia Huang, Eve Fleisig, Dan Klein
  • for: 本研究旨在改进当前在 Amazon Mechanical Turk (MTurk) 上进行自然语言处理数据收集的现行实践,以提高数据质量和考虑工作者权益。
  • methods: 本研究采用了 kritische 文献综述和MTurk工作者问卷,以解决关于最佳实践、公平支付、工作者隐私、数据质量和工作者奖励的开问。
  • results: 调查结果表明,工作者偏好可靠、合理的支付,而不是不确定、非常高的支付;报告经常谎报个人信息问题;表达对无解释工作拒绝的沮丧。此外,工作者认为一些质量控制方法,如需要最低响应时间或硬件资格要求,是偏袋和效果不佳。根据调查结果,本研究提出了将来的 NLP 研究如何更好地考虑MTurk工作者的经验,以尊重工作者权益并提高数据质量。
    Abstract Current practices regarding data collection for natural language processing on Amazon Mechanical Turk (MTurk) often rely on a combination of studies on data quality and heuristics shared among NLP researchers. However, without considering the perspectives of MTurk workers, these approaches are susceptible to issues regarding workers' rights and poor response quality. We conducted a critical literature review and a survey of MTurk workers aimed at addressing open questions regarding best practices for fair payment, worker privacy, data quality, and considering worker incentives. We found that worker preferences are often at odds with received wisdom among NLP researchers. Surveyed workers preferred reliable, reasonable payments over uncertain, very high payments; reported frequently lying on demographic questions; and expressed frustration at having work rejected with no explanation. We also found that workers view some quality control methods, such as requiring minimum response times or Master's qualifications, as biased and largely ineffective. Based on the survey results, we provide recommendations on how future NLP studies may better account for MTurk workers' experiences in order to respect workers' rights and improve data quality.
    摘要 现有的MTurk数据收集做法 frequently rely on combination of studies on data quality和 shared heuristics among NLP researchers. However, without considering the perspectives of MTurk workers, these approaches are susceptible to issues regarding workers' rights and poor response quality. We conducted a critical literature review and a survey of MTurk workers aimed at addressing open questions regarding best practices for fair payment, worker privacy, data quality, and considering worker incentives. We found that worker preferences are often at odds with received wisdom among NLP researchers. Surveyed workers preferred reliable, reasonable payments over uncertain, very high payments; reported frequently lying on demographic questions; and expressed frustration at having work rejected with no explanation. We also found that workers view some quality control methods, such as requiring minimum response times or Master's qualifications, as biased and largely ineffective. Based on the survey results, we provide recommendations on how future NLP studies may better account for MTurk workers' experiences in order to respect workers' rights and improve data quality.

cs.CL - 2023-11-06

STONYBOOK: A System and Resource for Large-Scale Analysis of Novels

  • paper_url: http://arxiv.org/abs/2311.03614
  • repo_url: None
  • paper_authors: Charuta Pethe, Allen Kim, Rajesh Prabhakar, Tanzir Pial, Steven Skiena
  • for: 这个论文是为了提供一种大规模分析小说的资源,包括一个开源的终端到终端NLP分析管道,以及49,207本清洁和注释过的小说集。
  • methods: 这个论文使用的方法包括开发了一个标准XML格式来注释小说,以及建立了一个大规模文本分析数据库和网页界面。
  • results: 论文提供了各种分析 artifacts,包括人物出现和互动的视觉化、相似的书籍、代表词汇、部首统计和阅读指标。这些结果可以用于质量和kvantitativer逻辑分析大量的小说 Corpora。
    Abstract Books have historically been the primary mechanism through which narratives are transmitted. We have developed a collection of resources for the large-scale analysis of novels, including: (1) an open source end-to-end NLP analysis pipeline for the annotation of novels into a standard XML format, (2) a collection of 49,207 distinct cleaned and annotated novels, and (3) a database with an associated web interface for the large-scale aggregate analysis of these literary works. We describe the major functionalities provided in the annotation system along with their utilities. We present samples of analysis artifacts from our website, such as visualizations of character occurrences and interactions, similar books, representative vocabulary, part of speech statistics, and readability metrics. We also describe the use of the annotated format in qualitative and quantitative analysis across large corpora of novels.
    摘要 书籍traditionally have been the primary means through which narratives are transmitted. We have developed a collection of resources for the large-scale analysis of novels, including: (1) an open-source end-to-end NLP analysis pipeline for the annotation of novels into a standard XML format, (2) a collection of 49,207 distinct cleaned and annotated novels, and (3) a database with an associated web interface for the large-scale aggregate analysis of these literary works. We describe the major functionalities provided in the annotation system along with their utilities. We present samples of analysis artifacts from our website, such as visualizations of character occurrences and interactions, similar books, representative vocabulary, part of speech statistics, and readability metrics. We also describe the use of the annotated format in qualitative and quantitative analysis across large corpora of novels.Here's a word-for-word translation of the text using Traditional Chinese characters:书籍传统上是传递narra的主要途径。我们已经发展了一组资源来进行大规模的小说分析,包括:(1)一个开源的端到端NLP分析管线来标注小说成standard XML格式,(2)一个包含49,207个精心整理和标注的小说集,以及(3)一个对大量文本进行聚合分析的数据库和网页交互面。我们详细介绍了标注系统的主要功能和其价值。我们将从我们的网站上提供的分析遗存中展示一些分析成果,如人物出现和互动的分析图表、相似的书籍、常用词汇、parts of speech的统计和阅读度量。我们还详细介绍了使用标注格式进行质量和量itative分析的优点。

Dimensions of Online Conflict: Towards Modeling Agonism

  • paper_url: http://arxiv.org/abs/2311.03584
  • repo_url: None
  • paper_authors: Matt Canute, Mali Jin, hannah holtzclaw, Alberto Lusoli, Philippa R Adams, Mugdha Pandya, Maite Taboada, Diana Maynard, Wendy Hui Kyong Chun
  • for: 这个论文主要研究了在社交媒体上的对话中的对抗关系,以及这种对抗关系如何影响对话质量。
  • methods: 作者使用了Twitter上的争议话题来收集对话,并开发了一个完整的注释标准来标记对话中的不同级别的对抗关系。然后,他们使用了逻辑回归和变换器模型来训练模型,并在模型中包含了对话中的上下文信息,如参与者数量和互动结构。
  • results: 研究结果表明,Contextual labels可以帮助确定对抗关系,并使模型在话题变化时保持稳定性。这些结果可以为内容审核和社交媒体平台管理做出贡献。
    Abstract Agonism plays a vital role in democratic dialogue by fostering diverse perspectives and robust discussions. Within the realm of online conflict there is another type: hateful antagonism, which undermines constructive dialogue. Detecting conflict online is central to platform moderation and monetization. It is also vital for democratic dialogue, but only when it takes the form of agonism. To model these two types of conflict, we collected Twitter conversations related to trending controversial topics. We introduce a comprehensive annotation schema for labelling different dimensions of conflict in the conversations, such as the source of conflict, the target, and the rhetorical strategies deployed. Using this schema, we annotated approximately 4,000 conversations with multiple labels. We then trained both logistic regression and transformer-based models on the dataset, incorporating context from the conversation, including the number of participants and the structure of the interactions. Results show that contextual labels are helpful in identifying conflict and make the models robust to variations in topic. Our research contributes a conceptualization of different dimensions of conflict, a richly annotated dataset, and promising results that can contribute to content moderation.
    摘要 争议 играет重要的角色在民主对话中,推动多元观点和坚定的讨论。在在线冲突中,另外一种类型是恶意争议,这会阻碍有益的对话。检测在线冲突是民主对话中不可或缺的,只有当它变成争议时。为了模型这两种冲突,我们收集了关于热门争议话题的推特对话。我们介绍了对话中不同维度的争议的完整标注schema,例如争议的来源、目标和使用的修辞技巧。使用这个schema,我们对约4000个对话进行了多个标注。然后我们训练了逻辑回归和转换器基于模型,使用对话中的上下文,包括参与者人数和互动结构。结果表明,上下文标注有助于确定冲突,使模型具有话题变化的 robustness。我们的研究对于不同维度的争议做出了概念化、富有标注数据和成功的实验成果,可以贡献于内容审核。

Measuring Adversarial Datasets

  • paper_url: http://arxiv.org/abs/2311.03566
  • repo_url: https://github.com/kritwik1/Detection-of-Anomalies-in-Images-using-Adversarial-learning
  • paper_authors: Yuanchen Bai, Raoyi Huang, Vijay Viswanathan, Tzu-Sheng Kuo, Tongshuang Wu
  • for: 这个研究的目的是为了探讨现有的量化度量是否能够捕捉NLP任务中文本实例的难度、多样性和分歧。
  • methods: 这个研究使用了现有的敌对性例数据集,并对这些数据集和原始数据集进行比较,以了解这些敌对性例的分布是否与假设一致。
  • results: 研究发现,现有的量化度量可以很好地捕捉敌对性例的难度和多样性,但是它们可能不能够捕捉敌对性例的分歧。这些结果提供了valuable的信息,可以帮助研究人员更好地理解敌对性例的特点和假设。
    Abstract In the era of widespread public use of AI systems across various domains, ensuring adversarial robustness has become increasingly vital to maintain safety and prevent undesirable errors. Researchers have curated various adversarial datasets (through perturbations) for capturing model deficiencies that cannot be revealed in standard benchmark datasets. However, little is known about how these adversarial examples differ from the original data points, and there is still no methodology to measure the intended and unintended consequences of those adversarial transformations. In this research, we conducted a systematic survey of existing quantifiable metrics that describe text instances in NLP tasks, among dimensions of difficulty, diversity, and disagreement. We selected several current adversarial effect datasets and compared the distributions between the original and their adversarial counterparts. The results provide valuable insights into what makes these datasets more challenging from a metrics perspective and whether they align with underlying assumptions.
    摘要 在人工智能系统广泛应用于多个领域的时代,保证对抗Robustness已成为维护安全和避免不良错误的关键。研究人员通过干扰损害数据集(through perturbations)捕捉模型缺陷,这些缺陷无法在标准测试数据集中显示出来。然而,对这些对抗示例与原始数据点之间的差异还是不够了解,而且还没有一种方法来衡量这些对抗变换的意图和不意图的后果。在这项研究中,我们进行了系统性的量化度量研究,探讨了存在于NLP任务中的文本实例度量,包括难度、多样性和分歧。我们选择了一些当前的对抗效果数据集,并比较了这些数据集的分布与其对抗对应的分布。结果提供了有价值的洞察,了解这些数据集在量化度量上的挑战和是否与下面的假设相符。

Quantifying Uncertainty in Natural Language Explanations of Large Language Models

  • paper_url: http://arxiv.org/abs/2311.03533
  • repo_url: None
  • paper_authors: Sree Harsha Tanneru, Chirag Agarwal, Himabindu Lakkaraju
  • for: 本研究旨在量化LLM的解释uncertainty。
  • methods: 我们提出了两个新的度量方法:Verbalized Uncertainty和Probing Uncertainty,以量化LLM的解释uncertainty。
  • results: 我们的实验表明,Verbalized Uncertainty不是一个可靠的解释 confidence 度量方法,而Probing Uncertainty度量与解释 faithfulness 呈正相关。
    Abstract Large Language Models (LLMs) are increasingly used as powerful tools for several high-stakes natural language processing (NLP) applications. Recent prompting works claim to elicit intermediate reasoning steps and key tokens that serve as proxy explanations for LLM predictions. However, there is no certainty whether these explanations are reliable and reflect the LLMs behavior. In this work, we make one of the first attempts at quantifying the uncertainty in explanations of LLMs. To this end, we propose two novel metrics -- $\textit{Verbalized Uncertainty}$ and $\textit{Probing Uncertainty}$ -- to quantify the uncertainty of generated explanations. While verbalized uncertainty involves prompting the LLM to express its confidence in its explanations, probing uncertainty leverages sample and model perturbations as a means to quantify the uncertainty. Our empirical analysis of benchmark datasets reveals that verbalized uncertainty is not a reliable estimate of explanation confidence. Further, we show that the probing uncertainty estimates are correlated with the faithfulness of an explanation, with lower uncertainty corresponding to explanations with higher faithfulness. Our study provides insights into the challenges and opportunities of quantifying uncertainty in LLM explanations, contributing to the broader discussion of the trustworthiness of foundation models.
    摘要

Spoken Dialogue System for Medical Prescription Acquisition on Smartphone: Development, Corpus and Evaluation

  • paper_url: http://arxiv.org/abs/2311.03510
  • repo_url: None
  • paper_authors: Ali Can Kocabiyikoglu, François Portet, Jean-Marc Babouchkine, Prudence Gibert, Hervé Blanchon, Gaëtan Gavazzi
  • for: 这篇论文是关于医疗信息系统(HIS)中的电子药物预scribing软件,它提供了一种使用自然语言对话系统来记录药物预scription的方法。
  • methods: 这篇论文使用了对话模型、语义提取和数据增强等技术来开发一种基于自然语言对话的药物预scription系统。
  • results: 论文中提出的系统在实际应用中被评估,结果显示该系统可以减少医生在计算机上输入信息的时间,同时提高预cription的正确率和效率。试验中的55名参与者中,医生的均值预cription时间为66.15秒,其他专家的均值预cription时间为35.64秒,任务成功率为76%和72%。试验数据被记录和标注,并形成了PxCorpus,全面提供给社区(https://doi.org/10.5281/zenodo.6524162)。
    Abstract Hospital information systems (HIS) have become an essential part of healthcare institutions and now incorporate prescribing support software. Prescription support software allows for structured information capture, which improves the safety, appropriateness and efficiency of prescriptions and reduces the number of adverse drug events (ADEs). However, such a system increases the amount of time physicians spend at a computer entering information instead of providing medical care. In addition, any new visiting clinician must learn to manage complex interfaces since each HIS has its own interfaces. In this paper, we present a natural language interface for e-prescribing software in the form of a spoken dialogue system accessible on a smartphone. This system allows prescribers to record their prescriptions verbally, a form of interaction closer to their usual practice. The system extracts the formal representation of the prescription ready to be checked by the prescribing software and uses the dialogue to request mandatory information, correct errors or warn of particular situations. Since, to the best of our knowledge, there is no existing voice-based prescription dialogue system, we present the system developed in a low-resource environment, focusing on dialogue modeling, semantic extraction and data augmentation. The system was evaluated in the wild with 55 participants. This evaluation showed that our system has an average prescription time of 66.15 seconds for physicians and 35.64 seconds for other experts, and a task success rate of 76\% for physicians and 72\% for other experts. All evaluation data were recorded and annotated to form PxCorpus, the first spoken drug prescription corpus that has been made fully available to the community (\url{https://doi.org/10.5281/zenodo.6524162}).
    摘要 医院信息系统(HIS)已成为医疗机构的重要组成部分,并包括订药支持软件。订药支持软件可以结构化信息捕获,从而提高药物订药的安全性、适用性和效率,并减少药物相关事件(ADEs)的发生。然而,这种系统会使医生在计算机上输入信息的时间增加,而不是提供医疗服务。此外,每个医院信息系统都有自己的界面,新来的医生必须学习这些复杂的界面。在本文中,我们提出了一种基于自然语言对话的订药软件,通过智能手机上的对话系统来记录医生的订药。这种系统使医生可以通过口头记录订药,与其常见的医疗做法更相似。系统会提取订药的正式表示形式,并使用对话来请求必要的信息、修正错误或警告特定情况。由于我们知道的 voz-based 订药对话系统并不存在,我们在具有较低资源环境下开发了这个系统,重点是对话模型、semantic extraction 和数据增强。我们在野化进行了55名参与者的评估,评估结果显示,我们的系统的医生平均订药时间为66.15秒,其他专家平均订药时间为35.64秒,任务成功率为76% 和72%。所有评估数据都被记录并标注,以形成 PxCorpus,是首个全面向社区公开的 spoken drug prescription corpus(https://doi.org/10.5281/zenodo.6524162)。

In-Context Exemplars as Clues to Retrieving from Large Associative Memory

  • paper_url: http://arxiv.org/abs/2311.03498
  • repo_url: https://github.com/andotalao24/ICL-as-retrieval-from-associative-memory
  • paper_authors: Jiachen Zhao
  • for: 本研究旨在探讨大语言模型(LLM)中的卷积学习(ICL)能力,以及如何选择示例的问题。
  • methods: 本研究使用了聚合网络来建立ICL的理论基础,并对示例的选择进行了实验研究。
  • results: 研究发现,ICL的性能与示例的选择有直接的关系,并提出了更有效的活动示例选择方法。这些发现可能有助于更深入理解LLM的含义和工作机制。
    Abstract Recently, large language models (LLMs) have made remarkable progress in natural language processing. The most representative ability of LLMs is in-context learning (ICL), which enables LLMs to learn patterns from in-context exemplars without training. The performance of ICL greatly depends on the exemplars used. However, how to choose exemplars remains unclear due to the lack of understanding of how in-context learning works. In this paper, we present a novel perspective on ICL by conceptualizing it as contextual retrieval from a model of associative memory. We establish a theoretical framework of ICL based on Hopfield Networks. Based on our framework, we look into how in-context exemplars influence the performance of ICL and propose more efficient active exemplar selection. Our study sheds new light on the mechanism of ICL by connecting it to memory retrieval, with potential implications for advancing the understanding of LLMs.
    摘要 (简化中文)最近,大型自然语言处理模型(LLM)在自然语言处理领域取得了非常出色的进步。LLM的最主要能力是在 контексте学习(ICL),即在不需要训练的情况下,模型可以从 контексте中学习模式。ICL的性能很大程度上取决于使用的 exemplars。然而,如何选择 exemplars 仍然不清楚,因为lack of understanding of how in-context learning works。在这篇论文中,我们提出了一种新的思路,即认为ICL可以视为一种contextual retrieval from a model of associative memory。我们建立了一个基于Hopfield Networks的ICL理论框架。基于我们的框架,我们研究了ICL中 exemplars 的影响和更有效的活动 exemplar 选择。我们的研究 shed new light on ICL的机制,并可能有助于进一步理解 LLMs。

Tackling Concept Shift in Text Classification using Entailment-style Modeling

  • paper_url: http://arxiv.org/abs/2311.03320
  • repo_url: None
  • paper_authors: Sumegh Roychowdhury, Karan Gupta, Siva Rajesh Kasa, Prasanna Srinivasa Murthy, Alok Chandra
  • for: Handle concept shift in text classification tasks with less labeling data.
  • methods: Reformulate vanilla classification as an entailment-style problem, requiring less data to adapt to new concepts.
  • results: Achieve absolute F1 gains of up to 7% and 40% in few-shot settings on real-world and synthetic datasets, respectively, with 75% labeling cost savings overall.
    Abstract Pre-trained language models (PLMs) have seen tremendous success in text classification (TC) problems in the context of Natural Language Processing (NLP). In many real-world text classification tasks, the class definitions being learned do not remain constant but rather change with time - this is known as Concept Shift. Most techniques for handling concept shift rely on retraining the old classifiers with the newly labelled data. However, given the amount of training data required to fine-tune large DL models for the new concepts, the associated labelling costs can be prohibitively expensive and time consuming. In this work, we propose a reformulation, converting vanilla classification into an entailment-style problem that requires significantly less data to re-train the text classifier to adapt to new concepts. We demonstrate the effectiveness of our proposed method on both real world & synthetic datasets achieving absolute F1 gains upto 7% and 40% respectively in few-shot settings. Further, upon deployment, our solution also helped save 75% of labeling costs overall.
    摘要

Unraveling Downstream Gender Bias from Large Language Models: A Study on AI Educational Writing Assistance

  • paper_url: http://arxiv.org/abs/2311.03311
  • repo_url: https://github.com/epfl-ml4ed/unraveling-llm-bias
  • paper_authors: Thiemo Wambsganss, Xiaotian Su, Vinitra Swamy, Seyed Parsa Neshaei, Roman Rietsche, Tanja Käser
  • for: 这 paper 探讨了 AI 写作支持系统 中 inherent bias 的影响。
  • methods: 该 paper 使用了大量的 user study 和不同类型的模型来检测 bias。
  • results: 研究发现,在 AI 写作支持系统 中,bias 不会传递到学生的回答中。
    Abstract Large Language Models (LLMs) are increasingly utilized in educational tasks such as providing writing suggestions to students. Despite their potential, LLMs are known to harbor inherent biases which may negatively impact learners. Previous studies have investigated bias in models and data representations separately, neglecting the potential impact of LLM bias on human writing. In this paper, we investigate how bias transfers through an AI writing support pipeline. We conduct a large-scale user study with 231 students writing business case peer reviews in German. Students are divided into five groups with different levels of writing support: one classroom group with feature-based suggestions and four groups recruited from Prolific -- a control group with no assistance, two groups with suggestions from fine-tuned GPT-2 and GPT-3 models, and one group with suggestions from pre-trained GPT-3.5. Using GenBit gender bias analysis, Word Embedding Association Tests (WEAT), and Sentence Embedding Association Test (SEAT) we evaluate the gender bias at various stages of the pipeline: in model embeddings, in suggestions generated by the models, and in reviews written by students. Our results demonstrate that there is no significant difference in gender bias between the resulting peer reviews of groups with and without LLM suggestions. Our research is therefore optimistic about the use of AI writing support in the classroom, showcasing a context where bias in LLMs does not transfer to students' responses.
    摘要 大型语言模型(LLM)在教育任务中越来越受到应用,例如为学生提供写作建议。despite their potential,LLMs are known to harbor inherent biases which may negatively impact learners. previous studies have investigated bias in models and data representations separately, neglecting the potential impact of LLM bias on human writing. in this paper, we investigate how bias transfers through an AI writing support pipeline. we conduct a large-scale user study with 231 students writing business case peer reviews in German. students are divided into five groups with different levels of writing support: one classroom group with feature-based suggestions and four groups recruited from Prolific -- a control group with no assistance, two groups with suggestions from fine-tuned GPT-2 and GPT-3 models, and one group with suggestions from pre-trained GPT-3.5. using GenBit gender bias analysis, Word Embedding Association Tests (WEAT), and Sentence Embedding Association Test (SEAT) we evaluate the gender bias at various stages of the pipeline: in model embeddings, in suggestions generated by the models, and in reviews written by students. our results demonstrate that there is no significant difference in gender bias between the resulting peer reviews of groups with and without LLM suggestions. our research is therefore optimistic about the use of AI writing support in the classroom, showcasing a context where bias in LLMs does not transfer to students' responses.

Ziya2: Data-centric Learning is All LLMs Need

  • paper_url: http://arxiv.org/abs/2311.03301
  • repo_url: None
  • paper_authors: Ruyi Gan, Ziwei Wu, Renliang Sun, Junyu Lu, Xiaojun Wu, Dixiang Zhang, Kunhao Pan, Ping Yang, Qi Yang, Jiaxing Zhang, Yan Song
  • for: 本研究旨在提出一种基于LLaMA2模型的13亿参数Ziya2模型,并在不同阶段进行数据驱动优化以提高Ziya2模型在多个标准准点上的学习过程。
  • methods: 本研究采用了多种预训练技术和数据驱动优化策略,包括预训练数据的选择和组织、预训练过程中的数据填充策略以及在不同阶段进行数据驱动优化。
  • results: 实验结果显示,Ziya2模型在多个标准准点上表现出色,特别是与代表性的开源模型相比,Ziya2模型在一些预测任务上达到了更高的性能。Ziya2(基本)模型在https://huggingface.co/IDEA-CCNL/Ziya2-13B-Base和https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary中发布。
    Abstract Various large language models (LLMs) have been proposed in recent years, including closed- and open-source ones, continually setting new records on multiple benchmarks. However, the development of LLMs still faces several issues, such as high cost of training models from scratch, and continual pre-training leading to catastrophic forgetting, etc. Although many such issues are addressed along the line of research on LLMs, an important yet practical limitation is that many studies overly pursue enlarging model sizes without comprehensively analyzing and optimizing the use of pre-training data in their learning process, as well as appropriate organization and leveraging of such data in training LLMs under cost-effective settings. In this work, we propose Ziya2, a model with 13 billion parameters adopting LLaMA2 as the foundation model, and further pre-trained on 700 billion tokens, where we focus on pre-training techniques and use data-centric optimization to enhance the learning process of Ziya2 on different stages. Experiments show that Ziya2 significantly outperforms other models in multiple benchmarks especially with promising results compared to representative open-source ones. Ziya2 (Base) is released at https://huggingface.co/IDEA-CCNL/Ziya2-13B-Base and https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary.
    摘要 各种大型语言模型(LLMs)在最近几年内被提出,包括关闭和开源的模型,不断创造新的纪录在多个 benchmarck 上。然而, LLMS 的开发仍面临多个问题,如从 scratch 训练模型的高成本、 catastrophic forgetting 等等。虽然这些问题在 LLMS 研究中得到了很多的解决方案,但是一个重要且实用的限制是许多研究过于强调模型的大小,而不是全面分析和优化在训练过程中使用的预训练数据,以及如何合理地组织和利用这些数据来训练 LLMS。在这项工作中,我们提出了 Ziya2,一个采用 LLaMA2 基础模型,并在 700 亿个字符上进行了进一步预训练,我们在各个阶段都将注重预训练技巧,并通过数据中心化优化来提高 Ziya2 在不同阶段的学习过程。实验结果表明,Ziya2 在多个 benchmarck 上显著超越其他模型,尤其是与代表性的开源模型相比,Ziya2 (Base) 已经发布在

Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges

  • paper_url: http://arxiv.org/abs/2311.03287
  • repo_url: https://github.com/gzcch/bingo
  • paper_authors: Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, Huaxiu Yao
  • for: 这个研究是为了评估和描述 GPT-4V(ision) 模型中的幻觉行为,以及这种幻觉的两种常见类型:偏见和干扰。
  • methods: 这个研究使用了一个新的 benchmark,即 Bias and Interference Challenges in Visual Language Models (Bingo),来评估 GPT-4V(ision) 模型的幻觉行为。
  • results: 研究发现,GPT-4V(ision) 模型存在 regional bias,即更好地理解西方图像或图像中的英文文本,而对其他国家或其他语言的图像和文本的理解不及格。此外,GPT-4V(ision) 模型容易受到提问的影响,并且在处理多个图像时会受到混乱。这些挑战无法通过自我修复和链式思维方法解决。
    Abstract While GPT-4V(ision) impressively models both visual and textual information simultaneously, it's hallucination behavior has not been systematically assessed. To bridge this gap, we introduce a new benchmark, namely, the Bias and Interference Challenges in Visual Language Models (Bingo). This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference. Here, bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data. Interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented. We identify a notable regional bias, whereby GPT-4V(ision) is better at interpreting Western images or images with English writing compared to images from other countries or containing text in other languages. Moreover, GPT-4V(ision) is vulnerable to leading questions and is often confused when interpreting multiple images together. Popular mitigation approaches, such as self-correction and chain-of-thought reasoning, are not effective in resolving these challenges. We also identified similar biases and interference vulnerabilities with LLaVA and Bard. Our results characterize the hallucination challenges in GPT-4V(ision) and state-of-the-art visual-language models, and highlight the need for new solutions. The Bingo benchmark is available at https://github.com/gzcch/Bingo.
    摘要 而GPT-4V(ision)却显示出了同时模型视觉和文本信息的卓越表现,但它的幻觉行为尚未得到系统性的评估。为了填补这一遗漏,我们提出了一个新的标准测试套件,即视觉语言模型偏见和干扰挑战(Bingo)。这个测试套件是为了评估和探讨视觉语言模型中两种常见的幻觉类型:偏见和干扰。其中,偏见指的是模型幻觉某些类型的回答,可能是因为训练数据的不均衡。干扰指的是在提示文本或输入图像的表述方式中,模型的判断能力被干扰的情况。我们发现了一种明显的地域偏见,即GPT-4V(ision)更好地理解西方图像或图像包含英文文本的情况。此外,GPT-4V(ision)容易受到提示文本的诱导和多个图像的混乱影响。现有的 Mitigation 方法,如自我检查和链条思维,无法解决这些挑战。我们还发现了 LLava 和 Bard 等模型中的相似偏见和干扰敏感性。我们的结果描述了 GPT-4V(ision) 和当前最佳视觉语言模型中的幻觉挑战,并 highlights 需要新的解决方案。Bingo 测试套件可以在 GitHub 上获取。

Safurai-Csharp: Harnessing Synthetic Data to improve language-specific Code LLM

  • paper_url: http://arxiv.org/abs/2311.03243
  • repo_url: None
  • paper_authors: Davide Cifarelli, Leonardo Boiardi, Alessandro Puppo, Leon Jovanovic
  • for: 这篇论文是为了提出一种开源模型,用于生成、完成和调试 C# 代码。
  • methods: 该模型基于 CodeLlama 34B 模型,并使用 EvolInstruct 技术进行精细化和扩展数据集,以进行精细化和扩展数据集。
  • results: 模型在 Manual MultiPL-E 比赛中获得了56.33% 的 notable 分数(Zero-Shot, Pass@1),表明它具有优秀的开发工作流程协助和代码学习支持功能。
    Abstract This paper introduces Safurai-Csharp, an open-source model designed to specialize in the generation, completion, and debugging of C# code. Safurai-Csharp is built upon the novel CodeLlama 34B model and leverages the EvolInstruct technique, creating a refined and expanded dataset for its fine-tuning process. The results of its performance, a notable score of 56.33% on the Manual MultiPL-E benchmark (Zero-Shot, Pass@1), signal its high capacity to streamline developers' workflows and aid code learning. It shows promise in setting new stakes in the landscape of open-source C# LLMs and hopes to inspire more inclusive and wide-ranging development in the field of language-specific LLMs.
    摘要 这篇论文介绍了Safurai-Csharp,一个开源模型,旨在优化C#代码生成、完成和调试。Safurai-Csharp基于CodeLlama 34B模型,并使用EvolInstruct技术,通过精细的调整和扩展数据集,实现了高效的特化和优化。 benchmark测试结果显示,Safurai-Csharp在Manual MultiPL-E多频率测试中取得了56.33%的成绩(零shot,Pass@1),表明它在开发者工作流程中具有很高的效率和可靠性。这表明Safurai-Csharp具有开推新的可能性,并希望能够激发更多的开源C# LLMS的发展,以及更广泛的语言特定LLMS的开发。

p-Laplacian Transformer

  • paper_url: http://arxiv.org/abs/2311.03235
  • repo_url: None
  • paper_authors: Tuan Nguyen, Tam Nguyen, Vinh Nguyen, Tan M. Nguyen
  • for: 本文主要研究自注意 Mechanism 在 transformers 中的应用,以实现更好的语言模型性能。
  • methods: 本文提出了一种基于 $p$-Laplacian regularization 的新型 transformers,称为 $p$-Laplacian Transformer (p-LaT),以利用自注意层中的异质特征。
  • results: 对多种 benchmark 数据集进行了实验,并证明了 p-LaT 在语言模型性能上的优势。
    Abstract $p$-Laplacian regularization, rooted in graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data. Smaller values of $p$ promote sparsity and interpretability, while larger values encourage smoother solutions. In this paper, we first show that the self-attention mechanism obtains the minimal Laplacian regularization ($p=2$) and encourages the smoothness in the architecture. However, the smoothness is not suitable for the heterophilic structure of self-attention in transformers where attention weights between tokens that are in close proximity and non-close ones are assigned indistinguishably. From that insight, we then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT), which leverages $p$-Laplacian regularization framework to harness the heterophilic features within self-attention layers. In particular, low $p$ values will effectively assign higher attention weights to tokens that are in close proximity to the current token being processed. We empirically demonstrate the advantages of p-LaT over the baseline transformers on a wide range of benchmark datasets.
    摘要 (Simplified Chinese translation)$p$-laplacian regularization, originating from graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data. Smaller values of $p$ promote sparsity and interpretability, while larger values encourage smoother solutions. In this paper, we first show that the self-attention mechanism obtains the minimal Laplacian regularization ($p=2$) and encourages smoothness in the architecture. However, the smoothness is not suitable for the heterophilic structure of self-attention in transformers where attention weights between tokens that are in close proximity and non-close ones are assigned indistinguishably. Based on this insight, we then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT), which leverages $p$-Laplacian regularization framework to harness the heterophilic features within self-attention layers. Specifically, low $p$ values will effectively assign higher attention weights to tokens that are in close proximity to the current token being processed. We empirically demonstrate the advantages of p-LaT over the baseline transformers on a wide range of benchmark datasets.

Model-based Counterfactual Generator for Gender Bias Mitigation

  • paper_url: http://arxiv.org/abs/2311.03186
  • repo_url: None
  • paper_authors: Ewoenam Kwaku Tokpo, Toon Calders
  • for: 降低语言模型中的性别偏见
  • methods: combines data processing techniques and a bi-objective training regime to develop a model-based solution for generating counterfactuals
  • results: alleviates the shortcomings of dictionary-based solutions and improves the mitigation of gender bias
    Abstract Counterfactual Data Augmentation (CDA) has been one of the preferred techniques for mitigating gender bias in natural language models. CDA techniques have mostly employed word substitution based on dictionaries. Although such dictionary-based CDA techniques have been shown to significantly improve the mitigation of gender bias, in this paper, we highlight some limitations of such dictionary-based counterfactual data augmentation techniques, such as susceptibility to ungrammatical compositions, and lack of generalization outside the set of predefined dictionary words. Model-based solutions can alleviate these problems, yet the lack of qualitative parallel training data hinders development in this direction. Therefore, we propose a combination of data processing techniques and a bi-objective training regime to develop a model-based solution for generating counterfactuals to mitigate gender bias. We implemented our proposed solution and performed an empirical evaluation which shows how our model alleviates the shortcomings of dictionary-based solutions.
    摘要 counterfactual 数据增强 (CDA) 是一种常用的技术来减轻自然语言模型中的性别偏见。 CDA 技术主要使用词替换基于词典, although 这些词典基于的 CDA 技术已经证明可以有效地减轻性别偏见,但是在这篇论文中,我们指出了这些技术的一些局限性,如容易出现不 grammatical 的 sentence,并且无法泛化到未定义词汇集中。 model-based 解决方案可以解决这些问题,但是因为缺乏 качеitative 平行训练数据,因此不得不采用数据处理技术和 bi-objective 训练方案来开发一种基于模型的解决方案。 we 实现了我们的提议并进行了 empirical 评估,显示了我们的模型可以减轻词典基于的 CDA 技术中的缺陷。

Architectural Sweet Spots for Modeling Human Label Variation by the Example of Argument Quality: It’s Best to Relate Perspectives!

  • paper_url: http://arxiv.org/abs/2311.03153
  • repo_url: https://github.com/phhei/relateperspectives-sweetspots
  • paper_authors: Philipp Heinisch, Matthias Orlikowski, Julia Romberg, Philipp Cimiano
  • for: 这个论文主要针对的是自然语言处理中的annotation任务,具体来说是argument质量分类任务。
  • methods: 这个论文使用了一种continuum的方法,从fully归一化到”share nothing”-architectures,来表征个人和共同 perspectives的协同作用。
  • results: 研究发现,通过使用 recomender系统中的模型层来模型不同 annotator之间的关系,可以提高averaged annotator-individual F$_1$-scores,最高提高43%。这些结果表明,对subjectivity的approaches可以通过关系个人 perspectives来提高表达质量。
    Abstract Many annotation tasks in natural language processing are highly subjective in that there can be different valid and justified perspectives on what is a proper label for a given example. This also applies to the judgment of argument quality, where the assignment of a single ground truth is often questionable. At the same time, there are generally accepted concepts behind argumentation that form a common ground. To best represent the interplay of individual and shared perspectives, we consider a continuum of approaches ranging from models that fully aggregate perspectives into a majority label to "share nothing"-architectures in which each annotator is considered in isolation from all other annotators. In between these extremes, inspired by models used in the field of recommender systems, we investigate the extent to which architectures that include layers to model the relations between different annotators are beneficial for predicting single-annotator labels. By means of two tasks of argument quality classification (argument concreteness and validity/novelty of conclusions), we show that recommender architectures increase the averaged annotator-individual F$_1$-scores up to $43\%$ over a majority label model. Our findings indicate that approaches to subjectivity can benefit from relating individual perspectives.
    摘要 很多自然语言处理中的标注任务具有主观性,因为存在不同的有效和合理的观点可以用于描述给定示例的标签。这同时也适用于论点质量评价,其中单个真实的判据往往存在问题。为了最好地表现个人和共同的视角之间的互动,我们考虑了一个维度的方法,从完全汇总视角到“分享无 Shared”-架构,在这两个极端之间进行调查。在这些极端之间,我们发现了基于推荐系统中使用的模型,可以增加预测单个标注员标签的精度。通过两个论点质量分类任务(论点具体性和结论的有效性/新颖性),我们发现,推荐架构可以提高平均标注员F$_1$-分数达43%。我们的发现表明,主观性方法可以从各个个人视角之间的关系中受益。

Text Augmentations with R-drop for Classification of Tweets Self Reporting Covid-19

  • paper_url: http://arxiv.org/abs/2311.03420
  • repo_url: None
  • paper_authors: Sumam Francis, Marie-Francine Moens
  • for: 本研究为社交媒体挖掘2023年共同任务提出的模型。我们的团队面临了第一项任务,分类推特发布自我报告COVID-19诊断。
  • methods: 我们的方法包括一个分类模型,利用多种文本增强和R-drop增强数据,以减少过拟合。我们将增强模型应用了多种增强技巧,如同义词替换、保留词和返回词。
  • results: 我们的系统在测试集上实现了各自F1分数0.877,在任务中超过了 mean 和 median 分数。
    Abstract This paper presents models created for the Social Media Mining for Health 2023 shared task. Our team addressed the first task, classifying tweets that self-report Covid-19 diagnosis. Our approach involves a classification model that incorporates diverse textual augmentations and utilizes R-drop to augment data and mitigate overfitting, boosting model efficacy. Our leading model, enhanced with R-drop and augmentations like synonym substitution, reserved words, and back translations, outperforms the task mean and median scores. Our system achieves an impressive F1 score of 0.877 on the test set.
    摘要 这篇论文介绍了为健康社交媒体挖掘2023年共同任务创建的模型。我们团队解决了第一个任务,即通过推特分类自报 Covid-19 诊断。我们的方法包括一种分类模型,利用多种文本扩展和使用 R-drop 来增强数据和避免过拟合,从而提高模型效果。我们的领先模型,通过 R-drop 和扩展如同义词替换、保留词和回译等,超越任务的 mean 和 median 分数。我们的系统在测试集上达到了可观的 F1 分数0.877。

Injecting Categorical Labels and Syntactic Information into Biomedical NER

  • paper_url: http://arxiv.org/abs/2311.03113
  • repo_url: None
  • paper_authors: Sumam Francis, Marie-Francine Moens
  • for: 提高生物医学命名实体识别(NER)精度
  • methods: 采用两种方法:首先训练一个序列级分类器,将句子分类为类别,并将标签改为自然语言模板,以提高分类器的准确率。然后将这些标签和Part-of-speech(POS)信息注入到NER模型中。
  • results: 在三个benchmark数据集上进行实验,发现将分类标签和POS信息注入到NER模型中可以提高NER精度,并且超过基elineBERT模型。
    Abstract We present a simple approach to improve biomedical named entity recognition (NER) by injecting categorical labels and Part-of-speech (POS) information into the model. We use two approaches, in the first approach, we first train a sequence-level classifier to classify the sentences into categories to obtain the sentence-level tags (categorical labels). The sequence classifier is modeled as an entailment problem by modifying the labels as a natural language template. This helps to improve the accuracy of the classifier. Further, this label information is injected into the NER model. In this paper, we demonstrate effective ways to represent and inject these labels and POS attributes into the NER model. In the second approach, we jointly learn the categorical labels and NER labels. Here we also inject the POS tags into the model to increase the syntactic context of the model. Experiments on three benchmark datasets show that incorporating categorical label information with syntactic context is quite useful and outperforms baseline BERT-based models.
    摘要 我们提出了一种简单的方法来提高生物医学命名实体识别(NER)的精度,我们在模型中注入了分类标签和语法类型(POS)信息。我们采用了两种方法:在第一种方法中,我们首先训练一个序列级别的分类器,以将句子分类为不同的类别,从而获得句子级别的标签(分类标签)。这个分类器是通过修改标签为自然语言模板来实现的,这有助于提高分类器的准确率。然后,我们将这些标签和POS信息注入到NER模型中。在第二种方法中,我们同时学习分类标签和NER标签。在这里,我们还注入了POS标签,以增加模型的语法上下文。我们在三个标准数据集上进行了实验,结果表明,将分类标签和语法上下文注入到BERT模型中可以提高NER的精度,并且超越基eline BERT模型。

Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch

  • paper_url: http://arxiv.org/abs/2311.03099
  • repo_url: https://github.com/yule-buaa/mergelm
  • paper_authors: Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li
  • for: 本研究旨在探讨语言模型(LM)可以通过吸收同类模型参数而获得新能力,无需重新训练或GPU。
  • methods: 研究人员发现,通过一种新的操作 called DARE(Drop And REscale),可以直接将大多数 delta 参数设为零,而不会影响 SFT LM 的能力。此外,通过将多个 SFT 同类模型的 delta 参数简化并合并为一个单一模型,可以获得多种能力。
  • results: 实验结果表明, delta 参数的值范围通常在 0.005 左右,DARE 可以轻松地消除 99% 的 delta 参数。然而,一旦模型进行了连续预训练, delta 参数的值范围可以增加到约 0.03,使 DARE 成为不切实际。此外,尝试将 fine-tuned 参数 removal 和 delta 参数 removal 进行比较,发现将 fine-tuned 参数 removal 可以导致性能减少至 0。这显示出 SFT 只是通过 delta 参数来刺激 LM 的能力,而不是投入新的能力。此外,DARE 可以将多个任务特定 LM 合并成一个多能力 LM。例如,将 WizardLM 和 WizardMath 合并后,GSM8K 零点扩展精度从 2.2 提高至 66.3,保留 WizardLM 的 instrucion-following 能力,超过 WizardMath 的原始 64.2 性能。
    Abstract In this paper, we uncover that Language Models (LMs), either encoder- or decoder-based, can obtain new capabilities by assimilating the parameters of homologous models without retraining or GPUs. Typically, new abilities of LMs can be imparted by Supervised Fine-Tuning (SFT), reflected in the disparity between fine-tuned and pre-trained parameters (i.e., delta parameters). We initially observe that by introducing a novel operation called DARE (Drop And REscale), most delta parameters can be directly set to zeros without affecting the capabilities of SFT LMs and larger models can tolerate a higher proportion of discarded parameters. Based on this observation, we further sparsify delta parameters of multiple SFT homologous models with DARE and subsequently merge them into a single model by parameter averaging. We conduct experiments on eight datasets from the GLUE benchmark with BERT and RoBERTa. We also merge WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental results show that: (1) The delta parameter value ranges for SFT models are typically small, often within 0.005, and DARE can eliminate 99% of them effortlessly. However, once the models are continuously pre-trained, the value ranges can grow to around 0.03, making DARE impractical. We have also tried to remove fine-tuned instead of delta parameters and find that a 10% reduction can lead to drastically decreased performance (even to 0). This highlights that SFT merely stimulates the abilities via delta parameters rather than injecting new abilities into LMs; (2) DARE can merge multiple task-specific LMs into one LM with diverse abilities. For instance, the merger of WizardLM and WizardMath improves the GSM8K zero-shot accuracy of WizardLM from 2.2 to 66.3, retaining its instruction-following ability while surpassing WizardMath's original 64.2 performance. Codes are available at https://github.com/yule-BUAA/MergeLM.
    摘要 在这篇论文中,我们发现语言模型(LM),无论是基于编码器或解码器的,可以通过吸收同类模型的参数而获得新的能力,无需重新训练或GPU。通常,LM的新能力可以通过监督精度调整(SFT)来实现,这可以通过参数之间的差异( delta 参数)来衡量。我们发现,通过一种新的操作 called DARE(Drop And REscale),大多数 delta 参数可以直接设为零,而不会影响 SFT LM 的能力。基于这一点,我们进一步减轻 delta 参数的多个 SFT 同类模型,并将它们合并成一个单独的模型。我们在 GLUE benchmark 上的八个数据集上进行实验,以及将 WizardLM、WizardMath 和 Code Alpaca 基于 Llama 2 进行合并。实验结果表明:(1) SFT 模型的 delta 参数范围通常在 0.005 左右,DARE 可以轻松地消除 99% 的 delta 参数。然而,当模型进行连续预训练时, delta 参数的范围可以增长到约 0.03,使 DARE 变得不实际。我们还尝试了从 fine-tuned 而不是 delta 参数中 removing fine-tuned 并发现,将 fine-tuned 参数减少 10% 可能会导致性能减少到 0。这表明 SFT 仅仅通过 delta 参数来刺激 LM 的能力,而不是在 LM 中植入新的能力;(2) DARE 可以将多个任务特定的 LM 合并成一个多能力 LM。例如,将 WizardLM 和 WizardMath 合并到一起,可以提高 WizardLM 的 GSM8K 零shot准确率从 2.2 提高到 66.3,保留 WizardLM 的 instrucion-following 能力,而同时超过 WizardMath 的原始 64.2 性能。代码可以在 上获取。

BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer

  • paper_url: http://arxiv.org/abs/2311.03078
  • repo_url: https://github.com/eblict-gigatech/BanLemma
  • paper_authors: Sadia Afrin, Md. Shahad Mahmud Chowdhury, Md. Ekramul Islam, Faisal Ahamed Khan, Labib Imam Chowdhury, MD. Motahar Mahtab, Nazifa Nuha Chowdhury, Massud Forkan, Neelima Kundu, Hakim Arif, Mohammad Mamun Or Rashid, Mohammad Ruhul Amin, Nabeel Mohammed
  • for: 这个论文的目的是提出一个基于语言规则的抽象lemmatization算法,用于解决孟加拉语言的抽象lemmatization问题。
  • methods: 该论文使用了语言规则和词典来设计一个特定于孟加拉语言的lemmatizer,并通过分析大量的孟加拉文本来验证其准确性。
  • results: 该论文的实验结果显示,使用该lemmatizer可以达到96.36%的准确率,并且与之前发表的三个孟加拉lemmatization数据集中的结果相比,表现竞争力强。
    Abstract Lemmatization holds significance in both natural language processing (NLP) and linguistics, as it effectively decreases data density and aids in comprehending contextual meaning. However, due to the highly inflected nature and morphological richness, lemmatization in Bangla text poses a complex challenge. In this study, we propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer specifically for Bangla. Our system aims to lemmatize words based on their parts of speech class within a given sentence. Unlike previous rule-based approaches, we analyzed the suffix marker occurrence according to the morpho-syntactic values and then utilized sequences of suffix markers instead of entire suffixes. To develop our rules, we analyze a large corpus of Bangla text from various domains, sources, and time periods to observe the word formation of inflected words. The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained linguists and demonstrates competitive performance on three previously published Bangla lemmatization datasets. We are making the code and datasets publicly available at https://github.com/eblict-gigatech/BanLemma in order to contribute to the further advancement of Bangla NLP.
    摘要 lemmatization在自然语言处理(NLP)和语言学中具有重要意义,因为它可以有效减少数据密度,并帮助理解上下文中的意思。然而,由于孟加拉语的高度变格和 morphological richness,孟加拉语 lemmatization poses a complex challenge。在这项研究中,我们提出了语言规则 для lemmatization,并使用字典和规则来设计特定 для孟加拉语的 lemmatizer。我们的系统 aimsto lemmatize words based on their parts of speech class within a given sentence。不同于前一些规则基本的方法,我们分析了 suffix marker 的出现 according to the morpho-syntactic values,然后使用 sequences of suffix markers instead of entire suffixes。为了开发我们的规则,我们分析了大量的孟加拉语文本从多个领域、来源和时期,以观察 инфиlected words 的形成。lemmatizer 在一个手动注释的测试集上测试时 achieved an accuracy of 96.36%,并在三个之前发布的孟加拉语 lemmatization 数据集上达到了竞争性的性能。我们将代码和数据集公开发布在 GitHub 上,以便贡献到孟加拉语 NLP 的进一步发展。

Zero-shot Bilingual App Reviews Mining with Large Language Models

  • paper_url: http://arxiv.org/abs/2311.03058
  • repo_url: https://github.com/jl-wei/mini-bar
  • paper_authors: Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, Gérard Dray
  • for: 提高软件需求的评估和优化
  • methods: 使用大型自然语言处理(NLP)模型和隐藏 маркетин数据集
  • results: 实现零shot的用户评论挖掘和概括,并提供用户评论群集和概要摘要Here’s a more detailed explanation of each point:
  • for: The paper is written to improve the assessment and optimization of software requirements by leveraging user reviews from app stores.
  • methods: The proposed approach, called Mini-BAR, uses large language models (LLMs) to automatically mine user reviews in both English and French. Mini-BAR consists of four main components: classification, clustering, abstractive summary generation, and ranking.
  • results: The authors evaluate the effectiveness and efficiency of Mini-BAR using a dataset of 6,000 English and 6,000 French annotated user reviews. Preliminary results demonstrate the ability of Mini-BAR to accurately classify, cluster, and summarize user reviews, as well as rank the review clusters based on their relevance to software requirements.
    Abstract App reviews from app stores are crucial for improving software requirements. A large number of valuable reviews are continually being posted, describing software problems and expected features. Effectively utilizing user reviews necessitates the extraction of relevant information, as well as their subsequent summarization. Due to the substantial volume of user reviews, manual analysis is arduous. Various approaches based on natural language processing (NLP) have been proposed for automatic user review mining. However, the majority of them requires a manually crafted dataset to train their models, which limits their usage in real-world scenarios. In this work, we propose Mini-BAR, a tool that integrates large language models (LLMs) to perform zero-shot mining of user reviews in both English and French. Specifically, Mini-BAR is designed to (i) classify the user reviews, (ii) cluster similar reviews together, (iii) generate an abstractive summary for each cluster and (iv) rank the user review clusters. To evaluate the performance of Mini-BAR, we created a dataset containing 6,000 English and 6,000 French annotated user reviews and conducted extensive experiments. Preliminary results demonstrate the effectiveness and efficiency of Mini-BAR in requirement engineering by analyzing bilingual app reviews. (Replication package containing the code, dataset, and experiment setups on https://github.com/Jl-wei/mini-bar )
    摘要 应用商店中的用户评论对软件需求的改进具有关键作用。大量有价值的用户评论不断地被上传,描述软件问题和预期功能。有效地利用用户评论需要提取有用信息,并对其进行概括。由于用户评论的数量过大,手动分析是困难的。基于自然语言处理(NLP)的多种方法已经被提议用于自动化用户评论挖掘。然而,大多数方法需要手动制作数据集来训练其模型,这限制了它们在实际场景中的使用。在这种情况下,我们提出了 Mini-BAR 工具,它利用大型自然语言模型(LLMs)来完成零shot的用户评论挖掘。特别是,Mini-BAR 的设计包括(i)类别用户评论,(ii)将相似的评论集成起来,(iii)为每个集合生成抽象概括,以及(iv)对用户评论集进行排名。为了评估 Mini-BAR 的表现,我们创建了包含 6,000 个英语和 6,000 个法语用户评论的数据集,并进行了广泛的实验。初步结果表明 Mini-BAR 在需求工程中的效果和效率,通过分析双语应用评论。(复制包含代码、数据集和实验设置的https://github.com/Jl-wei/mini-bar )

Detecting Agreement in Multi-party Conversational AI

  • paper_url: http://arxiv.org/abs/2311.03026
  • repo_url: None
  • paper_authors: Laura Schauer, Jason Sweeney, Charlie Lyttle, Zein Said, Aron Szeles, Cale Clark, Katie McAskill, Xander Wickham, Tom Byars, Daniel Hernández Garcia, Nancie Gunson, Angus Addlesee, Oliver Lemon
  • for: 这个论文是为了解决多方会话中的社交助手机器人(SARs)的实际使用问题,特别是识别说话人和接受者、复杂的回答交换等问题。
  • methods: 该论文提出了一种多方会话对话系统,让两名用户参与一个知识竞赛游戏。系统可以检测用户们的一致或不一致的答案,并按照应对方式回答。
  • results: 论文的评估包括性能评估和用户评估,重点是检测用户一致的答案。我们提供了对应的注释脚本和GitHub上的代码,以便其他研究人员可以进行复用和扩展。
    Abstract Today, conversational systems are expected to handle conversations in multi-party settings, especially within Socially Assistive Robots (SARs). However, practical usability remains difficult as there are additional challenges to overcome, such as speaker recognition, addressee recognition, and complex turn-taking. In this paper, we present our work on a multi-party conversational system, which invites two users to play a trivia quiz game. The system detects users' agreement or disagreement on a final answer and responds accordingly. Our evaluation includes both performance and user assessment results, with a focus on detecting user agreement. Our annotated transcripts and the code for the proposed system have been released open-source on GitHub.
    摘要 Translation into Simplified Chinese:今天,对话系统预期能够处理多方会话,特别是在社会辅助机器人(SAR)中。然而,实际使用中存在多种挑战,如说话人识别、目标人识别和复杂的回答交互。在这篇论文中,我们介绍了一种多方对话系统, Invites two users to play a trivia quiz game.系统可以检测用户们的同意或不同意 final answer,并根据此进行应答。我们的评估包括性能评估和用户评估结果,重点是检测用户同意。我们已经在 GitHub 上发布了对应的注释转译和系统代码。

Detecting agreement in multi-party dialogue: evaluating speaker diarisation versus a procedural baseline to enhance user engagement

  • paper_url: http://arxiv.org/abs/2311.03021
  • repo_url: https://github.com/ddenley/multi-person-quiz
  • paper_authors: Angus Addlesee, Daniel Denley, Andy Edmondson, Nancie Gunson, Daniel Hernández Garcia, Alexandre Kha, Oliver Lemon, James Ndubuisi, Neil O’Reilly, Lia Perochaud, Raphaël Valeri, Miebaka Worika
  • for: 这个研究用于检验对话状态跟踪方法是否能够正确地识别对话中的一致和不一致情况。
  • methods: 这个研究使用了 диари化模型和频率和 proximity 基于的方法来识别对话中的一致和不一致情况。
  • results: 实验结果表明,我们的原始系统比 диари化系统更加有趣,并且更加准确地识别了一致情况,其准确率达到了 0.44,而 диари化系统的准确率为 0.28。
    Abstract Conversational agents participating in multi-party interactions face significant challenges in dialogue state tracking, since the identity of the speaker adds significant contextual meaning. It is common to utilise diarisation models to identify the speaker. However, it is not clear if these are accurate enough to correctly identify specific conversational events such as agreement or disagreement during a real-time interaction. This study uses a cooperative quiz, where the conversational agent acts as quiz-show host, to determine whether diarisation or a frequency-and-proximity-based method is more accurate at determining agreement, and whether this translates to feelings of engagement from the players. Experimental results show that our procedural system was more engaging to players, and was more accurate at detecting agreement, reaching an average accuracy of 0.44 compared to 0.28 for the diarised system.
    摘要 <> translate_language: zh-CN多方会话中的对话管理器面临着 significativley 难以实现对话状态跟踪的挑战,因为发言人的身份增加了Contextual 含义。 通常使用划分模型来标识发言人。然而,是否准确地标识对话中的特定对话事件,如同意或不同意,是一个问题。这个研究使用了合作测验,其中对话管理器 acts as 测验主持人,以确定划分模型或频率和距离基于方法是更加准确地确定同意的。实验结果表明,我们的程序性系统更加吸引人们的注意力,并且更加准确地检测到同意,达到了0.44的准确率,比0.28的划分系统更高。Note: "zh-CN" is the language code for Simplified Chinese.

Towards a Transformer-Based Reverse Dictionary Model for Quality Estimation of Definitions

  • paper_url: http://arxiv.org/abs/2311.02985
  • repo_url: None
  • paper_authors: Julien Guité-Vinet, Alexandre Blondin Massé, Fatiha Sadat
  • for: 这篇研究是为了解决词汇游戏“字典游戏”中的反词字典任务。
  • methods: 这篇研究使用了不同的 transformer-based 模型来解决反词字典任务,并 explore 这些模型在这个Context中的使用。
  • results: 研究获得了不同的 transformer-based 模型在解决反词字典任务的效果,并 analyzed 这些模型的优缺点。
    Abstract In the last years, several variants of transformers have emerged. In this paper, we compare different transformer-based models for solving the reverse dictionary task and explore their use in the context of a serious game called The Dictionary Game.
    摘要 最近几年,Transformers家族中的不同变体出现了。本文将 Comparing different transformer-based models for solving the reverse dictionary task, and explore their use in the context of a serious game called The Dictionary Game。

Adapting Pre-trained Generative Models for Extractive Question Answering

  • paper_url: http://arxiv.org/abs/2311.02961
  • repo_url: https://github.com/prabirmallick/GenAI4EQA
  • paper_authors: Prabir Mallick, Tapas Nayak, Indrajit Bhattacharya
  • for: 提高抽取问答 tasks 的表现
  • methods: 使用预训练的生成模型生成答案相关的索引
  • results: 在多个抽取问答 dataset 上达到了更高的表现,比如 MultiSpanQA、BioASQ、MASHQA 和 WikiQA。
    Abstract Pre-trained Generative models such as BART, T5, etc. have gained prominence as a preferred method for text generation in various natural language processing tasks, including abstractive long-form question answering (QA) and summarization. However, the potential of generative models in extractive QA tasks, where discriminative models are commonly employed, remains largely unexplored. Discriminative models often encounter challenges associated with label sparsity, particularly when only a small portion of the context contains the answer. The challenge is more pronounced for multi-span answers. In this work, we introduce a novel approach that uses the power of pre-trained generative models to address extractive QA tasks by generating indexes corresponding to context tokens or sentences that form part of the answer. Through comprehensive evaluations on multiple extractive QA datasets, including MultiSpanQA, BioASQ, MASHQA, and WikiQA, we demonstrate the superior performance of our proposed approach compared to existing state-of-the-art models.
    摘要 先前的生成模型,如BART和T5等,在自然语言处理中的文本生成任务中备受欢迎,包括概括性长篇问答(QA)和概要。然而,生成模型在抽取式QA任务中的潜力仍未得到充分发挥,特别是当只有小部分上下文中包含答案时。这种挑战更加明显,当答案需要多个 Span 时。在这项工作中,我们提出了一种新的方法,使用预训练的生成模型来解决抽取式QA任务,通过生成上下文字元或句子的索引,以便更好地找到答案。经过对多个抽取式QA数据集,包括 MultiSpanQA、BioASQ、MASHQA 和 WikiQA 的全面评估,我们展示了我们提出的方法与现有状态的模型相比,表现出优异的性能。

PhoGPT: Generative Pre-training for Vietnamese

  • paper_url: http://arxiv.org/abs/2311.02945
  • repo_url: https://github.com/vinairesearch/phogpt
  • paper_authors: Dat Quoc Nguyen, Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Nhung Nguyen, Thien Huu Nguyen, Dinh Phung, Hung Bui
  • for: 这个论文是为了介绍一种新的开源 generative 模型系列 PhoGPT,用于越南语言。
  • methods: 该模型使用了一种基于 transformer 的7.5亿参数模型,并提供了一种 instruciton-following 变体 PhoGPT-7B5-Instruct。
  • results: 论文通过人工评估实验展示了这个模型的性能比前一代开源模型更高。In English, that’s:
  • for: This paper introduces a new open-source generative model series PhoGPT for Vietnamese.
  • methods: The model uses a transformer-based 7.5 billion parameter model and provides an instruction-following variant PhoGPT-7B5-Instruct.
  • results: The paper demonstrates the superior performance of this model through a human evaluation experiment compared to previous open-source models.
    Abstract We open-source a state-of-the-art 7.5B-parameter generative model series named PhoGPT for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-7B5 and its instruction-following variant, PhoGPT-7B5-Instruct. In addition, we also demonstrate its superior performance compared to previous open-source models through a human evaluation experiment. GitHub: https://github.com/VinAIResearch/PhoGPT
    摘要 我们开源了一系列现代化的7.5B参数生成模型,名为 PhoGPT,用于越南语言。该系列包括基础预训练单语言模型 PhoGPT-7B5 和其指令遵循变体 PhoGPT-7B5-Instruct。此外,我们还通过人工评估实验证明其在前一代开源模型之上的超越性。GitHub:https://github.com/VinAIResearch/PhoGPT。

SQLPrompt: In-Context Text-to-SQL with Minimal Labeled Data

  • paper_url: http://arxiv.org/abs/2311.02883
  • repo_url: None
  • paper_authors: Ruoxi Sun, Sercan Ö. Arik, Rajarishi Sinha, Hootan Nakhost, Hanjun Dai, Pengcheng Yin, Tomas Pfister
  • for: 提高文本到SQL生成器的几个shot提示能力
  • methods: 创新的提示设计、执行相关的一致性解码策略和多种提示设计和基础模型的混合策略
  • results: 在受限的数据量下,与已经训练的模型相比,提高了文本到SQL生成器的几个shot学习能力,降低了与高级模型的差距
    Abstract Text-to-SQL aims to automate the process of generating SQL queries on a database from natural language text. In this work, we propose "SQLPrompt", tailored to improve the few-shot prompting capabilities of Text-to-SQL for Large Language Models (LLMs). Our methods include innovative prompt design, execution-based consistency decoding strategy which selects the SQL with the most consistent execution outcome among other SQL proposals, and a method that aims to improve performance by diversifying the SQL proposals during consistency selection with different prompt designs ("MixPrompt") and foundation models ("MixLLMs"). We show that \emph{SQLPrompt} outperforms previous approaches for in-context learning with few labeled data by a large margin, closing the gap with finetuning state-of-the-art with thousands of labeled data.
    摘要 文本到SQL目的是自然语言文本中生成SQL查询的自动化过程。在这项工作中,我们提出了“SQLPrompt”,用于改进大语言模型(LLM)中几次提示能力。我们的方法包括创新的提示设计、执行基于一致性解码策略和多提示执行选择策略,以及一种用于提高性能的多提示执行选择策略(MixPrompt)和基础模型(MixLLMs)。我们表明, compared to previous approaches, \emph{SQLPrompt} 在少量标注数据下进行协study learning的情况下,能够大幅超越之前的方法,并且落差与高级标注数据进行 fine-tuning 的状态差不远。

Less than One-shot: Named Entity Recognition via Extremely Weak Supervision

  • paper_url: http://arxiv.org/abs/2311.02861
  • repo_url: https://github.com/komeijiforce/x-ner
  • paper_authors: Letian Peng, Zihan Wang, Jingbo Shang
  • for: 这 paper 是为了解决 named entity recognition (NER) 问题在 extremely weak supervision (XWS) Setting 中。
  • methods: 该 paper 提出了一种新的方法 X-NER,该方法可以在一个上下文自由的情况下,使用一个例子实体来帮助学习 NER。
  • results: 对 4 个 NER 数据集进行了广泛的实验和分析,显示 X-NER 的综合 NER 性能高于当前一些一射学习方法,并且可以具有跨语言能力。
    Abstract We study the named entity recognition (NER) problem under the extremely weak supervision (XWS) setting, where only one example entity per type is given in a context-free way. While one can see that XWS is lighter than one-shot in terms of the amount of supervision, we propose a novel method X-NER that can outperform the state-of-the-art one-shot NER methods. We first mine entity spans that are similar to the example entities from an unlabelled training corpus. Instead of utilizing entity span representations from language models, we find it more effective to compare the context distributions before and after the span is replaced by the entity example. We then leverage the top-ranked spans as pseudo-labels to train an NER tagger. Extensive experiments and analyses on 4 NER datasets show the superior end-to-end NER performance of X-NER, outperforming the state-of-the-art few-shot methods with 1-shot supervision and ChatGPT annotations significantly. Finally, our X-NER possesses several notable properties, such as inheriting the cross-lingual abilities of the underlying language models.
    摘要 我们研究了名实体识别(NER)问题在极其轻量级监督(XWS) Setting下,只有一个例行实体每种类型被提供在context-free的方式。虽然XWS比一shot更轻量级,我们提出了一种新方法X-NER,可以超越当前一shot NER方法的状态。我们首先在无标注训练集中挖掘类似于示例实体的实体探索。而不是利用语言模型生成的实体 span表示,我们发现更有效的是比较在span被替换后的上下文分布和之前的分布。然后,我们利用排名最高的探索作为pseudo-标签来训练NER标记器。我们对4个NER数据集进行了广泛的实验和分析,发现X-NER具有出色的综合NER性能,超越当前几个ew shot方法,并且与ChatGPT标注显著。最后,我们的X-NER具有一些吸引人的特性,如继承下来的语言模型的cross-Lingual能力。

Improving Machine Translation with Large Language Models: A Preliminary Study with Cooperative Decoding

  • paper_url: http://arxiv.org/abs/2311.02851
  • repo_url: https://github.com/lemon0830/CoDec
  • paper_authors: Jiali Zeng, Fandong Meng, Yongjing Yin, Jie Zhou
  • for: 本研究的目的是分析不同商业NMT系统和MT-oriented LLMs的优缺点,并基于这些发现提出一种hybrid方法来补充NMT系统。
  • methods: 本研究使用了多种方法,包括对不同NMT系统和MT-oriented LLMs的比较分析,以及基于这些发现的hybrid方法的开发。
  • results: 研究结果表明,MT-oriented LLMs可以作为NMT系统的补充解决复杂的翻译问题,而CoDec方法在WMT22测试集和新收集的WebCrawl测试集上得到了显著的效果和效率提升。
    Abstract Contemporary translation engines built upon the encoder-decoder framework have reached a high level of development, while the emergence of Large Language Models (LLMs) has disrupted their position by offering the potential for achieving superior translation quality. Therefore, it is crucial to understand in which scenarios LLMs outperform traditional NMT systems and how to leverage their strengths. In this paper, we first conduct a comprehensive analysis to assess the strengths and limitations of various commercial NMT systems and MT-oriented LLMs. Our findings indicate that neither NMT nor MT-oriented LLMs alone can effectively address all the translation issues, but MT-oriented LLMs can serve as a promising complement to the NMT systems. Building upon these insights, we explore hybrid methods and propose Cooperative Decoding (CoDec), which treats NMT systems as a pretranslation model and MT-oriented LLMs as a supplemental solution to handle complex scenarios beyond the capability of NMT alone. The results on the WMT22 test sets and a newly collected test set WebCrawl demonstrate the effectiveness and efficiency of CoDec, highlighting its potential as a robust solution for combining NMT systems with MT-oriented LLMs in machine translation.
    摘要 当代翻译引擎,基于编码器-解码器框架,已经达到了高度的发展,而大语言模型(LLMs)的出现则对其造成了冲击, LLMS 提供了可以实现更高水平的翻译质量的潜在力量。因此,我们需要了解 LLMS 在哪些场景下表现出色,并如何利用其优势。在这篇论文中,我们首先进行了全面的分析,以评估不同的商业 NMT 系统和 MT-oriented LLMs 的优缺点。我们的发现表明,NMT 系统和 MT-oriented LLMs 独立无法解决所有翻译问题,但 MT-oriented LLMs 可以作为 NMT 系统的优秀补充。基于这些发现,我们探索了混合方法,并提出了协同解码(CoDec),协同解码将 NMT 系统作为预翻译模型,MT-oriented LLMs 作为 NMT 系统之外的补充解决方案,以处理 NMT 系统无法处理的复杂场景。 results on WMT22 测试集和我们新收集的 WebCrawl 测试集表明 CoDec 的效果和效率, highlighting its potential as a robust solution for combining NMT systems with MT-oriented LLMs in machine translation.

Tailoring Self-Rationalizers with Multi-Reward Distillation

  • paper_url: http://arxiv.org/abs/2311.02805
  • repo_url: https://github.com/ink-usc/rationalemultirewarddistillation
  • paper_authors: Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren
  • for: 这篇论文旨在提高小型语言模型(LMs)的自我合理化能力,以帮助问答系统提高问题回答的性能。
  • methods: 这篇论文提出了一种名为MaRio(多重评价自我合理化算法)的多评价条件自我合理化算法,通过优化多种特征如可能性、多样性和一致性来提高小LMs的自我合理化质量。
  • results: 实验结果表明,MaRio不仅能够提高问题回答性能,还能够提高小LMs的自我合理化质量,比超级vised fine-tuning(SFT)基线更好。人类评价也表明,MaRio的合理化 rationales 比 SFT 的 rationales 更受欢迎,并且有质量上的改进。
    Abstract Large language models (LMs) are capable of generating free-text rationales to aid question answering. However, prior work 1) suggests that useful self-rationalization is emergent only at significant scales (e.g., 175B parameter GPT-3); and 2) focuses largely on downstream performance, ignoring the semantics of the rationales themselves, e.g., are they faithful, true, and helpful for humans? In this work, we enable small-scale LMs (approx. 200x smaller than GPT-3) to generate rationales that not only improve downstream task performance, but are also more plausible, consistent, and diverse, assessed both by automatic and human evaluation. Our method, MaRio (Multi-rewArd RatIOnalization), is a multi-reward conditioned self-rationalization algorithm that optimizes multiple distinct properties like plausibility, diversity and consistency. Results on five difficult question-answering datasets StrategyQA, QuaRel, OpenBookQA, NumerSense and QASC show that not only does MaRio improve task accuracy, but it also improves the self-rationalization quality of small LMs across the aforementioned axes better than a supervised fine-tuning (SFT) baseline. Extensive human evaluations confirm that MaRio rationales are preferred vs. SFT rationales, as well as qualitative improvements in plausibility and consistency.
    摘要

cs.LG - 2023-11-06

CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers

  • paper_url: http://arxiv.org/abs/2311.03615
  • repo_url: None
  • paper_authors: Jieming Bian, Shaolei Ren, Jie Xu
  • for: 本研究旨在探讨跨地区数据中心(geo-distributed)上训练人工智能(AI)模型的挑战,并寻找平衡学习性和环境影响的方法。
  • methods: 本研究提出了一种名为CAFE(碳负荷意识 federated learning)的新框架,该框架通过在固定碳负荷预算内优化训练,以最大化学习性和降低环境影响。
  • results: 通过对实际碳负荷数据进行广泛的 simulations,我们证明了我们的算法的有效性,并证明了它在最大化学习性而最小化环境影响方面的优越性。
    Abstract Training large-scale artificial intelligence (AI) models demands significant computational power and energy, leading to increased carbon footprint with potential environmental repercussions. This paper delves into the challenges of training AI models across geographically distributed (geo-distributed) data centers, emphasizing the balance between learning performance and carbon footprint. We consider Federated Learning (FL) as a solution, which prioritizes model parameter exchange over raw data, ensuring data privacy and compliance with local regulations. Given the variability in carbon intensity across regions, we propose a new framework called CAFE (short for Carbon-Aware Federated Learning) to optimize training within a fixed carbon footprint budget. Our approach incorporates coreset selection to assess learning performance, employs the Lyapunov drift-plus-penalty framework to address the unpredictability of future carbon intensity, and devises an efficient algorithm to address the combinatorial complexity of the data center selection. Through extensive simulations using real-world carbon intensity data, we demonstrate the efficacy of our algorithm, highlighting its superiority over existing methods in optimizing learning performance while minimizing environmental impact.
    摘要 训练大规模人工智能(AI)模型需要巨量的计算资源和能源,导致增加碳脚印,有可能对环境产生影响。这篇论文探讨跨地区分布(geo-distributed)数据中心训练AI模型中的挑战,强调学习性和碳脚印之间的平衡。我们认为联邦学习(Federated Learning,FL)是一种解决方案,它强调模型参数交换而不是原始数据,以保护数据隐私和遵循当地法规。由于地区碳素数据的变化,我们提出了一个名为CAFE(碳脚印感知联邦学习)的新框架,用于优化训练在固定碳脚印预算内。我们的方法包括核心选择来评估学习性,使用了Lyapunov逐步加速策略来Address未来碳素数据的不可预测性,并开发了高效的数据中心选择算法。通过使用实际碳素数据进行广泛的 simulations,我们证明了我们的算法的有效性,指出它在最小化学习性和环境影响之间寻找平衡。

Plug-and-Play Stability for Intracortical Brain-Computer Interfaces: A One-Year Demonstration of Seamless Brain-to-Text Communication

  • paper_url: http://arxiv.org/abs/2311.03611
  • repo_url: https://github.com/cffan/corp
  • paper_authors: Chaofei Fan, Nick Hahn, Foram Kamdar, Donald Avansino, Guy H. Wilson, Leigh Hochberg, Krishna V. Shenoy, Jaimie M. Henderson, Francis R. Willett
  • for: 这个论文旨在解决iBCI系统中的长期稳定性问题,使iBCI系统可以在长期内维持高性能。
  • methods: 该论文提出了一种基于大语言模型(LM)的自动纠正错误的方法,通过在线更新iBCI解码器来实现长期稳定性。
  • results: 在一名参与者的403天长期试验中,该方法实现了高性能的手写iBCI任务的稳定性,比其他基线方法更高。这是目前 longest-running iBCI 稳定性示范之一。
    Abstract Intracortical brain-computer interfaces (iBCIs) have shown promise for restoring rapid communication to people with neurological disorders such as amyotrophic lateral sclerosis (ALS). However, to maintain high performance over time, iBCIs typically need frequent recalibration to combat changes in the neural recordings that accrue over days. This requires iBCI users to stop using the iBCI and engage in supervised data collection, making the iBCI system hard to use. In this paper, we propose a method that enables self-recalibration of communication iBCIs without interrupting the user. Our method leverages large language models (LMs) to automatically correct errors in iBCI outputs. The self-recalibration process uses these corrected outputs ("pseudo-labels") to continually update the iBCI decoder online. Over a period of more than one year (403 days), we evaluated our Continual Online Recalibration with Pseudo-labels (CORP) framework with one clinical trial participant. CORP achieved a stable decoding accuracy of 93.84% in an online handwriting iBCI task, significantly outperforming other baseline methods. Notably, this is the longest-running iBCI stability demonstration involving a human participant. Our results provide the first evidence for long-term stabilization of a plug-and-play, high-performance communication iBCI, addressing a major barrier for the clinical translation of iBCIs.
    摘要 来自 cortical brain-computer interfaces (iBCIs) 的应用已经显示出对于神经学疾病,如amyotrophic lateral sclerosis (ALS) 的恢复快速通信的潜力。然而,以维持高性能,iBCIs 通常需要频繁的重新参数化,以避免在日子变化中的 neural recordings 的变化。这需要 iBCI 使用者在使用 iBCI 时间中停止使用 iBCI 并进行监督的数据收集,从而使 iBCI 系统变得困难使用。在这篇文章中,我们提出了一种方法,可以让 communication iBCIs 进行自动重新参数化,不需要使用者中断使用 iBCI。我们的方法利用大型语言模型 (LMs) 来自动更正 iBCI 输出中的错误。自动重新参数化过程使用这些更正后的输出 ("pseudo-labels") 来在线上不断更新 iBCI 解oder。在More than one year (403 days) 的时间评估中,我们的 Continual Online Recalibration with Pseudo-labels (CORP) 框架与一名供试者进行了测试。CORP 在线上手写 iBCI 任务中获得了93.84%的稳定解oder精度,与其他基准方法相比,有 statistically significant 的优势。特别是,这是人类参与者的 longest-running iBCI 稳定示范,解决了 iBCI 的一 Major barrier 的试验。我们的结果提供了首次的长期稳定 plug-and-play,高性能 communication iBCI 的证据,解决了 iBCI 的临床翻译的一 major barrier。

Testing RadiX-Nets: Advances in Viable Sparse Topologies

  • paper_url: http://arxiv.org/abs/2311.03609
  • repo_url: None
  • paper_authors: Kevin Kwak, Zack West, Hayden Jananthan, Jeremy Kepner
  • for: 这篇论文旨在探讨RadiX-Nets的性能和可扩展性,以便在大规模数据处理中使用。
  • methods: 这篇论文使用了TensorFlow测试RadiX-Nets的性能,并发现了不同网络拓扑、初始化和训练方法对RadiX-Nets的影响。
  • results: 论文发现了一些“奇怪的模型”,它们在训练时间和精度之间存在差异,而同等权重的模型则能够训练得比较好。
    Abstract The exponential growth of data has sparked computational demands on ML research and industry use. Sparsification of hyper-parametrized deep neural networks (DNNs) creates simpler representations of complex data. Past research has shown that some sparse networks achieve similar performance as dense ones, reducing runtime and storage. RadiX-Nets, a subgroup of sparse DNNs, maintain uniformity which counteracts their lack of neural connections. Generation, independent of a dense network, yields faster asymptotic training and removes the need for costly pruning. However, little work has been done on RadiX-Nets, making testing challenging. This paper presents a testing suite for RadiX-Nets in TensorFlow. We test RadiX-Net performance to streamline processing in scalable models, revealing relationships between network topology, initialization, and training behavior. We also encounter "strange models" that train inconsistently and to lower accuracy while models of similar sparsity train well.
    摘要 “数据的激增带来机器学习研究和实际应用的计算压力。通过简化深度神经网络(DNN)的参数,可以创造简洁的复杂数据表示。过去的研究表明,一些简化网络可以与笔丝网络相当,减少运行时间和存储空间。RadiX-Nets是一个子集的简化DNN,保持了网络的均匀性,这在缺乏神经连接时提供了一个稳定的基础。通过生成独立于笔丝网络,可以更快地实现极限训练,并减少costly pruning。然而,对RadiX-Nets的研究相对较少,这使测试变得更加挑战性。本文提供了一个基于TensorFlow的RadiX-Nets测试集,用于检验RadiX-Net的性能,并揭示了网络结构、初始化和训练行为之间的关系。我们还发现了一些“strange models”,它们在不同的初始化和训练情况下具有不一致的训练行为和较低的准确率,而与相同的简化度的模型则可以训练得非常好。”

Generative Diffusion Models for Lattice Field Theory

  • paper_url: http://arxiv.org/abs/2311.03578
  • repo_url: None
  • paper_authors: Lingxiao Wang, Gert Aarts, Kai Zhou
  • for: 本研究探讨了机器学习和逻辑场论的连接,通过将生成扩散模型(DMs)与随机量化相结合,从某种Stochastic Differential Equation(SDE)的视角出发。
  • methods: 我们表明了DMs可以通过逆转一种随机过程,它是由Langevin方程驱动的,从初始分布生成样本,以估计目标分布。在一个简单的模型中,我们强调了DMs的能力学习有效动作。
  • results: 我们还证明了DMs可以作为全局抽象器,生成二维$\phi^4$量子逻辑场论中的配置。
    Abstract This study delves into the connection between machine learning and lattice field theory by linking generative diffusion models (DMs) with stochastic quantization, from a stochastic differential equation perspective. We show that DMs can be conceptualized by reversing a stochastic process driven by the Langevin equation, which then produces samples from an initial distribution to approximate the target distribution. In a toy model, we highlight the capability of DMs to learn effective actions. Furthermore, we demonstrate its feasibility to act as a global sampler for generating configurations in the two-dimensional $\phi^4$ quantum lattice field theory.
    摘要

A Graph-Theoretic Framework for Understanding Open-World Semi-Supervised Learning

  • paper_url: http://arxiv.org/abs/2311.03524
  • repo_url: https://github.com/deeplearning-wisc/sorl
  • paper_authors: Yiyou Sun, Zhenmei Shi, Yixuan Li
  • for: 这篇论文targets at open-world semi-supervised learning, aiming to infer both known and novel classes in unlabeled data by leveraging prior knowledge from labeled sets.
  • methods: 该论文提出了一个图理论基础,用于形式化开放世界的 clustering 问题,并提供了实际的算法和理论保证。基于图形式化,该论文应用了名为Spectral Open-world Representation Learning(SORL)算法,并证明了将loss函数最小化等价于图spectral decomposition。
  • results: 实验表明,SORL可以与多种强基elines匹配或超越,并且可以提供有理论保证的实用用途。
    Abstract Open-world semi-supervised learning aims at inferring both known and novel classes in unlabeled data, by harnessing prior knowledge from a labeled set with known classes. Despite its importance, there is a lack of theoretical foundations for this problem. This paper bridges the gap by formalizing a graph-theoretic framework tailored for the open-world setting, where the clustering can be theoretically characterized by graph factorization. Our graph-theoretic framework illuminates practical algorithms and provides guarantees. In particular, based on our graph formulation, we apply the algorithm called Spectral Open-world Representation Learning (SORL), and show that minimizing our loss is equivalent to performing spectral decomposition on the graph. Such equivalence allows us to derive a provable error bound on the clustering performance for both known and novel classes, and analyze rigorously when labeled data helps. Empirically, SORL can match or outperform several strong baselines on common benchmark datasets, which is appealing for practical usage while enjoying theoretical guarantees.
    摘要

The Fairness Stitch: Unveiling the Potential of Model Stitching in Neural Network De-Biasing

  • paper_url: http://arxiv.org/abs/2311.03532
  • repo_url: https://github.com/modar7/the_fairness_stitch
  • paper_authors: Modar Sulaiman, Kallol Roy
  • for: 提高机器学习模型的公平性
  • methods: combines 模型缝合和训练,并具有公平约束
  • results: 在 celebA 和 utkface 两个well-known数据集上进行了广泛的评估,并与现有的基准方法进行了系统比较,发现在实现公平性和性能之间的平衡达到了显著提高, highlighting the promising potential of our method to address bias-related challenges and foster equitable outcomes in machine learning models.
    Abstract The pursuit of fairness in machine learning models has emerged as a critical research challenge in different applications ranging from bank loan approval to face detection. Despite the widespread adoption of artificial intelligence algorithms across various domains, concerns persist regarding the presence of biases and discrimination within these models. To address this pressing issue, this study introduces a novel method called "The Fairness Stitch (TFS)" to enhance fairness in deep learning models. This method combines model stitching and training jointly, while incorporating fairness constraints. In this research, we assess the effectiveness of our proposed method by conducting a comprehensive evaluation of two well-known datasets, CelebA and UTKFace. We systematically compare the performance of our approach with the existing baseline method. Our findings reveal a notable improvement in achieving a balanced trade-off between fairness and performance, highlighting the promising potential of our method to address bias-related challenges and foster equitable outcomes in machine learning models. This paper poses a challenge to the conventional wisdom of the effectiveness of the last layer in deep learning models for de-biasing.
    摘要 《机器学习模型中的公平性追求:一种新的解决方案》Introduction:随着人工智能算法在不同领域的普及,关注机器学习模型中的偏见和歧视问题日益减轻。为了解决这一问题,本研究提出了一种新的方法 called "The Fairness Stitch (TFS)",用于增强机器学习模型的公平性。该方法结合模型缝合和训练,并包含公平性约束。本研究通过对 celebA 和 utkFace 两个Well-known数据集进行了全面的评估,系统比较了我们的方法与现有基eline方法的性能。我们的发现表明,我们的方法可以更好地实现公平性和性能之间的平衡, highlighting the promising potential of our method to address bias-related challenges and foster equitable outcomes in machine learning models. This paper challenges the conventional wisdom of the effectiveness of the last layer in deep learning models for de-biasing.Translation notes:* 机器学习模型中的公平性追求:This phrase is used to emphasize the importance of fairness in machine learning models.* 一种新的解决方案:This phrase is used to introduce the novel method proposed in the study.* 公平性约束:This phrase is used to refer to the fairness constraints incorporated into the proposed method.* celebA 和 utkFace:These are two well-known datasets used in the study to evaluate the effectiveness of the proposed method.* 基eline方法:This phrase is used to refer to the existing baseline method compared with the proposed method.* 平衡:This word is used to refer to the balanced trade-off between fairness and performance achieved by the proposed method.* 普及:This word is used to refer to the widespread adoption of artificial intelligence algorithms across various domains.

Asynchronous Local Computations in Distributed Bayesian Learning

  • paper_url: http://arxiv.org/abs/2311.03496
  • repo_url: None
  • paper_authors: Kinjal Bhar, He Bai, Jemin George, Carl Busart
  • for: 本研究旨在提出一种基于卖场通信的批处理机器学习算法,以提高计算效率和减少通信负担。
  • methods: 本研究使用了 Bayesian 采样和不调整的兰堡算法(ULA) MCMC 进行本地计算,并在多个活动代理之间进行异步通信。
  • results: 对于一个简单的示例问题和实际数据集,研究人员发现使用该算法可以得到更快的初始减法和更高的准确率,特别是在低数据范围内。在Gamma天文望远镜和mHealth数据集上,该算法可以实现78%和90%的分类精度。
    Abstract Due to the expanding scope of machine learning (ML) to the fields of sensor networking, cooperative robotics and many other multi-agent systems, distributed deployment of inference algorithms has received a lot of attention. These algorithms involve collaboratively learning unknown parameters from dispersed data collected by multiple agents. There are two competing aspects in such algorithms, namely, intra-agent computation and inter-agent communication. Traditionally, algorithms are designed to perform both synchronously. However, certain circumstances need frugal use of communication channels as they are either unreliable, time-consuming, or resource-expensive. In this paper, we propose gossip-based asynchronous communication to leverage fast computations and reduce communication overhead simultaneously. We analyze the effects of multiple (local) intra-agent computations by the active agents between successive inter-agent communications. For local computations, Bayesian sampling via unadjusted Langevin algorithm (ULA) MCMC is utilized. The communication is assumed to be over a connected graph (e.g., as in decentralized learning), however, the results can be extended to coordinated communication where there is a central server (e.g., federated learning). We theoretically quantify the convergence rates in the process. To demonstrate the efficacy of the proposed algorithm, we present simulations on a toy problem as well as on real world data sets to train ML models to perform classification tasks. We observe faster initial convergence and improved performance accuracy, especially in the low data range. We achieve on average 78% and over 90% classification accuracy respectively on the Gamma Telescope and mHealth data sets from the UCI ML repository.
    摘要

Leveraging High-Level Synthesis and Large Language Models to Generate, Simulate, and Deploy a Uniform Random Number Generator Hardware Design

  • paper_url: http://arxiv.org/abs/2311.03489
  • repo_url: None
  • paper_authors: James T. Meech
  • For: The paper is written for generating hardware designs using large language models and open-source tools.* Methods: The paper presents a new high-level synthesis methodology that uses exclusively open-source tools, excluding the large language model, to generate hardware designs.* Results: The paper presents a case study of generating a permuted congruential random number generator design with a wishbone interface, and verifies the functionality and quality of the design using large language model-generated simulations and the Dieharder randomness test suite.
    Abstract We present a new high-level synthesis methodology for using large language model tools to generate hardware designs. The methodology uses exclusively open-source tools excluding the large language model. As a case study, we use our methodology to generate a permuted congruential random number generator design with a wishbone interface. We verify the functionality and quality of the random number generator design using large language model-generated simulations and the Dieharder randomness test suite. We document all the large language model chat logs, Python scripts, Verilog scripts, and simulation results used in the case study. We believe that our method of hardware design generation coupled with the open source silicon 130 nm design tools will revolutionize application-specific integrated circuit design. Our methodology significantly lowers the bar to entry when building domain-specific computing accelerators for the Internet of Things and proof of concept prototypes for later fabrication in more modern process nodes.
    摘要 我们介绍一种新的高级合成方法,使用大语言模型工具生成硬件设计。该方法使用仅开源工具,排除大语言模型。作为案例研究,我们使用该方法生成一个卷积排序随机数生成器设计,具有愿望形桥接。我们使用大语言模型生成的 simulations和DieharderRandomness测试集 verify了随机数生成器设计的功能和质量。我们还记录了所有的大语言模型对话记录、Python脚本、Verilog脚本和测试结果。我们认为,我们的硬件设计生成方法,结合开源的130nm设计工具,将重塑应用特定集成电路设计。我们的方法可以大幅降低在建立领域特定计算加速器和互联网物联网设备的门槛。

Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

  • paper_url: http://arxiv.org/abs/2311.03351
  • repo_url: None
  • paper_authors: Kun Lei, Zhengmao He, Chenhao Lu, Kaizhe Hu, Yang Gao, Huazhe Xu
  • for: 本研究旨在探讨如何将离线学习和在线学习融合,以实现高效和安全的学习。
  • methods: 本paper提出了Uni-o4方法,利用在线策略对离线和在线学习进行匹配,从而实现离线和在线学习的无缝传递。在离线阶段,Uni-o4使用多个ensemble策略来解决行为策略与离线数据集的匹配问题。
  • results: 本研究表明,通过Uni-o4方法,离线和在线学习可以协同工作,以获得优秀的离线初始化和稳定的在线细化能力。通过实际的 робоット任务和许多模拟 benchmarks 的全面评估,我们证明了我们的方法在离线和离线到在线学习中具有领先的性能。
    Abstract Combining offline and online reinforcement learning (RL) is crucial for efficient and safe learning. However, previous approaches treat offline and online learning as separate procedures, resulting in redundant designs and limited performance. We ask: Can we achieve straightforward yet effective offline and online learning without introducing extra conservatism or regularization? In this study, we propose Uni-o4, which utilizes an on-policy objective for both offline and online learning. Owning to the alignment of objectives in two phases, the RL agent can transfer between offline and online learning seamlessly. This property enhances the flexibility of the learning paradigm, allowing for arbitrary combinations of pretraining, fine-tuning, offline, and online learning. In the offline phase, specifically, Uni-o4 leverages diverse ensemble policies to address the mismatch issues between the estimated behavior policy and the offline dataset. Through a simple offline policy evaluation (OPE) approach, Uni-o4 can achieve multi-step policy improvement safely. We demonstrate that by employing the method above, the fusion of these two paradigms can yield superior offline initialization as well as stable and rapid online fine-tuning capabilities. Through real-world robot tasks, we highlight the benefits of this paradigm for rapid deployment in challenging, previously unseen real-world environments. Additionally, through comprehensive evaluations using numerous simulated benchmarks, we substantiate that our method achieves state-of-the-art performance in both offline and offline-to-online fine-tuning learning. Our website: https://lei-kun.github.io/uni-o4/ .
    摘要 通过将线上和线下学习RL结合起来是efficient和安全学习的关键。然而,之前的方法通常将线上和线下学习视为两个分开的过程,这会导致重复的设计和有限的性能。我们问:可以在不引入额外保守性或规范的情况下,实现简单而有效的线上和线下学习吗?在这个研究中,我们提出了Uni-o4,它利用了在两个阶段中的在政策对象的同Alignment,使RL Agent可以在线上和线下学习之间转换无缝。由于这种同Alignment,RL Agent可以在两个阶段之间转换,从而提高了学习模式的灵活性,允许任意组合预训练、精度调整、线上和线下学习。在线下阶段,Uni-o4特别利用多个ensemble政策来解决 Line下数据与估计行为策略之间的匹配问题。通过一种简单的线下政策评估(OPE)方法,Uni-o4可以安全地实现多步政策改进。我们示出,通过使用这种方法,线上和线下学习的融合可以提供出色的初始化以及稳定和快速的在线精度调整能力。通过实际的 робоット任务,我们强调了这种 Paradigma的快速部署在复杂、 previously 未见的实际环境中的 benefita。此外,通过了多个模拟 benchmark 的全面评估,我们证明了我们的方法在线上和线下学习以及在线上转换到线下学习中的性能达到了领先水平。更多信息请访问我们的网站:

Learning Hard-Constrained Models with One Sample

  • paper_url: http://arxiv.org/abs/2311.03332
  • repo_url: None
  • paper_authors: Andreas Galanis, Alkis Kalavasis, Anthimos Vardis Kandiros
  • for: 本研究探讨了用单个样本估计Markov随机场的参数,并应用于$k$-SAT、正确颜色和通用$H$-色调模型。
  • methods: 本文使用 Pseudo-Likelihood 估计器,并使用 coupling 技术来提供变量 bound。
  • results: 本文获得了一些积极结果,包括 linear-time 估计器 для $q$-颜色和 $H$-色调模型,以及一些不可能估计的结果,如 $k$-SAT 模型中的非可 Identifier 问题。
    Abstract We consider the problem of estimating the parameters of a Markov Random Field with hard-constraints using a single sample. As our main running examples, we use the $k$-SAT and the proper coloring models, as well as general $H$-coloring models; for all of these we obtain both positive and negative results. In contrast to the soft-constrained case, we show in particular that single-sample estimation is not always possible, and that the existence of an estimator is related to the existence of non-satisfiable instances. Our algorithms are based on the pseudo-likelihood estimator. We show variance bounds for this estimator using coupling techniques inspired, in the case of $k$-SAT, by Moitra's sampling algorithm (JACM, 2019); our positive results for colorings build on this new coupling approach. For $q$-colorings on graphs with maximum degree $d$, we give a linear-time estimator when $q>d+1$, whereas the problem is non-identifiable when $q\leq d+1$. For general $H$-colorings, we show that standard conditions that guarantee sampling, such as Dobrushin's condition, are insufficient for one-sample learning; on the positive side, we provide a general condition that is sufficient to guarantee linear-time learning and obtain applications for proper colorings and permissive models. For the $k$-SAT model on formulas with maximum degree $d$, we provide a linear-time estimator when $k\gtrsim 6.45\log d$, whereas the problem becomes non-identifiable when $k\lesssim \log d$.
    摘要 我们考虑一个具有硬制约的马可夫随机场景中参数的估计问题,使用单一样本进行估计。我们的主要执行例子包括$k$-SAT和正确颜色模型,以及一般的$H$-颜色模型。在这些中,我们获得了正面和负面的结果。在不同于软制约情况下,我们展示了单一样本估计不一定可行,并且存在非满足性的实例的存在。我们的算法基于伪贝叶茨构成函数。我们使用对抗技术,对$k$-SAT模型使用了Moitra的抽取算法(JACM,2019),以及一个新的对抗方法,从而获得了变iance bound。在这些中,我们获得了正面的结果。在具有最大度$d$的图上,如果$q>d+1$,我们提供了一个线性时间的估计器,但是当$q\leq d+1$时,这个问题是非识别的。在一般的$H$-颜色情况下,我们展示了样本数量不足的问题,而且样本数量过多的情况下,我们获得了一个一般的可足够条件,可以保证线性时间的学习。我们还提供了一些实际应用,例如正确颜色和允许模型。在具有最大度$d$的$k$-SAT模型中,如果$k\gtrsim 6.45\log d$,我们提供了一个线性时间的估计器,但是当$k\lesssim \log d$时,这个问题是非识别的。

Practical considerations for variable screening in the Super Learner

  • paper_url: http://arxiv.org/abs/2311.03313
  • repo_url: https://github.com/bdwilliamson/sl_screening_supplementary
  • paper_authors: Brian D. Williamson, Drew King, Ying Huang
  • for: 该论文旨在探讨Super Learner ensemble的应用,以及使用变量选择算法来降维数据。
  • methods: 论文使用了Super Learner ensemble,并使用变量选择算法来降维数据 перед fitting其他预测算法。
  • results: 论文提供了实验结果,表明使用多种候选选择算法可以保证预测器的性能,类似于选择一库的预测算法。
    Abstract Estimating a prediction function is a fundamental component of many data analyses. The Super Learner ensemble, a particular implementation of stacking, has desirable theoretical properties and has been used successfully in many applications. Dimension reduction can be accomplished by using variable screening algorithms, including the lasso, within the ensemble prior to fitting other prediction algorithms. However, the performance of a Super Learner using the lasso for dimension reduction has not been fully explored in cases where the lasso is known to perform poorly. We provide empirical results that suggest that a diverse set of candidate screening algorithms should be used to protect against poor performance of any one screen, similar to the guidance for choosing a library of prediction algorithms for the Super Learner.
    摘要 估算预测函数是数据分析中的基本组成部分。超学习ensemble,一种堆叠的实现,具有了优秀的理论性质并在多个应用中获得了成功。变量屏选算法,包括lasso,可以在ensemble中进行维度减少之前使用。然而,使用lasso进行维度减少的超学习表现不佳的情况尚未得到了完全的探讨。我们提供了实验结果,表明使用多种候选屏选算法可以保护 against poor performance的任何一个屏选算法,类似于选择预测算法库的指南。

TS-Diffusion: Generating Highly Complex Time Series with Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.03303
  • repo_url: None
  • paper_authors: Yangming Li
  • for: 处理复杂时间序列数据,包括采样不规则、缺失值和高维特征时间维度。
  • methods: 提出了一种通用模型TS-Diffusion,包括点过程核心,具有采样不规则处理、缺失值处理和高维特征处理能力。
  • results: 在多个时间序列数据集上进行了广泛的实验,显示TS-Diffusion模型在传统和复杂时间序列上具有优秀表现,舒过先前的基elines。
    Abstract While current generative models have achieved promising performances in time-series synthesis, they either make strong assumptions on the data format (e.g., regularities) or rely on pre-processing approaches (e.g., interpolations) to simplify the raw data. In this work, we consider a class of time series with three common bad properties, including sampling irregularities, missingness, and large feature-temporal dimensions, and introduce a general model, TS-Diffusion, to process such complex time series. Our model consists of three parts under the framework of point process. The first part is an encoder of the neural ordinary differential equation (ODE) that converts time series into dense representations, with the jump technique to capture sampling irregularities and self-attention mechanism to handle missing values; The second component of TS-Diffusion is a diffusion model that learns from the representation of time series. These time-series representations can have a complex distribution because of their high dimensions; The third part is a decoder of another ODE that generates time series with irregularities and missing values given their representations. We have conducted extensive experiments on multiple time-series datasets, demonstrating that TS-Diffusion achieves excellent results on both conventional and complex time series and significantly outperforms previous baselines.
    摘要 当前的生成模型已经实现了时间序列合成的可靠性,但是它们都会假设时间序列的数据格式(例如,规律)或者使用预处理技术(例如, interpolations)来简化原始数据。在这个工作中,我们考虑了一类具有三种常见坏 свой性的时间序列,包括采样不均匀、缺失和大量特征时间维度。我们引入了一种通用的模型,TS-Diffusion,来处理这类复杂的时间序列。我们的模型包括三个部分,即编码器、扩散模型和解码器。首先,编码器是一个神经网络Ordinary Differential Equation(ODE),用于将时间序列转换为稠密表示。使用跳技术来捕捉采样不均匀,并使用自我注意机制来处理缺失值。第二部分是一个学习从时间序列表示的扩散模型。这些时间序列表示可能具有复杂的分布,因为它们的维度很高。最后,解码器是另一个ODE,用于通过时间序列表示生成时间序列。我们在多个时间序列数据集上进行了广泛的实验,并证明了TS-Diffusion在传统和复杂时间序列上具有出色的表现,并在之前的基elines之上显著超越。

Risk of Transfer Learning and its Applications in Finance

  • paper_url: http://arxiv.org/abs/2311.03283
  • repo_url: None
  • paper_authors: Haoyang Cao, Haotian Gu, Xin Guo, Mathieu Rosenbaum
  • for: 这篇论文旨在提出一种新的转移风险概念,用于评估转移学习的转移性能。
  • methods: 论文使用转移学习技术和转移风险概念来解决股票回报预测和资产优化问题。
  • results: 数据结果显示,转移风险与总转移学习性能之间存在强相关性,转移风险可以提供一种计算效率高的方式来确定合适的源任务在转移学习中。
    Abstract Transfer learning is an emerging and popular paradigm for utilizing existing knowledge from previous learning tasks to improve the performance of new ones. In this paper, we propose a novel concept of transfer risk and and analyze its properties to evaluate transferability of transfer learning. We apply transfer learning techniques and this concept of transfer risk to stock return prediction and portfolio optimization problems. Numerical results demonstrate a strong correlation between transfer risk and overall transfer learning performance, where transfer risk provides a computationally efficient way to identify appropriate source tasks in transfer learning, including cross-continent, cross-sector, and cross-frequency transfer for portfolio optimization.
    摘要 <> Transfer learning 是一种现代和受欢迎的 paradigm,利用之前学习任务中的知识来改善新任务的性能。在这篇论文中,我们提出了一种新的转移风险概念,并分析其属性以评估转移学习的可行性。我们运用转移学习技术和这种转移风险概念来解决股票回报预测和投资优化问题。numerical 结果表明,转移风险和总转移学习性能之间存在强相关性,而转移风险提供了一种计算效率高的方式来确定合适的来源任务在转移学习中,包括跨洲、跨领域和跨频率的转移。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Discretizing Numerical Attributes: An Analysis of Human Perceptions

  • paper_url: http://arxiv.org/abs/2311.03278
  • repo_url: None
  • paper_authors: Minakshi Kaushik, Rahul Sharma, Dirk Draheim
  • for: 本研究的目的是建立数值特征分割的标准方法。
  • methods: 本研究使用了人类对数值特征分割的感知分析和专家的数据可视化技术来对分割方法进行评估。
  • results: 研究发现,68.7%的人类回答与提出的两个指标相似,这表明该指标可能是一种有效的数值特征分割方法。
    Abstract Machine learning (ML) has employed various discretization methods to partition numerical attributes into intervals. However, an effective discretization technique remains elusive in many ML applications, such as association rule mining. Moreover, the existing discretization techniques do not reflect best the impact of the independent numerical factor on the dependent numerical target factor. This research aims to establish a benchmark approach for numerical attribute partitioning. We conduct an extensive analysis of human perceptions of partitioning a numerical attribute and compare these perceptions with the results obtained from our two proposed measures. We also examine the perceptions of experts in data science, statistics, and engineering by employing numerical data visualization techniques. The analysis of collected responses reveals that $68.7\%$ of human responses approximately closely align with the values generated by our proposed measures. Based on these findings, our proposed measures may be used as one of the methods for discretizing the numerical attributes.
    摘要 Machine learning (ML) 已经使用了多种精确化方法来将数值特征分成间隔。然而,有效的精确化技术仍然是许多 ML 应用中的缺失,如关联规则挖掘。此外,现有的精确化技术不reflect最好独立数值因素对依赖数值目标因素的影响。这些研究的目标是建立数值特征分割的标准方法。我们进行了广泛的人类知觉分割数值特征的分析,并与我们两个提议的度量相比较。此外,我们还通过使用数字数据视化技术来询问专家们的看法。收集的回答分析表明,$68.7\%$ 的人类回答与我们的提议度量相似。基于这些发现,我们的提议度量可以作为数值特征分割的一种方法。

Exploiting Latent Attribute Interaction with Transformer on Heterogeneous Information Networks

  • paper_url: http://arxiv.org/abs/2311.03275
  • repo_url: None
  • paper_authors: Zeyuan Zhao, Qingqing Ge, Anfeng Cheng, Yiding Liu, Xiang Li, Shuaiqiang Wang
  • for: 本文提出了一种新的矩阵图模型(MULAN),用于处理各种不同类型节点的矩阵图。
  • methods: 该模型包括两个主要组件:一个类型意识编码器和一个维度意识编码器。类型意识编码器使节点类型信息得到更好的利用,而维度意识编码器可以更好地捕捉不同节点特征之间的隐藏交互。
  • results: 在六个不同类型的矩阵图 dataset 上进行了广泛的实验,结果显示 MULAN 比其他现状模型更加出色,同时也更加高效。
    Abstract Heterogeneous graph neural networks (HGNNs) have recently shown impressive capability in modeling heterogeneous graphs that are ubiquitous in real-world applications. Due to the diversity of attributes of nodes in different types, most existing models first align nodes by mapping them into the same low-dimensional space. However, in this way, they lose the type information of nodes. In addition, most of them only consider the interactions between nodes while neglecting the high-order information behind the latent interactions among different node features. To address these problems, in this paper, we propose a novel heterogeneous graph model MULAN, including two major components, i.e., a type-aware encoder and a dimension-aware encoder. Specifically, the type-aware encoder compensates for the loss of node type information and better leverages graph heterogeneity in learning node representations. Built upon transformer architecture, the dimension-aware encoder is capable of capturing the latent interactions among the diverse node features. With these components, the information of graph heterogeneity, node features and graph structure can be comprehensively encoded in node representations. We conduct extensive experiments on six heterogeneous benchmark datasets, which demonstrates the superiority of MULAN over other state-of-the-art competitors and also shows that MULAN is efficient.
    摘要 各种不同类型的图(Heterogeneous Graph)在实际应用中非常普遍。由于节点的属性的多样性,大多数现有模型首先将节点映射到同一低维度空间中,从而产生了节点类型信息的丢失。此外,大多数模型只考虑节点之间的交互,而忽略了各种节点特征之间的高阶信息。为了解决这些问题,本文提出了一种新的多元图模型MULAN,包括两个主要组成部分:类型意识编码器和维度意识编码器。具体来说,类型意识编码器补偿了节点类型信息的丢失,更好地利用图中不同类型节点的多样性进行学习节点表示。基于转换器架构,维度意识编码器可以捕捉节点特征之间的隐藏交互。通过这些组成部分,图中的多样性、节点特征和图结构信息都可以被全面编码在节点表示中。我们在六个多元 benchmark 数据集上进行了广泛的实验, demonstarted MULAN 的优越性,并显示 MULAN 是高效的。

Parameter-Agnostic Optimization under Relaxed Smoothness

  • paper_url: http://arxiv.org/abs/2311.03252
  • repo_url: None
  • paper_authors: Florian Hübler, Junchi Yang, Xiang Li, Niao He
  • for: 这种研究的目的是提高机器学习模型的训练效率,并实现不需要任何问题参数的参数无关优化。
  • methods: 这种研究使用了 Normalized Stochastic Gradient Descent with Momentum(NSGD-M)算法,并提出了一种新的 тео里 Frameworks 来下断这种算法的复杂性。
  • results: 研究发现,NSGD-M 可以在 $(L_0, L_1)$-smooth 函数下实现 Nearly 率优化复杂性,而无需任何问题参数的优化。此外, Gradient Descent with Backtracking Line Search 可以在权重函数下消除这种对数因子。这些结论是在 deterministic 设定下得出的,并且通过实验 validate 了这些理论结论。
    Abstract Tuning hyperparameters, such as the stepsize, presents a major challenge of training machine learning models. To address this challenge, numerous adaptive optimization algorithms have been developed that achieve near-optimal complexities, even when stepsizes are independent of problem-specific parameters, provided that the loss function is $L$-smooth. However, as the assumption is relaxed to the more realistic $(L_0, L_1)$-smoothness, all existing convergence results still necessitate tuning of the stepsize. In this study, we demonstrate that Normalized Stochastic Gradient Descent with Momentum (NSGD-M) can achieve a (nearly) rate-optimal complexity without prior knowledge of any problem parameter, though this comes at the cost of introducing an exponential term dependent on $L_1$ in the complexity. We further establish that this exponential term is inevitable to such schemes by introducing a theoretical framework of lower bounds tailored explicitly for parameter-agnostic algorithms. Interestingly, in deterministic settings, the exponential factor can be neutralized by employing Gradient Descent with a Backtracking Line Search. To the best of our knowledge, these findings represent the first parameter-agnostic convergence results under the generalized smoothness condition. Our empirical experiments further confirm our theoretical insights.
    摘要 调整超参数,如步长,对机器学习模型训练带来重要挑战。为解决这个挑战,许多适应优化算法已经开发出来,可以在步长独立于问题特定参数时达到近似优化复杂性。然而,当假设更加实际的 $(L_0, L_1)$-平滑性时,所有现有的整合结果仍然需要调整步长。在这项研究中,我们证明了Normalized Stochastic Gradient Descent with Momentum(NSGD-M)可以 дости到一种(近似)率优复杂性,不需要任何问题参数的先知知识。然而,这来到了在 $L_1$ 上的指数因子,这个因子是不可避免的。我们还建立了一个Lower bound框架,专门为无参数算法设置下降 bound。让人感兴趣的是,在 deterministic Settings 中,这个指数因子可以通过使用 Gradient Descent with Backtracking Line Search 中和。我们认为这些发现是在 generalized smoothness condition 下的第一个参数无关的整合结果。我们的实验也证明了我们的理论发现。

Approximating Langevin Monte Carlo with ResNet-like Neural Network architectures

  • paper_url: http://arxiv.org/abs/2311.03242
  • repo_url: None
  • paper_authors: Martin Eigel, Charles Miranda, Janina Schütte, David Sommer
  • for: 本文旨在构造一种基于神经网络的样本抽取方法,以便从一个简单的参考分布(例如标准正态分布)抽取目标分布中的样本。
  • methods: 本文提出一种基于Langevin Monte Carlo(LMC)算法的神经网络架构,并利用LMC扰动结果得出目标分布的 aproximation 率。
  • results: 本文根据不同干扰的假设,得出了中间变量干扰的增长率 bounds,以及一种深度差分神经网络架构的表达能力结果,用于近似目标分布中的样本抽取。
    Abstract We sample from a given target distribution by constructing a neural network which maps samples from a simple reference, e.g. the standard normal distribution, to samples from the target. To that end, we propose using a neural network architecture inspired by the Langevin Monte Carlo (LMC) algorithm. Based on LMC perturbation results, we show approximation rates of the proposed architecture for smooth, log-concave target distributions measured in the Wasserstein-$2$ distance. The analysis heavily relies on the notion of sub-Gaussianity of the intermediate measures of the perturbed LMC process. In particular, we derive bounds on the growth of the intermediate variance proxies under different assumptions on the perturbations. Moreover, we propose an architecture similar to deep residual neural networks and derive expressivity results for approximating the sample to target distribution map.
    摘要 我们从给定的目标分布中采样,通过建立一个基于神经网络的映射,将来自简单的参考分布(例如标准正常分布)中的抽样转换为目标分布中的抽样。为此,我们提出使用基于Langevin Monte Carlo(LMC)算法的神经网络架构。通过LMC扰动结果,我们显示了该架构对于具有凸凹聚合性的目标分布( measured in Wasserstein-$2$ distance)的近似率。我们的分析强调了抛物质的副次 Gaussianity。特别是,我们 derive bounds on the growth of intermediate variance proxies under different assumptions on the perturbations。此外,我们还提出了一种类似于深度差分神经网络的架构,并 derive expressivity results for approximating the sample to target distribution map。

Out-of-distribution Detection Learning with Unreliable Out-of-distribution Sources

  • paper_url: http://arxiv.org/abs/2311.03236
  • repo_url: None
  • paper_authors: Haotian Zheng, Qizhou Wang, Zhen Fang, Xiaobo Xia, Feng Liu, Tongliang Liu, Bo Han
  • for: 本研究旨在提高开放世界分类中预测器的可靠性,通过使用数据生成器来生成假外部数据(OOD),不需要实际的OOD数据。
  • methods: 本研究提出了一种名为协助任务基本学习(ATOL)的数据生成器基本学习方法,通过在ID和OOD部分之间设置分支支持,使得学习ID和OOD模式时有所帮助。
  • results: 实验结果表明,ATOL方法可以效果地降低假OOD生成的干扰,提高预测器对ID和OOD数据的分类性能。
    Abstract Out-of-distribution (OOD) detection discerns OOD data where the predictor cannot make valid predictions as in-distribution (ID) data, thereby increasing the reliability of open-world classification. However, it is typically hard to collect real out-of-distribution (OOD) data for training a predictor capable of discerning ID and OOD patterns. This obstacle gives rise to data generation-based learning methods, synthesizing OOD data via data generators for predictor training without requiring any real OOD data. Related methods typically pre-train a generator on ID data and adopt various selection procedures to find those data likely to be the OOD cases. However, generated data may still coincide with ID semantics, i.e., mistaken OOD generation remains, confusing the predictor between ID and OOD data. To this end, we suggest that generated data (with mistaken OOD generation) can be used to devise an auxiliary OOD detection task to facilitate real OOD detection. Specifically, we can ensure that learning from such an auxiliary task is beneficial if the ID and the OOD parts have disjoint supports, with the help of a well-designed training procedure for the predictor. Accordingly, we propose a powerful data generation-based learning method named Auxiliary Task-based OOD Learning (ATOL) that can relieve the mistaken OOD generation. We conduct extensive experiments under various OOD detection setups, demonstrating the effectiveness of our method against its advanced counterparts.
    摘要 外部分布(OOD)检测可以检测到OOD数据,其中预测器无法对ID数据进行有效预测,从而提高开放世界分类的可靠性。然而,收集真正的OOD数据用于训练预测器是一个困难的任务。这种困难导致了数据生成基于学习方法,通过数据生成器对预测器进行训练,无需真正的OOD数据。相关的方法通常先在ID数据上预训练生成器,然后采用不同的选择过程来找出可能是OOD的数据。然而,生成的数据可能仍然与ID semantics相同,即生成的OOD数据仍然会与ID数据冲突,让预测器在ID和OOD数据之间产生混乱。为了解决这个问题,我们提出使用生成的数据(含有错误的OOD生成)来设置auxiliary OOD检测任务,以便在ID和OOD部分之间寻找分割。具体来说,我们可以通过一种合理的训练程序来确保学习auxiliary任务是有益的,并且ID和OOD部分的支持是分开的。根据这个思想,我们提出一种强大的数据生成基于学习方法,即auxiliary Task-based OOD Learning(ATOL),可以解决 mistaken OOD生成问题。我们在不同的OOD检测设置下进行了广泛的实验,并证明了我们的方法在与其他先进方法相比具有优异的效果。

Spatial Process Approximations: Assessing Their Necessity

  • paper_url: http://arxiv.org/abs/2311.03201
  • repo_url: None
  • paper_authors: Hao Zhang
  • for: 这篇论文主要用于探讨大样本大小时,矩阵积分在预测、分类和最大 likelihood 估计中的缺陷。
  • methods: 论文使用了多种优化方法来解决这种缺陷,包括使用低级别的积分方法和采用适应规则来改进预测和估计计算。
  • results: 论文的结果表明,使用这些优化方法可以提高预测和估计的精度,并且可以避免由积分缺陷导致的低级别问题。
    Abstract In spatial statistics and machine learning, the kernel matrix plays a pivotal role in prediction, classification, and maximum likelihood estimation. A thorough examination reveals that for large sample sizes, the kernel matrix becomes ill-conditioned, provided the sampling locations are fairly evenly distributed. This condition poses significant challenges to numerical algorithms used in prediction and estimation computations and necessitates an approximation to prediction and the Gaussian likelihood. A review of current methodologies for managing large spatial data indicates that some fail to address this ill-conditioning problem. Such ill-conditioning often results in low-rank approximations of the stochastic processes. This paper introduces various optimality criteria and provides solutions for each.
    摘要 在空间统计学和机器学习中,kernel矩阵在预测、分类和最大likelihood估计中扮演着关键角色。经过仔细查看,当样本大小较大时,kernel矩阵往往变得不正则,当采样点分布均匀时,这种情况会出现。这种不正则性会对预测和估计计算带来很大挑战,并且需要采用近似方法来预测和 Gaussian likelihood。现有的大 spatial数据管理方法中有一些不能解决这个不正则性问题。这种不正则性通常会导致Stochastic процеcess的低级别近似。本文介绍了多种优化 критерионов和解决方案。

Online Learning Quantum States with the Logarithmic Loss via VB-FTRL

  • paper_url: http://arxiv.org/abs/2311.04237
  • repo_url: None
  • paper_authors: Wei-Fu Tseng, Kai-Chun Chen, Zi-Hong Xiao, Yen-Huan Li
  • for: 这个论文主要研究的是在线学习量子状态的问题,具体来说是一种量子投资策略选择问题,这是在线学习领域的经典开 problema 已经超过三十年了。
  • methods: 这个论文使用的方法是VB-FTRL算法,这是第一个对OPS(Online Portfolio Selection)问题的近似 regret 优化算法,它的计算复杂度是 moderate。
  • results: 这个论文的结果是,通过对VB-FTRL算法进行扩展,可以实现LL-OLQS(Logarithmic Loss Online Quantum State Tomography)问题的 regret 率为 $O(d^2 \log(d+T))$, 这比现有的最好known regret率 $O(d^2 \log T)$ 更好。
    Abstract Online learning quantum states with the logarithmic loss (LL-OLQS) is a quantum generalization of online portfolio selection, a classic open problem in the field of online learning for over three decades. The problem also emerges in designing randomized optimization algorithms for maximum-likelihood quantum state tomography. Recently, Jezequel et al. (arXiv:2209.13932) proposed the VB-FTRL algorithm, the first nearly regret-optimal algorithm for OPS with moderate computational complexity. In this note, we generalize VB-FTRL for LL-OLQS. Let $d$ denote the dimension and $T$ the number of rounds. The generalized algorithm achieves a regret rate of $O ( d^2 \log ( d + T ) )$ for LL-OLQS. Each iteration of the algorithm consists of solving a semidefinite program that can be implemented in polynomial time by, e.g., cutting-plane methods. For comparison, the best-known regret rate for LL-OLQS is currently $O ( d^2 \log T )$, achieved by the exponential weight method. However, there is no explicit implementation available for the exponential weight method for LL-OLQS. To facilitate the generalization, we introduce the notion of VB-convexity. VB-convexity is a sufficient condition for the logarithmic barrier associated with any function to be convex and is of independent interest.
    摘要 在线学习量子状态的幂函数损失(LL-OLQS)是量子扩展在线股票选择的问题,这是线上学习领域的经典问题,已经存在三十多年。这个问题还出现在设计随机优化算法的最大elihood量子状态探测中。最近,Jezequel等人(arXiv:2209.13932)提出了VB-FTRL算法,是OPS中首个几乎 regret-optimal的算法,并且具有moderate的计算复杂度。在这个笔记中,我们推广VB-FTRL算法到LL-OLQS。在$d$维度和$T$轮数下,我们的总体算法实现了$O(d^2\log(d+T))$的 regret率。每次迭代中的算法包括解决一个半definite程序,可以在多项式时间内实现,例如,裁剪方法。相比之下,目前最好的LL-OLQS的 regret率是$O(d^2\log T)$,由exponential weight方法实现,但是没有Explicit实现可用。为了推广,我们引入VB-convexity。VB-convexity是任何函数的幂函数损失相对于任何函数是凸的sufficient condition,并且是独立的研究兴趣。

Stable Linear Subspace Identification: A Machine Learning Approach

  • paper_url: http://arxiv.org/abs/2311.03197
  • repo_url: https://github.com/cemempamoi/simba
  • paper_authors: Loris Di Natale, Muhammad Zakwan, Bratislav Svetozarevic, Philipp Heer, Giancarlo Ferrari Trecate, Colin N. Jones
  • for: 这篇论文旨在演示如何使用机器学习工具来提高线性系统识别(SI)的性能,并提出了一种基于自动�ifferentiation框架的SI方法,称为SIMBa。
  • methods: 这篇论文使用了一种基于Linear-Matrix-Inequality的自由参数化Schur矩阵来保证模型的稳定性,并使用了 automatic differentiation 框架来实现SIMBa方法。
  • results: 论文表明,SIMBa方法在许多输入输出系统和实际数据上都能够达到或超越传统的线性状态空间SI方法的性能,并且在一些情况下,SIMBa方法的性能提升可达25%以上,这表明SIMBa方法可以同时实现状态空间SI方法的最佳适应性和稳定性。
    Abstract Machine Learning (ML) and linear System Identification (SI) have been historically developed independently. In this paper, we leverage well-established ML tools - especially the automatic differentiation framework - to introduce SIMBa, a family of discrete linear multi-step-ahead state-space SI methods using backpropagation. SIMBa relies on a novel Linear-Matrix-Inequality-based free parametrization of Schur matrices to ensure the stability of the identified model. We show how SIMBa generally outperforms traditional linear state-space SI methods, and sometimes significantly, although at the price of a higher computational burden. This performance gap is particularly remarkable compared to other SI methods with stability guarantees, where the gain is frequently above 25% in our investigations, hinting at SIMBa's ability to simultaneously achieve state-of-the-art fitting performance and enforce stability. Interestingly, these observations hold for a wide variety of input-output systems and on both simulated and real-world data, showcasing the flexibility of the proposed approach. We postulate that this new SI paradigm presents a great extension potential to identify structured nonlinear models from data, and we hence open-source SIMBa on https://github.com/Cemempamoi/simba.
    摘要

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

  • paper_url: http://arxiv.org/abs/2311.03191
  • repo_url: https://github.com/tmlr-group/deepinception
  • paper_authors: Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han
  • for: 这个研究是为了探讨大语言模型(LLM)对于安全性的攻击和漏洞,并提出了一个轻量级的方法来实现这些攻击。
  • methods: 这个研究使用了对话内容中的人际互动来对LLM进行传统的攻击和漏洞测试,并提出了一个名为DeepInception的方法来实现这些攻击。DeepInception方法利用了LMM的人类化能力来建立一个嵌入式的scene来控制LMM的行为,实现了一个可靠且有效的攻击方法。
  • results: 这个研究通过了实验证明了DeepInception方法的有效性,可以实现高度的攻击成功率,并且可以在继续交互中进行无间断的监狱破坏。实验结果显示,DeepInception方法可以实现高度的攻击成功率,并且可以在继续交互中进行无间断的监狱破坏。
    Abstract Despite remarkable success in various applications, large language models (LLMs) are vulnerable to adversarial jailbreaks that make the safety guardrails void. However, previous studies for jailbreaks usually resort to brute-force optimization or extrapolations of a high computation cost, which might not be practical or effective. In this paper, inspired by the Milgram experiment that individuals can harm another person if they are told to do so by an authoritative figure, we disclose a lightweight method, termed as DeepInception, which can easily hypnotize LLM to be a jailbreaker and unlock its misusing risks. Specifically, DeepInception leverages the personification ability of LLM to construct a novel nested scene to behave, which realizes an adaptive way to escape the usage control in a normal scenario and provides the possibility for further direct jailbreaks. Empirically, we conduct comprehensive experiments to show its efficacy. Our DeepInception can achieve competitive jailbreak success rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open/closed-source LLMs like Falcon, Vicuna, Llama-2, and GPT-3.5/4/4V. Our investigation appeals that people should pay more attention to the safety aspects of LLMs and a stronger defense against their misuse risks. The code is publicly available at: https://github.com/tmlr-group/DeepInception.
    摘要 尽管大语言模型(LLM)在各种应用场景中表现出色,但它们却易受到黑客破坏的威胁。然而,过去的研究通常采用极端优化或高计算成本的推理,这可能不实际或有效。在这篇论文中,我们受到米勒实验的启发,其表明人们可以通过权威人士的命令,让别人伤害他人。我们披露了一种轻量级的方法,称为深度冥想(DeepInception),可以轻松地使LLM变成破坏者,并探索其不当使用的风险。具体来说,深度冥想利用LLM的人格化能力,构建了一个新的嵌入式场景,以实现常规情况下的适应式逃脱和进一步的直接破坏。我们进行了广泛的实验,证明了其效果。我们的深度冥想可以与前一代对手相比,实现竞争性的破坏成功率,并在后续互动中实现连续破坏,揭示了开源/关闭源LLM like Falcon、Vicuna、Llama-2和GPT-3.5/4/4V中的极其弱点。我们的调查表明,人们应该更关注LLM的安全问题,并采取更加有力的防御措施。代码可以在 GitHub 上获取:https://github.com/tmlr-group/DeepInception。

Hopfield-Enhanced Deep Neural Networks for Artifact-Resilient Brain State Decoding

  • paper_url: http://arxiv.org/abs/2311.03421
  • repo_url: https://github.com/arnaumarin/hdnn-artifactbrainstate
  • paper_authors: Arnau Marin-Llobet, Arnau Manasanch, Maria V. Sanchez-Vives
  • for: 这个研究旨在提高脑 States的识别精度,尤其是在不同程度的麻醉下。
  • methods: 这个研究使用了两个阶段的计算框架,首先使用了抽象网络来除掉噪声,然后使用了卷积神经网络来分类脑 States。
  • results: 研究发现,这个混合的抽象网络和卷积神经网络框架可以有效地 Mitigate 噪声,使模型在不同的数据压缩和噪声水平上达到了与清晰数据 CNN 的性能。
    Abstract The study of brain states, ranging from highly synchronous to asynchronous neuronal patterns like the sleep-wake cycle, is fundamental for assessing the brain's spatiotemporal dynamics and their close connection to behavior. However, the development of new techniques to accurately identify them still remains a challenge, as these are often compromised by the presence of noise, artifacts, and suboptimal recording quality. In this study, we propose a two-stage computational framework combining Hopfield Networks for artifact data preprocessing with Convolutional Neural Networks (CNNs) for classification of brain states in rat neural recordings under different levels of anesthesia. To evaluate the robustness of our framework, we deliberately introduced noise artifacts into the neural recordings. We evaluated our hybrid Hopfield-CNN pipeline by benchmarking it against two comparative models: a standalone CNN handling the same noisy inputs, and another CNN trained and tested on artifact-free data. Performance across various levels of data compression and noise intensities showed that our framework can effectively mitigate artifacts, allowing the model to reach parity with the clean-data CNN at lower noise levels. Although this study mainly benefits small-scale experiments, the findings highlight the necessity for advanced deep learning and Hopfield Network models to improve scalability and robustness in diverse real-world settings.
    摘要 研究脑电响应的研究,从高度同步到不同程度异步神经网络模式,如睡卫睡眠周期,是评估脑的空间时间动态和行为之间的关系的基础。然而,开发新技术来准确识别这些状态仍然是一个挑战,因为这些通常受到噪声、artifacts和低质量记录的影响。在这项研究中,我们提出了一种两阶段计算框架, combinign Hopfield Networks for artifact data preprocessing with Convolutional Neural Networks (CNNs) for classifying brain states in rat neural recordings under different levels of anesthesia。为评估我们的框架的稳定性,我们故意将噪声artifacts introduce into the neural recordings。我们对我们的混合Hopfield-CNN pipeline进行了 benchmarking,并与两个参照模型进行比较:一个专门为同样噪声输入处理的 CNN,以及另一个在噪声自由数据上训练和测试的 CNN。在不同的数据压缩和噪声强度下的性能评估表明,我们的框架可以有效地抑制噪声,使模型在较低的噪声水平达到同clean-data CNN的性能。尽管这项研究主要适用于小规模实验,但发现 highlights the necessity for advanced deep learning and Hopfield Network models to improve scalability and robustness in diverse real-world settings.

Preserving Privacy in GANs Against Membership Inference Attack

  • paper_url: http://arxiv.org/abs/2311.03172
  • repo_url: None
  • paper_authors: Mohammadhadi Shateri, Francisco Messina, Fabrice Labeau, Pablo Piantanida
  • for: 该论文主要研究如何使用生成器对潜在敏感数据进行隐私保护,以防止数据泄露和攻击。
  • methods: 该论文提出了两种防御策略:一种是基于最大 entropy 的 GAN 架构(MEGAN),另一种是基于 minimizing 生成样本中对训练数据点的信息泄露的方法(MIMGAN)。
  • results: 该论文通过应用这两种防御策略于一些常用的数据集,发现这些方法可以大幅降低对攻击者的准确率,同时减少生成样本的质量下降。
    Abstract Generative Adversarial Networks (GANs) have been widely used for generating synthetic data for cases where there is a limited size real-world dataset or when data holders are unwilling to share their data samples. Recent works showed that GANs, due to overfitting and memorization, might leak information regarding their training data samples. This makes GANs vulnerable to Membership Inference Attacks (MIAs). Several defense strategies have been proposed in the literature to mitigate this privacy issue. Unfortunately, defense strategies based on differential privacy are proven to reduce extensively the quality of the synthetic data points. On the other hand, more recent frameworks such as PrivGAN and PAR-GAN are not suitable for small-size training datasets. In the present work, the overfitting in GANs is studied in terms of the discriminator, and a more general measure of overfitting based on the Bhattacharyya coefficient is defined. Then, inspired by Fano's inequality, our first defense mechanism against MIAs is proposed. This framework, which requires only a simple modification in the loss function of GANs, is referred to as the maximum entropy GAN or MEGAN and significantly improves the robustness of GANs to MIAs. As a second defense strategy, a more heuristic model based on minimizing the information leaked from generated samples about the training data points is presented. This approach is referred to as mutual information minimization GAN (MIMGAN) and uses a variational representation of the mutual information to minimize the information that a synthetic sample might leak about the whole training data set. Applying the proposed frameworks to some commonly used data sets against state-of-the-art MIAs reveals that the proposed methods can reduce the accuracy of the adversaries to the level of random guessing accuracy with a small reduction in the quality of the synthetic data samples.
    摘要 生成对抗网络(GAN)广泛应用于生成受限数据集或数据持有者不愿分享数据样本的情况下生成合成数据。然而,recent works显示GAN可能因过拟合和记忆而泄露training数据样本中的信息。这使得GAN易受到会员推理攻击(MIAs)。文献中已经提出了多种防御策略来缓解这种隐私问题,但这些策略基于不同性隐私会导致合成数据点质量下降Extensively。相反,more recent frameworks such as PrivGAN和PAR-GAN不适用于小型训练集。在当前工作中,我们研究GAN中的过拟合问题,并定义了一种基于 Bhattacharyya 系数的更一般的过拟合度量。然后, inspirited by Fano's inequality,我们提出了一种防御机制, referred to as the maximum entropy GAN(MEGAN), which significantly improves the robustness of GANs to MIAs。作为第二种防御策略,我们提出了一种基于减少生成样本中对训练数据点的信息泄露的方法。这种方法被称为mutual information minimization GAN(MIMGAN),它使用了一种 Variational representation of the mutual information来最小化生成样本中对整个训练数据集的信息泄露。通过应用提出的方法到一些常用的数据集,我们发现可以将攻击者的准确率降到随机猜测率水平,只有一小部分降低合成数据点的质量。

An Examination of the Alleged Privacy Threats of Confidence-Ranked Reconstruction of Census Microdata

  • paper_url: http://arxiv.org/abs/2311.03171
  • repo_url: https://github.com/NajeebJebreel/CRR-analysis
  • paper_authors: David Sánchez, Najeeb Jebreel, Josep Domingo-Ferrer, Krishnamurty Muralidhar, Alberto Blanco-Justicia
  • for: 本研究旨在检验美国人口普查局(USCB)在2020年人口普查中使用 differential privacy(DP)来保护个人隐私,以及USCB是否正确地认为使用 DP 可以防止重建攻击。
  • methods: 本研究使用了一种新的重建攻击方法,即 confidence-ranked reconstruction,以衡量对原始数据进行重建的可能性。
  • results: 本研究发现, confidence-ranked reconstruction 不会威胁个人隐私,因为它无法引导泄露或特性泄露攻击。此外,由于人口普查数据的编译、处理和发布方式,无法重建原始和完整的记录。
    Abstract The alleged threat of reconstruction attacks has led the U.S. Census Bureau (USCB) to replace in the Decennial Census 2020 the traditional statistical disclosure limitation based on rank swapping with one based on differential privacy (DP). This has resulted in substantial accuracy loss of the released statistics. Worse yet, it has been shown that the reconstruction attacks used as an argument to move to DP are very far from allowing unequivocal reidentification of the respondents, because in general there are a lot of reconstructions compatible with the released statistics. In a very recent paper, a new reconstruction attack has been proposed, whose goal is to indicate the confidence that a reconstructed record was in the original respondent data. The alleged risk of serious disclosure entailed by such confidence-ranked reconstruction has renewed the interest of the USCB to use DP-based solutions. To forestall the potential accuracy loss in future data releases resulting from adoption of these solutions, we show in this paper that the proposed confidence-ranked reconstruction does not threaten privacy. Specifically, we report empirical results showing that the proposed ranking cannot guide reidentification or attribute disclosure attacks, and hence it fails to warrant the USCB's move towards DP. Further, we also demonstrate that, due to the way the Census data are compiled, processed and released, it is not possible to reconstruct original and complete records through any methodology, and the confidence-ranked reconstruction not only is completely ineffective at accurately reconstructing Census records but is trivially outperformed by an adequate interpretation of the released aggregate statistics.
    摘要 美国人口普查局(USCB)在2020年人口普查中改用了基于差异隐私(DP)的新统计隐私技术,以替代传统的排名交换统计隐私技术。这导致了发布统计数据的精度下降。另外,有人表示使用DP技术可以防止恢复攻击,但是实际上,这些攻击并不能准确地重建响应者的记录。在最近的论文中,一种新的恢复攻击方法被提出,其目的是指示恢复记录是否出现在原始响应者数据中。美国人口普查局因此重新受到了关注,以便使用DP技术来保护隐私。为了避免未来数据发布所导致的精度下降,我们在本文中证明了提议的信息排名不会威胁隐私。 Specifically,我们report了实验结果,表明提议的排名无法引导个人透明或特性泄露攻击,因此无需使用DP技术。此外,我们还证明了由于人口普查数据的编辑、处理和发布方式,无法通过任何方法重建原始和完整的记录,并且 confidence-ranked reconstruction不仅完全无法准确重建人口普查记录,而且是比充分解读 released aggregate statistics 更加轻松。

Convergence Analysis of Sequential Federated Learning on Heterogeneous Data

  • paper_url: http://arxiv.org/abs/2311.03154
  • repo_url: https://github.com/liyipeng00/convergence
  • paper_authors: Yipeng Li, Xinchen Lyu
  • for: 这篇论文旨在提供对非联合学习(Federated Learning,FL)中Sequential Federated Learning(SFL)的整合分析,并确定SFL在不同数据类型下的整合性。
  • methods: 本论文使用了对SFL的整合分析,并提供了对SFL的整合性的确定。
  • results: 实验结果表明,SFL在具有极高不同性的设备上的性能比PFL更好,这与当前的整合分析结果相符。
    Abstract There are two categories of methods in Federated Learning (FL) for joint training across multiple clients: i) parallel FL (PFL), where clients train models in a parallel manner; and ii) sequential FL (SFL), where clients train models in a sequential manner. In contrast to that of PFL, the convergence theory of SFL on heterogeneous data is still lacking. In this paper, we establish the convergence guarantees of SFL for strongly/general/non-convex objectives on heterogeneous data. The convergence guarantees of SFL are better than that of PFL on heterogeneous data with both full and partial client participation. Experimental results validate the counterintuitive analysis result that SFL outperforms PFL on extremely heterogeneous data in cross-device settings.
    摘要 在 Federated Learning (FL) 中,有两种方法 для共同训练多个客户端:一是平行 Federated Learning (PFL),客户端在平行方式进行模型训练;另一种是顺序 Federated Learning (SFL),客户端在顺序方式进行模型训练。与 PFL 不同的是,SFL 在不同数据上的收敛理论仍然缺失。在这篇论文中,我们确定了 SFL 在强不同/通用/非凸目标下的收敛保证。与 PFL 在不同数据上的收敛保证相比,SFL 在完全和半客户端参与情况下表现更好。实验结果证明了我们的反直觉分析结果,即 SFL 在极其不同数据的跨设备情况下比 PFL 表现更好。

End-to-end Material Thermal Conductivity Prediction through Machine Learning

  • paper_url: http://arxiv.org/abs/2311.03139
  • repo_url: None
  • paper_authors: Yagyank Srivastava, Ankit Jain
  • for: accelerated prediction of the thermal conductivity of materials
  • methods: employing machine learning methods, high-throughput calculations based on first principles and the Boltzmann transport equation
  • results: all models suffered from overfitting, best mean absolute percentage error remained in the range of 50-60%Here’s the Chinese translation:
  • for: 预测材料热导率的加速
  • methods: 使用机器学习方法、基于初始原理和博尔茨曼传输方程进行高通过put计算
  • results: 所有模型均存在过拟合问题,测试集的最佳均方误argin remained in the range of 50-60%
    Abstract We investigated the accelerated prediction of the thermal conductivity of materials through end- to-end structure-based approaches employing machine learning methods. Due to the non-availability of high-quality thermal conductivity data, we first performed high-throughput calculations based on first principles and the Boltzmann transport equation for 225 materials, effectively more than doubling the size of the existing dataset. We assessed the performance of state-of-the-art machine learning models for thermal conductivity prediction on this expanded dataset and observed that all these models suffered from overfitting. To address this issue, we introduced a novel graph-based neural network model, which demonstrated more consistent and regularized performance across all evaluated datasets. Nevertheless, the best mean absolute percentage error achieved on the test dataset remained in the range of 50-60%. This suggests that while these models are valuable for expediting material screening, their current accuracy is still limited.
    摘要 我们调查了通过终端结构基本方法加速材料热导率预测的进程。由于热导率数据的可用性不足,我们首先通过原理计算和博尔茨曼运动方程对225种材料进行了高通过率计算,实际上将现有数据集的大小增加了超过一倍。我们评估了现状最佳的机器学习模型对热导率预测的性能,发现所有这些模型都存在过拟合问题。为解决这个问题,我们提出了一种图形基于神经网络模型,该模型在所有评估数据集上显示了更一致和规则的性能。然而,在测试数据集上最佳的平均绝对百分比误差仍然在50-60%的范围内,这表明这些模型可以快速屏选材料,但其当前准确性仍有限。

Reservoir-Computing Model for Mapping and Forecasting Neuronal Interactions from Electrophysiological Data

  • paper_url: http://arxiv.org/abs/2311.03131
  • repo_url: None
  • paper_authors: Ilya Auslender, Giorgio Letti, Yasaman Heydari, Lorenzo Pavesi
  • for: 这个论文旨在描述如何使用计算机模型来解释神经元网络的结构和功能。
  • methods: 该模型基于储量计算机网络(RCN)架构,使用电physiological数据来重建神经元网络的结构。
  • results: 模型可以更高精度地预测神经元网络的连接图,并且可以预测特定输入的网络响应。
    Abstract Electrophysiological nature of neuronal networks allows to reveal various interactions between different cell units at a very short time-scales. One of the many challenges in analyzing these signals is to retrieve the morphology and functionality of a given network. In this work we developed a computational model, based on Reservoir Computing Network (RCN) architecture, which decodes the spatio-temporal data from electro-physiological measurements of neuronal cultures and reconstructs the network structure on a macroscopic domain, representing the connectivity between neuronal units. We demonstrate that the model can predict the connectivity map of the network with higher accuracy than the common methods such as Cross-Correlation and Transfer-Entropy. In addition, we experimentally demonstrate the ability of the model to predict a network response to a specific input, such as localized stimulus.
    摘要 电生物学性质的神经网络允许在非常短时间尺度内揭示各个单元之间的各种互动。然而,分析这些信号的一个挑战是恢复神经网络的结构和功能。在这项工作中,我们基于储量计算网络(RCN)架构,开发了一种计算模型,可以从神经元文化的电生物学测量数据中解码各个单元之间的连接关系,并在大规模域领域上重建神经网络结构。我们示示了这个模型可以比常用方法,如相关性和传输率,更准确地预测神经网络的连接图。此外,我们也实验ally demonstrated the ability of the model to predict a network response to a specific input, such as a localized stimulus.

Nonparametric modeling of the composite effect of multiple nutrients on blood glucose dynamics

  • paper_url: http://arxiv.org/abs/2311.03129
  • repo_url: https://github.com/jularina/trcmed-kit
  • paper_authors: Arina Odnoblyudova, Çağlar Hizli, ST John, Andrea Cognolato, Anne Juuti, Simo Särkkä, Kirsi Pietiläinen, Pekka Marttinen
  • for: 估计食物组成物的 physiological 响应,以及这些组成物的分开效应。
  • methods: 使用泛化函数方法,并通过嵌入组成物剂量和患者间共享统计信息来提高预测精度。
  • results: 能够更好地解释各种组成物对血糖响应的不同效应,并提高预测精度。
    Abstract In biomedical applications it is often necessary to estimate a physiological response to a treatment consisting of multiple components, and learn the separate effects of the components in addition to the joint effect. Here, we extend existing probabilistic nonparametric approaches to explicitly address this problem. We also develop a new convolution-based model for composite treatment-response curves that is more biologically interpretable. We validate our models by estimating the impact of carbohydrate and fat in meals on blood glucose. By differentiating treatment components, incorporating their dosages, and sharing statistical information across patients via a hierarchical multi-output Gaussian process, our method improves prediction accuracy over existing approaches, and allows us to interpret the different effects of carbohydrates and fat on the overall glucose response.
    摘要 Translated into Simplified Chinese:在生物医学应用中,经常需要估计多组分治疗的生理响应,并了解每个组分的分立效果以外的共同效果。我们在这里扩展现有的概率非Parametric方法,以解决这个问题。我们还开发了一种新的卷积型治疗响应曲线模型,这种模型更易于生物 интерпретирова。我们使用多输出 Gaussian 过程来共享患者间的统计信息,并 differentiate treatment components, incorporate their dosages, and improve prediction accuracy over existing approaches. 通过这种方法,我们可以解释各种碳水化合物和脂肪在血糖响应中的不同效果。

Algebraic Dynamical Systems in Machine Learning

  • paper_url: http://arxiv.org/abs/2311.03118
  • repo_url: None
  • paper_authors: Iolo Jones, Jerry Swan, Jeffrey Giansiracusa
  • for: 这 paper 是用于描述一种基于 rewrite 的动态系统,以及如何将这些系统应用于机器学习模型中。
  • methods: 这 paper 使用了一种 recursive function,并将其应用于 iterated rewriting system,以定义一种 formal class of models。
  • results: 这 paper 显示了这种 algebraic analogue of dynamical systems 可以将所有主要的动态机器学习模型(包括 recurrent neural networks、graph neural networks 和 diffusion models)嵌入到一个 formal class of models 中。
    Abstract We introduce an algebraic analogue of dynamical systems, based on term rewriting. We show that a recursive function applied to the output of an iterated rewriting system defines a formal class of models into which all the main architectures for dynamic machine learning models (including recurrent neural networks, graph neural networks, and diffusion models) can be embedded. Considered in category theory, we also show that these algebraic models are a natural language for describing the compositionality of dynamic models. Furthermore, we propose that these models provide a template for the generalisation of the above dynamic models to learning problems on structured or non-numerical data, including 'hybrid symbolic-numeric' models.
    摘要 我们提出了一个运算方法的数学类比,基于词法重写。我们显示了一个递归函数,当作迭代重写系统的输出,定义了一个正式的模型类别,可以包含所有主要的动态机器学习模型(包括回传神经网络、图形神经网络和扩散模型)。在Category theory中考虑,我们还显示了这些数学模型是动态模型的自然语言描述。此外,我们建议这些模型可以提供一个泛化的模型,用于将上述动态模型扩展到学习问题中的结构化或非数据类型,包括"混合 символіic-数值"模型。

RELand: Risk Estimation of Landmines via Interpretable Invariant Risk Minimization

  • paper_url: http://arxiv.org/abs/2311.03115
  • repo_url: None
  • paper_authors: Mateo Dulce Rubio, Siqi Zeng, Qi Wang, Didier Alvarado, Francisco Moreno, Hoda Heidari, Fei Fang
  • for: 提高人道主义废钳工作效率和准确性,为战后受影响的社区减少陷阱风险。
  • methods: 提出RELand系统,包括三大组成部分:一、提供全面的特征工程和标签分配指南,为全球废钳任务提供通用的数据预处理方法;二、将废钳存在问题定型为分类问题,设计一种可读性强的新型模型,基于稀缺特征覆盖和不变风险最小化;三、根据真实世界废钳操作规范,进行了广泛的评估,显示与现有方法相比有显著提升。
  • results: 在实际场景中,与一家人道主义废钳组织在哥伦比亚 collaborating,使用我们的系统进行了两个区域的废钳计划。
    Abstract Landmines remain a threat to war-affected communities for years after conflicts have ended, partly due to the laborious nature of demining tasks. Humanitarian demining operations begin by collecting relevant information from the sites to be cleared, which is then analyzed by human experts to determine the potential risk of remaining landmines. In this paper, we propose RELand system to support these tasks, which consists of three major components. We (1) provide general feature engineering and label assigning guidelines to enhance datasets for landmine risk modeling, which are widely applicable to global demining routines, (2) formulate landmine presence as a classification problem and design a novel interpretable model based on sparse feature masking and invariant risk minimization, and run extensive evaluation under proper protocols that resemble real-world demining operations to show a significant improvement over the state-of-the-art, and (3) build an interactive web interface to suggest priority areas for demining organizations. We are currently collaborating with a humanitarian demining NGO in Colombia that is using our system as part of their field operations in two areas recently prioritized for demining.
    摘要 废钣炸弹仍然是战后社区的威胁,一部分原因是清理废钣炸弹任务的困难程度。人道主义废钣炸弹除障工作开始于收集需要清理的场地的信息,然后由人类专家分析以确定剩下的废钣炸弹风险。在这篇论文中,我们提议了RELand系统来支持这些任务,该系统包括三个主要组成部分。我们(1)提供了普遍适用的地雷风险模型EngINEERING和标签分配指南,以增强陆地雷风险模型的数据集,(2)将废钣炸弹存在视为一种分类问题,并设计了一种新的可解释的模型,基于稀缺特征掩模和不变风险最小化,并在合理的协议下进行了广泛的评估,并显示与现有技术相比有显著提高。(3)建立了一个交互式网页界面,以建议废钣炸弹组织在优先级划分的地方进行清理。我们目前和一家在哥伦比亚的人道主义废钣炸弹NGO合作,他们在使用我们的系统作为其在两个最近被优先级划分的地区的场地清理工作的一部分。

Weight-Sharing Regularization

  • paper_url: http://arxiv.org/abs/2311.03096
  • repo_url: https://github.com/motahareh-sohrabi/weight-sharing-regularization
  • paper_authors: Mehran Shakerinava, Motahareh Sohrabi, Siamak Ravanbakhsh, Simon Lacoste-Julien
  • for: 本文是为了提出Weight-sharing regularization的概念和实现方法。
  • methods: 本文使用的方法包括定义weight-sharing regularization的函数$R(w) = \frac{1}{d - 1}\sum_{i > j}^d |w_i - w_j|$,以及对这个函数的 proximal mapping 的研究。
  • results: experiments 表明,Weight-sharing regularization 可以使得全连接神经网络学习 convolution-like filters。 In addition, the proposed parallel algorithm for proximal gradient descent provides an exponential speedup over previous algorithms, with a depth of $O(\log^3 d)$.
    Abstract Weight-sharing is ubiquitous in deep learning. Motivated by this, we introduce ''weight-sharing regularization'' for neural networks, defined as $R(w) = \frac{1}{d - 1}\sum_{i > j}^d |w_i - w_j|$. We study the proximal mapping of $R$ and provide an intuitive interpretation of it in terms of a physical system of interacting particles. Using this interpretation, we design a novel parallel algorithm for $\operatorname{prox}_R$ which provides an exponential speedup over previous algorithms, with a depth of $O(\log^3 d)$. Our algorithm makes it feasible to train weight-sharing regularized deep neural networks with proximal gradient descent. Experiments reveal that weight-sharing regularization enables fully-connected networks to learn convolution-like filters.
    摘要 深度学习中的Weight-sharing是普遍存在的。我们提出了一种''Weight-sharing regularization'' для神经网络,定义为 $R(w) = \frac{1}{d - 1}\sum_{i > j}^d |w_i - w_j|$. 我们研究了 $R$ 的 proximal mapping,并提供了一种物理系统中的交互Particle的INTUITIVE interpretations。使用这种 interpretations,我们设计了一种新的并行算法,提供了对前一代算法的几何speedup,depth为 $O(\log^3 d)$。我们的算法使得可以使用 proximal gradient descent 来训练Weight-sharing regularized深度神经网络。实验表明,Weight-sharing regularization 使得全连接神经网络可以学习 convolution-like filters。Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide that instead.

Equivariance Is Not All You Need: Characterizing the Utility of Equivariant Graph Neural Networks for Particle Physics Tasks

  • paper_url: http://arxiv.org/abs/2311.03094
  • repo_url: None
  • paper_authors: Savannah Thais, Daniel Murnane
  • for: 这篇论文主要是为了评估平衡图 neural network(GNN)在物理数据上的应用,以及该方法如何直接 incorporate 物理系统中的对称性。
  • methods: 这篇论文使用了已有的文献中关于对称网络的知识,并对实际的 particle physics 重构任务进行了广泛的评估。
  • results: 研究发现,许多理论上associated with equivariant networks的benefits可能不适用于实际的物理系统,并提出了未来研究的有力指导,以便对ML理论和物理应用产生 positively impact.
    Abstract Incorporating inductive biases into ML models is an active area of ML research, especially when ML models are applied to data about the physical world. Equivariant Graph Neural Networks (GNNs) have recently become a popular method for learning from physics data because they directly incorporate the symmetries of the underlying physical system. Drawing from the relevant literature around group equivariant networks, this paper presents a comprehensive evaluation of the proposed benefits of equivariant GNNs by using real-world particle physics reconstruction tasks as an evaluation test-bed. We demonstrate that many of the theoretical benefits generally associated with equivariant networks may not hold for realistic systems and introduce compelling directions for future research that will benefit both the scientific theory of ML and physics applications.
    摘要 “将对物理世界数据进行机器学习的研究中,数据对称的数学模型(Equivariant Graph Neural Networks,GNNs)在最近几年中得到了广泛关注。这些模型直接将物理系统下的对称性 incorporated into the model,因此在物理数据上进行学习。根据文献中的群对称网络,本文提供了实际的粒子物理重建任务作为评估平台,以评估提出的优点。我们发现了许多理论上对于对称网络的优点可能不适用于实际系统,并提出了未来研究的有益方向,将帮助物理应用和机器学习理论的发展。”Note: Simplified Chinese is also known as "Mandarin" or "Standard Chinese".

Persistent homology for high-dimensional data based on spectral methods

  • paper_url: http://arxiv.org/abs/2311.03087
  • repo_url: https://github.com/berenslab/eff-ph
  • paper_authors: Sebastian Damrich, Philipp Berens, Dmitry Kobak
  • for: 检测非致用点云的非恒等topology,如循环或孔隙。
  • methods: 使用 spectral distances on the $k$-nearest-neighbor graph of the data,如扩散距离和有效抗抗。
  • results: 在高维空间中的实际世界数据上,使用spectral distances on the $k$-nearest-neighbor graph可以robustly检测细胞周期循环。
    Abstract Persistent homology is a popular computational tool for detecting non-trivial topology of point clouds, such as the presence of loops or voids. However, many real-world datasets with low intrinsic dimensionality reside in an ambient space of much higher dimensionality. We show that in this case vanilla persistent homology becomes very sensitive to noise and fails to detect the correct topology. The same holds true for most existing refinements of persistent homology. As a remedy, we find that spectral distances on the $k$-nearest-neighbor graph of the data, such as diffusion distance and effective resistance, allow persistent homology to detect the correct topology even in the presence of high-dimensional noise. Furthermore, we derive a novel closed-form expression for effective resistance in terms of the eigendecomposition of the graph Laplacian, and describe its relation to diffusion distances. Finally, we apply these methods to several high-dimensional single-cell RNA-sequencing datasets and show that spectral distances on the $k$-nearest-neighbor graph allow robust detection of cell cycle loops.
    摘要 persistent homology 是一种广泛使用的计算工具,用于检测点云的非致命 topology,如循环或空隙。然而,许多真实世界数据集具有低内在维度,并且生活在一个 much 更高维度的 ambient space 中。我们显示在这种情况下,vanilla persistent homology 会受到噪声的影响,无法正确检测 topology。同时,大多数现有的 persistent homology 改进方法也会受到噪声的影响。为了解决这个问题,我们发现spectral distances on the $k$-nearest-neighbor graph of the data,如 diffusion distance 和 effective resistance,可以使 persistent homology 检测正确的 topology,即使在高维度噪声的情况下。此外,我们 derive 了一个 novel closed-form expression for effective resistance in terms of the eigendecomposition of the graph Laplacian,并描述了它与 diffusion distances 之间的关系。最后,我们应用这些方法到了一些高维单个细胞 RNA-seq 数据集,并显示了 spectral distances on the $k$-nearest-neighbor graph 可以Robustly 检测细胞周期循环。

Quantifying the value of information transfer in population-based SHM

  • paper_url: http://arxiv.org/abs/2311.03083
  • repo_url: None
  • paper_authors: Aidan J. Hughes, Jack Poole, Nikolaos Dervilis, Paul Gardner, Keith Worden
  • for: 这个研究旨在 Addressing some limitations of traditional Structural Health Monitoring (SHM) methods, such as data scarcity, by using population-based approaches.
  • methods: The paper proposes a transfer-strategy decision process for a classification task in a population of simulated structures, based on domain adaptation and the concept of expected value of information transfer.
  • results: The proposed method is demonstrated through a representative SHM maintenance problem, and the results show that the transfer-strategy decision process can effectively improve the classification performance in the target domain.Here’s the full text in Simplified Chinese:
  • for: 这个研究旨在Addressing some limitations of traditional Structural Health Monitoring (SHM) methods, such as data scarcity, by using population-based approaches.
  • methods: 该研究提出了一种基于域适应和信息传递的转移策略决策过程,用于在一个 simulate structure population 中进行分类任务。
  • results: 该方法在一个示例的 SHM 维护问题中进行了示例,结果表明,转移策略决策过程可以有效地提高目标预测性能。
    Abstract Population-based structural health monitoring (PBSHM), seeks to address some of the limitations associated with data scarcity that arise in traditional SHM. A tenet of the population-based approach to SHM is that information can be shared between sufficiently-similar structures in order to improve predictive models. Transfer learning techniques, such as domain adaptation, have been shown to be a highly-useful technology for sharing information between structures when developing statistical classifiers for PBSHM. Nonetheless, transfer-learning techniques are not without their pitfalls. In some circumstances, for example if the data distributions associated with the structures within a population are dissimilar, applying transfer-learning methods can be detrimental to classification performance -- this phenomenon is known as negative transfer. Given the potentially-severe consequences of negative transfer, it is prudent for engineers to ask the question `when, what, and how should one transfer between structures?'. The current paper aims to demonstrate a transfer-strategy decision process for a classification task for a population of simulated structures in the context of a representative SHM maintenance problem, supported by domain adaptation. The transfer decision framework is based upon the concept of expected value of information transfer. In order to compute the expected value of information transfer, predictions must be made regarding the classification (and decision performance) in the target domain following information transfer. In order to forecast the outcome of transfers, a probabilistic regression is used here to predict classification performance from a proxy for structural similarity based on the modal assurance criterion.
    摘要 population-based结构健康监测(PBSHM)想要解决传统健康监测中的数据稀缺问题。PBSHM的一个基本思想是在相似的结构之间共享信息,以提高预测模型的准确性。但是,传输学习技术不是无缺点的。在某些情况下,如果结构内部的数据分布不同,则使用传输学习方法可能会导致分类性能下降,这被称为负效应。因此,工程师应该问到“何时、什么、如何进行传输?”。本文提出了一种传输策略决策框架,基于结构健康监测维护问题中的代表性例子。在这个框架中,通过对结构相似度的评估,预测结构的分类性能。为了预测传输后的结果,这里使用了抽象回归来预测分类性能。

SoK: Memorisation in machine learning

  • paper_url: http://arxiv.org/abs/2311.03075
  • repo_url: None
  • paper_authors: Dmitrii Usynin, Moritz Knolle, Georgios Kaissis
  • for: 本研究旨在解决机器学习模型中各个数据样本的影响量化问题,尤其是在深度学习中,当需要从有限数据分布中学习复杂和高维关系时。
  • methods: 本研究提出了一种系统化方法,可以识别和评估机器学习模型中的记忆现象,以及其与模型泛化和隐私问题的关系。
  • results: 研究发现,记忆在机器学习中可能具有不同定义和方面,并且与模型泛化和隐私问题相互影响。此外,研究还提出了一些可能的隐私保护策略,以减少记忆对模型的影响。
    Abstract Quantifying the impact of individual data samples on machine learning models is an open research problem. This is particularly relevant when complex and high-dimensional relationships have to be learned from a limited sample of the data generating distribution, such as in deep learning. It was previously shown that, in these cases, models rely not only on extracting patterns which are helpful for generalisation, but also seem to be required to incorporate some of the training data more or less as is, in a process often termed memorisation. This raises the question: if some memorisation is a requirement for effective learning, what are its privacy implications? In this work we unify a broad range of previous definitions and perspectives on memorisation in ML, discuss their interplay with model generalisation and their implications of these phenomena on data privacy. Moreover, we systematise methods allowing practitioners to detect the occurrence of memorisation or quantify it and contextualise our findings in a broad range of ML learning settings. Finally, we discuss memorisation in the context of privacy attacks, differential privacy (DP) and adversarial actors.
    摘要 量化机器学习模型中个体数据样本的影响是一个开放的研究问题。特别是在深度学习中,由限制样本数据生成分布学习复杂高维关系时,模型不仅需要抽取有助于泛化的模式,而且也需要吸收一些训练数据大致Speech recognition as is,这个过程经常被称为memorization。这引发了一个问题:如果一定程度的记忆是学习效果的必要条件,那么这些现象具有什么隐私意义?在这篇文章中,我们将统一各种前期定义和机器学习中的memorization的视角,讨论它们与模型泛化的关系,以及这些现象对数据隐私的影响。此外,我们还会系统化方法,让实际人员可以检测memorization的发生或者量化它,并将我们的发现应用于各种机器学习学习环境中。最后,我们将memorization与隐私攻击、权威隐私(DP)和抗击敌方攻击进行比较。

Imaging through multimode fibres with physical prior

  • paper_url: http://arxiv.org/abs/2311.03062
  • repo_url: None
  • paper_authors: Chuncheng Zhang, Yingjie Shi, Zheyi Yao, Xiubao Sui, Qian Cheng
  • for: 这篇论文旨在提出一种physics-assisted, unsupervised, learning-based fibre imaging方法,以减少计算复杂性并提高多模式纤维成像的扩展应用。
  • methods: 该方法使用深度学习网络,但不需要目标对应的射频对。而是通过物理优化方法提供的方向来帮助网络学习目标特征。
  • results: 该方法可以通过在线学习,只需要几个噪声模式和未对应的目标,可以准确地重建目标图像。此外,该方法还可以提高多模式纤维成像的普适性。
    Abstract Imaging through perturbed multimode fibres based on deep learning has been widely researched. However, existing methods mainly use target-speckle pairs in different configurations. It is challenging to reconstruct targets without trained networks. In this paper, we propose a physics-assisted, unsupervised, learning-based fibre imaging scheme. The role of the physical prior is to simplify the mapping relationship between the speckle pattern and the target image, thereby reducing the computational complexity. The unsupervised network learns target features according to the optimized direction provided by the physical prior. Therefore, the reconstruction process of the online learning only requires a few speckle patterns and unpaired targets. The proposed scheme also increases the generalization ability of the learning-based method in perturbed multimode fibres. Our scheme has the potential to extend the application of multimode fibre imaging.
    摘要 “对于受扰的多模式纤维光通信,深度学习已经广泛研究。然而,现有方法主要使用不同配置的目标对组。很难重建目标无需训练网络。在本文中,我们提出了一个物理帮助、无监督、学习基于纤维光实验的方案。物理帮助的目的是简化纤维光实验中的目标图像与镜像关系,因此降低计算复杂度。无监督网络根据物理帮助来学习目标特征。因此,重建过程只需几个纤维光束和无配对目标。我们的方案还增加了学习基于纤维光实验的多模式纤维光实验的应用范围。”Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

Learned layered coding for Successive Refinement in the Wyner-Ziv Problem

  • paper_url: http://arxiv.org/abs/2311.03061
  • repo_url: None
  • paper_authors: Boris Joukovsky, Brent De Weerdt, Nikos Deligiannis
  • for: 本文提出了一种基于数据驱动的方法,用于逐步编码连续源,并在不同质量水平下逐步解码。这种设置类似于Successive Refine Wyner-Ziv编码问题。
  • methods: 本文使用了循环神经网络(RNN)来学习层次编码器和解码器,特别是在二阶 Gaussian случа子下。模型通过最小化Variational bound来训练,以实现逐步精细编码。
  • results: 实验表明,RNN可以显式地恢复层次归一化解决方案,类似于可扩展的嵌入量化。此外,本文的环境-distortion性能与相应的monolithic Wyner-Ziv编码方法几乎相同,并且与环境-distortion bound很近。
    Abstract We propose a data-driven approach to explicitly learn the progressive encoding of a continuous source, which is successively decoded with increasing levels of quality and with the aid of correlated side information. This setup refers to the successive refinement of the Wyner-Ziv coding problem. Assuming ideal Slepian-Wolf coding, our approach employs recurrent neural networks (RNNs) to learn layered encoders and decoders for the quadratic Gaussian case. The models are trained by minimizing a variational bound on the rate-distortion function of the successively refined Wyner-Ziv coding problem. We demonstrate that RNNs can explicitly retrieve layered binning solutions akin to scalable nested quantization. Moreover, the rate-distortion performance of the scheme is on par with the corresponding monolithic Wyner-Ziv coding approach and is close to the rate-distortion bound.
    摘要 我们提出了一种数据驱动的方法,以明确地学习不断编码的连续来源,并在不断提高质量的情况下使用相关的侧信息进行成功的解码。这种设置对应于Wyner-Ziv编码问题的逐渐精化。假设有理想的Slepian-Wolf编码,我们使用循环神经网络(RNN)来学习层次编码器和解码器,特别是在二阶ensional Gaussian 的情况下。我们通过最小化变量约束函数来训练模型,以实现逐渐精化的Wyner-Ziv编码问题的率质量函数。我们示出,RNN可以明确地恢复层次归一化解决方案,类似于可扩展的嵌套量化。此外,我们的方案的率质量性能与相应的单一Wyner-Ziv编码方法相当,并且与率质量 bound 很近。

Personalizing Keyword Spotting with Speaker Information

  • paper_url: http://arxiv.org/abs/2311.03419
  • repo_url: None
  • paper_authors: Beltrán Labrador, Pai Zhu, Guanlong Zhao, Angelo Scorza Scarpati, Quan Wang, Alicia Lozano-Diez, Alex Park, Ignacio López Moreno
  • for: 提高关键词检测精度,特别是面临多种口音和年龄群体的挑战。
  • methods: 利用Feature-wise Linear Modulation(FiLM)方法,结合发音人信息进行关键词检测,并在输入音频和预存用户音频中提取发音人信息。
  • results: 在多样化 dataset 上进行测试,实现了关键词检测精度的明显提高,特别是面临少数批群的发音人。此外,提议的方法只需增加1%的参数数量,并无显著影响延迟和计算成本,适用于实际应用。
    Abstract Keyword spotting systems often struggle to generalize to a diverse population with various accents and age groups. To address this challenge, we propose a novel approach that integrates speaker information into keyword spotting using Feature-wise Linear Modulation (FiLM), a recent method for learning from multiple sources of information. We explore both Text-Dependent and Text-Independent speaker recognition systems to extract speaker information, and we experiment on extracting this information from both the input audio and pre-enrolled user audio. We evaluate our systems on a diverse dataset and achieve a substantial improvement in keyword detection accuracy, particularly among underrepresented speaker groups. Moreover, our proposed approach only requires a small 1% increase in the number of parameters, with a minimum impact on latency and computational cost, which makes it a practical solution for real-world applications.
    摘要 �ycleptic spotting systems often struggle to generalize to a diverse population with various accents and age groups. To address this challenge, we propose a novel approach that integrates speaker information into keyword spotting using Feature-wise Linear Modulation (FiLM), a recent method for learning from multiple sources of information. We explore both Text-Dependent and Text-Independent speaker recognition systems to extract speaker information, and we experiment on extracting this information from both the input audio and pre-enrolled user audio. We evaluate our systems on a diverse dataset and achieve a substantial improvement in keyword detection accuracy, particularly among underrepresented speaker groups. Moreover, our proposed approach only requires a small 1% increase in the number of parameters, with a minimum impact on latency and computational cost, which makes it a practical solution for real-world applications.Here's the translation in Traditional Chinese:这些关键辨识系统经常对不同的口音和年龄层面具有困难,以扩展到多样化的人口。为了解决这个挑战,我们提出了一种新的方法,将话者信息 integrate into 关键辨识中,使用 Feature-wise Linear Modulation (FiLM),一种最近的多源信息学习方法。我们尝试了 Text-Dependent 和 Text-Independent 两种话者辨识系统,以EXTRACT 话者信息,并对 input 音频和预先登录的使用者音频进行实验。我们对多样化的数据集进行评估,并在特定的话者群体中取得了重大的关键检测精度提升。此外,我们的提议方法仅需加入 1% 的参数数量,对于 Computational cost 和延迟时间的影响相对轻微,使其成为实际应用中的实用解决方案。

DRAUC: An Instance-wise Distributionally Robust AUC Optimization Framework

  • paper_url: http://arxiv.org/abs/2311.03055
  • repo_url: None
  • paper_authors: Siran Dai, Qianqian Xu, Zhiyong Yang, Xiaochun Cao, Qingming Huang
    for:本研究旨在提高长尾分类中的 Area Under the ROC Curve (AUC) metric,以便在实际应用中提高模型的性能。methods:本研究提出了一种基于 Distributionally Robust Optimization (DRO) 的 instance-wise surrogate loss函数,即 Distributionally Robust AUC (DRAUC),并建立了一个优化框架。此外,我们还提出了一种更适合的分布意识 DRAUC metric,以消除标签偏见。results:我们经验表明,当训练集规模充分增大时,模型的总体性能会提高。此外,我们在一些腐坏 benchmark 数据集上进行了实验,并证明了我们的方法的效果。
    Abstract The Area Under the ROC Curve (AUC) is a widely employed metric in long-tailed classification scenarios. Nevertheless, most existing methods primarily assume that training and testing examples are drawn i.i.d. from the same distribution, which is often unachievable in practice. Distributionally Robust Optimization (DRO) enhances model performance by optimizing it for the local worst-case scenario, but directly integrating AUC optimization with DRO results in an intractable optimization problem. To tackle this challenge, methodically we propose an instance-wise surrogate loss of Distributionally Robust AUC (DRAUC) and build our optimization framework on top of it. Moreover, we highlight that conventional DRAUC may induce label bias, hence introducing distribution-aware DRAUC as a more suitable metric for robust AUC learning. Theoretically, we affirm that the generalization gap between the training loss and testing error diminishes if the training set is sufficiently large. Empirically, experiments on corrupted benchmark datasets demonstrate the effectiveness of our proposed method. Code is available at: https://github.com/EldercatSAM/DRAUC.
    摘要 “区域下降曲线(AUC)是长条状分类enario中广泛使用的衡量指标。然而,大多数现有方法假设训练和测试例子都是从同一个分布中抽出的,这经常是实际中不可能实现的。分布robust优化(DRO)可以提高模型性能,但是直接将AUC优化与DRO结合会导致困难的优化问题。为解决这个挑战,我们提出了例子别的代理损失函数Distributionally Robust AUC(DRAUC),并建立了我们的优化框架之上。此外,我们点出了传统DRAUC可能会导致标签偏见,因此引入了分布意识的DRAUC作为更适合的弹性AUC学习指标。理论上,我们证明了在训练集大 enough时,训练loss和测试误差之间的差异会减少。实际实验显示了我们的提案的效果。代码可以在https://github.com/EldercatSAM/DRAUC中找到。”Note: The translation is in Simplified Chinese, which is the standardized form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Validity problems in clinical machine learning by indirect data labeling using consensus definitions

  • paper_url: http://arxiv.org/abs/2311.03037
  • repo_url: https://github.com/statnlp/ml4h_validity_problems
  • paper_authors: Michael Hagmann, Shigehiko Schamoni, Stefan Riezler
  • for: 这种研究探讨了机器学习在医疗领域的疾病诊断中的有效性问题。
  • methods: 这种研究使用了一种通用的检测方法,可以在训练数据中检测到问题的存在。
  • results: 研究发现,当目标标签在训练数据中是由间接测量决定的时,机器学习模型将学习仅仅重建已知的目标定义,而不会学习真实的含义。这些模型在相似构建的测试数据上表现出色,但在实际例子中将失败。
    Abstract We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included in the input data representation. Machine learning models trained on this data will learn nothing else but to exactly reconstruct the known target definition. Such models show perfect performance on similarly constructed test data but will fail catastrophically on real-world examples where the defining fundamental measurements are not or only incompletely available. We present a general procedure allowing identification of problematic datasets and black-box machine learning models trained on them, and exemplify our detection procedure on the task of early prediction of sepsis.
    摘要 我们描述了机器学习在医学领域中疾病诊断中的有效性问题。这种问题发生在训练数据中的目标标签是基于间接测量,而基本测量必须用于确定间接测量的时候,这些基本测量被包含在输入数据表示中。机器学习模型在这些数据上训练后会学习到 nothing else but to exactly reconstruct the known target definition。这些模型在同构测试数据上表现出色,但在真实世界中的例子中,定义基本测量不完全或只有部分可用时,这些模型会失败彪夷。我们提出了一种通用的数据集标识方法和黑盒机器学习模型的检测方法,并在早期综合监测综合病情中进行了示例。

On regularized polynomial functional regression

  • paper_url: http://arxiv.org/abs/2311.03036
  • repo_url: None
  • paper_authors: Markus Holzleitner, Sergei Pereverzyev
  • for: 这篇论文探讨了多项式函数回归,并在设立了一个新的finite sample bound。
  • methods: 这篇论文使用了多种方法,包括普通的smoothness condition, capacity condition和正则化技术。
  • results: 数字证明表明,使用高阶多项式项可以提高回归性能。
    Abstract This article offers a comprehensive treatment of polynomial functional regression, culminating in the establishment of a novel finite sample bound. This bound encompasses various aspects, including general smoothness conditions, capacity conditions, and regularization techniques. In doing so, it extends and generalizes several findings from the context of linear functional regression as well. We also provide numerical evidence that using higher order polynomial terms can lead to an improved performance.
    摘要 Translated into Simplified Chinese:这篇文章对波动函数回归进行了全面的处理,包括建立了一个新的finite sample bound,这个bound包括了一般的光滑条件、容量条件以及规则化技术。这个bound将 linear函数回归的Context中的一些发现扩展和总结。此外,我们还提供了数据证明,使用高阶多项式项可以提高性能。

Estimating treatment effects from single-arm trials via latent-variable modeling

  • paper_url: http://arxiv.org/abs/2311.03002
  • repo_url: https://github.com/manuelhaussmann/lvm_singlearm
  • paper_authors: Manuel Haussmann, Tran Minh Son Le, Viivi Halla-aho, Samu Kurki, Jussi Leinonen, Miika Koskinen, Samuel Kaski, Harri Lähdesmäki
  • for: 这种研究是为了提供一种可行的替代方案,使用深度隐藏变量模型来估计治疗效果,并且可以考虑欠拥有的 covariate 观察数据的 strucured missingness 模式。
  • methods: 这种方法使用了摊销变量推断来学习共享隐藏变量的 Identifiable 模型,并且可以用于 (i) 医疗记录匹配,如果治疗结果不可用于治疗组,或者 (ii) 直接估计治疗效果,假设两个组具有结果。
  • results: compared to previous methods, our results show improved performance both for direct treatment effect estimation as well as for effect estimation via patient matching.
    Abstract Randomized controlled trials (RCTs) are the accepted standard for treatment effect estimation but they can be infeasible due to ethical reasons and prohibitive costs. Single-arm trials, where all patients belong to the treatment group, can be a viable alternative but require access to an external control group. We propose an identifiable deep latent-variable model for this scenario that can also account for missing covariate observations by modeling their structured missingness patterns. Our method uses amortized variational inference to learn both group-specific and identifiable shared latent representations, which can subsequently be used for (i) patient matching if treatment outcomes are not available for the treatment group, or for (ii) direct treatment effect estimation assuming outcomes are available for both groups. We evaluate the model on a public benchmark as well as on a data set consisting of a published RCT study and real-world electronic health records. Compared to previous methods, our results show improved performance both for direct treatment effect estimation as well as for effect estimation via patient matching.
    摘要 randomized controlled trials (RCTs) 是确定的标准 для征效 estimation, but they can be infeasible due to ethical reasons and prohibitive costs. single-arm trials, where all patients belong to the treatment group, can be a viable alternative, but require access to an external control group. we propose an identifiable deep latent-variable model for this scenario that can also account for missing covariate observations by modeling their structured missingness patterns. our method uses amortized variational inference to learn both group-specific and identifiable shared latent representations, which can subsequently be used for (i) patient matching if treatment outcomes are not available for the treatment group, or for (ii) direct treatment effect estimation assuming outcomes are available for both groups. we evaluate the model on a public benchmark as well as on a data set consisting of a published RCT study and real-world electronic health records. compared to previous methods, our results show improved performance both for direct treatment effect estimation as well as for effect estimation via patient matching.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Variational Weighting for Kernel Density Ratios

  • paper_url: http://arxiv.org/abs/2311.03001
  • repo_url: https://github.com/swyoon/variationally-weighted-kernel-density-estimation
  • paper_authors: Sangwoong Yoon, Frank C. Park, Gunsu S Yun, Iljung Kim, Yung-Kyun Noh
  • for: 提高机器学习中的生成和识别任务中的kernel density estimation(KDE)的精度。
  • methods: 基于多维 calculus of variations的工具, derivation of an optimal weight function to reduce bias in standard kernel density estimates for density ratios, leading to improved estimates of prediction posteriors and information-theoretic measures。
  • results: 提高KDE的精度,并 shed light on some fundamental aspects of density estimation, particularly from the perspective of algorithms that employ KDEs as their main building blocks。
    Abstract Kernel density estimation (KDE) is integral to a range of generative and discriminative tasks in machine learning. Drawing upon tools from the multidimensional calculus of variations, we derive an optimal weight function that reduces bias in standard kernel density estimates for density ratios, leading to improved estimates of prediction posteriors and information-theoretic measures. In the process, we shed light on some fundamental aspects of density estimation, particularly from the perspective of algorithms that employ KDEs as their main building blocks.
    摘要 kernel density estimation(KDE)是机器学习中多种生成和歧义任务中的关键技术。通过多维Calculus of variations的工具,我们得出了最佳权重函数,可以减少标准kernel density估计中的偏误,从而提高预测 posterior和信息理论度量的估计。在这个过程中,我们也探讨了概率估计的一些基本问题,特别是那些使用KDE作为主要构建件的算法。Note: The translation is done using Google Translate, and may not be perfect. Please let me know if you need further assistance.

Strong statistical parity through fair synthetic data

  • paper_url: http://arxiv.org/abs/2311.03000
  • repo_url: None
  • paper_authors: Ivona Krchova, Michael Platzer, Paul Tiwald
  • for: 本研究旨在开发一种能够保护原始数据隐私的人工智能生成的假数据,同时满足用户和数据消费者的需求。
  • methods: 本研究使用了 Fairness by Design 的方法,通过平衡敏感特征的学习目标概率分布,使下游模型在各个阈值下做出公平预测。这种公平调整可以直接 incorporated into the sampling process of a synthetic generator or added as a post-processing step。
  • results: 研究发现,通过使用这种 Fairness by Design 的方法,可以在各个阈值下提供公平的预测结果,即即使从扭曲的原始数据中进行预测。此外,这种公平调整可以在不需要原始数据的假设和重新训练假数据生成器的情况下进行。
    Abstract AI-generated synthetic data, in addition to protecting the privacy of original data sets, allows users and data consumers to tailor data to their needs. This paper explores the creation of synthetic data that embodies Fairness by Design, focusing on the statistical parity fairness definition. By equalizing the learned target probability distributions of the synthetic data generator across sensitive attributes, a downstream model trained on such synthetic data provides fair predictions across all thresholds, that is, strong fair predictions even when inferring from biased, original data. This fairness adjustment can be either directly integrated into the sampling process of a synthetic generator or added as a post-processing step. The flexibility allows data consumers to create fair synthetic data and fine-tune the trade-off between accuracy and fairness without any previous assumptions on the data or re-training the synthetic data generator.
    摘要 人工生成的数据可以保护原始数据集的隐私,同时允许用户和数据消费者根据自己的需求修改数据。本文探讨使用 Fairness by Design 创建的合理数据,特点在于在敏感属性上均衡学习目标概率分布。通过在合理数据生成器中平衡学习目标概率分布,下游模型在基于偏见的原始数据上进行预测时提供了公平预测结果,无论预测阈值如何。这种公平调整可以直接集成到合理数据生成器的采样过程中,或者作为后处理步骤进行添加。这种灵活性允许数据消费者创建合理的数据并调整准确性和公平之间的平衡,无需对数据进行任何先前假设或重新训练合理数据生成器。

Hacking Cryptographic Protocols with Advanced Variational Quantum Attacks

  • paper_url: http://arxiv.org/abs/2311.02986
  • repo_url: None
  • paper_authors: Borja Aizpurua, Pablo Bermejo, Josu Etxezarreta Martinez, Roman Orus
  • for: 该论文提出了一种改进的量子袭击算法(VQAA)来攻击 symmetric-key 协议。
  • methods: 该论文使用了更加精准的量子计算和更多的精度来实现更有效的量子袭击。
  • results: 该论文的攻击成功率提高了对 symmetric-key 协议的攻击,并且可以使用更少的量子比特来实现。 更多的应用可以在 asymmetric-key 协议和哈希函数上进行。
    Abstract Here we introduce an improved approach to Variational Quantum Attack Algorithms (VQAA) on crytographic protocols. Our methods provide robust quantum attacks to well-known cryptographic algorithms, more efficiently and with remarkably fewer qubits than previous approaches. We implement simulations of our attacks for symmetric-key protocols such as S-DES, S-AES and Blowfish. For instance, we show how our attack allows a classical simulation of a small 8-qubit quantum computer to find the secret key of one 32-bit Blowfish instance with 24 times fewer number of iterations than a brute-force attack. Our work also shows improvements in attack success rates for lightweight ciphers such as S-DES and S-AES. Further applications beyond symmetric-key cryptography are also discussed, including asymmetric-key protocols and hash functions. In addition, we also comment on potential future improvements of our methods. Our results bring one step closer assessing the vulnerability of large-size classical cryptographic protocols with Noisy Intermediate-Scale Quantum (NISQ) devices, and set the stage for future research in quantum cybersecurity.
    摘要

The Pursuit of Human Labeling: A New Perspective on Unsupervised Learning

  • paper_url: http://arxiv.org/abs/2311.02940
  • repo_url: https://github.com/mlbio-epfl/hume
  • paper_authors: Artyom Gadetsky, Maria Brbic
  • for: This paper is written for inferring human labeling of a given dataset without any external supervision.
  • methods: The paper uses a simple model-agnostic framework called HUME, which utilizes the insight that classes defined by many human labelings are linearly separable regardless of the representation space used to represent a dataset.
  • results: The proposed optimization objective in HUME is strikingly well-correlated with the ground truth labeling of the dataset, and the framework achieves state-of-the-art performance on four benchmark image classification datasets, including the large-scale ImageNet-1000 dataset.
    Abstract We present HUME, a simple model-agnostic framework for inferring human labeling of a given dataset without any external supervision. The key insight behind our approach is that classes defined by many human labelings are linearly separable regardless of the representation space used to represent a dataset. HUME utilizes this insight to guide the search over all possible labelings of a dataset to discover an underlying human labeling. We show that the proposed optimization objective is strikingly well-correlated with the ground truth labeling of the dataset. In effect, we only train linear classifiers on top of pretrained representations that remain fixed during training, making our framework compatible with any large pretrained and self-supervised model. Despite its simplicity, HUME outperforms a supervised linear classifier on top of self-supervised representations on the STL-10 dataset by a large margin and achieves comparable performance on the CIFAR-10 dataset. Compared to the existing unsupervised baselines, HUME achieves state-of-the-art performance on four benchmark image classification datasets including the large-scale ImageNet-1000 dataset. Altogether, our work provides a fundamentally new view to tackle unsupervised learning by searching for consistent labelings between different representation spaces.
    摘要 我们介绍HUME,一个简单的无监控模型框架,可以无需任何外部监控来推断资料集中的人类标签。HUME的关键想法是,由多个人类标签定义的类别在不同的表现空间中都是线性分类可能的。HUME利用这个想法来导引搜寻资料集中的所有可能的标签,以找到背后的人类标签。我们显示,提案的优化目标与真实的标签相对高度相关。实际上,我们仅在固定的表现空间上训练线性分类器,使得我们的框架可以与任何大型预训练和自动监控模型相容。尽管其简单,HUME在STL-10 dataset上大幅超越了对自动表现的supervised linear classifier,并在CIFAR-10 dataset上 achieve comparable performance。与现有的无监控基elines相比,HUME在四个 benchmark image classification dataset上 achieved state-of-the-art performance,包括大规模的ImageNet-1000 dataset。总的来说,我们的工作提供了一个全新的无监控学习方法,通过搜寻不同表现空间中的一致标签。

Edge2Node: Reducing Edge Prediction to Node Classification

  • paper_url: http://arxiv.org/abs/2311.02921
  • repo_url: https://github.com/arahmatiiii/E2N
  • paper_authors: Zahed Rahmati, Ali Rahmati, Dariush Kazemi
  • for: 本研究旨在提高图 neural network 模型在图边预测 задании的性能。
  • methods: 我们提出了一种新的方法 called E2N (Edge2Node),它直接从图中获取每个边的嵌入,而不需要预定的评分函数。
  • results: 我们在 ogbl-ddi 和 ogbl-collab 数据集上进行实验,并取得了与状态对照方法的比较优秀的性能。在 ogbl-ddi 数据集上,我们在验证集上达到了 Hits@20 分数为 98.79%,并在测试集上达到了 98.11%。在 ogbl-collab 数据集上,我们在验证集上达到了 Hits@50 分数为 95.46%,并在测试集上达到了 95.15%。
    Abstract Despite the success of graph neural network models in node classification, edge prediction (the task of predicting missing or potential relationships between nodes in a graph) remains a challenging problem for these models. A common approach for edge prediction is to first obtain the embeddings of two nodes, and then a predefined scoring function is used to predict the existence of an edge between the two nodes. In this paper, we introduce a new approach called E2N (Edge2Node) which directly obtains an embedding for each edge, without the need for a scoring function. To do this, we create a new graph H based on the graph G given for the edge prediction task, and then reduce the edge prediction task on G to a node classification task on H. Our E2N method can be easily applied to any edge prediction task with superior performance and lower computational costs. For the ogbl-ddi and ogbl-collab datasets, our E2N method outperforms the state-of-the-art methods listed on the leaderboards. Our experiments on the ogbl-ddi dataset achieved a Hits@20 score of 98.79% on the validation set and 98.11% on the test set. On the ogbl-collab dataset, we achieved a Hits@50 score of 95.46% on the validation set and 95.15% on the test set.
    摘要 尽管图 neural network 模型在节点分类任务上表现出色,但Edge prediction(预测图中缺失或可能存在的边关系)仍然是这些模型的挑战。一种常见的方法 дляEdge prediction是先获取两个节点的嵌入,然后使用预定的分数函数预测两节点之间是否存在边。在这篇论文中,我们介绍了一种新的方法called E2N(Edge2Node),它可以直接从图G中获取每个边的嵌入,不需要预定的分数函数。我们首先创建了一个新的图H,基于给定的图G和Edge prediction任务。然后,我们将Edge prediction任务降低到图H上的节点分类任务。我们的E2N方法可以轻松应用于任何Edge prediction任务,并且性能更高,计算成本更低。在ogbl-ddi和ogbl-collab datasets上,我们的E2N方法超过了现有的状态对方法。在ogbl-ddi dataset上,我们的实验在验证集上达到了Hits@20分数为98.79%,并在测试集上达到了98.11%。在ogbl-collab dataset上,我们的实验在验证集上达到了Hits@50分数为95.46%,并在测试集上达到了95.15%。

Distributed Matrix-Based Sampling for Graph Neural Network Training

  • paper_url: http://arxiv.org/abs/2311.02909
  • repo_url: https://github.com/djdprogramming/adfa2
  • paper_authors: Alok Tripathy, Katherine Yelick, Aydin Buluc
  • for: 这个 paper 的主要贡献是一些新的方法来降低分布式 GNN 训练中的通信量。
  • methods: 我们提出了一个矩阵基本的块样抽样方法,将抽样表示为稀疏矩阵乘法(SpGEMM),可以同时抽样多个批次。当输入图形结构不适合单一设备内存时,我们将图形分配到多个设备上,使用通信避免的 SpGEMM 算法来扩展 GNN 批次抽样,以训练更大的图形。
  • results: 我们通过实验示出,对于 $128$ 个 GPU 的最大 Open Graph Benchmark (OGB) 数据集,我们的管道比 Quiver (一个分布式延伸 PyTorch-Geometric) 快 $2.5\times$。在非 OGB 数据集上,我们获得了 $8.46\times$ 的速度提升。 finally, 我们还显示了分布在 GPU 上的图形和两种抽样算法的扩展。
    Abstract The primary contribution of this paper is new methods for reducing communication in the sampling step for distributed GNN training. Here, we propose a matrix-based bulk sampling approach that expresses sampling as a sparse matrix multiplication (SpGEMM) and samples multiple minibatches at once. When the input graph topology does not fit on a single device, our method distributes the graph and use communication-avoiding SpGEMM algorithms to scale GNN minibatch sampling, enabling GNN training on much larger graphs than those that can fit into a single device memory. When the input graph topology (but not the embeddings) fits in the memory of one GPU, our approach (1) performs sampling without communication, (2) amortizes the overheads of sampling a minibatch, and (3) can represent multiple sampling algorithms by simply using different matrix constructions. In addition to new methods for sampling, we show that judiciously replicating feature data with a simple all-to-all exchange can outperform current methods for the feature extraction step in distributed GNN training. We provide experimental results on the largest Open Graph Benchmark (OGB) datasets on $128$ GPUs, and show that our pipeline is $2.5\times$ faster Quiver (a distributed extension to PyTorch-Geometric) on a $3$-layer GraphSAGE network. On datasets outside of OGB, we show a $8.46\times$ speedup on $128$ GPUs in-per epoch time. Finally, we show scaling when the graph is distributed across GPUs and scaling for both node-wise and layer-wise sampling algorithms
    摘要 主要贡献之一是一种新的降低通信的分布式GNN训练阶段的方法。我们提议一种基于矩阵的批量采样方法,将采样表示为稀疏矩阵乘法(SpGEMM),并在一次多个批处理多个批处理。当输入图结构不能放入单个设备内存时,我们将图分布到多个设备,并使用通信快照的SpGEMM算法缩放GNN批处理的训练,以便在单个设备内存中训练更大的图。当输入图结构(但不是特征表示)可以在一个GPU内存中存储时,我们的方法(1)无需通信进行采样,(2)积累采样批处理的开销,(3)可以通过不同矩阵结构来表示多种采样算法。此外,我们还提供了一种新的采样方法,通过简单地在所有GPU之间进行所有到所有的交换来提高分布式GNN训练中的特征提取步骤。我们在$128$个GPU上进行实验,并证明我们的管道在$3$-层 GraphSAGE 网络上比 Quiver(分布式PyTorch-Geometric)快$2.5\times$。在OGB datasets之外,我们在$128$个GPU上每个epoch时间上提高$8.46\times$。最后,我们还展示了分布在GPU上的图和层wise采样算法的扩展。

HDGL: A hierarchical dynamic graph representation learning model for brain disorder classification

  • paper_url: http://arxiv.org/abs/2311.02903
  • repo_url: None
  • paper_authors: Parniyan Jalali, Mehran Safayani
  • For: 本研究旨在提出一种 hierarchical dynamic graph representation learning(HDGL)模型,用于分类 brain disorders 和 healthy 样本。* Methods: 该模型包括两个层次,第一层是构建 brain network graphs,并学习其空间和时间嵌入,第二层是组成 population graphs,并进行分类 после嵌入学习。此外,基于这两个层次的训练方法,提出了四种方法来减少内存复杂度。* Results: 对 ABIDE 和 ADHD-200 datasets 进行评估,结果显示 HDGL 模型在多种评价指标上比多个现状模型表现更好。
    Abstract The human brain can be considered as complex networks, composed of various regions that continuously exchange their information with each other, forming the brain network graph, from which nodes and edges are extracted using resting-state functional magnetic resonance imaging (rs-fMRI). Therefore, this graph can potentially depict abnormal patterns that have emerged under the influence of brain disorders. So far, numerous studies have attempted to find embeddings for brain network graphs and subsequently classify samples with brain disorders from healthy ones, which include limitations such as: not considering the relationship between samples, not utilizing phenotype information, lack of temporal analysis, using static functional connectivity (FC) instead of dynamic ones and using a fixed graph structure. We propose a hierarchical dynamic graph representation learning (HDGL) model, which is the first model designed to address all the aforementioned challenges. HDGL consists of two levels, where at the first level, it constructs brain network graphs and learns their spatial and temporal embeddings, and at the second level, it forms population graphs and performs classification after embedding learning. Furthermore, based on how these two levels are trained, four methods have been introduced, some of which are suggested for reducing memory complexity. We evaluated the performance of the proposed model on the ABIDE and ADHD-200 datasets, and the results indicate the improvement of this model compared to several state-of-the-art models in terms of various evaluation metrics.
    摘要 人脑可以视为复杂网络,由多个区域组成,这些区域不断交换信息,形成了大脑网络图,从而可以潜在地描述了脑部疾病的异常模式。目前,许多研究已经尝试了将大脑网络图 embed 到另一个空间中,并将样本分类为健康或疾病。然而,这些研究存在一些限制,例如:不考虑样本之间的关系、不使用现象信息、缺乏时间分析、使用静态功能连接(FC)而不是动态连接、使用固定图结构。我们提出了层次动态图表学习(HDGL)模型,这是首个解决了所有以上挑战的模型。HDGL包括两层,在第一层,它将大脑网络图构建并学习其空间和时间嵌入,在第二层,它将人群图形成并进行分类 после嵌入学习。此外,根据这两层的训练方式,我们提出了四种方法,其中一些可以降低内存复杂性。我们对 ABIDE 和 ADHD-200 数据集进行了评估,结果表明,提案的模型在多种评估指标上比多种现状模型表现出色。

Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction

  • paper_url: http://arxiv.org/abs/2311.02898
  • repo_url: None
  • paper_authors: Minchan Kim, Myeonghun Jeong, Byoung Jin Choi, Dongjune Lee, Nam Soo Kim
  • for: 这个论文是为了提出一个基于神经转化器的文本识别框架,以便实现文本识别的目的。
  • methods: 该论文使用了精度化的 semantic tokens,通过神经转化器生成含义相对应的语音样本,并使用非autoregressive(NAR)演示器进行语音生成。
  • results: 实验结果表明,提出的模型在零批适应TTS中超过了基准值,在语音质量和发音相似性方面均有显著提升。此外,模型的推理速度和句式控制性也得到了证明。
    Abstract We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints. The proposed model first generates aligned semantic tokens using the neural transducer, then synthesizes a speech sample from the semantic tokens using a non-autoregressive(NAR) speech generator. This decoupled framework alleviates the training complexity of TTS and allows each stage to focus on 1) linguistic and alignment modeling and 2) fine-grained acoustic modeling, respectively. Experimental results on the zero-shot adaptive TTS show that the proposed model exceeds the baselines in speech quality and speaker similarity via objective and subjective measures. We also investigate the inference speed and prosody controllability of our proposed model, showing the potential of the neural transducer for TTS frameworks.
    摘要 我们介绍了一个基于神经抽象器的文本到语音(TTS)框架。我们使用从wav2vec2.0嵌入中获取的精度化 semantic token,这使得我们可以轻松地将神经抽象器应用于 TTS 框架,享受它的幂等对齐约束。我们的提议的模型首先使用神经抽象器生成对齐的 semantic token,然后使用非自然语言生成器(NAR)生成语音样本。这种分离的框架可以减轻 TTS 的训练复杂度,使每个阶段可以专注于1) 语言和对齐模型化和2) 细腻语音模型化。我们的实验结果表明,我们的提议模型在零实际适应 TTS 中超过了基线,在语音质量和发音相似度方面通过对象和主观度量表现出色。我们还调查了我们的提议模型的推理速度和语调可控性,表明神经抽象器在 TTS 框架中的潜在优势。

AdaFlood: Adaptive Flood Regularization

  • paper_url: http://arxiv.org/abs/2311.02891
  • repo_url: None
  • paper_authors: Wonho Bae, Yi Ren, Mohamad Osama Ahmed, Frederick Tung, Danica J. Sutherland, Gabriel L. Oliveira
  • for: 提高模型的测试时泛化能力
  • methods: 使用适应式洪水训练法,根据样本的难度动态调整训练损失的目标值
  • results: 在文本、图像、异步事件序列和表格等多种输入模式下,经验表明 AdaFlood 可以强大地适应不同的数据领域和噪声水平
    Abstract Although neural networks are conventionally optimized towards zero training loss, it has been recently learned that targeting a non-zero training loss threshold, referred to as a flood level, often enables better test time generalization. Current approaches, however, apply the same constant flood level to all training samples, which inherently assumes all the samples have the same difficulty. We present AdaFlood, a novel flood regularization method that adapts the flood level of each training sample according to the difficulty of the sample. Intuitively, since training samples are not equal in difficulty, the target training loss should be conditioned on the instance. Experiments on datasets covering four diverse input modalities - text, images, asynchronous event sequences, and tabular - demonstrate the versatility of AdaFlood across data domains and noise levels.
    摘要 尽管神经网络通常强制逼近零训练损失,但最近的研究发现,targeting非零训练损失阈值,也就是洪水水平,可以提高测试时通用性。然而,现有的方法都是对所有训练样本应用相同的常量洪水水平,这种假设所有样本都有相同的困难度。我们提出了AdaFlood,一种新的洪水规范方法,可以根据每个训练样本的困难度来适应洪水水平。这种思想是,训练样本不是一样困难,因此训练损失目标应该与实例相关。经过对四种不同的输入模式(文本、图像、异步事件序列和表格)的数据集进行实验,我们发现AdaFlood在数据领域和噪音水平上具有广泛的可行性。

MultiSPANS: A Multi-range Spatial-Temporal Transformer Network for Traffic Forecast via Structural Entropy Optimization

  • paper_url: http://arxiv.org/abs/2311.02880
  • repo_url: https://github.com/selgroup/multispans
  • paper_authors: Dongcheng Zou, Senzhang Wang, Xuefeng Li, Hao Peng, Yuandong Wang, Chunyang Liu, Kehua Sheng, Bo Zhang
    for:多元时间序列 regression 任务是交通管理和规划中的核心问题,但现有方法往往无法模型复杂的多范围依赖关系。methods:我们提出了 MultiSPANS,它包括多构成层检测(Multi-filter convolution modules)、ST-token嵌入(ST-token embeddings)和Transformers capture long-range temporal和空间依赖关系。此外,我们引入了结构 entropy 理论来优化空间注意力机制。results:我们的方法在真实世界交通数据中与多个现有方法进行比较,展现出优异性。长期历史窗口也能够有效地使用。代码可以在 https://github.com/SELGroup/MultiSPANS 上找到。
    Abstract Traffic forecasting is a complex multivariate time-series regression task of paramount importance for traffic management and planning. However, existing approaches often struggle to model complex multi-range dependencies using local spatiotemporal features and road network hierarchical knowledge. To address this, we propose MultiSPANS. First, considering that an individual recording point cannot reflect critical spatiotemporal local patterns, we design multi-filter convolution modules for generating informative ST-token embeddings to facilitate attention computation. Then, based on ST-token and spatial-temporal position encoding, we employ the Transformers to capture long-range temporal and spatial dependencies. Furthermore, we introduce structural entropy theory to optimize the spatial attention mechanism. Specifically, The structural entropy minimization algorithm is used to generate optimal road network hierarchies, i.e., encoding trees. Based on this, we propose a relative structural entropy-based position encoding and a multi-head attention masking scheme based on multi-layer encoding trees. Extensive experiments demonstrate the superiority of the presented framework over several state-of-the-art methods in real-world traffic datasets, and the longer historical windows are effectively utilized. The code is available at https://github.com/SELGroup/MultiSPANS.
    摘要 宽泛预测是一项复杂多变量时间序列回归任务,对交通管理和规划具有极高的重要性。然而,现有的方法 oftentimes 难以模型复杂的多范围依赖关系,使用本地空间时间特征和路网层次知识。为解决这个问题,我们提出 MultiSPANS。首先,我们认为单个记录点无法反映 kritical 的空间时间本地模式,因此我们设计了多滤波扫描模块,以生成有用的 ST-token 嵌入,以便计算注意力。然后,我们使用 ST-token 和空间时间位编码,采用 Transformers 来捕捉长距离的时间和空间依赖关系。此外,我们引入结构熵理论来优化空间注意力机制。具体来说,我们使用结构熵最小化算法来生成优化的路网层次结构,即编码树。基于这个结构,我们提出了一种相对结构熵基于位编码和多头注意力掩码 schemes。EXTensive experiments 表明我们的框架在实际交通数据上表现出色,并且可以有效地利用更长的历史窗口。代码可以在 https://github.com/SELGroup/MultiSPANS 中找到。

Exploring Active Learning in Meta-Learning: Enhancing Context Set Labeling

  • paper_url: http://arxiv.org/abs/2311.02879
  • repo_url: None
  • paper_authors: Wonho Bae, Jing Wang, Danica J. Sutherland
  • for: 本文针对meta-learning方法中的active learning问题进行研究,并提出了一种基于 Gaussian mixture 的选择点标注算法。
  • methods: 本文使用了active meta-learning方法,其中在meta-learning过程中选择点标注的部分使用了active learning。提出了一种基于 Gaussian mixture 的选择点标注算法,该算法简单且具有理论基础。
  • results: 对多个 benchmark 数据集进行测试,该算法在与其他state-of-the-art active learning方法相比而出色,显示了其高效性。
    Abstract Most meta-learning methods assume that the (very small) context set used to establish a new task at test time is passively provided. In some settings, however, it is feasible to actively select which points to label; the potential gain from a careful choice is substantial, but the setting requires major differences from typical active learning setups. We clarify the ways in which active meta-learning can be used to label a context set, depending on which parts of the meta-learning process use active learning. Within this framework, we propose a natural algorithm based on fitting Gaussian mixtures for selecting which points to label; though simple, the algorithm also has theoretical motivation. The proposed algorithm outperforms state-of-the-art active learning methods when used with various meta-learning algorithms across several benchmark datasets.
    摘要 大多数元学习方法假设测试时用于建立新任务的(非常小)上下文集是通过无意识提供的。然而,在某些场景下,可以活动选择要标注的点,而且可以获得很大的提升,但这种设置需要与典型的活动学习设置有所不同。我们清楚地说明了在元学习过程中使用活动学习时如何选择标注点,以及这种方法在哪些方面使用活动学习。我们还提出了一种自然的 Gaussian mixture 适应算法,具有理论基础,并且在使用不同的元学习算法和数据集上比较其表现。这种算法可以超越当前的活动学习方法。

Sample Complexity Bounds for Estimating Probability Divergences under Invariances

  • paper_url: http://arxiv.org/abs/2311.02868
  • repo_url: None
  • paper_authors: Behrooz Tahmasebi, Stefanie Jegelka
  • for: 这个论文的目的是研究如何使用 Lie 群的自然�variance来提高数据生成模型中的 divergence 估计精度。
  • methods: 这个论文使用了 Sobolev Integral Probability Metrics (Sobolev IPMs)、Maximum Mean Discrepancy (MMD) 和 density 估计问题的复杂性来研究 Lie 群的自然�variance。
  • results: 这个论文的结果表明,在使用 Lie 群的自然�variance时,可以获得两个好处:首先,可以将样本复杂度减少一个多项式因子相应于群的大小(对于有限群)或 quotient 空间的 норализованvolume(对于正负 Dimension 的群);其次,可以提高 convergence 率的 exponent(对于正负 Dimension 的群)。这些结果对有限群的 recent bounds 进行了扩展,并且完全新的结果对正负 Dimension 的群。
    Abstract Group-invariant probability distributions appear in many data-generative models in machine learning, such as graphs, point clouds, and images. In practice, one often needs to estimate divergences between such distributions. In this work, we study how the inherent invariances, with respect to any smooth action of a Lie group on a manifold, improve sample complexity when estimating the Wasserstein distance, the Sobolev Integral Probability Metrics (Sobolev IPMs), the Maximum Mean Discrepancy (MMD), and also the complexity of the density estimation problem (in the $L^2$ and $L^\infty$ distance). Our results indicate a two-fold gain: (1) reducing the sample complexity by a multiplicative factor corresponding to the group size (for finite groups) or the normalized volume of the quotient space (for groups of positive dimension); (2) improving the exponent in the convergence rate (for groups of positive dimension). These results are completely new for groups of positive dimension and extend recent bounds for finite group actions.
    摘要 群体共轭概率分布出现在机器学习中的数据生成模型中,如图、点云和图像。在实践中,我们需要估计这些分布之间的差异。在这项工作中,我们研究了根据流体 Lie 群的满足满足的惯性变换对拓扑空间的固有不变性如何提高采样复杂性,以及 Wasserstein 距离、 Sobolev 概率度量、最大均值差(MMD)以及概率密度估计问题的复杂性。我们的结果表明,有两点提高:1. 通过流体 Lie 群的满足满足的惯性变换,可以将采样复杂性减少到群体大小(对于有限群)或抽象空间的归一化体积的多少倍。2. 对于有限维度的群体,可以提高征速度的指数倍数。这些结果是对有限维度群体的扩展,并且完全新的。

Barron Space for Graph Convolution Neural Networks

  • paper_url: http://arxiv.org/abs/2311.02838
  • repo_url: None
  • paper_authors: Seok-Young Chung, Qiyu Sun
  • for: 本研究旨在探讨图像领域中的图 convolutional neural network (GCNN) 的性能和可学习性。
  • methods: 本文引入了 Barron 空间函数的概念,并证明了该空间是一个 reproduce kernel Banach space,可以分解为一族 reproducing kernel Hilbert spaces with neuron kernels,并且可以densely embed在图像领域中的连续函数空间中。
  • results: 本文显示了 GCNN 的输出在 Barron 空间中,并且证明了 functions 在 Barron 空间中可以高效地被 GCNN 的输出 approximated 。此外,本文还估算了 Rademacher complexity 函数的 Barron нор的值,并证明了 functions 在 Barron 空间中可以高效地被随机抽样学习。
    Abstract Graph convolutional neural network (GCNN) operates on graph domain and it has achieved a superior performance to accomplish a wide range of tasks. In this paper, we introduce a Barron space of functions on a compact domain of graph signals. We prove that the proposed Barron space is a reproducing kernel Banach space, it can be decomposed into the union of a family of reproducing kernel Hilbert spaces with neuron kernels, and it could be dense in the space of continuous functions on the domain. Approximation property is one of the main principles to design neural networks. In this paper, we show that outputs of GCNNs are contained in the Barron space and functions in the Barron space can be well approximated by outputs of some GCNNs in the integrated square and uniform measurements. We also estimate the Rademacher complexity of functions with bounded Barron norm and conclude that functions in the Barron space could be learnt from their random samples efficiently.
    摘要 图像卷积神经网络(GCNN)在图像领域中运行,并达到了广泛的任务。在本文中,我们引入一个Barron空间函数在紧凑领域上的图像信号上。我们证明该提案的Barron空间是一个复制kernel Banach空间,可以分解为一家的 reproduce kernel Hilbert space with neuron kernels,并且可以在领域上密集的函数空间中 dense。 Approximation property是设计神经网络的一个主要原则。在本文中,我们显示GCNN的输出在Barron空间中包含,并且函数在Barron空间中可以通过一些GCNN的集成方差和均匀测试方法来良好地逼近。我们还估计函数的Rademacher复杂性,并结论是在Random sample上有效地学习函数在Barron空间中。

Prioritized Propagation in Graph Neural Networks

  • paper_url: http://arxiv.org/abs/2311.02832
  • repo_url: None
  • paper_authors: Yao Cheng, Minjie Chen, Xiang Li, Caihua Shan, Ming Gao
  • for: 本研究旨在提高图形神经网络(GNNs)中的节点层次传播学习,以便为不同节点设置个性化传播步骤。
  • methods: 本研究提出了一个通用框架PPro,可以与现有的大多数GNN模型集成,并学习优化节点层次传播步骤。该框架包括三个组成部分:基础GNN模型、传播控制器和质量控制器。我们还提出了一种互相增强机制来计算节点优先级、最佳传播步骤和标签预测。
  • results: 我们在8个标准 benchmark 数据集上进行了广泛的实验 comparative study,发现我们的框架可以在节点传播策略和节点表示方面达到更高的性能。
    Abstract Graph neural networks (GNNs) have recently received significant attention. Learning node-wise message propagation in GNNs aims to set personalized propagation steps for different nodes in the graph. Despite the success, existing methods ignore node priority that can be reflected by node influence and heterophily. In this paper, we propose a versatile framework PPro, which can be integrated with most existing GNN models and aim to learn prioritized node-wise message propagation in GNNs. Specifically, the framework consists of three components: a backbone GNN model, a propagation controller to determine the optimal propagation steps for nodes, and a weight controller to compute the priority scores for nodes. We design a mutually enhanced mechanism to compute node priority, optimal propagation step and label prediction. We also propose an alternative optimization strategy to learn the parameters in the backbone GNN model and two parametric controllers. We conduct extensive experiments to compare our framework with other 11 state-of-the-art competitors on 8 benchmark datasets. Experimental results show that our framework can lead to superior performance in terms of propagation strategies and node representations.
    摘要 граф нейронных сетей (GNNs) 在最近吸引了广泛关注。学习节点级消息传递在 GNNs 中的目标是为不同的节点设置个性化的传递步骤。尽管已经取得成功,现有方法忽略了节点优先级,这可以通过节点影响和异质性来反映。在本文中,我们提出了一个通用的框架 PPro,可以与大多数现有的 GNN 模型集成,并学习在 GNN 中个性化节点级消息传递。具体来说,框架包括三部分:基础 GNN 模型、传递控制器和Node 的优先级分配器。我们设计了互相增强的机制来计算节点优先级、最佳传递步骤和标签预测。我们还提出了一种代替优化策略来学习 Parameters 在基础 GNN 模型和两个 parametric 控制器中。我们进行了广泛的实验比较我们的框架与 11 个现有竞争对手在 8 个 benchmark 数据集上。实验结果表明,我们的框架可以在传递策略和节点表示方面取得优越的表现。

On Subagging Boosted Probit Model Trees

  • paper_url: http://arxiv.org/abs/2311.02827
  • repo_url: None
  • paper_authors: Tian Qin, Wei-Min Huang
  • For: 这个论文的目的是提出一个新的混合袋包-提升算法,以便在分类问题上进行更好的预测。* Methods: 这个算法使用了 variance-bias decomposition 的思想,并提出了一个新的树模型called Probit Model Tree (PMT),用于 AdaBoost 程序中的基底分类器。在袋包部分,相比于传统的抽样法,我们执行了强化 PMT 的过程,并将其结合成一个强大的 “委员会”,可以看作是一个不完整的 U-统计。* Results: 我们的理论分析显示:1. SBPMT 是在某些假设下一致的;2. 增加袋包时间可以对 SBPMT 的泛化错误产生一定的降低作用;3. PMT 中的 ProbitBoost 迭代可以帮助 SBPMT 在 fewer AdaBoost 步骤下表现更好。这三个性能都被 Mease 和 Wyner (2008)的著名的模拟所验证。最后两个点也提供了有用的问题定uning。与其他现有的分类方法相比,提出的 SBPMT 算法具有优秀的预测力,并在一些情况下表现出色。
    Abstract With the insight of variance-bias decomposition, we design a new hybrid bagging-boosting algorithm named SBPMT for classification problems. For the boosting part of SBPMT, we propose a new tree model called Probit Model Tree (PMT) as base classifiers in AdaBoost procedure. For the bagging part, instead of subsampling from the dataset at each step of boosting, we perform boosted PMTs on each subagged dataset and combine them into a powerful "committee", which can be viewed an incomplete U-statistic. Our theoretical analysis shows that (1) SBPMT is consistent under certain assumptions, (2) Increase the subagging times can reduce the generalization error of SBPMT to some extent and (3) Large number of ProbitBoost iterations in PMT can benefit the performance of SBPMT with fewer steps in the AdaBoost part. Those three properties are verified by a famous simulation designed by Mease and Wyner (2008). The last two points also provide a useful guidance in model tuning. A comparison of performance with other state-of-the-art classification methods illustrates that the proposed SBPMT algorithm has competitive prediction power in general and performs significantly better in some cases.
    摘要 通过变iance-bias decomposition的视角,我们设计了一种新的杂合bagging-boosting算法,名为SBPMT,用于解决分类问题。在 boosting 部分中,我们提议一种新的树模型,即概率模型树(PMT)作为 AdaBoost 过程中的基础分类器。在 bagging 部分中,而不是在每次 boosting 时间点上从数据集中采样,我们在每次 boosted PMT 中使用每个 subagged 数据集进行 boosted PMT,并将它们组合成一个强大的 "委员会",可以被视为一个不完全的 U-统计。我们的理论分析表明:1)SBPMT 在某些假设下是一个consistent的算法,2)增加 subagging 次数可以有所减少 SBPMT 的泛化误差,3)在 PMT 中增加多少 ProbitBoost 迭代可以提高 SBPMT 的性能,但是需要 fewer steps in AdaBoost 部分。这三点听起来都被 Mease 和 Wyner (2008)的著名的 simulations 验证。最后两点还提供了有用的模型调整指南。与其他当前的分类方法相比,我们的提出的 SBPMT 算法具有竞争的预测力,并在一些情况下表现得更好。

Signal Processing Meets SGD: From Momentum to Filter

  • paper_url: http://arxiv.org/abs/2311.02818
  • repo_url: None
  • paper_authors: Zhipeng Yao, Guisong Chang, Jiaqi Zhang, Qi Zhang, Yu Zhang, Dazhou Li
  • for: 这 paper 的目的是探讨降低历史梯度的方差对当前梯度估计的影响,以提高深度学习优化器的性能。
  • methods: 该 paper 使用了基于减少方差的新优化方法,并采用了wiener filter理论来增强SGD的初始方差估计。Specifically, the adaptive weight dynamically changes along with temporal fluctuation of gradient variance during deep learning model training.
  • results: 实验结果表明,该 proposed adaptive weight optimizer(SGDF)可以与当前顶峰性状态的优化器进行比较,并且可以实现满意的性能。
    Abstract In the field of deep learning, Stochastic Gradient Descent (SGD) and its momentum-based variants are the predominant choices for optimization algorithms. Despite all that, these momentum strategies, which accumulate historical gradients by using a fixed $\beta$ hyperparameter to smooth the optimization processing, often neglect the potential impact of the variance of historical gradients on the current gradient estimation. In the gradient variance during training, fluctuation indicates the objective function does not meet the Lipschitz continuity condition at all time, which raises the troublesome optimization problem. This paper aims to explore the potential benefits of reducing the variance of historical gradients to make optimizer converge to flat solutions. Moreover, we proposed a new optimization method based on reducing the variance. We employed the Wiener filter theory to enhance the first moment estimation of SGD, notably introducing an adaptive weight to optimizer. Specifically, the adaptive weight dynamically changes along with temporal fluctuation of gradient variance during deep learning model training. Experimental results demonstrated our proposed adaptive weight optimizer, SGDF (Stochastic Gradient Descent With Filter), can achieve satisfactory performance compared with state-of-the-art optimizers.
    摘要 在深度学习领域,Stochastic Gradient Descent(SGD)和其具有推移性的变体是主要的优化算法。然而,这些推移策略通常忽视了历史梯度的变量对当前梯度估计的影响。在训练过程中,梯度变动的方差可能会导致优化问题。本文旨在探讨减少历史梯度的变量可能会使优化器 converge to 平滑解。此外,我们提出了一种基于减少变量的新优化方法。我们利用了 Wiener 约束理论来增强 SGD 的首个期预测,特别是在深度学习模型训练中引入了一个适应性的权重。具体来说,适应性权重会随着梯度方差的时间变化而变化。实验结果表明,我们的提出的适应性权重优化器(SGDF)可以与当前的优化器相比,达到满意的性能。

APGL4SR: A Generic Framework with Adaptive and Personalized Global Collaborative Information in Sequential Recommendation

  • paper_url: http://arxiv.org/abs/2311.02816
  • repo_url: https://github.com/graph-team/apgl4sr
  • paper_authors: Mingjia Yin, Hao Wang, Xiang Xu, Likang Wu, Sirui Zhao, Wei Guo, Yong Liu, Ruiming Tang, Defu Lian, Enhong Chen
  • for: 这篇论文的目的是提出一个基于图的推荐系统框架,以提高序列推荐的效果。
  • methods: 这篇论文使用了一个叫做 Adaptive and Personalized Graph Learning for Sequential Recommendation (APGL4SR) 的图driven框架,它利用自适应和个人化的全局合作信息来提高序列推荐的性能。
  • results: 根据论文的结果,APGL4SR 可以与其他基于图的方法进行比较,并且具有更好的推荐性能。
    Abstract The sequential recommendation system has been widely studied for its promising effectiveness in capturing dynamic preferences buried in users' sequential behaviors. Despite the considerable achievements, existing methods usually focus on intra-sequence modeling while overlooking exploiting global collaborative information by inter-sequence modeling, resulting in inferior recommendation performance. Therefore, previous works attempt to tackle this problem with a global collaborative item graph constructed by pre-defined rules. However, these methods neglect two crucial properties when capturing global collaborative information, i.e., adaptiveness and personalization, yielding sub-optimal user representations. To this end, we propose a graph-driven framework, named Adaptive and Personalized Graph Learning for Sequential Recommendation (APGL4SR), that incorporates adaptive and personalized global collaborative information into sequential recommendation systems. Specifically, we first learn an adaptive global graph among all items and capture global collaborative information with it in a self-supervised fashion, whose computational burden can be further alleviated by the proposed SVD-based accelerator. Furthermore, based on the graph, we propose to extract and utilize personalized item correlations in the form of relative positional encoding, which is a highly compatible manner of personalizing the utilization of global collaborative information. Finally, the entire framework is optimized in a multi-task learning paradigm, thus each part of APGL4SR can be mutually reinforced. As a generic framework, APGL4SR can outperform other baselines with significant margins. The code is available at https://github.com/Graph-Team/APGL4SR.
    摘要 《序列推荐系统的全球协同学习》(APGL4SR)是一种基于图的框架,旨在Integrating adaptive和个性化的全球协同信息到序列推荐系统中。在现有的方法中,通常只关注于内部序列模型,而忽略了全球协同信息的利用,导致推荐性能下降。因此,以前的工作通常是通过预定的规则建立全球协同项目图来解决这个问题。但这些方法忽略了两个关键的特性,即适应性和个性化,从而导致用户表示不准确。APGL4SR提出了一种解决这个问题的方法,即在序列推荐系统中 incorporating adaptive和个性化的全球协同信息。具体来说,我们首先学习了一个适应的全球图,用于捕捉全球协同信息,并在自然学习方式下进行计算。此外,我们还提出了基于图的个性化项耦合方法,通过对每个用户进行特定的排序,使用个性化的项耦合来优化用户表示。最后,我们将整个框架进行多任务学习,以便每个部分都能够互相增强。由于APGL4SR是一种通用的框架,因此它可以与其他基准值相比,并且具有显著的优势。代码可以在 上获取。

On the Intersection of Self-Correction and Trust in Language Models

  • paper_url: http://arxiv.org/abs/2311.02801
  • repo_url: None
  • paper_authors: Satyapriya Krishna
  • for: 这篇论文的目的是调查自我更正能力是否能够提高大型自然语言模型的可靠性。
  • methods: 这篇论文使用了两个关键方面的实验来调查自我更正的效果:一是对当前任务的真实性进行评估,二是对模型的毒害性进行评估。
  • results: 实验结果显示,自我更正可以改善模型的毒害性和真实性,但这些改善的程度因任务的特点和自我更正的形式而异。此外,研究还发现了一些“自我犹豫”现象,需要进一步的解决。
    Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities in performing complex cognitive tasks. However, their complexity and lack of transparency have raised several trustworthiness concerns, including the propagation of misinformation and toxicity. Recent research has explored the self-correction capabilities of LLMs to enhance their performance. In this work, we investigate whether these self-correction capabilities can be harnessed to improve the trustworthiness of LLMs. We conduct experiments focusing on two key aspects of trustworthiness: truthfulness and toxicity. Our findings reveal that self-correction can lead to improvements in toxicity and truthfulness, but the extent of these improvements varies depending on the specific aspect of trustworthiness and the nature of the task. Interestingly, our study also uncovers instances of "self-doubt" in LLMs during the self-correction process, introducing a new set of challenges that need to be addressed.
    摘要

eess.IV - 2023-11-06

Auto-ICell: An Accessible and Cost-Effective Integrative Droplet Microfluidic System for Real-Time Single-Cell Morphological and Apoptotic Analysis

  • paper_url: http://arxiv.org/abs/2311.02927
  • repo_url: None
  • paper_authors: Yuanyuan Wei, Meiai Lin, Shanhang Luo, Syed Muhammad Tariq Abbasi, Liwei Tan, Guangyao Cheng, Bijie Bai, Yi-Ping Ho, Scott Wu Yuan, Ho-Pui Ho
  • for: 本研究使用Auto-ICell系统进行单元细胞分析,包括单元细胞形态和 apoptosis 分析。
  • methods: 研究使用了一种 integrate droplet microfluidic system,具有3D printing技术和图像分析算法,可以生成固定尺寸的营寄droplets,并实时进行图像分析。
  • results: 研究发现,Auto-ICell系统可以实现高速、高效、自动化的单元细胞分析,并且可以评估单元细胞形态和 apoptosis 的分布。
    Abstract The Auto-ICell system, a novel, and cost-effective integrated droplet microfluidic system, is introduced for real-time analysis of single-cell morphology and apoptosis. This system integrates a 3D-printed microfluidic chip with image analysis algorithms, enabling the generation of uniform droplet reactors and immediate image analysis. The system employs a color-based image analysis algorithm in the bright field for droplet content analysis. Meanwhile, in the fluorescence field, cell apoptosis is quantitatively measured through a combination of deep-learning-enabled multiple fluorescent channel analysis and a live/dead cell stain kit. Breast cancer cells are encapsulated within uniform droplets, with diameters ranging from 70 {\mu}m to 240 {\mu}m, generated at a high throughput of 1,500 droplets per minute. Real-time image analysis results are displayed within 2 seconds on a custom graphical user interface (GUI). The system provides an automatic calculation of the distribution and ratio of encapsulated dyes in the bright field, and in the fluorescent field, cell blebbing and cell circularity are observed and quantified respectively. The Auto-ICell system is non-invasive and provides online detection, offering a robust, time-efficient, user-friendly, and cost-effective solution for single-cell analysis. It significantly enhances the detection throughput of droplet single-cell analysis by reducing setup costs and improving operational performance. This study highlights the potential of the Auto-ICell system in advancing biological research and personalized disease treatment, with promising applications in cell culture, biochemical microreactors, drug carriers, cell-based assays, synthetic biology, and point-of-care diagnostics.
    摘要 《Auto-ICell系统》是一种新型、成本效果的集成液态微机系统,用于实时分析单元细胞形态和 apoptosis。该系统结合了3D打印微机器件和图像分析算法,实现了生成固定尺寸液态室和即时图像分析。系统使用了色彩基于图像分析算法,在亮场进行液态内容分析。而在荧光场中,通过组合深度学习enabled多色渠道分析和live/dead细胞染料盒,量化细胞 apoptosis。具体来说,用 breast cancer细胞 encapsulated within uniform droplets,尺寸在70μm到240μm之间,通过高速生成1500个液态每分钟。实时图像分析结果在2秒钟内显示在自定义图形用户界面(GUI)上。系统提供了液态内容的自动计算和荧光场中细胞弹性和细胞圆形的观察和量化。Auto-ICell系统是不侵入的,提供了在线检测,为单元细胞分析提供了robust、时间高效、易用、成本效果的解决方案。它显著提高了液态单元细胞分析的设置成本和运行性能,有潜在应用于细胞文化、生化微 реактор、药物输送、细胞基因组分析、生物学synthesis和点检查诊断。

An invariant feature extraction for multi-modal images matching

  • paper_url: http://arxiv.org/abs/2311.02842
  • repo_url: None
  • paper_authors: Chenzhong Gao, Wei Li
  • for: 本研究旨在提供一种有效的多模式图像不变特征提取和匹配算法,用于多源数据分析。
  • methods: 该算法基于多模式图像之间的差异和相关性,实现了特征基本匹配。关键技术包括相位一致性(PC)和史提莫asi特征点检测、LogGabor滤波器和质量分配主orientation图(WPMOM)特征提取、多尺度处理来处理尺度差和优化匹配结果。
  • results: 实验结果表明,该算法在实际数据上具有良好的普适性和准确性,能够实现多模式图像的准确空间对齐,表明了实际应用价值和良好的泛化能力。
    Abstract This paper aims at providing an effective multi-modal images invariant feature extraction and matching algorithm for the application of multi-source data analysis. Focusing on the differences and correlation of multi-modal images, a feature-based matching algorithm is implemented. The key technologies include phase congruency (PC) and Shi-Tomasi feature point for keypoints detection, LogGabor filter and a weighted partial main orientation map (WPMOM) for feature extraction, and a multi-scale process to deal with scale differences and optimize matching results. The experimental results on practical data from multiple sources prove that the algorithm has effective performances on multi-modal images, which achieves accurate spatial alignment, showing practical application value and good generalization.
    摘要 本文提出了一种可靠的多模态图像不变性特征提取和匹配算法,用于多源数据分析的应用。关注多模态图像之间的差异和相关性,实现了特征基于匹配算法。关键技术包括相位同步(PC)和史提莫asi特征点检测、LogGabor滤波器和权重部分主orientation图(WPMOM)特征提取,以及多尺度处理来处理比例差异和优化匹配结果。实验结果表明,该算法在实际数据上具有良好的 espacial 对齐性,达到了实际应用价值和好的泛化性。

eess.SP - 2023-11-06

Joint Sparse Estimation with Cardinality Constraint via Mixed-Integer Semidefinite Programming

  • paper_url: http://arxiv.org/abs/2311.03501
  • repo_url: None
  • paper_authors: Tianyi Liu, Frederic Matter, Alexander Sorg, Marc E. Pfetsch, Martin Haardt, Marius Pesavento
  • for: This paper addresses the maximum a posteriori (MAP) estimation for the multiple measurement vectors (MMV) problem, which is a fundamental problem in signal processing applications such as spectral analysis and direction-of-arrival (DOA) estimation.
  • methods: The paper derives an equivalent mixed-integer semidefinite program (MISDP) reformulation of the MAP estimation for the MMV problem, which can be exactly solved by a generic MISDP solver. However, for problems of extremely large dimensions, a relaxation-based approach is employed to obtain an approximate solution with reduced computation time.
  • results: The proposed method demonstrates improved error performance compared to several popular DOA estimation methods, including the deterministic maximum likelihood (DML) estimator. The method also offers a guarantee of finding a global optimum, unlike other nonconvex approaches for the MMV problem.
    Abstract The multiple measurement vectors (MMV) problem refers to the joint estimation of a row-sparse signal matrix from multiple realizations of mixtures with a known dictionary. As a generalization of the standard sparse representation problem for a single measurement, this problem is fundamental in various applications in signal processing, e.g., spectral analysis and direction-of-arrival (DOA) estimation. In this paper, we consider the maximum a posteriori (MAP) estimation for the MMV problem, which is classically formulated as a regularized least-squares (LS) problem with an $\ell_{2,0}$-norm constraint, and derive an equivalent mixed-integer semidefinite program (MISDP) reformulation. The proposed MISDP reformulation can be exactly solved by a generic MISDP solver, which, however, becomes computationally demanding for problems of extremely large dimensions. To further reduce the computation time in such scenarios, a relaxation-based approach can be employed to obtain an approximate solution of the MISDP reformulation, at the expense of a reduced estimation performance. Numerical simulations in the context of DOA estimation demonstrate the improved error performance of our proposed method in comparison to several popular DOA estimation methods. In particular, compared to the deterministic maximum likelihood (DML) estimator, which is often used as a benchmark, the proposed method applied with a state-of-the-art MISDP solver exhibits a superior estimation performance at a significantly reduced running time. Moreover, unlike other nonconvex approaches for the MMV problem, including the greedy methods and the sparse Bayesian learning, the proposed MISDP-based method offers a guarantee of finding a global optimum.
    摘要 多量测量向量(MMV)问题指的是从多个实现的混合中 joint 估计一个纤维数为 row-sparse 信号矩阵。这个问题是对于各种信号处理应用的基本问题,例如 спектраль分析和方向来源估计(DOA)。在这篇文章中,我们考虑了最大 posteriori(MAP)估计方法,这是经典的常数加权最小二乘(L2,0)准则问题的等价混合半整数Program(MISDP) reformulation。我们可以使用一个通用的 MISDP 解决方案来 exactly 解这个问题,但是在极大维度的问题中,这会变得计算昂贵。为了进一步减少计算时间,我们可以采用一种缓和方法来获得一个相对优化的 MISDP reformulation,但是这将导致估计性能下降。在 DOA 估计的数值实验中,我们发现我们提出的方法在与其他几种流行的 DOA 估计方法相比,具有更高的估计性能,同时具有更快的计算时间。此外,不同于其他非凸方法,包括排序方法和杂音抽象学习,我们的 MISDP-based 方法可以保证找到全球最优解。

Resource Allocation for RIS-Empowered Wireless Communications: Low-Complexity and Robust Designs

  • paper_url: http://arxiv.org/abs/2311.03282
  • repo_url: None
  • paper_authors: Ming Zeng, Wanming Hao, Zhangjie Peng, Zheng Chu, Xingwang Li, Changsheng You, Cunhua Pan
  • for: 本研究探讨了基于可编程智能面(RIS)系统的资源分配技术的前进,主要目标是实现低复杂性和可靠性。
  • methods: 本文不仅描述了低复杂性和可靠性资源分配技术的基本原理,还提供了具体的数字结果用于说明。
  • results: 研究表明,采用低复杂性和可靠性资源分配技术可以在RIS assisted系统中提高系统的可靠性和性能。
    Abstract This article delves into advancements in resource allocation techniques tailored for systems utilizing reconfigurable intelligent surfaces (RIS), with a primary focus on achieving low-complexity and resilient solutions. The investigation of low-complexity approaches for RIS holds significant relevance, primarily owing to the intricate characteristics inherent in RIS-based systems and the need of deploying large-scale RIS arrays. Concurrently, the exploration of robust solutions aims to address the issue of hardware impairments occurring at both the transceivers and RIS components in practical RIS-assisted systems. In the realm of both low-complexity and robust resource allocation, this article not only elucidates the fundamental techniques underpinning these methodologies but also offers comprehensive numerical results for illustrative purposes. The necessity of adopting resource allocation strategies that are both low in complexity and resilient is thoroughly established. Ultimately, this article provides prospective research avenues in the domain of low-complexity and robust resource allocation techniques tailored for RIS-assisted systems.
    摘要

Multivariate selfsimilarity: Multiscale eigen-structures for selfsimilarity parameter estimation

  • paper_url: http://arxiv.org/abs/2311.03247
  • repo_url: None
  • paper_authors: Charles-Gérard Lucas, Gustavo Didier, Herwig Wendt, Patrice Abry
  • for: 这 paper 是为了提出一种能够处理多变量自相似数据的方法。
  • methods: 该 paper 使用了基于多谱波特征的方法来估计自相似参数 vector。
  • results: 该 paper 提出了一种高效的估计方法,并在实际数据上进行了测试和验证。
    Abstract Scale-free dynamics, formalized by selfsimilarity, provides a versatile paradigm massively and ubiquitously used to model temporal dynamics in real-world data. However, its practical use has mostly remained univariate so far. By contrast, modern applications often demand multivariate data analysis. Accordingly, models for multivariate selfsimilarity were recently proposed. Nevertheless, they have remained rarely used in practice because of a lack of available robust estimation procedures for the vector of selfsimilarity parameters. Building upon recent mathematical developments, the present work puts forth an efficient estimation procedure based on the theoretical study of the multiscale eigenstructure of the wavelet spectrum of multivariate selfsimilar processes. The estimation performance is studied theoretically in the asymptotic limits of large scale and sample sizes, and computationally for finite-size samples. As a practical outcome, a fully operational and documented multivariate signal processing estimation toolbox is made freely available and is ready for practical use on real-world data. Its potential benefits are illustrated in epileptic seizure prediction from multi-channel EEG data.
    摘要 “级�cciones的动态,通过自相似性 formalized,提供了一个广泛和通用的模型,用于模elling temporal dynamics in real-world data。然而,它的实际使用主要仅对于单变量数数据进行分析。在现代应用中,通常需要多変量数据分析。因此,用于多变量自相似性的模型已经提出。然而,它们在实际中几乎没有被使用,因为缺乏可靠的自相似性参数的估计方法。基于最近的数学发展,本作提出了一个有效的估计方法,基于多尺度传递矩阵的波летспектrum。这个估计方法的性能在大规模和样本大小的对应上进行了理论研究,以及 computationally дляfinite-size样本。作为实用的结果,一个完整的操作和文档的多ivariate signal processing估计工具组已经免费提供,并且准备用于实际数据的处理。其潜在优点被应用于多通道 EEG 数据中的癫癫癫癫预测。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Using Shallow Neural Networks with Functional Connectivity from EEG signals for Early Diagnosis of Alzheimer’s and Frontotemporal Dementia

  • paper_url: http://arxiv.org/abs/2311.03151
  • repo_url: None
  • paper_authors: Zaineb Ajra, Binbin Xu, Gérard Dray, Jacky Montmain, Stéphane Perrey
  • For: The paper is written to explore the use of shallow neural networks and functional connectivity measures from EEG signals to differentiate between AD, FTD, and control cases.* Methods: The paper uses two sets of features: spectral-temporal and functional connectivity, and employs four methods, including shallow CNN-based models, to classify EEG signals.* Results: The shallow CNN-based models achieved the highest accuracy of 94.54% with AEC in the test dataset, outperforming conventional methods and providing a potentially additional early dementia diagnosis tool.Here is the information in Simplified Chinese text, as requested:
  • for: 这篇论文是为了探讨使用 shallow neural networks 和 EEG 信号函数连接度来分类 AD、FTD 和控制 случа的可能性。
  • methods: 这篇论文使用了两个集合特征:spectral-temporal 和函数连接度,并使用四种方法来分类 EEG 信号。
  • results: shallow CNN 基本模型在测试集上取得了 94.54% 的最高准确率,超过了常见方法,并提供了可能的早期诊断工具。
    Abstract {Introduction: } Dementia is a neurological disorder associated with aging that can cause a loss of cognitive functions, impacting daily life. Alzheimer's disease (AD) is the most common cause of dementia, accounting for 50--70\% of cases, while frontotemporal dementia (FTD) affects social skills and personality. Electroencephalography (EEG) provides an effective tool to study the effects of AD on the brain. {Methods: } In this study, we propose to use shallow neural networks applied to two sets of features: spectral-temporal and functional connectivity using four methods. We compare three supervised machine learning techniques to the CNN models to classify EEG signals of AD / FTD and control cases. We also evaluate different measures of functional connectivity from common EEG frequency bands considering multiple thresholds. {Results and Discussion: } Results showed that the shallow CNN-based models achieved the highest accuracy of 94.54\% with AEC in test dataset when considering all connections, outperforming conventional methods and providing potentially an additional early dementia diagnosis tool. \url{https://doi.org/10.3389%2Ffneur.2023.1270405}
    摘要 {Methods: } 在这项研究中,我们提出使用浅层神经网络,应用于两个集合:spectral-temporal和功能相关性。我们使用四种方法进行比较,包括三种超vised机器学习技术和CNN模型,以分类EEG信号的AD/FTD和控制 caso。此外,我们还评估了不同频谱带的功能相关性测量,使用多个阈值。{Results and Discussion: } 结果显示,使用浅层CNN-based模型可以达到94.54%的准确率,在测试数据集中,当考虑所有连接时,超过 conventional methods,并可能提供一种额外的诊断老年痴呐工具。参考文献:

Energy Harvesting Maximization for Reconfigurable Intelligent Surfaces Using Amplitude Measurements

  • paper_url: http://arxiv.org/abs/2311.03143
  • repo_url: None
  • paper_authors: Morteza Tavana, Meysam Masoudi, Emil Björnson
  • for: 能源收集可以使智能表面自动维护操作,不需要外部电源。本文考虑了智能表面上的能源收集问题,在没有协调 ambient RF 源的情况下。
  • methods: 我们提出了一系列的顺序相对位征算法,以最大化接收到的功率。我们证明了无噪场景下的提案算法将 converge 到最优值。然而,在噪场景下,我们提出了一个线性最小二乘估计器。我们证明了在线性估计器中的最佳测量相对位是均匀分布的相对位。
  • results: 我们通过对比Random phase update algorithm的表现,发现我们的提案算法在达到功率后的性能比Random algorithm更好,并且需要 fewer measurements per phase update。在噪场景下,我们通过Simulation表明,提案算法在一个离散的可能的相对位集中的情况下是不优的,它可以达到高于Random algorithm的值,但不是最大可能的值。
    Abstract Energy harvesting can enable a reconfigurable intelligent surface (RIS) to self-sustain its operations without relying on external power sources. In this paper, we consider the problem of energy harvesting for RISs in the absence of coordination with the ambient RF source. We propose a series of sequential phase-alignment algorithms that maximize the received power based on only power measurements. We prove the convergence of the proposed algorithm to the optimal value for the noiseless scenario. However, for the noisy scenario, we propose a linear least squares estimator. We prove that within the class of linear estimators, the optimal set of measurement phases are equally-spaced phases. To evaluate the performance of the proposed method, we introduce a random phase update algorithm as a benchmark. Our simulation results show that the proposed algorithms outperform the random phase update method in terms of achieved power after convergence while requiring fewer measurements per phase update. Using simulations, we show that in a noiseless scenario with a discrete set of possible phase shifts for the RIS elements, the proposed method is sub-optimal, achieving a higher value than the random algorithm but not exactly the maximum feasible value that we obtained by exhaustive search.
    摘要 能量收集可以让智能表面重新配置(RIS)无需依赖于外部电源进行自主运行。在这篇论文中,我们考虑了RIS中能量收集的问题,不同征Compatibility Mode(RF)源的协调。我们提出了一系列的顺序相对纹理算法,以最大化接收到的功率基于 только能量测量。我们证明了不含噪声的情况下,提posed算法的优化性。但是,在噪声场景下,我们提出了线性最小二乘估计器。我们证明了在线性估计器中的优化集是均匀分布的相位。为了评估提案方法的性能,我们引入了随机相位更新算法作为参照。我们的实验结果表明,提案方法在吞吐量更高,需要 fewer measurements per phase update。使用实验,我们发现在噪声场景下,对RIS元素的可能的相位shift是一个离散的集合时,提案方法是优化的,可以达到最大可能的值,但不是恰好的最大值。

  • paper_url: http://arxiv.org/abs/2311.03046
  • repo_url: None
  • paper_authors: Haoran Qin, Wen Chen, Zhendong Li, Qingqing Wu, Nan Cheng, Fangjiong Chen
  • for: investigate a multiple input single output (MISO) downlink communication system with movable antennas (MAs)
  • methods: adopt a field-response based channel model and employ an alternating optimization (AO) algorithm based on penalty method and successive convex approximation (SCA) to obtain a sub-optimal solution
  • results: the MA-enabled communication system performs better than conventional fixed position antennasHere’s the format you requested:
  • for: <what are the paper written for?>
  • methods: <what methods the paper use?>
  • results: <what results the paper get?>I hope this helps! Let me know if you have any other questions.
    Abstract This paper investigates a multiple input single output (MISO) downlink communication system in which users are equipped with movable antennas (MAs). First, We adopt a field-response based channel model to characterize the downlink channel with respect to MAs' positions. Then, we aim to minimize the total transmit power by jointly optimizing the MAs' positions and beamforming matrix. To solve the resulting non-convex problem, we employ an alternating optimization (AO) algorithm based on penalty method and successive convex approximation (SCA) to obtain a sub-optimal solution. Numerical results demonstrate that the MA-enabled communication system perform better than conventional fixed position antennas.
    摘要 这个论文研究了一个多输入单输出(MISO)下链通信系统,在该系统中用户装备了可动天线(MA)。首先,我们采用场响应基于通道模型来描述附近MA的下链通道。然后,我们想要最小化总发射功率,通过同时优化MA的位置和扬行矩阵来解决。为解决得到的非对称问题,我们使用 alternate optimization(AO)算法和罚方法和逐步几何近似(SCA)来获得一个优化解决方案。 numerically, results show that the MA-enabled communication system outperforms traditional fixed position antennas.Here's the breakdown of the translation:* "This paper investigates a multiple input single output (MISO) downlink communication system" becomes "这个论文研究了一个多输入单输出(MISO)下链通信系统"* "in which users are equipped with movable antennas (MAs)" becomes "在该系统中用户装备了可动天线(MA)"* "First, We adopt a field-response based channel model to characterize the downlink channel with respect to MAs' positions" becomes "首先,我们采用场响应基于通道模型来描述附近MA的下链通道"* "Then, we aim to minimize the total transmit power by jointly optimizing the MAs' positions and beamforming matrix" becomes "然后,我们想要最小化总发射功率,通过同时优化MA的位置和扬行矩阵来解决"* "To solve the resulting non-convex problem, we employ an alternating optimization (AO) algorithm based on penalty method and successive convex approximation (SCA) to obtain a sub-optimal solution" becomes "为解决得到的非对称问题,我们使用 alternate optimization(AO)算法和罚方法和逐步几何近似(SCA)来获得一个优化解决方案"* "Numerical results demonstrate that the MA-enabled communication system perform better than conventional fixed position antennas" becomes "numerically, results show that the MA-enabled communication system outperforms traditional fixed position antennas"

Optimization of RIS Placement for Satellite-to-Ground Coverage Enhancement

  • paper_url: http://arxiv.org/abs/2311.02958
  • repo_url: None
  • paper_authors: Xingchen Liu, Liuxun Xue, Shu Sun, Meixia Tao
  • for: 提高卫星到地面通信的可靠性和效率
  • methods: 使用可配置智能表面(RIS)协助,并优化RIS的布局在建筑物表面上,以提高卫星到地面通信的覆盖率
  • results: 通过大规模RIS部署,实现卫星到地面通信覆盖率的下限,并且通过优化RIS布局,提高非直线视野用户的通信覆盖率,并可应用于不同的建筑物分布情况(如农村、小镇、城市)。
    Abstract In satellite-to-ground communication, ensuring reliable and efficient connectivity poses significant challenges. The reconfigurable intelligent surface (RIS) offers a promising solution due to its ability to manipulate wireless propagation environments and thus enhance communication performance. In this paper, we propose a method for optimizing the placement of RISs on building facets to improve satellite-to-ground communication coverage. We model satellite-to-ground communication with RIS assistance, considering the actual positions of buildings and ground users. The theoretical lower bound on the coverage enhancement in satellite-to-ground communication through large-scale RIS deployment is derived. Then a novel optimization framework for RIS placement is formulated, and a parallel genetic algorithm is employed to solve the problem. Simulation results demonstrate the superior performance of the proposed RIS deployment strategy in enhancing satellite communication coverage probability for non-line-of-sight users. The proposed framework can be applied to various architectural distributions, such as rural areas, towns, and cities, by adjusting parameter settings.
    摘要 卫星到地面通信中确保可靠和高效连接具有重要挑战。智能表面重配置(RIS)提供了一种有前途的解决方案,因为它可以 manipulate 无线传播环境,从而提高通信性能。在这篇论文中,我们提议了一种改进卫星到地面通信覆盖的方法,通过对建筑物表面的 RIS 的布置优化。我们使用实际的建筑物和地面用户的位置来模拟卫星到地面通信,并 deriv 出了无线传播环境的理论下界。然后,我们提出了一种新的优化框架,并使用并行遗传算法来解决问题。实验结果表明,提议的 RIS 布局策略可以提高卫星通信覆盖率 для非直线视野用户。该提议的框架可以应用于不同的建筑分布,如乡村、小镇和城市,通过调整参数设置。

Channel Estimation and Training Design for Active RIS Aided Wireless Communications

  • paper_url: http://arxiv.org/abs/2311.02935
  • repo_url: None
  • paper_authors: Hao Chen, Nanxi Li, Ruizhe Long, Ying-Chang Liang
  • for: 提高无线通信的精度,使用活动再配置智能面(ARIS)技术增强incident signal的强度。
  • methods: 利用ARIS的信号增强功能进行通道估计,以提高估计精度。
  • results: 通过提出LS基于通道估计器和ARIS反射 patrern的优化方案,在ARIS频率干扰下实现精度的Channel estimation。
    Abstract Active reconfigurable intelligent surface (ARIS) is a newly emerging RIS technique that leverages radio frequency (RF) reflection amplifiers to empower phase-configurable reflection elements (REs) in amplifying the incident signal. Thereby, ARIS can enhance wireless communications with the strengthened ARIS-aided links. In this letter, we propose exploiting the signal amplification capability of ARIS for channel estimation, aiming to improve the estimation precision. Nevertheless, the signal amplification inevitably introduces the thermal noise at the ARIS, which can hinder the acquisition of accurate channel state information (CSI) with conventional channel estimation methods based on passive RIS (PRIS). To address this issue, we further investigate this ARIS-specific channel estimation problem and propose a least-square (LS) based channel estimator, whose performance can be further improved with the design on ARIS reflection patterns at the channel training phase. Based on the proposed LS channel estimator, we optimize the training reflection patterns to minimize the channel estimation error variance. Extensive simulation results show that our proposed design can achieve accurate channel estimation in the presence of the ARIS noises.
    摘要 新出现的活动可配置表面技术(ARIS)可以使用 радио频率(RF)反射增强器来强制配置阶段元件(RE),从而提高无线通信的信号强度。因此,ARIS可以提高无线通信链路的质量。在这封信中,我们提议利用ARIS增强信号的能力进行频率探测,以提高频率探测的精度。然而,信号增强必然会在ARIS中引入热噪声,这可能会使得传统的频率探测方法(基于被动RIS)难以获得准确的通道状态信息(CSI)。为解决这个问题,我们进一步研究了ARIS特有的频率探测问题,并提出了基于最小二乘(LS)的频率探测器。通过对ARIS反射模式的设计,我们可以在频率探测阶段进行训练,以最小化频率探测错误偏差的变量。通过我们的设计,我们可以在ARIS噪声的存在下实现准确的频率探测。Note: Simplified Chinese is also known as "Mandarin" or "Standard Chinese".

Pilot Design and Signal Detection for Symbiotic Radio over OFDM Carriers

  • paper_url: http://arxiv.org/abs/2311.02928
  • repo_url: None
  • paper_authors: Hao Chen, Qianqian Zhang, Ruizhe Long, Yiyang Pei, Ying-Chang Liang
  • for: 该论文主要研究频率分配和信号检测在Symbiotic radio(SR)系统中,以提高spectrum-和energy-efficiency。
  • methods: 研究使用 comb-type 干扰符和 preamble 干扰符的pilot结构,以及使用扩展��构 Channel Estimation(CE)来提高主传输的性能。
  • results: 实验结果表明,采用后向扩展��构 CE 可以提高主传输的性能,并且无需直接链接,primary和secondary传输都可以通过后向扩展��构 CE 支持。同时,研究还发现了symbol synchronization error的敏感性。
    Abstract Symbiotic radio (SR) is a promising solution to achieve high spectrum- and energy-efficiency due to its spectrum sharing and low-power consumption properties, in which the secondary system achieves data transmissions by backscattering the signal originating from the primary system. In this paper, we are interested in the pilot design and signal detection when the primary transmission adopts orthogonal frequency division multiplexing (OFDM). In particular, to preserve the channel orthogonality among the OFDM sub-carriers, each secondary symbol is designed to span an entire OFDM symbol. The comb-type pilot structure is employed by the primary transmission, while the preamble pilot structure is used by the secondary transmission. With the designed pilot structures, the primary signal can be detected via the conventional methods by treating the secondary signal as a part of the composite channel, i.e., the effective channel of the primary transmission. Furthermore, the secondary signal can be extracted from the estimated composite channel with the help of the detected primary signal. The bit error rate (BER) performance with both perfect and estimated CSI, the diversity orders of the primary and secondary transmissions, and the sensitivity to symbol synchronization error are analyzed. Simulation results show that the performance of the primary transmission is enhanced thanks to the backscatter link established by the secondary transmission. More importantly, even without the direct link, the primary and secondary transmissions can be supported via only the backscatter link.
    摘要 共生射频(SR)是一种有前途的解决方案,可以实现高频率和能量效率,因为它可以共享频率和低功率的特性。在本文中,我们关注的是主传输的预测设计和信号检测,当主传输采用分多样频分复用(OFDM)时。为保持OFDM子帧之间的频率独立性,我们将每个次级符号设计为覆盖整个OFDM符号。主传输使用comb型预测结构,而次传输使用预测预测结构。通过我们设计的预测结构,主信号可以通过传统方法检测,即将次信号视为主信道的一部分,即效果频道。此外,次信号还可以通过对估计的主信道进行拓展来提取。我们分析了基于完美和估计的频道状况信息(CSI)的比特错误率(BER)性能,主和次传输的多样度,以及符号同步错误的敏感度。 simulation results show that the performance of the primary transmission is enhanced due to the backscatter link established by the secondary transmission. Moreover, even without the direct link, the primary and secondary transmissions can be supported via only the backscatter link.

Goal-Oriented Wireless Communication Resource Allocation for Cyber-Physical Systems

  • paper_url: http://arxiv.org/abs/2311.02911
  • repo_url: None
  • paper_authors: Cheng Feng, Kedi Zheng, Yi Wang, Kaibin Huang, Qixin Chen
  • for: 这个论文主要是为了提高各种无线边缘应用,如智能电网和车辆网络,这些应用需要适应性和可控性的各种各样的无线通信网络。
  • methods: 这篇论文提出了一种目标导向的无线通信资源分配框架,考虑了数据的 semantics 和重要性,以便实现最佳的 CPS 性能。具体来说,他们提出了一种分解信息价值增加的分解方法,然后将带宽分配问题转化为一个可解释的饱和问题,并使用一种分配和规划算法来解决这个问题。
  • results: 这篇论文的结果表明,通过使用目标导向的无线通信资源分配框架,可以提高 CPS 的性能和效率,并且可以应用于多种应用场景,如数据驱动决策、边缘学习、联合学习和分布式优化。
    Abstract The proliferation of novel industrial applications at the wireless edge, such as smart grids and vehicle networks, demands the advancement of cyber-physical systems. The performance of CPSs is closely linked to the last-mile wireless communication networks, which often become bottlenecks due to their inherent limited resources. Current CPS operations often treat wireless communication networks as unpredictable and uncontrollable variables, ignoring the potential adaptability of wireless networks, which results in inefficient and overly conservative CPS operations. Meanwhile, current wireless communications often focus more on throughput and other transmission-related metrics instead of CPS goals. In this study, we introduce the framework of goal-oriented wireless communication resource allocations, accounting for the semantics and significance of data for CPS operation goals. This guarantees optimal CPS performance from a cybernetic standpoint. We formulate a bandwidth allocation problem aimed at maximizing the information utility gain of transmitted data brought to CPS operation goals. Since the goal-oriented bandwidth allocation problem is a large-scale combinational problem, we propose a divide-and-conquer and greedy solution algorithm. The information utility gain is first approximately decomposed into marginal utility information gains and computed in a parallel manner. Subsequently, the bandwidth allocation problem is reformulated as a knapsack problem, which can be further solved greedily with a guaranteed sub-optimality gap. We further demonstrate how our proposed goal-oriented bandwidth allocation algorithm can be applied in four potential CPS applications, including data-driven decision-making, edge learning, federated learning, and distributed optimization.
    摘要 随着无线边缘应用的增多,如智能网络和车辆网络,质量体系的发展受到推动。无线通信网络的性能对智能体系的运行起到关键作用,但现有的智能体系操作方法通常忽略无线通信网络的可变性和可控性,导致不必要地保守和不效率。同时,现有的无线通信往往更关注传输速率和其他传输相关指标,而忽略智能体系的操作目标。在本研究中,我们提出了基于目标的无线通信资源分配框架,考虑智能体系操作目标的 semantics和重要性。这保证了智能体系的最佳性从Cybernetic standpoint。我们将传输数据带来智能体系操作目标的信息价值增量作为目标函数,并将其分解为各个数据项的独立 marginal utility information gain。然后,我们将带来的带宽分配问题转化为一个knapsack问题,可以通过简单的贪婪算法解决。我们进一步示出了我们提出的目标带宽分配算法在四种可能的智能体系应用中的应用,包括数据驱动决策、边缘学习、联合学习和分布式优化。

Energy-Efficient Multidimensional Constellation Based on Leech Lattice for Visible Light Communications

  • paper_url: http://arxiv.org/abs/2311.02865
  • repo_url: None
  • paper_authors: Jia-Ning Guo, Ru-Han Chen, Jian Zhang, Longguang Li, Jing Zhou
  • for: 这个论文是为了研究室内可见光通讯(VLC)中的数位传输,特别是在具有峰值和平均Intensity输入限制的情况下。
  • methods: 这个论文使用了大幅退化理论工具,掌握了具有上述内数位传输的第二项偏微分方程的优化构成形状区域,进一步精确化了在[Chen. et. al, 2020]中的结果。在优化的几何形状区域中,提出了一个能量高效的24维度构成设计,其中利用了Leech组合和高效编码的策略,实现了较大的编码优化和几何优化。此外,还提出了快速的构成对应和解读算法。
  • results: numerical results表明,对比 existed method,这个方法可以 дости得更高的优化和更好的性能。
    Abstract In this paper, a 24-dimensional geometrically-shaped constellation design based on Leech lattice is presented for indoor visible light communications (VLCs) with a peak-and an average-intensity input constraints. Firstly, by leveraging tools from large deviation theory, we characterize second-order asymptotics of the optimal constellation shaping region under aforementioned intensity constraints, which further refine our previous results in [Chen. et. al, 2020]. Within the optimal geometrical shaping region, we develop an energy-efficient 24-dimensional constellation design, where a significant coding gain brought by the Leech lattice and the nearly-maximum shaping gain are incorporated by using a strategy called coarsely shaping and finely coding. Fast algorithms for constellation mapping and demodulation are presented as well. Numerical results verifies the superiority of our results as compared with existing methods.
    摘要 在这篇论文中,我们提出了一种基于Leech lattice的24维геометри设计,用于indoor可见光通信(VLC),并且受到峰值和平均输入强度的限制。我们首先通过大偏移理论工具来描述在上述强度限制下的最佳形态域的第二阶偏移,这也是我们在[陈等,2020]中的进一步发展。在最佳的 геометри设计区域内,我们开发了一种能效的24维数字编码设计,其中Leech lattice和高效编码具有较高的编码准确率。此外,我们还提出了一种名为“粗化编码”的策略,可以充分利用Leech lattice的优势。我们还提供了快速的 constellation mapping 和解译算法。 numerics 结果表明,我们的结果与现有方法相比具有显著的优势。

Multi-User Multi-IoT-Device Symbiotic Radio: A Novel Massive Access Scheme for Cellular IoT

  • paper_url: http://arxiv.org/abs/2311.02837
  • repo_url: None
  • paper_authors: Jun Wang, Ying-Chang Liang, Sumei Sun
  • for: 支持 cellular Internet of Things (IoT) 的 Symbiotic radio (SR) 系统,以实现大规模访问。
  • methods: 提议一种新的多用户多 IoT 设备 SR 系统,使得基站 (BS) 可以同时传输信息给多个 cellular 用户和多个 IoT 设备。
  • results: 在考虑到的系统中,通过使用 robust 设计方法和 semi-definite programming 和 difference-of-convex programming 算法,可以最小化 transmit 功率,同时满足 cellular 传输残弱概率和 IoT 传输总比特率的约束。
    Abstract Symbiotic radio (SR) is a promising technique to support cellular Internet-of-Things (IoT) by forming a mutualistic relationship between IoT and cellular transmissions. In this paper, we propose a novel multi-user multi-IoT-device SR system to enable massive access in cellular IoT. In the considered system, the base station (BS) transmits information to multiple cellular users, and a number of IoT devices simultaneously backscatter their information to these users via the cellular signal. The cellular users jointly decode the information from the BS and IoT devices. Noting that the reflective links from the IoT devices can be regarded as the channel uncertainty of the direct links, we apply the robust design method to design the beamforming vectors at the BS. Specifically, the transmit power is minimized under the cellular transmission outage probability constraints and IoT transmission sum rate constraints. The algorithm based on semi-definite programming and difference-of-convex programming is proposed to solve the power minimization problem. Moreover, we consider a special case where each cellular user is associated with several adjacent IoT devices and propose a direction of arrival (DoA)-based transmit beamforming design approach. The DoA-based approach requires only the DoA and angular spread (AS) of the direct links instead of the instantaneous channel state information (CSI) of the reflective link channels, leading to a significant reduction in the channel feedback overhead. Simulation results have substantiated the multi-user multi-IoT-device SR system and the effectiveness of the proposed beamforming approaches. It is shown that the DoA-based beamforming approach achieves comparable performance as the CSI-based approach in the special case when the ASs are small.
    摘要 симбиотиче radio (SR) 是一种有前途的技术,用于支持 cellular Internet of Things (IoT) by forming a mutualistic relationship between IoT and cellular transmissions. 在本文中,我们提议一种新的多用户多 IoT 设备 SR 系统,以实现大规模访问在 cellular IoT 中。在考虑的系统中,基站 (BS) 发送信息给多个 cellular 用户,而多个 IoT 设备同时借鉴了这些用户的信号来反射其信息。用户们共同解码基站和 IoT 设备之间的信息。注意到反射链的不确定性可以视为 direct link 的通道不确定性,我们采用了Robust 设计方法来设计基站的扫描向量。特别是,在 cellular 传输残留概率和 IoT 传输总速率的限制下,我们将 transmit 功率进行最小化。我们提出了一种基于 semi-definite programming 和差异 convex programming 的算法来解决功率最小化问题。此外,我们考虑了每个 cellular 用户与邻近的多个 IoT 设备之间的特殊情况,并提出了一种方向 Of Arrival (DoA) 基于的传输扫描设计方法。DoA 基于方法只需要 DoA 和 Angular Spread (AS) 的直接链 instead of instantaneous channel state information (CSI) 的反射链通道的状态信息,从而减少了通道反馈过程的卫星负担。实验结果证明了多用户多 IoT 设备 SR 系统和我们提议的扫描方法的有效性。结果表明,在特殊情况下,当 AS 较小时,DoA 基于方法与 CSI 基于方法的性能相似。

cs.SD - 2023-11-05

Yet Another Generative Model For Room Impulse Response Estimation

  • paper_url: http://arxiv.org/abs/2311.02581
  • repo_url: None
  • paper_authors: Sungho Lee, Hyeong-Seok Choi, Kyogu Lee
  • for: 这 paper 的目的是提出一种新的 neural room impulse response (RIR) 估计器,以提高估计质量。
  • methods: 这 paper 使用了一种 alternate generator 架构,通过 residual quantization 学习一个精度的离散Token空间,并将 RIR 估计问题转化为一个 reference-conditioned autoregressive token generation 任务。
  • results: 实验结果表明,这 paper 的系统在多种评价指标上都有优于基eline。
    Abstract Recent neural room impulse response (RIR) estimators typically comprise an encoder for reference audio analysis and a generator for RIR synthesis. Especially, it is the performance of the generator that directly influences the overall estimation quality. In this context, we explore an alternate generator architecture for improved performance. We first train an autoencoder with residual quantization to learn a discrete latent token space, where each token represents a small time-frequency patch of the RIR. Then, we cast the RIR estimation problem as a reference-conditioned autoregressive token generation task, employing transformer variants that operate across frequency, time, and quantization depth axes. This way, we address the standard blind estimation task and additional acoustic matching problem, which aims to find an RIR that matches the source signal to the target signal's reverberation characteristics. Experimental results show that our system is preferable to other baselines across various evaluation metrics.
    摘要 现代神经room响应函数估计器通常包括一个编码器用于参考音频分析和一个生成器用于响应函数合成。特别是,生成器的性能直接影响总估计质量。在这个上下文中,我们探索了一种 alternate 生成器架构以提高性能。我们首先在 autoencoder 中使用循环量化来学习一个精度时间频谱空间,其中每个token表示一个小时频谱块的响应函数。然后,我们将响应函数估计问题转化为一个引用条件自适应字符串生成任务,使用 transformer 变体在频率、时间和量化深度轴上运行。这样,我们解决了标准盲目估计问题和附加的听音匹配问题,其目的是找到一个匹配源信号的响应函数。实验结果显示,我们的系统在多个评价指标上比其他基准高。

cs.CV - 2023-11-05

MirrorCalib: Utilizing Human Pose Information for Mirror-based Virtual Camera Calibration

  • paper_url: http://arxiv.org/abs/2311.02791
  • repo_url: None
  • paper_authors: Longyun Liao, Andrew Mitchell, Rong Zheng
  • for: 估计虚拟摄像头的外部参数,即与实际摄像头相对的投影 mirror 的相对位姿。
  • methods: 利用人体知识和2D关节位置来估计摄像头外部参数,首先使用修改后的八点算法获取初始估计,然后根据人体 constraints 进行修正,最后使用 RANSAC 算法除异常点。
  • results: 在 synthetic 和实际数据集上进行测试,mirrorCalib 可以达到 rotation error 0.62{\deg}/1.82{\deg} 和 translation error 37.33/69.51 mm,超越当前最佳方法。
    Abstract In this paper, we present the novel task of estimating the extrinsic parameters of a virtual camera with respect to a real camera with one single fixed planar mirror. This task poses a significant challenge in cases where objects captured lack overlapping views from both real and mirrored cameras. To address this issue, prior knowledge of a human body and 2D joint locations are utilized to estimate the camera extrinsic parameters when a person is in front of a mirror. We devise a modified eight-point algorithm to obtain an initial estimation from 2D joint locations. The 2D joint locations are then refined subject to human body constraints. Finally, a RANSAC algorithm is employed to remove outliers by comparing their epipolar distances to a predetermined threshold. MirrorCalib is evaluated on both synthetic and real datasets and achieves a rotation error of 0.62{\deg}/1.82{\deg} and a translation error of 37.33/69.51 mm on the synthetic/real dataset, which outperforms the state-of-art method.
    摘要 在这篇论文中,我们提出了一个新的任务:使用真实摄像头和一个固定的平面镜来估算虚拟摄像头的外部参数。当物体被捕捉时,缺乏真实和镜头摄像头之间的重叠视图,这个任务具有显著的挑战。为解决这个问题,我们利用人体的先知知识和2D关节位置来估算摄像头的外部参数,当人物在镜子前时。我们修改了八点算法以获得初始估算,然后使用人体限制来精确估算。最后,我们使用RANSAC算法来移除异常值,比较它们的视角距离与先定的阈值。我们的 MirrorCalib 方法在Synthetic 和 Real 数据集上进行了评估,其中Synthetic 数据集的旋转错误为0.62°/1.82°,翻译错误为37.33/69.51 mm,超过了现有方法的性能。

MuSHRoom: Multi-Sensor Hybrid Room Dataset for Joint 3D Reconstruction and Novel View Synthesis

  • paper_url: http://arxiv.org/abs/2311.02778
  • repo_url: None
  • paper_authors: Xuqian Ren, Wenjia Wang, Dingding Cai, Tuuli Tuominen, Juho Kannala, Esa Rahtu
  • for: 这 paper 的目的是提高 Metaverse 技术中的准确、实时、 immerse 模型化,以满足非人类感知(如无人机/机器人/自动驾驶车 navigation)和 immerse 技术(如 AR/VR)的需求。
  • methods: 这 paper 使用了多感器 гибрид房间数据集 (MuSHRoom),并对其进行了多种著名的管道测试,以评估它们在实际应用中的性能。
  • results: 这 paper 提出了一种新的方法,可以在实时和 computationally efficient 的方式下,将3D reconstruction和高质量的rendering融合在一起。这种方法在 MuSHRoom 数据集上显示出了良好的性能。
    Abstract Metaverse technologies demand accurate, real-time, and immersive modeling on consumer-grade hardware for both non-human perception (e.g., drone/robot/autonomous car navigation) and immersive technologies like AR/VR, requiring both structural accuracy and photorealism. However, there exists a knowledge gap in how to apply geometric reconstruction and photorealism modeling (novel view synthesis) in a unified framework. To address this gap and promote the development of robust and immersive modeling and rendering with consumer-grade devices, first, we propose a real-world Multi-Sensor Hybrid Room Dataset (MuSHRoom). Our dataset presents exciting challenges and requires state-of-the-art methods to be cost-effective, robust to noisy data and devices, and can jointly learn 3D reconstruction and novel view synthesis, instead of treating them as separate tasks, making them ideal for real-world applications. Second, we benchmark several famous pipelines on our dataset for joint 3D mesh reconstruction and novel view synthesis. Finally, in order to further improve the overall performance, we propose a new method that achieves a good trade-off between the two tasks. Our dataset and benchmark show great potential in promoting the improvements for fusing 3D reconstruction and high-quality rendering in a robust and computationally efficient end-to-end fashion.
    摘要 translate text into Simplified ChineseMetaverse技术需要精准、实时和具有吸引力的模型,用于非人类感知(如无人机/机器人/自动驾驶车 navigation)以及具有吸引力的技术,如AR/VR,需要结构准确和写实感。然而,当前存在一个知识空白,即如何在一个统一框架中应用准确的三维重建和写实感模型。为了bridging这个知识空白,并促进使用消费级设备进行强大和吸引人的模型和渲染,我们首先提出了一个真实世界多感器混合房间数据集(MuSHRoom)。我们的数据集具有吸引人的挑战,需要当前的技术来实现成本效益、鲁棒性和可靠性,同时可以同时学习3D重建和新视野合成,而不是单独处理它们为两个独立的任务,使其适用于实际应用。其次,我们对我们的数据集上使用了许多知名的管道进行联合3D网格重建和新视野合成的benchmark。最后,为了进一步提高总性能,我们提出了一种新的方法,可以在两个任务之间取得良好的平衡。我们的数据集和benchmark表现出了推动改进混合3D重建和高质量渲染的robust和计算效率的潜力。

Fast Sparse 3D Convolution Network with VDB

  • paper_url: http://arxiv.org/abs/2311.02762
  • repo_url: None
  • paper_authors: Fangjun Zhou, Anyong Mao, Eftychios Sifakis
  • for: 这个论文是为了提出一种新的卷积神经网络实现,用于高效地进行稀疏3D数据推理。
  • methods: 这个实现使用NanoVDB作为数据结构,以减少内存占用量,同时保持高性能。
  • results: 这个架构比STATE-OF-THE-ART dense CNN模型快得多,在高分辨率3D物体分类网络上达到了约20倍的速度提升。
    Abstract We proposed a new Convolution Neural Network implementation optimized for sparse 3D data inference. This implementation uses NanoVDB as the data structure to store the sparse tensor. It leaves a relatively small memory footprint while maintaining high performance. We demonstrate that this architecture is around 20 times faster than the state-of-the-art dense CNN model on a high-resolution 3D object classification network.
    摘要 我们提出了一种新的卷积神经网络实现,针对稀疏的3D数据推理。这种实现使用NanoVDB作为数据结构来存储稀疏张量。它占用较少的内存空间,同时保持高性能。我们示出这种架构比现有的密集 CNN 模型在高分辨率3D物体分类网络上20倍快。

Fast Point-cloud to Mesh Reconstruction for Deformable Object Tracking

  • paper_url: http://arxiv.org/abs/2311.02749
  • repo_url: None
  • paper_authors: Elham Amin Mansour, Hehui Zheng, Robert K. Katzschmann
  • for: 用于控制软体的机器人手,需要在线获取软体的状态反馈。
  • methods: 我们提出了一种方法,可以在58Hz的速度下创建不同类型的物体的塑形网格,并跟踪其变形。这种方法基于点云自动编码器和实数流变换器,可以在Marker-free的方式下进行系统标识。
  • results: 我们的方法可以在6种ycb类别中实现58Hz的塑形网格重建和跟踪,这些结果可以用于控制机器人手的 grasping操作,并且可以帮助系统进行 marker-free的系统标识。
    Abstract The world around us is full of soft objects that we as humans learn to perceive and deform with dexterous hand movements from a young age. In order for a Robotic hand to be able to control soft objects, it needs to acquire online state feedback of the deforming object. While RGB-D cameras can collect occluded information at a rate of 30 Hz, the latter does not represent a continuously trackable object surface. Hence, in this work, we developed a method that can create deforming meshes of deforming point clouds at a speed of above 50 Hz for different categories of objects. The reconstruction of meshes from point clouds has been long studied in the field of Computer graphics under 3D reconstruction and 4D reconstruction, however both lack the speed and generalizability needed for robotics applications. Our model is designed using a point cloud auto-encoder and a Real-NVP architecture. The latter is a continuous flow neural network with manifold-preservation properties. Our model takes a template mesh which is the mesh of an object in its canonical state and then deforms the template mesh to match a deformed point cloud of the object. Our method can perform mesh reconstruction and tracking at a rate of 58 Hz for deformations of six different ycb categories. An instance of a downstream application can be the control algorithm for a robotic hand that requires online feedback from the state of a manipulated object which would allow online grasp adaptation in a closed-loop manner. Furthermore, the tracking capacity that our method provides can help in the system identification of deforming objects in a marker-free approach. In future work, we will extend our method to more categories of objects and real world deforming point clouds
    摘要 世界中的软物体 surrounds 我们,从小时候我们就开始学习通过手部动作来感知和改变它们。如果机器人手需要控制软物体,它需要在线获取软物体的状态反馈。而RGB-D 摄像头可以在 30 Hz 频率上收集受障的信息,但这些信息不是可持续跟踪的物体表面。因此,在这项工作中,我们开发了一种方法,可以在不同类型物体上创建弹性的三角形结构,并在不同类型物体上进行不同类型的变形。我们的模型基于点云自编码器和真实NVP 架构。后者是一种连续流 neural network 具有 manifold-preservation 性能。我们的模型首先将模板三角形与弹性点云结构进行匹配,然后将模板三角形变形为与弹性点云结构匹配。我们的方法可以在 58 Hz 频率上进行三角形重建和跟踪,并且可以在不同类型的 ycb 类型上进行不同类型的变形。这种方法的实现可以用于 robotic 手的控制算法,以便在关闭环境中进行在线抓握适应。此外,我们的方法可以提供跟踪能力,帮助在无标记 Approach 中系统标识弹性物体。在未来的工作中,我们计划扩展我们的方法到更多的类型上,以及使用实际世界中的弹性点云结构。

Attention Modules Improve Image-Level Anomaly Detection for Industrial Inspection: A DifferNet Case Study

  • paper_url: http://arxiv.org/abs/2311.02747
  • repo_url: https://github.com/andreluizbvs/insplad
  • paper_authors: André Luiz Buarque Vieira e Silva, Francisco Simões, Danny Kowerko, Tobias Schlosser, Felipe Battisti, Veronica Teichrieb
  • for: 这篇论文主要针对于 semi-automated 视觉工业检测中的学习基于方法,以便处理高分辨率图像中的小型缺陷模式。
  • methods: 这篇论文提出了基于 DifferNet 的解决方案,其中包括了注意模块:AttentDifferNet。该方法可以提高图像水平的检测和分类能力,并在三个视觉异常检测数据集上达到了更高的Results:InsPLAD-fault、MVTec AD 和 Semiconductor Wafer。
  • results: 相比之前的状态艺术,AttentDifferNet 在三个数据集上的全局 AUROC 平均提高了1.77±0.25个百分点,达到了领先的Results,特别是在InsPLAD-fault 数据集上,这是一个工业检测在野数据集。
    Abstract Within (semi-)automated visual industrial inspection, learning-based approaches for assessing visual defects, including deep neural networks, enable the processing of otherwise small defect patterns in pixel size on high-resolution imagery. The emergence of these often rarely occurring defect patterns explains the general need for labeled data corpora. To alleviate this issue and advance the current state of the art in unsupervised visual inspection, this work proposes a DifferNet-based solution enhanced with attention modules: AttentDifferNet. It improves image-level detection and classification capabilities on three visual anomaly detection datasets for industrial inspection: InsPLAD-fault, MVTec AD, and Semiconductor Wafer. In comparison to the state of the art, AttentDifferNet achieves improved results, which are, in turn, highlighted throughout our quali-quantitative study. Our quantitative evaluation shows an average improvement - compared to DifferNet - of 1.77 +/- 0.25 percentage points in overall AUROC considering all three datasets, reaching SOTA results in InsPLAD-fault, an industrial inspection in-the-wild dataset. As our variants to AttentDifferNet show great prospects in the context of currently investigated approaches, a baseline is formulated, emphasizing the importance of attention for industrial anomaly detection both in the wild and in controlled environments.
    摘要 在半自动化visual工业检测中,基于学习的方法用于识别视觉缺陷,包括深度神经网络,可以处理高分辨率图像中的小缺陷模式。由于这些缺陷模式通常是罕见的,因此需要大量标注数据集。为了解决这个问题并提高现有状态的艺术,本研究提出了AttentDifferNet解决方案,它基于DifferNet框架并添加了注意模块。它在三个视觉异常检测数据集(InsPLAD-fault、MVTec AD和半导体晶圆)上提高了图像级检测和分类能力。与现有状态相比,AttentDifferNet实现了提高的结果,这些结果在我们的资深评估中得到了证明。我们的量化评估表明,相比DifferNet,AttentDifferNet在三个数据集上的总AUROC平均提高了1.77±0.25个百分点,在InsPLAD-fault中达到了领先的state-of-the-art results。我们的变体表明,AttentDifferNet在当前investigated的方法中具有极大的潜力。因此,我们形ulated一个基线,强调在工业异常检测中的注意力的重要性,不仅在控制环境中,而且在野外环境中。

Scenario Diffusion: Controllable Driving Scenario Generation With Diffusion

  • paper_url: http://arxiv.org/abs/2311.02738
  • repo_url: None
  • paper_authors: Ethan Pronovost, Meghana Reddy Ganesina, Noureldin Hendy, Zeyu Wang, Andres Morales, Kai Wang, Nicholas Roy
  • for: 用于验证自动驾驶汽车的安全性。
  • methods: 使用扩散方法生成交通场景,并同时生成 sintetic agent的姿势、方向和路径的分布。
  • results: 能够模拟多样化的交通模式,并在不同地区进行扩散。
    Abstract Automated creation of synthetic traffic scenarios is a key part of validating the safety of autonomous vehicles (AVs). In this paper, we propose Scenario Diffusion, a novel diffusion-based architecture for generating traffic scenarios that enables controllable scenario generation. We combine latent diffusion, object detection and trajectory regression to generate distributions of synthetic agent poses, orientations and trajectories simultaneously. To provide additional control over the generated scenario, this distribution is conditioned on a map and sets of tokens describing the desired scenario. We show that our approach has sufficient expressive capacity to model diverse traffic patterns and generalizes to different geographical regions.
    摘要 自动化创建人工交通情况场景是评估自动驾驶车辆(AV)的安全性的关键部分。在这篇论文中,我们提出了 Scenario Diffusion,一种基于扩散的架构,用于生成交通情况场景。我们将潜在扩散、物体检测和轨迹回归结合起来,同时生成 distribución 的 synthetic agent 姿势、方向和轨迹。为了提供更多的控制权,我们将这个分布 conditioned 于地图和 sets of tokens 描述所需的场景。我们示示了我们的方法具有 sufficient 的表达能力,能够模拟多样化的交通模式,并在不同的地理区域中泛化。

JRDB-Traj: A Dataset and Benchmark for Trajectory Forecasting in Crowds

  • paper_url: http://arxiv.org/abs/2311.02736
  • repo_url: None
  • paper_authors: Saeed Saadatnejad, Yang Gao, Hamid Rezatofighi, Alexandre Alahi
  • for: 预测未来轨迹是自主导航中非常重要的,特别是在避免人类事故中,预测代理人的能力在先前是非常重要的。
  • methods: 作者提出了一个新的轨迹预测数据集,用于评估模型在实际场景中的性能,包括跟踪模块的偏差。
  • results: 数据集提供了各种感知输入数据,包括所有代理人的位置、场景图像和点云数据,以及预测未来代理人的位置。这个数据集可以帮助研究人员更好地理解导航动力学。
    Abstract Predicting future trajectories is critical in autonomous navigation, especially in preventing accidents involving humans, where a predictive agent's ability to anticipate in advance is of utmost importance. Trajectory forecasting models, employed in fields such as robotics, autonomous vehicles, and navigation, face challenges in real-world scenarios, often due to the isolation of model components. To address this, we introduce a novel dataset for end-to-end trajectory forecasting, facilitating the evaluation of models in scenarios involving less-than-ideal preceding modules such as tracking. This dataset, an extension of the JRDB dataset, provides comprehensive data, including the locations of all agents, scene images, and point clouds, all from the robot's perspective. The objective is to predict the future positions of agents relative to the robot using raw sensory input data. It bridges the gap between isolated models and practical applications, promoting a deeper understanding of navigation dynamics. Additionally, we introduce a novel metric for assessing trajectory forecasting models in real-world scenarios where ground-truth identities are inaccessible, addressing issues related to undetected or over-detected agents. Researchers are encouraged to use our benchmark for model evaluation and benchmarking.
    摘要 预测未来轨迹是自动导航中非常重要的,特别是避免人类事故,因为预测代理人的能力在先前是非常重要的。轨迹预测模型在机器人、自动驾驶和导航等领域中使用,但在实际场景中经常遇到挑战,常因模型组件孤立。为解决这个问题,我们介绍了一个新的轨迹预测数据集,用于评估模型在各种实际场景中的表现。这个数据集是JRDB数据集的扩展,提供了完整的数据,包括所有代理人的位置、场景图像和点云数据,全部是机器人的视角。目标是预测代理人未来与机器人之间的位置,使用原始感知输入数据。它bridges模型与实际应用之间的差距,促进了导航动力学的深入理解。此外,我们还引入了一种新的评价轨迹预测模型的指标,用于实际场景中评估模型表现,解决不可见或过度探测的代理人问题。研究人员可以使用我们的标准来评估和比较模型。

ISAR: A Benchmark for Single- and Few-Shot Object Instance Segmentation and Re-Identification

  • paper_url: http://arxiv.org/abs/2311.02734
  • repo_url: https://github.com/nicogorlo/isar_wacv24
  • paper_authors: Nicolas Gorlo, Kenneth Blomqvist, Francesco Milano, Roland Siegwart
  • for: 这篇论文是为了提高单射对象检测、实例分割和重新识别的能力而写的。
  • methods: 这篇论文提出了一个基准方法和一个semi-synthetic数据集,以便测试和评估单射对象检测、实例分割和重新识别的算法。
  • results: 这篇论文提出了一个基准方法,并提供了一个 semi-synthetic 数据集和一个标准化评估管线,以便加速开发单射对象检测、实例分割和重新识别的算法。
    Abstract Most object-level mapping systems in use today make use of an upstream learned object instance segmentation model. If we want to teach them about a new object or segmentation class, we need to build a large dataset and retrain the system. To build spatial AI systems that can quickly be taught about new objects, we need to effectively solve the problem of single-shot object detection, instance segmentation and re-identification. So far there is neither a method fulfilling all of these requirements in unison nor a benchmark that could be used to test such a method. Addressing this, we propose ISAR, a benchmark and baseline method for single- and few-shot object Instance Segmentation And Re-identification, in an effort to accelerate the development of algorithms that can robustly detect, segment, and re-identify objects from a single or a few sparse training examples. We provide a semi-synthetic dataset of video sequences with ground-truth semantic annotations, a standardized evaluation pipeline, and a baseline method. Our benchmark aligns with the emerging research trend of unifying Multi-Object Tracking, Video Object Segmentation, and Re-identification.
    摘要 现代物品水平映射系统大多采用上游学习的物品实例分割模型。如果我们想教他们新的物品或分割类,我们需要建立大型数据集并重新训练系统。为建立快速教育新物品的空间AI系统,我们需要有效解决单次物品检测、实例分割和重新识别的问题。目前没有一种满足所有这些要求的方法,也没有一个可用来测试这种方法的标准测试 benchмарck。为此,我们提出了 ISAR,一个基准方法和测试集,用于单次和少量的物品实例分割和重新识别。我们提供了一个半 sintetic的视频序列数据集,以及一个标准化的评估管道和基准方法。我们的 benchmark 与涌现的研究趋势相吻合,即将多个物体跟踪、视频物体分割和重新识别相结合。

Uncertainty Estimation for Safety-critical Scene Segmentation via Fine-grained Reward Maximization

  • paper_url: http://arxiv.org/abs/2311.02719
  • repo_url: https://github.com/med-air/fgrm
  • paper_authors: Hongzheng Yang, Cheng Chen, Yueyao Chen, Markus Scheppach, Hon Chi Yip, Qi Dou
  • for: 这个研究旨在提高深度 segmentation 模型在安全重要场景中的可靠性,特别是在医疗应用中。
  • methods: 我们提出了一个新的精细赏金 Maximum (FGRM) 框架,通过直接使用一个不确定度评估相关的奖励函数和一种可调整学习算法,以提高模型的不确定度评估。
  • results: 我们的方法在两个大规模的安全重要手术景象样本集上进行了实验,结果显示,我们的方法可以在实时一个前进对应中,以一个明显的优势在所有测量不确定度评估的标准做法中,而且可以保持高的任务准确度。代码可以在 \url{https://github.com/med-air/FGRM} 获取。
    Abstract Uncertainty estimation plays an important role for future reliable deployment of deep segmentation models in safety-critical scenarios such as medical applications. However, existing methods for uncertainty estimation have been limited by the lack of explicit guidance for calibrating the prediction risk and model confidence. In this work, we propose a novel fine-grained reward maximization (FGRM) framework, to address uncertainty estimation by directly utilizing an uncertainty metric related reward function with a reinforcement learning based model tuning algorithm. This would benefit the model uncertainty estimation through direct optimization guidance for model calibration. Specifically, our method designs a new uncertainty estimation reward function using the calibration metric, which is maximized to fine-tune an evidential learning pre-trained segmentation model for calibrating prediction risk. Importantly, we innovate an effective fine-grained parameter update scheme, which imposes fine-grained reward-weighting of each network parameter according to the parameter importance quantified by the fisher information matrix. To the best of our knowledge, this is the first work exploring reward optimization for model uncertainty estimation in safety-critical vision tasks. The effectiveness of our method is demonstrated on two large safety-critical surgical scene segmentation datasets under two different uncertainty estimation settings. With real-time one forward pass at inference, our method outperforms state-of-the-art methods by a clear margin on all the calibration metrics of uncertainty estimation, while maintaining a high task accuracy for the segmentation results. Code is available at \url{https://github.com/med-air/FGRM}.
    摘要 uncertainty estimation在未来安全critical scenario中的深度分割模型部署中扮演着重要的角色。然而,现有的uncertainty estimation方法受到了确定性评估的缺乏直接指导的问题。在这种情况下,我们提出了一种新的细化的奖励最大化(FGRM) Framework,以Address uncertainty estimation by directly using an uncertainty metric related reward function with a reinforcement learning based model tuning algorithm.这将通过直接优化指导来提高模型的uncertainty estimation。 Specifically, our method designs a new uncertainty estimation reward function using the calibration metric, which is maximized to fine-tune an evidential learning pre-trained segmentation model for calibrating prediction risk. Additionally, we innovate an effective fine-grained parameter update scheme, which imposes fine-grained reward-weighting of each network parameter according to the parameter importance quantified by the fisher information matrix. To the best of our knowledge, this is the first work exploring reward optimization for model uncertainty estimation in safety-critical vision tasks. Our method demonstrates effectiveness on two large safety-critical surgical scene segmentation datasets under two different uncertainty estimation settings, outperforming state-of-the-art methods by a clear margin on all calibration metrics of uncertainty estimation while maintaining a high task accuracy for segmentation results. Code is available at \url{https://github.com/med-air/FGRM}.

CycleCL: Self-supervised Learning for Periodic Videos

  • paper_url: http://arxiv.org/abs/2311.03402
  • repo_url: None
  • paper_authors: Matteo Destro, Michael Gygli
  • for: periodic video sequences 如 automatic production systems, remote sensing, medical applications, 或 physical training
  • methods: CycleCL, a self-supervised learning method specifically designed for periodic data, using triplet loss to optimize for desired properties
  • results: significantly outperforms previous video-based self-supervised learning methods on all tasks in industrial and human actions datasets
    Abstract Analyzing periodic video sequences is a key topic in applications such as automatic production systems, remote sensing, medical applications, or physical training. An example is counting repetitions of a physical exercise. Due to the distinct characteristics of periodic data, self-supervised methods designed for standard image datasets do not capture changes relevant to the progression of the cycle and fail to ignore unrelated noise. They thus do not work well on periodic data. In this paper, we propose CycleCL, a self-supervised learning method specifically designed to work with periodic data. We start from the insight that a good visual representation for periodic data should be sensitive to the phase of a cycle, but be invariant to the exact repetition, i.e. it should generate identical representations for a specific phase throughout all repetitions. We exploit the repetitions in videos to design a novel contrastive learning method based on a triplet loss that optimizes for these desired properties. Our method uses pre-trained features to sample pairs of frames from approximately the same phase and negative pairs of frames from different phases. Then, we iterate between optimizing a feature encoder and resampling triplets, until convergence. By optimizing a model this way, we are able to learn features that have the mentioned desired properties. We evaluate CycleCL on an industrial and multiple human actions datasets, where it significantly outperforms previous video-based self-supervised learning methods on all tasks.
    摘要 In this paper, we propose CycleCL, a self-supervised learning method specifically designed for periodic data. We start from the insight that a good visual representation for periodic data should be sensitive to the phase of a cycle but be invariant to the exact repetition. In other words, it should generate identical representations for a specific phase throughout all repetitions.We exploit the repetitions in videos to design a novel contrastive learning method based on a triplet loss that optimizes for these desired properties. Our method uses pre-trained features to sample pairs of frames from approximately the same phase and negative pairs of frames from different phases. Then, we iterate between optimizing a feature encoder and resampling triplets until convergence. By optimizing a model this way, we are able to learn features that have the mentioned desired properties.We evaluate CycleCL on an industrial and multiple human actions datasets, where it significantly outperforms previous video-based self-supervised learning methods on all tasks.

Benchmarking a Benchmark: How Reliable is MS-COCO?

  • paper_url: http://arxiv.org/abs/2311.02709
  • repo_url: None
  • paper_authors: Eric Zimmermann, Justin Szeto, Jerome Pasquero, Frederic Ratle
  • for: 本研究使用Sama-COCO dataset进行了可读性分析,以发现可能存在的偏见和偏好。
  • methods: 本研究使用了一个形态分析管道,以评估不同注释方式对模型的影响。
  • results: 结果表明注释风格对模型的性能有重要影响,并且注释管道应该仔细考虑任务的关键点。In English, that would be:
  • for: This study uses the Sama-COCO dataset to analyze the readability of the annotations and discover potential biases and preferences.
  • methods: The study uses a shape analysis pipeline to evaluate the impact of different annotation styles on the model’s performance.
  • results: The results show that the annotation style has a significant impact on the model’s performance, and the annotation pipeline should carefully consider the task of interest.
    Abstract Benchmark datasets are used to profile and compare algorithms across a variety of tasks, ranging from image classification to segmentation, and also play a large role in image pretraining algorithms. Emphasis is placed on results with little regard to the actual content within the dataset. It is important to question what kind of information is being learned from these datasets and what are the nuances and biases within them. In the following work, Sama-COCO, a re-annotation of MS-COCO, is used to discover potential biases by leveraging a shape analysis pipeline. A model is trained and evaluated on both datasets to examine the impact of different annotation conditions. Results demonstrate that annotation styles are important and that annotation pipelines should closely consider the task of interest. The dataset is made publicly available at https://www.sama.com/sama-coco-dataset/ .
    摘要 《Benchmark datasets are used to profile and compare algorithms across a variety of tasks, ranging from image classification to segmentation, and also play a large role in image pretraining algorithms. Emphasis is placed on results with little regard to the actual content within the dataset. It is important to question what kind of information is being learned from these datasets and what are the nuances and biases within them. In the following work, Sama-COCO, a re-annotation of MS-COCO, is used to discover potential biases by leveraging a shape analysis pipeline. A model is trained and evaluated on both datasets to examine the impact of different annotation conditions. Results demonstrate that annotation styles are important and that annotation pipelines should closely consider the task of interest. The dataset is made publicly available at https://www.sama.com/sama-coco-dataset/。》Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

An Empirical Study of Uncertainty in Polygon Annotation and the Impact of Quality Assurance

  • paper_url: http://arxiv.org/abs/2311.02707
  • repo_url: None
  • paper_authors: Eric Zimmermann, Justin Szeto, Frederic Ratle
  • for: 这篇论文是为了研究多边形标注中的不确定性和质量控制的问题。
  • methods: 这篇论文使用了多边形标注的多评估程序,并对MS-COCO数据集中的一些对象进行了分析。
  • results: 研究结果表明,多边形标注的可靠性取决于评估程序和场景和形状的复杂性。
    Abstract Polygons are a common annotation format used for quickly annotating objects in instance segmentation tasks. However, many real-world annotation projects request near pixel-perfect labels. While strict pixel guidelines may appear to be the solution to a successful project, practitioners often fail to assess the feasibility of the work requested, and overlook common factors that may challenge the notion of quality. This paper aims to examine and quantify the inherent uncertainty for polygon annotations and the role that quality assurance plays in minimizing its effect. To this end, we conduct an analysis on multi-rater polygon annotations for several objects from the MS-COCO dataset. The results demonstrate that the reliability of a polygon annotation is dependent on a reviewing procedure, as well as the scene and shape complexity.
    摘要 多角形是常用的注释格式,用于快速标注对象在实例分割任务中。然而,许多实际项目需要非常精准的标注。虽然严格的像素指南可能看起来是成功项目的解决方案,但实际上,很多实践者会忽视标注工作的可行性和常见因素的影响。这篇论文旨在检查和评估多角形注释中的内在不确定性,以及质量控制在减少其影响的角色。为此,我们对 MS-COCO 数据集中的多个对象进行了多评人多角形注释的分析。结果表明,多角形注释的可靠性取决于评审过程和场景和形状复杂度。

A Generative Multi-Resolution Pyramid and Normal-Conditioning 3D Cloth Draping

  • paper_url: http://arxiv.org/abs/2311.02700
  • repo_url: https://github.com/hunorlaczko/pyramid-drape
  • paper_authors: Hunor Laczkó, Meysam Madadi, Sergio Escalera, Jordi Gonzalez
  • for: 3D garment generation and draping
  • methods: conditional variational autoencoder with pyramid network and surface normal UV maps
  • results: robust, controllable, and state-of-the-art results with high generalization to unseen garments, poses, and shapes
    Abstract RGB cloth generation has been deeply studied in the related literature, however, 3D garment generation remains an open problem. In this paper, we build a conditional variational autoencoder for 3D garment generation and draping. We propose a pyramid network to add garment details progressively in a canonical space, i.e. unposing and unshaping the garments w.r.t. the body. We study conditioning the network on surface normal UV maps, as an intermediate representation, which is an easier problem to optimize than 3D coordinates. Our results on two public datasets, CLOTH3D and CAPE, show that our model is robust, controllable in terms of detail generation by the use of multi-resolution pyramids, and achieves state-of-the-art results that can highly generalize to unseen garments, poses, and shapes even when training with small amounts of data.
    摘要

ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models

  • paper_url: http://arxiv.org/abs/2311.02692
  • repo_url: https://github.com/openlamm/lamm
  • paper_authors: Zhelun Shi, Zhipin Wang, Hongxing Fan, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao
  • for: 本研究旨在提供一个普适的评估框架,以全面理解大型语言模型在多modal内容交互中的能力和局限性。
  • methods: 本研究提出了一个名为ChEF的全面评估框架,由四个组成部分组成:Scene(可扩展的多modal数据集)、Instruction(灵活的指令检索方程)、Inferencer(可靠的问题解答策略)和Metric(任务特定的分数函数)。这个框架可以标准化地评估不同的大型语言模型,并且可以根据不同的场景和需求设计新的评估方法。
  • results: 本研究通过对9种知名的大型语言模型在9个场景中进行大规模评估,总结了20多个有价值的观察,包括不同场景下大型语言模型的普适性和多modal交互需要的复合能力。这些观察可以帮助理解大型语言模型在多modal内容交互中的能力和局限性。
    Abstract Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks. However, even though a list of benchmarks has been proposed, the capabilities and limitations of MLLMs are still not comprehensively understood, due to a lack of a standardized and holistic evaluation framework. To this end, we present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs. First, we structure ChEF as four modular components, i.e., Scenario as scalable multimodal datasets, Instruction as flexible instruction retrieving formulae, Inferencer as reliable question answering strategies, and Metric as indicative task-specific score functions. Based on them, ChEF facilitates versatile evaluations in a standardized framework, and new evaluations can be built by designing new Recipes (systematic selection of these four components). Notably, current MLLM benchmarks can be readily summarized as recipes of ChEF. Second, we introduce 6 new recipes to quantify competent MLLMs' desired capabilities (or called desiderata, i.e., calibration, in-context learning, instruction following, language performance, hallucination, and robustness) as reliable agents that can perform real-world multimodal interactions. Third, we conduct a large-scale evaluation of 9 prominent MLLMs on 9 scenarios and 6 desiderata. Our evaluation summarized over 20 valuable observations concerning the generalizability of MLLMs across various scenarios and the composite capability of MLLMs required for multimodal interactions. We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models, so that ChEF can be a growing evaluation framework for the MLLM community.
    摘要 多modal大型语言模型(MLLMs)在与视觉内容互动中表现出了吸引人的能力,但是,即使有了一份标准的benchmark列表,MLLMs的能力和局限性仍未被全面了解,这是因为缺乏一个标准化和整体的评估框架。为此,我们提出了首个全面评估框架(ChEF),可以彻底评估每种MLLM,并公平比较不同的MLLMs。ChEF由四个可重复组件组成:Scene(可扩展的多模态数据集)、Instruction(灵活的指令检索方程)、Inferencer(可靠的问题回答策略)和Metric(任务特定的指标函数)。基于这些组件,ChEF提供了一个标准化的评估框架,并且可以通过设计新的Recipe(系统atic选择这些四个组件)来创建新的评估。值得注意的是,现有的MLLM benchmark可以被视为ChEF的recipe。我们还引入了6种新的recipe,用于评估MLLMs的需要的能力(或称为“欲望”,包括准确性、场景学习、指令遵从、语言表现、幻觉和稳定性)。然后,我们对9种知名MLLMs进行了大规模的评估,并对9个场景和6种欲望进行了评估。我们的评估结果表明,MLLMs在不同的场景下的一致性和多模态互动所需的复合能力是非常重要的。我们将在未来公布所有细节的实现,以及一个易于使用的模块化工具包,以便ChEF可以成为MLLM社区的发展评估框架。

Octavius: Mitigating Task Interference in MLLMs via MoE

  • paper_url: http://arxiv.org/abs/2311.02684
  • repo_url: None
  • paper_authors: Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, Yu Qiao, Jing Shao
  • for: 本研究旨在探讨大型自然语言模型(LLM)在多Modal学习中的零shot扩展能力,以及在不同模式和下游任务中的性能影响。
  • methods: 我们提出了一种新的和可扩展的框架,called \mname,用于全面地研究多Modal学习中的多Modal大型自然语言模型(MLLM)。我们将mixture-of-experts(MoE)和一种代表性的PEFT技术LoRA结合,设计了一种基于LLM的新解码器,called LoRA-MoE,用于多Modal学习。
  • results: 我们的实验结果(大约20%提升)表明了我们的设计的效iveness和多样性在不同的2D和3D下游任务中。
    Abstract Recent studies have demonstrated Large Language Models (LLMs) can extend their zero-shot generalization capabilities to multimodal learning through instruction tuning. As more modalities and downstream tasks are introduced, negative conflicts and interference may have a worse impact on performance. While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called \mname, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, \emph{i.e.,} LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning. The experimental results (about 20\% improvement) have shown the effectiveness and versatility of our design in various 2D and 3D downstream tasks. Code and corresponding dataset will be available soon.
    摘要 Specifically, we combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, to design a novel LLM-based decoder, called LoRA-MoE, for multimodal learning. Our experimental results (with an improvement of around 20%) have shown the effectiveness and versatility of our design in various 2D and 3D downstream tasks. We will make the code and corresponding dataset available soon.

Digital Typhoon: Long-term Satellite Image Dataset for the Spatio-Temporal Modeling of Tropical Cyclones

  • paper_url: http://arxiv.org/abs/2311.02665
  • repo_url: https://github.com/kitamoto-lab/digital-typhoon
  • paper_authors: Asanobu Kitamoto, Jared Hwang, Bastien Vuillod, Lucas Gautier, Yingtao Tian, Tarin Clanuwat
  • For: The paper presents the official release of the Digital Typhoon dataset, a long-term spatio-temporal satellite image dataset for benchmarking machine learning models in the context of tropical cyclones.* Methods: The authors developed a workflow to create an infrared typhoon-centered image for cropping using Lambert azimuthal equal-area projection and addressed data quality issues such as inter-satellite calibration to create a homogeneous dataset.* Results: The benchmarking results on the analysis, forecasting, and reanalysis for the intensity suggest that the dataset is challenging for recent deep learning models, with many choices affecting the performance of various models.Here are the three points in Simplified Chinese text:* For: 这篇论文介绍了数字飓风数据集的官方发布,这是40多年的飓风卫星图像数据集,用于测试机器学习模型的长期空间时间数据处理能力。* Methods: 作者们开发了一个工作流程,使用拉曼投影将飓风中心的偏振图像进行裁剪,并对卫星数据进行了一系列的数据质量处理,以创建一个一致的数据集。* Results: 对于分析、预测和重建风速的测试结果表明,这个数据集对现代深度学习模型是一个挑战,因为有多种选择会影响不同模型的性能。
    Abstract This paper presents the official release of the Digital Typhoon dataset, the longest typhoon satellite image dataset for 40+ years aimed at benchmarking machine learning models for long-term spatio-temporal data. To build the dataset, we developed a workflow to create an infrared typhoon-centered image for cropping using Lambert azimuthal equal-area projection referring to the best track data. We also address data quality issues such as inter-satellite calibration to create a homogeneous dataset. To take advantage of the dataset, we organized machine learning tasks by the types and targets of inference, with other tasks for meteorological analysis, societal impact, and climate change. The benchmarking results on the analysis, forecasting, and reanalysis for the intensity suggest that the dataset is challenging for recent deep learning models, due to many choices that affect the performance of various models. This dataset reduces the barrier for machine learning researchers to meet large-scale real-world events called tropical cyclones and develop machine learning models that may contribute to advancing scientific knowledge on tropical cyclones as well as solving societal and sustainability issues such as disaster reduction and climate change. The dataset is publicly available at http://agora.ex.nii.ac.jp/digital-typhoon/dataset/ and https://github.com/kitamoto-lab/digital-typhoon/.
    摘要 这份论文发布了数字风暴数据集的官方发布,这是40多年的风暴卫星图像数据集,旨在为机器学习模型进行长期空间时间数据的benchmarking。为建立数据集,我们开发了一个工作流程,以拉姆伯特方程为基础,使用最佳轨迹数据来生成射电风暴中心图像进行剪辑。我们还解决了数据质量问题,如卫星间协调 calibration,以创建一个一致的数据集。为了利用这个数据集,我们组织了机器学习任务,按照不同的类型和目标进行定型,包括气象分析、社会影响和气候变化。结果表明,这个数据集对于现代深度学习模型来说是一个挑战,由于多种选择对不同模型的性能产生影响。这个数据集将降低机器学习研究人员面临大规模实际事件风暴预测和解决社会和可持续发展问题的障碍。这个数据集公共可用于http://agora.ex.nii.ac.jp/digital-typhoon/dataset/和https://github.com/kitamoto-lab/digital-typhoon/。

Enhanced adaptive cross-layer scheme for low latency HEVC streaming over Vehicular Ad-hoc Networks (VANETs)

  • paper_url: http://arxiv.org/abs/2311.02664
  • repo_url: None
  • paper_authors: Mohamed Aymen Labiod, Mohamed Gharbi, François-Xavier Coudoux, Patrick Corlay, Noureddine Doghmane
  • for: 高效视频传输在 Vehicular Ad-hoc Networks (VANET) 中实现了现实,但是这些网络具有变化的通道质量和有限的带宽。
  • methods: 提议一种低复杂度跨层机制,通过考虑视频编码过程中的时间预测结构、帧重要性和网络负载状态,将每个视频传输包分配到最适合的 Access Category (AC) 队列在 Medium Access Control (MAC) 层。
  • results: 对不同的低延迟视频通信场景进行了评估,结果显示,提议的机制可以在视频质量和总结束延迟方面提供显著的改进,与802.11p 的 Enhanced Distributed Channel Access (EDCA) 相比。同时,对服务质量 (QoS) 和用户体验质量 (QoE) 也进行了评估,以验证提议的方法。
    Abstract Vehicular communication has become a reality guided by various applications. Among those, high video quality delivery with low latency constraints required by real-time applications constitutes a very challenging task. By dint of its never-before-achieved compression level, the new High-Efficiency Video Coding (HEVC) is very promising for real-time video streaming through Vehicular Ad-hoc Networks (VANET). However, these networks have variable channel quality and limited bandwidth. Therefore, ensuring satisfactory video quality on such networks is a major challenge. In this work, a low complexity cross-layer mechanism is proposed to improve end-to-end performances of HEVC video streaming in VANET under low delay constraints. The idea is to assign to each packet of the transmitted video the most appropriate Access Category (AC) queue on the Medium Access Control (MAC) layer, considering the temporal prediction structure of the video encoding process, the importance of the frame and the state of the network traffic load. Simulation results demonstrate that for different targeted low-delay video communication scenarios, the proposed mechanism offers significant improvements regarding video quality at the reception and end-to-end delay compared to the Enhanced Distributed Channel Access (EDCA) adopted in the 802.11p. Both Quality of Service (QoS) and Quality of Experience (QoE) evaluations have been also carried out to validate the proposed approach.
    摘要 To overcome this challenge, we propose a low-complexity cross-layer mechanism that improves the end-to-end performance of HEVC video streaming in VANETs under low delay constraints. Our approach assigns the most appropriate Access Category (AC) queue on the Medium Access Control (MAC) layer to each packet of the transmitted video, taking into account the temporal prediction structure of the video encoding process, the importance of the frame, and the state of the network traffic load.Simulation results show that our proposed mechanism offers significant improvements in video quality at the reception and end-to-end delay compared to the Enhanced Distributed Channel Access (EDCA) adopted in the 802.11p standard, for different targeted low-delay video communication scenarios. We also conducted Quality of Service (QoS) and Quality of Experience (QoE) evaluations to validate our approach.

CCMR: High Resolution Optical Flow Estimation via Coarse-to-Fine Context-Guided Motion Reasoning

  • paper_url: http://arxiv.org/abs/2311.02661
  • repo_url: https://github.com/cv-stuttgart/CCMR
  • paper_authors: Azin Jahedi, Maximilian Luz, Marc Rivinius, Andrés Bruhn
  • for: 高精度多尺度摄像机流计算
  • methods: 基于注意力的动量聚合概念,使用层次两步注意力基于上下文运动聚合策略,首先计算全尺度多尺度上下文特征,然后使用它们引导实际运动聚合。
  • results: 通过结合多尺度和注意力基于概念,提供了高精度的流场图,在 occluded 和 non-occluded 区域都显示出了强大改进,与单尺度注意力基本和多尺度注意力自由基eline比较,提高了23.0% 和 21.6%。并实现了 state-of-the-art 结果,在 KITTI 2015 和 MPI Sintel Clean and Final 上 ranking 第一和第二。
    Abstract Attention-based motion aggregation concepts have recently shown their usefulness in optical flow estimation, in particular when it comes to handling occluded regions. However, due to their complexity, such concepts have been mainly restricted to coarse-resolution single-scale approaches that fail to provide the detailed outcome of high-resolution multi-scale networks. In this paper, we hence propose CCMR: a high-resolution coarse-to-fine approach that leverages attention-based motion grouping concepts to multi-scale optical flow estimation. CCMR relies on a hierarchical two-step attention-based context-motion grouping strategy that first computes global multi-scale context features and then uses them to guide the actual motion grouping. As we iterate both steps over all coarse-to-fine scales, we adapt cross covariance image transformers to allow for an efficient realization while maintaining scale-dependent properties. Experiments and ablations demonstrate that our efforts of combining multi-scale and attention-based concepts pay off. By providing highly detailed flow fields with strong improvements in both occluded and non-occluded regions, our CCMR approach not only outperforms both the corresponding single-scale attention-based and multi-scale attention-free baselines by up to 23.0% and 21.6%, respectively, it also achieves state-of-the-art results, ranking first on KITTI 2015 and second on MPI Sintel Clean and Final. Code and trained models are available at https://github.com/cv-stuttgart /CCMR.
    摘要 听力基于的动作聚合概念在视力估计中最近几年得到了广泛应用,尤其是在处理 occluded 区域时。然而,由于其复杂性,这些概念通常只能在粗略分辨率单个级别上实现,无法提供高分辨率多级网络的详细结果。在这篇文章中,我们因此提出了 CCMR:一种高分辨率含级抽象方法,利用听力基于的动作聚合概念来实现多级视力估计。CCMR 利用一种层次两步听力基于的上下文动作聚合策略,首先计算全局多级上下文特征,然后使用它们引导实际动作聚合。我们在所有粗略抽象级别上迭代这两个步骤,并使用可变 covariance 图像变换器来实现高效实现,保持级别 dependent 性。实验和排除示出,我们的努力将多级和听力基于的概念结合起来了。通过提供具有强大改进的流场场景和非 occluded 区域,我们的 CCMR 方法不仅超过了对应的单个级别听力基于和无听力基于的基准值,还达到了状态机器人框架。代码和训练模型可以在 获取。

Region of Interest (ROI) based adaptive cross-layer system for real-time video streaming over Vehicular Ad-hoc NETworks (VANETs)

  • paper_url: http://arxiv.org/abs/2311.02656
  • repo_url: None
  • paper_authors: Mohamed Aymen Labiod, Mohamed Gharbi, François-Xavier Coudoux, Patrick Corlay
  • for: 提高 vehicular video transmission质量以提高驾驶环境感知
  • methods: 使用 adaptive cross-layer mapping将 ROI视频数据包寄存到 IEEE 802.11p MAC 层,提高 HEVC 压缩视频通信质量
  • results: 实际 VANET 模拟结果显示,对 HEVC 压缩视频通信,提posed系统可以提供 UP TO 11dB PSNR 提升在 ROI 部分
    Abstract Nowadays, real-time vehicle applications increasingly rely on video acquisition and processing to detect or even identify vehicles and obstacles in the driving environment. In this letter, we propose an algorithm that allows reinforcing these operations by improving end-to-end video transmission quality in a vehicular context. The proposed low complexity solution gives highest priority to the scene regions of interest (ROI) on which the perception of the driving environment is based on. This is done by applying an adaptive cross-layer mapping of the ROI visual data packets at the IEEE 802.11p MAC layer. Realistic VANET simulation results demonstrate that for HEVC compressed video communications, the proposed system offers PSNR gains up to 11dB on the ROI part.
    摘要 现在,实时车辆应用越来越依赖于视频获取和处理来探测或识别在驾驶环境中的车辆和障碍物。在这封信中,我们提出了一种算法,可以通过提高端到端视频传输质量来增强这些操作。我们的低复杂度解决方案会将关键场景区域(ROI)的视频数据包在IEEE 802.11p MAC层进行适应性跨层映射。使用HEVC压缩视频通信后,我们的系统可以在ROI部分提供PSNR增幅达11dB。

Generative Face Video Coding Techniques and Standardization Efforts: A Review

  • paper_url: http://arxiv.org/abs/2311.02649
  • repo_url: None
  • paper_authors: Bolin Chen, Jie Chen, Shiqi Wang, Yan Ye
    for: 这个论文主要探讨了最新的生成式面部视频编码(GFVC)技术的发展和标准化努力,以实现高质量的面部视频通信在ultra低带宽场景下。methods: 这篇论文将GFVC技术综合评估和总结,包括不同的GFVC算法和其对应的视觉表示方法,以及相关的标准化努力。results: 这篇论文总结了GFVC技术的发展前景和应用潜力,以及相关的挑战和机遇。
    Abstract Generative Face Video Coding (GFVC) techniques can exploit the compact representation of facial priors and the strong inference capability of deep generative models, achieving high-quality face video communication in ultra-low bandwidth scenarios. This paper conducts a comprehensive survey on the recent advances of the GFVC techniques and standardization efforts, which could be applicable to ultra low bitrate communication, user-specified animation/filtering and metaverse-related functionalities. In particular, we generalize GFVC systems within one coding framework and summarize different GFVC algorithms with their corresponding visual representations. Moreover, we review the GFVC standardization activities that are specified with supplemental enhancement information messages. Finally, we discuss fundamental challenges and broad applications on GFVC techniques and their standardization potentials, as well as envision their future trends. The project page can be found at https://github.com/Berlin0610/Awesome-Generative-Face-Video-Coding.
    摘要 generative face video coding(GFVC)技术可以利用面部先验的紧凑表示和深度生成模型的强大推理能力,实现高质量的面部视频通信在超低带宽场景下。这篇论文对最近的GFVC技术的进步和标准化努力进行了全面的报道,这些技术可以应用于超低位元率通信、用户指定的动画/滤波和元宇宙相关功能。具体来说,我们将GFVC系统划分到一个编码框架中,并将不同的GFVC算法与其相应的视觉表示进行总结。此外,我们还评论了GFVC的标准化活动,包括补充增强信息消息。最后,我们讨论了GFVC技术的基本挑战和广泛应用,以及其标准化潜力,以及未来趋势。项目页面可以在找到。

An Approach for Multi-Object Tracking with Two-Stage Min-Cost Flow

  • paper_url: http://arxiv.org/abs/2311.02642
  • repo_url: None
  • paper_authors: Huining Li, Yalong Jiang, Xianlin Zeng, Feng Li, Zhipeng Wang
  • for: 本 paper 的目的是提出一种two-stage tracking pipeline,用于精准地跟踪多个目标在视频中,并且可以减少 occlusion 的影响。
  • methods: 本 paper 使用 minimum network flow algorithm,并且利用 tracklets 的交叠和低信任探测来准确地定位不准确的 tracklets。在第一 stage,使用高信任探测作为输入,并使用交叠 mask 来准确地定位不准确的 tracklets。在第二 stage,使用低信任探测来修正不准确的 tracklets。
  • results: 本 paper 在多个popular MOT benchmark datasets上进行了 sufficient 的实验,并 achieved 78.4 MOTA on MOT16 test set, 79.2 on MOT17 test set, and 76.4 on MOT20 test set, 这表明提出的方法是有效的。
    Abstract The minimum network flow algorithm is widely used in multi-target tracking. However, the majority of the present methods concentrate exclusively on minimizing cost functions whose values may not indicate accurate solutions under occlusions. In this paper, by exploiting the properties of tracklets intersections and low-confidence detections, we develop a two-stage tracking pipeline with an intersection mask that can accurately locate inaccurate tracklets which are corrected in the second stage. Specifically, we employ the minimum network flow algorithm with high-confidence detections as input in the first stage to obtain the candidate tracklets that need correction. Then we leverage the intersection mask to accurately locate the inaccurate parts of candidate tracklets. The second stage utilizes low-confidence detections that may be attributed to occlusions for correcting inaccurate tracklets. This process constructs a graph of nodes in inaccurate tracklets and low-confidence nodes and uses it for the second round of minimum network flow calculation. We perform sufficient experiments on popular MOT benchmark datasets and achieve 78.4 MOTA on the test set of MOT16, 79.2 on MOT17, and 76.4 on MOT20, which shows that the proposed method is effective.
    摘要 “多目标追踪中 widely 使用最小网络流算法。然而,大多数现有方法仅专注于最小化成本函数的值,而不考虑 occlusions 的情况下的精度。在本文中,我们利用追踪碎片 intersection 和低信任探测的属性,开发了一个两阶段追踪管线,具有精度的找到不精度追踪碎片。具体来说,我们在第一阶段使用最小网络流算法高信任探测作为输入,以获取需要更正的候选追踪碎片。然后,我们利用碎片 intersection 属性来精确地找到不精度追踪碎片的不精度部分。第二阶段使用低信任探测,可能导因于 occlusions,来更正不精度追踪碎片。这个过程建立了一个网络格,其中的节点是不精度追踪碎片和低信任节点,并使用它们进行第二次最小网络流计算。我们对流行的 MOT 评分数据进行了丰富的实验,并在 MOT16 的评分数据上取得 78.4 MOTA,在 MOT17 上取得 79.2 MOTA,在 MOT20 上取得 76.4 MOTA,这表明我们的方法是有效的。”

The Background Also Matters: Background-Aware Motion-Guided Objects Discovery

  • paper_url: http://arxiv.org/abs/2311.02633
  • repo_url: None
  • paper_authors: Sandra Kara, Hejer Ammar, Florian Chabot, Quoc-Cuong Pham
  • for: 提高对视频数据中物体发现的精度和效率
  • methods: 利用摄像头流计算出的运动mask,通过学习机制扩展到真正的背景和前景区域,并在物体发现过程中 JOINTLY 学习物体发现任务和物体/非物体分离
  • results: 在 sintetic 和实际世界数据上进行了实验,结果表明,通过将我们的背景处理与多种前沿方法结合使用,可以大幅提高物体发现性能,并在物体/非物体分离任务中建立强的基线。
    Abstract Recent works have shown that objects discovery can largely benefit from the inherent motion information in video data. However, these methods lack a proper background processing, resulting in an over-segmentation of the non-object regions into random segments. This is a critical limitation given the unsupervised setting, where object segments and noise are not distinguishable. To address this limitation we propose BMOD, a Background-aware Motion-guided Objects Discovery method. Concretely, we leverage masks of moving objects extracted from optical flow and design a learning mechanism to extend them to the true foreground composed of both moving and static objects. The background, a complementary concept of the learned foreground class, is then isolated in the object discovery process. This enables a joint learning of the objects discovery task and the object/non-object separation. The conducted experiments on synthetic and real-world datasets show that integrating our background handling with various cutting-edge methods brings each time a considerable improvement. Specifically, we improve the objects discovery performance with a large margin, while establishing a strong baseline for object/non-object separation.
    摘要 Our approach utilizes masks of moving objects extracted from optical flow and designs a learning mechanism to extend them to the true foreground, which includes both moving and static objects. The background, a complementary concept to the learned foreground class, is then isolated in the object discovery process. This enables joint learning of the object discovery task and object/non-object separation.Experiments on synthetic and real-world datasets show that integrating our background handling with various state-of-the-art methods consistently brings significant improvements in object discovery performance, while establishing a strong baseline for object/non-object separation. Specifically, we improve the objects discovery performance by a large margin, demonstrating the effectiveness of our proposed method.

Neural Networks Are Implicit Decision Trees: The Hierarchical Simplicity Bias

  • paper_url: http://arxiv.org/abs/2311.02622
  • repo_url: None
  • paper_authors: Zhehang Du
    for: This paper aims to investigate the phenomenon of simplicity bias in neural networks and explore how they rely on simpler features while ignoring more complex ones, even when the complex features are equally predictive.methods: The authors introduce a novel approach called imbalanced label coupling to study scenarios where simple and complex features exhibit different levels of predictive power. They train neural networks on these scenarios and analyze how the networks make predictions based on the ascending complexity of input features.results: The authors find that the trained networks make predictions that align with the ascending complexity of input features, regardless of the underlying predictive power. For example, even when simple spurious features distort predictions in CIFAR-10, the networks still learn core features. However, last-layer retraining with target data distribution is effective but insufficient to fully recover core features when spurious features are perfectly correlated with the target labels in the synthetic dataset. These findings provide direct evidence that neural networks learn core features in the presence of spurious features.
    Abstract Neural networks exhibit simplicity bias; they rely on simpler features while ignoring equally predictive but more complex features. In this work, we introduce a novel approach termed imbalanced label coupling to investigate scenarios where simple and complex features exhibit different levels of predictive power. In these cases, complex features still contribute to predictions. The trained networks make predictions in alignment with the ascending complexity of input features according to how they correlate with the label in the training set, irrespective of the underlying predictive power. For instance, even when simple spurious features distort predictions in CIFAR-10, most cats are predicted to be dogs, and most trucks are predicted to be automobiles! This observation provides direct evidence that the neural network learns core features in the presence of spurious features. We empirically show that last-layer retraining with target data distribution is effective, yet insufficient to fully recover core features when spurious features are perfectly correlated with the target labels in our synthetic dataset. We hope our research contributes to a deeper understanding of the implicit bias of neural networks.
    摘要 Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.

TFNet: Tuning Fork Network with Neighborhood Pixel Aggregation for Improved Building Footprint Extraction

  • paper_url: http://arxiv.org/abs/2311.02617
  • repo_url: None
  • paper_authors: Muhammad Ahmad Waseem, Muhammad Tahir, Zubair Khalid, Momin Uppal
  • for: 这 paper 考虑了从卫星影像中提取建筑物的问题,这是许多城市规划和决策应用中的关键任务。
  • methods: 该 paper 提出了一种新的 Tuning Fork Network (TFNet) 设计,用于深度 semantic segmentation,该设计不仅在广泛的建筑物上表现出色,还在 closely packed 的建筑物上表现良好。TFNet 架构包括一个单一的编码器和两个并行的解码器,用于分别重construct 建筑物的架构和建筑物的边缘。此外,TFNet 还 coupling 了一种在训练过程中在 tile 边界上 incorporating neighbohood 信息的新方法。
  • results: 对 SpaceNet2、WHU 和一个来自卡拉χو(Pakistan)的 dataset 进行比较,提出的方法在所有三个 dataset 上显著地超越了参考方法。
    Abstract This paper considers the problem of extracting building footprints from satellite imagery -- a task that is critical for many urban planning and decision-making applications. While recent advancements in deep learning have made great strides in automated detection of building footprints, state-of-the-art methods available in existing literature often generate erroneous results for areas with densely connected buildings. Moreover, these methods do not incorporate the context of neighborhood images during training thus generally resulting in poor performance at image boundaries. In light of these gaps, we propose a novel Tuning Fork Network (TFNet) design for deep semantic segmentation that not only performs well for widely-spaced building but also has good performance for buildings that are closely packed together. The novelty of TFNet architecture lies in a a single encoder followed by two parallel decoders to separately reconstruct the building footprint and the building edge. In addition, the TFNet design is coupled with a novel methodology of incorporating neighborhood information at the tile boundaries during the training process. This methodology further improves performance, especially at the tile boundaries. For performance comparisons, we utilize the SpaceNet2 and WHU datasets, as well as a dataset from an area in Lahore, Pakistan that captures closely connected buildings. For all three datasets, the proposed methodology is found to significantly outperform benchmark methods.
    摘要 To address these limitations, we propose a novel Tuning Fork Network (TFNet) design for deep semantic segmentation. TFNet consists of a single encoder followed by two parallel decoders that separately reconstruct the building footprint and the building edge. Additionally, we introduce a novel methodology that incorporates neighborhood information at the tile boundaries during training, further improving performance, especially at the tile boundaries.We evaluate the proposed methodology on three datasets: SpaceNet2, WHU, and a dataset from Lahore, Pakistan, which captures closely connected buildings. Our results show that TFNet significantly outperforms benchmark methods on all three datasets.

Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection

  • paper_url: http://arxiv.org/abs/2311.02612
  • repo_url: https://github.com/zhangzjn/GPT-4V-AD
  • paper_authors: Jiangning Zhang, Xuhai Chen, Zhucun Xue, Yabiao Wang, Chengjie Wang, Yong Liu
    for: 这篇论文探讨了使用Visual Question Answering(VQA) paradigm来实现零基础的视觉异常检测(AD)任务,并对MVTec AD和VisA数据集进行质量和量化评估。methods: 该模型使用Large Multimodal Model(LMM) GPT-4V,包括三个组成部分:1) 粒度地划分,2) 提问设计,3) Text2Segmentation,以便轻松进行量化评估。results: 该模型在零基础AD任务中可以达到certain的结果,例如在MVTec AD和VisA数据集上的图像级别AU-ROC为77.1/88.0,像素级别AU-ROC为68.0/76.6。然而,与零基础方法WinCLIP ann CLIP-AD的性能还有一定差距,需要进一步研究。
    Abstract Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual grounding capabilities, making it possible to handle certain tasks through the Visual Question Answering (VQA) paradigm. This paper explores the potential of VQA-oriented GPT-4V in the recently popular visual Anomaly Detection (AD) and is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets. Considering that this task requires both image-/pixel-level evaluations, the proposed GPT-4V-AD framework contains three components: 1) Granular Region Division, 2) Prompt Designing, 3) Text2Segmentation for easy quantitative evaluation, and have made some different attempts for comparative analysis. The results show that GPT-4V can achieve certain results in the zero-shot AD task through a VQA paradigm, such as achieving image-level 77.1/88.0 and pixel-level 68.0/76.6 AU-ROCs on MVTec AD and VisA datasets, respectively. However, its performance still has a certain gap compared to the state-of-the-art zero-shot method, e.g., WinCLIP ann CLIP-AD, and further research is needed. This study provides a baseline reference for the research of VQA-oriented LMM in the zero-shot AD task, and we also post several possible future works. Code is available at \url{https://github.com/zhangzjn/GPT-4V-AD}.
    摘要 大型多模式模型(LMM)GPT-4V(视觉)具有视觉基准功能,使得GPT-4可以通过视觉问答(VQA)模式处理某些任务。本文探讨GPT-4V在最近受欢迎的视觉异常检测(AD)任务中的潜力,是首次对流行的MVTec AD和VisA数据集进行质量和量化评估。因为这个任务需要图像/像素级评估,提出了三组件:1)粒度区划,2)提示设计,3)文本2分 segmentation,以便轻松进行量化评估。results显示,GPT-4V可以通过VQA模式在零基础AD任务中获得某些结果,如MVTec AD和VisA数据集上的图像级77.1/88.0和像素级68.0/76.6 AU-ROC。然而,其表现仍有一定差距 compared tostate-of-the-art零基础方法,如WinCLIP ann CLIP-AD,并且需要进一步研究。这些研究提供了VQA-oriented LMM在零基础AD任务的基线参考,并提出了一些可能的未来工作。代码可以在 \url{https://github.com/zhangzjn/GPT-4V-AD} 上获取。

Deep Learning-based 3D Point Cloud Classification: A Systematic Survey and Outlook

  • paper_url: http://arxiv.org/abs/2311.02608
  • repo_url: None
  • paper_authors: Huang Zhang, Changshuo Wang, Shengwei Tian, Baoli Lu, Liping Zhang, Xin Ning, Xiao Bai
  • for: 本研究的目的是为点云分类提供最新的研究进展和未来趋势,以帮助 relate fields 的研究人员。
  • methods: 本文回顾了点云数据的取得、特点和挑战,然后介绍了常用的3D数据表示方法、存储格式和点云分类的深度学习方法。
  • results: 本文对主要方法进行了比较和分析,并提出了一些挑战和未来趋势。In English, this would be:
  • for: The purpose of this paper is to provide the latest research progress and future trends in point cloud classification for researchers in related fields.
  • methods: The paper reviews point cloud acquisition, characteristics, and challenges, and then introduces commonly used datasets and deep learning-based methods for point cloud classification.
  • results: The paper compares and analyzes the performance of the main methods and discusses some challenges and future directions for point cloud classification.
    Abstract In recent years, point cloud representation has become one of the research hotspots in the field of computer vision, and has been widely used in many fields, such as autonomous driving, virtual reality, robotics, etc. Although deep learning techniques have achieved great success in processing regular structured 2D grid image data, there are still great challenges in processing irregular, unstructured point cloud data. Point cloud classification is the basis of point cloud analysis, and many deep learning-based methods have been widely used in this task. Therefore, the purpose of this paper is to provide researchers in this field with the latest research progress and future trends. First, we introduce point cloud acquisition, characteristics, and challenges. Second, we review 3D data representations, storage formats, and commonly used datasets for point cloud classification. We then summarize deep learning-based methods for point cloud classification and complement recent research work. Next, we compare and analyze the performance of the main methods. Finally, we discuss some challenges and future directions for point cloud classification.
    摘要 各种计算机视觉领域中的研究热点之一是点云表示,在自动驾驶、虚拟现实、机器人等领域都有广泛的应用。虽然深度学习技术在处理常见的2D网格图像数据上已经取得了很大的成功,但对于不规则、无结构的点云数据处理仍然存在很大的挑战。点云分类是点云分析的基础,许多深度学习基于的方法在这个任务中广泛使用。因此,本文的目的是为这个领域的研究人员提供最新的研究进展和未来趋势。首先,我们介绍点云获取、特点和挑战。其次,我们回顾3D数据表示、存储格式和常用的点云分类 dataset。然后,我们总结了深度学习基于的方法,并补充最近的研究工作。接着,我们比较和分析主要方法的性能。最后,我们讨论了点云分类的一些挑战和未来方向。

Optimizing Implicit Neural Representations from Point Clouds via Energy-Based Models

  • paper_url: http://arxiv.org/abs/2311.02601
  • repo_url: None
  • paper_authors: Ryutaro Yamauchi, Jinya Sakurai, Ryo Furukawa, Tatsushi Matsubayashi
  • for: 重建无旋转3D点云表面
  • methods: 使用能量基本模型优化卷积神经网络
  • results: 提高对点云噪声的耐性
    Abstract Reconstructing a continuous surface from an unoritented 3D point cloud is a fundamental task in 3D shape processing. In recent years, several methods have been proposed to address this problem using implicit neural representations (INRs). In this study, we propose a method to optimize INRs using energy-based models (EBMs). By employing the absolute value of the coordinate-based neural networks as the energy function, the INR can be optimized through the estimation of the point cloud distribution by the EBM. In addition, appropriate parameter settings of the EBM enable the model to consider the magnitude of point cloud noise. Our experiments confirmed that the proposed method is more robust against point cloud noise than conventional surface reconstruction methods.
    摘要 <>translate english text into simplified chinese原文:Reconstructing a continuous surface from an unoriented 3D point cloud is a fundamental task in 3D shape processing. In recent years, several methods have been proposed to address this problem using implicit neural representations (INRs). In this study, we propose a method to optimize INRs using energy-based models (EBMs). By employing the absolute value of the coordinate-based neural networks as the energy function, the INR can be optimized through the estimation of the point cloud distribution by the EBM. In addition, appropriate parameter settings of the EBM enable the model to consider the magnitude of point cloud noise. Our experiments confirmed that the proposed method is more robust against point cloud noise than conventional surface reconstruction methods.翻译:建立一个连续的表面从无法指定的3D点云是3D形状处理中的基本任务。过去几年,一些方法被提出来解决这个问题使用隐藏神经网络表示(INR)。在这种研究中,我们提议使用能量基本模型(EBM)来优化INR。通过将坐标基本神经网络的绝对值作为能量函数,可以通过EBM估计点云分布,从而优化INR。此外,合适的EBM参数设置可以让模型考虑点云噪声的大小。我们的实验表明,我们提出的方法比传统表面重建方法更加鲁棒对待点云噪声。

Learning Class and Domain Augmentations for Single-Source Open-Domain Generalization

  • paper_url: http://arxiv.org/abs/2311.02599
  • repo_url: None
  • paper_authors: Prathmesh Bele, Valay Bundele, Avigyan Bhattacharya, Ankit Jha, Gemma Roig, Biplab Banerjee
  • for: 实现单源开放领域扩展(SS-ODG),解决训练时使用预订范围的标注范围,并在测试时遇到未知类别的挑战。
  • methods: 我们提出了一个名为SODG-Net的新框架,它同时生成新的领域和 pseudo-开放标本,使用学习型的目标函数,而不是常见的杂质混合策略。我们的方法通过增强多标本的多标本风格和生成多标本的多标本风格,从而提高扩展性。
  • results: 我们的SODG-Net在多个 benchmark 上进行了广泛的实验评估,与文献中的方法相比,它的表现都是superior。
    Abstract Single-source open-domain generalization (SS-ODG) addresses the challenge of labeled source domains with supervision during training and unlabeled novel target domains during testing. The target domain includes both known classes from the source domain and samples from previously unseen classes. Existing techniques for SS-ODG primarily focus on calibrating source-domain classifiers to identify open samples in the target domain. However, these methods struggle with visually fine-grained open-closed data, often misclassifying open samples as closed-set classes. Moreover, relying solely on a single source domain restricts the model's ability to generalize. To overcome these limitations, we propose a novel framework called SODG-Net that simultaneously synthesizes novel domains and generates pseudo-open samples using a learning-based objective, in contrast to the ad-hoc mixing strategies commonly found in the literature. Our approach enhances generalization by diversifying the styles of known class samples using a novel metric criterion and generates diverse pseudo-open samples to train a unified and confident multi-class classifier capable of handling both open and closed-set data. Extensive experimental evaluations conducted on multiple benchmarks consistently demonstrate the superior performance of SODG-Net compared to the literature.
    摘要 单源开放预测(SS-ODG)Addresses the challenge of labeled source domains with supervision during training and unlabeled novel target domains during testing. The target domain includes both known classes from the source domain and samples from previously unseen classes. Existing techniques for SS-ODG primarily focus on calibrating source-domain classifiers to identify open samples in the target domain, but these methods often struggle with visually fine-grained open-closed data, misclassifying open samples as closed-set classes. Moreover, relying solely on a single source domain restricts the model's ability to generalize. To overcome these limitations, we propose a novel framework called SODG-Net that simultaneously synthesizes novel domains and generates pseudo-open samples using a learning-based objective, rather than the ad-hoc mixing strategies commonly found in the literature. Our approach enhances generalization by diversifying the styles of known class samples using a novel metric criterion and generates diverse pseudo-open samples to train a unified and confident multi-class classifier capable of handling both open and closed-set data. Extensive experimental evaluations conducted on multiple benchmarks consistently demonstrate the superior performance of SODG-Net compared to the literature.

Synthetic Tumor Manipulation: With Radiomics Features

  • paper_url: http://arxiv.org/abs/2311.02586
  • repo_url: None
  • paper_authors: Inye Na, Jonghun Kim, Hyunjin Park
  • for: 用于生成精度控制和个性化的肿瘤部分
  • methods: 使用生成对抗网络、基于 радиомιcs特征的conditioning、多任务学习
  • results: 能够生成多样化、真实的肿瘤图像,并且可以根据特定的 радиомιcs特征进行细致的调整
    Abstract We introduce RadiomicsFill, a synthetic tumor generator conditioned on radiomics features, enabling detailed control and individual manipulation of tumor subregions. This conditioning leverages conventional high-dimensional features of the tumor (i.e., radiomics features) and thus is biologically well-grounded. Our model combines generative adversarial networks, radiomics-feature conditioning, and multi-task learning. Through experiments with glioma patients, RadiomicsFill demonstrated its capability to generate diverse, realistic tumors and its fine-tuning ability for specific radiomics features like 'Pixel Surface' and 'Shape Sphericity'. The ability of RadiomicsFill to generate an unlimited number of realistic synthetic tumors offers notable prospects for both advancing medical imaging research and potential clinical applications.
    摘要 我们介绍RadiomicsFill,一个基于对射频特征的人工肿瘤生成器,允许详细控制和个别修改肿瘤子区域。这个conditioning leverages conventional高维ensional特征(即对射频特征),因此具有生物学基础。我们的模型结合生成对抗网络、对射频特征conditioning和多任务学习。通过对肿瘤病人进行实验,RadiomicsFill表现出它的能力将生成多样化、现实的肿瘤,并且可以根据特定对射频特征进行细化调整,例如'Pixel Surface'和'Shape Sphericity'。RadiomicsFill的能力生成无限多个真实的人工肿瘤提供了重要的前途,将推动医疗影像研究和 potential clinical应用。

SSL-DG: Rethinking and Fusing Semi-supervised Learning and Domain Generalization in Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2311.02583
  • repo_url: https://github.com/yezanting/ssl-dg
  • paper_authors: Zanting Ye
  • for: 这个论文的目的是提出一种基于深度学习的医疗影像分类方法,以应对受限给annotated data的状况下,并且处理域shift问题。
  • methods: 本论文使用了semi-supervised learning(SSL)和domain generalization(DG)两种方法,具体来说是使用class-level representation来表示未见目标数据,并通过对数据进行增强,以实现cross-domain generalization。
  • results: 实验结果显示, compared withstate-of-the-art方法,本论文的方法在两个难度问题中表现出色,并且具有较好的一致性和可靠性。
    Abstract Deep learning-based medical image segmentation is an essential yet challenging task in clinical practice, which arises from restricted access to annotated data coupled with the occurrence of domain shifts. Previous attempts have focused on isolated solutions, while disregarding their inter-connectedness. In this paper, we rethink the relationship between semi-supervised learning (SSL) and domain generalization (DG), which are the cutting-edge approaches to address the annotated data-driven constraints and the domain shift issues. Inspired by class-level representation, we show that unseen target data can be represented by a linear combination of source data, which can be achieved by simple data augmentation. The augmented data enrich domain distributions while having semantic consistency, aligning with the principles of consistency-based SSL. Accordingly, we propose SSL-DG, fusing DG and SSL, to achieve cross-domain generalization with limited annotations. Specifically, the global and focal region augmentation, together with an augmentation scale-balancing mechanism, are used to construct a mask-based domain diffusion augmentation module to significantly enrich domain diversity. In order to obtain consistent predictions for the same source data in different networks, we use uncertainty estimation and a deep mutual learning strategy to enforce the consistent constraint. Extensive experiments including ablation studies are designed to validate the proposed SSL-DG. The results demonstrate that our SSL-DG significantly outperforms state-of-the-art solutions in two challenging DG tasks with limited annotations. Code is available at https://github.com/yezanting/SSL-DG.
    摘要 深度学习基于医疗图像分割是临床实践中的必要 yet 挑战任务,这是由于缺乏标注数据的限制和频繁出现的频率域变换所致。先前的尝试都是采取分立的方法,而忽视了它们之间的连接。在这篇论文中,我们重新考虑了 semi-supervised learning(SSL)和频率域泛化(DG)的关系,这两种是医疗图像分割的瓶颈和频率域变换问题的解决方案。受到类别表示的启发,我们表明了未经见过的目标数据可以通过简单的数据扩展表示为源数据的线性组合。扩展后的数据可以增强频率域分布,同时保持 semantic consistency,与SSL的原理相符。因此,我们提议SSL-DG,将DG和SSL融合,实现受限的标注的横向泛化。具体来说,我们使用全球和焦点区域扩展,加上扩展缩放机制,构建一个面具基于频率域扩散增强模块,以显著提高频率域分布的多样性。为确保不同的源数据在不同网络中的预测结果具有一致性,我们使用uncertainty估计和深度相互学习策略来强制一致性约束。我们进行了广泛的实验,包括简洁分析,以验证我们的SSL-DG。结果显示,我们的SSL-DG在两个挑战的DG任务中具有明显的优势,并且超过了当前的状况。代码可以在https://github.com/yezanting/SSL-DG上下载。

Group Testing for Accurate and Efficient Range-Based Near Neighbor Search : An Adaptive Binary Splitting Approach

  • paper_url: http://arxiv.org/abs/2311.02573
  • repo_url: None
  • paper_authors: Kashish Mittal, Harsh Shah, Ajit Rajwade
  • for: 这篇论文针对高维ensional Near Neighbor Search(NNS)问题提出了一个适应性的集群试验框架。
  • methods: 这篇论文使用了一个基于cosine距离的点积分法,不需要对库中的所有元素进行探索。它还使用了一个多阶分组试验算法,通过分成两个子集,然后逐步对每个子集进行点积分,以节省时间。
  • results: 实验结果显示,这篇论文的方法可以与排序搜寻相比,提高速度超过10倍,且精度与排序搜寻相同。此外,论文还提供了一个理论分析,详细阐述了预期的距离计算数量和pool中成员数量的关系。
    Abstract This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. The proposed method detects high-similarity vectors from an extensive collection of high dimensional vectors, where each vector represents an image descriptor. Our method efficiently marks each item in the collection as neighbor or non-neighbor on the basis of a cosine distance threshold without exhaustive search. Like other methods in the domain of large scale retrieval, our approach exploits the assumption that most of the items in the collection are unrelated to the query. Unlike other methods, it does not assume a large difference between the cosine similarity of the query vector with the least related neighbor and that with the least unrelated non-neighbor. Following the procedure of binary splitting, a multi-stage adaptive group testing algorithm, we split the set of items to be searched into half at each step, and perform dot product tests on smaller and smaller subsets, many of which we are able to prune away. We experimentally show that our method achieves a speed-up over exhaustive search by a factor of more than ten with an accuracy same as that of exhaustive search, on a variety of large datasets. We present a theoretical analysis of the expected number of distance computations per query and the probability that a pool with a certain number of members will be pruned. In this way, our method exploits very useful and practical distributional properties unlike other methods. In our method, all required data structures are created purely offline. Moreover, our method does not impose any strong assumptions on the number of true near neighbors, is adaptible to streaming settings where new vectors are dynamically added to the database, and does not require any parameter tuning.
    摘要

Multiple Object Tracking based on Occlusion-Aware Embedding Consistency Learning

  • paper_url: http://arxiv.org/abs/2311.02572
  • repo_url: None
  • paper_authors: Yaoqi Hu, Axi Niu, Yu Zhu, Qingsen Yan, Jinqiu Sun, Yanning Zhang
    for: 多bject tracking中的跟踪问题methods: 利用视觉嵌入的一致性来解决 occlusion 导致的跟踪中断results: 在不同 occlusion 场景下,实现了较高的跟踪性能
    Abstract The Joint Detection and Embedding (JDE) framework has achieved remarkable progress for multiple object tracking. Existing methods often employ extracted embeddings to re-establish associations between new detections and previously disrupted tracks. However, the reliability of embeddings diminishes when the region of the occluded object frequently contains adjacent objects or clutters, especially in scenarios with severe occlusion. To alleviate this problem, we propose a novel multiple object tracking method based on visual embedding consistency, mainly including: 1) Occlusion Prediction Module (OPM) and 2) Occlusion-Aware Association Module (OAAM). The OPM predicts occlusion information for each true detection, facilitating the selection of valid samples for consistency learning of the track's visual embedding. The OAAM leverages occlusion cues and visual embeddings to generate two separate embeddings for each track, guaranteeing consistency in both unoccluded and occluded detections. By integrating these two modules, our method is capable of addressing track interruptions caused by occlusion in online tracking scenarios. Extensive experimental results demonstrate that our approach achieves promising performance levels in both unoccluded and occluded tracking scenarios.
    摘要 “ JOINT DETECTION AND EMBEDDING (JDE) 框架在多对象跟踪中做出了卓越的进步。现有方法通常通过提取的嵌入来重新建立新检测和已经中断的跟踪之间的关系。然而,当 occlusion 区域包含邻近 объек 或垃圾物时,嵌入的可靠性会减退,特别是在严重 occlusion 的情况下。为了解决这个问题,我们提出了一种基于视觉嵌入一致性的多对象跟踪方法,包括:1) occlusion prediction module (OPM) 和 2) occlusion-aware association module (OAAM)。OPM 预测每个真实检测中的 occlusion 信息,使得选择有效样本进行嵌入一致学习跟踪的视觉嵌入。OAAM 利用 occlusion 迹象和视觉嵌入来生成每个跟踪的两个分开的嵌入,保证了在不Occluded 和 Occluded 检测场景下的一致性。通过这两个模块的结合,我们的方法可以在在线跟踪场景中解决由 occlusion 引起的跟踪中断。我们的实验结果表明,我们的方法在不Occluded 和 Occluded 跟踪场景下具有出色的表现。”

Rotation Invariant Transformer for Recognizing Object in UAVs

  • paper_url: http://arxiv.org/abs/2311.02559
  • repo_url: None
  • paper_authors: Shuoyi Chen, Mang Ye, Bo Du
  • for: 本研究的目标是提高UAV上的目标识别精度,特别是对于大角度变换的情况。
  • methods: 本研究提出了一种新的旋转不变性视Transformer(RotTrans),通过在特征层进行旋转操作来实现旋转不变性。此外,我们还设置了一种�variance constraint来确保原始特征与旋转后的特征之间的关系。
  • results: 我们的提出的RotTrans模型在最新的UAV数据集上进行测试,与当前状态的艺术得到了显著的改进,其中高度的MAP和Rank1分别提高了5.9%和4.8%。此外,我们的模型还在传统的城市摄像头上进行人重识别任务中表现竞争力强。特别是在ICCV 2021年的Multi-Modal Video Reasoning and Analyzing Competition中,我们的解决方案在UAV基于人重识别追踪上获得了第一名。
    Abstract Recognizing a target of interest from the UAVs is much more challenging than the existing object re-identification tasks across multiple city cameras. The images taken by the UAVs usually suffer from significant size difference when generating the object bounding boxes and uncertain rotation variations. Existing methods are usually designed for city cameras, incapable of handing the rotation issue in UAV scenarios. A straightforward solution is to perform the image-level rotation augmentation, but it would cause loss of useful information when inputting the powerful vision transformer as patches. This motivates us to simulate the rotation operation at the patch feature level, proposing a novel rotation invariant vision transformer (RotTrans). This strategy builds on high-level features with the help of the specificity of the vision transformer structure, which enhances the robustness against large rotation differences. In addition, we design invariance constraint to establish the relationship between the original feature and the rotated features, achieving stronger rotation invariance. Our proposed transformer tested on the latest UAV datasets greatly outperforms the current state-of-the-arts, which is 5.9\% and 4.8\% higher than the highest mAP and Rank1. Notably, our model also performs competitively for the person re-identification task on traditional city cameras. In particular, our solution wins the first place in the UAV-based person re-recognition track in the Multi-Modal Video Reasoning and Analyzing Competition held in ICCV 2021. Code is available at https://github.com/whucsy/RotTrans.
    摘要 recognizing a target of interest from UAVs is much more challenging than existing object re-identification tasks across multiple city cameras. The images taken by UAVs usually suffer from significant size difference when generating object bounding boxes and uncertain rotation variations. Existing methods are usually designed for city cameras, incapable of handling the rotation issue in UAV scenarios. A straightforward solution is to perform image-level rotation augmentation, but it would cause loss of useful information when inputting powerful vision transformer as patches. This motivates us to simulate the rotation operation at the patch feature level, proposing a novel rotation invariant vision transformer (RotTrans). This strategy builds on high-level features with the help of the specificity of the vision transformer structure, which enhances robustness against large rotation differences. In addition, we design invariance constraint to establish the relationship between the original feature and the rotated features, achieving stronger rotation invariance. Our proposed transformer tested on the latest UAV datasets greatly outperforms the current state-of-the-arts, which is 5.9% and 4.8% higher than the highest mAP and Rank1. Notably, our model also performs competitively for the person re-identification task on traditional city cameras. In particular, our solution wins the first place in the UAV-based person re-recognition track in the Multi-Modal Video Reasoning and Analyzing Competition held in ICCV 2021. Code is available at https://github.com/whucsy/RotTrans.

Multi-Agent 3D Map Reconstruction and Change Detection in Microgravity with Free-Flying Robots

  • paper_url: http://arxiv.org/abs/2311.02558
  • repo_url: None
  • paper_authors: Holly Dinkel, Julia Di, Jamie Santos, Keenan Albee, Paulo Borges, Marina Moreira, Oleg Alexandrov, Brian Coltin, Trey Smith
  • for: 这篇论文目标是为了帮助未来的宇航员使用自主飞行器进行宇宙站的维护和监测。
  • methods: 这篇论文使用了多代理协作地图建模和变化探测来帮助自主飞行器进行宇宙站的维护和监测。其中一个代理用于从图像和深度信息序列中重建宇宙站的3D模型。另一个代理用于定期扫描宇宙站环境,并与3D模型进行比较。
  • results: 这篇论文通过使用实际的图像和位置数据, validate了变化探测的有效性。
    Abstract Assistive free-flyer robots autonomously caring for future crewed outposts -- such as NASA's Astrobee robots on the International Space Station (ISS) -- must be able to detect day-to-day interior changes to track inventory, detect and diagnose faults, and monitor the outpost status. This work presents a framework for multi-agent cooperative mapping and change detection to enable robotic maintenance of space outposts. One agent is used to reconstruct a 3D model of the environment from sequences of images and corresponding depth information. Another agent is used to periodically scan the environment for inconsistencies against the 3D model. Change detection is validated after completing the surveys using real image and pose data collected by Astrobee robots in a ground testing environment and from microgravity aboard the ISS. This work outlines the objectives, requirements, and algorithmic modules for the multi-agent reconstruction system, including recommendations for its use by assistive free-flyers aboard future microgravity outposts.
    摘要 帮助自由飞行机器人在未来的人类殖民站上进行自主维护 -- 如 NASA 的 Astrobee 机器人在国际空站(ISS)上 -- 需要能够探测日常内部变化,跟踪库存、检测和诊断问题,以及监测站点状态。本文提出了多智能合作地图和变化检测框架,以启用机器人维护宇宙站。一个机器人用于从图像和相对深度信息序列中重建环境的3D模型。另一个机器人用于定期扫描环境,并与3D模型进行比较。变化检测的有效性通过在地面测试环境中收集的真实图像和姿态数据,以及在微重力环境中从 ISS 上收集的 Astrobee 机器人的数据进行验证。本文详细介绍了多智能重建系统的目标、要求和算法模块,并对未来微重力站点上的帮助自由飞行机器人使用这些系统提供建议。

IPVNet: Learning Implicit Point-Voxel Features for Open-Surface 3D Reconstruction

  • paper_url: http://arxiv.org/abs/2311.02552
  • repo_url: None
  • paper_authors: Mohammad Samiul Arshad, William J. Beksi
  • for: 重建三维开面(例如非水平的网格)是计算机视觉领域的一个未探讨的领域。
  • methods: 我们提出了一种基于学习的隐式方法(IPVNet),它可以在任意分辨率下重建目标。IPVNet 利用点云数据和其粗化的 voxel 对应物进行学习,可以减少artifacts。
  • results: 我们在synthetic和实际数据集上进行了实验,结果显示IPVNet 可以超越当前状态态的表现,同时生成的重建结果中减少了outlier。
    Abstract Reconstruction of 3D open surfaces (e.g., non-watertight meshes) is an underexplored area of computer vision. Recent learning-based implicit techniques have removed previous barriers by enabling reconstruction in arbitrary resolutions. Yet, such approaches often rely on distinguishing between the inside and outside of a surface in order to extract a zero level set when reconstructing the target. In the case of open surfaces, this distinction often leads to artifacts such as the artificial closing of surface gaps. However, real-world data may contain intricate details defined by salient surface gaps. Implicit functions that regress an unsigned distance field have shown promise in reconstructing such open surfaces. Nonetheless, current unsigned implicit methods rely on a discretized representation of the raw data. This not only bounds the learning process to the representation's resolution, but it also introduces outliers in the reconstruction. To enable accurate reconstruction of open surfaces without introducing outliers, we propose a learning-based implicit point-voxel model (IPVNet). IPVNet predicts the unsigned distance between a surface and a query point in 3D space by leveraging both raw point cloud data and its discretized voxel counterpart. Experiments on synthetic and real-world public datasets demonstrates that IPVNet outperforms the state of the art while producing far fewer outliers in the resulting reconstruction.
    摘要 <>将文本翻译成简化中文。<> computer vision 领域中,三维开 superficie 的重建(如非水平的 mesh)是一个未经充分探索的领域。 recent learning-based implicit technique 已经突破了之前的障碍,使得重建在任意分辨率中成为可能。然而,这些方法通常需要在重建目标时分辨内部和外部的区别,以提取zero level set。在开 superficie 中,这种分辨 often leads to artifacts such as artificially closing surface gaps。然而,实际数据可能包含细节定义的明显surface gaps。implicit function 表示一个无符号距离场,已经表现出重建开 superficie 的承诺。然而,当前的无符号 implicit method 仅仅基于原始数据的粗略表示。这不仅限制了学习过程的分辨率,而且也会导致重建中出现异常值。为了准确地重建开 superficie 无异常值,我们提出了学习基于点 cloud 和 Its 粗略 voxel 对应的点云点-voxel 模型(IPVNet)。IPVNet 可以在 3D 空间中预测一个表示点和查询点之间的 unsigned distance。实验表明,IPVNet 在实际和 Synthetic 公共数据集上超过了状态的艺术,同时生成的重建中减少了异常值的出现。

3D-Aware Talking-Head Video Motion Transfer

  • paper_url: http://arxiv.org/abs/2311.02549
  • repo_url: None
  • paper_authors: Haomiao Ni, Jiachen Liu, Yuan Xue, Sharon X. Huang
  • for: 生成一个新视频,具有原视频的人物表情和动作模式。
  • methods: 使用一个3D-aware talking-head video motion transfer network(Head3D),全面利用 sujet 视频中的多视图出现特征,并通过自动学习3D head geometry learning module 和 attention-based fusion network来生成合成视频。
  • results: 在两个公共的 talking-head 视频数据集上进行了广泛的实验,研究发现 Head3D 在实际 cross-identity 设定下比2D和3D先前艺术 superior,并且可以轻松地适应pose-controllable novel view synthesis任务。
    Abstract Motion transfer of talking-head videos involves generating a new video with the appearance of a subject video and the motion pattern of a driving video. Current methodologies primarily depend on a limited number of subject images and 2D representations, thereby neglecting to fully utilize the multi-view appearance features inherent in the subject video. In this paper, we propose a novel 3D-aware talking-head video motion transfer network, Head3D, which fully exploits the subject appearance information by generating a visually-interpretable 3D canonical head from the 2D subject frames with a recurrent network. A key component of our approach is a self-supervised 3D head geometry learning module, designed to predict head poses and depth maps from 2D subject video frames. This module facilitates the estimation of a 3D head in canonical space, which can then be transformed to align with driving video frames. Additionally, we employ an attention-based fusion network to combine the background and other details from subject frames with the 3D subject head to produce the synthetic target video. Our extensive experiments on two public talking-head video datasets demonstrate that Head3D outperforms both 2D and 3D prior arts in the practical cross-identity setting, with evidence showing it can be readily adapted to the pose-controllable novel view synthesis task.
    摘要 translate("Motion transfer of talking-head videos involves generating a new video with the appearance of a subject video and the motion pattern of a driving video. Current methodologies primarily depend on a limited number of subject images and 2D representations, thereby neglecting to fully utilize the multi-view appearance features inherent in the subject video. In this paper, we propose a novel 3D-aware talking-head video motion transfer network, Head3D, which fully exploits the subject appearance information by generating a visually-interpretable 3D canonical head from the 2D subject frames with a recurrent network. A key component of our approach is a self-supervised 3D head geometry learning module, designed to predict head poses and depth maps from 2D subject video frames. This module facilitates the estimation of a 3D head in canonical space, which can then be transformed to align with driving video frames. Additionally, we employ an attention-based fusion network to combine the background and other details from subject frames with the 3D subject head to produce the synthetic target video. Our extensive experiments on two public talking-head video datasets demonstrate that Head3D outperforms both 2D and 3D prior arts in the practical cross-identity setting, with evidence showing it can be readily adapted to the pose-controllable novel view synthesis task.")Here's the translation:现在的 talking-head 视频动作传输技术是生成一个新的视频,其视觉特征与源视频一致,而动作特征则与驱动视频一致。现有方法主要基于有限数量的源图像和2D表示,因此忽略了源视频中的多视图外观特征。在这篇论文中,我们提出了一种新的3D意识的 talking-head 视频动作传输网络,即 Head3D。我们的方法可以充分利用源视频中的外观信息,通过生成一个可见的3D抽象头来捕捉源视频中的头部pose和深度信息。我们还使用了一种注意力基于的融合网络,将背景和其他细节从源帧与3D主体头进行结合,以生成合成目标视频。我们的实验表明,Head3D在实际的交叉标识设定下,比2D和3D先前艺术高效,并且可以适应pose控制的新视图合成任务。

VR-NeRF: High-Fidelity Virtualized Walkable Spaces

  • paper_url: http://arxiv.org/abs/2311.02542
  • repo_url: https://github.com/facebookresearch/EyefulTower
  • paper_authors: Linning Xu, Vasu Agrawal, William Laney, Tony Garcia, Aayush Bansal, Changil Kim, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder, Aljaž Božič, Dahua Lin, Michael Zollhöfer, Christian Richardt
  • for: 这篇论文的目的是为了建立一个高精度捕捉、模型重建和实时渲染的虚拟现实系统,用于游走空间的高精度捕捉和模型化。
  • methods: 这篇论文使用了一个自定义多摄像头架设,以高精度捕捉游走空间,并使用了一种新的感知颜色空间来学习准确的高dynamic range外观,以及一种高效的mipmapping机制来实现级别of detail渲染。
  • results: 这篇论文的结果表明,使用这种方法可以在 dual 2K$\times$2K 的全息VR分辨率上实现高精度渲染,并且可以在36Hz的刷新率下保持高品质。此外,论文还提供了一个高精度测试集,并与现有的基准相比较。
    Abstract We present an end-to-end system for the high-fidelity capture, model reconstruction, and real-time rendering of walkable spaces in virtual reality using neural radiance fields. To this end, we designed and built a custom multi-camera rig to densely capture walkable spaces in high fidelity and with multi-view high dynamic range images in unprecedented quality and density. We extend instant neural graphics primitives with a novel perceptual color space for learning accurate HDR appearance, and an efficient mip-mapping mechanism for level-of-detail rendering with anti-aliasing, while carefully optimizing the trade-off between quality and speed. Our multi-GPU renderer enables high-fidelity volume rendering of our neural radiance field model at the full VR resolution of dual 2K$\times$2K at 36 Hz on our custom demo machine. We demonstrate the quality of our results on our challenging high-fidelity datasets, and compare our method and datasets to existing baselines. We release our dataset on our project website.
    摘要 我们提出了一个终端系统,用于在虚拟现实中实时渲染可行空间,使用神经辐射场。为此,我们设计了一个专门的多摄像头笼体,用于高精度捕捉可行空间,并生成多视图高动态范围图像。我们在神经图形元素上添加了一个新的感知色彩空间,用于学习准确的高动态范围外观,并使用高效的压缩缩放机制,以实现级别化渲染。我们使用多卡GPU渲染器,实现高精度体积渲染我们的神经辐射场模型,并在双2K×2K分辨率和36Hz的自定义demo机器上实现。我们在我们的高精度数据集上证明了我们的结果质量,并与现有基准进行比较。我们将数据集上载到我们的项目网站。

Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models

  • paper_url: http://arxiv.org/abs/2311.02536
  • repo_url: https://github.com/amzn/augment-the-pairs-wacv2024
  • paper_authors: Jingru Yi, Burak Uzkent, Oana Ignat, Zili Li, Amanmeet Garg, Xiang Yu, Linda Liu
  • for: 提高视觉语言模型的表现,具体来说是在准确地定位文本描述中提到的物体。
  • methods: 使用文本决定和无文本决定的数据增强策略,包括文本背景颜色噪声和水平旋转,以保持图像和文本之间的Semantic consistency。另外,我们还提出了基于受限的信号重建的新的数据增强策略,即像素级别的遮盲。
  • results: 通过对Flickr30k、referring expressions和GQA三个常用的数据集进行广泛的实验,我们的方法表现出了与现有状态艺术的高水平的表现,并且与CLIP大规模图像和语言数据集预训练的图像Encoder结合使用可以进一步提高表现。
    Abstract Grounding-based vision and language models have been successfully applied to low-level vision tasks, aiming to precisely locate objects referred in captions. The effectiveness of grounding representation learning heavily relies on the scale of the training dataset. Despite being a useful data enrichment strategy, data augmentation has received minimal attention in existing vision and language tasks as augmentation for image-caption pairs is non-trivial. In this study, we propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations. Specifically, we apply text-conditioned color jittering and horizontal flipping to ensure semantic consistency between images and captions. To guarantee image-caption correspondence in the training samples, we modify the captions according to pre-defined keywords when applying horizontal flipping. Additionally, inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation. While we demonstrate our data augmentation method with MDETR framework, the proposed approach is applicable to common grounding-based vision and language tasks with other frameworks. Finally, we show that image encoder pretrained on large-scale image and language datasets (such as CLIP) can further improve the results. Through extensive experiments on three commonly applied datasets: Flickr30k, referring expressions and GQA, our method demonstrates advanced performance over the state-of-the-arts with various metrics. Code can be found in https://github.com/amzn/augment-the-pairs-wacv2024.
    摘要 围绕基于grounding的视觉语言模型的研究,我们提出了一种可靠的图像描述对应模型,使用文本条件和无条件数据增强来学习表示学习。具体来说,我们使用文本条件的颜色扰动和水平翻转来保证图像和描述的semantic consistency。为保证训练样本中的图像描述对应,我们在应用水平翻转时对描述进行修改。此外,我们受到最近的masked signal reconstruction的启发,提出了一种新的数据增强方法:像素级别的遮盲。我们通过对MDETR框架进行修改来示出我们的数据增强方法的可应用性。最后,我们表明通过使用大规模的图像和语言数据集(如CLIP)进行预训练,可以进一步提高结果。通过对Flickr30k、referring expressions和GQA等三个常用的数据集进行广泛的实验,我们的方法达到了与先前最佳的多种纪录。代码可以在https://github.com/amzn/augment-the-pairs-wacv2024中找到。

TokenMotion: Motion-Guided Vision Transformer for Video Camouflaged Object Detection Via Learnable Token Selection

  • paper_url: http://arxiv.org/abs/2311.02535
  • repo_url: None
  • paper_authors: Zifan Yu, Erfan Bank Tavakoli, Meida Chen, Suya You, Raghuveer Rao, Sanjeev Agarwal, Fengbo Ren
  • for: 提高视频掩体物体检测(VCOD)的性能,解决Texture相似性和Camera运动引起的难题。
  • methods: 使用 transformer 模型提取运动指导特征,通过学习 tokens 选择来提高 VCOD 性能。
  • results: 在 MoCA-Mask 数据集上评估,TMNet 实现了 VCOD 领域的状态可比性,相比 existed 状态可比性方法,提高了12.8%的Weighted F-度、8.4%的S-度和10.7%的Mean IoU。
    Abstract The area of Video Camouflaged Object Detection (VCOD) presents unique challenges in the field of computer vision due to texture similarities between target objects and their surroundings, as well as irregular motion patterns caused by both objects and camera movement. In this paper, we introduce TokenMotion (TMNet), which employs a transformer-based model to enhance VCOD by extracting motion-guided features using a learnable token selection. Evaluated on the challenging MoCA-Mask dataset, TMNet achieves state-of-the-art performance in VCOD. It outperforms the existing state-of-the-art method by a 12.8% improvement in weighted F-measure, an 8.4% enhancement in S-measure, and a 10.7% boost in mean IoU. The results demonstrate the benefits of utilizing motion-guided features via learnable token selection within a transformer-based framework to tackle the intricate task of VCOD.
    摘要 “视频掩体物体检测(VCOD)领域存在特殊挑战,主要是因为目标对象和周围环境的文本相似性,以及对象和摄像头运动导致的不规则运动模式。本文提出了TokenMotion(TMNet),利用 transformer 模型提取运动导向特征,通过学习式Token选择进行增强。在复杂的 MoCA-Mask 数据集上测试,TMNet achieve 状态机器人-measure 的最佳性能,比既有状态机器人-measure 方法提高 12.8%,S-measure 提高 8.4%, mean IoU 提高 10.7%。结果表明,通过在 transformer 框架中使用学习式Token选择来捕捉运动导向特征,可以有效地解决 VCOD 领域中的复杂问题。”