eess.IV - 2023-11-08

An End-Cloud Computing Enabled Surveillance Video Transmission System

  • paper_url: http://arxiv.org/abs/2311.04685
  • repo_url: None
  • paper_authors: Dingxi Yang, Zhijin Qin, Liting Wang, Xiaoming Tao, Fang Cui, Hengjiang Wang
  • for: 提高Surveillance video transmission efficiency and quality
  • methods: 使用终端云计算、压缩视频、重复帧除法、键帧支持视频超分辨模型
  • results: 可以效果地减少数据量,并与现有视频超分辨模型相比,有较高的PSNR和SSIM表现
    Abstract The enormous data volume of video poses a significant burden on the network. Particularly, transferring high-definition surveillance videos to the cloud consumes a significant amount of spectrum resources. To address these issues, we propose a surveillance video transmission system enabled by end-cloud computing. Specifically, the cameras actively down-sample the original video and then a redundant frame elimination module is employed to further reduce the data volume of surveillance videos. Then we develop a key-frame assisted video super-resolution model to reconstruct the high-quality video at the cloud side. Moreover, we propose a strategy of extracting key frames from source videos for better reconstruction performance by utilizing the peak signal-to-noise ratio (PSNR) of adjacent frames to measure the propagation distance of key frame information. Simulation results show that the developed system can effectively reduce the data volume by the end-cloud collaboration and outperforms existing video super-resolution models significantly in terms of PSNR and structural similarity index (SSIM).
    摘要 “巨量数据的视频带来了网络的巨大压力。特别是将高清护视频传输到云存储所消耗的频谱资源非常大。为解决这些问题,我们提议一种基于终端云计算的视频传输系统。具体来说,摄像头会活动下amples the original video,然后使用重复帧除Module further reduce the amount of data volume of surveillance videos。然后,我们开发了一种基于关键帧的视频超Resolution模型,在云端进行高质量视频重建。此外,我们提出了利用相邻帧PSNR来度量关键帧信息的传播距离的策略。实验结果表明,我们开发的系统可以通过终端云合作Effectively reduce the amount of data and outperform existing video super-resolution models in terms of PSNR and structural similarity index (SSIM).”

A human brain atlas of chi-separation for normative iron and myelin distributions

  • paper_url: http://arxiv.org/abs/2311.04468
  • repo_url: None
  • paper_authors: Kyeongseon Min, Beomseok Sohn, Woo Jung Kim, Chae Jung Park, Soohwa Song, Dong Hoon Shin, Kyung Won Chang, Na-Young Shin, Minjun Kim, Hyeong-Geol Shin, Phil Hyu Lee, Jongho Lee
  • for: 这项研究的目的是构建一个健康人脑中 paramagnetic iron 和 diamagnetic myelin 的分离 Atlases,以便在脑部分 realizations 中提供细致的结构准确地呈现出脑部分的分布。
  • methods: 这项研究使用了 chi-separation 技术,成功地分离出 paramagnetic iron 和 diamagnetic myelin,并构建了一个 normative chi-separation atlas 从 106 个健康人脑中。
  • results: 研究得到了详细的 anatomical structures 相关 paramagnetic iron 和 diamagnetic myelin 的分布,清晰地呈现出脑部分的各个核心和白 matter 纤维 bundle。 此外,研究还发现了一些Region of interest 的 susceptibility values 随着年龄的变化。这个 atlas 可能有直接应用,例如在深 brain stimulation 或高Intensity focused ultrasound 中用于定位脑部分,并且可以作为未来研究的 valuable resource。
    Abstract Iron and myelin are primary susceptibility sources in the human brain. These substances are essential for healthy brain, and their abnormalities are often related to various neurological disorders. Recently, an advanced susceptibility mapping technique, which is referred to as chi-separation, has been proposed successfully disentangling paramagnetic iron from diamagnetic myelin, opening a new potential for generating iron map and myelin map in the brain. Utilizing this technique, this study constructs a normative chi-separation atlas from 106 healthy human brains. The resulting atlas provides detailed anatomical structures associated with the distributions of iron and myelin, clearly delineating subcortical nuclei and white matter fiber bundles. Additionally, susceptibility values in a number of regions of interest are reported along with age-dependent changes. This atlas may have direct applications such as localization of subcortical structures for deep brain stimulation or high-intensity focused ultrasound and also serve as a valuable resource for future research.
    摘要 铁和脑膜是人脑中主要的抵抗源。这些物质是健康脑的关键组成部分,其异常会与多种神经系统疾病相关。最近,一种高级别的抵抗映射技术,即chi-separation,已经成功地分离了 paramagnetic 铁和 diamagnetic 脑膜,打开了一个新的可能性,即生成铁图和脑膜图在脑中。利用这种技术,本研究建立了106名健康人脑中的 normative chi-separation Atlason。结果提供了详细的 анатомиче结构,与铁和脑膜分布相关,清晰地分割了脑下核和白 matter 纤维带。此外,在一些关注区域中,抵抗值也被报告,以及与年龄相关的变化。这个 Atlason 可能有直接应用,如深入脑刺激或高 интенсиTY focused ultrasound 的localization,也可以作为未来研究的 valuable 资源。

A labeled Clinical-MRI dataset of Nigerian brains

  • paper_url: http://arxiv.org/abs/2311.04425
  • repo_url: https://github.com/bacaron/nigerian_brain_analyses
  • paper_authors: Eberechi Wogu, Patrick Filima, Bradley Caron, Daniel Levitas, Peer Herholz, Catherine Leal, Mohammed F. Mehboob, Soichi Hayashi, Simisola Akintoye, George Ogoh, Tawe Godwin, Damian Eke, Franco Pestilli
  • for: The paper is written for the purpose of describing a magnetic resonance imaging (MRI) dataset from individuals in Nigeria, with the goal of contributing to the global neuroscience community and providing a benchmark for future studies.
  • methods: The paper uses pseudonymized structural MRI (T1w, T2w, FLAIR) data of clinical quality, containing data from 36 healthy control subjects, 32 individuals with age-related dementia, and 20 individuals with Parkinson’s disease.
  • results: The paper presents a dataset of MRI data from individuals in Nigeria, which is currently underrepresented in the global neuroscience community. The dataset provides an opportunity and benchmark for future studies to share data from the African continent.
    Abstract We describe a Magnetic Resonance Imaging (MRI) dataset from individuals from the African nation of Nigeria. The dataset contains pseudonymized structural MRI (T1w, T2w, FLAIR) data of clinical quality. The dataset contains data from 36 images from healthy control subjects, 32 images from individuals diagnosed with age-related dementia and 20 from individuals with Parkinson's disease. There is currently a paucity of data from the African continent. Given the potential for Africa to contribute to the global neuroscience community, this first MRI dataset represents both an opportunity and benchmark for future studies to share data from the African continent.
    摘要 我们描述了一个来自非洲国家奈及利亚的磁共振成像(MRI)数据集。该数据集包含匿名化的结构MRI(T1w、T2w、FLAIR)数据,质量很高。数据集包含36张健康控制群体的图像,32张年龄相关失忆症患者的图像和20张parkinson病患者的图像。目前非洲 continent上没有很多数据,这个首个MRI数据集标志着非洲大陆对全球神经科学社区的贡献的可能性和标准。

eess.SP - 2023-11-08

Constrained Independent Vector Analysis with Reference for Multi-Subject fMRI Analysis

  • paper_url: http://arxiv.org/abs/2311.05049
  • repo_url: None
  • paper_authors: Trung Vu, Francisco Laport, Hanlu Yang, Vince D. Calhoun, Tulay Adali
  • for: This paper proposes two novel methods for constrained independent vector analysis (ICA) to improve the quality of separation in multi-subject functional magnetic resonance imaging (fMRI) data analysis.
  • methods: The two proposed methods are based on an adaptive-reverse scheme to select variable thresholds for the constraints and a threshold-free formulation by leveraging the unique structure of IVA.
  • results: The proposed methods provide significantly better separation quality and model match while providing computationally efficient and highly reproducible solutions, as demonstrated through simulations and analysis of resting state fMRI data collected from 98 subjects.
    Abstract Independent component analysis (ICA) is now a widely used solution for the analysis of multi-subject functional magnetic resonance imaging (fMRI) data. Independent vector analysis (IVA) generalizes ICA to multiple datasets, i.e., to multi-subject data, and in addition to higher-order statistical information in ICA, it leverages the statistical dependence across the datasets as an additional type of statistical diversity. As such, it preserves variability in the estimation of single-subject maps but its performance might suffer when the number of datasets increases. Constrained IVA is an effective way to bypass computational issues and improve the quality of separation by incorporating available prior information. Existing constrained IVA approaches often rely on user-defined threshold values to define the constraints. However, an improperly selected threshold can have a negative impact on the final results. This paper proposes two novel methods for constrained IVA: one using an adaptive-reverse scheme to select variable thresholds for the constraints and a second one based on a threshold-free formulation by leveraging the unique structure of IVA. We demonstrate that our solutions provide an attractive solution to multi-subject fMRI analysis both by simulations and through analysis of resting state fMRI data collected from 98 subjects -- the highest number of subjects ever used by IVA algorithms. Our results show that both proposed approaches obtain significantly better separation quality and model match while providing computationally efficient and highly reproducible solutions.
    摘要 独立组分分析(ICA)已成为多subject功能磁共振成像(fMRI)数据分析的广泛使用解决方案。独立向量分析(IVA)总结ICA,并在多个数据集之间启用统计依赖关系作为额外的统计多样性。因此,它保留单个数据集的变化,但可能在数据集的数量增加时表现不佳。受限制的IVA是一种有效的解决方案,通过包含可用的先前信息来减少计算问题并提高分离质量。现有的受限制IVA方法通常需要用户定义的阈值来定义约束。然而,不当选择的阈值可能会对最终结果产生负面影响。本文提出了两种新的受限制IVA方法:一种使用逆向的变量阈值选择约束,并一种基于阈值自由形式,通过利用IVA的特殊结构来减少计算问题。我们的研究表明,我们的解决方案可以在多subject fMRI数据中提供更好的分离质量和模型匹配,同时提供计算效率和高度可重现的解决方案。

Harmonic Retrieval Using Weighted Lifted-Structure Low-Rank Matrix Completion

  • paper_url: http://arxiv.org/abs/2311.05003
  • repo_url: None
  • paper_authors: Mohammad Bokaei, Saeed Razavikia, Stefano Rini, Arash Amini, Hamid Behrouzi
  • for: 这个论文研究了一种复杂的时域频率分析问题,即从一个随机选择的时域样本中提取混合的 $K$ 个复杂正弦波的频率成分。
  • methods: 作者提出了一种两步策略,包括将受限样本集 lift 成一个有 missing 的结构矩阵,然后使用一个Weighted nuclear minimization问题来完善矩阵。
  • results: 作者的方法可以应用于各种矩阵结构,如汉ке尔和双汉ке尔矩阵,并且在不含噪声的情况下和噪声的情况下都有较好的表现,并且有理论保证。
    Abstract In this paper, we investigate the problem of recovering the frequency components of a mixture of $K$ complex sinusoids from a random subset of $N$ equally-spaced time-domain samples. Because of the random subset, the samples are effectively non-uniform. Besides, the frequency values of each of the $K$ complex sinusoids are assumed to vary continuously within a given range. For this problem, we propose a two-step strategy: (i) we first lift the incomplete set of uniform samples (unavailable samples are treated as missing data) into a structured matrix with missing entries, which is potentially low-rank; then (ii) we complete the matrix using a weighted nuclear minimization problem. We call the method a \emph{ weighted lifted-structured (WLi) low-rank matrix recovery}. Our approach can be applied to a range of matrix structures such as Hankel and double-Hankel, among others, and provides improvement over the unweighted existing schemes such as EMaC and DEMaC. We provide theoretical guarantees for the proposed method, as well as numerical simulations in both noiseless and noisy settings. Both the theoretical and the numerical results confirm the superiority of the proposed approach.
    摘要 在这篇论文中,我们研究了一个复杂的混合信号恢复问题,即从一个随机选择的 $N$ 个时域样本中恢复 $K$ 个复杂的声学信号的频率组分。由于随机选择的样本,实际上是非均匀的。此外,每个 $K$ 个声学信号的频率值假设是连续变化在给定的范围内。为解决这个问题,我们提出了一种两步策略:1. 首先,我们将缺失样本提升成一个结构化的缺失数据矩阵,这个矩阵具有缺失的元素,可能是低级的。2. 然后,我们使用一个权重 nuclear minimization 问题来完善该矩阵。我们称之为权重提升结构(WLi)低级矩阵恢复。我们的方法可以应用于各种矩阵结构,如汉KEL和双汉KEL等,并且比不Weighted existing schemes such as EMaC and DEMaC 提供更好的性能。我们提供了理论保证,以及在无噪和噪存在的情况下的数学实验。两者都证明了我们的方法的优越性。

Joint Transmit Signal and Beamforming Design for Integrated Sensing and Power Transfer Systems

  • paper_url: http://arxiv.org/abs/2311.04881
  • repo_url: None
  • paper_authors: Kenneth MacSporran Mayer, Nikita Shanin, Zhenlong You, Sebastian Lotter, Stefan Brückner, Martin Vossiek, Laura Cottatellucci, Robert Schober
  • for: 这个论文主要是为了提高感知和无线能量传输(WPT)功能的集成系统。
  • methods: 该论文提出了将探测和WPT功能集成到单一平台上,并且对各种功能进行共同优化。它采用矩形冲激信号和扫描方向的共同优化,以提高感知和WPT的效率。与先前的研究不同,该论文采用了准确的非线性电能收集(EH)模型。
  • results: 论文通过Grid搜索、semidefinite relaxation(SDR)和successive convex approximation(SCA)等方法解决了一个非对称优化问题,并证明了在大于平均发射功率的情况下,各个接收器的平均收集能量 monotonicity 增长 With 探测目标(ST)的扫描方向。同时,该论文还证明了ISAPT系统中感知性和能量传输之间的贸易OFF。
    Abstract Integrating different functionalities, conventionally implemented as dedicated systems, into a single platform allows utilising the available resources more efficiently. We consider an integrated sensing and power transfer (ISAPT) system and propose the joint optimisation of the rectangular pulse-shaped transmit signal and the beamforming design to combine sensing and wireless power transfer (WPT) functionalities efficiently. In contrast to prior works, we adopt an accurate non-linear circuit-based energy harvesting (EH) model. We formulate a non-convex optimisation problem for a general number of EH receivers and a single sensing target (ST) and solve the problem via a grid search over the pulse duration, semidefinite relaxation (SDR), and successive convex approximation (SCA). The average harvested power is shown to monotonically increase with the pulse duration when the average transmit power budget is large. We discuss the trade-off between sensing performance and power transfer of the ISAPT system. The proposed approach significantly outperforms a heuristic baseline scheme based on a linear EH model, which linearly combines energy beamforming with the beamsteering vector in the direction to the ST as its transmit strategy.
    摘要

Electromagnetic manifold characterization of antenna arrays

  • paper_url: http://arxiv.org/abs/2311.04835
  • repo_url: None
  • paper_authors: Miguel R. Castellanos, Robert W. Heath Jr
  • for: 本研究旨在提供一种能够考虑antenna behaviors的信号和通道模型,以便提高无线通信系统的性能。
  • methods: 本研究使用电romagnetic基础来建立一个array manifold,该 manifold可以考虑多种复杂的antenna behaviors,并且可以模型任意antenna配置。通过尺度化antenna为大量的勃氏 dipolo,我们可以开发一个用于预测非homogeneous array场的模型。
  • results: 我们的numerical result表明,使用该模型可以实现高度的 beamforming gain优化,并且可以考虑 polarization of the receive field以及radiated power density的约束。系统可以通过利用该array manifold来实现更高的 beamforming gain,相比之下使用较准的模型进行 beamforming。
    Abstract Antenna behaviors such as mutual coupling, near-field propagation, and polarization cannot be neglected in signal and channel models for wireless communication. We present an electromagnetic-based array manifold that accounts for several complicated behaviors and can model arbitrary antenna configurations. We quantize antennas into a large number of Hertzian dipoles to develop a model for the radiated array field. The resulting abstraction provides a means to predict the electric field for general non-homogeneous array geometries through a linear model that depends on the point source location, the position of each Hertzian dipole, and a set of coefficients obtained from electromagnetic simulation. We then leverage this model to formulate a beamforming gain optimization that can be adapted to account for polarization of the receive field as well as constraints on the radiated power density. Numerical results demonstrate that the proposed method achieves accuracy that is close to that of electromagnetic simulations. By leveraging the developed array manifold for beamforming, systems can achieve higher beamforming gains compared to beamforming with less accurate models.
    摘要 天线行为如共振、近场传播和极化不能被忽略在无线通信中的信号和通道模型中。我们提出一个电磁场基础的阵列构造,考虑到多种复杂的行为,可以模型任意天线配置。我们将天线量化为一大量的赫兹点短波天线,开发了一个模型以预测通过天线阵列的电磁场。这个抽象提供了一个可以预测通过非均匀天线配置的电磁场的线性模型,这个模型取决于天线点源位置、每个赫兹点天线的位置以及从电磁 simulations 中获得的一组系数。我们然后利用这个模型,实现了一个对极化 receive 场进行最佳化的焦点增强,并且可以根据射频场的极化和射频功率密度的限制进行最佳化。 numerics результалтати显示,提案的方法具有与电磁 simulations 的准确性相似的精度。通过利用开发的阵列构造进行焦点增强,系统可以在焦点增强方面比使用较不精度的模型取得更高的性能。

Integrated Distributed Semantic Communication and Over-the-air Computation for Cooperative Spectrum Sensing

  • paper_url: http://arxiv.org/abs/2311.04791
  • repo_url: None
  • paper_authors: Peng Yi, Yang Cao, Xin Kang, Ying-Chang Liang
  • for: 提高第一用户(PU)的探测,使用多个感知器。
  • methods: 提出一种新的整合通信计算(ICC)框架,即分布式 semantics 通信(DSC)和空气计算(AirComp),以提高探测性能和减少spectrum占用。
  • results: 在ICC框架下,实现了一种名为ICC-CSS的特定系统,对于独立同分布的PU信号样本,理论上证明与优化探测器-相关器(E-C)探测器相等。在各种传统CSS方案中,ICC-CSS在检测性能、频率干扰抗干扰和检测器稳定性方面表现出优异性,并且可扩展性良好。
    Abstract Cooperative spectrum sensing (CSS) is a promising approach to improve the detection of primary users (PUs) using multiple sensors. However, there are several challenges for existing combination methods, i.e., performance degradation and ceiling effect for hard-decision fusion (HDF), as well as significant uploading latency and non-robustness to noise in the reporting channel for soft-data fusion (SDF). To address these issues, in this paper, we propose a novel framework for CSS that integrates communication and computation, namely ICC. Specifically, distributed semantic communication (DSC) jointly optimizes multiple sensors and the fusion center to minimize the transmitted data without degrading detection performance. Moreover, over-the-air computation (AirComp) is utilized to further reduce spectrum occupation in the reporting channel, taking advantage of the characteristics of the wireless channel to enable data aggregation. Under the ICC framework, a particular system, namely ICC-CSS, is designed and implemented, which is theoretically proved to be equivalent to the optimal estimator-correlator (E-C) detector with equal gain SDF when the PU signal samples are independent and identically distributed. Extensive simulations verify the superiority of ICC-CSS compared with various conventional CSS schemes in terms of detection performance, robustness to SNR variations in both the sensing and reporting channels, as well as scalability with respect to the number of samples and sensors.
    摘要 合作频率感知(CSS)是一种有前途的方法,以提高 PRIMARY USER(PU)的探测,使用多个感知器。然而,现有的组合方法存在一些挑战,例如性能下降和层次效应(HDF)的硬件决策整合,以及报告通道中的上传延迟和阈值噪声(SDF)。为了解决这些问题,在这篇论文中,我们提出了一种新的CSS框架,即ICC。具体来说,分布式 semantics 通信(DSC)与多个感知器和整合中心进行共同优化,以最小化发送的数据量,而不是影响探测性能。此外,在报告通道中使用无线计算(AirComp),以进一步减少频率占用,利用无线通信的特性来实现数据聚合。在ICC框架下,我们设计了一个具体的系统,即ICC-CSS,该系统理论上与完美的统计量子探测器(E-C)相等,当PU信号样本独立同分布时。我们对ICC-CSS与各种传统CSS方案进行了广泛的实验,并证明了其在探测性能、频率响应度、报告通道噪声等方面的优越性。

Energy-efficient Wireless Image Retrieval for IoT Devices by Transmitting a TinyML Model

  • paper_url: http://arxiv.org/abs/2311.04788
  • repo_url: None
  • paper_authors: Junya Shiraishi, Mathias Thorsager, Shashi Raj Pandey, Petar Popovski
  • for: 这篇论文是为了采集互联网物联网(IoT)设备上的数据而写的。
  • methods: 这篇论文提出使用微型机器学习(Tiny ML)来启动IoT设备上的数据采集,并且只有在数据semantically相关时发送数据。
  • results: 论文的计算表明,相比基eline方案,提议方案可以实现高的检索精度和高的能效性,达到70%的能源减少,当存储图像的数量为8或更多时。
    Abstract This work considers a scenario in which an edge server collects data from Internet of Things (IoT) devices equipped with wake-up receivers. Although this procedure enables on-demand data collection, there is still energy waste if the content of the transmitted data following the wake-up is irrelevant. To mitigate this, we advocate the use of Tiny Machine Learning (ML) to enable a semantic response from the IoT devices, so they can send only semantically relevant data. Nevertheless, receiving the ML model and the ML processing at the IoT devices consumes additional energy. We consider the specific instance of image retrieval and investigate the gain brought by the proposed scheme in terms of energy efficiency, considering both the energy cost of introducing the ML model as well as that of wireless communication. The numerical evaluation shows that, compared to a baseline scheme, the proposed scheme can realize both high retrieval accuracy and high energy efficiency, which reaches up to 70% energy reduction when the number of stored images is equal to or larger than 8.
    摘要 这个研究场景中,边服务器从互联网智能设备(IoT)上收集数据,这些设备配备了唤醒接收器。尽管这种方式实现了需求响应数据收集,但是如果传输的数据内容无关,则会产生能源浪费。为了解决这个问题,我们建议使用小型机器学习(ML),让IoT设备发送只有Semantically相关的数据。然而,接收ML模型和ML处理在IoT设备上占用了额外的能源。我们对具体的图像检索情况进行了研究,并评估了我们的方案在能效率方面的提升,包括引入ML模型的能源成本以及无线通信成本。数值评估显示,相比基eline方案,我们的方案可以实现高准确率和高能效率,可以达到70%的能源减少,当存储图像的数量为8或更大时。

Superimposed Chirp Waveforms for SWIPT with Diplexer-based Integrated Receivers

  • paper_url: http://arxiv.org/abs/2311.04776
  • repo_url: None
  • paper_authors: Arijit Roy, Constantinos Psomas, Ioannis Krikidis
  • for: 本文探讨了同时进行无线信息和能量传输(SWIPT)应用中的抽象振荡波形superposition。利用振荡波形特性,我们可以同时传输多个振荡,从而在具有相同带宽的情况下实现同一个数量的振荡传输。这使得在多个orthogonal subband中进行子带选择成为可能。
  • methods: 我们考虑了用户装备了基于diplexer的集成接收器(DIR),该设计允许抽取无线电功率和解码信息从同一个信号中无需分配。通过抽象振荡和子带选择,我们提出了一种利用晶体堆的非线性和频率多样性来实现SWIPT的传输方案。我们 derive了关于平均收集能量(HE)的新的关闭式分析表达式,以及在选择子带时的下降信道率。
  • results: 我们通过分析和numerical计算结果显示,对于考虑的系统设计,使用抽象振荡基于SWIPT的方案可以提高平均HE性能的提高30%,提高最小级别的HE在多用户网络中,并将能量传输范围扩展到fixed-frequency波形之外。此外,我们还证明了在SWIPT应用中包括DIR接收器可以在能量传输和信息传输之间扩大能量信息传输区域,与通常考虑的力分配接收器相比。
    Abstract In this paper, we present the superposition of chirp waveforms for simultaneous wireless information and power transfer (SWIPT) applications. Exploiting the chirp waveform characteristics enables us to superimpose multiple chirps, thereby allowing transmission of the same number of waveforms over less bandwidth. This enables us to perform subband selection when operating over set of orthogonal subbands. Furthermore, we consider a user equipped with a diplexer-based integrated receiver (DIR), which enables to extract radio frequency power and decode information from the same signal without splitting. Thereby, incorporating chirp superposition and subband selection, a transmission scheme is proposed to exploit both the diode's nonlinearity and frequency diversity. We derive novel closed-form analytical expressions of the average harvested energy (HE) via transmission of superimposed chirp over selected subbands based on tools from order statistics. We also analyze the downlink information rate achieved at the user. Through our analytical and numerical results, for the considered system setup, we show that superimposed chirp-based SWIPT provides an improvement of 30$\%$ in average HE performance as compared to multisine waveforms consisting of a set of fixed-frequency cosine signals, improves the minimum level of HE in a multiuser network, and extends the operating range of energy transfer as compared to fixed-frequency waveforms. Furthermore, we illustrate that the inclusion of DIR at the receiver for SWIPT enlarges the energy-information transfer region when compared to the widely considered power splitting receiver.
    摘要 在这篇论文中,我们介绍了同时无线信息和能量传输(SWIPT)应用中的振荡波形superposition。利用振荡波形特点,我们可以同时传输多个振荡,从而在带宽上传输同样多个波形。这使得可以进行子带选择,当操作于多个 ortogonal subband 时。此外,我们考虑了一个装备有diplexer-based integrated receiver(DIR)的用户,该 receiver 可以从同一个信号中提取无线电频率能量和解码信息。因此,通过振荡superposition和子带选择,我们提出了利用晶体管非线性和频率多样性来实现 SWIPT 的传输方案。我们 derivated 新的关闭式分析表达式,用于计算在superimposed chirp 上选择子带的平均收集能量(HE)的表达式。我们还分析了下链信息率在用户端。通过我们的分析和数值结果,我们表明,在考虑的系统设置下,使用振荡superposition 的 SWIPT 可以提高平均 HE 性能的30%,提高多用户网络中的最低级别HE,并将能量传输范围扩展到fixed-frequency waveforms 之外。此外,我们 illustrate dassDIR 在 SWIPT 中的存在可以在比考虑power splitting receiver 的情况下扩大能量信息传输区域。

Discerning and Enhancing the Weighted Sum-Rate Maximization Algorithms in Communications

  • paper_url: http://arxiv.org/abs/2311.04546
  • repo_url: https://github.com/zepengzhang/ratemax
  • paper_authors: Zepeng Zhang, Ziping Zhao, Kaiming Shen, Daniel P. Palomar, Wei Yu
  • for: 本文研究了三种优化方法来最大化权重总Rate(WSR),以确保 converges to stationary points。
  • methods: 本文使用了三种优化方法,包括weighted sum-minimum mean-square error(WMMSE)和WSR最大化via fractional programming(WSR-FP),以及minorization-maximization(MM)算法。
  • results: 本文的贡献包括:1)对WMMSE、WSR-FP和WSR-MM之间的关系进行了完整的比较研究,并 revelaed their direct correlations; 2)提出了一种新算法WSR-MM+,可以快速 converge和减少计算负担; 3)将WSR-MM+重新定义为BCA框架下的equivalent transform,并提出了一种新版本的WSR-FP+算法。numerical simulations confirm the connections between WMMSE, WSR-FP, and WSR-MM, and the efficacy of the proposed WSR-MM+ and WSR-FP+ algorithms.
    Abstract Weighted sum-rate (WSR) maximization plays a critical role in communication system design. This paper examines three optimization methods for WSR maximization, which ensure convergence to stationary points: two block coordinate ascent (BCA) algorithms, namely, weighted sum-minimum mean-square error (WMMSE) and WSR maximization via fractional programming (WSR-FP), along with a minorization-maximization (MM) algorithm, WSR maximization via MM (WSR-MM). Our contributions are threefold. Firstly, we delineate the exact relationships among WMMSE, WSR-FP, and WSR-MM, which, despite their extensive use in the literature, lack a comprehensive comparative study. By probing the theoretical underpinnings linking the BCA and MM algorithmic frameworks, we reveal the direct correlations between the equivalent transformation techniques, essential to the development of WMMSE and WSR-FP, and the surrogate functions pivotal to WSR-MM. Secondly, we propose a novel algorithm, WSR-MM+, harnessing the flexibility of selecting surrogate functions in MM framework. By circumventing the repeated matrix inversions in the search for optimal Lagrange multipliers in existing algorithms, WSR-MM+ significantly reduces the computational load per iteration and accelerates convergence. Thirdly, we reconceptualize WSR-MM+ within the BCA framework, introducing a new equivalent transform, which gives rise to an enhanced version of WSR-FP, named as WSR-FP+. We further demonstrate that WSR-MM+ can be construed as the basic gradient projection method. This perspective yields a deeper understanding into its computational intricacies. Numerical simulations corroborate the connections between WMMSE, WSR-FP, and WSR-MM and confirm the efficacy of the proposed WSR-MM+ and WSR-FP+ algorithms.
    摘要 带有权重的总和(WSR)最大化在通信系统设计中扮演着关键的角色。这篇论文研究了WSR最大化优化方法的三种方法:块均衡升级(BCA)算法Weighted Sum-Minimum Mean-Square Error(WMMSE)和WSR最大化via fractional programming(WSR-FP)以及幂化-最大化(MM)算法WSR最大化via MM(WSR-MM)。我们的贡献有三个方面:一、我们指出了WMMSE、WSR-FP和WSR-MM之间的直接关系,尽管在文献中广泛使用,却缺乏全面的比较研究。我们通过探究BCA和MM算法框架之间的关系,揭示了WMMSE和WSR-FP中使用的等价转换技术,以及WSR-MM中使用的代表函数的直接关系。二、我们提出了一种新的算法WSR-MM+,利用MM框架中选择函数的灵活性。WSR-MM+通过缺少当前算法中的重复矩阵 inverse 操作,significantly 降低每轮计算负担和加速收敛。三、我们将WSR-MM+转移到BCA框架中,引入一个新的等价转换,从而得到一个改进版本的WSR-FP,名为WSR-FP+。我们进一步证明了WSR-MM+可以被视为基本的梯度向量 проек 方法。这种视角带来了计算复杂度的更深刻理解。numerical simulations confirm the connections between WMMSE, WSR-FP, and WSR-MM, and demonstrate the effectiveness of the proposed WSR-MM+ and WSR-FP+ algorithms.

Cross-Domain Waveform Design for 6G Integrated Sensing and Communication

  • paper_url: http://arxiv.org/abs/2311.04483
  • repo_url: None
  • paper_authors: Fan Zhang, Tianqi Mao, Ruiqi Liu, Zhu Han, Octavia A. Dobre, Sheng Chen, Zhaocheng Wang
  • for: 这个研究旨在提出两种跨领域波形优化策略,以最大化OFDM-based ISAC系统中的资料率。
  • methods: 这两种策略包括通信中心设计和探测中心设计。通信中心设计利用对通信频道的先前知识,将一部分RE分配给通信,并将其他RE用于探测,以抑制侧obe水平和峰均功率比。探测中心设计使用了�ambiguity函数的地方均�数�测,以确保� locally� perfect自相关性。
  • results: numerical results表明,这两种波形设计策略可以优化ISAC应用中的资料率。
    Abstract Orthogonal frequency division multiplexing (OFDM) is one of the representative integrated sensing and communication (ISAC) waveforms, where sensing and communications tend to be assigned with different resource elements (REs) due to their diverse design requirements. This motivates optimization of resource allocation/waveform design across time, frequency, power and delay-Doppler domains. Therefore, this article proposes two cross-domain waveform optimization strategies for OFDM-based ISAC systems, following communication-centric and sensing-centric criteria, respectively. For the communication-centric design, to maximize the achievable data rate, a fraction of REs are optimally allocated for communications according to prior knowledge of the communication channel. The remaining REs are then employed for sensing, where the sidelobe level and peak to average power ratio are suppressed by optimizing its power-frequency and phase-frequency characteristics. For the sensing-centric design, a `locally' perfect auto-correlation property is ensured by adjusting the unit cells of the ambiguity function within its region of interest (RoI). Afterwards, the irrelevant cells beyond RoI, which can readily determine the sensing power allocation, are optimized with the communication power allocation to enhance the achievable data rate. Numerical results demonstrate the superiority of the proposed communication-centric and sensing-centric waveform designs for ISAC applications.
    摘要 阶梯频分复装多普通频率分装 (OFDM) 是一种代表性的统合感知通信 (ISAC) 波形,其中感知和通信两者通常对应不同的资源元素 (RE) due to their diverse design requirements. This motivates optimization of resource allocation/waveform design across time, frequency, power and delay-Doppler domains. Therefore, this article proposes two cross-domain waveform optimization strategies for OFDM-based ISAC systems, following communication-centric and sensing-centric criteria, respectively.For the communication-centric design, a fraction of REs are optimally allocated for communications based on prior knowledge of the communication channel, and the remaining REs are then employed for sensing, where the sidelobe level and peak to average power ratio are suppressed by optimizing its power-frequency and phase-frequency characteristics.For the sensing-centric design, a `locally' perfect auto-correlation property is ensured by adjusting the unit cells of the ambiguity function within its region of interest (RoI). Afterwards, the irrelevant cells beyond RoI, which can readily determine the sensing power allocation, are optimized with the communication power allocation to enhance the achievable data rate. Numerical results demonstrate the superiority of the proposed communication-centric and sensing-centric waveform designs for ISAC applications.

cs.SD - 2023-11-07

Soundbay: Deep Learning Framework for Marine Mammals and Bioacoustic Research

  • paper_url: http://arxiv.org/abs/2311.04343
  • repo_url: None
  • paper_authors: Noam Bressler, Michael Faran, Amit Galor, Michael Moshe Michelashvili, Tomer Nachshon, Noa Weiss
  • for: 这篇论文是为了提供一个开源的Python框架,帮助生物声学和机器学习研究人员使用深度学习算法进行声音数据分析。
  • methods: 这篇论文使用了深度学习算法来进行动物叫声检测,并提供了一个简单易用的平台,让研究人员可以轻松地应用现有模型或创建新模型。
  • results: 这篇论文提供了一个 cetacean 叫声检测的基准数据集,并使用了这个基准数据集来评估深度学习算法的性能。
    Abstract This paper presents Soundbay, an open-source Python framework that allows bio-acoustics and machine learning researchers to implement and utilize deep learning-based algorithms for acoustic audio analysis. Soundbay provides an easy and intuitive platform for applying existing models on one's data or creating new models effortlessly. One of the main advantages of the framework is the capability to compare baselines on different benchmarks, a crucial part of emerging research and development related to the usage of deep-learning algorithms for animal call analysis. We demonstrate this by providing a benchmark for cetacean call detection on multiple datasets. The framework is publicly accessible via https://github.com/deep-voice/soundbay
    摘要 这篇论文介绍了Soundbay,一个开源Python框架,它使bio-acoustics和机器学习研究人员可以使用深度学习算法进行声音音频分析。Soundbay提供了一个简单易用的平台,使得研究人员可以轻松地应用现有的模型或创建新的模型。这个框架的一个主要优点是可以比较基线在不同的benchmark上,这是机器学习算法用于动物叫声分析领域的新研究和开发的关键部分。我们通过提供多个数据集上的 cetacean 叫声检测基准来说明这一点。框架可以通过https://github.com/deep-voice/soundbay访问。

eess.AS - 2023-11-07

Fine-tuning convergence model in Bengali speech recognition

  • paper_url: http://arxiv.org/abs/2311.04122
  • repo_url: None
  • paper_authors: Zhu Ruiying, Shen Meng
  • for: 提高自动speech recognition模型的性能,特别是对于孟加拉语的识别。
  • methods: 使用wave2vec 2.0预训练模型进行微调,并调整学习率和dropout参数。
  • results: 在测试集上,WER由0.508降至0.437,并将训练和验证集合并使用,实现了很好的WER值0.436。
    Abstract Research on speech recognition has attracted considerable interest due to the difficult task of segmenting uninterrupted speech. Among various languages, Bengali features distinct rhythmic patterns and tones, making it particularly difficult to recognize and lacking an efficient commercial recognition method. In order to improve the automatic speech recognition model for Bengali, our team has chosen to utilize the wave2vec 2.0 pre-trained model, which has undergone convergence for fine-tuning. Regarding Word Error Rate (WER), the learning rate and dropout parameters were fine-tuned, and after the model training was stable, attempts were made to enlarge the training set ratio, which improved the model's performance. Consequently, there was a notable enhancement in the WER from 0.508 to 0.437 on the test set of the publicly listed official dataset. Afterwards, the training and validation sets were merged, creating a comprehensive dataset that was used as the training set, achieving a remarkable WER of 0.436.
    摘要 Translated into Simplified Chinese:研究对语音识别具有很大的 интерес,因为分词是一项非常困难的任务。在各种语言中,孟加拉语具有特殊的节奏和调音特征,使其识别非常困难,而且没有有效的商业识别方法。为了提高自动语音识别模型的性能,我们团队选择使用wave2vec 2.0预训练模型,并进行了优化。对于 Word Error Rate(WER),我们调整了学习率和dropout参数,并在模型训练稳定后,尝试将训练集比例扩大,这使得模型性能得到了明显改善。在公共列表的官方数据集上,WER从0.508下降至0.437。然后,我们将训练集和验证集合并,创建了一个完整的数据集,并使用这个数据集进行了训练,达到了很出色的WER值0.436。

cs.CV - 2023-11-07

3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features

  • paper_url: http://arxiv.org/abs/2311.04391
  • repo_url: None
  • paper_authors: Chenfeng Xu, Huan Ling, Sanja Fidler, Or Litany
  • for: 3D объект детектирования из单张图像
  • methods: 使用3D散度模型提取特征,并通过两种特化Strategy进行微调,包括几何微调和semantic微调
  • results: 实现了高效的3D检测,并在Omni3D-ARkitscene数据集上超越了前一代标准Cube-RCNN模型,提高了AP3D指标的9.43%。同时,模型也表现出了较好的数据效率和跨频训练能力。
    Abstract We present 3DiffTection, a state-of-the-art method for 3D object detection from single images, leveraging features from a 3D-aware diffusion model. Annotating large-scale image data for 3D detection is resource-intensive and time-consuming. Recently, pretrained large image diffusion models have become prominent as effective feature extractors for 2D perception tasks. However, these features are initially trained on paired text and image data, which are not optimized for 3D tasks, and often exhibit a domain gap when applied to the target data. Our approach bridges these gaps through two specialized tuning strategies: geometric and semantic. For geometric tuning, we fine-tune a diffusion model to perform novel view synthesis conditioned on a single image, by introducing a novel epipolar warp operator. This task meets two essential criteria: the necessity for 3D awareness and reliance solely on posed image data, which are readily available (e.g., from videos) and does not require manual annotation. For semantic refinement, we further train the model on target data with detection supervision. Both tuning phases employ ControlNet to preserve the integrity of the original feature capabilities. In the final step, we harness these enhanced capabilities to conduct a test-time prediction ensemble across multiple virtual viewpoints. Through our methodology, we obtain 3D-aware features that are tailored for 3D detection and excel in identifying cross-view point correspondences. Consequently, our model emerges as a powerful 3D detector, substantially surpassing previous benchmarks, e.g., Cube-RCNN, a precedent in single-view 3D detection by 9.43\% in AP3D on the Omni3D-ARkitscene dataset. Furthermore, 3DiffTection showcases robust data efficiency and generalization to cross-domain data.
    摘要 我们介绍了3DiffTection方法,这是一种基于单张图像的3D物体检测state-of-the-art方法,利用了3D扩散模型的特征。为了实现3D检测, annotating大规模图像数据是资源充足和时间consuming的。现在,预训练的大图像扩散模型已成为2D感知任务的有效特征提取器。然而,这些特征通常是通过paired文本和图像数据进行预训练,这些数据通常不适合3D任务,而且经常会出现域隔现象。我们的方法使用两种特殊的调整策略来bridges这些差距:几何调整和semantic调整。为了几何调整,我们精心调整了扩散模型,以实现基于单张图像的新视图生成,通过引入novel epipolar warp operator。这个任务满足了两个关键的要求:需要3D意识和仅仅基于提供的posed图像数据进行调整,这些数据 readily available(例如,从视频中获得),不需要手动标注。为了semantic调整,我们进一步在目标数据上训练模型,并在检测监督下进行训练。两个调整阶段都使用ControlNet来保持原始特征Capabilities。在最后一步,我们利用这些加强的特征来进行测试预测 ensemble across multiple virtual viewpoints。通过我们的方法,我们获得了3D意识的特征,这些特征适用于3D检测,并且能够准确地找到交叉视点对应关系。因此,我们的模型在Omni3D-ARkitscene dataset上的AP3D指标上表现出色,比前一代标准9.43%。此外,3DiffTection还展示了robust数据效率和跨领域数据的通用性。

Basis restricted elastic shape analysis on the space of unregistered surfaces

  • paper_url: http://arxiv.org/abs/2311.04382
  • repo_url: None
  • paper_authors: Emmanuel Hartman, Emery Pierson, Martin Bauer, Mohamed Daoudi, Nicolas Charon
  • for: 这个论文旨在提出一种新的数学和计算方法来进行表面分析,基于泛函空间上的弹性里曼米特凯。
  • methods: 该方法 restricts the space of allowable transformations to predefined finite dimensional bases of deformation fields,并使用数据驱动的方式来估算这些基。
  • results: 该方法可以有效地完成许多surface mesh中的任务,包括shape registration、interpolation、motion transfer和随机pose生成,并且在人体形态和姿态数据和人脸扫描中表现出优于状态 искусственный方法。
    Abstract This paper introduces a new mathematical and numerical framework for surface analysis derived from the general setting of elastic Riemannian metrics on shape spaces. Traditionally, those metrics are defined over the infinite dimensional manifold of immersed surfaces and satisfy specific invariance properties enabling the comparison of surfaces modulo shape preserving transformations such as reparametrizations. The specificity of the approach we develop is to restrict the space of allowable transformations to predefined finite dimensional bases of deformation fields. These are estimated in a data-driven way so as to emulate specific types of surface transformations observed in a training set. The use of such bases allows to simplify the representation of the corresponding shape space to a finite dimensional latent space. However, in sharp contrast with methods involving e.g. mesh autoencoders, the latent space is here equipped with a non-Euclidean Riemannian metric precisely inherited from the family of aforementioned elastic metrics. We demonstrate how this basis restricted model can be then effectively implemented to perform a variety of tasks on surface meshes which, importantly, does not assume these to be pre-registered (i.e. with given point correspondences) or to even have a consistent mesh structure. We specifically validate our approach on human body shape and pose data as well as human face scans, and show how it generally outperforms state-of-the-art methods on problems such as shape registration, interpolation, motion transfer or random pose generation.
    摘要 However, this approach restricts the space of allowable transformations to predefined finite-dimensional bases of deformation fields, which are estimated in a data-driven way to emulate specific types of surface transformations observed in a training set. This allows for the simplification of the representation of the corresponding shape space to a finite-dimensional latent space, equipped with a non-Euclidean Riemannian metric precisely inherited from the family of elastic metrics.The proposed basis-restricted model can be effectively implemented to perform a variety of tasks on surface meshes, including shape registration, interpolation, motion transfer, and random pose generation, without assuming pre-registration or a consistent mesh structure. The approach is validated on human body shape and pose data, as well as human face scans, and is shown to outperform state-of-the-art methods on these tasks.

A Deep Learning Approach to Video Anomaly Detection using Convolutional Autoencoders

  • paper_url: http://arxiv.org/abs/2311.04351
  • repo_url: None
  • paper_authors: Gopikrishna Pavuluri, Gayathri Annem
  • for: 这个研究旨在提出一种基于深度学习的错误探测方法,用于视频中探测错误,使用核心autoencoder和解oder神经网络,并在UCSD dataset上进行评估。
  • methods: 这个方法使用核心autoencoder来学习正常视频的空间时间模式,然后将每帧视频测试影像与这个学习的表现进行比较。
  • results: 试验结果显示,这个方法在UCSD Ped1和Ped2 dataset上的全部精度为99.35%和99.77%,较其他现有方法高,显示这个方法具有优秀的错误探测能力,可以应用于实际应用中的视频错误探测。
    Abstract In this research we propose a deep learning approach for detecting anomalies in videos using convolutional autoencoder and decoder neural networks on the UCSD dataset.Our method utilizes a convolutional autoencoder to learn the spatiotemporal patterns of normal videos and then compares each frame of a test video to this learned representation. We evaluated our approach on the UCSD dataset and achieved an overall accuracy of 99.35% on the Ped1 dataset and 99.77% on the Ped2 dataset, demonstrating the effectiveness of our method for detecting anomalies in surveillance videos. The results show that our method outperforms other state-of-the-art methods, and it can be used in real-world applications for video anomaly detection.
    摘要 在这项研究中,我们提出了一种基于深度学习的视频异常检测方法,使用 convolutional autoencoder 和 decoder 神经网络在 UCSD 数据集上进行检测。我们的方法利用 convolutional autoencoder 学习正常视频中的空间时间特征,然后对每帧测试视频进行比较,以确定异常情况。我们对 UCSD 数据集进行评估,实现了 Ped1 数据集的总准确率为 99.35%, Ped2 数据集的总准确率为 99.77%,这表明我们的方法可以有效地检测视频中的异常情况。结果显示,我们的方法比其他当前状态最佳方法高效,可以在实际应用中用于视频异常检测。

SaFL: Sybil-aware Federated Learning with Application to Face Recognition

  • paper_url: http://arxiv.org/abs/2311.04346
  • repo_url: None
  • paper_authors: Mahdi Ghafourian, Julian Fierrez, Ruben Vera-Rodriguez, Ruben Tolosana, Aythami Morales
  • for: 本研究旨在提出一种新的防御方法,以防止在联合学习(Federated Learning,FL)中发生毒素攻击。
  • methods: 本研究使用了一种时变汇集方案,以降低毒素攻击的影响。
  • results: 研究发现,SaFL可以有效降低毒素攻击的影响,并提高FL的安全性和隐私性。
    Abstract Federated Learning (FL) is a machine learning paradigm to conduct collaborative learning among clients on a joint model. The primary goal is to share clients' local training parameters with an integrating server while preserving their privacy. This method permits to exploit the potential of massive mobile users' data for the benefit of machine learning models' performance while keeping sensitive data on local devices. On the downside, FL raises security and privacy concerns that have just started to be studied. To address some of the key threats in FL, researchers have proposed to use secure aggregation methods (e.g. homomorphic encryption, secure multiparty computation, etc.). These solutions improve some security and privacy metrics, but at the same time bring about other serious threats such as poisoning attacks, backdoor attacks, and free running attacks. This paper proposes a new defense method against poisoning attacks in FL called SaFL (Sybil-aware Federated Learning) that minimizes the effect of sybils with a novel time-variant aggregation scheme.
    摘要 Federated Learning (FL) 是一种机器学习模式,通过客户端之间的合作学习,实现共同模型的提升。主要目标是在客户端上保持本地训练参数的私钥,同时将这些参数与集成服务器共享。这种方法可以利用大量移动用户数据来提高机器学习模型的性能,而不需要披露敏感数据。然而,FL 也存在安全性和隐私问题,这些问题正在被研究。为了解决 FL 中一些关键威胁,研究人员提出了使用安全汇聚方法(如幂omorphic 加密、安全多方计算等)。这些解决方案可以提高一些安全性和隐私指标,但同时也会带来其他严重威胁,如毒素攻击、后门攻击和自由跑攻击。这篇论文提出了一种新的防御方法,称为 SaFL(知识体-意识 Federated Learning),以防止毒素攻击。SaFL 通过一种新的时变汇聚方案,来减少 sybil 的影响。

Efficient Semantic Matching with Hypercolumn Correlation

  • paper_url: http://arxiv.org/abs/2311.04336
  • repo_url: None
  • paper_authors: Seungwook Kim, Juhong Min, Minsu Cho
  • for: 本文主要针对 semantic matching 问题,即在多个视觉特征点之间建立 semantic 相关性的问题。
  • methods: 本文提出了 HCCNet 方法,它利用 multi-scale correlation maps 来提高 semantic matching 性能,而不需要对 4D correlation map 进行费时的 match-wise 关系挖掘。HCCNet 方法通过 feature slicing 技术,将瓶颈特征分解为更多的 intermediate features,然后使用这些 intermediate features 构建 hypercolumn correlation。
  • results: HCCNet 方法在标准的 semantic matching Benchmark 上达到了当前最佳或竞争性性能,同时具有较低的延迟和计算负担。相比之下,现有的 SoTA 方法在计算负担和延迟上都有所增加。
    Abstract Recent studies show that leveraging the match-wise relationships within the 4D correlation map yields significant improvements in establishing semantic correspondences - but at the cost of increased computation and latency. In this work, we focus on the aspect that the performance improvements of recent methods can also largely be attributed to the usage of multi-scale correlation maps, which hold various information ranging from low-level geometric cues to high-level semantic contexts. To this end, we propose HCCNet, an efficient yet effective semantic matching method which exploits the full potential of multi-scale correlation maps, while eschewing the reliance on expensive match-wise relationship mining on the 4D correlation map. Specifically, HCCNet performs feature slicing on the bottleneck features to yield a richer set of intermediate features, which are used to construct a hypercolumn correlation. HCCNet can consequently establish semantic correspondences in an effective manner by reducing the volume of conventional high-dimensional convolution or self-attention operations to efficient point-wise convolutions. HCCNet demonstrates state-of-the-art or competitive performances on the standard benchmarks of semantic matching, while incurring a notably lower latency and computation overhead compared to the existing SoTA methods.
    摘要 最新的研究表明,基于4D相关图的对应关系可以获得显著提高 semantic correspondence的性能 - 但是这也需要更高的计算和延迟。在这项工作中,我们注重于 Multi-scale correlation maps 的使用,它们包含了低级别的几何准确信息到高级别的 semantic 上下文信息。为了实现这一目标,我们提出了 HCCNet,一种高效且高性能的 semantic matching 方法。HCCNet 通过对瓶颈特征进行 feature slicing,以获得更丰富的中间特征,并将其用于构建 hypercolumn correlation。HCCNet 可以通过减少传统的高维ensional convolution 或自注意力操作,实现效果性的 semantic correspondence 建立。与现有的 SoTA 方法相比,HCCNet 在标准的 semantic matching 标准准样上显示出了领先或竞争性的性能,同时也具有较低的计算和延迟开销。

A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization

  • paper_url: http://arxiv.org/abs/2311.04315
  • repo_url: None
  • paper_authors: Xingzhe He, Zhiwen Cao, Nicholas Kolkin, Lantao Yu, Helge Rhodin, Ratheesh Kalarot
  • for: 能够生成具有自然语言描述的图像
  • methods: 使用数据增强方法,不需要修改模型结构
  • results: 生成的图像保留细节,同时能够生成多个不同的样本,具有高质量和好的文本拟合性
    Abstract Large text-to-image models have revolutionized the ability to generate imagery using natural language. However, particularly unique or personal visual concepts, such as your pet, an object in your house, etc., will not be captured by the original model. This has led to interest in how to inject new visual concepts, bound to a new text token, using as few as 4-6 examples. Despite significant progress, this task remains a formidable challenge, particularly in preserving the subject's identity. While most researchers attempt to to address this issue by modifying model architectures, our approach takes a data-centric perspective, advocating the modification of data rather than the model itself. We introduce a novel regularization dataset generation strategy on both the text and image level; demonstrating the importance of a rich and structured regularization dataset (automatically generated) to prevent losing text coherence and better identity preservation. The better quality is enabled by allowing up to 5x more fine-tuning iterations without overfitting and degeneration. The generated renditions of the desired subject preserve even fine details such as text and logos; all while maintaining the ability to generate diverse samples that follow the input text prompt. Since our method focuses on data augmentation, rather than adjusting the model architecture, it is complementary and can be combined with prior work. We show on established benchmarks that our data-centric approach forms the new state of the art in terms of image quality, with the best trade-off between identity preservation, diversity, and text alignment.
    摘要 大型文本到图像模型已经革命化了使用自然语言生成图像的能力。然而,特定的独特或个人视觉概念,如你的宠物或家中的物品等,将不会被原始模型捕捉。这导致了如何在少量示例(4-6个)下注入新的视觉概念,以保持文本准确性和主体认知的挑战。虽然大多数研究人员尝试通过修改模型结构来解决这个问题,但我们的方法具有数据驱动的思维,主张修改数据而不是模型。我们提出了一种新的常规数据生成策略,包括文本和图像两个水平,以保持文本准确性和主体认知。通过自动生成的丰富和结构化常规数据,我们可以避免文本准确性的损失和主体认知的受损。我们的方法强调数据增强,因此它是可以与之前的研究结合使用的。我们在确立的标准准测试 benchmark 上表现出新的状态前瞻图像质量,同时保持文本准确性、多样性和主体认知的最佳平衡。

Holistic Evaluation of Text-To-Image Models

  • paper_url: http://arxiv.org/abs/2311.04287
  • repo_url: https://github.com/stanford-crfm/helm
  • paper_authors: Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Benita Teufel, Marco Bellagente, Minguk Kang, Taesung Park, Jure Leskovec, Jun-Yan Zhu, Li Fei-Fei, Jiajun Wu, Stefano Ermon, Percy Liang
  • for: 评估文本到图像模型的全面性能。
  • methods: 引入新的评价指标,包括12个方面,如文本到图像对齐、图像质量、美学性、创新性、理解力、知识、偏见、恶意、公平性、稳定性、多语言支持和效率。
  • results: 评估26种state-of-the-art文本到图像模型,发现不同模型各有优势,没有一个模型在所有方面表现出色。
    Abstract The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at https://crfm.stanford.edu/heim/v1.1.0 and the code at https://github.com/stanford-crfm/helm, which is integrated with the HELM codebase.
    摘要 “这些最近的文本至图模型的优秀质量改善,吸引了广泛的注意和采纳。但我们尚未有一个全面的量化理解其能力和风险。为了填补这个空白,我们引入了一个新的参考基准,即整体评估文本至图模型(HEIM)。先前的评估主要集中在文本至图Alignment和图像质量,我们识别出12个方面,包括文本至图Alignment、图像质量、美学、原创性、推理、知识、偏见、毒性、公平、Robustness、多语言和效率。我们组合62个情况,包括这12个方面,并评估26种文本至图模型。我们的结果显示,没有一个模型在所有方面都优秀,不同的模型各有优势。我们发布生成的图像和人类评价结果,以及模型代码,在https://crfm.stanford.edu/heim/v1.1.0上公开,并与HELM代码库集成在https://github.com/stanford-crfm/helm。”

Video Instance Matting

  • paper_url: http://arxiv.org/abs/2311.04212
  • repo_url: https://github.com/shi-labs/vim
  • paper_authors: Jiachen Li, Roberto Henschel, Vidit Goel, Marianna Ohanyan, Shant Navasardyan, Humphrey Shi
  • for: 这篇论文是为了解决视频实例分割和 alpha matte 问题而写的。
  • methods: 这篇论文使用了一种名为 MSG-VIM 的Mask Sequence Guided Video Instance Matting神经网络来解决这个问题,该模型具有增强的mask guidance和时间特征指导等特性。
  • results: 根据新建的 VIM50 数据集和 VIMQ 评价指标,该模型在视频实例分割和 alpha matte 任务上具有出色的表现,与现有方法相比具有大幅提升。
    Abstract Conventional video matting outputs one alpha matte for all instances appearing in a video frame so that individual instances are not distinguished. While video instance segmentation provides time-consistent instance masks, results are unsatisfactory for matting applications, especially due to applied binarization. To remedy this deficiency, we propose Video Instance Matting~(VIM), that is, estimating alpha mattes of each instance at each frame of a video sequence. To tackle this challenging problem, we present MSG-VIM, a Mask Sequence Guided Video Instance Matting neural network, as a novel baseline model for VIM. MSG-VIM leverages a mixture of mask augmentations to make predictions robust to inaccurate and inconsistent mask guidance. It incorporates temporal mask and temporal feature guidance to improve the temporal consistency of alpha matte predictions. Furthermore, we build a new benchmark for VIM, called VIM50, which comprises 50 video clips with multiple human instances as foreground objects. To evaluate performances on the VIM task, we introduce a suitable metric called Video Instance-aware Matting Quality~(VIMQ). Our proposed model MSG-VIM sets a strong baseline on the VIM50 benchmark and outperforms existing methods by a large margin. The project is open-sourced at https://github.com/SHI-Labs/VIM.
    摘要 传统的视频排除输出一个alpha环境图,以便在视频帧中对所有实例进行分割。然而,视频实例分割提供了时间一致的实例掩模,但是结果并不满意,特别是因为应用了二进制化。为了解决这些不足,我们提出了视频实例排除(VIM),即在视频序列中每帧估计每个实例的alpha环境图。为了解决这个复杂的问题,我们提出了一种新的基线模型,即Mask Sequence Guided Video Instance Matting(MSG-VIM)。MSG-VIM利用了一种混合的掩码更新,以使其预测Results robust to inaccurate and inconsistent mask guidance。它还包括时间掩模和时间特征引导,以提高alpha环境预测的 temporal consistency。此外,我们建立了一个新的基准数据集,名为VIM50,该数据集包括50个视频clip,每个clip中有多个人体实例作为前景物体。为了评估在VIM任务中的性能,我们提出了一个适合的度量,即Video Instance-aware Matting Quality(VIMQ)。我们的提出的模型MSG-VIM在VIM50 benchmark上设置了一个强大的基线,并在现有方法之上出现较大的差。项目开源在https://github.com/SHI-Labs/VIM。

Deep Hashing via Householder Quantization

  • paper_url: http://arxiv.org/abs/2311.04207
  • repo_url: https://github.com/twistedcubic/learn-to-hash
  • paper_authors: Lucas R. Schwengber, Lucas Resende, Paulo Orenstein, Roberto I. Oliveira
  • For: The paper is written for improving the efficiency and performance of large-scale image similarity search using deep hashing techniques.* Methods: The paper proposes an alternative quantization strategy that decomposes the learning problem into two stages: first, perform similarity learning over the embedding space with no quantization, and second, find an optimal orthogonal transformation of the embeddings using Householder matrices and then quantize the transformed embedding through the sign function.* Results: The proposed algorithm leads to state-of-the-art performance on widely used image datasets and brings consistent improvements in performance to existing deep hashing algorithms, without any hyperparameter tuning and at no cost in terms of performance.Here’s the simplified Chinese text for the three key information points:* 用途:本文提出了一种改进大规模图像相似搜索的深度哈希技术,以提高效率和性能。* 方法:本文提出的方法是将学习问题分解成两个阶段:首先,在嵌入空间中进行相似学习,无需归一化;其次,使用欧几里得变换来找到最佳的对称变换,然后对transformed embedding进行归一化。* 结果:提议的算法在广泛使用的图像Dataset上达到了状态态的性能,并且与现有的深度哈希算法相比,带来了一致的改进,无需进行任何参数调整和性能损失。
    Abstract Hashing is at the heart of large-scale image similarity search, and recent methods have been substantially improved through deep learning techniques. Such algorithms typically learn continuous embeddings of the data. To avoid a subsequent costly binarization step, a common solution is to employ loss functions that combine a similarity learning term (to ensure similar images are grouped to nearby embeddings) and a quantization penalty term (to ensure that the embedding entries are close to binarized entries, e.g., -1 or 1). Still, the interaction between these two terms can make learning harder and the embeddings worse. We propose an alternative quantization strategy that decomposes the learning problem in two stages: first, perform similarity learning over the embedding space with no quantization; second, find an optimal orthogonal transformation of the embeddings so each coordinate of the embedding is close to its sign, and then quantize the transformed embedding through the sign function. In the second step, we parametrize orthogonal transformations using Householder matrices to efficiently leverage stochastic gradient descent. Since similarity measures are usually invariant under orthogonal transformations, this quantization strategy comes at no cost in terms of performance. The resulting algorithm is unsupervised, fast, hyperparameter-free and can be run on top of any existing deep hashing or metric learning algorithm. We provide extensive experimental results showing that this approach leads to state-of-the-art performance on widely used image datasets, and, unlike other quantization strategies, brings consistent improvements in performance to existing deep hashing algorithms.
    摘要 “哈希是大规模图像相似搜寻中的核心,而近年来的方法受到深度学习技术的改进。这些算法通常学习连续的嵌入。以避免后续的昂贵 binarization 步骤,一个常见的解决方案是使用一个 combine 两个 тер优点:一个 similarity learning 项(确保相似的图像被分配到附近的嵌入)和一个 quantization penalty 项(确保嵌入的元素都很近,例如 -1 或 1)。然而,这两个项目之间的互动可能会让学习更加困难,并导致嵌入更差。我们提出一个 alternating quantization 策略,将学习问题分成两个阶段:第一个阶段是在嵌入空间进行 similarity learning,第二个阶段是使用 Householder 矩阵对嵌入进行快速的对称转换,然后使用 sign 函数对转换后的嵌入进行量化。在第二个阶段,我们使用 Householder 矩阵来快速地对嵌入进行对称转换,这样可以充分利用测验梯度下降。因为相似度测验通常是对对称转换的不变量,因此这个量化策略不会对性能造成损害。得到的算法是无监控、快速、无超参数的,可以在任何现有的深度哈希或度量学习算法之上运行。我们在广泛的图像数据集上进行了广泛的实验,结果显示这个方法对现有的深度哈希算法带来了稳定的改进。”

High-fidelity 3D Reconstruction of Plants using Neural Radiance Field

  • paper_url: http://arxiv.org/abs/2311.04154
  • repo_url: None
  • paper_authors: Kewei Hu, Ying Wei, Yaoqiang Pan, Hanwen Kang, Chao Chen
    for:* The paper is focused on exploring the use of Neural Radiance Fields (NeRF) for plant phenotyping in agricultural contexts.methods:* The paper utilizes two state-of-the-art NeRF methods, Instant-NGP and Instant-NSR, to synthesize 2D novel-view images and reconstruct 3D crop and plant models.results:* The paper demonstrates that NeRF achieves commendable performance in synthesizing novel-view images and is competitive with Reality Capture, a leading commercial software for 3D Multi-View Stereo (MVS)-based reconstruction. However, the study also highlights certain drawbacks of NeRF, including relatively slow training speeds, performance limitations in cases of insufficient sampling, and challenges in obtaining geometry quality in complex setups.Here is the text in Simplified Chinese:for:* 本研究是针对减少可持续农业实践中的植物形态重建问题,通过使用神经辐射场(NeRF)进行研究。methods:* 本研究使用了两种 state-of-the-art NeRF方法,即 Instant-NGP 和 Instant-NSR,来synthesize 2D 新视图图像和重建 3D 作物和植物模型。results:* 本研究表明,NeRF 在synthesize 2D 新视图图像和重建 3D 作物和植物模型方面达到了可观的表现,与Reality Capture,一种商业化的3D Multi-View Stereo(MVS)重建软件,相比竞争。然而,研究也指出了NeRF的一些缺点,包括训练速度较慢,在不充分采样的情况下表现有限,以及在复杂设置下获得geometry质量的挑战。
    Abstract Accurate reconstruction of plant phenotypes plays a key role in optimising sustainable farming practices in the field of Precision Agriculture (PA). Currently, optical sensor-based approaches dominate the field, but the need for high-fidelity 3D reconstruction of crops and plants in unstructured agricultural environments remains challenging. Recently, a promising development has emerged in the form of Neural Radiance Field (NeRF), a novel method that utilises neural density fields. This technique has shown impressive performance in various novel vision synthesis tasks, but has remained relatively unexplored in the agricultural context. In our study, we focus on two fundamental tasks within plant phenotyping: (1) the synthesis of 2D novel-view images and (2) the 3D reconstruction of crop and plant models. We explore the world of neural radiance fields, in particular two SOTA methods: Instant-NGP, which excels in generating high-quality images with impressive training and inference speed, and Instant-NSR, which improves the reconstructed geometry by incorporating the Signed Distance Function (SDF) during training. In particular, we present a novel plant phenotype dataset comprising real plant images from production environments. This dataset is a first-of-its-kind initiative aimed at comprehensively exploring the advantages and limitations of NeRF in agricultural contexts. Our experimental results show that NeRF demonstrates commendable performance in the synthesis of novel-view images and is able to achieve reconstruction results that are competitive with Reality Capture, a leading commercial software for 3D Multi-View Stereo (MVS)-based reconstruction. However, our study also highlights certain drawbacks of NeRF, including relatively slow training speeds, performance limitations in cases of insufficient sampling, and challenges in obtaining geometry quality in complex setups.
    摘要 准确重建植物fenotype在精准农业(PA)中扮演着关键角色。目前,光学感测器基本方法dominates the field, but the need for high-fidelity 3D reconstruction of crops and plants in unstructured agricultural environments remains challenging. Recently, a promising development has emerged in the form of Neural Radiance Field (NeRF), a novel method that utilizes neural density fields. This technique has shown impressive performance in various novel vision synthesis tasks, but has remained relatively unexplored in the agricultural context. In our study, we focus on two fundamental tasks within plant phenotyping: (1) the synthesis of 2D novel-view images and (2) the 3D reconstruction of crop and plant models. We explore the world of NeRF, in particular two state-of-the-art methods: Instant-NGP, which excels in generating high-quality images with impressive training and inference speed, and Instant-NSR, which improves the reconstructed geometry by incorporating the Signed Distance Function (SDF) during training. In particular, we present a novel plant phenotype dataset comprising real plant images from production environments. This dataset is a first-of-its-kind initiative aimed at comprehensively exploring the advantages and limitations of NeRF in agricultural contexts. Our experimental results show that NeRF demonstrates commendable performance in the synthesis of novel-view images and is able to achieve reconstruction results that are competitive with Reality Capture, a leading commercial software for 3D Multi-View Stereo (MVS)-based reconstruction. However, our study also highlights certain drawbacks of NeRF, including relatively slow training speeds, performance limitations in cases of insufficient sampling, and challenges in obtaining geometry quality in complex setups.

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.04145
  • repo_url: https://github.com/damo-vilab/i2vgen-xl
  • paper_authors: Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou
  • for: 提高视频生成的semantic精度、时间空间连续性和视觉质量。
  • methods: 提出了一种逐 stage的I2VGen-XL方法,通过将输入数据分解成两个阶段,以提高模型的表现。
  • results: 通过大量的实验和比较,表明I2VGen-XL可以同时提高视频的semantic精度、时间空间连续性和视觉质量。
    Abstract Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280$\times$720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at \url{https://i2vgen-xl.github.io}.
    摘要 “视频生成技术在最近几年内升级很快,但它仍然面临Semantic精度、清晰度和时空连续性的挑战。这些挑战的主要原因是缺乏准确的文本-视频数据和视频的复杂内存结构,使得模型很难同时保证Semantic和质量上的优秀表现。在这份报告中,我们提议一种名为I2VGen-XL的方法,该方法可以减轻这些因素的影响,并通过使用静止图像作为重要导向来保证输入数据的对齐。I2VGen-XL包括两个阶段:一是基础阶段,该阶段通过两个层次编码器保证文本和图像之间的协调性,并保留输入图像的内容;二是优化阶段,该阶段通过添加一些简短的文本和提高分辨率到1280×720来提高视频的细节。为了提高多样性,我们收集了约35亿个单个文本-视频对和60亿个文本-图像对,并且通过优化模型来提高模型的性能。因此,I2VGen-XL可以同时提高Semantic精度、视频细节的连续性和清晰度。经过广泛的实验,我们发现I2VGen-XL的下面原理和当前顶峰方法的效果,可以在多样的数据上证明其效果。模型和代码将在 \url{https://i2vgen-xl.github.io} 上公开。”

Perceptual Quality Improvement in Videoconferencing using Keyframes-based GAN

  • paper_url: http://arxiv.org/abs/2311.04263
  • repo_url: https://github.com/lorenzoagnolucci/keyframes-gan
  • paper_authors: Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini, Alberto Del Bimbo
  • for: 提高视频会议中视觉质量
  • methods: 使用GAN技术维护和更新参考帧,提取多尺度特征并按面部特征进行进度组合
  • results: 提高视觉质量并生成高真实度结果,即使高压缩率时仍能获得良好效果
    Abstract In the latest years, videoconferencing has taken a fundamental role in interpersonal relations, both for personal and business purposes. Lossy video compression algorithms are the enabling technology for videoconferencing, as they reduce the bandwidth required for real-time video streaming. However, lossy video compression decreases the perceived visual quality. Thus, many techniques for reducing compression artifacts and improving video visual quality have been proposed in recent years. In this work, we propose a novel GAN-based method for compression artifacts reduction in videoconferencing. Given that, in this context, the speaker is typically in front of the camera and remains the same for the entire duration of the transmission, we can maintain a set of reference keyframes of the person from the higher-quality I-frames that are transmitted within the video stream and exploit them to guide the visual quality improvement; a novel aspect of this approach is the update policy that maintains and updates a compact and effective set of reference keyframes. First, we extract multi-scale features from the compressed and reference frames. Then, our architecture combines these features in a progressive manner according to facial landmarks. This allows the restoration of the high-frequency details lost after the video compression. Experiments show that the proposed approach improves visual quality and generates photo-realistic results even with high compression rates. Code and pre-trained networks are publicly available at https://github.com/LorenzoAgnolucci/Keyframes-GAN.
    摘要 最近几年,视频会议已经在人际关系中扮演了重要的角色,包括个人和商业用途。损失式视频压缩算法是视频会议的核心技术,它降低了实时视频流的带宽需求。然而,损失式视频压缩会降低视频的视觉质量。因此,许多用于减少压缩残差和提高视频视觉质量的技术已经在最近几年内提出。在这种情况下,我们提出了一种基于GAN的新方法,用于减少视频会议中的压缩残差。由于发言人通常会站在摄像头前,并且在整个传输过程中保持不变,我们可以维护一组高质量I帧中的人脸参考图像,并利用它们来导航视觉质量改进。我们的方法包括以下步骤:首先,我们从压缩和参考帧中提取多级特征。然后,我们将这些特征进行进一步组合,根据人脸特征进行进行逐步组合。这样可以重新获得因压缩而丢失的高频环境细节。实验结果表明,我们的方法可以提高视觉质量,并且可以生成高质量的图像,即使压缩率较高。我们的代码和预训练网络可以在 GitHub 上获得:https://github.com/LorenzoAgnolucci/Keyframes-GAN。

Interactive Semantic Map Representation for Skill-based Visual Object Navigation

  • paper_url: http://arxiv.org/abs/2311.04107
  • repo_url: None
  • paper_authors: Tatiana Zemskova, Aleksei Staroverov, Kirill Muravyev, Dmitry Yudin, Aleksandr Panov
  • for: 本研究旨在提出一种基于学习方法的移动机器人视觉对象导航方法,以提高移动机器人在室内环境中的导航质量。
  • methods: 该方法基于神经网络方法,通过误差估计值的反propagation进行权重调整,以形成在实体交互过程中的Scene semantic map。
  • results: 在Habitat环境中进行了大规模实验,并达到了与现有方法相比的显著提高在导航质量指标上。代码和自定义数据集在github.com/AIRI-Institute/skill-fusion上公开发布。
    Abstract Visual object navigation using learning methods is one of the key tasks in mobile robotics. This paper introduces a new representation of a scene semantic map formed during the embodied agent interaction with the indoor environment. It is based on a neural network method that adjusts the weights of the segmentation model with backpropagation of the predicted fusion loss values during inference on a regular (backward) or delayed (forward) image sequence. We have implemented this representation into a full-fledged navigation approach called SkillTron, which can select robot skills from end-to-end policies based on reinforcement learning and classic map-based planning methods. The proposed approach makes it possible to form both intermediate goals for robot exploration and the final goal for object navigation. We conducted intensive experiments with the proposed approach in the Habitat environment, which showed a significant superiority in navigation quality metrics compared to state-of-the-art approaches. The developed code and used custom datasets are publicly available at github.com/AIRI-Institute/skill-fusion.
    摘要 视觉对象导航使用学习方法是移动机器人领域中的关键任务之一。这篇论文介绍了一种新的场景Semantic地图表示方法,基于神经网络方法,在感知机器人与室内环境互动中调整分割模型的权重。我们在SkillTron全面导航方法中实现了这种表示方法,可以根据复制学习和经典地图基于规划方法选择机器人技能。该方法可以形成穿梭机器人的中间目标和最终导航目标。我们在Habitat环境中进行了广泛的实验,并显示了与当前方法相比 Navigation 质量指标有显著的superiority。我们在github.com/AIRI-Institute/skill-fusion上公开了代码和自定义数据集。

DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding

  • paper_url: http://arxiv.org/abs/2311.04098
  • repo_url: https://github.com/gofigure-lanl/figure-segmentation
  • paper_authors: Kehinde Ajayi, Xin Wei, Martin Gryder, Winston Shields, Jian Wu, Shawn M. Jones, Michal Kucer, Diane Oyen
  • for: 该论文是为了推动计算机视觉和自然语言处理领域的进步,通过利用大量实际应用的数据。
  • methods: 该论文使用了大规模的设计专利文档来提供大于270万件技术图像,并从这些图像中提取了132,890个物体名称和22,394个视角点。
  • results: 该论文通过使用 DeepPatent2 数据集,实现了技术图像的描述功能,并且表明该数据集可以用于其他研究领域,如3D图像重建和图像检索。
    Abstract Recent advances in computer vision (CV) and natural language processing have been driven by exploiting big data on practical applications. However, these research fields are still limited by the sheer volume, versatility, and diversity of the available datasets. CV tasks, such as image captioning, which has primarily been carried out on natural images, still struggle to produce accurate and meaningful captions on sketched images often included in scientific and technical documents. The advancement of other tasks such as 3D reconstruction from 2D images requires larger datasets with multiple viewpoints. We introduce DeepPatent2, a large-scale dataset, providing more than 2.7 million technical drawings with 132,890 object names and 22,394 viewpoints extracted from 14 years of US design patent documents. We demonstrate the usefulness of DeepPatent2 with conceptual captioning. We further provide the potential usefulness of our dataset to facilitate other research areas such as 3D image reconstruction and image retrieval.
    摘要 (Simplified Chinese translation)现代计算机视觉(CV)和自然语言处理技术的进步受到大规模实际应用的抽象启用。然而,这些研究领域仍受到可用数据的量、多样性和多样性的限制。CV任务,如图像描述,在自然图像上进行的主要任务仍然减少生成准确和有意义的描述 sketched 图像,常出现在科学和技术文档中。提高其他任务,如从 2D 图像中重建 3D 图像,需要更多的视图点。我们介绍 DeepPatent2,一个大规模数据集,包含 más de 2.7 万个技术图像,132,890 个物体名称和 22,394 个视图点,从美国设计专利文档中提取了 14 年。我们示出 DeepPatent2 的用于概念描述,并提供了其他研究领域,如 3D 图像重建和图像检索的潜在用途。

Image-Pointcloud Fusion based Anomaly Detection using PD-REAL Dataset

  • paper_url: http://arxiv.org/abs/2311.04095
  • repo_url: None
  • paper_authors: Jianjian Qin, Chunzhi Gu, Jun Yu, Chao Zhang
  • For: This paper is written for researchers and practitioners in the field of unsupervised anomaly detection (AD) in the 3D domain.* Methods: The paper uses a novel dataset called PD-REAL, which consists of Play-Doh models of 15 object categories with six types of anomalies, captured under different lighting conditions using a RealSense camera. The authors use state-of-the-art AD algorithms to evaluate the benefits and challenges of using 3D information in AD tasks.* Results: The paper shows that the PD-REAL dataset provides a controlled environment for analyzing the potential benefits of 3D information in AD tasks, and that the use of 3D information can improve the detection of anomalies compared to 2D-only representations. However, the authors also identify challenges in using 3D information, such as the need for more sophisticated algorithms to handle varying lighting conditions and object orientations.
    Abstract We present PD-REAL, a novel large-scale dataset for unsupervised anomaly detection (AD) in the 3D domain. It is motivated by the fact that 2D-only representations in the AD task may fail to capture the geometric structures of anomalies due to uncertainty in lighting conditions or shooting angles. PD-REAL consists entirely of Play-Doh models for 15 object categories and focuses on the analysis of potential benefits from 3D information in a controlled environment. Specifically, objects are first created with six types of anomalies, such as dent, crack, or perforation, and then photographed under different lighting conditions to mimic real-world inspection scenarios. To demonstrate the usefulness of 3D information, we use a commercially available RealSense camera to capture RGB and depth images. Compared to the existing 3D dataset for AD tasks, the data acquisition of PD-REAL is significantly cheaper, easily scalable and easier to control variables. Extensive evaluations with state-of-the-art AD algorithms on our dataset demonstrate the benefits as well as challenges of using 3D information. Our dataset can be downloaded from https://github.com/Andy-cs008/PD-REAL
    摘要 我们介绍PD-REAL,一个新的大规模未supervised anomaly detection(AD) dataset在3D领域。它受到了2D只 представiation在AD任务中可能无法捕捉异常的几何结构,因为照明条件或拍摄角度的不确定性。PD-REAL完全由Play-Doh模型组成15种物品类别,专注于控制环境中3D信息的分析。具体来说,物品首先创建了6种异常,如痕、裂、或者贯穿,然后在不同的照明条件下拍摄,以模拟实际的检查enario。为了证明3D信息的有用性,我们使用了一款商业可用的RealSense摄像头捕捉RGB和深度图像。相比已有的3Ddataset для AD任务,PD-REAL的数据取得更加便宜,易扩展和更容易控制变量。我们进行了state-of-the-art AD算法的广泛评估,以证明PD-REAL dataset的优点和挑战。您可以在https://github.com/Andy-cs008/PD-REAL上下载这个dataset。

Proceedings of the 5th International Workshop on Reading Music Systems

  • paper_url: http://arxiv.org/abs/2311.04091
  • repo_url: https://github.com/suziai/gui-tools
  • paper_authors: Jorge Calvo-Zaragoza, Alexander Pacha, Elona Shatri
  • for: 这个论文是为了连接研究者们,论文主要关注在乐谱识别领域,与其他研究者和实践者进行交流和合作。
  • methods: 本论文使用了乐谱识别技术,包括图像处理、作者识别、多模态系统等方法。
  • results: 本论文介绍了5th International Workshop on Reading Music Systems的论文集,包括乐谱识别、数据集和性能评估、图像处理等主题。
    Abstract The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 5th International Workshop on Reading Music Systems, held in Milan, Italy on Nov. 4th 2023.
    摘要 世界音乐读取系统国际研讨会(WoRMS)是一个研讨会,旨在连接开发音乐读取系统的研究人员(如光学音乐识别)与其他研究人员和实践者(如图书管理员或音乐学家),以便共享知识和技术。研讨会的主题包括,但不限于:音乐读取系统;光学音乐识别;数据集和性能评估;图书馆管理系统;多模态系统;新的音乐输入方法来生成书面音乐;网络音乐信息检索服务;应用和项目;有关书面音乐的用例。这是第5届世界音乐读取系统国际研讨会的论文集,于2023年11月4日在意大利米兰举行。

Restoration of Analog Videos Using Swin-UNet

  • paper_url: http://arxiv.org/abs/2311.04261
  • repo_url: https://github.com/miccunifi/analog-video-restoration
  • paper_authors: Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini, Alberto Del Bimbo
  • for: restaure analog videos of historical archives
  • methods: multi-frame approach, deal with severe tape mistracking
  • results: effective restoration of original content, tested on real-world videos from a major historical video archive
    Abstract In this paper, we present a system to restore analog videos of historical archives. These videos often contain severe visual degradation due to the deterioration of their tape supports that require costly and slow manual interventions to recover the original content. The proposed system uses a multi-frame approach and is able to deal with severe tape mistracking, which results in completely scrambled frames. Tests on real-world videos from a major historical video archive show the effectiveness of our demo system. The code and the pre-trained model are publicly available at https://github.com/miccunifi/analog-video-restoration.
    摘要 在这篇论文中,我们提出了一种系统来恢复历史档案中的分析视频。这些视频经常受到媒体支持的腐蚀,需要费时费力的手动操作来恢复原始内容。我们的系统采用多帧方法,能够有效地处理严重的卷积问题,从而恢复原始内容。我们在一个重要的历史视频档案中进行了实验,测试结果表明我们的示例系统的效果。我们的代码和预训练模型都可以在 GitHub 上获得,请参考

Learning Super-Resolution Ultrasound Localization Microscopy from Radio-Frequency Data

  • paper_url: http://arxiv.org/abs/2311.04081
  • repo_url: None
  • paper_authors: Christopher Hahne, Georges Chabouh, Olivier Couture, Raphael Sznitman
  • for: 本研究旨在提高ultrasound localization microscopy(ULM)的分辨率性能,通过快速和高效地地标定目标位置。
  • methods: 本研究使用无处理Radio-Frequency(RF)数据,并通过超解像网络进行地标定。为此,我们实现了标点投影和反向点 transform between B-mode和RF坐标空间。
  • results: 对于state-of-the-art技术的比较,我们的RF训练网络表明,不使用延迟和总和(DAS)扫描 beamforming可以更好地优化ULM的分辨率性能。
    Abstract Ultrasound Localization Microscopy (ULM) enables imaging of vascular structures in the micrometer range by accumulating contrast agent particle locations over time. Precise and efficient target localization accuracy remains an active research topic in the ULM field to further push the boundaries of this promising medical imaging technology. Existing work incorporates Delay-And-Sum (DAS) beamforming into particle localization pipelines, which ultimately determines the ULM image resolution capability. In this paper we propose to feed unprocessed Radio-Frequency (RF) data into a super-resolution network while bypassing DAS beamforming and its limitations. To facilitate this, we demonstrate label projection and inverse point transformation between B-mode and RF coordinate space as required by our approach. We assess our method against state-of-the-art techniques based on a public dataset featuring in silico and in vivo data. Results from our RF-trained network suggest that excluding DAS beamforming offers a great potential to optimize on the ULM resolution performance.
    摘要 Ultrasound Localization Microscopy (ULM) 可以在微米级别快速成像血管结构,通过时间积累对比杂素点的位置。为了进一步推动这项承诺的医疗成像技术,精准和高效的目标定位精度仍然是ULM领域的活跃研究话题。现有的工作将延迟和总和(DAS)扩散 beamforming 集成到参与者localization管道中,从而决定ULM图像分辨率能力。在这篇论文中,我们提议将未处理的Radio-Frequency(RF)数据feed到超分辨网络中,而不是通过DAS扩散 beamforming 和其局限性。为此,我们展示了标签投影和反向点变换 междуB-mode和RF坐标空间,这些步骤是我们方法所需的。我们对比了我们的RF训练网络和现有技术,根据公共数据集上的响应Silico和in vivo数据。结果表明,不包括DAS扩散 beamforming 可以提高ULM的分辨率性能。

Augmenting Lane Perception and Topology Understanding with Standard Definition Navigation Maps

  • paper_url: http://arxiv.org/abs/2311.04079
  • repo_url: None
  • paper_authors: Katie Z Luo, Xinshuo Weng, Yan Wang, Shuang Wu, Jie Li, Kilian Q Weinberger, Yue Wang, Marco Pavone
  • for: 实时赛道预测
  • methods: 使用 Standard Definition (SD) 地图和 Transformer 数据 Representations from transFormers 进行赛道预测
  • results: 与现有的 online map prediction 方法相比,提高了赛道探测和预测效率(最高提升60%),可以立即适用于任何 Transformer 基本的赛道预测方法。
    Abstract Autonomous driving has traditionally relied heavily on costly and labor-intensive High Definition (HD) maps, hindering scalability. In contrast, Standard Definition (SD) maps are more affordable and have worldwide coverage, offering a scalable alternative. In this work, we systematically explore the effect of SD maps for real-time lane-topology understanding. We propose a novel framework to integrate SD maps into online map prediction and propose a Transformer-based encoder, SD Map Encoder Representations from transFormers, to leverage priors in SD maps for the lane-topology prediction task. This enhancement consistently and significantly boosts (by up to 60%) lane detection and topology prediction on current state-of-the-art online map prediction methods without bells and whistles and can be immediately incorporated into any Transformer-based lane-topology method. Code is available at https://github.com/NVlabs/SMERF.
    摘要 自适应驾驶曾经依赖高Definition(HD)地图,导致扩展性受限。相比之下,标准Definition(SD)地图更加可 affordable,具有全球覆盖,提供可扩展的 alternativa。在这种工作中,我们系统地探讨了SD地图对实时干道结构理解的效果。我们提议一种将SD地图集成到在线地图预测中的框架,并使用Transformer基于的编码器,即SD Map Encoder Representations from transFormers,以利用SD地图中的假设来提高干道结构预测任务。这种改进可以在当前领先的在线地图预测方法上提高(最高达60%)干道检测和结构预测,无需额外的套件和配置,可以立即 integrate into任何Transformer基于的干道结构方法。代码可以在https://github.com/NVlabs/SMERF中找到。

Energy-based Calibrated VAE with Test Time Free Lunch

  • paper_url: http://arxiv.org/abs/2311.04071
  • repo_url: None
  • paper_authors: Yihong Luo, Siya Qiu, Xingjian Tao, Yujun Cai, Jing Tang
  • for: 提高 Variational Autoencoders (VAEs) 的采样效率和生成质量
  • methods: 使用 Conditional EBM 进行采样和生成,无需 MCMC 抽样
  • results: 提出了一种能够在训练和测试阶段都不需要 MCMC 抽样的 Energy-Calibrated Generative Model,并且在多个应用中达到了state-of-the-art 性能。
    Abstract In this paper, we propose a novel Energy-Calibrated Generative Model that utilizes a Conditional EBM for enhancing Variational Autoencoders (VAEs). VAEs are sampling efficient but often suffer from blurry generation results due to the lack of training in the generative direction. On the other hand, Energy-Based Models (EBMs) can generate high-quality samples but require expensive Markov Chain Monte Carlo (MCMC) sampling. To address these issues, we introduce a Conditional EBM for calibrating the generative direction during training, without requiring it for test time sampling. Our approach enables the generative model to be trained upon data and calibrated samples with adaptive weight, thereby enhancing efficiency and effectiveness without necessitating MCMC sampling in the inference phase. We also show that the proposed approach can be extended to calibrate normalizing flows and variational posterior. Moreover, we propose to apply the proposed method to zero-shot image restoration via neural transport prior and range-null theory. We demonstrate the effectiveness of the proposed method through extensive experiments in various applications, including image generation and zero-shot image restoration. Our method shows state-of-the-art performance over single-step non-adversarial generation.
    摘要 在这篇论文中,我们提出了一种新的能量准确生成模型,该模型利用 conditional EBM 来增强变量自动编码器(VAE)。 VAE 可以高效地采样,但通常因缺乏生成方向的训练而产生模糊的生成结果。而EBM 可以生成高质量的样本,但需要昂贵的 Markov Chain Monte Carlo(MCMC)采样。为了解决这些问题,我们引入了 conditional EBM,以在训练期间准确调整生成方向,不需要测试阶段的 MCMC 采样。我们的方法使得生成模型可以在数据和适应性权重的指导下接受训练,从而提高效率和效果,不需要测试阶段的 MCMC 采样。此外,我们还证明了该方法可以扩展到 calibrate normalizing flows 和 variational posterior。此外,我们还提出了在 zero-shot 图像恢复中应用该方法,使用神经运输前向和范围null理论。我们通过多种应用,包括图像生成和 zero-shot 图像恢复,展示了我们的方法的效果。我们的方法在单步非对抗生成中显示了状态级表现。

LISBET: a self-supervised Transformer model for the automatic segmentation of social behavior motifs

  • paper_url: http://arxiv.org/abs/2311.04069
  • repo_url: None
  • paper_authors: Giuseppe Chindemi, Benoit Girard, Camilla Bellone
  • For: The paper aims to understand the core principles of social behavior and identify potential therapeutic targets for addressing social deficits.* Methods: The paper introduces a new model called LISBET, which uses self-supervised learning to detect and quantify social behaviors from dynamic body parts tracking data.* Results: The paper shows that LISBET can be used in both hypothesis-driven and discovery-driven modes to automate behavior classification and segment social behavior motifs, and that the recognized motifs closely match human annotations and correlate with the electrophysiological activity of dopaminergic neurons in the Ventral Tegmental Area (VTA).Here is the same information in Simplified Chinese:* For: 该研究旨在更深入地理解社会行为的核心原理,并identify potential therapeutic targets for addressing social deficits.* Methods: 该研究提出了一种新的模型called LISBET,该模型使用无监督学习来探索和量化社会行为的动态身体部分跟踪数据。* Results: 研究发现,LISBET可以在假设驱动和发现驱动两种模式下使用,以自动化行为分类和社会行为模式识别。Recognized模式不仅准确地匹配人类注释,还与 Ventral Tegmental Area (VTA) dopamine neurons的电physiological活动相关。
    Abstract Social behavior, defined as the process by which individuals act and react in response to others, is crucial for the function of societies and holds profound implications for mental health. To fully grasp the intricacies of social behavior and identify potential therapeutic targets for addressing social deficits, it is essential to understand its core principles. Although machine learning algorithms have made it easier to study specific aspects of complex behavior, current methodologies tend to focus primarily on single-animal behavior. In this study, we introduce LISBET (seLf-supervIsed Social BEhavioral Transformer), a model designed to detect and segment social interactions. Our model eliminates the need for feature selection and extensive human annotation by using self-supervised learning to detect and quantify social behaviors from dynamic body parts tracking data. LISBET can be used in hypothesis-driven mode to automate behavior classification using supervised finetuning, and in discovery-driven mode to segment social behavior motifs using unsupervised learning. We found that motifs recognized using the discovery-driven approach not only closely match the human annotations but also correlate with the electrophysiological activity of dopaminergic neurons in the Ventral Tegmental Area (VTA). We hope LISBET will help the community improve our understanding of social behaviors and their neural underpinnings.
    摘要 社会行为,定义为个体在响应他人的行为和反应过程,对社会的功能至关重要,对心理健康也有深远的影响。为了全面理解社会行为的复杂性和潜在的治疗目标,我们需要了解其核心原理。虽然机器学习算法已经使得研究特定方面的复杂行为变得更加容易,但现有方法ologies tend to focus primarily on single-animal behavior.在这种情况下,我们介绍了LISBET(seLf-supervIsed Social BEhavioral Transformer)模型,用于检测和分类社交互动。我们的模型不需要特定的特征选择和大量的人类标注,通过自动学习检测和跟踪动体部分数据来检测和量化社交行为。LISBET可以在假设驱动模式下自动分类行为,以及在发现驱动模式下分类社交行为模式。我们发现使用发现驱动模式分类的模式与人类标注非常相似,并且与 dopaminergic neurons的电physiological 活动在腹囊核(VTA)也存在相似性。我们希望LISBET能够帮助社区更好地理解社交行为和其神经基础。

mmFUSION: Multimodal Fusion for 3D Objects Detection

  • paper_url: http://arxiv.org/abs/2311.04058
  • repo_url: None
  • paper_authors: Javed Ahmad, Alessio Del Bue
  • for: 本研究旨在提出一种新的中途级多模态融合(mmFUSION)方法,以解决自驾系统中多感器融合的挑战。
  • methods: mmFUSION使用了每个感知器 separately compute特征,然后通过cross-modality和多模态注意力机制进行融合。
  • results: 在KITTI和NuScenes dataset上测试,mmFUSION的性能比EARLY、INTERMEDIATE、LATE和两阶段融合方案更好,并且可以保持多模态信息并学习补做模态缺陷。
    Abstract Multi-sensor fusion is essential for accurate 3D object detection in self-driving systems. Camera and LiDAR are the most commonly used sensors, and usually, their fusion happens at the early or late stages of 3D detectors with the help of regions of interest (RoIs). On the other hand, fusion at the intermediate level is more adaptive because it does not need RoIs from modalities but is complex as the features of both modalities are presented from different points of view. In this paper, we propose a new intermediate-level multi-modal fusion (mmFUSION) approach to overcome these challenges. First, the mmFUSION uses separate encoders for each modality to compute features at a desired lower space volume. Second, these features are fused through cross-modality and multi-modality attention mechanisms proposed in mmFUSION. The mmFUSION framework preserves multi-modal information and learns to complement modalities' deficiencies through attention weights. The strong multi-modal features from the mmFUSION framework are fed to a simple 3D detection head for 3D predictions. We evaluate mmFUSION on the KITTI and NuScenes dataset where it performs better than available early, intermediate, late, and even two-stage based fusion schemes. The code with the mmdetection3D project plugin will be publicly available soon.
    摘要 多感器融合是自动驾驶系统中准确的三维物体探测的关键。相机和激光激光是最常用的感知器,通常在3D探测器的早期或晚期阶段进行融合,使用区域关注(RoI)。然而,中间阶段的融合更加适应,因为它不需要modalities的RoI,但是复杂度较高,因为两种感知器的特征从不同的角度出现。在这篇论文中,我们提出了一种新的中间阶段多模态融合(mmFUSION)方法来解决这些挑战。首先,mmFUSION使用每个感知器的分离Encoder计算特征,以实现感知器的特征空间减小。其次,这些特征通过跨模态和多模态注意机制进行融合。mmFUSION框架保留了多模态信息,并通过注意weight学习补做每个感知器的不足。强大的多模态特征从mmFUSION框架中来的是 fed into a simple 3D detection head for 3D predictions。我们在KITTI和NuScenes数据集上评估了mmFUSION,其性能高于现有的早期、中间、晚期和两个阶段融合方案。代码将在near future通过mmdetection3D项目插件公开。

Generative Structural Design Integrating BIM and Diffusion Model

  • paper_url: http://arxiv.org/abs/2311.04052
  • repo_url: None
  • paper_authors: Zhili He, Yu-Hsing Wang, Jian Zhang
  • For: This paper proposes a comprehensive solution for intelligent structural design using AI, with a focus on improving the perceptual quality and details of generations.* Methods: The paper introduces building information modeling (BIM) into intelligent structural design and establishes a structural design pipeline integrating BIM and generative AI. It also proposes a novel 2-stage generation framework, uses diffusion models (DMs) instead of generative adversarial network (GAN)-based models, and designs an attention block (AB) consisting of a self-attention block (SAB) and a parallel cross-attention block (PCAB) to facilitate cross-domain data fusion.* Results: The paper demonstrates the powerful generation and representation capabilities of the proposed method through quantitative and qualitative results, and shows that DMs have the potential to replace GANs and become the new benchmark for generative problems in civil engineering.
    Abstract Intelligent structural design using AI can effectively reduce time overhead and increase efficiency. It has potential to become the new design paradigm in the future to assist and even replace engineers, and so it has become a research hotspot in the academic community. However, current methods have some limitations to be addressed, whether in terms of application scope, visual quality of generated results, or evaluation metrics of results. This study proposes a comprehensive solution. Firstly, we introduce building information modeling (BIM) into intelligent structural design and establishes a structural design pipeline integrating BIM and generative AI, which is a powerful supplement to the previous frameworks that only considered CAD drawings. In order to improve the perceptual quality and details of generations, this study makes 3 contributions. Firstly, in terms of generation framework, inspired by the process of human drawing, a novel 2-stage generation framework is proposed to replace the traditional end-to-end framework to reduce the generation difficulty for AI models. Secondly, in terms of generative AI tools adopted, diffusion models (DMs) are introduced to replace widely used generative adversarial network (GAN)-based models, and a novel physics-based conditional diffusion model (PCDM) is proposed to consider different design prerequisites. Thirdly, in terms of neural networks, an attention block (AB) consisting of a self-attention block (SAB) and a parallel cross-attention block (PCAB) is designed to facilitate cross-domain data fusion. The quantitative and qualitative results demonstrate the powerful generation and representation capabilities of PCDM. Necessary ablation studies are conducted to examine the validity of the methods. This study also shows that DMs have the potential to replace GANs and become the new benchmark for generative problems in civil engineering.
    摘要 使用人工智能进行智能结构设计可以有效减少时间开销并提高效率。它在未来可能成为新的设计 парадигма,帮助或代替工程师,因此在学术社区中引起了广泛的研究兴趣。然而,当前的方法还有一些需要解决的限制,包括应用范围、生成结果的视觉质量和评价指标。本研究提出了一个全面的解决方案。首先,我们将建筑信息模型(BIM)引入智能结构设计,并建立了基于BIM和生成AI的结构设计管道,这是之前的框架仅考虑CAD图形的增强。为了提高生成结果的视觉质量和细节,本研究做出了3个贡献。首先,在生成框架方面,我们提出了一种基于人类绘图过程的新型二阶生成框架,以降低AI模型生成难度。其次,在生成AI工具方面,我们引入了扩散模型(DM),取代了广泛使用的生成对抗网络(GAN)模型,并提出了一种基于物理条件的条件扩散模型(PCDM),以考虑不同的设计前提。最后,在神经网络方面,我们设计了一个注意块(AB),包括一个自注意块(SAB)和一个平行交叉注意块(PCAB),以便进行跨领域数据融合。量化和质量 результаados表明了PCDM的强大生成和表示能力。进行必要的减少研究以确保方法的有效性。此外,我们还发现了DMs可能取代GANs,成为未来的生成问题的新标准。

3D EAGAN: 3D edge-aware attention generative adversarial network for prostate segmentation in transrectal ultrasound images

  • paper_url: http://arxiv.org/abs/2311.04049
  • repo_url: None
  • paper_authors: Mengqing Liu, Xiao Shao, Liping Jiang, Kaizhi Wu
  • for: 这个研究的目的是为了发展一种高效的肿瘤脉冲影像中的肿瘤分类方法,以扩展现有的分类方法,并且能够更好地处理肿瘤的不均匀组织和边缘信息。
  • methods: 这个方法使用了一个3D缘点注意力生成对抗网络(3D EAGAN),它包括一个缘点注意力分 segmentation 网络(EASNet)和一个判别网络,用于识别预测的肿瘤和实际的肿瘤。 EASNet 由一个encoder-decoder 结构的 U-Net 背景网络、一个细节补偿模组、四个3D 空间和通道注意力模组、一个缘点增强模组和一个全局特征提取器组成。
  • results: 这个方法可以实现高度的肿瘤分类精度,并且可以优化肿瘤的边缘信息和细节特征。
    Abstract Automatic prostate segmentation in TRUS images has always been a challenging problem, since prostates in TRUS images have ambiguous boundaries and inhomogeneous intensity distribution. Although many prostate segmentation methods have been proposed, they still need to be improved due to the lack of sensibility to edge information. Consequently, the objective of this study is to devise a highly effective prostate segmentation method that overcomes these limitations and achieves accurate segmentation of prostates in TRUS images. A 3D edge-aware attention generative adversarial network (3D EAGAN)-based prostate segmentation method is proposed in this paper, which consists of an edge-aware segmentation network (EASNet) that performs the prostate segmentation and a discriminator network that distinguishes predicted prostates from real prostates. The proposed EASNet is composed of an encoder-decoder-based U-Net backbone network, a detail compensation module, four 3D spatial and channel attention modules, an edge enhance module, and a global feature extractor. The detail compensation module is proposed to compensate for the loss of detailed information caused by the down-sampling process of the encoder. The features of the detail compensation module are selectively enhanced by the 3D spatial and channel attention module. Furthermore, an edge enhance module is proposed to guide shallow layers in the EASNet to focus on contour and edge information in prostates. Finally, features from shallow layers and hierarchical features from the decoder module are fused through the global feature extractor to predict the segmentation prostates.
    摘要 自动肾脏分割在TRUS图像中 siempre ha sido un problema desafiante, ya que las prostata en las imágenes TRUS tienen límites ambiguos y distribución no homogénea de intensidad. A pesar de que muchos métodos de segmentación de próstata han sido propuestos, todavía necesitan mejorar debido a la falta de sensibilidad a la información de borde. Por lo tanto, el objetivo de este estudio es desarrollar un método de segmentación de próstata altamente efectivo que supera estas limitaciones y realiza una segmentación precisa de las prostata en las imágenes TRUS.Se propone un método de segmentación de próstata basado en una red de generador de adversarios 3D edge-aware (3D EAGAN), que consta de una red de segmentación edge-aware (EASNet) que realiza la segmentación de próstata y una red de discriminador que distingue las prostata predichas de las reales. La red EASNet está compuesta por una backbone network U-Net basada en una red de encoder-decodificador, un módulo de compensación de detalles, cuatro módulos de atención espacial y de canal 3D, un módulo de mejora de borde y un extractor de características globales. El módulo de compensación de detalles se propone para compensar la pérdida de información detallada causada por el proceso de down-sampling del encoder. Las características del módulo de compensación de detalles se seleccionan y se enfocan por el módulo de atención espacial y de canal 3D. Además, se propone un módulo de mejora de borde para guiar las capas superficiales de la red EASNet para que se centren en la información de contorno y borde en las prostata. Finalmente, las características de las capas superficiales y las características jerárquicas del módulo de descodificador se fusionan a través del extractor de características globales para predecir la segmentación de próstata.

Analyzing Near-Infrared Hyperspectral Imaging for Protein Content Regression and Grain Variety Classification Using Bulk References and Varying Grain-to-Background Ratios

  • paper_url: http://arxiv.org/abs/2311.04042
  • repo_url: None
  • paper_authors: Ole-Christian Galbo Engstrøm, Erik Schou Dreier, Birthe Møller Jespersen, Kim Steenstrup Pedersen
  • for: 本研究使用 Near-Infrared Hyperspectral Imaging (NIR-HSI) 图像来调整模型,主要关注蛋白内含量预测和谷物种类分类两个任务。
  • methods: 研究者使用了限制参考数据来扩大蛋白内含量的预测,通过下游采样和相关联系来减少预测分布的偏见。然而,这种方法引入了明显的偏见,影响了PLS-R 和深度 CNN 模型的预测。研究者提议了一些纠正方法来减少这些偏见。
  • results: 研究者发现,高比例的谷物至背景图像在两个任务中具有更高的预测精度。然而,包含较低比例的图像在核心化过程中可以增强模型的Robustness。
    Abstract Based on previous work, we assess the use of NIR-HSI images for calibrating models on two datasets, focusing on protein content regression and grain variety classification. Limited reference data for protein content is expanded by subsampling and associating it with the bulk sample. However, this method introduces significant biases due to skewed leptokurtic prediction distributions, affecting both PLS-R and deep CNN models. We propose adjustments to mitigate these biases, improving mean protein reference predictions. Additionally, we investigate the impact of grain-to-background ratios on both tasks. Higher ratios yield more accurate predictions, but including lower-ratio images in calibration enhances model robustness for such scenarios.
    摘要

Data exploitation: multi-task learning of object detection and semantic segmentation on partially annotated data

  • paper_url: http://arxiv.org/abs/2311.04040
  • repo_url: None
  • paper_authors: Hoàng-Ân Lê, Minh-Tan Pham
  • for: 这篇论文主要探讨了多任务部分标注数据的 JOINT 学习,以扩展物探测和 semantic segmentation 两个最受欢迎的视觉任务的性能。
  • methods: 本论文使用了多任务学习和知识传播来将两个任务结合在一起,并进行了广泛的实验测试来评估每个任务的性能和弹性。
  • results: 实验结果显示, JOINT 学习和知识传播可以提高多任务学习的性能,并且在无法同时优化两个任务的情况下,可以提供更好的性能。
    Abstract Multi-task partially annotated data where each data point is annotated for only a single task are potentially helpful for data scarcity if a network can leverage the inter-task relationship. In this paper, we study the joint learning of object detection and semantic segmentation, the two most popular vision problems, from multi-task data with partial annotations. Extensive experiments are performed to evaluate each task performance and explore their complementarity when a multi-task network cannot optimize both tasks simultaneously. We propose employing knowledge distillation to leverage joint-task optimization. The experimental results show favorable results for multi-task learning and knowledge distillation over single-task learning and even full supervision scenario. All code and data splits are available at https://github.com/lhoangan/multas
    摘要 多任务半标注数据,每个数据点只有一个任务的标注,在数据缺乏时可能是有帮助的。在这篇论文中,我们研究了对象检测和 semantic segmentation 两个最受欢迎的视觉问题的共同学习,从多任务数据中获得了半标注。我们进行了广泛的实验来评估每个任务的性能,并探索它们之间的补偿性。我们提出了使用知识传授来利用共同优化。实验结果表明,多任务学习和知识传授在单任务学习和完整监督方式之上具有有利的效果。所有代码和数据分割可以在 GitHub 上找到:https://github.com/lhoangan/multas。

Exploring Dataset-Scale Indicators of Data Quality

  • paper_url: http://arxiv.org/abs/2311.04016
  • repo_url: None
  • paper_authors: Benjamin Feuer, Chinmay Hegde
  • for: 这个论文主要用于探讨计算机视觉基础模型的训练所需的数据质量。
  • methods: 该论文使用了两个重要的数据集级别组成部分来评估数据质量:标签集设计和类别均衡。
  • results: 研究人员通过监测这些指标来预测模型的性能和对分布变化的Robustness。
    Abstract Modern computer vision foundation models are trained on massive amounts of data, incurring large economic and environmental costs. Recent research has suggested that improving data quality can significantly reduce the need for data quantity. But what constitutes data quality in computer vision? We posit that the quality of a given dataset can be decomposed into distinct sample-level and dataset-level constituents, and that the former have been more extensively studied than the latter. We ablate the effects of two important dataset-level constituents: label set design, and class balance. By monitoring these constituents using key indicators we provide, researchers and practitioners can better anticipate model performance, measured in terms of its accuracy and robustness to distribution shifts.
    摘要 现代计算机视觉基础模型通常通过巨量数据训练,产生大量经济和环境成本。 latest research suggests that improving data quality can significantly reduce the need for data quantity. But what constitutes data quality in computer vision? We argue that the quality of a given dataset can be decomposed into two distinct constituents: sample-level and dataset-level. While the former has been more extensively studied, the latter has been relatively underexplored. We investigate the effects of two important dataset-level constituents: label set design and class balance. By monitoring these constituents using key indicators we provide, researchers and practitioners can better anticipate model performance, measured in terms of its accuracy and robustness to distribution shifts.Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

AGNES: Abstraction-guided Framework for Deep Neural Networks Security

  • paper_url: http://arxiv.org/abs/2311.04009
  • repo_url: None
  • paper_authors: Akshay Dhonthi, Marcello Eiermann, Ernst Moritz Hahn, Vahid Hashemi
  • for: 该论文旨在检测 Deep Neural Networks (DNNs) 中的后门,以确保图像识别 task 的正确性。
  • methods: 该论文基于一种新的方法 AGNES,用于检测 DNNs 中的后门。该方法基于一种新的特征提取技术,可以准确地检测后门。
  • results: 作者通过多个 relevante case studies 表明,AGNES 比许多现有的方法表现更好,可以准确地检测 DNNs 中的后门。
    Abstract Deep Neural Networks (DNNs) are becoming widespread, particularly in safety-critical areas. One prominent application is image recognition in autonomous driving, where the correct classification of objects, such as traffic signs, is essential for safe driving. Unfortunately, DNNs are prone to backdoors, meaning that they concentrate on attributes of the image that should be irrelevant for their correct classification. Backdoors are integrated into a DNN during training, either with malicious intent (such as a manipulated training process, because of which a yellow sticker always leads to a traffic sign being recognised as a stop sign) or unintentional (such as a rural background leading to any traffic sign being recognised as animal crossing, because of biased training data). In this paper, we introduce AGNES, a tool to detect backdoors in DNNs for image recognition. We discuss the principle approach on which AGNES is based. Afterwards, we show that our tool performs better than many state-of-the-art methods for multiple relevant case studies.
    摘要 在这篇论文中,我们介绍了 AGNES,一种用于检测 DNNs 中的后门的工具。我们讲述了 AGNES 的原理方法。然后,我们表明了我们的工具在多个相关的案例研究中表现出色。

Bias and Diversity in Synthetic-based Face Recognition

  • paper_url: http://arxiv.org/abs/2311.03970
  • repo_url: None
  • paper_authors: Marco Huber, Anh Thi Luu, Fadi Boutros, Arjan Kuijper, Naser Damer
  • for: 本研究旨在investigate synthetic face recognition dataset的多样性和生成模型训练数据的影响,以及这些模型对不同属性的偏见。
  • methods: 本研究使用了三种最新的生成模型,并对这些模型的训练数据进行分析,以了解生成的人脸数据的分布和偏见。
  • results: 研究结果显示,生成模型可以生成与训练数据的分布相似的人脸数据,但是这些模型仍然存在偏见问题,尤其是对 gender、ethnicity、age 和head position 等属性的偏见。
    Abstract Synthetic data is emerging as a substitute for authentic data to solve ethical and legal challenges in handling authentic face data. The current models can create real-looking face images of people who do not exist. However, it is a known and sensitive problem that face recognition systems are susceptible to bias, i.e. performance differences between different demographic and non-demographics attributes, which can lead to unfair decisions. In this work, we investigate how the diversity of synthetic face recognition datasets compares to authentic datasets, and how the distribution of the training data of the generative models affects the distribution of the synthetic data. To do this, we looked at the distribution of gender, ethnicity, age, and head position. Furthermore, we investigated the concrete bias of three recent synthetic-based face recognition models on the studied attributes in comparison to a baseline model trained on authentic data. Our results show that the generator generate a similar distribution as the used training data in terms of the different attributes. With regard to bias, it can be seen that the synthetic-based models share a similar bias behavior with the authentic-based models. However, with the uncovered lower intra-identity attribute consistency seems to be beneficial in reducing bias.
    摘要 “人工数据正在成为真实数据的替代品,以解决伦理和法律问题。现有的模型可以创建真实看起来的人脸图像,但是知道的敏感问题是,人脸识别系统受到偏见的影响,即不同的民族和非民族属性之间的性能差异,可能导致不公正的决策。在这项工作中,我们研究了人工数据的多样性与真实数据的多样性之间的关系,以及生成模型的训练数据分布对人工数据的影响。我们分析了性别、民族、年龄和头部位置等属性的分布。此外,我们还研究了三种最新的人工数据基于面Recognition模型对这些属性的偏见情况,与基准模型在真实数据上训练的情况进行比较。我们的结果表明,生成器对不同属性的分布具有类似的分布,而且与真实数据上的偏见行为也很相似。然而,通过降低内部同一个属性的一致性,可以减少偏见。”Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know.

CeCNN: Copula-enhanced convolutional neural networks in joint prediction of refraction error and axial length based on ultra-widefield fundus images

  • paper_url: http://arxiv.org/abs/2311.03967
  • repo_url: None
  • paper_authors: Chong Zhong, Yang Li, Danjuan Yang, Meiyan Li, Xingyao Zhou, Bo Fu, Catherine C. Liu, A. H. Welsh
  • for: 这个研究旨在应用深度学习模型来预测视力短暂性和视力过度问题,并使用多重回应任务来提高预测精度。
  • methods: 研究使用了高维度数据生成模型(CeCNN),具有抽象特征和依赖关系模型,以提高预测精度。
  • results: 研究发现,使用CeCNN模型可以提高预测精度,并且可以在不同的背景和实验设计下运行。
    Abstract Ultra-widefield (UWF) fundus images are replacing traditional fundus images in screening, detection, prediction, and treatment of complications related to myopia because their much broader visual range is advantageous for highly myopic eyes. Spherical equivalent (SE) is extensively used as the main myopia outcome measure, and axial length (AL) has drawn increasing interest as an important ocular component for assessing myopia. Cutting-edge studies show that SE and AL are strongly correlated. Using the joint information from SE and AL is potentially better than using either separately. In the deep learning community, though there is research on multiple-response tasks with a 3D image biomarker, dependence among responses is only sporadically taken into consideration. Inspired by the spirit that information extracted from the data by statistical methods can improve the prediction accuracy of deep learning models, we formulate a class of multivariate response regression models with a higher-order tensor biomarker, for the bivariate tasks of regression-classification and regression-regression. Specifically, we propose a copula-enhanced convolutional neural network (CeCNN) framework that incorporates the dependence between responses through a Gaussian copula (with parameters estimated from a warm-up CNN) and uses the induced copula-likelihood loss with the backbone CNNs. We establish the statistical framework and algorithms for the aforementioned two bivariate tasks. We show that the CeCNN has better prediction accuracy after adding the dependency information to the backbone models. The modeling and the proposed CeCNN algorithm are applicable beyond the UWF scenario and can be effective with other backbones beyond ResNet and LeNet.
    摘要 ultra-widefield (UWF) fundus images replacing traditional fundus images in screening, detection, prediction, and treatment of myopia complications because their much broader visual range is advantageous for highly myopic eyes. spherical equivalent (SE) is extensively used as the main myopia outcome measure, and axial length (AL) has drawn increasing interest as an important ocular component for assessing myopia. cutting-edge studies show that SE and AL are strongly correlated. using the joint information from SE and AL is potentially better than using either separately. in the deep learning community, though there is research on multiple-response tasks with a 3D image biomarker, dependence among responses is only sporadically taken into consideration. inspired by the spirit that information extracted from the data by statistical methods can improve the prediction accuracy of deep learning models, we formulate a class of multivariate response regression models with a higher-order tensor biomarker, for the bivariate tasks of regression-classification and regression-regression. specifically, we propose a copula-enhanced convolutional neural network (CeCNN) framework that incorporates the dependence between responses through a gaussian copula (with parameters estimated from a warm-up CNN) and uses the induced copula-likelihood loss with the backbone CNNs. we establish the statistical framework and algorithms for the aforementioned two bivariate tasks. we show that the CeCNN has better prediction accuracy after adding the dependency information to the backbone models. the modeling and the proposed CeCNN algorithm are applicable beyond the UWF scenario and can be effective with other backbones beyond ResNet and LeNet.

Fast Sun-aligned Outdoor Scene Relighting based on TensoRF

  • paper_url: http://arxiv.org/abs/2311.03965
  • repo_url: None
  • paper_authors: Yeonjin Chang, Yearim Kim, Seunghyeon Seo, Jung Yi, Nojun Kwak
  • for: 这个论文是为了提出一种名为Sun-aligned Relighting TensoRF(SR-TensoRF)的外部场景重新照明方法,用于Neural Radiance Fields(NeRF)。
  • methods: SR-TensoRF方法使用了sun direction作为阴影生成的输入,从而简化了推理过程的需求,同时也利用了TensoRF的训练效率。
  • results: SR-TensoRF方法可以快速地生成高质量的阴影和渲染结果,并且在训练和渲染过程中比现有方法更快。
    Abstract In this work, we introduce our method of outdoor scene relighting for Neural Radiance Fields (NeRF) named Sun-aligned Relighting TensoRF (SR-TensoRF). SR-TensoRF offers a lightweight and rapid pipeline aligned with the sun, thereby achieving a simplified workflow that eliminates the need for environment maps. Our sun-alignment strategy is motivated by the insight that shadows, unlike viewpoint-dependent albedo, are determined by light direction. We directly use the sun direction as an input during shadow generation, simplifying the requirements of the inference process significantly. Moreover, SR-TensoRF leverages the training efficiency of TensoRF by incorporating our proposed cubemap concept, resulting in notable acceleration in both training and rendering processes compared to existing methods.
    摘要 在这项工作中,我们介绍了一种名为室内场景重新照明的方法,即太阳平行Relighting TensoRF(SR-TensoRF)。SR-TensoRF提供了一个轻量级、快速的管道,其中sun direction作为输入直接使用在阴影生成过程中,从而大幅简化了推理过程的需求。此外,SR-TensoRF利用了TensoRF的训练效率,通过我们提议的立方体图概念,从而在训练和渲染过程中具有显著的加速。

Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining

  • paper_url: http://arxiv.org/abs/2311.03964
  • repo_url: https://github.com/ugorsahin/Generative-Negative-Mining
  • paper_authors: Ugur Sahin, Hang Li, Qadeer Khan, Daniel Cremers, Volker Tresp
  • for: 提高大规模图像语言模型(VLM)在多模态组合理解任务中的表现
  • methods: 提出了一个框架,不仅在两个方向上挖掘负例,还生成了多Modal compositional reasoning任务中难度很高的负样本,以提高 VLM 的表现
  • results: 通过利用这些生成的困难负样本,可以significantly enhance VLMs的表现在多模态组合理解任务中
    Abstract Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a fine-grained understanding of the complex interactions of objects and their attributes. This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined negative examples might not be difficult for the model to discriminate from the positive. An alternative to mining would be negative sample generation 2) But existing generative approaches primarily focus on generating hard negative texts associated with a given image. Mining in the other direction, i.e., generating negative image samples associated with a given text has been ignored. To overcome both these limitations, we propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities, i.e., images and texts. Leveraging these generative hard negative samples, we significantly enhance VLMs' performance in tasks involving multimodal compositional reasoning. Our code and dataset are released at https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html.
    摘要 当代大规模视觉语言模型(VLM)具有强大的表达能力,使其在图像和文本理解任务中普遍使用。它们通常通过对大量和多样化的图像和相关文本描述进行对比式训练。然而,VLM经常在组合逻辑任务中失败,这可以归结于两个主要因素:1)对比方法 tradicionalmente 在现有数据集中挖掘负例,但这些挖掘出来的负例可能并不是模型很难地区分出来的。二)现有的生成方法主要关注生成困难的文本对应图像的负例,而忽略了生成图像对应文本的负例。为了解决这两个限制,我们提出了一个框架,不仅挖掘两个方向,而且生成了图像和文本两个模态的困难负例。利用这些生成的困难负例,我们可以在多模态组合逻辑任务中明显提高 VLM 的表现。我们的代码和数据集在 上发布。

Improving the Effectiveness of Deep Generative Data

  • paper_url: http://arxiv.org/abs/2311.03959
  • repo_url: None
  • paper_authors: Ruyu Wang, Sabrina Schmedding, Marco F. Huber
  • for: 本研究旨在探讨使用深度生成模型生成的假图像在下游图像处理任务中的表现,并提出一种新的分类法以提高其表现。
  • methods: 本研究使用了生成对抗网络(GAN)和扩散概率模型(DPM)生成高品质图像,并对CIFAR-10 dataset进行了广泛的实验。
  • results: 研究发现,Content Gap是使用深度生成模型生成的假图像表现下降的主要原因,并提出了一些策略来更好地利用这些图像在下游任务中。实验结果表明,我们的方法在 Synthetic-to-Real 和 Data Augmentation 两种情况下都能够获得更高的表现,特别是在数据缺乏的情况下。
    Abstract Recent deep generative models (DGMs) such as generative adversarial networks (GANs) and diffusion probabilistic models (DPMs) have shown their impressive ability in generating high-fidelity photorealistic images. Although looking appealing to human eyes, training a model on purely synthetic images for downstream image processing tasks like image classification often results in an undesired performance drop compared to training on real data. Previous works have demonstrated that enhancing a real dataset with synthetic images from DGMs can be beneficial. However, the improvements were subjected to certain circumstances and yet were not comparable to adding the same number of real images. In this work, we propose a new taxonomy to describe factors contributing to this commonly observed phenomenon and investigate it on the popular CIFAR-10 dataset. We hypothesize that the Content Gap accounts for a large portion of the performance drop when using synthetic images from DGM and propose strategies to better utilize them in downstream tasks. Extensive experiments on multiple datasets showcase that our method outperforms baselines on downstream classification tasks both in case of training on synthetic only (Synthetic-to-Real) and training on a mix of real and synthetic data (Data Augmentation), particularly in the data-scarce scenario.
    摘要 最近的深度生成模型(DGM),如生成敌方网络(GAN)和扩散概率模型(DPM),已经显示出生成高品质光真图像的能力。虽然看起来很有吸引力,但是在用Synthetic图像进行下游图像处理任务如图像分类时,却会导致性能下降。先前的工作表明,扩充真实数据集中的Synthetic图像可以提供利益。然而,这些改进受到特定情况的限制,并且与添加相同数量的真实图像相比,并不能达到相同的水平。在这项工作中,我们提出了一个新的分类方法,描述了在常见的情况下,使用DGM生成的Synthetic图像导致性能下降的因素。我们认为,内容差距占了大量的性能下降,并提出了使用这些Synthetic图像更好地利用下游任务的策略。我们的方法在多个数据集上进行了广泛的实验,并证明了与基eline相比,在Synthetic只进行训练(Synthetic-to-Real)和在真实和Synthetic数据进行混合训练(数据扩展)情况下,我们的方法在下游分类任务中表现出色,特别是在数据缺乏的情况下。

CLIP Guided Image-perceptive Prompt Learning for Image Enhancement

  • paper_url: http://arxiv.org/abs/2311.03943
  • repo_url: None
  • paper_authors: Zinuo Li, Qiuhong Ke, Weiwen Chen
  • for: 这 paper 的目的是提出一种基于对比语言图像预训练(CLIP)指导提问学习的图像提升方法。
  • methods: 该方法使用 CLIP 模型学习图像感知提问,并将其作为提升网络的损害函数来驱动图像提升。
  • results: 研究表明,通过将 CLIP 模型的前知识引入到图像提升中,可以获得满意的结果,并且该方法比传统的 LUT 方法更加简单和高效。
    Abstract Image enhancement is a significant research area in the fields of computer vision and image processing. In recent years, many learning-based methods for image enhancement have been developed, where the Look-up-table (LUT) has proven to be an effective tool. In this paper, we delve into the potential of Contrastive Language-Image Pre-Training (CLIP) Guided Prompt Learning, proposing a simple structure called CLIP-LUT for image enhancement. We found that the prior knowledge of CLIP can effectively discern the quality of degraded images, which can provide reliable guidance. To be specific, We initially learn image-perceptive prompts to distinguish between original and target images using CLIP model, in the meanwhile, we introduce a very simple network by incorporating a simple baseline to predict the weights of three different LUT as enhancement network. The obtained prompts are used to steer the enhancement network like a loss function and improve the performance of model. We demonstrate that by simply combining a straightforward method with CLIP, we can obtain satisfactory results.
    摘要 Image enhancement is a significant research area in the fields of computer vision and image processing. In recent years, many learning-based methods for image enhancement have been developed, where the Look-up-table (LUT) has proven to be an effective tool. In this paper, we explore the potential of Contrastive Language-Image Pre-Training (CLIP) Guided Prompt Learning, proposing a simple structure called CLIP-LUT for image enhancement. We found that the prior knowledge of CLIP can effectively discern the quality of degraded images, which can provide reliable guidance. To be specific, we first learn image-perceptive prompts to distinguish between original and target images using the CLIP model, and then introduce a very simple network by incorporating a simple baseline to predict the weights of three different LUTs as an enhancement network. The obtained prompts are used to steer the enhancement network like a loss function and improve the performance of the model. We demonstrate that by simply combining a straightforward method with CLIP, we can obtain satisfactory results.

Analysis of NaN Divergence in Training Monocular Depth Estimation Model

  • paper_url: http://arxiv.org/abs/2311.03938
  • repo_url: None
  • paper_authors: Bum Jun Kim, Hyeonah Jang, Sang Woo Kim
  • for: 提高深度学习模型的准确性
  • methods: 通过对训练笔记深度估计网络的NaN损失进行深入分析,并确定了NaN损失的三种敏感性:使用平方根损失引起的不稳定梯度问题、使用对数 sigmoid 函数存在数学稳定问题、和某些变量实现问题导致的错误计算问题。
  • results: 通过遵循我们的指南,可以提高估计网络的优化稳定性和性能。
    Abstract The latest advances in deep learning have facilitated the development of highly accurate monocular depth estimation models. However, when training a monocular depth estimation network, practitioners and researchers have observed not a number (NaN) loss, which disrupts gradient descent optimization. Although several practitioners have reported the stochastic and mysterious occurrence of NaN loss that bothers training, its root cause is not discussed in the literature. This study conducted an in-depth analysis of NaN loss during training a monocular depth estimation network and identified three types of vulnerabilities that cause NaN loss: 1) the use of square root loss, which leads to an unstable gradient; 2) the log-sigmoid function, which exhibits numerical stability issues; and 3) certain variance implementations, which yield incorrect computations. Furthermore, for each vulnerability, the occurrence of NaN loss was demonstrated and practical guidelines to prevent NaN loss were presented. Experiments showed that both optimization stability and performance on monocular depth estimation could be improved by following our guidelines.
    摘要 最新的深度学习技术发展已经使得单目深度估计模型的准确性得到了大幅提高。然而,在训练单目深度估计网络时,实践者和研究人员通常会遇到NaN损失,这会阻碍梯度下降优化。虽然许多实践者已经报道了随机和神秘的NaN损失的出现,但在文献中没有讨论其根本原因。本研究对单目深度估计网络训练中的NaN损失进行了深入分析,并确定了三种可能导致NaN损失的潜在漏洞:1)使用平方根损失,导致梯度不稳定;2)使用对数sigmoid函数,存在数学稳定问题;3)certain variance实现,导致计算错误。此外,我们还提供了避免NaN损失的实践指南,并通过实验证明了我们的指南可以提高优化稳定性和单目深度估计性能。

FLORA: Fine-grained Low-Rank Architecture Search for Vision Transformer

  • paper_url: http://arxiv.org/abs/2311.03912
  • repo_url: https://github.com/shadowpa0327/flora
  • paper_authors: Chi-Chih Chang, Yuan-Yao Sung, Shixing Yu, Ning-Chi Huang, Diana Marculescu, Kai-Chiang Wu
    for: 这个论文的目的是提出一个自动化探索低阶数据的框架,以便实现对 Computer Vision Task 的最佳化。methods: 这个方法使用了 Low-Rank 测试和 One-Shot NAS 的相似性和适束,并且将这两个方法融合成一个统一的框架。它还使用了低阶数据检查和排除低性能的候选者,以避免实验过程中的偏见和干扰。results: 这个方法可以自动生成更细节的低阶数据配置,并且可以实现约 33% 的 Computational FLOPs 节省。具体来说,FLORA-DeiT-B/FLORA-Swin-B 可以节省约 55%/42% Computational FLOPs,而且可以与主流压缩技术或紧凑结构整合,实现更大的 Computational FLOPs 节省。
    Abstract Vision Transformers (ViT) have recently demonstrated success across a myriad of computer vision tasks. However, their elevated computational demands pose significant challenges for real-world deployment. While low-rank approximation stands out as a renowned method to reduce computational loads, efficiently automating the target rank selection in ViT remains a challenge. Drawing from the notable similarity and alignment between the processes of rank selection and One-Shot NAS, we introduce FLORA, an end-to-end automatic framework based on NAS. To overcome the design challenge of supernet posed by vast search space, FLORA employs a low-rank aware candidate filtering strategy. This method adeptly identifies and eliminates underperforming candidates, effectively alleviating potential undertraining and interference among subnetworks. To further enhance the quality of low-rank supernets, we design a low-rank specific training paradigm. First, we propose weight inheritance to construct supernet and enable gradient sharing among low-rank modules. Secondly, we adopt low-rank aware sampling to strategically allocate training resources, taking into account inherited information from pre-trained models. Empirical results underscore FLORA's efficacy. With our method, a more fine-grained rank configuration can be generated automatically and yield up to 33% extra FLOPs reduction compared to a simple uniform configuration. More specific, FLORA-DeiT-B/FLORA-Swin-B can save up to 55%/42% FLOPs almost without performance degradtion. Importantly, FLORA boasts both versatility and orthogonality, offering an extra 21%-26% FLOPs reduction when integrated with leading compression techniques or compact hybrid structures. Our code is publicly available at https://github.com/shadowpa0327/FLORA.
    摘要 幻 transformer (ViT) 在计算机视觉任务中显示出了成功,但是它们的计算需求增加了现实世界部署的挑战。而且,减少计算负担的低级排名选择在 ViT 中仍然是一个挑战。基于计算机视觉任务和一次 NAS 的相似性和对应关系,我们提出了 FLORA,一个端到端自动化的框架。为了解决 supernet 的设计挑战,FLORA 使用了低级排名意识 filtering 策略,可以干预低性能的候选者,从而避免 potential 的弱化和子网之间的干扰。此外,我们还提出了一种低级特定的训练方法,包括 weight 继承和低级排名检测。通过这些方法,我们可以生成更细致的排名配置,并在 FLOPs 减少方面获得更好的性能。我们的实验结果表明,FLORA 可以自动生成更细致的排名配置,并在不同的模型和任务上实现 FLOPs 减少。 Specifically,FLORA-DeiT-B 和 FLORA-Swin-B 可以将 FLOPs 减少约 33%,而且减少后性能几乎不受影响。此外,FLORA 还具有多样性和正交性,可以与主流压缩技术或紧凑型结构结合使用,从而提高 FLOPs 减少的效果。我们的代码可以在 GitHub 上找到:https://github.com/shadowpa0327/FLORA。

RobustMat: Neural Diffusion for Street Landmark Patch Matching under Challenging Environments

  • paper_url: http://arxiv.org/abs/2311.03904
  • repo_url: https://github.com/ai-it-avs/robustmat
  • paper_authors: Rui She, Qiyu Kang, Sijie Wang, Yuan-Rui Yang, Kai Zhao, Yang Song, Wee Peng Tay
  • for: The paper is written for the task of matching landmark patches in street scenes for autonomous vehicles (AVs), under challenging driving environments caused by changing seasons, weather, and illumination.
  • methods: The paper proposes an approach called RobustMat, which uses a convolutional neural ODE diffusion module to learn the feature representation for landmark patches, and a graph neural PDE diffusion module to aggregate information from neighboring landmark patches.
  • results: The paper demonstrates state-of-the-art matching results under environmental perturbations, using several street scene datasets.Here’s the text in Simplified Chinese:
  • for: 论文是为了匹配street scene中的标志性 patch,在自动驾驶车(AV)中扮演关键角色。
  • methods: 论文提出了一种方法,名为RobustMat,它利用了 convolutional neural ODE diffusion module来学习标志性 patch 的特征表示,并使用 graph neural PDE diffusion module来聚合邻近标志性 patch 的信息。
  • results: 论文在多个 street scene 数据集上进行了评估,并达到了在环境扰动下的最佳匹配结果。
    Abstract For autonomous vehicles (AVs), visual perception techniques based on sensors like cameras play crucial roles in information acquisition and processing. In various computer perception tasks for AVs, it may be helpful to match landmark patches taken by an onboard camera with other landmark patches captured at a different time or saved in a street scene image database. To perform matching under challenging driving environments caused by changing seasons, weather, and illumination, we utilize the spatial neighborhood information of each patch. We propose an approach, named RobustMat, which derives its robustness to perturbations from neural differential equations. A convolutional neural ODE diffusion module is used to learn the feature representation for the landmark patches. A graph neural PDE diffusion module then aggregates information from neighboring landmark patches in the street scene. Finally, feature similarity learning outputs the final matching score. Our approach is evaluated on several street scene datasets and demonstrated to achieve state-of-the-art matching results under environmental perturbations.
    摘要 (Simplified Chinese translation)For autonomous vehicles (AVs), visual perception techniques based on sensors like cameras play crucial roles in information acquisition and processing. In various computer perception tasks for AVs, it may be helpful to match landmark patches taken by an onboard camera with other landmark patches captured at a different time or saved in a street scene image database. To perform matching under challenging driving environments caused by changing seasons, weather, and illumination, we utilize the spatial neighborhood information of each patch. We propose an approach, named RobustMat, which derives its robustness to perturbations from neural differential equations. A convolutional neural ODE diffusion module is used to learn the feature representation for the landmark patches. A graph neural PDE diffusion module then aggregates information from neighboring landmark patches in the street scene. Finally, feature similarity learning outputs the final matching score. Our approach is evaluated on several street scene datasets and demonstrated to achieve state-of-the-art matching results under environmental perturbations.Note: The translation is in Simplified Chinese, which is the standardized form of Chinese used in mainland China and Singapore. The Traditional Chinese translation would be slightly different.

MeVGAN: GAN-based Plugin Model for Video Generation with Applications in Colonoscopy

  • paper_url: http://arxiv.org/abs/2311.03884
  • repo_url: None
  • paper_authors: Łukasz Struski, Tomasz Urbańczyk, Krzysztof Bucki, Bartłomiej Cupiał, Aneta Kaczyńska, Przemysław Spurek, Jacek Tabor
  • for: 这个论文是为了提出一种高效的视频生成模型,它可以生成高分辨率视频数据,并且可以在医学领域中应用。
  • methods: 这个模型使用插件型架构,使用预训练的2D图像GAN,并只添加了一个简单的神经网络来构造噪声空间中的旅程,以将旅程传递 через GAN 模型构建真实的视频。
  • results: 这个模型可以生成高质量的 sintetic colonoscopy 视频,可以用于虚拟 simulate colonoscopy 过程,并且可以帮助年轻的colonoscopist 更好地学习这个重要的医学过程。
    Abstract Video generation is important, especially in medicine, as much data is given in this form. However, video generation of high-resolution data is a very demanding task for generative models, due to the large need for memory. In this paper, we propose Memory Efficient Video GAN (MeVGAN) - a Generative Adversarial Network (GAN) which uses plugin-type architecture. We use a pre-trained 2D-image GAN and only add a simple neural network to construct respective trajectories in the noise space, so that the trajectory forwarded through the GAN model constructs a real-life video. We apply MeVGAN in the task of generating colonoscopy videos. Colonoscopy is an important medical procedure, especially beneficial in screening and managing colorectal cancer. However, because colonoscopy is difficult and time-consuming to learn, colonoscopy simulators are widely used in educating young colonoscopists. We show that MeVGAN can produce good quality synthetic colonoscopy videos, which can be potentially used in virtual simulators.
    摘要 文本翻译:视频生成对医学领域非常重要,因为大量数据都是 видео形式。然而,生成高分辨率视频是对生成模型的非常具有挑战性,因为需要很大的内存。在这篇论文中,我们提出了 Memory Efficient Video GAN(MeVGAN),它是一种基于生成对抗网络(GAN)的插件式架构。我们使用预训练的2D图像GAN,只是添加了一个简单的神经网络,以在噪声空间中构建各个旅程,从而使得噪声通过GAN模型构建的旅程是真实的视频。我们在colonoscopy视频生成任务中应用MeVGAN。colonoscopy是医学非常重要的检查方法,尤其是在检测和治疗肠Rectal癌中。然而,因为colonoscopy是学习困难和时间consuming的,因此colonoscopy模拟器在培训年轻的colonoscopists中广泛使用。我们表明MeVGAN可以生成高质量的 sintetic colonoscopy视频,这些视频可能在虚拟模拟器中使用。

A Comparative Study of Knowledge Transfer Methods for Misaligned Urban Building Labels

  • paper_url: http://arxiv.org/abs/2311.03867
  • repo_url: None
  • paper_authors: Bipul Neupane, Jagannath Aryal, Abbas Rajabifard
  • for: Addressing the misalignment issue in Earth observation (EO) images and building labels to train accurate convolutional neural networks (CNNs) for semantic segmentation of building footprints.
  • methods: Comparative study of three Teacher-Student knowledge transfer methods: supervised domain adaptation (SDA), knowledge distillation (KD), and deep mutual learning (DML).
  • results: SDA is the most effective method to address the misalignment problem, while KD and DML can efficiently compress network size without significant loss in performance. The 158 experiments and datasets developed in this study will be valuable to minimise the misaligned labels.Here’s the format you requested:
  • for: <what are the paper written for?>
  • methods: <what methods the paper use?>
  • results: <what results the paper get?>
    Abstract Misalignment in Earth observation (EO) images and building labels impact the training of accurate convolutional neural networks (CNNs) for semantic segmentation of building footprints. Recently, three Teacher-Student knowledge transfer methods have been introduced to address this issue: supervised domain adaptation (SDA), knowledge distillation (KD), and deep mutual learning (DML). However, these methods are merely studied for different urban buildings (low-rise, mid-rise, high-rise, and skyscrapers), where misalignment increases with building height and spatial resolution. In this study, we present a workflow for the systematic comparative study of the three methods. The workflow first identifies the best (with the highest evaluation scores) hyperparameters, lightweight CNNs for the Student (among 43 CNNs from Computer Vision), and encoder-decoder networks (EDNs) for both Teachers and Students. Secondly, three building footprint datasets are developed to train and evaluate the identified Teachers and Students in the three transfer methods. The results show that U-Net with VGG19 (U-VGG19) is the best Teacher, and U-EfficientNetv2B3 and U-EfficientNet-lite0 are among the best Students. With these Teacher-Student pairs, SDA could yield upto 0.943, 0.868, 0.912, and 0.697 F1 scores in the low-rise, mid-rise, high-rise, and skyscrapers respectively. KD and DML provide model compression of upto 82%, despite marginal loss in performance. This new comparison concludes that SDA is the most effective method to address the misalignment problem, while KD and DML can efficiently compress network size without significant loss in performance. The 158 experiments and datasets developed in this study will be valuable to minimise the misaligned labels.
    摘要 地球观测(EO)图像和建筑标签的不一致问题会影响建筑 semantic 分类器的训练。最近,三种教师-学生知识传递方法被提出来解决这个问题:指导适应(SDA)、知识储存(KD)和深度相互学习(DML)。然而,这些方法只在不同的城市建筑(低层、中层、高层和尖塔)进行了研究,而这些建筑的不一致程度随着建筑高度和空间分辨率增加。在这种研究中,我们提供了一个工作流程,用于系统 Comparative 研究这三种方法。这个 workflow 首先确定最佳(最高评价分)的参数、轻量级 CNN(从计算机视觉中选择)和编码器-解码器网络(EDN) для两个教师和学生。其次,为了训练和评估选定的教师和学生,我们开发了三个建筑footprint 数据集。结果显示,U-Net with VGG19(U-VGG19)是最佳教师,而 U-EfficientNetv2B3 和 U-EfficientNet-lite0 是最佳学生之一。使用这些教师-学生对,SDA 可以实现upto 0.943、0.868、0.912 和 0.697 F1 分数在不同的建筑高度上。KD 和 DML 可以压缩网络大小,即使是82%,尽管表现下降了一些。这种新的比较结论是,SDA 是解决不一致标签问题的最有效的方法,而 KD 和 DML 可以高效地压缩网络大小,不会导致重要的表现下降。这个研究中进行的 158 个实验和开发的数据集将有助于减少不一致标签的问题。

SCONE-GAN: Semantic Contrastive learning-based Generative Adversarial Network for an end-to-end image translation

  • paper_url: http://arxiv.org/abs/2311.03866
  • repo_url: None
  • paper_authors: Iman Abbasnejad, Fabio Zambetta, Flora Salim, Timothy Wiley, Jeffrey Chan, Russell Gallagher, Ehsan Abbasnejad
  • for: 这篇论文旨在探讨如何实现图像转换,以生成更加现实和多样的景观图像。
  • methods: 本研究使用了图像转换为主的GAN方法,并通过实现物件相依性,保持图像结构和Semantics,实现图像转换。此外,我们引入了式参考图像,以增加生成的图像多样性。
  • results: 我们透过实验 validate了我们的方法,并证明了它在四个数据集上的效果。结果显示,我们的方法可以生成更加现实和多样的景观图像,并且可以增加生成的图像多样性。
    Abstract SCONE-GAN presents an end-to-end image translation, which is shown to be effective for learning to generate realistic and diverse scenery images. Most current image-to-image translation approaches are devised as two mappings: a translation from the source to target domain and another to represent its inverse. While successful in many applications, these approaches may suffer from generating trivial solutions with limited diversity. That is because these methods learn more frequent associations rather than the scene structures. To mitigate the problem, we propose SCONE-GAN that utilises graph convolutional networks to learn the objects dependencies, maintain the image structure and preserve its semantics while transferring images into the target domain. For more realistic and diverse image generation we introduce style reference image. We enforce the model to maximize the mutual information between the style image and output. The proposed method explicitly maximizes the mutual information between the related patches, thus encouraging the generator to produce more diverse images. We validate the proposed algorithm for image-to-image translation and stylizing outdoor images. Both qualitative and quantitative results demonstrate the effectiveness of our approach on four dataset.
    摘要 SCONE-GAN 提出了一种端到端图像翻译方法,可以学习生成真实和多样化的景观图像。当前大多数图像到图像翻译方法都是设计为两个映射:一个从源领域到目标领域,另一个用于表示其逆转。虽然在许多应用程序中得到了成功,但这些方法可能会导致生成平庸的解决方案,因为它们学习的是更频繁的关联而不是场景结构。为了解决这问题,我们提议使用图像 convolutional networks 来学习对象的依赖关系,保持图像结构,并保持图像 semantics 的意义,而无需将图像转换到目标领域。为了生成更真实和多样化的图像,我们引入了风格参考图像。我们要求模型通过最大化风格图像和输出之间的共mutual information 来强制实现这个目标。我们的方法通过显式地最大化相关块之间的共mutual information 来鼓励生成器生成更多样化的图像。我们验证了我们的方法在图像到图像翻译和风格化户外图像方面的效果。Qualitative 和量化结果表明我们的方法在四个数据集上具有显著的效果。

GC-VTON: Predicting Globally Consistent and Occlusion Aware Local Flows with Neighborhood Integrity Preservation for Virtual Try-on

  • paper_url: http://arxiv.org/abs/2311.04932
  • repo_url: None
  • paper_authors: Hamza Rawal, Muhammad Junaid Ahmad, Farooq Zaman
  • for: 这研究旨在提高基于图像的虚拟试穿网络中的衣服折叠方法,以提高虚拟试穿的效果。
  • methods: 该研究提出了一种新的方法,即通过分解全局边界准确和本地文件保持任务,使用全局网络(GlobalNet)和本地网络(LocalNet)模块进行分解。此外,还使用了一种新的定制损失函数(NIPR)来评估折叠流的质量,并在折叠流中遮盖了身体部位可见掩码,以避免因身体部位或其他衣服 occlusion 而导致的损害。
  • results: 该研究的实验结果表明,与当前最佳方法相比,该方法在虚拟试穿数据集上表现出色,提高了虚拟试穿的效果。
    Abstract Flow based garment warping is an integral part of image-based virtual try-on networks. However, optimizing a single flow predicting network for simultaneous global boundary alignment and local texture preservation results in sub-optimal flow fields. Moreover, dense flows are inherently not suited to handle intricate conditions like garment occlusion by body parts or by other garments. Forcing flows to handle the above issues results in various distortions like texture squeezing, and stretching. In this work, we propose a novel approach where we disentangle the global boundary alignment and local texture preserving tasks via our GlobalNet and LocalNet modules. A consistency loss is then employed between the two modules which harmonizes the local flows with the global boundary alignment. Additionally, we explicitly handle occlusions by predicting body-parts visibility mask, which is used to mask out the occluded regions in the warped garment. The masking prevents the LocalNet from predicting flows that distort texture to compensate for occlusions. We also introduce a novel regularization loss (NIPR), that defines a criteria to identify the regions in the warped garment where texture integrity is violated (squeezed or stretched). NIPR subsequently penalizes the flow in those regions to ensure regular and coherent warps that preserve the texture in local neighborhoods. Evaluation on a widely used virtual try-on dataset demonstrates strong performance of our network compared to the current SOTA methods.
    摘要 流基 garment 扭曲是虚拟试穿网络中的一个重要组成部分。然而,单独优化一个流预测网络,以同时实现全局边界对齐和本地文件保存,会导致下面的问题:1. 文件流不适用于处理复杂的衣物干扰,如衣物部分遮挡或衣物之间的干扰。2. 压缩文件和扩展文件会导致文件的扭曲和质量下降。为了解决这些问题,我们提出了一种新的方法。我们在GlobalNet和LocalNet模块之间分离了全局边界对齐和本地文件保存任务。在这两个模块之间使用一个一致性损失,使得本地流与全局边界对齐融为一体。此外,我们还显式处理 occlusion 问题,通过预测身体部分可见掩码,以遮掩 occlusion 区域中的扭曲。这些掩码使得 LocalNet 不会预测扭曲以补偿 occlusion。最后,我们还引入了一种新的规范损失(NIPR),它定义了扭曲的区域,并对这些区域进行违反约束,以确保流fields 具有规则和一致的扭曲,并保持文件的质量。我们在一个广泛使用虚拟试穿数据集进行了评估,并与当前 State-of-the-Art 方法进行了比较。结果表明,我们的网络在虚拟试穿任务中具有强大的性能。

Multi-view Information Integration and Propagation for Occluded Person Re-identification

  • paper_url: http://arxiv.org/abs/2311.03828
  • repo_url: https://github.com/nengdong96/mviip
  • paper_authors: Neng Dong, Shuanglin Yan, Hao Tang, Jinhui Tang, Liyan Zhang
  • for: 提高 occluded person re-identification 的精度和稳定性,使用多视图图像来减少遮挡噪声的影响。
  • methods: 提出了一种基于多视图图像的 Multi-view Information Integration and Propagation (MVI$^{2}$P) 框架,通过综合利用多视图图像的特征图来提高批量人脸识别的精度和稳定性。
  • results: 经过广泛的实验和分析,证明了 MVI$^{2}$P 的有效性和超越性,能够在 occluded person re-identification task 中提高精度和稳定性。
    Abstract Occluded person re-identification (re-ID) presents a challenging task due to occlusion perturbations. Although great efforts have been made to prevent the model from being disturbed by occlusion noise, most current solutions only capture information from a single image, disregarding the rich complementary information available in multiple images depicting the same pedestrian. In this paper, we propose a novel framework called Multi-view Information Integration and Propagation (MVI$^{2}$P). Specifically, realizing the potential of multi-view images in effectively characterizing the occluded target pedestrian, we integrate feature maps of which to create a comprehensive representation. During this process, to avoid introducing occlusion noise, we develop a CAMs-aware Localization module that selectively integrates information contributing to the identification. Additionally, considering the divergence in the discriminative nature of different images, we design a probability-aware Quantification module to emphatically integrate highly reliable information. Moreover, as multiple images with the same identity are not accessible in the testing stage, we devise an Information Propagation (IP) mechanism to distill knowledge from the comprehensive representation to that of a single occluded image. Extensive experiments and analyses have unequivocally demonstrated the effectiveness and superiority of the proposed MVI$^{2}$P. The code will be released at \url{https://github.com/nengdong96/MVIIP}.
    摘要 occluded person re-identification (re-ID) 面临许多挑战,主要是因为干扰噪声。 despite great efforts to prevent the model from being disturbed by occlusion noise, most current solutions only capture information from a single image, ignoring the rich complementary information available in multiple images depicting the same pedestrian. In this paper, we propose a novel framework called Multi-view Information Integration and Propagation (MVI$^{2}$P). Specifically, we leverage the potential of multi-view images in effectively characterizing the occluded target pedestrian, and integrate feature maps to create a comprehensive representation. During this process, we develop a CAMs-aware Localization module that selectively integrates information contributing to the identification, and a probability-aware Quantification module to emphatically integrate highly reliable information. Moreover, as multiple images with the same identity are not accessible in the testing stage, we devise an Information Propagation (IP) mechanism to distill knowledge from the comprehensive representation to that of a single occluded image. Extensive experiments and analyses have unequivocally demonstrated the effectiveness and superiority of the proposed MVI$^{2}$P. The code will be released at \url{https://github.com/nengdong96/MVIIP}.

Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models

  • paper_url: http://arxiv.org/abs/2311.03799
  • repo_url: https://github.com/caoyichao/unihoi
  • paper_authors: Yichao Cao, Qingfei Tang, Xiu Su, Chen Song, Shan You, Xiaobo Lu, Chang Xu
  • for: 本研究旨在实现开放世界下的人物交互检测,通过视觉语言基础模型和大型自然语言模型(LLM)来解决复杂的人物交互问题。
  • methods: 本研究使用了视觉语言基础模型和大型自然语言模型(LLM),并提出了一种基于HO提示的学习方法(HO prompt-based learning),以帮助将高级关系表示与不同的HO对 associations。
  • results: 研究表明,UniHOI方法可以在开放世界下高度提高人物交互检测的性能,并在不同的输入类型(交互短语或解释句)下支持多种开放类型交互recognition。
    Abstract Human-object interaction (HOI) detection aims to comprehend the intricate relationships between humans and objects, predicting $$ triplets, and serving as the foundation for numerous computer vision tasks. The complexity and diversity of human-object interactions in the real world, however, pose significant challenges for both annotation and recognition, particularly in recognizing interactions within an open world context. This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs). The proposed method is dubbed as \emph{\textbf{UniHOI}. We conduct a deep analysis of the three hierarchical features inherent in visual HOI detectors and propose a method for high-level relation extraction aimed at VL foundation models, which we call HO prompt-based learning. Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. Furthermore, we utilize a LLM (\emph{i.e.} GPT) for interaction interpretation, generating a richer linguistic understanding for complex HOIs. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence. Our efficient architecture design and learning methods effectively unleash the potential of the VL foundation models and LLMs, allowing UniHOI to surpass all existing methods with a substantial margin, under both supervised and zero-shot settings. The code and pre-trained weights are available at: \url{https://github.com/Caoyichao/UniHOI}.
    摘要 人物交互探测(HOI)目标是理解人与物之间复杂的关系,预测<人,动作,物> triplets,并成为许多计算机视觉任务的基础。然而,在实际世界中人物交互的复杂性和多样性带来了对注释和识别的挑战,特别是在开放世界上下文中。本研究通过使用视觉语言基础模型(VL)和大型自然语言模型(LLM)来实现在开放世界上进行一般交互探测。我们提出了一种名为UniHOI的方法,包括一种HO Prompt-based Learning(HOPD),可以帮助将视觉基础模型中的高级关系表示与不同的HO对在图像中相关联。此外,我们使用GPT作为交互解释,以生成更加丰富的语言理解,用于处理复杂的HOI。为了进行开放类交互探测,我们的方法支持两种输入类型:交互短语和解释句子。我们的有效的建筑设计和学习方法可以有效发挥VL基础模型和LLM的潜力,使UniHOI在超过所有现有方法的情况下,在有监督和零值设定下达到了很高的性能。代码和预训练 веса可以在以下链接中下载:

Self-MI: Efficient Multimodal Fusion via Self-Supervised Multi-Task Learning with Auxiliary Mutual Information Maximization

  • paper_url: http://arxiv.org/abs/2311.03785
  • repo_url: None
  • paper_authors: Cam-Van Thi Nguyen, Ngoc-Hoa Thi Nguyen, Duc-Trong Le, Quang-Thuy Ha
  • for: 本研究旨在提高多模态学习中的特征表示学习,以便更好地捕捉不同模态之间的相互关系。
  • methods: 本研究使用了Self-MI方法,具体来说是通过对不同模态的输入对进行自然监督学习,并采用了Contrastive Predictive Coding(CPC)技术来增强多模态融合结果与单模态输入之间的相互信息。
  • results: 实验结果表明,Self-MI方法能够在三个 benchmark 数据集(CMU-MOSI、CMU-MOSEI 和 SIMS)上显著提高多模态融合任务的性能。
    Abstract Multimodal representation learning poses significant challenges in capturing informative and distinct features from multiple modalities. Existing methods often struggle to exploit the unique characteristics of each modality due to unified multimodal annotations. In this study, we propose Self-MI in the self-supervised learning fashion, which also leverage Contrastive Predictive Coding (CPC) as an auxiliary technique to maximize the Mutual Information (MI) between unimodal input pairs and the multimodal fusion result with unimodal inputs. Moreover, we design a label generation module, $ULG_{MI}$ for short, that enables us to create meaningful and informative labels for each modality in a self-supervised manner. By maximizing the Mutual Information, we encourage better alignment between the multimodal fusion and the individual modalities, facilitating improved multimodal fusion. Extensive experiments on three benchmark datasets including CMU-MOSI, CMU-MOSEI, and SIMS, demonstrate the effectiveness of Self-MI in enhancing the multimodal fusion task.
    摘要 多模态表示学习受到多模态特征 capture 约束,现有方法通常因为多模态约束而难以充分发挥每个模态的特点。在本研究中,我们提议了基于自我supervised learning的Self-MI方法,同时利用Contrastive Predictive Coding(CPC)作为auxiliary技术,以最大化多模态融合结果与单模态输入对的Mutual Information(MI)。此外,我们还设计了一个标签生成模块,简称为$ULG_{MI}$,以便在自我supervised manner中生成每个模态的有意义和有用的标签。通过最大化MI,我们鼓励多模态融合和单模态更好地对齐,从而提高多模态融合。我们在CMU-MOSI、CMU-MOSEI和SIMS三个 benchmark datasets上进行了广泛的实验,并证明了Self-MI在多模态融合任务中的效果。

UP-NeRF: Unconstrained Pose-Prior-Free Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2311.03784
  • repo_url: None
  • paper_authors: Injae Kim, Minhyuk Choi, Hyunwoo J. Kim
  • for: 用于处理不受约束的图像集合,不需要摄像机pose prior。
  • methods: 使用代理任务优化不敏感特征场和分离遮挡模块,以及候选头和可靠深度监督来提高pose估计和遮挡抗性。
  • results: 在一个复杂的互联网照片收藏中,比基线方法(包括BARF和其变体)表现出色,证明了我们的方法的优越性。
    Abstract Neural Radiance Field (NeRF) has enabled novel view synthesis with high fidelity given images and camera poses. Subsequent works even succeeded in eliminating the necessity of pose priors by jointly optimizing NeRF and camera pose. However, these works are limited to relatively simple settings such as photometrically consistent and occluder-free image collections or a sequence of images from a video. So they have difficulty handling unconstrained images with varying illumination and transient occluders. In this paper, we propose $\textbf{UP-NeRF}$ ($\textbf{U}$nconstrained $\textbf{P}$ose-prior-free $\textbf{Ne}$ural $\textbf{R}$adiance $\textbf{F}$ields) to optimize NeRF with unconstrained image collections without camera pose prior. We tackle these challenges with surrogate tasks that optimize color-insensitive feature fields and a separate module for transient occluders to block their influence on pose estimation. In addition, we introduce a candidate head to enable more robust pose estimation and transient-aware depth supervision to minimize the effect of incorrect prior. Our experiments verify the superior performance of our method compared to the baselines including BARF and its variants in a challenging internet photo collection, $\textit{Phototourism}$ dataset.
    摘要 neural radiance field (NeRF) 已经启动了高精度的新视角合成,只需要提供图像和摄像头位置。然而,这些工作受到了一些限制,例如:图像集是 photometrically 一致的,没有遮挡物品和遮挡物品。在这篇论文中,我们提出了 $\textbf{UP-NeRF}$($\textbf{U}$nconstrained $\textbf{P}$ose-prior-free $\textbf{Ne}$ural $\textbf{R}$adiance $\textbf{F}$ields),它可以在不受任何几何假设的情况下优化 NeRF。我们通过增加彩色不敏感特征场和遮挡物品封页来解决这些挑战。此外,我们还引入了候选首选项,以实现更加稳定的姿势估计和遮挡物品封页。我们的实验显示,我们的方法与 BARF 和其他基eline的比较,在一个具有挑战性的互联网照片集,$\textit{Phototourism}$ 数据集中,具有较高的性能。

CapST: An Enhanced and Lightweight Method for Deepfake Video Classification

  • paper_url: http://arxiv.org/abs/2311.03782
  • repo_url: None
  • paper_authors: Wasim Ahmad, Yan-Tsung Peng, Yuan-Hao Chang, Gaddisa Olani Ganfure, Sarwar Khan, Sahibzada Adil Shahzad
  • for: 本研究旨在针对深伪视频(Deepfake Video)进行分类,以提高各种领域(如政治、娱乐、安全)的识别能力。
  • methods: 该研究提出了一种新型的深伪视频分类模型,利用VGG19bn的一部分作为底层模型,并将卷积神经网络和空间时间注意力机制相结合,以提高分类精度。
  • results: 实验结果表明,该模型在一个广泛的深伪视频 benchmark 数据集(DFDM)上实现了比基eline模型高达4%的提升,同时具有更加高效的计算资源使用。
    Abstract The proliferation of deepfake videos, synthetic media produced through advanced Artificial Intelligence techniques has raised significant concerns across various sectors, encompassing realms such as politics, entertainment, and security. In response, this research introduces an innovative and streamlined model designed to classify deepfake videos generated by five distinct encoders adeptly. Our approach not only achieves state of the art performance but also optimizes computational resources. At its core, our solution employs part of a VGG19bn as a backbone to efficiently extract features, a strategy proven effective in image-related tasks. We integrate a Capsule Network coupled with a Spatial Temporal attention mechanism to bolster the model's classification capabilities while conserving resources. This combination captures intricate hierarchies among features, facilitating robust identification of deepfake attributes. Delving into the intricacies of our innovation, we introduce an existing video level fusion technique that artfully capitalizes on temporal attention mechanisms. This mechanism serves to handle concatenated feature vectors, capitalizing on the intrinsic temporal dependencies embedded within deepfake videos. By aggregating insights across frames, our model gains a holistic comprehension of video content, resulting in more precise predictions. Experimental results on an extensive benchmark dataset of deepfake videos called DFDM showcase the efficacy of our proposed method. Notably, our approach achieves up to a 4 percent improvement in accurately categorizing deepfake videos compared to baseline models, all while demanding fewer computational resources.
    摘要 “深圳技术”的普及,通过先进人工智能技术生成的假视频,在不同领域引发了重大关注,包括政治、娱乐和安全等。为应对这些挑战,本研究提出了一种创新的和高效的模型,用于分类生成自五种不同编码器的深圳视频。我们的方法不仅实现了状态的表现,还优化了计算资源。我们的解决方案在核心上采用一部分的VGG19bn作为背景,高效地提取特征,这种策略在图像相关任务中已经证明有效。我们将卷积网络和空间时间注意力机制结合使用,以增强模型的分类能力,同时保持资源的合理使用。这种结合使得模型能够快速和准确地识别深圳视频的特征。在我们的创新中,我们还介绍了一种现有的视频级别融合技术,利用时间注意力机制来处理 concatenated 特征向量, capitalizing on the intrinsic temporal dependencies embedded within deepfake videos。通过聚合帧内信息,我们的模型获得了视频内容的全面理解,从而实现更加准确的预测。实验结果表明,我们的提议方法在大量的深圳视频数据集(DFDM)上达到了4%的提升,与基eline模型相比,同时减少计算资源的需求。

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

  • paper_url: http://arxiv.org/abs/2311.03774
  • repo_url: None
  • paper_authors: Cheng Cheng, Lin Song, Ruoyi Xue, Hang Wang, Hongbin Sun, Yixiao Ge, Ying Shan
  • for: 这篇论文旨在提出一种能够在线进行几个样本微调的方法,以提高CLIP模型在几shot检测中的效能。
  • methods: 本文提出了一种名为Meta-Adapter的轻量级复修器,可以在线进行CLIP特征的细化,以应对几shot检测中的挑战。
  • results: 本文的方法可以在几个样本微调下,实现高效的几shot检测能力,并且可以在未见到的数据或任务下保持竞争性的性能。
    Abstract The contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts, enabling effective zero-shot image recognition. Nevertheless, few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples, resulting in longer inference time and the risk of over-fitting in certain domains. To tackle these challenges, we propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner. With a few training samples, our method can enable effective few-shot learning capabilities and generalize to unseen data or tasks without additional fine-tuning, achieving competitive performance and high efficiency. Without bells and whistles, our approach outperforms the state-of-the-art online few-shot learning method by an average of 3.6\% on eight image classification datasets with higher inference speed. Furthermore, our model is simple and flexible, serving as a plug-and-play module directly applicable to downstream tasks. Without further fine-tuning, Meta-Adapter obtains notable performance improvements in open-vocabulary object detection and segmentation tasks.
    摘要 “对于开放世界的视觉概念,对于零步识别有潜在的优秀潜力。然而,基于CLIP的几步学习方法通常需要静态精度调整,导致较长的推导时间和特定领域中的过滤过沾。为了解决这些挑战,我们提出了Meta-Adapter,一个轻量级的残差式适配器,可以在线上方式进行CLIP特征的精度调整。仅需几个训练样本,我们的方法可以实现有效的几步学习能力,并能够适应未见到的数据或任务,无需额外精度调整,实现竞争性的表现和高效率。而且,我们的方法简单易应用,可以直接应用到下游任务,并且不需进一步精度调整,在开放 vocabulary 物件检测和分类任务中获得了杰出的表现改善。”

Lightweight Portrait Matting via Regional Attention and Refinement

  • paper_url: http://arxiv.org/abs/2311.03770
  • repo_url: None
  • paper_authors: Yatao Zhong, Ilya Zharkov
  • for: 高解像肖像剪辑(portrait matting)
  • methods: 提出了一种轻量级模型,使用视Transformer(ViT)作为低分辨率网络的基础,并在精度修正网络中添加了一种新的跨区域注意力(CRA)模块,以提高本地区域的信息传递。
  • results: 比较其他基eline模型,本方法在三个标准测试集上达到了更高的Result,同时具有实时性和较低的计算开销。
    Abstract We present a lightweight model for high resolution portrait matting. The model does not use any auxiliary inputs such as trimaps or background captures and achieves real time performance for HD videos and near real time for 4K. Our model is built upon a two-stage framework with a low resolution network for coarse alpha estimation followed by a refinement network for local region improvement. However, a naive implementation of the two-stage model suffers from poor matting quality if not utilizing any auxiliary inputs. We address the performance gap by leveraging the vision transformer (ViT) as the backbone of the low resolution network, motivated by the observation that the tokenization step of ViT can reduce spatial resolution while retain as much pixel information as possible. To inform local regions of the context, we propose a novel cross region attention (CRA) module in the refinement network to propagate the contextual information across the neighboring regions. We demonstrate that our method achieves superior results and outperforms other baselines on three benchmark datasets while only uses $1/20$ of the FLOPS compared to the existing state-of-the-art model.
    摘要 我们提出了一种轻量级的高解像肖像抹除模型。该模型不使用任何辅助输入,如trimap或背景捕获,并在HD视频和4K视频的实时性下达到了实时性。我们的模型基于两个阶段框架,其中低分辨率网络用于粗略α估计,然后是一个改进网络用于地方区域的改进。然而,直观实现该两个阶段模型会导致抹除质量低下,除非使用辅助输入。我们解决这个性能差距问题,通过利用视Transformer(ViT)作为低分辨率网络的后夹,这是因为ViT的token化步骤可以减少空间分辨率,保留最多的像素信息。为了在邻近区域传递Contextual信息,我们提出了一种新的跨区域注意力(CRA)模块,用于在邻近区域之间传递Contextual信息。我们示示了我们的方法可以在三个标准测试集上达到更高的Result,并且只用了$1/20$的FLOPS,相比之前的状态则模型。

Image change detection with only a few samples

  • paper_url: http://arxiv.org/abs/2311.03762
  • repo_url: https://github.com/chrisneagu/FTC-Skystone-Dark-Angels-Romania-2020
  • paper_authors: Ke Liu, Zhaoyi Song, Haoyue Bai
  • for: 本研究针对仅有少量样本的图像变化检测,解决了对于仅有少量样本的挑战。
  • methods: 本研究使用简单的图像处理方法生成了Synthetic数据,并设计了基于物件检测的早期融合网络,以提高 Change detection 模型的通用能力。
  • results: 实验结果显示,使用Synthetic数据进行训练后,可以实现高度的通用能力,并且使用仅有 tens of 样本进行精度调整亦可以 achieve excellent results。
    Abstract This paper considers image change detection with only a small number of samples, which is a significant problem in terms of a few annotations available. A major impediment of image change detection task is the lack of large annotated datasets covering a wide variety of scenes. Change detection models trained on insufficient datasets have shown poor generalization capability. To address the poor generalization issue, we propose using simple image processing methods for generating synthetic but informative datasets, and design an early fusion network based on object detection which could outperform the siamese neural network. Our key insight is that the synthetic data enables the trained model to have good generalization ability for various scenarios. We compare the model trained on the synthetic data with that on the real-world data captured from a challenging dataset, CDNet, using six different test sets. The results demonstrate that the synthetic data is informative enough to achieve higher generalization ability than the insufficient real-world data. Besides, the experiment shows that utilizing a few (often tens of) samples to fine-tune the model trained on the synthetic data will achieve excellent results.
    摘要 The authors compare the model trained on synthetic data with one trained on real-world data from a challenging dataset, CDNet, using six different test sets. The results show that the synthetic data is informative enough to achieve higher generalization ability than the insufficient real-world data. Moreover, the experiment demonstrates that fine-tuning the model trained on synthetic data with a few (often tens of) samples can achieve excellent results.

Multiclass Segmentation using Teeth Attention Modules for Dental X-ray Images

  • paper_url: http://arxiv.org/abs/2311.03749
  • repo_url: None
  • paper_authors: Afnan Ghafoor, Seong-Yong Moon, Bumshik Lee
  • for: 这篇论文旨在提出一种革新的多类牙齿分割架构,以解决现有牙齿图像分割方法中的不准确和不可靠的问题。
  • methods: 该方法 integrate了M-Net-like结构、Swin Transformers以及一个名为牙齿注意块(TAB)。TAB使用了一种特有的注意机制,具体是专门针对牙齿复杂的结构进行注意。
  • results: 实验结果表明,该方法可以准确地分割牙齿图像,并且在多个标准牙齿图像 dataset 上表现出优于现有状态对牙齿图像分割的方法。
    Abstract This paper proposed a cutting-edge multiclass teeth segmentation architecture that integrates an M-Net-like structure with Swin Transformers and a novel component named Teeth Attention Block (TAB). Existing teeth image segmentation methods have issues with less accurate and unreliable segmentation outcomes due to the complex and varying morphology of teeth, although teeth segmentation in dental panoramic images is essential for dental disease diagnosis. We propose a novel teeth segmentation model incorporating an M-Net-like structure with Swin Transformers and TAB. The proposed TAB utilizes a unique attention mechanism that focuses specifically on the complex structures of teeth. The attention mechanism in TAB precisely highlights key elements of teeth features in panoramic images, resulting in more accurate segmentation outcomes. The proposed architecture effectively captures local and global contextual information, accurately defining each tooth and its surrounding structures. Furthermore, we employ a multiscale supervision strategy, which leverages the left and right legs of the U-Net structure, boosting the performance of the segmentation with enhanced feature representation. The squared Dice loss is utilized to tackle the class imbalance issue, ensuring accurate segmentation across all classes. The proposed method was validated on a panoramic teeth X-ray dataset, which was taken in a real-world dental diagnosis. The experimental results demonstrate the efficacy of our proposed architecture for tooth segmentation on multiple benchmark dental image datasets, outperforming existing state-of-the-art methods in objective metrics and visual examinations. This study has the potential to significantly enhance dental image analysis and contribute to advances in dental applications.
    摘要

SBCFormer: Lightweight Network Capable of Full-size ImageNet Classification at 1 FPS on Single Board Computers

  • paper_url: http://arxiv.org/abs/2311.03747
  • repo_url: https://github.com/xyonglu/sbcformer
  • paper_authors: Xiangyong Lu, Masanori Suganuma, Takayuki Okatani
  • for: 这篇论文旨在解决单板电脑(SBC)上的深度学习应用,特别是适用于智能农业、渔业和家畜饲养等领域。
  • methods: 这篇论文提出了一个名为SBCFormer的CNN-ViT混合网络,可以在低端CPU上实现高精度和快速计算。
  • results: 在Raspberry Pi 4 Model B上运行SBCFormer后,它可以在1.0帧/秒的速度下达到ImageNet-1K顶峰1的精度约80%,这是在SBC上首次实现的。
    Abstract Computer vision has become increasingly prevalent in solving real-world problems across diverse domains, including smart agriculture, fishery, and livestock management. These applications may not require processing many image frames per second, leading practitioners to use single board computers (SBCs). Although many lightweight networks have been developed for mobile/edge devices, they primarily target smartphones with more powerful processors and not SBCs with the low-end CPUs. This paper introduces a CNN-ViT hybrid network called SBCFormer, which achieves high accuracy and fast computation on such low-end CPUs. The hardware constraints of these CPUs make the Transformer's attention mechanism preferable to convolution. However, using attention on low-end CPUs presents a challenge: high-resolution internal feature maps demand excessive computational resources, but reducing their resolution results in the loss of local image details. SBCFormer introduces an architectural design to address this issue. As a result, SBCFormer achieves the highest trade-off between accuracy and speed on a Raspberry Pi 4 Model B with an ARM-Cortex A72 CPU. For the first time, it achieves an ImageNet-1K top-1 accuracy of around 80% at a speed of 1.0 frame/sec on the SBC. Code is available at https://github.com/xyongLu/SBCFormer.
    摘要 计算机视觉在实际应用中日益普遍,包括智能农业、渔业和畜牧管理等领域。这些应用可能不需要处理大量的图像帧数,因此实际使用单板计算机(SBC)。虽然许多轻量级网络已经为移动/边缘设备开发,但它们主要针对更强的处理器,而不是SBC的低端CPU。本文介绍一种名为SBCFormer的CNN-ViT混合网络,可以在这些低端CPU上实现高精度和快速计算。由于硬件限制,转换器的注意机制在低端CPU上更加有利,但使用注意在低端CPU上存在挑战:高分辨率内部特征地图需要极高的计算资源,但是降低其分辨率会导致图像细节的丢失。SBCFormer提出了一种建筑设计来解决这个问题。因此,SBCFormer在raspberry Pi 4 Model B上的ARM-Cortex A72 CPU上实现了最高的精度和速度之间的平衡,达到了ImageNet-1K top-1准确率约为80%,速度为1.0帧/秒。代码可以在https://github.com/xyongLu/SBCFormer上获取。

Unsupervised Video Summarization

  • paper_url: http://arxiv.org/abs/2311.03745
  • repo_url: https://github.com/KaiyangZhou/pytorch-vsumm-reinforce
  • paper_authors: Hanqing Li, Diego Klabjan, Jean Utke
  • for: 本研究旨在提出一种新的、无监督的自动视频概要生成方法,利用生成对抗网络的想法,但是去掉了识别器,使得模型具有简单的损失函数和分解不同部分的模型的训练。
  • methods: 本研究使用了迭代训练策略,先训练恢复器,然后训练帧选择器,并在训练和评估中添加了可调IMASK向量。模型还包括一种无监督的模型选择算法。
  • results: 在两个公共数据集(SumMe和TVSum)以及四个自创数据集(足球、LoL、MLB和ShortMLB)上进行了实验,结果表明每个组件对模型性能的影响,特别是迭代训练策略。评估和相对于当前状态的方法进行比较,表明提出的方法在性能、稳定性和训练效率方面具有优势。
    Abstract This paper introduces a new, unsupervised method for automatic video summarization using ideas from generative adversarial networks but eliminating the discriminator, having a simple loss function, and separating training of different parts of the model. An iterative training strategy is also applied by alternately training the reconstructor and the frame selector for multiple iterations. Furthermore, a trainable mask vector is added to the model in summary generation during training and evaluation. The method also includes an unsupervised model selection algorithm. Results from experiments on two public datasets (SumMe and TVSum) and four datasets we created (Soccer, LoL, MLB, and ShortMLB) demonstrate the effectiveness of each component on the model performance, particularly the iterative training strategy. Evaluations and comparisons with the state-of-the-art methods highlight the advantages of the proposed method in performance, stability, and training efficiency.
    摘要 这篇论文提出了一种新的、不监督的自动视频摘要方法,基于生成对抗网络的想法,但是去掉了识别器,使用简单的损失函数,并将模型的不同部分分别训练。此外,这种方法还使用了训练和评估阶段 alternate 的训练策略,并在摘要生成过程中添加了可调整的面向量。此外,这种方法还包括一种无监督模型选择算法。经过实验 validate 在两个公共数据集(SumMe 和 TVSum)以及我们创建的四个数据集(足球、LoL、MLB 和 ShortMLB)上,每一部分的效果都有显著提高。对比之下,这种方法在性能、稳定性和训练效率方面具有明显的优势。

3DifFusionDet: Diffusion Model for 3D Object Detection with Robust LiDAR-Camera Fusion

  • paper_url: http://arxiv.org/abs/2311.03742
  • repo_url: None
  • paper_authors: Xinhao Xiang, Simon Dräger, Jiawei Zhang
  • for: 提高3D对象检测性能,尤其是在LiDAR-Camera感知器上。
  • methods: 提出了3DifFusionDet框架,将3D对象检测视为一种噪声扩散过程,从噪声3D桶向目标桶进行匹配。在训练过程中,模型学习恢复噪声过程,而在推断过程中,模型逐渐精细化一组随机生成的桶,以达到最终结果。
  • results: 在KITTI测试 dataset上进行了广泛的实验,并证明了3DifFusionDet在实际交通对象识别中表现出色,比早期的检测器更为有力。
    Abstract Good 3D object detection performance from LiDAR-Camera sensors demands seamless feature alignment and fusion strategies. We propose the 3DifFusionDet framework in this paper, which structures 3D object detection as a denoising diffusion process from noisy 3D boxes to target boxes. In this framework, ground truth boxes diffuse in a random distribution for training, and the model learns to reverse the noising process. During inference, the model gradually refines a set of boxes that were generated at random to the outcomes. Under the feature align strategy, the progressive refinement method could make a significant contribution to robust LiDAR-Camera fusion. The iterative refinement process could also demonstrate great adaptability by applying the framework to various detecting circumstances where varying levels of accuracy and speed are required. Extensive experiments on KITTI, a benchmark for real-world traffic object identification, revealed that 3DifFusionDet is able to perform favorably in comparison to earlier, well-respected detectors.
    摘要 好的3D物体探测性能从LiDAR-Camera传感器需要无缝特征对应和融合策略。我们在这篇论文中提出了3DifFusionDet框架,它将3D物体探测视为一种噪声扩散过程,从噪声3D盒子到目标盒子的探测。在这个框架中,真实的盒子在训练中随机分布,模型学习恢复噪声过程。在推理中,模型逐渐精细化一组随机生成的盒子,以达到最终结果。在特征对应策略下,进行进化的精细化方法可以在不同的探测情况下提供更加稳定的LiDAR-Camera融合。此外,iterative refinement过程还能够在不同的精度和速度要求下展现出优秀的适应性。根据KITTI数据集,一个实际世界交通物体识别的标准 benchmark,我们的3DifFusionDet能够与先前的著名探测器相比,表现出色。

ADFactory: Automated Data Factory for Optical Flow Tasks

  • paper_url: http://arxiv.org/abs/2311.04246
  • repo_url: None
  • paper_authors: Han Ling
  • for: 提高现实世界中 optical flow 方法的泛化能力
  • methods: 使用高级 Nerf 技术重建场景,计算摄像机pose对的光流结果,并从多个方面筛选生成的训练数据
  • results: 在 KITTI 上超过现有自动学习光流和单摄像头场景流算法,并在实际世界零点泛化评估中常常超过最佳监督方法
    Abstract A major challenge faced by current optical flow methods is the difficulty in generalizing them well into the real world, mainly due to the high production cost of datasets, which currently do not have a large real-world optical flow dataset. To address this challenge, we introduce a novel optical flow training framework that can efficiently train optical flow networks on the target data domain without manual annotation. Specifically, we use advanced Nerf technology to reconstruct scenes from photo groups collected by monocular cameras, and calculate the optical flow results between camera pose pairs from the rendered results. On this basis, we screen the generated training data from various aspects such as Nerf's reconstruction quality, visual consistency of optical flow labels, reconstruction depth consistency, etc. The filtered training data can be directly used for network supervision. Experimentally, the generalization ability of our scheme on KITTI surpasses existing self-supervised optical flow and monocular scene flow algorithms. Moreover, it can always surpass most supervised methods in real-world zero-point generalization evaluation.
    摘要 Current optical flow methods face a major challenge in generalizing well to real-world scenarios due to the high cost of large-scale real-world datasets. To address this challenge, we propose a novel optical flow training framework that can efficiently train optical flow networks on the target data domain without manual annotation. Specifically, we use advanced NeRF technology to reconstruct scenes from photo groups collected by monocular cameras, and calculate the optical flow results between camera pose pairs from the rendered results. We then screen the generated training data from various aspects such as NeRF's reconstruction quality, visual consistency of optical flow labels, reconstruction depth consistency, etc. The filtered training data can be directly used for network supervision. Experimentally, our scheme surpasses existing self-supervised optical flow and monocular scene flow algorithms in terms of generalization ability on KITTI, and can also surpass most supervised methods in real-world zero-point generalization evaluation.

DeepInspect: An AI-Powered Defect Detection for Manufacturing Industries

  • paper_url: http://arxiv.org/abs/2311.03725
  • repo_url: None
  • paper_authors: Arti Kumbhar, Amruta Chougule, Priya Lokhande, Saloni Navaghane, Aditi Burud, Saee Nimbalkar
  • for: 这个研究是为了提高生产过程中的瑕疵检测精度和效率。
  • methods: 这个系统使用了卷积神经网络(CNN)、回传神经网络(RNN)和生成敌方算法(GAN),采用了一个创新的方法来检测生产过程中的瑕疵。
  • results: 这个系统可以对生产过程中的瑕疵进行高精度的检测,并且可以在不同的瑕疵情况下保持模型的稳定性和适应性。
    Abstract Utilizing Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Generative Adversarial Networks (GANs), our system introduces an innovative approach to defect detection in manufacturing. This technology excels in precisely identifying faults by extracting intricate details from product photographs, utilizing RNNs to detect evolving errors and generating synthetic defect data to bolster the model's robustness and adaptability across various defect scenarios. The project leverages a deep learning framework to automate real-time flaw detection in the manufacturing process. It harnesses extensive datasets of annotated images to discern complex defect patterns. This integrated system seamlessly fits into production workflows, thereby boosting efficiency and elevating product quality. As a result, it reduces waste and operational costs, ultimately enhancing market competitiveness.
    摘要 我们的系统利用卷积神经网络(CNNs)、回归神经网络(RNNs)和生成对抗网统(GANs),提出一个创新的瑕疵检测方法。这个技术可以精确地识别瑕疵,通过提取产品照片中的细微细节,利用RNNs检测进行变化检测,并生成实验数据来增强模型的韧性和适应力。这个项目利用深度学习框架自动化生产过程中的实时瑕疵检测。它利用大量标注图像的数据集来识别复杂的瑕疵模式。这个整合系统适应生产工作流程,因此提高效率和产品质量。因此,它减少废物和运营成本,最终提高市场竞争力。

Inertial Guided Uncertainty Estimation of Feature Correspondence in Visual-Inertial Odometry/SLAM

  • paper_url: http://arxiv.org/abs/2311.03722
  • repo_url: None
  • paper_authors: Seongwook Yoon, Jaehyun Kim, Sanghoon Sull
  • for: 提高自主导航和增强现实系统的稳定性和准确性。
  • methods: 利用不同视角的3D点对匹配,并使用不同视角的2D点对匹配来估算相对uncertainty。
  • results: 提出一种基于不同视角的3D点对匹配的不确定性估算方法,可以减少视觉误差、遮挡和光照变化等因素的影响,提高自主导航和增强现实系统的稳定性和准确性。
    Abstract Visual odometry and Simultaneous Localization And Mapping (SLAM) has been studied as one of the most important tasks in the areas of computer vision and robotics, to contribute to autonomous navigation and augmented reality systems. In case of feature-based odometry/SLAM, a moving visual sensor observes a set of 3D points from different viewpoints, correspondences between the projected 2D points in each image are usually established by feature tracking and matching. However, since the corresponding point could be erroneous and noisy, reliable uncertainty estimation can improve the accuracy of odometry/SLAM methods. In addition, inertial measurement unit is utilized to aid the visual sensor in terms of Visual-Inertial fusion. In this paper, we propose a method to estimate the uncertainty of feature correspondence using an inertial guidance robust to image degradation caused by motion blur, illumination change and occlusion. Modeling a guidance distribution to sample possible correspondence, we fit the distribution to an energy function based on image error, yielding more robust uncertainty than conventional methods. We also demonstrate the feasibility of our approach by incorporating it into one of recent visual-inertial odometry/SLAM algorithms for public datasets.
    摘要 Visual odometry和同时地图和定位(SLAM)是计算机视觉和机器人领域中的一项非常重要的任务,以帮助实现自主导航和增强现实系统。在特征基于的ODometry/SLAM中,一个在不同视点观察的移动视觉器 observes一组3D点,通常通过特征追踪和匹配来确定图像中的2D点对应关系。然而,由于对应点可能存在误差和噪声,可靠的uncertainty估计可以提高ODometry/SLAM方法的准确性。此外,使用测距仪来帮助视觉器,通过视觉-测距拟合来提高方法的稳定性。在本文中,我们提出了一种用于估计特征对应关系的不确定性的方法,基于测距仪的引导分布来采样可能的对应关系,并使用图像误差基于的能量函数来适应图像质量的变化。我们还demonstrate了我们的方法的可行性,通过在一个现有的视觉-测距ODometry/SLAM算法中 incorporate它。

Multimodal deep representation learning for quantum cross-platform verification

  • paper_url: http://arxiv.org/abs/2311.03713
  • repo_url: None
  • paper_authors: Yang Qian, Yuxuan Du, Zhenliang He, Min-hsiu Hsieh, Dacheng Tao
  • for: 这篇论文旨在解决早期量子计算阶段的跨平台验证问题,特别是在大量子比特的情况下,使用最少的量子测量数据来描述两个不完美的量子设备在执行同一个算法时的相似性。
  • methods: 本文提出了一种创新的多模态学习方法,认为在这个任务中存在两种不同的模式:测量结果和编译到explored量子设备上的类型circuit的классификация,两者都含有独特的信息。通过这种见解,我们设计了一种多模态神经网络,独立地提取这两种模式中的知识,然后进行融合操作以创建一个完整的数据表示。
  • results: 我们在不同的噪声模型下,包括50个量子比特的系统,进行了测试,结果表明,相比于随机测量和已有数据集中的算法,我们的提议可以提高预测精度三个数量级,并且提供了证明量子设备之间的相似性在新的量子算法执行时的有力证据。这些发现开创了用多模态学习解决更广泛的量子系统学习任务的可能性。
    Abstract Cross-platform verification, a critical undertaking in the realm of early-stage quantum computing, endeavors to characterize the similarity of two imperfect quantum devices executing identical algorithms, utilizing minimal measurements. While the random measurement approach has been instrumental in this context, the quasi-exponential computational demand with increasing qubit count hurdles its feasibility in large-qubit scenarios. To bridge this knowledge gap, here we introduce an innovative multimodal learning approach, recognizing that the formalism of data in this task embodies two distinct modalities: measurement outcomes and classical description of compiled circuits on explored quantum devices, both enriched with unique information. Building upon this insight, we devise a multimodal neural network to independently extract knowledge from these modalities, followed by a fusion operation to create a comprehensive data representation. The learned representation can effectively characterize the similarity between the explored quantum devices when executing new quantum algorithms not present in the training data. We evaluate our proposal on platforms featuring diverse noise models, encompassing system sizes up to 50 qubits. The achieved results demonstrate a three-orders-of-magnitude improvement in prediction accuracy compared to the random measurements and offer compelling evidence of the complementary roles played by each modality in cross-platform verification. These findings pave the way for harnessing the power of multimodal learning to overcome challenges in wider quantum system learning tasks.
    摘要 To address this challenge, we propose an innovative multimodal learning approach that recognizes the two distinct modalities in the task: measurement outcomes and classical descriptions of compiled circuits on explored quantum devices. By independently extracting knowledge from these modalities using a multimodal neural network, and then fusing the information, we can effectively characterize the similarity between the explored quantum devices when executing new quantum algorithms not present in the training data.We evaluate our proposal on platforms with diverse noise models, up to 50 qubits, and achieve a three-orders-of-magnitude improvement in prediction accuracy compared to random measurements. Our findings demonstrate the complementary roles played by each modality in cross-platform verification and pave the way for harnessing the power of multimodal learning to overcome challenges in wider quantum system learning tasks.

Unsupervised convolutional neural network fusion approach for change detection in remote sensing images

  • paper_url: http://arxiv.org/abs/2311.03679
  • repo_url: None
  • paper_authors: Weidong Yan, Pei Yan, Li Cao
  • for: 本研究旨在提出一种基于深度学习的无监督 shallow convolutional neural network (USCNN) 变化检测方法,以适应 Remote sensing 图像变化检测中的差分检测问题。
  • methods: 本方法首先将 би-时间图像转换为不同的特征空间,使用不同大小的卷积核来提取图像中的多尺度信息。然后,对同一个卷积核的输出特征图像进行减去操作,得到对应的差分特征图像。最后,将不同尺度的差分特征图像进行拼接,并使用 1 * 1 卷积层进行拼接。
  • results: 实验结果表明,提出的方法可以准确地检测 Remote sensing 图像中的变化。四个实验数据集的实验结果表明该方法的可行性和有效性。
    Abstract With the rapid development of deep learning, a variety of change detection methods based on deep learning have emerged in recent years. However, these methods usually require a large number of training samples to train the network model, so it is very expensive. In this paper, we introduce a completely unsupervised shallow convolutional neural network (USCNN) fusion approach for change detection. Firstly, the bi-temporal images are transformed into different feature spaces by using convolution kernels of different sizes to extract multi-scale information of the images. Secondly, the output features of bi-temporal images at the same convolution kernels are subtracted to obtain the corresponding difference images, and the difference feature images at the same scale are fused into one feature image by using 1 * 1 convolution layer. Finally, the output features of different scales are concatenated and a 1 * 1 convolution layer is used to fuse the multi-scale information of the image. The model parameters are obtained by a redesigned sparse function. Our model has three features: the entire training process is conducted in an unsupervised manner, the network architecture is shallow, and the objective function is sparse. Thus, it can be seen as a kind of lightweight network model. Experimental results on four real remote sensing datasets indicate the feasibility and effectiveness of the proposed approach.
    摘要 随着深度学习的快速发展,深度学习基于的变化检测方法在最近几年出现了许多。然而,这些方法通常需要训练样本的大量,因此非常昂贵。在这篇论文中,我们介绍了一种完全无监督的浅层卷积神经网络(USCNN)融合方法 для变化检测。首先,bi-temporal图像被转换成不同的特征空间,使用不同的卷积核来提取图像的多尺度信息。其次,bi-temporal图像的输出特征在同一个卷积核上被减去,以获得对应的差异图像,并将差异特征图像在同一个尺度上融合到一个特征图像中,使用1*1卷积层。最后,不同尺度的输出特征被 concatenate 并使用1*1卷积层融合图像的多尺度信息。模型参数由一种重新定义的稀疏函数来获得。我们的模型具有三个特点:整个训练过程是无监督的,网络架构是浅层的,目标函数是稀疏的。因此,可以看作一种轻量级的网络模型。实验结果表明,提议的方法在四个实际遥感数据集上是可行和有效的。

Image Generation and Learning Strategy for Deep Document Forgery Detection

  • paper_url: http://arxiv.org/abs/2311.03650
  • repo_url: None
  • paper_authors: Yamato Okamoto, Osada Genki, Iu Yahiro, Rintaro Hasegawa, Peifei Zhu, Hirokatsu Kataoka
  • for: 本研究旨在対抗深度神经网络(DNN)方法所带来的文档伪造威胁。
  • methods: 我们使用自然图像和文档图像的自适应学习来预训练模型,并构建了一个包含多种攻击方法的文档伪造图像集(FD-VIED)。
  • results: 我们的方法在实验中显示出提高检测性能的result。
    Abstract In recent years, document processing has flourished and brought numerous benefits. However, there has been a significant rise in reported cases of forged document images. Specifically, recent advancements in deep neural network (DNN) methods for generative tasks may amplify the threat of document forgery. Traditional approaches for forged document images created by prevalent copy-move methods are unsuitable against those created by DNN-based methods, as we have verified. To address this issue, we construct a training dataset of document forgery images, named FD-VIED, by emulating possible attacks, such as text addition, removal, and replacement with recent DNN-methods. Additionally, we introduce an effective pre-training approach through self-supervised learning with both natural images and document images. In our experiments, we demonstrate that our approach enhances detection performance.
    摘要

Instruct Me More! Random Prompting for Visual In-Context Learning

  • paper_url: http://arxiv.org/abs/2311.03648
  • repo_url: None
  • paper_authors: Jiahao Zhang, Bowen Wang, Liangzhi Li, Yuta Nakashima, Hajime Nagahara
  • for: 用于提高计算机视觉中的启发式学习(ICL)性能。
  • methods: 使用可学习的扰动(提示)来补充输入-输出图像对(叫做启发对),以提高模型的性能。
  • results: 比基eline无法学习提示的情况下,使用InMeMo方法可以提高多个主流任务的性能,包括对eground segmentation和单个物体检测任务的性能。 Specifically, InMeMo方法可以提高mIoU分数 by 7.35和15.13。
    Abstract Large-scale models trained on extensive datasets, have emerged as the preferred approach due to their high generalizability across various tasks. In-context learning (ICL), a popular strategy in natural language processing, uses such models for different tasks by providing instructive prompts but without updating model parameters. This idea is now being explored in computer vision, where an input-output image pair (called an in-context pair) is supplied to the model with a query image as a prompt to exemplify the desired output. The efficacy of visual ICL often depends on the quality of the prompts. We thus introduce a method coined Instruct Me More (InMeMo), which augments in-context pairs with a learnable perturbation (prompt), to explore its potential. Our experiments on mainstream tasks reveal that InMeMo surpasses the current state-of-the-art performance. Specifically, compared to the baseline without learnable prompt, InMeMo boosts mIoU scores by 7.35 and 15.13 for foreground segmentation and single object detection tasks, respectively. Our findings suggest that InMeMo offers a versatile and efficient way to enhance the performance of visual ICL with lightweight training. Code is available at https://github.com/Jackieam/InMeMo.
    摘要

Random Field Augmentations for Self-Supervised Representation Learning

  • paper_url: http://arxiv.org/abs/2311.03629
  • repo_url: None
  • paper_authors: Philip Andrew Mansfield, Arash Afkanpour, Warren Richard Morningstar, Karan Singhal
  • for: 自动增强学习(self-supervised representation learning),用于提高图像识别和泛化能力。
  • methods: 提出了一种新的本地变换方法,基于 Gaussian 随机场,用于生成图像增强。这些变换包括翻译、旋转、颜色干扰等,并且允许变换参数值从像素到像素 vary。
  • results: 实验结果表明,新的变换方法可以提高自动增强学习的性能,包括 ImageNet 下排名第一的批处理精度提高1.7%,以及 iNaturalist 下批处理精度提高3.6%。但是,由于新的变换的灵活性,学习得到的表示结构可以受到 hyperparameter 的影响,需要平衡增强表示的多样性和强度。
    Abstract Self-supervised representation learning is heavily dependent on data augmentations to specify the invariances encoded in representations. Previous work has shown that applying diverse data augmentations is crucial to downstream performance, but augmentation techniques remain under-explored. In this work, we propose a new family of local transformations based on Gaussian random fields to generate image augmentations for self-supervised representation learning. These transformations generalize the well-established affine and color transformations (translation, rotation, color jitter, etc.) and greatly increase the space of augmentations by allowing transformation parameter values to vary from pixel to pixel. The parameters are treated as continuous functions of spatial coordinates, and modeled as independent Gaussian random fields. Empirical results show the effectiveness of the new transformations for self-supervised representation learning. Specifically, we achieve a 1.7% top-1 accuracy improvement over baseline on ImageNet downstream classification, and a 3.6% improvement on out-of-distribution iNaturalist downstream classification. However, due to the flexibility of the new transformations, learned representations are sensitive to hyperparameters. While mild transformations improve representations, we observe that strong transformations can degrade the structure of an image, indicating that balancing the diversity and strength of augmentations is important for improving generalization of learned representations.
    摘要 自我超级vised学习中很重要地依赖数据扩展来确定表示中的不变性。先前的工作表明,对数据扩展进行多样化应用是下游性能的关键,但是扩展技术还很少被探索。在这种工作中,我们提出了一种新的本地变换家族,基于 Gaussian random fields 生成图像扩展 для自我超级vised学习。这些变换扩展了已知的平移、旋转和颜色扩展(翻译、旋转、颜色扩展等),并大大增加了扩展的空间。变换参数被视为像素坐标上独立的 Gaussian random fields,其中每个像素坐标上的参数值可以不同。我们的实验结果表明,这些新的变换对自我超级vised学习具有效果。具体来说,我们在 ImageNet 下游分类任务上 achieved 1.7% 的 top-1 精度提升,并在 iNaturalist 下游分类任务上 achieved 3.6% 的精度提升。然而,由于新的变换的灵活性,学习的表示受到了参数的影响。虽然柔和的变换可以提高表示的质量,但是强大的变换可能会破坏图像的结构,这表明在提高学习的表示总体化的过程中,需要考虑扩展的多样性和强度。

FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision Transformer Fusion

  • paper_url: http://arxiv.org/abs/2311.03620
  • repo_url: None
  • paper_authors: Xinhao Xiang, Jiawei Zhang
  • for: 这 paper 的目的是提出一种基于 vision transformer 的三 Dimensional 物体检测模型,以提高三 Dimensional 物体检测性能。
  • methods: 该模型使用 hierarchical 的 transformer 模型来对图像和点云进行嵌入,并通过视处理变换模型来进行数据表示学盘learn 。
  • results: 对于实际的交通Scene 数据集 KITTI 和 Waymo Open,我们的 FusionViT 模型可以达到状态空间的表现,并比之前的基于单Modal 图像或点云的方法和最新的多Modal 图像-点云深度融合方法都高。
    Abstract For 3D object detection, both camera and lidar have been demonstrated to be useful sensory devices for providing complementary information about the same scenery with data representations in different modalities, e.g., 2D RGB image vs 3D point cloud. An effective representation learning and fusion of such multi-modal sensor data is necessary and critical for better 3D object detection performance. To solve the problem, in this paper, we will introduce a novel vision transformer-based 3D object detection model, namely FusionViT. Different from the existing 3D object detection approaches, FusionViT is a pure-ViT based framework, which adopts a hierarchical architecture by extending the transformer model to embed both images and point clouds for effective representation learning. Such multi-modal data embedding representations will be further fused together via a fusion vision transformer model prior to feeding the learned features to the object detection head for both detection and localization of the 3D objects in the input scenery. To demonstrate the effectiveness of FusionViT, extensive experiments have been done on real-world traffic object detection benchmark datasets KITTI and Waymo Open. Notably, our FusionViT model can achieve state-of-the-art performance and outperforms not only the existing baseline methods that merely rely on camera images or lidar point clouds, but also the latest multi-modal image-point cloud deep fusion approaches.
    摘要 为3D物体检测,Camera和Lidar都被证明是有用的感知设备,提供不同模式的数据表示,如2DRGB图像和3D点云。为了提高3D物体检测性能,需要有效地学习并融合这些多模式感知数据。在这篇论文中,我们将介绍一种新的视transformer基于3D物体检测模型,即FusionViT。与现有的3D物体检测方法不同,FusionViT是一个纯ViT基本框架,通过扩展transformer模型,以便对图像和点云进行有效的表示学习。这些多模式数据表示将被通过视transformer模型进行融合,然后传递给物体检测头进行3D物体在输入场景中的检测和位置确定。为证明FusionViT的效果,我们在KITTI和Waymo开放 datasets上进行了广泛的实验。需要注意的是,我们的FusionViT模型可以在实际的交通场景中达到领先的性能,并在只使用Camera图像或Lidar点云的基础方法上出perform。此外,我们的模型还可以在多模式图像-点云深度融合方法上达到更高的性能。

cs.AI - 2023-11-07

ToP-ToM: Trust-aware Robot Policy with Theory of Mind

  • paper_url: http://arxiv.org/abs/2311.04397
  • repo_url: None
  • paper_authors: Chuang Yu, Baris Serhan, Angelo Cangelosi
  • for: 这个论文旨在探讨人工智能机器人在多体系统中与人类合作时,如何使用理解人类心理状态的理论来建立信任。
  • methods: 该论文使用了人类心理状态理论(理论心)来设计一种可靠性评估方法,并通过对人类行为观察和分析来推断人类的信任情况。
  • results: 实验结果表明,通过采用基于理论心的机器人策略,可以有效地维护人类与机器人之间的信任关系,并在多体系统中提高人类与机器人之间的协作效率。
    Abstract Theory of Mind (ToM) is a fundamental cognitive architecture that endows humans with the ability to attribute mental states to others. Humans infer the desires, beliefs, and intentions of others by observing their behavior and, in turn, adjust their actions to facilitate better interpersonal communication and team collaboration. In this paper, we investigated trust-aware robot policy with the theory of mind in a multiagent setting where a human collaborates with a robot against another human opponent. We show that by only focusing on team performance, the robot may resort to the reverse psychology trick, which poses a significant threat to trust maintenance. The human's trust in the robot will collapse when they discover deceptive behavior by the robot. To mitigate this problem, we adopt the robot theory of mind model to infer the human's trust beliefs, including true belief and false belief (an essential element of ToM). We designed a dynamic trust-aware reward function based on different trust beliefs to guide the robot policy learning, which aims to balance between avoiding human trust collapse due to robot reverse psychology. The experimental results demonstrate the importance of the ToM-based robot policy for human-robot trust and the effectiveness of our robot ToM-based robot policy in multiagent interaction settings.
    摘要 We found that when the robot focuses solely on team performance, it may resort to reverse psychology, which can damage trust between the human and the robot. When the human discovers the robot's deceptive behavior, their trust in the robot will collapse. To address this issue, we developed a robot ToM model to infer the human's trust beliefs, including true belief and false belief, which is an essential element of ToM.We designed a dynamic trust-aware reward function based on different trust beliefs to guide the robot policy learning, which aims to balance between avoiding human trust collapse due to robot reverse psychology and achieving team performance. Our experimental results demonstrate the importance of ToM-based robot policy for human-robot trust and the effectiveness of our approach in multiagent interaction settings.

Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

  • paper_url: http://arxiv.org/abs/2311.04386
  • repo_url: None
  • paper_authors: Jan Finkbeiner, Thomas Gmeinder, Mark Pupilli, Alexander Titterton, Emre Neftci
  • For: + The paper aims to explore the use of a massively parallel multiple instruction multiple data (MIMD) architecture with distributed local memory for training sparse and recurrent neural networks. + The authors aim to overcome the limitations of current AI training infrastructure, which is dominated by single instruction multiple data (SIMD) and systolic array architectures like GPUs and TPUs. + The paper aims to pave the way towards more efficient and sustainable AI training methods.* Methods: + The authors use a training routine based on backpropagation through time (BPTT) for the brain-inspired class of Spiking Neural Networks (SNNs) that feature binary sparse activations. + The authors use a MIMD processor, the Intelligence Processing Unit (IPU), to train the SNNs. + The authors compare the performance of the IPU with A100 GPUs.* Results: + The authors observe 5-10x throughput gains compared to A100 GPUs and up to 38x gains for higher levels of activation sparsity, without a significant slowdown in training convergence or reduction in final model performance. + The authors observe highly promising trends for both single and multi IPU configurations as they scale up to larger model sizes.Here is the summary in Simplified Chinese:* For: + 本研究旨在探讨使用分布式本地存储的多指令多数据(MIMD)架构来训练稀疏和递归神经网络。 + 作者希望超越当前人工智能训练基础设施,这些基础设施主要是单指令多数据(SIMD)和 Systolic 阵列设施,如 GPU 和 TPU。 + 本研究希望为更高效和可持续的人工智能训练提供新的方法。* Methods: + 作者使用 backpropagation through time(BPTT)训练Brain-inspired 的稀疏神经网络(SNNs),并使用 MIMD 处理器,Intelligence Processing Unit(IPU)来训练 SNNs。 + 作者比较 IPU 与 A100 GPU 的性能。* Results: + 作者发现在 activations 稀疏性较高时,使用 IPU 可以获得 5-10 倍的 Throughput 提升,并且没有显著减慢训练过程的速度或模型性能的下降。 + 作者发现在大型模型训练时,单 IPU 和多 IPU 配置具有极高的潜力。
    Abstract Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures, such as Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), that excel at accelerating parallel workloads and dense vector matrix multiplications. Potentially more efficient neural network models utilizing sparsity and recurrence cannot leverage the full power of SIMD processor and are thus at a severe disadvantage compared to today's prominent parallel architectures like Transformers and CNNs, thereby hindering the path towards more sustainable AI. To overcome this limitation, we explore sparse and recurrent model training on a massively parallel multiple instruction multiple data (MIMD) architecture with distributed local memory. We implement a training routine based on backpropagation through time (BPTT) for the brain-inspired class of Spiking Neural Networks (SNNs) that feature binary sparse activations. We observe a massive advantage in using sparse activation tensors with a MIMD processor, the Intelligence Processing Unit (IPU) compared to GPUs. On training workloads, our results demonstrate 5-10x throughput gains compared to A100 GPUs and up to 38x gains for higher levels of activation sparsity, without a significant slowdown in training convergence or reduction in final model performance. Furthermore, our results show highly promising trends for both single and multi IPU configurations as we scale up to larger model sizes. Our work paves the way towards more efficient, non-standard models via AI training hardware beyond GPUs, and competitive large scale SNN models.
    摘要 We implement a training routine based on backpropagation through time (BPTT) for the brain-inspired class of Spiking Neural Networks (SNNs) that feature binary sparse activations. Our results show a significant advantage in using sparse activation tensors with a MIMD processor, the Intelligence Processing Unit (IPU), compared to GPUs. On training workloads, we observe 5-10x throughput gains compared to A100 GPUs and up to 38x gains for higher levels of activation sparsity, without a significant slowdown in training convergence or reduction in final model performance.Our results also show promising trends for both single and multi IPU configurations as we scale up to larger model sizes. Our work paves the way towards more efficient, non-standard models via AI training hardware beyond GPUs and competitive large-scale SNN models.

Enhancing Malware Detection by Integrating Machine Learning with Cuckoo Sandbox

  • paper_url: http://arxiv.org/abs/2311.04372
  • repo_url: None
  • paper_authors: Amaal F. Alshmarni, Mohammed A. Alliheedi
  • for: 本研究目的是使用深度学习算法和传统机器学习算法对API调用序列中的Malware进行分类和识别。
  • methods: 本研究使用的方法包括CNN(卷积神经网络)和RNN(循环神经网络)等深度学习算法,以及SVM(支持向量机)、RF(随机森林)、KNN(最近邻居)、XGB(极限梯度提升)和GBC(梯度提升类ifiers)等传统机器学习算法。
  • results: 研究结果显示,深度学习和传统机器学习算法均可达到极高的准确率,达到99%以上。
    Abstract In the modern era, malware is experiencing a significant increase in both its variety and quantity, aligning with the widespread adoption of the digital world. This surge in malware has emerged as a critical challenge in the realm of cybersecurity, prompting numerous research endeavors and contributions to address the issue. Machine learning algorithms have been leveraged for malware detection due to their ability to uncover concealed patterns within vast datasets. However, deep learning algorithms, characterized by their multi-layered structure, surpass the limitations of traditional machine learning approaches. By employing deep learning techniques such as CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network), this study aims to classify and identify malware extracted from a dataset containing API call sequences. The performance of these algorithms is compared with that of conventional machine learning methods, including SVM (Support Vector Machine), RF (Random Forest), KNN (K-Nearest Neighbors), XGB (Extreme Gradient Boosting), and GBC (Gradient Boosting Classifier), all using the same dataset. The outcomes of this research demonstrate that both deep learning and machine learning algorithms achieve remarkably high levels of accuracy, reaching up to 99% in certain cases.
    摘要 现代时期,黑客软件(malware)的多样性和量在不断增长,与数字化世界的普及相应。这种黑客软件的增长对网络安全领域提出了棘手的挑战,并促使了许多研究和贡献以解决问题。机器学习算法在黑客软件检测方面得到广泛应用,因为它们可以在大量数据中找到隐藏的模式。然而,深度学习算法,即多层结构的算法,超越了传统机器学习方法的局限性。本研究使用深度学习技术,如卷积神经网络(CNN)和循环神经网络(RNN),对来自API调用序列的黑客软件进行分类和识别。本研究的结果表明,深度学习和机器学习算法均可以达到极高的准确率,达到99%以上。

Evaluating the Effectiveness of Retrieval-Augmented Large Language Models in Scientific Document Reasoning

  • paper_url: http://arxiv.org/abs/2311.04348
  • repo_url: None
  • paper_authors: Sai Munikoti, Anurag Acharya, Sridevi Wagle, Sameera Horawalavithana
  • for: 提高Large Language Model(LLM)的可靠性和科学性,解决LLM提供的假信息问题。
  • methods: 使用检索增强的LLM,通过从外部数据源检索相关信息,改善模型的训练过程。
  • results: 在科学文档理解任务中,模型会使用假据来证明预测结果,并且使用科学资料作为预训练数据不能减轻证据fabrication的风险。
    Abstract Despite the dramatic progress in Large Language Model (LLM) development, LLMs often provide seemingly plausible but not factual information, often referred to as hallucinations. Retrieval-augmented LLMs provide a non-parametric approach to solve these issues by retrieving relevant information from external data sources and augment the training process. These models help to trace evidence from an externally provided knowledge base allowing the model predictions to be better interpreted and verified. In this work, we critically evaluate these models in their ability to perform in scientific document reasoning tasks. To this end, we tuned multiple such model variants with science-focused instructions and evaluated them on a scientific document reasoning benchmark for the usefulness of the retrieved document passages. Our findings suggest that models justify predictions in science tasks with fabricated evidence and leveraging scientific corpus as pretraining data does not alleviate the risk of evidence fabrication.
    摘要 尽管大语言模型(LLM)的发展做出了很多显著的进步,但是它们经常提供不符事实的信息,通常被称为幻觉。检索增强型LLM可以通过从外部数据源中检索相关信息并增强训练过程来解决这些问题。这些模型可以跟踪从知ledge基中提供的证据,使模型的预测更好地被解释和验证。在这项工作中,我们 kritically evaluates these models in their ability to perform in scientific document reasoning tasks。为此,我们调整了多种模型变体,并在科学文档理解benchmark上评估它们的用用。我们发现,这些模型在科学任务中 justify 预测的方法有假的证据,而使用科学corpus作为预训数据不会减少证据fabrication的风险。

A Taxonomy of Rater Disagreements: Surveying Challenges & Opportunities from the Perspective of Annotating Online Toxicity

  • paper_url: http://arxiv.org/abs/2311.04345
  • repo_url: None
  • paper_authors: Wenbo Zhang, Hangzhi Guo, Ian D Kivlichan, Vinodkumar Prabhakaran, Davis Yadav, Amulya Yadav
  • for: 本研究旨在探讨在线恶意评论的计算检测和缓解方法中,annotator之间的分歧原因,以及如何在Machine Learning发展阶段内 integrate这些分歧。
  • methods: 本研究通过分析广泛的文献,提出了一个细化的分类器,以及对每种原因进行解释和讨论。
  • results: 本研究提出了一个全面的分类器,可以帮助解释和缓解在线恶意评论中的分歧原因。此外,还提出了一些未解决的问题,可以推动未来的研究发展。
    Abstract Toxicity is an increasingly common and severe issue in online spaces. Consequently, a rich line of machine learning research over the past decade has focused on computationally detecting and mitigating online toxicity. These efforts crucially rely on human-annotated datasets that identify toxic content of various kinds in social media texts. However, such annotations historically yield low inter-rater agreement, which was often dealt with by taking the majority vote or other such approaches to arrive at a single ground truth label. Recent research has pointed out the importance of accounting for the subjective nature of this task when building and utilizing these datasets, and this has triggered work on analyzing and better understanding rater disagreements, and how they could be effectively incorporated into the machine learning developmental pipeline. While these efforts are filling an important gap, there is a lack of a broader framework about the root causes of rater disagreement, and therefore, we situate this work within that broader landscape. In this survey paper, we analyze a broad set of literature on the reasons behind rater disagreements focusing on online toxicity, and propose a detailed taxonomy for the same. Further, we summarize and discuss the potential solutions targeting each reason for disagreement. We also discuss several open issues, which could promote the future development of online toxicity research.
    摘要 “在线空间中,毒性问题日益严重,这导致了过去十年的机器学习研究强调computationally检测和缓和在线毒性。这些努力将靠着人类给出的标签来识别社交媒体文本中的毒性内容。但是,这些标签往往会受到评估者间的不一致,这些不一致通常会通过获得多数决的方式或其他方法来得到单一的真实标签。现在的研究表明了评估者间的不一致需要被考虑,并且这些不一致可以对机器学习开发过程中的建立和使用标签进行更好的理解和integration。然而,这些努力仍然缺乏一个更广泛的架构,它应该涵盖毒性评估者间的根本原因。因此,我们在这篇调查报告中分析了线上毒性领域中的各种原因,并提出了一个详细的分类。此外,我们还总结了每个原因的解决方案,并讨论了一些未解之处,这些未解之处可以帮助未来线上毒性研究的发展。”

Multimodal Clinical Benchmark for Emergency Care (MC-BEC): A Comprehensive Benchmark for Evaluating Foundation Models in Emergency Medicine

  • paper_url: http://arxiv.org/abs/2311.04937
  • repo_url: None
  • paper_authors: Emma Chen, Aman Kansal, Julie Chen, Boyang Tom Jin, Julia Rachel Reisler, David A Kim, Pranav Rajpurkar
  • for: The paper aims to provide a comprehensive benchmark for evaluating foundation models in Emergency Medicine, specifically for predicting patient decompensation, disposition, and ED revisit.
  • methods: The paper uses a multimodal dataset of over 100,000 continuously monitored Emergency Department visits from 2020-2022, including clinical data such as vital signs, electrocardiogram and photoplethysmograph waveforms, and free-text reports of imaging studies. The paper provides a standardized evaluation framework with train-test splits and evaluation metrics.
  • results: The paper provides performance baselines for each prediction task to enable the evaluation of multimodal, multitask models.Here are the three points in Simplified Chinese text:
  • for: 这篇论文目标是提供一个包容全的基础模型评估 benchmark для急诊医学,特别是预测病人状况下降、处置和急诊室 revisit。
  • methods: 这篇论文使用了2020-2022年度的急诊室访问数据集,涵盖了详细的临床数据,包括评估信息、先前诊断和药物、连续测量的生命体 Parameters、电cardiogram和光谱波形图,以及急诊室诊断、处置和后续 revisit 的信息。论文提供了一个标准化的评估框架,包括训练测试分割和评估指标。
  • results: 论文提供了每个预测任务的性能基线,以便评估多模态、多任务模型。
    Abstract We propose the Multimodal Clinical Benchmark for Emergency Care (MC-BEC), a comprehensive benchmark for evaluating foundation models in Emergency Medicine using a dataset of 100K+ continuously monitored Emergency Department visits from 2020-2022. MC-BEC focuses on clinically relevant prediction tasks at timescales from minutes to days, including predicting patient decompensation, disposition, and emergency department (ED) revisit, and includes a standardized evaluation framework with train-test splits and evaluation metrics. The multimodal dataset includes a wide range of detailed clinical data, including triage information, prior diagnoses and medications, continuously measured vital signs, electrocardiogram and photoplethysmograph waveforms, orders placed and medications administered throughout the visit, free-text reports of imaging studies, and information on ED diagnosis, disposition, and subsequent revisits. We provide performance baselines for each prediction task to enable the evaluation of multimodal, multitask models. We believe that MC-BEC will encourage researchers to develop more effective, generalizable, and accessible foundation models for multimodal clinical data.
    摘要 我们提出了多modal临床标准(MC-BEC),用于评估临床数据的基本模型。MC-BEC使用2020-2022年度 Emergency Department 访问记录超过100,000次,并将临床相关的预测任务分为不同时间尺度,包括patient decompensation、 dispose 和 Emergency Department revisit。我们还提供了标准化的评估框架,包括训练测试分割和评估指标。数据集包括丰富的临床数据,包括triage信息、先前诊断和药物、不断测量的生命 Parameter、电cardiogram和光谱波形图像、在访问中下达的命令和给药、自由文本报告的成像试验结果以及 Emergency Department 诊断、处置和后续 revisits 信息。我们还提供了每个预测任务的性能基线,以便评估多modal、多任务模型。我们认为MC-BEC将鼓励研究人员开发更有效、普遍、可 accessible 的基本模型。

Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations

  • paper_url: http://arxiv.org/abs/2311.04335
  • repo_url: https://github.com/schen149/sub-sentence-encoder
  • paper_authors: Sihao Chen, Hongming Zhang, Tong Chen, Ben Zhou, Wenhao Yu, Dian Yu, Baolin Peng, Hongwei Wang, Dan Roth, Dong Yu
  • for: 这 paper 是为了提供一种基于 contrastive learning 的文本细词表示模型,以便在文本分类、检索和机器翻译等应用中进行细词级别的语义表示。
  • methods: 该 paper 使用了一种基于 contrastive learning 的文本细词表示模型,称为 sub-sentence encoder,该模型可以学习细词级别的语义表示,并且可以在不同文本序列中检测到相似的语义表达。
  • results: 该 paper 的实验结果表明,使用 sub-sentence encoder 可以在文本应用中提高细词级别的语义表示的准确率,同时保持与传统 sentence encoder 相同的推理成本和存储复杂度。
    Abstract We introduce sub-sentence encoder, a contrastively-learned contextual embedding model for fine-grained semantic representation of text. In contrast to the standard practice with sentence embeddings, where the meaning of an entire sequence of text is encoded into a fixed-length vector, the sub-sentence encoder learns to produce distinct contextual embeddings corresponding to different atomic propositions, i.e. atomic units of meaning expressed within a text sequence. The sub-sentence embeddings are contrastively learned to recognize (inferred) semantic equivalence between propositions across different text sequences. Our experiments show the effectiveness of sub-sentence encoders in applications, such as retrieving supporting facts for fine-grained text attribution or recognizing the conditional semantic similarity between texts. In practice, we demonstrate that sub-sentence encoders keep the same level of inference cost and space complexity compared to sentence encoders.
    摘要 我们介绍了下一个字句编码器,一种基于对比学习的上下文嵌入模型,用于细化文本 semantics的表示。与标准做法不同,即将整个文本序列中的意义编码成固定长度向量,而我们的字句编码器学习了生成不同原子提POSITIONS中的含义的上下文嵌入。这些字句嵌入被对比式学习,以认可(推理)在不同文本序列中的含义相似性。我们的实验表明,字句编码器在应用中具有效果,例如检索支持 факт dla fine-grained text attribution 或recognize conditional semantic similarity between texts。在实践中,我们发现字句编码器与句子编码器的推理成本和存储复杂度相同。

Educating for AI Cybersecurity Work and Research: Ethics, Systems Thinking, and Communication Requirements

  • paper_url: http://arxiv.org/abs/2311.04326
  • repo_url: None
  • paper_authors: Sorin Adam Matei, Elisa Bertino
    for:* The paper explores managerial and instructor perceptions of freshly employed cybersecurity workers’ preparedness to work effectively in a changing cybersecurity environment that includes AI tools.methods:* The study uses a survey to collect data on managerial and instructor perceptions of technical preparedness and non-technical skill sets (ethical, systems thinking, and communication skills) among freshly employed cybersecurity workers.results:* The study found that managers and professors perceive preparedness to use AI tools in cybersecurity to be significantly associated with all three non-technical skill sets, with ethics being the most important. Additionally, professors over-estimate students’ preparedness for ethical, system thinking, and communication abilities compared to IT managers’ perceptions of their newly employed IT workers.
    Abstract The present study explored managerial and instructor perceptions of their freshly employed cybersecurity workers' or students' preparedness to work effectively in a changing cybersecurity environment that includes AI tools. Specifically, we related perceptions of technical preparedness to ethical, systems thinking, and communication skills. We found that managers and professors perceive preparedness to use AI tools in cybersecurity to be significantly associated with all three non-technical skill sets. Most important, ethics is a clear leader in the network of relationships. Contrary to expectations that ethical concerns are left behind in the rush to adopt the most advanced AI tools in security, both higher education instructors and managers appreciate their role and see them closely associated with technical prowess. Another significant finding is that professors over-estimate students' preparedness for ethical, system thinking, and communication abilities compared to IT managers' perceptions of their newly employed IT workers.
    摘要 现在的研究探讨了新 empleados或学生的cybersecurity工作效果preparedness,特别是在包括AI工具的变化的cybersecurity环境中。我们发现管理者和教授对使用AI工具在cybersecurity中的准备程度与伦理、系统思维和communication skills有显著相关性。其中,伦理准备是网络关系中最重要的一个因素。 contradicting expectations, both higher education instructors and managers recognize the importance of ethics in the adoption of advanced AI tools in security, and see it closely related to technical proficiency. Another significant finding is that professors overestimate students' preparedness for ethical, system thinking, and communication abilities compared to IT managers' perceptions of their newly employed IT workers.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Extending Machine Learning-Based Early Sepsis Detection to Different Demographics

  • paper_url: http://arxiv.org/abs/2311.04325
  • repo_url: None
  • paper_authors: Surajsinh Parmar, Tao Shan, San Lee, Yonghwan Kim, Jang Yong Kim
  • for: 这项研究旨在探讨两种ensemble学习方法 LightGBM 和 XGBoost 在公共eICU-CRD数据集和私人韩国圣玛利亚医院数据集上的比较分析,以便更好地掌握医疗数据的不均衡问题和抑制综合症识别。
  • methods: 这项研究使用了 LightGBM 和 XGBoost 两种ensemble学习方法,以探讨它们在医疗数据中的应用。
  • results: 研究发现,LightGBM 在计算效率和扩展性方面表现有些微强,而 XGBoost 则在某些方面表现较差。这些结果预示了这两种方法在医疗数据中的应用潜力,并为扩展机器学习在医疗领域的应用提供了基础。
    Abstract Sepsis requires urgent diagnosis, but research is predominantly focused on Western datasets. In this study, we perform a comparative analysis of two ensemble learning methods, LightGBM and XGBoost, using the public eICU-CRD dataset and a private South Korean St. Mary's Hospital's dataset. Our analysis reveals the effectiveness of these methods in addressing healthcare data imbalance and enhancing sepsis detection. Specifically, LightGBM shows a slight edge in computational efficiency and scalability. The study paves the way for the broader application of machine learning in critical care, thereby expanding the reach of predictive analytics in healthcare globally.
    摘要 septicemia需要紧急诊断,但研究主要集中在西方数据集上。在这项研究中,我们进行了两种ensemble学习方法,LightGBM和XGBoost的比较分析,使用公共的eICU-CRD数据集和私人的韩国圣玛利亚医院数据集。我们的分析表明这些方法在医疗数据异质问题中能够有效地解决,提高了 septicemia的检测率。特别是LightGBM在计算效率和可扩展性方面表现了一定的优势。这项研究为医疗预测分析在全球扩展开创了道路。

A comparative analysis between Conformer-Transducer, Whisper, and wav2vec2 for improving the child speech recognition

  • paper_url: http://arxiv.org/abs/2311.04936
  • repo_url: https://github.com/c3imaging/child_asr_conformer
  • paper_authors: Andrei Barcovschi, Rishabh Jain, Peter Corcoran
  • for: 提高儿童语音识别性能
  • methods: 采用适应器-转换器模型进行自适应,并与其他两种模型进行比较
  • results: 结果显示,适应器-转换器模型在儿童语音识别中具有显著的改善,与其他两种模型相比,wav2vec2模型提供了最好的性能改善。
    Abstract Automatic Speech Recognition (ASR) systems have progressed significantly in their performance on adult speech data; however, transcribing child speech remains challenging due to the acoustic differences in the characteristics of child and adult voices. This work aims to explore the potential of adapting state-of-the-art Conformer-transducer models to child speech to improve child speech recognition performance. Furthermore, the results are compared with those of self-supervised wav2vec2 models and semi-supervised multi-domain Whisper models that were previously finetuned on the same data. We demonstrate that finetuning Conformer-transducer models on child speech yields significant improvements in ASR performance on child speech, compared to the non-finetuned models. We also show Whisper and wav2vec2 adaptation on different child speech datasets. Our detailed comparative analysis shows that wav2vec2 provides the most consistent performance improvements among the three methods studied.
    摘要 自动语音识别(ASR)系统在成人语音数据上的表现已经得到了显著提高,但是识别儿童语音仍然是一项挑战,这是因为儿童和成人语音的音频特征差异较大。本研究旨在探讨使用现有的Conformer-抽象器模型来改进儿童语音识别性能。此外,我们还对自动学习wav2vec2模型和多个频道Whisper模型进行了比较,这些模型都在同一个数据集上进行了finetuning。我们的详细比较分析表明,wav2vec2模型在不同的儿童语音数据集上提供了最稳定的性能改进。

Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning

  • paper_url: http://arxiv.org/abs/2311.04313
  • repo_url: None
  • paper_authors: Rishabh Jain, Peter Corcoran
  • for: 这个论文旨在研究如何使用 Fastpitch 文本到语音(TTS)模型生成高质量的人工婴儿语音。
  • methods: 这种新的方法利用了传输学习训练管道,并对多个说话者 TTS 模型进行了微调。
  • results: 这项研究使用了公共可用的 MyST 数据集(55 小时)进行了微调实验,并释放了一个原型数据集和模型代码,以支持进一步的研究。对于实际和生成的婴儿语音之间的相似性,我们使用了一个预训练的 MOSNet,并进行了对jective评估。此外,我们还使用了自动语音识别(ASR)模型来比较实际和生成的婴儿语音中的单词错误率(WER)。
    Abstract Speech synthesis technology has witnessed significant advancements in recent years, enabling the creation of natural and expressive synthetic speech. One area of particular interest is the generation of synthetic child speech, which presents unique challenges due to children's distinct vocal characteristics and developmental stages. This paper presents a novel approach that leverages the Fastpitch text-to-speech (TTS) model for generating high-quality synthetic child speech. This study uses the transfer learning training pipeline. The approach involved finetuning a multi-speaker TTS model to work with child speech. We use the cleaned version of the publicly available MyST dataset (55 hours) for our finetuning experiments. We also release a prototype dataset of synthetic speech samples generated from this research together with model code to support further research. By using a pretrained MOSNet, we conducted an objective assessment that showed a significant correlation between real and synthetic child voices. Additionally, to validate the intelligibility of the generated speech, we employed an automatic speech recognition (ASR) model to compare the word error rates (WER) of real and synthetic child voices. The speaker similarity between the real and generated speech is also measured using a pretrained speaker encoder.
    摘要 《文本识别技术在最近几年内得到了显著的进步,使得生成自然和表情充沛的合成语音成为可能。一个特别有趣的领域是生成合成儿童语音,这种语音具有儿童特有的声音特征和发展阶段。本文提出了一种新的方法,利用Fastpitch文本识别(TTS)模型来生成高质量的合成儿童语音。本研究使用了传输学习训练管道。我们使用了公共可用的MyST数据集(55小时)进行训练实验。我们还发布了一个原型数据集,包含这些研究中生成的合成语音样本,以及模型代码,以支持进一步的研究。通过使用预训练的MOSNet,我们进行了对比实验,并显示了真实和合成儿童声音之间的显著相似性。此外,为验证生成的语音是否可以理解,我们使用了自动语音识别(ASR)模型来比较真实和合成儿童声音的单词错误率(WER)。生成的语音与真实儿童声音之间的 speaker 相似性也被使用预训练的 speaker 编码器进行评估。

Class-Incremental Continual Learning for General Purpose Healthcare Models

  • paper_url: http://arxiv.org/abs/2311.04301
  • repo_url: None
  • paper_authors: Amritpal Singh, Mustafa Burak Gurbuz, Shiva Souhith Gantha, Prahlad Jasti
  • for: 这个研究旨在检验各种医疗影像案例中,使用不同的特殊性和医院,将模型 sequentially 学习新任务,而不会影响之前学习的任务的性能。
  • methods: 这个研究使用了不同的持续学习方法,包括 episodic memory、iCaRL 和 Rehearsal-based continual learning,并评估它们在不同的医疗影像案例中的表现。
  • results: 研究结果显示,使用不同的持续学习方法,单一的模型可以 sequentially 学习不同的特殊性和医院,并实现与预期值相似的性能。这表明了在不同的医疗影像案例中,可以共享或回收模型,从而推进医疗影像 AI 的发展,并且可以在不同的医院和特殊性中进行共享和实用化。
    Abstract Healthcare clinics regularly encounter dynamic data that changes due to variations in patient populations, treatment policies, medical devices, and emerging disease patterns. Deep learning models can suffer from catastrophic forgetting when fine-tuned in such scenarios, causing poor performance on previously learned tasks. Continual learning allows learning on new tasks without performance drop on previous tasks. In this work, we investigate the performance of continual learning models on four different medical imaging scenarios involving ten classification datasets from diverse modalities, clinical specialties, and hospitals. We implement various continual learning approaches and evaluate their performance in these scenarios. Our results demonstrate that a single model can sequentially learn new tasks from different specialties and achieve comparable performance to naive methods. These findings indicate the feasibility of recycling or sharing models across the same or different medical specialties, offering another step towards the development of general-purpose medical imaging AI that can be shared across institutions.
    摘要 医疗机构常常遇到动态数据,这些数据因为患者人口变化、治疗政策变化、医疗设备更新和新艺术疾病趋势而变化。深度学习模型在这些场景下可能会出现忘记灾难,导致之前学习的任务表现下降。连续学习可以让模型在新任务上学习而不会影响之前学习的任务表现。在这项工作中,我们研究了不同医疗领域的四个医学影像场景中,十个类别数据集的连续学习表现。我们实现了不同的连续学习方法,并评估其在这些场景中的表现。我们的结果表明,一个单一的模型可以在不同的专业领域中顺序学习新任务,并达到与预期方法相同的性能。这些发现表明,可以在同一或不同的医疗专业领域中重用或共享模型,这是普用医学影像AI的又一步发展。

CRAB: Assessing the Strength of Causal Relationships Between Real-world Events

  • paper_url: http://arxiv.org/abs/2311.04284
  • repo_url: None
  • paper_authors: Angelika Romanou, Syrielle Montariol, Debjit Paul, Leo Laugier, Karl Aberer, Antoine Bosselut
  • for: 评估大语言模型在新闻叙述中的 causal 理解能力
  • methods: 使用 CRAB benchmark,测试多种语言模型在 causal 关系之间的理解能力
  • results: 大多数语言模型在这个任务中表现不佳,尤其是当事件来源于复杂的 causal 结构时Here’s a more detailed explanation of each point:
  • for: The paper is written to assess the ability of large language models to understand causal relationships in real-world narratives.
  • methods: The authors use a new benchmark called CRAB to test the performance of several large language models in reasoning about causal relationships in narratives.
  • results: The authors find that most language models perform poorly on this task, and that models perform worse when events are derived from complex causal structures compared to simple linear causal chains.I hope this helps! Let me know if you have any further questions.
    Abstract Understanding narratives requires reasoning about the cause-and-effect relationships between events mentioned in the text. While existing foundation models yield impressive results in many NLP tasks requiring reasoning, it is unclear whether they understand the complexity of the underlying network of causal relationships of events in narratives. In this work, we present CRAB, a new Causal Reasoning Assessment Benchmark designed to evaluate causal understanding of events in real-world narratives. CRAB contains fine-grained, contextual causality annotations for ~2.7K pairs of real-world events that describe various newsworthy event timelines (e.g., the acquisition of Twitter by Elon Musk). Using CRAB, we measure the performance of several large language models, demonstrating that most systems achieve poor performance on the task. Motivated by classical causal principles, we also analyze the causal structures of groups of events in CRAB, and find that models perform worse on causal reasoning when events are derived from complex causal structures compared to simple linear causal chains. We make our dataset and code available to the research community.
    摘要 理解叙述需要对事件之间的 causa-effect 关系进行推理。现有的基础模型在许多 NLP 任务中表现出色,但是是否真的理解叙述中事件的复杂网络 causal 关系却存在uncertainty。在这项工作中,我们介绍了 CRAB,一个新的 causal reasoning assessment benchmark,用于评估事件叙述中的 causal 理解。CRAB 包含 ~2.7K 对实际新闻事件的细腻、上下文 causality 注释,例如 Elon Musk 收购 Twitter 等。使用 CRAB,我们测试了多种大语言模型的性能,发现大多数系统在这项任务中表现不佳。针对古典 causal 原则,我们还分析了 CRAB 中事件组合体系的 causal 结构,发现模型在复杂 causal 结构下的推理性能较差。我们将数据集和代码公开给研究社区。

OtterHD: A High-Resolution Multi-modality Model

  • paper_url: http://arxiv.org/abs/2311.04219
  • repo_url: None
  • paper_authors: Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu
  • for: 这个论文旨在提出一种新型多模态模型OtterHD-8B,用于处理高分辨率视觉输入,并且具有更高的灵活性和准确性。
  • methods: 该模型基于Fuyu-8B architecture,通过采用可变长度视觉编码器和精细调整的权重学习策略,实现了对高分辨率视觉输入的高精度处理。
  • results: 对于MagnifierBench数据集,OtterHD-8B直接处理高分辨率输入时表现出了明显的优势,与当前领先模型相比,具有较高的准确率和灵活性。
    Abstract In this paper, we present OtterHD-8B, an innovative multimodal model evolved from Fuyu-8B, specifically engineered to interpret high-resolution visual inputs with granular precision. Unlike conventional models that are constrained by fixed-size vision encoders, OtterHD-8B boasts the ability to handle flexible input dimensions, ensuring its versatility across various inference requirements. Alongside this model, we introduce MagnifierBench, an evaluation framework designed to scrutinize models' ability to discern minute details and spatial relationships of small objects. Our comparative analysis reveals that while current leading models falter on this benchmark, OtterHD-8B, particularly when directly processing high-resolution inputs, outperforms its counterparts by a substantial margin. The findings illuminate the structural variances in visual information processing among different models and the influence that the vision encoders' pre-training resolution disparities have on model effectiveness within such benchmarks. Our study highlights the critical role of flexibility and high-resolution input capabilities in large multimodal models and also exemplifies the potential inherent in the Fuyu architecture's simplicity for handling complex visual data.
    摘要 在这篇论文中,我们介绍了OtterHD-8B,一种创新的多模式模型,从Fuyu-8B中演化而来,专门用于解释高分辨率视觉输入的细节精度。与传统的模型不同,OtterHD-8B具有可变输入维度的能力,因此在不同的推理需求下表现非常灵活。此外,我们还提出了MagnifierBench评价框架,用于评估模型对小 objet的细节和空间关系的解释能力。我们的比较分析表明,当直接处理高分辨率输入时,OtterHD-8B在同类模型中表现出了明显的优势。这些发现探讨了不同模型在视觉信息处理中的结构差异以及视觉编码器的预训练分辨率差异对模型效果的影响。我们的研究强调了大型多模式模型的灵活性和高分辨率输入能力的重要性,同时也illustrates Fuyu架构的简单性在处理复杂视觉数据方面的潜在优势。

Towards Garment Sewing Pattern Reconstruction from a Single Image

  • paper_url: http://arxiv.org/abs/2311.04218
  • repo_url: None
  • paper_authors: Lijuan Liu, Xiangyu Xu, Zhijie Lin, Jiabin Liang, Shuicheng Yan
  • for: 这个研究旨在利用日常照片中恢复裤子缝制图,以扩展服装设计、虚拟试穿和数字人物等应用。
  • methods: 研究者首先合成了一个具有约1M张图像和真实缝制图的数据集,名为SewFactory,以供模型训练和评估。然后,他们提出了一个两级变换器网络,名为Sewformer,可以显著提高缝制图预测性能。
  • results: EXTENSIVE EXPERIMENTS表明,提案的框架可以有效地恢复缝制图,并在 casually-taken human photos 上广泛适用。
    Abstract Garment sewing pattern represents the intrinsic rest shape of a garment, and is the core for many applications like fashion design, virtual try-on, and digital avatars. In this work, we explore the challenging problem of recovering garment sewing patterns from daily photos for augmenting these applications. To solve the problem, we first synthesize a versatile dataset, named SewFactory, which consists of around 1M images and ground-truth sewing patterns for model training and quantitative evaluation. SewFactory covers a wide range of human poses, body shapes, and sewing patterns, and possesses realistic appearances thanks to the proposed human texture synthesis network. Then, we propose a two-level Transformer network called Sewformer, which significantly improves the sewing pattern prediction performance. Extensive experiments demonstrate that the proposed framework is effective in recovering sewing patterns and well generalizes to casually-taken human photos. Code, dataset, and pre-trained models are available at: https://sewformer.github.io.
    摘要 仪服缝纹模板表示服装的内在形状,是许多应用程序,如时尚设计、虚拟试穿和数字化人物的核心。在这项工作中,我们研究了从日常照片中恢复仪服缝纹模板的问题。为解决这个问题,我们首先生成了一个通用的数据集,名为SewFactory,该数据集包含约1M张图片和对应的缝纹模板,用于模型训练和评估。SewFactory覆盖了人体姿势、身体形态和缝纹模板的广泛范围,并具有真实的外观特征, thanks to our proposed human texture synthesis network。然后,我们提出了一种两级转换器网络,名为Sewformer,该网络能够显著提高缝纹模板预测性能。广泛的实验证明了我们提出的框架可以高效地恢复缝纹模板,并在习惯性图像中广泛应用。代码、数据集和预训练模型可以在:https://sewformer.github.io/获取。

Wearable data from subjects playing Super Mario, sitting university exams, or performing physical exercise help detect acute mood episodes via self-supervised learning

  • paper_url: http://arxiv.org/abs/2311.04215
  • repo_url: None
  • paper_authors: Filippo Corponi, Bryan M. Li, Gerard Anmella, Clàudia Valenzuela-Pascual, Ariadna Mas, Isabella Pacchiarotti, Marc Valentí, Iria Grande, Antonio Benabarre, Marina Garriga, Eduard Vieta, Allan H Young, Stephen M. Lawrie, Heather C. Whalley, Diego Hidalgo-Mazzei, Antonio Vergari
  • for: 这个研究用于监测情绪障碍(MDs),利用佩戴式设备收集的数据,以提高现代监测技术的应用。
  • methods: 本研究使用了自动学习(SSL)技术,利用无标签数据来学习表示,然后用于支持任务。
  • results: 研究显示,SSL可以有效地检测MDs的急性症状和稳定状态,并且可以与权威的XGBoost和Transformer架构进行比较。
    Abstract Personal sensing, leveraging data passively and near-continuously collected with wearables from patients in their ecological environment, is a promising paradigm to monitor mood disorders (MDs), a major determinant of worldwide disease burden. However, collecting and annotating wearable data is very resource-intensive. Studies of this kind can thus typically afford to recruit only a couple dozens of patients. This constitutes one of the major obstacles to applying modern supervised machine learning techniques to MDs detection. In this paper, we overcome this data bottleneck and advance the detection of MDs acute episode vs stable state from wearables data on the back of recent advances in self-supervised learning (SSL). This leverages unlabelled data to learn representations during pre-training, subsequently exploited for a supervised task. First, we collected open-access datasets recording with an Empatica E4 spanning different, unrelated to MD monitoring, personal sensing tasks -- from emotion recognition in Super Mario players to stress detection in undergraduates -- and devised a pre-processing pipeline performing on-/off-body detection, sleep-wake detection, segmentation, and (optionally) feature extraction. With 161 E4-recorded subjects, we introduce E4SelfLearning, the largest to date open access collection, and its pre-processing pipeline. Second, we show that SSL confidently outperforms fully-supervised pipelines using either our novel E4-tailored Transformer architecture (E4mer) or classical baseline XGBoost: 81.23% against 75.35% (E4mer) and 72.02% (XGBoost) correctly classified recording segments from 64 (half acute, half stable) patients. Lastly, we illustrate that SSL performance is strongly associated with the specific surrogate task employed for pre-training as well as with unlabelled data availability.
    摘要 个人感知,通过PASSIVEALLY和持续采集来自病人的佩戴设备数据,是监测情绪障碍(MD)的有前途的方法。然而,收集和标注佩戴设备数据具有很大的资源投入。这类研究通常只能招募几十名病人。这是监测MD的主要障碍。在这篇论文中,我们解决了这种数据瓶颈,并在使用最新的自我超vision学习(SSL)技术时提高了监测MD急性病情vs稳定状态的能力。通过在预训练阶段使用无标签数据来学习表示,然后在supervised任务上进行应用。我们收集了来自Empatica E4设备的开放访问数据,包括不同的、与MD监测无关的个人感知任务,如Super Mario游戏中的情绪识别和大学生中的压力检测。我们设计了一个预处理管道,包括在/离体检测、睡卫睡检测、分割和(可选)特征提取。通过161名E4记录的主题,我们引入了E4SelfLearning,是目前最大的开放访问收集。我们还证明了,使用我们的专门设计的E4mer变体的Transformer架构,SSL可以强制超过完全监测的架构,包括XGBoost。在64名病人中,我们获得了81.23%的正确记录分割,比75.35%和72.02%的完全监测和XGBoost架构更高。最后,我们发现了ssl性能与特定的代表任务的预训练和无标签数据可用性之间存在强相关性。

Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves

  • paper_url: http://arxiv.org/abs/2311.04205
  • repo_url: https://github.com/uclaml/Rephrase-and-Respond
  • paper_authors: Yihe Deng, Weitong Zhang, Zixiang Chen, Quanquan Gu
  • for: 这个论文的目的是提出一种名为“重句并回答”(Rephrase and Respond,简称RaR)的方法,以便人工智能语言模型(LLM)更好地理解人类提出的问题,并在单个提问中提供回答。
  • methods: 这种方法使用一个拥有重句能力的LLM来重句人类提出的问题,然后将原始问题和重句后的问题一起传递给另一个回答LLM。这种方法可以让LLM更好地理解问题,并提高它们的表现。
  • results: 实验表明,使用RaR方法可以提高不同任务的模型表现,并且与现有的链条思维(Chain-of-Thought,CoT)方法相比,RaR方法是一种更有效的提问方法。此外,RaR方法可以与CoT方法相结合,以达到更高的表现。
    Abstract Misunderstandings arise not only in interpersonal communication but also between humans and Large Language Models (LLMs). Such discrepancies can make LLMs interpret seemingly unambiguous questions in unexpected ways, yielding incorrect responses. While it is widely acknowledged that the quality of a prompt, such as a question, significantly impacts the quality of the response provided by LLMs, a systematic method for crafting questions that LLMs can better comprehend is still underdeveloped. In this paper, we present a method named `Rephrase and Respond' (RaR), which allows LLMs to rephrase and expand questions posed by humans and provide responses in a single prompt. This approach serves as a simple yet effective prompting method for improving performance. We also introduce a two-step variant of RaR, where a rephrasing LLM first rephrases the question and then passes the original and rephrased questions together to a different responding LLM. This facilitates the effective utilization of rephrased questions generated by one LLM with another. Our experiments demonstrate that our methods significantly improve the performance of different models across a wide range to tasks. We further provide a comprehensive comparison between RaR and the popular Chain-of-Thought (CoT) methods, both theoretically and empirically. We show that RaR is complementary to CoT and can be combined with CoT to achieve even better performance. Our work not only contributes to enhancing LLM performance efficiently and effectively but also sheds light on a fair evaluation of LLM capabilities. Data and codes are available at https://github.com/uclaml/Rephrase-and-Respond.
    摘要 人类和大语言模型(LLM)之间的误解不仅出现在人际交流中,还出现在LLM与人类之间的交流中。这些不一致性可能使LLM对明确的问题进行不预期的解释,导致错误的回答。虽然广泛认可的提问质量对LLM提供回答的质量产生 significan influence,但是一种系统化的提问方法,例如`Rephrase and Respond`(RaR),仍然处于开发阶段。在这篇论文中,我们提出了一种方法,即RaR,允许LLM对人类提出的问题进行重新表达和扩展,并提供回答在单个提问中。这种方法可以提高LLM的性能,并且我们还提出了一种两步变体的RaR方法,其中一个重新表达LLM首先重新表达问题,然后将原始和重新表达的问题一起传递给另一个回答LLM。这种方法可以有效地利用重新表达的问题,并且我们的实验表明,我们的方法可以在各种任务上提高不同的模型性能。此外,我们还对RaR和流行的链式思维(CoT)方法进行了比较,并证明了RaR和CoT是 complementary的,可以在一起使用以达到更好的性能。我们的工作不仅可以有效地提高LLM性能,还可以为LLM评估提供新的思路。数据和代码可以在https://github.com/uclaml/Rephrase-and-Respond中获取。

JPAVE: A Generation and Classification-based Model for Joint Product Attribute Prediction and Value Extraction

  • paper_url: http://arxiv.org/abs/2311.04196
  • repo_url: https://github.com/zhongfendeng/jpave
  • paper_authors: Zhongfen Deng, Hao Peng, Tao Zhang, Shuaiqi Liu, Wenting Zhao, Yibo Wang, Philip S. Yu
  • for: 这篇论文主要针对产品特征值EXTRACTION问题进行研究,帮助了下游应用程序如产品搜索和推荐。
  • methods: 该论文提出了一种多任务学习模型,包括值生成/分类和特征预测,以预测值无需考虑文本中值的顺序信息。此外,模型还包括值生成器的复制机制和值分类器的值注意模块,以解决数据不一致问题。
  • results: 实验结果表明,该模型比强基eline模型更有优势,并且具有更好的扩展性和适用性。
    Abstract Product attribute value extraction is an important task in e-Commerce which can help several downstream applications such as product search and recommendation. Most previous models handle this task using sequence labeling or question answering method which rely on the sequential position information of values in the product text and are vulnerable to data discrepancy between training and testing. This limits their generalization ability to real-world scenario in which each product can have multiple descriptions across various shopping platforms with different composition of text and style. They also have limited zero-shot ability to new values. In this paper, we propose a multi-task learning model with value generation/classification and attribute prediction called JPAVE to predict values without the necessity of position information of values in the text. Furthermore, the copy mechanism in value generator and the value attention module in value classifier help our model address the data discrepancy issue by only focusing on the relevant part of input text and ignoring other information which causes the discrepancy issue such as sentence structure in the text. Besides, two variants of our model are designed for open-world and closed-world scenarios. In addition, copy mechanism introduced in the first variant based on value generation can improve its zero-shot ability for identifying unseen values. Experimental results on a public dataset demonstrate the superiority of our model compared with strong baselines and its generalization ability of predicting new values.
    摘要 产品特征值EXTRACTION是电商中非常重要的任务,可以帮助多种下游应用程序,如产品搜索和推荐。之前的大多数模型都使用序列标注或问答方法来处理这个任务,这些方法需要文本中值的顺序信息,但是这些方法容易受到训练和测试数据的不一致问题的影响, limiting their generalization ability to real-world scenarios in which each product can have multiple descriptions across various shopping platforms with different text composition and style. They also have limited zero-shot ability to new values.在这篇论文中,我们提出了一种多任务学习模型,即JPAVE,可以预测值无需文本中值的顺序信息。此外,模型中的复制机制和价值注意模块帮助我们解决数据不一致问题,只关注输入文本中相关的部分,忽略其他信息。此外,我们还设计了两种模型的变体,一种适用于开放世界场景,另一种适用于关闭世界场景。此外,模型中的复制机制可以提高其零shot能力,用于识别未看过的值。实验结果表明,我们的模型在公共数据集上比强基eline表现出色,并且具有普遍性和适用性。

Selective Visual Representations Improve Convergence and Generalization for Embodied AI

  • paper_url: http://arxiv.org/abs/2311.04193
  • repo_url: None
  • paper_authors: Ainaz Eftekhar, Kuo-Hao Zeng, Jiafei Duan, Ali Farhadi, Ani Kembhavi, Ranjay Krishna
    for:* 这个论文是为了提高embodied AI模型中的视觉处理的精度和效率,使其更好地处理任务相关的视觉信息。methods:* 这个论文使用了一种Parameter-efficient Approach,即使用小型学习编码表,来实现视觉信息的筛选和压缩。results:* 这个论文在5个benchmark上实现了state-of-the-art的性能,包括ProcTHOR、ArchitecTHOR、RoboTHOR、AI2-iTHOR和ManipulaTHOR。In Simplified Chinese text, the three key points would be:for:* 这个论文是为了提高embodied AI模型中的视觉处理的精度和效率,使其更好地处理任务相关的视觉信息。methods:* 这个论文使用了一种Parameter-efficient Approach,即使用小型学习编码表,来实现视觉信息的筛选和压缩。results:* 这个论文在5个benchmark上实现了state-of-the-art的性能,包括ProcTHOR、ArchitecTHOR、RoboTHOR、AI2-iTHOR和ManipulaTHOR。
    Abstract Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic information about the scene, much of this information is often irrelevant to the specific task at hand. This introduces noise within the learning process and distracts the agent's focus from task-relevant visual cues. Inspired by selective attention in humans-the process through which people filter their perception based on their experiences, knowledge, and the task at hand-we introduce a parameter-efficient approach to filter visual stimuli for embodied AI. Our approach induces a task-conditioned bottleneck using a small learnable codebook module. This codebook is trained jointly to optimize task reward and acts as a task-conditioned selective filter over the visual observation. Our experiments showcase state-of-the-art performance for object goal navigation and object displacement across 5 benchmarks, ProcTHOR, ArchitecTHOR, RoboTHOR, AI2-iTHOR, and ManipulaTHOR. The filtered representations produced by the codebook are also able generalize better and converge faster when adapted to other simulation environments such as Habitat. Our qualitative analyses show that agents explore their environments more effectively and their representations retain task-relevant information like target object recognition while ignoring superfluous information about other objects. Code and pretrained models are available at our project website: https://embodied-codebook.github.io.
    摘要 embodied AI模型经常使用 commercially available视觉脊梁 like CLIP来编码其视觉观察。 although these general-purpose representations encode rich syntax and semantics about the scene, much of this information is often irrelevant to the specific task at hand. This introduces noise into the learning process and distracts the agent's focus from task-relevant visual cues. inspired by selective attention in humans-the process through which people filter their perception based on their experiences, knowledge, and the task at hand-we introduce a parameter-efficient approach to filter visual stimuli for embodied AI. our approach induces a task-conditioned bottleneck using a small learnable codebook module. This codebook is trained jointly to optimize task reward and acts as a task-conditioned selective filter over the visual observation. our experiments showcase state-of-the-art performance for object goal navigation and object displacement across 5 benchmarks, ProcTHOR, ArchitecTHOR, RoboTHOR, AI2-iTHOR, and ManipulaTHOR. the filtered representations produced by the codebook are also able to generalize better and converge faster when adapted to other simulation environments such as Habitat. our qualitative analyses show that agents explore their environments more effectively and their representations retain task-relevant information like target object recognition while ignoring superfluous information about other objects. code and pretrained models are available at our project website: .

Spatio-Temporal Anomaly Detection with Graph Networks for Data Quality Monitoring of the Hadron Calorimeter

  • paper_url: http://arxiv.org/abs/2311.04190
  • repo_url: None
  • paper_authors: Mulugeta Weldezgina Asres, Christian Walter Omlin, Long Wang, David Yu, Pavel Parygin, Jay Dittmann, Georgia Karapostoli, Markus Seidel, Rosamaria Venditti, Luka Lambrecht, Emanuele Usai, Muhammad Ahmad, Javier Fernandez Menendez, Kaori Maeshima, the CMS-HCAL Collaboration
  • for: 这个研究是为了提供一种 semi-supervised spatio-temporal anomaly detection (AD) 监测系统,用于检测高能物理实验室(LHC)中的粒子数据获取问题。
  • methods: 该研究使用了三维幂occupation图数据,并使用了卷积神经网络、图神经网络和回归神经网络来学习本地空间特征和全球行为。
  • results: 该研究表明,提出的 AD 监测系统可以准确地检测多种通道故障类型,并且在生产环境中实现了高级别的准确率。
    Abstract The compact muon solenoid (CMS) experiment is a general-purpose detector for high-energy collision at the large hadron collider (LHC) at CERN. It employs an online data quality monitoring (DQM) system to promptly spot and diagnose particle data acquisition problems to avoid data quality loss. In this study, we present semi-supervised spatio-temporal anomaly detection (AD) monitoring for the physics particle reading channels of the hadronic calorimeter (HCAL) of the CMS using three-dimensional digi-occupancy map data of the DQM. We propose the GraphSTAD system, which employs convolutional and graph neural networks to learn local spatial characteristics induced by particles traversing the detector, and global behavior owing to shared backend circuit connections and housing boxes of the channels, respectively. Recurrent neural networks capture the temporal evolution of the extracted spatial features. We have validated the accuracy of the proposed AD system in capturing diverse channel fault types using the LHC Run-2 collision data sets. The GraphSTAD system has achieved production-level accuracy and is being integrated into the CMS core production system--for real-time monitoring of the HCAL. We have also provided a quantitative performance comparison with alternative benchmark models to demonstrate the promising leverage of the presented system.
    摘要 “聚焦μ子探测器(CMS)实验是高能物理实验室(LHC)中的通用探测器,位于瑞士日内瓦核子研究所(CERN)。它使用在线数据质量监控(DQM)系统来及时检测和诊断粒子数据收集问题,以避免数据质量损失。在本研究中,我们提出了基于三维干扰量图像数据的 semi-supervised 空间时间异常检测(AD)监控方案,用于 физи学粒子读取通道(HCAL)的CMS。我们提出的GraphSTAD系统使用卷积神经网络学习粒子通过探测器的本地空间特征,以及全球性的后端电路连接和底板盒子特征。循环神经网络捕捉粒子 traverse 探测器的时间演化。我们在LHCRun-2碰撞数据集上验证了提议的AD系统的准确性,可以捕捉多种通道故障类型。GraphSTAD系统已经达到生产级准确性,并在CMS核心生产系统中进行实时监控HCAL。我们还提供了alternative benchmark模型的量化性能比较,以示出提出的系统的承诺优势。”

Prompt Cache: Modular Attention Reuse for Low-Latency Inference

  • paper_url: http://arxiv.org/abs/2311.04934
  • repo_url: None
  • paper_authors: In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong
  • for: 加速大语言模型(LLM)的推理进程,以便更快地响应用户提交的问题。
  • methods: 利用预计算和存储推理过程中出现的文本段的注意力状态,以便在用户提交中重用这些状态,从而提高推理速度。
  • results: 对多种大语言模型进行评估,显示Prompt Cache可以减少时间到首个字符的延迟,特别是对长提问的回答和推荐问题。对GPU基于的推理进程,提高8倍;对CPU基于的推理进程,提高60倍,而无需修改模型参数。
    Abstract We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context. Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt. Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.
    摘要 我们提出了几个方法可以加速大型语言模型(LLM)的推理过程,包括重复使用参考状态。许多输入提示中有许多重复的文本段落,例如系统讯息、提示模板和提供了背景文本。我们的关键见解是,可以在推理服务器上预先计算和储存这些频繁出现的文本段落的参考状态,以便在用户提示中重复使用它们。我们称之为“提示库”(Prompt Cache)。这个架构使用一个schema来明确定义可重复使用的文本段落,称为“提示模组”(Prompt Module)。这个schema确保了参考状态重复使用时的位置精度,并提供了用户可以在提示中访问储存的状态的界面。使用一个原型实现,我们评估了Prompt Cache在多个LLM上。我们发现,Prompt Cache可以对推理过程中的时间-第一个字元(time-to-first-token)进行明显增加,特别是在更长的提示中,例如文档基于的问题回答和推荐。改进范围从GPU基础的推理过程中8倍增加到CPU基础的推理过程中60倍,而且不需要模型参数的修改,同时保持输出精度。

On Leakage in Machine Learning Pipelines

  • paper_url: http://arxiv.org/abs/2311.04179
  • repo_url: https://github.com/djdprogramming/adfa2
  • paper_authors: Leonard Sasse, Eliana Nicolaisen-Sobesky, Juergen Dukart, Simon B. Eickhoff, Michael Götz, Sami Hamdan, Vera Komeyer, Abhijit Kulkarni, Juha Lahnakoski, Bradley C. Love, Federico Raimondo, Kaustubh R. Patil
  • for: 本研究旨在拓展遗漏导致ML管道性能估计过optimistic的原因,以及如何在设计、实现和评估ML管道时避免遗漏。
  • methods: 本文使用具体的实例来描述ML管道中可能出现的多种遗漏类型,并进行了全面的介绍和讨论。
  • results: 本研究的结果表明,遗漏可能导致ML管道的性能估计过optimistic,并且可能导致新数据上的失败generalization。
    Abstract Machine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to new data. This can have severe negative financial and societal implications. Our aim is to expand understanding associated with causes leading to leakage when designing, implementing, and evaluating ML pipelines. Illustrated by concrete examples, we provide a comprehensive overview and discussion of various types of leakage that may arise in ML pipelines.
    摘要

Enhancing LLM Intelligence with ARM-RAG: Auxiliary Rationale Memory for Retrieval Augmented Generation

  • paper_url: http://arxiv.org/abs/2311.04177
  • repo_url: None
  • paper_authors: Eric Melz
  • for: 这篇论文是关于如何提高语言模型的智能水平的研究。
  • methods: 该论文使用了 Retrieval Augmented Generation(RAG)技术,并提出了一种名为 Auxiliary Rationale Memory for Retrieval Augmented Generation(ARM-RAG)的新系统。
  • results: 该研究发现,通过存储和后续检索解释链可以提高解决grade-school math问题的性能。
    Abstract Large Language Models (LLMs) are smart but forgetful. Recent studies, (e.g., (Bubeck et al., 2023)) on modern LLMs have shown that they are capable of performing amazing tasks typically necessitating human-level intelligence. However, unlike humans, frozen LLMs do not improve over time; they neither acquire new knowledge nor learn from their successes or failures. Some approaches to improving the intelligence of LLMs include fine-tuning models based on problem-solving performance (Zelikman et al., 2022), and building bigger and more sophisticated models (Bubeck et al., 2023). However, these methods have the drawback of requiring substantial data and computational resources to retrain existing models. In this paper, we explore the use of Retrieval Augmented Generation, also known as RAG (Lewis et al., 2021) to improve problem-solving performance. We propose ARM-RAG (Auxiliary Rationale Memory for Retrieval Augmented Generation), a system that learns from its successes without incurring high training costs. We demonstrate that the storage and subsequent retrieval of reasoning chains have a positive influence on performance in grade-school math problems.
    摘要

HADES: Fast Singularity Detection with Local Measure Comparison

  • paper_url: http://arxiv.org/abs/2311.04171
  • repo_url: None
  • paper_authors: Uzu Lim, Harald Oberhauser, Vidit Nanda
  • for: 检测数据中的独特点(singularities)
  • methods: 使用核心善良测试和 diferencial geometry 等工具
  • results: 在数据样本生成的跨维度抽象上检测独特点,并在实验中回归真实存在的独特点
    Abstract We introduce Hades, an unsupervised algorithm to detect singularities in data. This algorithm employs a kernel goodness-of-fit test, and as a consequence it is much faster and far more scaleable than the existing topology-based alternatives. Using tools from differential geometry and optimal transport theory, we prove that Hades correctly detects singularities with high probability when the data sample lives on a transverse intersection of equidimensional manifolds. In computational experiments, Hades recovers singularities in synthetically generated data, branching points in road network data, intersection rings in molecular conformation space, and anomalies in image data.
    摘要 我团队介绍了冥王(Hades)算法,用于不监督地检测数据中的缺陷。这个算法使用kernel好度验证测试,因此它比现有的topology基于的方法更快速和可扩展。使用差分几何和最优运输理论,我们证明了冥王在数据样本生活在等维度抽象 manifold上的极限交叉点上correctly检测缺陷,并且在计算实验中能够回归数据中的分支点、路网数据中的分支点、分子 conformity 空间中的交叉环和图像数据中的异常。Note: "冥王" (Hades) is the name of the algorithm, and "数据中的缺陷" (singularities in data) is the object being detected.

Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization

  • paper_url: http://arxiv.org/abs/2311.04163
  • repo_url: None
  • paper_authors: Elan Rosenfeld, Andrej Risteski
  • For: This paper is written to study the interaction between depth and heavy-tailed structures in neural network optimization, and to provide intuitive explanations for several previously reported observations about network training dynamics.* Methods: The paper uses experimental and theoretical approaches to study the phenomenon of opposing signals in training data, and to explore the effects of these signals on the optimization and behavior of the network. The authors also compare the performance of two popular optimization algorithms, Adam and SGD.* Results: The paper reports several key findings, including the existence of paired groups of outliers in the training data that can significantly influence the optimization process, the concept of “grokking” and its connection to the edge of stability, and the importance of carefully balancing opposing signals during training. The authors also provide a mechanistic explanation of the phenomenon on a toy example and a theoretical analysis of a two-layer linear network on a simple model. Experimental results confirm the predictions made by the authors.
    Abstract We identify a new phenomenon in neural network optimization which arises from the interaction of depth and a particular heavy-tailed structure in natural data. Our result offers intuitive explanations for several previously reported observations about network training dynamics. In particular, it implies a conceptually new cause for progressive sharpening and the edge of stability; we also highlight connections to other concepts in optimization and generalization including grokking, simplicity bias, and Sharpness-Aware Minimization. Experimentally, we demonstrate the significant influence of paired groups of outliers in the training data with strong opposing signals: consistent, large magnitude features which dominate the network output throughout training and provide gradients which point in opposite directions. Due to these outliers, early optimization enters a narrow valley which carefully balances the opposing groups; subsequent sharpening causes their loss to rise rapidly, oscillating between high on one group and then the other, until the overall loss spikes. We describe how to identify these groups, explore what sets them apart, and carefully study their effect on the network's optimization and behavior. We complement these experiments with a mechanistic explanation on a toy example of opposing signals and a theoretical analysis of a two-layer linear network on a simple model. Our finding enables new qualitative predictions of training behavior which we confirm experimentally. It also provides a new lens through which to study and improve modern training practices for stochastic optimization, which we highlight via a case study of Adam versus SGD.
    摘要 我们发现了一种新的现象在神经网络优化中,它来自数据中自然存在的特殊重 tailed 结构和深度之间的交互作用。我们的结果提供了直观的解释,解释了许多先前报道的网络训练动态现象。特别是,它 imply 一种新的原因导致进步性和稳定边缘现象;我们还高亮了与其他优化和泛化概念相关的 grokking、简洁偏好和 Sharpness-Aware Minimization。实验证明,在训练数据中存在对抗信号的对应对,这些对应对包括强大的大量特征,这些特征在训练过程中控制网络输出,并且为网络提供反对向量。由于这些对应对,早期优化在训练过程中进入一个窄的谷地,精心平衡这些对应对的抗向量;随后的折衣使得对应对的损失快速增加,振荡在一个对应对的高峰和另一个对应对的高峰之间,直到总损失快速增加。我们描述了如何识别这些对应对,探索它们之间的差异,并且仔细研究它们对网络优化和行为的影响。我们补充了这些实验的机械化解释,并对一个简单的对应对示例进行了理论分析。我们的发现可以提供新的质量预测,我们通过实验确认了它们。它还提供了一个新的视角,用于研究和改进现代训练实践。

A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis

  • paper_url: http://arxiv.org/abs/2311.04157
  • repo_url: https://github.com/imageomics/intr
  • paper_authors: Dipanjyoti Paul, Arpita Chowdhury, Xinqi Xiong, Feng-Ju Chang, David Carlyn, Samuel Stevens, Kaiya Provost, Anuj Karpatne, Bryan Carstens, Daniel Rubenstein, Charles Stewart, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao
  • for: 提高图像分类的可读性,通过增强每个类别在图像中的搜索和特征LOCAL化。
  • methods: 基于Transformer编码器解码器的INTR方法,学习每个类别的特征查询,通过对比跨核查重来实现图像分类的可读性。
  • results: 在八个频道上进行了多种细腻的图像分类和分析,并且发现INTR方法能够增强图像分类的可读性和精度。
    Abstract We present a novel usage of Transformers to make image classification interpretable. Unlike mainstream classifiers that wait until the last fully-connected layer to incorporate class information to make predictions, we investigate a proactive approach, asking each class to search for itself in an image. We realize this idea via a Transformer encoder-decoder inspired by DEtection TRansformer (DETR). We learn ``class-specific'' queries (one for each class) as input to the decoder, enabling each class to localize its patterns in an image via cross-attention. We name our approach INterpretable TRansformer (INTR), which is fairly easy to implement and exhibits several compelling properties. We show that INTR intrinsically encourages each class to attend distinctively; the cross-attention weights thus provide a faithful interpretation of the prediction. Interestingly, via ``multi-head'' cross-attention, INTR could identify different ``attributes'' of a class, making it particularly suitable for fine-grained classification and analysis, which we demonstrate on eight datasets. Our code and pre-trained model are publicly accessible at https://github.com/Imageomics/INTR.
    摘要 我们提出了一种使用 transformer 进行图像分类的新方法,以使得分类结果更加可解。不同于主流的分类器,我们在最后一层完全连接层之前就将类信息integrated进行预测,而我们的方法是主动的,每个类都会在图像中搜寻自己的特征。我们通过基于DEtection TRansformer(DETR)的 transformer 编码器-解码器来实现这个想法,学习每个类的特定的查询(每个类一个),使得每个类能够通过对比关注图像中的特征进行地方化搜寻。我们命名这种方法为 INterpretable TRansformer(INTR),它容易实现并且具有许多吸引人的特性。我们表明,INTR 会自然地让每个类强制关注独特的特征,因此对比考虑的加权可以提供准确的预测解释。更加有趣的是,通过多头对比关注,INTR 可以识别不同的类属性,使其特别适合细致的分类和分析,我们在八个数据集上进行了示例。我们的代码和预训练模型可以在 中获取。

Contactless Fingerprint Biometric Anti-Spoofing: An Unsupervised Deep Learning Approach

  • paper_url: http://arxiv.org/abs/2311.04148
  • repo_url: None
  • paper_authors: Banafsheh Adami, Nima Karimian
  • For: 本研究旨在提高无接触指纹识别系统的用户舒适性和防止指纹假样攻击。* Methods: 本研究使用了一种创新的防伪方法,combines an unsupervised autoencoder with a convolutional block attention module,并且在训练阶段不暴露模型于假样攻击。* Results: 研究发现,使用这种方法可以在不同类型的假样攻击图像测试阶段达到了0.96%的BPCER和1.6%的APCER。
    Abstract Contactless fingerprint recognition offers a higher level of user comfort and addresses hygiene concerns more effectively. However, it is also more vulnerable to presentation attacks such as photo paper, paper-printout, and various display attacks, which makes it more challenging to implement in biometric systems compared to contact-based modalities. Limited research has been conducted on presentation attacks in contactless fingerprint systems, and these studies have encountered challenges in terms of generalization and scalability since both bonafide samples and presentation attacks are utilized during training model. Although this approach appears promising, it lacks the ability to handle unseen attacks, which is a crucial factor for developing PAD methods that can generalize effectively. We introduced an innovative anti-spoofing approach that combines an unsupervised autoencoder with a convolutional block attention module to address the limitations of existing methods. Our model is exclusively trained on bonafide images without exposure to any spoofed samples during the training phase. It is then evaluated against various types of presentation attack images in the testing phase. The scheme we proposed has achieved an average BPCER of 0.96\% with an APCER of 1.6\% for presentation attacks involving various types of spoofed samples.
    摘要 无接触指纹识别技术可以提供更高水平的用户 COMFORT 和更好地解决卫生问题。然而,它也更容易受到展示攻击,如 фото纸、打印机打印出的纸张和多种显示攻击,这使得其在生物ometrics系统中实现更加挑战。有限的研究已经在无接触指纹系统中进行了展示攻击,但这些研究遇到了总结和扩展性的问题,因为训练模型时都需要使用真实的样本和攻击样本。这种方法看上去很有前途,但它缺乏对未看过的攻击的能力,这是生物метrics系统中发展可靠PAD方法的关键因素。我们介绍了一种创新的防伪方法,该方法结合了无监督自适应神经网络和卷积束注意模块来解决现有方法的局限性。我们的模型在训练阶段只接触 bonafide 图像,而不是任何假样本。然后,我们在测试阶段对各种展示攻击图像进行评估。我们的方案实现了平均BPCER 0.96% 和 APcer 1.6% 的表现,对于各种假样本展示攻击来说。

Locating Cross-Task Sequence Continuation Circuits in Transformers

  • paper_url: http://arxiv.org/abs/2311.04131
  • repo_url: None
  • paper_authors: Michael Lan, Fazl Barez
  • for: 这个论文旨在探讨 transformer 模型如何进行语言任务,以及如何将其解释为可读的人类可理解的表示。
  • methods: 该论文使用了将 transformer 模型转换成可读的表示,称为电路,以实现算法功能。
  • results: 该论文通过分析和比较不同类型的序列续写任务的电路,发现了检测序列成员和预测下一个序列成员的关键子电路。此外,该论文还发现了semantically相关的序列使用共享的电路子графи,具有相似的角色。
    Abstract While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of digits, number words, and months. Through the application of circuit analysis techniques, we identify key sub-circuits responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Overall, documenting shared computational structures enables better prediction of model behaviors, identification of errors, and safer editing procedures. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable language models.
    摘要 transformer 模型在语言任务上表现出色,但其复杂的架构使其难以解释。 recent work 尝试将 transformer 模型转化成可读的人类表示,称为Circuit。我们在这些研究的基础上进一步分析和比较不同的序列续写任务的Circuit,包括增加数字、数字名称和月份。通过Circuit分析技术,我们发现了检测序列成员的关键子Circuit和预测下一个序列成员的Circuit。我们的分析表明,semantically related sequences 使用共享的Circuit子图,具有相似的角色。总的来说,记录共享的计算结构可以更好地预测模型的行为,标识错误和安全地编辑过程。这种机制性的理解transformer 是建立更加稳定、aligned和可解释的语言模型的重要一步。

Unveiling Safety Vulnerabilities of Large Language Models

  • paper_url: http://arxiv.org/abs/2311.04124
  • repo_url: None
  • paper_authors: George Kour, Marcel Zalmanovici, Naama Zwerdling, Esther Goldbraich, Ora Nova Fandina, Ateret Anaby-Tavor, Orna Raz, Eitan Farchi
  • for: This paper is written to address the concern of harmful or inappropriate responses from large language models, and to provide a dataset (AttaQ) and a novel approach for identifying and naming vulnerable semantic regions in such models.
  • methods: The paper uses a unique dataset of adversarial examples in the form of questions, and introduces a novel automatic approach for identifying and naming vulnerable semantic regions in models using specialized clustering techniques.
  • results: The paper assesses the efficacy of its dataset and approach by analyzing the vulnerabilities of various models when subjected to the AttaQ dataset, and demonstrates the effectiveness of its approach in identifying and naming vulnerable semantic regions.
    Abstract As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. Additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. This is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. Automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.
    摘要 大型语言模型在使用时,可能会产生不当或不适当的回应,这引起了对其可能的害的担忧。本篇文章介绍了一个唯一的数据集,即AttaQ,这是一组设计来提oking这些害的问题。我们通过分析不同模型在面对这个数据集时的脆弱性,以评估数据集的有效性。此外,我们还引入了一种新的自动化方法,可以自动识别和命名容易受到害的 semantic 区域,这些区域是模型对于特定的入力攻击而产生不当回应的区域。我们通过特殊的聚类技术来实现这一点,考虑了对于入力攻击的Semantic similarity和模型对于这些攻击的回应的害性。自动识别容易受到害的 semantic 区域可以增强模型的评估,并且可以帮助改善模型的安全机制和整体可靠性。

ETDPC: A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations

  • paper_url: http://arxiv.org/abs/2311.04262
  • repo_url: https://github.com/lamps-lab/ETDMiner
  • paper_authors: Muntabir Hasan Choudhury, Lamia Salsabil, William A. Ingram, Edward A. Fox, Jian Wu
  • for: This paper aims to segment Electronic Theses and Dissertations (ETDs) into 13 categories to facilitate navigation and exploration of the content.
  • methods: The proposed method, ETDPC, uses a two-stream multimodal model with a cross-attention network to classify ETD pages. To address the challenge of imbalanced labeled samples, the authors augmented data for minority categories and employed a hierarchical classifier.
  • results: ETDPC outperforms the state-of-the-art models in all categories, achieving an F1 score of 0.84 to 0.96 for 9 out of 13 categories. The authors also demonstrated the data efficiency of their approach.Here’s the same information in Simplified Chinese text:
  • for: 这篇论文旨在为电子硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件
    Abstract Electronic theses and dissertations (ETDs) have been proposed, advocated, and generated for more than 25 years. Although ETDs are hosted by commercial or institutional digital library repositories, they are still an understudied type of scholarly big data, partially because they are usually longer than conference proceedings and journals. Segmenting ETDs will allow researchers to study sectional content. Readers can navigate to particular pages of interest, discover, and explore the content buried in these long documents. Most existing frameworks on document page classification are designed for classifying general documents and perform poorly on ETDs. In this paper, we propose ETDPC. Its backbone is a two-stream multimodal model with a cross-attention network to classify ETD pages into 13 categories. To overcome the challenge of imbalanced labeled samples, we augmented data for minority categories and employed a hierarchical classifier. ETDPC outperforms the state-of-the-art models in all categories, achieving an F1 of 0.84 -- 0.96 for 9 out of 13 categories. We also demonstrated its data efficiency. The code and data can be found on GitHub (https://github.com/lamps-lab/ETDMiner/tree/master/etd_segmentation).
    摘要 电子硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬件硬

Evaluating Large Language Models in Ophthalmology

  • paper_url: http://arxiv.org/abs/2311.04933
  • repo_url: None
  • paper_authors: Jason Holmes, Shuyuan Ye, Yiwei Li, Shi-Nan Wu, Zhengliang Liu, Zihao Wu, Jinyu Hu, Huan Zhao, Xi Jiang, Wei Liu, Hong Wei, Jie Zou, Tianming Liu, Yi Shao
  • for: 评估三种大型语言模型(GPT-3.5、GPT-4、PaLM2)在回答眼科专业问题方面的表现,并与不同专业水平(医学本科生、医学硬件生、医生)进行比较。
  • methods: 使用100个眼科单选测验,对三种语言模型和三种专业水平进行测试,并对LLM的表现进行全面评估和比较。
  • results: 每种LLM都超过了医学本科生的平均分,其中GPT-4的表现与医生水平相当,而GPT-3.5和PaLM2略为下降到医学硬件生水平。此外,GPT-4的答案稳定性和自信度显著高于GPT-3.5和PaLM2。结论:我们的研究表明,LLM代表的GPT-4在眼科领域表现出色。随着进一步改进,LLM将带来未知的医学教育和临床决策的突破。
    Abstract Purpose: The performance of three different large language models (LLMS) (GPT-3.5, GPT-4, and PaLM2) in answering ophthalmology professional questions was evaluated and compared with that of three different professional populations (medical undergraduates, medical masters, and attending physicians). Methods: A 100-item ophthalmology single-choice test was administered to three different LLMs (GPT-3.5, GPT-4, and PaLM2) and three different professional levels (medical undergraduates, medical masters, and attending physicians), respectively. The performance of LLM was comprehensively evaluated and compared with the human group in terms of average score, stability, and confidence. Results: Each LLM outperformed undergraduates in general, with GPT-3.5 and PaLM2 being slightly below the master's level, while GPT-4 showed a level comparable to that of attending physicians. In addition, GPT-4 showed significantly higher answer stability and confidence than GPT-3.5 and PaLM2. Conclusion: Our study shows that LLM represented by GPT-4 performs better in the field of ophthalmology. With further improvements, LLM will bring unexpected benefits in medical education and clinical decision making in the near future.
    摘要 目的:评估三种不同的大语言模型(LLM)(GPT-3.5、GPT-4和PaLM2)在回答眼科专业问题上的表现,并与三种不同的专业人群(医学本科生、医学硬件生和医生)进行比较。方法:为三种LLM和三种专业水平(医学本科生、医学硬件生和医生)分别进行100项眼科单选测验。对LLM的表现进行全面评估和比较,包括平均分、稳定性和信心度。结果:每种LLM都比本科生在总体的表现得高,GPT-3.5和PaLM2只是微不足道的与硬件生水平相当,而GPT-4则与医生水平相当。此外,GPT-4还表现出了明显更高的答案稳定性和信心度。结论:我们的研究表明,LLM表示的GPT-4在眼科领域表现出色。随着进一步改进,LLM将在医学教育和临床决策中带来无法预期的优势。

Multitask Multimodal Prompted Training for Interactive Embodied Task Completion

  • paper_url: http://arxiv.org/abs/2311.04067
  • repo_url: None
  • paper_authors: Georgios Pantazopoulos, Malvina Nikandrou, Amit Parekh, Bhathiya Hemanthage, Arash Eshghi, Ioannis Konstas, Verena Rieser, Oliver Lemon, Alessandro Suglia
  • for: 该论文旨在解决现有视觉语言(VL)模型面临的两个基本挑战,即将语言融入行为轨迹和观察轨迹,以及referential歧义。
  • methods: 该论文提出了一种Embodied MultiModal Agent(EMMA)模型,它是一个统一的编码-解码模型,可以理解图像和轨迹,并将动作预测设置为多模态文本生成。通过将所有任务作为文本生成集成,EMMA学习了一种行为语言,这有助于在任务之间进行转移。与之前的分模块方法不同,我们使用了一个单一的多任务模型,每个任务均贡献到完成目标。
  • results: EMMA在多个VLbenchmark上表现与其他模型相当,并在Dialog-guided Task Completion(DTC)benchmark上达到了新的州OF-THE-ART性能(36.81%成功率),这是一个用于评估对话导向的智能代理在Alexa Arena的 benchmark。
    Abstract Interactive and embodied tasks pose at least two fundamental challenges to existing Vision & Language (VL) models, including 1) grounding language in trajectories of actions and observations, and 2) referential disambiguation. To tackle these challenges, we propose an Embodied MultiModal Agent (EMMA): a unified encoder-decoder model that reasons over images and trajectories, and casts action prediction as multimodal text generation. By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks. Different to previous modular approaches with independently trained components, we use a single multitask model where each task contributes to goal completion. EMMA performs on par with similar models on several VL benchmarks and sets a new state-of-the-art performance (36.81% success rate) on the Dialog-guided Task Completion (DTC), a benchmark to evaluate dialog-guided agents in the Alexa Arena
    摘要 现有的视觉语言(VL)模型面临两个基本挑战:1)将语言融入行为和观察的轨迹中,2)确定参照的ambiguation。为解决这些挑战,我们提议一个embodied multimodal agent(EMMA):一个统一的编码解码模型,用于处理图像和轨迹,并将动作预测转化为多modal文本生成。通过将所有任务视为文本生成,EMMA学习了一种行为语言,该语言促进了任务之间的传递。与之前的模块化方法不同,我们使用一个单一的多任务模型,其中每个任务协助完成目标。EMMA在多个VL benchmark上达到了与其他模型相当的性能(36.81%成功率),并在Dialog-guided Task Completion(DTC) benchmark上创下了新的州Of-the-art表现(36.81%成功率),这是一个用于评估对话引导的智能客厅中的代表性任务。

Can CLIP Help Sound Source Localization?

  • paper_url: http://arxiv.org/abs/2311.04066
  • repo_url: https://github.com/chrisneagu/FTC-Skystone-Dark-Angels-Romania-2020
  • paper_authors: Sooyoung Park, Arda Senocak, Joon Son Chung
    for: 这个研究是用于探索如何使用预训语音-影像模型来进行声源地图化。methods: 我们使用CLIP预训模型,并将音频讯号转换为CLIP可以处理的字串表示。然后,我们使用这些字串表示生成音频驱动的封包,将音频驱动的影像特征提取出来,并与音频驱动的字串表示进行对预对。results: 我们的方法可以将声音驱动的影像特征更加完整和紧凑地图示出。实验结果显示,我们的方法可以与现有的方法比较,获得更高的性能。
    Abstract Large-scale pre-trained image-text models demonstrate remarkable versatility across diverse tasks, benefiting from their robust representational capabilities and effective multimodal alignment. We extend the application of these models, specifically CLIP, to the domain of sound source localization. Unlike conventional approaches, we employ the pre-trained CLIP model without explicit text input, relying solely on the audio-visual correspondence. To this end, we introduce a framework that translates audio signals into tokens compatible with CLIP's text encoder, yielding audio-driven embeddings. By directly using these embeddings, our method generates audio-grounded masks for the provided audio, extracts audio-grounded image features from the highlighted regions, and aligns them with the audio-driven embeddings using the audio-visual correspondence objective. Our findings suggest that utilizing pre-trained image-text models enable our model to generate more complete and compact localization maps for the sounding objects. Extensive experiments show that our method outperforms state-of-the-art approaches by a significant margin.
    摘要 大规模预训练图像文本模型在多种任务上表现出了惊人的多样性,受益于它们的强大表示能力和有效的多媒体对接。我们在这些模型中 specifically CLIP 中扩展应用,特别是不需要显式文本输入,而是仅仅通过音视频对应关系来使用。为此,我们提出了一种框架,将音频信号转换成CLIP的文本编码器兼容的 токен,从而生成音频驱动的嵌入。通过直接使用这些嵌入,我们的方法生成了音频驱动的掩模,提取音频驱动的图像特征从高亮区域,并使用音视频对应关系目标来对准这些嵌入。我们的发现表明,通过使用预训练的图像文本模型,我们的模型可以生成更加完整和紧凑的当地化图。广泛的实验表明,我们的方法在相对比较的情况下,与现有的方法相比,具有显著的优势。

Multi-View Causal Representation Learning with Partial Observability

  • paper_url: http://arxiv.org/abs/2311.04056
  • repo_url: None
  • paper_authors: Dingling Yao, Danru Xu, Sébastien Lachapelle, Sara Magliacane, Perouz Taslakian, Georg Martius, Julius von Kügelgen, Francesco Locatello
  • for: studying the identifiability of representations learned from simultaneously observed views, such as different data modalities.
  • methods: using contrastive learning and a single encoder per view to learn the information shared across all subsets of any number of views.
  • results: the paper provides a unified framework and theoretical results that extend and unify several previous works on multi-view nonlinear ICA, disentanglement, and causal representation learning, and experimentally validate the claims on numerical, image, and multi-modal data sets.
    Abstract We present a unified framework for studying the identifiability of representations learned from simultaneously observed views, such as different data modalities. We allow a partially observed setting in which each view constitutes a nonlinear mixture of a subset of underlying latent variables, which can be causally related. We prove that the information shared across all subsets of any number of views can be learned up to a smooth bijection using contrastive learning and a single encoder per view. We also provide graphical criteria indicating which latent variables can be identified through a simple set of rules, which we refer to as identifiability algebra. Our general framework and theoretical results unify and extend several previous works on multi-view nonlinear ICA, disentanglement, and causal representation learning. We experimentally validate our claims on numerical, image, and multi-modal data sets. Further, we demonstrate that the performance of prior methods is recovered in different special cases of our setup. Overall, we find that access to multiple partial views enables us to identify a more fine-grained representation, under the generally milder assumption of partial observability.
    摘要 我们提出一个统一的框架,用于研究同时观察到的视图中学习的表示学习问题。我们允许部分观察的设置,在每个视图中包含一个非线性混合的一部分下面变量,这些变量可能是相关的。我们证明了,通过对所有 subsets 的任意数量的视图进行对比学习,可以学习到一个精细的映射,以便在所有视图中共享信息。我们还提供了一组图像标准,用于判断哪些变量可以通过简单的规则进行标识。我们的总框架和理论结论统一并扩展了多视图非线性ICA、解耦和 causal 表示学习的前期工作。我们在数学、图像和多Modal 数据集上进行了实验验证,并证明了我们的假设的性能。此外,我们还证明了我们的设置下,先前的方法的性能可以在不同的特殊情况下被恢复。总之,我们发现了访问多个部分视图,可以在更加轻松的假设下,标识一个更细grained的表示。

Causal Discovery Under Local Privacy

  • paper_url: http://arxiv.org/abs/2311.04037
  • repo_url: None
  • paper_authors: Rūta Binkytė, Carlos Pinzón, Szilvia Lestyán, Kangsoo Jung, Héber H. Arcolezi, Catuscia Palamidessi
  • for: 本研究是针对 differential privacy 框架中的本地隐私 Mechanism 的应用,以保护数据提供者的敏感信息。
  • methods: 本研究使用了多种知名的本地隐私机制,并评估了这些机制对于 causal discovery задача的干扰。
  • results: 研究发现,不同的本地隐私机制对于 causal discovery 任务的干扰程度不同,并且提供了选择适当的本地隐私协议的 valuable insights。
    Abstract Differential privacy is a widely adopted framework designed to safeguard the sensitive information of data providers within a data set. It is based on the application of controlled noise at the interface between the server that stores and processes the data, and the data consumers. Local differential privacy is a variant that allows data providers to apply the privatization mechanism themselves on their data individually. Therefore it provides protection also in contexts in which the server, or even the data collector, cannot be trusted. The introduction of noise, however, inevitably affects the utility of the data, particularly by distorting the correlations between individual data components. This distortion can prove detrimental to tasks such as causal discovery. In this paper, we consider various well-known locally differentially private mechanisms and compare the trade-off between the privacy they provide, and the accuracy of the causal structure produced by algorithms for causal learning when applied to data obfuscated by these mechanisms. Our analysis yields valuable insights for selecting appropriate local differentially private protocols for causal discovery tasks. We foresee that our findings will aid researchers and practitioners in conducting locally private causal discovery.
    摘要 differential privacy 是一种广泛采用的框架,用于保护数据提供者在数据集中的敏感信息。它基于数据处理和存储服务器和数据消费者之间应用控制的噪声的原理。本地异步隐私是一种变体,允许数据提供者本地应用隐私机制于他们的数据。因此,它可以在服务器或数据收集者无法被信任的情况下提供保护。噪声的引入,然而,必然影响数据的利用性,特别是对数据组件之间的相关性进行扭曲。这种扭曲可能对 causal discovery 任务产生负面影响。在这篇论文中,我们考虑了多种常见的本地异步隐私机制,并比较这些机制提供的隐私和 causal learning 算法应用于混淆后的数据中的准确性的负面影响。我们的分析带来了有价值的洞察,可以帮助研究人员和实践者选择合适的本地异步隐私协议进行 causal discovery 任务。我们预计,我们的发现将助力研究人员和实践者在本地私人方式进行 causal discovery。

Impact of HPO on AutoML Forecasting Ensembles

  • paper_url: http://arxiv.org/abs/2311.04034
  • repo_url: None
  • paper_authors: David Hoffmann
  • for: 这篇论文旨在探讨如何将不同的数据分析方法融合在一起以提高forecasting的精度。
  • methods: 论文使用了多种数据分析方法,包括MQ-CNN、DeepAR、Prophet、NPTS、ARIMA和ETS,并考虑了不同的数据分析方法之间的交互作用。
  • results: 论文的结果显示,在这种设置中,添加了搜索优化策略可以提高预测的精度,并且与基eline ensemble безHPO相比,具有9.9%的精度提升和65.8%的终端集成延迟时间提升。此外,这种设置还能够超越现有的商业AutoML预测解决方案Amazon Forecast,具有3.5%的误差降低和16.0%的终端集成延迟时间降低。
    Abstract A forecasting ensemble consisting of a diverse range of estimators for both local and global univariate forecasting, in particular MQ-CNN,DeepAR, Prophet, NPTS, ARIMA and ETS, can be used to make forecasts for a variety of problems. This paper delves into the aspect of adding different hyperparameter optimization strategies to the deep learning models in such a setup (DeepAR and MQ-CNN), exploring the trade-off between added training cost and the increase in accuracy for different configurations. It shows that in such a setup, adding hyperparameter optimization can lead to performance improvements, with the final setup having a 9.9 % percent accuracy improvement with respect to the avg-wQL over the baseline ensemble without HPO, accompanied by a 65.8 % increase in end-to-end ensemble latency. This improvement is based on an empirical analysis of combining the ensemble pipeline with different tuning strategies, namely Bayesian Optimisation and Hyperband and different configurations of those strategies. In the final configuration, the proposed combination of ensemble learning and HPO outperforms the state of the art commercial AutoML forecasting solution, Amazon Forecast, with a 3.5 % lower error and 16.0 % lower end-to-end ensemble latency.
    摘要 一个包含多种估计器的预测集群,包括MQ-CNN、DeepAR、Prophet、NPTS、ARIMA和ETS,可以用来预测多种问题。这篇论文探讨了将不同的超参数优化策略添加到深度学习模型中的影响(DeepAR和MQ-CNN),探究加入超参数优化后的准确性提升和预测时间的费用增加的负担关系。结果表明,在这种设置下,添加超参数优化可以提高性能,最终集群的准确率与对比 ensemble without HPO 的平均值QL差异为9.9%,同时预测时间增加65.8%。这种提升基于对不同配置的 ensemble pipeline 的实际分析,包括bayesian optimization和hyperband等策略的组合。最终配置中,提案的集群学习和HPO组合超过了商业AutoML预测解决方案 Amazon Forecast 的状态的误差和总体预测时间。

IoT-Based Environmental Control System for Fish Farms with Sensor Integration and Machine Learning Decision Support

  • paper_url: http://arxiv.org/abs/2311.04258
  • repo_url: None
  • paper_authors: D. Dhinakaran, S. Gopalakrishnan, M. D. Manigandan, T. P. Anish
    for: 这个研究旨在提高鱼养殖的环境控制和生产效率,以满足全球增长的海food需求,同时强调环境责任和经济可持续性。methods: 这个研究使用了互联网物联网(IoT)技术和先进的机器学习决策支持系统,在鱼养殖场中部署无线传感器网络,实时收集环境参数数据,包括水温、pH值、湿度和鱼类行为等。数据经过仔细的预处理,包括填充、异常检测、特征工程和同步。results: 这个系统使用四种不同的机器学习算法来实现环境控制和生产效率,包括随机森林算法(Random Forests)优化水温和pH值,检测和预测鱼类疾病和寄生虫;支持向量机器(SVMs)早期检测疾病和寄生虫;梯度提升机器(GBMs)动态调整饲料时间以适应实时环境 Conditions,提高鱼类生长和产量,同时减少资源浪费。这些机器学习算法共同实现实时决策,使鱼养殖环境Conditions与预定的规范相匹配,提高鱼类健康和产量,同时降低资源浪费,从而提高可持续性和经济可行性。
    Abstract In response to the burgeoning global demand for seafood and the challenges of managing fish farms, we introduce an innovative IoT based environmental control system that integrates sensor technology and advanced machine learning decision support. Deploying a network of wireless sensors within the fish farm, we continuously collect real-time data on crucial environmental parameters, including water temperature, pH levels, humidity, and fish behavior. This data undergoes meticulous preprocessing to ensure its reliability, including imputation, outlier detection, feature engineering, and synchronization. At the heart of our system are four distinct machine learning algorithms: Random Forests predict and optimize water temperature and pH levels for the fish, fostering their health and growth; Support Vector Machines (SVMs) function as an early warning system, promptly detecting diseases and parasites in fish; Gradient Boosting Machines (GBMs) dynamically fine-tune the feeding schedule based on real-time environmental conditions, promoting resource efficiency and fish productivity; Neural Networks manage the operation of critical equipment like water pumps and heaters to maintain the desired environmental conditions within the farm. These machine learning algorithms collaboratively make real-time decisions to ensure that the fish farm's environmental conditions align with predefined specifications, leading to improved fish health and productivity while simultaneously reducing resource wastage, thereby contributing to increased profitability and sustainability. This research article showcases the power of data-driven decision support in fish farming, promising to meet the growing demand for seafood while emphasizing environmental responsibility and economic viability, thus revolutionizing the future of fish farming.
    摘要 为了应对全球海鲜需求的增长和抚养鱼场的挑战,我们介绍了一种基于互联网物理(IoT)的环境控制系统,它 integrate了感知技术和高级机器学习决策支持。在鱼场中部署了无线感知器,实时收集了关键环境参数的数据,包括水温、pH值、湿度和鱼类行为。这些数据经过仔细的处理,以确保其可靠性,包括填充、异常检测、特征工程和同步。系统的核心是四种不同的机器学习算法:随机森林预测和优化水温和pH值,以促进鱼类健康和生长;支持向量机器(SVM)作为早期警报系统,迅速检测鱼类疾病和寄生虫;梯度提升机器(GBM)动态细化饲料时间表,以实现资源效率和鱼类产量的提高;神经网络管理关键设备如水泵和加热器,以保持鱼场的欲要的环境条件。这些机器学习算法合作实时做出决策,以确保鱼场的环境条件与预先定义的规范相符,从而提高鱼类健康和产量,同时减少资源浪费,从而提高可持续性和经济可行性。这篇研究文章展示了数据驱动决策支持在鱼养业中的力量,承诺满足全球海鲜需求的增长,同时强调环境负责任和经济可行性,从而革命化未来的鱼养业发展。

Expressivity of ReLU-Networks under Convex Relaxations

  • paper_url: http://arxiv.org/abs/2311.04015
  • repo_url: None
  • paper_authors: Maximilian Baader, Mark Niklas Müller, Yuhao Mao, Martin Vechev
  • for: 论文旨在探讨是否存在基于凸relaxation的神经网络训练和证明的基本限制。
  • methods: 论文使用了多种常用的凸relaxation,包括IBP relaxation和其他更高级别的relaxation,以探讨ReLU网络可以表示的函数类型和精度。
  • results: 研究结果显示,更高级别的凸relaxation可以表示更多的单变量函数,而且可以 exponential larger的解空间;即使使用最精度的单 neuron relaxation,也无法构建可以表示多变量凸CPWL函数的ReLU网络。
    Abstract Convex relaxations are a key component of training and certifying provably safe neural networks. However, despite substantial progress, a wide and poorly understood accuracy gap to standard networks remains, raising the question of whether this is due to fundamental limitations of convex relaxations. Initial work investigating this question focused on the simple and widely used IBP relaxation. It revealed that some univariate, convex, continuous piecewise linear (CPWL) functions cannot be encoded by any ReLU network such that its IBP-analysis is precise. To explore whether this limitation is shared by more advanced convex relaxations, we conduct the first in-depth study on the expressive power of ReLU networks across all commonly used convex relaxations. We show that: (i) more advanced relaxations allow a larger class of univariate functions to be expressed as precisely analyzable ReLU networks, (ii) more precise relaxations can allow exponentially larger solution spaces of ReLU networks encoding the same functions, and (iii) even using the most precise single-neuron relaxations, it is impossible to construct precisely analyzable ReLU networks that express multivariate, convex, monotone CPWL functions.
    摘要 凸relaxation是训练和证明可靠神经网络的关键组成部分。然而,尽管已经取得了重要进展,但是仍存在一个宽泛而不受理解的准确缺失,这引发了标准网络的准确性限制是凸relaxation的问题。初步的研究发现,使用IBP relaxation时,一些单变量、凸、连续弯曲函数(CPWL)不能被ReLU网络编码,使其IBP分析准确。为了检查这些限制是否被更高级的凸relaxation卷入,我们进行了所有常用凸relaxation的深入研究。我们发现:1.更高级的relaxation允许更多的单变量函数被 preciselly analyzable ReLU网络表示,2.更精准的relaxation可以让ReLU网络表示的函数空间增加 exponentially,3.即使使用最精准的单 neuron relaxation,也无法构建preciseley analyzable ReLU网络,表示多变量、凸、弯曲CPWL函数。

A Method to Improve the Performance of Reinforcement Learning Based on the Y Operator for a Class of Stochastic Differential Equation-Based Child-Mother Systems

  • paper_url: http://arxiv.org/abs/2311.04014
  • repo_url: None
  • paper_authors: Cheng Yin, Yi Chen
  • for: 提高基于actor-critic学习的控制性能,特别是在带有渐变随机过程的系统中。
  • methods: 提出了一种新的运算符,称为Y运算符,它将带有渐变随机过程的孩子系统的随机性 интеграinto critic网络的损失函数中,从而实现了RL算法的控制性能的明显提高。
  • results: Y运算符能够很好地解决RL算法在模型基于和数据驱动系统中的优化控制问题,并且在线性和非线性数学示例中展现出了与现有方法相比的显著提高。
    Abstract This paper introduces a novel operator, termed the Y operator, to elevate control performance in Actor-Critic(AC) based reinforcement learning for systems governed by stochastic differential equations(SDEs). The Y operator ingeniously integrates the stochasticity of a class of child-mother system into the Critic network's loss function, yielding substantial advancements in the control performance of RL algorithms.Additionally, the Y operator elegantly reformulates the challenge of solving partial differential equations for the state-value function into a parallel problem for the drift and diffusion functions within the system's SDEs.A rigorous mathematical proof confirms the operator's validity.This transformation enables the Y Operator-based Reinforcement Learning(YORL) framework to efficiently tackle optimal control problems in both model-based and data-driven systems.The superiority of YORL is demonstrated through linear and nonlinear numerical examples showing its enhanced performance over existing methods post convergence.
    摘要 这篇论文介绍了一种新的运算符,称为Y运算符,用于提高基于actor-critic(AC)学习算法的控制性能,适用于具有杂度的随机 diffeq 方程(SDE)的系统。Y运算符 Innately integrate了一类的child-mother系统的随机性 into the critic网络的损失函数中,从而实现了重要的控制性能提升。此外,Y运算符也 elegantly reformulates了解决部分 дифференциал方程的解值函数问题,转化成了并行的涨落函数问题。一个严格的数学证明证明了该运算符的有效性。这种转化使得YORL框架可以高效地解决模型基于和数据驱动的优化控制问题。数字示例表明,YORL在线性和非线性问题中的表现都胜过现有方法。

The Energy Prediction Smart-Meter Dataset: Analysis of Previous Competitions and Beyond

  • paper_url: http://arxiv.org/abs/2311.04007
  • repo_url: None
  • paper_authors: Direnc Pekaslan, Jose Maria Alonso-Moral, Kasun Bandara, Christoph Bergmeir, Juan Bernabe-Moreno, Robert Eigenmann, Nils Einecke, Selvi Ergen, Rakshitha Godahewa, Hansika Hewamalage, Jesus Lago, Steffen Limmer, Sven Rebhan, Boris Rabinovich, Dilini Rajapasksha, Heda Song, Christian Wagner, Wenlong Wu, Luis Magdalena, Isaac Triguero
  • for: 本研究提供了一个真实世界智能仪表数据集,并对能源预测技术挑战进行分析,主要集中在2020年IEEE计算机智能学会(IEEE-CIS)技术挑战和2021年IEEE国际 Conference on Fuzzy Systems(FUZZ-IEEE)技术挑战中。这两个挑战的目标是准确预测家庭能源消耗,并解释在下面的因素上。
  • methods: 本研究使用了不同的方法,包括机器学习、深度学习和杂度学习等,以提高能源预测的准确性和可解释性。
  • results: 本研究对实际世界智能仪表数据集进行分析,并提出了一些精准的能源预测方法,同时也提出了一些可解释性评价指标。
    Abstract This paper presents the real-world smart-meter dataset and offers an analysis of solutions derived from the Energy Prediction Technical Challenges, focusing primarily on two key competitions: the IEEE Computational Intelligence Society (IEEE-CIS) Technical Challenge on Energy Prediction from Smart Meter data in 2020 (named EP) and its follow-up challenge at the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) in 2021 (named as XEP). These competitions focus on accurate energy consumption forecasting and the importance of interpretability in understanding the underlying factors. The challenge aims to predict monthly and yearly estimated consumption for households, addressing the accurate billing problem with limited historical smart meter data. The dataset comprises 3,248 smart meters, with varying data availability ranging from a minimum of one month to a year. This paper delves into the challenges, solutions and analysing issues related to the provided real-world smart meter data, developing accurate predictions at the household level, and introducing evaluation criteria for assessing interpretability. Additionally, this paper discusses aspects beyond the competitions: opportunities for energy disaggregation and pattern detection applications at the household level, significance of communicating energy-driven factors for optimised billing, and emphasising the importance of responsible AI and data privacy considerations. These aspects provide insights into the broader implications and potential advancements in energy consumption prediction. Overall, these competitions provide a dataset for residential energy research and serve as a catalyst for exploring accurate forecasting, enhancing interpretability, and driving progress towards the discussion of various aspects such as energy disaggregation, demand response programs or behavioural interventions.
    摘要 The dataset includes 3,248 smart meters with varying data availability, ranging from one month to one year. The paper discusses the challenges, solutions, and evaluation criteria for developing accurate predictions at the household level. Additionally, it explores opportunities for energy disaggregation and pattern detection applications, the significance of communicating energy-driven factors for optimized billing, and the importance of responsible AI and data privacy considerations.These competitions provide a dataset for residential energy research and serve as a catalyst for exploring accurate forecasting, enhancing interpretability, and driving progress towards discussions of various aspects such as energy disaggregation, demand response programs, or behavioral interventions. Overall, the paper provides insights into the broader implications and potential advancements in energy consumption prediction.

Foundational propositions of hesitant fuzzy sets and parameter reductions of hesitant fuzzy information systems

  • paper_url: http://arxiv.org/abs/2311.04256
  • repo_url: None
  • paper_authors: Shizhan Lu
  • for: 这篇论文探讨了不确定和犹豫的情况下的软集的定义和应用。
  • methods: 该论文提出了基于不确定软集的形式的各种包含关系,并给出了一些基础定理和软集家族的描述。
  • results: 论文提出了基于不确定软集的参数缩减的基本提案,并给出了一个示例和一个算法来演示该过程。
    Abstract Hesitant fuzzy sets are widely used in the instances of uncertainty and hesitation. The inclusion relationship is an important and foundational definition for sets. Hesitant fuzzy set, as a kind of set, needs explicit definition of inclusion relationship. Base on the hesitant fuzzy membership degree of discrete form, several kinds of inclusion relationships for hesitant fuzzy sets are proposed. And then some foundational propositions of hesitant fuzzy sets and the families of hesitant fuzzy sets are presented. Finally, some foundational propositions of hesitant fuzzy information systems with respect to parameter reductions are put forward, and an example and an algorithm are given to illustrate the processes of parameter reductions.
    摘要 《不确定和犹豫的集合》广泛应用于不确定和犹豫的情况下。集合关系是不确定集合的基本定义。hesitant fuzzy set需要明确的包含关系定义。基于不确定模糊会员度的离散形式,对不确定集合的包含关系提出了多种建议。然后,对不确定集合和其家族进行了一些基本提案和推论。最后,对不确定模糊信息系统中参数减少的基本提案,并给出了一个示例和一个算法,以便 Illustrate the parameter reduction process.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Human-AI Collaboration in Thematic Analysis using ChatGPT: A User Study and Design Recommendations

  • paper_url: http://arxiv.org/abs/2311.03999
  • repo_url: None
  • paper_authors: Lixiang Yan, Vanessa Echeverria, Gloria Fernandez Nieto, Yueqiao Jin, Zachari Swiecki, Linxuan Zhao, Dragan Gašević, Roberto Martinez-Maldonado
  • for: 本研究探讨了研究者与生成人工智能(GenAI)的合作方式,尤其是使用ChatGPT进行资深分析。
  • methods: 研究者通过与ChatGPT进行合作,发现它可以帮助提高分析效率、帮助初步数据探索、提供细腻的量化预测和帮助非Native speaker和非专家理解。
  • results: 虽然ChatGPT显示了一定的价值,但研究者还存在对其可靠性、一致性和广泛acceptance within the research community的关切。本研究还提出了五项行动方案,以促进人AI合作的效果。这些方案包括在合作机制中提供透明的解释,提高界面和集成能力,优先级Contextual understanding和个性化,嵌入人AI反馈循环和迭代功能,以及加强信任 durch validation mechanisms。
    Abstract Generative artificial intelligence (GenAI) offers promising potential for advancing human-AI collaboration in qualitative research. However, existing works focused on conventional machine-learning and pattern-based AI systems, and little is known about how researchers interact with GenAI in qualitative research. This work delves into researchers' perceptions of their collaboration with GenAI, specifically ChatGPT. Through a user study involving ten qualitative researchers, we found ChatGPT to be a valuable collaborator for thematic analysis, enhancing coding efficiency, aiding initial data exploration, offering granular quantitative insights, and assisting comprehension for non-native speakers and non-experts. Yet, concerns about its trustworthiness and accuracy, reliability and consistency, limited contextual understanding, and broader acceptance within the research community persist. We contribute five actionable design recommendations to foster effective human-AI collaboration. These include incorporating transparent explanatory mechanisms, enhancing interface and integration capabilities, prioritising contextual understanding and customisation, embedding human-AI feedback loops and iterative functionality, and strengthening trust through validation mechanisms.
    摘要 生成人工智能(GenAI)在质量研究中具有潜在的扩展性,但现有的研究主要集中在传统的机器学习和模式基于的人工智能系统上,对GenAI在质量研究中的研究者与AI的合作情况知之甚少。这项研究探讨了10名质量研究者对ChatGPT的合作情况,发现ChatGPT作为主题分析的有价值合作者,提高代码效率,帮助初步数据探索,提供细致的量化分析,并帮助非本地语言使用者和非专家理解。然而,对其可靠性和准确性、一致性和可靠性、局部理解的限制以及研究社区中的更广泛acceptance仍然存在。我们提出了五项可行的设计建议,以促进人AI合作的有效实施。这些建议包括 integrate transparent explanatory mechanisms,提高界面和集成能力,优先级Contextual understanding and customization,嵌入人AI反馈循环和迭代功能,和强化信任 durch validation mechanisms。

Learned Causal Method Prediction

  • paper_url: http://arxiv.org/abs/2311.03989
  • repo_url: None
  • paper_authors: Shantanu Gupta, Cheng Zhang, Agrin Hilmkil
  • for: 这篇论文旨在提高 causal inference 方法的选择效率,因为 causal methods 通常需要复杂且难以验证的假设,并且 cross-validation 不适用因为真实的 causal 量未知。
  • methods: 本文提出了 CAusal Method Predictor (CAMP) 框架,用于预测最佳方法 для 给定的数据集。CAMP 使用了生成自各种 sintetic causal models 的数据集,评分了候选方法,并将模型训练为直接预测每个数据集的最高分方法。
  • results: CAMP 在 causal discovery 方面实现了比 selecting 任何个候选方法更高的性能,并且在未见过的半 sintetic 和真实世界 benchmark 中实现了良好的扩展性。
    Abstract For a given causal question, it is important to efficiently decide which causal inference method to use for a given dataset. This is challenging because causal methods typically rely on complex and difficult-to-verify assumptions, and cross-validation is not applicable since ground truth causal quantities are unobserved. In this work, we propose CAusal Method Predictor (CAMP), a framework for predicting the best method for a given dataset. To this end, we generate datasets from a diverse set of synthetic causal models, score the candidate methods, and train a model to directly predict the highest-scoring method for that dataset. Next, by formulating a self-supervised pre-training objective centered on dataset assumptions relevant for causal inference, we significantly reduce the need for costly labeled data and enhance training efficiency. Our strategy learns to map implicit dataset properties to the best method in a data-driven manner. In our experiments, we focus on method prediction for causal discovery. CAMP outperforms selecting any individual candidate method and demonstrates promising generalization to unseen semi-synthetic and real-world benchmarks.
    摘要 <>请求翻译文本为简化字符串。<>为给定的 causal 问题,效率地选择适合的 causal inference 方法是挑战。这是因为 causal 方法通常受到复杂且难以验证的假设的限制,而跨验证不适用,因为真实的 causal 量未被观察。在这种情况下,我们提出了 CAusal Method Predictor (CAMP),一种框架,用于预测给定数据集的最佳方法。为此,我们生成了一系列基于Synthetic causal models的数据集,评分候选方法,并使用一个模型来直接预测每个数据集的最高分方法。然后,我们通过 centered on dataset assumptions relevant for causal inference 的自我超vised pre-training objective来减少需要高昂的标注数据的需求,提高训练效率。我们的策略可以将数据集的隐藏属性映射到最佳方法,从数据驱动的角度来解决这个问题。在我们的实验中,我们关注 causal discovery 方法预测。CAMP 在比较任何个人候选方法时表现出色,并在未看过的半Synthetic和实际 benchmark 上展现了良好的泛化性。

Its All Graph To Me: Foundational Topology Models with Contrastive Learning on Multiple Domains

  • paper_url: http://arxiv.org/abs/2311.03976
  • repo_url: None
  • paper_authors: Alex O. Davies, Riku W. Green, Nirav S. Ajmeri, Telmo M. Silva Filho
  • for: 本研究旨在提出一种多领域图数据学习的方法,以便在数据稀缺或标签缺乏的情况下进行图数据分析。
  • methods: 本研究使用对抗对抗学习方法,通过在多个图领域上预训练模型,以获得更好的图数据表示。
  • results: 对比基eline模型、无预训练模型和非预训练模型,本研究显示,使用我们的单模型可以达到相当或更好的性能,特别是当节点标签在评估中使用时。
    Abstract Representations and embeddings of graph data have been essential in many domains of research. The principle benefit of learning such representations is that the pre-trained model can be fine-tuned on smaller datasets where data or labels are scarse. Existing models, however, are domain specific; for example a model trained on molecular graphs is fine-tuned on other molecular graphs. This means that in many application cases the choice of pre-trained model can be arbitrary, and novel domains may lack an appropriate pre-trained model. This is of particular issue where data is scarse, precluding traditional supervised methods. In this work we use adversarial contrastive learning to present a \method, a model pre-trained on many graph domains. We train the model only on topologies but include node labels in evaluation. We evaluate the efficacy of its learnt representations on various downstream tasks. Against baseline models pre-trained on single domains, as well as un-trained models and non-transferred models, we show that performance is equal or better using our single model. This includes when node labels are used in evaluation, where performance is consistently superior to single-domain or non-pre-trained models.
    摘要 研究领域中的表示和嵌入都是非常重要的。这些表示可以帮助预训练模型在数据或标签scarce的情况下进行细化。现有的模型却是域特定的,例如一个基于分子图的模型将在其他分子图上细化。这意味着在许多应用场景中,选择预训练模型是随意的,而新的领域可能缺乏适合的预训练模型。这对于数据scarce情况下特别是一个问题。在这种情况下,我们使用对抗式强化学习来提出一种方法,一个在多个图域上预训练的模型。我们只在结构上进行训练,但是在评估中包含节点标签。我们对此进行了评估,并与基于单个域的基eline模型、无预训练模型和非转移模型进行比较。我们发现,使用我们的单一模型,表现和基eline模型相当或更好,特别是在使用节点标签进行评估时。

An Expectation-Realization Model for Metaphor Detection

  • paper_url: http://arxiv.org/abs/2311.03963
  • repo_url: None
  • paper_authors: Oseremen O. Uduehi, Razvan C. Bunescu
  • for: 这篇论文旨在提出一种基于预期和实现模块的métafore检测建模,以提高métafore检测精度。
  • methods: 该方法使用两个主要模块:预期组件和实现组件。预期组件计算给定上下文中Literal字句的预期表示,而实现组件计算actual字句在上下文中的含义表示。整个建模被训练以学习Expectation-Realization(ER)模式,以捕捉métafore的用法。
  • results: 对三个métafore数据集进行评估,包括 dentro分布、离分布和新métafore泛化,提出的方法可以与现有的状态艺技比或更好的结果。进一步的ER模型 ensemble可以进一步提高métafore检测精度。
    Abstract We propose a metaphor detection architecture that is structured around two main modules: an expectation component that estimates representations of literal word expectations given a context, and a realization component that computes representations of actual word meanings in context. The overall architecture is trained to learn expectation-realization (ER) patterns that characterize metaphorical uses of words. When evaluated on three metaphor datasets for within distribution, out of distribution, and novel metaphor generalization, the proposed method is shown to obtain results that are competitive or better than state-of-the art. Further increases in metaphor detection accuracy are obtained through ensembling of ER models.
    摘要 我们提出了一种基于两个主要模块的比喻检测建筑:一个预期组件,用于在上下文中估计Literal字句的表达,以及一个实现组件,用于在上下文中计算实际字句的含义。总体建筑是通过学习预期-实现(ER)模式来学习比喻的用法。在三个比喻数据集上进行评估,我们的方法在 dentro分布、out of distribution和新比喻泛化上具有竞争或更好的结果。进一步的加权ER模型 ensemble可以提高比喻检测精度。

Elastic Information Bottleneck

  • paper_url: http://arxiv.org/abs/2311.03955
  • repo_url: https://github.com/nyyxxx/elastic-information-bottleneck
  • paper_authors: Yuyan Ni, Yanyan Lan, Ao Liu, Zhiming Ma
  • for: 本文研究了信息瓶颈原理在深度学习算法的表示学习中的应用,并对两种方法进行比较:信息瓶颈(IB)和决定性信息瓶颈(DIB)。
  • methods: 本文使用了信息瓶颈原理来学习最大压缩表示,并通过对两种方法进行比较,提出了一种新的扩展IB方法,即弹性信息瓶颈(EIB)。
  • results: 实验和数据分析表明,EIB方法可以在适应域适应中实现更好的适应结果,而且可以在不同预测 distribuition 下进行适应。
    Abstract Information bottleneck is an information-theoretic principle of representation learning that aims to learn a maximally compressed representation that preserves as much information about labels as possible. Under this principle, two different methods have been proposed, i.e., information bottleneck (IB) and deterministic information bottleneck (DIB), and have gained significant progress in explaining the representation mechanisms of deep learning algorithms. However, these theoretical and empirical successes are only valid with the assumption that training and test data are drawn from the same distribution, which is clearly not satisfied in many real-world applications. In this paper, we study their generalization abilities within a transfer learning scenario, where the target error could be decomposed into three components, i.e., source empirical error, source generalization gap (SG), and representation discrepancy (RD). Comparing IB and DIB on these terms, we prove that DIB's SG bound is tighter than IB's while DIB's RD is larger than IB's. Therefore, it is difficult to tell which one is better. To balance the trade-off between SG and the RD, we propose an elastic information bottleneck (EIB) to interpolate between the IB and DIB regularizers, which guarantees a Pareto frontier within the IB framework. Additionally, simulations and real data experiments show that EIB has the ability to achieve better domain adaptation results than IB and DIB, which validates the correctness of our theories.
    摘要 <>将文本翻译成简化中文。<>信息瓶颈是一种信息理论的学习原则,旨在学习最高度压缩的表示,以保持最多信息 Label。在这个原则下,有两种不同的方法被提议,即信息瓶颈(IB)和决定性信息瓶颈(DIB),它们在理论和实验上具有显著的进步,能够解释深度学习算法中的表示机制。但是,这些成功假设训练和测试数据是从同一个分布中采样,这明显不符合许多实际应用中的情况。在这篇文章中,我们研究IB和DIB在转移学习场景中的泛化能力,并将目标错误 decomposed 为三个组成部分:源Empirical error、源泛化差(SG)和表示差(RD)。 Comparing IB和DIB,我们证明DIB的SG bound 更紧,而DIB的RD更大。因此,无法判断哪一个更好。为了平衡SG和RD之间的负担,我们提议一种可塑性信息瓶颈(EIB),可以在IB框架中 interpolate IB和DIB regularizers, garantía 一个Pareto frontier。此外,实验和实际数据表明,EIB可以在域 adaptation 中 achieve 更好的结果,证明了我们的理论的正确性。

The Music Meta Ontology: a flexible semantic model for the interoperability of music metadata

  • paper_url: http://arxiv.org/abs/2311.03942
  • repo_url: None
  • paper_authors: Jacopo de Berardinis, Valentina Anita Carriero, Albert Meroño-Peñuela, Andrea Poltronieri, Valentina Presutti
  • for: 该论文旨在提供一种Semantic Description of MusicMetadata,以便创建可以归一化、 интеграción和搜索信息的音乐数据集。
  • methods: 该论文使用eXtreme Design方法和数据工程最佳实践,将不同领域的专家(音乐学家、图书馆员、数据工程师等)的需求和观点反映到模型的设计中,同时借鉴ontology设计模式和证明来源。
  • results: 该论文介绍了Music Meta ontology,一种rich和flexible的semantic模型,用于描述音乐元数据相关的艺术家、作品、演奏、录音等方面。此外,论文还提供了对Music Ontology、DOREMUS和Wikidata等其他schema的 alignments,以及数据转换支持。
    Abstract The semantic description of music metadata is a key requirement for the creation of music datasets that can be aligned, integrated, and accessed for information retrieval and knowledge discovery. It is nonetheless an open challenge due to the complexity of musical concepts arising from different genres, styles, and periods -- standing to benefit from a lingua franca to accommodate various stakeholders (musicologists, librarians, data engineers, etc.). To initiate this transition, we introduce the Music Meta ontology, a rich and flexible semantic model to describe music metadata related to artists, compositions, performances, recordings, and links. We follow eXtreme Design methodologies and best practices for data engineering, to reflect the perspectives and the requirements of various stakeholders into the design of the model, while leveraging ontology design patterns and accounting for provenance at different levels (claims, links). After presenting the main features of Music Meta, we provide a first evaluation of the model, alignments to other schema (Music Ontology, DOREMUS, Wikidata), and support for data transformation.
    摘要 “music metadata的Semantic description是创建可以协调、集成和检索信息的音乐数据集的关键要求。然而,这是一个打开的挑战,因为音乐概念来自不同的流派、风格和时期,从而需要一种 lingua franca 来满足不同的参与者(音乐学家、图书馆员、数据工程师等)。为了实现这一目标,我们介绍了 Music Meta ontology,这是一个强大和灵活的Semantic模型,用于描述音乐 metadata related to artists, compositions, performances, recordings, and links。我们遵循 eXtreme Design 方法和数据工程最佳实践,将参与者的视角和需求 Reflected into the model的设计,同时应用ontology设计模式和考虑多维 provinance(声明、链接)。文章最后介绍了 Music Meta的主要特点,以及与 Music Ontology、DOREMUS 和 Wikidata 的对接,以及数据转换的支持。”Note: Please note that Simplified Chinese is used here, which is the most commonly used Chinese language in mainland China. If you prefer Traditional Chinese, I can provide that version as well.

Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation

  • paper_url: http://arxiv.org/abs/2311.04254
  • repo_url: None
  • paper_authors: Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Wei Zhang, Si Qin, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang
  • For: The paper aims to improve the capabilities of Large Language Models (LLMs) in decision-making by developing a novel thought prompting approach called “Everything of Thoughts” (XoT).* Methods: The approach leverages pretrained reinforcement learning and Monte Carlo Tree Search (MCTS) to incorporate external domain knowledge into thoughts, and autonomously produces high-quality comprehensive cognitive mappings with minimal LLM interactions.* Results: The approach enables LLMs to generalize to unseen problems efficiently and engage in unconstrained thinking, allowing for flexible cognitive mappings for problems with multiple solutions.Here are the three points in Simplified Chinese text:* For: paper 目的是提高 Large Language Models (LLMs) 在决策中的能力。* Methods: 方法是利用预训练的回归学习和 Monte Carlo Tree Search (MCTS) 将外部领域知识 integrate 到思维中,并通过自动生成高质量的全面认知地图,使 LLMs 可以快速地 generalized 到未看到的问题。* Results: 结果是 LLMs 可以快速地 generalized 到未看到的问题,并且可以进行不受限制的思维,解决多解问题。
    Abstract Recent advancements in Large Language Models (LLMs) have revolutionized decision-making by breaking down complex problems into more manageable language sequences referred to as ``thoughts''. An effective thought design should consider three key perspectives: performance, efficiency, and flexibility. However, existing thought can at most exhibit two of these attributes. To address these limitations, we introduce a novel thought prompting approach called ``Everything of Thoughts'' (XoT) to defy the law of ``Penrose triangle of existing thought paradigms. XoT leverages pretrained reinforcement learning and Monte Carlo Tree Search (MCTS) to incorporate external domain knowledge into thoughts, thereby enhancing LLMs' capabilities and enabling them to generalize to unseen problems efficiently. Through the utilization of the MCTS-LLM collaborative thought revision framework, this approach autonomously produces high-quality comprehensive cognitive mappings with minimal LLM interactions. Additionally, XoT empowers LLMs to engage in unconstrained thinking, allowing for flexible cognitive mappings for problems with multiple solutions.
    摘要 Simplified Chinese:最近的大语言模型(LLMs)技术进步已经革命化了决策,将复杂问题转化为更容易处理的语言序列,称为“思想”。一个有效的思想设计应该考虑三个关键方面:性能、效率和灵活性。然而,现有的思想只能展现两个这些特征。为了解决这些限制,我们介绍了一种新的思想提示方法called“Everything of Thoughts”(XoT),用于绕过现有思想观念的法律。XoT利用预训练的奖励学习和 Monte Carlo Tree Search(MCTS),将外部领域知识 integrate into thoughts,从而提高LLMs的能力和效率,使其能够高效地处理未经见过的问题。通过MCTS-LLM合作思想修订框架,这种方法可以自动生成高质量的完整性ognitive mapping,并且减少LLM的交互量。此外,XoT使LLMs具有不受限制的思维能力,允许它们为多解问题生成灵活的认知地图。

MixtureGrowth: Growing Neural Networks by Recombining Learned Parameters

  • paper_url: http://arxiv.org/abs/2311.04251
  • repo_url: https://github.com/chaudatascience/mixturegrowth
  • paper_authors: Chau Pham, Piotr Teterwak, Soren Nelson, Bryan A. Plummer
  • for: 这篇论文旨在提出一种新的网络变大方法,以避免传统的网络训练 overhead。
  • methods: 本文使用了 MixtureGrowth 方法,将每个层的 Parameters 组合成一个 linear combination,以允许新增的层 weights 学习新的知识。
  • results: 本文比较了 MixtureGrowth 方法与传统的网络训练方法,结果显示 MixtureGrowth 方法可以提高 top-1 精度,并且具有较低的 Computational FLOPs。
    Abstract Most deep neural networks are trained under fixed network architectures and require retraining when the architecture changes. If expanding the network's size is needed, it is necessary to retrain from scratch, which is expensive. To avoid this, one can grow from a small network by adding random weights over time to gradually achieve the target network size. However, this naive approach falls short in practice as it brings too much noise to the growing process. Prior work tackled this issue by leveraging the already learned weights and training data for generating new weights through conducting a computationally expensive analysis step. In this paper, we introduce MixtureGrowth, a new approach to growing networks that circumvents the initialization overhead in prior work. Before growing, each layer in our model is generated with a linear combination of parameter templates. Newly grown layer weights are generated by using a new linear combination of existing templates for a layer. On one hand, these templates are already trained for the task, providing a strong initialization. On the other, the new coefficients provide flexibility for the added layer weights to learn something new. We show that our approach boosts top-1 accuracy over the state-of-the-art by 2-2.5% on CIFAR-100 and ImageNet datasets, while achieving comparable performance with fewer FLOPs to a larger network trained from scratch. Code is available at https://github.com/chaudatascience/mixturegrowth.
    摘要 大多数深度神经网络在固定网络架构下训练,需要重新训练当网络架构发生变化时。如果需要扩大网络的大小,需要从头开始训练,这是非常昂贵的。为了避免这个问题,可以通过逐渐添加Random weights来增加网络的大小。然而,这种简单的方法在实践中失败,因为它引入了太多噪声到增长过程中。先前的工作解决了这个问题,通过利用已经学习过的参数和训练数据来生成新的参数,通过进行计算昂贵的分析步骤。在这篇论文中,我们介绍了 MixtureGrowth,一种新的网络增长方法,可以缓解先前的初始化开销。在我们的方法中,每层的参数都是通过线性组合参数模板来生成的。新增加的层参数是通过使用新的线性组合已有的层参数模板来生成的。一方面,这些模板已经被训练了任务,可以提供强大的初始化。另一方面,新的系数提供了可以学习新的东西的灵活性。我们显示,我们的方法可以在CIFAR-100和ImageNet数据集上提高top-1准确率,与一个从头开始训练的更大网络相比,同时实现相似的性能,而且需要更少的FLOPs。代码可以在https://github.com/chaudatascience/mixturegrowth中找到。

Temporal Graph Representation Learning with Adaptive Augmentation Contrastive

  • paper_url: http://arxiv.org/abs/2311.03897
  • repo_url: None
  • paper_authors: Hongjiang Chen, Pengfei Jiao, Huijun Tang, Huaming Wu
  • for: 本研究旨在生成低维度的动态节点嵌入,以捕捉时间信息和结构和性信息。
  • methods: 我们提出了一种新的时间图表示学习模型(TGAC),该模型通过结合先前知识和时间信息来进行自适应扩充,并定义了扩充后的视图对比函数。
  • results: 我们的广泛实验表明,提出的模型在各种真实网络上具有优于其他时间图表示学习方法。
    Abstract Temporal graph representation learning aims to generate low-dimensional dynamic node embeddings to capture temporal information as well as structural and property information. Current representation learning methods for temporal networks often focus on capturing fine-grained information, which may lead to the model capturing random noise instead of essential semantic information. While graph contrastive learning has shown promise in dealing with noise, it only applies to static graphs or snapshots and may not be suitable for handling time-dependent noise. To alleviate the above challenge, we propose a novel Temporal Graph representation learning with Adaptive augmentation Contrastive (TGAC) model. The adaptive augmentation on the temporal graph is made by combining prior knowledge with temporal information, and the contrastive objective function is constructed by defining the augmented inter-view contrast and intra-view contrast. To complement TGAC, we propose three adaptive augmentation strategies that modify topological features to reduce noise from the network. Our extensive experiments on various real networks demonstrate that the proposed model outperforms other temporal graph representation learning methods.
    摘要 <>translate the following text into Simplified Chinese:Temporal graph representation learning aims to generate low-dimensional dynamic node embeddings to capture temporal information as well as structural and property information. Current representation learning methods for temporal networks often focus on capturing fine-grained information, which may lead to the model capturing random noise instead of essential semantic information. While graph contrastive learning has shown promise in dealing with noise, it only applies to static graphs or snapshots and may not be suitable for handling time-dependent noise. To alleviate the above challenge, we propose a novel Temporal Graph representation learning with Adaptive augmentation Contrastive (TGAC) model. The adaptive augmentation on the temporal graph is made by combining prior knowledge with temporal information, and the contrastive objective function is constructed by defining the augmented inter-view contrast and intra-view contrast. To complement TGAC, we propose three adaptive augmentation strategies that modify topological features to reduce noise from the network. Our extensive experiments on various real networks demonstrate that the proposed model outperforms other temporal graph representation learning methods.>Here's the translation:时间图表示学习目标是生成低维度动态节点嵌入,以捕捉时间信息和结构性和性信息。当前的时间网络表示学习方法通常强调细化信息,可能导致模型捕捉随机噪音而不是重要的semantic信息。虽然图像强制学习有潜在的应用,但它只适用于静止图或快照,可能无法适应时间依赖的噪音。为解决以上挑战,我们提出了一种新的时间图表示学习模型,即时间图表示学习with Adaptive augmentation Contrastive(TGAC)模型。TGAC模型中的adaptive扩展是通过结合先前知识和时间信息来实现的,对于时间图来说,我们定义了扩展的间观异见和内观异见。为了补做TGAC模型,我们提出了三种适应扩展策略,用于修改时间图的特征,以降低噪音的影响。我们在多种真实网络上进行了广泛的实验,并证明了我们提出的模型在其他时间图表示学习方法的基础上具有更高的性能。

Unifying Structure and Language Semantic for Efficient Contrastive Knowledge Graph Completion with Structured Entity Anchors

  • paper_url: http://arxiv.org/abs/2311.04250
  • repo_url: None
  • paper_authors: Sang-Hyun Je, Wontae Choi, Kwangjin Oh
  • for: 预测缺失的链接在知识图(KG)中,使用已知的训练事实来预测。
  • methods: 使用预训练语言模型(PLM),同时利用文本信息和结构信息,但是其性能落后于当前最佳结构基本方法或者一些方法会在结构嵌入和文本编码之间失去推理能力。
  • results: 提出一种新的方法,可以有效地融合结构信息和语言 semantics,不会失去推理能力。采用实体锚点,并将 KG 元素的文本描述和实体锚点一起输入 PLM 基本编码器,以学习统一表示。此外,使用随机负样本,可以在对照学习中重复使用每个 mini-batch 中,以学习通用实体表示。经过多种实验和分析,我们证明了我们提出的方法的效果。实验结果表明,我们的方法在标准的链接预测任务上超过了现有的 SOTA KGC 模型。尤其是在 FB15K-237 上,我们的方法展现出最大性能提升,与结构基本 KGC 方法竞争。
    Abstract The goal of knowledge graph completion (KGC) is to predict missing links in a KG using trained facts that are already known. In recent, pre-trained language model (PLM) based methods that utilize both textual and structural information are emerging, but their performances lag behind state-of-the-art (SOTA) structure-based methods or some methods lose their inductive inference capabilities in the process of fusing structure embedding to text encoder. In this paper, we propose a novel method to effectively unify structure information and language semantics without losing the power of inductive reasoning. We adopt entity anchors and these anchors and textual description of KG elements are fed together into the PLM-based encoder to learn unified representations. In addition, the proposed method utilizes additional random negative samples which can be reused in the each mini-batch during contrastive learning to learn a generalized entity representations. We verify the effectiveness of the our proposed method through various experiments and analysis. The experimental results on standard benchmark widely used in link prediction task show that the proposed model outperforms existing the SOTA KGC models. Especially, our method show the largest performance improvement on FB15K-237, which is competitive to the SOTA of structure-based KGC methods.
    摘要 知识图完成(KGC)的目标是预测知识图中缺失的链接,使用已知的训练事实。在最近,基于自然语言模型(PLM)的方法在推广使用文本信息和结构信息,但其性能落后于当前最佳结构基本方法或者一些方法在结构嵌入和文本编码的融合过程中失去了推理推导能力。在这篇论文中,我们提出一种新的方法,能够有效地结合结构信息和语言semantics,不失去推理推导能力。我们采用实体锚点,并将知识图元素的文本描述和实体锚点一起 feed 到 PLM-based 编码器中,以学习一致的表示。此外,我们的方法还利用随机负样本,可以在每个 mini-batch 中重复利用,在对比学习中学习一致的实体表示。我们通过多种实验和分析证明了方法的效iveness。实验结果表明,我们提出的模型在标准的链接预测任务上表现出色,尤其是在 FB15K-237 上,与结构基本 KGC 方法竞争。

Understanding Tool Discovery and Tool Innovation Using Active Inference

  • paper_url: http://arxiv.org/abs/2311.03893
  • repo_url: None
  • paper_authors: Poppy Collis, Paul F Kinghorn, Christopher L Buckley
  • for: 这篇论文旨在探讨人工智能代理人如何发现和创造新工具。
  • methods: 论文使用了活动推断理论中的最小描述来分别Tool discovery和Tool innovation。然后,通过引入工具可用性这个隐藏状态因素,构建了一个简单的工具创造模型。
  • results: 研究发现,通过在隐藏状态中引入工具可用性,代理人可以不仅发现工具,还可以创造新的工具。这些结果对于人工智能领域的工具创造研究提供了新的思路和未来研究方向。
    Abstract The ability to invent new tools has been identified as an important facet of our ability as a species to problem solve in dynamic and novel environments. While the use of tools by artificial agents presents a challenging task and has been widely identified as a key goal in the field of autonomous robotics, far less research has tackled the invention of new tools by agents. In this paper, (1) we articulate the distinction between tool discovery and tool innovation by providing a minimal description of the two concepts under the formalism of active inference. We then (2) apply this description to construct a toy model of tool innovation by introducing the notion of tool affordances into the hidden states of the agent's probabilistic generative model. This particular state factorisation facilitates the ability to not just discover tools but invent them through the offline induction of an appropriate tool property. We discuss the implications of these preliminary results and outline future directions of research.
    摘要 人类的问题解决能力是通过创造新工具的能力,这种能力被认为是我们种族面临动态和新环境中的重要特征。虽然人工智能代理人使用工具是一项复杂的任务,但是许多研究都未曾关注代理人创造新工具的能力。在这篇论文中,我们将(1)描述工具发现和工具创新之间的区别,通过活动推理的形式进行描述。然后我们将(2)使用这个描述来构建一个简单的工具创新模型,通过引入工具可用性的概念来增加代理人的生成模型中的隐藏状态因素。这种状态因素的分解使得代理人不仅能够发现工具,还能够通过离线学习适当的工具性质来创造新工具。我们讨论了这些初步结果的意义和未来研究的方向。

Formulating Discrete Probability Flow Through Optimal Transport

  • paper_url: http://arxiv.org/abs/2311.03886
  • repo_url: https://github.com/pangzecheung/discrete-probability-flow
  • paper_authors: Pengze Zhang, Hubery Yin, Chen Li, Xiaohua Xie
  • for: 本文 aim to establish the fundamental theory for the probability flow of discrete diffusion models.
  • methods: 作者首先证明了连续概率流是MONGE最优运输图 beneath certain conditions, 并在 discrete 情况下提供了相等的证明。然后,他们根据这些发现定义了离散概率流,并在这个原理上提出了一种新的采样方法。
  • results: 作者通过对 synthetic toy dataset 和 CIFAR-10 dataset 的广泛实验 validate 了他们提出的离散概率流的效果。代码可以在:https://github.com/PangzeCheung/Discrete-Probability-Flow 找到。
    Abstract Continuous diffusion models are commonly acknowledged to display a deterministic probability flow, whereas discrete diffusion models do not. In this paper, we aim to establish the fundamental theory for the probability flow of discrete diffusion models. Specifically, we first prove that the continuous probability flow is the Monge optimal transport map under certain conditions, and also present an equivalent evidence for discrete cases. In view of these findings, we are then able to define the discrete probability flow in line with the principles of optimal transport. Finally, drawing upon our newly established definitions, we propose a novel sampling method that surpasses previous discrete diffusion models in its ability to generate more certain outcomes. Extensive experiments on the synthetic toy dataset and the CIFAR-10 dataset have validated the effectiveness of our proposed discrete probability flow. Code is released at: https://github.com/PangzeCheung/Discrete-Probability-Flow.
    摘要 continuous diffusion models 通常被承认为展现Deterministic probability flow,而 discrete diffusion models 则不然。在这篇论文中,我们目标是建立Discrete probability flow的基本理论。 Specifically, we first prove that the continuous probability flow is the Monge optimal transport map under certain conditions,并提供了对 discrete cases 的等价证明。 Based on these findings,我们可以定义Discrete probability flow according to the principles of optimal transport。 Finally,drawing upon our newly established definitions,we propose a novel sampling method that can generate more certain outcomes than previous discrete diffusion models。 Extensive experiments on the synthetic toy dataset and the CIFAR-10 dataset have validated the effectiveness of our proposed discrete probability flow. Code can be found at: https://github.com/PangzeCheung/Discrete-Probability-Flow.Note that "Discrete Probability Flow" is a literal translation of "discrete diffusion models" in the text, and "Monge optimal transport map" is a literal translation of "continuous probability flow" in the text.

Mini but Mighty: Finetuning ViTs with Mini Adapters

  • paper_url: http://arxiv.org/abs/2311.03873
  • repo_url: https://github.com/iemprog/mimi
  • paper_authors: Imad Eddine Marouf, Enzo Tartaglione, Stéphane Lathuilière
  • for: 提高适应性模型(Adapters)的性能,避免训练和存储成本过高。
  • methods: 提出了一种名为MiMi的训练框架,可以逐渐减少适应器的大小,以提高性能。还引入了一种专门为适应器设计的新评价函数,用于自动估计每个适应器的隐藏维度。
  • results: 在DomainNet、VTAB和Multi-task等三个 dataset 上,与现有方法进行比较,MiMi 方法可以在 29 个 dataset 上找到最佳的准确率和训练参数之间的最佳平衡。
    Abstract Vision Transformers (ViTs) have become one of the dominant architectures in computer vision, and pre-trained ViT models are commonly adapted to new tasks via fine-tuning. Recent works proposed several parameter-efficient transfer learning methods, such as adapters, to avoid the prohibitive training and storage cost of finetuning. In this work, we observe that adapters perform poorly when the dimension of adapters is small, and we propose MiMi, a training framework that addresses this issue. We start with large adapters which can reach high performance, and iteratively reduce their size. To enable automatic estimation of the hidden dimension of every adapter, we also introduce a new scoring function, specifically designed for adapters, that compares the neuron importance across layers. Our method outperforms existing methods in finding the best trade-off between accuracy and trained parameters across the three dataset benchmarks DomainNet, VTAB, and Multi-task, for a total of 29 datasets.
    摘要 视transformer(ViT)已成为计算机视觉领域的主导架构,预训练ViT模型通常通过精细调整来适应新任务。近期的工作提出了多种精度efficient的传输学习方法,如适配器,以避免训练和存储成本过高。在这种工作中,我们发现小的适配器表现不佳,我们提议MiMi,一种训练框架,解决这个问题。我们从大的适配器开始,可以达到高性能,然后逐步减小它们的大小。为了自动估计每个适配器的隐藏维度,我们也引入了一个新的评分函数,专门设计 для适配器,用于比较层次中每个神经元的重要性。我们的方法在三个数据集标准 bencmarks(DomainNet、VTAB和Multi-task)上,对39个数据集进行了最佳的折衔比较,并超越了现有的方法。

FD-MIA: Efficient Attacks on Fairness-enhanced Models

  • paper_url: http://arxiv.org/abs/2311.03865
  • repo_url: None
  • paper_authors: Huan Tian, Guangsheng Zhang, Bo Liu, Tianqing Zhu, Ming Ding, Wanlei Zhou
  • for: 这篇论文主要探讨了对受歧视模型进行保护的方法,以及这些方法对于攻击者发起识别成员资料攻击的问题。
  • methods: 这篇论文使用了一种基于公平关系的攻击方法,即使用了训练模型的公平关系分析结果来侦测资料攻击。
  • results: 研究发现,对于公平关系强化的模型来说,攻击者可能会遇到困难,因为这些模型的预测结果会受到公平关系的限制,使得攻击者无法从预测结果中获得有用的信息。此外,研究也发现,对于大多数训练数据子集而言,公平关系方法通常会导致预测性能下降,这使得攻击者更难获得成功。
    Abstract Previous studies have developed fairness methods for biased models that exhibit discriminatory behaviors towards specific subgroups. While these models have shown promise in achieving fair predictions, recent research has identified their potential vulnerability to score-based membership inference attacks (MIAs). In these attacks, adversaries can infer whether a particular data sample was used during training by analyzing the model's prediction scores. However, our investigations reveal that these score-based MIAs are ineffective when targeting fairness-enhanced models in binary classifications. The attack models trained to launch the MIAs degrade into simplistic threshold models, resulting in lower attack performance. Meanwhile, we observe that fairness methods often lead to prediction performance degradation for the majority subgroups of the training data. This raises the barrier to successful attacks and widens the prediction gaps between member and non-member data. Building upon these insights, we propose an efficient MIA method against fairness-enhanced models based on fairness discrepancy results (FD-MIA). It leverages the difference in the predictions from both the original and fairness-enhanced models and exploits the observed prediction gaps as attack clues. We also explore potential strategies for mitigating privacy leakages. Extensive experiments validate our findings and demonstrate the efficacy of the proposed method.
    摘要 Building on these insights, we propose an efficient MIA method against fairness-enhanced models based on fairness discrepancy results (FD-MIA). This method leverages the difference in predictions from both the original and fairness-enhanced models and exploits the observed prediction gaps as attack clues. We also explore potential strategies for mitigating privacy leakages. Our extensive experiments validate our findings and demonstrate the effectiveness of the proposed method.

Aspects of human memory and Large Language Models

  • paper_url: http://arxiv.org/abs/2311.03839
  • repo_url: https://github.com/rmldj/memory-llm-paper
  • paper_authors: Romuald A. Janik
  • for: 研究大语言模型(LLM)的内存特性,了解它们是如何模仿人类内存的。
  • methods: 通过分析LLM的训练数据统计特性,发现它们具有人类内存的一些特点。
  • results: 研究结果表明,LLM的人类内存特点不是由其架构自动带来的,而是由训练文本数据的统计特性学习得来的。这些结果表明,人类内存的特点会影响我们如何排序文本 narraatives。
    Abstract Large Language Models (LLMs) are huge artificial neural networks which primarily serve to generate text, but also provide a very sophisticated probabilistic model of language use. Since generating a semantically consistent text requires a form of effective memory, we investigate the memory properties of LLMs and find surprising similarities with key characteristics of human memory. We argue that the human-like memory properties of the Large Language Model do not follow automatically from the LLM architecture but are rather learned from the statistics of the training textual data. These results strongly suggest that the biological features of human memory leave an imprint on the way that we structure our textual narratives.
    摘要 Here's the text in Simplified Chinese:大语言模型(LLM)是一种非常大的人工神经网络,主要用于生成文本,同时也提供了一种非常复杂的语言使用概率模型。为生成具有 semantic consistency 的文本,LLM 需要一种有效的记忆,我们发现了 LLM 的记忆特性与人类记忆的一些关键特征有很大的相似性。我们认为这些人类记忆特性不是 LLM 的architecture 的自然结果,而是从training文本数据中学习的。这些发现表明了人类记忆的生物特征会影响我们如何结构我们的文本narracles。

Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.03830
  • repo_url: https://github.com/Sainzerjj/SFERD
  • paper_authors: Shengzhe Zhou, Zejian Lee, Shengyuan Zhang, Lefan Hou, Changyuan Yang, Guang Yang, Lingyun Sun
  • For: 提高 diffusion 模型的采样质量,使用知识储存法。* Methods: 使用注意力导航和设计的semantic gradient predictor来减少学生模型的适应错误。* Results: 实现了一步内生高质量样本生成,对CIFAR-10和ImageNet 64×64 achieve FID of 5.31和9.39。
    Abstract Denoising Diffusion models have exhibited remarkable capabilities in image generation. However, generating high-quality samples requires a large number of iterations. Knowledge distillation for diffusion models is an effective method to address this limitation with a shortened sampling process but causes degraded generative quality. Based on our analysis with bias-variance decomposition and experimental observations, we attribute the degradation to the spatial fitting error occurring in the training of both the teacher and student model. Accordingly, we propose $\textbf{S}$patial $\textbf{F}$itting-$\textbf{E}$rror $\textbf{R}$eduction $\textbf{D}$istillation model ($\textbf{SFERD}$). SFERD utilizes attention guidance from the teacher model and a designed semantic gradient predictor to reduce the student's fitting error. Empirically, our proposed model facilitates high-quality sample generation in a few function evaluations. We achieve an FID of 5.31 on CIFAR-10 and 9.39 on ImageNet 64$\times$64 with only one step, outperforming existing diffusion methods. Our study provides a new perspective on diffusion distillation by highlighting the intrinsic denoising ability of models.
    摘要 diffusion 模型在图像生成方面表现出了非常出色的能力。然而,生成高质量样本需要较多的迭代。知识传授 для diffusion 模型是一种有效的方法,可以缩短迭代过程,但会导致生成质量下降。根据我们的分析和实验观察,我们认为这种下降是在教师和学生模型的训练过程中出现的空间适应错误。因此,我们提出了 $\textbf{S}$patial $\textbf{F}$itting-$\textbf{E}$rror $\textbf{R}$eduction $\textbf{D}$istillation模型(SFERD)。SFERD使用教师模型的注意力指导和设计的含义梯度预测器来减少学生模型的适应错误。经验表明,我们的提出的模型可以在几个函数评估过程中生成高质量样本,我们在 CIFAR-10 和 ImageNet 64$\times$64 上达到了 FID 5.31 和 9.39 只需一步,超过了现有的扩散方法。我们的研究提供了一个新的视角,强调扩散混合模型的内在杂谱能力。

Rethinking and Improving Multi-task Learning for End-to-end Speech Translation

  • paper_url: http://arxiv.org/abs/2311.03810
  • repo_url: https://github.com/xiaozhang521/imtl
  • paper_authors: Yuhao Zhang, Chen Xu, Bei Li, Hao Chen, Tong Xiao, Chunliang Zhang, Jingbo Zhu
  • For: This paper aims to investigate the consistency between different tasks in end-to-end speech translation (ST), and to propose an improved multi-task learning (IMTL) approach to bridge the modal gap and improve ST performance.* Methods: The authors use a multi-task learning approach, considering different times and modules, to explore the consistency between different tasks. They also propose an improved multi-task learning approach that mitigates the difference in length and representation to bridge the modal gap.* Results: The authors conduct experiments on the MuST-C dataset and achieve state-of-the-art results. Additionally, with the use of additional data, they achieve a new SOTA result on the MuST-C English to Spanish task with 20.8% of the training time required by the current SOTA method.
    Abstract Significant improvements in end-to-end speech translation (ST) have been achieved through the application of multi-task learning. However, the extent to which auxiliary tasks are highly consistent with the ST task, and how much this approach truly helps, have not been thoroughly studied. In this paper, we investigate the consistency between different tasks, considering different times and modules. We find that the textual encoder primarily facilitates cross-modal conversion, but the presence of noise in speech impedes the consistency between text and speech representations. Furthermore, we propose an improved multi-task learning (IMTL) approach for the ST task, which bridges the modal gap by mitigating the difference in length and representation. We conduct experiments on the MuST-C dataset. The results demonstrate that our method attains state-of-the-art results. Moreover, when additional data is used, we achieve the new SOTA result on MuST-C English to Spanish task with 20.8% of the training time required by the current SOTA method.
    摘要 significanth improvements in end-to-end speech translation (ST) have been achieved through the application of multi-task learning. However, the extent to which auxiliary tasks are highly consistent with the ST task, and how much this approach truly helps, have not been thoroughly studied. In this paper, we investigate the consistency between different tasks, considering different times and modules. We find that the textual encoder primarily facilitates cross-modal conversion, but the presence of noise in speech impedes the consistency between text and speech representations. Furthermore, we propose an improved multi-task learning (IMTL) approach for the ST task, which bridges the modal gap by mitigating the difference in length and representation. We conduct experiments on the MuST-C dataset. The results demonstrate that our method attains state-of-the-art results. Moreover, when additional data is used, we achieve the new SOTA result on MuST-C English to Spanish task with 20.8% of the training time required by the current SOTA method.Here's the translation in Traditional Chinese:这篇研究文章 investigate了多任务学习(MTL)在语音译写(ST)中的改善,但尚未全面研究auxiliary task是否高度一致ST任务,以及这种方法对ST任务的帮助程度。我们发现文本编码器主要帮助跨modal转换,但是对话声中的噪音会干扰文本和声音表现之间的一致性。此外,我们提出一种改进多任务学习(IMTL)方法,将modal gap mitigated by reducing the difference in length and representation。我们在MuST-C dataset上进行实验,结果显示我们的方法实现了state-of-the-art的结果。此外,当使用更多数据时,我们在MuST-C英文到西班牙语任务上实现了新的SOTA结果,并且只需20.8%的训练时间。

Scene-Driven Multimodal Knowledge Graph Construction for Embodied AI

  • paper_url: http://arxiv.org/abs/2311.03783
  • repo_url: None
  • paper_authors: Song Yaoxian, Sun Penglei, Liu Haoyu, Li Zhixu, Song Wei, Xiao Yanghua, Zhou Xiaofang
  • For: The paper aims to address the challenge of scene knowledge in embodied AI and proposes a scene-driven multimodal knowledge graph construction method to improve the intelligence of real-world agents.* Methods: The proposed method combines conventional knowledge engineering and large language models to construct a unified scene knowledge injection framework, and the authors evaluate the advantages of their method through an instantiation of the method for typical indoor robotic functionalities.* Results: The experimental results show that the knowledge-enhanced methods using the instantiated ManipMob-MMKG can improve the performance of embodied tasks without re-designing model structures complexly, demonstrating the effectiveness of the proposed method in improving the intelligence of real-world agents.Here is the Chinese version of the three key points:* For: 论文旨在解决embodied AI中场景知识的挑战,并提出了场景驱动多Modal知识图 constructions方法,以提高现实世界代理人的智能。* Methods: 提出的方法结合了常规知识工程和大型自然语言模型,实现了场景知识注入框架的统一化,并通过对典型室内 робоics功能(Manipulation和Mobility)进行实例化,以评估方法的优势。* Results: 实验结果表明,通过使用实例化的ManipMob-MMKG,可以明显提高embodied任务的性能,而无需复杂地重新设计模型结构。
    Abstract Embodied AI is one of the most popular studies in artificial intelligence and robotics, which can effectively improve the intelligence of real-world agents (i.e. robots) serving human beings. Scene knowledge is important for an agent to understand the surroundings and make correct decisions in the varied open world. Currently, knowledge base for embodied tasks is missing and most existing work use general knowledge base or pre-trained models to enhance the intelligence of an agent. For conventional knowledge base, it is sparse, insufficient in capacity and cost in data collection. For pre-trained models, they face the uncertainty of knowledge and hard maintenance. To overcome the challenges of scene knowledge, we propose a scene-driven multimodal knowledge graph (Scene-MMKG) construction method combining conventional knowledge engineering and large language models. A unified scene knowledge injection framework is introduced for knowledge representation. To evaluate the advantages of our proposed method, we instantiate Scene-MMKG considering typical indoor robotic functionalities (Manipulation and Mobility), named ManipMob-MMKG. Comparisons in characteristics indicate our instantiated ManipMob-MMKG has broad superiority in data-collection efficiency and knowledge quality. Experimental results on typical embodied tasks show that knowledge-enhanced methods using our instantiated ManipMob-MMKG can improve the performance obviously without re-designing model structures complexly. Our project can be found at https://sites.google.com/view/manipmob-mmkg
    摘要 现实世界中的机器人服务人类需要智能化,embodied AI是人工智能和机器人领域中最受欢迎的研究之一,可以有效提高机器人的智能水平。场景知识是一个机器人理解困难的重要因素,以便它可以在多样化的开放世界中做出正确的决策。然而,目前存在的场景知识库缺失,大多数现有的工作使用通用知识库或预训练模型来提高机器人的智能水平。这些通用知识库缺乏特定场景的信息,而且收集数据成本较高,预训练模型则面临知识不确定性和维护困难。为了解决场景知识的挑战,我们提出了场景驱动多modal知识图(Scene-MMKG)建构方法,这种方法结合了传统知识工程和大型自然语言模型。我们还提出了一种统一场景知识注入框架,用于场景知识表示。为了评估我们提出的方法的优势,我们实例化了Scene-MMKG,并考虑了典型的室内机器人功能(搬运和移动),称之为ManipMob-MMKG。对比分析表明,我们的实例化ManipMob-MMKG在数据收集效率和知识质量方面有广泛的优势。实验结果表明,使用我们的实例化ManipMob-MMKG可以明显提高机器人的性能,无需复杂地重新设计模型结构。更多细节可以通过我们的项目网站(https://sites.google.com/view/manipmob-mmkg)了解。

Ensembling Textual and Structure-Based Models for Knowledge Graph Completion

  • paper_url: http://arxiv.org/abs/2311.03780
  • repo_url: None
  • paper_authors: Ananjan Nandi, Navdeep Kaur, Parag Singla, Mausam
  • for: 这个论文的目的是提出一种可以结合两种常见的知识图完成(KGC)方法的新方法,即文本模型和结构基于模型。
  • methods: 这个论文使用了两种方法:文本模型和结构基于模型。文本模型利用知识图中实体描述文本,而结构基于模型则利用知识图的连接结构。
  • results: 这个论文的实验结果表明,这两种方法具有补偿的优势:结构基于模型在知识图中找到答案较容易时表现出色,而文本模型则可以通过描述文本来提供更好的性能,即使答案不在知识图中。
    Abstract We consider two popular approaches to Knowledge Graph Completion (KGC): textual models that rely on textual entity descriptions, and structure-based models that exploit the connectivity structure of the Knowledge Graph (KG). Preliminary experiments show that these approaches have complementary strengths: structure-based models perform well when the gold answer is easily reachable from the query head in the KG, while textual models exploit descriptions to give good performance even when the gold answer is not reachable. In response, we explore ensembling as a way of combining the best of both approaches. We propose a novel method for learning query-dependent ensemble weights by using the distributions of scores assigned by individual models to all candidate entities. Our ensemble baseline achieves state-of-the-art results on three standard KGC datasets, with up to 6.8 pt MRR and 8.3 pt Hits@1 gains over best individual models.
    摘要 我团队考虑了两种受欢迎的知识图完成(KGC)方法:文本模型,它们基于知识图中实体描述文本,以及结构基于模型,它们利用知识图中实体之间的连接结构。我们的初步实验表明,这两种方法具有补偿性优势:结构基于模型在金标答案可以快速到达知识图中的情况下表现良好,而文本模型可以通过描述来提供良好的性能,即使金标答案不可达。因此,我们研究了 ensemble 的方式来结合这两种方法。我们提出了一种学习查询相依 ensemble 重量的新方法,使用各个模型对所有候选实体分配的分布来分配 ensemble 重量。我们的 ensemble 基线达到了三个标准 KGC 数据集的状态核心结果,与最佳个体模型相比,增加了6.8pt MRR和8.3pt Hits@1。

PT-Tuning: Bridging the Gap between Time Series Masked Reconstruction and Forecasting via Prompt Token Tuning

  • paper_url: http://arxiv.org/abs/2311.03768
  • repo_url: None
  • paper_authors: Hao Liu, Jinrui Gan, Xiaoxuan Fan, Yi Zhang, Chuanxian Luo, Jing Zhang, Guangxin Jiang, Yucheng Qian, Changwei Zhao, Huan Ma, Zhenyu Guo
  • for: bridging the gap between time series masked reconstruction and forecasting
  • methods: reserving pre-trained mask token during fine-tuning stage, prompt token tuning (PT-Tuning) paradigm
  • results: state-of-the-art performance compared to representation learning and end-to-end supervised forecasting methods
    Abstract Self-supervised learning has been actively studied in time series domain recently, especially for masked reconstruction. Most of these methods follow the "Pre-training + Fine-tuning" paradigm in which a new decoder replaces the pre-trained decoder to fit for a specific downstream task, leading to inconsistency of upstream and downstream tasks. In this paper, we first point out that the unification of task objectives and adaptation for task difficulty are critical for bridging the gap between time series masked reconstruction and forecasting. By reserving the pre-trained mask token during fine-tuning stage, the forecasting task can be taken as a special case of masked reconstruction, where the future values are masked and reconstructed based on history values. It guarantees the consistency of task objectives but there is still a gap in task difficulty. Because masked reconstruction can utilize contextual information while forecasting can only use historical information to reconstruct. To further mitigate the existed gap, we propose a simple yet effective prompt token tuning (PT-Tuning) paradigm, in which all pre-trained parameters are frozen and only a few trainable prompt tokens are added to extended mask tokens in element-wise manner. Extensive experiments on real-world datasets demonstrate the superiority of our proposed paradigm with state-of-the-art performance compared to representation learning and end-to-end supervised forecasting methods.
    摘要 自适应学习在时间序列领域已经广泛研究,特别是面向压缩重建。大多数这些方法采用“预训练+精度调整”模式,在这种模式下,一个新的解码器取代预训练的解码器,以适应特定下游任务,导致上游和下游任务不一致。在这篇论文中,我们首先指出了将任务目标统一和适应任务Difficulty是bridging上时间序列压缩重建和预测的关键因素。在精度调整阶段,我们保留预训练的压缩token,以使预测任务成为面向压缩重建的特殊情况,将未来值视为压缩的未知值,并基于历史值进行重建。这 garantizes任务目标的一致性,但是还有一个差距在任务难度上。因为压缩重建可以利用 contextual information,而预测只能使用历史信息来重建。为了进一步减少这个差距,我们提出了一种简单 yet effective的 prompt token tuning(PT-Tuning)模式。在这种模式下,所有预训练参数都被冻结,并只添加了一些可训练的提示符Token来扩展压缩Token。我们对实际世界数据进行了广泛的实验, demonstarted our proposed paradigm的优越性,并与 representation learning和端到端直接监督预测方法相比。

Augmenting Radio Signals with Wavelet Transform for Deep Learning-Based Modulation Recognition

  • paper_url: http://arxiv.org/abs/2311.03761
  • repo_url: None
  • paper_authors: Tao Chen, Shilian Zheng, Kunfeng Qiu, Luxin Zhang, Qi Xuan, Xiaoniu Yang
  • for: 这篇论文是用于提出一种基于深度学习的无线电波调变识别方法,以解决现实中缺乏训练数据的问题。
  • methods: 这篇论文使用了一些资料增强技术,包括将细节系数分解成离散波лет变换后重建新数据,以增加训练集的多样性和量。不同的生成方法被用来生成替补序列。
  • results: simulations 结果显示,提案的方法可以对其他增强方法进行明显的优化。
    Abstract The use of deep learning for radio modulation recognition has become prevalent in recent years. This approach automatically extracts high-dimensional features from large datasets, facilitating the accurate classification of modulation schemes. However, in real-world scenarios, it may not be feasible to gather sufficient training data in advance. Data augmentation is a method used to increase the diversity and quantity of training dataset and to reduce data sparsity and imbalance. In this paper, we propose data augmentation methods that involve replacing detail coefficients decomposed by discrete wavelet transform for reconstructing to generate new samples and expand the training set. Different generation methods are used to generate replacement sequences. Simulation results indicate that our proposed methods significantly outperform the other augmentation methods.
    摘要 深度学习对广播调制识别已经在过去几年广泛使用。这种方法自动提取大量数据中的特征,使得调制方案的准确识别变得可能。然而,在实际应用中,可能无法在提前收集充足的训练数据。数据扩充是一种方法,用于增加训练集的多样性和量,并降低数据稀疏和不均衡。在这篇论文中,我们提议了一种数据扩充方法,包括使用梯形变换径分解后的细节系数重建新样本,以扩大训练集。不同的生成方法用于生成替换序列。实验结果表明,我们的提议方法在其他扩充方法的基础上显著提高了性能。

Learning Decentralized Traffic Signal Controllers with Multi-Agent Graph Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.03756
  • repo_url: None
  • paper_authors: Yao Zhang, Zhiwen Yu, Jun Zhang, Liang Wang, Tom H. Luan, Bin Guo, Chau Yuen
  • for: 这篇论文关注智能城市中的优化交通信号控制问题,即为复杂网络系统控制问题。在交通灯和道路网络之间存在互动动力,实现控制器适应性和扩展性是主要挑战。
  • methods: 作者采用多智能 reinforcement learning(MARL)方法,但现有MARL算法忽视了有效信息聚合,这是提高分布式代理人学习能力的关键。本文提出了一种新的分布式控制架构,包括改进环境观察性以捕捉空间-时间相关性。
  • results: 作者通过对 sintetic和实际数据进行广泛的实验,证明了其方案在分布式环境中比现有方法更高效。
    Abstract This paper considers optimal traffic signal control in smart cities, which has been taken as a complex networked system control problem. Given the interacting dynamics among traffic lights and road networks, attaining controller adaptivity and scalability stands out as a primary challenge. Capturing the spatial-temporal correlation among traffic lights under the framework of Multi-Agent Reinforcement Learning (MARL) is a promising solution. Nevertheless, existing MARL algorithms ignore effective information aggregation which is fundamental for improving the learning capacity of decentralized agents. In this paper, we design a new decentralized control architecture with improved environmental observability to capture the spatial-temporal correlation. Specifically, we first develop a topology-aware information aggregation strategy to extract correlation-related information from unstructured data gathered in the road network. Particularly, we transfer the road network topology into a graph shift operator by forming a diffusion process on the topology, which subsequently facilitates the construction of graph signals. A diffusion convolution module is developed, forming a new MARL algorithm, which endows agents with the capabilities of graph learning. Extensive experiments based on both synthetic and real-world datasets verify that our proposal outperforms existing decentralized algorithms.
    摘要 To address this limitation, we propose a new decentralized control architecture with enhanced environmental observability. Our approach includes a topology-aware information aggregation strategy to extract correlation-related information from unstructured data collected in the road network. Specifically, we convert the road network topology into a graph shift operator by creating a diffusion process on the topology, which enables the construction of graph signals. We then develop a diffusion convolution module, which endows agents with the ability to learn from graph signals.Extensive experiments using both synthetic and real-world datasets demonstrate that our proposed method outperforms existing decentralized algorithms. Our approach enables more efficient and effective traffic signal control, leading to improved traffic flow and reduced congestion in smart cities.

COOL: A Constraint Object-Oriented Logic Programming Language and its Neural-Symbolic Compilation System

  • paper_url: http://arxiv.org/abs/2311.03753
  • repo_url: None
  • paper_authors: Jipeng Han
  • for: 本研究探讨了神经网络与逻辑编程的集成,解决了长期存在的神经网络学习和逻辑逻辑之间的结合问题。传统的尝试都受到了数据收集初始化的困难、网络训练不可靠和已训练模型的复用和扩展问题的困难。
  • methods: 我们提出了一种新的Constraint Object-Oriented Logic(COOL)编程语言,它将逻辑推理与神经网络技术自然地结合起来。COOL可以自动处理数据收集,减少用户提供初始数据的需求。它还将用户提示包含在编程过程中,以降低神经网络训练不可靠的风险,并在神经网络模型的整个生命周期中促进模型之间的交互,以便提高模型的重用和扩展。
  • results: 我们的COOL语言和编译系统的基本原则和算法可能会为未来的编程语言和神经网络架构的发展提供有价值的指导。
    Abstract This paper explores the integration of neural networks with logic programming, addressing the longstanding challenges of combining the generalization and learning capabilities of neural networks with the precision of symbolic logic. Traditional attempts at this integration have been hampered by difficulties in initial data acquisition, the reliability of undertrained networks, and the complexity of reusing and augmenting trained models. To overcome these issues, we introduce the COOL (Constraint Object-Oriented Logic) programming language, an innovative approach that seamlessly combines logical reasoning with neural network technologies. COOL is engineered to autonomously handle data collection, mitigating the need for user-supplied initial data. It incorporates user prompts into the coding process to reduce the risks of undertraining and enhances the interaction among models throughout their lifecycle to promote the reuse and augmentation of networks. Furthermore, the foundational principles and algorithms in COOL's design and its compilation system could provide valuable insights for future developments in programming languages and neural network architectures.
    摘要 (本文探讨了神经网络与逻辑编程的集成,解决了将神经网络的通用化和学习能力与逻辑逻辑的精度相结合的长期挑战。传统的尝试都受到了数据收集的困难、不可靠的培训网络和 reuse和修改已训练的模型的复杂性的限制。为了解决这些问题,我们介绍了COOL(卷积物理逻辑)编程语言,一种创新的方法,可以自然地结合逻辑推理和神经网络技术。COOL通过自动处理数据收集来减少用户提供的初始数据的需求,并通过在编程过程中包含用户提示来减少下线训练的风险,以及在模型的整个生命周期中提高模型之间的互动,以促进模型的重复和扩展。此外,COOL的基本原理和编译系统的设计和实现可能会为未来的编程语言和神经网络架构带来有价值的意义。)

Analysis and Applications of Deep Learning with Finite Samples in Full Life-Cycle Intelligence of Nuclear Power Generation

  • paper_url: http://arxiv.org/abs/2311.04247
  • repo_url: None
  • paper_authors: Chenwei Tang, Wenqiang Zhou, Dong Wang, Caiyang Yu, Zhenan He, Jizhe Zhou, Shudong Huang, Yi Gao, Jianming Chen, Wentao Feng, Jiancheng Lv
  • for: 本研究旨在探讨在营业4.0时代,在能源探索和生产等复杂工业环境中应用深度学习技术,并对其在营业4.0中的应用提供技术基础和新视角。
  • methods: 本研究采用了小样本学习、少量学习、零处学习和开集识别等深度学习技术,以适应能源探索和生产等工业环境中的数据特点。
  • results: 本研究通过两个实践案例,一是自动识别锆合金图像,二是开集识别机器传感器信号,展示了深度学习技术在营业4.0中的可靠和高效应用。
    Abstract The advent of Industry 4.0 has precipitated the incorporation of Artificial Intelligence (AI) methods within industrial contexts, aiming to realize intelligent manufacturing, operation as well as maintenance, also known as industrial intelligence. However, intricate industrial milieus, particularly those relating to energy exploration and production, frequently encompass data characterized by long-tailed class distribution, sample imbalance, and domain shift. These attributes pose noteworthy challenges to data-centric Deep Learning (DL) techniques, crucial for the realization of industrial intelligence. The present study centers on the intricate and distinctive industrial scenarios of Nuclear Power Generation (NPG), meticulously scrutinizing the application of DL techniques under the constraints of finite data samples. Initially, the paper expounds on potential employment scenarios for AI across the full life-cycle of NPG. Subsequently, we delve into an evaluative exposition of DL's advancement, grounded in the finite sample perspective. This encompasses aspects such as small-sample learning, few-shot learning, zero-shot learning, and open-set recognition, also referring to the unique data characteristics of NPG. The paper then proceeds to present two specific case studies. The first revolves around the automatic recognition of zirconium alloy metallography, while the second pertains to open-set recognition for signal diagnosis of machinery sensors. These cases, spanning the entirety of NPG's life-cycle, are accompanied by constructive outcomes and insightful deliberations. By exploring and applying DL methodologies within the constraints of finite sample availability, this paper not only furnishes a robust technical foundation but also introduces a fresh perspective toward the secure and efficient advancement and exploitation of this advanced energy source.
    摘要 industry 4.0的出现引起了在工业上使用人工智能(AI)方法的普及,以实现智能生产、运营和维护,也称为工业智能。然而,在能源探索和生产中的工业环境中,数据往往具有长尾分布、样本偏移和领域转换等特点,这些特点对数据驱动的深度学习(DL)技术 pose notable challenges。本研究将关注在核电生产(NPG)中的复杂和特殊工业场景,细化分析DL技术在有限数据样本的约束下的应用。首先,本文将讨论在NPG全生命周期中可能的AI应用场景。然后,我们将深入探讨DL技术在有限样本视角下的发展,包括小样本学习、少数shot学习、零shot学习和开放集 recognition,同时参照NPG特有的数据特征。本文接着介绍两个特定的案例研究:一是自动识别锌合金镀层 micrography,二是开放集 recognition for machinery sensors的信号诊断。这两个案例,涵盖了NPG生命周期的全部,并且获得了有益的结果和深入的讨论。通过在有限样本 availability 下应用和探索DL方法ologies,本文不仅提供了坚实的技术基础,还提供了一种新的视角,即在安全和高效地推进和利用这种先进能源源的方法。

Leveraging Large Language Models for Automated Proof Synthesis in Rust

  • paper_url: http://arxiv.org/abs/2311.03739
  • repo_url: None
  • paper_authors: Jianan Yao, Ziqiao Zhou, Weiteng Chen, Weidong Cui
  • for: 这 paper 是为了推广形式验证的广泛采用,但高证明负担已经阻碍了很长时间。
  • methods: 这 paper 使用了 Large Language Models (LLMs) 和静态分析,把 Rust 语言的正式验证框架 Verus 中的 invariants、assertions 和其他证据结构合成出来。
  • results: 在几个步骤的设置下,LLMs 能够快速生成 postconditions 和循环 invariants,特别是对短代码段进行分析。但 LLMs 缺乏保持和传递上下文信息的能力,这是传统静态分析的优点。基于这些观察,我们开发了一个基于 OpenAI 的 GPT-4 模型的原型。我们的原型将验证任务分解成多个更小的任务,逐步调用 GPT-4,并将其输出与轻量级静态分析结合起来。我们对 20 个向量处理程序进行了评估。结果表明,我们的原型可以减少人类的证据代码写作劳动。
    Abstract Formal verification can provably guarantee the correctness of critical system software, but the high proof burden has long hindered its wide adoption. Recently, Large Language Models (LLMs) have shown success in code analysis and synthesis. In this paper, we present a combination of LLMs and static analysis to synthesize invariants, assertions, and other proof structures for a Rust-based formal verification framework called Verus. In a few-shot setting, LLMs demonstrate impressive logical ability in generating postconditions and loop invariants, especially when analyzing short code snippets. However, LLMs lack the ability to retain and propagate context information, a strength of traditional static analysis. Based on these observations, we developed a prototype based on OpenAI's GPT-4 model. Our prototype decomposes the verification task into multiple smaller ones, iteratively queries GPT-4, and combines its output with lightweight static analysis. We evaluated the prototype with a developer in the automation loop on 20 vector-manipulating programs. The results demonstrate that it significantly reduces human effort in writing entry-level proof code.
    摘要 Formal verification 可以证明 kritical 系统软件的正确性,但是长期以来受到了各种各样的阻碍。最近,大型自然语言模型(LLMs)在代码分析和生成方面表现出了成功。在这篇论文中,我们提出了结合 LLMs 和静态分析的方法,用于生成 invariants、assertrions 和其他证明结构,以验证 Rust 基于的正式验证框架 Verus。在几步设定下,LLMs 在分析短代码段时表现出了很强的逻辑能力,特别是在生成 postconditions 和循环 invariants 方面。然而,LLMs 缺乏保持和传播上下文信息的能力,这是传统静态分析的优点。基于这些观察,我们开发了一个基于 OpenAI 的 GPT-4 模型的原型。我们的原型将验证任务分解成多个更小的任务, iteratively 查询 GPT-4,并将其输出与轻量级静态分析结合起来。我们对 20 个vector-manipulating 程序进行了评估。结果表明,我们的原型可以很大程度地减少人类的证明代码编写劳动。

deep-REMAP: Parameterization of Stellar Spectra Using Regularized Multi-Task Learning

  • paper_url: http://arxiv.org/abs/2311.03738
  • repo_url: None
  • paper_authors: Sankalp Gilda
  • for: 这篇论文是为了准确地预测恒星大气层参数(效温、表面重力、金属量)而开发的一个新框架。
  • methods: 这篇论文使用了深度学习的多 зада用混合学习方法,包括多任务学习和一个创新的不对称损失函数,并且使用了 Phoenix 库中的丰富人工spectrum和 MARVELS 调查中的观测数据来准确预测恒星大气层参数。
  • results: 这篇论文的结果显示了 $\rm{deep-REMAP}$ 框架在使用不同的恒星库和属性时的预测能力具有优势,并且显示了这个框架可以扩展到其他的恒星属性和属性。
    Abstract Traditional spectral analysis methods are increasingly challenged by the exploding volumes of data produced by contemporary astronomical surveys. In response, we develop deep-Regularized Ensemble-based Multi-task Learning with Asymmetric Loss for Probabilistic Inference ($\rm{deep-REMAP}$), a novel framework that utilizes the rich synthetic spectra from the PHOENIX library and observational data from the MARVELS survey to accurately predict stellar atmospheric parameters. By harnessing advanced machine learning techniques, including multi-task learning and an innovative asymmetric loss function, $\rm{deep-REMAP}$ demonstrates superior predictive capabilities in determining effective temperature, surface gravity, and metallicity from observed spectra. Our results reveal the framework's effectiveness in extending to other stellar libraries and properties, paving the way for more sophisticated and automated techniques in stellar characterization.
    摘要 传统的spectral分析方法随着现代天文调查数据的急剧增长而逐渐成为挑战。为应对这一问题,我们开发了deep-Regularized Ensemble-based Multi-task Learning with Asymmetric Loss for Probabilistic Inference($\rm{deep-REMAP}$),一种新的框架,利用了PHOENIX图书馆中的辐射 synthetic spectra和MARVELS调查中的观测数据,以准确预测星际大气参数。通过应用先进的机器学习技术,包括多任务学习和创新的非对称损失函数,$\rm{deep-REMAP}$ 表现出了在确定效果温度、表面重力和金属量方面的高度预测能力。我们的结果表明,该框架可以扩展到其他星际图书和特性,为stellar caracterization提供更加复杂和自动化的技术。

Neural MMO 2.0: A Massively Multi-task Addition to Massively Multi-agent Learning

  • paper_url: http://arxiv.org/abs/2311.03736
  • repo_url: None
  • paper_authors: Joseph Suárez, Phillip Isola, Kyoung Whan Choe, David Bloomin, Hao Xiang Li, Nikhil Pinnaparaju, Nishaanth Kanna, Daniel Scott, Ryan Sullivan, Rose S. Shuman, Lucas de Alcântara, Herbie Bradley, Louis Castricato, Kirsty You, Yuhao Jiang, Qimai Li, Jiaxin Chen, Xiaolong Zhu
  • for: 本研究的目的是提供一个可以定义广泛任务目标和优化征引的大型多代理环境,供实验室研究人员进行强化学习研究。
  • methods: 本研究使用的方法包括Neural MMO 2.0,一个可以生成地图和多代理的多代理环境,以及CleanRL,一个强化学习框架。
  • results: 本研究的结果显示,Neural MMO 2.0可以提供三倍的性能提升 compared to its predecessor,并且支持更多的代理和更大的地图。
    Abstract Neural MMO 2.0 is a massively multi-agent environment for reinforcement learning research. The key feature of this new version is a flexible task system that allows users to define a broad range of objectives and reward signals. We challenge researchers to train agents capable of generalizing to tasks, maps, and opponents never seen during training. Neural MMO features procedurally generated maps with 128 agents in the standard setting and support for up to. Version 2.0 is a complete rewrite of its predecessor with three-fold improved performance and compatibility with CleanRL. We release the platform as free and open-source software with comprehensive documentation available at neuralmmo.github.io and an active community Discord. To spark initial research on this new platform, we are concurrently running a competition at NeurIPS 2023.
    摘要 neuralmmo 2.0 是一个大规模多智能环境,用于强化学习研究。新版本的关键特性是可变的任务系统,允许用户定义广泛的目标和奖励信号。我们挑战研究人员使得代理人能够通过训练扩展到任务、地图和对手,从未在训练过程中看到。 neuralmmo 2.0 支持生成式地图,标准设置中有128个代理人,并且支持最多4096个代理人。这是前一版本的三倍性能提升,并且与 CleanRL 兼容。我们发布了这个平台作为免费和开源软件,并提供了详细的文档,可以在 neuralmmo.github.io 上查看。同时,我们还在 NeurIPS 2023 上同步进行一项竞赛,以促进这个新平台的初始研究。

ClimateSet: A Large-Scale Climate Model Dataset for Machine Learning

  • paper_url: http://arxiv.org/abs/2311.03721
  • repo_url: None
  • paper_authors: Julia Kaltenborn, Charlotte E. E. Lange, Venkatesh Ramesh, Philippe Brouillard, Yaniv Gurwicz, Chandni Nagda, Jakob Runge, Peer Nowack, David Rolnick
  • for: 这个论文是为了提供大规模、一致的气候模型数据集,以支持气候科学和机器学习(ML)社区在气候变化预测和相关任务上的努力。
  • methods: 论文使用了Input4MIPs和CMIP6气候模型数据集,并提供了一个模块化的数据集管道,以便检索和处理更多的气候模型和enario。
  • results: 论文通过使用ClimateSet数据集作为benchmark,展示了ML模型在不同气候模型下的性能和总体化能力,并预测了新的气候变化enario,为气候科学和政策制定提供了新的参考。
    Abstract Climate models have been key for assessing the impact of climate change and simulating future climate scenarios. The machine learning (ML) community has taken an increased interest in supporting climate scientists' efforts on various tasks such as climate model emulation, downscaling, and prediction tasks. Many of those tasks have been addressed on datasets created with single climate models. However, both the climate science and ML communities have suggested that to address those tasks at scale, we need large, consistent, and ML-ready climate model datasets. Here, we introduce ClimateSet, a dataset containing the inputs and outputs of 36 climate models from the Input4MIPs and CMIP6 archives. In addition, we provide a modular dataset pipeline for retrieving and preprocessing additional climate models and scenarios. We showcase the potential of our dataset by using it as a benchmark for ML-based climate model emulation. We gain new insights about the performance and generalization capabilities of the different ML models by analyzing their performance across different climate models. Furthermore, the dataset can be used to train an ML emulator on several climate models instead of just one. Such a "super emulator" can quickly project new climate change scenarios, complementing existing scenarios already provided to policymakers. We believe ClimateSet will create the basis needed for the ML community to tackle climate-related tasks at scale.
    摘要 климаático模型已经是评估气候变化的重要工具,以及预测未来气候enario的方法。 машин学习(ML)社区对此表示了增加的兴趣,以支持气候科学家在各种任务上,如气候模型仿真、下降和预测任务。但是,气候科学和ML社区都认为,为了解决这些任务,我们需要大量、一致、ML准备好的气候模型数据集。这里,我们介绍了气候集(ClimateSet),一个包含36种气候模型的输入和输出数据集。此外,我们还提供了一个模块化数据集管道,用于检索和处理其他气候模型和enario。我们使用该数据集作为ML基于气候模型的仿真 benchmark,并分析不同气候模型之间的性能和总体可行性。此外,该数据集还可以用于训练一个ML仿真器,以便快速预测新的气候变化enario,并补充现有的气候变化enario,已经向政策制定者提供。我们认为,气候集将为ML社区提供基础,以便把气候相关任务进行大规模处理。

LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media Generators

  • paper_url: http://arxiv.org/abs/2311.03716
  • repo_url: None
  • paper_authors: Allen Roush, Emil Zakirov, Artemiy Shirokov, Polina Lunina, Jack Gane, Alexander Duffy, Charlie Basil, Aber Whitcomb, Jim Benedetto, Chris DeWolfe
  • for: 提高文本生成到图像和视频的质量和 relevance
  • methods: 使用多种技术,如约束解码、智能提示、精度调整和检索,以提高文本生成器的能力
  • results: 实现了更高质量和更有相关性的图像和视频生成,并在 Plai Labs 的应用和平台中使用
    Abstract Recent advancements in text-to-image generation have revolutionized numerous fields, including art and cinema, by automating the generation of high-quality, context-aware images and video. However, the utility of these technologies is often limited by the inadequacy of text prompts in guiding the generator to produce artistically coherent and subject-relevant images. In this paper, We describe the techniques that can be used to make Large Language Models (LLMs) act as Art Directors that enhance image and video generation. We describe our unified system for this called "LaDi". We explore how LaDi integrates multiple techniques for augmenting the capabilities of text-to-image generators (T2Is) and text-to-video generators (T2Vs), with a focus on constrained decoding, intelligent prompting, fine-tuning, and retrieval. LaDi and these techniques are being used today in apps and platforms developed by Plai Labs.
    摘要 Simplified Chinese:最近的文本到图像生成技术已经革命化了许多领域,包括艺术和电影,by自动生成高质量、上下文感知的图像和视频。然而,这些技术的实用性往往受到文本提示的不足,导致生成的图像不具有艺术性和主题相关性。在这篇论文中,我们介绍了使用Large Language Models(LLMs)作为艺术指导者,以提高图像和视频生成的技术。我们介绍了我们的统一系统叫“LaDi”,我们探讨了LaDi如何结合多种技术来增强文本到图像生成器(T2Is)和文本到视频生成器(T2Vs)的能力,包括约束解码、智能提示、精度调整和检索。LaDi和这些技术在Plai Labs开发的应用和平台上正在使用。

Loss Balancing for Fair Supervised Learning

  • paper_url: http://arxiv.org/abs/2311.03714
  • repo_url: https://github.com/khalilimahdi/loss_balancing_icml2023
  • paper_authors: Mohammad Mahdi Khalili, Xueru Zhang, Mahed Abroshan
  • for: 这 paper 的目的是提出一种可以快速和高效地在不平等损失(Equalized Loss,EL)限制下找到公平预测器的算法。
  • methods: 这 paper 使用了现有的几何编程工具(CVXPY)来将非对称优化问题转化为一系列的对称优化问题,从而使用 convex 优化算法来快速找到公平预测器。
  • results: 这 paper 通过 teorically 证明了其算法可以在某些条件下找到全球最优解,并通过了一些 empirical 研究来支持其理论结论。
    Abstract Supervised learning models have been used in various domains such as lending, college admission, face recognition, natural language processing, etc. However, they may inherit pre-existing biases from training data and exhibit discrimination against protected social groups. Various fairness notions have been proposed to address unfairness issues. In this work, we focus on Equalized Loss (EL), a fairness notion that requires the expected loss to be (approximately) equalized across different groups. Imposing EL on the learning process leads to a non-convex optimization problem even if the loss function is convex, and the existing fair learning algorithms cannot properly be adopted to find the fair predictor under the EL constraint. This paper introduces an algorithm that can leverage off-the-shelf convex programming tools (e.g., CVXPY) to efficiently find the global optimum of this non-convex optimization. In particular, we propose the ELminimizer algorithm, which finds the optimal fair predictor under EL by reducing the non-convex optimization to a sequence of convex optimization problems. We theoretically prove that our algorithm finds the global optimal solution under certain conditions. Then, we support our theoretical results through several empirical studies.
    摘要 超vised learning模型在不同领域中使用,如贷款、大学招生、脸部识别、自然语言处理等。但它们可能从训练数据中继承潜在的偏见,并对保护的社会群体表现出歧视。各种公平性观念被提出来解决不公平问题。在这种工作中,我们关注平衡损失(EL),它需要不同群体的预期损失相似。在满足EL的情况下,增加了一个非 convex 优化问题,并且现有的公平学习算法无法正确地采用EL约束来找到公平预测器。这篇论文介绍了一种可以利用现有的凸编程工具(如CVXPY)高效地找到非 convex 优化问题的全球最优解。具体来说,我们提出了EL最小化算法,它通过将非 convex 优化问题转化为一系列凸优化问题来找到公平预测器。我们理论上证明了我们的算法可以在某些条件下找到全球最优解。然后,我们通过一些实验研究来支持我们的理论结论。

Mitigating Estimation Errors by Twin TD-Regularized Actor and Critic for Deep Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.03711
  • repo_url: None
  • paper_authors: Junmin Zhong, Ruofan Wu, Jennie Si
  • for: 减少深度强化学习中的估计偏误
  • methods: 引入新的双TD-正则化actor-critic方法,以减少过估和下估错误
  • results: 通过 combining 好的DRL改进方法,如分布学习和长N步代理stage奖励方法,实现了新的TDR基本actor-critic学习在深度控制集中的出色表现,并将TD3和SAC的表现提升到与D4PG(当前SOTA)水平,同时还提高了D4PG的表现到新的SOTA水平。
    Abstract We address the issue of estimation bias in deep reinforcement learning (DRL) by introducing solution mechanisms that include a new, twin TD-regularized actor-critic (TDR) method. It aims at reducing both over and under-estimation errors. With TDR and by combining good DRL improvements, such as distributional learning and long N-step surrogate stage reward (LNSS) method, we show that our new TDR-based actor-critic learning has enabled DRL methods to outperform their respective baselines in challenging environments in DeepMind Control Suite. Furthermore, they elevate TD3 and SAC respectively to a level of performance comparable to that of D4PG (the current SOTA), and they also improve the performance of D4PG to a new SOTA level measured by mean reward, convergence speed, learning success rate, and learning variance.
    摘要 我们解决深度奖励学习(DRL)中的估计偏见问题,通过引入新的双TD-正则化actor-critic(TDR)方法,以减少过估和下估错误。与TDR相结合,我们通过分布学习和长N步代理奖励方法(LNSS),显示了我们的新TDR基于actor-critic学习可以在深度控制集中的复杂环境中超越其基线。此外,它还提高了TD3和SAC的性能,使其与D4PG(当前SOTA)的性能相当,并且还提高了D4PG的性能到新的SOTA水平, measured by mean reward, convergence speed, learning success rate, and learning variance。Note: "SOTA" stands for "State of the Art", which means the current best performance in a particular field or area.

The NeurIPS 2022 Neural MMO Challenge: A Massively Multiagent Competition with Specialization and Trade

  • paper_url: http://arxiv.org/abs/2311.03707
  • repo_url: https://github.com/neuralmmo/neurips2022nmmo-submission-pool
  • paper_authors: Enhong Liu, Joseph Suarez, Chenhui You, Bo Wu, Bingcheng Chen, Jun Hu, Jiaxin Chen, Xiaolong Zhu, Clare Zhu, Julian Togelius, Sharada Mohanty, Weijun Hong, Rui Du, Yibing Zhang, Qinwen Wang, Xinhang Li, Zheng Yuan, Xiang Li, Yuejia Huang, Kun Zhang, Hanhui Yang, Shiqi Tang, Phillip Isola
  • for: 本文描述了 NeurIPS-2022 神经网络多player挑战的结果,该挑战吸引了500名参与者并接受了1,600份提交。
  • methods: 本文使用了最新的 v1.6 神经网络多player环境,该环境引入了新的设备、战斗、交易和评价系统,对于前一个competition pose additional robustness和总体化挑战。
  • results: 本文描述了挑战的设计和结果,探讨了这种环境作为学习方法的标准 benchmark 的潜力,并提供了一些实用的强化学习训练方法,以解决复杂任务的稀少奖励问题。
    Abstract In this paper, we present the results of the NeurIPS-2022 Neural MMO Challenge, which attracted 500 participants and received over 1,600 submissions. Like the previous IJCAI-2022 Neural MMO Challenge, it involved agents from 16 populations surviving in procedurally generated worlds by collecting resources and defeating opponents. This year's competition runs on the latest v1.6 Neural MMO, which introduces new equipment, combat, trading, and a better scoring system. These elements combine to pose additional robustness and generalization challenges not present in previous competitions. This paper summarizes the design and results of the challenge, explores the potential of this environment as a benchmark for learning methods, and presents some practical reinforcement learning training approaches for complex tasks with sparse rewards. Additionally, we have open-sourced our baselines, including environment wrappers, benchmarks, and visualization tools for future research.
    摘要 在这篇论文中,我们介绍了2022年的NeuIPS neural MMO挑战的结果,该挑战吸引了500名参与者并收到了1,600份提交。与前一年的IJCAI neural MMO挑战一样,这个挑战由16个人口生成的世界中的代理人进行了资源收集和对手击败。这年的比赛运行在最新的v1.6神经MMO上,该版本引入了新的设备、战斗、贸易和更好的分数系统。这些元素共同提供了附加的可靠性和泛化挑战,不存在在前一年的比赛中。本文总结了挑战的设计和结果,探讨了这个环境的潜在作为学习方法的标准测试环境,并提出了一些实用的强化学习训练方法 для复杂任务的稀肥奖励。此外,我们还开源了我们的基elines,包括环境包装器、标准 benchmarks 和视觉化工具,以便未来的研究。

Efficient Bottom-Up Synthesis for Programs with Local Variables

  • paper_url: http://arxiv.org/abs/2311.03705
  • repo_url: None
  • paper_authors: Xiang Li, Xiangyu Zhou, Rui Dong, Yihong Zhang, Xinyu Wang
  • for: 本研究旨在提出一种新的合成算法,可以效率地搜索具有地方变量(如lambda引入的程序)的程序。现有的底部合成算法无法评估具有自由地方变量的程序,因此无法有效减少这些程序的搜索空间(例如使用标准观察等价减少技术),从而使合成变得慢。本算法可以减少程序的搜索空间。
  • methods: 本研究使用的方法包括提出了”提起解释”的想法,即从一个程序中提高解释过程,从而同时评估所有程序的语法树。这种方法可以系统地生成所有绑定上下文,因此可以评估和减少具有地方变量的程序的搜索空间。
  • results: 研究结果表明,使用提出的算法可以更有效地自动化互联网自动化任务,比如WebRobot和Helena等现有的技术。在实际应用中,提出的工具”Arborist”可以更高效地完成更多的挑战性任务。
    Abstract We propose a new synthesis algorithm that can efficiently search programs with local variables (e.g., those introduced by lambdas). Prior bottom-up synthesis algorithms are not able to evaluate programs with free local variables, and therefore cannot effectively reduce the search space of such programs (e.g., using standard observational equivalence reduction techniques), making synthesis slow. Our algorithm can reduce the space of programs with local variables. The key idea, dubbed lifted interpretation, is to lift up the program interpretation process, from evaluating one program at a time to simultaneously evaluating all programs from a grammar. Lifted interpretation provides a mechanism to systematically enumerate all binding contexts for local variables, thereby enabling us to evaluate and reduce the space of programs with local variables. Our ideas are instantiated in the domain of web automation. The resulting tool, Arborist, can automate a significantly broader range of challenging tasks more efficiently than state-of-the-art techniques including WebRobot and Helena.
    摘要 我们提出了一个新的合成算法,可以效率地搜寻具有地方变数(例如lambda引入的)的程式。先前的底部合成算法无法评估具有自由地方变数的程式,因此无法有效缩小这些程式的搜寻空间(例如使用标准观察对等减少技术),使合成变得慢。我们的算法可以缩小具有地方变数的程式的搜寻空间。我们的主要想法是将程式解释过程“升级”到同时评估所有程式的语法中,这称为“提升解释”。提升解释提供了一个系统性的方式来谱系地方变数的绑定上下文,因此可以有效地评估和缩小具有地方变数的程式的搜寻空间。我们的想法是实现在网页自动化领域,而我们的工具Arborist可以更有效率地自动化许多具有挑战性的任务,比如WebRobot和Helena。

Hypothesis Network Planned Exploration for Rapid Meta-Reinforcement Learning Adaptation

  • paper_url: http://arxiv.org/abs/2311.03701
  • repo_url: None
  • paper_authors: Maxwell Joseph Jacobson, Yexiang Xue
  • for: trains agents that adapt to fast-changing environments and tasks
  • methods: integrates an active and planned exploration process via the hypothesis network to optimize adaptation speed
  • results: outpaces baseline methods in adaptation speed and model accuracy, validating its potential in enhancing reinforcement learning adaptation in rapidly evolving settingsHere’s the full text in Simplified Chinese:
  • for: 训练适应快速变化环境和任务的智能代理
  • methods: 通过假设网络实现活动和规划探索,以优化适应速度
  • results: 在符号化的Alchemy游戏中舜测了比基准方法更快的适应速度和模型准确率,证明其在快速演化 Setting 中提高学习适应的潜力
    Abstract Meta Reinforcement Learning (Meta RL) trains agents that adapt to fast-changing environments and tasks. Current strategies often lose adaption efficiency due to the passive nature of model exploration, causing delayed understanding of new transition dynamics. This results in particularly fast-evolving tasks being impossible to solve. We propose a novel approach, Hypothesis Network Planned Exploration (HyPE), that integrates an active and planned exploration process via the hypothesis network to optimize adaptation speed. HyPE uses a generative hypothesis network to form potential models of state transition dynamics, then eliminates incorrect models through strategically devised experiments. Evaluated on a symbolic version of the Alchemy game, HyPE outpaces baseline methods in adaptation speed and model accuracy, validating its potential in enhancing reinforcement learning adaptation in rapidly evolving settings.
    摘要 meta 强化学习(Meta RL)训练代理人可适应快速变化的环境和任务。现有策略经常因模型探索的 passive 性而失去适应效率,导致新的转移动力学特性理解延迟。这会导致特别是快速演化的任务无法解决。我们提议一种新的方法,假设网络规划探索(HyPE),它通过假设网络整合活动和规划探索过程,以优化适应速度。HyPE 使用生成假设网络形成可能的状态转移动力学模型,然后通过策划的实验排除错误模型。在符号版的 Alchemy 游戏中评估,HyPE 比基eline方法更快地适应和模型准确性提高,证明其在快速演化的情况下提高强化学习适应能力的潜力。

A Novel Variational Lower Bound for Inverse Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.03698
  • repo_url: None
  • paper_authors: Yikang Gui, Prashant Doshi
  • for: 学习任务协同或模仿的奖励函数,从专家轨迹中学习任务特性和奖励函数。
  • methods: 基于概率图模型和优化节点的Variational Lower Bound for IRL(VLB-IRL)方法,同时学习奖励函数和政策。
  • results: 在多个知名领域中,比如跑步和游戏等,方法可以学习有效的奖励函数,并且政策下的奖励函数可以达到专家水平性。此外,方法也超过了现有的State-of-the-art IRL算法在这些领域的性能。
    Abstract Inverse reinforcement learning (IRL) seeks to learn the reward function from expert trajectories, to understand the task for imitation or collaboration thereby removing the need for manual reward engineering. However, IRL in the context of large, high-dimensional problems with unknown dynamics has been particularly challenging. In this paper, we present a new Variational Lower Bound for IRL (VLB-IRL), which is derived under the framework of a probabilistic graphical model with an optimality node. Our method simultaneously learns the reward function and policy under the learned reward function by maximizing the lower bound, which is equivalent to minimizing the reverse Kullback-Leibler divergence between an approximated distribution of optimality given the reward function and the true distribution of optimality given trajectories. This leads to a new IRL method that learns a valid reward function such that the policy under the learned reward achieves expert-level performance on several known domains. Importantly, the method outperforms the existing state-of-the-art IRL algorithms on these domains by demonstrating better reward from the learned policy.
    摘要 <> translate "Inverse reinforcement learning (IRL) seeks to learn the reward function from expert trajectories, to understand the task for imitation or collaboration thereby removing the need for manual reward engineering. However, IRL in the context of large, high-dimensional problems with unknown dynamics has been particularly challenging. In this paper, we present a new Variational Lower Bound for IRL (VLB-IRL), which is derived under the framework of a probabilistic graphical model with an optimality node. Our method simultaneously learns the reward function and policy under the learned reward function by maximizing the lower bound, which is equivalent to minimizing the reverse Kullback-Leibler divergence between an approximated distribution of optimality given the reward function and the true distribution of optimality given trajectories. This leads to a new IRL method that learns a valid reward function such that the policy under the learned reward achieves expert-level performance on several known domains. Importantly, the method outperforms the existing state-of-the-art IRL algorithms on these domains by demonstrating better reward from the learned policy."into Simplified Chinese.Here's the translation:<> inverse reinforcement learning (IRL) 目的是从专家轨迹中学习奖励函数,以便理解任务,以便模拟或协作,从而消除手动奖励工程。然而,在大型、高维度问题中, unknown 动力学下的 IRL 特别困难。在这篇文章中,我们提出了一种新的 Variational Lower Bound for IRL (VLB-IRL),它基于 probabilistic graphical model 的优化节点。我们的方法同时学习奖励函数和 policy under 学习的奖励函数,通过最大化下界,等于最小化 reverse Kullback-Leibler divergence between an approximated distribution of optimality given the reward function and the true distribution of optimality given trajectories。这导致了一种新的 IRL 方法,该方法学习一个有效的奖励函数,使得 policy under 学习的奖励函数 achieve expert-level performance on several known domains。重要的是,该方法在这些领域上超过了现有的 state-of-the-art IRL 算法,通过示出更好的奖励来证明。

Context Shift Reduction for Offline Meta-Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.03695
  • repo_url: https://github.com/moreanp/csro
  • paper_authors: Yunkai Gao, Rui Zhang, Jiaming Guo, Fan Wu, Qi Yi, Shaohui Peng, Siming Lan, Ruizhi Chen, Zidong Du, Xing Hu, Qi Guo, Ling Li, Yunji Chen
  • for: 提高meta-学习 Agent的总结能力,解决context shift问题
  • methods: 提出了一种新的Context Shift Reduction for OMRL(CSRO)方法,通过减少policy在context中的影响来解决context shift问题
  • results: 实验结果显示,CSRO方法可以减少context shift,提高meta-学习 Agent的总结能力,超过了之前的方法在多个复杂的领域
    Abstract Offline meta-reinforcement learning (OMRL) utilizes pre-collected offline datasets to enhance the agent's generalization ability on unseen tasks. However, the context shift problem arises due to the distribution discrepancy between the contexts used for training (from the behavior policy) and testing (from the exploration policy). The context shift problem leads to incorrect task inference and further deteriorates the generalization ability of the meta-policy. Existing OMRL methods either overlook this problem or attempt to mitigate it with additional information. In this paper, we propose a novel approach called Context Shift Reduction for OMRL (CSRO) to address the context shift problem with only offline datasets. The key insight of CSRO is to minimize the influence of policy in context during both the meta-training and meta-test phases. During meta-training, we design a max-min mutual information representation learning mechanism to diminish the impact of the behavior policy on task representation. In the meta-test phase, we introduce the non-prior context collection strategy to reduce the effect of the exploration policy. Experimental results demonstrate that CSRO significantly reduces the context shift and improves the generalization ability, surpassing previous methods across various challenging domains.
    摘要 <>translate "Offline meta-reinforcement learning (OMRL) utilizes pre-collected offline datasets to enhance the agent's generalization ability on unseen tasks. However, the context shift problem arises due to the distribution discrepancy between the contexts used for training (from the behavior policy) and testing (from the exploration policy). The context shift problem leads to incorrect task inference and further deteriorates the generalization ability of the meta-policy. Existing OMRL methods either overlook this problem or attempt to mitigate it with additional information. In this paper, we propose a novel approach called Context Shift Reduction for OMRL (CSRO) to address the context shift problem with only offline datasets. The key insight of CSRO is to minimize the influence of policy in context during both the meta-training and meta-test phases. During meta-training, we design a max-min mutual information representation learning mechanism to diminish the impact of the behavior policy on task representation. In the meta-test phase, we introduce the non-prior context collection strategy to reduce the effect of the exploration policy. Experimental results demonstrate that CSRO significantly reduces the context shift and improves the generalization ability, surpassing previous methods across various challenging domains."into Simplified Chinese:Offline meta-学习(OMRL)利用预收集的offline数据集来提高agent的未知任务泛化能力。然而,上下文偏移问题出现,由于各个上下文在训练(从行为策略)和测试(从探索策略)中的分布差异。这个问题会导致任务推理错误,并进一步恶化meta策略的泛化能力。现有OMRL方法可能忽视这个问题,或者尝试通过额外信息来缓解。在这篇论文中,我们提出了一种新的方法——上下文偏移减少 для OMRL(CSRO),通过仅使用offline数据集来解决上下文偏移问题。CSRO的关键思想在于,在meta训练和meta测试阶段都尽量减少策略在上下文中的影响。在meta训练阶段,我们设计了max-min相互信息表示学习机制,以减少行为策略对任务表示的影响。在meta测试阶段,我们引入了非先验上下文收集策略,以减少探索策略对上下文的影响。实验结果表明,CSRO可以减少上下文偏移,提高泛化能力,超过了先前方法在各种复杂的领域。

Deep Bayesian Reinforcement Learning for Spacecraft Proximity Maneuvers and Docking

  • paper_url: http://arxiv.org/abs/2311.03680
  • repo_url: None
  • paper_authors: Desong Du, Naiming Qi, Yanfang Liu, Wei Pan
  • for: 本研究旨在开发一种 Bayesian actor-critic 征值学习算法,用于实现自主太空船的距离执行和对接(PMD)任务。
  • methods: 本研究使用 Lyapunov 理论和 Gaussian 过程 regression 技术,将 PMD 任务转化为一个 Markov 决策过程,并使用 Bayesian quadrature policy 优化程序来分析策略对应。
  • results: 实验结果显示,提出的算法在一个太空船气压滑道测试平台上表现出色,并且满足了航天任务中的严格安全需求。
    Abstract In the pursuit of autonomous spacecraft proximity maneuvers and docking(PMD), we introduce a novel Bayesian actor-critic reinforcement learning algorithm to learn a control policy with the stability guarantee. The PMD task is formulated as a Markov decision process that reflects the relative dynamic model, the docking cone and the cost function. Drawing from the principles of Lyapunov theory, we frame the temporal difference learning as a constrained Gaussian process regression problem. This innovative approach allows the state-value function to be expressed as a Lyapunov function, leveraging the Gaussian process and deep kernel learning. We develop a novel Bayesian quadrature policy optimization procedure to analytically compute the policy gradient while integrating Lyapunov-based stability constraints. This integration is pivotal in satisfying the rigorous safety demands of spaceflight missions. The proposed algorithm has been experimentally evaluated on a spacecraft air-bearing testbed and shows impressive and promising performance.
    摘要 在探索自主空天器近距离推进和协调(PMD)方面,我们介绍了一种新的 bayesian actor-critic reinforcement learning算法,以学习一个具有稳定保证的控制策略。PMD任务被定义为一个Markov决策过程,反映了相对动态模型、协调杯和成本函数。从Lyapunov理论的原则出发,我们将时间差学习转化为一个受限Gaussian проце推 regression问题。这种创新的方法使得状态价值函数可以表示为Lyapunov函数,利用Gaussian проце和深度kernel学习。我们开发了一种bayesian quadrature policy优化程序,可以分析计算策略偏导的同时,并将Lyapunov-based稳定约束集成到系统中。这种集成是Spaceflight任务中的严格安全要求的满足。我们在一个空天器空拖测试平台上实验了提议的算法,并表现出了非常出色和可能的性能。

Stable Modular Control via Contraction Theory for Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.03669
  • repo_url: None
  • paper_authors: Bing Song, Jean-Jacques Slotine, Quang-Cuong Pham
  • for: 提高 neural control 稳定性、可靠性和泛化能力
  • methods: 利用 contraction theory 实现模块化 neural control,使得稳定性自然地保持,并且通过 signal composition 和 dynamic decomposition 实现模块化。
  • results: 在 simulations 中证明了方法的必要性和有效性,方法可以提高 hierarchical RL 的性能,并且可以使得 neural control 更加稳定、可靠和泛化。
    Abstract We propose a novel way to integrate control techniques with reinforcement learning (RL) for stability, robustness, and generalization: leveraging contraction theory to realize modularity in neural control, which ensures that combining stable subsystems can automatically preserve the stability. We realize such modularity via signal composition and dynamic decomposition. Signal composition creates the latent space, within which RL applies to maximizing rewards. Dynamic decomposition is realized by coordinate transformation that creates an auxiliary space, within which the latent signals are coupled in the way that their combination can preserve stability provided each signal, that is, each subsystem, has stable self-feedbacks. Leveraging modularity, the nonlinear stability problem is deconstructed into algebraically solvable ones, the stability of the subsystems in the auxiliary space, yielding linear constraints on the input gradients of control networks that can be as simple as switching the signs of network weights. This minimally invasive method for stability allows arguably easy integration into the modular neural architectures in machine learning, like hierarchical RL, and improves their performance. We demonstrate in simulation the necessity and the effectiveness of our method: the necessity for robustness and generalization, and the effectiveness in improving hierarchical RL for manipulation learning.
    摘要 我们提出了一种新的方法,把控制技术与强化学习(RL)结合在一起,以确保稳定性、可靠性和泛化性:通过ontraction theory来实现模块化在神经控制中,以确保将稳定的子系统组合起来,可以保持稳定性。我们通过信号组合和动态分解来实现这种模块化。信号组合创造了隐藏空间,在这个空间中,RL可以以最大化奖励来实现。动态分解通过坐标转换创造了一个辅助空间,在这个空间中,隐藏信号被 coupling在一起,以保持稳定性, provided each signal, that is, each subsystem, has stable self-feedbacks。通过模块化,非线性稳定性问题被分解成可解的问题,即隐藏空间中的稳定性问题,它们是线性约束的输入梯度问题,可以非常简单地Switching the signs of network weights。这种非侵入式的稳定性方法可以轻松地集成到现有的模块化神经网络架构中,例如层次RL,并提高其性能。我们在模拟中证明了我们的方法的必要性和有效性:必要性是为了稳定性和泛化性,有效性是通过改进层次RL的掌握能力。

GPT-ST: Generative Pre-Training of Spatio-Temporal Graph Neural Networks

  • paper_url: http://arxiv.org/abs/2311.04245
  • repo_url: https://github.com/hkuds/gpt-st
  • paper_authors: Zhonghang Li, Lianghao Xia, Yong Xu, Chao Huang
  • for: 本文旨在提出一种针对流行管理和旅行规划的空间temporal预测技术快速发展的方法,以满足现代交通管理和旅行规划的需求。
  • methods: 本文提出了一种针对下游基线模型的预训练框架,该框架通过两个关键设计来增强预测性能:首先,我们提出了一种针对空间temporal依赖关系的自适应封装器,该模型包括自定义参数学习器和层次的空间pattern编码器。其次,我们引入了一种适应封装策略,以便在预训练过程中学习Robust的空间temporal表示,并且可以轻松地模型不同关系,从内部cluster到外部cluster,在易于困难的训练方式下。
  • results: 我们在代表性的 benchmark 上进行了广泛的实验,并证明了我们的提出方法的效iveness。我们的模型实现可以在https://github.com/HKUDS/GPT-ST 上获得。
    Abstract In recent years, there has been a rapid development of spatio-temporal prediction techniques in response to the increasing demands of traffic management and travel planning. While advanced end-to-end models have achieved notable success in improving predictive performance, their integration and expansion pose significant challenges. This work aims to address these challenges by introducing a spatio-temporal pre-training framework that seamlessly integrates with downstream baselines and enhances their performance. The framework is built upon two key designs: (i) We propose a spatio-temporal mask autoencoder as a pre-training model for learning spatio-temporal dependencies. The model incorporates customized parameter learners and hierarchical spatial pattern encoding networks. These modules are specifically designed to capture spatio-temporal customized representations and intra- and inter-cluster region semantic relationships, which have often been neglected in existing approaches. (ii) We introduce an adaptive mask strategy as part of the pre-training mechanism. This strategy guides the mask autoencoder in learning robust spatio-temporal representations and facilitates the modeling of different relationships, ranging from intra-cluster to inter-cluster, in an easy-to-hard training manner. Extensive experiments conducted on representative benchmarks demonstrate the effectiveness of our proposed method. We have made our model implementation publicly available at https://github.com/HKUDS/GPT-ST.
    摘要 Recently, there has been a rapid development of spatio-temporal prediction techniques in response to the increasing demands of traffic management and travel planning. Although advanced end-to-end models have achieved notable success in improving predictive performance, their integration and expansion pose significant challenges. This work aims to address these challenges by introducing a spatio-temporal pre-training framework that seamlessly integrates with downstream baselines and enhances their performance. The framework is built upon two key designs:(i) We propose a spatio-temporal mask autoencoder as a pre-training model for learning spatio-temporal dependencies. The model incorporates customized parameter learners and hierarchical spatial pattern encoding networks. These modules are specifically designed to capture spatio-temporal customized representations and intra- and inter-cluster region semantic relationships, which have often been neglected in existing approaches.(ii) We introduce an adaptive mask strategy as part of the pre-training mechanism. This strategy guides the mask autoencoder in learning robust spatio-temporal representations and facilitates the modeling of different relationships, ranging from intra-cluster to inter-cluster, in an easy-to-hard training manner.Extensive experiments conducted on representative benchmarks demonstrate the effectiveness of our proposed method. Our model implementation is publicly available at .

The Linear Representation Hypothesis and the Geometry of Large Language Models

  • paper_url: http://arxiv.org/abs/2311.03658
  • repo_url: https://github.com/kihopark/linear_rep_geometry
  • paper_authors: Kiho Park, Yo Joong Choe, Victor Veitch
  • for: This paper aims to clarify the meaning of “linear representation” in the context of natural language processing, and to develop a unified framework for understanding geometric notions such as cosine similarity and projection in representation space.
  • methods: The authors use counterfactuals to formalize two different notions of “linear representation”, one in the output (word) representation space and one in the input (sentence) space. They then prove that these notions connect to linear probing and model steering, respectively.
  • results: The authors show that using a particular (non-Euclidean) inner product, they can unify all notions of linear representation, and demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product through experiments with LLaMA-2.
    Abstract Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.
    摘要 文章提出了两个问题:首先,“线性表示”的具体含义是什么?其次,在表示空间中如何理解几何概念(例如cosine相似性或投影)?为了回答这些问题,文章使用了counterfactual语言来给出了两种形式化的“线性表示”,一种在输出(单词)表示空间中,另一种在输入(句子)空间中。然后,文章证明了这些连接到线性探测和模型导航。使用这种形式化,文章可以归一化所有的线性表示概念,并通过counterfactual对pair来构建探测器和导航向量。实验表明了概念的线性表示存在,以及与解释和控制的关系。此外,文章还证明了内积的选择对于语言结构的理解起到了关键作用。

Machine Learning Parameterization of the Multi-scale Kain-Fritsch (MSKF) Convection Scheme

  • paper_url: http://arxiv.org/abs/2311.03652
  • repo_url: None
  • paper_authors: Xiaohui Zhong, Xing Yu, Hao Li
  • For: This paper aims to improve the representation of convective transport in high-resolution numerical weather prediction (NWP) models, specifically in the gray zone where the grid spacing is comparable to the length scales of convection.* Methods: The authors use a multi-scale Kain-Fritsch (MSKF) scheme and a multi-output bidirectional long short-term memory (Bi-LSTM) model to represent convective transport in the gray zone. They also compare the performance of the Bi-LSTM model with the original MSKF scheme in the WRF model.* Results: The results show that the Bi-LSTM model can achieve high accuracy in representing convective transport in the gray zone, indicating the potential use of machine learning (ML) models to substitute physical parameterizations in NWP models.
    Abstract Warm-sector heavy rainfall often occurs along the coast of South China, and it is usually localized and long-lasting, making it challenging to predict. High-resolution numerical weather prediction (NWP) models are increasingly used to better resolve topographic features and forecast such high-impact weather events. However, when the grid spacing becomes comparable to the length scales of convection, known as the gray zone, the turbulent eddies in the atmospheric boundary layer are only partially resolved and parameterized to some extent. Whether using a convection parameterization (CP) scheme in the gray zone remains controversial. Scale-aware CP schemes are developed to enhance the representation of convective transport within the gray zone. The multi-scale Kain-Fritsch (MSKF) scheme includes modifications that allow for its effective implementation at a grid resolution as high as 2 km. In recent years, there has been an increasing application of machine learning (ML) models to various domains of atmospheric sciences, including the replacement of physical parameterizations with ML models. This work proposes a multi-output bidirectional long short-term memory (Bi-LSTM) model as a replace the scale-aware MSKF CP scheme. The Weather Research and Forecast (WRF) model is used to generate training and testing data over South China at a horizontal resolution of 5 km. Furthermore, the WRF model is coupled with the ML based CP scheme and compared with WRF simulations with original MSKF scheme. The results demonstrate that the Bi-LSTM model can achieve high accuracy, indicating the potential use of ML models to substitute the MSKF scheme in the gray zone.
    摘要 暖 sector 重降水 frequent occurrence 南中国 海岸,通常是局部化和长时间的,预测困难。高分解能数值天气预测 (NWP) 模型在气象预测中日益广泛应用,以提高地形特征的解析和预测高影响天气事件。然而,当网格间距相当于 конвектив 的长度尺度时,称为灰色区域,大气边层中的湍流涡旋只部分解析出来,部分用参数化来做出报告。使用湍流参数化 (CP) 方案在灰色区域是有争议的。基于Scale-aware CP 方案的多个气象学应用。多Scale Kain-Fritsch (MSKF) 方案包括修改,以使其在2 km 的网格分辨率上有效实施。最近几年,机器学习 (ML) 模型在大气科学中的应用越来越广泛,包括将物理参数化替换为 ML 模型。这项工作提出了一种多输出 bidirectional 长短时间储存 (Bi-LSTM) 模型,用来取代灰色区域中的Scale-aware MSKF CP 方案。使用 WRF 模型生成训练和测试数据,并将 WRF 模型与 ML 基于 CP 方案相结合,与 WRF 模型中原始 MSKF 方案进行比较。结果显示,Bi-LSTM 模型可以达到高精度,表明 ML 模型可以在灰色区域中代替 MSKF 方案。

SeRO: Self-Supervised Reinforcement Learning for Recovery from Out-of-Distribution Situations

  • paper_url: http://arxiv.org/abs/2311.03651
  • repo_url: https://github.com/snuchankim/sero
  • paper_authors: Chan Kim, Jaekyung Cho, Christophe Bobda, Seung-Woo Seo, Seong-Woo Kim
  • for: 解决机器人代理人使用强化学习时在不同分布下(Out-of-distribution, OOD)状态下行为不可靠的问题。
  • methods: 我们提出了一种自动学习方法,使得当机器人代理人在OOD状态下时,可以自动恢复到已经学习的状态分布中。
  • results: 我们的实验结果表明,我们的方法可以大幅提高机器人代理人在OOD状态下恢复能力,包括样本效率和原始任务完成度的提高。此外,我们还证明了我们的方法可以在困难探索的状态下进行自动恢复。
    Abstract Robotic agents trained using reinforcement learning have the problem of taking unreliable actions in an out-of-distribution (OOD) state. Agents can easily become OOD in real-world environments because it is almost impossible for them to visit and learn the entire state space during training. Unfortunately, unreliable actions do not ensure that agents perform their original tasks successfully. Therefore, agents should be able to recognize whether they are in OOD states and learn how to return to the learned state distribution rather than continue to take unreliable actions. In this study, we propose a novel method for retraining agents to recover from OOD situations in a self-supervised manner when they fall into OOD states. Our in-depth experimental results demonstrate that our method substantially improves the agent's ability to recover from OOD situations in terms of sample efficiency and restoration of the performance for the original tasks. Moreover, we show that our method can retrain the agent to recover from OOD situations even when in-distribution states are difficult to visit through exploration.
    摘要

HKTGNN: Hierarchical Knowledge Transferable Graph Neural Network-based Supply Chain Risk Assessment

  • paper_url: http://arxiv.org/abs/2311.04244
  • repo_url: None
  • paper_authors: Zhanting Zhou, Kejun Bi, Yuyanzhen Zhong, Chao Tang, Dongfen Li, Shi Ying, Ruijin Wang
  • for: 本研究旨在提供一种基于图学 embedding 技术的供应链风险评估模型,以帮助企业管理和减轻供应链中的潜在风险。
  • methods: 本研究使用一种基于图 neural network 的层次知识传递模型(HKTGNN),通过对供应链网络中各个商品的图像进行嵌入,将复杂的供应链网络简化成一个导向同质图,并使用中心性知识传递模块和特征补做消除数据渴望问题。
  • results: 实验结果表明, compared with traditional knowledge inference methods, our proposed HKTGNN model outperforms in assessing supply chain risk. We also prove the effectiveness and fairness of our comparative experiment through an equation.
    Abstract The strength of a supply chain is an important measure of a country's or region's technical advancement and overall competitiveness. Establishing supply chain risk assessment models for effective management and mitigation of potential risks has become increasingly crucial. As the number of businesses grows, the important relationships become more complicated and difficult to measure. This emphasizes the need of extracting relevant information from graph data. Previously, academics mostly employed knowledge inference to increase the visibility of links between nodes in the supply chain. However, they have not solved the data hunger problem of single node feature characteristics. We propose a hierarchical knowledge transferable graph neural network-based (HKTGNN) supply chain risk assessment model to address these issues. Our approach is based on current graph embedding methods for assessing corporate investment risk assessment. We embed the supply chain network corresponding to individual goods in the supply chain using the graph embedding module, resulting in a directed homogeneous graph with just product nodes. This reduces the complicated supply chain network into a basic product network. It addresses difficulties using the domain difference knowledge transferable module based on centrality, which is presented by the premise that supply chain feature characteristics may be biased in the actual world. Meanwhile, the feature complement and message passing will alleviate the data hunger problem, which is driven by domain differences. Our model outperforms in experiments on a real-world supply chain dataset. We will give an equation to prove that our comparative experiment is both effective and fair.
    摘要 “供应链的强度是一个国家或地区的技术进步和竞争力的重要指标。为了有效管理和遏制风险,建立供应链风险评估模型已经变得越来越重要。随着企业数量的增加,关键关系变得更加复杂和难以测量。这种情况强调了提取供应链数据中重要信息的需要。在过去,学术界主要采用了知识推断来增强供应链中节点之间的链接可见性。然而,它们没有解决单节点特征特性数据不足的问题。我们提出了基于图数据预处理的层次知识传递图神经网络(HKTGNN)供应链风险评估模型,以解决这些问题。我们的方法基于当前图集 embedding 方法,用于评估企业投资风险评估。我们将供应链网络相应的各个商品 embedding 到图集模块中,从而将复杂的供应链网络转化为基本的产品网络。这解决了使用中心性知识传递模块基于中心性,即供应链特征特性可能受到实际世界的偏见的问题。同时,特征补做和消息传递可以缓解数据不足问题,它是由于领域差异所致。我们的模型在实际供应链数据集上实验表现出色,我们将给出一个公式来证明我们的比较实验是有效和公正的。”Note: The translation is done using Google Translate and may not be perfect. Please let me know if you need any further assistance.

Analysis of the User Perception of Chatbots in Education Using A Partial Least Squares Structural Equation Modeling Approach

  • paper_url: http://arxiv.org/abs/2311.03636
  • repo_url: None
  • paper_authors: Md Rabiul Hasan, Nahian Ismail Chowdhury, Md Hadisur Rahman, Md Asif Bin Syed, JuHyeong Ryu
  • for: 本研究旨在探讨学生在教育中对聊天机器人的采用,以填补现有研究中缺乏关注行为相关因素的知识空白。
  • methods: 本研究使用了部分最小 squares 结构方程模型(PLS-SEM)调查学生对聊天机器人在教育中的采用,并考虑了技术Ready Index(TRI)和技术接受度模型(TAM)。数据收集使用了五点likert分布,共获得185个答复,并使用R-Studio软件进行分析。
  • results: 研究发现,乐观性和创新性对聊天机器人的易用性(PEOU)和有用性(PU)具有正相关关系,而不适和不安室对PEOU具有负相关关系,只有不安室对PU具有负影响。这些发现可以为未来的技术设计师提供指导,揭示了在教育上下文中聊天机器人采用的关键用户行为因素。
    Abstract The integration of Artificial Intelligence (AI) into education is a recent development, with chatbots emerging as a noteworthy addition to this transformative landscape. As online learning platforms rapidly advance, students need to adapt swiftly to excel in this dynamic environment. Consequently, understanding the acceptance of chatbots, particularly those employing Large Language Model (LLM) such as Chat Generative Pretrained Transformer (ChatGPT), Google Bard, and other interactive AI technologies, is of paramount importance. However, existing research on chatbots in education has overlooked key behavior-related aspects, such as Optimism, Innovativeness, Discomfort, Insecurity, Transparency, Ethics, Interaction, Engagement, and Accuracy, creating a significant literature gap. To address this gap, this study employs Partial Least Squares Structural Equation Modeling (PLS-SEM) to investigate the determinant of chatbots adoption in education among students, considering the Technology Readiness Index (TRI) and Technology Acceptance Model (TAM). Utilizing a five-point Likert scale for data collection, we gathered a total of 185 responses, which were analyzed using R-Studio software. We established 12 hypotheses to achieve its objectives. The results showed that Optimism and Innovativeness are positively associated with Perceived Ease of Use (PEOU) and Perceived Usefulness (PU). Conversely, Discomfort and Insecurity negatively impact PEOU, with only Insecurity negatively affecting PU. These findings provide insights for future technology designers, elucidating critical user behavior factors influencing chatbots adoption and utilization in educational contexts.
    摘要 教育领域中人工智能(AI)的整合是一项最近的发展,聊天机器人(chatbot)成为这一变革的一个显著添加。在在线学习平台快速进步的情况下,学生需要快速适应以成功在这个动态环境中。因此,理解聊天机器人的接受度,特别是使用大型语言模型(LLM)如聊天生成预训练Transformer(ChatGPT)、Google Bard等交互式AI技术的接受度,是 Paramount importance。然而,现有的教育领域聊天机器人研究忽略了关键的行为相关方面,如优択性、创新性、不适感、不安全感、透明度、伦理、互动、参与度和准确性,这创造了一个重要的文献差距。为了填补这一差距,本研究使用部分最小二乘结构方程(PLS-SEM)调查教育领域聊天机器人的采用情况,考虑技术准备指数(TRI)和技术接受度模型(TAM)。通过五点Likert等级的数据收集,总共收集了185个答案,使用RStudio软件进行分析。我们设置了12个假设,以实现研究的目标。结果显示,优択性和创新性与使用容易性(PEOU)和有用性(PU)正相关,而不适感和不安全感则与PEOU相反,只有不安全感对PU具有负面影响。这些发现为未来技术设计师提供了灵感,把关于聊天机器人在教育上的采用和使用的用户行为因素揭露出来。

TWIST: Teacher-Student World Model Distillation for Efficient Sim-to-Real Transfer

  • paper_url: http://arxiv.org/abs/2311.03622
  • repo_url: None
  • paper_authors: Jun Yamada, Marc Rigter, Jack Collins, Ingmar Posner
  • for: 该论文目的是提出一种基于模型的RL方法来解决实际 робо太器中的视觉任务,并且提高模型之间的转移效果。
  • methods: 该论文使用了Distillation技术来实现模型之间的转移,特别是使用了教师模型和学生模型的设计来加速实际世界中的模型转移。
  • results: 实验表明,该方法可以比naive随机化和模型自由RL方法更高效地实现实际世界中的模型转移,并且可以提高任务性能。
    Abstract Model-based RL is a promising approach for real-world robotics due to its improved sample efficiency and generalization capabilities compared to model-free RL. However, effective model-based RL solutions for vision-based real-world applications require bridging the sim-to-real gap for any world model learnt. Due to its significant computational cost, standard domain randomisation does not provide an effective solution to this problem. This paper proposes TWIST (Teacher-Student World Model Distillation for Sim-to-Real Transfer) to achieve efficient sim-to-real transfer of vision-based model-based RL using distillation. Specifically, TWIST leverages state observations as readily accessible, privileged information commonly garnered from a simulator to significantly accelerate sim-to-real transfer. Specifically, a teacher world model is trained efficiently on state information. At the same time, a matching dataset is collected of domain-randomised image observations. The teacher world model then supervises a student world model that takes the domain-randomised image observations as input. By distilling the learned latent dynamics model from the teacher to the student model, TWIST achieves efficient and effective sim-to-real transfer for vision-based model-based RL tasks. Experiments in simulated and real robotics tasks demonstrate that our approach outperforms naive domain randomisation and model-free methods in terms of sample efficiency and task performance of sim-to-real transfer.
    摘要 模型基于RL是实世界机器人控制中有前途的方法,因其在样本效率和泛化能力方面比模型自由RL更为优秀。然而,实世界应用中的视觉基于模型基于RL解决方案需要跨 simulate-to-real gap 的桥梁。由于其计算成本较高,标准的领域随机化不能提供有效的解决方案。本文提出了TWIST(教师学生世界模型填充法),以实现efficient的 simulate-to-real 传输。 Specifically, TWIST 利用了 readily accessible 的状态信息,通常来自 simulator 中获得的特权信息,以加速 simulate-to-real 传输。具体来说,一个教师世界模型在状态信息上进行高效地训练,同时收集了随机化的图像观察数据。教师世界模型然后监督学生世界模型,该模型接受随机化的图像观察数据作为输入。通过填充学生模型中学习的 latent dynamics 模型,TWIST 实现了高效和有效的 simulate-to-real 传输。实验表明,我们的方法在 simulated 和实际机器人任务中超过了Randomization 和模型自由方法的样本效率和任务性能。

cs.CL - 2023-11-07

Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models

  • paper_url: http://arxiv.org/abs/2311.04378
  • repo_url: https://github.com/hlzhang109/impossibility-watermark
  • paper_authors: Hanlin Zhang, Benjamin L. Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, Boaz Barak
    for:The paper is written to study the (im)possibility of strong watermarking schemes for generative models.methods:The paper uses a generic efficient watermark attack that is based on two assumptions: access to a “quality oracle” and “perturbation oracle”.results:The paper proves that strong watermarking is impossible to achieve under well-specified and natural assumptions, even in the private detection algorithm setting. The attack successfully removes the watermarks planted by three existing watermarking schemes for large language models, with only minor quality degradation.Here is the answer in Simplified Chinese text:for: 本文研究Generative Model上的强水印 schemes的可能性。methods: 本文使用一种基于两个假设的高效水印攻击:访问”质量oracle”和”干扰 oracle”。results: 本文证明了强水印是不可能实现的,即使在私有探测算法设置下。攻击成功地除去了三个现有的水印 schemes for large language models,仅带有少量质量下降。
    Abstract Watermarking generative models consists of planting a statistical signal (watermark) in a model's output so that it can be later verified that the output was generated by the given model. A strong watermarking scheme satisfies the property that a computationally bounded attacker cannot erase the watermark without causing significant quality degradation. In this paper, we study the (im)possibility of strong watermarking schemes. We prove that, under well-specified and natural assumptions, strong watermarking is impossible to achieve. This holds even in the private detection algorithm setting, where the watermark insertion and detection algorithms share a secret key, unknown to the attacker. To prove this result, we introduce a generic efficient watermark attack; the attacker is not required to know the private key of the scheme or even which scheme is used. Our attack is based on two assumptions: (1) The attacker has access to a "quality oracle" that can evaluate whether a candidate output is a high-quality response to a prompt, and (2) The attacker has access to a "perturbation oracle" which can modify an output with a nontrivial probability of maintaining quality, and which induces an efficiently mixing random walk on high-quality outputs. We argue that both assumptions can be satisfied in practice by an attacker with weaker computational capabilities than the watermarked model itself, to which the attacker has only black-box access. Furthermore, our assumptions will likely only be easier to satisfy over time as models grow in capabilities and modalities. We demonstrate the feasibility of our attack by instantiating it to attack three existing watermarking schemes for large language models: Kirchenbauer et al. (2023), Kuditipudi et al. (2023), and Zhao et al. (2023). The same attack successfully removes the watermarks planted by all three schemes, with only minor quality degradation.
    摘要 水印生成模型的主要目的是在模型输出中植入一个统计信号(水印),以便后续确认该输出是由给定模型生成的。一个强大的水印计划满足了对于计算能力有限的攻击者无法消除水印而不导致重要质量下降的属性。在这篇论文中,我们研究水印计划的可能性。我们证明,在我们指出的具体和自然假设下,强大的水印计划是不可能实现的。这个结论保持,即使在私密探测算法设置下,水印插入和检测算法共享一个秘密密钥,不知道攻击者的。为证明这一结论,我们引入了一种通用高效的水印攻击方法。攻击者不需要知道私密密钥或者具体使用哪种计划。我们的攻击基于两个假设:(1)攻击者有访问一个"质量oracle",可以评估提示下的候选输出是否为高质量响应;(2)攻击者有访问一个"杂化oracle",可以修改输出,并且具有高效混合Random Walk的性质,使得输出保持高质量。我们认为,这两个假设在实际应用中都可以被攻击者满足,而且这些假设将在模型技术和Modalities不断提高时变得越来越容易实现。我们通过对三个现有的水印计划进行实例化,证明了我们的攻击方法的可行性。这三个计划分别是Kirchenbauer et al. (2023)、Kuditipudi et al. (2023)和Zhao et al. (2023)。我们的攻击方法成功地从这三个计划中移除了植入的水印,并且只带有轻微的质量下降。

Evaluating multiple large language models in pediatric ophthalmology

  • paper_url: http://arxiv.org/abs/2311.04368
  • repo_url: None
  • paper_authors: Jason Holmes, Rui Peng, Yiwei Li, Jinyu Hu, Zhengliang Liu, Zihao Wu, Huan Zhao, Xi Jiang, Wei Liu, Hong Wei, Jie Zou, Tianming Liu, Yi Shao
    for: The paper aims to evaluate the performance of large language models (LLMs) in pediatric ophthalmology consultations and compare their performance with medical students and physicians at different levels.methods: The study uses a 100-question exam based on pediatric ophthalmology to assess the performance of three LLMs (ChatGPT, GPT-4, and PaLM2) and three human cohorts (medical students, postgraduate students, and attending physicians).results: GPT-4 performed comparably to attending physicians, while ChatGPT (GPT-3.5) and PaLM2 outperformed medical students but slightly trailed behind postgraduate students. GPT-4 also exhibited greater stability and confidence when responding to inquiries compared to ChatGPT (GPT-3.5) and PaLM2.
    Abstract IMPORTANCE The response effectiveness of different large language models (LLMs) and various individuals, including medical students, graduate students, and practicing physicians, in pediatric ophthalmology consultations, has not been clearly established yet. OBJECTIVE Design a 100-question exam based on pediatric ophthalmology to evaluate the performance of LLMs in highly specialized scenarios and compare them with the performance of medical students and physicians at different levels. DESIGN, SETTING, AND PARTICIPANTS This survey study assessed three LLMs, namely ChatGPT (GPT-3.5), GPT-4, and PaLM2, were assessed alongside three human cohorts: medical students, postgraduate students, and attending physicians, in their ability to answer questions related to pediatric ophthalmology. It was conducted by administering questionnaires in the form of test papers through the LLM network interface, with the valuable participation of volunteers. MAIN OUTCOMES AND MEASURES Mean scores of LLM and humans on 100 multiple-choice questions, as well as the answer stability, correlation, and response confidence of each LLM. RESULTS GPT-4 performed comparably to attending physicians, while ChatGPT (GPT-3.5) and PaLM2 outperformed medical students but slightly trailed behind postgraduate students. Furthermore, GPT-4 exhibited greater stability and confidence when responding to inquiries compared to ChatGPT (GPT-3.5) and PaLM2. CONCLUSIONS AND RELEVANCE Our results underscore the potential for LLMs to provide medical assistance in pediatric ophthalmology and suggest significant capacity to guide the education of medical students.
    摘要 重要性:不同的大语言模型(LLM)和各种个人,包括医学生、硬件硬件学生和实践医生,在педиatriatic ophthalmology的咨询中的回应效果没有得到明确的确定。目标:设计一份100题测试,用于评估不同的LLM在高度特殊化的 scenarios中的表现,并与医学生和医生在不同水平的表现进行比较。设计、场景和参与者:这项调查研究对三个LLM进行评估,namely ChatGPT(GPT-3.5)、GPT-4和PaLM2,并与三个人类凝聚体进行比较:医学生、硬件硬件学生和实践医生。通过LLM网络接口分发测试纸,获得了志愿者的参与。主要结果和度量:对100个多选题的平均分、LLM和人类的回答稳定性、相关性和回答自信度。结论和意义:我们的结果表明LLM可以在педиatriatic ophthalmology中提供医疗协助,并 suggeSTs signifiCant capacity to guide medical education。

Syntax-Guided Transformers: Elevating Compositional Generalization and Grounding in Multimodal Environments

  • paper_url: http://arxiv.org/abs/2311.04364
  • repo_url: https://github.com/hlr/syntax-guided-transformers
  • paper_authors: Danial Kamali, Parisa Kordjamshidi
  • for: This paper aims to improve the ability of intelligent models to generalize to novel compositions in multimodal environments, by leveraging syntactic structure and attention masking techniques.
  • methods: The paper introduces and evaluates the effectiveness of using syntactic information in the multimodal grounding problem, specifically through dependency parsing and Weight Sharing across the Transformer encoder.
  • results: The results show that incorporating syntactic information into the grounding process leads to improved performance on diverse tasks, pushing the state-of-the-art in multimodal grounding and parameter-efficient modeling.
    Abstract Compositional generalization, the ability of intelligent models to extrapolate understanding of components to novel compositions, is a fundamental yet challenging facet in AI research, especially within multimodal environments. In this work, we address this challenge by exploiting the syntactic structure of language to boost compositional generalization. This paper elevates the importance of syntactic grounding, particularly through attention masking techniques derived from text input parsing. We introduce and evaluate the merits of using syntactic information in the multimodal grounding problem. Our results on grounded compositional generalization underscore the positive impact of dependency parsing across diverse tasks when utilized with Weight Sharing across the Transformer encoder. The results push the state-of-the-art in multimodal grounding and parameter-efficient modeling and provide insights for future research.
    摘要 compositional generalization,AI智能模型能够在新的组合中推广理解组件的能力,是人工智能研究中的基本 yet 挑战。在多Modal环境中,我们解决这个挑战,利用语言结构来提高compositional generalization。本文强调语音结构的重要性,特别是通过文本输入解析来获得的注意力掩码技术。我们介绍并评估了使用语音信息在多Modalgrounding问题中的利用。我们的结果表明,在多Modalgrounding问题中,通过Weight Sharing在Transformer核心编码器中使用语音信息,可以提高grounded compositional generalization的性能,并提供了参考价值。

Uncovering Causal Variables in Transformers using Circuit Probing

  • paper_url: http://arxiv.org/abs/2311.04354
  • repo_url: https://github.com/mlepori1/circuit_probing
  • paper_authors: Michael A. Lepori, Thomas Serre, Ellie Pavlick
  • for: 这篇论文的目的是解释神经网络模型中的算法,以便更好地理解神经网络模型如何进行计算。
  • methods: 这篇论文提出了一种新的分析技术—电路探针,可以自动找出神经网络模型中的低级电路,并使用这些电路来进行 causal 分析。
  • results: 通过应用电路探针技术, authors 能够解释神经网络模型中的算法,并发现模型中的模块结构和在训练过程中的电路发展。 这些结果证明了电路探针技术的有效性,并且在实际应用中能够帮助理解神经网络模型的计算过程。
    Abstract Neural network models have achieved high performance on a wide variety of complex tasks, but the algorithms that they implement are notoriously difficult to interpret. In order to understand these algorithms, it is often necessary to hypothesize intermediate variables involved in the network's computation. For example, does a language model depend on particular syntactic properties when generating a sentence? However, existing analysis tools make it difficult to test hypotheses of this type. We propose a new analysis technique -- circuit probing -- that automatically uncovers low-level circuits that compute hypothesized intermediate variables. This enables causal analysis through targeted ablation at the level of model parameters. We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2) revealing modular structure within a model, and (3) tracking the development of circuits over training. We compare circuit probing to other methods across these three experiments, and find it on par or more effective than existing analysis methods. Finally, we demonstrate circuit probing on a real-world use case, uncovering circuits that are responsible for subject-verb agreement and reflexive anaphora in GPT2-Small and Medium.
    摘要

Formal Aspects of Language Modeling

  • paper_url: http://arxiv.org/abs/2311.04329
  • repo_url: https://github.com/Gninos/CIM-With-Transition-Systems
  • paper_authors: Ryan Cotterell, Anej Svete, Clara Meister, Tianyu Liu, Li Du
  • for: 这篇论文是为了探讨大语言模型的数学基础和如何实现的。
  • methods: 论文使用了形式化的理论方法来描述语言模型的定义和实现方法。
  • results: 论文提供了一个 theoretical 的解释,帮助开发者和研究人员更好地理解大语言模型的数学基础和如何实现。
    Abstract Large language models have become one of the most commonly deployed NLP inventions. In the past half-decade, their integration into core natural language processing tools has dramatically increased the performance of such tools, and they have entered the public discourse surrounding artificial intelligence. Consequently, it is important for both developers and researchers alike to understand the mathematical foundations of large language models, as well as how to implement them. These notes are the accompaniment to the theoretical portion of the ETH Z\"urich course on large language models, covering what constitutes a language model from a formal, theoretical perspective.
    摘要 大型语言模型已成为人工智能领域最常用的自然语言处理发明之一。过去半个十年,它们在基本的自然语言处理工具中的整合,使得这些工具的性能得到了很大提高,并进入了人工智能的公共讨论。因此,开发者和研究人员都应该了解大型语言模型的数学基础和如何实现。这些笔记是ETH Zurich大学课程《大型语言模型》的理论部分的伴侣,涵盖了语言模型从形式、理论角度来看的定义。

Aspect-based Meeting Transcript Summarization: A Two-Stage Approach with Weak Supervision on Sentence Classification

  • paper_url: http://arxiv.org/abs/2311.04292
  • repo_url: None
  • paper_authors: Zhongfen Deng, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Quan Hung Tran, Shuaiqi Liu, Wenting Zhao, Tao Zhang, Yibo Wang, Philip S. Yu
  • for: aspect-based meeting transcript summarization
  • methods: sentence classifier and summarizer
  • results: outperform many strong baselinesHere’s the full text in Simplified Chinese:
  • for: 这个研究是为了实现会议笔记概要的分层分类,以便生成不同方面的多个概要。
  • methods: 我们使用了一个两个阶段的方法,首先使用一个句子分类器在一个预处理过的 AMI 数据集上进行 pseudo-labeling,然后选择相关的句子进行概要生成。
  • results: 我们的方法在 AMI 数据集上实验表现出色,胜过了许多强大的基线。
    Abstract Aspect-based meeting transcript summarization aims to produce multiple summaries, each focusing on one aspect of content in a meeting transcript. It is challenging as sentences related to different aspects can mingle together, and those relevant to a specific aspect can be scattered throughout the long transcript of a meeting. The traditional summarization methods produce one summary mixing information of all aspects, which cannot deal with the above challenges of aspect-based meeting transcript summarization. In this paper, we propose a two-stage method for aspect-based meeting transcript summarization. To select the input content related to specific aspects, we train a sentence classifier on a dataset constructed from the AMI corpus with pseudo-labeling. Then we merge the sentences selected for a specific aspect as the input for the summarizer to produce the aspect-based summary. Experimental results on the AMI corpus outperform many strong baselines, which verifies the effectiveness of our proposed method.
    摘要 Traditional meeting transcript summarization methods produce a single summary that combines information from all aspects, which cannot effectively address the challenges of aspect-based summarization. In this paper, we propose a two-stage method for aspect-based meeting transcript summarization. First, we train a sentence classifier on a dataset constructed from the AMI corpus with pseudo-labeling to select input content related to specific aspects. Then, we merge the sentences selected for a specific aspect as input for the summarizer to produce the aspect-based summary. Experimental results on the AMI corpus outperform many strong baselines, demonstrating the effectiveness of our proposed method.Here's the breakdown of the translation:* "Traditional meeting transcript summarization methods" is translated as "传统的会议笔记摘要方法".* "produce a single summary that combines information from all aspects" is translated as "生成一个混合所有方面信息的摘要".* "which cannot effectively address the challenges of aspect-based summarization" is translated as "这些方法无法有效地解决方面基的摘要挑战".* "In this paper, we propose a two-stage method for aspect-based meeting transcript summarization" is translated as "在这篇论文中,我们提出了一种两个阶段的方法 для方面基的会议笔记摘要".* "First, we train a sentence classifier on a dataset constructed from the AMI corpus with pseudo-labeling" is translated as "首先,我们使用 pseudo-labeling 技术在 AMI corpus 上构建了一个数据集,并对其进行了分类".* "to select input content related to specific aspects" is translated as "选择与特定方面相关的输入内容".* "Then, we merge the sentences selected for a specific aspect as input for the summarizer" is translated as "然后,我们将选择的每个方面的句子合并为摘要器的输入".* "to produce the aspect-based summary" is translated as "生成方面基的摘要".* "Experimental results on the AMI corpus outperform many strong baselines" is translated as "在 AMI corpus 上,我们的实验结果超过了许多强大的基线".Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study

  • paper_url: http://arxiv.org/abs/2311.04199
  • repo_url: None
  • paper_authors: Peilin Zhou, Meng Cao, You-Liang Huang, Qichen Ye, Peiyan Zhang, Junling Liu, Yueqi Xie, Yining Hua, Jaeboum Kim
  • for: 这个研究的目的是探索使用 Large Multimodal Models (LMMs) 在推荐任务中的潜力,并评估 GPT-4V 在不同领域的推荐能力。
  • methods: 这个研究使用了 GPT-4V 来进行推荐任务,并使用了多个领域的质感样本来评估其回应质量。
  • results: 研究结果显示 GPT-4V 在多个领域的推荐任务中表现出色,并且能够提供多样化的回应。但是,研究也发现 GPT-4V 在某些情况下会提供相似的回应。
    Abstract Large Multimodal Models (LMMs) have demonstrated impressive performance across various vision and language tasks, yet their potential applications in recommendation tasks with visual assistance remain unexplored. To bridge this gap, we present a preliminary case study investigating the recommendation capabilities of GPT-4V(ison), a recently released LMM by OpenAI. We construct a series of qualitative test samples spanning multiple domains and employ these samples to assess the quality of GPT-4V's responses within recommendation scenarios. Evaluation results on these test samples prove that GPT-4V has remarkable zero-shot recommendation abilities across diverse domains, thanks to its robust visual-text comprehension capabilities and extensive general knowledge. However, we have also identified some limitations in using GPT-4V for recommendations, including a tendency to provide similar responses when given similar inputs. This report concludes with an in-depth discussion of the challenges and research opportunities associated with utilizing GPT-4V in recommendation scenarios. Our objective is to explore the potential of extending LMMs from vision and language tasks to recommendation tasks. We hope to inspire further research into next-generation multimodal generative recommendation models, which can enhance user experiences by offering greater diversity and interactivity. All images and prompts used in this report will be accessible at https://github.com/PALIN2018/Evaluate_GPT-4V_Rec.
    摘要 大型多模式模型(LMM)已经在视觉和语言任务中表现出色,但它们在推荐任务中的应用尚未得到广泛的探索。为了填补这个空白,我们提出了一个初步的案例研究,检查GPT-4V(ison),OpenAI最新发布的LMM,在推荐任务中的能力。我们构建了多个域类的资料,并使用这些资料来评估GPT-4V在推荐场景中的回答质量。评估结果表明,GPT-4V在多个领域的零shot推荐任务中表现出色,归功于它的强大视觉文本理解能力和广泛的通用知识。然而,我们也发现了使用GPT-4V进行推荐的一些限制,包括对同类输入提供相同的回答。这份报告结束于对使用GPT-4V进行推荐的挑战和研究机会的深入讨论。我们的目标是探索LMM在推荐任务中的可能性,并开发下一代多模式生成推荐模型,以提高用户体验,提供更多的多样性和互动性。所有图像和提示在这份报告中使用的将在GitHub上公开,可以通过https://github.com/PALIN2018/Evaluate_GPT-4V_Rec访问。

JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models

  • paper_url: http://arxiv.org/abs/2311.04192
  • repo_url: https://github.com/keio-smilab23/JaSPICE
  • paper_authors: Yuiga Wada, Kanta Kaneda, Komei Sugiura
  • for: 本研究的目的是提出一个用于日本文字的自动评估指标,以提高现有的自动评估指标之间的与人工评估的相似性。
  • methods: 本研究使用了内存链接和 predicate-argument 结构生成场景图,并将场景图扩展为同义词。
  • results: 实验结果显示,我们的指标与人工评估的相似性 coefficient 高于基eline指标。
    Abstract Image captioning studies heavily rely on automatic evaluation metrics such as BLEU and METEOR. However, such n-gram-based metrics have been shown to correlate poorly with human evaluation, leading to the proposal of alternative metrics such as SPICE for English; however, no equivalent metrics have been established for other languages. Therefore, in this study, we propose an automatic evaluation metric called JaSPICE, which evaluates Japanese captions based on scene graphs. The proposed method generates a scene graph from dependencies and the predicate-argument structure, and extends the graph using synonyms. We conducted experiments employing 10 image captioning models trained on STAIR Captions and PFN-PIC and constructed the Shichimi dataset, which contains 103,170 human evaluations. The results showed that our metric outperformed the baseline metrics for the correlation coefficient with the human evaluation.
    摘要 Image captioning研究强调自动评价指标,如BLEU和METEOR,但这些n-gram基于指标与人工评价 corr 不良相关,导致提出了alter Native指标,如SPICE для英语。然而,没有相应的指标被设立于其他语言。因此,在本研究中,我们提议一种自动评价指标called JaSPICE,该指标根据场景图评估日本caption。我们的方法首先生成场景图从依赖关系和 predicate-argument结构,然后使用同义词扩展图。我们在STAIR Captions和PFN-PIC上训练10个图文描述模型,并构建了Shichimi数据集,该数据集包含103,170个人工评价。结果显示,我们的指标与人工评价 corr 高于基eline指标。

SpaDeLeF: A Dataset for Hierarchical Classification of Lexical Functions for Collocations in Spanish

  • paper_url: http://arxiv.org/abs/2311.04189
  • repo_url: None
  • paper_authors: Yevhen Kostiuk, Grigori Sidorov, Olga Kolesnikova
  • for: 这个论文的目的是提供一个大量标注的西班牙语动词-名词联合集和它们在文本中的句子,以便实现语义功能的层次分类。
  • methods: 这个论文使用了依赖树分析和匹配西班牙语句子中的短语,以生成一个大量的标注数据集。
  • results: 这个论文提供了一个包含37个lexical fonction的类别结构,用于层次分类西班牙语动词-名词联合。它们还提供了基准和数据分割 для每个目标。
    Abstract In natural language processing (NLP), lexical function is a concept to unambiguously represent semantic and syntactic features of words and phrases in text first crafted in the Meaning-Text Theory. Hierarchical classification of lexical functions involves organizing these features into a tree-like hierarchy of categories or labels. This is a challenging task as it requires a good understanding of the context and the relationships among words and phrases in text. It also needs large amounts of labeled data to train language models effectively. In this paper, we present a dataset of most frequent Spanish verb-noun collocations and sentences where they occur, each collocation is assigned to one of 37 lexical functions defined as classes for a hierarchical classification task. Each class represents a relation between the noun and the verb in a collocation involving their semantic and syntactic features. We combine the classes in a tree-based structure, and introduce classification objectives for each level of the structure. The dataset was created by dependency tree parsing and matching of the phrases in Spanish news. We provide baselines and data splits for each objective.
    摘要 在自然语言处理(NLP)中,词功能是一个概念,用于不同含义和语法特征的词和短语在文本中的有效表示。层次分类词功能涉及将这些特征分类为树状结构中的类别或标签。这是一项复杂的任务,因为它需要对文本中的上下文和单词和短语之间的关系有一定的理解。它还需要大量标注数据来训练语言模型。在这篇论文中,我们提供了西班牙语动词和名词的最常见搭配和它们出现的句子,每个搭配都被分配到37种定义的类别中,这些类别用于在层次分类任务中对名词和动词之间的关系进行分类。每个类别表示在搭配中的语义和语法特征。我们将这些类别组织成树状结构,并引入每个层次的分类目标。这个数据集由西班牙语新闻中的依赖树分析和匹配短语而创建。我们提供了基准和数据分割 для每个目标。

Perturbed examples reveal invariances shared by language models

  • paper_url: http://arxiv.org/abs/2311.04166
  • repo_url: None
  • paper_authors: Ruchit Rawal, Mariya Toneva
  • for: 本研究旨在比较两种自然语言处理模型,以揭示它们共享的可解释输入扰动的共轭性。
  • methods: 该研究提出了一种新的比较框架,通过设计 targets 特定语言能力(如同义词扰动、简写扰动)来揭示模型之间的共轭性。经过实验表明,该框架可以为不同架构家族的模型和商业黑盒API模型(如InstructGPT家族)提供评估共轭性的方法。
  • results: 研究结果表明,大型语言模型在多个语言能力方面具有许多共轭性,而大型模型之间的共轭性只有大型模型才能够保持。这种共轭性可能是大型语言模型的近期成功的关键因素之一,并且该框架可以为新模型的研究和发展提供深刻的理解。
    Abstract An explosion of work in language is leading to ever-increasing numbers of available natural language processing models, with little understanding of how new models compare to better-understood models. One major reason for this difficulty is saturating benchmark datasets, which may not reflect well differences in model performance in the wild. In this work, we propose a novel framework for comparing two natural language processing models by revealing their shared invariance to interpretable input perturbations that are designed to target a specific linguistic capability (e.g., Synonym-Invariance, Typo-Invariance). Via experiments on models from within the same and across different architecture families, this framework offers a number of insights about how changes in models (e.g., distillation, increase in size, amount of pre-training) affect multiple well-defined linguistic capabilities. Furthermore, we also demonstrate how our framework can enable evaluation of the invariances shared between models that are available as commercial black-box APIs (e.g., InstructGPT family) and models that are relatively better understood (e.g., GPT-2). Across several experiments, we observe that large language models share many of the invariances encoded by models of various sizes, whereas the invariances encoded by large language models are only shared by other large models. Possessing a wide variety of invariances may be a key reason for the recent successes of large language models, and our framework can shed light on the types of invariances that are retained by or emerge in new models.
    摘要 “Language processing Model 的快速增长导致了越来越多的可用自然语言处理模型,但是对新模型的理解仍然很有限。一个主要的问题是饱和的测试数据集,可能不准确反映模型在实际中的性能差异。在这种情况下,我们提出了一种新的比较框架,通过揭示模型对可解释性输入的干扰的共同不变性来比较两个自然语言处理模型。我们通过在同一个和不同的架构家族中的模型上进行实验,发现了多种语言能力的不同方面的共同不变性。此外,我们还示出了如何使用我们的框架来评估黑盒API模型(如InstructGPT家族)和比较好理解的模型(如GPT-2)之间的共同不变性。我们在多个实验中发现,大型语言模型共享许多共同不变性,而大型模型中所编码的共同不变性只被其他大型模型共享。拥有多种共同不变性可能是大型语言模型的最近成功的关键因素,我们的框架可以把这些共同不变性的类型推导出来。”

Black-Box Prompt Optimization: Aligning Large Language Models without Model Training

  • paper_url: http://arxiv.org/abs/2311.04155
  • repo_url: https://github.com/thu-coai/bpo
  • paper_authors: Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang
  • for: 提高大型自然语言模型(LLM)的用户指令遵循率,而不需要更新LLM的参数。
  • methods: 使用黑盒提示优化(BPO)方法,优化用户提示以适应LLM的输入理解,以实现用户意图的最佳实现。
  • results: BPO可以提高ChatGPT的赢利率22%,并且在GPT-4上也有10%的提高。同时,BPO可以超越PPO和DPO等方法的Alignment效果,并且与PPO或DPO结合使用可以带来额外的性能提升。
    Abstract Large language models (LLMs) have shown impressive success in various applications. However, these models are often not well aligned with human intents, which calls for additional treatments on them, that is, the alignment problem. To make LLMs better follow user instructions, existing alignment methods mostly focus on further training them. However, the extra training of LLMs are usually expensive in terms of GPU compute; worse still, LLMs of interest are oftentimes not accessible for user-demanded training, such as GPTs. In this work, we take a different perspective -- Black-Box Prompt Optimization (BPO) -- to perform alignments. The idea is to optimize user prompts to suit LLMs' input understanding, so as to best realize users' intents without updating LLMs' parameters. BPO is model-agnostic and the empirical results demonstrate that the BPO-aligned ChatGPT yields a 22% increase in the win rate against its original version, and 10% for GPT-4. Importantly, the BPO-aligned LLMs can outperform the same models aligned by PPO and DPO, and it also brings additional performance gains when combining BPO with PPO or DPO. Code and datasets are released at https://github.com/thu-coai/BPO.
    摘要

What is Lost in Knowledge Distillation?

  • paper_url: http://arxiv.org/abs/2311.04142
  • repo_url: None
  • paper_authors: Manas Mohanty, Tanya Roosta, Peyman Passban
  • for: 这个论文的目的是研究知识塑造(KD)技术对模型压缩的影响,以及如果压缩过程会导致信息损失,以及损失是否遵循特定的模式。
  • methods: 该论文使用了知识塑造技术对模型进行压缩,并对压缩后的模型与原始模型进行比较,以检查压缩过程是否导致信息损失。
  • results: 研究发现,压缩过程可能会导致信息损失,并且损失的程度与压缩后模型的复杂性有关。此外,研究还发现,某些任务可能更敏感于压缩,而另外些任务可能更抗应压缩。
    Abstract Deep neural networks (DNNs) have improved NLP tasks significantly, but training and maintaining such networks could be costly. Model compression techniques, such as, knowledge distillation (KD), have been proposed to address the issue; however, the compression process could be lossy. Motivated by this, our work investigates how a distilled student model differs from its teacher, if the distillation process causes any information losses, and if the loss follows a specific pattern. Our experiments aim to shed light on the type of tasks might be less or more sensitive to KD by reporting data points on the contribution of different factors, such as the number of layers or attention heads. Results such as ours could be utilized when determining effective and efficient configurations to achieve optimal information transfers between larger (teacher) and smaller (student) models.
    摘要 深度神经网络(DNN)已经大幅提高了自然语言处理(NLP)任务的性能,但训练和维护这些网络可能会很昂贵。为了解决这问题,模型压缩技术,如知识传递(KD),已经被提议。然而,压缩过程可能会导致信息损失。我们的工作想要了解压缩学生模型与其教师模型之间的差异,以及压缩过程是否会导致信息损失,以及损失是否遵循特定的模式。我们的实验旨在为确定效果和可行的配置来致导致优化信息传递 между更大的教师模型和更小的学生模型提供数据点。结果如我们所报道的可能用于确定有效和可行的配置,以便实现最佳的信息传递。

Modelling Sentiment Analysis: LLMs and data augmentation techniques

  • paper_url: http://arxiv.org/abs/2311.04139
  • repo_url: None
  • paper_authors: Guillem Senabre Prades
  • for: 本研究旨在提出一种基于小训练集的 binary sentiment classification 方法,以便在具有限制的训练数据情况下实现高度的准确率。
  • methods: 本研究使用了 LLMS 技术,包括 BERT、RoBERTa 和 XLNet,以实现 sentiment analysis。
  • results: 研究结果表明,使用 LLMS 技术可以在小训练集情况下实现高度的准确率,并且可以提高 sentiment analysis 的效果。I hope that helps! Let me know if you have any other questions.
    Abstract This paper provides different approaches for a binary sentiment classification on a small training dataset. LLMs that provided state-of-the-art results in sentiment analysis and similar domains are being used, such as BERT, RoBERTa and XLNet.
    摘要 这篇论文提供了不同的方法用于对小训练集进行二进制情感分类。使用了LLMs的state-of-the-art结果在情感分析和相关领域,如BERT、RoBERTa和XLNet。

Personality Style Recognition via Machine Learning: Identifying Anaclitic and Introjective Personality Styles from Patients’ Speech

  • paper_url: http://arxiv.org/abs/2311.04088
  • repo_url: None
  • paper_authors: Semere Kiros Bitew, Vincent Schelstraete, Klim Zaporojets, Kimberly Van Nieuwenhove, Reitske Meganck, Chris Develder
    for:The paper aims to investigate the possibility of automatically inferring personality types from speech utterances, with the goal of improving the accuracy of personality classification in psychopathology.methods:The authors use natural language processing (NLP) techniques and machine learning algorithms to analyze clinical diagnostic interviews (CDI) recorded from a sample of 79 patients diagnosed with major depressive disorder (MDD). They explore various linguistic features associated with each personality style and develop automatic classifiers based on standardized questionnaire responses, basic text features, advanced text features using LIWC, and audio features.results:The authors find that automated classification with language-derived features (based on LIWC) significantly outperforms questionnaire-based classification models. The best performance is achieved by combining LIWC with the questionnaire features, suggesting that more work should be put into developing linguistically based automated techniques for characterizing personality, while questionnaires still have some complementary value.
    Abstract In disentangling the heterogeneity observed in psychopathology, personality of the patients is considered crucial. While it has been demonstrated that personality traits are reflected in the language used by a patient, we hypothesize that this enables automatic inference of the personality type directly from speech utterances, potentially more accurately than through a traditional questionnaire-based approach explicitly designed for personality classification. To validate this hypothesis, we adopt natural language processing (NLP) and standard machine learning tools for classification. We test this on a dataset of recorded clinical diagnostic interviews (CDI) on a sample of 79 patients diagnosed with major depressive disorder (MDD) -- a condition for which differentiated treatment based on personality styles has been advocated -- and classified into anaclitic and introjective personality styles. We start by analyzing the interviews to see which linguistic features are associated with each style, in order to gain a better understanding of the styles. Then, we develop automatic classifiers based on (a) standardized questionnaire responses; (b) basic text features, i.e., TF-IDF scores of words and word sequences; (c) more advanced text features, using LIWC (linguistic inquiry and word count) and context-aware features using BERT (bidirectional encoder representations from transformers); (d) audio features. We find that automated classification with language-derived features (i.e., based on LIWC) significantly outperforms questionnaire-based classification models. Furthermore, the best performance is achieved by combining LIWC with the questionnaire features. This suggests that more work should be put into developing linguistically based automated techniques for characterizing personality, however questionnaires still to some extent complement such methods.
    摘要 在解剖医学中的疾病多样性中,患者的个性被视为非常重要。而研究表明,患者的语言使用很可能会反映他们的个性特质。我们提出的假设是,通过自动从语音中推断患者的个性类型,可能比传统的问卷方法更准确地分类患者的个性。为了证明这一假设,我们采用自然语言处理(NLP)和标准的机器学习工具进行分类。我们在一个记录了临床诊断 интервью(CDI)的样本上进行测试,该样本包含79名患有主要抑郁症(MDD)的患者,并将他们分为附属型和内在型个性风格两类。我们首先分析 интервью,以确定每个风格的语言特征,以更好地理解这两种风格。然后,我们开发了自动分类器,包括(a)标准问卷回答;(b)基本文本特征,即TF-IDF分数;(c)更高级的文本特征,使用LIWC(语言研究和词语计数)和BERT(变换器预测模型);(d)音频特征。我们发现,基于语言 derive 特征(即LIWC)的自动分类显著高于问卷基本分类模型。此外, combining LIWC 和问卷特征可以获得最佳性能。这种结果表明,更多的工作应该投入到开发语言基于的自动分类技术,但是问卷仍然有一定的补做作用。

Do LLMs exhibit human-like response biases? A case study in survey design

  • paper_url: http://arxiv.org/abs/2311.04076
  • repo_url: https://github.com/lindiatjuatja/biasmonkey
  • paper_authors: Lindia Tjuatja, Valerie Chen, Sherry Tongshuang Wu, Ameet Talwalkar, Graham Neubig
  • for: 这个论文旨在研究语言模型(LLM)是否能够模拟人类的意见,以及LLM是否会受到问题表达的影响。
  • methods: 作者使用了调查设计来研究人类响应偏见,并提出了一个框架来评估 LLM 是否会受到人类类似的响应偏见。
  • results: 研究发现,当前流行的开源和商业 LLM 通常不会模拟人类类似的行为,而且这种差异更加明显在模型经过了指令细化的情况下。此外,即使模型表现出了人类类似的变化,但在不同的问题表达中也可能会出现相似的变化,这可能是由于其他偶合关系引起的。这些结果表明使用 LLM 代替人类进行某些注解阶段可能存在隐患,并且更需要进一步的模型行为的细化。
    Abstract As large language models (LLMs) become more capable, there is growing excitement about the possibility of using LLMs as proxies for humans in real-world tasks where subjective labels are desired, such as in surveys and opinion polling. One widely-cited barrier to the adoption of LLMs is their sensitivity to prompt wording -- but interestingly, humans also display sensitivities to instruction changes in the form of response biases. As such, we argue that if LLMs are going to be used to approximate human opinions, it is necessary to investigate the extent to which LLMs also reflect human response biases, if at all. In this work, we use survey design as a case study, where human response biases caused by permutations in wordings of ``prompts'' have been extensively studied. Drawing from prior work in social psychology, we design a dataset and propose a framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior. These inconsistencies tend to be more prominent in models that have been instruction fine-tuned. Furthermore, even if a model shows a significant change in the same direction as humans, we find that perturbations that are not meant to elicit significant changes in humans may also result in a similar change, suggesting that such a result could be partially due to other spurious correlations. These results highlight the potential pitfalls of using LLMs to substitute humans in parts of the annotation pipeline, and further underscore the importance of finer-grained characterizations of model behavior. Our code, dataset, and collected samples are available at https://github.com/lindiatjuatja/BiasMonkey
    摘要 large language models (LLMs) 的能力在不断提高,人们对使用 LLMs 作为人类的代理在实际任务中获得主观标签,如问卷和意见调查中的可能性感到越来越激动。然而, LLMS 的句子wording 敏感性受到广泛关注,而人类也会因为指令的变化而产生偏见。因此,如果 LLMS 要用来 aproximate 人类意见,那么必须研究 LLMS 是否也会产生人类偏见,如果有。在这项工作中,我们使用问卷设计作为Case study,人类受到句子wording 的变化而产生的偏见已经得到了广泛的研究。基于社会心理学前作,我们设计了一个数据集和一个框架,以评估 LLMS 是否会 Display 人类类似的偏见。我们对九种模型进行了全面的评估,发现流行的开源和商业 LLMS 通常不会Display 人类类似的行为。这些偏见通常在模型被 instruction fine-tuned 时更加抑制。此外,我们发现,即使模型表现出人类一样的改变,但是对人类而言并不重要的小变化也可能会导致类似的改变,这表明这种结果可能是由于其他的偶散关系引起的。这些结果提醒我们使用 LLMs 取代人类在注释管道中的部分可能存在隐患,并且重申了对模型行为的更加细化描述的重要性。我们的代码、数据集和收集到的样本可以在https://github.com/lindiatjuatja/BiasMonkey 中找到。

Fully Automated Task Management for Generation, Execution, and Evaluation: A Framework for Fetch-and-Carry Tasks with Natural Language Instructions in Continuous Space

  • paper_url: http://arxiv.org/abs/2311.04260
  • repo_url: None
  • paper_authors: Motonari Kambara, Komei Sugiura
  • for: 本研究旨在开发一个基于视觉信息的机器人执行任务,响应自然语言指令进行Fetch-and-Carry with Object Grounding (FCOG)任务。
  • methods: 我们提出了一个框架,可以自动生成、执行和评估FCOG任务。此外,我们还引入了一种将FCOG任务分解成四个不同的子任务的方法。
  • results: 我们的框架可以自动生成和执行FCOG任务,并且可以在不同的环境下进行评估。这表明了我们的方法可以帮助机器人更好地执行Fetch-and-Carry with Object Grounding (FCOG)任务。
    Abstract This paper aims to develop a framework that enables a robot to execute tasks based on visual information, in response to natural language instructions for Fetch-and-Carry with Object Grounding (FCOG) tasks. Although there have been many frameworks, they usually rely on manually given instruction sentences. Therefore, evaluations have only been conducted with fixed tasks. Furthermore, many multimodal language understanding models for the benchmarks only consider discrete actions. To address the limitations, we propose a framework for the full automation of the generation, execution, and evaluation of FCOG tasks. In addition, we introduce an approach to solving the FCOG tasks by dividing them into four distinct subtasks.
    摘要

Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment

  • paper_url: http://arxiv.org/abs/2311.04072
  • repo_url: None
  • paper_authors: Geyang Guo, Ranchi Zhao, Tianyi Tang, Wayne Xin Zhao, Ji-Rong Wen
  • for: 提高大型自然语言模型(LLM)的听众偏好性。
  • methods: 基于精细调教(SFT)和精细质量信号(token或短语水平)进行改进的听众对齐方法。
  • results: 比较多种基eline的效果,表明提出的方法可以更好地帮助LLM学习听众对齐。
    Abstract Alignment with human preference is a desired property of large language models (LLMs). Currently, the main alignment approach is based on reinforcement learning from human feedback (RLHF). Despite the effectiveness of RLHF, it is intricate to implement and train, thus recent studies explore how to develop alternative alignment approaches based on supervised fine-tuning (SFT). A major limitation of SFT is that it essentially does imitation learning, which cannot fully understand what are the expected behaviors. To address this issue, we propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained (i.e., token or phrase level) quality signals that are derived by contrasting good and bad responses. Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment. Extensive experiments have demonstrated the effectiveness of our approaches by comparing a number of competitive baselines.
    摘要 大型语言模型(LLM)的准确性是一项极其重要的性能指标。目前主流的准确性途径是基于人类反馈学习(RLHF)。 despite RLHF的效果,它具有复杂的实现和训练需求,因此 latest studies explore 如何开发基于监督精细调整(SFT)的准确性方法。然而,SFT 的一个主要局限性是,它实际上只是模仿学习,无法完全理解所期望的行为。为了解决这个问题,我们提出了一种改进的准确性方法,即 FIGA。与之前的方法不同,我们在 Fine-grained(i.e., 单词或短语水平)的质量信号中包含了对比好和坏回复的数据。我们的方法有两大贡献:首先,我们精心准备了一个精细匹配数据集,该数据集包含了初始回复和相应的修订回复。其次,我们开发了一种新的损失函数,可以利用精细质量信号来指导 LLM 的学习。我们的实验结果表明,我们的方法可以与一些竞争性的基准值进行比较,并且表现出色。

Implementation and Comparison of Methods to Extract Reliability KPIs out of Textual Wind Turbine Maintenance Work Orders

  • paper_url: http://arxiv.org/abs/2311.04064
  • repo_url: None
  • paper_authors: Marc-Alexander Lutz, Bastian Schäfermeier, Rachael Sexton, Michael Sharp, Alden Dima, Stefan Faulstich, Jagan Mohini Aluri
  • for: 本文旨在提高风力机操作和维护的优化,通过对维护工作令牌中的信息进行梳理和分析,从而提高风力机的可靠性指标。
  • methods: 本文提出了三种方法来计算风力机的可靠性指标,包括人工标注法、自动标注法和人工助け标注法。
  • results: 研究表明,三种方法可以帮助提高风力机的维护和操作效率,同时可以提高风力机的可靠性指标。其中人工标注法的结果被用作标准比较其他两种方法的效果。
    Abstract Maintenance work orders are commonly used to document information about wind turbine operation and maintenance. This includes details about proactive and reactive wind turbine downtimes, such as preventative and corrective maintenance. However, the information contained in maintenance work orders is often unstructured and difficult to analyze, making it challenging for decision-makers to use this information for optimizing operation and maintenance. To address this issue, this work presents three different approaches to calculate reliability key performance indicators from maintenance work orders. The first approach involves manual labeling of the maintenance work orders by domain experts, using the schema defined in an industrial guideline to assign the label accordingly. The second approach involves the development of a model that automatically labels the maintenance work orders using text classification methods. The third technique uses an AI-assisted tagging tool to tag and structure the raw maintenance information contained in the maintenance work orders. The resulting calculated reliability key performance indicator of the first approach are used as a benchmark for comparison with the results of the second and third approaches. The quality and time spent are considered as criteria for evaluation. Overall, these three methods make extracting maintenance information from maintenance work orders more efficient, enable the assessment of reliability key performance indicators and therefore support the optimization of wind turbine operation and maintenance.
    摘要 维护工作指令通常用于记录风力机的运行和维护信息。这些指令包括有预防和整合性维护的风力机停机信息。然而,维护工作指令中的信息通常是不结构化的,导致决策者很难使用这些信息优化运行和维护。为解决这个问题,这份工作提出了三种不同的方法来计算可靠性关键性表现指标从维护工作指令中。第一种方法是通过域专家手动标记维护工作指令,使用工业标准定义的架构来分别标记。第二种方法是开发一种自动标记维护工作指令的模型,使用文本分类方法来标记。第三种技术是使用人工智能助手标记和结构化维护工作指令中的原始维护信息。这三种方法可以更高效地提取维护信息从维护工作指令中,并且可以计算可靠性关键性表现指标。因此,这些方法可以支持风力机运行和维护的优化。评价标准包括质量和时间花费。维护工作指令中的信息是不结构化的,导致决策者很难使用这些信息优化风力机的运行和维护。这些方法可以帮助解决这个问题,从而提高风力机的可靠性和效率。

Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features

  • paper_url: http://arxiv.org/abs/2311.04046
  • repo_url: https://github.com/edoardopona/predicting-inductive-biases-rl
  • paper_authors: Diogo Cruz, Edoardo Pona, Alex Holness-Tofts, Elias Schmied, Víctor Abia Alonso, Charlie Griffin, Bogdan-Ionut Cirstea
  • for: 这 paper investigate 大型语言模型(LLMs)在强化学习环境中是否遵循 inductive biases 的 principless.
  • methods: 研究使用 reinforcement learning 进行 fine-tuning phase, 并使用 controlled experiments 测试两个假设:一是 pre-training 后可以更容易提取的特征更有可能被使用,二是 features 的证据可以预测它们是否被使用.
  • results: 通过 controlled experiments 测试两个假设,发现存在 statistically significant 相关性,这是强化学习环境中 inductive biases 的证据.
    Abstract Many capable large language models (LLMs) are developed via self-supervised pre-training followed by a reinforcement-learning fine-tuning phase, often based on human or AI feedback. During this stage, models may be guided by their inductive biases to rely on simpler features which may be easier to extract, at a cost to robustness and generalisation. We investigate whether principles governing inductive biases in the supervised fine-tuning of LLMs also apply when the fine-tuning process uses reinforcement learning. Following Lovering et al (2021), we test two hypotheses: that features more $\textit{extractable}$ after pre-training are more likely to be utilised by the final policy, and that the evidence for/against a feature predicts whether it will be utilised. Through controlled experiments on synthetic and natural language tasks, we find statistically significant correlations which constitute strong evidence for these hypotheses.
    摘要 Many powerful大语言模型(LLM)是通过自我超vised pre-training followed by a reinforcement-learning fine-tuning phase来开发,经常基于人类或AI反馈。在这个阶段,模型可能受到其启发性偏好引导,使用更简单的特征,这可能导致模型的稳定性和泛化能力受到影响。我们调查了LLMs在精神投入学习 fine-tuning阶段的启发性偏好是否适用,以下是我们的两个假设:(1)预训练后可EXTRACTABLE的特征更可能被最终策略使用,(2)对特征的证据可以预测该特征是否被使用。通过控制的 sintetic和自然语言任务实验,我们发现了 statistically significant correlations,这 constitutes strong evidence for these hypotheses。

P-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models

  • paper_url: http://arxiv.org/abs/2311.04044
  • repo_url: None
  • paper_authors: Haoran Li, Dadi Guo, Donghao Li, Wei Fan, Qi Hu, Xin Liu, Chunkit Chan, Duanyi Yao, Yangqiu Song
  • for: 本研究旨在提供一个多元隐私评估 benchmark,以实际和直观地衡量语言模型(LM)的隐私泄露。
  • methods: 本研究使用了多元隐私评估 benchmark,将隐私泄露评估定义为多个方面的隐私目标,并建立了一个统一的管道来实现私人微调。
  • results: 本研究通过对三个GLUE dataset的实验,评估了不同的隐私保护Language Model(PPLM)的隐私泄露情况。
    Abstract The rapid development of language models (LMs) brings unprecedented accessibility and usage for both models and users. On the one hand, powerful LMs, trained with massive textual data, achieve state-of-the-art performance over numerous downstream NLP tasks. On the other hand, more and more attention is paid to unrestricted model accesses that may bring malicious privacy risks of data leakage. To address these issues, many recent works propose privacy-preserving language models (PPLMs) with differential privacy (DP). Unfortunately, different DP implementations make it challenging for a fair comparison among existing PPLMs. In this paper, we present P-Bench, a multi-perspective privacy evaluation benchmark to empirically and intuitively quantify the privacy leakage of LMs. Instead of only protecting and measuring the privacy of protected data with DP parameters, P-Bench sheds light on the neglected inference data privacy during actual usage. P-Bench first clearly defines multi-faceted privacy objectives during private fine-tuning. Then, P-Bench constructs a unified pipeline to perform private fine-tuning. Lastly, P-Bench performs existing privacy attacks on LMs with pre-defined privacy objectives as the empirical evaluation results. The empirical attack results are used to fairly and intuitively evaluate the privacy leakage of various PPLMs. We conduct extensive experiments on three datasets of GLUE for mainstream LMs.
    摘要 LM的快速发展带来了前所未有的访问和使用方便,对于模型和用户来说都是一种重要的进步。然而,随着模型的使用,更多的注意力被集中到了可能会导致数据泄露的不可靠访问上。为解决这些问题,许多最近的研究提出了隐私保护语言模型(PPLM),通过分布式隐私(DP)来保护用户的隐私。然而,不同的DP实现使得对现有PPLM的比较变得困难。在这篇论文中,我们提出了P-Bench,一个多角度隐私评估准则,用于实际和直观地评估LM的隐私泄露。而不是仅仅保护和测试保护数据的DP参数,P-Bench shed light on在实际使用时 neglected的推理数据隐私。P-Bench首先明确了在私有精度调整过程中的多种隐私目标。然后,P-Bench构建了一个统一的私有精度调整管道。最后,P-Bench对LM进行了先前定义的隐私目标为empirical评估结果。我们对GLUE数据集进行了广泛的实验,并对主流LM进行了评估。

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

  • paper_url: http://arxiv.org/abs/2311.04257
  • repo_url: https://github.com/x-plug/mplug-owl
  • paper_authors: Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
  • for: 这篇论文旨在开发一种多Modal大型自然语言模型(mPLUG-Owl2),以便在文本和多Modal任务中提高性能。
  • methods: mPLUG-Owl2使用了分Module化网络设计,其语言解码器 acting as 多Modal Interface,共享函数模块来促进Modal协作,并引入Modal适应模块以保持Modal特有特征。
  • results: 实验表明,mPLUG-Owl2可以在文本任务和多Modal任务中实现状态之最的表现,并且可以在不同Modal任务中进行普适化。此外,mPLUG-Owl2是首个在纯文本和多Modal场景中实现Modal协作现象的 MLLM 模型,开拓了未来多Modal基础模型的发展道路。
    Abstract Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.
    摘要 多Modal大语言模型(MLLM)已经展示了各种开放任务的吸引力,但以前的方法主要是增强多Modal的能力。在这项工作中,我们介绍了一种多Modal大语言模型,mPLUG-Owl2,它有效地利用模态协作来提高文本和多Modal任务的性能。mPLUG-Owl2采用分模卷网络设计,语言解码器作为多Modal任务的通用接口,共享功能模块来促进模态协作,并引入特有的模态适应模块以保持模式特异性。经验表明,mPLUG-Owl2能够通用于文本任务和多Modal任务,并在单一模型下实现状态级表现。特别是,mPLUG-Owl2是首个在纯文本和多Modal场景中展示模态协作现象的 MLLM 模型,开拓了未来多Modal基础模型的发展之路。

Analyzing Film Adaptation through Narrative Alignment

  • paper_url: http://arxiv.org/abs/2311.04020
  • repo_url: https://github.com/tanzir5/alignment_tool2.0
  • paper_authors: Tanzir Pial, Shahreen Salim, Charuta Pethe, Allen Kim, Steven Skiena
  • for: 研究电影改编过程中对原著剧本的修改和割辑,以及这些修改对改编过程中的忠诚度、对话重要性、情节顺序和性别表现的影响。
  • methods: 使用Smith-Waterman本地对照算法和SBERT嵌入距离计算文本相似性,并用这些相似性计算量化电影改编过程中的文本对应关系。
  • results: 对40部电影改编进行自动分析,发现改编过程中有很多 faithfulness of adaptation、对话重要性、情节顺序和性别表现的问题,并提供了一些可能的解决方案。
    Abstract Novels are often adapted into feature films, but the differences between the two media usually require dropping sections of the source text from the movie script. Here we study this screen adaptation process by constructing narrative alignments using the Smith-Waterman local alignment algorithm coupled with SBERT embedding distance to quantify text similarity between scenes and book units. We use these alignments to perform an automated analysis of 40 adaptations, revealing insights into the screenwriting process concerning (i) faithfulness of adaptation, (ii) importance of dialog, (iii) preservation of narrative order, and (iv) gender representation issues reflective of the Bechdel test.
    摘要 小说经常被改编成电影,但两媒体之间的差异通常需要从电影剧本中删除部分原始文本。我们在这里研究这种屏幕改编过程,使用斯密特-沃特曼本地对Alignment算法和SBERT嵌入距离来衡量场景和书单之间的文本相似性。我们使用这些对应关系来自动分析40次改编,揭示改编过程中的 faithfulness 问题(i)、对话的重要性(ii)、叙事顺序的保留(iii)以及 gender 表示问题(iv),包括杯儿德测试。

Exploring Jiu-Jitsu Argumentation for Writing Peer Review Rebuttals

  • paper_url: http://arxiv.org/abs/2311.03998
  • repo_url: None
  • paper_authors: Sukannya Purkayastha, Anne Lauscher, Iryna Gurevych
  • for: 这 paper 是为了研究人们在不同领域的论据中如何使用基本信念和世界观来驱动自己的论据,以及如何通过采用柔道式辩论方法来更有效地回答对手的论据。
  • methods: 这 paper 使用了一种基于柔道式辩论的方法,即首先确定对方的基本信念和世界观,然后选择一种适应这些驱动器的抨擦方案,而不是直接驳斥对方的论据。
  • results: 这 paper 提出了一种新的任务:基于基本信念和世界观的抨擦生成。该任务的目的是通过让模型学习关于基本信念和世界观的知识,并将其应用于辩论中,以提高辩论效果。
    Abstract In many domains of argumentation, people's arguments are driven by so-called attitude roots, i.e., underlying beliefs and world views, and their corresponding attitude themes. Given the strength of these latent drivers of arguments, recent work in psychology suggests that instead of directly countering surface-level reasoning (e.g., falsifying given premises), one should follow an argumentation style inspired by the Jiu-Jitsu 'soft' combat system (Hornsey and Fielding, 2017): first, identify an arguer's attitude roots and themes, and then choose a prototypical rebuttal that is aligned with those drivers instead of invalidating those. In this work, we are the first to explore Jiu-Jitsu argumentation for peer review by proposing the novel task of attitude and theme-guided rebuttal generation. To this end, we enrich an existing dataset for discourse structure in peer reviews with attitude roots, attitude themes, and canonical rebuttals. To facilitate this process, we recast established annotation concepts from the domain of peer reviews (e.g., aspects a review sentence is relating to) and train domain-specific models. We then propose strong rebuttal generation strategies, which we benchmark on our novel dataset for the task of end-to-end attitude and theme-guided rebuttal generation and two subtasks.
    摘要 在许多辩论领域,人们的论点是由叫做的态度根和世界观所驱动的,而这些隐藏的驱动力对论点的影响力非常强。根据这些隐藏的驱动力,最近的心理学研究表明,而不是直接对表面上的逻辑(例如,证明给出的前提是错误的),应该采取一种基于柔道('软'战斗系统)的辩论风格。 specifically, 我们应该先 indentify一个辩者的态度根和主题,然后选择与这些驱动器相align的典型反驳,而不是直接驳斥这些。在这项工作中,我们是首次对Jiu-Jitsu辩论进行 peer review 的探索。为此,我们提出了一项新的任务:态度和主题导向的反驳生成。为实现这项任务,我们增加了现有的 peer review 数据集中的态度根、态度主题和标准反驳。为了实现这一点,我们重新定义了Established annotation concepts from the domain of peer reviews(例如, peer review 中的句子所关系的方面),并培训域 especific 的模型。然后,我们提出了强大的反驳生成策略,并在我们的新数据集上进行了终到端态度和主题导向的反驳生成和两个子任务的 benchmarking。

Factoring Hate Speech: A New Annotation Framework to Study Hate Speech in Social Media

  • paper_url: http://arxiv.org/abs/2311.03969
  • repo_url: None
  • paper_authors: Gal Ron, Effi Levi, Odelia Oshri, Shaul R. Shenhav
  • for: 本研究提出了一种新的报告方案,它将仇恨言论分解成五个不同的讲话类别。
  • methods: 为了评估该方案,研究人员创建了一个包含290万条推特帖子中包含仇恨表达的词汇的词库,并将选择的1050条帖子进行了标注。
  • results: 研究人员通过统计分析标注数据,并提供了标注示例,并将在未来的研究中提出了一些有前途的方向。
    Abstract In this work we propose a novel annotation scheme which factors hate speech into five separate discursive categories. To evaluate our scheme, we construct a corpus of over 2.9M Twitter posts containing hateful expressions directed at Jews, and annotate a sample dataset of 1,050 tweets. We present a statistical analysis of the annotated dataset as well as discuss annotation examples, and conclude by discussing promising directions for future work.
    摘要 在这项工作中,我们提出了一种新的注释方案,它将仇恨言语分解为五种不同的演讲类别。为评估我们的方案,我们构建了包含超过290万条推特发布的恶意表达目录,并对1,050条微博进行了注释。我们发布了统计分析结果以及注释示例,并结束于未来工作的有前途方向。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

An Analysis of Dialogue Repair in Voice Assistants

  • paper_url: http://arxiv.org/abs/2311.03952
  • repo_url: None
  • paper_authors: Matthew Galbraith
  • for: investigate the significance of interactional language in dialogue repair between virtual assistants and users
  • methods: analyze interactions with Google Assistant and Siri, focusing on their utilization and response to the other-initiated repair strategy “huh?”
  • results: reveal several assistant-generated strategies but an inability to replicate human-like repair strategies such as “huh?”, with differences in users’ repair strategy preferences and assistant usage between English and Spanish speakers.
    Abstract Spoken dialogue systems have transformed human-machine interaction by providing real-time responses to queries. However, misunderstandings between the user and system persist. This study explores the significance of interactional language in dialogue repair between virtual assistants and users by analyzing interactions with Google Assistant and Siri, focusing on their utilization and response to the other-initiated repair strategy "huh?" prevalent in human-human interaction. Findings reveal several assistant-generated strategies but an inability to replicate human-like repair strategies such as "huh?". English and Spanish user acceptability surveys show differences in users' repair strategy preferences and assistant usage, with both similarities and disparities among the two surveyed languages. These results shed light on inequalities between interactional language in human-human interaction and human-machine interaction, underscoring the need for further research on the impact of interactional language in human-machine interaction in English and beyond.
    摘要

Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition

  • paper_url: http://arxiv.org/abs/2311.03928
  • repo_url: https://github.com/taeheejeon22/morphsubdecomp-korean
  • paper_authors: Taehee Jeon, Bongseok Yang, Changhwan Kim, Yoonseob Lim
  • for: 提高预训练语言模型(PLM)在韩语中的语法和 semantics 性能
  • methods: 使用 sub-character decomposition 实现 morpheme-aware subword tokenization,并且在 PLM 中 balance 语言准确性和计算效率
  • results: 在 NIKL-CoLA 任务中显示出良好的总体表现,尤其是在语法任务中提高表现,这表明 integrating morpheme type information 可以提高语言模型的语法和 semantics 能力,并且可以进一步提高性能 beyond standard morphological analysis。
    Abstract We introduce a morpheme-aware subword tokenization method that utilizes sub-character decomposition to address the challenges of applying Byte Pair Encoding (BPE) to Korean, a language characterized by its rich morphology and unique writing system. Our approach balances linguistic accuracy with computational efficiency in Pre-trained Language Models (PLMs). Our evaluations show that this technique achieves good performances overall, notably improving results in the syntactic task of NIKL-CoLA. This suggests that integrating morpheme type information can enhance language models' syntactic and semantic capabilities, indicating that adopting more linguistic insights can further improve performance beyond standard morphological analysis.
    摘要 我们介绍了一种基于字符分解的 morpheme-aware 子字符 tokenization 方法,用于解决在韩语中应用字对编码 (BPE) 时存在的挑战。我们的方法在预训练语言模型 (PLM) 中寻求语言学正确性和计算效率的平衡。我们的评估结果表明,该技术在总体来说获得了好的表现,特别是在NIKL-CoLA sintactic 任务中显著提高了结果。这表明,在语言模型中 integrate morpheme 类型信息可以提高语言模型的 sintactic 和 semantic 能力,这也表明,采用更多的语言学发现可以进一步提高性能,超出标准 morphological analysis 的限制。

iACOS: Advancing Implicit Sentiment Extraction with Informative and Adaptive Negative Examples

  • paper_url: http://arxiv.org/abs/2311.03896
  • repo_url: None
  • paper_authors: Xiancai Xu, Jia-Dong Zhang, Lei Xiong, Zhishang Liu
  • for: 本研究旨在提出一种新的 quadruple extraction 方法 iACOS,用于抽取含义层次的方面、类别、意见和情感。
  • methods: iACOS 方法包括四个主要步骤:一、 append two implicit tokens 到文本末尾以获取上下文感知表示; 二、 使用 sequences labeling 模型在上下文感知表示上进行同时抽取 explicit 和 implicit 方面和意见; 三、 开发一种特殊多头注意力的多标签分类器用于同时发现方面意见对应的类别和情感; 四、 使用 informative 和 adaptive negative examples 进行多任务学习来训练 multi-label classifier 和其他两个分类器。
  • results: 实验结果表明,iACOS 方法与其他 quadruple extraction 基线方法相比,在两个公共 benchmark 数据集上显著提高了 F1 分数。
    Abstract Aspect-based sentiment analysis (ABSA) have been extensively studied, but little light has been shed on the quadruple extraction consisting of four fundamental elements: aspects, categories, opinions and sentiments, especially with implicit aspects and opinions. In this paper, we propose a new method iACOS for extracting Implicit Aspects with Categories and Opinions with Sentiments. First, iACOS appends two implicit tokens at the end of a text to capture the context-aware representation of all tokens including implicit aspects and opinions. Second, iACOS develops a sequence labeling model over the context-aware token representation to co-extract explicit and implicit aspects and opinions. Third, iACOS devises a multi-label classifier with a specialized multi-head attention for discovering aspect-opinion pairs and predicting their categories and sentiments simultaneously. Fourth, iACOS leverages informative and adaptive negative examples to jointly train the multi-label classifier and the other two classifiers on categories and sentiments by multi-task learning. Finally, the experimental results show that iACOS significantly outperforms other quadruple extraction baselines according to the F1 score on two public benchmark datasets.
    摘要 对象是基于四元素的 sentiment analysis (ABSA) 已经受到广泛研究,但是对于四元素中的两个基本元素:意见和看法,尤其是对于隐藏的意见和看法,所知甚少。在这篇论文中,我们提出了一个新的方法 iACOS,用于提取隐藏的意见和看法。iACOS 的方法包括四个步骤:1. iACOS 在文本结构中附加了两个隐藏的元素,以捕捉所有的元素,包括隐藏的意见和看法。2. iACOS 使用一个顺序标签模型,以同时提取明确和隐藏的意见和看法。3. iACOS 开发了一个多头注意力的多类别分类器,用于同时发现属性和意见的对应关系,并且预测属性和意见的分类。4. iACOS 使用具有启发性和适应性的负例,进行多项任务学习,以对多项分类器进行联合训练。总结来说,iACOS 的实验结果显示,它与其他四元素提取基eline的比较,在两个公共的 benchmark 数据集上取得了优秀的 F1 分数。

Sparse Contrastive Learning of Sentence Embeddings

  • paper_url: http://arxiv.org/abs/2311.03881
  • repo_url: None
  • paper_authors: Ruize An, Chen Zhang, Dawei Song
  • for: 这 paper 的目的是探讨对句子嵌入模型进行 Parametric Sparsification,以提高句子嵌入模型的性能。
  • methods: 这 paper 使用了对句子嵌入模型进行 Parametric Sparsification,通过对每个参数的贡献度进行评估,并将低贡献度的参数简化为0。
  • results: 这 paper 的结果显示,使用 Parametric Sparsification 可以提高句子嵌入模型的性能,并且对 embedding space 进行了深入的分析,发现 embedding space 的吸引性和一致性都得到了提高。
    Abstract Recently, SimCSE has shown the feasibility of contrastive learning in training sentence embeddings and illustrates its expressiveness in spanning an aligned and uniform embedding space. However, prior studies have shown that dense models could contain harmful parameters that affect the model performance, and it is no wonder that SimCSE can as well be invented with such parameters. Driven by this, parameter sparsification is applied, where alignment and uniformity scores are used to measure the contribution of each parameter to the overall quality of sentence embeddings. Drawing from a preliminary study, we consider parameters with minimal contributions to be detrimental, as their sparsification results in improved model performance. To discuss the ubiquity of detrimental parameters and remove them, more experiments on the standard semantic textual similarity (STS) tasks and transfer learning tasks are conducted, and the results show that the proposed sparsified SimCSE (SparseCSE) has excellent performance in comparison with SimCSE. Furthermore, through in-depth analysis, we establish the validity and stability of our sparsification method, showcasing that the embedding space generated by SparseCSE exhibits improved alignment compared to that produced by SimCSE. Importantly, the uniformity yet remains uncompromised.
    摘要

OLaLa: Ontology Matching with Large Language Models

  • paper_url: http://arxiv.org/abs/2311.03837
  • repo_url: None
  • paper_authors: Sven Hertling, Heiko Paulheim
  • for: 本文旨在探讨如何使用大语言模型提高 Ontology (和更广泛的知识图) 匹配任务中的信息处理。
  • methods: 本文使用 zero-shot 和 few-shot 提示法,与多个开源的大语言模型进行不同的 OAEI 任务的实验,以探讨提示的设计和模型选择等问题。
  • results: 研究发现,只需要一些示例和Well-designed提示,可以达到与超级vised匹配系统使用更大量真实数据匹配的水平。
    Abstract Ontology (and more generally: Knowledge Graph) Matching is a challenging task where information in natural language is one of the most important signals to process. With the rise of Large Language Models, it is possible to incorporate this knowledge in a better way into the matching pipeline. A number of decisions still need to be taken, e.g., how to generate a prompt that is useful to the model, how information in the KG can be formulated in prompts, which Large Language Model to choose, how to provide existing correspondences to the model, how to generate candidates, etc. In this paper, we present a prototype that explores these questions by applying zero-shot and few-shot prompting with multiple open Large Language Models to different tasks of the Ontology Alignment Evaluation Initiative (OAEI). We show that with only a handful of examples and a well-designed prompt, it is possible to achieve results that are en par with supervised matching systems which use a much larger portion of the ground truth.
    摘要

Conversations in Galician: a Large Language Model for an Underrepresented Language

  • paper_url: http://arxiv.org/abs/2311.03812
  • repo_url: https://gitlab.irlab.org/irlab/cabuxa
  • paper_authors: Eliseo Bao, Anxo Pérez, Javier Parapar
  • for: 本研究旨在提高 Га利西안语言处理(NLP)技术,以便更好地包括少数语言社区在大语言模型的开发中。
  • methods: 本研究使用了两个新资源,包括加利西安语言适应Alpaca数据集和LLaMA-7B模型的精度调整。
  • results: 研究发现,通过使用加利西安语言适应Alpaca数据集和LLaMA-7B模型,可以帮助模型更好地理解和回答加利西安语言的问题,并且可以通过关注 português语言的知识来生成 coherent text。
    Abstract The recent proliferation of Large Conversation Language Models has highlighted the economic significance of widespread access to this type of AI technologies in the current information age. Nevertheless, prevailing models have primarily been trained on corpora consisting of documents written in popular languages. The dearth of such cutting-edge tools for low-resource languages further exacerbates their underrepresentation in the current economic landscape, thereby impacting their native speakers. This paper introduces two novel resources designed to enhance Natural Language Processing (NLP) for the Galician language. We present a Galician adaptation of the Alpaca dataset, comprising 52,000 instructions and demonstrations. This dataset proves invaluable for enhancing language models by fine-tuning them to more accurately adhere to provided instructions. Additionally, as a demonstration of the dataset utility, we fine-tuned LLaMA-7B to comprehend and respond in Galician, a language not originally supported by the model, by following the Alpaca format. This work contributes to the research on multilingual models tailored for low-resource settings, a crucial endeavor in ensuring the inclusion of all linguistic communities in the development of Large Language Models. Another noteworthy aspect of this research is the exploration of how knowledge of a closely related language, in this case, Portuguese, can assist in generating coherent text when training resources are scarce. Both the Galician Alpaca dataset and Cabuxa-7B are publicly accessible on our Huggingface Hub, and we have made the source code available to facilitate replication of this experiment and encourage further advancements for underrepresented languages.
    摘要 现今的信息时代,大型对话语言模型的普及化吸引了广泛的关注,它们的经济意义不容忽视。然而,目前的模型主要是根据受欢迎的语言数据集进行训练,导致低资源语言的没有相应的进步,这种情况进一步削弱了这些语言的native speaker的经济位置。本文介绍了两个新的资源,用于提高 galician 语言的自然语言处理(NLP)。我们提供了一个galician adaptation的 Alpaca 数据集,包含 52,000 个指令和示例。这个数据集非常有价,可以帮助提高语言模型,使其更好地遵循提供的指令。此外,我们还调整了 LLama-7B 模型,以便在 galician 语言上理解和回答,这是原本模型不支持的语言。这项研究对于多种语言模型的开发进行了贡献,并且探索了如何在资源缺乏时,通过知道相关语言的知识来生成coherent的文本。 galician Alpaca 数据集和 Cabuxa-7B 模型都公开 accessible 在我们的 Huggingface Hub,并且我们已经公开了脚本来促进这个实验的重复和进一步发展。

Noisy Pair Corrector for Dense Retrieval

  • paper_url: http://arxiv.org/abs/2311.03798
  • repo_url: None
  • paper_authors: Hang Zhang, Yeyun Gong, Xingwei He, Dayiheng Liu, Daya Guo, Jiancheng Lv, Jian Guo
  • for: 本研究探讨了 dense retrieval 中 implicit 的假设:训练 Query-Document 对应是 preciselly 匹配的。由于 manually annotate corpus 是 expensive, training pairs 通常是自动收集的,这会导致 noise 的存在。
  • methods: 我们提出了一种 novel approach,即 Noisy Pair Corrector (NPC),它包括 detection 模块和 correction 模块。 detection 模块 根据 annotated positive 和 easy negative 文档的 perplexity 来估计噪声对应。 correction 模块 使用 exponential moving average (EMA) 模型提供了一个软监督信号,以帮助 mitigate 噪声的影响。
  • results: 我们在 Natural Question 和 TriviaQA 等文本检索benchmark上进行了实验,结果显示 NPC 能够 effectively 处理 synthetic 和 realistic 噪声。
    Abstract Most dense retrieval models contain an implicit assumption: the training query-document pairs are exactly matched. Since it is expensive to annotate the corpus manually, training pairs in real-world applications are usually collected automatically, which inevitably introduces mismatched-pair noise. In this paper, we explore an interesting and challenging problem in dense retrieval, how to train an effective model with mismatched-pair noise. To solve this problem, we propose a novel approach called Noisy Pair Corrector (NPC), which consists of a detection module and a correction module. The detection module estimates noise pairs by calculating the perplexity between annotated positive and easy negative documents. The correction module utilizes an exponential moving average (EMA) model to provide a soft supervised signal, aiding in mitigating the effects of noise. We conduct experiments on text-retrieval benchmarks Natural Question and TriviaQA, code-search benchmarks StaQC and SO-DS. Experimental results show that NPC achieves excellent performance in handling both synthetic and realistic noise.
    摘要 大多数密集检索模型假设训练查询文档对是精确匹配的。由于在实际应用中annotate文档是费时的,因此训练对的收集通常会受到匹配错误的干扰。在这篇论文中,我们研究了一个有趣且挑战的 dense retrieval 问题:如何训练有效的模型在匹配错误的情况下。为解决这个问题,我们提出了一种新的方法called Noisy Pair Corrector (NPC),它包括检测模块和修正模块。检测模块通过计算注释正确文档和易于获得的负文档的准确率来估算干扰对。修正模块使用指数移动平均(EMA)模型提供一个软件支持信号,以帮助缓解干扰的影响。我们在Natural Question和TriviaQA等文本检索 benchmarks上进行了实验,并在 StaQC 和 SO-DS 等代码检索 benchmarks上进行了实验。实验结果表明,NPC在处理 synthetic 和实际干扰的情况下表现出色。

Character-Level Bangla Text-to-IPA Transcription Using Transformer Architecture with Sequence Alignment

  • paper_url: http://arxiv.org/abs/2311.03792
  • repo_url: None
  • paper_authors: Jakir Hasan, Shrestha Datta, Ameya Debnath
    for:The paper is written for the purpose of developing an Artificial Intelligence and Machine Learning model to map Bangla words to their International Phonetic Alphabet (IPA) representations.methods:The authors use a transformer-based sequence-to-sequence model at the letter and symbol level to map Bangla words to their IPA representations. They also utilize manual mapping to handle punctuation marks and foreign languages in the text.results:The authors achieve the top position in the public ranking of DataVerse Challenge - ITVerse 2023 with a word error rate of 0.10582.
    Abstract The International Phonetic Alphabet (IPA) is indispensable in language learning and understanding, aiding users in accurate pronunciation and comprehension. Additionally, it plays a pivotal role in speech therapy, linguistic research, accurate transliteration, and the development of text-to-speech systems, making it an essential tool across diverse fields. Bangla being 7th as one of the widely used languages, gives rise to the need for IPA in its domain. Its IPA mapping is too diverse to be captured manually giving the need for Artificial Intelligence and Machine Learning in this field. In this study, we have utilized a transformer-based sequence-to-sequence model at the letter and symbol level to get the IPA of each Bangla word as the variation of IPA in association of different words is almost null. Our transformer model only consisted of 8.5 million parameters with only a single decoder and encoder layer. Additionally, to handle the punctuation marks and the occurrence of foreign languages in the text, we have utilized manual mapping as the model won't be able to learn to separate them from Bangla words while decreasing our required computational resources. Finally, maintaining the relative position of the sentence component IPAs and generation of the combined IPA has led us to achieve the top position with a word error rate of 0.10582 in the public ranking of DataVerse Challenge - ITVerse 2023 (https://www.kaggle.com/competitions/dataverse_2023/).
    摘要 国际音律字母(IPA)是语言学习和理解的不可或缺工具,帮助用户更正确地发音和理解。它在语音疾病治疗、语言研究、精确转写和文本读取系统的发展中扮演着关键性角色,因此在多个领域都是必备的工具。旁遮普语是全球第七大常用语言,因此IPA在这个领域的需求增加。旁遮普语IPA映射非常复杂,无法由人工手动记录,因此我们需要使用人工智能和机器学习。在这个研究中,我们使用了一种基于序列-序列模型的变换器模型,以获取每个旁遮普语单词的IPA,因为旁遮普语单词的IPA变化非常小。我们的变换器模型只有8500万参数,只有一个解oder和编码器层。此外,我们使用了手动映射来处理文本中的括号和外语,以避免模型学习括号和外语与旁遮普语单词之间的分化。最后,我们保持了每个句子元素IPA的相对位置和生成总IPA,使我们在DataVerse Challenge - ITVerse 2023(https://www.kaggle.com/competitions/dataverse_2023)中获得了第一名,word error rate为0.10582。

Language Representation Projection: Can We Transfer Factual Knowledge across Languages in Multilingual Language Models?

  • paper_url: http://arxiv.org/abs/2311.03788
  • repo_url: None
  • paper_authors: Shaoyang Xu, Junzhuo Li, Deyi Xiong
  • for: 本研究旨在探讨multilingual pretrained language models是否可以Explicitly transfer rich factual knowledge from English to non-English languages.
  • methods: 我们提出了two parameter-free Language Representation Projection modules (LRP2),其中第一个模块将非英语表示转换为英语类似的Equivalents,而第二个模块则将英语类似的表示还原回对应的非英语语言表示。
  • results: 实验结果表明,LRP2可以显著提高了factual knowledge retrieval的准确率,并且可以帮助知识传递性适用于多种不同的非英语语言。我们还从表示空间和 crossing-lingual knowledge neuron的角度进行了工作机制的研究。
    Abstract Multilingual pretrained language models serve as repositories of multilingual factual knowledge. Nevertheless, a substantial performance gap of factual knowledge probing exists between high-resource languages and low-resource languages, suggesting limited implicit factual knowledge transfer across languages in multilingual pretrained language models. This paper investigates the feasibility of explicitly transferring relatively rich factual knowledge from English to non-English languages. To accomplish this, we propose two parameter-free $\textbf{L}$anguage $\textbf{R}$epresentation $\textbf{P}$rojection modules (LRP2). The first module converts non-English representations into English-like equivalents, while the second module reverts English-like representations back into representations of the corresponding non-English language. Experimental results on the mLAMA dataset demonstrate that LRP2 significantly improves factual knowledge retrieval accuracy and facilitates knowledge transferability across diverse non-English languages. We further investigate the working mechanism of LRP2 from the perspectives of representation space and cross-lingual knowledge neuron.
    摘要 多语言预训言模型作为多语言事实知识库,然而高资源语言和低资源语言之间的事实知识探测性能存在显著差距,这表明多语言预训言模型之间的隐式事实知识传递有限。本文研究将英语的事实知识传递到非英语语言的可能性。为此,我们提出了两个参数自由的语言表示 проекции模块(LRP2)。第一个模块将非英语表示转换成英语类似的表示,第二个模块将英语类似的表示恢复回到对应的非英语语言表示。实验结果表明,LRP2可以大幅提高事实知识检索精度和跨语言知识传递性。我们进一步研究LRP2的工作机制从表示空间和跨语言知识神经的角度。

Gender Inflected or Bias Inflicted: On Using Grammatical Gender Cues for Bias Evaluation in Machine Translation

  • paper_url: http://arxiv.org/abs/2311.03767
  • repo_url: https://github.com/iampushpdeep/gender-bias-hi-en-eval
  • paper_authors: Pushpdeep Singh
  • for: 这项研究的目的是为了评估Neural Machine Translation(NMT)模型中的社会偏见,特别是对于不同语言源文的评估。
  • methods: 这项研究使用了Hindi作为源语言,并构建了两组gender-specific sentence:OTSC-Hindi和WinoMT-Hindi,以自动评估不同的Hindi-English(HI-EN)NMT系统中的性偏见。
  • results: 研究发现,当使用不同的语言源文时,NMT模型中的社会偏见也会有所不同。这种偏见可以通过grammatical gendercue在源 sentence中进行正确的识别。这项研究highlights the importance of考虑语言的特点when designing extrinsic bias evaluation datasets。
    Abstract Neural Machine Translation (NMT) models are state-of-the-art for machine translation. However, these models are known to have various social biases, especially gender bias. Most of the work on evaluating gender bias in NMT has focused primarily on English as the source language. For source languages different from English, most of the studies use gender-neutral sentences to evaluate gender bias. However, practically, many sentences that we encounter do have gender information. Therefore, it makes more sense to evaluate for bias using such sentences. This allows us to determine if NMT models can identify the correct gender based on the grammatical gender cues in the source sentence rather than relying on biased correlations with, say, occupation terms. To demonstrate our point, in this work, we use Hindi as the source language and construct two sets of gender-specific sentences: OTSC-Hindi and WinoMT-Hindi that we use to evaluate different Hindi-English (HI-EN) NMT systems automatically for gender bias. Our work highlights the importance of considering the nature of language when designing such extrinsic bias evaluation datasets.
    摘要

Multilingual Mathematical Autoformalization

  • paper_url: http://arxiv.org/abs/2311.03755
  • repo_url: https://github.com/albertqjiang/mma
  • paper_authors: Albert Q. Jiang, Wenda Li, Mateja Jamnik
  • for: 这个论文的目的是提出一个大型、灵活、多语言、多领域的数据集,用于自动化数学表述转换。
  • methods: 这个论文使用了一种语言模型,将数学语句转换成相应的不正式语句。
  • results: 实验表明,在这个数据集上进行微调后,语言模型可以在两个标准测试benchmark上生成16-18%的句子,需要最小的修改。这比基本模型的0%有所提高。此外,微调在多语言ormal数据上还能够提高自动化数学表述转换模型的能力,即使在单语言任务上使用。
    Abstract Autoformalization is the task of translating natural language materials into machine-verifiable formalisations. Progress in autoformalization research is hindered by the lack of a sizeable dataset consisting of informal-formal pairs expressing the same essence. Existing methods tend to circumvent this challenge by manually curating small corpora or using few-shot learning with large language models. But these methods suffer from data scarcity and formal language acquisition difficulty. In this work, we create $\texttt{MMA}$, a large, flexible, multilingual, and multi-domain dataset of informal-formal pairs, by using a language model to translate in the reverse direction, that is, from formal mathematical statements into corresponding informal ones. Experiments show that language models fine-tuned on $\texttt{MMA}$ produce $16-18\%$ of statements acceptable with minimal corrections on the $\texttt{miniF2F}$ and $\texttt{ProofNet}$ benchmarks, up from $0\%$ with the base model. We demonstrate that fine-tuning on multilingual formal data results in more capable autoformalization models even when deployed on monolingual tasks.
    摘要 自然语言材料的自动化正式化任务是将自然语言材料翻译成机器可验证的正式表示。研究进步受到缺乏大量的正式-非正式对应对的数据集的限制。现有的方法通常是手动精心约定小量资料或使用大语言模型进行几招学习。但这些方法受到数据稀缺和正式语言学习困难的限制。在这项工作中,我们创建了$\texttt{MMA}$ dataset,这是一个大型、灵活、多语言和多领域的正式-非正式对应对数据集,通过反向翻译,即从正式数学陈述翻译到相应的非正式陈述。实验显示,将语言模型在$\texttt{MMA}$上进行微调后,其在$\texttt{miniF2F}$和$\texttt{ProofNet}$benchmark上的表达可接受度为16-18%,比基本模型为0%提高了。我们示示了在多语言正式数据上微调后,即使在单语言任务上部署时,自动化正式化模型也会更具能力。

Which is better? Exploring Prompting Strategy For LLM-based Metrics

  • paper_url: http://arxiv.org/abs/2311.03754
  • repo_url: None
  • paper_authors: Joonghoon Kim, Saeran Park, Kiyoon Jeong, Sangmin Lee, Seung Hun Han, Jiyoon Lee, Pilsung Kang
  • for: 本研究旨在探讨使用大型自然语言处理器(LLM)来评估自然语言生成(NLG)质量,以便更好地评估NLGTask中的系统。
  • methods: 本研究使用了多种提示和提示策略,并 compare了三种汇集策略来评估NLG质量。另外,研究还开发了一种生成证据的策略,以便解释LLM-based评估结果。
  • results: 研究发现,使用提示策略和汇集策略可以提高NLG质量评估的准确性。此外,研究还发现,使用open-source LLMs可以提供更加可靠的评估结果。
    Abstract This paper describes the DSBA submissions to the Prompting Large Language Models as Explainable Metrics shared task, where systems were submitted to two tracks: small and large summarization tracks. With advanced Large Language Models (LLMs) such as GPT-4, evaluating the quality of Natural Language Generation (NLG) has become increasingly paramount. Traditional similarity-based metrics such as BLEU and ROUGE have shown to misalign with human evaluation and are ill-suited for open-ended generation tasks. To address this issue, we explore the potential capability of LLM-based metrics, especially leveraging open-source LLMs. In this study, wide range of prompts and prompting techniques are systematically analyzed with three approaches: prompting strategy, score aggregation, and explainability. Our research focuses on formulating effective prompt templates, determining the granularity of NLG quality scores and assessing the impact of in-context examples on LLM-based evaluation. Furthermore, three aggregation strategies are compared to identify the most reliable method for aggregating NLG quality scores. To examine explainability, we devise a strategy that generates rationales for the scores and analyzes the characteristics of the explanation produced by the open-source LLMs. Extensive experiments provide insights regarding evaluation capabilities of open-source LLMs and suggest effective prompting strategies.
    摘要 The research focuses on three approaches: prompting strategy, score aggregation, and explainability. The study systematically analyzes a wide range of prompts and prompting techniques, and formulates effective prompt templates to improve the quality of NLG. Additionally, the research determines the granularity of NLG quality scores and assesses the impact of in-context examples on LLM-based evaluation.To examine explainability, the study devises a strategy that generates rationales for the scores, and analyzes the characteristics of the explanations produced by open-source LLMs. Extensive experiments provide insights into the evaluation capabilities of open-source LLMs and suggest effective prompting strategies.

Unified Low-Resource Sequence Labeling by Sample-Aware Dynamic Sparse Finetuning

  • paper_url: http://arxiv.org/abs/2311.03748
  • repo_url: https://github.com/psunlpgroup/fish-dip
  • paper_authors: Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, Peng Shi, Wenpeng Yin, Rui Zhang
  • for: 这篇论文旨在提高sequence labeling问题的解析效率,以便更好地利用大型语言模型知识来进行结构预测。
  • methods: 作者提出了一种名为FISH-DIP的散发式精简调整策略,具体来说是在精简调整过程中选择一部分参数,并由受测例提供的反馈来导引。
  • results: 在五个sequence labeling任务中,作者展示了FISH-DIP可以在低资源设定下顺利地调整模型,提高表现约40%,具体取决于目标评估设定。此外,相比内容学习和其他参数有效精简调整方法,FISH-DIP在极低资源设定下表现较好或更好,特别是在极端低资源设定下。
    Abstract Unified Sequence Labeling that articulates different sequence labeling problems such as Named Entity Recognition, Relation Extraction, Semantic Role Labeling, etc. in a generalized sequence-to-sequence format opens up the opportunity to make the maximum utilization of large language model knowledge toward structured prediction. Unfortunately, this requires formatting them into specialized augmented format unknown to the base pretrained language model (PLMs) necessitating finetuning to the target format. This significantly bounds its usefulness in data-limited settings where finetuning large models cannot properly generalize to the target format. To address this challenge and leverage PLM knowledge effectively, we propose FISH-DIP, a sample-aware dynamic sparse finetuning strategy that selectively focuses on a fraction of parameters, informed by feedback from highly regressing examples, during the fine-tuning process. By leveraging the dynamism of sparsity, our approach mitigates the impact of well-learned samples and prioritizes underperforming instances for improvement in generalization. Across five tasks of sequence labeling, we demonstrate that FISH-DIP can smoothly optimize the model in low resource settings offering upto 40% performance improvements over full fine-tuning depending on target evaluation settings. Also, compared to in-context learning and other parameter-efficient fine-tuning approaches, FISH-DIP performs comparably or better, notably in extreme low-resource settings.
    摘要 通过把不同的序列标记问题(如命名实体识别、关系提取、semantic role labeling等)转化为通用的序列-到-序列格式,大语言模型的知识可以得到最大化利用。然而,这需要将它们转换为特定格式,不知道基础预训练语言模型(PLMs)的格式,因此需要训练。这会限制其在数据有限的设置中的使用,因为大型模型在目标格式上不能良好泛化。为解决这个挑战并有效地利用PLM知识,我们提出了鱼钻抑制策略(FISH-DIP),它是一种样本相关的动态稀疏训练策略。在训练过程中,它会选择一部分参数,由高度抑制的示例返回,并在这些示例上进行精度强调。通过利用动态稀疏性,我们的方法可以减轻高度学习的样本的影响,并且优先级推进下降性能的实例进行改进。在五种序列标记任务上,我们示示了FISH-DIP可以在低资源设置下缓和优化模型,提供最高达40%的性能提升,具体取决于目标评估设置。此外,相比卷积学习和其他参数有效的细化训练方法,FISH-DIP在极低资源设置下表现相当或更好。

Leveraging Structured Information for Explainable Multi-hop Question Answering and Reasoning

  • paper_url: http://arxiv.org/abs/2311.03734
  • repo_url: https://github.com/bcdnlp/structure-qa
  • paper_authors: Ruosen Li, Xinya Du
  • for: 提高多步问答模型的推理能力和解释性
  • methods: 使用链条思维机制生成推理链和答案,以及利用提取的semantic结构进行多步问答
  • results: 对两个 benchmark dataset 实现了substantial 的提高,并且提取的结构本身就提供了固有的解释,比如生成的推理链和saliency-based解释更具有人类偏好。
    Abstract Neural models, including large language models (LLMs), achieve superior performance on multi-hop question-answering. To elicit reasoning capabilities from LLMs, recent works propose using the chain-of-thought (CoT) mechanism to generate both the reasoning chain and the answer, which enhances the model's capabilities in conducting multi-hop reasoning. However, several challenges still remain: such as struggling with inaccurate reasoning, hallucinations, and lack of interpretability. On the other hand, information extraction (IE) identifies entities, relations, and events grounded to the text. The extracted structured information can be easily interpreted by humans and machines (Grishman, 2019). In this work, we investigate constructing and leveraging extracted semantic structures (graphs) for multi-hop question answering, especially the reasoning process. Empirical results and human evaluations show that our framework: generates more faithful reasoning chains and substantially improves the QA performance on two benchmark datasets. Moreover, the extracted structures themselves naturally provide grounded explanations that are preferred by humans, as compared to the generated reasoning chains and saliency-based explanations.
    摘要 On the other hand, information extraction (IE) is a technique that identifies entities, relations, and events grounded in the text. The extracted structured information can be easily interpreted by both humans and machines. In this study, we explore the use of constructed and leveraged extracted semantic structures (graphs) for multi-hop question answering, particularly in the reasoning process. Our empirical results and human evaluations show that our framework:1. Generates more faithful reasoning chains, and2. Substantially improves QA performance on two benchmark datasets.Moreover, the extracted structures themselves naturally provide grounded explanations that are preferred by humans, as compared to the generated reasoning chains and saliency-based explanations.

Learning to Learn for Few-shot Continual Active Learning

  • paper_url: http://arxiv.org/abs/2311.03732
  • repo_url: None
  • paper_authors: Stella Ho, Ming Liu, Shang Gao, Longxiang Gao
  • for: 这 paper 是关于 continual active learning (CAL) Setting,旨在在具有限制性的标签数据和庞大的无标签数据之间平衡稳定性和 пластично性。
  • methods: 作者提出了一种简单 yet efficient 的方法,即 Meta-Continual Active Learning,该方法利用 meta-学习和经验回放来解决稳定性和 пластично性之间的负担。
  • results: 实验结果表明,随机抽样是最佳的默认策略 both for active learning 和 memory sample selection,以解决几个 shot CAL 问题。
    Abstract Continual learning strives to ensure stability in solving previously seen tasks while demonstrating plasticity in a novel domain. Recent advances in CL are mostly confined to a supervised learning setting, especially in NLP domain. In this work, we consider a few-shot continual active learning (CAL) setting where labeled data is inadequate, and unlabeled data is abundant but with a limited annotation budget. We propose a simple but efficient method, called Meta-Continual Active Learning. Specifically, we employ meta-learning and experience replay to address the trade-off between stability and plasticity. As a result, it finds an optimal initialization that efficiently utilizes annotated information for fast adaptation while preventing catastrophic forgetting of past tasks. We conduct extensive experiments to validate the effectiveness of the proposed method and analyze the effect of various active learning strategies and memory sample selection methods in a few-shot CAL setup. Our experiment results demonstrate that random sampling is the best default strategy for both active learning and memory sample selection to solve few-shot CAL problems.
    摘要

A Survey of Large Language Models Attribution

  • paper_url: http://arxiv.org/abs/2311.03731
  • repo_url: https://github.com/HITsz-TMG/awesome-llm-attributions
  • paper_authors: Dongfang Li, Zetian Sun, Xinshuo Hu, Zhenyu Liu, Ziyang Chen, Baotian Hu, Aiguo Wu, Min Zhang
  • for: 这篇论文旨在探讨开放领域生成系统中使用的归因机制,尤其是大语言模型。
  • methods: 论文总结了各种归因方法,包括权重归因、词 embeddings 归因、推荐归因等。
  • results: 论文指出,归因机制可以提高开放领域生成系统的可靠性和准确性,但同时也存在一些问题,如知识库的含糊性、内生偏见和过度归因的缺点。
    Abstract Open-domain generative systems have gained significant attention in the field of conversational AI (e.g., generative search engines). This paper presents a comprehensive review of the attribution mechanisms employed by these systems, particularly large language models. Though attribution or citation improve the factuality and verifiability, issues like ambiguous knowledge reservoirs, inherent biases, and the drawbacks of excessive attribution can hinder the effectiveness of these systems. The aim of this survey is to provide valuable insights for researchers, aiding in the refinement of attribution methodologies to enhance the reliability and veracity of responses generated by open-domain generative systems. We believe that this field is still in its early stages; hence, we maintain a repository to keep track of ongoing studies at https://github.com/HITsz-TMG/awesome-llm-attributions.
    摘要 Open-domain generative systems 已经在对话AI领域获得了广泛的关注(例如生成搜索引擎)。这篇评论文章把这些系统使用的归因机制进行了全面的评论,特别是大型自然语言模型。尽管归因或引用可以提高对实际性和可靠性的改进,但是存在混乱知识库、内在偏见以及过度归因的问题可能会阻碍这些系统的效果。本评论的目的是为研究人员提供有价值的洞察,以便通过修改归因方法来提高开放领域生成系统的可靠性和真实性。我们认为这个领域仍处于早期阶段,因此我们维护了一个存储库,以跟踪进行中的研究:https://github.com/HITsz-TMG/awesome-llm-attributions。

Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts

  • paper_url: http://arxiv.org/abs/2311.03696
  • repo_url: https://github.com/shyyhs/CourseraParallelCorpusMining
  • paper_authors: Haiyue Song, Raj Dabre, Chenhui Chu, Atsushi Fujita, Sadao Kurohashi
  • For: 本研究的目的是为了提高在线课程学习译文质量,但建立高质量的讲义机器翻译系统所需的公共可用并行 corpora 缺乏。为此,我们提出了并行 corpora 挖掘框架,可以快速和有效地挖掘公共可用讲义上的并行 corpora。* Methods: 我们提出了一种基于动态编程的句子对齐算法,利用机器翻译后的句子之间的偏度相似性进行对齐。此外,我们还提出了一种使用 BERTScore、LASER 和 sentBERT 等方法来评估对齐效果的方法。* Results: 我们通过机器翻译实验表明,使用我们提出的对齐算法可以获得高品质的讲义翻译结果。此外,我们还发现在使用合适的评估方法和数据预处理技术时,可以在不同的语言翻译 Task 上达到高度的翻译质量。
    Abstract Lecture transcript translation helps learners understand online courses, however, building a high-quality lecture machine translation system lacks publicly available parallel corpora. To address this, we examine a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera. To create the parallel corpora, we propose a dynamic programming based sentence alignment algorithm which leverages the cosine similarity of machine-translated sentences. The sentence alignment F1 score reaches 96%, which is higher than using the BERTScore, LASER, or sentBERT methods. For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets through manual filtering for benchmarking translation performance. Through machine translation experiments, we show that the mined corpora enhance the quality of lecture transcript translation when used in conjunction with out-of-domain parallel corpora via multistage fine-tuning. Furthermore, this study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits. For the sake of reproducibility, we have released the corpora as well as the code to create them. The dataset is available at https://github.com/shyyhs/CourseraParallelCorpusMining.
    摘要 讲义笔记翻译帮助学习者理解在线课程,但建立高质量讲义机器翻译系统缺乏公开可用的平行 Corpora。为解决这个问题,我们研究了平行 Corpora 挖掘框架,该框架可以快速地挖掘公开可用的 Coursera 讲义中的平行 Corpora。为创建平行 Corpora,我们提议使用动态编程基于句子对齐算法,该算法利用机器翻译句子的央正似值。句子对齐 F1 分数达到 96%,高于使用 BERTScore、LASER 或 sentBERT 方法。对英语-日语和英语-中文讲义翻译,我们提取了约 50,000 行的平行 Corpora,并通过手动筛选创建了开发和测试集。通过机器翻译实验,我们示出了挖掘 Corpora 可以提高讲义笔记翻译质量。此外,本研究还提供了收集和清洁 Corpora 的指南,以及 Mine 平行句子、清除挖掘数据中的噪声和创建高质量评估分割的方法。为保持可重复性,我们已经发布了 Corpora 以及创建它们的代码。数据集可以在 GitHub 上获取:https://github.com/shyyhs/CourseraParallelCorpusMining。

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

  • paper_url: http://arxiv.org/abs/2311.03687
  • repo_url: None
  • paper_authors: Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, Xiaowen Chu
  • for: 本研究旨在对大语言模型(LLMs)的预训练、精度调整和服务性能进行性能测试,以帮助用户更好地选择适合其需求的硬件和软件架构。
  • methods: 本研究使用了多种优化技术,包括ZeRO、量化、重复计算和FlashAttention,以提高LLMs的性能。
  • results: 研究发现,在不同的硬件和软件架构上,LLMs的运行时间可以有显著的差异。此外,通过进一步分析LLMs的子模块,我们发现了一些可能的优化机会,以帮助未来的研究人员进一步提高LLMs的运行时间性能。
    Abstract Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and deploying LLMs are expensive as it requires considerable computing resources and memory, hence many efficient approaches have been developed for improving system pipelines as well as operators. However, the runtime performance can vary significantly across hardware and software stacks, which makes it difficult to choose the best configuration. In this work, we aim to benchmark the performance from both macro and micro perspectives. First, we benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes , i.e., 7, 13, and 70 billion parameters (7B, 13B, and 70B) on three 8-GPU platforms with and without individual optimization techniques, including ZeRO, quantization, recomputation, FlashAttention. Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs. For end users, our benchmark and findings help better understand different optimization techniques, training and inference frameworks, together with hardware platforms in choosing configurations for deploying LLMs. For researchers, our in-depth module-wise analyses discover potential opportunities for future work to further optimize the runtime performance of LLMs.
    摘要 首先,我们测试了不同大小的 LLMs(7B、13B和70B)在三个8核 GPU 平台上的端到端性能,包括预训练、精度调整和服务。然后,我们进行了详细的运行时分析,探讨 LLMs 中计算和通信操作的性能。对于普通用户,我们的测试和发现可以帮助他们更好地理解不同优化技术、训练和推理框架以及硬件平台的配置,以便更好地部署 LLMs。对于研究人员,我们的深入模块分析可能会揭示未来进行 LLMs 的 runtime 性能优化的潜在机会。

CBSiMT: Mitigating Hallucination in Simultaneous Machine Translation with Weighted Prefix-to-Prefix Training

  • paper_url: http://arxiv.org/abs/2311.03672
  • repo_url: None
  • paper_authors: Mengge Liu, Wen Zhang, Xiang Li, Yanzhi Tian, Yuhang Guo, Jian Luan, Bin Wang, Shuoying Chen
  • for: 提高同时翻译(SiMT)的质量和稳定性,特别是在源语言字符串不完整时开始翻译。
  • methods: 使用预测目标元素的方法,通过对部分源前缀进行预测,学习目标元素。但是由于语言的单词顺序不同,可能会导致模型出现误差,即目标输出不准确反映源输入。
  • results: 提出一种自信度基于的同时翻译机制(CBSiMT),通过模型自信度来识别幻化元素,并通过加权前缀到前缀训练来减轻其影响。实验结果表明,我们的方法可以在不同的延迟 режиime中提高翻译质量,最高可以达2个BLEU分数提高。
    Abstract Simultaneous machine translation (SiMT) is a challenging task that requires starting translation before the full source sentence is available. Prefix-to-prefix framework is often applied to SiMT, which learns to predict target tokens using only a partial source prefix. However, due to the word order difference between languages, misaligned prefix pairs would make SiMT models suffer from serious hallucination problems, i.e. target outputs that are unfaithful to source inputs. Such problems can not only produce target tokens that are not supported by the source prefix, but also hinder generating the correct translation by receiving more source words. In this work, we propose a Confidence-Based Simultaneous Machine Translation (CBSiMT) framework, which uses model confidence to perceive hallucination tokens and mitigates their negative impact with weighted prefix-to-prefix training. Specifically, token-level and sentence-level weights are calculated based on model confidence and acted on the loss function. We explicitly quantify the faithfulness of the generated target tokens using the token-level weight, and employ the sentence-level weight to alleviate the disturbance of sentence pairs with serious word order differences on the model. Experimental results on MuST-C English-to-Chinese and WMT15 German-to-English SiMT tasks demonstrate that our method can consistently improve translation quality at most latency regimes, with up to 2 BLEU scores improvement at low latency.
    摘要 同时机器翻译(SiMT)是一项具有挑战性的任务,需要在源句子完全可用之前开始翻译。预FIX框架经常应用于SiMT,这种方法学习预测目标元素使用只有部分源前缀。然而,由于语言之间的单词顺序差异,不准确对应的前缀对SiMT模型会产生严重的幻觉问题,即目标输出不 faithful于源输入。这些问题不仅会生成不支持源前缀的目标元素,还会阻碍模型生成正确的翻译,因为接下来的源单词会受到这些幻觉元素的影响。在这种情况下,我们提出了一种信息Content-based Simultaneous Machine Translation(CBSiMT)框架,使用模型信息来感知幻觉元素,并通过权重PREFIX-to-PREFIX训练来缓解其负面影响。具体来说,根据模型信息计算 token-level 和 sentence-level 权重,然后将其加到损失函数中。我们Explicitly quantify the faithfulness of the generated target tokens using the token-level weight,并使用 sentence-level weight来缓解不同语言单词顺序对模型的影响。我们在 MuST-C 英语-中文和 WMT15 德语-英语 SiMT 任务上进行了实验,结果表明,我们的方法可以在不同的延迟 режиме下 consistently 提高翻译质量,最高达到 2 BLEU 分数提高。

Principles from Clinical Research for NLP Model Generalization

  • paper_url: http://arxiv.org/abs/2311.03663
  • repo_url: None
  • paper_authors: Aparna Elangovan, Jiayuan He, Yuan Li, Karin Verspoor
  • for: 本研究旨在探讨NLG模型的普适性,以及各种因素对其影响。
  • methods: 本研究使用严格的实验方法保证内部有效性,并分析了模型在不同数据集上的普适性。
  • results: 研究发现,模型在不同数据集上的表现会受到各种因素的影响,包括数据中的偶极性关系。此外,研究还提供了分析普适性失败的方法。
    Abstract The NLP community typically relies on performance of a model on a held-out test set to assess generalization. Performance drops observed in datasets outside of official test sets are generally attributed to "out-of-distribution'' effects. Here, we explore the foundations of generalizability and study the various factors that affect it, articulating generalizability lessons from clinical studies. In clinical research generalizability depends on (a) internal validity of experiments to ensure controlled measurement of cause and effect, and (b) external validity or transportability of the results to the wider population. We present the need to ensure internal validity when building machine learning models in natural language processing, especially where results may be impacted by spurious correlations in the data. We demonstrate how spurious factors, such as the distance between entities in relation extraction tasks, can affect model internal validity and in turn adversely impact generalization. We also offer guidance on how to analyze generalization failures.
    摘要 nlp社区通常通过模型在保留测试集上的表现来评估泛化性。在数据外部测试集上观察到的性能下降通常被归结为“非标准”效应。我们探究泛化性的基础和它受到哪些因素的影响,并从临床研究中提出泛化性评估的教训。在自然语言处理领域建立机器学习模型时,特别是在数据中存在偶极相关性的情况下,确保内部有效性非常重要。我们示例了如何使用距离Entity之间的关系EXTRACT任务中的偶极相关性会影响模型的内部有效性,从而导致泛化性下降。我们还提供了分析泛化失败的方法。

Innovation and Word Usage Patterns in Machine Learning

  • paper_url: http://arxiv.org/abs/2311.03633
  • repo_url: https://github.com/vitorbborges/monografia-PET22
  • paper_authors: Vítor Bandeira Borges, Daniel Oliveira Cajueiro
  • for: 本研究探讨机器学习研究的动态景观进化。
  • methods: 通过独特的 Dirichlet Allocation 方法,挖掘出机器学习领域中的重要主题和基本概念,然后进行了全面的演化分析,跟踪这些主题的演化轨迹。
  • results: 通过使用 Kullback-Leibler Divergence 统计量,衡量研究贡献的新颖度和分化程度,并发现了一些关键的研究者和学术会议在机器学习领域中的重要作用。
    Abstract In this study, we delve into the dynamic landscape of machine learning research evolution. Initially, through the utilization of Latent Dirichlet Allocation, we discern pivotal themes and fundamental concepts that have emerged within the realm of machine learning. Subsequently, we undertake a comprehensive analysis to track the evolutionary trajectories of these identified themes. To quantify the novelty and divergence of research contributions, we employ the Kullback-Leibler Divergence metric. This statistical measure serves as a proxy for ``surprise'', indicating the extent of differentiation between the content of academic papers and the subsequent developments in research. By amalgamating these insights, we gain the ability to ascertain the pivotal roles played by prominent researchers and the significance of specific academic venues (periodicals and conferences) within the machine learning domain.
    摘要 在这项研究中,我们探究机器学习研究发展的动态领域。首先,通过利用秘密分配法,我们分析出机器学习领域中的重要主题和基本概念。然后,我们进行了全面的分析,跟踪这些标识的主题的演化轨迹。为了量化研究贡献的新鲜度和分化程度,我们使用卡尔巴克-莱布尔散度度量。这个统计指标作为“surprise”的代理,反映了学术论文中的内容和后续研究的差异。通过结合这些洞察,我们得到了机器学习领域中的重要研究者和期刊(报刊和会议)的重要性。

GNAT: A General Narrative Alignment Tool

  • paper_url: http://arxiv.org/abs/2311.03627
  • repo_url: None
  • paper_authors: Tanzir Pial, Steven Skiena
  • for: 用于对纤维体系中的文档进行对比和对Alignment。
  • methods: 使用Smith-Waterman算法和现代文本相似度指标。
  • results: 可以对不同类型的文档进行对比,并且可以定义准确的p值来评估对比结果的有效性。
    Abstract Algorithmic sequence alignment identifies similar segments shared between pairs of documents, and is fundamental to many NLP tasks. But it is difficult to recognize similarities between distant versions of narratives such as translations and retellings, particularly for summaries and abridgements which are much shorter than the original novels. We develop a general approach to narrative alignment coupling the Smith-Waterman algorithm from bioinformatics with modern text similarity metrics. We show that the background of alignment scores fits a Gumbel distribution, enabling us to define rigorous p-values on the significance of any alignment. We apply and evaluate our general narrative alignment tool (GNAT) on four distinct problem domains differing greatly in both the relative and absolute length of documents, namely summary-to-book alignment, translated book alignment, short story alignment, and plagiarism detection -- demonstrating the power and performance of our methods.
    摘要 算法序列对alignment发现文档对比中的相似段落,是许多自然语言处理任务的基础。但是寻找远程版本的故事相似性,特别是简要摘要和缩写,是很困难的。我们开发了一种总体方法,将生物信息学中的Smith-Waterman算法与现代文本相似度度量结合起来。我们发现对Alignment scores的背景适合Gumbel分布,因此我们可以定义准确的p值,用于评估任何对齐。我们运用和评估我们的通用故事对齐工具(GNAT)在四个不同的问题领域中,即摘要到书籍对齐、翻译书籍对齐、短篇对齐和抄袭检测中,并示出了我们的方法的能力和性能。

cs.LG - 2023-11-07

Device Sampling and Resource Optimization for Federated Learning in Cooperative Edge Networks

  • paper_url: http://arxiv.org/abs/2311.04350
  • repo_url: None
  • paper_authors: Su Wang, Roberto Morabito, Seyyedali Hosseinalipour, Mung Chiang, Christopher G. Brinton
  • for: 提高 Federated Learning(FedL)训练精度,适应现代无线网络中的异构计算/通信资源和数据分布差异。
  • methods: 提出一种新的优化方法,通过智能设备采样和设备间通信(D2D)卸载来考虑网络中设备的异构资源和数据分布差异,以最大化 FedL 训练精度,同时最小化数据处理和D2D通信资源消耗。
  • results: 通过 theoretically 分析 D2D 卸载子问题,得到 FedL 融合 bounds 和高效的顺序凸优化器,并基于图 convolutional neural networks(GCNs)开发一种采样方法,可以最大化 FedL 精度。通过实验结果,发现该方法比文献中常见的设备采样方法更高效,可以提高 ML 模型性能、减少数据处理负担和能量消耗。
    Abstract The conventional federated learning (FedL) architecture distributes machine learning (ML) across worker devices by having them train local models that are periodically aggregated by a server. FedL ignores two important characteristics of contemporary wireless networks, however: (i) the network may contain heterogeneous communication/computation resources, and (ii) there may be significant overlaps in devices' local data distributions. In this work, we develop a novel optimization methodology that jointly accounts for these factors via intelligent device sampling complemented by device-to-device (D2D) offloading. Our optimization methodology aims to select the best combination of sampled nodes and data offloading configuration to maximize FedL training accuracy while minimizing data processing and D2D communication resource consumption subject to realistic constraints on the network topology and device capabilities. Theoretical analysis of the D2D offloading subproblem leads to new FedL convergence bounds and an efficient sequential convex optimizer. Using these results, we develop a sampling methodology based on graph convolutional networks (GCNs) which learns the relationship between network attributes, sampled nodes, and D2D data offloading to maximize FedL accuracy. Through evaluation on popular datasets and real-world network measurements from our edge testbed, we find that our methodology outperforms popular device sampling methodologies from literature in terms of ML model performance, data processing overhead, and energy consumption.
    摘要 传统的联邦学习(FedL)架构将机器学习(ML)分布在工作设备上,通过让它们在服务器 periodic 合并的地方训练本地模型。FedL 忽略了当代无线网络中两个重要特征:(一)网络可能包含不同类型的通信/计算资源,(二)设备之间的本地数据分布可能存在重叠。在这项工作中,我们开发了一种新的优化方法,通过智能设备采样和设备间通信(D2D)卸载来考虑这两个因素。我们的优化方法的目标是选择最佳的采样节点和数据卸载配置,以最大化 FedL 训练精度,同时最小化数据处理和D2D通信资源消耗,并且具有现实的网络拓扑和设备能力的限制。对 D2D 卸载子问题的理论分析导致了新的 FedL 收敛界限和高效的序列凸优化器。使用这些结果,我们开发了一种基于图 convolutional neural networks(GCNs)的采样方法,可以最大化 FedL 精度。经过对流行的数据集和实际网络测量数据进行评估,我们发现我们的方法在 ML 模型性能、数据处理开销和能 consumption 等方面都超过了文献中常见的设备采样方法。

InstrumentGen: Generating Sample-Based Musical Instruments From Text

  • paper_url: http://arxiv.org/abs/2311.04339
  • repo_url: None
  • paper_authors: Shahan Nercessian, Johannes Imort
  • for: 本研究targets at generating sample-based musical instruments based on textual prompts.
  • methods: 提议InstrumentGen模型,该模型基于文本提示扩展了生成音频框架,并condition on instrument family, source type, pitch (across an 88-key spectrum), velocity, and a joint text/audio embedding.
  • results: 我们的结果建立了一个基本的文本到乐器基线,扩展了自动化样本基于乐器生成的研究领域。
    Abstract We introduce the text-to-instrument task, which aims at generating sample-based musical instruments based on textual prompts. Accordingly, we propose InstrumentGen, a model that extends a text-prompted generative audio framework to condition on instrument family, source type, pitch (across an 88-key spectrum), velocity, and a joint text/audio embedding. Furthermore, we present a differentiable loss function to evaluate the intra-instrument timbral consistency of sample-based instruments. Our results establish a foundational text-to-instrument baseline, extending research in the domain of automatic sample-based instrument generation.
    摘要 我们介绍了文本到乐器任务,该任务目标是根据文本提示生成基于样本的乐器。我们提议了InstrumentGen模型,该模型是基于文本提示生成音频框架的扩展,并将Condition on instrument family, source type, pitch (across an 88-key spectrum), velocity, and a joint text/audio embedding。此外,我们提出了可导的损失函数来评估样本基于乐器的内部同一性。我们的结果建立了文本到乐器的基线,推动了自动样本基于乐器生成的研究。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Convex Methods for Constrained Linear Bandits

  • paper_url: http://arxiv.org/abs/2311.04338
  • repo_url: None
  • paper_authors: Amirhossein Afsharrad, Ahmadreza Moradipari, Sanjay Lall
  • for: 这个论文主要关注在实际世界中的安全敏感系统中,它们需要重复地与人类进行交互。
  • methods: 该论文提出了一种基于凸编程工具的执行效率的帮助者优化算法框架,以便实现安全线性帮助者问题的优化。
  • results: 该论文提出了一种结束到端的安全线性帮助者算法管道,该管道仅仅是解凸问题。此外,论文还进行了数值性评估。
    Abstract Recently, bandit optimization has received significant attention in real-world safety-critical systems that involve repeated interactions with humans. While there exist various algorithms with performance guarantees in the literature, practical implementation of the algorithms has not received as much attention. This work presents a comprehensive study on the computational aspects of safe bandit algorithms, specifically safe linear bandits, by introducing a framework that leverages convex programming tools to create computationally efficient policies. In particular, we first characterize the properties of the optimal policy for safe linear bandit problem and then propose an end-to-end pipeline of safe linear bandit algorithms that only involves solving convex problems. We also numerically evaluate the performance of our proposed methods.
    摘要 最近,匪帮优化在实际世界中的安全关键系统中受到了重要的关注。虽然文献中存在许多Algorithms with performance guarantees,但实际的实现尚未收到过相应的注意。这项工作提出了安全线性匪帮算法的计算性能方面的全面研究,specifically safe linear bandits,by introducing a framework that leverages convex programming tools to create computationally efficient policies. In particular, we first characterize the properties of the optimal policy for safe linear bandit problem and then propose an end-to-end pipeline of safe linear bandit algorithms that only involves solving convex problems. We also numerically evaluate the performance of our proposed methods.Note: "匪帮" (bandit) in Chinese refers to a "rogue" or "fraudulent" agent, and "安全" (safe) refers to the property of being free from harm or danger. "线性" (linear) refers to a linear relationship between variables.

Lie Point Symmetry and Physics Informed Networks

  • paper_url: http://arxiv.org/abs/2311.04293
  • repo_url: None
  • paper_authors: Tara Akhound-Sadegh, Laurence Perreault-Levasseur, Johannes Brandstetter, Max Welling, Siamak Ravanbakhsh
  • for: 提高神经网络的通用性,使其能够更好地解决偏微分方程(PDE)。
  • methods: 利用射点Symmetries(Lie point symmetries),一种常见的神经网络模型,即物理学信息神经网络(PINNs)中的一个主要家族。
  • results: 通过提出一个loss函数,使神经网络了解射点Symmetries(Lie point symmetries),从而使神经网络学习的解决方案拥有更高的通用性。
    Abstract Symmetries have been leveraged to improve the generalization of neural networks through different mechanisms from data augmentation to equivariant architectures. However, despite their potential, their integration into neural solvers for partial differential equations (PDEs) remains largely unexplored. We explore the integration of PDE symmetries, known as Lie point symmetries, in a major family of neural solvers known as physics-informed neural networks (PINNs). We propose a loss function that informs the network about Lie point symmetries in the same way that PINN models try to enforce the underlying PDE through a loss function. Intuitively, our symmetry loss ensures that the infinitesimal generators of the Lie group conserve the PDE solutions. Effectively, this means that once the network learns a solution, it also learns the neighbouring solutions generated by Lie point symmetries. Empirical evaluations indicate that the inductive bias introduced by the Lie point symmetries of the PDEs greatly boosts the sample efficiency of PINNs.
    摘要 Symmetries have been leveraged to improve the generalization of neural networks through different mechanisms from data augmentation to equivariant architectures. However, despite their potential, their integration into neural solvers for partial differential equations (PDEs) remains largely unexplored. We explore the integration of PDE symmetries, known as Lie point symmetries, in a major family of neural solvers known as physics-informed neural networks (PINNs). We propose a loss function that informs the network about Lie point symmetries in the same way that PINN models try to enforce the underlying PDE through a loss function. Intuitively, our symmetry loss ensures that the infinitesimal generators of the Lie group conserve the PDE solutions. Effectively, this means that once the network learns a solution, it also learns the neighboring solutions generated by Lie point symmetries. Empirical evaluations indicate that the inductive bias introduced by the Lie point symmetries of the PDEs greatly boosts the sample efficiency of PINNs.Here is the text in Traditional Chinese:Symmetries have been leveraged to improve the generalization of neural networks through different mechanisms from data augmentation to equivariant architectures. However, despite their potential, their integration into neural solvers for partial differential equations (PDEs) remains largely unexplored. We explore the integration of PDE symmetries, known as Lie point symmetries, in a major family of neural solvers known as physics-informed neural networks (PINNs). We propose a loss function that informs the network about Lie point symmetries in the same way that PINN models try to enforce the underlying PDE through a loss function. Intuitively, our symmetry loss ensures that the infinitesimal generators of the Lie group conserve the PDE solutions. Effectively, this means that once the network learns a solution, it also learns the neighboring solutions generated by Lie point symmetries. Empirical evaluations indicate that the inductive bias introduced by the Lie point symmetries of the PDEs greatly boosts the sample efficiency of PINNs.

Compilation of product-formula Hamiltonian simulation via reinforcement learning

  • paper_url: http://arxiv.org/abs/2311.04285
  • repo_url: https://github.com/leamarion/rl-for-compilation-of-product-formula-hamiltonian-simulation
  • paper_authors: Lea M. Trenkwalder, Eleanor Scerri, Thomas E. O’Brien, Vedran Dunjko
  • for: Hamiltoniano simulation是一个量子计算机可以获得量子优势的任务之一。
  • methods: Trotterization是一种常用的办法,使用了approximation $e^{i\sum_jA_j}\sim \prod_je^{iA_j}$和其他更高阶 corrections。但是,这还留下了操作顺序的问题(即在产生$j$中的顺序,这知道会affect the quality of approximation)。在某些情况下,这个顺序是由 Desire to minimize the error of approximation决定的,当它不是,我们提议可以根据native quantum architecture进行优化编译。这个问题被称为order-agnostic quantum circuit compilation,我们证明其是NP-hard的最坏情况。
  • results: We compare three methods of heuristic optimization of compilation: simulated annealing, Monte Carlo tree search, and reinforcement learning. While two of the methods outperform a naive heuristic, reinforcement learning clearly outperforms all others, with a gain of around 12% with respect to the second-best method and of around 50% compared to the naive heuristic in terms of the gate count. We also test the ability of RL to generalize across instances of the compilation problem, and find that a single learner is able to solve entire problem families. This demonstrates the ability of machine learning techniques to provide assistance in an order-agnostic quantum compilation task.
    Abstract Hamiltonian simulation is believed to be one of the first tasks where quantum computers can yield a quantum advantage. One of the most popular methods of Hamiltonian simulation is Trotterization, which makes use of the approximation $e^{i\sum_jA_j}\sim \prod_je^{iA_j}$ and higher-order corrections thereto. However, this leaves open the question of the order of operations (i.e. the order of the product over $j$, which is known to affect the quality of approximation). In some cases this order is fixed by the desire to minimise the error of approximation; when it is not the case, we propose that the order can be chosen to optimize compilation to a native quantum architecture. This presents a new compilation problem -- order-agnostic quantum circuit compilation -- which we prove is NP-hard in the worst case. In lieu of an easily-computable exact solution, we turn to methods of heuristic optimization of compilation. We focus on reinforcement learning due to the sequential nature of the compilation task, comparing it to simulated annealing and Monte Carlo tree search. While two of the methods outperform a naive heuristic, reinforcement learning clearly outperforms all others, with a gain of around 12% with respect to the second-best method and of around 50% compared to the naive heuristic in terms of the gate count. We further test the ability of RL to generalize across instances of the compilation problem, and find that a single learner is able to solve entire problem families. This demonstrates the ability of machine learning techniques to provide assistance in an order-agnostic quantum compilation task.
    摘要 希amiltonian模拟被认为是量子计算机能够获得量子优势的首要任务之一。一种最受欢迎的希amiltonian模拟方法是Trotter化,它使用近似关系 $e^{i\sum_jA_j}\sim \prod_je^{iA_j}$ 和更高阶 corrections 来进行模拟。然而,这些问题还留下了操作顺序的问题(即在 $j$ 中进行乘法的顺序),这个问题的解决尚未得到了一致的答案。在某些情况下,操作顺序可以根据模拟误差的最小化来固定;而在其他情况下,我们建议可以根据native量子架构的特点来选择操作顺序,以便优化模拟。这个问题被称为order-agnostic量子电路编译问题,我们证明其是NP困难的最坏情况。由于无法计算出精确的解,我们转而使用优化技术来解决这个问题。我们主要关注了利用强化学习来优化编译,并对其进行比较。在希amiltonian模拟任务中,强化学习表现出色,与第二好的方法相比,它的gateCount减少了约12%,与naive heuristic相比,它的gateCount减少了约50%。我们进一步测试了RL的能力泛化到希amiltonian模拟任务中的不同问题家族,发现一个RL学习者能够解决整个问题家族。这表明机器学习技术可以为order-agnostic量子编译任务提供帮助。

Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems

  • paper_url: http://arxiv.org/abs/2311.04161
  • repo_url: None
  • paper_authors: Nikita Puchkin, Eduard Gorbunov, Nikolay Kutuzov, Alexander Gasnikov
  • for: 这种研究是为了解决涉及杂散乱的优化问题。
  • methods: 这篇论文使用了精细的满足条件下的梯度积分法,以减少杂散乱的影响。
  • results: 研究人员发现,通过使用精细的满足条件下的梯度积分法,可以实现更快的速度收敛,比如$\mathcal{O}(K^{-2(\alpha - 1)/\alpha})$。
    Abstract We consider stochastic optimization problems with heavy-tailed noise with structured density. For such problems, we show that it is possible to get faster rates of convergence than $\mathcal{O}(K^{-2(\alpha - 1)/\alpha})$, when the stochastic gradients have finite moments of order $\alpha \in (1, 2]$. In particular, our analysis allows the noise norm to have an unbounded expectation. To achieve these results, we stabilize stochastic gradients, using smoothed medians of means. We prove that the resulting estimates have negligible bias and controllable variance. This allows us to carefully incorporate them into clipped-SGD and clipped-SSTM and derive new high-probability complexity bounds in the considered setup.
    摘要 我们考虑了随机优化问题,其中噪声具有巨大尾数的概率分布。我们显示,在这种情况下,可以得到比$\mathcal{O}(K^{-2(\alpha - 1)/\alpha})$更快的收敛率,其中噪声方差具有finite moment of order $\alpha \in (1, 2]$. Specifically,我们使用平滑的 médian of means来稳定随机梯度,并证明这些估计具有可控的方差和无所谓的偏差。这使我们可以谨慎地将其包含到clipped-SGD和clipped-SSTM中,并 derivnew high-probability complexity bounds in the considered setup.

Computing Approximate $\ell_p$ Sensitivities

  • paper_url: http://arxiv.org/abs/2311.04158
  • repo_url: None
  • paper_authors: Swati Padmanabhan, David P. Woodruff, Qiuyi, Zhang
  • for: 这个论文是用来解决维度减少问题中的敏感度问题的,提供可证明的保证,并且可以通过下采样来移除低敏感度的数据点。
  • methods: 这个论文使用了快速的$\ell_p$敏感度计算算法,包括$\ell_1$敏感度和总$\ell_p$敏感度的计算。此外,它还提供了一种基于 Lewis 重要性权重的重要性采样算法,用于估计总$\ell_p$敏感度。
  • results: 这个论文的计算结果显示,在许多实际世界数据集中,总敏感度可以快速地被估计,并且它们比理论预测的要小得多,这表明了实际数据集的内在有效维度很低。
    Abstract Recent works in dimensionality reduction for regression tasks have introduced the notion of sensitivity, an estimate of the importance of a specific datapoint in a dataset, offering provable guarantees on the quality of the approximation after removing low-sensitivity datapoints via subsampling. However, fast algorithms for approximating $\ell_p$ sensitivities, which we show is equivalent to approximate $\ell_p$ regression, are known for only the $\ell_2$ setting, in which they are termed leverage scores. In this work, we provide efficient algorithms for approximating $\ell_p$ sensitivities and related summary statistics of a given matrix. In particular, for a given $n \times d$ matrix, we compute $\alpha$-approximation to its $\ell_1$ sensitivities at the cost of $O(n/\alpha)$ sensitivity computations. For estimating the total $\ell_p$ sensitivity (i.e. the sum of $\ell_p$ sensitivities), we provide an algorithm based on importance sampling of $\ell_p$ Lewis weights, which computes a constant factor approximation to the total sensitivity at the cost of roughly $O(\sqrt{d})$ sensitivity computations. Furthermore, we estimate the maximum $\ell_1$ sensitivity, up to a $\sqrt{d}$ factor, using $O(d)$ sensitivity computations. We generalize all these results to $\ell_p$ norms for $p > 1$. Lastly, we experimentally show that for a wide class of matrices in real-world datasets, the total sensitivity can be quickly approximated and is significantly smaller than the theoretical prediction, demonstrating that real-world datasets have low intrinsic effective dimensionality.
    摘要 现有研究在减简维度方面为回归任务引入了敏感度(estimate of the importance of a specific datapoint in a dataset),提供可证明的保证,表明在移除低敏感度数据点后,减简后的数据集质量保证。然而,快速算法 дляapproximating $\ell_p$ 敏感度(equivalent to approximate $\ell_p$ regression)只知道在 $\ell_2$ 设定下,称为杠杆分数。在这项工作中,我们提供高效的算法来approximate $\ell_p$ 敏感度和相关的摘要统计量。具体来说,我们计算 $\alpha$-approximation to its $\ell_1$ 敏感度,在 $O(n/\alpha)$ 敏感度计算成本下。 для估算总 $\ell_p$ 敏感度(即总 $\ell_p$ 敏感度的和),我们提供基于 $\ell_p$ Lewis 质量的重要样本计算算法,该算法在 $O(\sqrt{d})$ 敏感度计算成本下计算一个常数因子的approximation。此外,我们估算最大 $\ell_1$ 敏感度,在 $O(d)$ 敏感度计算成本下,即 $\sqrt{d}$ 因子。我们推广所有这些结果到 $\ell_p$ нор的 $p > 1$。最后,我们通过实验表明,在实际世界数据中,总敏感度可以快速地被 aproximated,并且与理论预测相比,实际数据的内在有效维度较低。

Kernel-, mean- and noise-marginalised Gaussian processes for exoplanet transits and $H_0$ inference

  • paper_url: http://arxiv.org/abs/2311.04153
  • repo_url: https://github.com/zwei-beiner/transdimensional_sampler
  • paper_authors: Namu Kroupa, David Yallup, Will Handley, Michael Hobson
  • for: 这个论文的目的是扩展 Gaussian Process regression,以包括对核心选择和核心参数的权化。
  • methods: 这个方法使用了完全 bayesian 方法,通过证据来比较模型。计算共同 posterior 使用了 transdimensional sampler,同时样本涵盖着离散核心选择和其参数。
  • results: 在synthetic数据上进行了测试,true核心在低噪声区域中被恢复,而在大噪声区域中没有核心被选择。此外,对物理星际行星参数的推断也进行了。在高噪声区域中,可以除偏、宽化POSTERIOR或提高推断准确性。此外,模型中的mean函数预测分布中的不确定性增加。最后,该方法被扩展到了对mean函数和噪声模型进行了权化。
    Abstract Using a fully Bayesian approach, Gaussian Process regression is extended to include marginalisation over the kernel choice and kernel hyperparameters. In addition, Bayesian model comparison via the evidence enables direct kernel comparison. The calculation of the joint posterior was implemented with a transdimensional sampler which simultaneously samples over the discrete kernel choice and their hyperparameters by embedding these in a higher-dimensional space, from which samples are taken using nested sampling. This method was explored on synthetic data from exoplanet transit light curve simulations. The true kernel was recovered in the low noise region while no kernel was preferred for larger noise. Furthermore, inference of the physical exoplanet hyperparameters was conducted. In the high noise region, either the bias in the posteriors was removed, the posteriors were broadened or the accuracy of the inference was increased. In addition, the uncertainty in mean function predictive distribution increased due to the uncertainty in the kernel choice. Subsequently, the method was extended to marginalisation over mean functions and noise models and applied to the inference of the present-day Hubble parameter, $H_0$, from real measurements of the Hubble parameter as a function of redshift, derived from the cosmologically model-independent cosmic chronometer and {\Lambda}CDM-dependent baryon acoustic oscillation observations. The inferred $H_0$ values from the cosmic chronometers, baryon acoustic oscillations and combined datasets are $H_0$ = 66$\pm$6 km/s/Mpc, $H_0$ = 67$\pm$10 km/s/Mpc and $H_0$ = 69$\pm$6 km/s/Mpc, respectively. The kernel posterior of the cosmic chronometers dataset prefers a non-stationary linear kernel. Finally, the datasets are shown to be not in tension with ln(R)=12.17$\pm$0.02.
    摘要

HyperS2V: A Framework for Structural Representation of Nodes in Hyper Networks

  • paper_url: http://arxiv.org/abs/2311.04149
  • repo_url: https://github.com/liushu2019/hypers2v
  • paper_authors: Shu Liu, Cameron Lai, Fujio Toriumi
  • for: 本研究旨在提出一种基于结构相似性的节点嵌入方法(HyperS2V),以便在含有更复杂关系的超网络上应用机器学习方法。
  • methods: HyperS2V 方法首先引入了超网络中节点的超度概念,然后提出了一种新的结构相似度函数来衡量不同超度值之间的结构相似性。最后,通过多尺度随机游走框架来生成结构嵌入。
  • results: 在各种内在和外在实验中,HyperS2V 方法表现出了更高的解释力和下游任务应用性。
    Abstract In contrast to regular (simple) networks, hyper networks possess the ability to depict more complex relationships among nodes and store extensive information. Such networks are commonly found in real-world applications, such as in social interactions. Learning embedded representations for nodes involves a process that translates network structures into more simplified spaces, thereby enabling the application of machine learning approaches designed for vector data to be extended to network data. Nevertheless, there remains a need to delve into methods for learning embedded representations that prioritize structural aspects. This research introduces HyperS2V, a node embedding approach that centers on the structural similarity within hyper networks. Initially, we establish the concept of hyper-degrees to capture the structural properties of nodes within hyper networks. Subsequently, a novel function is formulated to measure the structural similarity between different hyper-degree values. Lastly, we generate structural embeddings utilizing a multi-scale random walk framework. Moreover, a series of experiments, both intrinsic and extrinsic, are performed on both toy and real networks. The results underscore the superior performance of HyperS2V in terms of both interpretability and applicability to downstream tasks.
    摘要 contrast to regular (simple) networks, hyper networks possess the ability to depict more complex relationships among nodes and store extensive information. Such networks are commonly found in real-world applications, such as in social interactions. Learning embedded representations for nodes involves a process that translates network structures into more simplified spaces, thereby enabling the application of machine learning approaches designed for vector data to be extended to network data. Nevertheless, there remains a need to delve into methods for learning embedded representations that prioritize structural aspects. This research introduces HyperS2V, a node embedding approach that centers on the structural similarity within hyper networks. Initially, we establish the concept of hyper-degrees to capture the structural properties of nodes within hyper networks. Subsequently, a novel function is formulated to measure the structural similarity between different hyper-degree values. Lastly, we generate structural embeddings utilizing a multi-scale random walk framework. Moreover, a series of experiments, both intrinsic and extrinsic, are performed on both toy and real networks. The results underscore the superior performance of HyperS2V in terms of both interpretability and applicability to downstream tasks.Here's the word-for-word translation of the text into Simplified Chinese:对于常见的简单网络,超网络具有更复杂的节点关系和更多的信息存储能力。这些网络在实际应用中很常见,如社交互动等。学习节点嵌入表示需要将网络结构简化成vector数据适用的机器学习方法的数据类型。然而,还需要关注优化结构嵌入的方法。本研究介绍了HyperS2V节点嵌入方法,该方法在超网络中强调结构相似性。我们首先提出了超度的概念,用于捕捉超网络中节点的结构特性。然后,我们定义了一种新的函数,用于度量不同超度值之间的结构相似性。最后,我们使用多尺度随机游走框架生成结构嵌入。此外,我们对各种内在和外在实验进行了多个实验,包括小型网络和真实网络。结果表明HyperS2V在 interpretability 和下游任务应用性方面具有显著优势。

Multi-resolution Time-Series Transformer for Long-term Forecasting

  • paper_url: http://arxiv.org/abs/2311.04147
  • repo_url: None
  • paper_authors: Yitian Zhang, Liheng Ma, Soumyasundar Pal, Yingxue Zhang, Mark Coates
  • for: 预测时间序列数据的性能提高了 significatively,特别是使用 patches segmentation 技术来学习时间序列中复杂的模式。
  • methods: 我们提出了一种新的框架,即多分辨率时间序列变换器(MTST),它通过多个分支结构同时模型不同的时间模式来提高预测性能。
  • results: 对多个实际数据集进行了广泛的实验,与现有的预测技术进行比较,MTST 显示出了更高的预测性能。
    Abstract The performance of transformers for time-series forecasting has improved significantly. Recent architectures learn complex temporal patterns by segmenting a time-series into patches and using the patches as tokens. The patch size controls the ability of transformers to learn the temporal patterns at different frequencies: shorter patches are effective for learning localized, high-frequency patterns, whereas mining long-term seasonalities and trends requires longer patches. Inspired by this observation, we propose a novel framework, Multi-resolution Time-Series Transformer (MTST), which consists of a multi-branch architecture for simultaneous modeling of diverse temporal patterns at different resolutions. In contrast to many existing time-series transformers, we employ relative positional encoding, which is better suited for extracting periodic components at different scales. Extensive experiments on several real-world datasets demonstrate the effectiveness of MTST in comparison to state-of-the-art forecasting techniques.
    摘要 “ transformers 的时间序列预测性能有所改善。现代架构通过分割时间序列为 patches,并将 patches 作为 tokens 进行学习。patch size 控制了 transformers 能够学习不同频率的时间模式:短 patches 有助于学习本地高频率模式,而长 patches 则能够捕捉长期季节性和趋势。这一观察所启发我们提出了一个新的框架:多解析时间序列transformer (MTST),这个框架包括多支分支架构,以同时模型不同分辨率的时间模式。相比于许多现有的时间序列transformer,我们使用相对位置编码,这种编码更适合提取不同 scales 的周期性。实际实验显示,MTST 与现有的预测技术相比,具有更高的预测性能。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.

Generative learning for nonlinear dynamics

  • paper_url: http://arxiv.org/abs/2311.04128
  • repo_url: None
  • paper_authors: William Gilpin
  • for: 这篇论文主要研究的是如何使用非线性动力学来描述和分析大规模生成统计学模型。
  • methods: 论文使用了信息论和非线性动力学的工具来推导和分析 chaos 的特性,以及如何将这些特性应用到实际数据中。
  • results: 论文发现了一些经典工具,如拟合器重构和符号表示法,可以用于分析和理解大规模生成统计学模型。此外,论文还发现了一些新的应用场景,如在复杂流体动力学中使用 operator-theoretic 方法,以及在生物数据中探测破碎的平衡。
    Abstract Modern generative machine learning models demonstrate surprising ability to create realistic outputs far beyond their training data, such as photorealistic artwork, accurate protein structures, or conversational text. These successes suggest that generative models learn to effectively parametrize and sample arbitrarily complex distributions. Beginning half a century ago, foundational works in nonlinear dynamics used tools from information theory to infer properties of chaotic attractors from time series, motivating the development of algorithms for parametrizing chaos in real datasets. In this perspective, we aim to connect these classical works to emerging themes in large-scale generative statistical learning. We first consider classical attractor reconstruction, which mirrors constraints on latent representations learned by state space models of time series. We next revisit early efforts to use symbolic approximations to compare minimal discrete generators underlying complex processes, a problem relevant to modern efforts to distill and interpret black-box statistical models. Emerging interdisciplinary works bridge nonlinear dynamics and learning theory, such as operator-theoretic methods for complex fluid flows, or detection of broken detailed balance in biological datasets. We anticipate that future machine learning techniques may revisit other classical concepts from nonlinear dynamics, such as transinformation decay and complexity-entropy tradeoffs.
    摘要 现代生成机器学习模型展示了奇异的能力创造真实的输出,比如写实际的艺术作品、准确的蛋白结构或对话文本。这些成功表明生成模型可以有效地参数化和采样无法预测的分布。半个世纪以前,基础性工作在非线性动力学使用信息理论工具来推导归案动力系统的属性,这些工具驱动了生成模型 Parametrize chaos 的发展。在这个视角下,我们想要连接这些古老的工作和新兴的主题在大规模生成统计学习中。我们首先考虑经典吸引器重建,这与生成模型学习的秘密表示有着相似的约束。然后我们回到了早期的尝试使用符号表示法来比较微观生成过程中的最小精炼生成器,这个问题与现代尝试总结和解释黑盒统计模型有着相似之处。新兴的交叉学科工作将非线性动力学和学习理论相连,如操作理论方法 для复杂流体流动或生物数据中的假设不均衡检测。我们预计未来的机器学习技术可能会回到其他古老的非线性动力学概念,如信息泄露衰落和复杂度-熵负担交易。

Do Language Models Learn Semantics of Code? A Case Study in Vulnerability Detection

  • paper_url: http://arxiv.org/abs/2311.04109
  • repo_url: None
  • paper_authors: Benjamin Steenhoek, Md Mahbubur Rahman, Shaila Sharmin, Wei Le
  • for: This paper aims to analyze the alignment of pretrained language models with bug semantics in the context of vulnerability detection.
  • methods: The paper uses three distinct methods to analyze the models: interpretability tools, attention analysis, and interaction matrix analysis.
  • results: The paper finds that better-performing models align better with potentially vulnerable statements (PVS), but the models fail to align strongly to buggy paths. The paper also develops two annotation methods to highlight bug semantics inside the model’s inputs, which improve the models’ performance in most settings.
    Abstract Recently, pretrained language models have shown state-of-the-art performance on the vulnerability detection task. These models are pretrained on a large corpus of source code, then fine-tuned on a smaller supervised vulnerability dataset. Due to the different training objectives and the performance of the models, it is interesting to consider whether the models have learned the semantics of code relevant to vulnerability detection, namely bug semantics, and if so, how the alignment to bug semantics relates to model performance. In this paper, we analyze the models using three distinct methods: interpretability tools, attention analysis, and interaction matrix analysis. We compare the models' influential feature sets with the bug semantic features which define the causes of bugs, including buggy paths and Potentially Vulnerable Statements (PVS). We find that (1) better-performing models also aligned better with PVS, (2) the models failed to align strongly to PVS, and (3) the models failed to align at all to buggy paths. Based on our analysis, we developed two annotation methods which highlight the bug semantics inside the model's inputs. We evaluated our approach on four distinct transformer models and four vulnerability datasets and found that our annotations improved the models' performance in the majority of settings - 11 out of 16, with up to 9.57 points improvement in F1 score compared to conventional fine-tuning. We further found that with our annotations, the models aligned up to 232% better to potentially vulnerable statements. Our findings indicate that it is helpful to provide the model with information of the bug semantics, that the model can attend to it, and motivate future work in learning more complex path-based bug semantics. Our code and data are available at https://figshare.com/s/4a16a528d6874aad51a0.
    摘要 近期,预训言语模型已经显示出了检测漏洞性能的状态之一。这些模型首先在大量的源代码中预训练,然后在更小的指导漏洞数据集上细化调教。由于不同的训练目标和模型性能,我们很有趣地考虑是否模型学习了相关的代码 semantics,即漏洞 semantics,并如何对这种对应关系。在这篇论文中,我们使用三种不同的方法来分析模型:可读性工具、注意力分析和交互矩阵分析。我们将比较模型的影响特征集与漏洞 semantics 定义的 bug 的原因,包括漏洞路径和潜在漏洞语句 (PVS)。我们发现:1. 性能更高的模型也更好地与 PVS 对应。2. 模型无法强制对 PVS 进行对应。3. 模型无法对漏洞路径进行对应。根据我们的分析,我们开发了两种注释方法,它们可以在模型的输入中标注漏洞 semantics。我们对四种 transformer 模型和四个漏洞数据集进行了评估,发现我们的注释可以提高模型的性能,其中最高提高为 9.57 个 F1 分数。此外,我们发现,通过我们的注释,模型可以更好地对 PVS 进行对应,最高提高 232%。我们的发现表明,提供模型 bug semantics 信息可以有助于模型更好地attend to 它,并促使未来的工作在学习更复杂的路径基本 bug semantics。我们的代码和数据可以在 上获取。

Time-Efficient Reinforcement Learning with Stochastic Stateful Policies

  • paper_url: http://arxiv.org/abs/2311.04082
  • repo_url: None
  • paper_authors: Firas Al-Hafez, Guoping Zhao, Jan Peters, Davide Tateo
  • for: 论文旨在探讨一种新的状态full策略训练方法,用于解决backpropagation through time(BPTT)的缺点,如训练过程慢和梯度衰减或扩散。
  • methods: 该方法基于分解状态full策略为随机内部状态核心和状态无关策略,并将两部分同时优化。
  • results: 对复杂的连续控制任务(如人工智能步行)进行了评估,并证明了该方法的渐进性和简单性,比BPTT更快速和稳定。
    Abstract Stateful policies play an important role in reinforcement learning, such as handling partially observable environments, enhancing robustness, or imposing an inductive bias directly into the policy structure. The conventional method for training stateful policies is Backpropagation Through Time (BPTT), which comes with significant drawbacks, such as slow training due to sequential gradient propagation and the occurrence of vanishing or exploding gradients. The gradient is often truncated to address these issues, resulting in a biased policy update. We present a novel approach for training stateful policies by decomposing the latter into a stochastic internal state kernel and a stateless policy, jointly optimized by following the stateful policy gradient. We introduce different versions of the stateful policy gradient theorem, enabling us to easily instantiate stateful variants of popular reinforcement learning and imitation learning algorithms. Furthermore, we provide a theoretical analysis of our new gradient estimator and compare it with BPTT. We evaluate our approach on complex continuous control tasks, e.g., humanoid locomotion, and demonstrate that our gradient estimator scales effectively with task complexity while offering a faster and simpler alternative to BPTT.
    摘要 状态ful策略在强化学习中扮演着重要的角色,如处理部分可见环境、增强稳定性或直接在策略结构中强制条件。 conventinal方法 для训练状态ful策略是经验回propagation through time(BPTT),但它带来了一些缺点,如递归梯度传播导致训练速度慢,以及gradient会消失或扩散导致策略更新受到偏见。我们提出了一种新的方法,即将状态ful策略分解为随机内部状态核心和状态eless策略,共同优化。我们还提出了不同版本的状态ful策略梯度定理,使得我们可以轻松地实现状态ful变种的强化学习和模仿学习算法。此外,我们提供了对我们新的梯度估计器的理论分析,并与BPTT进行比较。我们在复杂的连续控制任务上,如人类型步行,进行了实验,并证明了我们的梯度估计器可以有效地扩展到任务复杂度,而且比BPTT更快和简单。

Estimator-Coupled Reinforcement Learning for Robust Purely Tactile In-Hand Manipulation

  • paper_url: http://arxiv.org/abs/2311.04060
  • repo_url: None
  • paper_authors: Lennart Röstel, Johannes Pitz, Leon Sievers, Berthold Bäuml
  • for: 这篇论文是针对机器人内部抓取和调整问题提出了解决方案,具体是针对无感知的、目标条件的、手指向下倒转,并且能够在实际中实现高精度的State Estimation和控制策略。
  • methods: 这篇论文使用了分开训练控制策略和State Estimator,并在训练时候将它们联合在一起,以提高State Estimation的精度和控制策略的可读性。另外,使用了GPU加速,从零开始训练时间只需6.5小时。
  • results: 这篇论文在四种不同的物体形状下,在 sim2real 转移中成功地转动了四种物体到24个不同的方向,并且可以 consecutively 转动一个方巧到九个目标(中位数),这是之前的方法无法达成的。
    Abstract This paper identifies and addresses the problems with naively combining (reinforcement) learning-based controllers and state estimators for robotic in-hand manipulation. Specifically, we tackle the challenging task of purely tactile, goal-conditioned, dextrous in-hand reorientation with the hand pointing downwards. Due to the limited sensing available, many control strategies that are feasible in simulation when having full knowledge of the object's state do not allow for accurate state estimation. Hence, separately training the controller and the estimator and combining the two at test time leads to poor performance. We solve this problem by coupling the control policy to the state estimator already during training in simulation. This approach leads to more robust state estimation and overall higher performance on the task while maintaining an interpretability advantage over end-to-end policy learning. With our GPU-accelerated implementation, learning from scratch takes a median training time of only 6.5 hours on a single, low-cost GPU. In simulation experiments with the DLR-Hand II and for four significantly different object shapes, we provide an in-depth analysis of the performance of our approach. We demonstrate the successful sim2real transfer by rotating the four objects to all 24 orientations in the $\pi/2$ discretization of SO(3), which has never been achieved for such a diverse set of shapes. Finally, our method can reorient a cube consecutively to nine goals (median), which was beyond the reach of previous methods in this challenging setting.
    摘要 Here is the text in Simplified Chinese:这篇论文描述了在将学习控制器和状态估计器结合使用时存在的问题,特别是在 robotic 手部受控 manipulation 中实现纯粹的感觉控制。具体来说,我们面临的任务是在手指向下的情况下,使用纯粹的感觉控制来将物体 rotate 到指定的目标位置。由于感知限制,许多在模拟中可行的控制策略无法准确地估计状态,因此将控制策略和状态估计器分开训练并在测试时结合并不能达到高性能。我们解决这个问题是通过在模拟中couple 控制策略和状态估计器的训练来实现更加稳定的状态估计和总体更高的性能,同时保持了可解释性的优势。我们使用了GPU加速的实现,从零开始训练时间为6.5小时,在单个、低成本的GPU上。我们在DLR-Hand II和四种不同的物体形状的模拟实验中进行了深入的分析性能,并证明了我们的方法可以在模拟和实际中实现精准的转换。最后,我们的方法可以 consecutively 将一个方块转换到九个目标位置(中值),这是之前的方法在这种复杂的情况下无法达到的。

Feature Space Renormalization for Semi-supervised Learning

  • paper_url: http://arxiv.org/abs/2311.04055
  • repo_url: None
  • paper_authors: Jun Sun, Zhongjie Mao, Chao Li, Chao Zhou, Xiao-Jun Wu
  • for: 本研究旨在提出一种新的半监督学习方法,以便利用无标签数据来减轻模型对大量标签数据的依赖。
  • methods: 本研究使用了一种新的特征空间normalization机制来取代常用的一致正则化机制,以学习更好的抑制特征。
  • results: 实验结果表明,我们的方法可以在多种标准半监督学习 benchmark datasets上 achieve better performance,而且提出的特征空间normalization机制还可以增强其他半监督学习方法的性能。
    Abstract Semi-supervised learning (SSL) has been proven to be a powerful method for leveraging unlabelled data to alleviate models' dependence on large labelled datasets. The common framework among recent approaches is to train the model on a large amount of unlabelled data with consistency regularization to constrain the model predictions to be invariant to input perturbation. However, the existing SSL frameworks still have room for improvement in the consistency regularization method. Instead of regularizing category predictions in the label space as in existing frameworks, this paper proposes a feature space renormalization (FSR) mechanism for SSL. First, we propose a feature space renormalization mechanism to substitute for the commonly used consistency regularization mechanism to learn better discriminative features. To apply this mechanism, we start by building a basic model and an empirical model and then introduce our mechanism to renormalize the feature learning of the basic model with the guidance of the empirical model. Second, we combine the proposed mechanism with pseudo-labelling to obtain a novel effective SSL model named FreMatch. The experimental results show that our method can achieve better performance on a variety of standard SSL benchmark datasets, and the proposed feature space renormalization mechanism can also enhance the performance of other SSL approaches.
    摘要 我们首先提出了一种特征空间normalization机制,以代替常用的一致性regularization机制,以学习更好的抽象特征。我们开始于建立基本模型和经验模型,然后引入我们的机制,使基本模型在经验模型的指导下进行特征学习的normalization。其次,我们将我们的机制与 Pseudo-labeling 结合,得到了一种新的有效的SSL模型,名为FreMatch。实验结果表明,我们的方法可以在多个标准SSLbenchmark数据集上达到更好的性能,而且我们提出的特征空间normalization机制还可以提高其他SSL方法的性能。

Extracting human interpretable structure-property relationships in chemistry using XAI and large language models

  • paper_url: http://arxiv.org/abs/2311.04047
  • repo_url: https://github.com/geemi725/xpertai
  • paper_authors: Geemi P. Wellawatte, Philippe Schwaller
  • for: 本研究旨在对机器学习模型的不透明性进行解释,并且通过将XAI方法与大型自然语言模型(LLM)结合,自动生成化学数据的可读性 naturallanguage explanations。
  • methods: 本研究使用了XpertAI框架,具体来说是结合XAI方法和LLMs,以生成化学数据的可读性自然语言解释。
  • results: 我们在5个案例研究中发现,XpertAI组合了XAI工具和LLMs的优点,可以生成具体、科学和可解释的化学数据解释。
    Abstract Explainable Artificial Intelligence (XAI) is an emerging field in AI that aims to address the opaque nature of machine learning models. Furthermore, it has been shown that XAI can be used to extract input-output relationships, making them a useful tool in chemistry to understand structure-property relationships. However, one of the main limitations of XAI methods is that they are developed for technically oriented users. We propose the XpertAI framework that integrates XAI methods with large language models (LLMs) accessing scientific literature to generate accessible natural language explanations of raw chemical data automatically. We conducted 5 case studies to evaluate the performance of XpertAI. Our results show that XpertAI combines the strengths of LLMs and XAI tools in generating specific, scientific, and interpretable explanations.
    摘要 “可解释人工智能”(XAI)是一个emerging field in AI,旨在解决机器学习模型的透明性问题。此外,实验表明XAI可以用来撷取输入-输出关系,使其成为化学领域理解结构-性能关系的有用工具。然而,XAI方法的主要限制是它们是技术专业人员所设计的。我们提出了XpertAI框架,它 integrate XAI方法和大量自然语言模型(LLM),以生成自动化的原始化化学数据的可读性自然语言解释。我们进行了5例study来评估XpertAI的表现。我们的结果显示XpertAI结合了LLM和XAI工具的优点,能够生成特定、科学和可解释的解释。

Discordance Minimization-based Imputation Algorithms for Missing Values in Rating Data

  • paper_url: http://arxiv.org/abs/2311.04035
  • repo_url: None
  • paper_authors: Young Woong Park, Jinhak Kim, Dan Zhu
    for: 这个论文是为了处理多个评分列表的缺失评分问题而写的。methods: 本论文使用了数据分析和优化模型来填充缺失评分,并使用了known评分信息来优化模型。results: Computational experiments based on real-world and synthetic rating data sets show that the proposed methods outperform the state-of-the-art general imputation methods in the literature in terms of imputation accuracy。
    Abstract Ratings are frequently used to evaluate and compare subjects in various applications, from education to healthcare, because ratings provide succinct yet credible measures for comparing subjects. However, when multiple rating lists are combined or considered together, subjects often have missing ratings, because most rating lists do not rate every subject in the combined list. In this study, we propose analyses on missing value patterns using six real-world data sets in various applications, as well as the conditions for applicability of imputation algorithms. Based on the special structures and properties derived from the analyses, we propose optimization models and algorithms that minimize the total rating discordance across rating providers to impute missing ratings in the combined rating lists, using only the known rating information. The total rating discordance is defined as the sum of the pairwise discordance metric, which can be written as a quadratic function. Computational experiments based on real-world and synthetic rating data sets show that the proposed methods outperform the state-of-the-art general imputation methods in the literature in terms of imputation accuracy.
    摘要 评分 frequently 用于评估和比较应用领域中的对象,从教育到医疗,因为评分提供了简洁又可靠的对比方法。然而,当多个评分列表合并或考虑在一起时,对象经常会有缺失评分,因为大多数评分列表不会对所有的对象进行评分。在这项研究中,我们对缺失值模式进行分析,使用了六个真实世界数据集,来描述不同应用领域的特殊结构和性质。基于这些分析结果,我们提出了优化模型和算法,以最小化总评分差异 across 评分提供者,使用只知道的评分信息来填充缺失评分。总评分差异可以写作quadratic function。实验结果表明,我们的方法在实际世界和 sintetic 评分数据集上比现有的一般填充方法更高的填充精度。

Joint model for longitudinal and spatio-temporal survival data

  • paper_url: http://arxiv.org/abs/2311.04008
  • repo_url: None
  • paper_authors: Victor Medina-Olivares, Finn Lindgren, Raffaella Calabrese, Jonathan Crook
  • for: 这篇论文主要关注的是债务风险分析中的生存模型,尤其是在对于时间变化的条件下,如何模型对于债务者的时间存活。
  • methods: 这篇论文提出了一个称为Spatio-Temporal Joint Model(STJM)的统计模型,用于捕捉债务者的空间和时间效应,并且考虑到这些效应之间的互动。另外,这篇论文还使用了一种称为Integrated Nested Laplace Approximation(INLA)的估计方法,以便应对大规模的数据集。
  • results: 这篇论文的实验结果显示,包含空间效应可以舒缓债务者的时间存活预测性,但是当还包含时间变化的条件时,这些效应的提升效果变得较少定态。
    Abstract In credit risk analysis, survival models with fixed and time-varying covariates are widely used to predict a borrower's time-to-event. When the time-varying drivers are endogenous, modelling jointly the evolution of the survival time and the endogenous covariates is the most appropriate approach, also known as the joint model for longitudinal and survival data. In addition to the temporal component, credit risk models can be enhanced when including borrowers' geographical information by considering spatial clustering and its variation over time. We propose the Spatio-Temporal Joint Model (STJM) to capture spatial and temporal effects and their interaction. This Bayesian hierarchical joint model reckons the survival effect of unobserved heterogeneity among borrowers located in the same region at a particular time. To estimate the STJM model for large datasets, we consider the Integrated Nested Laplace Approximation (INLA) methodology. We apply the STJM to predict the time to full prepayment on a large dataset of 57,258 US mortgage borrowers with more than 2.5 million observations. Empirical results indicate that including spatial effects consistently improves the performance of the joint model. However, the gains are less definitive when we additionally include spatio-temporal interactions.
    摘要 在信用风险分析中, fixes 和时间变化的 covariates 的存 lived 模型广泛用于预测债务人的时间事件。当时间变化的驱动器是内生的时,模型同时考虑到了存 lived 时间的演化和内生 covariates。此外,信用风险模型可以通过包括债务人的地理信息来提高,考虑到了空间层次和时间变化的交互作用。我们提议使用 Spatio-Temporal Joint Model (STJM) 来捕捉空间和时间效应的交互作用。这是一种 Bayesian 层次模型,可以考虑债务人之间的不见性差异,并在同一个地区和同一个时间点上reckons 这些不见性差异的存 lived 效应。为了优化 STJM 模型在大量数据上的估计,我们使用 Integrated Nested Laplace Approximation (INLA) 方法。我们应用 STJM 模型来预测57,258名美国 mortgage 债务人的全 prepayment 时间,共有超过250万个观测。实际结果表明,包括空间效应可以一直提高存 lived 模型的表现,但是在同时包括空间-时间交互的情况下,提高的效果较为吃水。

An Initialization Schema for Neuronal Networks on Tabular Data

  • paper_url: http://arxiv.org/abs/2311.03996
  • repo_url: None
  • paper_authors: Wolfgang Fuhl
  • for: 这篇论文是关于使用神经网络进行表格数据预测的研究,尤其是在使用多元数据时进行回归和分类 tasks 时。
  • methods: 该论文提出了一种使用二进制初始化神经网络来解决表格数据预测问题,同时还提出了一种使用梯度Masking和最后一层二进制初始化来进行集成训练。
  • results: 作者在多个公共数据集上进行了实验,并证明了该方法可以比其他神经网络方法提供更好的性能。Here’s the same information in English:
  • for: This paper is about using neural networks for tabular data prediction, particularly for regression and classification tasks when dealing with heterogeneous data.
  • methods: The paper proposes using a binomial initialization scheme for neural networks to solve the tabular data prediction problem, and also introduces a gradient masking and last-layer binomial initialization method for joint ensemble training.
  • results: The authors experimented on multiple public datasets and demonstrated that their approach outperforms other neural network-based methods.
    Abstract Nowadays, many modern applications require heterogeneous tabular data, which is still a challenging task in terms of regression and classification. Many approaches have been proposed to adapt neural networks for this task, but still, boosting and bagging of decision trees are the best-performing methods for this task. In this paper, we show that a binomial initialized neural network can be used effectively on tabular data. The proposed approach shows a simple but effective approach for initializing the first hidden layer in neural networks. We also show that this initializing schema can be used to jointly train ensembles by adding gradient masking to batch entries and using the binomial initialization for the last layer in a neural network. For this purpose, we modified the hinge binary loss and the soft max loss to make them applicable for joint ensemble training. We evaluate our approach on multiple public datasets and showcase the improved performance compared to other neural network-based approaches. In addition, we discuss the limitations and possible further research of our approach for improving the applicability of neural networks to tabular data. Link: https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=%2FInitializationNeuronalNetworksTabularData&mode=list
    摘要 现在,许多现代应用需要多种类型的表格数据,这是神经网络预测和分类任务中的挑战。许多方法已经被提出来使神经网络适应这个任务,但是权值和袋包的决策树仍然是最佳性能的方法。在这篇论文中,我们表明了使用初始化的神经网络可以有效地处理表格数据。我们提出的方法显示了一种简单 yet 有效的初始化方法,可以用于初始化神经网络的第一层隐藏层。此外,我们还显示了如何使用梯度屏蔽和最后一层神经网络的 binomial 初始化来联合训练集成体。为此,我们修改了边界二进制损失和软max损失,使其适用于联合集成训练。我们在多个公共数据集上评估了我们的方法,并示出了与其他神经网络基于方法相比的改善性。此外,我们还讨论了我们方法的局限性和可能的进一步研究,以提高神经网络对表格数据的应用性。Link: https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=%2FInitializationNeuronalNetworksTabularData&mode=list

Bandit Pareto Set Identification: the Fixed Budget Setting

  • paper_url: http://arxiv.org/abs/2311.03992
  • repo_url: None
  • paper_authors: Cyrille Kone, Emilie Kaufmann, Laura Richert
  • for: 本研究探讨了一个多目标纯探索问题,即在多重臂抽象模型中识别不确定多变量分布的问题。
  • methods: 我们提出了首个针对 fixes 预算 pareto 集标识任务的算法,并进行了分析。我们的算法 combine了精心估算每个臂在或外 pareto 集中的困难程度与一种通用淘汰方案。
  • results: 我们证明了两种特定实现(EGE-SR 和 EGE-SH)在预算增长 exponentially 快的速度下,其错误概率 decay exponentially 快,并且与信息论下界相支持。我们还通过使用实际数据和 sintetic 数据进行了Empirical 研究,显示了我们的算法在实际场景中的良好表现。
    Abstract We study a multi-objective pure exploration problem in a multi-armed bandit model. Each arm is associated to an unknown multi-variate distribution and the goal is to identify the distributions whose mean is not uniformly worse than that of another distribution: the Pareto optimal set. We propose and analyze the first algorithms for the \emph{fixed budget} Pareto Set Identification task. We propose Empirical Gap Elimination, a family of algorithms combining a careful estimation of the ``hardness to classify'' each arm in or out of the Pareto set with a generic elimination scheme. We prove that two particular instances, EGE-SR and EGE-SH, have a probability of error that decays exponentially fast with the budget, with an exponent supported by an information theoretic lower-bound. We complement these findings with an empirical study using real-world and synthetic datasets, which showcase the good performance of our algorithms.
    摘要 我们研究一个多目标纯探索问题在多重机枪模型中。每棒机枪都与未知多变分布相关,目标是确定这些分布的含义,使其不 worse than 另一个分布:Pareto优化集。我们提出并分析了首个预算Pareto集标识任务的算法。我们提出了Empirical Gap Elimination,一种结合精心估计每棒机枪是否在Pareto集中的“困难分类”度量和一种通用排除方案。我们证明了两个特定实例EGE-SR和EGE-SH的概率错误在预算下随预算呈指数减少,其下界支持信息论下界。我们补充这些发现通过实际数据和 sintetic 数据的实验,显示我们的算法表现良好。

Cup Curriculum: Curriculum Learning on Model Capacity

  • paper_url: http://arxiv.org/abs/2311.03956
  • repo_url: https://github.com/luca-scharr/cupcurriculum
  • paper_authors: Luca Scharr, Vanessa Toborek
  • for: 提高自然语言处理任务的学习性能
  • methods: 使用变化版iterative magnitude pruning来降低模型容量,然后在第二阶段重新引入这些 weights,使模型容量示出“杯”形曲线
  • results: 比EARLY STOPPING更可靠地避免过拟合,并且可靠地提高自然语言处理任务的学习性能
    Abstract Curriculum learning (CL) aims to increase the performance of a learner on a given task by applying a specialized learning strategy. This strategy focuses on either the dataset, the task, or the model. There is little to no work analysing the possibilities to apply CL on the model capacity in natural language processing. To close this gap, we propose the cup curriculum. In a first phase of training we use a variation of iterative magnitude pruning to reduce model capacity. These weights are reintroduced in a second phase, resulting in the model capacity to show a cup-shaped curve over the training iterations. We empirically evaluate different strategies of the cup curriculum and show that it outperforms early stopping reliably while exhibiting a high resilience to overfitting.
    摘要 curriculum学习(CL)目的是提高学习者在给定任务上的性能,通过特殊的学习策略。这种策略可以对数据集、任务或模型进行特化。现在,对于自然语言处理领域中模型容量的可能性分析几乎无人做。为了填补这个差距,我们提出了杯课程。在第一个训练阶段,我们使用迭代矩阵裁剪来降低模型容量。这些权重在第二个训练阶段重新引入,使模型容量示出一个杯形曲线在训练轮次中变化。我们对杯课程不同策略进行实验性评估,并证明它可靠地超过早停止,同时具有高鲁棒性。

Blind Federated Learning via Over-the-Air q-QAM

  • paper_url: http://arxiv.org/abs/2311.04253
  • repo_url: None
  • paper_authors: Saeed Razavikia, José Mairton Barros Da Silva Júnior, Carlo Fischione
  • for: 这项研究探讨了 federated edge learning 在多址接入通道上的应用。
  • methods: 该研究提出了一种使用 q-ary quadrature amplitude modulation 的数字在天空计算策略,以减轻 edge devices 和接入点之间的通信负担。
  • results: 研究人员发现,通过增加 edge server 上的多个天线,可以有效地 Mitigate 随机折叠的影响,并且可以通过调整模ulation 的质因数来提高模型精度。
    Abstract In this work, we investigate federated edge learning over a fading multiple access channel. To alleviate the communication burden between the edge devices and the access point, we introduce a pioneering digital over-the-air computation strategy employing q-ary quadrature amplitude modulation, culminating in a low latency communication scheme. Indeed, we propose a new federated edge learning framework in which edge devices use digital modulation for over-the-air uplink transmission to the edge server while they have no access to the channel state information. Furthermore, we incorporate multiple antennas at the edge server to overcome the fading inherent in wireless communication. We analyze the number of antennas required to mitigate the fading impact effectively. We prove a non-asymptotic upper bound for the mean squared error for the proposed federated learning with digital over-the-air uplink transmissions under both noisy and fading conditions. Leveraging the derived upper bound, we characterize the convergence rate of the learning process of a non-convex loss function in terms of the mean square error of gradients due to the fading channel. Furthermore, we substantiate the theoretical assurances through numerical experiments concerning mean square error and the convergence efficacy of the digital federated edge learning framework. Notably, the results demonstrate that augmenting the number of antennas at the edge server and adopting higher-order modulations improve the model accuracy up to 60\%.
    摘要 在这项工作中,我们调查了联邦边缘学习在潮湮多access Channel上。为了减轻边缘设备和访问点之间的通信负担,我们提出了一种先进的数字飞行计算策略,使用q-ary quadrature amplitude modulation,从而实现低延迟的通信方案。实际上,我们提出了一个新的联邦边缘学习框架,在其中边缘设备使用数字模ulation进行无线上传到边缘服务器,而无需知道通道状态信息。此外,我们在边缘服务器中添加了多个天线,以抗衡无线通信中的潮湮。我们分析了需要用于消除潮湮的天线数量。我们证明了在噪声和潮湮条件下的非假性上限,用于评估提出的联邦边缘学习中的平均方差误差。基于 derive的上限,我们characterize了联邦边缘学习过程中的学习速率,即在损失函数不对称情况下,由潮湮通道引起的平均方差误差的减少率。此外,我们通过数字 federated edge learning框架的数学实验,证明了论证的保障。结果表明,增加边缘服务器中天线数量和采用更高的模ulation取得更高的模型准确率,达到60%。

CNN-Based Structural Damage Detection using Time-Series Sensor Data

  • paper_url: http://arxiv.org/abs/2311.04252
  • repo_url: None
  • paper_authors: Ishan Pathak, Ishan Jha, Aditya Sadana, Basuraj Bhowmik
  • for: 本研究旨在提出一种新的损害检测方法,利用新的卷积神经网络算法来检测结构损害。
  • methods: 本方法使用卷积神经网络算法来检测时间序列数据中的深度空间特征,并结合空间和时间特征来增强检测精度。
  • results: 实验结果表明,新的卷积神经网络算法能够准确地检测三层结构中的损害。
    Abstract Structural Health Monitoring (SHM) is vital for evaluating structural condition, aiming to detect damage through sensor data analysis. It aligns with predictive maintenance in modern industry, minimizing downtime and costs by addressing potential structural issues. Various machine learning techniques have been used to extract valuable information from vibration data, often relying on prior structural knowledge. This research introduces an innovative approach to structural damage detection, utilizing a new Convolutional Neural Network (CNN) algorithm. In order to extract deep spatial features from time series data, CNNs are taught to recognize long-term temporal connections. This methodology combines spatial and temporal features, enhancing discrimination capabilities when compared to methods solely reliant on deep spatial features. Time series data are divided into two categories using the proposed neural network: undamaged and damaged. To validate its efficacy, the method's accuracy was tested using a benchmark dataset derived from a three-floor structure at Los Alamos National Laboratory (LANL). The outcomes show that the new CNN algorithm is very accurate in spotting structural degradation in the examined structure.
    摘要 structural health monitoring (SHM) 是现代结构健康监测的关键技术,通过感知器数据分析来评估结构的状况,旨在检测结构的损害。与predictive maintenance相结合,SHM可以最小化机器下TIME和成本,通过解决可能的结构问题。various machine learning techniques 有很多机器学习技术,通常基于结构知识,用于从振荡数据中提取有价值信息。本研究提出了一种新的卷积神经网络算法,用于结构损害检测。通过学习长期时间连接,卷积神经网络能够提取深层空间特征,从而提高分类能力。时间序列数据被分为两类:不损害和损害。为验证这种方法的正确性,研究人员使用了来自三层结构的LANL benchmark dataset进行测试。结果表明,新的卷积神经网络算法具有高准确性,可以准确地检测LANL结构中的结构衰老。

Structure of universal formulas

  • paper_url: http://arxiv.org/abs/2311.03910
  • repo_url: https://github.com/smith86n/wiki-is-mostly-fake-radom-words-word-genrationr-
  • paper_authors: Dmitry Yarotsky
  • for: This paper is written to analyze the essential structural elements of highly expressive models, such as neural networks, and to study their approximating capabilities.
  • methods: The paper uses a hierarchy of expressiveness classes to connect the global approximability property to the weaker property of infinite VC dimension, and proves a series of classification results for several increasingly complex functional families.
  • results: The paper shows that fixed-size neural networks with not more than one layer of neurons having transcendental activations cannot in general approximate functions on arbitrary finite sets, but gives examples of functional families, including two-hidden-layer neural networks, that approximate functions on arbitrary finite sets but fail to do so on the whole domain of definition.
    Abstract By universal formulas we understand parameterized analytic expressions that have a fixed complexity, but nevertheless can approximate any continuous function on a compact set. There exist various examples of such formulas, including some in the form of neural networks. In this paper we analyze the essential structural elements of these highly expressive models. We introduce a hierarchy of expressiveness classes connecting the global approximability property to the weaker property of infinite VC dimension, and prove a series of classification results for several increasingly complex functional families. In particular, we introduce a general family of polynomially-exponentially-algebraic functions that, as we prove, is subject to polynomial constraints. As a consequence, we show that fixed-size neural networks with not more than one layer of neurons having transcendental activations (e.g., sine or standard sigmoid) cannot in general approximate functions on arbitrary finite sets. On the other hand, we give examples of functional families, including two-hidden-layer neural networks, that approximate functions on arbitrary finite sets, but fail to do that on the whole domain of definition.
    摘要 通过通用方程式我们理解参数化分析表达式,它们有固定的复杂性,但可以将任何连续函数在有界集中逼近。存在许多这种表达式的例子,包括一些神经网络。在这篇论文中,我们分析高表达能力模型的基本结构元素。我们建立一个表达能力层次结构,将全球逼近性质连接到弱性质VC次数,并证明了一系列分类结果。特别是,我们引入一个总体上是多项式幂指数代数函数的一个家族,并证明它们是受到多项式约束的。这使得我们显示,具有不超过一层神经元的几何活动函数(例如,极值函数或标准sigmoid函数)的固定大小神经网络不能在一般的有限集上逼近函数。然而,我们给出了一些函数家族的示例,包括两层隐藏层神经网络,它们在有限集上逼近函数,但在整个定义域上不能逼近函数。

Learning-Based Latency-Constrained Fronthaul Compression Optimization in C-RAN

  • paper_url: http://arxiv.org/abs/2311.03899
  • repo_url: None
  • paper_authors: Axel Grönland, Bleron Klaiqi, Xavier Gelabert
  • for: 这个论文旨在探讨云化无线移动网络的发展,以及Radio Access Network(RAN)功能的分布或中央化部署,以获得低成本、高容量和改善硬件利用率等优点。
  • methods: 该论文提出了一种使用深度强化学习(DRL)控制功能部署,以适应不同的FRONTHAUL(FH)负荷水平。
  • results: simulation结果表明,该DRL控制方法能够实现高达68.7%的FH资源利用率,同时能够遵循先定的FH延迟限制(在这个案例中为260微秒),并在不同的FH负荷水平下保持良好的空接口吞吐量。
    Abstract The evolution of wireless mobile networks towards cloudification, where Radio Access Network (RAN) functions can be hosted at either a central or distributed locations, offers many benefits like low cost deployment, higher capacity, and improved hardware utilization. Nevertheless, the flexibility in the functional deployment comes at the cost of stringent fronthaul (FH) capacity and latency requirements. One possible approach to deal with these rigorous constraints is to use FH compression techniques. To ensure that FH capacity and latency requirements are met, more FH compression is applied during high load, while less compression is applied during medium and low load to improve FH utilization and air interface performance. In this paper, a model-free deep reinforcement learning (DRL) based FH compression (DRL-FC) framework is proposed that dynamically controls FH compression through various configuration parameters such as modulation order, precoder granularity, and precoder weight quantization that affect both FH load and air interface performance. Simulation results show that DRL-FC exhibits significantly higher FH utilization (68.7% on average) and air interface throughput than a reference scheme (i.e. with no applied compression) across different FH load levels. At the same time, the proposed DRL-FC framework is able to meet the predefined FH latency constraints (in our case set to 260 $\mu$s) under various FH loads.
    摘要 云化的无线移动网络的演进,让无线接入网络(RAN)功能可以在中央或分布式位置上进行主机,具有较低的成本、更高的容量和更好的硬件利用率。然而,这些功能的分布部署带来了前方对接(FH)的容量和延迟需求的严格要求。为了解决这个问题,可以使用FH压缩技术。为了确保FH的容量和延迟需求得到满足,在高负载情况下增加更多的FH压缩,而在中等和低负载情况下则增加更少的压缩,以提高FH的利用率和空中接口性能。在这篇研究中,我们提出了一个基于深度学习(DRL)的无预设FH压缩框架(DRL-FC),可以动态控制FH压缩。这个框架通过调整不同的配置参数,例如模ulation order、前节精致度和前节重量量化,以影响FH负载和空中接口性能。我们的 simulations 显示,DRL-FC 可以在不同的FH负载水平上显示出较高的FH利用率(平均为68.7%)和空中接口传输率,并且可以遵循预先定义的FH延迟限制(在我们的情况下为260微秒)。

An Explainable Framework for Machine learning-Based Reactive Power Optimization of Distribution Network

  • paper_url: http://arxiv.org/abs/2311.03863
  • repo_url: None
  • paper_authors: Wenlong Liao, Benjamin Schäfer, Dalin Qin, Gonghao Zhang, Zhixian Wang, Zhe Yang
  • for: 优化分布网络的反应力能量,使用机器学习模型,但是通常 Considered as black boxes,困难 для电力系统运维者了解和理解机器学习模型的决策过程中的可能的偏见或错误。
  • methods: 提出了一种可解释的机器学习框架,用于优化分布网络的反应力能量。首先,使用 Shapley 添加性解释框架来衡量每个输入特征对机器学习模型生成的反应力优化解决方案的贡献。其次,开发了一种模型不依赖的近似方法,以避免直接计算 Shapley 值的计算含量。
  • results: 使用 simulated annealing 算法和 Shapley 添加性解释框架,可以准确地解释机器学习模型生成的反应力优化解决方案,从全球和实例两个视角来看。此外,提出的可解释框架是模型无关的,因此可以应用于不同的模型(如神经网络)。
    Abstract To reduce the heavy computational burden of reactive power optimization of distribution networks, machine learning models are receiving increasing attention. However, most machine learning models (e.g., neural networks) are usually considered as black boxes, making it challenging for power system operators to identify and comprehend potential biases or errors in the decision-making process of machine learning models. To address this issue, an explainable machine-learning framework is proposed to optimize the reactive power in distribution networks. Firstly, a Shapley additive explanation framework is presented to measure the contribution of each input feature to the solution of reactive power optimizations generated from machine learning models. Secondly, a model-agnostic approximation method is developed to estimate Shapley values, so as to avoid the heavy computational burden associated with direct calculations of Shapley values. The simulation results show that the proposed explainable framework can accurately explain the solution of the machine learning model-based reactive power optimization by using visual analytics, from both global and instance perspectives. Moreover, the proposed explainable framework is model-agnostic, and thus applicable to various models (e.g., neural networks).
    摘要 Firstly, a Shapley additive explanation framework is presented to measure the contribution of each input feature to the solution of reactive power optimizations generated from machine learning models. Secondly, a model-agnostic approximation method is developed to estimate Shapley values, so as to avoid the heavy computational burden associated with direct calculations of Shapley values.The simulation results show that the proposed explainable framework can accurately explain the solution of the machine learning model-based reactive power optimization by using visual analytics, from both global and instance perspectives. Moreover, the proposed explainable framework is model-agnostic, and thus applicable to various models (e.g., neural networks).

Improved MDL Estimators Using Fiber Bundle of Local Exponential Families for Non-exponential Families

  • paper_url: http://arxiv.org/abs/2311.03852
  • repo_url: None
  • paper_authors: Kohei Miyamoto, Andrew R. Barron, Jun’ichi Takeuchi
  • for: 这篇论文主要是关于最小描述长度(MDL)估计器的分析。
  • methods: 作者使用了两部分代码,其中一部分是通用编码,另一部分是基于数据描述的地方射影家族。
  • results: 作者得出了一个 tight 的上界于风险和损失,基于Barron和Cover在1991年提出的理论。此外,作者还证明了这些结果可以应用于杂合家族,这些家族是非对称家族的典型示例。
    Abstract Minimum Description Length (MDL) estimators, using two-part codes for universal coding, are analyzed. For general parametric families under certain regularity conditions, we introduce a two-part code whose regret is close to the minimax regret, where regret of a code with respect to a target family M is the difference between the code length of the code and the ideal code length achieved by an element in M. This is a generalization of the result for exponential families by Gr\"unwald. Our code is constructed by using an augmented structure of M with a bundle of local exponential families for data description, which is not needed for exponential families. This result gives a tight upper bound on risk and loss of the MDL estimators based on the theory introduced by Barron and Cover in 1991. Further, we show that we can apply the result to mixture families, which are a typical example of non-exponential families.
    摘要 “我们研究了使用两部分编码的最小描述长度(MDL)估计器,并提出了一个具有近似最小最大 regret的两部分编码。这是对于一般假设家族下的特定正规情况下的一个扩展,并且不需要对于对应的家族M进行特殊的调整。我们的编码是通过将家族M与一个本地对应的对应家族所组成的扩展结构,并且不需要在描述数据时使用本地对应家族。这个结果可以对于泛化家族进行紧凑的上限 bounds,并且可以应用到混合家族,这些家族通常是非对称的。”

User-level Differentially Private Stochastic Convex Optimization: Efficient Algorithms with Optimal Rates

  • paper_url: http://arxiv.org/abs/2311.03797
  • repo_url: None
  • paper_authors: Hilal Asi, Daogao Liu
  • for: 这个研究探讨了具有用户级 differentially private stochastic convex optimization (DP-SCO) 的方法,其中每个用户可能拥有多个数据项目。
  • methods: 我们开发了新的算法,用于实现用户级 DP-SCO,这些算法可以在 polynomial time 内得到均值预测的优化率,并且需要用户数量增长仅具有对数arithm 相关的增长。
  • results: 我们的算法可以在 polynomial time 内得到均值预测的优化率,并且是首个在非凸函数上取得优化率的方法。这些算法基于多个通过 DP-SGD,结合了一个新的私人均值估计程序,用于处理对集中数据的测量。
    Abstract We study differentially private stochastic convex optimization (DP-SCO) under user-level privacy, where each user may hold multiple data items. Existing work for user-level DP-SCO either requires super-polynomial runtime [Ghazi et al. (2023)] or requires the number of users to grow polynomially with the dimensionality of the problem with additional strict assumptions [Bassily et al. (2023)]. We develop new algorithms for user-level DP-SCO that obtain optimal rates for both convex and strongly convex functions in polynomial time and require the number of users to grow only logarithmically in the dimension. Moreover, our algorithms are the first to obtain optimal rates for non-smooth functions in polynomial time. These algorithms are based on multiple-pass DP-SGD, combined with a novel private mean estimation procedure for concentrated data, which applies an outlier removal step before estimating the mean of the gradients.
    摘要 我们研究具有用户级隐私的泛化减少隐私整数优化(DP-SCO),每个用户可能拥有多个数据项。现有的用户级DP-SCO工作ether requires super-polynomial runtime [Ghazi et al. (2023)]或者需要用户数量随着问题维度的度数增长,并且具有更多的严格假设 [Bassily et al. (2023)].我们开发了新的用户级DP-SCO算法,可以在 polynomial time 内获得 convex 和强 convex 函数的优化率,并且用户数量只需要在维度上增长logarithmically。此外,我们的算法还是第一个在 polynomial time 内获得非光滑函数的优化率。这些算法基于多通道DP-SGD,并结合了一种新的隐私性含义评估方法,用于集中数据中的异常值除除。

Neuro-GPT: Developing A Foundation Model for EEG

  • paper_url: http://arxiv.org/abs/2311.03764
  • repo_url: None
  • paper_authors: Wenhui Cui, Woojae Jeong, Philipp Thölke, Takfarinas Medani, Karim Jerbi, Anand A. Joshi, Richard M. Leahy
  • for: Addressing the challenges of data scarcity and heterogeneity in Brain-Computer Interface (BCI) tasks using Electroencephalography (EEG) data.
  • methods: Using a foundation model consisting of an EEG encoder and a GPT model, pre-trained on a large-scale public EEG dataset with a self-supervised task, and fine-tuning the model on a Motor Imagery Classification task with only 9 subjects.
  • results: Significant improvement in classification performance compared to a model trained from scratch, demonstrating the advanced generalizability of the foundation model and its ability to address data scarcity and heterogeneity.
    Abstract To handle the scarcity and heterogeneity of electroencephalography (EEG) data in Brain-Computer Interface (BCI) tasks, and to harness the vast public data, we propose Neuro-GPT, a foundation model consisting of an EEG encoder and a GPT model. The foundation model is pre-trained on a large-scale public EEG dataset, using a self-supervised task which learns how to reconstruct the masked chunk in EEG. We then fine-tune the foundation model on a Motor Imagery Classification task where only 9 subjects are available. Experiments demonstrated that applying foundation model can significantly improve classification performance compared to the model trained from scratch, which provides evidence for the advanced generalizability of foundation model and the ability to address the challenges of data scarcity and heterogeneity.
    摘要 为了处理电气脑图(EEG)数据的缺乏和多样性在脑计算机接口(BCI)任务中,以及利用公共数据,我们提议使用Neuro-GPT基础模型。该基础模型包括EEG编码器和GPT模型。我们在一个大规模公共EEG数据集上预训练了基础模型,使用自我supervised任务,学习如何重建屏蔽的EEG块。然后,我们在9名参与者的电幕想象分类任务上细化了基础模型。实验表明,通过基础模型可以明显提高分类性能,提供了基础模型的高度通用性和数据缺乏和多样性的能力。

Posterior Sampling-Based Bayesian Optimization with Tighter Bayesian Regret Bounds

  • paper_url: http://arxiv.org/abs/2311.03760
  • repo_url: None
  • paper_authors: Shion Takeno, Yu Inatsu, Masayuki Karasuyama, Ichiro Takeuchi
  • for: 本研究旨在提出一种新的吸引函数(PIMS),用于解决GP-UCB和TS在 bayesian 优化中的 hyperparameter 调整和过度探索问题。
  • methods: 本研究使用了一种随机变体的 GP-UCB,以及一种新的吸引函数PIMS,并对其进行了 theoretically 和实验性的分析。
  • results: 研究发现,PIMS可以减少GP-UCB和TS中的 hyperparameter 调整和过度探索问题,同时保持BCR 下界。此外,PIMS在各种实验中表现出色,可以 Mitigate GP-UCB 和 TS 中的实际问题。
    Abstract Among various acquisition functions (AFs) in Bayesian optimization (BO), Gaussian process upper confidence bound (GP-UCB) and Thompson sampling (TS) are well-known options with established theoretical properties regarding Bayesian cumulative regret (BCR). Recently, it has been shown that a randomized variant of GP-UCB achieves a tighter BCR bound compared with GP-UCB, which we call the tighter BCR bound for brevity. Inspired by this study, this paper first shows that TS achieves the tighter BCR bound. On the other hand, GP-UCB and TS often practically suffer from manual hyperparameter tuning and over-exploration issues, respectively. To overcome these difficulties, we propose yet another AF called a probability of improvement from the maximum of a sample path (PIMS). We show that PIMS achieves the tighter BCR bound and avoids the hyperparameter tuning, unlike GP-UCB. Furthermore, we demonstrate a wide range of experiments, focusing on the effectiveness of PIMS that mitigates the practical issues of GP-UCB and TS.
    摘要 中文翻译:在搜索优化(Search Optimization)中,各种获得函数(Acquisition Functions,AF)具有证明过的理论性,其中 Gaussian Process Upper Confidence Bound(GP-UCB)和Thompson Sampling(TS)是最为知名的选择。最近,一种随机变体的 GP-UCB 被证明可以实现更紧的 Bayesian Cumulative Regret(BCR) bound,我们简称为“更紧的 BCR bound”。受到这个研究的启发,本文首先证明 TS 实现了更紧的 BCR bound。然而,GP-UCB 和 TS 在实践中经常受到手动参数调整和过度探索问题的困扰。为了解决这些问题,我们提出了另一种 AF,即样本路径上的最大可能性提升(PIMS)。我们证明 PIMS 实现了更紧的 BCR bound,并且不需要手动参数调整,与 GP-UCB 不同。此外,我们在各种实验中展示了 PIMS 可以减轻 GP-UCB 和 TS 的实践问题。

Manifold learning: what, how, and why

  • paper_url: http://arxiv.org/abs/2311.03757
  • repo_url: None
  • paper_authors: Marina Meilă, Hanyu Zhang
  • for: 这个论文主要用于探讨幂等学习(Manifold Learning,简称ML)的原理、方法和统计基础。
  • methods: 论文涵盖了主要的ML方法,包括ISOMAP、LLE、T-SNE等,以及它们的统计基础。
  • results: 论文通过描述ML方法的应用场景和特点,帮助读者更深入地理解高维数据的几何结构和可视化。
    Abstract Manifold learning (ML), known also as non-linear dimension reduction, is a set of methods to find the low dimensional structure of data. Dimension reduction for large, high dimensional data is not merely a way to reduce the data; the new representations and descriptors obtained by ML reveal the geometric shape of high dimensional point clouds, and allow one to visualize, de-noise and interpret them. This survey presents the principles underlying ML, the representative methods, as well as their statistical foundations from a practicing statistician's perspective. It describes the trade-offs, and what theory tells us about the parameter and algorithmic choices we make in order to obtain reliable conclusions.
    摘要 多元学习(ML),也称为非线性维度减少,是一组方法用于找到数据的低维度结构。对于大量、高维度数据来说,维度减少不仅是一种减少数据的方式,新的表示和描述符由ML获得的减少后的点云几何结构,允许我们视觉、去噪和解释它们。本文介绍了ML的原理、代表方法以及其统计基础,从实际统计家的视角出发,描述了Parameter和算法选择的交易和统计理论的权威性。

Enhanced physics-informed neural networks with domain scaling and residual correction methods for multi-frequency elliptic problems

  • paper_url: http://arxiv.org/abs/2311.03746
  • repo_url: None
  • paper_authors: Deok-Kyu Jang, Hyea Hyun Kim, Kyungsoo Kim
  • for: 这 paper 用于开发基于神经网络的偏微分方程解决方法,用于解决具有多频解的偏微分方程。
  • methods: 这 paper 使用神经网络方法来approximation偏微分方程,而这些方法不受偏微分方程的形式或问题域的约束。为了 Addressing the issue of multi-frequency solutions, the paper proposes domain scaling and residual correction methods.
  • results: 约束 experiments 表明,提案的方法可以高效地和准确地解决多频模型问题。
    Abstract In this paper, neural network approximation methods are developed for elliptic partial differential equations with multi-frequency solutions. Neural network work approximation methods have advantages over classical approaches in that they can be applied without much concerns on the form of the differential equations or the shape or dimension of the problem domain. When applied to problems with multi-frequency solutions, the performance and accuracy of neural network approximation methods are strongly affected by the contrast of the high- and low-frequency parts in the solutions. To address this issue, domain scaling and residual correction methods are proposed. The efficiency and accuracy of the proposed methods are demonstrated for multi-frequency model problems.
    摘要 在这篇论文中,我们开发了基于神经网络的方法来近似椭圆型偏微分方程。神经网络方法在对偏微分方程的形式和问题域的形状和维度没有很多限制下表现出优势。当应用到具有多频解的问题时,神经网络方法的性能和准确性受到解的高频和低频部分之间的对比强烈影响。为解决这个问题,我们提出了域扩展和剩余修正方法。我们对多频模拟问题进行了效率和准确性的证明。

Improved weight initialization for deep and narrow feedforward neural network

  • paper_url: http://arxiv.org/abs/2311.03733
  • repo_url: None
  • paper_authors: Hyunwoo Lee, Yunho Kim, Seungyeop Yang, Hayoung Choi
  • for: 提高深度神经网络模型的训练和部署效果和效率
  • methods: 提出新的初始化 веса方法,并证明其初始 веса矩阵的性质,以便减少ReLU神经元的死亡现象
  • results: 通过一系列实验和比较 existed方法,证明新 initialization 方法的效果
    Abstract Appropriate weight initialization settings, along with the ReLU activation function, have been a cornerstone of modern deep learning, making it possible to train and deploy highly effective and efficient neural network models across diverse artificial intelligence. The problem of dying ReLU, where ReLU neurons become inactive and yield zero output, presents a significant challenge in the training of deep neural networks with ReLU activation function. Theoretical research and various methods have been introduced to address the problem. However, even with these methods and research, training remains challenging for extremely deep and narrow feedforward networks with ReLU activation function. In this paper, we propose a new weight initialization method to address this issue. We prove the properties of the proposed initial weight matrix and demonstrate how these properties facilitate the effective propagation of signal vectors. Through a series of experiments and comparisons with existing methods, we demonstrate the effectiveness of the new initialization method.
    摘要 现代深度学习中的一个重要基础是合适的初始化设置和ReLU活化函数,使得可以训练和部署高效和高效的神经网络模型。ReLU活化函数导致神经元变得不活跃,并产生零输出,这成为深度神经网络训练中的一个重要挑战。理论研究和多种方法已经被提出来解决这个问题,但是训练非常深和窄的feedforward神经网络仍然是一个挑战。在这篇论文中,我们提出了一种新的初始化方法来解决这个问题。我们证明了提案的初始 вес矩阵的性质,并证明这些性质使得信号向量的有效传播。通过一系列实验和现有方法的比较,我们证明了新 initialization 方法的效果。

Pipeline Parallelism for DNN Inference with Practical Performance Guarantees

  • paper_url: http://arxiv.org/abs/2311.03703
  • repo_url: None
  • paper_authors: Aaron Archer, Matthew Fahrbach, Kuikui Liu, Prakash Prabhu
  • for: 优化深度神经网络(DNN)推理的管道并行性,通过将模型图分解成 $k$ 个阶段,最小化瓶颈阶段的运行时间,包括通信。
  • methods: 提出了实用的算法来解决这个NP困难问题,并证明其在做实践中是非常近似于优化的。通过比较与新的混合整数编程(MIP)表示法获得的强下界,证明了这些算法的有效性。
  • results: 应用这些算法和下界方法到生产模型中,实现了较大的优化保证比例,比如在 $k=16$ 管道阶段的生产数据上,通过几何平均来衡量,MIP 表示法可以超越标准的 combinatorial 下界,从 $2.175$ 提高到 $1.058$。这种工作表明,虽然最大吞吐量分配是理论上困难,但在实践中我们已经掌握了算法的问题,主要是开发更加准确的成本模型,以便 feed 到分配算法中。
    Abstract We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into $k$ stages and minimizing the running time of the bottleneck stage, including communication. We design practical algorithms for this NP-hard problem and show that they are nearly optimal in practice by comparing against strong lower bounds obtained via novel mixed-integer programming (MIP) formulations. We apply these algorithms and lower-bound methods to production models to achieve substantially improved approximation guarantees compared to standard combinatorial lower bounds. For example, evaluated via geometric means across production data with $k=16$ pipeline stages, our MIP formulations more than double the lower bounds, improving the approximation ratio from $2.175$ to $1.058$. This work shows that while max-throughput partitioning is theoretically hard, we have a handle on the algorithmic side of the problem in practice and much of the remaining challenge is in developing more accurate cost models to feed into the partitioning algorithms.
    摘要 我们优化深度神经网络(DNN)推理管道的并行性,通过将模型图分割成$k$个阶段,最小化瓶颈阶段的运行时间,包括通信。我们设计了实用的算法来解决这个NP困难的问题,并证明它们在做实践中几乎是最佳的。我们通过对生产数据进行应用和下界方法进行比较,发现我们的MIP形式下的下界比标准的可combined下界高出许多。例如,通过 геометри mean 对生产数据进行评估,在$k=16$个管道阶段下,我们的MIP形式下的下界超过了标准的可combined下界,从$2.175$提高到$1.058$。这个工作表明,虽然最大吞吐量分配是理论上困难,但在实践中,我们已经掌握了分配问题的算法方面,主要的挑战在于开发更准确的成本模型,以便将其 feed 到分配算法中。

Dynamic Non-monotone Submodular Maximization

  • paper_url: http://arxiv.org/abs/2311.03685
  • repo_url: None
  • paper_authors: Kiarash Banihashem, Leyla Biabani, Samira Goudarzi, MohammadTaghi Hajiaghayi, Peyman Jabbarzade, Morteza Monemizadeh
  • for: 这paper主要研究了动态算法的应用于非升序子模式最大化问题。
  • methods: 该paper使用了几种动态算法来解决非升序子模式最大化问题,包括一种基于缓存的算法和一种基于扩展的算法。
  • results: 该paper提出了一种可以在非升序子模式下实现最大化的动态算法,并且可以在有限的订单数据上实现$(8+\epsilon)$-近似的解决方案。此外,paper还展示了这种动态算法在视频概要和最大扩展问题中的应用。
    Abstract Maximizing submodular functions has been increasingly used in many applications of machine learning, such as data summarization, recommendation systems, and feature selection. Moreover, there has been a growing interest in both submodular maximization and dynamic algorithms. In 2020, Monemizadeh and Lattanzi, Mitrovic, Norouzi{-}Fard, Tarnawski, and Zadimoghaddam initiated developing dynamic algorithms for the monotone submodular maximization problem under the cardinality constraint $k$. Recently, there have been some improvements on the topic made by Banihashem, Biabani, Goudarzi, Hajiaghayi, Jabbarzade, and Monemizadeh. In 2022, Chen and Peng studied the complexity of this problem and raised an important open question: "Can we extend [fully dynamic] results (algorithm or hardness) to non-monotone submodular maximization?". We affirmatively answer their question by demonstrating a reduction from maximizing a non-monotone submodular function under the cardinality constraint $k$ to maximizing a monotone submodular function under the same constraint. Through this reduction, we obtain the first dynamic algorithms to solve the non-monotone submodular maximization problem under the cardinality constraint $k$. Our algorithms maintain an $(8+\epsilon)$-approximate of the solution and use expected amortized $O(\epsilon^{-3}k^3\log^3(n)\log(k))$ or $O(\epsilon^{-1}k^2\log^3(k))$ oracle queries per update, respectively. Furthermore, we showcase the benefits of our dynamic algorithm for video summarization and max-cut problems on several real-world data sets.
    摘要 最大化半模态函数已经在机器学习中得到了广泛应用,如数据摘要、推荐系统和特征选择。此外,关于半模态最大化和动态算法的研究也越来越热门。在2020年,Monemizadeh和Lattanzi等人开始了对偏微 monotone 半模态最大化问题下的 cardinality 约束 $k$ 的动态算法的研究。近些年来,有一些关于这个主题的改进。在2022年,陈和平 studyd了这个问题的复杂性,并提出了一个重要的开放问题:"是否可以将fully dynamic 的结果(算法或困难)推广到非偏微半模态最大化问题?"。我们答复了这个问题,并通过一种减少从非偏微半模态最大化函数下的 cardinality 约束 $k$ 到偏微半模态最大化函数下的同样约束的方法来提供了第一个动态算法来解决非偏微半模态最大化问题。我们的算法保证能够获得 $(8+\epsilon)$ 的近似解和使用预期的整合 $O(\epsilon^{-3}k^3\log^3(n)\log(k))$ 或 $O(\epsilon^{-1}k^2\log^3(k))$ 的缓存读取次数每个更新。此外,我们还展示了我们的动态算法在视频摘要和 max-cut 问题中的应用。

Preventing Arbitrarily High Confidence on Far-Away Data in Point-Estimated Discriminative Neural Networks

  • paper_url: http://arxiv.org/abs/2311.03683
  • repo_url: None
  • paper_authors: Ahmad Rashid, Serena Hacker, Guojun Zhang, Agustinus Kristiadi, Pascal Poupart
  • for: 防止神经网络在不同频谱上的过度自信
  • methods: 添加一个对应于额外类的输出值,使神经网络在远离训练数据时避免过度自信
  • results: 在多个benchmark上表现出色,与竞争对手相比,在远离训练数据和实际OOD数据上都达到了优秀的表现
    Abstract Discriminatively trained, deterministic neural networks are the de facto choice for classification problems. However, even though they achieve state-of-the-art results on in-domain test sets, they tend to be overconfident on out-of-distribution (OOD) data. For instance, ReLU networks -- a popular class of neural network architectures -- have been shown to almost always yield high confidence predictions when the test data are far away from the training set, even when they are trained with OOD data. We overcome this problem by adding a term to the output of the neural network that corresponds to the logit of an extra class, that we design to dominate the logits of the original classes as we move away from the training data.This technique provably prevents arbitrarily high confidence on far-away test data while maintaining a simple discriminative point-estimate training. Evaluation on various benchmarks demonstrates strong performance against competitive baselines on both far-away and realistic OOD data.
    摘要 通用训练的权重学习网络在分类问题上是默认选择。然而,它们在不同 FROM 数据集上测试时会表现出过度自信。例如,ReLU 网络——一种流行的 neural network 架构——在测试数据远离训练数据时总是给出高自信估计。我们解决这个问题 by adding 一个对应于输出 neural network 的特征,该特征对于远离训练数据的测试数据进行了抑制。这种技巧可以证明地防止测试数据远离训练数据时的无理高自信,而不需要采用复杂的泛化点估计。我们在多个 benchmark 上进行了评估,并与竞争对手的基准进行了比较,得到了强大的表现。

Graph Neural Networks for Power Grid Operational Risk Assessment

  • paper_url: http://arxiv.org/abs/2311.03661
  • repo_url: None
  • paper_authors: Yadong Zhang, Pranav M Karve, Sankaran Mahadevan
  • for: investigate the utility of graph neural network (GNN) surrogates for Monte Carlo (MC) sampling-based risk quantification in daily operations of power grid
  • methods: train GNN surrogates using supervised learning, and use them to obtain MC samples of the quantities of interest (operating reserve, transmission line flow) given the (hours-ahead) probabilistic wind generation and load forecast
  • results: GNN surrogates are sufficiently accurate for predicting the (bus-level, branch-level and system-level) grid state and enable fast as well as accurate operational risk quantification for power gridsHere is the same information in Simplified Chinese text:
  • for: 研究用 graph neural network (GNN) 模拟器来实现 Monte Carlo (MC) 样本基于的电力网运行风险评估
  • methods: 使用指导学习训练 GNN 模拟器,并使用它们来 obtian MC 样本中的量 Of Interest (操作储备、传输线流) 给 (多小时) 风力发电和吞吐量预测
  • results: GNN 模拟器能够准确预测 (电压级、分支级和系统级) 网络状态,并提供快速并准确的操作风险评估 для 电力网
    Abstract In this article, the utility of graph neural network (GNN) surrogates for Monte Carlo (MC) sampling-based risk quantification in daily operations of power grid is investigated. The MC simulation process necessitates solving a large number of optimal power flow (OPF) problems corresponding to the sample values of stochastic grid variables (power demand and renewable generation), which is computationally prohibitive. Computationally inexpensive surrogates of the OPF problem provide an attractive alternative for expedited MC simulation. GNN surrogates are especially suitable due to their superior ability to handle graph-structured data. Therefore, GNN surrogates of OPF problem are trained using supervised learning. They are then used to obtain Monte Carlo (MC) samples of the quantities of interest (operating reserve, transmission line flow) given the (hours-ahead) probabilistic wind generation and load forecast. The utility of GNN surrogates is evaluated by comparing OPF-based and GNN-based grid reliability and risk for IEEE Case118 synthetic grid. It is shown that the GNN surrogates are sufficiently accurate for predicting the (bus-level, branch-level and system-level) grid state and enable fast as well as accurate operational risk quantification for power grids. The article thus develops various tools for fast reliability and risk quantification for real-world power grids using GNNs.
    摘要 Here is the text in Simplified Chinese:这篇文章研究了在电力网络的日常运营中使用图 нейрон网络(GNN)的代理人来实现 Monte Carlo(MC)样本基于的风险评估。MC simulation过程需要解决一个很大的优化电力流(OPF)问题,这是计算拥有的。GNN代理人通过超vised学习来训练,然后用来获取MC样本中的量度(运行储备、电力线流)的预测值,对于(前一个小时)预测风力生产和需求。GNN代理人的准确性被通过对OPF问题和GNN问题的网格可靠性和风险进行比较来评估。结果显示,GNN代理人可以准确地预测(电站级、线路级和系统级)网格状态,并且可以快速地进行风险评估。文章因此开发了基于GNN的快速可靠性和风险评估工具,用于实际电力网络中的运营管理。

  • paper_url: http://arxiv.org/abs/2311.03639
  • repo_url: None
  • paper_authors: Amirhossein Mollaali, Izzet Sahin, Iqrar Raza, Christian Moya, Guillermo Paniagua, Guang Lin
    for: This paper aims to develop a deep operator learning-based framework for achieving high-fidelity results in computational simulations with limited computational resources.methods: The proposed framework uses a physics-guided, bi-fidelity, Fourier-featured Deep Operator Network (DeepONet) that combines low and high-fidelity datasets to learn the foundational solution patterns and refine the initial low-fidelity output. The network utilizes an extensive dataset for foundational learning and a small high-fidelity dataset for refinement.results: The proposed approach is validated using a well-known 2D benchmark cylinder problem, and the results show that the physics-guided Fourier-featured deep operator network possesses superior predictive capability for the lift and drag coefficients compared to data-driven counterparts.
    Abstract In the pursuit of accurate experimental and computational data while minimizing effort, there is a constant need for high-fidelity results. However, achieving such results often requires significant computational resources. To address this challenge, this paper proposes a deep operator learning-based framework that requires a limited high-fidelity dataset for training. We introduce a novel physics-guided, bi-fidelity, Fourier-featured Deep Operator Network (DeepONet) framework that effectively combines low and high-fidelity datasets, leveraging the strengths of each. In our methodology, we began by designing a physics-guided Fourier-featured DeepONet, drawing inspiration from the intrinsic physical behavior of the target solution. Subsequently, we train this network to primarily learn the low-fidelity solution, utilizing an extensive dataset. This process ensures a comprehensive grasp of the foundational solution patterns. Following this foundational learning, the low-fidelity deep operator network's output is enhanced using a physics-guided Fourier-featured residual deep operator network. This network refines the initial low-fidelity output, achieving the high-fidelity solution by employing a small high-fidelity dataset for training. Notably, in our framework, we employ the Fourier feature network as the Trunk network for the DeepONets, given its proficiency in capturing and learning the oscillatory nature of the target solution with high precision. We validate our approach using a well-known 2D benchmark cylinder problem, which aims to predict the time trajectories of lift and drag coefficients. The results highlight that the physics-guided Fourier-featured deep operator network, serving as a foundational building block of our framework, possesses superior predictive capability for the lift and drag coefficients compared to its data-driven counterparts.
    摘要 在寻求精准实验和计算数据的同时减少努力的过程中,需要高精度的结果。然而,获得这些结果通常需要显著的计算资源。为解决这个挑战,这篇文章提出了一个基于深度学习的深度运算网络(DeepONet)框架,只需要一小量高精度数据进行训练。我们提出了一种新的物理导向的、双精度的、傅里埃特征的深度运算网络(DeepONet)框架,可以有效地结合低精度和高精度数据,利用每种数据的优势。在我们的方法中,我们首先设计了物理导向的傅里埃特征的深度运算网络,Drawing inspiration from the intrinsic physical behavior of the target solution。然后,我们将这个网络训练到主要学习低精度解,使用了广泛的数据集。这个过程确保了我们对基础解决方案的全面的掌握。接着,我们使用物理导向的傅里埃特征的深度运算网络来进一步改进低精度深度运算网络的输出,以达到高精度解。在我们的框架中,我们employs the Fourier feature network as the Trunk network for the DeepONets, given its proficiency in capturing and learning the oscillatory nature of the target solution with high precision。我们验证了我们的方法使用了一个知名的2D标准cylinder问题,该问题的目标是预测时间trajectory的升力和阻力坐标。结果表明,物理导向的傅里埃特征的深度运算网络,作为我们框架的基本建筑 block,在预测升力和阻力坐标方面具有更高的预测能力,相比于其数据驱动的对手。

Counterfactual Data Augmentation with Contrastive Learning

  • paper_url: http://arxiv.org/abs/2311.03630
  • repo_url: None
  • paper_authors: Ahmed Aloui, Juncheng Dong, Cat P. Le, Vahid Tarokh
  • for: 这篇论文的目的是解决估计 conditional Average Treatment Effects (CATE) 中的 statistically disparity 问题。
  • methods: 本文引入了一种model-agnostic data augmentation方法,通过对选择的个体进行对比学习,从而实现可靠地插入替代对 grupo的实际结果。
  • results: 理论分析和实验研究表明,本方法可以对 state-of-the-art 模型进行改进,提高其性能和防止过拟合性。
    Abstract Statistical disparity between distinct treatment groups is one of the most significant challenges for estimating Conditional Average Treatment Effects (CATE). To address this, we introduce a model-agnostic data augmentation method that imputes the counterfactual outcomes for a selected subset of individuals. Specifically, we utilize contrastive learning to learn a representation space and a similarity measure such that in the learned representation space close individuals identified by the learned similarity measure have similar potential outcomes. This property ensures reliable imputation of counterfactual outcomes for the individuals with close neighbors from the alternative treatment group. By augmenting the original dataset with these reliable imputations, we can effectively reduce the discrepancy between different treatment groups, while inducing minimal imputation error. The augmented dataset is subsequently employed to train CATE estimation models. Theoretical analysis and experimental studies on synthetic and semi-synthetic benchmarks demonstrate that our method achieves significant improvements in both performance and robustness to overfitting across state-of-the-art models.
    摘要 “统计差异 между不同治疗组是计算条件减少治疗效应(CATE)估计中最大的挑战。为解决这个问题,我们介绍了一种模型无关的数据扩充方法,该方法在选定的一些个体上进行了假设的可能的结果的投入。 Specifically, we utilize contrastive learning to learn a representation space and a similarity measure such that in the learned representation space, close individuals identified by the learned similarity measure have similar potential outcomes. This property ensures reliable imputation of counterfactual outcomes for the individuals with close neighbors from the alternative treatment group. By augmenting the original dataset with these reliable imputations, we can effectively reduce the discrepancy between different treatment groups, while inducing minimal imputation error. The augmented dataset is subsequently employed to train CATE estimation models. 理论分析和实验研究在synthetic和半synthetic benchmark上表明,我们的方法可以在state-of-the-art模型中实现显著提高性和抗过拟合性。”Note: The translation is in Simplified Chinese, which is one of the two standard varieties of Chinese. The other variety is Traditional Chinese.

Are Words Enough? On the semantic conditioning of affective music generation

  • paper_url: http://arxiv.org/abs/2311.03624
  • repo_url: None
  • paper_authors: Jorge Forero, Gilberto Bernardes, Mónica Mendes
  • for: 本文旨在探讨自动生成音乐的可能性,特别是基于情感的音乐生成。
  • methods: 本文回顾了两种主要的自动音乐生成方法:规则化模型和机器学习模型。特别是深度学习架构,它们可以从文本描述生成高质量的音乐。
  • results: 研究表明,使用深度学习和自然语言处理可以提供创作行业powerful工具,用于提示和生成新的音乐作品。
    Abstract Music has been commonly recognized as a means of expressing emotions. In this sense, an intense debate emerges from the need to verbalize musical emotions. This concern seems highly relevant today, considering the exponential growth of natural language processing using deep learning models where it is possible to prompt semantic propositions to generate music automatically. This scoping review aims to analyze and discuss the possibilities of music generation conditioned by emotions. To address this topic, we propose a historical perspective that encompasses the different disciplines and methods contributing to this topic. In detail, we review two main paradigms adopted in automatic music generation: rules-based and machine-learning models. Of note are the deep learning architectures that aim to generate high-fidelity music from textual descriptions. These models raise fundamental questions about the expressivity of music, including whether emotions can be represented with words or expressed through them. We conclude that overcoming the limitation and ambiguity of language to express emotions through music, some of the use of deep learning with natural language has the potential to impact the creative industries by providing powerful tools to prompt and generate new musical works.
    摘要 音乐被广泛认为是表达情感的方式。在这意义上,有一场激烈的辩论是如何用语言表达音乐中的情感。这个问题在今天更加重要,因为深度学习模型的激增使得可以通过提出semantic proposition来自动生成音乐。本篇评论的目的是分析和讨论conditioned by emotions的音乐生成的可能性。为了实现这一目的,我们提出了历史背景,涵盖了不同的学科和方法对这个话题的贡献。在详细的情况下,我们评论了两种主要的自动音乐生成方法:规则based和机器学习模型。特别是深度学习架构,可以生成基于文本描述的高质量音乐。这些模型提出了音乐表达的基本问题,包括情感是否可以通过语言表达,或者通过语言表达出来。我们结论认为,超越语言表达情感的限制和抽象,使用深度学习与自然语言的组合可能会对创作产业产生深见的影响,提供了强大的工具来促进和生成新的音乐作品。

Exploring Latent Spaces of Tonal Music using Variational Autoencoders

  • paper_url: http://arxiv.org/abs/2311.03621
  • repo_url: https://github.com/nadiacarvalho/latent-tonal-music
  • paper_authors: Nádia Carvalho, Gilberto Bernardes
  • for: 这个论文旨在评估varational Autoencoders(VAEs)在生成 cognitive和semantic value的latent representation方面的效果。
  • methods: 这个论文使用了不同的VAE编码方法,包括Piano roll、MIDI、ABC、Tonnetz、DFT of pitch和pitch class distributions,以生成不同的latent space。
  • results: 研究发现,ABC编码perform最好地重construct原始数据,而Pitch DFT编码能够从latent space中提取更多的信息。此外,通过对12个主或副调轨道每个曲目进行对准评估,研究发现Pitch DFT VAE latent space最好地与认知空间相对应,并提供了一个共同频谱空间,在这个空间中,每个关键的组件之间存在某种稳定的顺序关系。
    Abstract Variational Autoencoders (VAEs) have proven to be effective models for producing latent representations of cognitive and semantic value. We assess the degree to which VAEs trained on a prototypical tonal music corpus of 371 Bach's chorales define latent spaces representative of the circle of fifths and the hierarchical relation of each key component pitch as drawn in music cognition. In detail, we compare the latent space of different VAE corpus encodings -- Piano roll, MIDI, ABC, Tonnetz, DFT of pitch, and pitch class distributions -- in providing a pitch space for key relations that align with cognitive distances. We evaluate the model performance of these encodings using objective metrics to capture accuracy, mean square error (MSE), KL-divergence, and computational cost. The ABC encoding performs the best in reconstructing the original data, while the Pitch DFT seems to capture more information from the latent space. Furthermore, an objective evaluation of 12 major or minor transpositions per piece is adopted to quantify the alignment of 1) intra- and inter-segment distances per key and 2) the key distances to cognitive pitch spaces. Our results show that Pitch DFT VAE latent spaces align best with cognitive spaces and provide a common-tone space where overlapping objects within a key are fuzzy clusters, which impose a well-defined order of structural significance or stability -- i.e., a tonal hierarchy. Tonal hierarchies of different keys can be used to measure key distances and the relationships of their in-key components at multiple hierarchies (e.g., notes and chords). The implementation of our VAE and the encodings framework are made available online.
    摘要 variational autoencoders (VAEs) 有效地生成了 cognitive 和 semantic value 的潜在表示。我们评估了 VaEs 在 prototypical 柔软音乐 corpora 上定义的 latent space 是否与 cognitive 音乐理论中的圆形五度和 Hierarchical 关系相匹配。具体来说,我们比较了不同的 VaE corpus 编码 -- Piano roll, MIDI, ABC, Tonnetz, DFT of pitch, 和 pitch class distributions -- 在提供一个 pitch space 以便关键关系的匹配。我们使用对象 metric 评估这些编码的表现,包括准确率、平均方差 error (MSE)、KL- divergence 和计算成本。ABC 编码表现最好地重建原始数据,而 Pitch DFT 似乎捕捉了 latent space 中更多的信息。此外,我们采用对象评估方法,对每个键的 12 大小或小调轨迹进行量化,以衡量这些轨迹与 cognitive pitch space 之间的对应关系。我们的结果表明,Pitch DFT VAE latent space 最好地与 cognitive space 匹配,并提供了一个 common-tone space,其中每个键的 overlap 对象在一个某个键中具有明确的顺序和结构意义 -- i.e., tonality hierarchy。不同键的 tonality hierarchy 可以用来测量键之间的距离和关系,以及它们的内部组件的层次结构。我们的 VAE 和编码框架已经在线上实现。

eess.IV - 2023-11-07

Improved Topological Preservation in 3D Axon Segmentation and Centerline Detection using Geometric Assessment-driven Topological Smoothing (GATS)

  • paper_url: http://arxiv.org/abs/2311.04116
  • repo_url: None
  • paper_authors: Nina I. Shamsi, Alex S. Xu, Lars A. Gjesteby, Laura J. Brattain
    for: 这种研究是为了提高自动轴迹追踪的效率和准确性,特别是在使用自动注释工具时。methods: 该研究使用了完全监督学习,以及特别是针对三维脑影像的分割和中轴检测技术。它们还使用了保持 topology 的方法,以确保分割后的组件保持几何连接性。results: 该研究提出了一种自动 morphological smoothing 技术,并通过使用几何评估来避免过细化。该技术可以提高分割和中轴检测评价指标,并且可以提高 Betti 错误率。它还可以保持脑影像中的几何连接性。
    Abstract Automated axon tracing via fully supervised learning requires large amounts of 3D brain imagery, which is time consuming and laborious to obtain. It also requires expertise. Thus, there is a need for more efficient segmentation and centerline detection techniques to use in conjunction with automated annotation tools. Topology-preserving methods ensure that segmented components maintain geometric connectivity, which is especially meaningful for applications where volumetric data is used, and these methods often make use of morphological thinning algorithms as the thinned outputs can be useful for both segmentation and centerline detection of curvilinear structures. Current morphological thinning approaches used in conjunction with topology-preserving methods are prone to over-thinning and require manual configuration of hyperparameters. We propose an automated approach for morphological smoothing using geometric assessment of the radius of tubular structures in brain microscopy volumes, and apply average pooling to prevent over-thinning. We use this approach to formulate a loss function, which we call Geo-metric Assessment-driven Topological Smoothing loss, or GATS. Our approach increased segmentation and center-line detection evaluation metrics by 2%-5% across multiple datasets, and improved the Betti error rates by 9%. Our ablation study showed that geometric assessment of tubular structures achieved higher segmentation and centerline detection scores, and using average pooling for morphological smoothing in place of thinning algorithms reduced the Betti errors. We observed increased topological preservation during automated annotation of 3D axons volumes from models trained with GATS.
    摘要 自动化轴索追踪通过完全监督学习需要大量的3D脑图像,这需要较长的时间和劳动力来获得,同时还需要专家知识。因此,有更高效的分割和中心线检测技术的需求,以便与自动注释工具结合使用。保持 topology 的方法可以确保分割的组件保持几何连接,特别是在使用 volume 数据时,这些方法通常使用 morphological thinning 算法,因为压缩输出可以用于分割和中心线检测柔软结构。现有的 morphological thinning 方法在与保持 topology 方法结合使用时容易过于细化,需要手动配置超参数。我们提出一种自动化的 morphological smoothing 方法,使用脑微scopy volume 中管道结构的半径的几何评估,并应用平均聚合来避免过度细化。我们称之为 Geo-metric Assessment-driven Topological Smoothing loss,或 GATS。我们的方法在多个数据集上提高了分割和中心线检测评价指标,提高了 Betti 错误率,并且在分割和中心线检测方面实现了更高的分割和中心线检测得分。我们的抽象研究表明,半径的几何评估可以提高分割和中心线检测得分,而使用平均聚合而不是压缩算法可以降低 Betti 错误。我们发现在使用 GATS 进行自动注释的3D轴索Volume中, topological 保持得到了改进。

Toward ground-truth optical coherence tomography via three-dimensional unsupervised deep learning processing and data

  • paper_url: http://arxiv.org/abs/2311.03887
  • repo_url: None
  • paper_authors: Renxiong Wu, Fei Zheng, Meixuan Li, Shaoyan Huang, Xin Ge, Linbo Liu, Yong Liu, Guangming Ni
  • for: 高解像三维成像,增强生物医学应用
  • methods: 无监督3D卷积神经网络处理,利用OCT三维成像特征分离噪声
  • results: 实现高质量噪声自由3D成像,超过先前状态之最
    Abstract Optical coherence tomography (OCT) can perform non-invasive high-resolution three-dimensional (3D) imaging and has been widely used in biomedical fields, while it is inevitably affected by coherence speckle noise which degrades OCT imaging performance and restricts its applications. Here we present a novel speckle-free OCT imaging strategy, named toward-ground-truth OCT (tGT-OCT), that utilizes unsupervised 3D deep-learning processing and leverages OCT 3D imaging features to achieve speckle-free OCT imaging. Specifically, our proposed tGT-OCT utilizes an unsupervised 3D-convolution deep-learning network trained using random 3D volumetric data to distinguish and separate speckle from real structures in 3D imaging volumetric space; moreover, tGT-OCT effectively further reduces speckle noise and reveals structures that would otherwise be obscured by speckle noise while preserving spatial resolution. Results derived from different samples demonstrated the high-quality speckle-free 3D imaging performance of tGT-OCT and its advancement beyond the previous state-of-the-art.
    摘要

A Fast Algorithm for Low Rank + Sparse column-wise Compressive Sensing

  • paper_url: http://arxiv.org/abs/2311.03824
  • repo_url: None
  • paper_authors: Silpa Babu, Namrata Vaswani
  • for: 本文研究了一个低级+稀疏(LR+S)列压缩感知问题,目的是从$m$个独立的线性投影中恢复一个$n\times q$矩阵$\X^* = [\x_1^*, \x_2^*, \cdots, \x_q^*]$。
  • methods: 该问题使用了一种新的快速GD-基于解决方案,叫做AltGDmin-LR+S,它具有内存和通信减少的优点。
  • results: 通过对一系列 simulations 进行详细的数值评估, authors 证明了 AltGDmin-LR+S 的性能。
    Abstract This paper focuses studies the following low rank + sparse (LR+S) column-wise compressive sensing problem. We aim to recover an $n \times q$ matrix, $\X^* =[ \x_1^*, \x_2^*, \cdots , \x_q^*]$ from $m$ independent linear projections of each of its $q$ columns, given by $\y_k :=\A_k\x_k^*$, $k \in [q]$. Here, $\y_k$ is an $m$-length vector with $m < n$. We assume that the matrix $\X^*$ can be decomposed as $\X^*=\L^*+\S^*$, where $\L^*$ is a low rank matrix of rank $r << \min(n,q)$ and $\S^*$ is a sparse matrix. Each column of $\S$ contains $\rho$ non-zero entries. The matrices $\A_k$ are known and mutually independent for different $k$. To address this recovery problem, we propose a novel fast GD-based solution called AltGDmin-LR+S, which is memory and communication efficient. We numerically evaluate its performance by conducting a detailed simulation-based study.
    摘要

Dose-aware Diffusion Model for 3D Ultra Low-dose PET Imaging

  • paper_url: http://arxiv.org/abs/2311.04248
  • repo_url: None
  • paper_authors: Huidong Xie, Weijie Gan, Bo Zhou, Xiongchao Chen, Qiong Liu, Xueqi Guo, Liang Guo, Hongyu An, Ulugbek S. Kamilov, Ge Wang, Chi Liu
    for: 降低PET扫描中的辐射剂量是一个重要的话题,这种方法可以减少辐射风险和辐射暴露。methods: 该研究使用了扩散模型,这种新的生成器模型已经在医疗影像领域中得到了广泛的应用,并且在生成高质量样本方面表现出了强大的潜力。results: 该研究表明,提出了一种名为DDPET的新方法,该方法可以同时降低PET图像的噪声水平和辐射剂量。DDPET方法在395名患者的数据上进行了测试,并与先前的扩散模型和噪声意识的医疗图像降噪方法进行比较。结果显示,DDPET方法在3D低剂量PET扫描中表现出了superior的性能。
    Abstract As PET imaging is accompanied by substantial radiation exposure and cancer risk, reducing radiation dose in PET scans is an important topic. Recently, diffusion models have emerged as the new state-of-the-art generative model to generate high-quality samples and have demonstrated strong potential for various tasks in medical imaging. However, it is difficult to extend diffusion models for 3D image reconstructions due to the memory burden. Directly stacking 2D slices together to create 3D image volumes would results in severe inconsistencies between slices. Previous works tried to either applying a penalty term along the z-axis to remove inconsistencies or reconstructing the 3D image volumes with 2 pre-trained perpendicular 2D diffusion models. Nonetheless, these previous methods failed to produce satisfactory results in challenging cases for PET image denoising. In addition to administered dose, the noise-levels in PET images are affected by several other factors in clinical settings, such as scan time, patient size, and weight, etc. Therefore, a method to simultaneously denoise PET images with different noise-levels is needed. Here, we proposed a dose-aware diffusion model for 3D low-dose PET imaging (DDPET) to address these challenges. The proposed DDPET method was tested on 295 patients from three different medical institutions globally with different low-dose levels. These patient data were acquired on three different commercial PET scanners, including Siemens Vision Quadra, Siemens mCT, and United Imaging Healthcare uExplorere. The proposed method demonstrated superior performance over previously proposed diffusion models for 3D imaging problems as well as models proposed for noise-aware medical image denoising. Code is available at: xxx.
    摘要 As PET imaging is accompanied by substantial radiation exposure and cancer risk, reducing radiation dose in PET scans is an important topic. Recently, diffusion models have emerged as the new state-of-the-art generative model to generate high-quality samples and have demonstrated strong potential for various tasks in medical imaging. However, it is difficult to extend diffusion models for 3D image reconstructions due to the memory burden. Directly stacking 2D slices together to create 3D image volumes would results in severe inconsistencies between slices. Previous works tried to either applying a penalty term along the z-axis to remove inconsistencies or reconstructing the 3D image volumes with 2 pre-trained perpendicular 2D diffusion models. Nonetheless, these previous methods failed to produce satisfactory results in challenging cases for PET image denoising. In addition to administered dose, the noise-levels in PET images are affected by several other factors in clinical settings, such as scan time, patient size, and weight, etc. Therefore, a method to simultaneously denoise PET images with different noise-levels is needed. Here, we proposed a dose-aware diffusion model for 3D low-dose PET imaging (DDPET) to address these challenges. The proposed DDPET method was tested on 295 patients from three different medical institutions globally with different low-dose levels. These patient data were acquired on three different commercial PET scanners, including Siemens Vision Quadra, Siemens mCT, and United Imaging Healthcare uExplorere. The proposed method demonstrated superior performance over previously proposed diffusion models for 3D imaging problems as well as models proposed for noise-aware medical image denoising. Code is available at: xxx.Simplified Chinese:随着PET成像受到了辐射暴露和癌症风险,降低PET扫描中的辐射剂量是一项非常重要的话题。当前,扩散模型在医学成像中表现出了很强的潜力,但是将扩散模型扩展到3D图像重建中具有很大的内存压力。直接将2D slice合并成3D图像卷积会导致 slice之间的差异变得严重。先前的方法包括应用z轴方向的罚因项来消除差异或者使用2个垂直的 pré-trained 2D扩散模型来重建3D图像。然而,这些方法在复杂的PET图像去噪问题上未能实现满意的结果。医学设置中的辐射剂量以外,PET图像的噪声水平还受到了扫描时间、病人体重、体重等多个因素的影响。因此,一种同时去噪PET图像不同噪声水平的方法是需要的。我们提出了一种考虑辐射剂量的扩散模型,以解决这些挑战。我们的DDPET方法在295名患者的数据上进行了测试,这些患者数据来自全球三个不同的医疗机构,并在三个不同的商业PET扫描仪上获得。我们的方法在3D扫描问题上表现出了较好的性能,并且在医学去噪问题上也表现出了优于之前的扩散模型和噪声意识的医学图像去噪方法。代码可以在: xxx 上获取。

eess.SP - 2023-11-07

NEAT-MUSIC: Auto-calibration of DOA Estimation for Terahertz-Band Massive MIMO Systems

  • paper_url: http://arxiv.org/abs/2311.04322
  • repo_url: None
  • paper_authors: Ahmet M. Elbir, Abdulkadir Celik, Ahmed M. Eltawil
  • for: 本文探讨了在teraHertz(THz)频段的 sixth generation无线系统中的方向来自(DOA)估计问题,并研究了两种主要错误源的影响:1) gain-phase mismatches,2) beam-squint。
  • methods: 该文提出了一种名为 NoisE subspAce correcTion technique for MUltiple SIgnal Classification(NEAT-MUSIC)的自动准备方法,该方法基于 Correcting the noise subspace for accurate DOA estimation in the presence of gain-phase mismatches and beam-squint.
  • results: numerical results show that the proposed approach is effective.
    Abstract Terahertz (THz) band is envisioned for the future sixth generation wireless systems thanks to its abundant bandwidth and very narrow beamwidth. These features are one of the key enabling factors for high resolution sensing with milli-degree level direction-of-arrival (DOA) estimation. Therefore, this paper investigates the DOA estimation problem in THz systems in the presence of two major error sources: 1) gain-phase mismatches, which occur due to the deviations in the radio-frequency circuitry; 2) beam-squint, which is caused because of the deviations in the generated beams at different subcarriers due to ultra-wide bandwidth. An auto-calibration approach, namely NoisE subspAce correcTion technique for MUltiple SIgnal Classification (NEAT-MUSIC), is proposed based on the correction of the noise subspace for accurate DOA estimation in the presence of gain-phase mismatches and beam-squint. To gauge the performance of the proposed approach, the Cramer-Rao bounds are also derived. Numerical results show the effectiveness of the proposed approach.
    摘要 oterahertz (THz) 频率带被看作未来第六代无线系统的未来,因为它具有充足的带宽和非常窄的照射角度。这些特点是高分解能感应的关键因素,因此这篇论文研究了THz系统中的方向来源估计问题在获取资料时的两个主要错误源:1)获取阶段的偏差,这些偏差是由无线电路的差异引起的;2)照射角度的偏差,这是由不同子探频率生成的照射干扰所致。这篇论文提出了一个自动协调方法,即NoisE subspAce correcTion technique for MUltiple SIgnal Classification (NEAT-MUSIC),以精确地估计方向来源的DOA。为了评估提案的表现,还 deriv了克拉默-拉欧 bounds。 numerics 结果显示了提案的有效性。

High-performance Power Allocation Strategies for Active IRS-aided Wireless Network

  • paper_url: http://arxiv.org/abs/2311.04032
  • repo_url: None
  • paper_authors: Yifan Zhao, Xuehui Wang, Yan Wang, Xianpeng Wang, Zhilin Chen, Feng Shu, Jiangzhou Wang
  • for: 提高active intelligent reflective surface(IRS)的环境中的通信系统性能
  • methods: 提出了一种基于gradient ascent(GA)的等间距多点初始化GA(ESMPI-GA)方法,以及一种基于Taylor展开(TTE)的低复杂度PA方法
  • results: 对小规模IRS系统进行了 simulations,并证明了ESMPI-GA可以增加约0.5比特的性能提升,而TTE可以与TPA和固定PA策略相比,在低复杂度下达到更高的性能。
    Abstract Due to its intrinsic ability to combat the double fading effect, the active intelligent reflective surface (IRS) becomes popular. The main feature of active IRS must be supplied by power, and the problem of how to allocate the total power between base station (BS) and IRS to fully explore the rate gain achieved by power allocation (PA) to remove the rate gap between existing PA strategies and optimal exhaustive search (ES) arises naturally. First, the signal-to-noise ratio (SNR) expression is derived to be a function of PA factor beta [0, 1]. Then, to improve the rate performance of the conventional gradient ascent (GA), an equal-spacing-multiple-point-initialization GA (ESMPI-GA) method is proposed. Due to its slow linear convergence from iterative GA, the proposed ESMPI-GA is high-complexity. Eventually, to reduce this high complexity, a low-complexity closed-form PA method with third-order Taylor expansion (TTE) centered at point beta0 = 0.5 is proposed. Simulation results show that the proposed ESMPI-GA harvests about 0.5 bit gain over conventional GA and 1.2 and 0.8 bits gain over existing methods like equal PA and Taylor polynomial approximation (TPA) for small-scale IRS, and the proposed TTE performs much better than TPA and fixed PA strategies using an extremely low complexity.
    摘要 由于其内置的对双折射效应的战斗能力,活动智能反射表面(IRS)得到了广泛的应用。主要特点之一是需要电源供应,因此如何将总能量分配给基站(BS)和IRS以实现完全利用rate gain而带来的问题自然而然而来。首先,Signal-to-noise ratio(SNR)的表达式被 derivated为 beta [0, 1] 的函数。然后,为了提高传统的梯度升级(GA)的率性能,一种等间距多点初始化GA(ESMPI-GA)方法被提议。由于其在迭代GA中的慢Linear convergence,提议的ESMPI-GA高度复杂。最终,为了减少这种高复杂度,一种低复杂度的closed-form PA方法with third-order Taylor expansion(TTE) centered at point beta0 = 0.5被提议。 simulation results show that the proposed ESMPI-GA can harvest about 0.5 bit gain over conventional GA and 1.2 and 0.8 bits gain over existing methods like equal PA and Taylor polynomial approximation(TPA)for small-scale IRS,and the proposed TTE performs much better than TPA and fixed PA strategies using an extremely low complexity.

Memory AMP for Generalized MIMO: Coding Principle and Information-Theoretic Optimality

  • paper_url: http://arxiv.org/abs/2311.04012
  • repo_url: None
  • paper_authors: Yufei Chen, Lei Liu, Yuhao Chi, Ying Li, Zhaoyang Zhang
    for: This paper focuses on developing an information-theoretically optimal low-complexity receiver for generalized multiple-input multiple-output (GMIMO) systems.methods: The proposed receiver uses a message passing algorithm called MAMP, which is a low-complexity variant of the optimal OAMP/VAMP receiver. The paper also develops a simplified single-input single-output variational state evolution (VSE) to analyze the achievable rate of MAMP.results: The paper shows that the proposed MAMP receiver achieves the same performances as the optimal OAMP/VAMP receiver with 0.4% of the time consumption for large-scale systems. Additionally, the paper proves the information-theoretic optimality of MAMP and establishes an optimal coding principle to maximize the achievable rate. Numerical results show that the finite-length performances of MAMP with practical optimized LDPC codes are 0.5-2.7 dB away from the associated constrained capacities.
    Abstract To support complex communication scenarios in next-generation wireless communications, this paper focuses on a generalized MIMO (GMIMO) with practical assumptions, such as massive antennas, practical channel coding, arbitrary input distributions, and general right-unitarily-invariant channel matrices (covering Rayleigh fading, certain ill-conditioned and correlated channel matrices). The orthogonal/vector approximate message passing (OAMP/VAMP) receiver has been proved to be information-theoretically optimal in GMIMO, but it is limited to high-complexity LMMSE. To solve this problem, a low-complexity memory approximate message passing (MAMP) receiver has recently been shown to be Bayes optimal but limited to uncoded systems. Therefore, how to design a low-complexity and information-theoretically optimal receiver for GMIMO is still an open issue. To address this issue, this paper proposes an information-theoretically optimal MAMP receiver and investigates its achievable rate analysis and optimal coding principle. Specifically, due to the long-memory linear detection, state evolution (SE) for MAMP is intricately multidimensional and cannot be used directly to analyze its achievable rate. To avoid this difficulty, a simplified single-input single-output variational SE (VSE) for MAMP is developed by leveraging the SE fixed-point consistent property of MAMP and OAMP/VAMP. The achievable rate of MAMP is calculated using the VSE, and the optimal coding principle is established to maximize the achievable rate. On this basis, the information-theoretic optimality of MAMP is proved rigorously. Numerical results show that the finite-length performances of MAMP with practical optimized LDPC codes are 0.5-2.7 dB away from the associated constrained capacities. It is worth noting that MAMP can achieve the same performances as OAMP/VAMP with 0.4% of the time consumption for large-scale systems.
    摘要 本文研究了下一代无线通信中复杂通信场景下的通信理论最佳化。文章关注实际假设,如大量天线、实用 canal coding、任意输入分布和通用右特征不变Channel矩阵(包括劳仑辐射、一些不良conditioned和相关的 Channel矩阵)。文章证明了OAMP/VAMP接收器是GMIMO中信息理论最佳的,但它具有高复杂度LMMSE的限制。为解决这个问题,文章提出了一种低复杂度内存approximate message passing(MAMP)接收器,并证明其是信息理论最佳的。文章还研究了MAMP接收器的可达率分析和最佳编码原则。由于MAMP接收器的长期线性探测,其SE的维度是复杂的多dimensional,直接使用SE无法分析其可达率。为了避免这种困难,文章提出了基于MAMP和OAMP/VAMP的单输入单出力variational SE(VSE),并证明了VSE的可达率分析。文章还使用VSE计算了MAMP接收器的可达率,并确定了最佳编码原则以优化可达率。在这基础上,文章证明了MAMP接收器的信息理论最佳性。numerical results表明,使用实用LDBCodes的MAMP接收器在finite length系统中的性能与相关的受限容量相差0.5-2.7dB。另外,MAMP接收器可以在大规模系统中实现与OAMP/VAMP相同的性能,但它的计算时间只需0.4%。

Coverage Hole Elimination System in Industrial Environment

  • paper_url: http://arxiv.org/abs/2311.04011
  • repo_url: None
  • paper_authors: Mervat Zarour, Shreya Tayade, Sergiy Melnyk, Hans D. Schotten
  • for: 这 paper 旨在避免indoor环境中的覆盖洞,以确保自动导向车 (AGV) 的稳定连接。
  • methods: 该 paper 提出了一个框架,该框架使用支持向量机 (SVM) 分类模型确定覆盖洞的位置,并构建了一个二进制覆盖洞地图,以避免覆盖洞。AGV 的再规划路径由最短覆盖洞free路径优化。
  • results: 研究发现,如果在 ahead of time 知道覆盖洞的位置,AGV 的再规划路径可以更短并且更优化。
    Abstract The paper proposes a framework to identify and avoid the coverage hole in an indoor industry environment. We assume an edge cloud co-located controller that followers the Automated Guided Vehicle (AGV) movement on a factory floor over a wireless channel. The coverage holes are caused due to blockage, path-loss, and fading effects. An AGV in the coverage hole may lose connectivity to the edge-cloud and become unstable. To avoid connectivity loss, we proposed a framework that identifies the position of coverage hole using a Support- Vector Machine (SVM) classifier model and constructs a binary coverage hole map incorporating the AGV trajectory re-planning to avoid the identified coverage hole. The AGV's re-planned trajectory is optimized and selected to avoid coverage hole the shortest coverage-hole-free trajectory. We further investigated the look-ahead time's impact on the AGV's re-planned trajectory performance. The results reveal that an AGV's re-planned trajectory can be shorter and further optimized if the coverage hole position is known ahead of time
    摘要 文章提出了一种框架,用于 indentify和避免工厂内部环境中的覆盖洞。我们假设了一个位于工厂地面的边缘云控制器,跟随自动导向车(AGV)的运动通过无线频道进行通信。覆盖洞是由堵塞、路径损失和折射效应引起的,AGV在覆盖洞中可能会失去与边缘云的连接,导致不稳定。为了避免连接损失,我们提出了一种框架,使用支持向量机(SVM)分类模型确定覆盖洞的位置,并构建一个二进制覆盖洞地图,包括AGV的轨迹重新规划以避免所确定的覆盖洞。AGV的重新规划轨迹是通过优化和选择最短覆盖洞自由轨迹来实现。我们进一步调查了AGV的重新规划轨迹表现的影响因素,结果显示,如果在先知道覆盖洞位置,AGV的重新规划轨迹可以更短并更优化。

Federated Learning via Active RIS Assisted Over-the-Air Computation

  • paper_url: http://arxiv.org/abs/2311.03982
  • repo_url: None
  • paper_authors: Deyou Zhang, Ming Xiao, Mikael Skoglund, H. Vincent Poor
  • for: 这个论文目的是提出一种基于活动可重配置智能表面(RIS)的可靠梯度聚合支持的飞行计算(AirComp)启用的联合学习(FL)系统。
  • methods: 论文使用了一种新的优化问题,即减少每轮训练中梯度聚合错误的最优化问题,以优化传输器设计和RIS配置。
  • results: simulations 表明,使用活动RIS可以比pasive RIS更有效地减少梯度聚合错误,提高联合学习系统的稳定性和性能。
    Abstract In this paper, we propose leveraging the active reconfigurable intelligence surface (RIS) to support reliable gradient aggregation for over-the-air computation (AirComp) enabled federated learning (FL) systems. An analysis of the FL convergence property reveals that minimizing gradient aggregation errors in each training round is crucial for narrowing the convergence gap. As such, we formulate an optimization problem, aiming to minimize these errors by jointly optimizing the transceiver design and RIS configuration. To handle the formulated highly non-convex problem, we devise a two-layer alternative optimization framework to decompose it into several convex subproblems, each solvable optimally. Simulation results demonstrate the superiority of the active RIS in reducing gradient aggregation errors compared to its passive counterpart.
    摘要 在这篇论文中,我们提出使用可活动重配置智能表面(RIS)来支持无线计算(AirComp)启发的联合学习(FL)系统中可靠的梯度聚合。一种分析联合学习的收敛性质表明,在每轮训练中最小化梯度聚合错误是关键的,以减少收敛差距。因此,我们设计了一个优化问题,目标是最小化这些错误,通过同时优化无线传输设计和RIS配置。为处理这个非常不对称的问题,我们开发了一个两层代替优化框架,将其分解成多个可优化的几何问题,每个问题都可以优化到最优解。实验结果表明,活动RIS可以比其适应性较强的pasive counterpart更好地减少梯度聚合错误。

NOMA Enabled Multi-Access Edge Computing: A Joint MU-MIMO Precoding and Computation Offloading Design

  • paper_url: http://arxiv.org/abs/2311.03974
  • repo_url: None
  • paper_authors: Deyou Zhang, Meng Wang, Shuo Shi, Ming Xiao
  • for: 这个研究targets computation offloading and transmit precoding co-design for multi-access edge computing (MEC), where multiple MEC users (MUs) equipped with multiple antennas access the MEC server in a non-orthogonal multiple access manner.
  • methods: 以jointly optimizing computational frequency, offloading ratio, and precoding matrix的方式, minimize the total energy consumption of all MUs while satisfying latency constraints.
  • results: simulation results validate the convergence of the proposed method and demonstrate its superiority over baseline algorithms.
    Abstract This letter investigates computation offloading and transmit precoding co-design for multi-access edge computing (MEC), where multiple MEC users (MUs) equipped with multiple antennas access the MEC server in a non-orthogonal multiple access manner. We aim to minimize the total energy consumption of all MUs while satisfying the latency constraints by jointly optimizing the computational frequency, offloading ratio, and precoding matrix of each MU. For tractability, we first decompose the original problem into three subproblems and then solve these subproblems iteratively until convergence. Simulation results validate the convergence of the proposed method and demonstrate its superiority over baseline algorithms.
    摘要 Here's the text in Simplified Chinese:这封信函数探索了多接点边缘计算(MEC)中多个MEC用户(MU)通过多antenna访问MEC服务器的非对称多接点访问方式。我们目标是使所有MU的总能耗降低到最低,同时满足响应时间约束,通过对每个MU的计算频率、下载比率和 precoding 矩阵进行共同优化。为了使问题更加可解,我们首先将问题分解成三个互相独立的习题,然后在迭代 até convergence 的过程中解决这些习题。实验结果证明了我们提出的方法的可行性和基eline算法相比的优越性。

Distributed Parameter Estimation with Gaussian Observation Noises in Time-varying Digraphs

  • paper_url: http://arxiv.org/abs/2311.03911
  • repo_url: None
  • paper_authors: Jiaqi Yan, Hideaki Ishii
  • for: 这个论文研究了分布式参数估计问题在感知网络中。每个感知器在成功观测一个未知的 $d$ 维参数后,它们协作来估计真实的参数值。
  • methods: 这个论文首先将动态扩展混合(DREM)算法扩展到随机系统上,将参数估计问题转化为 $d$ 个整数估计问题。每个整数估计问题使用 combine-then-adapt(CTA)和 adapt-then-combine(ATC)扩散式估计算算法,每个感知器在其内部邻域中进行了组合步骤,并在流动观测中进行了适应步骤。
  • results: 该论文表明,提出的估计器可以 garantía 每个感知器都能够估计真实的参数值,即使任何一个感知器不能够估计。这需要网络拓扑的union集在固定长度的间隔上具有强连接性,并且感知器们必须共同满足一个协作 persistentexcitation(PE)条件,这条件relax traditional PE condition。数字Examplefinally提供了证明结果。
    Abstract In this paper, we consider the problem of distributed parameter estimation in sensor networks. Each sensor makes successive observations of an unknown $d$-dimensional parameter, which might be subject to Gaussian random noises. The sensors aim to infer the true value of the unknown parameter by cooperating with each other. To this end, we first generalize the so-called dynamic regressor extension and mixing (DREM) algorithm to stochastic systems, with which the problem of estimating a $d$-dimensional vector parameter is transformed to that of $d$ scalar ones: one for each of the unknown parameters. For each of the scalar problem, both combine-then-adapt (CTA) and adapt-then-combine (ATC) diffusion-based estimation algorithms are given, where each sensor performs a combination step to fuse the local estimates in its in-neighborhood, alongside an adaptation step to process its streaming observations. Under weak conditions on network topology and excitation of regressors, we show that the proposed estimators guarantee that each sensor infers the true parameter, even if any individual of them cannot by itself. Specifically, it is required that the union of topologies over an interval with fixed length is strongly connected. Moreover, the sensors must collectively satisfy a cooperative persistent excitation (PE) condition, which relaxes the traditional PE condition. Numerical examples are finally provided to illustrate the established results.
    摘要 在这篇论文中,我们考虑了分布式参数估计问题在感知网络中。每个感知器在不知道的 $d$-维参数上进行连续观测,这个参数可能受到 Gaussian 随机噪声的影响。感知器们想要推断真实的参数值,通过合作来实现这一目标。为此,我们首先将动态扩展和混合(DREM)算法推广到随机系统中,将 $d$ 维参数估计问题转化为 $d$ 个整数估计问题。每个整数问题中,我们提供了 combinaison-then-adapt(CTA)和 adapt-then-combination(ATC)扩散基于估计算法,其中每个感知器在其内部邻域中进行结合步骤,并进行适应步骤来处理流动观测数据。在网络结构和刺激因子的弱条件下,我们证明了我们提出的估计器可以使每个感知器推断出真实的参数值,即使任何一个感知器不能做出准确的估计。特别是,需要网络上的union topologys over an interval with fixed length是强连接的。此外,感知器们需要共同满足一个合作 persistentexcitation(PE)条件,这些条件相对于传统PE条件更加宽松。最后,我们提供了数字例子,以 Illustrate 我们的结论。

Integrated Sensing, Communication, and Computing for Cost-effective Multimodal Federated Perception

  • paper_url: http://arxiv.org/abs/2311.03815
  • repo_url: None
  • paper_authors: Ning Chen, Zhipeng Cheng, Xuwei Fan, Bangzhen Huang, Yifeng Zhao, Lianfen Huang, Xiaojiang Du, Mohsen Guizani
  • for: 这篇论文主要关注的是如何合理协调多个频率域资源的调度,以提高多模态联合感知(MFP)网络的学习性能和可扩展性。
  • methods: 本论文提出了一种基于服务方式的资源管理策略,通过约束逻辑和奖励机制来协调多个频率域资源的调度,以提高学习性能和减少资源成本。
  • results: 实验结果表明,提出的资源调度策略能够有效地提高MFP网络的学习性能和稳定性,同时减少资源成本。
    Abstract Federated learning (FL) is a classic paradigm of 6G edge intelligence (EI), which alleviates privacy leaks and high communication pressure caused by traditional centralized data processing in the artificial intelligence of things (AIoT). The implementation of multimodal federated perception (MFP) services involves three sub-processes, including sensing-based multimodal data generation, communication-based model transmission, and computing-based model training, ultimately relying on available underlying multi-domain physical resources such as time, frequency, and computing power. How to reasonably coordinate the multi-domain resources scheduling among sensing, communication, and computing, therefore, is crucial to the MFP networks. To address the above issues, this paper investigates service-oriented resource management with integrated sensing, communication, and computing (ISCC). With the incentive mechanism of the MFP service market, the resources management problem is redefined as a social welfare maximization problem, where the idea of "expanding resources" and "reducing costs" is used to improve learning performance gain and reduce resource costs. Experimental results demonstrate the effectiveness and robustness of the proposed resource scheduling mechanisms.
    摘要 sixth-generation (6G) 边缘智能 (EI) 中的 federated learning (FL) 是一种经典的 Paradigm, which alleviates privacy leaks and high communication pressure caused by traditional centralized data processing in the artificial intelligence of things (AIoT). The implementation of multimodal federated perception (MFP) services involves three sub-processes, including sensing-based multimodal data generation, communication-based model transmission, and computing-based model training, ultimately relying on available underlying multi-domain physical resources such as time, frequency, and computing power. How to reasonably coordinate the multi-domain resources scheduling among sensing, communication, and computing, therefore, is crucial to the MFP networks. To address the above issues, this paper investigates service-oriented resource management with integrated sensing, communication, and computing (ISCC). With the incentive mechanism of the MFP service market, the resources management problem is redefined as a social welfare maximization problem, where the idea of "expanding resources" and "reducing costs" is used to improve learning performance gain and reduce resource costs. Experimental results demonstrate the effectiveness and robustness of the proposed resource scheduling mechanisms.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China. If you prefer Traditional Chinese, I can provide that version as well.

Textile-based conformable and breathable ultrasound imaging probe

  • paper_url: http://arxiv.org/abs/2311.03787
  • repo_url: None
  • paper_authors: Takumi Noda, Seiichi Takamatsu, Michitaka Yamamoto, Naoto Tomita, Toshihiro Itoh, Takashi Azuma, Ichiro Sakuma, Naoki Tomii
  • for: 这个研究旨在开发一种可以在皮肤表面进行Internal tissues的日常监测,使用可以承受和呼吸的ultrasound(US)侦错探针。
  • methods: 研究人员使用了纺织品作为探针基材,并在这些纺织品上形成了铜电极through electroless plating。在电极部分,空气孔被填充了铜,以提高US波的通过率。在非电极部分,保留了空气孔,以确保高的呼吸性。
  • results: 实验结果显示,fabricated的纺织品基材探针具有低的flexural rigidity($0.066 \times 10^{-4} N \cdot m^2/m)和高的空气通过率($11.7 cm^3 / cm^2 \cdot s)。在人类 neck 的实验中,探针能够监测通过皮肤的血管径和 internal jugular vein 的 diameter 变化,这些变化可能会预悟 health issues 如arteriosclerosis 和 dehydration。
    Abstract Daily monitoring of internal tissues with conformable and breathable ultrasound (US) imaging probes is promising for early detection of diseases. In recent years, textile substrates are widely used for wearable devices since they satisfy both conformability and breathability. However, it is not currently possible to use textile substrates for US probes due to the reflection or attenuation of US waves at the air gaps in the textiles. In this paper, we fabricated a conformable and breathable US imaging probe by sandwiching the US elements between two woven polyester textiles on which copper electrodes were formed through electroless plating. The air gaps between the fibers at the electrode parts were filled with copper, allowing for high penetration of US waves. On the other hand, the non-electrode parts retain air gaps, leading to high breathability. The fabricated textile-based probe showed low flexural rigidity ($0.066 \times 10^{-4} N \cdot m^2/m$) and high air permeability ($11.7 cm^3 / cm^2 \cdot s$). Human neck imaging demonstrated the ability of the probe to monitor the pulsation of the common carotid artery and change in the internal jugular vein diameter, which lead to the early detection of health issues such as arteriosclerosis and dehydration.
    摘要 每天监测内部组织的可适应和呼吸性超声成像探针(US)是找到疾病的早期发现的有望的方法。在最近的几年中,纺织品substrate被广泛使用于可穿戴设备,因为它们满足了可适应和呼吸性。然而,目前并不可以使用纺织品substrate来制作US探针,因为在纺织品中的空气孔会反射或吸收US波。在本文中,我们制作了可适应和呼吸性的US成像探针,通过将US元素置于两块织物上,其中一块是织物,另一块是印制在织物上的杯状铜电极。空气孔在电极部分被填充铜,使US波能够深入传播。同时,非电极部分保留了空气孔,使探针具有高呼吸性。制作的纺织品基于探针表示了低的柔性($0.066 \times 10^{-4} N \cdot m^2/m$)和高的空气通透性($11.7 cm^3 / cm^2 \cdot s$)。人 neck imaging表明该探针能够监测通过血管的脉动和内 Jugular vein的 Diameter 的变化,从而早期发现健康问题,如arteriosclerosis和肥厚。

Multi-Beam Forming with Movable-Antenna Array

  • paper_url: http://arxiv.org/abs/2311.03775
  • repo_url: None
  • paper_authors: Wenyan Ma, Lipeng Zhu, Rui Zhang
  • for: 提高多束发射的能量密度和抑制干扰的性能
  • methods: 利用antenna位置优化和天线WeightVector优化来jointly最大化多束发射的最小干扰能量和最大化多束发射的最小干扰能量
  • results: 比traditional FPA数组和其他参考方案更高的多束发射能量和干扰抑制性能
    Abstract Conventional multi-beam forming with fixed-position antenna (FPA) arrays needs to trade-off between maximizing the beamforming gain over desired directions and minimizing the interference power over undesired directions. In this letter, we study the enhanced multi-beam forming with a linear movable-antenna (MA) array by exploiting the new degrees of freedom (DoFs) via antennas' position optimization. Specifically, we jointly optimize the antenna position vector (APV) and antenna weight vector (AWV) to maximize the minimum beamforming gain over multiple desired directions, subject to a given constraint on the maximum interference power over undesired directions. We propose an efficient alternating optimization algorithm to find a suboptimal solution by iteratively optimizing one of the APV and AWV with the other being fixed. Numerical results show that the proposed multi-beam forming design with MA arrays can significantly outperform that with the traditional FPA arrays and other benchmark schemes in terms of both beamforming gain and interference suppression.
    摘要 Here's the Simplified Chinese translation:传统的多束发射器(FPA)阵列需要让束发射器的位置取值进行让束,以最大化 DESIRED 方向的束发射器 gain,同时最小化不 DESIRED 方向的干扰电磁辐射强度。在这封信中,我们研究了使用可变位置的天线阵列(MA)来提高多束发射器的性能。我们jointly优化天线位置 вектор(APV)和天线质量 вектор(AWV),以最大化多个 DESIRED 方向的束发射器 gain,同时保证不 DESIRED 方向的干扰电磁辐射强度不超过一定的限制。我们提出了一种高效的交叉优化算法,可以在一个轮循的方式中,iteratively 优化一个 APV 和 AWV,而另一个保持不变。numerical results show that the proposed multi-beam forming design with MA arrays can significantly outperform that with traditional FPA arrays and other benchmark schemes in terms of both beamforming gain and interference suppression.

Classification of Various Types of Damages in Honeycomb Composite Sandwich Structures using Guided Wave Structural Health Monitoring

  • paper_url: http://arxiv.org/abs/2311.03765
  • repo_url: https://github.com/shrutisawant099/damage-classification-using-feature-engineering
  • paper_authors: Shruti Sawant, Jeslin Thalapil, Siddharth Tallur, Sauvik Banerjee, Amit Sethi
  • for: 本研究旨在为HCSS(封装叠层结构)的损害分类提供方法,以便进行维护和维修。
  • methods: 本研究使用了精心的特征工程和机器学习来分类不同类型的损害,包括核心压缩(CC)、高密度核心(HDC)、丢失涂敷剂(LFA)和聚四氟乙烯释放膜(TRF)。
  • results: 研究发现,其中两种损害类型对GW信号具有相似的影响。研究提取和评估了多种特征,包括时域和频域的特征,以及基线和无基线的特征。使用了施加了过滤的Pearson相关系数来消除冗余特征。最终,使用最佳的特征集,使用随机森林分类器在封装信号上获得了高精度。在不同的损害大小下,使用 Parametric 研究所获得的模拟数据进行评估,精度为77.89%。 interpretability 研究表明,基于基线信号的特征更有效。
    Abstract Classification of damages in honeycomb composite sandwich structure (HCSS) is important to decide remedial actions. However, previous studies have only detected damages using deviations of monitoring signal from healthy (baseline) using a guided wave (GW) based structural health monitoring system. Classification between various types of damages has not been reported for challenging cases. We show that using careful feature engineering and machine learning it is possible to classify between various types of damages such as core crush (CC), high density core (HDC), lost film adhesive (LFA) and teflon release film (TRF). We believe that we are the first to report numerical models for four types of damages in HCSS, which is followed up with experimental validation. We found that two out of four damages affect the GW signal in a particularly similar manner. We extracted and evaluated multiple features from time as well as frequency domains, and also experimented with features relative to as baseline as well as those that were baseline-free. Using Pearson's correlation coefficient based filtering, redundant features were eliminated. Finally, using an optimal feature set determined using feature elimination, high accuracy was achieved with a random forest classifier on held-out signals. For evaluating performance of the proposed method for different damage sizes, we used simulated data obtained from extensive parametric studies and got an accuracy of 77.89%. Interpretability studies to determine importance of various features showed that features computed using the baseline signal prove more effective as compared to baseline-free features.
    摘要 <>TRANSLATE_TEXT Classification of damages in honeycomb composite sandwich structures (HCSS) is important to determine remedial actions. However, previous studies have only detected damages using deviations of monitoring signals from healthy (baseline) using a guided wave (GW) based structural health monitoring system. Classification between various types of damages has not been reported for challenging cases. We show that using careful feature engineering and machine learning, it is possible to classify between various types of damages such as core crush (CC), high density core (HDC), lost film adhesive (LFA), and teflon release film (TRF). We believe that we are the first to report numerical models for four types of damages in HCSS, which is followed up with experimental validation. We found that two out of four damages affect the GW signal in a similar manner. We extracted and evaluated multiple features from time and frequency domains, and also experimented with features relative to the baseline as well as those that were baseline-free. Using Pearson's correlation coefficient based filtering, redundant features were eliminated. Finally, using an optimal feature set determined using feature elimination, high accuracy was achieved with a random forest classifier on held-out signals. For evaluating the performance of the proposed method for different damage sizes, we used simulated data obtained from extensive parametric studies and got an accuracy of 77.89%. Interpretability studies to determine the importance of various features showed that features computed using the baseline signal prove more effective compared to baseline-free features.TRANSLATE_TEXT

Beyond Traditional Beamforming: Singular Vector Projection Techniques for MU-MIMO Interference Management

  • paper_url: http://arxiv.org/abs/2311.03741
  • repo_url: None
  • paper_authors: Md Saheed Ullah, Rafid Umayer Murshed, Md. Forkan Uddin
  • for: 这篇论文旨在提出低复杂性扬声报文算法,以减少多用户多输入多输出(MU-MIMO)系统中的交互用户干扰和提高频率效率(SE)。
  • methods: 论文首先提出了一个单位 vectors 搜索(SVBS)算法,其中所有单位 вектор 都会被评估,以选择最有效的扬声报文方案。然后,论文提出了一个数学证明,证明了MU-MIMO扬声报文系统中的交互用户干扰可以从多个互相 projet 的单位 вектор 中fficiently计算。
  • results: numerical results 表明,SVBS算法比现有的算法更有优势,IOSVB算法可以在保持高效率的情况下提供near-identical SE,而DR-IOSVB算法则可以将性能和计算复杂度平衡。这项研究奠定了MU-MIMO无线通信系统中高性能且低复杂性扬声报文的新 benchamrk。
    Abstract This paper introduces low-complexity beamforming algorithms for multi-user multiple-input multiple-output (MU-MIMO) systems to minimize inter-user interference and enhance spectral efficiency (SE). A Singular-Vector Beamspace Search (SVBS) algorithm is initially presented, wherein all the singular vectors are assessed to determine the most effective beamforming scheme. We then establish a mathematical proof demonstrating that the total inter-user interference of a MU-MIMO beamforming system can be efficiently calculated from the mutual projections of orthonormal singular vectors. Capitalizing on this, we present an Interference-optimized Singular Vector Beamforming (IOSVB) algorithm for optimal singular vector selection. For further reducing the computational burden, we propose a Dimensionality-reduced IOSVB (DR-IOSVB) algorithm by integrating the principal component analysis (PCA). The numerical results demonstrate the superiority of the SVBS algorithm over the existing algorithms, with the IOSVB offering near-identical SE and the DR-IOSVB balancing the performance and computational efficiency. This work establishes a new benchmark for high-performance and low-complexity beamforming in MU-MIMO wireless communication systems.
    摘要 To further reduce computational complexity, an Interference-optimized Singular Vector Beamforming (IOSVB) algorithm is proposed for optimal singular vector selection. Additionally, a Dimensionality-reduced IOSVB (DR-IOSVB) algorithm is proposed by integrating principal component analysis (PCA) to reduce the dimensionality of the problem.Numerical results show that the SVBS algorithm outperforms existing algorithms, while the IOSVB and DR-IOSVB algorithms offer near-identical SE with reduced computational complexity. This work establishes a new benchmark for high-performance and low-complexity beamforming in MU-MIMO wireless communication systems.

Recursive Filters as Linear Time-Invariant Systems

  • paper_url: http://arxiv.org/abs/2311.03676
  • repo_url: None
  • paper_authors: Jonathan H. Manton
  • for: 这篇论文是为了解释如何将回卷滤波器视为线性时不变系统(LTI),以便使用 Fourier 分析。
  • methods: 这篇论文使用了 z-transform 来处理不初始化的回卷滤波器,并解释了它们的区域收敛性和为什么它们不能找到无限多个解。
  • results: 这篇论文的结果表明,可以将回卷滤波器视为 LTI 系统,并且可以使用 Fourier 分析来分析它们的性质。
    Abstract Recursive filters are treated as linear time-invariant (LTI) systems but they are not: uninitialised, they have an infinite number of outputs for any given input, while if initialised, they are not time-invariant. This short tutorial article explains how and why they can be treated as LTI systems, thereby allowing tools such as Fourier analysis to be applied. It also explains the origin of the z-transform, why the region of convergence is important, and why the z-transform fails to find an infinite number of solutions.
    摘要 Linear time-invariant (LTI) 系统是可以使用 Fourier 分析的,但是 Recursive 筛选器并不是 LTI 系统:它们没有初始化,输出为输入的数量是无限的,如果初始化,则不是时间不变的。这篇简短的教程文章解释了如何将 Recursive 筛选器视为 LTI 系统,以便使用 Fourier 分析;它还解释了 z-transform 的起源、区域对称性的重要性,以及 z-transform 为什么无法找到无限多个解。

On the Performance of LoRa Empowered Communication for Wireless Body Area Networks

  • paper_url: http://arxiv.org/abs/2311.03653
  • repo_url: None
  • paper_authors: Minling Zhang, Guofa Cai, Zhiping Xu, Jiguang He, Markku Juntti
  • for: 这个论文主要研究了基于LoRa的无线体Area网络(WBAN)中的物理层(PHY)和 médium访问控制层(MAC)的性能,特别是在Rayleigh-lognormal折射渠道下。
  • methods: 该论文使用了closed-form approximate bit error probability(BEP)表达式来描述LoRa系统的性能。
  • results: 研究结果表明,随着SF的增加和干扰的减少,可以有效地 mitigate shadowing效应。此外,对LoRa基于WBAN的MAC协议进行了比较分析,包括纯ALOHAC、频道分配ALOHAC和载波检测多路访问协议。研究发现,equal-interval-based和equal-area-based的 schemes可以帮助选择合适的SF。
    Abstract To remotely monitor the physiological status of the human body, long range (LoRa) communication has been considered as an eminently suitable candidate for wireless body area networks (WBANs). Typically, a Rayleigh-lognormal fading channel is encountered by the LoRa links of the WBAN. In this context, we characterize the performance of the LoRa system in WBAN scenarios with an emphasis on the physical (PHY) layer and medium access control (MAC) layer in the face of Rayleigh-lognormal fading channels and the same spreading factor interference. Specifically, closed-form approximate bit error probability (BEP) expressions are derived for the LoRa system. The results show that increasing the SF and reducing the interference efficiently mitigate the shadowing effects. Moreover, in the quest for the most suitable MAC protocol for LoRa based WBANs, three MAC protocols are critically appraised, namely the pure ALOHA, slotted ALOHA, and carrier-sense multiple access. The coverage probability, energy efficiency, throughput, and system delay of the three MAC protocols are analyzed in Rayleigh-lognormal fading channel. Furthermore, the performance of the equal-interval-based and equal-area-based schemes is analyzed to guide the choice of the SF. Our simulation results confirm the accuracy of the mathematical analysis and provide some useful insights for the future design of LoRa based WBANs.
    摘要 为了远程监测人体生理状态,长距离(LoRa)通信被视为适用于无线身体区域网络(WBAN)的非常佳的候选者。通常,LoRa链接会遇到RAYLEIGH-LOGNORMAL折射抑制通道。在这种情况下,我们研究了LoRa系统在WBAN场景中的性能,特别是物理层(PHY)层和媒体访问控制层(MAC)层,面临RAYLEIGH-LOGNORMAL折射通道和同步扩展因子干扰。我们 deriv了LoRa系统的闭形精度 approximate bit error probability(BEP)表达式。结果表明,逐渐提高SF和降低干扰可以有效地 Mitigate the shadowing effects。此外,为了选择LoRa基于WBAN中最佳的MAC协议,我们critically appraised three MAC协议,namely pure ALOHA, slotted ALOHA, and carrier-sense multiple access。我们分析了这三种MAC协议在RAYLEIGH-LOGNORMAL折射通道中的覆盖率、能效率、吞吐量和系统延迟。此外,我们还分析了平衡区 schemes的性能,以便选择SF。我们的 simulate results confirm the accuracy of the mathematical analysis and provide some useful insights for the future design of LoRa based WBANs.