eess.SP - 2023-11-04

On Learning the Distribution of a Random Spatial Field in a Location-Unaware Mobile Sensing Setup

  • paper_url: http://arxiv.org/abs/2311.02464
  • repo_url: None
  • paper_authors: Meera Pai
  • for: 本研究的目的是学习一个固定一 dimensional 路径上的空间时间场的统计分布,在absence of location information。
  • methods: 本研究使用了移动感知设备采集空间时间场的样本,并提出了一些简单的假设来学习场的统计分布。
  • results: 研究表明,可以使用移动感知设备采集的样本来学习空间时间场的统计分布,并且提供了一系列的分析和实验结果来支持这一结论。
    Abstract In applications like environment monitoring and pollution control, physical quantities are modeled by spatio-temporal fields. It is of interest to learn the statistical distribution of such fields as a function of space, time or both. In this work, our aim is to learn the statistical distribution of a spatio-temporal field along a fixed one dimensional path, as a function of spatial location, in the absence of location information. Spatial field analysis, commonly done using static sensor networks is a well studied problem in literature. Recently, due to flexibility in setting the spatial sampling density and low hardware cost, owing to larger spatial coverage, mobile sensors are used for this purpose. The main challenge in using mobile sensors is their location uncertainty. Obtaining location information of samples requires additional hardware and cost. So, we consider the case when the spatio-temporal field along the fixed length path is sampled using a simple mobile sensing device that records field values while traversing the path without any location information. We ask whether it is possible to learn the statistical distribution of the field, as a function of spatial location, using samples from the location-unaware mobile sensor under some simple assumptions on the field. We answer this question in affirmative and provide a series of analytical and experimental results to support our claim.
    摘要 在环境监测和污染控制应用中,物理量是通过空间-时间场的模拟来表示。我们的目标是在固定一个一维路径上学习这个场的统计分布,以空间位置为变量。在文献中广泛研究的空间场分析中,通常使用静止感知网络进行检测。然而,由于可以自由设置空间抽样密度以及低硬件成本,由于更大的空间覆盖率,移动感知设备在最近几年中变得越来越受欢迎。然而,移动感知设备的位置不确定性成为主要挑战。为了获取样本的位置信息,需要额外的硬件和成本。因此,我们考虑了在固定一个一维路径上,使用简单的移动感知设备记录场值,而不提供位置信息的情况下,是否可以学习场的统计分布,以空间位置为变量。我们的答案是可以,并且提供了一系列的分析和实验结果来支持我们的说法。

Utilizing Imperfect Resolution of Near-Field Beamforming: A Hybrid-NOMA Perspective

  • paper_url: http://arxiv.org/abs/2311.02451
  • repo_url: None
  • paper_authors: Zhiguo Ding, H. Vincent Poor
  • for: 本研究旨在利用近场通信中的不完全解像程度来提高无线网络吞吐和连接稳定性。
  • methods: 该研究提出了一种混合非对准多ступ通信(NOMA)传输策略,使用预配置的近场扫描器来服务更多的用户。然后通过不同的顺序扫描取消技术来解决能量消耗最小化问题。
  • results: 分析和 simulate结果表明,随着近场解像程度的提高,hybrid NOMA传输策略可以提高无线网络的吞吐和连接稳定性。
    Abstract This letter studies how the imperfect resolution of near-field beamforming, the key feature of near-field communications, can be used to improve the throughput and connectivity of wireless networks. In particular, a hybrid non-orthogonal multiple access (NOMA) transmission strategy is developed to use preconfigured near-field beams for serving additional users. An energy consumption minimization problem is first formulated and then solved by using different successive interference cancellation strategies. Both analytical and simulation results are presented to illustrate the impact of the resolution of near-field beamforming on the design of hybrid NOMA transmission.
    摘要 Translation notes:* "near-field beamforming" is translated as "近场扩散" (jìn chǎng kuò xiǎn)* "hybrid non-orthogonal multiple access" is translated as "混合非正交多接入" (hùn hǎi fēi zhèng jiāng duō yù)* "preconfigured near-field beams" is translated as "预先配置的近场扩散" (xiù xiān bèng jī de jìn chǎng kuò xiǎn)* "energy consumption minimization" is translated as "能量消耗最小化" (néng yàng xiāo hóu zuì xiǎo)* "successive interference cancellation" is translated as "successive interference cancellation" (成功ive kancel)Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

Quantized-but-uncoded Distributed Detection (QDD) with Unreliable Reporting Channels

  • paper_url: http://arxiv.org/abs/2311.02447
  • repo_url: None
  • paper_authors: Lei Cao, Ramanarayanan Viswanathan
  • for: 本研究旨在提出一种新的分布式检测方法,即量化但未编码的分布式检测(QDD),以提高传输能力和复杂性。
  • methods: 本研究使用量化但未编码的方法,其中每个感知器对其完整的观测数据进行量化,然后将归一化后的值传输到总Integration Center(FC)。
  • results: 比较CDD和QDD两种方法,本研究发现QDD在传输能力限制下表现更好,但是需要更多的参数选择。此外,在独立观测下,QDD保持了CDD中的必需条件,即最佳感知器决策规则是likelihood ratio quantizers(LRQ),不受通信渠道条件影响。
    Abstract Distributed detection primarily centers around two approaches: Unquantized Distributed Detection (UDD), where each sensor reports its complete observation to the fusion center (FC), and quantized-and-Coded DD (CDD), where each sensor first partitions the observation space and then reports to the FC a codeword. In this paper, we introduce Quantized-but-uncoded DD (QDD), where each sensor, after quantization, transmits a summarized value, instead of a codeword, to the FC. We show that QDD well adapts to the constraint of transmission power when compared to CDD, albeit with increased complexity in parameter selection. Moreover, we establish that, in the presence of independent observations, QDD upholds a necessary condition inherent in CDD. Specifically, the optimal sensor decision rules are the likelihood ratio quantizers (LRQ), irrelevant to the channel conditions. In the context of a single-sensor scenario involving binary decision at the sensor, we find that the optimal sensor rule in QDD is in general no longer ``channel blind", a feature presented in CDD. In addition, we compare these systems numerically under the same transmission power and bandwidth, while assuming additive white Gaussian noise (AWGN) in both sensing and reporting stages. Finally, we present some potential directions for future research.
    摘要 主要分布检测方法有两种:不量化分布检测(UDD),每个感知器都直接将完整的观测报告给归一化中心(FC),以及量化编码分布检测(CDD),每个感知器首先将观测空间分割,然后向FC报告一个编码word。在本文中,我们介绍了量化但未编码的分布检测(QDD),每个感知器,经过量化,将减少值传输到FC,而不是编码word。我们表明,QDD在传输功率限制下比CDD更适应,尽管它增加了参数选择的复杂性。此外,我们证明,在独立观测下,QDD保持了CDD中的必需条件。具体来说,感知器的优化决策规则是 likelihood ratio quantizers(LRQ),不受通信条件影响。在单感知器场景中,我们发现QDD的优化决策规则不再是“通信盲目”的,这是CDD中的特点。此外,我们在同传输功率和宽度下,对这些系统进行了数值比较,假设感知和报告阶段都存在添加itive white Gaussian noise(AWGN)。最后,我们提出了未来研究的一些可能的方向。

PIPO-Net: A Penalty-based Independent Parameters Optimization Deep Unfolding Network

  • paper_url: http://arxiv.org/abs/2311.02443
  • repo_url: None
  • paper_authors: Xiumei Li, Zhijie Zhang, Huang Bai, Ljubiša Stanković, Junpeng Hao, Junmei Sun
  • for: 用于重建压缩感知图像
  • methods: 使用罚函数优化策略和高频补充块
  • results: 实现高精度重建压缩感知图像
    Abstract Compressive sensing (CS) has been widely applied in signal and image processing fields. Traditional CS reconstruction algorithms have a complete theoretical foundation but suffer from the high computational complexity, while fashionable deep network-based methods can achieve high-accuracy reconstruction of CS but are short of interpretability. These facts motivate us to develop a deep unfolding network named the penalty-based independent parameters optimization network (PIPO-Net) to combine the merits of the above mentioned two kinds of CS methods. Each module of PIPO-Net can be viewed separately as an optimization problem with respective penalty function. The main characteristic of PIPO-Net is that, in each round of training, the learnable parameters in one module are updated independently from those of other modules. This makes the network more flexible to find the optimal solutions of the corresponding problems. Moreover, the mean-subtraction sampling and the high-frequency complementary blocks are developed to improve the performance of PIPO-Net. Experiments on reconstructing CS images demonstrate the effectiveness of the proposed PIPO-Net.
    摘要 压缩感知(CS)在信号处理和图像处理领域广泛应用。传统的CS重建算法具有完善的理论基础,但计算复杂性高;而时尚的深度网络基于方法可以实现高精度的CS重建,但缺乏可读性。这些因素激发我们开发一种名为罚函数基本独立参数优化网络(PIPO-Net)的深度 unfolding 网络,将上述两种CS方法的优点结合起来。PIPO-Net 中每个模块可以视为一个优化问题,每个模块的学习参数在训练过程中独立地更新。这使得网络更加灵活地找到相应的优化解决方案。此外,我们还提出了mean-subtraction sampling和高频补充块来提高PIPO-Net的性能。实验表明,提议的PIPO-Net 可以有效地重建CS图像。

SplitMAC: Wireless Split Learning over Multiple Access Channels

  • paper_url: http://arxiv.org/abs/2311.02405
  • repo_url: None
  • paper_authors: Seonjung Kim, Yongjeong Oh, Yo-Seb Jeon
  • For: 本文提出了一种新的分解学习(SL)框架,称为SplitMAC,它可以降低SL的延迟时间,通过同时在多个访问通道上传输多个设备的混合数据和设备 сторо面模型。* Methods: 本文使用分 grouping 策略,将设备分为多个组,并让同一组内的设备同时传输其混合数据和设备 сторо面模型。优化问题是将设备分配到最佳的组,以最小化SL延迟时间。* Results: simulations 表明,我们的SL框架,尤其是使用提出的设备分配算法,可以在各种信号噪响比(SNR)场景下减少SL延迟时间。
    Abstract This paper presents a novel split learning (SL) framework, referred to as SplitMAC, which reduces the latency of SL by leveraging simultaneous uplink transmission over multiple access channels. The key strategy is to divide devices into multiple groups and allow the devices within the same group to simultaneously transmit their smashed data and device-side models over the multiple access channels. The optimization problem of device grouping to minimize SL latency is formulated, and the benefit of device grouping in reducing the uplink latency of SL is theoretically derived. By examining a two-device grouping case, two asymptotically-optimal algorithms are devised for device grouping in low and high signal-to-noise ratio (SNR) scenarios, respectively, while providing proofs of their optimality. By merging these algorithms, a near-optimal device grouping algorithm is proposed to cover a wide range of SNR. Simulation results demonstrate that our SL framework with the proposed device grouping algorithm is superior to existing SL frameworks in reducing SL latency.
    摘要 Two asymptotically-optimal algorithms are proposed for device grouping in low and high signal-to-noise ratio (SNR) scenarios, respectively. These algorithms are proven to be optimal, and by merging them, a near-optimal device grouping algorithm is proposed to cover a wide range of SNR. Simulation results show that the proposed SL framework with the device grouping algorithm is superior to existing SL frameworks in reducing SL latency.In simplified Chinese, the text can be translated as:这篇论文提出了一种新的分布式学习(SL)框架,称为SplitMAC,它可以降低SL的延迟时间。这个框架的关键策略是将设备分成多个组,并让同一组的设备同时传输压缩数据和设备侧模型通过多个访问通道。将设备分组优化问题以减少SL延迟时间被形式化,并证明了设备分组可以减少上行延迟时间。对于低和高信号噪比(SNR)两种场景,分别提出了两种极似优算法,其中一种是适用于低SNR场景,另一种是适用于高SNR场景。这两种算法都是可数的优化算法,并且将它们合并可以得到一个近似优化的设备分组算法,以覆盖各种SNR场景。实验结果表明,提出的SL框架以及设备分组算法都是现有SL框架的改进版本,可以更好地减少SL延迟时间。

Intelligent Reflecting Surface-Aided Wireless Communication with Movable Elements

  • paper_url: http://arxiv.org/abs/2311.02376
  • repo_url: None
  • paper_authors: Guojie Hu, Qingqing Wu, Dognhui Xu, Kui Xu, Jiangbo Si, Yunlong Cai, Naofal Al-Dhahir
    for: 这个研究旨在提高通信性能的智能镜面技术 (IRS) 中,为了降低生产和控制成本,采用独立阶段调校 (DPS),但这种设置对于通过总RIician折射而具有问题。methods: 我们在这篇论文中设计了优化的非均匀DPS,以获得满意的性能水平。我们面对的主要挑战是当IRS元素的位置固定时,可能会出现各个构成元素间的偏移,导致不同的偏移模式,从而导致生产成本增加,特别是当IRS元素的数量很大时。results: 我们透过 simulations 表明,我们的提案可以与竞争 benchmark 相比,实现系统性能的明显提高。
    Abstract Intelligent reflecting surface (IRS) has been recognized as a powerful technology for boosting communication performance. To reduce manufacturing and control costs, it is preferable to consider discrete phase shifts (DPSs) for IRS, which are set by default as uniformly distributed in the range of $[ - \pi,\pi )$ in the literature. Such setting, however, cannot achieve a desirable performance over the general Rician fading where the channel phase concentrates in a narrow range with a higher probability. Motivated by this drawback, we in this paper design optimal non-uniform DPSs for IRS to achieve a desirable performance level. The fundamental challenge is the \textit{possible offset in phase distribution across different cascaded source-element-destination channels}, if adopting conventional IRS where the position of each element is fixed. Such phenomenon leads to different patterns of optimal non-uniform DPSs for each IRS element and thus causes huge manufacturing costs especially when the number of IRS elements is large. Driven by the recently emerging fluid antenna system (or movable antenna technology), we demonstrate that if the position of each IRS element can be flexibly adjusted, the above phase distribution offset can be surprisingly eliminated, leading to the same pattern of DPSs for each IRS element. Armed with this, we then determine the form of unified non-uniform DPSs based on a low-complexity iterative algorithm. Simulations show that our proposed design significantly improves the system performance compared to competitive benchmarks.
    摘要 智能反射表面(IRS)已被认为是一种强大的通信性能提升技术。为了降低生产和控制成本,它是将独立阶段调整(DPS)设置为默认值,即在 $[-\pi, \pi)$ 中 uniformly 分布的文献中的偏好。但这个设置无法在一般的雷电折射中 achieve Desirable 性能,因为通道频率偏集在狭窄的范围中,具有更高的几率。驱动了这个缺陷,我们在这篇论文中设计了优化的非均匀DPS,以 achieve Desirable 性能水平。基本挑战在于 possible 频率分布偏移 across different 缝合源-元素-目标通道,如果采用传统的 IRS,则每个 IRS 元素的位置固定。这个现象导致每个 IRS 元素的优化非均匀DPS 具有不同的几何结构,导致生产成本尤其高于当 IRS 元素的数量较多。驱动了最近发展的流体天线系统(或可动天线技术),我们示出了如果每个 IRS 元素的位置可以灵活地调整,这个偏移问题可以 unexpectedly 消除,导致每个 IRS 元素的 DPS 具有同样的模式。 armed with this,我们then 决定了非均匀 DPS 的形式,基于一种低复杂度的迭代算法。模拟结果显示,我们的提案对于竞争性能标准的优化做出了显著改善。

A Physics based Machine Learning Model to characterize Room Temperature Semiconductor Detectors in 3D

  • paper_url: http://arxiv.org/abs/2311.02290
  • repo_url: None
  • paper_authors: Srutarshi Banerjee, Miesher Rodrigues, Manuel Ballester, Alexander H. Vija, Aggelos K. Katsaggelos
  • For: The paper aims to develop a novel physics-based machine learning (PBML) model for characterizing room temperature semiconductor radiation detectors (RTSDs) in 3D space.* Methods: The PBML model is based on a discretized sub-pixelated 3D volume, and it considers the different physics-based charge transport properties such as drift, trapping, detrapping, and recombination of charges as trainable model weights. The model uses backpropagation to determine the trainable weights and optimize the loss function.* Results: The proposed PBML model is the first to characterize a full 3D charge transport model of RTSDs, and it can accurately determine the trainable weights that represent the one-to-one relation to the actual physical charge transport properties in a voxelized detector.
    Abstract Room temperature semiconductor radiation detectors (RTSD) for X-ray and gamma-ray detection are vital tools for medical imaging, astrophysics and other applications. CdZnTe (CZT) has been the main RTSD for more than three decades with desired detection properties. In a typical pixelated configuration, CZT have electrodes on opposite ends. For advanced event reconstruction algorithms at sub-pixel level, detailed characterization of the RTSD is required in three dimensional (3D) space. However, 3D characterization of the material defects and charge transport properties in the sub-pixel regime is a labor-intensive process with skilled manpower and novel experimental setups. Presently, state-of-art characterization is done over the bulk of the RTSD considering homogenous properties. In this paper, we propose a novel physics based machine learning (PBML) model to characterize the RTSD over a discretized sub-pixelated 3D volume which is assumed. Our novel approach is the first to characterize a full 3D charge transport model of the RTSD. In this work, we first discretize the RTSD between a pixelated electrodes spatially in 3D - x, y, and z. The resulting discretizations are termed as voxels in 3D space. In each voxel, the different physics based charge transport properties such as drift, trapping, detrapping and recombination of charges are modeled as trainable model weights. The drift of the charges considers second order non-linear motion which is observed in practice with the RTSDs. Based on the electron-hole pair injections as input to the PBML model, and signals at the electrodes, free and trapped charges (electrons and holes) as outputs of the model, the PBML model determines the trainable weights by backpropagating the loss function. The trained weights of the model represents one-to-one relation to that of the actual physical charge transport properties in a voxelized detector.
    摘要 室温半导体辐射探测器(RTSD)在医学影像、astrophysics和其他应用中是非常重要的工具。 Cadmium zinc telluride(CZT)在过去三十年中一直是主要的RTSD,具有欢得的探测性能。在常见的像素化配置中,CZT有电极在两端。为了在像素水平上使用高级事件重建算法,RTSD的详细三维(3D)特性的Characterization是必要的。然而,在sub-像素级别上对材料缺陷和电子传输性能的3DCharacterization是一项劳动密集的过程,需要专业人员和特殊的实验设备。现在,状态机器的Characterization都是基于整体RTSD的假设,忽略了物理性的细节。在这篇论文中,我们提出了一种新的物理学基本机器学习(PBML)模型,用于 caracterizing RTSD的3D电子传输模型。我们首先将RTSD在三维空间中分割成像素化的电极,并将每个像素称为voxel。在每个voxel中,我们模型了不同的物理学基本的电子传输特性,如漂移、固定、释放和 recombination of charges。这些模型参数被视为可训练的模型参数。基于电子-引起对的插入和电极上的信号,以及free和固定电荷(电子和洞)的输出,PBML模型通过反射损失函数来确定模型参数。训练后,模型的参数表示了RTSD的实际物理电子传输特性的一对一关系。

cs.SD - 2023-11-03

FiloBass: A Dataset and Corpus Based Study of Jazz Basslines

  • paper_url: http://arxiv.org/abs/2311.02023
  • repo_url: None
  • paper_authors: Xavier Riley, Simon Dixon
  • for: 这篇论文旨在探讨爵士乐double bass的重要作用,尤其是在辅助演奏中。
  • methods: 作者提供了48首职业爵士乐 double bass手的手记谱和相关的metadata,包括Audio stem、Score、Performance-aligned MIDI和 markers for musical form。
  • results: 通过对 FiloBass 谱系进行 contrastive 分析,作者发现了一些关键的爵士乐 double bass 演奏技巧,并且与现有的教学方法进行比较。
    Abstract We present FiloBass: a novel corpus of music scores and annotations which focuses on the important but often overlooked role of the double bass in jazz accompaniment. Inspired by recent work that sheds light on the role of the soloist, we offer a collection of 48 manually verified transcriptions of professional jazz bassists, comprising over 50,000 note events, which are based on the backing tracks used in the FiloSax dataset. For each recording we provide audio stems, scores, performance-aligned MIDI and associated metadata for beats, downbeats, chord symbols and markers for musical form. We then use FiloBass to enrich our understanding of jazz bass lines, by conducting a corpus-based musical analysis with a contrastive study of existing instructional methods. Together with the original FiloSax dataset, our work represents a significant step toward a fully annotated performance dataset for a jazz quartet setting. By illuminating the critical role of the bass in jazz, this work contributes to a more nuanced and comprehensive understanding of the genre.
    摘要 我们现在发布了FiloBass:一个新的音乐谱和注释集,专注于爵士乐伴奏中重要 yet often overlooked的double bass角色。受最近关于独奏者的工作所 inspirited,我们提供了48名职业爵士乐 bassist的手动验证 транскрип,包括超过50,000个音 Event,基于FiloSax数据集中的 backing tracks。每个录音都提供了音质 stem,谱表,与 markers for musical form 的表示,以及相关的metadata。我们使用FiloBass来推广我们对爵士乐 bass line的理解,通过对现有的教学方法进行相比研究。与原始 FiloSax数据集一起,我们的工作表示了一个完整的表演数据集的 jazz quartet 设置。通过推照爵士乐中的重要性,我们的工作对爵士乐领域的理解做出了重要贡献。

Acousto-optic reconstruction of exterior sound field based on concentric circle sampling with circular harmonic expansion

  • paper_url: http://arxiv.org/abs/2311.01715
  • repo_url: None
  • paper_authors: Phuc Duc Nguyen, Kenji Ishikawa, Noboru Harada, Takehiro Moriya
  • for: 提供一种新的外部声场重建方法,用于解决现有的声场重建算法在外部场景下的表现不佳问题。
  • methods: 该方法基于圆形卷积抽样和二维外部声场重建方法,使用径向圆形延展来扩展圆形卷积抽样。
  • results: 对比 conventinal 重建方法,提出的方法在数字实验和实际实验中具有更高的准确性,同时使用的抽样数据量很少。
    Abstract Acousto-optic sensing provides an alternative approach to traditional microphone arrays by shedding light on the interaction of light with an acoustic field. Sound field reconstruction is a fascinating and advanced technique used in acousto-optics sensing. Current challenges in sound-field reconstruction methods pertain to scenarios in which the sound source is located within the reconstruction area, known as the exterior problem. Existing reconstruction algorithms, primarily designed for interior scenarios, often exhibit suboptimal performance when applied to exterior cases. This paper introduces a novel technique for exterior sound-field reconstruction. The proposed method leverages concentric circle sampling and a two-dimensional exterior sound-field reconstruction approach based on circular harmonic extensions. To evaluate the efficacy of this approach, both numerical simulations and practical experiments are conducted. The results highlight the superior accuracy of the proposed method when compared to conventional reconstruction methods, all while utilizing a minimal amount of measured projection data.
    摘要 通过声光相互作用,声学探测提供了一种不同于传统麦克风数组的方法。声场重建是声学探测中的一种先进技术,但现有的声场重建方法在外部场景下存在许多挑战。这篇论文介绍了一种新的外部声场重建方法,该方法基于圆形弧形抽样和二维外部声场重建方法。为评估该方法的有效性,该论文进行了数值仿真和实验室实验。结果表明,该方法比传统重建方法更高精度,同时只需要使用 minimal amount of measured projection data。

eess.AS - 2023-11-03

SE Territory: Monaural Speech Enhancement Meets the Fixed Virtual Perceptual Space Mapping

  • paper_url: http://arxiv.org/abs/2311.01679
  • repo_url: None
  • paper_authors: Xinmeng Xu, Jibin Wu, Xiaoyong Wei, Yan Liu, Richard So, Yuhong Yang, Weiping Tu, Kay Chen Tan
  • for: 提高单麦口音频噪声纠正性能
  • methods: 提出了一种将单麦口音频Mapping到固定的Simulation空间中,以便更好地 отли别目标speech和噪声。这种方法基于二stage多任务学习框架,首先使用supervised speech mapping块将单麦口音频映射到虚拟空间中,然后使用cross-attention capture虚拟空间中的虚拟方向信息,以提高target speech的提取。
  • results: 对比其他最新的单麦口音频纠正方法,提出的SE-TerrNet显著超越了它们,both in terms of speech quality和语音可读性。
    Abstract Monaural speech enhancement has achieved remarkable progress recently. However, its performance has been constrained by the limited spatial cues available at a single microphone. To overcome this limitation, we introduce a strategy to map monaural speech into a fixed simulation space for better differentiation between target speech and noise. Concretely, we propose SE-TerrNet, a novel monaural speech enhancement model featuring a virtual binaural speech mapping network via a two-stage multi-task learning framework. In the first stage, monaural noisy input is projected into a virtual space using supervised speech mapping blocks, creating binaural representations. These blocks synthesize binaural noisy speech from monaural input via an ideal binaural room impulse response. The synthesized output assigns speech and noise sources to fixed directions within the perceptual space. In the second stage, the obtained binaural features from the first stage are aggregated. This aggregation aims to decrease pattern discrepancies between the mapped binaural and original monaural features, achieved by implementing an intermediate fusion module. Furthermore, this stage incorporates the utilization of cross-attention to capture the injected virtual spatial information to improve the extraction of the target speech. Empirical studies highlight the effectiveness of virtual spatial cues in enhancing monaural speech enhancement. As a result, the proposed SE-TerrNet significantly surpasses the recent monaural speech enhancement methods in terms of both speech quality and intelligibility.
    摘要 单声 speech 增强已经在最近得到了惊人的进步,但其表现受到单一麦克风提供的空间讯号限制。为了突破这个限制,我们提出了将单声 speech 映射到固定的 simulationspace 中,以更好地区分target speech 和噪声。具体来说,我们提出了 SE-TerrNet,一个新的单声 speech 增强模型,拥有一个通过二阶段多任务学习框架的虚拟 binatural speech 映射网络。在第一阶段,单声噪音输入被投射到虚拟空间中,使用supervised speech 映射封页 synthesize binatural noisy speech from monaural input,这些封页使得speech和噪声源分配到固定的方向 within the perceptual space。在第二阶段,获得的虚拟空间中的特征被聚合,以减少对映射后的单声 speech 和原始单声 input 的模式差异,这是通过实现一个中继融合模组来实现的。此外,这个阶段还包括利用跨关注处理来捕捉在虚拟空间中注射的虚拟空间信息,以提高对target speech的抽取。 empirical studies 显示虚拟空间信息在增强单声 speech 中发挥了惊人的作用。因此,提案的 SE-TerrNet 在比较 recent monaural speech enhancement methods 的情况下,实现了显著的提高。

cs.CV - 2023-11-03

EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision

  • paper_url: http://arxiv.org/abs/2311.02077
  • repo_url: https://github.com/NVlabs/EmerNeRF
  • paper_authors: Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, Yue Wang
  • for: 这篇论文旨在学习动态驾驶场景的空间-时间表示。
  • methods: 该方法基于神经场,同时捕捉场景的几何学、外观、运动和 semantics,通过自我启发来实现。
  • results: 该方法在感知器模拟中达到了状态前的最佳性能,与之前的方法相比,在静态 (+2.93 PSNR) 和动态 (+3.70 PSNR) 场景中都有显著的改善。此外,通过提高4D空间时间特征的semantic泛化,进一步提高3D感知性能。
    Abstract We present EmerNeRF, a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. Grounded in neural fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. EmerNeRF hinges upon two core components: First, it stratifies scenes into static and dynamic fields. This decomposition emerges purely from self-supervision, enabling our model to learn from general, in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field from the dynamic field and uses this flow field to further aggregate multi-frame features, amplifying the rendering precision of dynamic objects. Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to represent highly-dynamic scenes self-sufficiently, without relying on ground truth object annotations or pre-trained models for dynamic object segmentation or optical flow estimation. Our method achieves state-of-the-art performance in sensor simulation, significantly outperforming previous methods when reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In addition, to bolster EmerNeRF's semantic generalization, we lift 2D visual foundation model features into 4D space-time and address a general positional bias in modern Transformers, significantly boosting 3D perception performance (e.g., 37.50% relative improvement in occupancy prediction accuracy on average). Finally, we construct a diverse and challenging 120-sequence dataset to benchmark neural fields under extreme and highly-dynamic settings.
    摘要 我们介绍EmerNeRF,一种简单 yet powerful的方法,用于学习动态驾驶场景的空间-时间表示。基于神经场,EmerNeRF同时捕捉场景的几何结构、外观、运动和 semantics,通过自我启发。EmerNeRF的两个核心组成部分是:首先,它将场景分解为静止和动态场景两部分。这种分解是通过自我超级视图来实现,从而允许我们的模型从通用的各种数据源中学习。其次,EmerNeRF将动态场景中的引导流场景参数化,并使用这个流场景来进一步归并多帧特征,提高动态对象的渲染精度。将这三个场景(静止、动态和流)相互融合,使得EmerNeRF可以自主地表示高度动态的场景,不需要基于真实物理对象的批注或先进的模型来进行动态对象分 segmentation或光学流计算。我们的方法在感知器模拟中达到了状态的最佳性能,与之前的方法相比,在重建静止 (+2.93 PSNR) 和动态 (+3.70 PSNR) 场景中显著地超越。此外,为增强EmerNeRF的 semantic泛化性,我们将2D视觉基础模型特征抬到4D空间-时间中,并解决现代Transformers中的通用位置偏好,显著提高3D感知性能(例如,均值上的37.50%相对改进率)。最后,我们构建了一个多样化和挑战性的120序列数据集,用于评测神经场下的极端和高度动态设置。

Learning Historical Status Prompt for Accurate and Robust Visual Tracking

  • paper_url: http://arxiv.org/abs/2311.02072
  • repo_url: None
  • paper_authors: Wenrui Cai, Qingjie Liu, Yunhong Wang
  • for: 提高跟踪性能(improve tracking performance)
  • methods: 增强历史信息提供(enhance the provision of historical information),使用搜索区域特征来引入历史外观信息(use search region features to introduce historical appearance information),利用历史位置信息构建精细的目标掩模(construct refined masks of the target using historical position information)
  • results: 对LaSOT、LaSOT ext、GOT10k和NfS进行了实验,并显示了与所有状态的前一个approaches的比较优异(outperforms all state-of-the-art approaches),并且 Module exhibits strong generality and can be seamlessly integrated into trackers to improve tracking performance.
    Abstract Most trackers perform template and search region similarity matching to find the most similar object to the template during tracking. However, they struggle to make prediction when the target appearance changes due to the limited historical information introduced by roughly cropping the current search region based on the predicted result of previous frame. In this paper, we identify that the central impediment to improving the performance of existing trackers is the incapacity to integrate abundant and effective historical information. To address this issue, we propose a Historical Information Prompter (HIP) to enhance the provision of historical information. We also build HIPTrack upon HIP module. HIP is a plug-and-play module that make full use of search region features to introduce historical appearance information. It also incorporates historical position information by constructing refined mask of the target. HIP is a lightweight module to generate historical information prompts. By integrating historical information prompts, HIPTrack significantly enhances the tracking performance without the need to retrain the backbone. Experimental results demonstrate that our method outperforms all state-of-the-art approaches on LaSOT, LaSOT ext, GOT10k and NfS. Futhermore, HIP module exhibits strong generality and can be seamlessly integrated into trackers to improve tracking performance. The source code and models will be released for further research.
    摘要 大多数跟踪器在跟踪过程中使用模板和搜索区域相似性匹配来找到最相似的目标对象。然而,当目标外观发生变化时,他们很难作出预测,因为现有的历史信息受限,通常是基于上一帧预测结果粗略剪辑的当前搜索区域的信息。在这篇论文中,我们认为现有跟踪器的主要障碍是缺乏充分和有效的历史信息的集成。为解决这个问题,我们提出了历史信息推动器(HIP)模块,用于增强历史信息的提供。我们还构建了基于HIP模块的HIPTrack跟踪器。HIP是一个轻量级的模块,可以充分利用搜索区域特征来引入历史外观信息,并将历史位置信息通过构建高精度的目标掩蔽来增强。我们的方法在LaSOT、LaSOT ext、GOT10k和NfS等四个测试集上进行了实验,结果表明我们的方法超过了所有现有的方法。此外,HIP模块表现出了强大的通用性,可以轻松地与跟踪器集成,以提高跟踪性能。我们将发布源代码和模型,以便进一步的研究。

LOTUS: Continual Imitation Learning for Robot Manipulation Through Unsupervised Skill Discovery

  • paper_url: http://arxiv.org/abs/2311.02058
  • repo_url: None
  • paper_authors: Weikang Wan, Yifeng Zhu, Rutav Shah, Yuke Zhu
  • for: 这篇论文旨在描述一种可以让物理机器人不断学习并解决新的抓取任务的 kontinuierliches Imitation Learning 算法(LOTUS)。
  • methods: LOTUS 使用一种开放词汇视觉模型进行不断发现新技能,并通过更新现有技能来避免过去任务的恐慌遗忘,以解决视觉基于抓取任务的长期学习过程。
  • results: 相比于先前的基elines,LOTUS 在成功率上高出11%,表明它在知识传递方面具有优势。更多结果和视频可以在项目网站上找到:https://ut-austin-rpl.github.io/Lotus/.
    Abstract We introduce LOTUS, a continual imitation learning algorithm that empowers a physical robot to continuously and efficiently learn to solve new manipulation tasks throughout its lifespan. The core idea behind LOTUS is constructing an ever-growing skill library from a sequence of new tasks with a small number of human demonstrations. LOTUS starts with a continual skill discovery process using an open-vocabulary vision model, which extracts skills as recurring patterns presented in unsegmented demonstrations. Continual skill discovery updates existing skills to avoid catastrophic forgetting of previous tasks and adds new skills to solve novel tasks. LOTUS trains a meta-controller that flexibly composes various skills to tackle vision-based manipulation tasks in the lifelong learning process. Our comprehensive experiments show that LOTUS outperforms state-of-the-art baselines by over 11% in success rate, showing its superior knowledge transfer ability compared to prior methods. More results and videos can be found on the project website: https://ut-austin-rpl.github.io/Lotus/.
    摘要 我团队现在介绍一种名为LOTUS的持续学习算法,它使得物理机器人可以不断学习并解决新的抓取任务,从机器人的生命周期开始。LOTUS的核心思想在于建立一个不断增长的技能库,从一系列新任务中提取出技能的循环征例。LOTUS开始于持续技能发现过程,使用一个开放词汇视模型,从不分段的示例中提取出技能。继续技能发现更新现有技能,以避免过去任务的恐慌忘记,并添加新技能来解决新任务。LOTUS训练一个灵活组合各种技能的元控制器,以解决视觉基于的机器人 manipulate 任务在持续学习过程中。我们的广泛实验表明,LOTUS比前方法提高了11%的成功率,表明它在知识传递方面的优势比前方法更高。更多结果和视频可以在项目网站上找到:https://ut-austin-rpl.github.io/Lotus/.

Occlusion-Aware 2D and 3D Centerline Detection for Urban Driving via Automatic Label Generation

  • paper_url: http://arxiv.org/abs/2311.02044
  • repo_url: None
  • paper_authors: David Paz, Narayanan E. Ranganatha, Srinidhi K. Srinivas, Yunchao Yao, Henrik I. Christensen
  • for: 本研究旨在探索和确定在高度动态城市驾驶场景下的路径topology信息,包括2D和3D两种情况。
  • methods: 我们提出了一种自动生成标签过程和 occlusion 处理策略,以处理各种干扰和堵塞情况。我们还实现了多种中心线探测方法的比较研究,以评估这些方法的性能和可解性。
  • results: 我们的研究表明,我们的方法可以在不同的感知器配置下进行适应,并且在实际场景中展示了优秀的性能和实用性。我们还公开发布了我们的数据集和实验模型,以便进一步的研究和应用。
    Abstract This research work seeks to explore and identify strategies that can determine road topology information in 2D and 3D under highly dynamic urban driving scenarios. To facilitate this exploration, we introduce a substantial dataset comprising nearly one million automatically labeled data frames. A key contribution of our research lies in developing an automatic label-generation process and an occlusion handling strategy. This strategy is designed to model a wide range of occlusion scenarios, from mild disruptions to severe blockages. Furthermore, we present a comprehensive ablation study wherein multiple centerline detection methods are developed and evaluated. This analysis not only benchmarks the performance of various approaches but also provides valuable insights into the interpretability of these methods. Finally, we demonstrate the practicality of our methods and assess their adaptability across different sensor configurations, highlighting their versatility and relevance in real-world scenarios. Our dataset and experimental models are publicly available.
    摘要

Towards Unsupervised Object Detection From LiDAR Point Clouds

  • paper_url: http://arxiv.org/abs/2311.02007
  • repo_url: None
  • paper_authors: Lunjun Zhang, Anqi Joyce Yang, Yuwen Xiong, Sergio Casas, Bin Yang, Mengye Ren, Raquel Urtasun
  • for: 这个论文研究了自驾报道Scene中无监督物体检测的问题。
  • methods: 该方法利用(i)点云密集区域的点集 clustering,(ii)时间一致性来过滤噪杂的无监督检测,(iii) CNN的翻译对称性来扩展自动标签到远距离,以及(iv)自我超参数。
  • results: 该方法能够在零批训练的情况下,无需监督训练,在稀疏、远距离区域中检测物体,并且能够不断自我改进。在自驾报道场景中,提出了一个新的规划中心的感知指标,基于距离Collision。实验表明,我们的无监督物体检测器在PandaSet和Argoverse 2 Sensor dataset上显著超越了无监督基线。
    Abstract In this paper, we study the problem of unsupervised object detection from 3D point clouds in self-driving scenes. We present a simple yet effective method that exploits (i) point clustering in near-range areas where the point clouds are dense, (ii) temporal consistency to filter out noisy unsupervised detections, (iii) translation equivariance of CNNs to extend the auto-labels to long range, and (iv) self-supervision for improving on its own. Our approach, OYSTER (Object Discovery via Spatio-Temporal Refinement), does not impose constraints on data collection (such as repeated traversals of the same location), is able to detect objects in a zero-shot manner without supervised finetuning (even in sparse, distant regions), and continues to self-improve given more rounds of iterative self-training. To better measure model performance in self-driving scenarios, we propose a new planning-centric perception metric based on distance-to-collision. We demonstrate that our unsupervised object detector significantly outperforms unsupervised baselines on PandaSet and Argoverse 2 Sensor dataset, showing promise that self-supervision combined with object priors can enable object discovery in the wild. For more information, visit the project website: https://waabi.ai/research/oyster
    摘要 在这篇论文中,我们研究了无监督对象检测从3D点云中的问题。我们提出了一种简单 yet有效的方法,利用以下四个方法:(i)点云归一在近距离地区 dense point clouds 中进行归一,(ii)时间一致性来筛除噪杂无监督检测,(iii) CNN 的转换对称性来扩展自动标签至长距离,以及(iv)自我超vision来改进自己。我们的方法,命名为 OYSTER(对象发现 via 空间-时间细化),不需要数据采集中的任何限制(如重复 traverse 同一个位置),能够在零shot 模式下检测对象,而且在稀疏、远方地区也能够做出适当的检测。我们还提出了一个新的准备中心 metric 来衡量自驾报uning 中的模型性能,并在 PandaSet 和 Argoverse 2 Sensor 数据集上进行了证明。我们的无监督对象检测器在比较baseline 上表现出了显著的优势,这表明了自我超vision 结合对象预知可以在野外实现对象发现。更多信息请参考我们项目网站:https://waabi.ai/research/oyster。

A Structured Pruning Algorithm for Model-based Deep Learning

  • paper_url: http://arxiv.org/abs/2311.02003
  • repo_url: None
  • paper_authors: Chicago Park, Weijie Gan, Zihao Zou, Yuyang Hu, Zhixin Sun, Ulugbek S. Kamilov
  • for: 解决图像反问题,提高深度学习模型的计算效率。
  • methods: 使用结构化剪辑算法(SPADE)剪辑MBDL网络中不必要的参数,并对剪辑后的网络进行微调以保持性能。
  • results: SPADE可以大幅降低MBDL网络的测试时间计算复杂度,保持竞争力性能。
    Abstract There is a growing interest in model-based deep learning (MBDL) for solving imaging inverse problems. MBDL networks can be seen as iterative algorithms that estimate the desired image using a physical measurement model and a learned image prior specified using a convolutional neural net (CNNs). The iterative nature of MBDL networks increases the test-time computational complexity, which limits their applicability in certain large-scale applications. We address this issue by presenting structured pruning algorithm for model-based deep learning (SPADE) as the first structured pruning algorithm for MBDL networks. SPADE reduces the computational complexity of CNNs used within MBDL networks by pruning its non-essential weights. We propose three distinct strategies to fine-tune the pruned MBDL networks to minimize the performance loss. Each fine-tuning strategy has a unique benefit that depends on the presence of a pre-trained model and a high-quality ground truth. We validate SPADE on two distinct inverse problems, namely compressed sensing MRI and image super-resolution. Our results highlight that MBDL models pruned by SPADE can achieve substantial speed up in testing time while maintaining competitive performance.
    摘要 有一个增长的兴趣在model-based deep learning(MBDL)方面,用于解决图像反向问题。MBDL网络可以看作是迭代算法,利用物理测量模型和学习的图像先验(CNNs)来估算所需的图像。迭代性的MBDL网络会增加测试时的计算复杂性,限制其在某些大规模应用程序中的应用。我们解决这个问题,提出了结构化剪辑算法 для model-based deep learning(SPADE),是MBDL网络中非必要的Weight剪辑算法。我们提出了三种不同的细化策略,以适应不同的预训练模型和高质量的测试数据。每种细化策略具有独特的优点,它们取决于预训练模型和测试数据的存在。我们验证SPADE在压缩感知MRI和图像超分辨率两个不同的反向问题上,可以实现显著的测试时间减少,而保持竞争力的性能。

Detection of keratoconus Diseases using deep Learning

  • paper_url: http://arxiv.org/abs/2311.01996
  • repo_url: None
  • paper_authors: AKM Enzam-Ul Haque, Golam Rabbany, Md. Siam
  • for: 这个研究的目的是评估不同的深度学习模型在诊断 keratoconus 疾病中的表现。
  • methods: 这个研究使用了五种不同的 CNN 深度学习架构(DenseNet201、InceptionV3、MobileNetV2、VGG19、Xception)进行比较。
  • results: 研究结果显示,使用 DenseNet201 架构的模型在 keratoconus 疾病识别中表现出色,其精度为 89.14%,precision 为 89.51%,recall 为 88.75%,F1 分数为 89.08%。这些结果显示了这个模型在实际应用中的稳定性和可靠性。
    Abstract One of the most serious corneal disorders, keratoconus is difficult to diagnose in its early stages and can result in blindness. This illness, which often appears in the second decade of life, affects people of all sexes and races. Convolutional neural networks (CNNs), one of the deep learning approaches, have recently come to light as particularly promising tools for the accurate and timely diagnosis of keratoconus. The purpose of this study was to evaluate how well different D-CNN models identified keratoconus-related diseases. To be more precise, we compared five different CNN-based deep learning architectures (DenseNet201, InceptionV3, MobileNetV2, VGG19, Xception). In our comprehensive experimental analysis, the DenseNet201-based model performed very well in keratoconus disease identification in our extensive experimental research. This model outperformed its D-CNN equivalents, with an astounding accuracy rate of 89.14% in three crucial classes: Keratoconus, Normal, and Suspect. The results demonstrate not only the stability and robustness of the model but also its practical usefulness in real-world applications for accurate and dependable keratoconus identification. In addition, D-CNN DenseNet201 performs extraordinarily well in terms of precision, recall rates, and F1 scores in addition to accuracy. These measures validate the model's usefulness as an effective diagnostic tool by highlighting its capacity to reliably detect instances of keratoconus and to reduce false positives and negatives.
    摘要 一种非常严重的角膜疾病,扩散性角膜病(keratoconus)难以在早期 диагности,可能导致失明。这种疾病通常在第二个decades of life出现,影响男女老少都有。深度学习方法(D-CNN),特别是深度神经网络(CNN),最近才被发现对早期扩散性角膜病的准确诊断表现出极高的抗锋性和可靠性。本研究的目的是评估不同D-CNN模型在扩散性角膜病识别方面的表现。具体来说,我们对五种不同的CNN-基本架构(DenseNet201、InceptionV3、MobileNetV2、VGG19、Xception)进行了比较。在我们的广泛的实验分析中,基于DenseNet201的模型在扩散性角膜病识别方面表现非常出色,其准确率为89.14%,在三个关键类别(扩散性角膜病、正常和可疑)中表现出极高的稳定性和可靠性。结果表明这种模型不仅在实际应用中具有高度的可靠性和稳定性,还能够准确地检测扩散性角膜病的实例,降低假阳性和假阴性。此外,D-CNN DenseNet201在准确率、回暗率和F1分数方面也表现出了极高的表现,这些指标 validate了这种模型在实际应用中的有效性和可靠性。

Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation

  • paper_url: http://arxiv.org/abs/2311.01989
  • repo_url: None
  • paper_authors: Shichao Dong, Fayao Liu, Guosheng Lin
  • for: 这个论文的目的是将大规模的预训模型(如Segment-Anything Model和Contrastive Language-Image Pre-training)应用于3D点云分类任务中。
  • methods: 这个方法首先使用不同的大规模视觉基模型进行2D semantic mask的初步预测。然后将这些mask预测对RGB-D影像序列进行投射,以生成3D semantic pseudo标签。我们还引入了一个semantic label fusion策略,将所有结果联合成一个统一的3D semantic pseudo标签。
  • results: 这个方法在ScanNet dataset上进行了实验,结果显示了采用通用2D基模型解决3D点云分类任务的有效性。
    Abstract Recently, large-scale pre-trained models such as Segment-Anything Model (SAM) and Contrastive Language-Image Pre-training (CLIP) have demonstrated remarkable success and revolutionized the field of computer vision. These foundation vision models effectively capture knowledge from a large-scale broad data with their vast model parameters, enabling them to perform zero-shot segmentation on previously unseen data without additional training. While they showcase competence in 2D tasks, their potential for enhancing 3D scene understanding remains relatively unexplored. To this end, we present a novel framework that adapts various foundational models for the 3D point cloud segmentation task. Our approach involves making initial predictions of 2D semantic masks using different large vision models. We then project these mask predictions from various frames of RGB-D video sequences into 3D space. To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting. We examine diverse scenarios, like zero-shot learning and limited guidance from sparse 2D point labels, to assess the pros and cons of different vision foundation models. Our approach is experimented on ScanNet dataset for 3D indoor scenes, and the results demonstrate the effectiveness of adopting general 2D foundation models on solving 3D point cloud segmentation tasks.
    摘要 最近,大规模的预训练模型如划分任何模型(SAM)和语言图像对比预训练(CLIP)在计算机视觉领域表现了非凡的成功,并对该领域产生了革命性的变革。这些基础视觉模型通过它们的庞大模型参数,能够从大规模的广泛数据中捕捉知识,并在未经训练的情况下,对新的数据进行零Instance分割。虽然它们在2D任务中显示出了能力,但它们对3D场景理解的潜力还尚未得到了充分的探索。为此,我们提出了一种新的框架,用于适应不同的基础视觉模型为3D点云分割任务。我们的方法包括使用不同的大规模视觉模型来初步预测2D semanticmask。我们 then将这些mask预测从不同的RGB-D视频序列中的多个帧 proyect到3D空间。为生成Robust的3D semantic pseudo标签,我们引入了一种semantic标签融合策略,通过投票来有效地结合所有结果。我们在不同的enario,如零shot学习和基于稀薄2D点标签的限制指导下,评估了不同的视觉基础模型的优缺点。我们的方法在ScanNet数据集上进行了实验,结果表明,采用通用的2D基础模型可以解决3D点云分割任务。

Optimal Image Transport on Sparse Dictionaries

  • paper_url: http://arxiv.org/abs/2311.01984
  • repo_url: None
  • paper_authors: Junqing Huang, Haihui Wang, Andreas Weiermann, Michael Ruzhansky
  • for: 这篇论文旨在提出一种基于稀疏表示和最优运输的图像传输算法,用于同时实现图像表示和变换。
  • methods: 论文使用稀疏表示压缩图像特征,然后根据这些特征编码生成两个学习的字典,最后使用最优运输计划来实现图像的传输和转换。
  • results: 论文通过实验表明,该算法可以具有高效率和高质量的图像传输和转换效果,并且可以应用于不同的图像转换任务,如图像颜色转换和艺术风格转换。
    Abstract In this paper, we derive a novel optimal image transport algorithm over sparse dictionaries by taking advantage of Sparse Representation (SR) and Optimal Transport (OT). Concisely, we design a unified optimization framework in which the individual image features (color, textures, styles, etc.) are encoded using sparse representation compactly, and an optimal transport plan is then inferred between two learned dictionaries in accordance with the encoding process. This paradigm gives rise to a simple but effective way for simultaneous image representation and transformation, which is also empirically solvable because of the moderate size of sparse coding and optimal transport sub-problems. We demonstrate its versatility and many benefits to different image-to-image translation tasks, in particular image color transform and artistic style transfer, and show the plausible results for photo-realistic transferred effects.
    摘要 在这篇论文中,我们 derivate了一种新的最优图像传输算法,利用图像稀疏表示(SR)和最优传输(OT)的优势。简单来说,我们设计了一个统一优化框架,在这个框架中,图像特征(颜色、文化、风格等)都是通过稀疏表示紧凑地编码的,然后根据编码过程来寻找两个学习的字典之间的最优传输计划。这种思想给出了一种简单 yet 有效的同时图像表示和变换方法,同时这也是可解决的因为稀疏编码和最优传输子问题的模式较小。我们在不同的图像转换任务中展示了它的多样性和多种优点,特别是图像颜色变换和艺术风格传递等,并显示了可信的转换效果。

Depth-guided Free-space Segmentation for a Mobile Robot

  • paper_url: http://arxiv.org/abs/2311.01966
  • repo_url: None
  • paper_authors: Christos Sevastopoulos, Joey Hussain, Stasinos Konstantopoulos, Vangelis Karkaletsis, Fillia Makedon
  • for: 本研究旨在提供一种精度的indoor自由空间分割方法,以便在各种复杂的indoor环境中提高自主迷 Navigation。
  • methods: 本方法基于不监督的Masking技术,使用正例实例生成分割标签,根据文本同质和深度均匀性。在这个步骤中,我们还生成了相对高深度的超像素,并将其与Dense Prediction Transformer(DPT)特征提取器进行对齐。
  • results: 我们的实验表明,在具有堆积物和复杂障碍物的Scene中,本方法能够显示出 suficient的性能。
    Abstract Accurate indoor free-space segmentation is a challenging task due to the complexity and the dynamic nature that indoor environments exhibit. We propose an indoors free-space segmentation method that associates large depth values with navigable regions. Our method leverages an unsupervised masking technique that, using positive instances, generates segmentation labels based on textural homogeneity and depth uniformity. Moreover, we generate superpixels corresponding to areas of higher depth and align them with features extracted from a Dense Prediction Transformer (DPT). Using the estimated free-space masks and the DPT feature representation, a SegFormer model is fine-tuned on our custom-collected indoor dataset. Our experiments demonstrate sufficient performance in intricate scenarios characterized by cluttered obstacles and challenging identification of free space.
    摘要 准确的indoor自由空间分割是一项复杂和动态环境下的挑战。我们提出了一种indoor自由空间分割方法,将大深度值关联到可行区域。我们的方法利用无监督的masking技术,通过正例实例生成分割标签,基于тексту冗余和深度均匀性。此外,我们生成了高深度区域对应的superpixel,并与Dense Prediction Transformer(DPT)提取的特征进行对应。使用估算的自由空间幕和DPT特征表示,我们在自定义indoor数据集上进行了SegFormer模型的微调。我们的实验表明,在叠拥障碍物和复杂识别自由空间的情况下,我们的方法具有足够的性能。

ProS: Facial Omni-Representation Learning via Prototype-based Self-Distillation

  • paper_url: http://arxiv.org/abs/2311.01929
  • repo_url: None
  • paper_authors: Xing Di, Yiyu Zheng, Xiaoming Liu, Yu Cheng
  • for: 本研究提出了一种新的无监督面表示学习方法,即基于原型的自我混合(ProS),以解决现有监督方法强依赖大量注释的脸部训练数据,带来数据收集和隐私问题。
  • methods: 我们提出了一种基于两个视Transformer(教师和学生模型)的方法,通过不同的扩展图像(裁剪、模糊、颜色等)进行训练。此外,我们还建立了一个面部意识的检索系统,并在扩展图像中进行了准备。为了提高学习的特征,我们引入了一个基于原型的匹配损失函数,将特征与一组可学习的原型进行对比。
  • results: 我们的方法在多种任务上达到了状态 искусственный的性能,包括特征预测、表情识别和面部对齐。此外,我们还进行了对synthetic face图像的预训练,并发现ProS在这种情况下也表现出了良好的性能。
    Abstract This paper presents a novel approach, called Prototype-based Self-Distillation (ProS), for unsupervised face representation learning. The existing supervised methods heavily rely on a large amount of annotated training facial data, which poses challenges in terms of data collection and privacy concerns. To address these issues, we propose ProS, which leverages a vast collection of unlabeled face images to learn a comprehensive facial omni-representation. In particular, ProS consists of two vision-transformers (teacher and student models) that are trained with different augmented images (cropping, blurring, coloring, etc.). Besides, we build a face-aware retrieval system along with augmentations to obtain the curated images comprising predominantly facial areas. To enhance the discrimination of learned features, we introduce a prototype-based matching loss that aligns the similarity distributions between features (teacher or student) and a set of learnable prototypes. After pre-training, the teacher vision transformer serves as a backbone for downstream tasks, including attribute estimation, expression recognition, and landmark alignment, achieved through simple fine-tuning with additional layers. Extensive experiments demonstrate that our method achieves state-of-the-art performance on various tasks, both in full and few-shot settings. Furthermore, we investigate pre-training with synthetic face images, and ProS exhibits promising performance in this scenario as well.
    摘要 The proposed method consists of two vision-transformers (teacher and student models) trained with different augmented images, such as cropping, blurring, and coloring. Additionally, a face-aware retrieval system is built to obtain curated images with predominantly facial areas. To enhance the discrimination of learned features, a prototype-based matching loss is introduced to align the similarity distributions between features and a set of learnable prototypes.After pre-training, the teacher vision transformer serves as a backbone for downstream tasks, including attribute estimation, expression recognition, and landmark alignment, which can be achieved through simple fine-tuning with additional layers. Extensive experiments show that our method achieves state-of-the-art performance on various tasks, both in full and few-shot settings. Moreover, we investigate pre-training with synthetic face images and find that ProS performs well in this scenario as well.

Contrast-Agnostic Groupwise Registration by Robust PCA for Quantitative Cardiac MRI

  • paper_url: http://arxiv.org/abs/2311.01916
  • repo_url: None
  • paper_authors: Xinqi Li, Yi Zhang, Yidong Zhao, Jan van Gemert, Qian Tao
  • for: 这个论文的目的是提出一种基于robust原理Component分析(rPCA)的新的运动补偿框架,以解决quantitative cardiac MRI中的基线图像准确 registrations问题。
  • methods: 该方法使用了robust原理Component分析(rPCA) decomposes quantitative cardiac MRI into low-rank and sparse components,并将groupwise CNN-based registration backbone integrate within the rPCA framework。
  • results: 实验表明,该方法可以提高registration perfomance,并 reducet quantitative mapping error in both in-domain (pre-contrast MOLLI) and out-of-domain (post-contrast MOLLI) inference。
    Abstract Quantitative cardiac magnetic resonance imaging (MRI) is an increasingly important diagnostic tool for cardiovascular diseases. Yet, co-registration of all baseline images within the quantitative MRI sequence is essential for the accuracy and precision of quantitative maps. However, co-registering all baseline images from a quantitative cardiac MRI sequence remains a nontrivial task because of the simultaneous changes in intensity and contrast, in combination with cardiac and respiratory motion. To address the challenge, we propose a novel motion correction framework based on robust principle component analysis (rPCA) that decomposes quantitative cardiac MRI into low-rank and sparse components, and we integrate the groupwise CNN-based registration backbone within the rPCA framework. The low-rank component of rPCA corresponds to the quantitative mapping (i.e. limited degree of freedom in variation), while the sparse component corresponds to the residual motion, making it easier to formulate and solve the groupwise registration problem. We evaluated our proposed method on cardiac T1 mapping by the modified Look-Locker inversion recovery (MOLLI) sequence, both before and after the Gadolinium contrast agent administration. Our experiments showed that our method effectively improved registration performance over baseline methods without introducing rPCA, and reduced quantitative mapping error in both in-domain (pre-contrast MOLLI) and out-of-domain (post-contrast MOLLI) inference. The proposed rPCA framework is generic and can be integrated with other registration backbones.
    摘要 现代心脏磁共振成像(MRI)已成为心血管疾病诊断中越来越重要的工具。然而,在量化MRI序列中所有基线图像的协调是必要的,以确保量化地图的准确性和精度。然而,在量化心脏MRI序列中协调所有基线图像仍然是一项困难的任务,因为同时出现的是图像Intensity和对比度的变化,以及心跳和呼吸动作的运动。为解决这个挑战,我们提出了一种基于robust主成分分析(rPCA)的新的动态恢复框架,该框架可以将量化心脏MRI分解成低级别和稀疏组件。我们将组合式 convolutional neural network(CNN)基于的集群注registrations backboneintegrated within the rPCA framework。低级别rPCA组件对应于量化映射(即有限度的变化),而稀疏组件对应于剩余运动,这使得更加容易解决集群注registrations问题。我们通过使用MOLLI序列进行修改的Look-Locker倒映重建(MOLLI)序列进行测试,并在不使用rPCA的基eline方法上进行比较。我们的实验表明,我们的方法可以有效地提高注registrations性能,并降低在域域(预contrast MOLLI)和离域域(后contrast MOLLI)的量化映射误差。我们的提议的rPCA框架是通用的,可以与其他注registrations backbone结合使用。

End-to-End assessment of AR-assisted neurosurgery systems

  • paper_url: http://arxiv.org/abs/2311.01912
  • repo_url: None
  • paper_authors: Mahdi Bagheri, Farhad Piri, Hadi Digale, Saem Sattarzadeh, Mohammad Reza Mohammadi
  • for: 这项研究旨在描述一种基于扩展现实技术的 neurosurgery 系统,并评估该系统在医学操作中的可靠性和精度。
  • methods: 该研究使用了多种技术来评估 AR-assisted neurosurgery 系统,包括 registratin and tracking 技术,以及基于物理反馈的评估方法。
  • results: 研究发现,尽管系统可能会出现注射和跟踪错误,但Physical feedback 可以减少error caused by hologram displacement。然而,缺乏可见反馈对 HOLOgram 的影响不значиatives。
    Abstract Augmented Reality (AR) has emerged as a significant advancement in surgical procedures, offering a solution to the challenges posed by traditional neuronavigation methods. These conventional techniques often necessitate surgeons to split their focus between the surgical site and a separate monitor that displays guiding images. Over the years, many systems have been developed to register and track the hologram at the targeted locations, each employed its own evaluation technique. On the other hand, hologram displacement measurement is not a straightforward task because of various factors such as occlusion, Vengence-Accomodation Conflict, and unstable holograms in space. In this study, we explore and classify different techniques for assessing an AR-assisted neurosurgery system and propose a new technique to systematize the assessment procedure. Moreover, we conduct a deeper investigation to assess surgeon error in the pre- and intra-operative phases of the surgery based on the respective feedback given. We found that although the system can undergo registration and tracking errors, physical feedback can significantly reduce the error caused by hologram displacement. However, the lack of visual feedback on the hologram does not have a significant effect on the user 3D perception.
    摘要 augmened reality (AR) 在 neurosurgery 中发展出了重要的进步,解决了传统神经导航方法的挑战。这些传统技术 часто需要Surgeon 同时分心于手术Site 和一个分开的显示器上的导航图像。随着时间的推移,许多系统被开发出来了register 和跟踪 HOLOgram 的方法,每种系统都使用了自己的评估技巧。然而, HOLOgram 的偏移量测量并不是一个简单的任务,因为 occlusion、 Vengence-Accomodation Conflict 和不稳定的 HOLOgram 在空间中。在这种研究中,我们探讨了不同的 AR-assisted neurosurgery 系统评估技巧,并提出了一种新的评估过程系统。此外,我们进行了更深入的调查,以评估手术阶段 pre- 和 intra- 操作期间的Surgeon 错误,基于相应的反馈。我们发现,尽管系统可能会出现 registration 和 tracking 错误,但physical feedback 可以减少 HOLOgram 偏移量所导致的错误。然而,缺乏 HOLOgram 的视觉反馈并没有对用户3D认知产生重要影响。

LLM-driven Multimodal Target Volume Contouring in Radiation Oncology

  • paper_url: http://arxiv.org/abs/2311.01908
  • repo_url: None
  • paper_authors: Yujin Oh, Sangjoon Park, Hwa Kyung Byun, Jin Sung Kim, Jong Chul Ye
  • for: 这个研究旨在应用大语言模型(LLM)来整合图像和文本资讯,以解决辐照疗法中难以进行的目标体组定义问题。
  • methods: 本研究使用了LLM来验证并整合图像和文本资讯,并在乳癌辐照疗法中进行了静脉体组定义。
  • results: 研究结果显示,该模型在实际应用环境下具有了更好的一致性和数据效率,并在较具有复杂性的乳癌辐照疗法中进行了有效的静脉体组定义。
    Abstract Target volume contouring for radiation therapy is considered significantly more challenging than the normal organ segmentation tasks as it necessitates the utilization of both image and text-based clinical information. Inspired by the recent advancement of large language models (LLMs) that can facilitate the integration of the textural information and images, here we present a novel LLM-driven multi-modal AI that utilizes the clinical text information and is applicable to the challenging task of target volume contouring for radiation therapy, and validate it within the context of breast cancer radiation therapy target volume contouring. Using external validation and data-insufficient environments, which attributes highly conducive to real-world applications, we demonstrate that the proposed model exhibits markedly improved performance compared to conventional vision-only AI models, particularly exhibiting robust generalization performance and data-efficiency. To our best knowledge, this is the first LLM-driven multimodal AI model that integrates the clinical text information into target volume delineation for radiation oncology.
    摘要 目标体积辐射治疗中的预测体积辐射是与常见器官分割任务相比许多更加困难,因为它需要利用图像和文本信息。受最近大语言模型(LLMs)的发展启发,我们现在提出了一种基于文本信息的多modal AI,可以利用临床文本信息,并适用于难以完成的辐射治疗target体积辐射预测任务。在乳腺癌辐射治疗target体积辐射预测中进行了验证。我们使用了外部验证和数据缺乏环境,这些环境对实际应用来说非常有利,并证明了我们提出的模型在与传统视觉只AI模型相比 Displayed remarkable improved performance,特别是在robust generalization和数据效率方面。到目前为止,这是首个基于文本信息的多modal AI模型,可以在辐射肿瘤治疗中结合临床文本信息进行target体积辐射预测。

From Chaos to Calibration: A Geometric Mutual Information Approach to Target-Free Camera LiDAR Extrinsic Calibration

  • paper_url: http://arxiv.org/abs/2311.01905
  • repo_url: None
  • paper_authors: Jack Borer, Jeremy Tschirner, Florian Ölsner, Stefan Milz
  • for: 这 paper 是为了提出一种无需基准数据的目标自由外部均衡算法,用于自动驾驶车辆中的感知融合。
  • methods: 这 paper 使用的方法是基于分析牵引信息的 mutual information 方法,而这些方法最初是在 2012 年提出的。
  • results: 作者们在使用 KITTI 和 KITTI-360 fisheye 数据集进行证明,并表明其提出的改进方法可以准确地均衡 camera-LiDAR 外部参数。
    Abstract Sensor fusion is vital for the safe and robust operation of autonomous vehicles. Accurate extrinsic sensor to sensor calibration is necessary to accurately fuse multiple sensor's data in a common spatial reference frame. In this paper, we propose a target free extrinsic calibration algorithm that requires no ground truth training data, artificially constrained motion trajectories, hand engineered features or offline optimization and that is accurate, precise and extremely robust to initialization error. Most current research on online camera-LiDAR extrinsic calibration requires ground truth training data which is impossible to capture at scale. We revisit analytical mutual information based methods first proposed in 2012 and demonstrate that geometric features provide a robust information metric for camera-LiDAR extrinsic calibration. We demonstrate our proposed improvement using the KITTI and KITTI-360 fisheye data set.
    摘要 感测融合是自动驾驶车辆安全和稳定运行的关键。精准的外部感测器到感测器准确匹配是必要的,以将多个感测器的数据在共同空间参照幂中融合准确。在这篇论文中,我们提出了无需地面真实数据、人工限制的运动轨迹、手工设计特征或线上优化的无Target extrinsic准确报表算法。现有大多数在线相机-LiDAR外部准确匹配研究均需要地面真实数据,但这些数据难以在大规模上采集。我们回到2012年提出的分析约束基础方法,并证明了光学特征提供了Robust信息度量的摄像机-LiDAR外部准确匹配。我们使用KITTI和KITTI-360鱼眼数据集来示出我们的提议改进。

Simulation of acquisition shifts in T2 Flair MR images to stress test AI segmentation networks

  • paper_url: http://arxiv.org/abs/2311.01894
  • repo_url: None
  • paper_authors: Christiane Posselt, Mehmet Yigit Avci, Mehmet Yigitsoy, Patrick Schünke, Christoph Kolbitsch, Tobias Schäffter, Stefanie Remmele
  • for: 这项研究的目的是提供一个基于实验的MRI数据 simulation框架,用于对深度分割网络进行”压力测试”,以检验在临床实践中可能出现的数据采集偏移。
  • methods: 该方法使用MR信号方程来生成”采集偏移DERIVATIVES”的MR图像,以模拟在临床实践中可能出现的采集偏移。实验包括验证生成的图像与实际MR扫描的 validate,以及对当前state-of-the-art MS抑制网络进行示例压力测试,以探索一个通用的F1分数模型函数,以描述依赖于TE和TI参数的干扰效应。
  • results: 实验结果表明,在极端参数设置下,生成的图像与实际图像之间的差异可达19%,而且对于TE和TI参数,F1分数模型函数的干扰效应可以以 quadratic 模型函数(R^2 > 0.9)来描述。模型函数的系数表明,TE参数的变化对模型性能的影响更大于TI参数。
    Abstract Purpose: To provide a simulation framework for routine neuroimaging test data, which allows for "stress testing" of deep segmentation networks against acquisition shifts that commonly occur in clinical practice for T2 weighted (T2w) fluid attenuated inversion recovery (FLAIR) Magnetic Resonance Imaging (MRI) protocols. Approach: The approach simulates "acquisition shift derivatives" of MR images based on MR signal equations. Experiments comprise the validation of the simulated images by real MR scans and example stress tests on state-of-the-art MS lesion segmentation networks to explore a generic model function to describe the F1 score in dependence of the contrast-affecting sequence parameters echo time (TE) and inversion time (TI). Results: The differences between real and simulated images range up to 19 % in gray and white matter for extreme parameter settings. For the segmentation networks under test the F1 score dependency on TE and TI can be well described by quadratic model functions (R^2 > 0.9). The coefficients of the model functions indicate that changes of TE have more influence on the model performance than TI. Conclusions: We show that these deviations are in the range of values as may be caused by erroneous or individual differences of relaxation times as described by literature. The coefficients of the F1 model function allow for quantitative comparison of the influences of TE and TI. Limitations arise mainly from tissues with the low baseline signal (like CSF) and when the protocol contains contrast-affecting measures that cannot be modelled due to missing information in the DICOM header.
    摘要 目的:提供一个 Routine neuroimaging 测试数据的模拟框架,以便对深度分割网络进行“压力测试”,以适应在临床实践中经常出现的MR成像剖析(FLAIR)图像数据的获取变化。方法:通过MR信号方程来模拟MR图像的“获取变化 Derivatives”。实验包括验证模拟图像和真实MR扫描图像之间的差异,以及对现有MS损伤分割网络进行例子压力测试,以探索一个通用的F1分数函数,以描述TE和TI参数对F1分数的依赖关系。结果:模拟和真实图像之间的差异在灰白 matter 中可达19%,对于极端参数设置。对测试网络来说,TE和TI参数对F1分数的依赖关系可以用 quadratic model functions 描述(R^2 > 0.9)。模型函数的系数表明,TE参数对模型性能的影响大于TI参数。结论:我们表明这些差异与文献中描述的Relaxation Times的错误或个体差异相似。模型函数的系数允许对TE和TI参数的影响进行量化比较。限制主要来自于CSF和其他低基准信号的组织,以及在协调图像中包含不可模拟的对照措施。

Bridging the Gap between Multi-focus and Multi-modal: A Focused Integration Framework for Multi-modal Image Fusion

  • paper_url: http://arxiv.org/abs/2311.01886
  • repo_url: https://github.com/ixilai/MFIF-MMIF
  • paper_authors: Xilai Li, Xiaosong Li, Tao Ye, Xiaoqi Cheng, Wuyang Liu, Haishu Tan
  • for: 本研究は多modal imaging fusion(MMIF)の Challenge に焦点を当てています。specifically, the paper focuses on the challenge of fusing multiple visible images with different focal regions and infrared images in real-world MMIF applications.
  • methods: この研究では、 semi-sparsity-based smoothing filterを导入して、imagesをstructure and texture componentsに decomposite。furthermore, a novel multi-scale operator is proposed to fuse the texture components, which can detect significant information by considering the pixel focus attributes and relevant data from various modal images.
  • results: 调查结果は、state-of-the-art methodsを超えるvisua perception と quantitative evaluationを示します。extensive experiments on existing MMIF datasets, as well as the object detection and depth estimation tasks, consistently demonstrate that the proposed algorithm can surpass the state-of-the-art methods in visual perception and quantitative evaluation.
    Abstract Multi-modal image fusion (MMIF) integrates valuable information from different modality images into a fused one. However, the fusion of multiple visible images with different focal regions and infrared images is a unprecedented challenge in real MMIF applications. This is because of the limited depth of the focus of visible optical lenses, which impedes the simultaneous capture of the focal information within the same scene. To address this issue, in this paper, we propose a MMIF framework for joint focused integration and modalities information extraction. Specifically, a semi-sparsity-based smoothing filter is introduced to decompose the images into structure and texture components. Subsequently, a novel multi-scale operator is proposed to fuse the texture components, capable of detecting significant information by considering the pixel focus attributes and relevant data from various modal images. Additionally, to achieve an effective capture of scene luminance and reasonable contrast maintenance, we consider the distribution of energy information in the structural components in terms of multi-directional frequency variance and information entropy. Extensive experiments on existing MMIF datasets, as well as the object detection and depth estimation tasks, consistently demonstrate that the proposed algorithm can surpass the state-of-the-art methods in visual perception and quantitative evaluation. The code is available at https://github.com/ixilai/MFIF-MMIF.
    摘要 多模态图像融合(MMIF)将多个不同模式图像融合到一起,以获取更加有价值的信息。然而,在实际应用中,融合不同视觉图像的焦点区域和抗雷达图像是一个前所未有的挑战。这是因为视觉光学镜的深度焦点有限,使得同一场景中同时捕捉焦点信息很困难。为解决这问题,本文提出了一个MMIF框架,包括对合并焦点信息和多种模式信息的抽象。具体来说,我们引入了一种半稀畴基于滤波器来分解图像为结构和文本组成部分。然后,我们提出了一种多尺度运算器,用于融合文本组成部分,可以检测到场景中重要信息,并考虑视觉图像中像素焦点特性和不同模式图像中相关数据。此外,为了实现场景的亮度和对比度的有效捕捉,我们考虑了多个方向的频谱异常和信息熵。广泛的实验表明,提出的算法可以在现有MMIF数据集上,以及对象检测和深度估计任务上,超越当前的方法在视觉识别和量化评价中。代码可以在https://github.com/ixilai/MFIF-MMIF中下载。

An Ensemble Machine Learning Approach for Screening Covid-19 based on Urine Parameters

  • paper_url: http://arxiv.org/abs/2311.01854
  • repo_url: None
  • paper_authors: Behzad Moayedi, Abdalsamad Keramatfar, Mohammad Hadi Goldani, Mohammad Javad Fallahi, Alborz Jahangirisisakht, Mohammad Saboori, Leyla badiei
  • for: 这项研究旨在提出一种基于尿检测弓的COVID-19检测方法,以提高检测精度和效率。
  • methods: 研究人员使用RGB颜色空间参数来检测尿检测弓中的健康状况。然后,他们将RGB空间转换成10个额外的颜色空间,以提高模型的准确性。最后,他们提出了一种基于多层感知器神经网络的新ensemble模型。
  • results: 研究人员通过去除模型空间中的不确定区域来提高模型的检测性能。最终,他们的模型达到了80%的检测精度。这些结果表明,尿检测弓可能成为COVID-19检测中的一种有用工具,特别是在资源受限的设置中,PCR测试不可靠时。进一步的研究是需要验证这些发现,并探讨尿检测弓在COVID-19诊断和管理中的潜在作用。
    Abstract The rapid spread of COVID-19 and the emergence of new variants underscore the importance of effective screening measures. Rapid diagnosis and subsequent quarantine of infected individuals can prevent further spread of the virus in society. While PCR tests are the gold standard for COVID-19 diagnosis, they are costly and time-consuming. In contrast, urine test strips are an inexpensive, non-invasive, and rapidly obtainable screening method that can provide important information about a patient's health status. In this study, we collected a new dataset and used the RGB (Red Green Blue) color space of urine test strips parameters to detect the health status of individuals. To improve the accuracy of our model, we converted the RGB space to 10 additional color spaces. After evaluating four different machine learning models, we proposed a new ensemble model based on a multi-layer perceptron neural network. Although the initial results were not strong, we were able to improve the model's screening performance for COVID-19 by removing uncertain regions of the model space. Ultimately, our model achieved a screening accuracy of 80% based on urine parameters. Our results suggest that urine test strips can be a useful tool for COVID-19 screening, particularly in resource-constrained settings where PCR testing may not be feasible. Further research is needed to validate our findings and explore the potential role of urine test strips in COVID-19 diagnosis and management.
    摘要 COVID-19 的快速传播和新变种的出现强调了有效的检测措施的重要性。快速诊断和随后隔离感染者可以防止病毒在社会中进一步传播。虽然 PCR 测试是 COVID-19 诊断的标准方法,但它们是昂贵的和时间consuming。相比之下,尿检试剂是一种便宜、不侵入的检测方法,可以提供对患者健康状况的重要信息。在这项研究中,我们收集了一个新的数据集,并使用尿检试剂参数的 RGB(红绿蓝)颜色空间来检测个人健康状况。为了提高模型的准确性,我们将 RGB 空间转换为 10 个额外的颜色空间。经过评估四种不同的机器学习模型后,我们提出了一种基于多层感知神经网络的新ensemble模型。虽然初始结果不强,但我们通过移除模型空间的不确定区域来提高模型的检测性能。最终,我们的模型达到了基于尿检试剂的 COVID-19 检测精度80%。我们的结果表明,尿检试剂可以在资源受限的设置中作为 COVID-19 检测工具,特别是在 PCR 测试不可能的情况下。进一步的研究是需要验证我们的发现,并探讨尿检试剂在 COVID-19 诊断和管理中的潜在作用。

Holistic Representation Learning for Multitask Trajectory Anomaly Detection

  • paper_url: http://arxiv.org/abs/2311.01851
  • repo_url: None
  • paper_authors: Alexandros Stergiou, Brent De Weerdt, Nikos Deligiannis
  • for: 视频异常检测,Recognize abnormal events in videos
  • methods: 使用skeleton sequences和多任务学习,Learn expected motions across segments at different times
  • results: 与状态最佳结果对比,Show the advantages and effectiveness of our approach on anomaly detection in skeleton trajectories
    Abstract Video anomaly detection deals with the recognition of abnormal events in videos. Apart from the visual signal, video anomaly detection has also been addressed with the use of skeleton sequences. We propose a holistic representation of skeleton trajectories to learn expected motions across segments at different times. Our approach uses multitask learning to reconstruct any continuous unobserved temporal segment of the trajectory allowing the extrapolation of past or future segments and the interpolation of in-between segments. We use an end-to-end attention-based encoder-decoder. We encode temporally occluded trajectories, jointly learn latent representations of the occluded segments, and reconstruct trajectories based on expected motions across different temporal segments. Extensive experiments on three trajectory-based video anomaly detection datasets show the advantages and effectiveness of our approach with state-of-the-art results on anomaly detection in skeleton trajectories.
    摘要 视频异常检测是指视频中异常事件的识别。此外,视频异常检测还利用了骨架序列来进行Addressing。我们提议一种整体表示骨架轨迹的学习方法,以便在不同时间段中学习预期的运动。我们的方法使用多任务学习来重建任何不见的时间段,从而实现过去或未来时间段的拟合和中间时间段的插值。我们使用端到端的注意力基于Encoder-Decoder。我们编码时间受阻的骨架轨迹,同时学习隐藏的 segment 的射频表示,并根据不同时间段的预期运动来重建轨迹。我们在三个骨架轨迹基于视频异常检测数据集上进行了广泛的实验,并达到了当前最佳的异常检测效果。

Multi-LiDAR Localization and Mapping Pipeline for Urban Autonomous Driving

  • paper_url: http://arxiv.org/abs/2311.01823
  • repo_url: None
  • paper_authors: Florian Sauerbeck, Dominik Kulmer, Markus Pielmeier, Maximilian Leitenstern, Christoph Weiß, Johannes Betz
  • for: 本研究旨在提供一个高精度、可靠的自动驾驶车辆地图和位置算法,以便在城市环境中安全、可靠地导航。
  • methods: 该研究使用了四个LiDAR感知器,并基于KISS-ICP算法进行地图生成和位置匹配。
  • results: 研究人员通过对一辆实验车进行测试,证明了该算法的高精度和实时性。
    Abstract Autonomous vehicles require accurate and robust localization and mapping algorithms to navigate safely and reliably in urban environments. We present a novel sensor fusion-based pipeline for offline mapping and online localization based on LiDAR sensors. The proposed approach leverages four LiDAR sensors. Mapping and localization algorithms are based on the KISS-ICP, enabling real-time performance and high accuracy. We introduce an approach to generate semantic maps for driving tasks such as path planning. The presented pipeline is integrated into the ROS 2 based Autoware software stack, providing a robust and flexible environment for autonomous driving applications. We show that our pipeline outperforms state-of-the-art approaches for a given research vehicle and real-world autonomous driving application.
    摘要 自主车辆需要准确和可靠的本地化和地图算法,以确保安全和可靠地在城市环境中 navigating。我们提出了一个基于LiDAR感知器的新的整合式地图和本地化管道,该管道使用四个LiDAR感知器。地图和本地化算法基于KISS-ICP,可以实现实时性和高精度。我们介绍了一种生成适用于驾驶任务的 semantics 地图的方法。该管道被 интеGRATED INTO ROS 2 基础设施 Autoware 软件栈,提供一个可靠和灵活的环境 для自主驾驶应用程序。我们表明我们的管道在给定的研究车辆和实际的自主驾驶应用程序中超越了现状的方法。

Estimating 3D Uncertainty Field: Quantifying Uncertainty for Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2311.01815
  • repo_url: None
  • paper_authors: Jianxiong Shen, Ruijie Ren, Adria Ruiz, Francesc Moreno-Noguer
  • for: Addressing the limitation of Neural Radiance Fields (NeRF) in quantifying uncertainty, particularly in unseen space including occluded and outside scene content, for applications in robotics.
  • methods: Propose a novel approach to estimate a 3D Uncertainty Field based on learned incomplete scene geometry, considering accumulated transmittance along each camera ray to infer 2D pixel-wise uncertainty.
  • results: Our approach is the only one that can explicitly reason about high uncertainty both on 3D unseen regions and its involved 2D rendered pixels, compared with recent methods. Our designed uncertainty field is ideally suited for real-world robotics tasks, such as next-best-view selection.
    Abstract Current methods based on Neural Radiance Fields (NeRF) significantly lack the capacity to quantify uncertainty in their predictions, particularly on the unseen space including the occluded and outside scene content. This limitation hinders their extensive applications in robotics, where the reliability of model predictions has to be considered for tasks such as robotic exploration and planning in unknown environments. To address this, we propose a novel approach to estimate a 3D Uncertainty Field based on the learned incomplete scene geometry, which explicitly identifies these unseen regions. By considering the accumulated transmittance along each camera ray, our Uncertainty Field infers 2D pixel-wise uncertainty, exhibiting high values for rays directly casting towards occluded or outside the scene content. To quantify the uncertainty on the learned surface, we model a stochastic radiance field. Our experiments demonstrate that our approach is the only one that can explicitly reason about high uncertainty both on 3D unseen regions and its involved 2D rendered pixels, compared with recent methods. Furthermore, we illustrate that our designed uncertainty field is ideally suited for real-world robotics tasks, such as next-best-view selection.
    摘要 当前基于神经辐射场(NeRF)的方法缺乏量化未知预测的能力,特别是未经见到的空间,包括 occluded 和外场景内容。这种限制约束了它们在 робо扮仪中的广泛应用,因为需要考虑模型预测的可靠性,如 робо扮仪在未知环境中的探索和规划。为解决这个问题,我们提出了一种新的方法,以便估计3D 未知场(Uncertainty Field),基于学习的不完整场景几何学。我们考虑每束相机线的积累传输率,以便在每个像素位置上估计2D 像素尺度的uncertainty。为量化在学习表面上的不确定性,我们建模了随机辐射场。我们的实验表明,我们的方法是唯一一种能够直接考虑高度未知的3D 未见区域和相关的2D 渲染像素的方法,与当前的方法相比。此外,我们还证明了我们设计的uncertainty field适用于实际的世界 robotics 任务,如下一个最佳视角选择。

FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

  • paper_url: http://arxiv.org/abs/2311.01813
  • repo_url: https://github.com/llyx97/fetv
  • paper_authors: Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, Lu Hou
  • for: 本研究旨在提供一个细化的评估文本到视频(T2V)生成模型的Benchmark,以提高T2V模型的评估和研究。
  • methods: 本研究使用了多个方法,包括:(1)开发了一个多方面的Benchmark,名为FETV,用于细化评估T2V模型的表现;(2)对四种代表性的T2V模型进行了手动评估,以分析它们在不同类型的文本提示下的优缺点;(3)将FETV作为测试环境,评估了现有的自动T2V评估指标的可靠性。
  • results: 研究发现,现有的自动评估指标与人类评估指标存在较大的差异,并且不同的文本提示类型下的表现也存在差异。为了解决这个问题,本研究提出了一些解决方案,并开发了两种新的自动评估指标,它们与人类评估指标更加相似。
    Abstract Recently, open-domain text-to-video (T2V) generation models have made remarkable progress. However, the promising results are mainly shown by the qualitative cases of generated videos, while the quantitative evaluation of T2V models still faces two critical problems. Firstly, existing studies lack fine-grained evaluation of T2V models on different categories of text prompts. Although some benchmarks have categorized the prompts, their categorization either only focuses on a single aspect or fails to consider the temporal information in video generation. Secondly, it is unclear whether the automatic evaluation metrics are consistent with human standards. To address these problems, we propose FETV, a benchmark for Fine-grained Evaluation of Text-to-Video generation. FETV is multi-aspect, categorizing the prompts based on three orthogonal aspects: the major content, the attributes to control and the prompt complexity. FETV is also temporal-aware, which introduces several temporal categories tailored for video generation. Based on FETV, we conduct comprehensive manual evaluations of four representative T2V models, revealing their pros and cons on different categories of prompts from different aspects. We also extend FETV as a testbed to evaluate the reliability of automatic T2V metrics. The multi-aspect categorization of FETV enables fine-grained analysis of the metrics' reliability in different scenarios. We find that existing automatic metrics (e.g., CLIPScore and FVD) correlate poorly with human evaluation. To address this problem, we explore several solutions to improve CLIPScore and FVD, and develop two automatic metrics that exhibit significant higher correlation with humans than existing metrics. Benchmark page: https://github.com/llyx97/FETV.
    摘要 近些年来,开放领域文本到视频(T2V)生成模型已经取得了很大的进步。然而,这些成果主要表现在质量上,而量化评价T2V模型仍面临两个主要问题。首先,现有的研究缺乏细化的文本提示评价T2V模型。尽管一些标准化的benchmark已经将提示分类,但这些分类通常只关注单一方面,而且忽视视频生成中的时间信息。其次,不清楚 automatic评价指标与人类标准是否一致。为解决这些问题,我们提出了FETV,一个用于细化评价T2V模型的benchmark。FETV是多个方面的,根据三个各自独立的方面进行分类:主要内容、控制属性和提示复杂度。FETV还是时间感知的,对视频生成进行了多个时间分类。基于FETV,我们进行了全面的手动评价四种表现T2V模型的示例,揭示它们在不同类型的提示上的优缺点。我们还将FETV作为测试台来评估自动T2V指标的可靠性。FETV的多个方面分类使得自动指标的可靠性在不同的场景中进行细化分析。我们发现现有的自动指标(例如CLIPScore和FVD)与人类评价相关性很差。为解决这个问题,我们探索了多种解决方案,并开发了两种新的自动指标,它们与人类评价更高相关性。benchmark页面:https://github.com/llyx97/FETV。

inkn’hue: Enhancing Manga Colorization from Multiple Priors with Alignment Multi-Encoder VAE

  • paper_url: http://arxiv.org/abs/2311.01804
  • repo_url: https://github.com/rossiyareich/inknhue
  • paper_authors: Tawin Jiramahapokee
    for: Mangaka (manga artists) who want to colorize their black and white manga artwork.methods: Our proposed specialized framework for manga colorization uses a multi-encoder VAE to align shading and vibrant coloring models, allowing for clear and colorful results with the option to incorporate reference images and manual hints.results: Our approach achieves clear and colorful results for manga colorization, addressing the challenges of existing methods that often fall short in achieving desired results.
    Abstract Manga, a form of Japanese comics and distinct visual storytelling, has captivated readers worldwide. Traditionally presented in black and white, manga's appeal lies in its ability to convey complex narratives and emotions through intricate line art and shading. Yet, the desire to experience manga in vibrant colors has sparked the pursuit of manga colorization, a task of paramount significance for artists. However, existing methods, originally designed for line art and sketches, face challenges when applied to manga. These methods often fall short in achieving the desired results, leading to the need for specialized manga-specific solutions. Existing approaches frequently rely on a single training step or extensive manual artist intervention, which can yield less satisfactory outcomes. To address these challenges, we propose a specialized framework for manga colorization. Leveraging established models for shading and vibrant coloring, our approach aligns both using a multi-encoder VAE. This structured workflow ensures clear and colorful results, with the option to incorporate reference images and manual hints.
    摘要 漫画,一种日本漫画和独特的视觉故事,在全球读者中备受欢迎。传统上是黑白的,漫画的吸引力在于它可以通过复杂的线涂和阴影来传达复杂的故事和情感。然而,有人希望经过彩色的漫画,这引发了漫画颜色化的探索。现有的方法,原本设计用于线画和素描,在应用于漫画时遇到了挑战。这些方法经常不能达到所需的结果,导致了特有的漫画专用解决方案的需求。现有的方法frequently rely on single training step或者广泛的手动艺术家干预,这可能会导致less satisfactory的结果。为解决这些挑战,我们提出了一个特有的漫画颜色化框架。利用已经建立的模型 для阴影和鲜艳的彩色,我们的方法将阴影和彩色 align使用多encoder VAE。这种结构化的工作流程可以确保明亮鲜艳的结果,同时还可以包括参考图像和手动提示。

Generating Unbiased Pseudo-labels via a Theoretically Guaranteed Chebyshev Constraint to Unify Semi-supervised Classification and Regression

  • paper_url: http://arxiv.org/abs/2311.01782
  • repo_url: None
  • paper_authors: Jiaqi Wu, Junbiao Pang, Qingming Huang
  • for: 这个论文主要针对的是 semi-supervised classification 和 regression 问题,但是 semi-supervised classification 方法 rarely 应用于 regression 任务。
  • methods: 我们提出了一种基于 Chebyshev 不等式的 theoretically guaranteed 约束,将多个预测结果组合成高质量标签,并提出了一种 Unbiased Pseudo-labels network (UBPL network) Architecture,用于生成不偏的标签。
  • results: 我们的方法可以在 pose estimation 数据集 Mouse、FLIC 和 LSP 上达到 SOTA 性能,以及在 classification 数据集 CIFAR10/100 和 SVHN 上达到更好的性能。
    Abstract Both semi-supervised classification and regression are practically challenging tasks for computer vision. However, semi-supervised classification methods are barely applied to regression tasks. Because the threshold-to-pseudo label process (T2L) in classification uses confidence to determine the quality of label. It is successful for classification tasks but inefficient for regression tasks. In nature, regression also requires unbiased methods to generate high-quality labels. On the other hand, T2L for classification often fails if the confidence is generated by a biased method. To address this issue, in this paper, we propose a theoretically guaranteed constraint for generating unbiased labels based on Chebyshev's inequality, combining multiple predictions to generate superior quality labels from several inferior ones. In terms of high-quality labels, the unbiased method naturally avoids the drawback of T2L. Specially, we propose an Unbiased Pseudo-labels network (UBPL network) with multiple branches to combine multiple predictions as pseudo-labels, where a Feature Decorrelation loss (FD loss) is proposed based on Chebyshev constraint. In principle, our method can be used for both classification and regression and can be easily extended to any semi-supervised framework, e.g. Mean Teacher, FixMatch, DualPose. Our approach achieves superior performance over SOTAs on the pose estimation datasets Mouse, FLIC and LSP, as well as the classification datasets CIFAR10/100 and SVHN.
    摘要 Both semi-supervised classification and regression are practically challenging tasks for computer vision. However, semi-supervised classification methods are barely applied to regression tasks. Because the threshold-to-pseudo label process (T2L) in classification uses confidence to determine the quality of label. It is successful for classification tasks but inefficient for regression tasks. In nature, regression also requires unbiased methods to generate high-quality labels. On the other hand, T2L for classification often fails if the confidence is generated by a biased method. To address this issue, in this paper, we propose a theoretically guaranteed constraint for generating unbiased labels based on Chebyshev's inequality, combining multiple predictions to generate superior quality labels from several inferior ones. In terms of high-quality labels, the unbiased method naturally avoids the drawback of T2L. Specifically, we propose an Unbiased Pseudo-labels network (UBPL network) with multiple branches to combine multiple predictions as pseudo-labels, where a Feature Decorrelation loss (FD loss) is proposed based on Chebyshev constraint. In principle, our method can be used for both classification and regression and can be easily extended to any semi-supervised framework, e.g. Mean Teacher, FixMatch, DualPose. Our approach achieves superior performance over SOTAs on the pose estimation datasets Mouse, FLIC and LSP, as well as the classification datasets CIFAR10/100 and SVHN.Note: Simplified Chinese is a romanization of Chinese, and the translation may not be exact as the original text is in English.

CheX-Nomaly: Segmenting Lung Abnormalities from Chest Radiographs using Machine Learning

  • paper_url: http://arxiv.org/abs/2311.01777
  • repo_url: None
  • paper_authors: Sanskriti Singh
    for: 这个研究旨在提高胸部X射线成像(CXR)异常区域的精度诊断,特别是关于误判读异常区域的问题。methods: 这个研究使用了一个二进制本地化U-Net模型,利用了传播学习技术,并将异常区域分类为14种肺部疾病和“无症”情况。results: 研究发现,通过将异常区域分类为14种肺部疾病和“无症”情况,可以增强异常区域标识模型的一般化性。此外,这个模型还可以在与训练数据中没有看过的疾病之间进行协同分类。
    Abstract The global challenge in chest radiograph X-ray (CXR) abnormalities often being misdiagnosed is primarily associated with perceptual errors, where healthcare providers struggle to accurately identify the location of abnormalities, rather than misclassification errors. We currently address this problem through disease-specific segmentation models. Unfortunately, these models cannot be released in the field due to their lack of generalizability across all thoracic diseases. A binary model tends to perform poorly when it encounters a disease that isn't represented in the dataset. We present CheX-nomaly: a binary localization U-net model that leverages transfer learning techniques with the incorporation of an innovative contrastive learning approach. Trained on the VinDr-CXR dataset, which encompasses 14 distinct diseases in addition to 'no finding' cases, my model achieves generalizability across these 14 diseases and others it has not seen before. We show that we can significantly improve the generalizability of an abnormality localization model by incorporating a contrastive learning method and dissociating the bounding boxes with its disease class. We also introduce a new loss technique to apply to enhance the U-nets performance on bounding box segmentation. By introducing CheX-nomaly, we offer a promising solution to enhance the precision of chest disease diagnosis, with a specific focus on reducing the significant number of perceptual errors in healthcare.
    摘要 全球挑战在胸部X射线图像(CXR)异常现象 frequently 被误诊是由于诊断者对异常位置的识别出现问题,而不是分类错误。我们目前通过疾病特定的分割模型来解决这个问题。然而,这些模型无法在实际应用中发布,因为它们缺乏对所有胸部疾病的普适性。一个二进制模型在遇到未在数据集中 represent 的疾病时会表现很差。我们介绍了 CheX-nomaly:一种二进制本地化U-net模型,利用了传输学习技术和一种创新的对比学习方法。它在 VinDr-CXR 数据集上训练,该数据集包括 14 种疾病以及 "无发现" casos,并能够在这些疾病中实现普适性。我们表明,通过将对比学习方法与 bounding box 分割相结合,可以显著提高异常位置模型的普适性。此外,我们还介绍了一种新的损失函数,用于提高 U-net 的 bounding box 分割性能。通过引入 CheX-nomaly,我们提供了一种有 Promise 的解决方案,以提高胸部疾病诊断的精度,特别是减少健康保健中的重要性误诊。

PDF: Point Diffusion Implicit Function for Large-scale Scene Neural Representation

  • paper_url: http://arxiv.org/abs/2311.01773
  • repo_url: None
  • paper_authors: Yuhan Ding, Fukun Yin, Jiayuan Fan, Hui Li, Xin Chen, Wen Liu, Chongshan Lu, Gang YU, Tao Chen
  • for: 大规模场景新视图合成
  • methods: 点扩散函数(PDF)和缩小抽象空间
  • results: 超越相关状态前方法的效果Here’s a more detailed explanation of each point:
  • for: The paper is focused on the task of novel view synthesis for large-scale outdoor scenes.
  • methods: The proposed method uses a Point Diffusion implicit Function (PDF) to learn the surface distribution of the scene and reduce the sampling space. The method also employs a large-scale point cloud super-resolution diffusion module to enhance the sparse point cloud reconstructed from training images. In addition, the method uses region sampling based on Mip-NeRF 360 to model the background representation.
  • results: The paper demonstrates the effectiveness of the proposed method for large-scale scene novel view synthesis, outperforming relevant state-of-the-art baselines.
    Abstract Recent advances in implicit neural representations have achieved impressive results by sampling and fusing individual points along sampling rays in the sampling space. However, due to the explosively growing sampling space, finely representing and synthesizing detailed textures remains a challenge for unbounded large-scale outdoor scenes. To alleviate the dilemma of using individual points to perceive the entire colossal space, we explore learning the surface distribution of the scene to provide structural priors and reduce the samplable space and propose a Point Diffusion implicit Function, PDF, for large-scale scene neural representation. The core of our method is a large-scale point cloud super-resolution diffusion module that enhances the sparse point cloud reconstructed from several training images into a dense point cloud as an explicit prior. Then in the rendering stage, only sampling points with prior points within the sampling radius are retained. That is, the sampling space is reduced from the unbounded space to the scene surface. Meanwhile, to fill in the background of the scene that cannot be provided by point clouds, the region sampling based on Mip-NeRF 360 is employed to model the background representation. Expensive experiments have demonstrated the effectiveness of our method for large-scale scene novel view synthesis, which outperforms relevant state-of-the-art baselines.
    摘要 The core of our method is a large-scale point cloud super-resolution diffusion module that enhances the sparse point cloud reconstructed from several training images into a dense point cloud as an explicit prior. Then in the rendering stage, only sampling points with prior points within the sampling radius are retained. That is, the sampling space is reduced from the unbounded space to the scene surface.Meanwhile, to fill in the background of the scene that cannot be provided by point clouds, the region sampling based on Mip-NeRF 360 is employed to model the background representation. Expensive experiments have demonstrated the effectiveness of our method for large-scale scene novel view synthesis, which outperforms relevant state-of-the-art baselines.(Note: The text has been translated into Simplified Chinese, but some grammar and wording may be adjusted to better fit the language and conventions of the target audience.)

Towards a Unified Transformer-based Framework for Scene Graph Generation and Human-object Interaction Detection

  • paper_url: http://arxiv.org/abs/2311.01755
  • repo_url: None
  • paper_authors: Tao He, Lianli Gao, Jingkuan Song, Yuan-Fang Li
  • for: 本文旨在探讨Scene Graph Generation (SGG)和Human-Object Interaction (HOI)检测任务之间的自然关系,并提出一种基于Transformer架构的一步模型SG2HOI+。
  • methods: 本文使用两个相互嵌入的Transformer来融合SGG和HOI检测任务,首先生成视觉特征集合中的关系 triple,然后使用另一个Transformer-based decoder预测人物-物体交互。
  • results: 实验结果表明,SG2HOI+模型在Visual Genome、V-COCO和HICO-DET等标准 benchmark dataset上表现出色,与前一代一步SGG模型相比,SG2HOI+模型在HOI任务上表现竞争力强,同时对SGG任务也带来了显著改进。
    Abstract Scene graph generation (SGG) and human-object interaction (HOI) detection are two important visual tasks aiming at localising and recognising relationships between objects, and interactions between humans and objects, respectively. Prevailing works treat these tasks as distinct tasks, leading to the development of task-specific models tailored to individual datasets. However, we posit that the presence of visual relationships can furnish crucial contextual and intricate relational cues that significantly augment the inference of human-object interactions. This motivates us to think if there is a natural intrinsic relationship between the two tasks, where scene graphs can serve as a source for inferring human-object interactions. In light of this, we introduce SG2HOI+, a unified one-step model based on the Transformer architecture. Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection. Concretely, we initiate a relation Transformer tasked with generating relation triples from a suite of visual features. Subsequently, we employ another transformer-based decoder to predict human-object interactions based on the generated relation triples. A comprehensive series of experiments conducted across established benchmark datasets including Visual Genome, V-COCO, and HICO-DET demonstrates the compelling performance of our SG2HOI+ model in comparison to prevalent one-stage SGG models. Remarkably, our approach achieves competitive performance when compared to state-of-the-art HOI methods. Additionally, we observe that our SG2HOI+ jointly trained on both SGG and HOI tasks in an end-to-end manner yields substantial improvements for both tasks compared to individualized training paradigms.
    摘要 Scene graph生成(SGG)和人物对象交互(HOI)检测是两个重要的视觉任务,旨在本地化和识别对象之间的关系,以及人类和对象之间的交互。 existing works treat these tasks as distinct tasks, leading to the development of task-specific models tailored to individual datasets. However, we argue that the presence of visual relationships can provide crucial contextual and intricate relational cues that significantly enhance the inference of human-object interactions. This motivates us to explore whether there is an inherent relationship between the two tasks, where scene graphs can serve as a source for inferring human-object interactions.In light of this, we introduce SG2HOI+, a unified one-step model based on the Transformer architecture. Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection. Specifically, we initiate a relation Transformer tasked with generating relation triples from a suite of visual features. Subsequently, we use another transformer-based decoder to predict human-object interactions based on the generated relation triples.A comprehensive series of experiments conducted across established benchmark datasets including Visual Genome, V-COCO, and HICO-DET demonstrates the compelling performance of our SG2HOI+ model compared to prevalent one-stage SGG models. Remarkably, our approach achieves competitive performance when compared to state-of-the-art HOI methods. Furthermore, we observe that our SG2HOI+ jointly trained on both SGG and HOI tasks in an end-to-end manner yields substantial improvements for both tasks compared to individualized training paradigms.

Data-Centric Long-Tailed Image Recognition

  • paper_url: http://arxiv.org/abs/2311.01744
  • repo_url: None
  • paper_authors: Yanbiao Ma, Licheng Jiao, Fang Liu, Shuyuan Yang, Xu Liu, Puhua Chen
  • for: 提高模型性能,增强数据质量和量
  • methods: 信息扩展,帮助平衡样本的质量和量
  • results: 通过Feature Diversity Gain(FDG)来解释信息扩展的效果,可以提高模型性能而无需修改模型结构。
    Abstract In the context of the long-tail scenario, models exhibit a strong demand for high-quality data. Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance. Among these approaches, information augmentation has been progressively introduced as a crucial category. It achieves a balance in model performance by augmenting the richness and quantity of samples in the tail classes. However, there is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation methods. Consequently, the utilization of information augmentation in long-tail recognition tasks relies heavily on empirical and intricate fine-tuning. This work makes two primary contributions. Firstly, we approach the problem from the perspectives of feature diversity and distribution shift, introducing the concept of Feature Diversity Gain (FDG) to elucidate why information augmentation is effective. We find that the performance of information augmentation can be explained by FDG, and its performance peaks when FDG achieves an appropriate balance. Experimental results demonstrate that by using FDG to select augmented data, we can further enhance model performance without the need for any modifications to the model's architecture. Thus, data-centric approaches hold significant potential in the field of long-tail recognition, beyond the development of new model structures. Furthermore, we systematically introduce the core components and fundamental tasks of a data-centric long-tail learning framework for the first time. These core components guide the implementation and deployment of the system, while the corresponding fundamental tasks refine and expand the research area.
    摘要 在长尾场景下,模型强需高质量数据。数据中心化方法旨在提高模型性能的数据量和质量。其中,信息扩展被逐渐引入为关键类别。它在尾类中增加样本的质量和量,以提高模型性能。然而,关于信息扩展效果的内在机制还没有充分研究。因此,长尾识别任务中使用信息扩展的实践仍然受到较重的经验和细致的微调的限制。本工作做出了两项主要贡献。首先,我们从特征多样性和分布shift的角度出发,引入特征多样性增强度(FDG)来解释信息扩展的效果。我们发现,信息扩展的性能与FDG之间存在直接关系,并且FDG在适当的平衡点上达到最高性能。实验结果表明,通过使用FDG选择扩展数据,我们可以不需要修改模型结构,进一步提高模型性能。因此,数据中心化方法在长尾识别领域拥有广泛的潜力,超出了新模型结构的开发。此外,我们系统地介绍了长尾识别数据中心化框架的核心组件和基本任务。这些核心组件导向了实施和部署系统,而基本任务则是修复和扩展研究领域。

MixCon3D: Synergizing Multi-View and Cross-Modal Contrastive Learning for Enhancing 3D Representation

  • paper_url: http://arxiv.org/abs/2311.01734
  • repo_url: https://github.com/ucsc-vlaa/mixcon3d
  • paper_authors: Yipeng Gao, Zeyu Wang, Wei-Shi Zheng, Cihang Xie, Yuyin Zhou
  • for: 增强3D开放世界理解,包括文本、图像和点云信息。
  • methods: 结合2D图像和3D点云信息进行对比学习,并通过多视图2D图像的整合,提高传统三Modal表示的准确性和全面性。
  • results: 在三个代表性的benchmark上实现了显著提高,比基eline表现提高5.7%,并在文本对应3D检索和点云描述等更多应用中表现出色。
    Abstract Contrastive learning has emerged as a promising paradigm for 3D open-world understanding, jointly with text, image, and point cloud. In this paper, we introduce MixCon3D, which combines the complementary information between 2D images and 3D point clouds to enhance contrastive learning. With the further integration of multi-view 2D images, MixCon3D enhances the traditional tri-modal representation by offering a more accurate and comprehensive depiction of real-world 3D objects and bolstering text alignment. Additionally, we pioneer the first thorough investigation of various training recipes for the 3D contrastive learning paradigm, building a solid baseline with improved performance. Extensive experiments conducted on three representative benchmarks reveal that our method renders significant improvement over the baseline, surpassing the previous state-of-the-art performance on the challenging 1,156-category Objaverse-LVIS dataset by 5.7%. We further showcase the effectiveness of our approach in more applications, including text-to-3D retrieval and point cloud captioning. The code is available at https://github.com/UCSC-VLAA/MixCon3D.
    摘要 对开放世界3D理解而言,对比学习已经出现为一种有前途的方法,与文本、图像和点云集成一起使用。在这篇论文中,我们介绍了 MixCon3D,它结合了2D图像和3D点云的补充信息,以增强对比学习。另外,我们还进一步统合了多视角2D图像,从而提高了传统的三 modal 表现,提供更精确和全面的实际世界3D物体的描述,并且增强文本对齐。此外,我们还进行了第一次的对3D对比学习模型的训练组合方法的全面探讨,建立了优秀的基础点。实验结果显示,我们的方法在三个代表性的评分标准上得到了显著的改善,比前一个State-of-the-art的Objaverse-LVIS资料集的表现提高5.7%。此外,我们还证明了我们的方法在更多的应用中的效果,包括文本至3D搜寻和点云描述。代码可以在https://github.com/UCSC-VLAA/MixCon3D 上获取。

Capturing Local and Global Features in Medical Images by Using Ensemble CNN-Transformer

  • paper_url: http://arxiv.org/abs/2311.01731
  • repo_url: None
  • paper_authors: Javad Mirzapour Kaleybar, Hooman Saadat, Hooman Khaloo
  • for: The paper is written for the analysis of medical images, specifically for the diagnosis of COVID-19.
  • methods: The paper proposes a new classification model called the Controllable Ensemble Transformer and CNN (CETC), which combines the strengths of CNNs and transformers to capture both local and global features in medical images.
  • results: The CETC model outperforms existing state-of-the-art models across various evaluation metrics, demonstrating its superiority in accurately and efficiently analyzing medical images for the diagnosis of COVID-19.Here’s the same information in Simplified Chinese:
  • for: 这篇论文是为医疗图像分类而写的,特别是用于COVID-19诊断。
  • methods: 这篇论文提出了一种新的分类模型,即可控分类转换器和 convolutional neural networks (CETC),它将 convolutional neural networks (CNNs) 和转换器相结合,以便在医疗图像中 Capture both local and global features。
  • results: CETC模型在不同评价指标中表现出色,超越了现有的状态码模型,证明了它在医疗图像分类中的优异性和高效性。
    Abstract This paper introduces a groundbreaking classification model called the Controllable Ensemble Transformer and CNN (CETC) for the analysis of medical images. The CETC model combines the powerful capabilities of convolutional neural networks (CNNs) and transformers to effectively capture both local and global features present in medical images. The model architecture comprises three main components: a convolutional encoder block (CEB), a transposed-convolutional decoder block (TDB), and a transformer classification block (TCB). The CEB is responsible for capturing multi-local features at different scales and draws upon components from VGGNet, ResNet, and MobileNet as backbones. By leveraging this combination, the CEB is able to effectively detect and encode local features. The TDB, on the other hand, consists of sub-decoders that decode and sum the captured features using ensemble coefficients. This enables the model to efficiently integrate the information from multiple scales. Finally, the TCB utilizes the SwT backbone and a specially designed prediction head to capture global features, ensuring a comprehensive understanding of the entire image. The paper provides detailed information on the experimental setup and implementation, including the use of transfer learning, data preprocessing techniques, and training settings. The CETC model is trained and evaluated using two publicly available COVID-19 datasets. Remarkably, the model outperforms existing state-of-the-art models across various evaluation metrics. The experimental results clearly demonstrate the superiority of the CETC model, emphasizing its potential for accurately and efficiently analyzing medical images.
    摘要 The model architecture consists of three main components: a Convolutional Encoder Block (CEB), a Transposed-Convolutional Decoder Block (TDB), and a Transformer Classification Block (TCB). The CEB uses components from VGGNet, ResNet, and MobileNet as backbones to capture multi-local features at different scales. The TDB consists of sub-decoders that decode and sum the captured features using ensemble coefficients, allowing the model to efficiently integrate information from multiple scales. Finally, the TCB uses the SwT backbone and a specially designed prediction head to capture global features, ensuring a comprehensive understanding of the entire image.The paper provides detailed information on the experimental setup and implementation, including transfer learning, data preprocessing techniques, and training settings. The CETC model is trained and evaluated using two publicly available COVID-19 datasets, and the results show that it outperforms existing state-of-the-art models across various evaluation metrics. The experimental results demonstrate the superiority of the CETC model in accurately and efficiently analyzing medical images, highlighting its potential for future applications.

EXIM: A Hybrid Explicit-Implicit Representation for Text-Guided 3D Shape Generation

  • paper_url: http://arxiv.org/abs/2311.01714
  • repo_url: https://github.com/liuzhengzhe/exim
  • paper_authors: Zhengzhe Liu, Jingyu Hu, Ka-Hei Hui, Xiaojuan Qi, Daniel Cohen-Or, Chi-Wing Fu
  • for: 本 paper 描述了一种基于文本指导的三维形状生成技术,用于生成高质量的三维形状。
  • methods: 该技术利用了一种混合的三维形状表示方式,称为 EXIM,这种方式结合了显式和隐式表示的优点,从而实现高效的形状生成。
  • results: 经过广泛的实验,该技术能够基于自然语言描述生成高质量的三维形状,并且能够保证形状与文本的匹配性。Code和模型在 GitHub 上发布。
    Abstract This paper presents a new text-guided technique for generating 3D shapes. The technique leverages a hybrid 3D shape representation, namely EXIM, combining the strengths of explicit and implicit representations. Specifically, the explicit stage controls the topology of the generated 3D shapes and enables local modifications, whereas the implicit stage refines the shape and paints it with plausible colors. Also, the hybrid approach separates the shape and color and generates color conditioned on shape to ensure shape-color consistency. Unlike the existing state-of-the-art methods, we achieve high-fidelity shape generation from natural-language descriptions without the need for time-consuming per-shape optimization or reliance on human-annotated texts during training or test-time optimization. Further, we demonstrate the applicability of our approach to generate indoor scenes with consistent styles using text-induced 3D shapes. Through extensive experiments, we demonstrate the compelling quality of our results and the high coherency of our generated shapes with the input texts, surpassing the performance of existing methods by a significant margin. Codes and models are released at https://github.com/liuzhengzhe/EXIM.
    摘要

Taking a PEEK into YOLOv5 for Satellite Component Recognition via Entropy-based Visual Explanations

  • paper_url: http://arxiv.org/abs/2311.01703
  • repo_url: None
  • paper_authors: Mackenzie J. Meni, Trupti Mahendrakar, Olivia D. M. Raney, Ryan T. White, Michael L. Mayo, Kevin Pilkiewicz
  • for: 这篇论文旨在解决低地球轨道(LEO)中的撞击风险和空间垃圾问题,特别是处理不合作和未知的空间垃圾。
  • methods: 该论文使用自主小追踪卫星群,通过You Only Look Once v5(YOLOv5)对象检测模型进行目标几何确定和安全飞行轨迹规划。该模型具有批处理能力和快速检测能力,但缺乏可解释性,使得人类无法理解模型的决策过程。
  • results: 通过引入信息论统计分析方法,该论文分析了模型决策过程中的信息 entropy 和 latent representation,从而提供了可解释的决策过程。通过硬件在 loop 实验,PEEK 方法可以帮助分析模型决策过程,从而提高模型的可靠性和安全性。
    Abstract The escalating risk of collisions and the accumulation of space debris in Low Earth Orbit (LEO) has reached critical concern due to the ever increasing number of spacecraft. Addressing this crisis, especially in dealing with non-cooperative and unidentified space debris, is of paramount importance. This paper contributes to efforts in enabling autonomous swarms of small chaser satellites for target geometry determination and safe flight trajectory planning for proximity operations in LEO. Our research explores on-orbit use of the You Only Look Once v5 (YOLOv5) object detection model trained to detect satellite components. While this model has shown promise, its inherent lack of interpretability hinders human understanding, a critical aspect of validating algorithms for use in safety-critical missions. To analyze the decision processes, we introduce Probabilistic Explanations for Entropic Knowledge extraction (PEEK), a method that utilizes information theoretic analysis of the latent representations within the hidden layers of the model. Through both synthetic in hardware-in-the-loop experiments, PEEK illuminates the decision-making processes of the model, helping identify its strengths, limitations and biases.
    摘要 “随着低地球轨道(LEO)中的撞击风险和空间垃圾的数量不断增加,已经达到了摄理关注的水准。尤其是在处理不合作和未知的空间垃圾方面,这是一个非常重要的课题。本文对于实现自动化小搜索卫星群的目标几何决定和安全飞行轨道规划进行了贡献。我们的研究探讨了在轨道上使用You Only Look Once v5(YOLOv5)物件检测模型来检测卫星 комponents。这个模型已经表现出了应用潜力,但是它的自然无法解释限制了人类的理解,这是一个 Critical aspect of validating algorithms for use in safety-critical missions。为了分析决策过程,我们提出了可能性关注的方法,它利用了资讯论分析隐藏层的内部代表。通过硬件在Loop实验,PEEK可以照明模型的决策过程,帮助识别它的优点、局限和偏见。”

Medical Image Segmentation with Domain Adaptation: A Survey

  • paper_url: http://arxiv.org/abs/2311.01702
  • repo_url: https://github.com/cchen-cc/SIFA
  • paper_authors: Yuemeng Li, Yong Fan
  • for: 本文主要旨在提供医学成像数据分析领域中深度学习(DL)模型的适用场景,以及DL模型在不同数据集中的泛化问题的解决方案。
  • methods: 本文综述了域适应方法的应用在医学成像数据分析领域中的DL模型 segmentation,包括域适应技术的基本原理、域适应方法的类型和应用场景等。
  • results: 本文对域适应方法在医学成像数据分析领域中DL模型 segmentation的应用进行了综述,并评估了域适应方法的效果和可行性。
    Abstract Deep learning (DL) has shown remarkable success in various medical imaging data analysis applications. However, it remains challenging for DL models to achieve good generalization, especially when the training and testing datasets are collected at sites with different scanners, due to domain shift caused by differences in data distributions. Domain adaptation has emerged as an effective means to address this challenge by mitigating domain gaps in medical imaging applications. In this review, we specifically focus on domain adaptation approaches for DL-based medical image segmentation. We first present the motivation and background knowledge underlying domain adaptations, then provide a comprehensive review of domain adaptation applications in medical image segmentations, and finally discuss the challenges, limitations, and future research trends in the field to promote the methodology development of domain adaptation in the context of medical image segmentation. Our goal was to provide researchers with up-to-date references on the applications of domain adaptation in medical image segmentation studies.
    摘要 深度学习(DL)在各种医疗影像数据分析应用中表现出了惊人的成功。然而,DL模型在不同扫描器上收集的训练和测试数据集上的总体化仍然是一大挑战,尤其是由于数据分布的差异引起的领域偏移。领域适应技术在医疗影像应用中 emerged as an effective means to address this challenge by mitigating domain gaps.在这篇文章中,我们专门关注DL模型在医疗影像分割任务中的领域适应方法。我们首先介绍了领域适应的动机和背景知识,然后提供了医疗影像分割领域中领域适应的完整回顾,最后讨论了领域适应在医疗影像分割中的挑战、局限性和未来研究趋势,以便为研究人员提供最新的参考资料。我们的目标是为研究人员提供有关领域适应在医疗影像分割研究中的应用。

Universal Perturbation-based Secret Key-Controlled Data Hiding

  • paper_url: http://arxiv.org/abs/2311.01696
  • repo_url: None
  • paper_authors: Donghua Wang, Wen Yao, Tingsong Jiang, Xiaoqian Chen
  • for: 这个研究旨在提出一种基于通用随机干扰的隐藏数据方法,实现单一随机干扰中隐藏多个秘密图像,并使用秘密键控制的解oder提取不同秘密图像。
  • methods: 本研究使用了通用随机干扰来作为数据传输的传输干扰,并提出了一个秘密键控制的解oder来提取不同秘密图像。此外,本研究还提出了一个抑制损失函数以防止秘密图像泄露,并使用了一个强健模组件来增强解oder对于损坏的抗性。
  • results: 实验结果显示,本研究的方法可以实现高效的隐藏数据,并且在不同的数据集上具有良好的可靠性和安全性。此外,实验还显示了本研究的方法在实际应用中的可行性,例如在WeChat和Twitter等平台上进行了 físico 测试。
    Abstract Deep neural networks (DNNs) are demonstrated to be vulnerable to universal perturbation, a single quasi-perceptible perturbation that can deceive the DNN on most images. However, the previous works are focused on using universal perturbation to perform adversarial attacks, while the potential usability of universal perturbation as data carriers in data hiding is less explored, especially for the key-controlled data hiding method. In this paper, we propose a novel universal perturbation-based secret key-controlled data-hiding method, realizing data hiding with a single universal perturbation and data decoding with the secret key-controlled decoder. Specifically, we optimize a single universal perturbation, which serves as a data carrier that can hide multiple secret images and be added to most cover images. Then, we devise a secret key-controlled decoder to extract different secret images from the single container image constructed by the universal perturbation by using different secret keys. Moreover, a suppress loss function is proposed to prevent the secret image from leakage. Furthermore, we adopt a robust module to boost the decoder's capability against corruption. Finally, A co-joint optimization strategy is proposed to find the optimal universal perturbation and decoder. Extensive experiments are conducted on different datasets to demonstrate the effectiveness of the proposed method. Additionally, the physical test performed on platforms (e.g., WeChat and Twitter) verifies the usability of the proposed method in practice.
    摘要 深度神经网络(DNNs)被证明为易受到通用扰动的影响,一种可见的扰动可以误导DNN大多数图像。然而,之前的工作主要关注于使用通用扰动进行敌意攻击,而忽略了通用扰动作为数据隐藏的可能性,特别是针对针对钥匙控制的数据隐藏方法。在这篇论文中,我们提出了一种基于通用扰动的钥匙控制数据隐藏方法,实现了使用单个通用扰动隐藏多个秘密图像,并使用秘密钥来控制数据解码。具体来说,我们优化了单个通用扰动,该扰动serve as a data carrier可以隐藏多个秘密图像并可以添加到大多数覆盖图像中。然后,我们设计了一个钥匙控制的解码器,通过使用不同的秘密钥来从单个容器图像中提取不同的秘密图像。此外,我们还提出了一个防止秘密图像泄露的抑制损失函数,并采用一个强化模块来提高解码器对损害的抗性。最后,我们提出了一种共同优化策略,以找到最佳的通用扰动和解码器。我们在不同的数据集上进行了广泛的实验,以证明我们的方法的有效性。此外,我们在平台(如WeChat和Twitter)上进行了实际测试,以证明我们的方法在实践中的可用性。

Disentangled Representation Learning with Transmitted Information Bottleneck

  • paper_url: http://arxiv.org/abs/2311.01686
  • repo_url: None
  • paper_authors: Zhuohang Dang, Minnan Luo, Chengyou Jia, Guang Dai, Jihong Wang, Xiaojun Chang, Jingdong Wang, Qinghua Zheng
  • for: 本文主要针对于提高模型的 robustness和普适性,通过对表征信息进行抽象和分离来实现。
  • methods: 本文提出了一种新的目标函数DisTIB,该函数通过传递信息网络来实现表征信息的交互和分离。同时,本文使用了变分推断来 derive tractable estimation,并使用标准梯度下降进行优化。
  • results: 本文通过广泛的实验 validate了DisTIB的吸引力和可行性,并证明了其在不同下游任务中的超过对手性。
    Abstract Encoding only the task-related information from the raw data, \ie, disentangled representation learning, can greatly contribute to the robustness and generalizability of models. Although significant advances have been made by regularizing the information in representations with information theory, two major challenges remain: 1) the representation compression inevitably leads to performance drop; 2) the disentanglement constraints on representations are in complicated optimization. To these issues, we introduce Bayesian networks with transmitted information to formulate the interaction among input and representations during disentanglement. Building upon this framework, we propose \textbf{DisTIB} (\textbf{T}ransmitted \textbf{I}nformation \textbf{B}ottleneck for \textbf{Dis}entangled representation learning), a novel objective that navigates the balance between information compression and preservation. We employ variational inference to derive a tractable estimation for DisTIB. This estimation can be simply optimized via standard gradient descent with a reparameterization trick. Moreover, we theoretically prove that DisTIB can achieve optimal disentanglement, underscoring its superior efficacy. To solidify our claims, we conduct extensive experiments on various downstream tasks to demonstrate the appealing efficacy of DisTIB and validate our theoretical analyses.
    摘要 Encoding only the task-related information the raw data, disentangled learning, greatly to robustness generalizability models. Although advances been by the in with theory, major remain:1 representation inevitably to drop;2 disentanglement on are complicated To issues, introduce networks transmitted to the among and during Building this we DisTIB T I B for entangled learning), novel that the between compression preservation.We variational to a estimation DisTIB.This can simply via gradient with reparameterization Moreover, theoretically that can optimal underscoring superior To our we extensive on downstream to the efficacy DisTIB validate theoretical Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection
    • paper_url: http://arxiv.org/abs/2311.01682
    • repo_url: https://github.com/haibao-yu/ffnet-vic3d
    • paper_authors: Haibao Yu, Yingjuan Tang, Enze Xie, Jilei Mao, Ping Luo, Zaiqing Nie
    • for: 提高自动驾驶感知能力
    • methods: 使用Feature Flow Net(FFNet),一种新的合作检测框架,利用流量预测模块预测未来特征,补偿异步问题
    • results: 在DAIR-V2X数据集上实现了比现有合作检测方法更高的性能,只需要约1/100的数据传输成本,覆盖所有延迟,可以在一个模型中实现
      Abstract Cooperatively utilizing both ego-vehicle and infrastructure sensor data can significantly enhance autonomous driving perception abilities. However, the uncertain temporal asynchrony and limited communication conditions can lead to fusion misalignment and constrain the exploitation of infrastructure data. To address these issues in vehicle-infrastructure cooperative 3D (VIC3D) object detection, we propose the Feature Flow Net (FFNet), a novel cooperative detection framework. FFNet is a flow-based feature fusion framework that uses a feature flow prediction module to predict future features and compensate for asynchrony. Instead of transmitting feature maps extracted from still-images, FFNet transmits feature flow, leveraging the temporal coherence of sequential infrastructure frames. Furthermore, we introduce a self-supervised training approach that enables FFNet to generate feature flow with feature prediction ability from raw infrastructure sequences. Experimental results demonstrate that our proposed method outperforms existing cooperative detection methods while only requiring about 1/100 of the transmission cost of raw data and covers all latency in one model on the DAIR-V2X dataset. The code is available at \href{https://github.com/haibao-yu/FFNet-VIC3D}{https://github.com/haibao-yu/FFNet-VIC3D}.
      摘要 合作使用 Egovehicle 和基础设施感知数据可以大幅提高自动驾驶感知能力。然而,不确定的时间偏移和限制通信条件可能导致融合不一致,从而限制基础设施数据的利用。为解决这些问题,我们提出了 Feature Flow Net(FFNet),一种新的合作探测框架。FFNet 是一种基于流量的特征融合框架,使用特征流预测模块预测未来特征,以补偿偏移。而不是将静止图像中的特征图传输,FFNet 传输特征流,利用基础设施图像序列的时间启发关系。此外,我们提出了一种自动超参训练方法,使 FFNet 可以从 raw 基础设施序列中生成特征流,并且具有特征预测能力。实验结果表明,我们提posed 方法在 DAIR-V2X 数据集上比既有的合作探测方法高效,仅需要约 1/100 的传输成本,并且可以覆盖所有延迟。代码可以在 \href{https://github.com/haibao-yu/FFNet-VIC3D}{https://github.com/haibao-yu/FFNet-VIC3D} 上获取。

    Content Significance Distribution of Sub-Text Blocks in Articles and Its Application to Article-Organization Assessment

    • paper_url: http://arxiv.org/abs/2311.01673
    • repo_url: None
    • paper_authors: You Zhou, Jie Wang
    • for: 这paper是为了研究文章中各个子句块的内容重要性而写的。
    • methods: 这paper使用了Hugging Face的SentenceTransformer生成文本嵌入,并使用文本嵌入的MoverScore来衡量每个子句块与整篇文章之间的相似度。
    • results: 这paper表明,通过一种近似算法,可以快速计算每个子句块的内容重要性分布(CSD-1),并且这些分布随着文章类型的不同而展现出明显的差异。此外,这paper还发现,对于某些文章类型,CSD-2的平均值具有明确的特征,这些特征可以衡量文章的结构和组织。
      Abstract We explore how to capture the significance of a sub-text block in an article and how it may be used for text mining tasks. A sub-text block is a sub-sequence of sentences in the article. We formulate the notion of content significance distribution (CSD) of sub-text blocks, referred to as CSD of the first kind and denoted by CSD-1. In particular, we leverage Hugging Face's SentenceTransformer to generate contextual sentence embeddings, and use MoverScore over text embeddings to measure how similar a sub-text block is to the entire text. To overcome the exponential blowup on the number of sub-text blocks, we present an approximation algorithm and show that the approximated CSD-1 is almost identical to the exact CSD-1. Under this approximation, we show that the average and median CSD-1's for news, scholarly research, argument, and narrative articles share the same pattern. We also show that under a certain linear transformation, the complement of the cumulative distribution function of the beta distribution with certain values of $\alpha$ and $\beta$ resembles a CSD-1 curve. We then use CSD-1's to extract linguistic features to train an SVC classifier for assessing how well an article is organized. Through experiments, we show that this method achieves high accuracy for assessing student essays. Moreover, we study CSD of sentence locations, referred to as CSD of the second kind and denoted by CSD-2, and show that average CSD-2's for different types of articles possess distinctive patterns, which either conform common perceptions of article structures or provide rectification with minor deviation.
      摘要 我们探讨了如何捕捉文章中具有重要意义的子文本块,以及如何将其用于文本挖掘任务。子文本块是文章中的子序列。我们定义了文章内容重要性分布(CSD)的首次类型,并使用Hugging Face的 SentenceTransformer生成文本上下文嵌入,以及在文本嵌入空间中计算MoverScore来衡量子文本块与整个文章的相似度。为了解决子文本块数量呈指数爆发的问题,我们提出了一种近似算法,并证明了近似的CSD-1与正确的CSD-1几乎相同。在这种近似下,我们发现了不同文章类型的平均和中位CSD-1呈同样的模式。此外,我们还发现了一种特定的线性变换,使得 beta 分布的剩余分布函数的补做与CSD-1呈同样的形式。我们使用CSD-1来提取语言特征,并使用这些特征来训练一个SVC分类器,以评估文章是否具有良好的结构。通过实验,我们发现这种方法可以高效地评估学生的文章。此外,我们还研究了具有不同类型的文章的CSD-2,并发现了不同类型的文章的平均CSD-2具有不同的特征模式,一些符合常见的文章结构假设,一些则提供了一些修正。

    Efficient Cloud Pipelines for Neural Radiance Fields

    • paper_url: http://arxiv.org/abs/2311.01659
    • repo_url: None
    • paper_authors: Derek Jacoby, Donglin Xu, Weder Ribas, Minyi Xu, Ting Liu, Vishwanath Jayaraman, Mengdi Wei, Emma De Blois, Yvonne Coady
    • for: 这篇论文是关于Neural Radiance Fields(NeRFs)的应用和实现方法的研究。
    • methods: 这篇论文使用了高性能的学术计算机群组件和Microsoft Azure云pipeline来构建NeRFs。
    • results: 该论文描述了NeRFs在各种应用中的可能性,包括虚拟生产、虚拟现实和地ospatial分析中的变化检测。
      Abstract Since their introduction in 2020, Neural Radiance Fields (NeRFs) have taken the computer vision community by storm. They provide a multi-view representation of a scene or object that is ideal for eXtended Reality (XR) applications and for creative endeavors such as virtual production, as well as change detection operations in geospatial analytics. The computational cost of these generative AI models is quite high, however, and the construction of cloud pipelines to generate NeRFs is neccesary to realize their potential in client applications. In this paper, we present pipelines on a high performance academic computing cluster and compare it with a pipeline implemented on Microsoft Azure. Along the way, we describe some uses of NeRFs in enabling novel user interaction scenarios.
      摘要 自2020年引入以来,神经辐射场(NeRF)已经在计算机视觉社区引起了一阵风波。它们提供了一种场景或物体的多视图表示方式,非常适合扩展现实(XR)应用和虚拟生产、地球分析等创新应用。然而,这些生成AI模型的计算成本很高,因此构建云管道生成NeRF的措施是必须的,以实现客户端应用中的潜力。在这篇论文中,我们介绍了一个高性能的学术计算群集上的管道,并与Microsoft Azure上的管道进行比较。此外,我们还描述了NeRF在启发新用户交互方案的应用。

    Detecting Spurious Correlations via Robust Visual Concepts in Real and AI-Generated Image Classification

    • paper_url: http://arxiv.org/abs/2311.01655
    • repo_url: None
    • paper_authors: Preetam Prabhu Srikar Dammu, Chirag Shah
    • for: 检测模型中的假 correlate,提高模型的可靠性和抗分布Shift性。
    • methods: 提出一种通用的方法,能够快速、少量的人工干预检测可能的假 correlate,并提供直观的解释。
    • results: 在AI生成图像中表现出色,能够检测模型中的假 correlate,而且不需要像素级别的注释。
      Abstract Often machine learning models tend to automatically learn associations present in the training data without questioning their validity or appropriateness. This undesirable property is the root cause of the manifestation of spurious correlations, which render models unreliable and prone to failure in the presence of distribution shifts. Research shows that most methods attempting to remedy spurious correlations are only effective for a model's known spurious associations. Current spurious correlation detection algorithms either rely on extensive human annotations or are too restrictive in their formulation. Moreover, they rely on strict definitions of visual artifacts that may not apply to data produced by generative models, as they are known to hallucinate contents that do not conform to standard specifications. In this work, we introduce a general-purpose method that efficiently detects potential spurious correlations, and requires significantly less human interference in comparison to the prior art. Additionally, the proposed method provides intuitive explanations while eliminating the need for pixel-level annotations. We demonstrate the proposed method's tolerance to the peculiarity of AI-generated images, which is a considerably challenging task, one where most of the existing methods fall short. Consequently, our method is also suitable for detecting spurious correlations that may propagate to downstream applications originating from generative models.
      摘要 机器学习模型经常会自动学习训练数据中的关联,无论其VALIDITY或合适性是否得到评估。这种不良性质是潜在的相关性损害的根本原因,导致模型在分布变化时失效和不可靠。研究表明,现有的相关性检测方法只能有效对模型已知的假 correlate。当前的相关性检测算法可能需要广泛的人工标注或是过于 restrictive的形式。另外,它们可能会基于硬coded的视觉artifacts,这些artifacts可能不适用于由生成模型生成的数据,因为这些模型可能会hallucinate不符合标准规范的内容。在这种情况下,我们引入一种通用的方法,能够高效地检测潜在的相关性,并需要较少的人工干预。此外,我们的方法可以提供直观的解释,而不需要像素级别的标注。我们示示了我们的方法对生成模型生成的图像异常 Task 的耐误性,这是现有方法无法满足的一个挑战。因此,我们的方法适用于检测生成模型中传递的相关性,以及其下游应用中的相关性。

    INeAT: Iterative Neural Adaptive Tomography

    • paper_url: http://arxiv.org/abs/2311.01653
    • repo_url: None
    • paper_authors: Bo Xiong, Changqing Su, Zihan Lin, You Zhou, Zhaofei Yu
      for:* 这个研究旨在提高 Computed Tomography (CT) 的三维图像重建效果,特别是在面对 CT 扫描过程中的干扰和位置偏移时。methods:* 这个研究使用 Neural Adaptive Tomography (NeAT) 方法,它基于神经降光场来实现 CT 的三维图像重建。* 这个研究提出了一个叫 Iterative Neural Adaptive Tomography (INeAT) 的新方法,它利用迭代干扰优化来有效地抵消 CT 扫描过程中的干扰和位置偏移的影响。results:* 这个研究发现 INeAT 可以实现干扰和位置偏移不对的 CT 重建,并且可以维持和正常状态的 CT 重建效果,即使是在面对干扰和位置偏移的情况下。* 这个研究还发现 INeAT 可以实现短时间和低成本 CT 技术,因为它可以使用不稳定的扫描数据来进行重建。
      Abstract Computed Tomography (CT) with its remarkable capability for three-dimensional imaging from multiple projections, enjoys a broad range of applications in clinical diagnosis, scientific observation, and industrial detection. Neural Adaptive Tomography (NeAT) is a recently proposed 3D rendering method based on neural radiance field for CT, and it demonstrates superior performance compared to traditional methods. However, it still faces challenges when dealing with the substantial perturbations and pose shifts encountered in CT scanning processes. Here, we propose a neural rendering method for CT reconstruction, named Iterative Neural Adaptive Tomography (INeAT), which incorporates iterative posture optimization to effectively counteract the influence of posture perturbations in data, particularly in cases involving significant posture variations. Through the implementation of a posture feedback optimization strategy, INeAT iteratively refines the posture corresponding to the input images based on the reconstructed 3D volume. We demonstrate that INeAT achieves artifact-suppressed and resolution-enhanced reconstruction in scenarios with significant pose disturbances. Furthermore, we show that our INeAT maintains comparable reconstruction performance to stable-state acquisitions even using data from unstable-state acquisitions, which significantly reduces the time required for CT scanning and relaxes the stringent requirements on imaging hardware systems, underscoring its immense potential for applications in short-time and low-cost CT technology.
      摘要

    Keypoint Description by Symmetry Assessment – Applications in Biometrics

    • paper_url: http://arxiv.org/abs/2311.01651
    • repo_url: None
    • paper_authors: Anna Mikaelyan, Fernando Alonso-Fernandez, Josef Bigun
    • for: 提出了一种基于模型的特征提取器,用于描述附近关键点的地方特征,通过有限扩展来Estimate空间变化的方向。
    • methods: 使用哈密顿函数来描述附近地方的形状,这些函数在原点(关键点)的iso-曲线上具有高度的Symmetry,并且Estimate的参数具有明确的几何 interpretations。
    • results: 通过使用公开的数据集(NIST SD27)进行实验,提出了一种基于这种特征的验证和识别键点方法,其中验证性能达到19% EER,识别性能为24-78%。此外,还进行了近距离肖像特征的验证,并达到了13% EER的性能,与现有技术相当。而将两种系统 fusion 后,可以 obtaint measurable 性能提升。
      Abstract We present a model-based feature extractor to describe neighborhoods around keypoints by finite expansion, estimating the spatially varying orientation by harmonic functions. The iso-curves of such functions are highly symmetric w.r.t. the origin (a keypoint) and the estimated parameters have well defined geometric interpretations. The origin is also a unique singularity of all harmonic functions, helping to determine the location of a keypoint precisely, whereas the functions describe the object shape of the neighborhood. This is novel and complementary to traditional texture features which describe texture-shape properties i.e. they are purposively invariant to translation (within a texture). We report on experiments of verification and identification of keypoints in forensic fingerprints by using publicly available data (NIST SD27) and discuss the results in comparison to other studies. These support our conclusions that the novel features can equip single cores or single minutia with a significant verification power at 19% EER, and an identification power of 24-78% for ranks of 1-20. Additionally, we report verification results of periocular biometrics using near-infrared images, reaching an EER performance of 13%, which is comparable to the state of the art. More importantly, fusion of two systems, our and texture features (Gabor), result in a measurable performance improvement. We report reduction of the EER to 9%, supporting the view that the novel features capture relevant visual information, which traditional texture features do not.
      摘要 我们提出了基于模型的特征提取器,用finite expansion来描述附近关键点的 neighbohood,并估算空间变化的方向使用含有圆锥函数。这些iso-curves的函数具有很高的对称性关系于起始点(关键点),并且估算参数具有明确的 геометрической意义。起始点也是所有含有圆锥函数的唯一特点,帮助确定关键点的具体位置,而这些函数描述了附近物体的形状。这是一种新的和补充性的方法,与传统的文本特征不同,后者描述了文本-形状属性,即在翻译(在文本中)的抗抗变异性。我们在使用公共可用数据(NIST SD27)进行了实验,并发现结果与其他研究相比,支持我们的结论:新特征可以为单个核心或单个细胞提供显著的验证力,验证 Err 率为19%,并且可以为排名1-20的 Identification 提供24-78%的识别力。此外,我们还报告了使用near-infrared图像的 periocular 生物ometrics 验证结果,达到了13%的 Err 率,与状态之一样。更重要的是,将两个系统(我们的系统和文本特征)进行融合,可以获得可观测性提高。我们报告了 Err 率的减少为9%,支持我们的新特征捕捉了重要的视觉信息,传统的文本特征不捕捉。

    SemiGPC: Distribution-Aware Label Refinement for Imbalanced Semi-Supervised Learning Using Gaussian Processes

    • paper_url: http://arxiv.org/abs/2311.01646
    • repo_url: None
    • paper_authors: Abdelhak Lemkhenter, Manchen Wang, Luca Zancato, Gurumurthy Swaminathan, Paolo Favaro, Davide Modolo
    • for: 这个论文是关于semi-supervised learning的研究,旨在提出一种基于 Gaussian Processes 的分布意识修正策略,以提高模型在不同数据分布下的性能。
    • methods: 该策略基于 Gaussian Processes 的概率分布模型,包括一个归一化项以处理全局数据分布的偏见问题,以保持本地敏感度。
    • results: 对比 FixMatch、ReMixMatch、SimMatch 等 semi-supervised 方法和不同的预训练策略,SemiGPC 能够提高性能,特别在数据量较少的情况下。此外,SemiGPC 在不同水平的分类任务上达到了状态的艺术性能。
      Abstract In this paper we introduce SemiGPC, a distribution-aware label refinement strategy based on Gaussian Processes where the predictions of the model are derived from the labels posterior distribution. Differently from other buffer-based semi-supervised methods such as CoMatch and SimMatch, our SemiGPC includes a normalization term that addresses imbalances in the global data distribution while maintaining local sensitivity. This explicit control allows SemiGPC to be more robust to confirmation bias especially under class imbalance. We show that SemiGPC improves performance when paired with different Semi-Supervised methods such as FixMatch, ReMixMatch, SimMatch and FreeMatch and different pre-training strategies including MSN and Dino. We also show that SemiGPC achieves state of the art results under different degrees of class imbalance on standard CIFAR10-LT/CIFAR100-LT especially in the low data-regime. Using SemiGPC also results in about 2% avg.accuracy increase compared to a new competitive baseline on the more challenging benchmarks SemiAves, SemiCUB, SemiFungi and Semi-iNat.
      摘要 在本文中,我们介绍了一种基于 Gaussian Processes 的分布意识的标签级化策略,称为 SemiGPC。与其他缓冲区基于 semi-supervised 方法,如 CoMatch 和 SimMatch,SemiGPC 包含一个归一化项,以Address 数据全局分布不均衡的问题,同时保持本地敏感度。这种显式控制使 SemiGPC 更具Robustness 对Confirmation Bias,特别在类别不均衡情况下。我们展示了 SemiGPC 可以与不同的 semi-supervised 方法和预训练策略结合使用,包括 FixMatch、ReMixMatch、SimMatch 和 FreeMatch,以及不同的预训练策略,如 MSN 和 Dino。我们还展示了 SemiGPC 在不同的类别不均衡情况下的状态掌握结果,特别在低数据量情况下。使用 SemiGPC 也导致了相对于一个新竞争基准的平均准确率提高约 2%。

cs.AI - 2023-11-03

Post Turing: Mapping the landscape of LLM Evaluation

  • paper_url: http://arxiv.org/abs/2311.02049
  • repo_url: None
  • paper_authors: Alexey Tikhonov, Ivan P. Yamshchikov
  • for: 这篇论文旨在探讨大语言模型(LLM)的评估方法的发展历程,从阿兰·图灵的创始问题到现代人工智能研究。
  • methods: 这篇论文将LMM的发展分为不同的时期,每个时期都有其独特的标准和评估标准。传统的评估方法,如图灵测验,随着LMM越来越接近人类行为,而失去了可靠性。
  • results: 这篇论文强调了需要一个统一的评估系统,因为LMM在更广泛的社会影响下使用。通过分析常见的评估方法,这篇论文强调了标准化和客观标准的重要性,以确保LMM的可靠性、公平性和社会利好。
    Abstract In the rapidly evolving landscape of Large Language Models (LLMs), introduction of well-defined and standardized evaluation methodologies remains a crucial challenge. This paper traces the historical trajectory of LLM evaluations, from the foundational questions posed by Alan Turing to the modern era of AI research. We categorize the evolution of LLMs into distinct periods, each characterized by its unique benchmarks and evaluation criteria. As LLMs increasingly mimic human-like behaviors, traditional evaluation proxies, such as the Turing test, have become less reliable. We emphasize the pressing need for a unified evaluation system, given the broader societal implications of these models. Through an analysis of common evaluation methodologies, we advocate for a qualitative shift in assessment approaches, underscoring the importance of standardization and objective criteria. This work serves as a call for the AI community to collaboratively address the challenges of LLM evaluation, ensuring their reliability, fairness, and societal benefit.
    摘要 在大语言模型(LLM)的快速演化中,定义和标准化评估方法仍然是一个核心挑战。这篇文章跟踪了LLM评估的历史发展,从阿兰·图灵提出的基础问题到现代人工智能研究的时期。我们将LLM的发展分成不同的时期,每个时期都具有独特的标准和评估标准。随着LLM越来越接近人类行为,传统的评估代理人,如图灵测试,变得更加不可靠。我们强调了评估系统的统一化的需要,因为这些模型在社会中的广泛应用具有更大的社会意义。通过分析常见的评估方法,我们强调了标准化和客观标准的重要性,以便确保LLM的可靠性、公平性和社会 benefit。这篇文章作为人工智能社区的呼吁,呼吁所有相关人员共同面临LLM评估的挑战,以确保其可靠性、公平性和社会 benefit。

Quantum circuit synthesis with diffusion models

  • paper_url: http://arxiv.org/abs/2311.02041
  • repo_url: https://github.com/florianfuerrutter/genqc
  • paper_authors: Florian Fürrutter, Gorka Muñoz-Gil, Hans J. Briegel
  • for: 本研究使用生成机器学习模型,具体来说是杂化扩散模型(DM),以便将量子操作翻译成可行的物理实现。
  • methods: 本研究使用文本条件来控制DM模型,使其生成所需的量子操作 dentro gate-based quantum circuit。这种方法可以避免在训练过程中经典计算量子动力学的极高开销,从而提高模型的效率。
  • results: 研究表明,DM模型在两个任务中表现出色:生成强相关性和编辑量子Circuit。模型可以生成新的Circuit,并支持扩展such as masking和编辑,以适应目标量子设备的约束。由于其灵活性和通用性,我们认为DM模型将在量子Circuit合成中扮演重要角色,提高实际应用和理论量子计算的理解。
    Abstract Quantum computing has recently emerged as a transformative technology. Yet, its promised advantages rely on efficiently translating quantum operations into viable physical realizations. In this work, we use generative machine learning models, specifically denoising diffusion models (DMs), to facilitate this transformation. Leveraging text-conditioning, we steer the model to produce desired quantum operations within gate-based quantum circuits. Notably, DMs allow to sidestep during training the exponential overhead inherent in the classical simulation of quantum dynamics -- a consistent bottleneck in preceding ML techniques. We demonstrate the model's capabilities across two tasks: entanglement generation and unitary compilation. The model excels at generating new circuits and supports typical DM extensions such as masking and editing to, for instance, align the circuit generation to the constraints of the targeted quantum device. Given their flexibility and generalization abilities, we envision DMs as pivotal in quantum circuit synthesis, enhancing both practical applications but also insights into theoretical quantum computation.
    摘要

VQPy: An Object-Oriented Approach to Modern Video Analytics

  • paper_url: http://arxiv.org/abs/2311.01623
  • repo_url: https://github.com/vqpy/vqpy
  • paper_authors: Shan Yu, Zhenting Zhu, Yu Chen, Hanchen Xu, Pengzhan Zhao, Yang Wang, Arthi Padmanabhan, Hugo Latapie, Harry Xu
  • for: 用于开发视频分析系统和服务的前沿技术之一是视频查询,用户可以通过这些查询找到视频中的特定 interessante Objekte。
  • methods: 该方法基于视频对象(如人、动物、车辆等)与传统对象指定语言中的对象模型之间的相似性,并提出了一种基于对象指定的视频分析方法。该方法名为VQPy,它包括一个容易用户表达视频对象和其交互的前端(基于Python),以及可自动构建和优化管道的可扩展后端。
  • results: 我们已经实现了VQPy,并将其开源到Cisco的DeepVision框架中。这个技术已经被商业化并应用于视频分析领域。
    Abstract Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a frontend$\unicode{x2015}$a Python variant with constructs that make it easy for users to express video objects and their interactions$\unicode{x2015}$as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which has been productized in Cisco as part of its DeepVision framework.
    摘要 视频分析广泛应用于现代系统和服务中。用户发展的视频查询是视频分析的前导力量。建立在视频对象(如人、动物、车等)的核心思想上,我们提议开发一种对象射影视频分析方法。该方法名为VQPy,包括一个前端使用Python语言的变体,用于让用户轻松表达视频对象和它们之间的交互,以及可扩展的后端,可自动构建和优化视频对象的管道。我们已经实现和开源了VQPy,它在Cisco的DeepVision框架中被商业化。

APRICOT: Acuity Prediction in Intensive Care Unit (ICU): Predicting Stability, Transitions, and Life-Sustaining Therapies

  • paper_url: http://arxiv.org/abs/2311.02026
  • repo_url: None
  • paper_authors: Miguel Contreras, Brandon Silva, Benjamin Shickel, Tezcan Ozrazgat Baslanti, Yuanfang Ren, Ziyuan Guan, Sabyasachi Bandyopadhyay, Kia Khezeli, Azra Bihorac, Parisa Rashidi
    for: 这个研究的目的是开发一个基于 transformer ней网络的实时评估 ICU 病人症状的模型,以便在实时监测病人症状,提供便利于医生进行时间性的干预。methods: 该研究使用了三个大数据集进行开发和验证:University of Florida Health (UFH)、eICU Collaborative Research Database (eICU) 和 Medical Information Mart for Intensive Care (MIMIC)-IV。模型使用 transformer ней网络,并进行了外部、时间和前向验证。results: 模型的表现与当前状态艺术方法相当,并且可以预测病人需要生命维持治疗的可能性。此外,模型还可以预测病人需要的生命维持治疗,例如呼吸机和 vasopressor。这些结果表明,APRICOT 模型可以帮助医生在实时监测病人症状,并提供有用的信息以便进行时间性的干预。
    Abstract The acuity state of patients in the intensive care unit (ICU) can quickly change from stable to unstable, sometimes leading to life-threatening conditions. Early detection of deteriorating conditions can result in providing more timely interventions and improved survival rates. Current approaches rely on manual daily assessments. Some data-driven approaches have been developed, that use mortality as a proxy of acuity in the ICU. However, these methods do not integrate acuity states to determine the stability of a patient or the need for life-sustaining therapies. In this study, we propose APRICOT (Acuity Prediction in Intensive Care Unit), a Transformer-based neural network to predict acuity state in real-time in ICU patients. We develop and extensively validate externally, temporally, and prospectively the APRICOT model on three large datasets: University of Florida Health (UFH), eICU Collaborative Research Database (eICU), and Medical Information Mart for Intensive Care (MIMIC)-IV. The performance of APRICOT shows comparable results to state-of-the-art mortality prediction models (external AUROC 0.93-0.93, temporal AUROC 0.96-0.98, and prospective AUROC 0.98) as well as acuity prediction models (external AUROC 0.80-0.81, temporal AUROC 0.77-0.78, and prospective AUROC 0.87). Furthermore, APRICOT can make predictions for the need for life-sustaining therapies, showing comparable results to state-of-the-art ventilation prediction models (external AUROC 0.80-0.81, temporal AUROC 0.87-0.88, and prospective AUROC 0.85), and vasopressor prediction models (external AUROC 0.82-0.83, temporal AUROC 0.73-0.75, prospective AUROC 0.87). This tool allows for real-time acuity monitoring of a patient and can provide helpful information to clinicians to make timely interventions. Furthermore, the model can suggest life-sustaining therapies that the patient might need in the next hours in the ICU.
    摘要 ICU病人的病状可以快速从稳定转变为不稳定,有时会导致生命危险。早期检测病人的状况下降可以提供更时效的干预措施,提高生存率。现有的方法基于手动日常评估。一些数据驱动的方法已经开发,它们使用死亡率作为ICU病人的严重程度的代理。但这些方法并不能考虑病人的稳定状况或需要的生命维持治疗。本研究提出了APRICOT(ICU病人稳定状况预测)模型,基于变换器来预测ICU病人的稳定状况。我们在UFH、eICU和MIMIC-IV三个大数据集上开发和验证了APRICOT模型,并取得了相当于现状的result。APRICOT模型的性能与现状的死亡预测模型(外部AUROC0.93-0.93,时间AUROC0.96-0.98,前瞻AUROC0.98)以及稳定状况预测模型(外部AUROC0.80-0.81,时间AUROC0.77-0.78,前瞻AUROC0.87)相当。此外,APRICOT模型还可以预测需要生命维持治疗的可能性,与现状的呼吸预测模型(外部AUROC0.80-0.81,时间AUROC0.87-0.88,前瞻AUROC0.85)和 vasopressor 预测模型(外部AUROC0.82-0.83,时间AUROC0.73-0.75,前瞻AUROC0.87)相当。这种工具可以实时监测ICU病人的稳定状况,并提供便利的信息,以便医生在时间上作出有效的干预措施。此外,模型还可以预测在ICU中接下来几个小时内需要的生命维持治疗。

Active Reasoning in an Open-World Environment

  • paper_url: http://arxiv.org/abs/2311.02018
  • repo_url: None
  • paper_authors: Manjie Xu, Guangyuan Jiang, Wei Liang, Chi Zhang, Yixin Zhu
  • for: 这 paper 的目的是提出一个可互动的开放世界环境,以评估活跃的理解能力。
  • methods: 这 paper 使用了一种名为 $Conan$ 的交互式开放世界环境,激发了人工智能代理人的活跃探索和多轮推理。
  • results: 经过分析 $Conan$ 环境,这 paper 发现了现代状态的许多模型在活跃探索和解释复杂情况时存在缺陷。同时, paper 还探讨了从推理转移到推理的过程,并在 $Conan$ 环境中实现了这种转移。
    Abstract Recent advances in vision-language learning have achieved notable success on complete-information question-answering datasets through the integration of extensive world knowledge. Yet, most models operate passively, responding to questions based on pre-stored knowledge. In stark contrast, humans possess the ability to actively explore, accumulate, and reason using both newfound and existing information to tackle incomplete-information questions. In response to this gap, we introduce $Conan$, an interactive open-world environment devised for the assessment of active reasoning. $Conan$ facilitates active exploration and promotes multi-round abductive inference, reminiscent of rich, open-world settings like Minecraft. Diverging from previous works that lean primarily on single-round deduction via instruction following, $Conan$ compels agents to actively interact with their surroundings, amalgamating new evidence with prior knowledge to elucidate events from incomplete observations. Our analysis on $Conan$ underscores the shortcomings of contemporary state-of-the-art models in active exploration and understanding complex scenarios. Additionally, we explore Abduction from Deduction, where agents harness Bayesian rules to recast the challenge of abduction as a deductive process. Through $Conan$, we aim to galvanize advancements in active reasoning and set the stage for the next generation of artificial intelligence agents adept at dynamically engaging in environments.
    摘要

DeliverAI: Reinforcement Learning Based Distributed Path-Sharing Network for Food Deliveries

  • paper_url: http://arxiv.org/abs/2311.02017
  • repo_url: None
  • paper_authors: Ashman Mehra, Snehanshu Saha, Vaskar Raychoudhury, Archana Mathur
  • for: 这篇论文目的是为了提出一个基于人工智能学习的食物配送方案,以减少现有的配送成本和提高配送效率。
  • methods: 本论文使用了一种多目标优化方法,将consumer satisfaction和配送成本都当作优化目标,并通过一个基于强化学习的代理人系统来实现实时的决策。
  • results: 根据 simulations 的结果,DeliverAI 可以降低配送车队大小 by 12%,减少配送距离 by 13%,并提高配送效率 by 50% 相比基准。
    Abstract Delivery of items from the producer to the consumer has experienced significant growth over the past decade and has been greatly fueled by the recent pandemic. Amazon Fresh, Shopify, UberEats, InstaCart, and DoorDash are rapidly growing and are sharing the same business model of consumer items or food delivery. Existing food delivery methods are sub-optimal because each delivery is individually optimized to go directly from the producer to the consumer via the shortest time path. We observe a significant scope for reducing the costs associated with completing deliveries under the current model. We model our food delivery problem as a multi-objective optimization, where consumer satisfaction and delivery costs, both, need to be optimized. Taking inspiration from the success of ride-sharing in the taxi industry, we propose DeliverAI - a reinforcement learning-based path-sharing algorithm. Unlike previous attempts for path-sharing, DeliverAI can provide real-time, time-efficient decision-making using a Reinforcement learning-enabled agent system. Our novel agent interaction scheme leverages path-sharing among deliveries to reduce the total distance traveled while keeping the delivery completion time under check. We generate and test our methodology vigorously on a simulation setup using real data from the city of Chicago. Our results show that DeliverAI can reduce the delivery fleet size by 12\%, the distance traveled by 13%, and achieve 50% higher fleet utilization compared to the baselines.
    摘要 生产者到消费者的物品交付经历了过去十年的显著增长,而且受到最近的流行病影响很大。亚马逊新鲜、拍卖、uberEats、InstaCart和doorDash等公司的快速增长,都是共享同一个生意模式,即消费者物品或食品交付。现有的食品交付方法有限,每个交付都是单独优化,直接从生产者到消费者,最短时间路径。我们观察到现有交付模式中存在很大的成本减少空间。我们将食品交付问题模型为多目标优化问题,即消费者满意度和交付成本都需要优化。从taxi行业中成功的乘车共享经验,我们提出DeliverAI - 基于强化学习的路径共享算法。与过去的路径共享方法不同,DeliverAI可以在实时、时间高效的情况下,通过强化学习 Agent系统进行决策。我们的新代理互动方式利用交付中的路径共享,以降低总距离和保持交付完成时间。我们在使用实际的 ЧикаGO市数据进行了严格的测试,结果显示,DeliverAI可以减少交付车队大小12%,总距离减少13%,并实现交付车队使用率50%高于基准值。

Score Models for Offline Goal-Conditioned Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.02013
  • repo_url: None
  • paper_authors: Harshit Sikchi, Rohan Chitnis, Ahmed Touati, Alborz Geramifard, Amy Zhang, Scott Niekum
  • for: 本研究的目的是开发一种能够在完全离线 dataset 上学习多个目标的 Reinforcement Learning(RL)方法,以便开发通用的智能体可以在不需要手工设计奖励函数的情况下学习多种多样的技能。
  • methods: 本研究使用的方法是基于权重的 mixture-distribution matching,它将occupancy matching perspective和凸 dual形式的学习目标相结合,以更好地利用不优化的离线数据。
  • results: 实验表明,SMORe 可以在高维观察数据的 robot 抓取和行走任务上超过了现状的基准值,并且可以在不同的环境下达到更好的性能。
    Abstract Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions. Offline GCRL is pivotal for developing generalist agents capable of leveraging pre-existing datasets to learn diverse and reusable skills without hand-engineering reward functions. However, contemporary approaches to GCRL based on supervised learning and contrastive learning are often suboptimal in the offline setting. An alternative perspective on GCRL optimizes for occupancy matching, but necessitates learning a discriminator, which subsequently serves as a pseudo-reward for downstream RL. Inaccuracies in the learned discriminator can cascade, negatively influencing the resulting policy. We present a novel approach to GCRL under a new lens of mixture-distribution matching, leading to our discriminator-free method: SMORe. The key insight is combining the occupancy matching perspective of GCRL with a convex dual formulation to derive a learning objective that can better leverage suboptimal offline data. SMORe learns scores or unnormalized densities representing the importance of taking an action at a state for reaching a particular goal. SMORe is principled and our extensive experiments on the fully offline GCRL benchmark composed of robot manipulation and locomotion tasks, including high-dimensional observations, show that SMORe can outperform state-of-the-art baselines by a significant margin.
    摘要 <>转换文本到简化中文。<>Offline Goal-Conditioned Reinforcement Learning(GCRL)的任务是从无线据集中学习多个目标,使用稀疏奖励函数。Offline GCRL 是开发通用代理人可以利用预存在的数据学习多样化和可重用的技能,而无需手工设计奖励函数的关键。然而,当前 GCRL 方法基于监督学习和对比学习 often 在离线设置下是不优化的。一种 alternativa perspective on GCRL 是优化occupancy 匹配,但需要学习一个discriminator,该 discriminator subsequenly 作为 Pseudo-reward для下游 RL。在学习过程中,inaccuracies 在学习的 discriminator 可能会倒逼 negatively 影响 resulting 策略。我们提出一种新的 GCRL 方法,称为 SMORe,该方法基于 mixture-distribution 匹配的新镜头。关键思想是将 GCRL 的 occupancy 匹配视角与 convex dual 表示相结合,以 deriv 一个学习目标,可以更好地利用不优化的离线数据。SMORe 学习 actions 在状态下的importance ,代表着达到特定目标的可能性。SMORe 是原则的,我们在无线 GCRL benchmark 上的完全离线任务中,包括机器人 manipulate 和 locomotion 任务,以高维观察表示,展现了 SMORe 可以在 state-of-the-art 基准点上高于表现。

Obtaining Explainable Classification Models using Distributionally Robust Optimization

  • paper_url: http://arxiv.org/abs/2311.01994
  • repo_url: None
  • paper_authors: Sanjeeb Dash, Soumyadip Ghosh, Joao Goncalves, Mark S. Squillante
  • for: 这个论文的目的是提出一种能够同时保证模型泛化质量和计算成本低的分类器建模方法。
  • methods: 这个论文使用了sets of feature value rulesconstructed using distributionally robust optimization,并使用column generation来高效地搜索rule sets的空间。
  • results: 论文的实验结果显示,提出的方法可以在一个大量的公共可用的二分类问题实例上超过竞争方法(如随机森林或推动等),从一个或多个以下维度来衡量:泛化质量、计算成本和可读性。
    Abstract Model explainability is crucial for human users to be able to interpret how a proposed classifier assigns labels to data based on its feature values. We study generalized linear models constructed using sets of feature value rules, which can capture nonlinear dependencies and interactions. An inherent trade-off exists between rule set sparsity and its prediction accuracy. It is computationally expensive to find the right choice of sparsity -- e.g., via cross-validation -- with existing methods. We propose a new formulation to learn an ensemble of rule sets that simultaneously addresses these competing factors. Good generalization is ensured while keeping computational costs low by utilizing distributionally robust optimization. The formulation utilizes column generation to efficiently search the space of rule sets and constructs a sparse ensemble of rule sets, in contrast with techniques like random forests or boosting and their variants. We present theoretical results that motivate and justify the use of our distributionally robust formulation. Extensive numerical experiments establish that our method improves over competing methods -- on a large set of publicly available binary classification problem instances -- with respect to one or more of the following metrics: generalization quality, computational cost, and explainability.
    摘要 模型可读性是关键,以便人类用户可以根据特征值来解释提议分类器对数据分配标签。我们研究使用集合特征值规则构建的泛化线性模型,可以捕捉非线性关系和互动。存在一种折衔选择精度和预测精度之间的矛盾。现有方法 computationally expensive 找到最佳精度 -- 例如,via 批处理 -- 的方法。我们提出了一种新的表述,以同时解决这些矛盾的因素。我们的方法可以保证良好的泛化质量,同时降低计算成本,通过使用分布式 robust 优化。我们的表述使用列生成来快速搜索规则集的空间,并构建一个稀疏的规则集 ensemble,与 Random Forest 或 boosting 和其他变体不同。我们提供了理论上的结果,以证明和正确使用我们的分布式 robust 表述。我们的实验证明,我们的方法在一组公共可用的二分类问题实例上比竞争方法更好,以下一个或多个纪录:泛化质量、计算成本和可读性。

RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches

  • paper_url: http://arxiv.org/abs/2311.01977
  • repo_url: None
  • paper_authors: Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, Priya Sundaresan, Peng Xu, Hao Su, Karol Hausman, Chelsea Finn, Quan Vuong, Ted Xiao
  • for: 本研究旨在提高机器人学习系统的普适性,使其能够更好地适应新任务和新情况。
  • methods: 本研究提出了一种基于粗略运动路径图像的政策条件方法(RT-Trajectory),该方法通过粗略的运动路径图像来表达任务,并且可以让策略更好地适应新任务。
  • results: 实验结果表明,RT-Trajectory 能够在各种真实世界机器人任务中表现出较好的普适性,并且可以在不同的训练数据下表现出更广泛的任务能力。
    Abstract Generalization remains one of the most important desiderata for robust robot learning systems. While recently proposed approaches show promise in generalization to novel objects, semantic concepts, or visual distribution shifts, generalization to new tasks remains challenging. For example, a language-conditioned policy trained on pick-and-place tasks will not be able to generalize to a folding task, even if the arm trajectory of folding is similar to pick-and-place. Our key insight is that this kind of generalization becomes feasible if we represent the task through rough trajectory sketches. We propose a policy conditioning method using such rough trajectory sketches, which we call RT-Trajectory, that is practical, easy to specify, and allows the policy to effectively perform new tasks that would otherwise be challenging to perform. We find that trajectory sketches strike a balance between being detailed enough to express low-level motion-centric guidance while being coarse enough to allow the learned policy to interpret the trajectory sketch in the context of situational visual observations. In addition, we show how trajectory sketches can provide a useful interface to communicate with robotic policies: they can be specified through simple human inputs like drawings or videos, or through automated methods such as modern image-generating or waypoint-generating methods. We evaluate RT-Trajectory at scale on a variety of real-world robotic tasks, and find that RT-Trajectory is able to perform a wider range of tasks compared to language-conditioned and goal-conditioned policies, when provided the same training data.
    摘要 通用化仍然是Robot学习系统中最重要的目标之一。Recently proposed approaches show promise in generalizing to novel objects, semantic concepts, or visual distribution shifts, but generalizing to new tasks remains challenging. For example, a language-conditioned policy trained on pick-and-place tasks will not be able to generalize to a folding task, even if the arm trajectory of folding is similar to pick-and-place. Our key insight is that this kind of generalization becomes feasible if we represent the task through rough trajectory sketches. We propose a policy conditioning method using such rough trajectory sketches, which we call RT-Trajectory, that is practical, easy to specify, and allows the policy to effectively perform new tasks that would otherwise be challenging to perform. We find that trajectory sketches strike a balance between being detailed enough to express low-level motion-centric guidance while being coarse enough to allow the learned policy to interpret the trajectory sketch in the context of situational visual observations. In addition, we show how trajectory sketches can provide a useful interface to communicate with robotic policies: they can be specified through simple human inputs like drawings or videos, or through automated methods such as modern image-generating or waypoint-generating methods. We evaluate RT-Trajectory at scale on a variety of real-world robotic tasks, and find that RT-Trajectory is able to perform a wider range of tasks compared to language-conditioned and goal-conditioned policies, when provided the same training data.

The language of prompting: What linguistic properties make a prompt successful?

  • paper_url: http://arxiv.org/abs/2311.01967
  • repo_url: None
  • paper_authors: Alina Leidinger, Robert van Rooij, Ekaterina Shutova
  • for: 这研究旨在调查LLMs的不同大小、预训练和指令调整后在不同语法结构和词义上的提示表现。
  • methods: 该研究使用了LLMs的不同大小、预训练和指令调整后的模型,并对提示进行了语法结构和词义上的变化,以调查提示表现的相关性。
  • results: 研究发现,LLMs的表现与提示语法结构和词义上的变化有负相关性,而不是与提示的低准确率或字频、歧义率或提示长度相关。这 suggets that 现有的评价标准可能不够全面,需要更加全面和可靠的评价标准。
    Abstract The latest generation of LLMs can be prompted to achieve impressive zero-shot or few-shot performance in many NLP tasks. However, since performance is highly sensitive to the choice of prompts, considerable effort has been devoted to crowd-sourcing prompts or designing methods for prompt optimisation. Yet, we still lack a systematic understanding of how linguistic properties of prompts correlate with task performance. In this work, we investigate how LLMs of different sizes, pre-trained and instruction-tuned, perform on prompts that are semantically equivalent, but vary in linguistic structure. We investigate both grammatical properties such as mood, tense, aspect and modality, as well as lexico-semantic variation through the use of synonyms. Our findings contradict the common assumption that LLMs achieve optimal performance on lower perplexity prompts that reflect language use in pretraining or instruction-tuning data. Prompts transfer poorly between datasets or models, and performance cannot generally be explained by perplexity, word frequency, ambiguity or prompt length. Based on our results, we put forward a proposal for a more robust and comprehensive evaluation standard for prompting research.
    摘要 最新一代LLM可以通过提示来实现吸引人的零shot或几shot性能在多种NLP任务中。然而,由于选择提示的性能具有很高的敏感度,因此在这方面投入了大量的时间和精力,包括投入人们或设计提示优化方法。然而,我们仍然缺乏系统性的理解,推广提示的语言性质如何与任务性能相关。在这项工作中,我们研究了不同大小的LLM,预训练和指导调整后的表现,对于semantically相同但 linguistically不同的提示。我们研究了 grammatical properties 如模式、时态、方式和可能性,以及 lexico-semantic variation through the use of synonyms。我们的发现证明了常见的假设,即LLMs在低凝度提示上表现最佳,不正确。提示在不同的数据集或模型之间传递不好,并且不能通过凝度、字 frequency、歧义或提示长度来解释性能。根据我们的结果,我们提出了一种更加可靠和全面的评估标准 для提示研究。

Don’t Make Your LLM an Evaluation Benchmark Cheater

  • paper_url: http://arxiv.org/abs/2311.01964
  • repo_url: None
  • paper_authors: Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, Jiawei Han
  • for: 评估大语言模型(LLM)性能,提高人工智能前进。
  • methods: 使用评估指标来评估LLM的能力水平,但可能会导致不当使用和误导性的评估结果。
  • results: 在大量实验中发现,使用评估数据来训练模型可能会带来很大的评估结果偏高,导致模型性能的误估。提出了一些指南,以改善LLM的评估和训练。
    Abstract Large language models~(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs in different aspects. Despite that a number of high-quality benchmarks have been released, the concerns about the appropriate use of these benchmarks and the fair comparison of different models are increasingly growing. Considering these concerns, in this paper, we discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results. Specially, we focus on a special issue that would lead to inappropriate evaluation, \ie \emph{benchmark leakage}, referring that the data related to evaluation sets is occasionally used for model training. This phenomenon now becomes more common since pre-training data is often prepared ahead of model test. We conduct extensive experiments to study the effect of benchmark leverage, and find that it can dramatically boost the evaluation results, which would finally lead to an unreliable assessment of model performance. To improve the use of existing evaluation benchmarks, we finally present several guidelines for both LLM developers and benchmark maintainers. We hope this work can draw attention to appropriate training and evaluation of LLMs.
    摘要 Translation notes:* "Large language models" is translated as "大语言模型" (dà yǔ yán módel), which is a common term used to refer to deep learning models that are trained on large amounts of text data.* "Artificial intelligence" is translated as "人工智能" (rén gōng zhì nǎo), which is the standard term used in Chinese to refer to AI.* "Evaluation benchmarks" is translated as "评估标准" (píng jī biaodian), which refers to the datasets and tasks used to evaluate the performance of LLMs.* "Benchmark leakage" is translated as "标准泄露" (biāo zhì zhòu), which refers to the practice of using data related to evaluation sets for model training.* "Pre-training data" is translated as "预训练数据" (xiù xù xíng xīn), which refers to the data used to pre-train LLMs before fine-tuning them on specific tasks.

Assessing Fidelity in XAI post-hoc techniques: A Comparative Study with Ground Truth Explanations Datasets

  • paper_url: http://arxiv.org/abs/2311.01961
  • repo_url: None
  • paper_authors: M. Miró-Nicolau, A. Jaume-i-Capó, G. Moyà-Alcover
  • for: 本研究的目的是评估current state-of-the-art XAI方法的准确性和可靠性,并从中排除低准确性的方法,以促进更好的XAI技术的发展。
  • methods: 本研究使用了三种常见的XAI方法,namely backpropagation of output information to input, sensitivity analysis, and Class Activation Maps (CAM).
  • results: 研究结果表明,基于output backpropagation的XAI方法具有较高的准确性和可靠性,而sensitivity analysis和CAM方法则相对较低。然而,backpropagation方法生成的关键区域图像具有较高的噪声水平。这些发现有助于排除错误的解释并推动XAI技术的进一步发展。
    Abstract The evaluation of the fidelity of eXplainable Artificial Intelligence (XAI) methods to their underlying models is a challenging task, primarily due to the absence of a ground truth for explanations. However, assessing fidelity is a necessary step for ensuring a correct XAI methodology. In this study, we conduct a fair and objective comparison of the current state-of-the-art XAI methods by introducing three novel image datasets with reliable ground truth for explanations. The primary objective of this comparison is to identify methods with low fidelity and eliminate them from further research, thereby promoting the development of more trustworthy and effective XAI techniques. Our results demonstrate that XAI methods based on the backpropagation of output information to input yield higher accuracy and reliability compared to methods relying on sensitivity analysis or Class Activation Maps (CAM). However, the backpropagation method tends to generate more noisy saliency maps. These findings have significant implications for the advancement of XAI methods, enabling the elimination of erroneous explanations and fostering the development of more robust and reliable XAI.
    摘要 评估Explainable Artificial Intelligence(XAI)方法的准确性是一项复杂的任务,主要是因为无法获得解释的准确参照值。然而,评估准确性是确保XAI方法正确性的必要步骤。在这种研究中,我们通过引入三个新的图像集来进行公正和客观地比较当前领域的XAI方法。我们的主要目标是通过评估XAI方法的准确性和可靠性来消除低准确性的方法,以促进更加可靠和有效的XAI技术的发展。我们的结果显示,基于输出信息的倒推法生成的XAI方法比靠视分析或Class Activation Maps(CAM)方法更高的准确性和可靠性。然而,倒推法通常会生成更多的噪声灵敏图。这些发现有重要的意义 дляXAI方法的发展,可以消除错误的解释并推动更加稳定和可靠的XAI技术的发展。

Architecture of Smart Certificates for Web3 Applications Against Cyberthreats in Financial Industry

  • paper_url: http://arxiv.org/abs/2311.01956
  • repo_url: None
  • paper_authors: Stefan Kambiz Behfar, Jon Crowcroft
  • for: 本研究探讨了当今互联网的安全挑战,尤其是区块链和分布式存储等新技术的应用。它还研究了未来互联网的形态,并提出了一种新的“智能证书”设计方案,以帮助企业更好地保护自己免受网络攻击,并确保数据和系统的安全。
  • methods: 本研究使用了Web3应用程序和安全解决方案,如Certik、Forta、Slither和Securify等,以提高企业数字基础设施的可恢复性。它还提出了一种基于多层次架构的抗性分析和攻击聚合方法,以及一种用于检测和排查证书的证明力和可信worthiness。
  • results: 本研究提出的“智能证书”设计方案可以帮助企业更好地保护自己免受网络攻击,并提高数据和系统的安全性。此外,通过使用证书检测和排查技术,可以检测和排查证书的证明力和可信worthiness,以确保企业数据和系统的安全。
    Abstract This study addresses the security challenges associated with the current internet transformations, specifically focusing on emerging technologies such as blockchain and decentralized storage. It also investigates the role of Web3 applications in shaping the future of the internet. The primary objective is to propose a novel design for 'smart certificates,' which are digital certificates that can be programmatically enforced. Utilizing such certificates, an enterprise can better protect itself from cyberattacks and ensure the security of its data and systems. Web3 recent security solutions by companies and projects like Certik, Forta, Slither, and Securify are the equivalent of code scanning tool that were originally developed for Web1 and Web2 applications, and definitely not like certificates to help enterprises feel safe against cyberthreats. We aim to improve the resilience of enterprises' digital infrastructure by building on top of Web3 application and put methodologies in place for vulnerability analysis and attack correlation, focusing on architecture of different layers, Wallet/Client, Application and Smart Contract, where specific components are provided to identify and predict threats and risks. Furthermore, Certificate Transparency is used for enhancing the security, trustworthiness and decentralized management of the certificates, and detecting misuses, compromises, and malfeasances.
    摘要 Current Web3 security solutions, such as those offered by Certik, Forta, Slither, and Securify, are comparable to code scanning tools developed for Web1 and Web2 applications, and are not sufficient to protect enterprises from cyber threats. To improve the resilience of enterprises' digital infrastructure, we propose building on top of Web3 applications and implementing methodologies for vulnerability analysis and attack correlation, focusing on the architecture of different layers, including the wallet/client, application, and smart contract.To enhance the security, trustworthiness, and decentralized management of the certificates, we use Certificate Transparency. This also helps detect misuses, compromises, and malfeasances. Our proposed design for smart certificates aims to provide enterprises with a more secure and reliable way to protect their data and systems in the face of evolving cyber threats.

A Quantitative Autonomy Quantification Framework for Fully Autonomous Robotic Systems

  • paper_url: http://arxiv.org/abs/2311.01939
  • repo_url: None
  • paper_authors: Nasser Gyagenda, Hubert Roth
  • for: 本研究旨在提供一个基于任务需求的自主性评估框架,以帮助在限定人工监督下进行自主函数的机器人系统部署。
  • methods: 本研究使用了三种自主性指标,包括必要能力、可靠性和回应能力,并将它们与任务需求进行映射,以量化机器人系统的自主性水平。
  • results: 研究发现,使用提案的自主性评估框架可以帮助确定机器人系统的自主性水平,并且可以提供一个共同语言和规范界限 для自主系统开发者和使用者。两个案例研究,包括一个在路上验证的自主汽车和DARPA subT挑战规则分析,都显示了该框架的可行性和有用性。
    Abstract Although autonomous functioning facilitates deployment of robotic systems in domains that admit limited human oversight on our planet and beyond, finding correspondence between task requirements and autonomous capability is still an open challenge. Consequently, a number of methods for quantifying autonomy have been proposed over the last three decades, but to our knowledge all these have no discernment of sub-mode features of variation of autonomy and some are based on metrics that violet the Goodhart's law. This paper focuses on the full autonomous mode and proposes a task-requirements based autonomy assessment framework. The framework starts by establishing robot task characteristics from which three autonomy metrics, namely requisite capability, reliability and responsiveness, and functions for determining autonomy as a two-part measure, namely of level of autonomy and degree of autonomy are derived. These characteristics are founded on the realization that robots ultimately replace human skilled workers, to find a mapping between human job and robot task characteristics. The distinction between level and degree of autonomy stemmed from the acknowledgment that autonomy is not just a question of existence, but also one of performance of requisite capability. When continuously monitored, the proposed metrics provide a means of monitoring the integrity of a system. The framework has been demonstrated on two case studies, namely autonomous vehicle at an on-road dynamic driving task and the DARPA subT challenge rules analysis. The framework provides not only a tool for quantifying autonomy, but also a regulatory interface and common language for autonomous systems developers and users.
    摘要 尽管自主 fonctioning 可以在允许有限的人工监督下在我们 planet 和 beyond 中部署 robotic systems,但发现任务需求和自主能力之间的对应仍然是一个打开的挑战。因此,过去三十年来,一些方法用于量化自主性被提出,但据我们所知,这些方法都没有辨别自主模式中的特征变化,并且一些基于不符合Goodhart's law的度量。这篇文章关注全自主模式,并提出一种基于任务需求的自主性评估框架。该框架开始于确定机器人任务特征,并从这些特征 derive 三个自主度量:一是必需能力、可靠性和响应性,二是用于确定自主性的两部分度量:一是自主度量级别,二是自主度量的度量。这些特征基于人工智能将 eventually replace 人类高级工人的认识,以找到机器人任务特征和人工任务之间的映射。在监控下,提出的度量可以监测系统的完整性。框架在两个案例中进行了示例分析:一是在路面动态驾驶任务上的自动驾驶车辆,二是 DARPA subT 挑战规则分析。该框架不仅提供了量化自主性的工具,还提供了自主系统开发者和用户之间的标准化界面和通用语言。

Supermind Ideator: Exploring generative AI to support creative problem-solving

  • paper_url: http://arxiv.org/abs/2311.01937
  • repo_url: None
  • paper_authors: Steven R. Rick, Gianni Giacomelli, Haoran Wen, Robert J. Laubacher, Nancy Taubenslag, Jennifer L. Heyman, Max Sina Knicker, Younes Jeddi, Hendrik Maier, Stephen Dwyer, Pranav Ragupathy, Thomas W. Malone
  • for: This paper is written for people who want to use creative problem-solving techniques to generate innovative ideas for designing groups of people and/or computers.
  • methods: The paper uses a large language model (GPT 3.5) and adds prompting, fine-tuning, and a user interface specifically designed to help people use creative problem-solving techniques.
  • results: The paper describes early experiences with using this system and suggests ways it could be extended to support additional techniques for other specific problem-solving domains.Here’s the same information in Simplified Chinese:
  • for: 这篇论文是为了帮助人们使用创新问题解决技术来生成创新的想法,特别是设计人员和计算机(“超级 minds”)的组合。
  • methods: 论文使用大语言模型(GPT 3.5)和添加提示、精度调整以及特制用户界面,以帮助人们使用创新问题解决技术。
  • results: 论文描述了使用这种系统的初期经验,并建议将其扩展到支持其他特定问题解决领域的技巧。
    Abstract Previous efforts to support creative problem-solving have included (a) techniques (such as brainstorming and design thinking) to stimulate creative ideas, and (b) software tools to record and share these ideas. Now, generative AI technologies can suggest new ideas that might never have occurred to the users, and users can then select from these ideas or use them to stimulate even more ideas. Here, we describe such a system, Supermind Ideator. The system uses a large language model (GPT 3.5) and adds prompting, fine tuning, and a user interface specifically designed to help people use creative problem-solving techniques. Some of these techniques can be applied to any problem; others are specifically intended to help generate innovative ideas about how to design groups of people and/or computers ("superminds"). We also describe our early experiences with using this system and suggest ways it could be extended to support additional techniques for other specific problem-solving domains.
    摘要 previous efforts to support creative problem-solving have included (a) techniques (such as brainstorming and design thinking) to stimulate creative ideas, and (b) software tools to record and share these ideas. now, generative AI technologies can suggest new ideas that might never have occurred to the users, and users can then select from these ideas or use them to stimulate even more ideas. here, we describe such a system, Supermind Ideator. the system uses a large language model (GPT 3.5) and adds prompting, fine tuning, and a user interface specifically designed to help people use creative problem-solving techniques. some of these techniques can be applied to any problem; others are specifically intended to help generate innovative ideas about how to design groups of people and/or computers ("superminds"). we also describe our early experiences with using this system and suggest ways it could be extended to support additional techniques for other specific problem-solving domains.Here's the translation in Traditional Chinese:previous efforts to support creative problem-solving have included (a) techniques (such as brainstorming and design thinking) to stimulate creative ideas, and (b) software tools to record and share these ideas. now, generative AI technologies can suggest new ideas that might never have occurred to the users, and users can then select from these ideas or use them to stimulate even more ideas. here, we describe such a system, Supermind Ideator. the system uses a large language model (GPT 3.5) and adds prompting, fine tuning, and a user interface specifically designed to help people use creative problem-solving techniques. some of these techniques can be applied to any problem; others are specifically intended to help generate innovative ideas about how to design groups of people and/or computers ("superminds"). we also describe our early experiences with using this system and suggest ways it could be extended to support additional techniques for other specific problem-solving domains.

GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling

  • paper_url: http://arxiv.org/abs/2311.01927
  • repo_url: None
  • paper_authors: Tobias Katsch
  • For: This paper aims to improve the efficiency and effectiveness of sequence models, particularly for auto-regressive language modeling.* Methods: The authors develop a foundational sequence model called GateLoop, which generalizes linear recurrent models by employing data-controlled state transitions. The model comes with two efficient modes: $O(l)$ recurrent mode and $O(l \log_{2} l)$ parallel mode.* Results: The authors show that GateLoop outperforms existing models for auto-regressive language modeling, and the approach can be interpreted as providing data-controlled relative-positional information to Attention. The findings suggest that incorporating data-controlled complex cumulative products may be a crucial step towards more powerful sequence models.
    Abstract Linear Recurrence has proven to be a powerful tool for modeling long sequences efficiently. In this work, we show that existing models fail to take full advantage of its potential. Motivated by this finding, we develop GateLoop, a foundational sequence model that generalizes linear recurrent models such as S4, S5, LRU and RetNet, by employing data-controlled state transitions. Utilizing this theoretical advance, GateLoop empirically outperforms existing models for auto-regressive language modeling. Our method comes with a low-cost $O(l)$ recurrent mode and an efficient $O(l \log_{2} l)$ parallel mode making use of highly optimized associative scan implementations. Furthermore, we derive an $O(l^2)$ surrogate attention mode, revealing remarkable implications for Transformer and recently proposed architectures. Specifically, we prove that our approach can be interpreted as providing data-controlled relative-positional information to Attention. While many existing models solely rely on data-controlled cumulative sums for context aggregation, our findings suggest that incorporating data-controlled complex cumulative products may be a crucial step towards more powerful sequence models.
    摘要 Here's the Simplified Chinese translation:线性回归已经证明是效果很好的长序列模型工具。在这种工作中,我们发现现有模型没有充分利用其潜力。我们因此开发了 GateLoop,一种基础序列模型,可以普适 linear recurrent models such as S4, S5, LRU和RetNet,通过使用数据控制状态转移。我们的方法在自然语言模型中实际比现有模型 луч。它具有低成本的 $O(l)$ 回归模式和高效的 $O(l \log_{2} l)$ 并行模式,使用了高优化的相关扫描实现。此外,我们得出了 $O(l^2)$ 代理注意力模式,这有着很重要的意义 для Transformer 和最近提出的体系。我们的方法可以提供数据控制的相对位置信息给注意力,这与许多现有模型仅通过数据控制累加来实现上下文汇集不同。我们的发现表明,将数据控制的复杂累加纳入Sequence模型可能是至关重要的一步。

Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review

  • paper_url: http://arxiv.org/abs/2311.01918
  • repo_url: https://github.com/mingze-yuan/awesome-llm-healthcare
  • paper_authors: Mingze Yuan, Peng Bao, Jiajia Yuan, Yunhao Shen, Zifan Chen, Yi Xie, Jie Zhao, Yang Chen, Li Zhang, Lin Shen, Bin Dong
  • For: This paper provides a comprehensive review of the applications and implications of large language models (LLMs) in medicine, with a focus on their potential to enhance various aspects of healthcare.* Methods: The paper examines the fundamental applications of general-purpose and specialized LLMs, as well as the emerging development of LLM-powered autonomous agents for healthcare. It also explores the ability of multimodal LLMs to process diverse data types like medical imaging and EHRs to augment diagnostic accuracy.* Results: The paper highlights the transformative potential of LLMs in modern medicine, but also acknowledges the need for continuous optimizations and ethical oversight before these models can be effectively integrated into clinical practice.
    Abstract With the rapid development of artificial intelligence, large language models (LLMs) have shown promising capabilities in mimicking human-level language comprehension and reasoning. This has sparked significant interest in applying LLMs to enhance various aspects of healthcare, ranging from medical education to clinical decision support. However, medicine involves multifaceted data modalities and nuanced reasoning skills, presenting challenges for integrating LLMs. This paper provides a comprehensive review on the applications and implications of LLMs in medicine. It begins by examining the fundamental applications of general-purpose and specialized LLMs, demonstrating their utilities in knowledge retrieval, research support, clinical workflow automation, and diagnostic assistance. Recognizing the inherent multimodality of medicine, the review then focuses on multimodal LLMs, investigating their ability to process diverse data types like medical imaging and EHRs to augment diagnostic accuracy. To address LLMs' limitations regarding personalization and complex clinical reasoning, the paper explores the emerging development of LLM-powered autonomous agents for healthcare. Furthermore, it summarizes the evaluation methodologies for assessing LLMs' reliability and safety in medical contexts. Overall, this review offers an extensive analysis on the transformative potential of LLMs in modern medicine. It also highlights the pivotal need for continuous optimizations and ethical oversight before these models can be effectively integrated into clinical practice. Visit https://github.com/mingze-yuan/Awesome-LLM-Healthcare for an accompanying GitHub repository containing latest papers.
    摘要 受人工智能快速发展的推动,大型自然语言模型(LLM)在人类语言理解和推理方面已经表现出了承诺的能力。这一点引发了医疗领域应用LLM的兴趣,以提高医疗教育、临床决策等方面。然而,医疗涉及多种数据类型和复杂的理解技能,这些挑战 LLM 的应用。本文提供了医疗领域 LLM 应用和意涂抵触的全面评论。它首先检查了通用和专门的 LLM 在知识检索、研究支持、临床工作流程自动化和诊断协助方面的应用。认识到医疗的多样性,文章然后关注多模态 LLM,研究其能够处理医疗图像和 EHR 等多种数据类型,以增强诊断准确性。为了解决 LLM 的个性化和复杂临床理解限制,文章探讨了emerging 的 LLM 驱动的医疗自动化技术的发展。此外,文章还总结了评估 LLM 在医疗上的可靠性和安全性的评价方法。总之,本文提供了现代医疗中 LLM 的转变 potential,以及在应用这些模型之前,需要不断优化和优先考虑伦理监督的重要性。有关最新的论文,请参考 GitHub 存储夹。

Enhancing Functional Data Analysis with Sequential Neural Networks: Advantages and Comparative Study

  • paper_url: http://arxiv.org/abs/2311.01875
  • repo_url: None
  • paper_authors: J. Zhao, J. Li, M. Chen, S. Jadhav
  • for: 这篇论文旨在应用Sequential Neural Networks (SNNs) 来解决功能资料分析 (Functional Data Analysis, FDA) 领域中的问题。
  • methods: 这篇论文使用了SNNs 来实现功能资料的分析,并与传统 FDA 方法进行比较。
  • results: 研究发现,SNNs 可以在功能资料分析中提供更好的性能,并且可以轻松地实现。此外,SNNs 还可以处理高维ensionality 的功能资料,并且可以应对实际世界中的资料分析问题。
    Abstract Functional Data Analysis (FDA) is a statistical domain developed to handle functional data characterized by high dimensionality and complex data structures. Sequential Neural Networks (SNNs) are specialized neural networks capable of processing sequence data, a fundamental aspect of functional data. Despite their great flexibility in modeling functional data, SNNs have been inadequately employed in the FDA community. One notable advantage of SNNs is the ease of implementation, making them accessible to a broad audience beyond academia. Conversely, FDA-based methodologies present challenges, particularly for practitioners outside the field, due to their intricate complexity. In light of this, we propose utilizing SNNs in FDA applications and demonstrate their effectiveness through comparative analyses against popular FDA regression models based on numerical experiments and real-world data analysis. SNN architectures allow us to surpass the limitations of traditional FDA methods, offering scalability, flexibility, and improved analytical performance. Our findings highlight the potential of SNN-based methodologies as powerful tools for data applications involving functional data.
    摘要 SNNs 的一个优点是易于实现,使得它们可以被更广泛的应用者所使用,不仅限于学术界。然而, FDA 基础方法ologies 具有复杂的核心,对于非学术界的实践者而言,它们可能会具有困难性。因此,我们建议使用 SNNs 在 FDA 应用中,并通过比较性分析与实际数据分析,评估 SNNs 在 FDA 领域的表现。SNN 架构可以超过传统 FDA 方法的限制,提供可扩展性、柔软性和改进的分析性能。我们的发现显示 SNN-based 方法ologies 具有强大的应用潜力,可以帮助解决各种函数数据的应用问题。

Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval

  • paper_url: http://arxiv.org/abs/2311.01870
  • repo_url: None
  • paper_authors: Jinrui Yang, Timothy Baldwin, Trevor Cohn
  • for: 这个论文是为了研究多语言信息检索(IR)上的公平性而设计的多语言标准数据集。
  • methods: 这个数据集包含22,000多语言文档,收集自欧洲议会,涵盖24种语言。这个数据集具有 Authentic multilingual corpus,包括所有24种语言的话题翻译,以及cross-lingual relevance judgments。此外,数据集还包含文档关于文档的人口信息,使研究语言偏见更加容易。
  • results: 作者表明Multi-EuP数据集可以用于评估单语言和多语言IR的效果。作者还进行了一项初步实验,检查选择tokenization策略会导致的语言偏见。
    Abstract We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K multi-lingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias. We report the effectiveness of Multi-EuP for benchmarking both monolingual and multilingual IR. We also conduct a preliminary experiment on language bias caused by the choice of tokenization strategy.
    摘要 我们介绍 Multi-EuP,一个新的多语言标准数据集,包含22000多语言文档,收集自欧洲议会,涵盖24种语言。这个数据集旨在研究多语言搜索(IR)上的公平性,以分析语言和人口特征偏见在排序上。它拥有真实的多语言资料库,包括所有24种语言的话题翻译,以及跨语言相关评估。此外,它还提供了文档关于其文档的多元化信息,使得研究人员可以研究语言偏见。我们证明Multi-EuP可以用于评测单语言和多语言搜索的效果。我们还进行了一项预liminary实验,探讨因选择Tokenization策略而导致的语言偏见。

Towards Concept-Aware Large Language Models

  • paper_url: http://arxiv.org/abs/2311.01866
  • repo_url: None
  • paper_authors: Chen Shani, Jilles Vreeken, Dafna Shahaf
  • for: 本研究旨在将人类概念转移到机器上,以提高机器的概念形成和推理能力。
  • methods: 本研究使用了现代大语言模型(LLMs),并分析了这些模型是否能够正确地捕捉人类概念的结构。 研究还讨论了将概念包含在不同阶段的数据pipeline中,以提高机器的概念形成和推理能力。
  • results: 研究发现,使用概念进行预训可以更好地匹配人类的概念理解,并且可以提高机器的预测Robustness。这些初步结果显示了概念意识型机器语言模型的推论力。
    Abstract Concepts play a pivotal role in various human cognitive functions, including learning, reasoning and communication. However, there is very little work on endowing machines with the ability to form and reason with concepts. In particular, state-of-the-art large language models (LLMs) work at the level of tokens, not concepts. In this work, we analyze how well contemporary LLMs capture human concepts and their structure. We then discuss ways to develop concept-aware LLMs, taking place at different stages of the pipeline. We sketch a method for pretraining LLMs using concepts, and also explore the simpler approach that uses the output of existing LLMs. Despite its simplicity, our proof-of-concept is shown to better match human intuition, as well as improve the robustness of predictions. These preliminary results underscore the promise of concept-aware LLMs.
    摘要 文本中的概念扮演着重要的角色,包括学习、理解和交流。然而,目前的机器学习模型(LLM)却只能处理单词,而不是概念。在这篇文章中,我们分析了当代LLM是如何捕捉人类概念的。然后,我们讨论了如何开发概念意识的LLM,包括不同阶段的pipeline中的不同方法。我们采用了一种预训练方法,使用概念来训练LLM,并且也探讨了使用现有LLM的输出来实现的简单方法。尽管这种方法的简单,但我们的证明显示,它可以更好地匹配人类的直觉,并提高预测的稳定性。这些初步结果表明概念意识LLM具有承诺的承诺。

SortNet: Learning To Rank By a Neural-Based Sorting Algorithm

  • paper_url: http://arxiv.org/abs/2311.01864
  • repo_url: None
  • paper_authors: Leonardo Rigutini, Tiziano Papini, Marco Maggini, Franco Scarselli
  • for: 本研究旨在提出一种适应性排序算法,能够根据用户需求进行个性化排序。
  • methods: 本研究使用了一种基于神经网络的比较器,通过在每个训练例之间进行多次迭代,使得模型能够学习出最有用的排序顺序。
  • results: 实验结果表明,SortNet算法在LETO dataset上表现出色,与其他当前领先算法相比,具有更高的排序精度和更好的排序稳定性。
    Abstract The problem of relevance ranking consists of sorting a set of objects with respect to a given criterion. Since users may prefer different relevance criteria, the ranking algorithms should be adaptable to the user needs. Two main approaches exist in literature for the task of learning to rank: 1) a score function, learned by examples, which evaluates the properties of each object yielding an absolute relevance value that can be used to order the objects or 2) a pairwise approach, where a "preference function" is learned using pairs of objects to define which one has to be ranked first. In this paper, we present SortNet, an adaptive ranking algorithm which orders objects using a neural network as a comparator. The neural network training set provides examples of the desired ordering between pairs of items and it is constructed by an iterative procedure which, at each iteration, adds the most informative training examples. Moreover, the comparator adopts a connectionist architecture that is particularly suited for implementing a preference function. We also prove that such an architecture has the universal approximation property and can implement a wide class of functions. Finally, the proposed algorithm is evaluated on the LETOR dataset showing promising performances in comparison with other state of the art algorithms.
    摘要 问题的排序 ranks a set of objects based on a given criterion. Since users may prefer different relevance criteria, the ranking algorithms should be adaptable to the user needs. 在文献中有两种主要的方法来实现学习排序:1)通过例子学习的分数函数,对每个对象评估其特性,并从中计算出一个绝对相关性值,以便将对象排序;2)通过对象对的 preference function 学习,用于定义哪一个对象应该排在前面。在这篇论文中,我们提出了 SortNet,一种可变排序算法,通过神经网络作为比较器来排序对象。神经网络训练集提供了欲要的对象之间的排序示例,并通过迭代过程,在每一迭代中添加最有用的训练示例。此外,比较器采用了 Connectionist 架构,特别适合实现 preference function。我们还证明了这种架构具有 universal approximation property,可以实现广泛的函数。最后,我们对 LETOR 数据集进行了评估,与其他当前状态的算法相比,SortNet 表现很出色。

FAME: Flexible, Scalable Analogy Mappings Engine

  • paper_url: http://arxiv.org/abs/2311.01860
  • repo_url: None
  • paper_authors: Shahar Jacob, Chen Shani, Dafna Shahaf
  • for: This paper aims to advance computational analogy by developing a new framework that can handle partial analogies and suggest new entities to be added.
  • methods: The paper uses automatic extraction of commonsense representations to identify mappings between entities, and the input requirements are relaxed to only require names of entities.
  • results: The model achieves 81.2% accuracy on classical 2x2 analogy problems and 77.8% accuracy on larger problems, outperforming human performance and providing interpretable results.
    Abstract Analogy is one of the core capacities of human cognition; when faced with new situations, we often transfer prior experience from other domains. Most work on computational analogy relies heavily on complex, manually crafted input. In this work, we relax the input requirements, requiring only names of entities to be mapped. We automatically extract commonsense representations and use them to identify a mapping between the entities. Unlike previous works, our framework can handle partial analogies and suggest new entities to be added. Moreover, our method's output is easily interpretable, allowing for users to understand why a specific mapping was chosen. Experiments show that our model correctly maps 81.2% of classical 2x2 analogy problems (guess level=50%). On larger problems, it achieves 77.8% accuracy (mean guess level=13.1%). In another experiment, we show our algorithm outperforms human performance, and the automatic suggestions of new entities resemble those suggested by humans. We hope this work will advance computational analogy by paving the way to more flexible, realistic input requirements, with broader applicability.
    摘要 人类认知的核心能力之一是比喻,当面临新情况时,我们常将先前在其他领域的经验转移到新的情况中。大多数计算比喻工作都需要复杂、手动制作的输入。在这项工作中,我们宽松了输入要求,只需提供实体的名称,我们会自动提取常识表示并用其来确定实体之间的映射。与前一些工作不同,我们的框架可以处理 incomplete analogy 和提出新的实体。此外,我们的方法的输出易于解释,允许用户理解为何选择了特定的映射。实验表明,我们的模型在经典的 2x2 比喻问题中 correctly map 81.2% (guess level=50%),在更大的问题上达到 77.8% 的准确率 (mean guess level=13.1%)。在另一个实验中,我们的算法超越了人类性能,并且自动提出的新实体建议与人类的建议相似。我们希望这项工作将计算比喻的发展带进更加灵活、现实主义的输入要求,并具有更广泛的应用。

A Neural Radiance Field-Based Architecture for Intelligent Multilayered View Synthesis

  • paper_url: http://arxiv.org/abs/2311.01842
  • repo_url: None
  • paper_authors: D. Dhinakaran, S. M. Udhaya Sankar, G. Elumalai, N. Jagadish kumar
  • For: The paper aims to improve on-demand source routing systems in mobile ad hoc networks by proposing a new routing strategy called Optimized Route Selection via Red Imported Fire Ants (RIFA) Strategy.* Methods: The proposed method uses predicting route failure and energy utilization to select the path during the routing phase. The authors evaluate the performance of the proposed strategy based on parameters such as energy usage, packet delivery rate (PDR), and end-to-end (E2E) delay.* Results: The results show that the proposed strategy is superior to traditional routing methods in terms of network lifetime, node energy consumption, and typical E2E delay under most network performance measures and factors.
    Abstract A mobile ad hoc network is made up of a number of wireless portable nodes that spontaneously come together en route for establish a transitory network with no need for any central management. A mobile ad hoc network (MANET) is made up of a sizable and reasonably dense community of mobile nodes that travel across any terrain and rely solely on wireless interfaces for communication, not on any well before centralized management. Furthermore, routing be supposed to offer a method for instantly delivering data across a network between any two nodes. Finding the best packet routing from across infrastructure is the major issue, though. The proposed protocol's major goal is to identify the least-expensive nominal capacity acquisition that assures the transportation of realistic transport that ensures its durability in the event of any node failure. This study suggests the Optimized Route Selection via Red Imported Fire Ants (RIFA) Strategy as a way to improve on-demand source routing systems. Predicting Route Failure and energy Utilization is used to pick the path during the routing phase. Proposed work assess the results of the comparisons based on performance parameters like as energy usage, packet delivery rate (PDR), and end-to-end (E2E) delay. The outcome demonstrates that the proposed strategy is preferable and increases network lifetime while lowering node energy consumption and typical E2E delay under the majority of network performance measures and factors.
    摘要 mobile ad hoc network 是由一些无线可携式节点组成,这些节点自动集结在路上,建立一个没有中央管理的临时网络,以确保数据在网络中快速传输。mobile ad hoc network (MAN) 是一个较大、相对密集的移动节点社区,这些节点在任何地形上行驶,仅通过无线接口进行通信,不需要任何先前中央化管理。此外,路由应该提供一种能够立即传输数据 между任何两个节点的方法。然而,找到最佳包路由是网络的主要问题。提议的协议的主要目标是确定最低成本的常规容量获得,以确保实际交通的持续性,并在任何节点故障时保证网络的稳定性。本研究提出了优化via Red Imported Fire Ants (RIFA) 策略,以提高启动源路由系统的性能。在路由阶段,预测路径失败和能量利用,以选择路径。提议的工作评估结果基于性能参数,如能量消耗、包传输率 (PDR) 和终端到终端延迟 (E2E)。结果表明,提议的策略更有利,可以提高网络寿命,降低节点能量消耗和常规 E2E 延迟,对于大多数网络性能指标和因素。

DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-encoder

  • paper_url: http://arxiv.org/abs/2311.01811
  • repo_url: None
  • paper_authors: Tao Liu, Chenpeng Du, Shuai Fan, Feilong Chen, Kai Yu
  • for: 这篇论文是为了解决生成高质量、人类通用的视频配音问题而写的。
  • methods: 该论文提出了一种扩散 Dubbing 技术,包括在插入式渲染器中使用掩码来分割可编辑区和不可编辑区,以及使用数据增强和补充眼动引导等灵活策略来解决问题。
  • results: 经过严格的实验证明,该技术可以在人类通用和多语言enario中提供高质量、自然的视频配音,并且在不具备对应的音频视频数据的情况下也能够达到良好的效果。
    Abstract Generating high-quality and person-generic visual dubbing remains a challenge. Recent innovation has seen the advent of a two-stage paradigm, decoupling the rendering and lip synchronization process facilitated by intermediate representation as a conduit. Still, previous methodologies rely on rough landmarks or are confined to a single speaker, thus limiting their performance. In this paper, we propose DiffDub: Diffusion-based dubbing. We first craft the Diffusion auto-encoder by an inpainting renderer incorporating a mask to delineate editable zones and unaltered regions. This allows for seamless filling of the lower-face region while preserving the remaining parts. Throughout our experiments, we encountered several challenges. Primarily, the semantic encoder lacks robustness, constricting its ability to capture high-level features. Besides, the modeling ignored facial positioning, causing mouth or nose jitters across frames. To tackle these issues, we employ versatile strategies, including data augmentation and supplementary eye guidance. Moreover, we encapsulated a conformer-based reference encoder and motion generator fortified by a cross-attention mechanism. This enables our model to learn person-specific textures with varying references and reduces reliance on paired audio-visual data. Our rigorous experiments comprehensively highlight that our ground-breaking approach outpaces existing methods with considerable margins and delivers seamless, intelligible videos in person-generic and multilingual scenarios.
    摘要 <>通过两个阶段 paradigm,我们提出了一种新的Diffusion-based dubbing方法,即DiffDub。首先,我们采用了一个Diffusion自适应Encoder,通过填充低部分的面部区域,保留原始部分的精度。在我们的实验中,我们遇到了一些挑战。主要是semantic encoder的稳定性不够,导致高级特征的捕捉受到限制。另外,模型忽略了面部位置,导致在帧内的嘴或鼻部震动。为了解决这些问题,我们采用了多种策略,包括数据增强和补充眼部指导。此外,我们还包装了一个基于参考编码器和运动生成器的cross-attention机制,使我们的模型能够学习人Specific的面部特征,并降低了对配对的音频视频数据的依赖。我们的严谨的实验表明,我们的创新方法在人Specific和多语言场景中比 existed方法有considerable的优势,并能够提供流畅、智能的视频。

AFPQ: Asymmetric Floating Point Quantization for LLMs

  • paper_url: http://arxiv.org/abs/2311.01792
  • repo_url: https://github.com/zhangsichengsjtu/afpq
  • paper_authors: Yijia Zhang, Sicheng Zhang, Shijie Cao, Dayou Du, Jianyu Wei, Ting Cao, Ningyi Xu
  • for: 提高大型自然语言模型(LLM)的部署效率和可扩展性。
  • methods: 提出了偏好性浮点量化(AFPQ)方法,将正负值分别设置为不同的浮点幂。
  • results: 相比传统浮点量化方法,AFPQ方法可以大幅提高精度,并且可以轻松地与其他量化方法结合使用,无需额外存储空间。
    Abstract Large language models (LLMs) show great performance in various tasks, but face deployment challenges from limited memory capacity and bandwidth. Low-bit weight quantization can save memory and accelerate inference. Although floating-point (FP) formats show good performance in LLM quantization, they tend to perform poorly with small group sizes or sub-4 bits. We find the reason is that the absence of asymmetry in previous FP quantization makes it unsuitable for handling asymmetric value distribution of LLM weight tensors. In this work, we propose asymmetric FP quantization (AFPQ), which sets separate scales for positive and negative values. Our method leads to large accuracy improvements and can be easily plugged into other quantization methods, including GPTQ and AWQ, for better performance. Besides, no additional storage is needed compared with asymmetric integer (INT) quantization. The code is available at https://github.com/zhangsichengsjtu/AFPQ.
    摘要 大型语言模型(LLM)在多种任务中表现出色,但面临部署挑战由限制内存容量和带宽所致。低位数量量化可以降低内存占用和加速推理。虽然浮点数(FP)格式在LLM量化中表现良好,但它们在小组size或下于4位时表现糟糕。我们发现这是因为前一代FP量化缺乏非对称性,导致无法处理语言模型权重张量的非对称值分布。在这种情况下,我们提出非对称浮点量化(AFPQ)方法,该方法在正负值中设置不同的缩放因素。我们的方法可以大幅提高准确性,并且可以轻松地与其他量化方法混合使用,包括GPTQ和AWQ,以提高性能。此外,与非对称整数(INT)量化相比,我们的方法不需要额外存储空间。代码可以在 GitHub 上找到:https://github.com/zhangsichengsjtu/AFPQ。

TCM-GPT: Efficient Pre-training of Large Language Models for Domain Adaptation in Traditional Chinese Medicine

  • paper_url: http://arxiv.org/abs/2311.01786
  • repo_url: None
  • paper_authors: Guoxing Yang, Jianyu Shi, Zan Wang, Xiaohong Liu, Guangyu Wang
    for: This paper aims to improve the performance of large language models in the field of Traditional Chinese Medicine (TCM) by proposing a novel domain-specific TCM domain adaptation approach.methods: The proposed TCM Domain Adaptation (TCMDA) approach uses a large TCM-specific corpus, TCM-Corpus-1B, to pre-train and fine-tune a pre-trained language model, TCM-GPT-7B, to improve its performance on TCM-related tasks. The TCMDA approach leverages the LoRA technique to efficiently train specific dense layers for pre-training and fine-tuning.results: The proposed TCMDA approach achieves the best performance on two TCM tasks, TCM examination and TCM diagnosis, outperforming other models by relative increments of 17% and 12% in accuracy, respectively. This study represents the pioneering validation of domain adaptation of a large language model with 7 billion parameters in the TCM domain.
    Abstract Pre-training and fine-tuning have emerged as a promising paradigm across various natural language processing (NLP) tasks. The effectiveness of pretrained large language models (LLM) has witnessed further enhancement, holding potential for applications in the field of medicine, particularly in the context of Traditional Chinese Medicine (TCM). However, the application of these general models to specific domains often yields suboptimal results, primarily due to challenges like lack of domain knowledge, unique objectives, and computational efficiency. Furthermore, their effectiveness in specialized domains, such as Traditional Chinese Medicine, requires comprehensive evaluation. To address the above issues, we propose a novel domain specific TCMDA (TCM Domain Adaptation) approach, efficient pre-training with domain-specific corpus. Specifically, we first construct a large TCM-specific corpus, TCM-Corpus-1B, by identifying domain keywords and retreving from general corpus. Then, our TCMDA leverages the LoRA which freezes the pretrained model's weights and uses rank decomposition matrices to efficiently train specific dense layers for pre-training and fine-tuning, efficiently aligning the model with TCM-related tasks, namely TCM-GPT-7B. We further conducted extensive experiments on two TCM tasks, including TCM examination and TCM diagnosis. TCM-GPT-7B archived the best performance across both datasets, outperforming other models by relative increments of 17% and 12% in accuracy, respectively. To the best of our knowledge, our study represents the pioneering validation of domain adaptation of a large language model with 7 billion parameters in TCM domain. We will release both TCMCorpus-1B and TCM-GPT-7B model once accepted to facilitate interdisciplinary development in TCM and NLP, serving as the foundation for further study.
    摘要 <>将文本翻译成简化中文。<>预训练和精度调整已成为自然语言处理(NLP)任务中有效的方法。大型语言模型(LLM)的效果在特定领域中得到进一步改进,可能用于医学领域,特别是中医(TCM)领域。然而,通用模型在特定领域应用 frequently leads to suboptimal results,主要因为lack of domain knowledge、unique objectives和计算效率等问题。此外,这些普通模型在特殊领域中的效果还需要进行全面评估。为了解决上述问题,我们提议一种新的域特定TCMDA(TCM域调整)方法,fficiently pre-train with domain-specific corpus。我们首先构建了大量TCM域特定 corpus,TCM-Corpus-1B,通过标识域名和检索general corpus中的内容。然后,我们的TCMDA利用LoRA,即冻结预训练模型的参数并使用排名分解矩阵来效率地训练特定的稠密层,以高效地将模型与TCM相关任务相匹配,即TCM-GPT-7B。我们进行了广泛的TCM任务的实验,包括TCM考试和TCM诊断。TCM-GPT-7B在两个TCM任务上得到了最高性能,与其他模型相比,TCM任务上的准确率提高了17%和12%。到目前为止,我们的研究是TCM域中大语言模型7亿参数的域调整的先驱 validate。我们将在接下来释出TCMCorpus-1B和TCM-GPT-7B模型,以便推动TCM和NLP领域的交叉发展,并作为TCM领域的基础研究。

Modeling the Uncertainty with Maximum Discrepant Students for Semi-supervised 2D Pose Estimation

  • paper_url: http://arxiv.org/abs/2311.01770
  • repo_url: None
  • paper_authors: Jiaqi Wu, Junbiao Pang, Qingming Huang
  • for: 提高 semi-supervised pose estimation 任务中的计算机视觉性能
  • methods: 使用 dual mean-teacher 框架,构建两个最大差异学生 (MDSs),以及创造多种不确定性来评估 pseudo-labels 的质量
  • results: 实验结果显示,我们的方法可以提高 semi-supervised pose estimation 中的三个数据集的性能
    Abstract Semi-supervised pose estimation is a practically challenging task for computer vision. Although numerous excellent semi-supervised classification methods have emerged, these methods typically use confidence to evaluate the quality of pseudo-labels, which is difficult to achieve in pose estimation tasks. For example, in pose estimation, confidence represents only the possibility that a position of the heatmap is a keypoint, not the quality of that prediction. In this paper, we propose a simple yet efficient framework to estimate the quality of pseudo-labels in semi-supervised pose estimation tasks from the perspective of modeling the uncertainty of the pseudo-labels. Concretely, under the dual mean-teacher framework, we construct the two maximum discrepant students (MDSs) to effectively push two teachers to generate different decision boundaries for the same sample. Moreover, we create multiple uncertainties to assess the quality of the pseudo-labels. Experimental results demonstrate that our method improves the performance of semi-supervised pose estimation on three datasets.
    摘要 semi-supervised pose estimation是计算机视觉中的实际挑战。虽然有很多优秀的半指导类型方法出现了,但这些方法通常使用信任来评估pseudo-标签的质量,这在pose estimation任务中很难实现。例如,在pose estimation中,信任只表示一个热图中的位置是关键点的可能性,而不是这个预测的质量。在这篇论文中,我们提出了一个简单又高效的框架,用于在半指导 pose estimation任务中评估pseudo-标签的质量。具体来说,在 dual mean-teacher 框架下,我们构建了两个最大差分学生(MDSs),以便让两个教师生成不同的决策边界 для同一个样本。此外,我们还创造了多种不确定性,以评估pseudo-标签的质量。实验结果表明,我们的方法可以改善半指导 pose estimation在三个数据集上的性能。

Indo LEGO-ABSA: A Multitask Generative Aspect Based Sentiment Analysis for Indonesian Language

  • paper_url: http://arxiv.org/abs/2311.01757
  • repo_url: None
  • paper_authors: Randy Zakya Suchrady, Ayu Purwarianti
  • For: This paper aims to implement a multitask learning and prompting approach for aspect-based sentiment analysis in Bahasa Indonesia using generative pre-trained language models.* Methods: The Indo LEGO-ABSA model is developed using the LEGO-ABSA framework, which employs the T5 model (specifically mT5) and trains all tasks within aspect-based sentiment analysis using multitask learning.* Results: The model achieved high accuracy on several tasks within aspect-based sentiment analysis, including Aspect Sentiment Triplet Extraction (f1-score of 79.55%), Unified Aspect-based Sentiment Analysis (86.09%), Aspect Opinion Pair Extraction (79.85%), Aspect Term Extraction (87.45%), and Opinion Term Extraction (88.09%).
    Abstract Aspect-based sentiment analysis is a method in natural language processing aimed at identifying and understanding sentiments related to specific aspects of an entity. Aspects are words or phrases that represent an aspect or attribute of a particular entity. Previous research has utilized generative pre-trained language models to perform aspect-based sentiment analysis. LEGO-ABSA is one framework that has successfully employed generative pre-trained language models in aspect-based sentiment analysis, particularly in English. LEGO-ABSA uses a multitask learning and prompting approach to enhance model performance. However, the application of this approach has not been done in the context of Bahasa Indonesia. Therefore, this research aims to implement the multitask learning and prompting approach in aspect-based sentiment analysis for Bahasa Indonesia using generative pre-trained language models. In this study, the Indo LEGO-ABSA model is developed, which is an aspect-based sentiment analysis model utilizing generative pre-trained language models and trained with multitask learning and prompting. Indo LEGO-ABSA is trained with a hotel domain dataset in the Indonesian language. The obtained results include an f1-score of 79.55% for the Aspect Sentiment Triplet Extraction task, 86.09% for Unified Aspect-based Sentiment Analysis, 79.85% for Aspect Opinion Pair Extraction, 87.45% for Aspect Term Extraction, and 88.09% for Opinion Term Extraction. Indo LEGO-ABSA adopts the LEGO-ABSA framework that employs the T5 model, specifically mT5, by applying multitask learning to train all tasks within aspect-based sentiment analysis.
    摘要 “对象基 sentiment分析”是自然语言处理中的一种方法,旨在识别和理解对特定实体的情感。实体上的“方面”是指实体的某个特征或属性。先前的研究已经使用生成预训语言模型进行对象基 sentiment分析。LEGO-ABSA是一个成功地使用生成预训语言模型进行对象基 sentiment分析的框架,特别是在英文中。LEGO-ABSA使用多任务学习和提示方法来提高模型性能。然而,这种方法尚未应用于印尼语。因此,本研究的目的是将多任务学习和提示方法应用于印尼语的对象基 sentiment分析。在这项研究中,我们开发了印尼 LEGO-ABSA 模型,这是一个使用生成预训语言模型和多任务学习的对象基 sentiment分析模型。印尼 LEGO-ABSA 在饭店领域的印尼语数据上进行训练, obtained 的结果包括对于对象情感三元组抽取任务的 f1 分数为 79.55%,对于统一对象基情感分析任务的分数为 86.09%,对于方面意见对立抽取任务的分数为 79.85%,对于方面词抽取任务的分数为 87.45%,和对于意见词抽取任务的分数为 88.09%。印尼 LEGO-ABSA 运用了 LEGO-ABSA 框架,具体地是使用 T5 模型,特别是 mT5,通过多任务学习训练所有对象基 sentiment分析任务。

RiskQ: Risk-sensitive Multi-Agent Reinforcement Learning Value Factorization

  • paper_url: http://arxiv.org/abs/2311.01753
  • repo_url: https://github.com/xmu-rl-3dv/riskq
  • paper_authors: Siqi Shen, Chennan Ma, Chao Li, Weiquan Liu, Yongquan Fu, Songzhu Mei, Xinwang Liu, Cheng Wang
  • for: 这个论文旨在解决多智能系统中的风险敏感多智能学习问题,即在不确定环境、不同代理策略和部分可见性下学习协调的政策。
  • methods: 作者引入了风险敏感个体全球最大原则(RIGM),该原则要求每个代理的风险敏感行动选择集等于中央政策的风险敏感行动选择。作者还提出了一种名为风险Q的方法,该方法可以模型多个代理的共同返回分布,并满足RIGM原则。
  • results: 作者通过广泛的实验表明,风险Q方法可以在多个环境下实现优秀的性能。代码可以在https://github.com/xmu-rl-3dv/RiskQ中找到。
    Abstract Multi-agent systems are characterized by environmental uncertainty, varying policies of agents, and partial observability, which result in significant risks. In the context of Multi-Agent Reinforcement Learning (MARL), learning coordinated and decentralized policies that are sensitive to risk is challenging. To formulate the coordination requirements in risk-sensitive MARL, we introduce the Risk-sensitive Individual-Global-Max (RIGM) principle as a generalization of the Individual-Global-Max (IGM) and Distributional IGM (DIGM) principles. This principle requires that the collection of risk-sensitive action selections of each agent should be equivalent to the risk-sensitive action selection of the central policy. Current MARL value factorization methods do not satisfy the RIGM principle for common risk metrics such as the Value at Risk (VaR) metric or distorted risk measurements. Therefore, we propose RiskQ to address this limitation, which models the joint return distribution by modeling quantiles of it as weighted quantile mixtures of per-agent return distribution utilities. RiskQ satisfies the RIGM principle for the VaR and distorted risk metrics. We show that RiskQ can obtain promising performance through extensive experiments. The source code of RiskQ is available in https://github.com/xmu-rl-3dv/RiskQ.
    摘要 Current MARL value factorization methods do not satisfy the RIGM principle for common risk metrics such as the Value at Risk (VaR) metric or distorted risk measurements. To address this limitation, we propose RiskQ, which models the joint return distribution by modeling quantiles of it as weighted quantile mixtures of per-agent return distribution utilities. RiskQ satisfies the RIGM principle for the VaR and distorted risk metrics.We show that RiskQ can obtain promising performance through extensive experiments. The source code of RiskQ is available at https://github.com/xmu-rl-3dv/RiskQ.Translation notes:* "Multi-agent systems" is translated as "多Agent系统" (duō agent xì tǒng)* "Environmental uncertainty" is translated as "环境不确定" (huán jìng bù jiè dìng)* "Varying policies of agents" is translated as "代理人政策的变化" (dì zhěng zhèng yì zhī yì)* "Partial observability" is translated as "部分可见性" (bù zhāng kě jian xìng)* "Risk-sensitive MARL" is translated as "风险敏感的多Agent学习" (fēng xìng mǐn gǎn de duō agent xué xí)* "RIGM principle" is translated as "风险敏感原则" (fēng xìng mǐn gǎn yuán xì)* "Value at Risk" is translated as "值得风险" (zhí dé fēng xìng)* "Distorted risk measurements" is translated as "扭曲的风险测量" (kuò xiào de fēng xìng gòu liàng)* "RiskQ" is translated as "风险Q" (fēng xìng Q)

Energy Efficiency Optimization for Subterranean LoRaWAN Using A Reinforcement Learning Approach: A Direct-to-Satellite Scenario

  • paper_url: http://arxiv.org/abs/2311.01743
  • repo_url: None
  • paper_authors: Kaiqiang Lin, Muhammad Asad Ullah, Hirley Alves, Konstantin Mikhaylov, Tong Hao
  • for: 这篇论文旨在探讨如何在无 terrestrial 网络(NTN)中充分利用地下 LoRaWAN 网络,以实现负荷较大的农业和灾难救援操作中的经济和社会效益。
  • methods: 这篇论文使用了强调SF的 LoRa 模ulation,以优化数据传输率、无线通信时间、覆盖范围和能量消耗。但是,在大规模地下 LoRaWAN NTN 中,尚存在效率地分配SF 到终端设备的挑战。为此,这篇论文提出了一种基于强化学习(RL)的SF 分配策略,以优化系统的能源效率(EE)。
  • results: 对四个标准方法进行比较,RL 基于 SF 分配策略在极地下直接到卫星场景中表现出色,特别是 MAD3QN 在 MAA2C 方法之上具有更高的吞吐量和EE。
    Abstract The integration of subterranean LoRaWAN and non-terrestrial networks (NTN) delivers substantial economic and societal benefits in remote agriculture and disaster rescue operations. The LoRa modulation leverages quasi-orthogonal spreading factors (SFs) to optimize data rates, airtime, coverage and energy consumption. However, it is still challenging to effectively assign SFs to end devices for minimizing co-SF interference in massive subterranean LoRaWAN NTN. To address this, we investigate a reinforcement learning (RL)-based SFs allocation scheme to optimize the system's energy efficiency (EE). To efficiently capture the device-to-environment interactions in dense networks, we proposed an SFs allocation technique using the multi-agent dueling double deep Q-network (MAD3QN) and the multi-agent advantage actor-critic (MAA2C) algorithms based on an analytical reward mechanism. Our proposed RL-based SFs allocation approach evinces better performance compared to four benchmarks in the extreme underground direct-to-satellite scenario. Remarkably, MAD3QN shows promising potentials in surpassing MAA2C in terms of convergence rate and EE.
    摘要 <>translate "The integration of subterranean LoRaWAN and non-terrestrial networks (NTN) delivers substantial economic and societal benefits in remote agriculture and disaster rescue operations. The LoRa modulation leverages quasi-orthogonal spreading factors (SFs) to optimize data rates, airtime, coverage and energy consumption. However, it is still challenging to effectively assign SFs to end devices for minimizing co-SF interference in massive subterranean LoRaWAN NTN. To address this, we investigate a reinforcement learning (RL)-based SFs allocation scheme to optimize the system's energy efficiency (EE). To efficiently capture the device-to-environment interactions in dense networks, we proposed an SFs allocation technique using the multi-agent dueling double deep Q-network (MAD3QN) and the multi-agent advantage actor-critic (MAA2C) algorithms based on an analytical reward mechanism. Our proposed RL-based SFs allocation approach evinces better performance compared to four benchmarks in the extreme underground direct-to-satellite scenario. Remarkably, MAD3QN shows promising potentials in surpassing MAA2C in terms of convergence rate and EE." into Simplified Chinese.Here's the translation:<>将地下LoRaWAN和非地球网络(NTN)集成可以实现各种经济和社会效益,如远程农业和灾难救援操作。LoRa模ulation使用 quasi-正交扩展因子(SF)来优化数据速率、广播时间、覆盖率和能量消耗。然而,在大规模地下LoRaWAN NTN中有效地分配SF仍然是一个挑战。为了解决这个问题,我们研究了一种基于强化学习(RL)的SF分配策略,以优化系统的能效性(EE)。为了有效地捕捉设备与环境之间的互动,我们提出了一种使用多代理对抗逻辑网络(MAD3QN)和多代理优势actor-critic(MAA2C)算法的SF分配技术,基于分析奖励机制。我们的提出的RL基于SF分配方法在极地下直接卫星enario下表现更好,特别是MAD3QN在MAA2C方面具有潜在的提升。

Flexible Error Mitigation of Quantum Processes with Data Augmentation Empowered Neural Model

  • paper_url: http://arxiv.org/abs/2311.01727
  • repo_url: https://github.com/EXPmaster/DAEM
  • paper_authors: Manwen Liao, Yan Zhu, Giulio Chiribella, Yuxiang Yang
  • for: 这个论文的目的是为了开发一种可以在实际应用中使用的量子计算中的错误纠正方法。
  • methods: 这个论文使用了数据扩充 empowered 神经网络模型来实现错误纠正。这种模型不需要任何具体的噪声类型和测量设置的知识,可以直接从受到噪声影响的量子过程的受测结果中估算噪声自由统计数据。
  • results: 在数值实验中,这种模型能够高效地纠正多种类型的噪声,包括Markovian 噪声和非Markovian 噪声,比前一代的错误纠正方法更高效。此外,这种模型还可以应用于多种不同的量子过程中,包括大规模量子系统和连续变量量子态。这种数据扩充 empowered 神经网络模型为实现更可靠和可robust的量子技术提供了一个坚实的基础。
    Abstract Neural networks have shown their effectiveness in various tasks in the realm of quantum computing. However, their application in quantum error mitigation, a crucial step towards realizing practical quantum advancements, has been restricted by reliance on noise-free statistics. To tackle this critical challenge, we propose a data augmentation empowered neural model for error mitigation (DAEM). Our model does not require any prior knowledge about the specific noise type and measurement settings and can estimate noise-free statistics solely from the noisy measurement results of the target quantum process, rendering it highly suitable for practical implementation. In numerical experiments, we show the model's superior performance in mitigating various types of noise, including Markovian noise and Non-Markovian noise, compared with previous error mitigation methods. We further demonstrate its versatility by employing the model to mitigate errors in diverse types of quantum processes, including those involving large-scale quantum systems and continuous-variable quantum states. This powerful data augmentation-empowered neural model for error mitigation establishes a solid foundation for realizing more reliable and robust quantum technologies in practical applications.
    摘要 neural networks 在量子计算领域中的各种任务中表现出了效果。然而,它们在量子错误修正中,实现实用量子进步的关键步骤,受到了噪声自由统计的限制。为解决这个挑战,我们提议一种基于数据增强的神经网络模型 для错误修正(DAEM)。我们的模型不需要任何噪声类型和测量设置的先前知识,可以通过噪声损失结果来估算噪声自由统计,这使得它在实际应用中非常适用。在数学实验中,我们展示了模型在不同类型的噪声下的优秀性能,比如Markovian噪声和非Markovian噪声,并且在不同类型的量子过程中进行错误修正,包括大规模量子系统和连续变量量子态。这种数据增强 empowered 神经网络模型 для错误修正,为实现更可靠和robust的量子技术在实际应用中提供了一个坚实的基础。

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

  • paper_url: http://arxiv.org/abs/2311.01723
  • repo_url: None
  • paper_authors: Changdae Oh, Mijoo Kim, Hyesu Lim, Junhyeok Park, Euiseog Jeong, Zhi-Qi Cheng, Kyungwoo Song
  • for: This paper focuses on the problem of calibration and robustness in fine-tuning pre-trained vision-language models (VLMs) under distribution shift.
  • methods: The authors propose a simple approach called calibrated robust fine-tuning (CaRot) that incentivizes the calibration and robustness of pre-trained VLMs on both in-distribution (ID) and out-of-distribution (OOD) datasets.
  • results: The authors show that their proposed method, CaRot, effectively improves the calibration and robustness of pre-trained VLMs on OOD datasets, as verified by empirical results on ImageNet-1K distribution shift evaluation.Here’s the Chinese version of the three key points:
  • for: 这篇论文关注在微调投入预训练语义视觉模型(VLMs)下的分布Shift问题。
  • methods: 作者提出了一种简单的方法,即强制投入Calibration和Robustness的CaRot方法,以适应ID和OOD datasets上的预训练VLMs。
  • results: 作者证明了CaRot方法能有效地提高预训练VLMs在OOD datasets上的Calibration和Robustness,经验结果通过ImageNet-1K分布Shift评估 verify。
    Abstract While fine-tuning unleashes the potential of a pre-trained model to a specific task, it trades off the model's generalization capability on out-of-distribution (OOD) datasets. To mitigate this, robust fine-tuning aims to ensure performance on OOD datasets as well as an in-distribution (ID) dataset for which the model is being tuned. However, another criterion for reliable machine learning (ML), confidence calibration, has been overlooked despite its increasing demand for real-world high-stakes ML applications (e.g., autonomous driving and medical diagnosis). For the first time, we raise concerns about the calibration of fine-tuned vision-language models (VLMs) under distribution shift by showing that naive fine-tuning and even state-of-the-art robust fine-tuning methods hurt the calibration of pre-trained VLMs, especially on OOD datasets. To address this, we provide a simple approach, called a calibrated robust fine-tuning (CaRot) that incentivizes the calibration and robustness on both ID and OOD datasets. Empirical results on ImageNet-1K distribution shift evaluation verify the effectiveness of our method.
    摘要 While 微调 fine-tuning 可以发挥预训练模型对特定任务的潜力,它同时 sacrifice 模型对非常用数据集(OOD)的泛化能力。为了解决这个问题,我们提出了一种名为 robust fine-tuning 的方法,该方法具有在 ID 数据集和 OOD 数据集上的性能和泛化能力。然而,另一个重要的机器学习(ML)需求,即信息报告(confidence calibration),在实际高风险 ML 应用中受到了忽视。我们在这篇文章中首次提出了预训练 VLM 的准确性报告在分布shift下的问题,并证明了微调和当今最佳的 robust fine-tuning 方法在 OOD 数据集上会削弱预训练 VLM 的准确性。为解决这个问题,我们提出了一种简单的方法,即准确性报告和Robust fine-tuning(CaRot),该方法激励预训练 VLM 在 ID 和 OOD 数据集上具备准确性和泛化能力。实验结果表明,CaRot 可以有效地改善预训练 VLM 在 ImageNet-1K 分布shift 评估中的性能。

An Empirical Study of Benchmarking Chinese Aspect Sentiment Quad Prediction

  • paper_url: http://arxiv.org/abs/2311.01713
  • repo_url: None
  • paper_authors: Junxian Zhou, Haiqin Yang, Ye Junpeng, Yuxuan He, Hao Mou
  • for: expanding the capacity of aspect-level sentiment analysis
  • methods: constructing two large Chinese ASQP datasets and evaluating the performance of GPT series models
  • results: highlighting the need for additional techniques to address ASQP and the potential issues with using GPTsHere’s the simplified Chinese text:
  • for: 扩大方面 sentiment 分析的容量
  • methods: constructing two large Chinese ASQP datasets 和 evaluating GPT 系列模型的表现
  • results: highlighting the need for additional techniques to address ASQP 和 GPT 的可能问题
    Abstract Aspect sentiment quad prediction (ASQP) is a critical subtask of aspect-level sentiment analysis. Current ASQP datasets are characterized by their small size and low quadruple density, which hinders technical development. To expand capacity, we construct two large Chinese ASQP datasets crawled from multiple online platforms. The datasets hold several significant characteristics: larger size (each with 10,000+ samples) and rich aspect categories, more words per sentence, and higher density than existing ASQP datasets. Moreover, we are the first to evaluate the performance of Generative Pre-trained Transformer (GPT) series models on ASQP and exhibit potential issues. The experiments with state-of-the-art ASQP baselines underscore the need to explore additional techniques to address ASQP, as well as the importance of further investigation into methods to improve the performance of GPTs.
    摘要 “对象层情感预测(ASQP)是对应情感分析的重要子任务。现有的ASQP数据集的特点是小型和低四元密度,这限制了技术的发展。为了扩大能力,我们建立了两个大型的中文ASQP数据集,从多个在线平台网站爬虫取得。这两个数据集具有以下特点:更大的大小(每个数据集都有10,000+个样本)、丰富的层别分类、更多的单词数和更高的密度,较现有的ASQP数据集高。此外,我们是首次将生成预训Transformer(GPT)系列模型应用于ASQP,并发现了一些潜在的问题。实验中使用现有的ASQP基线,显示了需要进一步探索ASQP的技术,以及GPT的性能提升的重要性。”

Data-Free Distillation of Language Model by Text-to-Text Transfer

  • paper_url: http://arxiv.org/abs/2311.01689
  • repo_url: None
  • paper_authors: Zheyuan Bai, Xinduo Liu, Hailin Hu, Tianyu Guo, Qinghua Zhang, Yunhe Wang
  • for: 本研究的目的是提出一种基于生成语言模型的数据自由知识填充(DFKD)框架,以提高模型压缩性和特定性。
  • methods: 我们提出了一种名为DFKD-T$^{3}$的新框架,其中使用预训练的生成语言模型作为可控的数据生成器,以实现文本到文本的结构学习。
  • results: 我们的方法可以在不同的下游任务中提高填充性和多样性,并且可以直接用于填充其他语言模型,超越当前SOTA方法。
    Abstract Data-Free Knowledge Distillation (DFKD) plays a vital role in compressing the model when original training data is unavailable. Previous works for DFKD in NLP mainly focus on distilling encoder-only structures like BERT on classification tasks, which overlook the notable progress of generative language modeling. In this work, we propose a novel DFKD framework, namely DFKD-T$^{3}$, where the pretrained generative language model can also serve as a controllable data generator for model compression. This novel framework DFKD-T$^{3}$ leads to an end-to-end learnable text-to-text framework to transform the general domain corpus to compression-friendly task data, targeting to improve both the \textit{specificity} and \textit{diversity}. Extensive experiments show that our method can boost the distillation performance in various downstream tasks such as sentiment analysis, linguistic acceptability, and information extraction. Furthermore, we show that the generated texts can be directly used for distilling other language models and outperform the SOTA methods, making our method more appealing in a general DFKD setting. Our code is available at https://gitee.com/mindspore/models/tree/master/research/nlp/DFKD\_T3.
    摘要 <>将文本转换为简化中文。<>论文摘要:数据无法获取的知识填充(DFKD)在模型压缩中扮演着重要的角色。先前的DFKD研究主要集中在NLP领域的分类任务上,忽略了生成语言模型的进步。在这项工作中,我们提出了一种新的DFKD框架,即DFKD-T$^{3}$,其中预训练的生成语言模型也可以作为数据生成器来压缩模型。这种新的框架DFKD-T$^{3}$实现了一个综合的文本到文本框架,可以将通用领域文库转换为适合压缩的任务数据,以提高特点和多样性。我们的方法可以在多个下游任务中提高填充性,包括情感分类、语言可接受性和信息抽取。此外,我们还证明了生成的文本可以直接用于填充其他语言模型,并且超越当前最佳方法,使我们的方法在通用DFKD Setting中更加吸引人。我们的代码可以在https://gitee.com/mindspore/models/tree/master/research/nlp/DFKD\_T3中找到。

The R.O.A.D. to precision medicine

  • paper_url: http://arxiv.org/abs/2311.01681
  • repo_url: None
  • paper_authors: Dimitris Bertsimas, Angelos G. Koulouras, Georgios Antonios Margonis
  • for: 这个研究旨在 addresses the deficiencies of Randomized trial data subgroup analysis, and transforms ObservAtional Data to be used as if they were randomized, thus paving the road for precision medicine.
  • methods: 方法包括一个 novel two-step process to correct the estimated probabilities of the outcome under a treatment, and then use these probabilities to train Optimal Policy Trees (OPTs) to assign treatments to subgroups of patients based on their characteristics.
  • results: 研究结果显示,这些推荐高于专家的推荐在Gastrointestinal stromal tumors (GIST) 和 extremity sarcomas 中。 In addition, the framework identified a subset of patients with unique characteristics who may not require treatment, which was validated in an external cohort.
    Abstract We propose a prognostic stratum matching framework that addresses the deficiencies of Randomized trial data subgroup analysis and transforms ObservAtional Data to be used as if they were randomized, thus paving the road for precision medicine. Our approach counters the effects of unobserved confounding in observational data by correcting the estimated probabilities of the outcome under a treatment through a novel two-step process. These probabilities are then used to train Optimal Policy Trees (OPTs), which are decision trees that optimally assign treatments to subgroups of patients based on their characteristics. This facilitates the creation of clinically intuitive treatment recommendations. We applied our framework to observational data of patients with gastrointestinal stromal tumors (GIST) and validated the OPTs in an external cohort using the sensitivity and specificity metrics. We show that these recommendations outperformed those of experts in GIST. We further applied the same framework to randomized clinical trial (RCT) data of patients with extremity sarcomas. Remarkably, despite the initial trial results suggesting that all patients should receive treatment, our framework, after addressing imbalances in patient distribution due to the trial's small sample size, identified through the OPTs a subset of patients with unique characteristics who may not require treatment. Again, we successfully validated our recommendations in an external cohort.
    摘要 我们提出了一个预测层匹配框架,用于解决随机化试验数据 subgroup 分析中的缺陷,并将观察数据转化为可以被用于随机化的数据,从而开阔了精准医学的道路。我们的方法可以在观察数据中减轻不观察到的潜在干扰因素的影响,通过一种新的两步过程来修正对待治疗的估计概率。这些估计概率然后用于训练优化策略树(OPTs),这些树是根据病人特征来优化治疗分配的决策树。这使得可以创造出临床直观的治疗建议。我们在肠癌细胞肿(GIST)观察数据中应用了我们的框架,并在外部群体中验证了OPTs的性能。我们显示,我们的建议超过了GIST专家的建议。此外,我们还应用了同一个框架到手术试验数据中,并在试验结果显示所有患者应该接受治疗的情况下,我们的框架经过对患者分布不均衡的修正,通过OPTsidentified一 subgroup of patients with unique characteristics who may not require treatment。我们在外部群体中验证了我们的建议。

DialogBench: Evaluating LLMs as Human-like Dialogue Systems

  • paper_url: http://arxiv.org/abs/2311.01677
  • repo_url: None
  • paper_authors: Jiao Ou, Junda Lu, Che Liu, Yihong Tang, Fuzheng Zhang, Di Zhang, Zhongyuan Wang, Kun Gai
  • for: 评估大型自然语言模型(LLMs)的对话系统人类化能力
  • methods: 提出了一个对话评估Benchmark,名为DialogBench,用于评估LLMs的对话系统人类化能力
  • results: 实验结果表明,对LLMs进行指导微调可以提高它们的人类化能力,但是还有很多方面需要进一步改进,以达到人类化对话系统的标准。
    Abstract Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities, refreshing human's impressions on dialogue systems. The long-standing goal of dialogue systems is to be human-like enough to establish long-term connections with users by satisfying the need for communication, affection and social belonging. Therefore, there has been an urgent need to evaluate LLMs as human-like dialogue systems. In this paper, we propose DialogBench, a dialogue evaluation benchmark that currently contains $12$ dialogue tasks to assess the capabilities of LLMs as human-like dialogue systems should have. Specifically, we prompt GPT-4 to generate evaluation instances for each task. We first design the basic prompt based on widely-used design principles and further mitigate the existing biases to generate higher-quality evaluation instances. Our extensive test over $28$ LLMs (including pre-trained and supervised instruction-tuning) shows that instruction fine-tuning benefits improve the human likeness of LLMs to a certain extent, but there is still much room to improve those capabilities for most LLMs as human-like dialogue systems. In addition, experimental results also indicate that LLMs perform differently in various abilities that human-like dialogue systems should have. We will publicly release DialogBench, along with the associated evaluation code for the broader research community.
    摘要 大型语言模型(LLM)已经取得了杰出的突破,刷新人类对对话系统的印象。对话系统的长期目标是成为与用户建立长期连接的人工智能系统,满足用户的交流、感情和社交属性需求。因此,有一个急需评估 LLM 的人工智能对话系统的能力。在本文中,我们提出了 DialogBench,一个对话评估标准,包括 $12$ 个对话任务,以评估 LLM 是否具备人工智能对话系统的能力。具体来说,我们将 GPT-4 提供评估实例。我们首先根据广泛使用的设计原则设计基本提示,然后进一步降低现有的偏见,以生成更高品质的评估实例。我们对 $28$ 个 LLM (包括预训和实际指导 fine-tuning) 进行了广泛的测试,发现实际指导 fine-tuning 可以提高 LLM 的人工智能程度,但是大多数 LLM 仍然有很大的执行空间来提升其能力。此外,实验结果还显示了不同的 LLM 在不同的能力上表现不同。我们将在未来发布 DialogBench,以及相关的评估代码,供更广泛的研究社区使用。

MineSegSAT: An automated system to evaluate mining disturbed area extents from Sentinel-2 imagery

  • paper_url: http://arxiv.org/abs/2311.01676
  • repo_url: None
  • paper_authors: Ezra MacDonald, Derek Jacoby, Yvonne Coady
  • for: 这个论文的目的是为了评估采矿业对环境的影响,以便更好地理解和mitigate采矿活动的生态后果。
  • methods: 这个论文使用了SegFormer深度学习分割框架,将Sentinel-2数据集中的环境影响区域分类为不同的土地覆盖类别。
  • results: 论文通过使用不同的损失函数(Dice、Tversky和Lovasz损失函数)进行训练,并在测试区域上进行推断,以确定采矿活动对环境的影响。
    Abstract Assessing the environmental impact of the mineral extraction industry plays a critical role in understanding and mitigating the ecological consequences of extractive activities. This paper presents MineSegSAT, a model that presents a novel approach to predicting environmentally impacted areas of mineral extraction sites using the SegFormer deep learning segmentation architecture trained on Sentinel-2 data. The data was collected from non-overlapping regions over Western Canada in 2021 containing areas of land that have been environmentally impacted by mining activities that were identified from high-resolution satellite imagery in 2021. The SegFormer architecture, a state-of-the-art semantic segmentation framework, is employed to leverage its advanced spatial understanding capabilities for accurate land cover classification. We investigate the efficacy of loss functions including Dice, Tversky, and Lovasz loss respectively. The trained model was utilized for inference over the test region in the ensuing year to identify potential areas of expansion or contraction over these same periods. The Sentinel-2 data is made available on Amazon Web Services through a collaboration with Earth Daily Analytics which provides corrected and tiled analytics-ready data on the AWS platform. The model and ongoing API to access the data on AWS allow the creation of an automated tool to monitor the extent of disturbed areas surrounding known mining sites to ensure compliance with their environmental impact goals.
    摘要 <>翻译文本到简化中文。<<> evaluating the environmental impact of the mineral extraction industry is crucial in understanding and mitigating the ecological consequences of extractive activities. This paper presents MineSegSAT, a model that employs the SegFormer deep learning segmentation architecture to predict environmentally impacted areas of mineral extraction sites using Sentinel-2 data. The data was collected from non-overlapping regions over Western Canada in 2021, containing areas of land that have been environmentally impacted by mining activities identified from high-resolution satellite imagery in 2021. The SegFormer architecture, a state-of-the-art semantic segmentation framework, is leveraged to accurately classify land cover. We investigate the efficacy of loss functions including Dice, Tversky, and Lovasz loss, respectively. The trained model was utilized for inference over the test region in the ensuing year to identify potential areas of expansion or contraction over the same periods. The Sentinel-2 data is made available on Amazon Web Services through a collaboration with Earth Daily Analytics, which provides corrected and tiled analytics-ready data on the AWS platform. The model and ongoing API to access the data on AWS allow the creation of an automated tool to monitor the extent of disturbed areas surrounding known mining sites to ensure compliance with their environmental impact goals.我们使用SegFormer深度学习分割框架来预测采矿活动对环境的影响,并使用Sentinel-2数据集来训练模型。这个数据集包含2021年在加拿大西部的非重叠区域中发生了采矿活动所导致的环境影响。我们使用Dice、Tversky和Lovasz损失函数进行调参。训练完成后,我们使用该模型来对测试区域进行预测,以确定可能的扩张或缩减区域。Sentinel-2数据通过我们与Earth Daily Analytics的合作在AWS平台上提供了已经 corrected和分割的分析Ready数据。我们的模型和ongoing API可以在AWS平台上访问数据,并创建一个自动化工具来监测采矿活动周围的受损区域,以确保遵循环境影响目标。

Deep Learning-driven Community Resilience Rating based on Intertwined Socio-Technical Systems Features

  • paper_url: http://arxiv.org/abs/2311.01661
  • repo_url: None
  • paper_authors: Kai Yin, Ali Mostafavi
  • for: 这个论文主要是为了提高社区抗逆能力的评估和评价。
  • methods: 这个论文使用了一种基于深度学习的三层模型,即Resili-Net,来评估社区的抗逆能力水平。
  • results: 根据美国多个都会区的公共可 accessible 数据,Resili-Net 模型可以对社区的抗逆能力进行评估,并将其分为五个不同的水平。此外,模型还可以对社区抗逆能力的变化进行分析,以便为特定的抗逆能力提高措施提供指导。
    Abstract Community resilience is a complex and muti-faceted phenomenon that emerges from complex and nonlinear interactions among different socio-technical systems and their resilience properties. However, present studies on community resilience focus primarily on vulnerability assessment and utilize index-based approaches, with limited ability to capture heterogeneous features within community socio-technical systems and their nonlinear interactions in shaping robustness, redundancy, and resourcefulness components of resilience. To address this gap, this paper presents an integrated three-layer deep learning model for community resilience rating (called Resili-Net). Twelve measurable resilience features are specified and computed within community socio-technical systems (i.e., facilities, infrastructures, and society) related to three resilience components of robustness, redundancy, and resourcefulness. Using publicly accessible data from multiple metropolitan statistical areas in the United States, Resili-Net characterizes the resilience levels of spatial areas into five distinct levels. The interpretability of the model outcomes enables feature analysis for specifying the determinants of resilience in areas within each resilience level, allowing for the identification of specific resilience enhancement strategies. Changes in community resilience profiles under urban development patterns are further examined by changing the value of related socio-technical systems features. Accordingly, the outcomes provide novel perspectives for community resilience assessment by harnessing machine intelligence and heterogeneous urban big data.
    摘要 社区抗险能力是一种复杂多方面的现象,由社区技术系统之间的复杂和非线性互动而生成。然而,当前社区抗险研究主要关注的是漏斗分析,使用指标方法,具有限定能力捕捉社区技术系统中异质特性和非线性互动的复杂特征。为了填补这一漏洞,这篇文章提出了一种基于深度学习的社区抗险评价模型(名为Resili-Net)。这种模型确定了社区技术系统中12个可测量的抗险特征,并将其分配到三个抗险组成部分:坚韧性、备用性和资源fulness。使用美国多个都会区的公共可达数据,Resili-Net将社区抗险水平分为五个不同水平。模型结果的可读性允许特征分析,以确定每个水平中的抗险特征决定因素,并提供了特定抗险提高策略的标识。此外,通过改变相关社区技术系统特征的值,模型还研究了社区抗险资料的变化。因此,模型的结果为社区抗险评价提供了新的视角,并利用机器智能和多种城市大数据来捕捉社区抗险能力。

MARRS: Multimodal Reference Resolution System

  • paper_url: http://arxiv.org/abs/2311.01650
  • repo_url: None
  • paper_authors: Halim Cagri Ates, Shruti Bhargava, Site Li, Jiarui Lu, Siddhardha Maddula, Joel Ruben Antony Moniz, Anil Kumar Nalamalapu, Roman Hoang Nguyen, Melis Ozyildirim, Alkesh Patel, Dhivya Piraviperumal, Vincent Renkens, Ankit Samal, Thy Tran, Bo-Hsiang Tseng, Hong Yu, Yuan Zhang, Rong Zou
  • for: 这篇论文旨在描述一种在自然语言理解系统中处理上下文的框架,即多模态参考解决系统(MARRS)。
  • methods: 这篇论文使用了不同的机器学习模型来处理上下文,包括参考解决模型和上下文 rewrite 模型。
  • results: 这篇论文介绍了 MARRS 如何处理多种上下文,包括对话上下文、视觉上下文和背景上下文,同时保护用户隐私。
    Abstract Successfully handling context is essential for any dialog understanding task. This context maybe be conversational (relying on previous user queries or system responses), visual (relying on what the user sees, for example, on their screen), or background (based on signals such as a ringing alarm or playing music). In this work, we present an overview of MARRS, or Multimodal Reference Resolution System, an on-device framework within a Natural Language Understanding system, responsible for handling conversational, visual and background context. In particular, we present different machine learning models to enable handing contextual queries; specifically, one to enable reference resolution, and one to handle context via query rewriting. We also describe how these models complement each other to form a unified, coherent, lightweight system that can understand context while preserving user privacy.
    摘要 成功处理上下文是任务理解任务的关键。这种上下文可能是对话(基于前一个用户的查询或系统的回答)、视觉(基于用户看到的内容,例如屏幕)或背景(基于信号such as 铃声或播放音乐)。在这种工作中,我们介绍了 MARRS,即多modal引用解决系统,是一个在自然语言理解系统中的设备框架,负责处理对话、视觉和背景上下文。特别是,我们介绍了不同的机器学习模型,以便处理上下文 queries; specifically, one to enable reference resolution, and one to handle context via query rewriting。我们还描述了这些模型如何衔接起来,形成一个协调、轻量级的系统,能够理解上下文,而保持用户隐私。

cs.CL - 2023-11-03

Grounded Intuition of GPT-Vision’s Abilities with Scientific Images

  • paper_url: http://arxiv.org/abs/2311.02069
  • repo_url: https://github.com/ahwang16/grounded-intuition-gpt-vision
  • paper_authors: Alyssa Hwang, Andrew Head, Chris Callison-Burch
  • for: 本研究旨在帮助研究者更好地理解新型模型GPT-Vision的能力和局限性。
  • methods: 本研究使用了grounded theory和主题分析,从社会科学和人机交互的角度来设置一个严格的质量评估框架,以便对自然语言处理领域的新模型进行评估。
  • results: 研究发现,GPT-Vision具有特殊的激励特性,它响应于提示、图像中的对话文本和相对空间关系。这种方法和分析可以帮助研究者更好地了解新模型的应用前景,同时探索如何使用GPT-Vision来减轻信息的访问难度。
    Abstract GPT-Vision has impressed us on a range of vision-language tasks, but it comes with the familiar new challenge: we have little idea of its capabilities and limitations. In our study, we formalize a process that many have instinctively been trying already to develop "grounded intuition" of this new model. Inspired by the recent movement away from benchmarking in favor of example-driven qualitative evaluation, we draw upon grounded theory and thematic analysis in social science and human-computer interaction to establish a rigorous framework for qualitative evaluation in natural language processing. We use our technique to examine alt text generation for scientific figures, finding that GPT-Vision is particularly sensitive to prompting, counterfactual text in images, and relative spatial relationships. Our method and analysis aim to help researchers ramp up their own grounded intuitions of new models while exposing how GPT-Vision can be applied to make information more accessible.
    摘要

Vicinal Risk Minimization for Few-Shot Cross-lingual Transfer in Abusive Language Detection

  • paper_url: http://arxiv.org/abs/2311.02025
  • repo_url: None
  • paper_authors: Gretel Liz De la Peña Sarracén, Paolo Rosso, Robert Litschko, Goran Glavaš, Simone Paolo Ponzetto
  • for: 本研究旨在提高跨语言恶意语言识别的性能,使用数据扩充和持续预训练进行领域适应。
  • methods: 本研究使用了两种现有的数据扩充技术,并提出了一种新的数据扩充方法(MIXAG),该方法根据实例表示的角度进行 interpolate 对照对。
  • results: 实验结果表明,数据扩充策略可以提高跨语言少量恶意语言识别的性能,特别是在多领域和多语言环境下。
    Abstract Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques based on vicinal risk minimization and propose MIXAG, a novel data augmentation method which interpolates pairs of instances based on the angle of their representations. Our experiments involve seven languages typologically distinct from English and three different domains. The results reveal that the data augmentation strategies can enhance few-shot cross-lingual abusive language detection. Specifically, we observe that consistently in all target languages, MIXAG improves significantly in multidomain and multilingual environments. Finally, we show through an error analysis how the domain adaptation can favour the class of abusive texts (reducing false negatives), but at the same time, declines the precision of the abusive language detection model.
    摘要 cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques based on vicinal risk minimization and propose MIXAG, a novel data augmentation method which interpolates pairs of instances based on the angle of their representations. Our experiments involve seven languages typologically distinct from English and three different domains. The results reveal that the data augmentation strategies can enhance few-shot cross-lingual abusive language detection. Specifically, we observe that consistently in all target languages, MIXAG improves significantly in multidomain and multilingual environments. Finally, we show through an error analysis how the domain adaptation can favour the class of abusive texts (reducing false negatives), but at the same time, declines the precision of the abusive language detection model.Here's the translation in Traditional Chinese:cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques based on vicinal risk minimization and propose MIXAG, a novel data augmentation method which interpolates pairs of instances based on the angle of their representations. Our experiments involve seven languages typologically distinct from English and three different domains. The results reveal that the data augmentation strategies can enhance few-shot cross-lingual abusive language detection. Specifically, we observe that consistently in all target languages, MIXAG improves significantly in multidomain and multilingual environments. Finally, we show through an error analysis how the domain adaptation can favour the class of abusive texts (reducing false negatives), but at the same time, declines the precision of the abusive language detection model.

ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting of RNN-like Language Models

  • paper_url: http://arxiv.org/abs/2311.01981
  • repo_url: None
  • paper_authors: Haotian Luo, Kunming Wu, Cheng Dai, Sixian Ding, Xinhao Chen
  • for: 解决语言模型在生成过程中忘记提示问题
  • methods: 使用 sintetic gradient 教导模型在生成过程中记忆提示
  • results: 实验结果表明,该方法能够解决语言模型在生成过程中忘记提示的问题
    Abstract RNN-like language models are getting renewed attention from NLP researchers in recent years and several models have made significant progress, which demonstrates performance comparable to traditional transformers. However, due to the recurrent nature of RNNs, this kind of language model can only store information in a set of fixed-length state vectors. As a consequence, they still suffer from forgetfulness though after a lot of improvements and optimizations, when given complex instructions or prompts. As the prompted generation is the main and most concerned function of LMs, solving the problem of forgetting in the process of generation is no wonder of vital importance. In this paper, focusing on easing the prompt forgetting during generation, we proposed an architecture to teach the model memorizing prompt during generation by synthetic gradient. To force the model to memorize the prompt, we derive the states that encode the prompt, then transform it into model parameter modification using low-rank gradient approximation, which hard-codes the prompt into model parameters temporarily. We construct a dataset for experiments, and the results have demonstrated the effectiveness of our method in solving the problem of forgetfulness in the process of prompted generation. We will release all the code upon acceptance.
    摘要 在这篇论文中,我们关注在生成过程中缓解提示忘记的问题,我们提出了一种建议,通过合成梯度来教育模型在生成过程中记忆提示。我们首先提取了编码提示的状态,然后将其转换成模型参数修改使用低级导数预测,这会将提示短时间内写入模型参数中。我们构建了一个数据集,并进行了实验,结果表明我们的方法有效地解决了生成过程中的忘记问题。我们将代码发布于接受后。

Too Much Information: Keeping Training Simple for BabyLMs

  • paper_url: http://arxiv.org/abs/2311.01955
  • repo_url: None
  • paper_authors: Lukas Edman, Lisa Bylinina
  • for: 这篇论文描述了格罗宁根大学对 BabyLM 挑战的工作。
  • methods: 我们采用了如宝宝一样,将语言模型引入 simpler concept 先后理解更复杂的概念的想法。我们通过不同的角度(context size、词汇量和总语言复杂度)来检查这种策略的效果。
  • results: 我们发现只有context size truly beneficial to training a language model,但这simple change to context size 使我们在(Super)GLUE任务上的平均提高2点,在MSGS任务上的平均提高1点,在BLiMP任务上的平均提高12%。我们的context-limited模型比基线模型,在10 times更多的数据上进行训练。
    Abstract This paper details the work of the University of Groningen for the BabyLM Challenge. We follow the idea that, like babies, language models should be introduced to simpler concepts first and build off of that knowledge to understand more complex concepts. We examine this strategy of simple-then-complex through a variety of lenses, namely context size, vocabulary, and overall linguistic complexity of the data. We find that only one, context size, is truly beneficial to training a language model. However this simple change to context size gives us improvements of 2 points on average on (Super)GLUE tasks, 1 point on MSGS tasks, and 12\% on average on BLiMP tasks. Our context-limited model outperforms the baseline that was trained on 10$\times$ the amount of data.
    摘要 这份论文介绍了格隆根大学对宝宝LM挑战的工作。我们采用了婴儿式学习策略,即首先教育语言模型简单概念,然后逐步增加知识来理解更复杂的概念。我们通过不同的角度来检查这种简单然后复杂的策略,即上下文大小、词汇量和总语言复杂度。我们发现只有上下文大小真正有利于语言模型训练,但这种简单的改变使我们在(Super)GLUE任务上平均提高2点,MSGS任务上平均提高1点,BLiMP任务上平均提高12%。我们的上下文限定模型超过基eline模型,即使训练数据量为10倍。

Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks

  • paper_url: http://arxiv.org/abs/2311.01949
  • repo_url: None
  • paper_authors: Yifan Wang, Qingyan Guo, Xinzhe Ni, Chufan Shi, Lemao Liu, Haiyun Jiang, Yujiu Yang
  • for: 提高大语言模型(LLM)在知识密集任务中表现,特别是开放领域问答任务。
  • methods: 提出Hint-enhanced In-Context Learning(HICL)新 парадиг,利用LLM的解释能力从示例中提取问题相关的知识,然后将知识用作更Explicit的提示。同时,跟踪示例的来源以确定特定的示例,并引入Hint-related Example Retriever(HER)来选择有用的示例。
  • results: 对3个开放领域问答 benchmark进行评估,与标准设置相比,HICL加HER得到了平均性能提升2.89 EM score和2.52 F1 score在gpt-3.5-turbo上,7.62 EM score和7.27 F1 score在LLaMA-2-Chat-7B上。
    Abstract In-context learning (ICL) ability has emerged with the increasing scale of large language models (LLMs), enabling them to learn input-label mappings from demonstrations and perform well on downstream tasks. However, under the standard ICL setting, LLMs may sometimes neglect query-related information in demonstrations, leading to incorrect predictions. To address this limitation, we propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to explore the power of ICL in open-domain question answering, an important form in knowledge-intensive tasks. HICL leverages LLMs' reasoning ability to extract query-related knowledge from demonstrations, then concatenates the knowledge to prompt LLMs in a more explicit way. Furthermore, we track the source of this knowledge to identify specific examples, and introduce a Hint-related Example Retriever (HER) to select informative examples for enhanced demonstrations. We evaluate HICL with HER on 3 open-domain QA benchmarks, and observe average performance gains of 2.89 EM score and 2.52 F1 score on gpt-3.5-turbo, 7.62 EM score and 7.27 F1 score on LLaMA-2-Chat-7B compared with standard setting.
    摘要 受大语言模型(LLM)的规模增长的影响,宽 Context Learning(ICL)能力已经出现,使得 LLMS 可以从示例中学习输入标签映射,并在下游任务中表现良好。然而,在标准 ICLE 设置下, LLMS 可能会忽略示例中相关的查询信息,导致错误预测。为了解决这个限制,我们提出了一种新的思路calledHint-enhanced In-Context Learning(HICL),以探索 ICLE 在开放领域问答中的力量。HICL 利用 LLMS 的理解能力提取示例中相关的查询知识,然后将这些知识 concatenates 到提示 LLMS 以更加显式的方式。此外,我们跟踪这些知识的来源,并引入一个Hint-related Example Retriever(HER)来选择有用的示例,以提高示例的质量。我们在3个开放领域问答标准 benchmark上评估 HICL 和 HER,并观察了gpt-3.5-turbo 和 LLaMA-2-Chat-7B 上的平均性能提升2.89 EM 分数和2.52 F1 分数,升级7.62 EM 分数和7.27 F1 分数。

Constructing Temporal Dynamic Knowledge Graphs from Interactive Text-based Games

  • paper_url: http://arxiv.org/abs/2311.01928
  • repo_url: https://github.com/yukw777/temporal-discrete-graph-updater
  • paper_authors: Keunwoo Peter Yu
  • for: 这个论文的目的是提出一种新的图ppoydunker模型,以提高对文本游戏中的动态知识图的表示和学习。
  • methods: 该模型使用一种名为时间点基于图神经网络的方法,将动态知识图表示为一系列时间戳的图事件,以提高知识图的准确性和可解释性。
  • results: 通过对TextWorld数据集进行实验,研究发现TDGU模型比基elineDGU模型表现更好,并且通过缺省研究和对更复杂的环境的演示,证明TDGU模型具有更好的泛化能力。
    Abstract In natural language processing, interactive text-based games serve as a test bed for interactive AI systems. Prior work has proposed to play text-based games by acting based on discrete knowledge graphs constructed by the Discrete Graph Updater (DGU) to represent the game state from the natural language description. While DGU has shown promising results with high interpretability, it suffers from lower knowledge graph accuracy due to its lack of temporality and limited generalizability to complex environments with objects with the same label. In order to address DGU's weaknesses while preserving its high interpretability, we propose the Temporal Discrete Graph Updater (TDGU), a novel neural network model that represents dynamic knowledge graphs as a sequence of timestamped graph events and models them using a temporal point based graph neural network. Through experiments on the dataset collected from a text-based game TextWorld, we show that TDGU outperforms the baseline DGU. We further show the importance of temporal information for TDGU's performance through an ablation study and demonstrate that TDGU has the ability to generalize to more complex environments with objects with the same label. All the relevant code can be found at \url{https://github.com/yukw777/temporal-discrete-graph-updater}.
    摘要 在自然语言处理领域,文本基于游戏作为互动AI系统的测试床。先前的工作已经提议通过基于自然语言描述生成的Discrete Graph Updater(DGU)来控制文本基于游戏。 although DGU has shown promising results with high interpretability, it suffers from lower knowledge graph accuracy due to its lack of temporality and limited generalizability to complex environments with objects with the same label. 为了解决DGU的缺陷而保持高度可读性,我们提出了Temporal Discrete Graph Updater(TDGU),一种新的神经网络模型,它表示动态知识图为一个时间戳的图事件序列,并使用时间点基于图神经网络来模型。 通过TextWorld数据集上的实验,我们表明TDGU超过了基准DGU。我们还进行了剖析研究,证明了TDGU的时间信息的重要性,并示出TDGU可以在更复杂的环境中 generale化。所有相关的代码可以在 GitHub上找到,链接为 \url{https://github.com/yukw777/temporal-discrete-graph-updater}.

BoschAI @ PLABA 2023: Leveraging Edit Operations in End-to-End Neural Sentence Simplification

  • paper_url: http://arxiv.org/abs/2311.01907
  • repo_url: None
  • paper_authors: Valentin Knappich, Simon Razniewski, Annemarie Friedrich
  • for: 这个论文的目的是提出一种基于LLAMA2的自动简化系统,以便非专业人员更好地理解复杂的科学文献。
  • methods: 该系统使用语言模型将复杂语言翻译成简单语言。论文提出了使用句子级和字节级损失权重来减少模型的训练信号和保守性。
  • results: 经验证明,该方法可以生成更加接近人工标注者创造的简化文本 (+1.8% / +3.5% SARI),使用更加简单的语言 (-1 / -1.1 FKGL)和更多的修改(1.6x / 1.8x编辑距离),相比同模型通过标准十字 entropy进行 fine-tuning。此外,论文还表明了控制编辑距离和简单性水平(FKGL)的Hyperparameter $\lambda$。
    Abstract Automatic simplification can help laypeople to comprehend complex scientific text. Language models are frequently applied to this task by translating from complex to simple language. In this paper, we describe our system based on Llama 2, which ranked first in the PLABA shared task addressing the simplification of biomedical text. We find that the large portion of shared tokens between input and output leads to weak training signals and conservatively editing models. To mitigate these issues, we propose sentence-level and token-level loss weights. They give higher weight to modified tokens, indicated by edit distance and edit operations, respectively. We conduct an empirical evaluation on the PLABA dataset and find that both approaches lead to simplifications closer to those created by human annotators (+1.8% / +3.5% SARI), simpler language (-1 / -1.1 FKGL) and more edits (1.6x / 1.8x edit distance) compared to the same model fine-tuned with standard cross entropy. We furthermore show that the hyperparameter $\lambda$ in token-level loss weights can be used to control the edit distance and the simplicity level (FKGL).
    摘要 自动简化可以帮助非专家理解复杂科学文本。语言模型经常用于这种任务,将复杂语言翻译成简单语言。在这篇论文中,我们描述了基于LLAMA 2的系统,该系统在PLABA共享任务中排名第一,用于简化生物医学文本。我们发现输入和输出共享的大量共同token会导致弱的训练信号和保守的编辑模型。为了解决这些问题,我们提议使用句子级和token级损失权重。它们将修改后的token得到更高的权重,根据编辑距离和编辑操作来进行评估。我们对PLABA数据集进行了实验评估,发现两种方法都能够生成更加简洁的简化文本(+1.8% / +3.5% SARI), simpler language (-1 / -1.1 FKGL)和更多的编辑(1.6x / 1.8x编辑距离),比同样的模型通过标准十字Entropy训练更好。我们还发现了$\lambda$参数在token级损失权重中可以控制编辑距离和简洁水平(FKGL)。

Indicative Summarization of Long Discussions

  • paper_url: http://arxiv.org/abs/2311.01882
  • repo_url: https://github.com/webis-de/emnlp-23
  • paper_authors: Shahbaz Syed, Dominik Schwabe, Khalid Al-Khatib, Martin Potthast
  • for: 提供一种novel的无监督方法,使用大型自然语言模型(LLM)生成长讨论的指示性摘要,以便方便用户快速浏览和理解长讨论。
  • methods: 方法首先对讨论中的argument sentence进行聚类,然后生成聚类标签作为摘要,最后将生成的摘要分类为口语框架。
  • results: 经过优化的提问工程approach,我们测试了19个LLM的生成聚类标签和口语框架分类能力,并进行了用户研究,结果表明,我们的提出的指示性摘要可以帮助用户快速浏览和理解长讨论。
    Abstract Online forums encourage the exchange and discussion of different stances on many topics. Not only do they provide an opportunity to present one's own arguments, but may also gather a broad cross-section of others' arguments. However, the resulting long discussions are difficult to overview. This paper presents a novel unsupervised approach using large language models (LLMs) to generating indicative summaries for long discussions that basically serve as tables of contents. Our approach first clusters argument sentences, generates cluster labels as abstractive summaries, and classifies the generated cluster labels into argumentation frames resulting in a two-level summary. Based on an extensively optimized prompt engineering approach, we evaluate 19~LLMs for generative cluster labeling and frame classification. To evaluate the usefulness of our indicative summaries, we conduct a purpose-driven user study via a new visual interface called Discussion Explorer: It shows that our proposed indicative summaries serve as a convenient navigation tool to explore long discussions.
    摘要 在线讨论区域鼓励不同观点的交流和讨论。不仅可以展示自己的Arguments,还可以收集各种不同的Arguments。然而,长时间的讨论可能很难概括。这篇论文提出了一种新的无监督方法,使用大型自然语言模型(LLMs)生成长讨论的指示性摘要。我们的方法首先对Argument sentence进行聚合,生成聚合Label作为摘要,然后将生成的聚合Label进行分类,生成两级摘要。通过大量优化的提示工程 Approach,我们评估了19种LLMs的生成聚合标签和框架分类。为了评估我们的指示性摘要的有用性,我们进行了一项目的用途驱动的用户研究,通过一种新的视觉界面 called Discussion Explorer:它表明了我们的提posed indicative summaries可以作为浏览长讨论的便捷导航工具。

Sentiment Analysis through LLM Negotiations

  • paper_url: http://arxiv.org/abs/2311.01876
  • repo_url: None
  • paper_authors: Xiaofei Sun, Xiaoya Li, Shengyu Zhang, Shuhe Wang, Fei Wu, Jiwei Li, Tianwei Zhang, Guoyin Wang
  • for: This paper aims to improve the accuracy of sentiment analysis by introducing a multi-LLM negotiation framework that leverages the complementary abilities of multiple language models to generate more accurate and well-reasoned decisions.
  • methods: The proposed framework consists of a reasoning-infused generator and an explanation-deriving discriminator, which iterate until a consensus is reached. The generator provides decisions along with rationale, while the discriminator evaluates the credibility of the generator’s decisions.
  • results: The proposed approach consistently outperforms the in-context learning (ICL) baseline across all benchmarks, and even achieves superior performances compared to supervised baselines on the Twitter and movie review datasets.
    Abstract A standard paradigm for sentiment analysis is to rely on a singular LLM and makes the decision in a single round under the framework of in-context learning. This framework suffers the key disadvantage that the single-turn output generated by a single LLM might not deliver the perfect decision, just as humans sometimes need multiple attempts to get things right. This is especially true for the task of sentiment analysis where deep reasoning is required to address the complex linguistic phenomenon (e.g., clause composition, irony, etc) in the input. To address this issue, this paper introduces a multi-LLM negotiation framework for sentiment analysis. The framework consists of a reasoning-infused generator to provide decision along with rationale, a explanation-deriving discriminator to evaluate the credibility of the generator. The generator and the discriminator iterate until a consensus is reached. The proposed framework naturally addressed the aforementioned challenge, as we are able to take the complementary abilities of two LLMs, have them use rationale to persuade each other for correction. Experiments on a wide range of sentiment analysis benchmarks (SST-2, Movie Review, Twitter, yelp, amazon, IMDB) demonstrate the effectiveness of proposed approach: it consistently yields better performances than the ICL baseline across all benchmarks, and even superior performances to supervised baselines on the Twitter and movie review datasets.
    摘要 一般来说,用一个单一的深度学习模型(LLM)进行情感分析是一种常见的方法。这种方法的缺点是,单个LLM的输出可能不会提供完美的决策,就像人类在做出决策时有时需要多次尝试。这是特别真的 для情感分析任务,因为这个任务需要深入理解复杂的语言现象(例如句子组成、讽刺等)。为解决这个问题,这篇论文提出了一种多个LLM谈判框架 для情感分析。该框架包括一个理由感染生成器,用于提供决策以及理由,以及一个解释评估器,用于评估生成器的合理性。生成器和解释评估器会进行谈判,直到达成一致。提议的框架自然地解决了以上挑战,因为我们可以利用两个LLM的补充能力,让它们使用理由来证明对方需要更正。实验结果表明,提议的方法在各种情感分析标准benchmark(SST-2、电影评论、Twitter、Yelp、Amazon、IMDB)上表现出色, consistently 超过ICL基线,甚至在Twitter和电影评论数据集上超越了经过监督的基线。

Efficient Black-Box Adversarial Attacks on Neural Text Detectors

  • paper_url: http://arxiv.org/abs/2311.01873
  • repo_url: None
  • paper_authors: Vitalii Fishchuk, Daniel Braun
  • for: investigate the effectiveness of three simple and resource-efficient strategies to alter texts generated by GPT-3.5 to misclassify neural text detectors.
  • methods: parameter tweaking, prompt engineering, and character-level mutations.
  • results: especially parameter tweaking and character-level mutations are effective strategies.Here’s the summary in Traditional Chinese as well:
  • for: 研究使用三种简单且资源有效的策略,让GPT-3.5生成的文本被神经文本探测器误将为人工生成的文本。
  • methods: 参数调整、提示工程和字元水平的变化。
  • results: 特别是参数调整和字元水平的变化是有效的策略。
    Abstract Neural text detectors are models trained to detect whether a given text was generated by a language model or written by a human. In this paper, we investigate three simple and resource-efficient strategies (parameter tweaking, prompt engineering, and character-level mutations) to alter texts generated by GPT-3.5 that are unsuspicious or unnoticeable for humans but cause misclassification by neural text detectors. The results show that especially parameter tweaking and character-level mutations are effective strategies.
    摘要 neural text detectors 是模型,用于detect whether a given text was generated by a language model or written by a human。在这篇论文中,我们investigate three simple and resource-efficient strategies(parameter tweaking,prompt engineering,and character-level mutations)to alter texts generated by GPT-3.5 that are unsuspicious or unnoticeable for humans but cause misclassification by neural text detectors。result showsthat especially parameter tweaking and character-level mutations are effective strategies。

$R^3$-NL2GQL: A Hybrid Models Approach for for Accuracy Enhancing and Hallucinations Mitigation

  • paper_url: http://arxiv.org/abs/2311.01862
  • repo_url: https://github.com/zhiqix/nl2gql
  • paper_authors: Yuhang Zhou, He Yu, Siyu Tian, Dan Chen, Liuzhi Zhou, Xinlin Yu, Chuanjun Ji, Sen Liu, Guangnan Ye, Hongfeng Chai
  • for: 这篇论文主要应用于将自然语言转换为graph查询语言(NL2GQL)任务中,并解决了Foundation Models在NL2GQL任务中的挑战。
  • methods: 本论文使用了Foundation Models,并将其分为大小不同的模型,以进行不同的调整和组合。
  • results: 实验结果显示,大型Foundation Models在NL2GQL任务中展现出了优秀的横推数据能力,而小型Foundation Models则在细化和调整后,对于意思理解和 grammatical accuracy 有所进步。
    Abstract While current NL2SQL tasks constructed using Foundation Models have achieved commendable results, their direct application to Natural Language to Graph Query Language (NL2GQL) tasks poses challenges due to the significant differences between GQL and SQL expressions, as well as the numerous types of GQL. Our extensive experiments reveal that in NL2GQL tasks, larger Foundation Models demonstrate superior cross-schema generalization abilities, while smaller Foundation Models struggle to improve their GQL generation capabilities through fine-tuning. However, after fine-tuning, smaller models exhibit better intent comprehension and higher grammatical accuracy. Diverging from rule-based and slot-filling techniques, we introduce R3-NL2GQL, which employs both smaller and larger Foundation Models as reranker, rewriter and refiner. The approach harnesses the comprehension ability of smaller models for information reranker and rewriter, and the exceptional generalization and generation capabilities of larger models to transform input natural language queries and code structure schema into any form of GQLs. Recognizing the lack of established datasets in this nascent domain, we have created a bilingual dataset derived from graph database documentation and some open-source Knowledge Graphs (KGs). We tested our approach on this dataset and the experimental results showed that delivers promising performance and robustness.Our code and dataset is available at https://github.com/zhiqix/NL2GQL
    摘要 当前的NL2SQL任务使用基础模型构建得到了可嘉的结果,但直接应用于自然语言到图查询语言(NL2GQL)任务却存在挑战,主要是因为GQL和SQL表达之间存在显著差异,以及GQL的多种类型。我们的广泛实验表明,在NL2GQL任务中,更大的基础模型在跨 schema 泛化能力方面表现出色,而更小的基础模型通过细化不能提高其生成GQL能力。然而,经细化后,更小的模型具有更高的意图理解和语法正确率。不同于规则基于和槽填充技术,我们提出了R3-NL2GQL,它使用更小和更大的基础模型来重新排序、重写和精度。这种方法利用更小的模型对信息重新排序和重写的能力,以及更大的模型对输入自然语言查询和代码结构 schema 的转换能力。认识到这个领域的数据集还没有成熔,我们从图数据库文档和一些开源知识图(KG) derivated 一个双语数据集。我们对这个数据集进行了测试,实验结果表明了我们的方法具有扎实的表现和稳定性。代码和数据集可以在https://github.com/zhiqix/NL2GQL 上获取。

Large Language Models to the Rescue: Reducing the Complexity in Scientific Workflow Development Using ChatGPT

  • paper_url: http://arxiv.org/abs/2311.01825
  • repo_url: None
  • paper_authors: Mario Sänger, Ninon De Mecquenem, Katarzyna Ewa Lewińska, Vasilis Bountris, Fabian Lehmann, Ulf Leser, Thomas Kosch
  • for: 这篇研究旨在测试大自然语言模型(LLM)在科学工作流程中的效率,以支持用户在实现工作流程时所遇到的挑战。
  • methods: 研究使用了ChatGPT作为LLM,并进行了三个使用者研究,以评估ChatGPT在理解、适应和扩展工作流程方面的效能。
  • results: 研究结果显示LLM对工作流程的解释有高效性,但在交换组件或目的性工作流程扩展方面表现较差。研究也描述了LLM在这些困难情况下的限制,并建议未来研究的方向。
    Abstract Scientific workflow systems are increasingly popular for expressing and executing complex data analysis pipelines over large datasets, as they offer reproducibility, dependability, and scalability of analyses by automatic parallelization on large compute clusters. However, implementing workflows is difficult due to the involvement of many black-box tools and the deep infrastructure stack necessary for their execution. Simultaneously, user-supporting tools are rare, and the number of available examples is much lower than in classical programming languages. To address these challenges, we investigate the efficiency of Large Language Models (LLMs), specifically ChatGPT, to support users when dealing with scientific workflows. We performed three user studies in two scientific domains to evaluate ChatGPT for comprehending, adapting, and extending workflows. Our results indicate that LLMs efficiently interpret workflows but achieve lower performance for exchanging components or purposeful workflow extensions. We characterize their limitations in these challenging scenarios and suggest future research directions.
    摘要 We conducted three user studies in two scientific domains to evaluate ChatGPT's ability to comprehend, adapt, and extend workflows. Our results show that LLMs can efficiently interpret workflows, but their performance is lower when it comes to exchanging components or creating purposeful workflow extensions. We have identified the limitations of these challenging scenarios and suggest future research directions.Translated into Simplified Chinese:科学工作流系统在表达和执行复杂数据分析管道上占据着越来越多的市场份额,因为它们提供了可重现性、可靠性和可扩展性,并可自动平行化在大型计算集群上。然而,实现工作流程的困难在于许多黑盒工具的参与以及执行所需的深层基础设施。同时,用户支持工具罕见,可用的示例数量也远低于经典编程语言。为了解决这些挑战,我们研究了大语言模型(LLM),具体来说是ChatGPT,在科学工作流程中支持用户。我们在两个科学领域中进行了三个用户研究,以评估ChatGPT在理解、适应和扩展工作流程方面的能力。我们的结果表明,LLM可以高效地理解工作流程,但在交换组件或创造有目的工作流程扩展方面表现较差。我们对这些挑战的限制进行了特点分析,并建议未来的研究方向。

Minimalist Grammar: Construction without Overgeneration

  • paper_url: http://arxiv.org/abs/2311.01820
  • repo_url: None
  • paper_authors: Isidor Konrad Maier, Johannes Kuhn, Jesse Beisegel, Markus Huber-Liebl, Matthias Wolff
  • for: 这篇论文是如何编写 minimalist grammar (MG) 的指南。
  • methods: 使用 variant of context free grammars (CFG) 作为输入格式,并使用 licensors/-ees 特殊的方式处理例外情况。
  • results: 构建的 MG 可以避免过度生成,并且使用 adapters 解决 exceptions 处理中的问题。
    Abstract In this paper we give instructions on how to write a minimalist grammar (MG). In order to present the instructions as an algorithm, we use a variant of context free grammars (CFG) as an input format. We can exclude overgeneration, if the CFG has no recursion, i.e. no non-terminal can (indirectly) derive to a right-hand side containing itself. The constructed MGs utilize licensors/-ees as a special way of exception handling. A CFG format for a derivation $A\_eats\_B\mapsto^* peter\_eats\_apples$, where $A$ and $B$ generate noun phrases, normally leads to overgeneration, e.\,g., $i\_eats\_apples$. In order to avoid overgeneration, a CFG would need many non-terminal symbols and rules, that mainly produce the same word, just to handle exceptions. In our MGs however, we can summarize CFG rules that produce the same word in one item and handle exceptions by a proper distribution of licensees/-ors. The difficulty with this technique is that in most generations the majority of licensees/-ors is not needed, but still has to be triggered somehow. We solve this problem with $\epsilon$-items called \emph{adapters}.
    摘要 在这篇论文中,我们提供了写 minimalist grammar(MG)的指导方针。为了表示这些指导方针为算法,我们使用 variant of context free grammars(CFG)作为输入格式。如果 CFG 没有回归,则可以排除过度生成。 constructed MGs 使用licensee/-or作为特殊的例外处理方式。CFG 格式 для一个 derivation $A\_eats\_B\mapsto^* peter\_eats\_apples$,where $A$ 和 $B$ 生成名词短语,通常会导致过度生成,例如 $i\_eats\_apples$。为了避免过度生成,一个 CFG 需要很多非树状符号和规则,主要生成同一个词的不同形式,只是为了处理例外。在我们的 MGs 中,我们可以汇总 CFG 规则生成同一个词的项目,并通过正确的分配licensee/-or来处理例外。这种技术的困难在于,在大多数生成中,主要的licensee/-or并不需要,但仍需要某种触发方式。我们解决这个问题使用 $\epsilon$-item called \emph{adapters}。

Mitigating Framing Bias with Polarity Minimization Loss

  • paper_url: http://arxiv.org/abs/2311.01817
  • repo_url: None
  • paper_authors: Yejin Bang, Nayeon Lee, Pascale Fung
  • for: 防止新闻报道中的偏见倾向
  • methods: 提出一种新的损失函数,用于降低多个新闻报道中的偏见差异
  • results: 实验结果表明,通过在模型中添加该损失函数可以减少偏见倾向,其效果最大化在降低信息偏见倾向(即报道中选择的信息偏见)。
    Abstract Framing bias plays a significant role in exacerbating political polarization by distorting the perception of actual events. Media outlets with divergent political stances often use polarized language in their reporting of the same event. We propose a new loss function that encourages the model to minimize the polarity difference between the polarized input articles to reduce framing bias. Specifically, our loss is designed to jointly optimize the model to map polarity ends bidirectionally. Our experimental results demonstrate that incorporating the proposed polarity minimization loss leads to a substantial reduction in framing bias when compared to a BART-based multi-document summarization model. Notably, we find that the effectiveness of this approach is most pronounced when the model is trained to minimize the polarity loss associated with informational framing bias (i.e., skewed selection of information to report).
    摘要 帧偏调 plays a significant role in exacerbating political polarization by distorting the perception of actual events. Media outlets with divergent political stances often use polarized language in their reporting of the same event. We propose a new loss function that encourages the model to minimize the polarity difference between the polarized input articles to reduce framing bias. Specifically, our loss is designed to jointly optimize the model to map polarity ends bidirectionally. Our experimental results demonstrate that incorporating the proposed polarity minimization loss leads to a substantial reduction in framing bias when compared to a BART-based multi-document summarization model. Notably, we find that the effectiveness of this approach is most pronounced when the model is trained to minimize the polarity loss associated with informational framing bias (i.e., skewed selection of information to report).Here's the translation in Traditional Chinese:帧偏调对政治化分化具有重要作用,导致现实事件的观察被扭曲。媒体对同一事件的报导 often 使用偏 polarized 的语言,这会导致政治分化。我们提出了一个新的损失函数,这个损失函数鼓励模型将 polarity 的差异最小化,以减少帧偏调。具体来说,我们的损失函数设计来对 polarity 的端点进行bidirectional 的对映。我们的实验结果显示,将 proposed polarity 损失函数添加到模型中可以对帧偏调进行重大减少,相比之下,使用 BART 基于多篇文章摘要模型。当然,我们发现这种方法在对 informational framing bias 进行对映时表现最佳。

UP4LS: User Profile Constructed by Multiple Attributes for Enhancing Linguistic Steganalysis

  • paper_url: http://arxiv.org/abs/2311.01775
  • repo_url: None
  • paper_authors: Yihao Wang, Ruiqi Song, Ru Zhang, Jianyi Liu
  • for: 提高语言隐藏分析(LS)任务的性能,特别是在社交媒体上。
  • methods: 利用用户 profiling 技术,挖掘用户的写作习惯、心理状态和关注点,然后与现有方法结合使用语言模型来提取特征。
  • results: 对现有方法进行改进,实现减少隐藏样本数量下的性能提升,具体提升约25%。
    Abstract Linguistic steganalysis (LS) tasks aim to effectively detect stegos generated by linguistic steganography. Existing LS methods overlook the distinctive user characteristics, leading to weak performance in social networks. The limited occurrence of stegos further complicates detection. In this paper, we propose the UP4LS, a novel framework with the User Profile for enhancing LS performance. Specifically, by delving into post content, we explore user attributes like writing habits, psychological states, and focal areas, thereby building the user profile for LS. For each attribute, we design the identified feature extraction module. The extracted features are mapped to high-dimensional user features via deep-learning networks from existing methods. Then the language model is employed to extract content features. The user and content features are integrated to optimize feature representation. During the training phase, we prioritize the distribution of stegos. Experiments demonstrate that UP4LS can significantly enhance the performance of existing methods, and an overall accuracy improvement of nearly 25%. In particular, the improvement is especially pronounced with fewer stego samples. Additionally, UP4LS also sets the stage for studies on related tasks, encouraging extensive applications on LS tasks.
    摘要 文本隐藏分析(LS)任务目的是有效检测基于语言隐藏技术生成的隐藏文本(stegos)。现有的LS方法忽略了用户特征,导致检测效果在社交网络中弱化。隐藏文本的有限发生频率更进一步复杂了检测。本文提出了UP4LS,一种新的框架,通过探索文章内容,捕捉用户特征,如写作习惯、心理状态和焦点领域,建立用户profile,并为每个特征设计特定的特征提取模块。这些特征被映射到现有方法中的深度学习网络中,然后使用语言模型提取内容特征。用户和内容特征被结合,以优化特征表示。在训练阶段,我们优先考虑隐藏文本的分布。实验表明,UP4LS可以显著提高现有方法的性能,具体提高约25%。尤其是在 fewer stego samples 的情况下,提高更加明显。此外,UP4LS还为相关任务提供了开门篇,激发了广泛的应用研究。

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

  • paper_url: http://arxiv.org/abs/2311.01767
  • repo_url: https://github.com/gydpku/pptc
  • paper_authors: Yiduo Guo, Zekai Zhang, Yaobo Liang, Dongyan Zhao, Duan Nan
  • for: 这项研究旨在评估大自然语言模型(LLM)在完成多个交互、多modal操作的复杂多modal环境中的能力。
  • methods: 该研究使用了PowerPoint Task Completion(PPTC) benchmarch来评估LLM在创建和编辑PPT文件基于用户 instrucion的能力。
  • results: 研究发现GPT-4在单转对话测试中具有75.1%的准确率,但在完成整个会话中表现不佳,只有6%的会话准确率。研究发现三种主要错误原因:交互累积、长时间处理PPT模板和多modal识别。这些问题对未来LLM和代理系统 pose 极大挑战。
    Abstract Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark to assess LLMs' ability to create and edit PPT files based on user instructions. It contains 279 multi-turn sessions covering diverse topics and hundreds of instructions involving multi-modal operations. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence, thus it supports various LLM-generated API sequences. We measure 3 closed LLMs and 6 open-source LLMs. The results show that GPT-4 outperforms other LLMs with 75.1\% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6\% session accuracy. We find three main error causes in our benchmark: error accumulation in the multi-turn session, long PPT template processing, and multi-modality perception. These pose great challenges for future LLM and agent systems. We release the data, code, and evaluation system of PPTC at \url{https://github.com/gydpku/PPTC}.
    摘要 最近的大语言模型(LLM)评估中心在测试它们零shot/几shot能力来完成基本的自然语言任务以及将指令转化为工具API。然而,对于使用复杂工具完成多Turn多模态任务在复杂多模态环境中评估LLM的能力还没有被研究。为了解决这一漏洞,我们介绍了PowerPoint任务完成(PPTC)标准测试套件,用于评估LLM在基于用户指令创建和编辑PPT文件方面的能力。该套件包含279个多Turn会话,涵盖多个主题和百度 instrucciones 涉及多模态操作。我们还提出了PPTXMatch评估系统,它根据预测文件而不是标签API序列来评估LLM是否完成了指令。这种支持多种LLM生成的API序列。我们测试了三个关闭LLM和六个开源LLM。结果显示,GPT-4在单Turn对话测试中的准确率为75.1%,但在完成整个会话时表现不佳,只有6%的会话准确率。我们发现了三种主要的错误原因:在多Turn会话中的错误积累、长PPT模板处理和多模态感知。这些问题对未来LLM和代理系统带来了很大挑战。我们将数据、代码和评估系统发布到GitHub上,请参考 \url{https://github.com/gydpku/PPTC}.

Support or Refute: Analyzing the Stance of Evidence to Detect Out-of-Context Mis- and Disinformation

  • paper_url: http://arxiv.org/abs/2311.01766
  • repo_url: None
  • paper_authors: Xin Yuan, Jie Guo, Weidong Qiu, Zheng Huang, Shujun Li
  • for: 防止在线谣言和false information的扩散
  • methods: 提出了一种基于多模态证据的偏见抽取网络(SEN),可以同时抽取不同证据的偏见,以提高识别结果的准确性
  • results: 对大规模公共数据集进行了广泛的实验,发现提出的方法比前期基elines的表现升高3.2%的精度。
    Abstract Mis- and disinformation online have become a major societal problem as major sources of online harms of different kinds. One common form of mis- and disinformation is out-of-context (OOC) information, where different pieces of information are falsely associated, e.g., a real image combined with a false textual caption or a misleading textual description. Although some past studies have attempted to defend against OOC mis- and disinformation through external evidence, they tend to disregard the role of different pieces of evidence with different stances. Motivated by the intuition that the stance of evidence represents a bias towards different detection results, we propose a stance extraction network (SEN) that can extract the stances of different pieces of multi-modal evidence in a unified framework. Moreover, we introduce a support-refutation score calculated based on the co-occurrence relations of named entities into the textual SEN. Extensive experiments on a public large-scale dataset demonstrated that our proposed method outperformed the state-of-the-art baselines, with the best model achieving a performance gain of 3.2% in accuracy.
    摘要 互联网上的谬误和不准确信息已成为现代社会的重要问题,是多种不同类型的在线危害的主要来源。一种常见的谬误信息形式是Context Out-of-Context(OOC)信息,即不同的信息元素被谬误地联系起来,例如真实的图像与谬误的文字描述或歪曲的文本描述。 although some past studies have tried to defend against OOC misinformation through external evidence, they tend to ignore the role of different pieces of evidence with different stances. 驱动了寻求解决这个问题的直觉,我们提出了一种姿态提取网络(SEN),可以在一个统一的框架中提取不同类型的多Modal证据的姿态。此外,我们还引入了基于命名实体之间的共occurrence关系的支持驳回分数,来进一步提高文本SEN的准确性。经过了一系列的大规模公共数据集的实验,我们的提议方法在准确性方面超过了现有的基线,最佳模型在准确性方面提高了3.2%。

EmojiLM: Modeling the New Emoji Language

  • paper_url: http://arxiv.org/abs/2311.01751
  • repo_url: https://github.com/komeijiforce/emojilm
  • paper_authors: Letian Peng, Zilong Wang, Hang Liu, Zihan Wang, Jingbo Shang
  • for: 研究在线上社交媒体上的表情符号(emoji)的使用趋势和应用。
  • methods: 使用大型自然语言模型创建了大量文本-表情符号平行数据库(Text2Emoji),并基于这个平行数据库对文本-表情符号 bidirectional 翻译进行了几何分析。
  • results: 比较baseline模型和平行数据库,我们的提案模型在公共benchmark上和人工评估中均有出色的表现,并且显示了文本-表情符号bidirectional 翻译的应用价值。
    Abstract With the rapid development of the internet, online social media welcomes people with different backgrounds through its diverse content. The increasing usage of emoji becomes a noticeable trend thanks to emoji's rich information beyond cultural or linguistic borders. However, the current study on emojis is limited to single emoji prediction and there are limited data resources available for further study of the interesting linguistic phenomenon. To this end, we synthesize a large text-emoji parallel corpus, Text2Emoji, from a large language model. Based on the parallel corpus, we distill a sequence-to-sequence model, EmojiLM, which is specialized in the text-emoji bidirectional translation. Extensive experiments on public benchmarks and human evaluation demonstrate that our proposed model outperforms strong baselines and the parallel corpus benefits emoji-related downstream tasks.
    摘要 “因互联网的快速发展,在线社交媒体逐渐推广不同背景的人透过各种多元内容。增加使用表情符号的趋势也因为表情符号具有跨文化或语言边界的丰富信息,成为当前研究热点。然而,现有的研究仅专注于单一表情符号预测,有限的数据资源对进一步研究表情符号的兴趣语言现象提供了有限的支持。为此,我们合成了大量文本-表情符号平行数据库,Text2Emoji,基于大型语言模型。根据平行数据库,我们提炼了文本-表情符号双向翻译模型,EmojiLM,并进行了广泛的公共benchmark和人类评价。实验结果显示,我们提议的模型优于强基eline,并且平行数据库对表情符号相关下游任务具有助益。”

SAC$^3$: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency

  • paper_url: http://arxiv.org/abs/2311.01740
  • repo_url: None
  • paper_authors: Jiaxin Zhang, Zhuohang Li, Kamalika Das, Bradley A. Malin, Sricharan Kumar
  • for: 检测语言模型中的幻觉是现代自然语言处理中的一个关键步骤,以确定语言模型的可靠性。
  • methods: 我们基于语言模型的自我一致性进行检测,并发现了问题水平和模型水平的两种幻觉,这些幻觉不能通过自我一致性检测察看到。我们提出了一种新的采样方法,即含义相关的检查三重方法(SAC$^3$),该方法基于自我一致性检测的原理,并具有更多的机制来检测问题水平和模型水平的幻觉。
  • results: 我们通过广泛和系统的实验分析,证明了SAC$^3$ 方法在多个问答和开放领域生成 benchmark 上的表现,可以准确地检测非事实和事实声明。
    Abstract Hallucination detection is a critical step toward understanding the trustworthiness of modern language models (LMs). To achieve this goal, we re-examine existing detection approaches based on the self-consistency of LMs and uncover two types of hallucinations resulting from 1) question-level and 2) model-level, which cannot be effectively identified through self-consistency check alone. Building upon this discovery, we propose a novel sampling-based method, i.e., semantic-aware cross-check consistency (SAC$^3$) that expands on the principle of self-consistency checking. Our SAC$^3$ approach incorporates additional mechanisms to detect both question-level and model-level hallucinations by leveraging advances including semantically equivalent question perturbation and cross-model response consistency checking. Through extensive and systematic empirical analysis, we demonstrate that SAC$^3$ outperforms the state of the art in detecting both non-factual and factual statements across multiple question-answering and open-domain generation benchmarks.
    摘要 现代语言模型(LM)的可信worthiness问题是一个关键步骤。为了解决这个问题,我们重新审视了现有的检测方法,基于语言模型自我一致性。我们发现了两种类型的幻觉,即问题级幻觉和模型级幻觉,这些幻觉不可以通过自我一致性检查 alone 检测出来。基于这一发现,我们提出了一种新的采样基于方法,即含义相关的交叉检查一致性(SAC$^3$)。我们的SAC$^3$方法具有检测问题级和模型级幻觉的能力,通过利用包括semantically相同的问题抖动和跨模型响应一致性检查在内的进一步技术。我们通过了广泛和系统的实验分析,证明了SAC$^3$在检测多个问答和开放领域生成benchmark上的非事实和事实陈述性能比前者更高。

Proto-lm: A Prototypical Network-Based Framework for Built-in Interpretability in Large Language Models

  • paper_url: http://arxiv.org/abs/2311.01732
  • repo_url: None
  • paper_authors: Sean Xie, Soroush Vosoughi, Saeed Hassanpour
  • for: This paper aims to improve the interpretability of Large Language Models (LLMs) by developing a prototypical network-based white-box framework that allows LLMs to learn immediately interpretable embeddings during the fine-tuning stage while maintaining competitive performance.
  • methods: The proposed method, called proto-lm, uses a prototypical network to learn interpretable embeddings that can be used to understand how the LLM is making predictions. The method is based on a white-box framework, which allows for transparency and interpretability of the model’s inner workings.
  • results: The authors demonstrate the applicability and interpretability of their method through experiments on a wide range of NLP tasks, and show that their approach can pave the way for more interpretable models without sacrificing performance. Specifically, their results indicate that the proposed method can learn interpretable embeddings that can be used to understand how the LLM is making predictions, and that the method maintains competitive performance on a variety of NLP tasks.
    Abstract Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), but their lack of interpretability has been a major concern. Current methods for interpreting LLMs are post hoc, applied after inference time, and have limitations such as their focus on low-level features and lack of explainability at higher level text units. In this work, we introduce proto-lm, a prototypical network-based white-box framework that allows LLMs to learn immediately interpretable embeddings during the fine-tuning stage while maintaining competitive performance. Our method's applicability and interpretability are demonstrated through experiments on a wide range of NLP tasks, and our results indicate a new possibility of creating interpretable models without sacrificing performance. This novel approach to interpretability in LLMs can pave the way for more interpretable models without the need to sacrifice performance.
    摘要 (Simplified Chinese translation)大型语言模型(LLMs)已经帮助了自然语言处理(NLP)领域的发展,但它们的无法解释性带来了主要的担忧。现有的LLMs解释方法都是后期应用的,并且有些缺点,如专注于低级特征和文本单位高级解释性的缺失。在这项工作中,我们提出了 proto-lm,一种基于 прото型网络的白色盒框架,使得 LLMs 可以在练习阶段直接学习可解释的嵌入,而不会影响性能。我们的方法在多种 NLP 任务上进行了实验,并证明了其可应用性和解释性。 results 表明了一种可能性,即创建可解释的模型不需要牺牲性能。这种新的LLMs解释方法可能会开辟出一条新的解释性道路,无需牺牲性能。

A New Korean Text Classification Benchmark for Recognizing the Political Intents in Online Newspapers

  • paper_url: http://arxiv.org/abs/2311.01712
  • repo_url: https://github.com/kdavid2355/kopolitic-benchmark-dataset
  • paper_authors: Beomjune Kim, Eunsun Lee, Dongbin Na
  • for: 本文主要针对在南韩新闻媒体上发表的政治意图文章进行自动识别。
  • methods: 该文使用了深度学习基于变换器架构的语言模型,并在大规模的韩国新闻数据集上进行训练。
  • results: 训练后的模型显示了良好的文本分类性能,并且可以同时进行多任务分类。此外,该文还提供了大规模的韩国新闻数据集,可供Future研究使用。
    Abstract Many users reading online articles in various magazines may suffer considerable difficulty in distinguishing the implicit intents in texts. In this work, we focus on automatically recognizing the political intents of a given online newspaper by understanding the context of the text. To solve this task, we present a novel Korean text classification dataset that contains various articles. We also provide deep-learning-based text classification baseline models trained on the proposed dataset. Our dataset contains 12,000 news articles that may contain political intentions, from the politics section of six of the most representative newspaper organizations in South Korea. All the text samples are labeled simultaneously in two aspects (1) the level of political orientation and (2) the level of pro-government. To the best of our knowledge, our paper is the most large-scale Korean news dataset that contains long text and addresses multi-task classification problems. We also train recent state-of-the-art (SOTA) language models that are based on transformer architectures and demonstrate that the trained models show decent text classification performance. All the codes, datasets, and trained models are available at https://github.com/Kdavid2355/KoPolitic-Benchmark-Dataset.
    摘要 многие用户在阅读在线报纸时可能会遇到很大的区分隐含意图的困难。在这项工作中,我们关注自动识别在线报纸中的政治意图,通过理解文本的上下文来解决这个问题。为解决这个任务,我们提供了一个新的韩国文本分类数据集,该数据集包含了多种文章。我们还提供了基于深度学习的文本分类基线模型,该模型在我们提posed的数据集上训练。我们的数据集包含12,000篇报纸文章,这些文章可能包含政治意图,来自韩国六家最重要的报纸组织的政治部分。所有的文本样本都同时被标注了两个方面:(1)政治方向的水平和(2)政府支持度的水平。根据我们所知,我们的论文是最大规模的韩国新闻数据集,它包含了长文本,并解决了多任务分类问题。我们还训练了最新的状态zig对应的语言模型,该模型基于变换架构,并示出了训练后的模型在文本分类任务上的不错表现。所有的代码、数据集和训练模型都可以在https://github.com/Kdavid2355/KoPolitic-Benchmark-Dataset上获取。

CASE: Commonsense-Augmented Score with an Expanded Answer Space

  • paper_url: http://arxiv.org/abs/2311.01684
  • repo_url: https://github.com/wk-chen/commonsense-augmented-score-with-an-expanded-answer-space
  • paper_authors: Wenkai Chen, Sahithya Ravi, Vered Shwartz
  • for: 这个论文是为了提高 Language Model (LM) 在多项选择问答任务中的表现,特别是 Addressing the limitation of basic score 对所有单词的对待。
  • methods: 该论文提出了 Commonsense-Augmented Score with Expanded Answer Space (CASE),即基于含义关系的单词重要性权重,以及生成多元答案的方法。
  • results: 对五个常识 benchmark 进行了测试,RESULTS 表明,在使用 smaller LMs 时,CASE 方法可以超越强基线,并且与答案空间扩展方法相结合时,效果更好。
    Abstract LLMs have demonstrated impressive zero-shot performance on NLP tasks thanks to the knowledge they acquired in their training. In multiple-choice QA tasks, the LM probabilities are used as an imperfect measure of the plausibility of each answer choice. One of the major limitations of the basic score is that it treats all words as equally important. We propose CASE, a Commonsense-Augmented Score with an Expanded Answer Space. CASE addresses this limitation by assigning importance weights for individual words based on their semantic relations to other words in the input. The dynamic weighting approach outperforms basic LM scores, not only because it reduces noise from unimportant words, but also because it informs the model of implicit commonsense knowledge that may be useful for answering the question. We then also follow prior work in expanding the answer space by generating lexically-divergent answers that are conceptually-similar to the choices. When combined with answer space expansion, our method outperforms strong baselines on 5 commonsense benchmarks. We further show these two approaches are complementary and may be especially beneficial when using smaller LMs.
    摘要 Note: The text has been translated into Simplified Chinese, which is the standard writing system used in mainland China. The translation may not be perfect, and some nuances or idioms may not be fully conveyed.

Plot Retrieval as an Assessment of Abstract Semantic Association

  • paper_url: http://arxiv.org/abs/2311.01666
  • repo_url: None
  • paper_authors: Shicheng Xu, Liang Pang, Jiangnan Li, Mo Yu, Fandong Meng, Huawei Shen, Xueqi Cheng, Jie Zhou
  • for: 提高阅读体验和效率,提取相关剧情图文
  • methods: 使用标注数据集Plot Retrieval进行训练和评估信息检索模型的抽象含义关系能力
  • results: 现有信息检索模型仍然在捕捉抽象含义关系方面做不够,需要进一步研究抽象含义模型化能力
    Abstract Retrieving relevant plots from the book for a query is a critical task, which can improve the reading experience and efficiency of readers. Readers usually only give an abstract and vague description as the query based on their own understanding, summaries, or speculations of the plot, which requires the retrieval model to have a strong ability to estimate the abstract semantic associations between the query and candidate plots. However, existing information retrieval (IR) datasets cannot reflect this ability well. In this paper, we propose Plot Retrieval, a labeled dataset to train and evaluate the performance of IR models on the novel task Plot Retrieval. Text pairs in Plot Retrieval have less word overlap and more abstract semantic association, which can reflect the ability of the IR models to estimate the abstract semantic association, rather than just traditional lexical or semantic matching. Extensive experiments across various lexical retrieval, sparse retrieval, dense retrieval, and cross-encoder methods compared with human studies on Plot Retrieval show current IR models still struggle in capturing abstract semantic association between texts. Plot Retrieval can be the benchmark for further research on the semantic association modeling ability of IR models.
    摘要 <> Retrieving relevant plots from a book based on a query is a crucial task that can enhance the reading experience and efficiency of readers. However, existing information retrieval (IR) datasets do not reflect this ability well. In this paper, we propose Plot Retrieval, a labeled dataset to train and evaluate the performance of IR models on the novel task of Plot Retrieval. The text pairs in Plot Retrieval have less word overlap and more abstract semantic association, which can better reflect the ability of IR models to estimate the abstract semantic association rather than just traditional lexical or semantic matching. Extensive experiments comparing various lexical retrieval, sparse retrieval, dense retrieval, and cross-encoder methods with human studies on Plot Retrieval show that current IR models still struggle in capturing abstract semantic associations between texts. Plot Retrieval can serve as a benchmark for further research on the semantic association modeling ability of IR models.

cs.LG - 2023-11-03

Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos

  • paper_url: http://arxiv.org/abs/2311.02076
  • repo_url: None
  • paper_authors: Dayal Singh Kalra, Tianyu He, Maissam Barkeshli
  • For: 研究Gradient Descent动态学习神经网络中的稳定性和学习速率问题。* Methods: 使用一个简单的2层线性网络(UV模型)在一个单个训练样本上进行训练,并分析函数空间的固定点结构和函数更新的向量场,以揭示学习过程中稳定性和学习速率的机理。* Results: 发现在训练过程中,稳定性可能会随时间的推移而减退(早期稳定性减退),然后转为进行进攻性的加剧(进攻性加剧),并且在学习率增加时可能会出现边缘稳定性边缘。通过分析固定点结构和函数更新向量场,我们揭示了这些稳定性趋势的机理,包括早期稳定性减退的机理、进攻性加剧的机理和学习率增加时边缘稳定性边缘的机理。
    Abstract In gradient descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training. This includes early time regimes where the sharpness may decrease during early periods of training (sharpness reduction), and later time behavior such as progressive sharpening and edge of stability. We demonstrate that a simple $2$-layer linear network (UV model) trained on a single training example exhibits all of the essential sharpness phenomenology observed in real-world scenarios. By analyzing the structure of dynamical fixed points in function space and the vector field of function updates, we uncover the underlying mechanisms behind these sharpness trends. Our analysis reveals (i) the mechanism behind early sharpness reduction and progressive sharpening, (ii) the required conditions for edge of stability, and (iii) a period-doubling route to chaos on the edge of stability manifold as learning rate is increased. Finally, we demonstrate that various predictions from this simplified model generalize to real-world scenarios and discuss its limitations.
    摘要 在神经网络的梯度下降动力学中,损失函数的希尔比率(锐度)在训练过程中展现了多种 Robust 现象。包括在训练的早期阶段减少锐度(锐度减少),以及 later 阶段的进攻性锐度和稳定边缘。我们示出了一个简单的两层线性网络(UV 模型)在单个训练示例上进行训练时显示了所有真实场景中的锐度现象。通过分析函数空间的固定点结构和函数更新的向量场,我们揭示了这些锐度趋势的内在机制。我们的分析显示了(i)锐度减少和进攻性锐度的机制,(ii)学习率增加时稳定边缘的必要条件,以及(iii)学习率增加时period-doubling 到稳定边缘抽象的混沌路径。最后,我们证明了这个简化模型的预测在真实场景中有效,并讨论了它的局限性。

Active Learning-Based Species Range Estimation

  • paper_url: http://arxiv.org/abs/2311.02061
  • repo_url: https://github.com/chris-lange/sdm_active_sampling
  • paper_authors: Christian Lange, Elijah Cole, Grant Van Horn, Oisin Mac Aodha
  • for: 该论文旨在提出一种新的活动学习方法,用于有效地估计种群范围从有限的地面观察数据中。
  • methods: 该方法基于模型大量弱监睹社区收集的观察数据,并使用这些模型来生成候选种群范围集。然后,该方法采用一种新的活动询问方法,Sequentially选择最有优势的地理位置进行访问,以减少对未地图的种群范围的不确定性。
  • results: 作者对该方法进行了详细的评估,并与现有的活动学习方法和方法进行比较。结果表明,该方法高效地估计种群范围,并且与使用末端训练的模型准确率相似,即使只使用一部分数据。这显示了活动学习通过传输学习的空间表示来估计种群范围的utilty,以及利用emerging大规模的社区收集数据来活动发现种群。
    Abstract We propose a new active learning approach for efficiently estimating the geographic range of a species from a limited number of on the ground observations. We model the range of an unmapped species of interest as the weighted combination of estimated ranges obtained from a set of different species. We show that it is possible to generate this candidate set of ranges by using models that have been trained on large weakly supervised community collected observation data. From this, we develop a new active querying approach that sequentially selects geographic locations to visit that best reduce our uncertainty over an unmapped species' range. We conduct a detailed evaluation of our approach and compare it to existing active learning methods using an evaluation dataset containing expert-derived ranges for one thousand species. Our results demonstrate that our method outperforms alternative active learning methods and approaches the performance of end-to-end trained models, even when only using a fraction of the data. This highlights the utility of active learning via transfer learned spatial representations for species range estimation. It also emphasizes the value of leveraging emerging large-scale crowdsourced datasets, not only for modeling a species' range, but also for actively discovering them.
    摘要 我们提出了一种新的活动学习方法,用于效率地估算一种动物的地理范围从有限多个地面观察数据中。我们模型了这种未映射的物种的范围为多种不同种的估算范围的权重组合。我们表明可以通过使用已经训练过大规模、弱监督社区收集的观察数据来生成这些候选者。然后,我们开发了一种新的活动查询方法,该方法在不断选择不确定度最高的地理位置进行访问,以减少对未映射物种范围的不确定性。我们进行了详细的评估,并与现有的活动学习方法和方法进行比较,使用专家所 derivation 的范围数据集中的一千种物种的评估结果表明,我们的方法在使用只有一部分数据时仍可以超越其他活动学习方法,并且接近终端训练模型的性能。这种结果强调了通过活动学习 transferred 的空间表示来估算物种范围的有用性,以及利用emerging大规模的社区收集数据来模型物种范围的重要性。

Reproducible Parameter Inference Using Bagged Posteriors

  • paper_url: http://arxiv.org/abs/2311.02019
  • repo_url: None
  • paper_authors: Jonathan H. Huggins, Jeffrey W. Miller
  • for: 本研究旨在解决 bayesian posterior 在模型误差下的不准确性问题,并提出一种可靠的 reproduceability критерий。
  • methods: 本研究使用了 bagging 技术,即使用 posterior Distribution conditioned on bootstrapped datasets,以提高 reproduceability。
  • results: 研究发现,bayesbag Typically satisfies the overlap lower bound,并且有一个 Bernstein–Von Mises theorem,确定它的 asymptotic normal distribution。 通过 simulated experiments 和犯罪率预测应用,研究证明 bayesbag 的优点。
    Abstract Under model misspecification, it is known that Bayesian posteriors often do not properly quantify uncertainty about true or pseudo-true parameters. Even more fundamentally, misspecification leads to a lack of reproducibility in the sense that the same model will yield contradictory posteriors on independent data sets from the true distribution. To define a criterion for reproducible uncertainty quantification under misspecification, we consider the probability that two confidence sets constructed from independent data sets have nonempty overlap, and we establish a lower bound on this overlap probability that holds for any valid confidence sets. We prove that credible sets from the standard posterior can strongly violate this bound, particularly in high-dimensional settings (i.e., with dimension increasing with sample size), indicating that it is not internally coherent under misspecification. To improve reproducibility in an easy-to-use and widely applicable way, we propose to apply bagging to the Bayesian posterior ("BayesBag"'); that is, to use the average of posterior distributions conditioned on bootstrapped datasets. We motivate BayesBag from first principles based on Jeffrey conditionalization and show that the bagged posterior typically satisfies the overlap lower bound. Further, we prove a Bernstein--Von Mises theorem for the bagged posterior, establishing its asymptotic normal distribution. We demonstrate the benefits of BayesBag via simulation experiments and an application to crime rate prediction.
    摘要 “在模型错误下, bayesian posterior 通常不能妥善量化 true 或 pseudo-true 参数的不确定性。 更重要的是,错误会导致模型的不可重复性,即使使用同一个模型,在独立的数据集上得到的 posterior 会具有矛盾的结果。 为了定义在错误下的可重复性量化标准,我们考虑了两个独立的数据集上constructed confidence set之间的非空 overlap概率,并证明了这个 overlap 概率下界,该下界适用于任何有效的confidence set。 我们证明了标准 posterior 的信任集可能会强烈违反这个下界,特别是在高维度 Setting(即采样大小增长)中,表明这些信任集不是内在coherent 的。 为了改善可重复性,我们提议使用 bagging 技术(即 conditioned on bootstrapped datasets 的 posterior distribution)。我们从 Jeffrey conditionalization 的基本原理出发,motivate BayesBag,并证明 BayesBag 通常满足 overlap 下界。 更重要的是,我们证明了 BayesBag 的 asymptotic normal distribution,以及其在不同 Setting 下的性能。 我们通过 simulations 和犯罪率预测应用 demonstrate 了 BayesBag 的好处。”

A Variational Perspective on High-Resolution ODEs

  • paper_url: http://arxiv.org/abs/2311.02002
  • repo_url: None
  • paper_authors: Hoomaan Maskan, Konstantinos C. Zygalakis, Alp Yurtsever
  • for: 这个论文主要针对无约制最小化凸函数的问题。
  • methods: 这篇论文提出了一种新的变量观点,使得可以研究高分辨率ODE。通过这种观点,我们可以更快地 converge gradient norm minimization 使用 Nesterov 加速器方法。
  • results: 我们的方法可以在噪声梯度下实现更好的性能,并且可以与现有的方法进行比较。在一些数学实验中,我们的方法与现有的方法进行了比较,并且得到了更好的结果。
    Abstract We consider unconstrained minimization of smooth convex functions. We propose a novel variational perspective using forced Euler-Lagrange equation that allows for studying high-resolution ODEs. Through this, we obtain a faster convergence rate for gradient norm minimization using Nesterov's accelerated gradient method. Additionally, we show that Nesterov's method can be interpreted as a rate-matching discretization of an appropriately chosen high-resolution ODE. Finally, using the results from the new variational perspective, we propose a stochastic method for noisy gradients. Several numerical experiments compare and illustrate our stochastic algorithm with state of the art methods.
    摘要 我们考虑不受限制的极小化的几何函数。我们提出了一种新的量子视角,使用强制的欧拉-拉格朗日方程,以研究高分辨率ODE。透过这个新的视角,我们获得了更快的梯度距离减少率,从尼斯特洛夫的加速器梯度方法中。此外,我们显示了尼斯特洛夫的方法可以被解释为一种调整对应的高分辨率ODE的率调整策略。最后,我们使用新的量子视角提出了一种随机方法 для杂质梯度。一些数学实验比较和IlлюSTRATE了我们的随机算法与现有的方法。

High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise

  • paper_url: http://arxiv.org/abs/2311.02000
  • repo_url: None
  • paper_authors: Yusu Hong, Junhong Lin
  • for: 本文研究了 Adam 算法在非对称非 convex 概率优化中的收敛性。尽管在机器学习领域广泛应用,但其理论性仍然有限。先前的研究主要从预期角度研究 Adam 的收敛,经常需要强制ASSUME 梯度是均匀的和问题依赖的先验知识。这限制了这些发现在实际世界应用中的可用性。
  • methods: 作者提供了深入分析,证明 Adam 可以在高 probabilit 下 converge 到站点点,其速率为 $\mathcal{O}\left(\frac{\rm poly(\log T)}{\sqrt{T}\right)$,不需要任何梯度假设和问题依赖的先验知识来调整超参数。此外,也发现 Adam 限制了梯度的大小在 $\mathcal{O}\left(\rm poly(\log T)\right)$ 之间。最后,作者还研究了一种简化版 Adam 算法,取消一个纠正项,并获得了适应噪音水平的收敛率。
  • results: 本文的结果表明,在高probabilit 下,Adam 算法可以 converge 到站点点,其速率为 $\mathcal{O}\left(\frac{\rm poly(\log T)}{\sqrt{T}\right)$,而不需要任何梯度假设和问题依赖的先验知识来调整超参数。此外,Adam 算法还限制了梯度的大小在 $\mathcal{O}\left(\rm poly(\log T)\right)$ 之间。
    Abstract In this paper, we study the convergence of the Adaptive Moment Estimation (Adam) algorithm under unconstrained non-convex smooth stochastic optimizations. Despite the widespread usage in machine learning areas, its theoretical properties remain limited. Prior researches primarily investigated Adam's convergence from an expectation view, often necessitating strong assumptions like uniformly stochastic bounded gradients or problem-dependent knowledge in prior. As a result, the applicability of these findings in practical real-world scenarios has been constrained. To overcome these limitations, we provide a deep analysis and show that Adam could converge to the stationary point in high probability with a rate of $\mathcal{O}\left({\rm poly}(\log T)/\sqrt{T}\right)$ under coordinate-wise "affine" variance noise, not requiring any bounded gradient assumption and any problem-dependent knowledge in prior to tune hyper-parameters. Additionally, it is revealed that Adam confines its gradients' magnitudes within an order of $\mathcal{O}\left({\rm poly}(\log T)\right)$. Finally, we also investigate a simplified version of Adam without one of the corrective terms and obtain a convergence rate that is adaptive to the noise level.
    摘要 “在这篇论文中,我们研究了Adaptive Moment Estimation(Adam)算法在不受约束的非凸泛环境中的收敛性。虽然在机器学习领域广泛使用,但其理论性Properties remain limited。先前的研究主要从预期的角度研究了Adam的收敛性,经常假设 gradients是均匀的和bounded,这限制了其在实际场景中的应用。为了突破这些限制,我们提供了深入的分析,并证明Adam可以在高probability下收敛到站点点,其速度为 $\mathcal{O}\left(\frac{\rm poly(\log T)}{\sqrt{T}\right)$,不需要任何 bounded gradient假设和任何问题依赖的优化参数。此外,我们还发现Adam将 gradients的大小限制在 $\mathcal{O}\left(\rm poly(\log T)\right)$ 的范围内。最后,我们还 investigate了Adam中一个简化版本,去掉一个修正项,并 obtain了一个适应噪声水平的收敛速度。”Note: "Simplified Chinese" refers to the written form of Chinese that uses simpler grammar and vocabulary, and is often used in informal writing and online communication. The translation is based on the standardized Simplified Chinese writing system used in mainland China.

Conditions on Preference Relations that Guarantee the Existence of Optimal Policies

  • paper_url: http://arxiv.org/abs/2311.01990
  • repo_url: None
  • paper_authors: Jonathan Colaco Carr, Prakash Panangaden, Doina Precup
  • For: The paper is written to address the gap between the theory and application of Learning from Preferential Feedback (LfPF) algorithms, specifically in partially-observable, non-Markovian environments.* Methods: The paper introduces the Direct Preference Process, a new framework for analyzing LfPF problems, and uses the von Neumann-Morgenstern Expected Utility Theorem to establish conditions for the existence of optimal policies.* Results: The paper shows that the Direct Preference Process generalizes the standard reinforcement learning problem and provides future practitioners with the tools necessary for a more principled design of LfPF agents, narrowing the gap between empirical success and theoretical understanding.Here is the information in Simplified Chinese text:
  • for: 本文是为了填补学习从偏好反馈(LfPF)算法的理论和实践之间的空白,特别是在部分可见、非马歇维尔环境中。
  • methods: 本文引入了直接偏好过程(Direct Preference Process),一种新的分析LfPF问题的框架,并使用 von Neumann-Morgenstern 期望风险函数来确定优质策略的存在条件。
  • results: 本文表明,直接偏好过程可以将标准循环学习问题推广到非马歇维尔环境中,为未来的实践者提供更原则性的LfPF代理设计的工具。
    Abstract Learning from Preferential Feedback (LfPF) plays an essential role in training Large Language Models, as well as certain types of interactive learning agents. However, a substantial gap exists between the theory and application of LfPF algorithms. Current results guaranteeing the existence of optimal policies in LfPF problems assume that both the preferences and transition dynamics are determined by a Markov Decision Process. We introduce the Direct Preference Process, a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments. Within this framework, we establish conditions that guarantee the existence of optimal policies by considering the ordinal structure of the preferences. Using the von Neumann-Morgenstern Expected Utility Theorem, we show that the Direct Preference Process generalizes the standard reinforcement learning problem. Our findings narrow the gap between the empirical success and theoretical understanding of LfPF algorithms and provide future practitioners with the tools necessary for a more principled design of LfPF agents.
    摘要 学习偏好反馈(LfPF)在训练大语言模型和某些交互学习代理人中扮演了关键角色。然而,现有的理论和应用中的LfPF算法存在一定的知识差距。现有的结果只有在Markov决策过程下确保优化策略的存在。我们介绍了新的直接偏好过程框架,用于分析LfPF问题在部分可见、非马歇维环境中。在这个框架下,我们设置了 garantia优化策略的条件,通过考虑偏好的顺序结构。使用von Neumann-Morgenstern预期用途函数,我们表明了直接偏好过程对标准强化学习问题的总结。我们的发现将减少LfPF算法的实际成功和理论理解之间的差距,并为未来的实践者提供了更原则性的LfPF代理人设计的工具。

Latent Diffusion Model for Conditional Reservoir Facies Generation

  • paper_url: http://arxiv.org/abs/2311.01968
  • repo_url: None
  • paper_authors: Daesoo Lee, Oscar Ovanger, Jo Eidsvik, Erlend Aune, Jacob Skauvold, Ragnar Hauge
  • for: used to generate high-fidelity reservoir facies realizations that preserve conditioning data
  • methods: uses a novel Latent Diffusion Model that leverages the superiority of diffusion models over GANs
  • results: significantly outperforms a GAN-based alternative in generating realistic reservoir facies
    Abstract Creating accurate and geologically realistic reservoir facies based on limited measurements is crucial for field development and reservoir management, especially in the oil and gas sector. Traditional two-point geostatistics, while foundational, often struggle to capture complex geological patterns. Multi-point statistics offers more flexibility, but comes with its own challenges. With the rise of Generative Adversarial Networks (GANs) and their success in various fields, there has been a shift towards using them for facies generation. However, recent advances in the computer vision domain have shown the superiority of diffusion models over GANs. Motivated by this, a novel Latent Diffusion Model is proposed, which is specifically designed for conditional generation of reservoir facies. The proposed model produces high-fidelity facies realizations that rigorously preserve conditioning data. It significantly outperforms a GAN-based alternative.
    摘要 创建准确且地质学上实际的沉积 facies 基于有限的测量是钻井开发和沉积管理中的关键,特别是在油气领域。传统的两点地 statistcs 是基础知识,但它们经常难以捕捉复杂的地质模式。多点统计学提供更多的灵活性,但它们也有自己的挑战。随着生成 adversarial Networks(GANs)在不同领域的成功,有人开始使用它们 для facies 生成。然而,最近的计算机视觉领域的进步表明了扩散模型在 GANs 之上的超越。驱动于这一点,我们提出了一种新的潜在扩散模型,用于 conditional 生成沉积 facies。我们的模型可以生成高精度的 facies 实现,严格保持 conditioning 数据。与 GANs 相比,我们的模型在 conditioning 数据上的表现显著优于。

Hardness of Low Rank Approximation of Entrywise Transformed Matrix Products

  • paper_url: http://arxiv.org/abs/2311.01960
  • repo_url: None
  • paper_authors: Tamas Sarlos, Xingyou Song, David Woodruff, Qiuyi, Zhang
  • for: 本文研究了在entrywise transformed setting下的低级 Approximation问题,即想要找到一个好的rank $k$ Approximation来表示$f(U\cdot V)$, где$U, V^\top \in \mathbb{R}^{n \times r}$是 givens, $r = O(\log(n))$, $f(x)$是一个通用的scalar函数。
  • methods: 我们使用了 previoius work in sublinear low rank approximation中的方法,并给出了首次的conditional time hardness result,证明了两个condition(1)和(2)是必要的,以获得better than $n^{2-o(1)}$ time的相对误差low rank Approximation。
  • results: 我们给出了一个novel reduction from Strong Exponential Time Hypothesis (SETH),这个reduction rely on lower bounding the leverage scores of flat sparse vectors,并在$U \neq V^\top$情况下提供了runtime lower bounds of the form $\Omega(\min(n^{2-o(1)}, \Omega(2^p)))$. 最后,我们证明了我们的下界是紧的,给出了一个$O(n \cdot \text{poly}(k, 2^p, 1/\epsilon))$ time relative error approximation algorithm和一个fast $O(n \cdot \text{poly}(k, p, 1/\epsilon))$ additive error approximation。
    Abstract Inspired by fast algorithms in natural language processing, we study low rank approximation in the entrywise transformed setting where we want to find a good rank $k$ approximation to $f(U \cdot V)$, where $U, V^\top \in \mathbb{R}^{n \times r}$ are given, $r = O(\log(n))$, and $f(x)$ is a general scalar function. Previous work in sublinear low rank approximation has shown that if both (1) $U = V^\top$ and (2) $f(x)$ is a PSD kernel function, then there is an $O(nk^{\omega-1})$ time constant relative error approximation algorithm, where $\omega \approx 2.376$ is the exponent of matrix multiplication. We give the first conditional time hardness results for this problem, demonstrating that both conditions (1) and (2) are in fact necessary for getting better than $n^{2-o(1)}$ time for a relative error low rank approximation for a wide class of functions. We give novel reductions from the Strong Exponential Time Hypothesis (SETH) that rely on lower bounding the leverage scores of flat sparse vectors and hold even when the rank of the transformed matrix $f(UV)$ and the target rank are $n^{o(1)}$, and when $U = V^\top$. Furthermore, even when $f(x) = x^p$ is a simple polynomial, we give runtime lower bounds in the case when $U \neq V^\top$ of the form $\Omega(\min(n^{2-o(1)}, \Omega(2^p)))$. Lastly, we demonstrate that our lower bounds are tight by giving an $O(n \cdot \text{poly}(k, 2^p, 1/\epsilon))$ time relative error approximation algorithm and a fast $O(n \cdot \text{poly}(k, p, 1/\epsilon))$ additive error approximation using fast tensor-based sketching. Additionally, since our low rank algorithms rely on matrix-vector product subroutines, our lower bounds extend to show that computing $f(UV)W$, for even a small matrix $W$, requires $\Omega(n^{2-o(1)})$ time.
    摘要 受快速算法在自然语言处理中启发的启示,我们研究了在变换后的下rankapprox问题,即找到一个好的rank-$k$approximation于$f(U\cdot V)$, где$U, V^\top \in \mathbb{R}^{n \times r}$是给定的,$r = O(\log(n))$,和$f(x)$是一个通用的整数函数。先前的低线性下rankapprox问题的研究表明,如果 Both (1) $U = V^\top$和 (2) $f(x)$是一个PSDkernel函数,那么有一个$O(nk^{2.376-1})$时间常量相对误差approximation算法。我们给出了首次的条件时间困难结果,证明了这两个条件是必要的,以获得一个better than $n^{2-o(1)}$时间的相对误差low rankapproximation。我们还给出了novel的SETH降低,基于lower bounding the leverages cores of flat sparse vectors,这些降低可以在$f(UV)$的rank和目标rank是$n^{o(1)}$时仍然保持有效。几种情况下,我们给出了runtime lower bounds的形式,包括$f(x) = x^p$是一个简单的多项式时,当$U \neq V^\top$时,我们给出了$\Omega(\min(n^{2-o(1)}, \Omega(2^p)))$的时间下界。最后,我们证明了我们的下界是紧的,给出了一个$O(n \cdot \text{poly}(k, 2^p, 1/\epsilon))$时间相对误差approximation算法和一个快速的$O(n \cdot \text{poly}(k, p, 1/\epsilon))$添加itive error approximation。此外,由于我们的low rank算法 rely on matrix-vector product subroutines,我们的下界也适用于计算$f(UV)W$,其中$W$是一个小的矩阵。

Optimistic Multi-Agent Policy Gradient for Cooperative Tasks

  • paper_url: http://arxiv.org/abs/2311.01953
  • repo_url: None
  • paper_authors: Wenshuai Zhao, Yi Zhao, Zhiyuan Li, Juho Kannala, Joni Pajarinen
  • for: 多智能体学习任务中的协同学习问题,特别是在使用函数拟合学习时遇到的相对过拟合问题。
  • methods: 我们提出了一种基于Leaky ReLU函数的简单框架,以减少多智能体学习过程中的相对过拟合问题。
  • results: 我们在多种多智能体任务上进行了广泛的测试,并证明了我们的方法可以在13个测试任务中超过强基eline,并在剩下的6个任务中与基eline匹配性。
    Abstract \textit{Relative overgeneralization} (RO) occurs in cooperative multi-agent learning tasks when agents converge towards a suboptimal joint policy due to overfitting to suboptimal behavior of other agents. In early work, optimism has been shown to mitigate the \textit{RO} problem when using tabular Q-learning. However, with function approximation optimism can amplify overestimation and thus fail on complex tasks. On the other hand, recent deep multi-agent policy gradient (MAPG) methods have succeeded in many complex tasks but may fail with severe \textit{RO}. We propose a general, yet simple, framework to enable optimistic updates in MAPG methods and alleviate the RO problem. Specifically, we employ a \textit{Leaky ReLU} function where a single hyperparameter selects the degree of optimism to reshape the advantages when updating the policy. Intuitively, our method remains optimistic toward individual actions with lower returns which are potentially caused by other agents' sub-optimal behavior during learning. The optimism prevents the individual agents from quickly converging to a local optimum. We also provide a formal analysis from an operator view to understand the proposed advantage transformation. In extensive evaluations on diverse sets of tasks, including illustrative matrix games, complex \textit{Multi-agent MuJoCo} and \textit{Overcooked} benchmarks, the proposed method\footnote{Code can be found at \url{https://github.com/wenshuaizhao/optimappo}.} outperforms strong baselines on 13 out of 19 tested tasks and matches the performance on the rest.
    摘要 \begin{blockquote}多代理人学习任务中的相对过拟合(RO)现象发生在多个代理人协作学习环境中,当代理人因其他代理人的不优秀行为而过拟合到低优秀的共同策略。在早期的工作中,使用表格式Q学习的Optimism已经被证明可以降低RO问题。然而,使用函数拟合的Optimism可能会增加过估计,因此在复杂任务上失败。在 contrary,最近的深度多代理人策略梯度法(MAPG)方法在许多复杂任务上取得了成功,但可能会在严重的RO问题下失败。我们提出了一个通用 yet simple 的框架,以便在 MAPG 方法中实现可信的更新和RO问题的缓解。具体来说,我们使用 \_ Leaky ReLU 函数,其中一个参数选择度量优化的度量来重塑优势。我们的方法保持对个体动作的低返回值的optimism,这些返回值可能是由其他代理人的不优秀行为所导致的。optimism 防止个体代理人快速 converges to 局部优点。我们还提供了一种基于运算员视角的正式分析,以便更好地理解我们提议的优势转换。在多种任务集中,包括简单的矩阵游戏、复杂的多代理人 MuJoCo 和 Overcooked bencmarks,我们的方法(代码可以在 找到)在 13 个测试任务中超过强基线,并在剩下 6 个任务中匹配性能。\end{blockquote}Note that the translation is done using Google Translate, and may not be perfect. Please let me know if you have any further questions or need any corrections.

ForecastPFN: Synthetically-Trained Zero-Shot Forecasting

  • paper_url: http://arxiv.org/abs/2311.01933
  • repo_url: https://github.com/abacusai/forecastpfn
  • paper_authors: Samuel Dooley, Gurnoor Singh Khurana, Chirag Mohapatra, Siddartha Naidu, Colin White
  • for: 这篇论文是为了解决时间序列预测问题,特别是当初始资料够少时。
  • methods: 本文使用了一个名为 ForecastPFN 的预测模型,这是一个基于统计学的专案调整网络。这个模型可以在单一的前进传递中对新的时间序列资料进行预测。
  • results: 根据实验结果,ForecastPFN 的预测结果比 state-of-the-art 方法更精确和更快速,甚至当其他方法被允许使用百名以上的内部数据点时。
    Abstract The vast majority of time-series forecasting approaches require a substantial training dataset. However, many real-life forecasting applications have very little initial observations, sometimes just 40 or fewer. Thus, the applicability of most forecasting methods is restricted in data-sparse commercial applications. While there is recent work in the setting of very limited initial data (so-called `zero-shot' forecasting), its performance is inconsistent depending on the data used for pretraining. In this work, we take a different approach and devise ForecastPFN, the first zero-shot forecasting model trained purely on a novel synthetic data distribution. ForecastPFN is a prior-data fitted network, trained to approximate Bayesian inference, which can make predictions on a new time series dataset in a single forward pass. Through extensive experiments, we show that zero-shot predictions made by ForecastPFN are more accurate and faster compared to state-of-the-art forecasting methods, even when the other methods are allowed to train on hundreds of additional in-distribution data points.
    摘要 大多数时间序列预测方法需要很多训练数据。然而,在实际应用中,有很多情况只有40个或更少的初始观察值。因此,大多数预测方法在商业应用中的可用性受限。而在最近的研究中,有一些在非常有限的初始数据上进行预测(称为“零 shot”预测),但其性能因数据使用 для预测而异常。在这项工作中,我们采用了一种不同的方法,并提出了 ForecastPFN,第一个基于新的 sintetic 数据分布的零 shot 预测模型。ForecastPFN 是一种基于先验知识的 fitted 网络,通过单次前进 pass 来预测新的时间序列数据。经过广泛的实验,我们表明,由 ForecastPFN 进行预测的零 shot 预测结果比现有的预测方法更准确和更快,即使它们在 hundreds 个更多的在 Distribution 上进行训练。

Simplifying Transformer Blocks

  • paper_url: http://arxiv.org/abs/2311.01906
  • repo_url: https://github.com/bobby-he/simplified_transformers
  • paper_authors: Bobby He, Thomas Hofmann
  • For: The paper aims to simplify the standard transformer block to improve training speed and reduce the number of parameters.* Methods: The authors use signal propagation theory and empirical observations to motivate modifications to the standard transformer block, including removing skip connections, projection or value parameters, sequential sub-blocks, and normalization layers.* Results: The simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput and using 15% fewer parameters.
    Abstract A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable. In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput, and using 15% fewer parameters.
    摘要 “一个简单的设计方程式 для深度Transformer是将相同的建筑块组合起来。但标准transformer块与聚合、对称层和normalization层的组合,使得模型变得非常复杂,几乎所有的变更都会导致模型训练速度下降或者无法训练。在这个研究中,我们询问这些标准transformer块可以被简化到多少 extent?通过信号传递理论和实验观察,我们提出了一些修改,让许多块件可以被移除无损训练速度,包括跳过 Connection、投影或值参数、Sequential sub-blocks 和normalization层。在采用了 both autoregressive decoder-only 和 BERT encoder-only 模型的实验中,我们的简化transformer模型可以与标准transformer模型相似的每个更新训练速度和性能,并且在使用15% fewer parameters的情况下,比标准transformer模型快15%。”

High Precision Causal Model Evaluation with Conditional Randomization

  • paper_url: http://arxiv.org/abs/2311.01902
  • repo_url: None
  • paper_authors: Chao Ma, Cheng Zhang
  • for: 评估 causal 模型的标准方法是比较模型预测与真实的效应来自Randomized controlled trials (RCT)。但RCT不 always feasible或道德可靠。在这种情况下,使用conditionally randomized experiments based on inverse probability weighting (IPW) 可能会受到高估程度的问题。
  • methods: 我们引入了一种新的low-variance estimator for causal error,称为 pairs estimator。我们将同一个IPW estimator应用于模型和真实的实验效应上,从而使得其变iance due to IPW 消失,并 achieve smaller asymptotic variance。
  • results: 我们的方法可以在实验设置下提高 causal inference 模型的评估,并且在实验中 demonstrate 了我们的方法的优势,表明它可以达到 near-RCT 性能。这种简单 yet powerful 的方法可以在 conditional randomization 设置下评估 causal inference 模型,无需修改 IPW estimator 本身,从而为模型评估带来更加robust和可靠。
    Abstract The gold standard for causal model evaluation involves comparing model predictions with true effects estimated from randomized controlled trials (RCT). However, RCTs are not always feasible or ethical to perform. In contrast, conditionally randomized experiments based on inverse probability weighting (IPW) offer a more realistic approach but may suffer from high estimation variance. To tackle this challenge and enhance causal model evaluation in real-world conditional randomization settings, we introduce a novel low-variance estimator for causal error, dubbed as the pairs estimator. By applying the same IPW estimator to both the model and true experimental effects, our estimator effectively cancels out the variance due to IPW and achieves a smaller asymptotic variance. Empirical studies demonstrate the improved of our estimator, highlighting its potential on achieving near-RCT performance. Our method offers a simple yet powerful solution to evaluate causal inference models in conditional randomization settings without complicated modification of the IPW estimator itself, paving the way for more robust and reliable model assessments.
    摘要 “金Standard” для评估 causal模型 involves comparing model predictions with true effects estimated from randomized controlled trials(RCT)。However,RCTs are not always feasible or ethical to perform。In contrast,conditionally randomized experiments based on inverse probability weighting(IPW)offer a more realistic approach but may suffer from high estimation variance。To tackle this challenge and enhance causal model evaluation in real-world conditional randomization settings,we introduce a novel low-variance estimator for causal error,dubbed as the pairs estimator。By applying the same IPW estimator to both the model and true experimental effects,our estimator effectively cancels out the variance due to IPW and achieves a smaller asymptotic variance。Empirical studies demonstrate the improved of our estimator,highlighting its potential on achieving near-RCT performance。Our method offers a simple yet powerful solution to evaluate causal inference models in conditional randomization settings without complicated modification of the IPW estimator itself,paving the way for more robust and reliable model assessments。

Online non-parametric likelihood-ratio estimation by Pearson-divergence functional minimization

  • paper_url: http://arxiv.org/abs/2311.01900
  • repo_url: None
  • paper_authors: Alejandro de la Concha, Nicolas Vayatis, Argyris Kalogeratos
  • for: 本研究旨在提供一种在线非 Parametric 的 likelihood-ratio estimation(OLRE)方法,用于比较两个概率密度函数(p和q)的差异,并且在观察到 iid 样本 $(x_t \sim p, x’_t \sim q)$ 序列的时候进行。
  • methods: 我们的方法基于最近的 Kernel Methods 和函数最小化技术,可以有效地在线更新 estimator。我们的方法是非 Parametric,即不知道 $p$ 和 $q$ 的形式。
  • results: 我们提供了对 OLRE 方法的 theoretically garantuee,并进行了synthetic experiment 的实验验证。
    Abstract Quantifying the difference between two probability density functions, $p$ and $q$, using available data, is a fundamental problem in Statistics and Machine Learning. A usual approach for addressing this problem is the likelihood-ratio estimation (LRE) between $p$ and $q$, which -- to our best knowledge -- has been investigated mainly for the offline case. This paper contributes by introducing a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t \sim p, x'_t \sim q)$ are observed over time. The non-parametric nature of our approach has the advantage of being agnostic to the forms of $p$ and $q$. Moreover, we capitalize on the recent advances in Kernel Methods and functional minimization to develop an estimator that can be efficiently updated online. We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.
    摘要 “统计和机器学习中衡量两个概率密度函数($p$和$q$)之间的差异使用数据是一个基本问题。一般来说,用likelihood-ratio estimation(LRE)方法来解决这个问题,尽管这个方法主要在单个观测点上进行研究。本文提供了一种新的在线非 Parametric LRE(OLRE)方法,用于在时间序列中观测到的独立identically distributed(iid)观测点 $(x_t \sim p, x'_t \sim q)$。我们的非 Parametric 方法具有不知道 $p$ 和 $q$ 的形式的优点,同时我们利用了最近的核函数方法和函数最小化来开发一个可以高效地在线更新的估计器。我们提供了对OLRE方法的理论保证以及实验 validate在 sintetic experiment中。”Note: "Simplified Chinese" is a translation of the text into Traditional Chinese, which is one of the two standard forms of Chinese writing. The other form is "Traditional Chinese".

Learning Sparse Codes with Entropy-Based ELBOs

  • paper_url: http://arxiv.org/abs/2311.01888
  • repo_url: None
  • paper_authors: Dmytro Velychko, Simon Damm, Asja Fischer, Jörg Lücke
  • for: 这个论文的目的是提出一种基于信息论的条件随机 sparse coding 学习目标函数,用于非正态 posterior approximations。
  • methods: 该论文使用了一种基于信息论的条件随机 sparse coding 学习方法,包括非正态 posterior approximations 和 entropy-based 学习目标函数。
  • results: 该论文的实验结果表明,使用该学习方法可以有效地学习条件随机 sparse coding 模型,并且可以通过 entropy-based 学习目标函数来适应不同的 posterior approximations。
    Abstract Standard probabilistic sparse coding assumes a Laplace prior, a linear mapping from latents to observables, and Gaussian observable distributions. We here derive a solely entropy-based learning objective for the parameters of standard sparse coding. The novel variational objective has the following features: (A) unlike MAP approximations, it uses non-trivial posterior approximations for probabilistic inference; (B) unlike for previous non-trivial approximations, the novel objective is fully analytical; and (C) the objective allows for a novel principled form of annealing. The objective is derived by first showing that the standard ELBO objective converges to a sum of entropies, which matches similar recent results for generative models with Gaussian priors. The conditions under which the ELBO becomes equal to entropies are then shown to have analytical solutions, which leads to the fully analytical objective. Numerical experiments are used to demonstrate the feasibility of learning with such entropy-based ELBOs. We investigate different posterior approximations including Gaussians with correlated latents and deep amortized approximations. Furthermore, we numerically investigate entropy-based annealing which results in improved learning. Our main contributions are theoretical, however, and they are twofold: (1) for non-trivial posterior approximations, we provide the (to the knowledge of the authors) first analytical ELBO objective for standard probabilistic sparse coding; and (2) we provide the first demonstration on how a recently shown convergence of the ELBO to entropy sums can be used for learning.
    摘要 标准的数学潜在簇节架假设了拉普拉斯假设、线性对应从潜在到观察值,以及 Gaussian 观察值分布。我们在这里 derivates a solely entropy-based learning objective for the parameters of standard sparse coding。这个新的可变核心目标有以下特点:(A)与MAP估计不同,使用非贫则 posterior 估计 для条件arinferencing;(B)与前一些非贫则估计不同,这个新的目标是完全分析的;(C)这个目标允许一种新的均衡化原理。我们首先显示了标准的ELBO目标可以将转换为一个总 entropy 的和,这与最近的一些生成模型具有 Gaussian 假设的结果相符。然后,我们显示了这些条件下ELBO的解是分析的,这导致了一个完全分析的目标。我们使用了不同的 posterior 估计,包括相关的潜在对应和深度束质化估计。此外,我们还进行了实验,以证明可以使用这种 entropy-based ELBO 进行学习。我们的主要贡献是理论的,主要是:(1)为非贫则 posterior 估计提供了(到我们知识的作者)第一个分析的 ELBO 目标 для标准的数学潜在簇节架;(2)我们提供了最初将 ELBO 转换为 entropy 和的概念的证明。

Domain Randomization via Entropy Maximization

  • paper_url: http://arxiv.org/abs/2311.01885
  • repo_url: https://github.com/gabrieletiboni/doraemon
  • paper_authors: Gabriele Tiboni, Pascal Klink, Jan Peters, Tatiana Tommasi, Carlo D’Eramo, Georgia Chalvatzaki
  • for: 本研究旨在解决Domain Randomization(DR)中的现实差距问题,即RL中在不同的环境中的行为不同。
  • methods: 本研究提出了一种新的方法,即Domain Randomization via Entropy MaximizatiON(DORAEMON),它是一个受限的优化问题,通过直接最大化训练 distribuion 的熵来自动调整环境参数的分布。
  • results: 实验结果表明,DORAEMON 可以获得高度适应和普适的策略,即在不同的环境参数下能够解决任务。此外,DORAEMON 还可以在不知道实际世界参数的情况下进行零基础转移。
    Abstract Varying dynamics parameters in simulation is a popular Domain Randomization (DR) approach for overcoming the reality gap in Reinforcement Learning (RL). Nevertheless, DR heavily hinges on the choice of the sampling distribution of the dynamics parameters, since high variability is crucial to regularize the agent's behavior but notoriously leads to overly conservative policies when randomizing excessively. In this paper, we propose a novel approach to address sim-to-real transfer, which automatically shapes dynamics distributions during training in simulation without requiring real-world data. We introduce DOmain RAndomization via Entropy MaximizatiON (DORAEMON), a constrained optimization problem that directly maximizes the entropy of the training distribution while retaining generalization capabilities. In achieving this, DORAEMON gradually increases the diversity of sampled dynamics parameters as long as the probability of success of the current policy is sufficiently high. We empirically validate the consistent benefits of DORAEMON in obtaining highly adaptive and generalizable policies, i.e. solving the task at hand across the widest range of dynamics parameters, as opposed to representative baselines from the DR literature. Notably, we also demonstrate the Sim2Real applicability of DORAEMON through its successful zero-shot transfer in a robotic manipulation setup under unknown real-world parameters.
    摘要 varying 动力参数在模拟中是一种受欢迎的Domain Randomization(DR)方法,以减少RL中的现实差距。然而,DR强烈取决于模拟中的参数采样分布的选择,因为高度的变化是关键来减少代理人的行为,但同时不应该过度随机。在这篇文章中,我们提出了一种新的方法来解决模拟到实际的转移问题,即在训练中自动调整动力分布,无需真实世界数据。我们称之为Domain Randomization via Entropy Maximization(DORAEMON),它是一个受限制的优化问题,直接最大化训练 distribuion的熵,保持泛化能力。在实现这一点上,DORAEMON逐渐增加样本动力参数的多样性,只要当当前策略的成功概率充分高时。我们在许多DR文献中进行了比较,证明了DORAEMON可以获得高度适应和泛化的策略,即在不同的动力参数下能够成功解决任务。此外,我们还证明了DORAEMON在机器人 manipulate setup中的零化转移可行性。

Spectral Clustering of Attributed Multi-relational Graphs

  • paper_url: http://arxiv.org/abs/2311.01840
  • repo_url: None
  • paper_authors: Ylli Sadikaj, Yllka Velaj, Sahar Behzadi, Claudia Plant
  • for: 本研究旨在提出一种基于多种关系和特征属性的图 clustering 方法,以便更好地理解图结构和属性之间的相互关系。
  • methods: 本研究提出了 SpectralMix 方法,它是一种结合所有图信息的维度减少技术,可以将图结构、不同类型的关系和属性信息都减少到一个维度中,从而更好地理解图结构和属性之间的相互关系。
  • results: 实验结果表明,SpectralMix 方法可以更好地捕捉图结构和属性之间的相互关系,并且在多个实际数据集上显示出了更高的效果。
    Abstract Graph clustering aims at discovering a natural grouping of the nodes such that similar nodes are assigned to a common cluster. Many different algorithms have been proposed in the literature: for simple graphs, for graphs with attributes associated to nodes, and for graphs where edges represent different types of relations among nodes. However, complex data in many domains can be represented as both attributed and multi-relational networks. In this paper, we propose SpectralMix, a joint dimensionality reduction technique for multi-relational graphs with categorical node attributes. SpectralMix integrates all information available from the attributes, the different types of relations, and the graph structure to enable a sound interpretation of the clustering results. Moreover, it generalizes existing techniques: it reduces to spectral embedding and clustering when only applied to a single graph and to homogeneity analysis when applied to categorical data. Experiments conducted on several real-world datasets enable us to detect dependencies between graph structure and categorical attributes, moreover, they exhibit the superiority of SpectralMix over existing methods.
    摘要 graph clustering aimed at discovering natural grouping of nodes, such that similar nodes are assigned to common cluster. many different algorithms have been proposed in literature: for simple graphs, for graphs with attributes associated to nodes, and for graphs where edges represent different types of relations among nodes. however, complex data in many domains can be represented as both attributed and multi-relational networks. in this paper, we propose spectralmix, a joint dimensionality reduction technique for multi-relational graphs with categorical node attributes. spectralmix integrates all information available from attributes, different types of relations, and graph structure to enable sound interpretation of clustering results. moreover, it generalizes existing techniques: it reduces to spectral embedding and clustering when only applied to single graph and to homogeneity analysis when applied to categorical data. experiments conducted on several real-world datasets enable us to detect dependencies between graph structure and categorical attributes, moreover, they exhibit superiority of spectralmix over existing methods.

Mix-ME: Quality-Diversity for Multi-Agent Learning

  • paper_url: http://arxiv.org/abs/2311.01829
  • repo_url: None
  • paper_authors: Garðar Ingvarsson, Mikayel Samvelyan, Bryan Lim, Manon Flageat, Antoine Cully, Tim Rocktäschel
  • for: 本研究旨在探讨多智能体系中的质量多样性(Quality-Diversity,QD)方法,以实现在不同的情况和需求下发现高性能解决方案的多样性。
  • methods: 本研究提出了一种基于MAP-Elites算法的多智能体变体 Mix-ME,通过混合不同队伍中的智能体来生成新的解决方案。
  • results: 在多种部分可见控制任务上进行评估,研究发现,基于Mix-ME算法生成的多智能体变体不仅与单智能体基准相匹配,而且在多智能体情况下,在部分可见情况下也经常超越单智能体基准。
    Abstract In many real-world systems, such as adaptive robotics, achieving a single, optimised solution may be insufficient. Instead, a diverse set of high-performing solutions is often required to adapt to varying contexts and requirements. This is the realm of Quality-Diversity (QD), which aims to discover a collection of high-performing solutions, each with their own unique characteristics. QD methods have recently seen success in many domains, including robotics, where they have been used to discover damage-adaptive locomotion controllers. However, most existing work has focused on single-agent settings, despite many tasks of interest being multi-agent. To this end, we introduce Mix-ME, a novel multi-agent variant of the popular MAP-Elites algorithm that forms new solutions using a crossover-like operator by mixing together agents from different teams. We evaluate the proposed methods on a variety of partially observable continuous control tasks. Our evaluation shows that these multi-agent variants obtained by Mix-ME not only compete with single-agent baselines but also often outperform them in multi-agent settings under partial observability.
    摘要 在许多实际系统中,如适应机器人学习,单个优化解决方案可能不足。相反,需要一个多样化高性能解决方案来适应不同的上下文和需求。这是质量多样性(QD)的领域,旨在发现一组高性能解决方案,每个都具有独特的特点。QD方法在多个领域中获得成功,包括机器人学习,其用于发现损害适应行动控制器。然而,大多数现有工作都集中在单机器人设置下,尽管许多任务对象是多机器人。为此,我们介绍 Mix-ME,一种新的多机器人变体,使用混合操作将不同队伍中的机器人混合而成新的解决方案。我们对多种部分可见连续控制任务进行评估,结果显示,由 Mix-ME 获得的多机器人变体不仅与单机器人基elines竞争,而且在部分可见情况下frequently outperform 他们。

Sketching for Convex and Nonconvex Regularized Least Squares with Sharp Guarantees

  • paper_url: http://arxiv.org/abs/2311.01806
  • repo_url: None
  • paper_authors: Yingzhen Yang, Ping Li
  • for: solving large-scale optimization problems with regularization functions, such as least square problems with convex or nonconvex regularization.
  • methods: proposes a fast sketching algorithm called Sketching for Regularized Optimization (SRO), which generates a sketch of the original data matrix and solves the sketched problem to obtain the optimization results.
  • results: the proposed algorithm handles general Frechet subdifferentiable regularization functions in an unified framework, and provides general theoretical results for the approximation error between the original problem and the sketched problem for regularized least square problems. Additionally, minimax rates for sparse signal estimation by solving the sketched sparse convex or nonconvex learning problems are obtained under mild conditions.
    Abstract Randomized algorithms are important for solving large-scale optimization problems. In this paper, we propose a fast sketching algorithm for least square problems regularized by convex or nonconvex regularization functions, Sketching for Regularized Optimization (SRO). Our SRO algorithm first generates a sketch of the original data matrix, then solves the sketched problem. Different from existing randomized algorithms, our algorithm handles general Frechet subdifferentiable regularization functions in an unified framework. We present general theoretical result for the approximation error between the optimization results of the original problem and the sketched problem for regularized least square problems which can be convex or nonconvex. For arbitrary convex regularizer, relative-error bound is proved for the approximation error. Importantly, minimax rates for sparse signal estimation by solving the sketched sparse convex or nonconvex learning problems are also obtained using our general theoretical result under mild conditions. To the best of our knowledge, our results are among the first to demonstrate minimax rates for convex or nonconvex sparse learning problem by sketching under a unified theoretical framework. We further propose an iterative sketching algorithm which reduces the approximation error exponentially by iteratively invoking the sketching algorithm. Experimental results demonstrate the effectiveness of the proposed SRO and Iterative SRO algorithms.
    摘要 随机算法在解决大规模优化问题上具有重要的意义。在这篇论文中,我们提出了一种快速的笔记算法,即Sketching for Regularized Optimization(SRO)。我们的SRO算法首先生成了原始数据矩阵的笔记,然后解决笔记中的问题。与现有的随机算法不同,我们的算法可以处理通用的Fréchet次导函数。我们提供了对于各种正则化函数的通用理论结论,包括对于几何函数的 bounds。我们还证明了对于各种正则化函数的最小最大rate,并通过实验证明了我们的提案的有效性。

On the Generalization Properties of Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.01797
  • repo_url: https://github.com/lphleo/diffusion_generalization
  • paper_authors: Puheng Li, Zhong Li, Huishuai Zhang, Jiang Bian
  • for: 这个论文旨在理解扩散模型的泛化能力。
  • methods: 这篇论文使用了评估扩散模型的泛化误差,并提出了一种基于样本大小和模型容量的泛化误差估计。
  • results: 研究发现,扩散模型在训练过程中的泛化误差随着样本大小和模型容量的增长而减少,并且在数据集中的模式变化下保持可靠的泛化性。
    Abstract Diffusion models are a class of generative models that serve to establish a stochastic transport map between an empirically observed, yet unknown, target distribution and a known prior. Despite their remarkable success in real-world applications, a theoretical understanding of their generalization capabilities remains underdeveloped. This work embarks on a comprehensive theoretical exploration of the generalization attributes of diffusion models. We establish theoretical estimates of the generalization gap that evolves in tandem with the training dynamics of score-based diffusion models, suggesting a polynomially small generalization error ($O(n^{-2/5}+m^{-4/5})$) on both the sample size $n$ and the model capacity $m$, evading the curse of dimensionality (i.e., not exponentially large in the data dimension) when early-stopped. Furthermore, we extend our quantitative analysis to a data-dependent scenario, wherein target distributions are portrayed as a succession of densities with progressively increasing distances between modes. This precisely elucidates the adverse effect of "modes shift" in ground truths on the model generalization. Moreover, these estimates are not solely theoretical constructs but have also been confirmed through numerical simulations. Our findings contribute to the rigorous understanding of diffusion models' generalization properties and provide insights that may guide practical applications.
    摘要 Diffusion models是一类生成模型,旨在建立一个随机运输map,将empirical observation中的未知目标分布和一个已知的先验分布相匹配。尽管它们在实际应用中表现出色,但对它们的总体泛化能力的理论理解仍然受到限制。这项工作开始了Diffusion models的总体泛化能力的理论探索。我们提出了Diffusion models的泛化差的理论估计,表明在训练Score-based diffusion models时,泛化错误的演变矩阵是$O(n^{-2/5}+m^{-4/5})$,其中$n$是样本大小,$m$是模型容量,不是数据维度的幂次增长,这意味着Diffusion models在训练时可以避免欠拟合症(curse of dimensionality)。此外,我们还扩展了我们的量化分析至数据依赖的场景,在这种场景下,目标分布是一系列的浓度,每个浓度之间的距离逐渐增长。这准确地阐述了模型泛化中"模式shift"的弊端,并且这些估计不仅是理论构造,还经过了数值 simulations 的验证。我们的发现对Diffusion models的泛化性能的理论理解做出了贡献,并提供了实践应用中的指导。

Learning to Augment Distributions for Out-of-Distribution Detection

  • paper_url: http://arxiv.org/abs/2311.01796
  • repo_url: None
  • paper_authors: Qizhou Wang, Zhen Fang, Yonggang Zhang, Feng Liu, Yixuan Li, Bo Han
  • for: 本文旨在解决在开放世界下,使用 auxiliary OOD 数据进行 OOD 探测时,存在异常分布的问题。
  • methods: 本文提出了 Distributional-Augmented OOD Learning (DAL) 方法,通过在 Wasserstein 球中心auxiliary OOD 分布上构造 OOD 分布集来减少 OOD 分布差异。
  • results: 对多种 OOD 探测设置进行了广泛的评估,并证明 DAL 在 auxiliary OOD 数据上进行训练的预测器可以提高开放世界下 OOD 探测性能。
    Abstract Open-world classification systems should discern out-of-distribution (OOD) data whose labels deviate from those of in-distribution (ID) cases, motivating recent studies in OOD detection. Advanced works, despite their promising progress, may still fail in the open world, owing to the lack of knowledge about unseen OOD data in advance. Although one can access auxiliary OOD data (distinct from unseen ones) for model training, it remains to analyze how such auxiliary data will work in the open world. To this end, we delve into such a problem from a learning theory perspective, finding that the distribution discrepancy between the auxiliary and the unseen real OOD data is the key to affecting the open-world detection performance. Accordingly, we propose Distributional-Augmented OOD Learning (DAL), alleviating the OOD distribution discrepancy by crafting an OOD distribution set that contains all distributions in a Wasserstein ball centered on the auxiliary OOD distribution. We justify that the predictor trained over the worst OOD data in the ball can shrink the OOD distribution discrepancy, thus improving the open-world detection performance given only the auxiliary OOD data. We conduct extensive evaluations across representative OOD detection setups, demonstrating the superiority of our DAL over its advanced counterparts.
    摘要 To address this discrepancy, we propose Distributional-Augmented OOD Learning (DAL), which involves crafting an OOD distribution set that contains all distributions within a Wasserstein ball centered on the auxiliary OOD distribution. We show that training a predictor over the worst OOD data in the ball can help shrink the OOD distribution discrepancy, thereby improving open-world detection performance given only the auxiliary OOD data.We conduct extensive evaluations across representative OOD detection setups and demonstrate the superiority of our DAL over its advanced counterparts.

Efficient Generalized Low-Rank Tensor Contextual Bandits

  • paper_url: http://arxiv.org/abs/2311.01771
  • repo_url: None
  • paper_authors: Qianxin Yi, Yiyang Yang, Yao Wang, Shaojie Tang
  • for: This paper aims to provide high-usable and accountable decision-making services by building a novel bandits algorithm that fully harnesses the power of multi-dimensional data and non-linear reward functions.
  • methods: The paper introduces a generalized low-rank tensor contextual bandits model, which represents an action as a tensor and determines the reward through a generalized linear function. The algorithm “Generalized Low-Rank Tensor Exploration Subspace then Refine” (G-LowTESTR) is introduced to effectively trade off exploration and exploitation.
  • results: The paper shows that the regret bound of G-LowTESTR is superior to those in vectorization and matricization cases through theoretical analysis and simulations/real data experiments. The algorithm is able to capitalize on the low-rank tensor structure for enhanced learning.
    Abstract In this paper, we aim to build a novel bandits algorithm that is capable of fully harnessing the power of multi-dimensional data and the inherent non-linearity of reward functions to provide high-usable and accountable decision-making services. To this end, we introduce a generalized low-rank tensor contextual bandits model in which an action is formed from three feature vectors, and thus can be represented by a tensor. In this formulation, the reward is determined through a generalized linear function applied to the inner product of the action's feature tensor and a fixed but unknown parameter tensor with a low tubal rank. To effectively achieve the trade-off between exploration and exploitation, we introduce a novel algorithm called "Generalized Low-Rank Tensor Exploration Subspace then Refine" (G-LowTESTR). This algorithm first collects raw data to explore the intrinsic low-rank tensor subspace information embedded in the decision-making scenario, and then converts the original problem into an almost lower-dimensional generalized linear contextual bandits problem. Rigorous theoretical analysis shows that the regret bound of G-LowTESTR is superior to those in vectorization and matricization cases. We conduct a series of simulations and real data experiments to further highlight the effectiveness of G-LowTESTR, leveraging its ability to capitalize on the low-rank tensor structure for enhanced learning.
    摘要 在本文中,我们目标建立一种新的带刺数据搜索算法,能够全面利用多维数据的力量和奖励函数的内在非线性,提供高可用和可负责的决策服务。为此,我们引入一种泛化低级张量上下文ual bandits模型,其中一个动作可以由三个特征向量组成,并且可以表示为一个张量。在这种形式下,奖励由一个通用线性函数应用于动作特征张量和一个固定 pero unknown 参数张量的内积来确定。为实现探索和利用之间的负荷平衡,我们引入一种新的算法called "泛化低级张量探索空间然后精细" (G-LowTESTR)。这个算法首先收集原始数据,以探索决策场景中附加的低级张量信息,然后将原始问题转换为一个几乎两维的通用线性contextual bandits问题。我们的理论分析表明,G-LowTESTR的 regret bound高于vectorization和matricization情况。我们进行了一系列的仿真和实际数据实验,以证明G-LowTESTR的效果,利用其能够利用低级张量结构进行加强学习。

Solving Kernel Ridge Regression with Gradient Descent for a Non-Constant Kernel

  • paper_url: http://arxiv.org/abs/2311.01762
  • repo_url: None
  • paper_authors: Oskar Allerbo
  • For: 该研究探讨了 kernel ridge regression(KRR)中kernel的变化在训练过程中对模型复杂性和泛化性的影响,并提出了一种在训练过程中逐步递减带宽的更新方案。* Methods: 该研究使用了KRR的迭代法,并 investigate了在训练过程中变化kernel的影响。* Results: 研究发现,逐步递减带宽可以使KRR模型在训练Error和泛化性之间取得平衡,并且可以实现零训练Error和良好的泛化性。此外,研究还发现了一种double descent现象,其中在某些情况下,逐步递减带宽可以使模型的泛化性提高。
    Abstract Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. The solution can be obtained either as a closed-form solution, which includes a matrix inversion, or iteratively through gradient descent. Using the iterative approach opens up for changing the kernel during training, something that is investigated in this paper. We theoretically address the effects this has on model complexity and generalization. Based on our findings, we propose an update scheme for the bandwidth of translational-invariant kernels, where we let the bandwidth decrease to zero during training, thus circumventing the need for hyper-parameter selection. We demonstrate on real and synthetic data how decreasing the bandwidth during training outperforms using a constant bandwidth, selected by cross-validation and marginal likelihood maximization. We also show theoretically and empirically that using a decreasing bandwidth, we are able to achieve both zero training error in combination with good generalization, and a double descent behavior, phenomena that do not occur for KRR with constant bandwidth but are known to appear for neural networks.
    摘要 kernel ridge regression(KRR)是Linear Ridge Regression的推广,在数据上是非线性的,但在参数上是线性的。解决方案可以通过关键值矩阵 inverse 或者迭代的梯度下降来获得。使用迭代方法可以在训练过程中改变kernel,这在本文中被调查。我们从理论角度解决了这些改变对模型复杂度和泛化性的影响。基于我们的发现,我们提出了一种更新策略,其中在训练过程中减小了带宽,从而消除了hyperparameter选择的需要。我们在实际和Synthetic data上示出了使用减小带宽在训练中的优化性。此外,我们还 theoretically和Empirically验证了使用减小带宽可以实现零训练误差、良好的泛化和神经网络上知道的双峰现象。

TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices

  • paper_url: http://arxiv.org/abs/2311.01759
  • repo_url: None
  • paper_authors: Jianlei Yang, Jiacheng Liao, Fanding Lei, Meichen Liu, Junyi Chen, Lingkun Long, Han Wan, Bei Yu, Weisheng Zhao
  • for: 本研究旨在开发和部署在微控制器单元(MCU)上的深度学习模型,以满足各种嵌入式互联网应用的需求。
  • methods: 该研究提出了一个名为TinyFormer的框架,用于开发和部署MCU上的资源有效的转换器模型。TinyFormer包括SuperNAS、SparseNAS和SparseEngine三部分。SuperNAS用于在庞大的搜索空间中搜索适合MCU的超网络模型。SparseNAS用于评估最佳缺省单路模型,包括转换器架构。最后,SparseEngine高效地将搜索到的缺省模型部署到MCU上进行推理。
  • results: 根据CIFAR-10数据集的评估结果,TinyFormer可以开发高效的转换器模型,具有$96.1%$的准确率,同时遵循MCU的硬件限制,即$1$MB存储和$320$KB内存。此外,TinyFormer在缺省推理中具有显著的速度提升,达到$12.2\times$的提升,相比CMSIS-NN库。TinyFormer被认为可以将强大的转换器带入天线ML场景,扩大深度学习应用的范围。
    Abstract Developing deep learning models on tiny devices (e.g. Microcontroller units, MCUs) has attracted much attention in various embedded IoT applications. However, it is challenging to efficiently design and deploy recent advanced models (e.g. transformers) on tiny devices due to their severe hardware resource constraints. In this work, we propose TinyFormer, a framework specifically designed to develop and deploy resource-efficient transformers on MCUs. TinyFormer mainly consists of SuperNAS, SparseNAS and SparseEngine. Separately, SuperNAS aims to search for an appropriate supernet from a vast search space. SparseNAS evaluates the best sparse single-path model including transformer architecture from the identified supernet. Finally, SparseEngine efficiently deploys the searched sparse models onto MCUs. To the best of our knowledge, SparseEngine is the first deployment framework capable of performing inference of sparse models with transformer on MCUs. Evaluation results on the CIFAR-10 dataset demonstrate that TinyFormer can develop efficient transformers with an accuracy of $96.1\%$ while adhering to hardware constraints of $1$MB storage and $320$KB memory. Additionally, TinyFormer achieves significant speedups in sparse inference, up to $12.2\times$, when compared to the CMSIS-NN library. TinyFormer is believed to bring powerful transformers into TinyML scenarios and greatly expand the scope of deep learning applications.
    摘要 发展深度学习模型在微控制器单元(MCU)上(例如,微控制器单元)已经吸引了各种嵌入互联网应用的广泛关注。然而,由于MCU的硬件资源有限制,使得不可避免地将现代高级模型(如转换器)部署到MCU上是一项挑战。在这项工作中,我们提出了TinyFormer框架,用于开发和部署MCU上的资源有效的转换器模型。TinyFormer主要由SuperNAS、SparseNAS和SparseEngine三部分组成。每一部分都扮演着重要的角色。首先,SuperNAS通过巨量搜索空间来搜索适合MCU的超网络。然后,SparseNAS根据找到的超网络来评估最佳的稀疏单轨模型,包括转换器架构。最后,SparseEngine高效地将搜索到的稀疏模型部署到MCU上。到目前为止,SparseEngine是首个可以在MCU上进行稀疏模型执行的投影引擎。我们的评估结果表明,TinyFormer可以在CIFAR-10数据集上开发高效的转换器模型,具有$96.1\%$的准确率,同时遵守MCU的硬件限制,即$1$MB存储和$320$KB内存。此外,TinyFormer在稀疏执行中实现了与CMSIS-NN库相比的速度提升,达到$12.2\times$。TinyFormer被认为将带来强大的转换器到天然语言应用场景,扩大深度学习应用的范围。

Epidemic Decision-making System Based Federated Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.01749
  • repo_url: None
  • paper_authors: Yangxi Zhou, Junping Du, Zhe Xue, Zhenhui Pan, Weikang Chen
  • for: 该论文旨在帮助政府通过对公共安全和经济发展进行全面考虑,应对公共卫生和安全紧急情况。
  • methods: 该论文提出了一种基于联合学习的方法,通过将各省的疫情情况数据进行合作训练,以提高疫情决策的准确性和效率。
  • results: 实验结果显示,联合学习方法可以在疫情决策中获得更优化的性能和返回,并可以加速训练模型的收敛速度。此外,对比试验还表明,A2C模型是适合疫情决策场景的最佳强化学习模型,其次是PPO模型,而DDPG模型的性能较差。
    Abstract Epidemic decision-making can effectively help the government to comprehensively consider public security and economic development to respond to public health and safety emergencies. Epidemic decision-making can effectively help the government to comprehensively consider public security and economic development to respond to public health and safety emergencies. Some studies have shown that intensive learning can effectively help the government to make epidemic decision, thus achieving the balance between health security and economic development. Some studies have shown that intensive learning can effectively help the government to make epidemic decision, thus achieving the balance between health security and economic development. However, epidemic data often has the characteristics of limited samples and high privacy. However, epidemic data often has the characteristics of limited samples and high privacy. This model can combine the epidemic situation data of various provinces for cooperative training to use as an enhanced learning model for epidemic situation decision, while protecting the privacy of data. The experiment shows that the enhanced federated learning can obtain more optimized performance and return than the enhanced learning, and the enhanced federated learning can also accelerate the training convergence speed of the training model. accelerate the training convergence speed of the client. At the same time, through the experimental comparison, A2C is the most suitable reinforcement learning model for the epidemic situation decision-making. learning model for the epidemic situation decision-making scenario, followed by the PPO model, and the performance of DDPG is unsatisfactory.
    摘要 《医疫决策》可以有效地帮助政府全面考虑公共安全和经济发展,以应对公共卫生和安全紧急情况。一些研究表明,高效学习可以帮助政府做出医疫决策,从而实现健康安全和经济发展的平衡。然而,医疫数据经常具有有限的样本和高隐私性。为此,本文提出了一种基于联合学习的医疫决策模型,可以结合各省的医疫情况数据进行合作训练,以保护数据隐私。实验表明,加强联合学习可以在训练模型性能和训练速度两个方面取得更高的优化效果,而且在训练速度方面,加强联合学习可以加速客户端的训练速度。同时,通过实验对比,A2C模型在医疫决策场景中表现最佳,其次是PPO模型,而DDPG模型的表现不满足。

Global Optimization: A Machine Learning Approach

  • paper_url: http://arxiv.org/abs/2311.01742
  • repo_url: https://github.com/jettbrains/-L-
  • paper_authors: Dimitris Bertsimas, Georgios Margaritis
  • for: solves black-box global optimization problems with nonlinear constraints.
  • methods: uses hyperplane-based Decision-Trees and mixed integer optimization (MIO) approximation, with extensions to other ML models and adaptive sampling procedures.
  • results: shows improvements in solution feasibility and optimality in the majority of instances compared to BARON, with improved optimality gaps or solution times in 11 instances.
    Abstract Many approaches for addressing Global Optimization problems typically rely on relaxations of nonlinear constraints over specific mathematical primitives. This is restricting in applications with constraints that are black-box, implicit or consist of more general primitives. Trying to address such limitations, Bertsimas and Ozturk (2023) proposed OCTHaGOn as a way of solving black-box global optimization problems by approximating the nonlinear constraints using hyperplane-based Decision-Trees and then using those trees to construct a unified mixed integer optimization (MIO) approximation of the original problem. We provide extensions to this approach, by (i) approximating the original problem using other MIO-representable ML models besides Decision Trees, such as Gradient Boosted Trees, Multi Layer Perceptrons and Suport Vector Machines, (ii) proposing adaptive sampling procedures for more accurate machine learning-based constraint approximations, (iii) utilizing robust optimization to account for the uncertainty of the sample-dependent training of the ML models, and (iv) leveraging a family of relaxations to address the infeasibilities of the final MIO approximation. We then test the enhanced framework in 81 Global Optimization instances. We show improvements in solution feasibility and optimality in the majority of instances. We also compare against BARON, showing improved optimality gaps or solution times in 11 instances.
    摘要 多种方法通常用于解决全球优化问题,通常基于非线性约束的松弛。这限制了应用中的约束是黑盒、隐藏或更一般的 primitives。 Trying to address these limitations, Bertsimas and Ozturk (2023) proposed OCTHaGOn to solve black-box global optimization problems by approximating nonlinear constraints using hyperplane-based Decision-Trees and then using those trees to construct a unified mixed integer optimization (MIO) approximation of the original problem. We provide extensions to this approach, by (i) approximating the original problem using other MIO-representable ML models besides Decision Trees, such as Gradient Boosted Trees, Multi Layer Perceptrons and Suport Vector Machines, (ii) proposing adaptive sampling procedures for more accurate machine learning-based constraint approximations, (iii) utilizing robust optimization to account for the uncertainty of the sample-dependent training of the ML models, and (iv) leveraging a family of relaxations to address the infeasibilities of the final MIO approximation. We then test the enhanced framework in 81 Global Optimization instances. We show improvements in solution feasibility and optimality in the majority of instances. We also compare against BARON, showing improved optimality gaps or solution times in 11 instances.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Taiwan and Hong Kong.

CDGraph: Dual Conditional Social Graph Synthesizing via Diffusion Model

  • paper_url: http://arxiv.org/abs/2311.01729
  • repo_url: None
  • paper_authors: Jui-Yi Tsai, Ya-Wen Teng, Ho Chiok Yew, De-Nian Yang, Lydia Y. Chen
  • for: 本研究旨在提出一种基于两个指定条件的 conditional diffusion model for social networks,以生成符合条件的社交图。
  • methods: 该模型通过对 dual conditions 的协同依赖进行杜琪处理,以捕捉两个条件之间的互dependent关系,同时还包括社交同类和社交感染来保持节点之间的连接,并通过对 dual conditions 的互dependent关系进行指导,进行 diffusion 过程的训练。
  • results: 对四个 dataset 进行评估,与四种现有的图生成方法进行比较,结果显示 CDGraph 可以生成符合 dual-conditional 的社交图,并且在多种社交网络指标中具有较低的不一致性和较高的 dual-conditional 有效性。
    Abstract The social graphs synthesized by the generative models are increasingly in demand due to data scarcity and concerns over user privacy. One of the key performance criteria for generating social networks is the fidelity to specified conditionals, such as users with certain membership and financial status. While recent diffusion models have shown remarkable performance in generating images, their effectiveness in synthesizing graphs has not yet been explored in the context of conditional social graphs. In this paper, we propose the first kind of conditional diffusion model for social networks, CDGraph, which trains and synthesizes graphs based on two specified conditions. We propose the co-evolution dependency in the denoising process of CDGraph to capture the mutual dependencies between the dual conditions and further incorporate social homophily and social contagion to preserve the connectivity between nodes while satisfying the specified conditions. Moreover, we introduce a novel classifier loss, which guides the training of the diffusion process through the mutual dependency of dual conditions. We evaluate CDGraph against four existing graph generative methods, i.e., SPECTRE, GSM, EDGE, and DiGress, on four datasets. Our results show that the generated graphs from CDGraph achieve much higher dual-conditional validity and lower discrepancy in various social network metrics than the baselines, thus demonstrating its proficiency in generating dual-conditional social graphs.
    摘要 社交图表生成的生成模型受到数据缺乏和用户隐私问题的增加需求。生成社交图表中的一个关键性能标准是对指定的条件进行忠实性,例如用户具有某些会员和财务状况。而最近的扩散模型在生成图像方面已经表现出色,但它们在生成图表方面的效果尚未得到研究。在这篇论文中,我们提出了首个基于条件的扩散模型 для社交图表,即CDGraph,它在两个指定的条件下训练和生成图表。我们提出了在杂化过程中的共演化依赖性,以捕捉图表中节点之间的互相依赖关系,并进一步包括社交同类和社交感染,以保持节点之间的连接而满足指定的条件。此外,我们引入了一种新的分类损失函数,用于导航扩散过程的训练,该损失函数通过两个条件之间的互相依赖关系来引导扩散过程的训练。我们对CDGraph与四种现有的图生成方法,即SPECTRE、GSM、EDGE和DiGress进行评估,结果显示,生成从CDGraph得到的图表在多种社交网络指标中的双重条件有效性和不一致性均较低,这表明CDGraph在生成双重条件的社交图表方面具有极高的效果。

Heterogeneous federated collaborative filtering using FAIR: Federated Averaging in Random Subspaces

  • paper_url: http://arxiv.org/abs/2311.01722
  • repo_url: https://github.com/apd10/flcf
  • paper_authors: Aditya Desai, Benjamin Meisburger, Zichang Liu, Anshumali Shrivastava
  • for: 这个论文的目的是推荐系统(RS)的实现,尤其是在面临数据隐私和GDPR等法规的情况下,通过联合学习来实现推荐模型,而不是在中央服务器上训练。
  • methods: 这个论文使用了联合学习的方法,特别是在 embedding 表格中进行收敛,而不是在中央服务器上进行训练。它使用了哈希基Random projection来实现Device capacity-aware federated averaging,使得各种设备都可以参与训练。
  • results: 论文通过多个数据集进行实验,证明了 FAIR 可以在各种设备上进行训练,并且可以处理不同设备的数据,以实现在线学习。此外,论文还证明了 FAIR 的整体收敛性。
    Abstract Recommendation systems (RS) for items (e.g., movies, books) and ads are widely used to tailor content to users on various internet platforms. Traditionally, recommendation models are trained on a central server. However, due to rising concerns for data privacy and regulations like the GDPR, federated learning is an increasingly popular paradigm in which data never leaves the client device. Applying federated learning to recommendation models is non-trivial due to large embedding tables, which often exceed the memory constraints of most user devices. To include data from all devices in federated learning, we must enable collective training of embedding tables on devices with heterogeneous memory capacities. Current solutions to heterogeneous federated learning can only accommodate a small range of capacities and thus limit the number of devices that can participate in training. We present Federated Averaging in Random subspaces (FAIR), which allows arbitrary compression of embedding tables based on device capacity and ensures the participation of all devices in training. FAIR uses what we call consistent and collapsible subspaces defined by hashing-based random projections to jointly train large embedding tables while using varying amounts of compression on user devices. We evaluate FAIR on Neural Collaborative Filtering tasks with multiple datasets and verify that FAIR can gather and share information from a wide range of devices with varying capacities, allowing for seamless collaboration. We prove the convergence of FAIR in the homogeneous setting with non-i.i.d data distribution. Our code is open source at {https://github.com/apd10/FLCF}
    摘要 (traditional Chinese translation)推荐系统(RS) для Item(例如,电影、书籍)和广告是广泛使用来适应用户在互联网平台上的内容。传统上,推荐模型是在中央服务器上训练的。但由于隐私权和GDPR等法规的问题,联合学习是一种越来越受欢迎的方法,它可以让数据保留在客户端上。将联合学习应用到推荐模型是非常具有挑战,因为推荐模型的嵌入表通常会超过大多数用户端的内存限制。为了包括所有设备的数据在联合学习中,我们必须在设备之间实现嵌入表的集体训练。现有的联合学习解决方案只能涵盖一小段的设备 capacities,因此仅能参与训练的设备有限。我们提出了Federated Averaging in Random subspaces(FAIR),它可以根据设备capacity进行不同程度的压缩,并确保所有设备都可以参与训练。FAIR使用我们称为“一致和可拓展的子空间”的哈希基于随机投射来实现大嵌入表的联合训练。我们使用多个数据集进行评估,并证明FAIR在几何同步设定下的异步数据分布下可以实现收敛。我们的代码可以在 中找到。

Physics-Informed Generator-Encoder Adversarial Networks with Latent Space Matching for Stochastic Differential Equations

  • paper_url: http://arxiv.org/abs/2311.01708
  • repo_url: None
  • paper_authors: Ruisong Gao, Min Yang, Jin Zhang
  • for: 解决随机差分方程中的前进、逆向和混合问题,即系统参数只有有限的快照数据。
  • methods: 提出了一种新的物理学 Informed Generator-Encoder Adversarial Networks(PIG-EA),通过在各种随机差分方程中直接使用生成器和编码器来解决问题。
  • results: 经过数学实验证明,PIG-EA方法可以更高精度地解决不同类型的随机差分方程问题,并且可以有效地mitigate训练不稳定性问题。
    Abstract We propose a new class of physics-informed neural networks, called Physics-Informed Generator-Encoder Adversarial Networks, to effectively address the challenges posed by forward, inverse, and mixed problems in stochastic differential equations. In these scenarios, while the governing equations are known, the available data consist of only a limited set of snapshots for system parameters. Our model consists of two key components: the generator and the encoder, both updated alternately by gradient descent. In contrast to previous approaches of directly matching the approximated solutions with real snapshots, we employ an indirect matching that operates within the lower-dimensional latent feature space. This method circumvents challenges associated with high-dimensional inputs and complex data distributions, while yielding more accurate solutions compared to existing neural network solvers. In addition, the approach also mitigates the training instability issues encountered in previous adversarial frameworks in an efficient manner. Numerical results provide compelling evidence of the effectiveness of the proposed method in solving different types of stochastic differential equations.
    摘要 我们提出一种新的物理学 Informed Neural Network(Physics-Informed Generator-Encoder Adversarial Network,PIG-ENA),用于有效地解决涉及到前向、反向和混合问题的随机 diferencial equations 中的挑战。在这些情况下,规定方程知道,但可用的数据只是系统参数的有限集。我们的模型包括两个关键组成部分:生成器和编码器,两者都通过梯度下降更新。与之前的直接匹配实际解与真实Snapshot的方法不同,我们采用了间接匹配,该操作在具有较低维度的封闭特征空间中进行。这种方法可以避免高维度输入和复杂数据分布所带来的挑战,同时提供更高精度的解决方案,与现有的神经网络解决方案相比。此外,我们的方法还能有效地解决过去的反对抗框架中的训练不稳定问题。数字实验证明了我们提出的方法在不同类型的随机 diffeq 中的有效性。

Adversarial Attacks on Cooperative Multi-agent Bandits

  • paper_url: http://arxiv.org/abs/2311.01698
  • repo_url: None
  • paper_authors: Jinhang Zuo, Zhiyao Zhang, Xuchuang Wang, Cheng Chen, Shuai Li, John C. S. Lui, Mohammad Hajiesmaili, Adam Wierman
  • for: 这个论文研究了合作多体智能机器人在共享多臂抓拍游戏中的潜在漏洞,以及对这些合作的攻击。
  • methods: 这篇论文使用了对一些代理人进行攻击,以影响其他代理人的决策。具体来说,在同质性设定下,我们提出了一种target arm攻击策略,可以在$T$轮内让所有代理人选择特定的目标臂$T-o(T)$次,而具有$o(T)$攻击成本。在不同质性设定下,我们证明了target臂攻击需要线性攻击成本,并提出了一种可以让最多代理人受到线性悔化的攻击策略。
  • results: 数值实验证明了我们提出的攻击策略的有效性。
    Abstract Cooperative multi-agent multi-armed bandits (CMA2B) consider the collaborative efforts of multiple agents in a shared multi-armed bandit game. We study latent vulnerabilities exposed by this collaboration and consider adversarial attacks on a few agents with the goal of influencing the decisions of the rest. More specifically, we study adversarial attacks on CMA2B in both homogeneous settings, where agents operate with the same arm set, and heterogeneous settings, where agents have distinct arm sets. In the homogeneous setting, we propose attack strategies that, by targeting just one agent, convince all agents to select a particular target arm $T-o(T)$ times while incurring $o(T)$ attack costs in $T$ rounds. In the heterogeneous setting, we prove that a target arm attack requires linear attack costs and propose attack strategies that can force a maximum number of agents to suffer linear regrets while incurring sublinear costs and only manipulating the observations of a few target agents. Numerical experiments validate the effectiveness of our proposed attack strategies.
    摘要

Communication-Efficient Federated Non-Linear Bandit Optimization

  • paper_url: http://arxiv.org/abs/2311.01695
  • repo_url: None
  • paper_authors: Chuanhao Li, Chong Liu, Yu-Xiang Wang
  • for: 该论文研究了多客户端协同优化问题,以保持数据隐私和实现大规模计算,并采用中央服务器协调。
  • methods: 该论文提出了一种新的算法 named Fed-GO-UCB,用于联合非线性目标函数的联合弗链优化。
  • results: 论文通过了一些强制条件,证明了 Fed-GO-UCB 算法可以实现下线速率的总征领导和通信成本。
    Abstract Federated optimization studies the problem of collaborative function optimization among multiple clients (e.g. mobile devices or organizations) under the coordination of a central server. Since the data is collected separately by each client and always remains decentralized, federated optimization preserves data privacy and allows for large-scale computing, which makes it a promising decentralized machine learning paradigm. Though it is often deployed for tasks that are online in nature, e.g., next-word prediction on keyboard apps, most works formulate it as an offline problem. The few exceptions that consider federated bandit optimization are limited to very simplistic function classes, e.g., linear, generalized linear, or non-parametric function class with bounded RKHS norm, which severely hinders its practical usage. In this paper, we propose a new algorithm, named Fed-GO-UCB, for federated bandit optimization with generic non-linear objective function. Under some mild conditions, we rigorously prove that Fed-GO-UCB is able to achieve sub-linear rate for both cumulative regret and communication cost. At the heart of our theoretical analysis are distributed regression oracle and individual confidence set construction, which can be of independent interests. Empirical evaluations also demonstrate the effectiveness of the proposed algorithm.
    摘要 “联邦优化”是一个研究多个客户端(例如移动设备或组织)在中央服务器协调下实现协同函数优化的问题。由于每个客户端都将数据收集和保留在本地,因此“联邦优化”可以保持数据隐私和实现大规模计算,这使其成为一种吸引人的分散式机器学习概念。尽管通常是在线上问题上进行部署,例如键盘应用程序上的下一个词的预测,但大多数研究都是以假设为offline问题进行设计。仅有一些例外情况是考虑联邦投机优化,并且仅对非 Parametric 函数类型进行设计,这严重限制其实际应用。在这篇文章中,我们提出了一个新的算法,名为 Fed-GO-UCB,用于联邦投机优化。我们严谨地证明了 Fed-GO-UCB 能够在一些轻微的条件下 achieves 次线性速率两个总 regret 和通信成本。我们的理论分析的核心是分布式回归资料和个人信任集建构,这些可能会具有独立的价值。实验评估也证明了我们的提案的有效性。

Amide Proton Transfer (APT) imaging in tumor with a machine learning approach using partially synthetic data

  • paper_url: http://arxiv.org/abs/2311.01683
  • repo_url: None
  • paper_authors: Malvika Viswanathan, Leqi Yin, Yashwant Kurmi, Zhongliang Zu
  • for: 这项研究旨在使用机器学习(ML)模型量化化学交换吸收转移(CEST)效应。
  • methods: 这项研究使用了一种新的平台,它将模拟和实验数据结合起来生成部分合成CEST数据,以评估这种方法在训练ML模型预测蛋白质氨基转移(APT)效应方面的可行性。
  • results: 研究结果表明,使用部分合成CEST数据可以准确地预测APT效应,并且在实验中比使用实际数据和完全模拟数据更为稳定和准确。
    Abstract Machine learning (ML) has been increasingly used to quantify chemical exchange saturation transfer (CEST) effect. ML models are typically trained using either measured data or fully simulated data. However, training with measured data often lacks sufficient training data, while training with fully simulated data may introduce bias due to limited simulations pools. This study introduces a new platform that combines simulated and measured components to generate partially synthetic CEST data, and to evaluate its feasibility for training ML models to predict amide proton transfer (APT) effect. Partially synthetic CEST signals were created using an inverse summation of APT effects from simulations and the other components from measurements. Training data were generated by varying APT simulation parameters and applying scaling factors to adjust the measured components, achieving a balance between simulation flexibility and fidelity. First, tissue-mimicking CEST signals along with ground truth information were created using multiple-pool model simulations to validate this method. Second, an ML model was trained individually on partially synthetic data, in vivo data, and fully simulated data, to predict APT effect in rat brains bearing 9L tumors. Experiments on tissue-mimicking data suggest that the ML method using the partially synthetic data is accurate in predicting APT. In vivo experiments suggest that our method provides more accurate and robust prediction than the training using in vivo data and fully synthetic data. Partially synthetic CEST data can address the challenges in conventional ML methods.
    摘要

Maximum Likelihood Estimation of Flexible Survival Densities with Importance Sampling

  • paper_url: http://arxiv.org/abs/2311.01660
  • repo_url: None
  • paper_authors: Mert Ketenci, Shreyas Bhave, Noémie Elhadad, Adler Perotte
  • for: 这个论文的目的是提出一种新的生存分析方法,以减少实践者需要调整的 гиперparameters数量,包括分类模型中的分配数量和离散模型中的分 Bin 数量。
  • methods: 该方法使用了一种新的分类模型,以减少模型中的 гиперparameters数量,并且不需要进行优化。它还使用了一种新的离散模型,以提高模型的稳定性和数值稳定性。
  • results: 研究人员通过实验研究表明,该方法可以与基eline方法匹配或超越其性能,并且可以减少实践者需要调整的时间和努力。
    Abstract Survival analysis is a widely-used technique for analyzing time-to-event data in the presence of censoring. In recent years, numerous survival analysis methods have emerged which scale to large datasets and relax traditional assumptions such as proportional hazards. These models, while being performant, are very sensitive to model hyperparameters including: (1) number of bins and bin size for discrete models and (2) number of cluster assignments for mixture-based models. Each of these choices requires extensive tuning by practitioners to achieve optimal performance. In addition, we demonstrate in empirical studies that: (1) optimal bin size may drastically differ based on the metric of interest (e.g., concordance vs brier score), and (2) mixture models may suffer from mode collapse and numerical instability. We propose a survival analysis approach which eliminates the need to tune hyperparameters such as mixture assignments and bin sizes, reducing the burden on practitioners. We show that the proposed approach matches or outperforms baselines on several real-world datasets.
    摘要 In addition, we found in our empirical studies that the optimal bin size can vary significantly depending on the metric of interest (e.g., concordance vs brier score), and mixture models may suffer from mode collapse and numerical instability. To address these issues, we propose a survival analysis approach that eliminates the need to tune hyperparameters such as mixture assignments and bin sizes, reducing the burden on practitioners. We show that our proposed approach matches or outperforms baselines on several real-world datasets.

Calibrate and Boost Logical Expressiveness of GNN Over Multi-Relational and Temporal Graphs

  • paper_url: http://arxiv.org/abs/2311.01647
  • repo_url: https://github.com/hdmmblz/multi-graph
  • paper_authors: Yeyuan Chen, Dingmin Wang
  • for: 这个论文主要研究了图像学习框架Graph Neural Networks (GNNs)的逻辑表达能力,具体来说是在多关系图上使用GNNs来表示Boolean node classifiers。
  • methods: 作者使用了R$^2$-GNN架构,该架构是通过将本地消息传递GNN扩展到全球读取来提高表达能力。
  • results: 研究发现,R$^2$-GNN模型在某些限制性的情况下可以完全表达Boolean node classifiers,但在总体情况下则不行。作者也提出了一种简单的图变换技术,可以在线性时间内执行,以便使R$^2$-GNN模型能够有效地表达任何Boolean node classifiers。
    Abstract As a powerful framework for graph representation learning, Graph Neural Networks (GNNs) have garnered significant attention in recent years. However, to the best of our knowledge, there has been no formal analysis of the logical expressiveness of GNNs as Boolean node classifiers over multi-relational graphs, where each edge carries a specific relation type. In this paper, we investigate $\mathcal{FOC}_2$, a fragment of first-order logic with two variables and counting quantifiers. On the negative side, we demonstrate that the R$^2$-GNN architecture, which extends the local message passing GNN by incorporating global readout, fails to capture $\mathcal{FOC}_2$ classifiers in the general case. Nevertheless, on the positive side, we establish that R$^2$-GNNs models are equivalent to $\mathcal{FOC}_2$ classifiers under certain restricted yet reasonable scenarios. To address the limitations of R$^2$-GNNs regarding expressiveness, we propose a simple graph transformation technique, akin to a preprocessing step, which can be executed in linear time. This transformation enables R$^2$-GNNs to effectively capture any $\mathcal{FOC}_2$ classifiers when applied to the "transformed" input graph. Moreover, we extend our analysis of expressiveness and graph transformation to temporal graphs, exploring several temporal GNN architectures and providing an expressiveness hierarchy for them. To validate our findings, we implement R$^2$-GNNs and the graph transformation technique and conduct empirical tests in node classification tasks against various well-known GNN architectures that support multi-relational or temporal graphs. Our experimental results consistently demonstrate that R$^2$-GNN with the graph transformation outperforms the baseline methods on both synthetic and real-world datasets
    摘要 “graph neural networks(GNNs)是一种强大的图表示学习框架,在过去几年内吸引了广泛的关注。然而,据我们所知,对于多关系图上的Boolean节点分类问题,GNNs的逻辑表达能力没有正式的分析。在这篇论文中,我们 investigate $\mathcal{FOC}_2$,一种first-order logic中的两个变量和计数量论 fragment。我们的负面结果表明,R$^2$-GNN架构,通过扩展本地消息传递GNN,不能在总体情况下捕捉 $\mathcal{FOC}_2$ 分类器。然而,我们的积极结果表明,R$^2$-GNN模型与 $\mathcal{FOC}_2$ 分类器在某些有限但合理的情况下是等价的。为了解决R$^2$-GNN的表达能力的局限性,我们提出了一种简单的图变换技术,可以在线性时间内执行。这种变换可以使R$^2$-GNN模型有效地捕捉任何 $\mathcal{FOC}_2$ 分类器。此外,我们还扩展了我们的表达能力和图变换分析到temporal graph,探讨了多种temporal GNN架构,并提出了temporal GNN的表达能力层次。为了证明我们的发现,我们实现了R$^2$-GNN和图变换技术,并在节点分类任务中进行了实验,与多种支持多关系或temporal graph的GNNA architectures进行比较。我们的实验结果表明,R$^2$-GNN与图变换技术在多种synthetic和实际数据集上均有优于基eline方法”

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

  • paper_url: http://arxiv.org/abs/2311.01644
  • repo_url: None
  • paper_authors: Berfin Şimşek, Amire Bendjeddou, Wulfram Gerstner, Johanni Brea
  • for: 本文研究的目标是研究一种具有较少神经元数的神经网络如何对一个具有较多神经元数的神经网络进行近似。
  • methods: 本文使用了神经网络的拟合方法来研究这个问题,并证明了在某些情况下,使用较少神经元数的神经网络可以对一个具有较多神经元数的神经网络进行高度的近似。
  • results: 本文的结果表明,在使用 erf 活化函数和标准正态输入分布的情况下,较少神经元数的神经网络可以达到最佳的近似效果,并且这个效果可以通过调整每个学生神经元是否复制教师神经元来实现。
    Abstract Any continuous function $f^*$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$. We consider the case when $f^*$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f^*$ with a neural network with $n< k$ neurons can thus be seen as fitting an under-parameterized "student" network with $n$ neurons to a "teacher" network with $k$ neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when $n-1$ student neurons each copy one teacher neuron and the $n$-th student neuron averages the remaining $k-n+1$ teacher neurons. For the student network with $n=1$ neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.
    摘要 任何连续函数 $f^{*}$ 都可以被 sufficiently many neurons $k$ 的神经网络 arbitrarily well 近似。我们考虑 $f^{*}$ 本身是一个具有一个隐藏层和 $k$ 个神经元的神经网络。对 $f^{*}$ 使用一个具有 $n

Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula

  • paper_url: http://arxiv.org/abs/2311.01642
  • repo_url: None
  • paper_authors: Aryaman Reddi, Maximilian Tölle, Jan Peters, Georgia Chalvatzaki, Carlo D’Eramo
  • for: 本文目标是提高对抗攻击和分布变化的Robustness Reinforcement Learning(RL)。
  • methods: 本文提出了一种基于Entropy regularization的新方法,该方法可以简化了极点优化问题。
  • results: 对多个MuJoCo游戏和导航问题进行了广泛的实验,并证明了QARL在总性和Robustness方面超过了RARL和最近的基准值。
    Abstract Robustness against adversarial attacks and distribution shifts is a long-standing goal of Reinforcement Learning (RL). To this end, Robust Adversarial Reinforcement Learning (RARL) trains a protagonist against destabilizing forces exercised by an adversary in a competitive zero-sum Markov game, whose optimal solution, i.e., rational strategy, corresponds to a Nash equilibrium. However, finding Nash equilibria requires facing complex saddle point optimization problems, which can be prohibitive to solve, especially for high-dimensional control. In this paper, we propose a novel approach for adversarial RL based on entropy regularization to ease the complexity of the saddle point optimization problem. We show that the solution of this entropy-regularized problem corresponds to a Quantal Response Equilibrium (QRE), a generalization of Nash equilibria that accounts for bounded rationality, i.e., agents sometimes play random actions instead of optimal ones. Crucially, the connection between the entropy-regularized objective and QRE enables free modulation of the rationality of the agents by simply tuning the temperature coefficient. We leverage this insight to propose our novel algorithm, Quantal Adversarial RL (QARL), which gradually increases the rationality of the adversary in a curriculum fashion until it is fully rational, easing the complexity of the optimization problem while retaining robustness. We provide extensive evidence of QARL outperforming RARL and recent baselines across several MuJoCo locomotion and navigation problems in overall performance and robustness.
    摘要 RL中的稳定性对抗难以控制的攻击和分布变化是一个长期目标。为此,我们提出了Robust Adversarial Reinforcement Learning(RARL),它在竞争性零和游戏中训练一个主角,以适应由敌方所予以的破坏性力量。然而,在找到约束 Nash 平衡点的过程中,可能会遇到复杂的锐点优化问题,特别是在高维控制下。在这篇论文中,我们提出了一种基于 entropy 规范的新方法,以缓解锐点优化问题的复杂性。我们证明了这种 entropy-regularized 目标函数对应于一种Quantal Response Equilibrium(QRE),这是约束 Nash 平衡点的一种扩展,考虑到 bounded rationality,即代理人可能会采取Random 动作而不是最优动作。这种关系允许我们通过调整温度系数来自由地调节代理人的合理性。我们利用这一点,提出了我们的新算法Quantal Adversarial RL(QARL),它逐渐增加了敌方的合理性,直到它完全合理,从而缓解优化问题的复杂性。我们在多个 MuJoCo 步行和导航问题上提供了广泛的证明,证明 QARL 在总性和稳定性方面超过 RARL 和最新的基elines。

eess.IV - 2023-11-03

Quantitative Evaluation of a Multi-Modal Camera Setup for Fusing Event Data with RGB Images

  • paper_url: http://arxiv.org/abs/2311.01881
  • repo_url: None
  • paper_authors: Julian Moosmann, Jakub Mandula, Philipp Mayer, Luca Benini, Michele Magno
  • for: 这个论文的目的是提出一种多模式摄像头设置,用于将高分辨率DVS数据与RGB图像数据进行融合,以便使用两种技术 simultaneously。
  • methods: 这个论文使用了几种时间基于的同步方法来帮助将DVS数据与RGB图像数据进行对应,并进行了相关的Camera alignment和镜头影响的分析。
  • results: 实验结果表明,提出的系统具有较低的图像校准误差(less than 0.90px)和像素十分之偏差(1.6px),而使用8毫米 focal length镜头可以检测到距离350米的30厘米大小的 объекts against homogeneous background。
    Abstract Event-based cameras, also called silicon retinas, potentially revolutionize computer vision by detecting and reporting significant changes in intensity asynchronous events, offering extended dynamic range, low latency, and low power consumption, enabling a wide range of applications from autonomous driving to longtime surveillance. As an emerging technology, there is a notable scarcity of publicly available datasets for event-based systems that also feature frame-based cameras, in order to exploit the benefits of both technologies. This work quantitatively evaluates a multi-modal camera setup for fusing high-resolution DVS data with RGB image data by static camera alignment. The proposed setup, which is intended for semi-automatic DVS data labeling, combines two recently released Prophesee EVK4 DVS cameras and one global shutter XIMEA MQ022CG-CM RGB camera. After alignment, state-of-the-art object detection or segmentation networks label the image data by mapping boundary boxes or labeled pixels directly to the aligned events. To facilitate this process, various time-based synchronization methods for DVS data are analyzed, and calibration accuracy, camera alignment, and lens impact are evaluated. Experimental results demonstrate the benefits of the proposed system: the best synchronization method yields an image calibration error of less than 0.90px and a pixel cross-correlation deviation of1.6px, while a lens with 8mm focal length enables detection of objects with size 30cm at a distance of 350m against homogeneous background.
    摘要 Event-based 摄像头,也称为silicon retina,有 potential 革命化计算机视觉,因为它可以检测和报告快速变化的强度 asynchronous 事件,提供扩展的动态范围,低延迟,和低功耗,因此可以应用于自动驾驶到长期监测等多种应用。作为新兴技术,公共可用的 dataset для event-based 系统和 frame-based 摄像头的混合还是罕见的。本研究使用多模式摄像头设置,将高分辨率 DVS 数据与 RGB 图像数据混合,并通过静态摄像头对齐来实现。这种设置是为 semi-automatic DVS 数据标注而设计,使用两个最新发布的 Prophesee EVK4 DVS 摄像头和一个全球闭环 XIMEA MQ022CG-CM RGB 摄像头。在对齐后,使用现状的对象检测或分割网络将图像数据标注为对齐事件。为此,我们分析了多种时间基准的同步方法,并评估了相机对齐精度、镜头影响和摄像头对齐精度。实验结果表明,我们的方法具有优秀的效果:最佳同步方法的图像准确性错误低于0.90px,像素十分之偏移低于1.6px,而8mm focal length 镜头可以检测到30cm大小的 объек的到达350m 距离。

3-Dimensional residual neural architecture search for ultrasonic defect detection

  • paper_url: http://arxiv.org/abs/2311.01867
  • repo_url: None
  • paper_authors: Shaun McKnight, Christopher MacKinnon, S. Gareth Pierce, Ehsan Mohseni, Vedran Tunukovic, Charles N. MacLeod, Randika K. W. Vithanage, Tom OHare
  • for: 这种研究使用深度学习方法检测碳纤维复合材料中的缺陷,通过用三维卷积神经网络处理三维超声测试数据。
  • methods: 这种方法使用了一种新的数据生成方法,通过保留完整的三维数据,使得复杂的预处理步骤减少,神经网络可以利用空间和时间信息,提高模型的性能。
  • results: 研究 comparing三种体制,包括一种自定义的卷积神经网络,一种使用立方体卷积神经网络,以及一种通过神经网络搜索生成的三维差异神经网络。结果显示,使用全连接层进行维度减少,比使用最大池化层更高的性能。此外,在训练时添加域特性增强方法,也有显著提高模型性能的效果。
    Abstract This study presents a deep learning methodology using 3-dimensional (3D) convolutional neural networks to detect defects in carbon fiber reinforced polymer composites through volumetric ultrasonic testing data. Acquiring large amounts of ultrasonic training data experimentally is expensive and time-consuming. To address this issue, a synthetic data generation method was extended to incorporate volumetric data. By preserving the complete volumetric data, complex preprocessing is reduced, and the model can utilize spatial and temporal information that is lost during imaging. This enables the model to utilise important features that might be overlooked otherwise. The performance of three architectures were compared. The first two architectures were hand-designed to address the high aspect ratios between the spatial and temporal dimensions. The first architecture reduced dimensionality in the time domain and used cubed kernels for feature extraction. The second architecture used cuboidal kernels to account for the large aspect ratios. The evaluation included comparing the use of max pooling and convolutional layers for dimensionality reduction, with the fully convolutional layers consistently outperforming the models using max pooling. The third architecture was generated through neural architecture search from a modified 3D Residual Neural Network (ResNet) search space. Additionally, domain-specific augmentation methods were incorporated during training, resulting in significant improvements in model performance for all architectures. The mean accuracy improvements ranged from 8.2% to 22.4%. The best performing models achieved mean accuracies of 91.8%, 92.2%, and 100% for the reduction, constant, and discovered architectures, respectively. Whilst maintaining a model size smaller than most 2-dimensional (2D) ResNets.
    摘要 Three architecture designs were compared: the first two were hand-designed to address high aspect ratios between spatial and temporal dimensions. The first architecture reduced dimensionality in the time domain using cubed kernels for feature extraction, while the second architecture used cuboidal kernels to account for large aspect ratios. The third architecture was generated through neural architecture search from a modified 3D Residual Neural Network (ResNet) search space.During training, domain-specific augmentation methods were incorporated, resulting in significant improvements in model performance for all architectures. The mean accuracy improvements ranged from 8.2% to 22.4%. The best-performing models achieved mean accuracies of 91.8%, 92.2%, and 100% for the reduction, constant, and discovered architectures, respectively, while maintaining a model size smaller than most 2D ResNets.

Neural SPDE solver for uncertainty quantification in high-dimensional space-time dynamics

  • paper_url: http://arxiv.org/abs/2311.01783
  • repo_url: None
  • paper_authors: Maxime Beauchamp, Ronan Fablet, Hugo Georgenthum
  • for: 这篇论文目的是对大型地球物理数据进行插值和资料融合。
  • methods: 这篇论文使用了Stochastic Partial Differential Equations(SPDE)和 Gaussian Markov Random Fields(GMRF)来处理大数据,并使用了简短精度矩阵来实现插值。
  • results: 这篇论文的解法提高了Optimal Interpolation(OI)的基eline,并能够quantify the associated uncertainties。它还能够与神经网络结合,实现资料融合和线上参数估测。
    Abstract Historically, the interpolation of large geophysical datasets has been tackled using methods like Optimal Interpolation (OI) or model-based data assimilation schemes. However, the recent connection between Stochastic Partial Differential Equations (SPDE) and Gaussian Markov Random Fields (GMRF) introduced a novel approach to handle large datasets making use of sparse precision matrices in OI. Recent advancements in deep learning also addressed this issue by incorporating data assimilation into neural architectures: it treats the reconstruction task as a joint learning problem involving both prior model and solver as neural networks. Though, it requires further developments to quantify the associated uncertainties. In our work, we leverage SPDEbased Gaussian Processes to estimate complex prior models capable of handling nonstationary covariances in space and time. We develop a specific architecture able to learn both state and SPDE parameters as a neural SPDE solver, while providing the precisionbased analytical form of the SPDE sampling. The latter is used as a surrogate model along the data assimilation window. Because the prior is stochastic, we can easily draw samples from it and condition the members by our neural solver, allowing flexible estimation of the posterior distribution based on large ensemble. We demonstrate this framework on realistic Sea Surface Height datasets. Our solution improves the OI baseline, aligns with neural prior while enabling uncertainty quantification and online parameter estimation.
    摘要 En el pasado, la interpolación de grandes conjuntos de datos geofísicos se ha abordado utilizando métodos como Interpolación Óptima (OI) o esquemas de asimilación de datos basados en modelos. Sin embargo, la reciente conexión entre Ecuaciones Parciales Diferenciales Estocásticas (SPDE) y Campos de Markov Gaussianos (GMRF) presentó una nueva aproximación para manejar grandes conjuntos de datos utilizando matrices de precisión esparcas en OI. Los avances recientes en aprendizaje profundo también abordaron este problema al incorporar la asimilación de datos en arquitecturas neurales: se trata la tarea de reconstrucción como un problema de aprendizaje conjunto que involucra tanto el modelo previo como el solver como redes neuronales. Aunque requiere desarrollos adicionales para cuantificar las incertidumbres asociadas. En nuestro trabajo, utilizamos Procesos de Gaussianas Basadas en SPDE para estimar modelos priorizados complejos capaces de manejar covarianzas no estacionarias en el espacio y el tiempo. Desarrollamos una arquitectura específica que aprende tanto los parámetros del estado como los parámetros de SPDE como un solucionador neural SPDE, mientras proporciona la forma analítica de la SPDE sampling. La última se utiliza como un modelo de surrogato a lo largo de la ventana de asimilación de datos. Como el prior es estocástico, podemos fácilmente extraer muestras de él y condicionarlos con nuestro solucionador neural, lo que permite una estimación flexible de la distribución posterior en función de un gran ensamble. Demostramos este marco en conjuntos de datos de Altura de la Surface del Mar realistas. Nuestra solución mejora el umbral de OI, se alinea con el prior neural y permite la cuantificación de incertidumbres y la estimación en línea de parámetros.

eess.SP - 2023-11-03

HPC-based Solvers of Minimisation Problems for Signal Processing

  • paper_url: http://arxiv.org/abs/2311.02039
  • repo_url: None
  • paper_authors: Simone Cammarasana, Giuseppe Patanè
  • for: solves two minimization problems (approximation and denoising) with different constraints using high-performance computing.
  • methods: compares and analyzes different minimization methods in terms of functional computation, convergence, execution time, and scalability properties.
  • results: PRAXIS is the best optimizer in terms of minima computation, with an efficiency of 38% for approximation and 46% for denoising.Here is the full text in Simplified Chinese:
  • for: 本研究 solves two minimization problems (approximation和denoising) with different constraints using high-performance computing.
  • methods: Compares and analyzes different minimization methods in terms of functional computation, convergence, execution time, and scalability properties.
  • results: PRAXIS is the best optimizer in terms of minima computation, with an efficiency of 38% for approximation and 46% for denoising.
    Abstract Several physics and engineering applications involve the solution of a minimisation problem to compute an approximation of the input signal. Modern computing hardware and software apply high-performance computing to solve and considerably reduce the execution time. We compare and analyse different minimisation methods in terms of functional computation, convergence, execution time, and scalability properties, for the solution of two minimisation problems (i.e., approximation and denoising) with different constraints that involve computationally expensive operations. These problems are attractive due to their numerical and analytical properties, and our general analysis can be extended to most signal-processing problems. We perform our tests on the Cineca Marconi100 cluster, at the 26th position in the top500 list. Our experimental results show that PRAXIS is the best optimiser in terms of minima computation: the efficiency of the approximation is 38% with 256 processes, while the denoising has 46% with 32 processes.
    摘要 许多物理和工程应用中需要解决最小化问题以计算输入信号的近似值。现代计算硬件和软件通过高性能计算解决这些问题,大大减少了执行时间。我们对不同的最小化方法进行比较和分析,考虑其函数计算、收敛、执行时间和可扩展性特性,以解决两个最小化问题(即近似和净化),它们具有不同的计算约束。这些问题具有数字和分析性质,我们的总分析可以扩展到大多数信号处理问题。我们在Cineca Marconi100集群上进行测试,该集群在top500列表中排名第26位。我们的实验结果显示,PRAXIS是最佳优化器,在256个进程中,近似计算效率为38%,而净化计算效率为46%。

Terahertz Communication Testbeds: Challenges and Opportunities

  • paper_url: http://arxiv.org/abs/2311.01972
  • repo_url: None
  • paper_authors: Eray Guven, Gunes Karabulut Kurt
  • for: 这项研究探讨了一个实验性的 Software Defined Radio(SDR)实现在180 GHz频率域。
  • methods: 本研究使用了硬件瓶颈问题的率缺乏和频率缺乏来解释实验困难,并提出了相应的系统模型。
  • results: 该SDR-THz测试平台可以达到3.2 Mbps的传输速率,并且可以通过使用反射板来精细调整频率错误和偏斜错误,但需要至少14.91 dB的信号噪声比。结果表明SDR基带信号生成在THz通信中完全可行,并且开 up了许多机会来超越实验研究中的硬件限制。
    Abstract This study investigates an experimental software defined radio (SDR) implementation on 180 GHz. Rate scarcity and frequency sparsity are discussed as hardware bottlenecks. Experimental challenges are explained along with the derived system model of such a cascaded structure. Multiple error metrics for the terahertz (THz) signal are acquired, and various case scenarios are subsequently compared. The SDR-THz testbed reaches 3.2 Mbps with < 1 degree skew error. The use of a reflector plate can fine-tune the frequency error and gain imbalance in the expense of at least 14.91 dB signal-to-noise ratio. The results demonstrate the complete feasibility of SDR-based baseband signal generation in THz communication, revealing abundant opportunities to overcome hardware limitations in experimental research.
    摘要

Reconfigurable Intelligent Surface & Edge – An Introduction of an EM manipulation structure on obstacles’ edge

  • paper_url: http://arxiv.org/abs/2311.01919
  • repo_url: None
  • paper_authors: Tianqi Xiang, Zhiwei Jiang, Weijun Hong, Xin Zhang, Yuehong Gao
  • for: 提高遮挡区域信号覆盖性能
  • methods: 使用各种部署位置和电磁振荡结构设计来改变电磁环境
  • results: 在不同场景下,提出了一种新的电磁振荡结构,可以在障碍物边缘进行静态电磁环境修改,并且在不同场景下实现更好的覆盖性能。
    Abstract Reconfigurable Intelligent Surface (RIS) or metasurface is one of the important enabling technologies in mobile cellular networks that can effectively enhance the signal coverage performance in obstructed regions, and it is generally deployed on surfaces different from obstacles to redirect electromagnetic (EM) waves by reflection, or covered on objects' surfaces to manipulate EM waves by refraction. In this paper, Reconfigurable Intelligent Surface & Edge (RISE) is proposed to extend RIS' abilities of reflection and refraction over surfaces to diffraction around obstacles' edge for better adaptation to specific coverage scenarios. Based on that, this paper analyzes the performance of several different deployment locations and EM manipulation structure designs for different coverage scenarios. Then a novel EM manipulation structure deployed at the obstacles' edge is proposed to achieve static EM environment modification. Simulations validate the preference of the schemes for different scenarios and the new structure achieves better coverage performance than other typical structures in the static scheme.
    摘要 reh-kon-fig-yur-uh-ble in-tel-li-jent sur-fis (RIS) or meh-tah-sur-fis is one of the important en-abling tech-nolo-gies in mo-bile cel-lu-lar net-works that can ef-fect-ively en-hance the sig-nal cov-er-age per-for-mance in ob-structed re-gions, and it is gen-er-ally de-ployed on sur-faces dif-fer-ent from ob-stacles to re-dir-ect EM waves by re-flec-tion, or cov-ered on ob-jects' sur-faces to man-i-pulate EM waves by re-frac-tion. In this pa-per, Re-config-urable In-tel-li-gent Sur-face & Edge (RISE) is pro-posed to ex-tend RIS' abil-i-ties of re-flec-tion and re-frac-tion over sur-faces to dif-fraction around ob-stacles' edge for bet-ter ad-ap-tion to spec-i-fic cov-er-age scena-rios. Based on that, this pa-per an-a-lyzes the per-for-mance of se-ver-al dif-fer-ent de-ploy-ment lo-ca-tions and EM man-i-pu-la-tion struc-ture de-signs for dif-fer-ent cov-er-age scena-rios. Then a no-vel EM man-i-pu-la-tion struc-ture de-ployed at the ob-stacles' edge is pro-posed to achieve stat-ic EM en-vi-ron-ment mod-i-fi-ca-tion. Sim-u-la-tions val-id-ate the pref-er-ence of the schemes for dif-fer-ent scena-rios and the new struc-ture achieves bet-ter cov-er-age per-for-mance than other typ-i-cal struc-tures in the stat-ic scheme.

Random ISAC Signals Deserve Dedicated Precoding

  • paper_url: http://arxiv.org/abs/2311.01822
  • repo_url: None
  • paper_authors: Shihang Lu, Fan Liu, Fuwang Dong, Yifeng Xiong, Jie Xu, Ya-Feng Liu, Shi Jin
  • for: 本文研究了使用随机信号在多天线系统中进行目标探测和通信系统的性能分析。
  • methods: 本文提出了一种新的感知性能指标——随机信号平均线性最小均方误差(ELMMSE),用于评估随机ISAC信号在探测场景下的性能。此外,本文还提出了一种数据依赖 precoding(DDP)算法和一种数据独立 precoding(DIP)算法,用于优化随机ISAC信号的探测性能。
  • results: 本文的数据分析表明,DDP和DIP方法可以大幅提高ISAC信号中的探测性能,而且这些方法可以适应不同的随机信号模式。此外,本文还提出了一种可靠的算法来解决ISAC信号中的随机信号设计问题。
    Abstract Radar systems typically employ well-designed deterministic signals for target sensing, while integrated sensing and communications (ISAC) systems have to adopt random signals to convey useful information. This paper analyzes the sensing and ISAC performance relying on random signaling in a multiantenna system. Towards this end, we define a new sensing performance metric, namely, ergodic linear minimum mean square error (ELMMSE), which characterizes the estimation error averaged over random ISAC signals. Then, we investigate a data-dependent precoding (DDP) scheme to minimize the ELMMSE in sensing-only scenarios, which attains the optimized performance at the cost of high implementation overhead. To reduce the cost, we present an alternative data-independent precoding (DIP) scheme by stochastic gradient projection (SGP). Moreover, we shed light on the optimal structures of both sensing-only DDP and DIP precoders. As a further step, we extend the proposed DDP and DIP approaches to ISAC scenarios, which are solved via a tailored penalty-based alternating optimization algorithm. Our numerical results demonstrate that the proposed DDP and DIP methods achieve substantial performance gains over conventional ISAC signaling schemes that treat the signal sample covariance matrix as deterministic, which proves that random ISAC signals deserve dedicated precoding designs.
    摘要 射频系统通常使用良好设计的决定性信号进行目标探测,而统合探测通信(ISAC)系统则必须运用随机信号传递有用信息。本文分析了使用随机信号的探测和ISAC性能,并定义了一个新的探测性能指标,即随机线性最小平均方差(ELMMSE),用于描述随机ISAC信号中的估计误差。然后,我们调查了一个基于数据依赖的实现(DDP)策略,以减少ELMMSE的值,并提出了一个基于随机投影(SGP)的替代策略。此外,我们还考虑了两种探测只的DDP和DIP预设的结构。最后,我们将这些方法扩展到ISAC场景,这些问题透过一个特别设计的罚则基于交互运算法解决。我们的数据分析结果显示,提案的DDP和DIP方法可以与传统ISAC信号传递方案相比,实现substantial的性能提升,这证明了随机ISAC信号值得特别的预设设计。

Carrier Frequency Offset Estimation for OCDM with Null Subchirps

  • paper_url: http://arxiv.org/abs/2311.01812
  • repo_url: None
  • paper_authors: Sidong Guo, Yiyin Wang, Xiaoli Ma
  • for: investigate the carrier frequency offset (CFO) identifiability problem in orthogonal chirp division multiplexing (OCDM) systems.
  • methods: propose a transmission scheme by inserting consecutive null subchirps, and develop a CFO estimator to achieve a full acquisition range.
  • results: demonstrate that the proposed transmission scheme not only helps to resolve CFO identifiability issues but also enables multipath diversity for OCDM systems, and simulation results corroborate the theoretical findings.Here’s the full translation in Simplified Chinese:
  • for: 这篇论文是 investigate orthogonal chirp division multiplexing (OCDM) 系统中的载波频率偏移 (CFO) 可识别性问题。
  • methods: 提议在 OCDM 系统中插入连续的null subchirp transmission scheme,并开发一种CFO估计器以实现全覆盖范围。
  • results: 表明提议的传输方案不仅能够解决 CFO 可识别性问题,还能够为 OCDM 系统带来多Path 多普适性。 simulation results 证明了理论发现。
    Abstract In this paper, we investigate the carrier frequency offset (CFO) identifiability problem in orthogonal chirp division multiplexing (OCDM) systems. We propose a transmission scheme by inserting consecutive null subchirps. A CFO estimator is accordingly developed to achieve a full acquisition range. We further demonstrate that the proposed transmission scheme not only help to resolve CFO identifiability issues but also enable multipath diversity for OCDM systems. Simulation results corroborate our theoretical findings.
    摘要 在这篇论文中,我们研究了扩展幂分多路复用(OCDM)系统中的载波频率偏移(CFO)可识别问题。我们提议了插入连续null子频谱的传输方案,并开发了一种CFO估计器以实现全范围的获得。我们还证明了我们的传输方案不仅能够解决CFO可识别问题,还能够启用OCDM系统中的多路幂资源。实验结果与我们的理论发现相符。Here's the word-for-word translation:在这篇论文中,我们研究了扩展幂分多路复用(OCDM)系统中的载波频率偏移(CFO)可识别问题。我们提议了插入连续null子频谱的传输方案,并开发了一种CFO估计器以实现全范围的获得。我们还证明了我们的传输方案不仅能够解决CFO可识别问题,还能够启用OCDM系统中的多路幂资源。实验结果与我们的理论发现相符。

Moving Target Sensing for ISAC Systems in Clutter Environment

  • paper_url: http://arxiv.org/abs/2311.01700
  • repo_url: None
  • paper_authors: Dongqi Luo, Huihui Wu, Hongliang Luo, Bo Lin, Feifei Gao
  • for: 这篇论文关注整合感知通信(ISAC)系统在噪压环境中的运动目标探测问题。
  • methods: 作者采用扫描干扰来搜寻运动目标候选者,并将受到干扰信号中的响应高频率滤除以便识别候选目标。然后,他们使用根据MUSIC算法估算候选目标的角度、距离和垂直速度。
  • results: 作者透过实验结果显示了这些提议的方法的有效性。
    Abstract In this paper, we consider the moving target sensing problem for integrated sensing and communication (ISAC) systems in clutter environment. Scatterers produce strong clutter, deteriorating the performance of ISAC systems in practice. Given that scatterers are typically stationary and the targets of interest are usually moving, we here focus on sensing the moving targets. Specifically, we adopt a scanning beam to search for moving target candidates. For the received signal in each scan, we employ high-pass filtering in the Doppler domain to suppress the clutter within the echo, thereby identifying candidate moving targets according to the power of filtered signal. Then, we adopt root-MUSIC-based algorithms to estimate the angle, range, and radial velocity of these candidate moving targets. Subsequently, we propose a target detection algorithm to reject false targets. Simulation results validate the effectiveness of these proposed methods.
    摘要 在这篇论文中,我们考虑了 интеграцион感知通信(ISAC)系统在噪声环境中的移动目标感知问题。噪声会对ISAC系统的性能产生负面影响,因此我们专注于感知移动目标。具体来说,我们采用扫描射频搜索移动目标候选人。对每次扫描得到的信号,我们使用高频滤波器在Doppler域进行高频滤波,以便从响应中排除噪声,并根据过滤后的信号强度确定候选移动目标。然后,我们采用基于MUSIC算法的根幂算法来估算候选移动目标的角度、距离和垂线速度。最后,我们提出了一种目标检测算法,以排除假目标。实验结果证明了我们提出的方法的有效性。

Integrated Sensing and Communications in Clutter Environment

  • paper_url: http://arxiv.org/abs/2311.01674
  • repo_url: None
  • paper_authors: Hongliang Luo, Yucong Wang, Jianwei Zhao, Huihui Wu, Shaodan Ma, Feifei Gao
  • for: 本研究提出一种实用的综合感知通信(ISAC)框架,以感知动态目标从噪声环境中,同时保证用户通信质量。
  • methods: 我们设计了多个通信束,可以与用户进行通信,同时设计了一个扫描整个空间的感知束。为降低现有通信系统干扰,我们将服务区分为感知束(S4S)和通信束(C4S)两部分,并提供了束形设计和功率分配优化策略。
  • results: 我们的方案在实际实验中得到了证明,比较于忽略环境噪声的现有ISAC方法,我们的方案可以更好地探测动态目标并估计其角度、距离和速度。特别是,动态目标探测和角度估计通过角度-Doppler光谱估计(ADSE)和多个子载波(MSJD)的共同检测,而距离和速度估计通过扩展子空间算法。
    Abstract In this paper, we propose a practical integrated sensing and communications (ISAC) framework to sense dynamic targets from clutter environment while ensuring users communications quality. To implement communications function and sensing function simultaneously, we design multiple communications beams that can communicate with the users as well as one sensing beam that can rotate and scan the entire space. To minimize the interference of sensing beam on existing communications systems, we divide the service area into sensing beam for sensing (S4S) sector and communications beam for sensing (C4S) sector, and provide beamforming design and power allocation optimization strategies for each type sector. Unlike most existing ISAC studies that ignore the interference of static environmental clutter on target sensing, we construct a mixed sensing channel model that includes both static environment and dynamic targets. When base station receives the echo signals, the mean phasor cancellation (MPC) method is employed to filter out the interference from static environmental clutter and to extract the effective dynamic target echoes. Then a complete and practical dynamic target sensing scheme is designed to detect the presence of dynamic targets and to estimate their angles, distances, and velocities. In particular, dynamic target detection and angle estimation are realized through angle-Doppler spectrum estimation (ADSE) and joint detection over multiple subcarriers (MSJD), while distance and velocity estimation are realized through the extended subspace algorithm. Simulation results demonstrate the effectiveness of the proposed scheme and its superiority over the existing methods that ignore environmental clutter.
    摘要 在这篇论文中,我们提出了一种实用的集成感知通信(ISAC)框架,以感知动态目标在噪声环境中的探测,并保证用户的通信质量。为实现通信功能和感知功能的同时实现,我们设计了多个通信ibeam,可以与用户进行通信,同时还有一个感知ibeam,可以在整个空间中旋转和扫描。为了减少感知ibeam对现有通信系统的干扰,我们将服务区分为感知ibeam(S4S) сектор和通信ibeam(C4S) сектор,并提供了ibeamforming设计和功率分配优化策略 для每种类型的 сектор。与大多数现有ISAC研究不同,我们构建了一种混合感知通道模型,包括了静止环境和动态目标。当基站接收回射信号时,我们使用meanphasor抑制(MPC)方法来筛选静止环境干扰,并提取有效的动态目标回射信号。然后,我们设计了一种实用的动态目标探测方案,以检测动态目标的存在和角度、距离和速度的估算。具体来说,动态目标检测和角度估算通过角度-Doppler спектrum估算(ADSE)和多个子载波(MSJD)的联合检测来实现,而距离和速度估算则通过扩展子空间算法来实现。 simulation结果表明我们的方案的效果和现有方法的差异,并且我们的方案在感知环境中的干扰下表现更佳。

cs.CV - 2023-11-02

Idempotent Generative Network

  • paper_url: http://arxiv.org/abs/2311.01462
  • repo_url: None
  • paper_authors: Assaf Shocher, Amil Dravid, Yossi Gandelsman, Inbar Mosseri, Michael Rubinstein, Alexei A. Efros
  • for: 本研究旨在提出一种基于神经网络培养的生成模型,该模型可以在一步中生成输出,同时保持一致的幂次空间,并允许顺序应用 для细化。
  • methods: 该模型使用了一种新的培养方法,即培养神经网络成为可递归的操作。该操作的目标是将源分布(例如泊松噪声)映射到目标分布(例如真实图像)。
  • results: 该研究表明,通过使用这种培养方法,可以提取到一种能够在一步中生成输出,同时保持一致的幂次空间的模型。此外,该模型还能够处理来自目标和源分布的输入,并将其映射回目标分布。
    Abstract We propose a new approach for generative modeling based on training a neural network to be idempotent. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, namely $f(f(z))=f(z)$. The proposed model $f$ is trained to map a source distribution (e.g, Gaussian noise) to a target distribution (e.g. realistic images) using the following objectives: (1) Instances from the target distribution should map to themselves, namely $f(x)=x$. We define the target manifold as the set of all instances that $f$ maps to themselves. (2) Instances that form the source distribution should map onto the defined target manifold. This is achieved by optimizing the idempotence term, $f(f(z))=f(z)$ which encourages the range of $f(z)$ to be on the target manifold. Under ideal assumptions such a process provably converges to the target distribution. This strategy results in a model capable of generating an output in one step, maintaining a consistent latent space, while also allowing sequential applications for refinement. Additionally, we find that by processing inputs from both target and source distributions, the model adeptly projects corrupted or modified data back to the target manifold. This work is a first step towards a ``global projector'' that enables projecting any input into a target data distribution.
    摘要 我们提出一新的生成模型基于将神经网络训练为无效的操作。一个无效的操作是可以递进行无效的操作,而不会改变结果,即 $f(f(z)) = f(z) $。我们的模型 $f $ 将从源分布(例如 Gaussian 噪声)映射到目标分布(例如真实的图像),使用以下目标:1. 目标分布中的所有实例都应该变映射到自己,即 $f(x) = x $。我们称这为目标扩展。2. 源分布中的所有实例都应该变映射到定义的目标扩展中,这是通过优化无效性项目 $f(f(z)) = f(z) $ 来实现。这个过程会导致 $f(z) $ 的范围在目标扩展中,从而使得模型能够在一步骤内产生出PUT。在理想的假设下,这个过程可以将input转换为目标分布中的实例。此外,我们发现当处理来自目标和源分布的输入时,模型能够优化受到损害或修改的数据,将其转换回目标扩展中。这项工作是一个“全球投影器”的第一步,可以将任何输入投射到目标数据分布中。

Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization

  • paper_url: http://arxiv.org/abs/2311.01459
  • repo_url: https://github.com/jameelhassan/PromptAlign
  • paper_authors: Jameel Hassan, Hanan Gani, Noor Hussein, Muhammad Uzair Khattak, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan
  • For: 本研究旨在解决预训练后Prompt Learning的难点,即对未seen频率的频率不适应。* Methods: 我们使用了一种新的提问调整方法,即在测试阶段使用提问调整来将测试阶段样本的特征分布与源频率样本的特征分布进行对应。* Results: 我们的方法可以在零shot上提高vision-language模型的性能,比基elineMaPLe提高3.08%。在跨数据集总结检验中,我们的方法在所有数据集上表现出了一致性的提升。Here’s the English version for reference:* For: The paper aims to solve the problem of poor generalization to unseen frequencies in prompt learning, which is a challenge in pre-training and fine-tuning vision-language models.* Methods: We propose a new prompt tuning method that aligns the test-time sample statistics with the source data statistics, using a single test sample to adapt multi-modal prompts at test time.* Results: Our method can improve the performance of vision-language models on zero-shot tasks, with a 3.08% improvement over the baseline MaPLe. Our method consistently improves performance across all datasets in cross-dataset generalization with unseen categories.
    Abstract The promising zero-shot generalization of vision-language models such as CLIP has led to their adoption using prompt learning for numerous downstream tasks. Previous works have shown test-time prompt tuning using entropy minimization to adapt text prompts for unseen domains. While effective, this overlooks the key cause for performance degradation to unseen domains -- distribution shift. In this work, we explicitly handle this problem by aligning the out-of-distribution (OOD) test sample statistics to those of the source data using prompt tuning. We use a single test sample to adapt multi-modal prompts at test time by minimizing the feature distribution shift to bridge the gap in the test domain. Evaluating against the domain generalization benchmark, our method improves zero-shot top- 1 accuracy beyond existing prompt-learning techniques, with a 3.08% improvement over the baseline MaPLe. In cross-dataset generalization with unseen categories across 10 datasets, our method improves consistently across all datasets compared to the existing state-of-the-art. Our source code and models are available at https://jameelhassan.github.io/promptalign.
    摘要 “CLIP的零基础通用化扩展已经导致它的应用使用提示学习。先前的研究表明,在测试时使用 entropy 最小化来调整文本提示以适应未看过的领域是有效的。然而,这些方法忽略了性能下降的主要原因——分布shift。在这种情况下,我们明确处理这个问题,使用提示调整来对测试样本的数据分布进行对应。我们使用单个测试样本来在测试时调整多modal的提示,以减少测试领域中的特征分布差异,从而bridging测试领域中的差距。通过对域泛化标准 benchmark 进行评估,我们的方法在零基础情况下提高了顶部 1 的准确率,与基eline MaPLe 相比,提高了3.08%。在不同的数据集之间进行交叉预测时,我们的方法在所有数据集上具有一致性的提高,与现有的状态之artefact 相比。我们的源代码和模型可以在 中获取。”

Detecting Deepfakes Without Seeing Any

  • paper_url: http://arxiv.org/abs/2311.01458
  • repo_url: https://github.com/talreiss/factor
  • paper_authors: Tal Reiss, Bar Cavia, Yedid Hoshen
  • for: 防止深伪攻击,对社会造成严重威胁的媒体恶意修改。
  • methods: 引入“事实检查”概念,从 fake news 检测中获得灵感,以检查伪媒体中的 false facts,并将其与观察到的媒体进行比较,以进行深伪攻击的检测。
  • results: 提出 FACTOR,一个实用的深伪攻击检测方法,在面对重要攻击设定下表现出色,包括脸部交换和音视声合成。优点包括不需训练,仅使用现有的特征,易于实现,并且不需要见到深伪攻击。
    Abstract Deepfake attacks, malicious manipulation of media containing people, are a serious concern for society. Conventional deepfake detection methods train supervised classifiers to distinguish real media from previously encountered deepfakes. Such techniques can only detect deepfakes similar to those previously seen, but not zero-day (previously unseen) attack types. As current deepfake generation techniques are changing at a breathtaking pace, new attack types are proposed frequently, making this a major issue. Our main observations are that: i) in many effective deepfake attacks, the fake media must be accompanied by false facts i.e. claims about the identity, speech, motion, or appearance of the person. For instance, when impersonating Obama, the attacker explicitly or implicitly claims that the fake media show Obama; ii) current generative techniques cannot perfectly synthesize the false facts claimed by the attacker. We therefore introduce the concept of "fact checking", adapted from fake news detection, for detecting zero-day deepfake attacks. Fact checking verifies that the claimed facts (e.g. identity is Obama), agree with the observed media (e.g. is the face really Obama's?), and thus can differentiate between real and fake media. Consequently, we introduce FACTOR, a practical recipe for deepfake fact checking and demonstrate its power in critical attack settings: face swapping and audio-visual synthesis. Although it is training-free, relies exclusively on off-the-shelf features, is very easy to implement, and does not see any deepfakes, it achieves better than state-of-the-art accuracy.
    摘要 深刻的假动态攻击(deepfake attacks)对社会造成了严重的忧虑。传统的深刻检测方法通过训练监督分类器来分辨真伪媒体。这些技术只能检测已经见过的深刻攻击,但不能检测 Zero-day(未见过)攻击型。随着现有的深刻生成技术在极其快速的进步,新的攻击型不断提出,这成为了一个重要的问题。我们的主要观察结果是:一、在许多有效的深刻攻击中,假媒体必须被附加 false facts(说法),例如指出假媒体是谁(例如Obama);二、目前的生成技术无法完美地Synthesize false facts。我们因此引入了“ факчекінグ”(fact checking)概念,从 fake news 检测中获得灵感,用于检测 Zero-day 深刻攻击。fact checking 检查假媒体中的 false facts 是否与观察到的媒体相符,因此可以区分真伪媒体。因此,我们引入 FACTOR,一个实用的深刻实现 fact checking 的方法,并在重要的攻击设定中进行评估:脸部调换和音频视觉合成。这个方法不需要训练,仅仅使用现有的特征,易于实现,并且无法检测任何深刻。它在critical attack settings中实现了更好的准确性。

UltraLiDAR: Learning Compact Representations for LiDAR Completion and Generation

  • paper_url: http://arxiv.org/abs/2311.01448
  • repo_url: None
  • paper_authors: Yuwen Xiong, Wei-Chiu Ma, Jingkang Wang, Raquel Urtasun
  • for: 增强 LiDAR 点云的精度和覆盖率,提高自驾抵达系统的性能。
  • methods: 基于数据驱动的 UltraLiDAR 框架,包括点云的数据驱动编码、点云的精度和覆盖率的改进、点云的生成和 manipulate。
  • results: 对实际点云数据进行训练,可以达到densify sparse point clouds 的目的,并且可以生成更加真实和可信的 LiDAR 点云数据,比 Priors 方法更有优势。
    Abstract LiDAR provides accurate geometric measurements of the 3D world. Unfortunately, dense LiDARs are very expensive and the point clouds captured by low-beam LiDAR are often sparse. To address these issues, we present UltraLiDAR, a data-driven framework for scene-level LiDAR completion, LiDAR generation, and LiDAR manipulation. The crux of UltraLiDAR is a compact, discrete representation that encodes the point cloud's geometric structure, is robust to noise, and is easy to manipulate. We show that by aligning the representation of a sparse point cloud to that of a dense point cloud, we can densify the sparse point clouds as if they were captured by a real high-density LiDAR, drastically reducing the cost. Furthermore, by learning a prior over the discrete codebook, we can generate diverse, realistic LiDAR point clouds for self-driving. We evaluate the effectiveness of UltraLiDAR on sparse-to-dense LiDAR completion and LiDAR generation. Experiments show that densifying real-world point clouds with our approach can significantly improve the performance of downstream perception systems. Compared to prior art on LiDAR generation, our approach generates much more realistic point clouds. According to A/B test, over 98.5\% of the time human participants prefer our results over those of previous methods.
    摘要 利达(LiDAR)提供了高精度的三维世界几何测量。然而,高密度的LiDAR仪器非常昂贵,而且低密度LiDAR所捕获的点云经常是稀疏的。为解决这些问题,我们介绍了UltraLiDAR,一个数据驱动的场景级LiDAR完成、生成和修改框架。UltraLiDAR的核心思想是一种含有点云几何结构的紧凑、离散表示方法,具有噪声抗性和易于操作的特点。我们表明,通过将稀疏点云的表示与高密度点云的表示进行对应,可以将稀疏点云灵活地填充为如果被真实高密度LiDAR捕获的样式,减少成本。此外,通过学习点云codebook的前提,我们可以生成多种、现实主义的LiDAR点云,用于自动驾驶。我们对稀疏点云完成和LiDAR生成进行评估。实验表明,使用我们的方法可以在下游识别系统中显著提高稀疏点云的性能。相比之前的LiDAR生成方法,我们的方法可以生成更加真实的点云。根据A/B测试,人类参与者超过98.5%的时间 prefer我们的结果。

CADSim: Robust and Scalable in-the-wild 3D Reconstruction for Controllable Sensor Simulation

  • paper_url: http://arxiv.org/abs/2311.01447
  • repo_url: None
  • paper_authors: Jingkang Wang, Sivabalan Manivasagam, Yun Chen, Ze Yang, Ioan Andrei Bârsan, Anqi Joyce Yang, Wei-Chiu Ma, Raquel Urtasun
  • For: The paper is written for the development of realistic simulation for self-driving vehicles, specifically focusing on sensor simulation and the reconstruction of vehicle geometry.* Methods: The paper proposes a new method called CADSim, which combines part-aware object-class priors with differentiable rendering to automatically reconstruct vehicle geometry, including articulated wheels, with high-quality appearance.* Results: The paper shows that CADSim recovers more accurate shapes from sparse data compared to existing approaches, and it trains and renders efficiently. The reconstructed vehicles are demonstrated in several applications, including accurate testing of autonomy perception systems.Here is the same information in Simplified Chinese:* For: 本文为自动驾驶车辆实时模拟实现真实simulation,特别是感知器模拟。* Methods: 本文提出了一种新方法 called CADSim,它将部件意识对象类预先知识与可微分渲染结合,自动重建车辆几何结构,包括摆脱的车轮。* Results: 本文表明,CADSim可以从稀疏数据中提取更高质量的形状信息,并具有高效训练和渲染能力。重建的车辆在多个应用中展示了高度准确的测试和感知系统。
    Abstract Realistic simulation is key to enabling safe and scalable development of % self-driving vehicles. A core component is simulating the sensors so that the entire autonomy system can be tested in simulation. Sensor simulation involves modeling traffic participants, such as vehicles, with high quality appearance and articulated geometry, and rendering them in real time. The self-driving industry has typically employed artists to build these assets. However, this is expensive, slow, and may not reflect reality. Instead, reconstructing assets automatically from sensor data collected in the wild would provide a better path to generating a diverse and large set with good real-world coverage. Nevertheless, current reconstruction approaches struggle on in-the-wild sensor data, due to its sparsity and noise. To tackle these issues, we present CADSim, which combines part-aware object-class priors via a small set of CAD models with differentiable rendering to automatically reconstruct vehicle geometry, including articulated wheels, with high-quality appearance. Our experiments show our method recovers more accurate shapes from sparse data compared to existing approaches. Importantly, it also trains and renders efficiently. We demonstrate our reconstructed vehicles in several applications, including accurate testing of autonomy perception systems.
    摘要 Simplified Chinese translation:现实化模拟是自动驾驶车辆开发中安全和扩展的关键。核心组件是模拟感知器,以便整个自主系统可以在模拟中测试。感知器模拟包括模拟交通参与者,如车辆,并在实时中渲染它们。自驾行业Typically, artists have been employed to build these assets, but this is expensive, slow, and may not reflect reality. Instead, automatically reconstructing assets from in-the-wild sensor data would provide a better path to generating a diverse and large set with good real-world coverage. However, current reconstruction approaches struggle with in-the-wild sensor data due to its sparsity and noise. To address these issues, we present CADSim, which combines part-aware object-class priors via a small set of CAD models with differentiable rendering to automatically reconstruct vehicle geometry, including articulated wheels, with high-quality appearance. Our experiments show that our method recovers more accurate shapes from sparse data compared to existing approaches. Importantly, it also trains and renders efficiently. We demonstrate our reconstructed vehicles in several applications, including accurate testing of autonomy perception systems.

Adv3D: Generating Safety-Critical 3D Objects through Closed-Loop Simulation

  • paper_url: http://arxiv.org/abs/2311.01446
  • repo_url: None
  • paper_authors: Jay Sarva, Jingkang Wang, James Tu, Yuwen Xiong, Sivabalan Manivasagam, Raquel Urtasun
  • for: 这个论文旨在测试自动驾驶车辆(SDV)在各种场景下的可靠性,以确保其安全部署。
  • methods: 这篇论文提出了一个框架,名为Adv3D,可以在真实世界场景下进行closed-loop感知 simulation,以评估自动驾驶系统的性能。
  • results: 该框架可以在真实世界场景下找到影响自动驾驶系统性能的场景变化,并且发现这些变化在交互 Setting下更加有效。
    Abstract Self-driving vehicles (SDVs) must be rigorously tested on a wide range of scenarios to ensure safe deployment. The industry typically relies on closed-loop simulation to evaluate how the SDV interacts on a corpus of synthetic and real scenarios and verify it performs properly. However, they primarily only test the system's motion planning module, and only consider behavior variations. It is key to evaluate the full autonomy system in closed-loop, and to understand how variations in sensor data based on scene appearance, such as the shape of actors, affect system performance. In this paper, we propose a framework, Adv3D, that takes real world scenarios and performs closed-loop sensor simulation to evaluate autonomy performance, and finds vehicle shapes that make the scenario more challenging, resulting in autonomy failures and uncomfortable SDV maneuvers. Unlike prior works that add contrived adversarial shapes to vehicle roof-tops or roadside to harm perception only, we optimize a low-dimensional shape representation to modify the vehicle shape itself in a realistic manner to degrade autonomy performance (e.g., perception, prediction, and motion planning). Moreover, we find that the shape variations found with Adv3D optimized in closed-loop are much more effective than those in open-loop, demonstrating the importance of finding scene appearance variations that affect autonomy in the interactive setting.
    摘要 In this paper, we propose a framework called Adv3D that takes real-world scenarios and performs closed-loop sensor simulation to evaluate autonomy performance. We also find vehicle shapes that make the scenario more challenging, resulting in autonomy failures and uncomfortable SDV maneuvers. Unlike prior works that add contrived adversarial shapes to vehicle roof-tops or roadside to harm perception only, we optimize a low-dimensional shape representation to modify the vehicle shape itself in a realistic manner to degrade autonomy performance (e.g., perception, prediction, and motion planning).Moreover, we find that the shape variations found with Adv3D optimized in closed-loop are much more effective than those in open-loop, demonstrating the importance of finding scene appearance variations that affect autonomy in the interactive setting.

LabelFormer: Object Trajectory Refinement for Offboard Perception from LiDAR Point Clouds

  • paper_url: http://arxiv.org/abs/2311.01444
  • repo_url: None
  • paper_authors: Anqi Joyce Yang, Sergio Casas, Nikita Dvornik, Sean Segal, Yuwen Xiong, Jordan Sir Kwang Hu, Carter Fang, Raquel Urtasun
  • for: 这个论文旨在提出一种简单、高效、有效的路径水平纠正方法,以提高自动驾驶观察系统的训练效果。
  • methods: 该论文提出了一种两阶段方法,首先探测和跟踪对象,然后使用学习的纠正模型进行精度提高。而 LabelFormer 方法则是一种简单、高效、有效的 trajectory-level 纠正方法,它首先对每帧观察进行编码,然后利用自我注意力来理解轨迹的全 temporal 上下文,最后对轨迹进行解码,以获得精度提高后的对象大小和每帧 pose。
  • results: 论文的实验结果表明,LabelFormer 方法可以在城市和高速公路上的数据集上大幅度超越现有的方法。此外,论文还表明,通过使用 LabelFormer 生成的自动标签进行训练,可以提高下游探测性能。详细信息请参考 https://waabi.ai/labelformer
    Abstract A major bottleneck to scaling-up training of self-driving perception systems are the human annotations required for supervision. A promising alternative is to leverage "auto-labelling" offboard perception models that are trained to automatically generate annotations from raw LiDAR point clouds at a fraction of the cost. Auto-labels are most commonly generated via a two-stage approach -- first objects are detected and tracked over time, and then each object trajectory is passed to a learned refinement model to improve accuracy. Since existing refinement models are overly complex and lack advanced temporal reasoning capabilities, in this work we propose LabelFormer, a simple, efficient, and effective trajectory-level refinement approach. Our approach first encodes each frame's observations separately, then exploits self-attention to reason about the trajectory with full temporal context, and finally decodes the refined object size and per-frame poses. Evaluation on both urban and highway datasets demonstrates that LabelFormer outperforms existing works by a large margin. Finally, we show that training on a dataset augmented with auto-labels generated by our method leads to improved downstream detection performance compared to existing methods. Please visit the project website for details https://waabi.ai/labelformer
    摘要 很多自动驾驶感知系统的训练Scaling-up受到人工标注的瓶颈。一种有前途的方法是利用“自动标注”的Board perception模型,可以自动生成标注从原始LiDAR点云数据,并且只需要一小部分的成本。自动标注通常通过两个阶段进行:首先探测和跟踪对象,然后每个对象轨迹通过学习改进模型来提高准确性。现有的改进模型太复杂,缺乏高级时间逻辑能力,因此在这里我们提出了LabelFormer,一种简单、高效、有效的轨迹级别改进方法。我们的方法首先编码每帧观察数据,然后利用自我注意力来理解轨迹的全 temporal 上下文,并最后解码出改进后的对象大小和每帧姿态。我们的LabelFormer在都市和高速公路 dataset 上进行评估,与现有方法比较,表现出了大幅度的提高。最后,我们展示了通过我们的方法生成的自动标注来训练下游检测模型,对现有方法进行训练带来了改进的检测性能。详细信息请参考我们的项目网站:

Transformation Decoupling Strategy based on Screw Theory for Deterministic Point Cloud Registration with Gravity Prior

  • paper_url: http://arxiv.org/abs/2311.01432
  • repo_url: None
  • paper_authors: Xinyi Li, Zijian Ma, Yinlong Liu, Walter Zimmer, Hu Cao, Feihu Zhang, Alois Knoll
  • for: 这 paper 是为了解决受重 OUTLIER 干扰的点云注册问题,特别是在实际应用中经常出现的相对性基于重力方向的注册问题。
  • methods: 该 paper 提出了一种基于扭轴理论的转换分解策略,将原始的 4-DOF 问题分解成 3 个 sub-problems 中的 1-DOF、2-DOF 和 1-DOF,从而提高计算效率。特别是,第一个 1-DOF 表示矢量在旋转轴上的翻译,我们提出了一种间隔刺激方法来解决它。第二个 2-DOF 表示枢轴,我们利用 branch-and-bound 方法来解决它。最后一个 1-DOF 表示旋转角度,我们提出了一种全局投票方法来估算它。
  • results: 该 paper 的方法可以高效地和 deterministic 地进行注册,特别是在 OUTLIER 率超过 99% 的情况下。广泛的实验表明,与现有方法相比,该方法更高效和更稳定。
    Abstract Point cloud registration is challenging in the presence of heavy outlier correspondences. This paper focuses on addressing the robust correspondence-based registration problem with gravity prior that often arises in practice. The gravity directions are typically obtained by inertial measurement units (IMUs) and can reduce the degree of freedom (DOF) of rotation from 3 to 1. We propose a novel transformation decoupling strategy by leveraging screw theory. This strategy decomposes the original 4-DOF problem into three sub-problems with 1-DOF, 2-DOF, and 1-DOF, respectively, thereby enhancing the computation efficiency. Specifically, the first 1-DOF represents the translation along the rotation axis and we propose an interval stabbing-based method to solve it. The second 2-DOF represents the pole which is an auxiliary variable in screw theory and we utilize a branch-and-bound method to solve it. The last 1-DOF represents the rotation angle and we propose a global voting method for its estimation. The proposed method sequentially solves three consensus maximization sub-problems, leading to efficient and deterministic registration. In particular, it can even handle the correspondence-free registration problem due to its significant robustness. Extensive experiments on both synthetic and real-world datasets demonstrate that our method is more efficient and robust than state-of-the-art methods, even when dealing with outlier rates exceeding 99%.
    摘要 <>文 Cloud 注册是在具有重大外liers的对应关系时具有挑战性。这篇论文关注了在实践中常出现的强对应关系基础注册问题,并使用重力方向作为准确度提高的一种方法。重力方向通常由惯性测量设备(IMU)获得,可以将旋转的度量量reduced to 1。我们提议了一种新的变换分解策略,基于螺旋理论。这种策略将原始的4DOF问题分解成3个子问题,每个子问题都是1DOF、2DOF和1DOF,从而提高计算效率。 Specifically, the first 1DOF represents the translation along the rotation axis, and we propose an interval stabbing-based method to solve it. The second 2DOF represents the pole, which is an auxiliary variable in screw theory, and we utilize a branch-and-bound method to solve it. The last 1DOF represents the rotation angle, and we propose a global voting method for its estimation. The proposed method sequentially solves three consensus maximization sub-problems, leading to efficient and deterministic registration. In particular, it can even handle the correspondence-free registration problem due to its significant robustness. Extensive experiments on both synthetic and real-world datasets demonstrate that our method is more efficient and robust than state-of-the-art methods, even when dealing with outlier rates exceeding 99%.<>

Efficient Vision Transformer for Accurate Traffic Sign Detection

  • paper_url: http://arxiv.org/abs/2311.01429
  • repo_url: None
  • paper_authors: Javad Mirzapour Kaleybar, Hooman Khaloo, Avaz Naghipour
  • for: 本研究论文主要targets traffic sign detection in self-driving vehicles and driver assistance systems, with the goal of developing reliable and highly accurate algorithms for widespread adoption in diverse real-life scenarios.
  • methods: 本研究使用了Transformer模型,尤其是Vision Transformer variants,来解决 traffic sign detection task。Transformer的注意机制,原本设计用于自然语言处理,在图像识别领域提供了并行效率的改进。
  • results: 实验评估表明,该策略可以在GTSDB数据集上实现显著的进步,特别是在速度和准确率两个方面。
    Abstract This research paper addresses the challenges associated with traffic sign detection in self-driving vehicles and driver assistance systems. The development of reliable and highly accurate algorithms is crucial for the widespread adoption of traffic sign recognition and detection (TSRD) in diverse real-life scenarios. However, this task is complicated by suboptimal traffic images affected by factors such as camera movement, adverse weather conditions, and inadequate lighting. This study specifically focuses on traffic sign detection methods and introduces the application of the Transformer model, particularly the Vision Transformer variants, to tackle this task. The Transformer's attention mechanism, originally designed for natural language processing, offers improved parallel efficiency. Vision Transformers have demonstrated success in various domains, including autonomous driving, object detection, healthcare, and defense-related applications. To enhance the efficiency of the Transformer model, the research proposes a novel strategy that integrates a locality inductive bias and a transformer module. This includes the introduction of the Efficient Convolution Block and the Local Transformer Block, which effectively capture short-term and long-term dependency information, thereby improving both detection speed and accuracy. Experimental evaluations demonstrate the significant advancements achieved by this approach, particularly when applied to the GTSDB dataset.
    摘要 To enhance the efficiency of the Transformer model, the research proposes a novel strategy that integrates a locality inductive bias and a transformer module. This includes the introduction of the Efficient Convolution Block and the Local Transformer Block, which effectively capture short-term and long-term dependency information, thereby improving both detection speed and accuracy.Experimental evaluations demonstrate the significant advancements achieved by this approach, particularly when applied to the GTSDB dataset. The proposed method is able to detect traffic signs more accurately and efficiently, which is crucial for the widespread adoption of traffic sign recognition and detection (TSRD) in diverse real-life scenarios.

Exploring Deep Learning Techniques for Glaucoma Detection: A Comprehensive Review

  • paper_url: http://arxiv.org/abs/2311.01425
  • repo_url: None
  • paper_authors: Aized Amin Soofi, Fazal-e-Amin
    for: 本文旨在提供一种全面的深度学习方法,用于检测和诊断眼内压病( Glaucoma)。methods: 本文使用的方法包括深度学习的分割、分类和检测技术,以提高眼内压病的检测精度和效率。results: 根据文献分析,深度学习方法在眼内压病检测中表现出色,可以提高检测精度和效率,并且具有可重复性和可靠性。但是,深度学习方法还存在一些限制和挑战,需要进一步的研究和改进。
    Abstract Glaucoma is one of the primary causes of vision loss around the world, necessitating accurate and efficient detection methods. Traditional manual detection approaches have limitations in terms of cost, time, and subjectivity. Recent developments in deep learning approaches demonstrate potential in automating glaucoma detection by detecting relevant features from retinal fundus images. This article provides a comprehensive overview of cutting-edge deep learning methods used for the segmentation, classification, and detection of glaucoma. By analyzing recent studies, the effectiveness and limitations of these techniques are evaluated, key findings are highlighted, and potential areas for further research are identified. The use of deep learning algorithms may significantly improve the efficacy, usefulness, and accuracy of glaucoma detection. The findings from this research contribute to the ongoing advancements in automated glaucoma detection and have implications for improving patient outcomes and reducing the global burden of glaucoma.
    摘要 Here is the translation in Simplified Chinese: glaucoma是全球主要导致视力损失的疾病之一,需要精准和高效的检测方法。传统的手动检测方法受到成本、时间和主观性的限制。current deep learning approaches demonstrate potential in automating glaucoma detection by detecting relevant features from retinal fundus images. This article provides a comprehensive overview of cutting-edge deep learning methods used for the segmentation, classification, and detection of glaucoma. By analyzing recent studies, the effectiveness and limitations of these techniques are evaluated, key findings are highlighted, and potential areas for further research are identified. The use of deep learning algorithms may significantly improve the efficacy, usefulness, and accuracy of glaucoma detection. The findings from this research contribute to the ongoing advancements in automated glaucoma detection and have implications for improving patient outcomes and reducing the global burden of glaucoma.

CenterRadarNet: Joint 3D Object Detection and Tracking Framework using 4D FMCW Radar

  • paper_url: http://arxiv.org/abs/2311.01423
  • repo_url: None
  • paper_authors: Jen-Hao Cheng, Sheng-Yao Kuan, Hugo Latapie, Gaowen Liu, Jenq-Neng Hwang
  • for: 提高自动驾驶和协助驾驶技术的安全性,增强雷达感知的稳定性和可靠性。
  • methods: 提出一种高效的中心雷达网络(CenterRadarNet),利用4D雷达数据进行高级别表示学习,实现3D对象检测和重新识别(re-ID)任务。
  • results: 在K-Radar3D对象检测数据集上达到了状态之Art的result,并实现了雷达数据集V2上的首个3D对象跟踪结果。在多种驾驶场景下,CenterRadarNet表现了一致、稳定的性能,强调其广泛应用性。
    Abstract Robust perception is a vital component for ensuring safe autonomous and assisted driving. Automotive radar (77 to 81 GHz), which offers weather-resilient sensing, provides a complementary capability to the vision- or LiDAR-based autonomous driving systems. Raw radio-frequency (RF) radar tensors contain rich spatiotemporal semantics besides 3D location information. The majority of previous methods take in 3D (Doppler-range-azimuth) RF radar tensors, allowing prediction of an object's location, heading angle, and size in bird's-eye-view (BEV). However, they lack the ability to at the same time infer objects' size, orientation, and identity in the 3D space. To overcome this limitation, we propose an efficient joint architecture called CenterRadarNet, designed to facilitate high-resolution representation learning from 4D (Doppler-range-azimuth-elevation) radar data for 3D object detection and re-identification (re-ID) tasks. As a single-stage 3D object detector, CenterRadarNet directly infers the BEV object distribution confidence maps, corresponding 3D bounding box attributes, and appearance embedding for each pixel. Moreover, we build an online tracker utilizing the learned appearance embedding for re-ID. CenterRadarNet achieves the state-of-the-art result on the K-Radar 3D object detection benchmark. In addition, we present the first 3D object-tracking result using radar on the K-Radar dataset V2. In diverse driving scenarios, CenterRadarNet shows consistent, robust performance, emphasizing its wide applicability.
    摘要 robust 感知是自动驾驶和助动驾驶安全的关键组件。汽车雷达(77至81 GHz),具有天气抵抗性,提供了视觉或 LiDAR 自动驾驶系统的补充能力。 raw 电磁波(RF)雷达张量包含了具有3D位置信息的辐射学 semantics。大多数前一代方法接受3D(Doppler-range-azimuth)RF雷达张量,允许预测目标的位置、方向角和大小在鸟瞰视图(BEV)中。然而,它们缺乏同时推断目标的大小、方向和身份在3D空间的能力。为了解决这些限制,我们提出了一种高效的联合体系结构,称之为 CenterRadarNet,用于从4D(Doppler-range-azimuth-elevation)雷达数据中进行高级表示学习,以实现3D объек的检测和重新识别(re-ID)任务。作为单个阶段3D对象探测器,CenterRadarNet直接生成了BEV对象分布信息可信度地图,相应的3D包围框属性和外观嵌入。此外,我们建立了在线跟踪器,使用学习的外观嵌入进行重新识别。 CenterRadarNet在K-Radar 3D对象检测标准准则上实现了状态的最佳结果。此外,我们还提供了基于雷达的K-Radar数据集 V2 上的首个3D对象跟踪结果。在多样化的驾驶enario中,CenterRadarNet表现了一致、可靠的性,强调了其广泛的应用能力。

The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing

  • paper_url: http://arxiv.org/abs/2311.01410
  • repo_url: None
  • paper_authors: Shen Nie, Hanzhong Allan Guo, Cheng Lu, Yuhao Zhou, Chenyu Zheng, Chongxuan Li
  • for: 这个论文旨在提出一种统一的概率形式化方法,用于基于扩散的图像编辑,其中一个隐藏变量在任务特定的方式下被编辑,并且通常与原始的随机或偏微分方程(SDE或ODE)中的相应变量偏离。
  • methods: 论文使用的方法包括编辑SDE和ODE,以及基于SDE的扩散基eline。
  • results: 论文的实验结果表明,在不同任务中,SDE具有明显的优势和多样性,可以在图像编辑中提供更高质量的结果,并且可以与现有的扩散基eline相比,显示出更好的性能。
    Abstract We present a unified probabilistic formulation for diffusion-based image editing, where a latent variable is edited in a task-specific manner and generally deviates from the corresponding marginal distribution induced by the original stochastic or ordinary differential equation (SDE or ODE). Instead, it defines a corresponding SDE or ODE for editing. In the formulation, we prove that the Kullback-Leibler divergence between the marginal distributions of the two SDEs gradually decreases while that for the ODEs remains as the time approaches zero, which shows the promise of SDE in image editing. Inspired by it, we provide the SDE counterparts for widely used ODE baselines in various tasks including inpainting and image-to-image translation, where SDE shows a consistent and substantial improvement. Moreover, we propose SDE-Drag -- a simple yet effective method built upon the SDE formulation for point-based content dragging. We build a challenging benchmark (termed DragBench) with open-set natural, art, and AI-generated images for evaluation. A user study on DragBench indicates that SDE-Drag significantly outperforms our ODE baseline, existing diffusion-based methods, and the renowned DragGAN. Our results demonstrate the superiority and versatility of SDE in image editing and push the boundary of diffusion-based editing methods.
    摘要 我们提出了一种统一的概率形式化方法 дляDiffusion-based图像编辑,其中一个隐藏变量在任务特定的方式下被编辑,通常与原始的随机或ordinary differential equation(SDE或ODE)中的相应 marginal distribution不同。相反,它定义了一个对应的SDE或ODE для编辑。在这种形式化中,我们证明了两个SDE的margin distribution的Kullback-Leibler divergence逐渐减少,而ODE中的 margin distribution的Kullback-Leibler divergence保持不变,这表明SDE在图像编辑中的承诺。 inspirited by it, we provide SDE counterparts for widely used ODE baselines in various tasks, including inpainting and image-to-image translation, where SDE shows consistent and substantial improvement. Moreover, we propose SDE-Drag -- a simple yet effective method built upon the SDE formulation for point-based content dragging. We build a challenging benchmark (termed DragBench) with open-set natural, art, and AI-generated images for evaluation. A user study on DragBench indicates that SDE-Drag significantly outperforms our ODE baseline, existing diffusion-based methods, and the renowned DragGAN. Our results demonstrate the superiority and versatility of SDE in image editing and push the boundary of diffusion-based editing methods.

Learning to See Physical Properties with Active Sensing Motor Policies

  • paper_url: http://arxiv.org/abs/2311.01405
  • repo_url: None
  • paper_authors: Gabriel B. Margolis, Xiang Fu, Yandong Ji, Pulkit Agrawal
  • for: 这个论文是为了帮助机器人更有效地行走,通过利用图像中的物理特性来做出计划。
  • methods: 这个方法使用自我超vised labeling,使用实际行走中采集的图像和实际物理参数估计器在模拟环境中训练。另外,我们还引入了活动感知 дви作策略(ASMP),以便增强物理参数估计的准确性。
  • results: 我们的方法可以准确地预测物理参数,并且可以在不同的摄像头和机器人上工作,即使是在飞行器拍摄的过头图像中。
    Abstract Knowledge of terrain's physical properties inferred from color images can aid in making efficient robotic locomotion plans. However, unlike image classification, it is unintuitive for humans to label image patches with physical properties. Without labeled data, building a vision system that takes as input the observed terrain and predicts physical properties remains challenging. We present a method that overcomes this challenge by self-supervised labeling of images captured by robots during real-world traversal with physical property estimators trained in simulation. To ensure accurate labeling, we introduce Active Sensing Motor Policies (ASMP), which are trained to explore locomotion behaviors that increase the accuracy of estimating physical parameters. For instance, the quadruped robot learns to swipe its foot against the ground to estimate the friction coefficient accurately. We show that the visual system trained with a small amount of real-world traversal data accurately predicts physical parameters. The trained system is robust and works even with overhead images captured by a drone despite being trained on data collected by cameras attached to a quadruped robot walking on the ground.
    摘要 知识来自景色的物理属性可以帮助机器人制定有效的移动计划。然而,与人类标注图像不同,将图像块标注为物理属性是不直观的。没有标注数据,建立一个接受图像作为输入,预测物理属性的视觉系统仍然是一个挑战。我们提出了一种方法,通过自我超vised标注实际行走中捕捉的图像来解决这个问题。为确保准确标注,我们引入了活动感知电动政策(ASMP),这些政策在实际行走中被训练以探索提高物理参数估计的行走方式。例如,四肢动物机器人学会在地面上滑块来精准估计透震率。我们显示,使用小量实际行走数据训练的视觉系统可以准确预测物理参数。受训系统是Robust的,甚至在悬停在空中的飞机拍摄的图像上也能正常工作,即使它们在地面上行走时被训练。

Learning Realistic Traffic Agents in Closed-loop

  • paper_url: http://arxiv.org/abs/2311.01394
  • repo_url: None
  • paper_authors: Chris Zhang, James Tu, Lunjun Zhang, Kelvin Wong, Simon Suo, Raquel Urtasun
  • for: 本研究旨在开发一种可靠和可扩展的自动驾驶软件,通过在实际上使用人工智能来模拟实际交通情况,以避免在真实世界中发生的危险。
  • methods: 本研究使用了拟合学习(IL)和奖励学习(RL)两种方法,通过一种叫做“强制规则恢复”(RTR)的关闭Loop学习目标,将这两种方法结合起来,以实现更加人类化的交通行为。
  • results: 实验结果表明,使用RTR方法可以让交通策略更加真实和普遍,在正常和偏值情况下都能够达到更好的平衡点,并且可以用作预测模型训练数据生成工具,从而提高下游预测指标。
    Abstract Realistic traffic simulation is crucial for developing self-driving software in a safe and scalable manner prior to real-world deployment. Typically, imitation learning (IL) is used to learn human-like traffic agents directly from real-world observations collected offline, but without explicit specification of traffic rules, agents trained from IL alone frequently display unrealistic infractions like collisions and driving off the road. This problem is exacerbated in out-of-distribution and long-tail scenarios. On the other hand, reinforcement learning (RL) can train traffic agents to avoid infractions, but using RL alone results in unhuman-like driving behaviors. We propose Reinforcing Traffic Rules (RTR), a holistic closed-loop learning objective to match expert demonstrations under a traffic compliance constraint, which naturally gives rise to a joint IL + RL approach, obtaining the best of both worlds. Our method learns in closed-loop simulations of both nominal scenarios from real-world datasets as well as procedurally generated long-tail scenarios. Our experiments show that RTR learns more realistic and generalizable traffic simulation policies, achieving significantly better tradeoffs between human-like driving and traffic compliance in both nominal and long-tail scenarios. Moreover, when used as a data generation tool for training prediction models, our learned traffic policy leads to considerably improved downstream prediction metrics compared to baseline traffic agents. For more information, visit the project website: https://waabi.ai/rtr
    摘要 现实主义交通模拟是自动驾驶软件开发中非常重要的,以确保在真实世界中安全部署。通常,模仿学习(IL)用于直接从真实世界观察中学习人类交通代理,但是不具体地规定交通规则,则代理训练出来的不具有真实的交通规则遵从性,导致很多偏差和脱离道路的情况。这个问题在不同的情况和长尾情况下更加严重。相反,奖励学习(RL)可以训练交通代理避免偏差,但是使用RLalone会导致不人类化的驾驶行为。我们提出了一种名为“强制交通规则”(RTR)的整体循环学习目标,用于匹配专家示例,并且自然地组合了IL + RL两种方法,从而获得最佳的两个世界。我们的方法在循环 simulations of both nominal scenarios from real-world datasets as well as procedurally generated long-tail scenarios中学习。我们的实验表明,RTR可以更加真实和普遍的交通模拟策略,在nominal和长尾情况下都能够获得更好的平衡。此外,当用作预测模型训练数据生成工具时,我们学习的交通策略会导致下游预测 metric 明显提高,相比基eline traffic agents。更多信息请访问我们的项目网站:https://waabi.ai/rtr。

Sim2Real Bilevel Adaptation for Object Surface Classification using Vision-Based Tactile Sensors

  • paper_url: http://arxiv.org/abs/2311.01380
  • repo_url: https://github.com/hsp-iit/sim2real-surface-classification
  • paper_authors: Gabriele M. Caddeo, Andrea Maracani, Paolo D. Alfano, Nicola A. Piga, Lorenzo Rosasco, Lorenzo Natale
  • for: bridging the Sim2Real gap in vision-based tactile sensors for classifying object surfaces
  • methods: training a Diffusion Model using a small dataset of real-world images and aligning features of the two domains using an adversarial procedure
  • results: a total accuracy of 81.9%, a significant improvement compared to the 34.7% achieved by the classifier trained solely on simulated images
    Abstract In this paper, we address the Sim2Real gap in the field of vision-based tactile sensors for classifying object surfaces. We train a Diffusion Model to bridge this gap using a relatively small dataset of real-world images randomly collected from unlabeled everyday objects via the DIGIT sensor. Subsequently, we employ a simulator to generate images by uniformly sampling the surface of objects from the YCB Model Set. These simulated images are then translated into the real domain using the Diffusion Model and automatically labeled to train a classifier. During this training, we further align features of the two domains using an adversarial procedure. Our evaluation is conducted on a dataset of tactile images obtained from a set of ten 3D printed YCB objects. The results reveal a total accuracy of 81.9%, a significant improvement compared to the 34.7% achieved by the classifier trained solely on simulated images. This demonstrates the effectiveness of our approach. We further validate our approach using the classifier on a 6D object pose estimation task from tactile data.
    摘要 在这篇论文中,我们处理视觉基于感觉器的Surface classification问题中的Sim2Real gap。我们使用一个扩散模型来跨越这个差距,使用一个相对较小的实际世界图像集来训练。然后,我们使用一个模拟器生成图像,通过对物体表面进行均匀采样,从YCB模型集中获取的图像。这些模拟图像然后通过扩散模型进行翻译,并自动将其标注为训练一个分类器。在这个训练过程中,我们还使用一种对抗性方法来对两个领域的特征进行对齐。我们的评估是基于一组从3D打印YCB对象中获取的感觉图像集。结果表明,我们的方法可以达到81.9%的总准确率,与 solely 在模拟图像上训练的分类器(34.7%)相比,这表明我们的方法的效iveness。我们进一步验证了我们的方法,使用感觉数据进行6D对象姿态估计任务。

Robust Identity Perceptual Watermark Against Deepfake Face Swapping

  • paper_url: http://arxiv.org/abs/2311.01357
  • repo_url: None
  • paper_authors: Tianyi Wang, Mengxiao Huang, Harry Cheng, Bin Ma, Yinglong Wang
  • for: 防止 Deepfake 面孔替换的隐私问题
  • methods: 植入不可见信号进行探测和追溯
  • results: 实现了对 Deepfake 面孔替换的检测和追溯,并且在不同的数据集和替换方法下达到了状态之最的性能
    Abstract Notwithstanding offering convenience and entertainment to society, Deepfake face swapping has caused critical privacy issues with the rapid development of deep generative models. Due to imperceptible artifacts in high-quality synthetic images, passive detection models against face swapping in recent years usually suffer performance damping regarding the generalizability issue. Therefore, several studies have been attempted to proactively protect the original images against malicious manipulations by inserting invisible signals in advance. However, the existing proactive defense approaches demonstrate unsatisfactory results with respect to visual quality, detection accuracy, and source tracing ability. In this study, we propose the first robust identity perceptual watermarking framework that concurrently performs detection and source tracing against Deepfake face swapping proactively. We assign identity semantics regarding the image contents to the watermarks and devise an unpredictable and unreversible chaotic encryption system to ensure watermark confidentiality. The watermarks are encoded and recovered by jointly training an encoder-decoder framework along with adversarial image manipulations. Extensive experiments demonstrate state-of-the-art performance against Deepfake face swapping under both cross-dataset and cross-manipulation settings.
    摘要 不смотря于对社会提供便捷和娱乐,深入模型的发展使得深伪肖面换技术导致了严重的隐私问题。由于高质量 sintetic 图像中的不可见artefacts,以往的检测模型在面 swap 问题上通常会表现出性能下降,尤其是在泛化问题上。因此,一些研究尝试了在先进行反应性保护原始图像,以防止恶意操作。然而,现有的反应性防御方法在视觉质量、检测精度和来源追踪能力等方面均表现不满意。在这种情况下,我们提出了首个可靠性感知水印框架,可同时进行检测和来源追踪对 Deepfake 面 swap 进行反应性防御。我们将图像内容中的Identify semantics分配给水印,并设计了不可预测、不可逆的混沌加密系统,以保证水印的机密性。水印被编码和还原通过在encoder-decoder框架中进行共同训练,并与对采样图像进行恶意修改。广泛的实验表明,我们的方法在不同的 dataset 和修改设定下均达到了顶尖性能。

Deep learning based Image Compression for Microscopy Images: An Empirical Study

  • paper_url: http://arxiv.org/abs/2311.01352
  • repo_url: None
  • paper_authors: Yu Zhou, Jan Sollman, Jianxu Chen
  • for: 本研究旨在分析 классические和深度学习基于图像压缩方法,以及它们对深度学习基于图像处理模型的影响。
  • methods: 本研究使用了多种класси型损失图像压缩技术和深度学习基于图像压缩模型,并对它们进行比较,包括CompressAI工具箱提供的多种压缩模型。
  • results: 研究发现,深度学习基于图像压缩技术可以大幅提高压缩率,而不会对下游的标签自由预测模型造成重大影响。在2D情况下,AI基于压缩技术的表现远胜于класси型压缩技术。
    Abstract With the fast development of modern microscopes and bioimaging techniques, an unprecedentedly large amount of imaging data are being generated, stored, analyzed, and even shared through networks. The size of the data poses great challenges for current data infrastructure. One common way to reduce the data size is by image compression. This present study analyzes classic and deep learning based image compression methods, and their impact on deep learning based image processing models. Deep learning based label-free prediction models (i.e., predicting fluorescent images from bright field images) are used as an example application for comparison and analysis. Effective image compression methods could help reduce the data size significantly without losing necessary information, and therefore reduce the burden on data management infrastructure and permit fast transmission through the network for data sharing or cloud computing. To compress images in such a wanted way, multiple classical lossy image compression techniques are compared to several AI-based compression models provided by and trained with the CompressAI toolbox using python. These different compression techniques are compared in compression ratio, multiple image similarity measures and, most importantly, the prediction accuracy from label-free models on compressed images. We found that AI-based compression techniques largely outperform the classic ones and will minimally affect the downstream label-free task in 2D cases. In the end, we hope the present study could shed light on the potential of deep learning based image compression and the impact of image compression on downstream deep learning based image analysis models.
    摘要 随着现代微镜和生物成像技术的快速发展,生成的成像数据量已达到历史高点,对当今数据基础设施pose了巨大挑战。一种常见的方法是图像压缩,以减少数据大小。本研究 Compares classic and deep learning based image compression methods, and their impact on deep learning based image processing models. 用作比较和分析的例子应用是label-free预测模型(即从明亮图像预测 fluorescent image)。有效地压缩图像可以减少数据大小,而无需产生重要信息的损失,因此可以减轻数据管理基础设施的负担和允许数据在网络上快速传输或云计算。在python中使用CompressAI工具箱进行了多种класси型损失图像压缩技术的比较,以及几种基于AI的压缩模型。这些不同的压缩技术在压缩率、多个图像相似度度量和最重要的预测精度上进行了比较。我们发现,基于AI的压缩技术在2D情况下较Classic的压缩技术有很大的优势,并且对下游label-free任务的影响较小。我们希望这一研究可以把关于深度学习基于图像压缩的潜在能力和图像压缩对深度学习基于图像分析模型的影响 shed some light on。

Towards Evaluating Transfer-based Attacks Systematically, Practically, and Fairly

  • paper_url: http://arxiv.org/abs/2311.01323
  • repo_url: None
  • paper_authors: Qizhang Li, Yiwen Guo, Wangmeng Zuo, Hao Chen
  • for: 本研究旨在提供一个标准化的攻击比较 bencmark,以系统地、公平地、实际地评估对黑盒神经网络模型的攻击方法。
  • methods: 本研究使用了30多种转移基于攻击方法,包括各种攻击模式、攻击方法、攻击评估方法等。
  • results: 本研究对25种受到攻击的substitute/victim模型进行了完整的评估,获得了新的见解和指导方针,可以帮助未来的攻击评估。
    Abstract The adversarial vulnerability of deep neural networks (DNNs) has drawn great attention due to the security risk of applying these models in real-world applications. Based on transferability of adversarial examples, an increasing number of transfer-based methods have been developed to fool black-box DNN models whose architecture and parameters are inaccessible. Although tremendous effort has been exerted, there still lacks a standardized benchmark that could be taken advantage of to compare these methods systematically, fairly, and practically. Our investigation shows that the evaluation of some methods needs to be more reasonable and more thorough to verify their effectiveness, to avoid, for example, unfair comparison and insufficient consideration of possible substitute/victim models. Therefore, we establish a transfer-based attack benchmark (TA-Bench) which implements 30+ methods. In this paper, we evaluate and compare them comprehensively on 25 popular substitute/victim models on ImageNet. New insights about the effectiveness of these methods are gained and guidelines for future evaluations are provided. Code at: https://github.com/qizhangli/TA-Bench.
    摘要 深度神经网络(DNN)的敌对攻击漏洞引起了广泛的关注,因为它们在实际应用中的安全风险较高。基于传输性的攻击方法的开发,随着黑盒模型的应用,逐渐增加了一些传输性的攻击方法,以欺骗无法访问模型的 Architecture 和参数的黑盒模型。虽然努力很大,但是目前还缺乏一个标准化的准则,可以系统、公平、实用地比较这些方法。我们的调查发现,评估一些方法的需要更加合理、更加全面,以避免例如不公平的比较和可能的代用/受害者模型的不足考虑。因此,我们建立了一个基于传输的攻击准则(TA-Bench),它实现了30多种方法。在这篇论文中,我们对25个popular substitute/victim模型进行了全面的评估和比较,从而获得了新的洞察和指导。代码在:https://github.com/qizhangli/TA-Bench。

Hybrid-Fusion Transformer for Multisequence MRI

  • paper_url: http://arxiv.org/abs/2311.01308
  • repo_url: None
  • paper_authors: Jihoon Cho, Jinah Park
  • for: 这个论文主要目标是为了提高多模态MRI图像分割的精度。
  • methods: 该论文提出了一种hybrid fusion transformer(HFTrans)方法,利用不同的多模态MRI序列特性,并通过Transformer层进行特征集成。
  • results: 实验表明,提出的hybrid-fusion方法在三维医疗图像分割任务中表现出色,在BraTS2020和MRBrainS18两个公共数据集上比前一个状态的方法更高精度。
    Abstract Medical segmentation has grown exponentially through the advent of a fully convolutional network (FCN), and we have now reached a turning point through the success of Transformer. However, the different characteristics of the modality have not been fully integrated into Transformer for medical segmentation. In this work, we propose the novel hybrid fusion Transformer (HFTrans) for multisequence MRI image segmentation. We take advantage of the differences among multimodal MRI sequences and utilize the Transformer layers to integrate the features extracted from each modality as well as the features of the early fused modalities. We validate the effectiveness of our hybrid-fusion method in three-dimensional (3D) medical segmentation. Experiments on two public datasets, BraTS2020 and MRBrainS18, show that the proposed method outperforms previous state-of-the-art methods on the task of brain tumor segmentation and brain structure segmentation.
    摘要 医学分割技术在全 convolutional network(FCN)的出现后 exponentiates 快速增长,而现在已经达到了转折点,这是由于 transformer 的成功。然而,不同的模态特征还没有被完全 integrate 到 transformer 中 для医学分割。在这项工作中,我们提议一种新的 hybrid fusion transformer(HFTrans) для多sequences MRI图像分割。我们利用不同的多modal MRI sequence的特征,并使用 transformer 层将每个模式中提取的特征和早期合并的特征集成。我们在三维医学分割中验证了我们的 hybrid-fusion 方法的有效性。在 BraTS2020 和 MRBrainS18 两个公共数据集上进行了实验,并确认了我们的方法在脑肿瘤分割和脑结构分割任务上的优于前一个state-of-the-art方法。

DP-Mix: Mixup-based Data Augmentation for Differentially Private Learning

  • paper_url: http://arxiv.org/abs/2311.01295
  • repo_url: https://github.com/wenxuan-bao/dp-mix
  • paper_authors: Wenxuan Bao, Francesco Pittaluga, Vijay Kumar B G, Vincent Bindschaedler
  • for: 提高计算机视觉模型的通用性,特别是在训练数据有限的情况下。
  • methods: 使用多样化数据 augmentation技术,如简单的图像变换和组合,以提高计算机视觉模型的泛化能力。
  • results: 提出了两种专门针对权谱学习的数据增强技术,包括DP-Mix_Self和DP-Mix_Diff,可以在多个数据集和设置下达到最佳性能。
    Abstract Data augmentation techniques, such as simple image transformations and combinations, are highly effective at improving the generalization of computer vision models, especially when training data is limited. However, such techniques are fundamentally incompatible with differentially private learning approaches, due to the latter's built-in assumption that each training image's contribution to the learned model is bounded. In this paper, we investigate why naive applications of multi-sample data augmentation techniques, such as mixup, fail to achieve good performance and propose two novel data augmentation techniques specifically designed for the constraints of differentially private learning. Our first technique, DP-Mix_Self, achieves SoTA classification performance across a range of datasets and settings by performing mixup on self-augmented data. Our second technique, DP-Mix_Diff, further improves performance by incorporating synthetic data from a pre-trained diffusion model into the mixup process. We open-source the code at https://github.com/wenxuan-Bao/DP-Mix.
    摘要 <>传统的数据扩充技术,如简单的图像变换和组合,对计算机视觉模型的通用性进行了高度的改进,尤其是在训练数据scarce情况下。然而,这些技术与异质性学习方法不兼容,因为后者假设每个训练图像对学习的模型做出的贡献是有限的。在这篇论文中,我们 investigate why naive应用多样样本数据扩充技术,如mixup,无法达到好性能,并提出了两种专门针对异质性学习的数据扩充技术。我们的第一种技术,DP-Mix_Self,在多个 dataset 和设置中达到了 SoTA 分类性能,通过在自增强数据上进行mixup。我们的第二种技术,DP-Mix_Diff,进一步提高性能,通过在扩充过程中包含Synthetic数据来增强mixup。我们将代码开源在https://github.com/wenxuan-Bao/DP-Mix上。

Joint 3D Shape and Motion Estimation from Rolling Shutter Light-Field Images

  • paper_url: http://arxiv.org/abs/2311.01292
  • repo_url: None
  • paper_authors: Hermes McGriff, Renato Martins, Nicolas Andreff, Cédric Demonceaux
  • for: Addresses the problem of 3D reconstruction of scenes from a single image captured by a light-field camera equipped with a rolling shutter sensor.
  • methods: Leverages the 3D information cues present in the light-field and the motion information provided by the rolling shutter effect, with a generic model for the imaging process and a two-stage algorithm that minimizes the re-projection error.
  • results: Provides an instantaneous 3D shape-and-pose-and-velocity sensing paradigm, with a new benchmark dataset and several experiments conducted for different scenes and types of motions to demonstrate the effectiveness and advantages of the approach.Here is the same information in Traditional Chinese:
  • for: Addresses the problem of 3D reconstruction of scenes from a single image captured by a light-field camera equipped with a rolling shutter sensor.
  • methods: Leverages the 3D information cues present in the light-field and the motion information provided by the rolling shutter effect, with a generic model for the imaging process and a two-stage algorithm that minimizes the re-projection error.
  • results: Provides an instantaneous 3D shape-and-pose-and-velocity sensing paradigm, with a new benchmark dataset and several experiments conducted for different scenes and types of motions to demonstrate the effectiveness and advantages of the approach.
    Abstract In this paper, we propose an approach to address the problem of 3D reconstruction of scenes from a single image captured by a light-field camera equipped with a rolling shutter sensor. Our method leverages the 3D information cues present in the light-field and the motion information provided by the rolling shutter effect. We present a generic model for the imaging process of this sensor and a two-stage algorithm that minimizes the re-projection error while considering the position and motion of the camera in a motion-shape bundle adjustment estimation strategy. Thereby, we provide an instantaneous 3D shape-and-pose-and-velocity sensing paradigm. To the best of our knowledge, this is the first study to leverage this type of sensor for this purpose. We also present a new benchmark dataset composed of different light-fields showing rolling shutter effects, which can be used as a common base to improve the evaluation and tracking the progress in the field. We demonstrate the effectiveness and advantages of our approach through several experiments conducted for different scenes and types of motions. The source code and dataset are publicly available at: https://github.com/ICB-Vision-AI/RSLF
    摘要 在这篇论文中,我们提出了一种方法,用于从单个拍摄的图像中重建场景的3D形态。我们的方法利用了光场中的3D信息征ifiers和滚动镜头效果提供的运动信息。我们提出了一个通用的捕捉过程模型和一种两个阶段算法,以最小化投影误差,同时考虑摄像机的位置和运动。因此,我们提供了一种实时的3D形态、位置和速度探测方法。根据我们所知,这是首次利用这种传感器来实现这种目的。我们还提供了一个新的比较基准数据集,包括不同的光场,这可以用作评估和跟踪领域的共同基准。我们通过对不同场景和运动类型进行多个实验,证明了我们的方法的有效性和优势。源代码和数据集可以在以下链接中下载:https://github.com/ICB-Vision-AI/RSLF。

Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition

  • paper_url: http://arxiv.org/abs/2311.01283
  • repo_url: None
  • paper_authors: Hamid Ahmadabadi, Omid Nejati Manzari, Ahmad Ayatollahi
  • for: 提高人体动作识别的性能和效率,通过知识传播和 CNN 和 ViT 模型的组合。
  • methods: 使用 Transformer 视网膜作为学生模型,而 convolutional network 作为教师模型。教师模型提取本地图像特征,而学生模型通过注意力机制关注全图像特征。采用 Vision Transformer(ViT)框架,并评估多种变体的 ViT,包括 PVT、Convit、MVIT、Swin Transformer 和 Twins。
  • results: 对 Stanford 40 数据集进行人体动作识别任务,通过知识传播训练学生模型,相比常规训练方法,得到了显著提高的准确率和 mAP。这些结果表明,将本地和全图像特征结合在一起可以提高动作识别任务的性能。
    Abstract This paper presents a study on improving human action recognition through the utilization of knowledge distillation, and the combination of CNN and ViT models. The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models. The proposed method employs a Transformer vision network as the student model, while a convolutional network serves as the teacher model. The teacher model extracts local image features, whereas the student model focuses on global features using an attention mechanism. The Vision Transformer (ViT) architecture is introduced as a robust framework for capturing global dependencies in images. Additionally, advanced variants of ViT, namely PVT, Convit, MVIT, Swin Transformer, and Twins, are discussed, highlighting their contributions to computer vision tasks. The ConvNeXt model is introduced as a teacher model, known for its efficiency and effectiveness in computer vision. The paper presents performance results for human action recognition on the Stanford 40 dataset, comparing the accuracy and mAP of student models trained with and without knowledge distillation. The findings illustrate that the suggested approach significantly improves the accuracy and mAP when compared to training networks under regular settings. These findings emphasize the potential of combining local and global features in action recognition tasks.
    摘要

Exploring Deep Learning Image Super-Resolution for Iris Recognition

  • paper_url: http://arxiv.org/abs/2311.01241
  • repo_url: None
  • paper_authors: Eduardo Ribeiro, Andreas Uhl, Fernando Alonso-Fernandez, Reuben A. Farrugia
  • for: 这个论文是为了检验深度学习方法在低分辨率图像到高分辨率图像的映射问题上的能力。
  • methods: 这个论文使用了两种深度学习单图超解决方法:堆叠自适应网络(SAE)和卷积神经网络(CNN),以实现快速速度、保持地方信息和减少噪声的目的。
  • results: 实验结果表明,深度学习方法在一个 Near-infrared iris 图像库中的评估和识别实验中表现出色,超过了与之比较的算法。
    Abstract In this work we test the ability of deep learning methods to provide an end-to-end mapping between low and high resolution images applying it to the iris recognition problem. Here, we propose the use of two deep learning single-image super-resolution approaches: Stacked Auto-Encoders (SAE) and Convolutional Neural Networks (CNN) with the most possible lightweight structure to achieve fast speed, preserve local information and reduce artifacts at the same time. We validate the methods with a database of 1.872 near-infrared iris images with quality assessment and recognition experiments showing the superiority of deep learning approaches over the compared algorithms.
    摘要 在这项工作中,我们测试了深度学习方法是否可以提供低到高分辨率图像的端到端映射,并应用于芳心识别问题。我们提议使用两种深度学习单图超解析方法:堆式自适应神经网络(SAE)和卷积神经网络(CNN),以达到快速速度、保持本地信息和减少噪声的目的。我们验证了这些方法使用1.872个近红外芳心图像库,并进行评估和识别实验,显示深度学习方法比比较算法更出色。

Log-Likelihood Score Level Fusion for Improved Cross-Sensor Smartphone Periocular Recognition

  • paper_url: http://arxiv.org/abs/2311.01237
  • repo_url: None
  • paper_authors: Fernando Alonso-Fernandez, Kiran B. Raja, Christoph Busch, Josef Bigun
  • for: 提高不同摄像头数据的可比性和识别率
  • methods: 使用多比较器的拟合方法,基于线性逻辑回归,将各摄像头的分布调整到共同的概率领域
  • results: 实现对不同摄像头数据的融合,提高 périocular 性能,降低cross-sensor EER达40%
    Abstract The proliferation of cameras and personal devices results in a wide variability of imaging conditions, producing large intra-class variations and a significant performance drop when images from heterogeneous environments are compared. However, many applications require to deal with data from different sources regularly, thus needing to overcome these interoperability problems. Here, we employ fusion of several comparators to improve periocular performance when images from different smartphones are compared. We use a probabilistic fusion framework based on linear logistic regression, in which fused scores tend to be log-likelihood ratios, obtaining a reduction in cross-sensor EER of up to 40% due to the fusion. Our framework also provides an elegant and simple solution to handle signals from different devices, since same-sensor and cross-sensor score distributions are aligned and mapped to a common probabilistic domain. This allows the use of Bayes thresholds for optimal decision-making, eliminating the need of sensor-specific thresholds, which is essential in operational conditions because the threshold setting critically determines the accuracy of the authentication process in many applications.
    摘要 “由于相机和个人设备的普遍存在,导致图像环境的差异较大,从不同设备获取的图像之间存在大量的内类差异,这导致对图像进行比较时表现下降。然而,许多应用程序需要定期处理来自不同源的数据,因此需要解决这些可操作性问题。我们采用多比较器的合并方法来改进 périocular 性能,使用线性логистиック回归框架,在这个框架中,融合后的分数倾向于是Log-likelihood比率,从而实现了降低跨传感器EER的目标,最多降低40%。我们的框架还提供了一种简单和易于处理不同设备的信号的方法,因为同传感器和跨传感器分布是被映射到一个共同的 probabilistic 领域,这使得可以使用 bayes 阈值进行优化的决策,从而消除传感器特定的阈值的需求,这在操作条件下是非常重要的,因为阈值设定对图像认证过程中的精度具有 kritical 的作用。”

Robust Feature Learning and Global Variance-Driven Classifier Alignment for Long-Tail Class Incremental Learning

  • paper_url: http://arxiv.org/abs/2311.01227
  • repo_url: https://github.com/JAYATEJAK/GVAlign
  • paper_authors: Jayateja Kalla, Soma Biswas
  • for: 增强长尾类逐步学习,使模型逐步学习新类,并mitigate catastrophic forgetting在长尾数据分布下。
  • methods: 利用全球差异作为有用的度量,并在第二阶段使用类prototype来实现类ifierAlignment,从而Capture类属性,消除数据平衡或另外层次调整的需求。
  • results: 在CIFAR-100和ImageNet-Subset datasets上进行了广泛的实验,证明了该方法在多种长尾CIL场景中的超越性,并且在不同的长尾类 incremental learning情况下保持了优异性。
    Abstract This paper introduces a two-stage framework designed to enhance long-tail class incremental learning, enabling the model to progressively learn new classes, while mitigating catastrophic forgetting in the context of long-tailed data distributions. Addressing the challenge posed by the under-representation of tail classes in long-tail class incremental learning, our approach achieves classifier alignment by leveraging global variance as an informative measure and class prototypes in the second stage. This process effectively captures class properties and eliminates the need for data balancing or additional layer tuning. Alongside traditional class incremental learning losses in the first stage, the proposed approach incorporates mixup classes to learn robust feature representations, ensuring smoother boundaries. The proposed framework can seamlessly integrate as a module with any class incremental learning method to effectively handle long-tail class incremental learning scenarios. Extensive experimentation on the CIFAR-100 and ImageNet-Subset datasets validates the approach's efficacy, showcasing its superiority over state-of-the-art techniques across various long-tail CIL settings.
    摘要

Optimal Transport-Guided Conditional Score-Based Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.01226
  • repo_url: https://github.com/xjtu-xgu/otcs
  • paper_authors: Xiang Gu, Liwei Yang, Jian Sun, Zongben Xu
  • for: conditional generation of target data with paired data as condition
  • methods: optimal transport-guided conditional score-based diffusion model (OTCS)
  • results: effective training of the conditional score-based model for unpaired or partially paired settings, with theoretical proof of data transport in optimal transport.Here’s the full text in Simplified Chinese:
  • for: 本文提出了一种基于匹配关系的条件分布模型(OTCS),用于无或半对数据的条件生成。
  • methods: OTCS 使用 $L_2$-正则化不监督或半监督的最优运输来建立对不对数据的 Coupling 关系,然后基于这个 Coupling 关系来训练条件分布模型。
  • results: OTCS 在无或半对super-resolution 和图像转换 tasks 上进行了广泛的实验,并证明了其效果性。从Optimal Transport的视角来看,OTCS 实现了数据的传输,这是一个对大规模数据集的挑战。我们还提供了一个 theoretically 的证明,证明 OTCS 实现了数据传输。代码可以在 \url{https://github.com/XJTU-XGU/OTCS} 上获取。
    Abstract Conditional score-based diffusion model (SBDM) is for conditional generation of target data with paired data as condition, and has achieved great success in image translation. However, it requires the paired data as condition, and there would be insufficient paired data provided in real-world applications. To tackle the applications with partially paired or even unpaired dataset, we propose a novel Optimal Transport-guided Conditional Score-based diffusion model (OTCS) in this paper. We build the coupling relationship for the unpaired or partially paired dataset based on $L_2$-regularized unsupervised or semi-supervised optimal transport, respectively. Based on the coupling relationship, we develop the objective for training the conditional score-based model for unpaired or partially paired settings, which is based on a reformulation and generalization of the conditional SBDM for paired setting. With the estimated coupling relationship, we effectively train the conditional score-based model by designing a ``resampling-by-compatibility'' strategy to choose the sampled data with high compatibility as guidance. Extensive experiments on unpaired super-resolution and semi-paired image-to-image translation demonstrated the effectiveness of the proposed OTCS model. From the viewpoint of optimal transport, OTCS provides an approach to transport data across distributions, which is a challenge for OT on large-scale datasets. We theoretically prove that OTCS realizes the data transport in OT with a theoretical bound. Code is available at \url{https://github.com/XJTU-XGU/OTCS}.
    摘要 <>使用 Conditional Score-based Diffusion Model(SBDM)可以实现目标数据的条件生成,但是它需要对condition paired的数据。在实际应用中,可能无法获得充分的paired数据。为了解决这个问题,我们在这篇论文中提出了一种新的 Optimal Transport-guided Conditional Score-based Diffusion Model(OTCS)。我们通过 $L_2$-regularized unsupervised或半supervised Optimal Transport来建立coupling关系,并基于这个coupling关系来定义Objective для条件 SBDM 的训练。通过使用估计的coupling关系,我们可以有效地训练条件 Score-based Model。我们设计了一种“重采样-by-compatibility”策略,以选择与高兼容性的样本作为导航。我们在无对应数据和半对应数据上进行了广泛的实验,并证明了我们的 OTCS 模型的有效性。从optimal transport的视角来看,OTCS 提供了将数据传输到分布上的方法,这是对大规模数据集的optimal transport而言是一个挑战。我们理论上证明了 OTCS 实现了数据传输在 OT 中的理论上的 bound。代码可以在 \url{https://github.com/XJTU-XGU/OTCS} 上获取。

Convergent plug-and-play with proximal denoiser and unconstrained regularization parameter

  • paper_url: http://arxiv.org/abs/2311.01216
  • repo_url: None
  • paper_authors: Samuel Hurault, Antonin Chambolle, Arthur Leclaire, Nicolas Papadakis
  • for: 这篇论文的目的是提供新的抽象证明,以解决图像反问题中的抽象问题。
  • methods: 这篇论文使用的方法是基于插入预训练的噪声矩阵的PnP算法,包括Proximal Gradient Descent(PGD)和Douglas-Rachford Splitting(DRS)。
  • results: 该论文的实验研究表明,使用这两种解决方案可以提高图像恢复的准确率。
    Abstract In this work, we present new proofs of convergence for Plug-and-Play (PnP) algorithms. PnP methods are efficient iterative algorithms for solving image inverse problems where regularization is performed by plugging a pre-trained denoiser in a proximal algorithm, such as Proximal Gradient Descent (PGD) or Douglas-Rachford Splitting (DRS). Recent research has explored convergence by incorporating a denoiser that writes exactly as a proximal operator. However, the corresponding PnP algorithm has then to be run with stepsize equal to $1$. The stepsize condition for nonconvex convergence of the proximal algorithm in use then translates to restrictive conditions on the regularization parameter of the inverse problem. This can severely degrade the restoration capacity of the algorithm. In this paper, we present two remedies for this limitation. First, we provide a novel convergence proof for PnP-DRS that does not impose any restrictions on the regularization parameter. Second, we examine a relaxed version of the PGD algorithm that converges across a broader range of regularization parameters. Our experimental study, conducted on deblurring and super-resolution experiments, demonstrate that both of these solutions enhance the accuracy of image restoration.
    摘要 在这个工作中,我们提供了新的收敛证明 для插入式游戏(PnP)算法。PnP方法是高效的迭代算法,用于解决图像反转问题,其中的正则化是通过插入预训练的噪声除除器来实现,如距离梯度下降(PGD)或道格拉斯-蕾舍分裂(DRS)。Recent research has explored convergence by incorporating a denoiser that writes exactly as a proximal operator. However, the corresponding PnP algorithm has then to be run with stepsize equal to $1$. The stepsize condition for nonconvex convergence of the proximal algorithm in use then translates to restrictive conditions on the regularization parameter of the inverse problem. This can severely degrade the restoration capacity of the algorithm. In this paper, we present two remedies for this limitation. First, we provide a novel convergence proof for PnP-DRS that does not impose any restrictions on the regularization parameter. Second, we examine a relaxed version of the PGD algorithm that converges across a broader range of regularization parameters. Our experimental study, conducted on deblurring and super-resolution experiments, demonstrate that both of these solutions enhance the accuracy of image restoration.Note: The translation is in Simplified Chinese, which is one of the two standard Chinese writing systems. The other system is Traditional Chinese.

High-Quality Animatable Dynamic Garment Reconstruction from Monocular Videos

  • paper_url: http://arxiv.org/abs/2311.01214
  • repo_url: None
  • paper_authors: Xiongzheng Li, Jinsong Zhang, Yu-Kun Lai, Jingyu Yang, Kun Li
  • for: reconstruction of high-quality animatable dynamic garments from monocular videos
  • methods: learnable garment deformation network, multi-hypothesis deformation module
  • results: high-quality dynamic garments with coherent surface details, can be easily animated under unseen poses
    Abstract Much progress has been made in reconstructing garments from an image or a video. However, none of existing works meet the expectations of digitizing high-quality animatable dynamic garments that can be adjusted to various unseen poses. In this paper, we propose the first method to recover high-quality animatable dynamic garments from monocular videos without depending on scanned data. To generate reasonable deformations for various unseen poses, we propose a learnable garment deformation network that formulates the garment reconstruction task as a pose-driven deformation problem. To alleviate the ambiguity estimating 3D garments from monocular videos, we design a multi-hypothesis deformation module that learns spatial representations of multiple plausible deformations. Experimental results on several public datasets demonstrate that our method can reconstruct high-quality dynamic garments with coherent surface details, which can be easily animated under unseen poses. The code will be provided for research purposes.
    摘要 很多进步已经被成功地应用于从图像或视频中重建衣服。然而,现有的所有方法都不能满足高质量动态衣服的数字化,可以根据不同的未知pose进行调整。在这篇论文中,我们提出了首个不 dependence on scanned data 的简单视频中高质量动态衣服重建方法。为了生成不同pose下的合理的变形,我们提议了一种学习型衣服变形网络,将衣服重建任务定义为pose驱动的变形问题。为了解决来自单视频中的3D衣服估计的ambiguity,我们设计了多种可能性变形模块,这些模块学习了多个可能的变形的空间表示。我们的方法在多个公共数据集上进行了实验,结果表明我们可以重建高质量的动态衣服,并且可以轻松地在未知pose下进行动画。我们将提供代码供研究用途。

Semantic Scene Graph Generation Based on an Edge Dual Scene Graph and Message Passing Neural Network

  • paper_url: http://arxiv.org/abs/2311.01192
  • repo_url: None
  • paper_authors: Hyeongjin Kim, Sangwon Kim, Jong Taek Lee, Byoung Chul Ko
  • for: 提高Scene Graph Generation(SGG)的精度和可靠性,使其能够更好地捕捉图像中对象之间的复杂关系和互动。
  • methods: 基于Edge Dual Scene Graph(EdgeSGG)和Dual Message Passing Neural Network(DualMPNN),可以更好地捕捉图像中对象之间的richContextual interactions,并且可以更精确地预测对象之间的关系。
  • results: 与State-of-the-Art(SoTA)方法进行比较,提出的模型在三个SGG任务上显示了substantial性能提升,并且在长尾分布上进行实验表明,在 integrate对象之间的关系时,可以有效 mitigate Existing long-tail problems。
    Abstract Along with generative AI, interest in scene graph generation (SGG), which comprehensively captures the relationships and interactions between objects in an image and creates a structured graph-based representation, has significantly increased in recent years. However, relying on object-centric and dichotomous relationships, existing SGG methods have a limited ability to accurately predict detailed relationships. To solve these problems, a new approach to the modeling multiobject relationships, called edge dual scene graph generation (EdgeSGG), is proposed herein. EdgeSGG is based on a edge dual scene graph and Dual Message Passing Neural Network (DualMPNN), which can capture rich contextual interactions between unconstrained objects. To facilitate the learning of edge dual scene graphs with a symmetric graph structure, the proposed DualMPNN learns both object- and relation-centric features for more accurately predicting relation-aware contexts and allows fine-grained relational updates between objects. A comparative experiment with state-of-the-art (SoTA) methods was conducted using two public datasets for SGG operations and six metrics for three subtasks. Compared with SoTA approaches, the proposed model exhibited substantial performance improvements across all SGG subtasks. Furthermore, experiment on long-tail distributions revealed that incorporating the relationships between objects effectively mitigates existing long-tail problems.
    摘要 accompanies the rise of generative AI, scene graph generation (SGG) has gained significant attention in recent years. However, existing SGG methods are limited in their ability to accurately predict detailed relationships due to their reliance on object-centric and dichotomous relationships. To address these issues, a new approach called edge dual scene graph generation (EdgeSGG) is proposed. EdgeSGG is based on an edge dual scene graph and a Dual Message Passing Neural Network (DualMPNN), which can capture rich contextual interactions between unconstrained objects. To facilitate the learning of edge dual scene graphs with a symmetric graph structure, the proposed DualMPNN learns both object- and relation-centric features for more accurately predicting relation-aware contexts and allows fine-grained relational updates between objects. A comparative experiment with state-of-the-art (SoTA) methods was conducted using two public datasets for SGG operations and six metrics for three subtasks. Compared with SoTA approaches, the proposed model exhibited substantial performance improvements across all SGG subtasks. Furthermore, experiment on long-tail distributions revealed that incorporating the relationships between objects effectively mitigates existing long-tail problems.Here is the translation in Traditional Chinese:随着生成AI的出现,Scene Graph Generation (SGG)在最近的年分内得到了很大的关注。然而,现有的SGG方法受到物件中心和二分法的限制,它们的预测细节关系的能力有限。为解决这些问题,一种新的方法called EdgeSGG被提出。EdgeSGG基于边dual scene graph和Dual Message Passing Neural Network (DualMPNN),可以捕捉无结构物件之间的丰富contextual互动。为了促进边dual scene graph的学习,提出的DualMPNN将学习物件和关系中心的特征,以更精确地预测关系意识的上下文,并允许细化的关系更新。对于SoTA方法进行比较实验,使用了两个公共的数据集和六个度量来评估三个SGG任务。与SoTA方法相比,提出的模型在所有SGG任务上表现出substantial的性能改善。此外,对于长尾分布的实验显示,将物件之间的关系 интеグrez effectively mitigates existing long-tail problems。

Terrain-Informed Self-Supervised Learning: Enhancing Building Footprint Extraction from LiDAR Data with Limited Annotations

  • paper_url: http://arxiv.org/abs/2311.01188
  • repo_url: None
  • paper_authors: Anuja Vats, David Völgyes, Martijn Vermeer, Marius Pedersen, Kiran Raja, Daniele S. M. Fantin, Jacob Alexander Hay
  • for: 这个研究的目的是提出一种基于深度学习的建筑图像分类方法,以便从remote sensing数据中提取 preciselocation building footprint maps。
  • methods: 这个方法使用自动生成的地形模型,从LiDAR数据中学习特点特征,并透过自我超级vised learning来对building segmentation进行推导。
  • results: 这个方法可以从仅有1%的标签(相当于25个标签的例子)中提取出高性能的建筑分类表现,并在几何对应中进一步提高表现。此外,这个方法还能够在实际应用中应用,并且比其他基于ImageNet预训练的方法表现更好。
    Abstract Estimating building footprint maps from geospatial data is of paramount importance in urban planning, development, disaster management, and various other applications. Deep learning methodologies have gained prominence in building segmentation maps, offering the promise of precise footprint extraction without extensive post-processing. However, these methods face challenges in generalization and label efficiency, particularly in remote sensing, where obtaining accurate labels can be both expensive and time-consuming. To address these challenges, we propose terrain-aware self-supervised learning, tailored to remote sensing, using digital elevation models from LiDAR data. We propose to learn a model to differentiate between bare Earth and superimposed structures enabling the network to implicitly learn domain-relevant features without the need for extensive pixel-level annotations. We test the effectiveness of our approach by evaluating building segmentation performance on test datasets with varying label fractions. Remarkably, with only 1% of the labels (equivalent to 25 labeled examples), our method improves over ImageNet pre-training, showing the advantage of leveraging unlabeled data for feature extraction in the domain of remote sensing. The performance improvement is more pronounced in few-shot scenarios and gradually closes the gap with ImageNet pre-training as the label fraction increases. We test on a dataset characterized by substantial distribution shifts and labeling errors to demonstrate the generalizability of our approach. When compared to other baselines, including ImageNet pretraining and more complex architectures, our approach consistently performs better, demonstrating the efficiency and effectiveness of self-supervised terrain-aware feature learning.
    摘要 估算建筑地图从地ospatial数据中是城市规划、开发、灾害管理等应用中的关键任务。深度学习方法在建筑分割地图中得到了广泛应用,可以准确地提取建筑地图 без需要大量后处理。然而,这些方法面临通用化和标签效率的挑战,特别是在遥感中,获得准确的标签可能会是时间和成本的投资。为解决这些挑战,我们提议使用地形自适应自监学习,适应遥感,使用雷达数据获得的数字高程模型。我们提议学习一种模型,可以区分无附加结构和地表上的结构,使得网络可以隐式地学习领域相关的特征,不需要大量的像素级标注。我们测试了我们的方法的效果,对测试数据集进行了不同标签分布的评估。非常remarkably,只使用1%的标签(相当于25个标注示例),我们的方法可以超越图像网络预训练,显示了在遥感领域中利用无标注数据进行特征提取的优势。性能提升随着标签分布的增加,在几个批量场景中,我们的方法的性能相对较高,示示了我们的方法的效率和效果。我们对一个具有显著的分布偏移和标签错误的数据集进行了测试,以示我们的方法的普适性。与其他基elines,包括图像网络预训练和更复杂的架构相比,我们的方法一直表现出色, demonstrating the efficiency and effectiveness of self-supervised terrain-aware feature learning.

Learning Intra and Inter-Camera Invariance for Isolated Camera Supervised Person Re-identification

  • paper_url: http://arxiv.org/abs/2311.01155
  • repo_url: None
  • paper_authors: Menglin Wang, Xiaojin Gong
  • for: 这种论文是为了研究在受到isoled camera supervised(ISCS)设置下进行人识别的情况。
  • methods: 该论文提出了一种新的方法,通过充分利用训练数据的变化来解决人识别下来。该方法包括在每个环境中构建style-consistent的环境,并在每个环境中进行prototype contrastive learning。同时,通过强制实施 intra-camera 增强不变性来消除相机偏见的影响。
  • results: 该论文在多个 benchmark 上进行了广泛的实验,并证明了该方法的有效性和超越性。
    Abstract Supervised person re-identification assumes that a person has images captured under multiple cameras. However when cameras are placed in distance, a person rarely appears in more than one camera. This paper thus studies person re-ID under such isolated camera supervised (ISCS) setting. Instead of trying to generate fake cross-camera features like previous methods, we explore a novel perspective by making efficient use of the variation in training data. Under ISCS setting, a person only has limited images from a single camera, so the camera bias becomes a critical issue confounding ID discrimination. Cross-camera images are prone to being recognized as different IDs simply by camera style. To eliminate the confounding effect of camera bias, we propose to learn both intra- and inter-camera invariance under a unified framework. First, we construct style-consistent environments via clustering, and perform prototypical contrastive learning within each environment. Meanwhile, strongly augmented images are contrasted with original prototypes to enforce intra-camera augmentation invariance. For inter-camera invariance, we further design a much improved variant of multi-camera negative loss that optimizes the distance of multi-level negatives. The resulting model learns to be invariant to both subtle and severe style variation within and cross-camera. On multiple benchmarks, we conduct extensive experiments and validate the effectiveness and superiority of the proposed method. Code will be available at https://github.com/Terminator8758/IICI.
    摘要 受监测人重识别假设有多个摄像头捕捉到同一个人的图像。然而,当摄像头远离时,人很少会出现在多个摄像头中。这篇论文因此研究了在孤立摄像头超级vised(ISCS)设定下进行人重识别。而不是尝试生成虚假的交叉摄像头特征,我们explore一种新的视角,即efficient地利用训练数据的变化。在ISCS设定下,一个人只有限制的图像来自单个摄像头,因此摄像头偏见成为人识别中的关键问题。交叉摄像头图像容易被识别为不同的ID,只是因为摄像头风格。为了消除摄像头偏见的影响,我们提议学习 both intra-和inter-摄像头不变性于一个统一框架下。首先,我们使用 clustering 构建 style-consistent 环境,并在每个环境中进行 prototypical contrastive learning。同时,我们使用强制加工的图像与原始评原核对进行 intra-camera 增强不变性。为了保证 inter-camera 不变性,我们还提出了一个大幅提高的多摄像头负面损失的改进版本。这使得模型学习到了内部和交叉摄像头中的不变性,并且对于严重和柔性的样式变化都具有抗预测能力。在多个标准列表上进行了广泛的实验,并证明了我们的方法的有效性和超越性。代码将在 https://github.com/Terminator8758/IICI 上提供。

AeroPath: An airway segmentation benchmark dataset with challenging pathology

  • paper_url: http://arxiv.org/abs/2311.01138
  • repo_url: https://github.com/raidionics/aeropath
  • paper_authors: Karen-Helene Støverud, David Bouget, Andre Pedersen, Håkon Olav Leira, Thomas Langø, Erlend Fagertun Hofstad
  • for: 提高肺病患者的诊断和治疗效果,需要早期诊断和治疗。CT图像分析是诊断的关键之一,而高质量的气管树分割是 intervención 规划和直到 bronchoscopy 操作的必要条件。
  • methods: 我们提出了一种新的公共benchmark dataset(AeroPath),包含27个CT图像,来评估新的 automatic airway segmentation 方法。此外,我们还提出了一种多尺度融合设计,以便自动气管分割。
  • results: 我们的提案的模型在AeroPath dataset上预测了所有患者的正确的分割结果,并且能够抗衡各种病理变化,至少到第五代气管。此外,我们还开发了一个公开可用的在线应用程序,以便在新数据上测试我们的模型。
    Abstract To improve the prognosis of patients suffering from pulmonary diseases, such as lung cancer, early diagnosis and treatment are crucial. The analysis of CT images is invaluable for diagnosis, whereas high quality segmentation of the airway tree are required for intervention planning and live guidance during bronchoscopy. Recently, the Multi-domain Airway Tree Modeling (ATM'22) challenge released a large dataset, both enabling training of deep-learning based models and bringing substantial improvement of the state-of-the-art for the airway segmentation task. However, the ATM'22 dataset includes few patients with severe pathologies affecting the airway tree anatomy. In this study, we introduce a new public benchmark dataset (AeroPath), consisting of 27 CT images from patients with pathologies ranging from emphysema to large tumors, with corresponding trachea and bronchi annotations. Second, we present a multiscale fusion design for automatic airway segmentation. Models were trained on the ATM'22 dataset, tested on the AeroPath dataset, and further evaluated against competitive open-source methods. The same performance metrics as used in the ATM'22 challenge were used to benchmark the different considered approaches. Lastly, an open web application is developed, to easily test the proposed model on new data. The results demonstrated that our proposed architecture predicted topologically correct segmentations for all the patients included in the AeroPath dataset. The proposed method is robust and able to handle various anomalies, down to at least the fifth airway generation. In addition, the AeroPath dataset, featuring patients with challenging pathologies, will contribute to development of new state-of-the-art methods. The AeroPath dataset and the web application are made openly available.
    摘要 要改善患有肺病的患者的诊断和治疗效果,早期诊断和治疗是关键。CT图像分析是诊断的不可或缺的工具,而高质量的气管树分 segmentation 则是用于操作规划和直到生 bronchoscopy 的live导航中的必要条件。最近,多域气管树模型大会(ATM'22)挑战发布了大量数据,为深度学习基于模型的训练提供了条件,并为气管分 segmentation 任务带来了显著的状态艺术提升。然而,ATM'22 数据集包含少量患有肺动脉病理的患者,这些病理可能会影响气管树的解剖结构。在这项研究中,我们介绍了一个新的公共数据集(AeroPath),包含 27 个 CT 图像,这些图像来自患有肺动脉病理的患者,包括肺脏病和大型肿瘤,同时还包括气管和支气管的注释。其次,我们提出了一种多尺度融合设计用于自动气管分 segmentation。我们在 ATM'22 数据集上训练了模型,在 AeroPath 数据集上进行测试,并对开源方法进行比较。使用 ATM'22 挑战中使用的同样效果指标进行比较。最后,我们开发了一个开放的网络应用程序,以便轻松地在新数据上测试我们的提议方法。结果表明,我们的提议体系在 AeroPath 数据集上预测了所有患者的正确分 segmentation。我们的方法具有抗难度和能够处理多种畸形的能力,至少到第五代气管。此外,AeroPath 数据集, featuring 患有复杂病理的患者,将为开发新的状态艺术方法提供贡献。AeroPath 数据集和网络应用程序都是公开可用。

A deep learning experiment for semantic segmentation of overlapping characters in palimpsests

  • paper_url: http://arxiv.org/abs/2311.01130
  • repo_url: None
  • paper_authors: Michela Perino, Michele Ginolfi, Anna Candida Felici, Michela Rosellini
  • for: 这项研究的目的是提出一种基于深度学习的Semantic Segmentation方法,用于在重叠的字符上分割个letter。
  • methods: 该方法使用了多spectral imaging技术和人工智能技术,包括深度学习的Semantic Segmentation算法,用于识别和分割重叠的字符。
  • results: 实验结果表明,该方法可以准确地分割重叠的字符,并且可以提高对Palimpsests的识别和分析效率。
    Abstract Palimpsests refer to historical manuscripts where erased writings have been partially covered by the superimposition of a second writing. By employing imaging techniques, e.g., multispectral imaging, it becomes possible to identify features that are imperceptible to the naked eye, including faded and erased inks. When dealing with overlapping inks, Artificial Intelligence techniques can be utilized to disentangle complex nodes of overlapping letters. In this work, we propose deep learning-based semantic segmentation as a method for identifying and segmenting individual letters in overlapping characters. The experiment was conceived as a proof of concept, focusing on the palimpsests of the Ars Grammatica by Prisciano as a case study. Furthermore, caveats and prospects of our approach combined with multispectral imaging are also discussed.
    摘要 某些抄写物件被称为磁带文献,这些文献中的字符串被部分覆盖了第二种写作。通过使用多spectral imaging技术,可以检测出覆盖不明文字符的特征,包括淡入的字符和抹除的字符。在多个字符之间相互重叠时,人工智能技术可以用来分解复杂的节点。在这种情况下,我们提出了深度学习基于semantic segmentation的方法,用于在重叠字符中识别和分类个字符。我们的实验是以证明性为目的,案例研究了普里斯尼亚诺的《语法 Grammatica》抄写物件。此外,我们还讨论了我们的方法的限制和前景。

Cheating Depth: Enhancing 3D Surface Anomaly Detection via Depth Simulation

  • paper_url: http://arxiv.org/abs/2311.01117
  • repo_url: https://github.com/vitjanz/3dsr
  • paper_authors: Vitjan Zavrtanik, Matej Kristan, Danijel Skočaj
  • for: 提高RGB基于表面异常检测方法的准确率和处理速度
  • methods: 提出了一种新的深度感知分割自动编码器(DADA)架构,以便同时学习RGB和3D数据的整体离散特征空间,以便3D表面异常检测
  • results: 实验结果表明,提出的方法可以在MVTec3D异常检测标准套件上达到最高精度和处理速度,超过所有现有的状态之异常检测方法
    Abstract RGB-based surface anomaly detection methods have advanced significantly. However, certain surface anomalies remain practically invisible in RGB alone, necessitating the incorporation of 3D information. Existing approaches that employ point-cloud backbones suffer from suboptimal representations and reduced applicability due to slow processing. Re-training RGB backbones, designed for faster dense input processing, on industrial depth datasets is hindered by the limited availability of sufficiently large datasets. We make several contributions to address these challenges. (i) We propose a novel Depth-Aware Discrete Autoencoder (DADA) architecture, that enables learning a general discrete latent space that jointly models RGB and 3D data for 3D surface anomaly detection. (ii) We tackle the lack of diverse industrial depth datasets by introducing a simulation process for learning informative depth features in the depth encoder. (iii) We propose a new surface anomaly detection method 3DSR, which outperforms all existing state-of-the-art on the challenging MVTec3D anomaly detection benchmark, both in terms of accuracy and processing speed. The experimental results validate the effectiveness and efficiency of our approach, highlighting the potential of utilizing depth information for improved surface anomaly detection.
    摘要 (i)我们提出了一种新的深度意识Discrete Autoencoder(DADA)建筑,它允许学习一个通用的离散准则空间,该空间同时模型RGB和3D数据 для3D表面异常检测。(ii)我们解决了工业深度数据集的有限性问题,通过引入一种学习深度特征的 simulations process。(iii)我们提出了一种新的3DSR方法,它在MVTec3D异常检测benchmark上表现出了比所有现有的国际状态最佳的性能,both in terms of accuracy和processing speed。我们的实验结果证明了我们的方法的有效性和高效性, highlighting the potential of utilizing depth information for improved surface anomaly detection.

H-NeXt: The next step towards roto-translation invariant networks

  • paper_url: http://arxiv.org/abs/2311.01111
  • repo_url: https://github.com/karellat/h-next
  • paper_authors: Tomas Karella, Filip Sroubek, Jan Flusser, Jan Blazek, Vasek Kosik
  • for: 该论文目的是提出一种可以快速学习并在不同orientation下保持性的网络模型。
  • methods: 该论文使用了一种名为H-NeXt的网络模型,它包括一个对称矩阵的后向层、一个不变性池化层和一个分类层。
  • results: 该论文通过在不含扩展图像的训练集上训练H-NeXt网络,并在扩展测试集上进行分类,得到了与当前状态集成比的更高的表现。
    Abstract The widespread popularity of equivariant networks underscores the significance of parameter efficient models and effective use of training data. At a time when robustness to unseen deformations is becoming increasingly important, we present H-NeXt, which bridges the gap between equivariance and invariance. H-NeXt is a parameter-efficient roto-translation invariant network that is trained without a single augmented image in the training set. Our network comprises three components: an equivariant backbone for learning roto-translation independent features, an invariant pooling layer for discarding roto-translation information, and a classification layer. H-NeXt outperforms the state of the art in classification on unaugmented training sets and augmented test sets of MNIST and CIFAR-10.
    摘要 广泛的equivariant网络的普及,强调了参数效率模型和有效使用训练数据的重要性。在当今不可忽略的不visible deformation Robustness era,我们提出了H-NeXt,它在 equivariance和invariance之间填补了空白。H-NeXt是一种parameter-efficient的旋转翻译不变的网络,在没有一个扩展图像的训练集上培养。我们的网络包括三部分:一个恒等背景,用于学习旋转翻译独立的特征,一个不变pooling层,用于抛弃旋转信息,以及一个分类层。H-NeXt在MNIST和CIFAR-10的未扩展训练集和扩展测试集上的分类性能高于当前状态。

Learning A Multi-Task Transformer Via Unified And Customized Instruction Tuning For Chest Radiograph Interpretation

  • paper_url: http://arxiv.org/abs/2311.01092
  • repo_url: https://github.com/medhk23/omnifm-dr
  • paper_authors: Lijian Xu, Ziyu Ni, Xinglong Liu, Xiaosong Wang, Hongsheng Li, Shaoting Zhang
  • For: 这个研究旨在探讨多模式深度学习模型在医学应用中的应用,并将多项任务集成为一个统一的变数损失函数,以提高诊断的可解释性。* Methods: 本研究使用自定义指令调整的变数深度学习模型,并将多项颜ppo任务集成为一个共同训练架构,以增加诊断的可解释性。* Results: 该模型在多个胸部X射影benchmark上 exhibits 高度的直接推论和调整性,并且还经过了三位放射学家的评估,证明了模型的解释性。
    Abstract The emergence of multi-modal deep learning models has made significant impacts on clinical applications in the last decade. However, the majority of models are limited to single-tasking, without considering disease diagnosis is indeed a multi-task procedure. Here, we demonstrate a unified transformer model specifically designed for multi-modal clinical tasks by incorporating customized instruction tuning. We first compose a multi-task training dataset comprising 13.4 million instruction and ground-truth pairs (with approximately one million radiographs) for the customized tuning, involving both image- and pixel-level tasks. Thus, we can unify the various vision-intensive tasks in a single training framework with homogeneous model inputs and outputs to increase clinical interpretability in one reading. Finally, we demonstrate the overall superior performance of our model compared to prior arts on various chest X-ray benchmarks across multi-tasks in both direct inference and finetuning settings. Three radiologists further evaluate the generated reports against the recorded ones, which also exhibit the enhanced explainability of our multi-task model.
    摘要 随着多modal深度学习模型的出现,在过去的一代,它们在临床应用中产生了重要的影响。然而,大多数模型都是单任务的,没有考虑到疾病诊断实际上是多任务的过程。在这里,我们演示了一种特有的转换器模型,专门为多modal临床任务而设计,通过自定义指令调整。我们首先组织了一个多任务训练集,包括1340万个指令和真实数据对(约100万个X射像),用于自定义调整。因此,我们可以在单一的训练框架中,将多种视觉沉浸任务集成起来,使得临床解释性提高。最后,我们比较了我们的模型与之前艺术的性能,在多任务情况下,Direct inference和微调设置中,都达到了总体更高的性能。三名医生还评估了我们生成的报告与记录的报告,这也表明了我们的多任务模型的解释性得到了进一步提高。

Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding

  • paper_url: http://arxiv.org/abs/2311.01091
  • repo_url: None
  • paper_authors: Tianrui Hui, Zihan Ding, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, Si Liu
  • for: 本研究旨在提高图文描述的对应关系,即图像中的物体和文本描述之间的相互作用。
  • methods: 该研究提出了一种新的phrase-pixel-object transformer decoder(PPO-TD),该模型可以同时捕捉图像中细节和概念级别的信息,并且通过对应描述文本进行学习。此外,研究者还提出了一种phraseObject Contrastive Loss(POCL),用于更精准地聚合对应的phrase-object对。
  • results: 实验表明,该方法可以在PNG benchmark上达到新的状态值性能,与之前的方法相比,具有大的margin。
    Abstract Panoptic narrative grounding (PNG) aims to segment things and stuff objects in an image described by noun phrases of a narrative caption. As a multimodal task, an essential aspect of PNG is the visual-linguistic interaction between image and caption. The previous two-stage method aggregates visual contexts from offline-generated mask proposals to phrase features, which tend to be noisy and fragmentary. The recent one-stage method aggregates only pixel contexts from image features to phrase features, which may incur semantic misalignment due to lacking object priors. To realize more comprehensive visual-linguistic interaction, we propose to enrich phrases with coupled pixel and object contexts by designing a Phrase-Pixel-Object Transformer Decoder (PPO-TD), where both fine-grained part details and coarse-grained entity clues are aggregated to phrase features. In addition, we also propose a PhraseObject Contrastive Loss (POCL) to pull closer the matched phrase-object pairs and push away unmatched ones for aggregating more precise object contexts from more phrase-relevant object tokens. Extensive experiments on the PNG benchmark show our method achieves new state-of-the-art performance with large margins.
    摘要

Infusion: Internal Diffusion for Video Inpainting

  • paper_url: http://arxiv.org/abs/2311.01090
  • repo_url: None
  • paper_authors: Nicolas Cherel, Andrés Almansa, Yann Gousseau, Alasdair Newson
  • for: 视频填充(video inpainting)任务是在视频中填充某个区域,以达到视觉上的满意度。
  • methods: 我们采用了扩散模型,它可以模型复杂的数据分布,包括图像和视频。我们采用了内部学习方法,这也使得我们的网络规模减少了。
  • results: 我们的方法可以在视频填充任务中达到状态机器的性能,特别是在动态背景和Texture中。我们的方法不需要支持元素,如光学流计算,因此它在动态Texture中表现更好。
    Abstract Video inpainting is the task of filling a desired region in a video in a visually convincing manner. It is a very challenging task due to the high dimensionality of the signal and the temporal consistency required for obtaining convincing results. Recently, diffusion models have shown impressive results in modeling complex data distributions, including images and videos. Diffusion models remain nonetheless very expensive to train and perform inference with, which strongly restrict their application to video. We show that in the case of video inpainting, thanks to the highly auto-similar nature of videos, the training of a diffusion model can be restricted to the video to inpaint and still produce very satisfying results. This leads us to adopt an internal learning approch, which also allows for a greatly reduced network size. We call our approach "Infusion": an internal learning algorithm for video inpainting through diffusion. Due to our frugal network, we are able to propose the first video inpainting approach based purely on diffusion. Other methods require supporting elements such as optical flow estimation, which limits their performance in the case of dynamic textures for example. We introduce a new method for efficient training and inference of diffusion models in the context of internal learning. We split the diffusion process into different learning intervals which greatly simplifies the learning steps. We show qualititative and quantitative results, demonstrating that our method reaches state-of-the-art performance, in particular in the case of dynamic backgrounds and textures.
    摘要 视频填充是填充一个 désirée 区域在视频中,以达到可观的效果。这是一个非常具有挑战性的任务,因为视频信号的维度很高,并且需要在时间上保持一致性以获得可靠的结果。最近,扩散模型在处理复杂数据分布中表现出色,包括图像和视频。但是,扩散模型在训练和推理中非常昂贵,这限制了它在视频填充中的应用。我们发现,在视频填充中,由于视频的高自动相似性,可以通过仅训练在填充视频中的扩散模型,以达到非常满意的结果。我们称这种方法为“扩散融合”(Infusion)。由于我们的网络较为减少,我们可以提出第一个基于扩散的视频填充方法。其他方法通常需要支持元素,如光流估计,这限制了它们在动态Texture中的表现。我们提出了一种高效的训练和推理扩散模型的方法,将扩散过程分成不同的学习间隔。我们显示了qualitative和quantitative结果,证明我们的方法可以达到领先的性能,特别是在动态背景和Texture中。

Dynamic Multimodal Information Bottleneck for Multimodality Classification

  • paper_url: http://arxiv.org/abs/2311.01066
  • repo_url: https://github.com/bii-wushuang/dmib
  • paper_authors: Yingying Fang, Shuang Wu, Sheng Zhang, Chaoyan Huang, Tieyong Zeng, Xiaodan Xing, Simon Walsh, Guang Yang
  • for: 这篇论文的目的是提高多 modal 数据的使用,以提高医疗诊断和预测的精度。
  • methods: 本文使用了一种称为多元数据信息瓶颈框架的方法,以减少数据繁殖和噪音,并保持适当的预测信息精度。
  • results: 实验结果显示,本文的方法在两个内部COVID-19数据集和两个公共生物医学数据集上的诊断和预测任务中,比前一代方法更高效和更Robust,能够在大规模噪音渠道存在时保持高度的预测性。
    Abstract Effectively leveraging multimodal data such as various images, laboratory tests and clinical information is gaining traction in a variety of AI-based medical diagnosis and prognosis tasks. Most existing multi-modal techniques only focus on enhancing their performance by leveraging the differences or shared features from various modalities and fusing feature across different modalities. These approaches are generally not optimal for clinical settings, which pose the additional challenges of limited training data, as well as being rife with redundant data or noisy modality channels, leading to subpar performance. To address this gap, we study the robustness of existing methods to data redundancy and noise and propose a generalized dynamic multimodal information bottleneck framework for attaining a robust fused feature representation. Specifically, our information bottleneck module serves to filter out the task-irrelevant information and noises in the fused feature, and we further introduce a sufficiency loss to prevent dropping of task-relevant information, thus explicitly preserving the sufficiency of prediction information in the distilled feature. We validate our model on an in-house and a public COVID19 dataset for mortality prediction as well as two public biomedical datasets for diagnostic tasks. Extensive experiments show that our method surpasses the state-of-the-art and is significantly more robust, being the only method to remain performance when large-scale noisy channels exist. Our code is publicly available at https://github.com/BII-wushuang/DMIB.
    摘要 通过有效地利用多模态数据,如各种图像、实验室测试和临床信息,在许多基于人工智能的医疗诊断和预测任务中占据主导地位。现有的多模态技术大多只是利用不同或共同特征之间的差异和共同特征的融合来提高性能。这些方法在临床设置下不是最佳选择,因为存在有限的训练数据,以及充斥着重复的数据或噪声的渠道,导致表现下降。为解决这个差距,我们研究了现有方法对数据重复和噪声的Robustness,并提出一种通用的动态多模态信息瓶颈框架,以获得一个Robust的融合特征表示。具体来说,我们的信息瓶颈模块可以筛除任务不关的信息和噪声在融合特征中,并引入一种充分loss来防止任务相关信息的排除,从而Explicitly preserved任务适用信息的完整性。我们在一个内部和一个公共COVID-19数据集上进行了 Mortality 预测和两个公共生物医学数据集上进行了诊断任务的验证。广泛的实验表明,我们的方法超过了当前状态的表现,并且在大规模噪声渠道存在时具有显著的Robust性,是唯一一个能够保持表现的方法。我们的代码可以在https://github.com/BII-wushuang/DMIB上获取。

Novel View Synthesis from a Single RGBD Image for Indoor Scenes

  • paper_url: http://arxiv.org/abs/2311.01065
  • repo_url: None
  • paper_authors: Congrui Hetang, Yuping Wang
  • for: 这篇论文提出了一种基于单个RGBD输入的新视图图像合成方法。
  • methods: 该方法将RGBD图像转换为点云,然后通过渲染从不同视角来实现新视图图像的合成。它将NVS任务转换为图像翻译问题,并使用生成器抗抗网络进行风格传递。
  • results: 该方法可以实现高质量的新视图图像合成,并且可以 circumvent了传统多图像技术的限制,如NeRF和MVS。
    Abstract In this paper, we propose an approach for synthesizing novel view images from a single RGBD (Red Green Blue-Depth) input. Novel view synthesis (NVS) is an interesting computer vision task with extensive applications. Methods using multiple images has been well-studied, exemplary ones include training scene-specific Neural Radiance Fields (NeRF), or leveraging multi-view stereo (MVS) and 3D rendering pipelines. However, both are either computationally intensive or non-generalizable across different scenes, limiting their practical value. Conversely, the depth information embedded in RGBD images unlocks 3D potential from a singular view, simplifying NVS. The widespread availability of compact, affordable stereo cameras, and even LiDARs in contemporary devices like smartphones, makes capturing RGBD images more accessible than ever. In our method, we convert an RGBD image into a point cloud and render it from a different viewpoint, then formulate the NVS task into an image translation problem. We leveraged generative adversarial networks to style-transfer the rendered image, achieving a result similar to a photograph taken from the new perspective. We explore both unsupervised learning using CycleGAN and supervised learning with Pix2Pix, and demonstrate the qualitative results. Our method circumvents the limitations of traditional multi-image techniques, holding significant promise for practical, real-time applications in NVS.
    摘要 在这篇论文中,我们提出了一种方法,可以从单个RGBD(红绿蓝深度)输入中生成新视图图像。新视图合成(NVS)是计算机视觉领域的一项有趣的任务,具有广泛的应用前景。传统的多张图像方法(如训练场景特定的神经辐射场(NeRF)或者利用多视图ステレオ(MVS)和3D渲染管道),尽管都是计算机昂贵或者不能普适应用于不同场景,但是它们的实际价值受限。相反,RGBD图像中嵌入的深度信息,使得3D的潜在能力从单个视图中解锁,使得NVS更加简单。现在, Compact和Affordable的斯tereo相机和甚至LiDAR在现代设备中的普及,使得捕捉RGBD图像更加容易。在我们的方法中,我们将RGBD图像转换为点云,然后从不同视图点渲染图像,最后将NVS任务转化为图像翻译问题。我们利用生成对抗网络进行风格转换,实现了从新视图点渲染的结果,与真实的新视图图像几乎相同。我们在无监督学习和监督学习两种情况下进行了质量检验,并展示了相关的结果。我们的方法可以绕过传统的多张图像技术的限制,具有实用的应用前景。

Multimodal Foundation Models for Zero-shot Animal Species Recognition in Camera Trap Images

  • paper_url: http://arxiv.org/abs/2311.01064
  • repo_url: None
  • paper_authors: Zalan Fabian, Zhongqi Miao, Chunyuan Li, Yuanhan Zhang, Ziwei Liu, Andrés Hernández, Andrés Montes-Rojas, Rafael Escucha, Laura Siabatto, Andrés Link, Pablo Arbeláez, Rahul Dodhia, Juan Lavista Ferres
  • for: 这个研究目的是为了发展一个大规模的野生动物追踪解决方案,以减少人工劳动成本。
  • methods: 这个研究使用了多 modal 基础模型,包括视觉语言模型,将camera trap图像描述为文本,然后与外部知识库进行对比,以验证物种。
  • results: 研究发现,使用培育学习技术可以对camera trap图像进行零条件物种分类,并且可以增强描述质量。
    Abstract Due to deteriorating environmental conditions and increasing human activity, conservation efforts directed towards wildlife is crucial. Motion-activated camera traps constitute an efficient tool for tracking and monitoring wildlife populations across the globe. Supervised learning techniques have been successfully deployed to analyze such imagery, however training such techniques requires annotations from experts. Reducing the reliance on costly labelled data therefore has immense potential in developing large-scale wildlife tracking solutions with markedly less human labor. In this work we propose WildMatch, a novel zero-shot species classification framework that leverages multimodal foundation models. In particular, we instruction tune vision-language models to generate detailed visual descriptions of camera trap images using similar terminology to experts. Then, we match the generated caption to an external knowledge base of descriptions in order to determine the species in a zero-shot manner. We investigate techniques to build instruction tuning datasets for detailed animal description generation and propose a novel knowledge augmentation technique to enhance caption quality. We demonstrate the performance of WildMatch on a new camera trap dataset collected in the Magdalena Medio region of Colombia.
    摘要 WildMatch uses vision-language models to generate detailed visual descriptions of camera trap images, using similar terminology to experts. Then, it matches the generated caption to an external knowledge base of descriptions to determine the species in a zero-shot manner. To build the instruction tuning datasets for detailed animal description generation, we investigate techniques such as knowledge augmentation.We demonstrate the performance of WildMatch on a new camera trap dataset collected in the Magdalena Medio region of Colombia. With the ability to analyze wildlife images without relying on expensive labeled data, WildMatch has the potential to revolutionize large-scale wildlife tracking solutions, reducing the need for human labor and increasing the efficiency of conservation efforts.

Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation

  • paper_url: http://arxiv.org/abs/2311.01034
  • repo_url: None
  • paper_authors: Xueting Hu, Ce Zhang, Yi Zhang, Bowen Hai, Ke Yu, Zhihai He
  • for: 本研究的目的是提出一种ew-shot基于的离散depth estimation方法,以增强VLMs的泛化能力。
  • methods: 该方法使用CLIP作为VLMs,并使用固定的depth bins来实现Zero-shot depth estimation。此外,还包括一些learnable prompts来预处理输入文本,以使模型更好地理解文本。
  • results: 对NYU V2和KITTI dataset进行了广泛的实验,并证明了该方法可以在MARE指标上比前一个状态的方法提高至多10.6%。
    Abstract Pre-trained Vision-Language Models (VLMs), such as CLIP, have shown enhanced performance across a range of tasks that involve the integration of visual and linguistic modalities. When CLIP is used for depth estimation tasks, the patches, divided from the input images, can be combined with a series of semantic descriptions of the depth information to obtain similarity results. The coarse estimation of depth is then achieved by weighting and summing the depth values, called depth bins, corresponding to the predefined semantic descriptions. The zero-shot approach circumvents the computational and time-intensive nature of traditional fully-supervised depth estimation methods. However, this method, utilizing fixed depth bins, may not effectively generalize as images from different scenes may exhibit distinct depth distributions. To address this challenge, we propose a few-shot-based method which learns to adapt the VLMs for monocular depth estimation to balance training costs and generalization capabilities. Specifically, it assigns different depth bins for different scenes, which can be selected by the model during inference. Additionally, we incorporate learnable prompts to preprocess the input text to convert the easily human-understood text into easily model-understood vectors and further enhance the performance. With only one image per scene for training, our extensive experiment results on the NYU V2 and KITTI dataset demonstrate that our method outperforms the previous state-of-the-art method by up to 10.6\% in terms of MARE.
    摘要 预训练的视觉语言模型(VLM),如CLIP,在涉及视觉语言模式的多种任务上表现出色。当用CLIP进行深度估计任务时,将图像中分割的小块与一系列 semantic description of depth information相结合,可以获得相似性结果。通过对 depth value 的权重和总和,可以实现粗略的深度估计。这种零批学习方法可以避免传统的干扰和时间consuming的深度估计方法。然而,这种方法使用固定的depth bin,可能无法有效泛化为不同的场景中的图像。为了解决这个挑战,我们提出了一种几批学习基于的方法,可以在训练成本和泛化能力之间寻找平衡。具体来说,它在不同的场景中分配不同的 depth bin,可以由模型在推理时选择。此外,我们还在输入文本中添加了学习的提示,以将人类可以理解的文本转换为模型可以理解的向量,并进一步提高性能。只需要一张图像 per scene 进行训练,我们在 NYU V2 和 KITTI 数据集上进行了广泛的实验,结果显示,我们的方法可以与之前的状态的艺术比出至多 10.6% 的 MARE 提高。

Nonnegative/Binary Matrix Factorization for Image Classification using Quantum Annealing

  • paper_url: http://arxiv.org/abs/2311.01028
  • repo_url: None
  • paper_authors: Hinako Asaoka, Kazue Kudo
  • for: 图像分类问题的解决
  • methods: 使用量子热化法实现矩阵分解法和多类分类模型
  • results: 1. 数据量、特征数、轮次都较少时,使用NBMF模型的准确率高于传统机器学习方法,如神经网络;2. 使用量子热化法计算器可以减少计算时间。
    Abstract Classical computing has borne witness to the development of machine learning. The integration of quantum technology into this mix will lead to unimaginable benefits and be regarded as a giant leap forward in mankind's ability to compute. Demonstrating the benefits of this integration now becomes essential. With the advance of quantum computing, several machine-learning techniques have been proposed that use quantum annealing. In this study, we implement a matrix factorization method using quantum annealing for image classification and compare the performance with traditional machine-learning methods. Nonnegative/binary matrix factorization (NBMF) was originally introduced as a generative model, and we propose a multiclass classification model as an application. We extract the features of handwritten digit images using NBMF and apply them to solve the classification problem. Our findings show that when the amount of data, features, and epochs is small, the accuracy of models trained by NBMF is superior to classical machine-learning methods, such as neural networks. Moreover, we found that training models using a quantum annealing solver significantly reduces computation time. Under certain conditions, there is a benefit to using quantum annealing technology with machine learning.
    摘要 Nonnegative/binary matrix factorization (NBMF) was originally introduced as a generative model, and we propose a multiclass classification model as an application. We extract the features of handwritten digit images using NBMF and apply them to solve the classification problem. Our findings show that when the amount of data, features, and epochs is small, the accuracy of models trained by NBMF is superior to classical machine-learning methods, such as neural networks. Moreover, we found that training models using a quantum annealing solver significantly reduces computation time.Under certain conditions, the use of quantum annealing technology with machine learning can bring benefits.

Incorporating Language-Driven Appearance Knowledge Units with Visual Cues in Pedestrian Detection

  • paper_url: http://arxiv.org/abs/2311.01025
  • repo_url: None
  • paper_authors: Sungjune Park, Hyunjun Kim, Yong Man Ro
  • for: 本研究旨在利用大型自然语言模型(LLM)对文本描述中的语义和上下文信息,提高人体检测 task 的性能。
  • methods: 我们提出了一种新的语言驱动的人体检测方法,通过将文本描述 integrate 到视觉cue中,以提高人体检测 task 的性能。我们首先构建了大量的描述集,其中包含了各种人体的描述,然后通过 LLM 进行学习,提取出语义上下文信息。最后,我们将语言驱动的知识单元与视觉cue相结合,以提供丰富的描述信息。
  • results: 我们通过对多种人体检测器进行广泛的实验,证明了我们的方法的有效性,并实现了人体检测 task 的最新表现。
    Abstract Large language models (LLMs) have shown their capability in understanding contextual and semantic information regarding appearance knowledge of instances. In this paper, we introduce a novel approach to utilize the strength of an LLM in understanding contextual appearance variations and to leverage its knowledge into a vision model (here, pedestrian detection). While pedestrian detection is considered one of crucial tasks directly related with our safety (e.g., intelligent driving system), it is challenging because of varying appearances and poses in diverse scenes. Therefore, we propose to formulate language-driven appearance knowledge units and incorporate them with visual cues in pedestrian detection. To this end, we establish description corpus which includes numerous narratives describing various appearances of pedestrians and others. By feeding them through an LLM, we extract appearance knowledge sets that contain the representations of appearance variations. After that, we perform a task-prompting process to obtain appearance knowledge units which are representative appearance knowledge guided to be relevant to a downstream pedestrian detection task. Finally, we provide plentiful appearance information by integrating the language-driven knowledge units with visual cues. Through comprehensive experiments with various pedestrian detectors, we verify the effectiveness of our method showing noticeable performance gains and achieving state-of-the-art detection performance.
    摘要 大型语言模型(LLM)有示出理解上下文和 semantics 信息的能力,这里我们提出一种新的方法,利用 LLM 理解上下文的应用变化和对于视觉模型(例如人员探测)的知识。由于人员探测是安全系统中一项重要的任务,但是它受到不同的场景和姿势的影响,因此我们提出使用语言驱动的应用知识单元来应对这个挑战。我们建立了一个描述库,包括许多描述不同人员和其他物品的故事。我们通过 LLM 处理这些故事,从中提取了应用知识集,这些集包括了不同人员的应用形式。然后,我们进行了任务激发过程,从中获得了具有应用知识导向的语言驱动知识单元。最后,我们通过与视觉提示集成而提供了丰富的应用信息。通过对不同人员探测器进行了详细的实验,我们证明了我们的方法的有效性,并 achieved state-of-the-art 的探测性能。

Expanding Expressiveness of Diffusion Models with Limited Data via Self-Distillation based Fine-Tuning

  • paper_url: http://arxiv.org/abs/2311.01018
  • repo_url: None
  • paper_authors: Jiwan Hur, Jaehyun Choi, Gyojin Han, Dong-Jae Lee, Junmo Kim
  • for: 提高限制 dataset 上 diffusion model 的表达能力和生成能力,以解决各种下游任务中使用预训练 diffusion model 的不满result.
  • methods: 提出 Self-Distillation for Fine-Tuning diffusion models (SDFT) 方法,利用源 dataset 中多种特征,提高 diffusion model 的生成能力和表达能力。
  • results: 实验结果表明,SDFT 可以在限制 dataset 上提高 diffusion model 的表达能力和生成能力,并且可以在多种下游任务中提高生成效果。
    Abstract Training diffusion models on limited datasets poses challenges in terms of limited generation capacity and expressiveness, leading to unsatisfactory results in various downstream tasks utilizing pretrained diffusion models, such as domain translation and text-guided image manipulation. In this paper, we propose Self-Distillation for Fine-Tuning diffusion models (SDFT), a methodology to address these challenges by leveraging diverse features from diffusion models pretrained on large source datasets. SDFT distills more general features (shape, colors, etc.) and less domain-specific features (texture, fine details, etc) from the source model, allowing successful knowledge transfer without disturbing the training process on target datasets. The proposed method is not constrained by the specific architecture of the model and thus can be generally adopted to existing frameworks. Experimental results demonstrate that SDFT enhances the expressiveness of the diffusion model with limited datasets, resulting in improved generation capabilities across various downstream tasks.
    摘要 <>转换文本到简化中文。<>训练扩散模型在有限的数据集上存在限制生成能力和表达能力的挑战,导致使用预训练扩散模型的下游任务获得不满足的结果,如域转换和文本引导图像修饰。在这篇论文中,我们提出了自适应精炼扩散模型(SDFT),一种方法来解决这些挑战,通过利用大源数据集预训练的多样的特征。SDFT从源模型中提取更通用的特征(形状、颜色等),而不是域特定的特征(Texture、细节等),使得知识传递成功不会对目标数据集的训练过程产生影响。提出的方法不受特定模型的架构限制,因此可以通用于现有框架。实验结果表明,SDFT可以提高有限数据集上扩散模型的表达能力,从而在多种下游任务中提高生成能力。

Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning

  • paper_url: http://arxiv.org/abs/2311.01016
  • repo_url: None
  • paper_authors: Yiran Li, Junpeng Wang, Prince Aboagye, Michael Yeh, Yan Zheng, Liang Wang, Wei Zhang, Kwan-Liu Ma
  • for: 本研究旨在帮助读者更好地理解大规模图像数据集中的Semantic结构和可能存在的数据偏见,以及提高语言-图像模型在caption生成过程中的表达能力。
  • methods: 本研究采用了一种新的视觉分析方法,利用大规模语言-图像模型的预训练技术,可以快速浏览大规模图像数据集,并自动生成图像的caption,以帮助读者更好地理解图像的Semantic结构和数据偏见。
  • results: 研究结果表明,通过使用这种新的视觉分析方法,可以快速发现大规模图像数据集中的数据偏见,并且可以提高语言-图像模型在caption生成过程中的表达能力。此外,研究还发现了一些可能存在的数据偏见,并提出了一些建议来改进语言-图像模型的caption生成能力。
    Abstract Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension, offering a significant leap forward. These breakthroughs have proven particularly instrumental in addressing long-standing challenges that were previously daunting. Leveraging these innovative techniques, this paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process. On the one hand, by visually examining the captions automatically generated from language-image models for an image dataset, we gain deeper insights into the semantic underpinnings of the visual contents, unearthing data biases that may be entrenched within the dataset. On the other hand, by depicting the association between visual contents and textual captions, we expose the weaknesses of pre-trained language-image models in their captioning capability and propose an interactive interface to steer caption generation. The two parts have been coalesced into a coordinated visual analytics system, fostering mutual enrichment of visual and textual elements. We validate the effectiveness of the system with domain practitioners through concrete case studies with large-scale image datasets.
    摘要
  1. Efficient exploration of large-scale image datasets and identification of potential biases within them.2. Evaluation of image captions and steering of their generation process.On one hand, by visually examining the captions automatically generated from language-image models for an image dataset, we gain a deeper understanding of the semantic underpinnings of the visual content, revealing any biases that may be present in the dataset.On the other hand, by depicting the association between visual contents and textual captions, we expose the weaknesses of pre-trained language-image models in their captioning capability and propose an interactive interface to steer caption generation.These two parts have been integrated into a coordinated visual analytics system, which fosters the mutual enrichment of visual and textual elements. We validate the effectiveness of the system through concrete case studies with large-scale image datasets, and demonstrate its potential for practical applications in the field.

Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs

  • paper_url: http://arxiv.org/abs/2311.01015
  • repo_url: https://github.com/jpthu17/graphmotion
  • paper_authors: Peng Jin, Yang Wu, Yanbo Fan, Zhongqian Sun, Yang Wei, Li Yuan
  • for: fine-grained control over human motion generation
  • methods: hierarchical semantic graphs, text-to-motion diffusion process
  • results: superior performance on two benchmark datasets, ability to continuously refine generated motion
    Abstract Most text-driven human motion generation methods employ sequential modeling approaches, e.g., transformer, to extract sentence-level text representations automatically and implicitly for human motion synthesis. However, these compact text representations may overemphasize the action names at the expense of other important properties and lack fine-grained details to guide the synthesis of subtly distinct motion. In this paper, we propose hierarchical semantic graphs for fine-grained control over motion generation. Specifically, we disentangle motion descriptions into hierarchical semantic graphs including three levels of motions, actions, and specifics. Such global-to-local structures facilitate a comprehensive understanding of motion description and fine-grained control of motion generation. Correspondingly, to leverage the coarse-to-fine topology of hierarchical semantic graphs, we decompose the text-to-motion diffusion process into three semantic levels, which correspond to capturing the overall motion, local actions, and action specifics. Extensive experiments on two benchmark human motion datasets, including HumanML3D and KIT, with superior performances, justify the efficacy of our method. More encouragingly, by modifying the edge weights of hierarchical semantic graphs, our method can continuously refine the generated motion, which may have a far-reaching impact on the community. Code and pre-training weights are available at https://github.com/jpthu17/GraphMotion.
    摘要 大多数文本驱动人体动作生成方法采用顺序模型,如 transformer,自动提取文本表达并用于人体动作生成。然而,这些紧凑的文本表达可能会强调动作名称的代价,而忽略其他重要特性,并且缺乏细节来指导动作生成。在这篇论文中,我们提议使用层次 semantic graphs 来实现细化控制 sobre 动作生成。具体来说,我们将动作描述分解成三级层次结构,包括动作、动作特征和具体动作。这些全局到本地结构可以帮助我们更好地理解动作描述,并且为动作生成提供细化控制。与此同时,我们将文本到动作协同扩散过程分解成三个semantic层次,以捕捉整体动作、本地动作和动作特征。经验表明,我们的方法在 HumanML3D 和 KIT 两个人体动作数据集上表现出色,并且可以不断细化生成的动作,这可能会对社区产生深远的影响。代码和预训练 веса可以在 https://github.com/jpthu17/GraphMotion 上获取。

Exploring Unified Perspective For Fast Shapley Value Estimation

  • paper_url: http://arxiv.org/abs/2311.01010
  • repo_url: https://github.com/user-tian/simshap
  • paper_authors: Borui Zhang, Baotong Tian, Wenzhao Zheng, Jie Zhou, Jiwen Lu
  • for: 这篇论文旨在解决深度神经网络模型中的黑盒问题,使用了Shapley值作为可靠的工具。
  • methods: 这篇论文使用了多种方法,包括ApproSemivalue、KernelSHAP和FastSHAP,以减少计算复杂性。
  • results: 该论文通过分析现有工作的一致性,提出了一种简单和高效的估计方法,称为SimSHAP,并通过了大量的表格和图像数据的实验,证明了其效果性。
    Abstract Shapley values have emerged as a widely accepted and trustworthy tool, grounded in theoretical axioms, for addressing challenges posed by black-box models like deep neural networks. However, computing Shapley values encounters exponential complexity in the number of features. Various approaches, including ApproSemivalue, KernelSHAP, and FastSHAP, have been explored to expedite the computation. We analyze the consistency of existing works and conclude that stochastic estimators can be unified as the linear transformation of importance sampling of feature subsets. Based on this, we investigate the possibility of designing simple amortized estimators and propose a straightforward and efficient one, SimSHAP, by eliminating redundant techniques. Extensive experiments conducted on tabular and image datasets validate the effectiveness of our SimSHAP, which significantly accelerates the computation of accurate Shapley values.
    摘要 <>使用Shapley值 Addressing Deep Learning模型的挑战============================================Shapley值已成为深度学习模型的挑战的一种广泛accepted和可靠的工具,基于理论axioms。然而,计算Shapley值遇到了特征数量的指数复杂性。各种方法,包括ApproSemivalue、KernelSHAP和FastSHAP,已经被探索以减少计算复杂性。我们对现有的工作进行了一致性分析,并发现了特征子抽样的重要性。基于这一点,我们提出了一种简单的总结器,SimSHAP,并进行了广泛的实验,证明了我们的SimSHAP可以快速和高效地计算准确的Shapley值。>>>Here's the translation:使用Shapley值 Addressing Deep Learning模型的挑战============================================Shapley值已成为深度学习模型的挑战的一种广泛accepted和可靠的工具,基于理论axioms。然而,计算Shapley值遇到了特征数量的指数复杂性。各种方法,包括ApproSemivalue、KernelSHAP和FastSHAP,已经被探索以减少计算复杂性。我们对现有的工作进行了一致性分析,并发现了特征子抽样的重要性。基于这一点,我们提出了一种简单的总结器,SimSHAP,并进行了广泛的实验,证明了我们的SimSHAP可以快速和高效地计算准确的Shapley值。

VCISR: Blind Single Image Super-Resolution with Video Compression Synthetic Data

  • paper_url: http://arxiv.org/abs/2311.00996
  • repo_url: https://github.com/kiteretsu77/vcisr-official
  • paper_authors: Boyang Wang, Bowen Liu, Shiyu Liu, Fengyu Yang
  • for: 这个论文主要针对的是在单个视频帧输入下,使用低分辨率图像数据进行盲目超分辨率(SISR)任务。
  • methods: 我们提出了一种基于视频压缩的质量模型,用于生成低分辨率图像数据,并将其整合到现有的图像集中。这种方法可以广泛应用于现有的图像集,从而保持训练效率。
  • results: 我们的方法在无参考图像质量评估中达到了最高水平,并在多个数据集上显示了更好的视觉质量。此外,我们还评估了使用我们的杂化模型训练的SISR神经网络在视频超分辨率(VSR)任务中的性能,并发现其与专门为VSRS设计的建筑物 exhibits 相似或更好的性能,这说明了我们的策略可以普适地应用于更复杂的压缩残留 artifacts。
    Abstract In the blind single image super-resolution (SISR) task, existing works have been successful in restoring image-level unknown degradations. However, when a single video frame becomes the input, these works usually fail to address degradations caused by video compression, such as mosquito noise, ringing, blockiness, and staircase noise. In this work, we for the first time, present a video compression-based degradation model to synthesize low-resolution image data in the blind SISR task. Our proposed image synthesizing method is widely applicable to existing image datasets, so that a single degraded image can contain distortions caused by the lossy video compression algorithms. This overcomes the leak of feature diversity in video data and thus retains the training efficiency. By introducing video coding artifacts to SISR degradation models, neural networks can super-resolve images with the ability to restore video compression degradations, and achieve better results on restoring generic distortions caused by image compression as well. Our proposed approach achieves superior performance in SOTA no-reference Image Quality Assessment, and shows better visual quality on various datasets. In addition, we evaluate the SISR neural network trained with our degradation model on video super-resolution (VSR) datasets. Compared to architectures specifically designed for the VSR purpose, our method exhibits similar or better performance, evidencing that the presented strategy on infusing video-based degradation is generalizable to address more complicated compression artifacts even without temporal cues.
    摘要 在单影像超分辨率(SISR)任务中,现有的工作已经成功地恢复图像级未知降低。然而,当单个视频帧为输入时,这些工作通常无法处理由视频压缩引起的降低,如蚊子噪声、环形噪声、块状噪声和扫描噪声。在这种情况下,我们为首次提出了基于视频压缩的降低模型,用于生成低分辨率图像数据。我们的提议的图像生成方法可以广泛应用于现有的图像数据集,以便一个降低图像包含视频压缩所引起的扰动。这些扰动可以增加图像数据的特征多样性,从而保持训练效率。通过将视频编码扰动引入到SISR降低模型中,神经网络可以在恢复图像时恢复视频压缩所引起的降低,并且能够更好地恢复图像压缩所引起的扰动。我们的提出的方法在SOTA无参考图像质量评估中显示出优秀的性能,并在不同的数据集上显示更好的视觉质量。此外,我们对SISR神经网络经过我们的降低模型进行训练后,对视超分辨率(VSR)数据集进行评估。相比特制为VSRL的建筑,我们的方法在视觉质量方面显示相似或更好的性能,证明了我们提出的策略可以普适地应用于更复杂的压缩扰动,甚至 без时间证明。

A Chronological Survey of Theoretical Advancements in Generative Adversarial Networks for Computer Vision

  • paper_url: http://arxiv.org/abs/2311.00995
  • repo_url: None
  • paper_authors: Hrishikesh Sharma
  • for: This paper aims to provide a chronological overview of the development of Generative Adversarial Networks (GANs) in the research field of computer vision, highlighting the key challenges and solutions in the evolution of GAN models.
  • methods: The paper uses a chronological approach to present the landmark research works on GANs, focusing on the theoretical advancements and applications of GANs in computer vision.
  • results: The paper highlights the significant improvements in the training of GAN models over time, and the various applications of GANs in computer vision tasks such as image generation, image-to-image translation, and image synthesis.
    Abstract Generative Adversarial Networks (GANs) have been workhorse generative models for last many years, especially in the research field of computer vision. Accordingly, there have been many significant advancements in the theory and application of GAN models, which are notoriously hard to train, but produce good results if trained well. There have been many a surveys on GANs, organizing the vast GAN literature from various focus and perspectives. However, none of the surveys brings out the important chronological aspect: how the multiple challenges of employing GAN models were solved one-by-one over time, across multiple landmark research works. This survey intends to bridge that gap and present some of the landmark research works on the theory and application of GANs, in chronological order.
    摘要 生成对抗网络(GAN)在过去几年中成为计算机视觉领域的重要生成模型,因此有很多重要的进展在GAN模型的理论和应用方面。尽管有很多关于GAN的评论文章,但 none of them 探讨了GAN模型的多个挑战如何逐步解决,从多个重要研究工作的角度出发。这篇评论文章的目的是弥补这一漏洞,并介绍一些GAN模型的理论和应用的各个阶段进展。

LaughTalk: Expressive 3D Talking Head Generation with Laughter

  • paper_url: http://arxiv.org/abs/2311.00994
  • repo_url: None
  • paper_authors: Kim Sung-Bin, Lee Hyun, Da Hye Hong, Suekyeong Nam, Janghoon Ju, Tae-Hyun Oh
  • for: 这篇论文旨在提出一种能够同时表达语言和笑声的3D人物生成方法。
  • methods: 该方法使用了一种新的数据集,该数据集包括2D笑声视频和 Pseudo-annotated和人类验证的3D FLAME参数和顶点。该方法还使用了一种两个阶段的训练方案,首先学习语言朗读,然后学习表达笑声信号。
  • results: 对比现有方法,该方法在语言朗读和表达笑声信号方面都有出色的表现。此外,该方法还可以用于生成真实的人物模型。
    Abstract Laughter is a unique expression, essential to affirmative social interactions of humans. Although current 3D talking head generation methods produce convincing verbal articulations, they often fail to capture the vitality and subtleties of laughter and smiles despite their importance in social context. In this paper, we introduce a novel task to generate 3D talking heads capable of both articulate speech and authentic laughter. Our newly curated dataset comprises 2D laughing videos paired with pseudo-annotated and human-validated 3D FLAME parameters and vertices. Given our proposed dataset, we present a strong baseline with a two-stage training scheme: the model first learns to talk and then acquires the ability to express laughter. Extensive experiments demonstrate that our method performs favorably compared to existing approaches in both talking head generation and expressing laughter signals. We further explore potential applications on top of our proposed method for rigging realistic avatars.
    摘要 幽默是人类社交交流中的一种重要表达方式,但现有3D讲话头生成方法往往无法准确捕捉幽默和笑容的细节和重要性。在这篇论文中,我们介绍了一个新的任务:生成3D讲话头,能够同时具备流畅的语音和真实的笑容表达。我们新编译的数据集包括2D笑容视频和 Pseudo-注释和人类验证的3D FLAME参数和顶点。我们提议的训练方案包括两个阶段:首先学习讲话,然后学习表达笑容信号。我们的方法在讲话头生成和表达笑容信号方面都表现出了优异的成绩。我们还探讨了基于我们的提议方法的可能的应用,如制作真实的人物替身。

IR-UWB Radar-based Situational Awareness System for Smartphone-Distracted Pedestrians

  • paper_url: http://arxiv.org/abs/2311.00991
  • repo_url: None
  • paper_authors: Jamsheed Manja Ppallan, Ruchi Pandey, Yellappa Damam, Vijay Narayan Tiwari, Karthikeyan Arunachalam, Antariksha Ray
  • for: 提高智能手机使用者在路上行人安全性
  • methods: 使用IR-UWB雷达和人工神经网络实现实时障碍探测和警示
  • results: 实现了97%的障碍检测精度和95%的障碍分类精度,检测延迟26.8毫秒
    Abstract With the widespread adoption of smartphones, ensuring pedestrian safety on roads has become a critical concern due to smartphone distraction. This paper proposes a novel and real-time assistance system called UWB-assisted Safe Walk (UASW) for obstacle detection and warns users about real-time situations. The proposed method leverages Impulse Radio Ultra-Wideband (IR-UWB) radar embedded in the smartphone, which provides excellent range resolution and high noise resilience using short pulses. We implemented UASW specifically for Android smartphones with IR-UWB connectivity. The framework uses complex Channel Impulse Response (CIR) data to integrate rule-based obstacle detection with artificial neural network (ANN) based obstacle classification. The performance of the proposed UASW system is analyzed using real-time collected data. The results show that the proposed system achieves an obstacle detection accuracy of up to 97% and obstacle classification accuracy of up to 95% with an inference delay of 26.8 ms. The results highlight the effectiveness of UASW in assisting smartphone-distracted pedestrians and improving their situational awareness.
    摘要 随着智能手机的普及,保障行人安全在路上已成为一个重要问题,因为智能手机的分心。这篇论文提出了一种新的实时协助系统,名为UWB-assisted Safe Walk(UASW),用于避免障碍物检测和警示用户实时情况。该方法利用冲击式 радио Ultra-Wideband(IR-UWB)雷达,嵌入在智能手机中,它具有出色的范围分辨率和高噪声抗性,使用短报波。我们专门为Android智能手机开发了UASW。框架使用复杂的通道冲击响应(CIR)数据集成规则基于障碍物检测和人工神经网络(ANN)基于障碍物分类。我们对提出的UASW系统进行了实时数据收集和分析,结果显示,该系统可以达到97%的障碍物检测精度和95%的障碍物分类精度,检测延迟为26.8毫秒。结果表明,UASW系统有效地帮助智能手机分心行人提高情况意识。

VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning

  • paper_url: http://arxiv.org/abs/2311.00990
  • repo_url: None
  • paper_authors: Hong Chen, Xin Wang, Guanning Zeng, Yipeng Zhang, Yuwei Zhou, Feilin Han, Wenwu Zhu
  • for: 这篇论文旨在提出一种个性化多主题文本到视频生成模型,即VideoDreamer框架,以生成具有独特视觉特征的多主题文本导向视频。
  • methods: VideoDreamer模型基于预训练的稳定扩散和 latent-code 动力,并利用 temporal cross-frame attention 进行视频生成。在生成过程中,VideoDreamer还采用了Disen-Mix Finetuning和 Human-in-the-Loop Re-finetuning策略,以解决多主题生成中的 attribute binding 问题。
  • results: 在评估中,VideoDreamer模型能够生成具有新内容的、适应多主题的文本导向视频,例如新的事件和背景。同时,VideoDreamer还能够保持视频的时间协调和可读性。
    Abstract Customized text-to-video generation aims to generate text-guided videos with customized user-given subjects, which has gained increasing attention recently. However, existing works are primarily limited to generating videos for a single subject, leaving the more challenging problem of customized multi-subject text-to-video generation largely unexplored. In this paper, we fill this gap and propose a novel VideoDreamer framework. VideoDreamer can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. Specifically, VideoDreamer leverages the pretrained Stable Diffusion with latent-code motion dynamics and temporal cross-frame attention as the base video generator. The video generator is further customized for the given multiple subjects by the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy, which can tackle the attribute binding problem of multi-subject generation. We also introduce MultiStudioBench, a benchmark for evaluating customized multi-subject text-to-video generation models. Extensive experiments demonstrate the remarkable ability of VideoDreamer to generate videos with new content such as new events and backgrounds, tailored to the customized multiple subjects. Our project page is available at https://videodreamer23.github.io/.
    摘要 自定义文本到视频生成技术已经引起了越来越多的关注,目前的工作主要是为单个主题生成视频,忽略了更加挑战性的多主题文本到视频生成问题。在这篇论文中,我们填补这一漏洞,并提出了一个名为VideoDreamer的框架。VideoDreamer可以生成具有稳定的时间性和多主题视频特征的文本引导视频。具体来说,VideoDreamer利用了预训练的稳定扩散方法和带有动态混合的时间跨帧注意力,作为基础视频生成器。此外,我们还提出了一种名为Disen-MixFinetuning和人工约束重新训练策略,可以解决多主题生成中的特征绑定问题。我们还提出了一个名为MultiStudioBench的多主题生成评价指标,用于评估自定义多主题文本到视频生成模型。广泛的实验表明,VideoDreamer可以生成具有新的内容,如新的事件和背景,适应自定义多主题。我们的项目页面可以在https://videodreamer23.github.io/查看。

CML-MOTS: Collaborative Multi-task Learning for Multi-Object Tracking and Segmentation

  • paper_url: http://arxiv.org/abs/2311.00987
  • repo_url: None
  • paper_authors: Yiming Cui, Cheng Han, Dongfang Liu
  • for: 这个论文旨在提出一个能够同时进行物件检测、实例分割和多个物件追踪的可效框架,以便在视觉分析中进行实例级别的分析。
  • methods: 本文提出的方法是基于一个称为“相互连接”的新结构,这个结构在一个端到端学习的CNN中实现了多任务之间的相互连接,以便帮助这些任务同时进行。
  • results: 在KITTI MOTS和MOTS Challenge datasets上进行了广泛的评估,结果表明了本文提出的方法在多个物件追踪和实例分割任务中的表现都很出色。
    Abstract The advancement of computer vision has pushed visual analysis tasks from still images to the video domain. In recent years, video instance segmentation, which aims to track and segment multiple objects in video frames, has drawn much attention for its potential applications in various emerging areas such as autonomous driving, intelligent transportation, and smart retail. In this paper, we propose an effective framework for instance-level visual analysis on video frames, which can simultaneously conduct object detection, instance segmentation, and multi-object tracking. The core idea of our method is collaborative multi-task learning which is achieved by a novel structure, named associative connections among detection, segmentation, and tracking task heads in an end-to-end learnable CNN. These additional connections allow information propagation across multiple related tasks, so as to benefit these tasks simultaneously. We evaluate the proposed method extensively on KITTI MOTS and MOTS Challenge datasets and obtain quite encouraging results.
    摘要 “计算机视觉的发展使得视力分析任务从静止图像转移到视频域。近年来,视频实例分割,即在视频帧中跟踪和分割多个对象,吸引了很多关注,因为它在自动驾驶、智能交通和智能商业等领域可能得到广泛的应用。在这篇论文中,我们提出一种高效的视频帧级别的实例分析框架,可以同时进行对象检测、实例分割和多对象跟踪。我们的方法的核心思想是在结构化多任务学习中实现协同学习,通过将检测、分割和跟踪任务头部之间建立相关连接来实现信息传递。这些额外连接使得多个相关任务之间的信息可以相互传递,从而对多个任务产生共同的改进。我们对提出的方法进行了广泛的测试,并在KITTI MOTS和MOTS Challenge数据集上获得了很好的结果。”

M&M3D: Multi-Dataset Training and Efficient Network for Multi-view 3D Object Detection

  • paper_url: http://arxiv.org/abs/2311.00986
  • repo_url: None
  • paper_authors: Hang Zhang
  • for: 本研究提出了一种多视图3D物体检测网络结构,使用Camera-only数据和Bird’s-Eye-View地图,以解决当前键问题域适应和视觉数据传输。
  • methods: 该研究基于域适应和视觉数据传输的挑战,提出了一种基于Transformer的传输学习方法和3DanchorQuery的检测头,以实现数据迁移和效率检测。
  • results: 通过多个数据集训练和使用小量源数据和现有大型模型预训练参数,该网络实现了竞争性的结果,并且利用3D信息作为可用的semantic信息和2D多视图图像特征融合到视语言传输设计中。
    Abstract In this research, I proposed a network structure for multi-view 3D object detection using camera-only data and a Bird's-Eye-View map. My work is based on a current key challenge domain adaptation and visual data transfer. Although many excellent camera-only 3D object detection has been continuously proposed, many research work risk dramatic performance drop when the networks are trained on the source domain but tested on a different target domain. Then I found it is very surprising that predictions on bounding boxes and classes are still replied to on 2D networks. Based on the domain gap assumption on various 3D datasets, I found they still shared a similar data extraction on the same BEV map size and camera data transfer. Therefore, to analyze the domain gap influence on the current method and to make good use of 3D space information among the dataset and the real world, I proposed a transfer learning method and Transformer construction to study the 3D object detection on NuScenes-mini and Lyft. Through multi-dataset training and a detection head from the Transformer, the network demonstrated good data migration performance and efficient detection performance by using 3D anchor query and 3D positional information. Relying on only a small amount of source data and the existing large model pre-training weights, the efficient network manages to achieve competitive results on the new target domain. Moreover, my study utilizes 3D information as available semantic information and 2D multi-view image features blending into the visual-language transfer design. In the final 3D anchor box prediction and object classification, my network achieved good results on standard metrics of 3D object detection, which differs from dataset-specific models on each training domain without any fine-tuning.
    摘要 在这项研究中,我提出了一种多视图3D物体检测网络结构,使用了摄像头数据和鸟瞰图。我的工作基于当前关键挑战的领域适应和视觉数据传递。虽然许多出色的摄像头只3D物体检测方法已经不断提出,但是许多研究工作在不同目标领域进行训练后会导致性能巨大下降。然而,我发现了一个很奇怪的现象:在2D网络上预测矩形框和类别时,预测结果仍然受到2D网络的影响。基于领域差异假设,我发现了许多3D数据集之间的数据EXTRACTOR都是类似的,因此可以在同一个BEV图像大小和摄像头数据传输下进行数据传输。为了分析当前方法中领域差异的影响和在数据集和实际世界中利用3D空间信息,我提出了一种传输学习方法和Transformer结构。通过多个数据集训练和Transformer检测头,网络实现了良好的数据迁移性和高效的检测性能,使用3D锚定Query和3D位势信息。只需要少量的源数据和现有大型模型预训练 весов,高效的网络可以在新的目标领域中获得竞争性的结果。此外,我的研究利用了3D信息作为可用的semantic信息和2D多视图图像特征融合到视语言传输设计中。最终,我的网络在标准3D物体检测指标上实现了好的结果,与不同预训练领域的模型不需要任何微调。

MAAIG: Motion Analysis And Instruction Generation

  • paper_url: http://arxiv.org/abs/2311.00980
  • repo_url: None
  • paper_authors: Wei-Hsin Yeh, Pei Hsin Lin, Yu-An Su, Wen Hsiang Cheng, Lun-Wei Ku
  • for: 提供个人体育训练home自动化指导,帮助用户提高运动技巧和避免伤害。
  • methods: 使用MAAIG应用框架,通过对用户提供的运动动作视频进行分析,生成每帧的嵌入向量,并将其与预训练T5模型结合,生成专业教练般的运动指导。
  • results: 能够识别和解决用户可能存在的问题,提供实时指导,帮助用户改善运动技巧和避免伤害。
    Abstract Many people engage in self-directed sports training at home but lack the real-time guidance of professional coaches, making them susceptible to injuries or the development of incorrect habits. In this paper, we propose a novel application framework called MAAIG(Motion Analysis And Instruction Generation). It can generate embedding vectors for each frame based on user-provided sports action videos. These embedding vectors are associated with the 3D skeleton of each frame and are further input into a pretrained T5 model. Ultimately, our model utilizes this information to generate specific sports instructions. It has the capability to identify potential issues and provide real-time guidance in a manner akin to professional coaches, helping users improve their sports skills and avoid injuries.
    摘要 很多人在家中自己进行体育训练,但是缺乏专业教练的实时指导,导致他们容易受伤或形成错误的习惯。在这篇论文中,我们提出了一种新的应用框架,即Motion Analysis And Instruction Generation(MAAIG)。它可以基于用户提供的体育动作视频生成嵌入向量,这些嵌入向量与每帧3D骨架相关,然后输入到预训练的T5模型中。最终,我们的模型可以利用这些信息生成特定的体育指导。它可以识别用户的可能问题,并在专业教练的方式下提供实时指导,帮助用户提高体育技巧,避免伤害。

Overhead Line Defect Recognition Based on Unsupervised Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2311.00979
  • repo_url: None
  • paper_authors: Weixi Wang, Xichen Zhong, Xin Li, Sizhe Li, Xun Ma
  • for: automatic defect recognition in overhead lines
  • methods: Faster RCNN network + unsupervised semantic segmentation
  • results: improved accuracy and adaptability in identifying equipment issues
    Abstract Overhead line inspection greatly benefits from defect recognition using visible light imagery. Addressing the limitations of existing feature extraction techniques and the heavy data dependency of deep learning approaches, this paper introduces a novel defect recognition framework. This is built on the Faster RCNN network and complemented by unsupervised semantic segmentation. The approach involves identifying the type and location of the target equipment, utilizing semantic segmentation to differentiate between the device and its backdrop, and finally employing similarity measures and logical rules to categorize the type of defect. Experimental results indicate that this methodology focuses more on the equipment rather than the defects when identifying issues in overhead lines. This leads to a notable enhancement in accuracy and exhibits impressive adaptability. Thus, offering a fresh perspective for automating the inspection of distribution network equipment.
    摘要 Overhead line inspection 受到缺陷识别使用可见光影像的巨大 beneficial 影响。现有的特征提取技术和深度学习方法存在局限性,这篇文章提出了一种新的缺陷识别框架。这基于Faster RCNN网络,并且 complemented by 无supervised semantic segmentation。该方法包括:首先,确定目标设备的类型和位置;其次,使用semantic segmentation来分 differentiate 设备和背景;最后,使用相似度度量和逻辑规则来分类缺陷类型。实验结果表明,该方法更强调设备而不是缺陷,从而提高了准确性。此外,它具有出色的适应性。因此,这种方法可以提供一种新的自动化分配网络设备检查的新 perspectives。

Lightweight super resolution network for point cloud geometry compression

  • paper_url: http://arxiv.org/abs/2311.00970
  • repo_url: https://github.com/lidq92/lsrn-pcgc
  • paper_authors: Wei Zhang, Dingquan Li, Ge Li, Wen Gao
  • for: 本文提出了一种基于缓冲点云的压缩方法,通过利用轻量级超Resolution网络来实现。
  • methods: 该方法首先将点云分解为基点云和重建原点云的 interpolate 模式。而 interpolation 模式的处理 strategy 则是通过一个轻量级超Resolution网络来学习,而不是直接压缩 interpolate 模式。
  • results: 实验表明,与lookup table-based方法相比,该方法可以更加准确地获得 interpolate 模式,同时在接受ABLE computational cost下可以访问更广泛的邻近 voxels。这使得该方法在 MPEG Cat1 (Solid) 和 Cat2 数据集上实现了显著的压缩性能。
    Abstract This paper presents an approach for compressing point cloud geometry by leveraging a lightweight super-resolution network. The proposed method involves decomposing a point cloud into a base point cloud and the interpolation patterns for reconstructing the original point cloud. While the base point cloud can be efficiently compressed using any lossless codec, such as Geometry-based Point Cloud Compression, a distinct strategy is employed for handling the interpolation patterns. Rather than directly compressing the interpolation patterns, a lightweight super-resolution network is utilized to learn this information through overfitting. Subsequently, the network parameter is transmitted to assist in point cloud reconstruction at the decoder side. Notably, our approach differentiates itself from lookup table-based methods, allowing us to obtain more accurate interpolation patterns by accessing a broader range of neighboring voxels at an acceptable computational cost. Experiments on MPEG Cat1 (Solid) and Cat2 datasets demonstrate the remarkable compression performance achieved by our method.
    摘要 Simplified Chinese translation:这篇论文提出了一种基于轻量级超解算法的点云减少方法。该方法将点云分解为基点云和重建原点云的 interpolating 模式。而基点云可以使用任何lossless 编码器高效地压缩,而 interpolating 模式则通过一个轻量级超解网络学习。然后,网络参数将被传输到决策端,以帮助重建点云。与lookup 表格基于方法不同,我们可以通过访问更广泛的邻近 voxel 来获取更准确的 interpolating 模式,而不会影响计算成本。实验结果表明,我们的方法在 MPEG Cat1 (Solid) 和 Cat2 数据集上实现了出色的压缩性能。

Detecting Generated Images by Real Images Only

  • paper_url: http://arxiv.org/abs/2311.00962
  • repo_url: https://github.com/molyswu/hand_detection
  • paper_authors: Xiuli Bi, Bo Liu, Fan Yang, Bin Xiao, Weisheng Li, Gao Huang, Pamela C. Cosman
  • for: 这篇论文旨在探讨生成模型产生的图像是否真实,并提出了一种新的检测方法,即从真实图像开始,找出它们共同点,然后将它们映射到对应的紧密空间中,以此检测生成图像。
  • methods: 本文使用了一种新的检测方法,即将真实图像映射到一个紧密的空间中,以检测生成图像。这种方法不需要大量的训练数据,仅需使用实际的图像进行训练,并且可以实现高效的检测。
  • results: 实验结果显示,本文提出的方法可以实现高效的生成图像检测,并且具有较好的响应性和稳定性。它可以检测出不同的生成模型,并且可以应对各种后期处理。这些优点使得本方法可以应用在实际的应用中。
    Abstract As deep learning technology continues to evolve, the images yielded by generative models are becoming more and more realistic, triggering people to question the authenticity of images. Existing generated image detection methods detect visual artifacts in generated images or learn discriminative features from both real and generated images by massive training. This learning paradigm will result in efficiency and generalization issues, making detection methods always lag behind generation methods. This paper approaches the generated image detection problem from a new perspective: Start from real images. By finding the commonality of real images and mapping them to a dense subspace in feature space, the goal is that generated images, regardless of their generative model, are then projected outside the subspace. As a result, images from different generative models can be detected, solving some long-existing problems in the field. Experimental results show that although our method was trained only by real images and uses 99.9\% less training data than other deep learning-based methods, it can compete with state-of-the-art methods and shows excellent performance in detecting emerging generative models with high inference efficiency. Moreover, the proposed method shows robustness against various post-processing. These advantages allow the method to be used in real-world scenarios.
    摘要 deep learning技术继续发展,生成模型中的图像越来越真实,让人们开始 вопро问图像的真实性。现有的生成图像检测方法检测生成图像中的视觉瑕疵或者从实际和生成图像中学习特征,通过大规模训练。这种学习模式会导致效率和泛化问题,使检测方法总是落后于生成方法。本文从新的角度解决生成图像检测问题:从实际图像开始。通过找到实际图像的共同点,将它们映射到封闭的子空间中,使得生成图像,无论它们的生成模型,都将被投影到外部子空间。因此,不同的生成模型中的图像可以被检测出来,解决了领域中一些长期存在的问题。实验结果表明,我们的方法只使用实际图像进行训练,使用99.9% menos的深度学习基于数据进行训练,可以与当前状态的方法竞争,并且在检测新兴的生成模型方面表现出色,具有高速检测效率和Robustness against various post-processing。这些优点使得方法可以在实际场景中使用。

Concatenated Masked Autoencoders as Spatial-Temporal Learner

  • paper_url: http://arxiv.org/abs/2311.00961
  • repo_url: https://github.com/minhoooo1/catmae
  • paper_authors: Zhouqiang Jiang, Bowen Wang, Tong Xiang, Zhaofeng Niu, Hong Tang, Guangshun Li, Liangzhi Li
  • for: 这篇论文的目的是学习视频表示,包括理解视频中的连续运动和视觉匹配。
  • methods: 这篇论文提出了一种新的自我supervised视频表示学习方法,即 Concatenated Masked Autoencoders (CatMAE),它使用了一个掩码(95%)来遮盖视频帧的后续帧,并使用了一个Encoder和一个Decoder来编码和重建视频帧。
  • results: 与之前最先进的预训练方法相比,CatMAE在视频分割任务和动作识别任务中表现出了领先的水平。
    Abstract Learning representations from videos requires understanding continuous motion and visual correspondences between frames. In this paper, we introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for self-supervised video representation learning. For the input sequence of video frames, CatMAE keeps the initial frame unchanged while applying substantial masking (95%) to subsequent frames. The encoder in CatMAE is responsible for encoding visible patches for each frame individually; subsequently, for each masked frame, the decoder leverages visible patches from both previous and current frames to reconstruct the original image. Our proposed method enables the model to estimate the motion information between visible patches, match the correspondences between preceding and succeeding frames, and ultimately learn the evolution of scenes. Furthermore, we propose a new data augmentation strategy, Video-Reverse (ViRe), which uses reversed video frames as the model's reconstruction targets. This further encourages the model to utilize continuous motion details and correspondences to complete the reconstruction, thereby enhancing the model's capabilities. Compared to the most advanced pre-training methods, CatMAE achieves a leading level in video segmentation tasks and action recognition tasks.
    摘要 学习视频中的表示需要理解连续的运动和视觉匹配 между帧。在这篇论文中,我们介绍了嵌入式马SKAd(CatMAE)作为自我超级视频表示学习的空间-时间学习器。对于输入序列中的视频帧,CatMAE保留初始帧不变,而对后续帧应用了95%的压缩(masking)。CatMAE的编码器负责为每帧图像中的可见区域进行编码;对于每帧压缩图像,CatMAE的解码器利用前一帧和当前帧中的可见区域来重建原始图像。我们提议的方法使得模型可以估计图像中的运动信息,匹配前一帧和当前帧之间的对应关系,并最终学习场景的演化。此外,我们还提出了一种新的数据增强策略,名为视频反向(ViRe),它使用反转的视频帧作为模型的重建目标。这使得模型更加强调连续的运动细节和对应关系,从而提高模型的能力。相比最先进的预训练方法,CatMAE在视频分割任务和动作认知任务中达到了领先水平。

Optimal Noise pursuit for Augmenting Text-to-Video Generation

  • paper_url: http://arxiv.org/abs/2311.00949
  • repo_url: None
  • paper_authors: Shijie Ma, Huayi Xu, Mengjian Li, Weidong Geng, Meng Wang, Yaxiong Wang
  • for: 提高文本到视频生成器的稳定性和质量,尤其是在不同噪音输入下。
  • methods: 提出了一种 aproaches 使用倒推视频映射来寻找最佳噪音,并通过搜索和反向映射来实现。此外,还提出了一种semantic-preserving rewriter来优化文本提示。
  • results: 通过extensive experiments on WebVid-10M benchmark,显示了提高文本到视频生成器的稳定性和质量,而且无需优化。
    Abstract Despite the remarkable progress in text-to-video generation, existing diffusion-based models often exhibit instability in terms of noise during inference. Specifically, when different noises are fed for the given text, these models produce videos that differ significantly in terms of both frame quality and temporal consistency. With this observation, we posit that there exists an optimal noise matched to each textual input; however, the widely adopted strategies of random noise sampling often fail to capture it. In this paper, we argue that the optimal noise can be approached through inverting the groundtruth video using the established noise-video mapping derived from the diffusion model. Nevertheless, the groundtruth video for the text prompt is not available during inference. To address this challenge, we propose to approximate the optimal noise via a search and inversion pipeline. Given a text prompt, we initially search for a video from a predefined candidate pool that closely relates to the text prompt. Subsequently, we invert the searched video into the noise space, which serves as an improved noise prompt for the textual input. In addition to addressing noise, we also observe that the text prompt with richer details often leads to higher-quality videos. Motivated by this, we further design a semantic-preserving rewriter to enrich the text prompt, where a reference-guided rewriting is devised for reasonable details compensation, and a denoising with a hybrid semantics strategy is proposed to preserve the semantic consistency. Extensive experiments on the WebVid-10M benchmark show that our proposed method can improve the text-to-video models with a clear margin, while introducing no optimization burden.
    摘要 尽管文本到视频生成技术已经取得了非常出色的进步,但现有的扩散基本模型在推理过程中仍然存在噪声稳定性问题。具体来说,对于不同的噪声输入,这些模型会生成具有不同框架质量和时间一致性的视频。从这个观察出发,我们认为存在一个与每个文本输入匹配的最佳噪声,但通常采用的随机噪声抽样策略往往无法捕捉到它。在这篇论文中,我们 argue that 最佳噪声可以通过推理模型确定的噪声-视频映射来接近。但是,在推理过程中不可以获得真实的地面视频。为解决这个挑战,我们提议一种搜索和反向映射管线来估算最佳噪声。给定一个文本提示,我们首先从预定的候选池中搜索一个与文本提示高度相关的视频,然后将搜索到的视频反向映射到噪声空间,作为改进的噪声提示。此外,我们还发现文本提示具有更多的细节时,会导致更高质量的视频。驱动于这一点,我们进一步设计了一种具有语义保持的重写器,以提高文本提示的细节和语义一致性。我们对WebVid-10M测试集进行了广泛的实验,结果表明,我们的提议方法可以帮助提高文本到视频模型,而且无需进行优化升级。

SatBird: Bird Species Distribution Modeling with Remote Sensing and Citizen Science Data

  • paper_url: http://arxiv.org/abs/2311.00936
  • repo_url: https://github.com/rolnicklab/satbird
  • paper_authors: Mélisande Teng, Amna Elmustafa, Benjamin Akera, Yoshua Bengio, Hager Radi Abdelwahed, Hugo Larochelle, David Rolnick
  • for: 本研究旨在提高生物多样性监测和生态系统模拟,通过预测卫星图像中种群遇Rate的方法,以帮助保护生态系统。
  • methods: 本研究使用卫星图像和公民科学工具收集种群观察数据,并提供了环境数据和种群范围地图。
  • results: 本研究提供了一个新的任务,即使用卫星图像预测种群遇Rate,并在美国和肯尼亚提供了数据集。这些数据集可以帮助扩大生物多样性监测和生态系统模拟。
    Abstract Biodiversity is declining at an unprecedented rate, impacting ecosystem services necessary to ensure food, water, and human health and well-being. Understanding the distribution of species and their habitats is crucial for conservation policy planning. However, traditional methods in ecology for species distribution models (SDMs) generally focus either on narrow sets of species or narrow geographical areas and there remain significant knowledge gaps about the distribution of species. A major reason for this is the limited availability of data traditionally used, due to the prohibitive amount of effort and expertise required for traditional field monitoring. The wide availability of remote sensing data and the growing adoption of citizen science tools to collect species observations data at low cost offer an opportunity for improving biodiversity monitoring and enabling the modelling of complex ecosystems. We introduce a novel task for mapping bird species to their habitats by predicting species encounter rates from satellite images, and present SatBird, a satellite dataset of locations in the USA with labels derived from presence-absence observation data from the citizen science database eBird, considering summer (breeding) and winter seasons. We also provide a dataset in Kenya representing low-data regimes. We additionally provide environmental data and species range maps for each location. We benchmark a set of baselines on our dataset, including SOTA models for remote sensing tasks. SatBird opens up possibilities for scalably modelling properties of ecosystems worldwide.
    摘要 生物多样性正在不可预期的速度下降,对生物圈服务的确保食物、水和人类健康和幸福具有重要作用。了解种群的分布是保护政策规划中非常重要。然而,传统生态学中的种群分布模型(SDM)通常将注意力集中在特定的种群或特定的地理区域,并且还有许多不明之处关于种群的分布。一个主要原因是传统的场景监测数据的有限性,由于场景监测所需的努力和专业技能的成本很高。然而,卫星数据的广泛可用性和公民科学工具的普及使得生物多样性监测变得更加容易,并且可以提高生态系统的模拟。我们介绍了一个新的任务,即通过卫星图像预测鸟类种群的分布,并提出了SatBird数据集,该数据集包括美国各地的卫星图像位置和基于公民科学数据库eBird的存在或缺失观察数据,覆盖夏季(繁殖期)和冬季两个季节。此外,我们还提供了每个位置的环境数据和种群范围地图。我们对我们的数据集进行了一系列的基线测试,包括当今最佳实践的卫星数据处理模型。SatBird开销了全球范围内生态系统的可扩展模拟的可能性。

Towards High-quality HDR Deghosting with Conditional Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.00932
  • repo_url: None
  • paper_authors: Qingsen Yan, Tao Hu, Yuan Sun, Hao Tang, Yu Zhu, Wei Dong, Luc Van Gool, Yanning Zhang
  • for: 本研究旨在使用深度神经网络技术来重建高动态范围(HDR)图像,从多个低动态范围(LDR)图像中提取HDR图像,并解决阻挡现实场景中的应用。
  • methods: 本研究使用了Diffusion Model来生成HDR图像,包括Feature Condition Generator和Noise Predictor两部分。Feature Condition Generator使用了注意力和Domain Feature Alignment(DFA)层来转换中间特征,以避免幽灵残影。Noise Predictor使用了随机迭代的抽象过程来生成HDR图像。此外,为了减少LDR图像的饱和问题所引起的语义混乱,我们设计了滑块窗口噪声估计器来采样平滑的噪声。
  • results: 我们对HDR图像重建 benchmark datasets进行了实验,结果表明,我们的方法可以达到状态机器人的性能,并且在实际图像中进行了良好的泛化。
    Abstract High Dynamic Range (HDR) images can be recovered from several Low Dynamic Range (LDR) images by existing Deep Neural Networks (DNNs) techniques. Despite the remarkable progress, DNN-based methods still generate ghosting artifacts when LDR images have saturation and large motion, which hinders potential applications in real-world scenarios. To address this challenge, we formulate the HDR deghosting problem as an image generation that leverages LDR features as the diffusion model's condition, consisting of the feature condition generator and the noise predictor. Feature condition generator employs attention and Domain Feature Alignment (DFA) layer to transform the intermediate features to avoid ghosting artifacts. With the learned features as conditions, the noise predictor leverages a stochastic iterative denoising process for diffusion models to generate an HDR image by steering the sampling process. Furthermore, to mitigate semantic confusion caused by the saturation problem of LDR images, we design a sliding window noise estimator to sample smooth noise in a patch-based manner. In addition, an image space loss is proposed to avoid the color distortion of the estimated HDR results. We empirically evaluate our model on benchmark datasets for HDR imaging. The results demonstrate that our approach achieves state-of-the-art performances and well generalization to real-world images.
    摘要 高动态范围(HDR)图像可以从多个低动态范围(LDR)图像中恢复使用现有深度神经网络(DNN)技术。 DESPITE remarkable progress, DNN-based methods still generate ghosting artifacts when LDR images have saturation and large motion, which hinders potential applications in real-world scenarios. To address this challenge, we formulate the HDR deghosting problem as an image generation that leverages LDR features as the diffusion model's condition, consisting of the feature condition generator and the noise predictor. Feature condition generator employs attention and Domain Feature Alignment(DFA)layer to transform the intermediate features to avoid ghosting artifacts. With the learned features as conditions, the noise predictor leverages a stochastic iterative denoising process for diffusion models to generate an HDR image by steering the sampling process. Furthermore, to mitigate semantic confusion caused by the saturation problem of LDR images, we design a sliding window noise estimator to sample smooth noise in a patch-based manner. In addition, an image space loss is proposed to avoid the color distortion of the estimated HDR results. We empirically evaluate our model on benchmark datasets for HDR imaging. The results demonstrate that our approach achieves state-of-the-art performances and well generalization to real-world images.

RPCANet: Deep Unfolding RPCA Based Infrared Small Target Detection

  • paper_url: http://arxiv.org/abs/2311.00917
  • repo_url: None
  • paper_authors: Fengyi Wu, Tianfang Zhang, Lei Li, Yian Huang, Zhenming Peng
  • for: 提高探测远赤外小目标的准确率和可解释性。
  • methods: 提出了一种可解释的深度学习网络(RPCANet),通过对探测任务的归纳为归纳矩阵分解、低级背景估计和图像重建的抽象,将深度学习与域知识结合起来。
  • results: 在实验中,RPCANet 得到了优于基eline方法的良好效果,并且可以准确地检测小目标,同时保留图像的内在特征。
    Abstract Deep learning (DL) networks have achieved remarkable performance in infrared small target detection (ISTD). However, these structures exhibit a deficiency in interpretability and are widely regarded as black boxes, as they disregard domain knowledge in ISTD. To alleviate this issue, this work proposes an interpretable deep network for detecting infrared dim targets, dubbed RPCANet. Specifically, our approach formulates the ISTD task as sparse target extraction, low-rank background estimation, and image reconstruction in a relaxed Robust Principle Component Analysis (RPCA) model. By unfolding the iterative optimization updating steps into a deep-learning framework, time-consuming and complex matrix calculations are replaced by theory-guided neural networks. RPCANet detects targets with clear interpretability and preserves the intrinsic image feature, instead of directly transforming the detection task into a matrix decomposition problem. Extensive experiments substantiate the effectiveness of our deep unfolding framework and demonstrate its trustworthy results, surpassing baseline methods in both qualitative and quantitative evaluations.
    摘要 深度学习(DL)网络在红外小目标检测(ISTD)中已经实现了很好的表现。然而,这些结构具有解释性不足的问题,被广泛视为黑盒子,因为它们忽略了ISTD领域知识。为解决这个问题,本研究提出了可解释的深度网络,称为RPCANet,用于检测红外暗目标。具体来说,我们的方法将ISTD任务解释为稀疏目标提取、低级背景估计和图像重建的一种松散的Robust Principle Component Analysis(RPCA)模型。通过将迭代优化更新步骤转化为深度学习框架,时间consuming和复杂的矩阵计算被替换为理论导向的神经网络。RPCANet可以清晰地解释目标,并保留内在的图像特征,而不是直接将检测任务转化为矩阵分解问题。我们的深度嵌入框架在实验中证明了其效果,超过了基eline方法的质量和量化评价。