eess.SP - 2023-12-02

BER Analysis of SCMA-OFDM Systems in the Presence of Carrier Frequency Offset

  • paper_url: http://arxiv.org/abs/2312.01126
  • repo_url: None
  • paper_authors: Haibo Liu, Qu Luo, Zilong Liu, Shan Luo, Pei Xiao, Rongping Lin
  • for: 本文探讨了使用 sparse code multiple access (SCMA) 和 orthogonal frequency division multiplexing (OFDM) 技术支持未来机器类通信网络的大规模连接性。
  • methods: 本文使用了 Gaussian 和 multipath Rayleigh fading 通道模型来研究 SCMA-OFDM 系统在 carrier frequency offset (CFO) 的存在下的比特错误率 (BER) 性能。
  • results: 通过 simulations,本文证明了 CFO 对 SCMA-OFDM 系统的影响,并显示了当正常化 CFO 超过 0.02 时,系统的 BER 性能会受到显著的降低。
    Abstract Sparse code multiple access (SCMA) building upon orthogonal frequency division multiplexing (OFDM) is a promising wireless technology for supporting massive connectivity in future machine-type communication networks. However, the sensitivity of OFDM to carrier frequency offset (CFO) poses a major challenge because it leads to orthogonality loss and incurs intercarrier interference (ICI). In this paper, we investigate the bit error rate (BER) performance of SCMA-OFDM systems in the presence of CFO over both Gaussian and multipath Rayleigh fading channels. We first model the ICI in SCMA-OFDM as Gaussian variables conditioned on a single channel realization for fading channels. The BER is then evaluated by averaging over all codeword pairs considering the fading statistics. Through simulations, we validate the accuracy of our BER analysis and reveal that there is a significant BER degradation for SCMA-OFDM systems when the normalized CFO exceeds 0.02.
    摘要 稀疏码多ступ存取(SCMA)基于对幂频分多址(OFDM)是未来机器类通信网络中支持大量连接的承诺技术。然而,OFDM对幂频偏移(CFO)的敏感性 pose a major challenge,因为它会导致orthogonality loss和产生intercarrier interference(ICI)。在这篇论文中,我们研究SCMA-OFDM系统在CFO的存在下的比特错误率(BER)性能。我们首先模型了SCMA-OFDM中ICI的分布为Gaussian变量,conditioned on a single channel realization for fading channels。然后,我们通过平均所有codeword pair来评估BER。通过实验,我们验证了我们的BER分析的准确性,并发现了CFO的normalized值超过0.02时,SCMA-OFDM系统的BER受到了显著的下降。

Design and Performance Analysis of Index Modulation Empowered AFDM System

  • paper_url: http://arxiv.org/abs/2312.01125
  • repo_url: None
  • paper_authors: Jing Zhu, Qu Luo, Gaojie Chen, Pei Xiao, Lixia Xiao
  • for: 本研究旨在提高 affine frequency division multiplexing (AFDM) 的 bit error rate (BER) 和能效率 (EE) 性能,通过 incorporating index modulation (IM)。
  • methods: 该方案使用 $M$-ary constellation symbols 和滤波器偏振信号 (SCs) 的活动来传输信息,并提出了两种功率分配策略(power reallocation (PR) 策略和 power saving (PS) 策略)来提高 BER 和 EE 性能。
  • results: simulations 结果表明,提案的 AFDM-IM 方案在 BER 性能方面比 conventional AFDM 方案更高。
    Abstract In this letter, we incorporate index modulation (IM) into affine frequency division multiplexing (AFDM), called AFDM-IM, to enhance the bit error rate (BER) and energy efficiency (EE) performance. In this scheme, the information bits are conveyed not only by $M$-ary constellation symbols, but also by the activation of the chirp subcarriers (SCs) indices, which are determined based on the incoming bit streams. Then, two power allocation strategies, namely power reallocation (PR) strategy and power saving (PS) strategy, are proposed to enhance BER and EE performance, respectively. Furthermore, the average bit error probability (ABEP) is theoretically analyzed. Simulation results demonstrate that the proposed AFDM-IM scheme achieves better BER performance than the conventional AFDM scheme.
    摘要 在这封信中,我们将索引调制(IM)引入了分形频分多路传输(AFDM),称之为AFDM-IM,以提高比特错误率(BER)和能效率(EE)性能。在该方案中,信息比特不仅由$M$-ary符号传输,而且还由快速抽象(SC)的索引活动决定。然后,我们提议了两种功率分配策略,namely power reallocation(PR)策略和power saving(PS)策略,以提高BER和EE性能。此外,我们还 theoretically analyze了均值比特错误概率(ABEP)。实验结果表明,我们的AFDM-IM方案比传统AFDM方案表现更好。

Joint Multiple FMCW Chirp Sequence Processing for Velocity Estimation and Ambiguity Resolving

  • paper_url: http://arxiv.org/abs/2312.01123
  • repo_url: None
  • paper_authors: Tarik Kazaz, Karan Jayachandra, Arie Koppellar, Yiting Lu
  • for: This paper aims to improve the accuracy and resolution of velocity estimation in FMCW automotive radar systems by jointly processing multiple chirp sequences and resolving possible ambiguities.
  • methods: The proposed algorithm uses a novel approach that combines the strengths of multiple chirp sequences with co-prime delay shifts, while avoiding the limitations of classical spectral estimation techniques based on FFT.
  • results: The proposed algorithm achieves statistically efficient and gridless velocity estimation with super-resolution properties, outperforming traditional methods in terms of resolution and accuracy. These results are validated through numerical simulations and experiments with an automotive radar IC.
    Abstract In FMCW automotive radar applications, it is often a challenge to design a chirp sequence that satisfies the requirements set by practical driving scenarios and simultaneously enables high range resolution, large maximum range, and unambiguous velocity estimation. To support long-range scenarios the chirps should have a sufficiently long duration compared to their bandwidth. At the same time, the long chirps result in ambiguous velocity estimation for targets with high velocity. The problem of velocity ambiguity is often solved by using multiple chirp sequences with co-prime delay shifts between them. However, coherent processing of multiple chirp sequences is not possible using classical spectral estimation techniques based on Fast Fourier Transform (FFT). This results in statistically not efficient velocity estimation and loss of processing gain. In this work, we propose an algorithm that can jointly process multiple chirp sequences and resolve possible ambiguities present in the velocities estimates. The resulting algorithm is statistically efficient and gridless. Furthermore, it increases the resolution of velocity estimation beyond the natural resolution due to its super-resolution properties. These results are confirmed by both numerical simulations and experiments with automotive radar IC.
    摘要 在汽车FMCW雷达应用中,设计一个适合实际驾驶情况的滤波序quence是一个长期挑战。要支持长距离enario,滤波序quence应该具有足够的持续时间,但长滤波序quence会导致高速目标的干扰估计带有很大的偏差。为解决速度偏差问题,通常使用多个滤波序quence的差分延迟来解决这个问题。然而,使用多个滤波序quence的协同处理是不可能使用基于快速傅立叶变换(FFT)的经典 spectral estimation技术来实现。这会导致 estatisticamente不有效的速度估计和数据损失。在这项工作中,我们提出了一种可以同时处理多个滤波序quence并解决可能存在的速度估计偏差的算法。该算法具有 statistically efficient和gridless的特点,同时具有超分辨能力,可以超越自然的分辨能力。这些结果通过数值仿真和汽车雷达IC实验得到了证明。

Prior-Aware Robust Beam Alignment for Low-SNR Millimeter-Wave Communications

  • paper_url: http://arxiv.org/abs/2312.01100
  • repo_url: None
  • paper_authors: Jihun Park, Yongjeong Oh, Jaewon Yun, Seonjung Kim, Yo-Seb Jeon
  • for: 这种技术是为了提高 millimeter-wave 通信在低信号噪比(SNR)环境中的稳定性而设计的。
  • methods: 这种技术使用了重复发送最有可能性的扩散候选者来减少扩散误差的概率。特别是,为了在给定的扩散训练负荷下最大化每个扩散候选者的选择和重复数,我们根据通道前景信息进行了优化。在这个过程中,我们使用了深度神经网络来学习最佳扩散的先验概率。
  • results: 对于使用这种技术的 simulate 结果,我们在使用 DeepMIMO 数据集中的动态低SNR 通信环境中发现了与现有扩散对齐技术相比的更高稳定性。
    Abstract This paper presents a robust beam alignment technique for millimeter-wave communications in low signal-to-noise ratio (SNR) environments. The core strategy of our technique is to repeatedly transmit the most probable beam candidates to reduce beam misalignment probability induced by noise. Specifically, for a given beam training overhead, both the selection of candidates and the number of repetitions for each beam candidate are optimized based on channel prior information. To achieve this, a deep neural network is employed to learn the prior probability of the optimal beam at each location. The beam misalignment probability is then analyzed based on the channel prior, forming the basis for an optimization problem aimed at minimizing the analyzed beam misalignment probability. A closed-form solution is derived for a special case with two beam candidates, and an efficient algorithm is developed for general cases with multiple beam candidates. Simulation results using the DeepMIMO dataset demonstrate the superior performance of our technique in dynamic low-SNR communication environments when compared to existing beam alignment techniques.
    摘要

Hybrid Hierarchical DRL Enabled Resource Allocation for Secure Transmission in Multi-IRS-Assisted Sensing-Enhanced Spectrum Sharing Networks

  • paper_url: http://arxiv.org/abs/2312.01071
  • repo_url: None
  • paper_authors: Lingyi Wang, Wei Wu, Fuhui Zhou, Qihui Wu, Octavia A. Dobre, Tony Q. S. Quek
  • For: The paper is written to explore the potential of intelligent reflective surfaces (IRSs) in enhancing spectrum sharing and secure transmission performance in wireless networks.* Methods: The paper proposes an intelligent resource allocation scheme based on double deep Q networks (D3QN) algorithm and soft Actor-Critic (SAC) algorithm to maximize the secure transmission rate of the secondary network by jointly optimizing IRS pairings, subchannel assignment, transmit beamforming of the secondary base station, reflection coefficients of IRSs, and the sensing time.* Results: The paper shows that the proposed intelligent scheme can efficiently enhance the secrecy rate and spectrum utilization of the secondary network, and the hierarchical reinforcement learning method can effectively tackle the sparse reward problem caused by a significant amount of reflection elements of multiple IRSs. The paper also demonstrates the superiority of multi-IRS design in enhancing security performance and spectrum utilization, but inappropriate deployment of IRSs can reduce the security performance with the presence of multiple eavesdroppers (Eves).Here is the information in Simplified Chinese text:* For: 本文是为探讨智能反射表面(IRS)在无线网络中提高频率分享和安全传输性能而写的。* Methods: 本文提议一种基于double deep Q networks(D3QN)算法和soft Actor-Critic(SAC)算法的智能资源分配方案,以最大化次要网络的安全传输率,并jointly 优化IRS的匹配、频率分配、次要基站的发射扬anner、反射率和感知时间。* Results: 本文显示,提议的智能方案可以高效地提高次要网络的机密率和频率利用率,而层次学习法可以有效地解决由多个反射元件引起的罕见奖励问题。同时,本文也示出了多个IRS的设计可以在存在多个伪装者(Eves)的情况下提高安全性和频率利用率,但是不当部署IRS可能会降低安全性表现。
    Abstract Secure communications are of paramount importance in spectrum sharing networks due to the allocation and sharing characteristics of spectrum resources. To further explore the potential of intelligent reflective surfaces (IRSs) in enhancing spectrum sharing and secure transmission performance, a multiple intelligent reflection surface (multi-IRS)-assisted sensing-enhanced wideband spectrum sharing network is investigated by considering physical layer security techniques. An intelligent resource allocation scheme based on double deep Q networks (D3QN) algorithm and soft Actor-Critic (SAC) algorithm is proposed to maximize the secure transmission rate of the secondary network by jointly optimizing IRS pairings, subchannel assignment, transmit beamforming of the secondary base station, reflection coefficients of IRSs and the sensing time. To tackle the sparse reward problem caused by a significant amount of reflection elements of multiple IRSs, the method of hierarchical reinforcement learning is exploited. An alternative optimization (AO)-based conventional mathematical scheme is introduced to verify the computational complexity advantage of our proposed intelligent scheme. Simulation results demonstrate the efficiency of our proposed intelligent scheme as well as the superiority of multi-IRS design in enhancing secrecy rate and spectrum utilization. It is shown that inappropriate deployment of IRSs can reduce the security performance with the presence of multiple eavesdroppers (Eves), and the arrangement of IRSs deserves further consideration.
    摘要 “安全通信是谐 spectrum sharing 网络中的关键因素,因为spectrum resource的分配和共享特性。为了更深入探讨智能反射表面(IRS)在增强spectrum sharing和安全传输性能方面的潜力,本文提出了一种多智能反射表面(multi-IRS)帮助的感知增强广泛频率共享网络。通过物理层安全技术,提出了一种基于double deep Q networks(D3QN)算法和soft Actor-Critic(SAC)算法的智能资源分配方案,以最大化次网络的安全传输率。通过联合IRS对策、子频分配、次基站的发射形式、反射率和感知时间的优化,提高了次网络的安全传输率。为了解决多个反射表面的稀疏奖励问题,利用了层次学习法。并 introduce了一种基于传统数学方法的可alternative optimization(AO)方法,以验证我们提出的智能方案的计算复杂性优势。实验结果表明,我们的智能方案效果高,并且多个IRS设计能够提高机密率和频率利用率。研究结果还表明,不当部署IRS可能会降低安全性性,并且IRS的布置需要进一步考虑。”

Covert Communications in STAR-RIS-Aided Rate-Splitting Multiple Access Systems

  • paper_url: http://arxiv.org/abs/2312.01042
  • repo_url: None
  • paper_authors: Heng Chang, Hai Yang, Shuobo Xu, Xiyu Pang, Hongwu Liu
  • for: investigate covert communications in a simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS)-aided rate-splitting multiple access (RSMA) system
  • methods: derive analytical expression for the minimum average detection error probability of Willie, based on which a covert rate maximization problem is formulated
  • results: optimize transmit power allocation, common rate allocation, and STAR-RIS reflection/transmission beamforming to maximize Bob’s covert rate while confusing Willie’s monitoring, subject to Grace’s quality of service (QoS) requirementsHere’s the format you requested:
  • for: <what are the paper written for?>
  • methods: <what methods the paper use?>
  • results: <what results the paper get?>I hope that helps! Let me know if you have any other questions.
    Abstract In this paper, we investigate covert communications in a simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS)-aided rate-splitting multiple access (RSMA) system. Under the RSMA principles, the messages for the covert user (Bob) and public user (Grace) are converted to the common and private streams at the legitimate transmitter (Alice) to realize downlink transmissions, while the STAR-RIS is deployed not only to aid the public transmissions from Alice to Grace, but also to shield the covert transmissions from Alice to Bob against the warden (Willie). To characterize the covert performance of the considered STAR-RIS-aided RSMA (STAR-RIS-RSMA) system, we derive analytical expression for the minimum average detection error probability of Willie, based on which a covert rate maximization problem is formulated. To maximize Bob's covert rate while confusing Willie's monitoring, the transmit power allocation, common rate allocation, and STAR-RIS reflection/transmission beamforming are jointly optimized subject to Grace's quality of service (QoS) requirements. The non-convex covert rate maximization problem, consisting of highly coupled system parameters are decoupled into three sub-problems of transmit power allocation, common rate allocation, and STAR-RIS reflection/transmission beamforming, respectively. To obtain the rank-one constrained optimal solution for the sub-problem of optimizing the STAR-RIS reflection/transmission beamforming, a penalty-based successive convex approximation scheme is developed. Moreover, an alternative optimization (AO) algorithm is designed to determine the optimal solution for the sub-problem of optimizing the transmit power allocation, while the original problem is overall solved by a new AO algorithm.
    摘要 在这篇论文中,我们研究了同时发送和反射的可配置智能表面(STAR-RIS)支持的秘密通信系统。在RSMA原则下,在Alice发送器上,用于Bob和Grace用户的消息被转换为公共流和私有流,以实现下行传输。同时,STAR-RIS被部署,不仅为Alice发送器提供帮助,还为Bob用户的秘密传输屏蔽了外部监视(Willie)。为了描述考虑STAR-RIS的潜在性能,我们得出了Willie监视下的最小平均检测错误概率的分析表达。基于这个表达,我们定义了一个潜在秘密率最大化问题。要 Maximizing Bob的秘密率, transmit power allocation, common rate allocation, 和STAR-RIS反射/传输扩散被 jointly 优化,以满足Grace用户的服务质量要求。这个不对称的潜在秘密率最大化问题,包括高度相互关联的系统参数,通过将其分解为三个互不相关的互斥问题: transmit power allocation、common rate allocation、和STAR-RIS反射/传输扩散。为了获得STAR-RIS反射/传输扩散优化问题的rank-one constrained optimal solution,我们开发了一种罚偿基于successive convex approximation scheme。此外,我们还设计了一种代替优化(AO)算法,用于解决 transmit power allocation优化问题,而原始问题则通过一种新的AO算法得到解决。

Perceptive, Resilient, and Efficient Networks assisted by Reconfigurable Intelligent Surfaces

  • paper_url: http://arxiv.org/abs/2312.01009
  • repo_url: None
  • paper_authors: Giorgos Stratidakis, Sotiris Droulias, Angeliki Alexiou
  • for: 本研究旨在探讨具有高频功能的无线通信系统中,通过快速智能表面(RIS)的高级波前工程来提高网络效率和可靠性。
  • methods: 本研究使用了高级波前工程技术,通过控制RIS上的多个散射器,将入射的波front转换成非常复杂的反射波front,以提高网络的通信效率和可靠性。
  • results: 研究发现,通过使用高级波前工程技术和RIS,可以实现高频无线通信系统中的聚集、折叠和自我修复功能,超过了现有技术的限制。此外,本研究还提出了一种基于混合波前成形/波前聚焦方法的位置确定技术。
    Abstract Wireless communications are nowadays shifting to higher operation frequencies with the aim to meet the ever-increasing demand for bandwidth. While reconfigurable intelligent surfaces (RISs) are usually envisioned to restore the line-of-sight of blocked links and to efficiently counteract the increased pathloss, their functionalities can extend far beyond these basic operations. Owing to their large surface and the multitude of scatterers, RISs can be exploited to perform advanced wavefront engineering, essentially transforming the incident beam into a non-trivial reflected beam that is able to address the challenges of high frequencies more efficiently than conventional beam-forming. In this paper it is demonstrated how advanced wavefront engineering with RISs enables beam profiles that are able to focus, bend and self-heal, thus offering functionalities beyond the current state-of-the-art. Their potential as enablers of perceptive, resilient, and efficient networks is discussed, and a localization technique based on a hybrid beam-forming/beam-focusing scheme is demonstrated.
    摘要 无线通信现在正在转移到更高的运作频率,以满足不断增长的带宽需求。而可配置智能表面(RIS)通常被视为恢复受阻链路的视线和高效地减少碰撞的方法。但RIS的功能不仅Stop there,它们可以被利用来进行高级波前工程,使入射ibeam变成一种非常复杂的反射ibeam,可以更高效地处理高频信号than conventional beam-forming。本文示出了通过RIS进行高级波前工程所带来的 beam profile,可以弯曲、集中和自我修复,从而提供了现有状态艺术的功能以上。RIS的潜在作用是使网络更加敏感、可靠和高效,这一点会在文中进行讨论。此外,我们还示出了基于混合射频/射频聚焦的地址技术,可以帮助实现高级网络的地址。

Securing the Sensing Functionality in ISAC Networks: An Artificial Noise Design

  • paper_url: http://arxiv.org/abs/2312.00981
  • repo_url: None
  • paper_authors: Jiaqi Zou, Christos Masouros, Fan Liu, Songlin Sun
  • for: 本研究旨在增强ISAC系统的探测安全性,通过增强探测器与恶意非法窃听者(Eve)之间的信息隐私。
  • methods: 我们提出了一种基于扩探采用的射频信号设计方法,以提高探测器的探测精度和安全性。我们还考虑了使用噪声增强技术来进一步提高探测安全性。
  • results: 实验结果表明,我们的提议方法可以提高探测器的敏感信息积累量,同时限制Eve的探测积累量,相比于基eline方案。另外,使用噪声增强技术可以进一步提高探测安全性。
    Abstract Integrated sensing and communications (ISAC) systems employ dual-functional signals to simultaneously accomplish radar sensing and wireless communication tasks. However, ISAC systems open up new sensing security vulnerabilities to malicious illegitimate eavesdroppers (Eves) that can also exploit the transmitted waveform to extract sensing information from the environment. In this paper, we investigate the beamforming design to enhance the sensing security of an ISAC system, where the communication user (CU) serves as a sensing Eve. Our objective is to maximize the mutual information (MI) for the legitimate radar sensing receiver while considering the constraint of the MI for the Eve and the quality of service to the CUs. Then, we consider the artificial noise (AN)-aided beamforming to further enhance the sensing security. Simulation results demonstrate that our proposed methods achieve MI improvement of the legitimate receiver while limiting the sensing MI of the Eve, compared with the baseline scheme, and that the utilization of AN further contributes to sensing security.
    摘要 integated sensing and communications (ISAC) 系统使用双功能信号同时完成雷达探测和无线通信任务。然而,ISAC 系统开放了新的探测安全漏洞, allowing malicious illegitimate eavesdroppers (Eves) to exploit the transmitted waveform to extract sensing information from the environment。在本文中,我们研究了束形设计,以增强 ISAC 系统的探测安全性。我们的目标是最大化各自信息(MI) для合法的雷达探测接收器,同时考虑了黑客(Eve)的探测MI和服务质量 для用户(CU)。然后,我们考虑了人工噪声(AN)-aided 束形设计,以进一步增强探测安全性。实验结果表明,我们的提议方法可以提高合法接收器的MI,同时限制黑客的探测MI,相比基eline scheme。此外,使用 AN 进一步提高了探测安全性。

Optimal Placement of Transmissive RIS in the Near Field for Capacity Maximization in THz Communications

  • paper_url: http://arxiv.org/abs/2312.00977
  • repo_url: None
  • paper_authors: Nithish Sharvirala, Amine Mezghani, Ekram Hossain
  • for: 这个研究探讨了利用可变智能表面(RIS)在teraHz频段实现的线性视野(LoS)MIMO通信。
  • methods: 研究使用了RIS,使得波front的 curvature在发射和接收器数组之间显示出来,即使它们在发射和接收器之间远距离。这种现象帮助提高空间多播。
  • results: 模拟结果表明,RIS的最佳布置不仅取决于发射器和接收器之间的距离,还取决于发射器和接收器之间的天线间距。
    Abstract This study centers on Line-of-Sight (LoS) MIMO communication enabled by a Transmissive Reconfigurable Intelligent Surface (RIS) operating in the Terahertz (THz) frequency bands. The study demonstrates that the introduction of RIS can render the curvature of the wavefront apparent over the transmit and receive arrays, even when they are positioned in the far field from each other. This phenomenon contributes to an enhancement in spatial multiplexing. Notably, simulation results underline that the optimal placement of the RIS in the near-field is not solely contingent on proximity to the transmitter (Tx) or receiver (Rx) but relies on the inter-antenna spacing of the Tx and Rx.
    摘要

cs.SD - 2023-12-01

Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer

  • paper_url: http://arxiv.org/abs/2312.00476
  • repo_url: https://github.com/bingyang-20/sar-ssl
  • paper_authors: Bing Yang, Xiaofei Li
  • for: 这篇论文的目的是提出一种自监督学习方法,用于无监督的多渠道麦克风讯号处理和空间音响参数估测。
  • methods: 这篇论文使用了一个新的预文任务,即跨渠道信号重建(CCSR),以学习无监督的空间音响表现。具体来说,这篇论文将部分信号的一个渠道覆盖,然后让模型使用另一个渠道的信号来重建覆盖的信号,这样就可以从无监督的信号中学习出空间音响信息。此外,这篇论文还使用了一个Encoder-Decoder架构,将这两种信息分开。通过精致地训练预先训练的空间编码器,这个编码器就可以用来估测空间音响参数。
  • results: 这篇论文的实验结果显示,这种自监督学习方法可以很好地估测多个空间音响参数,包括时间差 arrival、直接至反射比例和反射时间。此外,这篇论文还证明了这种方法在实际应用中的可行性和有效性。
    Abstract Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality generalization problem due to the mismatch between simulated and real-world acoustic characteristics and the deficiency of annotated real-world data. To this end, this work proposes a self-supervised method that takes full advantage of unlabeled data for spatial acoustic parameter estimation. First, a new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed to learn a universal spatial acoustic representation from unlabeled multi-channel microphone signals. We mask partial signals of one channel and ask the model to reconstruct them, which makes it possible to learn spatial acoustic information from unmasked signals and extract source information from the other microphone channel. An encoder-decoder structure is used to disentangle the two kinds of information. By fine-tuning the pre-trained spatial encoder with a small annotated dataset, this encoder can be used to estimate spatial acoustic parameters. Second, a novel multi-channel audio Conformer (MC-Conformer) is adopted as the encoder model architecture, which is suitable for both the pretext and downstream tasks. It is carefully designed to be able to capture the local and global characteristics of spatial acoustics exhibited in the time-frequency domain. Experimental results of five acoustic parameter estimation tasks on both simulated and real-world data show the effectiveness of the proposed method. To the best of our knowledge, this is the first self-supervised learning method in the field of spatial acoustic representation learning and multi-channel audio signal processing.
    摘要 <>Translate the given text into Simplified Chinese.<>超vised learning方法已经表现出在时间差到达、直接听到反射率和听到反射时间的估计中的效果。然而,它们仍然受到实际和模拟数据之间的匹配问题的影响,以及实际数据上的注释不充分的问题。为此,这项工作提出了一种无监督的方法,它可以完全利用无注释数据来进行空间音学参数估计。首先,我们设计了一个新的预测任务,即跨通道信号重建(CCSR),以学习从无注释多核心 Microphone 信号中获取空间音学信息。我们将一个通道的信号掩码为部分信号,并请求模型重建它们,从而使得模型可以从未掩码的信号中学习空间音学信息,并从另一个 Microphone 通道中提取源信息。我们使用了一个Encoder-Decoder结构来分离这两种信息。通过练化预先训练的空间编码器使其能够估计空间音学参数。其次,我们采用了一种适合预测和下游任务的多通道声学Conformer(MC-Conformer)模型 architecture,它适合在时间频域中展示空间音学特征的地方和全局特征。实验结果表明,该方法在模拟和实际数据上的五种声学参数估计任务中具有效果。据我们所知,这是无监督学习在空间音学表征学习和多通道声学信号处理领域的第一个方法。

eess.AS - 2023-12-01

SPIRE-SIES: A Spontaneous Indian English Speech Corpus

  • paper_url: http://arxiv.org/abs/2312.00698
  • repo_url: None
  • paper_authors: Abhayjeet Singh, Charu Shah, Rajashri Varadaraj, Sonakshi Chauhan, Prasanta Kumar Ghosh
  • for: 本研究开发了一个170.83小时的印度英语自由说话数据集,以便开发适应印度语言风格的语音系统。
  • methods: 本研究使用图像作为唤起自由说话的刺激,并通过语音活动检测、性别验证和图像 semantics 相关性来评估数据集质量。
  • results: 研究通过生成和验证23小时的讲话笔记,创造了一个自由说话 ASR bencmark,并证明了图像刺激对自由说话的影响。
    Abstract In this paper, we present a 170.83 hour Indian English spontaneous speech dataset. Lack of Indian English speech data is one of the major hindrances in developing robust speech systems which are adapted to the Indian speech style. Moreover this scarcity is even more for spontaneous speech. This corpus is crowd sourced over varied Indian nativities, genders and age groups. Traditional spontaneous speech collection strategies involve capturing of speech during interviewing or conversations. In this study, we use images as stimuli to induce spontaneity in speech. Transcripts for 23 hours is generated and validated which can serve as a spontaneous speech ASR benchmark. Quality of the corpus is validated with voice activity detection based segmentation, gender verification and image semantic correlation. Which determines a relationship between image stimulus and recorded speech using caption keywords derived from Image2Text model and high occurring words derived from whisper ASR generated transcripts.
    摘要 在这篇论文中,我们提供了170.83小时的印度英语自然语言说话数据集。印度英语说话数据的缺乏是开发适应印度说话风格的语音系统的一个主要障碍。此外,这种缺乏还更加突出在自然语言说话方面。这个 corpus 是通过印度不同的本地、性别和年龄组合来收集的。传统的自然语言说话收集策略通常是在面谈或对话中采集说话。在这项研究中,我们使用图像作为唤起自由说话的刺激。我们生成了23小时的讲解,并验证了其可用性。我们使用视觉活动检测基于分 segmentation、性别验证和图像 semantics 的相关性来评估数据质量。这种相关性是基于图像刺激和 whisper ASR 生成的讲解词汇和高发生的词汇来确定。

cs.CV - 2023-12-01

Consistent Mesh Diffusion

  • paper_url: http://arxiv.org/abs/2312.00971
  • repo_url: None
  • paper_authors: Julian Knodt, Xifeng Gao
  • for: 生成3D模型上的文本图像
  • methods: 使用单个深度到图像协同扩散网络,将多个2D图像的扩散路径统一并升级到3D,使得在3D表面上渲染时得到一个一致的文本图像
  • results: 在30个模型上花费约5分钟,并使用CLIP-score和Frechet Inception Distance进行评估,显示我们的方法在前一代方法的基础上做出了改进。
    Abstract Given a 3D mesh with a UV parameterization, we introduce a novel approach to generating textures from text prompts. While prior work uses optimization from Text-to-Image Diffusion models to generate textures and geometry, this is slow and requires significant compute resources. Alternatively, there are projection based approaches that use the same Text-to-Image models that paint images onto a mesh, but lack consistency at different viewing angles, we propose a method that uses a single Depth-to-Image diffusion network, and generates a single consistent texture when rendered on the 3D surface by first unifying multiple 2D image's diffusion paths, and hoisting that to 3D with MultiDiffusion~\cite{multidiffusion}. We demonstrate our approach on a dataset containing 30 meshes, taking approximately 5 minutes per mesh. To evaluate the quality of our approach, we use CLIP-score~\cite{clipscore} and Frechet Inception Distance (FID)~\cite{frechet} to evaluate the quality of the rendering, and show our improvement over prior work.
    摘要 Existing projection-based approaches use the same Text-to-Image models to paint images onto a mesh, but lack consistency at different viewing angles. In contrast, our method unifies multiple 2D image diffusion paths and hoists them to 3D with MultiDiffusion, resulting in high-quality textures that are consistent across different viewing angles.We demonstrate our approach on a dataset containing 30 meshes, with each mesh taking approximately 5 minutes to render. To evaluate the quality of our approach, we use CLIP-score and Frechet Inception Distance (FID) to assess the rendering quality and show significant improvement over prior work.

Improve Supervised Representation Learning with Masked Image Modeling

  • paper_url: http://arxiv.org/abs/2312.00950
  • repo_url: None
  • paper_authors: Kaifeng Chen, Daniel Salz, Huiwen Chang, Kihyuk Sohn, Dilip Krishnan, Mojtaba Seyedhosseini
  • for: 提高计算机视觉领域中表示学习的质量
  • methods: 采用掩码图像模型(MIM)自我监督学习方法,将其与已有的指导学习方法结合使用
  • results: 对下游任务 such as 分类、图像检索和semantic segmentation进行改进,并在ImageNet-1k和K-Nearest-Neighbor图像检索评价中达到了高效率I hope this helps! Let me know if you have any other questions.
    Abstract Training visual embeddings with labeled data supervision has been the de facto setup for representation learning in computer vision. Inspired by recent success of adopting masked image modeling (MIM) in self-supervised representation learning, we propose a simple yet effective setup that can easily integrate MIM into existing supervised training paradigms. In our design, in addition to the original classification task applied to a vision transformer image encoder, we add a shallow transformer-based decoder on top of the encoder and introduce an MIM task which tries to reconstruct image tokens based on masked image inputs. We show with minimal change in architecture and no overhead in inference that this setup is able to improve the quality of the learned representations for downstream tasks such as classification, image retrieval, and semantic segmentation. We conduct a comprehensive study and evaluation of our setup on public benchmarks. On ImageNet-1k, our ViT-B/14 model achieves 81.72% validation accuracy, 2.01% higher than the baseline model. On K-Nearest-Neighbor image retrieval evaluation with ImageNet-1k, the same model outperforms the baseline by 1.32%. We also show that this setup can be easily scaled to larger models and datasets. Code and checkpoints will be released.
    摘要 具有预测批处的视觉嵌入训练已经成为计算机视觉中的标准设置。我们受到最近采用Masked Image Modeling(MIM)自我supervised representation learning的成功而 inspirited,我们提议一个简单 yet effective的设置,可以轻松地将MIMintegrated into existing supervised training paradigms。在我们的设计中,除了原始的分类任务,我们还添加了一个短深度变换器基于decoder到vision transformer图像编码器之上,并引入一个MIM任务,尝试根据遮盖图像输入重建图像 tokens。我们表明,只需要 minimal change in architecture 和无额外处理 during inference,这种设置可以提高下游任务(如分类、图像检索、semantic segmentation)中学习的表示质量。我们在公共benchmark上进行了全面的研究和评估。在ImageNet-1k上,我们的ViT-B/14模型达到了81.72%的验证精度,比基eline模型高2.01%。在ImageNet-1k上进行的K-Nearest-Neighbor图像检索评估中,同样的模型也高于基eline by 1.32%。我们还表明这种设置可以轻松扩展到更大的模型和数据集。我们将代码和Checkpoint发布。

Object 6D pose estimation meets zero-shot learning

  • paper_url: http://arxiv.org/abs/2312.00947
  • repo_url: None
  • paper_authors: Andrea Caraffa, Davide Boscaini, Amir Hamza, Fabio Poiesi
  • for: 提高零shot对象6D姿态估计的精度
  • methods: combines geometric descriptors learned from point cloud data with visual features learned from large-scale web images to produce distinctive 3D point-level descriptors
  • results: outperforms all state-of-the-art zero-shot object 6D pose estimation approaches and ranks first in the BOP Benchmark under the category Task 4: 6D localization of unseen objects
    Abstract Object 6D pose estimation methods can achieve high accuracy when trained and tested on the same objects. However, estimating the pose of objects that are absent at training time is still a challenge. In this work, we advance the state-of-the-art in zero-shot object 6D pose estimation by proposing the first method that fuses the contribution of pre-trained geometric and vision foundation models. Unlike state-of-the-art approaches that train their pipeline on data specifically crafted for the 6D pose estimation task, our method does not require task-specific finetuning. Instead, our method, which we name PoMZ, combines geometric descriptors learned from point cloud data with visual features learned from large-scale web images to produce distinctive 3D point-level descriptors. By applying an off-the-shelf registration algorithm, like RANSAC, PoMZ outperforms all state-of-the-art zero-shot object 6D pose estimation approaches. We extensively evaluate PoMZ across the seven core datasets of the BOP Benchmark, encompassing over a hundred objects and 20 thousand images captured in diverse scenarios. PoMZ ranks first in the BOP Benchmark under the category Task 4: 6D localization of unseen objects. We will release the source code publicly.
    摘要 Object 6D pose estimation methods can achieve high accuracy when trained and tested on the same objects. However, estimating the pose of objects that are absent at training time is still a challenge. In this work, we advance the state-of-the-art in zero-shot object 6D pose estimation by proposing the first method that fuses the contribution of pre-trained geometric and vision foundation models. Unlike state-of-the-art approaches that train their pipeline on data specifically crafted for the 6D pose estimation task, our method does not require task-specific finetuning. Instead, our method, which we name PoMZ, combines geometric descriptors learned from point cloud data with visual features learned from large-scale web images to produce distinctive 3D point-level descriptors. By applying an off-the-shelf registration algorithm, like RANSAC, PoMZ outperforms all state-of-the-art zero-shot object 6D pose estimation approaches. We extensively evaluate PoMZ across the seven core datasets of the BOP Benchmark, encompassing over a hundred objects and 20 thousand images captured in diverse scenarios. PoMZ ranks first in the BOP Benchmark under the category Task 4: 6D localization of unseen objects. We will release the source code publicly.Here's the translation in Traditional Chinese: Object 6D pose estimation methods can achieve high accuracy when trained and tested on the same objects. However, estimating the pose of objects that are absent at training time is still a challenge. In this work, we advance the state-of-the-art in zero-shot object 6D pose estimation by proposing the first method that fuses the contribution of pre-trained geometric and vision foundation models. Unlike state-of-the-art approaches that train their pipeline on data specifically crafted for the 6D pose estimation task, our method does not require task-specific finetuning. Instead, our method, which we name PoMZ, combines geometric descriptors learned from point cloud data with visual features learned from large-scale web images to produce distinctive 3D point-level descriptors. By applying an off-the-shelf registration algorithm, like RANSAC, PoMZ outperforms all state-of-the-art zero-shot object 6D pose estimation approaches. We extensively evaluate PoMZ across the seven core datasets of the BOP Benchmark, encompassing over a hundred objects and 20 thousand images captured in diverse scenarios. PoMZ ranks first in the BOP Benchmark under the category Task 4: 6D localization of unseen objects. We will release the source code publicly.

Enhancing Diffusion Models with 3D Perspective Geometry Constraints

  • paper_url: http://arxiv.org/abs/2312.00944
  • repo_url: None
  • paper_authors: Rishi Upadhyay, Howard Zhang, Yunhao Ba, Ethan Yang, Blake Gella, Sicheng Jiang, Alex Wong, Achuta Kadambi
  • for: 提高图像合成方法中的对称矩阵精度
  • methods: 引入几何约束来让生成模型在训练过程中强制实现对称矩阵精度
  • results: 训练使用几何约束的生成模型可以生成更加真实和有趣的图像,并且可以提高下游模型在生成图像上的性能。
    Abstract While perspective is a well-studied topic in art, it is generally taken for granted in images. However, for the recent wave of high-quality image synthesis methods such as latent diffusion models, perspective accuracy is not an explicit requirement. Since these methods are capable of outputting a wide gamut of possible images, it is difficult for these synthesized images to adhere to the principles of linear perspective. We introduce a novel geometric constraint in the training process of generative models to enforce perspective accuracy. We show that outputs of models trained with this constraint both appear more realistic and improve performance of downstream models trained on generated images. Subjective human trials show that images generated with latent diffusion models trained with our constraint are preferred over images from the Stable Diffusion V2 model 70% of the time. SOTA monocular depth estimation models such as DPT and PixelFormer, fine-tuned on our images, outperform the original models trained on real images by up to 7.03% in RMSE and 19.3% in SqRel on the KITTI test set for zero-shot transfer.
    摘要 “ perspective 是艺术中已经非常 изу究的话题,但在图像中它通常被忽略。然而,对于最近的高质量图像生成方法,如潜在扩散模型, perspective 精度不是明确的要求。这些生成的图像难以遵循线性的 perspective 原理。我们引入了一种新的几何约束在生成模型的训练过程中,以确保 perspective 精度。我们表明,通过这种约束训练的模型输出的图像更加真实,并且提高了基于生成图像的下游模型的性能。人类评测表明,使用我们的约束训练的潜在扩散模型输出的图像比 Stable Diffusion V2 模型的图像 prefer 70% 的时间。SOTA 单目几何深度估计模型,如 DPT 和 PixelFormer,在我们的图像上进行了精度调整,在 KITTI 测试集上对于零拟采样转移而达到了最高的 RMSE 下降7.03% 和 SqRel 下降19.3%。”

Zero-Shot Video Question Answering with Procedural Programs

  • paper_url: http://arxiv.org/abs/2312.00937
  • repo_url: None
  • paper_authors: Rohan Choudhury, Koichiro Niinuma, Kris M. Kitani, László A. Jeni
  • for: answers video questions without requiring pre-trained models or task-specific fine-tuning.
  • methods: uses a large language model to generate short procedural programs that solve a sequence of visual subtasks, and executes them to obtain the output.
  • results: achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, and multimodal video question-answering datasets.Here’s the full translation in Simplified Chinese:
  • for: 本研究旨在 answering video questions without requiring pre-trained models or task-specific fine-tuning.
  • methods: 使用大型语言模型生成短程序程序,解决视觉子任务序列,然后执行它们以获取输出。
  • results: 在多种多样的benchmark上达到了状态略的Result,与比较方法相比,提高了最多25%的短、长、开放式和多modal video question-answering数据集。Please note that the translation is done using Google Translate and may not be perfect. The original English text may be more accurate.
    Abstract We propose to answer zero-shot questions about videos by generating short procedural programs that derive a final answer from solving a sequence of visual subtasks. We present Procedural Video Querying (ProViQ), which uses a large language model to generate such programs from an input question and an API of visual modules in the prompt, then executes them to obtain the output. Recent similar procedural approaches have proven successful for image question answering, but videos remain challenging: we provide ProViQ with modules intended for video understanding, allowing it to generalize to a wide variety of videos. This code generation framework additionally enables ProViQ to perform other video tasks in addition to question answering, such as multi-object tracking or basic video editing. ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, and multimodal video question-answering datasets. Our project page is at https://rccchoudhury.github.io/proviq2023.
    摘要 我们提出了一种Answer zero-shot video问题的方法,通过生成短程序程序来解决一系列视觉子任务,从而获得最终答案。我们称之为Procedural Video Querying(ProViQ),它使用大型自然语言模型将输入问题和视觉模块API组合成程序程序,然后执行它们来获取输出。相比之前的相似的进程方法,视频仍然是挑战,我们为ProViQ提供了特有的视频理解模块,以便它能够泛化到各种视频中。此代码生成框架还允许ProViQ执行其他视频任务,如多对象跟踪或基本视频编辑。ProViQ在多样化的 benchmark 上达到了领先的成绩,与比较短、长、开放和多模态视频问题答案 datasets 的改进达到25%。我们的项目页面位于https://rccchoudhury.github.io/proviq2023。

Label Delay in Continual Learning

  • paper_url: http://arxiv.org/abs/2312.00923
  • repo_url: None
  • paper_authors: Botos Csaba, Wenxuan Zhang, Matthias Müller, Ser-Nam Lim, Mohamed Elhoseiny, Philip Torr, Adel Bibi
  • for: 本研究旨在解决在线 continual learning中的标签延迟问题,即新数据可能因为慢昂贵的标注过程而无法获得标签。
  • methods: 我们引入了一个新的 continual learning框架,该框架在时间步骤中显示了当前时间步骤 $t$ 中的未标记数据,以及在 $t-d$ 时间步骤中的标签。在每个步骤中,我们的框架使用了一个简单、高效的基线方法,通过 reuse 已有的标签来填充未标记数据中的几何难题。
  • results: 我们的实验结果显示,增加计算资源 alone 无法解决标签延迟问题。当延迟标签时间很长时,仅仅使用最新的未标记数据和标签来训练模型,其性能会明显下降。然而,我们的方法可以减少标签延迟的影响,并在某些情况下 même surpass the performance of 非延迟的模型。
    Abstract Online continual learning, the process of training models on streaming data, has gained increasing attention in recent years. However, a critical aspect often overlooked is the label delay, where new data may not be labeled due to slow and costly annotation processes. We introduce a new continual learning framework with explicit modeling of the label delay between data and label streams over time steps. In each step, the framework reveals both unlabeled data from the current time step $t$ and labels delayed with $d$ steps, from the time step $t-d$. In our extensive experiments amounting to 1060 GPU days, we show that merely augmenting the computational resources is insufficient to tackle this challenge. Our findings underline a notable performance decline when solely relying on labeled data when the label delay becomes significant. More surprisingly, when using state-of-the-art SSL and TTA techniques to utilize the newer, unlabeled data, they fail to surpass the performance of a na\"ive method that simply trains on the delayed supervised stream. To this end, we introduce a simple, efficient baseline that rehearses from the labeled memory samples that are most similar to the new unlabeled samples. This method bridges the accuracy gap caused by label delay without significantly increasing computational complexity. We show experimentally that our method is the least affected by the label delay factor and in some cases successfully recovers the accuracy of the non-delayed counterpart. We conduct various ablations and sensitivity experiments, demonstrating the effectiveness of our approach.
    摘要 在线持续学习,通过流动数据进行模型训练,在最近几年内得到了越来越多的关注。然而,一个重要的问题frequently overlooked是标签延迟,即新数据可能无法得到标签,因为标签过程慢且昂贵。我们介绍了一个新的持续学习框架,其中明确模型在时间步骤中的标签延迟。在每个步骤中,框架会透过新的数据和延迟的标签来提供更多的信息。我们的广泛实验(总计1060具GPU天)表明,增加计算资源 alone 是不能解决这个挑战。我们的发现表明,当 solely 使用标注数据时,当标签延迟变得显著时,性能会下降。这更 surprises 我们,使用当今最佳的 SSL 和 TTA 技术可以利用 newer, unlabeled data,它们无法超越简单地在延迟的超级vised流中进行训练的简单方法的性能。为此,我们引入了一个简单、高效的基线,它在最相似的标签记忆中进行回忆,以bridging 标签延迟所导致的准确性差距。我们的实验表明,我们的方法在标签延迟因子的影响下表现最好,并在一些情况下成功地恢复非延迟版本的准确性。我们进行了多个ablation 和敏感性实验,证明了我们的方法的有效性。

Segment and Caption Anything

  • paper_url: http://arxiv.org/abs/2312.00869
  • repo_url: https://github.com/IDEA-Research/Grounded-Segment-Anything
  • paper_authors: Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, Zicheng Liu
  • for: 这个研究旨在快速并可扩展地具备 Segment Anything Model (SAM) 生成区域描述。
  • methods: 研究者提出了一种将区域特征与语言模型的嵌入空间进行对齐的轻量级查询基于特征混合器,以便在 later caption 生成过程中使用区域特征。
  • results: 研究者通过对象检测和分割任务的先前训练,使用weak supervision预训练,并进行了广泛的实验 validate each design choice。 results show that the proposed method is superior to existing methods and has the potential to scale up regional captioning data.
    Abstract We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions), it costs less computation, less memory usage, and less communication bandwidth, resulting in both fast and scalable training. To address the scarcity problem of regional caption data, we propose to first pre-train our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pre-training data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics. The project page, along with the associated code, can be accessed via the following https://xk-huang.github.io/segment-caption-anything/.
    摘要 我们提出了一种方法,用于快速地让Segment Anything Model(SAM)具备区域标签生成能力。SAM具有强大的通用化能力,可以 segment anything,而且短语理解能力强。通过引入轻量级的查询基于特征混合器,我们将区域特定的特征与语言模型的嵌入空间对齐。由于计算量少(通常在十亿级别),因此计算成本低,内存使用量低,通信带宽低,训练速度快,可扩展性好。为了解决区域标签数据的缺乏问题,我们提议先在目标检测和分割任务上进行弱监督预训练。我们称这个步骤为弱监督预训练,因为预训练数据只包含类别名称,而不是完整的句子描述。弱监督预训练allow us to利用许多公开可用的目标检测和分割数据集。我们进行了广泛的实验,以证明我们的方法的优越性和验证每一个设计选择。这项工作作为扩展区域标签数据的起点,投射到了有效地增加区域标签数据的方向。项目页面,以及相关的代码,可以通过以下链接访问:https://xk-huang.github.io/segment-caption-anything/.

Dense Optical Tracking: Connecting the Dots

  • paper_url: http://arxiv.org/abs/2312.00786
  • repo_url: None
  • paper_authors: Guillaume Le Moing, Jean Ponce, Cordelia Schmid
  • for: 该研究旨在提出一种简单、高效的点跟踪方法,能够在视频中跟踪每帧所见点的路径,即使存在遮挡。
  • methods: 该方法首先从关键区域中提取一小集点跟踪,使用商业化的点跟踪算法进行初始化。然后,给定源和目标帧,计算初步的粗略流场和可见掩码,通过学习Optical Flow估计器来处理遮挡和精度地计算点跟踪。
  • results: 对比当前Optical Flow技术和”通用”跟踪器Like OmniMotion,该方法显示更高准确。同时,与最佳点跟踪算法Like CoTracker相比,其性能在一个数量级上相当或更高,而且速度至少两个数量级更快。量化和质量性实验表明该方法在真实视频中具有承诺。代码、数据和实验视频可以在项目页面:https://16lemoing.github.io/dot 中找到。
    Abstract Recent approaches to point tracking are able to recover the trajectory of any scene point through a large portion of a video despite the presence of occlusions. They are, however, too slow in practice to track every point observed in a single frame in a reasonable amount of time. This paper introduces DOT, a novel, simple and efficient method for solving this problem. It first extracts a small set of tracks from key regions at motion boundaries using an off-the-shelf point tracking algorithm. Given source and target frames, DOT then computes rough initial estimates of a dense flow field and visibility mask through nearest-neighbor interpolation, before refining them using a learnable optical flow estimator that explicitly handles occlusions and can be trained on synthetic data with ground-truth correspondences. We show that DOT is significantly more accurate than current optical flow techniques, outperforms sophisticated "universal" trackers like OmniMotion, and is on par with, or better than, the best point tracking algorithms like CoTracker while being at least two orders of magnitude faster. Quantitative and qualitative experiments with synthetic and real videos validate the promise of the proposed approach. Code, data, and videos showcasing the capabilities of our approach are available in the project webpage: https://16lemoing.github.io/dot .
    摘要 现代方法可以重建场景中任何点的轨迹,即使存在遮挡。然而,这些方法在实践中过于慢,无法在合理的时间内跟踪每帧中所有点。本文介绍了 DOT,一种新的简单有效的方法。它首先在动态区域中提取一小集 tracks 使用可用的点跟踪算法。给定源框和目标框,DOT 然后计算初始估计 dense flow 场和可见掩码通过最近邻域 interpolate,然后使用可学习的 optical flow 估计器来处理遮挡和可以在合理的时间内训练。我们表明 DOT 与当前的 optical flow 技术相比,更准确,超过 OmniMotion 等复杂的 "通用" 跟踪器,并且与 CoTracker 相当或更好,而且速度至少两个数量级快。量化和质量上的实验表明了我们的方法的承诺。代码、数据和实验视频可以在项目网站:https://16lemoing.github.io/dot 上获取。

Sequential Modeling Enables Scalable Learning for Large Vision Models

  • paper_url: http://arxiv.org/abs/2312.00785
  • repo_url: https://github.com/ytongbai/LVM
  • paper_authors: Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros
  • for: 本研究提出了一种新的顺序模型方法,用于不使用任何语言数据来学习大视野模型(LVM)。
  • methods: 该方法定义了一种通用的格式,称为“视觉句子”,可以表示Raw图像和视频以及Semantic segmentation和深度重建等标注数据源,无需更多的元知识。
  • results: 通过训练不同级别的模型架构和数据多样性,我们提供了empirical evidence表明我们的模型可以有效扩展。通过设计适当的视觉提示,可以解决多种视觉任务。
    Abstract We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.
    摘要 我们提出了一种新的顺序模型方法,可以在没有任何语言数据的情况下培育大视觉模型(LVM)。我们定义了一种通用格式,称为“视觉句子”,可以将raw图像和视频以及Semantic segmentation和深度重建等注解数据源所表示,无需任何元知识以外的信息。当这些多样化的视觉数据(总计420亿个字符)被表示为序列后,模型可以通过下一个字符预测的十字Entropy损失来训练。通过训练不同的模型建筑和数据多样性,我们提供了实验证据,表明我们的模型可以有效扩展。在测试时,可以通过设计合适的视觉提示来解决多种视觉任务。

MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video

  • paper_url: http://arxiv.org/abs/2312.00778
  • repo_url: None
  • paper_authors: Hengyi Wang, Jingwen Wang, Lourdes Agapito
  • for: 这篇论文旨在实现基于RGB-D视频捕捉的动态场景重建。
  • methods: 该方法使用神经网络表示法,通过捕捉到的视频数据来学习场景的几何和外观特征。同时,它还使用视角相关的扩散预测器来实现真实的完成。
  • results: 实验结果表明,该方法可以从单光RGB-D视频中高精度地重建360度场景,并且可以处理各种实际世界场景中的大量未观察区域。
    Abstract Neural rendering has demonstrated remarkable success in dynamic scene reconstruction. Thanks to the expressiveness of neural representations, prior works can accurately capture the motion and achieve high-fidelity reconstruction of the target object. Despite this, real-world video scenarios often feature large unobserved regions where neural representations struggle to achieve realistic completion. To tackle this challenge, we introduce MorpheuS, a framework for dynamic 360{\deg} surface reconstruction from a casually captured RGB-D video. Our approach models the target scene as a canonical field that encodes its geometry and appearance, in conjunction with a deformation field that warps points from the current frame to the canonical space. We leverage a view-dependent diffusion prior and distill knowledge from it to achieve realistic completion of unobserved regions. Experimental results on various real-world and synthetic datasets show that our method can achieve high-fidelity 360{\deg} surface reconstruction of a deformable object from a monocular RGB-D video.
    摘要 neural rendering 已经实现了动态场景重建的很大成功。由于神经表示的表达力强大,先前的工作可以准确地捕捉运动和达到高质量重建目标对象。然而,实际世界视频场景经常包含大量未观察到的区域,神经表示在这些区域中很难实现真实的完成。为解决这个挑战,我们介绍了 MorpheuS,一个基于RGB-D视频捕捉的动态360度表面重建框架。我们的方法将目标场景模型为一个 canoncial 场景,该场景包含场景的几何和外观信息,同时还包含一个扭曲场景,该场景将当前帧中的点扭曲到 canonical 空间中。我们利用视角依赖的扩散先验知识,以实现真实的完成未观察到的区域。实验结果表明,我们的方法可以从单色RGB-D视频中高精度重建一个可变形对象的360度表面。

VideoBooth: Diffusion-based Video Generation with Image Prompts

  • paper_url: http://arxiv.org/abs/2312.00777
  • repo_url: None
  • paper_authors: Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, Ziwei Liu
  • for: 根据文章的内容,这篇论文主要是 для了文本生成领域中的视觉生成任务,尤其是使用图像提示来控制视觉内容的生成。
  • methods: 这篇论文提出了一个简流式框架VideoBooth,包括两个特别的设计:1)将图像提示嵌入为粗细度对应,以及2)在细节层中将多尺度图像提示作为额外的关注对应。
  • results: 实验结果显示,VideoBooth可以实现高品质的自定义视觉内容生成,并且可以适应各种图像提示。
    Abstract Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with feed-forward pass.
    摘要 “视频生成技术快速发展,但仅使用文本提示不够准确地描绘用户意图中的所需主题形象。在这篇论文中,我们研究了基于图片提示的视频生成任务,提供更精确和直接的内容控制。 Specifically,我们提出了一种往返框架VideoBooth,包括两个专门的设计:1)我们提出将图片提示embedded在干扰的方式下,由图像Encoder提供高级编码,而提供多尺度和细节编码的图像提示编码器。这两个补充编码可以准确捕捉所需的形象。 2)在注意力注射模块的细粒度层次,我们将多尺度图片提示作为额外键和值传递给不同的跨帧注意力层。这些额外信息可以细化首帧的细节,然后通过帧内层次传递来保持时间一致性。我们的实验证明,VideoBooth可以生成自定义高质量视频,主题是通过图片提示指定的用户意图。另外,VideoBooth是一种通用的框架,只需要单独的 feed-forward pass,就能够处理各种图片提示。”

Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans

  • paper_url: http://arxiv.org/abs/2312.00775
  • repo_url: None
  • paper_authors: Homanga Bharadhwaj, Abhinav Gupta, Vikash Kumar, Shubham Tulsiani
  • for: 该 paper 的目的是开发一种可以无需直接尝试和不同的物品进行交互的 robot,通过一个多样化的抓取技巧来实现。
  • methods: 该 paper 使用了一种因素分解的方法,可以利用大规模的人类视频来学习人类如何完成一个任务(人类计划),然后将这个计划翻译成 robot 的实现。
  • results: 该 paper 的学习系统可以在零次尝试下完成多种 manipulate 技巧,包括拧取、推动、搅拌等,并且可以在不同的物品和场景下进行扩展。
    Abstract We pursue the goal of developing robots that can interact zero-shot with generic unseen objects via a diverse repertoire of manipulation skills and show how passive human videos can serve as a rich source of data for learning such generalist robots. Unlike typical robot learning approaches which directly learn how a robot should act from interaction data, we adopt a factorized approach that can leverage large-scale human videos to learn how a human would accomplish a desired task (a human plan), followed by translating this plan to the robots embodiment. Specifically, we learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations. We combine this with a translation module that learns a plan-conditioned robot manipulation policy, and allows following humans plans for generic manipulation tasks in a zero-shot manner with no deployment-time training. Importantly, while the plan predictor can leverage large-scale human videos for learning, the translation module only requires a small amount of in-domain data, and can generalize to tasks not seen during training. We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects, encompassing 100 real-world tasks for table-top manipulation and diverse in-the-wild manipulation. https://homangab.github.io/hopman/
    摘要 We have developed a human plan predictor that, given a current scene image and a goal image, predicts the future hand and object configurations. We combine this with a translation module that learns a plan-conditioned robot manipulation policy, allowing the robot to follow human plans for generic manipulation tasks in a zero-shot manner with no deployment-time training.Our system can perform over 16 manipulation skills that generalize to 40 objects, encompassing 100 real-world tasks for table-top manipulation and diverse in-the-wild manipulation. The translation module only requires a small amount of in-domain data and can generalize to tasks not seen during training.More information can be found on our website:

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

  • paper_url: http://arxiv.org/abs/2312.00863
  • repo_url: https://github.com/yformer/EfficientSAM
  • paper_authors: Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, Raghuraman Krishnamoorthi, Vikas Chandra
  • For: 这个研究旨在提高Segment Anything Model (SAM)的效率,使其能够应用于更广泛的应用场景。* Methods: 本研究使用了Masked Image Pretraining (SAMI)来学习优化的视觉表现,并使用了SAMI预训的轻量级图像Encoder和Mask Decoder来建立EfficientSAMs模型。* Results: 研究发现,使用SAMI预训法可以与其他几乎相同的快速SAM模型(e.g., ~4 AP on COCO/LVIS)perform favorably,并且在零shot实例分割任务中获得了 significiant gain。
    Abstract Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs, and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification, object detection, instance segmentation, and semantic object detection, and find that our proposed pretraining method, SAMI, consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation, our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models.
    摘要 Segment Anything Model (SAM) 已经成为许多视觉应用中的一个强大工具。一个关键的组件是一个超大的 transformer 模型,通过对广泛高质量的 SA-1B 数据集进行训练而具有卓越的性能,包括零shot 转移和高灵活性。然而,这个巨大的计算成本限制了 SAM 模型的应用范围。为解决这个问题,我们提出了 EfficientSAMs,一种轻量级的 SAM 模型,具有较少的复杂性,但仍能达到可接受的性能水平。我们的想法是利用 masked image pretraining,SAMI,来学习有效的视觉表示学习。然后,我们使用 SAMI 预训练的轻量级图像编码器和mask decoder来构建 EfficientSAMs,并在 SA-1B 上进行训练。我们在多种视觉任务上进行评估,包括图像分类、物体检测、实例分割和semantic object detection,并发现我们提出的预训练方法,SAMI,在其他 masked image pretraining 方法上 consistently outperform。在 zero-shot instance segmentation 任务中,我们的 EfficientSAMs 与 SAMI 预训练的轻量级图像编码器相比,在 COCO/LVIS 上获得了 significiant gain(例如,~4 AP)。

Adversarial Score Distillation: When score distillation meets GAN

  • paper_url: http://arxiv.org/abs/2312.00739
  • repo_url: https://github.com/2y7c3/asd
  • paper_authors: Min Wei, Jingkai Zhou, Junyao Sun, Xuesong Zhang
  • for: 这 paper 的目的是解释和分析现有的分数散列方法存在的缺陷,以及提出一种新的分数散列方法来解决这些缺陷。
  • methods: 这 paper 使用 Wasserstein 生成对抗网络 (WGAN) парадиг,并发现现有的分数散列方法 Either 使用 fixes 不优化的探测器或者进行不完整的探测器优化,从而导致缺陷。该 paper 提出了一种新的分数散列方法,即 Adversarial Score Distillation (ASD),其维护一个可优化的探测器,并通过完整的优化目标来更新探测器。
  • results: 该 paper 的实验表明,相比现有方法,ASD 在 2D 散列和文本到 3D 任务中表现出色。此外,为了探索 WGAN парадиг的普适性,该 paper 还将 ASD 扩展到图像修饰任务,实现了竞争性的结果。
    Abstract Existing score distillation methods are sensitive to classifier-free guidance (CFG) scale: manifested as over-smoothness or instability at small CFG scales, while over-saturation at large ones. To explain and analyze these issues, we revisit the derivation of Score Distillation Sampling (SDS) and decipher existing score distillation with the Wasserstein Generative Adversarial Network (WGAN) paradigm. With the WGAN paradigm, we find that existing score distillation either employs a fixed sub-optimal discriminator or conducts incomplete discriminator optimization, resulting in the scale-sensitive issue. We propose the Adversarial Score Distillation (ASD), which maintains an optimizable discriminator and updates it using the complete optimization objective. Experiments show that the proposed ASD performs favorably in 2D distillation and text-to-3D tasks against existing methods. Furthermore, to explore the generalization ability of our WGAN paradigm, we extend ASD to the image editing task, which achieves competitive results. The project page and code are at https://github.com/2y7c3/ASD.
    摘要 现有的分数蒸馏方法对级别自由指导(CFG)尺度敏感:在小CFG尺度下显示过滤或不稳定,而在大CFG尺度下则过滤。为了解释和分析这些问题,我们返回分数蒸馏抽样(SDS)的 derivation 和 Wasserstein 生成敌方网络(WGAN)思想。通过 WGAN 思想,我们发现现有的分数蒸馏方法都是采用固定不优的探测器或只进行部分探测器优化,导致级别敏感的问题。我们提出了对抗分数蒸馏(ASD)方法,该方法维持可优化的探测器,并通过完整的优化目标来更新它。实验显示,我们提出的 ASD 在 2D 蒸馏和文本到 3D 任务中表现出色,并且我们通过扩展 ASD 到图像修改任务来探索我们的 WGAN 思想的普适性,实现了竞争性的结果。项目页面和代码位于

Segment Any 3D Gaussians

  • paper_url: http://arxiv.org/abs/2312.00860
  • repo_url: https://github.com/Jumpat/SegAnyGAussians
  • paper_authors: Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian
  • for: 本研究旨在提出一种可实现高精度、快速的3D交互分割方法,以便在3D场景理解和修改中具有更高的效率和灵活性。
  • methods: 该方法基于2D分割基础模型,并结合3D Gaussian Splatting(3DGS)技术,通过特制的对比训练将2D分割结果嵌入3D Gaussian点特征中。
  • results: 评估结果显示,SAGA可以与当前顶峰方法竞争,并且可以实现多级别分割和多种输入模式(包括点、涂抹和2D质量)。此外,SAGA可以在毫秒级时间内完成3D分割,相比之前顶峰的1000倍加速。更多细节可以参考https://jumpat.github.io/SAGA。
    Abstract Interactive 3D segmentation in radiance fields is an appealing task since its importance in 3D scene understanding and manipulation. However, existing methods face challenges in either achieving fine-grained, multi-granularity segmentation or contending with substantial computational overhead, inhibiting real-time interaction. In this paper, we introduce Segment Any 3D GAussians (SAGA), a novel 3D interactive segmentation approach that seamlessly blends a 2D segmentation foundation model with 3D Gaussian Splatting (3DGS), a recent breakthrough of radiance fields. SAGA efficiently embeds multi-granularity 2D segmentation results generated by the segmentation foundation model into 3D Gaussian point features through well-designed contrastive training. Evaluation on existing benchmarks demonstrates that SAGA can achieve competitive performance with state-of-the-art methods. Moreover, SAGA achieves multi-granularity segmentation and accommodates various prompts, including points, scribbles, and 2D masks. Notably, SAGA can finish the 3D segmentation within milliseconds, achieving nearly 1000x acceleration compared to previous SOTA. The project page is at https://jumpat.github.io/SAGA.
    摘要 <> translate "Interactive 3D segmentation in radiance fields is an appealing task since its importance in 3D scene understanding and manipulation. However, existing methods face challenges in either achieving fine-grained, multi-granularity segmentation or contending with substantial computational overhead, inhibiting real-time interaction. In this paper, we introduce Segment Any 3D GAussians (SAGA), a novel 3D interactive segmentation approach that seamlessly blends a 2D segmentation foundation model with 3D Gaussian Splatting (3DGS), a recent breakthrough of radiance fields. SAGA efficiently embeds multi-granularity 2D segmentation results generated by the segmentation foundation model into 3D Gaussian point features through well-designed contrastive training. Evaluation on existing benchmarks demonstrates that SAGA can achieve competitive performance with state-of-the-art methods. Moreover, SAGA achieves multi-granularity segmentation and accommodates various prompts, including points, scribbles, and 2D masks. Notably, SAGA can finish the 3D segmentation within milliseconds, achieving nearly 1000x acceleration compared to previous SOTA. The project page is at https://jumpat.github.io/SAGA." into Simplified Chinese.Here's the translation:<>三维场景理解和操作中的交互式三维分割是一项吸引人的任务,因为它们在三维场景中的重要性。然而,现有的方法面临着实现细化多级分割或者承受巨大的计算开销,这会阻碍实时交互。在这篇论文中,我们介绍了 Segment Any 3D GAussians(SAGA),一种新的三维交互分割方法,它将二维分割基础模型与三维 Gaussian Splatting(3DGS)相伴合。SAGA通过设计良好的对比训练,将二维分割结果融合到三维 Gaussian 点特征上。经测试表明,SAGA可以与当前state-of-the-art方法匹敌。此外,SAGA可以实现多级分割,并且可以处理多种提示,包括点、笔画和二维mask。尤其是,SAGA可以在毫秒级别内完成三维分割,与前一代SOTA的nearly 1000倍加速。项目页面在https://jumpat.github.io/SAGA。Note: "SOTA" stands for "state-of-the-art" in English.

PointBeV: A Sparse Approach to BeV Predictions

  • paper_url: http://arxiv.org/abs/2312.00703
  • repo_url: https://github.com/valeoai/pointbev
  • paper_authors: Loick Chambon, Eloi Zablocki, Mickael Chen, Florent Bartoccioni, Patrick Perez, Matthieu Cord
  • for: 本研究旨在提出一种新的 Bird’s-eye View(BeV)分割模型,以提高在驾驶应用中的感知数据融合和下游任务的性能。
  • methods: 本模型使用 sparse BeV 细分单元 вместо密集格点,从而提供精细的控制 над内存使用情况,并使得可以使用长时间上下文和适应内存缓存的平台。模型采用了高效的两阶段训练策略,以便在训练时进行精细的定制。在推理时,模型可以根据不同的内存/性能负担进行可变的调整,并且可以适应新的具体应用场景。
  • results: 模型在 nuScenes 数据集上对于车辆、人员和车道分割Task中取得了状态级Result,并且在静止和时间上下文中都能够表现出优秀的性能,而无需使用密集的扫描信号。
    Abstract Bird's-eye View (BeV) representations have emerged as the de-facto shared space in driving applications, offering a unified space for sensor data fusion and supporting various downstream tasks. However, conventional models use grids with fixed resolution and range and face computational inefficiencies due to the uniform allocation of resources across all cells. To address this, we propose PointBeV, a novel sparse BeV segmentation model operating on sparse BeV cells instead of dense grids. This approach offers precise control over memory usage, enabling the use of long temporal contexts and accommodating memory-constrained platforms. PointBeV employs an efficient two-pass strategy for training, enabling focused computation on regions of interest. At inference time, it can be used with various memory/performance trade-offs and flexibly adjusts to new specific use cases. PointBeV achieves state-of-the-art results on the nuScenes dataset for vehicle, pedestrian, and lane segmentation, showcasing superior performance in static and temporal settings despite being trained solely with sparse signals. We will release our code along with two new efficient modules used in the architecture: Sparse Feature Pulling, designed for the effective extraction of features from images to BeV, and Submanifold Attention, which enables efficient temporal modeling. Our code is available at https://github.com/valeoai/PointBeV.
    摘要 鸟瞰视(BeV)表示法在驾驶应用中成为了共同空间,提供了一个统一的空间 для感知数据融合和多种下游任务。然而,传统的模型使用固定分辨率和范围的网格,这会导致计算效率低下,因为所有细节都占用了相同的资源。为解决这个问题,我们提出了PointBeV,一种新的稀疏BeV分割模型,该模型在稀疏BeV细分单元上进行分割而不是在密集网格上。这种方法可以准确地控制内存使用情况,使得可以使用长时间Context和适应内存受限的平台。PointBeV使用高效的两个过程方法进行训练,可以对特定的区域进行专门的计算。在推理时,它可以根据不同的内存/性能负载进行选择,并可以适应新的特定用 caso。PointBeV在nuScenes数据集上为汽车、人员和车道分割 achieves state-of-the-art results,展示了在静止和时间设置中的superior performance,即使只在稀疏信号上进行训练。我们将在https://github.com/valeoai/PointBeV上发布我们的代码,同时包括两个新的高效模块:稀疏特征抽取,用于从图像中提取到BeV的有效特征,以及Submanifold Attention,它可以快速和有效地进行时间模型化。

GIFT: Generative Interpretable Fine-Tuning Transformers

  • paper_url: http://arxiv.org/abs/2312.00700
  • repo_url: https://github.com/savadikarc/gift
  • paper_authors: Chinmay Savadikar, Xi Song, Tianfu Wu
  • For: 这个论文旨在提出一种具有 interpretability 的 Parameter-efficient fine-tuning Transformers(GIFT),用于在下游任务上 parameter-efficient 地 fine-tune 预训练的 transformer 模型。* Methods: 该论文使用 deep parameter-residual learning 方法,通过选择 transformer 模型中的最终投影层(linear layer)进行 parameter-efficient fine-tuning,并通过学习 PaCa 来生成 fine-tuning 参数。* Results: 在 VTAB benchmark 和 fine-grained visual classification (FGVC) benchmark 上,提出的 GIFT 方法与先前艺术取得了显著更好的性能。
    Abstract We present GIFT (Generative Interpretable Fine-tuning Transformers) for fine-tuning pretrained (often large) Transformer models at downstream tasks in a parameter-efficient way with built-in interpretability. Our GIFT is a deep parameter-residual learning method, which addresses two problems in fine-tuning a pretrained Transformer model: Where to apply the parameter-efficient fine-tuning (PEFT) to be extremely lightweight yet sufficiently expressive, and How to learn the PEFT to better exploit the knowledge of the pretrained model in a direct way? For the former, we select the final projection (linear) layer in the multi-head self-attention of a Transformer model, and verify its effectiveness. For the latter, in contrast to the prior art that directly introduce new model parameters (often in low-rank approximation form) to be learned in fine-tuning with downstream data, we propose a method for learning to generate the fine-tuning parameters. Our GIFT is a hyper-Transformer which take as input the pretrained parameters of the projection layer to generate its fine-tuning parameters using a proposed Parameter-to-Cluster Attention (PaCa). The PaCa results in a simple clustering-based forward explainer that plays the role of semantic segmentation in testing. In experiments, our proposed GIFT is tested on the VTAB benchmark and the fine-grained visual classification (FGVC) benchmark. It obtains significantly better performance than the prior art. Our code is available at https://github.com/savadikarc/gift
    摘要 我们介绍GIFT(生成可解释的细化转换器),用于细化预训练的 transformer 模型在下游任务上进行 parameter-efficient 方式进行 fine-tuning,并具有内置解释性。我们的GIFT 是一种深度参数剩余学习方法,解决了在细化预训练 transformer 模型中的两个问题:在哪里应用 parameter-efficient fine-tuning (PEFT),以及如何更好地利用预训练模型的知识。为了解决前一个问题,我们选择了 transformer 模型中的最终投影(线性)层,并证明其效果。为了解决后一个问题,我们不同于先前艺术直接在 fine-tuning 阶段引入新的模型参数(常常在低级别减少形式),而是提议一种方法来学习生成 fine-tuning 参数。我们的 GIFT 是一种 hyper-transformer,接受预训练参数的投影层,并使用我们提议的 Parameter-to-Cluster Attention(PaCa)来生成 fine-tuning 参数。 PaCa 的结果是一种简单的归一化前 explainer,它在测试中扮演了 semantic segmentation 的角色。在实验中,我们提posed GIFT 被测试在 VTAB benchmark 和 fine-grained visual classification (FGVC) benchmark 上,并显示它在相比先前艺术上得到了显著更好的性能。我们的代码可以在 https://github.com/savadikarc/gift 上获取。

Rethinking Detection Based Table Structure Recognition for Visually Rich Documents

  • paper_url: http://arxiv.org/abs/2312.00699
  • repo_url: None
  • paper_authors: Bin Xiao, Murat Simsek, Burak Kantarci, Ala Abu Alkheir
  • for: 这种论文的目的是什么?
  • methods: 这种论文使用了哪些方法?
  • results: 这种论文得到了什么结果?Here are my answers:
  • for: 这种论文的目的是提出一种基于检测模型的表格结构识别(TSR)方法,以将不结构化表格图像转换为结构化格式,如 HTML 序列。
  • methods: 这种论文使用了 detection 模型来检测表格中的元素,如列和行,然后使用规则基于的后处理方法将检测结果转换为 HTML 序列。
  • results: 这种论文通过对 Cascade R-CNN 模型进行简单修改,实现了状态的最佳性能,提高了基eline Cascade R-CNN 模型的性能 by 19.32%, 11.56% 和 14.77% regarding 表格结构只 TEDS 在 SciTSR、FinTabNet 和 PubTables1M 数据集上。
    Abstract Table Structure Recognition (TSR) aims at transforming unstructured table images into structured formats, such as HTML sequences. One type of popular solution is using detection models to detect components of a table, such as columns and rows, then applying a rule-based post-processing method to convert detection results into HTML sequences. However, existing detection-based studies often have the following limitations. First, these studies usually pay more attention to improving the detection performance, which does not necessarily lead to better performance regarding cell-level metrics, such as TEDS. Second, some solutions over-simplify the problem and can miss some critical information. Lastly, even though some studies defined the problem to detect more components to provide as much information as other types of solutions, these studies ignore the fact this problem definition is a multi-label detection because row, projected row header and column header can share identical bounding boxes. Besides, there is often a performance gap between two-stage and transformer-based detection models regarding the structure-only TEDS, even though they have similar performance regarding the COCO metrics. Therefore, we revisit the limitations of existing detection-based solutions, compare two-stage and transformer-based detection models, and identify the key design aspects for the success of a two-stage detection model for the TSR task, including the multi-class problem definition, the aspect ratio for anchor box generation, and the feature generation of the backbone network. We applied simple methods to improve these aspects of the Cascade R-CNN model, achieved state-of-the-art performance, and improved the baseline Cascade R-CNN model by 19.32%, 11.56% and 14.77% regarding the structure-only TEDS on SciTSR, FinTabNet, and PubTables1M datasets.
    摘要 table {width: 100%;border-collapse: collapse;}thead {background-color: #f2f2f2;border: 1px solid #ccc;}th, td {padding: 8px;text-align: left;}th {background-color: #e5e5e5;border: 1px solid #ccc;}td {border: 1px solid #ccc;}.margin-top {margin-top: 10px;}.margin-bottom {margin-bottom: 10px;}.float-left {float: left;}.float-right {float: right;}.clearfix:after {content: "";display: table;clear: both;}.container {max-width: 1200px;margin: 0 auto;padding: 20px;background-color: #fff;border: 1px solid #ccc;box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);}.aspect-ratio {display: flex;justify-content: center;align-items: center;height: 0;}.aspect-ratio img {width: 100%;height: auto;}.feature-generation {display: flex;justify-content: center;align-items: center;height: 0;}.feature-generation img {width: 100%;height: auto;}.backbone-network {display: flex;justify-content: center;align-items: center;height: 0;}.backbone-network img {width: 100%;height: auto;}.multi-class {display: flex;justify-content: center;align-items: center;height: 0;}.multi-class img {width: 100%;height: auto;}.problem-definition {display: flex;justify-content: center;align-items: center;height: 0;}.problem-definition img {width: 100%;height: auto;}.performance-gap {display: flex;justify-content: center;align-items: center;height: 0;}.performance-gap img {width: 100%;height: auto;}.success-key {display: flex;justify-content: center;align-items: center;height: 0;}.success-key img {width: 100%;height: auto;}.baseline {display: flex;justify-content: center;align-items: center;height: 0;}.baseline img {width: 100%;height: auto;}.improvement {display: flex;justify-content: center;align-items: center;height: 0;}.improvement img {width: 100%;height: auto;}.margin-top-10px {margin-top: 10px;}.margin-bottom-10px {margin-bottom: 10px;}.float-left-10px {float: left;margin-right: 10px;}.float-right-10px {float: right;margin-left: 10px;}.clearfix:before {content: "";display: table;clear: both;}.container {max-width: 1200px;margin: 0 auto;padding: 20px;background-color: #fff;border: 1px solid #ccc;box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);}.aspect-ratio {display: flex;justify-content: center;align-items: center;height: 0;}.aspect-ratio img {width: 100%;height: auto;}.feature-generation {display: flex;justify-content: center;align-items: center;height: 0;}.feature-generation img {width: 100%;height: auto;}.backbone-network {display: flex;justify-content: center;align-items: center;height: 0;}.backbone-network img {width: 100%;height: auto;}.multi-class {display: flex;justify-content: center;align-items: center;height: 0;}.multi-class img {width: 100%;height: auto;}.problem-definition {display: flex;justify-content: center;align-items: center;height: 0;}.problem-definition img {width: 100%;height: auto;}.performance-gap {display: flex;justify-content: center;align-items: center;height: 0;}.performance-gap img {width: 100%;height: auto;}.success-key {display: flex;justify-content: center;align-items: center;height: 0;}.success-key img {width: 100%;height: auto;}.baseline {display: flex;justify-content: center;align-items: center;height: 0;}.baseline img {width: 100%;height: auto;}.improvement {display: flex;justify-content: center;align-items: center;height: 0;}.improvement img {width: 100%;height: auto;}.margin-top-10px {margin-top: 10px;}.margin-bottom-10px {margin-bottom: 10px;}.float-left-10px {float: left;margin-right: 10px;}.float-right-10px {float: right;margin-left: 10px;}.clearfix:before {content: "";display: table;clear: both;}.container {max-width: 1200px;margin: 0 auto;padding: 20px;background-color: #fff;border: 1px solid #ccc;box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);}.aspect-ratio {display: flex;justify-content: center;align-items: center;height: 0;}.aspect-ratio img {width: 100%;height: auto;}.feature-generation {display: flex;justify-content: center;align-items: center;height: 0;}.feature-generation img {width: 100%;height: auto;}.backbone-network {display: flex;justify-content: center;align-items: center;height: 0;}.backbone-network img {width: 100%;height: auto;}.multi-class {display: flex;justify-content: center;align-items: center;height: 0;}.multi-class img {width: 100%;height: auto;}.problem-definition {display: flex;justify-content: center;align-items: center;height: 0;}.problem-definition img {width: 100%;height: auto;}.performance-gap {display: flex;justify-content: center;align-items: center;height: 0;}.performance-gap img {width: 100%;height: auto;}.success-key {display: flex;justify-content: center;align-items: center;height: 0;}.success-key img {width: 100%;height: auto;}.baseline {display: flex;justify-content: center;align-items: center;height: 0;}.baseline img {width: 100%;height: auto;}.improvement {display: flex;justify-content: center;align-items: center;height: 0;}.improvement img {width: 100%;height: auto;}.margin-top-10px {margin-top: 10px;}.margin-bottom-10px {margin-bottom: 10px;}.float-left-10px {float: left;margin-right: 10px;}.float-right-10px {float: right;margin-left: 10px;}.clearfix:before {content: "";display: table;clear: both;}.container {max-width: 1200px;margin: 0 auto;padding: 20px;background-color: #fff;border: 1px solid #ccc;box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);}.aspect-ratio {display: flex;justify-content: center;align-items: center;height: 0;}.aspect-ratio img {width: 100%;height: auto;}.feature-generation {display: flex;justify-content: center;align-items: center;height: 0;}.feature-generation img {width: 100%;height: auto;}.backbone-network {display: flex;justify-content: center;align-items: center;height: 0;}.backbone-network img {width: 100%;height: auto;}.multi-class {display: flex;justify-content: center;align-items: center;height: 0;}.multi-class img {width: 100%;height: auto;}.problem-definition {display: flex;justify-content: center;align-items: center;height: 0;}.problem-definition img {width: 100%;height: auto;}.performance-gap {display: flex;justify-content: center;align-items: center;height: 0;}.performance-gap img {width: 100%;height: auto;}.success-key {display: flex;justify-content: center;align-items: center;height: 0;}.success-key img {width: 100%;height: auto;}.baseline {display: flex;justify-content: center;align-items: center;height: 0;}.baseline img {width: 100%;height: auto;}.improvement {display: flex;justify-content: center;align-items: center;height: 0;}.improvement img {width: 100%;height: auto;}.margin-top-10px {margin-top: 10px;}.margin-bottom-10px {margin-bottom: 10px;}.float-left-10px {float: left;margin-right: 10px;}.float-right-10px {float: right;margin-left: 10px;}.clearfix:before {content: "";display: table;clear: both;}.container {max-width: 1200px;margin: 0 auto;padding: 20px;background-color: #fff;border: 1px solid #ccc;box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);}.aspect-ratio {display: flex;justify-content: center;align-items: center;height: 0;}.aspect-ratio img {width: 100%;height: auto;}.feature-generation {display: flex;justify-content: center;align-items: center;height: 0;}.feature-generation img {width: 100%;height: auto;}.backbone-network {display: flex;justify-content: center;align-items: center;height: 0;}.backbone-network img {width: 100%;height: auto;}.multi-class {display: flex;justify-content: center;align-items: center;height: 0;}.multi-class img {width: 100%;height: auto;}.problem-definition {display: flex;justify-content: center;align-items: center;height: 0;}.problem-definition img {width: 100%;height: auto;}.performance-gap {display: flex;justify-content: center;align-items: center;height: 0;}.performance-gap img {width: 100%;height: auto;}.success-key {display: flex;justify-content: center;align-items: center;height: 0;}.success-key img {width: 100%;height: auto;}

Object Detector Differences when using Synthetic and Real Training Data

  • paper_url: http://arxiv.org/abs/2312.00694
  • repo_url: https://github.com/ljungqvistmartin/datasplits
  • paper_authors: Martin Georg Ljungqvist, Otto Nordander, Markus Skans, Arvid Mildner, Tony Liu, Pierre Nugues
  • for: 这个论文的目的是研究训练基于 sintetic data 的神经网络如何影响神经网络层次结构,以及它们对检测器的影响。
  • methods: 作者使用了 YOLOv3 对检测器进行训练,并使用 Centered Kernel Alignment (CKA) 进行相似性分析,以explore 训练 sintetic data 对各层神经网络的影响。
  • results: 研究发现,训练 sintetic data 对检测器的影响主要在早期层,而在 head 部分的影响较大。此外,冻结或不冻结 backbone 对性能和相似性没有显著影响。
    Abstract To train well-performing generalizing neural networks, sufficiently large and diverse datasets are needed. Collecting data while adhering to privacy legislation becomes increasingly difficult and annotating these large datasets is both a resource-heavy and time-consuming task. An approach to overcome these difficulties is to use synthetic data since it is inherently scalable and can be automatically annotated. However, how training on synthetic data affects the layers of a neural network is still unclear. In this paper, we train the YOLOv3 object detector on real and synthetic images from city environments. We perform a similarity analysis using Centered Kernel Alignment (CKA) to explore the effects of training on synthetic data on a layer-wise basis. The analysis captures the architecture of the detector while showing both different and similar patterns between different models. With this similarity analysis we want to give insights on how training synthetic data affects each layer and to give a better understanding of the inner workings of complex neural networks. The results show that the largest similarity between a detector trained on real data and a detector trained on synthetic data was in the early layers, and the largest difference was in the head part. The results also show that no major difference in performance or similarity could be seen between frozen and unfrozen backbone.
    摘要 <>通过使用大量和多样化的数据集来训练高性能的神经网络,可以更好地减少难以遵循隐私法规的数据收集和标注问题。使用 sintética 数据可以解决这些问题,因为它可以自动标注和扩大。然而,训练 sintética 数据对神经网络层次结构的影响仍然不清楚。在这篇论文中,我们训练了 YOLOv3 对象检测器在实际和 sintética 图像中。我们使用 Centered Kernel Alignment(CKA)进行相似性分析,以探索训练 sintética 数据对层次结构的影响。这种分析捕捉了检测器的建构,并显示了不同模型之间的相似和不同的模式。通过这种相似性分析,我们想要提供层次结构如何在不同模型之间差异化的理解,以及训练 sintética 数据对每层的影响。结果显示,训练 sintética 数据和实际数据训练的检测器在早期层之间最大的相似性,而头部部分最大的差异。结果还表明,不冻和冻掉核心的效果无法区分。

VisionaryVR: An Optical Simulation Tool for Evaluating and Optimizing Vision Correction Solutions in Virtual Reality

  • paper_url: http://arxiv.org/abs/2312.00692
  • repo_url: None
  • paper_authors: Benedikt W. Hosp, Martin Dechant, Yannick Sauer, Rajat Agarwala, Siegfried Wahl
  • for: 这种研究用于开发和评估视科学方法的实验工具,需要一种可靠和高效的 simulate real-world optical methods while providing high experimental control.
  • methods: 这种研究使用了一种新的虚拟现实(VR) simulate tool,包括一个 эксперимент控制器、一个通用的eye-tracking控制器、一个可 configurable defocus simulator,以及一个可 configurable VR questionnaire loader。
  • results: 这种VR-based simulation tool可以帮助视科学家增加他们的研究工具,提供一种可靠、真实、快速的研究环境。
    Abstract Developing and evaluating vision science methods require robust and efficient tools for assessing their performance in various real-world scenarios. This study presents a novel virtual reality (VR) simulation tool that simulates real-world optical methods while giving high experimental control to the experiment. The tool incorporates an experiment controller, to smoothly and easily handle multiple conditions, a generic eye-tracking controller, that works with most common VR eye-trackers, a configurable defocus simulator, and a generic VR questionnaire loader to assess participants' behavior in virtual reality. This VR-based simulation tool bridges the gap between theoretical and applied research on new optical methods, corrections, and therapies. It enables vision scientists to increase their research tools with a robust, realistic, and fast research environment.
    摘要 开发和评估视科学方法需要强大和高效的工具来评估它们在实际世界场景中的性能。本研究提出了一种新的虚拟现实(VR)模拟工具,可以模拟实际世界的光学方法,同时具有高度的实验控制。工具包括一个实验控制器,可以平滑地处理多种条件,一个通用的眼动跟踪控制器,可以与大多数VR眼动跟踪器兼容,一个可配置的虚拟抖噪模拟器,以及一个可配置的VR问卷加载器,用于评估参与者在虚拟世界中的行为。这个VR基于的模拟工具将理论和应用研究之间的差距 bridged,帮助视科学家增强他们的研究工具,提供了一个强大、现实istic和快速的研究环境。

Open-vocabulary object 6D pose estimation

  • paper_url: http://arxiv.org/abs/2312.00690
  • repo_url: None
  • paper_authors: Jaime Corsetti, Davide Boscaini, Changjae Oh, Andrea Cavallaro, Fabio Poiesi
  • for: 这个论文是关于开放词汇物体6D姿态估计的新设定,在这个设定下, objet of interest 是通过文本提示来指定的,而不需要训练时使用 CAD 模型或视频序列。
  • methods: 我们提出了一种新的方法,利用视力语言模型来将文本提示中的对象信息与图像特征进行融合,从而估计对象的6D姿态。我们采用了一种特殊的融合策略,以便将文本提示中的对象级别信息与图像特征融合到一起,从而实现对新概念的泛化。
  • results: 我们在基于 REAL275 和 Toyota-Light 两个 популяр的数据集上新建了一个 benchmark,并对两种方法进行比较:一个是一个经验性的手工方法,另一个是一个深度学习基础方法。结果表明,我们的方法在不同场景中估计对象的6D姿态方面表现出色,超过了两者。项目页面:https://jcorsetti.github.io/oryon/.
    Abstract We introduce the new setting of open-vocabulary object 6D pose estimation, in which a textual prompt is used to specify the object of interest. In contrast to existing approaches, in our setting (i) the object of interest is specified solely through the textual prompt, (ii) no object model (e.g. CAD or video sequence) is required at inference, (iii) the object is imaged from two different viewpoints of two different scenes, and (iv) the object was not observed during the training phase. To operate in this setting, we introduce a novel approach that leverages a Vision-Language Model to segment the object of interest from two distinct scenes and to estimate its relative 6D pose. The key of our approach is a carefully devised strategy to fuse object-level information provided by the prompt with local image features, resulting in a feature space that can generalize to novel concepts. We validate our approach on a new benchmark based on two popular datasets, REAL275 and Toyota-Light, which collectively encompass 39 object instances appearing in four thousand image pairs. The results demonstrate that our approach outperforms both a well-established hand-crafted method and a recent deep learning-based baseline in estimating the relative 6D pose of objects in different scenes. Project page: https://jcorsetti.github.io/oryon/.
    摘要 我们介绍一个新的开放词汇物体6D姿掌预测设定,在这个设定中,用文本提示来特定目标物体。与现有方法不同的是,在我们的设定中:(1)目标物体仅通过文本提示来定义,没有CAD或视讯序列的需求;(2)没有需要训练过的物品模型;(3)物品从两个不同的拍摄角度和两个不同的场景中捕捉到图像,并且在训练阶段没有观察到物品。为了在这个设定下运作,我们提出了一个新的方法,利用了视觉语言模型来将目标物体从两个不同的场景中分类出来,并且估算其相对6D姿掌。我们的方法的关键是将文本提示中的物品层级信息与本地图像特征进行精心融合,从而产生一个可以扩展到新概念的特征空间。我们验证了我们的方法在基于REAL275和Toyota-Light两个受欢迎的数据集上,总共有39个物品实例出现在四千几对图像中。结果显示,我们的方法在不同场景中估算物品的相对6D姿掌比旧有的手工方法和现代深度学习基eline更高。详情请参考https://jcorsetti.github.io/oryon/.

Infrared Image Super-Resolution via GAN

  • paper_url: http://arxiv.org/abs/2312.00689
  • repo_url: None
  • paper_authors: Yongsong Huang, Shinichiro Omachi
  • for: IR image super-resolution
  • methods: generative models, adversarial training
  • results: potential areas for further investigation and advancement
    Abstract The ability of generative models to accurately fit data distributions has resulted in their widespread adoption and success in fields such as computer vision and natural language processing. In this chapter, we provide a brief overview of the application of generative models in the domain of infrared (IR) image super-resolution, including a discussion of the various challenges and adversarial training methods employed. We propose potential areas for further investigation and advancement in the application of generative models for IR image super-resolution.
    摘要 generative模型在数据分布上的准确适应,在计算机视觉和自然语言处理等领域得到广泛的应用和成功。在这一章中,我们提供了IR图像超解像的领域中generative模型的应用概述,包括不同挑战和对抗训练方法的讨论。我们还提出了IR图像超解像领域进一步研究和发展的潜在领域。

Unsupervised Adaptive Implicit Neural Representation Learning for Scan-Specific MRI Reconstruction

  • paper_url: http://arxiv.org/abs/2312.00677
  • repo_url: None
  • paper_authors: Junwei Yang, Pietro Liò
  • for: 提高MRI成像速度,适用于特定的扫描设置下的低样本率情况
  • methods: 提出了一种无监督、适应性块级别的MRI成像方法,利用各自坐标的多维协调学习映射到信号强度上
  • results: 对公共数据集进行了全面评估,并表明该方法可以在8倍下采样情况下,与当前状态艺术 scan-specific MRI成像技术相比,提高成像质量。
    Abstract In recent studies on MRI reconstruction, advances have shown significant promise for further accelerating the MRI acquisition. Most state-of-the-art methods require a large amount of fully-sampled data to optimise reconstruction models, which is impractical and expensive under certain clinical settings. On the other hand, for unsupervised scan-specific reconstruction methods, overfitting is likely to happen due to insufficient supervision, while restrictions on acceleration rates and under-sampling patterns further limit their applicability. To this end, we propose an unsupervised, adaptive coarse-to-fine framework that enhances reconstruction quality without being constrained by the sparsity levels or patterns in under-sampling. The framework employs an implicit neural representation for scan-specific MRI reconstruction, learning a mapping from multi-dimensional coordinates to their corresponding signal intensities. Moreover, we integrate a novel learning strategy that progressively refines the use of acquired k-space signals for self-supervision. This approach effectively adjusts the proportion of supervising signals from unevenly distributed information across different frequency bands, thus mitigating the issue of overfitting while improving the overall reconstruction. Comprehensive evaluation on a public dataset, including both 2D and 3D data, has shown that our method outperforms current state-of-the-art scan-specific MRI reconstruction techniques, for up to 8-fold under-sampling.
    摘要 现在的MRI重建研究中,有了重要的进步,将能够进一步加速MRI取得。大多数现代方法需要充分的数据来优化重建模型,但是在certain clinical setting中是不实际和昂贵的。另一方面,不监督的扫描特定重建方法容易出现过滤现象,而且因为不充分的监督,导致重建质量下降。为了解决这个问题,我们提出了一个不监督的、适应性从粗到细框架,能够提高重建质量,不受数据精度或排序的限制。这个框架使用了伪隐藏层级的神经表现,学习对多维坐标与其相对的信号强度之间的映射。此外,我们还整合了一个新的学习策略,逐步地进行自我监督,将不均匀分布在不同频率带的获取的k-空间信号进行自我监督,因此可以避免过滤现象,同时提高重建质量。我们在公共数据集上进行了广泛的评估,包括2D和3D数据,结果显示,我们的方法可以与目前的scan-specific MRI重建技术相比,在8倍减采样情况下实现8倍的提升。

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

  • paper_url: http://arxiv.org/abs/2312.00674
  • repo_url: None
  • paper_authors: Ying Nie, Wei He, Kai Han, Yehui Tang, Tianyu Guo, Fanyi Du, Yunhe Wang
  • for: 提高轻量级 CLIP 模型的表现,解决一些图像文本对不对应问题。
  • methods: 提出多级互动方法,包括:Global Instance-level Alignment 改进、Relaxed Bipartite Matching 对象、Masked Language Modeling 额外目标。
  • results: 在多个下游任务上实现更高的表现,而不增加执行时间成本。
    Abstract Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked language modeling (MLM) is leveraged for maximizing the potential of the shortened text encoder. In practice, an auxiliary fusion module injecting unmasked image embedding into masked text embedding at different network stages is proposed for enhancing the MLM. Extensive experiments show that without introducing additional computational cost during inference, the proposed method achieves a higher performance on multiple downstream tasks.
    摘要 CLIP类视觉预训练技术已经在多种下游任务上展现出了扎实的表现,如零点图像分类和图像文本检索。现有的大多数CLIP相似工作通常采用相对较大的图像编码器如ResNet50和ViT,而轻量级的对应工作则少有讨论。在这篇论文中,我们提出了一种多级互动 paradigm для训练轻量级CLIP模型。首先,为了解决一些图像文本对不是唯一对应的问题,我们改进了传统的全局实例级别对象,通过软化负样本标签进行进行进行逐渐减弱。其次,我们引入了一种宽松的二分图像对象来实现更细致的图像片段和文本词之间的对应。此外,根据CLIP模型的表现不随文本编码器参数的增加而增加的观察,我们采用了一个额外的MLM目标来激活短文本编码器的潜在能力。在实践中,我们提出了一种辅助混合模块,将透明图像嵌入掺入masked文本嵌入器的不同网络阶段,以提高MLM的表现。广泛的实验表明,无需在推理时添加计算成本,我们的方法可以在多个下游任务上达到更高的表现。

CellMixer: Annotation-free Semantic Cell Segmentation of Heterogeneous Cell Populations

  • paper_url: http://arxiv.org/abs/2312.00671
  • repo_url: None
  • paper_authors: Mehdi Naouar, Gabriel Kalweit, Anusha Klett, Yannick Vogt, Paula Silvestrini, Diana Laura Infante Ramirez, Roland Mertelsmann, Joschka Boedecker, Maria Kalweit
  • for: 本研究旨在开发一种无监督的细胞分类方法,以便在医学影像中自动分类不同类型的细胞。
  • methods: 我们提出了一种基于扩展的注解自由方法,可以从图像级别的标签中训练细胞分类模型。
  • results: 我们的实验结果显示,CellMixer可以在多种细胞类型和成像方式下实现竞争力强的分类性能,这表明方法具有扩展性和广泛应用的潜力。
    Abstract In recent years, several unsupervised cell segmentation methods have been presented, trying to omit the requirement of laborious pixel-level annotations for the training of a cell segmentation model. Most if not all of these methods handle the instance segmentation task by focusing on the detection of different cell instances ignoring their type. While such models prove adequate for certain tasks, like cell counting, other applications require the identification of each cell's type. In this paper, we present CellMixer, an innovative annotation-free approach for the semantic segmentation of heterogeneous cell populations. Our augmentation-based method enables the training of a segmentation model from image-level labels of homogeneous cell populations. Our results show that CellMixer can achieve competitive segmentation performance across multiple cell types and imaging modalities, demonstrating the method's scalability and potential for broader applications in medical imaging, cellular biology, and diagnostics.
    摘要 近年来,一些无监督单元分割方法已经提出, Trying to忽略需要耗费劳动的像素级别标注来训练单元分割模型。大多数,如果不是所有的这些方法都是关注不同单元的探测,忽略单元的类型。而这些模型在某些任务上是有效的,如细胞计数,但其他应用需要每个细胞的类型标识。在这篇论文中,我们提出了CellMixer,一种创新的无监督注解方法,用于无类别细胞群体的semantic分割。我们的扩展方法可以从同类细胞population的图像级别标签上训练分割模型。我们的结果显示,CellMixer可以在多种细胞类型和成像方式下实现竞争性的分割性能,表明该方法的普适性和潜在应用在医学成像、细胞生物学和诊断等领域。

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

  • paper_url: http://arxiv.org/abs/2312.00663
  • repo_url: None
  • paper_authors: Kangcheng Liu, Yong-Jin Liu, Kai Tang, Ming Liu, Baoquan Chen
  • for: 提高3D场景理解的数据效率学习和开放世界几个shot学习
  • methods: 提议一种通用和简单的框架,利用视Language模型中的含义提取和精细知识填充来扩展3D场景理解的训练数据,并使用能量基础损失和边界意识来优化 bounding box 预测。
  • results: 实验表明,该方法可以在室内和室外场景中提高3D场景理解的数据效率学习和开放世界几个shot学习性能,并且可以减少训练数据的量和质量要求。
    Abstract Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck for current 3D recognition approaches is that they do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse kinds of real-world applications. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require high-quality labels to train neural networks, which merely perform well in a fully supervised manner. This work presents a generalized and simple framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. All codes, models, and data are made publicly available at: https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing.
    摘要 深度神经网络模型已经在封闭设定下和完整标签的情况下实现了惊人的进步。然而,当前的3D认知方法的主要瓶颈是它们无法识别任何未看过的新类型,尤其是在实际应用中遇到的多种多样的场景中。同时,当前state-of-the-art 3D场景理解方法主要需要高质量的标签来训练神经网络,它们只能在完全监督模式下表现出色。本工作提出了一种通用且简单的框架,用于在有限标签的情况下进行3D场景理解。为了从预训练视语模型中提取知识,我们提议一种层次对齐预训练和知识继承策略,以提取和继承大规模视语模型中的有用信息。此外,我们还提出了一种新的能量基的损失函数,以利用边界信息。为了促进秘密实例识别和保证效率,我们提出了无监督区域级别semantic contrastive learning方案,使用神经网络的信任预测来确定多个阶段的中间特征编码的差异。我们的方法在室内和室外场景进行了广泛的实验,并证明了其在数据效率学习和开放世界几shot学习中的效果。所有代码、模型和数据都公开发布在:https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing。

Dual-Domain Multi-Contrast MRI Reconstruction with Synthesis-based Fusion Network

  • paper_url: http://arxiv.org/abs/2312.00661
  • repo_url: None
  • paper_authors: Junwei Yang, Pietro Liò
  • for: 提高多重对焦MRI重建的效率和质量
  • methods: 基于深度学习的双Domain重建框架,利用快速取得的参考对焦测量来估算受测对焦测量
  • results: 与现有方法比较,提高重建效率达8倍,并进行了完整的分析和剔除研究以评估方法的有效性
    Abstract Purpose: To develop an efficient dual-domain reconstruction framework for multi-contrast MRI, with the focus on minimising cross-contrast misalignment in both the image and the frequency domains to enhance optimisation. Theory and Methods: Our proposed framework, based on deep learning, facilitates the optimisation for under-sampled target contrast using fully-sampled reference contrast that is quicker to acquire. The method consists of three key steps: 1) Learning to synthesise data resembling the target contrast from the reference contrast; 2) Registering the multi-contrast data to reduce inter-scan motion; and 3) Utilising the registered data for reconstructing the target contrast. These steps involve learning in both domains with regularisation applied to ensure their consistency. We also compare the reconstruction performance with existing deep learning-based methods using a dataset of brain MRI scans. Results: Extensive experiments demonstrate the superiority of our proposed framework, for up to an 8-fold acceleration rate, compared to state-of-the-art algorithms. Comprehensive analysis and ablation studies further present the effectiveness of the proposed components. Conclusion:Our dual-domain framework offers a promising approach to multi-contrast MRI reconstruction. It can also be integrated with existing methods to further enhance the reconstruction.
    摘要 目的:开发一个高效的双Domain重建框架,以提高多重变数MRI重建的优化。理论和方法:我们的提议框架基于深度学习,可以快速地使用快速取得的参考变数来优化受测变数。方法包括以下三个步骤:1)学习将参考变数转换为目标变数的数据;2)将多重变数数据注册以减少间歇变数;3)使用注册的数据重建目标变数。这些步骤都包括在两个Domain中进行学习,并对它们进行调整以确保它们的一致性。我们还与现有的深度学习基础上的方法进行比较,使用大脑MRI扫描数据进行实验。结果:广泛的实验显示了我们的提议框架可以与现有的方法相比,在 acceleration rate 上达到8倍。此外,我们还进行了全面的分析和剪除研究,以评估提议的元件的有效性。结论:我们的双Domain重建框架可以对多重变数MRI重建做出有价的贡献,并且可以与现有的方法进行整合,以进一步提高重建的精度。

SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers

  • paper_url: http://arxiv.org/abs/2312.00648
  • repo_url: https://github.com/gkakogeorgiou/spot
  • paper_authors: Ioannis Kakogeorgiou, Spyros Gidaris, Konstantinos Karantzalos, Nikos Komodakis
  • for: 本研究旨在提高无监督对象中心学习中场景的可解释性,通过将场景分解成可解释的对象实体(槽)。
  • methods: 本研究使用了两种新技术:首先,一种基于注意力的自动训练方法,通过将解码器中的槽特征精细地储存到编码器中,提高了对象 segmentation的精度;其次,一种创新的 patch-order 排序策略,使得 transformer 在重建过程中更加强调槽向量的角色。
  • results: 实验表明,这两种技术的组合可以大幅提高无监督对象中心学习中的对象分割精度,特别是使用了复杂的实际图像。 codes 的实现可以在 https://github.com/gkakogeorgiou/spot 中找到。
    Abstract Unsupervised object-centric learning aims to decompose scenes into interpretable object entities, termed slots. Slot-based auto-encoders stand out as a prominent method for this task. Within them, crucial aspects include guiding the encoder to generate object-specific slots and ensuring the decoder utilizes them during reconstruction. This work introduces two novel techniques, (i) an attention-based self-training approach, which distills superior slot-based attention masks from the decoder to the encoder, enhancing object segmentation, and (ii) an innovative patch-order permutation strategy for autoregressive transformers that strengthens the role of slot vectors in reconstruction. The effectiveness of these strategies is showcased experimentally. The combined approach significantly surpasses prior slot-based autoencoder methods in unsupervised object segmentation, especially with complex real-world images. We provide the implementation code at https://github.com/gkakogeorgiou/spot .
    摘要 <> translate the following text into Simplified Chinese:Unsupervised object-centric learning aims to decompose scenes into interpretable object entities, termed slots. Slot-based auto-encoders stand out as a prominent method for this task. Within them, crucial aspects include guiding the encoder to generate object-specific slots and ensuring the decoder utilizes them during reconstruction. This work introduces two novel techniques, (i) an attention-based self-training approach, which distills superior slot-based attention masks from the decoder to the encoder, enhancing object segmentation, and (ii) an innovative patch-order permutation strategy for autoregressive transformers that strengthens the role of slot vectors in reconstruction. The effectiveness of these strategies is showcased experimentally. The combined approach significantly surpasses prior slot-based autoencoder methods in unsupervised object segmentation, especially with complex real-world images. We provide the implementation code at https://github.com/gkakogeorgiou/spot .Translate the text into Simplified Chinese.Here's the translation:不监督物体中心学习的目标是将场景拆分成可解释的物体实体,称为插槽。插槽基于自动编码器是这个任务中最引人注目的方法之一。在其中,重要的方面包括引导编码器生成对象特定的插槽,并确保解码器在重建过程中使用这些插槽。本研究提出了两种新技术:(i)基于注意力的自我训练方法,可以从解码器中提取出更高质量的插槽关注mask,从而提高对象分割效果,和(ii)一种创新的patch顺序Permutation策略,用于 autoregressive transformers,可以强化插槽向量在重建过程中的角色。我们通过实验证明了这些策略的有效性。这种结合方法在不监督物体分割方面比之前的插槽基于自动编码器方法更加出色,特别是在复杂的实际图像上。我们在github上提供了实现代码,请参考https://github.com/gkakogeorgiou/spot。

QAFE-Net: Quality Assessment of Facial Expressions with Landmark Heatmaps

  • paper_url: http://arxiv.org/abs/2312.00856
  • repo_url: https://github.com/shuchaoduan/qafe-net
  • paper_authors: Shuchao Duan, Amirhossein Dadashzadeh, Alan Whone, Majid Mirmehdi
  • for: 这篇论文旨在评估帕金森病患者的表情质量。
  • methods: 该方法使用了时间特征图表与RGB数据,通过捕捉小面部肌肉运动来编码并映射到严重度分数。
  • results: 对于PFED5数据集和UNBC-McMaster Shoulder Pain Expression Archive Database的对比实验,该方法表现出色,与State-of-the-Art动作质量评估方法在PFED5数据集上具有更高的准确率,并在UNBC-McMaster数据集上实现了较低的平均绝对误差。
    Abstract Facial expression recognition (FER) methods have made great inroads in categorising moods and feelings in humans. Beyond FER, pain estimation methods assess levels of intensity in pain expressions, however assessing the quality of all facial expressions is of critical value in health-related applications. In this work, we address the quality of five different facial expressions in patients affected by Parkinson's disease. We propose a novel landmark-guided approach, QAFE-Net, that combines temporal landmark heatmaps with RGB data to capture small facial muscle movements that are encoded and mapped to severity scores. The proposed approach is evaluated on a new Parkinson's Disease Facial Expression dataset (PFED5), as well as on the pain estimation benchmark, the UNBC-McMaster Shoulder Pain Expression Archive Database. Our comparative experiments demonstrate that the proposed method outperforms SOTA action quality assessment works on PFED5 and achieves lower mean absolute error than the SOTA pain estimation methods on UNBC-McMaster. Our code and the new PFED5 dataset are available at https://github.com/shuchaoduan/QAFE-Net.
    摘要 人类情绪表达分类(FER)技术已经在识别人类情绪的领域取得了很大的进步。然而,评估所有情绪表达的质量在医疗应用中是关键的。在这种情况下,我们研究了患有 Parkinson 病的患者的五种不同情绪表达质量。我们提出了一种新的关键点导向方法(QAFE-Net),它将将时间关键点热图与RGB数据结合,以捕捉小脸部肌肉运动,并将其映射到严重性分数中。我们的方法在新的 Parkinson 病脸部表达数据集(PFED5)和《UNBC-McMaster Shoulder Pain Expression Archive Database》上进行了比较实验,并证明了我们的方法可以超过现有的动作质量评估方法,并在《UNBC-McMaster》上实现了较低的平均绝对误差。我们的代码和新的 PFED5 数据集可以在 GitHub 上找到:https://github.com/shuchaoduan/QAFE-Net。

EvE: Exploiting Generative Priors for Radiance Field Enrichment

  • paper_url: http://arxiv.org/abs/2312.00639
  • repo_url: None
  • paper_authors: Karim Kassab, Antoine Schnepf, Jean-Yves Franceschi, Laurent Caraffa, Jeremie Mary, Valérie Gouet-Brunet
  • for: 大规模景象从无结构图像集中进行计算机视觉中的场景模型化是一项重要挑战。现有方法通过关闭世界的设定来解决在野的神经渲染问题,即只有在训练集中捕捉的场景图像的知识。我们提出了EvE,这是我们知道的第一个利用生成模型来提高在野场景模型化的方法。我们使用预训练的生成网络来增强K-Planes表示中的外部知识。为此,我们定义了相互转换训练程序,以便在训练集上进行优化导航。
  • methods: 我们使用预训练的生成网络来增强K-Planes表示中的外部知识。为此,我们定义了相互转换训练程序,以便在训练集上进行优化导航。
  • results: EvE可以在实际旅游 фото集中提高rendered场景的细节和质量,并且超越了现有的状态对在野神经渲染任务的表现。
    Abstract Modeling large-scale scenes from unconstrained image collections in-the-wild has proven to be a major challenge in computer vision. Existing methods tackling in-the-wild neural rendering operate in a closed-world setting, where knowledge is limited to a scene's captured images within a training set. We propose EvE, which is, to the best of our knowledge, the first method leveraging generative priors to improve in-the-wild scene modeling. We employ pre-trained generative networks to enrich K-Planes representations with extrinsic knowledge. To this end, we define an alternating training procedure to conduct optimization guidance of K-Planes trained on the training set. We carry out extensive experiments and verify the merit of our method on synthetic data as well as real tourism photo collections. EvE enhances rendered scenes with richer details and outperforms the state of the art on the task of novel view synthesis in-the-wild. Our project page can be found at https://eve-nvs.github.io .
    摘要 modelo de escenas en masa de imágenes en-the-wild desde colecciones de imágenes en el mundo real ha demostrado ser un desafío importante en la visión por computadora. Los métodos existentes para renderizar escenas en el mundo real operan en un entorno cerrado, donde el conocimiento se limita a las imágenes capturadas de una escena dentro de un conjunto de entrenamiento. Proponemos EvE, que, a nuestro conocimiento, es el primer método que utiliza priores generativos para mejorar la modelización de escenas en el mundo real. Utilizamos redes generativas pre-entrenadas para enriquecer las representaciones de K-Planes con conocimiento extrínseco. Para lograr esto, definimos un procedimiento de entrenamiento alternado para guiar la optimización de K-Planes entrenados en el conjunto de entrenamiento. Realizamos experimentos extensivos y verificamos la excelencia de nuestro método en datos sintéticos así como en colecciones de fotos turísticas reales. EvE mejora las escenas renderizadas con más detalles y supera el estado del arte en la tarea de síntesis de vistas nuevas en el mundo real. Puedes encontrar más información sobre nuestro proyecto en la página web .

A Recent Survey of Vision Transformers for Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2312.00634
  • repo_url: None
  • paper_authors: Asifullah Khan, Zunaira Rauf, Abdul Rehman Khan, Saima Rathore, Saddam Hussain Khan, Sahar Shah, Umair Farooq, Hifsa Asif, Aqsa Asif, Umme Zahoora, Rafi Ullah Khalil, Suleman Qamar, Umme Hani Asif, Faiza Babar Khan, Abdul Majid, Jeonghwan Gwak
  • for: 这个论文主要为了探讨最新的医学图像分割技术,帮助研究人员更好地理解和应用这些技术。
  • methods: 这个论文使用的方法包括视Transformers(ViT)和卷积神经网络(CNN)的组合,以及不同的模型和数据集。
  • results: 这个论文提出了一些最新的医学图像分割方法,包括使用ViT和CNN的组合来提高图像分割精度,以及在各种医学图像模式下实现实时应用。
    Abstract Medical image segmentation plays a crucial role in various healthcare applications, enabling accurate diagnosis, treatment planning, and disease monitoring. In recent years, Vision Transformers (ViTs) have emerged as a promising technique for addressing the challenges in medical image segmentation. In medical images, structures are usually highly interconnected and globally distributed. ViTs utilize their multi-scale attention mechanism to model the long-range relationships in the images. However, they do lack image-related inductive bias and translational invariance, potentially impacting their performance. Recently, researchers have come up with various ViT-based approaches that incorporate CNNs in their architectures, known as Hybrid Vision Transformers (HVTs) to capture local correlation in addition to the global information in the images. This survey paper provides a detailed review of the recent advancements in ViTs and HVTs for medical image segmentation. Along with the categorization of ViT and HVT-based medical image segmentation approaches we also present a detailed overview of their real-time applications in several medical image modalities. This survey may serve as a valuable resource for researchers, healthcare practitioners, and students in understanding the state-of-the-art approaches for ViT-based medical image segmentation.
    摘要 医疗图像分割扮演着重要的角色在各种医疗应用中,帮助精确诊断、治疗规划和病情监测。在医疗图像中,结构通常具有高度相关和全球分布的特点。Vision Transformers(ViTs)在医疗图像分割方面表现出了投入的潜力,它们利用其多级注意机制来模型图像中的长距离关系。然而,ViTs缺乏图像相关的适应性和翻译不变性,这可能影响其性能。研究人员最近开发了一些将CNNs结合到ViTs架构中的方法,称为混合视图转换器(HVTs),以捕捉图像中的本地相关性。本文提供了医疗图像分割领域最近的进展,包括ViTs和HVTs的应用。我们还对这些方法的实时应用在医疗图像模式中进行了详细的概述。本文可以作为研究人员、医疗专业人员和学生了解当前最佳方法的 valuable 资源。

Rethinking the Domain Gap in Near-infrared Face Recognition

  • paper_url: http://arxiv.org/abs/2312.00627
  • repo_url: None
  • paper_authors: Michail Tarasiou, Jiankang Deng, Stefanos Zafeiriou
  • for: 这篇论文主要是为了解决异类面部识别(HFR)问题,即将视觉频谱(VIS)和近红外频谱(NIR)视觉域的图像匹配。
  • methods: 该论文不同于现有的大多数文献,不直接强调bridging the domain gap,而是采用了大型神经网络的预训练和正则化细化方法,以优化HFR问题。
  • results: 该论文通过对四个公共数据集进行验证,得出了至少匹配或超越当前状态艺术的结果。
    Abstract Heterogeneous face recognition (HFR) involves the intricate task of matching face images across the visual domains of visible (VIS) and near-infrared (NIR). While much of the existing literature on HFR identifies the domain gap as a primary challenge and directs efforts towards bridging it at either the input or feature level, our work deviates from this trend. We observe that large neural networks, unlike their smaller counterparts, when pre-trained on large scale homogeneous VIS data, demonstrate exceptional zero-shot performance in HFR, suggesting that the domain gap might be less pronounced than previously believed. By approaching the HFR problem as one of low-data fine-tuning, we introduce a straightforward framework: comprehensive pre-training, succeeded by a regularized fine-tuning strategy, that matches or surpasses the current state-of-the-art on four publicly available benchmarks. Corresponding codes can be found at https://github.com/michaeltrs/RethinkNIRVIS.
    摘要 非同质面识别(HFR)涉及到复杂的图像匹配任务,跨观察频谱的可见(VIS)和近红外(NIR)频谱。而大多数现有的HFR研究将域阶差视为主要挑战,强调在输入或特征层bridging it。我们的工作不同于这一趋势,我们发现大型神经网络,当先行训练于大规模同质VIS数据后,表现出了很好的零shot性能,表明域阶差可能比以前所认为的更加强大。我们采取了一种低数据精度调整的方法,将HFR问题看作是一种低数据精度调整的问题,并提出了一种简单的框架:全面预训练,然后进行弹性调整策略,可以与当前状态的推进至四个公开的benchmark上匹配或超越。相关代码可以在https://github.com/michaeltrs/RethinkNIRVIS中找到。

Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution

  • paper_url: http://arxiv.org/abs/2312.00853
  • repo_url: https://github.com/ianyeung/mgld-vsr
  • paper_authors: Xi Yang, Chenhang He, Jianqi Ma, Lei Zhang
    for:本研究旨在提出一种高质量的实际世界视频超分辨(VSR)算法,使得LR视频中的细节和材质得到了高质量的恢复。methods:我们提出了一种基于潜在扩散模型的VSR算法,使用了潜在扩散模型的强大能力来生成真实的细节,并通过优化潜在扩散路径来控制生成的图像内容。results:我们的方法在实际世界VSR benchmark数据集上实现了明显的性能提升,证明了我们的模型设计和训练策略的有效性。
    Abstract Real-world low-resolution (LR) videos have diverse and complex degradations, imposing great challenges on video super-resolution (VSR) algorithms to reproduce their high-resolution (HR) counterparts with high quality. Recently, the diffusion models have shown compelling performance in generating realistic details for image restoration tasks. However, the diffusion process has randomness, making it hard to control the contents of restored images. This issue becomes more serious when applying diffusion models to VSR tasks because temporal consistency is crucial to the perceptual quality of videos. In this paper, we propose an effective real-world VSR algorithm by leveraging the strength of pre-trained latent diffusion models. To ensure the content consistency among adjacent frames, we exploit the temporal dynamics in LR videos to guide the diffusion process by optimizing the latent sampling path with a motion-guided loss, ensuring that the generated HR video maintains a coherent and continuous visual flow. To further mitigate the discontinuity of generated details, we insert temporal module to the decoder and fine-tune it with an innovative sequence-oriented loss. The proposed motion-guided latent diffusion (MGLD) based VSR algorithm achieves significantly better perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.
    摘要 In this paper, we propose an effective real-world VSR algorithm by leveraging the strength of pre-trained latent diffusion models. To ensure content consistency among adjacent frames, we use the temporal dynamics in LR videos to guide the diffusion process by optimizing the latent sampling path with a motion-guided loss. This ensures that the generated HR video maintains a coherent and continuous visual flow.To further mitigate the discontinuity of generated details, we insert a temporal module into the decoder and fine-tune it with an innovative sequence-oriented loss. The proposed motion-guided latent diffusion (MGLD) based VSR algorithm achieves significantly better perceptual quality than state-of-the-art methods on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.

Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion

  • paper_url: http://arxiv.org/abs/2312.00852
  • repo_url: None
  • paper_authors: Litu Rout, Yujia Chen, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, Wen-Sheng Chu
  • for: 解决 inverse problems 中的 sampling 问题,使用 latent diffusion 模型。
  • methods: 提出 Second-order Tweedie sampler from Surrogate Loss (STSL),使用 second-order approximation 实现高效的 posterior sampling。
  • results: 与 existing solvers PSLD 和 P2L 比较,STSL 可以在 FFHQ、ImageNet 和 COCO benchmarks 上实现4X 和 8X 的减少 neural function evaluations,同时提高 sampling quality。此外,STSL 还可以应用于 text-guided image editing,并能够处理 corrupted images 中的剩余扭曲。
    Abstract Sampling from the posterior distribution poses a major computational challenge in solving inverse problems using latent diffusion models. Common methods rely on Tweedie's first-order moments, which are known to induce a quality-limiting bias. Existing second-order approximations are impractical due to prohibitive computational costs, making standard reverse diffusion processes intractable for posterior sampling. This paper introduces Second-order Tweedie sampler from Surrogate Loss (STSL), a novel sampler that offers efficiency comparable to first-order Tweedie with a tractable reverse process using second-order approximation. Our theoretical results reveal that the second-order approximation is lower bounded by our surrogate loss that only requires $O(1)$ compute using the trace of the Hessian, and by the lower bound we derive a new drift term to make the reverse process tractable. Our method surpasses SoTA solvers PSLD and P2L, achieving 4X and 8X reduction in neural function evaluations, respectively, while notably enhancing sampling quality on FFHQ, ImageNet, and COCO benchmarks. In addition, we show STSL extends to text-guided image editing and addresses residual distortions present from corrupted images in leading text-guided image editing methods. To our best knowledge, this is the first work to offer an efficient second-order approximation in solving inverse problems using latent diffusion and editing real-world images with corruptions.
    摘要 很多逆问题解决方法都是通过Latent Diffusion Model来进行采样,但是采样从 posterior distribution 中的采样具有重大的计算挑战。通常的方法是基于 Tweedie 的第一个 moment,这些方法知道会导致质量限制的偏见。现有的第二个approximation是不可能实现的,因为计算成本过高,使得标准的 reverse diffusion 过程变得不可行。这篇论文介绍了一种新的 Second-order Tweedie sampler from Surrogate Loss(STSL),这种采样器可以与 first-order Tweedie 的效率相比,同时具有可追踪的 reverse 过程使用 second-order aproximation。我们的理论结果表明,第二个approximation 是基于我们的代理损失,只需要 $O(1)$ 的计算量使用跟踪 Jacobian,并且我们 derive 一个新的推移项来使 reverse 过程可追踪。我们的方法超过了 SOA 的 solvers PSLD 和 P2L,在 FFHQ、ImageNet 和 COCO benchmarks 上减少了神经函数评估的数目,分别是 4X 和 8X,而且明显提高了采样质量。此外,我们还证明了 STSL 可以扩展到文本引导的图像编辑,并且可以解决文本引导图像编辑方法中的剩余扭曲问题。根据我们所知,这是第一次提供了效率的 second-order approximation 在使用 latent diffusion 和文本引导图像编辑中。

Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers

  • paper_url: http://arxiv.org/abs/2312.00593
  • repo_url: None
  • paper_authors: Sahar Nasirihaghighi, Negin Ghamsarian, Heinrich Husslein, Klaus Schoeffmann
  • for: 本研究旨在开发一个高精度的laparoscopic surgery视频事件识别模型,用于医学教育、实时运动预测和后期手术评估等应用。
  • methods: 我们使用了一个全面的laparoscopic gynecology视频集,并对其进行了精度的标注和评估。我们还提出了一种hybrid transformer架构,利用Transformer网络来捕捉视频中的间隔dependencies,从而提高事件识别精度。
  • results: 我们通过了一系列的广泛实验,证明了我们的提posed方法在事件识别中的superiority,并且可以在不同的手术场景和医生技巧水平下进行高精度的事件识别。
    Abstract Analyzing laparoscopic surgery videos presents a complex and multifaceted challenge, with applications including surgical training, intra-operative surgical complication prediction, and post-operative surgical assessment. Identifying crucial events within these videos is a significant prerequisite in a majority of these applications. In this paper, we introduce a comprehensive dataset tailored for relevant event recognition in laparoscopic gynecology videos. Our dataset includes annotations for critical events associated with major intra-operative challenges and post-operative complications. To validate the precision of our annotations, we assess event recognition performance using several CNN-RNN architectures. Furthermore, we introduce and evaluate a hybrid transformer architecture coupled with a customized training-inference framework to recognize four specific events in laparoscopic surgery videos. Leveraging the Transformer networks, our proposed architecture harnesses inter-frame dependencies to counteract the adverse effects of relevant content occlusion, motion blur, and surgical scene variation, thus significantly enhancing event recognition accuracy. Moreover, we present a frame sampling strategy designed to manage variations in surgical scenes and the surgeons' skill level, resulting in event recognition with high temporal resolution. We empirically demonstrate the superiority of our proposed methodology in event recognition compared to conventional CNN-RNN architectures through a series of extensive experiments.
    摘要 分析lapanoscopic surgery видео存在复杂多方面的挑战,其应用包括手术培训、过程中手术异常预测和手术后评估。在这篇论文中,我们介绍了一个完整的数据集,用于lapanoscopic gynecology видео中重要事件的识别。我们的数据集包括关键事件与主要过程中的内部挑战和手术后的并发症。为验证我们的注释的准确性,我们使用了多个CNN-RNN架构进行评估。此外,我们引入了一种hybrid transformer架构,并开发了一个特定的训练-推理框架,用于认可lapanoscopic surgery видео中四个 especific事件。利用Transformer网络,我们的提议的架构利用间隔dependencies来抗衡相关内容遮盖、运动模糊和手术场景变化,从而明显提高事件识别精度。此外,我们提出了一种框架采样策略,用于管理手术场景的变化和外科医生的技巧水平,从而实现高 temporal resolution的事件识别。我们通过一系列广泛的实验证明了我们的提议方法在事件识别方面的优越性,比 conventional CNN-RNN架构更高。

Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version)

  • paper_url: http://arxiv.org/abs/2312.00592
  • repo_url: None
  • paper_authors: Emma Cramer, Jonas Reiher, Sebastian Trimpe
  • for: 本研究使用 spatial autoencoders (SAEs) 为 robot 控制 tasks 提供低维度表示。
  • methods: 本研究使用 SAEs 提取图像数据中的空间特征,并评估 SAEs 是否能够有效地跟踪图像中的物体。
  • results: 研究发现,通用 SAEs 在跟踪物体方面有很大差异,而且SAEs 的性能与 downstream RL Task 直接相关。此外,研究还提出了三种modification 来提高 SAEs 的跟踪性能。
    Abstract Reinforcement learning (RL) for robot control typically requires a detailed representation of the environment state, including information about task-relevant objects not directly measurable. Keypoint detectors, such as spatial autoencoders (SAEs), are a common approach to extracting a low-dimensional representation from high-dimensional image data. SAEs aim at spatial features such as object positions, which are often useful representations in robotic RL. However, whether an SAE is actually able to track objects in the scene and thus yields a spatial state representation well suited for RL tasks has rarely been examined due to a lack of established metrics. In this paper, we propose to assess the performance of an SAE instance by measuring how well keypoints track ground truth objects in images. We present a computationally lightweight metric and use it to evaluate common baseline SAE architectures on image data from a simulated robot task. We find that common SAEs differ substantially in their spatial extraction capability. Furthermore, we validate that SAEs that perform well in our metric achieve superior performance when used in downstream RL. Thus, our metric is an effective and lightweight indicator of RL performance before executing expensive RL training. Building on these insights, we identify three key modifications of SAE architectures to improve tracking performance. We make our code available at anonymous.4open.science/r/sae-rl.
    摘要 常用的强化学习(RL)控制器需要详细的环境状态表示,包括不直接测量的任务相关对象的信息。关键点检测器,如空间自动编码器(SAE),是RL控制器中常见的状态表示方法。SAE通常会在图像数据中检测到空间特征,如物体位置,这些特征通常是RL任务中有用的表示。然而,SAE是否实际上可以跟踪图像中的物体并将其转换为RL任务中有用的状态表示,很少有人研究过。在这篇论文中,我们提出了一种用于评估SAE实例的方法,即在图像中测量关键点是否能够正确地跟踪真实物体的地方。我们提出了一种计算轻量级的 metric,并使其用于评估常见的SAE架构在图像数据中的表示能力。我们发现了SAE的表示能力差异很大,并且我们验证了SAE可以在RL任务中表现出色,只要它们在我们的metric中表现良好。因此,我们的metric是RL任务前的表现评估中的有效和轻量级指标。基于这些发现,我们提出了三种SAE架构修改来提高跟踪性能。我们的代码可以在[anonymous.4open.science/r/sae-rl](http://anonymous.4open.science/r/sae-rl)上下载。

Generative models for visualising abstract social processes: Guiding streetview image synthesis of StyleGAN2 with indices of deprivation

  • paper_url: http://arxiv.org/abs/2312.00570
  • repo_url: None
  • paper_authors: Aleksi Knuutila
  • for: 这个论文探讨了通过生成对抗网络(GANs)研究社会过程的视觉方面。
  • methods: 作者使用StyleGAN2模型在自定义的14564张伦敦Google街景图像集上训练,并使用三种倒转技术进行比较。
  • results: 研究发现了使用GANs生成新图像时,可以根据社会经济、健康和环境质量的metadata进行conditioning,从而生成反映社会过程的特征性图像。这些图像反映了之前未知和难以研究的社会过程的视觉特征。
    Abstract This paper presents a novel application of Generative Adverserial Networks (GANs) to study visual aspects of social processes. I train a a StyleGAN2-model on a custom dataset of 14,564 images of London, sourced from Google Streetview taken in London. After training, I invert the images in the training set, finding points in the model's latent space that correspond to them, and compare results from three inversion techniques. I connect each data point with metadata from the Indices of Multiple Deprivation, describing income, health and environmental quality in the area where the photographs were taken. It is then possible to map which parts of the model's latent space encode visual features that are distinctive for health, income and environmental quality, and condition the synthesis of new images based on these factors. The synthetic images created reflect visual features of social processes that were previously unknown and difficult to study, describing recurring visual differences between deprived and privileged areas in London. GANs are known for their capability to produce a continuous range of images that exhibit visual differences. The paper tests how to exploit this ability through visual comparisons in still images as well as through an interactive website where users can guide image synthesis with sliders. Though conditioned synthesis has its limitations and the results are difficult to validate, the paper points to the potential for generative models to be repurposed to be parts of social scientific methods.
    摘要

Physics Inspired Criterion for Pruning-Quantization Joint Learning

  • paper_url: http://arxiv.org/abs/2312.00851
  • repo_url: https://github.com/fanxxxxyi/pic-pq
  • paper_authors: Weiying Xie, Xiaoyi Fan, Xin Zhang, Yunsong Li, Jie Lei, Leyuan Fang
  • for: 这篇论文的目的是提出一种新的物理启发式剪枝量化学习(PIC-PQ)方法,以实现深度神经网络(DNNs)在具有限制的边缘设备上的部署。
  • methods: 这篇论文使用了一个新的物理启发式剪枝量化(PIC)方法,它基于了弹簧学中的赫克斯律(Hooke’s law),将模型内部的缓冲组件(FP)与对应的滤波器参数(FP)建立了线性关系。此外,PIC还加上了一个可调整的对称因数,以确保可行性和柔性。
  • results: 这篇论文的实验结果显示,PIC-PQ方法可以实现一个好的平衡between精度和位元运算(BOPs)压缩比例,例如在CIFAR10上的ResNet56模型可以实现54.96X的BOPs压缩比例,仅增加0.10%的精度损失,而在ImageNet上的ResNet18模型可以实现53.24X的BOPs压缩比例,仅增加0.61%的精度损失。
    Abstract Pruning-quantization joint learning always facilitates the deployment of deep neural networks (DNNs) on resource-constrained edge devices. However, most existing methods do not jointly learn a global criterion for pruning and quantization in an interpretable way. In this paper, we propose a novel physics inspired criterion for pruning-quantization joint learning (PIC-PQ), which is explored from an analogy we first draw between elasticity dynamics (ED) and model compression (MC). Specifically, derived from Hooke's law in ED, we establish a linear relationship between the filters' importance distribution and the filter property (FP) by a learnable deformation scale in the physics inspired criterion (PIC). Furthermore, we extend PIC with a relative shift variable for a global view. To ensure feasibility and flexibility, available maximum bitwidth and penalty factor are introduced in quantization bitwidth assignment. Experiments on benchmarks of image classification demonstrate that PIC-PQ yields a good trade-off between accuracy and bit-operations (BOPs) compression ratio e.g., 54.96X BOPs compression ratio in ResNet56 on CIFAR10 with 0.10% accuracy drop and 53.24X in ResNet18 on ImageNet with 0.61% accuracy drop). The code will be available at https://github.com/fanxxxxyi/PIC-PQ.
    摘要 这文章提出了一种新的物理灵感的剪枝数字化结合学习(PIC-PQ),它是通过将模型压缩和量化视为一个可解释的全局条件来实现深度神经网络(DNNs)在有限资源的边缘设备上部署。大多数现有的方法不会同时学习一个全局条件来剪枝和量化,而PIC-PQ则可以实现这一点。在这篇论文中,我们首先将模型压缩和量化视为一个可解释的全局条件,并从运动学(ED)中获得一个对应关系。具体来说,我们从韦伯-霍奇律(Hooke's law)中获得了一个可学习的塑形因子,以将重要性分布与模型属性(FP)之间建立线性关系。此外,我们还将PIC扩展为包括一个相对偏移变数,以获得全局的看法。为确保可行性和灵活性,我们将最大位元数和罚分因子引入量化位元数分配。实验结果显示,PIC-PQ可以实现一个好的平衡度和位元操作数(BOPs)压缩比例,例如在CIFAR10上的ResNet56中,PIC-PQ可以实现54.96倍的BOPs压缩比例,同时仅增加0.10%的精度损失。在ImageNet上的ResNet18中,PIC-PQ可以实现53.24倍的BOPs压缩比例,同时仅增加0.61%的精度损失。代码将会在https://github.com/fanxxxxyi/PIC-PQ中公开。

Domain Adaptive Imitation Learning with Visual Observation

  • paper_url: http://arxiv.org/abs/2312.00548
  • repo_url: https://github.com/sunghochoi122/D3IL
  • paper_authors: Sungho Choi, Seungyul Han, Woojun Kim, Jongseong Chae, Whiyoung Jung, Youngchul Sung
  • for: 本文考虑了域 adapted imitation learning,agent在目标域学习完成任务,通过观察来源域专家示例。
  • methods: 我们提出了一种新的框架,通过双Feature抽取和图像重建来提取域独立的行为特征,以便在域 shift 情况下进行训练学习。
  • results: 我们的方法在跨域imit learning with visual observation中表现出优于之前的算法,实验结果表明。
    Abstract In this paper, we consider domain-adaptive imitation learning with visual observation, where an agent in a target domain learns to perform a task by observing expert demonstrations in a source domain. Domain adaptive imitation learning arises in practical scenarios where a robot, receiving visual sensory data, needs to mimic movements by visually observing other robots from different angles or observing robots of different shapes. To overcome the domain shift in cross-domain imitation learning with visual observation, we propose a novel framework for extracting domain-independent behavioral features from input observations that can be used to train the learner, based on dual feature extraction and image reconstruction. Empirical results demonstrate that our approach outperforms previous algorithms for imitation learning from visual observation with domain shift.
    摘要 在这篇论文中,我们考虑了域适应型仿copy学习,其中一个目标域中的机器人通过观察来自不同域的专家示例学习执行任务。域适应型仿copy学习在实际应用中出现,当一个机器人通过视觉感知数据学习别人的移动,但是由于域之间的差异,可能需要通过不同的角度或不同形状的机器人进行仿copy。为了解决跨域仿copy学习中的域转移问题,我们提出了一种新的框架,基于双重特征提取和图像重建来提取域独立的行为特征,并用这些特征来训练学习者。我们的方法在跨域仿copy学习中观察视觉数据时进行了实验,并证明了我们的方法能够超过前一代的仿copy学习算法。

LiDAR-based curb detection for ground truth annotation in automated driving validation

  • paper_url: http://arxiv.org/abs/2312.00534
  • repo_url: None
  • paper_authors: Jose Luis Apellániz, Mikel García, Nerea Aranjuelo, Javier Barandiarán, Marcos Nieto
  • for: This paper is written for the purpose of developing a method for detecting 3D curbs in a sequence of point clouds captured from a LiDAR sensor, with the goal of creating pre-annotations for labelling pipelines to efficiently generate curb-related ground truth data.
  • methods: The method used in this paper consists of two main steps: (1) detecting curbs at each scan using a segmentation deep neural network, and (2) estimating the 3D curbs in the reconstructed point cloud using the odometry of the vehicle.
  • results: The results of this paper show that the manually annotation time is reduced by 50.99% thanks to the automatically generated pre-annotations, while maintaining the data quality level.
    Abstract Curb detection is essential for environmental awareness in Automated Driving (AD), as it typically limits drivable and non-drivable areas. Annotated data are necessary for developing and validating an AD function. However, the number of public datasets with annotated point cloud curbs is scarce. This paper presents a method for detecting 3D curbs in a sequence of point clouds captured from a LiDAR sensor, which consists of two main steps. First, our approach detects the curbs at each scan using a segmentation deep neural network. Then, a sequence-level processing step estimates the 3D curbs in the reconstructed point cloud using the odometry of the vehicle. From these 3D points of the curb, we obtain polylines structured following ASAM OpenLABEL standard. These detections can be used as pre-annotations in labelling pipelines to efficiently generate curb-related ground truth data. We validate our approach through an experiment in which different human annotators were required to annotate curbs in a group of LiDAR-based sequences with and without our automatically generated pre-annotations. The results show that the manual annotation time is reduced by 50.99% thanks to our detections, keeping the data quality level.
    摘要 curb detection 是自动驾驶(AD)环境意识中的关键组成部分,通常限制可行和不可行区域。annotated数据是开发和验证 AD 函数的必要条件,但公共数据集中 annotated 点云凹槽数据却scarce。本文提出了用于检测 LiDAR 扫描数据中的3D凹槽的方法,该方法包括两个主要步骤。首先,我们的方法在每个扫描中使用深度神经网络进行凹槽分割,然后使用车辆的增程来在重建的点云中估算3D凹槽。从这些3D凹槽点中,我们获得了符合 ASAM OpenLABEL 标准的聚合线构成的 polyline。这些检测可以用于生成高质量的凹槽相关的地面实际数据,并且可以减少人工标注时间。我们通过对一组 LiDAR 序列进行实验来验证我们的方法,结果显示,在使用我们自动生成的前置标注后,人工标注时间下降了50.99%,保持数据质量水平。

DeepDR: Deep Structure-Aware RGB-D Inpainting for Diminished Reality

  • paper_url: http://arxiv.org/abs/2312.00532
  • repo_url: None
  • paper_authors: Christina Gsaxner, Shohei Mori, Dieter Schmalstieg, Jan Egger, Gerhard Paar, Werner Bailer, Denis Kalkofen
  • for: 本研究旨在提出一种RGB-D填充框架,用于实现真实环境中物体的消失,并生成完整的3D场景。
  • methods: 该框架使用深度学习网络,以提高填充的精度和效率。具有场景 semantics 的生成网络,能够Explicitly 控制颜色和深度输出的准确性。
  • results: 实验结果表明,提议的框架可以胜过相关的工作,both qualitatively and quantitatively。
    Abstract Diminished reality (DR) refers to the removal of real objects from the environment by virtually replacing them with their background. Modern DR frameworks use inpainting to hallucinate unobserved regions. While recent deep learning-based inpainting is promising, the DR use case is complicated by the need to generate coherent structure and 3D geometry (i.e., depth), in particular for advanced applications, such as 3D scene editing. In this paper, we propose DeepDR, a first RGB-D inpainting framework fulfilling all requirements of DR: Plausible image and geometry inpainting with coherent structure, running at real-time frame rates, with minimal temporal artifacts. Our structure-aware generative network allows us to explicitly condition color and depth outputs on the scene semantics, overcoming the difficulty of reconstructing sharp and consistent boundaries in regions with complex backgrounds. Experimental results show that the proposed framework can outperform related work qualitatively and quantitatively.
    摘要 减少现实(DR)指的是将真实的物体从环境中移除,并将它们替换为背景。现代DR框架使用填充来描述未观察到的区域。虽然最近的深度学习基于的填充表现出了承诺,但DR使用情况受到生成一致性的结构和3D几何学(即深度)的限制,特别是 для进阶应用,如3D场景编辑。在这篇论文中,我们提出了深度DR,第一个可以满足DR所有要求的RGB-D填充框架。我们的结构意识的生成网络使我们能够直接将颜色和深度输出与场景 semantics相关联,从而超越复杂背景中的锐利和一致的边界重建的困难。实验结果表明,我们提出的框架可以与相关工作进行比较和量化比较。

Algorithm-based diagnostic application for diabetic retinopathy detection

  • paper_url: http://arxiv.org/abs/2312.00529
  • repo_url: None
  • paper_authors: Agnieszka Cisek, Karolina Korycinska, Leszek Pyziak, Marzena Malicka, Tomasz Wiecek, Grzegorz Gruzel, Kamil Szmuc, Jozef Cebulski, Mariusz Spyra
  • for: 这个研究是为了开发一个自动识别糖尿病视网膜病变的方法,以提高糖尿病视网膜病变的早期检测和预防。
  • methods: 这个方法使用了高级技术,例如人工神经网络、深度学习和影像分析算法,来分析糖尿病视网膜影像。它使用 morphological algorithms 来识别眼球中的 optic disc 和糖尿病视网膜病变的特征,例如 microaneurysms、hemorrhages 和 exudates。
  • results: 这个方法可以帮助提高糖尿病视网膜病变的早期检测和预防,并且可以帮助减少糖尿病导致的视力障碍 casos。
    Abstract Diabetic retinopathy (DR) is a growing health problem worldwide and is a leading cause of visual impairment and blindness, especially among working people aged 20-65. Its incidence is increasing along with the number of diabetes cases, and it is more common in developed countries than in developing countries. Recent research in the field of diabetic retinopathy diagnosis is using advanced technologies, such as analysis of images obtained by ophthalmoscopy. Automatic methods for analyzing eye images based on neural networks, deep learning and image analysis algorithms can improve the efficiency of diagnosis. This paper describes an automatic DR diagnosis method that includes processing and analysis of ophthalmoscopic images of the eye. It uses morphological algorithms to identify the optic disc and lesions characteristic of DR, such as microaneurysms, hemorrhages and exudates. Automated DR diagnosis has the potential to improve the efficiency of early detection of this disease and contribute to reducing the number of cases of diabetes-related visual impairment. The final step was to create an application with a graphical user interface that allowed retinal images taken at cooperating ophthalmology offices to be uploaded to the server. These images were then analyzed using a developed algorithm to make a diagnosis.
    摘要 世界上的糖尿病 RETINOPATHY (DR) 是一个不断增长的健康问题,是主要的视力障碍和残疾的原因,特别是在20-65岁的工作人群中。其发生率随着糖尿病患者的数量增加,在 развивающихся国家更为普遍,而在发达国家更为常见。现代技术的应用,如对眼科图像进行分析,可以改善糖尿病 RETINOPATHY 诊断的效率。本文描述了一种自动化糖尿病 RETINOPATHY 诊断方法,包括对眼科图像进行处理和分析。该方法使用 morphological 算法来识别眼球中的 optic disc 和糖尿病 RETINOPATHY 的特征 lesions,如微aneurysms、吸血和液体。自动化糖尿病 RETINOPATHY 诊断具有提高糖尿病 RETINOPATHY 的早期发现的潜在优势,从而减少糖尿病相关的视力残疾的数量。最后,我们创建了一个具有图形用户界面的应用程序,可以将在合作眼科办事处上拍摄的 RETINOPATHY 图像上传到服务器进行分析。

Global Localization: Utilizing Relative Spatio-Temporal Geometric Constraints from Adjacent and Distant Cameras

  • paper_url: http://arxiv.org/abs/2312.00500
  • repo_url: None
  • paper_authors: Mohammad Altillawi, Zador Pataki, Shile Li, Ziyuan Liu
  • for: 本研究旨在使用单个图像在已知区域中重新定位摄像头,以便在机器视觉应用中实现更高精度的地理位置推断。
  • methods: 本研究提议利用一种新的网络来导引 Deep Network 的本地化。该网络利用相对的空间和时间几何约束来约束 Deep Network 的训练,并同时利用相邻帧和远离帧的相对pose约束来提高本地化的精度。
  • results: 我们在3个常见的视觉本地化数据集上进行了实验,并证明了我们的方法可以在具有少量或非常罕见的三维坐标坐标的情况下进行高精度的本地化。与直接pose估计方法相比,我们的方法在同等条件下表现更好。
    Abstract Re-localizing a camera from a single image in a previously mapped area is vital for many computer vision applications in robotics and augmented/virtual reality. In this work, we address the problem of estimating the 6 DoF camera pose relative to a global frame from a single image. We propose to leverage a novel network of relative spatial and temporal geometric constraints to guide the training of a Deep Network for localization. We employ simultaneously spatial and temporal relative pose constraints that are obtained not only from adjacent camera frames but also from camera frames that are distant in the spatio-temporal space of the scene. We show that our method, through these constraints, is capable of learning to localize when little or very sparse ground-truth 3D coordinates are available. In our experiments, this is less than 1% of available ground-truth data. We evaluate our method on 3 common visual localization datasets and show that it outperforms other direct pose estimation methods.
    摘要 <>转换给定文本到简化中文。<>在机器视觉应用中,重新本地化一个摄像头从前一次地图区域中的单个图像是非常重要的。在这个工作中,我们解决了基于单个图像的摄像头6个自由度姿态的估计问题。我们提议利用一个新的相对空间和时间几何约束网络来引导深度网络的本地化训练。我们同时使用相对空间和时间几何约束,不仅来自邻近摄像头帧,还来自场景中远 away 的摄像头帧。我们显示,通过这些约束,我们的方法可以在少量或非常罕见的基准3D坐标数据下学习本地化。在我们的实验中,这些基准数据占用的比例是 menos than 1%。我们对3个常见视觉本地化数据集进行了评估,并显示我们的方法超过了直接姿态估计方法。

Explainable AI in Diagnosing and Anticipating Leukemia Using Transfer Learning Method

  • paper_url: http://arxiv.org/abs/2312.00487
  • repo_url: None
  • paper_authors: Wahidul Hasan Abir, Md. Fahim Uddin, Faria Rahman Khanam, Mohammad Monirujjaman Khan
  • for: 这个研究报告关注了急性лимфо血癌(ALL),一种常见的血液癌,主要影响儿童和青少年,表现为快速增殖的幼细胞(WBCs)。这些异常细胞可以压倒正常细胞,导致严重的健康问题。早期检测ALL的精准性是诊断和治疗效果的关键。
  • methods: 该研究提出了一种自动检测方法,使用计算机助成诊断(CAD)模型,利用深度学习技术提高淋巴癌诊断的准确性和效率。研究使用了不同的传输学习模型,如ResNet101V2、VGG19、InceptionV3和InceptionResNetV2,对ALL进行分类。研究还使用了Local Interpretable Model-Agnostic Explanations(LIME)来确保AI系统的预测的有效性和可靠性。
  • results: 研究发现,使用InceptionV3模型可以达到98.38%的准确率,超过其他测试模型。结果由LIME算法验证,表明这种方法在准确地识别ALL,提供了一个有价值的工具 для医疗专业人员。研究还指出,这种方法的可靠性和可信worthiness,为医疗领域的XAI技术开辟了新的可能性。
    Abstract This research paper focuses on Acute Lymphoblastic Leukemia (ALL), a form of blood cancer prevalent in children and teenagers, characterized by the rapid proliferation of immature white blood cells (WBCs). These atypical cells can overwhelm healthy cells, leading to severe health consequences. Early and accurate detection of ALL is vital for effective treatment and improving survival rates. Traditional diagnostic methods are time-consuming, costly, and prone to errors. The paper proposes an automated detection approach using computer-aided diagnostic (CAD) models, leveraging deep learning techniques to enhance the accuracy and efficiency of leukemia diagnosis. The study utilizes various transfer learning models like ResNet101V2, VGG19, InceptionV3, and InceptionResNetV2 for classifying ALL. The methodology includes using the Local Interpretable Model-Agnostic Explanations (LIME) for ensuring the validity and reliability of the AI system's predictions. This approach is critical for overcoming the "black box" nature of AI, where decisions made by models are often opaque and unaccountable. The paper highlights that the proposed method using the InceptionV3 model achieved an impressive 98.38% accuracy, outperforming other tested models. The results, verified by the LIME algorithm, showcase the potential of this method in accurately identifying ALL, providing a valuable tool for medical practitioners. The research underscores the impact of explainable artificial intelligence (XAI) in medical diagnostics, paving the way for more transparent and trustworthy AI applications in healthcare.
    摘要

Unfolder: Fast localization and image rectification of a document with a crease from folding in half

  • paper_url: http://arxiv.org/abs/2312.00467
  • repo_url: None
  • paper_authors: A. M. Ershov, D. V. Tropin, E. E. Limonova, D. P. Nikolaev, V. V. Arlazarov
  • for: digitizing folded documents with a smartphone camera
  • methods: proposes a novel approach called Unfolder, which is robust to projective distortions and does not fragment the image in the vicinity of a crease
  • results: achieved a recognition error rate of 0.33, outperforming advanced neural network methods DocTr and DewarpNet, with an average runtime of 0.25 s/image on an iPhone XR.
    Abstract Presentation of folded documents is not an uncommon case in modern society. Digitizing such documents by capturing them with a smartphone camera can be tricky since a crease can divide the document contents into separate planes. To unfold the document, one could hold the edges potentially obscuring it in a captured image. While there are many geometrical rectification methods, they were usually developed for arbitrary bends and folds. We consider such algorithms and propose a novel approach Unfolder developed specifically for images of documents with a crease from folding in half. Unfolder is robust to projective distortions of the document image and does not fragment the image in the vicinity of a crease after rectification. A new Folded Document Images dataset was created to investigate the rectification accuracy of folded (2, 3, 4, and 8 folds) documents. The dataset includes 1600 images captured when document placed on a table and when held in hand. The Unfolder algorithm allowed for a recognition error rate of 0.33, which is better than the advanced neural network methods DocTr (0.44) and DewarpNet (0.57). The average runtime for Unfolder was only 0.25 s/image on an iPhone XR.
    摘要 现代社会中,折叠文档的显示不是非常罕见的情况。将这些文档数字化,使用智能手机摄像头拍摄时可能会困难,因为折叠会将文档内容分成多个平面。要将文档打开,可以举手阻挡着折叠的部分。虽然有很多几何修正方法,但它们通常适用于任意的折叠和弯曲。我们考虑了这些算法,并提出了一种新的方法——Unfolder,专门为折叠半开的文档图像进行修正。Unfolder具有对文档图像的投影变形强度的抗性,并不会在修正过程中分解图像的附近部分。为了调查这种修正精度,我们创建了一个新的折叠文档图像集(Folded Document Images dataset),该集包括2000张折叠(2、3、4、8叠)文档图像,其中每张图像都是在桌子上放置和在手中持有的两种情况下拍摄的。Unfolder算法的识别错误率为0.33,高于先进神经网络方法DocTr(0.44)和DewarpNet(0.57)。Unfolder算法的平均运行时间为iPhone XR上0.25秒/图像。

Learning Unorthogonalized Matrices for Rotation Estimation

  • paper_url: http://arxiv.org/abs/2312.00462
  • repo_url: None
  • paper_authors: Kerui Gu, Zhihao Li, Shiyong Liu, Jianzhuang Liu, Songcen Xu, Youliang Yan, Michael Bi Mi, Kenji Kawaguchi, Angela Yao
  • for: pose estimation tasks
  • methods: learning unorthogonalized `Pseudo’ Rotation Matrices (PRoM)
  • results: state-of-the-art results on large-scale benchmarks for human pose estimation
    Abstract Estimating 3D rotations is a common procedure for 3D computer vision. The accuracy depends heavily on the rotation representation. One form of representation -- rotation matrices -- is popular due to its continuity, especially for pose estimation tasks. The learning process usually incorporates orthogonalization to ensure orthonormal matrices. Our work reveals, through gradient analysis, that common orthogonalization procedures based on the Gram-Schmidt process and singular value decomposition will slow down training efficiency. To this end, we advocate removing orthogonalization from the learning process and learning unorthogonalized `Pseudo' Rotation Matrices (PRoM). An optimization analysis shows that PRoM converges faster and to a better solution. By replacing the orthogonalization incorporated representation with our proposed PRoM in various rotation-related tasks, we achieve state-of-the-art results on large-scale benchmarks for human pose estimation.
    摘要 估计3D旋转是计算机视觉领域中常见的过程。准确性取决于旋转表示方法。一种常用的表示方法是旋转矩阵,它具有连续性,特别适合pose estimation任务。学习过程通常包括正交化,以确保矩阵正交。我们的工作通过梯度分析发现,常见的正交化方法基于 Gram-Schmidt 过程和特征值分解会降低训练效率。因此,我们建议从学习过程中移除正交化,学习不正交化的“伪”旋转矩阵(Pseudo Rotation Matrices,PRoM)。优化分析表明,PRoM更快 converges和更好的解。通过在不同的旋转相关任务中取代包含正交化的表示方法,我们实现了大规模标准chmark上的人 pose estimation状态的末点结果。

An Encoding Framework for Binarized Images using HyperDimensional Computing

  • paper_url: http://arxiv.org/abs/2312.00454
  • repo_url: None
  • paper_authors: Laura Smets, Werner Van Leekwijck, Ing Jyh Tsang, Steven Latré
  • for: 这篇论文旨在提出一种轻量级的机器学习方法,以应用于穿戴式互联网、附近感应器人工智能应用和设备处理等领域。
  • methods: 本论文提出了一种使用本原始的四维空间(HD)算术vector操作来对二进制化图像进行编码,以保持邻近位置的模式相似性。这种编码方法使用点扩展和本地线性映射。
  • results: 根据试验集,这种编码方法可以达到97.35%的准确率 для MNIST数据集和84.12%的准确率 для Fashion-MNIST数据集,与其他基eline HDC模型不同的编码方法相比,表现更好。此外,这种编码方法也具有更高的杂音和模糊防护性。
    Abstract Hyperdimensional Computing (HDC) is a brain-inspired and light-weight machine learning method. It has received significant attention in the literature as a candidate to be applied in the wearable internet of things, near-sensor artificial intelligence applications and on-device processing. HDC is computationally less complex than traditional deep learning algorithms and typically achieves moderate to good classification performance. A key aspect that determines the performance of HDC is the encoding of the input data to the hyperdimensional (HD) space. This article proposes a novel light-weight approach relying only on native HD arithmetic vector operations to encode binarized images that preserves similarity of patterns at nearby locations by using point of interest selection and local linear mapping. The method reaches an accuracy of 97.35% on the test set for the MNIST data set and 84.12% for the Fashion-MNIST data set. These results outperform other studies using baseline HDC with different encoding approaches and are on par with more complex hybrid HDC models. The proposed encoding approach also demonstrates a higher robustness to noise and blur compared to the baseline encoding.
    摘要 超dimensional计算(HDC)是一种基于大脑的轻量级机器学习方法。它在文献中受到了广泛关注,用于穿戴式互联网对象、靠近感知器艺术智能应用以及设备处理。相比传统深度学习算法,HDC的计算复杂性较低,通常可以达到中等至良好的分类性能。HDC的输入数据编码到超dimensional(HD)空间是一个关键因素,这篇文章提出了一种新的轻量级方法,只使用本地HD数学运算来编码简化图像,以保持邻近位置的模式相似性。该方法在MNIST数据集测试集上达到了97.35%的准确率,并在Fashion-MNIST数据集测试集上达到了84.12%的准确率。这些结果超过了其他基eline HDC的编码方法,并与更复杂的混合HDC模型相当。该编码方法还表现出更高的鲁棒性于噪声和模糊。

Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

  • paper_url: http://arxiv.org/abs/2312.00452
  • repo_url: None
  • paper_authors: Yajie Liu, Pu Ge, Haoxiang Ma, Shichao Fan, Qingjie Liu, Di Huang, Yunhong Wang
  • for: 该论文目的是提高图像分割(RIS)的泛化能力,使其能够更好地处理不同文本表达或未经见过的视觉实体。
  • methods: 该方法包括两部分:首先,使用强大预训练模型提供的视觉指导来强制融合多modal信息,以利用空间关系和像素协调来处理部分目标面和偶极点聚集。其次,为了处理不受语言风格限制的文本描述,提出了一种增强表达的方法,通过提供一个明确和重要的推荐语,使表达在一致的上下文中补充,以便更好地捕捉目标。
  • results: 对于RefCOCO、RefCOCO+和ReferIt等零例跨数据集,该方法实现了比州前方的显著提高,例如,与state-of-the-art相比,它在RefCOCO上提高了4.15%、5.45%和4.64%的mIoU值。此外,在GraspNet-RIS上,该方法也能够良好地泛化到新的enario中, demonstrating its effectiveness。
    Abstract Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions. Despite the overwhelming progress, it still remains challenging for current approaches to perform well on cases with various text expressions or with unseen visual entities, limiting its further application. In this paper, we present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above. Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context, facilitating target capturing in the presence of linguistic style changes. Furthermore, we introduce a multi-modal fusion aggregation module with visual guidance from a powerful pretrained model to leverage spatial relations and pixel coherences to handle the incomplete target masks and false positive irregular clumps which often appear on unseen visual entities. Extensive experiments are conducted in the zero-shot cross-dataset settings and the proposed approach achieves consistent gains compared to the state-of-the-art, e.g., 4.15\%, 5.45\%, and 4.64\% mIoU increase on RefCOCO, RefCOCO+ and ReferIt respectively, demonstrating its effectiveness. Additionally, the results on GraspNet-RIS show that our approach also generalizes well to new scenarios with large domain shifts.
    摘要 参照图像分割(RIS)目标是根据 libre-from 文本描述 segment 图像中的对象。尽管有很大的进步,但现在的方法仍然在不同的文本表达或未看过的视觉实体上表现不佳,限制其进一步应用。在这篇文章中,我们提出了一种新的 RIS 方法,可以大幅提高通用能力。 Specifically,我们提出了一种增强给定表达的方法,通过在一个统一上下文中补充表达,以适应语言风格变化。另外,我们引入了一种多模态融合聚合模块,利用强大预训练模型提供的视觉导航来利用空间关系和像素协调来处理不完整的目标掩蔽和偶极点异常块。我们进行了广泛的零学习跨数据集实验,并达到了与当前状态卷的显著提高,例如,RefCOCO 上的 mIoU 提高4.15%、RefCOCO+ 上的 mIoU 提高5.45% 和 ReferIt 上的 mIoU 提高4.64%,这些结果表明我们的方法有效。此外,我们的方法还在 GraspNet-RIS 上进行了新场景的普适性测试,并得到了良好的结果。

FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting

  • paper_url: http://arxiv.org/abs/2312.00451
  • repo_url: https://github.com/VITA-Group/FSGS
  • paper_authors: Zehao Zhu, Zhiwen Fan, Yifan Jiang, Zhangyang Wang
  • For: The paper is written for novel view synthesis from limited observations, specifically using 3D Gaussian Splatting for real-time and photo-realistic results with as few as three training views.* Methods: The paper proposes a few-shot view synthesis framework based on 3D Gaussian Splatting, which handles extremely sparse initialized SfM points with a thoughtfully designed Gaussian Unpooling process, integrating a large-scale pre-trained monocular depth estimator to guide the geometric optimization.* Results: The proposed method, FSGS, achieves state-of-the-art performance in both accuracy and rendering efficiency across diverse datasets, including LLFF, Mip-NeRF360, and Blender.Here’s the same information in Simplified Chinese:* For: 这篇论文是为了解决限制性观察下的新视图 sintesis问题,特别是使用3D Gaussian Splatting来实现实时和图像真实的结果,只需要三个训练视图。* Methods: 该论文提出了一种基于3D Gaussian Splatting的几何� Standing for novel view synthesis框架,该框架可以快速地和高效地处理极其罕见的SfM点,并通过大规模预训练的单目深度估计器来引导几何优化过程。* Results: 提出的方法FSGS可以在多个数据集上实现最佳性和渲染效率的同时提高,包括LLFF、Mip-NeRF360和Blender等。
    Abstract Novel view synthesis from limited observations remains an important and persistent task. However, high efficiency in existing NeRF-based few-shot view synthesis is often compromised to obtain an accurate 3D representation. To address this challenge, we propose a few-shot view synthesis framework based on 3D Gaussian Splatting that enables real-time and photo-realistic view synthesis with as few as three training views. The proposed method, dubbed FSGS, handles the extremely sparse initialized SfM points with a thoughtfully designed Gaussian Unpooling process. Our method iteratively distributes new Gaussians around the most representative locations, subsequently infilling local details in vacant areas. We also integrate a large-scale pre-trained monocular depth estimator within the Gaussians optimization process, leveraging online augmented views to guide the geometric optimization towards an optimal solution. Starting from sparse points observed from limited input viewpoints, our FSGS can accurately grow into unseen regions, comprehensively covering the scene and boosting the rendering quality of novel views. Overall, FSGS achieves state-of-the-art performance in both accuracy and rendering efficiency across diverse datasets, including LLFF, Mip-NeRF360, and Blender. Project website: https://zehaozhu.github.io/FSGS/.
    摘要 新视图合成从有限的观察中 remaining 是一个重要和持续的任务。然而,现有的 NeRF-based few-shot view synthesis 高效率通常会 compromise obtaining an accurate 3D representation. To address this challenge, we propose a few-shot view synthesis framework based on 3D Gaussian Splatting that enables real-time and photo-realistic view synthesis with as few as three training views. The proposed method, dubbed FSGS, handles the extremely sparse initialized SfM points with a thoughtfully designed Gaussian Unpooling process. Our method iteratively distributes new Gaussians around the most representative locations, subsequently infilling local details in vacant areas. We also integrate a large-scale pre-trained monocular depth estimator within the Gaussians optimization process, leveraging online augmented views to guide the geometric optimization towards an optimal solution. Starting from sparse points observed from limited input viewpoints, our FSGS can accurately grow into unseen regions, comprehensively covering the scene and boosting the rendering quality of novel views. Overall, FSGS achieves state-of-the-art performance in both accuracy and rendering efficiency across diverse datasets, including LLFF, Mip-NeRF360, and Blender. Project website: .

Dolphins: Multimodal Language Model for Driving

  • paper_url: http://arxiv.org/abs/2312.00438
  • repo_url: None
  • paper_authors: Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, Chaowei Xiao
  • for: 本研究旨在开发一种可以在复杂的现实世界中驾驶自动车辆,具有人类智能水平的汽车助手。
  • methods: 本研究使用了一种新型的视觉语言模型,名为Dolphins,可以处理多Modal输入,包括视频数据、文本指令和历史控制信号,并生成相应的输出。在基于开源的预训练视觉语言模型OpenFlamingo的基础之上,我们首先通过创新的基于链的思维过程(GCoT)进行了Dolphins的理解能力的提升。然后,我们对Dolphins进行了适应驾驶领域的定制,通过构建驾驶 especific instruction data和进行了特定的调整。
  • results: 通过使用BDD-X数据集,我们将四个不同的汽车任务集成到Dolphins中,以促进了汽车任务的总体理解和解决。作为结果,Dolphins具有两个独特的特征:一是对于复杂和长尾的开放世界驾驶场景进行全面理解,并能够解决一系列AV任务;二是具有人类智能特性,包括 gradient-free快速适应、在context learning和错误恢复。
    Abstract The quest for fully autonomous vehicles (AVs) capable of navigating complex real-world scenarios with human-like understanding and responsiveness. In this paper, we introduce Dolphins, a novel vision-language model architected to imbibe human-like abilities as a conversational driving assistant. Dolphins is adept at processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instructions. Building upon the open-sourced pretrained Vision-Language Model, OpenFlamingo, we first enhance Dolphins's reasoning capabilities through an innovative Grounded Chain of Thought (GCoT) process. Then we tailored Dolphins to the driving domain by constructing driving-specific instruction data and conducting instruction tuning. Through the utilization of the BDD-X dataset, we designed and consolidated four distinct AV tasks into Dolphins to foster a holistic understanding of intricate driving scenarios. As a result, the distinctive features of Dolphins are characterized into two dimensions: (1) the ability to provide a comprehensive understanding of complex and long-tailed open-world driving scenarios and solve a spectrum of AV tasks, and (2) the emergence of human-like capabilities including gradient-free instant adaptation via in-context learning and error recovery via reflection.
    摘要 寻求完全自主汽车(AV),能够处理复杂的实际世界enario,与人类类似的理解和反应能力。在这篇论文中,我们介绍了一种新的视觉语言模型 architecture,称为Dolphins,这种模型具有人类类似的能力,作为交通辅助。Dolphins能够处理多modal输入,包括视频(或图像)数据、文本指令和历史控制信号,生成相应的输出,响应给提供的指令。基于开源的预训练视觉语言模型OpenFlamingo,我们首先增强Dolphins的理解能力,通过一种创新的基于grounded Chain of Thought(GCoT)过程。然后,我们对Dolphins进行了适应驾驶领域的定制,通过构建驾驶Specific instruction data和进行调整。通过使用BDD-X数据集,我们设计了并团结了四个不同的AV任务,以促进它的整体理解能力。因此,Dolphins的特有特征可以分为两个维度:(1)能够提供复杂和长尾的开放世界驾驶enario的全面理解,并解决一系列AV任务,(2)具有人类类似的能力,包括 gradient-free快速适应、在上下文学习中的快速学习和反射。

Enhancing Image Captioning with Neural Models

  • paper_url: http://arxiv.org/abs/2312.00435
  • repo_url: None
  • paper_authors: Pooja Bhatnagar, Sai Mrunaal, Sachin Kamnure
  • for: 这项研究探讨了基于深度学习模型的图像描述文本生成领域。
  • methods: 该研究 investigate了不同的神经网络架构配置,特别是注入架构,并提出了一种新的质量指标来评估描述文本生成。
  • results: 研究发现,尽管merge模型具有更大的词汇库和更高的ROUGE分数,但是注入架构仍然能够生成相关和简洁的图像描述文本。此外,研究还发现了训练数据的优化和超参数的优化对模型性能的影响。
    Abstract This research explores the realm of neural image captioning using deep learning models. The study investigates the performance of different neural architecture configurations, focusing on the inject architecture, and proposes a novel quality metric for evaluating caption generation. Through extensive experimentation and analysis, this work sheds light on the challenges and opportunities in image captioning, providing insights into model behavior and overfitting. The results reveal that while the merge models exhibit a larger vocabulary and higher ROUGE scores, the inject architecture generates relevant and concise image captions. The study also highlights the importance of refining training data and optimizing hyperparameters for improved model performance. This research contributes to the growing body of knowledge in neural image captioning and encourages further exploration in the field, emphasizing the democratization of artificial intelligence.
    摘要 这项研究探讨使用深度学习模型进行图像描述,包括不同的神经网络架构的性能对比,特别是注入架构。这项研究还提出了一种新的质量指标来评估图像描述生成。通过广泛的实验和分析,这项研究揭示了图像描述的挑战和机会,并为模型行为和过拟合提供了深入的理解。研究发现,虽然merge模型具有更大的词汇量和更高的ROUGE分数,但是注入架构能够生成相关和简洁的图像描述。此外,研究还强调了训练数据的优化和超参数的优化,以提高模型性能。这项研究对神经图像描述领域的发展做出了贡献,并鼓励进一步探索,推广人工智能的普及。

A Low-Power Neuromorphic Approach for Efficient Eye-Tracking

  • paper_url: http://arxiv.org/abs/2312.00425
  • repo_url: https://github.com/pbonazzi/retina
  • paper_authors: Pietro Bonazzi, Sizhen Bian, Giovanni Lippolis, Yawei Li, Sadique Sheik, Michele Magno
    for: 这篇论文旨在提出一种基于 neuromorphic 技术的眼动追踪方法,使用纯事件数据,以提高眼动追踪系统的精度和效率。methods: 该方法使用了直接训练的启发型神经网络(SNN)回归模型,并利用了一种 cutting-edge 低功耗边缘神经处理器——Speck,以实现更高的精度和效率。results: 该方法在实验中实现了较高的精度和较低的计算复杂性,比较传统的event-based眼动追踪方法。 Specifically, the method achieved a pupil centroid error of 3.24 pixels in a 64x64 DVS input, which is 1.24 pixels better than the latest event-based eye-tracking method. Additionally, the method showed an end-to-end power consumption between 2.89-4.8 mW and a latency of 5.57-8.01 mS, depending on the time window.
    Abstract This paper introduces a neuromorphic methodology for eye tracking, harnessing pure event data captured by a Dynamic Vision Sensor (DVS) camera. The framework integrates a directly trained Spiking Neuron Network (SNN) regression model and leverages a state-of-the-art low power edge neuromorphic processor - Speck, collectively aiming to advance the precision and efficiency of eye-tracking systems. First, we introduce a representative event-based eye-tracking dataset, "Ini-30", which was collected with two glass-mounted DVS cameras from thirty volunteers. Then,a SNN model, based on Integrate And Fire (IAF) neurons, named "Retina", is described , featuring only 64k parameters (6.63x fewer than the latest) and achieving pupil tracking error of only 3.24 pixels in a 64x64 DVS input. The continous regression output is obtained by means of convolution using a non-spiking temporal 1D filter slided across the output spiking layer. Finally, we evaluate Retina on the neuromorphic processor, showing an end-to-end power between 2.89-4.8 mW and a latency of 5.57-8.01 mS dependent on the time window. We also benchmark our model against the latest event-based eye-tracking method, "3ET", which was built upon event frames. Results show that Retina achieves superior precision with 1.24px less pupil centroid error and reduced computational complexity with 35 times fewer MAC operations. We hope this work will open avenues for further investigation of close-loop neuromorphic solutions and true event-based training pursuing edge performance.
    摘要 这篇论文介绍了一种神经omorphic方法ology for eye tracking, 利用纯事件数据 captured by a Dynamic Vision Sensor (DVS) camera。这个框架 integrate了一个直接训练的脉冲神经网络(SNN)回归模型,并利用了一种 cutting-edge low power edge neuromorphic processor - Speck,共同目标是提高眼动跟踪系统的精度和效率。首先,我们介绍了一个代表性的事件基于眼动跟踪数据集,“Ini-30”,这些数据被收集了三十名志愿者使用两个镜头 mount DVS camera。然后,我们描述了一个基于 Integrate And Fire (IAF) 神经元的 SNN 模型,名为 “Retina”,它只有 64k 参数(6.63x 少于最新的),并在 64x64 DVS 输入上实现了眼动跟踪错误仅 3.24 像素。 continuous regression output 是通过 convolution 使用一个非脳神经 1D 滤波器在输出射频层滑动来获得的。最后,我们在 neuromorphic processor 上评估 Retina,显示了综合的 endpoint power 在 2.89-4.8 mW 和 latency 在 5.57-8.01 mS 之间,具体取决于时间窗口。我们还对 Retina 模型与最新的事件基于眼动跟踪方法“3ET”进行了比较,结果显示 Retina 具有更高的精度,并且减少了计算复杂性, MAC 操作数量减少了 35 倍。我们希望这项工作能够开启关于真正的事件基于训练和边缘性能的 investigate 的大门。

Towards Explaining Satellite Based Poverty Predictions with Convolutional Neural Networks

  • paper_url: http://arxiv.org/abs/2312.00416
  • repo_url: None
  • paper_authors: Hamid Sarmadi, Thorsteinn Rögnvaldsson, Nils Roger Carlsson, Mattias Ohlsson, Ibrahim Wahab, Ola Hall
  • for: 这 paper 的目的是解释深度卷积神经网络(CNN)在静止卫星图像上预测贫困指标的基础。
  • methods: 这 paper 使用的方法是分析 CNN 的响应细节,并解释预测的基础。
  • results: 该 paper 的结果表明,使用低分辨率日夜两栅卫星图像训练的 CNN 模型可以超越人类Subjects,在排序贫困指标类别中表现出优异的能力。
    Abstract Deep convolutional neural networks (CNNs) have been shown to predict poverty and development indicators from satellite images with surprising accuracy. This paper presents a first attempt at analyzing the CNNs responses in detail and explaining the basis for the predictions. The CNN model, while trained on relatively low resolution day- and night-time satellite images, is able to outperform human subjects who look at high-resolution images in ranking the Wealth Index categories. Multiple explainability experiments performed on the model indicate the importance of the sizes of the objects, pixel colors in the image, and provide a visualization of the importance of different structures in input images. A visualization is also provided of type images that maximize the network prediction of Wealth Index, which provides clues on what the CNN prediction is based on.
    摘要 Translation in Simplified Chinese:深度卷积神经网络(CNN)已经能准确预测卫星图像中的贫困和发展指标。本文提供了对 CNN 模型的详细分析和解释,以及其预测基础的解释。CNN 模型,尽管在低分辨率日间和夜间卫星图像上训练,仍能超越人类对高分辨率图像的识别。多种解释实验表明,CNN 模型中的对象大小、图像像素颜色以及输入图像中的不同结构均具有重要作用。此外,还提供了 MAX 值图像类型,它们可以帮助我们理解 CNN 预测基础的依据。

Large-scale Vision-Language Models Learn Super Images for Efficient and High-Performance Partially Relevant Video Retrieval

  • paper_url: http://arxiv.org/abs/2312.00414
  • repo_url: None
  • paper_authors: Taichi Nishimura, Shota Nakada, Masayoshi Kondo
  • for: partially relevant video retrieval (PRVR)
  • methods: use of lightweight visual backbones and a simple query-image attention trick to improve efficiency and performance, and fine-tuning approach with trainable modules to further boost performance
  • results: best performance on ActivityNet Captions and TVR, with promising zero-shot performance against state-of-the-art (SOTA) methods efficiently
    Abstract In this paper, we propose an efficient and high-performance method for partially relevant video retrieval (PRVR), which aims to retrieve untrimmed long videos that contain at least one relevant moment to the input text query. In terms of both efficiency and performance, the overlooked bottleneck of previous studies is the visual encoding of dense frames. This guides researchers to choose lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities of learned visual representations. However, it is undesirable to simply replace them with high-performance large-scale vision-and-language models (VLMs) due to their low efficiency. To address these issues, instead of dense frames, we focus on super images, which are created by rearranging the video frames in a $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and compensates for the low efficiency of large-scale VLMs, allowing us to adopt them as powerful encoders. Surprisingly, we discover that with a simple query-image attention trick, VLMs generalize well to super images effectively and demonstrate promising zero-shot performance against SOTA methods efficiently. In addition, we propose a fine-tuning approach by incorporating a few trainable modules into the VLM backbones. The experimental results demonstrate that our approaches efficiently achieve the best performance on ActivityNet Captions and TVR.
    摘要 在这篇论文中,我们提出了一种高效高性能的方法 для部分相关视频检索(PRVR),目标是检索不trimmed长视频中包含至少一个相关的文本查询 moment。在效率和性能两个方面,前一个研究的潜在瓶颈是视觉编码稠密帧。这导致研究人员选择轻量级的视觉核心,从而导致检索性能下降。然而,直接将高性能大规模视语言模型(VLM)应用于这些视频检索 task 并不合适,因为它们的效率低下。为解决这些问题,我们不是 dense frames,而是使用 super image,即将视频帧重新排序成 $N \times N$ 格式。这将视觉编码数量减少到 $\frac{1}{N^2}$,并且补偿了大规模 VLM 的低效率,使我们能够采用它们作为强大编码器。 surprisingly,我们发现,只需使用简单的查询图像注意力技巧,VLM 可以对 super image 进行有效的泛化,并且表现出了可观的零shot性能。此外,我们还提出了一种微调方法,通过 incorporating 一些可调模块到 VLM 后ION 中。实验结果表明,我们的方法可以高效地达到 ActivityNet Captions 和 TVR 上的最佳性能。

SCHEME: Scalable Channer Mixer for Vision Transformers

  • paper_url: http://arxiv.org/abs/2312.00412
  • repo_url: None
  • paper_authors: Deepak Sridhar, Yunsheng Li, Nuno Vasconcelos
  • For: This paper aims to improve the performance of Vision Transformers (ViT) by studying the channel mixer or feature mixing block, specifically by using sparse feature mixing and a lightweight channel covariance attention mechanism.* Methods: The authors propose a block diagonal MLP structure to improve the accuracy of the feature clusters, and introduce a new family of SCHEMEformer models that can be plugged into any ViT architecture to obtain different trade-offs between complexity and performance.* Results: The proposed method achieves substantial accuracy gains over existing designs, especially under lower FLOPs regimes, on image classification, object detection, and semantic segmentation tasks with different ViT backbones. For example, the SCHEMEformer establishes a new SOTA of 79.7% accuracy for ViTs using pure attention mixers on ImageNet-1K at 1.77G FLOPs.
    Abstract Vision Transformers have received significant attention due to their impressive performance in many vision tasks. While the token mixer or attention block has been studied in great detail, the channel mixer or feature mixing block (FFN or MLP) has not been explored in depth albeit it accounts for a bulk of the parameters and computation in a model. In this work, we study whether sparse feature mixing can replace the dense connections and confirm this with a block diagonal MLP structure that improves the accuracy by supporting larger expansion ratios. To improve the feature clusters formed by this structure and thereby further improve the accuracy, a lightweight, parameter-free, channel covariance attention (CCA) mechanism is introduced as a parallel branch during training. This design of CCA enables gradual feature mixing across channel groups during training whose contribution decays to zero as the training progresses to convergence. This allows the CCA block to be discarded during inference, thus enabling enhanced performance with no additional computational cost. The resulting $\textit{Scalable CHannEl MixEr}$ (SCHEME) can be plugged into any ViT architecture to obtain a gamut of models with different trade-offs between complexity and performance by controlling the block diagonal structure size in the MLP. This is shown by the introduction of a new family of SCHEMEformer models. Experiments on image classification, object detection, and semantic segmentation, with different ViT backbones, consistently demonstrate substantial accuracy gains over existing designs, especially under lower FLOPs regimes. For example, the SCHEMEformer establishes a new SOTA of 79.7% accuracy for ViTs using pure attention mixers on ImageNet-1K at 1.77G FLOPs.
    摘要 vision transformers 因其在许多视觉任务中表现出色,而吸引了广泛关注。although the token mixer or attention block has been studied in great detail, the channel mixer or feature mixing block (ffn or mlp) has not been explored in depth, despite accounting for a significant portion of the parameters and computation in a model. in this work, we investigate whether sparse feature mixing can replace dense connections and confirm this with a block diagonal mlp structure that improves accuracy by supporting larger expansion ratios. to further improve the feature clusters formed by this structure and thereby enhance accuracy, we introduce a lightweight, parameter-free channel covariance attention (cca) mechanism as a parallel branch during training. this cca mechanism enables gradual feature mixing across channel groups during training, with the contribution of each group decaying to zero as training progresses to convergence. this allows the cca block to be discarded during inference, resulting in enhanced performance with no additional computational cost. the resulting scalable channel mixer (scheme) can be integrated into any vit architecture to obtain a range of models with different trade-offs between complexity and performance by controlling the size of the block diagonal structure in the mlp. experiments on image classification, object detection, and semantic segmentation with different vit backbones consistently demonstrate substantial accuracy gains over existing designs, especially under lower flop regimes. for example, the schemeformer establishes a new sota of 79.7% accuracy for vit using pure attention mixers on imaginet-1k at 1.77g flop.

VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video Internet of Things

  • paper_url: http://arxiv.org/abs/2312.00401
  • repo_url: None
  • paper_authors: Yaoyao Zhong, Mengshi Qi, Rui Wang, Yuhan Qiu, Yang Zhang, Huadong Ma
  • for: 本研究旨在Addressing the challenges of fine-grained and interrelated vision tool usage in Video Internet of Things (VIoT) by building a framework based on large language models (LLMs) to correctly interact with humans, query knowledge videos, and invoke vision models to accomplish complicated tasks.
  • methods: 该研究使用了semi-automatic annotations to craft a training dataset and establish benchmarks involving 11 representative vision models across three categories. Additionally, the ReAct instruction tuning method was used to learn the tool capability of LLMs.
  • results: 实验和分析结果表明,VIoTGPT是有效的,可以 correctly interact with humans, query knowledge videos, and invoke vision models to accomplish complicated tasks.
    Abstract Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data. Learning to schedule perceiving models and analyzing the collected videos intelligently will be potential sparks for VIoT. In this paper, to address the challenges posed by the fine-grained and interrelated vision tool usage of VIoT, we build VIoTGPT, the framework based on LLMs to correctly interact with humans, query knowledge videos, and invoke vision models to accomplish complicated tasks. To support VIoTGPT and related future works, we meticulously crafted the training dataset and established benchmarks involving 11 representative vision models across three categories based on semi-automatic annotations. To guide LLM to act as the intelligent agent towards intelligent VIoT, we resort to ReAct instruction tuning based on the collected VIoT dataset to learn the tool capability. Quantitative and qualitative experimental results and analyses demonstrate the effectiveness of VIoTGPT.
    摘要 视频互联网其 Things (VIoT) 已经实现了历史上无 precedent 的视频数据收集量。学习智能调度视频模型和分析收集的视频数据将是 VIoT 的潜在燃料。在这篇论文中,我们为了解决 VIoT 中细化和相互关联的视觉工具使用带来的挑战,构建了基于 LLMS 的 VIoTGPT 框架,以正确地与人类交互、查询知识视频和适用视觉模型完成复杂任务。为支持 VIoTGPT 和未来相关工作,我们仔细制作了训练集和建立了基于 semi-automatic 批注的三类视觉模型的 Benchmark。为了使 LLMS 成为智能代理人,我们根据收集到的 VIoT 数据进行 ReAct 指令调整,以学习工具能力。经过量化和质量测试,我们得到了 VIoTGPT 的效果。

Learning to Estimate Critical Gait Parameters from Single-View RGB Videos with Transformer-Based Attention Network

  • paper_url: http://arxiv.org/abs/2312.00398
  • repo_url: None
  • paper_authors: Quoc Hung T. Le, Hieu H. Pham
  • for: 旨在早期诊断和治疗肌骨疾病和认知障碍病人的运动困难和心理健康问题。
  • methods: 使用计算机视觉和深度学习技术,提出了一种新的空间时间转换器网络,从单视角RGB视频中估计重要的步态参数。
  • results: 对公共数据集的脑性麻痹病人进行了实验评估,显示了当前状态的提前的方法被超越,在估计总步态参数(包括步速、步偏指数 - GDI、膝盖屈臂角)方面显示了显著的改善,同时使用 fewer 参数和减少手动特征提取。
    Abstract Musculoskeletal diseases and cognitive impairments in patients lead to difficulties in movement as well as negative effects on their psychological health. Clinical gait analysis, a vital tool for early diagnosis and treatment, traditionally relies on expensive optical motion capture systems. Recent advances in computer vision and deep learning have opened the door to more accessible and cost-effective alternatives. This paper introduces a novel spatio-temporal Transformer network to estimate critical gait parameters from RGB videos captured by a single-view camera. Empirical evaluations on a public dataset of cerebral palsy patients indicate that the proposed framework surpasses current state-of-the-art approaches and show significant improvements in predicting general gait parameters (including Walking Speed, Gait Deviation Index - GDI, and Knee Flexion Angle at Maximum Extension), while utilizing fewer parameters and alleviating the need for manual feature extraction.
    摘要 疾病和智能障碍导致患者的移动受到阻碍,同时也对其心理健康产生负面影响。临床步态分析是诊断和治疗的关键工具,但传统的光学动态捕捉系统很昂贵。现代计算机视觉和深度学习技术的出现使得更加可 accessible 和 cost-effective 的选择成为可能。本文提出一种新的空间时间转换器网络,用于从单视图RGB视频中估算步态参数。实验证明,提出的框架在公共数据集上赢得了当前状态的先进方法,并在总体步态参数(包括步速、步态指数 - GDI 和膝盖弯曲角度)中显示了显著的改善,同时使用的参数少于现有方法,并减少了手动特征提取的需求。

Study and Survey on Gesture Recognition Systems

  • paper_url: http://arxiv.org/abs/2312.00392
  • repo_url: None
  • paper_authors: Kshitij Deshpande, Varad Mashalkar, Kaustubh Mhaisekar, Amaan Naikwadi, Archana Ghotkar
  • for: 本研究探讨了手势识别系统在多个领域的应用,包括游戏、医疗、家用电器、工业机器人和虚拟现实等。
  • methods: 本文比较了不同的手势捕获方法,包括计算机视觉、深度学习、人工智能等方法。同时还讨论了数据收集技术和数据源。
  • results: 本研究结果显示,手势识别系统在不同领域的应用中具有广泛的应用前景和潜在的商业化可能性。同时,也遇到了一些常见的挑战,如手势识别精度、数据采集和处理等问题。
    Abstract In recent years, there has been a considerable amount of research in the Gesture Recognition domain, mainly owing to the technological advancements in Computer Vision. Various new applications have been conceptualised and developed in this field. This paper discusses the implementation of gesture recognition systems in multiple sectors such as gaming, healthcare, home appliances, industrial robots, and virtual reality. Different methodologies for capturing gestures are compared and contrasted throughout this survey. Various data sources and data acquisition techniques have been discussed. The role of gestures in sign language has been studied and existing approaches have been reviewed. Common challenges faced while building gesture recognition systems have also been explored.
    摘要 近年来,GestureRecognition领域内有很大的研究进展,主要归功于计算机视觉技术的发展。这篇论文探讨了 gesture recognition 系统在多个领域的应用,如游戏、医疗、家用电器、工业机器人和虚拟现实等。文中比较了不同的捕捉手势方法,以及不同的数据来源和数据收集技术。此外,文章还研究了手势在手语中的角色,并对现有的方法进行了评论。最后,文章还探讨了在建立 gesture recognition 系统时面临的常见挑战。

Partition-based K-space Synthesis for Multi-contrast Parallel Imaging

  • paper_url: http://arxiv.org/abs/2312.00387
  • repo_url: None
  • paper_authors: Yuxia Huang, Zhonghui Wu, Xiaoling Xu, Minghui Zhang, Shanshan Wang, Qiegen Liu
  • for: 这篇论文主要是为了提高多CONTRAST磁共振成像技术的效率和图像质量。
  • methods: 这篇论文提出了一种新的多CONTRAST成像方法,即分割空间重建(Partition-based k-space synthesis,PKS),可以通过特征联合来提高T2束重建质量。
  • results: 实验结果表明, combining T1和T2数据可以实现比使用传统的k-space平行成像(SAKE)更好的图像质量,并且可以减少总成像时间。
    Abstract Multi-contrast magnetic resonance imaging is a significant and essential medical imaging technique.However, multi-contrast imaging has longer acquisition time and is easy to cause motion artifacts. In particular, the acquisition time for a T2-weighted image is prolonged due to its longer repetition time (TR). On the contrary, T1-weighted image has a shorter TR. Therefore,utilizing complementary information across T1 and T2-weighted image is a way to decrease the overall imaging time. Previous T1-assisted T2 reconstruction methods have mostly focused on image domain using whole-based image fusion approaches. The image domain reconstruction method has the defects of high computational complexity and limited flexibility. To address this issue, we propose a novel multi-contrast imaging method called partition-based k-space synthesis (PKS) which can achieve super reconstruction quality of T2-weighted image by feature fusion. Concretely, we first decompose fully-sampled T1 k-space data and under-sampled T2 k-space data into two sub-data, separately. Then two new objects are constructed by combining the two sub-T1/T2 data. After that, the two new objects as the whole data to realize the reconstruction of T2-weighted image. Finally, the objective T2 is synthesized by extracting the sub-T2 data of each part. Experimental results showed that our combined technique can achieve comparable or better results than using traditional k-space parallel imaging(SAKE) that processes each contrast independently.
    摘要 多contrast磁共振成像是医学成像技术中的一种重要和必要的方法。然而,多contrast成像有更长的获取时间,容易出现运动artefacts。特别是T2重量成像的获取时间会更长,因为其 repeating time(TR)更长。相反,T1重量成像的TR更短。因此,利用T1和T2成像中的共谱信息来减少总成像时间是一种方法。先前的T1协助T2重建方法主要集中在图像域中使用整体基于图像 fusión方法。图像域重建方法存在高度计算复杂性和局限性。为解决这个问题,我们提出了一种新的多contrast成像方法,即分割基于k空间合成(PKS)方法。我们首先将完全样本的T1k空间数据和不足样本的T2k空间数据分割成两个子数据。然后,我们构建了两个新的对象,每个对象由两个子T1/T2数据组成。接着,我们将这两个新对象作为整个数据进行重建。最后,我们从每个部分中提取出对应的sub-T2数据,以实现T2重量成像的重建。实验结果表明,我们的结合技术可以与传统的k空间并行成像(SAKE)技术相比,达到或更好的结果。

Local monotone operator learning using non-monotone operators: MnM-MOL

  • paper_url: http://arxiv.org/abs/2312.00386
  • repo_url: None
  • paper_authors: Maneesh John, Jyothi Rikhab Chand, Mathews Jacob
  • for: 这篇论文旨在提高Undersampled MR图像恢复的性能,推广现有的iterative reconstruction算法。
  • methods: 该论文使用了深度平衡模型和最近的 monotone operator learning 方法,减少了训练过程中的内存需求。但这些方法通常会导致性能下降。本文提出了两种方法来减轻这种约束,包括在数据梯度和CNN块之间加入 convex-non-convex regularization 策略,以及只在图像拟合 manifold 附近强制 CNN 块是 monotone 操作。
  • results: 理论分析和实验结果都表明,提出的方法可以提高性能,并且具有robustness 性,即对输入杂乱的抗干扰能力。
    Abstract The recovery of magnetic resonance (MR) images from undersampled measurements is a key problem that has seen extensive research in recent years. Unrolled approaches, which rely on end-to-end training of convolutional neural network (CNN) blocks within iterative reconstruction algorithms, offer state-of-the-art performance. These algorithms require a large amount of memory during training, making them difficult to employ in high-dimensional applications. Deep equilibrium (DEQ) models and the recent monotone operator learning (MOL) approach were introduced to eliminate the need for unrolling, thus reducing the memory demand during training. Both approaches require a Lipschitz constraint on the network to ensure that the forward and backpropagation iterations converge. Unfortunately, the constraint often results in reduced performance compared to unrolled methods. The main focus of this work is to relax the constraint on the CNN block in two different ways. Inspired by convex-non-convex regularization strategies, we now impose the monotone constraint on the sum of the gradient of the data term and the CNN block, rather than constrain the CNN itself to be a monotone operator. This approach enables the CNN to learn possibly non-monotone score functions, which can translate to improved performance. In addition, we only restrict the operator to be monotone in a local neighborhood around the image manifold. Our theoretical results show that the proposed algorithm is guaranteed to converge to the fixed point and that the solution is robust to input perturbations, provided that it is initialized close to the true solution. Our empirical results show that the relaxed constraints translate to improved performance and that the approach enjoys robustness to input perturbations similar to MOL.
    摘要 恢复 магнитная резонанса(MR)图像从不充分量测量的问题是近年来研究的关键问题。卷积神经网络(CNN)块在迭代重建算法中的末级训练得到了现代性的表现。这些算法需要训练时大量的内存,因此在高维应用中具有困难。深度平衡(DEQ)模型和最近的偏见运算学习(MOL)方法是解除不rollers的需求,从而降低了训练时内存占用。这两种方法都需要网络中的 Lipschitz 约束,以确保前向和反向径向迭代的收敛。然而,这种约束通常会导致性能下降。本文的主要研究方向是relaxing这种约束,以提高性能。我们通过 convex-non-convex 规则策略,对数据项和 CNN 块的梯度的和进行约束,而不是直接约束 CNN 本身是偏见操作符。这种方法允许 CNN 学习可能不具备偏见的分数函数,从而提高性能。此外,我们只Restrict the operator to be monotone in a local neighborhood around the image manifold.我们的理论结果表明,提案的算法是确定的收敛点,并且输入扰动的解是稳定的,只要初始化得到True solution。我们的实验结果表明,饰用 relaxed 约束可以提高性能,并且拥有与 MOL 类似的输入扰动Robustness。

NeuSG: Neural Implicit Surface Reconstruction with 3D Gaussian Splatting Guidance

  • paper_url: http://arxiv.org/abs/2312.00846
  • repo_url: None
  • paper_authors: Hanlin Chen, Chen Li, Gim Hee Lee
  • for: 提高多视图3D重建的精度,使用神经隐式表面重建策略。
  • methods: 使用3D Gaussian Splatting为指导,并通过权重补做来约束3D Gaussian Splatting生成的点云紧邻表面。同时,使用神经隐式模型预测表面法向,以优化点云的补做。
  • results: 实现高精度的多视图3D重建,包括细节 rich 的表面重建。实验结果表明,我们的方法可以在 Tanks 和 Temples 等场景中提高重建的精度。
    Abstract Existing neural implicit surface reconstruction methods have achieved impressive performance in multi-view 3D reconstruction by leveraging explicit geometry priors such as depth maps or point clouds as regularization. However, the reconstruction results still lack fine details because of the over-smoothed depth map or sparse point cloud. In this work, we propose a neural implicit surface reconstruction pipeline with guidance from 3D Gaussian Splatting to recover highly detailed surfaces. The advantage of 3D Gaussian Splatting is that it can generate dense point clouds with detailed structure. Nonetheless, a naive adoption of 3D Gaussian Splatting can fail since the generated points are the centers of 3D Gaussians that do not necessarily lie on the surface. We thus introduce a scale regularizer to pull the centers close to the surface by enforcing the 3D Gaussians to be extremely thin. Moreover, we propose to refine the point cloud from 3D Gaussians Splatting with the normal priors from the surface predicted by neural implicit models instead of using a fixed set of points as guidance. Consequently, the quality of surface reconstruction improves from the guidance of the more accurate 3D Gaussian splatting. By jointly optimizing the 3D Gaussian Splatting and the neural implicit model, our approach benefits from both representations and generates complete surfaces with intricate details. Experiments on Tanks and Temples verify the effectiveness of our proposed method.
    摘要 现有的神经隐式表面重建方法已经在多视图3D重建中实现了印象人的表现,通过利用深度图或点云作为regulization来优化表面重建结果。然而,重建结果仍然缺乏细节,这是因为深度图过于平滑或点云过于稀疏。在这种情况下,我们提出了一种基于神经隐式表面重建管道的3D高斯拟合引导的细节重建方法。高斯拟合可以生成密集的点云,但是直接使用高斯拟合可能会失败,因为生成的点不一定位于表面上。我们因此引入了扩展尺度正则化,以便将高斯拟合生成的点吸引到表面靠近。此外,我们还提出了使用神经隐式模型预测的表面法向正则化来精细地修正高斯拟合生成的点云。通过结合3D高斯拟合和神经隐式模型的优化,我们的方法可以利用两种表示方式,并生成具有细节的完整表面。实验结果表明,我们的方法在坦克和寺庙等场景中具有remarkable效果。

Text-Guided 3D Face Synthesis – From Generation to Editing

  • paper_url: http://arxiv.org/abs/2312.00375
  • repo_url: None
  • paper_authors: Yunjie Wu, Yapeng Meng, Zhipeng Hu, Lincheng Li, Haoqian Wu, Kun Zhou, Weiwei Xu, Xin Yu
  • for: 本文提出了一种纹理导向的3D面Synthesis方法,可以通过Iterative adjustments来实现自定义的3D面Synthesis。
  • methods: 在生成阶段,本文提出了一种geometry-texture decoupled generation方法,以解决由 coupling 所导致的 geometry 细节丢失问题。 在编辑阶段,本文使用了一种增强 texture 质量的 texture diffusion model,并引入了 UV 域一致性保持 régularization 和自适应一致性权重策略来提高编辑效果。
  • results: 通过 comprehensive experiments,本文显示了其方法在面Synthesis中的superiority。
    Abstract Text-guided 3D face synthesis has achieved remarkable results by leveraging text-to-image (T2I) diffusion models. However, most existing works focus solely on the direct generation, ignoring the editing, restricting them from synthesizing customized 3D faces through iterative adjustments. In this paper, we propose a unified text-guided framework from face generation to editing. In the generation stage, we propose a geometry-texture decoupled generation to mitigate the loss of geometric details caused by coupling. Besides, decoupling enables us to utilize the generated geometry as a condition for texture generation, yielding highly geometry-texture aligned results. We further employ a fine-tuned texture diffusion model to enhance texture quality in both RGB and YUV space. In the editing stage, we first employ a pre-trained diffusion model to update facial geometry or texture based on the texts. To enable sequential editing, we introduce a UV domain consistency preservation regularization, preventing unintentional changes to irrelevant facial attributes. Besides, we propose a self-guided consistency weight strategy to improve editing efficacy while preserving consistency. Through comprehensive experiments, we showcase our method's superiority in face synthesis. Project page: https://faceg2e.github.io/.
    摘要 文本指导的3D脸合成已经取得了很大的成果,通过利用文本到图像(T2I)扩散模型。然而,现有的大多数工作都是单向生成,忽略了编辑,因此无法通过Iterative adjustments进行自定义3D脸的合成。在这篇论文中,我们提议一个统一的文本指导框架,从面generation到编辑。在生成阶段,我们提议一种几何学-文本分离的生成,以避免由 coupling 所带来的几何学细节损失。此外,分离允许我们在生成geometry和texture之间进行互相匹配,从而提高了geometry和texture之间的吻合度。我们还使用了精度调整的Texture扩散模型来提高Texture的质量在RGB和YUV空间中。在编辑阶段,我们首先使用一个预训练的扩散模型来更新脸部几何学或Texture基于文本。为了启用顺序编辑,我们引入了UV域一致性保持regularization,避免了无意义地改变不相关的脸部特征。此外,我们提出了一种自适应的Consistency weight策略,以提高编辑效果的同时保持一致性。通过广泛的实验,我们展示了我们的方法在脸合成中的超越性。项目页面:https://faceg2e.github.io/.

Benchmarking Multi-Domain Active Learning on Image Classification

  • paper_url: http://arxiv.org/abs/2312.00364
  • repo_url: None
  • paper_authors: Jiayi Li, Rohan Taori, Tatsunori B. Hashimoto
  • for: 提高模型性能的活动学习方法
  • methods: 多个领域数据点策略性标注
  • results: 传统单个领域活动学习策略在多个领域场景下通常落后于随机选择,需要未来研究。
    Abstract Active learning aims to enhance model performance by strategically labeling informative data points. While extensively studied, its effectiveness on large-scale, real-world datasets remains underexplored. Existing research primarily focuses on single-source data, ignoring the multi-domain nature of real-world data. We introduce a multi-domain active learning benchmark to bridge this gap. Our benchmark demonstrates that traditional single-domain active learning strategies are often less effective than random selection in multi-domain scenarios. We also introduce CLIP-GeoYFCC, a novel large-scale image dataset built around geographical domains, in contrast to existing genre-based domain datasets. Analysis on our benchmark shows that all multi-domain strategies exhibit significant tradeoffs, with no strategy outperforming across all datasets or all metrics, emphasizing the need for future research.
    摘要 active learning目的是提高模型性能,通过策略性地标记有用的数据点。虽然已经广泛研究,但在大规模实际数据集上的效果仍未得到充分探索。现有研究主要集中在单个源数据上,忽略了实际世界数据的多元领域性。我们引入了多领域活动学benchmark,填补了这个差距。我们的benchmark表明,传统的单个领域活动学策略在多领域场景下经常落后于随机选择。我们还引入了CLIP-GeoYFCC,一个新的大规模图像数据集,与现有的类型基于领域数据集不同,我们的分析表明,所有的多领域策略都存在了明显的变数,无论是在所有数据集上或所有指标上,强调了未来研究的需要。

Dancing with Images: Video Distillation via Static-Dynamic Disentanglement

  • paper_url: http://arxiv.org/abs/2312.00362
  • repo_url: None
  • paper_authors: Ziyu Wang, Yue Xu, Cewu Lu, Yong-Lu Li
  • for: 这项研究旨在为视频 dataset 进行效率的机器学习准备,特别是对于图像 dataset 的数据压缩。
  • methods: 这项研究首次系统地研究了视频压缩,并提出了一个分类法来描述视频的压缩方法。
  • results: 研究发现,视频中的时间信息通常不太好学习,而且synthetic data 中的时间维度很小。这些发现motivates我们的一种将动态和静止信息分离开的框架,首先将视频转换为静止图像,然后使用一个可学习的动态记忆块补做动态和运动信息。我们的方法在不同的视频集上实现了状态机器学习的最佳性,同时具有较小的存储开销。
    Abstract Recently, dataset distillation has paved the way towards efficient machine learning, especially for image datasets. However, the distillation for videos, characterized by an exclusive temporal dimension, remains an underexplored domain. In this work, we provide the first systematic study of video distillation and introduce a taxonomy to categorize temporal compression. Our investigation reveals that the temporal information is usually not well learned during distillation , and the temporal dimension of synthetic data contributes little. The observations motivate our unified framework of disentangling the dynamic and static information in the videos. It first distills the videos into still images as static memory and then compensates the dynamic and motion information with a learnable dynamic memory block. Our method achieves state-of-the-art on video datasets at different scales, with notably smaller storage expenditure. Our code will be publicly available.
    摘要 Our investigation reveals that temporal information is often not well learned during distillation, and the temporal dimension of synthetic data has little impact. These observations motivate our unified framework for disentangling dynamic and static information in videos. The framework first distills the videos into still images as static memory and then compensates for the dynamic and motion information with a learnable dynamic memory block.Our method achieves state-of-the-art performance on video datasets of different scales, with significantly smaller storage expenditure. Our code will be publicly available.

Efficient Multimodal Semantic Segmentation via Dual-Prompt Learning

  • paper_url: http://arxiv.org/abs/2312.00360
  • repo_url: https://github.com/shaohuadong2021/dplnet
  • paper_authors: Shaohua Dong, Yunhe Feng, Qing Yang, Yan Huang, Dongfang Liu, Heng Fan
  • for: 提高复杂场景中的semantic segmentation(例如indoor/low-light环境)的精度
  • methods: 使用 dual-prompt learning network(DPLNet),它使用预训练的RGB模型进行适应,并通过两个提问学习模块(MPG和MFA)来实现多模态的特征融合和学习
  • results: 在四个RGB-D/T semantic segmentation数据集上达到新的状态码性能或与其他复杂方法相当,同时满足参数效率,并在其他多模态任务中如精度 objet detection和视频semantic segmentation中表现出色。
    Abstract Multimodal (e.g., RGB-Depth/RGB-Thermal) fusion has shown great potential for improving semantic segmentation in complex scenes (e.g., indoor/low-light conditions). Existing approaches often fully fine-tune a dual-branch encoder-decoder framework with a complicated feature fusion strategy for achieving multimodal semantic segmentation, which is training-costly due to the massive parameter updates in feature extraction and fusion. To address this issue, we propose a surprisingly simple yet effective dual-prompt learning network (dubbed DPLNet) for training-efficient multimodal (e.g., RGB-D/T) semantic segmentation. The core of DPLNet is to directly adapt a frozen pre-trained RGB model to multimodal semantic segmentation, reducing parameter updates. For this purpose, we present two prompt learning modules, comprising multimodal prompt generator (MPG) and multimodal feature adapter (MFA). MPG works to fuse the features from different modalities in a compact manner and is inserted from shadow to deep stages to generate the multi-level multimodal prompts that are injected into the frozen backbone, while MPG adapts prompted multimodal features in the frozen backbone for better multimodal semantic segmentation. Since both the MPG and MFA are lightweight, only a few trainable parameters (3.88M, 4.4% of the pre-trained backbone parameters) are introduced for multimodal feature fusion and learning. Using a simple decoder (3.27M parameters), DPLNet achieves new state-of-the-art performance or is on a par with other complex approaches on four RGB-D/T semantic segmentation datasets while satisfying parameter efficiency. Moreover, we show that DPLNet is general and applicable to other multimodal tasks such as salient object detection and video semantic segmentation. Without special design, DPLNet outperforms many complicated models. Our code will be available at github.com/ShaohuaDong2021/DPLNet.
    摘要 多模态(如RGB-深度/RGB-热)融合已经在复杂场景中提高 semantic segmentation 的潜在性。现有的方法通常是完全精度地 fine-tune 一个 dual-branch encoder-decoder 框架,以实现多模态 semantic segmentation,这会带来大量参数更新,因此训练成本高。为解决这个问题,我们提出了一种奇异简单 yet 有效的 dual-prompt learning 网络(称为 DPLNet),用于增加训练效率的多模态(如RGB-D/T)semantic segmentation。DPLNet 的核心思想是直接采用冻结的 pre-trained RGB 模型,并将其适应到多模态 semantic segmentation,从而减少参数更新。为此,我们提出了两个 prompt learning 模块:多模态 prompt generator(MPG)和多模态 feature adapter(MFA)。MPG 用于将不同模式的特征 fusion 到一起,并在阴影到深度的多个层次中插入,以生成多级多模态提醒,而 MFA 则是在冻结后部中适应提醒的多模态特征,以提高多模态 semantic segmentation。由于 MPG 和 MFA 都是轻量级的,只有3.88M(4.4% 的冻结后部参数)的可变参数被引入,用于多模态特征融合和学习。使用简单的解码器(3.27M 参数),DPLNet 在四个 RGB-D/T semantic segmentation 数据集上实现了新的国际最佳性或与其他复杂的方法相当,而且满足参数效率。此外,我们还证明了 DPLNet 是通用的,可以应用于其他多模态任务,如焦点对象检测和视频 semantic segmentation。 Without special design, DPLNet outperforms many complicated models.我们的代码将在 GitHub 上提供。

Impact of Data Augmentation on QCNNs

  • paper_url: http://arxiv.org/abs/2312.00358
  • repo_url: None
  • paper_authors: Leting Zhouli, Peiyong Wang, Udaya Parampalli
  • for: 这篇论文的目的是用量子机制改进传统的图像识别模型,以提高图像识别的性能。
  • methods: 这篇论文使用的方法包括传统的图像识别模型(CNNs)和量子图像识别模型(QCNNs),以及数据增强(DA)技术。
  • results: 测试结果显示,QCNNs在三个常用的数据集上的损失和预测精度都高于CNNs,而且DA技术不会提高QCNNs的性能。
    Abstract In recent years, Classical Convolutional Neural Networks (CNNs) have been applied for image recognition successfully. Quantum Convolutional Neural Networks (QCNNs) are proposed as a novel generalization to CNNs by using quantum mechanisms. The quantum mechanisms lead to an efficient training process in QCNNs by reducing the size of input from $N$ to $log_2N$. This paper implements and compares both CNNs and QCNNs by testing losses and prediction accuracy on three commonly used datasets. The datasets include the MNIST hand-written digits, Fashion MNIST and cat/dog face images. Additionally, data augmentation (DA), a technique commonly used in CNNs to improve the performance of classification by generating similar images based on original inputs, is also implemented in QCNNs. Surprisingly, the results showed that data augmentation didn't improve QCNNs performance. The reasons and logic behind this result are discussed, hoping to expand our understanding of Quantum machine learning theory.
    摘要 近年来,经典的卷积神经网络(CNN)已成功应用于图像识别领域。量子卷积神经网络(QCNN)是一种新的推广,通过利用量子机制来提高训练效率。量子机制使输入大小从$N$减少到$log_2N$,从而提高了QCNN的训练效率。本文对CNN和QCNN进行实现和比较,并在三个常用的数据集上测试损失和预测精度。这三个数据集分别是MNIST手写数字、时尚MNIST和猫狗脸部图像。此外,数据扩展(DA)技术,通常用于CNN来提高分类性能,也在QCNN中实现。然而,结果显示,数据扩展并没有提高QCNN的性能。这些结果的原因和逻辑被讨论,以扩展我们对量子机器学习理论的理解。

A Generalizable Deep Learning System for Cardiac MRI

  • paper_url: http://arxiv.org/abs/2312.00357
  • repo_url: None
  • paper_authors: Rohan Shad, Cyril Zakka, Dhamanpreet Kaur, Robyn Fong, Ross Warren Filice, John Mongan, Kimberly Kalianos, Nishith Khandwala, David Eng, Matthew Leipzig, Walter Witschey, Alejandro de Feria, Victor Ferrari, Euan Ashley, Michael A. Acker, Curtis Langlotz, William Hiesinger
  • for: cardiac MRI assessment of myocardial structure, function, and tissue characteristics
  • methods: deep learning model trained via self-supervised contrastive learning using raw text of radiology reports
  • results: remarkable performance across a range of tasks, including left ventricular ejection fraction regression and diagnosis of 35 different conditions such as cardiac amyloidosis and hypertrophic cardiomyopathy, with clinical grade diagnostic accuracy using a fraction of the typical training data.
    Abstract Cardiac MRI allows for a comprehensive assessment of myocardial structure, function, and tissue characteristics. Here we describe a foundational vision system for cardiac MRI, capable of representing the breadth of human cardiovascular disease and health. Our deep learning model is trained via self-supervised contrastive learning, by which visual concepts in cine-sequence cardiac MRI scans are learned from the raw text of the accompanying radiology reports. We train and evaluate our model on data from four large academic clinical institutions in the United States. We additionally showcase the performance of our models on the UK BioBank, and two additional publicly available external datasets. We explore emergent zero-shot capabilities of our system, and demonstrate remarkable performance across a range of tasks; including the problem of left ventricular ejection fraction regression, and the diagnosis of 35 different conditions such as cardiac amyloidosis and hypertrophic cardiomyopathy. We show that our deep learning system is capable of not only understanding the staggering complexity of human cardiovascular disease, but can be directed towards clinical problems of interest yielding impressive, clinical grade diagnostic accuracy with a fraction of the training data typically required for such tasks.
    摘要 卡ди亚克MRI可以进行全面评估心肺结构、功能和组织特征。我们描述了一个基础视觉系统,可以涵盖人类冠arca疾病和健康的全面范围。我们的深度学习模型通过自我超越对比学习,从raw文本中学习了心肺MRI扫描图像中的视觉概念。我们在美国四大学普通科医院的数据上训练和评估了我们的模型,并在UK BioBank和两个公共可用的外部数据集上进行了表现。我们还证明了我们的系统在多个任务上表现出色,包括左心肺血量回卷 regression 和35种不同的疾病诊断,如 cardiac amyloidosis 和 hypertrophic cardiomyopathy。我们表示,我们的深度学习系统不仅能够理解人类冠arca疾病的惊人复杂性,还可以根据临床问题指导,以便在很小的训练数据量下获得临床级别的诊断精度。

Manipulating the Label Space for In-Context Classification

  • paper_url: http://arxiv.org/abs/2312.00351
  • repo_url: None
  • paper_authors: Haokun Chen, Xu Yang, Yuhang Huang, Zihan Wu, Jing Wang, Xin Geng
  • for: 提高协同学习(In-Context Learning,ICL)能力,使Language Model(LM)和视觉语言模型(VLMs)能够更好地在不同的数据集上进行分类。
  • methods: 通过修改每个协同示例(ICEs)的标签空间,以提高协同示例的知识密度,从而使用 fewer ICEs 提供更多信息,并且提高协同分类性能。
  • results: 在 ImageNet 和 CUB-200 等多种数据集上,使用我们的方法可以提高协同分类精度,比如在 ImageNet 上从 74.70% 提高到 76.21%,比 CLIP 高0.67%。在 CUB-200 上,我们的方法可以提高 1-shot 精度从 48.86% 到 69.05%,比 CLIP 高12.15%。
    Abstract After pre-training by generating the next word conditional on previous words, the Language Model (LM) acquires the ability of In-Context Learning (ICL) that can learn a new task conditional on the context of the given in-context examples (ICEs). Similarly, visually-conditioned Language Modelling is also used to train Vision-Language Models (VLMs) with ICL ability. However, such VLMs typically exhibit weaker classification abilities compared to contrastive learning-based models like CLIP, since the Language Modelling objective does not directly contrast whether an object is paired with a text. To improve the ICL of classification, using more ICEs to provide more knowledge is a straightforward way. However, this may largely increase the selection time, and more importantly, the inclusion of additional in-context images tends to extend the length of the in-context sequence beyond the processing capacity of a VLM. To alleviate these limitations, we propose to manipulate the label space of each ICE to increase its knowledge density, allowing for fewer ICEs to convey as much information as a larger set would. Specifically, we propose two strategies which are Label Distribution Enhancement and Visual Descriptions Enhancement to improve In-context classification performance on diverse datasets, including the classic ImageNet and more fine-grained datasets like CUB-200. Specifically, using our approach on ImageNet, we increase accuracy from 74.70\% in a 4-shot setting to 76.21\% with just 2 shots. surpassing CLIP by 0.67\%. On CUB-200, our method raises 1-shot accuracy from 48.86\% to 69.05\%, 12.15\% higher than CLIP. The code is given in https://anonymous.4open.science/r/MLS_ICC.
    摘要 “对于Language Model(LM)和视觉语言模型(VLM)来说,我们提出了两种策略来提高内容学习(In-Context Learning,ICL)的性能。第一种策略是增强标签空间的分布,使每个内容示例(ICE)能够传递更多的知识。第二种策略是增强视觉描述,使VLM能够更好地理解内容示例中的物件。我们在ImageNet和CUB-200等多个数据集上运行了我们的方法,结果显示我们的方法可以从4个随机抽取的示例中获得76.21%的准确率,比CLIP高0.67%。在CUB-200数据集上,我们的方法从1个随机抽取的示例中获得69.05%的准确率,比CLIP高12.15%。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the given text and may not be exactly the same as the original text in Traditional Chinese.

Student Activity Recognition in Classroom Environments using Transfer Learning

  • paper_url: http://arxiv.org/abs/2312.00348
  • repo_url: None
  • paper_authors: Anagha Deshpande, Vedant Deshpande
  • for: 提高教室环境的安全性、效率和总质量
  • methods: 使用深度学习技术,包括传输学习和特征提取,以及一个新的类room dataset
  • results: Xception模型在新的类room dataset上达到93%的准确率,超过其他三个模型Here’s the full translation of the abstract in Simplified Chinese:
  • for: 本研究旨在通过人工智能和深度学习技术,提高教室环境的安全性、效率和总质量。
  • methods: 本研究使用的方法包括传输学习和特征提取,以及一个新的类room dataset。
  • results: 研究发现,使用Xception模型在新的类room dataset上达到93%的准确率,超过其他三个模型。
    Abstract The recent advances in artificial intelligence and deep learning facilitate automation in various applications including home automation, smart surveillance systems, and healthcare among others. Human Activity Recognition is one of its emerging applications, which can be implemented in a classroom environment to enhance safety, efficiency, and overall educational quality. This paper proposes a system for detecting and recognizing the activities of students in a classroom environment. The dataset has been structured and recorded by the authors since a standard dataset for this task was not available at the time of this study. Transfer learning, a widely adopted method within the field of deep learning, has proven to be helpful in complex tasks like image and video processing. Pretrained models including VGG-16, ResNet-50, InceptionV3, and Xception are used for feature extraction and classification tasks. Xception achieved an accuracy of 93%, on the novel classroom dataset, outperforming the other three models in consideration. The system proposed in this study aims to introduce a safer and more productive learning environment for students and educators.
    摘要 This paper proposes a system for detecting and recognizing the activities of students in a classroom environment. The authors have collected and structured a dataset for this task, as no standard dataset was available at the time of the study. Transfer learning, a widely used method in deep learning, has proven to be effective in complex tasks such as image and video processing.The authors have used pretrained models, including VGG-16, ResNet-50, InceptionV3, and Xception, for feature extraction and classification tasks. Xception achieved an accuracy of 93% on the novel classroom dataset, outperforming the other three models. The system proposed in this study aims to create a safer and more productive learning environment for students and educators.Here is the text in Simplified Chinese:Recent advances in人工智能和深度学习已经实现了各种自动化应用,如家居自动化、智能监控系统和医疗等。人类活动识别(HAR)是这些技术的一个emerging应用,可以在教室环境中提高安全性、效率和总的教学质量。这篇论文提出了一种检测和识别学生在教室环境中的活动系统。作者们自己收集和结构化了这个数据集,因为当时没有标准的数据集。转移学习,深度学习中广泛使用的方法,在复杂的图像和视频处理任务中证明了其效果。作者们使用了预训练模型,包括VGG-16、ResNet-50、InceptionV3和Xception,进行特征提取和分类任务。Xception在新的教室数据集上达到了93%的准确率,超过了其他三个模型。这个系统的目标是创造一个更安全和更产出的学习环境 для学生和教师。

OpenStereo: A Comprehensive Benchmark for Stereo Matching and Strong Baseline

  • paper_url: http://arxiv.org/abs/2312.00343
  • repo_url: https://github.com/xiandaguo/openstereo
  • paper_authors: Xianda Guo, Juntao Lu, Chenming Zhang, Yiqi Wang, Yiqun Duan, Tian Yang, Zheng Zhu, Long Chen
  • for: 这份研究的目的是提供一个实用的史特耳匹配函数库,并通过实验评估不同网络模型的表现。
  • methods: 这份研究使用了多种不同的网络模型,包括12种不同的训练和测试网络模型,以及一个名为StereoBase的简单却强大的基eline模型。
  • results: 这份研究的结果显示,使用OpenStereo函数库和StereoBase基eline模型,在SceneFlow资料集上的实验结果具有或超越了原始论文中报告的性能指标。
    Abstract Stereo matching, a pivotal technique in computer vision, plays a crucial role in robotics, autonomous navigation, and augmented reality. Despite the development of numerous impressive methods in recent years, replicating their results and determining the most suitable architecture for practical application remains challenging. Addressing this gap, our paper introduces a comprehensive benchmark focusing on practical applicability rather than solely on performance enhancement. Specifically, we develop a flexible and efficient stereo matching codebase, called OpenStereo. OpenStereo includes training and inference codes of more than 12 network models, making it, to our knowledge, the most complete stereo matching toolbox available. Based on OpenStereo, we conducted experiments on the SceneFlow dataset and have achieved or surpassed the performance metrics reported in the original paper. Additionally, we conduct an in-depth revisitation of recent developments in stereo matching through ablative experiments. These investigations inspired the creation of StereoBase, a simple yet strong baseline model. Our extensive comparative analyses of StereoBase against numerous contemporary stereo matching methods on the SceneFlow dataset demonstrate its remarkably strong performance. The source code is available at https://github.com/XiandaGuo/OpenStereo.
    摘要 sterEO匹配,一种关键的计算机视觉技术,在机器人、自主导航和增强现实中扮演着重要的角色。尽管在过去几年内出现了许多优秀的方法,但复制这些结果并确定最适合实际应用的建筑师仍然是一个挑战。为了解决这个问题,我们的论文提出了一个完整的benchmark,关注实际应用性而不仅仅是提高性能。特别是,我们开发了一个flexible和高效的stereo匹配代码库,called OpenStereo。OpenStereo包括训练和推理代码的更 than 12个网络模型,使其成为我们所知道的最完整的stereo匹配工具箱。基于OpenStereo,我们在SceneFlow dataset上进行了实验,并达到或超过原始文献中报道的性能指标。此外,我们还进行了对最近的stereo匹配发展的ablative实验。这些调查 inspirited我们创建了StereoBase,一个简单却强大的基eline模型。我们对StereoBase与许多当代stereo匹配方法进行了广泛的比较分析,并在SceneFlow dataset上进行了extensive的比较分析。结果显示StereoBase在性能上表现极佳。代码可以在https://github.com/XiandaGuo/OpenStereo中获取。

Learning Anatomically Consistent Embedding for Chest Radiography

  • paper_url: http://arxiv.org/abs/2312.00335
  • repo_url: https://github.com/jlianglab/peac
  • paper_authors: Ziyu Zhou, Haozhe Luo, Jiaxuan Pang, Xiaowei Ding, Michael Gotway, Jianming Liang
  • for: 本文旨在利用自然学习(SSL)方法学习医疗图像中的视觉表示。
  • methods: 本文提出了一种新的SSL方法,即patch embedding of anatomical consistency(PEAC),以便医疗图像分析。PEAC方法通过稳定的网格基本匹配来学习全局和局部一致性,并将预训练PEAC模型转移到多种下游任务中。
  • results: 对比现有的完全/自我监督方法,PEAC方法实现了显著更好的性能。PEAC方法还能够捕捉不同gender、体重和健康状况的患者之间的 анатоical结构一致性,从而提高了我们的方法的医疗图像分析可读性。
    Abstract Self-supervised learning (SSL) approaches have recently shown substantial success in learning visual representations from unannotated images. Compared with photographic images, medical images acquired with the same imaging protocol exhibit high consistency in anatomy. To exploit this anatomical consistency, this paper introduces a novel SSL approach, called PEAC (patch embedding of anatomical consistency), for medical image analysis. Specifically, in this paper, we propose to learn global and local consistencies via stable grid-based matching, transfer pre-trained PEAC models to diverse downstream tasks, and extensively demonstrate that (1) PEAC achieves significantly better performance than the existing state-of-the-art fully/self-supervised methods, and (2) PEAC captures the anatomical structure consistency across views of the same patient and across patients of different genders, weights, and healthy statuses, which enhances the interpretability of our method for medical image analysis.
    摘要 自主学习(SSL)方法在无注释图像上学习视觉表示已经显示出很大的成功。与摄影图像不同,医疗图像采用同一种捕捉协议,图像的解剖学协议高度一致。为了利用这种解剖学一致性,本文介绍了一种新的SSL方法,即PEAC(质量块嵌入的解剖学一致),用于医疗图像分析。特别是,在本文中,我们提议通过稳定的格子基本匹配来学习全局和局部一致性,将预训练PEAC模型转移到多种下游任务上,并证明了以下两点:1. PEAC比既有状态的自主/无监督方法更好的性能,并且2. PEAC能够在不同的患者、性别、体重和健康状态下捕捉到一致的解剖结构,从而提高医疗图像分析的可读性。

Improving Efficiency of DNN-based Relocalization Module for Autonomous Driving with Server-side Computing

  • paper_url: http://arxiv.org/abs/2312.00316
  • repo_url: None
  • paper_authors: Dengbo Li, Jieren Cheng, Boyi Liu
  • for: 本研究提出了一种基于深度神经网络(DNN)的自动驾驶摄像头重定位方法,以解决现有Literature中的计算机telicity问题。
  • methods: 我们的方法利用了边缘云合作,将某些神经网络模块传输到服务器端进行执行,以提高计算效率。我们还研究了不同网络分割方案下的数据帧执行时间,以帮助我们做出Offloading决策。
  • results: 我们的研究显示,服务器端Offloading对DNN基于的自动驾驶摄像头重定位方法的实现起到了关键作用。此外,我们还研究了数据融合的结果。最后,我们通过实验评估 validate了我们的提议的效果。
    Abstract In this work, we present a novel framework for camera relocation in autonomous vehicles, leveraging deep neural networks (DNN). While existing literature offers various DNN-based camera relocation methods, their deployment is hindered by their high computational demands during inference. In contrast, our approach addresses this challenge through edge cloud collaboration. Specifically, we strategically offload certain modules of the neural network to the server and evaluate the inference time of data frames under different network segmentation schemes to guide our offloading decisions. Our findings highlight the vital role of server-side offloading in DNN-based camera relocation for autonomous vehicles, and we also discuss the results of data fusion. Finally, we validate the effectiveness of our proposed framework through experimental evaluation.
    摘要 在这个工作中,我们提出了一种新的摄像头重定位框架 для自动驾驶车辆,利用深度神经网络(DNN)。现有的文献提供了多种基于DNN的摄像头重定位方法,但它们在推理时的计算占用率很高。与此相反,我们的方法通过Edge云合作解决这个挑战。具体来说,我们在服务器和Edge云之间分别分配了certain module of the neural network,并评估了数据帧在不同网络分割方案下的推理时间,以便做出合适的卸载决策。我们的发现表明Edge云卸载在DNN基于摄像头重定位中发挥了关键作用,并且我们还讨论了数据融合的结果。最后,我们验证了我们的提议的效果通过实验评估。

Improving Normalization with the James-Stein Estimator

  • paper_url: http://arxiv.org/abs/2312.00313
  • repo_url: None
  • paper_authors: Seyedalireza Khoshsirat, Chandra Kambhamettu
  • for: 提高高维统计中sample mean的效果,解决Stein’s paradox问题。
  • methods: 使用James-Stein estimator来改进mean和variance的估计,并在深度学习中应用normalization layers。
  • results: 在不同的计算机视觉任务中(包括图像分类、语义分割和3D物体分类),我们的改进的normalization layers显示出了superior的准确率,而无需额外的计算成本增加。
    Abstract Stein's paradox holds considerable sway in high-dimensional statistics, highlighting that the sample mean, traditionally considered the de facto estimator, might not be the most efficacious in higher dimensions. To address this, the James-Stein estimator proposes an enhancement by steering the sample means toward a more centralized mean vector. In this paper, first, we establish that normalization layers in deep learning use inadmissible estimators for mean and variance. Next, we introduce a novel method to employ the James-Stein estimator to improve the estimation of mean and variance within normalization layers. We evaluate our method on different computer vision tasks: image classification, semantic segmentation, and 3D object classification. Through these evaluations, it is evident that our improved normalization layers consistently yield superior accuracy across all tasks without extra computational burden. Moreover, recognizing that a plethora of shrinkage estimators surpass the traditional estimator in performance, we study two other prominent shrinkage estimators: Ridge and LASSO. Additionally, we provide visual representations to intuitively demonstrate the impact of shrinkage on the estimated layer statistics. Finally, we study the effect of regularization and batch size on our modified batch normalization. The studies show that our method is less sensitive to batch size and regularization, improving accuracy under various setups.
    摘要 史丁的 парадокс在高维统计中具有很大的影响,指出样本平均值,传统上被视为效果最好的估计器,在更高的维度下可能不是最佳的选择。为解决这个问题,詹姆斯-史丁估计器提出了一种改进方法,通过调整样本平均值向中心化的均值向量倾斜。在这篇论文中,我们首先证明了深度学习中的 нормализа层使用不当的估计器来估计均值和方差。然后,我们介绍了一种新的方法,使用詹姆斯-史丁估计器来改进均值和方差的估计。我们在不同的计算机视觉任务上进行了评估:图像分类、semantic segmentation和3D对象分类。经过这些评估,显示我们改进后的normalization层在所有任务上都具有更高的精度,而且没有额外的计算负担。此外,我们还研究了两种其他的显著减少估计器:Ridge和LASSO。此外,我们还提供了视觉表示,用于直观地示出减少的影响于估计层 Statistics。最后,我们研究了 regularization和批处理的效果于我们的修改后的批处理。研究结果表明,我们的方法对于不同的设置是更加敏感的,而且在不同的批处理和 regularization 下都能够提高精度。

Segment Anything Model-guided Collaborative Learning Network for Scribble-supervised Polyp Segmentation

  • paper_url: http://arxiv.org/abs/2312.00312
  • repo_url: None
  • paper_authors: Yiming Zhao, Tao Zhou, Yunqi Gu, Yi Zhou, Yizhe Zhang, Ye Wu, Huazhu Fu
    for:This paper proposes a novel method for scribble-supervised polyp segmentation in colonoscopy images, which is crucial for early detection and prevention of colorectal cancer.methods:The proposed method, called SAM-CLNet, combines a Cross-level Enhancement and Aggregation Network (CEA-Net) with a Segment Anything Model (SAM) to leverage the strengths of both models. CEA-Net uses a Cross-level Enhancement Module (CEM) and a Feature Aggregation Module (FAM) to enhance and aggregate features from different resolutions, while SAM provides segmentation guidance.results:Extensive experiments show that SAM-CLNet outperforms state-of-the-art weakly-supervised segmentation methods, demonstrating the effectiveness of the proposed method for polyp segmentation in colonoscopy images.
    Abstract Polyp segmentation plays a vital role in accurately locating polyps at an early stage, which holds significant clinical importance for the prevention of colorectal cancer. Various polyp segmentation methods have been developed using fully-supervised deep learning techniques. However, pixel-wise annotation for polyp images by physicians during the diagnosis is both time-consuming and expensive. Moreover, visual foundation models such as the Segment Anything Model (SAM) have shown remarkable performance. Nevertheless, directly applying SAM to medical segmentation may not produce satisfactory results due to the inherent absence of medical knowledge. In this paper, we propose a novel SAM-guided Collaborative Learning Network (SAM-CLNet) for scribble-supervised polyp segmentation, enabling a collaborative learning process between our segmentation network and SAM to boost the model performance. Specifically, we first propose a Cross-level Enhancement and Aggregation Network (CEA-Net) for weakly-supervised polyp segmentation. Within CEA-Net, we propose a Cross-level Enhancement Module (CEM) that integrates the adjacent features to enhance the representation capabilities of different resolution features. Additionally, a Feature Aggregation Module (FAM) is employed to capture richer features across multiple levels. Moreover, we present a box-augmentation strategy that combines the segmentation maps generated by CEA-Net with scribble annotations to create more precise prompts. These prompts are then fed into SAM, generating segmentation SAM-guided masks, which can provide additional supervision to train CEA-Net effectively. Furthermore, we present an Image-level Filtering Mechanism to filter out unreliable SAM-guided masks. Extensive experimental results show that our SAM-CLNet outperforms state-of-the-art weakly-supervised segmentation methods.
    摘要 政 полиsegmentation plays a vital role in accurately locating polyps at an early stage, which holds significant clinical importance for the prevention of colorectal cancer. Various polyp segmentation methods have been developed using fully-supervised deep learning techniques. However, pixel-wise annotation for polyp images by physicians during the diagnosis is both time-consuming and expensive. Moreover, visual foundation models such as the Segment Anything Model (SAM) have shown remarkable performance. Nevertheless, directly applying SAM to medical segmentation may not produce satisfactory results due to the inherent absence of medical knowledge. In this paper, we propose a novel SAM-guided Collaborative Learning Network (SAM-CLNet) for scribble-supervised polyp segmentation, enabling a collaborative learning process between our segmentation network and SAM to boost the model performance. Specifically, we first propose a Cross-level Enhancement and Aggregation Network (CEA-Net) for weakly-supervised polyp segmentation. Within CEA-Net, we propose a Cross-level Enhancement Module (CEM) that integrates the adjacent features to enhance the representation capabilities of different resolution features. Additionally, a Feature Aggregation Module (FAM) is employed to capture richer features across multiple levels. Moreover, we present a box-augmentation strategy that combines the segmentation maps generated by CEA-Net with scribble annotations to create more precise prompts. These prompts are then fed into SAM, generating segmentation SAM-guided masks, which can provide additional supervision to train CEA-Net effectively. Furthermore, we present an Image-level Filtering Mechanism to filter out unreliable SAM-guided masks. Extensive experimental results show that our SAM-CLNet outperforms state-of-the-art weakly-supervised segmentation methods.

3D Face Reconstruction with the Geometric Guidance of Facial Part Segmentation

  • paper_url: http://arxiv.org/abs/2312.00311
  • repo_url: None
  • paper_authors: Zidu Wang, Xiangyu Zhu, Tianshuo Zhang, Baiqin Wang, Zhen Lei
  • for: 提供了3D面部重建的多种应用场景,但现有方法困难重建表达强的面部。
  • methods: 利用facial part segmentation的geometry,引入Part Re-projection Distance Loss(PRDL)来优化点云集的分布,从而提高面部重建的性能。
  • results: PRDL比基于renderer的方法更具有明确的梯度,并在广泛的量化和质量测试中达到了状态机器人的重建性能。
    Abstract 3D Morphable Models (3DMMs) provide promising 3D face reconstructions in various applications. However, existing methods struggle to reconstruct faces with extreme expressions due to deficiencies in supervisory signals, such as sparse or inaccurate landmarks. Segmentation information contains effective geometric contexts for face reconstruction. Certain attempts intuitively depend on differentiable renderers to compare the rendered silhouettes of reconstruction with segmentation, which is prone to issues like local optima and gradient instability. In this paper, we fully utilize the facial part segmentation geometry by introducing Part Re-projection Distance Loss (PRDL). Specifically, PRDL transforms facial part segmentation into 2D points and re-projects the reconstruction onto the image plane. Subsequently, by introducing grid anchors and computing different statistical distances from these anchors to the point sets, PRDL establishes geometry descriptors to optimize the distribution of the point sets for face reconstruction. PRDL exhibits a clear gradient compared to the renderer-based methods and presents state-of-the-art reconstruction performance in extensive quantitative and qualitative experiments. The project will be publicly available.
    摘要 三维可变模型(3DMM)提供了许多应用场景中的可靠三维面部重建。然而,现有的方法在面部表达强烈时受到监督信号不充分或不准确的缺陷,导致重建面部表达困难。 segmentation信息包含面部重建中有效的 геометрические上下文。certain attempts rely on differentiable renderers to compare the rendered silhouettes of reconstruction with segmentation, which is prone to issues like local optima and gradient instability。在本文中,我们充分利用面部部分分 segmentation的几何结构,通过引入 Part Re-projection Distance Loss(PRDL)来优化面部重建。具体来说,PRDL将面部部分分 segmentation转换成2D点 cloud,然后将重建图像平面上重 проек化。接着,通过引入网格锚点和计算不同的统计距离,PRDL建立了面部重建中的geometry描述符,以便优化点云的分布。PRDL与基于 renderer 的方法相比,具有明确的梯度,并在广泛的量化和质量测试中达到了状态 искусственный智能水平。该项目将于公共可用。

A knowledge-based data-driven (KBDD) framework for all-day identification of cloud types using satellite remote sensing

  • paper_url: http://arxiv.org/abs/2312.00308
  • repo_url: None
  • paper_authors: Longfeng Nie, Yuntian Chen, Mengge Du, Changqi Sun, Dongxiao Zhang
    for:* The paper is written for evaluating changes in rainfall, heatwaves, water resources, floods and droughts, food security and vegetation cover, as well as land use, using high-resolution geostationary observations.methods:* The paper proposes a knowledge-based data-driven (KBDD) framework for all-day identification of cloud types based on spectral information from Himawari-8/9 satellite sensors.* A novel, simple and efficient network, named CldNet, is proposed and compared with widely used semantic segmentation networks.results:* The proposed model CldNet achieves an accuracy of 80.89+-2.18% in identifying cloud types, which is state-of-the-art and has increased by 32%, 46%, 22%, 2%, and 39%, respectively, compared with widely used semantic segmentation networks.* The accuracy of CldNet-W using visible and near-infrared bands and CldNet-O not using visible and near-infrared bands on the test dataset is 82.23+-2.14% and 73.21+-2.02%, respectively.* The trained CldNet without any fine-tuning can predict cloud types with higher spatial resolution using satellite spectral data with spatial resolution 0.02°*0.02°, indicating that CldNet possesses a strong generalization ability.
    Abstract Cloud types, as a type of meteorological data, are of particular significance for evaluating changes in rainfall, heatwaves, water resources, floods and droughts, food security and vegetation cover, as well as land use. In order to effectively utilize high-resolution geostationary observations, a knowledge-based data-driven (KBDD) framework for all-day identification of cloud types based on spectral information from Himawari-8/9 satellite sensors is designed. And a novel, simple and efficient network, named CldNet, is proposed. Compared with widely used semantic segmentation networks, including SegNet, PSPNet, DeepLabV3+, UNet, and ResUnet, our proposed model CldNet with an accuracy of 80.89+-2.18% is state-of-the-art in identifying cloud types and has increased by 32%, 46%, 22%, 2%, and 39%, respectively. With the assistance of auxiliary information (e.g., satellite zenith/azimuth angle, solar zenith/azimuth angle), the accuracy of CldNet-W using visible and near-infrared bands and CldNet-O not using visible and near-infrared bands on the test dataset is 82.23+-2.14% and 73.21+-2.02%, respectively. Meanwhile, the total parameters of CldNet are only 0.46M, making it easy for edge deployment. More importantly, the trained CldNet without any fine-tuning can predict cloud types with higher spatial resolution using satellite spectral data with spatial resolution 0.02{\deg}*0.02{\deg}, which indicates that CldNet possesses a strong generalization ability. In aggregate, the KBDD framework using CldNet is a highly effective cloud-type identification system capable of providing a high-fidelity, all-day, spatiotemporal cloud-type database for many climate assessment fields.
    摘要 云类,作为气象数据类型,对评估降水、热波、水资源、洪涝和旱情等气候评估领域具有特殊重要性。为了有效利用高分辨率地球Station观测数据,我们设计了基于知识库数据驱动(KBDD)框架,用于基于 Himawari-8/9 卫星传感器的spectral信息进行全天云类类型识别。并提出了一种新的简单、高效的网络模型,名为CldNet。与常用的语义分割网络模型,包括SegNet、PSPNet、DeepLabV3+、UNet和ResUnet等,相比,我们的提posed模型CldNet在识别云类类型方面具有state-of-the-art的性能,提高了32%、46%、22%、2%和39%。通过auxiliary信息(例如卫星zenith/azimuth角度、太阳zenith/azimuth角度)的协助,CldNet-W使用可见和近红外带的数据在测试集上的准确率为82.23+-2.14%,而CldNet-O不使用可见和近红外带的数据的准确率为73.21+-2.02%。此外,CldNet的总参数只有0.46M,使其易于边缘部署。同时,训练过的CldNet没有任何 fine-tuning 可以在卫星spectral数据上预测云类类型,这表明CldNet具有强大的泛化能力。综上所述,KBDD框架使用CldNet是一种非常有效的云类类型识别系统,可以为许多气候评估领域提供高准确率、全天、空间temporal云类数据库。

RadioGalaxyNET: Dataset and Novel Computer Vision Algorithms for the Detection of Extended Radio Galaxies and Infrared Hosts

  • paper_url: http://arxiv.org/abs/2312.00306
  • repo_url: None
  • paper_authors: Nikhel Gupta, Zeeshan Hayder, Ray P. Norris, Minh Huynh, Lars Petersson
  • for: automatic identification of associated components of extended sources and their corresponding infrared hosts
  • methods: multimodal dataset and novel computer vision algorithms
  • results: detection and localization of multi-component extended radio galaxies and their corresponding infrared hosts
    Abstract Creating radio galaxy catalogues from next-generation deep surveys requires automated identification of associated components of extended sources and their corresponding infrared hosts. In this paper, we introduce RadioGalaxyNET, a multimodal dataset, and a suite of novel computer vision algorithms designed to automate the detection and localization of multi-component extended radio galaxies and their corresponding infrared hosts. The dataset comprises 4,155 instances of galaxies in 2,800 images with both radio and infrared channels. Each instance provides information about the extended radio galaxy class, its corresponding bounding box encompassing all components, the pixel-level segmentation mask, and the keypoint position of its corresponding infrared host galaxy. RadioGalaxyNET is the first dataset to include images from the highly sensitive Australian Square Kilometre Array Pathfinder (ASKAP) radio telescope, corresponding infrared images, and instance-level annotations for galaxy detection. We benchmark several object detection algorithms on the dataset and propose a novel multimodal approach to simultaneously detect radio galaxies and the positions of infrared hosts.
    摘要 创建Radio galaxy目录从下一代深度调查需要自动识别相关组件的扩展源和其对应的红外主 galaxy。在这篇论文中,我们介绍RadioGalaxyNET数据集,这是一个多模式数据集和一系列新的计算机视觉算法,用于自动检测和定位多组件扩展电波 галакси和其对应的红外主 galaxy。数据集包含4,155个星系实例,分别在2,800张图像中,包括电波和红外通道。每个实例提供电波 галакси类别、包含所有组件的 bounding box、像素级分割面积和红外主 galaxy的关键点位置。RadioGalaxyNET是第一个包含ASKAP电波望远镜、红外图像和实例级标注的数据集。我们在数据集上测试了多种对象检测算法,并提出了一种新的多模式方法,同时检测电波 галакси和红外主 galaxy的位置。

Developmental Pretraining (DPT) for Image Classification Networks

  • paper_url: http://arxiv.org/abs/2312.00304
  • repo_url: None
  • paper_authors: Niranjan Rajesh, Debayan Gupta
  • for: 解决深度神经网络对物体识别的数据需求不断增长,而传统预训练方法需要大量数据,我们提出了发展预训练(DPT)方法。
  • methods: DPT采用了人类婴儿视觉发展为灵感,采用逐步教学方法,从Primitive和通用特征开始,逐步增加复杂特征。
  • results: 对比随机初始化的模型,DPT方法可以提高模型的性能,表明DPT可以成为一种可行的预训练方法。
    Abstract In the backdrop of increasing data requirements of Deep Neural Networks for object recognition that is growing more untenable by the day, we present Developmental PreTraining (DPT) as a possible solution. DPT is designed as a curriculum-based pre-training approach designed to rival traditional pre-training techniques that are data-hungry. These training approaches also introduce unnecessary features that could be misleading when the network is employed in a downstream classification task where the data is sufficiently different from the pre-training data and is scarce. We design the curriculum for DPT by drawing inspiration from human infant visual development. DPT employs a phased approach where carefully-selected primitive and universal features like edges and shapes are taught to the network participating in our pre-training regime. A model that underwent the DPT regime is tested against models with randomised weights to evaluate the viability of DPT.
    摘要 在深度神经网络对物体识别的数据需求不断增长的背景下,我们提出了发展预训练(DPT)作为一种可能的解决方案。DPT是一种基于课程的预训练方法,旨在与传统预训练技术相比,减少数据的需求。这些预训练技术也可能引入不必要的特征,这些特征在下游分类任务中可能会导致误导。我们设计了DPT的课程,Drawing inspiration from human infant visual development。DPT使用分阶段方法,教会网络参与我们的预训练程序中的精心选择的基本和通用特征,如缝和形状。我们测试了经过DPT训练的模型,并与随机 inicial weights 的模型进行比较,以评估DPT的可行性。

QIENet: Quantitative irradiance estimation network using recurrent neural network based on satellite remote sensing data

  • paper_url: http://arxiv.org/abs/2312.00299
  • repo_url: None
  • paper_authors: Longfeng Nie, Yuntian Chen, Dongxiao Zhang, Xinyue Liu, Wentian Yuan
    for:* 该研究旨在提供高 spatial resolution 的全球水平照度 (GHI) 估计,用于生成可再生能源。methods:* 提出了一种量化照度估计网络 (QIENet),使用循环神经网络 (RNN) 和卷积操作将卫星Remote Sensing数据的时空特征提取出来,并与GHI相关的时间信息 (时间、日期、月份) 和地理信息 (高程、经度、纬度) 作为模型输入。results:* QIENet可以准确地估计hourly GHI,并且不会过估地面观测值,同时可以降低RMSE的比例为27.51%/18.00%,提高R2的比例为20.17%/9.42%,并提高r的比例为8.69%/3.54% 相比NSRDB/ERA5。Here are the three key points in Simplified Chinese:for:* 该研究旨在提供高 spatial resolution 的全球水平照度 (GHI) 估计,用于生成可再生能源。methods:* 提出了一种量化照度估计网络 (QIENet),使用循环神经网络 (RNN) 和卷积操作将卫星Remote Sensing数据的时空特征提取出来,并与GHI相关的时间信息 (时间、日期、月份) 和地理信息 (高程、经度、纬度) 作为模型输入。results:* QIENet可以准确地估计hourly GHI,并且不会过估地面观测值,同时可以降低RMSE的比例为27.51%/18.00%,提高R2的比例为20.17%/9.42%,并提高r的比例为8.69%/3.54% 相比NSRDB/ERA5。
    Abstract Global horizontal irradiance (GHI) plays a vital role in estimating solar energy resources, which are used to generate sustainable green energy. In order to estimate GHI with high spatial resolution, a quantitative irradiance estimation network, named QIENet, is proposed. Specifically, the temporal and spatial characteristics of remote sensing data of the satellite Himawari-8 are extracted and fused by recurrent neural network (RNN) and convolution operation, respectively. Not only remote sensing data, but also GHI-related time information (hour, day, and month) and geographical information (altitude, longitude, and latitude), are used as the inputs of QIENet. The satellite spectral channels B07 and B11 - B15 and time are recommended as model inputs for QIENet according to the spatial distributions of annual solar energy. Meanwhile, QIENet is able to capture the impact of various clouds on hourly GHI estimates. More importantly, QIENet does not overestimate ground observations and can also reduce RMSE by 27.51%/18.00%, increase R2 by 20.17%/9.42%, and increase r by 8.69%/3.54% compared with ERA5/NSRDB. Furthermore, QIENet is capable of providing a high-fidelity hourly GHI database with spatial resolution 0.02{\deg} * 0.02{\deg}(approximately 2km * 2km) for many applied energy fields.
    摘要

Towards Redundancy-Free Sub-networks in Continual Learning

  • paper_url: http://arxiv.org/abs/2312.00840
  • repo_url: None
  • paper_authors: Cheng Chen, Jingkuan Song, LianLi Gao, Heng Tao Shen
  • for: This paper addresses the problem of catastrophic forgetting (CF) in continual learning, specifically by proposing a method called Information Bottleneck Masked sub-network (IBM) to mitigate CF and improve the efficiency of new tasks training.
  • methods: The IBM method uses information bottleneck to eliminate redundancy within sub-networks, accumulates valuable information into essential weights, and decomposes hidden representations to automate the construction process.
  • results: The paper shows that IBM consistently outperforms state-of-the-art methods, with a 70% reduction in the number of parameters within sub-networks and an 80% decrease in training time.
    Abstract Catastrophic Forgetting (CF) is a prominent issue in continual learning. Parameter isolation addresses this challenge by masking a sub-network for each task to mitigate interference with old tasks. However, these sub-networks are constructed relying on weight magnitude, which does not necessarily correspond to the importance of weights, resulting in maintaining unimportant weights and constructing redundant sub-networks. To overcome this limitation, inspired by information bottleneck, which removes redundancy between adjacent network layers, we propose \textbf{\underline{I}nformation \underline{B}ottleneck \underline{M}asked sub-network (IBM)} to eliminate redundancy within sub-networks. Specifically, IBM accumulates valuable information into essential weights to construct redundancy-free sub-networks, not only effectively mitigating CF by freezing the sub-networks but also facilitating new tasks training through the transfer of valuable knowledge. Additionally, IBM decomposes hidden representations to automate the construction process and make it flexible. Extensive experiments demonstrate that IBM consistently outperforms state-of-the-art methods. Notably, IBM surpasses the state-of-the-art parameter isolation method with a 70\% reduction in the number of parameters within sub-networks and an 80\% decrease in training time.
    摘要 catastrophic forgetting (CF) 是一个问题在持续学习中出现的潜在问题。parameter isolation addresses this challenge by masking a sub-network for each task to mitigate interference with old tasks. However, these sub-networks are constructed relying on weight magnitude, which does not necessarily correspond to the importance of weights, resulting in maintaining unimportant weights and constructing redundant sub-networks. To overcome this limitation, inspired by information bottleneck, which removes redundancy between adjacent network layers, we propose \textbf{\underline{I}nformation \underline{B}ottleneck \underline{M}asked sub-network (IBM)} to eliminate redundancy within sub-networks. Specifically, IBM accumulates valuable information into essential weights to construct redundancy-free sub-networks, not only effectively mitigating CF by freezing the sub-networks but also facilitating new tasks training through the transfer of valuable knowledge. Additionally, IBM decomposes hidden representations to automate the construction process and make it flexible. Extensive experiments demonstrate that IBM consistently outperforms state-of-the-art methods. Notably, IBM surpasses the state-of-the-art parameter isolation method with a 70\% reduction in the number of parameters within sub-networks and an 80\% decrease in training time.Here's the text with some additional information about the Simplified Chinese translation:The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. The text is written in a formal style, using technical terms and phrases commonly used in the field of machine learning.The translation includes some idiomatic expressions and cultural references that may not be familiar to non-native Chinese speakers. For example, "parameter isolation" is a term that is commonly used in machine learning to refer to the practice of isolating specific parts of a model to prevent them from being modified during training. However, this term may not be immediately familiar to non-native Chinese speakers, and the translation may need to be adjusted to better convey the intended meaning.Additionally, the translation includes some cultural references that may not be familiar to non-native Chinese speakers. For example, the phrase "information bottleneck" is a term that is commonly used in machine learning to refer to the practice of reducing the amount of information in a model to improve its performance. However, this term may not be immediately familiar to non-native Chinese speakers, and the translation may need to be adjusted to better convey the intended meaning.Overall, the translation is written in a formal style and includes some idiomatic expressions and cultural references that may not be familiar to non-native Chinese speakers. However, the translation should still be understandable to native Chinese speakers and non-native speakers who are familiar with the field of machine learning.

An Adaptive Correspondence Scoring Framework for Unsupervised Image Registration of Medical Images

  • paper_url: http://arxiv.org/abs/2312.00837
  • repo_url: None
  • paper_authors: Xiaoran Zhang, John C. Stendahl, Lawrence Staib, Albert J. Sinusas, Alex Wong, James S. Duncan
  • for: 这个研究旨在提出一个适应性训练方案,以解决医疗影像注册中的无监督学习问题。
  • methods: 这个方法使用一个可靠的推导映射来重新权重错误 residuals,以避免因为杂音和共谱性而导致的误差。
  • results: 这个研究发现,使用这个适应性方法可以优化医疗影像注册的性能,并且与其他基准方法比较,表现明显佳。
    Abstract We propose an adaptive training scheme for unsupervised medical image registration. Existing methods rely on image reconstruction as the primary supervision signal. However, nuisance variables (e.g. noise and covisibility) often cause the loss of correspondence between medical images, violating the Lambertian assumption in physical waves (e.g. ultrasound) and consistent imaging acquisition. As the unsupervised learning scheme relies on intensity constancy to establish correspondence between images for reconstruction, this introduces spurious error residuals that are not modeled by the typical training objective. To mitigate this, we propose an adaptive framework that re-weights the error residuals with a correspondence scoring map during training, preventing the parametric displacement estimator from drifting away due to noisy gradients, which leads to performance degradations. To illustrate the versatility and effectiveness of our method, we tested our framework on three representative registration architectures across three medical image datasets along with other baselines. Our proposed adaptive framework consistently outperforms other methods both quantitatively and qualitatively. Paired t-tests show that our improvements are statistically significant. The code will be publicly available at \url{https://voldemort108x.github.io/AdaCS/}.
    摘要 我们提出一种适应性训练方案 для不监督医疗图像 регистрации。现有方法通常基于图像重建作为主要监督信号。然而,干扰变量(如噪声和相关性)经常导致医疗图像之间的匹配关系丢失,违背了兰伯特矢规(例如ultrasound)和一致的图像获取。由于不监督学习方案基于图像的强度常数来建立图像之间的对应关系,这会引入假的错误副作用,这些错误副作用不被典型的训练目标模型。为了解决这个问题,我们提出一个适应性框架,在训练过程中重新权重错误副作用的整体分数地图,防止参数调整器由噪声梯度引起的性能下降。为了证明我们的方法的多样性和有效性,我们在三种不同的医疗图像 dataset 上测试了我们的框架,并与其他基准值进行比较。我们的提出的适应性框架在质量和质量上都能够超越其他方法,并且 statistically significant 的方差显示出我们的改进是 statistically significant。我们的代码将在 \url{https://voldemort108x.github.io/AdaCS/} 上公开。

Adaptability of Computer Vision at the Tactical Edge: Addressing Environmental Uncertainty

  • paper_url: http://arxiv.org/abs/2312.00269
  • repo_url: None
  • paper_authors: Hayden Moore
    for:这篇论文旨在提高战场上的指挥控制(C2)系统中的计算机视觉(CV)系统的智能分析能力。methods:这篇论文提出了一种基于不确定性评估(UQ)的方法,用于在战场上实时调整CV模型,以适应变化的环境和目标。results:这篇论文的结果表明,通过使用UQ来调整CV模型,可以提高CV系统在战场上的适应能力和precision。
    Abstract Computer Vision (CV) systems are increasingly being adopted into Command and Control (C2) systems to improve intelligence analysis on the battlefield, the tactical edge. CV systems leverage Artificial Intelligence (AI) algorithms to help visualize and interpret the environment, enhancing situational awareness. However, the adaptability of CV systems at the tactical edge remains challenging due to rapidly changing environments and objects which can confuse the deployed models. A CV model leveraged in this environment can become uncertain in its predictions, as the environment and the objects existing in the environment begin to change. Additionally, mission objectives can rapidly change leading to adjustments in technology, camera angles, and image resolutions. All of which can negatively affect the performance of and potentially introduce uncertainty into the system. When the training environment and/or technology differs from the deployment environment, CV models can perform unexpectedly. Unfortunately, most scenarios at the tactical edge do not incorporate Uncertainty Quantification (UQ) into their deployed C2 and CV systems. This concept paper explores the idea of synchronizing robust data operations and model fine-tuning driven by UQ all at the tactical edge. Specifically, curating datasets and training child models based on the residuals of predictions, using these child models to calculate prediction intervals (PI), and then using these PI to calibrate the deployed models. By incorporating UQ into the core operations surrounding C2 and CV systems at the tactical edge, we can help drive purposeful adaptability on the battlefield.
    摘要 这份概念论文探讨了同步稳健数据操作和模型精确化驱动的UQ概念,以提高战场环境中的适应性。具体来说,这些方法包括:1. 统计数据和训练子模型,使用这些子模型计算预测interval(PI)。2. 使用这些PI来检验和调整部署模型的预测。3. 通过在部署环境中进行UQ的数据操作和模型精确化,实现在战场环境中的稳定适应。通过将UQ integrating into the core operations surrounding C2 and CV systems at the tactical edge, we can help drive purposeful adaptability on the battlefield.

Heteroscedastic Uncertainty Estimation for Probabilistic Unsupervised Registration of Noisy Medical Images

  • paper_url: http://arxiv.org/abs/2312.00836
  • repo_url: None
  • paper_authors: Xiaoran Zhang, Daniel H. Pak, Shawn S. Ahn, Xiaoxiao Li, Chenyu You, Lawrence Staib, Albert J. Sinusas, Alex Wong, James S. Duncan
  • for: 这种纸上提出了一个不同强度噪声估计框架,用于无监督医学图像注册。现有方法假设图像上噪声的分布是均匀的,而忽略了医学图像实际上的噪声分布是不均匀的特点。这会导致噪声矩阵的折算和性能下降。
  • methods: 我们提出了一种适应加权方案,通过对噪声使用分配的$\gamma$-指数强化信号噪声比(SNR)来改善估计器的精度。我们还使用了一个分配的噪声估计器来模拟医学图像上的不均匀噪声,以避免模型被噪声矩阵所驱动。
  • results: 我们在三个医学图像数据集上测试了我们的框架,并与其他基elines进行比较。我们的提出的方法在量计和质量上都有所改善,并且可以提供正确和有意义的不确定度测量。对比试验表明,我们的改进是 statistically significant。代码将于 \url{https://voldemort108x.github.io/hetero_uncertainty/} 上公开发布。
    Abstract This paper proposes a heteroscedastic uncertainty estimation framework for unsupervised medical image registration. Existing methods rely on objectives (e.g. mean-squared error) that assume a uniform noise level across the image, disregarding the heteroscedastic and input-dependent characteristics of noise distribution in real-world medical images. This further introduces noisy gradients due to undesired penalization on outliers, causing unnatural deformation and performance degradation. To mitigate this, we propose an adaptive weighting scheme with a relative $\gamma$-exponentiated signal-to-noise ratio (SNR) for the displacement estimator after modeling the heteroscedastic noise using a separate variance estimator to prevent the model from being driven away by spurious gradients from error residuals, leading to more accurate displacement estimation. To illustrate the versatility and effectiveness of the proposed method, we tested our framework on two representative registration architectures across three medical image datasets. Our proposed framework consistently outperforms other baselines both quantitatively and qualitatively while also providing accurate and sensible uncertainty measures. Paired t-tests show that our improvements in registration accuracy are statistically significant. The code will be publicly available at \url{https://voldemort108x.github.io/hetero_uncertainty/}.
    摘要 本文提出了一种不同方差的风险估计框架,用于无监督医学图像 регистра。现有方法通常使用平均квадраット误差作为目标函数,忽略了实际医学图像中的不同方差特性,导致的是噪声梯度的抛弃和误差残余的干扰,从而导致不自然的形变和性能下降。为解决这个问题,我们提议使用一种相对γ-加速的信号噪声比率(SNR)的 adaptive 权重分配方案,并在模型中采用分离的噪声估计器,以避免模型被噪声梯度所驱动,从而更加准确地估计位移。为证明我们的方法的多样性和效果,我们在三个医学图像数据集上测试了两种常见的 регистра架构。结果表明,我们的提议方法在量和质量两个方面都能够持续性地提高 регистра精度,而且也提供了更加准确的不确定度度量。对比测试显示,我们的改进在 регистра精度方面是 statistically 有意义的。代码将在 \url{https://voldemort108x.github.io/hetero_uncertainty/} 上公开。

cs.AI - 2023-12-01

Video Summarization: Towards Entity-Aware Captions

  • paper_url: http://arxiv.org/abs/2312.02188
  • repo_url: None
  • paper_authors: Hammad A. Ayyubi, Tianqi Liu, Arsha Nagrani, Xudong Lin, Mingda Zhang, Anurag Arnab, Feng Han, Yukun Zhu, Jialu Liu, Shih-Fu Chang
  • for: 本研究目的是提出一个直接将新闻视频转换为具有实体名称的字幕的任务。
  • methods: 我们提出了一种方法,即将视频信息与外部世界知识结合,以生成具有实体名称的字幕。
  • results: 我们透过实验证明了我们的方法的有效性,并且显示了它可以应用于现有的新闻图片字幕 dataset。
    Abstract Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities for meaningful summarization. As such, we propose the task of summarizing news video directly to entity-aware captions. We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task. Further, we propose a method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions. We demonstrate the effectiveness of our approach on three video captioning models. We also show that our approach generalizes to existing news image captions dataset. With all the extensive experiments and insights, we believe we establish a solid basis for future research on this challenging task.
    摘要 现有的流行的视频描述标准和模型都是无关具体人名、地点或组织名称的描述。然而,新闻视频提供了一个挑战性的Setting, caption需要这些名称实体,以便有意义的概要。因此,我们提出了将新闻视频直接描述为实体意识的描述。我们还发布了一个大规模的数据集,VIENS(视频新闻),以支持这个任务的研究。此外,我们提出了一种方法,即使视频信息的补充外部世界知识来生成实体意识的描述。我们在三种视频描述模型中证明了我们的方法的有效性。此外,我们还证明了我们的方法可以泛化到现有的新闻图片描述数据集。通过广泛的实验和发现,我们认为我们建立了一个有固定基础的研究任务。

Spectral Temporal Contrastive Learning

  • paper_url: http://arxiv.org/abs/2312.00966
  • repo_url: None
  • paper_authors: Sacha Morin, Somjit Nath, Samira Ebrahimi Kahou, Guy Wolf
  • for: 本研究旨在学习无需标签的有用数据表示方法,特别是利用自动生成的数据增强方法进行自我超vision学习。
  • methods: 本研究使用了时间相关的对比学习方法(TCL),利用数据序列的顺序结构来定义正方向对比。
  • results: 本研究提出了spectral temporal contrastive learning(STCL)方法,可以将linear probing性能连接到图谱的 спектраль性质,并可以通过考虑前面观察到的数据序列来估算STCL损失。
    Abstract Learning useful data representations without requiring labels is a cornerstone of modern deep learning. Self-supervised learning methods, particularly contrastive learning (CL), have proven successful by leveraging data augmentations to define positive pairs. This success has prompted a number of theoretical studies to better understand CL and investigate theoretical bounds for downstream linear probing tasks. This work is concerned with the temporal contrastive learning (TCL) setting where the sequential structure of the data is used instead to define positive pairs, which is more commonly used in RL and robotics contexts. In this paper, we adapt recent work on Spectral CL to formulate Spectral Temporal Contrastive Learning (STCL). We discuss a population loss based on a state graph derived from a time-homogeneous reversible Markov chain with uniform stationary distribution. The STCL loss enables to connect the linear probing performance to the spectral properties of the graph, and can be estimated by considering previously observed data sequences as an ensemble of MCMC chains.
    摘要 现代深度学习中学习有用数据表示方法不需要标签是一个重要的核心。自我超vised学习方法,特别是对比学习(CL),在利用数据增强后定义正例对而取得成功。这种成功引发了一些理论研究,以更好地理解CL和下游线性探索任务的理论上的下限。这篇论文关注在时间对比学习(TCL)设置下,使用数据序列的顺序结构来定义正例对,这更常见于RL和机器人应用场景下。本文将对最近的spectral CL进行修改,提出spectral temporal contrastive learning(STCL)。我们讨论一种基于状态图的人口损失,该图来自于时间Homogeneous reversible Markov chain,具有均匀的站点分布。STCL损失可以将线性探索性能与图的спектраль性质连接起来,并可以通过考虑先前观察到的数据序列来估算。

The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models

  • paper_url: http://arxiv.org/abs/2312.00960
  • repo_url: https://github.com/namburisrinath/llmcompression
  • paper_authors: Satya Sai Srinath Namburi, Makesh Sreedhar, Srinath Srinivasan, Frederic Sala
  • for: 本研究的目的是提供一个系统性的分析,以评估常用的压缩技术对模型性能的影响,特别是在不同的模型家族(ENCODER、ENCODER-DECODER和DECODER)下。
  • methods: 本研究使用了两种常用的压缩技术:减少和量化。减少技术删除模型层中的重复连接,而量化技术将模型参数表示为 fewer bits。
  • results: 研究发现,压缩技术对模型性能的影响是复杂的,存在负相关性。在不同的模型家族和压缩级别下,压缩技术的影响也有差异。此外,研究还发现,压缩技术对模型的 parametric knowledge 有一定的影响。
    Abstract Compressing large language models (LLMs), often consisting of billions of parameters, provides faster inference, smaller memory footprints, and enables local deployment. Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits. The key tradeoff is between the degree of compression and the impact on the quality of the compressed model. Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy. More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored. To help bridge this gap, we present a comprehensive analysis across multiple model families (ENCODER, ENCODER-DECODER, and DECODER) using the LAMA and LM-HARNESS benchmarks in order to systematically quantify the effect of commonly employed compression techniques on model performance. A particular focus is on tradeoffs involving parametric knowledge, with the goal of providing practitioners with practical insights to help make informed decisions on compression. We release our codebase1 to enable further research.
    摘要 压缩大型语言模型(LLM),通常包含数十亿个参数,可以提供更快的推理、更小的内存占用量和本地部署。现有两种标准压缩技术是裁剪和量化,前者将模型层中的重复连接消除,后者将模型参数表示为 fewer bits。关键的交易是压缩度和模型质量之间的平衡。现有研究主要关注总性指标,如杂化率或下游任务准确率。但是,更细化的指标,如参数知识,尚未得到足够的研究。为了bridging这个差距,我们提供了一项全面的分析,使用LAMA和LM-HARNESS benchmark进行系统地量化各种模型家族(ENCODER、ENCODER-DECODER和DECODER)中通常采用的压缩技术对模型性能的影响。特别是关于参数知识的交易,以提供实践者们作为决策参考。我们发布了我们的代码库,以便进一步的研究。

Effectiveness of probabilistic contact tracing in epidemic containment: the role of super-spreaders and transmission paths reconstruction

  • paper_url: http://arxiv.org/abs/2312.00910
  • repo_url: None
  • paper_authors: A. P. Muntoni, F. Mazza, A. Braunstein, G. Catania, L. Dall’Asta
  • for: 本研究旨在提出一种基于概率技术的数字接触追踪方法,以优化抗击新冠肺炎的策略。
  • methods: 研究人员首先使用三种现代新冠肺炎传播模型,对接触追踪措施的诊断和社会成本进行数值分析。
  • results: 研究结果表明,概率技术可以更有效地控制新冠肺炎的传播,并且可以降低诊断和社会成本。此外,研究还发现,概率接触追踪技术能够更好地捕捉反传播和超传播事件。
    Abstract The recent COVID-19 pandemic underscores the significance of early-stage non-pharmacological intervention strategies. The widespread use of masks and the systematic implementation of contact tracing strategies provide a potentially equally effective and socially less impactful alternative to more conventional approaches, such as large-scale mobility restrictions. However, manual contact tracing faces strong limitations in accessing the network of contacts, and the scalability of currently implemented protocols for smartphone-based digital contact tracing becomes impractical during the rapid expansion phases of the outbreaks, due to the surge in exposure notifications and associated tests. A substantial improvement in digital contact tracing can be obtained through the integration of probabilistic techniques for risk assessment that can more effectively guide the allocation of new diagnostic tests. In this study, we first quantitatively analyze the diagnostic and social costs associated with these containment measures based on contact tracing, employing three state-of-the-art models of SARS-CoV-2 spreading. Our results suggest that probabilistic techniques allow for more effective mitigation at a lower cost. Secondly, our findings reveal a remarkable efficacy of probabilistic contact-tracing techniques in capturing backward propagations and super-spreading events, relevant features of the diffusion of many pathogens, including SARS-CoV-2.
    摘要 COVID-19 现象强调了初期不需药物治疗措施的重要性。广泛使用面具和系统性实施了接触跟踪策略,可能是 convent ional方法,如大规模移动限制,的等效和社会影响较小的选择。然而,手动接触跟踪面临着访问联系网络的限制,并且现有的智能手机基础设置的数字接触跟踪协议在爆发速度快的阶段变得不实际,因为暴露通知和相关测试的增加。通过 интегра tion of probabilistic techniques for risk assessment,可以更有效地指导新的诊断测试分配。在这个研究中,我们首先量化了基于接触跟踪的鉴定和社会成本,使用三种 state-of-the-art SARS-CoV-2 传播模型。我们的结果表明, probabilistic techniques 可以更有效地降低成本。其次,我们发现了 probabilistic contact-tracing 技术在反向传播和超速传播事件中的惊人效果。

Identifying Spurious Correlations using Counterfactual Alignment

  • paper_url: http://arxiv.org/abs/2312.02186
  • repo_url: https://github.com/ieee8023/latentshift
  • paper_authors: Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari
  • for: 检测黑盒类分器中的假 correlate,提高类分器的普遍性表现。
  • methods: 使用对一个类分器进行对假的图像生成,然后输入到其他类分器中,并计算这些类分器的输出变化的关系。
  • results: 可以检测黑盒类分器中的假 correlate,并且可以Quantify the relationship between the responses of different classifiers to identify specific instances where a spurious correlation exists.
    Abstract Models driven by spurious correlations often yield poor generalization performance. We propose the counterfactual alignment method to detect and explore spurious correlations of black box classifiers. Counterfactual images generated with respect to one classifier can be input into other classifiers to see if they also induce changes in the outputs of these classifiers. The relationship between these responses can be quantified and used to identify specific instances where a spurious correlation exists as well as compute aggregate statistics over a dataset. Our work demonstrates the ability to detect spurious correlations in face attribute classifiers. This is validated by observing intuitive trends in a face attribute classifier as well as fabricating spurious correlations and detecting their presence, both visually and quantitatively. Further, utilizing the CF alignment method, we demonstrate that we can rectify spurious correlations identified in classifiers.
    摘要 黑obox分类器driven by假 correlate often yield poor generalization performance. We propose counterfactual alignment method to detect and explore black box classifier的假 correlate. counterfactual image generated with respect to one classifier can be input into other classifier to see if they also induce changes in the outputs of these classifiers. the relationship between these responses can be quantified and used to identify specific instances where a spurious correlation exists as well as compute aggregate statistics over a dataset. Our work demonstrates the ability to detect spurious correlations in face attribute classifier. This is validated by observing intuitive trends in a face attribute classifier as well as fabricating spurious correlations and detecting their presence, both visually and quantitatively. Further, utilizing the CF alignment method, we demonstrate that we can rectify spurious correlations identified in classifiers.

LLM-TAKE: Theme Aware Keyword Extraction Using Large Language Models

  • paper_url: http://arxiv.org/abs/2312.00909
  • repo_url: None
  • paper_authors: Reza Yousefi Maragheh, Chenhao Fang, Charan Chand Irugu, Parth Parikh, Jason Cho, Jianpeng Xu, Saranyan Sukumar, Malay Patel, Evren Korpeoglu, Sushant Kumar, Kannan Achan
  • for: 这篇论文是为了提出一种基于大语言模型(LLM)的主题感知关键词EXTRACTION模型,用于生成基于文本Metadata的产品主题。
  • methods: 该模型框架包括多个阶段,包括避免输出不信息的或敏感的关键词、减少LLM的幻觉现象,以提高EXTRACTION模型的准确率和多样性。
  • results: 在三个实际数据集上进行了广泛的实验,比较了与参考模型的性能指标,显示了LLM TAKE模型在提高准确率和多样性指标方面的表现优于参考模型。
    Abstract Keyword extraction is one of the core tasks in natural language processing. Classic extraction models are notorious for having a short attention span which make it hard for them to conclude relational connections among the words and sentences that are far from each other. This, in turn, makes their usage prohibitive for generating keywords that are inferred from the context of the whole text. In this paper, we explore using Large Language Models (LLMs) in generating keywords for items that are inferred from the items textual metadata. Our modeling framework includes several stages to fine grain the results by avoiding outputting keywords that are non informative or sensitive and reduce hallucinations common in LLM. We call our LLM-based framework Theme-Aware Keyword Extraction (LLM TAKE). We propose two variations of framework for generating extractive and abstractive themes for products in an E commerce setting. We perform an extensive set of experiments on three real data sets and show that our modeling framework can enhance accuracy based and diversity based metrics when compared with benchmark models.
    摘要 《主题感知键词EXTRACTION》是自然语言处理中的核心任务之一。传统EXTRACTION模型具有短时间注意力,使得它们很难捕捉文本中距离的关联性。这使得传统EXTRACTION模型在生成文本中的键词时 exhibit 出缺乏弹性和准确性。在这篇论文中,我们探讨使用大型自然语言模型(LLM)来生成基于文本metadata的键词。我们的框架包括多个阶段,以精确地避免生成不具有弹性或敏感性的键词,并对LLM的幻视问题进行处理。我们称这为“主题感知键词EXTRACTION”(LLM TAKE)框架。我们提出了两种基于抽象和EXTRACTIVE主题的框架,用于生成产品在电子商务中的键词。我们对三个真实数据集进行了广泛的实验,并证明了我们的框架可以对比对 benchmark 模型进行改进的精度和多样性指标。

Nash Learning from Human Feedback

  • paper_url: http://arxiv.org/abs/2312.00886
  • repo_url: None
  • paper_authors: Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup, Bilal Piot
  • for: 在文本概要中,这篇论文的目的是提出一种基于人类反馈的 preference learning 方法,用于调整大型自然语言模型 (LLM) 的策略,以便更好地与人类偏好相Alignment。
  • methods: 该方法包括首先学习一个 preference model,该模型是根据提示两个输入来定义,然后通过优化这个 preference model 来实现一个策略,该策略能够在与任何竞争策略相比而被人类更好地喜欢。这个方法被称为 Nash learning from human feedback (NLHF)。
  • results: 在一个 tabular policy representation 中,我们提出了一种基于 mirror descent 的 novel algorithmic solution,称为 Nash-MD,该算法可以生成一系列策略,其中最后一轮 converging 到了准确的 Nash 平衡。此外,我们还 explore 了参数表示方法和深度学习架构的 gradient descent 算法。我们在文本概要任务中进行了实验,并证明了我们的方法的有效性。
    Abstract Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.
    摘要 人类反馈学习(RLHF)已成为大语言模型(LLM)与人类偏好的主要方法。通常情况下,RLHF包括从人类反馈中学习奖励模型,经常表达为两个文本生成Output的 LLMCMP 之间的偏好。然后,通过使 LLMCMP optimize 奖励模型来使其最大化奖励值,使用了反馈学习算法。然而,当前的奖励模型存在一定的局限性,即无法完全表达人类偏好的丰富性和归一化分布的依赖。在这项研究中,我们介绍了一种alternative的精细化方法,使用对比式人类反馈来精细化 LLMCMP。我们的方法包括首先学习一个偏好模型,该模型基于提示输入两个输出,然后通过一种定义Nash平衡的策略来追求一个能够在任何竞争策略下产生更多偏好的响应。我们称之为人类反馈学习(NLHF)。在表示策略使用表格时,我们提出了一种新的算法解决方案,即Nash-MD。这个算法根据mirror descent的原则生成了一系列策略,其中最后一轮 converge 到规范化Nash平衡。此外,我们还考虑了参数表示策略的方法,并引入了深度学习架构中的梯度下降算法。为证明我们的方法的有效性,我们在文本概要任务上进行了实验。我们认为NLHF可以提供一条有挑战性的道路,可以进一步提高LLMCMP与人类偏好的对应。

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

  • paper_url: http://arxiv.org/abs/2312.00878
  • repo_url: https://github.com/walbouss/gem
  • paper_authors: Walid Bousselham, Felix Petersen, Vittorio Ferrari, Hilde Kuehne
  • for: 这篇论文目的是提出一个零 Fine-tuning 的 Open-vocabulary 物件位置推导方法,并且可以在不同的数据集和背景下实现高效的推导性能。
  • methods: 本文提出了一个名为 Grounding Everything Module (GEM) 的模组,它将value-value attention 扩展为一个自我注意路径,并且透过自我注意来强制物件Token之间的相似性,同时保持语言空间的掌握。
  • results: 本文透过实验表明,GEM 不仅在不同的数据集和背景下可以实现零 Fine-tuning 的 Open-vocabulary 物件位置推导,而且在 OpenImagesV7 大规模分类测试 benchmark 上实现了州务-of-the-art 的结果。
    Abstract Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. To leverage those capabilities, we propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path. We show that the concept of self-self attention corresponds to clustering, thus enforcing groups of tokens arising from the same object to be similar while preserving the alignment with the language space. To further guide the group formation, we propose a set of regularizations that allows the model to finally generalize across datasets and backbones. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation. It shows that GEM not only outperforms other training-free open-vocabulary localization methods, but also achieves state-of-the-art results on the recently proposed OpenImagesV7 large-scale segmentation benchmark.
    摘要 视力语基本模型在多个零批设定中表现出色,如图像检索、分类或captioning等。然而,这些模型在指示表达和物体的零批定位方面仍然落后。因此,它们需要进行微调。在这篇论文中,我们表明了预训练的视力语(VL)模型可以在零批开 vocabulary 对象定位中达到开关不需要微调的性能。为了利用这些能力,我们提出了基于自我自注意的全能模块(GEM)。我们示示了这个概念与自我注意相对应,因此使得来自同一个物体的token在语言空间中保持Alignment,同时强制这些token在语言空间中的分组。为了进一步引导分组,我们提出了一些正则化,使得模型最终能够在不同的dataset和后端上Generalize。我们评估了提出的GEM框架在多个 benchmark 任务和dataset上的性能,并显示GEM不仅超过了其他零批开 vocabulary 定位方法的性能,还达到了最新的OpenImagesV7大规模 segmentation benchmark的状态的艺术 Results。

3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing

  • paper_url: http://arxiv.org/abs/2312.00870
  • repo_url: None
  • paper_authors: Balamurugan Thambiraja, Sadegh Aliakbarian, Darren Cosker, Justus Thies
  • for: 这篇论文是用于创建人性化的speech-driven 3D 面部动画和修改。
  • methods: 这篇论文使用了一个新的方法,即一个轻量级的音频状况驱动的扩散模型,以生成3D 面部动画。
  • results: 根据量化和质感评估,这篇论文证明了其方法可以对已有的State-of-the-art技术进行改进,并生成更有实际感的动画。
    Abstract We present 3DiFACE, a novel method for personalized speech-driven 3D facial animation and editing. While existing methods deterministically predict facial animations from speech, they overlook the inherent one-to-many relationship between speech and facial expressions, i.e., there are multiple reasonable facial expression animations matching an audio input. It is especially important in content creation to be able to modify generated motion or to specify keyframes. To enable stochasticity as well as motion editing, we propose a lightweight audio-conditioned diffusion model for 3D facial motion. This diffusion model can be trained on a small 3D motion dataset, maintaining expressive lip motion output. In addition, it can be finetuned for specific subjects, requiring only a short video of the person. Through quantitative and qualitative evaluations, we show that our method outperforms existing state-of-the-art techniques and yields speech-driven animations with greater fidelity and diversity.
    摘要 我们介绍3DiFACE,一种新的个性化语音驱动3D脸部动画和编辑方法。现有方法可能会 deterministically 预测脸部动画从语音输入,但它们忽略了脸部表达与语音之间的自然一对多关系,即一个语音输入可能有多个合理的脸部动画。特别是在内容创作中,需要能够修改生成的运动或指定关键帧。为了实现随机性以及动作编辑,我们提议一种听取音频条件的扩散模型 для 3D 脸部动画。这种扩散模型可以在一个小的3D动作数据集上培训,保持表达舌头动作的表现。此外,它还可以通过精度调整来适应特定主题,只需要一段短视频。通过量化和质量评估,我们表明我们的方法在现有状态的技术方面表现出色,并且生成的语音驱动动画具有更高的准确性和多样性。

Making Large Multimodal Models Understand Arbitrary Visual Prompts

  • paper_url: http://arxiv.org/abs/2312.00784
  • repo_url: None
  • paper_authors: Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee
  • for: 提高图像理解的Region-specific能力,使用自然的视觉提示。
  • methods: 提出一种新的多Modal模型,可以解码自然的视觉提示,如红色 bounding box 或指向箭头。
  • results: 实现了Visual7W、PointQA和Visual Commonsense Reasoning benchmark的状态天数表现,并提供了一个完整的Benchmark来评估模型的视觉提示理解能力。
    Abstract While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.
    摘要 Current large vision-language multimodal models focus on whole image understanding, but there is a prominent gap in achieving region-specific comprehension. Existing approaches using textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts, allowing users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.

Context Retrieval via Normalized Contextual Latent Interaction for Conversational Agent

  • paper_url: http://arxiv.org/abs/2312.00774
  • repo_url: https://github.com/jliu-v/pk-ncli
  • paper_authors: Junfeng Liu, Zhuocheng Mei, Kewen Peng, Ranga Raju Vatsavai
  • for: 本研究旨在提高对话代理人的质量,特别是使用深度学习的对话代理人在实际应用中面临的挑战,包括不尊重知识和事实、不个性化用户喜好以及训练和推理过程占用计算资源的巨大需求。
  • methods: 本研究使用了一种新的方法,即PK-NCLI,来准确地和高效地从auxiliary信息中提取有用信息,以提高对话响应质量。PK-NCLI通过低级别的常见化潜在交互来学习人物、对话历史和知识背景之间的相互关系。
  • results: 我们的实验结果表明,PK-NCLI比PK-FoCus方法提高了47.80%/30.61%/24.14%的误差、知识基础和训练效率,并同时维持了人物基础性能水平。我们还进行了详细的分析,探讨不同语言模型和训练参数对PK-NCLI性能的影响。
    Abstract Conversational agents leveraging AI, particularly deep learning, are emerging in both academic research and real-world applications. However, these applications still face challenges, including disrespecting knowledge and facts, not personalizing to user preferences, and enormous demand for computational resources during training and inference. Recent research efforts have been focused on addressing these challenges from various aspects, including supplementing various types of auxiliary information to the conversational agents. However, existing methods are still not able to effectively and efficiently exploit relevant information from these auxiliary supplements to further unleash the power of the conversational agents and the language models they use. In this paper, we present a novel method, PK-NCLI, that is able to accurately and efficiently identify relevant auxiliary information to improve the quality of conversational responses by learning the relevance among persona, chat history, and knowledge background through low-level normalized contextual latent interaction. Our experimental results indicate that PK-NCLI outperforms the state-of-the-art method, PK-FoCus, by 47.80%/30.61%/24.14% in terms of perplexity, knowledge grounding, and training efficiency, respectively, and maintained the same level of persona grounding performance. We also provide a detailed analysis of how different factors, including language model choices and trade-offs on training weights, would affect the performance of PK-NCLI.
    摘要 现在的对话代理人使用人工智能(AI)和深度学习(DL)在学术研究和实际应用中得到了广泛的应用。然而,这些应用还面临着一些挑战,包括不尊重知识和事实,不个性化用户喜好,以及训练和推理过程中的巨大计算资源需求。现有研究努力在不同的方面进行了努力,包括补充不同类型的辅助信息。然而,现有方法仍然无法有效地和高效地利用这些辅助信息来提高对话响应质量。在本文中,我们提出了一种新的方法,即PK-NCLI,可以准确地和高效地识别有用的辅助信息,以提高对话响应质量。我们通过对 persona、chat history 和知识背景之间的低级别正规化 контекстual latent interaction(PK-NCLI)进行学习,以确定它们之间的相互关系。我们的实验结果表明,PK-NCLI 在对 persona 背景和知识背景进行准确地和高效地识别方面,与当前状态的方法PK-FoCus相比,提高了47.80%/30.61%/24.14%的plexity、知识固定和训练效率,并保持了同等的 persona 固定性。我们还对不同的因素,包括语言模型选择和训练 веса的平衡,进行了详细的分析,以了解PK-NCLI 的性能如何受到这些因素的影响。

Automated Material Properties Extraction For Enhanced Beauty Product Discovery and Makeup Virtual Try-on

  • paper_url: http://arxiv.org/abs/2312.00766
  • repo_url: None
  • paper_authors: Fatemeh Taheri Dezaki, Himanshu Arora, Rahul Suresh, Amin Banitalebi-Dehkordi
  • for: 提高化妆购物体验,提供更加便捷和满足的化妆产品搜索体验。
  • methods: 采用自定义机器学习模型pipeline,从化妆产品图像中提取重要的物理特征。
  • results: 成功应用于多种化妆产品类别,包括眼镜和口红,并实现虚拟试用体验,提高化妆购物体验。
    Abstract The multitude of makeup products available can make it challenging to find the ideal match for desired attributes. An intelligent approach for product discovery is required to enhance the makeup shopping experience to make it more convenient and satisfying. However, enabling accurate and efficient product discovery requires extracting detailed attributes like color and finish type. Our work introduces an automated pipeline that utilizes multiple customized machine learning models to extract essential material attributes from makeup product images. Our pipeline is versatile and capable of handling various makeup products. To showcase the efficacy of our pipeline, we conduct extensive experiments on eyeshadow products (both single and multi-shade ones), a challenging makeup product known for its diverse range of shapes, colors, and finish types. Furthermore, we demonstrate the applicability of our approach by successfully extending it to other makeup categories like lipstick and foundation, showcasing its adaptability and effectiveness across different beauty products. Additionally, we conduct ablation experiments to demonstrate the superiority of our machine learning pipeline over human labeling methods in terms of reliability. Our proposed method showcases its effectiveness in cross-category product discovery, specifically in recommending makeup products that perfectly match a specified outfit. Lastly, we also demonstrate the application of these material attributes in enabling virtual-try-on experiences which makes makeup shopping experience significantly more engaging.
    摘要 “许多化妆品的可用性可能会让您寻找理想的对象很困难。为了增强化妆用品购买体验,需要一个智能的产品探索方法。然而,实现精确和高效的产品探索需要提取详细的材料特征,如颜色和完成类型。我们的工作将 introduce an automated pipeline that utilizes multiple customized machine learning models to extract essential material attributes from makeup product images.我们的管道可以处理不同的化妆品,并且可以实现跨类别的产品探索。为了说明我们的方法的有效性,我们将实现了广泛的实验,包括眼影粉和彩妆产品(单一和多色),这些产品因其多样化的形状、颜色和完成类型而变得更加挑战。此外,我们还将这个方法扩展到其他化妆类别,如口红和基地,展示了它的灵活性和有效性。此外,我们还将实现ablation实验,以示人工标签方法与我们的机器学习管道之间的优劣。最后,我们还将说明这些材料特征的应用,包括实现虚拟尝试的经验,让化妆品购买体验更加有趣。”

Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses

  • paper_url: http://arxiv.org/abs/2312.00763
  • repo_url: None
  • paper_authors: Xiao Ma, Swaroop Mishra, Ariel Liu, Sophie Su, Jilin Chen, Chinmay Kulkarni, Heng-Tze Cheng, Quoc Le, Ed Chi
  • for: 用于帮助用户结构化思想、探索不同选项、导航选择和建议,以便更好地个性化响应。
  • methods: 使用 schema-like 结构和导航功能,让用户更容易定义高级偏好和目标。
  • results: 在用户研究中,发现用户认为 ExploreLLM 有助于帮助他们完成探索和规划任务,并且可以更好地个性化响应。
    Abstract Large language model (LLM) powered chatbots are primarily text-based today, and impose a large interactional cognitive load, especially for exploratory or sensemaking tasks such as planning a trip or learning about a new city. Because the interaction is textual, users have little scaffolding in the way of structure, informational "scent", or ability to specify high-level preferences or goals. We introduce ExploreLLM that allows users to structure thoughts, help explore different options, navigate through the choices and recommendations, and to more easily steer models to generate more personalized responses. We conduct a user study and show that users find it helpful to use ExploreLLM for exploratory or planning tasks, because it provides a useful schema-like structure to the task, and guides users in planning. The study also suggests that users can more easily personalize responses with high-level preferences with ExploreLLM. Together, ExploreLLM points to a future where users interact with LLMs beyond the form of chatbots, and instead designed to support complex user tasks with a tighter integration between natural language and graphical user interfaces.
    摘要 translate into Simplified Chinese:大型语言模型(LLM)驱动的 chatbot 主要是文本类型的 today,带来大量的互动认知负担,特别是在探索或理解任务中,如规划旅行或学习新城市。因为交互是文本的,用户没有多少框架、信息 "踪迹" 或高级偏好或目标的指导。我们介绍 ExploreLLM,允许用户结构思想,帮助探索不同的选项,导航选择和建议,并更容易使模型生成更个性化的响应。我们进行了用户研究,发现用户在探索或规划任务中发现 ExploreLLM 有用,因为它提供了有用的 schema 类型结构,并导引用户进行规划。研究还表明用户可以更容易个性化响应,使用高级偏好。 together, ExploreLLM 指向了未来,用户与 LLM 不再是 chatbot 的形式,而是为复杂用户任务进行更紧密的自然语言和图形用户界面的结合。

Deep Unlearning: Fast and Efficient Training-free Approach to Controlled Forgetting

  • paper_url: http://arxiv.org/abs/2312.00761
  • repo_url: https://github.com/sangamesh-kodge/class_forgetting
  • paper_authors: Sangamesh Kodge, Gobinda Saha, Kaushik Roy
  • for: 这个论文是为了解决机器学习中的减学问题,即在 deleting user data upon request 和 Privacy 的要求下,现有方法受限于计算资源和对原始训练数据的限制。
  • methods: 这篇论文提出了一种新的类减学算法,可以筛选掉整个类或一组类从学习模型中。这个算法首先估计了保留空间和忘记空间,表示要保留的样本特征或活动空间,以及要忘记的样本特征或活动空间。然后,它计算这两个空间之间的共同信息,并从忘记空间中减去这个信息,以隔离类异特征空间。最后,它将模型参数 проек到这个空间中,以获得减学后的模型。
  • results: 这篇论文使用了 ImageNet 上的 Vision Transformer,并实现了约 1.5% 的保留精度下降,与原始模型相比,同时保持了 Less than 1% 的减学后的样本精度。此外,这种算法在面部推理攻击中表现良好,与其他基elines comparison 平均下降 7.8%,而且在多种图像分类dataset和网络架构上都能够达到这个水平。
    Abstract Machine unlearning has emerged as a prominent and challenging area of interest, driven in large part by the rising regulatory demands for industries to delete user data upon request and the heightened awareness of privacy. Existing approaches either retrain models from scratch or use several finetuning steps for every deletion request, often constrained by computational resource limitations and restricted access to the original training data. In this work, we introduce a novel class unlearning algorithm designed to strategically eliminate an entire class or a group of classes from the learned model. To that end, our algorithm first estimates the Retain Space and the Forget Space, representing the feature or activation spaces for samples from classes to be retained and unlearned, respectively. To obtain these spaces, we propose a novel singular value decomposition-based technique that requires layer wise collection of network activations from a few forward passes through the network. We then compute the shared information between these spaces and remove it from the forget space to isolate class-discriminatory feature space for unlearning. Finally, we project the model weights in the orthogonal direction of the class-discriminatory space to obtain the unlearned model. We demonstrate our algorithm's efficacy on ImageNet using a Vision Transformer with only $\sim$1.5% drop in retain accuracy compared to the original model while maintaining under 1% accuracy on the unlearned class samples. Further, our algorithm consistently performs well when subject to Membership Inference Attacks showing 7.8% improvement on average across a variety of image classification datasets and network architectures, as compared to other baselines while being $\sim$6x more computationally efficient.
    摘要 机器无学习已经成为一个引人注目的领域,受到用户数据删除请求的法规要求和隐私意识的提高的推动。现有的方法可以是从scratch retrained模型或使用多个精度调整步骤来处理每个删除请求,但这些方法受到计算资源限制和原始训练数据的有限访问的限制。在这种情况下,我们介绍了一种新的类无学习算法,可以策略性地从学习模型中除掉整个类或一组类。为达到这个目标,我们的算法首先估算了保留空间和忘记空间,这些空间表示保留模型中的样本类别的特征或活动空间。我们提出了一种基于 singular value decomposition 的新技术来获得这些空间,这种技术需要网络启动一些前进 pass 的活动。然后,我们计算了这些空间之间的共同信息,并从忘记空间中除去它们来隔离类别特征空间。最后,我们将模型权重向量在这个特征空间的正交方向上进行投影,以获得无学习后的模型。我们在 ImageNet 上使用了一个 Vision Transformer,只需要 $\sim$1.5% 的保留精度相比原始模型而下降,同时保持 $\sim$1% 的忘记精度。此外,我们的算法在面对成员推理攻击时表现良好,与其他基准值相比,平均提高了7.8%,在多种图像分类 dataset 和网络架构上表现良好。此外,我们的算法在计算效率方面也比其他基准值更高,约6倍。

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

  • paper_url: http://arxiv.org/abs/2312.00752
  • repo_url: https://github.com/radarFudan/mamba
  • paper_authors: Albert Gu, Tri Dao
  • for: This paper aims to improve the efficiency and performance of deep learning models, particularly in natural language processing tasks.
  • methods: The authors propose several improvements to the standard Transformer architecture, including the use of selective structured state space models (SSMs) and a hardware-aware parallel algorithm in recurrent mode.
  • results: The proposed model, called Mamba, achieves faster inference and linear scaling in sequence length, with improved performance on real data up to million-length sequences. Mamba also achieves state-of-the-art performance across several modalities, including language, audio, and genomics. Specifically, the Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size in language modeling tasks.
    Abstract Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
    摘要 基础模型现在推动大多数深度学习应用程序,大多数基于Transformer架构和其核心吸引模块。许多次几乎时间架构,如线性吸引、闭合卷积和回归模型,以及结构状态空间模型(SSM)已经开发出来解决Transformers在长序列上的计算不fficient问题,但它们在语言等重要Modalities上没有达到吸引模块的性能。我们认为这些模型的重大弱点是无法进行内容基于的逻辑推理,我们提出了一些改进。首先,让SSM参数成为输入的函数,使得模型在序列长度维度上可以选择性地传播或忘记信息,根据当前的token。其次,尽管这种变化禁用了高效的卷积,我们设计了硬件意识 parallel算法。我们将这些选择性的SSM integrate into a simplified end-to-end神经网络架构,不含注意力或者MLP块(Mamba)。Mamba在推理速度上比Transformers快5倍,并且在序列长度上 linear scaling,其性能在真实数据上提高,并在不同Modalities上达到了状态空间模型的表现。在语言模型ing,我们的Mamba-3B模型与Transformers相同大小的模型和Transformers twice its size的模型匹配,在预训练和下游评估中都达到了最佳性能。

Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

  • paper_url: http://arxiv.org/abs/2312.00751
  • repo_url: None
  • paper_authors: Tam Nguyen, Tan M. Nguyen, Richard G. Baraniuk
  • for: 解决深度 transformer 模型中的过度精炼问题,提高模型的表示能力。
  • methods: 提出了一种新的正则化方法,通过 penalty 输出 tokens 的差异的 norm 来保持输入 tokens 的准确性。
  • results: 对于 объек检测、图像分割和语言模型等实际任务, NeuTRENO 模型可以减少输入 tokens 的过度精炼问题,并且比基eline transformer 和现有方法更有优势。
    Abstract Transformers have achieved remarkable success in a wide range of natural language processing and computer vision applications. However, the representation capacity of a deep transformer model is degraded due to the over-smoothing issue in which the token representations become identical when the model's depth grows. In this work, we show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We then propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. Minimizing the resulting regularized energy functional, we derive the Neural Transformer with a Regularized Nonlocal Functional (NeuTRENO), a novel class of transformer models that can mitigate the over-smoothing issue. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling.
    摘要 transformers已经取得了广泛的成功在自然语言处理和计算机视觉领域。然而,深度transformer模型的表示能力受到过滤平凡问题的影响,这会导致token表示的标准化。在这种情况下,我们发现了自我注意层在transformers中实际上是最小化了一个功能,该功能提高了均匀性,从而导致token的标准化。我们然后提出了一种新的正则化方法,该方法对输出token的范例进行惩罚,以保持输入token的准确性。通过最小化这个含正则化能量函数,我们 derivates了一种新的transformer模型,即NeuTRENO,该模型可以 Mitigate过滤平凡问题。我们通过实验表明,NeuTRENO在多个实际任务上,包括物体分类、图像分割和语言模型,可以减少token表示的过滤平凡问题。

Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games

  • paper_url: http://arxiv.org/abs/2312.00746
  • repo_url: None
  • paper_authors: Dekun Wu, Haochen Shi, Zhiyuan Sun, Bang Liu
  • for: 这个研究探讨了大自然语言模型(LLM)在“jubensha”(中文杀人推理游戏)中的应用,这是人工智能驱动游戏中一个新领域。
  • methods: 我们引入了第一个特有的中文jubensha数据集,包括人物脚本和游戏规则,以促进AI代理的开发在这种复杂的情节环境中。我们还提出了一种独特的多代理互动框架,使AI代理可以自主参与游戏,提高jubensha游戏的动态性。
  • results: 我们开发了特化的评估方法,以评估AI代理在案件信息和推理技能方面的掌握程度。此外,我们还应用了最新的Contextual Learning技术,以提高代理的信息收集、凶手推理和逻辑推理等方面的表现。实验结果证明了我们的提posed方法的效果。这项研究旨在提供LLM的新的应用场景,并为研究者在这个领域提供一个新的评价标准。
    Abstract In this study, we explore the application of Large Language Models (LLMs) in "Jubensha" (Chinese murder mystery role-playing games), a novel area in AI-driven gaming. We introduce the first Chinese dataset specifically for Jubensha, including character scripts and game rules, to foster AI agent development in this complex narrative environment. Our work also presents a unique multi-agent interaction framework using LLMs, allowing AI agents to autonomously engage in the game, enhancing the dynamics of Jubensha gameplay. To evaluate these AI agents, we developed specialized methods targeting their mastery of case information and reasoning skills. Furthermore, we incorporated the latest advancements in in-context learning to improve the agents' performance in critical aspects like information gathering, murderer detection, and logical reasoning. The experimental results validate the effectiveness of our proposed methods. This work aims to offer a fresh perspective on understanding LLM capabilities and establish a new benchmark for evaluating large language model-based agents to researchers in the field.
    摘要 在这项研究中,我们探索了大语言模型(LLMs)在“卷降者”(中文杀人推理角色游戏)中的应用,这是人工智能驱动游戏领域的一个新领域。我们提出了第一个专门为卷降者而设计的中文数据集,包括角色脚本和游戏规则,以促进AI代理人的发展在这种复杂的故事环境中。我们的工作还提出了一种基于LLMs的多代理人互动框架,使AI代理人能够自主参与游戏,提高卷降者游戏的动态。为评估这些AI代理人,我们开发了专门针对案件信息和推理技能的评价方法。此外,我们还应用了最新的上下文学习技术,以提高代理人的信息收集、凶手检测和逻辑推理能力。实验结果证明了我们的提议的有效性。本研究的目的是为研究者在LLM能力的理解提供一个新的视角,并为大语言模型基于代理人的研究提供一个新的标准。

Scalable Meta-Learning with Gaussian Processes

  • paper_url: http://arxiv.org/abs/2312.00742
  • repo_url: None
  • paper_authors: Petru Tighineanu, Lukas Grossberger, Paul Baireuther, Kathrin Skubch, Stefan Falkner, Julia Vinogradska, Felix Berkenkamp
  • for: 本研究旨在提出一种可扩展的GP模型,用于适应新任务的快速解决。
  • methods: 本研究使用了closed-form posterior of Gaussian processes(GP)和 Bayesian optimization方法,但这些方法存在计算成本高或假设不具有可靠的协调uncertainty между任务模型,这可能会打乱exploration和exploitation的平衡。
  • results: 本研究提出了一种可扩展的GP模型,即ScaML-GP,它可以快速学习新任务,并且可以处理多个任务。在synthetic和实际应用中,我们证明ScaML-GP可以有效地学习,即使只有几个meta-任务或多个meta-任务。
    Abstract Meta-learning is a powerful approach that exploits historical data to quickly solve new tasks from the same distribution. In the low-data regime, methods based on the closed-form posterior of Gaussian processes (GP) together with Bayesian optimization have achieved high performance. However, these methods are either computationally expensive or introduce assumptions that hinder a principled propagation of uncertainty between task models. This may disrupt the balance between exploration and exploitation during optimization. In this paper, we develop ScaML-GP, a modular GP model for meta-learning that is scalable in the number of tasks. Our core contribution is a carefully designed multi-task kernel that enables hierarchical training and task scalability. Conditioning ScaML-GP on the meta-data exposes its modular nature yielding a test-task prior that combines the posteriors of meta-task GPs. In synthetic and real-world meta-learning experiments, we demonstrate that ScaML-GP can learn efficiently both with few and many meta-tasks.
    摘要 <>这是一种强大的方法,利用历史数据快速解决新任务。在低数据情况下,基于关闭形 posterior 的 Gaussian 过程(GP)以及 Bayesian 优化方法已经实现了高性能。然而,这些方法可能是 computationally expensive 或是导入假设,这可能会中断探索和优化的平衡。在这篇文章中,我们开发了 ScaML-GP,一个可扩展的 GP 模型 для meta-learning。我们的核心贡献是一个精心设计的多任务核心,允许 hierarchical training 和任务可扩展性。将 ScaML-GP conditional 在 meta-data 上露出其模块性,则可以获得一个包含 meta-task GP 的 test-task prior。在实验中,我们展示了 ScaML-GP 可以快速学习,具有少量和多量 meta-tasks 的能力。

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

  • paper_url: http://arxiv.org/abs/2312.00732
  • repo_url: https://github.com/lkeab/gaussian-grouping
  • paper_authors: Mingqiao Ye, Martin Danelljan, Fisher Yu, Lei Ke
  • for: 高品质和即时新视角Synthesis of 3D scenes, but lacks fine-grained object-level scene understanding.
  • methods: 提出了Gaussian Grouping,将Gaussian Splatting扩展到同时重建和分类3DScene中的任何物件。
  • results: 证明了Discrete和Grouped 3D Gaussians可以高品质地重建、分类和编辑3D scene,并且有高效率和精细控制。
    Abstract The recent Gaussian Splatting achieves high-quality and real-time novel-view synthesis of the 3D scenes. However, it is solely concentrated on the appearance and geometry modeling, while lacking in fine-grained object-level scene understanding. To address this issue, we propose Gaussian Grouping, which extends Gaussian Splatting to jointly reconstruct and segment anything in open-world 3D scenes. We augment each Gaussian with a compact Identity Encoding, allowing the Gaussians to be grouped according to their object instance or stuff membership in the 3D scene. Instead of resorting to expensive 3D labels, we supervise the Identity Encodings during the differentiable rendering by leveraging the 2D mask predictions by SAM, along with introduced 3D spatial consistency regularization. Comparing to the implicit NeRF representation, we show that the discrete and grouped 3D Gaussians can reconstruct, segment and edit anything in 3D with high visual quality, fine granularity and efficiency. Based on Gaussian Grouping, we further propose a local Gaussian Editing scheme, which shows efficacy in versatile scene editing applications, including 3D object removal, inpainting, colorization and scene recomposition. Our code and models will be at https://github.com/lkeab/gaussian-grouping.
    摘要 最近的泊ERT方法可以实现高质量和实时的新视图合成三个维度场景。然而,它只关注到场景的外观和几何模型,而忽略了细化的物体级别场景理解。为解决这个问题,我们提议泊ERT集合,它将泊ERT扩展到同时重建和分割场景中的任何物体。我们将每个泊ERT加上一个紧凑的标识编码,使得泊ERT可以根据其物体实例或物体成员身份在场景中分组。而不是依靠costly的3D标签,我们在渲染过程中监督标识编码,通过与SAM提供的2Dmask预测以及引入的3D空间一致正则化。与隐式NeRF表示相比,我们示出了精细的3D泊ERT可以重建、分割和编辑任何3D场景,并且具有高视觉质量、细化和效率。基于泊ERT集合,我们进一步提议了本地泊ERT修改方案,它在多种场景编辑应用中表现出了 efficacy,包括3D对象移除、填充、颜色化和场景重组。我们的代码和模型将在GitHub上提供,链接为https://github.com/lkeab/gaussian-grouping。

Virtual Fusion with Contrastive Learning for Single Sensor-based Activity Recognition

  • paper_url: http://arxiv.org/abs/2312.02185
  • repo_url: None
  • paper_authors: Duc-Anh Nguyen, Cuong Pham, Nhien-An Le-Khac
    for:* 这个论文是为了解决人体活动识别(HAR)中多种感知器的问题,即每种感知器都有不同的优劣点,导致错误预测。methods:* 本论文提出了虚拟融合方法,即在训练时使用多个时间同步的感知器获得更多信息,但只需一个感知器进行测试。* 文章采用了对比学习方法,利用感知器之间的相关性。results:* 虚拟融合方法在 UCI-HAR 和 PAMAP2 测试集上实现了状态 искусственный智能(AI)和 F1 分数的最佳记录。* 在某些情况下,虚拟融合方法 même 超过了实际融合方法在测试时的性能。Here’s the translation in Simplified Chinese:for:* 这篇论文是为了解决人体活动识别(HAR)中多种感知器的问题,即每种感知器都有不同的优劣点,导致错误预测。methods:* 本篇论文提出了虚拟融合方法,即在训练时使用多个时间同步的感知器获得更多信息,但只需一个感知器进行测试。* 文章采用了对比学习方法,利用感知器之间的相关性。results:* 虚拟融合方法在 UCI-HAR 和 PAMAP2 测试集上实现了状态 искусственный智能(AI)和 F1 分数的最佳记录。* 在某些情况下,虚拟融合方法 même 超过了实际融合方法在测试时的性能。
    Abstract Various types of sensors can be used for Human Activity Recognition (HAR), and each of them has different strengths and weaknesses. Sometimes a single sensor cannot fully observe the user's motions from its perspective, which causes wrong predictions. While sensor fusion provides more information for HAR, it comes with many inherent drawbacks like user privacy and acceptance, costly set-up, operation, and maintenance. To deal with this problem, we propose Virtual Fusion - a new method that takes advantage of unlabeled data from multiple time-synchronized sensors during training, but only needs one sensor for inference. Contrastive learning is adopted to exploit the correlation among sensors. Virtual Fusion gives significantly better accuracy than training with the same single sensor, and in some cases, it even surpasses actual fusion using multiple sensors at test time. We also extend this method to a more general version called Actual Fusion within Virtual Fusion (AFVF), which uses a subset of training sensors during inference. Our method achieves state-of-the-art accuracy and F1-score on UCI-HAR and PAMAP2 benchmark datasets. Implementation is available upon request.
    摘要 不同类型的感测器可以用于人体活动识别(HAR),每种感测器都有不同的优势和缺陷。有时候单个感测器无法完全观察用户的动作,导致错误预测。虽然感测器融合提供更多的信息 для HAR,但它们带来许多内在的缺陷,如用户隐私和接受度、成本高昂的设置、运行和维护。为解决这个问题,我们提出虚拟融合方法 - 在训练时使用多个时间同步的感测器上的无标签数据,但只需一个感测器进行推理。我们采用对比学习来利用感测器之间的相关性。虚拟融合在训练时比同样的单个感测器更高的准确率和F1分数,并在某些情况下,甚至超过了实际融合多个感测器的测试时间。我们还扩展了这种方法,称为实际融合在虚拟融合中(AFVF),在推理时使用一组训练时使用的感测器 subset。我们的方法在UCIL-HAR和PAMAP2benchmark数据集上达到了状态控制的准确率和F1分数。实现可以在请求中提供。

Safe Reinforcement Learning in Tensor Reproducing Kernel Hilbert Space

  • paper_url: http://arxiv.org/abs/2312.00727
  • repo_url: None
  • paper_authors: Xiaoyuan Cheng, Boli Chen, Liz Varga, Yukun Hu
  • for: 这paper是为了解决半 observable 环境中的安全强化学习(RL)问题,以实现安全可达性目标。
  • methods: 我们提出了一种基于 Stochastic Model 的方法,可以在不知道系统动力学和部分观察环境的情况下保证 RL 的安全性,并且可以在多步观察中 analytically 表示未来的观察。
  • results: 我们在这个上下文中提出了一种可证明的方法,可以在不知道系统动力学和部分观察环境的情况下,通过不同的运算符来 recursive 地估算未来的观察。我们还Established a polynomial sample complexity for the RL algorithm, ensuring an $\epsilon-$suboptimal safe policy guarantee.
    Abstract This paper delves into the problem of safe reinforcement learning (RL) in a partially observable environment with the aim of achieving safe-reachability objectives. In traditional partially observable Markov decision processes (POMDP), ensuring safety typically involves estimating the belief in latent states. However, accurately estimating an optimal Bayesian filter in POMDP to infer latent states from observations in a continuous state space poses a significant challenge, largely due to the intractable likelihood. To tackle this issue, we propose a stochastic model-based approach that guarantees RL safety almost surely in the face of unknown system dynamics and partial observation environments. We leveraged the Predictive State Representation (PSR) and Reproducing Kernel Hilbert Space (RKHS) to represent future multi-step observations analytically, and the results in this context are provable. Furthermore, we derived essential operators from the kernel Bayes' rule, enabling the recursive estimation of future observations using various operators. Under the assumption of \textit{undercompleness}, a polynomial sample complexity is established for the RL algorithm for the infinite size of observation and action spaces, ensuring an $\epsilon-$suboptimal safe policy guarantee.
    摘要 The approach leverages the Predictive State Representation (PSR) and Reproducing Kernel Hilbert Space (RKHS) to represent future multi-step observations analytically. The paper derives essential operators from the kernel Bayes' rule, enabling recursive estimation of future observations using various operators. Under the assumption of undercompleness, the paper establishes a polynomial sample complexity for the RL algorithm, ensuring an $\epsilon-$suboptimal safe policy guarantee for the infinite size of observation and action spaces.

DeepCache: Accelerating Diffusion Models for Free

  • paper_url: http://arxiv.org/abs/2312.00858
  • repo_url: https://github.com/horseee/deepcache
  • paper_authors: Xinyin Ma, Gongfan Fang, Xinchao Wang
  • for: 这篇论文旨在提高Diffusion模型的生成能力,并且降低 computional cost。
  • methods: 这篇论文提出了一个名为DeepCache的训练�free方法,具体来说是利用Diffusion模型的时间重复性,将高层次特征缓存起来,并在缓存中进行Feature reuse。
  • results: 这篇论文的实验结果显示,DeepCache可以将Stable Diffusion v1.5的速度提高2.3倍,只有0.05的CLIP Score下降,并且可以将LDM-4-G的速度提高4.1倍,仅有0.22的FID下降。此外,DeepCache也比现有的剪除和精简方法更好,并且可以与现有的抽样技术相容。
    Abstract Diffusion models have recently gained unprecedented attention in the field of image synthesis due to their remarkable generative capabilities. Notwithstanding their prowess, these models often incur substantial computational costs, primarily attributed to the sequential denoising process and cumbersome model size. Traditional methods for compressing diffusion models typically involve extensive retraining, presenting cost and feasibility challenges. In this paper, we introduce DeepCache, a novel training-free paradigm that accelerates diffusion models from the perspective of model architecture. DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models, which caches and retrieves features across adjacent denoising stages, thereby curtailing redundant computations. Utilizing the property of the U-Net, we reuse the high-level features while updating the low-level features in a very cheap way. This innovative strategy, in turn, enables a speedup factor of 2.3$\times$ for Stable Diffusion v1.5 with only a 0.05 decline in CLIP Score, and 4.1$\times$ for LDM-4-G with a slight decrease of 0.22 in FID on ImageNet. Our experiments also demonstrate DeepCache's superiority over existing pruning and distillation methods that necessitate retraining and its compatibility with current sampling techniques. Furthermore, we find that under the same throughput, DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS. The code is available at https://github.com/horseee/DeepCache
    摘要 Diffusion models 在图像生成领域已经受到非常广泛的关注,因为它们的生成能力是非常出色的。不过,这些模型经常需要很大的计算成本,主要是由于顺序的噪声处理和庞大的模型大小。传统的方法用于压缩 diffusion models 通常包括了广泛的重新训练,这会带来成本和可行性的挑战。在这篇论文中,我们提出了 DeepCache,一种新的训练自由的方法,用于加速 diffusion models。DeepCache 利用 diffusion models 中的序列噪声处理步骤中的自然时间相关性,将特征缓存并重新利用,从而减少了重复的计算。通过 U-Net 的性质,我们可以重用高级特征,同时更新低级特征,这是非常便宜的。这种创新的策略使得 diffusion models 的速度因子提高至 2.3倍,只有 CLIP 分数下降了 0.05,并且 LDM-4-G 的速度因子提高至 4.1倍,但是 FID 下降了 0.22 个点。我们的实验还表明 DeepCache 的超越性比现有的减少和精炼方法更好,并且可以与当前的采样技术相结合使用。此外,我们发现,在同等吞吐量下,DeepCache 可以实现与 DDIM 或 PLMS 相同或稍微更好的结果。代码可以在 GitHub 上找到:https://github.com/horseee/DeepCache。

Removing Biases from Molecular Representations via Information Maximization

  • paper_url: http://arxiv.org/abs/2312.00718
  • repo_url: https://github.com/uhlerlab/infocore
  • paper_authors: Chenyu Wang, Sharut Gupta, Caroline Uhler, Tommi Jaakkola
  • for: This paper is written to address the challenge of dealing with batch effects in high-throughput drug screening data, and to propose a new method called InfoCORE for effectively removing these effects and obtaining refined molecular representations.
  • methods: The paper proposes an Information maximization approach for COnfounder REmoval (InfoCORE), which establishes a variational lower bound on the conditional mutual information of the latent representations given a batch identifier, and adaptively reweighs samples to equalize their implied batch distribution.
  • results: The paper reports extensive experiments on drug screening data that demonstrate the superior performance of InfoCORE in a multitude of tasks, including molecular property prediction and molecule-phenotype retrieval. Additionally, the paper shows how InfoCORE can be used to resolve general distribution shifts and issues of data fairness by minimizing correlation with spurious features or removing sensitive attributes.
    Abstract High-throughput drug screening -- using cell imaging or gene expression measurements as readouts of drug effect -- is a critical tool in biotechnology to assess and understand the relationship between the chemical structure and biological activity of a drug. Since large-scale screens have to be divided into multiple experiments, a key difficulty is dealing with batch effects, which can introduce systematic errors and non-biological associations in the data. We propose InfoCORE, an Information maximization approach for COnfounder REmoval, to effectively deal with batch effects and obtain refined molecular representations. InfoCORE establishes a variational lower bound on the conditional mutual information of the latent representations given a batch identifier. It adaptively reweighs samples to equalize their implied batch distribution. Extensive experiments on drug screening data reveal InfoCORE's superior performance in a multitude of tasks including molecular property prediction and molecule-phenotype retrieval. Additionally, we show results for how InfoCORE offers a versatile framework and resolves general distribution shifts and issues of data fairness by minimizing correlation with spurious features or removing sensitive attributes. The code is available at https://github.com/uhlerlab/InfoCORE.
    摘要 高通量药物屏选 -- 使用细胞影像或基因表达量作为药物效果的输出 -- 是生物技术中的一种关键工具,用于评估和理解药物化学结构和生物活性之间的关系。由于大规模屏选必须分成多个实验,因此一个关键的Difficulty is dealing with batch effects, which can introduce systematic errors and non-biological associations in the data. We propose InfoCORE, an Information maximization approach for COnfounder REmoval, to effectively deal with batch effects and obtain refined molecular representations. InfoCORE establishes a variational lower bound on the conditional mutual information of the latent representations given a batch identifier. It adaptively reweighs samples to equalize their implied batch distribution. Extensive experiments on drug screening data reveal InfoCORE's superior performance in a multitude of tasks including molecular property prediction and molecule-phenotype retrieval. Additionally, we show results for how InfoCORE offers a versatile framework and resolves general distribution shifts and issues of data fairness by minimizing correlation with spurious features or removing sensitive attributes. 代码可以在https://github.com/uhlerlab/InfoCORE中找到。Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know and I can provide that as well.

Towards Transparency in Coreference Resolution: A Quantum-Inspired Approach

  • paper_url: http://arxiv.org/abs/2312.00688
  • repo_url: https://github.com/hwazni/qcoref
  • paper_authors: Hadi Wazni, Mehrnoosh Sadrzadeh
  • For: The paper is written for the task of pronoun resolution in natural language processing.* Methods: The paper uses a combination of quantum natural language processing (QNLP) and variational quantum classifiers (VQC) to perform the task of pronoun resolution.* Results: The paper achieves an F1 score of 87.20% on the Winograd-style pronoun resolution task, outperforming two out of three classical coreference resolution systems and approaching state-of-the-art SpanBERT. Additionally, the authors propose a mixed quantum-classical model that improves these results with an F1 score increase of around 6%.
    Abstract Guided by grammatical structure, words compose to form sentences, and guided by discourse structure, sentences compose to form dialogues and documents. The compositional aspect of sentence and discourse units is often overlooked by machine learning algorithms. A recent initiative called Quantum Natural Language Processing (QNLP) learns word meanings as points in a Hilbert space and acts on them via a translation of grammatical structure into Parametrised Quantum Circuits (PQCs). Previous work extended the QNLP translation to discourse structure using points in a closure of Hilbert spaces. In this paper, we evaluate this translation on a Winograd-style pronoun resolution task. We train a Variational Quantum Classifier (VQC) for binary classification and implement an end-to-end pronoun resolution system. The simulations executed on IBMQ software converged with an F1 score of 87.20%. The model outperformed two out of three classical coreference resolution systems and neared state-of-the-art SpanBERT. A mixed quantum-classical model yet improved these results with an F1 score increase of around 6%.
    摘要 根据语法结构,单词组成句子,并且根据对话结构,句子组成文档。然而,机器学习算法常常忽略这种 композиitional 性。一个现在的倡议called Quantum Natural Language Processing (QNLP) learns word meanings as Hilbert space points and acts on them via a translation of grammatical structure into Parametrised Quantum Circuits (PQCs)。之前的工作扩展了QNLP翻译到对话结构使用点在 closure of Hilbert spaces。在这篇文章中,我们评估了这种翻译在Winograd式pronoun resolution task上。我们训练了一个Variational Quantum Classifier (VQC) for binary classification,并实现了一个端到端的pronoun resolution系统。在IBMQ软件上执行的模拟得到了F1 score的87.20%。该模型超越了三个 классиcal coreference resolution系统,并且接近了状态的艺术SpanBERT。一个混合 quantum-classical 模型还提高了这些结果,F1 score增加了约6%。

Resource-constrained knowledge diffusion processes inspired by human peer learning

  • paper_url: http://arxiv.org/abs/2312.00660
  • repo_url: None
  • paper_authors: Ehsan Beikihassan, Amy K. Hoover, Ioannis Koutis, Ali Parviz, Niloofar Aghaieabiane
  • for: 本研究旨在优化人工学习者群体的表现指标,以及在受限制的训练资源下实现这一目标。
  • methods: 本研究采用自然知识协同过程,其中学习者之间的交互可能受到协调者的评估影响。研究者还提出了一种模块化神经网络模型,以增强模型的泛化能力和鲁棒性。
  • results: 研究发现,自然知识协同过程可以有效地利用训练资源,并且可以设计模块化神经网络模型,这些模型具有泛化能力和鲁棒性。
    Abstract We consider a setting where a population of artificial learners is given, and the objective is to optimize aggregate measures of performance, under constraints on training resources. The problem is motivated by the study of peer learning in human educational systems. In this context, we study natural knowledge diffusion processes in networks of interacting artificial learners. By `natural', we mean processes that reflect human peer learning where the students' internal state and learning process is mostly opaque, and the main degree of freedom lies in the formation of peer learning groups by a coordinator who can potentially evaluate the learners before assigning them to peer groups. Among else, we empirically show that such processes indeed make effective use of the training resources, and enable the design of modular neural models that have the capacity to generalize without being prone to overfitting noisy labels.
    摘要 我们考虑一个设定,在这个设定中有一群人工学习者, objective是优化总性能指标,受限于培训资源。这个问题受到人类教育系统中的帮助学习的研究启发。在这个上下文中,我们研究了人工智能学习者之间的自然知识传播过程。我们使用“自然”来描述这种过程,因为学习过程中学习者的内部状态和学习过程几乎完全透明,主要的自由度在于学习组合的协调员可能会评估学习者之前分配学习组。此外,我们也证明了这种过程确实能够有效利用培训资源,并可以设计模块化神经网络模型,这些模型具有抗混叠标签的能力。

Simple Transferability Estimation for Regression Tasks

  • paper_url: http://arxiv.org/abs/2312.00656
  • repo_url: https://github.com/cuongnn218/regression_transferability
  • paper_authors: Cuong N. Nguyen, Phong Tran, Lam Si Tung Ho, Vu Dinh, Anh T. Tran, Tal Hassner, Cuong V. Nguyen
  • for: 这篇论文主要是关于估计深度学习模型在不同任务之间的传输性能。它专注于回归任务,这些任务在之前得到了少量关注,并提出了两种简单而计算效率高的方法来估计传输性能。
  • methods: 这篇论文提出了两种方法来估计传输性能,即使用负值正则化方差平均值的线性回归模型。我们证明了这些方法与实际传输性能之间的关系,并证明了其与优化目标模型从传输学习过程中获得的实际传输性能之间的关系。
  • results: 对两个大规模关键点回归benchmark数据集进行测试,我们发现我们的方法可以在平均上提高12%到36%,并且在计算效率方面至少提高27%,与之前的状态态方法相比。
    Abstract We consider transferability estimation, the problem of estimating how well deep learning models transfer from a source to a target task. We focus on regression tasks, which received little previous attention, and propose two simple and computationally efficient approaches that estimate transferability based on the negative regularized mean squared error of a linear regression model. We prove novel theoretical results connecting our approaches to the actual transferability of the optimal target models obtained from the transfer learning process. Despite their simplicity, our approaches significantly outperform existing state-of-the-art regression transferability estimators in both accuracy and efficiency. On two large-scale keypoint regression benchmarks, our approaches yield 12% to 36% better results on average while being at least 27% faster than previous state-of-the-art methods.
    摘要 我们考虑了迁移可Estimation,即用源任务来估计目标任务的迁移性。我们主要关注了回归任务,这个问题在之前得到了少量的关注,并提出了两种简单而 computationally efficient的方法来估计迁移性,基于负抑制的mean squared error of a linear regression model。我们证明了新的理论结论,连接我们的方法与实际迁移学习过程中的最佳目标模型的迁移性。尽管我们的方法简单,但它们在精度和效率方面与之前的状态对比较好,在两个大规模的键点回归 benchmark上,我们的方法平均提高了12%到36%,并且比之前的状态对比较快,至少27%。

Latent Space Explorer: Visual Analytics for Multimodal Latent Space Exploration

  • paper_url: http://arxiv.org/abs/2312.00857
  • repo_url: None
  • paper_authors: Bum Chul Kwon, Samuel Friedman, Kai Xu, Steven A Lubitz, Anthony Philippakis, Puneet Batra, Patrick T Ellinor, Kenney Ng
  • for: 本研究旨在开发一个可视化分析系统,帮助医疗专业人员通过多Modalities的数据进行预测和新颖的医学发现。
  • methods: 本研究使用多Modalities的数据进行预测,并开发了一个可视化分析系统来帮助医疗专业人员更好地探索和理解这些数据。
  • results: 用户研究表明,Latent Space Explorer可以帮助医疗专业人员更好地探索和理解多Modalities的数据,并且可以提供有用的预测结果和新颖的医学发现。
    Abstract Machine learning models built on training data with multiple modalities can reveal new insights that are not accessible through unimodal datasets. For example, cardiac magnetic resonance images (MRIs) and electrocardiograms (ECGs) are both known to capture useful information about subjects' cardiovascular health status. A multimodal machine learning model trained from large datasets can potentially predict the onset of heart-related diseases and provide novel medical insights about the cardiovascular system. Despite the potential benefits, it is difficult for medical experts to explore multimodal representation models without visual aids and to test the predictive performance of the models on various subpopulations. To address the challenges, we developed a visual analytics system called Latent Space Explorer. Latent Space Explorer provides interactive visualizations that enable users to explore the multimodal representation of subjects, define subgroups of interest, interactively decode data with different modalities with the selected subjects, and inspect the accuracy of the embedding in downstream prediction tasks. A user study was conducted with medical experts and their feedback provided useful insights into how Latent Space Explorer can help their analysis and possible new direction for further development in the medical domain.
    摘要 机器学习模型在多模态训练数据上建立可以揭示不可能通过单modal数据描述的新发现。例如,心脏磁共振成像(MRI)和电cardiogram(ECG)都是 capture有用信息关于subjects的心血管健康状况。一个多模态机器学习模型从大量数据中训练可能预测心血管疾病的开始和提供新领域医学发现。 despite the potential benefits, it is difficult for medical experts to explore multimodal representation models without visual aids and to test the predictive performance of the models on various subpopulations. To address the challenges, we developed a visual analytics system called Latent Space Explorer. Latent Space Explorer provides interactive visualizations that enable users to explore the multimodal representation of subjects, define subgroups of interest, interactively decode data with different modalities with the selected subjects, and inspect the accuracy of the embedding in downstream prediction tasks. A user study was conducted with medical experts and their feedback provided useful insights into how Latent Space Explorer can help their analysis and possible new direction for further development in the medical domain.Here's the word-for-word translation of the text into Simplified Chinese:机器学习模型在多模态训练数据上建立可以揭示不可能通过单modal数据描述的新发现。例如,心脏磁共振成像(MRI)和电cardiogram(ECG)都是 capture有用信息关于subjects的心血管健康状况。一个多模态机器学习模型从大量数据中训练可能预测心血管疾病的开始和提供新领域医学发现。 despite the potential benefits, it is difficult for medical experts to explore multimodal representation models without visual aids and to test the predictive performance of the models on various subpopulations. To address the challenges, we developed a visual analytics system called Latent Space Explorer. Latent Space Explorer provides interactive visualizations that enable users to explore the multimodal representation of subjects, define subgroups of interest, interactively decode data with different modalities with the selected subjects, and inspect the accuracy of the embedding in downstream prediction tasks. A user study was conducted with medical experts and their feedback provided useful insights into how Latent Space Explorer can help their analysis and possible new direction for further development in the medical domain.

TrackDiffusion: Multi-object Tracking Data Generation via Diffusion Models

  • paper_url: http://arxiv.org/abs/2312.00651
  • repo_url: None
  • paper_authors: Pengxiang Li, Zhili Liu, Kai Chen, Lanqing Hong, Yunzhi Zhuge, Dit-Yan Yeung, Huchuan Lu, Xu Jia
  • for: 提高多bject tracking(MOT)系统的表现,通过生成高质量的跟踪序列。
  • methods: 基于图像扩散模型,实现图像扩散模型能够捕捉复杂的动态轨迹,并保证实例一致性。
  • results: 实验结果显示,我们的模型能够显著提高实例一致性,导致改进的感知指标。在YTVIS数据集上,我们的方法实现了8.7的提升(TrackAP)和11.8的提升(TrackAP$_{50}$)。
    Abstract Diffusion models have gained prominence in generating data for perception tasks such as image classification and object detection. However, the potential in generating high-quality tracking sequences, a crucial aspect in the field of video perception, has not been fully investigated. To address this gap, we propose TrackDiffusion, a novel architecture designed to generate continuous video sequences from the tracklets. TrackDiffusion represents a significant departure from the traditional layout-to-image (L2I) generation and copy-paste synthesis focusing on static image elements like bounding boxes by empowering image diffusion models to encompass dynamic and continuous tracking trajectories, thereby capturing complex motion nuances and ensuring instance consistency among video frames. For the first time, we demonstrate that the generated video sequences can be utilized for training multi-object tracking (MOT) systems, leading to significant improvement in tracker performance. Experimental results show that our model significantly enhances instance consistency in generated video sequences, leading to improved perceptual metrics. Our approach achieves an improvement of 8.7 in TrackAP and 11.8 in TrackAP$_{50}$ on the YTVIS dataset, underscoring its potential to redefine the standards of video data generation for MOT tasks and beyond.
    摘要 Diffusion models have gained popularity in generating data for perception tasks such as image classification and object detection. However, the potential in generating high-quality tracking sequences, a crucial aspect in the field of video perception, has not been fully explored. To address this gap, we propose TrackDiffusion, a novel architecture designed to generate continuous video sequences from tracklets. TrackDiffusion represents a significant departure from the traditional layout-to-image (L2I) generation and copy-paste synthesis, focusing on static image elements like bounding boxes by empowering image diffusion models to encompass dynamic and continuous tracking trajectories, thereby capturing complex motion nuances and ensuring instance consistency among video frames. For the first time, we demonstrate that the generated video sequences can be utilized for training multi-object tracking (MOT) systems, leading to significant improvement in tracker performance. Experimental results show that our model significantly enhances instance consistency in generated video sequences, leading to improved perceptual metrics. Our approach achieves an improvement of 8.7 in TrackAP and 11.8 in TrackAP$_{50}$ on the YTVIS dataset, underscoring its potential to redefine the standards of video data generation for MOT tasks and beyond.

Refine, Discriminate and Align: Stealing Encoders via Sample-Wise Prototypes and Multi-Relational Extraction

  • paper_url: http://arxiv.org/abs/2312.00855
  • repo_url: None
  • paper_authors: Shuchi Wu, Chuan Ma, Kang Wei, Xiaogang Xu, Ming Ding, Yuwen Qian, Tao Xiang
  • for: 这篇论文旨在解决过去对于预训语模型的删除攻击中的两个主要缺陷:一是依赖于偏袋化的优化目标,二是由于终端架构的需要每个epoch询问目标Encoder,从而导致询问成本过高。
  • methods: 这篇论文提出了一个新的RDA方法,包括两个主要步骤:首先,将目标Encoder的每个训练数据Refine到更加不偏的表现,以建立更加合理的优化目标;其次,通过将每个数据的多个角度的表现聚合为一个标准型态,以便在不需要询问目标Encoder的情况下进行训练。
  • results: 这篇论文的实验结果显示,RDA方法可以在不同的下游数据集上实现state-of-the-art的结果,并且与多种常用的防护措施相比,RDA方法具有较高的Robustness。
    Abstract This paper introduces RDA, a pioneering approach designed to address two primary deficiencies prevalent in previous endeavors aiming at stealing pre-trained encoders: (1) suboptimal performances attributed to biased optimization objectives, and (2) elevated query costs stemming from the end-to-end paradigm that necessitates querying the target encoder every epoch. Specifically, we initially Refine the representations of the target encoder for each training sample, thereby establishing a less biased optimization objective before the steal-training phase. This is accomplished via a sample-wise prototype, which consolidates the target encoder's representations for a given sample's various perspectives. Demanding exponentially fewer queries compared to the end-to-end approach, prototypes can be instantiated to guide subsequent query-free training. For more potent efficacy, we develop a multi-relational extraction loss that trains the surrogate encoder to Discriminate mismatched embedding-prototype pairs while Aligning those matched ones in terms of both amplitude and angle. In this way, the trained surrogate encoder achieves state-of-the-art results across the board in various downstream datasets with limited queries. Moreover, RDA is shown to be robust to multiple widely-used defenses.
    摘要
  1. Biased optimization objectives leading to suboptimal performance.2. High query costs due to the end-to-end paradigm, which requires querying the target encoder every epoch.To address these issues, RDA refines the representations of the target encoder for each training sample before the steal-training phase. This is achieved through the use of a sample-wise prototype, which consolidates the target encoder’s representations for a given sample’s various perspectives. This approach exponentially reduces the number of queries compared to the end-to-end approach, making it possible to train the surrogate encoder without queries.To further enhance the efficacy of RDA, a multi-relational extraction loss is developed to train the surrogate encoder to discriminate mismatched embedding-prototype pairs while aligning matched ones in terms of both amplitude and angle. This results in the trained surrogate encoder achieving state-of-the-art results across various downstream datasets with limited queries.Moreover, RDA is shown to be robust to multiple widely-used defenses, making it a versatile and effective approach for stealing pre-trained encoders.

A Probabilistic Neural Twin for Treatment Planning in Peripheral Pulmonary Artery Stenosis

  • paper_url: http://arxiv.org/abs/2312.00854
  • repo_url: None
  • paper_authors: John D. Lee, Jakob Richter, Martin R. Pfaller, Jason M. Szafron, Karthik Menon, Andrea Zanoni, Michael R. Ma, Jeffrey A. Feinstein, Jacqueline Kreutzer, Alison L. Marsden, Daniele E. Schiavazzi
  • for: 该研究旨在使用数据驱动架构和优化技术来减轻高精度模型的计算成本,以便在实时决策中使用数字塑型技术。
  • methods: 该研究使用了抽象模型、模型缩放和训练数据生成 pipeline,以及在线估计极值概率和可能的条件补做。
  • results: 该研究提出了一种新的偏函数抽象法,可以用于描述淤积物的形态和位置。此外,该研究还提出了一种新的参数化方法,可以用于描述任意形状的血管修复。
    Abstract The substantial computational cost of high-fidelity models in numerical hemodynamics has, so far, relegated their use mainly to offline treatment planning. New breakthroughs in data-driven architectures and optimization techniques for fast surrogate modeling provide an exciting opportunity to overcome these limitations, enabling the use of such technology for time-critical decisions. We discuss an application to the repair of multiple stenosis in peripheral pulmonary artery disease through either transcatheter pulmonary artery rehabilitation or surgery, where it is of interest to achieve desired pressures and flows at specific locations in the pulmonary artery tree, while minimizing the risk for the patient. Since different degrees of success can be achieved in practice during treatment, we formulate the problem in probability, and solve it through a sample-based approach. We propose a new offline-online pipeline for probabilsitic real-time treatment planning which combines offline assimilation of boundary conditions, model reduction, and training dataset generation with online estimation of marginal probabilities, possibly conditioned on the degree of augmentation observed in already repaired lesions. Moreover, we propose a new approach for the parametrization of arbitrarily shaped vascular repairs through iterative corrections of a zero-dimensional approximant. We demonstrate this pipeline for a diseased model of the pulmonary artery tree available through the Vascular Model Repository.
    摘要 高精度模型在数字征迹动力学中的计算成本过于高,只能用于后期治疗规划。新的数据驱动架构和优化技术可以快速创建伪函数模型,使得这种技术能够在时间紧迫的决策中使用。我们介绍了用于多处扩张性肺动脉疾病的修复,包括肺动脉再建或手术,其中需要在肺动脉树中达到特定的压力和流速,同时最小化患者风险。在实践中,不同的成功程度可以实现,因此我们将问题表述为概率问题,并使用样本基本方法解决。我们提出了一个新的线上-线下管道,该管道包括线上估计毛瑟概率,可能受到已经修复的损伤的度量影响。此外,我们提出了一种新的参数化方法,用于模型化任意形状的血管修复。我们在一个可用于肺动脉树疾病模型的数据存储库中展示了这种管道。

Towards Efficient 3D Object Detection in Bird’s-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

  • paper_url: http://arxiv.org/abs/2312.00633
  • repo_url: None
  • paper_authors: Yuxin Li, Qiang Han, Mengying Yu, Yuxin Jiang, Chaikiat Yeo, Yiheng Li, Zihang Huang, Nini Liu, Hsuanhan Chen, Xiaojun Wu
  • for: 提高驾驶自动化领域中BEV空间中的3D对象检测精度和速度优化
  • methods: 提出一种高效的BEV空间基于 convolutional-only 架构的3D检测框架 BEVENet, circumvent ViT模型的限制,保持BEV空间方法的有效性
  • results: 在NuScenes挑战中比现有SOTA方法快3倍,在NuScenes验证集上实现了0.456的mAP和0.555的NDS,并且达到了47.6帧/秒的推理速度
    Abstract 3D object detection in Bird's-Eye-View (BEV) space has recently emerged as a prevalent approach in the field of autonomous driving. Despite the demonstrated improvements in accuracy and velocity estimation compared to perspective view methods, the deployment of BEV-based techniques in real-world autonomous vehicles remains challenging. This is primarily due to their reliance on vision-transformer (ViT) based architectures, which introduce quadratic complexity with respect to the input resolution. To address this issue, we propose an efficient BEV-based 3D detection framework called BEVENet, which leverages a convolutional-only architectural design to circumvent the limitations of ViT models while maintaining the effectiveness of BEV-based methods. Our experiments show that BEVENet is 3$\times$ faster than contemporary state-of-the-art (SOTA) approaches on the NuScenes challenge, achieving a mean average precision (mAP) of 0.456 and a nuScenes detection score (NDS) of 0.555 on the NuScenes validation dataset, with an inference speed of 47.6 frames per second. To the best of our knowledge, this study stands as the first to achieve such significant efficiency improvements for BEV-based methods, highlighting their enhanced feasibility for real-world autonomous driving applications.
    摘要 三维物体探测在鸟瞰视(BEV)空间已经在自动驾驶领域得到广泛应用。虽然BEV基本方法在准确率和速度估计方面表现出了显著改善,但在实际应用中仍然存在挑战。这主要是因为BEV方法的依赖于视Transformer(ViT)基础结构,这会导致输入分辨率的 quadratic complexity。为解决这个问题,我们提出了一种高效的BEV基本方法called BEVENet,该方法通过避免ViT模型的限制而维持BEV基本方法的效果。我们的实验表明,BEVENet比现代state-of-the-art(SOTA)方法三倍快,在NuScenes挑战中 achievement mean average precision(mAP)为0.456,nuScenes检测分数(NDS)为0.555,并且检测速度为47.6帧每秒。根据我们所知,这是首次实现BEV基本方法的性能改进,这种改进了BEV基本方法的实际应用可行性。

Weighted Riesz Particles

  • paper_url: http://arxiv.org/abs/2312.00621
  • repo_url: https://github.com/986876245/weighted-riesz-particles
  • paper_authors: Xiongming Dai, Gerald Baumgartner
  • for: 本文关注Markov chain Monte Carlo(MCMC)方法的探索复杂统计分布的问题,尤其是在高维参数空间时的计算复杂性问题。
  • methods: 本文提出一种基于权重里兹能量的方法,通过对征 Rectifiable submanifolds 的对应点进行对抗,来加速MCMC方法的探索过程。
  • results: 对比实验表明,该方法可以提高MCMC方法的acceptance rate,并且需要 fewer evaluations。这种方法可以应用于高维参数空间的MCMC问题。
    Abstract Markov chain Monte Carlo (MCMC) methods are simulated by local exploration of complex statistical distributions, and while bypassing the cumbersome requirement of a specific analytical expression for the target, this stochastic exploration of an uncertain parameter space comes at the expense of a large number of samples, and this computational complexity increases with parameter dimensionality. Although at the exploration level, some methods are proposed to accelerate the convergence of the algorithm, such as tempering, Hamiltonian Monte Carlo, Rao-redwellization, and scalable methods for better performance, it cannot avoid the stochastic nature of this exploration. We consider the target distribution as a mapping where the infinite-dimensional Eulerian space of the parameters consists of a number of deterministic submanifolds and propose a generalized energy metric, termed weighted Riesz energy, where a number of points is generated through pairwise interactions, to discretize rectifiable submanifolds. We study the properties of the point, called Riesz particle, and embed it into sequential MCMC, and we find that there will be higher acceptance rates with fewer evaluations, we validate it through experimental comparative analysis from a linear Gaussian state-space model with synthetic data and a non-linear stochastic volatility model with real-world data.
    摘要

Learning from One Continuous Video Stream

  • paper_url: http://arxiv.org/abs/2312.00598
  • repo_url: https://github.com/jettbrains/-L-
  • paper_authors: João Carreira, Michael King, Viorica Pătrăucean, Dilara Gokay, Cătălin Ionescu, Yi Yang, Daniel Zoran, Joseph Heyward, Carl Doersch, Yusuf Aytar, Dima Damen, Andrew Zisserman
  • for: 这个 paper 是为了研究在线学习从单个连续视频流中进行学习,而人类和动物在学习时通常不使用 mini-batches、数据增强或搅拌。
  • methods: 这个 paper 使用了一个框架,该框架包括两个现有视频数据集的流和任务,以及一种性能评估方法,该方法考虑了适应性和泛化性。它们还使用了像素到像素的模型来在单个流中进行预训练和单流评估,而无需更改模型,并且总是使用同一个像素损失。
  • results: 根据这个 paper,通过使用这个框架,他们获得了大量的单流学习增益,并发现了摩拜挫蛋化的影响,以及weight更新的速度对学习的影响。这些发现导致了在使用同样的架构和不使用重复缓存时,与批处理学习的性能相当。
    Abstract We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of streams and tasks composed from two existing video datasets, plus methodology for performance evaluation that considers both adaptation and generalization. We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation as well as between arbitrary tasks, without ever requiring changes to models and always using the same pixel loss. Equipped with this framework we obtained large single-stream learning gains from pre-training with a novel family of future prediction tasks, found that momentum hurts, and that the pace of weight updates matters. The combination of these insights leads to matching the performance of IID learning with batch size 1, when using the same architecture and without costly replay buffers.
    摘要 我们介绍了一个在线学习框架,可以从单个连续视频流中学习,就像人类和动物一样,无需小批量、数据增强或搅拌。这会带来很大的挑战,因为视频帧之间的相关性很高,而现有的研究很少。我们的框架允许我们进行深入研究,包括从两个现有视频数据集中获取的流和任务,以及用于性能评估的方法,考虑了适应和泛化。我们使用像素到像素模型作为一种实用和灵活的方式来在预训练和单个流评估之间进行切换,无需更改模型,并且总是使用同一个像素损失。通过这个框架,我们在预训练中获得了大量的单流学习增益,发现了摩托振荡会带来负面影响,以及权重更新的速度会对性能产生影响。这些发现导致我们可以与使用同样的架构和无需贵重的回放缓存的IID学习匹配性能。

UAVs and Birds: Enhancing Short-Range Navigation through Budgerigar Flight Studies

  • paper_url: http://arxiv.org/abs/2312.00597
  • repo_url: None
  • paper_authors: Md. Mahmudur Rahman, Sajid Islam, Showren Chowdhury, Sadia Jahan Zeba, Debajyoti Karmaker
  • for: 本研究探讨了鸽子(Melopsittacus undulatus)的飞行行为,以了解其飞行轨迹和运动。
  • methods: 研究使用了3D重建技术,基于斯tereo视频摄像头记录,对鸽子的速度和加速度进行了仔细分析。
  • results: 研究发现,鸽子在起飞、飞行和着陆时的运动具有独特的特征,这些特征可能用于改进无人飞行器(UAV)的性能和自主性。
    Abstract This study delves into the flight behaviors of Budgerigars (Melopsittacus undulatus) to gain insights into their flight trajectories and movements. Using 3D reconstruction from stereo video camera recordings, we closely examine the velocity and acceleration patterns during three flight motion takeoff, flying and landing. The findings not only contribute to our understanding of bird behaviors but also hold significant implications for the advancement of algorithms in Unmanned Aerial Vehicles (UAVs). The research aims to bridge the gap between biological principles observed in birds and the application of these insights in developing more efficient and autonomous UAVs. In the context of the increasing use of drones, this study focuses on the biologically inspired principles drawn from bird behaviors, particularly during takeoff, flying and landing flight, to enhance UAV capabilities. The dataset created for this research sheds light on Budgerigars' takeoff, flying, and landing techniques, emphasizing their ability to control speed across different situations and surfaces. The study underscores the potential of incorporating these principles into UAV algorithms, addressing challenges related to short-range navigation, takeoff, flying, and landing.
    摘要

BCN: Batch Channel Normalization for Image Classification

  • paper_url: http://arxiv.org/abs/2312.00596
  • repo_url: https://github.com/AfifaKhaled/Batch-Channel-Normalization
  • paper_authors: Afifa Khaled, Chao Li, Jia Ning, Kun He
  • for: 本研究旨在提出一种新的常量化技术,以提高深度学习模型的性能。
  • methods: 该技术基于Batch Channel Normalization(BCN),通过分别对输入进行(N,H,W)和(C,H,W)轴的normalization,然后将两个normalized输出相加,以适应特定的dataset或任务。
  • results: 实验结果显示,提出的技术可以轻松地应用于不同版本的CNN或Vision Transformer架构,并且在不同的dataset上表现出优于标准的Batch Normalization(BN)和Layer Normalization(LN)技术。
    Abstract Normalization techniques have been widely used in the field of deep learning due to their capability of enabling higher learning rates and are less careful in initialization. However, the effectiveness of popular normalization technologies is typically limited to specific areas. Unlike the standard Batch Normalization (BN) and Layer Normalization (LN), where BN computes the mean and variance along the (N,H,W) dimensions and LN computes the mean and variance along the (C,H,W) dimensions (N, C, H and W are the batch, channel, spatial height and width dimension, respectively), this paper presents a novel normalization technique called Batch Channel Normalization (BCN). To exploit both the channel and batch dependence and adaptively and combine the advantages of BN and LN based on specific datasets or tasks, BCN separately normalizes inputs along the (N, H, W) and (C, H, W) axes, then combines the normalized outputs based on adaptive parameters. As a basic block, BCN can be easily integrated into existing models for various applications in the field of computer vision. Empirical results show that the proposed technique can be seamlessly applied to various versions of CNN or Vision Transformer architecture. The code is publicly available at https://github.com/AfifaKhaled/BatchChannel-Normalization
    摘要 常用的正规化技术在深度学习领域广泛应用,因为它们可以提高学习率并且不需要仔细初始化。然而,各种各样的正规化技术效果通常受到特定领域的限制。与标准批处理正规化(BN)和层正规化(LN)不同,BN计算批处理缓冲区(N, H, W)维度上的mean和variance,LN计算通道缓冲区(C, H, W)维度上的mean和variance。这篇论文提出了一种新的正规化技术, называ为批量通道正规化(BCN)。通过在(N, H, W)和(C, H, W)轴上分别正规化输入,然后根据适应参数相乘 combin,BCN可以同时利用通道和批处理的依赖关系,并且可以根据特定数据集或任务选择合适的BN和LN的优点。作为基本块,BCN可以轻松地与现有模型结合使用,用于各种计算机视ión应用。实验结果表明,提议的技术可以顺利应用于不同版本的CNN或计算机视iónTransformer架构。代码可以在https://github.com/AfifaKhaled/BatchChannel-Normalization上获取。

Less is More: Learning Reference Knowledge Using No-Reference Image Quality Assessment

  • paper_url: http://arxiv.org/abs/2312.00591
  • repo_url: https://github.com/LXDxmu/IQA
  • paper_authors: Xudong Li, Jingyuan Zheng, Xiawu Zheng, Runze Hu, Enwei Zhang, Yuting Gao, Yunhang Shen, Ke Li, Yutao Liu, Pingyang Dai, Yan Zhang, Rongrong Ji
  • for: 这个论文主要是为了解决无参图像质量评估(NR-IQA)中的问题,即如何学习参照图像的知识来评估图像质量。
  • methods: 这个论文提出了一种新的特征抽象方法,用于在NR-IQA中学习参照图像的知识。此外,文章还提出了一种增强 inductive bias regularization,以避免过拟合和提高特征提取框架的表达能力。
  • results: 根据文章的实验结果,这种方法可以在八个标准NR-IQA数据集上 achieve state-of-the-art 的性能,其PLCC值分别为0.917和0.686,比之前的教程模型高得多。
    Abstract Image Quality Assessment (IQA) with reference images have achieved great success by imitating the human vision system, in which the image quality is effectively assessed by comparing the query image with its pristine reference image. However, for the images in the wild, it is quite difficult to access accurate reference images. We argue that it is possible to learn reference knowledge under the No-Reference Image Quality Assessment (NR-IQA) setting, which is effective and efficient empirically. Concretely, by innovatively introducing a novel feature distillation method in IQA, we propose a new framework to learn comparative knowledge from non-aligned reference images. And then, to achieve fast convergence and avoid overfitting, we further propose an inductive bias regularization. Such a framework not only solves the congenital defects of NR-IQA but also improves the feature extraction framework, enabling it to express more abundant quality information. Surprisingly, our method utilizes less input while obtaining a more significant improvement compared to the teacher models. Extensive experiments on eight standard NR-IQA datasets demonstrate the superior performance to the state-of-the-art NR-IQA methods, i.e., achieving the PLCC values of 0.917 (vs. 0.884 in LIVEC) and 0.686 (vs. 0.661 in LIVEFB).
    摘要 图像质量评估(IQA)采用参考图像已经取得了很大的成功,通过模仿人类视觉系统,将查询图像与其原始参考图像进行比较,以评估图像质量。然而,在野外图像中获取准确的参考图像很难。我们认为可以在无参考图像质量评估(NR-IQA)设定下学习参考知识,这是有效和高效的。具体来说,我们通过在IQA中引入一种新的特征液化方法,提出了一个新的框架,用于从非对齐参考图像中学习比较知识。然后,为了快速收敛和避免过拟合,我们进一步提出了一种偏导因素正则化。这种框架不仅解决了NR-IQA的遗传缺陷,还改进了特征提取框架,使其能够更好地表达更多的质量信息。 surprisingly,我们的方法使用更少的输入,却获得了更大的改进,相比教师模型。广泛的实验表明,我们的方法在八个标准NR-IQA数据集上达到了状态之 искусственный智能的PLCC值,即0.917(vs. 0.884 in LIVEC)和0.686(vs. 0.661 in LIVEFB)。

Explainable Fraud Detection with Deep Symbolic Classification

  • paper_url: http://arxiv.org/abs/2312.00586
  • repo_url: https://github.com/samanthav24/dsc_fraud_detection
  • paper_authors: Samantha Visbeek, Erman Acar, Floris den Hengst
    for:深度符号分类(DSC)是一种基于深度符号回归框架的分类方法,用于解决诈金探测领域中的各种问题。methods:DSC使用了深度神经网络和反射学习来搜索符号表示空间中的所有可能的分析函数,并且直接优化了任意评价指标。results:DSC在PaySim数据集上达到了与当前最佳模型相同的预测性能,而同时在可解释性方面超过了它们。这表明DSC是一种有前途的诈金探测模型。
    Abstract There is a growing demand for explainable, transparent, and data-driven models within the domain of fraud detection. Decisions made by fraud detection models need to be explainable in the event of a customer dispute. Additionally, the decision-making process in the model must be transparent to win the trust of regulators and business stakeholders. At the same time, fraud detection solutions can benefit from data due to the noisy, dynamic nature of fraud and the availability of large historical data sets. Finally, fraud detection is notorious for its class imbalance: there are typically several orders of magnitude more legitimate transactions than fraudulent ones. In this paper, we present Deep Symbolic Classification (DSC), an extension of the Deep Symbolic Regression framework to classification problems. DSC casts classification as a search problem in the space of all analytic functions composed of a vocabulary of variables, constants, and operations and optimizes for an arbitrary evaluation metric directly. The search is guided by a deep neural network trained with reinforcement learning. Because the functions are mathematical expressions that are in closed-form and concise, the model is inherently explainable both at the level of a single classification decision and the model's decision process. Furthermore, the class imbalance problem is successfully addressed by optimizing for metrics that are robust to class imbalance such as the F1 score. This eliminates the need for oversampling and undersampling techniques that plague traditional approaches. Finally, the model allows to explicitly balance between the prediction accuracy and the explainability. An evaluation on the PaySim data set demonstrates competitive predictive performance with state-of-the-art models, while surpassing them in terms of explainability. This establishes DSC as a promising model for fraud detection systems.
    摘要 有增长的需求是在欺诈检测领域内提供可解释、透明和数据驱动的模型。欺诈检测模型的决策需要在客户纠纷时进行解释。同时,欺诈检测解决方案可以从数据中受益,因为欺诈是不稳定的和有很大历史数据集。然而,欺诈检测具有类别偏好:有许多合法交易,而欺诈交易只占几个数量级。在这篇论文中,我们提出了深度符号分类(DSC),它是深度符号回归框架的扩展,用于解决分类问题。DSC将分类视为在所有可能的分析函数中搜索问题,并且通过深度神经网络训练了回归学习来指导搜索。由于函数是具有闭合形式和简洁的数学表达,因此模型本身就是可解释的。此外,DSC可以直接优化robust to class imbalance的度量,从而消除了传统方法中的欠拟合和异常抽样技术。最后,模型允许明确地平衡 между预测精度和解释性。在PaySim数据集上进行评估,DSC与当前状态的模型竞争,而且在解释性方面超越它们。这种表现证明了DSC是一种有前途的欺诈检测模型。

  • paper_url: http://arxiv.org/abs/2312.00584
  • repo_url: None
  • paper_authors: Josef Valvoda, Alec Thompson, Ryan Cotterell, Simone Teufel
  • for: 这篇论文主要是关于法律自然语言处理领域中法官的自动化问题。
  • methods: 论文使用了大量的公共法律数据集,并使用机器学习技术来模拟法官的决策过程。
  • results: 论文认为,通过自动化法官的角色可能会降低法律系统的伦理层次,特别是在英美法系中。
    Abstract The introduction of large public legal datasets has brought about a renaissance in legal NLP. Many of these datasets are comprised of legal judgements - the product of judges deciding cases. This fact, together with the way machine learning works, means that several legal NLP models are models of judges. While some have argued for the automation of judges, in this position piece, we argue that automating the role of the judge raises difficult ethical challenges, in particular for common law legal systems. Our argument follows from the social role of the judge in actively shaping the law, rather than merely applying it. Since current NLP models come nowhere close to having the facilities necessary for this task, they should not be used to automate judges. Furthermore, even in the case the models could achieve human-level capabilities, there would still be remaining ethical concerns inherent in the automation of the legal process.
    摘要 大量公共法律数据的引入已经为法律自然语言处理带来了一个新的精神。许多这些数据集是法律判决 - 法官决案的产物。这一事实,加上机器学习的工作方式,意味着许多法律自然语言处理模型都是模拟法官的。一些人提出了自动化法官的想法,但在这篇观点文章中,我们 argue that 自动化法官的角色带来了困难的伦理问题,特别是在常法法律系统中。我们的Argument来自于法官在活动shape法律的社会角色,而不仅仅是应用它。由于当前NLP模型没有 necessary的设备,它们不应该用于自动化法官。即使模型可以达到人类水平的能力,也仍然存在自动化法律过程中的伦理问题。

  • paper_url: http://arxiv.org/abs/2312.00554
  • repo_url: None
  • paper_authors: Aniket Deroy, Subhankar Maity
  • for: 这项研究旨在探讨法律数据集和大型自然语言模型(LLM)在法律领域中的应用,特别是在案例判决摘要中发现的可能存在的偏见。
  • methods: 本研究使用法律数据集和大型自然语言模型生成案例判决摘要,并分析这些摘要中可能存在的偏见。
  • results: 研究发现了 gender-related keywords、race-related keywords、对女性犯罪的关键词、国家名称和宗教关键词中存在偏见。
    Abstract The evolution of legal datasets and the advent of large language models (LLMs) have significantly transformed the legal field, particularly in the generation of case judgment summaries. However, a critical concern arises regarding the potential biases embedded within these summaries. This study scrutinizes the biases present in case judgment summaries produced by legal datasets and large language models. The research aims to analyze the impact of biases on legal decision making. By interrogating the accuracy, fairness, and implications of biases in these summaries, this study contributes to a better understanding of the role of technology in legal contexts and the implications for justice systems worldwide. In this study, we investigate biases wrt Gender-related keywords, Race-related keywords, Keywords related to crime against women, Country names and religious keywords. The study shows interesting evidences of biases in the outputs generated by the large language models and pre-trained abstractive summarization models. The reasoning behind these biases needs further studies.
    摘要 法律数据和大型自然语言模型(LLM)的演化,对法律领域产生了深见改变,特别是在案例判决摘要的生成方面。然而,一个关键问题是这些摘要中嵌入的偏见。这项研究检查法律数据和大型自然语言模型生成的案例判决摘要中的偏见。研究的目的是分析偏见对法律决策的影响。通过调查摘要的准确性、公平性和偏见的影响,本研究帮助了我们更好地理解技术在法律上下文中的角色,以及对司法系统世界各地的影响。本研究investigates偏见关键字words related to gender, race, crime against women, country names and religious keywords.结果显示大自然语言模型和预训练摘要模型生成的输出存在偏见。需要进一步的研究,以了解这些偏见的原因。

Target-agnostic Source-free Domain Adaptation for Regression Tasks

  • paper_url: http://arxiv.org/abs/2312.00540
  • repo_url: https://github.com/Siriusize/TASFAR_DA
  • paper_authors: Tianlang He, Zhiqiu Xia, Jierun Chen, Haoliang Li, S. -H. Gary Chan
  • for: 这个论文是为了解决无监督领域适应(UDA)中的领域差距问题,并且不需要目标领域的标签数据。
  • methods: 这个论文提出了一个新的无监督源自由领域适应(TASFAR)方法,用于推断任务中的领域适应。TASFAR使用预测信任来估计目标领域的标签分布,然后将源模型在目标领域上准确化。
  • results: 在四个推断任务中,TASFAR对比于现有的源自由UDA方法而言,平均减少了22%的错误,并且与使用源数据的领域适应方法(source-based UDA)的精度相似。
    Abstract Unsupervised domain adaptation (UDA) seeks to bridge the domain gap between the target and source using unlabeled target data. Source-free UDA removes the requirement for labeled source data at the target to preserve data privacy and storage. However, work on source-free UDA assumes knowledge of domain gap distribution, and hence is limited to either target-aware or classification task. To overcome it, we propose TASFAR, a novel target-agnostic source-free domain adaptation approach for regression tasks. Using prediction confidence, TASFAR estimates a label density map as the target label distribution, which is then used to calibrate the source model on the target domain. We have conducted extensive experiments on four regression tasks with various domain gaps, namely, pedestrian dead reckoning for different users, image-based people counting in different scenes, housing-price prediction at different districts, and taxi-trip duration prediction from different departure points. TASFAR is shown to substantially outperform the state-of-the-art source-free UDA approaches by averagely reducing 22% errors for the four tasks and achieve notably comparable accuracy as source-based UDA without using source data.
    摘要 不监督领域适应(UDA)目标是bridging领域差距 между目标和来源,不使用目标数据进行监督。无源UDA取消了目标数据的监督要求,以保护数据隐私和存储。然而,无源UDA的研究假设了知道领域差距分布,因此受到限制,只能处理目标意识或分类任务。为了突破这一限制,我们提出了TASFAR,一种新的无源领域适应方法,用于回归任务。TASFAR使用预测信任来估算目标标签分布,并将其用于源模型在目标领域的准确。我们对四种回归任务进行了广泛的实验,分别是不同用户的行人去reckoning、不同场景的图像基于人数计算、不同区域的住房价格预测和不同起点的出租车程时间预测。TASFAR与现状的无源UDA方法相比,平均减少了22%的错误,并 achieves notable comparable accuracy with source-based UDA without using source data.

SurreyAI 2023 Submission for the Quality Estimation Shared Task

  • paper_url: http://arxiv.org/abs/2312.00525
  • repo_url: None
  • paper_authors: Archchana Sindhujan, Diptesh Kanojia, Constantin Orasan, Tharindu Ranasinghe
  • for: The paper is written for assessing the quality of translations in situations where there is no reference available.
  • methods: The paper uses the TransQuest framework and explores various autoencoder pre-trained language models within the MonoTransQuest architecture using single and ensemble settings.
  • results: The proposed approach, using the MonoTQ-InfoXLM-large model, significantly improves over the baseline for the majority of the 5 language pairs (English-Gujarati, English-Hindi, English-Marathi, English-Tamil, and English-Telugu) in the Sentence-Level Direct Assessment shared task of WMT23.Here is the same information in Simplified Chinese text:
  • for: 本文是为评估缺乏参考的翻译质量而写的。
  • methods: 本文使用TransQuest框架,并使用不同的自动编码器预训练语言模型在单个和ensemble设置下进行探索。
  • results: 提议的方法使用MonoTQ-InfoXLM-large模型,在5种语言对(英语-古吉拉提, 英语-希ن第, 英语-马拉地, 英语-泰米尔, 英语-泰卢固)中的大多数语言对上超越基线。
    Abstract Quality Estimation (QE) systems are important in situations where it is necessary to assess the quality of translations, but there is no reference available. This paper describes the approach adopted by the SurreyAI team for addressing the Sentence-Level Direct Assessment shared task in WMT23. The proposed approach builds upon the TransQuest framework, exploring various autoencoder pre-trained language models within the MonoTransQuest architecture using single and ensemble settings. The autoencoder pre-trained language models employed in the proposed systems are XLMV, InfoXLM-large, and XLMR-large. The evaluation utilizes Spearman and Pearson correlation coefficients, assessing the relationship between machine-predicted quality scores and human judgments for 5 language pairs (English-Gujarati, English-Hindi, English-Marathi, English-Tamil and English-Telugu). The MonoTQ-InfoXLM-large approach emerges as a robust strategy, surpassing all other individual models proposed in this study by significantly improving over the baseline for the majority of the language pairs.
    摘要 高品质估计(QE)系统在翻译质量评估中发挥着重要作用,特别是在没有参考数据的情况下。本文介绍了萨里郡AI团队在WMT23中的共同评估任务的方法。该方法基于TransQuest框架,并使用单个和ensemble设置来探索不同的自动编码语言模型,包括XLMV、InfoXLM-large和XLMR-large。机器预测的质量分数与人工评分之间的关系使用斯宾塞和帕森相关系数进行评估,对英语-古吉拉特、英语-希腊、英语-马拉地、英语-泰米尔和英语-泰夸鲁五种语言对抗进行评估。结果表明, InfoXLM-large自动编码语言模型在大多数语言对抗中显著超过了其他个人模型,并在大多数语言对抗中提高了基准值。

Generative artificial intelligence enhances individual creativity but reduces the collective diversity of novel content

  • paper_url: http://arxiv.org/abs/2312.00506
  • repo_url: None
  • paper_authors: Anil R. Doshi, Oliver P. Hauser
  • for: 本研究旨在探讨生成人工智能(GenAI)的想法对创作输出的影响。
  • methods: 研究采用在线实验研究,一部分作者可以通过GenAI平台获得创作想法。
  • results: 研究发现,访问GenAI想法会使故事被评价为更有创意、更好写作和更愉悦,尤其是对于 menos 创意的作者。但是对于每个条件来说,机器人编写的故事之间的对比表明,GenAI-enabled故事更像彼此相似,而人类alone的故事更有多样性。这些结果表明个人创新能力增加,但同时也存在一定的集体创新风险:即个人作者使用GenAI提高自己的写作能力,但是集体来说可能会生成更窄的创新内容范围。
    Abstract Creativity is core to being human. Generative artificial intelligence (GenAI) holds promise for humans to be more creative by offering new ideas, or less creative by anchoring on GenAI ideas. We study the causal impact of GenAI ideas on the production of an unstructured creative output in an online experimental study where some writers could obtain ideas for a story from a GenAI platform. We find that access to GenAI ideas causes stories to be evaluated as more creative, better written and more enjoyable, especially among less creative writers. However, objective measures of story similarity within each condition reveal that GenAI-enabled stories are more similar to each other than stories by humans alone. These results point to an increase in individual creativity, but at the same time there is a risk of losing collective novelty: this dynamic resembles a social dilemma where individual writers are better off using GenAI to improve their own writing, but collectively a narrower scope of novel content may be produced with GenAI. Our results have implications for researchers, policy-makers and practitioners interested in bolstering creativity, but point to potential downstream consequences from over-reliance.
    摘要 创造力是人类核心的。生成人工智能(GenAI)的出现提供了新的创意想法,可以使人更创造性。我们在在线实验研究中发现,让作者通过GenAI平台获得创意想法后,他们的故事被评价为更有创意、更好写和更有趣。特别是对于较不创造的作者来说,GenAI的影响更加明显。然而,我们发现,通过GenAI提供的想法使故事更像彼此的同时,人类作者的故事更加独特和多样。这些结果表明,GenAI可以提高个人创造力,但同时也可能导致集体创新的减少。这种情况类似于社会冲击,个人作者使用GenAI提高自己的写作能力,但集体上可能产生较窄的创新内容。我们的结果对于研究人员、政策制定者和实践者有着重要的意义,但也需要关注下沉淀的后果。

  • paper_url: http://arxiv.org/abs/2312.00480
  • repo_url: None
  • paper_authors: Hiroaki Yamada, Takenobu Tokunaga, Ryutaro Ohara, Akira Tokutsu, Keisuke Takeshita, Mihoko Sumida
  • for: 这个论文是为了日本法律判断预测(LJP)预测的第一个数据集,即日本损害诉论据集(JTD),该数据集包括两个任务:诉论据预测和其理由抽取。
  • methods: 该论文使用了基线实验来证明提出的两个任务的可行性,并通过法律专家的错误分析来确定未来LJP研究的方向。
  • results: 该论文通过基线实验表明了提出的两个任务的可行性,并通过法律专家的错误分析identified sources of errors and suggested future directions of the LJP research。
    Abstract This paper presents the first dataset for Japanese Legal Judgment Prediction (LJP), the Japanese Tort-case Dataset (JTD), which features two tasks: tort prediction and its rationale extraction. The rationale extraction task identifies the court's accepting arguments from alleged arguments by plaintiffs and defendants, which is a novel task in the field. JTD is constructed based on annotated 3,477 Japanese Civil Code judgments by 41 legal experts, resulting in 7,978 instances with 59,697 of their alleged arguments from the involved parties. Our baseline experiments show the feasibility of the proposed two tasks, and our error analysis by legal experts identifies sources of errors and suggests future directions of the LJP research.
    摘要 这篇论文介绍了日本法律预测(LJP)的首个数据集,日本侵权案件数据集(JTD),该数据集包含两个任务:侵权预测和其理由提取。理由提取任务目标是从被告和被告党提交的纠纷中提取法院接受的Arguments,这是领域中的一项新任务。JTD是基于3477份日本民法判决,由41名法律专家进行标注,共计7978个实例,59697个被告和被告党的纠纷。我们的基线实验表明了提案的两个任务的可行性,而我们的错误分析也提供了法律专家的意见,并指出了未来LJP研究的方向。

A Bayesian approach for prompt optimization in pre-trained language models

  • paper_url: http://arxiv.org/abs/2312.00471
  • repo_url: None
  • paper_authors: Antonio Sabbatella, Andrea Ponti, Antonio Candelieri, Ilaria Giordani, Francesco Archetti
  • for: 这篇论文主要是为了提高干预提示(HPT)的效果,以便在语言模型(LLM)不可用或只可用作黑盒模型时, Still achieve good performance in text classification tasks.
  • methods: 这篇论文使用了抽象优化(Bayesian optimization)方法来优化精度提示(discrete prompts),以解决高维度的token空间和提示序列长度的问题。
  • results: 实验结果显示,使用抽象优化方法可以 efficiently search for discrete prompts,并且在六个benchmark上显示出良好的性能,包括准确率和wall clock time的tradeoff。
    Abstract A prompt is a sequence of symbol or tokens, selected from a vocabulary according to some rule, which is prepended/concatenated to a textual query. A key problem is how to select the sequence of tokens: in this paper we formulate it as a combinatorial optimization problem. The high dimensionality of the token space com-pounded by the length of the prompt sequence requires a very efficient solution. In this paper we propose a Bayesian optimization method, executed in a continuous em-bedding of the combinatorial space. In this paper we focus on hard prompt tuning (HPT) which directly searches for discrete tokens to be added to the text input with-out requiring access to the large language model (LLM) and can be used also when LLM is available only as a black-box. This is critically important if LLMs are made available in the Model as a Service (MaaS) manner as in GPT-4. The current manu-script is focused on the optimization of discrete prompts for classification tasks. The discrete prompts give rise to difficult combinatorial optimization problem which easily become intractable given the dimension of the token space in realistic applications. The optimization method considered in this paper is Bayesian optimization (BO) which has become the dominant approach in black-box optimization for its sample efficiency along with its modular structure and versatility. In this paper we use BoTorch, a library for Bayesian optimization research built on top of pyTorch. Albeit preliminary and obtained using a 'vanilla' version of BO, the experiments on RoB-ERTa on six benchmarks, show a good performance across a variety of tasks and enable an analysis of the tradeoff between size of the search space, accuracy and wall clock time.
    摘要 Prompt 是一个序列表示符或 токен,从一个词汇中选择,根据某些规则进行预先 concatenate 或 prepend 到文本查询。一个关键问题是如何选择这个序列表示符:在这篇论文中,我们将其形式化为一个 combinatorial optimization 问题。高维度的 token 空间和提示序列的长度相互紧张,需要非常高效的解决方案。在这篇论文中,我们提议使用 Bayesian 优化方法,在连续的 embedding 空间中执行。我们在这篇论文中专注于硬件提示优化(HPT),直接在文本输入中添加不同的 discrete 提示,而不需要访问大语言模型(LLM),并且可以在 LLM 只是黑盒模型时使用。这是非常重要的,因为在 MaaS 方式下,LLM 将被提供为模型。现有的整篇论文是关于分类任务的 дискcrete 提示优化。这些 discrete 提示会导致困难的 combinatorial optimization 问题,在实际应用中容易变得不可持。我们考虑使用 Bayesian 优化(BO)方法,它在黑盒优化中已经成为主流方法,因为它的样本效率高,并且具有模块结构和多样性。在这篇论文中,我们使用 BoTorch,一个基于 pyTorch 的 Bayesian 优化库。虽然还是初步的,但通过使用 'vanilla' 版本的 BO,我们在 RoB-ERTa 上进行了六个 benchmark 的实验,显示在多种任务上表现良好,并允许分析搜索空间大小、准确率和wall clock time 之间的负面关系。

Meta-Diversity Search in Complex Systems, A Recipe for Artificial Open-Endedness ?

  • paper_url: http://arxiv.org/abs/2312.00455
  • repo_url: None
  • paper_authors: Mayalen Etcheverry, Bert Wang-Chak Chan, Clément Moulin-Frier, Pierre-Yves Oudeyer
  • For: The paper aims to develop an artificial system that can generate endless surprises in Minecraft by leveraging complex systems and meta-diversity search.* Methods: The proposed framework includes a complex system for recursively growing and complexifying artifacts over time, as well as a discovery algorithm that utilizes meta-diversity search to automate the long-term discovery of novel and increasingly complex artifacts.* Results: The authors simulate an artificial “chemistry” system based on Lenia continuous cellular automaton for generating artifacts, and an artificial “discovery assistant” (called Holmes) for the artifact-discovery process. Holmes incrementally learns a hierarchy of modular representations to characterize divergent sources of diversity and uses a goal-based intrinsically-motivated exploration as the diversity search strategy.Here are the three points in Simplified Chinese text:* For: 这篇论文的目的是开发一个可以在 Minecraft 中产生无限的奇想的人工系统。* Methods: 该提案的框架包括一个复杂系统,用于在时间上重复增长和复杂化 artifacts,以及一个发现算法,利用 meta-diversity search 自动化长期发现 Minecraft 中的新奇和不断增长的 artifacts。* Results: 作者们在 Lenia 连续细胞自动机上模拟了一个人工 “化学” 系统来生成 artifacts,并使用了一个人工 “发现助手” (叫做 Holmes) 来进行 artifact 发现过程。 Holmes 逐渐学习了一个层次的模块化表示来描述不同的多样性源,并使用了一种目标基于自然motivation的探索策略来搜索多样性。
    Abstract Can we build an artificial system that would be able to generate endless surprises if ran "forever" in Minecraft? While there is not a single path toward solving that grand challenge, this article presents what we believe to be some working ingredients for the endless generation of novel increasingly complex artifacts in Minecraft. Our framework for an open-ended system includes two components: a complex system used to recursively grow and complexify artifacts over time, and a discovery algorithm that leverages the concept of meta-diversity search. Since complex systems have shown to enable the emergence of considerable complexity from set of simple rules, we believe them to be great candidates to generate all sort of artifacts in Minecraft. Yet, the space of possible artifacts that can be generated by these systems is often unknown, challenging to characterize and explore. Therefore automating the long-term discovery of novel and increasingly complex artifacts in these systems is an exciting research field. To approach these challenges, we formulate the problem of meta-diversity search where an artificial "discovery assistant" incrementally learns a diverse set of representations to characterize behaviors and searches to discover diverse patterns within each of them. A successful discovery assistant should continuously seek for novel sources of diversities while being able to quickly specialize the search toward a new unknown type of diversity. To implement those ideas in the Minecraft environment, we simulate an artificial "chemistry" system based on Lenia continuous cellular automaton for generating artifacts, as well as an artificial "discovery assistant" (called Holmes) for the artifact-discovery process. Holmes incrementally learns a hierarchy of modular representations to characterize divergent sources of diversity and uses a goal-based intrinsically-motivated exploration as the diversity search strategy.
    摘要 可以建立一个人工系统,以生成无穷无尽的奇怪。这篇文章提出了我们认为是生成奇怪的工具:一个复杂系统,用于在时间进行复杂化和增加复杂性的 artifacts,以及一个发现算法,利用多元多样性搜寻。由于复杂系统可以从简单规则中产生巨大的复杂性,因此我们认为它们是 IDEAL candidates 生成 Minecraft 中的所有类型的artifacts。然而,生成这些系统中的可能性空间是不知道的,困难重要和探索。因此,自动化长期发现新奇的和增加复杂性的artifacts 是一个激发人们的研究领域。为了解决这些挑战,我们将问题定义为多元多样性搜寻,一个人工 "发现助手" (called Holmes) 逐渐学习一组多元的表示,以characterize 行为和搜寻新型多样性。一个成功的发现助手应该不断搜寻新的多样性来源,并能够快速专注于新不熟悉的多样性类型。为了实现这些想法在 Minecraft 环境中,我们使用 Lenia 连续细胞自动Machine 生成 artifacts,以及一个人工 "发现助手" (called Holmes) 来进行发现过程。Holmes 逐渐学习一 hierarchy of 模块化表示,以characterize 多样性的分支源头,并使用目标基于自主动机搜寻为多样性搜寻策略。

PEFTDebias : Capturing debiasing information using PEFTs

  • paper_url: http://arxiv.org/abs/2312.00434
  • repo_url: None
  • paper_authors: Sumit Agarwal, Aditya Srikanth Veerubhotla, Srijan Bansal
  • for: 本研究旨在解决基础模型中含有的隐式偏见问题,通过Parameter-Efficient Fine-Tuning(PEFT)方法来mitigate these biases。
  • methods: 本研究使用了两个主要阶段:一个上游阶段用于获取偏见补偿参数,以及一个下游阶段用于在细化训练过程中固化这些参数。
  • results: 对四个数据集进行了评估,发现PEFT可以有效地减少下游偏见,并且这些参数具有特定偏见轴的补偿特性,可以有效地在不同的下游任务中减少偏见。
    Abstract The increasing use of foundation models highlights the urgent need to address and eliminate implicit biases present in them that arise during pretraining. In this paper, we introduce PEFTDebias, a novel approach that employs parameter-efficient fine-tuning (PEFT) to mitigate the biases within foundation models. PEFTDebias consists of two main phases: an upstream phase for acquiring debiasing parameters along a specific bias axis, and a downstream phase where these parameters are incorporated into the model and frozen during the fine-tuning process. By evaluating on four datasets across two bias axes namely gender and race, we find that downstream biases can be effectively reduced with PEFTs. In addition, we show that these parameters possess axis-specific debiasing characteristics, enabling their effective transferability in mitigating biases in various downstream tasks. To ensure reproducibility, we release the code to do our experiments.
    摘要 随着基础模型的使用量的增加,需要立即解决基础模型中存在的隐式偏见。在这篇论文中,我们介绍PEFTDebias方法,该方法利用参数高效精度调整(PEFT)来缓解基础模型中的偏见。PEFTDebias方法包括两个主要阶段:一个上游阶段用于在特定偏见轴上获得减震参数,以及一个下游阶段在模型细化过程中将这些参数粘土并冻结。我们通过在四个数据集上进行评估,发现在不同的偏见轴(性别和种族)上,下游偏见都可以有效地减震。此外,我们发现这些参数具有轴特异性减震特征,可以有效地在不同的下游任务中传输减震效果。为确保可重复性,我们发布了我们的实验代码。

Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?

  • paper_url: http://arxiv.org/abs/2312.00413
  • repo_url: https://github.com/wssun/ast4plu
  • paper_authors: Weisong Sun, Chunrong Fang, Yun Miao, Yudu You, Mengzhe Yuan, Yuchen Chen, Quanjun Zhang, An Guo, Xiang Chen, Yang Liu, Zhenyu Chen
  • for: 本研究的目的是探讨AST基于代码表示学习的效果,以及不同的AST解析/处理/编码方法对代码表示学习和 subsequente任务的影响。
  • methods: 本研究采用了empirical study的方式,对代码表示学习模型的训练数据进行了比较,包括Token基于代码表示和AST基于代码表示。另外,本研究还进行了对不同AST解析/处理/编码方法的evaluation,以及对不同任务类型的比较。
  • results: 结果表明,models trained with Token基于代码表示在所有三个任务中表现出色,而models trained with AST基于代码表示则表现较差。此外,我们还发现了一些子集的样本,其表现与AST基于代码表示的模型不同。此外,本研究还提供了关于如何选择AST解析/处理/编码方法的详细指南。
    Abstract Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of the source code features while preserving its semantics. These representations can be used for facilitating subsequent code-related tasks. The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning. However, there is still a lack of systematic and quantitative evaluation of how well AST-based code representation facilitates subsequent code-related tasks. In this paper, we first conduct a comprehensive empirical study to explore the effectiveness of the AST-based code representation in facilitating follow-up code-related tasks. To do so, we compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks. Surprisingly, the overall quantitative statistical results demonstrate that models trained with AST-based code representation consistently perform worse across all three tasks compared to models trained with Token-based code representation. Our further quantitative analysis reveals that models trained with AST-based code representation outperform models trained with Token-based code representation in certain subsets of samples across all three tasks. We also conduct comprehensive experiments to evaluate and reveal the impact of the choice of AST parsing/preprocessing/encoding methods on AST-based code representation and subsequent code-related tasks. Our study provides future researchers with detailed guidance on how to select solutions at each stage to fully exploit AST.
    摘要 (Simplified Chinese translation)编程语言理解和表示(也称为代码表示学习)一直是软件工程领域热点和挑战性任务。它通过深度学习技术生成源代码特征的数字表示,保留其 semantics。这些表示可以用于促进后续代码相关任务。抽象树(AST)是源代码的基本特征,广泛用于代码表示学习。然而,有一个缺乏系统和量化评估AST基于代码表示在后续代码相关任务中的效果。在这篇论文中,我们首先进行了详细的实验研究,探讨AST基于代码表示在后续代码相关任务中的效果。我们比较了使用代码符号序列(Token)基于代码表示和AST基于代码表示在三种流行的代码相关任务中的性能。 surprisingly,总的量化统计结果表明,使用AST基于代码表示的模型在所有三个任务中一直表现更差。我们进一步的量化分析表明,使用AST基于代码表示的模型在某些子集中表现更好于使用Token基于代码表示的模型。我们还进行了详细的实验,评估AST基于代码表示和后续代码相关任务之间的关系。我们的研究提供了未来研究人员关于如何在每个阶段选择解决方案,以完全利用AST的详细指南。

Enhancing Explainability in Mobility Data Science through a combination of methods

  • paper_url: http://arxiv.org/abs/2312.00380
  • repo_url: None
  • paper_authors: Georgios Makridis, Vasileios Koukos, Georgios Fatouros, Dimosthenis Kyriazis
  • for: 本研究旨在提高路径数据驱动的模型解释性,并为各类用户提供更深入的理解。
  • methods: 我们提出了一个综合性的框架,它结合了LIME、SHAP、Saliency maps、注意力机制、直观轨迹视图和Permutation Feature Importance等多种XAI技术,以便更好地解释模型对路径数据的决策。
  • results: 我们的框架可以帮助用户更好地理解模型的决策过程,并提供更加细致的解释结果,以满足不同用户群体的需求。
    Abstract In the domain of Mobility Data Science, the intricate task of interpreting models trained on trajectory data, and elucidating the spatio-temporal movement of entities, has persistently posed significant challenges. Conventional XAI techniques, although brimming with potential, frequently overlook the distinct structure and nuances inherent within trajectory data. Observing this deficiency, we introduced a comprehensive framework that harmonizes pivotal XAI techniques: LIME (Local Interpretable Model-agnostic Explanations), SHAP (SHapley Additive exPlanations), Saliency maps, attention mechanisms, direct trajectory visualization, and Permutation Feature Importance (PFI). Unlike conventional strategies that deploy these methods singularly, our unified approach capitalizes on the collective efficacy of these techniques, yielding deeper and more granular insights for models reliant on trajectory data. In crafting this synthesis, we effectively address the multifaceted essence of trajectories, achieving not only amplified interpretability but also a nuanced, contextually rich comprehension of model decisions. To validate and enhance our framework, we undertook a survey to gauge preferences and reception among various user demographics. Our findings underscored a dichotomy: professionals with academic orientations, particularly those in roles like Data Scientist, IT Expert, and ML Engineer, showcased a profound, technical understanding and often exhibited a predilection for amalgamated methods for interpretability. Conversely, end-users or individuals less acquainted with AI and Data Science showcased simpler inclinations, such as bar plots indicating timestep significance or visual depictions pinpointing pivotal segments of a vessel's trajectory.
    摘要 在 mobilicity 数据科学领域,阅读模型在轨迹数据上进行预测和解释是一项复杂的任务。传统的 XAI 技术,尽管充满潜力,frequently overlook 轨迹数据中特殊的结构和特性。我们在这种情况下引入了一个完整的框架,融合了重要的 XAI 技术:LIME(本地可解释模型无关扩展)、SHAP(SHapley Additive exPlanations)、Saliency maps、注意力机制、直接轨迹视化和Permutation Feature Importance(PFI)。与传统策略不同,我们的一体化方法可以结合这些技术的共同力量,以获得更深入和更细腻的模型决策理解。在设计这种合并,我们有效地解决了轨迹的多方面性,实现了不仅增强了解释性,还具有了对模型决策的 nuanced、contextually rich 理解。为验证和改进我们的框架,我们进行了一项调查,了解用户各种各样的偏好和接受度。我们的发现表明了一个分化:专业人员,特别是数据科学家、IT专家和机器学习工程师,表现出了深刻、技术性的理解,并经常表现出对集成方法的偏好。相比之下,非专业用户或对 AI 和数据科学 less 熟悉的人,倾向于更简单的示例,如时间步骤重要性的折衣图或轨迹上重要段的视觉表示。

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

  • paper_url: http://arxiv.org/abs/2312.00845
  • repo_url: https://github.com/HyeonHo99/Video-Motion-Customization
  • paper_authors: Hyeonho Jeong, Geon Yeong Park, Jong Chul Ye
  • for: 这篇论文旨在提高视频生成的自定义能力。
  • methods: 该方法使用视频扩散模型,并引入一种新的时间注意力层来调整视频中的动作。
  • results: 该方法可以准确地复制目标视频中的动作,同时生成多种视觉变化。Here’s the full Chinese text:
  • for: 这篇论文旨在提高视频生成的自定义能力,使得可以根据需要生成视频中的特定动作。
  • methods: 该方法使用视频扩散模型,并引入一种新的时间注意力层来调整视频中的动作。
  • results: 该方法可以准确地复制目标视频中的动作,同时生成多种视觉变化。I hope that helps! Let me know if you have any other questions.
    Abstract Text-to-video diffusion models have advanced video generation significantly. However, customizing these models to generate videos with tailored motions presents a substantial challenge. In specific, they encounter hurdles in (a) accurately reproducing motion from a target video, and (b) creating diverse visual variations. For example, straightforward extensions of static image customization methods to video often lead to intricate entanglements of appearance and motion data. To tackle this, here we present the Video Motion Customization (VMC) framework, a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference. The diffusion process then preserves low-frequency motion trajectories while mitigating high-frequency motion-unrelated noise in image space. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts. Our codes, data and the project demo can be found at https://video-motion-customization.github.io
    摘要 文本到视频协沟模型已经提高了视频生成的水平。然而,为了生成具有特定动作的视频,存在一定的挑战。具体来说,它们在(a)准确地复制目标视频中的动作,以及(b)创造多样化的视觉变化方面遇到了困难。例如,直接将静止图像自定义方法扩展到视频时,经常会导致图像和动作数据之间的复杂的杂化。为了解决这个问题,我们在本文提出了视频动作定制(VMC)框架,一种新的一键调整方法,用于在视频扩散模型中添加时间注意力层。我们的方法使用 consecutive frames 之间的差异 вектор作为动作参考,然后通过扩散过程来保留低频动作轨迹,同时mitigate高频动作无关的图像噪声。我们对state-of-the-art video生成模型进行了评测,并在多种真实世界的动作和上下文中证明了我们的方法的有效性。codes、数据和项目 demo 可以在 找到。

SynFundus: Generating a synthetic fundus images dataset with millions of samples and multi-disease annotations

  • paper_url: http://arxiv.org/abs/2312.00377
  • repo_url: None
  • paper_authors: Fangxin Shang, Jie Fu, Yehui Yang, Lei Ma
  • for: Addressing the scarcity of large-scale medical imaging datasets due to privacy restrictions, the paper introduces SynFundus-1M, a high-quality synthetic dataset with over 1 million retinal fundus images and extensive disease and pathology annotations.
  • methods: The dataset is generated by a Denoising Diffusion Probabilistic Model, and the paper compares the SynFundus-Generator and SynFundus-1M with existing methods on mainstream public datasets, achieving superior Frechet Inception Distance (FID) scores.
  • results: The ophthalmologists’ evaluation confirms the authenticity of the synthetic images, and the paper demonstrates that both CNN and ViT can benefit from SynFundus-1M by pretraining or training directly, achieving better performance and faster convergence on various downstream tasks compared to datasets like ImageNet or EyePACS.
    Abstract In the field of medical imaging, the scarcity of large-scale datasets due to privacy restrictions stands as a significant barrier to develop large models for medical. To address this issue, we introduce SynFundus-1M, a high-quality synthetic dataset with over 1 million retinal fundus images and extensive disease and pathologies annotations, which is generated by a Denoising Diffusion Probabilistic Model. The SynFundus-Generator and SynFundus-1M achieve superior Frechet Inception Distance (FID) scores compared to existing methods on main-stream public real datasets. Furthermore, the ophthalmologists evaluation validate the difficulty in discerning these synthetic images from real ones, confirming the SynFundus-1M's authenticity. Through extensive experiments, we demonstrate that both CNN and ViT can benifit from SynFundus-1M by pretraining or training directly. Compared to datasets like ImageNet or EyePACS, models train on SynFundus-1M not only achieve better performance but also faster convergence on various downstream tasks.
    摘要 医疗影像领域中,由于隐私限制,庞大数据集的缺乏成为开发大型模型的 significante barrier。为解决这问题,我们介绍SynFundus-1M,一个高质量的synthetic数据集,包含超过100万个眼部影像和丰富的疾病和 PATHOLOGY 注释,通过denoising diffusion probabilistic model生成。SynFundus-Generator和SynFundus-1M在主流公共数据集上达到了更高的Frechet Inception Distance(FID)分数,并且经过了专业针对的诊断医生评估,证明了SynFundus-1M的真实性。我们通过广泛的实验表明,CNN和ViT都可以通过SynFundus-1M进行预训练或直接训练,并且在多种下游任务上比ImageNet和EyePACS数据集更好的性能和更快的收敛。

Sparse Beats Dense: Rethinking Supervision in Radar-Camera Depth Completion

  • paper_url: http://arxiv.org/abs/2312.00844
  • repo_url: None
  • paper_authors: Huadong Li, Minhao Jing, Jiajun Liang, Haoqiang Fan, Renhe Ji
  • for: 本文探讨了深度完成领域中 dense supervision 和 sparse supervision 之间的差异,并提出了一种“干扰补偿”框架来解决 projection transformation collapse (PTC) 问题。
  • methods: 本文提出了一种“干扰补偿”框架,包括干扰部分和补偿部分。干扰部分抛弃了图像/雷达/探测器之间的位坐标匹配,而补偿部分利用了三维空间和二维 semantic 信息来补偿抛弃的有利位坐标匹配。
  • results: 实验结果表明,我们的框架( sparse supervision)在 mean absolute error 和 speedup 两个指标上都超过了状态计算机( dense supervision),具体提高了11.6%。代码可以在 … 获取。
    Abstract It is widely believed that the dense supervision is better than the sparse supervision in the field of depth completion, but the underlying reasons for this are rarely discussed. In this paper, we find that the challenge of using sparse supervision for training Radar-Camera depth prediction models is the Projection Transformation Collapse (PTC). The PTC implies that sparse supervision leads the model to learn unexpected collapsed projection transformations between Image/Radar/LiDAR spaces. Building on this insight, we propose a novel ``Disruption-Compensation" framework to handle the PTC, thereby relighting the use of sparse supervision in depth completion tasks. The disruption part deliberately discards position correspondences among Image/Radar/LiDAR, while the compensation part leverages 3D spatial and 2D semantic information to compensate for the discarded beneficial position correspondence. Extensive experimental results demonstrate that our framework (sparse supervision) outperforms the state-of-the-art (dense supervision) with 11.6$\%$ improvement in mean absolute error and $1.6 \times$ speedup. The code is available at ...
    摘要 广泛认为密集监督比零监督更好在深度完成领域,但这些下面的原因几乎未得到讨论。在这篇论文中,我们发现使用零监督训练雷达相机深度预测模型的挑战是投影变换塌缩(PTC)。PTC表明零监督导致模型学习未料的投影变换 между图像/雷达/探空器空间。基于这一点,我们提出了一种“干扰补偿”框架,以解决PTC,从而使零监督在深度完成任务中可以使用。干扰部分故意抛弃图像/雷达/探空器之间的位坐标对应,而补偿部分利用三维空间和二维 semantic信息补偿抛弃的有利位坐标对应。我们的框架在mean absolute error和运行速度两个指标上均超过了现有最佳方法(密集监督),即11.6%的提升和1.6倍的快速。代码可以在...

On Exploring the Reasoning Capability of Large Language Models with Knowledge Graphs

  • paper_url: http://arxiv.org/abs/2312.00353
  • repo_url: None
  • paper_authors: Pei-Chi Lo, Yi-Hang Tsai, Ee-Peng Lim, San-Yih Hwang
  • for: 本研究探讨 LLM 是否可以通过自己学习的知识图reasoning。
  • methods: 本研究使用 LLMPerform four 种不同的知识图理解任务,以 investigate LLM 的准确性和推理能力。
  • results: 实验结果表明 LLM 可以成功解决 simple 和 complex 的知识图理解任务,并可以从 context 中推理知识图关系。
    Abstract This paper examines the capacity of LLMs to reason with knowledge graphs using their internal knowledge graph, i.e., the knowledge graph they learned during pre-training. Two research questions are formulated to investigate the accuracy of LLMs in recalling information from pre-training knowledge graphs and their ability to infer knowledge graph relations from context. To address these questions, we employ LLMs to perform four distinct knowledge graph reasoning tasks. Furthermore, we identify two types of hallucinations that may occur during knowledge reasoning with LLMs: content and ontology hallucination. Our experimental results demonstrate that LLMs can successfully tackle both simple and complex knowledge graph reasoning tasks from their own memory, as well as infer from input context.
    摘要
  1. How accurate are LLMs in recalling information from pre-training knowledge graphs?2. Can LLMs infer knowledge graph relations from context?To address these questions, the paper uses LLMs to perform four different knowledge graph reasoning tasks. The study also identifies two types of hallucinations that may occur during knowledge reasoning with LLMs: content hallucination and ontology hallucination.The experimental results show that LLMs can successfully complete both simple and complex knowledge graph reasoning tasks from their own memory, as well as infer from input context.

The Case for Scalable, Data-Driven Theory: A Paradigm for Scientific Progress in NLP

  • paper_url: http://arxiv.org/abs/2312.00349
  • repo_url: None
  • paper_authors: Julian Michael
  • for: 这篇论文主要是为了探讨如何通过开发可扩展、数据驱动的语言结构理论来推动自然语言处理(NLP)领域的科学进步。
  • methods: 这篇论文使用了机器学习技术来构建基于问答对的语言结构理论,以解释语言行为现象的行为现象。
  • results: 这篇论文通过使用问答驱动的 semantic role labeling(SRL) schema来实现对 verb predicate-argument 关系的精确标注,并提出了将来的数据收集和理论建模的原则。
    Abstract I propose a paradigm for scientific progress in NLP centered around developing scalable, data-driven theories of linguistic structure. The idea is to collect data in tightly scoped, carefully defined ways which allow for exhaustive annotation of behavioral phenomena of interest, and then use machine learning to construct explanatory theories of these phenomena which can form building blocks for intelligible AI systems. After laying some conceptual groundwork, I describe several investigations into data-driven theories of shallow semantic structure using Question-Answer driven Semantic Role Labeling (QA-SRL), a schema for annotating verbal predicate-argument relations using highly constrained question-answer pairs. While this only scratches the surface of the complex language behaviors of interest in AI, I outline principles for data collection and theoretical modeling which can inform future scientific progress. This note summarizes and draws heavily on my PhD thesis.
    摘要 我提出一种 парадиг для科学进步在自然语言处理(NLP)领域,中心思想是发展可扩展、数据驱动的语言结构理论。我的想法是收集数据,使其具有紧密定义、严格控制的方式,以便对语言行为 interessant 的现象进行完整的注释,然后使用机器学习构建解释这些现象的理论,这些理论可以成为可读ble AI 系统的基础。在论述一些概念基础之后,我描述了一些基于问题-答案驱动的semantic Role Labeling(QA-SRL)数据驱动理论的研究。虽然这只是对语言行为的一部分进行了研究,但我提出了数据收集和理论建模的原则,可以导向未来的科学进步。这份笔记主要基于我的博士论文。

Efficient Off-Policy Safe Reinforcement Learning Using Trust Region Conditional Value at Risk

  • paper_url: http://arxiv.org/abs/2312.00342
  • repo_url: https://github.com/rllab-snu/Off-Policy-TRC
  • paper_authors: Dohyeong Kim, Songhwai Oh
  • for: 解决一个安全再增值学习(RL)问题,使用风险度量来约束。
  • methods: 提出了一种基于信任区间方法的在策略上安全RL方法,以及一种基于停止区间约束的离线RL方法。
  • results: 在模拟和实际环境中证明了该方法可以在不到几步内满足安全约束,并且在复杂的机器人任务中达到高返回率。
    Abstract This paper aims to solve a safe reinforcement learning (RL) problem with risk measure-based constraints. As risk measures, such as conditional value at risk (CVaR), focus on the tail distribution of cost signals, constraining risk measures can effectively prevent a failure in the worst case. An on-policy safe RL method, called TRC, deals with a CVaR-constrained RL problem using a trust region method and can generate policies with almost zero constraint violations with high returns. However, to achieve outstanding performance in complex environments and satisfy safety constraints quickly, RL methods are required to be sample efficient. To this end, we propose an off-policy safe RL method with CVaR constraints, called off-policy TRC. If off-policy data from replay buffers is directly used to train TRC, the estimation error caused by the distributional shift results in performance degradation. To resolve this issue, we propose novel surrogate functions, in which the effect of the distributional shift can be reduced, and introduce an adaptive trust-region constraint to ensure a policy not to deviate far from replay buffers. The proposed method has been evaluated in simulation and real-world environments and satisfied safety constraints within a few steps while achieving high returns even in complex robotic tasks.
    摘要 However, to achieve outstanding performance in complex environments and satisfy safety constraints quickly, RL methods need to be sample efficient. To address this, we propose an off-policy safe RL method with CVaR constraints, called off-policy TRC.One challenge with using off-policy data from replay buffers to train TRC is that the estimation error caused by the distributional shift can result in performance degradation. To address this, we propose novel surrogate functions that can reduce the effect of the distributional shift and introduce an adaptive trust-region constraint to ensure that the policy does not deviate far from the replay buffers.The proposed method has been evaluated in simulation and real-world environments and was able to satisfy safety constraints within a few steps while achieving high returns even in complex robotic tasks.

Green Edge AI: A Contemporary Survey

  • paper_url: http://arxiv.org/abs/2312.00333
  • repo_url: None
  • paper_authors: Yuyi Mao, Xianghao Yu, Kaibin Huang, Ying-Jun Angela Zhang, Jun Zhang
  • for: 本研究旨在探讨绿色边缘AI技术,以提高边缘设备的能源效率和可持续性。
  • methods: 本文采用了一种绿色设计方法,包括分析边缘AI系统的主要能源消耗组成部分,以及基于这些原则的绿色设计方法。
  • results: 本文总结了一些能效的绿色边缘AI设计方法,包括数据收集、边缘训练和边缘计算。此外,文章还提出了未来研究方向,以进一步提高边缘AI的能效性。
    Abstract Artificial intelligence (AI) technologies have emerged as pivotal enablers across a multitude of industries, including consumer electronics, healthcare, and manufacturing, largely due to their resurgence over the past decade. The transformative power of AI is primarily derived from the utilization of deep neural networks (DNNs), which require extensive data for training and substantial computational resources for processing. Consequently, DNN models are typically trained and deployed on resource-rich cloud servers. However, due to potential latency issues associated with cloud communications, deep learning (DL) workflows are increasingly being transitioned to wireless edge networks near end-user devices (EUDs). This shift is designed to support latency-sensitive applications and has given rise to a new paradigm of edge AI, which will play a critical role in upcoming 6G networks to support ubiquitous AI applications. Despite its potential, edge AI faces substantial challenges, mostly due to the dichotomy between the resource limitations of wireless edge networks and the resource-intensive nature of DL. Specifically, the acquisition of large-scale data, as well as the training and inference processes of DNNs, can rapidly deplete the battery energy of EUDs. This necessitates an energy-conscious approach to edge AI to ensure both optimal and sustainable performance. In this paper, we present a contemporary survey on green edge AI. We commence by analyzing the principal energy consumption components of edge AI systems to identify the fundamental design principles of green edge AI. Guided by these principles, we then explore energy-efficient design methodologies for the three critical tasks in edge AI systems, including training data acquisition, edge training, and edge inference. Finally, we underscore potential future research directions to further enhance the energy efficiency of edge AI.
    摘要 人工智能(AI)技术在多个领域中发挥了重要作用,包括消费电子、医疗和制造等,主要归功于过去十年的崛起。AI的变革力主要来自于深度神经网络(DNN)的应用,但DNN模型需要大量数据进行训练和处理,因此通常在云服务器上进行训练和部署。然而,由于云通信可能会导致延迟问题,因此DL工作流程在无线边缘网络附近的终端设备(EUD)上进行转换,以支持延迟敏感应用。这种转换是6G网络中支持普遍AI应用的新 paradigma的一部分。虽然edge AI具有潜在的潜力,但它还面临着云边缘网络资源的限制和DL的资源占用问题。具体来说,获取大规模数据以及DNN模型的训练和推理过程可以迅速消耗EUD的电池能量。因此,我们需要一种能源减少的方法来实现edge AI,以确保优化和可持续的性能。在这篇论文中,我们提供了一个当代的绿色边缘AIsurvey。我们首先分析了边缘AI系统的主要能源消耗成分,以确定绿色边缘AI的基本设计原则。遵循这些原则,我们然后探讨了能量减少的设计方法 для边缘AI系统中的三个关键任务,包括数据收集、边缘训练和边缘推理。最后,我们强调了将来可能的未来研究方向,以进一步提高边缘AI的能效性。

Exploring the Robustness of Decentralized Training for Large Language Models

  • paper_url: http://arxiv.org/abs/2312.00843
  • repo_url: https://github.com/jettbrains/-L-
  • paper_authors: Lin Lu, Chenxi Dai, Wangcheng Tao, Binhang Yuan, Yanan Sun, Pan Zhou
  • for: 这篇论文的目的是探讨 Decentralized training of large language models 中存在的安全问题,并提出一种可行的解决方案。
  • methods: 这篇论文使用了三种主要的方法来探讨 Decentralized training 的 robustness:首先,通过描述 Decentralized training 框架中的硬件、数据和模型漏洞,揭示了 Decentralized training 的敏感性。其次,通过比较 Decentralized foundation model training 和普通的 Federated learning 的安全技术,指出了 Decentralized training 的安全问题。最后,通过提出一种具体的威胁模型,描述了 Decentralized training 需要的基本组成部分和可行的解决方案。
  • results: 这篇论文的结果表明,Decentralized training of large language models 存在许多安全问题,包括硬件漏洞、数据泄露和模型攻击等。此外,Decentralized training 不同于普通的 Federated learning,因此安全技术无法直接应用。为了建立一个可靠和高效的 Decentralized training 框架,需要解决这些问题。
    Abstract Decentralized training of large language models has emerged as an effective way to democratize this technology. However, the potential threats associated with this approach have not been carefully discussed, which would hinder the development of decentralized training infrastructures. This paper aims to initiate discussion towards this end by exploring the robustness of decentralized training from three main perspectives. First, we demonstrate the vulnerabilities inherent in decentralized training frameworks in terms of hardware, data, and models. Second, we highlight the fundamental difference between decentralized foundation model training and vanilla federated learning, where the security techniques employed in federated learning cannot be applied directly. Third, we discuss the essential components required for a robust and efficient decentralized training framework and present a case study by modeling a concrete threat model. Our objective in this vision paper is to emphasize the importance of addressing security concerns in the context of decentralized training for large language models.
    摘要 <> translate_language=zh-CN广泛训练大型自然语言模型的分布式训练方法已经显示出效iveness,但是这种方法的可能的威胁尚未得到了详细的讨论,这会阻碍分布式训练基础设施的发展。这篇论文的目的是通过三个主要角度探讨分布式训练的Robustness。首先,我们展示了分布式训练框架中硬件、数据和模型的潜在漏洞。其次,我们强调了 federated learning 和分布式训练基础模型之间的根本区别,后者的安全技术不能直接应用于前者。最后,我们讨论了robust和高效的分布式训练框架的重要组成部分,并通过定制化威胁模型来进行示例。我们的目标在这篇视野论文中是强调在分布式训练大型自然语言模型的上下文中解决安全问题的重要性。

Matching Weak Informative Ontologies

  • paper_url: http://arxiv.org/abs/2312.00332
  • repo_url: https://github.com/npubird/lilywio
  • paper_authors: Peng Wang
  • for: This paper addresses the challenge of matching weakly informative ontologies (WIOs) using the ontology structure information to discover alignments.
  • methods: The proposed method employs a semantic subgraph-based similarity propagation model to match WIOs, with constraints to ensure a balance between efficiency and quality.
  • results: The proposed method significantly outperforms most state-of-the-art works in both WIO matching tasks and general ontology matching tasks, with a large increase in recall and high precision of matching results.
    Abstract Most existing ontology matching methods utilize the literal information to discover alignments. However, some literal information in ontologies may be opaque and some ontologies may not have sufficient literal information. In this paper, these ontologies are named as weak informative ontologies (WIOs) and it is challenging for existing methods to matching WIOs. On one hand, string-based and linguistic-based matching methods cannot work well for WIOs. On the other hand, some matching methods use external resources to improve their performance, but collecting and processing external resources is still time-consuming. To address this issue, this paper proposes a practical method for matching WIOs by employing the ontology structure information to discover alignments. First, the semantic subgraphs are extracted from the ontology graph to capture the precise meanings of ontology elements. Then, a new similarity propagation model is designed for matching WIOs. Meanwhile, in order to avoid meaningless propagation, the similarity propagation is constrained by semantic subgraphs and other conditions. Consequently, the similarity propagation model ensures a balance between efficiency and quality during matching. Finally, the similarity propagation model uses a few credible alignments as seeds to find more alignments, and some useful strategies are adopted to improve the performance. This matching method for WIOs has been implemented in the ontology matching system Lily. Experimental results on public OAEI benchmark datasets demonstrate that Lily significantly outperforms most of the state-of-the-art works in both WIO matching tasks and general ontology matching tasks. In particular, Lily increases the recall by a large margin, while it still obtains high precision of matching results.
    摘要 现有的 Ontology 匹配方法多数利用文字信息来发现匹配。然而,一些 Ontology 的文字信息可能是透明的,而且一些 Ontology 可能没有充分的文字信息。在这篇论文中,这些 Ontology 被称为弱信息 Ontology(WIO),匹配 WIO 是一个挑战。一方面,字符串基本和语言基本的匹配方法无法对 WIO 匹配好。另一方面,一些匹配方法使用外部资源来提高其性能,但收集和处理外部资源仍然是时间consuming。为了解决这个问题,这篇论文提出了一个实用的 WIO 匹配方法,通过利用 Ontology 结构信息来发现匹配。首先,从 Ontology 图中提取 semantic subgraphs,以捕捉 Ontology 元素的精确意义。然后,设计了一个新的相似传播模型,用于匹配 WIO。同时,为了避免意义无法传播,相似传播被限制在 semantic subgraphs 和其他条件下。因此,相似传播模型保证了匹配效率和质量之间的平衡。最后,相似传播模型使用一些可靠的匹配作为种子,以找到更多的匹配。此外,还采用了一些有用的策略来提高性能。这个 WIO 匹配方法已经被实现在 Lily 中。实验结果显示,Lily 在 OAEI 公共benchmark datasets 上表现出色,与大多数现有的State-of-the-art工作相比,Lily 在 WIO 匹配任务和一般 Ontology 匹配任务中均表现出色,尤其是在回传margin上增加了大量的准确性。

StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

  • paper_url: http://arxiv.org/abs/2312.00330
  • repo_url: https://github.com/GongyeLiu/StyleCrafter
  • paper_authors: Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Xintao Wang, Yujiu Yang, Ying Shan
  • for: 这个论文的目的是提高文本到视频转换(T2V)模型的精度和可控性,使其能够生成符合文本内容和参考图像风格的高质量视频。
  • methods: 该论文提出了一种基于参考图像的风格控制器,通过将风格控制器添加到已经训练过的 T2V 模型中,使得模型能够通过提供参考图像来生成任何风格的视频。在训练过程中, authors 使用了一种吸取风格信息的学习策略和一种缩放适应模块来保证内容风格分离和混合的良好性。
  • results: 该论文的实验结果显示,StyleCrafter 可以生成高质量的风格化视频,其风格与文本内容和参考图像高度相似。与现有的竞争者相比,StyleCrafter 更加灵活和高效。
    Abstract Text-to-video (T2V) models have shown remarkable capabilities in generating diverse videos. However, they struggle to produce user-desired stylized videos due to (i) text's inherent clumsiness in expressing specific styles and (ii) the generally degraded style fidelity. To address these challenges, we introduce StyleCrafter, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image. Considering the scarcity of stylized video datasets, we propose to first train a style control adapter using style-rich image datasets, then transfer the learned stylization ability to video generation through a tailor-made finetuning paradigm. To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image using a decoupling learning strategy. Additionally, we design a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features, which helps generalization across various text and style combinations. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images. Experiments demonstrate that our approach is more flexible and efficient than existing competitors.
    摘要 文本到视频(T2V)模型已经展现出了非凡的多样化视频生成能力。然而,它们很难生成用户需要的个性化风格视频,主要是因为文本的内在粗糙性和风格缺失精度。为解决这些挑战,我们介绍了 StyleCrafter,一种通用的方法,它可以增强预训练的 T2V 模型,通过提供参考图像来实现任何风格的视频生成。由于缺乏风格化视频数据集,我们提议首先在风格丰富的图像数据集上训练style控制适配器,然后通过特制的 Fine-tuning 策略将学习的风格化能力转移到视频生成中。为了提高内容风格分离,我们从文本提示中去除风格描述,并通过分离学习策略从参考图像中提取风格信息。此外,我们设计了可缩放的融合模块,以平衡文本基于内容特征和图像基于风格特征的影响,从而增强泛化性。 StyleCrafter 高效地生成了高质量的风格化视频,其内容与文本相符,风格与参考图像相似。实验表明,我们的方法比现有的竞争者更加灵活和高效。

Agent-OM: Leveraging Large Language Models for Ontology Matching

  • paper_url: http://arxiv.org/abs/2312.00326
  • repo_url: None
  • paper_authors: Zhangcheng Qiang, Weiqing Wang, Kerry Taylor
  • for: 本研究旨在探讨大语言模型(LLM)在 ontology matching(OM)领域中的应用潜力,并提出了一种基于代理人powered LLM 的 OM 系统设计方案。
  • methods: 本研究提出了一种 Agent-OM 框架,包括两个 Siamese 代理人用于检索和匹配,以及一组简单的 prompt-based OM 工具。
  • results: 对 Ontology Alignment Evaluation Initiative(OAEI)评估 tracks 的评估表明,我们的系统可以在简单的 OM 任务上几乎与最佳长期表现相当,并在复杂和少量 OM 任务上显著提高表现。
    Abstract Ontology matching (OM) enables semantic interoperability between different ontologies and resolves their conceptual heterogeneity by aligning related entities. OM systems currently have two prevailing design paradigms: conventional knowledge-based expert systems and newer machine learning-based predictive systems. While large language models (LLMs) and LLM-based agents have become revolutionary in data engineering and have been applied creatively in various domains, their potential for OM remains underexplored. This study introduces a novel agent-powered LLM-based design paradigm for OM systems. With thoughtful consideration of several specific challenges to leverage LLMs for OM, we propose a generic framework, namely Agent-OM, consisting of two Siamese agents for retrieval and matching, with a set of simple prompt-based OM tools. Our framework is implemented in a proof-of-concept system. Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks over state-of-the-art OM systems show that our system can achieve very close results to the best long-standing performance on simple OM tasks and significantly improve the performance on complex and few-shot OM tasks.
    摘要 ontology matching (OM) 允许 semantic 相互操作 между不同 ontology 和解决它们的概念多样性by aligning 相关的实体。OM 系统目前有两种主要的设计哲学:传统的知识基础系统和 newer machine learning 预测系统。而 large language models (LLMs) 和 LLM-based 代理已经成为数据工程中的革命,它们在不同领域中应用了创新的方式,但它们在 OM 中的潜力还未得到充分利用。这项研究介绍了一种新的 LLM-based 代理设计方法 для OM 系统。通过考虑一些特定的挑战,我们提议一种通用的 Agent-OM 框架,包括两个 Siamese 代理 для检索和匹配,以及一组简单的 prompt-based OM 工具。我们的框架在一个证明性系统中实现。经过评估 OAEI 评估轨道上的三个 Ontology Alignment Evaluation Initiative (OAEI) track,我们的系统可以在简单的 OM 任务上达到与最佳长期性能很近的结果,并在复杂和少量 OM 任务上显著提高性能。

Conceptual Engineering Using Large Language Models

  • paper_url: http://arxiv.org/abs/2312.03749
  • repo_url: https://github.com/bradleypallen/zero-shot-classifiers-for-conceptual-engineering
  • paper_authors: Bradley P. Allen
  • for: 这个论文的目的是提出一种基于 Jennifer Nado 的定义方法,用于实现概念工程中的分类程序。
  • methods: 这个方法使用大型自然语言模型来实现概念工程中的分类程序。
  • results: 通过使用 Wikidata 知识图数据进行评估,这个方法可以评估两个概念工程项目(国际天文联合会的 PLANET 重定义和 Haslanger 的 WOMAN 修补分析)中的概念定义。
    Abstract We describe a method, based on Jennifer Nado's definition of classification procedures as targets of conceptual engineering, that implements such procedures using a large language model. We then apply this method using data from the Wikidata knowledge graph to evaluate concept definitions from two paradigmatic conceptual engineering projects: the International Astronomical Union's redefinition of PLANET and Haslanger's ameliorative analysis of WOMAN. We discuss implications of this work for the theory and practice of conceptual engineering. The code and data can be found on GitHub.
    摘要 我们描述了一种方法,基于 Jennifer Nado 定义的分类过程为概念工程的目标,使用大语言模型实现这些过程。然后我们使用 Wikidata 知识图库中的数据应用这种方法,评估两个概念工程项目中的概念定义:国际天文学联合会重新定义的 PLANET,以及 Haslanger 修复分析的 WOMAN。我们讨论了这项工作对概念工程理论和实践的影响。代码和数据可以在 GitHub 上找到。

PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction

  • paper_url: http://arxiv.org/abs/2312.00839
  • repo_url: https://github.com/guanleics/pipeoptim
  • paper_authors: Lei Guan, Dongsheng Li, Jiye Liang, Wenjian Wang, Xicheng Lu
  • for: 这个研究是为了解决对 asynchronous pipeline training 的 weight inconsistency 和 weight staleness 问题。
  • methods: 这个研究使用了一个 optimizer-dependent weight prediction strategy (简称为 PipeOptim),透过在前进通道中使用预测的 weights,以确保每个 mini-batch 使用的是一致的和不潮湿的 weights。
  • results: 实验结果显示 PipeOptim 能够优于对比的 pipelined 方法(包括 GPipe、PipeDream、PipeDream-2BW 和 SpecTrain),并且能够确保实际的参数学习不受 optimizer 的类型所限。
    Abstract Asynchronous pipeline model parallelism with a "1F1B" (one forward, one backward) schedule generates little bubble overhead and always provides quite a high throughput. However, the "1F1B" schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches across GPUs. To simultaneously address these two problems, in this paper, we propose an optimizer-dependent weight prediction strategy (a.k.a PipeOptim) for asynchronous pipeline training. The key insight of our proposal is that we employ a weight prediction strategy in the forward pass to ensure that each mini-batch uses consistent and staleness-free weights to compute the forward pass. To be concrete, we first construct the weight prediction scheme based on the update rule of the used optimizer when training the deep neural network models. Then throughout the "1F1B" pipelined training, each mini-batch is mandated to execute weight prediction ahead of the forward pass, subsequently employing the predicted weights to perform the forward pass. As a result, PipeOptim 1) inherits the advantage of the "1F1B" schedule and generates pretty high throughput, and 2) can ensure effective parameter learning regardless of the type of the used optimizer. To verify the effectiveness of our proposal, we conducted extensive experimental evaluations using eight different deep-learning models spanning three machine-learning tasks including image classification, sentiment analysis, and machine translation. The experiment results demonstrate that PipeOptim outperforms the popular pipelined approaches including GPipe, PipeDream, PipeDream-2BW, and SpecTrain. The code of PipeOptim can be accessible at https://github.com/guanleics/PipeOptim.
    摘要 异步管道模型并行模式 WITH "1F1B" (一进一返)调度可以生成非常小的气泡开销,并且总是提供非常高的 Throughput。然而,"1F1B" 调度无可避免地会导致参数不一致和参数偏古问题,这是因为不同的 mini-batch 在 GPU 之间进行交叉训练。为了同时解决这两个问题,在这篇论文中,我们提出了一种依赖于优化器的 weight prediction 策略(即 PipeOptim) для异步管道训练。我们的关键发现是,我们在前进 pass 中使用 weight prediction 策略来确保每个 mini-batch 使用一致的和新鲜的 weights 来计算前进 pass。具体来说,我们首先根据使用的优化器更新规则来构建 weight prediction 方案。然后,在 "1F1B" 管道训练中,每个 mini-batch 被要求在进行 weight prediction 之前,并在使用预测的 weights 来进行前进 pass。这样,PipeOptim 1) 继承了 "1F1B" 调度的优点,生成非常高的 Throughput,并 2) 可以确保参数学习不受优化器类型的限制。为了证明 PipeOptim 的效果,我们对深度学习模型进行了广泛的实验评估,包括图像分类、情感分析和机器翻译等三种机器学习任务。实验结果表明,PipeOptim 在与 GPipe、PipeDream、PipeDream-2BW 和 SpecTrain 等流行的管道方法进行比较时,具有更高的 Throughput 和更好的参数学习性。PipeOptim 的代码可以在 GitHub 上获取:

Mark My Words: Analyzing and Evaluating Language Model Watermarks

  • paper_url: http://arxiv.org/abs/2312.00273
  • repo_url: https://github.com/wagner-group/markmywords
  • paper_authors: Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, David Wagner
  • for: 本研究旨在评估文本水印技术,以帮助应用者更好地选择适合自己需求的水印方案。
  • methods: 本研究使用了 MARKMYWORDS benchmark,包括不同任务和实际攻击下的评估。研究者使用了三个主要纪录:质量、大小(例如,需要多少token来探测水印)和防御性。
  • results: 现有的水印技术已经可以在无感知质量下进行部署,例如,使用 Kirchenbauer et al. [1] 的方法可以在 Llama2-7B-chat 上水印无可察见的质量下,探测水印只需要 fewer than 100 tokens,并且对简单的攻击有良好的防御性。研究者 argue that water mark indistinguishability 是一个过于严格的要求,不同的方法可以通过微小地改变 logit 分布来实现更好的性能。
    Abstract The capabilities of large language models have grown significantly in recent years and so too have concerns about their misuse. In this context, the ability to distinguish machine-generated text from human-authored content becomes important. Prior works have proposed numerous schemes to watermark text, which would benefit from a systematic evaluation framework. This work focuses on text watermarking techniques - as opposed to image watermarks - and proposes MARKMYWORDS, a comprehensive benchmark for them under different tasks as well as practical attacks. We focus on three main metrics: quality, size (e.g. the number of tokens needed to detect a watermark), and tamper-resistance. Current watermarking techniques are good enough to be deployed: Kirchenbauer et al. [1] can watermark Llama2-7B-chat with no perceivable loss in quality, the watermark can be detected with fewer than 100 tokens, and the scheme offers good tamper-resistance to simple attacks. We argue that watermark indistinguishability, a criteria emphasized in some prior works, is too strong a requirement: schemes that slightly modify logit distributions outperform their indistinguishable counterparts with no noticeable loss in generation quality. We publicly release our benchmark (https://github.com/wagner-group/MarkMyWords)
    摘要 大型自然语言模型的能力在最近几年内有了很大的进步,同时关于它们的不当使用也逐渐增加。在这种情况下,可以将文本水印作为一种重要的技术。先前的研究已经提出了许多水印文本的方法,但是这些方法尚未得到系统的评估框架。本文将 concentrate 在文本水印技术上,而不是图像水印,并提出了 MARKMYWORDS,一个全面的 benchmark beneath different tasks 以及实际攻击。我们主要关注三个维度:质量、大小(例如,用于检测水印的token数)和抗 tampering 性。目前的水印技术已经可以进行部署:Kirchenbauer et al. [1] 可以将 Llama2-7B-chat 水印到无可察见的质量下,水印可以通过 fewer than 100 个token 检测,并且该方案具有良好的简单攻击 resistant 性。我们认为水印不可区分性,在一些先前的工作中被强调的标准,是一个太强的要求: modifying logit 分布会使水印 scheme outperform 其他可区分的 scheme,无需对生成质量产生显著影响。我们将 MARKMYWORDS 公开发布(https://github.com/wagner-group/MarkMyWords)。

Academic competitions

  • paper_url: http://arxiv.org/abs/2312.00268
  • repo_url: https://github.com/vicentinileonardo/DWT-SVD-digital-watermarking
  • paper_authors: Hugo Jair Escalante, Aleksandra Kruchinina
    for: 这篇论文旨在探讨学术挑战的现状和发展趋势,尤其是在机器学习和相关领域的应用。methods: 本文使用了论文综述的方法,对最近几年最具影响力的学术挑战进行了回顾和分析,包括不同领域的挑战目标、成果和未来发展趋势。results: 本文分析发现,学术挑战在机器学习和相关领域有着广泛的应用和影响,在不同领域的挑战目标和成果方面也有所不同。同时,本文还发现了一些未来发展趋势,如数据驱动的挑战和人工智能的应用。
    Abstract Academic challenges comprise effective means for (i) advancing the state of the art, (ii) putting in the spotlight of a scientific community specific topics and problems, as well as (iii) closing the gap for under represented communities in terms of accessing and participating in the shaping of research fields. Competitions can be traced back for centuries and their achievements have had great influence in our modern world. Recently, they (re)gained popularity, with the overwhelming amounts of data that is being generated in different domains, as well as the need of pushing the barriers of existing methods, and available tools to handle such data. This chapter provides a survey of academic challenges in the context of machine learning and related fields. We review the most influential competitions in the last few years and analyze challenges per area of knowledge. The aims of scientific challenges, their goals, major achievements and expectations for the next few years are reviewed.
    摘要 学术挑战包括有效的手段,以提高研究领域的状态,把科学社区特定话题和问题放在灯光中,以及为受抑表达的社区提供参与研究领域的机会。竞赛可以追溯到 centuries ago,其成就对现代世界产生了深远的影响。在最近几年,它们又重新获得了popularity,因为不同领域生成的庞大数据量,以及现有方法和工具的挑战。本章节对学术挑战在机器学习和相关领域进行了评估。我们回顾过去几年最具影响力的竞赛,分析了每个领域的挑战。本章节还概述了科学竞赛的目标,其中的目标、主要成就和未来几年的期望。

Sample Efficient Reinforcement Learning from Human Feedback via Active Exploration

  • paper_url: http://arxiv.org/abs/2312.00267
  • repo_url: None
  • paper_authors: Viraj Mehta, Vikramjeet Das, Ojash Neopane, Yijia Dai, Ilija Bogunovic, Jeff Schneider, Willie Neiswanger
  • for: 本研究旨在提高人类反馈学习(RLHF)中的策略选择效率,特别是在大语言模型训练中。
  • methods: 本研究使用了上下文选择性战略,并提出了一种基于上下文的多 armed bandit问题的解决方案。
  • results: 实验结果表明, compared to多种基eline,我们的方法在 synthetic 环境中具有更好的性能,并在实际 datasets 上达到了更高的表现。
    Abstract Preference-based feedback is important for many applications in reinforcement learning where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback (RLHF) on large language models. For many applications of RLHF, the cost of acquiring the human feedback can be substantial. In this work, we take advantage of the fact that one can often choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and formalize this as an offline contextual dueling bandit problem. We give an upper-confidence-bound style algorithm for this problem and prove a polynomial worst-case regret bound. We then provide empirical confirmation in a synthetic setting that our approach outperforms existing methods. After, we extend the setting and methodology for practical use in RLHF training of large language models. Here, our method is able to reach better performance with fewer samples of human preferences than multiple baselines on three real-world datasets.
    摘要 preference-based feedback 是 reinforcement learning 中的一个重要应用,特别是当irect evaluation of a reward function 不可能时。一个最近的例子出现在 reinforcement learning from human feedback (RLHF) 中,在大型语言模型上。许多RLHF应用中,获取人类反馈的成本可能很高。在这种情况下,我们利用人类反馈可以在选择上进行优化,以便最 efficiently identify a good policy,并将其形式化为 offline contextual dueling bandit problem。我们提出了一种 upper-confidence-bound 风格的算法,并证明了其减少的最差情况偏离 bound。然后,我们在一个 sintetic 环境中进行了 empirical confirmation,并证明了我们的方法超过了现有方法。最后,我们扩展了设定和方法,用于实际应用在 RLHF 中训练大型语言模型。在这种情况下,我们的方法能够更好地达到 fewer samples of human preferences 下的 better performance,并且超过了多个基eline在三个实际数据集上。

Skipper: Improving the Reach and Fidelity of Quantum Annealers by Skipping Long Chains

  • paper_url: http://arxiv.org/abs/2312.00264
  • repo_url: None
  • paper_authors: Ramin Ayanzadeh, Moinuddin Qureshi
  • for: 提高量子熔化器(QA)的容量和精度,使其能解决更大的问题。
  • methods: 提出了一种软件技术“ Skipper”,可以跳过主导性链并将其程序队列中的两个读取结果替换为一个链。
  • results: 使用了一个5761个队列的QA,证明了“ Skipper”可以解决更大的问题(最多59%),并且可以提高QA的精度(最多44%)。
    Abstract Quantum Annealers (QAs) operate as single-instruction machines, lacking a SWAP operation to overcome limited qubit connectivity. Consequently, multiple physical qubits are chained to form a program qubit with higher connectivity, resulting in a drastically diminished effective QA capacity by up to 33x. We observe that in QAs: (a) chain lengths exhibit a power-law distribution, a few dominant chains holding substantially more qubits than others; and (b) about 25% of physical qubits remain unused, getting isolated between these chains. We propose Skipper, a software technique that enhances the capacity and fidelity of QAs by skipping dominant chains and substituting their program qubit with two readout results. Using a 5761-qubit QA, we demonstrate that Skipper can tackle up to 59% (Avg. 28%) larger problems when eleven chains are skipped. Additionally, Skipper can improve QA fidelity by up to 44% (Avg. 33%) when cutting five chains (32 runs). Users can specify up to eleven chain cuts in Skipper, necessitating about 2,000 distinct quantum executable runs. To mitigate this, we introduce Skipper-G, a greedy scheme that skips sub-problems less likely to hold the global optimum, executing a maximum of 23 quantum executables with eleven chain trims. Skipper-G can boost QA fidelity by up to 41% (Avg. 29%) when cutting five chains (11 runs).
    摘要 量子泛化器(QA)作为单指令机器,缺乏SWAP操作,因此Physical qubits的连接性受限。因此,多个物理量子比特被串联,形成一个高连接性的程序量子比特,导致效果的QA容量减少至多达33倍。我们发现在QA中:(a)链长度遵循力学分布,一些主导的链拥有许多量子比特比其他链更多; 和(b)约25%的物理量子比特处于隔离状态,被孤立在这些链之间。我们提议了一种软件技术,称为Skipper,可以提高QA的容量和准确性。Skipper可以跳过主导的链,并将其程序量子比特替换为两个读出结果。使用5761个量子比特的QA,我们示出Skipper可以解决更大的问题,比如59%(均值28%)更大的问题,当 eleven链被跳过时。此外,Skipper还可以提高QA的准确性,达到44%(均值33%),当五个链被切割(32次)。用户可以指定最多 eleven个链被跳过,需要约2000个不同的量子可执行运行。为了解决这个问题,我们提出了Skipper-G,一种滥览策略,可以跳过不太可能包含全球最优解的子问题。Skipper-G可以提高QA的准确性,达到41%(均值29%),当五个链被切割(11次)。

cs.CL - 2023-12-01

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

  • paper_url: http://arxiv.org/abs/2312.00968
  • repo_url: None
  • paper_authors: Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, Radu Soricut
  • for: 这篇论文主要关注于如何对大型多modal模型(LMMs)进行调整,以提高它们的通用性能。
  • methods: 论文提出了一种名为Omni-SMoLA的架构,它使用软MoE方法将多modal低维度专家融合在一起,并避免了将大量新的参数添加到模型中。
  • results: 实验结果显示,SMoLA架构可以帮助提高通用性能 across a broad range of generative vision-and-language tasks,常常与或超过单一专门LMM基eline,以及新的专家模型基eline。
    Abstract Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks. However, generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks. Recent research suggests that Mixture of Experts (MoE) architectures are useful for instruction tuning, but for LMMs of parameter size around O(50-100B), the prohibitive cost of replicating and storing the expert models severely limits the number of experts we can use. We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to (softly) mix many multimodal low rank experts, and avoids introducing a significant number of new parameters compared to conventional MoE models. The core intuition here is that the large model provides a foundational backbone, while different lightweight experts residually learn specialized knowledge, either per-modality or multimodally. Extensive experiments demonstrate that the SMoLA approach helps improve the generalist performance across a broad range of generative vision-and-language tasks, achieving new SoTA generalist performance that often matches or outperforms single specialized LMM baselines, as well as new SoTA specialist performance.
    摘要 大型多Modal模型(LMM)具有很好的表现能力,但是通常的总体LMM在处理大量任务时会导致性能下降。 latest research suggests that Mixture of Experts(MoE)架构是用于指导调整的有用,但是对于LMM的参数大小约为50-100B,复制和存储专家模型的成本是非常高的,这限制了我们可以使用的专家数量。我们提出了Omni-SMoLA架构,它使用软MoE方法将多种多Modal low rank专家轻松地混合,并避免了与传统MoE模型相比增加了很多新参数。核心想法是,大型模型提供了基础脊梁,而不同的轻量级专家 residually 学习特有的知识,可以是每个模式还是多modal。广泛的实验表明,SMoLA方法可以提高通用性能 across a broad range of generative vision-and-language tasks,实现了新的SoTA通用性能,经常与或超过单个专业LMM基线,以及新的SoTA专业性能。

Hyperparameter Optimization for Large Language Model Instruction-Tuning

  • paper_url: http://arxiv.org/abs/2312.00949
  • repo_url: None
  • paper_authors: Christophe Tribes, Sacha Benarroch-Lelong, Peng Lu, Ivan Kobyzev
  • for: 本研究旨在调整大语言模型(LLMs),以提高自然语言处理应用中的性能。
  • methods: 本研究使用了低矩阵尺度化(LoRA)方法,即冻结大部分预训练LLM的权重,并通过低矩阵分解权重矩阵,只调整一小部分网络。
  • results: 研究通过两种黑盒优化(BBO)技术,efficiently explore了优化参数的空间,并实现了性能和人类对齐的提升。
    Abstract The fine-tuning of Large Language Models (LLMs) has enabled them to recently achieve milestones in natural language processing applications. The emergence of ever larger LLMs has paved the way for more efficient fine-tuning methods. Among these, the Low-Rank Adaptation (LoRA) method keeps most of the weights of the pre-trained LLM frozen while introducing a low-rank decomposition of the weight matrix, enabling the tuning of only a very small proportion of the network. The performance on downstream tasks of models fine-tuned with LoRA heavily relies on a set of hyperparameters including the rank of the decomposition. In this work, we investigate the choice of these hyperparameters through two main blackbox optimization (BBO) techniques. We examine the whole pipeline of performing fine-tuning and validation on a pre-trained LLM as a blackbox and efficiently explore the space of hyperparameters with the \nomad algorithm, achieving a boost in performance and human alignment of the tuned model.
    摘要 大型自然语言模型(LLM)的精度调整已经使得它们在自然语言处理应用中实现了里程碑。大型LLM的出现使得更高效的调整方法得以出现。其中,低级别适应(LoRA)方法保持了预训练LLM中的大多数权重冻结,并引入了weight矩阵的低级别分解,只需调整一小部分网络。下游任务的模型经过LoRA调整后的性能高度取决于一组超参数,包括分解级别。在这项工作中,我们通过两种主要的黑盒优化(BBO)技术来调整这些超参数。我们对预训练LLM整个调整和验证管道进行黑盒处理,高效地探索超参数的空间,并实现了调整后的性能和人类对齐的提升。

Quick Back-Translation for Unsupervised Machine Translation

  • paper_url: http://arxiv.org/abs/2312.00912
  • repo_url: https://github.com/bbrimacombe/quick-back-translation
  • paper_authors: Benjamin Brimacombe, Jiawei Zhou
  • for: 这篇论文旨在提出一种以Transformer为基础的无监督自动翻译方法,并使用back-translation算法进行反向翻译和自我优化。
  • methods: 该方法使用Transformer模型进行生成模型,并使用生成器生成的序列来训练解码器,同时与原有的 autoregressive back-translation 步骤结合使用。
  • results: 实验结果表明,对于不同的 WMT 测试 benchmark,只需要一些简单的反向翻译步骤,可以提高当前的无监督自动翻译模型,并且 QBT 方法在相同的翻译质量下比标准的 back-translation 方法更高效。
    Abstract The field of unsupervised machine translation has seen significant advancement from the marriage of the Transformer and the back-translation algorithm. The Transformer is a powerful generative model, and back-translation leverages Transformer's high-quality translations for iterative self-improvement. However, the Transformer is encumbered by the run-time of autoregressive inference during back-translation, and back-translation is limited by a lack of synthetic data efficiency. We propose a two-for-one improvement to Transformer back-translation: Quick Back-Translation (QBT). QBT re-purposes the encoder as a generative model, and uses encoder-generated sequences to train the decoder in conjunction with the original autoregressive back-translation step, improving data throughput and utilization. Experiments on various WMT benchmarks demonstrate that a relatively small number of refining steps of QBT improve current unsupervised machine translation models, and that QBT dramatically outperforms standard back-translation only method in terms of training efficiency for comparable translation qualities.
    摘要 自然语言翻译领域中无监督机器翻译的发展很 significiant,即通过传统和回讲算法结合,使用传统的Transformer模型。传统模型具有高质量的生成能力,而回讲算法可以通过使用高质量的翻译来进行反向优化。然而,传统模型在回讲算法中的 autoregressive 推理步骤会增加运行时间,而且回讲算法受到生成数据的有效性限制。我们提出了一种两阶段改进方案:快速后翻译(QBT)。QBT 重用Encoder作为生成模型,并使用 Encoder 生成的序列来训练 Decoder,并在原有的 autoregressive 回讲步骤中进行同步训练。实验表明,只需要一些简单的改进步骤,QBT 可以提高当前无监督机器翻译模型的性能,并且 QBT 在相同的翻译质量下可以大幅提高训练效率。

Analyzing the Influence of Fake News in the 2024 Elections: A Comprehensive Dataset

  • paper_url: http://arxiv.org/abs/2312.03750
  • repo_url: None
  • paper_authors: Mizanur Rahman, Shaina Raza
  • for: 这个研究是为了研究美国政治演说中的假新闻,具体来说是检测种族骚扰和偏见。
  • methods: 这个研究使用高级自然语言处理工具和人工验证,从40,000篇新闻文章中抽取和标注数据,为机器学习和偏见分析提供了丰富的资源。
  • results: 这个研究提供了一个关于假新闻的数据集,可供研究人员、政策制定者和教育者使用,以开发抵御假新闻的策略和提高媒体Literacy。这个数据集专注于2024年选举中的假新闻分析,并且公共可访问,以便社区共同努力对假新闻进行识别。
    Abstract This work introduces a dataset focused on fake news in US political speeches, specifically examining racial slurs and biases. By scraping and annotating 40,000 news articles, using advanced NLP tools and human verification, we provide a nuanced understanding of misinformation in political discourse. The dataset, designed for machine learning and bias analysis, is a critical resource for researchers, policymakers, and educators. It facilitates the development of strategies against misinformation and enhances media literacy, marking a significant contribution to the study of fake news and political communication. Our dataset, focusing on the analysis of fake news in the context of the 2024 elections, is publicly accessible for community to work on fake news identification. Our dataset, focusing on the analysis of fake news in the context of the 2024 elections, is publicly accessible.
    摘要 这个研究介绍了一个关注美国政治演讲中的假新闻数据集,具体研究了种族诋毁和偏见。我们通过抓取和注释40,000篇新闻文章,使用先进的自然语言处理工具和人工验证,为研究人员、政策制定者和教育者提供了细腻的虚假新闻认知。这个数据集,针对机器学习和偏见分析,是一个重要的资源,帮助研究人员、政策制定者和教育者开发抵御虚假新闻策略,提高媒体素养,对假新闻和政治沟通做出了重要贡献。我们的数据集,专注于2024年选举的假新闻分析,公共可用,欢迎社区使用来鉴别假新闻。

Hi-ArG: Exploring the Integration of Hierarchical Argumentation Graphs in Language Pretraining

  • paper_url: http://arxiv.org/abs/2312.00874
  • repo_url: https://github.com/ljcleo/hi-arg
  • paper_authors: Jingcong Liang, Rong Ye, Meng Han, Qi Zhang, Ruofei Lai, Xinyu Zhang, Zhao Cao, Xuanjing Huang, Zhongyu Wei
  • for: 本研究旨在提出一种新的知识图 structures,帮助语言模型在不同应用中表现更好。
  • methods: 本文提出了一种新的 Hierarchical Argumentation Graph (Hi-ArG) 结构,以及两种使用这种结构的方法:一种是 text-graph 多模态模型 GreaseArG,另一种是一种增强语言模型的预训练框架,并在这个框架中添加图形信息。
  • results: 实验表明,在两个 Argumentation 任务上,经过进一步预训练和微调,GreaseArG 可以超越同等规模的语言模型的性能,而在把图形信息integrated into预训练框架中也可以提高vanilla语言模型的性能。
    Abstract The knowledge graph is a structure to store and represent knowledge, and recent studies have discussed its capability to assist language models for various applications. Some variations of knowledge graphs aim to record arguments and their relations for computational argumentation tasks. However, many must simplify semantic types to fit specific schemas, thus losing flexibility and expression ability. In this paper, we propose the Hierarchical Argumentation Graph (Hi-ArG), a new structure to organize arguments. We also introduce two approaches to exploit Hi-ArG, including a text-graph multi-modal model GreaseArG and a new pre-training framework augmented with graph information. Experiments on two argumentation tasks have shown that after further pre-training and fine-tuning, GreaseArG supersedes same-scale language models on these tasks, while incorporating graph information during further pre-training can also improve the performance of vanilla language models. Code for this paper is available at https://github.com/ljcleo/Hi-ArG .
    摘要 知识图是一种存储和表示知识的结构,最近的研究已经证明它可以帮助语言模型在多种应用中表现出色。一些变种的知识图旨在记录 Argument 和它们之间的关系,以便计算性的论证任务。然而,许多情况下需要简化 semantic type,以适应特定的模板,从而失去flexibility和表达能力。在本文中,我们提出了层次论证图(Hi-ArG),一种新的结构来组织 Argument。我们还介绍了两种方法来利用 Hi-ArG,包括文本-图模型 GreaseArG 和一种新的预训练框架,其中包含图信息。对于两个论证任务进行实验,我们发现,经过进一步预训练和细化,GreaseArG 可以在这些任务上超越同样规模的语言模型,而在预训练中包含图信息也可以提高 vanilla 语言模型的表现。代码可以在 GitHub 上找到:https://github.com/ljcleo/Hi-ArG。

SeaLLMs – Large Language Models for Southeast Asia

  • paper_url: http://arxiv.org/abs/2312.00738
  • repo_url: https://github.com/damo-nlp-sg/seallms
  • paper_authors: Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, Lidong Bing
  • for: 增进 southeast asian 语言模型的研究和应用,提高 regional languages 的表达能力和文化准确性。
  • methods: 基于 llama-2 模型,通过继续预训With 扩展词汇、特殊指导和对齐调整,进一步提高了 regional languages 的表达能力和文化准确性。
  • results: 对比 comparable open-source models,SeaLLM-13b 模型在各种语言任务中表现出色,并在 non-Latin 语言中(如泰语、傣语、老语和缅甸语)与 ChatGPT-3.5 相比,具有大幅度的优势,同时具有轻量级和cost-effective的特点。
    Abstract Despite the remarkable achievements of large language models (LLMs) in various tasks, there remains a linguistic bias that favors high-resource languages, such as English, often at the expense of low-resource and regional languages. To address this imbalance, we introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning to better capture the intricacies of regional languages. This allows them to respect and reflect local cultural norms, customs, stylistic preferences, and legal considerations. Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform ChatGPT-3.5 in non-Latin languages, such as Thai, Khmer, Lao, and Burmese, by large margins while remaining lightweight and cost-effective to operate.
    摘要 尽管大型语言模型(LLM)在各种任务中做出了非凡的成就,但语言偏袋仍然存在,它们偏好高资源语言,如英语,并经常在低资源语言和地区语言的代价。为了解决这一不平等,我们引入SeaLLMs,一种革新的语言模型,专门针对东南亚(SEA)语言。SeaLLMs 基于 Llama-2 模型,并通过继续预训 With an extended vocabulary, specialized instruction and alignment tuning来更好地捕捉地区语言的细节。这使得它们能够尊重和反映本地文化规范、习惯、文体偏好和法律考虑。我们的全面评估表明,SeaLLM-13b 模型在各种语言任务和助手式指令遵循能力方面表现出色,与相关的开源模型相比。此外,它们在非拉丁语言,如泰语、傣语、寮语和缅甸语中,与 ChatGPT-3.5 模型相比,占据了很大的优势,而且具有轻量级和成本效果。

Contextualized word senses: from attention to compositionality

  • paper_url: http://arxiv.org/abs/2312.00680
  • repo_url: None
  • paper_authors: Pablo Gamallo
  • for: 本文提出了一种透明、可解释的方法,用于EncodingContextual Sense of Words。
  • methods: 本文提出了一种基于语言学理念的Semantic Compositionality模型,特别是关注依赖关系和Semantic Notions。
  • results: 对于一个Semantic Task,即Word Sense Similarity Calculation,该模型可以与Transformer-based Architecture相比,并且获得竞争力。
    Abstract The neural architectures of language models are becoming increasingly complex, especially that of Transformers, based on the attention mechanism. Although their application to numerous natural language processing tasks has proven to be very fruitful, they continue to be models with little or no interpretability and explainability. One of the tasks for which they are best suited is the encoding of the contextual sense of words using contextualized embeddings. In this paper we propose a transparent, interpretable, and linguistically motivated strategy for encoding the contextual sense of words by modeling semantic compositionality. Particular attention is given to dependency relations and semantic notions such as selection preferences and paradigmatic classes. A partial implementation of the proposed model is carried out and compared with Transformer-based architectures for a given semantic task, namely the similarity calculation of word senses in context. The results obtained show that it is possible to be competitive with linguistically motivated models instead of using the black boxes underlying complex neural architectures.
    摘要 neural networks的语言模型结构在不断增加复杂度,特别是基于注意机制的Transformers。虽然它们在自然语言处理任务上表现非常成功,但是它们仍然是不可解释的模型。这种模型在编码上下文感知词语的任务中表现最佳,特别是在使用上下文化 embedding 来编码上下文感知词语的情况下。在这篇论文中,我们提出了一种透明、可解释、语言驱动的方法来编码上下文感知词语,包括依赖关系和语义概念如选择偏好和词语分类。我们对提出的模型进行了部分实现,并与基于Transformers的架构进行了比较,对一个 semantics 任务,即上下文中词语类似度计算。结果表明,可以与语言驱动的模型竞争,而不是使用复杂的神秘的神经网络架构。

The Efficiency Spectrum of Large Language Models: An Algorithmic Survey

  • paper_url: http://arxiv.org/abs/2312.00678
  • repo_url: https://github.com/tding1/efficient-llm-survey
  • paper_authors: Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang
  • for: 本文旨在提供一份全面的大语言模型(LLM)效率提升方法的综述,帮助研究人员和实践者更好地理解LLM的效率问题。
  • methods: 本文涵盖了多种方法来提升LLM效率,包括算法方法和硬件解决方案,以满足LLM在各种应用领域的需求。
  • results: 本文提出了一系列有效的方法,包括扩大法律、数据利用、建筑创新、训练和调整策略、和推理技术,以提高LLM的效率。
    Abstract The rapid growth of Large Language Models (LLMs) has been a driving force in transforming various domains, reshaping the artificial general intelligence landscape. However, the increasing computational and memory demands of these models present substantial challenges, hindering both academic research and practical applications. To address these issues, a wide array of methods, including both algorithmic and hardware solutions, have been developed to enhance the efficiency of LLMs. This survey delivers a comprehensive review of algorithmic advancements aimed at improving LLM efficiency. Unlike other surveys that typically focus on specific areas such as training or model compression, this paper examines the multi-faceted dimensions of efficiency essential for the end-to-end algorithmic development of LLMs. Specifically, it covers various topics related to efficiency, including scaling laws, data utilization, architectural innovations, training and tuning strategies, and inference techniques. This paper aims to serve as a valuable resource for researchers and practitioners, laying the groundwork for future innovations in this critical research area. Our repository of relevant references is maintained at url{https://github.com/tding1/Efficient-LLM-Survey}.
    摘要 大量语言模型(LLM)的快速增长已经是许多领域的推动力,重新定义人工通用智能的景象。然而,这些模型的计算和记忆需求逐渐增加,对学术研究和实际应用都产生了严重的挑战。为了解决这些问题,一些方法,包括算法和硬件解决方案,已经开发出来来提高LLM的效率。本篇文章提供了LLM效率的全面评论,不同于其他文章通常专注于特定的领域,例如训练或模型压缩,本文则探讨LLM效率的多方面性,包括扩展法则、数据利用、建筑创新、训练和调整策略、推理技术等。本文的目的是为研究人员和实践者提供一个有价的资源,以便未来在这个重要的研究领域中发展新的创新。我们的相关参考文献存储在 GitHub 上的 url{https://github.com/tding1/Efficient-LLM-Survey}。

Nonparametric Variational Regularisation of Pretrained Transformers

  • paper_url: http://arxiv.org/abs/2312.00662
  • repo_url: None
  • paper_authors: Fabio Fehr, James Henderson
  • For: The paper aims to address the overfitting problem in large-scale pre-training and fine-tuning of Transformer language models, and to improve their out-of-domain generalization.* Methods: The paper proposes using Nonparametric Variational Information Bottleneck (NVIB) as a regulariser for training cross-attention in Transformers, and extends the NVIB framework to replace all types of attention functions in Transformers.* Results: The paper shows that existing pretrained Transformers can be reinterpreted as Nonparametric Variational (NV) models using a proposed identity initialisation, and that changing the initialisation introduces a novel, information-theoretic post-training regularisation in the attention mechanism, which improves out-of-domain generalization without any training.Here are the three points in Simplified Chinese text:* For: 这篇论文目标是解决大规模预训练和精度调整Transformer语言模型中的过拟合问题,并提高其异频域泛化性。* Methods: 论文提出使用非参数化变量信息瓶颈(NVIB)来规范Transformer语言模型中的交叉注意力训练,并扩展NVIB框架来取代所有类型的注意力函数。* Results: 论文显示,现有的预训练Transformer语言模型可以被视为非参数化变量(NV)模型,使用提posed的标识初始化。此外,改变初始化会引入一种新的信息理论post训练正则化在注意力机制中,无需任何训练,可以提高异频域泛化性。
    Abstract The current paradigm of large-scale pre-training and fine-tuning Transformer large language models has lead to significant improvements across the board in natural language processing. However, such large models are susceptible to overfitting to their training data, and as a result the models perform poorly when the domain changes. Also, due to the model's scale, the cost of fine-tuning the model to the new domain is large. Nonparametric Variational Information Bottleneck (NVIB) has been proposed as a regulariser for training cross-attention in Transformers, potentially addressing the overfitting problem. We extend the NVIB framework to replace all types of attention functions in Transformers, and show that existing pretrained Transformers can be reinterpreted as Nonparametric Variational (NV) models using a proposed identity initialisation. We then show that changing the initialisation introduces a novel, information-theoretic post-training regularisation in the attention mechanism, which improves out-of-domain generalisation without any training. This success supports the hypothesis that pretrained Transformers are implicitly NV Bayesian models.
    摘要 当前的大规模预训练和精度调整Transformer大语言模型已经导致自然语言处理领域的显著改进。然而,这些大型模型容易过拟合训练数据,并且当领域变化时表现不佳。此外,由于模型的规模, Fine-tuning模型到新领域的成本很高。非 Parametric Variational Information Bottleneck (NVIB) 已经被提议作为Transformer中的预测cross-attention的规则,可能解决过拟合问题。我们将 NVIB 框架扩展到所有类型的注意函数,并表示现有预训练Transformers可以被重新解释为非 Parametric Variational (NV) 模型,使用提议的标识初始化。我们然后表明,更改初始化会引入一种新的信息理论性后处理规则,该规则在注意机制中提高了频率外频训练。这种成功支持假设预训练Transformers是隐式的NV Bayesian模型。

Instruction-tuning Aligns LLMs to the Human Brain

  • paper_url: http://arxiv.org/abs/2312.00575
  • repo_url: None
  • paper_authors: Khai Loong Aw, Syrielle Montariol, Badr AlKhamissi, Martin Schrimpf, Antoine Bosselut
  • For: The paper investigates the effect of instruction-tuning on the similarity between language models (LLMs) and human language processing.* Methods: The paper uses brain alignment and behavioral alignment to measure the similarity between LLMs and humans, and assesses the effect of instruction-tuning on these measures.* Results: The paper finds that instruction-tuning generally enhances brain alignment by an average of 6%, but does not have a similar effect on behavioral alignment. The paper also identifies a strong positive correlation between brain alignment and model size, as well as performance on tasks requiring world knowledge.
    Abstract Instruction-tuning is a widely adopted method of finetuning that enables large language models (LLMs) to generate output that more closely resembles human responses to natural language queries, in many cases leading to human-level performance on diverse testbeds. However, it remains unclear whether instruction-tuning truly makes LLMs more similar to how humans process language. We investigate the effect of instruction-tuning on LLM-human similarity in two ways: (1) brain alignment, the similarity of LLM internal representations to neural activity in the human language system, and (2) behavioral alignment, the similarity of LLM and human behavior on a reading task. We assess 25 vanilla and instruction-tuned LLMs across three datasets involving humans reading naturalistic stories and sentences. We discover that instruction-tuning generally enhances brain alignment by an average of 6%, but does not have a similar effect on behavioral alignment. To identify the factors underlying LLM-brain alignment, we compute correlations between the brain alignment of LLMs and various model properties, such as model size, various problem-solving abilities, and performance on tasks requiring world knowledge spanning various domains. Notably, we find a strong positive correlation between brain alignment and model size (r = 0.95), as well as performance on tasks requiring world knowledge (r = 0.81). Our results demonstrate that instruction-tuning LLMs improves both world knowledge representations and brain alignment, suggesting that mechanisms that encode world knowledge in LLMs also improve representational alignment to the human brain.
    摘要 instruction-tuning是一种广泛采用的方法,可以让大型语言模型(LLM)生成更加类似于人类对自然语言查询的回答,在许多情况下达到人类水平的测试表现。然而,其实际上是否使得LLM更加类似于人类处理语言的方式仍然不清楚。我们 investigate了instruction-tuning对LLM和人类语言系统之间的相似性的效果,通过两种方法:1)脑对Alignment,LLM内部表示与人类语言系统中的神经活动的相似性,2)行为对Alignment,LLM和人类在阅读任务中的行为的相似性。我们在三个含有人类阅读自然故事和句子的数据集上评估了25个vanilla和instruction-tuned LLM。我们发现,instruction-tuning通常提高了脑对Alignment的平均提高率为6%,但没有类似的效果在行为对Alignment方面。为了了解LLM-脑对Alignment的因素,我们计算了LLMs中不同特征和模型属性之间的相关性,如模型大小、不同领域的问题解决能力和世界知识表达能力。我们发现,脑对Alignment和模型大小之间存在强正相关关系(r = 0.95),以及世界知识表达能力和模型大小之间的强正相关关系(r = 0.81)。我们的结果表明,instruction-tuning LLMs可以提高世界知识表达和脑对Alignment, suggesting that mechanisms that encode world knowledge in LLMs also improve representational alignment to the human brain。

Explanatory Argument Extraction of Correct Answers in Resident Medical Exams

  • paper_url: http://arxiv.org/abs/2312.00567
  • repo_url: None
  • paper_authors: Iakes Goenaga, Aitziber Atutxa, Koldo Gojenola, Maite Oronoz, Rodrigo Agerri
    for: 这篇论文的目的是为医疗专业人员提供一个有用的技术,以帮助他们在日常活动中使用人工智能。methods: 这篇论文使用了一些大型自然语言模型(LLM)和自动化的评估 benchmark,以便在医学基础知识(EBM)中进行信息提取,并使用自然语言来作为人机互动的工具。results: 这篇论文的结果显示,在西班牙语中的医疗问题中,使用了医生写的解释可以帮助医疗专业人员找到相关的证据基础解释。此外,这篇论文还发现,在不同的语言模型中,有时候多语言模型会比较好,甚至比过去特化 для医疗领域的模型。
    Abstract Developing the required technology to assist medical experts in their everyday activities is currently a hot topic in the Artificial Intelligence research field. Thus, a number of large language models (LLMs) and automated benchmarks have recently been proposed with the aim of facilitating information extraction in Evidence-Based Medicine (EBM) using natural language as a tool for mediating in human-AI interaction. The most representative benchmarks are limited to either multiple-choice or long-form answers and are available only in English. In order to address these shortcomings, in this paper we present a new dataset which, unlike previous work: (i) includes not only explanatory arguments for the correct answer, but also arguments to reason why the incorrect answers are not correct; (ii) the explanations are written originally by medical doctors to answer questions from the Spanish Residency Medical Exams. Furthermore, this new benchmark allows us to setup a novel extractive task which consists of identifying the explanation of the correct answer written by medical doctors. An additional benefit of our setting is that we can leverage the extractive QA paradigm to automatically evaluate performance of LLMs without resorting to costly manual evaluation by medical experts. Comprehensive experimentation with language models for Spanish shows that sometimes multilingual models fare better than monolingual ones, even outperforming models which have been adapted to the medical domain. Furthermore, results across the monolingual models are mixed, with supposedly smaller and inferior models performing competitively. In any case, the obtained results show that our novel dataset and approach can be an effective technique to help medical practitioners in identifying relevant evidence-based explanations for medical questions.
    摘要 Currently, developing technology to assist medical experts in their daily activities is a hot topic in the field of Artificial Intelligence research. To facilitate information extraction in Evidence-Based Medicine (EBM) using natural language as a tool for human-AI interaction, several large language models (LLMs) and automated benchmarks have been proposed. However, these benchmarks are limited to either multiple-choice or long-form answers and are only available in English.To address these shortcomings, we present a new dataset in this paper. Our dataset includes not only explanatory arguments for the correct answer, but also arguments to reason why the incorrect answers are not correct. The explanations were written originally by medical doctors to answer questions from the Spanish Residency Medical Exams. Additionally, our new benchmark allows us to set up a novel extractive task, which consists of identifying the explanation of the correct answer written by medical doctors. This approach allows us to automatically evaluate the performance of LLMs without relying on costly manual evaluation by medical experts.Our comprehensive experimentation with language models for Spanish shows that sometimes multilingual models perform better than monolingual ones, even outperforming models that have been adapted to the medical domain. Moreover, the results of the monolingual models are mixed, with smaller and inferior models performing competitively. Regardless, the obtained results demonstrate that our novel dataset and approach can be an effective technique to help medical practitioners identify relevant evidence-based explanations for medical questions.

Improving Unsupervised Relation Extraction by Augmenting Diverse Sentence Pairs

  • paper_url: http://arxiv.org/abs/2312.00552
  • repo_url: https://github.com/qingwang-isu/augure
  • paper_authors: Qing Wang, Kang Zhou, Qiao Qiao, Yuepei Li, Qi Li
  • for: 提高无监督关系提取(URE)的性能,并且解决现有的方法缺乏多样性和适当的损失函数问题。
  • methods: 提出了具有增强多样性和提高抑制力的对比学习策略,并且应用了多句子对比以增加正例对的多样性。同时,提出了margin损失函数来更好地衡量对话对的分布。
  • results: 在NYT-FB和TACRED数据集上进行了实验,并取得了当前最佳性能。
    Abstract Unsupervised relation extraction (URE) aims to extract relations between named entities from raw text without requiring manual annotations or pre-existing knowledge bases. In recent studies of URE, researchers put a notable emphasis on contrastive learning strategies for acquiring relation representations. However, these studies often overlook two important aspects: the inclusion of diverse positive pairs for contrastive learning and the exploration of appropriate loss functions. In this paper, we propose AugURE with both within-sentence pairs augmentation and augmentation through cross-sentence pairs extraction to increase the diversity of positive pairs and strengthen the discriminative power of contrastive learning. We also identify the limitation of noise-contrastive estimation (NCE) loss for relation representation learning and propose to apply margin loss for sentence pairs. Experiments on NYT-FB and TACRED datasets demonstrate that the proposed relation representation learning and a simple K-Means clustering achieves state-of-the-art performance.
    摘要 Unsupervised关系提取(URE)的目标是从 raw 文本中提取名称实体之间的关系,无需人工标注或先有知识库。在 latest studies of URE,研究人员强调了对冲学策略的应用,但是这些研究经常忽略了两个重要方面:包括多样化的正例对和适当的损失函数。在这篇论文中,我们提议了 AugURE,它通过在同句对和跨句对之间增加多样化的正例对,以增强对冲学策略的推理力。我们还发现了对于关系表示学习,雷达度估计(NCE)损失函数有限制,我们提议使用边缘损失函数来对 sentence pairs 进行学习。在 NYT-FB 和 TACRED 数据集上的实验表明,我们的提案的关系表示学习方法和简单的 K-Means 聚类算法可以达到状态艺术性的性能。

Trained MT Metrics Learn to Cope with Machine-translated References

  • paper_url: http://arxiv.org/abs/2312.00536
  • repo_url: https://github.com/amazon-science/prism-finetuned
  • paper_authors: Jannis Vamvas, Tobias Domhan, Sony Trenous, Rico Sennrich, Eva Hasler
  • for: 这个论文是为了研究基于人类评估的神经网络指标是如何 corr related with human judgments。
  • methods: 作者使用了一个基eline metric(Prism)和一个已经训练过的版本(Prism+FT),并进行了控制性的实验,以比较这两个指标在机器翻译引用问题上的表现。
  • results: surprisingly, Prism+FT在机器翻译引用问题上变得更加稳定,这表明 metric 训练的效果不仅仅是改善总体与人类评估的相关性。
    Abstract Neural metrics trained on human evaluations of MT tend to correlate well with human judgments, but their behavior is not fully understood. In this paper, we perform a controlled experiment and compare a baseline metric that has not been trained on human evaluations (Prism) to a trained version of the same metric (Prism+FT). Surprisingly, we find that Prism+FT becomes more robust to machine-translated references, which are a notorious problem in MT evaluation. This suggests that the effects of metric training go beyond the intended effect of improving overall correlation with human judgments.
    摘要 neural 度量经过人评估MT的结果往往呈现良好的相关性,但它们的行为还不够了解。在这篇论文中,我们进行了控制实验,并将基eline度量metric(Prism)与已经训练过的同一个度量metric(Prism+FT)进行比较。我们惊奇地发现,Prism+FT在机器翻译引用中变得更加稳定,这表示度量训练的效果超出了改善总体与人类判断相关性的目的。

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

  • paper_url: http://arxiv.org/abs/2312.00849
  • repo_url: https://github.com/rlhf-v/rlhf-v
  • paper_authors: Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, Tat-Seng Chua
  • For: Addresses the challenge of serious hallucination problems in existing MLLMs, which make them untrustworthy and impractical in real-world applications.* Methods: Uses behavior alignment from fine-grained correctional human feedback to enhance the trustworthiness of MLLMs. Specifically, it collects human preference in the form of segment-level corrections on hallucinations and performs dense direct preference optimization over the human feedback.* Results: Achieves substantially more trustworthy MLLM behaviors with promising data and computation efficiency, outperforming the concurrent LLaVA-RLHF trained on 10k annotated data. The final model achieves state-of-the-art performance in trustworthiness among open-source MLLMs and shows better robustness than GPT-4V in preventing hallucinations aroused from over-generalization.
    Abstract Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. However, existing MLLMs prevalently suffer from serious hallucination problems, generating text that is not factually grounded in associated images. The problem makes existing MLLMs untrustworthy and thus impractical in real-world (especially high-stakes) applications. To address the challenge, we present RLHF-V, which enhances MLLM trustworthiness via behavior alignment from fine-grained correctional human feedback. Specifically, RLHF-V collects human preference in the form of segment-level corrections on hallucinations, and performs dense direct preference optimization over the human feedback. Comprehensive experiments on five benchmarks in both automatic and human evaluation show that, RLHF-V can enable substantially more trustworthy MLLM behaviors with promising data and computation efficiency. Remarkably, using 1.4k annotated data samples, RLHF-V significantly reduces the hallucination rate of the base MLLM by 34.8%, outperforming the concurrent LLaVA-RLHF trained on 10k annotated data. The final model achieves state-of-the-art performance in trustworthiness among open-source MLLMs, and shows better robustness than GPT-4V in preventing hallucinations aroused from over-generalization. We open-source our code, model, and data at https://github.com/RLHF-V/RLHF-V.
    摘要 多模态大语言模型(MLLM)最近表现出了很强的多模态理解、思维和互动能力。然而,现有的 MLLM 受到严重的幻觉问题困扰,生成的文本与关联的图像不具有实际基础。这个问题使得现有的 MLLM 不可靠,无法应用于真实世界中的高风险应用程序。为解决这个挑战,我们提出了 RLHF-V,它通过精细的人类反馈进行行为Alignment,提高 MLLM 的可靠性。RLHF-V 收集人类的偏好为 segment-level corrections,并通过人类反馈进行紧密的直接偏好优化。我们在五个 benchmark 上进行了全面的自动和人类评估,结果显示,RLHF-V 可以生成可靠的 MLLM 行为,并且具有可行的数据和计算效率。特别是,使用 1.4k 注解数据样本,RLHF-V 可以减少基础 MLLM 的幻觉率 by 34.8%,超过了同时期的 LLaVA-RLHF 在 10k 注解数据上的表现。最终模型在开源 MLLM 中表现出了状态的可靠性,并且在防止由过泛化引起的幻觉方面表现出了更好的Robustness。我们将代码、模型和数据公开在 GitHub 上,请参考 https://github.com/RLHF-V/RLHF-V。

Summarization-based Data Augmentation for Document Classification

  • paper_url: http://arxiv.org/abs/2312.00513
  • repo_url: https://github.com/etsurin/summaug
  • paper_authors: Yueguan Wang, Naoki Yoshinaga
  • for: 提高文档分类 task 的稳定性和准确率
  • methods: 使用 summarization-based data augmentation,通过将输入文档简化为易于学习的示例,然后使用生成的 pseudo 示例进行课程学习
  • results: 在两个 datasets 上实验结果表明,我们的方法比现有基线方法在稳定性和准确率方面具有优势
    Abstract Despite the prevalence of pretrained language models in natural language understanding tasks, understanding lengthy text such as document is still challenging due to the data sparseness problem. Inspired by that humans develop their ability of understanding lengthy text from reading shorter text, we propose a simple yet effective summarization-based data augmentation, SUMMaug, for document classification. We first obtain easy-to-learn examples for the target document classification task by summarizing the input of the original training examples, while optionally merging the original labels to conform to the summarized input. We then use the generated pseudo examples to perform curriculum learning. Experimental results on two datasets confirmed the advantage of our method compared to existing baseline methods in terms of robustness and accuracy. We release our code and data at https://github.com/etsurin/summaug.
    摘要 尽管预训言语模型在自然语言理解任务中广泛应用,但理解长文档仍然困难,主要因为数据稀缺问题。人类在阅读短文时发展出了理解长文档的能力,我们则提出了一种简单 yet effective的概要基于数据增强方法,SUMMaug,用于文档分类。我们首先通过概要输入原始训练示例,生成容易学习的示例,并可选地将原始标签与概要输入保持一致。然后,我们使用生成的假示例进行课程学习。两个数据集的实验结果表明,我们的方法与现有基eline方法相比,在Robustness和准确性方面具有优势。我们在GitHub上发布了代码和数据,请参考

CoLLiE: Collaborative Training of Large Language Models in an Efficient Way

  • paper_url: http://arxiv.org/abs/2312.00407
  • repo_url: https://github.com/openlmlab/collie
  • paper_authors: Kai Lv, Shuo Zhang, Tianle Gu, Shuhao Xing, Jiawei Hong, Keyu Chen, Xiaoran Liu, Yuqing Yang, Honglin Guo, Tengxiao Liu, Yu Sun, Qipeng Guo, Hang Yan, Xipeng Qiu
  • for: 本研究旨在提出一个高效的库,用于联合训练大型自然语言处理模型(LLM),以提高性能。
  • methods: 本研究使用了3D平行性、参数高效精致化(PEFT)方法和优化器如狮子、阿达、索菲亚、LOMO和阿达洛莫。
  • results: 比较 convention 的解决方案,CoLLiE 在预训和精致化情况下证明了更高的训练效率。 另外,我们也提供了对不同优化器和 PEFT 方法的比较。
    Abstract Large language models (LLMs) are increasingly pivotal in a wide range of natural language processing tasks. Access to pre-trained models, courtesy of the open-source community, has made it possible to adapt these models to specific applications for enhanced performance. However, the substantial resources required for training these models necessitate efficient solutions. This paper introduces CoLLiE, an efficient library that facilitates collaborative training of large language models using 3D parallelism, parameter-efficient fine-tuning (PEFT) methods, and optimizers such as Lion, Adan, Sophia, LOMO and AdaLomo. With its modular design and comprehensive functionality, CoLLiE offers a balanced blend of efficiency, ease of use, and customization. CoLLiE has proven superior training efficiency in comparison with prevalent solutions in pre-training and fine-tuning scenarios. Furthermore, we provide an empirical evaluation of the correlation between model size and GPU memory consumption under different optimization methods, as well as an analysis of the throughput. Lastly, we carry out a comprehensive comparison of various optimizers and PEFT methods within the instruction-tuning context. CoLLiE is available at https://github.com/OpenLMLab/collie.
    摘要 大型自然语言处理任务中的语言模型(LLM)在不断增长的应用领域中扮演着越来越重要的角色。通过开源社区提供的预训练模型,现在可以将这些模型适应特定应用场景,以提高性能。然而,训练这些模型需要很大的资源。这篇论文介绍了CoLLiE,一个高效的库,它通过三维并行、参数高效调教(PEFT)方法和优化器,如狮子、阿达、索фи亚、LOMO和阿达朗蒂等,实现了高效的模型训练。CoLLiE具有可编程的设计和完整的功能,可以带来高效、易用和自定义的 equilibrio。CoLLiE在预训练和细化场景中证明了更高的训练效率,并对不同优化方法和PEFT方法在指令调教上进行了广泛的比较。CoLLiE可以在https://github.com/OpenLMLab/collie上下载。

  • paper_url: http://arxiv.org/abs/2312.00372
  • repo_url: None
  • paper_authors: Nan Yang, Shusen Zhang, Yannan Zhang, Xiaoling Bai, Hualong Deng, Tianhua Zhou, Jin Ma
  • for: 提高实时搜索中的信息检索效果,特别是面对突发新闻事件的搜索意图。
  • methods: 将事件信息综合到查询中,通过cross-attention机制与时间上下文query representation进行集成,并通过多任务训练提升事件表示能力。
  • results: 与现有基eline方法比较,提出的方法在一百万级生产 dataset 上的Offline实验中表现出色,并在在线系统中进行A/B测试,证明了方法的有效性。
    Abstract Information retrieval in real-time search presents unique challenges distinct from those encountered in classical web search. These challenges are particularly pronounced due to the rapid change of user search intent, which is influenced by the occurrence and evolution of breaking news events, such as earthquakes, elections, and wars. Previous dense retrieval methods, which primarily focused on static semantic representation, lack the capacity to capture immediate search intent, leading to inferior performance in retrieving the most recent event-related documents in time-sensitive scenarios. To address this issue, this paper expands the query with event information that represents real-time search intent. The Event information is then integrated with the query through a cross-attention mechanism, resulting in a time-context query representation. We further enhance the model's capacity for event representation through multi-task training. Since publicly available datasets such as MS-MARCO do not contain any event information on the query side and have few time-sensitive queries, we design an automatic data collection and annotation pipeline to address this issue, which includes ModelZoo-based Coarse Annotation and LLM-driven Fine Annotation processes. In addition, we share the training tricks such as two-stage training and hard negative sampling. Finally, we conduct a set of offline experiments on a million-scale production dataset to evaluate our approach and deploy an A/B testing in a real online system to verify the performance. Extensive experimental results demonstrate that our proposed approach significantly outperforms existing state-of-the-art baseline methods.
    摘要 信息检索在实时搜索中存在独特的挑战,与 классиical web搜索不同。这些挑战尤其在用户搜索意图的快速变化中表现出来,这与地震、选举和战争等事件的发生和发展有关。先前的紧凑检索方法,主要关注于静态semantic表示,无法捕捉实时搜索意图,导致在时间敏感场景中 Retrieval最新的事件相关文档的性能下降。为解决这个问题,这篇论文将查询扩展到包含事件信息,表示实时搜索意图。事件信息然后通过交叉注意机制与查询集成,得到时间上下文查询表示。我们还通过多任务训练进一步增强事件表示能力。由于公开的 dataset 如 MS-MARCO 中没有查询Side的事件信息,我们设计了自动数据采集和注释管道,包括 ModelZoo 基于 Coarse Annotation 和 LLM 驱动 Fine Annotation 过程。此外,我们还分享了训练技巧,如两个阶段训练和困难的负例抽样。最后,我们在一百万级生产 dataset 上进行了OFFLINE 实验,并在真实在线系统中进行了 A/B 测试,以证明我们的方法的性能。广泛的实验结果表明,我们的提议方法在已有的基eline方法之上显著提高了性能。

RTQ: Rethinking Video-language Understanding Based on Image-text Model

  • paper_url: http://arxiv.org/abs/2312.00347
  • repo_url: https://github.com/SCZwangxiao/RTQ-MM2023
  • paper_authors: Xiao Wang, Yaoyu Li, Tian Gan, Zheng Zhang, Jingjing Lv, Liqiang Nie
  • for: 提高视频语言理解的精度和效果,强化视频语言理解的基础知识。
  • methods: 提出一种新的框架RTQ(纠偏、时间模型和查询),通过纠偏内帧的重复信息、模型时间关系和查询任务特定信息来解决视频语言理解的挑战。
  • results: 对比现有方法,我们的模型在不具备视频语言预训练的情况下表现出色,并且与或超过了现状态的预训练方法的效果。
    Abstract Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos. However, video-language understanding presents unique challenges due to the inclusion of highly complex semantic details, which result in information redundancy, temporal dependency, and scene complexity. Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. The approach involves refining redundant information within frames, modeling temporal relations among frames, and querying task-specific information from the videos. Remarkably, our model demonstrates outstanding performance even in the absence of video-language pre-training, and the results are comparable with or superior to those achieved by state-of-the-art pre-training methods.
    摘要

PsyAttention: Psychological Attention Model for Personality Detection

  • paper_url: http://arxiv.org/abs/2312.00293
  • repo_url: None
  • paper_authors: Baohua Zhang, Yongyi Huang, Wenyao Cui, Huaping Zhang, Jianyun Shang
  • for: 本研究旨在提出一种基于不同心理模型的人格检测方法,以减少干扰和提高性能。
  • methods: 该方法基于提档Attention机制,可以有效地编码心理特征,减少心理特征的数量85%。
  • results: 在Big Five和MBTI模型中,提档Attention方法实现了65.66%和86.30%的准确率,超过了现有方法的性能, indicating that it is effective at encoding psychological features.
    Abstract Work on personality detection has tended to incorporate psychological features from different personality models, such as BigFive and MBTI. There are more than 900 psychological features, each of which is helpful for personality detection. However, when used in combination, the application of different calculation standards among these features may result in interference between features calculated using distinct systems, thereby introducing noise and reducing performance. This paper adapts different psychological models in the proposed PsyAttention for personality detection, which can effectively encode psychological features, reducing their number by 85%. In experiments on the BigFive and MBTI models, PysAttention achieved average accuracy of 65.66% and 86.30%, respectively, outperforming state-of-the-art methods, indicating that it is effective at encoding psychological features.
    摘要 在人格检测方面的工作通常会结合不同的人格模型,如BigFive和MBTI。有超过900个心理特征,每个特征都有助于人格检测。然而,当这些特征在组合使用时,不同的计算标准之间的干扰可能会导致特征计算结果之间的干扰,从而引入噪声并降低性能。这篇论文在提议的PsyAttention中采用了不同的心理模型,可以有效地编码心理特征,减少它们的数量 by 85%。在BigFive和MBTI模型的实验中,PsyAttention实现了65.66%和86.30%的平均准确率,分别超过了现有方法,表明它有效地编码心理特征。

SEPSIS: I Can Catch Your Lies – A New Paradigm for Deception Detection

  • paper_url: http://arxiv.org/abs/2312.00292
  • repo_url: None
  • paper_authors: Anku Rani, Dwip Dalal, Shreya Gautam, Pankaj Gupta, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das
    for:This paper aims to explore the problem of deception through the lens of psychology, with a focus on lies of omission, and to propose a novel framework for deception detection using NLP techniques.methods:The authors use a multi-task learning pipeline that leverages fine-tuned language models to address the deception detection task, and they curate an annotated dataset of 876,784 samples by combining a popular fake news dataset and scraped news headlines from Twitter.results:The proposed model achieved an F1 score of 0.87, demonstrating strong performance across all layers of deceptive content, including the type, color, intention, and topic aspects. The authors also explore the relationship between lies of omission and propaganda techniques, and uncover significant correlations between loaded language and opinion.
    Abstract Deception is the intentional practice of twisting information. It is a nuanced societal practice deeply intertwined with human societal evolution, characterized by a multitude of facets. This research explores the problem of deception through the lens of psychology, employing a framework that categorizes deception into three forms: lies of omission, lies of commission, and lies of influence. The primary focus of this study is specifically on investigating only lies of omission. We propose a novel framework for deception detection leveraging NLP techniques. We curated an annotated dataset of 876,784 samples by amalgamating a popular large-scale fake news dataset and scraped news headlines from the Twitter handle of Times of India, a well-known Indian news media house. Each sample has been labeled with four layers, namely: (i) the type of omission (speculation, bias, distortion, sounds factual, and opinion), (ii) colors of lies(black, white, etc), and (iii) the intention of such lies (to influence, etc) (iv) topic of lies (political, educational, religious, etc). We present a novel multi-task learning pipeline that leverages the dataless merging of fine-tuned language models to address the deception detection task mentioned earlier. Our proposed model achieved an F1 score of 0.87, demonstrating strong performance across all layers including the type, color, intent, and topic aspects of deceptive content. Finally, our research explores the relationship between lies of omission and propaganda techniques. To accomplish this, we conducted an in-depth analysis, uncovering compelling findings. For instance, our analysis revealed a significant correlation between loaded language and opinion, shedding light on their interconnectedness. To encourage further research in this field, we will be making the models and dataset available with the MIT License, making it favorable for open-source research.
    摘要 <>TRANSLATE_TEXT欺骗是人类社会演化中的一种Intentional Practice,具有多种方面。这项研究通过心理学领域的框架来研究欺骗,将其分为三种形式:掩饰、谎言和影响。本研究专门关注掩饰的问题,并提出了一种基于自然语言处理(NLP)技术的欺骗检测方法。我们组织了876,784个样本的注释集,通过粘贴大规模的假新闻数据集和Twitter上的时代新闻报道来混合。每个样本都被标记为四层,即:1. 掩饰类型(推测、偏见、扭曲、看起来是事实、意见)2. 颜色的谎言(黑、白等)3. 谎言的意图(以影响等)4. 谎言的主题(政治、教育、宗教等)我们提出了一种多任务学习管道,通过粘贴精度调整的语言模型来解决欺骗检测问题。我们的提案的模型在四个层次上都达到了优秀的性能,包括类型、颜色、意图和主题方面。最后,我们的研究探讨了掩饰和宣传技巧之间的关系。为此,我们进行了深入分析,发现了一些吸引人的发现。例如,我们发现了荟词和意见之间的相关性,这提供了对它们之间的相互关系的更多的了解。为了促进这一领域的研究,我们将会将模型和数据集发布于MIT许可,使其更有利于开源研究。

Text Attribute Control via Closed-Loop Disentanglement

  • paper_url: http://arxiv.org/abs/2312.00277
  • repo_url: None
  • paper_authors: Lei Sha, Thomas Lukasiewicz
  • for: 这篇论文的目的是提出一种robust控制特征的方法,以便在文本中改变特征而保持内容完整。
  • methods: 这篇论文使用了一种semi-supervised contrastive learning方法,通过在latent space中强制实施特征分离,以便实现robust控制特征的目的。
  • results: 实验结果表明,这种方法能够有效地改变文本中的特征,同时保持内容完整。
    Abstract Changing an attribute of a text without changing the content usually requires to first disentangle the text into irrelevant attributes and content representations. After that, in the inference phase, the representation of one attribute is tuned to a different value, expecting that the corresponding attribute of the text can also be changed accordingly. The usual way of disentanglement is to add some constraints on the latent space of an encoder-decoder architecture, including adversarial-based constraints and mutual-information-based constraints. However, the previous semi-supervised processes of attribute change are usually not enough to guarantee the success of attribute change and content preservation. In this paper, we propose a novel approach to achieve a robust control of attributes while enhancing content preservation. In this approach, we use a semi-supervised contrastive learning method to encourage the disentanglement of attributes in latent spaces. Differently from previous works, we re-disentangle the reconstructed sentence and compare the re-disentangled latent space with the original latent space, which makes a closed-loop disentanglement process. This also helps content preservation. In addition, the contrastive learning method is also able to replace the role of minimizing mutual information and adversarial training in the disentanglement process, which alleviates the computation cost. We conducted experiments on three text datasets, including the Yelp Service review dataset, the Amazon Product review dataset, and the GoEmotions dataset. The experimental results show the effectiveness of our model.
    摘要 通常需要分解文本中的特征而不改变内容,需要先从文本中提取无关特征和内容表示。在推理阶段,对文本特征的表示进行调整,期望文本中的相应特征也会随之改变。常见的分解方法包括在encoder-decoder架构中添加对 latent space 的约束,包括对敌对学习和共享信息的约束。然而,以前的半supervised进程通常不够保证特征变化和内容保持。在这篇论文中,我们提出了一种新的方法来实现特征控制的稳定性,同时提高内容保持。我们使用半supervised contrastive learning方法来促进特征分解。与前作不同,我们重新分解重构后的句子,并与原始 latent space 进行比较,形成一个封闭的分解过程,这也有助于内容保持。此外,对比学习方法还能替代降低对 mutual information 和对敌对训练的计算成本,这也使得分解过程更加简单。我们在 Yelp Service 评论数据集、Amazon Product 评论数据集和 GoEmotions 数据集进行了实验,实验结果表明我们的模型的效果。

cs.LG - 2023-12-01

Spatiotemporal Transformer for Imputing Sparse Data: A Deep Learning Approach

  • paper_url: http://arxiv.org/abs/2312.00963
  • repo_url: None
  • paper_authors: Kehui Yao, Jingyi Huang, Jun Zhu
  • for: This paper aims to address the challenge of missing values in sparse spatiotemporal datasets, particularly focusing on soil moisture data.
  • methods: The ST-Transformer model employs multiple spatiotemporal attention layers to capture complex spatiotemporal correlations in the data and can integrate additional spatiotemporal covariates during the imputation process, enhancing its accuracy.
  • results: The model demonstrates superior accuracy compared to well-known imputation methods and is applicable to various spatiotemporal imputation tasks.Here is the same information in Simplified Chinese text:
  • for: 这篇论文主要目的是解决稀疏时空数据中的缺失值问题,特别是关注 soil moisture 数据。
  • methods: 这种 ST-Transformer 模型使用多个时空注意层来捕捉时空数据中的复杂相关关系,并可以在填充过程中使用其他时空 covariates,从而提高准确性。
  • results: 模型在 SMAP 1km soil moisture 数据上的应用显示了较高的准确性,并且在其他数据集上的 simulation 研究表明该模型可以在多种时空填充任务中实现更高的应用效果。
    Abstract Effective management of environmental resources and agricultural sustainability heavily depends on accurate soil moisture data. However, datasets like the SMAP/Sentinel-1 soil moisture product often contain missing values across their spatiotemporal grid, which poses a significant challenge. This paper introduces a novel Spatiotemporal Transformer model (ST-Transformer) specifically designed to address the issue of missing values in sparse spatiotemporal datasets, particularly focusing on soil moisture data. The ST-Transformer employs multiple spatiotemporal attention layers to capture the complex spatiotemporal correlations in the data and can integrate additional spatiotemporal covariates during the imputation process, thereby enhancing its accuracy. The model is trained using a self-supervised approach, enabling it to autonomously predict missing values from observed data points. Our model's efficacy is demonstrated through its application to the SMAP 1km soil moisture data over a 36 x 36 km grid in Texas. It showcases superior accuracy compared to well-known imputation methods. Additionally, our simulation studies on other datasets highlight the model's broader applicability in various spatiotemporal imputation tasks.
    摘要 管理环境资源和农业可持续发展具有重要意义,减少缺失数据对这些任务是一个大的挑战。这篇论文提出了一种新的空间时间变换器模型(ST-Transformer),特意解决缺失数据问题在缺少数据空间时间grid上。ST-Transformer使用多个空间时间注意层捕捉数据中的复杂空间时间相关性,可以在填充过程中integrate额外的空间时间变量,因此提高准确性。模型通过自动学习方法进行训练,能够自动从观察数据点预测缺失值。我们的模型在德州1公里的SMAP soil moisture数据上进行应用,显示与其他知名填充方法相比,具有更高的准确性。此外,我们的仿真研究表明,该模型在其他数据集上也具有更广泛的应用可能性。

A Theory of Unimodal Bias in Multimodal Learning

  • paper_url: http://arxiv.org/abs/2312.00935
  • repo_url: https://github.com/Aryia-Behroziuan/References
  • paper_authors: Yedi Zhang, Peter E. Latham, Andrew Saxe
  • for: This paper aims to understand the phenomenon of unimodal bias in deep multimodal neural networks during joint training.
  • methods: The authors use deep multimodal linear networks and analyze the duration of the unimodal phase in learning as a function of layer fusion, dataset statistics, and initialization.
  • results: The authors find that deeper layer fusion leads to longer unimodal phases, which can result in permanent unimodal bias and a generalization deficit in overparametrized networks. Additionally, they show that the modality learned first is not necessarily the most important modality for the output. These results apply to ReLU networks in certain settings.
    Abstract Using multiple input streams simultaneously in training multimodal neural networks is intuitively advantageous, but practically challenging. A key challenge is unimodal bias, where a network overly relies on one modality and ignores others during joint training. While unimodal bias is well-documented empirically, our theoretical understanding of how architecture and data statistics influence this bias remains incomplete. Here we develop a theory of unimodal bias with deep multimodal linear networks. We calculate the duration of the unimodal phase in learning as a function of the depth at which modalities are fused within the network, dataset statistics, and initialization. We find that the deeper the layer at which fusion occurs, the longer the unimodal phase. A long unimodal phase can lead to a generalization deficit and permanent unimodal bias in the overparametrized regime. In addition, our theory reveals the modality learned first is not necessarily the modality that contributes more to the output. Our results, derived for multimodal linear networks, extend to ReLU networks in certain settings. Taken together, this work illuminates pathologies of multimodal learning under joint training, showing that late and intermediate fusion architectures can give rise to long unimodal phases and permanent unimodal bias.
    摘要 Simplified Chinese:使用多个输入流 simultaneously 在训练多模态神经网络时,Intuitively advantageous, but practically challenging. A key challenge is unimodal bias, where a network overly relies on one modality and ignores others during joint training. While unimodal bias is well-documented empirically, our theoretical understanding of how architecture and data statistics influence this bias remains incomplete. Here we develop a theory of unimodal bias with deep multimodal linear networks. We calculate the duration of the unimodal phase in learning as a function of the depth at which modalities are fused within the network, dataset statistics, and initialization. We find that the deeper the layer at which fusion occurs, the longer the unimodal phase. A long unimodal phase can lead to a generalization deficit and permanent unimodal bias in the overparametrized regime. In addition, our theory reveals the modality learned first is not necessarily the modality that contributes more to the output. Our results, derived for multimodal linear networks, extend to ReLU networks in certain settings. Taken together, this work illuminates pathologies of multimodal learning under joint training, showing that late and intermediate fusion architectures can give rise to long unimodal phases and permanent unimodal bias.

PACE: A Program Analysis Framework for Continuous Performance Prediction

  • paper_url: http://arxiv.org/abs/2312.00918
  • repo_url: https://github.com/padlab/pace
  • paper_authors: Chidera Biringa, Gokhan Kul
  • for: 这篇论文主要是为了提供一个程序分析框架,以便在代码更新之前提供连续反馈关于代码性能影响的数据。
  • methods: 该框架使用了功能测试案例的执行时间来创建性能微本benchmark,然后将这些微本benchmark映射到代码风格特征上,并将其传递给预测器进行性能预测。
  • results: 实验表明,该框架可以具有显著的性能预测能力,比现有状态的技术提高75%。
    Abstract Software development teams establish elaborate continuous integration pipelines containing automated test cases to accelerate the development process of software. Automated tests help to verify the correctness of code modifications decreasing the response time to changing requirements. However, when the software teams do not track the performance impact of pending modifications, they may need to spend considerable time refactoring existing code. This paper presents PACE, a program analysis framework that provides continuous feedback on the performance impact of pending code updates. We design performance microbenchmarks by mapping the execution time of functional test cases given a code update. We map microbenchmarks to code stylometry features and feed them to predictors for performance predictions. Our experiments achieved significant performance in predicting code performance, outperforming current state-of-the-art by 75% on neural-represented code stylometry features.
    摘要 Note: Simplified Chinese is also known as "Mandarin" or "Standard Chinese".Translation notes:* "continuous integration" is translated as "连续 интеграation" (liánxié integration)* "automated test cases" is translated as "自动化测试用例" (zìdònghuà test cases)* "performance impact" is translated as "性能影响" (xìngnéng yǐngxiāng)* "pending modifications" is translated as "待修改的代码" (dài xiūgòu de gōngzuò)* "code stylometry features" is translated as "代码风格特征" (gōngzuò fēngxìng tèzhèng)* "microbenchmarks" is translated as "微测试" (wēi tèshì)* "predictors" is translated as "预测器" (yùdòng zhī)* "neural-represented" is translated as "神经表示的" (shénxīn biǎozhǐ de)

Extreme Event Prediction with Multi-agent Reinforcement Learning-based Parametrization of Atmospheric and Oceanic Turbulence

  • paper_url: http://arxiv.org/abs/2312.00907
  • repo_url: None
  • paper_authors: Rambod Mojgani, Daniel Waelchli, Yifei Guan, Petros Koumoutsakos, Pedram Hassanzadeh
  • for: 这篇论文旨在探讨globale climate models(GCMs)中的主要问题, specifically the use of supervised-learned closures and reinforcement learning to improve the accuracy of climate simulations.
  • methods: 本论文使用了Scientific Multi-Agent Reinforcement Learning(SMARL)和基本的湍流物理学原理,通过将computational elements作为精确化点和学习代理,来学习closure模型。
  • results: 本论文的结果显示,使用SMARL和湍流物理学原理来学习closure模型,可以在仅使用几个高精度数据的情况下,实现稳定且高准确的低分辨率模拟。
    Abstract Global climate models (GCMs) are the main tools for understanding and predicting climate change. However, due to limited numerical resolutions, these models suffer from major structural uncertainties; e.g., they cannot resolve critical processes such as small-scale eddies in atmospheric and oceanic turbulence. Thus, such small-scale processes have to be represented as a function of the resolved scales via closures (parametrization). The accuracy of these closures is particularly important for capturing climate extremes. Traditionally, such closures are based on heuristics and simplifying assumptions about the unresolved physics. Recently, supervised-learned closures, trained offline on high-fidelity data, have been shown to outperform the classical physics-based closures. However, this approach requires a significant amount of high-fidelity training data and can also lead to instabilities. Reinforcement learning is emerging as a potent alternative for developing such closures as it requires only low-order statistics and leads to stable closures. In Scientific Multi-Agent Reinforcement Learning (SMARL) computational elements serve a dual role of discretization points and learning agents. We leverage SMARL and fundamentals of turbulence physics to learn closures for prototypes of atmospheric and oceanic turbulence. The policy is trained using only the enstrophy spectrum, which is nearly invariant and can be estimated from a few high-fidelity samples (these few samples are far from enough for supervised/offline learning). We show that these closures lead to stable low-resolution simulations that, at a fraction of the cost, can reproduce the high-fidelity simulations' statistics, including the tails of the probability density functions. The results demonstrate the high potential of SMARL for closure modeling for GCMs, especially in the regime of scarce data and indirect observations.
    摘要 全球气候模型(GCM)是现代气候变化理解和预测的主要工具。然而,由于数值分解能力有限,这些模型受到主要结构不确定性的影响,例如不能解决大气和海洋中的小规模旋变。因此,这些小规模过程通常通过 closure(参数化)来表示,其中参数是通过高精度数据训练来得到的。然而,这种方法需要大量的高精度训练数据,并且可能会导致不稳定。在这些情况下,人工智能学习是一种有潜力的代替方案,因为它只需要低阶统计信息,并且可以获得稳定的 closure。在科学多智能学习(SMARL)中,计算元素同时扮演着离散点和学习代理的角色。我们利用SMARL和液体动力学基础知识来学习closure。 closure是通过仅使用扩散 спектrum来训练,这个 спектrum在高精度数据中是不变的,并且可以从几个高精度样本中估算。我们表明,这些closure可以在低分解度 simulations中获得稳定的低分解度 simulations,并且可以在较低的成本下复制高精度 simulations 的统计特征,包括分布函数的尾部。这些结果表明SMARL在GCMs中的应用潜力很大,��pecially在数据稀缺和间接观测的情况下。

Explaining Knock-on Effects of Bias Mitigation

  • paper_url: http://arxiv.org/abs/2312.00765
  • repo_url: None
  • paper_authors: Svetoslav Nizhnichenkov, Rahul Nair, Elizabeth Daly, Brian Mac Namee
  • for: 这 paper 的目的是 characterise 受影响的群体(cohorts)when bias mitigation interventions are applied.
  • methods: 该 paper 使用 explainable meta-classifier 来 identific 受影响的 cohorts, 并 examine 多种 bias mitigation strategies 的效果。
  • results: 研究发现,所有测试的mitigation strategies都会 negatively impact a non-trivial fraction of cases, 即因为mitigation efforts而 receiving unfavourable outcomes. 这些结果 serve as a basis for arguing for more careful audits of static mitigation interventions that go beyond aggregate metrics.
    Abstract In machine learning systems, bias mitigation approaches aim to make outcomes fairer across privileged and unprivileged groups. Bias mitigation methods work in different ways and have known "waterfall" effects, e.g., mitigating bias at one place may manifest bias elsewhere. In this paper, we aim to characterise impacted cohorts when mitigation interventions are applied. To do so, we treat intervention effects as a classification task and learn an explainable meta-classifier to identify cohorts that have altered outcomes. We examine a range of bias mitigation strategies that work at various stages of the model life cycle. We empirically demonstrate that our meta-classifier is able to uncover impacted cohorts. Further, we show that all tested mitigation strategies negatively impact a non-trivial fraction of cases, i.e., people who receive unfavourable outcomes solely on account of mitigation efforts. This is despite improvement in fairness metrics. We use these results as a basis to argue for more careful audits of static mitigation interventions that go beyond aggregate metrics.
    摘要 To do this, we treat the intervention effects as a classification task and use an explainable meta-classifier to identify cohorts that have experienced altered outcomes. We examine a range of bias mitigation strategies that work at different stages of the model life cycle.Our results show that our meta-classifier is able to identify impacted cohorts. However, we also find that all of the tested mitigation strategies negatively impact a non-trivial fraction of cases, meaning that some people may receive unfavourable outcomes solely as a result of the mitigation efforts. This is despite the fact that these strategies can improve fairness metrics.We use these results to argue for more careful audits of static mitigation interventions that go beyond aggregate metrics. This is because these interventions can have unintended consequences and it is important to understand how they are impacting different groups of people.

SpaCE: The Spatial Confounding Environment

  • paper_url: http://arxiv.org/abs/2312.00710
  • repo_url: https://github.com/nsaph-projects/space
  • paper_authors: Mauricio Tec, Ana Trisovic, Michelle Audirac, Sophie Woodward, Jie Kate Hu, Naeem Khoshnevis, Francesca Dominici
  • for: Addressing spatial confounding in scientific studies involving spatial data
  • methods: Provides realistic benchmark datasets and tools for evaluating causal inference methods
  • results: Includes diverse datasets of varying sizes and spatial complexity, with realistic semi-synthetic outcomes and counterfactuals
    Abstract Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.
    摘要 空间假设 pose 科学研究中的一个重要挑战,因为未观察的空间变量可以影响处理和结果,导致假设的相关性。为解决这个问题,我们介绍了 SpaCE:空间假设环境,该工具包提供了真实的比较数据集和用于系统地评估 causal inference 方法的工具。每个数据集包括训练数据、真实的对照数据、一个含有坐标的空间图和影响缺失的空间假设的投影分数和均匀分数。它还包括使用现代机器学习套件生成的真实的半 sintetic 结果和对照数据,遵循最佳实践 для causal inference 标准准。数据集覆盖了实际的处理和 covariates 来自多个领域,包括气候、健康和社会科学。SpaCE 提供了一个自动化的终端到终点管道,简化了数据加载、实验设置和评估机器学习和 causal inference 模型。SpaCE 项目提供了多达几十个不同大小和空间复杂性的数据集。它是作为 Python 包公开可用,欢迎社区反馈和贡献。

Machine Learning for Health symposium 2023 – Findings track

  • paper_url: http://arxiv.org/abs/2312.00655
  • repo_url: None
  • paper_authors: Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Thomas Hartvigsen, Harvineet Singh
  • For: The paper is part of the 3rd Machine Learning for Health (ML4H) symposium, which is a collection of accepted Findings papers presented on December 10, 2023, in New Orleans, Louisiana, USA.* Methods: The paper underwent a double-blind peer-review process.* Results: The paper is a collection of new ideas and sparked insightful discussions in the health-related disciplines of healthcare, biomedicine, and public health.Here’s the information in Simplified Chinese text:
  • for: 这是第3届机器学习 для医疗(ML4H)学术会议上的Accepted Findings paper,于2023年12月10日在路易斯安那州新奥尔良举行。
  • methods: 这篇论文经过了双重盲 peer-review过程。
  • results: 这篇论文是健康相关领域(医疗、生物医学和公共卫生)中新的想法和有价值的讨论的集成。
    Abstract A collection of the accepted Findings papers that were presented at the 3rd Machine Learning for Health symposium (ML4H 2023), which was held on December 10, 2023, in New Orleans, Louisiana, USA. ML4H 2023 invited high-quality submissions on relevant problems in a variety of health-related disciplines including healthcare, biomedicine, and public health. Two submission tracks were offered: the archival Proceedings track, and the non-archival Findings track. Proceedings were targeted at mature work with strong technical sophistication and a high impact to health. The Findings track looked for new ideas that could spark insightful discussion, serve as valuable resources for the community, or could enable new collaborations. Submissions to the Proceedings track, if not accepted, were automatically considered for the Findings track. All the manuscripts submitted to ML4H Symposium underwent a double-blind peer-review process.
    摘要 一份包含2023年12月10日在美国路易斯安那州新奥尔良市举行的第三届机器学习 для健康学术symposium(ML4H 2023)的accepted Findings paper集。ML4H 2023邀请高水平的提交作品,涵盖各种健康相关领域,如医疗、生物医学和公共健康。提交两个轨道:一是可archivable Proceedings轨道,二是非 archivable Findings轨道。Proceedings轨道针对已成熟的工作,具有强技术性和高度影响健康领域。Findings轨道寻找新的想法,以便促进深入讨论、为社区提供价值资源或者启动新的合作。如果Proceedings轨道不被接受,则自动进行Findings轨道的考核。所有ML4H Symposium的投稿都经过了双盲审核过程。

Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation

  • paper_url: http://arxiv.org/abs/2312.00645
  • repo_url: None
  • paper_authors: Paul Bricman
  • for: 本研究旨在评估语言模型在敏感话题方面的能力,如生物恐怖主义或cyber战。但传统的开源benchmark不适用于此任务,因为它们通常公布正确答案在人类可读格式下。
  • methods: 我们提议使用hashmarking协议来评估语言模型在开放的环境中,而不需要披露正确答案。在 simplest form中,一个hashmark是一个benchmark,其参考解决方案已经在发布前被加密。
  • results: 我们评估了hashmarking协议的抵御性,包括传统的攻击方式(如彩虹表攻击)以及生成型模型的缺陷。
    Abstract There is a growing need to gain insight into language model capabilities that relate to sensitive topics, such as bioterrorism or cyberwarfare. However, traditional open source benchmarks are not fit for the task, due to the associated practice of publishing the correct answers in human-readable form. At the same time, enforcing mandatory closed-quarters evaluations might stifle development and erode trust. In this context, we propose hashmarking, a protocol for evaluating language models in the open without having to disclose the correct answers. In its simplest form, a hashmark is a benchmark whose reference solutions have been cryptographically hashed prior to publication. Following an overview of the proposed evaluation protocol, we go on to assess its resilience against traditional attack vectors (e.g. rainbow table attacks), as well as against failure modes unique to increasingly capable generative models.
    摘要 “需要增加关于敏感话题语言模型能力的了解,如生物恐怖主义或网络战争。但传统的开源标准 benchmar ks 不适用于这种任务,因为它们通常将正确答案发布在人类可读格式下。同时,强制实施封闭评估可能会阻碍发展和侵蚀信任。在这种情况下,我们提议使用 hashmarking 协议,以评估语言模型在开放的环境中而不需要披露正确答案。在 simplest form 中,一个 hashmark 是一个 benchmark 的参考解决方案,其前置公布的解决方案已经被 криптографически hashed。接下来,我们会介绍评估协议的可抗性,包括传统的攻击方法(如彩虹表攻击)以及生成型模型的特有失败模式。”

One to beat them all: “RYU’’ – a unifying framework for the construction of safe balls

  • paper_url: http://arxiv.org/abs/2312.00640
  • repo_url: None
  • paper_authors: Thu-Le Tran, Clément Elvira, Hong-Phuong Dang, Cédric Herzet
  • for: 这 paper 是用于构造 “安全” 球(即包含目标优化问题的 dual 解的区域)的一种新框架(名为 “RYU”)。
  • methods: 这 paper 使用了一种新的框架,名为 “RYU”,来构造 “安全” 球。这个框架可以涵盖过去十年内 для相关优化问题的所有结果。
  • results: 这 paper 的结果表明,RYU 框架可以将所有过去十年内的相关优化问题的结果总结或改进。
    Abstract In this paper, we put forth a novel framework (named ``RYU'') for the construction of ``safe'' balls, i.e. regions that provably contain the dual solution of a target optimization problem. We concentrate on the standard setup where the cost function is the sum of two terms: a closed, proper, convex Lipschitz-smooth function and a closed, proper, convex function. The RYU framework is shown to generalize or improve upon all the results proposed in the last decade for the considered family of optimization problems.
    摘要 在这篇论文中,我们提出了一种新的框架(名为“RYU”),用于构建“安全”的球体(即包含目标优化问题的 dual 解的区域)。我们对标准设置进行研究,其中成本函数是两个项:一个关闭、完善、凸 lipschitz 光滑函数和一个关闭、完善、凸函数。RYU 框架被证明可以总结或改进过去一个 década 中关于这种优化问题的所有结果。

  • paper_url: http://arxiv.org/abs/2312.00626
  • repo_url: None
  • paper_authors: Joschka Herteux, Christoph Räth, Amine Baha, Giulia Martini, Duccio Piovani
  • for: 这项研究旨在开发一个基于数据驱动的全球早期警示系统,以预测60天内的食物消耗水平,并在四个国家(马利、奈及利亚、叙利亚和也门)进行了应用。
  • methods: 该方法基于世界食品计划的一体化全球饥饿监测系统,使用公共可用数据进行预测,包括冲击、天气事件和其他饥饿安全驱动因素的日常更新。研究 Comparing ARIMA、XGBoost、LSTMs、CNNs和RC(储存计算)模型的性能,包括RMSE指标。
  • results: 研究发现,RC模型在饥饿安全领域具有优异表现,具有强抗过拟合特性和高效训练能力。该方法可以用于建立全球数据驱动的早期警示系统,以预测和检测饥饿问题。
    Abstract Early warning systems are an essential tool for effective humanitarian action. Advance warnings on impending disasters facilitate timely and targeted response which help save lives, livelihoods, and scarce financial resources. In this work we present a new quantitative methodology to forecast levels of food consumption for 60 consecutive days, at the sub-national level, in four countries: Mali, Nigeria, Syria, and Yemen. The methodology is built on publicly available data from the World Food Programme's integrated global hunger monitoring system which collects, processes, and displays daily updates on key food security metrics, conflict, weather events, and other drivers of food insecurity across 90 countries (https://hungermap.wfp.org/). In this study, we assessed the performance of various models including ARIMA, XGBoost, LSTMs, CNNs, and Reservoir Computing (RC), by comparing their Root Mean Squared Error (RMSE) metrics. This comprehensive analysis spanned classical statistical, machine learning, and deep learning approaches. Our findings highlight Reservoir Computing as a particularly well-suited model in the field of food security given both its notable resistance to over-fitting on limited data samples and its efficient training capabilities. The methodology we introduce establishes the groundwork for a global, data-driven early warning system designed to anticipate and detect food insecurity.
    摘要 早期警告系统是人道主义行动中不可或缺的工具。提前预测自然灾害的发生可以实现时准的应对措施,从而拯救生命、生产基础设施和紧缺的财务资源。在这份工作中,我们提出了一种新的量化方法,用于预测60天 consecutively的食品消耗水平,在马里、尼日利亚、叙利亚和也门四个国家的sub-national水平。这种方法基于公共可用的数据,包括世界食品计划集成全球饥饿监测系统,该系统每天更新关键饥饿安全指标、冲突、天气事件和食品安全问题的数据,覆盖90个国家(https://hungermap.wfp.org/)。在这项研究中,我们评估了不同的模型,包括ARIMA、XGBoost、LSTMs、CNNs和Reservoir Computing(RC),通过比较它们的根平方误差(RMSE)指标。这种全面的分析涵盖了统计学、机器学习和深度学习方法。我们的发现表明,Reservoir Computing在食品安全领域具有优秀的适应性和训练效率,因此可以作为食品安全预警系统的优秀选择。这种方法的引入标志着一个全球、数据驱动的早期警告系统的开创,旨在预测和检测食品危机。

Practical Path-based Bayesian Optimization

  • paper_url: http://arxiv.org/abs/2312.00622
  • repo_url: None
  • paper_authors: Jose Pablo Folch, James Odgers, Shiqiang Zhang, Robert M Lee, Behrang Shafei, David Walz, Calvin Tsay, Mark van der Wilk, Ruth Misener
  • for: 这篇论文是用于探讨数据驱动实验设计,特别是在化学工程和药品制造中的应用。
  • methods: 本研究使用了 Bayesian 优化(BO)方法,可以模型具有高成本黑盒函数的反应。在这篇论文中,我们将 SnAKe 算法扩展到同时考虑实验成本和变量成本。
  • results: 本研究提出了对最大输入变化的扩展,以及多目标设定下的扩展。
    Abstract There has been a surge in interest in data-driven experimental design with applications to chemical engineering and drug manufacturing. Bayesian optimization (BO) has proven to be adaptable to such cases, since we can model the reactions of interest as expensive black-box functions. Sometimes, the cost of this black-box functions can be separated into two parts: (a) the cost of the experiment itself, and (b) the cost of changing the input parameters. In this short paper, we extend the SnAKe algorithm to deal with both types of costs simultaneously. We further propose extensions to the case of a maximum allowable input change, as well as to the multi-objective setting.
    摘要 “在数据驱动实验设计方面,有很大的兴趣增长,特别是在化学工程和药品生产方面。 bayesian优化(BO)已经证明可以适应这些情况,因为我们可以模型我们关注的反应为昂贵的黑盒函数。有时,这种黑盒函数的成本可以分为两部分:(a)实验成本,和(b)输入参数修改成本。在这篇短文中,我们将扩展SnAKe算法,以同时处理这两种成本。我们还提出了在最大输入变化限制下的扩展,以及多目标情况下的扩展。”Note: The translation is done using Google Translate, and may not be perfect.

Investigating a domain adaptation approach for integrating different measurement instruments in a longitudinal clinical registry

  • paper_url: http://arxiv.org/abs/2312.00616
  • repo_url: None
  • paper_authors: Maren Hackenberg, Michelle Pfaffenlehner, Max Behrens, Astrid Pechmann, Janbernd Kirschner, Harald Binder
  • for: 这种研究旨在探讨深度学习技术可以如何将不同时间点的测量工具数据集合起来,以获得共同的隐藏表示。
  • methods: 这种研究使用了适应领域学习的概念,将图像数据与测量工具数据进行了对应。
  • results: 研究发现,适应领域学习可以在小数量的时间点上进行追踪,并且可以根据个体特征来推断测量工具数据的映射。此外,研究还发现,在不同测量工具数据中,存在一定的结构,即使测量工具数据存在差异也可以获得一定的结果。
    Abstract In a longitudinal clinical registry, different measurement instruments might have been used for assessing individuals at different time points. To combine them, we investigate deep learning techniques for obtaining a joint latent representation, to which the items of different measurement instruments are mapped. This corresponds to domain adaptation, an established concept in computer science for image data. Using the proposed approach as an example, we evaluate the potential of domain adaptation in a longitudinal cohort setting with a rather small number of time points, motivated by an application with different motor function measurement instruments in a registry of spinal muscular atrophy (SMA) patients. There, we model trajectories in the latent representation by ordinary differential equations (ODEs), where person-specific ODE parameters are inferred from baseline characteristics. The goodness of fit and complexity of the ODE solutions then allows to judge the measurement instrument mappings. We subsequently explore how alignment can be improved by incorporating corresponding penalty terms into model fitting. To systematically investigate the effect of differences between measurement instruments, we consider several scenarios based on modified SMA data, including scenarios where a mapping should be feasible in principle and scenarios where no perfect mapping is available. While misalignment increases in more complex scenarios, some structure is still recovered, even if the availability of measurement instruments depends on patient state. A reasonable mapping is feasible also in the more complex real SMA dataset. These results indicate that domain adaptation might be more generally useful in statistical modeling for longitudinal registry data.
    摘要 在长itudinal临床 registry 中,不同的测量工具可能会在不同的时间点上使用不同的测量工具来评估个体。为了将它们结合起来,我们研究深度学习技术,以获得共同的射度表示,其中不同测量工具的项目都是映射到该表示中的。这与计算机科学中的领域适应(domain adaptation)的概念相似,但是在长itudinal cohort 中进行适应。使用我们的方法为例,我们评估了在长itudinal cohort 中使用领域适应的潜在价值。在一个小数量的时间点上进行评估,我们使用了各个测量工具的数据,并模型了人具体的演化趋势。我们使用了径向 differential equations(ODEs)来模型这些趋势,并从基线特征中推断出每个人的特定参数。然后,我们可以根据模型的合理性和复杂性来评估测量工具的映射。我们随后探索了如何通过添加相应的罚金项来改善匹配。为了系统地研究测量工具之间的差异的影响,我们考虑了一些基于修改的 SMA 数据的场景,包括一些可能的映射场景和无法完美映射的场景。当测量工具之间存在差异时,我们发现了一定的结构仍然可以被恢复,即使测量工具的可用性取决于病人的状态。在更复杂的实际 SMA 数据中,我们还发现了一个合理的映射是可行的。这些结果表明,领域适应可能在统计模型中更广泛地有用。

Improving Plasticity in Online Continual Learning via Collaborative Learning

  • paper_url: http://arxiv.org/abs/2312.00600
  • repo_url: None
  • paper_authors: Maorong Wang, Nicolas Michel, Ling Xiao, Toshihiko Yamasaki
  • for: solves the problem of learning the ever-emerging new classification tasks from a continuous data stream.
  • methods: Collaborative Continual Learning (CCL), a collaborative learning based strategy to improve the model’s capability in acquiring new concepts. Additionally, we introduce Distillation Chain (DC), a novel collaborative learning scheme to boost the training of the models.
  • results: our strategy can still improve model plasticity dramatically, and thereby improve the overall performance by a large margin.
    Abstract Online Continual Learning (CL) solves the problem of learning the ever-emerging new classification tasks from a continuous data stream. Unlike its offline counterpart, in online CL, the training data can only be seen once. Most existing online CL research regards catastrophic forgetting (i.e., model stability) as almost the only challenge. In this paper, we argue that the model's capability to acquire new knowledge (i.e., model plasticity) is another challenge in online CL. While replay-based strategies have been shown to be effective in alleviating catastrophic forgetting, there is a notable gap in research attention toward improving model plasticity. To this end, we propose Collaborative Continual Learning (CCL), a collaborative learning based strategy to improve the model's capability in acquiring new concepts. Additionally, we introduce Distillation Chain (DC), a novel collaborative learning scheme to boost the training of the models. We adapted CCL-DC to existing representative online CL works. Extensive experiments demonstrate that even if the learners are well-trained with state-of-the-art online CL methods, our strategy can still improve model plasticity dramatically, and thereby improve the overall performance by a large margin.
    摘要 在线连续学习(CL)解决了从连续数据流中学习不断出现的新分类任务的问题。与其它线上CL研究不同,在线CL中的训练数据只能看一次。大多数现有的线上CL研究视为慢升级(i.e., 模型稳定性)为线上CL最大的挑战。在这篇论文中,我们认为模型获得新知识(i.e., 模型塑化性)是线上CL另一个挑战。虽然播放方法已经被证明可以减轻慢升级,但是对于提高模型塑化性的研究却存在显著的缺失。为此,我们提出了协同连续学习(CCL)策略,以提高模型的新知识获得能力。此外,我们还介绍了分配链(DC),一种新的协同学习方案,以加强模型的训练。我们适应CCL-DC到现有的代表性线上CL工作中。广泛的实验表明,即使学习者已经通过当今线上CL方法受过高水平训练,我们的策略仍然可以在模型塑化性方面提高差距,并因此提高总性能。

Adaptive Parameter-Free Robust Learning using Latent Bernoulli Variables

  • paper_url: http://arxiv.org/abs/2312.00585
  • repo_url: https://github.com/akarakulev/rlvi
  • paper_authors: Aleksandr Karakulev, Dave Zachariah, Prashant Singh
  • for: robust statistical learning from corrupted training sets
  • methods: 使用 latent Bernoulli variables identific corrupted 和 non-corrupted samples, 并通过变量极大化来解决 robust learning 问题
  • results: 提高了 state-of-the-art 的 robust learning 方法, 自动推断腐败水平和异常点,增加了小量计算负担, 并在多种机器学习任务中展示了适应不同噪音水平和高预测精度
    Abstract We present an efficient parameter-free approach for statistical learning from corrupted training sets. We identify corrupted and non-corrupted samples using latent Bernoulli variables, and therefore formulate the robust learning problem as maximization of the likelihood where latent variables are marginalized out. The resulting optimization problem is solved via variational inference using an efficient Expectation-Maximization based method. The proposed approach improves over the state-of-the-art by automatically inferring the corruption level and identifying outliers, while adding minimal computational overhead. We demonstrate our robust learning method on a wide variety of machine learning tasks including online learning and deep learning where it exhibits ability to adapt to different levels of noise and attain high prediction accuracy.
    摘要 我们提出了一种高效无参数的统计学学习方法,可以在受损训练集上进行学习。我们使用潜在的bernoulli变量来识别受损和非受损样本,因此将robust学习问题转化为maximize likelihood的问题,其中潜在变量被约束出。我们使用可靠的Expectation-Maximization基于方法来解决这个优化问题。我们的方法可以自动推断受损水平和标注异常样本,而且增加了 minimal的计算过程。我们在多种机器学习任务上展示了我们的Robust学习方法,包括在线学习和深度学习中,其中它表现出了适应不同水平的噪声和高预测精度。

Pathway to a fully data-driven geotechnics: lessons from materials informatics

  • paper_url: http://arxiv.org/abs/2312.00581
  • repo_url: None
  • paper_authors: Stephen Wu, Yu Otake, Yosuke Higo, Ikumasa Yoshida
  • for: 本研究探讨了将数据驱动方法应用于岩土工程中的挑战和机遇,从物料信息学的成功中继承。
  • methods: 本研究使用了深度学习等数据驱动方法,特别是高维数据特征提取和传输学习,以推动岩土工程领域的共同合作和创新。
  • results: 本研究预测了数据驱动方法在岩土工程中的应用将导致一种新的共同合作和创新的 paradigma shift,并将通过大语言模型等高级计算工具重塑岩土工程信息学领域。
    Abstract This paper elucidates the challenges and opportunities inherent in integrating data-driven methodologies into geotechnics, drawing inspiration from the success of materials informatics. Highlighting the intricacies of soil complexity, heterogeneity, and the lack of comprehensive data, the discussion underscores the pressing need for community-driven database initiatives and open science movements. By leveraging the transformative power of deep learning, particularly in feature extraction from high-dimensional data and the potential of transfer learning, we envision a paradigm shift towards a more collaborative and innovative geotechnics field. The paper concludes with a forward-looking stance, emphasizing the revolutionary potential brought about by advanced computational tools like large language models in reshaping geotechnics informatics.
    摘要 Translation notes:* "geotechnics" is translated as "地科技" (dì kē jī)* "data-driven methodologies" is translated as "数据驱动方法" (shù jí yùn fāng yì)* "materials informatics" is translated as "材料信息学" (caì yào xìn xué)* "soil complexity" is translated as "土壤复杂性" (tǔ chén fù zhāng xìng)* "heterogeneity" is translated as "不均质性" (bù jìn zhì xìng)* "comprehensive data" is translated as "全面数据" (quán miàn shù)* "community-driven database initiatives" is translated as "社区驱动数据库 iniciativas" (shè qū yùn dRIVe shū jué)* "open science movements" is translated as "开放科学运动" (kāifàng kē xué yùndòng)* "deep learning" is translated as "深度学习" (shēn dào xué xí)* "feature extraction" is translated as "特征提取" (tè yì fù qiū)* "transfer learning" is translated as "传输学习" (chuán xīn xué xí)* "paradigm shift" is translated as "思维转变" (sī wéi zhuān biàn)* "advanced computational tools" is translated as "高级计算工具" (gāo jí jì yì gōng chú)* "large language models" is translated as "大语言模型" (dà yǔ yán mó delì)

Interior Point Constrained Reinforcement Learning with Global Convergence Guarantees

  • paper_url: http://arxiv.org/abs/2312.00561
  • repo_url: None
  • paper_authors: Tingting Ni, Maryam Kamgarpour
  • for: 这篇论文主要针对的是解决Constrained Markov Decision Processes(CMDP)中的优化问题,即在遵守一些约束条件的情况下找到最优策略,以 maximize the expected cumulative reward。
  • methods: 作者采用了零次点方法,基于CMDP的减法栅函数,以确保策略满足约束条件。
  • results: 作者证明了该算法在满足约束条件的前提下,能够 garantuee策略的可行性,并且在执行过程中可以保证策略的可行性。相比之前的方法,该算法需要更多的样本数($O(\varepsilon^{-6})$),但可以在同等的样本数下保证策略的可行性。
    Abstract We consider discounted infinite horizon constrained Markov decision processes (CMDPs) where the goal is to find an optimal policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Motivated by the application of CMDPs in online learning of safety-critical systems, we focus on developing an algorithm that ensures constraint satisfaction during learning. To this end, we develop a zeroth-order interior point approach based on the log barrier function of the CMDP. Under the commonly assumed conditions of Fisher non-degeneracy and bounded transfer error of the policy parameterization, we establish the theoretical properties of the algorithm. In particular, in contrast to existing CMDP approaches that ensure policy feasibility only upon convergence, our algorithm guarantees feasibility of the policies during the learning process and converges to the optimal policy with a sample complexity of $O(\varepsilon^{-6})$. In comparison to the state-of-the-art policy gradient-based algorithm, C-NPG-PDA, our algorithm requires an additional $O(\varepsilon^{-2})$ samples to ensure policy feasibility during learning with same Fisher-non-degenerate parameterization.
    摘要 我们考虑了折扣无穷远 horizon constrained Markov decision process (CMDP),其目的是找到一个优化策略,以最大化预期总奖励,同时满足预期总约束。在安全关键系统的在线学习应用中,我们强调了策略学习中的约束满足。为此,我们提出了一种零次内点方法,基于 CMDP 的征函数 log barrier。在政策参数化的假设下,我们证明了算法的理论性质。特别是,我们的算法可以在学习过程中保证策略的合法性,并且与 Fisher 非分歧的参数化下 converge to optimal policy 的样本复杂度为 $O(\varepsilon^{-6})$。相比之下,使用 policy gradient-based algorithm C-NPG-PDA 的样本复杂度为 $O(\varepsilon^{-3})$。

A Preconditioned Interior Point Method for Support Vector Machines Using an ANOVA-Decomposition and NFFT-Based Matrix-Vector Products

  • paper_url: http://arxiv.org/abs/2312.00538
  • repo_url: https://github.com/wagnertheresa/nfftsvmipm
  • paper_authors: Theresa Wagner, John W. Pearson, Martin Stoll
  • for: solves the numerical solution to the soft-margin support vector machine optimization problem with large-scale kernel matrices.
  • methods: employs an NFFT-accelerated matrix-vector product using an ANOVA decomposition and an interior point method, with preconditioning based on low-rank approximations of the kernel matrix and a Krylov subspace solver.
  • results: compares the accuracy of the ANOVA-based kernel with the default LIBSVM implementation and investigates the performance of different preconditioners on several large-scale datasets.
    Abstract In this paper we consider the numerical solution to the soft-margin support vector machine optimization problem. This problem is typically solved using the SMO algorithm, given the high computational complexity of traditional optimization algorithms when dealing with large-scale kernel matrices. In this work, we propose employing an NFFT-accelerated matrix-vector product using an ANOVA decomposition for the feature space that is used within an interior point method for the overall optimization problem. As this method requires the solution of a linear system of saddle point form we suggest a preconditioning approach that is based on low-rank approximations of the kernel matrix together with a Krylov subspace solver. We compare the accuracy of the ANOVA-based kernel with the default LIBSVM implementation. We investigate the performance of the different preconditioners as well as the accuracy of the ANOVA kernel on several large-scale datasets.
    摘要 在这篇论文中,我们考虑了软边支持向量机器学习优化问题的数值解决方案。这个问题通常使用SMO算法解决,因为传统优化算法在处理大规模kernel矩阵时的计算复杂性很高。在这项工作中,我们提议使用NFFT加速matrix-vector乘法使用ANOVA分解在特征空间中,并使用内点法解决整个优化问题。由于这个方法需要解决一个坐标点形式的线性系统,我们建议一种基于低级别kernel矩阵的低级别预conditioning方法,并使用Krylov子空间解决器。我们比较了ANOVA基于kernel的准确率和LIBSVM实现的默认实现,以及不同预conditioners的性能和ANOVA kernel在几个大规模数据集上的准确率。

RIS-Based On-the-Air Semantic Communications – a Diffractional Deep Neural Network Approach

  • paper_url: http://arxiv.org/abs/2312.00535
  • repo_url: None
  • paper_authors: Shuyi Chen, Yingzhe Hui, Yifan Qin, Yueyi Yuan, Weixiao Meng, Xuewen Luo, Hsiao-Hwa Chen
  • for: 这篇论文的目的是探讨基于智能板(RIS)的空中 semantics 通信技术,以实现高效的无线通信。
  • methods: 该论文使用的方法是基于空中 diffraction 深度神经网络(D$^2$NN),实现在无线信号传输过程中进行计算处理。
  • results: 该论文通过对图像传输作为例子进行性能分析,显示了基于 RIS 的空中 semantics 通信技术可以提供较高的传输效率和多任务同时处理能力。
    Abstract Semantic communication has gained significant attention recently due to its advantages in achieving higher transmission efficiency by focusing on semantic information instead of bit-level information. However, current AI-based semantic communication methods require digital hardware for implementation. With the rapid advancement on reconfigurable intelligence surfaces (RISs), a new approach called on-the-air diffractional deep neural networks (D$^2$NN) can be utilized to enable semantic communications on the wave domain. This paper proposes a new paradigm of RIS-based on-the-air semantic communications, where the computational process occurs inherently as wireless signals pass through RISs. We present the system model and discuss the data and control flows of this scheme, followed by a performance analysis using image transmission as an example. In comparison to traditional hardware-based approaches, RIS-based semantic communications offer appealing features, such as light-speed computation, low computational power requirements, and the ability to handle multiple tasks simultaneously.
    摘要 现代Semantic communication技术在最近受到了广泛关注,因为它可以更高效地传输信息,通过专注于 semantics信息而不是 bits信息。然而,目前的 AI-based Semantic communication方法都需要数字硬件实现。随着易配置智能表面(RIS)的快速进步,我们可以使用 called on-the-air diffractional deep neural networks (D$^2$NN)来实现Semantic communications在波域上。这篇论文提出了基于 RIS 的在空中 Semantic communications 新方法,其中计算过程会自然地在无线信号通过 RIS 时发生。我们介绍了系统模型,并讨论了数据和控制流的这种方案,然后进行了图像传输作为例子的性能分析。与传统硬件基础方法相比,基于 RIS 的 Semantic communications 具有吸引人的特点,如光速计算、低计算机功率要求和可同时处理多个任务。

Spatio-Temporal-Decoupled Masked Pre-training for Traffic Forecasting

  • paper_url: http://arxiv.org/abs/2312.00516
  • repo_url: https://github.com/jimmy-7664/std_mae
  • paper_authors: Haotian Gao, Renhe Jiang, Zheng Dong, Jinliang Deng, Xuan Song
  • for: 预测多变量交通流时序系列数据,即预测不同路径上的交通流速度和时间的相关性。
  • methods: 使用两个分离的伪隐藏层自动编码器来学习和编码交通数据的空间和时间相关性,并通过自我超级vised预训练方法来增强下游交通预测器的能力。
  • results: 在四个广泛使用的交通准 benchmark 上(PEMS03、PEMS04、PEMS07和PEMS08)进行了评估,并证明了 STD-MAE 可以提高下游交通预测器的能力,特别是在捕捉长距离复杂的空间和时间相关性方面。
    Abstract Accurate forecasting of multivariate traffic flow time series remains challenging due to substantial spatio-temporal heterogeneity and complex long-range correlative patterns. To address this, we propose Spatio-Temporal-Decoupled Masked Pre-training (STD-MAE), a novel framework that employs masked autoencoders to learn and encode complex spatio-temporal dependencies via pre-training. Specifically, we use two decoupled masked autoencoders to reconstruct the traffic data along spatial and temporal axes using a self-supervised pre-training approach. These mask reconstruction mechanisms capture the long-range correlations in space and time separately. The learned hidden representations are then used to augment the downstream spatio-temporal traffic predictor. A series of quantitative and qualitative evaluations on four widely-used traffic benchmarks (PEMS03, PEMS04, PEMS07, and PEMS08) are conducted to verify the state-of-the-art performance, with STD-MAE explicitly enhancing the downstream spatio-temporal models' ability to capture long-range intricate spatial and temporal patterns. Codes are available at https://github.com/Jimmy-7664/STD_MAE.
    摘要 准确预测多变量交通流时间序列仍然存在挑战,主要原因在于交通流的空间时间特征强烈不同和复杂长距离相关性。为解决这问题,我们提出了空间时间分解掩码预训练(STD-MAE)框架,该框架利用掩码自适应神经网络来学习和编码交通流的复杂空间时间关系。具体来说,我们使用两个分解的掩码自适应神经网络来重建交通流数据的空间和时间轴,使用自我超级vised预训练方法。这两个掩码重建机制可以分别捕捉交通流的长距离相关性在空间和时间上。学习的隐藏表示被用来改善下游的交通流预测模型。我们在四个常用的交通流标准 benchmark(PEMS03、PEMS04、PEMS07和PEMS08)上进行了一系列量化和质量评估,以验证我们的模型可以达到当前顶峰性能。代码可以在 GitHub 上找到:https://github.com/Jimmy-7664/STD_MAE。

Bayesian causal discovery from unknown general interventions

  • paper_url: http://arxiv.org/abs/2312.00509
  • repo_url: https://github.com/alesmascaro/bcd-ugi
  • paper_authors: Alessandro Mascaro, Federico Castelletti
  • for: 学习 causal Directed Acyclic Graphs (DAGs) 使用观测和干扰实验数据的组合。
  • methods: 提议一种 Bayesian 方法 для causal discovery,允许对未知目标节点进行修改。
  • results: 提出了一种 Markov Chain Monte Carlo (MCMC) 算法来 aproximate posterior distribution over DAGs, intervention targets and induced parent sets。
    Abstract We consider the problem of learning causal Directed Acyclic Graphs (DAGs) using combinations of observational and interventional experimental data. Current methods tailored to this setting assume that interventions either destroy parent-child relations of the intervened (target) nodes or only alter such relations without modifying the parent sets, even when the intervention targets are unknown. We relax this assumption by proposing a Bayesian method for causal discovery from general interventions, which allow for modifications of the parent sets of the unknown targets. Even in this framework, DAGs and general interventions may be identifiable only up to some equivalence classes. We provide graphical characterizations of such interventional Markov equivalence and devise compatible priors for Bayesian inference that guarantee score equivalence of indistinguishable structures. We then develop a Markov Chain Monte Carlo (MCMC) scheme to approximate the posterior distribution over DAGs, intervention targets and induced parent sets. Finally, we evaluate the proposed methodology on both simulated and real protein expression data.
    摘要 我团队考虑了用观察和干预实验数据学习 causal 导向的Directed Acyclic Graphs (DAGs)。现有的方法假设干预会摧毁父节点与目标节点之间的父子关系,或者只是改变这些关系而不改变父集。我们relax这个假设,提议了一种 bayesian方法用于 causal 发现,该方法允许修改目标节点的父集。即使在这种框架下,DAGs和通用干预可能只能被确定为相对EQivalence类型。我们提供了MarkovEquivalence的图形化表示,并设计了相容的 prior 以保证scoreEquivalence。然后,我们开发了 Markov Chain Monte Carlo (MCMC) 算法来近似 posterior distribution over DAGs, 干预目标和推导出的父集。最后,我们对 simulated 和实际蛋白表达数据进行了评估。

VEXIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity

  • paper_url: http://arxiv.org/abs/2312.00507
  • repo_url: None
  • paper_authors: S. VenkataKeerthy, Yashas Andaluri, Sayan Dey, Soumya Banerjee, Ramakrishna Upadrasta
  • for: 本研究提出了一种基于VEX IR的函数编码框架,用于在 binary 中找到相似函数。
  • methods: 该框架使用 VEX IR 作为中间表示,并提出了 POV 自定义缩小优化引擎来normalize VEX IR,以便进行有效的相似性分析。
  • results: 在两个实验中(diffing和Searching),我们的框架对针对不同架构、编译器和版本、优化序列和混淆的 binary 进行了评估,并取得了superior的精度和准确率值。我们的框架也具有高可扩展性和并发性。
    Abstract We propose VEXIR2Vec, a code embedding framework for finding similar functions in binaries. Our representations rely on VEX IR, the intermediate representation used by binary analysis tools like Valgrind and angr. Our proposed embeddings encode both syntactic and semantic information to represent a function, and is both application and architecture independent. We also propose POV, a custom Peephole Optimization engine that normalizes the VEX IR for effective similarity analysis. We design several optimizations like copy/constant propagation, constant folding, common subexpression elimination and load-store elimination in POV. We evaluate our framework on two experiments -- diffing and searching -- involving binaries targeting different architectures, compiled using different compilers and versions, optimization sequences, and obfuscations. We show results on several standard projects and on real-world vulnerabilities. Our results show that VEXIR2Vec achieves superior precision and recall values compared to the state-of-the-art works. Our framework is highly scalable and is built as a multi-threaded, parallel library by only using open-source tools. VEXIR2Vec achieves about $3.2 \times$ speedup on the closest competitor, and orders-of-magnitude speedup on other tools.
    摘要 我们提出VEXIR2Vec,一个代码嵌入框架,用于在二进制中找到相似函数。我们的表示方式基于VEX IR,这是 binary 分析工具如 Valgrind 和 angr 使用的中间表示。我们的提议的嵌入都会采集函数的语法和 semantics 信息,并且是独立于应用程序和架构的。我们还提出了 POV,一个自定义的 peephole 优化引擎,用于normalize VEX IR,以便有效地进行相似性分析。我们在 POV 中设计了多个优化,如拷贝/常量传递、常量聚合、通用表达消除和加载/存储消除。我们在两个实验中评估了我们的框架:diffing 和 searching。这两个实验都涉及到不同的架构、编译器和版本、优化序列和隐蔽。我们在几个标准项目和真实攻击漏洞上进行了评估。我们的结果表明,VEXIR2Vec 的精度和准确率比现有的工作更高。我们的框架具有高可扩展性,并且是一个多线程、并发的开源库。VEXIR2Vec 与最接近竞争者相比,实现了约 $3.2 \times$ 的速度提升,并在其他工具上实现了orders-of-magnitude 的速度提升。

On the Out-Of-Distribution Robustness of Self-Supervised Representation Learning for Phonocardiogram Signals

  • paper_url: http://arxiv.org/abs/2312.00502
  • repo_url: https://github.com/aristotelisballas/listen2yourheart
  • paper_authors: Aristotelis Ballas, Vasileios Papapanagiotou, Christos Diou
  • for: 这个研究旨在解决医学领域中深度学习模型尚未广泛被接受的问题,即缺乏高品质标注数据,从而妨碍模型的发展和一致性。
  • methods: 本研究提出使用对照自监督学习(SSL)来解决缺乏标注数据的问题,通过将无标注数据作为训练数据,从而提高模型的一致性和效力。
  • results: 我们实验发现,对于不同的训练分布,对于未见的数据(OOD)评分效果可以下降到32%,而使用SSL模型则只下降到10%或者甚至提高。结论:使用对照SSL预训可以帮助提供具有一致性和抗变性的分类器,不需要过程式的标注过程,并且提供了一个广泛的评估协议,帮助选择适合的音频增强技术。
    Abstract Objective: Despite the recent increase in research activity, deep-learning models have not yet been widely accepted in medicine. The shortage of high-quality annotated data often hinders the development of robust and generalizable models, which do not suffer from degraded effectiveness when presented with newly-collected, out-of-distribution (OOD) datasets. Methods: Contrastive Self-Supervised Learning (SSL) offers a potential solution to the scarcity of labeled data as it takes advantage of unlabeled data to increase model effectiveness and robustness. In this research, we propose applying contrastive SSL for detecting abnormalities in phonocardiogram (PCG) samples by learning a generalized representation of the signal. Specifically, we perform an extensive comparative evaluation of a wide range of audio-based augmentations and evaluate trained classifiers on multiple datasets across different downstream tasks. Results: We experimentally demonstrate that, depending on its training distribution, the effectiveness of a fully-supervised model can degrade up to 32% when evaluated on unseen data, while SSL models only lose up to 10% or even improve in some cases. Conclusions: Contrastive SSL pretraining can assist in providing robust classifiers which can generalize to unseen, OOD data, without relying on time- and labor-intensive annotation processes by medical experts. Furthermore, the proposed extensive evaluation protocol sheds light on the most promising and appropriate augmentations for robust PCG signal processing. Significance: We provide researchers and practitioners with a roadmap towards producing robust models for PCG classification, in addition to an open-source codebase for developing novel approaches.
    摘要 目的:尽管最近有很多研究活动,深度学习模型在医学领域还没有得到广泛acceptance。缺乏高质量标注数据经常阻碍深度学习模型的发展,这些模型不受新收集的外部数据(OOD)的影响。方法:对比自我超视学习(SSL)提供了一种可能的解决方案,它利用无标注数据提高模型的效果和通用性。在这项研究中,我们提议通过学习普适的信号表示来检测phonocardiogram(PCG)样本中的异常。具体来说,我们进行了广泛的对比评估,以评估不同的音频基于扩展的扩展和训练集。结果:我们实验表明,取决于训练集的分布,全部超vised模型在未经见过数据上的效果可能下降到32%,而SSL模型则下降到10%或者even improve in some cases。结论:对比SSL预训练可以帮助提供通用的分类器,不需要由医学专家投入大量时间和劳动来标注数据。此外,我们提出的广泛评估协议可以透视最有前途的和适合PCG信号处理的扩展。重要性:我们为研究人员和实践者提供了PCG分类器的稳定模型的制作路线图,以及一个开源代码库,以便开发新的方法。

REDUCR: Robust Data Downsampling Using Class Priority Reweighting

  • paper_url: http://arxiv.org/abs/2312.00486
  • repo_url: None
  • paper_authors: William Bankes, George Hughes, Ilija Bogunovic, Zi Wang
    for: 降低实际图像和文本分类任务中模型训练成本的方法methods: 使用类别优先权重调整的数据减样方法results: 在视觉和文本分类任务中,使用REDUCR方法可以显著提高最坏类测试精度(以及平均精度),比前STATE方法提高约15%。
    Abstract Modern machine learning models are becoming increasingly expensive to train for real-world image and text classification tasks, where massive web-scale data is collected in a streaming fashion. To reduce the training cost, online batch selection techniques have been developed to choose the most informative datapoints. However, these techniques can suffer from poor worst-class generalization performance due to class imbalance and distributional shifts. This work introduces REDUCR, a robust and efficient data downsampling method that uses class priority reweighting. REDUCR reduces the training data while preserving worst-class generalization performance. REDUCR assigns priority weights to datapoints in a class-aware manner using an online learning algorithm. We demonstrate the data efficiency and robust performance of REDUCR on vision and text classification tasks. On web-scraped datasets with imbalanced class distributions, REDUCR significantly improves worst-class test accuracy (and average accuracy), surpassing state-of-the-art methods by around 15%.
    摘要 现代机器学习模型在实际图像和文本分类任务中日益成本高,这是因为收集了大量的网络规模数据,并在流处理方式下进行训练。为了降低训练成本,在线批处理技术已经开发出来,以选择最有用的数据点。然而,这些技术可能会因为类偏移和分布变化而导致最坏类泛化性能的下降。这项工作介绍了REDUCR,一种可靠和高效的数据减样方法,该方法使用类优先权重定义。REDUCR减少了训练数据量,同时保持最坏类泛化性能。REDUCR使用在线学习算法来为每个类分配优先权重,以类 Bewusstsein manner。我们在视觉和文本分类任务上示出了REDUCR的数据效率和稳定性。在收集网络数据时,REDUCR在不均衡类分布的情况下显著提高了最坏类测试精度(以及平均精度),超过了当前的状态艺术方法约15%。

Backbone-based Dynamic Graph Spatio-Temporal Network for Epidemic Forecasting

  • paper_url: http://arxiv.org/abs/2312.00485
  • repo_url: None
  • paper_authors: Junkai Mao, Yuexing Han, Gouhei Tanaka, Bing Wang
  • For: 本文提出了一种新的模型BDGSTN,用于精准预测疾病传播。该模型可以充分利用图像的连续变化,并且通过综合使用静止图像和动态图像来捕捉疾病的 espacio-temporal 特征。* Methods: 本文提出了一种名为BDGSTN的新模型,它包括静止图像的adaptive生成和动态图像的生成两个部分。具体来说,首先使用静止图像的adaptive生成来生成静止图像,然后使用动态图像的生成来捕捉疾病的时间依赖关系。最后,使用DLINear模型来处理时间相关性,并与动态图像 convolution 相结合以进行疾病预测。* Results: 实验结果表明,BDGSTN模型在两个数据集上的预测性能都高于基eline模型,并且与缺省模型进行比较后,BDGSTN模型的效果更加明显。此外,文章还分析了不同方面的信息度量,以证明BDGSTN模型中的图像和时间图像之间的关系是非常重要的。最后,文章还对模型参数量和训练时间进行了比较,证明BDGSTN模型在复杂性和效率两个方面都具有优势。
    Abstract Accurate epidemic forecasting is a critical task in controlling disease transmission. Many deep learning-based models focus only on static or dynamic graphs when constructing spatial information, ignoring their relationship. Additionally, these models often rely on recurrent structures, which can lead to error accumulation and computational time consumption. To address the aforementioned problems, we propose a novel model called Backbone-based Dynamic Graph Spatio-Temporal Network (BDGSTN). Intuitively, the continuous and smooth changes in graph structure, make adjacent graph structures share a basic pattern. To capture this property, we use adaptive methods to generate static backbone graphs containing the primary information and temporal models to generate dynamic temporal graphs of epidemic data, fusing them to generate a backbone-based dynamic graph. To overcome potential limitations associated with recurrent structures, we introduce a linear model DLinear to handle temporal dependencies and combine it with dynamic graph convolution for epidemic forecasting. Extensive experiments on two datasets demonstrate that BDGSTN outperforms baseline models and ablation comparison further verifies the effectiveness of model components. Furthermore, we analyze and measure the significance of backbone and temporal graphs by using information metrics from different aspects. Finally, we compare model parameter volume and training time to confirm the superior complexity and efficiency of BDGSTN.
    摘要 📝精准的流行病预测是控制疾病传播的关键任务。许多深度学习基于模型只是在构建空间信息时注重静止或动态图,忽略它们之间的关系。此外,这些模型经常依赖循环结构,可能会导致错误积累和计算时间浪费。为解决以上问题,我们提出了一种新的模型,即基础结构基于动态图 spatial-temporal 网络(BDGSTN)。💡我们发现,流行病数据中的连续和平滑的结构变化,使得邻近图结构共享基本模式。为了捕捉这种性质,我们使用适应方法生成静止基础图,包含主要信息,并使用时间模型生成动态时间图,将其与基础图进行融合,生成基础结构基于动态图。🔢为了解决循环结构可能引起的问题,我们引入了一种线性模型 DLinear,用于处理时间相关性,并与动态图 convolution 结合,用于流行病预测。广泛的实验表明,BDGSTN 超过基线模型,并且综合比较表明模型组件的效果。此外,我们还分析和测量不同方面的信息度量,以证明基础和时间图的重要性。最后,我们对模型参数的体积和训练时间进行比较,确认BDGSTN 的更高复杂性和效率。

MultiView Independent Component Analysis with Delays

  • paper_url: http://arxiv.org/abs/2312.00484
  • repo_url: None
  • paper_authors: Ambroise Heurtebise, Pierre Ablin, Alexandre Gramfort
  • for: 这个论文是为了提高独立源分离的信号质量而设计的。
  • methods: 这个论文使用了多视图独立源分析(MVICA)模型,允许源在不同视图上具有不同的延迟。
  • results: 通过使用 simulated data 和 Cam-CAN 大规模磁共振成像(MEG)数据,这个模型可以更好地分离 neural signals。 另外,这个模型还发现了年龄相关的延迟效应。
    Abstract Linear Independent Component Analysis (ICA) is a blind source separation technique that has been used in various domains to identify independent latent sources from observed signals. In order to obtain a higher signal-to-noise ratio, the presence of multiple views of the same sources can be used. In this work, we present MultiView Independent Component Analysis with Delays (MVICAD). This algorithm builds on the MultiView ICA model by allowing sources to be delayed versions of some shared sources: sources are shared across views up to some unknown latencies that are view- and source-specific. Using simulations, we demonstrate that MVICAD leads to better unmixing of the sources. Moreover, as ICA is often used in neuroscience, we show that latencies are age-related when applied to Cam-CAN, a large-scale magnetoencephalography (MEG) dataset. These results demonstrate that the MVICAD model can reveal rich effects on neural signals without human supervision.
    摘要 Linear Independent Component Analysis (ICA) 是一种无视源分离技术,通常用于不同领域来确定来自观察信号的独立潜在来源。为了提高信号噪声比,可以使用多个视图。在这种情况下,我们介绍了 MultiView Independent Component Analysis with Delays (MVICAD) 算法。这种算法基于 MultiView ICA 模型,允许源被视图和来源特定的未知延迟所影响。通过仿真实验,我们证明了 MVICAD 能够更好地拆分源。此外,由于 ICAs frequently 用于神经科学,我们在 Cam-CAN 大规模磁共振成像(MEG)数据集上应用了这种模型,并发现了年龄相关的延迟。这些结果表明了 MVICAD 模型可以无人监督下提取脑信号中的丰富效应。

Interpretable Meta-Learning of Physical Systems

  • paper_url: http://arxiv.org/abs/2312.00477
  • repo_url: None
  • paper_authors: Matthieu Blanke, Marc Lelarge
  • for: 这篇论文旨在探讨如何使用机器学习方法在不同的实验条件下进行学习。
  • methods: 这篇论文使用了多任务学习方法,但是它们采用黑obox神经网络,导致计算成本高、解释性差。作者提出了一种简单的学习模型,具有对学习任务的拟合结构,可以实现多环境泛化。
  • results: 作者通过比较与现状最佳算法进行比较,显示了该方法在物理系统上的竞争性泛化性和计算成本的优势。此外,作者还通过应用于物理参数导引和自适应控制等原始应用来说明了该方法的解释性。
    Abstract Machine learning methods can be a valuable aid in the scientific process, but they need to face challenging settings where data come from inhomogeneous experimental conditions. Recent meta-learning methods have made significant progress in multi-task learning, but they rely on black-box neural networks, resulting in high computational costs and limited interpretability. Leveraging the structure of the learning problem, we argue that multi-environment generalization can be achieved using a simpler learning model, with an affine structure with respect to the learning task. Crucially, we prove that this architecture can identify the physical parameters of the system, enabling interpreable learning. We demonstrate the competitive generalization performance and the low computational cost of our method by comparing it to state-of-the-art algorithms on physical systems, ranging from toy models to complex, non-analytical systems. The interpretability of our method is illustrated with original applications to physical-parameter-induced adaptation and to adaptive control.
    摘要

Auto-encoding GPS data to reveal individual and collective behaviour

  • paper_url: http://arxiv.org/abs/2312.00456
  • repo_url: None
  • paper_authors: Saint-Clair Chabert-Liddell, Nicolas Bez, Pierre Gloaguen, Sophie Donnet, Stéphanie Mahévas
  • for: 本研究旨在分析渔船的个体和集体行为,通过使用GPS轨迹数据和 convolutional neural networks 建立低维隐藏表示。
  • methods: 方法包括使用 conditional variational autoencoder 将轨迹数据映射到低维隐藏空间,并使用 Bhattacharyya 积分来比较轨迹 Distribution。 Collective 行为分析方法包括建立 proximity 图和使用多个网络扩展 stochastic block model。
  • results: 应用于法国渔船数据,可以获得不同时间段和地点的个体和集体行为征特。
    Abstract We propose an innovative and generic methodology to analyse individual and collective behaviour through individual trajectory data. The work is motivated by the analysis of GPS trajectories of fishing vessels collected from regulatory tracking data in the context of marine biodiversity conservation and ecosystem-based fisheries management. We build a low-dimensional latent representation of trajectories using convolutional neural networks as non-linear mapping. This is done by training a conditional variational auto-encoder taking into account covariates. The posterior distributions of the latent representations can be linked to the characteristics of the actual trajectories. The latent distributions of the trajectories are compared with the Bhattacharyya coefficient, which is well-suited for comparing distributions. Using this coefficient, we analyse the variation of the individual behaviour of each vessel during time. For collective behaviour analysis, we build proximity graphs and use an extension of the stochastic block model for multiple networks. This model results in a clustering of the individuals based on their set of trajectories. The application to French fishing vessels enables us to obtain groups of vessels whose individual and collective behaviours exhibit spatio-temporal patterns over the period 2014-2018.
    摘要 我们提出了一种创新的和通用的方法来分析个人和集体行为,使用个人轨迹数据。这项工作受到marine生物多样性保护和基于生态系统的渔业管理的GPS轨迹数据的影响。我们使用卷积神经网络实现低维latent表示,并通过conditional variational autoencoder来考虑covariates。 posterior Distributions of latent representations可以与实际轨迹的特征相关。我们使用Bhattacharyya coefficient比较distribution,分析每艘船的行为变化过程中的差异。为集体行为分析,我们建立 proximity graphs,并使用多个网络的扩展stochastic block model。这种模型可以基于每艘船的轨迹集合来划分个人。在应用于法国渔船数据上,我们可以获得2014-2018年间的个人和集体行为具有时空特征的分组。

From Mutual Information to Expected Dynamics: New Generalization Bounds for Heavy-Tailed SGD

  • paper_url: http://arxiv.org/abs/2312.00427
  • repo_url: None
  • paper_authors: Benjamin Dupuis, Paul Viallard
  • for: 本研究旨在解释现代机器学习算法的泛化能力,尤其是SGD算法的学习动力与重 tailed 动力之间的关系。
  • methods: 本文使用了对重 tailed 动力的研究,并将其应用于泛化理论中。我们还引入了一种几何强分离项,通过比较学习动力与预期动力之间的差异来upper bound这个项。
  • results: 本文提出了一种不包含相互信息项的泛化 bound,并在PAC-Bayesian设置中进一步紧张这个 bound。
    Abstract Understanding the generalization abilities of modern machine learning algorithms has been a major research topic over the past decades. In recent years, the learning dynamics of Stochastic Gradient Descent (SGD) have been related to heavy-tailed dynamics. This has been successfully applied to generalization theory by exploiting the fractal properties of those dynamics. However, the derived bounds depend on mutual information (decoupling) terms that are beyond the reach of computability. In this work, we prove generalization bounds over the trajectory of a class of heavy-tailed dynamics, without those mutual information terms. Instead, we introduce a geometric decoupling term by comparing the learning dynamics (depending on the empirical risk) with an expected one (depending on the population risk). We further upper-bound this geometric term, by using techniques from the heavy-tailed and the fractal literature, making it fully computable. Moreover, as an attempt to tighten the bounds, we propose a PAC-Bayesian setting based on perturbed dynamics, in which the same geometric term plays a crucial role and can still be bounded using the techniques described above.
    摘要 moderne 机器学习算法的通用化能力已经是过去几十年的主要研究topic。在最近的年头,SGD的学习dinamics( Stochastic Gradient Descent)和重 tailed dynamics 之间的关系已经被成功地应用到通用化理论中。然而, derivated bounds 仍然виси于互 informations(decoupling) terms,这些 terms 是计算不可能的。在这个工作中,我们证明了一类重 tailed dynamics 的通用化 bounds,不 dependence on 互 informations terms。我们引入了一个几何 decoupling term,通过比较学习 dinamics(对应empirical risk)与预期 dinamics(对应population risk)。我们还 upper-bound 这个几何 term,使用了重 tailed 和 fractal 文献中的技术,以使其可计算。此外,为了紧缩范围,我们提出了一个 perturbed dynamics 的 PAC-Bayesian setting,在这个设定中,这个几何 term 仍然play a crucial role,可以使用上述技术来 bound。

A framework for mining lifestyle profiles through multi-dimensional and high-order mobility feature clustering

  • paper_url: http://arxiv.org/abs/2312.00411
  • repo_url: None
  • paper_authors: Yeshuo Shu, Gangcheng Zhang, Keyi Liu, Jintong Tang, Liyan Xu
  • for: 本研究旨在透过高级特征提取和聚类分析,揭示人们的生活方式特征。
  • methods: 该研究提出了一种进步的特征提取策略,利用空间、时间和 semantics 维度上的高级流动特征,包括旅行模式、时间序列中的快慢点、地点 semantics 等,并将其聚类分析以揭示用户的生活方式特征。
  • results: 实验使用了500000个用户的路径数据,可以分出7个用户群,每个群都具有不同的生活方式特征,这些特征可以通过跨级流动特征工程和聚类分析来揭示。
    Abstract Human mobility demonstrates a high degree of regularity, which facilitates the discovery of lifestyle profiles. Existing research has yet to fully utilize the regularities embedded in high-order features extracted from human mobility records in such profiling. This study proposes a progressive feature extraction strategy that mines high-order mobility features from users' moving trajectory records from the spatial, temporal, and semantic dimensions. Specific features are extracted such as travel motifs, rhythms decomposed by discrete Fourier transform (DFT) of mobility time series, and vectorized place semantics by word2vec, respectively to the three dimensions, and they are further clustered to reveal the users' lifestyle characteristics. An experiment using a trajectory dataset of over 500k users in Shenzhen, China yields seven user clusters with different lifestyle profiles that can be well interpreted by common sense. The results suggest the possibility of fine-grained user profiling through cross-order trajectory feature engineering and clustering.
    摘要 人类流动示示高度的规律性,这使得生活型号的发现变得更加容易。现有研究尚未充分利用人类流动记录中的高级特征来进行 Profiling。本研究提议一种进步的特征提取策略,该策略在用户移动轨迹记录中提取高级的空间、时间和 semantics 维度上的特征。特定的特征包括旅游模式、由 Discrete Fourier Transform (DFT) 分解的流动时间序列的rhythms、word2vec 将地点semanticsvector化,并将其分布到三个维度上。这些特征被进一步归类,以描述用户的生活特征。使用了500k用户的轨迹数据集在深圳,中国进行实验,并得到了七个用户群,每个群都有不同的生活特征,这些特征可以通过跨维度的特征工程和归类来进行细化的用户 Profiling。结果表明,可以通过高级的特征提取和归类来实现细化的用户 Profiling。

A Causality-Aware Pattern Mining Scheme for Group Activity Recognition in a Pervasive Sensor Space

  • paper_url: http://arxiv.org/abs/2312.00404
  • repo_url: None
  • paper_authors: Hyunju Kim, Heesuk Son, Dongman Lee
  • for: 本研究旨在提出一种高效的群体活动识别方案,以支持智能空间中无障碍和隐私问题的人活动识别。
  • methods: 本方案使用了基于事件Sequences的 causality 模型,利用推荐规则筛选不相关的噪声事件,然后使用 pattern-tree 算法提取频繁的 causal 模式,最后使用Weighted sum-based pattern matching algorithm 进行group activity识别。
  • results: 实验结果表明,提出的方案在实际环境中具有较高的识别精度和较小的运行时间开销,比现有方案高效。
    Abstract Human activity recognition (HAR) is a key challenge in pervasive computing and its solutions have been presented based on various disciplines. Specifically, for HAR in a smart space without privacy and accessibility issues, data streams generated by deployed pervasive sensors are leveraged. In this paper, we focus on a group activity by which a group of users perform a collaborative task without user identification and propose an efficient group activity recognition scheme which extracts causality patterns from pervasive sensor event sequences generated by a group of users to support as good recognition accuracy as the state-of-the-art graphical model. To filter out irrelevant noise events from a given data stream, a set of rules is leveraged to highlight causally related events. Then, a pattern-tree algorithm extracts frequent causal patterns by means of a growing tree structure. Based on the extracted patterns, a weighted sum-based pattern matching algorithm computes the likelihoods of stored group activities to the given test event sequence by means of matched event pattern counts for group activity recognition. We evaluate the proposed scheme using the data collected from our testbed and CASAS datasets where users perform their tasks on a daily basis and validate its effectiveness in a real environment. Experiment results show that the proposed scheme performs higher recognition accuracy and with a small amount of runtime overhead than the existing schemes.
    摘要 人类活动识别(HAR)是智能 computing 中的关键挑战,其解决方案基于多种学科。特别是在智能空间中无障碍和隐私问题的情况下,通过部署在场景中的普适传感器生成的数据流进行 HAR。在这篇论文中,我们关注一个由多个用户共同完成的群体活动,不需要用户标识,并提出一种高效的群体活动识别方案。该方案从普适传感器事件序列中提取 causality 模式,以支持高准确性。为filter出不相关的噪声事件,我们使用一组规则高亮相关事件。然后,一种 Pattern-tree 算法提取了频繁的 causal 模式,并使用这些模式计算测试事件序列与存储的群体活动之间的匹配度。我们使用实验室和 CASAS 数据集进行评估,并证明该方案在真实环境中的有效性。实验结果表明,提posed scheme 的识别精度高,并且具有较小的运行时间开销,比既存的方案更高。

GFN-SR: Symbolic Regression with Generative Flow Networks

  • paper_url: http://arxiv.org/abs/2312.00396
  • repo_url: None
  • paper_authors: Sida Li, Ioana Marinescu, Sebastian Musslick
  • for: 这个论文主要目标是解决符号回归问题,以便生成最佳的表达树。
  • methods: 该方法使用了深度学习和权重网络,模拟了构建表达树的过程,从而可以生成多个最佳的表达树。
  • results: 对比其他符号回归算法,GFN-SR在噪声数据范围内表现更高,具有学习一个分布的优势,可以在多个解决方案空间中生成多个最佳的表达树。
    Abstract Symbolic regression (SR) is an area of interpretable machine learning that aims to identify mathematical expressions, often composed of simple functions, that best fit in a given set of covariates $X$ and response $y$. In recent years, deep symbolic regression (DSR) has emerged as a popular method in the field by leveraging deep reinforcement learning to solve the complicated combinatorial search problem. In this work, we propose an alternative framework (GFN-SR) to approach SR with deep learning. We model the construction of an expression tree as traversing through a directed acyclic graph (DAG) so that GFlowNet can learn a stochastic policy to generate such trees sequentially. Enhanced with an adaptive reward baseline, our method is capable of generating a diverse set of best-fitting expressions. Notably, we observe that GFN-SR outperforms other SR algorithms in noisy data regimes, owing to its ability to learn a distribution of rewards over a space of candidate solutions.
    摘要 Symbolic regression (SR) 是一种可解释机器学习领域,旨在在给定的 covariates $X$ 和响应 $y$ 中找到最佳的数学表达。在过去几年,深度符号回归 (DSR) 在该领域中得到了广泛的应用,通过深度强化学习解决复杂的 combinatorial 搜索问题。在这项工作中,我们提出了一个替代方案 (GFN-SR),使用深度学习来解决 SR 问题。我们模型了构建表达树的过程为通过指向图 (DAG) 进行搜索,以便 GFlowNet 可以学习一个随机政策,生成这些树sequentially。增强了一个适应奖励基准,我们的方法可以生成一个多样化的最佳表达集。值得注意的是,我们发现 GFN-SR 在噪声数据范围内表现出色,归因于它可以学习一个表达空间中的奖励分布。

LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices

  • paper_url: http://arxiv.org/abs/2312.00388
  • repo_url: None
  • paper_authors: Junchen Zhao, Yurun Song, Simeng Liu, Ian G. Harris, Sangeetha Abdu Jyothi
  • for: 这个论文是为了解决在移动设备上部署大型自然语言模型(LLM)的问题, LLMs 的巨大记忆需求对移动设备的资源要求很高。
  • methods: 这个论文使用了三种关键策略来解决这个问题:首先,使用优化的模型分配技术来将 LLMs 分成不同的部分,然后使用线性优化来调整这些部分与每个设备的能力相对align。其次,使用优化的数据传输机制来确保数据流动效率高,并保持原始模型结构的完整性。最后,论文使用了runtime负载均衡器,帮助监控和重新分配任务,以避免瓶须和卡须,提高整体效率和响应性。
  • results: 这个论文透过广泛的测试,证明了 LinguaLinked 可以实现高效的 LLM 推断,同时维持一致的吞吐率和最小的延迟。在单线程设定下,LinguaLinked 相比基eline,取得了1.11倍至1.61倍的推断性能提升。在多线程设定下,LinguaLinked 取得了1.73倍至2.65倍的推断性能提升。runtime负载均衡器还提供了一个总体推断性能提升的1.29倍至1.32倍。
    Abstract Deploying Large Language Models (LLMs) locally on mobile devices presents a significant challenge due to their extensive memory requirements. In this paper, we introduce LinguaLinked, a system for decentralized, distributed LLM inference on mobile devices. LinguaLinked enables collaborative execution of the inference task across multiple trusted devices. LinguaLinked ensures data privacy by processing information locally. LinguaLinked uses three key strategies. First, an optimized model assignment technique segments LLMs and uses linear optimization to align segments with each device's capabilities. Second, an optimized data transmission mechanism ensures efficient and structured data flow between model segments while also maintaining the integrity of the original model structure. Finally, LinguaLinked incorporates a runtime load balancer that actively monitors and redistributes tasks among mobile devices to prevent bottlenecks, enhancing the system's overall efficiency and responsiveness. We demonstrate that LinguaLinked facilitates efficient LLM inference while maintaining consistent throughput and minimal latency through extensive testing across various mobile devices, from high-end to low-end Android devices. In our evaluations, compared to the baseline, LinguaLinked achieves an inference performance acceleration of $1.11\times$ to $1.61\times$ in single-threaded settings, $1.73\times$ to $2.65\times$ with multi-threading. Additionally, runtime load balancing yields an overall inference acceleration of $1.29\times$ to $1.32\times$.
    摘要 <>通过在移动设备上本地部署大型自然语言模型(LLM)来实现大规模语言模型的计算是一项挑战。在这篇论文中,我们介绍了一个名为LinguaLinked的分布式、协同执行大语言模型推理系统。LinguaLinked使用三种关键策略来实现数据隐私和高效的计算。首先,LinguaLinked使用优化的模型分配技术将LLM分成多个部分,并使用线性优化将这些部分与每个设备的能力进行对齐。其次,LinguaLinked使用优化的数据传输机制来确保数据流动效率和结构化,同时保持模型的原始结构完整性。最后,LinguaLinked包含一个运行时负荷均衡器,实时监控和重新分配任务,以避免瓶颈,提高整体的效率和响应性。我们通过对各种移动设备进行广泛测试,包括高端到低端的Android设备,证明LinguaLinked可以实现高效的LLM推理,保持一致的吞吐量和最小的延迟。在单线程设置下,相比基准情况,LinguaLinked实现了推理性能加速率为1.11倍至1.61倍,在多线程设置下,LinguaLinked实现了推理性能加速率为1.73倍至2.65倍。此外,运行时负荷均衡器可以提高总体的推理加速率为1.29倍至1.32倍。<>

Optimal Sample Complexity of Contrastive Learning

  • paper_url: http://arxiv.org/abs/2312.00379
  • repo_url: None
  • paper_authors: Noga Alon, Dmitrii Avdiukhin, Dor Elboim, Orr Fischer, Grigory Yaroslavtsev
  • for: 本文研究了对带有标签的对象的对比学习,即学习将数据点与其他数据点之间的距离关系匹配的技术。
  • methods: 本文使用了对比学习方法,并研究了这种方法的采样复杂性。
  • results: 本文给出了对比学习采样复杂性的紧binding的下界,包括对于一般的 $\ell_p$ 距离函数和树 metric。特别是,对于任意 $p \geq 1$,我们证明了 $\tilde \Theta(\min(nd,n^2))$ 标注对需要进行 $d$-维表示学习。
    Abstract Contrastive learning is a highly successful technique for learning representations of data from labeled tuples, specifying the distance relations within the tuple. We study the sample complexity of contrastive learning, i.e. the minimum number of labeled tuples sufficient for getting high generalization accuracy. We give tight bounds on the sample complexity in a variety of settings, focusing on arbitrary distance functions, both general $\ell_p$-distances, and tree metrics. Our main result is an (almost) optimal bound on the sample complexity of learning $\ell_p$-distances for integer $p$. For any $p \ge 1$ we show that $\tilde \Theta(\min(nd,n^2))$ labeled tuples are necessary and sufficient for learning $d$-dimensional representations of $n$-point datasets. Our results hold for an arbitrary distribution of the input samples and are based on giving the corresponding bounds on the Vapnik-Chervonenkis/Natarajan dimension of the associated problems. We further show that the theoretical bounds on sample complexity obtained via VC/Natarajan dimension can have strong predictive power for experimental results, in contrast with the folklore belief about a substantial gap between the statistical learning theory and the practice of deep learning.
    摘要 “对比学习是一种非常成功的技术,用于从标注对的集合中学习数据表示。我们研究对比学习的样本复杂性,即必须的最小标注对数量以获得高泛化准确率。我们给出了各种设置下的紧张 bounds,包括通用的 $\ell_p$-距离函数和树度量。我们的主要结果是对 $\ell_p$-距离函数的学习所需的样本复杂性的(几乎)优化 bound。对于任意 $p \geq 1$,我们证明了 $\tilde \Theta(\min(nd,n^2))$ 标注对数量是学习 $d$-维表示的必要和充分条件。我们的结果适用于输入样本的任意分布,基于给出的 bounds on Vapnik-Chervonenkis/Natarajan 维度。我们还证明了对比学习的理论样本复杂性 bounds 可以具有强的预测力,与深度学习实践中的信念不同。”Note: "VC/Natarajan dimension" refers to the Vapnik-Chervonenkis dimension or Natarajan dimension, which are measures of the complexity of a set of functions, and are used to bound the sample complexity of learning algorithms.

Streaming Bayesian Modeling for predicting Fat-Tailed Customer Lifetime Value

  • paper_url: http://arxiv.org/abs/2312.00373
  • repo_url: None
  • paper_authors: Alexey V. Calabourdin, Konstantin A. Aksenov
  • for: 这篇论文是为了应用在层次确率模型和GLMS上的在线学习MCMC方法。
  • methods: 这篇论文使用了在线学习MCMC方法,可以应用于层次确率模型和GLMS。它还开发了一个可以扩展至多种脂肪尾和细长尾的脂肪尾LTV模型。
  • results: 这篇论文在一个大型移动应用上评估了这两个开发,并获得了良好的结果。
    Abstract We develop an online learning MCMC approach applicable for hierarchical bayesian models and GLMS. We also develop a fat-tailed LTV model that generalizes over several kinds of fat and thin tails. We demonstrate both developments on commercial LTV data from a large mobile app.
    摘要 我们开发了一种在线学习MCMC方法,可应用于层次权重模型和GLMS。我们还开发了一种总体化多种脂肪和瘦 tail 的 LTV 模型。我们在一家大型移动应用商业LTV数据上进行了示例。Here's a word-for-word translation:我们开发了一种在线学习MCMC方法,可应用于层次权重模型和GLMS。我们还开发了一种总体化多种脂肪和瘦 tail 的 LTV 模型。我们在一家大型移动应用商业LTV数据上进行了示例。Note that Simplified Chinese is the standard writing system used in mainland China, while Traditional Chinese is used in Taiwan and other parts of the world.

Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training

  • paper_url: http://arxiv.org/abs/2312.00359
  • repo_url: https://github.com/yefanzhou/tempbalance
  • paper_authors: Yefan Zhou, Tianyu Pang, Keqin Liu, Charles H. Martin, Michael W. Mahoney, Yaoqing Yang
  • for: This paper focuses on improving the training of neural networks by using a layer-wise learning rate method called TempBalance, which is based on Heavy-Tailed Self-Regularization (HT-SR) theory.
  • methods: The paper proposes using HT-SR-motivated metrics to guide the scheduling and balancing of temperature across all network layers during model training, resulting in improved performance during testing.
  • results: The paper shows that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization on several benchmark datasets, including CIFAR10, CIFAR100, SVHN, and TinyImageNet, using ResNets, VGGs, and WideResNets with various depths and widths. Additionally, TempBalance outperforms a number of state-of-the-art optimizers and learning rate schedulers.
    Abstract Regularization in modern machine learning is crucial, and it can take various forms in algorithmic design: training set, model family, error function, regularization terms, and optimizations. In particular, the learning rate, which can be interpreted as a temperature-like parameter within the statistical mechanics of learning, plays a crucial role in neural network training. Indeed, many widely adopted training strategies basically just define the decay of the learning rate over time. This process can be interpreted as decreasing a temperature, using either a global learning rate (for the entire model) or a learning rate that varies for each parameter. This paper proposes TempBalance, a straightforward yet effective layer-wise learning rate method. TempBalance is based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which characterizes the implicit self-regularization of different layers in trained models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide the scheduling and balancing of temperature across all network layers during model training, resulting in improved performance during testing. We implement TempBalance on CIFAR10, CIFAR100, SVHN, and TinyImageNet datasets using ResNets, VGGs, and WideResNets with various depths and widths. Our results show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization. We also show that TempBalance outperforms a number of state-of-the-art optimizers and learning rate schedulers.
    摘要 现代机器学习中的正则化非常重要,它可以通过训练集、模型家族、错误函数、正则化项和优化器的不同形式来实现。特别是学习率,可以被看作机器学习中的温度类参数,在神经网络训练中扮演着关键角色。实际上,许多广泛采用的训练策略都是通过时间的滑动来定义学习率的减少。这个过程可以被看作是降低温度的过程,使用全局学习率(整个模型)或者是变化的学习率(每个参数)。本文提出了TempBalance,一种简单而有效的层wise学习率方法。TempBalance基于重 tailed自 regulization(HT-SR)理论,该理论描述了不同层的训练模型中的隐式自 regulization。我们示出了使用HT-SR驱动的指标来导航和平衡整个网络层的学习率 durante el entrenamiento,以提高测试时的性能。我们在CIFAR10、CIFAR100、SVHN和TinyImageNet datasets上使用ResNets、VGGs和WideResNets等不同深度和宽度的模型进行实现 TempBalance。我们的结果表明,TempBalance在测试时的性能明显超过了普通的SGD和仔细调整的spectral norm regularization。我们还示出了TempBalance在多种状态 искус智的优化器和学习率调整器上的性能优势。

Transfer learning for predicting source terms of principal component transport in chemically reactive flow

  • paper_url: http://arxiv.org/abs/2312.00356
  • repo_url: None
  • paper_authors: Ki Sung Jung, Tarek Echekki, Jacqueline H. Chen, Mohammad Khalil
  • for: 本研究旨在 evaluating whether the number of requisite training samples can be reduced with the use of various transfer learning models for predicting chemical source terms of a data-driven reduced-order model that represents the homogeneous ignition process of a hydrogen/air mixture.
  • methods: 本研究使用了主成分分析减少数据的维度,并使用人工神经网络(ANNs)来 tabulate 反应率的主成分。然后,一系列的常微分方程被解决。
  • results: 当锻炼样本数量减少到 target task 中(即 T0 > 1000 K 和 various phi)时,减少模型无法预测氢/空气混合物的点燃进程。然后,三种传输学习策略被应用到 ANN 模型的训练中。通过控制 ANN 模型的初始化和正则化,可以remarkably enhance 减少模型的性能。此外,在任务相似度较低时,改变 ANN 模型的初始化方案可以获得额外的性能提升。
    Abstract The objective of this study is to evaluate whether the number of requisite training samples can be reduced with the use of various transfer learning models for predicting, for example, the chemical source terms of the data-driven reduced-order model that represents the homogeneous ignition process of a hydrogen/air mixture. Principal component analysis is applied to reduce the dimensionality of the hydrogen/air mixture in composition space. Artificial neural networks (ANNs) are used to tabulate the reaction rates of principal components, and subsequently, a system of ordinary differential equations is solved. As the number of training samples decreases at the target task (i.e.,for T0 > 1000 K and various phi), the reduced-order model fails to predict the ignition evolution of a hydrogen/air mixture. Three transfer learning strategies are then applied to the training of the ANN model with a sparse dataset. The performance of the reduced-order model with a sparse dataset is found to be remarkably enhanced if the training of the ANN model is restricted by a regularization term that controls the degree of knowledge transfer from source to target tasks. To this end, a novel transfer learning method is introduced, parameter control via partial initialization and regularization (PaPIR), whereby the amount of knowledge transferred is systemically adjusted for the initialization and regularization of the ANN model in the target task. It is found that an additional performance gain can be achieved by changing the initialization scheme of the ANN model in the target task when the task similarity between source and target tasks is relatively low.
    摘要 这个研究的目标是判断使用不同的传输学习模型可以降低需要的训练样本数量,以便预测例如氢/空气混合物的数据驱动简化模型中的化学源极值。使用主成分分析降低氢/空气混合物的维度空间,人工神经网络(ANN)用于 Tabulate 总是的反应率,然后解决系统的常微分方程。随着目标任务(即 T0 > 1000 K 和多种 phi)中的训练样本数量减少,简化模型无法预测氢/空气混合物的燃烧演化。然后,将传输学习策略应用到 ANN 模型的训练中。通过对 ANN 模型的初始化和正则化进行系统调整,控制了知识传输的数量,从而提高了简化模型的性能。此外,还发现可以通过修改 target 任务中 ANN 模型的初始化方案来获得额外的性能提升,当任务相似度 relativelow时。

Quantum Kernel t-Distributed Stochastic Neighbor Embedding

  • paper_url: http://arxiv.org/abs/2312.00352
  • repo_url: None
  • paper_authors: Yoshiaki Kawase, Kosuke Mitarai, Keisuke Fujii
  • for: 用于Visualizing quantum states and optimization trajectories, allowing for a better understanding of quantum circuits and algorithms.
  • methods: 使用Quantum kernels for fast and highly accurate visualization of quantum states, compared to classical kernel methods.
  • results: 成功Visualized hand-written digits dataset and optimization trajectories of finding the ground states of transverse field Ising model, without degrading the separability of the input higher dimensional data.
    Abstract Data visualization is important in understanding the characteristics of data that are difficult to see directly. It is used to visualize loss landscapes and optimization trajectories to analyze optimization performance. Popular optimization analysis is performed by visualizing a loss landscape around the reached local or global minimum using principal component analysis. However, this visualization depends on the variational parameters of a quantum circuit rather than quantum states, which makes it difficult to understand the mechanism of optimization process through the property of quantum states. Here, we propose a quantum data visualization method using quantum kernels, which enables us to offer fast and highly accurate visualization of quantum states. In our numerical experiments, we visualize hand-written digits dataset and apply $k$-nearest neighbor algorithm to the low-dimensional data to quantitatively evaluate our proposed method compared with a classical kernel method. As a result, our proposed method achieves comparable accuracy to the state-of-the-art classical kernel method, meaning that the proposed visualization method based on quantum machine learning does not degrade the separability of the input higher dimensional data. Furthermore, we visualize the optimization trajectories of finding the ground states of transverse field Ising model and successfully find the trajectory characteristics. Since quantum states are higher dimensional objects that can only be seen via observables, our visualization method, which inherits the similarity of quantum data, would be useful in understanding the behavior of quantum circuits and algorithms.
    摘要 “数据视觉是重要的,它可以帮助我们理解直观不可见的数据特性。它可以用来视觉损失地形和优化轨迹,以分析优化性能。现有的优化分析通常是使用主成分分析来视觉损失地形绕着达到的本地或全球最小值。但这种视觉依赖于量子环境中的变量参数,而不是量子态,这使得我们很难通过量子态的性质来理解优化过程的机制。为了解决这个问题,我们提出了一种基于量子机器学习的量子数据视觉方法。我们的方法可以快速和高度准确地视觉量子态。在我们的数字实验中,我们使用手写数据集和$k$-最近邻algorithm来评估我们的提议方法,并与经典kernel方法进行比较。结果显示,我们的提议方法可以与经典kernel方法相比,表明量子机器学习基于的视觉方法不会降低输入的维度高数据的分离性。此外,我们还可以视觉优化轨迹,并成功地发现了搜索的特性。由于量子态是高维度的对象,只能通过观察来看到,我们的视觉方法,它继承了量子数据的相似性,将会对量子环境和算法的行为提供有用的理解。”

TRC: Trust Region Conditional Value at Risk for Safe Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2312.00344
  • repo_url: https://github.com/rllab-snu/Trust-Region-CVaR
  • paper_authors: Dohyeong Kim, Songhwai Oh
  • for: 本研究旨在提出一种基于信任区的安全强化学习方法(TRC),以满足CVaR约束并最大化希望返回。
  • methods: 本方法首先计算CVaR的Upper bound,然后在信任区内使用这个上界来 aproximateCVaR,并将其转化为可导的形式。然后,通过多次解决相应的子问题, trains policies。
  • results: 在 simulate Navigation tasks中,TRC方法比其他安全RL方法提高性能1.93倍,同时保证约束的满足。
    Abstract As safety is of paramount importance in robotics, reinforcement learning that reflects safety, called safe RL, has been studied extensively. In safe RL, we aim to find a policy which maximizes the desired return while satisfying the defined safety constraints. There are various types of constraints, among which constraints on conditional value at risk (CVaR) effectively lower the probability of failures caused by high costs since CVaR is a conditional expectation obtained above a certain percentile. In this paper, we propose a trust region-based safe RL method with CVaR constraints, called TRC. We first derive the upper bound on CVaR and then approximate the upper bound in a differentiable form in a trust region. Using this approximation, a subproblem to get policy gradients is formulated, and policies are trained by iteratively solving the subproblem. TRC is evaluated through safe navigation tasks in simulations with various robots and a sim-to-real environment with a Jackal robot from Clearpath. Compared to other safe RL methods, the performance is improved by 1.93 times while the constraints are satisfied in all experiments.
    摘要 为确保机器人安全,有一种名为安全学习(Safe RL)的研究得到了广泛的关注。在安全学习中,我们目标是找到一个策略,使得返回最大化而且满足定义的安全约束。有各种类型的约束,其中 conditional value at risk(CVaR)约束能够减少高成本导致的失败的概率,因为CVaR是一个 conditional expectation 的上个百分数。在这篇论文中,我们提出了一种基于信任区的安全学习方法,称为 TRC。我们首先 deriv 出 CVaR 的Upper bound,然后在一个信任区内 approximates 这个Upper bound的偏 diferencial 形式。使用这个approximation,我们可以将一个问题的policy gradient 计算出来,并通过反复解这个问题来训练策略。TRC 在安全导航任务中通过在多种机器人和一个 Jackal 机器人上进行了多个实验,与其他安全学习方法进行了比较。相比之下,TRC 在所有实验中都能够满足约束,并且性能提高了1.93倍。

Hypergraph Node Representation Learning with One-Stage Message Passing

  • paper_url: http://arxiv.org/abs/2312.00336
  • repo_url: None
  • paper_authors: Shilin Qu, Weiqing Wang, Yuan-Fang Li, Xin Zhou, Fajie Yuan
  • for: 该 paper 的目的是提出一种基于 Transformer 框架的 hypergraph 节点表示学习方法,以提高 semi-supervised hypernode 分类任务的性能。
  • methods: 该 paper 使用了一种基于一阶传递方法的 hypergraph 节点表示学习方法,通过将注意力矩阵和 hypergraph Laplacian 结合在一起,inject 了 hypergraph 结构信息(本地信息)到 Transformers 中(全局信息)。
  • results: 对于五个代表性的 benchmark 数据集,该 paper 的方法在 semi-supervised hypernode 分类任务上表现出了比较优秀的result,与比较最新的 hypergraph 学习方法相比,提高了性能,具体的提高率为 2.52% 到 6.70%。
    Abstract Hypergraphs as an expressive and general structure have attracted considerable attention from various research domains. Most existing hypergraph node representation learning techniques are based on graph neural networks, and thus adopt the two-stage message passing paradigm (i.e. node -> hyperedge -> node). This paradigm only focuses on local information propagation and does not effectively take into account global information, resulting in less optimal representations. Our theoretical analysis of representative two-stage message passing methods shows that, mathematically, they model different ways of local message passing through hyperedges, and can be unified into one-stage message passing (i.e. node -> node). However, they still only model local information. Motivated by this theoretical analysis, we propose a novel one-stage message passing paradigm to model both global and local information propagation for hypergraphs. We integrate this paradigm into HGraphormer, a Transformer-based framework for hypergraph node representation learning. HGraphormer injects the hypergraph structure information (local information) into Transformers (global information) by combining the attention matrix and hypergraph Laplacian. Extensive experiments demonstrate that HGraphormer outperforms recent hypergraph learning methods on five representative benchmark datasets on the semi-supervised hypernode classification task, setting new state-of-the-art performance, with accuracy improvements between 2.52% and 6.70%. Our code and datasets are available.
    摘要 Our theoretical analysis of representative two-stage message passing methods reveals that they can be unified into a one-stage message passing approach (i.e., node -> node), but they still only capture local information. Motivated by this analysis, we propose a novel one-stage message passing paradigm that models both global and local information propagation for hypergraphs.We integrate this paradigm into HGraphormer, a Transformer-based framework for hypergraph node representation learning. HGraphormer combines the attention matrix and hypergraph Laplacian to inject the hypergraph structure information (local information) into Transformers (global information).Experimental results on five benchmark datasets for semi-supervised hypernode classification demonstrate that HGraphormer outperforms recent hypergraph learning methods, achieving new state-of-the-art performance with accuracy improvements ranging from 2.52% to 6.70%. Our code and datasets are available.

ESM-NBR: fast and accurate nucleic acid-binding residue prediction via protein language model feature representation and multi-task learning

  • paper_url: http://arxiv.org/abs/2312.00842
  • repo_url: https://github.com/wwzll123/esm-nbr
  • paper_authors: Wenwu Zeng, Dafeng Lv, Wenjuan Liu, Shaoliang Peng
  • for: 本研究旨在提出一种快速和准确的基于序列的方法,用于预测蛋白质中的核酸绑定位点。
  • methods: 本研究使用了大型蛋白质语言模型ESM2提取特征表示,然后使用叠加双向长短时间记忆(BiLSTM)和多层杂化层(MLP)网络进行预测。
  • results: 实验结果表明,ESM2特征表示的预测性能大大超过了基于进化信息的隐马尔科夫模型(HMM)特征。此外,ESM-NBR在两个独立的测试集上的MCC值为0.427和0.391,相比之下,第二个最佳方法的MCC值高出18.61%和10.45%。
    Abstract Protein-nucleic acid interactions play a very important role in a variety of biological activities. Accurate identification of nucleic acid-binding residues is a critical step in understanding the interaction mechanisms. Although many computationally based methods have been developed to predict nucleic acid-binding residues, challenges remain. In this study, a fast and accurate sequence-based method, called ESM-NBR, is proposed. In ESM-NBR, we first use the large protein language model ESM2 to extract discriminative biological properties feature representation from protein primary sequences; then, a multi-task deep learning model composed of stacked bidirectional long short-term memory (BiLSTM) and multi-layer perceptron (MLP) networks is employed to explore common and private information of DNA- and RNA-binding residues with ESM2 feature as input. Experimental results on benchmark data sets demonstrate that the prediction performance of ESM2 feature representation comprehensively outperforms evolutionary information-based hidden Markov model (HMM) features. Meanwhile, the ESM-NBR obtains the MCC values for DNA-binding residues prediction of 0.427 and 0.391 on two independent test sets, which are 18.61 and 10.45% higher than those of the second-best methods, respectively. Moreover, by completely discarding the time-cost multiple sequence alignment process, the prediction speed of ESM-NBR far exceeds that of existing methods (5.52s for a protein sequence of length 500, which is about 16 times faster than the second-fastest method). A user-friendly standalone package and the data of ESM-NBR are freely available for academic use at: https://github.com/wwzll123/ESM-NBR.
    摘要 生物活动中蛋白-核酸结合很重要,确定核酸结合位点的准确性是理解交互机制的关键步骤。虽然许多计算机方法已经开发,但还有挑战。本研究提出了一种快速准确的序列基于方法,即ESM-NBR。在ESM-NBR中,我们首先使用大量蛋白质语言模型ESM2提取特征表示,然后使用堆式反向长短时间记忆(BiLSTM)和多层感知机(MLP)网络来探索DNA和RNA结合位点的共同和专属信息。实验结果表明,ESM2特征表示的预测性能超过进化信息基于隐马尔可夫模型(HMM)特征。此外,ESM-NBR在DNA结合位点预测中获得了MCC值为0.427和0.391,在两个独立测试集上分别高于第二最佳方法的18.61%和10.45%。此外,由完全抛弃多重时间序列对Alignment过程,ESM-NBR的预测速度远胜现有方法(5.52秒,对500个蛋白质序列),约16倍快于第二快的方法。用户友好的独立包和ESM-NBR数据可以免费在GitHub上获取:https://github.com/wwzll123/ESM-NBR。

Multiple Testing of Linear Forms for Noisy Matrix Completion

  • paper_url: http://arxiv.org/abs/2312.00305
  • repo_url: None
  • paper_authors: Wanteng Ma, Lilun Du, Dong Xia, Ming Yuan
  • for: solves the problem of noisy matrix completion for large-scale recommender systems.
  • methods: introduces new statistics for individual tests with sharp asymptotics and utilizes a data splitting and symmetric aggregation scheme to control the false discovery rate (FDR).
  • results: guarantees power under nearly optimal sample size requirements and shows valid FDR control with extensive numerical simulations and real data examples.Here’s the format you requested:
  • for: < solves the problem of noisy matrix completion for large-scale recommender systems.>
  • methods: < introduces new statistics for individual tests with sharp asymptotics and utilizes a data splitting and symmetric aggregation scheme to control the false discovery rate (FDR).>
  • results: < guarantees power under nearly optimal sample size requirements and shows valid FDR control with extensive numerical simulations and real data examples.>
    Abstract Many important tasks of large-scale recommender systems can be naturally cast as testing multiple linear forms for noisy matrix completion. These problems, however, present unique challenges because of the subtle bias-and-variance tradeoff of and an intricate dependence among the estimated entries induced by the low-rank structure. In this paper, we develop a general approach to overcome these difficulties by introducing new statistics for individual tests with sharp asymptotics both marginally and jointly, and utilizing them to control the false discovery rate (FDR) via a data splitting and symmetric aggregation scheme. We show that valid FDR control can be achieved with guaranteed power under nearly optimal sample size requirements using the proposed methodology. Extensive numerical simulations and real data examples are also presented to further illustrate its practical merits.
    摘要 很多大规模推荐系统的重要任务可以自然地表示为多个线性形式的含杂矩阵完成测试。然而,这些问题具有特殊的偏见和强度负担问题,由于低级结构引起的维度依赖关系。在这篇论文中,我们开发了一种通用的方法来解决这些问题,通过引入新的个体统计,以及利用它们来控制false discovery rate(FDR)via数据分割和对称聚合方案。我们表明,使用提议的方法可以实现有保证的样本大小要求,并且在nearly optimal样本大小下实现有效的FDR控制。我们还在实验和实际数据例子中进行了详细的 Illustrate its practical benefits.

Towards Aligned Canonical Correlation Analysis: Preliminary Formulation and Proof-of-Concept Results

  • paper_url: http://arxiv.org/abs/2312.00296
  • repo_url: None
  • paper_authors: Biqian Cheng, Evangelos E. Papalexakis, Jia Chen
  • for: 本研究提出了一种新的核心相关分析(ACCA)框架,以解决传统方法中数据视角之间的Alignment问题。
  • methods: 该框架通过迭代解决Alignment和多视角嵌入问题来实现jointly embedding多个数据视角在最大相关性空间中。
  • results: 对于一些实际应用场景,ACCA可以更好地嵌入数据视角,并且可以提高相关性的计算效率。
    Abstract Canonical Correlation Analysis (CCA) has been widely applied to jointly embed multiple views of data in a maximally correlated latent space. However, the alignment between various data perspectives, which is required by traditional approaches, is unclear in many practical cases. In this work we propose a new framework Aligned Canonical Correlation Analysis (ACCA), to address this challenge by iteratively solving the alignment and multi-view embedding.
    摘要 Traditional Canonical Correlation Analysis (CCA) 已经广泛应用于将多个视图数据jointly embedding 在最大相关性的隐藏空间中。然而,在许多实际情况下,不同数据视角的对齐问题并不清楚。在这种情况下,我们提出了一个新的框架Aligned Canonical Correlation Analysis (ACCA),以解决这个挑战,通过迭代解决对齐和多视图嵌入。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form as well.

Learning to forecast diagnostic parameters using pre-trained weather embedding

  • paper_url: http://arxiv.org/abs/2312.00290
  • repo_url: None
  • paper_authors: Peetak P. Mitra, Vivek Ramavajjala
  • for: 这个论文旨在提出一种方法,使得新增的诊断变量可以轻松地添加到现有的气象预测模型中,而不需要重新训练整个模型。
  • methods: 该方法包括两个阶段。在第一阶段,我们使用自适应神经网络来嵌入气象变量到一个隐藏空间中。在第二阶段,自适应神经网络被冻结,而下游模型则被训练用于预测诊断变量,使用Only latent representation of prognostic variables as input。
  • results: 我们的实验表明,使用这种两阶段方法可以达到与特制模型的准确率相同的水平,同时减少了训练和推理过程中的资源消耗。此外,这种方法允许在不affect现有模型的前提下,随时开发新的下游模型。
    Abstract Data-driven weather prediction (DDWP) models are increasingly becoming popular for weather forecasting. However, while operational weather forecasts predict a wide variety of weather variables, DDWPs currently forecast a specific set of key prognostic variables. Non-prognostic ("diagnostic") variables are sometimes modeled separately as dependent variables of the prognostic variables (c.f. FourCastNet), or by including the diagnostic variable as a target in the DDWP. However, the cost of training and deploying bespoke models for each diagnostic variable can increase dramatically with more diagnostic variables, and limit the operational use of such models. Likewise, retraining an entire DDWP each time a new diagnostic variable is added is also cost-prohibitive. We present an two-stage approach that allows new diagnostic variables to be added to an end-to-end DDWP model without the expensive retraining. In the first stage, we train an autoencoder that learns to embed prognostic variables into a latent space. In the second stage, the autoencoder is frozen and "downstream" models are trained to predict diagnostic variables using only the latent representations of prognostic variables as input. Our experiments indicate that models trained using the two-stage approach offer accuracy comparable to training bespoke models, while leading to significant reduction in resource utilization during training and inference. This approach allows for new "downstream" models to be developed as needed, without affecting existing models and thus reducing the friction in operationalizing new models.
    摘要 大数据驱动天气预测(DDWP)模型在天气预测中日益受欢迎。然而,而DDWP目前只预测一组关键预测变量。其他非预测(诊断)变量可能被视为依赖于预测变量的依赖变量(如FourCastNet),或者包含诊断变量作为DDWP的目标。然而,为每个诊断变量培训和部署特定模型的成本会随着诊断变量的增加而增加几何,限制这些模型的操作使用。另外,每次添加新的诊断变量时重新培训整个DDWP也是成本不可控的。我们提出了一种两stage的方法,允许新的诊断变量被添加到整体DDWP模型中,而不需要昂贵的重新培训。在第一个阶段,我们培训了一个自适应神经网络,以学习将预测变量嵌入到一个封闭空间中。在第二个阶段,自适应神经网络被冻结,而“下游”模型则在只使用预测变量的嵌入表示为输入时被训练,以预测诊断变量。我们的实验表明,使用这种两stage方法可以实现与特定模型培训一样的准确率,同时在训练和推断过程中减少资源使用量。这种方法允许在需要时随时开发新的“下游”模型,而无需影响现有模型,从而减少操作化新模型的阻力。

Age-Based Scheduling for Mobile Edge Computing: A Deep Reinforcement Learning Approach

  • paper_url: http://arxiv.org/abs/2312.00279
  • repo_url: https://github.com/xingqiuhe/dpds
  • paper_authors: Xingqiu He, Chaoqun You, Tony Q. S. Quek
  • For: The paper is written for MEC systems that require real-time performance and freshness of collected environmental information, and it proposes a new definition of Age of Information (AoI) that takes into account the event-driven nature of the desired status information.* Methods: The paper proposes an online AoI minimization problem for MEC systems, which can be formulated as a Markov Decision Process (MDP) and solved using Reinforcement Learning (RL) algorithms. To accelerate the learning process, the paper introduces Post-Decision States (PDSs) that exploit the partial knowledge of the system’s dynamics.* Results: The paper demonstrates through numerical results that the proposed algorithm outperforms benchmarks under various scenarios, indicating its effectiveness in minimizing the Age of Information in MEC systems.
    Abstract With the rapid development of Mobile Edge Computing (MEC), various real-time applications have been deployed to benefit people's daily lives. The performance of these applications relies heavily on the freshness of collected environmental information, which can be quantified by its Age of Information (AoI). In the traditional definition of AoI, it is assumed that the status information can be actively sampled and directly used. However, for many MEC-enabled applications, the desired status information is updated in an event-driven manner and necessitates data processing. To better serve these applications, we propose a new definition of AoI and, based on the redefined AoI, we formulate an online AoI minimization problem for MEC systems. Notably, the problem can be interpreted as a Markov Decision Process (MDP), thus enabling its solution through Reinforcement Learning (RL) algorithms. Nevertheless, the traditional RL algorithms are designed for MDPs with completely unknown system dynamics and hence usually suffer long convergence times. To accelerate the learning process, we introduce Post-Decision States (PDSs) to exploit the partial knowledge of the system's dynamics. We also combine PDSs with deep RL to further improve the algorithm's applicability, scalability, and robustness. Numerical results demonstrate that our algorithm outperforms the benchmarks under various scenarios.
    摘要 随着移动边 Computing(MEC)的快速发展,许多实时应用程序已经被部署以改善人们的日常生活。这些应用程序的性能受到环境信息的新鲜度(Age of Information,简称AoI)的影响,AoI是指环境信息的最新状态。在传统的AoI定义中,假设环境信息可以主动采样并直接使用。但是,许多MEC实现应用程序中的desired状态信息通过事件驱动方式更新,因此需要数据处理。为了更好地服务这些应用程序,我们提出了一个新的AoI定义,并基于该定义,我们定义了一个在MEC系统上进行在线AoI最小化问题。这个问题可以被解释为Markov Decision Process(MDP),因此可以通过Reinforcement Learning(RL)算法解决。然而,传统的RL算法在完全不知道系统动力学的情况下设计,因此通常具有长期间靠整合时间。为了加速学习过程,我们引入了Post-Decision States(PDS),以利用系统动力学的部分知识。我们还将PDS与深度RL结合,以进一步提高算法的可应用性、可扩展性和可靠性。 numeral results表明,我们的算法在不同的场景下都超过了标准。

Automating Continual Learning

  • paper_url: http://arxiv.org/abs/2312.00276
  • repo_url: https://github.com/idsia/automated-cl
  • paper_authors: Kazuki Irie, Róbert Csordás, Jürgen Schmidhuber
  • for: 本研究旨在解决神经网络学习算法受到更改环境的影响,导致之前学习的技能被忘记。
  • methods: 我们提出了自动化连续学习(ACL)方法,通过训练自referential神经网络来meta-学习自己的内容学习算法。
  • results: 我们的实验表明,ACL方法可以有效解决“内容忘记”问题,并在无重复数据的情况下超越了手动设计的算法。
    Abstract General-purpose learning systems should improve themselves in open-ended fashion in ever-changing environments. Conventional learning algorithms for neural networks, however, suffer from catastrophic forgetting (CF) -- previously acquired skills are forgotten when a new task is learned. Instead of hand-crafting new algorithms for avoiding CF, we propose Automated Continual Learning (ACL) to train self-referential neural networks to meta-learn their own in-context continual (meta-)learning algorithms. ACL encodes all desiderata -- good performance on both old and new tasks -- into its meta-learning objectives. Our experiments demonstrate that ACL effectively solves "in-context catastrophic forgetting"; our ACL-learned algorithms outperform hand-crafted ones, e.g., on the Split-MNIST benchmark in the replay-free setting, and enables continual learning of diverse tasks consisting of multiple few-shot and standard image classification datasets.
    摘要 (Simplified Chinese translation)通用学习系统应该在开放式环境中不断改进自己,但传统的神经网络学习算法却受到快速协助忘记 (CF) 的影响,已经学习的技能会被新任务所取代。而不是手动制定新的 CF 避免算法,我们提议自动化连续学习 (ACL),使神经网络自身学习其自身的在 Context continual 学习算法。ACL 将所有的需求(包括老任务和新任务的好势)编码到其元学习目标中。我们的实验表明,ACL 可以有效解决 "在 Context 中的快速协助忘记",ACL 学习的算法在 Split-MNIST 测试集上无需回放而已经超过了手动制定的算法,并且可以连续学习多种多样的任务,包括多少shot 图像分类和标准图像分类任务。

Towards Clinical Prediction with Transparency: An Explainable AI Approach to Survival Modelling in Residential Aged Care

  • paper_url: http://arxiv.org/abs/2312.00271
  • repo_url: https://github.com/teosusnjak/survival-analysis-stage1
  • paper_authors: Teo Susnjak, Elise Griffin, Mitchell McCutcheon, Kathleen Potter
  • for: 这份研究的目的是为了精确地估算长期照顾中的遇难时间,以帮助医疗决策。
  • methods: 这份研究使用了进步的机器学习技术,包括CoxPH、EN、RR、Lasso、GB、XGB和RF等20个实验,以找出最佳的预测模型。
  • results: 研究发现,GB、XGB和RF模型的C-Index值最高(0.714、0.712和0.712),XGB模型在6个月后的预测AUROC中得到了0.746(95% CI 0.744-0.749)的数据。
    Abstract Background: Accurate survival time estimates aid end-of-life medical decision-making. Objectives: Develop an interpretable survival model for elderly residential aged care residents using advanced machine learning. Setting: A major Australasian residential aged care provider. Participants: Residents aged 65+ admitted for long-term care from July 2017 to August 2023. Sample size: 11,944 residents across 40 facilities. Predictors: Factors include age, gender, health status, co-morbidities, cognitive function, mood, nutrition, mobility, smoking, sleep, skin integrity, and continence. Outcome: Probability of survival post-admission, specifically calibrated for 6-month survival estimates. Statistical Analysis: Tested CoxPH, EN, RR, Lasso, GB, XGB, and RF models in 20 experiments with a 90/10 train/test split. Evaluated accuracy using C-index, Harrell's C-index, dynamic AUROC, IBS, and calibrated ROC. Chose XGB for its performance and calibrated it for 1, 3, 6, and 12-month predictions using Platt scaling. Employed SHAP values to analyze predictor impacts. Results: GB, XGB, and RF models showed the highest C-Index values (0.714, 0.712, 0.712). The optimal XGB model demonstrated a 6-month survival prediction AUROC of 0.746 (95% CI 0.744-0.749). Key mortality predictors include age, male gender, mobility, health status, pressure ulcer risk, and appetite. Conclusions: The study successfully applies machine learning to create a survival model for aged care, aligning with clinical insights on mortality risk factors and enhancing model interpretability and clinical utility through explainable AI.
    摘要 背景:精准存活时间估计可以帮助老年医疗决策。目标:使用先进机器学习技术开发可解释的存活模型,用于lderly residential aged care居民。设置:一家大型澳大利亚长期居民Provider。参与者:从2017年7月至2023年8月入院长期care的65岁及以上居民11,944人,分布于40家机构。预测因素:年龄、性别、健康状况、相关病种、认知功能、情绪、营养、 mobilitas、吸烟、睡眠、皮肤完整性和 kontinenz。结果:GB、XGB和RF模型 display the highest C-Index values (0.714, 0.712, 0.712)。使用Platt scaling进行了1、3、6和12个月预测calibration。使用SHAP值来分析预测因素的影响。结果显示,年龄、男性 gender、 mobilitas、健康状况、压力瘤风险和食欲是键要的死亡预测因素。结论:这项研究成功地应用机器学习技术,创造了一个适用于长期居民的存活模型,与临床见解相一致,提高了模型解释性和临床实用性。

eess.IV - 2023-12-01

Deep Image prior with StruCtUred Sparsity (DISCUS) for dynamic MRI reconstruction

  • paper_url: http://arxiv.org/abs/2312.00953
  • repo_url: None
  • paper_authors: Muhammad Ahmad Sultan, Chong Chen, Yingmin Liu, Rizwan Ahmad
  • for: 提高动态MRI中训练数据质量的方法
  • methods: 使用自我超vised深度学习方法(DISCUS),其结合网络参数和输入编码向量的共优化,并且鼓励帧特定编码向量的群 sparse,以描述 temporally variation的低维抽象 manifold。
  • results: 在三个数学研究中,DISCUS比CS和DIP better,并且通过群 sparse鼓励编码向量,可以更好地描述动态图像的低维抽象 manifold,提高图像重建的性能。
    Abstract High-quality training data are not always available in dynamic MRI. To address this, we propose a self-supervised deep learning method called deep image prior with structured sparsity (DISCUS) for reconstructing dynamic images. DISCUS is inspired by deep image prior (DIP) and recovers a series of images through joint optimization of network parameters and input code vectors. However, DISCUS additionally encourages group sparsity on frame-specific code vectors to discover the low-dimensional manifold that describes temporal variations across frames. Compared to prior work on manifold learning, DISCUS does not require specifying the manifold dimensionality. We validate DISCUS using three numerical studies. In the first study, we simulate a dynamic Shepp-Logan phantom with frames undergoing random rotations, translations, or both, and demonstrate that DISCUS can discover the dimensionality of the underlying manifold. In the second study, we use data from a realistic late gadolinium enhancement (LGE) phantom to compare DISCUS with compressed sensing (CS) and DIP and to demonstrate the positive impact of group sparsity. In the third study, we use retrospectively undersampled single-shot LGE data from five patients to compare DISCUS with CS reconstructions. The results from these studies demonstrate that DISCUS outperforms CS and DIP and that enforcing group sparsity on the code vectors helps discover true manifold dimensionality and provides additional performance gain.
    摘要 高品质训练数据不总是可用于动态MRI。为解决这个问题,我们提出了一种自愿supervised深度学习方法,即深度图像先验(DISCUS),用于重建动态图像。DISCUS受到深度图像先验(DIP)的激发,并通过联合网络参数和输入代码向量的优化来重建一系列图像。然而,DISCUS还强制了几何簇节点的簇节点簇节点,以发现时间变化的低维数据 manifold。与对于几何学习的先前工作不同,DISCUS不需要指定数据维度。我们验证DISCUS使用三个数据分析。在第一个研究中,我们使用静态Shepp-Logan假象,将几帧帧进行随机旋转、平移或 Both,并证明DISCUS可以发现底层数据的维度。在第二个研究中,我们使用实际的LGE假象来比较DISCUS、压缩感知(CS)和DIP,并证明了几何簇节点的正面影响。在第三个研究中,我们使用实际的单击LGE数据,将DISCUS与CS重建相比较。结果显示DISCUS比CS和DIP表现更好,并且强制几何簇节点簇节点帮助发现真正的数据维度,提供了额外的性能优化。

Surface Coil Intensity Correction for MRI

  • paper_url: http://arxiv.org/abs/2312.00936
  • repo_url: https://github.com/osu-mr/scc
  • paper_authors: Xuan Lei, Philip Schniter, Chong Chen, Rizwan Ahmad
  • for: corrections of undesired spatial variation in intensity of MR images
  • methods: utilizes pre-scan data and proposes an intensity correction method
  • results: demonstrated on a digital phantom and cardiac MRI data collected from a commercial scanner
    Abstract Modern MRI scanners utilize one or more arrays of small receive-only coils to collect k-space data. The sensitivity maps of the coils, when estimated using traditional methods, differ from the true sensitivity maps, which are generally unknown. Consequently, the reconstructed MR images exhibit undesired spatial variation in intensity. These intensity variations can be at least partially corrected using pre-scan data. In this work, we propose an intensity correction method that utilizes pre-scan data. For demonstration, we apply our method to a digital phantom, as well as to cardiac MRI data collected from a commercial scanner by Siemens Healthineers. The code is available at https://github.com/OSU-MR/SCC.
    摘要 现代MRI仪器使用一个或多个小 receive-only磁场探测器来收集k空间数据。传统方法估算探测器的敏感地图,与真实的敏感地图不同,这会导致重建MR图像中的空间变化。这些变化可以至少部分通过先行数据进行修正。在这个工作中,我们提出了一种使用先行数据进行强度修正的方法。为了示例,我们对数字phantom进行了应用,以及从商业SCANNER中收集的卡达MRI数据。代码可以在https://github.com/OSU-MR/SCC中找到。

New Filters for Image Interpolation and Resizing

  • paper_url: http://arxiv.org/abs/2312.00926
  • repo_url: https://github.com/mhowerton91/history
  • paper_authors: Amir Said
  • for: 用于图像 interpolating 和缩放
  • methods: 使用新的核函数设计筛子,其特性由两个参数定义,分别表示过渡带宽和特有侧LOB高度
  • results: 可以高效探索适用于图像 interpolating 和缩放的核函数空间,并将最佳核函数选择为特定应用程序I hope that helps! Let me know if you have any other questions.
    Abstract We propose a new class of kernels to simplify the design of filters for image interpolation and resizing. Their properties are defined according to two parameters, specifying the width of the transition band and the height of a unique sidelobe. By varying these parameters it is possible to efficiently explore the space with only the filters that are suitable for image interpolation and resizing, and identify the filter that is best for a given application. These two parameters are also sufficient to obtain very good approximations of many commonly-used interpolation kernels. We also show that, because the Fourier transforms of these kernels have very fast decay, these filters produce better results when time-stretched for image downsizing.
    摘要 我们提出一种新的kernel类型,用于简化图像 interpolate和缩放过程中的筛子设计。这些kernel的性质由两个参数定义,即过渡带宽和唯一侧LOB的高度。通过调整这两个参数,可以快速探索适用于图像 interpolate 和缩放的filter空间,并identify最佳的filter для给定应用。此外,我们还证明了这些kernel的傅立叶变换具有非常快的衰落,因此这些筛子在图像缩放时产生更好的结果。

Bitstream Organization for Parallel Entropy Coding on Neural Network-based Video Codecs

  • paper_url: http://arxiv.org/abs/2312.00921
  • repo_url: None
  • paper_authors: Amir Said, Hoang Le, Farzad Farhadzadeh
  • for: 提高带宽和数据传输速率,降低 compression loss
  • methods: 并行编码、 bidirectional 数据流压缩、 joint 优化arithmic coding termination
  • results: 降低 overhead,可以将其降到小于1%和0.1%的水平
    Abstract Video compression systems must support increasing bandwidth and data throughput at low cost and power, and can be limited by entropy coding bottlenecks. Efficiency can be greatly improved by parallelizing coding, which can be done at much larger scales with new neural-based codecs, but with some compression loss related to data organization. We analyze the bit rate overhead needed to support multiple bitstreams for concurrent decoding, and for its minimization propose a method for compressing parallel-decoding entry points, using bidirectional bitstream packing, and a new form of jointly optimizing arithmetic coding termination. It is shown that those techniques significantly lower the overhead, making it easier to reduce it to a small fraction of the average bitstream size, like, for example, less than 1% and 0.1% when the average number of bitstream bytes is respectively larger than 95 and 1,200 bytes.
    摘要 Translation in Simplified Chinese:视频压缩系统需要支持增长的带宽和数据传输速率,同时具有低成本和低功耗。然而,这些系统可能受到Entropy编码瓶颈的限制。我们可以通过并行编码来大幅提高效率,但是这会导致一些数据组织related的压缩损失。我们分析了多个bit流的支持成本,并提出了一种压缩并行解码入口点的方法,使用双向bit流压缩和一种新的共同优化加密编码终止方法。我们发现,这些技术可以减少成本,使其变得更容易减少到小于1%和0.1%的比例,例如在95个字节的平均bit流大小时和1200个字节的平均bit流大小时。

Technical description of the EPFL submission to the JPEG DNA CfP

  • paper_url: http://arxiv.org/abs/2312.00560
  • repo_url: None
  • paper_authors: Davi Lazzarotto, Jorge Encinas Ramos, Michela Testolina, Touradj Ebrahimi
  • for: 这个论文是为了提出一种基于JPEG XL和DNA编码的代码生成器,可以对原始图像和已经压缩的JPEG流文件进行编码和解码。
  • methods: 该代码生成器使用了现有的JPEG XL编码器,以及一种修改后的Raptor码实现,以实现DNA编码。
  • results: 作者提供了代码、对象指标结果、图表和生化约束分析,可以在ISO文档系统上获取,文档编号WG1M101013-ICQ-EPFL。
    Abstract This document provides a technical description of the codec proposed by EPFL to the JPEG DNA Call for Proposals. The codec we refer to as V-DNA for its versatility, enables the encoding of raw images and already compressed JPEG 1 bitstreams, but the underlying algorithm could be used to encode and transcode any kind of data. The codec is composed of two main modules: the image compression module, handled by the state-of-the-art JPEG XL codec, and the DNA encoding module, implemented using a modified Raptor Code implementation following the RU10 (Raptor Unsystematic) description. The code for encoding and decoding, as well as the objective metrics results, plots and biochemical constraints analysis are available on ISO Documents system with document number WG1M101013-ICQ-EPFL submission to the JPEG DNA CfP.
    摘要 这份文档提供了由EPFL提出的代码生成器的技术描述,我们称之为V-DNA,它允许对原始图像和已经压缩的JPEG1位流进行编码,但是下面的算法可以用于任何数据类型的编码和转码。代码生成器由两个主要模块组成:图像压缩模块,使用现有的JPEG XL代码生成器,以及DNA编码模块,使用修改后的Raptor Code实现,按照RU10(RaptorUnsystematic)的描述进行实现。代码生成器的编码和解码代码、目标指标结果、图表和生物化约束分析都可以在ISO文档系统上找到,文档号为WG1M101013-ICQ-EPFL提交到JPEG DNA calls for proposals。

Suppression of the Talbot effect in Fourier transform acousto-optic imaging

  • paper_url: http://arxiv.org/abs/2312.00432
  • repo_url: None
  • paper_authors: Maïmouna Bocoum, François Figliolia, Jean-Pierre Huignard, François Ramaz, Jean-Michel Tualle
  • for: 这个论文旨在解决在探测频率域扫描中出现的 Talbot 效应,以提高图像重建质量和可读性。
  • methods: 论文使用了对Structured acoustic waves进行额外相位调制,以消除 Talbot 效应并提高图像重建质量。
  • results: 实验和理论研究表明,对 acoustic periodic structure进行额外相位调制可以消除 Talbot 效应,提高图像重建质量和可读性。
    Abstract We report on the observation and correction of an imaging artifact attributed to the Talbot effect in the context of acousto-optic imaging using structured acoustic waves. When ultrasound waves are emitted with a periodic structure, the Talbot effect produces $\pi$ -phase shifts of that periodic structure at every half of the Talbot distance in propagation. This unwanted artifact is detrimental to the image reconstruction, which assumes near-field diffraction is negligible. Here, we demonstrate both theoretically and experimentally how imposing an additional phase modulation on the acoustic periodic structure induces a symmetry constraint leading to the annihilation of the Talbot effect. This will significantly improve the acousto-optic image reconstruction quality and allows for an improvement of the reachable spatial resolution of the image.
    摘要 我们报道了对听音镜像扩散 artifact的观察和修正,这种扩散 artifact被归因于坦博效应在音频普散频率场中。当 ultrasound 波发射 periodic 结构时,坦博效应会在每个坦博距离的一半处产生 $\pi$ 期延迟,这种不良扩散 artifact会导致图像重建失败,图像重建假设近场干扰可以忽略。在这里,我们理论和实验表明,对音频 periodic 结构增加额外的相位调制可以导致对称约束,使坦博效应消失。这将改善音频普散图像重建质量,并允许更高的可见范围。

eess.SP - 2023-12-01

Rethinking Skip Connections in Spiking Neural Networks with Time-To-First-Spike Coding

  • paper_url: http://arxiv.org/abs/2312.00919
  • repo_url: None
  • paper_authors: Youngeun Kim, Adar Kahana, Ruokai Yin, Yuhang Li, Panos Stinis, George Em Karniadakis, Priyadarshini Panda
  • for: 本文研究了在时钟脉冲(TTFS)编码的脉冲神经网络(SNN)中skip连接的作用,以强调 skip连接在SNN中的role。
  • methods: 作者使用了两种不同的skip连接架构:加法 skip连接和 concatenation skip连接。他们发现,加法 skip连接会导致更多的脉冲延迟,而 concatenation skip连接则会减少脉冲延迟,但会导致信息混合不佳。
  • results: 作者提出了一种解决方案,即使用学习延迟来bridging skip连接中的时间差。这种方法能够提高信息混合,并在MNIST和Fashion-MNIST等公共数据集上进行实验。此外,作者还扩展了TTFS coding的应用范围,证明其可以应用于 beyond image recognition 任务和科学机器学习任务。
    Abstract Time-To-First-Spike (TTFS) coding in Spiking Neural Networks (SNNs) offers significant advantages in terms of energy efficiency, closely mimicking the behavior of biological neurons. In this work, we delve into the role of skip connections, a widely used concept in Artificial Neural Networks (ANNs), within the domain of SNNs with TTFS coding. Our focus is on two distinct types of skip connection architectures: (1) addition-based skip connections, and (2) concatenation-based skip connections. We find that addition-based skip connections introduce an additional delay in terms of spike timing. On the other hand, concatenation-based skip connections circumvent this delay but produce time gaps between after-convolution and skip connection paths, thereby restricting the effective mixing of information from these two paths. To mitigate these issues, we propose a novel approach involving a learnable delay for skip connections in the concatenation-based skip connection architecture. This approach successfully bridges the time gap between the convolutional and skip branches, facilitating improved information mixing. We conduct experiments on public datasets including MNIST and Fashion-MNIST, illustrating the advantage of the skip connection in TTFS coding architectures. Additionally, we demonstrate the applicability of TTFS coding on beyond image recognition tasks and extend it to scientific machine-learning tasks, broadening the potential uses of SNNs.
    摘要 时间到首触(TTFS)编码在神经网络(SNN)中提供了显著的能效性优势,几乎完全模仿生物神经元的行为。在这项工作中,我们探讨了 skip connections 在 SNN 中的角色,特别是在 TTFS 编码下。我们关注两种不同类型的 skip connection 架构:(1)加法基于 skip connections,和(2) concatenation 基于 skip connections。我们发现,加法基于 skip connections 会导致更多的脉冲延迟。相反, concatenation 基于 skip connections 可以避免这种延迟,但会生成时间差 между after-convolution 和 skip connection 路径,因此限制了信息混合的有效性。为解决这些问题,我们提议一种基于学习延迟的 skip connection 方法。这种方法可以在 concatenation 基于 skip connection 架构中bridging 时间差 между convolutional 和 skip 分支,使信息混合更加完整。我们在公共数据集上进行了实验,包括 MNIST 和 Fashion-MNIST,并证明了 skip connection 在 TTFS 编码架构中的优势。此外,我们还扩展了 SNN 的应用范围,应用到科学机器学习任务,拓展了 SNN 的潜在应用领域。

A WINNER+ Based 3-D Non-Stationary Wideband MIMO Channel Model

  • paper_url: http://arxiv.org/abs/2312.00568
  • repo_url: None
  • paper_authors: Ji Bian, Jian Sun, Cheng-Xiang Wang, Rui Feng, Jie Huang, Yang Yang, Minggao Zhang
  • For: This paper proposes a three-dimensional (3-D) non-stationary wideband multiple-input multiple-output (MIMO) channel model based on the WINNER+ channel model, which considers the angular distributions of clusters in both the horizontal and vertical planes and the movement of the receiver and clusters.* Methods: The proposed channel model uses a birth-death process to model the cluster time evolution, and investigates statistical properties such as spatial cross-correlation function (CCF), temporal autocorrelation function (ACF), Doppler power spectrum density (PSD), level-crossing rate (LCR), average fading duration (AFD), and stationary interval.* Results: The proposed channel model is validated against measurement data and shows the ability to reproduce the main properties of real non-stationary channels. The paper also demonstrates the adaptability of the channel model to various communication scenarios by adjusting different parameter values.Here is the same information in Traditional Chinese:* For: 本文提出了一个三维(3-D)不对称宽频多input多output(MIMO)通道模型,基于WINNER+通道模型,它考虑了水平和垂直方向的角分布,并考虑到接收器和单元的运动。* Methods: 提案的通道模型使用生成-死亡过程来模型单元时间演化,并调查了一些统计性质,例如空间垂直相関函数(CCF)、时间自相关函数(ACF)、Doppler功率 спектurm密度(PSD)、水平跨越率(LCR)、平均折损时间(AFD)和静止时间。* Results: 提案的通道模型与实验数据 validate,显示了模型能够重现实际的非站点通道特性。此外,该模型还可以根据不同参数值进行适应不同通信enario。
    Abstract In this paper, a three-dimensional (3-D) non-stationary wideband multiple-input multiple-output (MIMO) channel model based on the WINNER+ channel model is proposed. The angular distributions of clusters in both the horizontal and vertical planes are jointly considered. The receiver and clusters can be moving, which makes the model more general. Parameters including number of clusters, powers, delays, azimuth angles of departure (AAoDs), azimuth angles of arrival (AAoAs), elevation angles of departure (EAoDs), and elevation angles of arrival (EAoAs) are time-variant. The cluster time evolution is modeled using a birth-death process. Statistical properties, including spatial cross-correlation function (CCF), temporal autocorrelation function (ACF), Doppler power spectrum density (PSD), level-crossing rate (LCR), average fading duration (AFD), and stationary interval are investigated and analyzed. The LCR, AFD, and stationary interval of the proposed channel model are validated against the measurement data. Numerical and simulation results show that the proposed channel model has the ability to reproduce the main properties of real non-stationary channels. Furthermore, the proposed channel model can be adapted to various communication scenarios by adjusting different parameter values.
    摘要 在这篇论文中,一种三维非站ARY(3-D)不同频率多输入多出口(MIMO)通道模型,基于WINNER+通道模型,被提出。该模型同时考虑了水平和垂直方向的角分布。接收器和分配器都可以在移动,这使得模型更加通用。参数包括分布量、功率、延迟、投射角(AAoD)、接收角(AAoA)、发射角(EAoD)和接收角(EAoA)都是时变的。分配器的时间演化使用了生成-死亡过程来模型。这种通道模型的统计性能,包括空间垂直相关函数(CCF)、时间自相关函数(ACF)、Doppler峰域密度(PSD)、水平横跨率(LCR)、平均抑减时间(AFD)和静止间隔,被 investigate和分析。LCR、AFD和静止间隔的 Validation 结果表明,提出的通道模型能够正确地反映实际的非站ARY通道。此外,该通道模型可以根据不同的参数值适应不同的通信场景。

A Spatio-Temporal Graph Convolutional Network for Gesture Recognition from High-Density Electromyography

  • paper_url: http://arxiv.org/abs/2312.00553
  • repo_url: None
  • paper_authors: Wenjuan Zhong, Yuyang Zhang, Peiwen Fu, Wenxuan Xiong, Mingming Zhang
  • for: 这个研究的目的是为了提高基于高密度表面电 MYography (HD-sEMG) 的人机界面的肢体姿态预测精度。
  • methods: 本研究使用了统计学Graph Convolutional Neural Networks (GCNNs) 来捕捉 HD-sEMG 资料中的空间-时间相依性,并将其应用于人机界面中的肢体姿态识别。
  • results: 研究结果显示,使用 STGCN-GR 方法可以实现高精度的肢体姿态预测(精度为 91.07%),比过去的深度学习方法在同一数据集上的性能更高。
    Abstract Accurate hand gesture prediction is crucial for effective upper-limb prosthetic limbs control. As the high flexibility and multiple degrees of freedom exhibited by human hands, there has been a growing interest in integrating deep networks with high-density surface electromyography (HD-sEMG) grids to enhance gesture recognition capabilities. However, many existing methods fall short in fully exploit the specific spatial topology and temporal dependencies present in HD-sEMG data. Additionally, these studies are often limited number of gestures and lack generality. Hence, this study introduces a novel gesture recognition method, named STGCN-GR, which leverages spatio-temporal graph convolution networks for HD-sEMG-based human-machine interfaces. Firstly, we construct muscle networks based on functional connectivity between channels, creating a graph representation of HD-sEMG recordings. Subsequently, a temporal convolution module is applied to capture the temporal dependences in the HD-sEMG series and a spatial graph convolution module is employed to effectively learn the intrinsic spatial topology information among distinct HD-sEMG channels. We evaluate our proposed model on a public HD-sEMG dataset comprising a substantial number of gestures (i.e., 65). Our results demonstrate the remarkable capability of the STGCN-GR method, achieving an impressive accuracy of 91.07% in predicting gestures, which surpasses state-of-the-art deep learning methods applied to the same dataset.
    摘要 通过高度灵活和多个自由度的人手手势,有关控制上臂 prosthetic 手的准确手势预测已成为关键。然而,许多现有方法未能充分利用 HD-sEMG 数据中的特定空间顺序结构和时间依赖关系。此外,这些研究通常具有有限的手势数量和普遍性。因此,本研究提出了一种新的手势识别方法,称为 STGCN-GR,它利用空间-时间图像 convolution 网络来提高 HD-sEMG 基于人机界面的手势识别能力。首先,我们根据 HD-sEMG 记录中 каanal之间的函циональ连接构建了肌肉网络,创建了 HD-sEMG 记录的图表表示。然后,我们应用了时间核算模块来捕捉 HD-sEMG 序列中的时间依赖关系,并使用空间图像核算模块来有效地学习 HD-sEMG kanal之间的内在空间顺序结构信息。我们对公共 HD-sEMG 数据集进行了评估,该数据集包括65个手势。我们的结果表明,STGCN-GR 方法具有惊人的性能,在预测手势方面达到了 91.07% 的准确率,超过了同样使用同一个数据集的深度学习方法。

Novel 3D Geometry-Based Stochastic Models for Non-Isotropic MIMO Vehicle-to-Vehicle Channels

  • paper_url: http://arxiv.org/abs/2312.00550
  • repo_url: None
  • paper_authors: Yi Yuan, Cheng-Xiang Wang, Xiang Cheng, Bo Ai, David I. Laurenson
  • for: 本文提出了一种三维(3D)理论正规形 geometry-based 随机模型(RS-GBSM)和相应的汇聚 sinusoids(SoS)模拟模型,用于非均匀多输入多出口(MIMO)汽车到汽车(V2V)各种干扰混合折射渠道。
  • methods: 提出的 RS-GBSM 组合了直线视野(LoS)组件、二个球体模型和一个圆柱体模型,能够研究 vehicular traffic density(VTD)对通道统计 Parameters的影响,同时jointly considering azimuth和 elevation 角度使用 von Mises Fisher 分布。
  • results: 根据提出的 3D 理论 RS-GBSM 和其 SoS 模拟模型,对通道统计 Parameters进行了深入调查,并对升角度在 3D 模型中对重要统计 Parameters的影响进行了比较。结果显示,3D 模型更加准确地描述实际 V2V 通道,尤其是在皮科 cell enario中。最后,与理论模型、SoS 模拟模型和实际结果进行了close agreement,证明了提出的模型的实用性。
    Abstract This paper proposes a novel three-dimensional (3D) theoretical regular-shaped geometry-based stochastic model (RS-GBSM) and the corresponding sum-of-sinusoids (SoS) simulation model for non-isotropic multiple-input multiple-output (MIMO) vehicle-to-vehicle (V2V) Ricean fading channels. The proposed RS-GBSM, combining line-of-sight (LoS) components, a two-sphere model, and an elliptic-cylinder model, has the ability to study the impact of the vehicular traffic density (VTD) on channel statistics, and jointly considers the azimuth and elevation angles by using the von Mises Fisher distribution. Moreover, a novel parameter computation method is proposed for jointly calculating the azimuth and elevation angles in the SoS channel simulator. Based on the proposed 3D theoretical RS-GBSM and its SoS simulation model, statistical properties are derived and thoroughly investigated. The impact of the elevation angle in the 3D model on key statistical properties is investigated by comparing with those of the corresponding two-dimensional (2D) model. It is demonstrated that the 3D model is more accurate to characterize real V2V channels, in particular for pico cell scenarios. Finally, close agreement is achieved between the theoretical model, SoS simulation model, and simulation results, demonstrating the utility of the proposed models.
    摘要

Broad Beam Reflection for RIS-Assisted MIMO Systems with Planar Arrays

  • paper_url: http://arxiv.org/abs/2312.00482
  • repo_url: None
  • paper_authors: Parisa Ramezani, Maksym A. Girnyk, Emil Björnson
  • for: 该研究旨在填补现有的块 Broadcasting 方案中 RIS 的帮助下Cell-specific传输方面的研究空白。
  • methods: 该研究使用了 dual-polarized RIS,利用它们的 polarization 度的自由度,设计了相应的 phase 配置矩阵,以实现 RIS 发射广泛的照射,覆盖所有的 azimuth 和 elevation 角度, garantizar 所有用户都能够接收到信号。
  • results: 研究人员通过数值仿真验证了数学分析结果,并证明了 RIS 可以在宽angular 范围内提供均匀的照射,同时提高了 RIS 的辐射强度。
    Abstract While reconfigurable intelligent surface (RIS)-aided user-specific beamforming has been vastly investigated, the aspect of utilizing RISs for assisting cell-specific transmission has been largely unattended. Aiming to fill this gap, we study a downlink broadcasting scenario where a base station (BS) sends a cell-specific signal to all the users located in a wide angular area with the assistance of a dual-polarized RIS. We utilize the polarization degree of freedom offered by this type of RIS and design the phase configurations in the two polarizations in such a way that the RIS can radiate a broad beam, thereby uniformly covering all azimuth and elevation angles where the users might reside. Specifically, the per-polarization configuration matrices are designed in such a way that the total power-domain array factor becomes spatially flat over all observation angles implying that the RIS can preserve the broad radiation pattern of a single element while boosting its gain proportionally to its aperture size. We validate the mathematical analyses via numerical simulations.
    摘要 “对于具有多普勒调变功能的弹性智能表层(RIS),已经进行了广泛的研究,但是对于使用RIS进行给定范围传输仍然是一个未解之处。为了填补这个 gap,我们研究了一个下行广播enario,在这个enario中,基站(BS)将一个给定范围的信号传递给所有在广角区域中的用户,并且通过使用具有两个极化的RIS来获得帮助。我们利用RIS中的极化度径来设计两个极化的相位配置,以实现RIS可以射出宽束,将所有方位角和高度角覆盖,并且将用户对应的信号传递给所有用户。具体来说,我们设计了每个极化配置矩阵,使其在各个观察角度下具有平坦的总功率频域积分,这意味着RIS可以保持单元素宽束射击的广泛射击范围,同时将其覆盖范围增加了几倍。我们通过数值仿真 validate我们的数学分析。”

EEG-Based Reaction Time Prediction with Fuzzy Common Spatial Patterns and Phase Cohesion using Deep Autoencoder Based Data Fusion

  • paper_url: http://arxiv.org/abs/2312.00479
  • repo_url: None
  • paper_authors: Vivek Singh, Tharun Kumar Reddy
    for: 这个研究旨在检测驾驶者的睡意状态,并预测他们的反应时间,以便预防交通事故。methods: 该研究使用了新的方法, combining Fuzzy Common Spatial Patterns (CSP) optimised Phase Cohesive Sequence (PCS) representations and fuzzy CSP-optimized signal amplitude representations,以检测EEG数据中的同步变化,并预测驾驶者的反应时间。results: 研究发现,这种新方法可以成功地分辨alert和睡意两种状态之间的差异。使用深度自编码器来拼接 amplitude EEG 能量特征和PCS特征,并使用支持向量回归或最小绝对减少选择器(LASSO)进行回归,这种方法在误差、绝对误差和相关系数方面都高于使用单个特征集和回归模型。
    Abstract Drowsiness state of a driver is a topic of extensive discussion due to its significant role in causing traffic accidents. This research presents a novel approach that combines Fuzzy Common Spatial Patterns (CSP) optimised Phase Cohesive Sequence (PCS) representations and fuzzy CSP-optimized signal amplitude representations. The research aims to examine alterations in Electroencephalogram (EEG) synchronisation between a state of alertness and drowsiness, forecast drivers' reaction times by analysing EEG data, and subsequently identify the presence of drowsiness. The study's findings indicate that this approach successfully distinguishes between alert and drowsy mental states. By employing a Deep Autoencoder-based data fusion technique and a regression model such as Support Vector Regression (SVR) or Least Absolute Shrinkage and Selection Operator (LASSO), the proposed method outperforms using individual feature sets in combination with a regressor model. This superiority is measured by evaluating the Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and Correlation Coefficient (CC). In other words, the fusion of autoencoder-based amplitude EEG power features and PCS features, when used in regression, outperforms using either of these features alone in a regressor model. Specifically, the proposed data fusion method achieves a 14.36% reduction in RMSE, a 25.12% reduction in MAPE, and a 10.12% increase in CC compared to the baseline model using only individual amplitude EEG power features and regression.
    摘要 driver的睡意状态是一个广泛的研究主题,因为它对交通事故的起因具有重要作用。这项研究提出了一种新的方法,该方法结合了含义均衡的共同空间特征(CSP)优化的phasic coherence sequence(PCS)表示和含义均衡的CSP优化的信号强度表示。研究的目标是检查EEG同步 alterations between alert and drowsy mental states, forecast drivers' reaction times by analyzing EEG data, and subsequently identify the presence of drowsiness.研究发现,这种方法成功地分辨 alert和睡意两种心态。通过使用深度自适应网络数据融合技术和回归模型,如支持向量回归(SVR)或最小绝对减少和选择操作(LASSO),提出的方法在比基eline模型(只使用各个特征集和回归模型)的情况下表现出了superiority。这种superiority被评估通过根据EEG同步的root mean squared error(RMSE)、mean absolute percentage error(MAPE)和correlation coefficient(CC)进行评估。具体来说,提出的数据融合方法在使用回归模型时,在使用autoencoder-based amplitude EEG power特征和PCS特征时,与使用单独的特征集和回归模型相比,实现了14.36%的RMSE减少、25.12%的MAPE减少和10.12%的CC提高。

Investigation on data fusion of sun-induced chlorophyll fluorescence and reflectance for photosynthetic capacity of rice

  • paper_url: http://arxiv.org/abs/2312.00437
  • repo_url: None
  • paper_authors: Yu-an Zhou, Li Zhai, Weijun Zhou, Ji Zhou, Haiyan Cen
  • for: This research aims to improve the accuracy of crop photosynthesis estimation by combining leaf reflectance and sun-induced chlorophyll fluorescence (SIF) signals.
  • methods: The study uses a combination of noise removal, data fusion at different levels (raw, feature, and decision), competitive adaptive reweighted sampling (CARS), and partial least squares regression (PLSR) to estimate key photosynthetic traits in rice.
  • results: The results show that combining reflectance and SIF data sources through measurement-level data fusion significantly improves the accuracy of photosynthetic trait estimation, with mid-level and decision-level fusion also providing positive outcomes.
    Abstract Studying crop photosynthesis is crucial for improving yield, but current methods are labor-intensive. This research aims to enhance accuracy by combining leaf reflectance and sun-induced chlorophyll fluorescence (SIF) signals to estimate key photosynthetic traits in rice. The study analyzes 149 leaf samples from two rice cultivars, considering reflectance, SIF, chlorophyll, carotenoids, and CO2 response curves. After noise removal, SIF and reflectance spectra are used for data fusion at different levels (raw, feature, and decision). Competitive adaptive reweighted sampling (CARS) extracts features, and partial least squares regression (PLSR) builds regression models. Results indicate that using either reflectance or SIF alone provides modest estimations for photosynthetic traits. However, combining these data sources through measurement-level data fusion significantly improves accuracy, with mid-level and decision-level fusion also showing positive outcomes. In particular, decision-level fusion enhances predictive capabilities, suggesting the potential for efficient crop phenotyping. Overall, sun-induced chlorophyll fluorescence spectra effectively predict rice's photosynthetic capacity, and data fusion methods contribute to increased accuracy, paving the way for high-throughput crop phenotyping.
    摘要 (Simplified Chinese translation)研究作物 photosynthesis 是提高产量的关键,但现有方法具有劳动密集的缺点。这项研究目的是通过结合叶reflectance和太阳诱导氧化绿色素(SIF)信号来估算rice中关键的 фото sintetic trait。研究分析了149个叶片样本,来自两种rice Cultivar,包括reflectance、SIF、chlorophyll、carotenoids和CO2response curves。后处理噪声,SIF和reflectance спектrum进行数据融合,在不同的水平(raw、特征和决策)进行数据融合。竞争适应重weight sampling(CARS)提取特征,并使用部分最小方差回归(PLSR)建立回归模型。结果表明,只使用reflectance或SIF alone 提供了相对较差的估算结果。然而,通过融合这两种数据源的方法可以显著提高准确性,mid-level和决策水平的融合也显示了正面的效果。特别是决策水平的融合可以提高预测能力,表明了高效的作物形态识别的潜力。总之,太阳诱导氧化绿色素 спектrum可以有效预测rice中的 photosynthetic capacity,并且数据融合方法对准确性做出了贡献,开 up了高速作物形态识别的未来。

UAV-Aided Lifelong Learning for AoI and Energy Optimization in Non-Stationary IoT Networks

  • paper_url: http://arxiv.org/abs/2312.00334
  • repo_url: None
  • paper_authors: Zhenzhen Gong, Omar Hashash, Yingze Wang, Qimei Cui, Wei Ni, Walid Saad, Kei Sakaguchi
  • for: 提高 IoT 设备在非站ARY环境中的性能和能源消耗效率
  • methods: 使用生命长RL算法和无人飞行器学习 Agent 实现 IoT 设备适应环境的策略自适应
  • results: 比对标准准则,提高 IoT 设备的平衡成本提高8.3%,并对无人飞行器的能源消耗减少49.38%
    Abstract In this paper, a novel joint energy and age of information (AoI) optimization framework for IoT devices in a non-stationary environment is presented. In particular, IoT devices that are distributed in the real-world are required to efficiently utilize their computing resources so as to balance the freshness of their data and their energy consumption. To optimize the performance of IoT devices in such a dynamic setting, a novel lifelong reinforcement learning (RL) solution that enables IoT devices to continuously adapt their policies to each newly encountered environment is proposed. Given that IoT devices have limited energy and computing resources, an unmanned aerial vehicle (UAV) is leveraged to visit the IoT devices and update the policy of each device sequentially. As such, the UAV is exploited as a mobile learning agent that can learn a shared knowledge base with a feature base in its training phase, and feature sets of a zero-shot learning method in its testing phase, to generalize between the environments. To optimize the trajectory and flying velocity of the UAV, an actor-critic network is leveraged so as to minimize the UAV energy consumption. Simulation results show that the proposed lifelong RL solution can outperform the state-of-art benchmarks by enhancing the balanced cost of IoT devices by $8.3\%$ when incorporating warm-start policies for unseen environments. In addition, our solution achieves up to $49.38\%$ reduction in terms of energy consumption by the UAV in comparison to the random flying strategy.
    摘要 本文提出了一种基于物联网设备的共享知识库的生命长学习解决方案,以优化设备在非站点环境中的性能。特别是,物联网设备在实际世界中分布,需要有效利用其计算资源,以保持数据的新鲜度和能源消耗的平衡。为了优化物联网设备的性能,提出了一种基于无人机(UAV)的生命长学习解决方案,其中UAV被利用为训练环境中的移动学习代理,以学习共享知识库。在测试阶段,UAV使用零拟合学习方法来泛化到不同环境中。此外,为了优化UAV的轨迹和飞行速度,利用了actor-critic网络,以最小化UAV的能量消耗。实验结果表明,提出的生命长学习解决方案可以与 estado-of-the-art benchmark 相比,提高物联网设备的平衡成本 by $8.3\%$,并在不同环境中实现 up to $49.38\%$ 的能量消耗减少。

cs.SD - 2023-11-30

Subspace Hybrid MVDR Beamforming for Augmented Hearing

  • paper_url: http://arxiv.org/abs/2311.18689
  • repo_url: None
  • paper_authors: Sina Hafezi, Alastair H. Moore, Pierre H. Guiraud, Patrick A. Naylor, Jacob Donley, Vladimir Tourbabin, Thomas Lunner
  • for: 提高HEAD-WORN MICROPHONE ARRAY中的augmented reality音频识别性能
  • methods: 使用多通道语音增强算法, combine signal-dependent beamformer的 adaptability和signal-independent super-directive beamformer的计算效率和稳定性
  • results: 对实际记录和 simulate cocktail-party场景进行评估,相比基eline super-directive beamformer,提出的算法显示出 significiant noise suppression、speech intelligibility和speech quality的改善
    Abstract Signal-dependent beamformers are advantageous over signal-independent beamformers when the acoustic scenario - be it real-world or simulated - is straightforward in terms of the number of sound sources, the ambient sound field and their dynamics. However, in the context of augmented reality audio using head-worn microphone arrays, the acoustic scenarios encountered are often far from straightforward. The design of robust, high-performance, adaptive beamformers for such scenarios is an on-going challenge. This is due to the violation of the typically required assumptions on the noise field caused by, for example, rapid variations resulting from complex acoustic environments, and/or rotations of the listener's head. This work proposes a multi-channel speech enhancement algorithm which utilises the adaptability of signal-dependent beamformers while still benefiting from the computational efficiency and robust performance of signal-independent super-directive beamformers. The algorithm has two stages. (i) The first stage is a hybrid beamformer based on a dictionary of weights corresponding to a set of noise field models. (ii) The second stage is a wide-band subspace post-filter to remove any artifacts resulting from (i). The algorithm is evaluated using both real-world recordings and simulations of a cocktail-party scenario. Noise suppression, intelligibility and speech quality results show a significant performance improvement by the proposed algorithm compared to the baseline super-directive beamformer. A data-driven implementation of the noise field dictionary is shown to provide more noise suppression, and similar speech intelligibility and quality, compared to a parametric dictionary.
    摘要 受信号висиendent的扩束器比受信号无关的扩束器在 straightforward 的声学enario下更有优势,但在头戴式麦克铺数组中的增强现实音频应用中,遇到的声学enario往往很复杂。设计高性能、适应性强的扩束器是一项持续的挑战。这是因为噪声场的假设被噪声环境中的快速变化和/或 listener 的头部旋转所遗弃。这项工作提出了一种多通道语音增强算法,该算法利用了信号висиendent扩束器的适应性,同时仍然保留了信号无关扩束器的计算效率和Robust性。该算法包括两个阶段。(i)第一阶段是一种基于字典的扩束器,该字典包含一组噪声场模型的权重。(ii)第二阶段是一种宽频域子 filters,用于除去任何由前一阶段所产生的artefacts。该算法在实际记录和 simulate 一个cocktail-party场景中进行评估。噪声抑制、语音知能和音质结果表明提案的算法与基准超 direktibeamformer 相比具有显著的性能改善。一种基于数据驱动的噪声场字典实现被证明可以提供更多的噪声抑制,同时保持与 parametric 字典相同的语音知能和音质。

Barwise Music Structure Analysis with the Correlation Block-Matching Segmentation Algorithm

  • paper_url: http://arxiv.org/abs/2311.18604
  • repo_url: None
  • paper_authors: Axel Marmoret, Jérémy E. Cohen, Frédéric Bimbot
  • for: 这个论文主要是为了提高音乐结构分析(MSA)领域的自动化分析方法。
  • methods: 这个论文提出了一种基于块匹配算法(CBM)的自适应音乐结构分析方法,该算法可以自动从音频信号的特征表示中计算自相似矩阵,并在时间频谱上进行时间频谱分割。
  • results: 研究发现,在理想的条件下,提议的算法可以与已知架构的超vised状态革命方法竞争水平,而无需知道歌词的具体内容。此外,算法还是开源的,可以根据不同的应用场景进行自定义。
    Abstract Music Structure Analysis (MSA) is a Music Information Retrieval task consisting of representing a song in a simplified, organized manner by breaking it down into sections typically corresponding to ``chorus'', ``verse'', ``solo'', etc. In this work, we extend an MSA algorithm called the Correlation Block-Matching (CBM) algorithm introduced by (Marmoret et al., 2020, 2022b). The CBM algorithm is a dynamic programming algorithm that segments self-similarity matrices, which are a standard description used in MSA and in numerous other applications. In this work, self-similarity matrices are computed from the feature representation of an audio signal and time is sampled at the bar-scale. This study examines three different standard similarity functions for the computation of self-similarity matrices. Results show that, in optimal conditions, the proposed algorithm achieves a level of performance which is competitive with supervised state-of-the-art methods while only requiring knowledge of bar positions. In addition, the algorithm is made open-source and is highly customizable.
    摘要 音乐结构分析(MSA)是音乐信息检索任务,它通过将歌曲分解成 Typically correspond to "chorus", "verse", "solo", etc. 的部分来表示它们在一种简化的、有序的方式。在这种工作中,我们扩展了一种名为协方差块匹配(CBM)算法,该算法是一种动态计划算法,用于在 MSA 和其他应用中常用的自similarity矩阵中进行分割。在这种研究中,我们使用音频信号的特征表示来计算自similarity矩阵,并在 bar 标准上采样时间。研究发现,在理想的条件下,我们的方法可以与经验驱动的状态 arts 的性能竞争,只需要知道bar的位置。此外,我们的算法是开源的,可以高度自定义。

String Sound Synthesizer on GPU-accelerated Finite Difference Scheme

  • paper_url: http://arxiv.org/abs/2311.18505
  • repo_url: None
  • paper_authors: Jin Woo Lee, Min Jun Choi, Kyogu Lee
  • for: 这篇论文描述了一种非线性弦音 sintizer,基于finite difference simulation方法模拟弦的动态行为下不同的刺激。
  • methods: 该synthesizer使用了一种多参数化的弦模拟引擎,可以模拟弦的自然震动行为,包括基频调制、弹性、张力、频率相关损耗和刺激控制。
  • results: 这个开源的物理模型模拟器不仅对音 signal处理社区有利,还可以作为一个新的数据集建立工具,对于 neural network-based audio synthesis领域的发展做出了贡献。PyTorch实现的这个synthesizer具有灵活性,可以在CPU和GPU上使用,从而提高了它的应用性。GPU的利用可以并行操作空间和批处理维度,进一步提高了它的实用性作为数据生成器。
    Abstract This paper introduces a nonlinear string sound synthesizer, based on a finite difference simulation of the dynamic behavior of strings under various excitations. The presented synthesizer features a versatile string simulation engine capable of stochastic parameterization, encompassing fundamental frequency modulation, stiffness, tension, frequency-dependent loss, and excitation control. This open-source physical model simulator not only benefits the audio signal processing community but also contributes to the burgeoning field of neural network-based audio synthesis by serving as a novel dataset construction tool. Implemented in PyTorch, this synthesizer offers flexibility, facilitating both CPU and GPU utilization, thereby enhancing its applicability as a simulator. GPU utilization expedites computation by parallelizing operations across spatial and batch dimensions, further enhancing its utility as a data generator.
    摘要 Translated into Simplified Chinese:这篇论文介绍了一种基于finite difference方法的非线性弦音 sinthezizer,可模拟弦的动态行为下不同的刺激。提出的synthesizer具有可 Stochastic parameterization功能,包括基 frequency Modulation、弹性、张力、频率相关损耗和刺激控制。这个开源物理模型模拟器不仅为音频信号处理社区提供了利益,还为 neural network-based audio sinthezis领域的发展做出了贡献, serving as a novel dataset construction tool。 implemented in PyTorch,这个synthesizer具有灵活性,可以在 CPU 和 GPU 上使用,从而提高其作为模拟器的可用性。 GPU 的使用可以平行化操作,通过 spatial 和批量维度的并行化,进一步提高其作为数据生成器的可用性。

Sound Terminology Describing Production and Perception of Sonification

  • paper_url: http://arxiv.org/abs/2312.00091
  • repo_url: None
  • paper_authors: Tim Ziemer
  • for: 这篇论文的目的是提出一种解决SONIFICATION研究中的术语不一致问题的方法,以促进不同领域研究者之间的交流和合作。
  • methods: 本论文使用了文献研究和问题描述的方法,找到了SONIFICATION研究中存在的问题,并应用了解决这些问题的方法。
  • results: 本论文的结果是提出了三个方面的SONIFICATION设计方法,并为每个方面提供了具体的术语和解释,以促进不同领域研究者之间的交流和合作。
    Abstract Sonification research is intrinsically interdisciplinary. Consequently, a proper documentation of, and interdisciplinary discourse about a sonification is often hindered by terminology discrepancies between involved disciplines, i.e., the lack of a common sound terminology in sonification research. Without a common ground, a researcher from one discipline may have troubles understanding the implementation and imagining the resulting sound perception of a sonification, if the sonification is described by a researcher from another discipline. To find a common ground, I consulted literature on interdisciplinary research and discourse, identified problems that occur in sonification, and applied the recommended solutions. As a result, I recommend considering three aspects of sonification individually, namely 1.) Sound Design Concept, 2.) Objective and 3.) Method, clarifying which discipline is involved in which aspect, and sticking to this discipline's terminology. As two requirements of sonifications are that they are a) reproducible and b) interpretable, I recommend documenting and discussing every sonification design once using audio engineering terminology, and once using psychoacoustic terminology. The appendix provides comprehensive lists of sound terms from both disciplines, together with relevant literature and a clarification of often misunderstood and misused terms.
    摘要 儿化研究自然具有跨学科性质,因此有效地记录和跨学科交流关于儿化的描述经常受到不同领域之间的术语差异所妨碍。不同领域之间缺乏共同的声学术语,使得一位来自一个领域的研究者可能很难理解另一个领域的儿化实现和它所导致的听觉效果。为了找到共同点,我咨询了跨学科研究和交流的文献,识别了儿化中的问题,并采用了建议的解决方案。结果,我建议在儿化中分别考虑三个方面,即1.) 声音设计概念,2.) 目的和3.) 方法,并明确哪些领域在哪个方面参与,并且遵循这些领域的术语。由于儿化的两个必要条件是可重现性和可解释性,我建议对每个儿化设计使用音响工程术语进行文档和讨论,并使用心理听觉学术语进行第二次文档和讨论。附录提供了声学术语列表和相关文献,以及有关常被错解和滥用的词汇的解释。

Audio Prompt Tuning for Universal Sound Separation

  • paper_url: http://arxiv.org/abs/2311.18399
  • repo_url: https://github.com/redrabbit94/apt-uss
  • paper_authors: Yuzhuo Liu, Xubo Liu, Yan Zhao, Yuanyuan Wang, Rui Xia, Pingchuan Tain, Yuxuan Wang
  • for: 提高现有的声音分离系统的精度和稳定性
  • methods: 使用小量示例数据进行声音提示调整(APT),并将声音分离模型的参数冻结以保持通用性
  • results: 在MUSDB18和ESC-50 dataset上比基eline模型提高了0.67dB和2.06dB的信号噪声比性能,并且使用只有5个声音示例还可以超越基eline系统在ESC-50 dataset上表现。
    Abstract Universal sound separation (USS) is a task to separate arbitrary sounds from an audio mixture. Existing USS systems are capable of separating arbitrary sources, given a few examples of the target sources as queries. However, separating arbitrary sounds with a single system is challenging, and the robustness is not always guaranteed. In this work, we propose audio prompt tuning (APT), a simple yet effective approach to enhance existing USS systems. Specifically, APT improves the separation performance of specific sources through training a small number of prompt parameters with limited audio samples, while maintaining the generalization of the USS model by keeping its parameters frozen. We evaluate the proposed method on MUSDB18 and ESC-50 datasets. Compared with the baseline model, APT can improve the signal-to-distortion ratio performance by 0.67 dB and 2.06 dB using the full training set of two datasets. Moreover, APT with only 5 audio samples even outperforms the baseline systems utilizing full training data on the ESC-50 dataset, indicating the great potential of few-shot APT.
    摘要 <>translate the following text into Simplified Chinese<> Universal sound separation (USS) 是一个将多种声音从音频混合中分离出来的任务。现有的 USS 系统可以在给定一些目标源示例后,将任意的声音分离出来。然而,使用单个系统将任意声音分离开来是困难的,并且不一定能够保证Robustness。在这项工作中,我们提出了音频提示调整(APT),一种简单 yet effective的方法来提高现有 USS 系统的分离性能。具体来说,APT 通过在 USS 模型中固定参数的情况下,通过少量的音频样本进行提示调整,以提高特定源的分离性能,而不会影响 USS 模型的总体性能。我们在 MUSDB18 和 ESC-50 数据集上进行评估该方法。相比基eline模型,APT 可以在两个数据集上提高信号噪声比例性能,分别为 0.67 dB 和 2.06 dB。此外,APT 只使用 5 个音频样本,还能在 ESC-50 数据集上超越基eline 系统,这表明了几何shot APT 的潜力。