eess.SP - 2023-10-12

Cell-free Massive MIMO and SWIPT: Access Point Operation Mode Selection and Power Control

  • paper_url: http://arxiv.org/abs/2310.08752
  • repo_url: None
  • paper_authors: Mohammadali Mohammadi, Le-Nam Tran, Zahra Mobini, Hien Quoc Ngo, Michail Matthaiou
  • for: 这 paper 研究了无机体 massive multiple-input multiple-output (CF-mMIMO) 系统,它们包括同时无线信息和能量传输 (SWIPT) for 分开的信息用户 (IUs) 和能量用户 (EUs) in Internet of Things (IoT) 网络。
  • methods: 作者提出了一种联合Access Point (AP) 操作模式选择和功率控制设计,其中一些 APs 专门用于向 EUs 传输能量,而其他 APs 专门用于向 IUs 传输信息。
  • results: 作者的数字结果表明,提出的 AP 操作模式选择算法可以提供高达 76% 和 130% 的性能提升 compared to random AP 操作模式选择,具体来说是 maximizing the total harvested energy (HE) for EUs, while satisfying constraints on spectral efficiency (SE) for individual IUs and minimum HE for individual EUs.
    Abstract This paper studies cell-free massive multiple-input multiple-output (CF-mMIMO) systems incorporating simultaneous wireless information and power transfer (SWIPT) for separate information users (IUs) and energy users (EUs) in Internet of Things (IoT) networks. To optimize both the spectral efficiency (SE) of IUs and harvested energy (HE) of EUs, we propose a joint access point (AP) operation mode selection and power control design, wherein certain APs are designated for energy transmission to EUs, while others are dedicated to information transmission to IUs. We investigate the problem of maximizing the total HE for EUs, considering constraints on SE for individual IUs and minimum HE for individual EUs. Our numerical results showcase that the proposed AP operation mode selection algorithm can provide up to $76\%$ and $130\%$ performance gains over random AP operation mode selection with and without power control, respectively.
    摘要 We investigate the problem of maximizing the total HE for EUs, while ensuring constraints on SE for individual IUs and minimum HE for individual EUs. Our numerical results show that the proposed AP operation mode selection algorithm can provide up to 76% and 130% performance gains over random AP operation mode selection with and without power control, respectively.

A Framework for Developing and Evaluating Algorithms for Estimating Multipath Propagation Parameters from Channel Sounder Measurements

  • paper_url: http://arxiv.org/abs/2310.08718
  • repo_url: None
  • paper_authors: Akbar Sayeed, Damla Guven, Michael Doebereiner, Sebastian Semper, Camillo Gentile, Anuraag Bodi, Zihang Cheng
  • for: 提出了一个框架,用于开发和评估毫米波频率上的多path干扰分布(MPCs)测量数据中的算法。
  • methods: 使用了一种具有普通平面阵列(UPA)的接收器,并对 channel sounder 进行了精度的数学模型建立,包括非理想的横截波特性。
  • results: 结果表明,使用 CLEAN 算法可以获得比较可靠的估计,SAGE 和 RiMAX 算法可以进一步改进估计结果,但是 RiMAX 还可以捕捉到干扰的散射效应。
    Abstract A framework is proposed for developing and evaluating algorithms for extracting multipath propagation components (MPCs) from measurements collected by channel sounders at millimeter-wave frequencies. Sounders equipped with an omnidirectional transmitter and a receiver with a uniform planar array (UPA) are considered. An accurate mathematical model is developed for the spatial frequency response of the sounder that incorporates the non-ideal cross-polar beampatterns for the UPA elements. Due to the limited Field-of-View (FoV) of each element, the model is extended to accommodate multi-FoV measurements in distinct azimuth directions. A beamspace representation of the spatial frequency response is leveraged to develop three progressively complex algorithms aimed at solving the singlesnapshot maximum likelihood estimation problem: greedy matching pursuit (CLEAN), space-alternative generalized expectationmaximization (SAGE), and RiMAX. The first two are based on purely specular MPCs whereas RiMAX also accommodates diffuse MPCs. Two approaches for performance evaluation are proposed, one with knowledge of ground truth parameters, and one based on reconstruction mean-squared error. The three algorithms are compared through a demanding channel model with hundreds of MPCs and through real measurements. The results demonstrate that CLEAN gives quite reasonable estimates which are improved by SAGE and RiMAX. Lessons learned and directions for future research are discussed.
    摘要 一种框架被提议用于开发和评估抽取多路干扰组件(MPC)的算法,该算法使用频率上的通道测量仪器进行 millimeter-wave 频率收集数据。这些测量仪器包括一个全irectional 发射器和一个具有均匀平面阵列(UPA)的接收器。为了更好地模型测量仪器的空间频率响应,一个精确的数学模型被开发,该模型考虑了 UPA 元素的非理想交叉束波响应。由于每个元素的视场有限,模型进一步扩展以处理多个视场的测量。通过使用 beamspace 表示法,开发了三种不同的算法,以解决单个快照最大可能性估计问题:排序匹配缓解(CLEAN)、空间替换总体预期最大化(SAGE)和 RiMAX。前两个算法基于纯specular MPC,而 RiMAX 还处理 diffuse MPC。两种方法被提议用于性能评估:一种基于地面参数的知情预测,另一种基于重建平均方差。这三种算法在一种复杂的通道模型和实际测量中进行比较,结果表明,CLEAN 提供了相对较好的估计,SAGE 和 RiMAX 则提供了更好的估计。文章结尾还讨论了学习的教训和未来研究的方向。

Free space optics communication system design using iterative optimization

  • paper_url: http://arxiv.org/abs/2310.13633
  • repo_url: None
  • paper_authors: Gebrehiwet Gebrekrstos Lema
  • for: 提高无线光通信系统的链路容量和可用性,并mitigate各种天气和地理因素的影响
  • methods: 使用迭代优化算法,最大化可见距离,保证可靠性,并最小化Bit Error Rate(BER)
  • results: 比对文献,实现10Gbps的数据传输速率,在不同天气条件下,visibility距离、质量因子、BER和眼图都有较好的表现
    Abstract Free Space Optics (FSO) communication provides attractive bandwidth enhancement with unlicensed bands worldwide spectrum. However, the link capacity and availability are the major concern in the different atmospheric conditions. The reliability of the link is highly dependent on weather conditions that attenuate the signal strength. Hence, this study focuses to mitigate the weather and geographic effects using iterative optimization on FSO communication. The optimization maximizes the visibility distance while guaranteeing the reliability by minimizing the Bit Error Rate (BER). The wireless optical communication system is designed for the data rate of 10 Gbps. The performance of the proposed wireless optical communication is compared against the literature in terms of visibility distance, quality factor, BER, and Eye diagram at different atmospheric conditions. The simulation results have shown that the proposed work has achieved better performance.
    摘要 自由空间光学(FSO)通信提供了广阔的带宽提升,但链接容量和可用性在不同的天气条件下是主要的问题。链接可靠性高度依赖于天气条件的吸收强度,因此这个研究旨在通过迭代优化 mitigate 天气和地理效应,以提高FSO通信的可靠性。优化寻味距离,并保证可靠性,最小化Bit Error Rate(BER)。这个无线光学通信系统设计了10Gbps的数据速率。与文献比较,这个提案的性能在不同的天气条件下表现较好,visibility distance、质量因子、BER和eye diagram都有所提高。

How secure is the time-modulated array-enabled ofdm directional modulation?

  • paper_url: http://arxiv.org/abs/2310.08551
  • repo_url: None
  • paper_authors: Zhihao Tao, Zhaoyi Xu, Athina Petropulu
  • for: 研究了时间模ulated arrays(TMA)发送的 ortogonal frequency division multiplexing(OFDM)波形的物理层安全性,并表明了攻击者可以破坏scrambling。
  • methods: 使用独立组分分析(ICA)技术来分离数据符号和TMA参数,并利用扩展和Permutation ambiguity resolved by exploiting the Toeplitz structure of the mixing matrix and knowledge of data constellation, OFDM specifics, and TMA parameter selection rules。
  • results: 表明了在ICA技术的帮助下,攻击者可以从混乱的信号中提取数据符号和TMA参数,并且引入了一种新的TMA实现方式来防止攻击。
    Abstract Time-modulated arrays (TMA) transmitting orthogonal frequency division multiplexing (OFDM) waveforms achieve physical layer security by allowing the signal to reach the legitimate destination undistorted, while making the signal appear scrambled in all other directions. In this paper, we examine how secure the TMA OFDM system is, and show that it is possible for the eavesdropper to defy the scrambling. In particular, we show that, based on the scrambled signal, the eavesdropper can formulate a blind source separation problem and recover data symbols and TMA parameters via independent component analysis (ICA) techniques. We show how the scaling and permutation ambiguities arising in ICA can be resolved by exploiting the Toeplitz structure of the corresponding mixing matrix, and knowledge of data constellation, OFDM specifics, and the rules for choosing TMA parameters. We also introduce a novel TMA implementation to defend the scrambling against the eavesdropper.
    摘要 时间模拟数组(TMA)发送 orthogonal frequency division multiplexing(OFDM)波形可以实现物理层安全性,使信号只能正确地达到合法目标,而在其他方向都显示混乱。在这篇论文中,我们研究了TMA OFDM系统的安全性,并显示了可以由侦测者违规破坏混乱。特别是,我们表明,基于混乱的信号,侦测者可以将问题转化为盲源分离问题,并通过独立元分析(ICA)技术来恢复数据符号和TMA参数。我们解决了涉及到ICA的缩放和排序歧义,通过利用混合矩阵的托勒茨结构和数据集、OFDM特点、TMA参数选择规则。此外,我们还介绍了一种新的TMA实现方式,以防止混乱被侦测者破坏。

  • paper_url: http://arxiv.org/abs/2310.08364
  • repo_url: None
  • paper_authors: Lihao Zhang, Haijian Sun, Jin Sun, Ramviyas Parasuraman, Yinghui Ye, Rose Qingyang Hu
  • For: 本研究的目的是设计一个可以在城市环境中实现高性能的车辆间通讯链路选择方法。* Methods: 本研究使用了机器学习技术,包括卷积神经网络和 гра embedding 模型,从城市地图和车辆位置中估计频率state information,并对最佳链路选择策略进行优化。* Results: 本研究的结果显示,提案的方法可以在城市环境中实现高精度的频率state estimation,并且具有较低的过程复杂度和延迟。
    Abstract Urban vehicle-to-vehicle (V2V) link scheduling with shared spectrum is a challenging problem. Its main goal is to find the scheduling policy that can maximize system performance (usually the sum capacity of each link or their energy efficiency). Given that each link can experience interference from all other active links, the scheduling becomes a combinatorial integer programming problem and generally does not scale well with the number of V2V pairs. Moreover, link scheduling requires accurate channel state information (CSI), which is very difficult to estimate with good accuracy under high vehicle mobility. In this paper, we propose an end-to-end urban V2V link scheduling method called Map2Schedule, which can directly generate V2V scheduling policy from the city map and vehicle locations. Map2Schedule delivers comparable performance to the physical-model-based methods in urban settings while maintaining low computation complexity. This enhanced performance is achieved by machine learning (ML) technologies. Specifically, we first deploy the convolutional neural network (CNN) model to estimate the CSI from street layout and vehicle locations and then apply the graph embedding model for optimal scheduling policy. The results show that the proposed method can achieve high accuracy with much lower overhead and latency.
    摘要 城市往返自动车(V2V)链接调度问题是一个具有挑战性的问题。其主要目标是找到调度策略,以最大化系统性能(通常是每个链接的总容量或能效性)。由于每个链接可以受到所有活跃链接的干扰,调度问题变成了一个 combinatorial 整数编程问题,通常不会随着 V2V 对的数量很好地扩展。此外,链接调度需要准确的通道状态信息(CSI),这是在高速移动的情况下很难以估计准确。在这篇论文中,我们提出了一种名为 Map2Schedule 的综合urban V2V 链接调度方法,可以直接从城市地图和车辆位置生成 V2V 调度策略。Map2Schedule 可以在城市环境下达到与物理模型基于方法相当的性能,同时保持低的计算复杂度。这种提高的性能是基于机器学习(ML)技术。具体来说,我们首先部署了卷积神经网络(CNN)模型,以估计从街道布局和车辆位置中的 CSI,然后应用图像嵌入模型进行优化调度策略。结果显示,我们的方法可以达到高精度,并且带有远低的开销和延迟。

Fusion framework and multimodality for the Laplacian approximation of Bayesian neural networks

  • paper_url: http://arxiv.org/abs/2310.08315
  • repo_url: None
  • paper_authors: Magnus Malmström, Isaac Skog, Daniel Axehill, Fredrik Gustafsson
  • for: 本研究考虑了神经网络(NN)的顺序融合和多个NN的融合策略,以增加鲁棒性,即减少错误分类对结果的影响和检测外部异常。
  • methods: 本研究使用巴YES(BNN)的勒拉cean approximation来衡量必要的不确定性,以便进行融合。此外,提出了一种扩展,即使NN的预测可以表示多Modal的分布。
  • results: 在使用两个经典的图像分类任务(MNIST和CFAR10)和摄像头拍摄的瑞典森林中的狮子摄像头拍摄图像序列进行示例,这种融合策略和提出的扩展都能够提高误差calibration的性能。
    Abstract This paper considers the problem of sequential fusion of predictions from neural networks (NN) and fusion of predictions from multiple NN. This fusion strategy increases the robustness, i.e., reduces the impact of one incorrect classification and detection of outliers the \nn has not seen during training. This paper uses Laplacian approximation of Bayesian NNs (BNNs) to quantify the uncertainty necessary for fusion. Here, an extension is proposed such that the prediction of the NN can be represented by multimodal distributions. Regarding calibration of the estimated uncertainty in the prediction, the performance is significantly improved by having the flexibility to represent a multimodal distribution. Two class classical image classification tasks, i.e., MNIST and CFAR10, and image sequences from camera traps of carnivores in Swedish forests have been used to demonstrate the fusion strategies and proposed extension to the Laplacian approximation.
    摘要 这篇论文考虑了神经网络(NN)的顺序融合和多个NN的融合策略,这种融合策略可以提高鲁棒性,即减少一个错误分类的影响和探测器在训练过程中没有看到的异常值。这篇论文使用朗尼均方(Laplacian approximation)来评估神经网络(BNN)的不确定性。在这种情况下,一种扩展是提出的,即预测神经网络可以表示多模态分布。对于预测 uncertainty 的准确性的调整,表现得非常改善,这是因为可以表示多模态分布的灵活性。在两个分类任务中,即MNIST和CFAR10,以及从瑞典雨林中的摄像头拍摄的车辆猎食行为图像序列中,使用了这种融合策略和提出的扩展来示示。

Low Complexity Algorithms for Mission Completion Time Minimization in UAV-Based ISAC Systems

  • paper_url: http://arxiv.org/abs/2310.08311
  • repo_url: None
  • paper_authors: Mateen Ashraf, Anna Gaydamaka, Bo Tan, Dmitri Moltchanov, Yevgeni Koucheryavy
  • for: 本研究旨在提高智能交通系统(ITS)应用范围,通过 sixth-generation(6G)系统启用 интеGRATED sensing and communications(ISAC) paradigm。
  • methods: 本研究使用了一种大规模地区多个点对 интерес的优化问题,通过优化无人机速度和访问点顺序,以最小化任务时间。
  • results: 研究发现,通过实际仿真参数,第一种算法可以保持至少20%的时间提升,而第二种算法可以将总完成时间减少至少7倍。
    Abstract The inherent support of sixth-generation (6G) systems enabling integrated sensing and communications (ISAC) paradigm greatly enhances the application area of intelligent transportation systems (ITS). One of the mission-critical applications enabled by these systems is disaster management, where ISAC functionality may not only provide localization but also provide users with supplementary information such as escape routes, time to rescue, etc. In this paper, by considering a large area with several locations of interest, we formulate and solve the optimization problem of delivering task parameters of the ISAC system by optimizing the UAV speed and the order of visits to the locations of interest such that the mission time is minimized. The formulated problem is a mixed integer non-linear program which is quite challenging to solve. To reduce the complexity of the solution algorithms, we propose two circular trajectory designs. The first algorithm finds the optimal UAV velocity and radius of the circular trajectories. The second algorithm finds the optimal connecting points for joining the individual circular trajectories. Our numerical results reveal that, with practical simulation parameters, the first algorithm provides a time saving of at least $20\%$, while the second algorithm cuts down the total completion time by at least $7$ times.
    摘要 sixth-generation (6G) 系统内置的集成感知通信(ISAC)模式可以大幅扩展智能交通系统(ITS)的应用范围。 ISAC 功能可以不仅提供地理位置信息,还可以为用户提供补充信息,如避险 Routes、救援时间等。 在这篇论文中,我们通过考虑一个大面积的多个关注点来形式化和解决 ISAC 系统交由任务参数的优化问题,以最小化任务时间。 我们提出了两种圆形轨迹设计来降低解题算法的复杂性。 首先,我们找到了最优的 UAV 速度和圆形轨迹半径。 其次,我们找到了连接各个圆形轨迹的最优连接点。 我们的数据分析表明,使用实际参数进行模拟,首 algorithm 可以保证在至少20%的时间上减少任务时间,而第二 algorithm 可以将总完成时间减少至少7倍。

Maximization of minimum rate in MIMO OFDM RIS-assisted Broadcast Channels

  • paper_url: http://arxiv.org/abs/2310.08289
  • repo_url: None
  • paper_authors: Mohammad Soleymani, Ignacio Santamaria, Aydin Sezgin, Eduard Jorswieck
  • for: 提高无线通信系统的频率效率
  • methods: 优化RIS元素,并joint precoding和RIS优化问题
  • results: RIS可以在多输入多出力OFDM广播频道中提高系统性能,即使每个子带中RIS元素很少。
    Abstract Reconfigurable intelligent surface (RIS) is a promising technology to enhance the spectral efficiency of wireless communication systems. By optimizing the RIS elements, the performance of the overall system can be improved. Yet, in contrast to single-carrier systems, in multi-carrier systems, it is not possible to independently optimize RIS elements at each sub-carrier, which may reduce the benefits of RIS in multi-user orthogonal frequency division multiplexing (OFDM) systems. To this end, we investigate the effectiveness of RIS in multiple-input, multiple-output (MIMO) OFDM broadcast channels (BC). We formulate and solve a joint precoding and RIS optimization problem. We show that RIS can significantly improve the system performance even when the number of RIS elements per sub-band is very low.
    摘要 可重配置智能表面(RIS)是一种扩展无线通信系统的spectral efficiency的技术。通过优化RIS元素,整体系统的性能可以得到改善。然而,在多个卡通系统中,不能独立地优化RIS元素每个子带,这可能减少RIS在多用户orthogonal frequency division multiplexing(OFDM)系统中的 beneficial effects。为此,我们研究了RIS在多输入多出力(MIMO)OFDM广播频道(BC)中的效果。我们建立了一个共同预编和RIS优化问题,并证明RIS可以在每个子带中具有很少的RIS元素时 Still significantly improve the system performance。

Underwater Sound Speed Profile Construction: A Review

  • paper_url: http://arxiv.org/abs/2310.08251
  • repo_url: None
  • paper_authors: Wei Huang, Jixuan Zhou, Fan Gao, Jiajun Lu, Sijia Li, Pengfei Wu, Junting Wang, Hao Zhang, Tianhe Xu
  • for: 这篇论文主要关注于建立水下声速profile(SSP),以提高水下定位、导航和时间(PNT)系统的精度。
  • methods: 主流方法包括直接测量SSP和SSP反推。 direct measurement方法比较精度高,但通常需要很长时间。而反推方法可以提高实时性,但精度不及直接测量方法。
  • results: 当前主流方法仅能在各种水下观测系统覆盖区域内进行SSP构建,无法预测未来时间内声速分布。未来研究将注重多源数据的共同利用,提供不同精度和实时要求的声速分布估计服务,而无需水下观测系统。
    Abstract Real--time and accurate construction of regional sound speed profiles (SSP) is important for building underwater positioning, navigation, and timing (PNT) systems as it greatly affect the signal propagation modes such as trajectory. In this paper, we summarizes and analyzes the current research status in the field of underwater SSP construction, and the mainstream methods include direct SSP measurement and SSP inversion. In the direct measurement method, we compare the performance of popular international commercial temperature, conductivity, and depth profilers (CTD). While for the inversion methods, the framework and basic principles of matched field processing (MFP), compressive sensing (CS), and deep learning (DL) for constructing SSP are introduced, and their advantages and disadvantages are compared. The traditional direct measurement method has good accuracy performance, but it usually takes a long time. The proposal of SSP inversion method greatly improves the convenience and real--time performance, but the accuracy is not as good as the direct measurement method. Currently, the SSP inversion relies on sonar observation data, making it difficult to apply to areas that couldn't be covered by underwater observation systems, and these methods are unable to predict the distribution of sound velocity at future times. How to comprehensively utilize multi-source data and provide elastic sound velocity distribution estimation services with different accuracy and real-time requirements for underwater users without sonar observation data is the mainstream trend in future research on SSP construction.
    摘要 实时准确地建立地区声速 Profil (SSP) 非常重要,因为它会影响信号传播模式,如轨迹。在这篇论文中,我们总结了现有领域的研究状况,主流方法包括直接测量SSP和SSP反推。直接测量方法中,我们比较了流行的国际商业温度、电导和深度测量仪器(CTD)的性能。而反推方法中,我们介绍了匹配场处理(MFP)、压缩感知(CS)和深度学习(DL)的框架和基本原则,并比较了它们的优劣点。直接测量方法具有良好的精度表现,但通常需要很长时间。反推方法可以大幅提高实时性,但精度不如直接测量方法。当前,SSP反推方法仅仅基于声波观测数据,因此在没有声波观测系统覆盖的区域无法应用,并且无法预测未来时间的声速分布。未来研究应该尝试通过多源数据的共同利用,为无声波观测数据的水下用户提供不同精度和实时需求的声速分布估计服务。

Analysing of 3D MIMO Communication Beamforming in Linear and Planar Arrays

  • paper_url: http://arxiv.org/abs/2310.08614
  • repo_url: None
  • paper_authors: Amirsadegh Roshanzamir
  • for: investigate the performance of different beamforming techniques for MIMO communication systems with planar arrays.
  • methods: covariance-based MIMO communication waveform method, MATLAB simulations.
  • results: 3D beam patterns generated by these constellations.Here’s the full text in Simplified Chinese:
  • for: 研究不同干扰方法的MIMO通信系统平面阵列性能.
  • methods: 协方差基于MIMO通信波形方法, MATLAB仿真.
  • results: 平面阵列Generated的3D扩散模式.
    Abstract Massive multiple-input multiple-output (MIMO) systems are expected to play a crucial role in the 5G wireless communication systems. These advanced systems, which are being deployed since 2021, offer significant advantages over conventional communications generations. Unlike previous versions of communication, MIMO systems can transmit various probing signals through their antennas, which may or may not be correlated with each other. This waveform diversity provided by MIMO communication enables enhanced capabilities and improved performance. Numerous research papers have proposed different approaches for beamforming in MIMO communication. We anticipate that our research will provide valuable insights into the performance of different beamforming techniques for MIMO communication systems with planar arrays. We will investigate the 3D beam patterns generated by these constellations using the covariance-based MIMO communication waveform method. MATLAB simulations will be utilized to analyze and evaluate the performance of these methods.
    摘要 巨大多输入多输出(MIMO)系统在5G无线通信系统中将扮演关键角色。这些高级系统自2021年起已经被部署,与过去的通信生成器相比,它们提供了重要的优势。与过去的通信不同,MIMO系统可以通过其天线传输多种探测信号,这些信号可能或可能不是相关的。这种波形多样性提供了MIMO通信的增强功能和性能提高。许多研究论文已经提出了不同的束形策略方法。我们预计,我们的研究将为MIMO通信系统 WITH PLANAR ARRAYS 的不同束形策略的性能提供有价值的发现。我们将使用基于协方差的MIMO通信波形方法来研究这些星座的3D束 Pattern。使用MATLAB仿真来分析和评估这些方法的性能。

Fast Ray-Tracing-Based Precise Underwater Acoustic Localization without Prior Acknowledgment of Target Depth

  • paper_url: http://arxiv.org/abs/2310.08201
  • repo_url: None
  • paper_authors: Wei Huang, Hao Zhang, Kaitao Meng, Fan Gao, Wenzhou Sun, Jianxu Shu, Tianhe Xu, Deshi Li
  • for: 海上localization技术在海洋观测和建筑PNT系统中具有重要意义,特别是在灾害预警、海上搜救和资源探测等领域。
  • methods: 我们提出了一种Iterative Ray Tracing 3D Underwater Localization(IRTUL)方法,用于补偿层次质量的影响。我们首先 derivate了信号路径为гляancing角度函数,然后证明信号传播时间和水平传播距离是初始折射角度的 monotonic 函数,从而实现快速的射线跟踪。此外,我们还提出了一种音速profile(SVP)简化方法,以降低射线跟踪的计算成本。
  • results: 实验结果表明,IRTUL方法可以减少深度方向的距离偏差,并提高了平均精度约3米compared to地理位置模型。此外,简化的SVP方法可以在实时性方面减少精度损失,即使在使用时间平均损失低于0.2米。
    Abstract Underwater localization is of great importance for marine observation and building positioning, navigation, timing (PNT) systems that could be widely applied in disaster warning, underwater rescues and resources exploration. The uneven distribution of underwater sound velocity poses great challenge for precise underwater positioning. The current soundline correction positioning method mainly aims at scenarios with known target depth. However, for nodes that are non-cooperative nodes or lack of depth information, soundline tracking strategies cannot work well due to nonunique positional solutions. To tackle this issue, we propose an iterative ray tracing 3D underwater localization (IRTUL) method for stratification compensation. To demonstrate the feasibility of fast stratification compensation, we first derive the signal path as a function of glancing angle, and then prove that the signal propagation time and horizontal propagation distance are monotonic functions of the initial grazing angle, so that fast ray tracing can be achieved. Then, we propose an sound velocity profile (SVP) simplification method, which reduces the computational cost of ray tracing. Experimental results show that the IRTUL has the most significant distance correction in the depth direction, and the average accuracy of IRTUL has been improved by about 3 meters compared to localization model with constant sound velocity. Also, the simplified SVP can significantly improve real-time performance with average accuracy loss less than 0.2 m when used for positioning.
    摘要 水下Localization对marine observation和建筑位置、导航、时间(PNT)系统的应用非常重要,特别是在紧急警示、水下搜救和资源探索等领域。水下声速的不均分布对精确水下定位 pose 大 Challenge。目前的声线修正定位方法主要针对已知目标深度的场景,但是对于不合作节点或lack of depth information的情况,声线跟踪策略难以实现Unique positional solution。为解决这个问题,我们提出了一种迭代射线跟踪3D水下定位(IRTUL)方法,用于层次补做。首先,我们 derive 声波路径作为 glance angle 函数,并证明声波传播时间和水平传播距离是初始触角 angle 的唯一 monotonic function,从而实现快速射线跟踪。然后,我们提出了一种声速profile(SVP)简化方法,可以减少射线跟踪的计算成本。实验结果表明,IRTUL在深度方向上具有最大的距离 corrections,并且localization模型中的常数声速的精度提高了约3米。此外,简化的SVP可以在实时性方面提供significant improvement, average accuracy loss less than 0.2 m。

Sensing-assisted Accurate and Fast Beam Management for Cellular-connected mmWave UAV Network

  • paper_url: http://arxiv.org/abs/2310.08134
  • repo_url: None
  • paper_authors: Yanpeng Cui, Qixun Zhang, Zhiyong Feng, Qin Wen, Ying Zhou, Zhiqing Wei, Ping Zhang
  • for: 提高 millimeter-wave UAV 网络中的比较高延迟和低精度的矢量管理,包括初始访问(IA)和矢量跟踪。
  • methods: 使用整合感知和通信技术,并采用图像处理技术和扩展卡尔曼筛法进行矢量跟踪和预测。
  • results: 比较 conventional 方法,提出的解决方案在IA延迟、关联精度、跟踪误差和通信性能方面表现出色。
    Abstract Beam management, including initial access (IA) and beam tracking, is essential to the millimeter-wave Unmanned Aerial Vehicle (UAV) network. However, conventional communication-only and feedback-based schemes suffer a high delay and low accuracy of beam alignment since they only enable the receiver to passively hear the information of the transmitter from the radio domain. This paper presents a novel sensing-assisted beam management approach, the first solution that fully utilizes the information from the visual domain to improve communication performance. We employ both integrated sensing and communication and computer vision techniques and design an extended Kalman filtering method for beam tracking and prediction. Besides, we also propose a novel dual identity association solution to distinguish multiple UAVs in dynamic environments. Real-world experiments and numerical results show that the proposed solution outperforms the conventional methods in IA delay, association accuracy, tracking error, and communication performance.
    摘要 beam 管理,包括初始访问(IA)和扫描 beam,对 millimeter 波无人机网络是必需的。然而,传统的通信只和反馈方案受到高延迟和低精度的扫描Alignment的限制,因为它们只允许接收器在电波频谱中接受发送器的信息。本文提出了一种新的感知协助 beam 管理方法,是首个完全利用视觉频谱中的信息提高通信性能的解决方案。我们利用了集成感知和通信技术,并设计了一种扩展 kalman 筛法 для beam 跟踪和预测。此外,我们还提出了一种新的双重标识关系解决方案,以在动态环境中分辨多个无人机。实际实验和数学结果表明,我们的解决方案在IA延迟、关联精度、跟踪错误和通信性能方面都超过了传统方法。

3D terrain mapping and filtering from coarse resolution data cubes extracted from real-aperture 94 GHz radar

  • paper_url: http://arxiv.org/abs/2310.08120
  • repo_url: None
  • paper_authors: William D. Harcourt, David G. Macfarlane, Duncan A. Robertson
  • for: 精确高分辨率环境三维地形映射是许多领域的关键,本研究提出了一种新的技术,即PCFilts-94算法,用于从粗糙度毫米波雷达数据立方体中提取3D点云和相关的不确定性评估。
  • methods: 本研究使用了一种新的点云提取和筛选技术,包括非相关波形均值和Voronoi基于点云异常点除法,以减少点云不确定性。此外,还使用了最优的地面控制点数量(GCP)来地理参考点云,并估算了点云不确定性。
  • results: 研究结果表明,新的处理方法可以生成稳定的点云,即可重复使用不同的点云提取和筛选参数值进行预处理,并且 less sensitive to over-filtering through the point cloud processing workflow。此外,点云不确定性也减少到了约1.5米至3米之间,比其他靠近范围的雷达系统更小。这些结果可以作为未来使用毫米波雷达系统进行地形映射的标准。
    Abstract Accurate, high-resolution 3D mapping of environmental terrain is critical in a range of disciplines. In this study, we develop a new technique, called the PCFilt-94 algorithm, to extract 3D point clouds from coarse resolution millimetre-wave radar data cubes and quantify their associated uncertainties. A technique to non-coherently average neighbouring waveforms surrounding each AVTIS2 range profile was developed in order to reduce speckle and was found to reduce point cloud uncertainty by 13% at long range and 20% at short range. Further, a Voronoi-based point cloud outlier removal algorithm was implemented which iteratively removes outliers in a point cloud until the process converges to the removal of 0 points. Taken together, the new processing methodology produces a stable point cloud, which means that: 1) it is repeatable even when using different point cloud extraction and filtering parameter values during pre-processing, and 2) is less sensitive to over-filtering through the point cloud processing workflow. Using an optimal number of Ground Control Points (GCPs) for georeferencing, which was determined to be 3 at close range (<1.5 km) and 5 at long range (>3 km), point cloud uncertainty was estimated to be approximately 1.5 m at 1.5 km to 3 m at 3 km and followed a Lorentzian distribution. These uncertainties are smaller than those reported for other close-range radar systems used for terrain mapping. The results of this study should be used as a benchmark for future application of millimetre-wave radar systems for 3D terrain mapping.
    摘要 精确高分辨度三维地形映射是多个领域的关键。本研究中,我们开发了一种新的算法,即PCFilts-94算法,以提取毫米波雷达数据立方体点云和相关的不确定性。为了减少雷达棱镜的影响,我们开发了一种非协调平均邻近波形式的方法,并发现可以降低点云不确定性13%在远距离和20%在近距离。此外,我们实现了基于Voronoi区域的点云异常点除法,这个过程可以逐步除除点云中的异常点,直到 converge to the removal of 0 points。综上所述,新的处理方法可以生成稳定的点云,其特点是:1)可重复性,即在不同的点云提取和筛选参数值时,可以重复获得相同的结果,2) menos sensitive to over-filtering,即点云处理流程中的过滤效果更加稳定。使用最佳的地面控制点数(GCPs)进行地理参考,其值为1.5 km处3个和3 km处5个,点云不确定性约为1.5 m在1.5 km处到3 m在3 km处,遵循lorentzian分布。这些不确定性较小,比其他靠近范围的雷达系统用于地形映射报告的不确定性更小。本研究的结果应该成为未来毫米波雷达系统的3D地形映射应用的标准。

Multi-Satellite Cooperative Networks: Joint Hybrid Beamforming and User Scheduling Design

  • paper_url: http://arxiv.org/abs/2310.08095
  • repo_url: None
  • paper_authors: Xuan Zhang, Shu Sun, Meixia Tao, Qin Huang, Xiaohu Tang
  • for: The paper is written for a cooperative communication network where multiple low-Earth-orbit satellites provide services for ground users (GUs) at the same time and on the same frequency.
  • methods: The paper proposes a hybrid beamforming method consisting of analog beamforming for beam alignment and digital beamforming for interference mitigation, as well as a low-complexity heuristic user scheduling algorithm to establish appropriate connections between the satellites and GUs.
  • results: The proposed joint hybrid beamforming and user scheduling (JHU) scheme is expected to dramatically improve the performance of the multi-satellite cooperative network, and simulations are conducted to compare the proposed schemes with representative baselines and to analyze the key factors influencing the performance of the network.Here is the same information in Simplified Chinese text:
  • for: 这篇论文是为了考虑一个多个低地球轨天体卫星协同通信网络,这些卫星同时提供服务给地面用户(GUs)。
  • methods: 论文提出了一种混合 beamforming 方法,包括分析 beamforming для射频轨道的平衡和数字 beamforming для干扰 Mitigation,以及一种低复杂度的决策算法来确定卫星和 GUs 之间的连接。
  • results: 提出的 JHU 方案预计可以帮助提高多卫星协同网络的性能,并通过与代表性基线相比进行 simulations,分析了网络性能的关键因素。
    Abstract In this paper, we consider a cooperative communication network where multiple low-Earth-orbit satellites provide services for ground users (GUs) (at the same time and on the same frequency). The multi-satellite cooperative network has great potential for satellite communications due to its dense configuration, extensive coverage, and large spectral efficiency. However, the communication and computational resources on satellites are usually restricted. Therefore, considering the limitation of the on-board radio-frequency chains of satellites, we first propose a hybrid beamforming method consisting of analog beamforming for beam alignment and digital beamforming for interference mitigation. Then, to establish appropriate connections between the satellites and GUs, we propose a low-complexity heuristic user scheduling algorithm which determines the connections according to the total spectral efficiency increment of the multi-satellite cooperative network. Next, considering the intrinsic connection between beamforming and user scheduling, a joint hybrid beamforming and user scheduling (JHU) scheme is proposed to dramatically improve the performance of the multi-satellite cooperative network. In addition to the single-connection scenario, we also consider the multi-connection case using the JHU scheme. Moreover, simulations are conducted to compare the proposed schemes with representative baselines and to analyze the key factors influencing the performance of the multi-satellite cooperative network.
    摘要 在这篇论文中,我们考虑了多颗低地球轨道卫星协作网络,这些卫星为地面用户(GU)提供服务(同时,在同一频率上)。这种多卫星协作网络具有密集配置、广泛覆盖和大 spectral efficiency,但卫星上的通信和计算资源受限。因此,我们首先提出了一种混合扫描方法,其中analog扫描用于杆位定位,而数字扫描用于干扰降低。然后,为建立多卫星协作网络中的合适连接,我们提出了一种低复杂度的冒险用户调度算法,该算法根据多卫星协作网络的总spectral efficiency增量确定连接。接着,我们考虑了扫描和用户调度之间的内在关系,并提出了一种结合扫描和用户调度的共同方案(JHU),以显著提高多卫星协作网络的性能。此外,我们还考虑了多连接情况下的JHU方案。此外,我们进行了对基eline的比较和分析,以分析多卫星协作网络的性能关键因素。

Channel-robust Automatic Modulation Classification Using Spectral Quotient Cumulants

  • paper_url: http://arxiv.org/abs/2310.08021
  • repo_url: None
  • paper_authors: Sai Huang, Yuting Chen, Jiashuo He, Shuo Chang, Zhiyong Feng
    for:The paper is written for proposing a channel-robust modulation classification framework for orthogonal frequency division multiplexing (OFDM) systems, which can mitigate the adverse effects of multipath channel and improve the classification accuracy.methods:The proposed method uses spectral quotient cumulants (SQCs) extracted from the filtered spectral quotient (SQ) sequence as the inputs to train an artificial neural network (ANN) classifier. The method also employs an outlier detector to filter the outliers in the SQ sequence.results:The simulation results show that the proposed SQCC method exhibits classification robustness and superiority under various unknown Rician multipath fading channels, with nearly 90% classification accuracy at the signal to noise ratio (SNR) of 4dB when testing under multiple channels but training under AWGN channel.
    Abstract Automatic modulation classification (AMC) is to identify the modulation format of the received signal corrupted by the channel effects and noise. Most existing works focus on the impact of noise while relatively little attention has been paid to the impact of channel effects. However, the instability posed by multipath fading channels leads to significant performance degradation. To mitigate the adverse effects of the multipath channel, we propose a channel-robust modulation classification framework named spectral quotient cumulant classification (SQCC) for orthogonal frequency division multiplexing (OFDM) systems. Specifically, we first transform the received signal to the spectral quotient (SQ) sequence by spectral circular shift division operations. Secondly, an outlier detector is proposed to filter the outliers in the SQ sequence. At last, we extract spectral quotient cumulants (SQCs) from the filtered SQ sequence as the inputs to train the artificial neural network (ANN) classifier and use the trained ANN to make the final decisions. Simulation results show that our proposed SQCC method exhibits classification robustness and superiority under various unknown Rician multipath fading channels compared with other existing methods. Specifically, the SQCC method achieves nearly 90% classification accuracy at the signal to noise ratio (SNR) of 4dB when testing under multiple channels but training under AWGN channel.
    摘要 自动模式分类(AMC)是确定接收信号中的模式格式,它受到通道效果和噪声的影响。现有大多数研究强调噪声的影响,而忽略了通道效果的影响。然而,多路干扰通道会导致性能下降。为了 mitigate 多路干扰通道的影响,我们提出了一种鲁棒的通道robust模式分类框架,名为спектральquotientcumulant分类(SQCC),用于orthogonal frequency division multiplexing(OFDM)系统。specifically,我们首先将接收信号转换为spectral quotient(SQ)序列,然后提出了一种outlier检测器来过滤SQ序列中的异常值。最后,我们从过滤后的SQ序列提取spectral quotientcumulants(SQCs)作为人工神经网络(ANN)分类器的输入,并使用已经训练的ANN来做最终的决定。 simulation results show that our proposed SQCC method exhibits robustness and superiority under various unknown Rician multipath fading channels compared with other existing methods. Specifically, the SQCC method achieves nearly 90% classification accuracy at the signal to noise ratio(SNR)of 4dB when testing under multiple channels but training under AWGN channel.

On the Capacity of Reconfigurable Intelligence Surface: the Sparse Channel Case

  • paper_url: http://arxiv.org/abs/2310.07994
  • repo_url: None
  • paper_authors: Chenxi Zhu
  • for: 这篇论文是为了分析利用智能表面协助的MIMO通信在稀畴频谱中提高容量而写的。
  • methods: 这篇论文使用了exploring稀畴性 Property来提高MIMO通信的容量,并开发了SU-MIMO或DL MU-MIMO的高效算法。
  • results: 论文表明,在RIS反射频谱中支持高级传输更加困难,并且提出了一种基于稀畴性的MIMO通信方法来解决这个问题。
    Abstract Reconfigurable intelligent surface (RIS) is an important candidate technology for 6G. We provide an analysis of RIS-assisted MIMO communication in sparse channel typically found in the mmW or THz range. By exploring the sparse property, we maximize the capacity in the singular space of the channel and developed efficient algorithms for SU-MIMO or DL MU-MIMO. We also proved it is more difficult to support high rank transmission in the RIS reflection channel than in the traditional MIMO channel.
    摘要 <>重配置智能表面(RIS)是6G技术的重要候选人。我们对RIS协助MIMO通信在罕见频谱中进行分析,通常发生在mmWave或THz范围内。通过探索罕见性,我们最大化了频谱空间中的容量,并开发了高效的SU-MIMO或DL MU-MIMO算法。此外,我们证明了在RIS反射频谱中支持高级传输比traditional MIMO频谱更加困难。[/INST0] Here's the translation of the text into Traditional Chinese:<>重配置智能表面(RIS)是6G技术的重要候选人。我们对RIS协助MIMO通信在罕见频范围中进行分析,通常发生在mmWave或THz范围内。通过探索罕见性,我们最大化了频范围空间中的容量,并开发了高效的SU-MIMO或DL MU-MIMO算法。此外,我们证明了在RIS反射频范围中支持高级传输比traditional MIMO频范围更加困难。

cs.SD - 2023-10-11

Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

  • paper_url: http://arxiv.org/abs/2310.07246
  • repo_url: https://github.com/bakerbunker/vectok
  • paper_authors: Xinfa Zhu, Yuanjun Lv, Yi Lei, Tao Li, Wendi He, Hongbin Zhou, Heng Lu, Lei Xie
  • for: 这篇论文是为了提出一种可以实现多种演讲任务的扩展框架,包括无需预训练的语言模型(LM)。
  • methods: 论文提出了一种新的演讲编码方法,基于演讲 vectors 和 semantic tokens。演讲 vectors 包含杂音细节,可以保证高质量的演讲重建,而 semantic tokens 则是关注演讲语言内容,以便语言模型化。此外,论文还使用了 Byte-Pair Encoding(BPE)来减少表达长度和比特率,提高LM的性能。
  • results: 根据实验结果,Vec-Tok Speech 在使用 50 万小时演讲数据时表现出色,比其他最佳模型更好。
    Abstract Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at https://github.com/BakerBunker/VecTok .
    摘要 语言模型(LM)在自然语言处理和计算机视觉领域最近得到了广泛应用,生成了高精度的文本或图像。然而,当前的语音生成模型仍然在语音质量和任务泛化方面存在困难。这篇论文提出了Vec-Tok Speech框架,这是一个可扩展的框架,可以实现多种语音生成任务,生成高质量和高精度的语音。具体来说,我们提出了一种基于语音向量和semantic token的语音编码方法。语音向量包含语音中的音响细节,以便实现高精度的语音重建;semantic token则关注语音中的语言内容,使得语言模型可以更好地模型语音。基于该语音编码方法,Vec-Tok Speech可以通过使用LM进行核心语音生成。此外,我们还引入了字节对编码(BPE),以降低токен长度和比特率,从而提高LM的性能。Vec-Tok Speech可以用于 между语言和cross-lingual零Shift语音转换(VC)、零Shift发音风格转换(TTS)、语音到语音翻译(S2ST)、语音干扰除和speaker隐藏和匿名化。实验结果表明,Vec-Tok Speech,基于50000小时的语音数据,与其他SOTA模型相比,表现更好。代码将在GitHub上提供,地址为https://github.com/BakerBunker/VecTok。

eess.AS - 2023-10-11

Damping Density of an Absorptive Shoebox Room Derived from the Image-Source Method

  • paper_url: http://arxiv.org/abs/2310.07363
  • repo_url: None
  • paper_authors: Sebastian J. Schlecht, Karolina Prawda, Rudolf Rabenstein, Maximilian Schäfer
  • for: 这篇论文主要探讨了如何快速计算带有任意吸收的射镜房(shoebox room)的吸收响应(RIR)。
  • methods: 该论文使用了图像源方法计算射镜房的RIR,并 derive了一个关闭式表达式来描述全部多坡衰减率(damping density)。
  • results: 该论文通过对墙面吸收率的变化来研究射镜房的吸收响应,并提出了一种快速随机生成晚反射的方法。该方法可以准确地预测射镜房的吸收响应,并且在不同的墙面吸收率下都具有高精度。
    Abstract The image-source method is widely applied to compute room impulse responses (RIRs) of shoebox rooms with arbitrary absorption. However, with increasing RIR lengths, the number of image sources grows rapidly, leading to slow computation. In this paper, we derive a closed-form expression for the damping density, which characterizes the overall multi-slope energy decay. The omnidirectional energy decay over time is directly derived from the damping density. The resulting energy decay model accurately matches the late reverberation simulated via the image-source method. The proposed model allows the fast stochastic synthesis of late reverberation by shaping noise with the energy envelope. Simulations of various wall damping coefficients demonstrate the model's accuracy. The proposed model consistently outperforms the energy decay prediction accuracy compared to a state-of-the-art approximation method. The paper elaborates on the proposed damping density's applicability to modeling multi-sloped sound energy decay, predicting reverberation time in non-diffuse sound fields, and fast frequency-dependent RIR synthesis.
    摘要 “图像源方法广泛应用于计算封闭室内响应(RIR)的射频响应。然而,随着 RIR 的增长,图像源的数量增长得非常快,导致计算变得慢。在这篇论文中,我们 derive 一个闭式表达式,用于描述全体多坡衰减率。通过这个表达式,我们直接 deriv 出各个方向的能量衰减。这种能量衰减模型可以准确地与图像源方法 simulate 的晚期响应相匹配。我们的模型允许通过修形噪声的形式来快速生成晚期响应。我们通过不同墙面减噪系数的 simulations 表明了我们的模型的准确性。我们的模型在与现有的approximation方法相比之下表现出了更高的能量衰减预测精度。论文还详细介绍了我们提出的凝固density 的应用性,包括模型多坡衰减、预测射频响应时间以及快速frequency-dependent RIR synthesis。”

Magnitude-and-phase-aware Speech Enhancement with Parallel Sequence Modeling

  • paper_url: http://arxiv.org/abs/2310.07316
  • repo_url: None
  • paper_authors: Yuewei Zhang, Huanbin Zou, Jie Zhu
  • for: 本研究是关于喷水声提高(SE)领域的一篇论文,旨在提高喷水声的语音质量。
  • methods: 本研究使用了一种新的预测方法,即使用实数网络来预测干扰声的大小和正规化cIRM(Complex Ideal Ratio Mask)。此外,研究者还提出了一种平行序列模型(PSM)块,用于改进传统的循环回归网络(CRN)模型。
  • results: 实验结果表明,使用的MPCRN方法可以在喷水声提高中实现更高的性能。
    Abstract In speech enhancement (SE), phase estimation is important for perceptual quality, so many methods take clean speech's complex short-time Fourier transform (STFT) spectrum or the complex ideal ratio mask (cIRM) as the learning target. To predict these complex targets, the common solution is to design a complex neural network, or use a real network to separately predict the real and imaginary parts of the target. But in this paper, we propose to use a real network to estimate the magnitude mask and normalized cIRM, which not only avoids the significant increase of the model complexity caused by complex networks, but also shows better performance than previous phase estimation methods. Meanwhile, we devise a parallel sequence modeling (PSM) block to improve the RNN block in the convolutional recurrent network (CRN)-based SE model. We name our method as magnitude-and-phase-aware and PSM-based CRN (MPCRN). The experimental results illustrate that our MPCRN has superior SE performance.
    摘要 在speech enhancement(SE)中,频谱估计是重要的,因此许多方法使用清晰speech的复杂短时傅立叶变换(STFT)谱或理想的复杂比例面纱(cIRM)作为学习目标。为预测这些复杂目标,常见的解决方案是设计复杂的神经网络,或者使用实际网络分开预测实部和虚部。但在本文中,我们提议使用实网络来估计魔方面和 нормализаzed cIRM,不仅可以避免由复杂网络引起的模型复杂度增加,而且也比前期预测方法表现更好。此外,我们设计了并行序列模型(PSM)块来改进CRN基于的卷积隐藏状态机制(CRN)模型。我们称我们的方法为魔方-和频谱-意识的PSM-CRN(MPCRN)。实验结果表明,我们的MPCRN具有更高的SE性能。

VSANet: Real-time Speech Enhancement Based on Voice Activity Detection and Causal Spatial Attention

  • paper_url: http://arxiv.org/abs/2310.07295
  • repo_url: None
  • paper_authors: Yuewei Zhang, Huanbin Zou, Jie Zhu
  • for: 提高speech干扰改进(SE)性能
  • methods: 使用多任务学习框架和 causal spatial attention(CSA)块
  • results: 实验结果表明,VSANet具有出色的SE性能,其中多任务学习框架和CSA块都有益于SE性能的提高。
    Abstract The deep learning-based speech enhancement (SE) methods always take the clean speech's waveform or time-frequency spectrum feature as the learning target, and train the deep neural network (DNN) by reducing the error loss between the DNN's output and the target. This is a conventional single-task learning paradigm, which has been proven to be effective, but we find that the multi-task learning framework can improve SE performance. Specifically, we design a framework containing a SE module and a voice activity detection (VAD) module, both of which share the same encoder, and the whole network is optimized by the weighted loss of the two modules. Moreover, we design a causal spatial attention (CSA) block to promote the representation capability of DNN. Combining the VAD aided multi-task learning framework and CSA block, our SE network is named VSANet. The experimental results prove the benefits of multi-task learning and the CSA block, which give VSANet an excellent SE performance.
    摘要 deep learning 基于 speech enhancement(SE)方法总是使用干净speech的波形或时域频谱特征作为学习目标,并使用深度神经网络(DNN)来减少错误损失之间的差异。这是一种常见的单任务学习模式,已经证明有效,但我们发现多任务学习框架可以提高 SE 性能。specifically,我们设计了一个包含 SE 模块和voice activity detection(VAD)模块的框架,两者都共享同一个编码器,整个网络通过两个模块的权重损失来优化。此外,我们还设计了一个 causal spatial attention(CSA)块,以提高 DNN 的表达能力。将 VAD 帮助多任务学习框架和 CSA 块结合在一起,我们称之为 VSANet。实验结果表明多任务学习和 CSA 块对 VSANet 的 SE 性能产生了积极的影响。

cs.CV - 2023-10-11

Dynamic Appearance Particle Neural Radiance Field

  • paper_url: http://arxiv.org/abs/2310.07916
  • repo_url: None
  • paper_authors: Ancheng Lin, Jun Li
  • for: 模elling 3D 动态场景的 Dynamic NeRFs 扩展了这种模型,但是现有的动态 NeRFs 使用类似的欧拉rian representation,导致外观和运动之间存在紧密的关联,缺乏物理解释。
  • methods: 我们提出了Dynamic Appearance Particle Neural Radiance Field (DAP-NeRF),它通过粒子基示的方式来模拟动态场景中视觉元素的运动。DAP-NeRF 由静态场景和动态场景组成,其中动态场景是由多个{\em appearance particles} 组成,每个粒子都携带了小元素的视觉信息和运动模型。所有组件,包括静态场景、视觉特征和运动模型,都是通过单摄影视频学习而不需要任何先前的场景知识。
  • results: 我们构建了一个新的数据集来评估运动模型,并开发了一种高效的计算框架。实验结果表明,DAP-NeRF 是一种有效的技术,可以捕捉动态场景中不仅外观,还有物理意义的运动。
    Abstract Neural Radiance Fields (NeRFs) have shown great potential in modelling 3D scenes. Dynamic NeRFs extend this model by capturing time-varying elements, typically using deformation fields. The existing dynamic NeRFs employ a similar Eulerian representation for both light radiance and deformation fields. This leads to a close coupling of appearance and motion and lacks a physical interpretation. In this work, we propose Dynamic Appearance Particle Neural Radiance Field (DAP-NeRF), which introduces particle-based representation to model the motions of visual elements in a dynamic 3D scene. DAP-NeRF consists of superposition of a static field and a dynamic field. The dynamic field is quantised as a collection of {\em appearance particles}, which carries the visual information of a small dynamic element in the scene and is equipped with a motion model. All components, including the static field, the visual features and motion models of the particles, are learned from monocular videos without any prior geometric knowledge of the scene. We develop an efficient computational framework for the particle-based model. We also construct a new dataset to evaluate motion modelling. Experimental results show that DAP-NeRF is an effective technique to capture not only the appearance but also the physically meaningful motions in a 3D dynamic scene.
    摘要 neural radiance fields (NeRFs) 有大量的潜力用于模拟3D场景。动态NeRFs进一步发展了这个模型,通常使用形变场来捕捉时间变化的元素。现有的动态NeRFs都使用相似的尤里安表示法来表示光辐射场和形变场,这会导致视觉和运动之间的强相关性和缺乏物理解释。在这项工作中,我们提出了动态外观粒子神经辐射场(DAP-NeRF),它通过使用粒子基于表示来模拟动态3D场景中的视觉元素运动。DAP-NeRF包括一个静止场和一个动态场的积和。动态场被量化为一个集合的“外观粒子”,这些粒子携带了场景中小元素的视觉信息,并拥有运动模型。所有组件,包括静止场、视觉特征和运动模型,都由单投影视频无关 geometric knowledge of the scene 进行学习。我们开发了一个高效的计算框架,并构建了一个新的数据集来评估运动模型。实验结果表明,DAP-NeRF 是一种有效的方法,可以捕捉3D动态场景中的视觉特征和物理意义的运动。

NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration

  • paper_url: http://arxiv.org/abs/2310.07896
  • repo_url: None
  • paper_authors: Ajay Sridhar, Dhruv Shah, Catherine Glossop, Sergey Levine
  • for: 本研究的目的是提供一种能够执行任务导航和无目标探索的单一扩散策略,以提高在未知环境中的机器人导航性能。
  • methods: 本研究使用了一种基于Transformer的大规模策略,与一种扩散模型解码器相结合,以适应两类不同的导航需求。
  • results: 实验结果表明,使用本研究的方法可以在实际的移动机器人平台上实现更好的总性性能,并与五种相关方法进行比较,显示了显著的改进和减少碰撞的情况。
    Abstract Robotic learning for navigation in unfamiliar environments needs to provide policies for both task-oriented navigation (i.e., reaching a goal that the robot has located), and task-agnostic exploration (i.e., searching for a goal in a novel setting). Typically, these roles are handled by separate models, for example by using subgoal proposals, planning, or separate navigation strategies. In this paper, we describe how we can train a single unified diffusion policy to handle both goal-directed navigation and goal-agnostic exploration, with the latter providing the ability to search novel environments, and the former providing the ability to reach a user-specified goal once it has been located. We show that this unified policy results in better overall performance when navigating to visually indicated goals in novel environments, as compared to approaches that use subgoal proposals from generative models, or prior methods based on latent variable models. We instantiate our method by using a large-scale Transformer-based policy trained on data from multiple ground robots, with a diffusion model decoder to flexibly handle both goal-conditioned and goal-agnostic navigation. Our experiments, conducted on a real-world mobile robot platform, show effective navigation in unseen environments in comparison with five alternative methods, and demonstrate significant improvements in performance and lower collision rates, despite utilizing smaller models than state-of-the-art approaches. For more videos, code, and pre-trained model checkpoints, see https://general-navigation-models.github.io/nomad/
    摘要 机器人学习 navigation 在未知环境中需要提供任务域导航(即达到机器人已经发现的目标)和任务无关探索(即在新环境中寻找目标)的策略。通常,这些角色由不同的模型来处理,例如使用互助提案、规划或不同的导航策略。在这篇论文中,我们描述了如何通过培训单一个混合扩散策略来处理任务域导航和任务无关探索,其中前者提供了达到用户指定的目标的能力,而后者提供了在新环境中搜索目标的能力。我们表明这种混合策略在视觉指定目标下在新环境中导航时比使用生成模型的互助提案或先前的射频变量模型方法更好的总性表现。我们在实现方面使用了大规模的Transformer基于策略,并使用扩散模型解码器来灵活地处理任务域导航和任务无关探索。我们的实验在真实世界移动机器人平台上进行,与五种alternative方法进行比较,并显示了在未 seen环境中的有效导航,并且demonstrate了与之前的方法相比更高的性能和更低的碰撞率,即使使用了更小的模型。更多视频、代码和预训练模型检查点,请访问https://general-navigation-models.github.io/nomad/。

Unsupervised Structured Noise Removal with Variational Lossy Autoencoder

  • paper_url: http://arxiv.org/abs/2310.07887
  • repo_url: https://github.com/krulllab/DVLAE
  • paper_authors: Benjamin Salmon, Alexander Krull
  • for: 该研究旨在提出一种无监督的深度学习方法,可以去除微scopic中的图像噪声,而无需任何净图像或噪声模型。
  • methods: 该方法基于变分自动编码器(VAE),并使用特制的 autoregressive 解码器,可以模拟图像噪声的分布,但不能独立地模拟净图像的分布。
  • results: 实验结果表明,该方法可以超越现有的自监和无监督图像去噪方法,并且具有对拟合核心区域大小的 Robustness。 代码可以在 GitHub 上找到:https://github.com/krulllab/DVLAE。
    Abstract Most unsupervised denoising methods are based on the assumption that imaging noise is either pixel-independent, i.e., spatially uncorrelated, or signal-independent, i.e., purely additive. However, in practice many imaging setups, especially in microscopy, suffer from a combination of signal-dependent noise (e.g. Poisson shot noise) and axis-aligned correlated noise (e.g. stripe shaped scanning or readout artifacts). In this paper, we present the first unsupervised deep learning-based denoiser that can remove this type of noise without access to any clean images or a noise model. Unlike self-supervised techniques, our method does not rely on removing pixels by masking or subsampling so can utilize all available information. We implement a Variational Autoencoder (VAE) with a specially designed autoregressive decoder capable of modelling the noise component of an image but incapable of independently modelling the underlying clean signal component. As a consequence, our VAE's encoder learns to encode only underlying clean signal content and to discard imaging noise. We also propose an additional decoder for mapping the encoder's latent variables back into image space, thereby sampling denoised images. Experimental results demonstrate that our approach surpasses existing methods for self- and unsupervised image denoising while being robust with respect to the size of the autoregressive receptive field. Code for this project can be found at https://github.com/krulllab/DVLAE.
    摘要 大多数无监督干涉方法假设图像噪声是像素独立的,即空间不相关,或信号独立的,即纯加性的。然而,在实际应用中,许多图像设置,特别是微scopy,受到信号相关的噪声(例如摄件衰减噪声)和排序相关的噪声(例如扫描条形或读取 artifacts)的混合噪声。在这篇论文中,我们介绍了首个无监督深度学习基于的干涉方法,可以在没有干涉图像或噪声模型的情况下,去除这种类型的噪声。与自我监督技术不同,我们的方法不需要通过屏蔽或抽样来消除像素,因此可以利用所有可用的信息。我们实现了一种变量自动机(VAE),其拥有特制的 autoregressive 解码器,可以模型图像噪声的组成部分,但不能独立模型图像的净信号部分。因此,VAE 的编码器将只编码净信号内容,并且抛弃图像噪声。我们还提出了一种将编码器的缺省变量映射回图像空间的额外解码器,从而采样干涉后的净化图像。实验结果表明,我们的方法超越了现有的自我监督和无监督图像干涉方法,并且对 autoregressive 感知场的大小具有较好的稳定性。代码可以在 找到。

A Survey of Feature Types and Their Contributions for Camera Tampering Detection

  • paper_url: http://arxiv.org/abs/2310.07886
  • repo_url: None
  • paper_authors: Pranav Mantini, Shishir K. Shah
  • for: 这篇论文是关于摄像头妨碍检测的,它旨在检测摄像头中的非法和不可预期的修改。
  • methods: 这篇论文使用了时间序列分析方法来检测摄像头妨碍,并对不同特征类型进行了分析和实验研究。
  • results: 研究发现,使用不同特征类型可以提高摄像头妨碍检测的精度和可靠性。同时,研究还发现了不同特征类型在不同的妨碍情况下的表现不同。
    Abstract Camera tamper detection is the ability to detect unauthorized and unintentional alterations in surveillance cameras by analyzing the video. Camera tampering can occur due to natural events or it can be caused intentionally to disrupt surveillance. We cast tampering detection as a change detection problem, and perform a review of the existing literature with emphasis on feature types. We formulate tampering detection as a time series analysis problem, and design experiments to study the robustness and capability of various feature types. We compute ten features on real-world surveillance video and apply time series analysis to ascertain their predictability, and their capability to detect tampering. Finally, we quantify the performance of various time series models using each feature type to detect tampering.
    摘要 surveillance camera تammer detection 是指通过分析视频来检测非法和无意之变化。 camera tampering 可能由自然事件引起,也可能是故意的干扰Surveillance。 we cast tampering detection as a change detection problem,并对现有文献进行了审查,强调特征类型。 we formulate tampering detection as a time series analysis problem,并设计了许多实验来评估不同特征类型的可靠性和检测能力。 we compute ten features on real-world surveillance video and apply time series analysis to determine their predictability and ability to detect tampering. finally, we quantify the performance of various time series models using each feature type to detect tampering.Here's the text with some additional information about the features used in the study:surveillance camera تammer detection 是指通过分析视频来检测非法和无意之变化。 camera tampering 可能由自然事件引起,也可能是故意的干扰Surveillance。 we cast tampering detection as a change detection problem,并对现有文献进行了审查,强调特征类型。 we formulate tampering detection as a time series analysis problem,并设计了许多实验来评估不同特征类型的可靠性和检测能力。 we compute ten features on real-world surveillance video and apply time series analysis to determine their predictability and ability to detect tampering. these features include:1. color histogram2. color moments3. color co-occurrence matrices4. texture features5. edge features6. corner features7. motion features8. optical flow features9. histogram of oriented gradients (HOG)10. scale-invariant feature transform (SIFT)we quantify the performance of various time series models using each feature type to detect tampering. the time series models include:1. autoregressive (AR) models2. moving average (MA) models3. ARIMA models4. seasonal ARIMA (SARIMA) models5. long short-term memory (LSTM) modelswe evaluate the performance of each model using metrics such as detection accuracy, false alarm rate, and area under the receiver operating characteristic (ROC) curve.

BrainVoxGen: Deep learning framework for synthesis of Ultrasound to MRI

  • paper_url: http://arxiv.org/abs/2310.08608
  • repo_url: None
  • paper_authors: Shubham Singh, Dr. Mrunal Bewoor, Ammar Ranapurwala, Satyam Rai, Sheetal Patil
  • for: 这个研究旨在使用深度学习框架将三维ultrasound图像转换为三维MRI图像。
  • methods: 这个方法使用Pix2Pix GAN模型,将三维ultrasound图像输入到UNET生成器和patch检测器中,生成对应的三维MRI图像。
  • results: 研究发现,这个方法可以成功地生成三维MRI图像,并且与预期结果 exhibit 一定的相似性。
    Abstract The study presents a deep learning framework aimed at synthesizing 3D MRI volumes from three-dimensional ultrasound images of the brain utilizing the Pix2Pix GAN model. The process involves inputting a 3D volume of ultrasounds into a UNET generator and patch discriminator, generating a corresponding 3D volume of MRI. Model performance was evaluated using losses on the discriminator and generator applied to a dataset of 3D ultrasound and MRI images. The results indicate that the synthesized MRI images exhibit some similarity to the expected outcomes. Despite challenges related to dataset size, computational resources, and technical complexities, the method successfully generated MRI volume with a satisfactory similarity score meant to serve as a baseline for further research. It underscores the potential of deep learning-based volume synthesis techniques for ultrasound to MRI conversion, showcasing their viability for medical applications. Further refinement and exploration are warranted for enhanced clinical relevance.
    摘要 这种研究描述了一种深度学习框架,用于从三维超声图像转换为三维MRI图像,使用Pix2Pix GAN模型。该过程中,将三维超声图像输入到UNET生成器和补做分类器中,生成相应的三维MRI图像。模型性能被评估使用生成器和分类器对三维超声和MRI图像集合的损失。结果表明,生成的MRI图像具有一定的相似性。虽然数据集的大小、计算资源和技术复杂度带来了挑战,但方法仍然成功地生成了MRI图像,并且得到了一个可接受的相似性分数。这种方法的成功表明了深度学习基于超声到MRI转换的卷积synthesis技术的可能性,并且为医疗应用提供了一个可靠的基础。进一步的改进和探索是需要的,以提高临床 relevance。

CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping

  • paper_url: http://arxiv.org/abs/2310.07855
  • repo_url: None
  • paper_authors: Tim Lebailly, Thomas Stegmüller, Behzad Bozorgtabar, Jean-Philippe Thiran, Tinne Tuytelaars
  • for: 提高 dense visual representation learning 的精度和效能
  • methods: 使用 object-level nearest neighbor bootstrapping 方法,在训练过程中对每个对象进行精细化的 Bootstrapping,以提高模型在具有多个对象的场景下的表现
  • results: 在具有多个对象的场景下,CrIBo 表现出色,在具有 nearest neighbor retrieval 的测试任务上达到了领先的性能水平,并在标准下游任务中保持竞争力In English, this means:
  • for: Improving the accuracy and efficiency of dense visual representation learning
  • methods: Using object-level nearest neighbor bootstrapping method, fine-grained bootstrapping is performed for each object during training to enhance the model’s performance in scenes with multiple objects
  • results: CrIBo achieves state-of-the-art performance on tasks with nearest neighbor retrieval in scenes with multiple objects, and is highly competitive in standard downstream segmentation tasks.
    Abstract Leveraging nearest neighbor retrieval for self-supervised representation learning has proven beneficial with object-centric images. However, this approach faces limitations when applied to scene-centric datasets, where multiple objects within an image are only implicitly captured in the global representation. Such global bootstrapping can lead to undesirable entanglement of object representations. Furthermore, even object-centric datasets stand to benefit from a finer-grained bootstrapping approach. In response to these challenges, we introduce a novel Cross-Image Object-Level Bootstrapping method tailored to enhance dense visual representation learning. By employing object-level nearest neighbor bootstrapping throughout the training, CrIBo emerges as a notably strong and adequate candidate for in-context learning, leveraging nearest neighbor retrieval at test time. CrIBo shows state-of-the-art performance on the latter task while being highly competitive in more standard downstream segmentation tasks. Our code and pretrained models will be publicly available upon acceptance.
    摘要 利用最近邻居检索来进行自我超vised表示学习已经证明有利于对象中心图像。然而,这种方法在场景中心数据集上遇到限制,因为图像中多个对象仅在全局表示中被间接捕捉。这可能导致对象表示的不良杂化。此外,即使对象中心数据集也可以从更细化的杂化方法中受益。为回应这些挑战,我们提出了一种新的跨图像对象级Bootstrapping方法,可以提高粗细视觉表示学习。我们在训练中使用对象级最近邻居检索,并在测试时使用最近邻居检索。我们称之为CrIBo。CrIBo在具有状态的下游分 segmentation任务中表现出色,而且在标准下游任务中也具有高竞争力。我们将代码和预训练模型公开发布。

Explorable Mesh Deformation Subspaces from Unstructured Generative Models

  • paper_url: http://arxiv.org/abs/2310.07814
  • repo_url: None
  • paper_authors: Arman Maesumi, Paul Guerrero, Vladimir G. Kim, Matthew Fisher, Siddhartha Chaudhuri, Noam Aigerman, Daniel Ritchie
  • for: 实现3D形状的变化探索,从传入的指标形状中找到可视化的2D探索空间,并将该空间转换为训练过的生成模型中的子空间,以实现高质量的这些形状之间的变化探索。
  • methods: 使用生成模型的高维度 latent space 进行探索,并寻找一个将这些 latent space 转换为可视化的2D探索空间的映射。然后,使用这个映射将变化在2D空间中,并将其转换为高质量这些形状的塑形场。
  • results: 可以实现视觉上可见且易于探索的2D探索空间,并将该空间转换为高质量这些形状之间的变化。对于一些形状 category 进行了评估,结果显示了与先前的学习塑形空间的比较,其可以实现更多的变化和更好的可视化。
    Abstract Exploring variations of 3D shapes is a time-consuming process in traditional 3D modeling tools. Deep generative models of 3D shapes often feature continuous latent spaces that can, in principle, be used to explore potential variations starting from a set of input shapes. In practice, doing so can be problematic: latent spaces are high dimensional and hard to visualize, contain shapes that are not relevant to the input shapes, and linear paths through them often lead to sub-optimal shape transitions. Furthermore, one would ideally be able to explore variations in the original high-quality meshes used to train the generative model, not its lower-quality output geometry. In this paper, we present a method to explore variations among a given set of landmark shapes by constructing a mapping from an easily-navigable 2D exploration space to a subspace of a pre-trained generative model. We first describe how to find a mapping that spans the set of input landmark shapes and exhibits smooth variations between them. We then show how to turn the variations in this subspace into deformation fields, to transfer those variations to high-quality meshes for the landmark shapes. Our results show that our method can produce visually-pleasing and easily-navigable 2D exploration spaces for several different shape categories, especially as compared to prior work on learning deformation spaces for 3D shapes.
    摘要 将文本翻译成简化中文。 traditional 3D 模型化工具中查找3D 形状的变化是一个时间消耗的过程。深度生成模型中的3D 形状通常具有连续的潜在空间,可以从输入形状开始探索潜在的变化。然而,在实践中,这可以是一个问题:潜在空间的维度很高,Difficult to visualize, contains shapes that are not relevant to the input shapes, and linear paths through them often lead to sub-optimal shape transitions. In this paper, we present a method to explore variations among a given set of landmark shapes by constructing a mapping from an easily-navigable 2D exploration space to a subspace of a pre-trained generative model. We first describe how to find a mapping that spans the set of input landmark shapes and exhibits smooth variations between them. We then show how to turn the variations in this subspace into deformation fields, to transfer those variations to high-quality meshes for the landmark shapes. Our results show that our method can produce visually-pleasing and easily-navigable 2D exploration spaces for several different shape categories, especially as compared to prior work on learning deformation spaces for 3D shapes.以下是翻译结果:传统的3D模型化工具中查找3D形状的变化是一个时间消耗的过程。深度生成模型中的3D形状通常具有连续的潜在空间,可以从输入形状开始探索潜在的变化。然而,在实践中,这可以是一个问题:潜在空间的维度很高,Difficult to visualize,contains shapes that are not relevant to the input shapes,和Linear paths through them often lead to sub-optimal shape transitions。在这篇论文中,我们提出了一种方法,通过从可探索的2D探索空间到预训练的生成模型的子空间中构建映射,以探索给定的标记形状中的变化。我们首先描述了如何找到一个涵盖输入标记形状的映射,并且在这个映射中展示了平滑的变化。然后,我们显示了如何将这个映射中的变化转换为塑形场,以传递这些变化到高质量的矩阵 для标记形状。我们的结果表明,我们的方法可以生成可观赏和易探索的2D探索空间,特别是与先前学习3D形状的塑形空间相比。

CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2310.07794
  • repo_url: None
  • paper_authors: Changhe Chen, Mozhgan Pourkeshavarz, Amir Rasouli
  • for: 本文提出了一个新的评估巡回预测方法的评估方程(CRITERIA),用于评估自动驾驶巡回预测模型的性能。
  • methods: 本文提出了以下方法:1)通过根据道路结构、模型性能和数据特性提取驾驶场景,实现细化的评估模型表现;2)基于实际驾驶约束,开发了不偏度的多元指标来衡量巡回预测模型的多样性和合法性。
  • results: 经过广泛的实验, authors发现提出的评估方程可以更准确地评估巡回预测模型的性能,并且可以作为评估模型行为的方式。 authors还进行了缺省研究,以阐明不同元素在计算提出的指标中的贡献。
    Abstract Benchmarking is a common method for evaluating trajectory prediction models for autonomous driving. Existing benchmarks rely on datasets, which are biased towards more common scenarios, such as cruising, and distance-based metrics that are computed by averaging over all scenarios. Following such a regiment provides a little insight into the properties of the models both in terms of how well they can handle different scenarios and how admissible and diverse their outputs are. There exist a number of complementary metrics designed to measure the admissibility and diversity of trajectories, however, they suffer from biases, such as length of trajectories. In this paper, we propose a new benChmarking paRadIgm for evaluaTing trajEctoRy predIction Approaches (CRITERIA). Particularly, we propose 1) a method for extracting driving scenarios at varying levels of specificity according to the structure of the roads, models' performance, and data properties for fine-grained ranking of prediction models; 2) A set of new bias-free metrics for measuring diversity, by incorporating the characteristics of a given scenario, and admissibility, by considering the structure of roads and kinematic compliancy, motivated by real-world driving constraints. 3) Using the proposed benchmark, we conduct extensive experimentation on a representative set of the prediction models using the large scale Argoverse dataset. We show that the proposed benchmark can produce a more accurate ranking of the models and serve as a means of characterizing their behavior. We further present ablation studies to highlight contributions of different elements that are used to compute the proposed metrics.
    摘要 《用于评估自动驾驶路径预测模型的benchmarking方法》是一个常见的评估方法。现有的benchmark围绕着常见的enario,如维持速度,集成了距离基于的metric,这些metric通过平均所有scenario来计算。然而,这些benchmark存在偏见,如scenario的长度。在这篇论文中,我们提出了一个新的benchmarking方法(CRITERIA),具有以下特点:1. 提取了不同级别的驾驶scenario,根据道路结构、模型性能和数据特性进行细化的排名预测模型。2. 提出了一组新的偏见自由的多元指标,通过考虑场景特点和道路结构来衡量多样性和合法性。3. 使用大规模的Argoverse dataset进行了广泛的实验,并证明了提出的benchmark可以更准确地排名模型,并且可以用来描述模型的行为。我们还进行了减少实验来 highlight提出的元素的贡献。

An automated approach for improving the inference latency and energy efficiency of pretrained CNNs by removing irrelevant pixels with focused convolutions

  • paper_url: http://arxiv.org/abs/2310.07782
  • repo_url: https://github.com/PurdueCAM2Project/focused-convolutions
  • paper_authors: Caleb Tung, Nicholas Eliopoulos, Purvish Jajal, Gowri Ramshankar, Chen-Yun Yang, Nicholas Synovic, Xuecen Zhang, Vipin Chaudhary, George K. Thiruvathukal, Yung-Hsiang Lu
  • for: 提高 Convolutional Neural Networks (CNNs) 的能效性,降低 computation 和能源成本。
  • methods: 提出一种自动化方法,通过插入一个阈值层来筛选之前层的活动,以实现更加精准地忽略图像中无关部分,从而降低推理延迟和能源成本,保持准确性。
  • results: 对多种流行的预训练 CNNs 进行了实验,发现该方法可以降低推理延迟(最多下降25%)和能源成本(最多下降22%),而且准确性几乎不受影响。
    Abstract Computer vision often uses highly accurate Convolutional Neural Networks (CNNs), but these deep learning models are associated with ever-increasing energy and computation requirements. Producing more energy-efficient CNNs often requires model training which can be cost-prohibitive. We propose a novel, automated method to make a pretrained CNN more energy-efficient without re-training. Given a pretrained CNN, we insert a threshold layer that filters activations from the preceding layers to identify regions of the image that are irrelevant, i.e. can be ignored by the following layers while maintaining accuracy. Our modified focused convolution operation saves inference latency (by up to 25%) and energy costs (by up to 22%) on various popular pretrained CNNs, with little to no loss in accuracy.
    摘要

3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers

  • paper_url: http://arxiv.org/abs/2310.07781
  • repo_url: https://github.com/Beckschen/3D-TransUNet
  • paper_authors: Jieneng Chen, Jieru Mei, Xianhang Li, Yongyi Lu, Qihang Yu, Qingyue Wei, Xiangde Luo, Yutong Xie, Ehsan Adeli, Yan Wang, Matthew Lungren, Lei Xing, Le Lu, Alan Yuille, Yuyin Zhou
  • for: 这篇论文的目的是提出一个基于Transformer的医疗影像分类网络,以提高医疗影像分类的精度和效率。
  • methods: 这篇论文使用了Transformer的自注意力机制来补充U-Net的本地信息,以提高医疗影像分类的能力。具体来说,论文提出了两个关键的 ком成分:1)使用Transformer嵌入器将影像片段转换为Token,以EXTRACT全局背景信息;2)使用Transformer解oder适应地对候选区域进行修饰,通过跨andidate proposal和U-Net特征之间的相互注意力。
  • results: 论文的实验结果显示,不同的医疗任务可以从不同的架构设计中获得更好的效果。Transformer嵌入器在多器官分类任务中表现出色,而Transformer解oder则在较小且具有挑战性的分类目标,如肿瘤分类任务中表现更好。总的来说,结合Transformer-based嵌入器和解oder into U-Net架构可以提高医疗影像分类的精度和效率。
    Abstract Medical image segmentation plays a crucial role in advancing healthcare systems for disease diagnosis and treatment planning. The u-shaped architecture, popularly known as U-Net, has proven highly successful for various medical image segmentation tasks. However, U-Net's convolution-based operations inherently limit its ability to model long-range dependencies effectively. To address these limitations, researchers have turned to Transformers, renowned for their global self-attention mechanisms, as alternative architectures. One popular network is our previous TransUNet, which leverages Transformers' self-attention to complement U-Net's localized information with the global context. In this paper, we extend the 2D TransUNet architecture to a 3D network by building upon the state-of-the-art nnU-Net architecture, and fully exploring Transformers' potential in both the encoder and decoder design. We introduce two key components: 1) A Transformer encoder that tokenizes image patches from a convolution neural network (CNN) feature map, enabling the extraction of global contexts, and 2) A Transformer decoder that adaptively refines candidate regions by utilizing cross-attention between candidate proposals and U-Net features. Our investigations reveal that different medical tasks benefit from distinct architectural designs. The Transformer encoder excels in multi-organ segmentation, where the relationship among organs is crucial. On the other hand, the Transformer decoder proves more beneficial for dealing with small and challenging segmented targets such as tumor segmentation. Extensive experiments showcase the significant potential of integrating a Transformer-based encoder and decoder into the u-shaped medical image segmentation architecture. TransUNet outperforms competitors in various medical applications.
    摘要 医疗影像分割对于提高医疗系统的疾病诊断和治疗规划具有关键作用。U-Net建筑,也称为U-shaped architecture,在各种医疗影像分割任务中表现出了极高的成功率。然而,U-Net的卷积操作自然地限制了它的长距离依赖性模型能力。为了解决这些限制,研究人员转向了Transformers,这种global self-attention机制的知名网络。在这篇论文中,我们扩展了2D TransUNet架构到3D网络,基于state-of-the-art nnU-Net架构,并充分发挥Transformers的潜力在encoder和decoder设计中。我们介绍了两个关键组件:1)使用CNN特征图像块 Tokenizer,从CNN特征图像中提取全局上下文,2)使用交叉关注来适应候选区域的精度调整。我们的调查表明,不同的医疗任务需要不同的架构设计。Transformer encoder在多器官分割任务中表现出色,因为器官之间的关系非常重要。然而,Transformer decoder在处理小型和复杂分割目标,如肿瘤分割任务中表现更出色。我们的实验结果表明,将Transformer-based encoder和decoder与U-shaped医疗影像分割架构结合使用,可以提高TransUNet的性能,并在各种医疗应用中超越竞争对手。

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

  • paper_url: http://arxiv.org/abs/2310.07749
  • repo_url: None
  • paper_authors: Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Lijuan Wang, Jiebo Luo
  • for: 该论文探讨了一个开放领域图文生成任务,该任务通过输入查询生成杂合的图文内容。
  • methods: 该论文提出了一个基于大型自然语言模型(LLM)和预训练文本到图像(T2I)模型的新杂合生成框架,名为OpenLEAF。该框架中的LLM生成文本描述,协调T2I模型,创建视觉提示图像生成,并将全局上下文 integrate 到T2I模型中。
  • results: 根据我们构建的评估集,使用大型多modal模型(LMM)评估Entity和Style一致性,我们的提出的杂合生成框架可以在不同领域和应用中生成高质量的图文内容,如问答、故事、图文重新写作、宣传品等。此外,我们验证了我们提出的LMM评估技术的有效性。
    Abstract This work investigates a challenging task named open-domain interleaved image-text generation, which generates interleaved texts and images following an input query. We propose a new interleaved generation framework based on prompting large-language models (LLMs) and pre-trained text-to-image (T2I) models, namely OpenLEAF. In OpenLEAF, the LLM generates textual descriptions, coordinates T2I models, creates visual prompts for generating images, and incorporates global contexts into the T2I models. This global context improves the entity and style consistencies of images in the interleaved generation. For model assessment, we first propose to use large multi-modal models (LMMs) to evaluate the entity and style consistencies of open-domain interleaved image-text sequences. According to the LMM evaluation on our constructed evaluation set, the proposed interleaved generation framework can generate high-quality image-text content for various domains and applications, such as how-to question answering, storytelling, graphical story rewriting, and webpage/poster generation tasks. Moreover, we validate the effectiveness of the proposed LMM evaluation technique with human assessment. We hope our proposed framework, benchmark, and LMM evaluation could help establish the intriguing interleaved image-text generation task.
    摘要 Translation notes:* "open-domain" is translated as "开放领域" (kāifàng lǐngyè)* "interleaved" is translated as "交错" (jiāo chá)* "image-text generation" is translated as "图文生成" (túwén shēngchǎng)* "LLMs" is translated as "大型语言模型" (dàxìng yǔyán módelì)* "T2I models" is translated as "文本到图像模型" (wén tiě dào tú xiàng módelì)* "global context" is translated as "全局上下文" (quán jí shàng xiàng)* "entity and style consistencies" is translated as "实体和风格一致性" (shíwù hé fēng gé yīchāngxìng)* "LMMs" is translated as "大型多模态模型" (dàxìng duō móduō módelì)* "evaluation set" is translated as "评估集" (píngjǐ zhù)* "how-to question answering" is translated as "如何问答" (rúhěn wèn dá)* "storytelling" is translated as "故事告诉" (gùshì gào shuō)* "graphical story rewriting" is translated as "图文故事重写" (túwén gùshì zhòngxī)* "webpage/poster generation" is translated as "网页/海报生成" (wǎng jiāng/hǎi bào shēngchǎng)

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

  • paper_url: http://arxiv.org/abs/2310.07702
  • repo_url: https://github.com/yingqinghe/scalecrafter
  • paper_authors: Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, Ying Shan
  • for: 本研究探讨了使用预训练的扩散模型在更高的分辨率上生成图像,并且图像的方向比例可以是任意的。
  • methods: 我们提出了一种简单 yet有效的重尺刻法,可以在推理时动态调整卷积核心的见觉范围,以解决预训练模型中的问题。我们还提出了分散卷积和噪声抑制的自由导向方法,可以实现超高分辨率图像生成(例如4096 x 4096)。
  • results: 我们的方法可以很好地解决重复问题,并且在高分辨率图像生成中达到了状态的表现。我们的方法不需要任何训练或优化。广泛的实验表明,我们的方法可以在高分辨率图像生成中很好地保持细节Texture。
    Abstract In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.
    摘要 在这项研究中,我们 investigate了使用预训练的扩散模型生成高分辨率图像。此外,生成的图像应该有任意的图像方向比。当直接在更高的分辨率1024x1024上生成图像,使用预训练的稳定扩散模型,我们观察到了持续的 объек repeating和不合理的对象结构问题。现有的高分辨率生成方法,如注意力基的方法和联合扩散方法,无法好解这些问题。作为一个新的视角,我们分析了扩散模型中的结构组件,并确定了归一化核心的局限性为主要原因。基于这一关键观察,我们提出了一种简单又有效的重定义,可以在推理过程中动态调整归一化核心的见觉范围。我们还提出了散布核心和噪声抑制的类别器-自由导航,可以实现ultra-高分辨率图像生成(例如4096x4096)。需要注意的是,我们的方法不需要任何的训练或优化。广泛的实验表明,我们的方法可以很好地解决重复问题,并在高分辨率图像生成中 achieved state-of-the-art表现,特别是在тексту册细节方面。我们的工作还表明了一个预训练的扩散模型可以直接在高分辨率图像生成中使用,无需进一步调整,这可能提供了未来研究高分辨率图像和视频生成的灵感。

ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation

  • paper_url: http://arxiv.org/abs/2310.07697
  • repo_url: None
  • paper_authors: Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao
  • for: 文章主要针对的问题是如何通过提供条件、视频和输入文本,生成高质量的动态视频。
  • methods: 本文提出了一种无需训练的文本到视频生成方法,基于现有的文本到图像生成方法(如稳定扩散),并通过分解动态场景的运动表示来提高生成的 temporal coherence。
  • results: 对比其他方法,本文的方法在Frame consistency、clip score和 conditional accuracy等指标上表现出色,得到了更高的性能。
    Abstract Recent works have successfully extended large-scale text-to-image models to the video domain, producing promising results but at a high computational cost and requiring a large amount of video data. In this work, we introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e.g., Stable Diffusion). ConditionVideo generates realistic dynamic videos from random noise or given scene videos. Our method explicitly disentangles the motion representation into condition-guided and scenery motion components. To this end, the ConditionVideo model is designed with a UNet branch and a control branch. To improve temporal coherence, we introduce sparse bi-directional spatial-temporal attention (sBiST-Attn). The 3D control network extends the conventional 2D controlnet model, aiming to strengthen conditional generation accuracy by additionally leveraging the bi-directional frames in the temporal domain. Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.
    摘要 近期研究已成功扩展大规模文本到视频领域,但计算成本高并需大量视频数据。在这项工作中,我们介绍ConditionVideo,一种无需训练的文本到视频生成方法,基于给定的条件、视频和输入文本,通过利用市场上已有的文本到图像生成方法(如稳定扩散)。ConditionVideo可生成真实的动态视频从随机噪声或给定的场景视频。我们的方法明确分离动作表示为条件导向的动作组件和场景动作组件。为此,ConditionVideo模型采用了UNet分支和控制分支。为了改进时间准确性,我们引入罕见的双向时空注意力(sBiST-Attn)。三维控制网络将传统的二维控制网络模型扩展到三维空间,以更好地利用时间领域的双向帧。我们的方法在帧一致性、clip分数和条件准确性方面表现出色,超越其他比较方法。

Orbital Polarimetric Tomography of a Flare Near the Sagittarius A* Supermassive Black Hole

  • paper_url: http://arxiv.org/abs/2310.07687
  • repo_url: None
  • paper_authors: Aviad Levis, Andrew A. Chael, Katherine L. Bouman, Maciek Wielgus, Pratul P. Srinivasan
  • for: 这个研究的目的是理解黑洞吸收过程中的高能射线飞亮现象。
  • methods: 这个研究使用了人工神经网络三维表示(一种emergent artificial intelligence方法)和黑洞重力模型来解决高度不定Posing的Tomography问题,以恢复黑洞中央的辐射强度。
  • results: 研究结果表明,在2017年4月11日ALMA数据中存在一个位于黑洞事件 horizonto的紧凑的明亮区域,距离黑洞6倍,并且在低倾斜的orbital平面上进行clockwise旋转。这些结果与以往的EHT和GRAVITY合作的研究结果一致。
    Abstract The interaction between the supermassive black hole at the center of the Milky Way, Sagittarius A$^*$, and its accretion disk, occasionally produces high energy flares seen in X-ray, infrared and radio. One mechanism for observed flares is the formation of compact bright regions that appear within the accretion disk and close to the event horizon. Understanding these flares can provide a window into black hole accretion processes. Although sophisticated simulations predict the formation of these flares, their structure has yet to be recovered by observations. Here we show the first three-dimensional (3D) reconstruction of an emission flare in orbit recovered from ALMA light curves observed on April 11, 2017. Our recovery results show compact bright regions at a distance of roughly 6 times the event horizon. Moreover, our recovery suggests a clockwise rotation in a low-inclination orbital plane, a result consistent with prior studies by EHT and GRAVITY collaborations. To recover this emission structure we solve a highly ill-posed tomography problem by integrating a neural 3D representation (an emergent artificial intelligence approach for 3D reconstruction) with a gravitational model for black holes. Although the recovered 3D structure is subject, and sometimes sensitive, to the model assumptions, under physically motivated choices we find that our results are stable and our approach is successful on simulated data. We anticipate that in the future, this approach could be used to analyze a richer collection of time-series data that could shed light on the mechanisms governing black hole and plasma dynamics.
    摘要 黑洞的中心星系银河系的超大质量黑洞,Sagittarius A*,与其吸积盘 occasionally produce high energy flares visible in X-ray, infrared and radio。一种导致这些闪光的机制是在吸积盘中形成 Compact bright regions,这些区域位于黑洞事件 horizons 附近。理解这些闪光可以提供黑洞吸积过程的窗口。虽然先进的仿真 simulation 预测了这些闪光的形成,但它们的结构还没有由观测所回收。在这里,我们展示了由 ALMA 光谱 curves 于2017年4月11日观测到的首次三维 (3D) 重建的发射闪光。我们的重建结果表明 Compact bright regions 的距离约为黑洞事件 horizon 的6倍。此外,我们的重建结果还表明这些闪光在低倾斜的 орбиталь平面上以 clockwise 方向旋转,这与 prior studies by EHT 和 GRAVITY collaborations 的结果相符。为了重建这些发射结构,我们解决了一个高度不定 проблеma tomography 问题,通过结合人工智能 Representation (一种 emergent artificial intelligence 方法)和 gravitational model for black holes 来做。虽然recovered 3D structure 受到模型假设的影响,但在物理上有理由的选择下,我们发现结果是稳定的,并且在 simulated data 上成功。未来,我们预计这种方法可以用来分析更加丰富的时间序列数据,以了解黑洞和气体动力学的机制。

Prediction of MET Overexpression in Non-Small Cell Lung Adenocarcinomas from Hematoxylin and Eosin Images

  • paper_url: http://arxiv.org/abs/2310.07682
  • repo_url: None
  • paper_authors: Kshitij Ingale, Sun Hae Hong, Josh S. K. Bell, Abbas Rizvi, Amy Welch, Lingdao Sha, Irvin Ho, Kunal Nagpal, Aicha BenTaieb, Rohan P Joshi, Martin C Stumpe
    for: 这个研究旨在开发一种使用 routinely available digitized H&E 染色片预测 MET 蛋白过表达的算法,以便为 NSCLC 患者预测是否有可能获得 ME 蛋白或 ME 基因表达状态的诊断。methods: 这个研究使用了一个大型的匹配 H&E 染色片和 RNA 表达数据库来训练一种弱级超vised 模型,以直接从 H&E 图像中预测 MET RNA 过表达。results: 这个模型在一个独立的占据测试集上进行了评估,并达到了 ROC-AUC 0.70(95% 信息interval:0.66-0.74)的稳定性特征,并且在不同的患者临床变量下表现稳定,并且对于synthetic 噪声进行了Robust性测试。这些结果表明,H&E 基本的预测模型可能是一种有用的工具,以便为 NSCLC 患者预测 ME 蛋白或 ME 基因表达状态的可能性。
    Abstract MET protein overexpression is a targetable event in non-small cell lung cancer (NSCLC) and is the subject of active drug development. Challenges in identifying patients for these therapies include lack of access to validated testing, such as standardized immunohistochemistry (IHC) assessment, and consumption of valuable tissue for a single gene/protein assay. Development of pre-screening algorithms using routinely available digitized hematoxylin and eosin (H&E)-stained slides to predict MET overexpression could promote testing for those who will benefit most. While assessment of MET expression using IHC is currently not routinely performed in NSCLC, next-generation sequencing is common and in some cases includes RNA expression panel testing. In this work, we leveraged a large database of matched H&E slides and RNA expression data to train a weakly supervised model to predict MET RNA overexpression directly from H&E images. This model was evaluated on an independent holdout test set of 300 over-expressed and 289 normal patients, demonstrating an ROC-AUC of 0.70 (95th percentile interval: 0.66 - 0.74) with stable performance characteristics across different patient clinical variables and robust to synthetic noise on the test set. These results suggest that H&E-based predictive models could be useful to prioritize patients for confirmatory testing of MET protein or MET gene expression status.
    摘要 MET蛋白过表达是非小细胞肺癌(NSCLC)中可target的事件,目前有活跃的药物开发。检测patient中MET蛋白过表达的挑战包括lack of access to validated testing, such as standardized immunohistochemistry (IHC) assessment, and consumption of valuable tissue for a single gene/protein assay。Development of pre-screening algorithms using routinely available digitized hematoxylin and eosin (H&E)-stained slides to predict MET overexpression could promote testing for those who will benefit most. Although assessment of MET expression using IHC is currently not routinely performed in NSCLC, next-generation sequencing is common and in some cases includes RNA expression panel testing. In this work, we leveraged a large database of matched H&E slides and RNA expression data to train a weakly supervised model to predict MET RNA overexpression directly from H&E images. This model was evaluated on an independent holdout test set of 300 over-expressed and 289 normal patients, demonstrating an ROC-AUC of 0.70 (95th percentile interval: 0.66 - 0.74) with stable performance characteristics across different patient clinical variables and robust to synthetic noise on the test set. These results suggest that H&E-based predictive models could be useful to prioritize patients for confirmatory testing of MET protein or MET gene expression status.

Accelerating Vision Transformers Based on Heterogeneous Attention Patterns

  • paper_url: http://arxiv.org/abs/2310.07664
  • repo_url: None
  • paper_authors: Deli Yu, Teng Xi, Jianwei Li, Baopu Li, Gang Zhang, Haocheng Feng, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang
  • for: 提高 ViT 的运行速度,以便在计算复杂的自注意 Mechanism 上进行减少计算量。
  • methods: 提出了一个集成压缩管道,包括动态指导静止自注意 (DGSSA) 方法和全球聚合 pyramid (GLAD) 方法,以减少 ViT 的计算量。
  • results: 实验表明,该集成压缩管道可以提高 ViT 的运行速度,相比 DeiT 提高了121% 的运行throughput,超过了所有 SOTA 方法。
    Abstract Recently, Vision Transformers (ViTs) have attracted a lot of attention in the field of computer vision. Generally, the powerful representative capacity of ViTs mainly benefits from the self-attention mechanism, which has a high computation complexity. To accelerate ViTs, we propose an integrated compression pipeline based on observed heterogeneous attention patterns across layers. On one hand, different images share more similar attention patterns in early layers than later layers, indicating that the dynamic query-by-key self-attention matrix may be replaced with a static self-attention matrix in early layers. Then, we propose a dynamic-guided static self-attention (DGSSA) method where the matrix inherits self-attention information from the replaced dynamic self-attention to effectively improve the feature representation ability of ViTs. On the other hand, the attention maps have more low-rank patterns, which reflect token redundancy, in later layers than early layers. In a view of linear dimension reduction, we further propose a method of global aggregation pyramid (GLAD) to reduce the number of tokens in later layers of ViTs, such as Deit. Experimentally, the integrated compression pipeline of DGSSA and GLAD can accelerate up to 121% run-time throughput compared with DeiT, which surpasses all SOTA approaches.
    摘要 近期,视transformer(ViTs)在计算机视觉领域引起了很多关注。通常,ViTs的强大表现capacity mainly benefits from the self-attention mechanism, which has high computation complexity. To accelerate ViTs, we propose an integrated compression pipeline based on observed heterogeneous attention patterns across layers. On one hand, different images share more similar attention patterns in early layers than later layers, indicating that the dynamic query-by-key self-attention matrix may be replaced with a static self-attention matrix in early layers. Then, we propose a dynamic-guided static self-attention (DGSSA) method where the matrix inherits self-attention information from the replaced dynamic self-attention to effectively improve the feature representation ability of ViTs. On the other hand, the attention maps have more low-rank patterns, which reflect token redundancy, in later layers than early layers. In a view of linear dimension reduction, we further propose a method of global aggregation pyramid (GLAD) to reduce the number of tokens in later layers of ViTs, such as Deit. Experimentally, the integrated compression pipeline of DGSSA and GLAD can accelerate up to 121% run-time throughput compared with DeiT, which surpasses all SOTA approaches.

Deep Video Inpainting Guided by Audio-Visual Self-Supervision

  • paper_url: http://arxiv.org/abs/2310.07663
  • repo_url: https://github.com/kyuyeonpooh/Audio-Visual-Deep-Video-Inpainting
  • paper_authors: Kyuyeon Kim, Junsik Jung, Woo Jae Kim, Sung-Eui Yoon
  • for: 提高视频填充质量
  • methods: 使用深度学习模型借鉴人类的听觉视觉对应知识,实现视频填充
  • results: 实验结果表明,提出的方法可以更广泛地恢复视频场景,特别是在听觉对象在场景中部分遮盖时表现出色。
    Abstract Humans can easily imagine a scene from auditory information based on their prior knowledge of audio-visual events. In this paper, we mimic this innate human ability in deep learning models to improve the quality of video inpainting. To implement the prior knowledge, we first train the audio-visual network, which learns the correspondence between auditory and visual information. Then, the audio-visual network is employed as a guider that conveys the prior knowledge of audio-visual correspondence to the video inpainting network. This prior knowledge is transferred through our proposed two novel losses: audio-visual attention loss and audio-visual pseudo-class consistency loss. These two losses further improve the performance of the video inpainting by encouraging the inpainting result to have a high correspondence to its synchronized audio. Experimental results demonstrate that our proposed method can restore a wider domain of video scenes and is particularly effective when the sounding object in the scene is partially blinded.
    摘要 人类可以轻松地从听音信息中想象场景,在这篇论文中,我们模仿人类的 innate 能力,在深度学习模型中提高视频填充质量。为了实现先前知识,我们首先训练了 audio-visual 网络,该网络学习听音和视觉信息之间的对应关系。然后,我们使用我们提出的两种新的损失函数:听音视觉注意力损失和听音视觉假类一致损失。这两种损失函数进一步提高了视频填充的性能,使得填充结果具有高度对应于其同步的听音。实验结果表明,我们提出的方法可以恢复更广泛的视频场景,并在听音对象在场景中部分遮盲时特别有效。

Context-Enhanced Detector For Building Detection From Remote Sensing Images

  • paper_url: http://arxiv.org/abs/2310.07638
  • repo_url: None
  • paper_authors: Ziyue Huang, Mingming Zhang, Qingjie Liu, Wei Wang, Zhe Dong, Yunhong Wang
  • for: 提高遥感图像中建筑物检测精度, Addressing the challenges of building detection in remote sensing images.
  • methods: 提出一种新的Context-Enhanced Detector(CEDet)方法,包括三个阶段堆式结构,使用Semantic Guided Contextual Mining(SGCM)模块和Instance Context Mining Module(ICMM)两个模块,并使用Semantic segmentation loss基于pseudo-masks来导引上下文信息抽取。
  • results: 在三个建筑物检测标准测试集上达到了状态 искусственный智能性能,包括CNBuilding-9P、CNBuilding-23P和SpaceNet。
    Abstract The field of building detection from remote sensing images has made significant progress, but faces challenges in achieving high-accuracy detection due to the diversity in building appearances and the complexity of vast scenes. To address these challenges, we propose a novel approach called Context-Enhanced Detector (CEDet). Our approach utilizes a three-stage cascade structure to enhance the extraction of contextual information and improve building detection accuracy. Specifically, we introduce two modules: the Semantic Guided Contextual Mining (SGCM) module, which aggregates multi-scale contexts and incorporates an attention mechanism to capture long-range interactions, and the Instance Context Mining Module (ICMM), which captures instance-level relationship context by constructing a spatial relationship graph and aggregating instance features. Additionally, we introduce a semantic segmentation loss based on pseudo-masks to guide contextual information extraction. Our method achieves state-of-the-art performance on three building detection benchmarks, including CNBuilding-9P, CNBuilding-23P, and SpaceNet.
    摘要 隐身图像建筑检测领域已经做出了重要进展,但是因为建筑物的多样性和场景的复杂性,高精度检测仍然面临挑战。为解决这些挑战,我们提出了一种新的方法 called Context-Enhanced Detector (CEDet)。我们的方法采用三stage阶段结构来增强 Contextual information的提取和改善建筑物检测精度。特别是,我们引入了两个模块:Semantic Guided Contextual Mining (SGCM)模块和 Instance Context Mining Module (ICMM)。SGCM模块通过不同级别的上下文进行聚合和使用注意力机制来捕捉长距离相互作用,而ICMM模块则通过构建空间关系图和聚合实例特征来捕捉实例水平的关系上下文。此外,我们还引入了基于 Pseudo-masks的 semantic segmentation loss来引导上下文信息提取。我们的方法在三个建筑物检测标准 bencmarks,包括 CNBuilding-9P、CNBuilding-23P 和 SpaceNet 上达到了当前领域的状态态��表现。

Attention-Map Augmentation for Hypercomplex Breast Cancer Classification

  • paper_url: http://arxiv.org/abs/2310.07633
  • repo_url: None
  • paper_authors: Eleonora Lopez, Filippo Betello, Federico Carmignani, Eleonora Grassucci, Danilo Comminiello
  • for: 提高乳腺癌早期诊断性能
  • methods: 使用 Parameterized Hypercomplex Attention Maps (PHAM) 框架,包括图像增强步骤和扩展步骤,以及使用Parameterized Hypercomplex Neural Network (PHNN) 进行乳腺癌分类
  • results: 在环境中覆盖率高,超过了关注基于网络的现状和实数值对应的方法的表现,并在医疗图像和 histopathological 图像上进行了证明
    Abstract Breast cancer is the most widespread neoplasm among women and early detection of this disease is critical. Deep learning techniques have become of great interest to improve diagnostic performance. Nonetheless, discriminating between malignant and benign masses from whole mammograms remains challenging due to them being almost identical to an untrained eye and the region of interest (ROI) occupying a minuscule portion of the entire image. In this paper, we propose a framework, parameterized hypercomplex attention maps (PHAM), to overcome these problems. Specifically, we deploy an augmentation step based on computing attention maps. Then, the attention maps are used to condition the classification step by constructing a multi-dimensional input comprised of the original breast cancer image and the corresponding attention map. In this step, a parameterized hypercomplex neural network (PHNN) is employed to perform breast cancer classification. The framework offers two main advantages. First, attention maps provide critical information regarding the ROI and allow the neural model to concentrate on it. Second, the hypercomplex architecture has the ability to model local relations between input dimensions thanks to hypercomplex algebra rules, thus properly exploiting the information provided by the attention map. We demonstrate the efficacy of the proposed framework on both mammography images as well as histopathological ones, surpassing attention-based state-of-the-art networks and the real-valued counterpart of our method. The code of our work is available at https://github.com/elelo22/AttentionBCS.
    摘要 乳癌是女性最常见的肿瘤,早期发现这种疾病非常重要。深度学习技术在提高诊断性能方面表现出了极大的兴趣。然而,从整个照片中分别识别癌变和良性肿瘤仍然是一项极其困难的任务,因为它们在无经验的眼里看起来几乎一样,而识别区域(ROI)占整个照片的一个非常小的部分。在这篇论文中,我们提出了一个框架,即卷积注意地图(PHAM)。我们在这个框架中使用了一个增强步骤,基于计算注意地图。然后,我们使用了这些注意地图来控制分类步骤,通过构建一个多维输入,其中包括原始乳癌照片和相应的注意地图。在这个步骤中,我们使用了一个参数化的高维复杂神经网络(PHNN)来进行乳癌分类。我们的框架具有两个主要优点。首先,注意地图提供了ROI的重要信息,使神经网络能够专注于它。其次,高维复杂架构可以通过高维复杂代数规则,正确利用注意地图提供的信息。我们在照片和 histopathological 图像上进行了实验,超过了注意力基于状态最佳网络和实值对应的我们方法。我们的代码可以在 GitHub 上找到。

Prompt Backdoors in Visual Prompt Learning

  • paper_url: http://arxiv.org/abs/2310.07632
  • repo_url: None
  • paper_authors: Hai Huang, Zhengyu Zhao, Michael Backes, Yun Shen, Yang Zhang
  • for: 这篇论文目的是探讨大型预训模型的精细化是否可行,以及Visual Prompt Learning(VPL)是一种可能的替代方案。
  • methods: 这篇论文使用了Visual Prompt as a Service(VPPTaaS),即提供者可以供给用户一个可读的显示图像,并让用户使用这个图像和大型预训模型进行预测。
  • results: 这篇论文发现了VPL中的一个新的安全风险,即BadVisualPrompt,这是一种可以透过攻击提供者提供的显示图像来控制模型的攻击。 Specifically, 这篇论文发现了一个新的技术挑战,即显示图像触发器和显示图像之间的互动,这不同于传统的模型水平的后门攻击。
    Abstract Fine-tuning large pre-trained computer vision models is infeasible for resource-limited users. Visual prompt learning (VPL) has thus emerged to provide an efficient and flexible alternative to model fine-tuning through Visual Prompt as a Service (VPPTaaS). Specifically, the VPPTaaS provider optimizes a visual prompt given downstream data, and downstream users can use this prompt together with the large pre-trained model for prediction. However, this new learning paradigm may also pose security risks when the VPPTaaS provider instead provides a malicious visual prompt. In this paper, we take the first step to explore such risks through the lens of backdoor attacks. Specifically, we propose BadVisualPrompt, a simple yet effective backdoor attack against VPL. For example, poisoning $5\%$ CIFAR10 training data leads to above $99\%$ attack success rates with only negligible model accuracy drop by $1.5\%$. In particular, we identify and then address a new technical challenge related to interactions between the backdoor trigger and visual prompt, which does not exist in conventional, model-level backdoors. Moreover, we provide in-depth analyses of seven backdoor defenses from model, prompt, and input levels. Overall, all these defenses are either ineffective or impractical to mitigate our BadVisualPrompt, implying the critical vulnerability of VPL.
    摘要 大型预训计算机视觉模型的细调是对资源有限的用户来说不可能进行。因此,视觉提示学习(VPL)已经出现以提供一种高效和灵活的替代方案。具体来说,VPL提供者将Visual Prompt as a Service(VPPTaaS)优化一个视觉提示,并且用户可以使用这个提示和大型预训模型进行预测。然而,这种新的学习模式可能也会带来安全风险,当VPPTAaaS提供商而不是提供正确的视觉提示。在这篇论文中,我们通过镜头攻击的视角来探讨这些风险。具体来说,我们提出了BadVisualPrompt,一种简单 yet effective的镜头攻击方法,可以让攻击者在只需要负担5%的CIFAR10训练数据上成功率高于99%,而且只带来模型准确率的1.5%下降。我们还发现了一个新的技术挑战,即镜头触发器和视觉提示之间的交互问题,这个问题不存在于传统的模型级镜头攻击中。此外,我们还提供了七种防御策略的深入分析,包括模型、提示和输入级防御策略。总之,这些防御策略都是无效或不实际的,表明VPL具有极高的漏洞。

Dual Radar: A Multi-modal Dataset with Dual 4D Radar for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2310.07602
  • repo_url: https://github.com/adept-thu/dual-radar
  • paper_authors: Xinyu Zhang, Li Wang, Jian Chen, Cheng Fang, Lei Yang, Ziying Song, Guangqi Yang, Yichen Wang, Xiaofei Zhang, Qingshan Yang, Jun Li
  • For: The paper is written for the purpose of introducing a novel large-scale multi-modal dataset for studying effective 4D radar perception algorithms in autonomous driving.* Methods: The paper uses two types of 4D radars captured simultaneously to create a novel dataset, which consists of 151 consecutive series, most of which last 20 seconds and contain 10,007 meticulously synchronized and annotated frames.* Results: The paper experimentally validates the dataset and provides valuable results for studying different types of 4D radars.Here is the information in Simplified Chinese text:* For: 本文是为了介绍一个新的大规模多模式数据集,用于研究自动驾驶4D радиar感知算法的有效性。* Methods: 本文使用了两种同时捕获的4D радиar来创建一个新的数据集,该数据集包括151个连续的系列,每个系列持续20秒钟,共计10,007幅 preciselly同步和标注的帧。* Results: 本文对数据集进行了实验 validate,并提供了对不同类型4D радиar的有价值的结果。
    Abstract Radar has stronger adaptability in adverse scenarios for autonomous driving environmental perception compared to widely adopted cameras and LiDARs. Compared with commonly used 3D radars, the latest 4D radars have precise vertical resolution and higher point cloud density, making it a highly promising sensor for autonomous driving in complex environmental perception. However, due to the much higher noise than LiDAR, manufacturers choose different filtering strategies, resulting in an inverse ratio between noise level and point cloud density. There is still a lack of comparative analysis on which method is beneficial for deep learning-based perception algorithms in autonomous driving. One of the main reasons is that current datasets only adopt one type of 4D radar, making it difficult to compare different 4D radars in the same scene. Therefore, in this paper, we introduce a novel large-scale multi-modal dataset featuring, for the first time, two types of 4D radars captured simultaneously. This dataset enables further research into effective 4D radar perception algorithms.Our dataset consists of 151 consecutive series, most of which last 20 seconds and contain 10,007 meticulously synchronized and annotated frames. Moreover, our dataset captures a variety of challenging driving scenarios, including many road conditions, weather conditions, nighttime and daytime with different lighting intensities and periods. Our dataset annotates consecutive frames, which can be applied to 3D object detection and tracking, and also supports the study of multi-modal tasks. We experimentally validate our dataset, providing valuable results for studying different types of 4D radars. This dataset is released on https://github.com/adept-thu/Dual-Radar.
    摘要 雷达在自动驾驶环境感知方面具有更强的适应能力,比较广泛使用的相机和激光雷达更具有优势。与常见的3D雷达相比,最新的4D雷达具有高精度的垂直分辨率和更高的点云密度,使其成为自动驾驶复杂环境感知的非常有前途的传感器。然而,由于雷达的噪声比激光更高,制造商们采用不同的筛选策略,导致点云密度与噪声水平之间存在反比关系。到目前为止,没有对不同筛选策略对深度学习基于感知算法的比较分析。这是因为当前数据集只采用了一种类型的4D雷达,使其不可以在同一场景中比较不同的4D雷达。因此,在本文中,我们引入了一个新的大规模多模态数据集,其中包含了两种类型的4D雷达同时捕获的数据。这个数据集允许进一步研究4D雷达感知算法。我们的数据集包含151个连续时间序列,大多数时间长达20秒,包含10,007幅高精度同步 annotated frames。此外,我们的数据集涵盖了许多挑战性的驾驶场景,包括不同的路面条件、天气条件、夜间和白天不同的照明强度和时间。我们的数据集将 consecutively frames annotated,可以应用于3D объек体检测和跟踪,同时也支持多模态任务的研究。我们在实验 validate了我们的数据集,提供了价值的结果,用于研究不同类型的4D雷达。这个数据集在https://github.com/adept-thu/Dual-Radar上发布。

PeP: a Point enhanced Painting method for unified point cloud tasks

  • paper_url: http://arxiv.org/abs/2310.07591
  • repo_url: None
  • paper_authors: Zichao Dong, Hang Ji, Xufeng Huang, Weikun Zhang, Xin Zhan, Junbo Chen
  • for: 提高点云识别的性能,提供更好的输入参数 для下游模块。
  • methods: 提出了一种新的PeP模块,包括改进点绘制方法和LM基于的点编码器。
  • results: 在nuScenes和KITTI数据集上进行了实验,并证明了我们的PeP模块在semantic segmentation和物体检测方面具有强大表现,包括单点云和多模态设置下的情况。
    Abstract Point encoder is of vital importance for point cloud recognition. As the very beginning step of whole model pipeline, adding features from diverse sources and providing stronger feature encoding mechanism would provide better input for downstream modules. In our work, we proposed a novel PeP module to tackle above issue. PeP contains two main parts, a refined point painting method and a LM-based point encoder. Experiments results on the nuScenes and KITTI datasets validate the superior performance of our PeP. The advantages leads to strong performance on both semantic segmentation and object detection, in both lidar and multi-modal settings. Notably, our PeP module is model agnostic and plug-and-play. Our code will be publicly available soon.
    摘要 <>将文本翻译成简化中文。<>点编码器对点云识别具有重要的重要性。作为整个模型管道的开头步骤,从多种来源添加特征并提供更强的特征编码机制可以为下游模块提供更好的输入。在我们的工作中,我们提出了一种新的PeP模块来解决上述问题。PeP包括两个主要部分:一种精度调整的点绘制方法和一种LM基于的点编码器。在nuScenes和KITTI数据集上进行的实验结果表明了我们的PeP模块在 semantic segmentation和对象检测中具有出色的表现,包括雷达和多模式下的情况。尤其是,我们的PeP模块是模型无关的和可插入的。我们的代码将很快地公开。

A Discrepancy Aware Framework for Robust Anomaly Detection

  • paper_url: http://arxiv.org/abs/2310.07585
  • repo_url: https://github.com/caiyuxuan1120/daf
  • paper_authors: Yuxuan Cai, Dingkang Liang, Dongliang Luo, Xinwei He, Xin Yang, Xiang Bai
  • for: 本研究旨在探讨自然语言处理中的异常检测问题,尤其是使用自然语言生成的数据进行自监学习。
  • methods: 我们提出了一种异常检测方法,即异常检测框架(DAF),它可以在不同的异常检测任务中表现出较好的 Robustness。
  • results: 我们在两个复杂的数据集上进行了广泛的实验,并证明了我们的方法可以在使用简单的生成策略时表现出较好的性能,并且在异常检测任务中实现了state-of-the-art的本地化性能。
    Abstract Defect detection is a critical research area in artificial intelligence. Recently, synthetic data-based self-supervised learning has shown great potential on this task. Although many sophisticated synthesizing strategies exist, little research has been done to investigate the robustness of models when faced with different strategies. In this paper, we focus on this issue and find that existing methods are highly sensitive to them. To alleviate this issue, we present a Discrepancy Aware Framework (DAF), which demonstrates robust performance consistently with simple and cheap strategies across different anomaly detection benchmarks. We hypothesize that the high sensitivity to synthetic data of existing self-supervised methods arises from their heavy reliance on the visual appearance of synthetic data during decoding. In contrast, our method leverages an appearance-agnostic cue to guide the decoder in identifying defects, thereby alleviating its reliance on synthetic appearance. To this end, inspired by existing knowledge distillation methods, we employ a teacher-student network, which is trained based on synthesized outliers, to compute the discrepancy map as the cue. Extensive experiments on two challenging datasets prove the robustness of our method. Under the simple synthesis strategies, it outperforms existing methods by a large margin. Furthermore, it also achieves the state-of-the-art localization performance. Code is available at: https://github.com/caiyuxuan1120/DAF.
    摘要 <>转换文本到简化中文。<>人工智能中的缺陷检测是一个关键的研究领域。最近,基于合成数据的自主学习已经表现出了很大的潜力。虽然有很多复杂的合成策略,但是对模型对不同策略的Robustness还需要更多的研究。在这篇论文中,我们将关注这个问题,并发现现有的方法对不同策略非常敏感。为了解决这个问题,我们提出了一种不同策略意识框架(DAF),它在不同的缺陷检测 bencmarks 中表现出了稳定性和简单性。我们认为现有方法对合成数据的视觉出现非常依赖,而我们的方法借鉴了现有的知识填充方法,通过计算不同策略下的异常映射来避免依赖于合成数据的视觉。为此,我们采用了一种教师生网络,通过合成异常数据来计算异常映射。我们在两个复杂的数据集上进行了广泛的实验,结果表明我们的方法在简单的合成策略下表现出了明显的优势,并且也达到了当前的最佳本地化性能。代码可以在 GitHub 上找到:https://github.com/caiyuxuan1120/DAF。

Centrality of the Fingerprint Core Location

  • paper_url: http://arxiv.org/abs/2310.07584
  • repo_url: None
  • paper_authors: Laurenz Ruzicka, Bernhard Strobl, Bernhard Kohn, Clemens Heitzinger
  • for: 这个研究的目的是分析和提高指纹识别的方法,特别是研究指纹核心的分布。
  • methods: 该研究使用了rolled fingerprint recordings和plain fingerprint recordings的大量数据集,并使用了empirical distribution的方法来分析指纹核心的分布。
  • results: 研究发现,rolling fingerprint recordings中核心的位置与指纹中心的偏差为5.7% $\pm$ 5.2%到7.6% $\pm$ 6.9%,而plain fingerprint recordings中核心的位置遵循正态分布。此外,研究还发现,NFIQ 2 预测器偏爱rolled fingerprint recordings中核心处于指纹中心下方的位置。
    Abstract Fingerprints have long been recognized as a unique and reliable means of personal identification. Central to the analysis and enhancement of fingerprints is the concept of the fingerprint core. Although the location of the core is used in many applications, to the best of our knowledge, this study is the first to investigate the empirical distribution of the core over a large, combined dataset of rolled, as well as plain fingerprint recordings. We identify and investigate the extent of incomplete rolling during the rolled fingerprint acquisition and investigate the centrality of the core. After correcting for the incomplete rolling, we find that the core deviates from the fingerprint center by 5.7% $\pm$ 5.2% to 7.6% $\pm$ 6.9%, depending on the finger. Additionally, we find that the assumption of normal distribution of the core position of plain fingerprint recordings cannot be rejected, but for rolled ones it can. Therefore, we use a multi-step process to find the distribution of the rolled fingerprint recordings. The process consists of an Anderson-Darling normality test, the Bayesian Information Criterion to reduce the number of possible candidate distributions and finally a Generalized Monte Carlo goodness-of-fit procedure to find the best fitting distribution. We find the non-central Fischer distribution best describes the cores' horizontal positions. Finally, we investigate the correlation between mean core position offset and the NFIQ 2 score and find that the NFIQ 2 prefers rolled fingerprint recordings where the core sits slightly below the fingerprint center.
    摘要 人体指纹长期被认为是一种独特和可靠的个人认可方法。中心于指纹核心的分析和加强是识别人体指纹的关键。尽管核心位置在许多应用中被使用,但 according to our knowledge, this study is the first to investigate the empirical distribution of the core over a large, combined dataset of rolled and plain fingerprint recordings. We identify and investigate the extent of incomplete rolling during the rolled fingerprint acquisition and investigate the centrality of the core. After correcting for the incomplete rolling, we find that the core deviates from the fingerprint center by 5.7% ± 5.2% to 7.6% ± 6.9%, depending on the finger. Additionally, we find that the assumption of normal distribution of the core position of plain fingerprint recordings cannot be rejected, but for rolled ones it can. Therefore, we use a multi-step process to find the distribution of the rolled fingerprint recordings. The process consists of an Anderson-Darling normality test, the Bayesian Information Criterion to reduce the number of possible candidate distributions, and finally a Generalized Monte Carlo goodness-of-fit procedure to find the best fitting distribution. We find the non-central Fischer distribution best describes the cores' horizontal positions. Finally, we investigate the correlation between mean core position offset and the NFIQ 2 score and find that the NFIQ 2 prefers rolled fingerprint recordings where the core sits slightly below the fingerprint center.

Relational Prior Knowledge Graphs for Detection and Instance Segmentation

  • paper_url: http://arxiv.org/abs/2310.07573
  • repo_url: None
  • paper_authors: Osman Ülger, Yu Wang, Ysbrand Galama, Sezer Karaoglu, Theo Gevers, Martin R. Oswald
  • for: 这 paper 的目的是调查使用对象之间关系来进行物体探测和实例分割是否有效。
  • methods: 该 paper 提出了一种基于关系优先的特征增强模型(RP-FEM),该模型在Scene graph上运行,并同时学习对象探测和实例分割中的关系上下文模型。
  • results: 实验表明,在 COCO 数据集上,使用Scene graph和关系优先,可以提高物体探测和实例分割的性能,RP-FEM 可以减少图像中不可能的类别预测,同时避免生成重复预测,与基础模型相比呈现提高。
    Abstract Humans have a remarkable ability to perceive and reason about the world around them by understanding the relationships between objects. In this paper, we investigate the effectiveness of using such relationships for object detection and instance segmentation. To this end, we propose a Relational Prior-based Feature Enhancement Model (RP-FEM), a graph transformer that enhances object proposal features using relational priors. The proposed architecture operates on top of scene graphs obtained from initial proposals and aims to concurrently learn relational context modeling for object detection and instance segmentation. Experimental evaluations on COCO show that the utilization of scene graphs, augmented with relational priors, offer benefits for object detection and instance segmentation. RP-FEM demonstrates its capacity to suppress improbable class predictions within the image while also preventing the model from generating duplicate predictions, leading to improvements over the baseline model on which it is built.
    摘要 人类具有非常出色的能力,可以通过理解对象之间的关系来理解世界。在这篇论文中,我们研究了使用这些关系来进行对象检测和实例分割。为此,我们提出了一种基于关系优先的特征增强模型(RP-FEM),这是一种图 transformer 模型,可以通过在Scene Graph中增强对象提案特征来提高对象检测和实例分割的性能。这种建议的架构在Scene Graph中进行同时学习对象检测和实例分割的关系上下文模型。在COCO数据集上进行实验评估,我们发现使用Scene Graph和关系优先可以提高对象检测和实例分割的性能。RP-FEM可以降低图像中不可能的类别预测,同时避免模型生成重复预测,从而超越基础模型。

Impact of Label Types on Training SWIN Models with Overhead Imagery

  • paper_url: http://arxiv.org/abs/2310.07572
  • repo_url: None
  • paper_authors: Ryan Ford, Kenneth Hutchison, Nicholas Felts, Benjamin Cheng, Jesse Lew, Kyle Jackson
  • for: 这个论文的目的是研究数据集设计对模型训练和性能的影响,以帮助降低生成遥感和过空标注数据的成本。
  • methods: 这篇论文使用了固定窗口变换器的训练,使用了 bounding boxes 和 segmentation 标签,其中后者更加昂贵。作者比较了使用不同类型的标签进行训练,并研究了这些模型在不同任务上的性能。
  • results: 作者发现,使用只有目标像素的训练不会提高分类任务的性能,而且会混淆评估集中的背景像素。对于对象检测模型,使用不同类型的标签没有影响测试集的性能。作者发现,使用 bounding boxes 可以为一些不需要更复杂的标签的任务提供足够的性能。
    Abstract Understanding the impact of data set design on model training and performance can help alleviate the costs associated with generating remote sensing and overhead labeled data. This work examined the impact of training shifted window transformers using bounding boxes and segmentation labels, where the latter are more expensive to produce. We examined classification tasks by comparing models trained with both target and backgrounds against models trained with only target pixels, extracted by segmentation labels. For object detection models, we compared performance using either label type when training. We found that the models trained on only target pixels do not show performance improvement for classification tasks, appearing to conflate background pixels in the evaluation set with target pixels. For object detection, we found that models trained with either label type showed equivalent performance across testing. We found that bounding boxes appeared to be sufficient for tasks that did not require more complex labels, such as object segmentation. Continuing work to determine consistency of this result across data types and model architectures could potentially result in substantial savings in generating remote sensing data sets for deep learning.
    摘要 Understanding the impact of data set design on model training and performance can help reduce the costs associated with generating remote sensing and overhead labeled data. This work examined the impact of training shifted window transformers using bounding boxes and segmentation labels, where the latter are more expensive to produce. We compared models trained with both target and backgrounds against models trained with only target pixels, extracted by segmentation labels. For object detection models, we compared performance using either label type when training. We found that the models trained on only target pixels did not show performance improvement for classification tasks, appearing to conflate background pixels in the evaluation set with target pixels. For object detection, we found that models trained with either label type showed equivalent performance across testing. We found that bounding boxes were sufficient for tasks that did not require more complex labels, such as object segmentation. Continuing work to determine consistency of this result across data types and model architectures could potentially result in substantial savings in generating remote sensing data sets for deep learning.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Does resistance to Style-Transfer equal Shape Bias? Evaluating Shape Bias by Distorted Shape

  • paper_url: http://arxiv.org/abs/2310.07555
  • repo_url: None
  • paper_authors: Ziqi Wen, Tianqin Li, Tai Sing Lee
  • for: 这项研究旨在评估深度学习模型对形状的敏感性,并提供一个新的测试工具箱(Distorted Shape Testbench,DiST)来评估模型的全局形状敏感性。
  • methods: 这项研究使用了样式传递图像来训练深度学习模型,并对模型的性能进行评估。
  • results: 研究发现,已有的Shape bias评估方法不能准确评估模型的全局形状敏感性,而DiST测试工具箱可以准确地评估模型的全局形状敏感性,并且训练使用DiST图像可以bridge人类和现有SOTA模型之间的性能差距,同时保持模型的标准图像分类任务的准确率。
    Abstract Deep learning models are known to exhibit a strong texture bias, while human tends to rely heavily on global shape for object recognition. The current benchmark for evaluating a model's shape bias is a set of style-transferred images with the assumption that resistance to the attack of style transfer is related to the development of shape sensitivity in the model. In this work, we show that networks trained with style-transfer images indeed learn to ignore style, but its shape bias arises primarily from local shapes. We provide a Distorted Shape Testbench (DiST) as an alternative measurement of global shape sensitivity. Our test includes 2400 original images from ImageNet-1K, each of which is accompanied by two images with the global shapes of the original image distorted while preserving its texture via the texture synthesis program. We found that (1) models that performed well on the previous shape bias evaluation do not fare well in the proposed DiST; (2) the widely adopted ViT models do not show significant advantages over Convolutional Neural Networks (CNNs) on this benchmark despite that ViTs rank higher on the previous shape bias tests. (3) training with DiST images bridges the significant gap between human and existing SOTA models' performance while preserving the models' accuracy on standard image classification tasks; training with DiST images and style-transferred images are complementary, and can be combined to train network together to enhance both the global and local shape sensitivity of the network. Our code will be host at: https://github.com/leelabcnbc/DiST
    摘要 深度学习模型通常会表现出强的文本擅长,而人类则更倾向于依靠全局形状来识别对象。现有的测试方法之一是使用样式传输图像,假设图像的鲜明度攻击性能与模型内部形状敏感度之间存在相关性。在这项工作中,我们证明了通过样式传输图像训练的网络实际上会忽略样式,但是其形状偏好主要来自本地形状。我们提供了一个扭曲形状测试台(DiST),作为评估模型全局形状敏感度的代理测试方法。我们的测试包括2400个原始图像,每个图像都有两个基于图像的全局形状扭曲的图像,保持图像的文本via texture synthesis程序。我们发现:1. 在之前的形状偏好评估中高分的模型不如在我们提出的DiST测试中表现出色。2. 广泛采用的ViT模型与传统 convolutional neural network(CNN)在该测试中并没有显著的优势,即使ViT模型在之前的形状偏好测试中得分高。3. 使用DiST图像进行训练可以bridges人类和现有最佳实际模型之间的巨大差距,同时保持模型在标准图像分类任务中的准确率。使用DiST图像和样式传输图像进行训练可以将全局和本地形状敏感度融合到同一个模型中,这种融合训练可以提高模型的总体性和特征性能。我们的代码将会hosts在:https://github.com/leelabcnbc/DiST

Attribute Localization and Revision Network for Zero-Shot Learning

  • paper_url: http://arxiv.org/abs/2310.07548
  • repo_url: None
  • paper_authors: Junzhe Xu, Suling Duan, Chenwei Tang, Zhenan He, Jiancheng Lv
  • for: 本文旨在提出一种基于 Attribute Localization and Revision Network 的零shot学习模型,以便在无需训练数据的情况下,模型能够识别未经训练的类别。
  • methods: 本文使用了 Attribute Localization Module (ALM) 和 Attribute Revision Module (ARM) 两种模块来捕捉图像区域的本地和全局特征,并通过修改涂抹图像的每个特征值来补做忽略内类 attribute 变化的缺陷。
  • results: 经过广泛的实验测试,本文的模型在三个常用的 benchmark 上表现出色,证明了本文提出的方法的有效性。
    Abstract Zero-shot learning enables the model to recognize unseen categories with the aid of auxiliary semantic information such as attributes. Current works proposed to detect attributes from local image regions and align extracted features with class-level semantics. In this paper, we find that the choice between local and global features is not a zero-sum game, global features can also contribute to the understanding of attributes. In addition, aligning attribute features with class-level semantics ignores potential intra-class attribute variation. To mitigate these disadvantages, we present Attribute Localization and Revision Network in this paper. First, we design Attribute Localization Module (ALM) to capture both local and global features from image regions, a novel module called Scale Control Unit is incorporated to fuse global and local representations. Second, we propose Attribute Revision Module (ARM), which generates image-level semantics by revising the ground-truth value of each attribute, compensating for performance degradation caused by ignoring intra-class variation. Finally, the output of ALM will be aligned with revised semantics produced by ARM to achieve the training process. Comprehensive experimental results on three widely used benchmarks demonstrate the effectiveness of our model in the zero-shot prediction task.
    摘要 zero-shot 学习可以让模型认识未经见过的类别,通过 auxiliary Semantic information such as 特征。现有工作提出了从本地图像区域检测特征并将抽取的特征与类别 semantics 对齐。在这篇论文中,我们发现选择本地和全局特征并不是零和游戏,全局特征也可以帮助理解特征。此外,对属性特征与类别 semantics 对齐忽略了内部类 attribute 的变化。为了解决这些缺点,我们在这篇论文中提出了 Attribute Localization and Revision Network。首先,我们设计了 Attribute Localization Module (ALM),它可以从图像区域中捕捉 both local 和 global 特征。为了融合全局和本地表示,我们采用了一个新的 Scale Control Unit。其次,我们提出了 Attribute Revision Module (ARM),它可以根据图像的实际情况修改每个特征的真实值,以补偿因为忽略内部类 attribute 的变化而导致的性能下降。最后,ALM 的输出将被与 ARM 生成的修订后的 semantics 进行对齐,以实现训练过程。我们在三个常用的 benchmark 上进行了广泛的实验, demonstarting 我们的模型在 zero-shot 预测任务中的有效性。

S4C: Self-Supervised Semantic Scene Completion with Neural Fields

  • paper_url: http://arxiv.org/abs/2310.07522
  • repo_url: None
  • paper_authors: Adrian Hayler, Felix Wimbauer, Dominik Muhle, Christian Rupprecht, Daniel Cremers
  • for: 本研究旨在解决计算机视觉中的3Dsemantic场景理解挑战,帮助移动代理人自动规划和探索不确定环境。
  • methods: 我们提出了首个不需要3D实际数据的自我超级visedapproach,通过单张图像重建场景,并只使用视频和pseudo分割数据进行训练。
  • results: 我们的方法可以准确地推断场景的填充和semantic类别,并且可以 synthesize precisemap for far away viewpoints。
    Abstract 3D semantic scene understanding is a fundamental challenge in computer vision. It enables mobile agents to autonomously plan and navigate arbitrary environments. SSC formalizes this challenge as jointly estimating dense geometry and semantic information from sparse observations of a scene. Current methods for SSC are generally trained on 3D ground truth based on aggregated LiDAR scans. This process relies on special sensors and annotation by hand which are costly and do not scale well. To overcome this issue, our work presents the first self-supervised approach to SSC called S4C that does not rely on 3D ground truth data. Our proposed method can reconstruct a scene from a single image and only relies on videos and pseudo segmentation ground truth generated from off-the-shelf image segmentation network during training. Unlike existing methods, which use discrete voxel grids, we represent scenes as implicit semantic fields. This formulation allows querying any point within the camera frustum for occupancy and semantic class. Our architecture is trained through rendering-based self-supervised losses. Nonetheless, our method achieves performance close to fully supervised state-of-the-art methods. Additionally, our method demonstrates strong generalization capabilities and can synthesize accurate segmentation maps for far away viewpoints.
    摘要 三维semantic场景理解是计算机视觉领域的基本挑战。它使移动代理能够自主规划和探索不确定环境。SSC将这个挑战形式化为同时估算场景的密度和semantic信息从稀疏观察数据中。现有的SSC方法通常通过3D实际数据进行训练,这个过程需要特殊的感知器和手动标注,这些成本高并不具扩展性。为了解决这个问题,我们的工作提出了第一个不需要3D实际数据的自主supervised SSC方法,称为S4C。我们的提议的方法可以从单张图像中重construct场景,只需要视频和pseudo segmentationground truth生成于市场上的图像分割网络进行训练。与现有方法不同,我们将场景表示为半 implicit semantic场景。这种表示方式允许在相机投影范围内查询任何点的占据和semantic类别。我们的架构通过渲染基于的自我超vised损失来训练。尽管如此,我们的方法可以与完全supervised方法相比,并且显示出强大的泛化能力,可以生成正确的分割图像 для远距离视角。

CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing

  • paper_url: http://arxiv.org/abs/2310.07517
  • repo_url: None
  • paper_authors: Yaru Chen, Ruohao Guo, Xubo Liu, Peipei Wu, Guangyao Li, Zhenbo Li, Wenwu Wang
  • for: 这篇论文旨在提高视频分割任务的精度,通过利用注意力机制 capture 视频中各个段落之间的语义相关性。
  • methods: 我们提出了一种新的交互增强十Modal 感知方法(CM-PIE),通过应用段基 attention 模块来学习细腻的特征。此外,我们还引入了一个交叉模态聚合块,以协同优化视频和声音信号的 semantic表示。
  • results: 我们的模型在 Look, Listen, and Parse 数据集上的 parsing 性能比其他方法高。
    Abstract Audio-visual video parsing is the task of categorizing a video at the segment level with weak labels, and predicting them as audible or visible events. Recent methods for this task leverage the attention mechanism to capture the semantic correlations among the whole video across the audio-visual modalities. However, these approaches have overlooked the importance of individual segments within a video and the relationship among them, and tend to rely on a single modality when learning features. In this paper, we propose a novel interactive-enhanced cross-modal perception method~(CM-PIE), which can learn fine-grained features by applying a segment-based attention module. Furthermore, a cross-modal aggregation block is introduced to jointly optimize the semantic representation of audio and visual signals by enhancing inter-modal interactions. The experimental results show that our model offers improved parsing performance on the Look, Listen, and Parse dataset compared to other methods.
    摘要 Audio-visual视频分解是将视频分为每个段的任务,并将它们分类为听ible或可见事件。现代方法在这个任务中利用注意力机制来捕捉全视频的语义相关性。然而,这些方法通常忽略视频中每个段的重要性以及它们之间的关系,而且通常仅仅使用单一的感知modalities来学习特征。在本文中,我们提出了一种新的交互增强的交叉模态感知方法(CM-PIE),可以通过应用段级注意力模块来学习细化的特征。此外,我们还引入了交叉模态聚合块,以便在语音和视频信号之间强化交互。实验结果表明,我们的模型在Look, Listen, and Parse数据集上的分解性能比其他方法更高。

A Unified Remote Sensing Anomaly Detector Across Modalities and Scenes via Deviation Relationship Learning

  • paper_url: http://arxiv.org/abs/2310.07511
  • repo_url: https://github.com/jingtao-li-cver/uniadrs
  • paper_authors: Jingtao Li, Xinyu Wang, Hengwei Zhao, Liangpei Zhang, Yanfei Zhong
    for:这个研究旨在建立一个横跨多 modalities 和 scene 的潜在适应率高的问题检测器,以搜寻 earth 上的多种问题。methods:本研究使用了一个基于偏差关系的 bilayer 图表 reformulation,将问题检测任务转换为一个 conditional probability 模型,并透过条件预测Problem 来训练模型。results:研究发现,在五种不同的模式(包括 Hyperspectral、可见光、Synthetic Aperture Radar (SAR)、infrared 和 low light)中,提出的模型具有横跨多 modalities 和 scene 的问题检测能力。
    Abstract Remote sensing anomaly detector can find the objects deviating from the background as potential targets. Given the diversity in earth anomaly types, a unified anomaly detector across modalities and scenes should be cost-effective and flexible to new earth observation sources and anomaly types. However, the current anomaly detectors are limited to a single modality and single scene, since they aim to learn the varying background distribution. Motivated by the universal anomaly deviation pattern, in that anomalies exhibit deviations from their local context, we exploit this characteristic to build a unified anomaly detector. Firstly, we reformulate the anomaly detection task as an undirected bilayer graph based on the deviation relationship, where the anomaly score is modeled as the conditional probability, given the pattern of the background and normal objects. The learning objective is then expressed as a conditional probability ranking problem. Furthermore, we design an instantiation of the reformulation in the data, architecture, and optimization aspects. Simulated spectral and spatial anomalies drive the instantiated architecture. The model is optimized directly for the conditional probability ranking. The proposed model was validated in five modalities including the hyperspectral, visible light, synthetic aperture radar (SAR), infrared and low light to show its unified detection ability.
    摘要 <>这里使用简化字体写入中文。> Remote sensing异常探测器可以找到背景上异常的物体作为潜在目标。由于地球上异常的多样性,一个通用的异常探测器适合多modalities和场景将是cost-effective和flexible。然而,现有的异常探测器仅对单一modalities和单一场景进行学习,因为它们专注于变化的背景分布。我们受到地球上异常的通用偏差模式所驱动,我们利用这个特征来建立一个通用的异常探测器。首先,我们将异常探测任务 reformulate为一个不同irectional bilayer graph,基于偏差关系,异常得分被model为背景和正常物体的模式下的条件 probabilities。学习目标则表示为conditional probability ranking问题。此外,我们还设计了实体化的构造、架构和优化方面。实验 spectral和spatial异常驱动了实体化架构。我们直接对条件 probabilities进行优化。我们的模型在五种modalities中,包括光谱、可见光、Synthetic Aperture Radar(SAR)、infrared和low light中显示了通用的探测能力。

Heuristic Vision Pre-Training with Self-Supervised and Supervised Multi-Task Learning

  • paper_url: http://arxiv.org/abs/2310.07510
  • repo_url: None
  • paper_authors: Zhiming Qian
  • for: 该 paper 的目的是提出一种基于自然语言的视觉基础模型,以便更好地模仿人类视觉的方式来认识多样化和开放的世界。
  • methods: 该 paper 使用了自然语言上的多种预言任务,包括自然语言模型和图像分类等,以便更好地学习视觉表示学习。
  • results: 该 paper 的实验结果表明,使用该基础模型可以在多种视觉任务中达到或超过现有最佳Result,包括 ImageNet-1K 分类、COCO 物体检测和 ADE-20K Semantic Segmentation 等。
    Abstract To mimic human vision with the way of recognizing the diverse and open world, foundation vision models are much critical. While recent techniques of self-supervised learning show the promising potentiality of this mission, we argue that signals from labelled data are also important for common-sense recognition, and properly chosen pre-text tasks can facilitate the efficiency of vision representation learning. To this end, we propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner. Specifically, given an image, we take a heuristic way by considering its intrinsic style properties, inside objects with their locations and correlations, and how it looks like in 3D space for basic visual understanding. However, large-scale object bounding boxes and correlations are usually hard to achieve. Alternatively, we develop a hybrid method by leveraging both multi-label classification and self-supervised learning. On the one hand, under the multi-label supervision, the pre-trained model can explore the detailed information of an image, e.g., image types, objects, and part of semantic relations. On the other hand, self-supervised learning tasks, with respect to Masked Image Modeling (MIM) and contrastive learning, can help the model learn pixel details and patch correlations. Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks. For example, with a vanilla Swin-B backbone, we achieve 85.3\% top-1 accuracy on ImageNet-1K classification, 47.9 box AP on COCO object detection for Mask R-CNN, and 50.6 mIoU on ADE-20K semantic segmentation when using Upernet. The performance shows the ability of our vision foundation model to serve general purpose vision tasks.
    摘要 On one hand, under multi-label supervision, the pre-trained model can explore detailed image information such as image types, objects, and semantic relations. On the other hand, self-supervised learning tasks like Masked Image Modeling (MIM) and contrastive learning help the model learn pixel details and patch correlations. Our pre-trained models achieve results on par with or better than state-of-the-art (SOTA) on multiple visual tasks. For example, with a vanilla Swin-B backbone, we achieve 85.3% top-1 accuracy on ImageNet-1K classification, 47.9 box AP on COCO object detection for Mask R-CNN, and 50.6 mIoU on ADE-20K semantic segmentation when using Upernet. These results demonstrate the ability of our vision foundation model to handle various general-purpose vision tasks.

Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation

  • paper_url: http://arxiv.org/abs/2310.07506
  • repo_url: None
  • paper_authors: Haizhong Zheng, Jiachen Sun, Shutong Wu, Bhavya Kailkhura, Zhuoqing Mao, Chaowei Xiao, Atul Prakash
  • for: 本文提出了一种新的数据压缩方法,以提高模型训练的性能。
  • methods: 本文使用了一种新的数据参数化方法,将数据压缩成参数化数据容器而不是像素空间。
  • results: 比较baseline方法,本文的提posed方法在四个公共数据集(SVHN、CIFAR10、CIFAR100和Tiny-ImageNet)上表现出了更高的性能,甚至在使用批处理损失函数并使用 less GPU内存。
    Abstract Given a real-world dataset, data condensation (DC) aims to synthesize a significantly smaller dataset that captures the knowledge of this dataset for model training with high performance. Recent works propose to enhance DC with data parameterization, which condenses data into parameterized data containers rather than pixel space. The intuition behind data parameterization is to encode shared features of images to avoid additional storage costs. In this paper, we recognize that images share common features in a hierarchical way due to the inherent hierarchical structure of the classification system, which is overlooked by current data parameterization methods. To better align DC with this hierarchical nature and encourage more efficient information sharing inside data containers, we propose a novel data parameterization architecture, Hierarchical Memory Network (HMN). HMN stores condensed data in a three-tier structure, representing the dataset-level, class-level, and instance-level features. Another helpful property of the hierarchical architecture is that HMN naturally ensures good independence among images despite achieving information sharing. This enables instance-level pruning for HMN to reduce redundant information, thereby further minimizing redundancy and enhancing performance. We evaluate HMN on four public datasets (SVHN, CIFAR10, CIFAR100, and Tiny-ImageNet) and compare HMN with eight DC baselines. The evaluation results show that our proposed method outperforms all baselines, even when trained with a batch-based loss consuming less GPU memory.
    摘要 Translated into Simplified Chinese:给一个实际世界数据集,数据缩写(DC)目标是生成一个较小的数据集,捕捉这个数据集的知识,以便模型训练高性能。现有的方法提议通过数据参数化来增强DC,将数据压缩到参数化数据容器中,而不是像素空间。我们认为图像共享特征的理念是基于图像分类系统的层次结构,这一点被当前的数据参数化方法忽略。为了更好地将DC与这个层次结构相匹配,并促进数据容器内的信息共享,我们提出了一种新的数据参数化架构——层次记忆网络(HMN)。HMN将压缩数据存储在三级结构中,表示数据集级、类级和实例级特征。另外,层次架构的帮助特性是,HMN自然地保证图像之间的独立性,同时实现信息共享。这使得HMN可以进行实例级别的剪辑,从而进一步减少重复信息,提高性能。我们在四个公共数据集(SVHN、CIFAR10、CIFAR100和Tiny-ImageNet)上评估HMN,并与八个DC基准方法进行比较。评估结果表明,我们的提出方法在所有基准方法中表现出色,即使在使用更少的GPU内存的批处理损失函数训练时。

PtychoDV: Vision Transformer-Based Deep Unrolling Network for Ptychographic Image Reconstruction

  • paper_url: http://arxiv.org/abs/2310.07504
  • repo_url: None
  • paper_authors: Weijie Gan, Qiuchen Zhai, Michael Thompson McCann, Cristina Garcia Cardona, Ulugbek S. Kamilov, Brendt Wohlberg
  • for: 本研究旨在提出一种高效、高品质ptychography图像重建方法,以解决现有深度学习方法的缺陷和费时问题。
  • methods: 本研究使用了一种新的深度学习模型,称为PtychoDV,来实现高效、高品质ptychography图像重建。PtychoDV包括一个视觉变换器,用于从 raw measurement 中生成初始图像,并考虑到这些测量的相互关系。然后,使用一个深度 unfolding 网络来修正初始图像,使用学习的卷积推荐和ptychography测量模型。
  • results: 实验结果表明,PtychoDV 可以在 simulated data 上比现有的深度学习方法表现更好,并且可以Significantly reduce 计算成本,与迭代方法相比,而且维持竞争力性。
    Abstract Ptychography is an imaging technique that captures multiple overlapping snapshots of a sample, illuminated coherently by a moving localized probe. The image recovery from ptychographic data is generally achieved via an iterative algorithm that solves a nonlinear phase-field problem derived from measured diffraction patterns. However, these approaches have high computational cost. In this paper, we introduce PtychoDV, a novel deep model-based network designed for efficient, high-quality ptychographic image reconstruction. PtychoDV comprises a vision transformer that generates an initial image from the set of raw measurements, taking into consideration their mutual correlations. This is followed by a deep unrolling network that refines the initial image using learnable convolutional priors and the ptychography measurement model. Experimental results on simulated data demonstrate that PtychoDV is capable of outperforming existing deep learning methods for this problem, and significantly reduces computational cost compared to iterative methodologies, while maintaining competitive performance.
    摘要 ptychography 是一种成像技术,通过多次重叠的探针探测样品,以征服干扰的方式获得样品的图像。然而,现有的方法通常需要高度计算成本。在这篇论文中,我们介绍了ptychoDV,一种新的深度学习模型基网络,用于高效、高质量ptychographic图像重建。ptychoDV包括一个视觉转换器,通过对 Raw 测量数据进行处理,生成初始图像,考虑到测量数据之间的相互关系。然后,一个深度拓展网络会使用学习的卷积约束和ptychography测量模型,来细化初始图像。我们对模拟数据进行了实验,结果表明,ptychoDV 可以在深度学习方法中出色表现,并且在计算成本方面具有显著的改善,同时维持竞争力。

FGPrompt: Fine-grained Goal Prompting for Image-goal Navigation

  • paper_url: http://arxiv.org/abs/2310.07473
  • repo_url: https://github.com/XinyuSun/FGPrompt
  • paper_authors: Xinyu Sun, Peihao Chen, Jugang Fan, Thomas H. Li, Jian Chen, Mingkui Tan
  • for: 本研究旨在解决自主系统导航到图像指定目标的重要 yet challenging task,agent需要从拍摄图像中理解目标位置。
  • methods: 我们采用 Fine-grained Goal Prompting (FGPrompt) 方法,利用高级别和高分辨率特征图作为启发,通过conditioned embedding来保留目标图像中细节信息,并使观察Encoder更加关注目标相关区域。
  • results: 与existMethods比较,我们在3个图像目标导航 benchmark datasets(i.e., Gibson, MP3D, HM3D)上显著提高了性能,特别是在Gibson上,我们超过了状态之前最佳成功率by 8%,只用1/50的模型大小。
    Abstract Learning to navigate to an image-specified goal is an important but challenging task for autonomous systems. The agent is required to reason the goal location from where a picture is shot. Existing methods try to solve this problem by learning a navigation policy, which captures semantic features of the goal image and observation image independently and lastly fuses them for predicting a sequence of navigation actions. However, these methods suffer from two major limitations. 1) They may miss detailed information in the goal image, and thus fail to reason the goal location. 2) More critically, it is hard to focus on the goal-relevant regions in the observation image, because they attempt to understand observation without goal conditioning. In this paper, we aim to overcome these limitations by designing a Fine-grained Goal Prompting (FGPrompt) method for image-goal navigation. In particular, we leverage fine-grained and high-resolution feature maps in the goal image as prompts to perform conditioned embedding, which preserves detailed information in the goal image and guides the observation encoder to pay attention to goal-relevant regions. Compared with existing methods on the image-goal navigation benchmark, our method brings significant performance improvement on 3 benchmark datasets (i.e., Gibson, MP3D, and HM3D). Especially on Gibson, we surpass the state-of-the-art success rate by 8% with only 1/50 model size. Project page: https://xinyusun.github.io/fgprompt-pages
    摘要 学习到具体目标位置 Navigation 是自主系统中一项重要但具有挑战性的任务。 Agent 需要从拍摄图像中理解目标位置。现有方法通过学习导航策略,以独立地捕捉目标图像和观察图像的semantic特征,并最终将其混合以预测导航动作序列。然而,这些方法受到两个主要限制:1)它们可能会遗漏目标图像中的细节信息,因此无法理解目标位置。2)更加重要的是,它们难以在观察图像中注意目标相关区域,因为它们不会在目标条件下理解观察。在这篇论文中,我们寻求超越这些限制,通过设计 Fine-grained Goal Prompting(FGPrompt)方法,以便在图像目标导航中提高性能。具体来说,我们利用目标图像中细化和高分辨率的特征图作为引导,通过conditioned embedding来实现conditioned embedding,以保持目标图像中细节信息,并导引观察编码器注意目标相关区域。相比之前的方法,我们在三个图像目标导航 benchmark 上实现了显著的性能提升。尤其是在 Gibson 上,我们超过了状态 искусternal 的成功率,并且只需要1/50 的模型大小。项目页面:https://xinyusun.github.io/fgprompt-pages

PoRF: Pose Residual Field for Accurate Neural Surface Reconstruction

  • paper_url: http://arxiv.org/abs/2310.07449
  • repo_url: None
  • paper_authors: Jia-Wang Bian, Wenjing Bian, Victor Adrian Prisacariu, Philip Torr
    for:* The paper aims to improve the accuracy of neural surface reconstruction in real-world scenarios by refining camera poses and reducing the impact of camera pose noise.methods:* The proposed method uses a novel implicit representation called the pose residual field (PoRF), which leverages global information over the entire sequence to improve pose accuracy.* The method also introduces an epipolar geometry loss to enhance supervision and improve pose accuracy.results:* The proposed method achieves promising results on two datasets: the DTU dataset and the MobileBrick dataset.* On the DTU dataset, the method reduces rotation error by 78% for COLMAP poses and decreases the reconstruction Chamfer distance from 3.48mm to 0.85mm.* On the MobileBrick dataset, the method improves the reconstruction F1 score from 69.18 to 75.67, outperforming the dataset provided ground-truth pose.
    Abstract Neural surface reconstruction is sensitive to the camera pose noise, even if state-of-the-art pose estimators like COLMAP or ARKit are used. More importantly, existing Pose-NeRF joint optimisation methods have struggled to improve pose accuracy in challenging real-world scenarios. To overcome the challenges, we introduce the pose residual field (\textbf{PoRF}), a novel implicit representation that uses an MLP for regressing pose updates. This is more robust than the conventional pose parameter optimisation due to parameter sharing that leverages global information over the entire sequence. Furthermore, we propose an epipolar geometry loss to enhance the supervision that leverages the correspondences exported from COLMAP results without the extra computational overhead. Our method yields promising results. On the DTU dataset, we reduce the rotation error by 78\% for COLMAP poses, leading to the decreased reconstruction Chamfer distance from 3.48mm to 0.85mm. On the MobileBrick dataset that contains casually captured unbounded 360-degree videos, our method refines ARKit poses and improves the reconstruction F1 score from 69.18 to 75.67, outperforming that with the dataset provided ground-truth pose (75.14). These achievements demonstrate the efficacy of our approach in refining camera poses and improving the accuracy of neural surface reconstruction in real-world scenarios.
    摘要 “神经表面重建敏感于摄像头姿态噪声,即使使用最先进的姿态估计器如COLMAP或ARKit。更重要的是,现有的姿态-NeRF共产化优化方法在实际世界场景中困难寻求高精度姿态。为了解决这些挑战,我们引入姿态差估场(PoRF),一种新的隐式表示方法,使用多层感知(MLP)来回归姿态更新。这种方法比传统的姿态参数优化更加稳定,因为它可以共享参数,利用全序列的全局信息。此外,我们提出了视觉几何损失来增强监督,利用COLMAP结果中出口的对准关系,而无需额外计算开销。我们的方法实现了可靠的成果。在DTU数据集上,我们将COLMAP姿态Error降低至78%,导致重建Chamfer距离从3.48mm降低至0.85mm。在包含抓拍 captured 360度视频的MobileBrick数据集上,我们的方法改善了ARKit姿态,提高了重建F1分数从69.18提高至75.67,超过了数据集提供的基准姿态(75.14)。这些成果表明我们的方法在实际世界场景中有效地改善摄像头姿态和神经表面重建精度。”

Distance Weighted Trans Network for Image Completion

  • paper_url: http://arxiv.org/abs/2310.07440
  • repo_url: None
  • paper_authors: Pourya Shamsolmoali, Masoumeh Zareapoor, Huiyu Zhou, Xuelong Li, Yue Lu
  • for: 本文提出了一种新的图像生成模型,用于更好地理解图像的结构和关系。
  • methods: 该模型基于距离Weighted Transformer (DWT)和卷积神经网络 (CNN) 两种不同的模型,以优化图像完成过程。
  • results: 对三个复杂的图像 dataset 进行了广泛的量化和质量测试,并证明了该模型的超越性。
    Abstract The challenge of image generation has been effectively modeled as a problem of structure priors or transformation. However, existing models have unsatisfactory performance in understanding the global input image structures because of particular inherent features (for example, local inductive prior). Recent studies have shown that self-attention is an efficient modeling technique for image completion problems. In this paper, we propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components. In our model, we leverage the strengths of both Convolutional Neural Networks (CNNs) and DWT blocks to enhance the image completion process. Specifically, CNNs are used to augment the local texture information of coarse priors and DWT blocks are used to recover certain coarse textures and coherent visual structures. Unlike current approaches that generally use CNNs to create feature maps, we use the DWT to encode global dependencies and compute distance-based weighted feature maps, which substantially minimizes the problem of visual ambiguities. Meanwhile, to better produce repeated textures, we introduce Residual Fast Fourier Convolution (Res-FFC) blocks to combine the encoder's skip features with the coarse features provided by our generator. Furthermore, a simple yet effective technique is proposed to normalize the non-zero values of convolutions, and fine-tune the network layers for regularization of the gradient norms to provide an efficient training stabiliser. Extensive quantitative and qualitative experiments on three challenging datasets demonstrate the superiority of our proposed model compared to existing approaches.
    摘要 描述文本:图像生成挑战已经被模型为结构先验或变换问题。然而,现有模型在理解全局输入图像结构方面表现不满足,主要因为特定的遗传特征(例如本地推导先验)。最近的研究表明,自注意是一种高效的模型技术 для图像完成问题。在这篇论文中,我们提议一种新的架构,即距离基于变换器(DWT),以更好地理解图像组件之间的关系。我们在模型中结合了卷积神经网络(CNN)和DWT块来提高图像完成过程。具体来说,CNNs用于增强粗略先验中的本地特征信息,而DWT块用于恢复一些粗略的文本特征和一致视觉结构。与现有方法一样,我们使用DWT来编码全局依赖关系并计算距离基于权重的特征地图,从而减少视觉混乱的问题。此外,我们还提出了一种简单 yet有效的技术,即 residual Fast Fourier Convolution(Res-FFC)块,以结合生成器提供的粗略特征和编码器的跳过特征。此外,我们还提出了一种简单 yet有效的正则化技术,即Normalize the non-zero values of convolutions,并在网络层次进行 fine-tune 来稳定训练过程。广泛的量化和质量测试表明,我们的提议模型在三个复杂的数据集上表现出色,比现有方法更高效。

DESTINE: Dynamic Goal Queries with Temporal Transductive Alignment for Trajectory Prediction

  • paper_url: http://arxiv.org/abs/2310.07438
  • repo_url: None
  • paper_authors: Rezaul Karim, Soheil Mohamad Alizadeh Shabestary, Amir Rasouli
  • for: 预测路用户的时间相关路径,以便更好地预测路用户的行为。
  • methods: 使用semantic map信息和模型交互,并建立一种能够理解不同粒度的行为的机制。
  • results: 使用DESTINE方法,在Argoverse数据集上实现了状态领先的性能,并通过精细的ablation研究,证明了不同模块的贡献。
    Abstract Predicting temporally consistent road users' trajectories in a multi-agent setting is a challenging task due to unknown characteristics of agents and their varying intentions. Besides using semantic map information and modeling interactions, it is important to build an effective mechanism capable of reasoning about behaviors at different levels of granularity. To this end, we propose Dynamic goal quErieS with temporal Transductive alIgNmEnt (DESTINE) method. Unlike past arts, our approach 1) dynamically predicts agents' goals irrespective of particular road structures, such as lanes, allowing the method to produce a more accurate estimation of destinations; 2) achieves map compliant predictions by generating future trajectories in a coarse-to-fine fashion, where the coarser predictions at a lower frame rate serve as intermediate goals; and 3) uses an attention module designed to temporally align predicted trajectories via masked attention. Using the common Argoverse benchmark dataset, we show that our method achieves state-of-the-art performance on various metrics, and further investigate the contributions of proposed modules via comprehensive ablation studies.
    摘要 预测 temporally consistent road users' trajectories 在多代理人设定下是一项具有挑战性的任务,因为代理人的特性和意图都是未知的。除了使用semantic map信息和模型交互外,我们还需要建立一种有效的机制,以便对代理人的行为进行不同级别的推理。为此,我们提出了动态目标QuErieS with temporal Transductive alIgNmEnt(DESTINE)方法。与过去的艺术不同,我们的方法具有以下特点:1. 动态预测代理人的目标,不受特定的道路结构,如车道,影响预测的精度;2. 实现地图符合的预测,通过在粗粒度和细粒度之间进行多层次预测,使得中间的粗粒度预测服为间接目标;3. 使用面向时间的注意模块,通过做Masked attention来进行时间启配。使用常用的 Argoverse 数据集,我们表明了我们的方法在不同的维度上达到了领先的性能,并通过完整的减少研究来调查提posed模块的贡献。

A Novel Voronoi-based Convolutional Neural Network Framework for Pushing Person Detection in Crowd Videos

  • paper_url: http://arxiv.org/abs/2310.07416
  • repo_url: https://github.com/pedestriandynamics/vcnn4pude
  • paper_authors: Ahmed Alia, Mohammed Maree, Mohcine Chraibi, Armin Seyfried
  • for: 本研究旨在通过分析人群中微观动态行为,提供更深刻的人群模式和互动理解,以便制定更有效的人群管理策略、优化人群流动和提高总体人群体验。
  • methods: 本研究提出了一种新的自动化推拿框架,包括两个主要组成部分:i)特征提取和ii)视频标注。在特征提取部分,采用了基于Voronoi方法的新方法来确定输入视频中每个人的本地区域。然后,这些区域被 fed into EfficientNetV1B0卷积神经网络以提取每个人在时间上的深度特征。在第二个组成部分,使用了一个完全连接层与sigmoid激活函数来分析这些深度特征,并将其与视频中推拿行为相关的个体标注。
  • results: 实验结果表明,提议的框架在比较分析中超过了7个基准方法。
    Abstract Analyzing the microscopic dynamics of pushing behavior within crowds can offer valuable insights into crowd patterns and interactions. By identifying instances of pushing in crowd videos, a deeper understanding of when, where, and why such behavior occurs can be achieved. This knowledge is crucial to creating more effective crowd management strategies, optimizing crowd flow, and enhancing overall crowd experiences. However, manually identifying pushing behavior at the microscopic level is challenging, and the existing automatic approaches cannot detect such microscopic behavior. Thus, this article introduces a novel automatic framework for identifying pushing in videos of crowds on a microscopic level. The framework comprises two main components: i) Feature extraction and ii) Video labeling. In the feature extraction component, a new Voronoi-based method is developed for determining the local regions associated with each person in the input video. Subsequently, these regions are fed into EfficientNetV1B0 Convolutional Neural Network to extract the deep features of each person over time. In the second component, a combination of a fully connected layer with a Sigmoid activation function is employed to analyze these deep features and annotate the individuals involved in pushing within the video. The framework is trained and evaluated on a new dataset created using six real-world experiments, including their corresponding ground truths. The experimental findings indicate that the suggested framework outperforms seven baseline methods that are employed for comparative analysis purposes.
    摘要 可以通过分析人群中微型动态行为来获得有价值的人群模式和互动知识。通过在人群视频中标识推担行为,可以更深入地理解推担行为何时、何处、何 raison。这些知识是创建更有效的人群管理策略、优化人群流动和提高总体人群体验的关键。然而,手动地在微型水平上识别推担行为很困难,而现有的自动方法无法检测微型行为。因此,本文提出了一种新的自动框架,用于在人群视频中识别推担行为。该框架包括两个主要组成部分:一是特征提取,二是视频标注。在特征提取部分,我们开发了一种基于Voronoi区域的新方法,用于每个输入视频中确定每个人的本地区域。然后,这些区域将被传递给EfficientNetV1B0卷积神经网络,以提取每个人的深度特征。在第二个组成部分,我们使用一个全连接层和sigmoid激活函数,以分析这些深度特征并将其与视频中的推担行为关联。我们对新的实验 dataset 进行了训练和评估,该dataset 包括六个实际实验,以及其对应的地面真实值。实验结果表明,我们的方法在七个基准方法的比较分析中表现出色,并且在识别推担行为方面达到了更高的准确率。

CLIP for Lightweight Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2310.07394
  • repo_url: None
  • paper_authors: Ke Jin, Wankou Yang
  • for: 该文章目的是提出一种新的语言引导的Semantic Segmentation方法,使得语言引导的方法可以应用于轻量级网络。
  • methods: 该方法使用了一种新的特征融合模块,该模块通过并行的CNN和Transformer网络实现语言引导的特征融合,从而提高Semantic Segmentation的性能。
  • results: 实验结果表明,该方法可以在不同的视觉背景下 achieve better performance than previous SOTA work,例如DenseCLIP,并且可以全面利用预训练的语言优先知识来提高Semantic Segmentation的性能。
    Abstract The large-scale pretrained model CLIP, trained on 400 million image-text pairs, offers a promising paradigm for tackling vision tasks, albeit at the image level. Later works, such as DenseCLIP and LSeg, extend this paradigm to dense prediction, including semantic segmentation, and have achieved excellent results. However, the above methods either rely on CLIP-pretrained visual backbones or use none-pretrained but heavy backbones such as Swin, while falling ineffective when applied to lightweight backbones. The reason for this is that the lightweitht networks, feature extraction ability of which are relatively limited, meet difficulty embedding the image feature aligned with text embeddings perfectly. In this work, we present a new feature fusion module which tackles this problem and enables language-guided paradigm to be applied to lightweight networks. Specifically, the module is a parallel design of CNN and transformer with a two-way bridge in between, where CNN extracts spatial information and visual context of the feature map from the image encoder, and the transformer propagates text embeddings from the text encoder forward. The core of the module is the bidirectional fusion of visual and text feature across the bridge which prompts their proximity and alignment in embedding space. The module is model-agnostic, which can not only make language-guided lightweight semantic segmentation practical, but also fully exploit the pretrained knowledge of language priors and achieve better performance than previous SOTA work, such as DenseCLIP, whatever the vision backbone is. Extensive experiments have been conducted to demonstrate the superiority of our method.
    摘要 大规模预训练模型CLIP,训练了400万张图像和文本对应对,提供了解决视觉任务的有前途的方法,即图像水平。后续的工作,如 denseclip 和 lseg,在 dense prediction 方面扩展了这种方法,并取得了出色的结果。然而,以上方法都是基于 CLIP 预训练的视觉脊梁或使用不预训练的但是重量级别的脊梁,如 swin,而不是使用轻量级别的脊梁。这是因为轻量级别的网络,其特征提取能力相对较弱,难以将图像特征与文本嵌入完全匹配。在这种情况下,我们提出了一种新的特征融合模块,该模块可以解决这个问题,并使得语言导向的方法可以应用于轻量级别的网络。具体来说,该模块是一种并行的 CNN 和 transformer 的设计,其中 CNN 提取图像中的空间信息和视觉上下文,而 transformer 将文本嵌入从文本编码器传递进来。模块的核心是两个方向的特征融合,使得视觉和文本特征在嵌入空间中相互吸引。这种模块是模型无关的,可以不仅使得语言导向的轻量级别semantic segmentation 成为现实,还可以全面利用预训练的语言优先知识,并在不同的视觉脊梁上达到更高的性能,比如 denseclip 等。我们进行了广泛的实验,以证明我们的方法的优越性。

Domain Generalization Guided by Gradient Signal to Noise Ratio of Parameters

  • paper_url: http://arxiv.org/abs/2310.07361
  • repo_url: None
  • paper_authors: Mateusz Michalkiewicz, Masoud Faraki, Xiang Yu, Manmohan Chandraker, Mahsa Baktashmotlagh
  • for: 防止深度神经网络过拟合源频道数据
  • methods: 基于梯度信号噪声比(GSNR)选择dropout掩码,利用元学习法寻找优化的dropout比率
  • results: 在标准频道总结chmark上达到竞争性的结果,包括分类和人脸反掩膜问题
    Abstract Overfitting to the source domain is a common issue in gradient-based training of deep neural networks. To compensate for the over-parameterized models, numerous regularization techniques have been introduced such as those based on dropout. While these methods achieve significant improvements on classical benchmarks such as ImageNet, their performance diminishes with the introduction of domain shift in the test set i.e. when the unseen data comes from a significantly different distribution. In this paper, we move away from the classical approach of Bernoulli sampled dropout mask construction and propose to base the selection on gradient-signal-to-noise ratio (GSNR) of network's parameters. Specifically, at each training step, parameters with high GSNR will be discarded. Furthermore, we alleviate the burden of manually searching for the optimal dropout ratio by leveraging a meta-learning approach. We evaluate our method on standard domain generalization benchmarks and achieve competitive results on classification and face anti-spoofing problems.
    摘要 常见的过拟合问题在深度神经网络的梯度基本训练中出现。为了缓解过参数的模型,许多正则化技术被引入,如基于dropout的方法。这些方法在ImageNet等 классических测试集上实现了显著改善,但是在测试集中的领域变化时,其性能减退。在这篇论文中,我们偏离了传统的bernoulli抽样dropout面积建构方法,并基于网络参数的梯度噪声比率(GSNR)来选择参数。具体来说,在每个训练步骤中,具有高GSNR的参数将被排除。此外,我们利用元学习方法,以避免手动搜索dropout率的优化。我们在标准的领域普适性测试上评估了我们的方法,并在分类和人脸防伪检测问题上实现了竞争性的结果。

Diagnosing Bipolar Disorder from 3-D Structural Magnetic Resonance Images Using a Hybrid GAN-CNN Method

  • paper_url: http://arxiv.org/abs/2310.07359
  • repo_url: None
  • paper_authors: Masood Hamed Saghayan, Mohammad Hossein Zolfagharnasab, Ali Khadem, Farzam Matinfar, Hassan Rashidi
  • for: 这个研究旨在开发一个基于三维显微镜像(sMRI)的潜在鉴别BD患者的方法,以提供更可靠的诊断支持系统,帮助医生更加准确地诊断BD患者。
  • methods: 这个研究使用了一种混合式GAN-CNN模型,通过对sMRI样本进行扩展,提高了BD的诊断精度。研究还使用了5-fold检查分割,以评估不同扩展比例的影响。
  • results: 根据结果,这个研究获得了75.8%的准确率、60.3%的感染率和82.5%的特异率,较以往研究高出3-5%,并使用了少于6%的样本数。此外,研究还证明了一个2D层基本的GAN生成器可以有效地重现复杂的3D脑样本,并且使用了50%的扩展阈值。
    Abstract Bipolar Disorder (BD) is a psychiatric condition diagnosed by repetitive cycles of hypomania and depression. Since diagnosing BD relies on subjective behavioral assessments over a long period, a solid diagnosis based on objective criteria is not straightforward. The current study responded to the described obstacle by proposing a hybrid GAN-CNN model to diagnose BD from 3-D structural MRI Images (sMRI). The novelty of this study stems from diagnosing BD from sMRI samples rather than conventional datasets such as functional MRI (fMRI), electroencephalography (EEG), and behavioral symptoms while removing the data insufficiency usually encountered when dealing with sMRI samples. The impact of various augmentation ratios is also tested using 5-fold cross-validation. Based on the results, this study obtains an accuracy rate of 75.8%, a sensitivity of 60.3%, and a specificity of 82.5%, which are 3-5% higher than prior work while utilizing less than 6% sample counts. Next, it is demonstrated that a 2- D layer-based GAN generator can effectively reproduce complex 3D brain samples, a more straightforward technique than manual image processing. Lastly, the optimum augmentation threshold for the current study using 172 sMRI samples is 50%, showing the applicability of the described method for larger sMRI datasets. In conclusion, it is established that data augmentation using GAN improves the accuracy of the CNN classifier using sMRI samples, thus developing more reliable decision support systems to assist practitioners in identifying BD patients more reliably and in a shorter period
    摘要 《抑郁症(BD)是一种心理疾病,通过反复的假mania和抑郁而诊断。由于诊断BD需要长期的主观行为评估,因此不可靠的诊断方法不是很 straightforward。本研究回应了这个挑战,提出了一种将GAN和CNN结合使用的模型,用于从三维结构MRI图像(sMRI)上诊断BD。本研究的创新之处在于,不同于以往的fMRI、EEG和行为症状数据,从sMRI样本中诊断BD,同时解决了对sMRI样本的数据缺乏问题。此外,本研究还测试了不同的扩展比例,并使用5次分割验证。根据结果,本研究获得了75.8%的准确率,60.3%的敏感度和82.5%的特异性,与先前的工作相比高出3-5%,使用的样本数量少于6%。然后,示出了一种使用2维GAN生成器可以有效地重现复杂的3D脑样本,比手动图像处理更加简单。最后,本研究发现了使用172个sMRI样本的最佳扩展阈值为50%。总之,本研究证明了通过GAN数据扩展提高CNN分类器的准确率,从而开发更可靠的决策支持系统,帮助医生更准确地诊断BD患者,更快速地进行诊断。

IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

  • paper_url: http://arxiv.org/abs/2310.07355
  • repo_url: None
  • paper_authors: Che Liu, Sibo Cheng, Miaojing Shi, Anand Shah, Wenjia Bai, Rossella Arcucci
  • for: 这 paper 是为了提高医学视语预训练(VLP)中对医疗报告和相关医疗图像的特征提取方法。
  • methods: 这 paper 提出了一种新的临床指导 VLP 框架,named IMITATE,通过在医疗报告中捕捉结构信息来学习视语对Alignment。该框架使用多级视觉特征来对医疗报告中的描述性和结论性文本进行分层对应。此外,这 paper 还提出了一种新的临床知识引入对医疗报告进行对比学习的矫正损失。
  • results: 根据实验结果,IMITATE 模型在六个不同的数据集上比基eline VLP 方法表现出色,在五种医疗影像下游任务中均显示出优于基eline方法。这些结果表明,通过 integrate 医疗报告中的结构信息可以提高视语预训练中的特征提取效果。
    Abstract In the field of medical Vision-Language Pre-training (VLP), significant efforts have been devoted to deriving text and image features from both clinical reports and associated medical images. However, most existing methods may have overlooked the opportunity in leveraging the inherent hierarchical structure of clinical reports, which are generally split into `findings' for descriptive content and `impressions' for conclusive observation. Instead of utilizing this rich, structured format, current medical VLP approaches often simplify the report into either a unified entity or fragmented tokens. In this work, we propose a novel clinical prior guided VLP framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report. Furthermore, a new clinical-informed contrastive loss is introduced for cross-modal learning, which accounts for clinical prior knowledge in formulating sample correlations in contrastive learning. The proposed model, IMITATE, outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks. Comprehensive experimental results highlight the advantages of integrating the hierarchical structure of medical reports for vision-language alignment.
    摘要 在医疗领域的医学视语预训练(VLP)领域,有大量努力投入到从医疗报告和关联的医学图像中提取文本和图像特征。然而,大多数现有方法可能忽略了利用医疗报告的自然层次结构,这些报告通常分为描述性内容的“发现”和结论的“印象”。相反,现有的医学VLP方法通常将报告简化为单一实体或分解成多个符号。在这种情况下,我们提出了一种新的临床导向的VLP框架,名为IMITATE,用于从医疗报告中学习层次结构信息。该框架在胸部X射线图像中提取多级视觉特征,并将这些特征与描述性和结论文本进行层次视语对应。此外,我们还引入了一种新的临床知识 Informed Contrastive Loss,用于在跨模态学习中考虑临床知识。我们的模型IMITATE在六个不同的数据集上比基eline VLP方法表现出色,在五种医学影像下沉淀任务中取得了最高的表现。广泛的实验结果表明,将医疗报告的层次结构 интеグ吧到视语对应可以提高视语对应的性能。

PointHR: Exploring High-Resolution Architectures for 3D Point Cloud Segmentation

  • paper_url: http://arxiv.org/abs/2310.07743
  • repo_url: https://github.com/haibo-qiu/PointHR
  • paper_authors: Haibo Qiu, Baosheng Yu, Yixin Chen, Dacheng Tao
  • for: 高精度3D点云分割,即使没有额外的优化和修饰。
  • methods: 提出了一种高精度架构,named PointHR,包括knn顺序运算符和差分扩散运算符,以及预计算序列和扩散运算符的索引。
  • results: 在S3DIS和ScanNetV2数据集上进行了广泛的实验,并表明PointHR可以高度竞争与当前最佳方法无需额外优化和修饰。
    Abstract Significant progress has been made recently in point cloud segmentation utilizing an encoder-decoder framework, which initially encodes point clouds into low-resolution representations and subsequently decodes high-resolution predictions. Inspired by the success of high-resolution architectures in image dense prediction, which always maintains a high-resolution representation throughout the entire learning process, we consider it also highly important for 3D dense point cloud analysis. Therefore, in this paper, we explore high-resolution architectures for 3D point cloud segmentation. Specifically, we generalize high-resolution architectures using a unified pipeline named PointHR, which includes a knn-based sequence operator for feature extraction and a differential resampling operator to efficiently communicate different resolutions. Additionally, we propose to avoid numerous on-the-fly computations of high-resolution architectures by pre-computing the indices for both sequence and resampling operators. By doing so, we deliver highly competitive high-resolution architectures while capitalizing on the benefits of well-designed point cloud blocks without additional effort. To evaluate these architectures for dense point cloud analysis, we conduct thorough experiments using S3DIS and ScanNetV2 datasets, where the proposed PointHR outperforms recent state-of-the-art methods without any bells and whistles. The source code is available at \url{https://github.com/haibo-qiu/PointHR}.
    摘要 Recently, there have been significant advances in point cloud segmentation using an encoder-decoder framework, where point clouds are first encoded into low-resolution representations and then decoded into high-resolution predictions. Inspired by the success of high-resolution architectures in image dense prediction, we believe it is also crucial for 3D dense point cloud analysis. Therefore, in this paper, we explore high-resolution architectures for 3D point cloud segmentation. Specifically, we propose a unified pipeline named PointHR, which includes a knn-based sequence operator for feature extraction and a differential resampling operator to efficiently communicate different resolutions. Additionally, we pre-compute the indices for both sequence and resampling operators to avoid on-the-fly computations. By doing so, we achieve highly competitive high-resolution architectures without additional effort.To evaluate these architectures for dense point cloud analysis, we conduct thorough experiments using S3DIS and ScanNetV2 datasets. The results show that our proposed PointHR outperforms recent state-of-the-art methods without any additional bells and whistles. The source code is available at \url{https://github.com/haibo-qiu/PointHR}.

Guided Attention for Interpretable Motion Captioning

  • paper_url: http://arxiv.org/abs/2310.07324
  • repo_url: https://github.com/rd20karim/m2t-interpretable
  • paper_authors: Karim Radouane, Andon Tchechmedjiev, Sylvie Ranwez, Julien Lagarde
  • for: 本研究旨在生成文本从动作中,并提出了一种使用运动编码器和时空注意模型的方法,以及一种在训练中引导注意力的策略,以提高文本生成的可读性。
  • methods: 本研究使用运动编码器和时空注意模型,并提出了一种在训练中引导注意力的策略,以提高文本生成的可读性。
  • results: 本研究在KIT MLD dataset和HumanML3D dataset上取得了比基eline高的性能,包括BLEU@4、ROUGE-L、CIDEr和Bertscore等指标。
    Abstract While much effort has been invested in generating human motion from text, relatively few studies have been dedicated to the reverse direction, that is, generating text from motion. Much of the research focuses on maximizing generation quality without any regard for the interpretability of the architectures, particularly regarding the influence of particular body parts in the generation and the temporal synchronization of words with specific movements and actions. This study explores the combination of movement encoders with spatio-temporal attention models and proposes strategies to guide the attention during training to highlight perceptually pertinent areas of the skeleton in time. We show that adding guided attention with adaptive gate leads to interpretable captioning while improving performance compared to higher parameter-count non-interpretable SOTA systems. On the KIT MLD dataset, we obtain a BLEU@4 of 24.4% (SOTA+6%), a ROUGE-L of 58.30% (SOTA +14.1%), a CIDEr of 112.10 (SOTA +32.6) and a Bertscore of 41.20% (SOTA +18.20%). On HumanML3D, we obtain a BLEU@4 of 25.00 (SOTA +2.7%), a ROUGE-L score of 55.4% (SOTA +6.1%), a CIDEr of 61.6 (SOTA -10.9%), a Bertscore of 40.3% (SOTA +2.5%). Our code implementation and reproduction details will be soon available at https://github.com/rd20karim/M2T-Interpretable/tree/main.
    摘要 “尽管有很多研究投入到人体动作从文本生成中,但相对少数研究关注反向方向,即从动作生成文本。大多数研究集中于提高生成质量而忽略建模 interpretability,特别是关于特定身体部分在生成中的影响和时间同步词语与动作之间的关系。本研究拟合运动编码器和空间时间注意模型,并提出了引导注意力的策略,以提高文本生成的可读性。我们的实验表明,在 KIT MLD 数据集上,通过添加引导注意力和适应门户,可以实现可读性的文本生成,同时提高性能,与高参数计数的非可读性 SOTA 系统相比。在 HumanML3D 数据集上,我们获得了 BLEU@4 的 25.00% (SOTA +2.7%), ROUGE-L 的 55.4% (SOTA +6.1%), CIDEr 的 61.6 (SOTA -10.9%), Bertscore 的 40.3% (SOTA +2.5%).我们的代码实现和复现细节将在 GitHub 上公开。”

A webcam-based machine learning approach for three-dimensional range of motion evaluation

  • paper_url: http://arxiv.org/abs/2310.07322
  • repo_url: None
  • paper_authors: Xiaoye Michael Wang, Derek T. Smith, Qin Zhu
  • for: 这个研究旨在提供一种可靠且可以远端存取的肢体范围动作评估方法,并评估这种方法的可靠性。
  • methods: 这个研究使用机器学习方法来评估肢体范围动作,并使用 webcam 进行评估。
  • results: 研究结果显示,这种 webcam 方法具有高的内排重复性和其他评估方法的相互重复性,并且可以实现远端存取physical therapy 和rehabilitation。
    Abstract Background. Joint range of motion (ROM) is an important quantitative measure for physical therapy. Commonly relying on a goniometer, accurate and reliable ROM measurement requires extensive training and practice. This, in turn, imposes a significant barrier for those who have limited in-person access to healthcare. Objective. The current study presents and evaluates an alternative machine learning-based ROM evaluation method that could be remotely accessed via a webcam. Methods. To evaluate its reliability, the ROM measurements for a diverse set of joints (neck, spine, and upper and lower extremities) derived using this method were compared to those obtained from a marker-based optical motion capture system. Results. Data collected from 25 healthy adults demonstrated that the webcam solution exhibited high test-retest reliability, with substantial to almost perfect intraclass correlation coefficients for most joints. Compared with the marker-based system, the webcam-based system demonstrated substantial to almost perfect inter-rater reliability for some joints, and lower inter-rater reliability for other joints (e.g., shoulder flexion and elbow flexion), which could be attributed to the reduced sensitivity to joint locations at the apex of the movement. Conclusions. The proposed webcam-based method exhibited high test-retest and inter-rater reliability, making it a versatile alternative for existing ROM evaluation methods in clinical practice and the tele-implementation of physical therapy and rehabilitation.
    摘要 背景: JOINT 范围动作(ROM)是物理治疗中非常重要的量化测量。通常使用指南仪,准确和可靠的 ROM 测量需要广泛的训练和实践。这种情况限制了那些具有有限的面对面医疗访问的人们。目标:本研究提出了一种基于机器学习的 ROM 评估方法,可以通过网络摄像头访问。方法:为评估其可靠性,使用这种方法测量的 JOINT 范围动作数据与使用标记器基于光学运动跟踪系统获取的数据进行比较。结果:从25名健康成人收集的数据表明,网络摄像头解决方案具有高测试再测试可靠性和大多数关节的substantial到几乎完美的同准类相关系数。与标记器基于系统相比,网络摄像头解决方案在一些关节(如肩部和肘部)上具有substantial到几乎完美的同准类相关系数,而在其他关节(如肩部和肘部)上具有较低的同准类相关系数,这可能是因为网络摄像头的感度减退。结论:提出的网络摄像头解决方案具有高测试再测试和同准类相关性,使其成为现有的 ROM 评估方法的多样化选择,并且可以在临床实践和远程物理治疗中应用。

Deep Aramaic: Towards a Synthetic Data Paradigm Enabling Machine Learning in Epigraphy

  • paper_url: http://arxiv.org/abs/2310.07310
  • repo_url: None
  • paper_authors: Andrei C. Aioanei, Regine Hunziker-Rodewald, Konstantin Klein, Dominik L. Michels
  • for: 这个研究旨在提高古代文献中文字识别率,使用现代人工智能技术,如机器学习(ML),从古代文献中提取知识。
  • methods: 该研究使用创新的数据生成技术,synthesize photo-realistic Aramaic letter datasets,包括文本特征、照明、损害和扩展等,以模拟实际铭文多样性。
  • results: 该研究成功创建了250,000个训练图像和25,000个验证图像,涵盖了古代阿拉伯字母的22个类别。这个全面的数据集为训练一个剩余神经网络(ResNet)提供了高度的识别率,并在不同材料和风格下进行了成功的检验,证明了模型的可重复性。
    Abstract Epigraphy increasingly turns to modern artificial intelligence (AI) technologies such as machine learning (ML) for extracting insights from ancient inscriptions. However, scarce labeled data for training ML algorithms severely limits current techniques, especially for ancient scripts like Old Aramaic. Our research pioneers an innovative methodology for generating synthetic training data tailored to Old Aramaic letters. Our pipeline synthesizes photo-realistic Aramaic letter datasets, incorporating textural features, lighting, damage, and augmentations to mimic real-world inscription diversity. Despite minimal real examples, we engineer a dataset of 250,000 training and 25,000 validation images covering the 22 letter classes in the Aramaic alphabet. This comprehensive corpus provides a robust volume of data for training a residual neural network (ResNet) to classify highly degraded Aramaic letters. The ResNet model demonstrates high accuracy in classifying real images from the 8th century BCE Hadad statue inscription. Additional experiments validate performance on varying materials and styles, proving effective generalization. Our results validate the model's capabilities in handling diverse real-world scenarios, proving the viability of our synthetic data approach and avoiding the dependence on scarce training data that has constrained epigraphic analysis. Our innovative framework elevates interpretation accuracy on damaged inscriptions, thus enhancing knowledge extraction from these historical resources.
    摘要 随着epigraphy越来越向现代人工智能(AI)技术,如机器学习(ML),提取古代铭文中的信息。然而,古代文字如Old Aramaic的有限的标注数据对当前技术带来了严重的限制。我们的研究创新了一种适用于Old Aramaic字母的创新方法。我们的管道Synthesize了高度真实的Aramaic字母数据集,包括文字特征、照明、损害和扩展等,以模拟实际铭文多样性。尽管具有 minimal real examples,我们Enginered一个包括250,000个训练图像和25,000个验证图像的数据集,覆盖了Aramaic字母的22个类型。这个全面的数据集提供了一个强大的训练ResNet神经网络来分类高度损害的Aramaic字母。ResNet模型在实际图像中的8世纪前BCE Hadad雕塑铭文中显示出高精度的分类能力。此外,我们还进行了多种材质和风格的实验,证明了模型的普适性和可行性。我们的结果证明了我们的合成数据方法的可行性,并且不再依赖于稀缺的训练数据,从而提高了铭文解读的准确性,并扩展了历史资源的知识抽取。

Distilling Efficient Vision Transformers from CNNs for Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2310.07265
  • repo_url: None
  • paper_authors: Xu Zheng, Yunhao Luo, Pengyuan Zhou, Lin Wang
  • for: 本研究目的是如何将预训练的 CNN 模型知识转移到学习 Compact Vision Transformer (ViT) 模型,而保持其学习能力?
  • methods: 本研究提出了一个 CNN 到 ViT 知识传递框架(C2VKD),并提出了一个独特的视语相容特征(VLFD)模组和一个像素对像分离分配(PDD)模组。
  • results: 实验结果显示,与 SoTA KD 方法相比,我们的方法可以提高 mIoU 的提升量超过 200%。
    Abstract In this paper, we tackle a new problem: how to transfer knowledge from the pre-trained cumbersome yet well-performed CNN-based model to learn a compact Vision Transformer (ViT)-based model while maintaining its learning capacity? Due to the completely different characteristics of ViT and CNN and the long-existing capacity gap between teacher and student models in Knowledge Distillation (KD), directly transferring the cross-model knowledge is non-trivial. To this end, we subtly leverage the visual and linguistic-compatible feature character of ViT (i.e., student), and its capacity gap with the CNN (i.e., teacher) and propose a novel CNN-to-ViT KD framework, dubbed C2VKD. Importantly, as the teacher's features are heterogeneous to those of the student, we first propose a novel visual-linguistic feature distillation (VLFD) module that explores efficient KD among the aligned visual and linguistic-compatible representations. Moreover, due to the large capacity gap between the teacher and student and the inevitable prediction errors of the teacher, we then propose a pixel-wise decoupled distillation (PDD) module to supervise the student under the combination of labels and teacher's predictions from the decoupled target and non-target classes. Experiments on three semantic segmentation benchmark datasets consistently show that the increment of mIoU of our method is over 200% of the SoTA KD methods
    摘要 在这篇论文中,我们面临一个新的问题:如何将预训练的庞然一物的 CNN 模型的知识传递到学习 compact Vision Transformer (ViT) 模型,而保持其学习能力?由于 CNN 和 ViT 模型之间的完全不同特征和长期存在的容量差异,直接传递模型之间的知识是非常困难的。为此,我们偏好利用 ViT 模型(学生)的视觉和语言相容特征,以及它们与 CNN 模型(教师)之间的容量差,并提出了一种新的 CNN 到 ViT KD 框架,称为 C2VKD。特别是,由于教师模型的特征与学生模型的特征不同,我们首先提出了一种视觉语言相容特征采样(VLFD)模块,以实现有效的 KD。此外,由于教师模型和学生模型之间的容量差较大,以及不可避免的预测错误,我们则提出了一种像素独立分离采样(PDD)模块,以便在教师的预测和非预测类划分中监督学生。我们在三个semantic segmentation benchmarkdataset上进行了实验,结果表明,我们的方法可以提高 mIoU 的增量超过 200% 的 SoTA KD 方法。

ADASR: An Adversarial Auto-Augmentation Framework for Hyperspectral and Multispectral Data Fusion

  • paper_url: http://arxiv.org/abs/2310.07255
  • repo_url: https://github.com/fangfang11-plog/adasr
  • paper_authors: Jinghui Qin, Lihuang Fang, Ruitao Lu, Liang Lin, Yukai Shi
  • for: 提高深度学习驱动的干涉光спектраль图像(HSI)超分辨率,通过融合干涉光спектраль图像(HSI)和多spectral图像(MSI),使用深度神经网络(DNNs)来生成高空间分辨率HSI(HR-HSI)。
  • methods: 我们提出了一种新的反对抗整形自动数据增强框架ADASR,可以自动优化和增强HSI-MSI示例对的多样性,以便在实际场景中应用。我们的框架是示例意识的,并通过对扩展网络和两个下采样网络进行对抗学习来优化它们,以便学习更加稳定的下采样网络用于训练扩展网络。
  • results: 我们的ADASR在两个公共的古典干涉光спектраль数据集上进行了广泛的实验,并证明了我们的ADASR与当前方法相比更加有效。
    Abstract Deep learning-based hyperspectral image (HSI) super-resolution, which aims to generate high spatial resolution HSI (HR-HSI) by fusing hyperspectral image (HSI) and multispectral image (MSI) with deep neural networks (DNNs), has attracted lots of attention. However, neural networks require large amounts of training data, hindering their application in real-world scenarios. In this letter, we propose a novel adversarial automatic data augmentation framework ADASR that automatically optimizes and augments HSI-MSI sample pairs to enrich data diversity for HSI-MSI fusion. Our framework is sample-aware and optimizes an augmentor network and two downsampling networks jointly by adversarial learning so that we can learn more robust downsampling networks for training the upsampling network. Extensive experiments on two public classical hyperspectral datasets demonstrate the effectiveness of our ADASR compared to the state-of-the-art methods.
    摘要 深度学习基于卷积神经网络(DNN)的卷积谱图像(HSI)超分辨率,旨在通过将卷积谱图像(HSI)和多спектраль图像(MSI)融合,生成高空间分辨率卷积谱图像(HR-HSI)。然而,DNN需要大量的训练数据,使其在实际场景中应用受限。在这封信中,我们提出了一种新的反对抗自动数据增强框架ADASR,可以自动优化和增强HSI-MSI样本对的多样性,以便为HSI-MSI融合提供更多的数据。我们的框架是样本意识的,并通过反对学习来优化一个增强器网络和两个下采样网络,以便我们可以更好地学习更有效的下采样网络,用于训练上采样网络。我们在两个公共古典卷积谱数据集上进行了广泛的实验,并证明了我们的ADASR比 estado-of-the-art方法更有效。

A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation

  • paper_url: http://arxiv.org/abs/2310.07252
  • repo_url: None
  • paper_authors: Rashid Khan, Bingding Huang, Haseeb Hassan, Asim Zaman, Zhongfu Ye
  • for: 这个论文是为了提出一种深度神经网络框架,用于图像描述文本生成。
  • methods: 该方法使用了多个预训练的卷积神经网络作为Encoder来提取图像特征,并使用GRU语言模型作为Decoder来生成描述文本。它还使用了Bahdanau注意机制与GRU解码器相结合,以便学习专注于特定图像部分。
  • results: 该方法在MSCOCO和Flickr30k datasets上进行了评估,并取得了与当前方法相当的成绩。
    Abstract Image captioning is a challenging task involving generating a textual description for an image using computer vision and natural language processing techniques. This paper proposes a deep neural framework for image caption generation using a GRU-based attention mechanism. Our approach employs multiple pre-trained convolutional neural networks as the encoder to extract features from the image and a GRU-based language model as the decoder to generate descriptive sentences. To improve performance, we integrate the Bahdanau attention model with the GRU decoder to enable learning to focus on specific image parts. We evaluate our approach using the MSCOCO and Flickr30k datasets and show that it achieves competitive scores compared to state-of-the-art methods. Our proposed framework can bridge the gap between computer vision and natural language and can be extended to specific domains.
    摘要 Image captioning是一个复杂的任务,即使用计算机视觉和自然语言处理技术生成图像的文本描述。这篇论文提出了一种深度神经网络框架,用于图像描述生成。我们的方法使用多个预训练的卷积神经网络作为Encoder提取图像特征,并使用GRU基于注意机制来生成描述性句子。为了提高性能,我们将巴登瑙注意模型与GRU解码器结合,让学习关注特定图像部分。我们使用MSCOCO和Flickr30k数据集进行评估,并显示了与状态之前方法相当的分数。我们的提议的框架可以将计算机视觉和自然语言相连,并可以扩展到特定领域。

Synthesizing Missing MRI Sequences from Available Modalities using Generative Adversarial Networks in BraTS Dataset

  • paper_url: http://arxiv.org/abs/2310.07250
  • repo_url: None
  • paper_authors: Ibrahim Ethem Hamamci
  • for: 这篇论文的目的是为了提高脑癌MRI检查和诊断的精度和效率,并且将AI技术应用到脑癌MRI量化中。
  • methods: 这篇论文使用了生成对抗网络(GAN)技术,将三个MRI序列作为输入,生成缺失的第四个MRI序列。
  • results: 这篇论文的结果显示,使用GAN技术可以实现高品质和现实的MRI序列生成,帮助临床医生提高诊断能力,并且支持AI技术应用到脑癌MRI量化中。
    Abstract Glioblastoma is a highly aggressive and lethal form of brain cancer. Magnetic resonance imaging (MRI) plays a significant role in the diagnosis, treatment planning, and follow-up of glioblastoma patients due to its non-invasive and radiation-free nature. The International Brain Tumor Segmentation (BraTS) challenge has contributed to generating numerous AI algorithms to accurately and efficiently segment glioblastoma sub-compartments using four structural (T1, T1Gd, T2, T2-FLAIR) MRI scans. However, these four MRI sequences may not always be available. To address this issue, Generative Adversarial Networks (GANs) can be used to synthesize the missing MRI sequences. In this paper, we implement and utilize an open-source GAN approach that takes any three MRI sequences as input to generate the missing fourth structural sequence. Our proposed approach is contributed to the community-driven generally nuanced deep learning framework (GaNDLF) and demonstrates promising results in synthesizing high-quality and realistic MRI sequences, enabling clinicians to improve their diagnostic capabilities and support the application of AI methods to brain tumor MRI quantification.
    摘要 高级肿瘤癌是脑肿瘤的一种高度致命的形式。核磁共振成像(MRI)在诊断、治疗规划和跟踪高级肿瘤癌患者中扮演着非常重要的角色,因为它不侵入性和无辐射性。国际脑肿瘤分 segmentation(BraTS)挑战已经促成了许多人工智能算法,以准确和高效地分 segment glioblastoma 子组件使用四种结构(T1、T1Gd、T2、T2-FLAIR)MRI扫描。然而,这四种 MRI 序列可能不总是可用。为解决这个问题,生成对抗网络(GANs)可以用于生成缺失的 MRI 序列。在这篇论文中,我们实现了一种开源 GAN 方法,该方法使用任意三种 MRI 序列作为输入,以生成缺失的第四种结构序列。我们的提议方法被添加到了社区驱动的普遍精细深度学习框架(GaNDLF)中,并在生成高质量和实实际的 MRI 序列方面显示出了扎实的成果,使临床医生可以提高诊断能力,并支持人工智能方法的脑肿瘤 MRI 量化应用。

IBoxCLA: Towards Robust Box-supervised Segmentation of Polyp via Improved Box-dice and Contrastive Latent-anchors

  • paper_url: http://arxiv.org/abs/2310.07248
  • repo_url: None
  • paper_authors: Zhiwei Wang, Qiang Hu, Hongkuan Shi, Li He, Man He, Wenxuan Dai, Ting Li, Yitong Zhang, Dun Li, Mei Liu, Qiang Li
  • for: 这篇论文主要应用于泌脓腺部分分类,以提高医疗影像分类的精度和效率。
  • methods: 这篇论文提出了两种创新的学习方法:Improved Box-dice(IBox)和Contrastive Latent-Anchors(CLA),并将它们融合使用以训练一个坚固的泌脓腺部分分类模型IBoxCLA。这两种学习方法可以将形状学习和位置/大小学习分开,从而增强模型的准确性和稳定性。
  • results: 这篇论文的实验结果显示IBoxCLA在五个公共的泌脓腺数据集上的比较表现,与最近的完全监督泌脓腺部分分类方法相比,有至少6.5%和7.5%的提升。此外,IBoxCLA也比其他的泌脓腺部分分类方法有更好的稳定性和精度。
    Abstract Box-supervised polyp segmentation attracts increasing attention for its cost-effective potential. Existing solutions often rely on learning-free methods or pretrained models to laboriously generate pseudo masks, triggering Dice constraint subsequently. In this paper, we found that a model guided by the simplest box-filled masks can accurately predict polyp locations/sizes, but suffers from shape collapsing. In response, we propose two innovative learning fashions, Improved Box-dice (IBox) and Contrastive Latent-Anchors (CLA), and combine them to train a robust box-supervised model IBoxCLA. The core idea behind IBoxCLA is to decouple the learning of location/size and shape, allowing for focused constraints on each of them. Specifically, IBox transforms the segmentation map into a proxy map using shape decoupling and confusion-region swapping sequentially. Within the proxy map, shapes are disentangled, while locations/sizes are encoded as box-like responses. By constraining the proxy map instead of the raw prediction, the box-filled mask can well supervise IBoxCLA without misleading its shape learning. Furthermore, CLA contributes to shape learning by generating two types of latent anchors, which are learned and updated using momentum and segmented polyps to steadily represent polyp and background features. The latent anchors facilitate IBoxCLA to capture discriminative features within and outside boxes in a contrastive manner, yielding clearer boundaries. We benchmark IBoxCLA on five public polyp datasets. The experimental results demonstrate the competitive performance of IBoxCLA compared to recent fully-supervised polyp segmentation methods, and its superiority over other box-supervised state-of-the-arts with a relative increase of overall mDice and mIoU by at least 6.5% and 7.5%, respectively.
    摘要 《Box-supervised胞分割吸引了增加的关注,因为它的成本效果很高。现有的解决方案frequently使用学习无关的方法或预训练模型生成 Pseudo masks,从而触发 dice约束。在这篇论文中,我们发现一个以最简单的方块填充mask为导向的模型可以准确预测胞位置/大小,但是受到形态压缩的影响。为了解决这个问题,我们提出了两种创新的学习方法:Improved Box-dice(IBox)和Contrastive Latent-Anchors(CLA),并将它们相结合以训练一个可靠的box-supervised模型IBoxCLA。IBoxCLA的核心想法是解耦位置/大小和形态的学习,以便对每个特征进行专注的约束。specifically, IBox将 segmentation map转换为一个代理 map,先后进行形态解耦和干扰区域交换。在代理 map中,形态被解耦,而位置/大小被编码为方块样式的回应。通过约束代理 map而不是直接约束raw prediction,方块填充mask可以良好地指导IBoxCLA不mislead its shape learning。此外,CLA通过生成两种类型的latent anchors,通过涨动和分割胞和背景特征来逐渐代表胞和背景特征。这些latent anchors使IBoxCLA能够在对和外部box中捕捉特征,并通过对比式学习捕捉更清晰的边界。我们对IBoxCLA进行五个公共胞数据集的测试。实验结果表明IBoxCLA与最近的完全监督胞分割方法相比,有着竞争性的性能,并且与其他box-supervised state-of-the-arts相比,在全部mDice和mIoU上增加至少6.5%和7.5%。

Optimizing the Placement of Roadside LiDARs for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2310.07247
  • repo_url: https://github.com/PJLab-ADG/PCSim
  • paper_authors: Wentao Jiang, Hao Xiang, Xinyu Cai, Runsheng Xu, Jiaqi Ma, Yikang Li, Gim Hee Lee, Si Liu
  • for: 提高自动驾驶roadside LiDAR的探测性能
  • methods: 基于探测功能的greedy算法和单个点云帧学习的感知预测器
  • results: 提出了一种基于探测功能的LiDAR位置优化方法,并创建了Roadside-Opt数据集以便进一步研究roadside LiDAR placement问题。
    Abstract Multi-agent cooperative perception is an increasingly popular topic in the field of autonomous driving, where roadside LiDARs play an essential role. However, how to optimize the placement of roadside LiDARs is a crucial but often overlooked problem. This paper proposes an approach to optimize the placement of roadside LiDARs by selecting optimized positions within the scene for better perception performance. To efficiently obtain the best combination of locations, a greedy algorithm based on perceptual gain is proposed, which selects the location that can maximize the perceptual gain sequentially. We define perceptual gain as the increased perceptual capability when a new LiDAR is placed. To obtain the perception capability, we propose a perception predictor that learns to evaluate LiDAR placement using only a single point cloud frame. A dataset named Roadside-Opt is created using the CARLA simulator to facilitate research on the roadside LiDAR placement problem.
    摘要 Multi-agent合作感知是自驾车领域中越来越受欢迎的话题,路边LiDAR扮演着关键性的角色。然而,如何优化路边LiDAR的布局是一个关键 yet often overlooked的问题。这篇论文提出了一种方法来优化路边LiDAR的布局,通过选择场景中最佳位置来提高感知性能。为了高效地获得最佳组合位置,我们提出了一种基于偏好度的贪婪算法,该算法在每次选择新LiDAR位置时,选择能够最大化偏好度的位置。我们定义偏好度为新LiDAR的布局增加的感知能力。为了获得感知能力,我们提出了一种感知预测器,该预测器通过只使用单个点云帧来评估LiDAR布局。为了促进路边LiDAR布局问题的研究,我们创建了名为Roadside-Opt的数据集,该数据集使用CARLA模拟器生成。

Crowd Counting in Harsh Weather using Image Denoising with Pix2Pix GANs

  • paper_url: http://arxiv.org/abs/2310.07245
  • repo_url: None
  • paper_authors: Muhammad Asif Khan, Hamid Menouar, Ridha Hamila
  • For: 本研究旨在提高人群计数模型在恶劣天气下的性能,特别是在fog、尘埃和低光照等不良条件下。* Methods: 本研究提出使用Pix2Pix生成器对人群图像进行预处理,以提高计数模型的推理性能。* Results: 研究测试了JHU-Crowd数据集,并证明了提出的方法可以在不良天气下提高人群计数模型的准确率和可靠性。
    Abstract Visual crowd counting estimates the density of the crowd using deep learning models such as convolution neural networks (CNNs). The performance of the model heavily relies on the quality of the training data that constitutes crowd images. In harsh weather such as fog, dust, and low light conditions, the inference performance may severely degrade on the noisy and blur images. In this paper, we propose the use of Pix2Pix generative adversarial network (GAN) to first denoise the crowd images prior to passing them to the counting model. A Pix2Pix network is trained using synthetic noisy images generated from original crowd images and then the pretrained generator is then used in the inference engine to estimate the crowd density in unseen, noisy crowd images. The performance is tested on JHU-Crowd dataset to validate the significance of the proposed method particularly when high reliability and accuracy are required.
    摘要 <>传感技术可以估算人群的密度使用深度学习模型,如 convolutional neural networks (CNNs)。模型的性能强度取决于训练数据中的人群图像质量。在恶劣天气条件下,如雾、尘埃和低光照条件下,推理性能可能会受到图像噪声和模糊的影响,使得推理结果不准确。在这篇论文中,我们提议使用 Pix2Pix 生成 adversarial network (GAN) 首先将人群图像去噪,然后将已经训练的生成器用于实际推理引擎中,以估算人群密度。我们在 JHU-Crowd 数据集上测试了方法,以验证该方法在需要高可靠性和准确性时的效果。Note: "Pix2Pix" is translated as "生成 adversarial network" (生成对抗网络) in Simplified Chinese, which is a combination of "Pix2Pix" and "adversarial network" (对抗网络).

SAGE-ICP: Semantic Information-Assisted ICP

  • paper_url: http://arxiv.org/abs/2310.07237
  • repo_url: None
  • paper_authors: Jiaming Cui, Jiming Chen, Liang Li
    for: 本研究旨在提高遥感器加载 unknown environment中的姿态估计精度和稳定性。methods: 本文提出了一种基于LiDAR的点对点ICP算法,并利用有效的semantic信息。results: 实验表明,相比基eline方法,本方法可以在大规模场景中提高姿态估计精度,并且可以保持实时性。
    Abstract Robust and accurate pose estimation in unknown environments is an essential part of robotic applications. We focus on LiDAR-based point-to-point ICP combined with effective semantic information. This paper proposes a novel semantic information-assisted ICP method named SAGE-ICP, which leverages semantics in odometry. The semantic information for the whole scan is timely and efficiently extracted by a 3D convolution network, and these point-wise labels are deeply involved in every part of the registration, including semantic voxel downsampling, data association, adaptive local map, and dynamic vehicle removal. Unlike previous semantic-aided approaches, the proposed method can improve localization accuracy in large-scale scenes even if the semantic information has certain errors. Experimental evaluations on KITTI and KITTI-360 show that our method outperforms the baseline methods, and improves accuracy while maintaining real-time performance, i.e., runs faster than the sensor frame rate.
    摘要 Robust和准确的姿态估计在未知环境中是机器人应用的关键部分。我们关注LiDAR基于点对点ICP的方法,并且使用有效的semantic信息。这篇论文提出了一种基于semantic信息的ICP方法,名为SAGE-ICP,它在odometry中利用semantic信息。整个扫描的semantic信息在时间上是有效的和高效的提取,并且这些点级标签在注册过程中深入参与了每一部分,包括semantic尺度下采样、数据关联、自适应地图和动态车辆除除。与过去的semantic援助方法不同,我们的方法可以在大规模场景中提高姿态精度,即使semantic信息有一定错误。实验评估在KITTI和KITTI-360上显示,我们的方法在精度和实时性之间协调,比基准方法快速,即在感知框架帧率以上运行。

AdaMesh: Personalized Facial Expressions and Head Poses for Speech-Driven 3D Facial Animation

  • paper_url: http://arxiv.org/abs/2310.07236
  • repo_url: None
  • paper_authors: Liyang Chen, Weihong Bao, Shun Lei, Boshi Tang, Zhiyong Wu, Shiyin Kang, Haozhi Huang
  • for: 这个论文旨在提出一种基于 speech-driven 3D 人脸动画的个性化方法,以实现与驱动语音同步的自然人脸表达和头pose。
  • methods: 该方法使用 mixture-of-low-rank adaptation (MoLoRA) 技术来精准地捕捉Reference video中的 talking style,并通过一个具有语义意识的 pose style matrix 来自动调整头pose。
  • results: 对比于现有方法,该方法能够更好地保持 Reference video 中的 talking style,并生成更加生动的人脸动画。
    Abstract Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at https://adamesh.github.io.
    摘要 《语音驱动3D面部动画》是一项相对新的领域,目标是在语音驱动下生成同步的面部动作,现有的工作大多忽略了个人特有的说话风格,包括表情和头部姿态风格。一些工作尝试通过细化模块来捕捉个人性格,但由于训练数据的限制,生成的表情和头部姿态具有浓郁的缺乏生命力。在这种情况下,我们提出了 AdaMesh,一种新的适应语音驱动面部动画方法,可以从约10秒的参考视频中学习个人化说话风格,并生成有生命力的表情和头部姿态。具体来说,我们提出了一种混合低级 adaptation(MoLoRA)方法,用于精细地捕捉表情风格。为个性化姿态风格,我们提出了一种姿态适应器,通过建立精确的姿态前providing和使用语义意识 pose style matrix 来无需 Fine-tuning retrieve相应的风格嵌入。EXTensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. Supplementary video and code will be available at https://adamesh.github.io.

Deep Learning for blind spectral unmixing of LULC classes with MODIS multispectral time series and ancillary data

  • paper_url: http://arxiv.org/abs/2310.07223
  • repo_url: https://github.com/jrodriguezortega/msmtu
  • paper_authors: José Rodríguez-Ortega, Rohaifa Khaldi, Domingo Alcaraz-Segura, Siham Tabik
  • for: 这个论文的目的是提出一种基于深度学习模型的多spectral时间序列数据中的纯化方法,用于提取杂合的土地用途和地形特征信息。
  • methods: 该方法使用了深度学习模型,包括长短期记忆网络(LSTM)模型,将 spectral-temporal 输入数据与地ографи和气候参数结合使用,以提高杂合像素中的物类含量估计。
  • results: 实验表明,结合spectral-temporal输入数据与地ографи和气候参数可以substantially improve the abundance estimation of LULC classes in mixed pixels。
    Abstract Remotely sensed data are dominated by mixed Land Use and Land Cover (LULC) types. Spectral unmixing is a technique to extract information from mixed pixels into their constituent LULC types and corresponding abundance fractions. Traditionally, solving this task has relied on either classical methods that require prior knowledge of endmembers or machine learning methods that avoid explicit endmembers calculation, also known as blind spectral unmixing (BSU). Most BSU studies based on Deep Learning (DL) focus on one time-step hyperspectral data, yet its acquisition remains quite costly compared with multispectral data. To our knowledge, here we provide the first study on BSU of LULC classes using multispectral time series data with DL models. We further boost the performance of a Long-Short Term Memory (LSTM)-based model by incorporating geographic plus topographic (geo-topographic) and climatic ancillary information. Our experiments show that combining spectral-temporal input data together with geo-topographic and climatic information substantially improves the abundance estimation of LULC classes in mixed pixels. To carry out this study, we built a new labeled dataset of the region of Andalusia (Spain) with monthly multispectral time series of pixels for the year 2013 from MODIS at 460m resolution, for two hierarchical levels of LULC classes, named Andalusia MultiSpectral MultiTemporal Unmixing (Andalusia-MSMTU). This dataset provides, at the pixel level, a multispectral time series plus ancillary information annotated with the abundance of each LULC class inside each pixel. The dataset and code are available to the public.
    摘要 Remotely sensed data受混合土地用途和土地覆盖(LULC)类型的控制。spectral unmixing是一种技术,以提取混合像素中的各种LULC类型和相应的含量分数。传统上,解决这个任务通常需要知识先驱练或机器学习方法,而不需要显式计算终端成员。大多数无知终端混合(BSU)研究基于深度学习(DL)方法,但这些研究通常只关注单个时间步的干涉光谱数据。在我们所知道的情况下,我们在这里提供了首个BSU的LULC类型使用多spectral时间序数据与DL模型的研究。我们还使用地理加topographic(geo-topographic)和气候附加信息来提高LSTM模型的性能。我们的实验表明,将spectral-temporal输入数据与geo-topographic和气候信息结合在一起可以显著提高混合像素中LULC类型的含量估计。为了进行这项研究,我们建立了一个新的标注集,名为Andalusia MultiSpectral MultiTemporal Unmixing(Andalusia-MSMTU)。这个集包括2013年MODIS在460米分辨率上每月多spectral时间序数据,以及两个层次的LULC类型。每个像素级别具有多spectral时间序数据和附加信息,并标注每个像素中每种LULC类型的含量。这个集和代码现在对公众开放。

Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model

  • paper_url: http://arxiv.org/abs/2310.07222
  • repo_url: https://github.com/ysy31415/unipaint
  • paper_authors: Shiyuan Yang, Xiaodong Chen, Jing Liao
  • for: 这篇论文的目的是提出一种多modal填充方法,以提供多种导向方式,包括无条件、文本驱动、橡皮擦驱动和示例驱动填充,以及这些模式的组合。
  • methods: 该方法基于预训练的稳定扩散模型(Stable Diffusion),不需要特定任务的训练,可以在少量数据下实现一元化。
  • results: 论文通过对多种数据集进行广泛的质量和量化评估,表明其方法可以与单modal方法匹配的效果,同时具有多modal填充的能力。
    Abstract Recently, text-to-image denoising diffusion probabilistic models (DDPMs) have demonstrated impressive image generation capabilities and have also been successfully applied to image inpainting. However, in practice, users often require more control over the inpainting process beyond textual guidance, especially when they want to composite objects with customized appearance, color, shape, and layout. Unfortunately, existing diffusion-based inpainting methods are limited to single-modal guidance and require task-specific training, hindering their cross-modal scalability. To address these limitations, we propose Uni-paint, a unified framework for multimodal inpainting that offers various modes of guidance, including unconditional, text-driven, stroke-driven, exemplar-driven inpainting, as well as a combination of these modes. Furthermore, our Uni-paint is based on pretrained Stable Diffusion and does not require task-specific training on specific datasets, enabling few-shot generalizability to customized images. We have conducted extensive qualitative and quantitative evaluations that show our approach achieves comparable results to existing single-modal methods while offering multimodal inpainting capabilities not available in other methods. Code will be available at https://github.com/ysy31415/unipaint.
    摘要 最近,文本到图像去噪扩散概率模型(DDPM)已经表现出了惊人的图像生成能力,并且在图像填充方面也有成功应用。然而,在实践中,用户经常需要更多的控制力在填充过程中,特别是当他们想要搭配自定义外观、颜色、形状和布局时。可惜,现有的扩散基于的填充方法都受到单modal导航的限制,需要任务特定的训练,这会阻碍其跨modal扩展性。为了解决这些限制,我们提出了Uni-paint,一个多modal填充框架,它提供了多种导航模式,包括随机、文本驱动、笔划驱动、示例驱动填充,以及这些模式的组合。此外,我们的Uni-paint基于预训练的稳定扩散,不需要任务特定的训练,可以在特定图像上进行几步扩展,实现了自定义图像的填充。我们进行了广泛的质量和量测试,结果表明,我们的方法可以与单modal方法匹配,同时提供了多modal填充的能力,不在其他方法中可以实现。代码将在https://github.com/ysy31415/unipaint上公开。

Multi-task Explainable Skin Lesion Classification

  • paper_url: http://arxiv.org/abs/2310.07209
  • repo_url: None
  • paper_authors: Mahapara Khurshid, Mayank Vatsa, Richa Singh
  • for: 这篇论文旨在提出一种基于几个扩展练习的多任务几据学习方法,以帮助早期识别皮肤癌。
  • methods: 本文提出了一种多任务几据学习方法,包括一个统一维度的条件统计学习(Segmentation Network)和一个分类维度的条件统计学习(Classification Network),并将它们融合为一个整体的条件统计学习系统。
  • results: 实验结果显示,提出的方法可以对皮肤癌进行早期识别,并且在不同的数据集上具有很好的一致性和稳定性。
    Abstract Skin cancer is one of the deadliest diseases and has a high mortality rate if left untreated. The diagnosis generally starts with visual screening and is followed by a biopsy or histopathological examination. Early detection can aid in lowering mortality rates. Visual screening can be limited by the experience of the doctor. Due to the long tail distribution of dermatological datasets and significant intra-variability between classes, automatic classification utilizing computer-aided methods becomes challenging. In this work, we propose a multitask few-shot-based approach for skin lesions that generalizes well with few labelled data to address the small sample space challenge. The proposed approach comprises a fusion of a segmentation network that acts as an attention module and classification network. The output of the segmentation network helps to focus on the most discriminatory features while making a decision by the classification network. To further enhance the classification performance, we have combined segmentation and classification loss in a weighted manner. We have also included the visualization results that explain the decisions made by the algorithm. Three dermatological datasets are used to evaluate the proposed method thoroughly. We also conducted cross-database experiments to ensure that the proposed approach is generalizable across similar datasets. Experimental results demonstrate the efficacy of the proposed work.
    摘要 皮肤癌是一种非常危险的疾病,如果不得到治疗,则可能会有很高的死亡率。诊断通常由视觉检查开始,然后是比采或历史检查。早期发现可以降低死亡率。然而,视觉检查可能受医生的经验限制。由于皮肤癌数据集的长尾分布和类型之间的显著内部变化,使用计算机助手的自动分类变得困难。在这种情况下,我们提出了一种多任务几 shot 基于的方法,用于处理皮肤癌的小样本空间问题。我们的方法包括一个 fusion 的 segmentation 网络,它作为注意模块,并一个分类网络。分类网络输出的结果可以帮助注意模块决策,同时,注意模块可以帮助分类网络做出更加准确的决策。为了进一步提高分类性能,我们将 segmentation 和分类损失权重加以权重,并包括了可视化结果,以便更好地解释算法的决策。我们使用了三个皮肤癌数据集进行了全面的评估。我们还进行了跨数据集的试验,以确保我们的方法是可重复性的。实验结果表明,我们的方法具有效果。

DeepSimHO: Stable Pose Estimation for Hand-Object Interaction via Physics Simulation

  • paper_url: http://arxiv.org/abs/2310.07206
  • repo_url: https://github.com/rongakowang/DeepSimHO
  • paper_authors: Rong Wang, Wei Mao, Hongdong Li
  • for: 这个研究目的是从单一图像观察中进行3D姿势估测,并且处理手与物体之间的互动。
  • methods: 这个研究使用了深度学习架构,其中包括前向物理 simulations 和反向梯度推断。
  • results: 实验结果显示,这个方法可以提高估测的稳定性,并且比测时优化更高效。
    Abstract This paper addresses the task of 3D pose estimation for a hand interacting with an object from a single image observation. When modeling hand-object interaction, previous works mainly exploit proximity cues, while overlooking the dynamical nature that the hand must stably grasp the object to counteract gravity and thus preventing the object from slipping or falling. These works fail to leverage dynamical constraints in the estimation and consequently often produce unstable results. Meanwhile, refining unstable configurations with physics-based reasoning remains challenging, both by the complexity of contact dynamics and by the lack of effective and efficient physics inference in the data-driven learning framework. To address both issues, we present DeepSimHO: a novel deep-learning pipeline that combines forward physics simulation and backward gradient approximation with a neural network. Specifically, for an initial hand-object pose estimated by a base network, we forward it to a physics simulator to evaluate its stability. However, due to non-smooth contact geometry and penetration, existing differentiable simulators can not provide reliable state gradient. To remedy this, we further introduce a deep network to learn the stability evaluation process from the simulator, while smoothly approximating its gradient and thus enabling effective back-propagation. Extensive experiments show that our method noticeably improves the stability of the estimation and achieves superior efficiency over test-time optimization. The code is available at https://github.com/rongakowang/DeepSimHO.
    摘要 To address these issues, we propose DeepSimHO, a novel deep-learning pipeline that combines forward physics simulation and backward gradient approximation with a neural network. Given an initial hand-object pose estimated by a base network, we forward it to a physics simulator to evaluate its stability. However, existing differentiable simulators cannot provide reliable state gradients due to non-smooth contact geometry and penetration. To address this, we introduce a deep network to learn the stability evaluation process from the simulator, while smoothly approximating its gradient and enabling effective back-propagation.Our method significantly improves the stability of the estimation and achieves superior efficiency over test-time optimization. The code is available at https://github.com/rongakowang/DeepSimHO.

SpikePoint: An Efficient Point-based Spiking Neural Network for Event Cameras Action Recognition

  • paper_url: http://arxiv.org/abs/2310.07189
  • repo_url: None
  • paper_authors: Hongwei Ren, Yue Zhou, Yulong Huang, Haotian Fu, Xiaopeng Lin, Jie Song, Bojun Cheng
  • for: 本研究旨在开发一种能够实现低功耗和高精度的事件摄像头应用场景,通过将事件摄像头与脉冲神经网络(SNN)相结合。
  • methods: 本研究提出了一种名为SpikePoint的新的端到端点 wise SNN架构,可以高效处理 sparse event cloud 数据,并提取全局和局部特征。
  • results: 对于四个事件基因 recognize 数据集,SpikePoint 达到了状态机器人(SOTA)性能,只需要使用 16 个时间步骤,超过了其他 SNN 方法。此外,它还在三个数据集上达到了 SOTA 性能,使用了约 0.3% 的参数和 0.5% 的能耗,相比较于人工神经网络(ANNs)的参数和能耗。这些结果证明 Point Cloud 的重要性,并开启了许多低功耗事件基因数据处理应用场景。
    Abstract Event cameras are bio-inspired sensors that respond to local changes in light intensity and feature low latency, high energy efficiency, and high dynamic range. Meanwhile, Spiking Neural Networks (SNNs) have gained significant attention due to their remarkable efficiency and fault tolerance. By synergistically harnessing the energy efficiency inherent in event cameras and the spike-based processing capabilities of SNNs, their integration could enable ultra-low-power application scenarios, such as action recognition tasks. However, existing approaches often entail converting asynchronous events into conventional frames, leading to additional data mapping efforts and a loss of sparsity, contradicting the design concept of SNNs and event cameras. To address this challenge, we propose SpikePoint, a novel end-to-end point-based SNN architecture. SpikePoint excels at processing sparse event cloud data, effectively extracting both global and local features through a singular-stage structure. Leveraging the surrogate training method, SpikePoint achieves high accuracy with few parameters and maintains low power consumption, specifically employing the identity mapping feature extractor on diverse datasets. SpikePoint achieves state-of-the-art (SOTA) performance on four event-based action recognition datasets using only 16 timesteps, surpassing other SNN methods. Moreover, it also achieves SOTA performance across all methods on three datasets, utilizing approximately 0.3\% of the parameters and 0.5\% of power consumption employed by artificial neural networks (ANNs). These results emphasize the significance of Point Cloud and pave the way for many ultra-low-power event-based data processing applications.
    摘要 Simplified Chinese:事件摄像机是生物体外的感知器,响应当地光度变化,具有低延迟、高能效率和高 dinamic范围。同时,使得神经网络(SNN)在处理数据时具有很好的性能和容错性。通过将事件摄像机的能效性与 SNN 的射频处理机制相结合,可以实现低功耗应用场景,如行动认知任务。然而,现有的方法通常会将异步事件转换成常规帧,从而需要额外的数据映射努力,并且会导致数据精度下降,与 SNN 和事件摄像机的设计原则相抵触。为解决这个挑战,我们提出了 PointCloud,一种新的端到端点基 SNN 架构。PointCloud excells 在处理稀疏事件云数据,通过单个阶段结构,提取全局和局部特征。通过使用代理训练方法,PointCloud 可以在多个数据集上实现高精度,并且具有低功耗。在四个事件基于动作认知数据集上,PointCloud 使用 16 个时间步骤,已经超越了其他 SNN 方法。此外,PointCloud 还在三个数据集上实现了全方法的最佳性能,使用了约 0.3% 的参数和 0.5% 的电力耗用,相比于人工神经网络 (ANN) 的参数和电力耗用。这些结果强调了 PointCloud 的重要性,并开创了许多低功耗事件基数据处理应用场景。

NeuroInspect: Interpretable Neuron-based Debugging Framework through Class-conditional Visualizations

  • paper_url: http://arxiv.org/abs/2310.07184
  • repo_url: https://github.com/yeongjoonju/neuroinspect
  • paper_authors: Yeong-Joon Ju, Ji-Hoon Park, Seong-Whan Lee
  • for: This paper aims to provide an interpretable neuron-based debugging framework for deep learning (DL) models, to help DL practitioners understand and fix mistakes made by the models.
  • methods: The paper proposes a three-stage debugging framework called NeuroInspect, which includes counterfactual explanations, feature visualizations, and false correlation mitigation. The framework uses a novel feature visualization method called CLIP-Illusion to generate human-interpretable explanations for model errors.
  • results: The paper demonstrates the effectiveness of NeuroInspect in addressing false correlations and improving inferences for classes with the worst performance in real-world settings. The results show that NeuroInspect helps debug the mistakes of DL models and improves human understanding of the decision-making process within the networks.
    Abstract Despite deep learning (DL) has achieved remarkable progress in various domains, the DL models are still prone to making mistakes. This issue necessitates effective debugging tools for DL practitioners to interpret the decision-making process within the networks. However, existing debugging methods often demand extra data or adjustments to the decision process, limiting their applicability. To tackle this problem, we present NeuroInspect, an interpretable neuron-based debugging framework with three key stages: counterfactual explanations, feature visualizations, and false correlation mitigation. Our debugging framework first pinpoints neurons responsible for mistakes in the network and then visualizes features embedded in the neurons to be human-interpretable. To provide these explanations, we introduce CLIP-Illusion, a novel feature visualization method that generates images representing features conditioned on classes to examine the connection between neurons and the decision layer. We alleviate convoluted explanations of the conventional visualization approach by employing class information, thereby isolating mixed properties. This process offers more human-interpretable explanations for model errors without altering the trained network or requiring additional data. Furthermore, our framework mitigates false correlations learned from a dataset under a stochastic perspective, modifying decisions for the neurons considered as the main causes. We validate the effectiveness of our framework by addressing false correlations and improving inferences for classes with the worst performance in real-world settings. Moreover, we demonstrate that NeuroInspect helps debug the mistakes of DL models through evaluation for human understanding. The code is openly available at https://github.com/yeongjoonJu/NeuroInspect.
    摘要 尽管深度学习(DL)已经取得了各种领域的显著进步,但DL模型仍然容易出错。这个问题需要有效的调试工具,以便DL实践者可以解释网络的决策过程。然而,现有的调试方法经常需要额外的数据或调整决策过程,限制其应用。为解决这个问题,我们提出了NeuroInspect,一个可解释的 neuron-based 调试框架。我们的调试框架首先在网络中标识出负责出错的 neuron,然后使用CLIP-Illusion,一种新的特征视图方法,生成表示特征类别的图像,以便人类可以理解。我们通过使用类信息,缩小混杂的解释,从而提供更人类可理解的错误解释,而无需更改已训练的网络或需要额外的数据。此外,我们的框架还解决了基于数据的false correlation问题,通过修改考虑到的neuron的决策,从而提高网络的准确率。我们验证了NeuroInspect的效果,通过对实际场景中的false correlation和各类错误进行修复,提高了网络的推理能力。此外,我们还证明了NeuroInspect可以帮助调试DL模型的错误。代码可以在https://github.com/yeongjoonJu/NeuroInspect中下载。

Improving mitosis detection on histopathology images using large vision-language models

  • paper_url: http://arxiv.org/abs/2310.07176
  • repo_url: None
  • paper_authors: Ruiwen Ding, James Hall, Neil Tenenholtz, Kristen Severson
  • for: 这个论文的目的是提高悉尼肿瘤检测的准确率,使用混沌神经网络和自然语言进行辅助。
  • methods: 这个论文使用了混沌神经网络,并利用了图像描述和自然语言来提高悉尼肿瘤检测的准确率。
  • results: 研究表明,使用这种方法可以提高悉尼肿瘤检测的准确率,并且比较于各种基线模型。
    Abstract In certain types of cancerous tissue, mitotic count has been shown to be associated with tumor proliferation, poor prognosis, and therapeutic resistance. Due to the high inter-rater variability of mitotic counting by pathologists, convolutional neural networks (CNNs) have been employed to reduce the subjectivity of mitosis detection in hematoxylin and eosin (H&E)-stained whole slide images. However, most existing models have performance that lags behind expert panel review and only incorporate visual information. In this work, we demonstrate that pre-trained large-scale vision-language models that leverage both visual features and natural language improve mitosis detection accuracy. We formulate the mitosis detection task as an image captioning task and a visual question answering (VQA) task by including metadata such as tumor and scanner types as context. The effectiveness of our pipeline is demonstrated via comparison with various baseline models using 9,501 mitotic figures and 11,051 hard negatives (non-mitotic figures that are difficult to characterize) from the publicly available Mitosis Domain Generalization Challenge (MIDOG22) dataset.
    摘要 某些癌细胞组织中, Mitotic 数量与肿瘤增殖、较差的预后和治疗耐荷强相关。由于病理医生在 Mitotic 计数中存在高度的交互变性,因此人工神经网络(CNNs)已被应用以减少病理分析中的主观性。然而,大多数现有模型只使用视觉信息,其性能落后于专家审查anel。在这项工作中,我们示出了使用预训练的大规模视力语言模型,可以提高 Mitosis 检测精度。我们将 Mitosis 检测任务定义为一个图像描述任务和一个视觉问答(VQA)任务,并通过包括肿瘤类型和扫描仪类型等metadata来提供背景信息。我们的管道的效果通过与多种基准模型进行比较,使用 MIDOG22 公共可用的 Mitosis 领域泛化挑战 dataset 中的 9,501 个 Mitotic 图像和 11,051 个困难的非 Mitotic 图像进行证明。

Anchor-based Multi-view Subspace Clustering with Hierarchical Feature Descent

  • paper_url: http://arxiv.org/abs/2310.07166
  • repo_url: None
  • paper_authors: Qiyuan Ou, Siwei Wang, Pei Zhang, Sihang Zhou, En Zhu
  • for: 这个论文主要目标是提出一种基于多视图的含义下降 clustering 算法,以解决现有多视图 clustering 算法的时间复杂度问题。
  • methods: 该论文使用了一种基于层次特征下降的方法,通过在不同视图之间建立相互关系来实现视图之间的数据拟合。然后,通过一种统一采样策略在含义下降空间中进行采样,并使用子空间 clustering 算法来学习共同表示。
  • results: 实验结果表明,提出的 MVSC-HFD 模型在公共评估数据集上经常超越当前状态艺技。
    Abstract Multi-view clustering has attracted growing attention owing to its capabilities of aggregating information from various sources and its promising horizons in public affairs. Up till now, many advanced approaches have been proposed in recent literature. However, there are several ongoing difficulties to be tackled. One common dilemma occurs while attempting to align the features of different views. We dig out as well as deploy the dependency amongst views through hierarchical feature descent, which leads to a common latent space( STAGE 1). This latent space, for the first time of its kind, is regarded as a 'resemblance space', as it reveals certain correlations and dependencies of different views. To be exact, the one-hot encoding of a category can also be referred to as a resemblance space in its terminal phase. Moreover, due to the intrinsic fact that most of the existing multi-view clustering algorithms stem from k-means clustering and spectral clustering, this results in cubic time complexity w.r.t. the number of the objects. However, we propose Anchor-based Multi-view Subspace Clustering with Hierarchical Feature Descent(MVSC-HFD) to further reduce the computing complexity to linear time cost through a unified sampling strategy in resemblance space( STAGE 2), followed by subspace clustering to learn the representation collectively( STAGE 3). Extensive experimental results on public benchmark datasets demonstrate that our proposed model consistently outperforms the state-of-the-art techniques.
    摘要 多视图聚类在最近几年来得到了越来越多的关注,这是因为它可以聚合来自不同源泉的信息,并且在公共事务中具有承诺的前途。到目前为止,文献中已经提出了许多高级方法。然而,还有许多在进行的困难,其中一个最常见的困难是对不同视图之间的特征进行Alignment。我们通过层次特征降低来解决这个问题,并在多视图聚类中提出了一个新的Latent space(Stage 1)。这个Latent space被认为是一种'相似空间',因为它揭示了不同视图之间的相似性和依赖关系。此外,由于大多数现有的多视图聚类算法来自k-means clustering和spectral clustering,这会导致 cubic time complexity 对于对象的数量。然而,我们提出了基于 anchor的多视图子空间聚类 Algorithm with Hierarchical Feature Descent(MVSC-HFD),可以在resemblance space中进行统一采样策略(Stage 2),然后使用子空间聚类来学习表示(Stage 3)。我们在公共测试数据集上进行了广泛的实验,结果显示,我们提出的模型在与现有技术相比 consistently outperform。

Robust Unsupervised Domain Adaptation by Retaining Confident Entropy via Edge Concatenation

  • paper_url: http://arxiv.org/abs/2310.07149
  • repo_url: None
  • paper_authors: Hye-Seong Hong, Abhishek Kumar, Dong-Gyu Lee
  • for: 提高不监督领域适应的Semantic segmentation网络训练效果,使用计算机生成的标注数据作为源数据。
  • methods: 利用内部和外部信息的共同作用,在Entropy-based adversarial networks中增强预测源领域的能力。增加权重 Edge-predicted probability values,以提高分类边界的清晰度。设计了一种概率分享网络,将多种信息集成到更有效的分类中。
  • results: 在不监督领域适应 benchmark上进行了严格的评估,包括 SYNTHIA $\rightarrow$ Cityscapes和 SYNTHIA $\rightarrow$ Mapillary。实验结果表明,提出的方法在不同的无监督领域适应场景中具有优于当前方法的性能。
    Abstract The generalization capability of unsupervised domain adaptation can mitigate the need for extensive pixel-level annotations to train semantic segmentation networks by training models on synthetic data as a source with computer-generated annotations. Entropy-based adversarial networks are proposed to improve source domain prediction; however, they disregard significant external information, such as edges, which have the potential to identify and distinguish various objects within an image accurately. To address this issue, we introduce a novel approach to domain adaptation, leveraging the synergy of internal and external information within entropy-based adversarial networks. In this approach, we enrich the discriminator network with edge-predicted probability values within this innovative framework to enhance the clarity of class boundaries. Furthermore, we devised a probability-sharing network that integrates diverse information for more effective segmentation. Incorporating object edges addresses a pivotal aspect of unsupervised domain adaptation that has frequently been neglected in the past -- the precise delineation of object boundaries. Conventional unsupervised domain adaptation methods usually center around aligning feature distributions and may not explicitly model object boundaries. Our approach effectively bridges this gap by offering clear guidance on object boundaries, thereby elevating the quality of domain adaptation. Our approach undergoes rigorous evaluation on the established unsupervised domain adaptation benchmarks, specifically in adapting SYNTHIA $\rightarrow$ Cityscapes and SYNTHIA $\rightarrow$ Mapillary. Experimental results show that the proposed model attains better performance than state-of-the-art methods. The superior performance across different unsupervised domain adaptation scenarios highlights the versatility and robustness of the proposed method.
    摘要 通过不监督领域适应,可以减轻 semantic segmentation 网络训练所需的广泛像素级注释。通过训练模型使用生成的数据作为源,并使用计算机生成的注释。不过,基于Entropy的反对抗网络可能会忽略一些重要的外部信息,如图像中的边缘,这些信息可能能够准确地识别和分类不同的对象。为解决这个问题,我们提出了一种新的适应领域方法,具有内部和外部信息的共同作用。在这种方法中,我们增强了推定网络的边缘预测概率值,以提高类划界的清晰度。此外,我们设计了一种概率共享网络,以更有效地进行分类。在这种方法中,我们利用了对象边缘的信息,解决了过去常见的适应领域方法不具备的对象边界的精确分类问题。我们的方法在已知的适应领域 benchmark 上进行了严格的评估,包括 SYNTHIA $\rightarrow$ Cityscapes 和 SYNTHIA $\rightarrow$ Mapillary 等。实验结果表明,我们的方法在不同的适应领域场景中表现出色,超过了现有的state-of-the-art方法。这些结果表明了我们的方法的多样性和可靠性。

Echocardiography video synthesis from end diastolic semantic map via diffusion model

  • paper_url: http://arxiv.org/abs/2310.07131
  • repo_url: None
  • paper_authors: Phi Nguyen Van, Duc Tran Minh, Hieu Pham Huy, Long Tran Quoc
  • for: 这篇论文的目的是为echocardiography视频生成任务提供一种新的方法,使用semantic映射来指导生成过程,以提高生成的视频的真实感和一致性。
  • methods: 这篇论文使用了Denoising Diffusion Probabilistic Models (DDPMs),并在其基础上进行了扩展和改进,以适应echocardiography视频生成任务。具体来说,我们使用了semantic映射来指导生成过程,并在多尺度特征图中进行了空间适应 нормализа。
  • results: 经过实验,我们发现OUR模型在CAMUS数据集上的表现比标准扩散技术更好,包括多个纪录指标,如FID、FVD和SSMI。这表明OUR模型可以更好地生成echocardiography视频序列,具有更高的真实感和一致性。
    Abstract Denoising Diffusion Probabilistic Models (DDPMs) have demonstrated significant achievements in various image and video generation tasks, including the domain of medical imaging. However, generating echocardiography videos based on semantic anatomical information remains an unexplored area of research. This is mostly due to the constraints imposed by the currently available datasets, which lack sufficient scale and comprehensive frame-wise annotations for every cardiac cycle. This paper aims to tackle the aforementioned challenges by expanding upon existing video diffusion models for the purpose of cardiac video synthesis. More specifically, our focus lies in generating video using semantic maps of the initial frame during the cardiac cycle, commonly referred to as end diastole. To further improve the synthesis process, we integrate spatial adaptive normalization into multiscale feature maps. This enables the inclusion of semantic guidance during synthesis, resulting in enhanced realism and coherence of the resultant video sequences. Experiments are conducted on the CAMUS dataset, which is a highly used dataset in the field of echocardiography. Our model exhibits better performance compared to the standard diffusion technique in terms of multiple metrics, including FID, FVD, and SSMI.
    摘要 德 numerically 噪声扩散模型(DDPM)在不同的图像和视频生成任务中表现出了显著的成就,其中包括医学影像领域。然而,基于semantic anatomical information的echocardiography视频生成仍然是一个未经探索的领域,这主要是因为当前available的数据集缺乏 sufficient scale和每个cardiac cycle的完整框架级别的注释。本文旨在解决上述挑战,扩展现有的视频扩散模型用于卡ди亚视频生成。更 specifically,我们的重点是使用initial frame during cardiac cycle的semantic maps来生成视频。为了进一步改进生成过程,我们将spatial adaptive normalization integrate into multiscale feature maps,以使用semantic guidance during synthesis,从而提高生成的视频序列的真实性和一致性。我们在CAMUS dataset上进行了实验,这是医学影像领域中的一个非常流行的数据集。我们的模型在多个纪录metric上表现出了更好的性能,包括FID、FVD和SSMI。

cs.AI - 2023-10-11

AutoRepo: A general framework for multi-modal LLM-based automated construction reporting

  • paper_url: http://arxiv.org/abs/2310.07944
  • repo_url: None
  • paper_authors: Hongxu Pu, Xincong Yang, Jing Li, Runhao Guo, Heng Li
  • for: 提高建筑项目安全、质量和时间完成性,使用自动生成建筑检查报告的新框架AutoRepo。
  • methods: 利用无人车进行建筑检查,收集场景信息,并使用多模态大语言模型生成检查报告。
  • results: 在实际建筑项目中应用并测试了AutoRepo框架,显示它可以减少检查过程的时间和资源配置,并生成符合法规标准的高质量检查报告。
    Abstract Ensuring the safety, quality, and timely completion of construction projects is paramount, with construction inspections serving as a vital instrument towards these goals. Nevertheless, the predominantly manual approach of present-day inspections frequently results in inefficiencies and inadequate information management. Such methods often fall short of providing holistic, exhaustive assessments, consequently engendering regulatory oversights and potential safety hazards. To address this issue, this paper presents a novel framework named AutoRepo for automated generation of construction inspection reports. The unmanned vehicles efficiently perform construction inspections and collect scene information, while the multimodal large language models (LLMs) are leveraged to automatically generate the inspection reports. The framework was applied and tested on a real-world construction site, demonstrating its potential to expedite the inspection process, significantly reduce resource allocation, and produce high-quality, regulatory standard-compliant inspection reports. This research thus underscores the immense potential of multimodal large language models in revolutionizing construction inspection practices, signaling a significant leap forward towards a more efficient and safer construction management paradigm.
    摘要 Ensuring the safety, quality, and timely completion of construction projects is crucial, with construction inspections serving as a vital tool towards these goals. However, the predominantly manual approach of present-day inspections frequently leads to inefficiencies and inadequate information management. Such methods often fall short of providing comprehensive, exhaustive assessments, resulting in regulatory oversights and potential safety hazards. To address this issue, this paper presents a novel framework named AutoRepo for automated generation of construction inspection reports. The unmanned vehicles efficiently perform construction inspections and collect scene information, while the multimodal large language models (LLMs) are leveraged to automatically generate the inspection reports. The framework was applied and tested on a real-world construction site, demonstrating its potential to expedite the inspection process, significantly reduce resource allocation, and produce high-quality, regulatory standard-compliant inspection reports. This research thus underscores the immense potential of multimodal large language models in revolutionizing construction inspection practices, signaling a significant leap forward towards a more efficient and safer construction management paradigm.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation using Large Language Models

  • paper_url: http://arxiv.org/abs/2310.07937
  • repo_url: None
  • paper_authors: Bangguo Yu, Hamidreza Kasaei, Ming Cao
  • for: 这篇论文旨在解决多机器人合作时的可视目标导航问题,以实现高效率和可靠性。
  • methods: 本文提出了一个创新的框架,即Co-NavGPT,它使用大型自然语言模型(LLM)作为多机器人合作的全球观察者,将环境资料转换为提示,进而提高 LLM 的景象理解能力。
  • results: 实验结果显示,Co-NavGPT 在 HM3D 环境中的成功率和效率都高于现有模型,而且不需要任何学习过程,这表明 LLM 在多机器人合作领域的应用潜力非常大。
    Abstract In advanced human-robot interaction tasks, visual target navigation is crucial for autonomous robots navigating unknown environments. While numerous approaches have been developed in the past, most are designed for single-robot operations, which often suffer from reduced efficiency and robustness due to environmental complexities. Furthermore, learning policies for multi-robot collaboration are resource-intensive. To address these challenges, we propose Co-NavGPT, an innovative framework that integrates Large Language Models (LLMs) as a global planner for multi-robot cooperative visual target navigation. Co-NavGPT encodes the explored environment data into prompts, enhancing LLMs' scene comprehension. It then assigns exploration frontiers to each robot for efficient target search. Experimental results on Habitat-Matterport 3D (HM3D) demonstrate that Co-NavGPT surpasses existing models in success rates and efficiency without any learning process, demonstrating the vast potential of LLMs in multi-robot collaboration domains. The supplementary video, prompts, and code can be accessed via the following link: \href{https://sites.google.com/view/co-navgpt}{https://sites.google.com/view/co-navgpt}.
    摘要 在高级人机交互任务中,视觉目标导航是关键,以便自主机器人在未知环境中进行自主导航。过去有许多方法被开发出来,但大多数是单机器人操作的,它们往往因环境复杂性而减少效率和可靠性。此外,学习策略 для多机器人合作也是费时费力的。为解决这些挑战,我们提出了Co-NavGPT框架,它将大型自然语言模型(LLM)作为多机器人合作的全球规划器。Co-NavGPT将探索环境数据编码成提示,从而提高 LLM 的景象理解能力。然后,它将每个机器人分配出探索前沿,以实现高效的目标搜索。在Habitat-Matterport 3D(HM3D)上进行的实验结果表明,Co-NavGPT比既有模型在成功率和效率方面具有更高的潜力,而且无需进行学习过程,这表明 LLM 在多机器人合作领域的潜力是非常大。补充视频、提示和代码可以通过以下链接获取:\href{https://sites.google.com/view/co-navgpt}{https://sites.google.com/view/co-navgpt}.

What Matters to You? Towards Visual Representation Alignment for Robot Learning

  • paper_url: http://arxiv.org/abs/2310.07932
  • repo_url: None
  • paper_authors: Ran Tian, Chenfeng Xu, Masayoshi Tomizuka, Jitendra Malik, Andrea Bajcsy
  • for: 本研究旨在帮助机器人优化与人类 preference 相关的奖励,以便机器人可以根据人类的需求和选择进行行为。
  • methods: 本研究使用了 Representation-Aligned Preference-based Learning (RAPL) 方法,该方法通过人类反馈来调整机器人的视觉表示,以便更好地满足人类的需求。
  • results: 实验结果表明,RAPL 的奖励可以生成人类喜欢的机器人行为,并且具有高样本效率和零样本泛化性。
    Abstract When operating in service of people, robots need to optimize rewards aligned with end-user preferences. Since robots will rely on raw perceptual inputs like RGB images, their rewards will inevitably use visual representations. Recently there has been excitement in using representations from pre-trained visual models, but key to making these work in robotics is fine-tuning, which is typically done via proxy tasks like dynamics prediction or enforcing temporal cycle-consistency. However, all these proxy tasks bypass the human's input on what matters to them, exacerbating spurious correlations and ultimately leading to robot behaviors that are misaligned with user preferences. In this work, we propose that robots should leverage human feedback to align their visual representations with the end-user and disentangle what matters for the task. We propose Representation-Aligned Preference-based Learning (RAPL), a method for solving the visual representation alignment problem and visual reward learning problem through the lens of preference-based learning and optimal transport. Across experiments in X-MAGICAL and in robotic manipulation, we find that RAPL's reward consistently generates preferred robot behaviors with high sample efficiency, and shows strong zero-shot generalization when the visual representation is learned from a different embodiment than the robot's.
    摘要

D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning

  • paper_url: http://arxiv.org/abs/2310.07931
  • repo_url: https://github.com/adymaharana/d2pruning
  • paper_authors: Adyasha Maharana, Prateek Yadav, Mohit Bansal
  • for: 提高模型训练数据质量可以降低模型测试错误率,同时可以采用数据减少方法来降低计算成本。
  • methods: 提出了一种基于图structure的数据选择算法, named D2 Pruning, 使用前向和反向消息传递来更新数据集中每个示例的difficulty scores,然后使用图Structured sampling方法选择最佳的核心集。
  • results: 对于视觉和语言 datasets,D2 Pruning比前一代方法更好地选择核心集,可以达到70%的减少率,同时发现使用D2 Pruning来筛选大量多模态数据集可以提高数据集的多样性和预训练模型的泛化性。
    Abstract Analytical theories suggest that higher-quality data can lead to lower test errors in models trained on a fixed data budget. Moreover, a model can be trained on a lower compute budget without compromising performance if a dataset can be stripped of its redundancies. Coreset selection (or data pruning) seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset. There are two dominant approaches: (1) geometry-based data selection for maximizing data diversity in the coreset, and (2) functions that assign difficulty scores to samples based on training dynamics. Optimizing for data diversity leads to a coreset that is biased towards easier samples, whereas, selection by difficulty ranking omits easy samples that are necessary for the training of deep learning models. This demonstrates that data diversity and importance scores are two complementary factors that need to be jointly considered during coreset selection. We represent a dataset as an undirected graph and propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection. D2 Pruning updates the difficulty scores of each example by incorporating the difficulty of its neighboring examples in the dataset graph. Then, these updated difficulty scores direct a graph-based sampling method to select a coreset that encapsulates both diverse and difficult regions of the dataset space. We evaluate supervised and self-supervised versions of our method on various vision and language datasets. Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates. Additionally, we find that using D2 Pruning for filtering large multimodal datasets leads to increased diversity in the dataset and improved generalization of pretrained models.
    摘要 高品质数据可能导致模型在固定数据预算下的测试错误下降。此外,一个模型可以在固定计算预算下训练无需妥协性能,只要可以从数据集中剔除重复项。核心集选择(数据剔除)目的是选择训练数据集中的子集,以最大化模型在这个子集上的性能。现有两种主导方法:(1)基于geometry的数据选择,以提高数据多样性在核心集中,和(2)基于训练动态函数来评分样本的难度。最佳化数据多样性会导致核心集偏向容易样本,而选择难度排名则会忽略容易样本,这些样本是深度学习模型训练所必需的。这表明数据多样性和重要性分数是两种完全相关的因素,需要在核心集选择中同时考虑。我们将数据集表示为无向图,并提出一种新的减少算法,D2 Pruning,它使用数据集图上的前向和反向消息传递来进行核心集选择。D2 Pruning将数据集图上的每个例子的难度分数更新,基于该例子的邻居例子在数据集图上的难度。然后,这些更新后的难度分数将导航一种基于图的采样方法,选择包含多样和困难区域的核心集。我们对各种视觉和语言数据集进行了超过70%的减少率。此外,我们发现使用D2 Pruning进行大量多模态数据集的筛选可以提高数据集的多样性和预训练模型的泛化性。

Contextualized Policy Recovery: Modeling and Interpreting Medical Decisions with Adaptive Imitation Learning

  • paper_url: http://arxiv.org/abs/2310.07918
  • repo_url: None
  • paper_authors: Jannik Deuschel, Caleb N. Ellington, Benjamin J. Lengerich, Yingtao Luo, Pascal Friederich, Eric P. Xing
  • for: 该论文目的是提出一种 Contextualized Policy Recovery(CPR)方法,以解决现有的政策学习模型存在准确性和可读性之间的负面选择问题。
  • methods: CPR方法将问题定义为多任务学习问题,将复杂的决策过程分解为不同的上下文特定策略。每个上下文特定策略都是一个线性观察到行动映射。CPR方法可以在完全无线上和部分可见的决策环境中运行,并可以与任何循环黑盒模型或可读的决策模型结合使用。
  • results: 研究人员通过在 simulate 和实际数据上测试 CPR 方法,实现了在静脉抗生素干扰 ICU 中预测抗生素药物的 (+22% AUROC vs. 前一代 SOTA) 和预测 Alzheimer 病人 MRI 药物的 (+7.7% AUROC vs. 前一代 SOTA) 任务上的状元表现。与此同时,CPR 方法 closing 了可读性和黑盒方法之间的准确性差距,允许高分辨率探索和分析上下文特定的决策模型。
    Abstract Interpretable policy learning seeks to estimate intelligible decision policies from observed actions; however, existing models fall short by forcing a tradeoff between accuracy and interpretability. This tradeoff limits data-driven interpretations of human decision-making process. e.g. to audit medical decisions for biases and suboptimal practices, we require models of decision processes which provide concise descriptions of complex behaviors. Fundamentally, existing approaches are burdened by this tradeoff because they represent the underlying decision process as a universal policy, when in fact human decisions are dynamic and can change drastically with contextual information. Thus, we propose Contextualized Policy Recovery (CPR), which re-frames the problem of modeling complex decision processes as a multi-task learning problem in which complex decision policies are comprised of context-specific policies. CPR models each context-specific policy as a linear observation-to-action mapping, and generates new decision models $\textit{on-demand}$ as contexts are updated with new observations. CPR is compatible with fully offline and partially observable decision environments, and can be tailored to incorporate any recurrent black-box model or interpretable decision model. We assess CPR through studies on simulated and real data, achieving state-of-the-art performance on the canonical tasks of predicting antibiotic prescription in intensive care units ($+22\%$ AUROC vs. previous SOTA) and predicting MRI prescription for Alzheimer's patients ($+7.7\%$ AUROC vs. previous SOTA). With this improvement in predictive performance, CPR closes the accuracy gap between interpretable and black-box methods for policy learning, allowing high-resolution exploration and analysis of context-specific decision models.
    摘要 “干预推理”政策学习目标在于从观察行为中估算出可理解的决策政策,但现有模型却存在精度和可理解之间的对抗。这个对抗限制了基于数据的人类决策过程的解释。例如,为了审核医疗决策中的偏见和不良实践,我们需要一些可解释的决策过程模型,以提供简洁的行为描述。实际上,现有的方法受到这个对抗,因为它们将基于决策过程的底层模型设计为一个通用政策,但人类决策实际上是动态的,可以随着上下文资讯而改变。因此,我们提出了“上下文化政策恢复”(CPR),它将决策过程模型化为多任务学习问题,并将复杂的决策政策拆分为具有上下文特定的政策。CPR模型每个上下文特定的政策为一个线性观察到动作的映射,并在上下文更新时产生新的决策模型。CPR适用于完全离线和部分可观察的决策环境,并可以与任何可读黑盒模型或可解释的决策模型整合。我们通过在模拟和实际数据上进行了研究, CP 的表现比前一代最佳化� Архивная копия от 20 августа 2020 на Wayback Machine

  • paper_url: http://arxiv.org/abs/2310.07917
  • repo_url: None
  • paper_authors: Elaheh Jafarigol, Theodore Trafalis
  • for: 本研究旨在提供大规模不均衡数据中机器学习领域各种方法的概述和总结,以便在不同领域中应用大规模不均衡数据。
  • methods: 本研究涉及了各种手段,包括各种数据处理技术和机器学习算法,以 Addressing the problem of imbalanced data in various domains。
  • results: 本研究通过收集和评审258篇同行评审文章,提供了对各种方法的审视和总结,以及在不同领域中机器学习的应用。
    Abstract For over two decades, detecting rare events has been a challenging task among researchers in the data mining and machine learning domain. Real-life problems inspire researchers to navigate and further improve data processing and algorithmic approaches to achieve effective and computationally efficient methods for imbalanced learning. In this paper, we have collected and reviewed 258 peer-reviewed papers from archival journals and conference papers in an attempt to provide an in-depth review of various approaches in imbalanced learning from technical and application perspectives. This work aims to provide a structured review of methods used to address the problem of imbalanced data in various domains and create a general guideline for researchers in academia or industry who want to dive into the broad field of machine learning using large-scale imbalanced data.
    摘要

Recurrent networks recognize patterns with low-dimensional oscillations

  • paper_url: http://arxiv.org/abs/2310.07908
  • repo_url: https://github.com/ktmurray1999/neural-rules
  • paper_authors: Keith T. Murray
  • for: 这个研究探讨了一种新的动力学机制,用于识别模式,通过解释一个基于SET卡游戏的简单任务中的回归神经网络(RNN)的含义。
  • methods: 这个研究使用了解释RNN的方法,并将其视为在低维度限径谱中发生相位变换的模式识别。此外,研究者还手工制作了一个气泡模型,以重现RNN的动力学。
  • results: 这个研究发现,RNN可以通过相位变换来识别模式,并且这种机制与金字塔自动机(FSA)的转移有相似之处。此外,研究还发现了一种可能的神经网络实现FSA的机制。这项研究不仅提供了一种可能的模式识别机制,还为深度学习模型解释提供了一个新的视角。
    Abstract This study proposes a novel dynamical mechanism for pattern recognition discovered by interpreting a recurrent neural network (RNN) trained on a simple task inspired by the SET card game. We interpreted the trained RNN as recognizing patterns via phase shifts in a low-dimensional limit cycle in a manner analogous to transitions in a finite state automaton (FSA). We further validated this interpretation by handcrafting a simple oscillatory model that reproduces the dynamics of the trained RNN. Our findings not only suggest of a potential dynamical mechanism capable of pattern recognition, but also suggest of a potential neural implementation of FSA. Above all, this work contributes to the growing discourse on deep learning model interpretability.
    摘要

RoboCLIP: One Demonstration is Enough to Learn Robot Policies

  • paper_url: http://arxiv.org/abs/2310.07899
  • repo_url: None
  • paper_authors: Sumedh A Sontakke, Jesse Zhang, Sébastien M. R. Arnold, Karl Pertsch, Erdem Bıyık, Dorsa Sadigh, Chelsea Finn, Laurent Itti
    for: RoboCLIP is designed to address the difficulty of reward specification in reinforcement learning, particularly the need for extensive expert supervision to design robust reward functions.methods: RoboCLIP uses a single video demonstration or textual description of the task to generate rewards without manual reward function design, leveraging pretrained Video-and-Language Models (VLMs) without any finetuning.results: Reinforcement learning agents trained with RoboCLIP rewards demonstrate 2-3 times higher zero-shot performance than competing imitation learning methods on downstream robot manipulation tasks, using only one video/text demonstration.
    Abstract Reward specification is a notoriously difficult problem in reinforcement learning, requiring extensive expert supervision to design robust reward functions. Imitation learning (IL) methods attempt to circumvent these problems by utilizing expert demonstrations but typically require a large number of in-domain expert demonstrations. Inspired by advances in the field of Video-and-Language Models (VLMs), we present RoboCLIP, an online imitation learning method that uses a single demonstration (overcoming the large data requirement) in the form of a video demonstration or a textual description of the task to generate rewards without manual reward function design. Additionally, RoboCLIP can also utilize out-of-domain demonstrations, like videos of humans solving the task for reward generation, circumventing the need to have the same demonstration and deployment domains. RoboCLIP utilizes pretrained VLMs without any finetuning for reward generation. Reinforcement learning agents trained with RoboCLIP rewards demonstrate 2-3 times higher zero-shot performance than competing imitation learning methods on downstream robot manipulation tasks, doing so using only one video/text demonstration.
    摘要 <>请注意,以下文本将使用简化中文表示。> reward specification 是 reinforcement learning 中的一个不orious难题,需要广泛的专家指导以设计可靠的奖励函数。 imitation learning(IL)方法尝试通过使用专家示范来绕过这些问题,但通常需要大量的域内专家示范。 我们受到 Video-and-Language Models(VLMs)领域的进步 inspirited,我们提出了 RoboCLIP,一种在线的 imitation learning 方法,使用单个示范(超越大量数据要求),可以通过视频示范或文本描述来生成奖励,无需手动设计奖励函数。 RoboCLIP 还可以使用不同域的示范,如人类解决任务的视频示范,以便不需要同一个示范和部署域。 RoboCLIP 使用预训练的 VLMs,无需任何finetuning,可以生成奖励。 reinforcement learning 代理人使用 RoboCLIP 奖励表现出2-3倍于竞争的 imitation learning 方法在下游机器人处理任务上的零件表现,使用单个视频/文本示范。

Efficient Integrators for Diffusion Generative Models

  • paper_url: http://arxiv.org/abs/2310.07894
  • repo_url: https://github.com/mandt-lab/PSLD
  • paper_authors: Kushagra Pandey, Maja Rudolph, Stephan Mandt
  • for: 本研究旨在提高扩展Diffusion模型的采样速度,以便更快地生成样本。
  • methods: 我们提出了两种扩展Diffusion模型的采样方法: conjugate integrators和splitting integrators。 conjugate integrators通过将反射 diffusion 动力学映射到更易于采样的空间,而 splitting-based integrators通过在数据和占 auxiliary 变量之间进行交替更新来减少数值计算误差。
  • results: 我们对这两种方法进行了广泛的实验和理论研究,并提出了一种hybrid方法,这种方法可以在扩展空间中对Diffusion模型进行最佳性能的采样。在应用于[Pandey & Mandt, 2023]中的CIFAR-10上,我们的deterministic和stochastic采样器可以在100个网络功能评估(NFE)后达到FID分数为2.11和2.36,比最佳基eline的2.57和2.63更低。
    Abstract Diffusion models suffer from slow sample generation at inference time. Therefore, developing a principled framework for fast deterministic/stochastic sampling for a broader class of diffusion models is a promising direction. We propose two complementary frameworks for accelerating sample generation in pre-trained models: Conjugate Integrators and Splitting Integrators. Conjugate integrators generalize DDIM, mapping the reverse diffusion dynamics to a more amenable space for sampling. In contrast, splitting-based integrators, commonly used in molecular dynamics, reduce the numerical simulation error by cleverly alternating between numerical updates involving the data and auxiliary variables. After extensively studying these methods empirically and theoretically, we present a hybrid method that leads to the best-reported performance for diffusion models in augmented spaces. Applied to Phase Space Langevin Diffusion [Pandey & Mandt, 2023] on CIFAR-10, our deterministic and stochastic samplers achieve FID scores of 2.11 and 2.36 in only 100 network function evaluations (NFE) as compared to 2.57 and 2.63 for the best-performing baselines, respectively. Our code and model checkpoints will be made publicly available at \url{https://github.com/mandt-lab/PSLD}.
    摘要 Diffusion models在推理时会受到慢的样本生成问题。因此,开发一个原则性的框架来加速预训练模型的干扰/随机抽取是一个有前途的方向。我们提议两种补充性框架来加速预训练模型的样本生成:协调 интеграторы和分割 интеграторы。协调 интеграторы扩展了逆扩散动力学,将逆扩散动力学映射到更适合采样的空间。相比之下,分割基于分子动力学通常使用数值方法更新数据和辅助变量,从而减少数值计算误差。我们经验和理论上深入研究这些方法,并提出了一种混合方法,导致预训练模型在扩展空间中的最佳表现。应用于[Pandey & Mandt, 2023]中的phas Space Langevin Diffusion(PSLD)在CIFAR-10上,我们的干扰和随机抽取器在100个网络函数评估(NFE)内达到了FID分数为2.11和2.36,与最佳基线相比下升2.57和2.63。我们将代码和模型Checkpoint公开在 GitHub上,请参考\url{https://github.com/mandt-lab/PSLD}.

LangNav: Language as a Perceptual Representation for Navigation

  • paper_url: http://arxiv.org/abs/2310.07889
  • repo_url: None
  • paper_authors: Bowen Pan, Rameswar Panda, SouYoung Jin, Rogerio Feris, Aude Oliva, Phillip Isola, Yoon Kim
  • for: 本文探讨了使用语言作为视觉 Navigation 的感知表示。
  • methods: 我们的方法使用了市场上可得的视觉系统(图像描述和对象检测)将当前时间步的自我中心Panoramic View转换成自然语言描述。然后,我们将预训练的语言模型进行训练,以选择基于当前视图和轨迹历史的最佳行为。与标准设置不同的是,我们的方法使用(粗粒度)语言作为感知表示,而不是直接使用预训练的视觉特征。
  • results: 我们的方法在R2R视觉语言导航标准 benchmark上实现了比预先学习的视觉特征更好的性能,特别是在只有10-100个黄金轨迹available的情况下。这表明使用语言作为感知表示可以在导航任务中提供更好的性能。
    Abstract We explore the use of language as a perceptual representation for vision-and-language navigation. Our approach uses off-the-shelf vision systems (for image captioning and object detection) to convert an agent's egocentric panoramic view at each time step into natural language descriptions. We then finetune a pretrained language model to select an action, based on the current view and the trajectory history, that would best fulfill the navigation instructions. In contrast to the standard setup which adapts a pretrained language model to work directly with continuous visual features from pretrained vision models, our approach instead uses (discrete) language as the perceptual representation. We explore two use cases of our language-based navigation (LangNav) approach on the R2R vision-and-language navigation benchmark: generating synthetic trajectories from a prompted large language model (GPT-4) with which to finetune a smaller language model; and sim-to-real transfer where we transfer a policy learned on a simulated environment (ALFRED) to a real-world environment (R2R). Our approach is found to improve upon strong baselines that rely on visual features in settings where only a few gold trajectories (10-100) are available, demonstrating the potential of using language as a perceptual representation for navigation tasks.
    摘要

Leader-Follower Neural Networks with Local Error Signals Inspired by Complex Collectives

  • paper_url: http://arxiv.org/abs/2310.07885
  • repo_url: None
  • paper_authors: Chenzhong Yin, Mingxi Cheng, Xiongye Xiao, Xinghe Chen, Shahin Nazarian, Andrei Irimia, Paul Bogdan
  • For: The paper is written to propose a neural network architecture inspired by the rules observed in nature’s collective ensembles, and to investigate the behavior of workers in the network.* Methods: The paper uses a leader-follower neural network (LFNN) structure, and trains the network using local error signals and optionally incorporating backpropagation (BP) and global loss.* Results: The paper achieves significantly lower error rates than previous BP-free algorithms on MNIST and CIFAR-10, and outperforms previous BP-free algorithms by a significant margin on ImageNet.Here is the information in Simplified Chinese text, as requested:* For: 这篇论文是为了提出基于自然集合中规则的神经网络架构,并研究工作者在网络中的行为。* Methods: 这篇论文使用了领导者-追随者神经网络(LFNN)结构,并通过本地错误信号和可选的反propagation(BP)和全局损失来训练网络。* Results: 这篇论文在MNIST和CIFAR-10上实现了比前一代BP-free算法更低的错误率,并在ImageNet上超过了前一代BP-free算法的性能。
    Abstract The collective behavior of a network with heterogeneous, resource-limited information processing units (e.g., group of fish, flock of birds, or network of neurons) demonstrates high self-organization and complexity. These emergent properties arise from simple interaction rules where certain individuals can exhibit leadership-like behavior and influence the collective activity of the group. Motivated by the intricacy of these collectives, we propose a neural network (NN) architecture inspired by the rules observed in nature's collective ensembles. This NN structure contains workers that encompass one or more information processing units (e.g., neurons, filters, layers, or blocks of layers). Workers are either leaders or followers, and we train a leader-follower neural network (LFNN) by leveraging local error signals and optionally incorporating backpropagation (BP) and global loss. We investigate worker behavior and evaluate LFNNs through extensive experimentation. Our LFNNs trained with local error signals achieve significantly lower error rates than previous BP-free algorithms on MNIST and CIFAR-10 and even surpass BP-enabled baselines. In the case of ImageNet, our LFNN-l demonstrates superior scalability and outperforms previous BP-free algorithms by a significant margin.
    摘要 集体行为的网络(例如鱼群、鸟群或神经网络)表现出高度自组织和复杂性。这些emergent特性来自简单的互动规则,其中某些个体可以展示领导性行为,影响集体活动的总体表现。受自然集体ensemble的复杂性启发,我们提出一种基于自然集体的神经网络(NN)结构。这个NN结构包含有一个或多个信息处理单元(例如神经元、滤波器、层或层组)的工作者。工作者可以是领导者或追随者,我们使用本地错误信号和可选地包括反向传播(BP)和全局损失来训练领导者-追随者神经网络(LFNN)。我们对工作者行为进行了广泛的实验研究,并评估了LFNN的性能。我们的LFNN使用本地错误信号进行训练,在MNIST和CIFAR-10上达到了较低的错误率,并在ImageNet上超越了前一代BP-free算法。

The Thousand Faces of Explainable AI Along the Machine Learning Life Cycle: Industrial Reality and Current State of Research

  • paper_url: http://arxiv.org/abs/2310.07882
  • repo_url: None
  • paper_authors: Thomas Decker, Ralf Gross, Alexander Koebler, Michael Lebacher, Ronald Schnitzer, Stefan H. Weber
  • for: 本研究探讨了可解释人工智能(XAI)在生产业中的实际应用 relevance,并对当前学术研究进行了比较。
  • methods: 本研究采用了广泛的采访方法,包括各种角色和关键参与者从不同的行业部门进行了访问。此外,我们还提供了一个简短的文献回顾,以提供一个涵盖所调查人员的意见以及当前学术研究的总览。
  • results: 我们的调查结果表明,虽然存在许多不同的XAI方法,但大多数都集中在模型评估阶段和数据科学家之间。此外,我们还发现了一些不足,例如现有方法和框架不足以帮助非专家用户理解和解释透明度不高的人工智能模型。
    Abstract In this paper, we investigate the practical relevance of explainable artificial intelligence (XAI) with a special focus on the producing industries and relate them to the current state of academic XAI research. Our findings are based on an extensive series of interviews regarding the role and applicability of XAI along the Machine Learning (ML) lifecycle in current industrial practice and its expected relevance in the future. The interviews were conducted among a great variety of roles and key stakeholders from different industry sectors. On top of that, we outline the state of XAI research by providing a concise review of the relevant literature. This enables us to provide an encompassing overview covering the opinions of the surveyed persons as well as the current state of academic research. By comparing our interview results with the current research approaches we reveal several discrepancies. While a multitude of different XAI approaches exists, most of them are centered around the model evaluation phase and data scientists. Their versatile capabilities for other stages are currently either not sufficiently explored or not popular among practitioners. In line with existing work, our findings also confirm that more efforts are needed to enable also non-expert users' interpretation and understanding of opaque AI models with existing methods and frameworks.
    摘要 The study finds that while there are many different XAI approaches, most are centered around the model evaluation phase and data scientists, with limited exploration of their versatility in other stages. Additionally, the study confirms that more efforts are needed to enable non-expert users to interpret and understand opaque AI models using existing methods and frameworks.The review of the relevant literature provides an encompassing overview of the opinions of the surveyed persons as well as the current state of academic research. By comparing the interview results with the current research approaches, the study reveals several discrepancies between the two, highlighting the need for further research and development in XAI.

DeePref: Deep Reinforcement Learning For Video Prefetching In Content Delivery Networks

  • paper_url: http://arxiv.org/abs/2310.07881
  • repo_url: None
  • paper_authors: Nawras Alkassab, Chin-Tser Huang, Tania Lorido Botran
  • for: 这篇论文旨在提高内容传输网络(Content Delivery Networks,CDN)中视频内容的缓存和预取优化算法,以提高用户体验质量。
  • methods: 这篇论文提出了一种基于深度学习的预取优化算法,named DeePref,可以在CDN边缘网络中自动适应用户访问模式的变化,提高预取精度和覆盖率。
  • results: 实验结果表明,使用DeePref DRQN在实际世界数据集上,可以提高预取精度和覆盖率,相比基eline方法,提高28%和17%,同时也研究了将统计模型从一个边缘网络传输到另一个边缘网络,可以提高预取精度和覆盖率,相比基eline方法,提高30%和10%。
    Abstract Content Delivery Networks carry the majority of Internet traffic, and the increasing demand for video content as a major IP traffic across the Internet highlights the importance of caching and prefetching optimization algorithms. Prefetching aims to make data available in the cache before the requester places its request to reduce access time and improve the Quality of Experience on the user side. Prefetching is well investigated in operating systems, compiler instructions, in-memory cache, local storage systems, high-speed networks, and cloud systems. Traditional prefetching techniques are well adapted to a particular access pattern, but fail to adapt to sudden variations or randomization in workloads. This paper explores the use of reinforcement learning to tackle the changes in user access patterns and automatically adapt over time. To this end, we propose, DeePref, a Deep Reinforcement Learning agent for online video content prefetching in Content Delivery Networks. DeePref is a prefetcher implemented on edge networks and is agnostic to hardware design, operating systems, and applications. Our results show that DeePref DRQN, using a real-world dataset, achieves a 17% increase in prefetching accuracy and a 28% increase in prefetching coverage on average compared to baseline approaches that use video content popularity as a building block to statically or dynamically make prefetching decisions. We also study the possibility of transfer learning of statistical models from one edge network into another, where unseen user requests from unknown distribution are observed. In terms of transfer learning, the increase in prefetching accuracy and prefetching coverage are [$30%$, $10%$], respectively. Our source code will be available on Github.
    摘要 content delivery networks 承载了互联网上大量流量,而在互联网上占主要的流量中,视频内容的增长也使得缓存和预取优化算法变得越来越重要。预取的目的是在用户请求之前将数据提取到缓存中,以降低访问时间和提高用户体验质量。预取技术已经在操作系统、编译器指令、内存缓存、本地存储系统、高速网络和云系统中得到了广泛的研究。传统的预取技术通常是根据特定的访问模式进行适应,但是它们无法适应用户访问模式的快速变化或随机化。本文探讨了使用强化学习来解决这些变化的问题。为此,我们提出了DeePref,一种基于深度强化学习的在线视频内容预取代理。DeePref在边缘网络上实现,不依赖于硬件设计、操作系统或应用程序。我们的结果表明,DeePref DRQN使用实际数据集时,平均提高预取精度28%和预取覆盖率17% Compared to基eline方法,使用视频内容的流行度作为预取决策的基础或动态决策。我们还研究了将统计模型从一个边缘网络传播到另一个边缘网络中,以处理未经见过的用户请求和未知分布。在传播学习中,预取精度和预取覆盖率增加了[30%, 10%]。我们的源代码将在 GitHub 上发布。

TabLib: A Dataset of 627M Tables with Context

  • paper_url: http://arxiv.org/abs/2310.07875
  • repo_url: None
  • paper_authors: Gus Eggert, Kevin Huo, Mike Biven, Justin Waugh
  • for: 这篇论文是为了提供一个大型、多样化的表格数据集,以便进行现代AI系统的研究和发展。
  • methods: 该论文使用了多种数据抽取方法,从GitHub和Common Crawl等来源中提取了627万个表格,总量达69 TiB,并且包含了867亿个上下文符号。
  • results: 该论文提出了一个名为”TabLib”的新的表格数据集,其包含了627万个表格,总量达69 TiB,并且具有多样化的表格结构和上下文。这个数据集的大小和多样性都提供了许多探索和研究的机会,类似于text和图像模态的基础数据集。
    Abstract It is well-established that large, diverse datasets play a pivotal role in the performance of modern AI systems for text and image modalities. However, there are no datasets for tabular data of comparable size and diversity to those available for text and images. Thus we present "TabLib'', a compilation of 627 million tables totaling 69 TiB, along with 867B tokens of context. TabLib was extracted from numerous file formats, including CSV, HTML, SQLite, PDF, Excel, and others, sourced from GitHub and Common Crawl. The size and diversity of TabLib offer considerable promise in the table modality, reminiscent of the original promise of foundational datasets for text and images, such as The Pile and LAION.
    摘要 “已经确立了现代人工智能系统中大量多样数据的重要作用。然而,对于表格数据,没有相关的大量多样数据集的存在,与文本和图像模式相似。因此,我们提出了“TabLib”,包含627万个表格,总量为69 TiB,以及867亿个上下文token。TabLib从多种文件格式中提取,包括CSV、HTML、SQLite、PDF、Excel等,来自GitHub和Common Crawl。TabLib的大小和多样性表现出了很大的抢救潜力,类似于文本和图像领域的基础数据集,如The Pile和LAION。”

Hierarchical Pretraining on Multimodal Electronic Health Records

  • paper_url: http://arxiv.org/abs/2310.07871
  • repo_url: https://github.com/xiaochenwang-psu/medhmp
  • paper_authors: Xiaochen Wang, Junyu Luo, Jiaqi Wang, Ziyi Yin, Suhan Cui, Yuan Zhong, Yaqing Wang, Fenglong Ma
  • for: 这篇论文是为了解决医疗领域中电子健康记录(EHR)资料的层次结构问题,以提高现有预训模型在多种下游任务中的通用能力。
  • methods: 本文提出了一个新的、通用的、多modal EHR预训框架(MEDHMP),专门针对医疗领域中的层次结构资料进行预训。
  • results: authors通过实验结果显示了MEDHMP的效果,在八个下游任务中三个层次上展示了佳绩,与十八个基准相比,更加强调了我们的方法的可行性。
    Abstract Pretraining has proven to be a powerful technique in natural language processing (NLP), exhibiting remarkable success in various NLP downstream tasks. However, in the medical domain, existing pretrained models on electronic health records (EHR) fail to capture the hierarchical nature of EHR data, limiting their generalization capability across diverse downstream tasks using a single pretrained model. To tackle this challenge, this paper introduces a novel, general, and unified pretraining framework called MEDHMP, specifically designed for hierarchically multimodal EHR data. The effectiveness of the proposed MEDHMP is demonstrated through experimental results on eight downstream tasks spanning three levels. Comparisons against eighteen baselines further highlight the efficacy of our approach.
    摘要 <>转换文本到简化中文。<>预训练技术在自然语言处理(NLP)中已经证明是一种强大的技术,在不同的NLP下游任务中表现出了很好的成绩。然而,在医疗领域,现有的预训练模型在电子医疗记录(EHR)数据上失去了层次结构的特点,因此限制了单个预训练模型在多种下游任务中的通用化能力。为解决这个挑战,本文提出了一种新的、通用、多模式预训练框架called MEDHMP,专门针对层次多模式EHR数据。我们通过对八个下游任务进行实验,证明了我们的方法的有效性。与 eighteen个基准值进行比较,我们的方法的成绩更加出色。

Cheap Talking Algorithms

  • paper_url: http://arxiv.org/abs/2310.07867
  • repo_url: None
  • paper_authors: Daniele Condorelli, Massimiliano Furlan
  • for: 研究独立强化学习算法在 crawford 和 sobel (1982) 游戏中的信息传输行为。
  • methods: sender 和 receiver 在训练中共同 converges to 离散的 Nash 均衡。
  • results: 通信占据预期的最大程度,即根据交战利益冲突的程度。结论是对不同参数和游戏 especification robust。I hope this helps!
    Abstract We simulate behaviour of independent reinforcement learning algorithms playing the Crawford and Sobel (1982) game of strategic information transmission. We show that a sender and a receiver training together converge to strategies close to the exante optimal equilibrium of the game. Hence, communication takes place to the largest extent predicted by Nash equilibrium given the degree of conflict of interest between agents. The conclusion is shown to be robust to alternative specifications of the hyperparameters and of the game. We discuss implications for theories of equilibrium selection in information transmission games, for work on emerging communication among algorithms in computer science and for the economics of collusions in markets populated by artificially intelligent agents.
    摘要 我们模拟独立强化学习算法在克劳福德和索贝尔(1982)游戏中的行为。我们显示,发送者和接收者一起培训时,会 converges到靠近预先优化的均衡点。因此,通信发生在预先优化的均衡点所预测的范围内。结论是对于不同的具体化参数和游戏参数,都是Robust的。我们讨论这些结论对信息传输游戏理论选择、计算机科学中算法之间的emerging通信以及人工智能代理人市场中的经济合作等方面的影响。

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

  • paper_url: http://arxiv.org/abs/2310.07849
  • repo_url: None
  • paper_authors: Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ming Yin
  • for: investigate the effectiveness of using large language models (LLMs) to generate synthetic datasets for text classification
  • methods: use LLMs to generate synthetic data, and evaluate the performance of models trained on these synthetic data
  • results: find that subjectivity, at both the task level and instance level, is negatively associated with the performance of the model trained on synthetic dataHere’s the full text in Simplified Chinese:
  • for: 这个研究是为了investigate大语言模型(LLMs)是否可以生成高质量的synthetic datasets,以便提高文本分类模型的性能。
  • methods: 研究者使用LLMs生成synthetic data,并评估这些synthetic data上模型的性能。
  • results: 发现任务级别和实例级别的主观性均对模型在synthetic data上的性能产生负面影响。
    Abstract The collection and curation of high-quality training data is crucial for developing text classification models with superior performance, but it is often associated with significant costs and time investment. Researchers have recently explored using large language models (LLMs) to generate synthetic datasets as an alternative approach. However, the effectiveness of the LLM-generated synthetic data in supporting model training is inconsistent across different classification tasks. To better understand factors that moderate the effectiveness of the LLM-generated synthetic data, in this study, we look into how the performance of models trained on these synthetic data may vary with the subjectivity of classification. Our results indicate that subjectivity, at both the task level and instance level, is negatively associated with the performance of the model trained on synthetic data. We conclude by discussing the implications of our work on the potential and limitations of leveraging LLM for synthetic data generation.
    摘要 collection and curation of high-quality training data 高质量训练数据的收集和整理是发展文本分类模型的表现优秀的关键因素,但它们frequently associated with significant costs and time investment. Researchers have recently explored using large language models (LLMs) to generate synthetic datasets as an alternative approach. However, the effectiveness of the LLM-generated synthetic data in supporting model training is inconsistent across different classification tasks.To better understand the factors that moderate the effectiveness of the LLM-generated synthetic data, in this study, we investigate how the performance of models trained on these synthetic data may vary with the subjectivity of classification. Our results indicate that subjectivity, at both the task level and instance level, is negatively associated with the performance of the model trained on synthetic data.We conclude by discussing the implications of our work on the potential and limitations of leveraging LLM for synthetic data generation.Here's the translation in Traditional Chinese:集成和整理高质量训练数据的收集是发展文本分类模型的表现优秀的关键因素,但它们经常与大量的成本和时间投入相关。研究人员最近尝试使用大语言模型(LLMs)生成 sintetic数据作为代替方案。然而,LLM生成的 sintetic数据在不同的分类任务中的效果是不一致的。为了更好地理解LLM生成 sintetic数据的效果的moderating因素,在这个研究中,我们investigate how the performance of models trained on these sintetic data may vary with the subjectivity of classification.我们的结果表明,在任务水平和实例水平都存在负相关性 между模型在 sintetic数据上的性能和分类的主观性。我们 conclude by discussing the implications of our work on the potential and limitations of leveraging LLM for synthetic data generation.

Towards the Fundamental Limits of Knowledge Transfer over Finite Domains

  • paper_url: http://arxiv.org/abs/2310.07838
  • repo_url: https://github.com/sw-packages/ae0783895ca52d793929d6e5e57c365320dc5864c41ab9a7d5f64b2310c2fd59
  • paper_authors: Qingyue Zhao, Banghua Zhu
  • for: 本文研究了知识传递的统计效率,具体来说是通过$n$个教师样本传递到一个 probabilistic 学习器中,以便在输入空间$\mathcal{S}$上预测标签$\mathcal{A}$。
  • methods: 本文使用了三个进行知识传递的水平,它们分别是:只有硬标签信息(first level)、教师概率分布信息+硬标签信息(second level)和完整的soft labels信息(third level)。
  • results: 本文证明了,在第一个水平上,只有硬标签信息时,最优的最大 LIKElihood estimator 的渐近速率为 $\sqrt{|{\mathcal S}||{\mathcal A}|}/{n}$。在第二个水平上,增加教师概率分布信息可以提高渐近速率的下界至 ${|{\mathcal S}||{\mathcal A}|}/{n}$。在第三个水平上,使用完整的soft labels信息可以实现渐近速率 ${|{\mathcal S}|}/{n}$ ,而且任何 Kullback-Leibler divergence 最小化器都是优选的。numerical simulations 验证了这些理论结论。
    Abstract We characterize the statistical efficiency of knowledge transfer through $n$ samples from a teacher to a probabilistic student classifier with input space $\mathcal S$ over labels $\mathcal A$. We show that privileged information at three progressive levels accelerates the transfer. At the first level, only samples with hard labels are known, via which the maximum likelihood estimator attains the minimax rate $\sqrt{|{\mathcal S}||{\mathcal A}|}/{n}$. The second level has the teacher probabilities of sampled labels available in addition, which turns out to boost the convergence rate lower bound to ${|{\mathcal S}||{\mathcal A}|}/{n}$. However, under this second data acquisition protocol, minimizing a naive adaptation of the cross-entropy loss results in an asymptotically biased student. We overcome this limitation and achieve the fundamental limit by using a novel empirical variant of the squared error logit loss. The third level further equips the student with the soft labels (complete logits) on ${\mathcal A}$ given every sampled input, thereby provably enables the student to enjoy a rate ${|{\mathcal S}|}/{n}$ free of $|{\mathcal A}|$. We find any Kullback-Leibler divergence minimizer to be optimal in the last case. Numerical simulations distinguish the four learners and corroborate our theory.
    摘要 我们Characterize了知识传输的统计效率通过$n$个教师到一个概率学生分类器的输入空间$\mathcal S$上的标签$\mathcal A$.我们显示了隐私信息的三个进步级别可以加速传输。在第一个级别上,只有硬标签是知道的,由最大likelihood估计达到最小最大值$\sqrt{|{\mathcal S}||{\mathcal A}|}/{n}$.在第二个级别上,教师标签的概率也可以获得,这会降低下界到${|{\mathcal S}||{\mathcal A}|}/{n}$.然而,在这个第二个数据收集协议下,直接适应cross-entropy损失会导致漫游学生。我们解决了这个局限性,并达到基本限制,使用一种新的实际变体的平方差logit损失。在第三个级别上,学生还获得了每个输入的完整logits(${\mathcal A}$),这使得学生可以在${|{\mathcal S}|}/{n}$的时间内达到基本限制。我们发现任何Kullback-Leibler divergence最小化者是最佳的。数字实验证明了我们的理论。

When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement

  • paper_url: http://arxiv.org/abs/2310.07831
  • repo_url: https://github.com/facebookresearch/adaptive_scheduling
  • paper_authors: Aaron Defazio, Ashok Cutkosky, Harsh Mehta, Konstantin Mishchenko
  • for: 这个论文的目的是为了bridging theory和实践之间的差距,并为一类优化算法(包括SGD)提供新的问题适应学习率计划。
  • methods: 这篇论文使用了对一类优化算法的推论分析,并使用实际中的观测梯度 norm来Derive更加精细的学习率计划。
  • results: 论文的实验结果表明,linear decay schedule在10种深度学习问题中具有最好的性能,而且在一系列LLMs和一组логистиック回归问题中也有出色的表现。此外,论文还提供了一种自动实现学习率计划的系统方法,可以实现学习率温化和快速学习率降低。
    Abstract Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules. Our key technical contribution is a refined analysis of learning rate schedules for a wide class of optimization algorithms (including SGD). In contrast to most prior works that study the convergence of the average iterate, we study the last iterate, which is what most people use in practice. When considering only worst-case analysis, our theory predicts that the best choice is the linear decay schedule: a popular choice in practice that sets the stepsize proportionally to $1 - t/T$, where $t$ is the current iteration and $T$ is the total number of steps. To go beyond this worst-case analysis, we use the observed gradient norms to derive schedules refined for any particular task. These refined schedules exhibit learning rate warm-up and rapid learning rate annealing near the end of training. Ours is the first systematic approach to automatically yield both of these properties. We perform the most comprehensive evaluation of learning rate schedules to date, evaluating across 10 diverse deep learning problems, a series of LLMs, and a suite of logistic regression problems. We validate that overall, the linear-decay schedule matches or outperforms all commonly used default schedules including cosine annealing, and that our schedule refinement method gives further improvements.
    摘要 theory和实践中的学习率调度不符合,我们通过提出新的问题适应型学习率调度来减少这一差距。我们的关键技术贡献是对一类优化算法(包括SGD)的学习率调度进行了精细分析。与大多数前一些工作一样,我们研究了平均迭代点的渐进,但是我们强调研究最后一个迭代点,这是实际应用中人们常用的。在假设最坏情况下,我们的理论预测,最佳选择是线性衰减调度:一种广泛使用的做法,其中每个迭代点的步长与当前迭代次数/$T$ 成正比。为了超越这个最坏情况分析,我们使用观察到的梯度 norm 来 derive 更加细化的调度。这些细化调度具有学习率暖启和快速学习率缓和两个性能。我们是首个系统地自动实现这两个特性的系统。我们对10种深度学习问题、一系列LLMs以及一组logistic regression问题进行了最全面的学习率调度评估。我们证明,总的来说,线性衰减调度与所有通用的默认调度(包括cosine annealing)相当或超越。此外,我们的调度修正方法可以提供进一步的改进。

Does Synthetic Data Make Large Language Models More Efficient?

  • paper_url: http://arxiv.org/abs/2310.07830
  • repo_url: None
  • paper_authors: Sia Gholami, Marwan Omar
  • for: 本研究旨在探讨深度学习方法在自然语言处理(NLP)领域中的应用,尤其是关于生成模板化问题的数据生成。
  • methods: 本研究使用模板基于的问题生成方法,以增加数据的多样性和数据量,并对现代变换器模型的性能进行评估。
  • results: 研究发现,使用模板基于的数据生成可以提高变换器模型的性能,但同时也存在风险的过拟合和模板限制的问题。
    Abstract Natural Language Processing (NLP) has undergone transformative changes with the advent of deep learning methodologies. One challenge persistently confronting researchers is the scarcity of high-quality, annotated datasets that drive these models. This paper explores the nuances of synthetic data generation in NLP, with a focal point on template-based question generation. By assessing its advantages, including data augmentation potential and the introduction of structured variety, we juxtapose these benefits against inherent limitations, such as the risk of overfitting and the constraints posed by pre-defined templates. Drawing from empirical evaluations, we demonstrate the impact of template-based synthetic data on the performance of modern transformer models. We conclude by emphasizing the delicate balance required between synthetic and real-world data, and the future trajectories of integrating synthetic data in model training pipelines. The findings aim to guide NLP practitioners in harnessing synthetic data's potential, ensuring optimal model performance in diverse applications.
    摘要 自然语言处理(NLP)在深度学习方法出现后已经经历了转变性变化。一个持续挑战研究人员的问题是数据质量的缺乏,这些模型驱动。这篇论文探讨了NLP中 sintetic数据生成的细节,特点是模板基于的问题生成。我们评估了这些优点,如数据增强潜力和结构化多样性的引入,并与内置的限制进行对比,如过拟合风险和预定模板所带来的限制。通过实验评估,我们证明了模板基于的 sintetic数据对现代转换器模型的性能产生了影响。我们结论,将synthetic数据和真实世界数据进行 equilibrio是NLP实践者需要注意的。将来,我们预计将在模型训练管道中集成synthetic数据,以便优化模型在多样化应用中的性能。这些发现希望能够引导NLP实践者在使用synthetic数据的潜力,以确保模型在多样化应用中的优秀性能。

Exploring the Relationship between Analogy Identification and Sentence Structure Encoding in Large Language Models

  • paper_url: http://arxiv.org/abs/2310.07818
  • repo_url: None
  • paper_authors: Thilini Wijesiriwardene, Ruwan Wickramarachchi, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das
  • for: 本研究旨在探讨现代自然语言处理(NLP)技术是如何识别句子对应关系,以及这种能力与语言模型(LLM)的句子结构编码能力之间的关系。
  • methods: 本研究使用多种大语言模型(LLM)来识别句子对应关系,并分析这些模型对句子结构的编码能力。
  • results: 研究发现,LLMs的句子对应关系识别能力与它们对句子结构的编码能力之间存在正相关性。具体来说, LLMS 更好地捕捉句子结构,它们也更具句子对应关系识别能力。
    Abstract Identifying analogies plays a pivotal role in human cognition and language proficiency. In the last decade, there has been extensive research on word analogies in the form of ``A is to B as C is to D.'' However, there is a growing interest in analogies that involve longer text, such as sentences and collections of sentences, which convey analogous meanings. While the current NLP research community evaluates the ability of Large Language Models (LLMs) to identify such analogies, the underlying reasons behind these abilities warrant deeper investigation. Furthermore, the capability of LLMs to encode both syntactic and semantic structures of language within their embeddings has garnered significant attention with the surge in their utilization. In this work, we examine the relationship between the abilities of multiple LLMs to identify sentence analogies, and their capacity to encode syntactic and semantic structures. Through our analysis, we find that analogy identification ability of LLMs is positively correlated with their ability to encode syntactic and semantic structures of sentences. Specifically, we find that the LLMs which capture syntactic structures better, also have higher abilities in identifying sentence analogies.
    摘要 理解对比在人类认知和语言能力中发挥着重要作用。过去十年,关于单词对比的研究得到了广泛的关注,但是现在越来越多的研究者关注 sentence对比,即 sentence A 和 sentence B 之间的对比。然而,当前的自然语言处理(NLP)研究社区正在评估大型自然语言模型(LLM)能否识别这类对比。尽管 LLM 的能力在识别对比方面的研究得到了广泛的关注,但是这些能力的下面原因仍然需要更深入的探究。此外, LLM 能够内嵌语言结构的能力也在不断受到关注,特别是它们可以内嵌 sentence 的语法和 semantics 结构。在这篇文章中,我们研究了多个 LLM 的对比 Identification 能力和它们内嵌语言结构的关系。我们发现,LLM 的对比 Identification 能力和它们内嵌语言结构的能力是正相关的。具体来说,我们发现 LLM 可以更好地捕捉语法结构的那些也有更高的对比 Identification 能力。

Generative Modeling with Phase Stochastic Bridges

  • paper_url: http://arxiv.org/abs/2310.07805
  • repo_url: None
  • paper_authors: Tianrong Chen, Jiatao Gu, Laurent Dinh, Evangelos A. Theodorou, Josh Susskind, Shuangfei Zhai
  • for: 这篇论文的目的是提出一种基于阶段空间动力学的生成模型框架,以便更好地生成连续输入数据。
  • methods: 该模型使用了Stochastic Differential Equation(SDE)和神经网络来实现逆向生成。具体来说,它首先在输入空间中定义了一个阶段空间,然后使用Stochastic Optimal Control的理念来建立一个路径度量,以便高效地采样数据。
  • results: 在标准图像生成测试 benchmark 上,该模型在小范围内的Number of Function Evaluations(NFEs)下表现出色,并且与使用有效采样技术的 diffusion models 的性能相当,这说明该模型有potential作为一种新的生成模型工具。
    Abstract Diffusion models (DMs) represent state-of-the-art generative models for continuous inputs. DMs work by constructing a Stochastic Differential Equation (SDE) in the input space (ie, position space), and using a neural network to reverse it. In this work, we introduce a novel generative modeling framework grounded in \textbf{phase space dynamics}, where a phase space is defined as {an augmented space encompassing both position and velocity.} Leveraging insights from Stochastic Optimal Control, we construct a path measure in the phase space that enables efficient sampling. {In contrast to DMs, our framework demonstrates the capability to generate realistic data points at an early stage of dynamics propagation.} This early prediction sets the stage for efficient data generation by leveraging additional velocity information along the trajectory. On standard image generation benchmarks, our model yields favorable performance over baselines in the regime of small Number of Function Evaluations (NFEs). Furthermore, our approach rivals the performance of diffusion models equipped with efficient sampling techniques, underscoring its potential as a new tool generative modeling.
    摘要 干扰模型(DM)表示现代生成模型的极限,它们在维度输入空间中构建了随机差分方程(SDE),并使用神经网络来逆向解决。在这项工作中,我们介绍了一种新的生成模型框架,基于phaspace动力学,phaspace是包括位置和速度的扩展空间。利用Stochastic Optimal Control的洞察,我们在phaspace中建立了一个路径度量,以便高效采样。与DMs不同,我们的框架在动力学征标的早期阶段就能生成真实数据点,这使得可以通过使用速度信息来加速数据生成。在标准图像生成标准benchmark上,我们的模型在小范围内的Number of Function Evaluations(NFEs)下表现优秀,并且与Diffusion Models配备高效采样技术相比,我们的方法在生成模型方面具有潜在的潜力。

A general mechanism of humor: reformulating the semantic overlap

  • paper_url: http://arxiv.org/abs/2310.07803
  • repo_url: None
  • paper_authors: Javier Martínez
  • for: 提出一种通用的幽默机制,不限于语言交流。
  • methods: 基于礼物的概念,包括脱困和冲突的解决方法。
  • results: 提出了一种可以应用于非语言交流的幽默机制,并且认为这种机制是人类思维的核心。
    Abstract This article proposes a cognitive mechanism of humour of general applicability, not restricted to verbal communication. It is indebted to Raskin's concept of script overlap, and conforms to the incongruity-resolution theoretical framework, but it is built on the notion of constraint, an abstract correspondence between sets of data. Under this view, script overlap is an outcome of a more abstractly described phenomenon, constraint overlap. The important concept of the overlooked argument is introduced to characterise the two overlapping constraints -- overt and covert. Their inputs and outputs are not directly encoded in utterances, but implicated by them, and their overlap results in another overlap at the level of the communicated utterances, that the incongruity reveals. Our hypothesis assumes as a given that the evocation of such constraints is a cognitive effect of the inferential process by which a hearer interprets utterances. We base this assumption on Hofstadter's theory of analogy-making as the essence of human thought. By substituting "stimuli" of any kind for "utterances" in this model, we obtain a mechanism as easily applicable to non-verbal communication -- slapstick, cartoons -- and we propose it describes the necessary and sufficient conditions for a communicative act in any modality to carry humour.
    摘要 The authors propose that the evocation of these constraints is a cognitive effect of the inferential process by which a hearer interprets utterances. This idea is based on Hofstadter's theory of analogy-making as the essence of human thought. By applying this mechanism to non-verbal communication, such as slapstick or cartoons, the authors suggest that it provides a necessary and sufficient condition for a communicative act in any modality to carry humor.

An Information Bottleneck Characterization of the Understanding-Workload Tradeoff

  • paper_url: http://arxiv.org/abs/2310.07802
  • repo_url: https://github.com/mycal-tucker/ib-explanations
  • paper_authors: Lindsay Sanneman, Mycal Tucker, Julie Shah
  • for: 这篇论文旨在探讨人工智能系统的可解释性(XAI),以支持人类理解AI系统。
  • methods: 论文使用信息瓶颈方法(Information Bottleneck method)来自动生成抽象(hand-crafted groupings of related problem features),以平衡工作负荷和理解之间的 contradistinction。
  • results: 实验表明,通过抽象来解释复杂概念可以有效地Address和平衡工作负荷和理解之间的 contradistinction。
    Abstract Recent advances in artificial intelligence (AI) have underscored the need for explainable AI (XAI) to support human understanding of AI systems. Consideration of human factors that impact explanation efficacy, such as mental workload and human understanding, is central to effective XAI design. Existing work in XAI has demonstrated a tradeoff between understanding and workload induced by different types of explanations. Explaining complex concepts through abstractions (hand-crafted groupings of related problem features) has been shown to effectively address and balance this workload-understanding tradeoff. In this work, we characterize the workload-understanding balance via the Information Bottleneck method: an information-theoretic approach which automatically generates abstractions that maximize informativeness and minimize complexity. In particular, we establish empirical connections between workload and complexity and between understanding and informativeness through human-subject experiments. This empirical link between human factors and information-theoretic concepts provides an important mathematical characterization of the workload-understanding tradeoff which enables user-tailored XAI design.
    摘要 Translation notes:* "artificial intelligence" is translated as "人工智能" (réngōng zhìnéng)* "explainable AI" is translated as "可解释人工智能" (kějìjiě xiǎng réngōng zhìnéng)* "human understanding" is translated as "人类理解" (réngrì lǐjiě)* "mental workload" is translated as "心理劳动" (xīn lǐ gōngzuò)* "information-theoretic approach" is translated as "信息理论方法" (xìnwù lǐlùn fāngfa)* "abstractions" is translated as "抽象" (chōuxiàng)* "hand-crafted groupings" is translated as "手工组合" (shǒu gōng zǔyì)* "problem features" is translated as "问题特征" (wèn tí tèchēng)* "workload-understanding tradeoff" is translated as "劳动-理解交换" (gōngzuò-lǐjiě gòuhuan)* "user-tailored XAI design" is translated as "用户定制XAI设计" (yònghòu dìngxì XAI jièdì)

Explainable Attention for Few-shot Learning and Beyond

  • paper_url: http://arxiv.org/abs/2310.07800
  • repo_url: None
  • paper_authors: Bahareh Nikpour, Narges Armanfard
  • for: 提高几何学模型的准确率和可靠性,特别在数据采集和标注过程中面临限制的情况下。
  • methods: 利用深度强化学习实现硬注意力找到,直接影响原始输入数据,使其可解释性提高。
  • results: 通过对多个 benchmark 数据集进行广泛的实验,证明我们提出的方法的效果。
    Abstract Attention mechanisms have exhibited promising potential in enhancing learning models by identifying salient portions of input data. This is particularly valuable in scenarios where limited training samples are accessible due to challenges in data collection and labeling. Drawing inspiration from human recognition processes, we posit that an AI baseline's performance could be more accurate and dependable if it is exposed to essential segments of raw data rather than the entire input dataset, akin to human perception. However, the task of selecting these informative data segments, referred to as hard attention finding, presents a formidable challenge. In situations with few training samples, existing studies struggle to locate such informative regions due to the large number of training parameters that cannot be effectively learned from the available limited samples. In this study, we introduce a novel and practical framework for achieving explainable hard attention finding, specifically tailored for few-shot learning scenarios, called FewXAT. Our approach employs deep reinforcement learning to implement the concept of hard attention, directly impacting raw input data and thus rendering the process interpretable for human understanding. Through extensive experimentation across various benchmark datasets, we demonstrate the efficacy of our proposed method.
    摘要 注意机制在增强学习模型方面表现出了扎实的潜力,特别是在数据采集和标注过程中存在困难时。 Drawing inspiration from human recognition processes, we argue that an AI baseline's performance could be more accurate and dependable if it is exposed to essential segments of raw data rather than the entire input dataset, similar to human perception. However, the task of selecting these informative data segments, referred to as hard attention finding, presents a formidable challenge. In situations with few training samples, existing studies struggle to locate such informative regions due to the large number of training parameters that cannot be effectively learned from the available limited samples. In this study, we introduce a novel and practical framework for achieving explainable hard attention finding, specifically tailored for few-shot learning scenarios, called FewXAT. Our approach employs deep reinforcement learning to implement the concept of hard attention, directly impacting raw input data and thus rendering the process interpretable for human understanding. Through extensive experimentation across various benchmark datasets, we demonstrate the efficacy of our proposed method.

A Transfer-Learning-Based Prognosis Prediction Paradigm that Bridges Data Distribution Shift across EMR Datasets

  • paper_url: http://arxiv.org/abs/2310.07799
  • repo_url: None
  • paper_authors: Zhongji Zhang, Yuhang Wang, Yinghao Zhu, Xinyu Ma, Tianlong Wang, Chaohe Zhang, Yasha Wang, Liantao Ma
  • for: 预测新疆突病和其他疾病的准确性
  • methods: 使用传输学习方法建立源数据集和目标数据集之间的转换模型,以适应不同任务域下的特征分布偏移问题
  • results: 比基eline方法高效,特别是在处理有限数据量时 display(“我们的方法可以更好地预测新疆突病和其他疾病。”)
    Abstract Due to the limited information about emerging diseases, symptoms are hard to be noticed and recognized, so that the window for clinical intervention could be ignored. An effective prognostic model is expected to assist doctors in making right diagnosis and designing personalized treatment plan, so to promptly prevent unfavorable outcomes. However, in the early stage of a disease, limited data collection and clinical experiences, plus the concern out of privacy and ethics, may result in restricted data availability for reference, to the extent that even data labels are difficult to mark correctly. In addition, Electronic Medical Record (EMR) data of different diseases or of different sources of the same disease can prove to be having serious cross-dataset feature misalignment problems, greatly mutilating the efficiency of deep learning models. This article introduces a transfer learning method to build a transition model from source dataset to target dataset. By way of constraining the distribution shift of features generated in disparate domains, domain-invariant features that are exclusively relative to downstream tasks are captured, so to cultivate a unified domain-invariant encoder across various task domains to achieve better feature representation. Experimental results of several target tasks demonstrate that our proposed model outperforms competing baseline methods and has higher rate of training convergence, especially in dealing with limited data amount. A multitude of experiences have proven the efficacy of our method to provide more accurate predictions concerning newly emergent pandemics and other diseases.
    摘要 (Simplified Chinese translation)由于疾病出现的信息有限,症状难以注意和识别,因此临床 intervención的窗口可能被忽略。一个有效的预测模型可以帮助医生确定病种和制定个性化的治疗方案,以便更快地预防不利的结果。然而,疾病的早期阶段,有限的数据收集和临床经验,加上隐私和伦理的担忧,可能导致参考数据的有限性,甚至数据标签难以正确地标注。此外,不同疾病或同一种疾病的不同来源的电子医疗记录(EMR)数据可能会导致严重的跨数据集特征不一致问题,大大降低深度学习模型的效率。这篇文章介绍了一种传输学习方法,用于从源数据集转移到目标数据集。通过限制不同领域中特征生成的分布shift,捕捉固有的领域特征,以便在不同任务领域中培养一个统一的领域特征不变的编码器,以达到更好的特征表示。实验结果表明,我们提出的方法在多个目标任务上表现出色,特别是在处理有限数据量时。多种经验证明了我们的方法在新出现的流行病和其他疾病预测方面的可靠性。

GenTKG: Generative Forecasting on Temporal Knowledge Graph

  • paper_url: http://arxiv.org/abs/2310.07793
  • repo_url: None
  • paper_authors: Ruotong Liao, Xu Jia, Yunpu Ma, Volker Tresp
  • for: 用于替代传统的 embedding-based 和 rule-based 模型,并在 temporal knowledge graph 领域实现生成式预测。
  • methods: 提出了一种基于 retrieval 策略和 lightweight 参数efficient instruciton tuning的生成式预测方法,named GenTKG。
  • results: 在低计算资源下,GenTKG 比传统方法有更高的预测性能,并且在未经重新训练的情况下,在未看到的数据集上也表现出了很好的转移性。
    Abstract The rapid advancements in large language models (LLMs) have ignited interest in the temporal knowledge graph (tKG) domain, where conventional carefully designed embedding-based and rule-based models dominate. The question remains open of whether pre-trained LLMs can understand structured temporal relational data and replace them as the foundation model for temporal relational forecasting. Therefore, we bring temporal knowledge forecasting into the generative setting. However, challenges occur in the huge chasms between complex temporal graph data structure and sequential natural expressions LLMs can handle, and between the enormous data sizes of tKGs and heavy computation costs of finetuning LLMs. To address these challenges, we propose a novel retrieval augmented generation framework that performs generative forecasting on tKGs named GenTKG, which combines a temporal logical rule-based retrieval strategy and lightweight parameter-efficient instruction tuning. Extensive experiments have shown that GenTKG outperforms conventional methods of temporal relational forecasting under low computation resources. GenTKG also highlights remarkable transferability with exceeding performance on unseen datasets without re-training. Our work reveals the huge potential of LLMs in the tKG domain and opens a new frontier for generative forecasting on tKGs.
    摘要 大Language Model (LLM)的快速进步使得temporal knowledge graph (tKG)领域受到了关注,而在这个领域,传统的经过设计的嵌入式和规则基本模型仍然主导。问题是, pré-trained LLMs能否理解结构化的时间关系数据,并取代它们作为时间关系预测的基本模型?因此,我们将 temporal knowledge forecasting 引入到生成设定中。然而,在复杂的时间图数据结构和Sequential natural expressions LLMs处理的大� Fischer gap 和 tKGs的庞大数据量和轻量级 fine-tuning LLMs 的计算成本之间存在挑战。为解决这些挑战,我们提出了一种新的检索增强生成框架,名为 GenTKG,它结合了时间逻辑规则基本的检索策略和轻量级参数高效调整。我们的实验表明,GenTKG 在低计算资源下超过了传统的时间关系预测方法。GenTKG 还表现出了很好的转移性,在未看过的数据集上达到了 excel 的表现。我们的工作揭示了 LLMs 在 tKG 领域的巨大潜力,并开启了一个新的前ier для generative forecasting on tKGs。

DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model

  • paper_url: http://arxiv.org/abs/2310.07771
  • repo_url: None
  • paper_authors: Xiaofan Li, Yifu Zhang, Xiaoqing Ye
  • for: 提供高质量、大规模多视图视频数据,用于自动驾驶研究。
  • methods: 提出了一种基于 Bird’s-Eye-View(BEV)表示的协调扩散框架DrivingDiffusion,用于生成真实多视图视频。
  • results: 通过该框架,可以免费生成大规模、高质量多视图视频,用于驱动任务研究。
    Abstract With the increasing popularity of autonomous driving based on the powerful and unified bird's-eye-view (BEV) representation, a demand for high-quality and large-scale multi-view video data with accurate annotation is urgently required. However, such large-scale multi-view data is hard to obtain due to expensive collection and annotation costs. To alleviate the problem, we propose a spatial-temporal consistent diffusion framework DrivingDiffusion, to generate realistic multi-view videos controlled by 3D layout. There are three challenges when synthesizing multi-view videos given a 3D layout: How to keep 1) cross-view consistency and 2) cross-frame consistency? 3) How to guarantee the quality of the generated instances? Our DrivingDiffusion solves the problem by cascading the multi-view single-frame image generation step, the single-view video generation step shared by multiple cameras, and post-processing that can handle long video generation. In the multi-view model, the consistency of multi-view images is ensured by information exchange between adjacent cameras. In the temporal model, we mainly query the information that needs attention in subsequent frame generation from the multi-view images of the first frame. We also introduce the local prompt to effectively improve the quality of generated instances. In post-processing, we further enhance the cross-view consistency of subsequent frames and extend the video length by employing temporal sliding window algorithm. Without any extra cost, our model can generate large-scale realistic multi-camera driving videos in complex urban scenes, fueling the downstream driving tasks. The code will be made publicly available.
    摘要 随着自动驾驶基于强大和统一的 bird's-eye-view (BEV) 表示的 популяр度增长,需要大量高质量多视图视频数据和准确的标注,但这些数据很难获得因为收集和标注成本高昂。为解决这个问题,我们提出了一个空间-时间一致的扩散框架 DrivingDiffusion,用于生成真实的多视图视频,控制了3D 布局。在生成多视图视频时,存在以下三个挑战:1)保持多视图视频之间的一致性和2)保持多帧视频之间的一致性?3)如何保证生成的实例质量?我们的 DrivingDiffusion 解决这些问题,通过将多视图单帧图像生成步骤、多camera共享的单视图视频生成步骤和后处理步骤,来生成真实的多视图视频。在多视图模型中,保证多视图图像之间的一致性,通过邻接相机之间的信息交换。在时间模型中,我们主要从多视图图像的首帧中提取需要注意的信息,并通过引入本地提示来提高生成的实例质量。在后处理步骤中,我们进一步提高了后续帧之间的一致性,并通过使用时间滑动窗口算法,延长视频的长度。无需额外成本,我们的模型可以生成大量高质量的多相机城市驾驶视频,为下游驾驶任务提供燃料。代码将公开。

PAD: A Dataset and Benchmark for Pose-agnostic Anomaly Detection

  • paper_url: http://arxiv.org/abs/2310.07716
  • repo_url: https://github.com/ericlee0224/pad
  • paper_authors: Qiang Zhou, Weize Li, Lihan Jiang, Guoliang Wang, Guyue Zhou, Shanghang Zhang, Hao Zhao
  • for: 这个论文的目的是解决对象异常检测中的两个主要挑战:第一个是现有数据集缺乏完整的视觉信息,其中数据集通常假设训练和测试样本具有相同的pose angle,但在实际应用中,异常可能存在任何对象区域,需要研究无关于pose的异常检测。第二个挑战是对于无关于pose的异常检测的实验协议的缺乏一致性,这使得不同方法之间的比较不公平,阻碍了无关于pose的异常检测的研究。
  • methods: 作者们开发了一个名为Multi-pose Anomaly Detection(MAD)数据集和Pose-agnostic Anomaly Detection(PAD)benchmark,以解决上述两个挑战。Specifically,他们使用了20种复杂形状的LEGO玩具,包括4K视图,以及高质量和多样化的3D异常在 both simulated和real environments中。此外,作者们还提出了一种名为OmniposeAD的新方法,通过使用MAD进行训练,专门设计用于无关于pose的异常检测。
  • results: 作者们通过了全面的评估,证明了他们的数据集和方法的相关性。此外,他们还提供了一个开源的benchmark库,包括数据集和基eline方法,以便未来的研究和应用。代码、数据和模型都公开可用于https://github.com/EricLee0224/PAD。
    Abstract Object anomaly detection is an important problem in the field of machine vision and has seen remarkable progress recently. However, two significant challenges hinder its research and application. First, existing datasets lack comprehensive visual information from various pose angles. They usually have an unrealistic assumption that the anomaly-free training dataset is pose-aligned, and the testing samples have the same pose as the training data. However, in practice, anomaly may exist in any regions on a object, the training and query samples may have different poses, calling for the study on pose-agnostic anomaly detection. Second, the absence of a consensus on experimental protocols for pose-agnostic anomaly detection leads to unfair comparisons of different methods, hindering the research on pose-agnostic anomaly detection. To address these issues, we develop Multi-pose Anomaly Detection (MAD) dataset and Pose-agnostic Anomaly Detection (PAD) benchmark, which takes the first step to address the pose-agnostic anomaly detection problem. Specifically, we build MAD using 20 complex-shaped LEGO toys including 4K views with various poses, and high-quality and diverse 3D anomalies in both simulated and real environments. Additionally, we propose a novel method OmniposeAD, trained using MAD, specifically designed for pose-agnostic anomaly detection. Through comprehensive evaluations, we demonstrate the relevance of our dataset and method. Furthermore, we provide an open-source benchmark library, including dataset and baseline methods that cover 8 anomaly detection paradigms, to facilitate future research and application in this domain. Code, data, and models are publicly available at https://github.com/EricLee0224/PAD.
    摘要 “物体异常检测是机器视觉领域的重要问题,最近有很大的进步。然而,两个主要挑战是阻碍其研究和应用。第一个是现有数据集缺乏全面的视觉信息,通常假设 anomaly-free 训练数据集是同一个 pose 的,测试样本也是同一个 pose。然而,在实际情况下,异常可能存在于对象任意区域,训练和查询样本可能有不同的 pose,需要研究无 pose 的异常检测。第二个是对pose-agnostic异常检测的实验室协议缺乏共识,导致不公正的比较,阻碍研究pose-agnostic异常检测。为解决这些问题,我们开发了Multi-pose Anomaly Detection(MAD)数据集和Pose-agnostic Anomaly Detection(PAD) benchmar,这是解决pose-agnostic异常检测问题的第一步。 Specifically,我们使用了20种复杂形状的 LEGO 玩具,包括4K 视图和各种 pose,以及高质量和多样化的3D 异常在 both simulated 和实际环境中。此外,我们提出了一种新的 OmniposeAD 方法,通过 MAD 训练而得,专门针对无 pose 异常检测。通过全面的评估,我们证明了我们的数据集和方法的相关性。此外,我们还提供了一个开源的benchmark库,包括dataset和基准方法,覆盖8种异常检测思想,以便未来的研究和应用。代码、数据和模型都公开可用于https://github.com/EricLee0224/PAD。”

InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining

  • paper_url: http://arxiv.org/abs/2310.07713
  • repo_url: None
  • paper_authors: Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, Bryan Catanzaro
  • for: 这个论文是为了研究预训练自然语言模型(LLM)的可靠性和精度,以及如何通过外部数据库来提高这些模型的性能。
  • methods: 这个论文使用了Retrieval方法来预训练LLM,并在这个基础模型上进行了更多的预训练和调教。
  • results: 论文的实验结果表明,使用Retrieval方法预训练LLM后,可以大幅提高模型的精度和可靠性,并且可以在零基础情况下进行成功的问答 tasks。
    Abstract Pretraining auto-regressive large language models (LLMs) with retrieval demonstrates better perplexity and factual accuracy by leveraging external databases. However, the size of existing pretrained retrieval-augmented LLM is still limited (e.g., Retro has 7.5B parameters), which limits the effectiveness of instruction tuning and zero-shot generalization. In this work, we introduce Retro 48B, the largest LLM pretrained with retrieval before instruction tuning. Specifically, we continue to pretrain the 43B GPT model on additional 100 billion tokens using the Retro augmentation method by retrieving from 1.2 trillion tokens. The obtained foundation model, Retro 48B, largely outperforms the original 43B GPT in terms of perplexity. After instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on zero-shot question answering (QA) tasks. Specifically, the average improvement of InstructRetro is 7% over its GPT counterpart across 8 short-form QA tasks, and 10% over GPT across 4 challenging long-form QA tasks. Surprisingly, we find that one can ablate the encoder from InstructRetro architecture and directly use its decoder backbone, while achieving comparable results. We hypothesize that pretraining with retrieval makes its decoder good at incorporating context for QA. Our results highlights the promising direction to obtain a better GPT decoder for QA through continued pretraining with retrieval before instruction tuning.
    摘要 <> translate "Pretraining auto-regressive large language models (LLMs) with retrieval demonstrates better perplexity and factual accuracy by leveraging external databases. However, the size of existing pretrained retrieval-augmented LLM is still limited (e.g., Retro has 7.5B parameters), which limits the effectiveness of instruction tuning and zero-shot generalization. In this work, we introduce Retro 48B, the largest LLM pretrained with retrieval before instruction tuning. Specifically, we continue to pretrain the 43B GPT model on additional 100 billion tokens using the Retro augmentation method by retrieving from 1.2 trillion tokens. The obtained foundation model, Retro 48B, largely outperforms the original 43B GPT in terms of perplexity. After instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on zero-shot question answering (QA) tasks. Specifically, the average improvement of InstructRetro is 7% over its GPT counterpart across 8 short-form QA tasks, and 10% over GPT across 4 challenging long-form QA tasks. Surprisingly, we find that one can ablate the encoder from InstructRetro architecture and directly use its decoder backbone, while achieving comparable results. We hypothesize that pretraining with retrieval makes its decoder good at incorporating context for QA. Our results highlights the promising direction to obtain a better GPT decoder for QA through continued pretraining with retrieval before instruction tuning." into Simplified Chinese.Here's the translation:<>预训自动循环大语言模型(LLM)与检索结合可以提高混淆率和事实准确率,通过利用外部数据库。然而,现有的预训检索增强LLM的大小仍然有限(例如,Retro有7.5亿参数),这限制了指令调整和零基数泛化的效iveness。在这种工作中,我们引入Retro 48B,最大化LLM预训检索后的指令调整。具体来说,我们继续预训43B GPT模型在additional 100亿个字符上,使用Retro增强方法,通过检索1.2万亿个字符来进行预训。获得的基础模型,Retro 48B,与原始43B GPT在混淆率上显著提高。在Retro上进行指令调整后,InstructRetro在零基数问答任务上表现出了显著提高。具体来说,InstructRetro在8个短形问答任务中平均提高7%,在4个挑战长形问答任务中提高10%。 surprisely,我们发现可以从InstructRetro架构中除去encoder,直接使用其decoder backbone,而 achieve comparable results。我们 hypothesize that预训检索使得其decoder可以好好地包含上下文。我们的结果显示了预训检索后GPT decoder的提高的可能性,并且 highlights the promising direction to obtain a better GPT decoder for QA through continued pretraining with retrieval before instruction tuning。

Growing Brains: Co-emergence of Anatomical and Functional Modularity in Recurrent Neural Networks

  • paper_url: http://arxiv.org/abs/2310.07711
  • repo_url: None
  • paper_authors: Ziming Liu, Mikail Khona, Ila R. Fiete, Max Tegmark
  • for: 这个论文的目的是研究如何使用机器学习方法来实现脑模式下的神经网络结构。
  • methods: 这个论文使用的方法是一种名为“脑灵感模块化训练”(BIMT),它可以让神经网络中的神经元组织成功参与到同一些计算任务中,同时也可以使神经网络的性能更高。
  • results: 研究发现,通过使用BIMT训练神经网络,可以同时实现功能和结构的模块化,并且这些模块化的神经元也可以在不同的计算任务中保持一定的稳定性。此外,相比标准的$L_1$或无regularization设置,BIMT可以使神经网络的性能更高。
    Abstract Recurrent neural networks (RNNs) trained on compositional tasks can exhibit functional modularity, in which neurons can be clustered by activity similarity and participation in shared computational subtasks. Unlike brains, these RNNs do not exhibit anatomical modularity, in which functional clustering is correlated with strong recurrent coupling and spatial localization of functional clusters. Contrasting with functional modularity, which can be ephemerally dependent on the input, anatomically modular networks form a robust substrate for solving the same subtasks in the future. To examine whether it is possible to grow brain-like anatomical modularity, we apply a recent machine learning method, brain-inspired modular training (BIMT), to a network being trained to solve a set of compositional cognitive tasks. We find that functional and anatomical clustering emerge together, such that functionally similar neurons also become spatially localized and interconnected. Moreover, compared to standard $L_1$ or no regularization settings, the model exhibits superior performance by optimally balancing task performance and network sparsity. In addition to achieving brain-like organization in RNNs, our findings also suggest that BIMT holds promise for applications in neuromorphic computing and enhancing the interpretability of neural network architectures.
    摘要 Recurrent neural networks (RNNs) 训练在compositional tasks上可以显示函数含量,在这些 neurons 可以被分为活动相似性和共享计算子任务中的集群。与大脑不同,这些 RNNs 不会显示解剖学含量,解剖学含量与强回路互连和功能集群的空间地域化强相关。与函数含量不同,解剖学含量可以在输入的影响下短暂地存在。为了检查是否可以培养大脑类似的解剖学含量,我们在一个解剖学含量训练(BIMT)中训练一个解剖学含量的网络,以解决一组compositional cognitive tasks。我们发现,功能相似的 neurons 不仅在活动上相似,还在空间上受到相似的归一化和连接。此外,相比标准 $L_1$ 或无规则化设置,模型在任务性能和网络稀缺性之间取得了优质平衡,并且表现出色。除了在 RNNs 中实现大脑类似的组织结构外,我们的发现还表明BIMT在 neuromorphic computing 和增强神经网络架构的解释性方面具有潜在的潜力。

Pixel State Value Network for Combined Prediction and Planning in Interactive Environments

  • paper_url: http://arxiv.org/abs/2310.07706
  • repo_url: None
  • paper_authors: Sascha Rosbach, Stefan M. Leupold, Simon Großjohann, Stefan Roth
  • for: 本研究旨在提高自动驾驶车辆在城市环境中的交通互动能力。
  • methods: 该研究提出了一种基于深度学习的方法,将预测和规划分别作为两个独立模块。 conditional GAN with U-Net architecture 是用于预测高分辨率图像序列的。
  • results: 研究结果表明,该方法可以在复杂的情况下,如车道变换 amidst conflicting objectives 中展现出直观的行为。
    Abstract Automated vehicles operating in urban environments have to reliably interact with other traffic participants. Planning algorithms often utilize separate prediction modules forecasting probabilistic, multi-modal, and interactive behaviors of objects. Designing prediction and planning as two separate modules introduces significant challenges, particularly due to the interdependence of these modules. This work proposes a deep learning methodology to combine prediction and planning. A conditional GAN with the U-Net architecture is trained to predict two high-resolution image sequences. The sequences represent explicit motion predictions, mainly used to train context understanding, and pixel state values suitable for planning encoding kinematic reachability, object dynamics, safety, and driving comfort. The model can be trained offline on target images rendered by a sampling-based model-predictive planner, leveraging real-world driving data. Our results demonstrate intuitive behavior in complex situations, such as lane changes amidst conflicting objectives.
    摘要 自动驾驶车辆在城市环境中必须可靠地与其他交通参与者交互。规划算法经常利用分离的预测模块预测 probabilistic、多模式和互动行为。将预测和规划分为两个模块会导致很多挑战,尤其是由于这两个模块之间的互相关系。这项工作提出了基于深度学习的方法,将预测和规划合并起来。使用 conditional GAN WITH U-Net 架构,训练预测两个高分辨率图像序列。这两个序列表示明确的运动预测,主要用于训练上下文理解,以及适用于规划编码减速可能性、物体动力学、安全和驾驶舒适。模型可以在 target 图像上进行训练,使用采样基本的模拟预测规划器生成的图像,利用实际驾驶数据。我们的结果表明在复杂的情况下,如lane change amidst conflicting objectives, exhibit intuitive behavior。

From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions

  • paper_url: http://arxiv.org/abs/2310.07699
  • repo_url: None
  • paper_authors: Zhengfeng Lai, Haotian Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, Meng Cao
  • for: 提高CLIP模型的训练效果和数据效率
  • methods: 利用视觉概念和新生成的视觉增强caption(VeC)进行拓展和改进 caption,并提出了一种混合训练方案
  • results: 对于不同规模的原始数据进行了全面的评估,显示 VeCLIP 在图像-文本对齐和总体模型性能方面具有显著优势,例如在 COCO 和 Flickr30k 检索任务中的 Retrieval 任务中达到了20%以上的提升,同时在数据效率方面也达到了3%以上的提升。
    Abstract Web-crawled datasets are pivotal to the success of pre-training vision-language models, exemplified by CLIP. However, web-crawled AltTexts can be noisy and potentially irrelevant to images, thereby undermining the crucial image-text alignment. Existing methods for rewriting captions using large language models (LLMs) have shown promise on small, curated datasets like CC3M and CC12M. Nevertheless, their efficacy on massive web-captured captions is constrained by the inherent noise and randomness in such data. In this study, we address this limitation by focusing on two key aspects: data quality and data variety. Unlike recent LLM rewriting techniques, we emphasize exploiting visual concepts and their integration into the captions to improve data quality. For data variety, we propose a novel mixed training scheme that optimally leverages AltTexts alongside newly generated Visual-enriched Captions (VeC). We use CLIP as one example and adapt the method for CLIP training on large-scale web-crawled datasets, named VeCLIP. We conduct a comprehensive evaluation of VeCLIP across small, medium, and large scales of raw data. Our results show significant advantages in image-text alignment and overall model performance, underscoring the effectiveness of VeCLIP in improving CLIP training. For example, VeCLIP achieves a remarkable over 20% improvement in COCO and Flickr30k retrieval tasks under the 12M setting. For data efficiency, we also achieve a notable over 3% improvement while using only 14% of the data employed in the vanilla CLIP and 11% in ALIGN.
    摘要 网络爬取数据集是CLIP成功的关键因素,但网络爬取AltText可能是不稳定和无关的图像,从而损害图像和文本的对齐。现有的使用大型自然语言模型(LLM)重写caption的方法有所成就在小型 cura dataset上,但它们在大量网络抓取caption上的效果受限于数据的噪音和随机性。在这项研究中,我们解决这一问题,重点关注数据质量和数据多样性两个方面。与之前的LLM重写技术不同,我们强调利用视觉概念并将其 интегриinto caption中以提高数据质量。为了提高数据多样性,我们提议一种新的混合训练方案,利用AltText alongside新生成的视觉增强caption(VeC)进行优化。我们使用CLIP作为一个例子,并适应CLIP在大规模网络抓取数据上进行训练,称之为VeCLIP。我们对VeCLIP进行了广泛的评估,包括小、中和大规模的 raw data 评估。我们的结果表明,VeCLIP在图像和文本对齐方面存在显著改善,并且在COCO和Flickr30k检索任务中 achieved 辉煌的提升,尤其在12M设定下。此外,我们还在数据效率方面取得了明显的提升,只使用了14%的数据,相比于vanilla CLIP和ALIGN使用的11%和14%。

SurroCBM: Concept Bottleneck Surrogate Models for Generative Post-hoc Explanation

  • paper_url: http://arxiv.org/abs/2310.07698
  • repo_url: None
  • paper_authors: Bo Pan, Zhenke Liu, Yifei Zhang, Liang Zhao
  • For: 这 paper 的目的是解释黑盒模型的决策过程,以提高模型的可解释性。* Methods: 这 paper 使用了 Concept Activation Vectors (CAVs) 和 Concept Bottleneck Models (CBMs) 等新的技术,以提供基于概念的解释。但是,这些技术需要人工定义的概念,可能是成本高的。因此,这 paper 提出了一种新的框架,即 Concept Bottleneck Surrogate Models (SurroCBM),可以自动发现黑盒模型中的概念,并提供可解释的模型。* Results: 经过广泛的实验,这 paper 证明了 SurroCBM 的可行性和有效性,并且可以不断提高解释质量。这表明 SurroCBM 有可能成为黑盒模型可解释性的新途径。
    Abstract Explainable AI seeks to bring light to the decision-making processes of black-box models. Traditional saliency-based methods, while highlighting influential data segments, often lack semantic understanding. Recent advancements, such as Concept Activation Vectors (CAVs) and Concept Bottleneck Models (CBMs), offer concept-based explanations but necessitate human-defined concepts. However, human-annotated concepts are expensive to attain. This paper introduces the Concept Bottleneck Surrogate Models (SurroCBM), a novel framework that aims to explain the black-box models with automatically discovered concepts. SurroCBM identifies shared and unique concepts across various black-box models and employs an explainable surrogate model for post-hoc explanations. An effective training strategy using self-generated data is proposed to enhance explanation quality continuously. Through extensive experiments, we demonstrate the efficacy of SurroCBM in concept discovery and explanation, underscoring its potential in advancing the field of explainable AI.
    摘要 <>用�ayer 的 Explainable AI 技术来揭示黑盒型模型的决策过程。传统的焦点方法可以高亮影响数据段的数据,但缺乏含义理解。最近的进步包括概念活化 вектор (CAV) 和概念瓶颈模型 (CBM),可以提供基于概念的解释,但需要人类定义的概念。然而,人类标注的概念是有成本的获得的。这篇论文介绍了概念瓶颈代理模型 (SurroCBM),一种新的框架,用于解释黑盒型模型的决策过程。SurroCBM 可以自动发现黑盒型模型中共享和特有的概念,并使用可解释的代理模型进行后期解释。通过自动生成的数据进行高质量的训练,可以不断提高解释质量。经过广泛的实验,我们证明 SurroCBM 在概念发现和解释方面具有极高的效果,这有助于进一步发展透明 AI 技术。

Hypergraph Neural Networks through the Lens of Message Passing: A Common Perspective to Homophily and Architecture Design

  • paper_url: http://arxiv.org/abs/2310.07684
  • repo_url: None
  • paper_authors: Lev Telyatnikov, Maria Sofia Bucarelli, Guillermo Bernardez, Olga Zaghen, Simone Scardapane, Pietro Lio
  • for: 本文探讨了hypergraph学习领域中存在的一些问题,包括homophily在高阶网络中的作用、现有的hypergraph架构和方法的可能性,以及现有的数据集是否能够为高阶网络学习提供有意义的比较标准。
  • methods: 本文提出了一种基于消息传递方式的高阶网络内部homophily概念,并提出了一种新的消息传递框架MultiSet,以及一种基于新的超链抽样策略的新架构MultiSetMixer。
  • results: 经过广泛的实验,本文得出了许多有价值的发现,包括homophily在高阶网络中的作用、现有的hypergraph架构和方法的局限性,以及一些改进的方法和架构的可能性。
    Abstract Most of the current hypergraph learning methodologies and benchmarking datasets in the hypergraph realm are obtained by lifting procedures from their graph analogs, simultaneously leading to overshadowing hypergraph network foundations. This paper attempts to confront some pending questions in that regard: Can the concept of homophily play a crucial role in Hypergraph Neural Networks (HGNNs), similar to its significance in graph-based research? Is there room for improving current hypergraph architectures and methodologies? (e.g. by carefully addressing the specific characteristics of higher-order networks) Do existing datasets provide a meaningful benchmark for HGNNs? Diving into the details, this paper proposes a novel conceptualization of homophily in higher-order networks based on a message passing scheme; this approach harmonizes the analytical frameworks of datasets and architectures, offering a unified perspective for exploring and interpreting complex, higher-order network structures and dynamics. Further, we propose MultiSet, a novel message passing framework that redefines HGNNs by allowing hyperedge-dependent node representations, as well as introduce a novel architecture MultiSetMixer that leverages a new hyperedge sampling strategy. Finally, we provide an extensive set of experiments that contextualize our proposals and lead to valuable insights in hypergraph representation learning.
    摘要 Currently, most hypergraph learning methodologies and benchmarking datasets in the hypergraph realm are derived from lifting procedures from their graph analogs, which can lead to overshadowing hypergraph network foundations. This paper aims to address some outstanding questions in this regard: Can the concept of homophily play a crucial role in Hypergraph Neural Networks (HGNNs), similar to its significance in graph-based research? Is there room for improving current hypergraph architectures and methodologies? (e.g., by carefully addressing the specific characteristics of higher-order networks) Do existing datasets provide a meaningful benchmark for HGNNs?In detail, this paper proposes a novel conceptualization of homophily in higher-order networks based on a message passing scheme, which harmonizes the analytical frameworks of datasets and architectures, offering a unified perspective for exploring and interpreting complex, higher-order network structures and dynamics. Furthermore, we propose MultiSet, a novel message passing framework that redefines HGNNs by allowing hyperedge-dependent node representations, as well as introduce a novel architecture MultiSetMixer that leverages a new hyperedge sampling strategy. Finally, we provide an extensive set of experiments that contextualize our proposals and lead to valuable insights in hypergraph representation learning.Here's the translation in Traditional Chinese:现在,大多数的超гра网学方法和测试数据集在超гра网领域都是从它们的几何网领域中提取出来的,这可能会导致超гра网网络基础建筑的陌生。本文尝试回答一些尚未得到解答的问题:在超гра网神经网络(HGNN)中,认可性(homophily)是否能够扮演重要的角色,跟graph-based研究中的认可性一样?现有的数据集是否能够提供有意义的参考 benchmark для HGNN?进一步详细地说,本文提出了一个新的高阶网络中认可性的概念化方法,基于讯息传递方案,这种方法可以融合数据集和架构的分析框架,提供一个统一的见解来探索和解释高阶网络结构和动态的复杂性。此外,我们还提出了 MultiSet,一个新的讯息传递框架,它可以让超边依赖的节点表现,以及引入一个新的超边抽样策略。最后,我们提供了一系列实验,以背景和评估我们的提议,并带来有价值的见解在超гра网表示学习中。

Controllable Data Generation Via Iterative Data-Property Mutual Mappings

  • paper_url: http://arxiv.org/abs/2310.07683
  • repo_url: None
  • paper_authors: Bo Pan, Muran Qin, Shiyu Wang, Yifei Zhang, Liang Zhao
  • for: 这个论文的目的是提高基于VAE的数据生成器的控制性和分离性。
  • methods: 这个论文提出了一种普适的框架,通过在数据和属性之间进行互相映射,来增强VAE基于的数据生成器的控制性和分离性。
  • results: 实验结果表明,该框架可以在短时间内准确地控制生成样本的属性,同时保持生成样本的有效性和分离性。
    Abstract Deep generative models have been widely used for their ability to generate realistic data samples in various areas, such as images, molecules, text, and speech. One major goal of data generation is controllability, namely to generate new data with desired properties. Despite growing interest in the area of controllable generation, significant challenges still remain, including 1) disentangling desired properties with unrelated latent variables, 2) out-of-distribution property control, and 3) objective optimization for out-of-distribution property control. To address these challenges, in this paper, we propose a general framework to enhance VAE-based data generators with property controllability and ensure disentanglement. Our proposed objective can be optimized on both data seen and unseen in the training set. We propose a training procedure to train the objective in a semi-supervised manner by iteratively conducting mutual mappings between the data and properties. The proposed framework is implemented on four VAE-based controllable generators to evaluate its performance on property error, disentanglement, generation quality, and training time. The results indicate that our proposed framework enables more precise control over the properties of generated samples in a short training time, ensuring the disentanglement and keeping the validity of the generated samples.
    摘要 Translation notes:* 离干分离 (lìgǎn fēnhū) refers to the problem of disentangling desired properties from unrelated latent variables.* OUT-OF-DISTRIBUTION (OUT-OF-DISTRIBUTION) refers to the challenge of controlling properties that are not present in the training data.* 目标优化 (mèngtǎo yòujiā) refers to the process of optimizing the objective function to achieve the desired properties.* VAE-based controllable generators (VAE-based kòngzhì yǐngchǎng) refers to the generative models that use Variational Autoencoders (VAEs) to generate data with desired properties.

Explainable Image Similarity: Integrating Siamese Networks and Grad-CAM

  • paper_url: http://arxiv.org/abs/2310.07678
  • repo_url: None
  • paper_authors: Ioannis E. Livieris, Emmanuel Pintelas, Niki Kiriakidou, Panagiotis Pintelas
  • for: 提高图像相似性评估的可解释性,以便更好地理解图像之间的相似性原因。
  • methods: 提出了一种基于Siamese网络和Grad-CAM的图像相似性评估方法,并提供了可视化的实际和假设性解释。
  • results: 提出了一种新的图像相似性评估框架,可以提供可解释的图像相似性分数以及实际和假设性解释,并且有可能提高图像基于系统的解释性、可靠性和用户接受度。
    Abstract With the proliferation of image-based applications in various domains, the need for accurate and interpretable image similarity measures has become increasingly critical. Existing image similarity models often lack transparency, making it challenging to understand the reasons why two images are considered similar. In this paper, we propose the concept of explainable image similarity, where the goal is the development of an approach, which is capable of providing similarity scores along with visual factual and counterfactual explanations. Along this line, we present a new framework, which integrates Siamese Networks and Grad-CAM for providing explainable image similarity and discuss the potential benefits and challenges of adopting this approach. In addition, we provide a comprehensive discussion about factual and counterfactual explanations provided by the proposed framework for assisting decision making. The proposed approach has the potential to enhance the interpretability, trustworthiness and user acceptance of image-based systems in real-world image similarity applications. The implementation code can be found in https://github.com/ioannislivieris/Grad_CAM_Siamese.git.
    摘要 We present a new framework that integrates Siamese Networks and Grad-CAM for providing explainable image similarity. This approach has the potential to enhance the interpretability, trustworthiness, and user acceptance of image-based systems in real-world image similarity applications.In addition, we provide a comprehensive discussion of the factual and counterfactual explanations provided by the proposed framework, which can assist decision-making. The proposed approach has the potential to improve the interpretability and trustworthiness of image-based systems, and the implementation code can be found at https://github.com/ioannislivieris/Grad_CAM_Siamese.git.

Accountability in Offline Reinforcement Learning: Explaining Decisions with a Corpus of Examples

  • paper_url: http://arxiv.org/abs/2310.07747
  • repo_url: None
  • paper_authors: Hao Sun, Alihan Hüyük, Daniel Jarrett, Mihaela van der Schaar
  • for: 本研究旨在提出一种负责任控制器,以便在决策系统中减少实际应用中的风险。
  • methods: 本研究使用了离线数据集作为决策 cuerpo,并根据特定的例子选择(称为 Corpus Subset)进行负责任控制。
  • results: 研究表明,AOC可以在低数据量情况下运行,并且可以在具有离线imitating设定的情况下进行扩展。AOC在模拟和实际医疗应用中表现出了高水平的性能,同时保持负责任。
    Abstract Learning controllers with offline data in decision-making systems is an essential area of research due to its potential to reduce the risk of applications in real-world systems. However, in responsibility-sensitive settings such as healthcare, decision accountability is of paramount importance, yet has not been adequately addressed by the literature. This paper introduces the Accountable Offline Controller (AOC) that employs the offline dataset as the Decision Corpus and performs accountable control based on a tailored selection of examples, referred to as the Corpus Subset. AOC operates effectively in low-data scenarios, can be extended to the strictly offline imitation setting, and displays qualities of both conservation and adaptability. We assess AOC's performance in both simulated and real-world healthcare scenarios, emphasizing its capability to manage offline control tasks with high levels of performance while maintaining accountability.
    摘要 学习控制器使用停机数据进行决策系统的研究是一个非常重要的领域,因为它可以减少实际系统中的风险。然而,在责任感知的设置中,决策责任并没有得到文献充分考虑。这篇论文介绍了负责任控制器(AOC),它使用停机dataset作为决策体系,并基于定制的示例选择(称为Corpus Subset)进行负责任控制。AOC在低数据情况下运行得非常有效,可以扩展到严格的停机模仿环境,并表现出保守和适应的特点。我们在模拟和实际医疗场景中评估了AOC的性能,强调它在处理停机控制任务时能够达到高水平的性能,同时保持责任。

HaarNet: Large-scale Linear-Morphological Hybrid Network for RGB-D Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2310.07669
  • repo_url: None
  • paper_authors: Rick Groenendijk, Leo Dorst, Theo Gevers
  • for: 本研究旨在开探使用多样性模式的约束来提高RGB-D数据的处理和分析效能。
  • methods: 本文提出了一种混合线性- morphological 网络,称为 HaarNet,使用了 morphological 元素和常见的线性模块。
  • results: 实验表明, HaarNet 与当前最佳 CNN 相当竞争,表明 morphological 网络是 geometry-based 学习任务的可能的研究方向。
    Abstract Signals from different modalities each have their own combination algebra which affects their sampling processing. RGB is mostly linear; depth is a geometric signal following the operations of mathematical morphology. If a network obtaining RGB-D input has both kinds of operators available in its layers, it should be able to give effective output with fewer parameters. In this paper, morphological elements in conjunction with more familiar linear modules are used to construct a mixed linear-morphological network called HaarNet. This is the first large-scale linear-morphological hybrid, evaluated on a set of sizeable real-world datasets. In the network, morphological Haar sampling is applied to both feature channels in several layers, which splits extreme values and high-frequency information such that both can be processed to improve both modalities. Moreover, morphologically parameterised ReLU is used, and morphologically-sound up-sampling is applied to obtain a full-resolution output. Experiments show that HaarNet is competitive with a state-of-the-art CNN, implying that morphological networks are a promising research direction for geometry-based learning tasks.
    摘要 文本中的不同modalities每个都有自己的组合代数,这些组合代数会影响它们的采样处理。RGB主要是线性的;深度是一种几何信号,按照数学形态学的操作进行处理。如果一个网络获得RGB-D输入,这个网络中有多种操作可以在其层次结构中使用,那么它应该能够在 fewer parameters 下提供有效的输出。在这篇论文中,作者使用了 conjunction 的 linear-morphological 网络,称之为 HaarNet。这是第一个大规模的线性-几何混合网络,在一些实际世界的数据集上进行了评估。在网络中,morphological Haar sampling 是在多个层次应用于特征通道上,将极端值和高频信息分割,以便进一步处理这两个模式。此外,使用了 morphologically parameterized ReLU 和 morphologically-sound up-sampling,以获得全分辨率输出。实验表明,HaarNet 与state-of-the-art CNN 相当竞争,implying that morphological networks 是一个有前途的研究方向 для基于几何学的学习任务。

GRaMuFeN: Graph-based Multi-modal Fake News Detection in Social Media

  • paper_url: http://arxiv.org/abs/2310.07668
  • repo_url: None
  • paper_authors: Makan Kananian, Fatima Badiei, S. AmirAli Gh. Ghahramani
  • for: 检测假信息在社交媒体平台上的扩散,提高公众意见形成的真实性。
  • methods: 提议使用文本encoder和图像encoder组合,文本encoder使用图Converter网络(GCN)进行文本分析,图像encoder使用预训练的ResNet-152 convolutional neural network(CNN)进行图像分析,并实现对比相似度损失函数,以提高检测假信息的精度。
  • results: 对两个公共可用的社交媒体新闻数据集进行了广泛的评估,相比现有的状态之artifact,提高了10%的微 F1-Score,表明GCN和CNN模型的组合可以有效地检测社交媒体上的假信息。
    Abstract The proliferation of social media platforms such as Twitter, Instagram, and Weibo has significantly enhanced the dissemination of false information. This phenomenon grants both individuals and governmental entities the ability to shape public opinions, highlighting the need for deploying effective detection methods. In this paper, we propose GraMuFeN, a model designed to detect fake content by analyzing both the textual and image content of news. GraMuFeN comprises two primary components: a text encoder and an image encoder. For textual analysis, GraMuFeN treats each text as a graph and employs a Graph Convolutional Neural Network (GCN) as the text encoder. Additionally, the pre-trained ResNet-152, as a Convolutional Neural Network (CNN), has been utilized as the image encoder. By integrating the outputs from these two encoders and implementing a contrastive similarity loss function, GraMuFeN achieves remarkable results. Extensive evaluations conducted on two publicly available benchmark datasets for social media news indicate a 10 % increase in micro F1-Score, signifying improvement over existing state-of-the-art models. These findings underscore the effectiveness of combining GCN and CNN models for detecting fake news in multi-modal data, all while minimizing the additional computational burden imposed by model parameters.
    摘要 “社交媒体平台的普及,如Twitter、Instagram和微博,已经提高了假信息的传播。这种现象让个人和政府机构都可以影响公众意见,高亮了需要部署有效的检测方法。本文提出了GraMuFeN模型,用于检测假新闻。GraMuFeN包括两个主要组成部分:文本编码器和图像编码器。对文本分析,GraMuFeN将每个文本视为一个图,并使用图 convolutional neural network (GCN) 作为文本编码器。此外,预训练的 ResNet-152 也被用作图像编码器。通过将这两个编码器的输出集成并实现对比相似性损失函数,GraMuFeN实现了显著的效果。对社交媒体新闻两个公共可用的 benchmark 数据集进行了广泛的评估,GraMuFeN 在 micro F1-Score 方面提高了10%,表明与现有状态码模型相比有显著的提高。这些发现表明了将 GCN 和 CNN 模型结合使用可以在多模式数据中检测假新闻,同时减少模型参数所增加的计算负担。”

Global Minima, Recoverability Thresholds, and Higher-Order Structure in GNNS

  • paper_url: http://arxiv.org/abs/2310.07667
  • repo_url: None
  • paper_authors: Drake Brown, Trevor Garrity, Kaden Parker, Jason Oliphant, Stone Carson, Cole Hanson, Zachary Boyd
  • for: 这个论文探讨了图 neural network(GNN)架构的性能从Random Graph Theory的角度。
  • methods: 作者使用了理论和数值方法来分析GNN的性能,包括对一层和二层GCNs的nodewise准确率的理论分析,以及对四种不同的GNN架构(GCN、GAT、SAGE和Graph Transformer)在不同假设下的数值分析。
  • results: 作者发现了一些关键的结论,包括:重 tailed degree distribution可以提高GNN性能,GNN可以在强烈不同结构上工作,SAGE和Graph Transformer可以在无isy edge数据上工作,但是没有架构能够处理足够噪音特征数据。此外,作者发现了一些特定的高阶结构在 sintetic data中和实际数据中的杂合效果通常是负面的。
    Abstract We analyze the performance of graph neural network (GNN) architectures from the perspective of random graph theory. Our approach promises to complement existing lenses on GNN analysis, such as combinatorial expressive power and worst-case adversarial analysis, by connecting the performance of GNNs to typical-case properties of the training data. First, we theoretically characterize the nodewise accuracy of one- and two-layer GCNs relative to the contextual stochastic block model (cSBM) and related models. We additionally prove that GCNs cannot beat linear models under certain circumstances. Second, we numerically map the recoverability thresholds, in terms of accuracy, of four diverse GNN architectures (GCN, GAT, SAGE, and Graph Transformer) under a variety of assumptions about the data. Sample results of this second analysis include: heavy-tailed degree distributions enhance GNN performance, GNNs can work well on strongly heterophilous graphs, and SAGE and Graph Transformer can perform well on arbitrarily noisy edge data, but no architecture handled sufficiently noisy feature data well. Finally, we show how both specific higher-order structures in synthetic data and the mix of empirical structures in real data have dramatic effects (usually negative) on GNN performance.
    摘要 我们从Random graph theory的角度分析图 neural network(GNN)的性能。我们的方法可以补充现有的GNN分析方法,如 combinatorial expressive power和最坏情况攻击分析,以连接GNN的性能和训练数据的典型特性。首先,我们理论上Characterize the node accuracy of one- and two-layer GCNs relative to the contextual stochastic block model (cSBM) and related models。我们还证明GCNs不能在某些情况下超过线性模型。其次,我们 numerically map the recoverability thresholds, in terms of accuracy, of four diverse GNN architectures (GCN, GAT, SAGE, and Graph Transformer) under a variety of assumptions about the data。 Sample results of this second analysis include: heavy-tailed degree distributions enhance GNN performance, GNNs can work well on strongly heterophilous graphs, and SAGE and Graph Transformer can perform well on arbitrarily noisy edge data, but no architecture handled sufficiently noisy feature data well。最后,我们示出了 especific higher-order structures in synthetic data and the mix of empirical structures in real data have dramatic effects (usually negative) on GNN performance。

Deep Backtracking Counterfactuals for Causally Compliant Explanations

  • paper_url: http://arxiv.org/abs/2310.07665
  • repo_url: None
  • paper_authors: Klaus-Rudolf Kladny, Julius von Kügelgen, Bernhard Schölkopf, Michael Muehlebach
  • for: 本文研究了Counterfactuals的一种新方法,即backtracking方法,可以在结构 causal models 中计算出 conditional counterfactuals。
  • methods: 本文提出了一种实用的方法,通过在结构 latent space 中做 tractable constrained optimization 问题,来生成 backtracking counterfactuals。
  • results: 实验表明, compared to existing methods of counterfactual explanations, 本文的方法更加 versatile, modular, and causally compliant。
    Abstract Counterfactuals can offer valuable insights by answering what would have been observed under altered circumstances, conditional on a factual observation. Whereas the classical interventional interpretation of counterfactuals has been studied extensively, backtracking constitutes a less studied alternative the backtracking principle has emerged as an alternative philosophy where all causal laws are kept intact. In the present work, we introduce a practical method for computing backtracking counterfactuals in structural causal models that consist of deep generative components. To this end, we impose conditions on the structural assignments that enable the generation of counterfactuals by solving a tractable constrained optimization problem in the structured latent space of a causal model. Our formulation also facilitates a comparison with methods in the field of counterfactual explanations. Compared to these, our method represents a versatile, modular and causally compliant alternative. We demonstrate these properties experimentally on a modified version of MNIST and CelebA.
    摘要 <>输入文本翻译成简化中文。<> counterfactuals 可以提供有价值的洞察,回答在修改条件下所观察到的结果。 classical interventional interpretation of counterfactuals 已经得到了广泛的研究,而 backtracking 则是一种 less studied alternative 。 backtracking principle 是一种具有保持所有 causal laws 的哲学原则。 在当前的工作中,我们介绍了一种实用的计算 backtracking counterfactuals 的方法,该方法在 structural causal models 中包含深度生成组件。 为此,我们在 causal model 中做出了特定的结构分配,以便通过解决一个可解决的封闭优化问题在 structured latent space 中生成 counterfactuals。 我们的формаulation 还可以与 counterfactual explanations 方法进行比较,相比之下,我们的方法表现出了 versatile、modular 和 causally compliant 的性能。 我们在一个修改后的 MNIST 和 CelebA 上进行了实验来证明这一点。

Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models

  • paper_url: http://arxiv.org/abs/2310.07653
  • repo_url: https://github.com/Zeqiang-Lai/MiniDALLE-3
  • paper_authors: Zeqiang Lai, Xizhou Zhu, Jifeng Dai, Yu Qiao, Wenhai Wang
  • for: 这个研究旨在探讨如何使用自然语言描述来与高品质的文字至图模型(T2I)进行有效的沟通,以及如何将这种技术应用于实际的人机交互中。
  • methods: 本研究使用了调整提示的技术和现有的T2I模型来解决问题,并评估了这种方法在不同的语言模型(LLM)和T2I模型下的效果。
  • results: 研究发现,这种方法可以让LLMs拥有更好的图像质量和更强的文字与图像相互对应,并且可以让任何现有的LLMs和T2I模型都具备这种能力,而且不需要进行任何训练。
    Abstract The revolution of artificial intelligence content generation has been rapidly accelerated with the booming text-to-image (T2I) diffusion models. Within just two years of development, it was unprecedentedly of high-quality, diversity, and creativity that the state-of-the-art models could generate. However, a prevalent limitation persists in the effective communication with these popular T2I models, such as Stable Diffusion, using natural language descriptions. This typically makes an engaging image hard to obtain without expertise in prompt engineering with complex word compositions, magic tags, and annotations. Inspired by the recently released DALLE3 - a T2I model directly built-in ChatGPT that talks human language, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I), where people can interact with LLM for interleaved high-quality image generation/edit/refinement and question answering with stronger images and text correspondences using natural language. In addressing the iT2I problem, we present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models. We evaluate our approach for iT2I in a variety of common-used scenarios under different LLMs, e.g., ChatGPT, LLAMA, Baichuan, and InternLM. We demonstrate that our approach could be a convenient and low-cost way to introduce the iT2I ability for any existing LLMs and any text-to-image models without any training while bringing little degradation on LLMs' inherent capabilities in, e.g., question answering and code generation. We hope this work could draw broader attention and provide inspiration for boosting user experience in human-machine interactions alongside the image quality of the next-generation T2I systems.
    摘要 人工智能内容生成革命已经快速加速,特别是在文本到图像(T2I)扩散模型方面。只用两年的时间,这些模型的质量、多样性和创造力已经到了历史的新高度。然而,在使用自然语言描述与这些流行的T2I模型进行有效交流仍然存在一定的限制,这通常需要专业的提示工程师、魔法标签和注释。 drawing inspiration from the recently released DALLE3 - a T2I model directly built-in ChatGPT that talks human language, we revisit the existing T2I systems and introduce a new task - interactive text to image (iT2I), where people can interact with LLM for interleaved high-quality image generation/edit/refinement and question answering with stronger images and text correspondences using natural language.在解决iT2I问题时,我们提出了一种简单的方法,即在LLMs中进行iT2I的扩展,使用提示技术和现有的T2I模型。我们在不同的LLMs(如ChatGPT、LLAMA、Baichuan和InternLM)下进行了多种常见的场景的评估。我们的方法可以让任何现有的LLMs和任何文本到图像模型都具备iT2I能力,而无需训练,同时带来对LLMs的内置能力的影响很小。我们希望这种工作能吸引更广泛的关注,并为下一代T2I系统的图像质量和人机交互的进步提供灵感。

Rethinking the BERT-like Pretraining for DNA Sequences

  • paper_url: http://arxiv.org/abs/2310.07644
  • repo_url: None
  • paper_authors: Chaoqi Liang, Weiqiang Bai, Lifeng Qiao, Yuchen Ren, Jianle Sun, Peng Ye, Hongliang Yan, Xinzhu Ma, Wangmeng Zuo, Wanli Ouyang
  • for: 这份研究旨在探讨如何将大规模预训应用到生物科学领域,特别是基于DNA序列的预训方法。
  • methods: 研究人员首先执行了一系列的探索性实验,获得了许多有益的观察,包括:在下游任务 fine-tuning 阶段,使用 K-mer 重叠tokenization 而不是 K-mer 非重叠tokenization,两者都可以在下游任务中获得显著的性能改进。
  • results: 研究人员发现,使用 K-mer 重叠tokenization 在预训过程中可以迅速生成明确的 K-mer 嵌入,并降低损失到非常低水平,但使用 K-mer 非重叠tokenization 则会导致嵌入变得更模糊,并持续降低损失。此外,使用重叠tokenization 会导致预训模型的自我注意力在中间层中过度集中在某些字串上,显示这些层未能得到适当的优化。总之,重叠tokenization 可以帮助下游任务的 fine-tuning,但它会导致预训过程中的快速收敛。为了解开预训的潜力,研究人员提出了一种新的方法called RandomMask,它通过不断扩展隐藏界限来增加BERT-like预训的任务难度,并成功取得了26个数据集中的28个数据集上的7个下游任务的Top-tier表现。
    Abstract With the success of large-scale pretraining in NLP, there is an increasing trend of applying it to the domain of life sciences. In particular, pretraining methods based on DNA sequences have garnered growing attention due to their potential to capture generic information about genes. However, existing pretraining methods for DNA sequences largely rely on direct adoptions of BERT pretraining from NLP, lacking a comprehensive understanding and a specifically tailored approach. To address this research gap, we first conducted a series of exploratory experiments and gained several insightful observations: 1) In the fine-tuning phase of downstream tasks, when using K-mer overlapping tokenization instead of K-mer non-overlapping tokenization, both overlapping and non-overlapping pretraining weights show consistent performance improvement.2) During the pre-training process, using K-mer overlapping tokenization quickly produces clear K-mer embeddings and reduces the loss to a very low level, while using K-mer non-overlapping tokenization results in less distinct embeddings and continuously decreases the loss. 3) Using overlapping tokenization causes the self-attention in the intermediate layers of pre-trained models to tend to overly focus on certain tokens, reflecting that these layers are not adequately optimized. In summary, overlapping tokenization can benefit the fine-tuning of downstream tasks but leads to inadequate pretraining with fast convergence. To unleash the pretraining potential, we introduce a novel approach called RandomMask, which gradually increases the task difficulty of BERT-like pretraining by continuously expanding its mask boundary, forcing the model to learn more knowledge. RandomMask is simple but effective, achieving top-tier performance across 26 datasets of 28 datasets spanning 7 downstream tasks.
    摘要 随着人工智能的应用在生命科学领域的扩大,对于基因序列的预训练方法也在吸引越来越多的关注。特别是基因序列预训练方法,因为它们可能会捕捉到生物体中 generic 信息。然而,现有的基因序列预训练方法大多基于 NLP 中的 BERT 预训练方法,lacking a comprehensive understanding and a specifically tailored approach。为了填补这个研究漏洞,我们首先进行了一系列的探索性实验,获得了一些有价值的观察:1)在下游任务 fine-tuning 阶段,使用 K-mer overlap 的tokenization而不是 K-mer non-overlapping 的tokenization, both overlapping 和 non-overlapping 预训练权重都显示了一致的性能提升。2)在预训练过程中,使用 K-mer overlap 的tokenization快速生成了明确的 K-mer 嵌入,并将损失降到了非常低的水平,而使用 K-mer non-overlapping 的tokenization则导致了较为模糊的嵌入,并持续降低损失。3)使用 overlap 的tokenization会让预训练模型中的自注意力倾向于过度关注某些符号,表明这些层并未充分优化。总之,overlapping 的tokenization可以优化下游任务的 fine-tuning,但是会导致预训练快速收敛。为了解锁预训练的潜力,我们提出了一种新的方法RandomMask,它通过不断扩展BERT-like 预训练模型的mask边界来增加任务难度,让模型学习更多的知识。RandomMask 简单 yet effective,在 28 个数据集上的 26 个任务上实现了顶尖表现。

OpsEval: A Comprehensive Task-Oriented AIOps Benchmark for Large Language Models

  • paper_url: http://arxiv.org/abs/2310.07637
  • repo_url: None
  • paper_authors: Yuhe Liu, Changhua Pei, Longlong Xu, Bohan Chen, Mingze Sun, Zhirui Zhang, Yongqian Sun, Shenglin Zhang, Kun Wang, Haiming Zhang, Jianhui Li, Gaogang Xie, Xidao Wen, Xiaohui Nie, Dan Pei
    for:The paper is written to evaluate the performance of large language models (LLMs) in Artificial Intelligence for IT Operations (AIOps) tasks.methods:The paper presents a comprehensive task-oriented AIOps benchmark called OpsEval, which includes 7,200 questions in both multiple-choice and question-answer formats to assess LLMs’ proficiency in three crucial scenarios (Wired Network Operation, 5G Communication Operation, and Database Operation) at various ability levels.results:The paper shows that GPT4-score is more consistent with experts than widely used Bleu and Rouge, and that various LLM tricks can affect the performance of AIOps, including zero-shot, chain-of-thought, and few-shot in-context learning. The paper also provides quantitative and qualitative results that demonstrate the effectiveness of OpsEval in evaluating LLMs’ performance in AIOps tasks.
    Abstract Large language models (LLMs) have exhibited remarkable capabilities in NLP-related tasks such as translation, summarizing, and generation. The application of LLMs in specific areas, notably AIOps (Artificial Intelligence for IT Operations), holds great potential due to their advanced abilities in information summarizing, report analyzing, and ability of API calling. Nevertheless, the performance of current LLMs in AIOps tasks is yet to be determined. Furthermore, a comprehensive benchmark is required to steer the optimization of LLMs tailored for AIOps. Compared with existing benchmarks that focus on evaluating specific fields like network configuration, in this paper, we present \textbf{OpsEval}, a comprehensive task-oriented AIOps benchmark designed for LLMs. For the first time, OpsEval assesses LLMs' proficiency in three crucial scenarios (Wired Network Operation, 5G Communication Operation, and Database Operation) at various ability levels (knowledge recall, analytical thinking, and practical application). The benchmark includes 7,200 questions in both multiple-choice and question-answer (QA) formats, available in English and Chinese. With quantitative and qualitative results, we show how various LLM tricks can affect the performance of AIOps, including zero-shot, chain-of-thought, and few-shot in-context learning. We find that GPT4-score is more consistent with experts than widely used Bleu and Rouge, which can be used to replace automatic metrics for large-scale qualitative evaluations.
    摘要 大型语言模型(LLM)在自然语言处理(NLP)相关任务中表现出了非常出色的能力,包括翻译、摘要和生成等。在特定领域中应用LLM的潜在性非常大,尤其是在人工智能操作(AIOps)中,因为它们在资讯摘要、报告分析和API调用等方面有出色的能力。然而,目前LLM在AIOps任务中的表现仍未被评估。此外,为了适当地优化LLM,需要一个全面的标准参考。相比于现有的标准,这篇文章提出了一个名为“OpsEval”的全面的AIOps标准参考,用于评估LLM的能力。OpsEval包括三个重要的操作场景(有线网络操作、5G通信操作和数据库操作),并且在不同的能力水平(知识回传、分析思维和实践应用)进行评估。标准包括7,200个问题,分为多选和问题回答(QA)格式,英文和中文两种语言。我们通过量化和质量的结果显示出不同的LLM技巧可以如何影响AIOps的表现,包括零式、串行和少数内容学习。我们发现GPT4-score与专家的表现更一致,而Bleu和Rouge的自动评分可以用来取代大规模的质量评分。

Dual Quaternion Rotational and Translational Equivariance in 3D Rigid Motion Modelling

  • paper_url: http://arxiv.org/abs/2310.07623
  • repo_url: None
  • paper_authors: Guilherme Vieira, Eleonora Grassucci, Marcos Eduardo Valle, Danilo Comminiello
  • for: 本 paper 的目的是提出一种基于 dual quaternion 表示的3D空间中对象的刚性运动模型,以便更好地处理3D学习任务。
  • methods: 本 paper 使用 dual quaternion 表示法来模型3D空间中对象的刚性运动,并且通过对每个点进行同时旋转和平移的表示,保留了点集的相关性。
  • results: 实验证明,使用本 paper 提出的 dual quaternion 表示法可以在人姿预测任务中超越前一些方法,表明该方法在3D学习任务中的效果。
    Abstract Objects' rigid motions in 3D space are described by rotations and translations of a highly-correlated set of points, each with associated $x,y,z$ coordinates that real-valued networks consider as separate entities, losing information. Previous works exploit quaternion algebra and their ability to model rotations in 3D space. However, these algebras do not properly encode translations, leading to sub-optimal performance in 3D learning tasks. To overcome these limitations, we employ a dual quaternion representation of rigid motions in the 3D space that jointly describes rotations and translations of point sets, processing each of the points as a single entity. Our approach is translation and rotation equivariant, so it does not suffer from shifts in the data and better learns object trajectories, as we validate in the experimental evaluations. Models endowed with this formulation outperform previous approaches in a human pose forecasting application, attesting to the effectiveness of the proposed dual quaternion formulation for rigid motions in 3D space.
    摘要 三维空间中物体的刚性运动被描述为旋转和平移的高相关点集,每个点有相应的 $x,y,z$ 坐标,但是实值网络视为每个点为独立实体,导致信息损失。先前的工作利用四元数代数来模型旋转运动,但这些代数不能正确地表示平移,从而导致三维学习任务中的下降性能。为了解决这些限制,我们使用三元数表示方式来描述点集的刚性运动,对每个点进行单一处理。我们的方法具有平移和旋转对称性,因此不会受到数据的偏移,更好地学习物体的运动轨迹,如我们在实验评估中所证明。使用这种形式的模型,与先前的方法相比,在人姿预测应用中表现出色,证明了我们的双元数表示方法在三维空间中的刚性运动的有效性。

Reinforcement Learning-based Knowledge Graph Reasoning for Explainable Fact-checking

  • paper_url: http://arxiv.org/abs/2310.07613
  • repo_url: None
  • paper_authors: Gustav Nikopensius, Mohit Mayank, Orchid Chetia Phukan, Rajesh Sharma
    for:The paper aims to improve the trustworthiness of automated fact-checking systems by incorporating reinforcement learning (RL) and knowledge graph (KG) reasoning for explainable fact-checking.methods:The proposed approach uses RL to train an agent to compute paths that prove or disprove factual claims, and a voting mechanism to reach a verdict based on the paths produced by the agent. The KG is used to represent knowledge for explanations.results:Extensive experiments on two datasets (FB15K-277 and NELL-995) show that the proposed approach is effective in producing human-readable explanations in the form of paths and classifications for fact claims, and can increase trustworthiness by providing a human-in-the-loop approach.
    Abstract Fact-checking is a crucial task as it ensures the prevention of misinformation. However, manual fact-checking cannot keep up with the rate at which false information is generated and disseminated online. Automated fact-checking by machines is significantly quicker than by humans. But for better trust and transparency of these automated systems, explainability in the fact-checking process is necessary. Fact-checking often entails contrasting a factual assertion with a body of knowledge for such explanations. An effective way of representing knowledge is the Knowledge Graph (KG). There have been sufficient works proposed related to fact-checking with the usage of KG but not much focus is given to the application of reinforcement learning (RL) in such cases. To mitigate this gap, we propose an RL-based KG reasoning approach for explainable fact-checking. Extensive experiments on FB15K-277 and NELL-995 datasets reveal that reasoning over a KG is an effective way of producing human-readable explanations in the form of paths and classifications for fact claims. The RL reasoning agent computes a path that either proves or disproves a factual claim, but does not provide a verdict itself. A verdict is reached by a voting mechanism that utilizes paths produced by the agent. These paths can be presented to human readers so that they themselves can decide whether or not the provided evidence is convincing or not. This work will encourage works in this direction for incorporating RL for explainable fact-checking as it increases trustworthiness by providing a human-in-the-loop approach.
    摘要 fact-checking是一项非常重要的任务,因为它可以防止谣言的扩散。然而,手动fact-checking无法与在线false信息的速度保持 pace。自动化fact-checking机器比人类更快。但为了提高自动化系统的信任和透明度,需要解释性在fact-checking过程中。fact-checking通常包括对一个真实声明与一个知识库进行对比,以便提供这些解释。知识图(KG)是一种有效的知识表示方式。虽然有很多关于fact-checking的提议,但没有很多关于RL的应用。为了填补这一差,我们提出了一种基于RL的KG逻辑应用,用于可读性的fact-checking。我们对FB15K-277和NELL-995 datasets进行了广泛的实验,结果表明,使用KG进行逻辑 reasoning可以生成人类可读的解释,包括路径和分类。RL逻辑代理 computes一个路径,以证明或驳斥一个真实声明,但不提供自己的判断。一个决策是通过RL逻辑代理生成的路径进行投票,以确定声明的真实性。这些路径可以向人类读者展示,让他们自己决定提供的证据是否有力。这种工作将鼓励更多关于RL的fact-checking应用,因为它提高了系统的可信度,并提供了人类在循环的方式。

PHYDI: Initializing Parameterized Hypercomplex Neural Networks as Identity Functions

  • paper_url: http://arxiv.org/abs/2310.07612
  • repo_url: https://github.com/ispamm/phydi
  • paper_authors: Matteo Mancanelli, Eleonora Grassucci, Aurelio Uncini, Danilo Comminiello
  • for: 这篇论文主要用于研究 parameterized hypercomplex neural networks(PHNNs)的收敛性和提高其性能。
  • methods: 本文提出了 parameterized hypercomplex identity initialization(PHYDI)方法,用于控制PHNNs的收敛性,并在不同的缩放量下实现更好的性能。
  • results: 研究发现,PHYDI方法可以在不同的benchmark中提高PHNNs的性能,并且可以在减少迭代次数的情况下达到相同的性能水平。
    Abstract Neural models based on hypercomplex algebra systems are growing and prolificating for a plethora of applications, ranging from computer vision to natural language processing. Hand in hand with their adoption, parameterized hypercomplex neural networks (PHNNs) are growing in size and no techniques have been adopted so far to control their convergence at a large scale. In this paper, we study PHNNs convergence and propose parameterized hypercomplex identity initialization (PHYDI), a method to improve their convergence at different scales, leading to more robust performance when the number of layers scales up, while also reaching the same performance with fewer iterations. We show the effectiveness of this approach in different benchmarks and with common PHNNs with ResNets- and Transformer-based architecture. The code is available at https://github.com/ispamm/PHYDI.
    摘要 神经网络基于高维复杂代数系统在各种应用中增长和普遍,从计算机视觉到自然语言处理。随着其采用,具有参数化的高维复杂神经网络(PHNNs)的大小不断增长,而没有任何控制其归一化的技术。本文研究PHNNs的归一化并提出具有参数化高维复杂标识初始化(PHYDI)方法,以提高它们在不同级别上的归一化性能,以达到更加稳定的性能,同时也可以在更少的迭代次数下达到相同的性能。我们在不同的标准吨数据集上测试了这种方法,并与常见的PHNNs结构(ResNets和Transformers)结合使用。代码可以在https://github.com/ispamm/PHYDI中找到。

Democratizing LLMs: An Exploration of Cost-Performance Trade-offs in Self-Refined Open-Source Models

  • paper_url: http://arxiv.org/abs/2310.07611
  • repo_url: None
  • paper_authors: Sumuk Shashidhar, Abhinav Chinta, Vaibhav Sahai, Zhenhailong Wang, Heng Ji
  • For: The paper aims to address the issue of restricted access and information privacy concerns due to the dominance of proprietary large language models (LLMs). It seeks to provide high-performing open-source alternatives that can compete with proprietary models in performance and cost.* Methods: The paper proposes a novel ranking metric called Performance, Refinement, and Inference Cost Score (PeRFICS) to evaluate and select the optimal open-source model for a given task. The authors also propose a domain-agnostic self-refinement process to improve the performance of open-source models.* Results: The authors’ experiments show that open-source models of varying sizes, on average, improve 8.2% from their baseline performance. The smallest model, Vicuna-7B, achieves a 11.74% improvement overall and up to a 25.39% improvement in high-creativity tasks. The best-performing model, Vicuna-13B, outperforms ChatGPT post-refinement, demonstrating the effectiveness of the proposed approach.
    Abstract The dominance of proprietary LLMs has led to restricted access and raised information privacy concerns. High-performing open-source alternatives are crucial for information-sensitive and high-volume applications but often lag behind in performance. To address this gap, we propose (1) A untargeted variant of iterative self-critique and self-refinement devoid of external influence. (2) A novel ranking metric - Performance, Refinement, and Inference Cost Score (PeRFICS) - to find the optimal model for a given task considering refined performance and cost. Our experiments show that SoTA open source models of varying sizes from 7B - 65B, on average, improve 8.2% from their baseline performance. Strikingly, even models with extremely small memory footprints, such as Vicuna-7B, show a 11.74% improvement overall and up to a 25.39% improvement in high-creativity, open ended tasks on the Vicuna benchmark. Vicuna-13B takes it a step further and outperforms ChatGPT post-refinement. This work has profound implications for resource-constrained and information-sensitive environments seeking to leverage LLMs without incurring prohibitive costs, compromising on performance and privacy. The domain-agnostic self-refinement process coupled with our novel ranking metric facilitates informed decision-making in model selection, thereby reducing costs and democratizing access to high-performing language models, as evidenced by case studies.
    摘要 due to the dominance of proprietary LLMs, there are concerns about restricted access and information privacy. High-performing open-source alternatives are crucial for information-sensitive and high-volume applications, but they often lag behind in performance. To address this gap, we propose:1. A untargeted variant of iterative self-critique and self-refinement that is not influenced by external factors.2. A new ranking metric called Performance, Refinement, and Inference Cost Score (PeRFICS) to find the best model for a given task based on refined performance and cost.Our experiments show that the SoTA open-source models of varying sizes from 7B to 65B, on average, improve 8.2% from their baseline performance. Surprisingly, even models with extremely small memory footprints, such as Vicuna-7B, show a 11.74% improvement overall and up to a 25.39% improvement in high-creativity, open-ended tasks on the Vicuna benchmark. Vicuna-13B even outperforms ChatGPT post-refinement.This work has significant implications for resource-constrained and information-sensitive environments seeking to leverage LLMs without incurring prohibitive costs, compromising on performance, and privacy. The domain-agnostic self-refinement process coupled with our novel ranking metric facilitates informed decision-making in model selection, thereby reducing costs and democratizing access to high-performing language models, as shown by case studies.

Survey on Imbalanced Data, Representation Learning and SEP Forecasting

  • paper_url: http://arxiv.org/abs/2310.07598
  • repo_url: None
  • paper_authors: Josias Moukpe
  • for: 这篇论文旨在探讨深度学习方法如何在实际应用中进行调整,以减少因数据不均衡而导致的影响。
  • methods: 这篇论文使用了表示学习方法,将注意力集中在具有丰富特征的数据空间中,以更好地捕捉资料特征和泛化到少数类别。
  • results: 这篇论文发现,这些新的深度学习方法可以更好地处理实际世界中的数据不均衡问题,并且在SEP预测任务中获得了更好的结果。
    Abstract Deep Learning methods have significantly advanced various data-driven tasks such as regression, classification, and forecasting. However, much of this progress has been predicated on the strong but often unrealistic assumption that training datasets are balanced with respect to the targets they contain. This misalignment with real-world conditions, where data is frequently imbalanced, hampers the effectiveness of such models in practical applications. Methods that reconsider that assumption and tackle real-world imbalances have begun to emerge and explore avenues to address this challenge. One such promising avenue is representation learning, which enables models to capture complex data characteristics and generalize better to minority classes. By focusing on a richer representation of the feature space, these techniques hold the potential to mitigate the impact of data imbalance. In this survey, we present deep learning works that step away from the balanced-data assumption, employing strategies like representation learning to better approximate real-world imbalances. We also highlight a critical application in SEP forecasting where addressing data imbalance is paramount for success.
    摘要 深度学习方法在许多数据驱动任务中取得了重大进步,包括回归、分类和预测等。然而,大多数这些进步假设了训练集与目标之间的均衡,这种假设在实际应用中并不是真实的。因此,许多模型在实际应用中效果不佳。为了解决这个问题,有些新的方法开始出现,它们尝试重新考虑训练集与实际应用中的偏衡。一种这样的有前途的方向是表示学习,它允许模型捕捉复杂的数据特征,并更好地泛化到小类。通过强调richer的表示空间,这些技术可能会减轻数据偏衡的影响。在这篇评论中,我们介绍了脱离平衡数据的深度学习工作,使用表示学习等策略来更好地 aproximate实际应用中的偏衡。我们还高亮了应用在SEP预测中,Addressing data imbalance是成功的关键应用。

Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models

  • paper_url: http://arxiv.org/abs/2310.07589
  • repo_url: None
  • paper_authors: Luiza Pozzobon, Beyza Ermis, Patrick Lewis, Sara Hooker
  • for: 这篇论文的目的是提出一种适应语言演化的恶意抑制方法,以提高模型在实际应用中的性能和可靠性。
  • methods: 该方法基于Retrieval-based Approach,通过在decode时使用检索来实现恶意控制文本生成。
  • results: 论文通过对多种语言模型进行实验,证明了Goodtriever方法可以减少43%的延迟时间和提高计算效率,同时保持与当前状态艺术的水平。
    Abstract Considerable effort has been dedicated to mitigating toxicity, but existing methods often require drastic modifications to model parameters or the use of computationally intensive auxiliary models. Furthermore, previous approaches have often neglected the crucial factor of language's evolving nature over time. In this work, we present a comprehensive perspective on toxicity mitigation that takes into account its changing nature. We introduce Goodtriever, a flexible methodology that matches the current state-of-the-art toxicity mitigation while achieving 43% relative latency reduction during inference and being more computationally efficient. By incorporating a retrieval-based approach at decoding time, Goodtriever enables toxicity-controlled text generation. Our research advocates for an increased focus on adaptable mitigation techniques, which better reflect the data drift models face when deployed in the wild. Code and data are available at https://github.com/for-ai/goodtriever.
    摘要 很大的努力已经投入到抑制毒性方面,但现有的方法frequently需要对模型参数进行极大的修改或使用 computationally intensive的auxiliary models。此外,前一代的方法经常忽视了语言的不断发展的特点。在这项工作中,我们提出了一个完整的抑制毒性视角,考虑到其变化的特点。我们介绍了Goodtriever,一种灵活的方法,与当前状态的抑制毒性方法相当,实现了43%的相对延迟减少 durante la inferencia y es más eficiente en términos de computación。通过在 decode 时 incorporating a retrieval-based approach,Goodtriever 实现了基于文本生成的毒性控制。我们强调了适应性的技术的重要性,以更好地反映数据模型在野外部署时所面临的数据漂移模型。代码和数据可以在 获取。

Accurate Use of Label Dependency in Multi-Label Text Classification Through the Lens of Causality

  • paper_url: http://arxiv.org/abs/2310.07588
  • repo_url: None
  • paper_authors: Caoyun Fan, Wenqing Chen, Jidong Tian, Yitian Li, Hao He, Yaohui Jin
  • for: 本研究旨在提高多标签文本分类(MLTC)模型的性能,通过引入标签相关性来增强模型的预测能力。
  • methods: 本研究提出了一种Counterfactual Text Classifier(CFTC),它首先使用预测后修改的基础体系来提取标签相关性中嵌入的精准标签信息,然后通过对标签相关性的反向矩阵来阻断相关性快捷的偏好。
  • results: 实验结果表明,CFTC在三个数据集上显著超过基eline,并有效地消除了标签相关性偏好。
    Abstract Multi-Label Text Classification (MLTC) aims to assign the most relevant labels to each given text. Existing methods demonstrate that label dependency can help to improve the model's performance. However, the introduction of label dependency may cause the model to suffer from unwanted prediction bias. In this study, we attribute the bias to the model's misuse of label dependency, i.e., the model tends to utilize the correlation shortcut in label dependency rather than fusing text information and label dependency for prediction. Motivated by causal inference, we propose a CounterFactual Text Classifier (CFTC) to eliminate the correlation bias, and make causality-based predictions. Specifically, our CFTC first adopts the predict-then-modify backbone to extract precise label information embedded in label dependency, then blocks the correlation shortcut through the counterfactual de-bias technique with the help of the human causal graph. Experimental results on three datasets demonstrate that our CFTC significantly outperforms the baselines and effectively eliminates the correlation bias in datasets.
    摘要 多 Label 文本分类 (MLTC) 目标是为每个给定的文本分配最 relevante 标签。现有方法表明,标签依赖可以帮助提高模型的性能。然而,标签依赖的引入可能会导致模型受到不良预测偏见。在本研究中,我们归因偏见于模型对标签依赖的误用,即模型倾向于利用标签依赖的相关短cut 而不是将文本信息和标签依赖 fusion 用于预测。 motivated by causal inference,我们提出了Counterfactual Text Classifier (CFTC),以消除相关偏见,并基于 causality 进行预测。specifically,我们的 CFTC 首先采用 predict-then-modify 基干来提取标签依赖中嵌入的精确标签信息,然后通过 counterfactual de-bias 技术和人类 causal graph 屏蔽相关短cut,以消除偏见。实验结果在三个数据集上表明,我们的 CFTC significantly outperforms 基elines,并有效地消除数据集中的偏见。

Fed-GraB: Federated Long-tailed Learning with Self-Adjusting Gradient Balancer

  • paper_url: http://arxiv.org/abs/2310.07587
  • repo_url: https://github.com/zackzikaixiao/fedgrab
  • paper_authors: Zikai Xiao, Zihan Chen, Songshang Liu, Hualiang Wang, Yang Feng, Jin Hao, Joey Tianyi Zhou, Jian Wu, Howard Hao Yang, Zuozhu Liu
  • for: 这篇论文研究了一种 federated long-tailed learning(Fed-LT)任务,在每个客户端上存在本地不同的数据集,如果这些数据集可以全局聚合,它们就会共同出现长尾分布。在这种设置下,现有的联邦优化和/或中央长尾学习方法很难应用,因为在隐私约束下不能准确地特征长尾分布的全局性。
  • methods: 该论文提出了一种名为 $\texttt{Fed-GraB}$ 的方法,包括一个自适应权重调整器(SGB)模块,该模块在关闭loop的方式下,根据客户端的反馈,重新权重客户端的梯度。此外,该方法还包括一个直接先验分析器(DPA)模块,用于评估客户端数据集的全局长尾分布。
  • results: EXTENSIVE EXPERIMENTS DEMONSTRATE THAT $\texttt{Fed-GraB}$ ACHIEVES STATE-OF-THE-ART PERFORMANCE ON REPRESENTATIVE DATASETS SUCH AS CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, AND iNaturalist。
    Abstract Data privacy and long-tailed distribution are the norms rather than the exception in many real-world tasks. This paper investigates a federated long-tailed learning (Fed-LT) task in which each client holds a locally heterogeneous dataset; if the datasets can be globally aggregated, they jointly exhibit a long-tailed distribution. Under such a setting, existing federated optimization and/or centralized long-tailed learning methods hardly apply due to challenges in (a) characterizing the global long-tailed distribution under privacy constraints and (b) adjusting the local learning strategy to cope with the head-tail imbalance. In response, we propose a method termed $\texttt{Fed-GraB}$, comprised of a Self-adjusting Gradient Balancer (SGB) module that re-weights clients' gradients in a closed-loop manner, based on the feedback of global long-tailed distribution evaluated by a Direct Prior Analyzer (DPA) module. Using $\texttt{Fed-GraB}$, clients can effectively alleviate the distribution drift caused by data heterogeneity during the model training process and obtain a global model with better performance on the minority classes while maintaining the performance of the majority classes. Extensive experiments demonstrate that $\texttt{Fed-GraB}$ achieves state-of-the-art performance on representative datasets such as CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist.
    摘要 “数据隐私和长尾分布是现实任务中的常见现象,而不是特例。本文研究了一种联合长尾学习(Fed-LT)任务,每个客户端持有本地不同 datasets,如果这些数据集可以全球聚合,它们就会共同表现出长尾分布。在这种设置下,现有的联合优化和/或中央长尾学习方法几乎无法应用,因为(a)全局长尾分布的特征难以在隐私约束下表征,(b)本地学习策略难以适应头部和尾部差异。为应对这些挑战,我们提出了一种方法,称为 $\texttt{Fed-GraB}$,它包括一个自适应权重重新分配(SGB)模块,根据全球长尾分布的反馈,在关闭循环方式下重新分配客户端的梯度。使用 $\texttt{Fed-GraB}$,客户端可以在模型训练过程中有效地缓解由数据不同性引起的分布漂移,并在少数类上获得更好的性能,同时保持多数类的性能。广泛的实验表明, $\texttt{Fed-GraB}$ 达到了代表性数据集的领先性能,包括 CIFAR-10-LT、CIFAR-100-LT、ImageNet-LT 和 iNaturalist。”

Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT

  • paper_url: http://arxiv.org/abs/2310.07582
  • repo_url: https://github.com/deanhazineh/emergent-world-representations-othello
  • paper_authors: Dean S. Hazineh, Zechen Zhang, Jeffery Chiu
  • for: 这篇论文是为了研究基于Othello的Transformer模型是否真正理解世界,而不仅仅是随机模仿。
  • methods: 这篇论文使用了一个简单的Transformer模型,并对其进行了扩展,以提高对Othello-GPT模型的理解。
  • results: 研究发现,Othello-GPT模型具有一个线性的对抗方面表示,这个表示导致了它的决策过程。此外,研究还发现了这个线性世界表示和 causal 决策之间的互动,以及层数和模型复杂度对这种互动的影响。
    Abstract Foundation models exhibit significant capabilities in decision-making and logical deductions. Nonetheless, a continuing discourse persists regarding their genuine understanding of the world as opposed to mere stochastic mimicry. This paper meticulously examines a simple transformer trained for Othello, extending prior research to enhance comprehension of the emergent world model of Othello-GPT. The investigation reveals that Othello-GPT encapsulates a linear representation of opposing pieces, a factor that causally steers its decision-making process. This paper further elucidates the interplay between the linear world representation and causal decision-making, and their dependence on layer depth and model complexity. We have made the code public.
    摘要 基本模型在决策和逻辑推理方面表现出了显著的能力。然而,关于它们真正理解世界的问题仍然存在着不断的讨论,一些人认为它们只是Random mimicry。本文仔细检查了一个简单的 transformer 被训练 для奥菲洛,扩展了先前的研究,以更好地了解奥菲洛-GPT 的 emergent 世界模型。调查发现,奥菲洛-GPT 包含了对对抗的 Piece 的线性表示,这种因素 causally 导向它的决策过程。本文还详细解释了线性世界表示和 causal 决策之间的交互,以及它们与模型复杂度和层数之间的依赖关系。我们已经公开了代码。

In-Context Unlearning: Language Models as Few Shot Unlearners

  • paper_url: http://arxiv.org/abs/2310.07579
  • repo_url: https://github.com/MartinPawel/In-Context-Unlearning
  • paper_authors: Martin Pawelczyk, Seth Neel, Himabindu Lakkaraju
  • for: 研究高效地从特定训练点中除去对训练模型的影响,以遵守隐私法规 like Right to be Forgotten.
  • methods: 提出了一些用于 approximate 训练数据除去而不需要重新训练模型的算法,这些算法需要对模型参数进行访问。
  • results: 实验结果表明,在提供输入Context和颠倒标签 alongside correctly labelled instances 时,可以高效地除去特定训练点的影响,并保持与状态体系方法的竞争力。
    Abstract Machine unlearning, the study of efficiently removing the impact of specific training points on the trained model, has garnered increased attention of late, driven by the need to comply with privacy regulations like the Right to be Forgotten. Although unlearning is particularly relevant for LLMs in light of the copyright issues they raise, achieving precise unlearning is computationally infeasible for very large models. To this end, recent work has proposed several algorithms which approximate the removal of training data without retraining the model. These algorithms crucially rely on access to the model parameters in order to update them, an assumption that may not hold in practice due to computational constraints or when the LLM is accessed via API. In this work, we propose a new class of unlearning methods for LLMs we call ''In-Context Unlearning'', providing inputs in context and without having to update model parameters. To unlearn a particular training instance, we provide the instance alongside a flipped label and additional correctly labelled instances which are prepended as inputs to the LLM at inference time. Our experimental results demonstrate that these contexts effectively remove specific information from the training set while maintaining performance levels that are competitive with (or in some cases exceed) state-of-the-art unlearning methods that require access to the LLM parameters.
    摘要 机器学习无学习(Machine Unlearning),即在已经训练过的模型上efficiently removing特定训练点的影响,在最近几年来收到了更多的关注,即由隐私法规如“Right to be Forgotten”所驱动。特别是在LLM中,由于复杂的版权问题,无学习变得非常重要。然而,在非常大的模型上,准确的无学习是计算不可能的。为此,最近的研究提出了一些使用模型参数更新的算法,以便精确地除掉训练数据。这些算法具有访问模型参数的假设,但在实践中这种假设可能不成立,例如因为计算限制或者LLM通过API访问。在这项工作中,我们提出了一种新的LLM无学习方法,我们称之为“在Context中的无学习”(In-Context Unlearning)。在这种方法中,我们在推理时提供特定训练实例,并将其旋转后的标签和正确标签的其他实例预处理为LLM的输入。我们的实验结果表明,这些上下文可以有效地从训练集中除掉特定信息,而且与无需更新模型参数的状态之前的方法竞争。

ChatGPT for Computational Topology

  • paper_url: http://arxiv.org/abs/2310.07570
  • repo_url: https://github.com/joybearliu/chatgpt-for-computational-topology
  • paper_authors: Jian Liu, Li Shen, Guo-Wei Wei
  • for: bridging the gap between theoretical topological concepts and their practical implementation in computational topology
  • methods: utilizing ChatGPT to transform mathematical formulations and concepts into functional code for computational topology
  • results: demonstrating the effectiveness of ChatGPT in computing Betti numbers, Laplacian matrices, and Dirac matrices for simplicial complexes, as well as the persistence of various homologies and Laplacians, and exploring its application in computing recently developed topological theories for hypergraphs and digraphs.
    Abstract ChatGPT represents a significant milestone in the field of artificial intelligence (AI), finding widespread applications across diverse domains. However, its effectiveness in mathematical contexts has been somewhat constrained by its susceptibility to conceptual errors. Concurrently, topological data analysis (TDA), a relatively new discipline, has garnered substantial interest in recent years. Nonetheless, the advancement of TDA is impeded by the limited understanding of computational algorithms and coding proficiency among theoreticians. This work endeavors to bridge the gap between theoretical topological concepts and their practical implementation in computational topology through the utilization of ChatGPT. We showcase how a pure theoretician, devoid of computational experience and coding skills, can effectively transform mathematical formulations and concepts into functional code for computational topology with the assistance of ChatGPT. Our strategy outlines a productive process wherein a mathematician trains ChatGPT on pure mathematical concepts, steers ChatGPT towards generating computational topology code, and subsequently validates the generated code using established examples. Our specific case studies encompass the computation of Betti numbers, Laplacian matrices, and Dirac matrices for simplicial complexes, as well as the persistence of various homologies and Laplacians. Furthermore, we explore the application of ChatGPT in computing recently developed topological theories for hypergraphs and digraphs. This work serves as an initial step towards effectively transforming pure mathematical theories into practical computational tools, with the ultimate goal of enabling real applications across diverse fields.
    摘要 chatGPT 代表了人工智能(AI)领域的一个重要里程碑,在多个领域中发现了广泛的应用。然而,它在数学上的效iveness受到了概念错误的限制。同时,数据 topology 分析(TDA)在最近几年内得到了广泛的关注。然而,TDA 的发展受到了计算算法和编程技能的限制,特别是在理论家中。这项工作的目的是通过使用 chatGPT 将数学概念与计算 topology 的实现相连接。我们展示了如何让纯粹的数学家,没有计算经验和编程技能,通过 chatGPT 将数学表述和概念转化为功能的计算 topology 代码。我们的策略是让数学家通过 chatGPT 训练数学概念,然后通过 chatGPT 生成计算 topology 代码,并 finally 验证生成的代码使用已知的例子进行验证。我们的具体案例包括计算 simplicial 复杂体上的 Betti 数、Laplacian 矩阵和 Dirac 矩阵,以及不同 homology 和 Laplacian 的 persistency。此外,我们还探讨了 chatGPT 在计算最新发展的图 theoretically 的应用。这项工作作为将纯粹的数学理论转化为实用计算工具的第一步,最终目标是在多个领域应用。

ROMO: Retrieval-enhanced Offline Model-based Optimization

  • paper_url: http://arxiv.org/abs/2310.07560
  • repo_url: https://github.com/cmciris/romo
  • paper_authors: Mingcheng Chen, Haoran Zhao, Yuxiang Zhao, Hulei Fan, Hongqiao Gao, Yong Yu, Zheng Tian
  • for: 在各种实际应用场景中,数据驱动黑盒模型基于优化(MBO)问题广泛存在,旨在在整个空间中找到一个最优设计,以最大化黑盒目标函数,基于静止的离线数据集。
  • methods: 我们在这篇论文中考虑了一种更加普遍而具有挑战性的MBO设定,即受限MBO(CoMBO),其中只有一部分的设计空间可以优化,而另一部分则被环境所限制。我们提出了一种新的挑战,即大多数在离线数据集中满足约束的设计都是中等的评价。因此,我们将注意力集中在优化这些中等的设计,而不是进一步提高传统MBO设定中的最优设计。
  • results: 我们提出了一种新的forward方法,名为回 retrieve-enhanced offline model-based optimization(ROMO),它可以在离线数据集中检索和聚合相关样本,以提供可靠的预测,并用其进行梯度下降优化。ROMO简单易行,并在CoMBO设定中超过了现有方法的表现。我们在一个 sintetic Hartmann(3D)函数数据集、一个工业CIO数据集以及一个Modified Tasks中进行了实验,结果表明,ROMO在各种受限优化任务中表现良好。
    Abstract Data-driven black-box model-based optimization (MBO) problems arise in a great number of practical application scenarios, where the goal is to find a design over the whole space maximizing a black-box target function based on a static offline dataset. In this work, we consider a more general but challenging MBO setting, named constrained MBO (CoMBO), where only part of the design space can be optimized while the rest is constrained by the environment. A new challenge arising from CoMBO is that most observed designs that satisfy the constraints are mediocre in evaluation. Therefore, we focus on optimizing these mediocre designs in the offline dataset while maintaining the given constraints rather than further boosting the best observed design in the traditional MBO setting. We propose retrieval-enhanced offline model-based optimization (ROMO), a new derivable forward approach that retrieves the offline dataset and aggregates relevant samples to provide a trusted prediction, and use it for gradient-based optimization. ROMO is simple to implement and outperforms state-of-the-art approaches in the CoMBO setting. Empirically, we conduct experiments on a synthetic Hartmann (3D) function dataset, an industrial CIO dataset, and a suite of modified tasks in the Design-Bench benchmark. Results show that ROMO performs well in a wide range of constrained optimization tasks.
    摘要 “数据驱动黑盒模型基于优化(MBO)问题在许多实际应用场景中出现,目标是在整个空间上找到最优化黑盒目标函数的设计,使用静态离线数据。在这项工作中,我们考虑了更一般 yet 更加挑战性的 MBO 设定,即受限 MBO(CoMBO),其中只有部分设计空间可以优化,而另外的部分受到环境的限制。这种新的挑战是,大多数满足约束的观察到的设计都是较差的评价。因此,我们将注意力转向了这些较差的评价设计,而不是在传统 MBO 设定中进一步提高最佳观察到的设计。我们提出了 reuse-based offline model-based optimization(ROMO),一种新的可求导的前向方法,它在拥有离线数据时重新检索和聚合相关样本,以提供可靠的预测,并用其进行梯度下降优化。ROMO 简单实现,并在 CoMBO 设定中超过了当前状态的方法。我们在一个 Synthetic Hartmann(3D)函数数据集、一个industrial CIO数据集以及一个 Design-Bench benchmark 中进行了实验,结果表明,ROMO 在受限优化任务中表现良好。”

ProtoHPE: Prototype-guided High-frequency Patch Enhancement for Visible-Infrared Person Re-identification

  • paper_url: http://arxiv.org/abs/2310.07552
  • repo_url: None
  • paper_authors: Guiwei Zhang, Yongfei Zhang, Zichang Tan
  • for: bridging the modality gap in visible-infrared person re-identification
  • methods: using high-frequency components and ProtoHPE with two core designs: split patches and multimodal prototypical contrast
  • results: effective enhancement of representation ability and capture of key high-frequency components without extra complexity, validated by extensive experiments
    Abstract Visible-infrared person re-identification is challenging due to the large modality gap. To bridge the gap, most studies heavily rely on the correlation of visible-infrared holistic person images, which may perform poorly under severe distribution shifts. In contrast, we find that some cross-modal correlated high-frequency components contain discriminative visual patterns and are less affected by variations such as wavelength, pose, and background clutter than holistic images. Therefore, we are motivated to bridge the modality gap based on such high-frequency components, and propose \textbf{Proto}type-guided \textbf{H}igh-frequency \textbf{P}atch \textbf{E}nhancement (ProtoHPE) with two core designs. \textbf{First}, to enhance the representation ability of cross-modal correlated high-frequency components, we split patches with such components by Wavelet Transform and exponential moving average Vision Transformer (ViT), then empower ViT to take the split patches as auxiliary input. \textbf{Second}, to obtain semantically compact and discriminative high-frequency representations of the same identity, we propose Multimodal Prototypical Contrast. To be specific, it hierarchically captures the comprehensive semantics of different modal instances, facilitating the aggregation of high-frequency representations belonging to the same identity. With it, ViT can capture key high-frequency components during inference without relying on ProtoHPE, thus bringing no extra complexity. Extensive experiments validate the effectiveness of ProtoHPE.
    摘要 visible-infrared人识别具有大的模态差异,因此大多数研究都会仰仗可 correlate的可见-红外人像,可能会在严重的分布转换下表现不佳。然而,我们发现一些跨模态相关的高频成分含有可 дискriminative的视觉特征,并且比可见人像更不受波长、姿势和背景噪声的影响。因此,我们决心基于这些高频成分来bridging模态差异,并提出了ProtoHPE方法,具有两个核心设计。首先,为了提高跨模态相关高频成分的表征能力,我们使用波峰变换和抽象迭代ViT来拆分patches,然后使ViT作为辅助输入使用。其次,为了获得不同感知模式下的同一个人的semantically Compact和дискriminative高频表示,我们提出了多模态原型冲突。具体来说,它可以 hierarchically capture不同感知模式下的同一个人的全面semantics,使得ViT能在推理过程中捕捉到关键的高频成分,无需依赖于ProtoHPE,从而不增加额外复杂性。我们的实验证明了ProtoHPE的效果。

Improving Fairness-Accuracy tradeoff with few Test Samples under Covariate Shift

  • paper_url: http://arxiv.org/abs/2310.07535
  • repo_url: None
  • paper_authors: Shreyas Havaldar, Jatin Chauhan, Karthikeyan Shanmugam, Jay Nandy, Aravindan Raghuveer
  • for: 本研究旨在提高模型的准确性和公平性,在covariate shift问题下,确保不同敏感群体之间的公平性具有社会意义,如刑事正义。
  • methods: 本文提出三种贡献:首先,我们提出一种新的复合权重 entropy 基于目标函数,用于提高预测准确性和公平性。其次,我们提出了一种新的不对称 covariate shift Setting,这种设定在许多现有基eline上表现出EXTREMELY CHALLENGING的特点。第三,我们提出了一种理论基础,表明我们的方法可以准确地预测测试数据中的loss。
  • results: 我们的实验和理论分析表明,我们的方法可以在几个标准数据集上出perform state-of-the-art baselines in the Pareto sense,并且不受重要样本偏置的影响。此外,我们还证明了我们的方法可以在新的不对称 covariate shift Setting中提高公平性和准确性。
    Abstract Covariate shift in the test data can significantly downgrade both the accuracy and the fairness performance of the model. Ensuring fairness across different sensitive groups in such settings is of paramount importance due to societal implications like criminal justice. We operate under the unsupervised regime where only a small set of unlabeled test samples along with a labeled training set is available. Towards this problem, we make three contributions. First is a novel composite weighted entropy based objective for prediction accuracy which is optimized along with a representation matching loss for fairness. We experimentally verify that optimizing with our loss formulation outperforms a number of state-of-the-art baselines in the pareto sense with respect to the fairness-accuracy tradeoff on several standard datasets. Our second contribution is a new setting we term Asymmetric Covariate Shift that, to the best of our knowledge, has not been studied before. Asymmetric covariate shift occurs when distribution of covariates of one group shifts significantly compared to the other groups and this happens when a dominant group is over-represented. While this setting is extremely challenging for current baselines, We show that our proposed method significantly outperforms them. Our third contribution is theoretical, where we show that our weighted entropy term along with prediction loss on the training set approximates test loss under covariate shift. Empirically and through formal sample complexity bounds, we show that this approximation to the unseen test loss does not depend on importance sampling variance which affects many other baselines.
    摘要 科Variate shift在测试数据中可能会导致模型的准确性和公平性表现下降。在社会中,保持不同敏感群体之间的公平性非常重要,特别是在刑事司法方面。我们在无监督情况下运行,只有一小组无标记测试样本和一个标记训练集可用。为解决这问题,我们提出了三个贡献:第一个贡献是一种新的复合权重Entropy基于目标函数,可以同时保证预测准确性和公平性。我们通过实验表明,使用我们的损失函数可以在 pareto折衔中超过一些状态方法的基elines,并且在多个标准数据集上达到更好的性能。第二个贡献是一种新的设定,我们称之为不对称科Variate shift。这种情况发生在一个群体中covariate的分布发生了显著变化,而另一个群体的分布则相对稳定。这种情况非常困难,但我们展示了我们提出的方法可以在这种情况下表现出色。第三个贡献是理论上的,我们表明了我们的复合权重Entropy терμ和预测损失函数可以近似测试损失函数。我们通过实验和正式样本复杂度下界来证明,这种近似不依赖于重要样本变量,因此不同于许多其他基elines。

Human-Centered Evaluation of XAI Methods

  • paper_url: http://arxiv.org/abs/2310.07534
  • repo_url: None
  • paper_authors: Karam Dawoud, Wojciech Samek, Peter Eisert, Sebastian Lapuschkin, Sebastian Bosse
  • for: 本研究旨在探讨深度学习中的决策过程,以提高人们对AI的理解和信任。
  • methods: 本研究使用三种主流解释方法:Prototypical Part Network、Occlusion和Layer-wise Relevance Propagation,以评估这些方法的可解释性。
  • results: 研究发现,这三种方法可以帮助人们快速理解和分类图像,并且它们之间的解释结果相对一致,从而提高了AI的透明度。
    Abstract In the ever-evolving field of Artificial Intelligence, a critical challenge has been to decipher the decision-making processes within the so-called "black boxes" in deep learning. Over recent years, a plethora of methods have emerged, dedicated to explaining decisions across diverse tasks. Particularly in tasks like image classification, these methods typically identify and emphasize the pivotal pixels that most influence a classifier's prediction. Interestingly, this approach mirrors human behavior: when asked to explain our rationale for classifying an image, we often point to the most salient features or aspects. Capitalizing on this parallel, our research embarked on a user-centric study. We sought to objectively measure the interpretability of three leading explanation methods: (1) Prototypical Part Network, (2) Occlusion, and (3) Layer-wise Relevance Propagation. Intriguingly, our results highlight that while the regions spotlighted by these methods can vary widely, they all offer humans a nearly equivalent depth of understanding. This enables users to discern and categorize images efficiently, reinforcing the value of these methods in enhancing AI transparency.
    摘要 在人工智能领域中,一个关键挑战是解释深度学习中的决策过程。在过去几年,许多方法出现了,旨在对各种任务中的决策进行解释。特别是在图像分类任务中,这些方法通常可以识别并强调影响分类器预测的关键像素。这种方法与人类行为的差异不大:当被问到分类图像的理由时,我们通常会指出最引人注目的特征或方面。 builds on this parallel, our research conducted a user-centered study to objectively measure the interpretability of three leading explanation methods: (1) Prototypical Part Network, (2) Occlusion, and (3) Layer-wise Relevance Propagation. Our results show that while the regions highlighted by these methods can vary widely, they all provide humans with a nearly equivalent depth of understanding. This enables users to efficiently categorize and discern images, which reinforces the value of these methods in enhancing AI transparency.

Energy Estimates Across Layers of Computing: From Devices to Large-Scale Applications in Machine Learning for Natural Language Processing, Scientific Computing, and Cryptocurrency Mining

  • paper_url: http://arxiv.org/abs/2310.07516
  • repo_url: None
  • paper_authors: Sadasivan Shankar
  • for: 这 paper 旨在确定和分析计算机系统中的能源使用量。
  • methods: 这 paper 使用了以前分析 [3] 的基础,对单个设备和系统,包括三种大规模计算应用程序(人工智能/机器学习自然语言处理、科学仿真和加密货币矿 Pool)进行了能源估计。与以前的比较 bit-level switching 中,由于几何缩放,通过逻辑门阵列来实现的能效性,现在在应用程序层次的指令和仿真层次中都需要更多的能源。
  • results: 这 paper 的分析表明,使用older 半导体技术节点的架构改变可以与使用 newer 技术节点的架构相比,在 AI/ML 加速器中实现相同的能效性。此外,对计算系统中的能量和热动力学限制进行了比较,显示计算应用程序的总模拟需要27-36个数量级更高的能量需求。这些能量估计表明计算系统中的能效性需要被考虑,包括能量作为设计参数,以满足计算应用程序的增长需求在数字世界中。
    Abstract Estimates of energy usage in layers of computing from devices to algorithms have been determined and analyzed. Building on the previous analysis [3], energy needed from single devices and systems including three large-scale computing applications such as Artificial Intelligence (AI)/Machine Learning for Natural Language Processing, Scientific Simulations, and Cryptocurrency Mining have been estimated. In contrast to the bit-level switching, in which transistors achieved energy efficiency due to geometrical scaling, higher energy is expended both at the at the instructions and simulations levels of an application. Additionally, the analysis based on AI/ML Accelerators indicate that changes in architectures using an older semiconductor technology node have comparable energy efficiency with a different architecture using a newer technology. Further comparisons of the energy in computing systems with the thermodynamic and biological limits, indicate that there is a 27-36 orders of magnitude higher energy requirements for total simulation of an application. These energy estimates underscore the need for serious considerations of energy efficiency in computing by including energy as a design parameter, enabling growing needs of compute-intensive applications in a digital world.
    摘要 计算层从设备到算法的能源使用估算和分析已经确定。基于之前分析(3),包括人工智能(AI)/机器学习自然语言处理、科学仿真和加密货币开采等三大规模计算应用程序的能源需求已经估算。与比特级 switching相比,在应用程序的指令和仿真层级上都需要更多的能源。此外,使用older半导体技术节点的架构改变表明,与不同架构使用更新的技术节点相比,能效性没有很大差异。此外,对计算系统的能源需求和热动力学和生物学限制进行比较,显示计算应用程序的总模拟需求高达27-36个数量级。这些能源估算表明,在计算设计中包含能源为重要参数是必要的,以满足计算密集应用程序在数字世界中的增长需求。

Sample-Driven Federated Learning for Energy-Efficient and Real-Time IoT Sensing

  • paper_url: http://arxiv.org/abs/2310.07497
  • repo_url: https://github.com/skyd-fl/scfl
  • paper_authors: Minh Ngoc Luu, Minh-Duong Nguyen, Ebrahim Bedeer, Van Duc Nguyen, Dinh Thai Hoang, Diep N. Nguyen, Quoc-Viet Pham
  • for: 这篇论文主要针对 Federated Learning (FL) 系统中的最新潮流方法,它们假设训练集在 IoT 设备上的数据具有全球数据分布的相似性。但这种方法无法捕捉现实时感数据的全面特点。这篇论文的目的是为 IoT 网络实现现实时感数据的 Federated Learning 系统。
  • methods: 这篇论文提出了一个新的方法,即 Sample-driven Control for Federated Learning (SCFL),该方法可以在 IoT 网络中实现现实时感数据的 Federated Learning。SCFL 方法首先将数据采样过程视为一个优化问题,并通过实现数据采样过程中的范例控制来减少过滤问题并提高准确性。
  • results: 这篇论文的结果显示,SCFL 方法可以有效地控制数据采样过程中的过滤问题,并提高 Federated Learning 系统的准确性。实验结果显示,SCFL 方法在不同的数据分布下均可以获得高准确性。此外,SCFL 方法还可以在变化的环境下获得佳效果,因为它可以在不同的数据分布下进行自适应调整。
    Abstract In the domain of Federated Learning (FL) systems, recent cutting-edge methods heavily rely on ideal conditions convergence analysis. Specifically, these approaches assume that the training datasets on IoT devices possess similar attributes to the global data distribution. However, this approach fails to capture the full spectrum of data characteristics in real-time sensing FL systems. In order to overcome this limitation, we suggest a new approach system specifically designed for IoT networks with real-time sensing capabilities. Our approach takes into account the generalization gap due to the user's data sampling process. By effectively controlling this sampling process, we can mitigate the overfitting issue and improve overall accuracy. In particular, We first formulate an optimization problem that harnesses the sampling process to concurrently reduce overfitting while maximizing accuracy. In pursuit of this objective, our surrogate optimization problem is adept at handling energy efficiency while optimizing the accuracy with high generalization. To solve the optimization problem with high complexity, we introduce an online reinforcement learning algorithm, named Sample-driven Control for Federated Learning (SCFL) built on the Soft Actor-Critic (A2C) framework. This enables the agent to dynamically adapt and find the global optima even in changing environments. By leveraging the capabilities of SCFL, our system offers a promising solution for resource allocation in FL systems with real-time sensing capabilities.
    摘要 在 Federated Learning(FL)系统领域,现代方法倚靠理想的条件进行归一化分析。specifically, these approaches assume that the training datasets on IoT devices possess similar attributes to the global data distribution. However, this approach fails to capture the full spectrum of data characteristics in real-time sensing FL systems. In order to overcome this limitation, we propose a new approach system specifically designed for IoT networks with real-time sensing capabilities. Our approach takes into account the generalization gap due to the user's data sampling process. By effectively controlling this sampling process, we can mitigate the overfitting issue and improve overall accuracy. In particular, We first formulate an optimization problem that harnesses the sampling process to concurrently reduce overfitting while maximizing accuracy. In pursuit of this objective, our surrogate optimization problem is adept at handling energy efficiency while optimizing the accuracy with high generalization. To solve the optimization problem with high complexity, we introduce an online reinforcement learning algorithm, named Sample-driven Control for Federated Learning (SCFL) built on the Soft Actor-Critic (A2C) framework. This enables the agent to dynamically adapt and find the global optima even in changing environments. By leveraging the capabilities of SCFL, our system offers a promising solution for resource allocation in FL systems with real-time sensing capabilities.Here's the word-for-word translation of the text into Simplified Chinese:在 Federated Learning(FL)系统领域,现代方法倚靠理想的条件进行归一化分析。specifically, these approaches assume that the training datasets on IoT devices possess similar attributes to the global data distribution. However, this approach fails to capture the full spectrum of data characteristics in real-time sensing FL systems. In order to overcome this limitation, we propose a new approach system specifically designed for IoT networks with real-time sensing capabilities. Our approach takes into account the generalization gap due to the user's data sampling process. By effectively controlling this sampling process, we can mitigate the overfitting issue and improve overall accuracy. In particular, We first formulate an optimization problem that harnesses the sampling process to concurrently reduce overfitting while maximizing accuracy. In pursuit of this objective, our surrogate optimization problem is adept at handling energy efficiency while optimizing the accuracy with high generalization. To solve the optimization problem with high complexity, we introduce an online reinforcement learning algorithm, named Sample-driven Control for Federated Learning (SCFL) built on the Soft Actor-Critic (A2C) framework. This enables the agent to dynamically adapt and find the global optima even in changing environments. By leveraging the capabilities of SCFL, our system offers a promising solution for resource allocation in FL systems with real-time sensing capabilities.

Diversity for Contingency: Learning Diverse Behaviors for Efficient Adaptation and Transfer

  • paper_url: http://arxiv.org/abs/2310.07493
  • repo_url: None
  • paper_authors: Finn Rietz, Johannes Andreas Stork
  • for: 本文旨在提出一种简单的方法,以寻找任务中所有可能的解决方案,以提高转移RL代理的性能和适应能力。
  • methods: 本文使用迭代学习策略,每一个策略都要求在所有前一个策略下 unlikely 的解决方案。不需要学习额外的模型,也不需要调整任务和新鲜度奖励信号。
  • results: 本文的方法可以快速适应任务和转移动力学变化,并且可以提高转移RL代理的性能。
    Abstract Discovering all useful solutions for a given task is crucial for transferable RL agents, to account for changes in the task or transition dynamics. This is not considered by classical RL algorithms that are only concerned with finding the optimal policy, given the current task and dynamics. We propose a simple method for discovering all possible solutions of a given task, to obtain an agent that performs well in the transfer setting and adapts quickly to changes in the task or transition dynamics. Our method iteratively learns a set of policies, while each subsequent policy is constrained to yield a solution that is unlikely under all previous policies. Unlike prior methods, our approach does not require learning additional models for novelty detection and avoids balancing task and novelty reward signals, by directly incorporating the constraint into the action selection and optimization steps.
    摘要 发现所有有用的解决方案是让转移RL代理工作的关键,以适应任务或过程动态变化。这并不是классиical RL算法所考虑的,它们只关心当前任务和动态下找到优化策略。我们提出了一种简单的方法,可以帮助代理人在转移设置下快速适应任务或动态变化,并且 Perform well。我们的方法每次迭代学习一组策略,而每一个后续策略都需要在所有前一个策略下 unlikely to yield a solution。不同于先前的方法,我们的方法不需要学习额外的模型来检测新事物,也不需要平衡任务和新事物的奖励信号,直接将约束 incorporated into action selection and optimization steps。

Boosting Black-box Attack to Deep Neural Networks with Conditional Diffusion Models

  • paper_url: http://arxiv.org/abs/2310.07492
  • repo_url: None
  • paper_authors: Renyang Liu, Wei Zhou, Tianwei Zhang, Kangjie Chen, Jun Zhao, Kwok-Yan Lam
    for:This paper proposes a novel black-box attack strategy to improve the query efficiency of generating adversarial examples (AE) under query-limited situations.methods:The proposed method, called Conditional Diffusion Model Attack (CDMA), formulates the task of AE synthesis as a distribution transformation problem and uses the conditional Denoising Diffusion Probabilistic Model as the converter to learn the transformation from clean samples to AEs.results:CDMA significantly reduces the number of queries needed compared to nine state-of-the-art black-box attacks, with an average reduction of the query count to a handful of times. The attack success rate is high, with $>99%$ success rate for untargeted attacks over all datasets and targeted attack over CIFAR-10 with a noise budget of $\epsilon=16$.
    Abstract Existing black-box attacks have demonstrated promising potential in creating adversarial examples (AE) to deceive deep learning models. Most of these attacks need to handle a vast optimization space and require a large number of queries, hence exhibiting limited practical impacts in real-world scenarios. In this paper, we propose a novel black-box attack strategy, Conditional Diffusion Model Attack (CDMA), to improve the query efficiency of generating AEs under query-limited situations. The key insight of CDMA is to formulate the task of AE synthesis as a distribution transformation problem, i.e., benign examples and their corresponding AEs can be regarded as coming from two distinctive distributions and can transform from each other with a particular converter. Unlike the conventional \textit{query-and-optimization} approach, we generate eligible AEs with direct conditional transform using the aforementioned data converter, which can significantly reduce the number of queries needed. CDMA adopts the conditional Denoising Diffusion Probabilistic Model as the converter, which can learn the transformation from clean samples to AEs, and ensure the smooth development of perturbed noise resistant to various defense strategies. We demonstrate the effectiveness and efficiency of CDMA by comparing it with nine state-of-the-art black-box attacks across three benchmark datasets. On average, CDMA can reduce the query count to a handful of times; in most cases, the query count is only ONE. We also show that CDMA can obtain $>99\%$ attack success rate for untarget attacks over all datasets and targeted attack over CIFAR-10 with the noise budget of $\epsilon=16$.
    摘要 现有的黑盒攻击已经显示了吸引深度学习模型的可能性(AE)的潜在攻击性。大多数这些攻击需要处理広大的优化空间,需要大量的询问,因此在实际情况下显示有限的实际影响。在这篇论文中,我们提出了一种新的黑盒攻击策略,即条件扩散模型攻击(CDMA),以提高受限制的询问量生成AE的能力。我们的关键见解是将AE生成视为两种不同分布之间的转换问题,即正常示例和其相应的AE可以被视为来自两个不同的分布,并可以使用特定的数据转换器将其转换为另一种分布。不同于传统的询问和优化方法,我们可以直接使用条件扩散probabilistic模型来生成适合的AE,这可以将询问数量大大减少。CDMA使用条件排除扩散模型来实现转换,这种模型可以从清洁示例到AE的转换,并确保受到不同防御策略的抗压。我们透过与九种现有的黑盒攻击进行比较,证明CDMA可以将询问数量大大减少,平均只需要几回询问。此外,我们还证明CDMA可以在CIFAR-10上 obtaint>99%的攻击成功率,并且在其他两个测试集上也有出色的表现。

KwaiYiiMath: Technical Report

  • paper_url: http://arxiv.org/abs/2310.07488
  • repo_url: None
  • paper_authors: Jiayi Fu, Lei Lin, Xiaoyang Gao, Pengli Liu, Zhengzong Chen, Zhirui Yang, Shengnan Zhang, Xue Zheng, Yan Li, Yuliang Liu, Xucheng Ye, Yiqiao Liao, Chao Liao, Bin Chen, Chengru Song, Junchen Wan, Zijia Lin, Fuzheng Zhang, Zhongyuan Wang, Di Zhang, Kun Gai
  • for: The paper is written to enhance the mathematical reasoning abilities of KwaiYiiBase1, a large language model, by applying Supervised Fine-Tuning (SFT) and Reinforced Learning from Human Feedback (RLHF) on both English and Chinese mathematical tasks.
  • methods: The paper uses Supervised Fine-Tuning (SFT) and Reinforced Learning from Human Feedback (RLHF) to enhance the mathematical reasoning abilities of KwaiYiiBase1.
  • results: The paper achieves state-of-the-art (SOTA) performance on GSM8k, CMath, and KMath compared with similar size models, respectively.Here are the three key information points in Simplified Chinese text:
  • for: 本文是为了增强kwaiyiibase1的数学逻辑能力,通过supervised fine-tuning (SFT)和人工反馈学习 (RLHF),在英文和中文数学任务上进行了应用。
  • methods: 本文使用supervised fine-tuning (SFT)和人工反馈学习 (RLHF)来增强kwaiyiibase1的数学逻辑能力。
  • results: 本文在GSM8k、CMath和KMath上 achievements state-of-the-art (SOTA)性能,与相似大小的模型相比。
    Abstract Recent advancements in large language models (LLMs) have demonstrated remarkable abilities in handling a variety of natural language processing (NLP) downstream tasks, even on mathematical tasks requiring multi-step reasoning. In this report, we introduce the KwaiYiiMath which enhances the mathematical reasoning abilities of KwaiYiiBase1, by applying Supervised Fine-Tuning (SFT) and Reinforced Learning from Human Feedback (RLHF), including on both English and Chinese mathematical tasks. Meanwhile, we also constructed a small-scale Chinese primary school mathematics test set (named KMath), consisting of 188 examples to evaluate the correctness of the problem-solving process generated by the models. Empirical studies demonstrate that KwaiYiiMath can achieve state-of-the-art (SOTA) performance on GSM8k, CMath, and KMath compared with the similar size models, respectively.
    摘要 近期大语言模型(LLM)的进步已经表现出对各种自然语言处理(NLP)下游任务的出色能力,甚至包括需要多步逻辑的数学任务。在这份报告中,我们介绍了废弃YiiMath,通过精度练习(SFT)和人工反馈学习(RLHF)进行数学逻辑能力的提升,包括英语和中文数学任务。同时,我们还建立了一个小规模的中文小学数学测试集(名为KMath),包含188个例子,用于评估模型生成的问题解决过程的正确性。实验表明,废弃YiiMath可以在GSM8k、CMath和KMath上达到相同规模模型的SOTA性能。

Multimodal Graph Learning for Generative Tasks

  • paper_url: http://arxiv.org/abs/2310.07478
  • repo_url: https://github.com/minjiyoon/mmgl
  • paper_authors: Minji Yoon, Jing Yu Koh, Bryan Hooi, Ruslan Salakhutdinov
  • for: 本研究旨在扩展现有的文本生成模型,以使其能够利用多模态数据进行生成。
  • methods: 我们提出了一种名为多模态图学习(MMGL)的框架,可以捕捉多模态数据之间的复杂关系。我们基于预训练语言模型(LM),并通过infuse多个邻居信息和图структура信息来增强其文本生成能力。
  • results: 我们通过了三个研究问题,即如何兼容多个邻居信息,如何兼容图структура信息,以及如何 Parametric-efficiently 训练预训练LM。我们的实验结果表明,MMGL可以增强文本生成能力,并且可以在兼容多个邻居信息和图структура信息的情况下进行Parameter-efficient 训练。
    Abstract Multimodal learning combines multiple data modalities, broadening the types and complexity of data our models can utilize: for example, from plain text to image-caption pairs. Most multimodal learning algorithms focus on modeling simple one-to-one pairs of data from two modalities, such as image-caption pairs, or audio-text pairs. However, in most real-world settings, entities of different modalities interact with each other in more complex and multifaceted ways, going beyond one-to-one mappings. We propose to represent these complex relationships as graphs, allowing us to capture data with any number of modalities, and with complex relationships between modalities that can flexibly vary from one sample to another. Toward this goal, we propose Multimodal Graph Learning (MMGL), a general and systematic framework for capturing information from multiple multimodal neighbors with relational structures among them. In particular, we focus on MMGL for generative tasks, building upon pretrained Language Models (LMs), aiming to augment their text generation with multimodal neighbor contexts. We study three research questions raised by MMGL: (1) how can we infuse multiple neighbor information into the pretrained LMs, while avoiding scalability issues? (2) how can we infuse the graph structure information among multimodal neighbors into the LMs? and (3) how can we finetune the pretrained LMs to learn from the neighbor context in a parameter-efficient manner? We conduct extensive experiments to answer these three questions on MMGL and analyze the empirical results to pave the way for future MMGL research.
    摘要 多模态学习结合多种数据模式,扩大我们模型使用的数据类型和复杂度:例如,从纯文本到图像描述对。大多数多模态学习算法都是模型简单的一对一对数据,如图像描述对或音频文本对。然而,在实际世界中,不同模式之间的实体往往存在更复杂和多方面的交互,超出一对一映射。我们建议表示这些复杂关系为图,以便捕捉多模式数据,并且在不同样本之间可以有不同的关系。为了实现这个目标,我们提出了多模态图学习(MMGL),一种通用和系统的框架,用于从多个多模式邻居中捕捉信息。具体来说,我们将关注MMGL的生成任务,基于预训练语言模型(LM),以增强其文本生成功能。我们提出了三个研究问题:(1)如何将多个邻居信息融入预训练LM中,而不会面临扩展性问题?(2)如何将多模式邻居之间的图结构信息融入LM中?(3)如何在LM中 parameter-efficient 地学习邻居 контекст?我们进行了广泛的实验和分析,以回答这三个问题,并为未来MMGL研究做出了基础。

An Ontology of Co-Creative AI Systems

  • paper_url: http://arxiv.org/abs/2310.07472
  • repo_url: None
  • paper_authors: Zhiyu Lin, Mark Riedl
  • for: 这个论文旨在为研究人员提供一种 Ontology of co-creative systems,以帮助解决人工智能和人类在创造过程中的分工和信息交换问题。
  • methods: 这篇论文使用了 Lubart 的原始 Ontology of creativity support tools,并新增了三个类别,专注于人工智能:计算机作为合作伙伴、计算机作为评论人和计算机作为合作者,其中一些有子类别。
  • results: 论文提出了一个 Ontology of co-creative systems,可以帮助研究人员更好地理解和分析人工智能和人类在创造过程中的相互作用,并且可以帮助设计更加有效的人工智能和人类合作系统。
    Abstract The term co-creativity has been used to describe a wide variety of human-AI assemblages in which human and AI are both involved in a creative endeavor. In order to assist with disambiguating research efforts, we present an ontology of co-creative systems, focusing on how responsibilities are divided between human and AI system and the information exchanged between them. We extend Lubart's original ontology of creativity support tools with three new categories emphasizing artificial intelligence: computer-as-subcontractor, computer-as-critic, and computer-as-teammate, some of which have sub-categorizations.
    摘要 《合作创造力》一词汇集了许多人工智能融合体,在创造活动中,人类和AI系统都参与了创新。为了帮助研究工作,我们提出了一个协创系统 ontology,关注人类和 AI 系统之间的责任分配和信息交换。我们对 Lubart 的原始创造支持工具 ontology 进行扩展,添加了三个新类型,强调人工智能:计算机作为合同人,计算机作为评论人,计算机作为团队成员,其中有一些子分类。

The Implications of Decentralization in Blockchained Federated Learning: Evaluating the Impact of Model Staleness and Inconsistencies

  • paper_url: http://arxiv.org/abs/2310.07471
  • repo_url: None
  • paper_authors: Francesc Wilhelmi, Nima Afraz, Elia Guerra, Paolo Dini
  • for: 这篇论文探讨了在区块链上实现分布式机器学习(DL)的可能性,尤其是在分布式学习(FL)中,区块链可以提供更多的分布式、安全、不可变和信任等特性,这些特性可以激发下一代应用中的共同智能。
  • methods: 论文使用了预训练模型,并在块链上进行了模型协调和更新。
  • results: 研究发现,在使用块链进行FL时,模型不一致和延迟会导致预测精度下降(下降约35%),这说明在设计块链系统时,需要考虑FL应用的特点。
    Abstract Blockchain promises to enhance distributed machine learning (ML) approaches such as federated learning (FL) by providing further decentralization, security, immutability, and trust, which are key properties for enabling collaborative intelligence in next-generation applications. Nonetheless, the intrinsic decentralized operation of peer-to-peer (P2P) blockchain nodes leads to an uncharted setting for FL, whereby the concepts of FL round and global model become meaningless, as devices' synchronization is lost without the figure of a central orchestrating server. In this paper, we study the practical implications of outsourcing the orchestration of FL to a democratic network such as in a blockchain. In particular, we focus on the effects that model staleness and inconsistencies, endorsed by blockchains' modus operandi, have on the training procedure held by FL devices asynchronously. Using simulation, we evaluate the blockchained FL operation on the well-known CIFAR-10 dataset and focus on the accuracy and timeliness of the solutions. Our results show the high impact of model inconsistencies on the accuracy of the models (up to a ~35% decrease in prediction accuracy), which underscores the importance of properly designing blockchain systems based on the characteristics of the underlying FL application.
    摘要 blockchain 承诺增强分布式机器学习(ML)方法,如联合学习(FL),通过提供更多的分布式、安全、不可变和信任性,这些属性是实现共同智能在下一代应用程序的关键。然而,干预式P2P区块链节点的自然分布式运行环境导致FL中的概念,例如轮次和全局模型,失去意义,因为设备的同步失去了中央调度服务器的引导。在这篇论文中,我们研究了在块链上进行FL的实际影响。具体来说,我们关注FL设备在异步情况下进行训练过程中的模型落后和不一致性问题,这些问题由块链的操作方式推动。通过实验,我们评估了使用块链进行FL操作在CIFAR-10数据集上的性能。我们发现,模型不一致性可以导致预测精度下降,最高下降约35%,这说明了如何设计基于FL应用程序的块链系统的重要性。

AI/ML-based Load Prediction in IEEE 802.11 Enterprise Networks

  • paper_url: http://arxiv.org/abs/2310.07467
  • repo_url: None
  • paper_authors: Francesc Wilhelmi, Dariush Salami, Gianluca Fontanesi, Lorenzo Galati-Giordano, Mika Kasslin
  • for: 该论文旨在探讨在实际企业 Wi-Fi 网络中采用基于人工智能和机器学习(AI/ML)的负荷预测是否可行和有效。
  • methods: 该论文采用了基于 AI/ML 的负荷预测方法,并对其适用性和可行性进行了研究。
  • results: 研究发现,使用硬件受限的 AI/ML 模型可以预测网络负荷,误差在20%以下,85%分位数在3%以下,这可以作为 Wi-Fi 网络优化的输入。
    Abstract Enterprise Wi-Fi networks can greatly benefit from Artificial Intelligence and Machine Learning (AI/ML) thanks to their well-developed management and operation capabilities. At the same time, AI/ML-based traffic/load prediction is one of the most appealing data-driven solutions to improve the Wi-Fi experience, either through the enablement of autonomous operation or by boosting troubleshooting with forecasted network utilization. In this paper, we study the suitability and feasibility of adopting AI/ML-based load prediction in practical enterprise Wi-Fi networks. While leveraging AI/ML solutions can potentially contribute to optimizing Wi-Fi networks in terms of energy efficiency, performance, and reliability, their effective adoption is constrained to aspects like data availability and quality, computational capabilities, and energy consumption. Our results show that hardware-constrained AI/ML models can potentially predict network load with less than 20% average error and 3% 85th-percentile error, which constitutes a suitable input for proactively driving Wi-Fi network optimization.
    摘要

Efficient machine-learning surrogates for large-scale geological carbon and energy storage

  • paper_url: http://arxiv.org/abs/2310.07461
  • repo_url: None
  • paper_authors: Teeratorn Kadeethum, Stephen J. Verzi, Hongkyu Yoon
  • for: 这篇论文是为了探讨地质碳和能源储存在实现零碳排放和气候变化管理方面所面临的不确定性,并提出一个特殊化的机器学习(ML)模型来有效管理大规模的油气储存模型。
  • methods: 这篇论文使用了一种特殊的机器学习模型,具有预测精度和训练成本之间的平衡,以解决大规模地质碳储存应用中的计算资源限制问题。
  • results: 这篇论文的结果显示,这种特殊的机器学习模型可以实现高精度的预测,并且可以对大规模的油气储存应用进行有效管理,协助解决地质碳储存中的不确定性和操作限制问题。
    Abstract Geological carbon and energy storage are pivotal for achieving net-zero carbon emissions and addressing climate change. However, they face uncertainties due to geological factors and operational limitations, resulting in possibilities of induced seismic events or groundwater contamination. To overcome these challenges, we propose a specialized machine-learning (ML) model to manage extensive reservoir models efficiently. While ML approaches hold promise for geological carbon storage, the substantial computational resources required for large-scale analysis are the obstacle. We've developed a method to reduce the training cost for deep neural operator models, using domain decomposition and a topology embedder to link spatio-temporal points. This approach allows accurate predictions within the model's domain, even for untrained data, enhancing ML efficiency for large-scale geological storage applications.
    摘要 地质碳和能源储存对于实现零碳排放和气候变化做出重要贡献,但它们面临地质因素和运营限制,可能导致人工地震或地下水污染。为了解决这些挑战,我们提议使用专门的机器学习(ML)模型来有效管理大规模的沉存模型。While ML approaches hold promise for geological carbon storage, the substantial computational resources required for large-scale analysis are the obstacle. We've developed a method to reduce the training cost for deep neural operator models, using domain decomposition and a topology embedder to link spatio-temporal points. This approach allows accurate predictions within the model's domain, even for untrained data, enhancing ML efficiency for large-scale geological storage applications.Translation notes:* "地质碳" (dì qiè diān) means "geological carbon"* "能源储存" (néngyuan jīcè) means "energy storage"* "零碳排放" (líng diān fāshuā) means "net-zero carbon emissions"* "气候变化" (qìngkēng biànhàng) means "climate change"* "地质因素" (dì qiè yīnxiàng) means "geological factors"* "运营限制" (yùn yì jìzhèng) means "operational limitations"* "人工地震" (réngōng dìzhèn) means "induced seismic events"* "地下水污染" (dì xià shuǐwù) means "groundwater contamination"* "ML" (M L) stands for "machine learning"* "沉存模型" (chéncèng módel) means "reservoir models"* "域 decomposition" (dì zhāng) means "domain decomposition"* "トポлоги embedder" (tuōpōlógì yìbiāo) means "topology embedder"* "链接点" (liánjié diǎn) means "linking points"* "模型的域" (módel de dì) means "the model's domain"* "untrained data" (wutaining tiào xiàng) means "untrained data"

HealthWalk: Promoting Health and Mobility through Sensor-Based Rollator Walker Assistance

  • paper_url: http://arxiv.org/abs/2310.07434
  • repo_url: None
  • paper_authors: Ivanna Kramer, Kevin Weirauch, Sabine Bauer, Mark Oliver Mints, Peer Neubert
  • for: 增强physically limited人群 mobilty和独立参与社会
  • methods: интегриyo sensors into rollator walker designs
  • results: 数据收集和其他 interessin use casesHere’s the information in Simplified Chinese text:
  • for: 增强physically limited人群 mobilty和独立参与社会
  • methods: интегриyo sensors into rollator walker designs
  • results: 数据收集和其他 interessin use casesI hope this helps! Let me know if you have any other questions.
    Abstract Rollator walkers allow people with physical limitations to increase their mobility and give them the confidence and independence to participate in society for longer. However, rollator walker users often have poor posture, leading to further health problems and, in the worst case, falls. Integrating sensors into rollator walker designs can help to address this problem and results in a platform that allows several other interesting use cases. This paper briefly overviews existing systems and the current research directions and challenges in this field. We also present our early HealthWalk rollator walker prototype for data collection with older people, rheumatism, multiple sclerosis and Parkinson patients, and individuals with visual impairments.
    摘要 轮椅步行器让人们 WITH 身体限制能够提高 mobilility 并给他们带来自信和独立,以便 longer 参与社会。但是轮椅步行器用户 часто有坏 posture,导致更多的健康问题,甚至最坏情况下掉进了 falls。把感应器 integrate 到轮椅步行器设计中可以解决这个问题,并且可以实现多种有趣的用case。本文 briefly 概述了现有系统和这个领域的当前研究方向和挑战。我们也 Present 我们的早期 HealthWalk 轮椅步行器原型,用于收集数据 FROM older people, rheumatism, multiple sclerosis 和 Parkinson 病人,以及Visual impairments 患者。

Imitation Learning from Observation with Automatic Discount Scheduling

  • paper_url: http://arxiv.org/abs/2310.07433
  • repo_url: None
  • paper_authors: Yuyang Liu, Weijun Dong, Yingdong Hu, Chuan Wen, Zhao-Heng Yin, Chongjie Zhang, Yang Gao
    for:本研究旨在解决机器人学习从视频示例数据中学习的问题,即“imitating the expert without access to its action”。methods:本研究使用了一种 inverse reinforcement learning 方法,将 ILfO 问题转化为 reinforcement learning 问题,使用代理奖金计算从 agent 和专家的观察中。results:实验结果显示,我们的方法在 nine Meta-World 任务上表现出色,与现有方法比较,在所有任务上均有显著改善,包括一些不可解决的任务。
    Abstract Humans often acquire new skills through observation and imitation. For robotic agents, learning from the plethora of unlabeled video demonstration data available on the Internet necessitates imitating the expert without access to its action, presenting a challenge known as Imitation Learning from Observations (ILfO). A common approach to tackle ILfO problems is to convert them into inverse reinforcement learning problems, utilizing a proxy reward computed from the agent's and the expert's observations. Nonetheless, we identify that tasks characterized by a progress dependency property pose significant challenges for such approaches; in these tasks, the agent needs to initially learn the expert's preceding behaviors before mastering the subsequent ones. Our investigation reveals that the main cause is that the reward signals assigned to later steps hinder the learning of initial behaviors. To address this challenge, we present a novel ILfO framework that enables the agent to master earlier behaviors before advancing to later ones. We introduce an Automatic Discount Scheduling (ADS) mechanism that adaptively alters the discount factor in reinforcement learning during the training phase, prioritizing earlier rewards initially and gradually engaging later rewards only when the earlier behaviors have been mastered. Our experiments, conducted on nine Meta-World tasks, demonstrate that our method significantly outperforms state-of-the-art methods across all tasks, including those that are unsolvable by them.
    摘要 人类常通过观察和模仿获得新技能。对于机器人代理,从互联网上的大量未标注视频示例数据中学习,却面临了无法访问行为者的行为的挑战,称为观察学习从抽象(ILfO)。常见的解决ILfO问题的方法是将其转化为反杂化学习问题,使用代理人和行为者的观察得到的代理奖励。然而,我们发现任务具有进步依赖性属性时,这些方法会遇到很大的挑战,因为代理人需要在学习早期的行为之前,才能学习专家的后续行为。我们的研究发现,主要的问题在于奖励信号赋给后续步骤会阻碍代理人学习初期的行为。为解决这个挑战,我们提出了一种新的ILfO框架,使得代理人可以在训练期间,先学习初期的行为,然后才能进行后续的学习。我们在训练过程中引入自适应折扣因子调整机制(ADS),根据训练过程中的步骤,自动调整折扣因子,在初期偏好早期奖励,逐渐地只有在初期行为已经熟悉时,才能够参与后续奖励。我们在Meta-World任务上进行了九项实验,结果表明,我们的方法在所有任务上都有显著的优势,包括一些不可能由现有方法解决的任务。

Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing Else

  • paper_url: http://arxiv.org/abs/2310.07419
  • repo_url: None
  • paper_authors: Hazarapet Tunanyan, Dejia Xu, Shant Navasardyan, Zhangyang Wang, Humphrey Shi
  • for: 提高文本描述逻辑概率生成图像的自然多元概念能力,不需要额外训练或执行时指导。
  • methods: 通过修正预训练文本扩展模型中的文本嵌入,解决概念占据和非本地贡献问题,提高多元概念图像生成性能。
  • results: 在文本描述逻辑概率生成图像、图像修改和个性化任务中,比前方法高效,且不需要额外训练或执行时指导。
    Abstract Recent advances in text-to-image diffusion models have enabled the photorealistic generation of images from text prompts. Despite the great progress, existing models still struggle to generate compositional multi-concept images naturally, limiting their ability to visualize human imagination. While several recent works have attempted to address this issue, they either introduce additional training or adopt guidance at inference time. In this work, we consider a more ambitious goal: natural multi-concept generation using a pre-trained diffusion model, and with almost no extra cost. To achieve this goal, we identify the limitations in the text embeddings used for the pre-trained text-to-image diffusion models. Specifically, we observe concept dominance and non-localized contribution that severely degrade multi-concept generation performance. We further design a minimal low-cost solution that overcomes the above issues by tweaking (not re-training) the text embeddings for more realistic multi-concept text-to-image generation. Our Correction by Similarities method tweaks the embedding of concepts by collecting semantic features from most similar tokens to localize the contribution. To avoid mixing features of concepts, we also apply Cross-Token Non-Maximum Suppression, which excludes the overlap of contributions from different concepts. Experiments show that our approach outperforms previous methods in text-to-image, image manipulation, and personalization tasks, despite not introducing additional training or inference costs to the diffusion steps.
    摘要

Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages

  • paper_url: http://arxiv.org/abs/2310.07418
  • repo_url: https://github.com/Guozheng-Ma/Adaptive-Replay-Ratio
  • paper_authors: Guozheng Ma, Lu Li, Sen Zhang, Zixuan Liu, Zhen Wang, Yixin Chen, Li Shen, Xueqian Wang, Dacheng Tao
  • for: 这篇论文的目的是研究强化学习中的塑性(plasticity)如何影响高性能和效率的视觉强化学习(VRL)。
  • methods: 该论文使用了系统的实验尝试,探讨了三个未曾充分研究的方面,并得出了以下有益的结论:(1)数据扩展是维持塑性的关键因素;(2)评价器的塑性损失是干扰高效培训的主要障碍;(3)在早期阶段不及时 intervene 可能会导致评价器的塑性损失变成致命的问题。
  • results: 该论文的研究结果表明,适应式 RR 可以避免早期阶段的评价器塑性损失,并在后期阶段更频 reuse,从而提高样本效率。
    Abstract Plasticity, the ability of a neural network to evolve with new data, is crucial for high-performance and sample-efficient visual reinforcement learning (VRL). Although methods like resetting and regularization can potentially mitigate plasticity loss, the influences of various components within the VRL framework on the agent's plasticity are still poorly understood. In this work, we conduct a systematic empirical exploration focusing on three primary underexplored facets and derive the following insightful conclusions: (1) data augmentation is essential in maintaining plasticity; (2) the critic's plasticity loss serves as the principal bottleneck impeding efficient training; and (3) without timely intervention to recover critic's plasticity in the early stages, its loss becomes catastrophic. These insights suggest a novel strategy to address the high replay ratio (RR) dilemma, where exacerbated plasticity loss hinders the potential improvements of sample efficiency brought by increased reuse frequency. Rather than setting a static RR for the entire training process, we propose Adaptive RR, which dynamically adjusts the RR based on the critic's plasticity level. Extensive evaluations indicate that Adaptive RR not only avoids catastrophic plasticity loss in the early stages but also benefits from more frequent reuse in later phases, resulting in superior sample efficiency.
    摘要 neural network 的可变性(plasticity) 是视觉强化学习(VRL) 的关键因素,它可以使 agent 在新数据上进行学习和改进。 although methods like resetting and regularization can potentially mitigate plasticity loss, the influences of various components within the VRL framework on the agent's plasticity are still poorly understood. 在这项工作中,我们进行了系统性的实验研究,关注三个次要未经探索的方面,得到以下有价值的结论:(1)数据扩展是维护可变性的关键因素;(2)批评者的可变性损失是训练的主要瓶颈;(3)在早期阶段没有及时干预可以使批评者的可变性恢复,这种损失会变成灾难性的。这些结论建议了一种新的策略,即适应性的 reuse frequency(RR),可以 dynamically 根据批评者的可变性水平进行调整。我们的实验表明,适应性的 RR 不仅可以避免批评者的可变性损失在早期阶段,还可以在后期阶段更频 reuse,从而提高样本效率。

What can knowledge graph alignment gain with Neuro-Symbolic learning approaches?

  • paper_url: http://arxiv.org/abs/2310.07417
  • repo_url: None
  • paper_authors: Pedro Giesteira Cotovio, Ernesto Jimenez-Ruiz, Catia Pesquita
  • for: 本研究旨在探讨现有的知识图гра(KG)对应算法的局限性,以及可以通过结合符号学习和数字学习的гибри德学习模型来改进KGA的性能和可解性。
  • methods: 本研究使用了现有的KGA算法和深度学习模型,以及一些相关的数字学习和符号学习方法进行比较和分析。
  • results: 研究发现,结合符号学习和数字学习的гибри德学习模型可以提高KGA的性能和可解性,并且可以支持人类中心的验证和验证方法。
    Abstract Knowledge Graphs (KG) are the backbone of many data-intensive applications since they can represent data coupled with its meaning and context. Aligning KGs across different domains and providers is necessary to afford a fuller and integrated representation. A severe limitation of current KG alignment (KGA) algorithms is that they fail to articulate logical thinking and reasoning with lexical, structural, and semantic data learning. Deep learning models are increasingly popular for KGA inspired by their good performance in other tasks, but they suffer from limitations in explainability, reasoning, and data efficiency. Hybrid neurosymbolic learning models hold the promise of integrating logical and data perspectives to produce high-quality alignments that are explainable and support validation through human-centric approaches. This paper examines the current state of the art in KGA and explores the potential for neurosymbolic integration, highlighting promising research directions for combining these fields.
    摘要 知识 graphs (KG) 是许多数据敏感应用的重要组成部分,因为它们可以表示数据以及其意义和上下文。对不同领域和提供商的 KG 进行对接是必要的,以便获得更加全面和集成的表示。当前的 KG 对应 (KGA) 算法有一定的限制,它们无法体现逻辑思维和语言、结构和 semantics 数据学习的相互作用。深入学习模型在 KGA 方面具有良好的表现,但它们受到解释性、逻辑和数据效率的限制。混合 neuralsymbolic 学习模型可以结合逻辑和数据视角,生成高质量的对接,同时可以提供可解释的结果和人类中心的验证方法。本文将对当前 KGA 领域的状况进行检查,并探讨将这两个领域结合在一起的潜在研究方向。

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation

  • paper_url: http://arxiv.org/abs/2310.07403
  • repo_url: https://github.com/ictnlp/daspeech
  • paper_authors: Qingkai Fang, Yan Zhou, Yang Feng
    for:* DASpeech is a non-autoregressive direct speech-to-speech translation model that aims to achieve both high-quality translations and fast decoding speeds.methods:* DASpeech uses a two-pass architecture to decompose the generation process into two steps: a linguistic decoder first generates the target text, and an acoustic decoder then generates the target speech based on the hidden states of the linguistic decoder.* DA-Transformer models translations with a directed acyclic graph (DAG), and dynamic programming is used to calculate the expected hidden states for each target token during training.results:* DASpeech achieves comparable or even better performance than the state-of-the-art S2ST model Translatotron 2, while preserving up to 18.53x speedup compared to the autoregressive baseline.* DASpeech shows significant improvements in both translation quality and decoding speed compared to previous non-autoregressive S2ST models, without relying on knowledge distillation or iterative decoding.* DASpeech preserves the speaker’s voice of the source speech during translation.Here is the format you requested:for: direkt S2ST模型可以实现高质量翻译和快速解码methods: two-pass架构,首先使用语言解码器生成目标文本,然后使用听音解码器基于语言解码器的隐藏状态生成目标语音results: comparable或更好的性能,与当前状态OF-the-art S2ST模型Translatotron 2相当,同时保持18.53倍的速度提升相比于核心基eline。
    Abstract Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model. However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST. To better capture the complex distribution of the target speech, DASpeech adopts the two-pass architecture to decompose the generation process into two steps, where a linguistic decoder first generates the target text, and an acoustic decoder then generates the target speech based on the hidden states of the linguistic decoder. Specifically, we use the decoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as the acoustic decoder. DA-Transformer models translations with a directed acyclic graph (DAG). To consider all potential paths in the DAG during training, we calculate the expected hidden states for each target token via dynamic programming, and feed them into the acoustic decoder to predict the target mel-spectrogram. During inference, we select the most probable path and take hidden states on that path as input to the acoustic decoder. Experiments on the CVSS Fr-En benchmark demonstrate that DASpeech can achieve comparable or even better performance than the state-of-the-art S2ST model Translatotron 2, while preserving up to 18.53x speedup compared to the autoregressive baseline. Compared with the previous non-autoregressive S2ST model, DASpeech does not rely on knowledge distillation and iterative decoding, achieving significant improvements in both translation quality and decoding speed. Furthermore, DASpeech shows the ability to preserve the speaker's voice of the source speech during translation.
    摘要 直接Speech-to-Speech翻译(S2ST)模型可以将一种语言的语音翻译成另一种语言。然而,由于语言和听音多样性,目标语音表现出复杂的多Modal分布,这会对S2ST模型的高质量翻译和快速解码速度带来挑战。在这篇论文中,我们提出了DASpeech模型,这是一种非autoregressive的直接S2ST模型,可以同时实现高质量翻译和快速解码速度。为了更好地捕捉目标语音的复杂分布,DASpeech采用了两个过程来分解生成过程,其中一个是语言解码器,它首先生成目标文本;另一个是听音解码器,它根据语言解码器生成的隐藏状态来生成目标听音spectrogram。我们使用DA-Transformer模型的解码器作为语言解码器,并使用FastSpeech 2模型作为听音解码器。在训练时,我们使用DAG模型来表示翻译,并通过动态计算隐藏状态来考虑所有可能的路径。在推理时,我们选择最有可能性的路径,并将隐藏状态feed into听音解码器来预测目标听音spectrogram。实验结果表明,DASpeech可以与状态艺术S2ST模型Translatotron 2进行比较,同时保持18.53倍的速度提升。与之前的非autoregressive S2ST模型相比,DASpeech不需要知识储存和迭代解码,它可以实现显著的改进 both translation quality和解码速度。此外,DASpeech还可以保持源语音的 speaker voice。

NuTime: Numerically Multi-Scaled Embedding for Large-Scale Time Series Pretraining

  • paper_url: http://arxiv.org/abs/2310.07402
  • repo_url: None
  • paper_authors: Chenguo Lin, Xumeng Wen, Wei Cao, Congrui Huang, Jiang Bian, Stephen Lin, Zhirong Wu
  • for: 学习时序数据的 semantic 表示。
  • methods: 采用 Transformer 架构,首先将输入分割成不重叠的窗口,然后对每个窗口进行 numerically multi-scaled embedding。
  • results: 在多个uniivariate和multivariate classification benchmark上显示出惊人的改进,并在非学习基础方法中Establish新的状态码。
    Abstract Recent research on time-series self-supervised models shows great promise in learning semantic representations. However, it has been limited to small-scale datasets, e.g., thousands of temporal sequences. In this work, we make key technical contributions that are tailored to the numerical properties of time-series data and allow the model to scale to large datasets, e.g., millions of temporal sequences. We adopt the Transformer architecture by first partitioning the input into non-overlapping windows. Each window is then characterized by its normalized shape and two scalar values denoting the mean and standard deviation within each window. To embed scalar values that may possess arbitrary numerical scales to high-dimensional vectors, we propose a numerically multi-scaled embedding module enumerating all possible scales for the scalar values. The model undergoes pretraining using the proposed numerically multi-scaled embedding with a simple contrastive objective on a large-scale dataset containing over a million sequences. We study its transfer performance on a number of univariate and multivariate classification benchmarks. Our method exhibits remarkable improvement against previous representation learning approaches and establishes the new state of the art, even compared with domain-specific non-learning-based methods.
    摘要 We adopt the Transformer architecture by first partitioning the input into non-overlapping windows. Each window is then characterized by its normalized shape and two scalar values denoting the mean and standard deviation within each window. To embed scalar values that may possess arbitrary numerical scales into high-dimensional vectors, we propose a numerically multi-scaled embedding module that enumerates all possible scales for the scalar values.The model undergoes pretraining using the proposed numerically multi-scaled embedding with a simple contrastive objective on a large-scale dataset containing over a million sequences. We study its transfer performance on a number of univariate and multivariate classification benchmarks. Our method exhibits remarkable improvement compared to previous representation learning approaches and establishes a new state of the art, even compared with domain-specific non-learning-based methods.

Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation

  • paper_url: http://arxiv.org/abs/2310.07397
  • repo_url: https://github.com/iwangjian/topdial
  • paper_authors: Jian Wang, Yi Cheng, Dongding Lin, Chak Tou Leong, Wenjie Li
    for: 这 paper 是研究 targets-oriented dialogue systems 的,它们可以帮助对话系统更好地驱动对话向 predetermined targets 或达到系统方面的目标。methods: 这 paper 使用了一种新的方法,即在 <dialogue act, topic> 对应的对话中进行个性化目标途径。它还提出了一种自动 dataset curation 框架,使用 role-playing 方法来生成大规模的个性化目标对话数据集。results: 这 paper 通过实验表明,这些个性化目标对话数据集具有高质量,可以用于探索个性化目标对话。
    Abstract Target-oriented dialogue systems, designed to proactively steer conversations toward predefined targets or accomplish specific system-side goals, are an exciting area in conversational AI. In this work, by formulating a pair as the conversation target, we explore a novel problem of personalized target-oriented dialogue by considering personalization during the target accomplishment process. However, there remains an emergent need for high-quality datasets, and building one from scratch requires tremendous human effort. To address this, we propose an automatic dataset curation framework using a role-playing approach. Based on this framework, we construct a large-scale personalized target-oriented dialogue dataset, TopDial, which comprises about 18K multi-turn dialogues. The experimental results show that this dataset is of high quality and could contribute to exploring personalized target-oriented dialogue.
    摘要 目标导向对话系统,旨在主动导引对话向预定的目标或完成系统侧的目标,是现代对话智能领域的一个有趣领域。在这种工作中,我们通过формализова对话行为和话题的对应关系,探索一种个性化目标导向对话的问题。然而,有一定的需求升级高质量的数据集,建立自身的数据集需要巨大的人工劳动。为解决这个问题,我们提出了一种自动数据集筛选框架,基于角色扮演的方法。通过这种框架,我们建立了一个大规模的个性化目标导向对话数据集,TopDial,包含约18K多轮对话。实验结果表明,这个数据集具有高质量,可以探索个性化目标导向对话。

Learning a Reward Function for User-Preferred Appliance Scheduling

  • paper_url: http://arxiv.org/abs/2310.07389
  • repo_url: https://github.com/nikskiks/learning-reward-function-demand-response
  • paper_authors: Nikolina Čović, Jochen Cremer, Hrvoje Pandžić
  • for: 降低电力 секто�的碳排放,需要加快住宅部门的需给回应服务提供的推展。
  • methods: 使用反馈学习算法,将住宅用户的过往消耗数据变数为创建每天家用电器运行时间表。
  • results: 透过不询问住宅用户的需求和愿望,将住宅用户内在地参与服务设计和决策过程,并透过金钱或环保动机来鼓励住宅用户继续参与需给回应服务。
    Abstract Accelerated development of demand response service provision by the residential sector is crucial for reducing carbon-emissions in the power sector. Along with the infrastructure advancement, encouraging the end users to participate is crucial. End users highly value their privacy and control, and want to be included in the service design and decision-making process when creating the daily appliance operation schedules. Furthermore, unless they are financially or environmentally motivated, they are generally not prepared to sacrifice their comfort to help balance the power system. In this paper, we present an inverse-reinforcement-learning-based model that helps create the end users' daily appliance schedules without asking them to explicitly state their needs and wishes. By using their past consumption data, the end consumers will implicitly participate in the creation of those decisions and will thus be motivated to continue participating in the provision of demand response services.
    摘要 加速了住宅部分的需求应答服务提供的发展是减少能源部门碳排放的关键。同时,激励终端用户参与是关键。终端用户强烈关注隐私和控制,希望在日常家用电器运行时间的设计和决策过程中被包括。此外,如果他们没有经济或环境上的驱动力,他们通常不愿意为了帮助平衡能源系统而做出牺牲。本文提出了一种基于逆激励学习的模型,可以无需直接询问终端用户需求和愿望,通过使用其过去的消耗数据,使终端用户在创造这些决策过程中implicitly参与,从而被激励继续参与提供需求应答服务。

Histopathological Image Classification and Vulnerability Analysis using Federated Learning

  • paper_url: http://arxiv.org/abs/2310.07380
  • repo_url: None
  • paper_authors: Sankalp Vyas, Amar Nath Patra, Raj Mani Shukla
  • for: 革新健康预测技术,保护用户隐私
  • methods: 联邦学习(Federated Learning)技术,对于肉眼皮癌资料集进行预测
  • results: 发现联邦学习容易受到数据毒素攻击,影响模型的精度。透过测试实验,发现当数据毒素比例增加时,模型的精度会显著下降。
    Abstract Healthcare is one of the foremost applications of machine learning (ML). Traditionally, ML models are trained by central servers, which aggregate data from various distributed devices to forecast the results for newly generated data. This is a major concern as models can access sensitive user information, which raises privacy concerns. A federated learning (FL) approach can help address this issue: A global model sends its copy to all clients who train these copies, and the clients send the updates (weights) back to it. Over time, the global model improves and becomes more accurate. Data privacy is protected during training, as it is conducted locally on the clients' devices. However, the global model is susceptible to data poisoning. We develop a privacy-preserving FL technique for a skin cancer dataset and show that the model is prone to data poisoning attacks. Ten clients train the model, but one of them intentionally introduces flipped labels as an attack. This reduces the accuracy of the global model. As the percentage of label flipping increases, there is a noticeable decrease in accuracy. We use a stochastic gradient descent optimization algorithm to find the most optimal accuracy for the model. Although FL can protect user privacy for healthcare diagnostics, it is also vulnerable to data poisoning, which must be addressed.
    摘要 医疗是机器学习(ML)的一个重要应用领域。传统上,ML模型通常由中央服务器进行训练,该服务器将来自不同分布式设备的数据集成以预测新生成的数据结果。这会导致隐私披露问题,因为模型可以访问敏感用户信息。一种联邦学习(FL)方法可以解决这个问题:全球模型将其 копию传递给所有客户端,客户端将其更新( weights)回传给它。随着全球模型的改进,其精度会逐渐提高。在训练过程中,数据隐私得到保护,因为训练在客户端上进行。然而,全球模型受到数据毒品攻击的威胁。我们开发了一种隐私保护的FL技术,并在皮肤癌数据集上进行了实验。发现,当一个客户端意外地将标签反转为攻击时,全球模型的准确率会下降。随着标签反转的百分比增加,全球模型的准确率会显著下降。我们使用某种随机梯度下降优化算法来找到最佳准确率。虽然FL可以保护用户隐私 для医疗诊断,但也容易受到数据毒品攻击,这些问题需要解决。

Causal Unsupervised Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2310.07379
  • repo_url: https://github.com/byungkwanlee/causal-unsupervised-segmentation
  • paper_authors: Junho Kim, Byung-Kwan Lee, Yong Man Ro
  • for: 这个论文的目的是提出一种新的无监督 semantic segmentation 方法,以实现高质量的semantic grouping无需人工标注。
  • methods: 该方法利用自动预训练的特征进行train prediction heads,并采用 causal inference 的思想来定义适当的 clustering 级别。
  • results: 通过大量实验和分析,该方法在不同的数据集上达到了无监督 semantic segmentation 的州OF-THE-ART表现。
    Abstract Unsupervised semantic segmentation aims to achieve high-quality semantic grouping without human-labeled annotations. With the advent of self-supervised pre-training, various frameworks utilize the pre-trained features to train prediction heads for unsupervised dense prediction. However, a significant challenge in this unsupervised setup is determining the appropriate level of clustering required for segmenting concepts. To address it, we propose a novel framework, CAusal Unsupervised Semantic sEgmentation (CAUSE), which leverages insights from causal inference. Specifically, we bridge intervention-oriented approach (i.e., frontdoor adjustment) to define suitable two-step tasks for unsupervised prediction. The first step involves constructing a concept clusterbook as a mediator, which represents possible concept prototypes at different levels of granularity in a discretized form. Then, the mediator establishes an explicit link to the subsequent concept-wise self-supervised learning for pixel-level grouping. Through extensive experiments and analyses on various datasets, we corroborate the effectiveness of CAUSE and achieve state-of-the-art performance in unsupervised semantic segmentation.
    摘要 自动 semantic segmentation 的目标是实现高质量的 semantic grouping 无需人工标注。随着自我超级预训练的出现,各种框架通过预训练特征来训练预测头 для无监督的稠密预测。然而,在这种无监督设置中,确定适当的归类水平是一个重要挑战。为解决这个问题,我们提出了一种新的框架,名为 CAusal Unsupervised Semantic sEgmentation (CAUSE),它利用 causal inference 的启示。具体来说,我们将 intervention-oriented approach (i.e., frontdoor adjustment) 引入,定义适当的两步任务 для无监督预测。第一步是构建 concept clusterbook 作为中间变量,它表示不同粒度的概念原型的可能性。然后, mediator 建立了Explicit链接,使得后续的概念wise self-supervised learning 可以帮助像素级别的归类。通过对多个数据集的广泛实验和分析,我们证明 CAUSE 的效果,并实现了无监督 semantic segmentation 的州际性能。

Point Cloud Denoising and Outlier Detection with Local Geometric Structure by Dynamic Graph CNN

  • paper_url: http://arxiv.org/abs/2310.07376
  • repo_url: None
  • paper_authors: Kosuke Nakayama, Hiroto Fukuta, Hiroshi Watanabe
  • for: 点云数据清洗和异常检测
  • methods: 应用两种基于动态图 convolutional layer的方法
  • results: 提出的方法在AUPR和Chamfer Distance上表现出色,比传统方法更高的异常检测精度和清洗精度
    Abstract The digitalization of society is rapidly developing toward the realization of the digital twin and metaverse. In particular, point clouds are attracting attention as a media format for 3D space. Point cloud data is contaminated with noise and outliers due to measurement errors. Therefore, denoising and outlier detection are necessary for point cloud processing. Among them, PointCleanNet is an effective method for point cloud denoising and outlier detection. However, it does not consider the local geometric structure of the patch. We solve this problem by applying two types of graph convolutional layer designed based on the Dynamic Graph CNN. Experimental results show that the proposed methods outperform the conventional method in AUPR, which indicates outlier detection accuracy, and Chamfer Distance, which indicates denoising accuracy.
    摘要 社会的数字化发展 rapidly towards the realization of digital twin和metaverse。特别是3D空间中的点云数据受到关注。由于测量错误导致的噪声和异常值,因此需要对点云数据进行噪声除除和异常值检测。其中,PointCleanNet是一种有效的点云噪声除除和异常值检测方法。但它不考虑点云的本地 геометрическую结构。我们解决这个问题,通过基于动态图 convolutional layer的两种类型应用。实验结果显示,我们提出的方法在AUPR和Chamfer Distance中表现出色,即异常检测精度和噪声除除精度。Here's the breakdown of the translation:* 社会的数字化发展 (society's digital development) -> 社会的数字化发展 (society's digital development)* rapidly towards the realization of digital twin和metaverse (rapidly towards the realization of digital twin and metaverse) -> rapid towards the realization of digital twin和metaverse (rapid towards the realization of digital twin and metaverse)* 特别是3D空间中的点云数据 (point cloud data in 3D space) -> 特别是3D空间中的点云数据 (point cloud data in 3D space)* 受到关注 (attracting attention) -> 受到关注 (attracting attention)* 由于测量错误导致的噪声和异常值 (due to measurement errors and outliers) -> 由于测量错误导致的噪声和异常值 (due to measurement errors and outliers)* 因此需要对点云数据进行噪声除除和异常值检测 (therefore, need to denoise and detect outliers for point cloud data) -> 因此需要对点云数据进行噪声除除和异常值检测 (therefore, need to denoise and detect outliers for point cloud data)* 其中,PointCleanNet是一种有效的点云噪声除除和异常值检测方法 (PointCleanNet is an effective method for point cloud denoising and outlier detection) -> 其中,PointCleanNet是一种有效的点云噪声除除和异常值检测方法 (PointCleanNet is an effective method for point cloud denoising and outlier detection)* 但它不考虑点云的本地 геометрическую结构 (but it does not consider the local geometric structure of the point cloud) -> 但它不考虑点云的本地 геометрическую结构 (but it does not consider the local geometric structure of the point cloud)* 我们解决这个问题,通过基于动态图 convolutional layer的两种类型应用 (we solve this problem by applying two types of graph convolutional layers based on dynamic graphs) -> 我们解决这个问题,通过基于动态图 convolutional layer的两种类型应用 (we solve this problem by applying two types of graph convolutional layers based on dynamic graphs)* 实验结果显示,我们提出的方法在AUPR和Chamfer Distance中表现出色 (experimental results show that our method outperforms the conventional method in AUPR and Chamfer Distance) -> 实验结果显示,我们提出的方法在AUPR和Chamfer Distance中表现出色 (experimental results show that our method outperforms the conventional method in AUPR and Chamfer Distance)

Give and Take: Federated Transfer Learning for Industrial IoT Network Intrusion Detection

  • paper_url: http://arxiv.org/abs/2310.07354
  • repo_url: None
  • paper_authors: Lochana Telugu Rajesh, Tapadhir Das, Raj Mani Shukla, Shamik Sengupta
  • for: 这篇论文旨在提出一种联边学习(Federated Transfer Learning,FTL)方法,用于防护互联网络(IIoT)中的网络入侵攻击。
  • methods: 本论文提出了一个搭配式神经网络(Combinational Neural Network,CNN),用于实现FTL的中心部分。在这篇论文中,我们将IIoT数据分成客户端和服务器端两部分,然后使用客户端模型的weight更新服务器模型。
  • results: 本论文的实验结果显示,FTL设置在IIoT客户端和服务器之间的 Iterations 具有高性能,并且比现有的机器学习算法在网络入侵攻击预测方面表现更好。
    Abstract The rapid growth in Internet of Things (IoT) technology has become an integral part of today's industries forming the Industrial IoT (IIoT) initiative, where industries are leveraging IoT to improve communication and connectivity via emerging solutions like data analytics and cloud computing. Unfortunately, the rapid use of IoT has made it an attractive target for cybercriminals. Therefore, protecting these systems is of utmost importance. In this paper, we propose a federated transfer learning (FTL) approach to perform IIoT network intrusion detection. As part of the research, we also propose a combinational neural network as the centerpiece for performing FTL. The proposed technique splits IoT data between the client and server devices to generate corresponding models, and the weights of the client models are combined to update the server model. Results showcase high performance for the FTL setup between iterations on both the IIoT clients and the server. Additionally, the proposed FTL setup achieves better overall performance than contemporary machine learning algorithms at performing network intrusion detection.
    摘要 “现代互联网络设备(IoT)技术的快速发展已成为今天的行业核心,组成了企业级互联网络(IIoT)计划,where industries are leveraging IoT to improve communication and connectivity via emerging solutions like data analytics and cloud computing. Unfortunately, the rapid use of IoT has made it an attractive target for cybercriminals. Therefore, protecting these systems is of utmost importance. In this paper, we propose a federated transfer learning (FTL) approach to perform IIoT network intrusion detection. As part of the research, we also propose a combinational neural network as the centerpiece for performing FTL. The proposed technique splits IoT data between the client and server devices to generate corresponding models, and the weights of the client models are combined to update the server model. Results showcase high performance for the FTL setup between iterations on both the IIoT clients and the server. Additionally, the proposed FTL setup achieves better overall performance than contemporary machine learning algorithms at performing network intrusion detection.”Note: Please note that the translation is in Simplified Chinese, which is one of the two standard Chinese writing systems.

Semantic Association Rule Learning from Time Series Data and Knowledge Graphs

  • paper_url: http://arxiv.org/abs/2310.07348
  • repo_url: None
  • paper_authors: Erkan Karabulut, Victoria Degeler, Paul Groth
  • for: 这篇论文的目的是为了提出一个基于知识 graphs 和时间序列数据的 semantic association rule 学习管道,以及一个新的 semantic association rule 标准。
  • methods: 这篇论文使用了知识 graphs 和时间序列数据来学习 semantic association rules,并提出了一种新的 semantic association rule 标准。
  • results: 实验结果表明,提出的方法可以学习出具有 semantic information 的大量 association rules,这些规则更加普适。
    Abstract Digital Twins (DT) are a promising concept in cyber-physical systems research due to their advanced features including monitoring and automated reasoning. Semantic technologies such as Knowledge Graphs (KG) are recently being utilized in DTs especially for information modelling. Building on this move, this paper proposes a pipeline for semantic association rule learning in DTs using KGs and time series data. In addition to this initial pipeline, we also propose new semantic association rule criterion. The approach is evaluated on an industrial water network scenario. Initial evaluation shows that the proposed approach is able to learn a high number of association rules with semantic information which are more generalizable. The paper aims to set a foundation for further work on using semantic association rule learning especially in the context of industrial applications.
    摘要 数字双胞(DT)是现代物联网系统研究中的一个抢险概念,它具有监测和自动推理等高级功能。 semantic技术如知识图(KG)在DT中特别地应用于信息模型化。基于这种趋势,本文提出了基于知识图和时间序列数据的semantic association rule学习管道。此外,我们还提出了一新的semantic association rule标准。我们通过对工业水网enario进行初步评估,发现提出的方法能够学习大量具有Semantic信息的相关规则,这些规则更加通用。本文的目的是为将来的Semantic association rule学习研究提供基础,特别是在工业应用中。

Fast-ELECTRA for Efficient Pre-training

  • paper_url: http://arxiv.org/abs/2310.07347
  • repo_url: None
  • paper_authors: Chengyu Dong, Liyuan Liu, Hao Cheng, Jingbo Shang, Jianfeng Gao, Xiaodong Liu
  • for: 本文针对ELECTRA预训方法进行改进,以提高效率和稳定性。
  • methods: 本文使用现有语言模型作为助动模型,并透过温度调整的渐减 schedule 建立学习课程,以帮助主模型提高表现。
  • results: 本研究显示,使用现有语言模型作为助动模型可以帮助提高ELECTRA的效率和稳定性,并且与现有state-of-the-art ELECTRA-style预训方法相当。
    Abstract ELECTRA pre-trains language models by detecting tokens in a sequence that have been replaced by an auxiliary model. Although ELECTRA offers a significant boost in efficiency, its potential is constrained by the training cost brought by the auxiliary model. Notably, this model, which is jointly trained with the main model, only serves to assist the training of the main model and is discarded post-training. This results in a substantial amount of training cost being expended in vain. To mitigate this issue, we propose Fast-ELECTRA, which leverages an existing language model as the auxiliary model. To construct a learning curriculum for the main model, we smooth its output distribution via temperature scaling following a descending schedule. Our approach rivals the performance of state-of-the-art ELECTRA-style pre-training methods, while significantly eliminating the computation and memory cost brought by the joint training of the auxiliary model. Our method also reduces the sensitivity to hyper-parameters and enhances the pre-training stability.
    摘要 ELECTRA 使用替换token的模型进行预训练,以提高效率。然而,ELECTRA 的潜力受到助手模型的训练成本的限制。具体来说,这个模型只是用于帮助主模型训练,并且在训练结束后被抛弃。这会导致大量的训练成本浪费。为了解决这个问题,我们提出了 Fast-ELECTRA,它利用现有的语言模型作为助手模型。我们使用温度层 scaling 将主模型的输出分布平滑化,以建立一个学习课程 для主模型。我们的方法可以与现有的状态态-of-the-art ELECTRA 预训练方法相比,同时减少计算和内存成本,以及敏感度到hyperparameter和预训练稳定性。

Exploring Social Motion Latent Space and Human Awareness for Effective Robot Navigation in Crowded Environments

  • paper_url: http://arxiv.org/abs/2310.07335
  • repo_url: None
  • paper_authors: Junaid Ahmed Ansari, Satyajit Tourani, Gourav Kumar, Brojeshwar Bhowmick
  • for: 本研究提出了一种基于社交动作潜在空间学习的社交机器人导航方法,以提高社交导航指标(如成功率、导航时间、轨迹长度),并生成更平滑、更预测性的轨迹。
  • methods: 该方法利用社交动作潜在空间来生成机器人控制,并通过对比基线模型而证明其超越性。
  • results: 研究表明,包含人类意识的社交机器人导航框架可以生成更短、更平滑的轨迹,是因为人类可以正面与机器人互动。
    Abstract This work proposes a novel approach to social robot navigation by learning to generate robot controls from a social motion latent space. By leveraging this social motion latent space, the proposed method achieves significant improvements in social navigation metrics such as success rate, navigation time, and trajectory length while producing smoother (less jerk and angular deviations) and more anticipatory trajectories. The superiority of the proposed method is demonstrated through comparison with baseline models in various scenarios. Additionally, the concept of humans' awareness towards the robot is introduced into the social robot navigation framework, showing that incorporating human awareness leads to shorter and smoother trajectories owing to humans' ability to positively interact with the robot.
    摘要 这个工作提出了一种新的社交机器人导航方法,通过学习生成机器人控制从社交动作潜在空间中生成机器人控制。通过利用这个社交动作潜在空间,提议方法可以获得 significatively 提高社交导航指标,如成功率、导航时间和轨迹长度,同时生成更平滑(具有 menos jerk 和 angular deviation)的轨迹。在不同的场景下,相比基eline 模型,提议方法的优势得到了证明。此外,在社交机器人导航框架中引入人类意识的概念,表明在机器人与人类之间的交互中,机器人可以更短更平滑的轨迹。

An Empirical Study of Instruction-tuning Large Language Models in Chinese

  • paper_url: http://arxiv.org/abs/2310.07328
  • repo_url: https://github.com/phoebussi/alpaca-cot
  • paper_authors: Qingyi Si, Tong Wang, Zheng Lin, Xu Zhang, Yanan Cao, Weiping Wang
  • for: 这 paper 旨在对中文大语言模型(LLMs)进行深入的实验研究,以便更好地适应中文指令。
  • methods: 本 paper 使用了三个关键元素进行实验研究:LLM bases、参数效率方法和指令数据类型。此外,它还研究了其他因素,如 chain-of-thought 数据和人类价值对Alignment。
  • results: 本 paper 的实验结果表明,通过对 LLM bases、参数效率方法和指令数据类型进行调整,可以实现更好地适应中文指令的中文 LLMS。 Code 和数据可以在 https://github.com/PhoebusSi/Alpaca-CoT 上获取。
    Abstract The success of ChatGPT validates the potential of large language models (LLMs) in artificial general intelligence (AGI). Subsequently, the release of LLMs has sparked the open-source community's interest in instruction-tuning, which is deemed to accelerate ChatGPT's replication process. However, research on instruction-tuning LLMs in Chinese, the world's most spoken language, is still in its early stages. Therefore, this paper makes an in-depth empirical study of instruction-tuning LLMs in Chinese, which can serve as a cookbook that provides valuable findings for effectively customizing LLMs that can better respond to Chinese instructions. Specifically, we systematically explore the impact of LLM bases, parameter-efficient methods, instruction data types, which are the three most important elements for instruction-tuning. Besides, we also conduct experiment to study the impact of other factors, e.g., chain-of-thought data and human-value alignment. We hope that this empirical study can make a modest contribution to the open Chinese version of ChatGPT. This paper will release a powerful Chinese LLMs that is comparable to ChatGLM. The code and data are available at https://github.com/PhoebusSi/Alpaca-CoT.
    摘要 成功的ChatGPT证明大语言模型(LLM)在人工通用智能(AGI)中的潜力。随后,LLM的发布激发了开源社区对于教程调整的兴趣,这被认为可以加速ChatGPT的复制过程。然而,对于中文的LLM教程调整研究仍处于初期阶段。因此,本文进行了深入的实验研究,以提供有价值的发现,可以帮助更好地调整LLM,以便更好地回应中文指令。具体来说,我们系统地探讨LLM基础、参数效率方法和指令数据类型等三个关键元素的影响。此外,我们还进行了实验研究其他因素,如链条数据和人类价值对alignment。我们希望这个实验研究可以为开放的中文版ChatGPT提供一个有价值的参考,并释放一个与ChatGPT相当的中文LLM。代码和数据可以在https://github.com/PhoebusSi/Alpaca-CoT中找到。

An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

  • paper_url: http://arxiv.org/abs/2310.07325
  • repo_url: None
  • paper_authors: James Dao, Yeu-Tong Lau, Can Rager, Jett Janiak
  • for: 该研究探讨了一种4层转换器的内存管理问题,并提供了具体的证据。
  • methods: 研究使用了直观逻辑权重分析技术来分析模型的输出。
  • results: 研究发现,直观逻辑权重分析技术可能提供不准确的结果,因为它不考虑模型中的干净行为。
    Abstract We provide concrete evidence for memory management in a 4-layer transformer. Specifically, we identify clean-up behavior, in which model components consistently remove the output of preceeding components during a forward pass. Our findings suggest that the interpretability technique Direct Logit Attribution provides misleading results. We show explicit examples where this technique is inaccurate, as it does not account for clean-up behavior.
    摘要 我们提供具体证据对 transformer Memory Management 的研究。具体来说,我们发现了“清理”行为,即模型组件在前进通道中一致地 removes 前一个组件的输出。我们的发现表明,使用 Direct Logit Attribution interpretability 技术可能会得到错误的结果。我们提供了明确的例子,证明这种技术无法考虑清理行为。

On the Impact of Cross-Domain Data on German Language Models

  • paper_url: http://arxiv.org/abs/2310.07321
  • repo_url: None
  • paper_authors: Amin Dada, Aokun Chen, Cheng Peng, Kaleb E Smith, Ahmad Idrissi-Yaghir, Constantin Marc Seibold, Jianning Li, Lars Heiliger, Xi Yang, Christoph M. Friedrich, Daniel Truhn, Jan Egger, Jiang Bian, Jens Kleesiek, Yonghui Wu
  • for: 本研究目的是探讨数据多样性对大语言模型的影响,以及高质量数据是否能够超越多样性的效果。
  • methods: 研究者使用了五个不同领域的文本数据集,并对这些数据集进行了归一化和分词处理。然后,他们在这些数据集上训练了一系列大语言模型,并对这些模型进行了多个下游任务的benchmark测试。
  • results: 研究者发现,训练在多样性数据集上的模型可以与高质量数据集上的模型相比,在多个下游任务上表现出较好的性能,并且可以提高过去最佳性能的4.45%。
    Abstract Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to $4.45\%$ over the previous state-of-the-art. The models are available at https://huggingface.co/ikim-uk-essen
    摘要 传统上,大型语言模型通常是通过全网爬虫或域专业数据进行训练。然而,最近的生成大型语言模型的成功,抛光了跨领域数据的优势。为了评估数据多样性的重要性,我们提供了一个德国 dataset,包含五个领域的文本,以及另一个具有高质量数据的 dataset。通过在两个 dataset 上训练一系列模型,从 122M 到 750M 参数的模型,我们进行了多个下游任务的完整性测试。我们的发现表明,跨领域 dataset 上训练的模型比单一质量数据alone 训练的模型提高了 $4.45\%$,超过了之前的状态调。模型可以在 https://huggingface.co/ikim-uk-essen 上获取。

WiGenAI: The Symphony of Wireless and Generative AI via Diffusion Models

  • paper_url: http://arxiv.org/abs/2310.07312
  • repo_url: None
  • paper_authors: Mehdi Letafati, Samad Ali, Matti Latva-aho
  • for: 这 paper 旨在探讨生成AI在无线通信系统中的应用,以铺垫未来的研究。
  • methods: 这 paper 使用了Diffusion-based生成模型,这种新的状态态模型在生成模型中具有最新的状态。
  • results: 这 paper 通过两个案例研究,提出了一种使用Diffusion模型提高无线通信系统的Bit Error Rate的方法,并且在不理想的接收器情况下实现了30%的提高。
    Abstract Innovative foundation models, such as GPT-3 and stable diffusion models, have made a paradigm shift in the realm of artificial intelligence (AI) towards generative AI-based systems. In unison, from data communication and networking perspective, AI and machine learning (AI/ML) algorithms are envisioned to be pervasively incorporated into the future generations of wireless communications systems, highlighting the need for novel AI-native solutions for the emergent communication scenarios. In this article, we outline the applications of generative AI in wireless communication systems to lay the foundations for research in this field. Diffusion-based generative models, as the new state-of-the-art paradigm of generative models, are introduced, and their applications in wireless communication systems are discussed. Two case studies are also presented to showcase how diffusion models can be exploited for the development of resilient AI-native communication systems. Specifically, we propose denoising diffusion probabilistic models (DDPM) for a wireless communication scheme with non-ideal transceivers, where 30% improvement is achieved in terms of bit error rate. As the second application, DDPMs are employed at the transmitter to shape the constellation symbols, highlighting a robust out-of-distribution performance. Finally, future directions and open issues for the development of generative AI-based wireless systems are discussed to promote future research endeavors towards wireless generative AI (WiGenAI).
    摘要 创新基础模型,如GPT-3和稳定扩散模型,在人工智能(AI)领域引入了一个新的思维方式,推动了基于生成AI的系统的发展。在数据通信和网络方面,AI/ML算法将在未来的无线通信系统中普遍应用,需要新的AINative解决方案来应对新兴的通信场景。本文介绍了生成AI在无线通信系统中的应用,为这一领域的研究奠基。 diffusion基础生成模型被介绍为新的生成模型的状态艺术,其应用在无线通信系统中被讨论。两个案例研究如何使用扩散模型来开发鲁棒的AINative通信系统。首先,我们提出了噪声扩散概率模型(DDPM),用于一种无线通信方案中的非理想接收器,其中提高了比特错误率30%。其次,DDPM被用于发送器来修饰 konstellation 符号,并证明了在异常输出情况下的稳定性。最后,我们讨论了未来发展和开放问题,以促进未来的无线生成AI(WiGenAI)研究。

RobustGEC: Robust Grammatical Error Correction Against Subtle Context Perturbation

  • paper_url: http://arxiv.org/abs/2310.07299
  • repo_url: https://github.com/hillzhang1999/robustgec
  • paper_authors: Yue Zhang, Leyang Cui, Enbo Zhao, Wei Bi, Shuming Shi
    for: 这种论文的目的是为了评估语言修正系统的上下文稳定性。methods: 该论文使用了5000个语言修正例子,每个例子包括一个原始错误语句和五个人工标注的变体。results: 研究发现当前的语言修正系统仍然无法快速响应上下文变化,而提议的简单 yet effective 方法可以有效解决这个问题。
    Abstract Grammatical Error Correction (GEC) systems play a vital role in assisting people with their daily writing tasks. However, users may sometimes come across a GEC system that initially performs well but fails to correct errors when the inputs are slightly modified. To ensure an ideal user experience, a reliable GEC system should have the ability to provide consistent and accurate suggestions when encountering irrelevant context perturbations, which we refer to as context robustness. In this paper, we introduce RobustGEC, a benchmark designed to evaluate the context robustness of GEC systems. RobustGEC comprises 5,000 GEC cases, each with one original error-correct sentence pair and five variants carefully devised by human annotators. Utilizing RobustGEC, we reveal that state-of-the-art GEC systems still lack sufficient robustness against context perturbations. In addition, we propose a simple yet effective method for remitting this issue.
    摘要 grammatical error correction (GEC) 系统在日常写作任务中扮演着重要的角色。然而,用户可能会在使用 GEC 系统时发现,当输入有所修改时,GEC 系统可能会在初始化时表现良好,但在修改后仍然无法正确地更正错误。为确保理想的用户体验,一个可靠的 GEC 系统应该有能力在不相关的上下文干扰下提供一致和准确的建议。在这篇论文中,我们介绍了 RobustGEC,一个用于评估 GEC 系统的上下文稳定性的库。RobustGEC 包含 5,000 个 GEC 案例,每个案例包含一对原始错误 corrected 句子 pair 和五个由人类标注员所设计的修改案例。利用 RobustGEC,我们发现现有的 GEC 系统仍然缺乏对上下文干扰的抗衡能力。此外,我们也提出了一个简单 yet 有效的方法来解决这个问题。

Beyond Memorization: Violating Privacy Via Inference with Large Language Models

  • paper_url: http://arxiv.org/abs/2310.07298
  • repo_url: None
  • paper_authors: Robin Staab, Mark Vero, Mislav Balunović, Martin Vechev
  • for: 这个研究的目的是研究现有大语言模型(LLM)是否可以通过文本内容来推断个人特征。
  • methods: 研究使用了现有的LLM,并构建了一个基于Reddit Profilese的数据集,以测试LLM的推断能力。
  • results: 研究发现,现有的LLM可以准确地推断个人特征,例如地点、收入和性别,并且可以在人工智能技术的一个分之一的成本和时间下达到人类水平。此外,研究还探讨了隐私泄露的风险,并发现现有的防御措施无法保护用户隐私。
    Abstract Current privacy research on large language models (LLMs) primarily focuses on the issue of extracting memorized training data. At the same time, models' inference capabilities have increased drastically. This raises the key question of whether current LLMs could violate individuals' privacy by inferring personal attributes from text given at inference time. In this work, we present the first comprehensive study on the capabilities of pretrained LLMs to infer personal attributes from text. We construct a dataset consisting of real Reddit profiles, and show that current LLMs can infer a wide range of personal attributes (e.g., location, income, sex), achieving up to $85\%$ top-1 and $95.8\%$ top-3 accuracy at a fraction of the cost ($100\times$) and time ($240\times$) required by humans. As people increasingly interact with LLM-powered chatbots across all aspects of life, we also explore the emerging threat of privacy-invasive chatbots trying to extract personal information through seemingly benign questions. Finally, we show that common mitigations, i.e., text anonymization and model alignment, are currently ineffective at protecting user privacy against LLM inference. Our findings highlight that current LLMs can infer personal data at a previously unattainable scale. In the absence of working defenses, we advocate for a broader discussion around LLM privacy implications beyond memorization, striving for a wider privacy protection.
    摘要 现有大量语言模型(LLM)隐私研究主要关注模型在训练数据中储存的问题。同时,模型的推理能力有了很大的提高。这提出了关键问题:现有LLM是否可以通过文本来推断个人特征?在这个工作中,我们提供了首次LLM在文本中推断个人特征的全面研究。我们构建了基于真实的Reddit Profilestext dataset,并显示了当前LLM可以通过文本推断各种个人特征(如地点、收入、性别),达到了人类的$85\%$ top-1和$95.8\%$ top-3准确率,而且只需要人类的$100\times$ 时间和$240\times$ 成本。随着人们在所有方面的生活中与LLM驱动的chatbot进行交互,我们也探讨了隐私泄露的emerging threat,即通过看起来普通的问题来提取个人信息。最后,我们发现现有的mitigationstrategies,如文本匿名和模型对齐,目前无法保护用户隐私。我们的发现表明当前LLM可以在前所未有的规模上进行个人数据推断。在没有工作的防御措施时,我们呼吁更广泛的隐私问题讨论,努力为更多的隐私保护。

Automated Verification of Equivalence Properties in Advanced Logic Programs – Bachelor Thesis

  • paper_url: http://arxiv.org/abs/2310.19806
  • repo_url: None
  • paper_authors: Jan Heuer
  • for: 这个论文的目的是提供一种自动化正式验证工具,用于替换原始子程序。
  • methods: 这个论文使用了 Anthem 翻译工具,以及一个自动证明工具 для классической逻辑,来验证两个程序的强等价性。
  • results: 该论文扩展了 Anthem 工具,使其能够验证包含卷积、否定和简单选择规则的逻辑程序的强等价性。新版本的 Anthem 还能够翻译这些程序到 классической逻辑中。
    Abstract With the increase in industrial applications using Answer Set Programming, the need for formal verification tools, particularly for critical applications, has also increased. During the program optimisation process, it would be desirable to have a tool which can automatically verify whether an optimised subprogram can replace the original subprogram. Formally this corresponds to the problem of verifying the strong equivalence of two programs. In order to do so, the translation tool anthem was developed. It can be used in conjunction with an automated theorem prover for classical logic to verify that two programs are strongly equivalent. With the current version of anthem, only the strong equivalence of positive programs with a restricted input language can be verified. This is a result of the translation $\tau^*$ implemented in anthem that produces formulas in the logic of here-and-there, which coincides with classical logic only for positive programs. This thesis extends anthem in order to overcome these limitations. First, the transformation $\sigma^*$ is presented, which transforms formulas from the logic of here-and-there to classical logic. A theorem formalises how $\sigma^*$ can be used to express equivalence in the logic of here-and-there in classical logic. Second, the translation $\tau^*$ is extended to programs containing pools. Another theorem shows how $\sigma^*$ can be combined with $\tau^*$ to express the strong equivalence of two programs in classical logic. With $\sigma^*$ and the extended $\tau^*$, it is possible to express the strong equivalence of logic programs containing negation, simple choices, and pools. Both the extended $\tau^*$ and $\sigma^*$ are implemented in a new version of anthem. Several examples of logic programs containing pools, negation, and simple choice rules, which the new version of anthem can translate to classical logic, are presented. Some a...
    摘要 随着应用 Answer Set Programming 的增加,对于重要应用程序的正式验证工具的需求也增加了。在程序优化过程中,您希望有一个工具可以自动验证优化后的子程序是否可以替换原始子程序。正式来说,这对应于两个程序的强等价性问题的验证。为此,anthem 工具被开发出来。它可以与自动证明工具结合,以验证两个程序的强等价性。anthem 的当前版本只能验证正确的程序的强等价性,这是因为 anthem 中的翻译 $\tau^*$ 仅能处理正确的程序。这个论文扩展 anthem,以解决这些限制。首先,我们提出了一种变换 $\sigma^*$,可以将 formulas 从 here-and-there 逻辑转换为类型逻辑。一个定理证明了 $\sigma^*$ 如何用于表达 here-and-there 逻辑中的等价性。其次,我们扩展了 $\tau^*$,以便处理包含池的程序。另一个定理证明了如何将 $\sigma^*$ 与 $\tau^*$ 结合使用,以表达两个程序的强等价性。通过 $\sigma^*$ 和扩展后的 $\tau^*$,我们可以表达包含否定、简单选择规则的逻辑程序的强等价性。这些扩展后的 $\sigma^*$ 和 $\tau^*$ 都被实现在新版本的 anthem 中。我们还提供了一些逻辑程序示例,包括包含池、否定和简单选择规则的程序,这些程序可以通过新版本的 anthem 转换为类型逻辑。

An Analysis on Large Language Models in Healthcare: A Case Study of BioBERT

  • paper_url: http://arxiv.org/abs/2310.07282
  • repo_url: None
  • paper_authors: Shyni Sharaf, V. S. Anoop
  • for: This paper explores the application of large language models, specifically BioBERT, in healthcare and its potential benefits for clinical decision support and information retrieval.
  • methods: The paper proposes a systematic methodology for fine-tuning BioBERT to meet the unique needs of the healthcare domain, including data gathering, annotation, and specialized preprocessing techniques.
  • results: The paper evaluates the performance of BioBERT in various healthcare-related tasks, such as medical entity recognition and question-answering, and explores techniques to improve the model’s interpretability. It also acknowledges the ethical considerations and challenges of integrating BioBERT into healthcare contexts.
    Abstract This paper conducts a comprehensive investigation into applying large language models, particularly on BioBERT, in healthcare. It begins with thoroughly examining previous natural language processing (NLP) approaches in healthcare, shedding light on the limitations and challenges these methods face. Following that, this research explores the path that led to the incorporation of BioBERT into healthcare applications, highlighting its suitability for addressing the specific requirements of tasks related to biomedical text mining. The analysis outlines a systematic methodology for fine-tuning BioBERT to meet the unique needs of the healthcare domain. This approach includes various components, including the gathering of data from a wide range of healthcare sources, data annotation for tasks like identifying medical entities and categorizing them, and the application of specialized preprocessing techniques tailored to handle the complexities found in biomedical texts. Additionally, the paper covers aspects related to model evaluation, with a focus on healthcare benchmarks and functions like processing of natural language in biomedical, question-answering, clinical document classification, and medical entity recognition. It explores techniques to improve the model's interpretability and validates its performance compared to existing healthcare-focused language models. The paper thoroughly examines ethical considerations, particularly patient privacy and data security. It highlights the benefits of incorporating BioBERT into healthcare contexts, including enhanced clinical decision support and more efficient information retrieval. Nevertheless, it acknowledges the impediments and complexities of this integration, encompassing concerns regarding data privacy, transparency, resource-intensive requirements, and the necessity for model customization to align with diverse healthcare domains.
    摘要

BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

  • paper_url: http://arxiv.org/abs/2310.07276
  • repo_url: https://github.com/QizhiPei/BioT5
  • paper_authors: Qizhi Pei, Wei Zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, Rui Yan
  • For: The paper aims to enhance drug discovery by integrating molecules, proteins, and natural language.* Methods: The proposed method, BioT5, uses SELFIES to generate robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. It also distinguishes between structured and unstructured knowledge.* Results: BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities.
    Abstract Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.
    摘要 最近的生物研究进步利用分子、蛋白质和自然语言的集成来提高药物发现。然而,当前的模型具有许多限制,如生成无效的分子SMILES、Contextual information的过度利用和结构化知识和未结构化知识的平等对待。为解决这些问题,我们提出了 $\mathbf{BioT5}$,一个全面预训练框架,用于增强生物学中的分子知识和自然语言关系。 $\mathbf{BioT5}$ 使用 SELFIES 确保 $100%$ 可靠的分子表示,从生物文献中的生物实体周围的上下文中提取知识,并在结构化和未结构化知识之间进行区分。这使得 $\mathbf{BioT5}$ 在许多任务上表现出色,表明它能够捕捉生物实体下面的关系和性质。我们的代码可以在 $\href{https://github.com/QizhiPei/BioT5}{Github}$ 上找到。

CoPAL: Corrective Planning of Robot Actions with Large Language Models

  • paper_url: http://arxiv.org/abs/2310.07263
  • repo_url: None
  • paper_authors: Frank Joublin, Antonello Ceravola, Pavel Smirnov, Felix Ocker, Joerg Deigmoeller, Anna Belardinelli, Chao Wang, Stephan Hasler, Daniel Tanneberg, Michael Gienger
  • for: 这种研究旨在提高机器人完全自主系统的可行性,以替代人类执行任务。
  • methods: 这种研究使用了大语言模型应用于机器人任务和动作规划,提出了一种新的重启策略来处理物理、逻辑和 semantics 错误。
  • results: 经验证明,提出的反馈体系可以提高执行可能性、正确性和时间复杂度。
    Abstract In the pursuit of fully autonomous robotic systems capable of taking over tasks traditionally performed by humans, the complexity of open-world environments poses a considerable challenge. Addressing this imperative, this study contributes to the field of Large Language Models (LLMs) applied to task and motion planning for robots. We propose a system architecture that orchestrates a seamless interplay between multiple cognitive levels, encompassing reasoning, planning, and motion generation. At its core lies a novel replanning strategy that handles physically grounded, logical, and semantic errors in the generated plans. We demonstrate the efficacy of the proposed feedback architecture, particularly its impact on executability, correctness, and time complexity via empirical evaluation in the context of a simulation and two intricate real-world scenarios: blocks world, barman and pizza preparation.
    摘要 “在实现完全自主机器人系统的挑战中,开放世界环境的复杂性对任务和动作观念规划产生了很大的挑战。这项研究对于大型自然语言模型(LLM)应用在机器人任务和动作规划方面做出了贡献。我们提出了一个系统架构,它在多个认知水平之间进行了联系,包括理解、规划和动作生成。这个架构的核心是一种新的重新规划策略,可以处理物理基础、逻辑和semantic错误在生成的计划中。我们透过实际评估在模拟和两个实际世界情况下(即积木世界和制作啤酒和制作izza),证明了我们提出的反馈架构的有效性,特别是对于执行可能性、正确性和时间复杂度的影响。”

Uncovering Hidden Connections: Iterative Tracking and Reasoning for Video-grounded Dialog

  • paper_url: http://arxiv.org/abs/2310.07259
  • repo_url: https://github.com/hyu-zhang/itr
  • paper_authors: Haoyu Zhang, Meng Liu, Yaowei Wang, Da Cao, Weili Guan, Liqiang Nie
  • for: 这篇论文的目的是提出一种新的视频对话方法,可以快速和准确地回答视频内容相关的问题。
  • methods: 这篇论文使用了一种迭代跟踪和理解策略,将文本编码器、视觉编码器和生成器相结合。文本编码器使用了一种路径跟踪和汇总机制,能够从对话历史中提取关键信息,解决问题。视觉编码器使用了一种迭代理解网络,精心挑选和强调视频中重要的视觉特征,提高视觉理解的深度。
  • results: 作者通过在两个知名的数据集上进行实验,证明了他们提出的方法的可靠性和适应性。
    Abstract In contrast to conventional visual question answering, video-grounded dialog necessitates a profound understanding of both dialog history and video content for accurate response generation. Despite commendable strides made by existing methodologies, they often grapple with the challenges of incrementally understanding intricate dialog histories and assimilating video information. In response to this gap, we present an iterative tracking and reasoning strategy that amalgamates a textual encoder, a visual encoder, and a generator. At its core, our textual encoder is fortified with a path tracking and aggregation mechanism, adept at gleaning nuances from dialog history that are pivotal to deciphering the posed questions. Concurrently, our visual encoder harnesses an iterative reasoning network, meticulously crafted to distill and emphasize critical visual markers from videos, enhancing the depth of visual comprehension. Culminating this enriched information, we employ the pre-trained GPT-2 model as our response generator, stitching together coherent and contextually apt answers. Our empirical assessments, conducted on two renowned datasets, testify to the prowess and adaptability of our proposed design.
    摘要 contrast to traditional visual question answering, video-grounded dialog requires a profound understanding of both dialog history and video content for accurate response generation. Despite notable advances made by existing methodologies, they often struggle with incrementally understanding complex dialog histories and integrating video information. In response to this gap, we propose an iterative tracking and reasoning strategy that combines a textual encoder, a visual encoder, and a generator. At its core, our textual encoder is reinforced with a path tracking and aggregation mechanism, skilled at extracting subtleties from dialog history that are crucial to deciphering the posed questions. Meanwhile, our visual encoder utilizes an iterative reasoning network, carefully crafted to distill and emphasize essential visual cues from videos, enhancing the depth of visual understanding. Combining this enriched information, we employ the pre-trained GPT-2 model as our response generator, seamlessly stitching together coherent and contextually appropriate answers. Our empirical evaluations, conducted on two well-known datasets, demonstrate the effectiveness and adaptability of our proposed design.

ADMEOOD: Out-of-Distribution Benchmark for Drug Property Prediction

  • paper_url: http://arxiv.org/abs/2310.07253
  • repo_url: https://github.com/qweasdzxc-wsy/ADMEOOD
  • paper_authors: Shuoying Wei, Xinlong Wen, Lida Zhu, Songquan Li, Rongbo Zhu
  • For: This paper aims to address the out-of-distribution (OOD) problem in drug property prediction by proposing a novel benchmark dataset and evaluating the performance of different domain generalization models.* Methods: The proposed benchmark, called ADMEOOD, includes a systematic OOD dataset curator and benchmark specifically designed for drug property prediction. It includes two types of OOD data shifts: Noise Shift and Concept Conflict Drift (CCD).* Results: The experimental results demonstrate the effectiveness of the proposed partition method in ADMEOOD, showing a significant difference in performance between in-distribution and out-of-distribution data. Additionally, the paper shows that Empirical Risk Minimization (ERM) and other models exhibit distinct trends in performance across different domains and measurement types.
    Abstract Obtaining accurate and valid information for drug molecules is a crucial and challenging task. However, chemical knowledge and information have been accumulated over the past 100 years from various regions, laboratories, and experimental purposes. Little has been explored in terms of the out-of-distribution (OOD) problem with noise and inconsistency, which may lead to weak robustness and unsatisfied performance. This study proposes a novel benchmark ADMEOOD, a systematic OOD dataset curator and benchmark specifically designed for drug property prediction. ADMEOOD obtained 27 ADME (Absorption, Distribution, Metabolism, Excretion) drug properties from Chembl and relevant literature. Additionally, it includes two kinds of OOD data shifts: Noise Shift and Concept Conflict Drift (CCD). Noise Shift responds to the noise level by categorizing the environment into different confidence levels. On the other hand, CCD describes the data which has inconsistent label among the original data. Finally, it tested on a variety of domain generalization models, and the experimental results demonstrate the effectiveness of the proposed partition method in ADMEOOD: ADMEOOD demonstrates a significant difference performance between in-distribution and out-of-distribution data. Moreover, ERM (Empirical Risk Minimization) and other models exhibit distinct trends in performance across different domains and measurement types.
    摘要 得到正确和有效的药分子信息是一项关键和挑战性的任务。然而,化学知识和信息在过去100年间在不同的地方、实验室和实验目的下积累了大量的经验。虽然有很多研究对药物性能进行了预测,但对于异常情况(OOD)的问题还有很少的探索。本研究提出了一个新的标准测试集ADMEOOD,该集包含27个ADME(吸收、分布、代谢、排除)药物性能 Parameters from Chembl和相关文献。此外,它还包括两种类型的OOD数据推移:噪声推移和概念冲突推移(CCD)。噪声推移根据噪声水平进行分类环境,而CCD则描述了原始数据中存在冲突的标签。最后,它在多种领域通用化模型上进行了测试,实验结果表明了提案的分区方法在ADMEOOD中的效果:ADMEOOD在含有和不含的数据之间显示了明显的性能差异。此外,ERM(Empirical Risk Minimization)模型和其他模型在不同的领域和测量类型上表现出了不同的趋势。

Ethical Reasoning over Moral Alignment: A Case and Framework for In-Context Ethical Policies in LLMs

  • paper_url: http://arxiv.org/abs/2310.07251
  • repo_url: None
  • paper_authors: Abhinav Rao, Aditi Khandelwal, Kumar Tanmay, Utkarsh Agarwal, Monojit Choudhury
  • for: argue that LLMs should be infused with generic ethical reasoning capabilities to handle value pluralism at a global scale, rather than aligning them to specific ethical principles.
  • methods: develop a framework that integrates moral dilemmas with moral principles from different formalisms of normative ethics and at different levels of abstractions.
  • results: initial experiments with GPT-x models show that while GPT-4 is a nearly perfect ethical reasoner, the models still have bias towards the moral values of Western and English speaking societies.
    Abstract In this position paper, we argue that instead of morally aligning LLMs to specific set of ethical principles, we should infuse generic ethical reasoning capabilities into them so that they can handle value pluralism at a global scale. When provided with an ethical policy, an LLM should be capable of making decisions that are ethically consistent to the policy. We develop a framework that integrates moral dilemmas with moral principles pertaining to different foramlisms of normative ethics, and at different levels of abstractions. Initial experiments with GPT-x models shows that while GPT-4 is a nearly perfect ethical reasoner, the models still have bias towards the moral values of Western and English speaking societies.
    摘要 在这份Position paper中,我们认为,而不是将人工智能语言模型(LLMs) morally align到特定的道德原则上,我们应该通过嵌入基于道德理解的能力来让它们能够处理全球范围内的价值多元性。当提供了一个道德政策时,一个LLM应该能够作出道德一致的决策。我们开发了一个整合道德困境和不同形式的normative ethics的道德原则的框架。初步实验表明,使用GPT-x模型时,GPT-4是一个几乎完美的道德思考者,但模型仍然偏向西方和英语社会的道德价值观。

Surrogate modeling for stochastic crack growth processes in structural health monitoring applications

  • paper_url: http://arxiv.org/abs/2310.07241
  • repo_url: None
  • paper_authors: Nicholas E. Silionis, Konstantinos N. Anyfantis
  • for: 这个论文的目的是用structural health monitoring(SHM)技术预测金属结构中裂缝增长的未来趋势,以便实现预测维护。
  • methods: 该论文使用了物理基础的裂缝增长模型,以及对这些模型的不确定性进行了 reprehension。具体来说,这个论文使用了 Gaussian Process(GP)回归模型,以生成不同类型的不确定性的先验分布。
  • results: 该论文通过在数值上实现了这种方法,并对两个基本的裂缝监测问题进行了评估,即裂缝长度监测(损害评估)和裂缝增长监测(损害预测)。
    Abstract Fatigue crack growth is one of the most common types of deterioration in metal structures with significant implications on their reliability. Recent advances in Structural Health Monitoring (SHM) have motivated the use of structural response data to predict future crack growth under uncertainty, in order to enable a transition towards predictive maintenance. Accurately representing different sources of uncertainty in stochastic crack growth (SCG) processes is a non-trivial task. The present work builds on previous research on physics-based SCG modeling under both material and load-related uncertainty. The aim here is to construct computationally efficient, probabilistic surrogate models for SCG processes that successfully encode these different sources of uncertainty. An approach inspired by latent variable modeling is employed that utilizes Gaussian Process (GP) regression models to enable the surrogates to be used to generate prior distributions for different Bayesian SHM tasks as the application of interest. Implementation is carried out in a numerical setting and model performance is assessed for two fundamental crack SHM problems; namely crack length monitoring (damage quantification) and crack growth monitoring (damage prognosis).
    摘要 轻度疲劳裂隙是金属结构衰弱的一种最常见的类型,它对结构可靠性有着重要的影响。现代结构健康监测(SHM)技术的发展,使得可以通过结构响应数据预测未来裂隙增长,以便实现预测维护。正确表达不同类型的不确定性在杂音裂隙(SCG)过程中是一项非常困难的任务。本工作基于之前的物理基础SCG模型下的不确定性研究,旨在构建高效、probabilistic替身模型,以成功地编码不同类型的不确定性。我们采用了含隐变量模型的方法,使用 Gaussian Process(GP)回归模型,以便替身模型可以用来生成不同bayesian SHM任务的先验分布。实施在数字化环境中,并对两个基本的裂隙 SHM问题进行了评估,即裂隙长度监测(损害评估)和裂隙增长监测(损害预测)。

Using Learnable Physics for Real-Time Exercise Form Recommendations

  • paper_url: http://arxiv.org/abs/2310.07221
  • repo_url: None
  • paper_authors: Abhishek Jaiswal, Gautam Chauhan, Nisheeth Srivastava
  • for: 这篇论文是为了提供一个用于训练和重abilitation的推荐系统,以实时评估和给出修正建议,以提高安全性和生产力。
  • methods: 这个推荐系统使用MediaPipe进行姿势识别,使用峰值振荡检测缩数量,并使用一个可学习的物理模拟器追踪每个运动动作的动作演进。
  • results: 这个系统在六种全身和上半身运动动作中进行了实时评估和修正建议,以提高自修练的可能性和降低运动伤害的风险。
    Abstract Good posture and form are essential for safe and productive exercising. Even in gym settings, trainers may not be readily available for feedback. Rehabilitation therapies and fitness workouts can thus benefit from recommender systems that provide real-time evaluation. In this paper, we present an algorithmic pipeline that can diagnose problems in exercise techniques and offer corrective recommendations, with high sensitivity and specificity in real-time. We use MediaPipe for pose recognition, count repetitions using peak-prominence detection, and use a learnable physics simulator to track motion evolution for each exercise. A test video is diagnosed based on deviations from the prototypical learned motion using statistical learning. The system is evaluated on six full and upper body exercises. These real-time recommendations, counseled via low-cost equipment like smartphones, will allow exercisers to rectify potential mistakes making self-practice feasible while reducing the risk of workout injuries.
    摘要 好的姿势和形态是健身安全和生产力的关键。即使在健身房 Setting中,教练可能不总是可以提供反馈。rehabilitation therapy和健身训练可以从推荐系统中受益,该系统可以提供实时的评估。在这篇论文中,我们提出了一个算法管道,可以诊断运动技巧中的问题并提供修正建议,具有高度敏感和特异性。我们使用MediaPipe进行姿势识别,使用峰值特征检测计数 repetitions,并使用可学习的物理模拟器跟踪运动的动态变化。一个测试视频根据异常分析 deviations from the learned prototypical motion。该系统被评估在六种全身和上半身运动中。这些实时建议,通过低成本的设备如智能手机,将让运动员可以Rectify potential mistakes,使自修成为可能,同时降低运动伤害的风险。

Improved Membership Inference Attacks Against Language Classification Models

  • paper_url: http://arxiv.org/abs/2310.07219
  • repo_url: None
  • paper_authors: Shlomit Shachor, Natalia Razinkov, Abigail Goldsteen
  • for: 这篇论文旨在评估机器学习模型中的隐私风险,以帮助决策使用、部署或共享模型。
  • methods: 该论文提出了一种新的整合方法,通过生成多个专门的攻击模型来对分类模型进行会员推理攻击。
  • results: 该研究表明,使用该整合方法可以实现更高的攻击精度,比单个攻击模型或每个分类标签的攻击模型都高。
    Abstract Artificial intelligence systems are prevalent in everyday life, with use cases in retail, manufacturing, health, and many other fields. With the rise in AI adoption, associated risks have been identified, including privacy risks to the people whose data was used to train models. Assessing the privacy risks of machine learning models is crucial to enabling knowledgeable decisions on whether to use, deploy, or share a model. A common approach to privacy risk assessment is to run one or more known attacks against the model and measure their success rate. We present a novel framework for running membership inference attacks against classification models. Our framework takes advantage of the ensemble method, generating many specialized attack models for different subsets of the data. We show that this approach achieves higher accuracy than either a single attack model or an attack model per class label, both on classical and language classification tasks.
    摘要 人工智能系统在日常生活中广泛应用,包括零售、制造、医疗等领域。随着人工智能的普及,相关的风险也被识别出来,包括使用人工智能模型训练数据的隐私风险。评估机器学习模型的隐私风险是必要的,以便做出了知情的决策是否使用、部署或共享模型。我们提出了一种新的散 membership 攻击框架,该框架利用 ensemble 方法,生成多个特化的攻击模型,用于不同的数据子集。我们展示了,这种方法可以在类别和语言分类任务上达到更高的准确率,比单个攻击模型或每个类别标签的攻击模型都高。

Quantifying Agent Interaction in Multi-agent Reinforcement Learning for Cost-efficient Generalization

  • paper_url: http://arxiv.org/abs/2310.07218
  • repo_url: None
  • paper_authors: Yuxin Chen, Chen Tang, Ran Tian, Chenran Li, Jinning Li, Masayoshi Tomizuka, Wei Zhan
  • for: 这 paper 是 investigate 多智能体强化学习(MARL)中的泛化问题。
  • methods: 这 paper 使用 Level of Influence(LoI)metric 量化多智能体之间的互动程度,并在不同的enario和环境中评估 LoI 对泛化性能的影响。
  • results: 研究发现,在许多enario中,多个合作者的多样化训练可以提高 eg agent 的泛化性能,但是这种提高的程度因enario和环境而异。LoI 能够预测这种差异性。此外,基于 LoI 的资源分配策略可以在受限的 computation budget 下提高泛化性能。
    Abstract Generalization poses a significant challenge in Multi-agent Reinforcement Learning (MARL). The extent to which an agent is influenced by unseen co-players depends on the agent's policy and the specific scenario. A quantitative examination of this relationship sheds light on effectively training agents for diverse scenarios. In this study, we present the Level of Influence (LoI), a metric quantifying the interaction intensity among agents within a given scenario and environment. We observe that, generally, a more diverse set of co-play agents during training enhances the generalization performance of the ego agent; however, this improvement varies across distinct scenarios and environments. LoI proves effective in predicting these improvement disparities within specific scenarios. Furthermore, we introduce a LoI-guided resource allocation method tailored to train a set of policies for diverse scenarios under a constrained budget. Our results demonstrate that strategic resource allocation based on LoI can achieve higher performance than uniform allocation under the same computation budget.
    摘要 <>文本翻译成简化中文。<>多 Agent Reinforcement Learning(MARL)中,泛化带来了重要的挑战。代表者在未经见过的合作者的影响程度取决于代表者的策略和特定情况。一个量化的分析这个关系,可以帮助有效地训练代表者。在这个研究中,我们提出了影响度指数(LoI),用于衡量在给定enario和环境中代表者之间的交互强度。我们发现,通常情况下,在训练过程中采用更加多样化的合作者集合,可以提高代表者的泛化性能;然而,这种改善的程度随着不同的情况和环境而异。LoI有效地预测这些改善的差异。此外,我们还提出了基于LoI的资源分配策略,用于在有限的计算预算下训练多种情况下的策略。我们的结果表明,根据LoI进行策略资源的有计划分配,可以在同一个计算预算下达到更高的性能。

Multi-Task Learning-Enabled Automatic Vessel Draft Reading for Intelligent Maritime Surveillance

  • paper_url: http://arxiv.org/abs/2310.07212
  • repo_url: None
  • paper_authors: Jingxiang Qu, Ryan Wen Liu, Chenjie Zhao, Yu Guo, Sendren Sheng-Dong Xu, Fenghua Zhu, Yisheng Lv
  • For: This paper proposes a multi-task learning-enabled computational method (MTL-VDR) for generating highly reliable vessel draft depth readings.* Methods: The MTL-VDR method consists of four components: draft mark detection, draft scale recognition, vessel/water segmentation, and final draft depth estimation. The method uses a powerful and efficient convolutional neural network for draft mark detection and employs a multi-task learning method for simultaneous draft scale recognition and vessel/water segmentation.* Results: The method demonstrated superior performance in terms of accuracy, robustness, and efficiency, with an adaptive computational method used to yield an accurate and robust draft depth. The computational speed exceeds 40 FPS, satisfying the requirements of real-time maritime surveillance to guarantee vessel traffic safety.
    Abstract The accurate and efficient vessel draft reading (VDR) is an important component of intelligent maritime surveillance, which could be exploited to assist in judging whether the vessel is normally loaded or overloaded. The computer vision technique with an excellent price-to-performance ratio has become a popular medium to estimate vessel draft depth. However, the traditional estimation methods easily suffer from several limitations, such as sensitivity to low-quality images, high computational cost, etc. In this work, we propose a multi-task learning-enabled computational method (termed MTL-VDR) for generating highly reliable VDR. In particular, our MTL-VDR mainly consists of four components, i.e., draft mark detection, draft scale recognition, vessel/water segmentation, and final draft depth estimation. We first construct a benchmark dataset related to draft mark detection and employ a powerful and efficient convolutional neural network to accurately perform the detection task. The multi-task learning method is then proposed for simultaneous draft scale recognition and vessel/water segmentation. To obtain more robust VDR under complex conditions (e.g., damaged and stained scales, etc.), the accurate draft scales are generated by an automatic correction method, which is presented based on the spatial distribution rules of draft scales. Finally, an adaptive computational method is exploited to yield an accurate and robust draft depth. Extensive experiments have been implemented on the realistic dataset to compare our MTL-VDR with state-of-the-art methods. The results have demonstrated its superior performance in terms of accuracy, robustness, and efficiency. The computational speed exceeds 40 FPS, which satisfies the requirements of real-time maritime surveillance to guarantee vessel traffic safety.
    摘要 “精准和高效的船舶吃水深度读取(VDR)是智能海上监测中重要的一部分,可以帮助判断船舶是否超载。计算机视觉技术具有出色的价格-性能比,成为船舶吃水深度估算的受欢迎媒体。然而,传统估算方法容易受到低质量图像、高计算成本等限制。在这种情况下,我们提出了一种基于多任务学习的计算方法(简称MTL-VDR),用于生成高可靠性的VDR。具体来说,我们的MTL-VDR包括四个组件:船舶吃水深度检测、船舶/水域分割、船舶吃水深度估算和自适应计算方法。我们首先构建了相关的船舶吃水深度检测数据集,并使用高效和强大的卷积神经网络进行检测任务的准确实施。然后,我们提出了多任务学习方法,用于同时进行船舶吃水深度估算和船舶/水域分割。为了在复杂情况下(如损坏和污染等)获得更加稳定的VDR,我们提出了一种自动更正方法,基于船舶吃水深度的精度规则。最后,我们运用了一种适应计算方法,以确保高准确性和稳定性。我们对实际数据进行了广泛的实验,与现有方法进行比较。结果显示,我们的MTL-VDR在精度、稳定性和效率方面具有显著的优势。计算速度超过40帧每秒,满足了实时海上监测的需求,以保障船舶交通安全。”

State of the Art on Diffusion Models for Visual Computing

  • paper_url: http://arxiv.org/abs/2310.07204
  • repo_url: None
  • paper_authors: Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman, Jonathan T. Barron, Amit H. Bermano, Eric Ryan Chan, Tali Dekel, Aleksander Holynski, Angjoo Kanazawa, C. Karen Liu, Lingjie Liu, Ben Mildenhall, Matthias Nießner, Björn Ommer, Christian Theobalt, Peter Wonka, Gordon Wetzstein
  • for: 提供一个入门性的状态报告,帮助研究者、艺术家和实践者了解扩散模型在视觉计算领域的基本数学概念、实现细节和设计选择,以及扩散基于生成AI工具的各种应用和扩展。
  • methods: 涵盖了扩散模型的基本数学概念、Stable Diffusion模型的实现细节和设计选择,以及扩散基于生成AI工具的各种应用和扩展。
  • results: 提供了一个全面的Literature综述,概述了扩散基于生成AI工具的各种应用和扩展,包括2D图像、视频、3D物体、动作和4D场景等。同时也讨论了可用的数据集、评价指标、开放的挑战和社会影响等。
    Abstract The field of visual computing is rapidly advancing due to the emergence of generative artificial intelligence (AI), which unlocks unprecedented capabilities for the generation, editing, and reconstruction of images, videos, and 3D scenes. In these domains, diffusion models are the generative AI architecture of choice. Within the last year alone, the literature on diffusion-based tools and applications has seen exponential growth and relevant papers are published across the computer graphics, computer vision, and AI communities with new works appearing daily on arXiv. This rapid growth of the field makes it difficult to keep up with all recent developments. The goal of this state-of-the-art report (STAR) is to introduce the basic mathematical concepts of diffusion models, implementation details and design choices of the popular Stable Diffusion model, as well as overview important aspects of these generative AI tools, including personalization, conditioning, inversion, among others. Moreover, we give a comprehensive overview of the rapidly growing literature on diffusion-based generation and editing, categorized by the type of generated medium, including 2D images, videos, 3D objects, locomotion, and 4D scenes. Finally, we discuss available datasets, metrics, open challenges, and social implications. This STAR provides an intuitive starting point to explore this exciting topic for researchers, artists, and practitioners alike.
    摘要 领域的视觉计算在迅速发展,启动了基于生成人工智能(AI)的扩展,这些技术在图像、视频和3D场景的生成、编辑和重建方面提供了无前例的能力。在这些领域中,扩散模型是生成AI架构的首选。过去一年内,有关扩散工具和应用的学术论文数量在计算机图形、计算机视觉和人工智能领域呈指数增长,每天在arXiv上出现新的论文。这种快速发展的领域使得保持最新的发展变得困难。本state-of-the-art报告(STAR)的目的是介绍扩散模型的基本数学概念,扩散模型的实现细节和设计选择,以及生成AI工具的重要方面,包括个性化、条件、反向等。此外,我们还给出了生成和编辑扩散模型的快速增长的评价,分类为生成媒介的类型,包括2D图像、视频、3D物体、移动和4D场景。最后,我们讨论了可用的数据集、评价指标、开放的挑战和社会影响。这个STAR为研究者、艺术家和实践者提供了直观的入门点,以便更好地探索这个激动人心的主题。

MatChat: A Large Language Model and Application Service Platform for Materials Science

  • paper_url: http://arxiv.org/abs/2310.07197
  • repo_url: None
  • paper_authors: Ziyi Chen, Fankai Xie, Meng Wan, Yang Yuan, Miao Liu, Zongguo Wang, Sheng Meng, Yangang Wang
  • for: 预测化学合成路径,以满足材料科学研究中的需求。
  • methods: 利用自动生成文本和问答系统,以及精细调整技术,实现大规模AI模型在特定领域中的部署。
  • results: 研究人员通过使用LLaMA2-7B模型,并将其特化为材料科学领域,创造出了名为MatChat的特殊AI模型,可以预测无机材料合成路径。MatChat表现出了丰富的知识掌握和逻辑能力,但还需要进一步改进,以满足不同材料设计需求。
    Abstract The prediction of chemical synthesis pathways plays a pivotal role in materials science research. Challenges, such as the complexity of synthesis pathways and the lack of comprehensive datasets, currently hinder our ability to predict these chemical processes accurately. However, recent advancements in generative artificial intelligence (GAI), including automated text generation and question-answering systems, coupled with fine-tuning techniques, have facilitated the deployment of large-scale AI models tailored to specific domains. In this study, we harness the power of the LLaMA2-7B model and enhance it through a learning process that incorporates 13,878 pieces of structured material knowledge data. This specialized AI model, named MatChat, focuses on predicting inorganic material synthesis pathways. MatChat exhibits remarkable proficiency in generating and reasoning with knowledge in materials science. Although MatChat requires further refinement to meet the diverse material design needs, this research undeniably highlights its impressive reasoning capabilities and innovative potential in the field of materials science. MatChat is now accessible online and open for use, with both the model and its application framework available as open source. This study establishes a robust foundation for collaborative innovation in the integration of generative AI in materials science.
    摘要 文本预测在材料科学研究中扮演着重要的角色。现在,化学合成路径的复杂性和缺乏全面数据库等问题正在阻碍我们准确预测这些化学过程。然而,最近的生成人工智能(GAI)技术,包括自动生成文本和问答系统,以及精细调整技术,已经使得大规模AI模型在特定领域中进行部署。在本研究中,我们利用LLaMA2-7B模型的力量,并通过包括13,878个结构化物理知识数据的学习过程,开发了一个专门用于预测无机材料合成路径的AI模型,称为MatChat。MatChat在材料科学领域中表现出了惊人的知识生成和理解能力。虽然MatChat还需要进一步的优化,以满足多样化的材料设计需求,但这项研究无疑地高亮了MatChat在材料科学领域的创新潜力。MatChat现在在线开放,可以免费使用,模型和应用框架都是开源的。本研究建立了对生成AI在材料科学领域的集成创新的坚实基础。

Adaptive Gating in Mixture-of-Experts based Language Models

  • paper_url: http://arxiv.org/abs/2310.07188
  • repo_url: None
  • paper_authors: Jiamin Li, Qiang Su, Yitao Yang, Yimin Jiang, Cong Wang, Hong Xu
  • for: 这篇论文主要是关于如何使用适应性网络来提高语言模型的训练效率和性能。
  • methods: 该论文提出了一种适应性网络(MoE)模型,其中每个token可以通过不同的专家来进行计算,以适应不同的语言复杂度。此外,论文还使用了课程学习来进一步降低训练时间。
  • results: 实验结果显示,适应性网络可以减少最多22.5%的训练时间,同时保持推理质量。此外,论文还进行了路由决策的分析,并提供了相关的分析结论。
    Abstract Large language models, such as OpenAI's ChatGPT, have demonstrated exceptional language understanding capabilities in various NLP tasks. Sparsely activated mixture-of-experts (MoE) has emerged as a promising solution for scaling models while maintaining a constant number of computational operations. Existing MoE model adopts a fixed gating network where each token is computed by the same number of experts. However, this approach contradicts our intuition that the tokens in each sequence vary in terms of their linguistic complexity and, consequently, require different computational costs. Little is discussed in prior research on the trade-off between computation per token and model performance. This paper introduces adaptive gating in MoE, a flexible training strategy that allows tokens to be processed by a variable number of experts based on expert probability distribution. The proposed framework preserves sparsity while improving training efficiency. Additionally, curriculum learning is leveraged to further reduce training time. Extensive experiments on diverse NLP tasks show that adaptive gating reduces at most 22.5% training time while maintaining inference quality. Moreover, we conduct a comprehensive analysis of the routing decisions and present our insights when adaptive gating is used.
    摘要 大型语言模型,如OpenAI的ChatGPT,在不同的自然语言处理任务中表现出了非常出色的语言理解能力。零启动权重的混合专家(MoE)已经成为了可扩展模型的一个有 promise的解决方案,以保持计算操作数量的常数。现有的MoE模型采用固定的闭包网络,每个字符都由相同数量的专家计算。然而,这种方法与我们的语言理解 intuition 相抵触,即每个序列中的字符具有不同的语言复杂度,因此需要不同的计算成本。在优化过程中,对计算每个字符的时间和模型性能之间的负面ffekt little 的研究。本文提出了适应性闭包(Adaptive Gating),一种灵活的训练策略,使得字符可以根据专家概率分布来处理不同数量的专家。该提案保持了稀疏性,同时改善了训练效率。此外,我们还利用了课程学习,以进一步减少训练时间。广泛的实验表明,适应性闭包可以在不同的自然语言处理任务中减少训练时间最多22.5%,保持推理质量。此外,我们还进行了路由决策的全面分析,并对适应性闭包使用的时候提供了我们的思路。

Multiview Transformer: Rethinking Spatial Information in Hyperspectral Image Classification

  • paper_url: http://arxiv.org/abs/2310.07186
  • repo_url: None
  • paper_authors: Jie Zhang, Yongshan Zhang, Yicong Zhou
  • for: 本文是为了提高涉及谱图像(HSI)的地形分类准确性而研究的。
  • methods: 本文使用多视图变换器(MPCA、SED、SPTT)来提取HSI中的空间-spectral特征表示。MPCA通过构建多视图观察数据,并在每个视图数据上应用PCA来提取低维度视图表示。SED使用全连接卷积神经网络来提取多视图特征图。SPTT使用空间poolingtokenization策略将多视图特征转换为tokens,学习稳定和分类的空间-spectral特征。
  • results: 实验结果表明,提出的多视图变换器在三个HSI数据集上表现出色,超过了现有方法的性能。
    Abstract Identifying the land cover category for each pixel in a hyperspectral image (HSI) relies on spectral and spatial information. An HSI cuboid with a specific patch size is utilized to extract spatial-spectral feature representation for the central pixel. In this article, we investigate that scene-specific but not essential correlations may be recorded in an HSI cuboid. This additional information improves the model performance on existing HSI datasets and makes it hard to properly evaluate the ability of a model. We refer to this problem as the spatial overfitting issue and utilize strict experimental settings to avoid it. We further propose a multiview transformer for HSI classification, which consists of multiview principal component analysis (MPCA), spectral encoder-decoder (SED), and spatial-pooling tokenization transformer (SPTT). MPCA performs dimension reduction on an HSI via constructing spectral multiview observations and applying PCA on each view data to extract low-dimensional view representation. The combination of view representations, named multiview representation, is the dimension reduction output of the MPCA. To aggregate the multiview information, a fully-convolutional SED with a U-shape in spectral dimension is introduced to extract a multiview feature map. SPTT transforms the multiview features into tokens using the spatial-pooling tokenization strategy and learns robust and discriminative spatial-spectral features for land cover identification. Classification is conducted with a linear classifier. Experiments on three HSI datasets with rigid settings demonstrate the superiority of the proposed multiview transformer over the state-of-the-art methods.
    摘要 Identifying the land cover category for each pixel in a hyperspectral image (HSI) relies on both spectral and spatial information. We use an HSI cuboid with a specific patch size to extract spatial-spectral feature representations for the central pixel. However, we find that scene-specific but not essential correlations may be recorded in the HSI cuboid, which can improve model performance on existing HSI datasets but also make it difficult to evaluate the model's ability. We refer to this problem as the spatial overfitting issue and use strict experimental settings to avoid it.To address this issue, we propose a multiview transformer for HSI classification, which consists of multiview principal component analysis (MPCA), spectral encoder-decoder (SED), and spatial-pooling tokenization transformer (SPTT). MPCA performs dimension reduction on the HSI by constructing spectral multiview observations and applying PCA on each view data to extract low-dimensional view representations. The combination of view representations, named multiview representation, is the dimension reduction output of the MPCA. To aggregate the multiview information, a fully-convolutional SED with a U-shape in spectral dimension is introduced to extract a multiview feature map. SPTT transforms the multiview features into tokens using the spatial-pooling tokenization strategy and learns robust and discriminative spatial-spectral features for land cover identification. Classification is conducted with a linear classifier.Experiments on three HSI datasets with rigid settings demonstrate the superiority of the proposed multiview transformer over the state-of-the-art methods.

rpcPRF: Generalizable MPI Neural Radiance Field for Satellite Camera

  • paper_url: http://arxiv.org/abs/2310.07179
  • repo_url: None
  • paper_authors: Tongtong Zhang, Yuanxiang Li
  • for: 这个论文targets the task of novel view synthesis of satellite images, with a focus on Rational Polynomial Camera (RPC) models.
  • methods: The proposed method, called rpcPRF, uses a Multiplane Images (MPI) based Planar neural Radiance Field (PRF) to synthesize novel views of satellite images. The model leverages reprojection supervision to generalize to unseen scenes and removes the need for dense depth supervision.
  • results: The paper reports that rpcPRF outperforms state-of-the-art NERF-based methods in terms of image fidelity, reconstruction accuracy, and efficiency on two datasets (TLC and SatMVS3D) with urban scenes from WV-3 and ZY-3 satellites.
    Abstract Novel view synthesis of satellite images holds a wide range of practical applications. While recent advances in the Neural Radiance Field have predominantly targeted pin-hole cameras, and models for satellite cameras often demand sufficient input views. This paper presents rpcPRF, a Multiplane Images (MPI) based Planar neural Radiance Field for Rational Polynomial Camera (RPC). Unlike coordinate-based neural radiance fields in need of sufficient views of one scene, our model is applicable to single or few inputs and performs well on images from unseen scenes. To enable generalization across scenes, we propose to use reprojection supervision to induce the predicted MPI to learn the correct geometry between the 3D coordinates and the images. Moreover, we remove the stringent requirement of dense depth supervision from deep multiview-stereo-based methods by introducing rendering techniques of radiance fields. rpcPRF combines the superiority of implicit representations and the advantages of the RPC model, to capture the continuous altitude space while learning the 3D structure. Given an RGB image and its corresponding RPC, the end-to-end model learns to synthesize the novel view with a new RPC and reconstruct the altitude of the scene. When multiple views are provided as inputs, rpcPRF exerts extra supervision provided by the extra views. On the TLC dataset from ZY-3, and the SatMVS3D dataset with urban scenes from WV-3, rpcPRF outperforms state-of-the-art nerf-based methods by a significant margin in terms of image fidelity, reconstruction accuracy, and efficiency, for both single-view and multiview task.
    摘要 <> translate the following text into Simplified Chinese:Novel view synthesis of satellite images has a wide range of practical applications. While recent advances in the Neural Radiance Field have predominantly targeted pin-hole cameras, and models for satellite cameras often demand sufficient input views. This paper presents rpcPRF, a Multiplane Images (MPI) based Planar neural Radiance Field for Rational Polynomial Camera (RPC). Unlike coordinate-based neural radiance fields in need of sufficient views of one scene, our model is applicable to single or few inputs and performs well on images from unseen scenes. To enable generalization across scenes, we propose to use reprojection supervision to induce the predicted MPI to learn the correct geometry between the 3D coordinates and the images. Moreover, we remove the stringent requirement of dense depth supervision from deep multiview-stereo-based methods by introducing rendering techniques of radiance fields. rpcPRF combines the superiority of implicit representations and the advantages of the RPC model, to capture the continuous altitude space while learning the 3D structure. Given an RGB image and its corresponding RPC, the end-to-end model learns to synthesize the novel view with a new RPC and reconstruct the altitude of the scene. When multiple views are provided as inputs, rpcPRF exerts extra supervision provided by the extra views. On the TLC dataset from ZY-3, and the SatMVS3D dataset with urban scenes from WV-3, rpcPRF outperforms state-of-the-art nerf-based methods by a significant margin in terms of image fidelity, reconstruction accuracy, and efficiency, for both single-view and multiview task.Translation:新视图合成卫星图像应用广泛。Recent Advances in Neural Radiance Field 主要针对平面镜头,而卫星相机模型通常需要足够的输入视图。本文提出了rpcPRF,基于多平面图像(MPI)的平面神经频率场(RPC)模型。与坐标基于神经频率场的模型不同,我们的模型适用于单个或几个输入,并在未见场景中表现良好。为实现场景总结,我们提议使用重投映监督,使预测的MPI学习正确的场景坐标和图像之间的几何关系。此外,我们从深度多视图雷达方法中移除了密集深度监督的需求,通过引入投映技术来实现雷达场景的渲染。rpcPRF结合了神经频率场的优势和RPC模型的优点,可以捕捉不间断的高度空间,同时学习3D结构。给定一个RGB图像和其相应的RPC,结束到终点模型可以使用新的RPCsynthesize Novel View和重建场景的高度。当提供多个视图输入时,rpcPRF可以提供Extra supervision。在ZY-3的TLC数据集和WV-3的SatMVS3D数据集上,rpcPRF与状态aru的nerf-based方法比之,在图像准确度、重建精度和效率等方面具有显著的优势,包括单视图和多视图任务。

Online Speculative Decoding

  • paper_url: http://arxiv.org/abs/2310.07177
  • repo_url: None
  • paper_authors: Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Ion Stoica, Zhijie Deng, Alvin Cheung, Hao Zhang
  • for: 加速大语言模型(LLM)的推理过程,使用较小的稿本模型预测目标模型的输出。
  • methods: 在线预测推理(OSD)技术,通过在用户查询数据观察到的过程中不断更新(多个)稿本模型(),使用LLM服务器集群的剩余计算能力进行在线重新训练稿本模型,以提高预测精度。
  • results: 根据实验结果,在线预测推理可以提高稿本模型的准确率,从而提高LLM的推理效率,并且可以降低延迟时间,实现1.22倍至3.06倍的提升。
    Abstract Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive accuracy of the draft model, particularly when faced with diverse text inputs and a significant capability gap between the draft and target models. We introduce online speculative decoding (OSD) to address this challenge. The main idea is to continually update (multiple) draft model(s) on observed user query data using the abundant excess computational power in an LLM serving cluster. Given that LLM inference is memory-bounded, the surplus computational power in a typical LLM serving cluster can be repurposed for online retraining of draft models, thereby making the training cost-neutral. Since the query distribution of an LLM service is relatively simple, retraining on query distribution enables the draft model to more accurately predict the target model's outputs, particularly on data originating from query distributions. As the draft model evolves online, it aligns with the query distribution in real time, mitigating distribution shifts. We develop a prototype of online speculative decoding based on online knowledge distillation and evaluate it using both synthetic and real query data on several popular LLMs. The results show a substantial increase in the token acceptance rate by 0.1 to 0.65, which translates into 1.22x to 3.06x latency reduction.
    摘要 推测解码是一种重要的技术,可以加速大型语言模型(LLM)的推断过程,通过使用一个更小的签名模型来预测目标模型的输出。然而,其效果可能受到签名模型预测精度的限制,特别是面对多样化的文本输入和目标模型的能力差距。我们提出在线推测解码(OSD)技术来解决这个挑战。主要想法是在 LLM 服务中继续更新(多个)签名模型(多个),使用 LLM 的剩余计算能力来在线 retrained 签名模型,从而使得 retrained 成本neutral。由于 LLM 的查询分布相对简单,因此在线 retrained 签名模型可以更好地预测目标模型的输出,特别是来自查询分布的数据。在签名模型在线演化的过程中,它与查询分布相对应,使得分布shift Mitigated。我们开发了基于在线知识填充的 OSD прототип,并对其进行了Synthetic和实际查询数据的评估。结果表明,通过使用 OSD 技术,可以提高 токен接受率 BY 0.1 TO 0.65,相当于减少延迟时间 BY 1.22x TO 3.06x。

Solving Travelling Thief Problems using Coordination Based Methods

  • paper_url: http://arxiv.org/abs/2310.07156
  • repo_url: None
  • paper_authors: Majid Namazi, M. A. Hakim Newton, Conrad Sanderson, Abdul Sattar
  • for: solves the Travelling Thief Problem (TTP) by proposing a coordination-based approach that integrates human-designed and machine learning-based heuristics to improve solution quality.
  • methods: uses a combination of local search, human-designed coordination heuristics, and machine learning to explore cyclic tours and make item selections during collection plan exploration.
  • results: significantly outperforms existing state-of-the-art TTP solvers on a set of benchmark problems, demonstrating the effectiveness of the proposed coordination-based approach.
    Abstract A travelling thief problem (TTP) is a proxy to real-life problems such as postal collection. TTP comprises an entanglement of a travelling salesman problem (TSP) and a knapsack problem (KP) since items of KP are scattered over cities of TSP, and a thief has to visit cities to collect items. In TTP, city selection and item selection decisions need close coordination since the thief's travelling speed depends on the knapsack's weight and the order of visiting cities affects the order of item collection. Existing TTP solvers deal with city selection and item selection separately, keeping decisions for one type unchanged while dealing with the other type. This separation essentially means very poor coordination between two types of decision. In this paper, we first show that a simple local search based coordination approach does not work in TTP. Then, to address the aforementioned problems, we propose a human designed coordination heuristic that makes changes to collection plans during exploration of cyclic tours. We further propose another human designed coordination heuristic that explicitly exploits the cyclic tours in item selections during collection plan exploration. Lastly, we propose a machine learning based coordination heuristic that captures characteristics of the two human designed coordination heuristics. Our proposed coordination based approaches help our TTP solver significantly outperform existing state-of-the-art TTP solvers on a set of benchmark problems. Our solver is named Cooperation Coordination (CoCo) and its source code is available from https://github.com/majid75/CoCo
    摘要 “旅行盗僧问题”(TTP)是一个代表现实生活中的问题,如邮政收集。TTP包括一个旅行销售问题(TSP)和一个袋包问题(KP)的挂钮,因为KP中的物品是在TSP中的城市分散的,盗僧需要到城市集取物品。在TTP中,城市选择和物品选择的决策需要密切协调,因为盗僧的旅行速度取决于袋包的重量,并且城市顺序影响物品顺序收集。现有的TTP解决方案是分开处理城市选择和物品选择,对一种类型的决策不会改变,而对另一种类型的决策进行处理。这种分离实际上意味着两种决策之间的很 poor 的协调。在这篇论文中,我们首先显示了一个简单的本地搜索基于协调方法在TTP中不会工作。然后,为了解决上述问题,我们提出了一个人工设计的协调规则,在探索游历的过程中对收集计划进行修改。我们还提出了另一个人工设计的协调规则,在收集计划探索过程中明确地利用游历。 finally, we propose a machine learning based coordination heuristic that captures the characteristics of the two human-designed coordination heuristics. our proposed coordination-based approaches significantly outperform existing state-of-the-art TTP solvers on a set of benchmark problems. our solver is named Cooperation Coordination (CoCo), and its source code is available from .

No Privacy Left Outside: On the (In-)Security of TEE-Shielded DNN Partition for On-Device ML

  • paper_url: http://arxiv.org/abs/2310.07152
  • repo_url: https://github.com/ziqi-zhang/teeslice-artifact
  • paper_authors: Ziqi Zhang, Chen Gong, Yifeng Cai, Yuanyuan Yuan, Bingyan Liu, Ding Li, Yao Guo, Xiangqun Chen
  • For: The paper is focused on addressing the security challenges of on-device machine learning (ML) models, specifically the threats of model stealing (MS) and membership inference attack (MIA).* Methods: The paper proposes a novel technique called TEESlice, which partitions the DNN model into two parts before training to defend against MS and MIA during inference. TEESlice uses a partition-before-training strategy to accurately separate privacy-related weights from public weights.* Results: The paper presents experimental results that show TEESlice delivers the same security protection as shielding the entire DNN model inside a Trusted Execution Environment (TEE), but with over 10X less overhead and no accuracy loss compared to prior TSDP solutions. The paper also highlights the inherent difficulty in deciding optimal DNN partition configurations for present TSDP solutions and the variability of such configurations across datasets and models.
    Abstract On-device ML introduces new security challenges: DNN models become white-box accessible to device users. Based on white-box information, adversaries can conduct effective model stealing (MS) and membership inference attack (MIA). Using Trusted Execution Environments (TEEs) to shield on-device DNN models aims to downgrade (easy) white-box attacks to (harder) black-box attacks. However, one major shortcoming is the sharply increased latency (up to 50X). To accelerate TEE-shield DNN computation with GPUs, researchers proposed several model partition techniques. These solutions, referred to as TEE-Shielded DNN Partition (TSDP), partition a DNN model into two parts, offloading the privacy-insensitive part to the GPU while shielding the privacy-sensitive part within the TEE. This paper benchmarks existing TSDP solutions using both MS and MIA across a variety of DNN models, datasets, and metrics. We show important findings that existing TSDP solutions are vulnerable to privacy-stealing attacks and are not as safe as commonly believed. We also unveil the inherent difficulty in deciding optimal DNN partition configurations (i.e., the highest security with minimal utility cost) for present TSDP solutions. The experiments show that such ``sweet spot'' configurations vary across datasets and models. Based on lessons harvested from the experiments, we present TEESlice, a novel TSDP method that defends against MS and MIA during DNN inference. TEESlice follows a partition-before-training strategy, which allows for accurate separation between privacy-related weights from public weights. TEESlice delivers the same security protection as shielding the entire DNN model inside TEE (the ``upper-bound'' security guarantees) with over 10X less overhead (in both experimental and real-world environments) than prior TSDP solutions and no accuracy loss.
    摘要 ondevice ML引入新的安全挑战:DNN模型变成了设备用户可见的白盒模型。基于白盒信息,攻击者可以进行有效的模型窃取(MS)和会员推理攻击(MIA)。使用Trusted Execution Environments(TEEs)保护在设备上的DNN模型,以降低(容易)白盒攻击到(更加困难)黑盒攻击。然而,一个主要缺点是增加了响应时间(最多50倍)。为了加速TEE保护的DNN计算,研究人员提出了多种模型分割技术。这些解决方案被称为TEE-Shielded DNN Partition(TSDP),它将DNN模型分成两部分,将隐私敏感部分卷入TEE中,而隐私不敏感部分将被卷入GPU上。这篇论文对现有TSDP解决方案进行了MS和MIA的分别测试,并对多个DNN模型、数据集和指标进行了测试。我们发现现有TSDP解决方案容易受到隐私窃取攻击,并不如常被认为的安全。我们还发现决定最佳DNN分割配置(即最高安全性和最小实用成本)对现有TSDP解决方案是困难的。实验表明,这些“甜点”配置在不同的数据集和模型上具有差异。基于实验所获的经验,我们提出了TEESlice,一种新的TSDP方法。TEESlice采用分配before training的策略,允许准确地分化隐私相关的权重与公共权重。TEESlice提供了完全保护MS和MIA During DNN推理的安全保障,并且在实验和实际环境中具有10倍以上的性能优化,无损失 accuracy。

Determining Winners in Elections with Absent Votes

  • paper_url: http://arxiv.org/abs/2310.07150
  • repo_url: None
  • paper_authors: Qishen Han, Amélie Marian, Lirong Xia
  • for: 这个论文研究了选举中缺失选票的情况下,决定赢家的问题。
  • methods: 这篇论文使用了NP-完全理论和特殊的位置得分规则,以计算缺失选票情况下的赢家问题。
  • results: 论文表明,在投票 truncated 情况下,赢家问题是NP-完全的,而在特定的位置得分规则下,问题可以在多阶段时间内解决。
    Abstract An important question in elections is the determine whether a candidate can be a winner when some votes are absent. We study this determining winner with the absent votes (WAV) problem when the votes are top-truncated. We show that the WAV problem is NP-complete for the single transferable vote, Maximin, and Copeland, and propose a special case of positional scoring rule such that the problem can be computed in polynomial time. Our results in top-truncated rankings differ from the results in full rankings as their hardness results still hold when the number of candidates or the number of missing votes are bounded, while we show that the problem can be solved in polynomial time in either case.
    摘要 <>转换给定文本到简化中文。<>选举中一个重要问题是确定缺失票的候选人是否可以赢得选举。我们研究缺失票确定赢家问题(WAV问题),当投票是top-truncated时。我们显示WAV问题是NP-完全的 для单轮转移投票、最大最小值和冠军得分方式,并提出一种特殊情况的 pozitional scoring rule,使得问题可以在多阶段时间内解决。我们的结果在top-truncated排名中与全排名的结果不同,因为他们的困难结果仍然在缺失票或候选人数量是有限的情况下仍然成立,而我们则显示问题可以在任一情况下解决在多阶段时间内。Note: "简化中文" (Simplified Chinese) is a standardized form of Chinese used in mainland China and Singapore, while " tradicional中文" (Traditional Chinese) is used in Hong Kong, Macau, and Taiwan.

Denoising Task Routing for Diffusion Models

  • paper_url: http://arxiv.org/abs/2310.07138
  • repo_url: None
  • paper_authors: Byeongjun Park, Sangmin Woo, Hyojun Go, Jin-Young Kim, Changick Kim
  • for: 这篇论文的目的是提出一种简单的添加策略,以提高 diffusion 模型中的多任务学习(MTL)性能。
  • methods: 该策略基于 selectively 活跃 diffusion 模型中的多个核心通道,以实现不同任务之间的信息路径分离。
  • results: experiments 表明,该策略可以提高 diffusion 模型的性能,并且不需要添加额外参数。此外,该策略还可以加速训练过程的收敛。
    Abstract Diffusion models generate highly realistic images through learning a multi-step denoising process, naturally embodying the principles of multi-task learning (MTL). Despite the inherent connection between diffusion models and MTL, there remains an unexplored area in designing neural architectures that explicitly incorporate MTL into the framework of diffusion models. In this paper, we present Denoising Task Routing (DTR), a simple add-on strategy for existing diffusion model architectures to establish distinct information pathways for individual tasks within a single architecture by selectively activating subsets of channels in the model. What makes DTR particularly compelling is its seamless integration of prior knowledge of denoising tasks into the framework: (1) Task Affinity: DTR activates similar channels for tasks at adjacent timesteps and shifts activated channels as sliding windows through timesteps, capitalizing on the inherent strong affinity between tasks at adjacent timesteps. (2) Task Weights: During the early stages (higher timesteps) of the denoising process, DTR assigns a greater number of task-specific channels, leveraging the insight that diffusion models prioritize reconstructing global structure and perceptually rich contents in earlier stages, and focus on simple noise removal in later stages. Our experiments demonstrate that DTR consistently enhances the performance of diffusion models across various evaluation protocols, all without introducing additional parameters. Furthermore, DTR contributes to accelerating convergence during training. Finally, we show the complementarity between our architectural approach and existing MTL optimization techniques, providing a more complete view of MTL within the context of diffusion training.
    摘要 Diffusion models可以生成非常真实的图像,通过学习多步降噪过程,自然地启用多任务学习(MTL)的原理。尽管涉及到的连接在 diffusion models 和 MTL 之间,仍然有一个未探索的领域,那就是设计神经网络架构,以显式地包含 MTL 在 diffusion models 中。在这篇论文中,我们提出了一种简单的扩展策略,即 Denoising Task Routing(DTR),可以让现有的 diffusion model 架构中设置独特的信息通路,以便每个任务在单一架构中有自己的信息通路。DTR 的实现方式很有趣,它通过在模型中选择性地启用多个通道来实现这一点。具体来说,DTR 在不同的时间步骤中启用不同的通道,使得模型可以在不同的时间步骤中完成不同的任务。此外,DTR 还可以根据任务之间的相互关系来启用相应的通道,从而使得模型可以更好地利用多任务的相互关系。我们的实验结果表明,DTR 可以一直提高 diffusion models 的性能,无需添加额外参数。此外,DTR 还可以加速训练过程的收敛。最后,我们还证明了 DTR 和现有的 MTL 优化技术之间的相互关系,从而提供了更全面的 MTL 视角,以便更好地理解 diffusion 训练中的多任务学习。

Off-Policy Evaluation for Human Feedback

  • paper_url: http://arxiv.org/abs/2310.07123
  • repo_url: None
  • paper_authors: Qitong Gao, Ge Gao, Juncheng Dong, Vahid Tarokh, Min Chi, Miroslav Pajic
  • for: 用于评估人工奖励信号(HF)的精度,提高RL在医疗等领域的安全性和效率。
  • methods: 基于IHR恢复方法和环境知识储存的幂等空间,对HF信号进行精度评估。
  • results: 在实验中,比直接使用现有OPE方法而言,我们的方法显著提高了HF信号的精度评估表现。
    Abstract Off-policy evaluation (OPE) is important for closing the gap between offline training and evaluation of reinforcement learning (RL), by estimating performance and/or rank of target (evaluation) policies using offline trajectories only. It can improve the safety and efficiency of data collection and policy testing procedures in situations where online deployments are expensive, such as healthcare. However, existing OPE methods fall short in estimating human feedback (HF) signals, as HF may be conditioned over multiple underlying factors and is only sparsely available; as opposed to the agent-defined environmental rewards (used in policy optimization), which are usually determined over parametric functions or distributions. Consequently, the nature of HF signals makes extrapolating accurate OPE estimations to be challenging. To resolve this, we introduce an OPE for HF (OPEHF) framework that revives existing OPE methods in order to accurately evaluate the HF signals. Specifically, we develop an immediate human reward (IHR) reconstruction approach, regularized by environmental knowledge distilled in a latent space that captures the underlying dynamics of state transitions as well as issuing HF signals. Our approach has been tested over two real-world experiments, adaptive in-vivo neurostimulation and intelligent tutoring, as well as in a simulation environment (visual Q&A). Results show that our approach significantly improves the performance toward estimating HF signals accurately, compared to directly applying (variants of) existing OPE methods.
    摘要 <>Off-policy evaluation (OPE) 是关键性的,可以减少在训练和评估强化学习(RL)中线上部署的成本,通过使用停滞轨迹来估计目标(评估)策略的性能和/或排名。它可以提高数据采集和策略测试的安全性和效率,特别是在医疗领域,因为在线部署是昂贵的。然而,现有的 OPE 方法无法准确地估计人类反馈(HF)信号,因为 HF 可能会受到多个下面因素的影响,并且只有稀缺的可用;相比之下,agent-defined 环境奖励(用于策略优化)通常是基于参数函数或分布来定义的。因此,HF 信号的自然特点使得 extrapolating 准确的 OPE 估计变得挑战。为解决这一问题,我们介绍了一个 OPEHF 框架,该框架可以重新使用现有的 OPE 方法,以准确地评估 HF 信号。specifically,我们开发了一种立即人类奖励(IHR)重构方法,该方法在环境知识捕获层中进行Regularization,以捕捉状态转移的下面动态和发出 HF 信号。我们的方法在两个实际实验(adaptive in-vivo neurostimulation和智能教学)以及一个模拟环境(视觉问答)中进行测试,结果表明,我们的方法可以准确地估计 HF 信号,相比直接应用(变种的)现有 OPE 方法。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

The Temporal Structure of Language Processing in the Human Brain Corresponds to The Layered Hierarchy of Deep Language Models

  • paper_url: http://arxiv.org/abs/2310.07106
  • repo_url: None
  • paper_authors: Ariel Goldstein, Eric Ham, Mariano Schain, Samuel Nastase, Zaid Zada, Avigail Dabush, Bobbi Aubrey, Harshvardhan Gazula, Amir Feder, Werner K Doyle, Sasha Devore, Patricia Dugan, Daniel Friedman, Roi Reichart, Michael Brenner, Avinatan Hassidim, Orrin Devinsky, Adeen Flinker, Omer Levy, Uri Hasson
  • for: 这paper的目的是用Deep Language Models(DLMs)来理解人类大脑中自然语言处理的机制。
  • methods: 这paper使用了层次序列的连续数值向量来表示单词和上下文,从而开拓了一系列的应用,如人类化文本生成。
  • results: 这paper表明了DLM层次结构可以模型人类大脑中语言理解的时间动力学,并通过对ECoG数据的使用,实现了更高的时间分辨率。结果表明DLM和人类大脑的语言处理有关系,DLM层次结构的层次积累信息与人类大脑高级语言区域的神经活动相吻合。
    Abstract Deep Language Models (DLMs) provide a novel computational paradigm for understanding the mechanisms of natural language processing in the human brain. Unlike traditional psycholinguistic models, DLMs use layered sequences of continuous numerical vectors to represent words and context, allowing a plethora of emerging applications such as human-like text generation. In this paper we show evidence that the layered hierarchy of DLMs may be used to model the temporal dynamics of language comprehension in the brain by demonstrating a strong correlation between DLM layer depth and the time at which layers are most predictive of the human brain. Our ability to temporally resolve individual layers benefits from our use of electrocorticography (ECoG) data, which has a much higher temporal resolution than noninvasive methods like fMRI. Using ECoG, we record neural activity from participants listening to a 30-minute narrative while also feeding the same narrative to a high-performing DLM (GPT2-XL). We then extract contextual embeddings from the different layers of the DLM and use linear encoding models to predict neural activity. We first focus on the Inferior Frontal Gyrus (IFG, or Broca's area) and then extend our model to track the increasing temporal receptive window along the linguistic processing hierarchy from auditory to syntactic and semantic areas. Our results reveal a connection between human language processing and DLMs, with the DLM's layer-by-layer accumulation of contextual information mirroring the timing of neural activity in high-order language areas.
    摘要

ClausewitzGPT Framework: A New Frontier in Theoretical Large Language Model Enhanced Information Operations

  • paper_url: http://arxiv.org/abs/2310.07099
  • repo_url: None
  • paper_authors: Benjamin Kereopa-Yorke
  • for: This paper aims to provide a framework for navigating the risks and challenges of Large Language Models (LLMs) and autonomous AI agents in the context of information operations.
  • methods: The paper uses a novel formulation called the “ClausewitzGPT” equation to quantify the risks of LLM-augmented operations and emphasizes the importance of ethical considerations and autonomous AI agents in ensuring a moral compass and societal imperatives.
  • results: The paper highlights the staggering year-on-year growth of AI information campaigns and emphasizes the urgency of addressing the challenges and risks of LLMs and autonomous AI agents in the context of information operations.Here are the three points in Simplified Chinese text:
  • for: 这篇论文目标是为了提供LLM和自动化AI代理在信息操作中 navigating 技术飞速带来的风险和挑战的框架。
  • methods: 论文使用一种新的方法,称为“ClausewitzGPT”方程,以量化LLM增强操作的风险,并强调了伦理考虑和自动化AI代理在保持道德 компас和社会要求方面的重要性。
  • results: 论文强调了年度增长的AI信息运动,并强调了现在的杰uncje点,需要我们通过加强LLM和自动化AI代理的风险和挑战来应对。
    Abstract In a digital epoch where cyberspace is the emerging nexus of geopolitical contention, the melding of information operations and Large Language Models (LLMs) heralds a paradigm shift, replete with immense opportunities and intricate challenges. As tools like the Mistral 7B LLM (Mistral, 2023) democratise access to LLM capabilities (Jin et al., 2023), a vast spectrum of actors, from sovereign nations to rogue entities (Howard et al., 2023), find themselves equipped with potent narrative-shaping instruments (Goldstein et al., 2023). This paper puts forth a framework for navigating this brave new world in the "ClausewitzGPT" equation. This novel formulation not only seeks to quantify the risks inherent in machine-speed LLM-augmented operations but also underscores the vital role of autonomous AI agents (Wang, Xie, et al., 2023). These agents, embodying ethical considerations (Hendrycks et al., 2021), emerge as indispensable components (Wang, Ma, et al., 2023), ensuring that as we race forward, we do not lose sight of moral compasses and societal imperatives. Mathematically underpinned and inspired by the timeless tenets of Clausewitz's military strategy (Clausewitz, 1832), this thesis delves into the intricate dynamics of AI-augmented information operations. With references to recent findings and research (Department of State, 2023), it highlights the staggering year-on-year growth of AI information campaigns (Evgeny Pashentsev, 2023), stressing the urgency of our current juncture. The synthesis of Enlightenment thinking, and Clausewitz's principles provides a foundational lens, emphasising the imperative of clear strategic vision, ethical considerations, and holistic understanding in the face of rapid technological advancement.
    摘要 在数字时代,虚拟空间成为地opolitical竞争的emerging nexus,information操作和大型自然语言模型(LLM)的融合标志着一种新的 paradigm shift,具有巨大的机遇和复杂的挑战。Tools like Mistral 7B LLM(Mistral,2023)通过 démocratising access to LLM capabilities(Jin et al., 2023),让各种actor,从主权国家到黑帮(Howard et al., 2023),拥有高效的叙述形成工具。这篇论文提出了一种用于 navigate这个勇敢的新世界的“ClausewitzGPT”方程。这种新的方程不仅试图量化机器速度下LLM-加速的风险,而且强调了自主AI代理(Wang, Xie, et al., 2023)的重要性。这些代理,具有伦理考虑(Hendrycks et al., 2021),在我们前进的过程中变得不可或缺。这篇论文受数学基础和Clausewitz的 воен略思想(Clausewitz, 1832)的激发,探讨了人工智能加速的信息操作动态。通过参考最新的发现和研究(Department of State, 2023),这篇论文强调了艺术智能信息活动的年度增长(Evgeny Pashentsev, 2023),强调当前的战略危机性。通过融合Enlightenment思想和Clausewitz的原则,这篇论文提供了一种基本的镜像,强调了在快速技术进步的面前,我们必须具备清晰的战略视野、伦理考虑和整体理解。

Sparse Universal Transformer

  • paper_url: http://arxiv.org/abs/2310.07096
  • repo_url: https://github.com/shawntan/SUT
  • paper_authors: Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, Chuang Gan
  • for: 本研究旨在提出一种叫做 sparse universal transformer(SUT),用于提高 universal transformer(UT)的计算复杂度和参数效率,同时保持其基本特性和泛化能力。
  • methods: 本研究使用了 sparse mixture of experts(SMoE)和一种新的棒拌分解方式来降低 UT 的计算复杂度,并且提出了一种新的停止机制来控制计算复杂度。
  • results: 实验表明,SUT 可以与强基线模型相当的性能,仅使用了半个计算和参数量,并且在正式语言任务(Logical inference和CFQ)上显示了强大的泛化能力。此外,新的停止机制还可以在推理过程中减少计算量约 50%,而无须减少性能。
    Abstract The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers. Empirical evidence shows that UTs have better compositional generalization than Vanilla Transformers (VTs) in formal language tasks. The parameter-sharing also affords it better parameter efficiency than VTs. Despite its many advantages, scaling UT parameters is much more compute and memory intensive than scaling up a VT. This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) and a new stick-breaking-based dynamic halting mechanism to reduce UT's computation complexity while retaining its parameter efficiency and generalization ability. Experiments show that SUT achieves the same performance as strong baseline models while only using half computation and parameters on WMT'14 and strong generalization results on formal language tasks (Logical inference and CFQ). The new halting mechanism also enables around 50\% reduction in computation during inference with very little performance decrease on formal language tasks.
    摘要 《 universal transformer(UT)是一种变体的transformer,它在层之间共享参数。实验证明,UT在正式语言任务上有更好的compositional generalizationthan Vanilla Transformers(VT)。此外,UT的参数共享还使其的参数使用效率更高than VT。尽管它有许多优点,但是扩展UT参数的计算复杂度会比VT的计算复杂度更高。这篇论文提出了Sparse Universal Transformer(SUT),它利用Sparse Mixture of Experts(SMoE)和一种基于扔投的新动态停止机制来降低UT的计算复杂度,保持UT的参数效率和泛化能力。实验表明,SUT可以与强基eline模型相当的性能,仅使用半个计算和参数来处理WMT'14和正式语言任务(逻辑推理和CFQ)。新的停止机制还可以在推理过程中降低计算量约50%,而无需减少正式语言任务中的性能。

Jaeger: A Concatenation-Based Multi-Transformer VQA Model

  • paper_url: http://arxiv.org/abs/2310.07091
  • repo_url: None
  • paper_authors: Jieting Long, Zewei Shi, Penghao Jiang, Yidong Gan
  • for: 提高文档视觉问答的表现,增强语义含义的准确性和细化的multimodal检索。
  • methods: 利用大语言和开放世界的先前模型,如RoBERTa大型和GPT2-xl,作为特征提取器,并对其输出进行 concatenation 操作,以实现多源信息的同时考虑。
  • results: 在 Task C 的 PDF-VQA 数据集上实现竞争性表现。
    Abstract Document-based Visual Question Answering poses a challenging task between linguistic sense disambiguation and fine-grained multimodal retrieval. Although there has been encouraging progress in document-based question answering due to the utilization of large language and open-world prior models\cite{1}, several challenges persist, including prolonged response times, extended inference durations, and imprecision in matching. In order to overcome these challenges, we propose Jaegar, a concatenation-based multi-transformer VQA model. To derive question features, we leverage the exceptional capabilities of RoBERTa large\cite{2} and GPT2-xl\cite{3} as feature extractors. Subsequently, we subject the outputs from both models to a concatenation process. This operation allows the model to consider information from diverse sources concurrently, strengthening its representational capability. By leveraging pre-trained models for feature extraction, our approach has the potential to amplify the performance of these models through concatenation. After concatenation, we apply dimensionality reduction to the output features, reducing the model's computational effectiveness and inference time. Empirical results demonstrate that our proposed model achieves competitive performance on Task C of the PDF-VQA Dataset. If the user adds any new data, they should make sure to style it as per the instructions provided in previous sections.
    摘要 文档视觉问答 зада题存在语义含义涉及和细腻多媒体检索的挑战。虽然因使用大语言和开放世界先进模型\cite{1}而取得了鼓舞人的进步,但还存在许多挑战,包括长时间响应、延长推理时间和匹配不准确。为了解决这些挑战,我们提议Jaegar,一种 concatenation-based 多变换 VQA 模型。为了 derivate 问题特征,我们利用 RoBERTa 大\cite{2} 和 GPT2-xl\cite{3} 作为特征提取器。然后,我们将两个模型的输出经 concatenation 操作,以便同时考虑不同来源的信息,提高模型的表达能力。通过利用预训练模型来提取特征,我们的方法可能会强化这些模型的表现。然后,我们对输出特征进行维度缩放,以降低模型的计算效率和推理时间。实验结果表明,我们提议的模型在 PDF-VQA 数据集的 Task C 上达到了竞争性的性能。如果用户添加任何新数据,他们应该按照以前提供的指导方针进行风格化处理。

Diversity of Thought Improves Reasoning Abilities of Large Language Models

  • paper_url: http://arxiv.org/abs/2310.07088
  • repo_url: None
  • paper_authors: Ranjita Naik, Varun Chandrasekaran, Mert Yuksekgonul, Hamid Palangi, Besmira Nushi
  • for: 提高 LLM 在需要复杂逻辑的设置中的表现
  • methods: 使用多种生成和修改解码步骤来 ensemble 多个生成,以提高模型表现
  • results: 在固定生成预算下,DIV-SE 和 IDIV-SE 在多个逻辑测试上表现出色,超过了之前的基线值,而且在最Difficult Blocksworld任务上达到了最高的29.6%的提升率。
    Abstract Large language models (LLMs) are documented to struggle in settings that require complex reasoning. Nevertheless, instructing the model to break down the problem into smaller reasoning steps (Wei et al., 2022), or ensembling various generations through modifying decoding steps (Wang et al., 2023) boosts performance. Current methods assume that the input prompt is fixed and expect the decoding strategies to introduce the diversity needed for ensembling. In this work, we relax this assumption and discuss how one can create and leverage variations of the input prompt as a means to diversity of thought to improve model performance. We propose a method that automatically improves prompt diversity by soliciting feedback from the LLM to ideate approaches that fit for the problem. We then ensemble the diverse prompts in our method DIV-SE (DIVerse reasoning path Self-Ensemble) across multiple inference calls. We also propose a cost-effective alternative where diverse prompts are used within a single inference call; we call this IDIV-SE (In-call DIVerse reasoning path Self-Ensemble). Under a fixed generation budget, DIV-SE and IDIV-SE outperform the previously discussed baselines using both GPT-3.5 and GPT-4 on several reasoning benchmarks, without modifying the decoding process. Additionally, DIV-SE advances state-of-the-art performance on recent planning benchmarks (Valmeekam et al., 2023), exceeding the highest previously reported accuracy by at least 29.6 percentage points on the most challenging 4/5 Blocksworld task. Our results shed light on how to enforce prompt diversity toward LLM reasoning and thereby improve the pareto frontier of the accuracy-cost trade-off.
    摘要 大型语言模型(LLM)在需要复杂推理的设定下 frequently struggle (Wei et al., 2022)。然而,将问题拆分成小推理步骤(Wei et al., 2022)或者聚合不同代表(Wang et al., 2023)可以提高表现。现有的方法假设输入提示是固定的,并且预期解码策略可以引入充分的多样性。在这个工作中,我们实际松动这个假设,并讨论如何将输入提示的多样性引入到模型中,以提高模型表现。我们提出了一个方法,可以自动提高提示多样性,通过从LLM获取反馈,以便发展适合问题的方法。我们然后聚合这些多样的提示,在多个推理步骤中进行ensemble。我们还提出了一个成本效益高的代替方案,使用多个不同的提示在单一的推理步骤中进行ensemble。在固定的生成预算下,DIV-SE和IDIV-SE在多个推理benchmark上表现出色,不需要修改解码过程。此外,DIV-SE在最近的规划benchmark上进一步提高了州时的表现,至少比前一代最高的accuracy报告提高29.6个百分点。我们的结果 shed light on how to enforce提示多样性在LLM推理中,并因此提高精度-成本费用的 pareto frontier。

Leveraging Twitter Data for Sentiment Analysis of Transit User Feedback: An NLP Framework

  • paper_url: http://arxiv.org/abs/2310.07086
  • repo_url: None
  • paper_authors: Adway Das, Abhishek Kumar Prajapati, Pengxiang Zhang, Mukund Srinath, Andisheh Ranjbari
  • for: 本研究は、便宜なソーシャルメディアデータを利用して、ユーザーフィードバックを収集するための新しいNLP基盘フレームワークを提案しています。
  • methods: このフレームワークでは、ツイートをクラスタリングするためのフィーウェイ学习を使用し、ツイートの感情を评価するためのレキシコンベースの感情分析モデルを适用します。
  • results: この研究では、ニューヨーク市の地下铁システムを例として、このフレームワークを适用して、ユーザーの感想を捉えることができました。结果は、安全性、信赖性、および保守の3つのカテゴリーに分类されたツイートを正确に识别することができました。また、感情の强さと方向性を评価することができました。この结果は、便宜なソーシャルメディアデータを使用してユーザーフィードバックを収集することが有效であることを证明しています。
    Abstract Traditional methods of collecting user feedback through transit surveys are often time-consuming, resource intensive, and costly. In this paper, we propose a novel NLP-based framework that harnesses the vast, abundant, and inexpensive data available on social media platforms like Twitter to understand users' perceptions of various service issues. Twitter, being a microblogging platform, hosts a wealth of real-time user-generated content that often includes valuable feedback and opinions on various products, services, and experiences. The proposed framework streamlines the process of gathering and analyzing user feedback without the need for costly and time-consuming user feedback surveys using two techniques. First, it utilizes few-shot learning for tweet classification within predefined categories, allowing effective identification of the issues described in tweets. It then employs a lexicon-based sentiment analysis model to assess the intensity and polarity of the tweet sentiments, distinguishing between positive, negative, and neutral tweets. The effectiveness of the framework was validated on a subset of manually labeled Twitter data and was applied to the NYC subway system as a case study. The framework accurately classifies tweets into predefined categories related to safety, reliability, and maintenance of the subway system and effectively measured sentiment intensities within each category. The general findings were corroborated through a comparison with an agency-run customer survey conducted in the same year. The findings highlight the effectiveness of the proposed framework in gauging user feedback through inexpensive social media data to understand the pain points of the transit system and plan for targeted improvements.
    摘要 传统的公共交通User feedback收集方法经常是时间consuming、资源占用和成本高的。在这篇论文中,我们提出了一种基于自然语言处理(NLP)的框架,利用社交媒体平台 like Twitter 上的丰富、便宜的用户生成内容来了解用户对不同服务问题的看法。Twitter 是一个 Microblogging 平台,它上有大量的实时用户生成内容,这些内容经常包含有价值的反馈和意见。我们的框架可以快速地收集和分析用户反馈,不需要费时和费力的用户反馈调查。我们使用了 few-shot learning 来分类 tweet,并使用 sentiment analysis 模型来评估 tweet 的情感 INTENSITY和方向。我们验证了该框架的有效性,并将其应用到纽约市地铁系统作为案例研究。结果表明,框架可以有效地将 tweet 分类到预定义的安全、可靠性和维护等类别中,并准确地评估每个类别中的情感 INTENSITY。我们的发现得到了公共交通机构所运行的客户调查的支持,这些发现反映了该框架在估计用户反馈的有效性。

cs.CL - 2023-10-11

Crosslingual Structural Priming and the Pre-Training Dynamics of Bilingual Language Models

  • paper_url: http://arxiv.org/abs/2310.07929
  • repo_url: None
  • paper_authors: Catherine Arnett, Tyler A. Chang, James A. Michaelov, Benjamin K. Bergen
  • for: 这研究探讨了多语言模型是否共享抽象语法表示形式,以及这些表示形式是如何发展的。
  • methods: 作者使用了结构预导来测试模型输出中的抽象语法表示形式,并将这种方法应用到了荷兰语-英语双语设定中。他们还评估了一个荷兰语-英语语言模型在预训练时的表现。
  • results: 研究发现,在接受第二语言后,跨语言结构预导效果很快出现,只需要少于100万个字的数据。这些结果有关于数据污染、低资源传输和多语言模型中抽象语法表示形式的发展。
    Abstract Do multilingual language models share abstract grammatical representations across languages, and if so, when do these develop? Following Sinclair et al. (2022), we use structural priming to test for abstract grammatical representations with causal effects on model outputs. We extend the approach to a Dutch-English bilingual setting, and we evaluate a Dutch-English language model during pre-training. We find that crosslingual structural priming effects emerge early after exposure to the second language, with less than 1M tokens of data in that language. We discuss implications for data contamination, low-resource transfer, and how abstract grammatical representations emerge in multilingual models.
    摘要 请参考Sinclair等(2022),我们使用结构驱动来测试多语言模型中的抽象语法表示。我们将该方法扩展到荷兰语-英语双语设置,并评估一个荷兰语-英语语言模型在预训练期间。我们发现,在接触第二语言后不久,跨语言结构驱动效果便出现了,仅需要少于1M个Token的数据。我们讨论了数据污染、低资源传输和多语言模型中抽象语法表示的起源。

The Expressive Power of Transformers with Chain of Thought

  • paper_url: http://arxiv.org/abs/2310.07923
  • repo_url: None
  • paper_authors: William Merrill, Ashish Sabharwal
  • for: 这个研究探讨了使用Transformer进行语言理解和计算的能力。
  • methods: 研究人员使用了一种”链条思维”或”笔记簿”的方法,即在回答输入前生成和条件 intermediate tokens。
  • results: 研究人员发现,使用 intermediate generation 可以提高 transformer 的计算能力,但是Amount of increase 取决于 intermediate generation 的数量。例如,使用 logarithmic 数量的 decoding steps 只能 marginally 提高 transformer 的能力,而 linear 数量的 decoding steps 可以认为是一种新的能力(以标准复杂性 conjectures),可以识别所有的 regular languages。此外,研究人员还发现,使用 polynomial 数量的 decoding steps 可以识别所有的 polynomial-time solvable problems,这是第一次对 transformer 的类型进行了正确的 Characterization 。
    Abstract Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, our results provide a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.
    摘要 (Simplified Chinese translation)最近的理论研究发现了一些奇异的简单逻辑问题,如图中两个节点是连接的检查或模拟 finite-state machine,是不可避免地由标准 transformer 所解决的。但在实践中, transformer 的逻辑可以通过允许它们使用 "链式思维" 或 "笔记",即生成并条件于输入的一系列间接符号,来改进。这引发了我们的问题:这种间接生成是否fundamentally 扩展了 decoder-only transformer 的计算能力?我们显示,答案是 yes,但间接生成的数量对计算能力的提高有关键的影响。例如,我们发现,对输入长度的 logarithmic 数量的解oding步可以只很小地推动标准 transformer,而 linear 数量的解oding步可以添加一个明确的新能力(根据标准复杂性假设):recognize 所有的 Regular 语言。我们的结果还表明,linear 步骤可以将 transformer decoder 限制在 context-sensitive 语言中,而 polynomial 步骤可以使其recognize 所有的 polynomial-time solvable problems ,这是首次对 transformer 类型的一种描述。总之,我们的结果提供了一个细化的框架,用于理解 transformer 的 chain of thought 或 scratchpad 的长度如何影响其逻辑能力。

Pit One Against Many: Leveraging Attention-head Embeddings for Parameter-efficient Multi-head Attention

  • paper_url: http://arxiv.org/abs/2310.07911
  • repo_url: None
  • paper_authors: Huiyin Xue, Nikolaos Aletras
  • for: 降低自适应语言模型的内存需求,提高自然语言处理任务的性能。
  • methods: 基于transformer中的位嵌入,提出了一种简化多头注意力(MHA)机制的代替模块,使用单个共享投影矩阵和多头嵌入(MHE)。
  • results: 对多个下游任务进行实验,证明MHE注意力可以减少大量内存需求,同时保持高预测性能率。相比之下,MHA需要更多的参数($(3n^2-3n)d^2-3nd$)。
    Abstract Scaling pre-trained language models has resulted in large performance gains in various natural language processing tasks but comes with a large cost in memory requirements. Inspired by the position embeddings in transformers, we aim to simplify and reduce the memory footprint of the multi-head attention (MHA) mechanism. We propose an alternative module that uses only a single shared projection matrix and multiple head embeddings (MHE), i.e. one per head. We empirically demonstrate that our MHE attention is substantially more memory efficient compared to alternative attention mechanisms while achieving high predictive performance retention ratio to vanilla MHA on several downstream tasks. MHE attention only requires a negligible fraction of additional parameters ($3nd$, where $n$ is the number of attention heads and $d$ the size of the head embeddings) compared to a single-head attention, while MHA requires $(3n^2-3n)d^2-3nd$ additional parameters.
    摘要 <>预训练语言模型的扩大已经导致了许多自然语言处理任务中的性能提升,但是它们的内存需求却很大。 Drawing inspiration from transformers的位嵌入,我们想要简化并减少多头注意(MHA)机制的内存占用。我们提出了一种替代模块,它使用单个共享投影矩阵和多个头嵌入(MHE),即每个头都有一个嵌入。我们实际测试了我们的MHE注意力,并证明它与替代注意力机制相比,具有较高的内存效率和预测性能保留率。MHE注意力只需要negligible fraction of additional parameters($3nd$, where $n$ is the number of attention heads and $d$ is the size of the head embeddings),而MHA需要 $(3n^2-3n)d^2-3nd$ 额外参数。

Assessing Evaluation Metrics for Neural Test Oracle Generation

  • paper_url: http://arxiv.org/abs/2310.07856
  • repo_url: None
  • paper_authors: Jiho Shin, Hadi Hemmati, Moshi Wei, Song Wang
  • for: 本研究は现有的oracle生成研究 plus ChatGPTを用于实际调查当前的性能水平,包括NLG基于和测试充分性metric。
  • methods: 我们训练并运行四种state-of-the-art测试oracle生成模型,并对五种NLG基于和两种测试充分性metric进行分析。
  • results: 我们发现NLG基于metric和测试充分性metric之间没有显著相关性。例如,通过ChatGPT生成的activemq-artemis项目的oracles在所有studied NOGs中的NLG基于metric最高,但它们在所有studied NOGs中测试充分性metric最低。我们进行质量分析,发现oracles with high NLG-based metrics but low test adequacy metrics tend to have complex or multiple chained method invocations within the oracle’s parameters, making it hard for the model to generate completely, affecting the test adequacy metrics。
    Abstract In this work, we revisit existing oracle generation studies plus ChatGPT to empirically investigate the current standing of their performance in both NLG-based and test adequacy metrics. Specifically, we train and run four state-of-the-art test oracle generation models on five NLG-based and two test adequacy metrics for our analysis. We apply two different correlation analyses between these two different sets of metrics. Surprisingly, we found no significant correlation between the NLG-based metrics and test adequacy metrics. For instance, oracles generated from ChatGPT on the project activemq-artemis had the highest performance on all the NLG-based metrics among the studied NOGs, however, it had the most number of projects with a decrease in test adequacy metrics compared to all the studied NOGs. We further conduct a qualitative analysis to explore the reasons behind our observations, we found that oracles with high NLG-based metrics but low test adequacy metrics tend to have complex or multiple chained method invocations within the oracle's parameters, making it hard for the model to generate completely, affecting the test adequacy metrics. On the other hand, oracles with low NLG-based metrics but high test adequacy metrics tend to have to call different assertion types or a different method that functions similarly to the ones in the ground truth. Overall, this work complements prior studies on test oracle generation with an extensive performance evaluation with both NLG and test adequacy metrics and provides guidelines for better assessment of deep learning applications in software test generation in the future.
    摘要 在这项研究中,我们对现有的oracle生成研究进行了评估,并与ChatGPT进行了实验性的研究,以评估现有的性能水平。我们训练并运行了四种state-of-the-art测试 oracle生成模型,并对五种NLG基于的和两种测试准确性度量进行了分析。我们应用了两种不同的相关分析方法 между这两种不同的度量。结果显示,NLG基于的度量和测试准确性度量之间没有显著的相关性。例如,由ChatGPT生成的活动mq-artemis的oracles在所有研究的NOG中表现最高,但它在所有研究的NOG中有最多的项目测试准确性度量下降。我们进行了质量分析,以探究这些观察结果的原因。我们发现,NLG基于的度量低,但测试准确性度量高的oracles通常有复杂的或多个链接的方法调用在参数中,使模型生成完全难以,affecting测试准确性度量。相反,NLG基于的度量高,但测试准确性度量低的oracles通常有不同的断言类型或类似于ground truth中的方法调用。总的来说,这项研究补充了先前的测试 oracle生成研究,并提供了未来深度学习应用软件测试生成的评估指南。

Framework for Question-Answering in Sanskrit through Automated Construction of Knowledge Graphs

  • paper_url: http://arxiv.org/abs/2310.07848
  • repo_url: None
  • paper_authors: Hrishikesh Terdalkar, Arnab Bhattacharya
  • for: 本研究旨在提取梵语(sa\d{m}sk\d{r}ta)文献中的知识,并使用知识图来回答问题。
  • methods: 本研究使用自然语言问答系统和知识图来回答问题。
  • results: 研究表明,使用知识图可以回答约50%的问题。同时,研究还分析了系统的缺点并提出了可能的改进方向。
    Abstract Sanskrit (sa\d{m}sk\d{r}ta) enjoys one of the largest and most varied literature in the whole world. Extracting the knowledge from it, however, is a challenging task due to multiple reasons including complexity of the language and paucity of standard natural language processing tools. In this paper, we target the problem of building knowledge graphs for particular types of relationships from sa\d{m}sk\d{r}ta texts. We build a natural language question-answering system in sa\d{m}sk\d{r}ta that uses the knowledge graph to answer factoid questions. We design a framework for the overall system and implement two separate instances of the system on human relationships from mah\=abh\=arata and r\=am\=aya\d{n}a, and one instance on synonymous relationships from bh\=avaprak\=a\'sa nigha\d{n}\d{t}u, a technical text from \=ayurveda. We show that about 50% of the factoid questions can be answered correctly by the system. More importantly, we analyse the shortcomings of the system in detail for each step, and discuss the possible ways forward.
    摘要 sanskrit (sa\d{m}sk\d{r}ta) 有世界上最大和最多样化的文学作品。然而,从它中提取知识是一项复杂的任务,主要因为 sanskrit 语言的复杂性和自然语言处理工具的缺乏。在这篇论文中,我们面临的问题是从 sa\d{m}sk\d{r}ta 文本中建立知识图。我们开发了一套在 sa\d{m}sk\d{r}ta 语言中建立自然语言问答系统的框架,并在人类关系、mah\=abh\=arata 和 r\=am\=aya\d{n}a 等领域中实现了两个独立的实例。此外,我们还在 bh\=avaprak\=a\'sa nigha\d{n}\d{t}u 等技术文献中实现了一个实例。我们发现,该系统可以对约50%的问题作出正确的答案。此外,我们还详细分析了系统的缺陷,并讨论了可能的进一步策略。

Antarlekhaka: A Comprehensive Tool for Multi-task Natural Language Annotation

  • paper_url: http://arxiv.org/abs/2310.07826
  • repo_url: https://github.com/Antarlekhaka/code
  • paper_authors: Hrishikesh Terdalkar, Arnab Bhattacharya
  • for: 这篇论文旨在提高自然语言处理(NLP)技术的发展,尤其是为低资源语言的NLP技术提供更多的注释数据集,以便训练和测试机器学习模型。
  • methods: 这篇论文提出了一种名为Antarlekhaka的工具,用于手动注释NLP领域的广泛任务。该工具支持多个同时注释者,语言不限,可以在网络上部署,并且支持分布式注释。工具包含8种类型的用户友好的界面,以便注释8种NLP任务类型。这些任务类型包括2种语言学任务,即句子边界检测和选择正确的字符顺序,这两种任务都不是其他工具能处理。
  • results: 论文表明Antarlekhaka工具在对象评估中表现更好,并且已经在两种不同的语言上进行了两个实际注释任务。工具可以在 https://github.com/Antarlekhaka/code 上获取。
    Abstract One of the primary obstacles in the advancement of Natural Language Processing (NLP) technologies for low-resource languages is the lack of annotated datasets for training and testing machine learning models. In this paper, we present Antarlekhaka, a tool for manual annotation of a comprehensive set of tasks relevant to NLP. The tool is Unicode-compatible, language-agnostic, Web-deployable and supports distributed annotation by multiple simultaneous annotators. The system sports user-friendly interfaces for 8 categories of annotation tasks. These, in turn, enable the annotation of a considerably larger set of NLP tasks. The task categories include two linguistic tasks not handled by any other tool, namely, sentence boundary detection and deciding canonical word order, which are important tasks for text that is in the form of poetry. We propose the idea of sequential annotation based on small text units, where an annotator performs several tasks related to a single text unit before proceeding to the next unit. The research applications of the proposed mode of multi-task annotation are also discussed. Antarlekhaka outperforms other annotation tools in objective evaluation. It has been also used for two real-life annotation tasks on two different languages, namely, Sanskrit and Bengali. The tool is available at https://github.com/Antarlekhaka/code.
    摘要 一个主要阻碍自然语言处理(NLP)技术的发展是低资源语言的Annotation dataset缺乏。在这篇论文中,我们介绍了Antarlekhaka工具,用于手动标注NLP相关的完整任务集。该工具兼容Unicode、语言不偏、Web部署和多个同时 annotators 支持分布式标注。系统支持8种类型的标注任务用户界面,可以对NLP任务进行更广泛的标注。任务类型包括两种语言学任务,即句子边界检测和推定正确的单词顺序,这些任务对文学作品中的文本非常重要。我们提出了基于小文本单元的顺序标注的想法,其中一个annotator先完成一个文本单元中的多个任务,然后进行下一个单元的标注。我们还讨论了这种多任务标注的研究应用。Antarlekhaka在对象评估中表现出色,并在两种不同语言的两个实际标注任务中使用。工具可以在https://github.com/Antarlekhaka/code 上下载。

Non-autoregressive Text Editing with Copy-aware Latent Alignments

  • paper_url: http://arxiv.org/abs/2310.07821
  • repo_url: https://github.com/yzhangcs/ctc-copy
  • paper_authors: Yu Zhang, Yue Zhang, Leyang Cui, Guohong Fu
  • for: 提高文本编辑效率和多语言泛化性
  • methods: 基于秘密CTC对编辑进行非自适应排序,引入复制操作以优化文本重叠管理
  • results: 在GEC和句子融合任务上实现了比基于Seq2Seq的显著提高(超过4倍),并且在德语和俄语上也达到了良好的泛化性。
    Abstract Recent work has witnessed a paradigm shift from Seq2Seq to Seq2Edit in the field of text editing, with the aim of addressing the slow autoregressive inference problem posed by the former. Despite promising results, Seq2Edit approaches still face several challenges such as inflexibility in generation and difficulty in generalizing to other languages. In this work, we propose a novel non-autoregressive text editing method to circumvent the above issues, by modeling the edit process with latent CTC alignments. We make a crucial extension to CTC by introducing the copy operation into the edit space, thus enabling more efficient management of textual overlap in editing. We conduct extensive experiments on GEC and sentence fusion tasks, showing that our proposed method significantly outperforms existing Seq2Edit models and achieves similar or even better results than Seq2Seq with over $4\times$ speedup. Moreover, it demonstrates good generalizability on German and Russian. In-depth analyses reveal the strengths of our method in terms of the robustness under various scenarios and generating fluent and flexible outputs.
    摘要 最近的工作受到了Seq2Seq到Seq2Edit的 paradigm shift,以解决前者的慢进 autoregressive inference 问题。然而,Seq2Edit 方法仍然面临着一些挑战,如生成灵活性不足和通用性不强。在这项工作中,我们提出了一种新的非autoregressive文本编辑方法,通过使用 latent CTC 对应关系来模型编辑过程。我们对 CTC 进行了重要扩展,在编辑空间中引入了复制操作,从而更有效地管理文本重叠。我们对 GEC 和 sentence fusion 任务进行了广泛的实验,显示了我们的提出方法在现有 Seq2Edit 模型的比较和 Seq2Seq 模型的 $4\times$ 速度提升。此外,它还能够在德语和俄语上显示出良好的通用性。深入分析表明,我们的方法在不同情况下具有强大的Robustness和生成灵活的输出。

Faithfulness Measurable Masked Language Models

  • paper_url: http://arxiv.org/abs/2310.07819
  • repo_url: https://github.com/AndreasMadsen/faithfulness-measurable-models
  • paper_authors: Andreas Madsen, Siva Reddy, Sarath Chandar
  • for: 本研究旨在提供一种可靠度度量,以评估 NLP 模型中Token的重要性。
  • methods: 本研究使用一种新的 fine-tuning 方法,将掩码Token进行特殊处理,使其成为内存 Distribution 的一部分。
  • results: 本研究通过对多种任务进行应用和统计分布测试,证明了该方法的可靠度度量的有效性。此外,由于掩码Token已经成为内存 Distribution 的一部分,因此 NLP 模型中Token的重要性度量也变得更加可靠。
    Abstract A common approach to explain NLP models, is to use importance measures that express which tokens are important for a prediction. Unfortunately, such explanations are often wrong despite being persuasive. Therefore, it is essential to measure their faithfulness. One such metric is if tokens are truly important, then masking them should result in worse model performance. However, token masking introduces out-of-distribution issues and existing solutions are computationally expensive and employ proxy-models. Furthermore, other metrics are very limited in scope. In this work, we propose an inherently faithfulness measurable model that addresses these challenges. This is achieved by using a novel fine-tuning method that incorporates masking, such that masking tokens become in-distribution by design. This differs from existing approaches, which are completely model-agnostic but are inapplicable in practice. We demonstrate the generality of our approach by applying it to various tasks and validate it using statistical in-distribution tests. Additionally, because masking is in-distribution, importance measures which themselves use masking become more faithful, thus our model becomes more explainable.
    摘要 通常来说,使用重要度度量来解释NLG模型的方法是使用重要度度量来表示选择的токен是否重要。然而,这些解释通常是错误的,尽管很有吸引力。因此,必须测试其忠实程度。一种度量器是,如果tokentoken是重要的,那么对它们进行遮盖应该导致模型性能下降。然而,tokentoken遮盖引入了非标准issue和现有解决方案是计算成本高昂,并且使用代理模型。此外,其他度量器很有限,只能用于某些任务。在这项工作中,我们提出了一种内在准确度度量器,解决了这些挑战。这是通过使用一种新的微调方法,将遮盖tokentoken变成了标准分布。这与现有方法不同,它们是完全无关的模型的,但在实践中无法应用。我们在不同的任务上应用了我们的方法,并通过统计学的在 Distribution Test validate 它。此外,因为遮盖tokentoken变成了标准分布,因此重要度度量也变得更加准确,从而使我们的模型变得更加解释。

Language Models As Semantic Indexers

  • paper_url: http://arxiv.org/abs/2310.07815
  • repo_url: https://github.com/Aryia-Behroziuan/References
  • paper_authors: Bowen Jin, Hansi Zeng, Guoyin Wang, Xiusi Chen, Tianxin Wei, Ruirui Li, Zhengyang Wang, Zheng Li, Yang Li, Hanqing Lu, Suhang Wang, Jiawei Han, Xianfeng Tang
  • for: 本文targets at learning semantic IDs for documents, which can facilitate various downstream tasks such as recommendation and retrieval.
  • methods: 本文提出了一种自然语言模型基于的自我超vised框架,用于学习文档的含义ID。该框架使用了进步的训练和对比学习来生成神经网络顺序分解表示,并通过自我超vised文档重建目标进行训练。
  • results: 实验结果表明,LMINDEXER在三个任务中(推荐、产品搜索和文档检索)在五个数据集上具有显著性和一致性的优势,与竞争性基准模型相比。
    Abstract Semantic identifier (ID) is an important concept in information retrieval that aims to preserve the semantics of objects such as documents and items inside their IDs. Previous studies typically adopt a two-stage pipeline to learn semantic IDs by first procuring embeddings using off-the-shelf text encoders and then deriving IDs based on the embeddings. However, each step introduces potential information loss and there is usually an inherent mismatch between the distribution of embeddings within the latent space produced by text encoders and the anticipated distribution required for semantic indexing. Nevertheless, it is non-trivial to design a method that can learn the document's semantic representations and its hierarchical structure simultaneously, given that semantic IDs are discrete and sequentially structured, and the semantic supervision is deficient. In this paper, we introduce LMINDEXER, a self-supervised framework to learn semantic IDs with a generative language model. We tackle the challenge of sequential discrete ID by introducing a semantic indexer capable of generating neural sequential discrete representations with progressive training and contrastive learning. In response to the semantic supervision deficiency, we propose to train the model with a self-supervised document reconstruction objective. The learned semantic indexer can facilitate various downstream tasks, such as recommendation and retrieval. We conduct experiments on three tasks including recommendation, product search, and document retrieval on five datasets from various domains, where LMINDEXER outperforms competitive baselines significantly and consistently.
    摘要 <>Semantic identifier (ID) 是信息检索中的一个重要概念,旨在保留文档和其内部元素的 semantics。先前的研究通常采用两个阶段管道来学习含义 ID,先从 off-the-shelf 文本编码器获取嵌入,然后基于嵌入derive ID。但每个步骤都会导致信息损失,而且嵌入空间中的分布和预期的分布通常存在差异。尽管是非常困难的设计一种可以同时学习文档的含义表示和层次结构的方法,因为含义 ID 是整数和顺序结构的,而且含义监督不足。在本文中,我们介绍了 LMINDEXER,一种自动化的框架,可以使用生成语言模型来学习含义 ID。我们解决了顺序整数 ID 的挑战,通过引入含义索引器,该索引器可以在进行进程训练和对比学习后生成神经网络整数表示。受到含义监督不足的挑战,我们提议使用自动化文档重建目标进行训练。学习的含义索引器可以帮助下游任务,如推荐和检索。我们在三个任务上进行了五个数据集的实验,包括推荐、产品搜索和文档检索,LMINDEXER 与竞争对手相比显著并且一致性高。

Ontology Enrichment for Effective Fine-grained Entity Typing

  • paper_url: http://arxiv.org/abs/2310.07795
  • repo_url: None
  • paper_authors: Siru Ouyang, Jiaxin Huang, Pranav Pillai, Yunyi Zhang, Yu Zhang, Jiawei Han
  • for: 本研究旨在提出一种静态ontology-based zero-shot fine-grained实体类型标注(FET)方法,以便在无需人工标注的情况下实现高质量的实体类型标注。
  • methods: 我们提出了一种名为OnEFET的方法,它在ontology结构中增加了两种类型的额外信息,并开发了一种从粗到细的类型标注算法,通过在不同话题和实例增强训练样本中训练一个推理模型来利用这些额外信息。
  • results: 我们的实验结果表明,OnEFET可以在无需人工标注的情况下实现高质量的 fine-grained entity typing,与现有的零shot方法相比,其表现较好,甚至可以与有监督方法相比。
    Abstract Fine-grained entity typing (FET) is the task of identifying specific entity types at a fine-grained level for entity mentions based on their contextual information. Conventional methods for FET require extensive human annotation, which is time-consuming and costly. Recent studies have been developing weakly supervised or zero-shot approaches. We study the setting of zero-shot FET where only an ontology is provided. However, most existing ontology structures lack rich supporting information and even contain ambiguous relations, making them ineffective in guiding FET. Recently developed language models, though promising in various few-shot and zero-shot NLP tasks, may face challenges in zero-shot FET due to their lack of interaction with task-specific ontology. In this study, we propose OnEFET, where we (1) enrich each node in the ontology structure with two types of extra information: instance information for training sample augmentation and topic information to relate types to contexts, and (2) develop a coarse-to-fine typing algorithm that exploits the enriched information by training an entailment model with contrasting topics and instance-based augmented training samples. Our experiments show that OnEFET achieves high-quality fine-grained entity typing without human annotation, outperforming existing zero-shot methods by a large margin and rivaling supervised methods.
    摘要 细化实体类型标识(FET)是根据实体提及的上下文信息确定特定实体类型的任务。传统方法需要大量人工标注,却是时间consuming和成本高的。最近的研究已经开始开发弱级或无级指导的方法。我们研究了基于ontology的零基础FET设定,但现有的ontology结构缺乏详细信息和甚至存在歧义关系,使其无法有效地导引FET。最近发展的自然语言处理模型,尽管在各种几个shot和零基础NLP任务中表现出色,但在零基础FET中可能会遇到挑战,因为它们与任务特定的ontology没有直接交互。在本研究中,我们提出了OnEFET,其中我们(1)为ontology结构中的每个节点添加了两种类型的额外信息:实例信息用于训练样本增强和主题信息用于将类型与上下文关联,并(2)开发了一种宽到细类型标识算法,利用这些额外信息通过训练排除模型和对比主题的实例基本样本来利用。我们的实验显示,OnEFET可以在无人注释情况下实现高质量细化实体类型标识,胜过现有的零基础方法,并与supervised方法相当。

To Build Our Future, We Must Know Our Past: Contextualizing Paradigm Shifts in Natural Language Processing

  • paper_url: http://arxiv.org/abs/2310.07715
  • repo_url: None
  • paper_authors: Sireesh Gururaja, Amanda Bertsch, Clara Na, David Gray Widder, Emma Strubell
  • for: 本研究目的是理解NLP领域的发展趋势,以便更好地形决未来。
  • methods: 本研究使用长形采访26名NLP研究人员,分析了文化、激励力和基础设施等因素对NLP领域的影响。
  • results: 研究发现了NLP领域的cyclical patterns以及新的变革,包括benchmark文化和软件基础设施的变化。
    Abstract NLP is in a period of disruptive change that is impacting our methodologies, funding sources, and public perception. In this work, we seek to understand how to shape our future by better understanding our past. We study factors that shape NLP as a field, including culture, incentives, and infrastructure by conducting long-form interviews with 26 NLP researchers of varying seniority, research area, institution, and social identity. Our interviewees identify cyclical patterns in the field, as well as new shifts without historical parallel, including changes in benchmark culture and software infrastructure. We complement this discussion with quantitative analysis of citation, authorship, and language use in the ACL Anthology over time. We conclude by discussing shared visions, concerns, and hopes for the future of NLP. We hope that this study of our field's past and present can prompt informed discussion of our community's implicit norms and more deliberate action to consciously shape the future.
    摘要

Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models

  • paper_url: http://arxiv.org/abs/2310.07712
  • repo_url: https://github.com/castorini/perm-sc
  • paper_authors: Raphael Tang, Xinyu Zhang, Xueguang Ma, Jimmy Lin, Ferhan Ture
  • for: Addressing positional bias in listwise ranking of large language models (LLMs)
  • methods: permutation self-consistency, marginalizing out different list orders in the prompt to produce order-independent ranking with less positional bias
  • results: Improved scores from conventional inference by up to 7-18% for GPT-3.5 and 8-16% for LLaMA v2 (70B) on five list-ranking datasets in sorting and passage reranking, surpassing the previous state of the art in passage reranking.
    Abstract Large language models (LLMs) exhibit positional bias in how they use context, which especially complicates listwise ranking. To address this, we propose permutation self-consistency, a form of self-consistency over ranking list outputs of black-box LLMs. Our key idea is to marginalize out different list orders in the prompt to produce an order-independent ranking with less positional bias. First, given some input prompt, we repeatedly shuffle the list in the prompt and pass it through the LLM while holding the instructions the same. Next, we aggregate the resulting sample of rankings by computing the central ranking closest in distance to all of them, marginalizing out prompt order biases in the process. Theoretically, we prove the robustness of our method, showing convergence to the true ranking in the presence of random perturbations. Empirically, on five list-ranking datasets in sorting and passage reranking, our approach improves scores from conventional inference by up to 7-18% for GPT-3.5 and 8-16% for LLaMA v2 (70B), surpassing the previous state of the art in passage reranking. Our code is at https://github.com/castorini/perm-sc.
    摘要 大型语言模型(LLM)会展示位置偏见,尤其是在列表排名中。为了解决这问题,我们提出了排序自适应性,一种黑盒 LLM 的排名结果自适应性。我们的关键想法是在提示中随机排序列表,然后将其通过 LLM,并将不同的列表顺序聚合成一个位置不受歧视的排名。具体来说,我们会将输入提示中的列表随机排序,然后将其通过 LLM,并在不同的列表顺序下重复这些排序。接着,我们将这些排序结果聚合起来, computed the central ranking closest in distance to all of them, thereby marginalizing out prompt order biases in the process。理论上,我们证明了我们的方法的稳定性,显示在随机干扰下,我们的方法会趋向真实的排名。实验上,我们在五个列表排名 dataset 上进行了 sorting 和 passage reranking,比较了与传统推理相比,我们的方法可以提高分数达7-18% для GPT-3.5 和 8-16% для LLaMA v2 (70B),超过了过去的州际之优。我们的代码位于 GitHub 上的 https://github.com/castorini/perm-sc。

DiPmark: A Stealthy, Efficient and Resilient Watermark for Large Language Models

  • paper_url: http://arxiv.org/abs/2310.07710
  • repo_url: None
  • paper_authors: Yihan Wu, Zhengmian Hu, Hongyang Zhang, Heng Huang
  • for: 保护数据安全,采用隐藏信息在数据中嵌入 watermarking 技术。
  • methods: 提出了一种基于分布采样和哈希函数的分布保持 watermarking 方法(DiPmark),可以避免当前策略中的分布误差。
  • results: 对比 experiments 表明,该方法可以具有隐蔽性、高效性和鲁棒性,适用于需要准确性保持的 watermarking 任务。
    Abstract Watermarking techniques offer a promising way to secure data via embedding covert information into the data. A paramount challenge in the domain lies in preserving the distribution of original data during watermarking. Our research extends and refines existing watermarking framework, placing emphasis on the importance of a distribution-preserving (DiP) watermark. Contrary to the current strategies, our proposed DiPmark preserves the original token distribution during watermarking (stealthy), is detectable without access to the language model API or weights (efficient), and is robust to moderate changes of tokens (resilient). This is achieved by incorporating a novel reweight strategy, combined with a hash function that assigns unique \textit{i.i.d.} ciphers based on the context. The empirical benchmarks of our approach underscore its stealthiness, efficiency, and resilience, making it a robust solution for watermarking tasks that demand impeccable quality preservation.
    摘要 通过水印技术来保护数据,通过在数据中隐藏潜在信息。领域中的一大挑战是保持原始数据的分布。我们的研究扩展和改进了现有的水印框架,强调在水印过程中保持原始token的分布。与现有策略不同,我们的提议的DiPmark可以在水印过程中保持原始token的分布(隐蔽),不需要访问语言模型API或参数(高效),并且对小范围的token变化具有抗锋性。这是通过新的重要策略和基于上下文的哈希函数来实现的,这使得我们的方法在隐蔽性、高效性和抗锋性方面具有优势,适用于需要精细质量保持的水印任务。

MatFormer: Nested Transformer for Elastic Inference

  • paper_url: http://arxiv.org/abs/2310.07707
  • repo_url: None
  • paper_authors: Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain
  • for: 这 paper 的目的是提出一种可以适应不同部署环境的嵌入式 transformer 模型,以提高模型的灵活性和可控性。
  • methods: 这 paper 使用了一种名为 MatFormer 的嵌入式 transformer 模型,该模型通过对各层 feed forward network (FFN) 块进行共同优化,以实现模型的灵活性和可控性。
  • results: 该 paper 通过实验表明,MatFormer 模型在不同的模型类型(解码器和编码器)、Modalities(语言和视觉)和大小(达到 2.6B 参数)中具有广泛的适用性和可靠性。此外, MatFormer 模型还可以提取出准确且可靠的小型模型,以提高下游评估的准确性和可靠性。
    Abstract Transformer models are deployed in a wide range of settings, from multi-accelerator clusters to standalone mobile phones. The diverse inference constraints in these scenarios necessitate practitioners to train foundation models such as PaLM 2, Llama, & ViTs as a series of models of varying sizes. Due to significant training costs, only a select few model sizes are trained and supported, limiting more fine-grained control over relevant tradeoffs, including latency, cost, and accuracy. This work introduces MatFormer, a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints. Each Feed Forward Network (FFN) block of a MatFormer model is jointly optimized with a few nested smaller FFN blocks. This training procedure allows for the Mix'n'Match of model granularities across layers -- i.e., a trained universal MatFormer model enables extraction of hundreds of accurate smaller models, which were never explicitly optimized. We empirically demonstrate MatFormer's effectiveness across different model classes (decoders & encoders), modalities (language & vision), and scales (up to 2.6B parameters). We find that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B, each exhibiting comparable validation loss and one-shot downstream evaluations to their independently trained counterparts. Furthermore, we observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval. Finally, we showcase that speculative decoding with the accurate and consistent submodels extracted from MatFormer can further reduce inference latency.
    摘要 《Transformer模型在多种设置中部署,从多个加速器集群到单个手持式手机。这些不同的推理约束在这些场景中需要实践者在PaLM 2、Llama和ViTs等基础模型中训练多种模型的不同大小。由于训练成本很高,只有一些选择的模型大小得到了训练和支持,限制了更细化的控制 над相关的负载、成本和准确率的负载。本文介绍了MatFormer,一种嵌入式Transporter架构,用于提供多种部署约束的灵活性。每个Feed Forward Network(FFN)块的MatFormer模型都是与嵌入的一些更小的FFN块进行共同优化的。这种训练过程允许在层次上进行模型的混合和拼接,即一个已经训练过的通用MatFormer模型可以提取出百度精度的准确小模型,这些小模型从未直接优化过。我们在不同的模型类型(解码器和编码器)、modalities(语言和视觉)和 scale(最多2.6亿参数)上进行了实验,发现一个2.6亿解码器Only MatFormer语言模型(MatLM)可以提取出1.5亿到2.6亿的精度小模型,每个模型具有与独立训练的对应模型相同的验证损失和一次下游评估。此外,我们发现使用MatFormer基于ViT(MatViT)Encoder提取的小编码器保持了度量空间结构,适用于大规模适应式检索。最后,我们发现使用MatFormer提取的准确和一致的子模型进行规划的推理可以进一步减少推理时间。

Ferret: Refer and Ground Anything Anywhere at Any Granularity

  • paper_url: http://arxiv.org/abs/2310.07704
  • repo_url: https://github.com/apple/ml-ferret
  • paper_authors: Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang
  • for: 这个论文旨在描述一种新的多Modal大语言模型(MLLM),可以理解图像中任意形状或粒度的空间引用,并准确地将描述与图像相关联。
  • methods: 为了统一引用和降grunding在LLM模型中,这个论文提出了一种新的hybrid区域表示方法,将精确的坐标和连续特征相结合,以表示图像中的区域。此外,提出了一种适应不同形状的视觉采样器,可以处理图像中的不同粒度。
  • results: 根据实验结果,这个模型不仅在经典的引用和降grunding任务中表现出色,而且在基于区域的多模式聊天中也表现出了优秀的能力。此外,模型还能够更好地描述图像的细节和减少对象推测的情况。
    Abstract We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination. Code and data will be available at https://github.com/apple/ml-ferret
    摘要 我们介绍 Ferret,一种新的多modal大型语言模型(MLLM),可以理解图像中任何形状或粒度的空间引用,并准确地将开放词汇描述与图像相关联。为了在LLM模型中统一引用和降解,Ferret使用一种新的强大的混合区域表示方法,将精确的整数坐标和连续特征联合起来表示图像中的区域。为了提取不同形状的区域中的连续特征,我们提议了一种适应性能处理的空间视觉采样器,能够处理不同形状的变化。因此,Ferret可以接受不同的区域输入,如点、矩形框和自由形状。为了增强Ferret的愿望能力,我们组织了 GRIT,一个包括1.1万个样本的参考和降解指令调整数据集,其中包括9.5万个困难的负例数据,以促进模型的稳定性。最终模型不仅在经典的引用和降解任务中表现出色,而且在多modal聊天中需要区域基础的任务中也表现出优异,并且在描述图像细节和对象投影方面表现出了显著改善。代码和数据将在https://github.com/apple/ml-ferret上提供。

Knowledge-enhanced Memory Model for Emotional Support Conversation

  • paper_url: http://arxiv.org/abs/2310.07700
  • repo_url: None
  • paper_authors: Mengzhao Jia, Qianglong Chen, Liqiang Jing, Dawei Fu, Renyu Li
  • For: 提高Emotional Support Conversation的效果,以扩展 mental health 支持的可能性。* Methods: 提出了一种知识增强 Memory mODEl for emotional suppoRt coNversation (MODERN),包括对话语义encode和基于ConceptNet的实用回答生成模块。* Results: 对一个广泛使用的大规模数据集进行了详细实验,证明了我们的模型在比较先进的基准上表现出色。
    Abstract The prevalence of mental disorders has become a significant issue, leading to the increased focus on Emotional Support Conversation as an effective supplement for mental health support. Existing methods have achieved compelling results, however, they still face three challenges: 1) variability of emotions, 2) practicality of the response, and 3) intricate strategy modeling. To address these challenges, we propose a novel knowledge-enhanced Memory mODEl for emotional suppoRt coNversation (MODERN). Specifically, we first devise a knowledge-enriched dialogue context encoding to perceive the dynamic emotion change of different periods of the conversation for coherent user state modeling and select context-related concepts from ConceptNet for practical response generation. Thereafter, we implement a novel memory-enhanced strategy modeling module to model the semantic patterns behind the strategy categories. Extensive experiments on a widely used large-scale dataset verify the superiority of our model over cutting-edge baselines.
    摘要 现在,情绪疾病的流行性已成为一个重要的问题,导致了情绪支持对话的效iveness来支持心理健康的更多的注意力。现有的方法已经实现了有力的结果,但它们还面临着三个挑战:1)情绪的变化性,2)回应的实用性,和3)复杂的战略模型。为了解决这些挑战,我们提出了一种基于知识的Memory mODEl for emotional suppoRt coNversation(MODERN)。具体来说,我们首先设计了一种增强对话上下文的知识编码,以捕捉不同时间段的对话中的动态情绪变化,并选择相关的上下文概念从ConceptNet中生成实用的回应。然后,我们实施了一种新的记忆增强策略模型模块,以模型 semantic patterns 下的策略类别。经验证明,我们的模型在一个广泛使用的大规模数据集上表现出了较好的效果,比较出色的基准值。

Composite Backdoor Attacks Against Large Language Models

  • paper_url: http://arxiv.org/abs/2310.07676
  • repo_url: None
  • paper_authors: Hai Huang, Zhengyu Zhao, Michael Backes, Yun Shen, Yang Zhang
  • for: 本研究探讨了语言模型(LLMs)的不可靠性,以及在下游任务中可能存在的攻击方式。
  • methods: 本研究使用了背门际攻击(backdoor attack)来探讨 LLMS 的攻击性。与先前的攻击方法不同,本研究在不同的提问组件中扫描多个触发密钥。这种复合背门攻击(CBA)比单个组件中的多个触发密钥更隐蔽。
  • results: 我们的实验表明,CBA 在自然语言处理(NLP)和多媒体任务中都是有效的。例如,在 LLaMA-7B 模型对 Emotion 数据集的 $3%$ 杂入样本上,我们的攻击达到了 $100%$ 攻击成功率(ASR), False Triggered Rate(FTR)在 $2.06%$ 以下,模型性能减少很小。
    Abstract Large language models (LLMs) have demonstrated superior performance compared to previous methods on various tasks, and often serve as the foundation models for many researches and services. However, the untrustworthy third-party LLMs may covertly introduce vulnerabilities for downstream tasks. In this paper, we explore the vulnerability of LLMs through the lens of backdoor attacks. Different from existing backdoor attacks against LLMs, ours scatters multiple trigger keys in different prompt components. Such a Composite Backdoor Attack (CBA) is shown to be stealthier than implanting the same multiple trigger keys in only a single component. CBA ensures that the backdoor is activated only when all trigger keys appear. Our experiments demonstrate that CBA is effective in both natural language processing (NLP) and multimodal tasks. For instance, with $3\%$ poisoning samples against the LLaMA-7B model on the Emotion dataset, our attack achieves a $100\%$ Attack Success Rate (ASR) with a False Triggered Rate (FTR) below $2.06\%$ and negligible model accuracy degradation. The unique characteristics of our CBA can be tailored for various practical scenarios, e.g., targeting specific user groups. Our work highlights the necessity of increased security research on the trustworthiness of foundation LLMs.
    摘要 大型语言模型(LLM)在各种任务上表现出色,常作为许多研究和服务的基础模型。然而,第三方不可靠的 LLM 可能会隐藏攻击性漏洞。在这篇论文中,我们通过针对 LLM 的后门攻击来探讨 LLM 的不可靠性。与现有的 LLM 后门攻击不同,我们的 Composite Backdoor Attack(CBA)在不同的提示组件中扫描多个触发关键。这种 CBA 比将同样多个触发关键Implanting 在单个组件中更隐蔽。CBA 确保只有当所有触发关键都出现时才会打开后门。我们的实验表明,CBA 在自然语言处理(NLP)和多Modal 任务中都有效。例如,在对 LLaMA-7B 模型的 Emotion 数据集上,我们使用 $3\%$ 恶意样本,我们的攻击达到了 $100\%$ 攻击成功率(ASR),False Triggered Rate(FTR)低于 $2.06\%$,模型性能下降非常小。CBA 的独特特点可以适应不同的实际场景,例如 targeting 特定用户群。我们的工作高亮了基础 LLM 的信任性的重要性,强调了对这些模型的安全研究的必要性。

Well Begun is Half Done: Generator-agnostic Knowledge Pre-Selection for Knowledge-Grounded Dialogue

  • paper_url: http://arxiv.org/abs/2310.07659
  • repo_url: https://github.com/qinlang14/gate
  • paper_authors: Lang Qin, Yao Zhang, Hongru Liang, Jun Wang, Zhenglu Yang
  • for: 这篇论文旨在提高知识选择的准确性,以便在知识卷积对话系统中进行更好的对话。
  • methods: 这篇论文提出了一种新的知识选择方法,即在生成之前选择相关的知识,以减少后续响应生成模型(特别是LLMs)的学习、调整和解释压力。
  • results: 实验结果表明,该方法可以提高响应的信息性,并且指出了知识选择前的生成是轻量级而有效的方式,可以帮助LLMs(如ChatGPT)生成更有用的响应。
    Abstract Accurate knowledge selection is critical in knowledge-grounded dialogue systems. Towards a closer look at it, we offer a novel perspective to organize existing literature, i.e., knowledge selection coupled with, after, and before generation. We focus on the third under-explored category of study, which can not only select knowledge accurately in advance, but has the advantage to reduce the learning, adjustment, and interpretation burden of subsequent response generation models, especially LLMs. We propose GATE, a generator-agnostic knowledge selection method, to prepare knowledge for subsequent response generation models by selecting context-related knowledge among different knowledge structures and variable knowledge requirements. Experimental results demonstrate the superiority of GATE, and indicate that knowledge selection before generation is a lightweight yet effective way to facilitate LLMs (e.g., ChatGPT) to generate more informative responses.
    摘要 精准的知识选择是知识固有对话系统的关键。我们提出了一种新的视角,即知识选择与生成结合在一起,以及生成之前和之后的知识选择。我们专注于第三种未探讨的研究领域,即可以不仅在预先选择正确的知识,而且可以减轻后续响应生成模型(特别是LLMs)的学习、调整和解释负担。我们提出了一种生成器独立的知识选择方法,以适应不同的知识结构和变量知识需求。实验结果表明GATE的优越性,并表明知识选择 перед生成是一种轻量级 yet 有效的方式,使LLMs(如ChatGPT)能够更加有 informations 的响应。

Audio-Visual Neural Syntax Acquisition

  • paper_url: http://arxiv.org/abs/2310.07654
  • repo_url: None
  • paper_authors: Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass
  • for: 这个论文是用来研究视觉挖掘的语音结构induction的。
  • methods: 这个论文使用的方法是首先将语音波形分割成单词序列,然后使用推导出的段级连续表示来推导语句结构。
  • results: 这个论文的实验结果表明,通过听取音频和查看图像,而不需要任何文本Supervision,Audio-Visual Neural Syntax Learner(AV-NSL)可以学习出有意义的语句结构,与自然Supervised文本解析器相当。
    Abstract We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without ever being exposed to text. By training on paired images and spoken captions, AV-NSL exhibits the capability to infer meaningful phrase structures that are comparable to those derived by naturally-supervised text parsers, for both English and German. Our findings extend prior work in unsupervised language acquisition from speech and grounded grammar induction, and present one approach to bridge the gap between the two topics.
    摘要 我们研究语音结构推导从视觉关联的 speech。核心思想是首先将语音波形分段成字段序列,然后使用推断出的段级连续表示来推导 phrase structure。我们提出了 Audio-Visual Neural Syntax Learner(AV-NSL),它通过听音和看图学习语音结构,不需要任何文本干扰。通过对图像和说话标注进行训练,AV-NSL能够推导出有意义的 phrase structure,与自然监督文本分析器相当。我们的发现扩展了先前的无监督语言学习从speech和固定语法推导的研究,并提供了一种将两个话题相连接的方法。

LLM4Vis: Explainable Visualization Recommendation using ChatGPT

  • paper_url: http://arxiv.org/abs/2310.07652
  • repo_url: https://github.com/demoleiwang/llm4vis
  • paper_authors: Lei Wang, Songheng Zhang, Yun Wang, Ee-Peng Lim, Yong Wang
  • for: 该研究旨在提供一种自动化视觉推荐方法,以便在不同领域中探索和传达视觉结论。
  • methods: 该方法基于ChatGPT的提问方法,包括特征描述、示例选择、解释生成、示例构建和推理步骤。为了获得高质量的解释,提出了一种新的解释生成启发法,通过考虑之前的生成和模板基本提示来进行反射式启发。
  • results: 在VizML数据集上进行评估,LLM4Vis在几个实际中或与随机森林、决策树和多层感知网络等超vised学习模型相当,并在零shot和几个实际中表现出色。Qualitative评估还表明LLM4Vis生成的解释的效果。代码可以在 \href{https://github.com/demoleiwang/LLM4Vis}{https://github.com/demoleiwang/LLM4Vis} 上获取。
    Abstract Data visualization is a powerful tool for exploring and communicating insights in various domains. To automate visualization choice for datasets, a task known as visualization recommendation has been proposed. Various machine-learning-based approaches have been developed for this purpose, but they often require a large corpus of dataset-visualization pairs for training and lack natural explanations for their results. To address this research gap, we propose LLM4Vis, a novel ChatGPT-based prompting approach to perform visualization recommendation and return human-like explanations using very few demonstration examples. Our approach involves feature description, demonstration example selection, explanation generation, demonstration example construction, and inference steps. To obtain demonstration examples with high-quality explanations, we propose a new explanation generation bootstrapping to iteratively refine generated explanations by considering the previous generation and template-based hint. Evaluations on the VizML dataset show that LLM4Vis outperforms or performs similarly to supervised learning models like Random Forest, Decision Tree, and MLP in both few-shot and zero-shot settings. The qualitative evaluation also shows the effectiveness of explanations generated by LLM4Vis. We make our code publicly available at \href{https://github.com/demoleiwang/LLM4Vis}{https://github.com/demoleiwang/LLM4Vis}.
    摘要 “数据视觉是一种强大的工具,用于探索和传达不同领域的探索结果。为自动化视觉选择,一项称为视觉推荐的任务已被提议。多种基于机器学习的方法已经被开发出来,但它们经常需要大量的数据集和视觉对的训练,而且往往无法提供自然的解释。为了解决这一研究漏洞,我们提出了LLM4Vis,一种基于ChatGPT的新的提问方法,用于进行视觉推荐并返回人类化的解释,只需要几个示例例子。我们的方法包括特征描述、示例选择、解释生成、示例构建和推理步骤。为了获得高质量的解释,我们提出了一种新的解释生成恒bootstrap,可以逐步优化生成的解释,通过考虑之前的生成和模板基于的提示。在VizML数据集上进行评估,我们发现LLM4Vis在几个示例例子和零示例例子的情况下均能与随机森林、决策树和多层感知网络相当或超越。 qualitative评估还表明LLM4Vis生成的解释的效果。我们将代码公开在https://github.com/demoleiwang/LLM4Vis。”

Evaluating Large Language Models at Evaluating Instruction Following

  • paper_url: http://arxiv.org/abs/2310.07641
  • repo_url: https://github.com/princeton-nlp/llmbar
  • paper_authors: Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, Danqi Chen
  • for: 本研究旨在评估LLM evaluator的效果,特别是用于评估生成文本是否遵循给定的指令。
  • methods: 本研究使用了LLMBar作为一个挑战性的meta-评估benchmark,以测试LLM evaluator的指令遵循能力。
  • results: 研究发现不同的评估器(即LLM和提示的组合)在LLMBar上表现出 diferencia significativa,甚至最高分的评估器也有较大的改进空间。此外,本研究还提出了一个新的提示策略,以更好地衡量LLM的指令遵循能力。
    Abstract As research in large language models (LLMs) continues to accelerate, LLM-based evaluation has emerged as a scalable and cost-effective alternative to human evaluations for comparing the ever increasing list of models. This paper investigates the efficacy of these "LLM evaluators", particularly in using them to assess instruction following, a metric that gauges how closely generated text adheres to the given instruction. We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. The authors manually curated 419 pairs of outputs, one adhering to instructions while the other diverging, yet may possess deceptive qualities that mislead an LLM evaluator, e.g., a more engaging tone. Contrary to existing meta-evaluation, we discover that different evaluators (i.e., combinations of LLMs and prompts) exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement. We also present a novel suite of prompting strategies that further close the gap between LLM and human evaluators. With LLMBar, we hope to offer more insight into LLM evaluators and foster future research in developing better instruction-following models.
    摘要 LLM 研究在继续加速, LLMBased 评估成为一种可扩展和成本效果的人工评估的替代方案,用于比较越来越多的模型。这篇论文 investigate LLM 评估器的效果,特别是用于评估 instruciton 遵循度,一个测量生成文本是否遵循给定的 instruciton 的指标。我们提出了一个挑战性的 meta-评估标准 LLMBar,用于测试 LLM 评估器是否能够识别遵循 instruciton 的输出。作者手动精心选择了 419 对输出,其中一个遵循 instruciton,另一个偏离 instruciton,但可能具有诱导 LLM 评估器的特性,如更有吸引力的语言风格。与现有的 meta-评估不同,我们发现不同的评估器(即 LLM 和提示的组合)在 LLMBar 上表现出不同的性能,甚至最高分的评估器还有很大的改进空间。我们还提出了一个新的提示策略集合,可以进一步减距 между LLM 和人类评估器。通过 LLMBar,我们希望能够为 LLM 评估器提供更多的洞察,并促进未来的研究,以开发更好的 instruciton 遵循模型。

The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values

  • paper_url: http://arxiv.org/abs/2310.07629
  • repo_url: None
  • paper_authors: Hannah Rose Kirk, Andrew M. Bean, Bertie Vidgen, Paul Röttger, Scott A. Hale
  • for: 这篇论文主要是为了研究如何收集和利用人类反馈来改进大语言模型(LLM)的表现。
  • methods: 论文主要抄取了95篇文章,来总结过去对语言模型的人类反馈的约束和现代技术和实践。
  • results: 论文提出了五个未解决的概念和实践挑战,以促进未来对Feedback学习的发展。
    Abstract Human feedback is increasingly used to steer the behaviours of Large Language Models (LLMs). However, it is unclear how to collect and incorporate feedback in a way that is efficient, effective and unbiased, especially for highly subjective human preferences and values. In this paper, we survey existing approaches for learning from human feedback, drawing on 95 papers primarily from the ACL and arXiv repositories.First, we summarise the past, pre-LLM trends for integrating human feedback into language models. Second, we give an overview of present techniques and practices, as well as the motivations for using feedback; conceptual frameworks for defining values and preferences; and how feedback is collected and from whom. Finally, we encourage a better future of feedback learning in LLMs by raising five unresolved conceptual and practical challenges.
    摘要 人类反馈在大自然语言模型(LLM)中的行为指导 increasingly 使用。然而,不清楚如何收集和包含反馈的方式,尤其是对于高度主观的人类偏好和价值,是效果、高效和无偏见的。在这篇论文中,我们对现有的反馈学习方法进行了调查,主要从 ACLL和arXiv 存储库中检索了95篇论文。首先,我们总结过去, LLM 前的人类反馈integrating 语言模型的趋势。其次,我们提供现有的技术和实践,以及使用反馈的动机,定义价值和偏好的概念框架,以及反馈是如何收集的和从谁收集。最后,我们鼓励更好的反馈学习在 LLM 中的未来,提出了五个未解决的概念和实践挑战。

QACHECK: A Demonstration System for Question-Guided Multi-Hop Fact-Checking

  • paper_url: http://arxiv.org/abs/2310.07609
  • repo_url: https://github.com/xinyuanlu00/qacheck
  • paper_authors: Liangming Pan, Xinyuan Lu, Min-Yen Kan, Preslav Nakov
  • For: The paper aims to address the challenge of fact-checking real-world claims with complex, multi-step reasoning, and to provide a transparent, explainable, and user-friendly fact-checking process.* Methods: The proposed QACHECK system uses a sequence of (question, answer) pairs to guide its reasoning process, and includes five key modules: a claim verifier, a question generator, a question-answering module, a QA validator, and a reasoner.* Results: The paper demonstrates the effectiveness of QACHECK through a recorded video, showing how the system can provide a comprehensive report detailing its reasoning process and the source of evidence supporting each question.
    Abstract Fact-checking real-world claims often requires complex, multi-step reasoning due to the absence of direct evidence to support or refute them. However, existing fact-checking systems often lack transparency in their decision-making, making it challenging for users to comprehend their reasoning process. To address this, we propose the Question-guided Multi-hop Fact-Checking (QACHECK) system, which guides the model's reasoning process by asking a series of questions critical for verifying a claim. QACHECK has five key modules: a claim verifier, a question generator, a question-answering module, a QA validator, and a reasoner. Users can input a claim into QACHECK, which then predicts its veracity and provides a comprehensive report detailing its reasoning process, guided by a sequence of (question, answer) pairs. QACHECK also provides the source of evidence supporting each question, fostering a transparent, explainable, and user-friendly fact-checking process. A recorded video of QACHECK is at https://www.youtube.com/watch?v=ju8kxSldM64
    摘要 fact-checking 实际场景中的真假性检查经常需要复杂的多步骤 reasoning,因为没有直接的证据支持或驳斥这些laims。然而,现有的 fact-checking 系统经常缺乏决策过程的透明性,使得用户很难理解它们的思维过程。为解决这个问题,我们提议Question-guided Multi-hop Fact-Checking(QACHECK)系统,它通过问题来导引模型的思维过程。QACHECK 系统有五个关键模块:声明验证模块、问题生成模块、问题回答模块、QA 验证模块和理解模块。用户可以将声明输入到 QACHECK 系统中,然后它会预测声明的真假性并提供一份详细的报告,描述了它的思维过程,并且这些思维过程是通过一系列(问题、答案)对被Question-guided Multi-hop Fact-Checking(QACHECK)系统。QACHECK 系统还提供每个问题的证据来源,这使得 fact-checking 过程变得更加透明、可靠和用户友好。有关 QACHECK 的录制视频可以在 YouTube 上搜索:https://www.youtube.com/watch?v=ju8kxSldM64

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

  • paper_url: http://arxiv.org/abs/2310.07521
  • repo_url: https://github.com/wangcunxiang/llm-factuality-survey
  • paper_authors: Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, Yidong Wang, Linyi Yang, Jindong Wang, Xing Xie, Zheng Zhang, Yue Zhang
  • for: This paper addresses the issue of factuality in Large Language Models (LLMs) and its implications for their reliability and accuracy in diverse applications.
  • methods: The paper analyzes the mechanisms of LLM factuality, including the storage and processing of facts, and evaluates methodologies for assessing LLM factuality.
  • results: The paper explores strategies for enhancing LLM factuality, including approaches tailored for specific domains, and offers a structured guide for researchers aiming to improve the factual reliability of LLMs.Here are the three points in Simplified Chinese text:
  • for: 这篇论文关注 Large Language Models (LLMs) 的准确性问题,以及其在多个领域的可靠性和准确性。
  • methods: 论文分析 LLM 的准确性机制,包括存储和处理事实的方式,并评估了评估 LLM 准确性的方法。
  • results: 论文探讨了提高 LLM 准确性的策略,包括特定领域的方法,并提供了为研究人员增强 LLM 准确性的结构化指南。
    Abstract This survey addresses the crucial issue of factuality in Large Language Models (LLMs). As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital. We define the Factuality Issue as the probability of LLMs to produce content inconsistent with established facts. We first delve into the implications of these inaccuracies, highlighting the potential consequences and challenges posed by factual errors in LLM outputs. Subsequently, we analyze the mechanisms through which LLMs store and process facts, seeking the primary causes of factual errors. Our discussion then transitions to methodologies for evaluating LLM factuality, emphasizing key metrics, benchmarks, and studies. We further explore strategies for enhancing LLM factuality, including approaches tailored for specific domains. We focus two primary LLM configurations standalone LLMs and Retrieval-Augmented LLMs that utilizes external data, we detail their unique challenges and potential enhancements. Our survey offers a structured guide for researchers aiming to fortify the factual reliability of LLMs.
    摘要

Cognate Transformer for Automated Phonological Reconstruction and Cognate Reflex Prediction

  • paper_url: http://arxiv.org/abs/2310.07487
  • repo_url: https://github.com/mahesh-ak/cognatetransformer
  • paper_authors: V. S. D. S. Mahesh Akavarapu, Arnab Bhattacharya
  • for: 本研究的目的是自动化历史语言学中的phonological reconstruction问题,使用computational biology中的一些想法和技术。
  • methods: 本研究使用的方法是基于多重序列对 alignment的 MSA Transformer 模型,并将其应用到自动化 phonological reconstruction 和 cognate reflex prediction 问题上。
  • results: 研究结果显示,我们的模型在这两个 задачі中的表现都比现有的模型更好,特别是在预训练 masked word prediction 任务上。
    Abstract Phonological reconstruction is one of the central problems in historical linguistics where a proto-word of an ancestral language is determined from the observed cognate words of daughter languages. Computational approaches to historical linguistics attempt to automate the task by learning models on available linguistic data. Several ideas and techniques drawn from computational biology have been successfully applied in the area of computational historical linguistics. Following these lines, we adapt MSA Transformer, a protein language model, to the problem of automated phonological reconstruction. MSA Transformer trains on multiple sequence alignments as input and is, thus, apt for application on aligned cognate words. We, hence, name our model as Cognate Transformer. We also apply the model on another associated task, namely, cognate reflex prediction, where a reflex word in a daughter language is predicted based on cognate words from other daughter languages. We show that our model outperforms the existing models on both tasks, especially when it is pre-trained on masked word prediction task.
    摘要 <>传统的历史语言学问题之一是推算 proto-word 的 ancestral 语言,从 observer 的 daughter 语言中的ognate 词语中确定。计算方法在历史语言学中尝试自动化任务,学习模型以用于可用的语言数据。从计算生物学中的想法和技术来,我们在 computational 历史语言学中应用了 MSA Transformer,一种蛋白语言模型。MSA Transformer 在输入多个序列对应上训练,因此适用于对aligned cognate 词语进行自动化 reconstruction。我们因此将模型命名为 Cognate Transformer。此外,我们还应用了模型在另一个相关任务中,即 cognate reflex 预测任务, predict 一个 daughter 语言中的 reflex 词语,基于另外的 daughter 语言中的 cognate 词语。我们表明我们的模型在两个任务上比现有模型表现更好,特别是在预训练 word 隐藏预测任务中。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The Traditional Chinese writing system is also commonly used in Taiwan and other parts of the world, but it may have slightly different grammar and vocabulary.

Adapting the adapters for code-switching in multilingual ASR

  • paper_url: http://arxiv.org/abs/2310.07423
  • repo_url: https://github.com/atharva7k/mms-code-switching
  • paper_authors: Atharva Kulkarni, Ajinkya Kulkarni, Miguel Couceiro, Hanan Aldarmaki
  • for: 提高 code-switched speech 的自动语音识别(ASR)性能
  • methods: 使用语言适应器和批处理机制,将信息从每个语言适应器在网络中的每个适应点传递
  • results: 在三个 code-switched 数据集(包括阿拉伯语、普通话和印地语)和英语的测试集上,实现了持续性的 code-switching 性能提高,至少减少了 10% 的 CERHere’s a more detailed explanation of each point:
  • for: The paper aims to improve the performance of automatic speech recognition (ASR) on code-switched speech.
  • methods: The proposed method uses language adapters and processing mechanisms to transfer information from each language adapter at each adaptation point in the network. Additionally, the paper models code-switching as a sequence of latent binary sequences that can be used to guide the flow of information from each language adapter at the frame level.
  • results: The proposed approach is evaluated on three code-switched datasets (including Arabic, Mandarin, and Hindi) and shows consistent improvements in code-switching performance, with at least 10% absolute reduction in CER across all test sets.
    Abstract Recently, large pre-trained multilingual speech models have shown potential in scaling Automatic Speech Recognition (ASR) to many low-resource languages. Some of these models employ language adapters in their formulation, which helps to improve monolingual performance and avoids some of the drawbacks of multi-lingual modeling on resource-rich languages. However, this formulation restricts the usability of these models on code-switched speech, where two languages are mixed together in the same utterance. In this work, we propose ways to effectively fine-tune such models on code-switched speech, by assimilating information from both language adapters at each language adaptation point in the network. We also model code-switching as a sequence of latent binary sequences that can be used to guide the flow of information from each language adapter at the frame level. The proposed approaches are evaluated on three code-switched datasets encompassing Arabic, Mandarin, and Hindi languages paired with English, showing consistent improvements in code-switching performance with at least 10\% absolute reduction in CER across all test sets.
    摘要 最近,大型预训练多语言语音模型已经显示出了扩展自动语音识别(ASR)到多个低资源语言的潜力。其中一些模型使用语言适配器在其形式ulation中,以提高单语言性能并避免在资源rich语言上多语言模型的一些缺点。然而,这种形式ulation限制了这些模型在混合语言语音上的可用性。在这项工作中,我们提出了有效地微调这些模型在混合语言语音上的方法,通过在每个语言适配点网络中吸收两种语言适配器中的信息。我们还模型了混合语言为一系列隐藏 binary 序列,以便在帧层级引导每个语言适配器的信息流。我们的方法被评估在涵盖阿拉伯语、普通话和印地语与英语的三个混合语言测试集上,显示了一致性的提高,CER 的最小减少为 10% 以上。

Linguistic laws in biology

  • paper_url: http://arxiv.org/abs/2310.07387
  • repo_url: None
  • paper_authors: Stuart Semple, Ramon Ferrer-i-Cancho, Morgan L. Gustison
  • for: investigating the prevalence of linguistic laws beyond language and unifying linguistic laws and core theory in biology
  • methods: adopting a new conceptual framework that integrates distinct levels of analysis, from description to prediction to theory building
  • results: providing critical new insights into the fundamental rules of organisation underpinning natural systems, unifying linguistic laws and core theory in biology
    Abstract Linguistic laws, the common statistical patterns of human language, have been investigated by quantitative linguists for nearly a century. Recently, biologists from a range of disciplines have started to explore the prevalence of these laws beyond language, finding patterns consistent with linguistic laws across multiple levels of biological organisation, from molecular (genomes, genes, and proteins) to organismal (animal behaviour) to ecological (populations and ecosystems). We propose a new conceptual framework for the study of linguistic laws in biology, comprising and integrating distinct levels of analysis, from description to prediction to theory building. Adopting this framework will provide critical new insights into the fundamental rules of organisation underpinning natural systems, unifying linguistic laws and core theory in biology.
    摘要 生物学中的语言法则,人类语言中的统计趋势,已经在量化语言学家的研究中进行了nearly一个世纪。而在最近几年,生物学家从不同领域开始探索这些法则在生物系统中的普遍性,从分子(基因、蛋白质)到生物体(动物行为)到生态系统(人口和生态系统)多个生物水平都发现了与语言法则相符的模式。我们提出了一个新的概念框架,用于语言法则生物学的研究,包括了不同水平的分析、预测和理论建构。采用这个框架,将提供新的核心理论,整合语言法则和生物核心理论。

Investigating the Effect of Language Models in Sequence Discriminative Training for Neural Transducers

  • paper_url: http://arxiv.org/abs/2310.07345
  • repo_url: None
  • paper_authors: Zijian Yang, Wei Zhou, Ralf Schlüter, Hermann Ney
  • for: 本研究探讨了不同语言模型(LM) Context length和标签单元(phoneme vs. word)在语音识别器的序列推理训练中的效果。
  • methods: 本研究采用了无格式和N-best列表方法,并对 lattice-free方法中使用phoneme-level LM进行了一种近似方法来模拟全文本依赖关系。
  • results: 实验结果表明,在Librispeech上使用word-level LM进行训练可以超过使用phoneme-level LM,同时发现语言模型在probability计算中的上下文大小有限制性,以及序列推理训练中假设空间质量的重要性。
    Abstract In this work, we investigate the effect of language models (LMs) with different context lengths and label units (phoneme vs. word) used in sequence discriminative training for phoneme-based neural transducers. Both lattice-free and N-best-list approaches are examined. For lattice-free methods with phoneme-level LMs, we propose a method to approximate the context history to employ LMs with full-context dependency. This approximation can be extended to arbitrary context length and enables the usage of word-level LMs in lattice-free methods. Moreover, a systematic comparison is conducted across lattice-free and N-best-list-based methods. Experimental results on Librispeech show that using the word-level LM in training outperforms the phoneme-level LM. Besides, we find that the context size of the LM used for probability computation has a limited effect on performance. Moreover, our results reveal the pivotal importance of the hypothesis space quality in sequence discriminative training.
    摘要 在这项研究中,我们 investigate了不同上下文长度和标签单元(phoneme vs. word)在序列推断训练中语言模型(LM)的效果。我们还 examine了无格子和N-best-list方法。对于phoneme-level LM,我们提出了一种方法来approximate context history以使用具有全文件依赖的LM。这种approximation可以扩展到任意上下文长度,并允许在无格子方法中使用word-level LM。此外,我们进行了系统性的比较,包括无格子和N-best-list-based方法。实验结果表明,在Librispeech上使用word-level LM进行训练可以超越phoneme-level LM。此外,我们发现上下文大小对LM在概率计算中的性能影响很有限。此外,我们的结果还表明,序列推断训练中假设空间质量的重要性。

How Do Large Language Models Capture the Ever-changing World Knowledge? A Review of Recent Advances

  • paper_url: http://arxiv.org/abs/2310.07343
  • repo_url: https://github.com/hyintell/awesome-refreshing-llms
  • paper_authors: Zihan Zhang, Meng Fang, Ling Chen, Mohammad-Reza Namazi-Rad, Jun Wang
  • for: 本研究旨在寻找一种方法,使大语言模型(LLMs)可以适应世界知识的变化,而不需要从scratch重新训练。
  • methods: 本文系统地归纳了最新的研究成果,并进行了深入的比较和讨论。
  • results: 本文提供了一个综合的评估和未来研究方向,以帮助研究人员更好地进行这一领域的研究。
    Abstract Although large language models (LLMs) are impressive in solving various tasks, they can quickly be outdated after deployment. Maintaining their up-to-date status is a pressing concern in the current era. This paper provides a comprehensive review of recent advances in aligning LLMs with the ever-changing world knowledge without re-training from scratch. We categorize research works systemically and provide in-depth comparisons and discussion. We also discuss existing challenges and highlight future directions to facilitate research in this field. We release the paper list at https://github.com/hyintell/awesome-refreshing-llms
    摘要 尽管大语言模型(LLMs)在各种任务上表现出色,但它们很快就会被取代。维护它们的最新状态是当今时期的一项严重问题。本文提供了对最近各种更新LLMs与不断变化的世界知识的系统性梳理,并进行了深入的比较和讨论。我们还讨论了现有的挑战和未来的发展方向,以便促进这一领域的研究。我们在 GitHub 上发布了相关资料,请参考

SNOiC: Soft Labeling and Noisy Mixup based Open Intent Classification Model

  • paper_url: http://arxiv.org/abs/2310.07306
  • repo_url: None
  • paper_authors: Aditi Kanwar, Aditi Seetha, Satyendra Singh Chouhan, Rajdeep Niyogi
  • For: The paper presents a Soft Labeling and Noisy Mixup-based open intent classification model (SNOiC) to address the limitations of existing threshold-based methods, which can overfit and produce biased predictions.* Methods: The SNOiC model combines Soft Labeling and Noisy Mixup strategies to reduce bias and generate pseudo-data for open intent classes.* Results: The experimental results on four benchmark datasets show that the SNOiC model achieves a minimum and maximum performance of 68.72% and 94.71%, respectively, in identifying open intents, and improves the performance by 0.93% to 12.76% compared to state-of-the-art models.Here are the three points in Simplified Chinese text:* For: 本文提出了一种基于软标签和噪音混合的开放意图分类模型(SNOiC),以解决现有的阈值基于方法具有过拟合和生成偏见预测的问题。* Methods: SNOiC模型结合软标签和噪音混合策略,以减少偏见并生成开放意图类型的 Pseudo-数据。* Results: 对四个 benchmark 数据集进行实验,SNOiC模型在开放意图分类方面实现了最低和最高的性能为 68.72% 和 94.71%,分别提高了现有模型的性能表现0.93% 到 12.76%。
    Abstract This paper presents a Soft Labeling and Noisy Mixup-based open intent classification model (SNOiC). Most of the previous works have used threshold-based methods to identify open intents, which are prone to overfitting and may produce biased predictions. Additionally, the need for more available data for an open intent class presents another limitation for these existing models. SNOiC combines Soft Labeling and Noisy Mixup strategies to reduce the biasing and generate pseudo-data for open intent class. The experimental results on four benchmark datasets show that the SNOiC model achieves a minimum and maximum performance of 68.72\% and 94.71\%, respectively, in identifying open intents. Moreover, compared to state-of-the-art models, the SNOiC model improves the performance of identifying open intents by 0.93\% (minimum) and 12.76\% (maximum). The model's efficacy is further established by analyzing various parameters used in the proposed model. An ablation study is also conducted, which involves creating three model variants to validate the effectiveness of the SNOiC model.
    摘要 Translation in Simplified Chinese:这篇论文提出了一种基于软标签和噪音混合的开放意图分类模型(SNOiC),以解决过去的阈值基于方法存在过拟合和生成偏向预测的问题。此外,开放意图类的数据不足是另一个限制。SNOiC模型 combinates 软标签和噪音混合策略,以减少偏向和生成开放意图类的 Pseudo-data。实验结果表明,SNOiC模型在四个 benchmark 数据集上的最低和最高性能为 68.72% 和 94.71%,分别,在开放意图类中标识open intent。与现有模型相比,SNOiC模型提高了开放意图类的标识性能的最低和最高值为 0.93% 和 12.76%。此外,模型的有效性还得到了参数分析和减少研究的支持。

Parrot: Enhancing Multi-Turn Chat Models by Learning to Ask Questions

  • paper_url: http://arxiv.org/abs/2310.07301
  • repo_url: None
  • paper_authors: Yuchong Sun, Che Liu, Jinwen Huang, Ruihua Song, Fuzheng Zhang, Di Zhang, Zhongyuan Wang, Kun Gai
  • for: 这篇论文主要是为了提高对话模型的性能,具体来说是在多回合对话中提高chat模型的效果。
  • methods: 论文使用了一种自动生成高质量指令调整数据的方法,并使用这些数据来强化对话模型的性能。特别是,论文使用了一种名为Parrot-Ask的模型,用于模拟真实用户在生成指令时的行为。
  • results: 论文的实验结果显示,使用Parrot-Ask模型生成的高质量多回合对话数据可以大幅提高chat模型的性能,特别是在多回合评价中。此外,论文还发现了这些数据在不同话题上的多样性和人类对话的相似性。
    Abstract Impressive progress has been made on chat models based on Large Language Models (LLMs) recently; however, there is a noticeable lag in multi-turn conversations between open-source chat models (e.g., Alpaca and Vicuna) and the leading chat models (e.g., ChatGPT and GPT-4). Through a series of analyses, we attribute the lag to the lack of enough high-quality multi-turn instruction-tuning data. The available instruction-tuning data for the community are either single-turn conversations or multi-turn ones with certain issues, such as non-human-like instructions, less detailed responses, or rare topic shifts. In this paper, we address these challenges by introducing Parrot, a highly scalable solution designed to automatically generate high-quality instruction-tuning data, which are then used to enhance the effectiveness of chat models in multi-turn conversations. Specifically, we start by training the Parrot-Ask model, which is designed to emulate real users in generating instructions. We then utilize Parrot-Ask to engage in multi-turn conversations with ChatGPT across a diverse range of topics, resulting in a collection of 40K high-quality multi-turn dialogues (Parrot-40K). These data are subsequently employed to train a chat model that we have named Parrot-Chat. We demonstrate that the dialogues gathered from Parrot-Ask markedly outperform existing multi-turn instruction-following datasets in critical metrics, including topic diversity, number of turns, and resemblance to human conversation. With only 40K training examples, Parrot-Chat achieves strong performance against other 13B open-source models across a range of instruction-following benchmarks, and particularly excels in evaluations of multi-turn capabilities. We make all codes, datasets, and two versions of the Parrot-Ask model based on LLaMA2-13B and KuaiYii-13B available at https://github.com/kwai/KwaiYii/Parrot.
    摘要 很多进步已经被做出在基于大语言模型(LLM)的对话模型方面,但是在多回合对话方面,开源对话模型(如阿尔帕卡和维库那)和领先对话模型(如对话GPT和GPT-4)之间存在明显的延迟。经过一系列分析,我们认为这种延迟的原因是因为社区available的instruction-tuning数据不够,这些数据包括单回合对话或多回合对话,但是具有一些问题,如非人类化的指令、较少细节的回答和rare topic shift。在这篇论文中,我们解决这些挑战,通过引入Parrot,一种高可扩展的解决方案,以生成高质量的instruction-tuning数据,并用于提高对话模型在多回合对话中的效果。具体来说,我们首先在Parrot-Ask模型中训练,该模型设计用于模拟真正的用户来生成指令。然后,我们使用Parrot-Ask与ChatGPT进行多回合对话,生成了一个多样化的话题的40K高质量多回合对话集(Parrot-40K)。这些数据被用来训练一个名为Parrot-Chat的对话模型,我们示出Parrot-Ask生成的对话明显超过现有的多回合指令遵从数据集的重要指标,包括话题多样性、回合数和人类对话的相似性。尽管只有40K的训练样例,Parrot-Chat仍然在多种指令遵从benchmark上表现出色,特别是在多回合评估中表现出特殊的优势。我们在https://github.com/kwai/KwaiYii/Parrot上分享所有代码、数据集和基于LLaMA2-13B和KuaiYii-13B的两个Parrot-Ask模型。

Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators

  • paper_url: http://arxiv.org/abs/2310.07289
  • repo_url: https://github.com/chanliang/conner
  • paper_authors: Liang Chen, Yang Deng, Yatao Bian, Zeyu Qin, Bingzhe Wu, Tat-Seng Chua, Kam-Fai Wong
  • for: 这篇论文的目的是如何评估生成的知识是否准确?
  • methods: 这篇论文使用了六个不同的评估标准来评估生成知识的准确性、相关性、一致性、有用性、有效性和合理性。
  • results: 研究发现,生成知识的准确性并不是准确度的决定因素,而更重要的是输出的相关性和一致性。此外, authors 还提出了两种策略来提高知识具有任务的性能:Prompt Engineering 和 Knowledge Selection。
    Abstract Large language models (LLMs) outperform information retrieval techniques for downstream knowledge-intensive tasks when being prompted to generate world knowledge. However, community concerns abound regarding the factuality and potential implications of using this uncensored knowledge. In light of this, we introduce CONNER, a COmpreheNsive kNowledge Evaluation fRamework, designed to systematically and automatically evaluate generated knowledge from six important perspectives -- Factuality, Relevance, Coherence, Informativeness, Helpfulness and Validity. We conduct an extensive empirical analysis of the generated knowledge from three different types of LLMs on two widely studied knowledge-intensive tasks, i.e., open-domain question answering and knowledge-grounded dialogue. Surprisingly, our study reveals that the factuality of generated knowledge, even if lower, does not significantly hinder downstream tasks. Instead, the relevance and coherence of the outputs are more important than small factual mistakes. Further, we show how to use CONNER to improve knowledge-intensive tasks by designing two strategies: Prompt Engineering and Knowledge Selection. Our evaluation code and LLM-generated knowledge with human annotations will be released to facilitate future research.
    摘要 大型语言模型(LLM)在下游知识密集任务中表现更好于信息检索技术,但社区对使用这些未经检索的知识表达出了关切。为此,我们介绍了CONNER,一个全面的知识评估框架,可以系统地和自动地评估生成的知识从六个重要角度:事实性、相关性、一致性、启示性、帮助性和有效性。我们进行了对三种不同的LLM生成知识的广泛验证研究,并在两个广泛研究的知识密集任务上进行了实验。surprisingly,我们的研究发现,生成的知识的事实性,即使低,并不会对下游任务产生很大的阻碍。相反,输出的相关性和一致性更加重要于小的事实错误。此外,我们还示出了如何使用CONNER来改进知识密集任务,通过提出两种策略:提示工程和知识选择。我们的评估代码和LLM生成的知识以及人工注释将被释出,以便未来研究。

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

  • paper_url: http://arxiv.org/abs/2310.07284
  • repo_url: https://github.com/haoxiangsnr/llm-tse
  • paper_authors: Xiang Hao, Jibin Wu, Jianwei Yu, Chenglin Xu, Kay Chen Tan
    for: 这种研究旨在复制人类在听力环境中选择 interess 的音频源的能力,即cocktail party问题。methods: 这种研究使用了大型自然语言模型(LLM)来提取用户输入文本中的有用 semantic cues,以便增强目标 speaker extraction 模型的可靠性、可控性和性能。results: 实验结果表明,只有文本基础的cue可以达到竞争性表现,文本作为任务选择器的效果,并且将文本基础cue与先前注册的cue相结合可以创造出新的纪录。这是首次使用LLM来引导目标 speaker extraction,可能成为cocktail party问题研究的重要基础。
    Abstract Humans possess an extraordinary ability to selectively focus on the sound source of interest amidst complex acoustic environments, commonly referred to as cocktail party scenarios. In an attempt to replicate this remarkable auditory attention capability in machines, target speaker extraction (TSE) models have been developed. These models leverage the pre-registered cues of the target speaker to extract the sound source of interest. However, the effectiveness of these models is hindered in real-world scenarios due to the unreliable or even absence of pre-registered cues. To address this limitation, this study investigates the integration of natural language description to enhance the feasibility, controllability, and performance of existing TSE models. Specifically, we propose a model named LLM-TSE, wherein a large language model (LLM) extracts useful semantic cues from the user's typed text input. These cues can serve as independent extraction cues, task selectors to control the TSE process or complement the pre-registered cues. Our experimental results demonstrate competitive performance when only text-based cues are presented, the effectiveness of using input text as a task selector, and a new state-of-the-art when combining text-based cues with pre-registered cues. To our knowledge, this is the first study to successfully incorporate LLMs to guide target speaker extraction, which can be a cornerstone for cocktail party problem research.
    摘要 人类具有一种极强的选择性听觉能力,能够在复杂的听觉环境中选择 интересуante的声音来源,这种情况通常被称为“cocktail party”问题。为了复制人类的出色听觉注意力能力,目标说话者抽取(TSE)模型已经被开发。这些模型利用目标说话者的预注册的cue来抽取声音来源。然而,现实中的限制使得这些模型的效果受到限制。为了解决这个问题,本研究 investigate了在现有TSE模型中 integrate natural language description以提高可行性、可控性和性能。specifically,我们提出了一个名为LLM-TSE的模型,其中一个大型自然语言模型(LLM)通过用户输入的文本来提取有用的semantic cue。这些cue可以作为独立的抽取cue、任务选择器来控制TSE过程或补充预注册的cue。我们的实验结果表明,只有文本基础的cue提供时,LLM-TSE模型的性能具有竞争力,使用文本作为任务选择器的效果和将文本基础和预注册的cue结合使用时的新状态。在我们知道的范围内,这是首次成功地integrate LLM来导向目标说话者抽取,这可能是cocktail party问题研究的开山之作。

Enhancing expressivity transfer in textless speech-to-speech translation

  • paper_url: http://arxiv.org/abs/2310.07279
  • repo_url: None
  • paper_authors: Jarod Duret, Benjamin O’Brien, Yannick Estève, Titouan Parcollet
  • for: 本研究旨在提高文本eless speech-to-speech翻译系统的表达准确性。
  • methods: 该研究提出了一种新的方法,通过在不同语言之间传递语言无关信息来提高表达准确性。这种方法在speech unit级别进行操作,并利用多语言情感嵌入来预测目标语言中的音高和持续时间。
  • results: 对于一个法语到英语翻译任务,我们的实验结果表明,我们的方法可以更好地传递表达情感,比现有的状态 искусственный智能系统更高效。
    Abstract Textless speech-to-speech translation systems are rapidly advancing, thanks to the integration of self-supervised learning techniques. However, existing state-of-the-art systems fall short when it comes to capturing and transferring expressivity accurately across different languages. Expressivity plays a vital role in conveying emotions, nuances, and cultural subtleties, thereby enhancing communication across diverse languages. To address this issue this study presents a novel method that operates at the discrete speech unit level and leverages multilingual emotion embeddings to capture language-agnostic information. Specifically, we demonstrate how these embeddings can be used to effectively predict the pitch and duration of speech units in the target language. Through objective and subjective experiments conducted on a French-to-English translation task, our findings highlight the superior expressivity transfer achieved by our approach compared to current state-of-the-art systems.
    摘要 文本语音翻译系统在不断发展,感谢自我超vised学习技术的整合。然而,现有的状态调研系统在准确传递expressivity方面存在缺陷。expressivity在传递情感、细节和文化差异方面扮演着重要的角色,因此可以增强不同语言之间的交流。为解决这个问题,本研究提出了一种新的方法,它在discrete speech unit nivel operate和多语言情感嵌入之间进行了结合。我们示出了这些嵌入可以准确预测目标语言中的音高和持续时间。通过对法语到英语翻译任务进行对象和主观实验,我们的发现表明我们的方法在expressivity传递方面表现出了与当前状态调研系统相比的superior性。

Exploring the Landscape of Large Language Models In Medical Question Answering: Observations and Open Questions

  • paper_url: http://arxiv.org/abs/2310.07225
  • repo_url: None
  • paper_authors: Karolina Korgul, Andrew M. Bean, Felix Krones, Robert McCraith, Adam Mahdi
  • for: 这篇论文旨在了解现代医疗问答系统中大型自然语言模型(LLMs)的局限性,以便在高风险环境中部署这些模型。
  • methods: 该论文使用了多种流行的LLMs,对医疗问题进行了评估,以了解这些模型在医疗领域的性能。
  • results: 论文提出了一些初步的观察和开放问题,以便进一步探讨LLMs在医疗领域的应用。
    Abstract Large Language Models (LLMs) have shown promise in medical question answering by achieving passing scores in standardised exams and have been suggested as tools for supporting healthcare workers. Deploying LLMs into such a high-risk context requires a clear understanding of the limitations of these models. With the rapid development and release of new LLMs, it is especially valuable to identify patterns which exist across models and may, therefore, continue to appear in newer versions. In this paper, we evaluate a wide range of popular LLMs on their knowledge of medical questions in order to better understand their properties as a group. From this comparison, we provide preliminary observations and raise open questions for further research.
    摘要 大型语言模型(LLM)在医疗问答中表现良好,达到标准化考试的过关分数,并被建议作为医疗工作者支持工具。将LLM部署到高风险环境中需要清晰地理解这些模型的限制。随着新的LLM的快速开发和发布,可以从多个模型之间找到共同的特征,因此可能在更新后仍然存在。本文通过评估广泛的流行LLMs来更好地了解它们的特性。从这个比较中,我们提供初步观察和开出更进一步的研究问题。

PHALM: Building a Knowledge Graph from Scratch by Prompting Humans and a Language Model

  • paper_url: http://arxiv.org/abs/2310.07170
  • repo_url: https://github.com/nlp-waseda/comet-atomic-ja
  • paper_authors: Tatsuya Ide, Eiki Murata, Daisuke Kawahara, Takato Yamazaki, Shengzhe Li, Kenta Shinzato, Toshinori Sato
  • for: 这篇论文旨在提出一种从头构建知识图谱的方法,以便建立更好的常识感知型语言模型。
  • methods: 该方法使用了人工智能和大型自然语言模型(LLM)的提问,从而构建了一个日本事件知识图谱。
  • results: 实验结果表明, constructed graph 和通过训练的理解模型生成的推理都是可接受的,并且比较人工智能和 LLM 的提问方法的不同性。 code、数据和模型都可以在 GitHub 上找到。
    Abstract Despite the remarkable progress in natural language understanding with pretrained Transformers, neural language models often do not handle commonsense knowledge well. Toward commonsense-aware models, there have been attempts to obtain knowledge, ranging from automatic acquisition to crowdsourcing. However, it is difficult to obtain a high-quality knowledge base at a low cost, especially from scratch. In this paper, we propose PHALM, a method of building a knowledge graph from scratch, by prompting both crowdworkers and a large language model (LLM). We used this method to build a Japanese event knowledge graph and trained Japanese commonsense generation models. Experimental results revealed the acceptability of the built graph and inferences generated by the trained models. We also report the difference in prompting humans and an LLM. Our code, data, and models are available at github.com/nlp-waseda/comet-atomic-ja.
    摘要 尽管各种预训Transformer的进步很remarkable, neural language model经常不处理常识知识得当。为建立常识意识的模型,有尝试从自动获取到聚集ources。然而,获得高质量常识知识库,特别是从头开始,是很困难的。在这篇论文中,我们提出了PHALM方法,可以从头开始建立知识图грам,通过询问大量的人工工作者和一个大语言模型(LLM)。我们使用了这种方法建立了日本事件知识图грам,并训练了日本常识生成模型。实验结果表明了建立的图грам和训练的模型生成的推理都是可接受的。我们还报告了人工工作者和LLM的询问之间的差异。我们的代码、数据和模型都可以在github.com/nlp-waseda/comet-atomic-ja上获取。

Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms

  • paper_url: http://arxiv.org/abs/2310.07161
  • repo_url: None
  • paper_authors: Joseph Konan, Ojas Bhargave, Shikhar Agnihotri, Shuo Han, Yunyang Zeng, Ankit Shah, Bhiksha Raj
  • for: 这研究探讨了VoIP(声音在互联网协议)通信中的复杂性,尤其是sender-side denoising效果的研究。
  • methods: 这研究使用了Oaxaca分解法,一种经济学工具,以分析VoIP系统中的语音-语音学变化。此外,研究还使用了PESQ和STOI指标来衡量语音变化的质量。
  • results: 研究发现,VoIP系统中的语音变化非常复杂,受到多种因素的影响。此外,研究还发现了一些新的 psychoacoustic 指标,可以用于衡量语音变化的质量。
    Abstract Within the ambit of VoIP (Voice over Internet Protocol) telecommunications, the complexities introduced by acoustic transformations merit rigorous analysis. This research, rooted in the exploration of proprietary sender-side denoising effects, meticulously evaluates platforms such as Google Meets and Zoom. The study draws upon the Deep Noise Suppression (DNS) 2020 dataset, ensuring a structured examination tailored to various denoising settings and receiver interfaces. A methodological novelty is introduced via the Oaxaca decomposition, traditionally an econometric tool, repurposed herein to analyze acoustic-phonetic perturbations within VoIP systems. To further ground the implications of these transformations, psychoacoustic metrics, specifically PESQ and STOI, were harnessed to furnish a comprehensive understanding of speech alterations. Cumulatively, the insights garnered underscore the intricate landscape of VoIP-influenced acoustic dynamics. In addition to the primary findings, a multitude of metrics are reported, extending the research purview. Moreover, out-of-domain benchmarking for both time and time-frequency domain speech enhancement models is included, thereby enhancing the depth and applicability of this inquiry.
    摘要 在VOIP(语音过网协议)电信中,音频转换引入的复杂性需要严格的分析。这项研究,基于专利发送方清除效果的探索,综合评估Google Meets和Zoom等平台。研究使用2020年深度噪声 dataset,确保结构化的评估适应不同的清除设置和接收界面。本研究 introduce了一种方法新颖,即使用Oaxaca decomposition,原本是一种经济ometrics工具,以分析VOIP系统中的音频语音扰动。此外,通过使用 psychoacoustic 指标,如PESQ和STOI,研究得到了全面的说话变化理解。总的来说,这些发现反映了VOIP-影响的听音动力学景观的复杂性。此外,本研究还报告了多种指标,扩大了研究范围。此外,对域外的时间频率域speech增强模型进行了对比测试,从而提高了这项研究的深度和实用性。

  • paper_url: http://arxiv.org/abs/2310.07155
  • repo_url: None
  • paper_authors: Shamik Roy, Dan Goldwasser
  • for: 这篇论文是为了研究社交媒体在社会变革中的作用,以及如何自动理解在线社会运动的视角和对立面。
  • methods: 该论文提出了一种弱监督图structured prediction方法,通过分析Twitter上的文本和社交网络作者之间的关系,对#BlackLivesMatter相关的微博进行了视角分类。该方法使用了社会语言表示,将文本转换为图形,并使用小量示例集来验证。
  • results: 该论文通过对人工标注测试集进行质量和量itative分析,发现其模型在对视角分类方面表现出色,比多任务基线提高了大幅度。该模型还成功地描述了支持和反对 #BLM 的视角。
    Abstract Social media has become a major driver of social change, by facilitating the formation of online social movements. Automatically understanding the perspectives driving the movement and the voices opposing it, is a challenging task as annotated data is difficult to obtain. We propose a weakly supervised graph-based approach that explicitly models perspectives in #BackLivesMatter-related tweets. Our proposed approach utilizes a social-linguistic representation of the data. We convert the text to a graph by breaking it into structured elements and connect it with the social network of authors, then structured prediction is done over the elements for identifying perspectives. Our approach uses a small seed set of labeled examples. We experiment with large language models for generating artificial training examples, compare them to manual annotation, and find that it achieves comparable performance. We perform quantitative and qualitative analyses using a human-annotated test set. Our model outperforms multitask baselines by a large margin, successfully characterizing the perspectives supporting and opposing #BLM.
    摘要 社交媒体已成为社会变革的主要驱动力,通过促成在线社会运动的形成。自动理解运动中的观点和反对观点是一项具有挑战性的任务,因为获取注释数据是困难的。我们提议一种弱级超视的图structured prediction方法,其中明确表示数据中的观点。我们将文本转换成图形,将其与作者社交网络连接起来,然后进行结构预测,以确定观点。我们的方法使用小量精心标注示例。我们使用大型自然语言模型生成人工训练示例,并与手动注释进行比较,发现它们具有相似性。我们使用人工注释测试集进行量化和质量分析,发现我们的模型在对 #BLM 相关的 tweet 进行观点分类中占据了明显的优势,成功地捕捉了支持和反对 #BLM 的观点。

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

  • paper_url: http://arxiv.org/abs/2310.07147
  • repo_url: None
  • paper_authors: Zhikai Li, Xiaoxuan Liu, Banghua Zhu, Zhen Dong, Qingyi Gu, Kurt Keutzer
  • for: 这 paper 的目的是提出一种Parameter-efficient fine-tuning方法,以便在大型语言模型(LLMs)上进行精细调整,而不会增加过多的资源开销。
  • methods: 这 paper 使用了两个新的想法:首先,使用高效的 Lion 优化器,它只跟踪积分和批量大小,这有利于精细调整和量化;其次,对所有模型状态进行量化,并采用梯度流和参数更新方案来更新量化的参数。
  • results: 根据 paper 的结果,QFT 可以将模型状态内存减少至 21% 的标准解决方案,同时保持相同的性能水平。例如,对 LLaMA-7B 模型进行精细调整只需要 <30GB 的内存,可以通过一个 A6000 GPU 满足。
    Abstract Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pre-trained models on downstream datasets provides further significant performance gains, but this process has been challenging due to its extraordinary resource requirements. To this end, existing efforts focus on parameter-efficient fine-tuning, which, unfortunately, fail to capitalize on the powerful potential of full-parameter fine-tuning. In this work, we propose QFT, a novel Quantized Full-parameter Tuning framework for LLMs that enables memory-efficient fine-tuning without harming performance. Our framework incorporates two novel ideas: (i) we adopt the efficient Lion optimizer, which only keeps track of the momentum and has consistent update magnitudes for each parameter, an inherent advantage for robust quantization; and (ii) we quantize all model states and store them as integer values, and present a gradient flow and parameter update scheme for the quantized weights. As a result, QFT reduces the model state memory to 21% of the standard solution while achieving comparable performance, e.g., tuning a LLaMA-7B model requires only <30GB of memory, satisfied by a single A6000 GPU.
    摘要
  1. 我们采用高效的 Lion 优化器,只跟踪势量和每个参数的一致更新大小,这是量化的自然优势;2. 我们量化所有模型状态,将其转换为整数值,并提供了一种梯度流和参数更新方案 для量化的 weights。因此,QFT 可以将模型状态内存减少至 21% 的标准解决方案,同时实现相似的性能,例如,对 LLaMA-7B 模型进行 fine-tuning 只需要 <30GB 的内存,可以由一个 A6000 GPU 满足。

Empowering Psychotherapy with Large Language Models: Cognitive Distortion Detection through Diagnosis of Thought Prompting

  • paper_url: http://arxiv.org/abs/2310.07146
  • repo_url: None
  • paper_authors: Zhiyu Chen, Yujie Lu, William Yang Wang
    for: 这个论文的目的是开发人工智能助手来帮助计算心理咨询。methods: 这个论文使用了大语言模型进行诊断思维检测,包括三个阶段:主观性评估、相互推理和学习概念板块。results: 实验显示,DoT在认知扭曲检测方面得到了显著改进,同时生成的诊断理由得到了专业人员的批准。
    Abstract Mental illness remains one of the most critical public health issues of our time, due to the severe scarcity and accessibility limit of professionals. Psychotherapy requires high-level expertise to conduct deep, complex reasoning and analysis on the cognition modeling of the patients. In the era of Large Language Models, we believe it is the right time to develop AI assistance for computational psychotherapy. We study the task of cognitive distortion detection and propose the Diagnosis of Thought (DoT) prompting. DoT performs diagnosis on the patient's speech via three stages: subjectivity assessment to separate the facts and the thoughts; contrastive reasoning to elicit the reasoning processes supporting and contradicting the thoughts; and schema analysis to summarize the cognition schemas. The generated diagnosis rationales through the three stages are essential for assisting the professionals. Experiments demonstrate that DoT obtains significant improvements over ChatGPT for cognitive distortion detection, while generating high-quality rationales approved by human experts.
    摘要 心理疾病仍然是当今公共卫生问题中最严重的一个,这主要归结于专业人员的紧缺和访问限制。心理治疗需要高水平的专业知识,以进行深入的、复杂的认知和分析,以模拟患者的认知模型。在大语言模型时代,我们认为是时候开发人工智能助手来支持计算心理治疗。我们研究了认知扭曲检测任务,并提出了诊断思维(DoT)提示。DoT在患者的话语中进行三个阶段的诊断:主观评估,以分离 факты和思想; 对比逻辑,以引出支持和反对思想的逻辑过程; 和 schema 分析,以概括认知schema。生成的诊断理由经过三个阶段的诊断是对专业人员的助手。实验显示,DoT在认知扭曲检测方面比ChatGPT具有显著的改善,同时生成高质量的人类专家批准的诊断理由。

AE-smnsMLC: Multi-Label Classification with Semantic Matching and Negative Label Sampling for Product Attribute Value Extraction

  • paper_url: http://arxiv.org/abs/2310.07137
  • repo_url: https://github.com/zhongfendeng/ae-smnsmlc
  • paper_authors: Zhongfen Deng, Wei-Te Chen, Lei Chen, Philip S. Yu
  • for: 这篇论文主要针对电子商务中的产品特征值EXTRACTION问题,尤其是产品搜寻和推荐。
  • methods: 这篇论文提出了一个基于多 Label Classification 的方法,它可以应对实际情况下的产品特征值EXTRACTION,在缺乏位置信息的情况下进行train模型。它还考虑了产品内多个属性值之间的 semantics 连接,从而帮助产品特征值EXTRACTION。
  • results: 论文的实验结果显示,提出的方法具有优越性和有效性,可以在三个不同的电子商务数据集上进行精确的产品特征值EXTRACTION。
    Abstract Product attribute value extraction plays an important role for many real-world applications in e-Commerce such as product search and recommendation. Previous methods treat it as a sequence labeling task that needs more annotation for position of values in the product text. This limits their application to real-world scenario in which only attribute values are weakly-annotated for each product without their position. Moreover, these methods only use product text (i.e., product title and description) and do not consider the semantic connection between the multiple attribute values of a given product and its text, which can help attribute value extraction. In this paper, we reformulate this task as a multi-label classification task that can be applied for real-world scenario in which only annotation of attribute values is available to train models (i.e., annotation of positional information of attribute values is not available). We propose a classification model with semantic matching and negative label sampling for attribute value extraction. Semantic matching aims to capture semantic interactions between attribute values of a given product and its text. Negative label sampling aims to enhance the model's ability of distinguishing similar values belonging to the same attribute. Experimental results on three subsets of a large real-world e-Commerce dataset demonstrate the effectiveness and superiority of our proposed model.
    摘要 In this paper, we reformulate this task as a multi-label classification task that can be applied to real-world scenarios where only annotation of attribute values is available to train models (i.e., annotation of positional information of attribute values is not available). We propose a classification model with semantic matching and negative label sampling for attribute value extraction. Semantic matching aims to capture the semantic interactions between attribute values of a given product and its text, while negative label sampling aims to enhance the model's ability to distinguish similar values belonging to the same attribute.Experimental results on three subsets of a large real-world e-Commerce dataset demonstrate the effectiveness and superiority of our proposed model.

Comparing Styles across Languages

  • paper_url: http://arxiv.org/abs/2310.07135
  • repo_url: https://github.com/sanusanth/javascript-basic-program
  • paper_authors: Shreya Havaldar, Matthew Pressimone, Eric Wong, Lyle Ungar
  • for: 这篇论文的目的是提出一种解释框架,用于从多种语言的语言模型中提取风格差异并比较不同语言之间的风格。
  • methods: 该论文使用的方法包括生成全面的风格词典和将语言模型中的特征重要性转化为可比较的 lexical category。
  • results: 通过应用该解释框架, authors 创造了首个涵盖四种语言的全面风格数据集,并分析了不同语言之间的尊重程度如何异同。
    Abstract Understanding how styles differ across languages is advantageous for training both humans and computers to generate culturally appropriate text. We introduce an explanation framework to extract stylistic differences from multilingual LMs and compare styles across languages. Our framework (1) generates comprehensive style lexica in any language and (2) consolidates feature importances from LMs into comparable lexical categories. We apply this framework to compare politeness, creating the first holistic multilingual politeness dataset and exploring how politeness varies across four languages. Our approach enables an effective evaluation of how distinct linguistic categories contribute to stylistic variations and provides interpretable insights into how people communicate differently around the world.
    摘要 理解不同语言的风格差异对于训练人类和计算机生成文本的文化适应性是有利的。我们介绍了一种解释框架,用于从多语言LM中提取风格差异并对语言之间的风格进行比较。我们的框架包括以下两个主要功能:1. 生成任何语言的完整风格词典。2. 将LM中的特征重要性整合到相似的词汇类别中。我们应用这个框架,比较了四种语言的尊重度,创建了第一个涵盖四种语言的整体多语言尊重数据集,并explored了各语言之间的尊重度差异。我们的方法可以有效地评估不同语言类型的风格差异的贡献,并提供可读取的各种通信方式之间的对比。

Argumentative Stance Prediction: An Exploratory Study on Multimodality and Few-Shot Learning

  • paper_url: http://arxiv.org/abs/2310.07093
  • repo_url: None
  • paper_authors: Arushi Sharma, Abhibha Gupta, Maneesh Bilalpur
  • for: 这项研究旨在评估图像是否对架构预测中的意见角度有益,以及文本基础模型在少数shot设置下的表现。
  • methods: 研究使用了Twitter上的推文和图像,并使用了几种不同的模型来进行预测,包括文本基础模型、图像基础模型和 multimedia 模型。
  • results: 研究发现, ensemble 的文本基础模型(0.817 F1-score)在预测中表现更好,比 multimodal 模型(0.677 F1-score)和文本基础 few-shot 预测(0.550 F1-score)更高。 Additionally, the study found that multimodal models perform better when the image content is summarized as natural language, and using in-context examples improves the few-shot performance of language models.
    Abstract To advance argumentative stance prediction as a multimodal problem, the First Shared Task in Multimodal Argument Mining hosted stance prediction in crucial social topics of gun control and abortion. Our exploratory study attempts to evaluate the necessity of images for stance prediction in tweets and compare out-of-the-box text-based large-language models (LLM) in few-shot settings against fine-tuned unimodal and multimodal models. Our work suggests an ensemble of fine-tuned text-based language models (0.817 F1-score) outperforms both the multimodal (0.677 F1-score) and text-based few-shot prediction using a recent state-of-the-art LLM (0.550 F1-score). In addition to the differences in performance, our findings suggest that the multimodal models tend to perform better when image content is summarized as natural language over their native pixel structure and, using in-context examples improves few-shot performance of LLMs.
    摘要 要提高论证立场预测为多Modal问题,第一次共同任务在多Modal Argument Mining中进行了立场预测,涉及到重要的社会话题,如枪支控制和堕胎。我们的探索研究试图评估图像是否对立场预测在微博中是必要的,并比较未经修改的文本大语言模型(LLM)在少量学习设置下与精心适应单Modal和多Modal模型的性能。我们的工作表明,一个 ensemble of 精心适应文本大语言模型(0.817 F1-score)超过了多Modal(0.677 F1-score)和文本基于几个shot预测使用最新的状态对语言模型(0.550 F1-score)。此外,我们的发现表明,多Modal模型在自然语言描述图像内容时表现较好,而使用Context例子可以提高LLMs的几个shot预测性能。

cs.LG - 2023-10-11

Cost-Driven Hardware-Software Co-Optimization of Machine Learning Pipelines

  • paper_url: http://arxiv.org/abs/2310.07940
  • repo_url: None
  • paper_authors: Ravit Sharma, Wojciech Romaszkan, Feiqian Zhu, Puneet Gupta, Ankur Mehta
  • for: This paper aims to enable widely-applicable smart devices by overcoming the storage and processing requirements of deep neural networks.
  • methods: The paper explores the interactions between quantization, model scaling, and multi-modality with system components such as memory, sensors, and processors, and develops guidelines for optimal system design and model deployment for cost-constrained platforms.
  • results: The paper demonstrates an end-to-end, on-device, biometric user authentication system using a $20 ESP-EYE board.
    Abstract Researchers have long touted a vision of the future enabled by a proliferation of internet-of-things devices, including smart sensors, homes, and cities. Increasingly, embedding intelligence in such devices involves the use of deep neural networks. However, their storage and processing requirements make them prohibitive for cheap, off-the-shelf platforms. Overcoming those requirements is necessary for enabling widely-applicable smart devices. While many ways of making models smaller and more efficient have been developed, there is a lack of understanding of which ones are best suited for particular scenarios. More importantly for edge platforms, those choices cannot be analyzed in isolation from cost and user experience. In this work, we holistically explore how quantization, model scaling, and multi-modality interact with system components such as memory, sensors, and processors. We perform this hardware/software co-design from the cost, latency, and user-experience perspective, and develop a set of guidelines for optimal system design and model deployment for the most cost-constrained platforms. We demonstrate our approach using an end-to-end, on-device, biometric user authentication system using a $20 ESP-EYE board.
    摘要 In this work, we comprehensively explore how quantization, model scaling, and multi-modality interact with system components such as memory, sensors, and processors. We perform this hardware/software co-design from the cost, latency, and user-experience perspective, and develop a set of guidelines for optimal system design and model deployment for the most cost-constrained platforms. We demonstrate our approach using an end-to-end, on-device, biometric user authentication system using a $20 ESP-EYE board.

Enhanced sampling of Crystal Nucleation with Graph Representation Learnt Variables

  • paper_url: http://arxiv.org/abs/2310.07927
  • repo_url: None
  • paper_authors: Ziyue Zou, Pratyush Tiwary
  • for: 这个论文是用来描述一种基于图 neural network的学习方法,用于从实验室中观察的结晶结构中Derive low-dimensional variables。
  • methods: 这种方法使用了简单的卷积和聚合方法,并使用了自适应Encoder设置。
  • results: 该方法可以在多种铁和глицин的多态和多晶体中实现可靠的抽象和自由能计算,并且可以与实验结果相匹配。
    Abstract In this study, we present a graph neural network-based learning approach using an autoencoder setup to derive low-dimensional variables from features observed in experimental crystal structures. These variables are then biased in enhanced sampling to observe state-to-state transitions and reliable thermodynamic weights. Our approach uses simple convolution and pooling methods. To verify the effectiveness of our protocol, we examined the nucleation of various allotropes and polymorphs of iron and glycine from their molten states. Our graph latent variables when biased in well-tempered metadynamics consistently show transitions between states and achieve accurate free energy calculations in agreement with experiments, both of which are indicators of dependable sampling. This underscores the strength and promise of our graph neural net variables for improved sampling. The protocol shown here should be applicable for other systems and with other sampling methods.
    摘要 在这项研究中,我们提出了基于图 neural network的学习方法,使用自适应Encoder设置来 derivate低维度变量从实验室中观察的晶体结构特征。这些变量然后被偏导向增强抽样,以观察状态转移和可靠的热动力学权重。我们的方法使用简单的卷积和聚合方法。为了证明我们的协议的有效性,我们对铁和глицин的多种晶体和合金的融化过程进行了研究。我们的图秘密变量,当偏导向于well-tempered metadynamics中,一致地显示了状态之间的转移,并实现了准确的自由能计算,与实验数据一致,这都是可靠的抽样的标志。这表明了我们的图 neural net变量的优异和承诺,这种方法应该适用于其他系统和其他抽样方法。

First-Order Dynamic Optimization for Streaming Convex Costs

  • paper_url: http://arxiv.org/abs/2310.07925
  • repo_url: None
  • paper_authors: M. Rostami, H. Moradian, S. S. Kia
  • for: 该 paper 提出了一组新的优化算法,用于解决一类具有时间变化的流量成本函数的凸优化问题。
  • methods: 我们开发了一种方法来跟踪优化解的最优解,并且只使用了成本函数的首次导数,从而使得算法更加计算效率。
  • results: 我们比较了我们的算法和梯度下降算法,并证明了梯度下降算法在优化问题中不是有效的。我们还通过一些示例,如解决一个预测控制问题,来演示我们的结果。
    Abstract This paper proposes a set of novel optimization algorithms for solving a class of convex optimization problems with time-varying streaming cost function. We develop an approach to track the optimal solution with a bounded error. Unlike the existing results, our algorithm is executed only by using the first-order derivatives of the cost function which makes it computationally efficient for optimization with time-varying cost function. We compare our algorithms to the gradient descent algorithm and show why gradient descent is not an effective solution for optimization problems with time-varying cost. Several examples including solving a model predictive control problem cast as a convex optimization problem with a streaming time-varying cost function demonstrate our results.
    摘要 这篇论文提出了一组新的优化算法,用于解决具有时间变化流量成本函数的凸优化问题。我们开发了一种方法,以确保遵循优化解的 bounded error。与现有结果不同,我们的算法只使用成本函数的首次导数,从而使其 computationally efficient 于优化时间变化的成本函数。我们与梯度下降算法进行比较,并解释了梯度下降算法在优化时间变化的成本函数时的不足之处。这篇论文通过解决一个模型预测控制问题,具有流动时间变化的成本函数,来展示我们的结果。Note: "Simplified Chinese" is a romanization of Chinese that uses a simplified set of characters and grammar rules to represent the language. It is commonly used in mainland China and Singapore.

Unraveling the Single Tangent Space Fallacy: An Analysis and Clarification for Applying Riemannian Geometry in Robot Learning

  • paper_url: http://arxiv.org/abs/2310.07902
  • repo_url: None
  • paper_authors: Noémie Jaquier, Leonel Rozo, Tamim Asfour
  • for: 本研究旨在探讨在 робототехнике中广泛使用机器学习方法处理数据时,数据中的自然几何约束如何得到有效处理。
  • methods: 本文提出了一种基于偏微分几何的方法,以解决机器学习方法中数据中的几何约束问题。
  • results: 本研究发现,在使用偏微分几何时,存在一种“单 tangent 空间误差”,即仅将数据项投影到单个 tangent 空间上,然后使用存在误差的机器学习算法进行处理。这种方法的缺陷导致了机器学习模型的不准确性和不稳定性。
    Abstract In the realm of robotics, numerous downstream robotics tasks leverage machine learning methods for processing, modeling, or synthesizing data. Often, this data comprises variables that inherently carry geometric constraints, such as the unit-norm condition of quaternions representing rigid-body orientations or the positive definiteness of stiffness and manipulability ellipsoids. Handling such geometric constraints effectively requires the incorporation of tools from differential geometry into the formulation of machine learning methods. In this context, Riemannian manifolds emerge as a powerful mathematical framework to handle such geometric constraints. Nevertheless, their recent adoption in robot learning has been largely characterized by a mathematically-flawed simplification, hereinafter referred to as the ``single tangent space fallacy". This approach involves merely projecting the data of interest onto a single tangent (Euclidean) space, over which an off-the-shelf learning algorithm is applied. This paper provides a theoretical elucidation of various misconceptions surrounding this approach and offers experimental evidence of its shortcomings. Finally, it presents valuable insights to promote best practices when employing Riemannian geometry within robot learning applications.
    摘要 在 роботиCS中,许多下游任务利用机器学习方法处理、模型或生成数据。经常times,这些数据包括具有几何约束的变数,如体积正规的旋转矩阵或弹性和操作矩阵的正定性。有效地处理这些几何约束需要将数据转换为几何空间中的一个紧致的数据集合,并且使用几何学的工具来设计机器学习方法。在这个上下文中,里曼维 manifold emerges as a powerful mathematical framework to handle such geometric constraints.然而,在机器学习中的最近几年,这种方法的采用受到了“单点 tangent 空间误解”的影响,即将数据转换为单点 tangent 空间,然后使用对应的机器学习算法进行处理。本文将提供这种方法的理论详细阐述,以及实验证据证明其缺陷。最后,它将提供实践中的最佳做法,以便在机器学习应用中正确地使用几何学。

Precise localization within the GI tract by combining classification of CNNs and time-series analysis of HMMs

  • paper_url: http://arxiv.org/abs/2310.07895
  • repo_url: None
  • paper_authors: Julia Werner, Christoph Gerum, Moritz Reiber, Jörg Nick, Oliver Bringmann
  • for: 这个论文是为了高效地分类来自Video Capsule Endoscopy (VCE)研究中的肠胃部分图像而设计的。
  • methods: 这个论文使用了卷积神经网络(CNN)进行分类,并利用隐马尔可夫模型(HMM)的时间序列分析特性。
  • results: 该方法在里士满(RI)肠胃学数据集上达到了$98.04%$的准确率,可以准确地地标定肠胃迷你隧道中的位置,并且只需要约1M个参数,因此适用于低功耗设备。
    Abstract This paper presents a method to efficiently classify the gastroenterologic section of images derived from Video Capsule Endoscopy (VCE) studies by exploring the combination of a Convolutional Neural Network (CNN) for classification with the time-series analysis properties of a Hidden Markov Model (HMM). It is demonstrated that successive time-series analysis identifies and corrects errors in the CNN output. Our approach achieves an accuracy of $98.04\%$ on the Rhode Island (RI) Gastroenterology dataset. This allows for precise localization within the gastrointestinal (GI) tract while requiring only approximately 1M parameters and thus, provides a method suitable for low power devices
    摘要 这篇论文提出了一种方法,通过结合卷积神经网络(CNN)和隐马尔可夫模型(HMM)的时间序列分析特性来高效地分类 Gastroenterologic 图像序列,从Video Capsule Endoscopy (VCE) 研究中获取的。研究表明,顺序时间序列分析可以corrrect CNN 输出中的错误。我们的方法在 Rhode Island (RI) Gastroenterology 数据集上达到了 $98.04\%$ 的准确率,这使得在 Gastrointestinal (GI) 轨迹中进行精确定位,只需约 1M 参数,因此适用于低功耗设备。

ASV Station Keeping under Wind Disturbances using Neural Network Simulation Error Minimization Model Predictive Control

  • paper_url: http://arxiv.org/abs/2310.07892
  • repo_url: None
  • paper_authors: Jalil Chavez-Galaviz, Jianwen Li, Ajinkya Chaudhary, Nina Mahmoudian
  • for: 这个论文主要针对 Autonomous Surface Vehicles (ASVs) 在狭窄空间中进行探测和相对定位时的稳定控制问题。
  • methods: 该论文提出了一种基于神经网络预测误差最小化 (NNSEM-MPC) 的模型预测控制器,用于精准地预测 ASV 的动态行为下风干扰的影响。
  • results: 对于风干扰情况下的稳定控制问题,该论文的提出的 NNSEM-MPC 方法与其他控制器(包括 backstepping 控制器、滑模控制器、简化动态 MPC (SD-MPC)、神经普通几何 MPC (NODE-MPC) 和知识基于 NODE MPC (KNODE-MPC))进行比较,并在六个测试情况下得到了显著的优异性。
    Abstract Station keeping is an essential maneuver for Autonomous Surface Vehicles (ASVs), mainly when used in confined spaces, to carry out surveys that require the ASV to keep its position or in collaboration with other vehicles where the relative position has an impact over the mission. However, this maneuver can become challenging for classic feedback controllers due to the need for an accurate model of the ASV dynamics and the environmental disturbances. This work proposes a Model Predictive Controller using Neural Network Simulation Error Minimization (NNSEM-MPC) to accurately predict the dynamics of the ASV under wind disturbances. The performance of the proposed scheme under wind disturbances is tested and compared against other controllers in simulation, using the Robotics Operating System (ROS) and the multipurpose simulation environment Gazebo. A set of six tests were conducted by combining two wind speeds (3 m/s and 6 m/s) and three wind directions (0$^\circ$, 90$^\circ$, and 180$^\circ$). The simulation results clearly show the advantage of the NNSEM-MPC over the following methods: backstepping controller, sliding mode controller, simplified dynamics MPC (SD-MPC), neural ordinary differential equation MPC (NODE-MPC), and knowledge-based NODE MPC (KNODE-MPC). The proposed NNSEM-MPC approach performs better than the rest in 4 out of the 6 test conditions, and it is the second best in the 2 remaining test cases, reducing the mean position and heading error by at least 31\% and 46\% respectively across all the test cases. In terms of execution speed, the proposed NNSEM-MPC is at least 36\% faster than the rest of the MPC controllers. The field experiments on two different ASV platforms showed that ASVs can effectively keep the station utilizing the proposed method, with a position error as low as $1.68$ m and a heading error as low as $6.14^{\circ}$ within time windows of at least $150$s.
    摘要 Station keeping是自主水下车辆(ASV)的必需操作,尤其在封闭空间中使用,以进行需要ASV保持位置或与其他车辆合作,其中相对位置对任务有重要影响。然而,这种操作可能会对 классическими反馈控制器而成为挑战,因为需要准确的ASV动态模型和环境干扰的数据。这项工作提议使用神经网络预测误差最小化模型predictive控制(NNSEM-MPC)来准确预测ASV在风干扰下的动态。我们对提议的方案在风干扰下的性能进行了测试和比较,使用Robotics Operating System(ROS)和多用途 simulate环境Gazebo。我们进行了六个测试,其中每个测试都 combinated two wind speed(3 m/s和6 m/s)和 three wind direction(0$^\circ$, 90$^\circ$,和180$^\circ)。模拟结果明显地显示了NNSEM-MPC的优势,胜过以下方法:backstepping controller、滑模控制、简化动态MPC(SD-MPC)、神经ordinary differential equation MPC(NODE-MPC)和知识基于NODE MPC(KNODE-MPC)。提议的NNSEM-MPC方法在4个测试条件中表现最佳,并在另外2个测试条件中表现第二最佳,将mean position和heading error降低至少31%和46%。在执行速度方面,提议的NNSEM-MPC至少36% faster than其他MPC控制器。在两种不同ASV平台上进行的野外实验表明,ASV可以通过提议的方法 effetively keep station,position error为1.68米,heading error为6.14度,在时间窗口至少150秒内。

A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks

  • paper_url: http://arxiv.org/abs/2310.07891
  • repo_url: None
  • paper_authors: Behrad Moniri, Donghwan Lee, Hamed Hassani, Edgar Dobriban
  • for: This paper is written for understanding the conditions under which feature learning occurs in deep neural networks, and how to improve the learning process by introducing multiple rank-one components.
  • methods: The paper uses two-layer fully-connected neural networks and gradient descent with a growing learning rate to introduce multiple rank-one components, which enable the learning of non-linear features.
  • results: The paper shows that with a growing learning rate, the training process introduces multiple rank-one components, each corresponding to a specific polynomial feature, and these non-linear features can enhance learning. The limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes.
    Abstract Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer followed by ridge regression on the second layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the loss, we demonstrate that these non-linear features can enhance learning.
    摘要 <>translate "Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer followed by ridge regression on the second layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the loss, we demonstrate that these non-linear features can enhance learning."Translate:Feature 学习是深度神经网络的成功的基本原因之一。在满足 certain 条件下的两层全连接神经网络中,一步Gradient Descent在第一层 followed by Ridge Regression 在第二层可以导致特征学习,这可以通过特征矩阵的spectrum中的分离rank-one component -- spike -- 来Characterize。然而,随着步长不变,这个spike只会传递线性函数的信息,因此学习非线性Component是不可能的。我们显示,随着样本大小增加,这种训练实际引入多个rank-one component,每个component都对应于特定的多项式特征。我们进一步证明, limiting 大量和大样本训练和测试错误的更新神经网络的准确性是由这些spike完全Characterize。通过精确分析改进的损失,我们示出这些非线性特征可以提高学习。

Refined Mechanism Design for Approximately Structured Priors via Active Regression

  • paper_url: http://arxiv.org/abs/2310.07874
  • repo_url: None
  • paper_authors: Christos Boutsikas, Petros Drineas, Marios Mertzanidis, Alexandros Psomas, Paritosh Verma
  • for: 这个论文的目的是解决一个具有大量商品和投标者的契约问题,投标者的估价是从高维不确定的先前分布中独立采样出来的。
  • methods: 该论文使用了一种以话题模型为基础的方法,使用活动学习组件和机制设计组件。活动学习组件负责与投标者进行互动,并输出低维度的投标者类型,而机制设计组件则负责使机制在低维度模型下变得对投标者类型有效。
  • results: 该论文的结果表明,使用这种方法可以减少机制设计中的假设和限制,并且可以在不同的话题模型下实现更高的收益。此外,该论文还开拓了机制设计和随机线性代数(RLA)的连接,并将RLA的多个突破成果应用到机制设计中。
    Abstract We consider the problem of a revenue-maximizing seller with a large number of items $m$ for sale to $n$ strategic bidders, whose valuations are drawn independently from high-dimensional, unknown prior distributions. It is well-known that optimal and even approximately-optimal mechanisms for this setting are notoriously difficult to characterize or compute, and, even when they can be found, are often rife with various counter-intuitive properties. In this paper, following a model introduced recently by Cai and Daskalakis~\cite{cai2022recommender}, we consider the case that bidders' prior distributions can be well-approximated by a topic model. We design an active learning component, responsible for interacting with the bidders and outputting low-dimensional approximations of their types, and a mechanism design component, responsible for robustifying mechanisms for the low-dimensional model to work for the approximate types of the former component. On the active learning front, we cast our problem in the framework of Randomized Linear Algebra (RLA) for regression problems, allowing us to import several breakthrough results from that line of research, and adapt them to our setting. On the mechanism design front, we remove many restrictive assumptions of prior work on the type of access needed to the underlying distributions and the associated mechanisms. To the best of our knowledge, our work is the first to formulate connections between mechanism design, and RLA for active learning of regression problems, opening the door for further applications of randomized linear algebra primitives to mechanism design.
    摘要 我们考虑一个寻求最大化收益的售家,面临一大量的商品($m$)供$n$名战略性投标者购买。这些投标者的价值是从高维度、未知的对应分布中独立地获取。对于这个设定,优化和约优化的机制是极其困难computational和易受到各种 counter-intuitive 的性质。在这篇文章中,我们遵循 Cai 和 Daskalakis (2022)提出的模型,即投标者对价值的对应分布可以以主题模型的形式近似。我们设计了一个活动学习部分,负责与投标者进行互动,从来到低维度的近似类型,以及一个机制设计部分,负责对这个低维度模型进行强健化,以适应投标者的实际类型。在活动学习方面,我们将问题套用在Randomized Linear Algebra(RLA)的框架中,允许我们从这个领域的研究中吸取多个突破性结果,并将其适应到我们的设定。在机制设计方面,我们取消了许多先前的研究对机制设计的限制性假设,例如需要存取到背后的分布和相关机制。我们的工作是首个将机制设计与 RLA 的活动学习连接起来,这会开启更多的应用机会,将随机线性代数元素应用到机制设计中。

QArchSearch: A Scalable Quantum Architecture Search Package

  • paper_url: http://arxiv.org/abs/2310.07858
  • repo_url: None
  • paper_authors: Ankit Kulshrestha, Danylo Lykov, Ilya Safro, Yuri Alexeev
  • for: 这篇论文旨在提供一种AI驱动的量子架构搜索包,用于自动选择合适的量子模型,以实现量子计算任务。
  • methods: 该论文使用的方法包括使用\texttt{QTensor}库作为后端,并采用两级并行的策略来加速搜索过程,以便在高性能计算系统上运行。
  • results: 论文的实验结果表明,\texttt{QArchSearch}可以有效地搜索大型量子电路,并可以探索不同的量子应用中的更复杂的模型。
    Abstract The current era of quantum computing has yielded several algorithms that promise high computational efficiency. While the algorithms are sound in theory and can provide potentially exponential speedup, there is little guidance on how to design proper quantum circuits to realize the appropriate unitary transformation to be applied to the input quantum state. In this paper, we present \texttt{QArchSearch}, an AI based quantum architecture search package with the \texttt{QTensor} library as a backend that provides a principled and automated approach to finding the best model given a task and input quantum state. We show that the search package is able to efficiently scale the search to large quantum circuits and enables the exploration of more complex models for different quantum applications. \texttt{QArchSearch} runs at scale and high efficiency on high-performance computing systems using a two-level parallelization scheme on both CPUs and GPUs, which has been demonstrated on the Polaris supercomputer.
    摘要 当前的量子计算时代已经涌现了许多算法,这些算法承诺可以提供高效的计算能力。虽然这些算法在理论上是有效的,但是有很少关于如何设计合适的量子电路来实现输入量子状态的应用转换的指导。在这篇论文中,我们介绍了\texttt{QArchSearch},一个基于人工智能的量子建筑搜索包,它使用\texttt{QTensor}库作为后端,并提供了一种原则正的和自动化的方法来找到给定任务和输入量子状态的最佳模型。我们表明了\texttt{QArchSearch}可以高效地扩展到大型量子电路,并允许对不同量子应用的模型进行更加复杂的探索。\texttt{QArchSearch}在高性能计算系统上运行,使用了两级并行化策略,其在CPUs和GPUs上进行了两级并行,这已经在Polaris超级计算机上得到了证明。

On the Computational Complexity of Private High-dimensional Model Selection via the Exponential Mechanism

  • paper_url: http://arxiv.org/abs/2310.07852
  • repo_url: None
  • paper_authors: Saptarshi Roy, Ambuj Tewari
  • for: 该文章研究了在高维度含有稀畴元素的线性回归模型下的模型选择问题,并在敏感性框架下进行研究。特别是,文章研究了极私的最佳子集选择问题,并对其实现了utilities保证。
  • methods: 文章采用了著名的指数机制来选择最佳模型,并在满足某种margin条件下Establish its strong model recovery property。然而,指数搜索空间的指数机制带来了严重的计算瓶颈。为了解决这个挑战,文章提出了 Metropolis-Hastings算法来采样步骤,并Establish its polynomial mixing time to its stationary distribution in the problem parameters $n,p$, and $s$.
  • results: 文章的结果表明,Metropolis-Hastings算法可以在高维度含有稀畴元素的线性回归模型下实现极私的最佳子集选择,并且可以在某种margin条件下Establish strong model recovery property。此外,文章还提出了一种approximate differential privacy的方法来保证最终估计的Metropolis-Hastings random walk的极私性。最后,文章还进行了一些Ilustrative simulations,并证明了其理论结论。
    Abstract We consider the problem of model selection in a high-dimensional sparse linear regression model under the differential privacy framework. In particular, we consider the problem of differentially private best subset selection and study its utility guarantee. We adopt the well-known exponential mechanism for selecting the best model, and under a certain margin condition, we establish its strong model recovery property. However, the exponential search space of the exponential mechanism poses a serious computational bottleneck. To overcome this challenge, we propose a Metropolis-Hastings algorithm for the sampling step and establish its polynomial mixing time to its stationary distribution in the problem parameters $n,p$, and $s$. Furthermore, we also establish approximate differential privacy for the final estimates of the Metropolis-Hastings random walk using its mixing property. Finally, we also perform some illustrative simulations that echo the theoretical findings of our main results.
    摘要 我团队考虑了在高维简单线性回归模型下进行模型选择,并在权限保护框架下进行研究。特别是,我们考虑了不同极值隐私最佳分布选择问题,并研究其性能保证。我们采用了著名的凝聚机制来选择最佳模型,并在某种margin条件下证明了它的强型回归性质。然而,凝聚搜索空间的凝聚机制带来了严重的计算瓶颈。为了解决这个挑战,我们提议了 Metropolis-Hastings算法来实现抽样步骤,并证明了它在参数$n$, $p$, 和 $s$下的多项式混合时间到其站点分布。此外,我们还证明了对最终估计的Metropolis-Hastings随机步骤的approximate权限保护。最后,我们还进行了一些与理论结果相符的仿真实验。

Measuring Feature Sparsity in Language Models

  • paper_url: http://arxiv.org/abs/2310.07837
  • repo_url: None
  • paper_authors: Mingyang Deng, Lucas Tao, Joe Benton
  • for: 本研究旨在探讨语言模型中活动的表征方式,具体来说是用简单的线性组合来描述输入文本中的特征方向。
  • methods: 本研究使用 sparse coding 技术来重建特征方向,并开发了一些指标来评估 sparse coding 的成功程度。
  • results: 研究发现,语言模型的活动可以准确地表示为简单的线性组合,而且这种线性和简单性假设都得到了证明。此外,研究还发现模型活动的稀热程度最高在输入文本的第一层和最后一层。
    Abstract Recent works have proposed that activations in language models can be modelled as sparse linear combinations of vectors corresponding to features of input text. Under this assumption, these works aimed to reconstruct feature directions using sparse coding. We develop metrics to assess the success of these sparse coding techniques and test the validity of the linearity and sparsity assumptions. We show our metrics can predict the level of sparsity on synthetic sparse linear activations, and can distinguish between sparse linear data and several other distributions. We use our metrics to measure levels of sparsity in several language models. We find evidence that language model activations can be accurately modelled by sparse linear combinations of features, significantly more so than control datasets. We also show that model activations appear to be sparsest in the first and final layers.
    摘要 近期研究建议,语言模型的激活可以表示为稀疏线性组合的特征向量。基于这个假设,这些研究尝试使用稀疏编码重建特征方向。我们开发了评估这些稀疏编码技术的度量,并测试了Linearity和稀疏性假设的有效性。我们发现我们的度量可以预测稀疏线性活动的水平,并能够分辨稀疏线性数据和其他数据 Distribution。我们使用我们的度量测量了一些语言模型的激活水平,并发现语言模型的激活可以准确地表示为稀疏线性组合的特征向量,比控制数据集更高。此外,我们还发现模型的激活在第一层和最后一层最为稀疏。

Large Language Models Are Zero-Shot Time Series Forecasters

  • paper_url: http://arxiv.org/abs/2310.07820
  • repo_url: https://github.com/ngruver/llmtime
  • paper_authors: Nate Gruver, Marc Finzi, Shikai Qiu, Andrew Gordon Wilson
  • for: 这个论文主要针对时间序列预测 task,使用大型自然语言模型(LLMs)如 GPT-3 和 LLaMA-2 进行预测。
  • methods: 该论文提出了一种将时间序列编码为数字字符串的方法,并使用这种方法来实现 LLMS 的 zero-shot 推广。另外,论文还提出了一种将离散分布转换为高度灵活的连续值分布的方法。
  • results: 论文发现, LLMS 可以在时间序列预测 task 上达到或超过专门设计的时间序列模型的性能,而无需训练。此外,论文还示出了 LLMS 可以处理缺失数据、考虑文本副信息和回答预测问题等。
    Abstract By encoding time series as a string of numerical digits, we can frame time series forecasting as next-token prediction in text. Developing this approach, we find that large language models (LLMs) such as GPT-3 and LLaMA-2 can surprisingly zero-shot extrapolate time series at a level comparable to or exceeding the performance of purpose-built time series models trained on the downstream tasks. To facilitate this performance, we propose procedures for effectively tokenizing time series data and converting discrete distributions over tokens into highly flexible densities over continuous values. We argue the success of LLMs for time series stems from their ability to naturally represent multimodal distributions, in conjunction with biases for simplicity, and repetition, which align with the salient features in many time series, such as repeated seasonal trends. We also show how LLMs can naturally handle missing data without imputation through non-numerical text, accommodate textual side information, and answer questions to help explain predictions. While we find that increasing model size generally improves performance on time series, we show GPT-4 can perform worse than GPT-3 because of how it tokenizes numbers, and poor uncertainty calibration, which is likely the result of alignment interventions such as RLHF.
    摘要 通过将时间序列编码为一串数字,我们可以将时间序列预测转化为下一个token的预测问题。发展这种方法,我们发现大语言模型(LLM)such as GPT-3和LLaMA-2可以 surprisingly Zero-shot推断时间序列,其性能与专门为下游任务训练的时间序列模型相当或超越。为了实现这种性能,我们提出了 tokenization 时间序列数据和将精度分布转化为高度灵活的浮点值的方法。我们认为 LLMS 对时间序列的成功归功于它们的自然地表示多模态分布,以及对简单、重复的偏好,这些特征与许多时间序列中的重复季节性脉冲有很大相似性。我们还示出了 LLMS 可以自然处理缺失数据,不需要采用非数字文本进行填充,同时可以处理文本侧信息和解释预测结果。虽然我们发现模型大小增加通常会提高时间序列的性能,但我们发现 GPT-4 可能会比 GPT-3 更差,这可能是因为 GPT-4 的 tokenization 方式不同,以及不良的uncertainty calibration,这可能是由RLHF等对适应性进行的调整所致。

Online RL in Linearly $q^π$-Realizable MDPs Is as Easy as in Linear MDPs If You Learn What to Ignore

  • paper_url: http://arxiv.org/abs/2310.07811
  • repo_url: None
  • paper_authors: Gellért Weisz, András György, Csaba Szepesvári
  • for: 这种研究是为了解决线性$q^\pi$-可实现性假设下的在线学习决策问题。
  • methods: 这篇论文使用的方法是同时学习扫描过程中的状态和动作值,以及使用一种新的算法来同时学习扫描过程中的状态和动作值。
  • results: 这篇论文提出了一种新的算法,可以在线学习决策问题,并且可以在有限的交互中返回$\epsilon$-优化的策略。此外,论文还证明了这种算法在假设错误的情况下的性能,并且显示了性能与假设错误之间的关系。
    Abstract We consider online reinforcement learning (RL) in episodic Markov decision processes (MDPs) under the linear $q^\pi$-realizability assumption, where it is assumed that the action-values of all policies can be expressed as linear functions of state-action features. This class is known to be more general than linear MDPs, where the transition kernel and the reward function are assumed to be linear functions of the feature vectors. As our first contribution, we show that the difference between the two classes is the presence of states in linearly $q^\pi$-realizable MDPs where for any policy, all the actions have approximately equal values, and skipping over these states by following an arbitrarily fixed policy in those states transforms the problem to a linear MDP. Based on this observation, we derive a novel (computationally inefficient) learning algorithm for linearly $q^\pi$-realizable MDPs that simultaneously learns what states should be skipped over and runs another learning algorithm on the linear MDP hidden in the problem. The method returns an $\epsilon$-optimal policy after $\text{polylog}(H, d)/\epsilon^2$ interactions with the MDP, where $H$ is the time horizon and $d$ is the dimension of the feature vectors, giving the first polynomial-sample-complexity online RL algorithm for this setting. The results are proved for the misspecified case, where the sample complexity is shown to degrade gracefully with the misspecification error.
    摘要 我们考虑在线束规动学学习(RL)中的集合Markov决策过程(MDP),在Linear-$q^\pi$-可现实性假设下,其中假设所有策略的动作价值可以表示为线性函数的状态动作特征。这个类型比linear MDP更加一般,因为在linear MDP中,转移核和奖励函数都是特征Vector的线性函数。作为我们的首要贡献,我们显示了这两个类型之间的差别在于在Linearly-$q^\pi$-可现实性MDP中有些状态,对于任何策略,所有的动作都有相对平等的价值,并且通过遵循任意固定策略在这些状态中跳过这些状态,可以将问题转化为一个线性MDP。基于这一观察,我们 derivate了一种新的(计算不fficiente)学习算法,可以同时学习在Linearly-$q^\pi$-可现实性MDP中哪些状态应该跳过以及在隐藏在问题中的线性MDP上运行另一个学习算法。这种方法可以在$\text{polylog}(H, d)/\epsilon^2$交互后返回一个$\epsilon$-优化策略,其中$H$是时间悬度和$d$是特征vector的维度,这是首个在这种设定下的多项式样本复杂度在线RL算法。我们的结果是在误specified情况下证明的,其中样本复杂度显示了对误pecification错误的恶化。

FedSym: Unleashing the Power of Entropy for Benchmarking the Algorithms for Federated Learning

  • paper_url: http://arxiv.org/abs/2310.07807
  • repo_url: None
  • paper_authors: Ensiye Kiyamousavi, Boris Kraychev, Ivan Koychev
    for: 这种研究的目标是为了解决 Federated Learning (FL) 中数据不均衡和模型融合效果的问题。methods: 这种研究使用了多种数据分割技术,以适应不同的数据不均衡。results: 研究表明,我们提出的方法可以准确地测量数据不均衡,并且可以逐步挑战 FL 算法。实验结果表明,模型在我们提出的分布上进行训练后,模型更加异质。
    Abstract Federated learning (FL) is a decentralized machine learning approach where independent learners process data privately. Its goal is to create a robust and accurate model by aggregating and retraining local models over multiple rounds. However, FL faces challenges regarding data heterogeneity and model aggregation effectiveness. In order to simulate real-world data, researchers use methods for data partitioning that transform a dataset designated for centralized learning into a group of sub-datasets suitable for distributed machine learning with different data heterogeneity. In this paper, we study the currently popular data partitioning techniques and visualize their main disadvantages: the lack of precision in the data diversity, which leads to unreliable heterogeneity indexes, and the inability to incrementally challenge the FL algorithms. To resolve this problem, we propose a method that leverages entropy and symmetry to construct 'the most challenging' and controllable data distributions with gradual difficulty. We introduce a metric to measure data heterogeneity among the learning agents and a transformation technique that divides any dataset into splits with precise data diversity. Through a comparative study, we demonstrate the superiority of our method over existing FL data partitioning approaches, showcasing its potential to challenge model aggregation algorithms. Experimental results indicate that our approach gradually challenges the FL strategies, and the models trained on FedSym distributions are more distinct.
    摘要 federated learning (FL) 是一种分布式机器学习方法,其目标是通过聚合和重新训练本地模型,创建一个稳定和准确的模型。然而,FL 面临数据多样性和模型聚合效果的挑战。为了模拟实际数据,研究人员使用数据分区方法,将Centralized learning 的数据集转换为适合分布式机器学习的多个子集。在这篇论文中,我们研究了目前流行的数据分区技术,并视觉化它们的主要缺点:数据多样性精度的缺失,导致无法准确地评估多样性指标,以及不能逐步挑战 FL 算法。为解决这个问题,我们提议一种方法,利用熵和对称来构建 '最有挑战性' 和可控的数据分布。我们介绍了一个度量数据多样性 среди学习代理的 metric,以及一种分割任何数据集的技术,以确保数据多样性的精度。通过比较研究,我们证明了我们的方法比现有的 FL 数据分区方法更加有利,并示出了它的潜在挑战模型聚合算法的能力。实验结果表明,我们的方法可以逐步挑战 FL 策略,并且模型在 FedSym 分布上训练得更加独特。

Using Spark Machine Learning Models to Perform Predictive Analysis on Flight Ticket Pricing Data

  • paper_url: http://arxiv.org/abs/2310.07787
  • repo_url: None
  • paper_authors: Philip Wong, Phue Thant, Pratiksha Yadav, Ruta Antaliya, Jongwook Woo
  • for: 预测航空票价(非停站) flight pricing
  • methods: 使用 r2(r-square)和 RMSE 进行预测,利用大量数据集(来自Expedia.com),包括约2000万记录和4.68 gigabytes
  • results: 确定最佳模型,以便在实际世界中预测航空票价,并且要求模型具有良好的泛化能力和优化的处理时间。
    Abstract This paper discusses predictive performance and processes undertaken on flight pricing data utilizing r2(r-square) and RMSE that leverages a large dataset, originally from Expedia.com, consisting of approximately 20 million records or 4.68 gigabytes. The project aims to determine the best models usable in the real world to predict airline ticket fares for non-stop flights across the US. Therefore, good generalization capability and optimized processing times are important measures for the model. We will discover key business insights utilizing feature importance and discuss the process and tools used for our analysis. Four regression machine learning algorithms were utilized: Random Forest, Gradient Boost Tree, Decision Tree, and Factorization Machines utilizing Cross Validator and Training Validator functions for assessing performance and generalization capability.
    摘要 这篇论文讨论了预测性能和利用r2(r平方)和RMSE来处理飞行票价数据,使用了大量的数据集,来自Expedia.com,包含约2000万记录或4.68吉比特。项目的目标是确定用于实际应用中预测航空票价的最佳模型,因此泛化能力和处理时间优化是重要的评价指标。我们会通过特征重要性来探索关键的商业发现,并讨论我们使用的分析过程和工具。在本研究中,我们使用了四种回归机器学习算法:随机森林、梯度提升树、决策树和因素分解机器,并使用了cross validate和training validate函数来评估性能和泛化能力。

Non-Stationary Contextual Bandit Learning via Neural Predictive Ensemble Sampling

  • paper_url: http://arxiv.org/abs/2310.07786
  • repo_url: None
  • paper_authors: Zheqing Zhu, Yueyang Liu, Xu Kuang, Benjamin Van Roy
  • for: 实际应用中的情感带its often exhibit non-stationarity due to seasonality, serendipity, and evolving social trends.
  • methods: 我们提出了一个新的非站ARY contextual bandit算法,它结合了可扩展的深度对应网络架构,并且运用了一个策略性优先预测的探索机制,以优先收集持续性最高的信息。
  • results: 我们通过对两个实际推荐数据集进行实验,证明了我们的方法与现有的基eline signifiantly outperform。
    Abstract Real-world applications of contextual bandits often exhibit non-stationarity due to seasonality, serendipity, and evolving social trends. While a number of non-stationary contextual bandit learning algorithms have been proposed in the literature, they excessively explore due to a lack of prioritization for information of enduring value, or are designed in ways that do not scale in modern applications with high-dimensional user-specific features and large action set, or both. In this paper, we introduce a novel non-stationary contextual bandit algorithm that addresses these concerns. It combines a scalable, deep-neural-network-based architecture with a carefully designed exploration mechanism that strategically prioritizes collecting information with the most lasting value in a non-stationary environment. Through empirical evaluations on two real-world recommendation datasets, which exhibit pronounced non-stationarity, we demonstrate that our approach significantly outperforms the state-of-the-art baselines.
    摘要 实际应用中的上下文带有趋势的带宽机器学习问题常会出现非站点性,这可能由季节性、偶然性和不断变化的社会趋势引起。虽然文献中有许多非站点上下文带宽机器学习算法,但它们过度探索,因为缺乏综合价值信息的优先级,或者设计不适应现代应用中高维用户特定特征和大量动作集,或者都两者。在本文中,我们介绍了一种新的非站点上下文带宽机器学习算法。它将拓扑图 neural network 作为核心组件,并通过精心设计的探索机制,以优先级Collect information with the most lasting value in a non-stationary environment。经验证明,我们的方法在两个实际推荐数据集上,具有明显的非站点性,significantly outperform了状态对照。

Promoting Robustness of Randomized Smoothing: Two Cost-Effective Approaches

  • paper_url: http://arxiv.org/abs/2310.07780
  • repo_url: None
  • paper_authors: Linbo Liu, Trong Nghia Hoang, Lam M. Nguyen, Tsui-Wei Weng
  • for: 提供可靠的 robustness 保证 для滑动化神经网络分类器。
  • methods: 提出了两种成本效果的方法,包括 AdvMacer 和 EsbRS,以提高滑动化神经网络分类器的 robustness 性能,而不会影响 clean 性能。
  • results: Comparing with SOTA 基elines, AdvMacer 可以提高滑动化神经网络分类器的 robustness 性能,并且可以在训练时间上减少 3 倍的时间成本。EsbRS 可以提高模型 ensemble 的 robustness 性能,并且提出了一种新的模型 ensemble 设计方法,以提高 robustness 性能。
    Abstract Randomized smoothing has recently attracted attentions in the field of adversarial robustness to provide provable robustness guarantees on smoothed neural network classifiers. However, existing works show that vanilla randomized smoothing usually does not provide good robustness performance and often requires (re)training techniques on the base classifier in order to boost the robustness of the resulting smoothed classifier. In this work, we propose two cost-effective approaches to boost the robustness of randomized smoothing while preserving its clean performance. The first approach introduces a new robust training method AdvMacerwhich combines adversarial training and robustness certification maximization for randomized smoothing. We show that AdvMacer can improve the robustness performance of randomized smoothing classifiers compared to SOTA baselines, while being 3x faster to train than MACER baseline. The second approach introduces a post-processing method EsbRS which greatly improves the robustness certificate based on building model ensembles. We explore different aspects of model ensembles that has not been studied by prior works and propose a novel design methodology to further improve robustness of the ensemble based on our theoretical analysis.
    摘要 Randomized smoothing 在 adversarial robustness 领域已经吸引了关注,以提供可证明的 Robustness garanties на smoothed neural network 分类器。然而,现有的工作表明,vanilla randomized smoothing 通常不提供良好的 Robustness 性能,并且常需要 (re)training 技术来提高基础分类器的 Robustness。在这个工作中,我们提出了两种cost-effective的方法来提高 randomized smoothing 的 Robustness,同时保持其 clean 性能。第一种方法是 AdvMacer,它combines adversarial training 和 Robustness 证明最大化 для randomized smoothing。我们示出,AdvMacer 可以在 SOTA 基eline 下提高 randomized smoothing 分类器的 Robustness 性能,而且训练速度比 MACER 基eline 快三倍。第二种方法是 EsbRS,它是一种post-processing 方法,可以大幅提高基于模型集的 Robustness 证明。我们探索了不同的模型集方面,并提出了一种新的设计方法,以进一步提高模型集的 Robustness。我们的理论分析表明,这种设计方法可以减少模型集的复杂性,同时保持其 Robustness。

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

  • paper_url: http://arxiv.org/abs/2310.07765
  • repo_url: None
  • paper_authors: Hannah Day, Yonatan Kahn, Daniel A. Roberts
  • for: 这篇论文是为了研究深度 neural network 的 initialization 方法,以及其对 training 的影响。
  • methods: 该论文使用了 rectangular network 和 tanh activation function,并使用了 orthogonal matrix 的 ensemble 初始化方法。
  • results: 论文表明,使用这种 initialization 方法可以避免深度 neural network 的 signal 干扰增长,并且可以提高网络的泛化能力和训练速度。
    Abstract Fully-connected deep neural networks with weights initialized from independent Gaussian distributions can be tuned to criticality, which prevents the exponential growth or decay of signals propagating through the network. However, such networks still exhibit fluctuations that grow linearly with the depth of the network, which may impair the training of networks with width comparable to depth. We show analytically that rectangular networks with tanh activations and weights initialized from the ensemble of orthogonal matrices have corresponding preactivation fluctuations which are independent of depth, to leading order in inverse width. Moreover, we demonstrate numerically that, at initialization, all correlators involving the neural tangent kernel (NTK) and its descendants at leading order in inverse width -- which govern the evolution of observables during training -- saturate at a depth of $\sim 20$, rather than growing without bound as in the case of Gaussian initializations. We speculate that this structure preserves finite-width feature learning while reducing overall noise, thus improving both generalization and training speed. We provide some experimental justification by relating empirical measurements of the NTK to the superior performance of deep nonlinear orthogonal networks trained under full-batch gradient descent on the MNIST and CIFAR-10 classification tasks.
    摘要 全连接深度神经网络的权重初始化为独立的高斯分布可以调整到极点,从而防止信号在网络中 exponential 增长或减少。然而,这些网络仍然会表现出线性增长,与网络的宽度相同的深度成比例。我们表述了一种方法,使用矩阵的ensemble initialization,可以使得前activation 干扰的变化独立于深度。此外,我们还证明了在初始化时,NTK和其后代的 correlators 在对width 进行倒数据阶段会到达一个深度约为20,而不是无限制增长,如果使用高斯初始化。我们推测这种结构可以保留宽度与深度相对的特征学习,同时减少总的噪音,从而提高泛化和训练速度。我们通过对NTK的实际测量与深非linear orthogonal网络在MNIST和CIFAR-10分类任务上的性能进行比较,提供了一些实际证明。

Self-supervised Representation Learning From Random Data Projectors

  • paper_url: http://arxiv.org/abs/2310.07756
  • repo_url: https://github.com/layer6ai-labs/lfr
  • paper_authors: Yi Sui, Tongzi Wu, Jesse C. Cresswell, Ga Wu, George Stein, Xiao Shi Huang, Xiaochen Zhang, Maksims Volkovs
  • for: 本研究旨在提出一种不需要人工数据增强的自我超级vised representation learning方法,可以应用于多种数据类型和网络架构。
  • methods: 该方法基于重建随机数据 проекции来学习高质量数据表示。
  • results: 对多种表示学习任务进行了广泛的评估,并与多个状态时的SSRL基elines进行比较,得到了更高的表示学习性能。
    Abstract Self-supervised representation learning~(SSRL) has advanced considerably by exploiting the transformation invariance assumption under artificially designed data augmentations. While augmentation-based SSRL algorithms push the boundaries of performance in computer vision and natural language processing, they are often not directly applicable to other data modalities, and can conflict with application-specific data augmentation constraints. This paper presents an SSRL approach that can be applied to any data modality and network architecture because it does not rely on augmentations or masking. Specifically, we show that high-quality data representations can be learned by reconstructing random data projections. We evaluate the proposed approach on a wide range of representation learning tasks that span diverse modalities and real-world applications. We show that it outperforms multiple state-of-the-art SSRL baselines. Due to its wide applicability and strong empirical results, we argue that learning from randomness is a fruitful research direction worthy of attention and further study.
    摘要 自我监督学习(SSRL)已经得到了很大的进步,通过人工设计的数据增强技术来利用转换不变性假设。而这些增强技术基于的SSRL算法在计算机视觉和自然语言处理领域的性能已经推pushed the boundaries,但它们通常不直接适用于其他数据类型,并且可能与应用特定的数据增强约束矛盾。这篇论文提出了不需要增强或掩蔽的SSRL方法,可以应用于任何数据类型和网络架构。具体来说,我们表明可以通过重建随机数据投影来学习高质量的数据表示。我们对各种表示学习任务进行了广泛的评估,这些任务覆盖了多种Modalities和真实应用。我们发现该方法可以超越多个状态 искусственныйSSRL基准。由于其广泛适用和强实验结果,我们认为学习 randomness 是一个有前途的研究方向,值得关注和进一步研究。

Stabilizing Estimates of Shapley Values with Control Variates

  • paper_url: http://arxiv.org/abs/2310.07672
  • repo_url: None
  • paper_authors: Jeremy Goldwasser, Giles Hooker
  • for: 用于解释黑盒机器学习模型的预测结果
  • methods: 使用Monte Carlo技术和控制变量来稳定模型解释
  • results: 在高维数据集上可以带来很大减少Monte Carlo协变性的Shapley估计Here’s the breakdown of each point:1. 用于解释黑盒机器学习模型的预测结果 (for): The paper is written to explain the predictions of blackbox machine learning models.2. 使用Monte Carlo技术和控制变量来稳定模型解释 (methods): The paper proposes using the Monte Carlo technique of control variates to stabilize the model explanations.3. 在高维数据集上可以带来很大减少Monte Carlo协变性的Shapley估计 (results): The paper finds that the proposed method can produce dramatic reductions in the Monte Carlo variability of Shapley estimates on high-dimensional datasets.
    Abstract Shapley values are among the most popular tools for explaining predictions of blackbox machine learning models. However, their high computational cost motivates the use of sampling approximations, inducing a considerable degree of uncertainty. To stabilize these model explanations, we propose ControlSHAP, an approach based on the Monte Carlo technique of control variates. Our methodology is applicable to any machine learning model and requires virtually no extra computation or modeling effort. On several high-dimensional datasets, we find it can produce dramatic reductions in the Monte Carlo variability of Shapley estimates.
    摘要 <>将文本翻译为简化字符的中文。<>采用Shapley值是黑obox机器学习模型预测的解释工具中最受欢迎的。然而,它们的计算成本较高,使得使用抽象近似法得到的解释带有一定的不确定性。为了稳定这些模型解释,我们提议使用控制SHAP,基于蒙тек控制变量的方法。我们的方法适用于任何机器学习模型,需要virtually no extra computation或模型定制努力。在一些高维数据集上,我们发现它可以生成显著减少Monte Carlo变化的Shapley估计。Note: "virtually no extra computation" is a bit tricky to translate, as "extra computation" is not a direct translation of "extra computation" in Chinese. I have translated it as "需要virtually no extra computation", where "virtually" is used to convey the idea that there is no significant additional computation required.

The First Pathloss Radio Map Prediction Challenge

  • paper_url: http://arxiv.org/abs/2310.07658
  • repo_url: None
  • paper_authors: Çağkan Yapar, Fabian Jaensch, Ron Levie, Gitta Kutyniok, Giuseppe Caire
  • for: 提出了一个pathloss radio map prediction挑战,以便促进研究和对最新的方法进行公平的比较。
  • methods: 使用提供的数据集和挑战任务进行预测。
  • results: 在挑战中,得到了一些结果。
    Abstract To foster research and facilitate fair comparisons among recently proposed pathloss radio map prediction methods, we have launched the ICASSP 2023 First Pathloss Radio Map Prediction Challenge. In this short overview paper, we briefly describe the pathloss prediction problem, the provided datasets, the challenge task and the challenge evaluation methodology. Finally, we present the results of the challenge.
    摘要 <>转换文本为简化中文。<>为促进研究和促进最近提出的路径损失Radio map预测方法的公正比较,我们于ICASSP 2023年首次Radio map预测挑战。在这篇简短概述中,我们 briefly describe the pathloss prediction problem, the provided datasets, the challenge task and the challenge evaluation methodology. Finally, we present the results of the challenge。

Hypercomplex Multimodal Emotion Recognition from EEG and Peripheral Physiological Signals

  • paper_url: http://arxiv.org/abs/2310.07648
  • repo_url: None
  • paper_authors: Eleonora Lopez, Eleonora Chiarantano, Eleonora Grassucci, Danilo Comminiello
  • for: 本研究旨在提出一种基于嵌入空间的多modal感情识别方法,以提高现有方法的效果。
  • methods: 本方法使用了一种新的融合模块,该模块通过在嵌入空间进行参数化的高复杂多元 multiplication 来实现更加有效的融合步骤。
  • results: 在使用 MAHNOB-HCI 数据集进行分类测试中,本方法的性能超过了现有的多modal状态gartoon network,并且可以更好地识别抑或情绪的强度。
    Abstract Multimodal emotion recognition from physiological signals is receiving an increasing amount of attention due to the impossibility to control them at will unlike behavioral reactions, thus providing more reliable information. Existing deep learning-based methods still rely on extracted handcrafted features, not taking full advantage of the learning ability of neural networks, and often adopt a single-modality approach, while human emotions are inherently expressed in a multimodal way. In this paper, we propose a hypercomplex multimodal network equipped with a novel fusion module comprising parameterized hypercomplex multiplications. Indeed, by operating in a hypercomplex domain the operations follow algebraic rules which allow to model latent relations among learned feature dimensions for a more effective fusion step. We perform classification of valence and arousal from electroencephalogram (EEG) and peripheral physiological signals, employing the publicly available database MAHNOB-HCI surpassing a multimodal state-of-the-art network. The code of our work is freely available at https://github.com/ispamm/MHyEEG.
    摘要 多Modal Emotion recognition from physiological signals 已经收到了越来越多的关注,因为不可控制的physiological signals不同于行为反应,可以提供更可靠的信息。现有的深度学习基本方法仍然基于提取的手动特征,没有完全利用神经网络的学习能力,而且常采用单模态方法,而人类情感表达是自然多模态的。在这篇论文中,我们提出了一个嵌入式多模态网络,具有一个新的融合模块,其中包括参数化的超复杂 multiply 运算。在超复杂domain中操作,操作按照代数规则进行,可以模型学习 feature 维度之间的潜在关系,从而实现更有效的融合步骤。我们使用MAHNOB-HCI数据集进行了EEG和周边生物信号的分类,并超越了现有的多模态状态码网络。我们的代码可以免费在https://github.com/ispamm/MHyEEG中下载。

Deep Reinforcement Learning for Autonomous Cyber Operations: A Survey

  • paper_url: http://arxiv.org/abs/2310.07745
  • repo_url: None
  • paper_authors: Gregory Palmer, Chris Parry, Daniel J. B. Harrold, Chris Willis
  • for: 这篇论文是为了探讨深度强化学习(DRL)在自动化网络操作(ACO)中的应用和挑战而写的。
  • methods: 论文使用了DRL方法,并对其应用于ACO问题进行了批判和评估。
  • results: 论文发现了DRL在ACO问题中的挑战,包括高维状态空间、大多个数动作空间和对抗学习等问题,并提出了一些可能的解决方案。
    Abstract The rapid increase in the number of cyber-attacks in recent years raises the need for principled methods for defending networks against malicious actors. Deep reinforcement learning (DRL) has emerged as a promising approach for mitigating these attacks. However, while DRL has shown much potential for cyber-defence, numerous challenges must be overcome before DRL can be applied to autonomous cyber-operations (ACO) at scale. Principled methods are required for environments that confront learners with very high-dimensional state spaces, large multi-discrete action spaces, and adversarial learning. Recent works have reported success in solving these problems individually. There have also been impressive engineering efforts towards solving all three for real-time strategy games. However, applying DRL to the full ACO problem remains an open challenge. Here, we survey the relevant DRL literature and conceptualize an idealised ACO-DRL agent. We provide: i.) A summary of the domain properties that define the ACO problem; ii.) A comprehensive evaluation of the extent to which domains used for benchmarking DRL approaches are comparable to ACO; iii.) An overview of state-of-the-art approaches for scaling DRL to domains that confront learners with the curse of dimensionality, and; iv.) A survey and critique of current methods for limiting the exploitability of agents within adversarial settings from the perspective of ACO. We conclude with open research questions that we hope will motivate future directions for researchers and practitioners working on ACO.
    摘要 随着最近几年的网络攻击数量快速增加,需要有原则的方法来防御网络免受黑客的攻击。深度强化学习(DRL)已经出现为防御攻击的有力的方法。然而,虽然DRL在网络防御方面表现出了很多潜力,但是在自动化网络操作(ACO)中应用DRL仍然是一个开放的挑战。在面临高维状态空间、大多个粒度动作空间和反对学习的环境中,原则的方法是必要的。过去的研究已经解决了这些问题,并且有很多有优的工程努力在解决这些问题上。然而,将DRL应用到整个ACO问题仍然是一个开放的挑战。在这篇文章中,我们对DRL的相关文献进行了抽象,并提出了一个理想化的ACO-DRL代理。我们提供了以下内容:1. ACO问题的域属性的总结,包括域的特点和挑战。2. DRL在不同域上的比较,以及这些域是否与ACO相似的分析。3. 对于面临维度瓶颈的域,State-of-the-art的扩展DRL方法的概述。4. 对于反对学习Setting下的DRL方法的评价和批判。我们结束时,提出了未来研究方向的开放问题,希望能够鼓励未来的研究人员和实践者在ACO领域进行更多的研究。

Graph Transformer Network for Flood Forecasting with Heterogeneous Covariates

  • paper_url: http://arxiv.org/abs/2310.07631
  • repo_url: https://github.com/JimengShi/FloodGTN_Prediction
  • paper_authors: Jimeng Shi, Vitalii Stebliankin, Zhaonan Wang, Shaowen Wang, Giri Narasimhan
  • for: 预测洪水,帮助管理洪水风险
  • methods: 使用图像变换网络(Graph Transformer Network,简称FloodGTN),通过图гра内部网络(Graph Neural Networks,GNNs)和LSTM学习水位的空间Temporal依赖关系,并利用Transformer学习对外部参数(如降雨、潮汐、水利设施等)的注意力。
  • results: 对南佛瑞达水资源管理区的数据进行应用,实验结果表明,FloodGTN比HEC-RAS模型高度精准,提高了70%,并在运行时间上减少了至少500倍。
    Abstract Floods can be very destructive causing heavy damage to life, property, and livelihoods. Global climate change and the consequent sea-level rise have increased the occurrence of extreme weather events, resulting in elevated and frequent flood risk. Therefore, accurate and timely flood forecasting in coastal river systems is critical to facilitate good flood management. However, the computational tools currently used are either slow or inaccurate. In this paper, we propose a Flood prediction tool using Graph Transformer Network (FloodGTN) for river systems. More specifically, FloodGTN learns the spatio-temporal dependencies of water levels at different monitoring stations using Graph Neural Networks (GNNs) and an LSTM. It is currently implemented to consider external covariates such as rainfall, tide, and the settings of hydraulic structures (e.g., outflows of dams, gates, pumps, etc.) along the river. We use a Transformer to learn the attention given to external covariates in computing water levels. We apply the FloodGTN tool to data from the South Florida Water Management District, which manages a coastal area prone to frequent storms and hurricanes. Experimental results show that FloodGTN outperforms the physics-based model (HEC-RAS) by achieving higher accuracy with 70% improvement while speeding up run times by at least 500x.
    摘要 洪水可以很破坏生命、财产和生活方式。全球气候变化和相应的海平面上升已导致极端天气事件的增加,从而增加了洪水风险的频繁性。因此,在海 coastal 河流系统中准确并快速的洪水预测是非常重要的。但是,现有的计算工具都是慢或不准确的。在这篇论文中,我们提出了一种基于图 transformer 网络 (FloodGTN) 的洪水预测工具,用于河流系统。更specifically,FloodGTN 使用图神经网络 (GNNs) 和 LSTM 学习水位在不同监测站之间的空间时间相互关系。它还可以考虑外部covariates,如降雨、潮汐和水利设施(如水库、水闸、泵等)的设置。我们使用 transformer 来学习对外部covariates的注意力。我们在南佛瑞达水资源管理区的数据上应用了 FloodGTN 工具,该区域是频繁遭受风暴和飓风的海岸区。实验结果表明,FloodGTN 在比较HEC-RAS物理模型时,提高了准确率70%,而且提高了运行速度至少500倍。

Differentiable Euler Characteristic Transforms for Shape Classification

  • paper_url: http://arxiv.org/abs/2310.07630
  • repo_url: https://github.com/aidos-lab/dect
  • paper_authors: Ernst Roell, Bastian Rieck
  • for: 本文旨在开发一种能够在终端到终端学习ECT的新计算层,以提高ECT在图像和点云分类任务中的性能。
  • methods: 本文使用了一种新的计算层DECT,它可以在终端到终端学习ECT,并且具有快速和计算效率的优点。
  • results: 本文的DECT方法在图像和点云分类任务中的性能与更复杂的模型相当,而且还证明了ECT仍然具有同等的 topological 表达能力。
    Abstract The Euler Characteristic Transform (ECT) has proven to be a powerful representation, combining geometrical and topological characteristics of shapes and graphs. However, the ECT was hitherto unable to learn task-specific representations. We overcome this issue and develop a novel computational layer that enables learning the ECT in an end-to-end fashion. Our method DECT is fast and computationally efficient, while exhibiting performance on a par with more complex models in both graph and point cloud classification tasks. Moreover, we show that this seemingly unexpressive statistic still provides the same topological expressivity as more complex topological deep learning layers provide.
    摘要 《EulerCharacteristicTransform》(ECT)已经显示出了一种强大的表示方式,将几何和拓扑特征结合在一起。然而,ECT之前无法学习任务特定的表示。我们解决了这个问题,并开发了一种新的计算层,使ECT可以在端到端的方式学习。我们的方法DECT快速和计算效率高,在图像和点云分类任务中表现与更复杂的模型相当。此外,我们还证明ECT仍然提供了与更复杂的拓扑深度学习层相同的拓扑表达能力。

Unsupervised Learning of Sea Surface Height Interpolation from Multi-variate Simulated Satellite Observations

  • paper_url: http://arxiv.org/abs/2310.07626
  • repo_url: None
  • paper_authors: Theo Archambault, Arthur Filoche, Anastase Charantonis, Dominique Bereziat, Sylvie Thiria
    for:这个论文主要是为了研究使用卫星测距数据推算海面高程的方法。methods:论文使用了深度学习网络,利用海面温度数据来提高推算海面高程的精度。results:论文发现,使用深度学习网络可以在不具备训练数据的情况下,使用海面温度数据来提高推算海面高程的精度,并且可以减少41%的根据差值。
    Abstract Satellite-based remote sensing missions have revolutionized our understanding of the Ocean state and dynamics. Among them, spaceborne altimetry provides valuable measurements of Sea Surface Height (SSH), which is used to estimate surface geostrophic currents. However, due to the sensor technology employed, important gaps occur in SSH observations. Complete SSH maps are produced by the altimetry community using linear Optimal Interpolations (OI) such as the widely-used Data Unification and Altimeter Combination System (DUACS). However, OI is known for producing overly smooth fields and thus misses some mesostructures and eddies. On the other hand, Sea Surface Temperature (SST) products have much higher data coverage and SST is physically linked to geostrophic currents through advection. We design a realistic twin experiment to emulate the satellite observations of SSH and SST to evaluate interpolation methods. We introduce a deep learning network able to use SST information, and a trainable in two settings: one where we have no access to ground truth during training and one where it is accessible. Our investigation involves a comparative analysis of the aforementioned network when trained using either supervised or unsupervised loss functions. We assess the quality of SSH reconstructions and further evaluate the network's performance in terms of eddy detection and physical properties. We find that it is possible, even in an unsupervised setting to use SST to improve reconstruction performance compared to SST-agnostic interpolations. We compare our reconstructions to DUACS's and report a decrease of 41\% in terms of root mean squared error.
    摘要 卫星远感任务已经革命化了我们对海洋状态和动力学的理解。其中,空间遥感技术提供了海面高程(SSH)的重要测量,用于估算表面地OSTROPIC currents。然而,由于感知技术的限制, SSH 观测存在重要的缺失。complete SSH 地图是通过附近优化技术(OI)生成,如广泛使用的数据统一和探针组合系统(DUACS)。然而,OI 知道生成过于平滑的场景,因此会错过一些中规模的涡动和涝层。一方面,海面温度(SST)产品具有更高的数据覆盖率,SST 与地OSTROPIC currents 是物理相关的。我们设计了一个现实的双子实验,用于模拟卫星观测的 SSH 和 SST。我们引入了一个深度学习网络,可以使用 SST 信息,并在两种设置下训练:一种没有训练数据,一种可以使用训练数据。我们的调查包括比较这些网络在不同的训练设置下的性能,并评估其在涝层检测和物理性能方面的表现。我们发现,即使在无supervision的设置下,也可以使用 SST 提高重建性能,相比于 SST 无关的 interpolations。我们对我们的重建与 DUACS 进行比较,发现root mean squared error 下降41%。

Prospective Side Information for Latent MDPs

  • paper_url: http://arxiv.org/abs/2310.07596
  • repo_url: None
  • paper_authors: Jeongyeol Kwon, Yonathan Efroni, Shie Mannor, Constantine Caramanis
  • for: 本研究目的是研究Latent Markov Decision Processes(LMDP)中的强化学习问题,具体来说是在有 prospectivе side information 的情况下。
  • methods: 本研究使用了Markov decision process(MDP)和partially observed Markov decision process(POMDP)的概念,以及近似搜索和强化学习算法。
  • results: 本研究发现,在LMDP中,具有prospectivе side information的情况下,任何高效样本算法都会遭受 $\Omega(K^{2/3})$ 的违和,而不是标准的 $\Omega(\sqrt{K})$ 下界。此外,本研究还设计了一种具有匹配的Upper bound的算法。
    Abstract In many interactive decision-making settings, there is latent and unobserved information that remains fixed. Consider, for example, a dialogue system, where complete information about a user, such as the user's preferences, is not given. In such an environment, the latent information remains fixed throughout each episode, since the identity of the user does not change during an interaction. This type of environment can be modeled as a Latent Markov Decision Process (LMDP), a special instance of Partially Observed Markov Decision Processes (POMDPs). Previous work established exponential lower bounds in the number of latent contexts for the LMDP class. This puts forward a question: under which natural assumptions a near-optimal policy of an LMDP can be efficiently learned? In this work, we study the class of LMDPs with {\em prospective side information}, when an agent receives additional, weakly revealing, information on the latent context at the beginning of each episode. We show that, surprisingly, this problem is not captured by contemporary settings and algorithms designed for partially observed environments. We then establish that any sample efficient algorithm must suffer at least $\Omega(K^{2/3})$-regret, as opposed to standard $\Omega(\sqrt{K})$ lower bounds, and design an algorithm with a matching upper bound.
    摘要 Many interactive decision-making settings have latent and unobserved information that remains fixed. For example, in a dialogue system, the user's preferences may not be fully known. In such an environment, the latent information remains fixed throughout each episode because the user's identity does not change during an interaction. This type of environment can be modeled as a Latent Markov Decision Process (LMDP), a special instance of Partially Observed Markov Decision Processes (POMDPs). Previous work established exponential lower bounds in the number of latent contexts for the LMDP class. This raises a question: under what natural assumptions can a near-optimal policy of an LMDP be efficiently learned?In this work, we study the class of LMDPs with "prospective side information," where an agent receives additional, weakly revealing information on the latent context at the beginning of each episode. We find that this problem is not captured by contemporary settings and algorithms designed for partially observed environments. We then establish that any sample-efficient algorithm must suffer at least $\Omega(K^{2/3})$ regret, rather than the standard $\Omega(\sqrt{K})$ lower bounds, and design an algorithm with a matching upper bound.

Transformers for Green Semantic Communication: Less Energy, More Semantics

  • paper_url: http://arxiv.org/abs/2310.07592
  • repo_url: None
  • paper_authors: Shubhabrata Mukherjee, Cory Beard, Sejun Song
  • for: 本研究旨在提高含义传输的效率和可靠性,而不是强调个别符号或位元。
  • methods: 该研究提出了一种新的多目标损失函数 named “Energy-Optimized Semantic Loss” (EOSL),用于平衡含义损失和能耗。
  • results: 经过对transformer模型的测试,包括CPU和GPU能耗测试,显示EOSL可以在推理阶段提高含义相似性表现的44%,同时节省90%的能耗。
    Abstract Semantic communication aims to transmit meaningful and effective information rather than focusing on individual symbols or bits, resulting in benefits like reduced latency, bandwidth usage, and higher throughput compared to traditional communication. However, semantic communication poses significant challenges due to the need for universal metrics for benchmarking the joint effects of semantic information loss and practical energy consumption. This research presents a novel multi-objective loss function named "Energy-Optimized Semantic Loss" (EOSL), addressing the challenge of balancing semantic information loss and energy consumption. Through comprehensive experiments on transformer models, including CPU and GPU energy usage, it is demonstrated that EOSL-based encoder model selection can save up to 90\% of energy while achieving a 44\% improvement in semantic similarity performance during inference in this experiment. This work paves the way for energy-efficient neural network selection and the development of greener semantic communication architectures.
    摘要 This research proposes a novel multi-objective loss function called "Energy-Optimized Semantic Loss" (EOSL) to address the challenge of balancing semantic information loss and energy consumption. Through comprehensive experiments on transformer models, including CPU and GPU energy usage, it is demonstrated that EOSL-based encoder model selection can save up to 90% of energy while achieving a 44% improvement in semantic similarity performance during inference. This work paves the way for energy-efficient neural network selection and the development of greener semantic communication architectures.Here is the text in Simplified Chinese:semantic communication aimsto transmit meaningful and effective information instead of focusing on individual symbols or bits, which can result in benefits such as reduced latency, bandwidth usage, and higher throughput compared to traditional communication. However, semantic communication also poses significant challenges, such as the need for universal metrics for benchmarking the joint effects of semantic information loss and practical energy consumption.this research proposes a novel multi-objective loss function called "Energy-Optimized Semantic Loss" (EOSL) to address the challenge of balancing semantic information loss and energy consumption. Through comprehensive experiments on transformer models, including CPU and GPU energy usage, it is demonstrated that EOSL-based encoder model selection can save up to 90% of energy while achieving a 44% improvement in semantic similarity performance during inference. This work paves the way for energy-efficient neural network selection and the development of greener semantic communication architectures.

Analyzing Trendy Twitter Hashtags in the 2022 French Election

  • paper_url: http://arxiv.org/abs/2310.07576
  • repo_url: None
  • paper_authors: Aamir Mandviwalla, Lake Yin, Boleslaw K. Szymanski
  • for: 预测社交媒体用户未来活动的模型需要丰富的特征来进行准确预测。许多先进的模型可以生成这些特征,但是它们在庞大数据集上运行时的计算时间常常是禁制的。一些研究表明,简单的含义网络特征可以rich enough使用于机器学习任务。我们提议使用含义网络作为用户级别特征。
  • methods: 我们使用了一个含义网络,其中有1037个Twitter标签从一个包含370万个推文的2022年法国总统选举相关的词语集中提取出来。我们将标签作为节点,用户之间的互动关系作为 Edge,并将这些关系加权。然后,我们将这个图转换成最大拓扑树,并将最受欢迎的标签作为根节点来构建一个层次结构。最后,我们为每个用户提供一个基于这个树的向量特征。
  • results: 我们使用了这个semantic特征进行回归预测每个用户的六种情感响应(愤怒、享受、失望等)。我们发现大多数情感的$R^2$值大于0.5,这表明我们的semantic特征具有预测社交媒体用户回归的能力。这些结果表明我们的semantic特征可以考虑在进一步的大数据集上进行预测。
    Abstract Regressions trained to predict the future activity of social media users need rich features for accurate predictions. Many advanced models exist to generate such features; however, the time complexities of their computations are often prohibitive when they run on enormous data-sets. Some studies have shown that simple semantic network features can be rich enough to use for regressions without requiring complex computations. We propose a method for using semantic networks as user-level features for machine learning tasks. We conducted an experiment using a semantic network of 1037 Twitter hashtags from a corpus of 3.7 million tweets related to the 2022 French presidential election. A bipartite graph is formed where hashtags are nodes and weighted edges connect the hashtags reflecting the number of Twitter users that interacted with both hashtags. The graph is then transformed into a maximum-spanning tree with the most popular hashtag as its root node to construct a hierarchy amongst the hashtags. We then provide a vector feature for each user based on this tree. To validate the usefulness of our semantic feature we performed a regression experiment to predict the response rate of each user with six emotions like anger, enjoyment, or disgust. Our semantic feature performs well with the regression with most emotions having $R^2$ above 0.5. These results suggest that our semantic feature could be considered for use in further experiments predicting social media response on big data-sets.
    摘要 <>将文本翻译成简化中文。<>预测社交媒体用户未来活动的回归模型需要丰富的特征来实现准确预测。许多高级模型可以生成这些特征,但是它们在庞大数据集上进行计算时间复杂度经常是禁止的。一些研究表明,使用 semantic network 的简单 semantic 特征可以避免复杂的计算。我们提出一种使用 semantic network 作为用户级别特征的方法。我们在一个包含 3.7 万条提子的推特帖子中选择了 1037 个标签,并将这些标签组织成一个带有权重边的对角raph。然后将这个对角raph变换成一个最大拓扑树,其中最受欢迎的标签作为根节点,以建立一个层次结构。我们然后为每个用户提供一个基于这棵树的 вектор特征。为验证我们的semantic特征的有用性,我们进行了一个回归实验,用于预测每个用户的六种情感响应(愤怒、愉悦、厌恶等)。我们的semantic特征在这些情感回归中表现良好,大多数情感有 $R^2$ 超过 0.5。这些结果表明,我们的semantic特征可以考虑在大数据集上进行进一步的实验。

Smootheness-Adaptive Dynamic Pricing with Nonparametric Demand Learning

  • paper_url: http://arxiv.org/abs/2310.07558
  • repo_url: None
  • paper_authors: Zeqi Ye, Hansheng Jiang
  • for: 研究非参数化需求函数的动态价格问题,并聚焦于适应未知Holder平滑参数$\beta$的需求函数。
  • methods: 提出了一种自相似性条件,以启用适应性。并开发了一种可以在不知道$\beta$情况下实现最佳动态价格算法,并证明了这种算法可以达到最佳 regretBound。
  • results: 证明了无法适应性的动态价格问题,并提出了一种基于自相似性条件的适应性动态价格算法,该算法可以在不知道$\beta$情况下实现最佳 regretBound。
    Abstract We study the dynamic pricing problem where the demand function is nonparametric and H\"older smooth, and we focus on adaptivity to the unknown H\"older smoothness parameter $\beta$ of the demand function. Traditionally the optimal dynamic pricing algorithm heavily relies on the knowledge of $\beta$ to achieve a minimax optimal regret of $\widetilde{O}(T^{\frac{\beta+1}{2\beta+1})$. However, we highlight the challenge of adaptivity in this dynamic pricing problem by proving that no pricing policy can adaptively achieve this minimax optimal regret without knowledge of $\beta$. Motivated by the impossibility result, we propose a self-similarity condition to enable adaptivity. Importantly, we show that the self-similarity condition does not compromise the problem's inherent complexity since it preserves the regret lower bound $\Omega(T^{\frac{\beta+1}{2\beta+1})$. Furthermore, we develop a smoothness-adaptive dynamic pricing algorithm and theoretically prove that the algorithm achieves this minimax optimal regret bound without the prior knowledge $\beta$.
    摘要 我们研究动态价格问题,其中需求函数是非parametric且哈lder平滑的。我们专注于适应未知哈lder平滑度 Parameter $\beta$ 的需求函数。传统上最佳的动态价格算法严重依赖 $\beta$ 的知识,以 дости得最佳的 regret Bound $\widetilde{O}(T^{\frac{\beta+1}{2\beta+1})$。但我们点出了适应性的挑战,并证明了无法适应地achivr 此最佳 regret bound 的价格策略。我们提出了自similarity 条件,以启动适应性。我们证明了这个条件不会增加问题的内在复杂性,因为它保持了 regret 下界 $\Omega(T^{\frac{\beta+1}{2\beta+1})$。此外,我们开发了一个具有哈lder平滑性的动态价格算法,并证明了这个算法可以 дости得最佳的 regret bound 无需 $\beta$ 的专门知识。

Provable Advantage of Parameterized Quantum Circuit in Function Approximation

  • paper_url: http://arxiv.org/abs/2310.07528
  • repo_url: None
  • paper_authors: Zhan Yu, Qiuhao Chen, Yuling Jiao, Yinan Li, Xiliang Lu, Xin Wang, Jerry Zhijian Yang
  • for: 这个论文的目的是分析parameterized quantum circuits(PQCs)在机器学习任务中的表达能力。
  • methods: 这篇论文使用函数近似的角度来分析PQCs的表达能力,并提供了可重构的PQCs的建构方法,以及在各种函数上进行近似的技术。
  • results: 这篇论文提供了关于PQCs的表达能力的Explicit Construction,并提供了关于PQCs的近似误差的量化界限,其中误差的大小与PQCs的宽度、深度和可调参数的数量有关。此外,论文还比较了提案的PQCs和深度学习网络在高维平滑函数的近似中的性能,并发现PQCs的模型大小与深度学习网络的模型大小之间存在指数关系。这 suggets a potentially novel avenue for showcasing quantum advantages in quantum machine learning.
    Abstract Understanding the power of parameterized quantum circuits (PQCs) in accomplishing machine learning tasks is one of the most important questions in quantum machine learning. In this paper, we analyze the expressivity of PQCs through the lens of function approximation. Previously established universal approximation theorems for PQCs are mainly nonconstructive, leading us to the following question: How large do the PQCs need to be to approximate the target function up to a given error? We exhibit explicit constructions of data re-uploading PQCs for approximating continuous and smooth functions and establish quantitative approximation error bounds in terms of the width, the depth and the number of trainable parameters of the PQCs. To achieve this, we utilize techniques from quantum signal processing and linear combinations of unitaries to construct PQCs that implement multivariate polynomials. We implement global and local approximation techniques using Bernstein polynomials and local Taylor expansion and analyze their performances in the quantum setting. We also compare our proposed PQCs to nearly optimal deep neural networks in approximating high-dimensional smooth functions, showing that the ratio between model sizes of PQC and deep neural networks is exponentially small with respect to the input dimension. This suggests a potentially novel avenue for showcasing quantum advantages in quantum machine learning.
    摘要

Exploiting Causal Graph Priors with Posterior Sampling for Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2310.07518
  • repo_url: None
  • paper_authors: Mirco Mutti, Riccardo De Santi, Marcello Restelli, Alexander Marx, Giorgia Ramponi
  • for: 提高强化学习的采样效率,使用 posterior sampling 技术,并利用先验知识来改善采样效率。
  • methods: 提出一种新的 posterior sampling 方法,使用 causal graph 来表示先验知识,并在这个 graph 上进行 Bayesian 推理。
  • results: 在 illustrate 的领域中,经过数值评估,提出的 C-PSRL 方法可以强化 posterior sampling 的效率,并且与完整的 causal graph 相比,其效果几乎相同。
    Abstract Posterior sampling allows the exploitation of prior knowledge of the environment's transition dynamics to improve the sample efficiency of reinforcement learning. The prior is typically specified as a class of parametric distributions, a task that can be cumbersome in practice, often resulting in the choice of uninformative priors. In this work, we propose a novel posterior sampling approach in which the prior is given as a (partial) causal graph over the environment's variables. The latter is often more natural to design, such as listing known causal dependencies between biometric features in a medical treatment study. Specifically, we propose a hierarchical Bayesian procedure, called C-PSRL, simultaneously learning the full causal graph at the higher level and the parameters of the resulting factored dynamics at the lower level. For this procedure, we provide an analysis of its Bayesian regret, which explicitly connects the regret rate with the degree of prior knowledge. Our numerical evaluation conducted in illustrative domains confirms that C-PSRL strongly improves the efficiency of posterior sampling with an uninformative prior while performing close to posterior sampling with the full causal graph.
    摘要 <>使用 posterior sampling 可以利用环境转移动力学的先前知识来提高强化学习的样本效率。先前通常是指定为一类 Parametric 分布,这可以在实践中是困难的,常导致选择不具有信息的先前。在这种工作中,我们提出了一种新的 posterior sampling 方法,在该方法中,先前是表示环境变量的 causal 图。这对于设计是更自然的,例如在医疗治疗研究中列出了知道的生物特征相互关系。我们提出了一种层次 Bayesian 过程,称为 C-PSRL,该过程同时学习全部 causal 图和其导致的结果的 factored 动力学参数。我们提供了 Bayesian regret 的分析,其直接连接了 regret 率与先前知识的度量。我们的数值评估在演示领域中表明,C-PSRL 可以大幅提高 posterior sampling 的效率,同时与完整的 causal 图相似。Note: " Simplified Chinese" is a romanization of the Chinese language, which is used to represent the language in the Latin alphabet. It is not a translation of the text into Chinese characters.

Model-based Clustering of Individuals’ Ecological Momentary Assessment Time-series Data for Improving Forecasting Performance

  • paper_url: http://arxiv.org/abs/2310.07491
  • repo_url: None
  • paper_authors: Mandani Ntekouli, Gerasimos Spanakis, Lourens Waldorp, Anne Roefs
  • for: 这个研究旨在使用时间序列数据进行ecological momentary assessment (EMA),并使用集成分析方法来描述个人情绪行为。
  • methods: 这个研究使用了两种模型基于的集成方法,一是使用个人化模型中提取的参数,另一是根据模型基于的预测性能进行优化。
  • results: 研究发现,使用性能为基准的集成方法得到了最好的结果,在所有评估指标上都超过了个人化、全部一起和随机分组的基准。
    Abstract Through Ecological Momentary Assessment (EMA) studies, a number of time-series data is collected across multiple individuals, continuously monitoring various items of emotional behavior. Such complex data is commonly analyzed in an individual level, using personalized models. However, it is believed that additional information of similar individuals is likely to enhance these models leading to better individuals' description. Thus, clustering is investigated with an aim to group together the most similar individuals, and subsequently use this information in group-based models in order to improve individuals' predictive performance. More specifically, two model-based clustering approaches are examined, where the first is using model-extracted parameters of personalized models, whereas the second is optimized on the model-based forecasting performance. Both methods are then analyzed using intrinsic clustering evaluation measures (e.g. Silhouette coefficients) as well as the performance of a downstream forecasting scheme, where each forecasting group-model is devoted to describe all individuals belonging to one cluster. Among these, clustering based on performance shows the best results, in terms of all examined evaluation measures. As another level of evaluation, those group-models' performance is compared to three baseline scenarios, the personalized, the all-in-one group and the random group-based concept. According to this comparison, the superiority of clustering-based methods is again confirmed, indicating that the utilization of group-based information could be effectively enhance the overall performance of all individuals' data.
    摘要

Nonlinear embeddings for conserving Hamiltonians and other quantities with Neural Galerkin schemes

  • paper_url: http://arxiv.org/abs/2310.07485
  • repo_url: None
  • paper_authors: Paul Schwerdtner, Philipp Schulze, Jules Berman, Benjamin Peherstorfer
  • for: 这个论文关注在解析方程的解场 Solution Fields 中的量保守问题,特别是使用深度网络来近似解场。
  • methods: 该方法基于Dirac–Frenkel变分原理,逐步在时间上训练非线性参数化。
  • results: 实验表明,该方法可以保持量的精度,并且可以与标准的显式和隐式时间推导方法结合使用。
    Abstract This work focuses on the conservation of quantities such as Hamiltonians, mass, and momentum when solution fields of partial differential equations are approximated with nonlinear parametrizations such as deep networks. The proposed approach builds on Neural Galerkin schemes that are based on the Dirac--Frenkel variational principle to train nonlinear parametrizations sequentially in time. We first show that only adding constraints that aim to conserve quantities in continuous time can be insufficient because the nonlinear dependence on the parameters implies that even quantities that are linear in the solution fields become nonlinear in the parameters and thus are challenging to discretize in time. Instead, we propose Neural Galerkin schemes that compute at each time step an explicit embedding onto the manifold of nonlinearly parametrized solution fields to guarantee conservation of quantities. The embeddings can be combined with standard explicit and implicit time integration schemes. Numerical experiments demonstrate that the proposed approach conserves quantities up to machine precision.
    摘要 我们的研究探讨了使用深度网络作为非线性参数化方法时,对于偏微分方程解场的保守量的问题。我们的方法基于Neural Galerkin方法,该方法基于Dirac--Frenkel变量原理来逐步在时间上训练非线性参数化。我们发现,只要添加保守量的约束并不足以保证量的保守,因为非线性参数的依赖关系使得解场中的量变为非线性函数,这使得在时间上的积分变得困难。因此,我们提议使用Neural Galerkin方法来在每个时间步骤中计算非线性参数化解场的明确嵌入,以保证量的保守。这些嵌入可以与标准的显式和隐式时间积分方法结合使用。我们的numerical experiments表明,我们的方法可以保证量的保守,并且和标准方法相比,可以提高计算效率。

Automatic Sensor-free Affect Detection: A Systematic Literature Review

  • paper_url: http://arxiv.org/abs/2310.13711
  • repo_url: None
  • paper_authors: Felipe de Morais, Diógines Goldoni, Tiago Kautzmann, Rodrigo da Silva, Patricia A. Jaques
  • for: This paper provides a comprehensive literature review on sensor-free affect detection in computer-based learning environments (CBLEs) to enhance learning outcomes.
  • methods: The paper reviews the most frequently identified affective states, methodologies and techniques employed for sensor development, defining attributes of CBLEs and data samples, and key research trends.
  • results: The paper highlights the consistent performance of the models and the application of advanced machine learning techniques, but notes that there is ample scope for future research, including enhancing model performance, collecting more samples of underrepresented emotions, and refining model development practices.Here is the information in Simplified Chinese text:
  • for: 这篇论文为了提高学习 outcome,提供了计算机基础学习环境(CBLEs)中无感器情感探测的全面文献评论。
  • methods: 论文评论了最常见的情感状态,感应器开发方法和技术,定义CBLE和数据样本的特征,以及主要的研究趋势。
  • results: 论文指出了模型的一致性表现和应用先进机器学习技术,但还有充足的发展空间,包括提高模型性能,收集更多的异常情感样本,并规范模型开发方法。
    Abstract Emotions and other affective states play a pivotal role in cognition and, consequently, the learning process. It is well-established that computer-based learning environments (CBLEs) that can detect and adapt to students' affective states can enhance learning outcomes. However, practical constraints often pose challenges to the deployment of sensor-based affect detection in CBLEs, particularly for large-scale or long-term applications. As a result, sensor-free affect detection, which exclusively relies on logs of students' interactions with CBLEs, emerges as a compelling alternative. This paper provides a comprehensive literature review on sensor-free affect detection. It delves into the most frequently identified affective states, the methodologies and techniques employed for sensor development, the defining attributes of CBLEs and data samples, as well as key research trends. Despite the field's evident maturity, demonstrated by the consistent performance of the models and the application of advanced machine learning techniques, there is ample scope for future research. Potential areas for further exploration include enhancing the performance of sensor-free detection models, amassing more samples of underrepresented emotions, and identifying additional emotions. There is also a need to refine model development practices and methods. This could involve comparing the accuracy of various data collection techniques, determining the optimal granularity of duration, establishing a shared database of action logs and emotion labels, and making the source code of these models publicly accessible. Future research should also prioritize the integration of models into CBLEs for real-time detection, the provision of meaningful interventions based on detected emotions, and a deeper understanding of the impact of emotions on learning.
    摘要 感情和其他情感状态在认知过程中扮演着关键性角色,因此在学习过程中也具有重要作用。已经证明了通过检测和适应学生情感状态的计算机基础学习环境(CBLE)可以提高学习成果。然而,实际应用中的偏见和限制常常使得感知基础的情感检测成为实现的瓶颈。因此,不需要感知设备的情感检测(sensor-free affect detection)成为了一种有前途的alternative。本文提供了关于情感检测的完整的文献综述,包括最常见的情感状态、情感检测器的开发方法和技术、CBLE和数据样本的特点以及主要的研究趋势。尽管领域的成熔度已经明显,表明模型的稳定性和高级机器学习技术的应用,但是还有很多可能的发展方向。未来研究应该强调检测模型的性能提升、更多的弱化情感样本收集、更多的情感状态检测以及模型开发方法的优化。此外,将模型集成到CBLE中进行实时检测,为检测到的情感状态提供有意义的 intervención,以及更深入地理解情感对学习的影响,也是未来研究的重要方向。

  • paper_url: http://arxiv.org/abs/2310.07464
  • repo_url: None
  • paper_authors: Zijie Fang, Yihan Liu, Yifeng Wang, Xiangyang Zhang, Yang Chen, Changjing Cai, Yiyang Lin, Ying Han, Zhi Wang, Shan Zeng, Hong Shen, Jun Tan, Yongbing Zhang
    for:多种低度 glioma (LGG) 诊断和治疗中不可或缺的一部分是生物标志物 detection。但现有的LGG生物标志物 detection方法依赖于成本高、复杂的分子遗传测试,需要专业人员分析结果,并且经常报告了内部repeatability。methods:我们提出了一种可读性深度学习管道,基于多例学习(MIL)框架的多生物标志物 Histomorphology Discoverer(Multi-Beholder)模型,可以使用染色和抹平板扫描图像来预测LGG中五个生物标志物的状态。特别是通过 incorporating一类分类into MIL framework,实现了准确的实例 Pseudo-labeling,以便使用板块级别标签进行实例级别监督,从而提高生物标志物预测性能。results:Multi-Beholder在两个组合(n=607)中显示出了出色的预测性能和普适性(AUROC=0.6469-0.9735)。此外,Multi-Beholder的极佳可读性使得可以发现生物标志物状态和 histomorphology 特征之间的量化和质量相关性。我们的管道不仅提供了一种新的生物标志物预测方法,推进了LGG患者的分子治疗的可采用性,而且可以促进分子功能和LGG进程的新机制发现。
    Abstract Biomarker detection is an indispensable part in the diagnosis and treatment of low-grade glioma (LGG). However, current LGG biomarker detection methods rely on expensive and complex molecular genetic testing, for which professionals are required to analyze the results, and intra-rater variability is often reported. To overcome these challenges, we propose an interpretable deep learning pipeline, a Multi-Biomarker Histomorphology Discoverer (Multi-Beholder) model based on the multiple instance learning (MIL) framework, to predict the status of five biomarkers in LGG using only hematoxylin and eosin-stained whole slide images and slide-level biomarker status labels. Specifically, by incorporating the one-class classification into the MIL framework, accurate instance pseudo-labeling is realized for instance-level supervision, which greatly complements the slide-level labels and improves the biomarker prediction performance. Multi-Beholder demonstrates superior prediction performance and generalizability for five LGG biomarkers (AUROC=0.6469-0.9735) in two cohorts (n=607) with diverse races and scanning protocols. Moreover, the excellent interpretability of Multi-Beholder allows for discovering the quantitative and qualitative correlations between biomarker status and histomorphology characteristics. Our pipeline not only provides a novel approach for biomarker prediction, enhancing the applicability of molecular treatments for LGG patients but also facilitates the discovery of new mechanisms in molecular functionality and LGG progression.
    摘要 生物标志物检测是低级 Glioma(LGG)诊断和治疗中不可或缺的一部分。然而,现有的LGG生物标志物检测方法依赖于昂贵和复杂的分子遗传学测试,需要专业人员分析结果,并且间误量往往被报告。为了解决这些挑战,我们提出了一个可解释的深度学习管道,即多种生物标志物检测器(Multi-Beholder)模型,基于多例学习(MIL)框架,用于预测LGG五个生物标志物的状态,只需使用染色的整幅干涂图像和板块级别生物标志物状态标签。特别是,通过将一类学习 integrate into MIL框架,实现了准确的实例 pseudo-标签,这种方法可以较好地补充板块级别标签,提高生物标志物预测性能。Multi-Beholder在两个 cohort(n=607)中表现出色,其AUROC值为0.6469-0.9735。此外,Multi-Beholder的优秀可解释性使得可以发现某些生物标志物状态和 Histomorphology 特征之间的数量和质量相关性。我们的管道不仅提供了一种新的生物标志物预测方法,扩展了LGG患者可应用的分子治疗,还可以探索新的分子功能和LGG进程中的新机制。

Uncovering ECG Changes during Healthy Aging using Explainable AI

  • paper_url: http://arxiv.org/abs/2310.07463
  • repo_url: https://github.com/ai4healthuol/ecg-aging
  • paper_authors: Gabriel Ott, Yannik Schaubelt, Juan Miguel Lopez Alcaraz, Wilhelm Haverkamp, Nils Strodthoff
  • for: 这项研究旨在提供更深刻的心脏年龄过程理解,以便更好地诊断心血管健康水平。
  • methods: 这项研究使用深度学习模型和树式分类模型分析了健康个体的ECG数据,并使用可解释AI技术确定ECG特征或原始信号特征是否能够分类不同年龄组。
  • results: 研究发现,年龄增长导致呼吸速率下降,并且发现高SDANN值能够区分年轻人和老年人。此外,深度学习模型表明,年龄增长导致P波类型的分布变化,这些发现可能提供传统特征分析方法之外的新的年龄相关ECG变化。
    Abstract Cardiovascular diseases remain the leading global cause of mortality. This necessitates a profound understanding of heart aging processes to diagnose constraints in cardiovascular fitness. Traditionally, most of such insights have been drawn from the analysis of electrocardiogram (ECG) feature changes of individuals as they age. However, these features, while informative, may potentially obscure underlying data relationships. In this paper, we employ a deep-learning model and a tree-based model to analyze ECG data from a robust dataset of healthy individuals across varying ages in both raw signals and ECG feature format. Explainable AI techniques are then used to identify ECG features or raw signal characteristics are most discriminative for distinguishing between age groups. Our analysis with tree-based classifiers reveal age-related declines in inferred breathing rates and identifies notably high SDANN values as indicative of elderly individuals, distinguishing them from younger adults. Furthermore, the deep-learning model underscores the pivotal role of the P-wave in age predictions across all age groups, suggesting potential changes in the distribution of different P-wave types with age. These findings shed new light on age-related ECG changes, offering insights that transcend traditional feature-based approaches.
    摘要 In this study, we employ a deep-learning model and a tree-based model to analyze ECG data from a large dataset of healthy individuals across different ages. We use explainable AI techniques to identify the most discriminative ECG features or raw signal characteristics for distinguishing between age groups.Our analysis with tree-based classifiers reveals age-related declines in inferred breathing rates and identifies high SDANN values as indicative of elderly individuals. Additionally, the deep-learning model emphasizes the crucial role of the P-wave in age predictions across all age groups, suggesting potential changes in the distribution of different P-wave types with age.These findings offer new insights into age-related ECG changes, going beyond traditional feature-based approaches.

ProbTS: A Unified Toolkit to Probe Deep Time-series Forecasting

  • paper_url: http://arxiv.org/abs/2310.07446
  • repo_url: None
  • paper_authors: Jiawen Zhang, Xumeng Wen, Shun Zheng, Jia Li, Jiang Bian
  • for: 这篇论文的目的是将深度学习技术应用到时间序列预测领域,并评估这两个分支之间的不同特性和表现。
  • methods: 这篇论文使用了两种不同的方法:一种是特定的神经网络架构,另一种是使用进步的深度生成模型进行 probabilistic 预测。
  • results: 这篇论文使用 ProbTS 工具套件进行比较和评估这两种方法的表现,发现它们在不同的数据enario和方法ological focuses 之间有所不同,并提供了新的研究方向来提高时间序列预测的精度。
    Abstract Time-series forecasting serves as a linchpin in a myriad of applications, spanning various domains. With the growth of deep learning, this arena has bifurcated into two salient branches: one focuses on crafting specific neural architectures tailored for time series, and the other harnesses advanced deep generative models for probabilistic forecasting. While both branches have made significant progress, their differences across data scenarios, methodological focuses, and decoding schemes pose profound, yet unexplored, research questions. To bridge this knowledge chasm, we introduce ProbTS, a pioneering toolkit developed to synergize and compare these two distinct branches. Endowed with a unified data module, a modularized model module, and a comprehensive evaluator module, ProbTS allows us to revisit and benchmark leading methods from both branches. The scrutiny with ProbTS highlights their distinct characteristics, relative strengths and weaknesses, and areas that need further exploration. Our analyses point to new avenues for research, aiming for more effective time-series forecasting.
    摘要 时间序列预测作为许多应用领域的核心,其中有两个主要分支:一个是针对时间序列特定的神经网络架构的设计,另一个是利用高级深度生成模型进行 probabilistic 预测。虽然这两个分支都取得了 significiant progress,但是在数据场景、方法重点和解码方案方面存在差异,这些差异仍然尚未得到系统的探索。为了bridging这个知识差距,我们提出了 ProbTS,一个 pioneering 的工具集,用于同时Synergize和比较这两个分支。ProbTS 具有一个统一的数据模块、一个模块化的模型模块和一个全面的评价模块,这使得我们可以重新评估和比较领导的方法从两个分支中。通过 ProbTS 的检验,我们发现了这两个分支的特点、相对优劣点和需要进一步探索的领域。我们的分析指向了新的研究方向,旨在更有效地预测时间序列。

A Branched Deep Convolutional Network for Forecasting the Occurrence of Hazes in Paris using Meteorological Maps with Different Characteristic Spatial Scales

  • paper_url: http://arxiv.org/abs/2310.07437
  • repo_url: None
  • paper_authors: Chien Wang
  • for: 预测浓雾事件的发生
  • methods: 使用多decadal日地区天气和水文变量作为输入特征,并使用Surface visibility观测数据作为目标进行训练
  • results: 两支分支架构对抗气雾事件的预测性能有所提高,并在验证和盲测评估中获得了合理的分数。
    Abstract A deep learning platform has been developed to forecast the occurrence of the low visibility events or hazes. It is trained by using multi-decadal daily regional maps of various meteorological and hydrological variables as input features and surface visibility observations as the targets. To better preserve the characteristic spatial information of different input features for training, two branched architectures have recently been developed for the case of Paris hazes. These new architectures have improved the performance of the network, producing reasonable scores in both validation and a blind forecasting evaluation using the data of 2021 and 2022 that have not been used in the training and validation.
    摘要 一个深度学习平台已经开发,用于预测普通天气下雾霾或低视野事件的发生。该平台通过使用多个劳伦地域天气和水文变量的多 décennial日aily地图作为输入特征,并将地面可见度观测作为目标。为更好地保留不同输入特征的特征空间信息, reciently 两个分支架构已经开发用于Paris雾霾情况。这两个新架构已经提高了网络的性能,在验证和隐藏预测评估中制造了合理的分数,使用2021和2022年的数据进行验证和预测。Note: "reciently" is a typo, it should be "recently".

Generalized Mixture Model for Extreme Events Forecasting in Time Series Data

  • paper_url: http://arxiv.org/abs/2310.07435
  • repo_url: None
  • paper_authors: Jincheng Wang, Yue Gao
  • for: 这个研究是为了提高时间序列预测中的极端值预测性能,特别是在气象预测、交通管理和股票价格预测等领域。
  • methods: 本研究使用了一个新的深度混合模型,即深度极点混合模型(DEMMA),其包括两个主要模块:1)一个基于折衣分布的通用混合分布,和2)一个基于自动编码器的LSTM特征提取器和时间注意力机制。
  • results: 研究使用多个真实世界的雨量数据展示了DEMMA模型的效果,并证明了它在模型极端值预测方面的改进。
    Abstract Time Series Forecasting (TSF) is a widely researched topic with broad applications in weather forecasting, traffic control, and stock price prediction. Extreme values in time series often significantly impact human and natural systems, but predicting them is challenging due to their rare occurrence. Statistical methods based on Extreme Value Theory (EVT) provide a systematic approach to modeling the distribution of extremes, particularly the Generalized Pareto (GP) distribution for modeling the distribution of exceedances beyond a threshold. To overcome the subpar performance of deep learning in dealing with heavy-tailed data, we propose a novel framework to enhance the focus on extreme events. Specifically, we propose a Deep Extreme Mixture Model with Autoencoder (DEMMA) for time series prediction. The model comprises two main modules: 1) a generalized mixture distribution based on the Hurdle model and a reparameterized GP distribution form independent of the extreme threshold, 2) an Autoencoder-based LSTM feature extractor and a quantile prediction module with a temporal attention mechanism. We demonstrate the effectiveness of our approach on multiple real-world rainfall datasets.
    摘要 时间序列预测(TSF)是广泛研究的主题,具有广泛的应用于天气预报、交通管理和股票价格预测等领域。时间序列中的极值事件经常对人类和自然系统产生深远的影响,但预测它们却是具有挑战性的,主要因为它们的发生率很低。基于极值理论(EVT)的统计方法可以系统地模型极值事件的分布,特别是通过 Generalized Pareto(GP)分布来模型超过阈值的出现。为了解决深度学习在处理具有极大尾部的数据时的表现不佳,我们提出了一种新的框架来强调极值事件。具体来说,我们提出了一种基于极值分布的深度混合模型(DEMMA),用于时间序列预测。该模型包括两个主要模块:1)基于极值模型和独立于极值阈值的GP分布的扩展 mixture distribution;2)基于自适应神经网络和时间注意机制的LSTM特征提取器和时间预测模块。我们在多个实际降水数据集上证明了我们的方法的效果。

Non-backtracking Graph Neural Networks

  • paper_url: http://arxiv.org/abs/2310.07430
  • repo_url: None
  • paper_authors: Seonghyun Park, Narae Ryu, Gahee Kim, Dongyeop Woo, Se-Young Yun, Sungsoo Ahn
  • for: 提高大规模图像 neural network 的精度和计算效率,解决 message-passing 更新方法中的循环问题。
  • methods: 提出非循环回访图 neural network (NBA-GNN),通过不包含之前访问过的节点的信息来更新消息,从而解决 message-passing 更新方法中的循环问题。
  • results: 通过实验证明,NBA-GNN 可以有效地解决循环问题,提高大规模图像 neural network 的精度和计算效率,并在长距离图像和推理节点分类任务上表现出色。
    Abstract The celebrated message-passing updates for graph neural networks allow the representation of large-scale graphs with local and computationally tractable updates. However, the local updates suffer from backtracking, i.e., a message flows through the same edge twice and revisits the previously visited node. Since the number of message flows increases exponentially with the number of updates, the redundancy in local updates prevents the graph neural network from accurately recognizing a particular message flow for downstream tasks. In this work, we propose to resolve such a redundancy via the non-backtracking graph neural network (NBA-GNN) that updates a message without incorporating the message from the previously visited node. We further investigate how NBA-GNN alleviates the over-squashing of GNNs, and establish a connection between NBA-GNN and the impressive performance of non-backtracking updates for stochastic block model recovery. We empirically verify the effectiveness of our NBA-GNN on long-range graph benchmark and transductive node classification problems.
    摘要 “著名的消息传递更新方法为图 neuronal network 提供了可 representations of large-scale graphs 的地方和计算可 tractable 的更新。然而,本地更新受到回tracking的影响,即消息流经同一个边两次并返回已经访问过的节点。由于消息流的数量 exponentially 增加与更新数量的关系,这种 redundancy 在图 neuronal network 中阻碍了下游任务的准确识别。在这项工作中,我们提议了解决这种 redundancy 的非回tracking图 neuronal network (NBA-GNN),它在更新消息时不 incorporate 先前访问过的节点的消息。我们进一步调查了 NBA-GNN 如何缓解 GNN 的过 compressing 问题,并建立了 NB 更新与 Stochastic Block Model 回归的连接。我们employmically 验证了我们的 NBA-GNN 在长距离图 bencmark 和推uctive node classification 问题上的效果。”Note: Simplified Chinese is a romanization of Chinese, and the translation may not be perfect. The original text is in English, and the translation is provided for reference only.

Quantum-Enhanced Forecasting: Leveraging Quantum Gramian Angular Field and CNNs for Stock Return Predictions

  • paper_url: http://arxiv.org/abs/2310.07427
  • repo_url: None
  • paper_authors: Zhengmeng Xu, Hai Lin
  • for: 该研究旨在提高时间序列预测精度,使用量子计算技术与深度学习结合。
  • methods: 该方法使用量子电路特定设计,将时间序列数据转换为适合 convolutional Neural Network (CNN) 训练的二维图像。与传统的 Gramian Angular Field (GAF) 方法不同,QGAF 方法不需要数据Normalization 和 inverse cosine 计算,简化了时间序列数据转换为图像的过程。
  • results: 对三个主要股市的数据进行了实验:中国A股市场、香港股市和美国股市。实验结果表明,相比传统GAF方法,QGAF方法在时间序列预测精度方面具有显著提高,降低了预测错误的平均绝对值(MAE)和平均方差(MSE) Errors by an average of 25% for MAE and 48% for MSE.
    Abstract We propose a time series forecasting method named Quantum Gramian Angular Field (QGAF). This approach merges the advantages of quantum computing technology with deep learning, aiming to enhance the precision of time series classification and forecasting. We successfully transformed stock return time series data into two-dimensional images suitable for Convolutional Neural Network (CNN) training by designing specific quantum circuits. Distinct from the classical Gramian Angular Field (GAF) approach, QGAF's uniqueness lies in eliminating the need for data normalization and inverse cosine calculations, simplifying the transformation process from time series data to two-dimensional images. To validate the effectiveness of this method, we conducted experiments on datasets from three major stock markets: the China A-share market, the Hong Kong stock market, and the US stock market. Experimental results revealed that compared to the classical GAF method, the QGAF approach significantly improved time series prediction accuracy, reducing prediction errors by an average of 25% for Mean Absolute Error (MAE) and 48% for Mean Squared Error (MSE). This research confirms the potential and promising prospects of integrating quantum computing with deep learning techniques in financial time series forecasting.
    摘要 我们提出了一种名为量子agramian angular field(QGAF)的时间序列预测方法。这种方法结合了量子计算技术和深度学习,目的是提高时间序列分类和预测精度。我们成功地将股票回报时间序列数据转化为适合深度神经网络训练的二维图像,通过设计专门的量子电路。与 классиical Gramian angular field(GAF)方法不同,QGAF方法不需要数据Normalization和反归cosine计算,从时间序列数据转化到二维图像的过程被简化。为验证这种方法的有效性,我们在三个主要股票市场的数据上进行了实验:中国A股市场、香港股市场和美国股市场。实验结果表明,相比 классиical GAF方法,QGAF方法在时间序列预测精度方面有显著提高,降低预测错误的平均绝对值(MAE)和平均方差(MSE) errors by an average of 25% and 48%, respectively.这项研究证明了将量子计算技术与深度学习技术结合在金融时间序列预测中的潜在和未来的投资机会。

Deep Kernel and Image Quality Estimators for Optimizing Robotic Ultrasound Controller using Bayesian Optimization

  • paper_url: http://arxiv.org/abs/2310.07392
  • repo_url: None
  • paper_authors: Deepak Raina, SH Chandrashekhara, Richard Voyles, Juan Wachs, Subir Kumar Saha
  • for: 帮助自动化医疗影像成像,减少医生的工作负担
  • methods: 使用神经网络学习低维度kernels,并使用这些kernels进行bayesian优化
  • results: 实现50%的样本效率提高,并且这种性能提高是不受具体训练数据的影响Here’s a more detailed explanation of each point:
  • for: The paper is written to help improve the efficiency of autonomous robotic ultrasound imaging, by reducing the workload of sonographers.
  • methods: The paper proposes using a neural network to learn a low-dimensional kernel in Bayesian optimization, which is a sample-efficient optimization framework. The neural network is trained using probe and image data acquired during the procedure.
  • results: The paper shows that the proposed framework can achieve over 50% increase in sample efficiency for 6D control of the robotized probe, and this performance enhancement is independent of the specific training dataset, demonstrating inter-patient adaptability.
    Abstract Ultrasound is a commonly used medical imaging modality that requires expert sonographers to manually maneuver the ultrasound probe based on the acquired image. Autonomous Robotic Ultrasound (A-RUS) is an appealing alternative to this manual procedure in order to reduce sonographers' workload. The key challenge to A-RUS is optimizing the ultrasound image quality for the region of interest across different patients. This requires knowledge of anatomy, recognition of error sources and precise probe position, orientation and pressure. Sample efficiency is important while optimizing these parameters associated with the robotized probe controller. Bayesian Optimization (BO), a sample-efficient optimization framework, has recently been applied to optimize the 2D motion of the probe. Nevertheless, further improvements are needed to improve the sample efficiency for high-dimensional control of the probe. We aim to overcome this problem by using a neural network to learn a low-dimensional kernel in BO, termed as Deep Kernel (DK). The neural network of DK is trained using probe and image data acquired during the procedure. The two image quality estimators are proposed that use a deep convolution neural network and provide real-time feedback to the BO. We validated our framework using these two feedback functions on three urinary bladder phantoms. We obtained over 50% increase in sample efficiency for 6D control of the robotized probe. Furthermore, our results indicate that this performance enhancement in BO is independent of the specific training dataset, demonstrating inter-patient adaptability.
    摘要 ultrasound 是一种广泛使用的医疗影像模式,需要专业的医疗人员手动操作ultrasound 探针。 autonomous Robotic Ultrasound (A-RUS) 是一种可能的解决方案,以减轻医疗人员的工作负担。 然而,要OPTIMIZE 影像质量在不同的病人中是主要挑战。这需要了解解剖学、识别错误来源以及精确的探针位置、orientation 和压力。 sample efficiency 是重要的,而且OPTIMIZING 这些参数与 robotized 探针控制器相关。 Bayesian Optimization (BO) 是一种样本效率的优化框架,已经应用于优化2D 探针运动。然而,需要进一步改进以提高高维度控制的样本效率。我们想使用神经网络学习一个低维度kernel,称为Deep Kernel (DK)。神经网络的 DK 在 BO 中训练,使用在过程中获取的探针和影像数据。我们提出了两种影像质量估计器,使用深度卷积神经网络,并提供实时反馈给 BO。我们验证了我们的框架使用这两种反馈函数,在三个尿道模拟中获得了50%以上的样本效率提高。此外,我们的结果表明,这种性能改进在 BO 中是无关特定训练集的,表示可以在不同的病人中进行适应。

Experimental quantum natural gradient optimization in photonics

  • paper_url: http://arxiv.org/abs/2310.07371
  • repo_url: None
  • paper_authors: Yizhi Wang, Shichuan Xue, Yaxuan Wang, Jiangfang Ding, Weixu Shi, Dongyang Wang, Yong Liu, Yingwen Liu, Xiang Fu, Guangyao Huang, Anqi Huang, Mingtang Deng, Junjie Wu
  • for: 研究可行量子计算机在噪响中等级量子时代的实用应用。
  • methods: 使用可 Parameterized 量子圈和类别优化器,实现可行量子计算机的实用应用。
  • results: 比用Gradient-free和普通的梯度下降方法更快地 converges 和更好地避免地点极值,从而降低了电路执行的成本。在光学设备上实验ally 证明了这一点,并实现了He-H$^+$ 离子的分解曲线,达到了化学精度。
    Abstract Variational quantum algorithms (VQAs) combining the advantages of parameterized quantum circuits and classical optimizers, promise practical quantum applications in the Noisy Intermediate-Scale Quantum era. The performance of VQAs heavily depends on the optimization method. Compared with gradient-free and ordinary gradient descent methods, the quantum natural gradient (QNG), which mirrors the geometric structure of the parameter space, can achieve faster convergence and avoid local minima more easily, thereby reducing the cost of circuit executions. We utilized a fully programmable photonic chip to experimentally estimate the QNG in photonics for the first time. We obtained the dissociation curve of the He-H$^+$ cation and achieved chemical accuracy, verifying the outperformance of QNG optimization on a photonic device. Our work opens up a vista of utilizing QNG in photonics to implement practical near-term quantum applications.
    摘要 “变量量量算法(VQA),结合参数化量Circuit和классиical优化器的优点,承诺实现量子应用程序在噪声中间量子时代。VQA的性能强调优化方法。相比于梯度值为零和普通梯度下降方法,量子自然梯度(QNG),它反映参数空间的几何结构,可以更快地收敛和更容易避免地陷入地点,从而降低环境执行成本。我们利用了完全可编程的光学芯片进行实验,对光学中的QNG进行了实验性估计。我们获得了He-H$^+$离子的解键曲线,并达到了化学精度,证明了QNG优化在光学设备上的超越性。我们的工作开启了在光学中使用QNG实现实用的近期量子应用的可能性。”Note that Simplified Chinese is a more informal and spoken version of Chinese, and it may not be appropriate for all formal situations or audiences. If you need a more formal translation, you may want to consider using Traditional Chinese or Classical Chinese.

Orthogonal Random Features: Explicit Forms and Sharp Inequalities

  • paper_url: http://arxiv.org/abs/2310.07370
  • repo_url: None
  • paper_authors: Nizar Demni, Hachem Kadri
  • for: 这个论文是为了扩大kernel方法的扩展而引入随机特征。特别是使用随机傅рие纹理和正交随机纹理来 aproximate popular Gaussian kernel。
  • methods: 这篇论文使用了随机傅рие纹理和正交随机纹理来 aproximate Gaussian kernel。
  • results: 这篇论文分析了随机特征的偏见和方差,并提供了正常化essel函数的准确表达和锐化均衡 bound,证明了正交随机特征比随机傅рие纹理更有用。
    Abstract Random features have been introduced to scale up kernel methods via randomization techniques. In particular, random Fourier features and orthogonal random features were used to approximate the popular Gaussian kernel. The former is performed by a random Gaussian matrix and leads exactly to the Gaussian kernel after averaging. In this work, we analyze the bias and the variance of the kernel approximation based on orthogonal random features which makes use of Haar orthogonal matrices. We provide explicit expressions for these quantities using normalized Bessel functions and derive sharp exponential bounds supporting the view that orthogonal random features are more informative than random Fourier features.
    摘要 随机特性被引入来扩大kernel方法。特别是随机傅立勃方法和正交随机特性被用来估计流行的加aussian kernel。前者通过随机矩阵来实现,并导致exact Gaussian kernel после均值。在这种工作中,我们分析kernel approximation的偏差和方差基于正交随机特性,使用 Haar正交矩阵。我们提供了Explicit表达式使用 normalized Bessel functions,并 derivsharp exponential bounds,支持我们的观点,即正交随机特性比随机傅立勃方法更有用。

Improved Analysis of Sparse Linear Regression in Local Differential Privacy Model

  • paper_url: http://arxiv.org/abs/2310.07367
  • repo_url: None
  • paper_authors: Liyang Zhu, Meng Ding, Vaneet Aggarwal, Jinhui Xu, Di Wang
  • for: 本研究重新审视了含有稀疏参数的线性回归问题在本地隐私(LDP)模型中。现有研究在非互动和串行本地模型中已经关注到了对于$1$-稀疏参数的下界,但扩展到更一般的$k$-稀疏参数的情况是有挑战的。此外,是否存在有效的非互动LDP(NLDP)算法仍然是一个问题。
  • methods: 我们首先考虑了在$\epsilon$非互动LDP模型中的问题,并提供了$\ell_2$-范数估计误差的下界为$\Omega(\frac{\sqrt{dk\log d}{\sqrt{n}\epsilon})$,其中$n$是样本大小和$d$是特征空间的维度。我们还提出了一种创新的NLDP算法,这是本问题的首次解决方案。这个算法还生成了一个高效的估计器作为副产品。我们的算法实现了对于各种数据的上界为$\tilde{O}({\frac{d\sqrt{k}{\sqrt{n}\epsilon})$,可以通过增加$O(\sqrt{d})$的因子进一步提高。在串行互动LDP模型中,我们显示了类似的下界。
  • results: 我们的结论表明,在稀疏线性回归问题中,非互动LDP模型和中心DP模型之间存在深刻的差异。
    Abstract In this paper, we revisit the problem of sparse linear regression in the local differential privacy (LDP) model. Existing research in the non-interactive and sequentially local models has focused on obtaining the lower bounds for the case where the underlying parameter is $1$-sparse, and extending such bounds to the more general $k$-sparse case has proven to be challenging. Moreover, it is unclear whether efficient non-interactive LDP (NLDP) algorithms exist. To address these issues, we first consider the problem in the $\epsilon$ non-interactive LDP model and provide a lower bound of $\Omega(\frac{\sqrt{dk\log d}{\sqrt{n}\epsilon})$ on the $\ell_2$-norm estimation error for sub-Gaussian data, where $n$ is the sample size and $d$ is the dimension of the space. We propose an innovative NLDP algorithm, the very first of its kind for the problem. As a remarkable outcome, this algorithm also yields a novel and highly efficient estimator as a valuable by-product. Our algorithm achieves an upper bound of $\tilde{O}({\frac{d\sqrt{k}{\sqrt{n}\epsilon})$ for the estimation error when the data is sub-Gaussian, which can be further improved by a factor of $O(\sqrt{d})$ if the server has additional public but unlabeled data. For the sequentially interactive LDP model, we show a similar lower bound of $\Omega({\frac{\sqrt{dk}{\sqrt{n}\epsilon})$. As for the upper bound, we rectify a previous method and show that it is possible to achieve a bound of $\tilde{O}(\frac{k\sqrt{d}{\sqrt{n}\epsilon})$. Our findings reveal fundamental differences between the non-private case, central DP model, and local DP model in the sparse linear regression problem.
    摘要 在这篇论文中,我们重新考虑了在本地权限隐私(LDP)模型下的稀疏线性回归问题。现有研究在非交互式和顺序本地模型下都集中在了对于$1$-稀疏的参数下获得下界,并将其推广到更通用的$k$-稀疏 случа子是一项挑战。另外,是否存在高效的非交互式LDP(NLDP)算法也是一个问题。为解决这些问题,我们首先考虑了在$\epsilon$非交互式LDP模型下的问题,并提供了$\ell_2$-范数估计误差的下界为$\Omega(\frac{\sqrt{dk\log d}{\sqrt{n}\epsilon})$,其中$n$是样本大小,$d$是空间维度。我们提出了一种创新的NLDP算法,这是首次为这个问题提出了解决方案。这个算法还生成了一种高效的估计器作为副产品。我们的算法在SUB-高分布数据上达到了$\tilde{O}({\frac{d\sqrt{k}{\sqrt{n}\epsilon})$的估计误差上界,可以通过增加$O(\sqrt{d})$的因子进一步改进。如果服务器拥有额外的公共 yet 无标签数据,那么我们的算法可以在这些数据上进行改进,从而提高估计误差的Bound。在Sequentially交互式LDP模型下,我们显示了类似的下界为$\Omega(\frac{\sqrt{dk}{\sqrt{n}\epsilon})$。在上界方面,我们修复了之前的方法,并显示了可以达到$\tilde{O}(\frac{k\sqrt{d}{\sqrt{n}\epsilon})$的上界。我们的发现表明了稀疏线性回归问题在非权限模型、中央DP模型和本地DP模型之间存在fundamental的差异。

GraphControl: Adding Conditional Control to Universal Graph Pre-trained Models for Graph Domain Transfer Learning

  • paper_url: http://arxiv.org/abs/2310.07365
  • repo_url: None
  • paper_authors: Yun Zhu, Yaoke Wang, Haizhou Shi, Zhenshuo Zhang, Siliang Tang
    for:这篇论文的目的是提出一个名为GraphControl的创新部署模组,以便更好地实现图形领域的转移学习。methods:这篇论文使用了自适应的图形模型和ControlNet来实现图形转移学习。results:实验结果显示,GraphControl可以将预训 модеル更好地适应目标图形资料,实现1.4-3倍的性能提升,并且比较training-from-scratch方法在目标资料上的表现更好,且变得更快速。
    Abstract Graph-structured data is ubiquitous in the world which models complex relationships between objects, enabling various Web applications. Daily influxes of unlabeled graph data on the Web offer immense potential for these applications. Graph self-supervised algorithms have achieved significant success in acquiring generic knowledge from abundant unlabeled graph data. These pre-trained models can be applied to various downstream Web applications, saving training time and improving downstream (target) performance. However, different graphs, even across seemingly similar domains, can differ significantly in terms of attribute semantics, posing difficulties, if not infeasibility, for transferring the pre-trained models to downstream tasks. Concretely speaking, for example, the additional task-specific node information in downstream tasks (specificity) is usually deliberately omitted so that the pre-trained representation (transferability) can be leveraged. The trade-off as such is termed as "transferability-specificity dilemma" in this work. To address this challenge, we introduce an innovative deployment module coined as GraphControl, motivated by ControlNet, to realize better graph domain transfer learning. Specifically, by leveraging universal structural pre-trained models and GraphControl, we align the input space across various graphs and incorporate unique characteristics of target data as conditional inputs. These conditions will be progressively integrated into the model during fine-tuning or prompt tuning through ControlNet, facilitating personalized deployment. Extensive experiments show that our method significantly enhances the adaptability of pre-trained models on target attributed datasets, achieving 1.4-3x performance gain. Furthermore, it outperforms training-from-scratch methods on target data with a comparable margin and exhibits faster convergence.
    摘要 世界各地的各种数据都具有图structured的特点,用于模型复杂的对象之间的关系,支持多种Web应用程序。日常大量未标注图数据的入流提供了巨大的潜在性 для这些应用程序。自我超vised学习算法在丰富的未标注图数据上取得了显著的成功,从而获得了一些通用的知识。这些预训练模型可以应用于多个下游Web应用程序,提高下游(目标)性能,并节省训练时间。然而,不同的图,即使在看起来相似的领域中,可能存在很大的属性 semantics 的差异,这会对预训练模型的转移带来很大的困难,甚至是不可能。具体来说,例如,在下游任务中添加特定任务的节点信息通常会故意 omitted,以便利用预训练表示。这种困难被称为“转移性-特点矛盾”在这篇论文中。为解决这个挑战,我们提出了一种创新的投入模块,称为GraphControl, inspirited by ControlNet。我们利用通用的结构预训练模型和GraphControl,将输入空间 across various graphs 进行对应,并 incorporate 目标数据中独特的特征作为条件输入。这些条件将在练习或提示调整中逐渐进行 integrate 到模型中,通过ControlNet,实现个性化部署。我们的方法在目标 attributed 数据上显著提高了预训练模型的适应性,实现了1.4-3x的性能提升。此外,它还超过了从scratch 训练方法在目标数据上的相同幅度,并且显示更快的收敛速度。

Atom-Motif Contrastive Transformer for Molecular Property Prediction

  • paper_url: http://arxiv.org/abs/2310.07351
  • repo_url: None
  • paper_authors: Wentao Yu, Shuo Chen, Chen Gong, Gang Niu, Masashi Sugiyama
    for:* 这 paper 是为了提高分子属性预测(MPP)的效果而写的。methods:* 这 paper 使用了 Atom-Motif Contrastive Transformer(AMCT)模型,该模型不仅考虑了分子中的单个原子间交互,还考虑了分子中的重要模式(例如功能组)的交互。results:* 对比于当前的状态艺术方法,该 paper 的方法能够更好地预测分子的属性,并且可以准确地确定每个分子中的关键模式。
    Abstract Recently, Graph Transformer (GT) models have been widely used in the task of Molecular Property Prediction (MPP) due to their high reliability in characterizing the latent relationship among graph nodes (i.e., the atoms in a molecule). However, most existing GT-based methods usually explore the basic interactions between pairwise atoms, and thus they fail to consider the important interactions among critical motifs (e.g., functional groups consisted of several atoms) of molecules. As motifs in a molecule are significant patterns that are of great importance for determining molecular properties (e.g., toxicity and solubility), overlooking motif interactions inevitably hinders the effectiveness of MPP. To address this issue, we propose a novel Atom-Motif Contrastive Transformer (AMCT), which not only explores the atom-level interactions but also considers the motif-level interactions. Since the representations of atoms and motifs for a given molecule are actually two different views of the same instance, they are naturally aligned to generate the self-supervisory signals for model training. Meanwhile, the same motif can exist in different molecules, and hence we also employ the contrastive loss to maximize the representation agreement of identical motifs across different molecules. Finally, in order to clearly identify the motifs that are critical in deciding the properties of each molecule, we further construct a property-aware attention mechanism into our learning framework. Our proposed AMCT is extensively evaluated on seven popular benchmark datasets, and both quantitative and qualitative results firmly demonstrate its effectiveness when compared with the state-of-the-art methods.
    摘要 近些年,图Transformer(GT)模型在分子性质预测(MPP)任务中广泛应用,因为它们可以准确地描述分子中节点之间的潜在关系。然而,大多数现有的GT基于方法通常只explore分子中基本的对应关系,因此它们忽略了分子中重要的对应关系(例如功能组consisted of several atoms)。由于分子中的功能组是决定分子性质的重要Patterns,忽略这些对应关系不可避免地降低了MPP的效果。为解决这个问题,我们提出了一种新的Atom-Motif Contrastive Transformer(AMCT)模型,它不仅探索分子中的原子间交互,还考虑分子中的对应关系。由于分子中的原子和功能组的表示是同一个实例的两种视角,因此它们自然地启用了自我超visional信号 для模型训练。此外,同一个功能组可以在不同的分子中出现,因此我们还使用了对比损失来提高同一个功能组在不同的分子中的表示协调。最后,为了清晰地确定每个分子中critical的功能组,我们进一步构建了一个性质意识的注意机制 into our learning framework。我们的提出的AMCT在七个流行的 benchmark dataset上进行了广泛的评估,并且量化和质量上的结果都证明了它的效果性比领先方法更高。

Towards Foundation Models for Learning on Tabular Data

  • paper_url: http://arxiv.org/abs/2310.07338
  • repo_url: https://github.com/jettbrains/-L-
  • paper_authors: Han Zhang, Xumeng Wen, Shun Zheng, Wei Xu, Jiang Bian
  • for: 提高 tabular 数据上的学习效果,并提供可转移的模型 для新任务。
  • methods: 使用生成型 tabular 学习,采用预训练的大语言模型(LLM)作为基本模型,并通过定制的目标进行精度调整。
  • results: 在零shot和Contextual Inference等指令ollowing任务中, TabFM 显著超越了 GPT-4 等关闭源 LLM,并在 scarce 数据下进行 fine-tuning 时显示了remarkable的效率和竞争力。
    Abstract Learning on tabular data underpins numerous real-world applications. Despite considerable efforts in developing effective learning models for tabular data, current transferable tabular models remain in their infancy, limited by either the lack of support for direct instruction following in new tasks or the neglect of acquiring foundational knowledge and capabilities from diverse tabular datasets. In this paper, we propose Tabular Foundation Models (TabFMs) to overcome these limitations. TabFMs harness the potential of generative tabular learning, employing a pre-trained large language model (LLM) as the base model and fine-tuning it using purpose-designed objectives on an extensive range of tabular datasets. This approach endows TabFMs with a profound understanding and universal capabilities essential for learning on tabular data. Our evaluations underscore TabFM's effectiveness: not only does it significantly excel in instruction-following tasks like zero-shot and in-context inference, but it also showcases performance that approaches, and in instances, even transcends, the renowned yet mysterious closed-source LLMs like GPT-4. Furthermore, when fine-tuning with scarce data, our model achieves remarkable efficiency and maintains competitive performance with abundant training data. Finally, while our results are promising, we also delve into TabFM's limitations and potential opportunities, aiming to stimulate and expedite future research on developing more potent TabFMs.
    摘要 学习表格数据的应用场景非常广泛。尽管有大量的努力在开发有效的学习模型 для表格数据,但目前可传输的表格模型仍处于幼年期,受到 Either irect instruction following in new tasks 或 neglect of acquiring foundational knowledge and capabilities from diverse tabular datasets的限制。在这篇论文中,我们提议使用表格基础模型(TabFM)来超越这些限制。TabFM 利用生成表格学习的潜力,使用预训练的大型自然语言模型(LLM)作为基本模型,并通过特定目标的精心调整在广泛的表格数据集上。这种方法赋予 TabFM 深刻的理解和普遍的能力,使其成为学习表格数据的优秀选择。我们的评估表明,TabFM 不仅在 zero-shot 和 in-context 推理任务中表现出色,而且在一些情况下,甚至超越了著名但神秘的关闭源 LLM like GPT-4。此外,当 fine-tuning WITH 稀有数据时,我们的模型实现了很好的效率,并保持了与丰富训练数据相比的竞争性。最后,虽然我们的结果吸引人,但我们 также探讨 TabFM 的局限性和潜在机遇,以便促进和加快未来的研究。

Multichannel consecutive data cross-extraction with 1DCNN-attention for diagnosis of power transformer

  • paper_url: http://arxiv.org/abs/2310.07323
  • repo_url: None
  • paper_authors: Wei Zheng, Guogang Zhang, Chenchen Zhao, Qianqian Zhu
  • for: 本研究旨在提出一种基于多通道连续数据的变压器诊断方法,以便更好地捕捉变压器的状态信息。
  • methods: 本方法基于多通道连续数据层次结构(MCDC),并引入一维 convolutional neural network attention(1DCNN-attention)机制以提高诊断效果和简化空间复杂度。
  • results: 实验结果表明,相比于其他方法,MCDC和1DCNN-attention具有更高的诊断精度和泛化能力,并且1DCNN-attention机制能够提供更稳定的诊断结果。
    Abstract Power transformer plays a critical role in grid infrastructure, and its diagnosis is paramount for maintaining stable operation. However, the current methods for transformer diagnosis focus on discrete dissolved gas analysis, neglecting deep feature extraction of multichannel consecutive data. The unutilized sequential data contains the significant temporal information reflecting the transformer condition. In light of this, the structure of multichannel consecutive data cross-extraction (MCDC) is proposed in this article in order to comprehensively exploit the intrinsic characteristic and evaluate the states of transformer. Moreover, for the better accommodation in scenario of transformer diagnosis, one dimensional convolution neural network attention (1DCNN-attention) mechanism is introduced and offers a more efficient solution given the simplified spatial complexity. Finally, the effectiveness of MCDC and the superior generalization ability, compared with other algorithms, are validated in experiments conducted on a dataset collected from real operation cases of power transformer. Additionally, the better stability of 1DCNN-attention has also been certified.
    摘要 <>Power transformer plays a critical role in grid infrastructure, and its diagnosis is paramount for maintaining stable operation. However, the current methods for transformer diagnosis focus on discrete dissolved gas analysis, neglecting deep feature extraction of multichannel consecutive data. The unutilized sequential data contains significant temporal information reflecting the transformer condition. In light of this, the structure of multichannel consecutive data cross-extraction (MCDC) is proposed in this article to comprehensively exploit the intrinsic characteristic and evaluate the states of transformer. Moreover, to better accommodate the scenario of transformer diagnosis, one dimensional convolution neural network attention (1DCNN-attention) mechanism is introduced, offering a more efficient solution given the simplified spatial complexity. Finally, the effectiveness of MCDC and the superior generalization ability, compared with other algorithms, are validated in experiments conducted on a dataset collected from real operation cases of power transformer. Additionally, the better stability of 1DCNN-attention has also been certified.Note: Please keep in mind that the translation is in Simplified Chinese, and the grammar and sentence structure may be different from Traditional Chinese.

Byzantine-Resilient Decentralized Multi-Armed Bandits

  • paper_url: http://arxiv.org/abs/2310.07320
  • repo_url: None
  • paper_authors: Jingxuan Zhu, Alec Koppel, Alvaro Velasquez, Ji Liu
  • for: This paper focuses on the problem of decentralized cooperative multi-armed bandits (MAB) in a setting where some agents may be Byzantine (i.e., they may provide arbitrary wrong information). The goal is to develop a fully decentralized resilient algorithm that can recover the salient behavior of a cooperative setting even in the presence of Byzantine agents.
  • methods: The proposed algorithm uses an information mixing step among agents and a truncation of inconsistent and extreme values to fuse the information. The algorithm is based on the Upper-Confidence Bound (UCB) method, but with a modification to handle the Byzantine agents.
  • results: The paper shows that the proposed algorithm can achieve a regret that is no worse than the classic single-agent UCB1 algorithm, and the cumulative regret of all normal agents is strictly better than the non-cooperative case, as long as each agent has at least 3f+1 neighbors where f is the maximum possible Byzantine agents in each agent’s neighborhood. The paper also establishes extensions to time-varying neighbor graphs and minimax lower bounds on the achievable regret. Experiments corroborate the merits of the proposed framework in practice.
    Abstract In decentralized cooperative multi-armed bandits (MAB), each agent observes a distinct stream of rewards, and seeks to exchange information with others to select a sequence of arms so as to minimize its regret. Agents in the cooperative setting can outperform a single agent running a MAB method such as Upper-Confidence Bound (UCB) independently. In this work, we study how to recover such salient behavior when an unknown fraction of the agents can be Byzantine, that is, communicate arbitrarily wrong information in the form of reward mean-estimates or confidence sets. This framework can be used to model attackers in computer networks, instigators of offensive content into recommender systems, or manipulators of financial markets. Our key contribution is the development of a fully decentralized resilient upper confidence bound (UCB) algorithm that fuses an information mixing step among agents with a truncation of inconsistent and extreme values. This truncation step enables us to establish that the performance of each normal agent is no worse than the classic single-agent UCB1 algorithm in terms of regret, and more importantly, the cumulative regret of all normal agents is strictly better than the non-cooperative case, provided that each agent has at least 3f+1 neighbors where f is the maximum possible Byzantine agents in each agent's neighborhood. Extensions to time-varying neighbor graphs, and minimax lower bounds are further established on the achievable regret. Experiments corroborate the merits of this framework in practice.
    摘要 在分布式合作多臂强擦擦机(MAB)中,每个代理都观察到自己独特的奖励流,并尝试与其他代理交换信息,以选择一个序列的臂,以最小化它的 regret。在合作 setting中,代理可以超过单个代理运行 MAB 方法,如 Upper-Confidence Bound(UCB)独立地执行。在这种工作中,我们研究如何恢复这种突出的行为,当一个未知的比例的代理可以是 Byzantine,即通过不正确地传递奖励均值或信任集来交流信息时。这种框架可以用来模型计算机网络中的攻击者,推荐系统中的启动者,或财务市场中的操纵者。我们的关键贡献是开发了一种完全分布式抗攻击的Upper Confidence Bound(UCB)算法,该算法结合代理之间的信息混合步骤,并对不一致和极端值进行舍入。这种舍入步骤使得我们可以证明每个正常的代理的性能不 worse than 独立的单个代理 UCB1 算法,而且更重要的是,所有正常的代理的总 regret 比非合作情况更好,只要每个代理至少有 3f+1 个邻居,其中 f 是最大可能的 Byzantine 代理数量。我们还Extensions to 时变邻居图和最小最差下界是进一步确立的。实验证明了这种框架在实践中的优势。

Molecule-Edit Templates for Efficient and Accurate Retrosynthesis Prediction

  • paper_url: http://arxiv.org/abs/2310.07313
  • repo_url: None
  • paper_authors: Mikołaj Sacha, Michał Sadowski, Piotr Kozakowski, Ruard van Workum, Stanisław Jastrzębski
  • for: 本研究旨在开发一种基于机器学习的Retrosynthesis方法,以便用更加有效和可解释的方式预测复杂分子的合成步骤。
  • methods: 本研究使用了一种名为METRO(分子修改模板Retrosynthesis)的机器学习模型,该模型使用最小的模板(简化的反应模式)来预测反应,从而减少计算开销并达到标准benchmark的最佳结果。
  • results: 根据标准benchmark的测试结果,METRO模型可以准确预测复杂分子的合成步骤,并且比传统的模板基本法更加有效和可解释。
    Abstract Retrosynthesis involves determining a sequence of reactions to synthesize complex molecules from simpler precursors. As this poses a challenge in organic chemistry, machine learning has offered solutions, particularly for predicting possible reaction substrates for a given target molecule. These solutions mainly fall into template-based and template-free categories. The former is efficient but relies on a vast set of predefined reaction patterns, while the latter, though more flexible, can be computationally intensive and less interpretable. To address these issues, we introduce METRO (Molecule-Edit Templates for RetrOsynthesis), a machine-learning model that predicts reactions using minimal templates - simplified reaction patterns capturing only essential molecular changes - reducing computational overhead and achieving state-of-the-art results on standard benchmarks.
    摘要 转换文本到简化中文:Retrosynthesis是指从简单前体分子 synthesize复杂分子的过程。这在有机化学中存在挑战,特别是预测目标分子可能的反应substrate。这些解决方案主要分为模板基和模板自由两类。前者是效率高,但需要大量预定的反应模式,而后者具有更多的灵活性,但计算负担大和解释性差。为解决这些问题,我们介绍METRO(分子修改模板 для RetrOsynthesis),一种基于机器学习的模型,使用最小模板预测反应,减少计算负担,达到标准benchmark的状态 искусственный智能。

Score Regularized Policy Optimization through Diffusion Behavior

  • paper_url: http://arxiv.org/abs/2310.07297
  • repo_url: https://github.com/thu-ml/srpo
  • paper_authors: Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, Jun Zhu
  • for: 实现高效的行为样本抽取和独立于时间和资源consuming的算法。
  • methods: 利用批评模型和遗传 diffusion 行为模型,将分布式批评政策转换为具有高效决策能力的决策政策,并在优化过程中直接使用行为分布的得分函数进行调整。
  • results: 在 D4RL 任务上,我们的方法可以大幅提高行为抽取速度,较于各种主流的扩散基于方法,并且仍保持现有的性能水准。
    Abstract Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling, which excels at representing heterogeneous behavior policies. However, sampling from diffusion policies is considerably slow because it necessitates tens to hundreds of iterative inference steps for one action. To address this issue, we propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models, leveraging the latter to directly regularize the policy gradient with the behavior distribution's score function during optimization. Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme, both during training and evaluation. Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance.
    摘要

Why Does Sharpness-Aware Minimization Generalize Better Than SGD?

  • paper_url: http://arxiv.org/abs/2310.07269
  • repo_url: None
  • paper_authors: Zixiang Chen, Junkai Zhang, Yiwen Kou, Xiangning Chen, Cho-Jui Hsieh, Quanquan Gu
  • for: 降低过拟合的挑战,特别是在训练大型神经网络时,使用Sharpness-Aware Minimization(SAM)方法可以提高神经网络的泛化能力,即使存在标签噪声。
  • methods: 本文使用Sharpness-Aware Minimization(SAM)方法,并通过对非线性神经网络和分类任务的研究,解释了SAM在这些任务中的成功原理。
  • results: 对于某种数据模型和二层卷积ReLU网络,本文证明SAM在某些情况下比Stochastic Gradient Descent(SGD)更好地泛化,并通过实验证明了这一点。
    Abstract The challenge of overfitting, in which the model memorizes the training data and fails to generalize to test data, has become increasingly significant in the training of large neural networks. To tackle this challenge, Sharpness-Aware Minimization (SAM) has emerged as a promising training method, which can improve the generalization of neural networks even in the presence of label noise. However, a deep understanding of how SAM works, especially in the setting of nonlinear neural networks and classification tasks, remains largely missing. This paper fills this gap by demonstrating why SAM generalizes better than Stochastic Gradient Descent (SGD) for a certain data model and two-layer convolutional ReLU networks. The loss landscape of our studied problem is nonsmooth, thus current explanations for the success of SAM based on the Hessian information are insufficient. Our result explains the benefits of SAM, particularly its ability to prevent noise learning in the early stages, thereby facilitating more effective learning of features. Experiments on both synthetic and real data corroborate our theory.
    摘要 Translated into Simplified Chinese:难以适应问题,即模型记忆训练数据而不能泛化测试数据,在大型神经网络训练中变得越来越重要。为解决这个挑战,锐度感知敏感化(SAM)已经成为一种有前途的训练方法,可以提高神经网络的泛化性,即使存在标签噪声。然而,SAM在非线性神经网络和分类任务中的工作机制仍然不够了解。这篇论文填补这个空白,说明SAM在某种数据模型和二层卷积ReLU网络上的优化性比SGD更好。我们的搜索问题的损失景观是非凹的,因此现有基于Hessian信息的解释不够。我们的结果解释了SAM的优势,特别是它在早期阶段避免噪声学习,从而促进更有效的特征学习。实验证明了我们的理论。

RaftFed: A Lightweight Federated Learning Framework for Vehicular Crowd Intelligence

  • paper_url: http://arxiv.org/abs/2310.07268
  • repo_url: None
  • paper_authors: Changan Yang, Yaxing Chen, Yao Zhang, Helei Cui, Zhiwen Yu, Bin Guo, Zheng Yan, Zijiang Yang
  • for: 这则研究旨在解决车载智能应用中的数据隐私问题,使用了联邦学习(Federated Learning,FL)技术。
  • methods: 本研究提出了一个名为RaftFed的新型联邦学习框架,实现了隐私保护的车载智能应用。RaftFed使用了raft协议来实现分布式模型聚合,并且适应非 Identical Independent Distributions(Non-IID)数据。
  • results: 实验结果显示,RaftFed相比基eline的通信负载、模型精度和模型融合都表现更好。
    Abstract Vehicular crowd intelligence (VCI) is an emerging research field. Facilitated by state-of-the-art vehicular ad-hoc networks and artificial intelligence, various VCI applications come to place, e.g., collaborative sensing, positioning, and mapping. The collaborative property of VCI applications generally requires data to be shared among participants, thus forming network-wide intelligence. How to fulfill this process without compromising data privacy remains a challenging issue. Although federated learning (FL) is a promising tool to solve the problem, adapting conventional FL frameworks to VCI is nontrivial. First, the centralized model aggregation is unreliable in VCI because of the existence of stragglers with unfavorable channel conditions. Second, existing FL schemes are vulnerable to Non-IID data, which is intensified by the data heterogeneity in VCI. This paper proposes a novel federated learning framework called RaftFed to facilitate privacy-preserving VCI. The experimental results show that RaftFed performs better than baselines regarding communication overhead, model accuracy, and model convergence.
    摘要 vehicular crowd intelligence (VCI) 是一个emerging研究领域。通过现代交通运输网络和人工智能技术,VCI应用得到了广泛的应用,例如共同探测、定位和地图生成。VCI应用的共同性通常需要参与者之间数据共享,因此形成网络范围内的智能。但是保护数据隐私的问题仍然是一个挑战。虽然联邦学习(FL)是一种有望的解决方案,但将传统FL框架适应VCI是一个非常困难的任务。首先,中央模型聚合是VCI中不可靠的,因为存在不优惠的通道条件下的停留者。其次,现有的FL方案容易受到非同分布数据的影响,这在VCI中更加严重,因为数据多样性很高。这篇论文提出了一种新的联邦学习框架called RaftFed,用于保护隐私的VCI。实验结果表明,RaftFed比基准方案更好,具有较低的通信开销、更高的模型准确率和更快的模型融合。

Classification of Dysarthria based on the Levels of Severity. A Systematic Review

  • paper_url: http://arxiv.org/abs/2310.07264
  • repo_url: None
  • paper_authors: Afnan Al-Ali, Somaya Al-Maadeed, Moutaz Saleh, Rani Chinnappa Naidu, Zachariah C Alex, Prakash Ramachandran, Rajeev Khoodeeram, Rajesh Kumar M
  • for: 这个评估是为了提高受影响者的沟通能力和生活质量,以及为了提供更加准确和可靠的诊断。
  • methods: 这个评估使用了人工智能技术,特别是机器学习算法,以自动分类受影响者的喉痛度。
  • results: 这个评估发现了一些最有效的特征和技术,可以用于自动分类受影响者的喉痛度,并提高了诊断的准确性和可靠性。
    Abstract Dysarthria is a neurological speech disorder that can significantly impact affected individuals' communication abilities and overall quality of life. The accurate and objective classification of dysarthria and the determination of its severity are crucial for effective therapeutic intervention. While traditional assessments by speech-language pathologists (SLPs) are common, they are often subjective, time-consuming, and can vary between practitioners. Emerging machine learning-based models have shown the potential to provide a more objective dysarthria assessment, enhancing diagnostic accuracy and reliability. This systematic review aims to comprehensively analyze current methodologies for classifying dysarthria based on severity levels. Specifically, this review will focus on determining the most effective set and type of features that can be used for automatic patient classification and evaluating the best AI techniques for this purpose. We will systematically review the literature on the automatic classification of dysarthria severity levels. Sources of information will include electronic databases and grey literature. Selection criteria will be established based on relevance to the research questions. Data extraction will include methodologies used, the type of features extracted for classification, and AI techniques employed. The findings of this systematic review will contribute to the current understanding of dysarthria classification, inform future research, and support the development of improved diagnostic tools. The implications of these findings could be significant in advancing patient care and improving therapeutic outcomes for individuals affected by dysarthria.
    摘要 《嗜睡病患者精度诊断分类方法的系统atic review》Introduction:嗜睡病(Dysarthria)是一种神经系统疾病,可能对患者的沟通能力和生活质量产生重要影响。准确和客观地诊断嗜睡病和其严重程度是诊断治疗的关键。现有的传统评估方法由语言听说师(SLP)进行,但这些评估方法常常是主观的、耗时的,并且可能由听说师之间存在差异。新兴的机器学习基于模型已经显示出可以提供更客观的嗜睡病诊断,从而提高诊断的准确性和可靠性。本系统atic review旨在全面分析当前的嗜睡病分类方法,具体来说是寻找最有效的分类特征和AI技术。Objectives:本文的目标是对嗜睡病分类方法进行系统atic review,以便更好地理解嗜睡病的分类方法,并且为未来的研究提供参考。特定的研究问题包括:1. 寻找最有效的分类特征,以便自动分类嗜睡病的严重程度。2. 评估AI技术的效果,以便选择最佳的AI技术来进行嗜睡病分类。Methodology:本文的方法包括:1. 搜索电子数据库和灰色文献,以找到相关的研究。2. 根据研究问题的 relevance 选择合适的文献。3. 对选择的文献进行数据抽取,包括使用的方法、分类特征和AI技术。Expected outcomes:本文的结果将对当前的嗜睡病分类方法进行全面分析,并提供有价值的参考。这些结果可能对患者的诊断和治疗产生重要影响,并且可能推动未来的研究。Conclusion:本文的系统atic review将对嗜睡病分类方法进行全面分析,并评估AI技术的效果。这些结果将有助于我们更好地理解嗜睡病的分类方法,并为未来的研究提供参考。这些结果的发现可能对患者的诊断和治疗产生重要影响,并且可能推动未来的研究。

Deep ReLU networks and high-order finite element methods II: Chebyshev emulation

  • paper_url: http://arxiv.org/abs/2310.07261
  • repo_url: None
  • paper_authors: Joost A. A. Opschoor, Christoph Schwab
  • for: 这篇论文主要研究了深度ReLU神经网络(NN)在 Sobolev нормов下的表达率和稳定性,以及NN的参数数量如何影响这些性能指标。
  • methods: 这篇论文使用了 Novel constructions of ReLU NN surrogates,即使用Chebychev多项式扩展系数来表示近似函数。这些系数可以从Clenshaw–Curtis点中的函数值使用 inverse fast Fourier transform 计算。
  • results: 论文得到了对表达率和稳定性的较好的上限,超过了基于ReLU NN拟合幂数考虑的构造。论文还提供了不同类型函数和 norms 的NN拟合误差估计,以及在数值分析中遇到的常见函数和 norms 的ReLU NN拟合率 bounds。
    Abstract Expression rates and stability in Sobolev norms of deep ReLU neural networks (NNs) in terms of the number of parameters defining the NN for continuous, piecewise polynomial functions, on arbitrary, finite partitions $\mathcal{T}$ of a bounded interval $(a,b)$ are addressed. Novel constructions of ReLU NN surrogates encoding the approximated functions in terms of Chebyshev polynomial expansion coefficients are developed. Chebyshev coefficients can be computed easily from the values of the function in the Clenshaw--Curtis points using the inverse fast Fourier transform. Bounds on expression rates and stability that are superior to those of constructions based on ReLU NN emulations of monomials considered in [Opschoor, Petersen, Schwab, 2020] are obtained. All emulation bounds are explicit in terms of the (arbitrary) partition of the interval, the target emulation accuracy and the polynomial degree in each element of the partition. ReLU NN emulation error estimates are provided for various classes of functions and norms, commonly encountered in numerical analysis. In particular, we show exponential ReLU emulation rate bounds for analytic functions with point singularities and develop an interface between Chebfun approximations and constructive ReLU NN emulations.
    摘要 “深度ReLU神经网络(NN)的表达速率和稳定性在 Sobolev нор下,对于连续、分割式多项函数,在有界区间(a,b)上的任意、有限分区 $\mathcal{T}$ 上进行研究。我们提出了基于 ReLU NN 的新构造,用于表示approximes 函数的 Chebyshev 多项展开系数。Chebyshev 系数可以通过 Clenshaw-Curtis 点的值来容易计算,使用反快速傅立叶变换。我们获得了基于 ReLU NN 的构造,superior 于 [Opschoor, Petersen, Schwab, 2020] 中基于 monomials 的构造的表达率和稳定性 bound。所有的 emulation bound 是对于(任意)分区、目标投影精度和每个分区元素的权重度的explicit 表达。我们还提供了不同类型函数和norms 的 ReLU NN 投影误差估计,广泛存在在数学分析中。特别是,我们展示了对于分析函数的点稳定性的 exponential ReLU 投影速率 bound。”

CacheGen: Fast Context Loading for Language Model Applications

  • paper_url: http://arxiv.org/abs/2310.07240
  • repo_url: None
  • paper_authors: Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang, Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, Ganesh Ananthanarayanan, Junchen Jiang
    for: This paper aims to improve the efficiency of large language models (LLMs) by minimizing the delays in fetching and processing contexts.methods: The paper proposes a novel encoder that compresses key-value (KV) features into more compact bitstream representations, taking advantage of the KV features’ distributional properties. Additionally, the paper uses a controller to determine when to load the context as compressed KV features or raw text and picks the appropriate compression level.results: Compared to recent methods that handle long contexts, the proposed method reduces bandwidth usage by 3.7-4.3x and the total delay in fetching and processing contexts by 2.7-3x while maintaining similar LLM performance on various tasks.
    Abstract As large language models (LLMs) take on more complex tasks, their inputs incorporate longer contexts to respond to questions that require domain knowledge or user-specific conversational histories. Yet, using long contexts poses a challenge for responsive LLM systems, as nothing can be generated until all the contexts are fetched to and processed by the LLM. Existing systems optimize only the computation delay in context processing (e.g., by caching intermediate key-value features of the text context) but often cause longer network delays in context fetching (e.g., key-value features consume orders of magnitude larger bandwidth than the text context). This paper presents CacheGen to minimize the delays in fetching and processing contexts for LLMs. CacheGen reduces the bandwidth needed for transmitting long contexts' key-value (KV) features through a novel encoder that compresses KV features into more compact bitstream representations. The encoder combines adaptive quantization with a tailored arithmetic coder, taking advantage of the KV features' distributional properties, such as locality across tokens. Furthermore, CacheGen minimizes the total delay in fetching and processing a context by using a controller that determines when to load the context as compressed KV features or raw text and picks the appropriate compression level if loaded as KV features. We test CacheGen on three models of various sizes and three datasets of different context lengths. Compared to recent methods that handle long contexts, CacheGen reduces bandwidth usage by 3.7-4.3x and the total delay in fetching and processing contexts by 2.7-3x while maintaining similar LLM performance on various tasks as loading the text contexts.
    摘要 large language models (LLMs) 在更复杂的任务中使用时,其输入将包含更长的上下文,以回答需要领域知识或用户特定的对话历史的问题。然而,使用长上下文会对responsive LLM系统 pose a challenge,因为系统无法生成任何内容 until all the contexts are fetched and processed by the LLM. existing systems only optimize the computation delay in context processing (e.g., by caching intermediate key-value features of the text context), but often cause longer network delays in context fetching (e.g., key-value features consume orders of magnitude larger bandwidth than the text context).this paper presents CacheGen, a method to minimize the delays in fetching and processing contexts for LLMs. CacheGen reduces the bandwidth needed for transmitting long contexts' key-value (KV) features through a novel encoder that compresses KV features into more compact bitstream representations. the encoder combines adaptive quantization with a tailored arithmetic coder, taking advantage of the KV features' distributional properties, such as locality across tokens. furthermore, CacheGen minimizes the total delay in fetching and processing a context by using a controller that determines when to load the context as compressed KV features or raw text and picks the appropriate compression level if loaded as KV features. we test CacheGen on three models of various sizes and three datasets of different context lengths. compared to recent methods that handle long contexts, CacheGen reduces bandwidth usage by 3.7-4.3x and the total delay in fetching and processing contexts by 2.7-3x while maintaining similar LLM performance on various tasks as loading the text contexts.

Are GATs Out of Balance?

  • paper_url: http://arxiv.org/abs/2310.07235
  • repo_url: None
  • paper_authors: Nimrah Mustafa, Aleksandar Bojchevski, Rebekka Burkholz
  • for: 该研究探讨了图神经网络(GNN)的优化和学习动力,尤其是GNN中的Graph Attention Network(GAT) Architecture的学习动力。
  • methods: 研究者们使用了权重化注意力系数的GAT网络,并 derive了GAT梯度流动动力的保守定律,解释了为什么使用标准初始化的GAT网络中大部分参数难以更新 durante el entrenamiento。
  • results: 研究者们提出了一种Initialize Balance的方法,该方法可以更好地传播梯度,从而使得更深的GAT网络可以更好地训练,同时可以大幅提高训练和融合时间。
    Abstract While the expressive power and computational capabilities of graph neural networks (GNNs) have been theoretically studied, their optimization and learning dynamics, in general, remain largely unexplored. Our study undertakes the Graph Attention Network (GAT), a popular GNN architecture in which a node's neighborhood aggregation is weighted by parameterized attention coefficients. We derive a conservation law of GAT gradient flow dynamics, which explains why a high portion of parameters in GATs with standard initialization struggle to change during training. This effect is amplified in deeper GATs, which perform significantly worse than their shallow counterparts. To alleviate this problem, we devise an initialization scheme that balances the GAT network. Our approach i) allows more effective propagation of gradients and in turn enables trainability of deeper networks, and ii) attains a considerable speedup in training and convergence time in comparison to the standard initialization. Our main theorem serves as a stepping stone to studying the learning dynamics of positive homogeneous models with attention mechanisms.
    摘要 whilst the expressive power and computational capabilities of graph neural networks (GNNs) have been theoretically studied, their optimization and learning dynamics, in general, remain largely unexplored. Our study undertakes the Graph Attention Network (GAT), a popular GNN architecture in which a node's neighborhood aggregation is weighted by parameterized attention coefficients. We derive a conservation law of GAT gradient flow dynamics, which explains why a high portion of parameters in GATs with standard initialization struggle to change during training. This effect is amplified in deeper GATs, which perform significantly worse than their shallow counterparts. To alleviate this problem, we devise an initialization scheme that balances the GAT network. Our approach i) allows more effective propagation of gradients and in turn enables trainability of deeper networks, and ii) attains a considerable speedup in training and convergence time in comparison to the standard initialization. Our main theorem serves as a stepping stone to studying the learning dynamics of positive homogeneous models with attention mechanisms.Here's the translation in Traditional Chinese:而Graph Neural Networks(GNNs)的表达能力和计算能力已经被理论上研究,但是它们的优化和学习动态在总的来说还是有很多不清楚的地方。我们的研究涉及到Graph Attention Network(GAT),这是一种常见的GNN架构,其中每个节点的邻居聚合是通过参数化的注意系数来权重。我们得出了GAT的Gradient Flow动力学保守定律,这解释了为什么使用标准初始化的GAT参数大部分在训练中难以变化。这个效果在深度GAT中更加明显,它们在训练和 converges 时间上表现较差。为了解决这个问题,我们提出了一种初始化方案,该方案可以更好地传递梯度,从而使得深度网络可以更好地训练,同时也可以减少训练和 converges 时间的浪费。我们的主要定理作为 studying the learning dynamics of positive homogeneous models with attention mechanisms 的起点。

Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality

  • paper_url: http://arxiv.org/abs/2310.07234
  • repo_url: https://github.com/thu-ml/hide-prompt
  • paper_authors: Liyuan Wang, Jingyi Xie, Xingxing Zhang, Mingyi Huang, Hang Su, Jun Zhu
  • for: 提高 continual learning 下的表现,尤其是在自动生成标签数据时。
  • methods: 提出了一种 Hierarchical Decomposition (HiDe-)Prompt 方法,通过将具有任务特征的提示 ensemble 和不同 Representation 的统计数据协调,以提高 continual learning 的表现。
  • results: 对 Split CIFAR-100 和 Split ImageNet-R 进行了广泛的实验,并取得了比较出色的表现(例如,在 Split CIFAR-100 上提高了15.01%和9.61%的表现),并且robustness 到不同的预训练方法。
    Abstract Prompt-based continual learning is an emerging direction in leveraging pre-trained knowledge for downstream continual learning, and has almost reached the performance pinnacle under supervised pre-training. However, our empirical research reveals that the current strategies fall short of their full potential under the more realistic self-supervised pre-training, which is essential for handling vast quantities of unlabeled data in practice. This is largely due to the difficulty of task-specific knowledge being incorporated into instructed representations via prompt parameters and predicted by uninstructed representations at test time. To overcome the exposed sub-optimality, we conduct a theoretical analysis of the continual learning objective in the context of pre-training, and decompose it into hierarchical components: within-task prediction, task-identity inference, and task-adaptive prediction. Following these empirical and theoretical insights, we propose Hierarchical Decomposition (HiDe-)Prompt, an innovative approach that explicitly optimizes the hierarchical components with an ensemble of task-specific prompts and statistics of both uninstructed and instructed representations, further with the coordination of a contrastive regularization strategy. Our extensive experiments demonstrate the superior performance of HiDe-Prompt and its robustness to pre-training paradigms in continual learning (e.g., up to 15.01% and 9.61% lead on Split CIFAR-100 and Split ImageNet-R, respectively). Our code is available at \url{https://github.com/thu-ml/HiDe-Prompt}.
    摘要 Prompt-based continual learning 是一种emerging direction,它利用预训练知识来进行下游 continual learning,并已经几乎 дости到了supervised pre-training的性能巅峰。然而,我们的实验研究表明,当前的策略在更实际的self-supervised pre-training下表现不佳,这主要是因为任务特定知识的包含 INTO instructed representations via prompt parameters和在测试时由uninstructed representations预测的困难。为了解决这种暴露的下optimality,我们进行了 continual learning目标在预训练 context中的理论分析,并将其 decomposes into hierarchical components:within-task prediction、task-identity inference和task-adaptive prediction。根据这些实际和理论的洞察,我们提出了HiDe-Prompt方法,它使用了 ensemble of task-specific prompts和预训练和测试时的 both uninstructed和 instructed representations的统计,同时协调了一种对比正则化策略。我们的广泛实验表明,HiDe-Prompt方法具有superior performance和鲁棒性,在不同的预训练 paradigms下(例如,Split CIFAR-100和Split ImageNet-R)。我们的代码可以在 \url{https://github.com/thu-ml/HiDe-Prompt} 上获取。

Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment

  • paper_url: http://arxiv.org/abs/2310.07229
  • repo_url: None
  • paper_authors: Bowen Gao, Yinjun Jia, Yuanle Mo, Yuyan Ni, Weiying Ma, Zhiming Ma, Yanyan Lan
  • for: 这个研究旨在提高药物可活性预测、药物结合亲和力预测和无样化药物设计等生医应用中的腔室表现。
  • methods: 本研究使用了一种新的腔室预训法,具体来说是将蛋白结构分成药物相似的片段和其对应的腔室,然后使用高效的先验式小分子表现来帮助学习腔室表现。
  • results: 研究结果显示,ProFSA方法可以在不同任务中实现州务之前的表现,包括腔室可活性预测、腔室匹配和药物结合亲和力预测等。此外,ProFSA方法也超过了其他预训方法,并且开启了一个新的途径来缓解蛋白-药物复杂数据的罕见性问题。
    Abstract Pocket representations play a vital role in various biomedical applications, such as druggability estimation, ligand affinity prediction, and de novo drug design. While existing geometric features and pretrained representations have demonstrated promising results, they usually treat pockets independent of ligands, neglecting the fundamental interactions between them. However, the limited pocket-ligand complex structures available in the PDB database (less than 100 thousand non-redundant pairs) hampers large-scale pretraining endeavors for interaction modeling. To address this constraint, we propose a novel pocket pretraining approach that leverages knowledge from high-resolution atomic protein structures, assisted by highly effective pretrained small molecule representations. By segmenting protein structures into drug-like fragments and their corresponding pockets, we obtain a reasonable simulation of ligand-receptor interactions, resulting in the generation of over 5 million complexes. Subsequently, the pocket encoder is trained in a contrastive manner to align with the representation of pseudo-ligand furnished by some pretrained small molecule encoders. Our method, named ProFSA, achieves state-of-the-art performance across various tasks, including pocket druggability prediction, pocket matching, and ligand binding affinity prediction. Notably, ProFSA surpasses other pretraining methods by a substantial margin. Moreover, our work opens up a new avenue for mitigating the scarcity of protein-ligand complex data through the utilization of high-quality and diverse protein structure databases.
    摘要 pocket 表示具有重要作用在各种生物医学应用中,如药物可能性预测、药物粘性预测和新药设计。 although existing geometric features and pre-trained representations have shown promising results, they usually treat pockets independently of ligands, neglecting the fundamental interactions between them. However, the limited number of pocket-ligand complex structures available in the PDB database (less than 100 thousand non-redundant pairs) hinders large-scale pretraining endeavors for interaction modeling. To address this constraint, we propose a novel pocket pretraining approach that leverages knowledge from high-resolution atomic protein structures, assisted by highly effective pre-trained small molecule representations. By segmenting protein structures into drug-like fragments and their corresponding pockets, we obtain a reasonable simulation of ligand-receptor interactions, resulting in the generation of over 5 million complexes. Subsequently, the pocket encoder is trained in a contrastive manner to align with the representation of pseudo-ligand furnished by some pre-trained small molecule encoders. Our method, named ProFSA, achieves state-of-the-art performance across various tasks, including pocket druggability prediction, pocket matching, and ligand binding affinity prediction. Notably, ProFSA surpasses other pretraining methods by a substantial margin. Moreover, our work opens up a new avenue for mitigating the scarcity of protein-ligand complex data through the utilization of high-quality and diverse protein structure databases.

COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL

  • paper_url: http://arxiv.org/abs/2310.07220
  • repo_url: None
  • paper_authors: Xiyao Wang, Ruijie Zheng, Yanchao Sun, Ruonan Jia, Wichayaporn Wongkamjan, Huazhe Xu, Furong Huang
  • for: 提高模型基于学习方法的效果和稳定性
  • methods: 使用保守的模型执行和优惠环境探索,以避免模型不准确地区域和减少模型预测错误的影响
  • results: 在一系列的 proprioceptive 和视觉控制任务上,使用 $\texttt{COPlanner}$ 可以显著提高模型基于方法的效率和稳定性,并且可以减少模型预测错误的影响
    Abstract Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose $\texttt{COPlanner}$, a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. $\texttt{COPlanner}$ leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, $\texttt{COPlanner}$ can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. $\texttt{COPlanner}$ is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with $\texttt{COPlanner}$.
    摘要 dynamically�model�based reinforcement learning�contains two phases: model rollouts to generate samples for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose $\texttt{COPlanner}$, a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. $\texttt{COPlanner}$ leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, $\texttt{COPlanner}$ can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. $\texttt{COPlanner}$ is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with $\texttt{COPlanner}$.

Enhancing Neural Architecture Search with Multiple Hardware Constraints for Deep Learning Model Deployment on Tiny IoT Devices

  • paper_url: http://arxiv.org/abs/2310.07217
  • repo_url: None
  • paper_authors: Alessio Burrello, Matteo Risso, Beatrice Alessandra Motetti, Enrico Macii, Luca Benini, Daniele Jahier Pagliari
  • for: 这个研究旨在解决互联网项目中的问题,即对于互联网的设备不足的硬件限制,以及对于深度学习模型的复杂度和计算量的需求。
  • methods: 这个研究使用了多个条件搜索的方法,包括对于紧缩度和延迟的条件搜索,以及对于紧缩度和内存的条件搜索。这些方法可以实现在单一搜索中,同时满足多个条件的需求。
  • results: 这个研究的结果显示,使用这些方法可以在单一搜索中,实现对于内存和延迟的条件搜索,并且可以实现与现有的深度学习模型相比,提高了硬件限制下的性能。具体来说,这个研究可以实现87.4%的内存和54.2%的延迟的减少,同时保持与现有的深度学习模型相比的精度。
    Abstract The rapid proliferation of computing domains relying on Internet of Things (IoT) devices has created a pressing need for efficient and accurate deep-learning (DL) models that can run on low-power devices. However, traditional DL models tend to be too complex and computationally intensive for typical IoT end-nodes. To address this challenge, Neural Architecture Search (NAS) has emerged as a popular design automation technique for co-optimizing the accuracy and complexity of deep neural networks. Nevertheless, existing NAS techniques require many iterations to produce a network that adheres to specific hardware constraints, such as the maximum memory available on the hardware or the maximum latency allowed by the target application. In this work, we propose a novel approach to incorporate multiple constraints into so-called Differentiable NAS optimization methods, which allows the generation, in a single shot, of a model that respects user-defined constraints on both memory and latency in a time comparable to a single standard training. The proposed approach is evaluated on five IoT-relevant benchmarks, including the MLPerf Tiny suite and Tiny ImageNet, demonstrating that, with a single search, it is possible to reduce memory and latency by 87.4% and 54.2%, respectively (as defined by our targets), while ensuring non-inferior accuracy on state-of-the-art hand-tuned deep neural networks for TinyML.
    摘要 “因互联网物 Things(IoT)设备的快速普及,导致深度学习(DL)模型的效率和准确性成为当前的应用需求。然而,传统的DL模型通常是复杂过度,不适合IoT端设备的 computationally intense 环境。为解决这个挑战,深度架构搜寻(NAS)已经出现为一种受欢迎的设计自动化技术,可以同时优化网络的准确性和复杂度。然而,现有的NAS技术通常需要许多迭代才能生成一个遵循硬件紧存和延迟限制的网络。在这个工作中,我们提出了一种新的方法,可以同时包含多个紧存和延迟的硬件紧存缓冲击网络的设计,并在单一迭代中生成一个遵循用户定义的紧存和延迟限制的网络,与现有的手动游戏定义网络相比,可以降低87.4%的紧存和54.2%的延迟,同时维持非劣于标准的深度学习网络。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Generative Modeling on Manifolds Through Mixture of Riemannian Diffusion Processes

  • paper_url: http://arxiv.org/abs/2310.07216
  • repo_url: None
  • paper_authors: Jaehyeong Jo, Sung Ju Hwang
  • for: 模型数据的分布在里曼尼安 manifold 上,是许多科学领域的应用需求。但现有的生成模型在 manifold 上受到严重的异常计算成本或简化热体函数的限制,这限制了它们的可行性和缩放性。
  • methods: 我们介绍了里曼尼安扩散混合(Riemannian Diffusion Mixture),一种基于endpoint-conditioned扩散过程的原理性框架,而不是基于之前的扩散模型的减去方法。我们还提出了一种简单又高效的训练目标,可以直接应用于一般 manifold。
  • results: 我们的方法在不同的 manifold 上比前一代生成模型表现出色,可以在高维度下进行扩散,并且需要减少很多在训练过程中的伪 simulate 步骤。
    Abstract Learning the distribution of data on Riemannian manifolds is crucial for modeling data from non-Euclidean space, which is required by many applications from diverse scientific fields. Yet, existing generative models on manifolds suffer from expensive divergence computation or rely on approximations of heat kernel. These limitations restrict their applicability to simple geometries and hinder scalability to high dimensions. In this work, we introduce the Riemannian Diffusion Mixture, a principled framework for building a generative process on manifolds as a mixture of endpoint-conditioned diffusion processes instead of relying on the denoising approach of previous diffusion models, for which the generative process is characterized by its drift guiding toward the most probable endpoint with respect to the geometry of the manifold. We further propose a simple yet efficient training objective for learning the mixture process, that is readily applicable to general manifolds. Our method outperforms previous generative models on various manifolds while scaling to high dimensions and requires a dramatically reduced number of in-training simulation steps for general manifolds.
    摘要 学习里曼尼安数据分布是非常重要的,因为它是许多科学领域的应用所需的。然而,现有的生成模型在 manifold 上受到了严重的减法计算成本或者使用抽象的热核算法,这限制了它们的可用性和扩展性。在这项工作中,我们介绍了里曼尼安扩散混合(Riemannian Diffusion Mixture),一种主义的框架,用于在 manifold 上建立生成过程,而不是依赖于前一个扩散模型的减法方法。我们还提出了一种简单 yet efficient 的训练目标,可以轻松应用于一般 manifold。我们的方法在多种 manifold 上表现出色,可以扩展到高维度,并且需要减少很多在训练过程中的计算步骤。

Bridging the Gap between Newton-Raphson Method and Regularized Policy Iteration

  • paper_url: http://arxiv.org/abs/2310.07211
  • repo_url: None
  • paper_authors: Zeyang Li, Chuxiong Hu, Yunan Wang, Guojian Zhan, Jie Li, Shengbo Eben Li
  • for: 这篇论文主要研究了规范化学习算法中的正则化技术,具体来说是使用希尔博 entropy 作为正则化函数,并证明了这种方法与标准的新颖矩阵法相等。
  • methods: 该论文使用了正则化策略迭代法,并证明了这种方法在强转化函数下的平滑化 Bellman 方程下具有全球线性准确率 $\gamma$,同时在本地区域内也具有 quadratic 准确率。
  • results: 该论文证明了正则化策略迭代法在全球和本地区域内都具有 globally linear convergence 性,其速率为 $\gamma$,并且在本地区域内也具有 quadratic 准确率。此外,研究者还证明了一种修改后的正则化策略迭代法,即使用 finite-step policy evaluation,与不准确的新颖法相等。这种算法在 asymptotic linear convergence 性下具有 $\gamma^M$ 的速率,其中 $M$ 是策略评估中所执行的步骤数。
    Abstract Regularization is one of the most important techniques in reinforcement learning algorithms. The well-known soft actor-critic algorithm is a special case of regularized policy iteration where the regularizer is chosen as Shannon entropy. Despite some empirical success of regularized policy iteration, its theoretical underpinnings remain unclear. This paper proves that regularized policy iteration is strictly equivalent to the standard Newton-Raphson method in the condition of smoothing out Bellman equation with strongly convex functions. This equivalence lays the foundation of a unified analysis for both global and local convergence behaviors of regularized policy iteration. We prove that regularized policy iteration has global linear convergence with the rate being $\gamma$ (discount factor). Furthermore, this algorithm converges quadratically once it enters a local region around the optimal value. We also show that a modified version of regularized policy iteration, i.e., with finite-step policy evaluation, is equivalent to inexact Newton method where the Newton iteration formula is solved with truncated iterations. We prove that the associated algorithm achieves an asymptotic linear convergence rate of $\gamma^M$ in which $M$ denotes the number of steps carried out in policy evaluation. Our results take a solid step towards a better understanding of the convergence properties of regularized policy iteration algorithms.
    摘要 “常规化”是强化学习算法中最重要的技术之一。知名的软actor-批评算法是常规化policy迭代的特殊情况,其 régulateur选择为 entropy。虽然规范化policy迭代的实际成功,但其理论基础仍然不清楚。这篇论文证明了规范化policy迭代与标准的Newton-raphson方法在bellman方程略微化下是等价的。这种等价性为证明了规范化policy迭代的全面分析。我们证明了规范化policy迭代在discount因子γ下有全面线性减少率,其速率为γ。此外,当iteration进入一个地方圈附近的optimal值时,这种算法会 quadratic减少。我们还证明了一种修改后的规范化policy迭代算法,即在policy评估中使用有限步骤,与不准确的Newton方法相等。我们证明了这种算法在迭代过程中有 asymptotic 线性减少率,其减少率为γ^M,其中M为policy评估中进行的步数。我们的结果为规范化policy迭代算法的减少性质提供了一个坚实的步骤。

Robust Safe Reinforcement Learning under Adversarial Disturbances

  • paper_url: http://arxiv.org/abs/2310.07207
  • repo_url: None
  • paper_authors: Zeyang Li, Chuxiong Hu, Shengbo Eben Li, Jia Cheng, Yunan Wang
  • For: This paper aims to address the challenge of applying reinforcement learning to real-world control tasks while ensuring safety and robustness in the presence of external disturbances.* Methods: The proposed method uses a policy iteration scheme to solve for the robust invariant set, which is a subset of the safe set where persistent safety is possible. The method also integrates the proposed policy iteration scheme into a constrained reinforcement learning algorithm that simultaneously synthesizes the robust invariant set and uses it for constrained policy optimization.* Results: The proposed method achieves zero constraint violation with learned worst-case adversarial disturbances, while other baseline algorithms violate the safety constraints substantially. Additionally, the proposed method attains comparable performance as the baselines even in the absence of the adversary.Here are the three points in Simplified Chinese text:* For: 本研究旨在应对在真实世界控制任务中应用强化学习,保证安全和可靠性在外部干扰存在下。* Methods: 提议的方法使用政策迭代方法解决最差情况下的不变性集,并将其集成到了受限制的强化学习算法中,同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同时同 Times同时同时同 Times同 Times同 Times同 Times同 Times同 Times同 Times同 Times同 Times同 Times同 Times同 Times同 Times同 Times同 Times同 Times
    Abstract Safety is a primary concern when applying reinforcement learning to real-world control tasks, especially in the presence of external disturbances. However, existing safe reinforcement learning algorithms rarely account for external disturbances, limiting their applicability and robustness in practice. To address this challenge, this paper proposes a robust safe reinforcement learning framework that tackles worst-case disturbances. First, this paper presents a policy iteration scheme to solve for the robust invariant set, i.e., a subset of the safe set, where persistent safety is only possible for states within. The key idea is to establish a two-player zero-sum game by leveraging the safety value function in Hamilton-Jacobi reachability analysis, in which the protagonist (i.e., control inputs) aims to maintain safety and the adversary (i.e., external disturbances) tries to break down safety. This paper proves that the proposed policy iteration algorithm converges monotonically to the maximal robust invariant set. Second, this paper integrates the proposed policy iteration scheme into a constrained reinforcement learning algorithm that simultaneously synthesizes the robust invariant set and uses it for constrained policy optimization. This algorithm tackles both optimality and safety, i.e., learning a policy that attains high rewards while maintaining safety under worst-case disturbances. Experiments on classic control tasks show that the proposed method achieves zero constraint violation with learned worst-case adversarial disturbances, while other baseline algorithms violate the safety constraints substantially. Our proposed method also attains comparable performance as the baselines even in the absence of the adversary.
    摘要 安全是控制任务中的主要考虑因素,特别是在外部干扰存在的情况下。然而,现有的安全学习算法很少考虑外部干扰,这限制了它们在实践中的可行性和可靠性。为解决这个挑战,这篇论文提出了一种可靠安全学习框架,可以抵御最差情况下的干扰。首先,这篇论文提出了一种策略迭代算法,用于解决最差情况下的策略问题。这种算法基于哈密逊-雅可比达到可靠性的概念,在其中,控制输入(即主人公)尝试维护安全,而外部干扰(即反派)尝试打砸安全。这篇论文证明了该算法的 converges monotonic 性。其次,这篇论文将提出的策略迭代算法integrated into a constrained reinforcement learning algorithm,该算法同时Synthesizes the robust invariant set and uses it for constrained policy optimization.这个算法同时解决了最优和安全的问题,即学习一个策略,可以在最差情况下实现高的奖励,同时维护安全。在经典控制任务上进行实验,我们的提出的方法可以在面对最差情况下的干扰时,保持零的约束违反率,而其他基准算法却有很大的违反率。此外,我们的方法还可以与基准算法相比,在缺乏反派的情况下具有相似的性能。

Boosting Learning for LDPC Codes to Improve the Error-Floor Performance

  • paper_url: http://arxiv.org/abs/2310.07194
  • repo_url: https://github.com/ghy1228/ldpc_error_floor
  • paper_authors: Hee-Youl Kwak, Dae-Young Yun, Yongjune Kim, Sang-Hyo Kim, Jong-Seon No
  • for: 这个论文的目的是提出一种用神经网络学习的LDPC码解码器,以消除LDPC码的错误底部效应。
  • methods: 这个论文使用了两种training方法来提高LDPC码解码器的性能:首先,通过使用 Ensemble Networks 技术,将解码器分成两个神经网络,并训练后一个神经网络来专门处理不可 corrected 的单词。其次,通过使用块式训练计划,地方训练块内的 weights,以解决迷你梯度问题。
  • results: 通过应用这些training方法于标准LDPC码中,实现了最佳的错误底部性能,并且这些方法不需要额外的硬件成本。
    Abstract Low-density parity-check (LDPC) codes have been successfully commercialized in communication systems due to their strong error correction capabilities and simple decoding process. However, the error-floor phenomenon of LDPC codes, in which the error rate stops decreasing rapidly at a certain level, presents challenges for achieving extremely low error rates and deploying LDPC codes in scenarios demanding ultra-high reliability. In this work, we propose training methods for neural min-sum (NMS) decoders to eliminate the error-floor effect. First, by leveraging the boosting learning technique of ensemble networks, we divide the decoding network into two neural decoders and train the post decoder to be specialized for uncorrected words that the first decoder fails to correct. Secondly, to address the vanishing gradient issue in training, we introduce a block-wise training schedule that locally trains a block of weights while retraining the preceding block. Lastly, we show that assigning different weights to unsatisfied check nodes effectively lowers the error-floor with a minimal number of weights. By applying these training methods to standard LDPC codes, we achieve the best error-floor performance compared to other decoding methods. The proposed NMS decoder, optimized solely through novel training methods without additional modules, can be integrated into existing LDPC decoders without incurring extra hardware costs. The source code is available at https://github.com/ghy1228/LDPC_Error_Floor .
    摘要 低密度正交码(LDPC)编码器在通信系统中得到成功,这是因为它们具有强大的错误检测能力和简单的解码过程。然而,LDPC编码器的错误地面现象,即错误率停止减少快速的现象,对于实现极低的错误率和在需要极高可靠性的场景中部署LDPC编码器带来挑战。在这种情况下,我们提出了使用神经网络的训练方法来消除错误地面现象。首先,我们利用了聚合学习技术,将解码网络分成两个神经网络,并将后台解码器特化为无法被首个解码器正确地解码的词语。其次,为了解决训练过程中的渐进性问题,我们引入了分割训练计划,当地方训练一个块的 weights 时,还会重新训练前一个块的 weights。最后,我们发现,对不满足的检查节点分配不同的权重,可以有效地降低错误地面。通过这些训练方法,我们在标准LDPC编码器上实现了最佳的错误地面性能,并且这些训练方法不需要额外的硬件成本。源代码可以在 上下载。

Neural networks: deep, shallow, or in between?

  • paper_url: http://arxiv.org/abs/2310.07190
  • repo_url: https://github.com/djdprogramming/adfa2
  • paper_authors: Guergana Petrova, Przemyslaw Wojtaszczyk
  • for: 估算一个封闭集合从巴拿赫空间中的近似误差。
  • methods: 使用带width W、深度 l 和 lipschitz 活动函数的批处理神经网络输出来估算。
  • results: 显示,当depth l 往infty 时,可以达到更好的估算精度,但是 fixing depth 并 letting width W 往infty 无法提高估算精度。
    Abstract We give estimates from below for the error of approximation of a compact subset from a Banach space by the outputs of feed-forward neural networks with width W, depth l and Lipschitz activation functions. We show that, modulo logarithmic factors, rates better that entropy numbers' rates are possibly attainable only for neural networks for which the depth l goes to infinity, and that there is no gain if we fix the depth and let the width W go to infinity.
    摘要 我们提供以下估计 для准确度近似compactsubset从Banach空间的迪拜恩网络输出。我们显示,对于具有无限深度l的神经网络,可以达到比Entropy数字更好的速率,但是如果将宽度W fix,则无法获得更好的性能。Here's the breakdown of the translation:* "compact subset" is translated as "compact subset" ( compact subset 是 compact subset)* "Banach space" is translated as "Banach空间" ( Banach 空间)* "feed-forward neural networks" is translated as "迪拜恩网络" ( 迪拜恩网络)* "width" is translated as "宽度" ( 宽度)* "depth" is translated as "深度" ( 深度)* "Lipschitz activation functions" is translated as "Lipschitz激活函数" ( Lipschitz 激活函数)* "entropy numbers" is translated as "Entropy数字" ( Entropy数字)* "modulo logarithmic factors" is translated as "除alogarithmic因子" ( 除alogarithmic因子)* "rates" is translated as "速率" ( 速率)I hope this helps! Let me know if you have any further questions.

Kernel Cox partially linear regression: building predictive models for cancer patients’ survival

  • paper_url: http://arxiv.org/abs/2310.07187
  • repo_url: https://github.com/rongyaohua/reggkm
  • paper_authors: Yaohua Rong, Sihai Dave Zhao, Xia Zheng, Yi Li
  • for: 预测肿瘤患者的临床结果,准确预测肿瘤患者的生存时间。
  • methods: 使用kernel Cox准对分析方法和RegGKM方法,同时自动除除无关参数和非参数。
  • results: 对多重肿瘤数据集进行分析,预测患者的死亡负担基于基因表达,并可以将患者分为不同的死亡风险组。
    Abstract Wide heterogeneity exists in cancer patients' survival, ranging from a few months to several decades. To accurately predict clinical outcomes, it is vital to build an accurate predictive model that relates patients' molecular profiles with patients' survival. With complex relationships between survival and high-dimensional molecular predictors, it is challenging to conduct non-parametric modeling and irrelevant predictors removing simultaneously. In this paper, we build a kernel Cox proportional hazards semi-parametric model and propose a novel regularized garrotized kernel machine (RegGKM) method to fit the model. We use the kernel machine method to describe the complex relationship between survival and predictors, while automatically removing irrelevant parametric and non-parametric predictors through a LASSO penalty. An efficient high-dimensional algorithm is developed for the proposed method. Comparison with other competing methods in simulation shows that the proposed method always has better predictive accuracy. We apply this method to analyze a multiple myeloma dataset and predict patients' death burden based on their gene expressions. Our results can help classify patients into groups with different death risks, facilitating treatment for better clinical outcomes.
    摘要 广泛的多样性存在于癌症患者的存活时间,从几个月到多个 décadas。为准确预测临床结果,建立一个准确预测模型,将患者的分子 Profiling 与患者的存活时间相关联是非常重要。由于存活时间与高维分子预测器之间存在复杂的关系,同时需要进行非参数化模型和不相关预测器的自动 removing,因此在这篇论文中,我们构建了一个kernel Cox准确预测模型,并提出了一种新的RegGKM方法来适应该问题。我们使用kernel机器方法来描述存活时间和预测器之间的复杂关系,同时通过LASSO penalty自动地除掉了无关的参数化和非参数化预测器。我们开发了一种高维度的高效算法来解决该问题。与其他竞争方法在模拟中比较,我们的方法总是有更高的预测准确性。我们在分析多发性骨髓癌数据集时应用了该方法,并预测了患者的死亡负担基于他们的基因表达。我们的结果可以帮助将患者分组为不同的死亡风险类别,以便进行更好的临床结果。

SAM-OCTA: Prompting Segment-Anything for OCTA Image Segmentation

  • paper_url: http://arxiv.org/abs/2310.07183
  • repo_url: https://github.com/shellredia/sam-octa
  • paper_authors: Xinrun Chen, Chengliang Wang, Haojian Ning, Shiying Li
  • For: This paper proposes a new method for segmenting specific targets in optical coherence tomography angiography (OCTA) images, which is useful for diagnosing and treating eye diseases.* Methods: The proposed method, named SAM-OCTA, uses a low-rank adaptation technique for fine-tuning a foundation model and generates prompt points for various segmentation tasks on OCTA datasets.* Results: SAM-OCTA achieves or approaches state-of-the-art segmentation performance metrics on two publicly available OCTA datasets (OCTA-500 and ROSE), and demonstrates effective local vessel segmentation and artery-vein segmentation, which were not well-solved in previous works.
    Abstract In the analysis of optical coherence tomography angiography (OCTA) images, the operation of segmenting specific targets is necessary. Existing methods typically train on supervised datasets with limited samples (approximately a few hundred), which can lead to overfitting. To address this, the low-rank adaptation technique is adopted for foundation model fine-tuning and proposed corresponding prompt point generation strategies to process various segmentation tasks on OCTA datasets. This method is named SAM-OCTA and has been experimented on the publicly available OCTA-500 and ROSE datasets. This method achieves or approaches state-of-the-art segmentation performance metrics. The effect and applicability of prompt points are discussed in detail for the retinal vessel, foveal avascular zone, capillary, artery, and vein segmentation tasks. Furthermore, SAM-OCTA accomplishes local vessel segmentation and effective artery-vein segmentation, which was not well-solved in previous works. The code is available at https://github.com/ShellRedia/SAM-OCTA.
    摘要 在Optical coherence tomography angiography(OCTA)图像分析中,需要进行特定目标的分割。现有方法通常是在有限样本(约百个)上进行监督学习,这可能导致过拟合。为解决这问题,我们采用了低级适应技术进行基础模型细化和提出了相应的提示点生成策略,以处理不同的分割任务。我们称之为SAM-OCTA,并在公共可用OCTA-500和ROSE数据集上进行了实验。SAM-OCTA方法达到或接近状态 искусственный智能分割性能指标。我们在retinal vessel、foveal avascular zone、capillary、artery和vein分割任务中详细介绍了提示点的效果和适用性。此外,SAM-OCTA还实现了本地血管分割和有效的artery-vein分割,这在前一次作品中没有得到妥善解决。代码可以在https://github.com/ShellRedia/SAM-OCTA上获取。

Generalized Neural Sorting Networks with Error-Free Differentiable Swap Functions

  • paper_url: http://arxiv.org/abs/2310.07174
  • repo_url: None
  • paper_authors: Jungtaek Kim, Jeongbeen Yoon, Minsu Cho
  • for: 本研究旨在探讨sorting问题的更加抽象且表达强的输入,如多 digit 图像和图像碎片,通过神经网络 sorting。
  • methods: 我们使用了一种具有梯度可信度的排序网络,并开发了一种不减少的交换函数来保证排序网络的导数性。
  • results: 我们的方法在多种排序 benchmark 上表现了比或相当于基eline方法。
    Abstract Sorting is a fundamental operation of all computer systems, having been a long-standing significant research topic. Beyond the problem formulation of traditional sorting algorithms, we consider sorting problems for more abstract yet expressive inputs, e.g., multi-digit images and image fragments, through a neural sorting network. To learn a mapping from a high-dimensional input to an ordinal variable, the differentiability of sorting networks needs to be guaranteed. In this paper we define a softening error by a differentiable swap function, and develop an error-free swap function that holds non-decreasing and differentiability conditions. Furthermore, a permutation-equivariant Transformer network with multi-head attention is adopted to capture dependency between given inputs and also leverage its model capacity with self-attention. Experiments on diverse sorting benchmarks show that our methods perform better than or comparable to baseline methods.
    摘要 “排序是计算机系统中的基本操作,已经是长期着重研究的主题。我们在传统排序算法的问题形ulation上,考虑到更抽象且表达力强的输入,例如多位数字和图像碎片,透过神经网络进行排序。为了从高维输入学习一个排序顺序,排序网络的 diferenciability 需要保证。在这篇论文中,我们定义了一个滑动错误函数,并开发了一个无错误的交换函数,它保证了非减少和渐近条件。此外,我们还使用了一个具有对称性的Transformer网络,并使用多头注意力来捕捉输入的依赖关系。实验结果显示,我们的方法在多种排序实验中表现更好或相近于基eline方法。”Note that Simplified Chinese is used in the translation, as it is the more commonly used standard for Chinese writing.

Federated Generalization via Information-Theoretic Distribution Diversification

  • paper_url: http://arxiv.org/abs/2310.07171
  • repo_url: None
  • paper_authors: Zheshun Wu, Zenglin Xu, Dun Zeng, Qifan Wang
  • For: The paper focuses on addressing the non-Independent Identically Distributed (non-IID) challenge in Federated Learning (FL), which is a significant hurdle to FL’s generalization efficacy.* Methods: The paper proposes an information-theoretic generalization framework for FL, which quantifies generalization errors by evaluating the information entropy of local distributions and discerning discrepancies across these distributions. The paper also introduces a weighted aggregation approach and a duo of client selection strategies to bolster FL’s generalization prowess.* Results: The paper’s extensive empirical evaluations reaffirm the potency of the proposed methods, aligning seamlessly with the theoretical construct.
    Abstract Federated Learning (FL) has surged in prominence due to its capability of collaborative model training without direct data sharing. However, the vast disparity in local data distributions among clients, often termed the non-Independent Identically Distributed (non-IID) challenge, poses a significant hurdle to FL's generalization efficacy. The scenario becomes even more complex when not all clients participate in the training process, a common occurrence due to unstable network connections or limited computational capacities. This can greatly complicate the assessment of the trained models' generalization abilities. While a plethora of recent studies has centered on the generalization gap pertaining to unseen data from participating clients with diverse distributions, the divergence between the training distributions of participating clients and the testing distributions of non-participating ones has been largely overlooked. In response, our paper unveils an information-theoretic generalization framework for FL. Specifically, it quantifies generalization errors by evaluating the information entropy of local distributions and discerning discrepancies across these distributions. Inspired by our deduced generalization bounds, we introduce a weighted aggregation approach and a duo of client selection strategies. These innovations aim to bolster FL's generalization prowess by encompassing a more varied set of client data distributions. Our extensive empirical evaluations reaffirm the potency of our proposed methods, aligning seamlessly with our theoretical construct.
    摘要 Federated Learning (FL) 在合作模型训练无需直接数据共享的能力下崛起。然而,本地数据分布的巨大差异(非独立同分布,non-IID)问题对 FL 的通用效果提出了 significiant 挑战。这种情况变得更加复杂,当不 все客户端参与训练过程时,这是由于网络连接不稳定或计算资源有限的情况。这可能会很大地复杂训练过程中模型的通用能力评估。而在当前的研究中,大多数研究都集中在参与训练的客户端数据的不同分布上,而忽略了参与训练的客户端数据与测试数据的分布之间的差异。为回应这一问题,我们的论文揭示了一种信息学基本的通用框架,具体来说是通过评估本地分布中的信息熵来评估模型的通用错误。以我们的推导出的通用 bound 为导向,我们引入了权重聚合方法和两种客户端选择策略。这些创新目的是通过包括更多客户端数据分布的方式来提高 FL 的通用能力。我们的实验证明了我们的提议的力量,与我们的理论构造一致。

LLark: A Multimodal Foundation Model for Music

  • paper_url: http://arxiv.org/abs/2310.07160
  • repo_url: https://github.com/spotify-research/llark
  • paper_authors: Josh Gardner, Simon Durand, Daniel Stoller, Rachel M. Bittner
  • for: 本研究旨在开发一种用于音乐理解的多Modal模型(LLark),以便更好地理解音乐的结构和特点。
  • methods: 本研究使用了多种数据集的扩充和指令调整方法来训练LLark模型,并将音乐和语言模型集成在一起。
  • results: 在三类任务(音乐理解、描述和逻辑)中,我们的模型可以在零shot泛化中与现有基eline匹配或超越它们,而人类在描述和逻辑任务中也与模型的回答有高度一致性。I hope this helps! Let me know if you have any other questions.
    Abstract Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLark, an instruction-tuned multimodal model for music understanding. We detail our process for dataset creation, which involves augmenting the annotations of diverse open-source music datasets and converting them to a unified instruction-tuning format. We propose a multimodal architecture for LLark, integrating a pretrained generative model for music with a pretrained language model. In evaluations on three types of tasks (music understanding, captioning, and reasoning), we show that our model matches or outperforms existing baselines in zero-shot generalization for music understanding, and that humans show a high degree of agreement with the model's responses in captioning and reasoning tasks. LLark is trained entirely from open-source music data and models, and we make our training code available along with the release of this paper. Additional results and audio examples are at https://bit.ly/llark, and our source code is available at https://github.com/spotify-research/llark .
    摘要 音乐具有独特和复杂的结构,对于专业人员和现有的人工智能系统来说都是挑战,与其他形式的音频不同。我们介绍了LLark,一种基于指令调整的多Modal模型,用于音乐理解。我们详细介绍了我们的数据创建过程,包括对多种开源音乐数据集的扩充和转换为一致的指令调整格式。我们提议一种多Modal架构,将音乐生成模型和语言模型集成。在三种任务(音乐理解、captioning和理解)的评估中,我们显示了我们的模型在零shot泛化中与现有基elines匹配或超越,而人类在captioning和理解任务中与模型的回答达到了高度一致。LLark通过 entirely从开源音乐数据和模型进行训练,我们在发表这篇论文时提供了训练代码,可以在https://bit.ly/llark找到更多的结果和音频示例。我们的源代码可以在https://github.com/spotify-research/llark 上找到。

Imitation Learning from Purified Demonstration

  • paper_url: http://arxiv.org/abs/2310.07143
  • repo_url: None
  • paper_authors: Yunke Wang, Minjing Dong, Bo Du, Chang Xu
  • For: Addressing sequential decision-making problems with imperfect expert demonstrations.* Methods: Purifying potential perturbations in imperfect demonstrations via a two-step diffusion process, and then conducting imitation learning from the purified demonstrations.* Results: Theoretical evidence supporting the approach, and evaluation results on MuJoCo demonstrating effectiveness from different aspects.Here’s the summary in Traditional Chinese:* For: 解决受到不完整专家示范的序列做决策问题。* Methods: 使用两步Diffusion过程缓和潜在随机变动,然后从缓和的示范中学习。* Results: 提供了对方法的理论支持,并在MuJoCo上进行了不同方面的评估。
    Abstract Imitation learning has emerged as a promising approach for addressing sequential decision-making problems, with the assumption that expert demonstrations are optimal. However, in real-world scenarios, expert demonstrations are often imperfect, leading to challenges in effectively applying imitation learning. While existing research has focused on optimizing with imperfect demonstrations, the training typically requires a certain proportion of optimal demonstrations to guarantee performance. To tackle these problems, we propose to purify the potential perturbations in imperfect demonstrations and subsequently conduct imitation learning from purified demonstrations. Motivated by the success of diffusion models, we introduce a two-step purification via the diffusion process. In the first step, we apply a forward diffusion process to effectively smooth out the potential perturbations in imperfect demonstrations by introducing additional noise. Subsequently, a reverse generative process is utilized to recover the optimal expert demonstrations from the diffused ones. We provide theoretical evidence supporting our approach, demonstrating that total variance distance between the purified and optimal demonstration distributions can be upper-bounded. The evaluation results on MuJoCo demonstrate the effectiveness of our method from different aspects.
    摘要 <> translate="zh-CN"Sequential decision-making problems 有 Emerged as a promising approach imitation learning, with the assumption that expert demonstrations are optimal. However, in real-world scenarios, expert demonstrations are often imperfect, leading to challenges in effectively applying imitation learning. While existing research has focused on optimizing with imperfect demonstrations, the training typically requires a certain proportion of optimal demonstrations to guarantee performance. To tackle these problems, we propose to purify the potential perturbations in imperfect demonstrations and subsequently conduct imitation learning from purified demonstrations.受 diffusion models 的成功所 inspirited, we introduce a two-step purification via the diffusion process. In the first step, we apply a forward diffusion process to effectively smooth out the potential perturbations in imperfect demonstrations by introducing additional noise. Subsequently, a reverse generative process is utilized to recover the optimal expert demonstrations from the diffused ones. We provide theoretical evidence supporting our approach, demonstrating that total variance distance between the purified and optimal demonstration distributions can be upper-bounded. The evaluation results on MuJoCo demonstrate the effectiveness of our method from different aspects.

Risk Assessment and Statistical Significance in the Age of Foundation Models

  • paper_url: http://arxiv.org/abs/2310.07132
  • repo_url: None
  • paper_authors: Apoorva Nitsure, Youssef Mroueh, Mattia Rigotti, Kristjan Greenewald, Brian Belgodere, Mikhail Yurochkin, Jiri Navratil, Igor Melnyk, Jerret Ross
  • for: 本研究旨在评估基础模型中的社会技术风险,并使用量化的统计学 significado进行评估。
  • methods: 本研究使用了一种新的统计相对测试,基于实际随机变量的首次和第二次统计学上的准则。这种测试的第二个统计学是与经济统计和金融数学中常用的mean-risk模型相关的。
  • results: 本研究通过定义一个”指标股票”来汇总一系列指标,并使用这个股票来选择基础模型。我们也提供了一种基于风险意识的模型选择方法,并通过 bootstrap variance estimate来支持统计学上的分析。通过使用这些方法,我们对不同的大语言模型进行了风险相关的比较。
    Abstract We propose a distributional framework for assessing socio-technical risks of foundation models with quantified statistical significance. Our approach hinges on a new statistical relative testing based on first and second order stochastic dominance of real random variables. We show that the second order statistics in this test are linked to mean-risk models commonly used in econometrics and mathematical finance to balance risk and utility when choosing between alternatives. Using this framework, we formally develop a risk-aware approach for foundation model selection given guardrails quantified by specified metrics. Inspired by portfolio optimization and selection theory in mathematical finance, we define a \emph{metrics portfolio} for each model as a means to aggregate a collection of metrics, and perform model selection based on the stochastic dominance of these portfolios. The statistical significance of our tests is backed theoretically by an asymptotic analysis via central limit theorems instantiated in practice via a bootstrap variance estimate. We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content.
    摘要 我们提出了一个分布式框架,用于评估基础模型的社会技术风险,并使用量化的统计学 significado。我们的方法基于新的随机变量测试,使用第一和第二个泛化随机变量的统计学上的比较。我们表明了这些测试的第二个统计学与经济数学和金融中常用的均值风险模型相关。使用这个框架,我们正式开发了一种基于风险意识的模型选择方法,给出了 guardrails 量化的指标。 inspirited by 数学金融中的股票选择理论,我们定义了一个“指标股票” для每个模型,以便对一系列指标进行汇总,并根据这些股票的随机上度来进行模型选择。我们的统计学 significado 是通过中心假设定理和 bootstrap 方法来证明的,并在实践中通过 bootstrap 假设来估计偏差。我们使用我们的框架来比较不同的大语言模型,关于从指令脱离和输出恶意内容的风险。

Machine Learning Methods for Background Potential Estimation in 2DEGs

  • paper_url: http://arxiv.org/abs/2310.07089
  • repo_url: None
  • paper_authors: Carlo da Cunha, Nobuyuki Aoki, David Ferry, Kevin Vora, Yu Zhang
    for: 这研究探讨了二维电子晕(2DEGs)中含有杂stoff和瑕疵的问题,以及这些问题如何影响碰撞载子 mobilit、导电性和量子准确时间。methods: 这个研究使用了扫描门微镜(SGM)技术,并应用了三种不同的机器学习算法来估算2DEGs的背景潜能:生成 adversarial neural network、细胞神经网络和进化搜索算法。results: 研究发现,使用进化搜索算法可以很有效地估算2DEGs的背景潜能,并提供了一种新的瑕疵分析方法。这项研究不仅提高了我们对2DEGs的理解,还强调了机器学习在探索量子材料方面的潜在优势,对量子计算和纳но电子学习有重要意义。
    Abstract In the realm of quantum-effect devices and materials, two-dimensional electron gases (2DEGs) stand as fundamental structures that promise transformative technologies. However, the presence of impurities and defects in 2DEGs poses substantial challenges, impacting carrier mobility, conductivity, and quantum coherence time. To address this, we harness the power of scanning gate microscopy (SGM) and employ three distinct machine learning techniques to estimate the background potential of 2DEGs from SGM data: image-to-image translation using generative adversarial neural networks, cellular neural network, and evolutionary search. Our findings, despite data constraints, highlight the effectiveness of an evolutionary search algorithm in this context, offering a novel approach for defect analysis. This work not only advances our understanding of 2DEGs but also underscores the potential of machine learning in probing quantum materials, with implications for quantum computing and nanoelectronics.
    摘要 在量子效应设备和材料领域中,二维电子液体(2DEG)作为基本结构,承诺引领技术转型。然而,2DEG中的杂质和缺陷对载子移动性、导电性和量子同步时间产生了显著影响。为此,我们利用扫描门微scanning gate microscopy(SGM)数据,并采用三种不同的机器学习技术来估算2DEG背景电势:图像到图像翻译使用生成对抗神经网络、细胞神经网络和进化搜索。我们的发现,尽管数据有限制,表明了进化搜索算法在这种情况下的效果,提供了一种新的缺陷分析方法。这种工作不仅提高了我们对2DEG的理解,也强调了机器学习在探测量子材料方面的潜在潜力,对量子计算和纳но电子学习有重要意义。

eess.IV - 2023-10-11

Time-Resolved Reconstruction of Motion, Force, and Stiffness using Spectro-Dynamic MRI

  • paper_url: http://arxiv.org/abs/2310.07622
  • repo_url: None
  • paper_authors: Max H. C. van Riel, Tristan van Leeuwen, Cornelis A. T. van den Berg, Alessandro Sbrizzi
  • for: 本研究旨在掌握肌肉和关节的动态特性和物理性质,以便更好地理解肌肉的(病)physiology。
  • methods: 本研究使用了Spectro-Dynamic MRI技术,可以直接从k-space数据中获得高空间和时间分辨率的动态系统特性。
  • results: 本研究提出了一种扩展的Spectro-Dynamic MRI框架,可以重建1) 时间分辨率为11 ms的MR图像,2) 时间分辨率为11 ms的运动场图像,3) 动态参数,以及4) 活动力的激活力。此外,该方法还比two-step方法(先从不含运动信息的下探测数据中重建时间分辨率MR图像,然后使用运动场图像进行运动估算)表现更好。
    Abstract Measuring the dynamics and mechanical properties of muscles and joints is important to understand the (patho)physiology of muscles. However, acquiring dynamic time-resolved MRI data is challenging. We have previously developed Spectro-Dynamic MRI which allows the characterization of dynamical systems at a high spatial and temporal resolution directly from k-space data. This work presents an extended Spectro-Dynamic MRI framework that reconstructs 1) time-resolved MR images, 2) time-resolved motion fields, 3) dynamical parameters, and 4) an activation force, at a temporal resolution of 11 ms. An iterative algorithm solves a minimization problem containing four terms: a motion model relating the motion to the fully-sampled k-space data, a dynamical model describing the expected type of dynamics, a data consistency term describing the undersampling pattern, and finally a regularization term for the activation force. We acquired MRI data using a dynamic motion phantom programmed to move like an actively driven linear elastic system, from which all dynamic variables could be accurately reconstructed, regardless of the sampling pattern. The proposed method performed better than a two-step approach, where time-resolved images were first reconstructed from the undersampled data without any information about the motion, followed by a motion estimation step.
    摘要 An iterative algorithm solves a minimization problem with four terms: a motion model relating motion to fully-sampled k-space data, a dynamical model describing expected dynamics, a data consistency term describing undersampling pattern, and a regularization term for activation force. We acquired MRI data using a dynamic motion phantom programmed to move like an actively driven linear elastic system, from which all dynamic variables could be accurately reconstructed, regardless of sampling pattern. The proposed method outperformed a two-step approach where time-resolved images were first reconstructed without motion information and then motion was estimated.

eess.SP - 2023-10-11

Kronecker-structured Sparse Vector Recovery with Application to IRS-MIMO Channel Estimation

  • paper_url: http://arxiv.org/abs/2310.07869
  • repo_url: None
  • paper_authors: Yanbin He, Geethu Joseph
  • for: 本文研究了一种 Kronecker 结构杂度矩阵稀疏 вектор回归问题,该问题在多个实际应用中出现,如智能反射面 aid 多输入多出力系统的稀疏通道估计。先前的方法仅利用 Kronecker 结构在稀疏 вектор的支持和非零元素上,解决整个线性系统,导致计算复杂性高。而我们则将原始稀疏回归问题分解成多个独立的子问题,并将它们解决一个一个。通过 Kronecker 乘法,我们可以保留稀疏 веctor 的结构,包括支持和非零元素。我们的 simulations 表明,我们的方法在精度和运行时间两个方面与先前的工作相比,有更高的性能。
  • methods: 我们的方法包括将原始稀疏回归问题分解成多个独立的子问题,并将它们解决一个一个。然后,通过 Kronecker 乘法,我们可以将独立解决的子问题结果相加以获得稀疏 веctor。
  • results: 我们的 simulations 表明,我们的方法在精度和运行时间两个方面与先前的工作相比,有更高的性能。我们归因此于解决空间减少和去噪效果。
    Abstract This paper studies the problem of Kronecker-structured sparse vector recovery from an underdetermined linear system with a Kronecker-structured dictionary. Such a problem arises in many real-world applications such as the sparse channel estimation of an intelligent reflecting surface-aided multiple-input multiple-output system. The prior art only exploits the Kronecker structure in the support of the sparse vector and solves the entire linear system together leading to high computational complexity. Instead, we break down the original sparse recovery problem into multiple independent sub-problems and solve them individually. We obtain the sparse vector as the Kronecker product of the individual solutions, retaining its structure in both support and nonzero entries. Our simulations demonstrate the superior performance of our methods in terms of accuracy and run time compared with the existing works, using synthetic data and the channel estimation application. We attribute the low run time to the reduced solution space due to the additional structure and improved accuracy to the denoising effect owing to the decomposition step.
    摘要 In contrast, our approach breaks down the original sparse recovery problem into multiple independent sub-problems and solves them individually. We obtain the sparse vector as the Kronecker product of the individual solutions, retaining its structure in both support and nonzero entries. Our simulations show that our methods outperform existing works in terms of accuracy and run time, using synthetic data and the channel estimation application. We attribute the low run time to the reduced solution space due to the additional structure and improved accuracy to the denoising effect of the decomposition step.

Adaptive Quantization for Key Generation in Low-Power Wide-Area Networks

  • paper_url: http://arxiv.org/abs/2310.07853
  • repo_url: None
  • paper_authors: Chen Chen, Junqing Zhang, Yingying Chen
  • for: 本研究旨在提高资源有限低功率宽域网络(LPWAN)的物理层密钥生成方法,通过基于反射和随机无线通信的物理层密钥生成方法,提高 LPWAN 的安全性。
  • methods: 本研究提出了一种基于 Lempel-Ziv 复杂度(LZ76)的自适应量化方案,可以在 LPWAN 中动态调整量化参数,以根据各个时间段的 RSSI 测量结果的随机性来生成密钥。此外,我们还提出了一种在实时密钥生成过程中进行信息重新匹配的卡利ibration scheme,以确保键的均匀分布。
  • results: 实验结果表明,对于 LoRa 设备,我们的自适应量化方案可以与对比方法(差分量化和固定量化)相比,提高 KGR 的表现,最高可以达到 2.35 倍和 1.51 倍的提升。
    Abstract Physical layer key generation based on reciprocal and random wireless channels has been an attractive solution for securing resource-constrained low-power wide-area networks (LPWANs). When quantizing channel measurements, namely received signal strength indicator (RSSI), into key bits, the existing works mainly adopt fixed quantization levels and guard band parameters, which fail to fully extract keys from RSSI measurements. In this paper, we propose a novel adaptive quantization scheme for key generation in LPWANs, taking LoRa as a case study. The proposed adaptive quantization scheme can dynamically adjust the quantization parameters according to the randomness of RSSI measurements estimated by Lempel-Ziv complexity (LZ76), while ensuring a predefined key disagreement ratio (KDR). Specifically, our scheme uses pre-trained linear regression models to determine the appropriate quantization level and guard band parameter for each segment of RSSI measurements. Moreover, we propose a guard band parameter calibration scheme during information reconciliation during real-time key generation operation. Experimental evaluations using LoRa devices show that the proposed adaptive quantization scheme outperforms the benchmark differential quantization and fixed quantization with up to 2.35$\times$ and 1.51$\times$ key generation rate (KGR) gains, respectively.
    摘要 物理层密钥生成基于相互反射和随机无线通道已成为优化资源充足低功率宽域网络(LPWAN)的一个吸引人的解决方案。当量化通道测量值(RSSI)为密钥位时,现有的工作主要采用固定量化水平和保障带参数,这些参数不足以完全从RSSI测量值中提取密钥。在本文中,我们提出了一种新的自适应量化方案,用于LPWAN中的密钥生成,具体来说是通过Lempel-Ziv复杂度(LZ76)来估计RSSI测量值的随机性,并保证一定的密钥分歧率(KDR)。specifically,我们的方案使用预训练的线性回归模型来确定每个RSSI测量值段的合适的量化水平和保障带参数。此外,我们还提出了在实时密钥生成过程中进行信息重新协调的保射参数calibration方案。实验使用LoRa设备显示,我们的自适应量化方案与参考 differential量化和固定量化相比,可以获得最高达2.35倍和1.51倍的密钥生成速率(KGR)提升。

Exploiting Semantic Localization in Highly Dynamic Wireless Networks Using Deep Homoscedastic Domain Adaptation

  • paper_url: http://arxiv.org/abs/2310.07792
  • repo_url: https://github.com/leo-chu/semanticloc
  • paper_authors: Lei Chu, Abdullah Alghafis, Andreas F. Molisch
  • for: 本研究旨在提高无GPS的street canyon环境中的地点定位精度和Robustness,通过对三种传播条件(line-of-sight(LOS)、阻挡LOS(OLOS)和非LOS(NLOS))进行semantic定义,并同时确定它们和实际地点的定位。
  • methods: 本研究使用机器学习(ML)技术,提出了semantic定位方法,其中包括联合任务(坐标回归和 semantics 分类)学习问题,以及对各个位置的多个CSIs进行有效的表示学习和知识传递。
  • results: 本研究通过多任务深度适应(DA)技术,使用有限多个标注样本和大量无标注样本进行地点定位,并提出了scenario适应学习策略,以确保efficient表示学习和成功的知识传递。此外,本研究还使用 bayesian 理论来模型uncertainty weights的重要性,从而降低了时间consuming的参数finetuning。
    Abstract Localization in GPS-denied outdoor locations, such as street canyons in an urban or metropolitan environment, has many applications. Machine Learning (ML) is widely used to tackle this critical problem. One challenge lies in the mixture of line-of-sight (LOS), obstructed LOS (OLOS), and non-LOS (NLOS) conditions. In this paper, we consider a semantic localization that treats these three propagation conditions as the ''semantic objects", and aims to determine them together with the actual localization, and show that this increases accuracy and robustness. Furthermore, the propagation conditions are highly dynamic, since obstruction by cars or trucks can change the channel state information (CSI) at a fixed location over time. We therefore consider the blockage by such dynamic objects as another semantic state. Based on these considerations, we formulate the semantic localization with a joint task (coordinates regression and semantics classification) learning problem. Another problem created by the dynamics is the fact that each location may be characterized by a number of different CSIs. To avoid the need for excessive amount of labeled training data, we propose a multi-task deep domain adaptation (DA) based localization technique, training neural networks with a limited number of labeled samples and numerous unlabeled ones. Besides, we introduce novel scenario adaptive learning strategies to ensure efficient representation learning and successful knowledge transfer. Finally, we use Bayesian theory for uncertainty modeling of the importance weights in each task, reducing the need for time-consuming parameter finetuning; furthermore, with some mild assumptions, we derive the related log-likelihood for the joint task and present the deep homoscedastic DA based localization method.
    摘要 本文研究用机器学习(ML)在无GPS的户外场景中进行地点地理位置的定位。这些场景包括城市街区的封闭环境,其中存在多种propagation condition,如直接视线(LOS)、受阻视线(OLOS)和非直接视线(NLOS)等。本文提出一种 semantics localization,将这三种propagation condition treated为 semantics objects,并将其与实际地理位置一起确定,以提高精度和可靠性。此外,这些propagation condition是高度动态的,因为卡车或卡车可以在fixed location上时间上对通信状态信息(CSI)进行阻挡。因此,我们将阻挡这些动态对象作为另一种semantic state来考虑。基于以上考虑,我们将semantic localization定义为一个联合任务(coordinates regression和semantics classification)学习问题。此外,由于每个地点可能有多个CSIs,我们提出一种多任务深度适应(DA)基本地址技术,通过使用有限量的标注样本和大量的无标注样本来训练神经网络。此外,我们还引入了一些novel scenario adaptive learning strategy来保证效率的表征学习和成功的知识传递。最后,我们使用 bayesian 理论来模型 uncertainty weight的importance,减少需要时间consuming的参数微调; 此外,在某些轻微的假设下,我们Derive the related log-likelihood for the joint task, and present the deep homoscedastic DA based localization method.

Sparse Millimeter Wave Channel Estimation From Partially Coherent Measurements

  • paper_url: http://arxiv.org/abs/2310.07569
  • repo_url: None
  • paper_authors: Weijia Yi, Nitin Jonathan Myers, Geethu Joseph
  • for: 这种研究旨在开发一种毫米波通信系统中的通道估计技术,以减少训练负担并考虑普通频率噪声引起的频率错误。
  • methods: 该方法利用毫米波通道的稀畴结构,并考虑在扫描射频谱中的频率错误。特别是在基于IEEE 802.11ad/ay的毫米波系统中,在一个扫描射频谱中的频率错误相对较小,而在不同的扫描射频谱中的频率错误相对较大。因此,标准的稀畴意识算法,忽略频率错误,在扫描射频谱中不能准确估计通道。我们提出一种新的算法called partially coherent matching pursuit,它采用迭代最小化来同时估计信号和频率错误。
  • results: 我们通过数值分析表明,我们的算法可以在较低的复杂性下高精度地估计通道。
    Abstract This paper develops a channel estimation technique for millimeter wave (mmWave) communication systems. Our method exploits the sparse structure in mmWave channels for low training overhead and accounts for the phase errors in the channel measurements due to phase noise at the oscillator. Specifically, in IEEE 802.11ad/ay-based mmWave systems, the phase errors within a beam refinement protocol packet are almost the same, while the errors across different packets are substantially different. Consequently, standard sparsity-aware algorithms, which ignore phase errors, fail when channel measurements are acquired over multiple beam refinement protocol packets. We present a novel algorithm called partially coherent matching pursuit for sparse channel estimation under practical phase noise perturbations. Our method iteratively detects the support of sparse signal and employs alternating minimization to jointly estimate the signal and the phase errors. We numerically show that our algorithm can reconstruct the channel accurately at a lower complexity than the benchmarks.
    摘要 The proposed algorithm, called partially coherent matching pursuit, iteratively detects the support of the sparse signal and employs alternating minimization to jointly estimate the signal and the phase errors. The algorithm is designed to handle practical phase noise perturbations and can accurately reconstruct the channel at a lower complexity than the benchmarks.Here is the text in Simplified Chinese:这篇论文提出了一种渠道估计技术 для毫米波通信系统,该技术利用毫米波通道的稀疏结构来减少训练负担,同时考虑振荡器内的频率噪声对渠道测量的影响。具体来说,在IEEE 802.11ad/ay基于毫米波系统中,内部的振荡器频率噪声对同一个扫描射频 packet 的频率错误是非常相似的,而不同包的频率错误则是非常不同的。因此,标准的稀疏意识算法,忽略了频率错误,在多个扫描射频 packet 上不能正确地估计渠道。我们提出了一种新的算法,called partially coherent matching pursuit,该算法 iteratively 检测稀疏信号的支持,并使用交互式最小化来联合估计信号和频率错误。我们numerically 表明,我们的算法可以在较低的复杂性下准确地重建渠道。

Quality of Service-Constrained Online Routing in High Throughput Satellites

  • paper_url: http://arxiv.org/abs/2310.07557
  • repo_url: None
  • paper_authors: Olivier Bélanger, Olfa Ben Yahia, Stéphane Martel, Antoine Lesage-Landry, Gunes Karabulut Kurt
    for: 这篇论文旨在解决高通信卫星(HTS)内部网络的优化问题,以实现高速数据传输和保持质量服务(QoS)标准。methods: 该论文提出了一种在线优化流量分配和计划方法,基于多品质流模型(MPC)的预测控制技术,以适应HTS数据流的变化和不确定性。results: 对于 numerical simulations,该方法与预先知道的优质方法相比,能够实现几乎与优质方法相当的性能,证明了其有效性和适应性。
    Abstract High Throughput Satellites (HTSs) outpace traditional satellites due to their multi-beam transmission. The rise of low Earth orbit mega constellations amplifies HTS data rate demands to terabits/second with acceptable latency. This surge in data rate necessitates multiple modems, often exceeding single device capabilities. Consequently, satellites employ several processors, forming a complex packet-switch network. This can lead to potential internal congestion and challenges in adhering to strict quality of service (QoS) constraints. While significant research exists on constellation-level routing, a literature gap remains on the internal routing within a singular HTS. The intricacy of this internal network architecture presents a significant challenge to achieve high data rates. This paper introduces an online optimal flow allocation and scheduling method for HTSs. The problem is treated as a multi-commodity flow instance with different priority data streams. An initial full time horizon model is proposed as a benchmark. We apply a model predictive control (MPC) approach to enable adaptive routing based on current information and the forecast within the prediction time horizon while allowing for deviation of the latter. Importantly, MPC is inherently suited to handle uncertainty in incoming flows. Our approach minimizes packet loss by optimally and adaptively managing the priority queue schedulers and flow exchanges between satellite processing modules. Central to our method is a routing model focusing on optimal priority scheduling to enhance data rates and maintain QoS. The model's stages are critically evaluated, and results are compared to traditional methods via numerical simulations. Through simulations, our method demonstrates performance nearly on par with the hindsight optimum, showcasing its efficiency and adaptability in addressing satellite communication challenges.
    摘要 高通过put Satellites (HTSs) 的传输速率比传统卫星更快,因为它们使用多个扫描。随着低地球轨道巨型卫星的出现,HTS 数据率需求增加到 Terra bits/秒,同时保持可接受的延迟。这种增加的数据率使得多个模式、常常超过单个设备的能力。因此,卫星通常使用多个处理器,形成复杂的包队列网络。这可能会导致内部塞突和遵循严格的服务质量(QoS)约束的挑战。虽然关于卫星群 constellation 级别的路由研究已经有很多,但是关于单个 HTS 内部路由的研究还存在一个知识空白。卫星内部网络的复杂性提出了高达数据率的挑战。本文介绍了一种在 HTS 中线上优化流量分配和调度方法。该问题被视为多商品流 instances 中的一个,其中有不同优先级数据流。我们提出了一个全时间轴模型作为参考。我们采用了预测控制(MPC)方法,以适应现有信息和预测时间 horizon 内的变化。MPC 自然地能够处理入流的不确定性。我们的方法可以最小化包产生损失,通过优化优先级调度和卫星处理模块之间的流体换来实现高达数据率和维护 QoS。中心于我们的方法的是一种专注于优化优先级调度,以提高数据率和维护 QoS。模型的阶段得到了严格的评估,并与传统方法进行比较。通过 simulations,我们的方法可以与后看 optimum 的性能相似,这显示了它的效率和适应性。

Proactive Monitoring via Jamming in Fluid Antenna Systems

  • paper_url: http://arxiv.org/abs/2310.07550
  • repo_url: None
  • paper_authors: Junteng Yao, Tuo Wu, Xiazhi Lai, Ming Jin, Cunhua Pan, Maged Elkashlan, Kai-Kit Wong
  • for: 本研究探讨了使用流体天线系统(FAS)在合法监测器上进行异常通信管理,以提高监测性能。
  • methods: 研究人员使用监测器调整天线位置,以降低停机概率,并提出了一种基于差分搜索法的优化方法,以提高平均监测率。
  • results: 实验结果表明,提出的方案可以较 Convention benchmark 高效地提高监测性能。
    Abstract This paper investigates the efficacy of utilizing fluid antenna system (FAS) at a legitimate monitor to oversee suspicious communication. The monitor switches the antenna position to minimize its outage probability for enhancing the monitoring performance. Our objective is to maximize the average monitoring rate, whose expression involves the integral of the first-order Marcum $Q$ function. The optimization problem, as initially posed, is non-convex owing to its objective function. Nevertheless, upon substituting with an upper bound, we provide a theoretical foundation confirming the existence of a unique optimal solution for the modified problem, achievable efficiently by the bisection search method. Furthermore, we also introduce a locally closed-form optimal resolution for maximizing the average monitoring rate. Empirical evaluations confirm that the proposed schemes outperform conventional benchmarks considerably.
    摘要 Simplified Chinese:这篇论文研究了使用流体天线系统(FAS)在合法监控器上监测异常通信的效果。监控器通过调整天线位置来最小化监测损失的概率,以提高监测性能。我们的目标是最大化平均监测率,其表达式包括首频 Marcum $Q$ 函数的积分。然而,原始优化问题是非凸的,但我们通过substituted an upper bound提供了一个理论基础,证明了优化问题的唯一优解存在。此外,我们还引入了一个本地关闭形式的优化解决方案,以最大化平均监测率。实验证明,我们的提议方案在与传统标准相比较显著地出performances。

Symbol-Level Precoding for Average SER Minimization in Multiuser MISO Systems

  • paper_url: http://arxiv.org/abs/2310.07436
  • repo_url: None
  • paper_authors: Yafei Wang, Hongwei Hou, Wenjin Wang, Xinping Yi
  • for: investigate symbol-level precoding (SLP) for high-order quadrature amplitude modulation (QAM) to minimize average symbol error rate (SER) and improve full signal-to-noise ratio (SNR) ranges.
  • methods: construct SER expression, formulate problem of average SER minimization subject to total transmit power constraint, and propose double-space alternating optimization (DSAO) algorithm to optimize transmitted signal and rescaling factor on orthogonal Stiefel manifold and Euclidean spaces, respectively.
  • results: propose a block transmission scheme to keep rescaling factor constant within a block, and demonstrate significant performance advantage over existing state-of-the-art SLP schemes through simulation results.
    Abstract This paper investigates symbol-level precoding (SLP) for high-order quadrature amplitude modulation (QAM) aimed at minimizing the average symbol error rate (SER), leveraging both constructive interference (CI) and noise power to gain superiority in full signal-to-noise ratio (SNR) ranges. We first construct the SER expression with respect to the transmitted signal and the rescaling factor, based on which the problem of average SER minimization subject to total transmit power constraint is further formulated. Given the non-convex nature of the objective, solving the above problem becomes challenging. Due to the differences in constraints between the transmit signal and the rescaling factor, we propose the double-space alternating optimization (DSAO) algorithm to optimize the two variables on orthogonal Stiefel manifold and Euclidean spaces, respectively. To facilitate QAM demodulation instead of affording impractical signaling overhead, we further develop a block transmission scheme to keep the rescaling factor constant within a block. Simulation results demonstrate that the proposed SLP scheme exhibits a significant performance advantage over existing state-of-the-art SLP schemes.
    摘要 Here is the translation in Simplified Chinese:这篇论文研究了使用高阶 quadrature amplitude modulation (QAM) 的 symbol-level precoding (SLP),以最小化平均符号错误率 (SER),利用构建性干扰 (CI) 和噪声功率。我们首先 derive SER 表达式,基于这些表达式,我们进一步形式化了在全信号响应率 (SNR) 范围内的平均 SER 最小化问题,并且采用了 double-space alternating optimization (DSAO) 算法来优化两个变量。为了使 QAM 模测可行,我们采用了块传输方案,以保持块内的扩大因子不变。Results show that the proposed SLP scheme outperforms existing state-of-the-art SLP schemes.

IRS Assisted Federated Learning A Broadband Over-the-Air Aggregation Approach

  • paper_url: http://arxiv.org/abs/2310.07405
  • repo_url: None
  • paper_authors: Deyou Zhang, Ming Xiao, Zhibo Pang, Lihui Wang, H. Vincent Poor
  • for: 这个研究旨在提高无线联合学习(FL)系统中的广域无线计算能力,并使用智能反射表面(IRS)来抵消无线折射和噪声。
  • methods: 该研究提出了一种基于节点选择的模型集成方法,并使用矩阵提升技术和差异度计算程序来解决问题。
  • results: 研究人员通过对MNIST数据集进行 simulate,并分析了两种基于节点选择和重量选择的模型集成方法的性能。结果显示, weight-selection 方法可以提高学习性能,而节点选择方法的性能与选择的边缘节点数量有关。
    Abstract We consider a broadband over-the-air computation empowered model aggregation approach for wireless federated learning (FL) systems and propose to leverage an intelligent reflecting surface (IRS) to combat wireless fading and noise. We first investigate the conventional node-selection based framework, where a few edge nodes are dropped in model aggregation to control the aggregation error. We analyze the performance of this node-selection based framework and derive an upper bound on its performance loss, which is shown to be related to the selected edge nodes. Then, we seek to minimize the mean-squared error (MSE) between the desired global gradient parameters and the actually received ones by optimizing the selected edge nodes, their transmit equalization coefficients, the IRS phase shifts, and the receive factors of the cloud server. By resorting to the matrix lifting technique and difference-of-convex programming, we successfully transform the formulated optimization problem into a convex one and solve it using off-the-shelf solvers. To improve learning performance, we further propose a weight-selection based FL framework. In such a framework, we assign each edge node a proper weight coefficient in model aggregation instead of discarding any of them to reduce the aggregation error, i.e., amplitude alignment of the received local gradient parameters from different edge nodes is not required. We also analyze the performance of this weight-selection based framework and derive an upper bound on its performance loss, followed by minimizing the MSE via optimizing the weight coefficients of the edge nodes, their transmit equalization coefficients, the IRS phase shifts, and the receive factors of the cloud server. Furthermore, we use the MNIST dataset for simulations to evaluate the performance of both node-selection and weight-selection based FL frameworks.
    摘要 我们考虑了一种宽带无线通信 empowered 模型聚合方法 для无线联合学习(FL)系统,并提议利用智能反射表面(IRS)来抗衰减和噪声。我们首先investigate了传统的节点选择基础框架,其中一些边节点被排除在模型聚合中来控制聚合错误。我们分析了这种节点选择基础框架的性能,并 derivated一个上限 bound的性能损失,该损失与选择的边节点相关。然后,我们寻求通过最小化模型聚合中的均方差(MSE)来实现 global 梯度参数与实际接收的梯度参数之间的均方差最小化。为此,我们采用矩阵提升技术和差分 convex 编程,将问题转化为一个convex问题,并使用存储库中的解决方案。为了提高学习性能,我们进一步提议一种weight-selection based FL框架。在这种框架中,我们为每个边节点分配一个适当的weight coefficient,以便在模型聚合中减少聚合错误,即不需要对接收到的本地梯度参数进行幂等匹配。我们还分析了weight-selection based FL框架的性能,并 derivated一个上限 bound的性能损失,然后通过最小化MSE来实现模型聚合中的均方差最小化。此外,我们使用 MNIST 数据集进行实验来评估两种 FL 框架的性能。

Integrated Sensing and Communication enabled Doppler Frequency Shift Estimation and Compensation

  • paper_url: http://arxiv.org/abs/2310.07401
  • repo_url: None
  • paper_authors: Jinzhu Jia, Zhiqing Wei, Ruiyun Zhang, Lin Wang
  • for: 这篇论文是为了解决高速车辆网络中 millimeter wave 技术导致的严重 Doppler Frequency Shift (DFS) 问题,以提高通信性能。
  • methods: 本论文提出了一个 Integrated Sensing and Communication (ISAC) 实现 DFS 估计和补偿算法,包括对 DFS 进行粗略估计和补偿、使用设计的 preamble 序列进行精确估计和补偿,以及适应式 DFS 估计器以减少计算复杂度。
  • results: 比较traditional DFS 估计算法,提案的算法在 bit error rate 和平均方差Error 性能上显示出改善。
    Abstract Despite the millimeter wave technology fulfills the low-latency and high data transmission, it will cause severe Doppler Frequency Shift (DFS) for high-speed vehicular network, which tremendously damages the communication performance. In this paper, we propose an Integrated Sensing and Communication (ISAC) enabled DFS estimation and compensation algorithm. Firstly, the DFS is coarsely estimated and compensated using radar detection. Then, the designed preamble sequence is used to accurately estimate and compensate DFS. In addition, an adaptive DFS estimator is designed to reduce the computational complexity. Compared with the traditional DFS estimation algorithm, the improvement of the proposed algorithm is verified in bit error rate and mean square error performance by simulation results.
    摘要 尽管毫米波技术实现了低延迟和高数据传输,但它会导致高速交通网络中严重的多普勒频率偏移(DFS),从而极大地损害通信性能。在这篇论文中,我们提出了一种结合探测和通信(ISAC)能力的DFS估算和补偿算法。首先,通过雷达探测来粗略地估算并补偿DFS。然后,采用设计的首部序列来精度地估算和补偿DFS。此外,我们还设计了一种适应式DFS估算器,以降低计算复杂性。与传统的DFS估算算法相比,我们的提案的改进被证明通过实验结果的比特错误率和均方差性能。

Extremal Mechanisms for Pointwise Maximal Leakage

  • paper_url: http://arxiv.org/abs/2310.07381
  • repo_url: None
  • paper_authors: Leonhard Grosse, Sara Saeidian, Tobias Oechtering
  • for: 本文研究了在不可靠第三方party发布数据时,如何保护数据隐私。
  • methods: 本文使用了随机性加载到数据点,以降低数据的有用性。
  • results: 研究发现,随机response机制可以实现本地均衡隐私,并且可以通过 convex 分析获得一些关键的关键解。
    Abstract Data publishing under privacy constraints can be achieved with mechanisms that add randomness to data points when released to an untrusted party, thereby decreasing the data's utility. In this paper, we analyze this privacy-utility tradeoff for the pointwise maximal leakage privacy measure and a general class of convex utility functions. Pointwise maximal leakage (PML) was recently proposed as an operationally meaningful privacy measure based on two equivalent threat models: An adversary guessing a randomized function and an adversary aiming to maximize a general gain function. We study the behavior of the randomized response mechanism designed for local differential privacy under different prior distributions of the private data. Motivated by the findings of this analysis, we derive several closed-form solutions for the optimal privacy-utility tradeoff in the presented PML context using tools from convex analysis. Finally, we present a linear program that can compute optimal mechanisms for PML in a general setting.
    摘要 <>传输数据以遵循隐私限制可以通过添加随机性到数据点来实现,从而降低数据的使用价值。在这篇论文中,我们分析了这种隐私-使用价值贸易的关系,使用点最大泄露隐私度量(PML)和一类凸Utility函数。PML是最近提出的一种操作可能性的隐私度量,基于两种等价威胁模型:敌方猜测一个随机函数,以及敌方尝试最大化一个通用获得函数。我们研究了随机响应机制在本地差分隐私下的不同先前分布的私人数据的行为。受此分析的结果启发,我们 deriv了一些关闭形式的解决方案,用于PML上的优化隐私-使用价值贸易。最后,我们提出了一个可以计算PML上优化机制的线性程序。Note: "随机函数" in the original text is translated as "随机性" in Simplified Chinese, which is a more common term used in the field.

A Unified Algorithmic Framework for Dynamic Compressive Sensing

  • paper_url: http://arxiv.org/abs/2310.07202
  • repo_url: None
  • paper_authors: Xiaozhi Liu, Yong Xia
  • for: 这 paper 是为了重建信号序列中的内在结构化动态稀缺性而设计的一种统一动态跟踪算法框架 (PLAY-CS)。
  • methods: 该框架利用了特定的统计假设,用于描述信号序列的动态滤波器,并通过新提出的partial-Laplacian filtering简度模型,更好地捕捉到信号序列的更加复杂的动态稀缺性。
  • results: 在实际应用场景,如无线通信中的动态通道跟踪,该框架比既有的DCS算法表现出更高的性能。
    Abstract We propose a unified dynamic tracking algorithmic framework (PLAY-CS) to reconstruct signal sequences with their intrinsic structured dynamic sparsity. By capitalizing on specific statistical assumptions concerning the dynamic filter of the signal sequences, the proposed framework exhibits versatility by encompassing various existing dynamic compressive sensing (DCS) algorithms. This is achieved through the incorporation of a newly proposed Partial-Laplacian filtering sparsity model, tailored to capture a more sophisticated dynamic sparsity. In practical scenarios such as dynamic channel tracking in wireless communications, the framework demonstrates enhanced performance compared to existing DCS algorithms.
    摘要 我们提出一个统一的动态跟踪算法框架(PLAY-CS),用于重建信号序列的内在结构化动态稀烈性。通过利用信号序列动态滤波器的特定统计假设,我们的框架能够展示多样性,包括不同的动态压缩感知(DCS)算法。这是通过 newly proposed partial-laplace 滤波稀烈性模型来实现的,这种模型用于捕捉更复杂的动态稀烈性。在无线通信中的实际应用场景中,我们的框架可以比既有的 DCS 算法表现更好。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Input-Output Relation and Low-Complexity Receiver Design for CP-OTFS Systems with Doppler Squint

  • paper_url: http://arxiv.org/abs/2310.07200
  • repo_url: None
  • paper_authors: Xuehan Wang, Xu Shi, Jintao Wang, Jian Song
  • for: 这篇论文旨在探讨在 orthogonal time frequency space (OTFS) 系统中,频率相关的 Doppler 效应(DSE)的影响,并研究一种基于 CP-OFDM 的 OTFS 系统。
  • methods: 本论文使用了实用的 OFDM 系统,并采用了循环前预处理(CP)来减少 DSE 的影响。在输入输出关系中,考虑了 DSE,并将通道估计分成了三部分:延迟探测、Doppler提取和增强估计。采用了线性平衡方案,利用通道矩阵的块对称性,实现了低复杂度接收器的设计。
  • results: simulations 表明,DSE 对 OTFS 系统的性能有重要影响,而提posed 的低复杂度接收器设计可以考虑 DSE,并有显著的性能提升。
    Abstract In orthogonal time frequency space (OTFS) systems, the impact of frequency-dependent Doppler which is referred to as the Doppler squint effect (DSE) is accumulated through longer duration, whose negligence has prevented OTFS systems from exploiting the performance superiority. In this paper, practical OFDM system using cyclic prefix time guard interval (CP-OFDM)-based OTFS systems with DSE are adopted. Cyclic prefix (CP) length is analyzed while the input-output relation considering DSE is derived. By deploying two prefix OFDM symbols, the channel estimation can be easily divided into three parts as delay detection, Doppler extraction and gain estimation. The linear equalization scheme is adopted taking the block diagonal property of the channel matrix into account, which completes the low-complexity receiver design. Simulation results confirm the significance of DSE and the considerable performance of the proposed low-complexity receiver scheme considering DSE.
    摘要 在正交时频空间(OTFS)系统中,频率相关的多普勒效应(DSE)的影响会随着时间的推移而受拥,这种忽略会导致OTFS系统无法实现性能优势。本文提出了基于CP-OFDM的OTFS系统,其中CP长度进行分析,并 derivation of the input-output relation considering DSE。通过分配两个prefix OFDM符号,渠道估计可以被简单地分解为三部分:延迟探测、多普勒提取和增强估计。采用了线性平衡方案,利用通道矩阵的块对称性,实现了低复杂度接收器设计。实验结果证明了DSE的重要性以及提议的低复杂度接收器计划的出色表现。

Integrated Sensing and Communication enabled Multiple Base Stations Cooperative Sensing Towards 6G

  • paper_url: http://arxiv.org/abs/2310.07180
  • repo_url: None
  • paper_authors: Zhiqing Wei, Wangjun Jiang, Zhiyong Feng, Huici Wu, Ning Zhang, Kaifeng Han, Ruizhong Xu, Ping Zhang
  • for: sixth-generation (6G) mobile communication systems, such as smart city and autonomous driving
  • methods: multi-BS cooperative sensing, unified ISAC performance metrics, ISAC signal design and optimization, interference management, cooperative sensing algorithms
  • results: breaking the limitation of single-BS sensing, establishing intelligent infrastructures connecting physical and cyber space, ushering the era of 6G
    Abstract Driven by the intelligent applications of sixth-generation (6G) mobile communication systems such as smart city and autonomous driving, which connect the physical and cyber space, the integrated sensing and communication (ISAC) brings a revolutionary change to the base stations (BSs) of 6G by integrating radar sensing and communication in the same hardware and wireless resource. However, with the requirements of long-range and accurate sensing in the applications of smart city and autonomous driving, the ISAC enabled single BS still has a limitation in the sensing range and accuracy. With the networked infrastructures of mobile communication systems, multi-BS cooperative sensing is a natural choice satisfying the requirement of long-range and accurate sensing. In this article, the framework of multi-BS cooperative sensing is proposed, breaking through the limitation of single-BS sensing. The enabling technologies, including unified ISAC performance metrics, ISAC signal design and optimization, interference management, cooperative sensing algorithms, are introduced in details. The performance evaluation results are provided to verify the effectiveness of multi-BS cooperative sensing schemes. With ISAC enabled multi-BS cooperative sensing (ISAC-MCS), the intelligent infrastructures connecting physical and cyber space can be established, ushering the era of 6G promoting the intelligence of everything.
    摘要 驱动了六代移动通信系统(6G)的智能应用,如智能城市和自动驾驶,这些应用连接了物理空间和虚拟空间,因此集成感知和通信(ISAC)在6G基站(BS)中带来了革命性的变革。然而,在智能城市和自动驾驶应用中需要覆盖较长范围和精度高的感知,ISAC启用的单BS仍有限制的感知范围和精度。基于移动通信系统的网络基础设施,多BS合作感知是一个自然的选择,满足覆盖较长范围和精度高的感知需求。本文提出了多BS合作感知框架,突破单BS感知的限制。本文还介绍了实现多BS合作感知的关键技术,包括统一ISAC性能指标、ISAC信号设计优化、干扰管理和合作感知算法。本文还提供了性能评估结果,证明了多BS合作感知方案的有效性。通过ISAC启用的多BS合作感知(ISAC-MCS),智能基础设施可以建立,推动6G时代,智能化 Everything。

Time and Frequency Offset Estimation and Intercarrier Interference Cancellation for AFDM Systems

  • paper_url: http://arxiv.org/abs/2310.07141
  • repo_url: None
  • paper_authors: Yuankun Tang, Anjie Zhang, Miaowen Wen, Yu Huang, Fei Ji, Jinming Wen
  • for: 这篇论文旨在提供一个可靠的通信方案,供时间变化频道上的 AFDM 系统。
  • methods: 这篇论文提出了两种最大 LIKELIHOOD(ML)估计器:一个 JOINT ML 估计器,通过比较样本之间的相似性来评估到来访时间和频率偏移;另一个是 Stepwise ML 估计器,可以降低复杂度。这两种估计器都利用 AFDM symbol 中内含的双极几何资讯,无需额外的导频。
  • results: numerically 的结果显示,提出的时间和频率偏移估计准确性和 mirror-mapping 基本帧调制可以实现 AFDM 系统中的可靠通信。
    Abstract Affine frequency division multiplexing (AFDM) is an emerging multicarrier waveform that offers a potential solution for achieving reliable communication for time-varying channels. This paper proposes two maximum likelihood (ML) estimators of symbol time offset and carrier frequency offset for AFDM systems. The joint ML estimator evaluates the arrival time and frequency offset by comparing the correlations of samples. Moreover, we propose the stepwise ML estimator to reduce the complexity. The proposed estimators exploit the redundant information contained within the chirp-periodic prefix inherent in AFDM symbols, thus dispensing with any additional pilots. To further mitigate the intercarrier interference resulting from the residual frequency offset, we design a mirror-mappingbased scheme for AFDM systems. Numerical results verify the effectiveness of the proposed time and frequency offset estimation criteria and the mirror-mapping-based modulation for AFDM systems.
    摘要 《 Affine 频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通频率分多普通

Edge Cloud Collaborative Stream Computing for Real-Time Structural Health Monitoring

  • paper_url: http://arxiv.org/abs/2310.07130
  • repo_url: None
  • paper_authors: Wenzhao Zhang, Cheng Guo, Yi Gao, Wei Dong
  • for: 这个论文是为了解决结构健康监控(SHM)中的资料过量和实时需求问题。
  • methods: 本论文提出了一个Edge Cloud合作精细流运算框架(ECStream),用于解决SHM中的问题。ECStream考虑了原子和composite operators的联合运算,并将其形式化为一个可读性和可缩小的流运算问题。
  • results: 根据先前的评估结果,ECStream可以有效地对带宽和终端 оператор处理延迟进行平衡,将带宽使用率降低到73.01%,并将终端operator computation延迟降低到34.08%的水平。
    Abstract Structural Health Monitoring (SHM) is crucial for the safety and maintenance of various infrastructures. Due to the large amount of data generated by numerous sensors and the high real-time requirements of many applications, SHM poses significant challenges. Although the cloud-centric stream computing paradigm opens new opportunities for real-time data processing, it consumes too much network bandwidth. In this paper, we propose ECStream, an Edge Cloud collaborative fine-grained stream operator scheduling framework for SHM. We collectively consider atomic and composite operators together with their iterative computability to model and formalize the problem of minimizing bandwidth usage and end-to-end operator processing latency. Preliminary evaluation results show that ECStream can effectively balance bandwidth usage and end-to-end operator computation latency, reducing bandwidth usage by 73.01% and latency by 34.08% on average compared to the cloud-centric approach.
    摘要 STRUCTURAL HEALTH MONITORING (SHM) 是重要的安全和维护基础设施的关键。由于众多感知器件生成的大量数据以及许多应用程序的高实时要求,SHM带来了重大挑战。虽然云计算中心主义思想开启了新的实时数据处理机会,但它占用了过多的网络带宽。在本文中,我们提出了ECStream,一个边缘云集成细致流操作调度框架,用于解决SHM中的带宽使用和终端操作计算延迟问题。我们一起考虑原子和复合运算者的迭代可计算性,以模型和正式化带宽使用和终端操作计算延迟的最佳化问题。初步评估结果表明,ECStream可以有效均衡带宽使用和终端操作计算延迟,减少带宽使用率73.01%,减少平均计算延迟34.08%。

Decentralization of Energy Systems with Blockchain: Bridging Top-down and Bottom-up Management of the Electricity Grid

  • paper_url: http://arxiv.org/abs/2310.07103
  • repo_url: None
  • paper_authors: Sakshi Mishra, Roohallah Khatami, Yu Christine Chen
  • for: 这篇论文旨在提出一种可行的方案,以满足递进的分布式能源资源(DERs)普及和卷积式计算和通信技术的发展,从而对现有的中央化操作模式进行重大变革。
  • methods: 本文使用了区块链技术,以便在分布式能源系统中实现端到端交易,并且提出了一种混合式操作模式,包括传统中央化操作模式和分布式操作模式。
  • results: 本文认为,通过使用区块链技术,可以实现分布式能源系统中的端到端交易,并且可以帮助实现现有的中央化操作模式和分布式操作模式之间的协同操作。
    Abstract For more than a century, the grid has operated in a centralized top-down fashion. However, as distributed energy resources (DERs) penetration grows, the grid edge is increasingly infused with intelligent computing and communication capabilities. Thus, the bottom-up approach to grid operations inclined toward decentralizing energy systems will likely gain momentum alongside the existing centralized paradigm. Decentralization refers to transferring control and decision-making from a centralized entity (individual, organization, or group thereof) to a distributed network. It is not a new concept - in energy systems context or otherwise. In the energy systems context, however, the complexity of this multifaceted concept increases manifolds due to two major reasons - i) the nature of the commodity being traded (the electricity) and ii) the enormity of the traditional electricity sector's structure that builds, operates, and maintains this capital-intensive network. In this work, we aim to highlight the need for and outline a credible path toward restructuring the current operational architecture of the electricity grid in view of the ongoing decentralization trends with an emphasis on peer-to-peer energy trading. We further introduce blockchain technology in the context of decentralized energy systems problems. We also suggest that blockchain is an effective technology for facilitating the synergistic operations of top-down and bottom-up approaches to grid management.
    摘要 Translation notes:* "grid" is translated as "电网" (dian wang)* "centralized" is translated as "中央化" (zhong yang hua)* "decentralized" is translated as "分散化" (fen shi hua)* "bottom-up" is translated as "底层化" (di yan hua)* "top-down" is translated as "顶层化" (ding yan hua)* "peer-to-peer" is translated as "对等" (dui yi)* "blockchain" is translated as "区块链" (qu yu lian)

Hybrid Arrays: How Many RF Chains Are Required to Prevent Beam Squint?

  • paper_url: http://arxiv.org/abs/2310.07101
  • repo_url: None
  • paper_authors: Heedong Do, Namyoon Lee, Robert W. Heath Jr, Angel Lozano
  • for: 解决 beamforming 中的 beam squint 问题
  • methods: 使用 hybrid arrays,不需要 downconversion at each element
  • results: hybrid arrays 可以达到 digital arrays 的性能水平,但需要 exceeds a certain threshold 的数量Here’s the breakdown of each point:1. for: The paper is written to solve the problem of beam squint in beamforming.2. methods: The paper proposes the use of hybrid arrays, which do not require downconversion at each element, to achieve the same performance as digital arrays.3. results: The paper shows that hybrid arrays can achieve the same performance as digital arrays, but with a lower threshold of elements. The result is robust and holds for suboptimum but highly appealing beamspace architectures.
    Abstract With increasing frequencies, bandwidths, and array apertures, the phenomenon of beam squint arises as a serious impairment to beamforming. Fully digital arrays with true time delay per antenna element are a potential solution, but they require downconversion at each element. This paper shows that hybrid arrays can perform essentially as well as digital arrays once the number of radio-frequency chains exceeds a certain threshold that is far below the number of elements. The result is robust, holding also for suboptimum but highly appealing beamspace architectures.
    摘要

cs.SD - 2023-10-10

Modeling of Speech-dependent Own Voice Transfer Characteristics for Hearables with In-ear Microphones

  • paper_url: http://arxiv.org/abs/2310.06554
  • repo_url: None
  • paper_authors: Mattes Ohlenbusch, Christian Rollwage, Simon Doclo
  • for: 这篇论文是为了研究听力器中的自己声音传递特性而写的。
  • methods: 该论文使用了语音认知技术来建立一个语音依赖的系统标定模型,以估计听力器中自己声音的传递特性。
  • results: 研究发现,使用提议的语音依赖模型可以更好地模拟听力器中的自己声音传递特性,并且比适应 filtering-based 模型更好地适应新的语音。此外,研究还发现,对于不同的说话者,使用 talked-averaged 模型可以更好地泛化到不同的说话者。
    Abstract Hearables often contain an in-ear microphone, which may be used to capture the own voice of its user. However, due to ear canal occlusion the in-ear microphone mostly records body-conducted speech, which suffers from band-limitation effects and is subject to amplification of low frequency content. These transfer characteristics are assumed to vary both based on speech content and between individual talkers. It is desirable to have an accurate model of the own voice transfer characteristics between hearable microphones. Such a model can be used, e.g., to simulate a large amount of in-ear recordings to train supervised learning-based algorithms aiming at compensating own voice transfer characteristics. In this paper we propose a speech-dependent system identification model based on phoneme recognition. Using recordings from a prototype hearable, the modeling accuracy is evaluated in terms of technical measures. We investigate robustness of transfer characteristic models to utterance or talker mismatch. Simulation results show that using the proposed speech-dependent model is preferable for simulating in-ear recordings compared to a speech-independent model. The proposed model is able to generalize better to new utterances than an adaptive filtering-based model. Additionally, we find that talker-averaged models generalize better to different talkers than individual models.
    摘要 听ables 经常包含在耳朵中的一个内耳麦克风,可以用来捕捉其用户的自己声音。然而,由于耳朵封闭,内耳麦克风主要记录的是身体传导的语音,这种语音受到频率限制的影响,同时也受到低频强调效果的增强。这些传输特性的变化受到语音内容和个体演说者的影响。因此,有一个准确的自己声音传输特性模型可以用于训练基于supervised learning的算法,以资acia减少自己声音传输特性的影响。在这篇论文中,我们提出了基于phoneme认识的语音依赖系统模型。使用一种原型听ables 的录音,我们评估了模型的准确性,并进行了技术性的评估。我们也研究了语音或演说者之间的模型的稳定性。结果表明,使用我们提出的语音依赖模型在模拟耳朵录音时比使用语音独立模型更好。此外,我们发现了 talker-averaged 模型在不同的演说者之间更好地泛化。

Topological data analysis of human vowels: Persistent homologies across representation spaces

  • paper_url: http://arxiv.org/abs/2310.06508
  • repo_url: None
  • paper_authors: Guillem Bonafos, Jean-Marc Freyermuth, Pierre Pudlo, Samuel Tronçon, Arnaud Rey
  • for: 这篇论文是用于研究数据分析方法的,具体来说是研究如何从各种数据表示空间中提取有用的特征,以便进行预测和分类。
  • methods: 这篇论文使用的方法包括 persistent homology 理论和 topologic 数据分析 (TDA) 技术,以及一些 Machine Learning 算法,如 random forest。
  • results: 这篇论文的结果表明,使用不同的数据表示空间可以提取到不同的特征,而这些特征之间存在一定的相互补做作用。此外,使用 topologic 数据分析可以提高预测和分类的准确率。
    Abstract Topological Data Analysis (TDA) has been successfully used for various tasks in signal/image processing, from visualization to supervised/unsupervised classification. Often, topological characteristics are obtained from persistent homology theory. The standard TDA pipeline starts from the raw signal data or a representation of it. Then, it consists in building a multiscale topological structure on the top of the data using a pre-specified filtration, and finally to compute the topological signature to be further exploited. The commonly used topological signature is a persistent diagram (or transformations of it). Current research discusses the consequences of the many ways to exploit topological signatures, much less often the choice of the filtration, but to the best of our knowledge, the choice of the representation of a signal has not been the subject of any study yet. This paper attempts to provide some answers on the latter problem. To this end, we collected real audio data and built a comparative study to assess the quality of the discriminant information of the topological signatures extracted from three different representation spaces. Each audio signal is represented as i) an embedding of observed data in a higher dimensional space using Taken's representation, ii) a spectrogram viewed as a surface in a 3D ambient space, iii) the set of spectrogram's zeroes. From vowel audio recordings, we use topological signature for three prediction problems: speaker gender, vowel type, and individual. We show that topologically-augmented random forest improves the Out-of-Bag Error (OOB) over solely based Mel-Frequency Cepstral Coefficients (MFCC) for the last two problems. Our results also suggest that the topological information extracted from different signal representations is complementary, and that spectrogram's zeros offers the best improvement for gender prediction.
    摘要 topological数据分析(TDA)已经成功地应用于各种信号/图像处理任务,从视觉化到指导/无指导分类。经常地, topological特征来自 persistente homology理论。TDA管道从原始信号数据或信号表示开始,然后在基于预先指定的筛选器上建立多级 topological结构,最后计算 topological签名以进一步利用。通常使用的 topological签名是持续 diagram(或其变形)。当前研究的问题是 exploit topological签名的多种方法,而不是筛选器的选择,而且尚未考虑信号表示的选择。这篇论文尝试提供一些答案,并通过对实际的音频数据进行比较性研究来评估不同表示空间中的 topological签名的质量。我们使用了三种不同的表示空间来表示每个音频信号:1. 使用 Takens 表示法将数据embedding到高维空间中。2. 视为三维 ambient空间中的表面,使用 spectrogram。3. spectrogram中的 zeros 集。对于女性语音录制,我们使用 topological签名进行三个预测问题:speaker gender、vowel type和个人。我们发现,使用 topologically-augmented random forest 可以在 Out-of-Bag Error(OOB)中提高 Mel-Frequency Cepstral Coefficients(MFCC)的性能。我们的结果还表明,不同的表示空间中的 topological信息是夹带的,而spectrogram中的 zeros 提供了最好的性能提升。

Cross-modal Cognitive Consensus guided Audio-Visual Segmentation

  • paper_url: http://arxiv.org/abs/2310.06259
  • repo_url: None
  • paper_authors: Zhaofeng Shi, Qingbo Wu, Hongliang Li, Fanman Meng, Linfeng Xu
    for:* 这篇论文的目的是提出一种 Audio-Visual Segmentation (AVS) 方法,用于从视频帧中提取听到的对象。methods:* 该方法使用 dense feature-level audio-visual interaction,忽略不同模式之间的维度差异。* 使用 Cross-modal Cognitive Consensus guided Network (C3N) align audio-visual semantics 从全Dimension 维度和地进行进一步的注意力机制。results:* 经验表明,该方法可以在 Single Sound Source Segmentation (S4) 和 Multiple Sound Source Segmentation (MS3) 任务上达到状态之最好性能。
    Abstract Audio-Visual Segmentation (AVS) aims to extract the sounding object from a video frame, which is represented by a pixel-wise segmentation mask. The pioneering work conducts this task through dense feature-level audio-visual interaction, which ignores the dimension gap between different modalities. More specifically, the audio clip could only provide a \textit{Global} semantic label in each sequence, but the video frame covers multiple semantic objects across different \textit{Local} regions. In this paper, we propose a Cross-modal Cognitive Consensus guided Network (C3N) to align the audio-visual semantics from the global dimension and progressively inject them into the local regions via an attention mechanism. Firstly, a Cross-modal Cognitive Consensus Inference Module (C3IM) is developed to extract a unified-modal label by integrating audio/visual classification confidence and similarities of modality-specific label embeddings. Then, we feed the unified-modal label back to the visual backbone as the explicit semantic-level guidance via a Cognitive Consensus guided Attention Module (CCAM), which highlights the local features corresponding to the interested object. Extensive experiments on the Single Sound Source Segmentation (S4) setting and Multiple Sound Source Segmentation (MS3) setting of the AVSBench dataset demonstrate the effectiveness of the proposed method, which achieves state-of-the-art performance.
    摘要 音视频分割(AVS)目标是从视频帧中提取听到的对象,它通过像素级别的音视频交互来实现。在这种情况下,音频片断只能提供一个全局Semantic标签,而视频帧则包含多个不同地方的Semantic对象。在这篇论文中,我们提议一种协调音视频 semantics的网络(C3N),以将全局维度上的音视频 semantics 与本地区域相协调。首先,我们开发了一种协调音视频 Semantic Inference模块(C3IM),以抽取音视频分类信任度和模式特征之间的相似性。然后,我们将这个协调模式标签返回给视频底层,并通过一种协调注意力模块(CCAM)来高亮对应的本地特征。我们对AVSBench数据集的Single Sound Source Segmentation(S4)和Multiple Sound Source Segmentation(MS3)两个设置进行了广泛的实验,并证明了我们的方法的有效性,达到了当前最佳性能。