eess.IV - 2023-09-10

Spatial Perceptual Quality Aware Adaptive Volumetric Video Streaming

  • paper_url: http://arxiv.org/abs/2309.05026
  • repo_url: None
  • paper_authors: Xi Wang, Wei Liu, Huitong Liu, Peng Yang
  • for: 本研究旨在探讨6自由度空间浏览所引入的观看距离对用户感知质量的影响,以及一种基于人类视觉能力限制的视觉含量模型,以满足6DoF浏览期间的空间视觉需求。
  • methods: 本研究使用了人类视觉能力限制的视觉含量模型,以满足6DoF浏览期间的空间视觉需求。此外,还提出了一种基于用户感知质量的QoE模型,以准确地表示用户在不同观看距离下的感知质量。
  • results: 实验结果表明,提出的方案可以提高总平均QoE水平,相比现有基eline,提高了26%。
    Abstract Volumetric video offers a highly immersive viewing experience, but poses challenges in ensuring quality of experience (QoE) due to its high bandwidth requirements. In this paper, we explore the effect of viewing distance introduced by six degrees of freedom (6DoF) spatial navigation on user's perceived quality. By considering human visual resolution limitations, we propose a visual acuity model that describes the relationship between the virtual viewing distance and the tolerable boundary point cloud density. The proposed model satisfies spatial visual requirements during 6DoF exploration. Additionally, it dynamically adjusts quality levels to balance perceptual quality and bandwidth consumption. Furthermore, we present a QoE model to represent user's perceived quality at different viewing distances precisely. Extensive experimental results demonstrate that, the proposed scheme can effectively improve the overall average QoE by up to 26% over real networks and user traces, compared to existing baselines.
    摘要 三维视频提供了非常深刻的浸没体验,但它带来质量经验(QoE)的挑战,因为它具有高带宽需求。在这篇论文中,我们探索了通过六个自由度(6DoF)空间导航引入的观看距离对用户的感知质量的影响。通过考虑人类视觉限制,我们提出了一个视觉acuity模型,该模型描述虚拟观看距离对tolerable boundary point cloud density的关系。我们的模型满足了在6DoF探索中的空间视觉需求。此外,我们还提出了一个QoE模型,可以准确地表示用户在不同观看距离下的感知质量。我们的方案可以有效地提高总平均QoE,相比现有基elines,提高了26%。

eess.SP - 2023-09-10

Kinematics-Based Sensor Fault Detection for Autonomous Vehicles Using Real-Time Numerical Differentiation

  • paper_url: http://arxiv.org/abs/2309.05158
  • repo_url: None
  • paper_authors: Shashank Verma, Yousaf Rahman, E. Dogan Sumer, Dennis S. Bernstein
  • for: 本研究旨在检测和识别汽车上的感测器故障。
  • methods: 本方法基于地面载荷的六个遥感数据,包括 компас、雷达、加速仪和gyro的测量数据及其 derivates,通过实时数值导函来计算实时错误指标。
  • results: 实验结果表明,该方法可以准确检测和识别汽车上的感测器故障。
    Abstract Sensor fault detection is of extreme importance for ensuring the safe operation of vehicles. This paper introduces a novel approach to detecting and identifying faulty sensors. For ground vehicles confined to the horizontal plane, this technique is based on six kinematics-based error metrics that are computed in real time by using onboard sensor data encompassing compass, radar, rate gyro, and accelerometer measurements as well as their derivatives. Real-time numerical differentiation is performed by applying the adaptive input and state estimation (AIE/ASE) algorithm. Numerical examples are provided to assess the efficacy of the proposed methodology.
    摘要 感测器故障检测对于保证交通工具安全运行至关重要。本文介绍一种新的检测和识别故障感测器的方法。对于在水平面上行驶的地面交通工具,该技术基于六种动力学基础错误指标,通过在board上的感测器数据( magnetometer、雷达、速度gyro和加速计)以及其导数来计算在实时中。实时数值分析通过应用适应输入和状态估计(AIE/ASE)算法进行。提供数值例子,评估提案的有效性。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Multi UAV-enabled Distributed Sensing: Cooperation Orchestration and Detection Protocol

  • paper_url: http://arxiv.org/abs/2309.05114
  • repo_url: None
  • paper_authors: Xavier Alejandro Flores Cabezas, Diana Pamela Moya Osorio, Markku Juntti
  • for: 这个论文提出了一种基于无人机(UAV)的分布感知框架,用于检测地面目标的位置,并且UAV在半播调制模式下运行。
  • methods: 该框架使用了Orthogonal Frequency-Division Multiplexing(OFDM)波形检测地面目标,并使用了空间网格方法,将特定区域分割成等大小的细节,然后将每个细节的射频跨section(RCS)联合估算由一个网络的双功能无人机(UAV)。
  • results: Monte Carlo simulations 表明,提出的框架可以提高检测精度和分辨率,并且比单一干扰无人机(UAV)标准做法具有更好的可靠性和稳定性。同时,与通用压缩感知(CS)方法相比,该框架具有较少的过载。
    Abstract This paper proposes an unmanned aerial vehicle (UAV)-based distributed sensing framework that uses orthogonal frequency-division multiplexing (OFDM) waveforms to detect the position of a ground target, and UAVs operate in half-duplex mode. A spatial grid approach is proposed, where an specific area in the ground is divided into cells of equal size, then the radar cross-section (RCS) of each cell is jointly estimated by a network of dual-function UAVs. For this purpose, three estimation algorithms are proposed employing the maximum likelihood criterion, and digital beamforming is used for the local signal acquisition at the receive UAVs. It is also considered that the coordination, fusion of sensing data, and central estimation is performed at a certain UAV acting as a fusion center (FC). Monte Carlo simulations are performed to obtain the absolute estimation error of the proposed framework. The results show an improved accuracy and resolution by the proposed framework, if compared to a single monostatic UAV benchmark, due to the distributed approach among the UAVs. It is also evidenced that a reduced overhead is obtained when compared to a general compressive sensing (CS) approach.
    摘要

Strategic Deployment of Swarm of UAVs for Secure IoT Networks

  • paper_url: http://arxiv.org/abs/2309.05104
  • repo_url: None
  • paper_authors: Xavier Alejandro Flores Cabezas, Diana Pamela Moya Osorio
  • for: 提高未来无线网络的安全性,特别是在互联网物联网(IoT)中的设备采用安全传输。
  • methods: 利用无人机(UAV)作为空中基站,实现 IoT 网络中节点之间的安全连接。并通过游戏理论和几何优化的工具,实现网络节点协调、UAV 的3D定位和无线电源分配,以提高系统的机密性性能。
  • results: 比起当前的欢快算法,提案的框架在 IoT 网络中实现了更好的和更高效的机密性性能。
    Abstract Security provisioning for low-complex and constrained devices in the Internet of Things (IoT) is exacerbating the concerns for the design of future wireless networks. To unveil the full potential of the sixth generation (6G), it is becoming even more evident that security measurements should be considered at all layers of the network. This work aims to contribute in this direction by investigating the employment of unmanned aerial vehicles (UAVs) for providing secure transmissions in ground IoT networks. Toward this purpose, it is considered that a set of UAVs acting as aerial base stations provide secure connectivity between the network and multiple ground nodes. Then, the association of IoT nodes, the 3D positioning of the UAVs and the power allocation of the UAVs are obtained by leveraging game theoretic and convex optimization-based tools with the goal of improving the secrecy of the system. It is shown that the proposed framework obtains better and more efficient secrecy performance over an IoT network than state-of-the-art greedy algorithms for positioning and association.
    摘要 安全资源分配 для低复杂和受限设备在互联网器物 (IoT) 是加剧未来无线网络设计的担忧。为探索 sixth generation (6G) 的潜力,越来越清楚地示出安全测量应该考虑到网络各层。这项工作想要贡献在这个方向,研究使用无人飞行器 (UAV) 为 IoT 网络提供安全传输。为达到这一目的,假设一组 UAV 作为空中基站,为网络和多个地面节点之间提供安全连接。然后, IoT 节点协会、UAV 的3D定位和 UAV 的能量分配通过游戏理论和几何优化工具来实现,以提高系统的秘密性。结果表明,该提案的框架比 estado-of-the-art 探索算法具有更好的和更高效的秘密性性能。

Maximizing the performance for microcomb based microwave photonic transversal signal processors

  • paper_url: http://arxiv.org/abs/2309.07155
  • repo_url: None
  • paper_authors: Yang Sun, Jiayang Wu, Yang Li, Xingyuan Xu, Guanghui Ren, Mengxi Tan, Sai Tak Chu, Brent E. Little, Roberto Morandotti, Arnan Mitchell, David J. Moss
  • for: This paper aims to analyze and improve the accuracy of microcomb-based MWP transversal signal processors.
  • methods: The paper uses a detailed analysis of error sources, including imperfections in microcombs, chirp of electro-optic modulators, chromatic dispersion of the dispersive module, shaping errors of optical spectral shapers, and noise of the photodetector.
  • results: The paper shows that feedback control can be used to compensate for errors caused by experimental imperfections, resulting in significantly improved accuracy for microcomb-based MWP transversal signal processors.
    Abstract Microwave photonic (MWP) transversal signal processors offer a compelling solution for realizing versatile high-speed information processing by combining the advantages of reconfigurable electrical digital signal processing and high-bandwidth photonic processing. With the capability of generating a number of discrete wavelengths from micro-scale resonators, optical microcombs are powerful multi-wavelength sources for implementing MWP transversal signal processors with significantly reduced size, power consumption, and complexity. By using microcomb-based MWP transversal signal processors, a diverse range of signal processing functions have been demonstrated recently. In this paper, we provide a detailed analysis for the processing inaccuracy that is induced by the imperfect response of experimental components. First, we investigate the errors arising from different sources including imperfections in the microcombs, the chirp of electro-optic modulators, chromatic dispersion of the dispersive module, shaping errors of the optical spectral shapers, and noise of the photodetector. Next, we provide a global picture quantifying the impact of different error sources on the overall system performance. Finally, we introduce feedback control to compensate the errors caused by experimental imperfections and achieve significantly improved accuracy. These results provide a guide for optimizing the accuracy of microcomb-based MWP transversal signal processors.
    摘要 microwave photonic(MWP)横向信号处理器提供了一个有力的解决方案,实现多样化高速信息处理,并结合电子数字信号处理的可重配置优点和光子处理的带宽优势。通过使用微型振荡器生成多个独立激光波,光学微comb是一种具有显著减小体积、功耗和复杂性的多激光源。在这篇论文中,我们对实验性错误引起的处理不准确性进行了详细分析。首先,我们研究了不同来源的错误,包括微comb的不完美性、电极激光模ulator的倾斜、分析模块中的彩色偏振、光学spectral shaper的形状错误和光电转换器的噪声。接着,我们提供了全局的错误影响系统性能的图像。最后,我们引入反馈控制,以弥补实验性错误引起的错误,并实现了显著改善的精度。这些结果为微comb基于MWP横向信号处理器的精度优化提供了一个指南。

High-Precision Channel Estimation for Sub-Noise Self-Interference Cancellation

  • paper_url: http://arxiv.org/abs/2309.05042
  • repo_url: None
  • paper_authors: Dongsheng Zheng, Lifeng Lin, Wenyao Li, Bingli Jiao
  • for: 实现可靠的全双工通信,自阻干扰抑制具有关键作用。
  • methods: 提议一种高精度通道估计方法,特点是利用所有发送的符号,进行自阻干扰通道估计。
  • results: 通过分析和数值仿真, validate the effectiveness of the proposed method,并显示了我们的方法在实现下阻干扰抑制的优秀性能。
    Abstract Self-interference cancellation plays a crucial role in achieving reliable full-duplex communications. In general, it is essential to cancel the self-interference signal below the thermal noise level, which necessitates accurate reconstruction of the self-interference signal. In this paper, we propose a high-precision channel estimation method specifically designed for sub-noise self-interference cancellation. Exploiting the fact that all transmitted symbols are known to their respective receivers, our method utilizes all transmitted symbols for self-interference channel estimation. Through analytical derivations and numerical simulations, we validate the effectiveness of the proposed method. The results demonstrate the superior performance of our approach in achieving sub-noise self-interference cancellation.
    摘要 自我干扰抵消在实现可靠全双工通信中发挥关键作用。一般来说,需要在噪声水平下取消自我干扰信号,这需要精准地重建自我干扰信号。在这篇论文中,我们提议一种高精度频率估计方法,专门用于低于噪声水平的自我干扰抵消。我们利用所有发送的符号都知道它们的接收器,我们的方法利用所有发送的符号进行自我干扰通道估计。通过分析性 derivations 和数值仿真,我们证明了我们的方法的有效性。结果表明我们的方法可以在实现低于噪声水平的自我干扰抵消。

Soft-connected Rigid Body Localization: State-of-the-Art and Research Directions for 6G

  • paper_url: http://arxiv.org/abs/2309.05002
  • repo_url: None
  • paper_authors: Niclas Führling, Hyeon Seok Rou, Giuseppe Thadeu Freitas de Abreu, David González G., Osvaldo Gonsa
  • for: 本文将提供一项关于无线局域化(WL)演进的深入研究,从单点模型到无线刚体位置(W-RBL)。
  • methods: 本文将介绍一些机制来演进WL算法到W-RBL方案中,包括其特点、数学方法和特性。
  • results: 本文将讨论扩展W-RBL技术到软连接刚体位置(SCW-RBL)算法。
    Abstract This white paper describes a proposed article that will aim to provide a thorough study of the evolution of the typical paradigm of wireless localization (WL), which is based on a single point model of each target, towards wireless rigid body localization (W-RBL). We also look beyond the concept of RBL itself, whereby each target is modeled as an independent multi-point three-dimensional (3D), with shape enforced via a set of conformation constraints, as a step towards a more general approach we refer to as soft-connected RBL, whereby an ensemble of several objects embedded in a given environment, is modeled as a set of soft-connected 3D objects, with rigid and soft conformation constraints enforced within each object and among them, respectively. A first intended contribution of the full version of this article is a compact but comprehensive survey on mechanisms to evolve WL algorithms in W-RBL schemes, considering their peculiarities in terms of the type of information, mathematical approach, and features the build on or offer. A subsequent contribution is a discussion of mechanisms to extend W-RBL techniques to soft-connected rigid body localization (SCW-RBL) algorithms.
    摘要 The article's first intended contribution is a compact but comprehensive survey of mechanisms to evolve WL algorithms in W-RBL schemes, considering their peculiarities in terms of the type of information, mathematical approach, and features they build on or offer. A subsequent contribution is a discussion of mechanisms to extend W-RBL techniques to soft-connected rigid body localization (SCW-RBL) algorithms.

AFDM vs OTFS: A Comparative Study of Promising Waveforms for ISAC in Doubly-Dispersive Channels

  • paper_url: http://arxiv.org/abs/2309.04998
  • repo_url: None
  • paper_authors: Hyeon Seok Rou, Giuseppe Thadeu Freitas de Abreu, Junil Choi, David González G., Osvaldo Gonsa, Yong Lian Guan, Marios Kountouris
  • for: The paper is written for a comprehensive comparative study of waveforms suitable for integrated sensing and communications (ISAC) systems in beyond fifth generation (B5G) and sixth generation (6G) wireless communication systems.
  • methods: The paper compares two waveform designs: (1) delay-Doppler domain-based orthognal time frequency space (OTFS) waveforms, and (2) chirp domain-based affine frequency division multiplexing (AFDM) waveforms. Both waveforms are designed based on a full delay-Doppler representation of the time variant (TV) multipath channel.
  • results: The paper aims to provide a thorough study of the advantages, shortcomings, and implications of these waveform designs for ISAC systems in B5G/6G systems, including their performance in terms of communication and sensing functions.
    Abstract This white paper aims to briefly describe a proposed article that will provide a thorough comparative study of waveforms designed to exploit the features of doubly-dispersive channels arising in heterogeneous high-mobility scenarios as expected in the beyond fifth generation (B5G) and sixth generation (6G), in relation to their suitability to integrated sensing and communications (ISAC) systems. In particular, the full article will compare the well-established delay-Doppler domain-based orthognal time frequency space (OTFS) and the recently proposed chirp domain-based affine frequency division multiplexing (AFDM) waveforms. Both these waveforms are designed based on a full delay- Doppler representation of the time variant (TV) multipath channel, yielding not only robustness and orthogonality of information symbols in high-mobility scenarios, but also a beneficial implication for environment target detection through the inherent capability of estimating the path delay and Doppler shifts, which are standard radar parameters. These modulation schemes are distinct candidates for ISAC in B5G/6G systems, such that a thorough study of their advantages, shortcomings, implications to signal processing, and performance of communication and sensing functions are well in order. In light of the above, a sample of the intended contribution (Special Issue paper) is provided below.
    摘要 这份白皮书目的是简要描述一篇提议的文章,该文章将进行同时比较 doubly-dispersive 通道在不同准备下的波形设计,以探讨其适用于 интеграted sensing and communications (ISAC) 系统。特别是,该文章将比较已经确立的 delay-Doppler 频域基于的orthogonal time frequency space (OTFS) 波形和最近提议的 chirp 频域基于的 affine frequency division multiplexing (AFDM) 波形。这两种波形都是基于全 delay-Doppler 表示的时变 multipath 通道,具有在高机动场景下Robustness和信息符号的正交性,同时具有优化环境目标探测的能力,这是标准雷达参数。这些调制方案是 B5G/6G 系统中 ISAC 的可能候选人,因此进行了这些方案的优点、缺点、处理信号的影响和通信和探测功能的性能等方面的深入研究是非常必要。以下是该文章的一个示例。

On the Impact of Mutual Coupling on RIS-Assisted Channel Estimation

  • paper_url: http://arxiv.org/abs/2309.04990
  • repo_url: None
  • paper_authors: Pinjun Zheng, Xiuxiu Ma, Tareq Y. Al-Naffouri
  • for: 这个论文旨在评估智能表面协助下的通道估计中的共享干扰效应的影响。
  • methods: 该论文使用了一种错误的克拉默-拉奥 bounds分析方法,以evaluate the impact of mutual coupling on RIS-assisted channel estimation。
  • results: 数据分析显示,在实际场景中,减少RIS元素间距或增加RIS大小可以强化共享干扰效应的影响。此外,即使在忽略共享干扰效应的情况下,过于紧密的RIS元素间距可能会导致通道估计性能受到重大降低。
    Abstract Amid the demand for densely integrated elements in holographic reconfigurable intelligent surfaces (RISs), the mutual coupling effect has gained prominence. By performing a misspecified Cram\'er-Rao bound analysis within an electromagnetics-compliant communication model, this letter offers a quantitative evaluation of the impact of mutual coupling on RIS-assisted channel estimation. Our analysis provides insights into situations where mutual coupling can be disregarded safely. The numerical results reveal that within practical scenarios, closer integration of RIS elements or the enlargement of RIS size accentuates the impact of neglecting mutual coupling. In addition, even with mutual coupling-aware setups, excessively tight RIS element spacing can lead to substantial degradation in the channel estimation performance.
    摘要 在激光卷积智能表面(RIS)中受需求的紧密集成元素下,双向干扰效应得到了更多的关注。通过在电磁学相容通信模型中进行误pecified Cramér-Rao bound分析,本书提供了RIS协助通道估计的量化评估。我们的分析为你提供了忽略双向干扰的情况下的洞察。 numerics 结果表明,在实际场景下,将RIS元素更加紧密集成或者提高RIS大小,会强调对忽略双向干扰的影响。此外,即使在忽略双向干扰的设置下,当RIS元素间距过紧时,channel estimation表现会受到显著的降低。

On the Capacity of Generalized Quadrature Spatial Modulation

  • paper_url: http://arxiv.org/abs/2309.04986
  • repo_url: None
  • paper_authors: Kein Yukiyoshi, Naoki Ishikawa
  • for: 这个论文主要用于研究通用四象限幂冲扩展(GQSM)的特性。
  • methods: 论文使用了蒙特卡洛方法进行数据集计算,并 derivied了GQSM的关键性能指标——平均相互信息(AMI)的闭合表达式。
  • results: 研究结果表明,对于不同的antenna activation pattern,GQSM的AMI与其他相关的SM方案相比,在符号水平上略有下降,但在总体水平上显著提高。此外,在选择antenna的过程中,使用 equiprobable antenna selection method可以进一步提高GQSM的AMI。
    Abstract In this letter, the average mutual information (AMI) of generalized quadrature spatial modulation (GQSM) is first derived for continuous-input continuous-output channels. Our mathematical analysis shows that the calculation error induced by Monte Carlo integration increases exponentially with the signal-to-noise ratio. This nature of GQSM is resolved by deriving a closed-form expression. The derived AMI is compared with other related SM schemes and evaluated for different antenna activation patterns. Our results show that an equiprobable antenna selection method slightly decreases AMI of symbols, while the method significantly improves AMI in total.
    摘要 本封信中,我们首先计算了通用四则幂空间模ulation(GQSM)的平均双向信息(AMI),这是一种连续输入连续输出通道的问题。我们的数学分析表明,使用Monte Carlo方法求解时的计算误差随信号响应比例呈指数增长。这种GQSM特性的解决方法是得到一个关闭式表达。我们对其他相关的SM方案进行了比较,并对不同的天线活动模式进行了评估。我们的结果表明, equiprobable antenna selection方法会对符号AMI造成轻微下降,但是对总AMI具有显著改善的作用。

Trade-Off Between Beamforming and Macro-Diversity Gains in Distributed mMIMO

  • paper_url: http://arxiv.org/abs/2309.04975
  • repo_url: None
  • paper_authors: Eduardo Noboro Tominaga, Hsuan-Jung Su, Jinfeng Du, Sivarama Venkatesan, Richard Demo Souza, Hirley Alves
  • for: 研究者们是为了演进从中心化大量多输入多出口(CmMIMO)演进到分布式多输入多出口(DmMIMO)架构而工作。
  • methods: 研究者们使用了数学模型来研究在选择更多的Access Points(AP)与 fewer APs equipped with many antennas之间的牵扯关系。
  • results: 研究者们发现了一个``甜点”在最佳Access Points数量和每个AP的天线元数之间,这个甜点是函散于覆盖区域的物理尺寸。
    Abstract Industry and academia have been working towards the evolution from Centralized massive Multiple-Input Multiple-Output (CmMIMO) to Distributed mMIMO (DmMIMO) architectures. Instead of splitting a coverage area into many cells, each served by a single Base Station equipped with several antennas, the whole coverage area is jointly covered by several Access Points (AP) equipped with few or single antennas. Nevertheless, when choosing between deploying more APs with few or single antennas or fewer APs equipped with many antennas, one observes an inherent trade-off between the beamforming and macro-diversity gains that has not been investigated in the literature. Given a total number of antenna elements and total downlink power, under a channel model that takes into account a probability of Line-of-Sight (LoS) as a function of the distance between the User Equipments (UEs) and APs, our numerical results show that there exists a ``sweet spot" on the optimal number of APs and of antenna elements per AP which is a function of the physical dimensions of the coverage area.
    摘要 产业和学术界在演化 FROM Centralized massive Multiple-Input Multiple-Output (CmMIMO) 到 Distributed mMIMO (DmMIMO) 架构方面努力奋斗。而不是将覆盖区域分成多个Cell,每个Cell都由一个单antenna Base Station 服务器而不是多个antenna,整个覆盖区域都是由多个Access Points (AP) 服务器,每个AP 只有一些或一个天线。然而,在选择更多的APs 或更少的APs 每个antenna 数量时,存在一个内在的质量和多样性收益之间的交易,这在文献中没有被调查。给定一个总天线元素数和总下行功率,根据UEs 和 APs 之间距离的概率Line-of-Sight (LoS) 函数,我们的数值结果显示,存在一个"甜点"的最佳APs 和天线元素数量,这是覆盖区域的物理尺寸函数。

  • paper_url: http://arxiv.org/abs/2309.04916
  • repo_url: None
  • paper_authors: Byunghyun Lee, Andrew Marcum, David Love, James Krogmeier
  • for: 提高无人机驱动多Input多Output通信系统的 spectral efficiency
  • methods: 使用数据融合predictive beamforming schemes, 利用飞行器动态跟踪和预测 trajectory和orientation
  • results: simulation results demonstrate improved overall spectral efficiency, particularly when the number of antennas is large.
    Abstract In this letter, we propose a data fusion-based predictive beamforming scheme for unmanned aerial vehicle (UAV)-assisted massive multiple-input multiple-output (MIMO) communication, which involves a base station and UAV, each equipped with a massive MIMO array. We consider aircraft dynamics to track and predict the trajectory and orientation of the UAV. To improve communication and tracking performance, we propose a novel fusion of the channel and motion data of the UAV using an extended Kalman filter (EKF). Simulation results demonstrate that the proposed scheme can improve overall spectral efficiency, particularly when the number of antennas is large.
    摘要 在这封信中,我们提出了基于数据融合的预测扫描优化方案,用于无人机(UAV)协助大量多输入多输出(MIMO)通信,该方案包括基站和UAV,每个都装备了庞大的MIMO阵列。我们考虑了飞机动力来跟踪和预测无人机的轨迹和方向。为了提高通信和跟踪性能,我们提议一种新的扩展卡尔曼滤波器(EKF)来融合无人机的通道和运动数据。实验结果表明,提议的方案可以提高总频谱效率,特别是当antenna的数量很大时。

One-Bit-Aided Modulo Sampling for DOA Estimation

  • paper_url: http://arxiv.org/abs/2309.04901
  • repo_url: None
  • paper_authors: Qi Zhang, Jiang Zhu, Zhiwei Xu, De Wen Soh
  • for: 提高DOA估计精度并解决近远问题
  • methods: 一位数量化+盲引数强制(1bit-aided BIF)方法
  • results: 比高精度ADC直接使用的方法有更好的性能
    Abstract Modulo sampling or unlimited sampling has recently drawn a great deal of attention for cutting-edge applications, due to overcoming the barrier of information loss through sensor saturation and clipping. This is a significant problem, especially when the range of signal amplitudes is unknown or in the near-far case. To overcome this fundamental bottleneck, we propose a one-bit-aided (1bit-aided) modulo sampling scheme for direction-of-arrival (DOA) estimation. On the one hand, one-bit quantization involving a simple comparator offers the advantages of low-cost and low-complexity implementation. On the other hand, one-bit quantization provides an estimate of the normalized covariance matrix of the unquantized measurements via the arcsin law. The estimate of the normalized covariance matrix is used to implement blind integer-forcing (BIF) decoder to unwrap the modulo samples to construct the covariance matrix, and subspace methods can be used to perform the DOA estimation. Our approach named as 1bit-aided-BIF addresses the near-far problem well and overcomes the intrinsic low dynamic range of one-bit quantization. Numerical experiments validate the excellent performance of the proposed algorithm compared to using a high-precision ADC directly in the given set up.
    摘要 “模ulo sampling或无限 sampling在最前线应用中受到了非常多的关注,因为它可以超越传感器满荷和剪辑导致的信息损失问题。这是一个非常重要的问题,特别在signal amplitude范围未知或near-far情况下。为了解决这个基本的瓶颈,我们提议一种基于一位数(1bit)的modulo sampling schemes дляdirection-of-arrival(DOA)估计。一方面,一位数量化可以通过简单的比较器实现low-cost和low-complexity的实现。另一方面,一位数量化可以通过arsin法得到normalized covariance matrix的估计。这个估计的normalized covariance matrix可以用来实现blind integer-forcing(BIF)解oder,将modulo samplesunwrap到construct covariance matrix,并使用subspace方法进行DOA估计。我们的方法名为1bit-aided-BIF,可以很好地解决near-far问题,并且超越了一位数量化的内置低动态范围。numerical experiments表明,我们的算法与使用高精度ADC直接在给定的设置中比较出色。”

cs.SD - 2023-09-09

Exploring Music Genre Classification: Algorithm Analysis and Deployment Architecture

  • paper_url: http://arxiv.org/abs/2309.04861
  • repo_url: None
  • paper_authors: Ayan Biswas, Supriya Dhabal, Palaniandavar Venkateswaran
  • for: 这篇论文是为了研究音乐类别分类而写的。
  • methods: 这篇论文使用了数字信号处理(DSP)和深度学习(DL)技术,提出了一种结合DSP和DL方法的音乐类别分类算法。
  • results: 该算法在GTZAN数据集上进行测试,准确率高。此外,文章还提出了一种端到端部署架构,用于音乐相关应用的集成。
    Abstract Music genre classification has become increasingly critical with the advent of various streaming applications. Nowadays, we find it impossible to imagine using the artist's name and song title to search for music in a sophisticated music app. It is always difficult to classify music correctly because the information linked to music, such as region, artist, album, or non-album, is so variable. This paper presents a study on music genre classification using a combination of Digital Signal Processing (DSP) and Deep Learning (DL) techniques. A novel algorithm is proposed that utilizes both DSP and DL methods to extract relevant features from audio signals and classify them into various genres. The algorithm was tested on the GTZAN dataset and achieved high accuracy. An end-to-end deployment architecture is also proposed for integration into music-related applications. The performance of the algorithm is analyzed and future directions for improvement are discussed. The proposed DSP and DL-based music genre classification algorithm and deployment architecture demonstrate a promising approach for music genre classification.
    摘要 音乐类别分类已成为现代音乐应用程序中的关键环节。如今,我们无法想象使用艺术家名和歌曲名来在高级音乐应用程序中搜索音乐。因为音乐相关信息,如地区、艺术家、专辑和非专辑等,是非常变化的。本文提出了一种结合数字信号处理(DSP)和深度学习(DL)技术的音乐类别分类算法。该算法利用了DSP和DL方法来提取音频信号中相关的特征并将其分类为不同的类别。该算法在GTZAN数据集上进行测试并达到了高精度。本文还提出了将该算法集成到音乐相关应用程序中的综合投入体系。算法的性能分析和未来改进方向也被讨论。提出的DSP和DL基于的音乐类别分类算法和投入体系表现出了可行的应用前景。

Generalized Minimum Error with Fiducial Points Criterion for Robust Learning

  • paper_url: http://arxiv.org/abs/2309.04670
  • repo_url: None
  • paper_authors: Haiquan Zhao, Yuan Gao, Yingying Zhu
  • for: 提高 minimum error entropy criterion 的灵活性和敏感性,并应对不确定性Error probability density function locations。
  • methods: 采用 Generalized Gaussian Density 函数作为 kernel,提供更多控制 tail 行为和峰度的能力。
  • results: 在适应Filter、kernel recursive algorithm、多层感知等领域的numerical simulations中,提出的新算法表现出色,比如系统识别、声学闭合取消、时间序列预测和超vised classification。
    Abstract The conventional Minimum Error Entropy criterion (MEE) has its limitations, showing reduced sensitivity to error mean values and uncertainty regarding error probability density function locations. To overcome this, a MEE with fiducial points criterion (MEEF), was presented. However, the efficacy of the MEEF is not consistent due to its reliance on a fixed Gaussian kernel. In this paper, a generalized minimum error with fiducial points criterion (GMEEF) is presented by adopting the Generalized Gaussian Density (GGD) function as kernel. The GGD extends the Gaussian distribution by introducing a shape parameter that provides more control over the tail behavior and peakedness. In addition, due to the high computational complexity of GMEEF criterion, the quantized idea is introduced to notably lower the computational load of the GMEEF-type algorithm. Finally, the proposed criterions are introduced to the domains of adaptive filter, kernel recursive algorithm, and multilayer perceptron. Several numerical simulations, which contain system identification, acoustic echo cancellation, times series prediction, and supervised classification, indicate that the novel algorithms' performance performs excellently.
    摘要 传统的最小错误Entropy(MEE)具有局限性,显示了错误均值的减少敏感性和不确定性关于错误概率分布的位置。为了缓解这些局限性,一种基于 fiducial points 的 MEE(MEEF)被提出。然而,MEEF 的效果不稳定,因为它依赖于固定的 Gaussian 核。在这篇论文中,一种通用的最小错误与 fiducial points criterion(GMEEF)被提出,通过采用通用 Gaussian Density 函数(GGD)作为核来扩展 Gaussian 分布。GGD 在尾部和峰值方面提供更多的控制,并且可以更好地捕捉非常轻量级的噪声。此外,由于 GMEEF criterion 的计算复杂度较高,因此在这篇论文中,一种quantized 的想法被引入,以减少 GMEEF-type 算法的计算负担。最后,提出了在适应过滤器、基于 Recursive Algorithm 的 kernel 算法和多层感知机中应用 GMEEF 和 MEEF criterion。numerical simulations 表明,提出的新算法在系统标识、音频回声抑制、时间序列预测和监督学习等领域的性能几乎卓越。

Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

  • paper_url: http://arxiv.org/abs/2309.04654
  • repo_url: None
  • paper_authors: Huaibo Zhao, Yosuke Higuchi, Yusuke Kida, Tetsuji Ogawa, Tetsunori Kobayashi
  • for: 这 paper 的目的是检验Mask-CTC基于预训练的效果,以提高流式自动语音识别(ASR)系统的准确率和速度。
  • methods: 这 paper 使用的方法包括Mask-CTC基于预训练、触发注意力和不同的模型架构(如Transformer-Transducer和 contextual block streaming ASR)。
  • results: 研究发现,Mask-CTC基于预训练可以提高不同模型架构的流式ASR准确率和速度,且可以获得正确的输出脉冲时间。
    Abstract Achieving high accuracy with low latency has always been a challenge in streaming end-to-end automatic speech recognition (ASR) systems. By attending to more future contexts, a streaming ASR model achieves higher accuracy but results in larger latency, which hurts the streaming performance. In the Mask-CTC framework, an encoder network is trained to learn the feature representation that anticipates long-term contexts, which is desirable for streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in achieving low latency and high accuracy for triggered attention-based ASR. However, the effectiveness of this method has not been demonstrated for various model architectures, nor has it been verified that the encoder has the expected look-ahead capability to reduce latency. This study, therefore, examines the effectiveness of Mask-CTCbased pre-training for models with different architectures, such as Transformer-Transducer and contextual block streaming ASR. We also discuss the effect of the proposed pre-training method on obtaining accurate output spike timing.
    摘要 在流动式自动语音识别(ASR)系统中,实现高精度低延迟总是一个挑战。 streaming ASR 模型通过更多未来上下文来提高准确率,但是会增加延迟,这会对流动性表现不利。在面具-CTC 框架中,一个Encoder网络被训练来学习预测长期上下文的特征表示,这是流动 ASR 中所需的。面具-CTC 基于的encoder预训练有助于实现低延迟和高精度的触发注意力基于 ASR。然而,这种方法的效果尚未在不同的模型结构上进行了证明,也没有确定Encoder是否具有预期的推迟能力来减少延迟。本研究,因此,检查了不同的模型结构,如 Transformer-Transducer 和 contextual block streaming ASR 中的 Mask-CTC 基于预训练的效果。我们还讨论了提取模型的输出脉冲时间是否准确。

cs.CV - 2023-09-09

Semi-supervised Instance Segmentation with a Learned Shape Prior

  • paper_url: http://arxiv.org/abs/2309.04888
  • repo_url: None
  • paper_authors: Long Chen, Weiwen Zhang, Yuli Wu, Martin Strauch, Dorit Merhof
  • for: 这个 paper 是为了解决实例分割问题,而不需要大量的标注对象边框数据。
  • methods: 这个方法使用了形状先验模型,该模型是通过变分自动编码器学习的,只需要很少量的训练数据。
  • results: 在我们的实验中,使用了几十个目标数据集中的对象形状补丁,以及完全 sintetic 的形状,可以达到与超级vised 方法相同的效果,并在三个 cell 分割数据集上都 superior 于预训练的超级vised 模型。
    Abstract To date, most instance segmentation approaches are based on supervised learning that requires a considerable amount of annotated object contours as training ground truth. Here, we propose a framework that searches for the target object based on a shape prior. The shape prior model is learned with a variational autoencoder that requires only a very limited amount of training data: In our experiments, a few dozens of object shape patches from the target dataset, as well as purely synthetic shapes, were sufficient to achieve results en par with supervised methods with full access to training data on two out of three cell segmentation datasets. Our method with a synthetic shape prior was superior to pre-trained supervised models with access to limited domain-specific training data on all three datasets. Since the learning of prior models requires shape patches, whether real or synthetic data, we call this framework semi-supervised learning.
    摘要 到目前为止,大多数实例分割方法基于supervised learning,需要较大量的标注对象边框作为训练真实数据。在这里,我们提出了一种框架,它基于形态先验来寻找目标对象。形态先验模型通过variational autoencoder来学习,只需要一个非常有限的训练数据:在我们的实验中,几十个目标数据集中的对象形状补充、完全 sintética的形状都能够达到与supervised方法相同的效果。我们的方法使用 sintética形状先验在所有三个数据集上都高于预训练的supervised模型。由于学习先验模型需要形状补充,是semi-supervised learning。

SortedAP: Rethinking evaluation metrics for instance segmentation

  • paper_url: http://arxiv.org/abs/2309.04887
  • repo_url: None
  • paper_authors: Long Chen, Yuli Wu, Johannes Stegmaier, Dorit Merhof
  • for: 评估实例分割中的评价指标,需要全面考虑对象检测和分割精度。
  • methods: 本文提出了一种新的评价指标called sortedAP,它具有 conditional sensitivity和精度递减的特点。
  • results: sortedAP可以准确地评估实例分割的质量,并且具有不间断的惩罚尺度,可以提供更加准确的质量评估结果。
    Abstract Designing metrics for evaluating instance segmentation revolves around comprehensively considering object detection and segmentation accuracy. However, other important properties, such as sensitivity, continuity, and equality, are overlooked in the current study. In this paper, we reveal that most existing metrics have a limited resolution of segmentation quality. They are only conditionally sensitive to the change of masks or false predictions. For certain metrics, the score can change drastically in a narrow range which could provide a misleading indication of the quality gap between results. Therefore, we propose a new metric called sortedAP, which strictly decreases with both object- and pixel-level imperfections and has an uninterrupted penalization scale over the entire domain. We provide the evaluation toolkit and experiment code at https://www.github.com/looooongChen/sortedAP.
    摘要 设计实例 segmentation 评价指标涉及全面考虑对象检测和分割精度。然而,现有的研究几乎忽略了其他重要特性,如敏感性、连续性和平等性。在这篇论文中,我们发现现有的指标有限制的分辨率。它们只是在某些指标下有限制的敏感,而且有一定的风险提供假的质量指标。因此,我们提出了一个新的指标called sortedAP,它在对象和像素级别的不足下坚持减少,并在整个领域上具有不间断的补偿幅度。我们在 GitHub 上提供了评价工具箱和实验代码,请参考

AnyPose: Anytime 3D Human Pose Forecasting via Neural Ordinary Differential Equations

  • paper_url: http://arxiv.org/abs/2309.04840
  • repo_url: None
  • paper_authors: Zixing Wang, Ahmed H. Qureshi
  • for: 这篇研究目的是为了提出一个可靠的三维人体姿态预测方法,以便在实时人机交互中进行预测。
  • methods: 这篇研究使用了神经ordinary differential equation(Neural ODE)来建模人类行为动力学。
  • results: 研究结果显示,AnyPose方法在Human3.6M、AMASS和3DPW数据集上显示出高精度的未来姿态预测,并且比传统方法快得多个computational time。
    Abstract Anytime 3D human pose forecasting is crucial to synchronous real-world human-machine interaction, where the term ``anytime" corresponds to predicting human pose at any real-valued time step. However, to the best of our knowledge, all the existing methods in human pose forecasting perform predictions at preset, discrete time intervals. Therefore, we introduce AnyPose, a lightweight continuous-time neural architecture that models human behavior dynamics with neural ordinary differential equations. We validate our framework on the Human3.6M, AMASS, and 3DPW dataset and conduct a series of comprehensive analyses towards comparison with existing methods and the intersection of human pose and neural ordinary differential equations. Our results demonstrate that AnyPose exhibits high-performance accuracy in predicting future poses and takes significantly lower computational time than traditional methods in solving anytime prediction tasks.
    摘要 任何时刻3D人姿预测是实时人机交互中关键,其中“任何时刻”指的是预测人姿的任何实数时间步。然而,我们所知道的所有现有方法都是在固定、精确时间间隔进行预测。因此,我们介绍了AnyPose,一种轻量级连续时间神经网络架构,用于模elling人类行为动力学。我们验证了我们的框架在Human3.6M、AMASS和3DPW数据集上,并进行了一系列完整的分析,包括与现有方法进行比较和人姿和神经ordinary differential equations的交叠。我们的结果表明,AnyPose在预测未来姿势方面具有高精度性和较低的计算时间,与传统方法在实时预测任务中具有优势。

Neural Semantic Surface Maps

  • paper_url: http://arxiv.org/abs/2309.04836
  • repo_url: None
  • paper_authors: Luca Morreale, Noam Aigerman, Vladimir G. Kim, Niloy J. Mitra
  • for: 生成两个 genus-zero 形的 semantic surface-to-surface 映射,即将 semantically 相应的区域匹配到另一个形上。
  • methods: 使用 pre-trained 视觉模型进行 Semantic Matching,并使用 off-the-shelf 图像匹配方法生成 feature points。
  • results: 可以生成 semantic surface-to-surface 映射,不需要任何 3D 训练数据或手动标注。方法可以在高semantic complexity 和 nearly isometric 情况下效果很好。
    Abstract We present an automated technique for computing a map between two genus-zero shapes, which matches semantically corresponding regions to one another. Lack of annotated data prohibits direct inference of 3D semantic priors; instead, current State-of-the-art methods predominantly optimize geometric properties or require varying amounts of manual annotation. To overcome the lack of annotated training data, we distill semantic matches from pre-trained vision models: our method renders the pair of 3D shapes from multiple viewpoints; the resulting renders are then fed into an off-the-shelf image-matching method which leverages a pretrained visual model to produce feature points. This yields semantic correspondences, which can be projected back to the 3D shapes, producing a raw matching that is inaccurate and inconsistent between different viewpoints. These correspondences are refined and distilled into an inter-surface map by a dedicated optimization scheme, which promotes bijectivity and continuity of the output map. We illustrate that our approach can generate semantic surface-to-surface maps, eliminating manual annotations or any 3D training data requirement. Furthermore, it proves effective in scenarios with high semantic complexity, where objects are non-isometrically related, as well as in situations where they are nearly isometric.
    摘要 Note:* "genus-zero shapes" refers to shapes without any holes or singularities.* "semantic priors" refer to the prior knowledge of the semantic meaning of the objects or regions in the scene.* "manual annotation" refers to the process of labeling the objects or regions in the scene with semantic information.* "pre-trained vision models" refer to deep learning models that have been trained on large datasets of images to learn features and patterns.* "image-matching method" refers to a technique that compares two images and finds the corresponding points between them.* "feature points" refer to the points in the image that have been identified as being semantically meaningful.* "bijection" refers to a one-to-one correspondence between two sets, which is important for ensuring that the output map is accurate and consistent.* "continuity" refers to the property of a function that has no gaps or jumps in its output.

Few-Shot Medical Image Segmentation via a Region-enhanced Prototypical Transformer

  • paper_url: http://arxiv.org/abs/2309.04825
  • repo_url: https://github.com/yazhouzhu19/rpt
  • paper_authors: Yazhou Zhu, Shidong Wang, Tong Xin, Haofeng Zhang
  • for: 这篇论文是为了解决医疗图像分类 tasks 中的问题,特别是对于大量医疗图像的自动分类。
  • methods: 本篇论文使用了一种名为 Region-enhanced Prototypical Transformer (RPT) 的方法,这是一种基于几个支持像的学习方法,它可以对于不同的测试案例进行几个 shot 的学习。
  • results: 在三个公开的医疗图像数据集上进行了广泛的实验,结果显示 RPT 方法可以对于 Few-Shot Medical Image Segmentation (FSMS) tasks 提供更好的性能,与现有的方法相比,具有更好的准确性和稳定性。
    Abstract Automated segmentation of large volumes of medical images is often plagued by the limited availability of fully annotated data and the diversity of organ surface properties resulting from the use of different acquisition protocols for different patients. In this paper, we introduce a more promising few-shot learning-based method named Region-enhanced Prototypical Transformer (RPT) to mitigate the effects of large intra-class diversity/bias. First, a subdivision strategy is introduced to produce a collection of regional prototypes from the foreground of the support prototype. Second, a self-selection mechanism is proposed to incorporate into the Bias-alleviated Transformer (BaT) block to suppress or remove interferences present in the query prototype and regional support prototypes. By stacking BaT blocks, the proposed RPT can iteratively optimize the generated regional prototypes and finally produce rectified and more accurate global prototypes for Few-Shot Medical Image Segmentation (FSMS). Extensive experiments are conducted on three publicly available medical image datasets, and the obtained results show consistent improvements compared to state-of-the-art FSMS methods. The source code is available at: https://github.com/YazhouZhu19/RPT.
    摘要 自动化分割大量医疗图像的问题 frequently 受到完全标注数据的有限性和不同患者的获取协议所导致的组织表面性的多样性的影响。在这篇论文中,我们介绍了一种更有前途的几拟学学习基于方法,名为区域增强的原型变换器(RPT),以降低大型内类多样性/偏见的影响。首先,我们提出了一种分区策略,以生成支持原型的分区原型集。其次,我们提出了一种自选机制,以吸收或移除在支持原型和区域支持原型中的干扰。通过堆叠 BaT 块,我们的 RPT 可以Iteratively 优化生成的区域原型,并最终生成修正和更准确的全局原型,为几拟学医疗图像分割(FSMS)提供了更好的结果。我们在三个公开的医疗图像数据集上进行了广泛的实验,并取得了与当前最佳 FSMS 方法相对的稳定性和可靠性。源代码可以在 GitHub 上找到:https://github.com/YazhouZhu19/RPT。

ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting

  • paper_url: http://arxiv.org/abs/2309.04820
  • repo_url: None
  • paper_authors: Michael A. Hobley, Victor A. Prisacariu
  • for: 这篇论文的目的是提出一种多类、无类别 counting 方法,以解决现有方法在 COUNTING 任务中存在的限制。
  • methods: 该方法使用了一种新的概念,即在 COUNTING 阶段不需要使用类例进行导航,而是在计数后发现类例以帮助用户理解生成的输出。
  • results: 对于 MCAC 数据集,该方法可以与 Contemporary methods 相比,而无需人工循环注解。此外,该方法还在 FSC-147 数据集上实现了类似的性能。
    Abstract Class-agnostic counting methods enumerate objects of an arbitrary class, providing tremendous utility in many fields. Prior works have limited usefulness as they require either a set of examples of the type to be counted or that the image contains only a single type of object. A significant factor in these shortcomings is the lack of a dataset to properly address counting in settings with more than one kind of object present. To address these issues, we propose the first Multi-class, Class-Agnostic Counting dataset (MCAC) and A Blind Counter (ABC123), a method that can count multiple types of objects simultaneously without using examples of type during training or inference. ABC123 introduces a new paradigm where instead of requiring exemplars to guide the enumeration, examples are found after the counting stage to help a user understand the generated outputs. We show that ABC123 outperforms contemporary methods on MCAC without the requirement of human in-the-loop annotations. We also show that this performance transfers to FSC-147, the standard class-agnostic counting dataset.
    摘要 <> translate "Class-agnostic counting methods enumerate objects of an arbitrary class, providing tremendous utility in many fields. Prior works have limited usefulness as they require either a set of examples of the type to be counted or that the image contains only a single type of object. A significant factor in these shortcomings is the lack of a dataset to properly address counting in settings with more than one kind of object present. To address these issues, we propose the first Multi-class, Class-Agnostic Counting dataset (MCAC) and A Blind Counter (ABC123), a method that can count multiple types of objects simultaneously without using examples of type during training or inference. ABC123 introduces a new paradigm where instead of requiring exemplars to guide the enumeration, examples are found after the counting stage to help a user understand the generated outputs. We show that ABC123 outperforms contemporary methods on MCAC without the requirement of human in-the-loop annotations. We also show that this performance transfers to FSC-147, the standard class-agnostic counting dataset."中文简体版:类型不扩知的统计方法可以对任意类型的对象进行枚举,提供了很多领域的巨大实用性。先前的方法具有有限的用途,因为它们需要 Either a set of examples of the type to be counted or that the image contains only a single type of object。这些缺点的一个重要因素是缺乏适用于多种对象存在的数据集,以正确地解决类型不扩知的统计问题。为解决这些问题,我们提出了首个多类、类型不扩知统计数据集(MCAC)和一种无需在训练或推理阶段使用类型示例的方法(ABC123)。ABC123引入了一新的思路,而不是需要 exemplars 来引导枚举,而是在统计阶段找到例子,以帮助用户理解生成的输出。我们表明,ABC123 在 MCAC 上超越了当前方法,而不需要人工循环注释。我们还表明,这种性能可以跨种类,并在标准的类型不扩知统计数据集 FSC-147 上进行验证。

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

  • paper_url: http://arxiv.org/abs/2309.04814
  • repo_url: None
  • paper_authors: Xiuzhe Wu, Pengfei Hu, Yang Wu, Xiaoyang Lyu, Yan-Pei Cao, Ying Shan, Wenming Yang, Zhongqian Sun, Xiaojuan Qi
  • for: 根据谈话生成自然看起来的动画,解决过去的问题包括不准确的唇形生成和底层的图像质量。
  • methods: 我们提出了一个构成-分解-重新组合框架(Speech2Lip),将谈话驱动的动作和外观分解为两个部分:谈话敏感的动作和谈话不敏感的动作。这使得我们可以从有限的训练数据中学习出自然的动画。
  • results: 我们的模型可以从几分钟的训练影片中学习出高品质的动画,并且在谈话与图像的同步性方面达到了顶尖的表现。
    Abstract Synthesizing realistic videos according to a given speech is still an open challenge. Previous works have been plagued by issues such as inaccurate lip shape generation and poor image quality. The key reason is that only motions and appearances on limited facial areas (e.g., lip area) are mainly driven by the input speech. Therefore, directly learning a mapping function from speech to the entire head image is prone to ambiguity, particularly when using a short video for training. We thus propose a decomposition-synthesis-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance to facilitate effective learning from limited training data, resulting in the generation of natural-looking videos. First, given a fixed head pose (i.e., canonical space), we present a speech-driven implicit model for lip image generation which concentrates on learning speech-sensitive motion and appearance. Next, to model the major speech-insensitive motion (i.e., head movement), we introduce a geometry-aware mutual explicit mapping (GAMEM) module that establishes geometric mappings between different head poses. This allows us to paste generated lip images at the canonical space onto head images with arbitrary poses and synthesize talking videos with natural head movements. In addition, a Blend-Net and a contrastive sync loss are introduced to enhance the overall synthesis performance. Quantitative and qualitative results on three benchmarks demonstrate that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization. Code: https://github.com/CVMI-Lab/Speech2Lip.
    摘要 Synthesizing realistic videos according to given speech is still an open challenge. Previous works have been plagued by issues such as inaccurate lip shape generation and poor image quality. The key reason is that only motions and appearances on limited facial areas (e.g., lip area) are mainly driven by the input speech. Therefore, directly learning a mapping function from speech to the entire head image is prone to ambiguity, particularly when using a short video for training. We thus propose a decomposition-synthesis-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance to facilitate effective learning from limited training data, resulting in the generation of natural-looking videos. First, given a fixed head pose (i.e., canonical space), we present a speech-driven implicit model for lip image generation which concentrates on learning speech-sensitive motion and appearance. Next, to model the major speech-insensitive motion (i.e., head movement), we introduce a geometry-aware mutual explicit mapping (GAMEM) module that establishes geometric mappings between different head poses. This allows us to paste generated lip images at the canonical space onto head images with arbitrary poses and synthesize talking videos with natural head movements. In addition, a Blend-Net and a contrastive sync loss are introduced to enhance the overall synthesis performance. Quantitative and qualitative results on three benchmarks demonstrate that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization. Code: .

VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis

  • paper_url: http://arxiv.org/abs/2309.04800
  • repo_url: None
  • paper_authors: Xinya Chen, Jiaxin Huang, Yanrui Bin, Lu Yu, Yiyi Liao
  • for: 生成高质量人体图像,包括自然的姿势和形态变化。
  • methods: 使用神经网络学习 vertex-based radiance field, Parametric human template SMPL 进行 parameterization。
  • results: 可以生成高品质的人体图像,并且可以自由控制摄像机姿势、人姿势、形态变化以及部分编辑。
    Abstract Unsupervised learning of 3D-aware generative adversarial networks has lately made much progress. Some recent work demonstrates promising results of learning human generative models using neural articulated radiance fields, yet their generalization ability and controllability lag behind parametric human models, i.e., they do not perform well when generalizing to novel pose/shape and are not part controllable. To solve these problems, we propose VeRi3D, a generative human vertex-based radiance field parameterized by vertices of the parametric human template, SMPL. We map each 3D point to the local coordinate system defined on its neighboring vertices, and use the corresponding vertex feature and local coordinates for mapping it to color and density values. We demonstrate that our simple approach allows for generating photorealistic human images with free control over camera pose, human pose, shape, as well as enabling part-level editing.
    摘要 Recently, there has been significant progress in unsupervised learning of 3D-aware generative adversarial networks. Some recent work has shown promising results in learning human generative models using neural articulated radiance fields, but their generalization ability and controllability are still limited, such as difficulty in generalizing to novel pose/shape and lack of part controllability. To address these issues, we propose VeRi3D, a generative human vertex-based radiance field parameterized by the vertices of the parametric human template, SMPL. We map each 3D point to the local coordinate system defined on its neighboring vertices, and use the corresponding vertex feature and local coordinates to map it to color and density values. Our simple approach enables the generation of photorealistic human images with free control over camera pose, human pose, shape, as well as part-level editing.

Self-Supervised Transformer with Domain Adaptive Reconstruction for General Face Forgery Video Detection

  • paper_url: http://arxiv.org/abs/2309.04795
  • repo_url: None
  • paper_authors: Daichi Zhang, Zihao Xiao, Jianmin Li, Shiming Ge
  • for: 本研究旨在提高违伪面影片检测效果,尤其是在不同的违伪方法或真实源影片下进行检测时。
  • methods: 本研究提出了一种基于自动编码器和对比学习的Self-supervised Transformer,并在 fine-tuning 过程中添加了两种辅助任务,即对比学习和重建学习。此外,还提出了一种适应域重建模块,用于在不同违伪频谱上进行适应。
  • results: 经验表明,提出的方法在公共数据集上进行测试时,能够与现有的超级vised竞争对手相比,并且具有很好的泛化性。
    Abstract Face forgery videos have caused severe social public concern, and various detectors have been proposed recently. However, most of them are trained in a supervised manner with limited generalization when detecting videos from different forgery methods or real source videos. To tackle this issue, we explore to take full advantage of the difference between real and forgery videos by only exploring the common representation of real face videos. In this paper, a Self-supervised Transformer cooperating with Contrastive and Reconstruction learning (CoReST) is proposed, which is first pre-trained only on real face videos in a self-supervised manner, and then fine-tuned a linear head on specific face forgery video datasets. Two specific auxiliary tasks incorporated contrastive and reconstruction learning are designed to enhance the representation learning. Furthermore, a Domain Adaptive Reconstruction (DAR) module is introduced to bridge the gap between different forgery domains by reconstructing on unlabeled target videos when fine-tuning. Extensive experiments on public datasets demonstrate that our proposed method performs even better than the state-of-the-art supervised competitors with impressive generalization.
    摘要 《Face forgery videos have caused severe social public concern, and various detectors have been proposed recently. However, most of them are trained in a supervised manner with limited generalization when detecting videos from different forgery methods or real source videos. To tackle this issue, we explore taking full advantage of the difference between real and forgery videos by only exploring the common representation of real face videos. In this paper, a Self-supervised Transformer cooperating with Contrastive and Reconstruction learning (CoReST) is proposed, which is first pre-trained only on real face videos in a self-supervised manner, and then fine-tuned a linear head on specific face forgery video datasets. Two specific auxiliary tasks incorporated contrastive and reconstruction learning are designed to enhance the representation learning. Furthermore, a Domain Adaptive Reconstruction (DAR) module is introduced to bridge the gap between different forgery domains by reconstructing on unlabeled target videos when fine-tuning. Extensive experiments on public datasets demonstrate that our proposed method performs even better than the state-of-the-art supervised competitors with impressive generalization.》Here's the word-for-word translation:《人脸伪造视频引起了严重的社会公众关注,而最近有许多检测器被提出。然而,大多数检测器都是在有监督的方式进行训练,其检测视频的能力受到不同的伪造方法或原始视频的限制。为了解决这个问题,我们尝试了利用真实视频中的差异,并且只探索真实视频的共同表示。在这篇论文中,我们提出了一种基于自助学习的 transformer 和对比学习(CoReST),它首先在真实视频上进行自助学习,然后在特定的伪造视频数据集上进行细致的调整。为了增强表示学习,我们采用了两种特定的辅助任务:对比学习和重构学习。此外,我们还提出了一种适应域重构(DAR)模块,用于在不同的伪造领域之间桥接。在公共数据集上进行了广泛的实验,结果表明,我们的提出的方法能够在充分扩展的情况下,与当前最佳监督者进行比较,并且表现出色。》

Latent Degradation Representation Constraint for Single Image Deraining

  • paper_url: http://arxiv.org/abs/2309.04780
  • repo_url: None
  • paper_authors: Yuhong He, Long Peng, Lu Wang, Jun Cheng
  • for: 本研究旨在提出一种新的单图排除雨水模型,以解决现有方法难以学习雨水干扰的问题。
  • methods: 该模型包括指向感知编码器(DAEncoder)、UNet排除网络和多尺度交互块(MSIBlock)。DAEncoder使用可变扩散捕捉雨水束的方向一致性,适应地抽取雨水干扰表示。然后,在训练中引入约束损失来显式地约束干扰表示学习。最后,我们提出了MSIBlock,用于与学习的干扰表示和排除网络的解码特征进行 adaptive 信息互动,以便使排除网络能够消除各种复杂的雨水束和重建图像细节。
  • results: 实验结果表明,我们的方法在 sintetic 和实际数据集上达到了新的州OF-the-art 性能。
    Abstract Since rain streaks show a variety of shapes and directions, learning the degradation representation is extremely challenging for single image deraining. Existing methods are mainly targeted at designing complicated modules to implicitly learn latent degradation representation from coupled rainy images. This way, it is hard to decouple the content-independent degradation representation due to the lack of explicit constraint, resulting in over- or under-enhancement problems. To tackle this issue, we propose a novel Latent Degradation Representation Constraint Network (LDRCNet) that consists of Direction-Aware Encoder (DAEncoder), UNet Deraining Network, and Multi-Scale Interaction Block (MSIBlock). Specifically, the DAEncoder is proposed to adaptively extract latent degradation representation by using the deformable convolutions to exploit the direction consistency of rain streaks. Next, a constraint loss is introduced to explicitly constraint the degradation representation learning during training. Last, we propose an MSIBlock to fuse with the learned degradation representation and decoder features of the deraining network for adaptive information interaction, which enables the deraining network to remove various complicated rainy patterns and reconstruct image details. Experimental results on synthetic and real datasets demonstrate that our method achieves new state-of-the-art performance.
    摘要 因为雨条状态呈多种形状和方向,单一图像净化很难学习降低表现的表现。现有方法主要是通过设计复杂的模组来隐式地学习隐藏的降低表现征象,这样很难分离内容独立的降低表现,从而导致过弹或者下弹问题。为解决这个问题,我们提出了一个新的内容独立降低表现条件网络(LDRCNet),它包括了方向感应编码器(DAEncoder)、UNet净化网络和多尺度互动对(MSIBlock)。具体来说,DAEncoder可以透过使用可整合的梯度感应来适应地抽出降低表现的内容独立表现。接着,我们引入了一个约束损失来规范降低表现学习的过程中。最后,我们提出了一个MSIBlock,用于与学习的降低表现和净化网络的解码特征进行互动运算,这使得净化网络能够根据不同的雨条状态和内容独立的降低表现来移除各种复杂的雨条状态和重建图像细节。实验结果显示,我们的方法在synthetic和real dataset上取得了新的顶峰性能。

Visual Material Characteristics Learning for Circular Healthcare

  • paper_url: http://arxiv.org/abs/2309.04763
  • repo_url: https://github.com/fedezocco/matvisiongluinh-pytorch_tensorflow
  • paper_authors: Federico Zocco, Shahin Rahimifard
  • for: 增强医疗垃圾回收链,提高医疗垃圾的再利用率。
  • methods: 开发了多种视力系统,用于三个主要循环经济任务:资源映射和量化、垃圾分类、和分解。
  • results: 研究表明,使用表征学视觉技术可以提高回收链的性能,自动化系统是关键因素,因为受污染风险。两个完全注释化数据集也公开发布,用于图像分割和逻辑点跟踪在医疗器械分解过程中。
    Abstract The linear take-make-dispose paradigm at the foundations of our traditional economy is proving to be unsustainable due to waste pollution and material supply uncertainties. Hence, increasing the circularity of material flows is necessary. In this paper, we make a step towards circular healthcare by developing several vision systems targeting three main circular economy tasks: resources mapping and quantification, waste sorting, and disassembly. The performance of our systems demonstrates that representation-learning vision can improve the recovery chain, where autonomous systems are key enablers due to the contamination risks. We also published two fully-annotated datasets for image segmentation and for key-point tracking in disassembly operations of inhalers and glucose meters. The datasets and source code are publicly available.
    摘要 传统经济的线性“取-制造-废弃”模式已经显示无法维持可持续发展,由废弃污染和材料供应不确定性而导致。因此,提高物流循环性是必要的。在这篇论文中,我们向循环医疗领域发展了多种视系统,目标是三大循环经济任务:资源映射和评估、废弃分类和分解。我们的系统表现了使用表征学视觉技术可以提高回收链,自主系统作为污染风险的关键启用者。我们还发布了两个完全注释的数据集,一个是图像分割数据集,另一个是关键点跟踪在分解医疗器械和糖尿病测量仪器的数据集。这两个数据集和源代码都公开可用。

Probabilistic Triangulation for Uncalibrated Multi-View 3D Human Pose Estimation

  • paper_url: http://arxiv.org/abs/2309.04756
  • repo_url: https://github.com/bymaths/probabilistic_triangulation
  • paper_authors: Boyuan Jiang, Lei Hu, Shihong Xia
  • for: 本研究旨在提出一种可靠的三维人体 pose 估计方法,以替代现有的固定摄像机pose 方法,提高 pose 估计的泛化能力。
  • methods: 本方法基于 probablistic triangulation 模块,通过 iteratively 更新摄像机pose 分布,从 2D 特征点计算 posterior 概率,以直接卷积 backwards 传播 gradients,实现 end-to-end 训练。
  • results: 对 Human3.6M 和 CMU Panoptic 数据集进行了广泛的实验,比较了与其他不准确方法和准确方法进行比较,显示了我们的方法可以达到更高的泛化性和更高的估计精度之间的让步。
    Abstract 3D human pose estimation has been a long-standing challenge in computer vision and graphics, where multi-view methods have significantly progressed but are limited by the tedious calibration processes. Existing multi-view methods are restricted to fixed camera pose and therefore lack generalization ability. This paper presents a novel Probabilistic Triangulation module that can be embedded in a calibrated 3D human pose estimation method, generalizing it to uncalibration scenes. The key idea is to use a probability distribution to model the camera pose and iteratively update the distribution from 2D features instead of using camera pose. Specifically, We maintain a camera pose distribution and then iteratively update this distribution by computing the posterior probability of the camera pose through Monte Carlo sampling. This way, the gradients can be directly back-propagated from the 3D pose estimation to the 2D heatmap, enabling end-to-end training. Extensive experiments on Human3.6M and CMU Panoptic demonstrate that our method outperforms other uncalibration methods and achieves comparable results with state-of-the-art calibration methods. Thus, our method achieves a trade-off between estimation accuracy and generalizability. Our code is in https://github.com/bymaths/probabilistic_triangulation
    摘要 三维人体姿态估算已经是计算机视觉和图形领域的长期挑战,多视图方法在这一点上已经取得了 significativement progress,但它们受到了繁琐的卡利ibration过程的限制。现有的多视图方法受到固定相机pose的限制,因此缺乏总体化能力。这篇论文提出了一种新的概率三角形模块,可以在卡利ibration场景下插入到已经卡利ibration的三维人体姿态估算方法中,并且可以提高其总体化能力。我们的关键想法是使用概率分布来模型相机pose,并且在每次迭代中更新这个分布,从2D特征上计算后验概率。具体来说,我们保持一个相机pose分布,然后在每次迭代中使用蒙地卡ろ sampling算法来更新这个分布。这样,可以直接从3D姿态估算中传递梯度到2D热图中,实现端到端训练。我们在Human3.6M和CMU Panoptic等数据集上进行了广泛的实验,结果表明,我们的方法在不卡利ibration场景下表现出比其他无卡利ibration方法更好的性能,并且与卡利ibration方法相当的性能。因此,我们的方法实现了姿态估算精度和总体化之间的交换。我们的代码在https://github.com/bymaths/probabilistic_triangulation中。

Deep Video Restoration for Under-Display Camera

  • paper_url: http://arxiv.org/abs/2309.04752
  • repo_url: None
  • paper_authors: Xuanxi Chen, Tao Wang, Ziqian Shao, Kaihao Zhang, Wenhan Luo, Tong Lu, Zikun Liu, Tae-Kyun Kim, Hongdong Li
  • For: 这个论文主要针对的是Under-Display Camera(UDC)视频修复(UDC-VR)问题,而现有的UDC修复方法仅专注于图像。* Methods: 这篇论文首先提出了基于GAN生成器的生成管线,用于模拟真实的UDC降低过程。然后,他们建立了大规模的UDC视频修复数据集named PexelsUDC,包括两个子集named PexelsUDC-T和PexelsUDC-P,这两个子集分别对应不同的显示器。* Results: 使用提出的数据集和基线方法,论文进行了广泛的比较研究,发现现有的视频修复方法在UDC-VR任务上存在局限性。然后,他们提出了一种基于 transformer 的新基eline方法,该方法可以充分利用视频的空间和时间信息来修复降低的视频。广泛的实验表明,该方法在 PexelsUDC 上达到了状态级表现。
    Abstract Images or videos captured by the Under-Display Camera (UDC) suffer from severe degradation, such as saturation degeneration and color shift. While restoration for UDC has been a critical task, existing works of UDC restoration focus only on images. UDC video restoration (UDC-VR) has not been explored in the community. In this work, we first propose a GAN-based generation pipeline to simulate the realistic UDC degradation process. With the pipeline, we build the first large-scale UDC video restoration dataset called PexelsUDC, which includes two subsets named PexelsUDC-T and PexelsUDC-P corresponding to different displays for UDC. Using the proposed dataset, we conduct extensive benchmark studies on existing video restoration methods and observe their limitations on the UDC-VR task. To this end, we propose a novel transformer-based baseline method that adaptively enhances degraded videos. The key components of the method are a spatial branch with local-aware transformers, a temporal branch embedded temporal transformers, and a spatial-temporal fusion module. These components drive the model to fully exploit spatial and temporal information for UDC-VR. Extensive experiments show that our method achieves state-of-the-art performance on PexelsUDC. The benchmark and the baseline method are expected to promote the progress of UDC-VR in the community, which will be made public.
    摘要 “图像或视频捕捉于下层显示摄像头(UDC)会受到严重的降解效应,如饱和衰减和颜色偏移。而现有的UDC还原方法仅专注于图像还原,UDC视频还原(UDC-VR)尚未在社区中得到探索。在这项工作中,我们首先提出了基于GAN的生成管道,用于模拟真实的UDC降解过程。通过管道,我们建立了首个大规模的UDC视频还原数据集named PexelsUDC,该数据集包括两个子集名为 PexelsUDC-T 和 PexelsUDC-P,分别对应不同的显示器 для UDC。使用我们提posed的数据集,我们进行了广泛的比较研究,发现现有的视频还原方法在UDC-VR任务上存在局限性。为此,我们提出了一种基于 transformer 的基eline方法,该方法可以在不同的显示器上进行自适应增强降解视频。该方法的关键组件包括空间分支、本地化 transformers、嵌入时间 transformers 和空间-时间融合模块。这些组件使得模型能够充分利用空间和时间信息进行UDC-VR。广泛的实验表明,我们的方法在 PexelsUDC 上达到了状态的最佳性能。数据集和基线方法将被公开,以促进社区中 UDC-VR 的进步。”

Mirror-Aware Neural Humans

  • paper_url: http://arxiv.org/abs/2309.04750
  • repo_url: None
  • paper_authors: Daniel Ajisafe, James Tang, Shih-Yang Su, Bastian Wandt, Helge Rhodin
  • for: 实现基于单个摄像头的高质量人体动作捕捉系统,解决多视图系统和单视图系统的缺点。
  • methods: 使用镜子来记录两个视图,并利用镜子来学习人体完整的形状和精密的外观特征。
  • results: 实现了一个可靠地从Off-the-shelf 2D姿势获取3Dskeleton姿势,并且在镜子场景中处理 occlusion 问题,提高了系统的可靠性和精度。
    Abstract Human motion capture either requires multi-camera systems or is unreliable using single-view input due to depth ambiguities. Meanwhile, mirrors are readily available in urban environments and form an affordable alternative by recording two views with only a single camera. However, the mirror setting poses the additional challenge of handling occlusions of real and mirror image. Going beyond existing mirror approaches for 3D human pose estimation, we utilize mirrors for learning a complete body model, including shape and dense appearance. Our main contributions are extending articulated neural radiance fields to include a notion of a mirror, making it sample-efficient over potential occlusion regions. Together, our contributions realize a consumer-level 3D motion capture system that starts from off-the-shelf 2D poses by automatically calibrating the camera, estimating mirror orientation, and subsequently lifting 2D keypoint detections to 3D skeleton pose that is used to condition the mirror-aware NeRF. We empirically demonstrate the benefit of learning a body model and accounting for occlusion in challenging mirror scenes.
    摘要 人体运动捕捉 either需要多个摄像头系统或者因为深度 ambiguity 导致单视输入不可靠。然而,镜子在城市环境中ready available 并且成为一种可靠的替代方案,只需要一个单个摄像头来记录两个视图。然而,镜子设置增加了处理真实和镜像干扰的挑战。我们超越现有的镜子方法 для 3D人体 pose estimation,我们利用镜子来学习完整的身体模型,包括形状和精密的外观。我们的主要贡献是将 articulated neural radiance fields 扩展到包括镜子的概念,使其在潜在干扰区域上更加效率。在一起,我们的贡献实现了一个消费级3D运动捕捉系统,它可以从OFF-THE-SHELF 2Dpose开始,自动调整摄像头,估算镜子方向,并将2D键点检测提升到3D骨骼姿势,该姿势用于condition mirror-aware NeRF。我们实际示出了学习身体模型和考虑干扰的好处在具有挑战的镜子场景中。

When to Learn What: Model-Adaptive Data Augmentation Curriculum

  • paper_url: http://arxiv.org/abs/2309.04747
  • repo_url: None
  • paper_authors: Chengkai Hou, Jieyu Zhang, Tianyi Zhou
  • for: 提高神经网络的通用性,通过强制实施输入数据中的一系列固定变换来实现数据增强。
  • methods: 提出了一种名为 Model Adaptive Data Augmentation (MADAug) 的方法,该方法通过在不同训练阶段选择不同的数据增强操作符来适应每个输入图像,从而生成一个数据增强课程优化了模型的泛化性。
  • results: 对多个图像分类任务和网络架构进行了广泛的评估,与现有的数据增强方法进行了互相比较,并表明 MADAug 可以在所有类型上提供更好的性能,并且在难度更高的类型上提供更大的改进。此外,MADAug 学习的策略在细化数据上表现更好,并自然地生成了一个易于难度增加的学习课程。
    Abstract Data augmentation (DA) is widely used to improve the generalization of neural networks by enforcing the invariances and symmetries to pre-defined transformations applied to input data. However, a fixed augmentation policy may have different effects on each sample in different training stages but existing approaches cannot adjust the policy to be adaptive to each sample and the training model. In this paper, we propose Model Adaptive Data Augmentation (MADAug) that jointly trains an augmentation policy network to teach the model when to learn what. Unlike previous work, MADAug selects augmentation operators for each input image by a model-adaptive policy varying between training stages, producing a data augmentation curriculum optimized for better generalization. In MADAug, we train the policy through a bi-level optimization scheme, which aims to minimize a validation-set loss of a model trained using the policy-produced data augmentations. We conduct an extensive evaluation of MADAug on multiple image classification tasks and network architectures with thorough comparisons to existing DA approaches. MADAug outperforms or is on par with other baselines and exhibits better fairness: it brings improvement to all classes and more to the difficult ones. Moreover, MADAug learned policy shows better performance when transferred to fine-grained datasets. In addition, the auto-optimized policy in MADAug gradually introduces increasing perturbations and naturally forms an easy-to-hard curriculum.
    摘要 <> translate the following text into Simplified Chinese<>数据扩充(DA)广泛应用于神经网络中以提高模型通用性,通过强制数据中的不变性和对称性。然而,现有的方法无法适应每个样本和训练阶段的不同效果,它们的固定扩充策略可能会导致模型的不平衡。在这篇论文中,我们提出了模型适应性数据扩充(MADAug),它将在训练过程中同时训练扩充策略网络,以教导模型何时学习什么。与之前的方法不同,MADAug在每个输入图像上选择的扩充运算符会随训练阶段而变化,生成一个适应性优化的数据扩充课程,以提高模型的通用性。在MADAug中,我们通过两级优化算法,即目标函数优化和权重优化,以iminimize一个验证集损失函数,以训练扩充策略网络。我们进行了多种图像分类任务和网络架构的广泛评估,并进行了对现有DA方法的比较。MADAug在多个任务上具有优于或与其他基elines一样的性能,并且展现出更好的公平性:它对所有类别都带来改进,并对难类更多。此外,MADAug学习的策略表现更好,当 transferred to 细化数据集时。此外,MADAug自动优化的策略逐渐增加干扰量,自然地形成一个易于困难的课程。Note: "Simplified Chinese" is used to refer to the written form of Chinese that uses simpler characters and grammar compared to Traditional Chinese.

Frequency-Aware Self-Supervised Long-Tailed Learning

  • paper_url: http://arxiv.org/abs/2309.04723
  • repo_url: None
  • paper_authors: Ci-Siang Lin, Min-Hung Chen, Yu-Chiang Frank Wang
  • for: 本研究旨在 Addressing the challenges of long-tailed data distributions in real-world scenarios, where label annotation may not be available.
  • methods: 方法方面, the paper proposes Frequency-Aware Self-Supervised Learning (FASSL), which learns discriminative feature representations from unlabeled data with inherent long-tailed distributions. The approach involves learning frequency-aware prototypes and exploiting the relationships between image data and the derived prototypes using a self-supervised learning scheme.
  • results: 实验结果表明, FASSL 可以有效地学习从无标签数据中,并且可以提供高质量的特征表示。 experiments on long-tailed image datasets demonstrate the effectiveness of the proposed approach.
    Abstract Data collected from the real world typically exhibit long-tailed distributions, where frequent classes contain abundant data while rare ones have only a limited number of samples. While existing supervised learning approaches have been proposed to tackle such data imbalance, the requirement of label supervision would limit their applicability to real-world scenarios in which label annotation might not be available. Without the access to class labels nor the associated class frequencies, we propose Frequency-Aware Self-Supervised Learning (FASSL) in this paper. Targeting at learning from unlabeled data with inherent long-tailed distributions, the goal of FASSL is to produce discriminative feature representations for downstream classification tasks. In FASSL, we first learn frequency-aware prototypes, reflecting the associated long-tailed distribution. Particularly focusing on rare-class samples, the relationships between image data and the derived prototypes are further exploited with the introduced self-supervised learning scheme. Experiments on long-tailed image datasets quantitatively and qualitatively verify the effectiveness of our learning scheme.
    摘要 通常来说,实际世界中的数据都会展现长尾分布,其中常见的类别具有丰富的数据,而罕见的类别则只有有限的样本。现有的超级vised学习方法可以解决数据不均衡问题,但是它们需要 labels 的存在,这限制了它们在真实世界中的应用。在这篇文章中,我们提出了不需要 labels 的自动学习方法,即频率意识自我超级学习(FASSL)。我们的目标是从无标签数据中学习具有抑制力的特征表示,以便在下游分类任务中使用。在 FASSL 中,我们首先学习频率意识的原型,这些原型反映了相应的长尾分布。特别是关注罕见类别的样本,我们通过引入的自我超级学习方案来利用这些样本和 derivated 的原型之间的关系。实验表明,我们的学习方法在长尾图像 dataset 上具有较高的效果。

UnitModule: A Lightweight Joint Image Enhancement Module for Underwater Object Detection

  • paper_url: http://arxiv.org/abs/2309.04708
  • repo_url: None
  • paper_authors: Zhuoyan Liu, Bo Wang, Ye Li, Jiaxian He, Yunfeng Li
    for: 提高对水下物体检测模型的输入图像质量,以提高检测效果。methods: 提出了一种可插入式的水下共同图像增强模块(UnitModule),通过对 UnitModule 和检测器进行无监督学习,以提高UnitModule 和检测器之间的交互。此外,还提出了一种预测颜色偏见的方法,以及一种叫做水下随机颜色传播(UCRT)的数据增强技术。results: 对 DUO dataset 进行了广泛的实验,并取得了最高改进率的 2.6 AP 以及新测试集(URPCtest)上的改进率为 3.3 AP。 UnitModule 可以提高所有测试模型的性能,特别是具有较少参数的模型。此外,UnitModule 的参数量只有 31K,对原始检测模型的执行速度没有明显的影响。我们的量化和视觉分析也证明了 UnitModule 可以有效地提高输入图像质量和检测器对对象特征的识别能力。
    Abstract Underwater object detection faces the problem of underwater image degradation, which affects the performance of the detector. Underwater object detection methods based on noise reduction and image enhancement usually do not provide images preferred by the detector or require additional datasets. In this paper, we propose a plug-and-play Underwater joint image enhancement Module (UnitModule) that provides the input image preferred by the detector. We design an unsupervised learning loss for the joint training of UnitModule with the detector without additional datasets to improve the interaction between UnitModule and the detector. Furthermore, a color cast predictor with the assisting color cast loss and a data augmentation called Underwater Color Random Transfer (UCRT) are designed to improve the performance of UnitModule on underwater images with different color casts. Extensive experiments are conducted on DUO for different object detection models, where UnitModule achieves the highest performance improvement of 2.6 AP for YOLOv5-S and gains the improvement of 3.3 AP on the brand-new test set (URPCtest). And UnitModule significantly improves the performance of all object detection models we test, especially for models with a small number of parameters. In addition, UnitModule with a small number of parameters of 31K has little effect on the inference speed of the original object detection model. Our quantitative and visual analysis also demonstrates the effectiveness of UnitModule in enhancing the input image and improving the perception ability of the detector for object features.
    摘要 水下物体检测面临着水下图像弱化问题,这会影响检测器的性能。通常的水下物体检测方法通过减少噪声和图像提高不提供检测器所需的图像,或者需要额外数据集。在这篇论文中,我们提出了一个卷积核Module(UnitModule),它提供了检测器所需的输入图像。我们设计了一个不supervised学习损失,以joint地训练UnitModule和检测器,从而改善UnitModule和检测器之间的交互。此外,我们还设计了一个帮助预测颜色折射的颜色预测器,以及一种叫做水下随机传播(UCRT)的数据增强技术,以提高UnitModule在不同颜色折射下的性能。我们在DUO上进行了广泛的实验,其中UnitModule在不同的物体检测模型上达到了最高的性能提升2.6AP,并在新的测试集(URPCtest)上提升3.3AP。此外,UnitModule对所有物体检测模型都有显著的性能提升,特别是对具有较少参数的模型。此外,UnitModule具有31K参数,对原始物体检测模型的执行速度有很小的影响。我们的量化和视觉分析也表明,UnitModule可以有效地提高输入图像的质量和检测器对物体特征的感知能力。

A Spatial-Temporal Deformable Attention based Framework for Breast Lesion Detection in Videos

  • paper_url: http://arxiv.org/abs/2309.04702
  • repo_url: https://github.com/alfredqin/stnet
  • paper_authors: Chao Qin, Jiale Cao, Huazhu Fu, Rao Muhammad Anwer, Fahad Shahbaz Khan
  • for: 检测乳腺癌视频是计算机辅助诊断中的关键任务。现有的视频基于乳腺癌检测方法通常是基于自我注意力操作进行时间特征聚合。我们认为这种策略难以有效地执行深度特征聚合,并且忽略了有用的地方信息。
  • methods: 我们提出了一种空间-时间可变注意力基础框架,名为STNet。我们的STNet引入了一个空间-时间可变注意力模块,以进行本地空间-时间特征融合。这个模块在每个阶段的encoder和decoder中都可以进行深度特征聚合。为了进一步加速检测速度,我们引入了一种encoder特征排序策略,在排序过程中,我们共享了背景和encoder特征,并将encoder特征排序给decoder生成多帧预测结果。
  • results: 我们在公共乳腺癌ultrasound视频数据集上进行了实验,结果显示,我们的STNet在检测性能方面取得了州属的纪录,同时在检测速度方面也比前者快两倍。代码和模型可以在https://github.com/AlfredQin/STNet上获取。
    Abstract Detecting breast lesion in videos is crucial for computer-aided diagnosis. Existing video-based breast lesion detection approaches typically perform temporal feature aggregation of deep backbone features based on the self-attention operation. We argue that such a strategy struggles to effectively perform deep feature aggregation and ignores the useful local information. To tackle these issues, we propose a spatial-temporal deformable attention based framework, named STNet. Our STNet introduces a spatial-temporal deformable attention module to perform local spatial-temporal feature fusion. The spatial-temporal deformable attention module enables deep feature aggregation in each stage of both encoder and decoder. To further accelerate the detection speed, we introduce an encoder feature shuffle strategy for multi-frame prediction during inference. In our encoder feature shuffle strategy, we share the backbone and encoder features, and shuffle encoder features for decoder to generate the predictions of multiple frames. The experiments on the public breast lesion ultrasound video dataset show that our STNet obtains a state-of-the-art detection performance, while operating twice as fast inference speed. The code and model are available at https://github.com/AlfredQin/STNet.
    摘要 检测乳腺病变视频是计算机辅助诊断中的关键任务。现有的视频基于 breast lesion 检测方法通常采用深度归一化特征的时间特征聚合方法。我们认为这种策略困难具有效地执行深度特征聚合和忽略了有用的本地信息。为解决这些问题,我们提出了一种空间时间变形注意力基本框架,名为 STNet。我们的 STNet 引入了一个空间时间变形注意力模块,以进行本地空间时间特征融合。这个模块在每个阶段的 both encoder 和 decoder 中进行深度特征聚合。为了进一步加速检测速度,我们提出了一种 encoder 特征混合策略,在推理过程中将 encoder 特征混合多帧预测。在我们的 encoder 特征混合策略中,我们共享 backbone 和 encoder 特征,并在 decoder 中混合 encoder 特征来生成多帧预测。实验结果表明,我们的 STNet 在公共乳腺病变ultrasound video 数据集上取得了状态的检测性能,同时在推理速度上两倍快。代码和模型可以在 中下载。

DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions

  • paper_url: http://arxiv.org/abs/2309.04682
  • repo_url: None
  • paper_authors: Teng Fu, Xiaocong Wang, Haiyang Yu, Ke Niu, Bin Li, Xiangyang Xue
  • for: 提高多对目标跟踪(MOT)在受阻碍的情况下的性能。
  • methods: 使用增强的隐藏状态和推理框架,以及一种新的排除噪声的策略。
  • results: 在MOT17、MOT20和DanceTrack datasets上进行了广泛的实验,并表明了与之前的状态革命性的提高。
    Abstract Multiple object tracking (MOT) tends to become more challenging when severe occlusions occur. In this paper, we analyze the limitations of traditional Convolutional Neural Network-based methods and Transformer-based methods in handling occlusions and propose DNMOT, an end-to-end trainable DeNoising Transformer for MOT. To address the challenge of occlusions, we explicitly simulate the scenarios when occlusions occur. Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture, so that our model can exhibit strong robustness and perform well under crowded scenes. Additionally, we propose a Cascaded Mask strategy to better coordinate the interaction between different types of queries in the decoder to prevent the mutual suppression between neighboring trajectories under crowded scenes. Notably, the proposed method requires no additional modules like matching strategy and motion state estimation in inference. We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.
    摘要 多bject 跟踪 (MOT) 在严重遮挡情况下变得更加挑战。在这篇论文中,我们分析传统的卷积神经网络基本方法和转移器基本方法在处理遮挡的局限性,并提出了DNMOT,一种可以受教育的端到端的减噪变换器 для MOT。为了解决遮挡的挑战,我们在训练时间添加了噪声到轨迹上,使我们的模型在encoder-decoder架构中学习减噪过程,从而使我们的模型在拥挤的场景下表现出强大的鲁棒性。此外,我们提出了协调器策略,以更好地协调不同类型的查询在解码器中的交互,从而避免在拥挤的场景下邻近轨迹之间的互相抑制。值得注意的是,我们提出的方法不需要在推断过程中添加额外的模块,如匹配策略和运动状态估计。我们在MOT17、MOT20和DanceTrack datasets上进行了广泛的实验,实验结果表明,我们的方法在前一代方法之上具有明显的优势。

BiLMa: Bidirectional Local-Matching for Text-based Person Re-identification

  • paper_url: http://arxiv.org/abs/2309.04675
  • repo_url: None
  • paper_authors: Takuro Fujii, Shuhei Tarashima
    for: 这个论文是针对文本描述人脸图像的重新识别问题(Text-based Person Re-identification,TBPReID)的研究。methods: 这个论文使用的方法是将图像和文本部分对齐,并通过Masked Language Modeling(MLM)和Masked Image Modeling(MIM)进行协调训练。它还提出了对向(from text to image)和反向(from image to text)的本地匹配方法,以提高TBPReID的性能。results: 根据实验结果,这个方法在三个测试 benchmark 上达到了当今最佳的 Rank@1 和 mAP 分数。
    Abstract Text-based person re-identification (TBPReID) aims to retrieve person images represented by a given textual query. In this task, how to effectively align images and texts globally and locally is a crucial challenge. Recent works have obtained high performances by solving Masked Language Modeling (MLM) to align image/text parts. However, they only performed uni-directional (i.e., from image to text) local-matching, leaving room for improvement by introducing opposite-directional (i.e., from text to image) local-matching. In this work, we introduce Bidirectional Local-Matching (BiLMa) framework that jointly optimize MLM and Masked Image Modeling (MIM) in TBPReID model training. With this framework, our model is trained so as the labels of randomly masked both image and text tokens are predicted by unmasked tokens. In addition, to narrow the semantic gap between image and text in MIM, we propose Semantic MIM (SemMIM), in which the labels of masked image tokens are automatically given by a state-of-the-art human parser. Experimental results demonstrate that our BiLMa framework with SemMIM achieves state-of-the-art Rank@1 and mAP scores on three benchmarks.
    摘要

SSHNN: Semi-Supervised Hybrid NAS Network for Echocardiographic Image Segmentation

  • paper_url: http://arxiv.org/abs/2309.04672
  • repo_url: None
  • paper_authors: Renqi Chen, Jingjing Luo, Fan Nian, Yuhui Cen, Yiheng Peng, Zekuan Yu
  • for: 准确的医疗影像分割,特别是echocardiographic图像处理中的噪声难以忽略,需要 elaboration 的网络设计。
  • methods: 我们提出了一种新的半supervised Hybrid NAS网络(SSHNN),利用卷积操作来实现层次特征融合,并通过引入Transformers来补做全局上下文,以及U-shaped解码器来有效地连接全局上下文和本地特征。
  • results: 我们在CAMUS医学电子心肺图像集上进行了广泛的实验,发现SSHNN比 estado-of-the-art方法更高效,实现了高精度的分割。
    Abstract Accurate medical image segmentation especially for echocardiographic images with unmissable noise requires elaborate network design. Compared with manual design, Neural Architecture Search (NAS) realizes better segmentation results due to larger search space and automatic optimization, but most of the existing methods are weak in layer-wise feature aggregation and adopt a ``strong encoder, weak decoder" structure, insufficient to handle global relationships and local details. To resolve these issues, we propose a novel semi-supervised hybrid NAS network for accurate medical image segmentation termed SSHNN. In SSHNN, we creatively use convolution operation in layer-wise feature fusion instead of normalized scalars to avoid losing details, making NAS a stronger encoder. Moreover, Transformers are introduced for the compensation of global context and U-shaped decoder is designed to efficiently connect global context with local features. Specifically, we implement a semi-supervised algorithm Mean-Teacher to overcome the limited volume problem of labeled medical image dataset. Extensive experiments on CAMUS echocardiography dataset demonstrate that SSHNN outperforms state-of-the-art approaches and realizes accurate segmentation. Code will be made publicly available.
    摘要 准确的医疗图像分割,特别是用于echocardiographic图像,需要考虑到干扰的存在。传统的手动设计方法在层次特征聚合方面有限,而Neural Architecture Search(NAS)可以通过更大的搜索空间和自动优化来实现更好的分割结果。然而,现有的方法通常具有“强Encoder,弱Decoder”结构,无法处理全局关系和地方细节。为解决这些问题,我们提出了一种新的半supervised Hybrid NAS网络,称为 SSHNN。在 SSHNN 中,我们创新地使用卷积操作来实现层次特征融合,而不是使用normalized scalars,以避免丢失细节。此外,我们还引入了Transformers来补做全局上下文,并设计了U型决策器来有效地连接全局上下文和地方特征。具体来说,我们实现了一种半supervised算法Mean-Teacher来超越医疗图像数据集的限制。我们进行了广泛的实验,并证明了 SSHNN 可以超过现有的方法,并实现准确的分割。代码将公开发布。

Unified Language-Vision Pretraining with Dynamic Discrete Visual Tokenization

  • paper_url: http://arxiv.org/abs/2309.04669
  • repo_url: https://github.com/jy0205/LaVIT
  • paper_authors: Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Bin Chen, Chenyi Lei, An Liu, Chengru Song, Xiaoqiang Lei, Yadong Mu, Di Zhang, Wenwu Ou, Kun Gai
  • for: 本研究旨在突破现有语言模型只允许视觉输入的限制,将语言和视觉都 Represented为一个共同表示,以提高多模态理解能力。
  • methods: 作者提出了一种名为LaVIT(语言-视觉 transformer)的基础模型,该模型通过一种图像tokenizer将非语言图像转化为一个序列化的语言形式,从而使得模型可以同时处理图像和文本。
  • results: 对于下游任务,LaVIT比现有模型提高了大幅度的性能,并且在多模态理解任务中表现出色。
    Abstract Recently, the remarkable advance of the Large Language Model (LLM) has inspired researchers to transfer its extraordinary reasoning capability to data across several modalities. The prevailing approaches primarily regard visual input as the prompt and focus exclusively on optimizing the text generation process conditioned upon vision content by a frozen LLM. Such an inequitable treatment of vision and language heavily constrains the model's potential. In this paper, we break through this limitation by representing both vision and language in a unified representation. To this end, we craft a visual tokenizer that translates the non-linguistic image into a sequence of discrete tokens like a foreign language that LLM can read. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image content. Coped with this visual tokenizer, the presented foundation model called LaVIT (Language-VIsion Transformer) can handle both image and text indiscriminately under a unified generative learning paradigm. Pre-trained on the web-scale image-text corpus, LaVIT is empowered with impressive multi-modal comprehension capability. The extensive experiments showcase that it outperforms existing models by a large margin on downstream tasks. Our code and models will be available at https://github.com/jy0205/LaVIT.
    摘要 近期,大型语言模型(LLM)的出色发展已经激发了研究人员将其杰出的思维能力应用到多 modalities 的数据上。现有的方法主要将视觉输入视为提示,归类专门为conditioned upon vision content by a frozen LLM。这种对视觉和语言的不公平待遇,具有严重限制模型的潜力。在这篇论文中,我们突破这一限制,将视觉和语言都 Represented 为共同表示。为此,我们设计了一种视觉化 токен化器,将非语言的图像转化为一系列精确的 discrete tokens,这些 tokens 类似于外语,LLM 可以读取。得到的视觉 tokens 包含高级别 semantics 和支持动态序列长度,从图像内容而来。与这种视觉化 токен化器相配合,我们提出的基础模型 called LaVIT (Language-VIsion Transformer) 可以平等地处理图像和文本,并在一个共同生成学习 paradigm 下进行学习。预训练在网络规模的图像-文本 Corporpus 上,LaVIT 具有卓越的多模态理解能力。广泛的实验表明,它在下游任务上高度超越现有模型。我们的代码和模型将在 GitHub 上提供,请参考 https://github.com/jy0205/LaVIT。

ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2309.05674
  • repo_url: https://github.com/xianlin7/convformer
  • paper_authors: Xian Lin, Zengqiang Yan, Xianbo Deng, Chuansheng Zheng, Li Yu
  • for: 提高transformer-based框架中的 segmentation性能,增强对医疗图像的分类能力。
  • methods: 提出CNN-style Transformers(ConvFormer),通过增强注意力归一化和特征提取来提高分类性能。ConvFormer包括pooling、CNN-style自注意(CSA)和卷积FeedForward Network(CFFN),可以作为vanilla Vision Transformers中的tokenization、self-attention和FeedForward Network。
  • results: 在多个 dataset上展示了ConvFormer作为plug-and-play模块,可以遥增transformer-based框架中的segmentation性能。
    Abstract Transformers have been extensively studied in medical image segmentation to build pairwise long-range dependence. Yet, relatively limited well-annotated medical image data makes transformers struggle to extract diverse global features, resulting in attention collapse where attention maps become similar or even identical. Comparatively, convolutional neural networks (CNNs) have better convergence properties on small-scale training data but suffer from limited receptive fields. Existing works are dedicated to exploring the combinations of CNN and transformers while ignoring attention collapse, leaving the potential of transformers under-explored. In this paper, we propose to build CNN-style Transformers (ConvFormer) to promote better attention convergence and thus better segmentation performance. Specifically, ConvFormer consists of pooling, CNN-style self-attention (CSA), and convolutional feed-forward network (CFFN) corresponding to tokenization, self-attention, and feed-forward network in vanilla vision transformers. In contrast to positional embedding and tokenization, ConvFormer adopts 2D convolution and max-pooling for both position information preservation and feature size reduction. In this way, CSA takes 2D feature maps as inputs and establishes long-range dependency by constructing self-attention matrices as convolution kernels with adaptive sizes. Following CSA, 2D convolution is utilized for feature refinement through CFFN. Experimental results on multiple datasets demonstrate the effectiveness of ConvFormer working as a plug-and-play module for consistent performance improvement of transformer-based frameworks. Code is available at https://github.com/xianlin7/ConvFormer.
    摘要 transformers 已经广泛研究在医学影像分割中建立对比较远范围的长距离相依性。然而,有限的高质量医学影像数据使 transformers 具有医学影像分割中的注意力塌缩现象,其中注意力映射变得相似或 même identical。相比之下,卷积神经网络(CNN)在小规模训练数据上具有更好的收敛性能,但它们受限于有限的接收场。现有的工作主要关注于将 CNN 和 transformers 结合使用,而忽略了注意力塌缩现象,这使得 transformers 的潜在能力尚未得到充分探索。在这篇论文中,我们提出了一种基于 CNN 的 transformers(ConvFormer),以便提高注意力的叠合和医学影像分割性能。具体来说,ConvFormer 包括池化、CNN 样式自注意(CSA)和卷积神经网络(CFFN),与标准视觉 transformers 中的征文化、自注意和Feed Forward 网络相对应。与position embedding和分割不同,ConvFormer 采用了2D卷积和最大池化来保持位坐标信息和特征大小减少。这样,CSA 可以将 2D 特征图作为输入,建立长距离相依性 by 构建自注意矩阵作为卷积核函数的 adaptive 大小。接着,2D 卷积被用于特征细化通过 CFFN。实验结果表明,ConvFormer 作为 transformer 基础架构中的插件模块,可以提高 transformer 基础架构的一致性和医学影像分割性能。代码可以在 https://github.com/xianlin7/ConvFormer 找到。

Progressive Feature Adjustment for Semi-supervised Learning from Pretrained Models

  • paper_url: http://arxiv.org/abs/2309.04659
  • repo_url: None
  • paper_authors: Hai-Ming Xu, Lingqiao Liu, Hao Chen, Ehsan Abbasnejad, Rafael Felix
  • for: 提高 semi-supervised learning 的性能,解决数据标注束缚问题
  • methods: 使用 pseudo-labels 更新 feature extractor,保证 feature distribution 维护良好的类别分离性,并且只允许类ifier 通过 labels 进行训练
  • results: 对比现有解决方案,提出的方法实现更高的性能
    Abstract As an effective way to alleviate the burden of data annotation, semi-supervised learning (SSL) provides an attractive solution due to its ability to leverage both labeled and unlabeled data to build a predictive model. While significant progress has been made recently, SSL algorithms are often evaluated and developed under the assumption that the network is randomly initialized. This is in sharp contrast to most vision recognition systems that are built from fine-tuning a pretrained network for better performance. While the marriage of SSL and a pretrained model seems to be straightforward, recent literature suggests that naively applying state-of-the-art SSL with a pretrained model fails to unleash the full potential of training data. In this paper, we postulate the underlying reason is that the pretrained feature representation could bring a bias inherited from the source data, and the bias tends to be magnified through the self-training process in a typical SSL algorithm. To overcome this issue, we propose to use pseudo-labels from the unlabelled data to update the feature extractor that is less sensitive to incorrect labels and only allow the classifier to be trained from the labeled data. More specifically, we progressively adjust the feature extractor to ensure its induced feature distribution maintains a good class separability even under strong input perturbation. Through extensive experimental studies, we show that the proposed approach achieves superior performance over existing solutions.
    摘要 为了减轻数据标注的负担,半upervised learning(SSL)提供了一个有力的解决方案,因为它可以利用标注和无标注数据建立预测模型。 although significant progress has been made recently, SSL algorithms are often evaluated and developed under the assumption that the network is randomly initialized. This is in sharp contrast to most vision recognition systems that are built from fine-tuning a pretrained network for better performance. While the marriage of SSL and a pretrained model seems to be straightforward, recent literature suggests that naively applying state-of-the-art SSL with a pretrained model fails to unleash the full potential of training data. In this paper, we postulate that the underlying reason is that the pretrained feature representation could bring a bias inherited from the source data, and the bias tends to be magnified through the self-training process in a typical SSL algorithm. To overcome this issue, we propose to use pseudo-labels from the unlabelled data to update the feature extractor that is less sensitive to incorrect labels and only allow the classifier to be trained from the labeled data. More specifically, we progressively adjust the feature extractor to ensure its induced feature distribution maintains a good class separability even under strong input perturbation. Through extensive experimental studies, we show that the proposed approach achieves superior performance over existing solutions.Note: The translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China and Singapore.

Generation and Recombination for Multifocus Image Fusion with Free Number of Inputs

  • paper_url: http://arxiv.org/abs/2309.04657
  • repo_url: None
  • paper_authors: Huafeng Li, Dan Wang, Yuxin Huang, Yafei Zhang, Zhengtao Yu
  • for: overcome the limitation of optical lenses and achieve simultaneous fusion of multiple images
  • methods: combining generation and recombination model (GRFusion), hard-pixel-guided recombination mechanism, and multi-directional gradient embedding method
  • results: effective and superior fusion performance, free from the number of inputs and with improved visual quality
    Abstract Multifocus image fusion is an effective way to overcome the limitation of optical lenses. Many existing methods obtain fused results by generating decision maps. However, such methods often assume that the focused areas of the two source images are complementary, making it impossible to achieve simultaneous fusion of multiple images. Additionally, the existing methods ignore the impact of hard pixels on fusion performance, limiting the visual quality improvement of fusion image. To address these issues, a combining generation and recombination model, termed as GRFusion, is proposed. In GRFusion, focus property detection of each source image can be implemented independently, enabling simultaneous fusion of multiple source images and avoiding information loss caused by alternating fusion. This makes GRFusion free from the number of inputs. To distinguish the hard pixels from the source images, we achieve the determination of hard pixels by considering the inconsistency among the detection results of focus areas in source images. Furthermore, a multi-directional gradient embedding method for generating full focus images is proposed. Subsequently, a hard-pixel-guided recombination mechanism for constructing fused result is devised, effectively integrating the complementary advantages of feature reconstruction-based method and focused pixel recombination-based method. Extensive experimental results demonstrate the effectiveness and the superiority of the proposed method.The source code will be released on https://github.com/xxx/xxx.
    摘要 多聚焦图像融合是一种有效的方法来超越光学镜头的限制。许多现有方法通过生成决策地图来获得融合结果,但这些方法经常假设源图像的焦点区域是补偿的,这使得同时融合多个图像变得不可能。此外,现有方法忽略了融合过程中硬Pixel的影响,从而限制融合图像的视觉质量改善。为解决这些问题,我们提出了GRFusion模型。GRFusion模型中可以独立实现每个源图像的焦点属性检测,因此可以同时融合多个源图像,避免因为交替融合而产生的信息损失。这使得GRFusion模型不受输入数量的限制。为了分辨硬Pixel与源图像之间的差异,我们提出了基于focus区域的决策结果的不一致来确定硬Pixel。此外,我们还提出了一种多向导向量嵌入方法来生成全焦图像。然后,我们设计了一种基于硬Pixel指导的融合机制,以有效地结合了特征重建方法和焦点像素重建方法的优点。经验证明了我们的方法的有效性和优越性。源代码将在GitHub上发布。

Exploring Robust Features for Improving Adversarial Robustness

  • paper_url: http://arxiv.org/abs/2309.04650
  • repo_url: None
  • paper_authors: Hong Wang, Yuefan Deng, Shinjae Yoo, Yuewei Lin
  • for: 提高深度神经网络(DNNs)在安全敏感应用中的使用,因为它们容易受到特制攻击。
  • methods: 提出了一种特征分离模型,用于分离Robust特征和非Robust特征以及域специфи的特征。
  • results: 对四种广泛使用的数据集进行了extensive实验,并证明了我们的模型可以提高对特制攻击的抵抗力,并且可以准确地识别域特定的特征。
    Abstract While deep neural networks (DNNs) have revolutionized many fields, their fragility to carefully designed adversarial attacks impedes the usage of DNNs in safety-critical applications. In this paper, we strive to explore the robust features which are not affected by the adversarial perturbations, i.e., invariant to the clean image and its adversarial examples, to improve the model's adversarial robustness. Specifically, we propose a feature disentanglement model to segregate the robust features from non-robust features and domain specific features. The extensive experiments on four widely used datasets with different attacks demonstrate that robust features obtained from our model improve the model's adversarial robustness compared to the state-of-the-art approaches. Moreover, the trained domain discriminator is able to identify the domain specific features from the clean images and adversarial examples almost perfectly. This enables adversarial example detection without incurring additional computational costs. With that, we can also specify different classifiers for clean images and adversarial examples, thereby avoiding any drop in clean image accuracy.
    摘要 深度神经网络(DNN)在许多领域中已经引领了革命,但它们对特制的敌意攻击却有很大的敏感性,这限制了DNN在安全关键应用程序中的使用。在这篇论文中,我们尝试探索抗敌攻击的可靠特征,即对于清洁图像和敌意攻击的稳定特征,以提高模型的抗敌能力。具体来说,我们提出了特征分离模型,以分离可靠特征和非可靠特征、域特定特征。我们在四种广泛使用的数据集上进行了大量的实验,并证明了我们的模型可以在不同的攻击下提高抗敌能力,并且域特定特征分离器可以准确地从清洁图像和敌意攻击中分离域特定特征。这使得我们可以采取不同的分类器来处理清洁图像和敌意攻击,从而避免清洁图像精度下降。

cs.AI - 2023-09-09

How to Evaluate Semantic Communications for Images with ViTScore Metric?

  • paper_url: http://arxiv.org/abs/2309.04891
  • repo_url: None
  • paper_authors: Tingting Zhu, Bo Peng, Jifan Liang, Tingchen Han, Hai Wan, Jingqiao Fu, Junjie Chen
  • for: 这 paper 的目的是为了提出一种新的图像Semantic Similarity评估方法,以替代传统的图像相似度评估方法,以便在Semantic Communications 中更好地交换semantic information。
  • methods: 这 paper 使用了一种基于 Transformer 模型的新 metric,名为 Vision Transformer Score (ViTScore),来评估图像的Semantic Similarity。
  • results: 经过5类 экспериimento,结果表明,ViTScore 能够更好地评估图像的Semantic Similarity,比传统的 PSNR、MS-SSIM 和 LPIPS 三种 metric 更加有效。
    Abstract Semantic communications (SC) have been expected to be a new paradigm shifting to catalyze the next generation communication, whose main concerns shift from accurate bit transmission to effective semantic information exchange in communications. However, the previous and widely-used metrics for images are not applicable to evaluate the image semantic similarity in SC. Classical metrics to measure the similarity between two images usually rely on the pixel level or the structural level, such as the PSNR and the MS-SSIM. Straightforwardly using some tailored metrics based on deep-learning methods in CV community, such as the LPIPS, is infeasible for SC. To tackle this, inspired by BERTScore in NLP community, we propose a novel metric for evaluating image semantic similarity, named Vision Transformer Score (ViTScore). We prove theoretically that ViTScore has 3 important properties, including symmetry, boundedness, and normalization, which make ViTScore convenient and intuitive for image measurement. To evaluate the performance of ViTScore, we compare ViTScore with 3 typical metrics (PSNR, MS-SSIM, and LPIPS) through 5 classes of experiments. Experimental results demonstrate that ViTScore can better evaluate the image semantic similarity than the other 3 typical metrics, which indicates that ViTScore is an effective performance metric when deployed in SC scenarios.
    摘要 听说(SC)将被看作是一个新的思维方式,它将catalyze下一代通信,主要关注从精确位传输升级到有效semantic信息交换在通信中。然而,过去广泛使用的图像评估 metric不适用于图像semantic相似性的评估。经典的图像相似性评估方法通常基于像素层或结构层,如PSNR和MS-SSIM。直接使用CV社区的深度学习方法基于metric,如LPIPS,是不可能的SC中。为了解决这个问题,我们提出了一种新的图像semantic相似性评估 metric,名为视觉 трансформа器分数(ViTScore)。我们证明了ViTScore具有3个重要的性质,包括对称性、卷积性和正规化性,这些性质使得ViTScore在图像评估中方便又直观。为了评估ViTScore的性能,我们与3种典型的metric(PSNR、MS-SSIM和LPIPS)进行5种类型的实验。实验结果表明,ViTScore可以更好地评估图像semantic相似性,这表明ViTScore是SC场景中的有效性能指标。

Evaluating Chatbots to Promote Users’ Trust – Practices and Open Problems

  • paper_url: http://arxiv.org/abs/2309.05680
  • repo_url: None
  • paper_authors: Biplav Srivastava, Kausik Lakkaraju, Tarmo Koppel, Vignesh Narayanan, Ashish Kundu, Sachindra Joshi
  • for: 评估聊天机器人(chatbot)的可靠性和用户满意度,以及长期对社会的影响。
  • methods: 现有的chatbot测试方法和开放问题,以及未来的测试方法和技术。
  • results: 评估chatbot的性能和用户满意度,以及对社会的长期影响。
    Abstract Chatbots, the common moniker for collaborative assistants, are Artificial Intelligence (AI) software that enables people to naturally interact with them to get tasks done. Although chatbots have been studied since the dawn of AI, they have particularly caught the imagination of the public and businesses since the launch of easy-to-use and general-purpose Large Language Model-based chatbots like ChatGPT. As businesses look towards chatbots as a potential technology to engage users, who may be end customers, suppliers, or even their own employees, proper testing of chatbots is important to address and mitigate issues of trust related to service or product performance, user satisfaction and long-term unintended consequences for society. This paper reviews current practices for chatbot testing, identifies gaps as open problems in pursuit of user trust, and outlines a path forward.
    摘要 chatbots,它们是人工智能软件,允许人们自然地与其交互,完成任务。虽然 chatbots 已经从人工智能出现以来被研究,但是它们特别在 ChatGPT 类大语言模型基础上的易于使用和通用 chatbots 出现后,引起了公众和企业的关注。在企业希望通过 chatbots 来与用户进行互动,包括客户、供应商和员工,正确测试 chatbots 是非常重要的,以解决服务或产品性能、用户满意度和社会长期未来的问题。本文将评论当前 chatbot 测试实践,描述存在的问题和挑战,并提出未来的发展道路。

Recall-driven Precision Refinement: Unveiling Accurate Fall Detection using LSTM

  • paper_url: http://arxiv.org/abs/2309.07154
  • repo_url: None
  • paper_authors: Rishabh Mondal, Prasun Ghosal
  • for: 这篇研究旨在解决老年人堕伤的问题,通过开发一个精准的堕伤检测系统。
  • methods: 本研究使用了现代技术,包括加速计和陀螺仪数据,与深度学习模型,具体是长期快速传统机制(LSTM)网络。实时执行能力通过raspberry Pi硬件的整合。我们还提出了裁剪技术,对LSTM模型的架构和参数进行精确调整,以便提高系统的性能。
  • results: 我们的实验结果显示,本系统具有高精度和高特异性(96%),实现了堕伤检测的目标。我们的研究将fall detection技术带到了新的水平,提供了一个可靠和有效的堕伤预防和处理解决方案。
    Abstract This paper presents an innovative approach to address the pressing concern of fall incidents among the elderly by developing an accurate fall detection system. Our proposed system combines state-of-the-art technologies, including accelerometer and gyroscope sensors, with deep learning models, specifically Long Short-Term Memory (LSTM) networks. Real-time execution capabilities are achieved through the integration of Raspberry Pi hardware. We introduce pruning techniques that strategically fine-tune the LSTM model's architecture and parameters to optimize the system's performance. We prioritize recall over precision, aiming to accurately identify falls and minimize false negatives for timely intervention. Extensive experimentation and meticulous evaluation demonstrate remarkable performance metrics, emphasizing a high recall rate while maintaining a specificity of 96\%. Our research culminates in a state-of-the-art fall detection system that promptly sends notifications, ensuring vulnerable individuals receive timely assistance and improve their overall well-being. Applying LSTM models and incorporating pruning techniques represent a significant advancement in fall detection technology, offering an effective and reliable fall prevention and intervention solution.
    摘要 (Simplified Chinese translation)这篇论文提出了一种创新的方法,用于解决老年人倒下的问题,即开发一个高度准确的倒下检测系统。我们的提议的系统结合了当前最佳的技术,包��加速度和自转仪器,以及深度学习模型,具体来说是Long Short-Term Memory(LSTM)网络。通过raspberry pi硬件的集成,实现了实时执行能力。我们引入了截剪技术,以优化LSTM模型的结构和参数,以提高系统的性能。我们偏好回报,即准确地识别倒下,而不是精度。通过严格的实验和评估,我们得到了惊人的性能指标,包括高回报率和96%的特异性。我们的研究最终 culminates in a state-of-the-art fall detection system that promptly sends notifications, ensuring vulnerable individuals receive timely assistance and improve their overall well-being。通过应用LSTM模型和截剪技术,我们代表了一种有效和可靠的倒下检测技术,提供了一个有效的倒下预防和 intervención解决方案。

Distributional Data Augmentation Methods for Low Resource Language

  • paper_url: http://arxiv.org/abs/2309.04862
  • repo_url: https://github.com/mosh98/text_aug_low_res
  • paper_authors: Mosleh Mahamud, Zed Lee, Isak Samsten
  • for: 提高预测性能,特别是在低资源语言中
  • methods: 使用易搅拌数据增强技术(EDA),以及基于语义词语上下文信息和分词标签的类型特定相似词替换技术(TSSR)
  • results: 在 svenska 语料中,使用提议的方法可以提高分类性能,特别是在低资源语言中
    Abstract Text augmentation is a technique for constructing synthetic data from an under-resourced corpus to improve predictive performance. Synthetic data generation is common in numerous domains. However, recently text augmentation has emerged in natural language processing (NLP) to improve downstream tasks. One of the current state-of-the-art text augmentation techniques is easy data augmentation (EDA), which augments the training data by injecting and replacing synonyms and randomly permuting sentences. One major obstacle with EDA is the need for versatile and complete synonym dictionaries, which cannot be easily found in low-resource languages. To improve the utility of EDA, we propose two extensions, easy distributional data augmentation (EDDA) and type specific similar word replacement (TSSR), which uses semantic word context information and part-of-speech tags for word replacement and augmentation. In an extensive empirical evaluation, we show the utility of the proposed methods, measured by F1 score, on two representative datasets in Swedish as an example of a low-resource language. With the proposed methods, we show that augmented data improve classification performances in low-resource settings.
    摘要 文本扩充是一种技术,用于从不充分的 corpus 中构建合成数据,以提高预测性能。合成数据生成在许多领域非常常见。然而,在自然语言处理(NLP)领域,文本扩充最近才得到了应用。一种当前状态的文本扩充技术是轻松数据扩充(EDA),它在训练数据中注入和替换同义词和随机排序句子。然而,EDA 需要具有广泛和完整的同义词词典,这些词典在低资源语言中很难找。为了改进 EDDA 的Utility,我们提出了两种扩展,易用分布数据扩充(EDDA)和类型特定相似词替换(TSSR),它们使用语义词语上下文信息和部首标签进行词替换和扩展。在两个代表性数据集上进行了广泛的实验评估,我们表明了提案方法的有用性, measured by F1 分数。通过扩展了的数据,我们在低资源设置中显示了预测性能的提高。

AmbientFlow: Invertible generative models from incomplete, noisy measurements

  • paper_url: http://arxiv.org/abs/2309.04856
  • repo_url: None
  • paper_authors: Varun A. Kelkar, Rucha Deshpande, Arindam Banerjee, Mark A. Anastasio
  • for: 这 paper 是为了学习基于流的生成模型,并 directly from noisy and incomplete data。
  • methods: 该 paper 使用了变量 Bayesian 方法,建立了一个新的 framework 来学习 flow-based generative models。
  • results: 数值研究表明,AmbientFlow 可以正确地学习对象分布,并在下游推理任务中进行图像重建。
    Abstract Generative models have gained popularity for their potential applications in imaging science, such as image reconstruction, posterior sampling and data sharing. Flow-based generative models are particularly attractive due to their ability to tractably provide exact density estimates along with fast, inexpensive and diverse samples. Training such models, however, requires a large, high quality dataset of objects. In applications such as computed imaging, it is often difficult to acquire such data due to requirements such as long acquisition time or high radiation dose, while acquiring noisy or partially observed measurements of these objects is more feasible. In this work, we propose AmbientFlow, a framework for learning flow-based generative models directly from noisy and incomplete data. Using variational Bayesian methods, a novel framework for establishing flow-based generative models from noisy, incomplete data is proposed. Extensive numerical studies demonstrate the effectiveness of AmbientFlow in correctly learning the object distribution. The utility of AmbientFlow in a downstream inference task of image reconstruction is demonstrated.
    摘要 生成模型在媒体科学中得到了广泛的应用,如图像重建、贝叶抽样和数据分享。基于流的生成模型尤其吸引人,因为它们可以追加精确的概率估计,同时提供快速、便宜和多样的样本。但是训练这些模型需要一大量、高质量的对象数据。在计算成像应用中,通常难以获得这些数据,因为需要长时间的获取或高剂量的辐射剂量,而获取噪声或部分观测的对象数据是更可行的。在这项工作中,我们提出了 AmbientFlow,一种直接从噪声和部分观测数据学习流基的生成模型的框架。使用变分 Bayesian 方法,我们提出了一种新的框架,可以从噪声和部分观测数据中直接学习对象分布。我们的数值研究表明,AmbientFlow 可以正确地学习对象分布。此外,AmbientFlow 在下游推理任务中的图像重建中的实用性也被证明。

Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

  • paper_url: http://arxiv.org/abs/2309.04849
  • repo_url: None
  • paper_authors: Debaditya Shome, Ali Etemad
  • for: 这个论文是为了提出一种新的语音情绪识别(SER)框架,以便在语音信号上学习强大的语言和情感表达。
  • methods: 这个方法使用了cross-modal知识填充在训练时期,以学习语音信号上的情感表达。在推断时期,我们的方法只需要一个流经语音信号来进行单模式SER,从而降低计算开销和避免在运行时转写和语音特征提取错误。
  • results: 实验表明,我们的方法在IEMOCAP benchmark上比其他单模式和多模式方法高出许多,并达到了状态机的性能(77.49%无担荷准确率和78.91%担荷准确率)。详细的ablation研究表明每个组件的影响。
    Abstract We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal knowledge distillation during training to learn strong linguistic and prosodic representations of emotion from speech. During inference, our method only uses a stream of speech signals to perform unimodal SER thus reducing computation overhead and avoiding run-time transcription and prosodic feature extraction errors. During training, our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin, and achieves state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted accuracy. Detailed ablation studies demonstrate the impact of each component of our method.
    摘要 我们提出了 EmoDistill,一种新的语音情感识别(SER)框架,利用交叉模态知识储备 durante 训练以学习从语音中强大的语言和表征表达情感。在推断过程中,我们的方法仅使用一个流 speech 信号来进行单模态 SER,从而减少计算负担和避免运行时转写和表征特征EXTRACTING错误。在训练过程中,我们的方法在 embedding 和 logit 两个水平上储备信息从 two 个预训练的 Prosodic 和 Linguistic 教师,这些教师在 SER 上进行了精度的 fine-tuning。在 IEMOCAP benchmark 上进行的实验表明,我们的方法在其他单模态和多模态技术的比较中表现出了 considerable 的优势,并达到了 state-of-the-art 性能的 77.49% 不平衡精度和 78.91% 平衡精度。详细的抽象研究表明了我们的方法中每个组件的影响。

Verifiable Reinforcement Learning Systems via Compositionality

  • paper_url: http://arxiv.org/abs/2309.06420
  • repo_url: None
  • paper_authors: Cyrus Neary, Aryaman Singh Samyal, Christos Verginis, Murat Cubuktepe, Ufuk Topcu
  • for: 本文提出了一个可验证和可分解的强化学习框架,用于实现多个强化学习子系统的集成,以完成一个总任务。
  • methods: 该框架包括一个高级模型,表示为 Parametric Markov Decision Process,用于规划和分析强化学习子系统的集成。强化学习子系统是通过定义子系统之间的接口,以实现自动化的任务分解和独立的训练和测试。
  • results: 实验结果表明,该框架在具有全 observability 和 partial observability 的环境中都能够实现高效的任务执行。同时,该框架可以处理离散和连续状态和动作空间,以及 deterministic 和 stochastic 动力学。
    Abstract We propose a framework for verifiable and compositional reinforcement learning (RL) in which a collection of RL subsystems, each of which learns to accomplish a separate subtask, are composed to achieve an overall task. The framework consists of a high-level model, represented as a parametric Markov decision process, which is used to plan and analyze compositions of subsystems, and of the collection of low-level subsystems themselves. The subsystems are implemented as deep RL agents operating under partial observability. By defining interfaces between the subsystems, the framework enables automatic decompositions of task specifications, e.g., reach a target set of states with a probability of at least 0.95, into individual subtask specifications, i.e. achieve the subsystem's exit conditions with at least some minimum probability, given that its entry conditions are met. This in turn allows for the independent training and testing of the subsystems. We present theoretical results guaranteeing that if each subsystem learns a policy satisfying its subtask specification, then their composition is guaranteed to satisfy the overall task specification. Conversely, if the subtask specifications cannot all be satisfied by the learned policies, we present a method, formulated as the problem of finding an optimal set of parameters in the high-level model, to automatically update the subtask specifications to account for the observed shortcomings. The result is an iterative procedure for defining subtask specifications, and for training the subsystems to meet them. Experimental results demonstrate the presented framework's novel capabilities in environments with both full and partial observability, discrete and continuous state and action spaces, as well as deterministic and stochastic dynamics.
    摘要 我们提出了一个扩展的强化学习(RL)框架,在这个框架中,一群RL子系统,每个子系统都学习完成一个独立的子任务,这些子系统被组合以完成总任务。该框架包括一个高级模型,表示为参数化的随机过程决策过程,用于规划和分析子系统的组合。子系统实现为深度学习RL代理,在受限性观察下运行。通过定义子系统之间的界面,该框架允许自动将任务规范分解成个别子任务规范,例如,达到目标集的状态 WITH least 0.95 的概率,或者在达到子系统的入口条件时,达到至少一定的最小概率。这样做了可以独立地培训和测试子系统。我们提供了理论结果,证明如果每个子系统学习满足其子任务规范,那么其组合就可以满足总任务规范。相反,如果子任务规范无法由学习的策略满足,我们提供了一种方法,即在高级模型中寻找优化的参数集,以自动更新子任务规范,以便 compte ten observe 短coming。结果是一种迭代的过程,用于定义子任务规范,并培训子系统以满足它们。实验结果表明,该框架在具有全 observable 和 partial observable 的环境中,以及具有整数和连续状态空间的环境中,都能够展示出新的能力。

Global Convergence of Receding-Horizon Policy Search in Learning Estimator Designs

  • paper_url: http://arxiv.org/abs/2309.04831
  • repo_url: https://github.com/xiangyuan-zhang/learningkf
  • paper_authors: Xiangyuan Zhang, Saviz Mowlavi, Mouhacine Benosman, Tamer Başar
  • for: 本研究开发了一种名为往返 horizon policy gradient(RHPG)算法,用于学习最佳线性估计设计(Kalman filter,KF)。
  • methods: RHPG算法 integrates vanilla policy search directions into a dynamic programming outer loop,将无限时间KF问题转换为一系列静止估计问题,并且提供了优化内部的测地图分析和数据点复杂度保证。
  • results: RHPG算法可以实现全球均衡,并且不需要任何先验知识或开 Loop稳定性。我们还提供了细化的优化景象分析和数据点复杂度保证。这个研究是控制应用中首次开发了具有性能保证的循环学习算法,并且结合了精确控制理论在算法设计和理论分析中。我们还验证了RHPG算法在一个大规模对流混合运算中的性能。代码存储库可以在 \url{https://github.com/xiangyuan-zhang/LearningKF} 上找到。
    Abstract We introduce the receding-horizon policy gradient (RHPG) algorithm, the first PG algorithm with provable global convergence in learning the optimal linear estimator designs, i.e., the Kalman filter (KF). Notably, the RHPG algorithm does not require any prior knowledge of the system for initialization and does not require the target system to be open-loop stable. The key of RHPG is that we integrate vanilla PG (or any other policy search directions) into a dynamic programming outer loop, which iteratively decomposes the infinite-horizon KF problem that is constrained and non-convex in the policy parameter into a sequence of static estimation problems that are unconstrained and strongly-convex, thus enabling global convergence. We further provide fine-grained analyses of the optimization landscape under RHPG and detail the convergence and sample complexity guarantees of the algorithm. This work serves as an initial attempt to develop reinforcement learning algorithms specifically for control applications with performance guarantees by utilizing classic control theory in both algorithmic design and theoretical analyses. Lastly, we validate our theories by deploying the RHPG algorithm to learn the Kalman filter design of a large-scale convection-diffusion model. We open-source the code repository at \url{https://github.com/xiangyuan-zhang/LearningKF}.
    摘要 我们介绍了落后 horizen 策略导数(RHPG)算法,这是首个可证明全球准确性的学习优化 Linear Estimator 设计算法,即卡尔曼滤波器(KF)。值得注意的是,RHPG 算法不需要任何系统的先前知识 для初始化,也不需要目标系统是开 Loop 稳定。RHPG 算法的关键在于将 vanilla PG(或任何其他策略搜索方向)integrated into a dynamic programming outer loop,这将将无限远程 KF 问题,即受约束和非对称的策略参数, decomposed into a sequence of static estimation problems that are unconstrained and strongly convex, thus enabling global convergence. 我们还提供了细化的优化景观下的 RHPG 算法的分析,并详细介绍了算法的收敛和样本复杂度保证。这项工作作为控制应用中开发强化学习算法的初步尝试,并通过利用经典控制理论在算法设计和理论分析中使用。最后,我们验证了我们的理论,通过将 RHPG 算法应用于一个大规模的扩散干扰模型来学习 Kalman 滤波器设计。我们在 GitHub 上开源了代码存储库,详情请参考 \url{https://github.com/xiangyuan-zhang/LearningKF}.

Good-looking but Lacking Faithfulness: Understanding Local Explanation Methods through Trend-based Testing

  • paper_url: http://arxiv.org/abs/2309.05679
  • repo_url: https://github.com/jenniferho97/xai-trend-test
  • paper_authors: Jinwen He, Kai Chen, Guozhu Meng, Jiangshan Zhang, Congyi Li
  • for: 本研究旨在evaluating explanation methods的实用性和 faithfulness,以及解释模型做出的决策。
  • methods: 本研究使用了三种新的趋势基测试来评估 faithfulness,并对十种受测方法进行了评估。
  • results: 研究发现,使用新的趋势基测试可以更好地评估 faithfulness,并获得了在复杂数据上的首次评估成果。 Downstream tasks也受益匪浅,例如模型调试具有 faithful explanation methods可以更好地检测和修正精度和安全问题。
    Abstract While enjoying the great achievements brought by deep learning (DL), people are also worried about the decision made by DL models, since the high degree of non-linearity of DL models makes the decision extremely difficult to understand. Consequently, attacks such as adversarial attacks are easy to carry out, but difficult to detect and explain, which has led to a boom in the research on local explanation methods for explaining model decisions. In this paper, we evaluate the faithfulness of explanation methods and find that traditional tests on faithfulness encounter the random dominance problem, \ie, the random selection performs the best, especially for complex data. To further solve this problem, we propose three trend-based faithfulness tests and empirically demonstrate that the new trend tests can better assess faithfulness than traditional tests on image, natural language and security tasks. We implement the assessment system and evaluate ten popular explanation methods. Benefiting from the trend tests, we successfully assess the explanation methods on complex data for the first time, bringing unprecedented discoveries and inspiring future research. Downstream tasks also greatly benefit from the tests. For example, model debugging equipped with faithful explanation methods performs much better for detecting and correcting accuracy and security problems.
    摘要 While enjoying the great achievements brought by deep learning (DL), people are also worried about the decisions made by DL models, since the high degree of non-linearity of DL models makes the decisions extremely difficult to understand. Consequently, attacks such as adversarial attacks are easy to carry out, but difficult to detect and explain, which has led to a boom in the research on local explanation methods for explaining model decisions. In this paper, we evaluate the faithfulness of explanation methods and find that traditional tests on faithfulness encounter the random dominance problem, \ie, the random selection performs the best, especially for complex data. To further solve this problem, we propose three trend-based faithfulness tests and empirically demonstrate that the new trend tests can better assess faithfulness than traditional tests on image, natural language and security tasks. We implement the assessment system and evaluate ten popular explanation methods. Benefiting from the trend tests, we successfully assess the explanation methods on complex data for the first time, bringing unprecedented discoveries and inspiring future research. Downstream tasks also greatly benefit from the tests. For example, model debugging equipped with faithful explanation methods performs much better for detecting and correcting accuracy and security problems.Here is the translation in Traditional Chinese:人们在深度学习(DL)的成就下享受着,但也担心DL模型的决策,因为DL模型的高度非线性性使得决策 extremely difficult to understand。因此,如 adversarial attack 等攻击性能易于实现,但困难检测和解释,这导致了解释模型决策的本地解释方法的研究热潮。在这篇论文中,我们评估解释方法的忠实度,发现传统的忠实度测试遇到随机主导问题,即随机选择perform the best,特别是 для复杂的数据。为了解决这个问题,我们提出了三种趋势基本的忠实度测试,并证明了这些新的趋势测试可以更好地评估忠实度 than traditional tests on image, natural language and security tasks。我们实现了评估系统,并评估了十种受欢迎的解释方法。受益于趋势测试,我们成功地评估了解释方法 on complex data for the first time,带来了前所未有的发现和未来研究的鼓励。下游任务也受益于测试。例如,具有忠实的解释方法的模型 Debugging 在检测和修正精度和安全问题上表现 Much better。

Timely Fusion of Surround Radar/Lidar for Object Detection in Autonomous Driving Systems

  • paper_url: http://arxiv.org/abs/2309.04806
  • repo_url: None
  • paper_authors: Wenjing Xie, Tao Hu, Neiwen Ling, Guoliang Xing, Shaoshan Liu, Nan Guan
    for: This paper aims to improve the fusion of surround Radar and Lidar sensor data for autonomous driving systems by developing techniques to work with the faster Lidar data instead of the slower Radar data.methods: The proposed method uses the state-of-the-art object detection model MVDNet to fuse surround Radar/Lidar data, but with enhanced training to tolerate the temporal unalignment of input data.results: The proposed method achieves high output frequency with little accuracy loss, making it a promising solution for real-time object detection in autonomous driving systems.
    Abstract Fusing Radar and Lidar sensor data can fully utilize their complementary advantages and provide more accurate reconstruction of the surrounding for autonomous driving systems. Surround Radar/Lidar can provide 360-degree view sampling with the minimal cost, which are promising sensing hardware solutions for autonomous driving systems. However, due to the intrinsic physical constraints, the rotating speed of surround Radar, and thus the frequency to generate Radar data frames, is much lower than surround Lidar. Existing Radar/Lidar fusion methods have to work at the low frequency of surround Radar, which cannot meet the high responsiveness requirement of autonomous driving systems.This paper develops techniques to fuse surround Radar/Lidar with working frequency only limited by the faster surround Lidar instead of the slower surround Radar, based on the state-of-the-art object detection model MVDNet. The basic idea of our approach is simple: we let MVDNet work with temporally unaligned data from Radar/Lidar, so that fusion can take place at any time when a new Lidar data frame arrives, instead of waiting for the slow Radar data frame. However, directly applying MVDNet to temporally unaligned Radar/Lidar data greatly degrades its object detection accuracy. The key information revealed in this paper is that we can achieve high output frequency with little accuracy loss by enhancing the training procedure to explore the temporal redundancy in MVDNet so that it can tolerate the temporal unalignment of input data. We explore several different ways of training enhancement and compare them quantitatively with experiments.
    摘要 将雷达和激光感知器融合可以完全利用它们的优势,提供更准确的周围环境重建 для自动驾驶系统。三百六十度雷达/激光可以提供360度的视野样本,是自动驾驶系统的感知硬件解决方案。然而,由于雷达的物理限制,雷达旋转速率相对较低,因此雷达数据帧的频率远低于激光。现有的雷达/激光融合方法必须在低频率的雷达数据帧上工作,无法满足自动驾驶系统的高响应性要求。本文提出了一种解决方案,使用基于state-of-the-art对象检测模型MVDNet进行雷达/激光融合。我们的思路简单:让MVDNet在雷达/激光数据不对时进行融合,以便在新的激光数据帧到达时进行融合,而不必等待慢速的雷达数据帧。然而,直接将MVDNet应用于不对时的雷达/激光数据会导致对象检测精度下降。我们发现,可以通过强化训练程序,以利用MVDNet中的时间重复性,使其能够忍受输入数据的时间不对。我们试了多种训练强化方法,并对它们进行了量化比较。

Finding Influencers in Complex Networks: An Effective Deep Reinforcement Learning Approach

  • paper_url: http://arxiv.org/abs/2309.07153
  • repo_url: None
  • paper_authors: Changan Liu, Changjun Fan, Zhongzhi Zhang
  • for: 本文针对复杂网络中Influence Maximization问题提出了一种有效的深度学习模型,以提高社会网络分析中的效果。
  • methods: 本文提出了一种结合图 neural network和强化学习的综合学习框架,名为DREIM,通过广泛的小型synthetic graphs训练,在大型synthetic和实际世界网络上超越了现有的基eline方法,并且对网络大小 linear scalability 的特性做出了实际证明。
  • results: 本文的DREIM模型在解决Influence Maximization问题时,相比现有的基eline方法,具有更高的解决质量和linear scalability 特性。
    Abstract Maximizing influences in complex networks is a practically important but computationally challenging task for social network analysis, due to its NP- hard nature. Most current approximation or heuristic methods either require tremendous human design efforts or achieve unsatisfying balances between effectiveness and efficiency. Recent machine learning attempts only focus on speed but lack performance enhancement. In this paper, different from previous attempts, we propose an effective deep reinforcement learning model that achieves superior performances over traditional best influence maximization algorithms. Specifically, we design an end-to-end learning framework that combines graph neural network as the encoder and reinforcement learning as the decoder, named DREIM. Trough extensive training on small synthetic graphs, DREIM outperforms the state-of-the-art baseline methods on very large synthetic and real-world networks on solution quality, and we also empirically show its linear scalability with regard to the network size, which demonstrates its superiority in solving this problem.
    摘要 <>将复杂网络中的影响力最大化作为社交网络分析中的实际重要任务,由于其NP困难的性质,现有的现有的近似或规则方法通常需要巨大的人工设计努力或者实现不够的效率和效果平衡。现代机器学习尝试只集中于速度,但缺乏性能提升。在这篇论文中,与之前的尝试不同,我们提出了一种高效的深度强化学习模型,可以超越传统的最佳影响最大化算法。specifically,我们设计了一个端到端学习框架,将图 neural network作为编码器和强化学习作为解码器,名为DREIM。经过广泛的小 synthetic graphs 训练,DREIM 超越了状态静态基eline 方法在很大的 sintetic 和实际网络上的解决质量,并且我们还证明其线性扩展性,表明其在解决这个问题上的优势。Note:* "NP-hard" is translated as "NP困难" (NP困难性)* "influence maximization" is translated as "影响力最大化" (影响力最大化)* "deep reinforcement learning" is translated as "深度强化学习" (深度强化学习)* "graph neural network" is translated as "图 neural network" (图 neural network)* "baseline methods" is translated as "基线方法" (基线方法)* "synthetic graphs" is translated as "小 synthetic graphs" (小 synthetic graphs)

Towards Real-World Burst Image Super-Resolution: Benchmark and Method

  • paper_url: http://arxiv.org/abs/2309.04803
  • repo_url: https://github.com/yjsunnn/fbanet
  • paper_authors: Pengxu Wei, Yujing Sun, Xingbei Guo, Chang Liu, Jie Chen, Xiangyang Ji, Liang Lin
  • for: 本研究旨在探讨如何使用多张图像来重建高质量的图像,特别是在实际场景中。
  • methods: 我们提出了一种 Federated Burst Affinity network (FBAnet),它使用了一种简单的投影变换来对图像进行匹配,并使用了一种 Federated Affinity Fusion (FAF) 策略来聚合帧中的相关信息。
  • results: 我们的 FBAnet 在两个版本的数据集上进行了广泛的实验,并证明了它可以超过现有的状态艺术图像重建方法,并且可以生成有趣的 SR 图像预测。我们的数据集、代码和模型都公开可用于 GitHub。
    Abstract Despite substantial advances, single-image super-resolution (SISR) is always in a dilemma to reconstruct high-quality images with limited information from one input image, especially in realistic scenarios. In this paper, we establish a large-scale real-world burst super-resolution dataset, i.e., RealBSR, to explore the faithful reconstruction of image details from multiple frames. Furthermore, we introduce a Federated Burst Affinity network (FBAnet) to investigate non-trivial pixel-wise displacements among images under real-world image degradation. Specifically, rather than using pixel-wise alignment, our FBAnet employs a simple homography alignment from a structural geometry aspect and a Federated Affinity Fusion (FAF) strategy to aggregate the complementary information among frames. Those fused informative representations are fed to a Transformer-based module of burst representation decoding. Besides, we have conducted extensive experiments on two versions of our datasets, i.e., RealBSR-RAW and RealBSR-RGB. Experimental results demonstrate that our FBAnet outperforms existing state-of-the-art burst SR methods and also achieves visually-pleasant SR image predictions with model details. Our dataset, codes, and models are publicly available at https://github.com/yjsunnn/FBANet.
    摘要 尽管已经取得了重要进步,单一图像超分解 (SISR) 仍然面临着从一个输入图像中重建高质量图像的挑战,特别是在实际场景下。在这篇论文中,我们建立了一个大规模的实际场景中的爆发超分解数据集,即RealBSR,以探索图像细节的忠实重建。此外,我们引入了一种 Federated Burst Affinity network (FBAnet),以探索实际场景下图像的非致命像素位移。具体来说,而不是使用像素位移对 align,我们的FBAnet使用了一种简单的投影变换的结构几何学方面的同步方法,并使用一种 Federated Affinity Fusion (FAF) 策略来聚合各帧中的补充信息。这些融合的信息表示被 fed 到一个基于 Transformer 的强制代码帧表示解码模块。此外,我们在 RealBSR-RAW 和 RealBSR-RGB 两个版本的数据集上进行了广泛的实验,结果表明,我们的 FBAnet 超过了现有的推荐爆发 SR 方法,并且实现了可见愉悦 SR 图像预测,同时保持模型细节。我们的数据集、代码和模型都可以在 GitHub 上公开获取,链接在https://github.com/yjsunnn/FBANet。

CPMR: Context-Aware Incremental Sequential Recommendation with Pseudo-Multi-Task Learning

  • paper_url: http://arxiv.org/abs/2309.04802
  • repo_url: https://github.com/dimarziobian/cpmr
  • paper_authors: Qingtian Bian, Jiaxing Xu, Hui Fang, Yiping Ke
  • for: 模型用户表征的动态兴趣环境,即用户在不同时间和上下文中的行为。
  • methods: 使用信息传播和进化来挖掘批处理的交互数据,并创建用户和物品的三个表示:静态嵌入、历史时间状态和Contextual时间状态。
  • results: 在四个标准推荐数据集上实验表明,CPMR可以持续超越当前状态艺术的基eline,并在三个数据集上 achieve 显著的提升。
    Abstract The motivations of users to make interactions can be divided into static preference and dynamic interest. To accurately model user representations over time, recent studies in sequential recommendation utilize information propagation and evolution to mine from batches of arriving interactions. However, they ignore the fact that people are easily influenced by the recent actions of other users in the contextual scenario, and applying evolution across all historical interactions dilutes the importance of recent ones, thus failing to model the evolution of dynamic interest accurately. To address this issue, we propose a Context-Aware Pseudo-Multi-Task Recommender System (CPMR) to model the evolution in both historical and contextual scenarios by creating three representations for each user and item under different dynamics: static embedding, historical temporal states, and contextual temporal states. To dually improve the performance of temporal states evolution and incremental recommendation, we design a Pseudo-Multi-Task Learning (PMTL) paradigm by stacking the incremental single-target recommendations into one multi-target task for joint optimization. Within the PMTL paradigm, CPMR employs a shared-bottom network to conduct the evolution of temporal states across historical and contextual scenarios, as well as the fusion of them at the user-item level. In addition, CPMR incorporates one real tower for incremental predictions, and two pseudo towers dedicated to updating the respective temporal states based on new batches of interactions. Experimental results on four benchmark recommendation datasets show that CPMR consistently outperforms state-of-the-art baselines and achieves significant gains on three of them. The code is available at: https://github.com/DiMarzioBian/CPMR.
    摘要 用户的动机可以分为静态喜好和动态兴趣。为了准确地模型用户在时间上的表现,现在的研究在串行推荐中使用信息传播和进化来 mines 从到达的交互批处理。然而,它们忽略了人们在场景下的受到他人最近行为影响的事实,并且在所有历史交互上应用进化,从而不能准确地模型动态兴趣的演化。为解决这个问题,我们提出了Context-Aware Pseudo-Multi-Task Recommender System (CPMR),用于在历史和场景下模型用户和ITEM的演化。我们设计了三种表示方法:静态嵌入、历史时间状态和场景时间状态。为了提高时间状态演化和逐步推荐的性能,我们实现了一种Pseudo-Multi-Task Learning (PMTL) paradigm,其中CPMR使用一个共享底层网络来进行时间状态的演化和用户-ITEM级别的 fusión。此外,CPMR还包括一个真实的射频塔来进行逐步预测,以及两个 Pseudo 射频塔来更新各自的时间状态基于新批处理的交互。实验结果表明,CPMR在四个基准推荐数据集上具有显著的优势,并在三个基准上达到了显著的提升。代码可以在https://github.com/DiMarzioBian/CPMR 中获取。

TMComposites: Plug-and-Play Collaboration Between Specialized Tsetlin Machines

  • paper_url: http://arxiv.org/abs/2309.04801
  • repo_url: https://github.com/cair/plug-and-play-collaboration-between-specialized-tsetlin-machines
  • paper_authors: Ole-Christoffer Granmo
  • for: 提高TM的性能在更复杂的任务和数据集上,例如CIFAR-10和CIFAR-100。
  • methods: 特有的TM Composites的协作,通过学习和推理时的特有精度评估来实现。
  • results: 在Fashion-MNIST、CIFAR-10和CIFAR-100上提高了准确率, Specifically, the TM Composite increased accuracy on Fashion-MNIST by 2 percentage points, CIFAR-10 by 12 points, and CIFAR-100 by 9 points, achieving new state-of-the-art results for TMs.
    Abstract Tsetlin Machines (TMs) provide a fundamental shift from arithmetic-based to logic-based machine learning. Supporting convolution, they deal successfully with image classification datasets like MNIST, Fashion-MNIST, and CIFAR-2. However, the TM struggles with getting state-of-the-art performance on CIFAR-10 and CIFAR-100, representing more complex tasks. This paper introduces plug-and-play collaboration between specialized TMs, referred to as TM Composites. The collaboration relies on a TM's ability to specialize during learning and to assess its competence during inference. When teaming up, the most confident TMs make the decisions, relieving the uncertain ones. In this manner, a TM Composite becomes more competent than its members, benefiting from their specializations. The collaboration is plug-and-play in that members can be combined in any way, at any time, without fine-tuning. We implement three TM specializations in our empirical evaluation: Histogram of Gradients, Adaptive Gaussian Thresholding, and Color Thermometers. The resulting TM Composite increases accuracy on Fashion-MNIST by two percentage points, CIFAR-10 by twelve points, and CIFAR-100 by nine points, yielding new state-of-the-art results for TMs. Overall, we envision that TM Composites will enable an ultra-low energy and transparent alternative to state-of-the-art deep learning on more tasks and datasets.
    摘要 特具机器 (TM) 提供了一个基本的转换,从数学基础到逻辑基础的机器学习。它们可以成功地处理像 Minnist、Fashion-Minnist 和 CIFAR-2 这些图像分类 dataset,但是它们对 CIFAR-10 和 CIFAR-100 这些更加复杂的任务表现不佳。这篇论文介绍了特殊化的 TM 之间的协作,称为 TM Composites。这种协作基于 TM 的学习中的特殊化和推断中的能力评估。当它们合作时,最自信的 TM 会作出决策,减轻不确定的 TM。因此,一个 TM Composite 会比其成员更有能力,从其特殊化中受益。这种协作是可插入式的,成员可以在任何时候、任何方式混合,不需要微调。我们在实验中实现了三种 TM 特殊化: Histogram of Gradients、Adaptive Gaussian Thresholding 和 Color Thermometers。它们的结合使得 Fashion-MNIST 的准确率提高了二个百分比点,CIFAR-10 的准确率提高了十二个百分比点,CIFAR-100 的准确率提高了九个百分比点,创造了新的state-of-the-art 结果。总的来说,我们预期 TM Composites 将在更多的任务和数据集上提供低能耗和透明的替代方案。

A Fast Algorithm for Moderating Critical Nodes via Edge Removal

  • paper_url: http://arxiv.org/abs/2309.06392
  • repo_url: https://github.com/hahaabc/fasticm
  • paper_authors: Changan Liu, Xiaotian Zhou, Ahad N. Zehmakan, Zhongzhi Zhang
  • for: 本研究旨在提高网络中执行有效的moderation,以避免由于恶意扩散而导致的负面响应。
  • methods: 本研究使用新的技术,如random walk-based Schur complement approximation和快速和简单的和计算方法,提出三种近似算法来解决这个问题。
  • results: 实验结果表明,我们提出的算法在不同的设定下具有高效性和可靠性,能够有效地减少网络中执行moderation的计算成本。
    Abstract Critical nodes in networks are extremely vulnerable to malicious attacks to trigger negative cascading events such as the spread of misinformation and diseases. Therefore, effective moderation of critical nodes is very vital for mitigating the potential damages caused by such malicious diffusions. The current moderation methods are computationally expensive. Furthermore, they disregard the fundamental metric of information centrality, which measures the dissemination power of nodes. We investigate the problem of removing $k$ edges from a network to minimize the information centrality of a target node $\lea$ while preserving the network's connectivity. We prove that this problem is computationally challenging: it is NP-complete and its objective function is not supermodular. However, we propose three approximation greedy algorithms using novel techniques such as random walk-based Schur complement approximation and fast sum estimation. One of our algorithms runs in nearly linear time in the number of edges. To complement our theoretical analysis, we conduct a comprehensive set of experiments on synthetic and real networks with over one million nodes. Across various settings, the experimental results illustrate the effectiveness and efficiency of our proposed algorithms.
    摘要 重要的网络中的节点非常易受到黑客攻击,导致负面传播事件的发生,如误information和疾病的传播。因此,有效地调节重要节点非常重要,以减少这些黑客攻击导致的潜在损害。现有的调节方法 computationally expensive,而且忽略了信息中心度的基本度量,它度量节点传播力。我们研究了从网络中移除 $k$ 个边,以使Target node $\lea$ 的信息中心度最小化,保持网络连接性。我们证明这个问题是 computationally challenging:它是NP-complete,并且其目标函数不具有supermodular。然而,我们提出了三种近似算法,使用了新的技术,如Random walk-based Schur complement approximation和快速总和估计。其中一个算法在 Nearly linear time 中处理了 edges。实际上,我们对实际和 sintetic 网络进行了广泛的实验,包括超过一百万个节点。不同的设定下,实验结果显示了我们的提案的有效性和高效性。

A Full-fledged Commit Message Quality Checker Based on Machine Learning

  • paper_url: http://arxiv.org/abs/2309.04797
  • repo_url: https://github.com/commit-message-collective/beams-commit-message-checker
  • paper_authors: David Faragó, Michael Färber, Christian Petrov
  • for: 这篇论文是关于提高版本控制中的提交信息质量的研究,以便更好地支持软件维护和演化。
  • methods: 该论文使用机器学习方法来评估提交信息质量,包括语义和上下文。
  • results: 该论文可以够准确地评估提交信息质量,其最低F$_1$分为82.9%,这表明机器学习方法可以很好地评估提交信息质量。
    Abstract Commit messages (CMs) are an essential part of version control. By providing important context in regard to what has changed and why, they strongly support software maintenance and evolution. But writing good CMs is difficult and often neglected by developers. So far, there is no tool suitable for practice that automatically assesses how well a CM is written, including its meaning and context. Since this task is challenging, we ask the research question: how well can the CM quality, including semantics and context, be measured with machine learning methods? By considering all rules from the most popular CM quality guideline, creating datasets for those rules, and training and evaluating state-of-the-art machine learning models to check those rules, we can answer the research question with: sufficiently well for practice, with the lowest F$_1$ score of 82.9\%, for the most challenging task. We develop a full-fledged open-source framework that checks all these CM quality rules. It is useful for research, e.g., automatic CM generation, but most importantly for software practitioners to raise the quality of CMs and thus the maintainability and evolution speed of their software.
    摘要 commit messages (CMs) 是版本控制中非常重要的一部分,它们提供了更改的重要上下文和原因,从而强化软件维护和演化。然而,写好CMs是很困难的,开发者们经常忽略这一点。迄今为止,没有一种适合实践的工具可以自动评估CM质量,包括它的 semantics 和context。由于这是一项具有挑战性的任务,我们提出了研究问题:可以使用机器学习方法来评估CM质量,包括 semantics 和context?我们考虑了最流行的CM质量指南中的所有规则,创建了相应的数据集,并使用当今最佳的机器学习模型来检查这些规则。我们的研究表明,使用机器学习方法可以很好地评估CM质量,最低的F1分数为82.9%,对最复杂的任务来说。我们开发了一套免费、开源的框架,可以检查所有CM质量规则。它可以用于研究,例如自动生成CM,但更重要的是,它可以帮助软件实践者提高CM质量,从而提高软件的维护和演化速度。

Towards Robust Model Watermark via Reducing Parametric Vulnerability

  • paper_url: http://arxiv.org/abs/2309.04777
  • repo_url: https://github.com/guanhaogan/robust-model-watermarking
  • paper_authors: Guanhao Gan, Yiming Li, Dongxian Wu, Shu-Tao Xia
  • for: 保护深度神经网络(DNN)的版权,防止其被不当使用或盗取。
  • methods: 使用后门式拓展来嵌入特定的行为,以便在发布模型时验证其所有权。
  • results: 提出了一种基于最大化最小化的方法,可以在 parametric 变化和许多后门除法攻击下提高模型水印的稳定性。
    Abstract Deep neural networks are valuable assets considering their commercial benefits and huge demands for costly annotation and computation resources. To protect the copyright of DNNs, backdoor-based ownership verification becomes popular recently, in which the model owner can watermark the model by embedding a specific backdoor behavior before releasing it. The defenders (usually the model owners) can identify whether a suspicious third-party model is ``stolen'' from them based on the presence of the behavior. Unfortunately, these watermarks are proven to be vulnerable to removal attacks even like fine-tuning. To further explore this vulnerability, we investigate the parameter space and find there exist many watermark-removed models in the vicinity of the watermarked one, which may be easily used by removal attacks. Inspired by this finding, we propose a mini-max formulation to find these watermark-removed models and recover their watermark behavior. Extensive experiments demonstrate that our method improves the robustness of the model watermarking against parametric changes and numerous watermark-removal attacks. The codes for reproducing our main experiments are available at \url{https://github.com/GuanhaoGan/robust-model-watermarking}.
    摘要 深度神经网络是商业上非常有价值的资产,同时它们需要大量的昂贵的注解和计算资源。为了保护深度神经网络的版权,在最近几年,以特定的后门行为为水印的拥有者认可方式在使用。但是这些水印却被证明容易受到移除攻击,甚至是通过精细调整。为了进一步探索这一点,我们研究了参数空间,发现在水印模型附近存在许多没有水印的模型,这些模型可能被用于移除攻击。受这一发现的启发,我们提出了一种最大化-最小化的形式来找到这些没有水印的模型,并恢复它们的水印行为。我们的方法可以提高模型水印的对 Parametric 变化和许多水印移除攻击的Robustness。codes for reproducing our main experiments are available at \url{https://github.com/GuanhaoGan/robust-model-watermarking}.

SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning

  • paper_url: http://arxiv.org/abs/2309.04766
  • repo_url: None
  • paper_authors: Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, Ai Ti Aw, Nancy F. Chen
  • for: 本文提出了一个多语言基础模型的benchmark,以探讨这些模型对自然语言理解和reasong的能力,以及它们对文化实践、细节和价值观的理解。
  • methods: 本文使用了标准的准确度指标以外的其他方法来评估基础模型的稳定性和多语言能力。
  • results: 研究发现了许多基础模型具有异常的行为,如重复提供的指令、位置偏好和主流标签偏好。此外,许多模型在根据factual、科学和常识知识提问时表现不一致。
    Abstract We present SeaEval, a benchmark for multilingual foundation models. In addition to characterizing how these models understand and reason with natural language, we also investigate how well they comprehend cultural practices, nuances, and values. Alongside standard accuracy metrics, we investigate the brittleness of foundation models in the dimensions of semantics and multilinguality. Our analyses span both open-sourced and closed models, leading to empirical results across classic NLP tasks, reasoning, and cultural comprehension. Key findings indicate (1) Most models exhibit varied behavior when given paraphrased instructions. (2) Many models still suffer from exposure bias (e.g., positional bias, majority label bias). (3) For questions rooted in factual, scientific, and commonsense knowledge, consistent responses are expected across multilingual queries that are semantically equivalent. Yet, most models surprisingly demonstrate inconsistent performance on these queries. (4) Multilingually-trained models have not attained "balanced multilingual" capabilities. Our endeavors underscore the need for more generalizable semantic representations and enhanced multilingual contextualization. SeaEval can serve as a launchpad for more thorough investigations and evaluations for multilingual and multicultural scenarios.
    摘要 我们介绍了 SeaEval,一个多语言基础模型的benchmark。除了描述这些模型如何理解和处理自然语言之外,我们还研究了这些模型如何理解文化做法、细节和价值观。与标准精度指标相结合,我们调查基础模型在语义和多语言方面的脆弱性。我们的分析覆盖了开源和关闭模型,从经典NLP任务、理解到文化理解方面得到了实证结果。关键发现包括:1. 大多数模型对提供重叠 instrucciones 时表现不同。2. 许多模型仍然受到露天偏见(例如位置偏见、多数标签偏见)的影响。3. 根据Factual、科学和通俗知识而问的问题,多语言查询的semantic相同性预期得到一致的回答。然而,大多数模型却在这些查询上表现不一致。4. 多语言训练的模型尚未 дости到了"平衡多语言"的能力。我们的努力强调了需要更加通用的semantic表示和多语言contextualization。 SeaEval可以作为多语言和多文化enario的评估和研究的起点。

AudRandAug: Random Image Augmentations for Audio Classification

  • paper_url: http://arxiv.org/abs/2309.04762
  • repo_url: https://github.com/turab45/audrandaug
  • paper_authors: Teerath Kumar, Muhammad Turab, Alessandra Mileo, Malika Bendechache, Takfarinas Saber
  • for: 这篇论文主要用于探讨对数据进行资料增强的方法,并提出了一种基于搜索空间的随机数据增强方法(AudRandAug)。
  • methods: 这篇论文使用了一种基于搜索空间的随机数据增强方法(AudRandAug), randomly selecting data augmentation techniques from a dedicated audio search space。
  • results: 根据我们的实验结果,AudRandAug 比其他现有的数据增强方法有着更高的精度表现。
    Abstract Data augmentation has proven to be effective in training neural networks. Recently, a method called RandAug was proposed, randomly selecting data augmentation techniques from a predefined search space. RandAug has demonstrated significant performance improvements for image-related tasks while imposing minimal computational overhead. However, no prior research has explored the application of RandAug specifically for audio data augmentation, which converts audio into an image-like pattern. To address this gap, we introduce AudRandAug, an adaptation of RandAug for audio data. AudRandAug selects data augmentation policies from a dedicated audio search space. To evaluate the effectiveness of AudRandAug, we conducted experiments using various models and datasets. Our findings indicate that AudRandAug outperforms other existing data augmentation methods regarding accuracy performance.
    摘要 <>translate "Data augmentation has proven to be effective in training neural networks. Recently, a method called RandAug was proposed, randomly selecting data augmentation techniques from a predefined search space. RandAug has demonstrated significant performance improvements for image-related tasks while imposing minimal computational overhead. However, no prior research has explored the application of RandAug specifically for audio data augmentation, which converts audio into an image-like pattern. To address this gap, we introduce AudRandAug, an adaptation of RandAug for audio data. AudRandAug selects data augmentation policies from a dedicated audio search space. To evaluate the effectiveness of AudRandAug, we conducted experiments using various models and datasets. Our findings indicate that AudRandAug outperforms other existing data augmentation methods regarding accuracy performance." into 简化中文。Here's the translation:数据增强已经证明对神经网络训练是有效的。最近,一种方法called RandAug被提出,随机从预定搜索空间中选择数据增强策略。RandAug在图像相关任务上表现出了显著的性能提升,而且对计算负担的要求非常低。然而,没有任何之前的研究探讨了将RandAug特地应用于音频数据增强,这将音频转换成图像类似的模式。为了填补这一漏洞,我们介绍了AudRandAug,它是RandAug的音频数据增强版本。AudRandAug从专门的音频搜索空间中选择数据增强策略。为了评估AudRandAug的效果,我们使用了不同的模型和数据集进行实验。我们的发现表明,AudRandAug在准确性表现方面超过了其他现有的数据增强方法。

RR-CP: Reliable-Region-Based Conformal Prediction for Trustworthy Medical Image Classification

  • paper_url: http://arxiv.org/abs/2309.04760
  • repo_url: None
  • paper_authors: Yizhe Zhang, Shuo Wang, Yejia Zhang, Danny Z. Chen
  • for: 提高医疗AI模型的准确率和可靠性,以便更好地与人类专家合作。
  • methods: 基于可靠区域的均衡预测(RR-CP)技术,以实现用户指定的错误率(例如0.5%),并且在这种约束下优化预测集的大小。
  • results: 在五个公共数据集上进行了实验,并显示了RR-CP在实现用户指定的错误率(例如0.5%)的情况下, prediction set 的大小相对较小,而且可靠性较高。
    Abstract Conformal prediction (CP) generates a set of predictions for a given test sample such that the prediction set almost always contains the true label (e.g., 99.5\% of the time). CP provides comprehensive predictions on possible labels of a given test sample, and the size of the set indicates how certain the predictions are (e.g., a set larger than one is `uncertain'). Such distinct properties of CP enable effective collaborations between human experts and medical AI models, allowing efficient intervention and quality check in clinical decision-making. In this paper, we propose a new method called Reliable-Region-Based Conformal Prediction (RR-CP), which aims to impose a stronger statistical guarantee so that the user-specified error rate (e.g., 0.5\%) can be achieved in the test time, and under this constraint, the size of the prediction set is optimized (to be small). We consider a small prediction set size an important measure only when the user-specified error rate is achieved. Experiments on five public datasets show that our RR-CP performs well: with a reasonably small-sized prediction set, it achieves the user-specified error rate (e.g., 0.5\%) significantly more frequently than exiting CP methods.
    摘要 具有预测集的具体预测(CP)生成一个测试样本的预测集,使得预测集中的真实标签几乎总是包含true label(例如,99.5%的时间)。CP提供了测试样本的可能性标签的全面预测,预测集的大小表示预测的certainty(例如,大于一的集是“uncertain”)。CP的特有性使得人类专家和医疗AI模型之间的合作更加有效, allowing for efficient intervention and quality check in clinical decision-making.在这篇论文中,我们提出了一种新的方法called Reliable-Region-Based Conformal Prediction (RR-CP),旨在在测试时间内实现用户指定的错误率(例如,0.5%),并在这个约束下优化预测集的大小。我们认为小的预测集大小是重要的度量,只当用户指定的错误率得到实现时。在五个公共数据集上进行了实验,我们的RR-CP表现良好:与相对较小的预测集大小,它可以 achieve用户指定的错误率(例如,0.5%),与现有CP方法相比,significantly more frequently。

Towards Real-time Training of Physics-informed Neural Networks: Applications in Ultrafast Ultrasound Blood Flow Imaging

  • paper_url: http://arxiv.org/abs/2309.04755
  • repo_url: None
  • paper_authors: Haotian Guan, Jinping Dong, Wei-Ning Lee
  • for: 解决 Navier-Stokes 方程,即血液流动方程
  • methods: 使用 Physics-informed Neural Network (PINN) 和 SeqPINN 方法
  • results: 实现了血液流动速度的Recovery,并且比原始设计快速多少Here’s the full translation of the abstract in Simplified Chinese:
  • for: 本文使用 Physics-informed Neural Network (PINN) 方法解决 Navier-Stokes 方程,即血液流动方程,但现有方法不适用于 ultrafast Doppler 超音波,这是血液流动动态图像的 estado-of-the-art 技术。
  • methods: 本文提出了一种新的训练框架,称为 SeqPINN,它通过稠密化 Navier-Stokes 方程,并采用转移学习来解决稠密化 Navier-Stokes 方程。此外,本文还提出了一种新的初始化方法,称为 SP-PINN,它通过权重抽象和随机梯度下降来初始化 PINN。
  • results: 对于单血管和 trifurcate 血管的 Finite-element 模拟和 \emph{in vitro} 血液模型,SeqPINN 和 SP-PINN 都能够快速地解决血液流动速度的问题,而且它们分别对 straight 血管和 trifurcate 血管的 RMSE 分别为 1.01 cm/s 和 1.26 cm/s,和 1.91 cm/s 和 2.56 cm/s。
    Abstract Physics-informed Neural Network (PINN) is one of the most preeminent solvers of Navier-Stokes equations, which are widely used as the governing equation of blood flow. However, current approaches, relying on full Navier-Stokes equations, are impractical for ultrafast Doppler ultrasound, the state-of-the-art technique for depiction of complex blood flow dynamics \emph{in vivo} through acquired thousands of frames (or, timestamps) per second. In this article, we first propose a novel training framework of PINN for solving Navier-Stokes equations by discretizing Navier-Stokes equations into steady state and sequentially solving steady-state Navier-Stokes equations with transfer learning. The novel training framework is coined as SeqPINN. Upon the success of SeqPINN, we adopt the idea of averaged constant stochastic gradient descent (SGD) as initialization and propose a parallel training scheme for all timestamps. To ensure an initialization that generalizes well, we borrow the concept of Stochastic Weight Averaging Gaussian to perform uncertainty estimation as an indicator of generalizability of the initialization. This algorithm, named SP-PINN, further expedites training of PINN while achieving comparable accuracy with SeqPINN. Finite-element simulations and \emph{in vitro} phantoms of single-branch and trifurcate blood vessels are used to evaluate the performance of SeqPINN and SP-PINN. Results show that both SeqPINN and SP-PINN are manyfold faster than the original design of PINN, while respectively achieving Root Mean Square Errors (RMSEs) of 1.01 cm/s and 1.26 cm/s on the straight vessel and 1.91 cm/s and 2.56 cm/s on the trifurcate blood vessel when recovering blood flow velocities.
    摘要 物理学信息化神经网络(PINN)是 Navier-Stokes 方程的一种最优解,广泛用于血液流动的研究。然而,现有的方法,基于全 Navier-Stokes 方程,对于高速Doppler超音波扫描(ultrafast Doppler ultrasound)来说是不实用的。在这篇文章中,我们首先提出了一种新的训练框架,称为SeqPINN,用于解决 Navier-Stokes 方程。我们将 Navier-Stokes 方程精度化为稳定态,并采用转移学习来逐渐解决稳定态 Navier-Stokes 方程。这种训练框架是SeqPINN。成功SeqPINN之后,我们采用了averaged constant stochastic gradient descent(SGD)的初始化,并提出了并行训练方案。为确保一个初始化能够通用,我们借鉴了Stochastic Weight Averaging Gaussian(SWAG)来进行uncertainty estimation,这个算法被称为SP-PINN。SP-PINN可以更快地训练 PINN,而且可以达到与SeqPINN相同的准确性。在finite-element simulations和�emph;in vitro�emph; phantoms中,我们使用了单臂和 trifurcate 血管来评估SeqPINN和SP-PINN的性能。结果显示,SeqPINN和SP-PINN都比原始 PINN 快得多,同时分别在直流血管和 trifurcate 血管中的Root Mean Square Errors(RMSE)分别为1.01 cm/s和1.26 cm/s,以及1.91 cm/s和2.56 cm/s。

A Spatiotemporal Deep Neural Network for Fine-Grained Multi-Horizon Wind Prediction

  • paper_url: http://arxiv.org/abs/2309.04733
  • repo_url: https://github.com/hfl15/windpred
  • paper_authors: Fanling Huang, Yangdong Deng
  • for: 预测风速和方向,即多种实际应用中的关键因素,如航空和风力发电等。
  • methods: 提出了一种新的数据驱动模型,即多个深度神经网络组合体系(MHSTN),用于准确和高效地预测细详风速和方向。
  • results: 模型的评估结果表明,与竞争对手相比,MHSTN具有显著的优势,并且已经在中国一个最繁忙的国际机场的调度平台中实现了集成。
    Abstract The prediction of wind in terms of both wind speed and direction, which has a crucial impact on many real-world applications like aviation and wind power generation, is extremely challenging due to the high stochasticity and complicated correlation in the weather data. Existing methods typically focus on a sub-set of influential factors and thus lack a systematic treatment of the problem. In addition, fine-grained forecasting is essential for efficient industry operations, but has been less attended in the literature. In this work, we propose a novel data-driven model, Multi-Horizon SpatioTemporal Network (MHSTN), generally for accurate and efficient fine-grained wind prediction. MHSTN integrates multiple deep neural networks targeting different factors in a sequence-to-sequence (Seq2Seq) backbone to effectively extract features from various data sources and produce multi-horizon predictions for all sites within a given region. MHSTN is composed of four major modules. First, a temporal module fuses coarse-grained forecasts derived by Numerical Weather Prediction (NWP) and historical on-site observation data at stations so as to leverage both global and local atmospheric information. Second, a spatial module exploits spatial correlation by modeling the joint representation of all stations. Third, an ensemble module weighs the above two modules for final predictions. Furthermore, a covariate selection module automatically choose influential meteorological variables as initial input. MHSTN is already integrated into the scheduling platform of one of the busiest international airports of China. The evaluation results demonstrate that our model outperforms competitors by a significant margin.
    摘要 各种因素的预测,包括风速和方向,对许多现实生活中的应用,如航空和风力发电,是极其困难的。这是因为天气数据中存在高度的随机性和复杂的相关性。现有的方法通常只关注一 subset of 影响因素,因此缺乏一个系统性的处理方法。另外,细化预测是业务操作的效率化的关键,但在文献中得到了更少的关注。在这项工作中,我们提出了一种新的数据驱动模型,即多个顺序时空网络(Multi-Horizon SpatioTemporal Network,MHSTN),用于准确和效率地进行细化风预测。MHSTN 模型包括四个主要模块。首先,一个时间模块将 numerical weather prediction(NWP) 和历史站点观测数据 fusion 以利用全球和地方大气信息。其次,一个空间模块利用空间相关性,模型所有站点的联合表示。第三,一个ensemble模块将上述两个模块进行最终预测。最后,一个 covariate 选择模块自动选择影响大气变量的关键变量作为输入。MHSTN 模型已经成功 интеGRATED 到了中国一个最繁忙的国际机场的调度平台。评估结果表明,我们的模型在竞争对手之上显著超越。

TCGAN: Convolutional Generative Adversarial Network for Time Series Classification and Clustering

  • paper_url: http://arxiv.org/abs/2309.04732
  • repo_url: https://bitbucket.org/lynn1/tcgan
  • paper_authors: Fanling Huang, Yangdong Deng
  • for: 本文旨在提出一种时序卷积神经网络(TCGAN),用于不监督地学习时序数据的层次表示。
  • methods: TCGAN 通过两个一维卷积神经网络(生成器和分类器)进行对抗游戏学习,不需要标注数据。
  • results: 对 synthetic 和实际世界数据进行了广泛的实验,结果表明 TCGAN 比现有的时序 GAN 更快速和更准确。学习得到的表示能够提高简单的分类和归一化方法的性能,并在具有少量标注和不均匀标注的情况下保持高效。
    Abstract Recent works have demonstrated the superiority of supervised Convolutional Neural Networks (CNNs) in learning hierarchical representations from time series data for successful classification. These methods require sufficiently large labeled data for stable learning, however acquiring high-quality labeled time series data can be costly and potentially infeasible. Generative Adversarial Networks (GANs) have achieved great success in enhancing unsupervised and semi-supervised learning. Nonetheless, to our best knowledge, it remains unclear how effectively GANs can serve as a general-purpose solution to learn representations for time series recognition, i.e., classification and clustering. The above considerations inspire us to introduce a Time-series Convolutional GAN (TCGAN). TCGAN learns by playing an adversarial game between two one-dimensional CNNs (i.e., a generator and a discriminator) in the absence of label information. Parts of the trained TCGAN are then reused to construct a representation encoder to empower linear recognition methods. We conducted comprehensive experiments on synthetic and real-world datasets. The results demonstrate that TCGAN is faster and more accurate than existing time-series GANs. The learned representations enable simple classification and clustering methods to achieve superior and stable performance. Furthermore, TCGAN retains high efficacy in scenarios with few-labeled and imbalanced-labeled data. Our work provides a promising path to effectively utilize abundant unlabeled time series data.
    摘要 Recent research has shown that supervised Convolutional Neural Networks (CNNs) can learn hierarchical representations from time series data for successful classification. However, acquiring high-quality labeled time series data can be costly and potentially infeasible. Generative Adversarial Networks (GANs) have achieved great success in enhancing unsupervised and semi-supervised learning. However, it remains unclear how effectively GANs can serve as a general-purpose solution to learn representations for time series recognition, i.e., classification and clustering. Inspired by these considerations, we introduce a Time-series Convolutional GAN (TCGAN). TCGAN learns by playing an adversarial game between two one-dimensional CNNs (i.e., a generator and a discriminator) in the absence of label information. Parts of the trained TCGAN are then reused to construct a representation encoder to empower linear recognition methods. We conducted comprehensive experiments on synthetic and real-world datasets. The results demonstrate that TCGAN is faster and more accurate than existing time-series GANs. The learned representations enable simple classification and clustering methods to achieve superior and stable performance. Furthermore, TCGAN retains high efficacy in scenarios with few-labeled and imbalanced-labeled data. Our work provides a promising path to effectively utilize abundant unlabeled time series data.Here is the word-for-word translation of the text into Simplified Chinese:近期研究表明,监督式 Convolutional Neural Networks (CNNs) 可以学习时序数据的层次表示,以实现成功的分类。然而,获取高质量的时序数据标注可能成本高昂,可能无法实现。生成对抗网络 (GANs) 在无标签情况下增强了无监督和半监督学习。然而,我们知道 GANs 是否可以作为时序Recognition的通用解决方案?TCGAN 是我们的答案。TCGAN 通过两个一维 CNNs(生成器和识别器)之间的对抗游戏学习,不需要标签信息。部分训练 TCGAN 后,可以重用来构建表示编码器,以便使用线性识别方法。我们在synthetic和实际 datasets上进行了广泛的实验。结果表明,TCGAN 比现有的时序 GANs 更快和更准。TCGAN 学习的表示能够使得简单的分类和聚类方法实现超越性和稳定性。此外,TCGAN 在少量标签和偏振标签数据 scenarios 中保持高效。我们的工作为充分利用庞大的无标签时序数据提供了一条可行的道路。

Transitions in echo index and dependence on input repetitions

  • paper_url: http://arxiv.org/abs/2309.04728
  • repo_url: None
  • paper_authors: Peter Ashwin, Andrea Ceni
  • for: 研究非自动的动力系统中的响应性稳态(echo state property),并探讨响应性稳态与输入的关系。
  • methods: 使用非自动系统的切换 между一组finite maps来研究响应性稳态的依赖于输入的 Parameter。
  • results: 发现响应性稳态与输入的关系取决于输入的振荡 amplitude和输入的特性,并在输入强度 intermediate 区域内适用。
    Abstract The echo index counts the number of simultaneously stable asymptotic responses of a nonautonomous (i.e. input-driven) dynamical system. It generalizes the well-known echo state property for recurrent neural networks - this corresponds to the echo index being equal to one. In this paper, we investigate how the echo index depends on parameters that govern typical responses to a finite-state ergodic external input that forces the dynamics. We consider the echo index for a nonautonomous system that switches between a finite set of maps, where we assume that each map possesses a finite set of hyperbolic equilibrium attractors. We find the minimum and maximum repetitions of each map are crucial for the resulting echo index. Casting our theoretical findings in the RNN computing framework, we obtain that for small amplitude forcing the echo index corresponds to the number of attractors for the input-free system, while for large amplitude forcing, the echo index reduces to one. The intermediate regime is the most interesting; in this region the echo index depends not just on the amplitude of forcing but also on more subtle properties of the input.
    摘要

SHAPE: A Sample-adaptive Hierarchical Prediction Network for Medication Recommendation

  • paper_url: http://arxiv.org/abs/2309.05675
  • repo_url: None
  • paper_authors: Sicen Liu, Xiaolong Wang, JIngcheng Du, Yongshuai Hou, Xianbing Zhao, Hui Xu, Hui Wang, Yang Xiang, Buzhou Tang
  • for: 这个论文的目的是提出一种基于复杂多重疾病的药物建议方法,以解决现有的医疗健康预测任务中的挑战。
  • methods: 该论文提出了一种Sample-adaptive Hierarchical medicAtion Prediction nEtwork(SHAPE)模型,用于解决上述挑战。该模型包括一个嵌入式的内访集编码器,用于编码医疗事件中的关系,以及一个间访长链编码器,用于学习患者水平的时间序列表示。此外,该模型还使用了一种软学习策略,以自动调整每个样本的难度。
  • results: 经验表明,SHAPE模型在一个标准测试集上比多种现有基线模型具有更高的准确率和更好的泛化能力。
    Abstract Effectively medication recommendation with complex multimorbidity conditions is a critical task in healthcare. Most existing works predicted medications based on longitudinal records, which assumed the information transmitted patterns of learning longitudinal sequence data are stable and intra-visit medical events are serialized. However, the following conditions may have been ignored: 1) A more compact encoder for intra-relationship in the intra-visit medical event is urgent; 2) Strategies for learning accurate representations of the variable longitudinal sequences of patients are different. In this paper, we proposed a novel Sample-adaptive Hierarchical medicAtion Prediction nEtwork, termed SHAPE, to tackle the above challenges in the medication recommendation task. Specifically, we design a compact intra-visit set encoder to encode the relationship in the medical event for obtaining visit-level representation and then develop an inter-visit longitudinal encoder to learn the patient-level longitudinal representation efficiently. To endow the model with the capability of modeling the variable visit length, we introduce a soft curriculum learning method to assign the difficulty of each sample automatically by the visit length. Extensive experiments on a benchmark dataset verify the superiority of our model compared with several state-of-the-art baselines.
    摘要 <>输入文本 translate into Simplified Chinese:“医疗健康预测是一项关键任务。现有的大多数工作都是基于长期纪录预测药物,假设传输信息的长期记录数据是稳定的,并且intragvisit医疗事件是串行化的。然而,以下情况可能被忽略:1)更加 компакт的内部关系编码器可以更好地编码医疗事件中的关系,以获得访问级别表示。2)为了学习精准的患者级别长期序列表示,需要不同的策略。在这篇论文中,我们提出了一种新的Sample-adaptive Hierarchical medicAtion Prediction nEtwork,简称SHAPE,以解决以上挑战。具体来说,我们设计了一个更加 компакт的内部关系编码器,以编码医疗事件中的关系,并开发了一个间访长期编码器,以有效地学习患者级别长期序列表示。为了让模型能够模型变化的访问长度,我们引入了一种软学习策略,以自动将每个样本的难度分配到访问长度。经过了一系列的实验,我们发现我们的模型在一个标准 benchmark dataset 上表现出色,与多种状态之前的基准相比。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Toward Reproducing Network Research Results Using Large Language Models

  • paper_url: http://arxiv.org/abs/2309.04716
  • repo_url: None
  • paper_authors: Qiao Xiang, Yuling Lin, Mingjun Fang, Bang Huang, Siyong Huang, Ridi Wen, Franck Le, Linghe Kong, Jiwu Shu
  • for: 本文提出了一种新的方法来复制网络研究结果,即使用大型自然语言模型(LLM)。
  • methods: 本文使用了四名学生,每个学生分别使用ChatGPT进行一种不同的网络系统的复制,通过提示工程来实现。
  • results: 实验观察到了四名学生的复制结果和经验教训,并提出了未来的研究问题。
    Abstract Reproducing research results in the networking community is important for both academia and industry. The current best practice typically resorts to three approaches: (1) looking for publicly available prototypes; (2) contacting the authors to get a private prototype; and (3) manually implementing a prototype following the description of the publication. However, most published network research does not have public prototypes and private prototypes are hard to get. As such, most reproducing efforts are spent on manual implementation based on the publications, which is both time and labor consuming and error-prone. In this paper, we boldly propose reproducing network research results using the emerging large language models (LLMs). In particular, we first prove its feasibility with a small-scale experiment, in which four students with essential networking knowledge each reproduces a different networking system published in prominent conferences and journals by prompt engineering ChatGPT. We report the experiment's observations and lessons and discuss future open research questions of this proposal. This work raises no ethical issue.
    摘要 重要的网络研究结果复制是对学术界和业界都非常重要。目前最佳做法通常是:(1)搜索公开可用的原型;(2)与作者联系以获取私有原型;以及(3)根据文章描述手动实现原型。但大多数发表的网络研究没有公共原型,私有原型很难获得。因此,大多数复制努力都是基于文章的手动实现,这是时间和劳动 intensity的和容易出错的。在这篇论文中,我们勇敢地提议使用emerging的大语言模型(LLMs)来复制网络研究结果。具体来说,我们首先证明了这种方法的可行性,通过在四名学生每个复制一种不同的网络系统,这些系统分别发表在知名会议和学术期刊上,通过提示工程ChatGPT来实现。我们报告了实验的观察和教训,并讨论了未来的开放研究问题。这项工作没有伦理问题。

Jade: A Differentiable Physics Engine for Articulated Rigid Bodies with Intersection-Free Frictional Contact

  • paper_url: http://arxiv.org/abs/2309.04710
  • repo_url: None
  • paper_authors: Gang Yang, Siyuan Luo, Lin Shao
  • for: 这个论文是用来描述一个可微分的物理引擎,用于静止和动的零碎体之间的冲击和碰撞。
  • methods: 这个论文使用了Linear Complementarity Problem (LCP)来模型联系,并且使用了无穷小冲击探测和反转策略来避免体之间的交叉。它还使用了不断的碰撞探测来探测冲击时间,并且将整个模拟过程转换为可微分的形式。
  • results: 这个论文通过广泛的实验表明,它的可微分物理模拟可以在许多具有联系的任务中实现更高的效能和稳定性,比如零碎体之间的碰撞和冲击。
    Abstract We present Jade, a differentiable physics engine for articulated rigid bodies. Jade models contacts as the Linear Complementarity Problem (LCP). Compared to existing differentiable simulations, Jade offers features including intersection-free collision simulation and stable LCP solutions for multiple frictional contacts. We use continuous collision detection to detect the time of impact and adopt the backtracking strategy to prevent intersection between bodies with complex geometry shapes. We derive the gradient calculation to ensure the whole simulation process is differentiable under the backtracking mechanism. We modify the popular Dantzig algorithm to get valid solutions under multiple frictional contacts. We conduct extensive experiments to demonstrate the effectiveness of our differentiable physics simulation over a variety of contact-rich tasks.
    摘要 我们介绍了一个差异化物理引擎——瑰琅(Jade),它模型了接触为线性补假问题(LCP)。相比现有的差异化仿真,瑰琅具有不同的特点,包括不受接触干扰的碰撞仿真和多种摩擦接触的稳定解。我们使用连续碰撞检测来检测碰撞时间,并采用回溯策略来避免复杂形状的体之间的交叠。我们还计算了梯度,以确保整个仿真过程是不可分割的。我们修改了流行的达条算法,以获得多种摩擦接触下的有效解。我们进行了广泛的实验,以证明我们的差异化物理仿真在多种接触丰富任务中的效果。

Advantage Actor-Critic with Reasoner: Explaining the Agent’s Behavior from an Exploratory Perspective

  • paper_url: http://arxiv.org/abs/2309.04707
  • repo_url: None
  • paper_authors: Muzhe Guo, Feixu Yu, Tian Lan, Fang Jin
  • for: 解决复杂决策问题的 reinforcement learning (RL) 缺乏透明度和可解释性,这在具有重要实际影响的决策领域 pose significative 挑战。
  • methods: 我们提出了一种名为 Advantage Actor-Critic with Reasoner (A2CR) 的新方法,可以轻松应用于 Actor-Critic 基于的 RL 模型中,并使其更加可解释。A2CR 由三个相互连接的网络组成:政策网络、价值网络和理解器网络。通过预先定义和分类 actor 的行为目的,A2CR 自动生成了更加全面和可解释的决策过程模型。
  • results: 在行动含量高的 Super Mario Bros 环境中进行了评估,发现:Reasoner 预测的标签分数随 RL 算法的探索水平增加而减少,而 purpose-based 焦点更加集中和可读。
    Abstract Reinforcement learning (RL) is a powerful tool for solving complex decision-making problems, but its lack of transparency and interpretability has been a major challenge in domains where decisions have significant real-world consequences. In this paper, we propose a novel Advantage Actor-Critic with Reasoner (A2CR), which can be easily applied to Actor-Critic-based RL models and make them interpretable. A2CR consists of three interconnected networks: the Policy Network, the Value Network, and the Reasoner Network. By predefining and classifying the underlying purpose of the actor's actions, A2CR automatically generates a more comprehensive and interpretable paradigm for understanding the agent's decision-making process. It offers a range of functionalities such as purpose-based saliency, early failure detection, and model supervision, thereby promoting responsible and trustworthy RL. Evaluations conducted in action-rich Super Mario Bros environments yield intriguing findings: Reasoner-predicted label proportions decrease for ``Breakout" and increase for ``Hovering" as the exploration level of the RL algorithm intensifies. Additionally, purpose-based saliencies are more focused and comprehensible.
    摘要 � Reinforcement learning (RL) 是一种强大的解决复杂决策问题的工具,但它缺乏透明性和可解释性,在具有重要现实世界影响的领域是一个主要挑战。在这篇论文中,我们提出了一种新的优先级理解者-评价者(A2CR)模型,可以轻松应用于actor-critic型RL模型中,并使其更加透明。A2CR包括三个相互连接的网络:政策网络、价值网络和理解者网络。通过预先定义和分类actor的行为目的,A2CR自动生成了更加全面和可解释的agent决策过程的模型。它提供了一些功能,如目的基于的焦点度、早期失败检测和模型监管,从而推动了负责任和可信RL。在动作富 Super Mario Bros 环境中进行的评估结果表明:理解者预测的标签分布随RL算法的探索水平的强化而下降,而“Breakout”和“悬停”的目的基于的焦点度更加集中和可解释。

Analysis of Disinformation and Fake News Detection Using Fine-Tuned Large Language Model

  • paper_url: http://arxiv.org/abs/2309.04704
  • repo_url: None
  • paper_authors: Bohdan M. Pavlyshenko
  • for: 本研究探讨了使用LLM进行假新闻检测和假信息分析的可能性。
  • methods: 本研究使用PEFT/LoRA等方法进行了模型精细调整。
  • results: 研究表明,精细调整的Llama 2模型可以深入分析文本,揭示复杂的风格和narraatives。提取的名实体情感可以作为预测特征在指导机器学习模型中。
    Abstract The paper considers the possibility of fine-tuning Llama 2 large language model (LLM) for the disinformation analysis and fake news detection. For fine-tuning, the PEFT/LoRA based approach was used. In the study, the model was fine-tuned for the following tasks: analysing a text on revealing disinformation and propaganda narratives, fact checking, fake news detection, manipulation analytics, extracting named entities with their sentiments. The obtained results show that the fine-tuned Llama 2 model can perform a deep analysis of texts and reveal complex styles and narratives. Extracted sentiments for named entities can be considered as predictive features in supervised machine learning models.
    摘要 文章考虑了使用LLama 2大语言模型(LLM)进行假新闻检测和假信息分析的可能性。为了进行细化,使用了PEFT/LoRA基于的方法。研究中使用了以下任务进行细化:分析文本中的假信息和宣传叙述,实现Fact Checking,假新闻检测, manipulate analytics,提取Named Entity的情感。研究结果显示,经过细化的Llama 2模型可以对文本进行深入的分析,揭示复杂的风格和叙述。提取的情感特征可以作为生成式机器学习模型的预测特征。

Advancements in Upper Body Exoskeleton: Implementing Active Gravity Compensation with a Feedforward Controller

  • paper_url: http://arxiv.org/abs/2309.04698
  • repo_url: None
  • paper_authors: Muhammad Ayaz Hussain, Ioannis Iossifidis
  • for: 这个论文是为了开发一种基于前向控制的活动重力补偿 upper body exoskeleton 的控制系统。
  • methods: 这个系统使用内部电动机传感器的位置数据来计算扭矩,使用分析控制方程基于新顿-尤利尔反动动力学。
  • results: 对硬件和软件实验以及模拟结果进行比较,系统在稳定性和精度方面具有优秀表现,能够维持位置在长时间内,并具有最小摩擦和不良滚动的特点。
    Abstract In this study, we present a feedforward control system designed for active gravity compensation on an upper body exoskeleton. The system utilizes only positional data from internal motor sensors to calculate torque, employing analytical control equations based on Newton-Euler Inverse Dynamics. Compared to feedback control systems, the feedforward approach offers several advantages. It eliminates the need for external torque sensors, resulting in reduced hardware complexity and weight. Moreover, the feedforward control exhibits a more proactive response, leading to enhanced performance. The exoskeleton used in the experiments is lightweight and comprises 4 Degrees of Freedom, closely mimicking human upper body kinematics and three-dimensional range of motion. We conducted tests on both hardware and simulations of the exoskeleton, demonstrating stable performance. The system maintained its position over an extended period, exhibiting minimal friction and avoiding undesired slewing.
    摘要 在这项研究中,我们提出了一种Feedforward控制系统,用于活动重力补偿Upper Body exoskeleton。该系统只使用内部电动机传感器的位势数据来计算扭矩,使用分析控制方程基于新顿-尤利尔反逆动力学。相比反馈控制系统,Feedforward方法具有多种优势。它消除了需要外部扭矩传感器的需求,从而减轻硬件复杂性和重量。此外,Feedforward控制具有更加积极的响应,导致性能的提高。我们使用的Exoskeleton是轻量级的,包含4个度Of Freedom,准确模拟人类Upper Body骨骼动态和三维运动范围。我们对硬件和Simulations中的Exoskeleton进行了测试,示出了稳定的性能。系统在Extended Period内保持了其位置,表现出Minimal friction和避免了不良滚动。

Code-Style In-Context Learning for Knowledge-Based Question Answering

  • paper_url: http://arxiv.org/abs/2309.04695
  • repo_url: https://github.com/TeniaKovacs/ExploratoryDataProject1
  • paper_authors: Zhijie Nie, Richong Zhang, Zhongyuan Wang, Xudong Liu
  • for: 本研究旨在提高知识基因问答(KBQA)的实用应用,通过简单的训练技术和模型框架来解决现有的限制。
  • methods: 本研究使用了受Context学习(ICL)技术,通过给大语言模型(LLM)提供一小数量的问题和其标注的逻辑形式作为示例,使得LLM可以理解任务意图并生成新问题的逻辑形式。
  • results: 实验结果表明,我们的代码式受Context学习方法可以减少生成逻辑形式时的格式错误问题,同时实现新的最佳性能在WebQSP、GrailQA和GraphQ下的少量设定下。
    Abstract Current methods for Knowledge-Based Question Answering (KBQA) usually rely on complex training techniques and model frameworks, leading to many limitations in practical applications. Recently, the emergence of In-Context Learning (ICL) capabilities in Large Language Models (LLMs) provides a simple and training-free semantic parsing paradigm for KBQA: Given a small number of questions and their labeled logical forms as demo examples, LLMs can understand the task intent and generate the logic form for a new question. However, current powerful LLMs have little exposure to logic forms during pre-training, resulting in a high format error rate. To solve this problem, we propose a code-style in-context learning method for KBQA, which converts the generation process of unfamiliar logical form into the more familiar code generation process for LLMs. Experimental results on three mainstream datasets show that our method dramatically mitigated the formatting error problem in generating logic forms while realizing a new SOTA on WebQSP, GrailQA, and GraphQ under the few-shot setting.
    摘要 现有的知识基本问答(KBQA)方法通常依赖于复杂的训练技术和模型框架,导致在实际应用中具有许多限制。近期,大量语言模型(LLMs)的具有场景学习(ICL)能力提供了一种简单而无需训练的 semantic parsing 模型 для KBQA:给定一小 número de preguntas和其标注的逻辑形式作为示例,LLMs 可以理解任务目的并生成新的问题逻辑形式。然而,当前最强大的 LLMs 在预训练时对逻辑形式没有多少接触,导致高的格式错误率。为解决这个问题,我们提议一种 code-style 在 Context Learning 方法 для KBQA,将生成不熟悉的逻辑形式转换成更加熟悉的代码生成过程。实验结果表明,我们的方法可以减少生成逻辑形式时的格式错误问题,同时实现新的 SOTA 在 WebQSP、GrailQA 和 GraphQ 上下的 few-shot 设置下。

Flexible and Robust Counterfactual Explanations with Minimal Satisfiable Perturbations

  • paper_url: http://arxiv.org/abs/2309.04676
  • repo_url: https://github.com/wangyongjie-ntu/cemsp
  • paper_authors: Yongjie Wang, Hangwei Qian, Yongjie Liu, Wei Guo, Chunyan Miao
  • For: The paper aims to provide more robust and flexible counterfactual explanations (CFEs) for enhancing informational fairness and trustworthiness in machine learning models.* Methods: The proposed method, called Counterfactual Explanations with Minimal Satisfiable Perturbations (CEMSP), constrains changing values of abnormal features with their semantically meaningful normal ranges, and models the problem as a Boolean satisfiability problem to modify as few features as possible.* Results: The proposed method provides more robust explanations while preserving flexibility, and is demonstrated to be more effective than existing methods through comprehensive experiments on both synthetic and real-world datasets.Here’s the simplified Chinese text for the three key points:* For: 这篇论文目的是提供更加稳定和灵活的对假解释(CFEs),以增强机器学习模型的信息公正和可靠性。* Methods: 提议的方法是Counterfactual Explanations with Minimal Satisfiable Perturbations(CEMSP),它将异常特征值修改为 semantically meaningful normal ranges,并将问题模型为Boolean satisfiability problem,以修改最少特征。* Results: 提议的方法可以提供更加稳定的解释,同时保持灵活性,并在synthetic和实际 datasets上进行了广泛的实验,证明了它比现有方法更有效。
    Abstract Counterfactual explanations (CFEs) exemplify how to minimally modify a feature vector to achieve a different prediction for an instance. CFEs can enhance informational fairness and trustworthiness, and provide suggestions for users who receive adverse predictions. However, recent research has shown that multiple CFEs can be offered for the same instance or instances with slight differences. Multiple CFEs provide flexible choices and cover diverse desiderata for user selection. However, individual fairness and model reliability will be damaged if unstable CFEs with different costs are returned. Existing methods fail to exploit flexibility and address the concerns of non-robustness simultaneously. To address these issues, we propose a conceptually simple yet effective solution named Counterfactual Explanations with Minimal Satisfiable Perturbations (CEMSP). Specifically, CEMSP constrains changing values of abnormal features with the help of their semantically meaningful normal ranges. For efficiency, we model the problem as a Boolean satisfiability problem to modify as few features as possible. Additionally, CEMSP is a general framework and can easily accommodate more practical requirements, e.g., casualty and actionability. Compared to existing methods, we conduct comprehensive experiments on both synthetic and real-world datasets to demonstrate that our method provides more robust explanations while preserving flexibility.
    摘要 counterfactual explanations (CFEs) 可以最小化特征向量的修改,以实现对一个实例的不同预测。CFEs 可以提高信息公正和可靠性,并为用户提供不同预测选择的建议。然而,当不同的 CFEs 对同一个实例或 slight 不同的实例提供多个选择时,这会导致问题。多个 CFEs 可以提供多样化的选择,但是如果返回不稳定的 CFEs ,则个人公正和模型可靠性将受损。现有方法无法充分利用多样化和不稳定性问题的同时处理。为解决这些问题,我们提出了一种概念简单又有效的解决方案,名为 counterfactual explanations with minimal satisfiable perturbations (CEMSP)。具体来说,CEMSP 通过在异常特征上进行Semantically meaningful normal range的改变来限制修改。为了提高效率,我们将问题模型为Boolean satisfiability problem,以修改最少的特征。此外,CEMSP 是一个通用的框架,可以轻松地满足更多的实际需求,例如 causality 和 actionability。与现有方法相比,我们在 Both synthetic 和实际数据集上进行了 comprehensive 的实验,并证明了我们的方法可以提供更加稳定的解释,同时保持多样化。

FIAT: Fusing learning paradigms with Instruction-Accelerated Tuning

  • paper_url: http://arxiv.org/abs/2309.04663
  • repo_url: None
  • paper_authors: Xinyi Wang, John Wieting, Jonathan H. Clark
  • for: 这篇论文旨在探讨大语言模型(LLM)的学习方法,具体来说是研究如何使用受限的数据量、模型大小和计算成本来训练LLM,以及如何使用最大化的方法来提高模型的性能。
  • methods: 该论文使用了两种常见的LLM学习方法:受限学习(ICL)和全部精度调整(full fine-tuning)。它们的不同之处在于数据量、模型大小和计算成本等方面,并且它们在不同的任务上表现不同。
  • results: 研究发现,使用FIAT方法可以在100-10,000个训练示例的范围内,比ICL和精度调整更好地表现。FIAT方法可以同时利用最大化的方法和受限学习的方法,以提高模型的性能。
    Abstract Learning paradigms for large language models (LLMs) currently tend to fall within either in-context learning (ICL) or full fine-tuning. Each of these comes with their own trade-offs based on available data, model size, compute cost, ease-of-use, and final quality with neither solution performing well across-the-board. In this article, we first describe ICL and fine-tuning paradigms in a way that highlights their natural connections. Based on these connections, we propose a new learning paradigm called FIAT that fuses the best of these paradigms together, enabling prompt-engineered instructions and chain-of-thought reasoning with the very largest models while also using similar methods to perform parameter updates on a modestly-sized LLM with parameter-efficient tuning. We evaluate FIAT's effectiveness on a variety of multilingual tasks and observe that FIAT performs better than both ICL and fine-tuning at scales ranging from 100-10,000 training examples. We hope that FIAT provides a practical way of harnessing the full potential of LLMs without needing to make a hard choice between learning paradigms.
    摘要 现有大语言模型(LLM)学习模式主要分为两类:在Context Learning(ICL)和完整精度调整(Fine-tuning)。每种方法都有其特点,包括数据可用性、模型大小、计算成本、使用容易度和最终质量等方面。然而, neither solution performs well across-the-board。在这篇文章中,我们首先描述ICL和精度调整模式,并将其联系到它们之间的自然联系。基于这些联系,我们提议一种新的学习模式called FIAT,它结合了ICL和精度调整模式的优点,使得使用最大模型时可以实现提示工程ered instrucions和链式思维,同时使用相同的方法来进行参数更新 modestly-sized LLM中 parameter-efficient tuning。我们在多种多语言任务上评估FIAT的效果,并发现FIAT在100-10,000个训练示例范围内比ICL和精度调整更好。我们希望FIAT可以为LLM的潜在力量做出实用的方式,不需要选择学习模式。

Video and Synthetic MRI Pre-training of 3D Vision Architectures for Neuroimage Analysis

  • paper_url: http://arxiv.org/abs/2309.04651
  • repo_url: None
  • paper_authors: Nikhil J. Dhinagar, Amit Singh, Saket Ozarkar, Ketaki Buwa, Sophia I. Thomopoulos, Conor Owens-Walton, Emily Laltoo, Yao-Liang Chen, Philip Cook, Corey McMillan, Chih-Chien Tsai, J-J Wang, Yih-Ru Wu, Paul M. Thompson
    for: 这个论文主要是为了评估不同的预训练方法,以提高3D医学影像任务的模型性能。methods: 作者使用了视transformer(ViT)和卷积神经网络(CNN)作为模型,并对其进行了不同的预训练方法 initialization。results: 研究发现,预训练可以提高所有任务的性能,包括提高AD分类任务的性能7.4%和PD分类任务的性能4.6%,同时也可以减少脑龄预测错误值1.26年。此外,研究还发现,使用大规模的视频或合成MRI数据进行预训练可以提高ViT的性能,而CNN在有限数据设置下表现了良好的稳定性,并且在预训练下进行域外预测也有良好的性能。
    Abstract Transfer learning represents a recent paradigm shift in the way we build artificial intelligence (AI) systems. In contrast to training task-specific models, transfer learning involves pre-training deep learning models on a large corpus of data and minimally fine-tuning them for adaptation to specific tasks. Even so, for 3D medical imaging tasks, we do not know if it is best to pre-train models on natural images, medical images, or even synthetically generated MRI scans or video data. To evaluate these alternatives, here we benchmarked vision transformers (ViTs) and convolutional neural networks (CNNs), initialized with varied upstream pre-training approaches. These methods were then adapted to three unique downstream neuroimaging tasks with a range of difficulty: Alzheimer's disease (AD) and Parkinson's disease (PD) classification, "brain age" prediction. Experimental tests led to the following key observations: 1. Pre-training improved performance across all tasks including a boost of 7.4% for AD classification and 4.6% for PD classification for the ViT and 19.1% for PD classification and reduction in brain age prediction error by 1.26 years for CNNs, 2. Pre-training on large-scale video or synthetic MRI data boosted performance of ViTs, 3. CNNs were robust in limited-data settings, and in-domain pretraining enhanced their performances, 4. Pre-training improved generalization to out-of-distribution datasets and sites. Overall, we benchmarked different vision architectures, revealing the value of pre-training them with emerging datasets for model initialization. The resulting pre-trained models can be adapted to a range of downstream neuroimaging tasks, even when training data for the target task is limited.
    摘要 transferred learning 表示人工智能(AI)系统的新方法shift。相比于专门预训练任务的模型,transferred learning 涉及预训练深度学习模型在大量数据集上并将其微调以适应特定任务。然而, для 3D医学成像任务,我们不知道是否最好预训练模型在自然图像、医疗图像或Synthetically生成的MRI扫描或视频数据上。为了评估这些选择,我们在这里对 ViTs 和 convolutional neural networks (CNNs) 进行了初始化不同的上游预训练方法。这些方法然后在三个独特的下游神经成像任务中进行了适应,包括阿尔茨海默病(AD)和公主病(PD)的分类、"脑龄"预测。实验测试表明了以下关键观察:1. 预训练提高了所有任务的表现,包括ViTs中的7.4%的AD分类提升和4.6%的PD分类提升,以及CNNs中的19.1%的PD分类提升和1.26年的脑龄预测错误减少。2. 预训练在大规模的视频或生成的MRI数据上得到了ViTs的提升,而CNNs在有限数据设置中表现了 robustness。3. 在有限数据设置中,培育在域内预训练中表现出了优异,而CNNs在域外预训练中表现出了更好的泛化性。4. 预训练提高了模型对不同数据集和站点的泛化性。总之,我们对不同的视觉架构进行了 benchmarking,发现预训练它们使用emerging datasets的值。这些预训练的模型可以适应一系列的下游神经成像任务,即使训练数据集的规模有限。

Efficient Finetuning Large Language Models For Vietnamese Chatbot

  • paper_url: http://arxiv.org/abs/2309.04646
  • repo_url: None
  • paper_authors: Vu-Thuan Doan, Quoc-Truong Truong, Duc-Vu Nguyen, Vinh-Tiep Nguyen, Thuy-Ngan Nguyen Luu
  • for: 这个研究旨在提高大型自然语言模型(LLMs)的效能,并且可以实现用户的指令和生成人类化回应。
  • methods: 我们使用大量的指令跟踪数据库,包括Alpaca、GPT4All和Chat-Doctor等,这些数据库覆盖了通用领域和具体医疗领域。然后,我们使用LoRA的参数高效调整技术,将Bloomz(多语言)和GPTJ-6B(越南语)两个开源模型进行调整,从而产生四个模型:Bloomz-Chat、Bloomz-Doctor、GPTJ-Chat和GPTJ-Doctor。
  • results: 我们通过自动评分机制GPT-4进行评估,发现我们的方法可以在评估任务中提高20-30%。
    Abstract Large language models (LLMs), such as GPT-4, PaLM, and LLaMa, have been shown to achieve remarkable performance across a variety of natural language tasks. Recent advancements in instruction tuning bring LLMs with ability in following user's instructions and producing human-like responses. However, the high costs associated with training and implementing LLMs pose challenges to academic research. Furthermore, the availability of pretrained LLMs and instruction-tune datasets for Vietnamese language is limited. To tackle these concerns, we leverage large-scale instruction-following datasets from open-source projects, namely Alpaca, GPT4All, and Chat-Doctor, which cover general domain and specific medical domain. To the best of our knowledge, these are the first instructional dataset for Vietnamese. Subsequently, we utilize parameter-efficient tuning through Low-Rank Adaptation (LoRA) on two open LLMs: Bloomz (Multilingual) and GPTJ-6B (Vietnamese), resulting four models: Bloomz-Chat, Bloomz-Doctor, GPTJ-Chat, GPTJ-Doctor.Finally, we assess the effectiveness of our methodology on a per-sample basis, taking into consideration the helpfulness, relevance, accuracy, level of detail in their responses. This evaluation process entails the utilization of GPT-4 as an automated scoring mechanism. Despite utilizing a low-cost setup, our method demonstrates about 20-30\% improvement over the original models in our evaluation tasks.
    摘要 大型自然语言模型(LLM),如GPT-4、PaLM和LLaMa,已经在各种自然语言任务上表现出色。最近的指令调整技术使得LLM可以按照用户的指令进行行动,并生成人类化的回复。然而,训练和实现LLM的成本高昂,对学术研究提出了挑战。此外,预训练的LLM和指令调整数据集 для越南语言的可用性受限。为解决这些问题,我们利用大规模的指令遵从数据集,来自开源项目,包括Alpaca、GPT4All和Chat-Doctor,这些数据集覆盖通用领域和具体医疗领域。我们知道这是越南语言第一个指令数据集。然后,我们使用LoRA parameter-efficient tuning技术,在开放的两个LLM上进行调整,生成四个模型:Bloomz-Chat、Bloomz-Doctor、GPTJ-Chat和GPTJ-Doctor。最后,我们根据每个样本的帮助程度、相关性、准确性和回答细节进行评估。这个评估过程中使用GPT-4作为自动评分机制。尽管我们使用低成本的设置,但我们的方法在我们的评估任务中表现出20-30%的提升。

cs.CL - 2023-09-09

Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System

  • paper_url: http://arxiv.org/abs/2309.04858
  • repo_url: None
  • paper_authors: Daphne Ippolito, Nicholas Carlini, Katherine Lee, Milad Nasr, Yun William Yu
  • for: 本研究旨在逆引导语言模型生成文本的方法,以便检测生成文本和掌握模型生成过程中的偏见。
  • methods: 本研究使用逆引导技术揭示语言模型生成文本时使用的排名方法(top-$k$或核心抽样),并评估这些方法对模型生成过程中的偏见的影响。
  • results: 研究人员通过逆引导语言模型生成文本的方法,成功地探测了许多家族的开源语言模型和生产系统(如ChatGPT)中使用的排名方法,并发现这些方法可能导致模型生成过程中的偏见。
    Abstract Neural language models are increasingly deployed into APIs and websites that allow a user to pass in a prompt and receive generated text. Many of these systems do not reveal generation parameters. In this paper, we present methods to reverse-engineer the decoding method used to generate text (i.e., top-$k$ or nucleus sampling). Our ability to discover which decoding strategy was used has implications for detecting generated text. Additionally, the process of discovering the decoding strategy can reveal biases caused by selecting decoding settings which severely truncate a model's predicted distributions. We perform our attack on several families of open-source language models, as well as on production systems (e.g., ChatGPT).
    摘要 神经语言模型在API和网站中逐渐被部署,允许用户输入提示并接收生成的文本。许多这些系统并不披露生成参数。在这篇论文中,我们提出了一些方法来逆向工程生成文本的方法(即top-$k$或核心采样)。我们能够找到生成文本的decoding策略,这有关于检测生成文本的意义。此外,通过发现decoding设置的选择会产生模型预测分布的偏见,我们可以通过这个过程揭示生成文本的偏见。我们对一些开源语言模型家族以及生产系统(如ChatGPT)进行了攻击。

Leveraging Large Language Models for Exploiting ASR Uncertainty

  • paper_url: http://arxiv.org/abs/2309.04842
  • repo_url: None
  • paper_authors: Pranay Dighe, Yi Su, Shangshang Zheng, Yunshu Liu, Vineet Garg, Xiaochuan Niu, Ahmed Tewfik
  • for: 这项研究旨在提高语言模型在语音理解任务中的表现,而不需要采用复杂或特化的架构设计。
  • methods: 研究人员使用n-best列表作为语音识别器输入,并通过训练低维适应器来调整下游任务。
  • results: 使用n-best列表提高了语言模型在语音检测和关键词检测任务中的表现,而且在这些任务中,使用n-best列表的系统表现较为稳定。
    Abstract While large language models excel in a variety of natural language processing (NLP) tasks, to perform well on spoken language understanding (SLU) tasks, they must either rely on off-the-shelf automatic speech recognition (ASR) systems for transcription, or be equipped with an in-built speech modality. This work focuses on the former scenario, where LLM's accuracy on SLU tasks is constrained by the accuracy of a fixed ASR system on the spoken input. Specifically, we tackle speech-intent classification task, where a high word-error-rate can limit the LLM's ability to understand the spoken intent. Instead of chasing a high accuracy by designing complex or specialized architectures regardless of deployment costs, we seek to answer how far we can go without substantially changing the underlying ASR and LLM, which can potentially be shared by multiple unrelated tasks. To this end, we propose prompting the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. We explore prompt-engineering to explain the concept of n-best lists to the LLM; followed by the finetuning of Low-Rank Adapters on the downstream tasks. Our approach using n-best lists proves to be effective on a device-directed speech detection task as well as on a keyword spotting task, where systems using n-best list prompts outperform those using 1-best ASR hypothesis; thus paving the way for an efficient method to exploit ASR uncertainty via LLMs for speech-based applications.
    摘要 大型自然语言处理(NLP)模型在各种任务上表现出色,但是在语音理解(SLU)任务上,它们需要依赖于可用的自动语音识别(ASR)系统进行转录,或者具备语音模式。本工作关注于前一种情况,即LLM的SLU任务表现受ASR系统对语音输入的准确率的限制。具体来说,我们解决了语音意图分类任务,其中高度的单词错误率可能限制LLM的意图理解能力。而不是通过设计复杂或特殊的架构来提高准确率,我们寻求如何不需要重大变革ASR和LLM的基础结构,以便在多个无关的任务上共享。为此,我们提出了使用n-best列表而不是唯一的错误的1-best假设来提示LLM。我们进行了提示工程来解释n-best列表的概念给LLM,然后进行了下游任务的训练。我们的方法使用n-best列表证明有效于设备指定的语音检测任务以及关键词检测任务,其中使用n-best列表提示的系统在1-best假设的系统上表现出色,因此为语音基于应用程序的有效方法。

Neurons in Large Language Models: Dead, N-gram, Positional

  • paper_url: http://arxiv.org/abs/2309.04827
  • repo_url: None
  • paper_authors: Elena Voita, Javier Ferrando, Christoforos Nalmpantis
  • for: 这篇论文主要研究了一种大型自然语言处理模型,并在单个GPU上进行轻量级的分析。
  • methods: 研究者使用了OPT家族模型,从125m到66b参数的范围内进行研究,并仅仅基于FFN神经元是否活跃或不活跃。
  • results: 研究者发现,早期网络部分很 sparse,表示许多特征是离散的。大约70%的神经元在66b模型中是“死亡”的,即 nunca activated 在大量多样化数据集上。同时,有些 alive 神经元 acts as token和n-gram探测器。 FFN更新不仅推荐下一个元素 канди达,而且还专门消除输入中的信息。这是研究者发现的首个特有的机制,用于从剩余流中移除(而不是添加)信息。随着规模的增加,模型变得更加 sparse,即有更多的“死亡”神经元和token探测器。最后,一些神经元被发现为位置 dependent,即它们的活跃或不活跃取决于位置,而不是文本数据。
    Abstract We analyze a family of large language models in such a lightweight manner that can be done on a single GPU. Specifically, we focus on the OPT family of models ranging from 125m to 66b parameters and rely only on whether an FFN neuron is activated or not. First, we find that the early part of the network is sparse and represents many discrete features. Here, many neurons (more than 70% in some layers of the 66b model) are "dead", i.e. they never activate on a large collection of diverse data. At the same time, many of the alive neurons are reserved for discrete features and act as token and n-gram detectors. Interestingly, their corresponding FFN updates not only promote next token candidates as could be expected, but also explicitly focus on removing the information about triggering them tokens, i.e., current input. To the best of our knowledge, this is the first example of mechanisms specialized at removing (rather than adding) information from the residual stream. With scale, models become more sparse in a sense that they have more dead neurons and token detectors. Finally, some neurons are positional: them being activated or not depends largely (or solely) on position and less so (or not at all) on textual data. We find that smaller models have sets of neurons acting as position range indicators while larger models operate in a less explicit manner.
    摘要 我们对一家大型语言模型进行轻量级分析,可以在单个GPU上进行。我们专注于OPT家族模型, Parameter范围从125m到66b,并且仅基于FFN neuron是否活动。我们发现,早期网络部分是稀畴的,表示许多独特的特征。在某些层中,66b模型中的超过70%的神经元是"死亡"的,即它们从未在大量多样化数据上活动。同时,大多数活跃神经元作为token和n-gram探测器,其FFN更新不仅推荐下一个Token候选者,还显著地减少输入信息,即当前输入。我们认为,这是首次特有的机制,从 residual 流中Explicitly removing 信息而不是添加信息。随着scale,模型变得更加稀畴,即它们有更多的"死亡"神经元和token探测器。最后,一些神经元是位置特征,它们的活动或不活动主要取决于位置,而不是文本数据。我们发现小型模型有一组神经元作为位置范围指示器,而更大的模型在更不那么显式的方式下运行。

FaNS: a Facet-based Narrative Similarity Metric

  • paper_url: http://arxiv.org/abs/2309.04823
  • repo_url: None
  • paper_authors: Mousumi Akter, Shubhra Kanti Karmaker Santu
  • for: 本研究的目的是提出一种新的叙事相似度度量方法,以便更好地比较叙事的细节。
  • methods: 本研究使用了现有的大语言模型(LLMs)来提取5W1H的特征(Who, What, When, Where, Why, and How),并将其作为叙事相似度度量的基础。
  • results: 实验结果表明,FaNS度量在比较叙事的细节方面表现出了37%的高 corrrelation,与传统的文本相似度度量方法相比,表明FaNS度量能够更好地捕捉叙事的细节。
    Abstract Similar Narrative Retrieval is a crucial task since narratives are essential for explaining and understanding events, and multiple related narratives often help to create a holistic view of the event of interest. To accurately identify semantically similar narratives, this paper proposes a novel narrative similarity metric called Facet-based Narrative Similarity (FaNS), based on the classic 5W1H facets (Who, What, When, Where, Why, and How), which are extracted by leveraging the state-of-the-art Large Language Models (LLMs). Unlike existing similarity metrics that only focus on overall lexical/semantic match, FaNS provides a more granular matching along six different facets independently and then combines them. To evaluate FaNS, we created a comprehensive dataset by collecting narratives from AllSides, a third-party news portal. Experimental results demonstrate that the FaNS metric exhibits a higher correlation (37\% higher) than traditional text similarity metrics that directly measure the lexical/semantic match between narratives, demonstrating its effectiveness in comparing the finer details between a pair of narratives.
    摘要 相似的故事检索是一项非常重要的任务,因为故事是解释和理解事件的重要工具。多个相关的故事可以帮助创建一个事件的总体视图。为了准确地标识semantically相似的故事,这篇论文提出了一种新的故事相似度度量方法,称为 Facet-based Narrative Similarity(FaNS),基于经典的5W1Hfacets(Who、What、When、Where、Why和How)。与现有的相似度度量方法不同,FaNS提供了更加细化的匹配,分别对六个不同的facet进行独立匹配,然后进行组合。为了评估FaNS,我们创建了一个完整的数据集,通过收集AllSides第三方新闻门户上的故事。实验结果表明,FaNS metric在比较两个故事的细节方面 exhibits higher correlation(37%高于传统的文本相似度度量方法),这demonstrates its effectiveness in comparing the finer details between a pair of narratives.

MMHQA-ICL: Multimodal In-context Learning for Hybrid Question Answering over Text, Tables and Images

  • paper_url: http://arxiv.org/abs/2309.04790
  • repo_url: None
  • paper_authors: Weihao Liu, Fangyu Lei, Tongxu Luo, Jiahe Lei, Shizhu He, Jun Zhao, Kang Liu
  • for: 解决多modal和多种类型的问答问题 (addressing multimodal and heterogeneous question answering problems)
  • methods: 提出了MMHQA-ICL框架,包括强化的多元数据检索器和图像caption模块,以及类型特定的在 контекス学习策略 (proposed MMHQA-ICL framework, including a strengthened heterogeneous data retriever and an image caption module, as well as a type-specific in-context learning strategy)
  • results: 实验结果表明,我们的框架在少量数据下的少 shot Setting下表现出state-of-the-art的result,在MultimodalQA数据集上超过所有基线和数据集全部训练的方法 (experimental results show that our framework achieves state-of-the-art results under the few-shot setting on the MultimodalQA dataset, outperforming all baselines and methods trained on the full dataset)
    Abstract In the real world, knowledge often exists in a multimodal and heterogeneous form. Addressing the task of question answering with hybrid data types, including text, tables, and images, is a challenging task (MMHQA). Recently, with the rise of large language models (LLM), in-context learning (ICL) has become the most popular way to solve QA problems. We propose MMHQA-ICL framework for addressing this problems, which includes stronger heterogeneous data retriever and an image caption module. Most importantly, we propose a Type-specific In-context Learning Strategy for MMHQA, enabling LLMs to leverage their powerful performance in this task. We are the first to use end-to-end LLM prompting method for this task. Experimental results demonstrate that our framework outperforms all baselines and methods trained on the full dataset, achieving state-of-the-art results under the few-shot setting on the MultimodalQA dataset.
    摘要 在现实世界中,知识经常存在多模式和多种形式。解决问答问题时使用混合数据类型,包括文本、表格和图像,是一项复杂的任务(MMHQA)。近些年来,大型自然语言模型(LLM)的出现,使得在场景学习(ICL)成为解决问答问题的最受欢迎方法。我们提出了MMHQA-ICL框架,该框架包括更强的多种数据检索器和图像描述模块。最重要的是,我们提出了特定类型的场景学习策略,使得LLM可以在这个任务中发挥出色的表现。我们是第一个使用端到端LLM推荐方法来解决这个任务。实验结果表明,我们的框架在几shotSetting下超过了所有基线和已经训练的方法,在多媒体Question Answering dataset上达到了当前最佳result。

Data Augmentation for Conversational AI

  • paper_url: http://arxiv.org/abs/2309.04739
  • repo_url: https://github.com/dataug-convai/dataug-convai.github.io
  • paper_authors: Heydar Soudani, Evangelos Kanoulas, Faegheh Hasibi
  • for: 提高对话系统的信息访问,超越单个查询的限制
  • methods: 使用数据扩充(DA)方法,解决低资源领域和语言的数据罕见问题
  • results: 提供了对话系统中最新的扩充技术,包括对话生成、开放领域对话生成和任务域对话生成,以及评估这些模型的不同方法。
    Abstract Advancements in conversational systems have revolutionized information access, surpassing the limitations of single queries. However, developing dialogue systems requires a large amount of training data, which is a challenge in low-resource domains and languages. Traditional data collection methods like crowd-sourcing are labor-intensive and time-consuming, making them ineffective in this context. Data augmentation (DA) is an affective approach to alleviate the data scarcity problem in conversational systems. This tutorial provides a comprehensive and up-to-date overview of DA approaches in the context of conversational systems. It highlights recent advances in conversation augmentation, open domain and task-oriented conversation generation, and different paradigms of evaluating these models. We also discuss current challenges and future directions in order to help researchers and practitioners to further advance the field in this area.
    摘要 “对话系统的进步已经改变了信息存取的方式,超过了单一查询的限制。但是开发对话系统需要大量的训练数据,这是低资源领域和语言的挑战。传统的数据收集方法如聚思网络是劳动密集和时间负担重的,使其在这个上下文中无法有效。数据增强(DA)是一种有效的方法来解决数据缺乏问题在对话系统中。本教程提供了对话系统中 DA 方法的全面和最新的概述,包括最新的对话增强、开放领域和任务对话生成、以及不同类型的评估这些模型。我们还讨论了现在的挑战和未来的方向,以帮助研究者和实践者继续推动这个领域。”Note that Simplified Chinese is used here, as it is the more widely used standard for Chinese writing in mainland China and other countries. If you prefer Traditional Chinese, I can provide that version as well.

Towards Better Multi-modal Keyphrase Generation via Visual Entity Enhancement and Multi-granularity Image Noise Filtering

  • paper_url: http://arxiv.org/abs/2309.04734
  • repo_url: None
  • paper_authors: Yifan Dong, Suhang Wu, Fandong Meng, Jie Zhou, Xiaoli Wang, Jianxin Lin, Jinsong Su
  • for: 本研究旨在提出一种基于多模态信息的关键短语生成模型,以便更好地捕捉输入文本和图像对的核心意思。
  • methods: 我们提出了一种新的多模态关键短语生成模型,该模型不仅通过外部知识补充模型输入,还能够有效地过滤图像噪音。我们首先引入图像外部视觉实体作为模型输入,以便进行交叉模态Semantic对齐。其次,我们同时计算图像文本匹配分数和图像区域文本相关分数,以进行多granularity图像噪音过滤。尤其是,我们引入图像区域和真实关键短语之间的相关分数,以进一步改进图像匹配分数的计算。
  • results: 我们在 benchmark 数据集上进行了多组实验,实验结果表明,我们的模型可以达到领先的性能。我们的代码可以在 https://github.com/DeepLearnXMU/MM-MKP 上找到。
    Abstract Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair. In this regard, dominant methods mainly focus on multi-modal fusion for keyphrase generation. Nevertheless, there are still two main drawbacks: 1) only a limited number of sources, such as image captions, can be utilized to provide auxiliary information. However, they may not be sufficient for the subsequent keyphrase generation. 2) the input text and image are often not perfectly matched, and thus the image may introduce noise into the model. To address these limitations, in this paper, we propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise. First, we introduce external visual entities of the image as the supplementary input to the model, which benefits the cross-modal semantic alignment for keyphrase generation. Second, we simultaneously calculate an image-text matching score and image region-text correlation scores to perform multi-granularity image noise filtering. Particularly, we introduce the correlation scores between image regions and ground-truth keyphrases to refine the calculation of the previously-mentioned correlation scores. To demonstrate the effectiveness of our model, we conduct several groups of experiments on the benchmark dataset. Experimental results and in-depth analyses show that our model achieves the state-of-the-art performance. Our code is available on https://github.com/DeepLearnXMU/MM-MKP.
    摘要 多Modal键词生成的目标是生成对输入文本-图像对的核心点的集合。在这意义上,主宰方法主要强调多Modal融合 для键词生成。然而,还有两个主要缺点:1)只有一定的资源,如图像描述,可以提供辅助信息。然而,这些资源可能不够充分 для后续的键词生成。2)输入文本和图像可能不匹配,因此图像可能会干扰模型。为了解决这些限制,在这篇论文中,我们提出了一个新的多Modal键词生成模型,不仅丰富模型的输入,而且有效地范围干扰图像噪声。首先,我们将图像的外部视觉实体作为模型的辅助输入,从而促进跨Modal semantic alignment。其次,我们同时计算图像文本匹配分数和图像区域文本相似度分数,以进行多粒度图像噪声范围。特别是,我们引入图像区域和真实键词之间的相似度分数,以进一步调整先前述的相似度分数。为了证明我们的模型的效果,我们进行了多组实验,并进行了深入的分析。结果显示,我们的模型在 benchmark 数据集上达到了国际级的表现。我们的代码可以在 上下载。

EPA: Easy Prompt Augmentation on Large Language Models via Multiple Sources and Multiple Targets

  • paper_url: http://arxiv.org/abs/2309.04725
  • repo_url: None
  • paper_authors: Hongyuan Lu, Wai Lam
    for: This paper aims to improve the performance of large language models (LLMs) on various natural language processing (NLP) tasks by developing a novel method called Easy Prompt Augmentation (EPA).methods: The proposed EPA method uses paraphrasing as an augmentation method to automatically generate multiple sources/targets for demonstrations, which are then used to improve the performance of LLMs on NLP tasks.results: The proposed EPA method effectively improves the performance of LLMs on various NLP tasks, including natural language inference and machine translation, covering tens of languages.
    Abstract Large language models (LLMs) have shown promising performance on various NLP tasks via task prompting. And their performance can be further improved by appending task demonstrations to the head of the prompt. And usually, a better performance can be achieved with more demonstrations. However, asking the users to write the demonstrations can be cumbersome. As a simple yet cost-effective workaround, this paper proposes a novel method called EPA (\textbf{E}asy \textbf{P}rompt \textbf{A}ugmentation)\footnote{While this paper considers augmenting prompts via demonstrations, we name it EPA as the name EDA is already taken by a well-known NLP method \citep{wei-zou-2019-eda}.} that effectively minimizes user efforts in writing demonstrations while improving the model performance at the same time. EPA achieves these goals by automatically augmenting the demonstrations with multiple sources/targets, where each of them paraphrases each other. This is well motivated as augmenting data via paraphrasing effectively improves neural language models. EPA thus employs paraphrasing as an augmentation method for in-context learning. Extensive experiments indicate that EPA effectively improves both NLU and NLG tasks, covering from natural language inference to machine translation in translating tens of languages.\footnote{Code and data will be released upon publication.}
    摘要 大型语言模型(LLM)在不同的自然语言处理(NLP)任务上显示了拥有推进性的表现,并且可以通过将任务示范复制到请求的首部来进一步提高表现。然而,要求用户写示范可能是一个困难和费时的过程。为了解决这个问题,这篇论文提出了一个简单 yet cost-effective的方法,即EPA(易于提高表现的请求补充,即EDA的一个修改)。EPA透过自动将示范复制多个来源/目标,每个来源/目标都会对另一个进行重新诠释,以提高语言模型的表现。这是因为将数据进行重新诠释实际上可以提高神经语言模型的表现。EPA因此使用重新诠释作为对应的增强方法,以进行内容学习。实验结果显示,EPA可以有效地提高自然语言推理和机器翻译等多种NLP任务,涵盖了多种语言的翻译。[Code和数据将在出版时发布.]

Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages

  • paper_url: http://arxiv.org/abs/2309.04679
  • repo_url: None
  • paper_authors: C. M. Downey, Terra Blevins, Nora Goldfine, Shane Steinert-Threlkeld
  • for: 本研究旨在提高低资源语言下的模型性能,通过特点化多语言模型的词库和嵌入矩阵。
  • methods: 本研究提出了一些简单的技术来取代多语言词库,包括词库特定化和嵌入矩阵重新初始化策略。
  • results: 研究结果显示,使用词库特定化和嵌入矩阵重新初始化策略可以提高低资源语言下的模型性能,并且与ocus方法相当。
    Abstract Pre-trained multilingual language models underpin a large portion of modern NLP tools outside of English. A strong baseline for specializing these models for specific languages is Language-Adaptive Pre-Training (LAPT). However, retaining a large cross-lingual vocabulary and embedding matrix comes at considerable excess computational cost during adaptation. In this study, we propose several simple techniques to replace a cross-lingual vocabulary with a compact, language-specific one. Namely, we address strategies for re-initializing the token embedding matrix after vocabulary specialization. We then provide a systematic experimental comparison of our techniques, in addition to the recently-proposed Focus method. We demonstrate that: 1) Embedding-replacement techniques in the monolingual transfer literature are inadequate for adapting multilingual models. 2) Replacing cross-lingual vocabularies with smaller specialized ones provides an efficient method to improve performance in low-resource languages. 3) Simple embedding re-initialization techniques based on script-wise sub-distributions rival techniques such as Focus, which rely on similarity scores obtained from an auxiliary model.
    摘要 预训练多语言模型在现代自然语言处理(NLP)工具中占据主导地位,尤其是外语模型。为了特化这些模型,我们可以使用语言适应预训练(LAPT)。然而,保留大量的跨语言词汇和嵌入矩阵来自恰到位的计算成本增加。在本研究中,我们提出了一些简单的技巧来替代跨语言词汇。首先,我们考虑了在特циализиasi词汇后重新初始化Token嵌入矩阵的策略。然后,我们对我们的技巧进行了系统性的实验比较,以及最近提出的关注方法(Focus)。我们的结果表明:1)在单语言传输文献中使用嵌入替换技术是不充分的。2)将跨语言词汇替换为更小的特定语言词汇可以有效地提高低资源语言的性能。3)基于书写系统的子分布的简单嵌入重新初始化技术可以与关注方法(Focus)相比。

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

  • paper_url: http://arxiv.org/abs/2309.04662
  • repo_url: None
  • paper_authors: Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat
  • for: 这个论文是为了介绍一个新的、通用领域的3Ttoken单语言 dataset,名为MADLAD-400,该 dataset 基于 CommonCrawl,覆盖了419种语言。
  • methods: 论文使用了自我审核的方法来检测 dataset 的局限性,并讨论了数据审核在 dataset 创建过程中的作用。
  • results: 论文在使用公共可用数据进行训练后,发现一个10.7B参数的多语言翻译模型和一个8B参数的语言模型,并对不同领域进行评估。Results 表明这些模型在翻译和ew-shot翻译方面具有竞争力,并且提供了基准模型供研究人员使用。
    Abstract We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot translation. We make the baseline models available to the research community.
    摘要 我们介绍MADLAD-400,一个人工审核的、通用领域3Token单语言数据集,基于CommonCrawl,覆盖419种语言。我们讨论自我审核MADLAD-400中的局限性,以及数据审核在数据集创建过程中的角色。然后我们使用公共可用数据进行训练,并发布一个10.7B参数的多语言翻译模型,覆盖超过450种语言,并发现其与更大的模型相比竞争性强。此外,我们还训练了8B参数的语言模型,并评估其在几个语言翻译中的表现。我们将基准模型公开发布给研究社区。

Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf

  • paper_url: http://arxiv.org/abs/2309.04658
  • repo_url: None
  • paper_authors: Yuzhuang Xu, Shuo Wang, Peng Li, Fuwen Luo, Xiaolong Wang, Weidong Liu, Yang Liu
  • for: 这个论文探讨了如何让大语言模型(LLMs)参与交流游戏,并提出了一个不需要调整的框架。
  • methods: 该方法采用了采集和反思过去交流和经验来提高。
  • results: 实验表明,该框架可以无需调整LLMs的参数来玩“人狼”游戏,并且在实验中出现了策略性行为,表明将LLMs引入交流游戏和相关领域是一个有前途的研究方向。
    Abstract Communication games, which we refer to as incomplete information games that heavily depend on natural language communication, hold significant research value in fields such as economics, social science, and artificial intelligence. In this work, we explore the problem of how to engage large language models (LLMs) in communication games, and in response, propose a tuning-free framework. Our approach keeps LLMs frozen, and relies on the retrieval and reflection on past communications and experiences for improvement. An empirical study on the representative and widely-studied communication game, ``Werewolf'', demonstrates that our framework can effectively play Werewolf game without tuning the parameters of the LLMs. More importantly, strategic behaviors begin to emerge in our experiments, suggesting that it will be a fruitful journey to engage LLMs in communication games and associated domains.
    摘要 通信游戏,我们称之为受限信息游戏,在经济学、社会科学和人工智能等领域具有重要的研究价值。在这种游戏中,我们研究如何让大型自然语言模型(LLM)参与通信游戏,并提出了一个不需要调参的框架。我们的方法是冻结LLM,并基于过去的交流和经验进行改进。在一个代表性和广泛研究的通信游戏“狼人”的实验中,我们证明了我们的框架可以无需调参地在狼人游戏中进行效果性的游戏。此外,我们的实验还发现了一些策略性的行为,这表示将LLM参与到通信游戏和相关领域的研究将是一项有前途的冒险。

cs.LG - 2023-09-09

Symplectic Structure-Aware Hamiltonian (Graph) Embeddings

  • paper_url: http://arxiv.org/abs/2309.04885
  • repo_url: None
  • paper_authors: Jiaxu Liu, Xinping Yi, Tianle Zhang, Xiaowei Huang
  • for: 这个研究旨在提高传统图形神经网络(GNN)的灵活性,以便更好地适应不同的图形几何。
  • methods: 这个研究使用了规律方程式来更新节点特征,并运用了里曼对称数学来自适性地学习底下的对称结构。
  • results: 这个研究获得了在不同类型图形资料集上的优秀表现和灵活性,并且在训练过程中实现了能量守恒性。
    Abstract In traditional Graph Neural Networks (GNNs), the assumption of a fixed embedding manifold often limits their adaptability to diverse graph geometries. Recently, Hamiltonian system-inspired GNNs are proposed to address the dynamic nature of such embeddings by incorporating physical laws into node feature updates. In this work, we present SAH-GNN, a novel approach that generalizes Hamiltonian dynamics for more flexible node feature updates. Unlike existing Hamiltonian-inspired GNNs, SAH-GNN employs Riemannian optimization on the symplectic Stiefel manifold to adaptively learn the underlying symplectic structure during training, circumventing the limitations of existing Hamiltonian GNNs that rely on a pre-defined form of standard symplectic structure. This innovation allows SAH-GNN to automatically adapt to various graph datasets without extensive hyperparameter tuning. Moreover, it conserves energy during training such that the implicit Hamiltonian system is physically meaningful. To this end, we empirically validate SAH-GNN's superior performance and adaptability in node classification tasks across multiple types of graph datasets.
    摘要 传统的图 neuronal networks (GNNs) 假设了固定的嵌入 manifold 经常限制它们在不同的图 геометрии上的适应性。最近,基于 Hamiltonian 系统的 GNNs 被提出来解决图嵌入的动态性,通过将物理法则 integrate 到节点特征更新中。在这项工作中,我们提出了 SAH-GNN,一种新的方法,可以扩展 Hamiltonian 动力学来更 flexible 的节点特征更新。与现有的 Hamiltonian-inspired GNNs 不同,SAH-GNN 使用 Riemannian 优化在 симплектиче Stiefel 拟合中学习Podcast 的下面结构,从而自适应地适应不同的图数据集。这种创新使得 SAH-GNN 可以自动适应不同类型的图数据集,而不需要较多的 гипер参数调整。此外,它保持了能量的 física 意义,从而使得 implicit Hamiltonian 系统 Physically meaningful。为了证明 SAH-GNN 的超过性和适应性,我们在多种类型的图数据集上进行了 empirical 验证。

A Gentle Introduction to Gradient-Based Optimization and Variational Inequalities for Machine Learning

  • paper_url: http://arxiv.org/abs/2309.04877
  • repo_url: None
  • paper_authors: Neha S. Wadia, Yatin Dandi, Michael I. Jordan
  • for: 这篇论文主要针对的是机器学习领域的扩展和进步,具体来说是从优化角度出发,转移到决策和多代人问题上。
  • methods: 论文使用的方法包括落差点和矩阵游戏等,这些方法可以帮助解决机器学习问题中的新的数学挑战。
  • results: 论文提供了一种更加广泛的框架来理解机器学习中的梯度下降算法,包括落差点和矩阵游戏等。但是,论文的主要重点不是提供具体的计算证明,而是为了提供动机和直觉。
    Abstract The rapid progress in machine learning in recent years has been based on a highly productive connection to gradient-based optimization. Further progress hinges in part on a shift in focus from pattern recognition to decision-making and multi-agent problems. In these broader settings, new mathematical challenges emerge that involve equilibria and game theory instead of optima. Gradient-based methods remain essential -- given the high dimensionality and large scale of machine-learning problems -- but simple gradient descent is no longer the point of departure for algorithm design. We provide a gentle introduction to a broader framework for gradient-based algorithms in machine learning, beginning with saddle points and monotone games, and proceeding to general variational inequalities. While we provide convergence proofs for several of the algorithms that we present, our main focus is that of providing motivation and intuition.
    摘要 随着机器学习领域的快速进步,总是基于高度生产力的梯度基于优化。未来的进步受到一定程度的宽度化的影响,转移焦点从形式识别向决策和多代人问题。在这些更广泛的设置下,新的数学挑战出现,涉及到平衡和游戏理论而不是最优点。梯度基于方法仍然是机器学习问题中的基础,但简单的梯度下降不再是算法设计的起点。我们提供一个温顺的引入,开始于极点和 monotone 游戏,然后进行总variational 不等式。虽提供了一些算法的收敛证明,但我们的主要焦点是提供动机和直觉。

Approximating ReLU on a Reduced Ring for Efficient MPC-based Private Inference

  • paper_url: http://arxiv.org/abs/2309.04875
  • repo_url: None
  • paper_authors: Kiwan Maeng, G. Edward Suh
  • for: 这篇论文旨在提高无信赖服务器端的机器学习运算速度,并维护用户的隐私敏感资料。
  • methods: 本文使用多方点 computation(MPC)技术,并运用一个名为 HummingBird 的框架,将 ReLU 评估过程中的通信量大幅降低。
  • results: HummingBird 可以在多服务器端实现高精度机器学习运算,并在实际应用中实现2.03-2.67倍的终端执行时间增速,最高可达8.64倍。
    Abstract Secure multi-party computation (MPC) allows users to offload machine learning inference on untrusted servers without having to share their privacy-sensitive data. Despite their strong security properties, MPC-based private inference has not been widely adopted in the real world due to their high communication overhead. When evaluating ReLU layers, MPC protocols incur a significant amount of communication between the parties, making the end-to-end execution time multiple orders slower than its non-private counterpart. This paper presents HummingBird, an MPC framework that reduces the ReLU communication overhead significantly by using only a subset of the bits to evaluate ReLU on a smaller ring. Based on theoretical analyses, HummingBird identifies bits in the secret share that are not crucial for accuracy and excludes them during ReLU evaluation to reduce communication. With its efficient search engine, HummingBird discards 87--91% of the bits during ReLU and still maintains high accuracy. On a real MPC setup involving multiple servers, HummingBird achieves on average 2.03--2.67x end-to-end speedup without introducing any errors, and up to 8.64x average speedup when some amount of accuracy degradation can be tolerated, due to its up to 8.76x communication reduction.
    摘要 安全多方计算(MPC)使用户可以在不信任服务器上执行机器学习推理,而不需要将隐私敏感数据分享。尽管它们具有强安全性质,但MPC基于私人推理还没有在实际应用中广泛采用,因为它们的通信开销较高。在评估ReLU层时,MPC协议在党之间交换大量数据,使总端到端执行时间与非私人计算相比多次 slower。本文介绍了HummingBird框架,它可以减少ReLU通信开销,使用一个较小的环来评估ReLU。基于理论分析,HummingBird可以在秘密分享中标识不重要的比特,并在ReLU评估中排除它们,以减少通信。具有高效的搜索引擎,HummingBird可以在ReLU评估中抛弃87--91%的比特,并仍保持高精度。在多服务器MPC设置中,HummingBird在平均2.03--2.67倍的端到端执行时间内实现了无错误的8.76倍通信减少。

Approximation Results for Gradient Descent trained Neural Networks

  • paper_url: http://arxiv.org/abs/2309.04860
  • repo_url: None
  • paper_authors: G. Welper
  • for: 这篇论文的目的是为了提供对具有 Sobolev 的函数进行预测的神经网络的近似保证。
  • methods: 这篇论文使用的方法包括 gradient flow 和 neural tangent kernel (NTK) 分析。
  • results: 论文得到的结果是,对于 Sobolev 的函数,采用 gradient flow 训练的神经网络可以在不超过参数的情况下提供高度的近似保证。
    Abstract The paper contains approximation guarantees for neural networks that are trained with gradient flow, with error measured in the continuous $L_2(\mathbb{S}^{d-1})$-norm on the $d$-dimensional unit sphere and targets that are Sobolev smooth. The networks are fully connected of constant depth and increasing width. Although all layers are trained, the gradient flow convergence is based on a neural tangent kernel (NTK) argument for the non-convex second but last layer. Unlike standard NTK analysis, the continuous error norm implies an under-parametrized regime, possible by the natural smoothness assumption required for approximation. The typical over-parametrization re-enters the results in form of a loss in approximation rate relative to established approximation methods for Sobolev smooth functions.
    摘要 文章提供了对神经网络的规uli guarantees,该网络通过梯度流进行训练,错误度量为绝对-$L_2(\mathbb{S}^{d-1})$ norm在$d$维单位球上,目标函数具有 Sobolev 的准确性。网络是完全连接的,深度和宽度都是常数。虽然所有层都被训练,但梯度流 converges based on neural tangent kernel(NTK)Argument for the non-convex second but last layer。与标准 NTK 分析不同,绝对错误 norm implies an under-parametrized regime, 可能由自然的 Sobolev 的假设所需的approximation。通常的过 Parametrization 重新出现在结果中,relative to established approximation methods for Sobolev smooth functions as a loss in approximation rate.

HAct: Out-of-Distribution Detection with Neural Net Activation Histograms

  • paper_url: http://arxiv.org/abs/2309.04837
  • repo_url: None
  • paper_authors: Sudeepta Mondal, Ganesh Sundaramoorthi
  • for: 检测训练后神经网络模型对于非典型数据(out-of-distribution,OOD)的探测
  • methods: 提出了一种简单、高效、准确的OOD探测方法,基于神经网络层输出值的激活分布(HAct)
  • results: 在多个OOD图像分类benchmark上达到了state-of-the-art的准确率(TPR),例如使用Resnet-50达到了95%的TPR,同时具有低的假阳性率(FP),比前一代方法提高20.66%。
    Abstract We propose a simple, efficient, and accurate method for detecting out-of-distribution (OOD) data for trained neural networks, a potential first step in methods for OOD generalization. We propose a novel descriptor, HAct - activation histograms, for OOD detection, that is, probability distributions (approximated by histograms) of output values of neural network layers under the influence of incoming data. We demonstrate that HAct is significantly more accurate than state-of-the-art on multiple OOD image classification benchmarks. For instance, our approach achieves a true positive rate (TPR) of 95% with only 0.05% false-positives using Resnet-50 on standard OOD benchmarks, outperforming previous state-of-the-art by 20.66% in the false positive rate (at the same TPR of 95%). The low computational complexity and the ease of implementation make HAct suitable for online implementation in monitoring deployed neural networks in practice at scale.
    摘要

Correcting sampling biases via importance reweighting for spatial modeling

  • paper_url: http://arxiv.org/abs/2309.04824
  • repo_url: None
  • paper_authors: Boris Prokhorov, Diana Koldasbayeva, Alexey Zaytsev
  • for: 该 paper 是为了解决Machine Learning模型中错误估计中的分布偏见问题,尤其是在环境学研究中的空间数据中。
  • methods: 该方法基于重要度抽样的想法,通过考虑愿望错误和可用数据之间的差异,重新权重错误在每个样点上, нейтралиzed 分布偏见。使用重要度抽样技术和kernel density estimation进行重新权重。
  • results: 我们使用人工数据,模拟实际的空间数据集, validate 该方法的有效性。我们发现,该方法可以减少预测错误的总体错误率,从7%降低到2%,并且随着样本规模增加,预测错误率越来越小。
    Abstract In machine learning models, the estimation of errors is often complex due to distribution bias, particularly in spatial data such as those found in environmental studies. We introduce an approach based on the ideas of importance sampling to obtain an unbiased estimate of the target error. By taking into account difference between desirable error and available data, our method reweights errors at each sample point and neutralizes the shift. Importance sampling technique and kernel density estimation were used for reweighteing. We validate the effectiveness of our approach using artificial data that resemble real-world spatial datasets. Our findings demonstrate advantages of the proposed approach for the estimation of the target error, offering a solution to a distribution shift problem. Overall error of predictions dropped from 7% to just 2% and it gets smaller for larger samples.
    摘要 在机器学习模型中,错误估计通常受到分布偏见的影响,特别是在环境学研究中的空间数据中。我们介绍了一种基于重要性抽样的方法,以获得不偏的目标错误估计。通过考虑愿景错误和可用数据之间的差异,我们的方法在每个抽样点重新权重错误。我们使用重要性抽样技术和核密度估计来重新权重错误。我们使用人工数据,模拟实际世界的空间数据集,以验证我们的方法的效果。我们的发现表明,我们的方法可以减少预测错误的总错误率,从7%降低到2%,并且随着样本规模的增加,错误率会更加小。

Detecting Violations of Differential Privacy for Quantum Algorithms

  • paper_url: http://arxiv.org/abs/2309.04819
  • repo_url: None
  • paper_authors: Ji Guan, Wang Fang, Mingyu Huang, Mingsheng Ying
  • for: 本研究旨在提出一种形式化的检测方法,用于检测量子算法中的不同步私隐私泄露。
  • methods: 本文使用tensor网络数据结构和量子计算平台TensorFlow Quantum和TorchQuantum进行实现,开发了一种检测算法,可以自动生成泄露信息,以便检测量子算法中的不同步私隐私泄露。
  • results: 实验结果表明,本方法可以准确地检测大多数量子算法中的不同步私隐私泄露,包括量子优化算法、量子机器学习模型、量子约等优化算法和量子均衡算法。
    Abstract Quantum algorithms for solving a wide range of practical problems have been proposed in the last ten years, such as data search and analysis, product recommendation, and credit scoring. The concern about privacy and other ethical issues in quantum computing naturally rises up. In this paper, we define a formal framework for detecting violations of differential privacy for quantum algorithms. A detection algorithm is developed to verify whether a (noisy) quantum algorithm is differentially private and automatically generate bugging information when the violation of differential privacy is reported. The information consists of a pair of quantum states that violate the privacy, to illustrate the cause of the violation. Our algorithm is equipped with Tensor Networks, a highly efficient data structure, and executed both on TensorFlow Quantum and TorchQuantum which are the quantum extensions of famous machine learning platforms -- TensorFlow and PyTorch, respectively. The effectiveness and efficiency of our algorithm are confirmed by the experimental results of almost all types of quantum algorithms already implemented on realistic quantum computers, including quantum supremacy algorithms (beyond the capability of classical algorithms), quantum machine learning models, quantum approximate optimization algorithms, and variational quantum eigensolvers with up to 21 quantum bits.
    摘要 近十年内,有许多关于实际问题的量子算法被提出,如数据搜索和分析、产品推荐和借记评分。随着量子计算技术的发展,关注隐私和其他伦理问题的担忧自然而生。在这篇论文中,我们定义了一个形式化的检测框架,用于检测量子算法中的不同隐私抵触。我们开发了一个检测算法,用于验证(含噪)量子算法是否遵循不同隐私规则,并自动生成违反隐私规则的信息。这些信息包括两个量子状态,用于说明违反的原因。我们的算法使用了矩阵网络,一种高效的数据结构,并在TensorFlow Quantum和TorchQuantum上执行,这两者分别是矩阵Flow和PyTorch的量子扩展。我们的实验结果表明,我们的算法具有高效和高可靠性。

Neural Latent Geometry Search: Product Manifold Inference via Gromov-Hausdorff-Informed Bayesian Optimization

  • paper_url: http://arxiv.org/abs/2309.04810
  • repo_url: None
  • paper_authors: Haitz Saez de Ocariz Borde, Alvaro Arroyo, Ismael Morales, Ingmar Posner, Xiaowen Dong
  • for: 提高机器学习模型的性能,通过调整幽默空间的几何结构,使其更好地模型数据结构。
  • methods: 提出了一种名为神经幽默几何搜索(NLGS)的新形式,它是一种基于度量几何的方法,可以自动地找到最佳的幽默空间,以提高模型的性能。
  • results: 通过实验证明,NLGS可以高效地找到多种机器学习模型的最佳幽默空间,提高模型的性能。
    Abstract Recent research indicates that the performance of machine learning models can be improved by aligning the geometry of the latent space with the underlying data structure. Rather than relying solely on Euclidean space, researchers have proposed using hyperbolic and spherical spaces with constant curvature, or combinations thereof, to better model the latent space and enhance model performance. However, little attention has been given to the problem of automatically identifying the optimal latent geometry for the downstream task. We mathematically define this novel formulation and coin it as neural latent geometry search (NLGS). More specifically, we introduce a principled method that searches for a latent geometry composed of a product of constant curvature model spaces with minimal query evaluations. To accomplish this, we propose a novel notion of distance between candidate latent geometries based on the Gromov-Hausdorff distance from metric geometry. In order to compute the Gromov-Hausdorff distance, we introduce a mapping function that enables the comparison of different manifolds by embedding them in a common high-dimensional ambient space. Finally, we design a graph search space based on the calculated distances between candidate manifolds and use Bayesian optimization to search for the optimal latent geometry in a query-efficient manner. This is a general method which can be applied to search for the optimal latent geometry for a variety of models and downstream tasks. Extensive experiments on synthetic and real-world datasets confirm the efficacy of our method in identifying the optimal latent geometry for multiple machine learning problems.
    摘要 We propose a novel approach called neural latent geometry search (NLGS) to address this problem. NLGS is a principled method that searches for a latent geometry composed of a product of constant curvature model spaces with minimal query evaluations. To accomplish this, we introduce a new notion of distance between candidate latent geometries based on the Gromov-Hausdorff distance from metric geometry. This distance measure allows us to compare different manifolds by embedding them in a common high-dimensional ambient space.We then design a graph search space based on the calculated distances between candidate manifolds and use Bayesian optimization to search for the optimal latent geometry in a query-efficient manner. This method is general and can be applied to search for the optimal latent geometry for a variety of models and downstream tasks.Extensive experiments on synthetic and real-world datasets confirm the effectiveness of our method in identifying the optimal latent geometry for multiple machine learning problems. By automatically identifying the optimal latent geometry, our method can improve the performance of machine learning models and help to unlock their full potential.

Stochastic Gradient Descent outperforms Gradient Descent in recovering a high-dimensional signal in a glassy energy landscape

  • paper_url: http://arxiv.org/abs/2309.04788
  • repo_url: None
  • paper_authors: Persia Jana Kamali, Pierfrancesco Urbani
  • for: 这个论文主要研究了泊松梯度下降(SGD)在训练人工神经网络时的效果,以及SGD在高维非对称优化问题中的表现。
  • methods: 这篇论文使用了动态均衡理论来分析SGD在高维限制下的性能。
  • results: 研究发现,使用SGD比使用梯度下降(GD)可以更好地优化高维非对称优化问题,特别是在小批量大小下。SGD的刺激时间下降的Power Law适应比GD更好。
    Abstract Stochastic Gradient Descent (SGD) is an out-of-equilibrium algorithm used extensively to train artificial neural networks. However very little is known on to what extent SGD is crucial for to the success of this technology and, in particular, how much it is effective in optimizing high-dimensional non-convex cost functions as compared to other optimization algorithms such as Gradient Descent (GD). In this work we leverage dynamical mean field theory to analyze exactly its performances in the high-dimensional limit. We consider the problem of recovering a hidden high-dimensional non-linearly encrypted signal, a prototype high-dimensional non-convex hard optimization problem. We compare the performances of SGD to GD and we show that SGD largely outperforms GD. In particular, a power law fit of the relaxation time of these algorithms shows that the recovery threshold for SGD with small batch size is smaller than the corresponding one of GD.
    摘要

RRCNN$^{+}$: An Enhanced Residual Recursive Convolutional Neural Network for Non-stationary Signal Decomposition

  • paper_url: http://arxiv.org/abs/2309.04782
  • repo_url: https://github.com/zhoudafa08/RRCNN_plus
  • paper_authors: Feng Zhou, Antonio Cicone, Haomin Zhou
  • for: 这个论文主要针对非线性和非站点信号时频分析中的挑战。
  • methods: 该论文提出了一种基于实验模式分解法的新方法,并利用深度学习提供了一个独特的非站点信号分解视角。
  • results: 研究表明,该新方法可以在大规模信号批处理中实现更稳定的分解,同时具有低计算成本和高效率。
    Abstract Time-frequency analysis is an important and challenging task in many applications. Fourier and wavelet analysis are two classic methods that have achieved remarkable success in many fields. They also exhibit limitations when applied to nonlinear and non-stationary signals. To address this challenge, a series of nonlinear and adaptive methods, pioneered by the empirical mode decomposition method have been proposed. Their aim is to decompose a non-stationary signal into quasi-stationary components which reveal better features in the time-frequency analysis. Recently, inspired by deep learning, we proposed a novel method called residual recursive convolutional neural network (RRCNN). Not only RRCNN can achieve more stable decomposition than existing methods while batch processing large-scale signals with low computational cost, but also deep learning provides a unique perspective for non-stationary signal decomposition. In this study, we aim to further improve RRCNN with the help of several nimble techniques from deep learning and optimization to ameliorate the method and overcome some of the limitations of this technique.
    摘要 时频分析是许多应用中的重要和挑战性任务。法oux和涤纹分析是两种经典的方法,在许多领域取得了很大的成功。但它们在非线性和非站点信号处理中表现有限。为了解决这个挑战,一系列的非线性和适应方法,如empirical mode decomposition方法,在提出了解决非站点信号的分解。这些方法的目标是将非站点信号分解成更好地表征的 quasi-stationary 组件。在最近,我们受到深度学习的启发,提出了一种新的方法:差异循环神经网络(RRCNN)。RRCNN不仅可以在批处理大规模信号时实现更稳定的分解,同时也可以在低计算成本下提供更高的分解精度。此外,深度学习提供了非站点信号分解中独特的视角。在本研究中,我们想要通过深度学习和优化技术来提高RRCNN方法,并解决一些这种方法的限制。

A Comprehensive Survey on Deep Learning Techniques in Educational Data Mining

  • paper_url: http://arxiv.org/abs/2309.04761
  • repo_url: None
  • paper_authors: Yuanguo Lin, Hong Chen, Wei Xia, Fan Lin, Pengcheng Wu, Zongyue Wang, Yong Li
  • for: 这篇论文旨在系统地回顾现代教育中使用深度学习技术的教育数据挖掘(EDM)现状。
  • methods: 本论文使用深度学习技术分析和建模教育数据,包括知识追踪、不良学生检测、性能预测和个性化推荐等四个教育场景。
  • results: 本论文对现有的公共数据集和处理工具进行了全面的概述,并指出了未来这个领域的趋势和发展方向。
    Abstract Educational Data Mining (EDM) has emerged as a vital field of research, which harnesses the power of computational techniques to analyze educational data. With the increasing complexity and diversity of educational data, Deep Learning techniques have shown significant advantages in addressing the challenges associated with analyzing and modeling this data. This survey aims to systematically review the state-of-the-art in EDM with Deep Learning. We begin by providing a brief introduction to EDM and Deep Learning, highlighting their relevance in the context of modern education. Next, we present a detailed review of Deep Learning techniques applied in four typical educational scenarios, including knowledge tracing, undesirable student detecting, performance prediction, and personalized recommendation. Furthermore, a comprehensive overview of public datasets and processing tools for EDM is provided. Finally, we point out emerging trends and future directions in this research area.
    摘要 现代教育数据挖掘(EDM)已成为一个重要的研究领域,利用计算机技术来分析教育数据。随着教育数据的复杂度和多样性的增加,深度学习技术在处理和模型这些数据方面表现出了显著的优势。本文系统地回顾了现代教育数据挖掘领域中使用深度学习技术的状态。我们首先提供了 EDM 和深度学习的简介,强调它们在现代教育中的重要性。然后,我们提供了四种常见的教育场景,包括知识追踪、不良学生检测、性能预测和个性化推荐。此外,我们还提供了一个全面的公共数据集和处理工具的概述。最后,我们指出了这个研究领域的出现趋势和未来方向。Note: Please note that the translation is in Simplified Chinese, which is used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Gromov-Hausdorff Distances for Comparing Product Manifolds of Model Spaces

  • paper_url: http://arxiv.org/abs/2309.05678
  • repo_url: None
  • paper_authors: Haitz Saez de Ocariz Borde, Alvaro Arroyo, Ismael Morales, Ingmar Posner, Xiaowen Dong
  • for: 提高机器学习模型的性能,通过对积累空间的几何特征与数据结构的对应进行调整。
  • methods: 使用非欧几何空间(如偏 sfere 和 hyperbolic space)或其组合(知为产品 manifold)来提高模型性能,并使用图earch space来搜索最佳积累geometry。
  • results: 提出一种新的评估积累geometry的方法,基于度量几何学中的Gromov-Hausdorff距离,并实现了计算Gromov-Hausdorff距离的算法。
    Abstract Recent studies propose enhancing machine learning models by aligning the geometric characteristics of the latent space with the underlying data structure. Instead of relying solely on Euclidean space, researchers have suggested using hyperbolic and spherical spaces with constant curvature, or their combinations (known as product manifolds), to improve model performance. However, there exists no principled technique to determine the best latent product manifold signature, which refers to the choice and dimensionality of manifold components. To address this, we introduce a novel notion of distance between candidate latent geometries using the Gromov-Hausdorff distance from metric geometry. We propose using a graph search space that uses the estimated Gromov-Hausdorff distances to search for the optimal latent geometry. In this work we focus on providing a description of an algorithm to compute the Gromov-Hausdorff distance between model spaces and its computational implementation.
    摘要 近期研究建议通过对 latent space 的几何特征与数据结构进行对齐,以提高机器学习模型的性能。而不是仅仅采用欧几何空间,研究人员已经提议使用扁球空间和圆柱空间(或其组合)来改进模型性能。然而,没有一种原则性的技巧来确定最佳的 latent product manifold 签名,即选择和维度 manifold 组件的决策。为此,我们介绍了一种新的 latent geometry 距离度量,基于度量几何中的 Gromov-Hausdorff 距离。我们提议使用图搜索空间,使用估计的 Gromov-Hausdorff 距离来搜索最佳 latent geometry。在这篇文章中,我们主要关注 computing Gromov-Hausdorff distance 和其计算实现。

Affine Invariant Ensemble Transform Methods to Improve Predictive Uncertainty in ReLU Networks

  • paper_url: http://arxiv.org/abs/2309.04742
  • repo_url: None
  • paper_authors: Diksha Bhandari, Jakiw Pidstrigach, Sebastian Reich
  • for: 用 ensemble Kalman filter 进行 Bayesian inference для logistic regression
  • methods: 使用两种互动的 particle systems 采样 approximate posterior,并证明这些 particle systems 在数量趋于无穷时 Display quantitative convergence rates
  • results: 应用这些技术,对 ReLU 网络中 predictive uncertainty 进行评估,并证明其效果
    Abstract We consider the problem of performing Bayesian inference for logistic regression using appropriate extensions of the ensemble Kalman filter. Two interacting particle systems are proposed that sample from an approximate posterior and prove quantitative convergence rates of these interacting particle systems to their mean-field limit as the number of particles tends to infinity. Furthermore, we apply these techniques and examine their effectiveness as methods of Bayesian approximation for quantifying predictive uncertainty in ReLU networks.
    摘要 我们考虑使用 ensemble Kalman filter 的扩展来进行 bayesian 推理 для Logistic Regression。我们提出了两种互动的 particle system,它们可以从 approximate posterior 中采样,并证明这些互动 particle system 在数量的增加时对mean-field limit的量化准确率。此外,我们运用这些技术来评估它们在 ReLU 网络中Quantifying predictive uncertainty的效果。Here's the word-for-word translation of the text:我们考虑使用ensemble Kalman filter的扩展来进行 bayesian推理 дляLogistic Regression。我们提出了两种互动的particle system,它们可以从approximate posterior中采样,并证明这些互动particle system在数量的增加时对mean-field limit的量化准确率。此外,我们运用这些技术来评估它们在ReLU网络中Quantifying predictive uncertainty的效果。

Training of Spiking Neural Network joint Curriculum Learning Strategy

  • paper_url: http://arxiv.org/abs/2309.04737
  • repo_url: None
  • paper_authors: Lingling Tang, Jielei Chu, Zhiguo Gong, Tianrui Li
  • For: The paper aims to enhance the biological plausibility of Spiking Neural Networks (SNNs) by introducing Curriculum Learning (CL) into SNNs.* Methods: The proposed CL-SNN model uses a confidence-aware loss to measure and process samples with different difficulty levels, allowing the model to learn more like humans and with higher biological interpretability.* Results: The authors conducted experiments on various datasets, including static image datasets MNIST, Fashion-MNIST, CIFAR10, and neuromorphic datasets N-MNIST, CIFAR10-DVS, DVS-Gesture, and the results are promising. To the best of the authors’ knowledge, this is the first proposal to enhance the biologically plausibility of SNNs by introducing CL.Here is the information in Simplified Chinese text:* For: 这篇论文目的是增强神经网络模型的生物可能性,通过引入学习环境中的学习策略,使神经网络模型更加类似于人类学习。* Methods: 提议的 CL-SNN 模型使用一种自信感掌握损失函数来评估不同难度水平的样本,从而使模型更加类似于人类学习。* Results: 作者在不同的数据集上进行了实验,包括静止图像集 MNIST、Fashion-MNIST、CIFAR10,以及 neuromorphic 数据集 N-MNIST、CIFAR10-DVS、DVS-Gesture,结果很有前途。据作者所知,这是首次通过引入 CL 增强 SNN 的生物可能性。
    Abstract Starting with small and simple concepts, and gradually introducing complex and difficult concepts is the natural process of human learning. Spiking Neural Networks (SNNs) aim to mimic the way humans process information, but current SNNs models treat all samples equally, which does not align with the principles of human learning and overlooks the biological plausibility of SNNs. To address this, we propose a CL-SNN model that introduces Curriculum Learning(CL) into SNNs, making SNNs learn more like humans and providing higher biological interpretability. CL is a training strategy that advocates presenting easier data to models before gradually introducing more challenging data, mimicking the human learning process. We use a confidence-aware loss to measure and process the samples with different difficulty levels. By learning the confidence of different samples, the model reduces the contribution of difficult samples to parameter optimization automatically. We conducted experiments on static image datasets MNIST, Fashion-MNIST, CIFAR10, and neuromorphic datasets N-MNIST, CIFAR10-DVS, DVS-Gesture. The results are promising. To our best knowledge, this is the first proposal to enhance the biologically plausibility of SNNs by introducing CL.
    摘要 人类学习的自然过程是从小而简单的概念开始,然后慢慢地引入复杂和困难的概念。神经网络模型(SNN)想要模仿人类信息处理的方式,但现有的SNN模型对所有样本进行同等的处理,这并不符合人类学习的原理,而且忽略了神经网络的生物学可能性。为解决这个问题,我们提出了CL-SNN模型,它将CURRICULUM学习(CL)引入SNN,使SNN更像人类学习的方式,并提供更高的生物学可解性。CL是一种培训策略,它提出将更容易的数据给模型之前,然后逐渐增加更加困难的数据,这与人类学习过程相似。我们使用了对样本的信任度进行评估和处理的confidence-aware损失函数。通过学习不同样本的信任度,模型会自动减少困难样本对参数优化的贡献。我们在静止图像集MNIST、Fashion-MNIST、CIFAR10、神经元逻辑集N-MNIST、CIFAR10-DVS、DVS-Gesture上进行了实验。结果很有前途。到我们知道的 extend,这是第一个通过引入CL提高神经网络的生物学可能性的提议。

MultiCaM-Vis: Visual Exploration of Multi-Classification Model with High Number of Classes

  • paper_url: http://arxiv.org/abs/2309.05676
  • repo_url: None
  • paper_authors: Syed Ahsan Ali Dilawer, Shah Rukh Humayoun
  • for: 本文旨在帮助机器学习专家在学习阶段出现错误的问题时,通过可见化分析,快速定位问题的根本原因。
  • methods: 本文提出了一种交互式可见化分析工具,名为MultiCaM-Vis,它提供了Overview+Detail样式的并行坐标图和一个Chord диаграм来探索和检查实例级别的错误分类。
  • results: 本文还提出了一项初步的用户研究,通过12名参与者的实验,发现这种可见化分析工具可以帮助机器学习专家快速定位问题的根本原因。
    Abstract Visual exploration of multi-classification models with large number of classes would help machine learning experts in identifying the root cause of a problem that occurs during learning phase such as miss-classification of instances. Most of the previous visual analytics solutions targeted only a few classes. In this paper, we present our interactive visual analytics tool, called MultiCaM-Vis, that provides \Emph{overview+detail} style parallel coordinate views and a Chord diagram for exploration and inspection of class-level miss-classification of instances. We also present results of a preliminary user study with 12 participants.
    摘要 <>传送给定文本到简化中文。>通过视觉探索多类分类模型的许多类型的实例会帮助机器学习专家在学习阶段出现错误的问题的根本原因。大多数前一代的视觉分析解决方案仅针对其中的一些类型。在这篇论文中,我们提出了我们的交互式视觉分析工具 MultiCaM-Vis,它提供了概览+细节并行拐视图和一个弦表来探索和检查实例的类别错误分类。我们还发布了12名参与者的初步用户研究结果。

Weak-PDE-LEARN: A Weak Form Based Approach to Discovering PDEs From Noisy, Limited Data

  • paper_url: http://arxiv.org/abs/2309.04699
  • repo_url: https://github.com/punkduckable/weak_pde_learn
  • paper_authors: Robert Stephany, Christopher Earls
  • for: 用于从噪音有限的解析数据中直接推断非线性偏微分方程(PDE)。
  • methods: 使用适应损失函数基于弱形式来训练神经网络,approximate PDE解而同时标识主要PDE。
  • results: 可以快速精准地推断多种偏微分方程,并且具有较高的噪音抗性和可靠性。
    Abstract We introduce Weak-PDE-LEARN, a Partial Differential Equation (PDE) discovery algorithm that can identify non-linear PDEs from noisy, limited measurements of their solutions. Weak-PDE-LEARN uses an adaptive loss function based on weak forms to train a neural network, $U$, to approximate the PDE solution while simultaneously identifying the governing PDE. This approach yields an algorithm that is robust to noise and can discover a range of PDEs directly from noisy, limited measurements of their solutions. We demonstrate the efficacy of Weak-PDE-LEARN by learning several benchmark PDEs.
    摘要 我们介绍Weak-PDE-LEARN,一种partial differential equation(PDE)发现算法,可以从噪音、有限测量的解方面获取非线性PDE。Weak-PDE-LEARN使用适应损失函数基于弱形式来训练神经网络U,以估计PDE解释,同时也可以获取统治PDE。这种方法可以对噪音有效,并且可以直接从噪音有限测量的解方面获取PDE。我们透过训练几个benchmark PDE来证明其效果。

Redundancy-Free Self-Supervised Relational Learning for Graph Clustering

  • paper_url: http://arxiv.org/abs/2309.04694
  • repo_url: https://github.com/yisiyu95/r2fgc
  • paper_authors: Si-Yu Yi, Wei Ju, Yifang Qin, Xiao Luo, Luchen Liu, Yong-Dao Zhou, Ming Zhang
  • for: 这篇论文的目的是提出一种基于自动编码器和图自动编码器的自然语言 clustering 方法,以优化图 струкured 数据中的 semantic 信息的抽象和利用。
  • methods: 该方法使用了一种名为 Relational Redundancy-Free Graph Clustering (R$^2$FGC),它从全球和本地视图中提取了属性和结构层次的关系信息,并通过保持归一化后的节点归一化来提取归一化后的semantic信息。此外,该方法还采用了一种简单 yet 有效的策略来解决过滤问题。
  • results: 对于 widely 使用的 benchmark 数据集,R$^2$FGC 在比较基准方法的情况下显示出了优越性,并且可以更好地利用图 structured 数据中的semantic信息。
    Abstract Graph clustering, which learns the node representations for effective cluster assignments, is a fundamental yet challenging task in data analysis and has received considerable attention accompanied by graph neural networks in recent years. However, most existing methods overlook the inherent relational information among the non-independent and non-identically distributed nodes in a graph. Due to the lack of exploration of relational attributes, the semantic information of the graph-structured data fails to be fully exploited which leads to poor clustering performance. In this paper, we propose a novel self-supervised deep graph clustering method named Relational Redundancy-Free Graph Clustering (R$^2$FGC) to tackle the problem. It extracts the attribute- and structure-level relational information from both global and local views based on an autoencoder and a graph autoencoder. To obtain effective representations of the semantic information, we preserve the consistent relation among augmented nodes, whereas the redundant relation is further reduced for learning discriminative embeddings. In addition, a simple yet valid strategy is utilized to alleviate the over-smoothing issue. Extensive experiments are performed on widely used benchmark datasets to validate the superiority of our R$^2$FGC over state-of-the-art baselines. Our codes are available at https://github.com/yisiyu95/R2FGC.
    摘要 GRAPH CLUSTERING,即通过学习节点表示来实现有效的分群任务,是数据分析领域的基础 yet challenging task,在最近的几年中,随着图神经网络的发展,得到了广泛的关注。然而,大多数现有的方法忽略了图中 nodes 之间的自然关系信息,因此 не能充分利用图 structured data 中的semantic信息,这导致了分群性能的下降。在这篇论文中,我们提出了一种新的自动supervised deep graph clustering方法,名为 Relational Redundancy-Free Graph Clustering (R$^2$FGC),以解决这个问题。R$^2$FGC 方法通过自动编码器和图自动编码器来提取图中 attribute-和 structure-level 的关系信息,并在 global 和 local 视图下对这些信息进行拓展。为了获得有效的semantic信息表示,我们保留了归一化后的节点之间的一致关系,而 redundant 关系则进一步减少以学习特异性的嵌入。此外,我们采用了一种简单 yet valid 的策略来解决过拟合问题。我们在 widely used 的 benchmark 数据集上进行了广泛的实验,以验证 R$^2$FGC 的超越性。codes 可以在 https://github.com/yisiyu95/R2FGC 上获取。

Compact: Approximating Complex Activation Functions for Secure Computation

  • paper_url: http://arxiv.org/abs/2309.04664
  • repo_url: None
  • paper_authors: Mazharul Islam, Sunpreet S. Arora, Rahul Chatterjee, Peter Rindal, Maliheh Shirvanian
  • for: 提供隐私保护的深度神经网络(DNN)模型查询服务,使用公共云计算。
  • methods: 使用现状顶尖的多方 computation(MPC)技术,并使用Compact生成 piece-wise polynomialapproximation来提高MPC技术的效率。
  • results: Compact不需要任何限制model训练,并且对四种不同的机器学习任务进行了广泛的实验评估,结果表明Compact与DNN特有的方法相比,对于处理复杂非线性 activation functions(AFs)而言,具有 negligible accuracy loss,同时提供了2-5倍的计算速度提升。
    Abstract Secure multi-party computation (MPC) techniques can be used to provide data privacy when users query deep neural network (DNN) models hosted on a public cloud. State-of-the-art MPC techniques can be directly leveraged for DNN models that use simple activation functions (AFs) such as ReLU. However, DNN model architectures designed for cutting-edge applications often use complex and highly non-linear AFs. Designing efficient MPC techniques for such complex AFs is an open problem. Towards this, we propose Compact, which produces piece-wise polynomial approximations of complex AFs to enable their efficient use with state-of-the-art MPC techniques. Compact neither requires nor imposes any restriction on model training and results in near-identical model accuracy. We extensively evaluate Compact on four different machine-learning tasks with DNN architectures that use popular complex AFs SiLU, GeLU, and Mish. Our experimental results show that Compact incurs negligible accuracy loss compared to DNN-specific approaches for handling complex non-linear AFs. We also incorporate Compact in two state-of-the-art MPC libraries for privacy-preserving inference and demonstrate that Compact provides 2x-5x speedup in computation compared to the state-of-the-art approximation approach for non-linear functions -- while providing similar or better accuracy for DNN models with large number of hidden layers
    摘要 使用安全多方计算(MPC)技术可以保证用户在公共云上查询深度神经网络(DNN)模型时的数据隐私。现状的MPC技术可以直接应用于使用简单 activation function(AF)的 DNN 模型,如 ReLU。然而,设计用于进行先进应用的 DNN 模型 architecture 通常使用复杂和高度非线性的 AF。为此,我们提出了 Compact,它生成了 piece-wise 多项式近似的复杂 AF,以便使用现状的MPC技术进行高效的使用。Compact 不需要或强制任何模型训练限制,并且会导致模型准确性几乎不变。我们在四种不同的机器学习任务上进行了广泛的实验,并证明了 Compact 与 DNN 特有的方法相比,对于处理复杂非线性 AF 的模型而言,减少了精度损失。此外,我们将 Compact 集成到了两个现状的MPC库中,并证明了 Compact 在计算速度方面比现状的近似方法提供了2-5倍的提升,而同时保持了模型中多个隐藏层的准确性。

Intelligent upper-limb exoskeleton using deep learning to predict human intention for sensory-feedback augmentation

  • paper_url: http://arxiv.org/abs/2309.04655
  • repo_url: None
  • paper_authors: Jinwoo Lee, Kangkyu Kwon, Ira Soltis, Jared Matthews, Yoonjae Lee, Hojoong Kim, Lissette Romero, Nathan Zavanelli, Youngjin Kwon, Shinjae Kwon, Jimin Lee, Yewon Na, Sung Hoon Lee, Ki Jun Yu, Minoru Shinohara, Frank L. Hammond, Woon-Hong Yeo
  • for: 这个研究旨在开发一种基于云计算和感知反馈的智能 upper-limb exoskeleton系统,以增强人类的手部运动能力。
  • methods: 该系统使用云计算的深度学习算法预测人类的意图动作,并通过软件感知器收集实时肌肉信号来提供感知反馈。
  • results: 研究表明,该系统可以在200-250毫秒响应时间内预测四个 upper-limb 关节运动,准确率达96.2%,并可以提供5.15倍的人类力量增强。
    Abstract The age and stroke-associated decline in musculoskeletal strength degrades the ability to perform daily human tasks using the upper extremities. Although there are a few examples of exoskeletons, they need manual operations due to the absence of sensor feedback and no intention prediction of movements. Here, we introduce an intelligent upper-limb exoskeleton system that uses cloud-based deep learning to predict human intention for strength augmentation. The embedded soft wearable sensors provide sensory feedback by collecting real-time muscle signals, which are simultaneously computed to determine the user's intended movement. The cloud-based deep-learning predicts four upper-limb joint motions with an average accuracy of 96.2% at a 200-250 millisecond response rate, suggesting that the exoskeleton operates just by human intention. In addition, an array of soft pneumatics assists the intended movements by providing 897 newton of force and 78.7 millimeter of displacement at maximum. Collectively, the intent-driven exoskeleton can augment human strength by 5.15 times on average compared to the unassisted exoskeleton. This report demonstrates an exoskeleton robot that augments the upper-limb joint movements by human intention based on a machine-learning cloud computing and sensory feedback.
    摘要 人们日常活动中使用上肢部时,年龄和roke-相关的肌肉强度下降会导致功能下降。虽然有一些外套式机器人,但它们需要人工操作,因为缺乏感知反馈和移动意图预测。在这里,我们介绍了一个智能上肢部外套系统,使用云计算深度学习预测人类意图,以增强肌肉强度。系统内置软件式感知器收集实时肌肉信号,并同时计算用户的意图移动。云计算深度学习预测四个上肢部 JOINT 运动,平均准确率为96.2%,响应时间为200-250毫秒,这表明机器人只遵循人类意图。此外,一个数组软空气填充器助力用户意图的运动,提供897牛顿的力和78.7毫米的移动距离最大。总的来说,意图驱动的机器人可以增强人类上肢部 JOINT 运动的强度,平均提高5.15倍 compared to 无助担机器人。这份报告描述了一种基于机器学习云计算和感知反馈的肌肉强度增强机器人。

Towards Understanding Neural Collapse: The Effects of Batch Normalization and Weight Decay

  • paper_url: http://arxiv.org/abs/2309.04644
  • repo_url: None
  • paper_authors: Leyan Pan, Xinyuan Cao
  • For: 这个论文研究了在神经网络分类器的最后一层使用批Normalization和权重衰退后,是否会出现神经崩溃现象。* Methods: 该论文提出了一种基于几何学的内部类和间类cosine相似度度量,可以捕捉到神经崩溃现象的多个核心方面。同时,该论文还提供了对于最佳化混合Entropy损失函数时,神经崩溃的理论保证。* Results: 实验结果表明,在神经网络模型中添加批Normalization和高权重衰退值时,神经崩溃现象更加明显,而且与批Normalization和权重衰退值之间存在正相关性。
    Abstract Neural Collapse is a recently observed geometric structure that emerges in the final layer of neural network classifiers. Specifically, Neural Collapse states that at the terminal phase of neural networks training, 1) the intra-class variability of last-layer features tends to zero, 2) the class feature means form an Equiangular Tight Frame (ETF), 3) last-layer class features and weights becomes equal up the scaling, and 4) classification behavior collapses to the nearest class center (NCC) decision rule. This paper investigates the effect of batch normalization and weight decay on the emergence of Neural Collapse. We propose the geometrically intuitive intra-class and inter-class cosine similarity measure which captures multiple core aspects of Neural Collapse. With this measure, we provide theoretical guarantees of Neural Collapse emergence with last-layer batch normalization and weight decay when the regularized cross-entropy loss is near optimal. We also perform further experiments to show that the Neural Collapse is most significant in models with batch normalization and high weight-decay values. Collectively, our results imply that batch normalization and weight decay may be fundamental factors in the emergence of Neural Collapse.
    摘要 neural collapse 是一种最近发现的几何结构,它在神经网络分类器的最后一层出现。具体来说,神经collapse 表示在神经网络训练的末期,1)最后一层特征变量内部减少到零,2)类特征均值形成等角紧凑框(ETF),3)最后一层类特征和权重归一化,4)分类行为归一化到最近的类中心(NCC)决策规则。本文研究了批Normalization和权重衰减对神经collapse 的影响。我们提出了几何直观的内类和间类夹角相似度度量,该度量捕捉了多个核心方面的神经collapse。通过这个度量,我们提供了理论保证神经collapse 的出现,当批Normalization和权重衰减值很大时。我们还进行了更多的实验,证明神经collapse 在模型中具有批Normalization和高权重衰减值时最为明显。总的来说,我们的结果表明,批Normalization和权重衰减可能是神经collapse 的基本因素。

eess.IV - 2023-09-09

Latent Degradation Representation Constraint for Single Image Deraining

  • paper_url: http://arxiv.org/abs/2309.04780
  • repo_url: None
  • paper_authors: Yuhong He, Long Peng, Lu Wang, Jun Cheng
  • for: 提高单图雨植物除法的精度和效果,解决现有方法存在过度或未适应现象。
  • methods: 提出了一种基于irection-aware编码器、UNet排除网络和多尺度交互块的Latent Degradation Representation Constraint Network(LDRCNet),通过带有方向一致性的扩展几何梯度来适应各种雨植物模式,并在训练时引入约束损失来显式地学习雨植物表示。
  • results: 在 sintetic 和实际数据集上实验表明,提出的方法可以达到新的状态级性能。
    Abstract Since rain streaks show a variety of shapes and directions, learning the degradation representation is extremely challenging for single image deraining. Existing methods are mainly targeted at designing complicated modules to implicitly learn latent degradation representation from coupled rainy images. This way, it is hard to decouple the content-independent degradation representation due to the lack of explicit constraint, resulting in over- or under-enhancement problems. To tackle this issue, we propose a novel Latent Degradation Representation Constraint Network (LDRCNet) that consists of Direction-Aware Encoder (DAEncoder), UNet Deraining Network, and Multi-Scale Interaction Block (MSIBlock). Specifically, the DAEncoder is proposed to adaptively extract latent degradation representation by using the deformable convolutions to exploit the direction consistency of rain streaks. Next, a constraint loss is introduced to explicitly constraint the degradation representation learning during training. Last, we propose an MSIBlock to fuse with the learned degradation representation and decoder features of the deraining network for adaptive information interaction, which enables the deraining network to remove various complicated rainy patterns and reconstruct image details. Experimental results on synthetic and real datasets demonstrate that our method achieves new state-of-the-art performance.
    摘要 因为雨托 Streaks 显示出多种形状和方向,学习降解表示是单图像抖掉极其困难的。现有方法主要targeted at设计复杂的模块,以异步学习潜在的降解表示从相关的雨照图像中。这种方法难以分离内容独立的降解表示,导致过度或不足进行增强问题。为了解决这个问题,我们提出了一种新的降解表示约束网络(LDRCNet),包括方向感知编码器(DAEncoder)、UNet抖掉网络和多Scale交互块(MSIBlock)。具体来说,DAEncoder是用具有可变扩散的卷积来适应ively抽取降解表示,并且通过利用雨托的方向一致性来提取有用的降解表示。然后,我们引入了一个约束损失来在训练中直接约束降解表示学习。最后,我们提出了一个MSIBlock来与学习的降解表示和抖掉网络的解码器特征进行相互交互,使得抖掉网络能够去除各种复杂的雨托模式,并重建图像细节。实验结果表明,我们的方法在 sintetic 和实际 datasets 上达到了新的状态级表现。

SSHNN: Semi-Supervised Hybrid NAS Network for Echocardiographic Image Segmentation

  • paper_url: http://arxiv.org/abs/2309.04672
  • repo_url: None
  • paper_authors: Renqi Chen, Jingjing Luo, Fan Nian, Yuhui Cen, Yiheng Peng, Zekuan Yu
  • for: 这个研究旨在提高医疗影像分类的精度,特别是用于echocardiographic影像,减少不必要的噪声。
  • methods: 这个研究使用Neural Architecture Search(NAS)来设计网络,并将层别特征聚合和对应的Transformers引入,以提高分类的精度和效率。
  • results: 实验结果显示,这个 SSHNN 网络可以优于现有的方法,实现更高的分类精度和效率。
    Abstract Accurate medical image segmentation especially for echocardiographic images with unmissable noise requires elaborate network design. Compared with manual design, Neural Architecture Search (NAS) realizes better segmentation results due to larger search space and automatic optimization, but most of the existing methods are weak in layer-wise feature aggregation and adopt a ``strong encoder, weak decoder" structure, insufficient to handle global relationships and local details. To resolve these issues, we propose a novel semi-supervised hybrid NAS network for accurate medical image segmentation termed SSHNN. In SSHNN, we creatively use convolution operation in layer-wise feature fusion instead of normalized scalars to avoid losing details, making NAS a stronger encoder. Moreover, Transformers are introduced for the compensation of global context and U-shaped decoder is designed to efficiently connect global context with local features. Specifically, we implement a semi-supervised algorithm Mean-Teacher to overcome the limited volume problem of labeled medical image dataset. Extensive experiments on CAMUS echocardiography dataset demonstrate that SSHNN outperforms state-of-the-art approaches and realizes accurate segmentation. Code will be made publicly available.
    摘要 准确的医学影像分割特别是用于echocardiographic图像的分割需要精心设计网络。与手动设计相比,使用神经网络搜索(NAS)可以实现更好的分割结果,因为它可以在更大的搜索空间中进行自动优化,但是现有的方法往往弱于层次特征聚合和采用了“强Encoder,弱Decoder”结构,无法处理全局关系和本地细节。为解决这些问题,我们提出了一种新的半supervised混合NAS网络,称为SSHNN。在SSHNN中,我们创新地使用卷积操作来实现层次特征融合,而不是使用正常化的整数,以避免丢失细节。此外,我们还引入了Transformers来补做全局上下文,并设计了U字形解码器来有效地连接全局上下文和本地特征。具体来说,我们实现了一种半supervised算法 Mean-Teacher,以解决有限量的医学影像数据集的问题。我们的实验表明,SSHNN在CAMUS echocardiography数据集上表现出了State-of-the-art的分割结果,并实现了准确的分割。我们将代码公开。

ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2309.05674
  • repo_url: None
  • paper_authors: Xian Lin, Zengqiang Yan, Xianbo Deng, Chuansheng Zheng, Li Yu
  • for: 提高 трансформа器在医学图像分割中的表现,特别是解决注意力归一化问题。
  • methods: 建立CNN风格的transformer(ConvFormer),通过pooling、CNN风格自注意(CSA)和卷积FeedForward Network(CFFN)来提高注意力归一化和feature refinement。
  • results: 在多个 dataset 上实验表明,ConvFormer 可以作为替换 transformer 框架中的插件模块,提高表现。
    Abstract Transformers have been extensively studied in medical image segmentation to build pairwise long-range dependence. Yet, relatively limited well-annotated medical image data makes transformers struggle to extract diverse global features, resulting in attention collapse where attention maps become similar or even identical. Comparatively, convolutional neural networks (CNNs) have better convergence properties on small-scale training data but suffer from limited receptive fields. Existing works are dedicated to exploring the combinations of CNN and transformers while ignoring attention collapse, leaving the potential of transformers under-explored. In this paper, we propose to build CNN-style Transformers (ConvFormer) to promote better attention convergence and thus better segmentation performance. Specifically, ConvFormer consists of pooling, CNN-style self-attention (CSA), and convolutional feed-forward network (CFFN) corresponding to tokenization, self-attention, and feed-forward network in vanilla vision transformers. In contrast to positional embedding and tokenization, ConvFormer adopts 2D convolution and max-pooling for both position information preservation and feature size reduction. In this way, CSA takes 2D feature maps as inputs and establishes long-range dependency by constructing self-attention matrices as convolution kernels with adaptive sizes. Following CSA, 2D convolution is utilized for feature refinement through CFFN. Experimental results on multiple datasets demonstrate the effectiveness of ConvFormer working as a plug-and-play module for consistent performance improvement of transformer-based frameworks. Code is available at https://github.com/xianlin7/ConvFormer.
    摘要 transformers 在医学影像 segmentation 领域得到了广泛的研究,以建立对应的长距离依赖关系。然而,有限的高质量医学影像数据使 transformers 在提取多样的全球特征方面困难,导致注意力坍塌,注意力映射变得相似或 même identical。相比之下,卷积神经网络(CNN)在小规模训练数据上有更好的收敛性能,但受限于宽度的接受范围。现有的工作主要是探索 CCN 和 transformers 的组合,忽略了注意力坍塌问题,因此 transformers 的潜在能力还未得到充分的探索。本文提出了一种具有 CNN 特征的 transformers(ConvFormer),以便提高注意力均匀性和 segmentation 性能。具体来说,ConvFormer 由池化、CNN 式自注意(CSA)和卷积 feed-forward 网络(CFFN)组成,与 vanilla vision transformers 中的封装、自注意和 feed-forward 网络相对应。与 pozitional embedding 和封装不同,ConvFormer 采用了2D卷积和最大池化来保持位置信息和特征大小减少。这样,CSA 可以将 2D 特征图作为输入,建立长距离依赖关系,并通过构建自注意矩阵来实现自注意。接下来,2D 卷积被用来进行特征细化,通过 CFFN 进行特征细化。实验结果表明,ConvFormer 作为 transformer 基础架构上的插件模块,可以为 transformer 基础架构提供可靠的性能提升。代码可以在 上获取。

Video and Synthetic MRI Pre-training of 3D Vision Architectures for Neuroimage Analysis

  • paper_url: http://arxiv.org/abs/2309.04651
  • repo_url: None
  • paper_authors: Nikhil J. Dhinagar, Amit Singh, Saket Ozarkar, Ketaki Buwa, Sophia I. Thomopoulos, Conor Owens-Walton, Emily Laltoo, Yao-Liang Chen, Philip Cook, Corey McMillan, Chih-Chien Tsai, J-J Wang, Yih-Ru Wu, Paul M. Thompson
    for:* 3D medical imaging tasks, particularly Alzheimer’s disease (AD) and Parkinson’s disease (PD) classification, and “brain age” prediction.methods:* Pre-training deep learning models on a large corpus of data, including natural images, medical images, and synthetically generated MRI scans or video data.* Adapting pre-trained models to downstream neuroimaging tasks with a range of difficulty.results:* Pre-training improved performance across all tasks, with a boost of 7.4% for AD classification and 4.6% for PD classification for ViTs, and 19.1% for PD classification and reduction in brain age prediction error by 1.26 years for CNNs.* Pre-training on large-scale video or synthetic MRI data boosted performance of ViTs.* CNNs were robust in limited-data settings, and in-domain pretraining enhanced their performances.* Pre-training improved generalization to out-of-distribution datasets and sites.
    Abstract Transfer learning represents a recent paradigm shift in the way we build artificial intelligence (AI) systems. In contrast to training task-specific models, transfer learning involves pre-training deep learning models on a large corpus of data and minimally fine-tuning them for adaptation to specific tasks. Even so, for 3D medical imaging tasks, we do not know if it is best to pre-train models on natural images, medical images, or even synthetically generated MRI scans or video data. To evaluate these alternatives, here we benchmarked vision transformers (ViTs) and convolutional neural networks (CNNs), initialized with varied upstream pre-training approaches. These methods were then adapted to three unique downstream neuroimaging tasks with a range of difficulty: Alzheimer's disease (AD) and Parkinson's disease (PD) classification, "brain age" prediction. Experimental tests led to the following key observations: 1. Pre-training improved performance across all tasks including a boost of 7.4% for AD classification and 4.6% for PD classification for the ViT and 19.1% for PD classification and reduction in brain age prediction error by 1.26 years for CNNs, 2. Pre-training on large-scale video or synthetic MRI data boosted performance of ViTs, 3. CNNs were robust in limited-data settings, and in-domain pretraining enhanced their performances, 4. Pre-training improved generalization to out-of-distribution datasets and sites. Overall, we benchmarked different vision architectures, revealing the value of pre-training them with emerging datasets for model initialization. The resulting pre-trained models can be adapted to a range of downstream neuroimaging tasks, even when training data for the target task is limited.
    摘要 Transfer learning 表示现代人工智能系统的一种新的思路。而不是专门为每个任务训练特定的模型, transfer learning 是在大量数据上预训练深度学习模型,并在最小化 fine-tuning 后应用于特定任务。然而,对于3D医学影像任务,我们不知道是否应该预训练模型于自然图像、医学影像或者 sintetically生成的 MRI 扫描或视频数据。为了评估这些选项,我们在这里对 vision transformers(ViTs)和卷积神经网络(CNNs)进行了比较。这些方法在三个独特的下游神经成像任务中进行了适应:阿尔茨heimer 病(AD)和 пар金森病(PD)分类、"brain age" 预测。实验证明了以下关键观察:1. 预训练可以提高所有任务的性能,包括boost 7.4% 的 AD 分类和4.6% 的 PD 分类 для ViT,以及19.1% 的 PD 分类和 reductions 的 brain age 预测错误率为 1.26 年 для CNNs。2. 预训练于大规模的视频或生成的 MRI 数据可以提高 ViTs 的性能。3. CNNs 在有限数据设置下表现稳定,并且在域内预训练下进一步提高了其性能。4. 预训练可以提高模型对outsidel distribution 数据集和站点的一致性。总的来说,我们对不同的视觉架构进行了比较,发现预训练它们以 emerging 数据集为初始化的值。这些预训练后的模型可以适应具有有限数据的下游神经成像任务,并且在训练数据中不同的站点上也能够达到良好的性能。

eess.SP - 2023-09-09

After-Fatigue Condition: A Novel Analysis Based on Surface EMG Signals

  • paper_url: http://arxiv.org/abs/2309.04770
  • repo_url: None
  • paper_authors: Van Hieu Nguyen, Gia Thien Luu, Thien Van Luong, Mai Xuan Trang, Philippe Ravier, Olivier Buttelli
  • for: 本研究旨在evaluating muscle after-fatigue condition based on surface electromyography (sEMG) signals, which has been overlooked in previous studies.
  • methods: 该方法使用了amplitude-based, spectral-based, 和muscle fiber conduction velocity (CV) parameters to analyze muscle fatigue indicators at various maximal voluntary contraction (MVC) levels, 以及每个MVC水平的contractions time.
  • results: 研究结果显示在after-fatigue condition下, muscle activation undergoes significant changes, including higher CV, power spectral density shifting to the right, and longer contraction time until exhaustion compared to before-fatigue and fatigue conditions.
    Abstract This study introduces a novel muscle activation analysis based on surface electromyography (sEMG) signals to assess the muscle's after-fatigue condition. Previous studies have mainly focused on the before-fatigue and fatigue conditions. However, a comprehensive analysis of the after-fatigue condition has been overlooked. The proposed method analyzes muscle fatigue indicators at various maximal voluntary contraction (MVC) levels to compare the before-fatigue, fatigue, and after-fatigue conditions using amplitude-based, spectral-based, and muscle fiber conduction velocity (CV) parameters. In addition, the contraction time of each MVC level is also analyzed with the same indicators. The results show that in the after-fatigue condition, the muscle activation changes significantly in the ways such as higher CV, power spectral density shifting to the right, and longer contraction time until exhaustion compared to the before-fatigue and fatigue conditions. The results can provide a comprehensive and objective evaluation of muscle fatigue and recovery, which can be helpful in clinical diagnosis, rehabilitation, and sports performance.
    摘要 Translation notes:* "surface electromyography" (sEMG) is translated as "表面电 MYography" (bǎo miàn diàn MYography)* "maximal voluntary contraction" (MVC) is translated as "最大自愿收缩" (zuì dà zì yù shōu)* "muscle fiber conduction velocity" (CV) is translated as "肌细胞传导速度" (jiāo xì bāng chuán dào)* "power spectral density" is translated as "能量 спектраль密度" (néngyè spèktrum míngdé)* "contractions time" is translated as "收缩时间" (shōu shū shíjiān)

Transformer-Based Deep Learning Detector for Dual-Mode Index Modulation 3D-OFDM

  • paper_url: http://arxiv.org/abs/2309.04764
  • repo_url: None
  • paper_authors: Toan Gian, Tien-Hoa Nguyen, Trung Tan Nguyen, Van-Cuong Pham, Thien Van Luong
  • for: 这个论文旨在提出一个基于深度学习的信号探测器,即TransD3D-IM,用于三维多调变调分多重频率分裂(3D-OFDM)系统中的信号探测。
  • methods: 这个方法使用Transformer框架进行信号探测,并且使用双模式3D标志和活动子频率指标来传输数据位元。
  • results: 实验结果显示,TransD3D-IM可以对DM-IM-3D-OFDM系统中的信号探测进行优化,并且比现有的IM基于模型的探测器更高的传输可靠性。此外,TransD3D-IM也可以大幅提高传输速率,并且具有更好的响应性。
    Abstract In this paper, we propose a deep learning-based signal detector called TransD3D-IM, which employs the Transformer framework for signal detection in the Dual-mode index modulation-aided three-dimensional (3D) orthogonal frequency division multiplexing (DM-IM-3D-OFDM) system. In this system, the data bits are conveyed using dual-mode 3D constellation symbols and active subcarrier indices. As a result, this method exhibits significantly higher transmission reliability than current IM-based models with traditional maximum likelihood (ML) detection. Nevertheless, the ML detector suffers from high computational complexity, particularly when the parameters of the system are large. Even the complexity of the Log-Likelihood Ratio algorithm, known as a low-complexity detector for signal detection in the DM-IM-3D-OFDM system, is also not impressive enough. To overcome this limitation, our proposal applies a deep neural network at the receiver, utilizing the Transformer framework for signal detection of DM-IM-3D-OFDM system in Rayleigh fading channel. Simulation results demonstrate that our detector attains to approach performance compared to the model-based receiver. Furthermore, TransD3D-IM exhibits more robustness than the existing deep learning-based detector while considerably reducing runtime complexity in comparison with the benchmarks.
    摘要 在这篇论文中,我们提出了一种深度学习基于的信号探测器,即TransD3D-IM,该探测器使用Transformer框架进行信号探测在三个维度(3D)相互频分复用(DM-IM-3D-OFDM)系统中。在这个系统中,数据位通过 dual-mode 3D 象限符号和活动子频点来传输。因此,这种方法在现有的IM-based模型中显示出了明显更高的传输可靠性。然而,ML 探测器在系统参数较大时会受到高计算复杂性的限制。而Log-Likelihood Ratio 算法,作为 DM-IM-3D-OFDM 系统中的低复杂度探测器,也并不够印象。为了解决这个问题,我们的提案使用了一个深度神经网络,通过Transformer框架进行 DM-IM-3D-OFDM 系统中的信号探测。实验结果表明,我们的探测器可以准确地探测 DM-IM-3D-OFDM 系统中的信号,并且比模型基于接收器更具有 approached 性。此外,TransD3D-IM 还表现了更高的鲁棒性,而且可以在对比 benchmarks 时大幅减少运行时复杂性。

A Public Information Precoding for MIMO Visible Light Communication System Based on Manifold Optimization

  • paper_url: http://arxiv.org/abs/2309.04709
  • repo_url: None
  • paper_authors: Hamed Alizadeh Ghazijahani, Mahmoud Atashbar, Guan Yong Liang, Zhaojie Yang
  • For: 这 paper 的目的是设计一种 omnidirectional precoding,以便在 MIMO-VLC 网络中传输公共信息的信号。* Methods: 该 paper 使用了一种最大化可 achievable 率的方法,以提高发送信号的能量效率和bit error rate。它还考虑了所有 LED 的平均发送功率是相同的约束。* Results: simulation 结果表明,提案的 omnidirectional precoding 可以实现更高的Received Mean Power 和 bit error rate,相比于传统的无precoding方法。
    Abstract Visible light communication (VLC) is an attractive subset of optical communication that provides a high data rate in the access layer of the network. The combination of multiple inputmultiple output (MIMO) with a VLC system leads to a higher speed of data transmission named as MIMO-VLC system. In multi-user (MU) MIMO-VLC, a LED array transmits signals for users. These signals are categorized as signals of private information for each user and signals of public information for all users. The main idea of this paper is to design an omnidirectional precoding to transmit the signals of public information in the MUMIMO-VLC network. To this end, we propose to maximize the achievable rate which leads to maximizing the received mean power at the possible location of the users. Besides maximizing the achievable rate, we consider equal mean transmission power constraint in all LEDs to achieve higher power efficiency of the power amplifiers used in the LED array. Based on this we formulate an optimization problem in which the constraint is in the form of a manifold and utilize a gradient method projected on the manifold to solve the problem. Simulation results indicate that the proposed omnidirectional precoding can achieve superior received mean power and bit error rate with respect to the classical form without precoding utilization.
    摘要 可见光通信(VLC)是一个吸引人的光学通信子集,它在网络访问层提供高速的数据传输率。将多输入多输出(MIMO)技术与VLC系统结合,称为MIMO-VLC系统,可以提高数据传输速率。在多用户(MU)MIMO-VLC网络中,LED数组发送用户信号。这些信号分为每个用户的专用信息信号和所有用户的公共信息信号。本文的主要想法是设计一种全方位预编码,以在MUMIMO-VLC网络中传输公共信息信号。为此,我们提出了最大化可 achievable 率的目标,以最大化用户可能位置上接收到的平均功率。此外,我们还考虑了所有LED的平均发射功率均衡限制,以提高发射器使用的电力增效。基于这些假设,我们形式化了一个优化问题,并使用梯度法在杯 manifold上解决问题。实验结果表明,我们提出的全方位预编码可以在比较高的接收平均功率和比特错误率下获得优于经典预编码无使用情况。

Novel Smart N95 Filtering Facepiece Respirator with Real-time Adaptive Fit Functionality and Wireless Humidity Monitoring for Enhanced Wearable Comfort

  • paper_url: http://arxiv.org/abs/2309.07152
  • repo_url: None
  • paper_authors: Kangkyu Kwon, Yoon Jae Lee, Yeongju Jung, Ira Soltis, Chanyeong Choi, Yewon Na, Lissette Romero, Myung Chul Kim, Nathan Rodeheaver, Hodam Kim, Michael S. Lloyd, Ziqing Zhuang, William King, Susan Xu, Seung-Hwan Ko, Jinwoo Lee, Woon-Hong Yeo
  • For: This paper aims to address the limitations of current facial respirators by developing a novel smart N-95 filtering facepiece respirator with self-fit adjusting functionality and air quality monitoring.* Methods: The proposed respirator incorporates a humidity sensor made of laser-induced graphene and a pressure sensor array based on dielectric elastomeric sponge to monitor the respirator’s contact with the user’s face and provide closed-loop feedback for self-fit adjustment.* Results: The self-fit adjusting mode and elastomeric lining of the proposed respirator improve the fit factor by an average of 3.20 and 5 times at maximum, compared to current commercial respirators.
    Abstract The widespread emergence of the COVID-19 pandemic has transformed our lifestyle, and facial respirators have become an essential part of daily life. Nevertheless, the current respirators possess several limitations such as poor respirator fit because they are incapable of covering diverse human facial sizes and shapes, potentially diminishing the effect of wearing respirators. In addition, the current facial respirators do not inform the user of the air quality within the smart facepiece respirator in case of continuous long-term use. Here, we demonstrate the novel smart N-95 filtering facepiece respirator that incorporates the humidity sensor and pressure sensory feedback-enabled self-fit adjusting functionality for the effective performance of the facial respirator to prevent the transmission of airborne pathogens. The laser-induced graphene (LIG) constitutes the humidity sensor, and the pressure sensor array based on the dielectric elastomeric sponge monitors the respirator contact on the face of the user, providing the sensory information for a closed-loop feedback mechanism. As a result of the self-fit adjusting mode along with elastomeric lining, the fit factor is increased by 3.20 and 5 times at average and maximum respectively. We expect that the experimental proof-of-concept of this work will offer viable solutions to the current commercial respirators to address the limitations.
    摘要 covid-19 疫情的普及使我们的生活方式发生了深刻的改变,而 facial respirator 也成为了我们日常生活中的必备品。然而,现有的 respirator 具有许多限制,例如覆盖人类面部多样化大小和形状的能力不足,可能导致戴着 respirator 的效果受损。此外,现有的 facial respirator 也无法在长期使用时提供空气质量的信息。我们在这里展示了一种新型的智能 N-95 滤袋面罩,它具有滤袋面罩内部湿度感应器和压力感应器透过自适应调节功能,以提高面罩的适合度和效果。使用 LIG 将滤袋面罩内部湿度感应器,而基于导电塑胶泡的压力感应器则用于监控用户面罩的接触情况,提供关键的感应信息。因此,这个自适应调节模式加上塑料膜装饰,使面罩的适合度提高了 3.20 和 5 倍的平均和最大值。我们预期这个实验证明将提供现有商业 respirator 改进的可能性。

cs.SD - 2023-09-08

Exploring Domain-Specific Enhancements for a Neural Foley Synthesizer

  • paper_url: http://arxiv.org/abs/2309.04641
  • repo_url: None
  • paper_authors: Ashwin Pillay, Sage Betko, Ari Liloia, Hao Chen, Ankit Shah
  • for: 这个研究旨在创建一个基于神经网络的FOLEY声音生成模型,可以生成多种声音效果 clip。
  • methods: 该模型利用预训练encoder保留声学和音乐特征,并通过类别conditioning进行修饰,以提高FOLEY类型之间的差异性。它还采用了一种新的 transformer 架构来优化自我注意计算,以满足大输入的需求。
  • results: 研究人员通过实施该模型后,得到了胜过基准的中间结果,并提出了实践中遇到的挑战和未来研究的可能性。
    Abstract Foley sound synthesis refers to the creation of authentic, diegetic sound effects for media, such as film or radio. In this study, we construct a neural Foley synthesizer capable of generating mono-audio clips across seven predefined categories. Our approach introduces multiple enhancements to existing models in the text-to-audio domain, with the goal of enriching the diversity and acoustic characteristics of the generated foleys. Notably, we utilize a pre-trained encoder that retains acoustical and musical attributes in intermediate embeddings, implement class-conditioning to enhance differentiability among foley classes in their intermediate representations, and devise an innovative transformer-based architecture for optimizing self-attention computations on very large inputs without compromising valuable information. Subsequent to implementation, we present intermediate outcomes that surpass the baseline, discuss practical challenges encountered in achieving optimal results, and outline potential pathways for further research.
    摘要 FOLEY声音合成指的是为媒体(如电影或广播)创建真实、地点声音效果。在这个研究中,我们构建了一个基于神经网络的FOLEY声音合成器,能够生成多个频道单声道音频clip。我们的方法对现有文本到声音频域的模型进行了多种改进,以增强生成的FOLEY声音的多样性和听觉特征。特别是,我们使用预训练的编码器保留了听觉和音乐特征在中间 Representation中,实施了类conditioning来增强FOLEY类在中间表示中的分 differentiability,并设计了一种新的transformer-based架构来优化自注意计算在很大输入上 без comprising valuable information。在实施后,我们展示了胜过基准的中间结果,讨论了实际遇到的挑战和 achievement 的可能性,并 outline了进一步研究的可能性。

Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning

  • paper_url: http://arxiv.org/abs/2309.04628
  • repo_url: None
  • paper_authors: Saurabhchand Bhati, Jesús Villalba, Laureano Moro-Velazquez, Thomas Thebaud, Najim Dehak
    for: 这 paper 的目的是提高 speech-based visually grounded models 的性能,使其能够更好地利用 pretrained image 和 text 编码器。methods: 这 paper 使用了 hierarchical segmental speech 编码器,将 speech 转化为 sequence of word-like unit 表示,然后使用 pretrained CLIP text 编码器进行编码。它还 explore 了 mapping audio 到 CLIP 词嵌入空间 via regularization 和 quantization。results: experiments 表明,使用这种方法可以减少 Retrieval 性能下降,并且 audio-only 系统可以减少到 audio-visual 系统的性能差距。
    Abstract Visually grounded speech systems learn from paired images and their spoken captions. Recently, there have been attempts to utilize the visually grounded models trained from images and their corresponding text captions, such as CLIP, to improve speech-based visually grounded models' performance. However, the majority of these models only utilize the pretrained image encoder. Cascaded SpeechCLIP attempted to generate localized word-level information and utilize both the pretrained image and text encoders. Despite using both, they noticed a substantial drop in retrieval performance. We proposed Segmental SpeechCLIP which used a hierarchical segmental speech encoder to generate sequences of word-like units. We used the pretrained CLIP text encoder on top of these word-like unit representations and showed significant improvements over the cascaded variant of SpeechCLIP. Segmental SpeechCLIP directly learns the word embeddings as input to the CLIP text encoder bypassing the vocabulary embeddings. Here, we explore mapping audio to CLIP vocabulary embeddings via regularization and quantization. As our objective is to distill semantic information into the speech encoders, we explore the usage of large unimodal pretrained language models as the text encoders. Our method enables us to bridge image and text encoders e.g. DINO and RoBERTa trained with uni-modal data. Finally, we extend our framework in audio-only settings where only pairs of semantically related audio are available. Experiments show that audio-only systems perform close to the audio-visual system.
    摘要 visually grounded speech系统学习从图像和其对应的语音标签 pairs。最近,有人尝试使用已经训练过图像和语音标签的视觉grounded模型,如CLIP,来提高语音基于图像的模型性能。然而,大多数模型只使用预训练的图像编码器。卷积SpeechCLIP尝试生成本地化的单词水平信息并使用图像和语音编码器。尽管使用了两者,但它们发现了重要的搜索性能下降。我们提出了层次分割的SpeechCLIP,使用层次分割的语音编码器来生成语音序列。我们使用预训练的CLIP文本编码器进行这些语音序列表示,并显示了显著改进于卷积SpeechCLIP的变体。Segmental SpeechCLIP直接学习 word embeddings 作为 CLIP文本编码器的输入,而不需要词表 embeddings。我们的目标是将语音编码器中的semantic信息储存下来,所以我们 explore使用大型单模型预训练语言模型作为文本编码器。我们的方法可以将图像和语音编码器相互连接,例如 DINO和RoBERTa 在单模型数据上进行训练。最后,我们扩展我们的框架到具有唯一相关的音频的设置, где只有semantically相关的音频对应。实验显示,具有唯一相关的音频系统可以与具有视频的系统相互竞争。

A Long-Tail Friendly Representation Framework for Artist and Music Similarity

  • paper_url: http://arxiv.org/abs/2309.04182
  • repo_url: None
  • paper_authors: Haoran Xiang, Junyu Dai, Xuchen Song, Furao Shen
  • for: 这 paper 的目的是提出一种适应长尾情况的 Long-Tail Friendly Representation Framework (LTFRF), 用于音乐检索和推荐。
  • methods: 该 paper 使用神经网络模型音乐、用户、元数据和关系数据, integrate 到一个统一的 métric learning 框架中,并使用多种关系为regular term来引入多元关系损失。
  • results: 对于 Similar Artist/Music Recommendation 任务,LTFRF 比基eline 高效, Hit Ratio@10 上升9.69%/19.42%,而在长尾情况下,LTFRF 与基eline 的差距为11.05%/14.14%。
    Abstract The investigation of the similarity between artists and music is crucial in music retrieval and recommendation, and addressing the challenge of the long-tail phenomenon is increasingly important. This paper proposes a Long-Tail Friendly Representation Framework (LTFRF) that utilizes neural networks to model the similarity relationship. Our approach integrates music, user, metadata, and relationship data into a unified metric learning framework, and employs a meta-consistency relationship as a regular term to introduce the Multi-Relationship Loss. Compared to the Graph Neural Network (GNN), our proposed framework improves the representation performance in long-tail scenarios, which are characterized by sparse relationships between artists and music. We conduct experiments and analysis on the AllMusic dataset, and the results demonstrate that our framework provides a favorable generalization of artist and music representation. Specifically, on similar artist/music recommendation tasks, the LTFRF outperforms the baseline by 9.69%/19.42% in Hit Ratio@10, and in long-tail cases, the framework achieves 11.05%/14.14% higher than the baseline in Consistent@10.
    摘要 translate("The investigation of the similarity between artists and music is crucial in music retrieval and recommendation, and addressing the challenge of the long-tail phenomenon is increasingly important.") investigate("调查") similarity("相似性") artists("艺术家") music("音乐") retrieval("检索") recommendation("推荐") long-tail("长尾") phenomenon("现象") crucial("关键") addressing("解决") challenge("挑战")Here's the translation of the rest of the text: translate("This paper proposes a Long-Tail Friendly Representation Framework (LTFRF) that utilizes neural networks to model the similarity relationship. Our approach integrates music, user, metadata, and relationship data into a unified metric learning framework, and employs a meta-consistency relationship as a regular term to introduce the Multi-Relationship Loss. Compared to the Graph Neural Network (GNN), our proposed framework improves the representation performance in long-tail scenarios, which are characterized by sparse relationships between artists and music.") propose("提出") framework("框架") Long-Tail Friendly Representation Framework ("LTFRF") neural networks ("神经网络") model("模型") similarity relationship ("相似性关系") integrate("集成") music ("音乐") user ("用户") metadata ("元数据") relationship data ("关系数据") unified metric learning framework ("统一度量学习框架") meta-consistency relationship ("meta共识关系") Multi-Relationship Loss ("多关系损失") Graph Neural Network ("GNN") representation performance ("表示性能") long-tail scenarios ("长尾场景") sparse relationships ("稀疏关系") between artists and music ("艺术家和音乐之间")

A Two-Stage Training Framework for Joint Speech Compression and Enhancement

  • paper_url: http://arxiv.org/abs/2309.04132
  • repo_url: https://github.com/jscscloris/sestream
  • paper_authors: Jiayi Huang, Zeyu Yan, Wenbin Jiang, Fei Wen
  • for: 本研究考虑了干扰处理 speech signal 的同时压缩和提高问题。
  • methods: 该研究提出了一种理论基础的两个阶段优化过程,首先优化一个编码器-解码器对,然后使用感知损失来优化一个感知解码器。
  • results: 实验结果表明,使用该两阶段训练方法可以超越 SoundStream 和其他代表性的编码器,在对象和主观评价指标上均表现出色。
    Abstract This paper considers the joint compression and enhancement problem for speech signal in the presence of noise. Recently, the SoundStream codec, which relies on end-to-end joint training of an encoder-decoder pair and a residual vector quantizer by a combination of adversarial and reconstruction losses,has shown very promising performance, especially in subjective perception quality. In this work, we provide a theoretical result to show that, to simultaneously achieve low distortion and high perception in the presence of noise, there exist an optimal two-stage optimization procedure for the joint compression and enhancement problem. This procedure firstly optimizes an encoder-decoder pair using only distortion loss and then fixes the encoder to optimize a perceptual decoder using perception loss. Based on this result, we construct a two-stage training framework for joint compression and enhancement of noisy speech signal. Unlike existing training methods which are heuristic, the proposed two-stage training method has a theoretical foundation. Finally, experimental results for various noise and bit-rate conditions are provided. The results demonstrate that a codec trained by the proposed framework can outperform SoundStream and other representative codecs in terms of both objective and subjective evaluation metrics. Code is available at \textit{https://github.com/jscscloris/SEStream}.
    摘要