eess.SP - 2023-09-20

Channel Reciprocity Attacks Using Intelligent Surfaces with Non-Diagonal Phase Shifts

  • paper_url: http://arxiv.org/abs/2309.11665
  • repo_url: None
  • paper_authors: Haoyu Wang, Zhu Han, Lee Swindlehurst
  • for: 本研究探讨了一种基于卷积智能表面(RIS)技术的攻击,该技术可以在多天线几何系统中引起通信链路的干扰。
  • methods: 本研究使用了非对称(ND)幅shift矩阵(RIS),通过非对称幅shift矩阵来破坏通信链路的对称性,从而降低下链路性能。
  • results: 研究结果表明,当一个恶意的ND-RIS被部署时,可以使得下链路性能受到极大的降低,并且这种攻击是无动作的和难以探测的。
    Abstract While reconfigurable intelligent surface (RIS) technology has been shown to provide numerous benefits to wireless systems, in the hands of an adversary such technology can also be used to disrupt communication links. This paper describes and analyzes an RIS-based attack on multi-antenna wireless systems that operate in time-division duplex mode under the assumption of channel reciprocity. In particular, we show how an RIS with a non-diagonal (ND) phase shift matrix (referred to here as an ND-RIS) can be deployed to maliciously break the channel reciprocity and hence degrade the downlink network performance. Such an attack is entirely passive and difficult to detect. We provide a theoretical analysis of the degradation in the sum ergodic rate that results when an arbitrary malicious ND-RIS is deployed and design an approach based on the genetic algorithm for optimizing the ND structure under partial knowledge of the available channel state information. Our simulation results validate the analysis and demonstrate that an ND-RIS channel reciprocity attack can dramatically reduce the downlink throughput.
    摘要 “弹性智能表面(RIS)技术已经被证明可以提供无线系统中的许多优点,但在敌人手上可以使用这技术来中断通信链接。本纸描述了一种基于RIS的攻击,对于在时分多普遍调幅模式下运行的多antenna无线系统进行破坏。具体来说,我们显示了如何在ND项目(non-diagonal)的RIS中部署恶意的项目,以破坏通道对称性,从而降低下联网性能。这种攻击是完全被动的,难以检测。我们提供了一个理论分析,以及基于生物算法来优化ND结构的方法,以对不完全知道可用通道状态信息进行优化。我们的实验结果证实了分析,并显示了ND-RIS通道对称性攻击可以导致下联网通过率的严重下降。”Note: The translation is done using Google Translate and may not be perfect. Please let me know if you need further assistance.

Compression Spectrum: Where Shannon meets Fourier

  • paper_url: http://arxiv.org/abs/2309.11640
  • repo_url: None
  • paper_authors: Aditi Kathpalia, Nithin Nagaraj
  • for: 这篇论文的目的是将信号处理和信息理论两个领域联系起来,以便通过对时间序列的压缩来估算它的信息量和压缩程度。
  • methods: 本文使用了一种无损数据压缩算法来估算时间序列的信息量或压缩程度,并使用了Effort-to-Compress(ETC)算法来获得一个压缩спектrum。
  • results: 本文通过应用压缩спектrum于人体心跳间隔(RR)序列,发现健康年轻人RR序列在律 log-log 尺度上表现类似于1/f 噪声,而健康老年人RR序列则表现不同。
    Abstract Signal processing and Information theory are two disparate fields used for characterizing signals for various scientific and engineering applications. Spectral/Fourier analysis, a technique employed in signal processing, helps estimation of power at different frequency components present in the signal. Characterizing a time-series based on its average amount of information (Shannon entropy) is useful for estimating its complexity and compressibility (eg., for communication applications). Information theory doesn't deal with spectral content while signal processing doesn't directly consider the information content or compressibility of the signal. In this work, we attempt to bring the fields of signal processing and information theory together by using a lossless data compression algorithm to estimate the amount of information or `compressibility' of time series at different scales. To this end, we employ the Effort-to-Compress (ETC) algorithm to obtain what we call as a Compression Spectrum. This new tool for signal analysis is demonstrated on synthetically generated periodic signals, a sinusoid, chaotic signals (weak and strong chaos) and uniform random noise. The Compression Spectrum is applied on heart interbeat intervals (RR) obtained from real-world normal young and elderly subjects. The compression spectrum of healthy young RR tachograms in the log-log scale shows behaviour similar to $1/f$ noise whereas the healthy old RR tachograms show a different behaviour. We envisage exciting possibilities and future applications of the Compression Spectrum.
    摘要 信号处理和信息理论是两个不同的领域,用于描述信号的不同科学和工程应用。spectral/ fourier分析是信号处理中使用的技术,可以为不同频率组成的信号估算能量。基于时间序列的平均信息量(Shannon entropy)的Characterizing是用于估算信号的复杂性和压缩性(例如, для通信应用)。信息理论不考虑频谱内容,而信号处理不直接考虑信号的信息内容或压缩性。在这项工作中,我们尝试将信号处理和信息理论两个领域联系起来,使用一种无损数据压缩算法来估算时间序列的信息量或“压缩性”。为此,我们使用Effort-to-Compress(ETC)算法获得一个压缩спектrum。这种新的信号分析工具在人工生成的 периодические信号、sinusoid信号、混沌信号(弱和强混沌)以及随机噪声上进行了应用。压缩спектrum在实际获得的心跳间隔(RR)上进行了应用,并在径向均衡尺度上显示了类似于1/f噪声的行为。我们看到了未来应用的激动人心。

Brief Architectural Survey of Biopotential Recording Front-Ends since the 1970s

  • paper_url: http://arxiv.org/abs/2309.11612
  • repo_url: None
  • paper_authors: Taeju Lee, Minkyu Je
  • For: The paper provides a survey of the architecture history of biopotential recording front-ends developed since the 1970s, and discusses overall key circuit techniques for reliable and continuous signal acquisition.* Methods: The paper discusses various front-end architectures for biopotential recording, including their characteristics and challenges, depending on the bioelectric signals being measured.* Results: The paper provides an overview of the evolution of biopotential recording front-ends over the last five decades, and discusses the key circuit techniques for low power and low noise performance.Here are the three information points in Simplified Chinese text:* For: 这篇论文提供了1970年代以来生物电动力记录前端的建筑历史,并讨论了适用于可靠连续记录的信号获取的总体关键电路技术。* Methods: 论文讨论了不同的生物电动力记录前端架构,包括它们的特点和挑战,具体取决于测量的生物电动力信号。* Results: 论文提供了过去五十年来生物电动力记录前端的演化历史,并讨论了低功耗和低噪声性能的关键电路技术。
    Abstract Measuring the bioelectric signals is one of the key functions in wearable healthcare devices and implantable medical devices. The use of wearable healthcare devices has made continuous and immediate monitoring of personal health status possible. Implantable medical devices have played an important role throughout the fields of neuroscience, brain-machine (or brain-computer) interface, and rehabilitation technology. Over the last five decades, the bioelectric signals have been observed through a variety of biopotential recording front-ends, along with advances in semiconductor technology scaling and circuit techniques. Also, for reliable and continuous signal acquisition, the front-end architectures have evolved while maintaining low power and low noise performance. In this article, the architecture history of the biopotential recording front-ends developed since the 1970s is surveyed, and overall key circuit techniques are discussed. Depending on the bioelectric signals being measured, appropriate front-end architecture needs to be chosen, and the characteristics and challenges of each architecture are also covered in this article.
    摘要 测量生物电子信号是现代医疗设备和嵌入式医疗设备中的关键功能之一。使用了可穿戴式医疗设备,人们可以实时、连续地监测个人健康状况。嵌入式医疗设备在神经科学、脑机 interfaces 和rehabilitation技术等领域发挥了重要作用。过去五十年,生物电子信号已经通过多种生物潜在记录前端,利用半导体技术的发展和电路技术的进步。为确保可靠和连续的信号捕获,前端架构也在不断发展,同时保持低功耗和低噪性能。在这篇文章中,自1970年代以来发展的生物潜在记录前端架构历史被评估,同时总体讲述了关键的电路技术。根据测量的生物电子信号,需要选择合适的前端架构,文章还讲述了每种架构的特点和挑战。

Self-Sustaining Oscillator with Frequency Counter for Resonance Frequency Tracking in Micro- and Nanomechanical Sensing

  • paper_url: http://arxiv.org/abs/2309.11581
  • repo_url: None
  • paper_authors: Hajrudin Bešić, Alper Demir, Veljko Vukićević, Johannes Steurer, Silvan Schmid
  • for: 本研究旨在提出一种基于振荡频率变化的奈米机械感知器,并通过 theoretically 和实验研究其速度和精度。
  • methods: 本研究使用了一种基于振荡频率变化的自带维持振荡器(SSO)奈米电子机械系统(NEMS)配置,并提出了一种基于振荡频率变化的频率计数器,以实现高速度和高精度的频率测量。
  • results: 研究结果显示,与现有的阶段锁定循环(PLL)方法相比,提出的方法具有类似或更好的性能,同时具有更低的成本和更高的使用容易度。实验测量结果与理论预测几乎完美吻合。
    Abstract Nanomechanical sensors based on detecting and tracking resonance frequency shifts are to be used in many applications. Various open- and closed-loop tracking schemes, all offering a trade-off between speed and precision, have been studied both theoretically and experimentally. In this work, we advocate the use of a frequency counter as a frequency shift monitor in conjunction with a self-sustaining oscillator (SSO) nanoelectromechanical system (NEMS) configuration. We derive a theoretical model for characterizing the speed and precision of frequency measurements with state-of-the-art frequency counters. Based on the understanding provided by this model, we introduce novel enhancements to frequency counters that result in a trade-off characteristics which is on a par with the other tracking schemes. We describe a low-cost field-programmable-gate array (FPGA) based implementation for the proposed frequency counter and use it with the SSO-NEMS device in order to study its frequency tracking performance. We compare the proposed approach with the phase-locked-loop based scheme both in theory and experimentally. Our results show that similar or better performance can be achieved at a substantially lower cost and improved ease-of-use. We obtain almost perfect correspondence between the theoretical model predictions and the experimental measurements.
    摘要 几种 nanomechanical 感测器基于探测和跟踪征频Shift的应用将在未来中普遍使用。各种开放和关闭loop tracking 方案,均提供了速度和精度之间的交易,已经被理论和实验研究。在这项工作中,我们建议使用频计作为征频shift 监测器,并与自持 oscillator(SSO) nanoelectromechanical system(NEMS)配置一起使用。我们 derivated一个理论模型,用于Characterizing 频度测量的速度和精度。基于这个模型,我们提出了一些新的增强,使得频计的交易特性与其他跟踪方案相当。我们描述了一种低成本的 field-programmable-gate array(FPGA)基于实现,并用其与 SSO-NEMS 设备一起研究其频度跟踪性能。我们比较了我们的方法与阶段锁相控制(PLL) 方案, both theoretically and experimentally。我们的结果表明,可以在更低的成本和更好的使用性下达到相同或更好的性能。我们实验中的结果与理论预测几乎完美匹配。

Decision-Directed Hybrid RIS Channel Estimation with Minimal Pilot Overhead

  • paper_url: http://arxiv.org/abs/2309.11485
  • repo_url: None
  • paper_authors: Ly V. Nguyen, A. Lee Swindlehurst
  • for: 提高系统spectral efficiency,减少频率干扰。
  • methods: 使用具有混合元件的RIS,同时反射和感知入射信号,提高渠道状态信息的准确性。
  • results: 比传统静脉RIS数组系统具有更高的系统spectral efficiency,减少了频率干扰。
    Abstract To reap the benefits of reconfigurable intelligent surfaces (RIS), channel state information (CSI) is generally required. However, CSI acquisition in RIS systems is challenging and often results in very large pilot overhead, especially in unstructured channel environments. Consequently, the RIS channel estimation problem has attracted a lot of interest and also been a subject of intense study in recent years. In this paper, we propose a decision-directed RIS channel estimation framework for general unstructured channel models. The employed RIS contains some hybrid elements that can simultaneously reflect and sense the incoming signal. We show that with the help of the hybrid RIS elements, it is possible to accurately recover the CSI with a pilot overhead proportional to the number of users. Therefore, the proposed framework substantially improves the system spectral efficiency compared to systems with passive RIS arrays since the pilot overhead in passive RIS systems is proportional to the number of RIS elements times the number of users. We also perform a detailed spectral efficiency analysis for both the pilot-directed and decision-directed frameworks. Our analysis takes into account both the channel estimation and data detection errors at both the RIS and the BS. Finally, we present numerous simulation results to verify the accuracy of the analysis as well as to show the benefits of the proposed decision-directed framework.
    摘要 通常需要通道状态信息(CSI)来收获智能重配置表面(RIS)的优点。然而,在RIS系统中获取CSI是具有挑战性和很大的尝试量,特别是在无结构通道环境中。因此,RIS通道估计问题已经吸引了很多关注并成为了近年来的研究主题。在这篇论文中,我们提议了一种基于决策的RIS通道估计框架,适用于一般的无结构通道模型。 employ 的RIS包含了一些混合元素,这些元素可同时反射和感知进来的信号。我们表明,使用这些混合RIS元素,可以准确地重建CSI,并且尝试量与用户数成正比。因此,我们提议的框架可以substantially提高系统spectral efficiency,比pasive RIS数组系统更高。我们还进行了详细的spectral efficiency分析,包括通道估计和数据检测错误在RIS和BS之间。最后,我们提供了许多的 simulations 结果,以验证分析的准确性,以及显示提议的决策导向框架的优势。

Generalised Hyperbolic State-space Models for Inference in Dynamic Systems

  • paper_url: http://arxiv.org/abs/2309.11422
  • repo_url: None
  • paper_authors: Yaman Kındap, Simon Godsill
  • for: 这个论文是为了探讨连续时间非格aussian滤波问题中的非格aussian滤波模型。
  • methods: 这个论文使用了一种基于总体化弗洛伯恩(GH)随机过程的连续时间均值场模型,并提供了连续时间模拟方法和一种基于MCMC的新的推断方法。
  • results: 这个论文通过应用到一个 sintetically生成的数据集和一个实际的金融时间序列上,以示其能力。
    Abstract In this work we study linear vector stochastic differential equation (SDE) models driven by the generalised hyperbolic (GH) L\'evy process for inference in continuous-time non-Gaussian filtering problems. The GH family of stochastic processes offers a flexible framework for modelling of non-Gaussian, heavy-tailed characteristics and includes the normal inverse-Gaussian, variance-gamma and Student-t processes as special cases. We present continuous-time simulation methods for the solution of vector SDE models driven by GH processes and novel inference methodologies using a variant of sequential Markov chain Monte Carlo (MCMC). As an example a particular formulation of Langevin dynamics is studied within this framework. The model is applied to both a synthetically generated data set and a real-world financial series to demonstrate its capabilities.
    摘要 “在这项工作中,我们研究线性向量抽象差分方程(SDE)模型,该模型由总体化幂(GH)随机过程驱动。GH随机过程家族提供非常灵活的非高准入特性模型化框架,包括正态反射差分、差分gamma和学生t过程为特殊情况。我们提出了积累时间 simulation方法来解决vector SDE模型中的GH过程,并提出了一种基于Markov链 Monte Carlo(MCMC)的新的推断方法。作为一个示例,我们研究了一种特定的Langevin动力学形式。该模型在一个 sintetically生成的数据集和一个实际世界金融时间序列上进行了应用,以示其能力。”Note that Simplified Chinese is a romanization of Chinese, and the translation may not be perfect.

Active Inference for Sum Rate Maximization in UAV-Assisted Cognitive NOMA Networks

  • paper_url: http://arxiv.org/abs/2309.11263
  • repo_url: None
  • paper_authors: Felix Obite, Ali Krayani, Atm S. Alam, Lucio Marcenaro, Arumugam Nallanathan, Carlo Regazzoni
  • for: 本研究旨在强化未来无线网络的通信容量,以满足由互联网物联网(IoT)、无人机(UAV)、认知 радио(CR)和多播访问(NOMA)等技术引起的巨大连接问题。
  • methods: 本文使用了认知活动推理(active inference)从认知神经科学中启发,并提出了一种协调子频和功率分配算法,以最大化总比特率。
  • results: simulation结果表明,相比benchmark方案,我们提出的算法可以更好地适应时间变化的网络环境,并提高积总比特率。
    Abstract Given the surge in wireless data traffic driven by the emerging Internet of Things (IoT), unmanned aerial vehicles (UAVs), cognitive radio (CR), and non-orthogonal multiple access (NOMA) have been recognized as promising techniques to overcome massive connectivity issues. As a result, there is an increasing need to intelligently improve the channel capacity of future wireless networks. Motivated by active inference from cognitive neuroscience, this paper investigates joint subchannel and power allocation for an uplink UAV-assisted cognitive NOMA network. Maximizing the sum rate is often a highly challenging optimization problem due to dynamic network conditions and power constraints. To address this challenge, we propose an active inference-based algorithm. We transform the sum rate maximization problem into abnormality minimization by utilizing a generalized state-space model to characterize the time-changing network environment. The problem is then solved using an Active Generalized Dynamic Bayesian Network (Active-GDBN). The proposed framework consists of an offline perception stage, in which a UAV employs a hierarchical GDBN structure to learn an optimal generative model of discrete subchannels and continuous power allocation. In the online active inference stage, the UAV dynamically selects discrete subchannels and continuous power to maximize the sum rate of secondary users. By leveraging the errors in each episode, the UAV can adapt its resource allocation policies and belief updating to improve its performance over time. Simulation results demonstrate the effectiveness of our proposed algorithm in terms of cumulative sum rate compared to benchmark schemes.
    摘要 随着无线数据交换量的增加,启发于互联网宇宙(IoT)、无人机(UAV)、认知电波(CR)和非对称多接入(NOMA)等技术的应用,Future无线网络的通道容量需要更加智能地提高。为了解决这一挑战,这篇论文提出了一种基于活动推理的 JOINT 子频率和功率分配算法。通过将总Bit rate最大化问题转化为异常值最小化问题,我们利用一种通用状态空间模型来描述时间变化的网络环境。然后,我们使用一个活动总体动态 bayesian 网络(Active-GDBN)解决这个问题。我们的框架包括在线上active inference阶段,在这个阶段,UAV使用一个层次结构的 GDBN 结构来学习精确的生成模型,以便在精确的子频率和连续的功率分配方面进行最佳化。在线上激活推理阶段,UAV会在精确的子频率和连续的功率分配方面进行动态选择,以最大化次级用户的总Bit rate。通过利用每个回合中的错误,UAV可以适应其资源分配策略和信息更新,从而提高其性能。实验结果表明,我们提出的算法在总Bit rate方面与参考方案相比表现更好。

Beamforming Design for RIS-Aided THz Wideband Communication Systems

  • paper_url: http://arxiv.org/abs/2309.11161
  • repo_url: None
  • paper_authors: Yihang Jiang, Ziqin Zhou, Xiaoyang Li, Yi Gong
  • for: 这篇论文的目的是解决在teraHz(THz)通信系统中的笼形拥填问题,以提高未来6G网络的性能。
  • methods: 论文提出了一种新的护墙扩展器(RIS)支持的笼形拥填架构,以减少笼形拥填的影响。
  • results: simulations表明,提出的架构能够有效地减少笼形拥填的影响,提高系统的性能。
    Abstract Benefiting from tens of GHz of bandwidth, terahertz (THz) communications has become a promising technology for future 6G networks. However, the conventional hybrid beamforming architecture based on frequency-independent phase-shifters is not able to cope with the beam split effect (BSE) in THz massive multiple-input multiple-output (MIMO) systems. Despite some work introducing the frequency-dependent phase shifts via the time delay network to mitigate the beam splitting in THz wideband communications, the corresponding issue in reconfigurable intelligent surface (RIS)-aided communications has not been well investigated. In this paper, the BSE in THz massive MIMO is quantified by analyzing the array gain loss. A new beamforming architecture has been proposed to mitigate this effect under RIS-aided communications scenarios. Simulations are performed to evaluate the effectiveness of the proposed system architecture in combating the array gain loss.
    摘要 使用十几GHz的带宽,teraHz(THz)通信已成为未来6G网络的促进技术。然而,传统的混合 beamforming架构基于频率独立的相位调节器无法应对THz大规模多输入多输出(MIMO)系统中的束分裂效应(BSE)。虽有一些工作介绍了频率相关的相位偏移通过时延网络来mitigate THz广泛通信中的束分裂,但相关的RIS(可编程智能面)协助通信场景的研究尚未得到了充分的探讨。本文对THz大规模MIMO系统中的BSE进行了分析,并提出了一种新的束分裂 Mitigation architecture。通过实验评估了提议系统架构的效果。

Sum-Rate Maximization for Movable Antenna Enabled Multiuser Communications

  • paper_url: http://arxiv.org/abs/2309.11135
  • repo_url: None
  • paper_authors: Zhenqiao Cheng, Nanxi Li, Jianchi Zhu, Xiaoming She, Chongjun Ouyang, Peng Chen
  • for: 提高下链吞吐量
  • methods: 使用antenna位置优化和传输扬送矩阵优化
  • results: 提高下链吞吐量和性能比FPAs更高Here’s a more detailed explanation of each point:
  • for: The paper is written to propose a novel multiuser communication system with movable antennas (MAs) that can enhance the downlink sum-rate by exploiting the antenna position optimization.
  • methods: The paper uses a joint optimization of the transmit beamforming vector and transmit MA positions to solve the non-convex problem. The authors propose an efficient algorithm that combines fractional programming, alternating optimization, and gradient descent methods to tackle the problem. As an alternative, a zero-forcing beamforming-based design is also proposed to strike a better performance-complexity trade-off.
  • results: Numerical investigations show that the proposed algorithms achieve better performance compared with the benchmark relying on conventional fixed-position antennas (FPAs). The proposed system can improve the downlink sum-rate and provide better performance than FPAs.
    Abstract A novel multiuser communication system with movable antennas (MAs) is proposed, where the antenna position optimization is exploited to enhance the downlink sum-rate. The joint optimization of the transmit beamforming vector and transmit MA positions is studied for a multiuser multiple-input single-input system. An efficient algorithm is proposed to tackle the formulated non-convex problem via capitalizing on fractional programming, alternating optimization, and gradient descent methods. To strike a better performance-complexity trade-off, a zero-forcing beamforming-based design is also proposed as an alternative. Numerical investigations are presented to verify the efficiency of the proposed algorithms and their superior performance compared with the benchmark relying on conventional fixed-position antennas (FPAs).
    摘要 新的多用户通信系统,使用可移动天线(MA),提出了一种新的天线位置优化技术,以增加下链数据率。在多用户多输入单输出系统中,joint优化传输扩散矩阵和天线位置的算法被研究。通过使用分数编程、 alternate优化和梯度下降方法,提出了一种高效的算法。为了更好地平衡性能和复杂度之间的贸易,也提出了一种基于零干扰扩散矩阵的设计。 numerically investigate the proposed algorithms and their superior performance compared with the benchmark relying on conventional fixed-position antennas (FPAs).Here is the word-for-word translation of the text into Simplified Chinese:新的多用户通信系统,使用可移动天线(MA),提出了一种新的天线位置优化技术,以增加下链数据率。在多用户多输入单输出系统中,joint优化传输扩散矩阵和天线位置的算法被研究。通过使用分数编程、 alternate优化和梯度下降方法,提出了一种高效的算法。为了更好地平衡性能和复杂度之间的贸易,也提出了一种基于零干扰扩散矩阵的设计。 numerically investigate the proposed algorithms and their superior performance compared with the benchmark relying on conventional fixed-position antennas (FPAs).

Evaluating Mental Stress Among College Students Using Heart Rate and Hand Acceleration Data Collected from Wearable Sensors

  • paper_url: http://arxiv.org/abs/2309.11097
  • repo_url: None
  • paper_authors: Moein Razavi, Anthony McDonald, Ranjana Mehta, Farzan Sasangohar
    for: This paper aims to develop a machine learning-based method for identifying stress using physiological data collected from college students.methods: The study uses wearable wrist-worn sensors and a mobile health application to collect heart rate and hand acceleration data, and self-reported stress data from college students. The XGBoost method was used to evaluate the effectiveness of the machine learning algorithms for stress detection.results: The study found that the XGBoost method was the most reliable model for identifying stress episodes, with an AUC of 0.64 and an accuracy of 84.5%. The standard deviation of hand acceleration, standard deviation of heart rate, and the minimum heart rate were the most important features for stress detection.
    Abstract Stress is various mental health disorders including depression and anxiety among college students. Early stress diagnosis and intervention may lower the risk of developing mental illnesses. We examined a machine learning-based method for identification of stress using data collected in a naturalistic study utilizing self-reported stress as ground truth as well as physiological data such as heart rate and hand acceleration. The study involved 54 college students from a large campus who used wearable wrist-worn sensors and a mobile health (mHealth) application continuously for 40 days. The app gathered physiological data including heart rate and hand acceleration at one hertz frequency. The application also enabled users to self-report stress by tapping on the watch face, resulting in a time-stamped record of the self-reported stress. We created, evaluated, and analyzed machine learning algorithms for identifying stress episodes among college students using heart rate and accelerometer data. The XGBoost method was the most reliable model with an AUC of 0.64 and an accuracy of 84.5%. The standard deviation of hand acceleration, standard deviation of heart rate, and the minimum heart rate were the most important features for stress detection. This evidence may support the efficacy of identifying patterns in physiological reaction to stress using smartwatch sensors and may inform the design of future tools for real-time detection of stress.
    摘要 stress是多种大学生心理健康问题,包括抑郁和焦虑。早期识别和 intervención可能降低创建心理疾病的风险。我们使用机器学习算法来识别心理压力,使用自报告压力作为真实参照数据,以及Physiological数据,如心 rate和手势加速度。这项研究从大学校园中采集了54名学生,他们使用腕表仪和移动医疗应用程序,连续40天收集数据。应用程序记录了每分钟一次的心 rate和手势加速度数据,同时也让用户通过触摸腕表来报告压力,从而获得了时间戳的自报告压力记录。我们创建、评估和分析了机器学习算法,用于在大学生中识别压力发作。XGBoost方法是最可靠的模型,AUC为0.64,准确率为84.5%。手势加速度的标准差、心 rate的标准差和最低心 rate是最重要的压力检测特征。这些证据可能支持通过智能手表感知器检测压力的Pattern,并且可能导向未来的实时压力检测工具的设计。

Pointing-and-Acquisition for Optical Wireless in 6G: From Algorithms to Performance Evaluation

  • paper_url: http://arxiv.org/abs/2309.10999
  • repo_url: None
  • paper_authors: Hyung-Joo Moon, Chan-Byoung Chae, Kai-Kit Wong, Mohamed-Slim Alouini
  • for: 该论文旨在探讨非地面网络的发展和自由空间光学通信技术的应用。
  • methods: 该论文使用了传统设备和机制,并提出了一种算法,通过抛物线链和反射器来实现角度估计和扫描。
  • results: 通过大量的 simulations,论文表明,提议的方法可以提供更好的链接和维护性能。
    Abstract The increasing demand for wireless communication services has led to the development of non-terrestrial networks, which enables various air and space applications. Free-space optical (FSO) communication is considered one of the essential technologies capable of connecting terrestrial and non-terrestrial layers. In this article, we analyze considerations and challenges for FSO communications between gateways and aircraft from a pointing-and-acquisition perspective. Based on the analysis, we first develop a baseline method that utilizes conventional devices and mechanisms. Furthermore, we propose an algorithm that combines angle of arrival (AoA) estimation through supplementary radio frequency (RF) links and beam tracking using retroreflectors. Through extensive simulations, we demonstrate that the proposed method offers superior performance in terms of link acquisition and maintenance.
    摘要 随着无线通信服务的增加需求,非地球网络的发展已经推动了各种空天应用。自由空间光学(FSO)通信被认为是连接地球和非地球层的重要技术之一。在本文中,我们分析了FSO通信 между网关和飞机从指向和捕获角度来考虑的考虑因素。根据分析结果,我们首先开发了使用普通设备和机制的基线方法。此外,我们提议一种方法,该方法利用补充Radio频信号的角度估计和反射器 beam Tracking。通过广泛的 simulations,我们证明了我们提议的方法可以提供更高的链接和维持链接性能。

cs.SD - 2023-09-19

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

  • paper_url: http://arxiv.org/abs/2309.10922
  • repo_url: None
  • paper_authors: Krishna C. Puvvada, Nithin Rao Koluguri, Kunal Dhawan, Jagadeesh Balam, Boris Ginsburg
  • for: 这篇论文的目的是评估压缩基于音频 represencing 的声音识别和 speaker 鉴定性能。
  • methods: 该论文使用了压缩基于 Residual Vector Quantization (RVQ) 的音频 Tokenization 方法,并对三个任务进行评估:Speaker Verification、Diarization 和多语言 Speech Recognition。
  • results: 研究发现,使用压缩基于音频 Tokenization 的模型在三个任务中的性能相对较强,与mel-spectrogram特征相对尚未超越,并且在不同的 speaker 和语言下表现稳定。此外,音频 Tokenization 可以实现数据压缩至 20 倍,而无需失去性能。
    Abstract Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain. To this end, various compression and representation-learning based tokenization schemes have been proposed. However, there is limited investigation into the performance of compression-based audio tokens compared to well-established mel-spectrogram features across various speaker and speech related tasks. In this paper, we evaluate compression based audio tokens on three tasks: Speaker Verification, Diarization and (Multi-lingual) Speech Recognition. Our findings indicate that (i) the models trained on audio tokens perform competitively, on average within $1\%$ of mel-spectrogram features for all the tasks considered, and do not surpass them yet. (ii) these models exhibit robustness for out-of-domain narrowband data, particularly in speaker tasks. (iii) audio tokens allow for compression to 20x compared to mel-spectrogram features with minimal loss of performance in speech and speaker related tasks, which is crucial for low bit-rate applications, and (iv) the examined Residual Vector Quantization (RVQ) based audio tokenizer exhibits a low-pass frequency response characteristic, offering a plausible explanation for the observed results, and providing insight for future tokenizer designs.
    摘要 简化音频表示,即音频tokenization,在最近受到了新的关注,因为它可以使得文本语言模型方法在音频领域应用。为此,各种压缩和表示学习基于的tokenization方案已经被提出。然而,关于压缩基于音频token的性能与著名的mel-spectrogram特征之间的比较,尚未有充分的研究。在这篇论文中,我们评估了基于压缩音频token的三个任务: speaker认证、分类和多语言语音识别。我们的发现是:(一)基于音频token训练的模型在所有考虑的任务中,在average上与mel-spectrogram特征相差不超过1%,并没有超过它们 yet。(二)这些模型在不同频谱数据上表现了Robustness,特别是在speaker任务中。(三)使用音频token可以将数据压缩到20倍,相比mel-spectrogram特征,减少了对speech和speaker相关任务的性能损失,这是低比特率应用中非常重要。(四)我们所考察的Residual Vector Quantization(RVQ)基于的音频tokenizer具有低通过滤波器特征,这提供了可能的解释,并为未来的tokenizer设计提供了意见。

USED: Universal Speaker Extraction and Diarization

  • paper_url: http://arxiv.org/abs/2309.10674
  • repo_url: None
  • paper_authors: Junyi Ao, Mehmet Sinan Yıldırım, Meng Ge, Shuai Wang, Ruijie Tao, Yanmin Qian, Liqun Deng, Longshuai Xiao, Haizhou Li
  • for: 这 paper 的目的是提出一个统一的框架,即 Universal Speaker Extraction and Diarization (USED),用于同时抽取所有说话人的波形。
  • methods: 这 paper 使用了现有的说话人抽取模型,并在其基础上添加了一个场景意识 differentiated loss function,以解决实际对话中的叠 overlap 问题。
  • results: 根据 paper 的结果,USED 模型在高度重叠和叠 overlap 场景下都能够明显超过基eline,并且可以同时提供高质量的说话人抽取和分类结果。
    Abstract Speaker extraction and diarization are two crucial enabling techniques for speech applications. Speaker extraction aims to extract a target speaker's voice from a multi-talk mixture, while speaker diarization demarcates speech segments by speaker, identifying `who spoke when'. The previous studies have typically treated the two tasks independently. However, the two tasks share a similar objective, that is to disentangle the speakers in the spectral domain for the former but in the temporal domain for the latter. It is logical to believe that the speaker turns obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker turns than the mixture speech. In this paper, we propose a unified framework called Universal Speaker Extraction and Diarization (USED). We extend the existing speaker extraction model to simultaneously extract the waveforms of all speakers. We also employ a scenario-aware differentiated loss function to address the problem of sparsely overlapped speech in real-world conversations. We show that the USED model significantly outperforms the baselines for both speaker extraction and diarization tasks, in both highly overlapped and sparsely overlapped scenarios. Audio samples are available at https://ajyy.github.io/demo/USED/.
    摘要 干支持和分类是语音应用程序中的两个关键技能。干支持目标是从多话者混合中提取目标说话人的声音,而分类则将speech分成不同的speaker,并识别“谁在什么时候说话”。过去的研究通常会独立地处理这两个任务。然而,这两个任务在目标上具有相似的目标,即在spectral domain中分离说话人的声音,但在时间频谱中则是识别speaker。逻辑地来说,来自分类的speaker turn可以帮助提取说话人的声音,而提取的speech也比混合 speech更准确地识别speaker。在这篇论文中,我们提出了一个统一框架,称为Universal Speaker Extraction and Diarization(USED)。我们将现有的说话人提取模型扩展到同时提取所有说话人的波形。我们还使用场景意识化的差分损失函数来解决实际对话中稀疏的 overlap speech问题。我们展示了USED模型在speaker extraction和分类任务上明显超过基eline,并在高度重叠和稀疏重叠的场景中都有优异表现。Audio示例可以在https://ajyy.github.io/demo/USED/中找到。

An Active Noise Control System Based on Soundfield Interpolation Using a Physics-informed Neural Network

  • paper_url: http://arxiv.org/abs/2309.10605
  • repo_url: None
  • paper_authors: Yile, Zhang, Fei Ma, Thushara Abhayapala, Prasanga Samarasinghe, Amy Bastine
  • for: 降低ROI内噪音
  • methods: 使用physics-informed neural network (PINN) interpolate soundfield from monitoring microphones placed outside ROI, 比较SPHERical harmonic method在限制数量监测微phone情况下的 interpolate表现
  • results: PINN-assisted ANC系统在模拟中比多点ANC系统降低ROI内噪音更好
    Abstract Conventional multiple-point active noise control (ANC) systems require placing error microphones within the region of interest (ROI), inconveniencing users. This paper designs a feasible monitoring microphone arrangement placed outside the ROI, providing a user with more freedom of movement. The soundfield within the ROI is interpolated from the microphone signals using a physics-informed neural network (PINN). PINN exploits the acoustic wave equation to assist soundfield interpolation under a limited number of monitoring microphones, and demonstrates better interpolation performance than the spherical harmonic method in simulations. An ANC system is designed to take advantage of the interpolated signal to reduce noise signal within the ROI. The PINN-assisted ANC system reduces noise more than that of the multiple-point ANC system in simulations.
    摘要 传统的多点活动噪声控制(ANC)系统需要在Region of Interest(ROI)中放置错误微phone,对用户造成不便。本文提出了一种可行的监测icrophone布局,位于ROI外部,提供用户更多的自由运动空间。使用物理学信息学习网络(PINN) interpolate the soundfield within the ROI from the microphone signals, PINN leverages the acoustic wave equation to assist soundfield interpolation under a limited number of monitoring microphones, and shows better interpolation performance than the spherical harmonic method in simulations. ANC system is designed to take advantage of the interpolated signal to reduce noise within the ROI. PINN-assisted ANC system reduces noise more than the multiple-point ANC system in simulations.Note: "Region of Interest" (ROI) is translated as " Region of Interest" (ROI) in Simplified Chinese, which is the same as the original English text.

Bridging the Spoof Gap: A Unified Parallel Aggregation Network for Voice Presentation Attacks

  • paper_url: http://arxiv.org/abs/2309.10560
  • repo_url: None
  • paper_authors: Awais Khan, Khalid Mahmood Malik
  • for: 强化voice bio-metrics用户身份验证系统的安全性,因为ASV系统面临逻辑和物理冒充攻击的Security Risks。
  • methods: 提出了一种Parallel Stacked Aggregation Network,该方法处理原始音频,采用Split-Transform-Aggregation技术,将词语分解成卷积表示,应用变换,并聚合结果,以识别逻辑冒充(LA)和物理冒充(PA)攻击。
  • results: 对ASVspoof-2019和VSDC数据集进行评估,显示了提案的系统的有效性,与现有解决方案相比,具有更低的EER差异和更高的检测冒充攻击的能力。这表明提案的方法具有普适性和优势。在voice-based安全系统中,提出的一元化冒充检测系统为ASV和用户数据提供了一种可靠的防御机制,帮助保护用户身份和数据。
    Abstract Automatic Speaker Verification (ASV) systems are increasingly used in voice bio-metrics for user authentication but are susceptible to logical and physical spoofing attacks, posing security risks. Existing research mainly tackles logical or physical attacks separately, leading to a gap in unified spoofing detection. Moreover, when existing systems attempt to handle both types of attacks, they often exhibit significant disparities in the Equal Error Rate (EER). To bridge this gap, we present a Parallel Stacked Aggregation Network that processes raw audio. Our approach employs a split-transform-aggregation technique, dividing utterances into convolved representations, applying transformations, and aggregating the results to identify logical (LA) and physical (PA) spoofing attacks. Evaluation of the ASVspoof-2019 and VSDC datasets shows the effectiveness of the proposed system. It outperforms state-of-the-art solutions, displaying reduced EER disparities and superior performance in detecting spoofing attacks. This highlights the proposed method's generalizability and superiority. In a world increasingly reliant on voice-based security, our unified spoofing detection system provides a robust defense against a spectrum of voice spoofing attacks, safeguarding ASVs and user data effectively.
    摘要

FoleyGen: Visually-Guided Audio Generation

  • paper_url: http://arxiv.org/abs/2309.10537
  • repo_url: None
  • paper_authors: Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas Chandra
  • for: 这项研究的目的是提出一种基于语言模型的开放频谱视频到音频(V2A)生成系统,以便将视频转换为声音。
  • methods: 该系统使用一个单向 transformer 模型,通过 Conditional Random Field 进行视觉特征提取,并使用一个 off-the-shelf 神经音频编码器进行双向转换。
  • results: 实验结果表明,提出的 FoleyGen 系统在 VGGSound 数据集上以对象指标和人类评估方面都超过了先前系统。
    Abstract Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we introduce FoleyGen, an open-domain V2A generation system built on a language modeling paradigm. FoleyGen leverages an off-the-shelf neural audio codec for bidirectional conversion between waveforms and discrete tokens. The generation of audio tokens is facilitated by a single Transformer model, which is conditioned on visual features extracted from a visual encoder. A prevalent problem in V2A generation is the misalignment of generated audio with the visible actions in the video. To address this, we explore three novel visual attention mechanisms. We further undertake an exhaustive evaluation of multiple visual encoders, each pretrained on either single-modal or multi-modal tasks. The experimental results on VGGSound dataset show that our proposed FoleyGen outperforms previous systems across all objective metrics and human evaluations.
    摘要 (Note: The text has been translated into Simplified Chinese, but please note that the translation may not be perfect and may not capture all the nuances of the original text.)

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement

  • paper_url: http://arxiv.org/abs/2309.10455
  • repo_url: https://github.com/ZhengRachel/UTIforAVSE-demo
  • paper_authors: Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling
  • for: 提高受损的语音质量和可识别度,同时利用舌头影像信息进行干扰音频影像干扰
  • methods: 使用知识塑化 durante el entrenamiento para investigar可以利用舌头相关信息而不需要直接输入ultrasound舌头影像,并引入一个lip-tongue键值记忆网络以模型舌头和唇Modalities的对齐。
  • results: 实验结果表明,两种提议方法可以significantly提高受损语音的质量和可识别度,并且具有强大的通用性能在未看到的speaker和噪声下。此外,通过自动语音识别(ASR)的phone error rate(PER)分析发现,所有音频都从 introducing ultrasound舌头影像中受益,而palatal和velar元音声最大受益。
    Abstract Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes the incorporation of ultrasound tongue images to improve the performance of lip-based AV-SE systems further. To address the challenge of acquiring ultrasound tongue images during inference, we first propose to employ knowledge distillation during training to investigate the feasibility of leveraging tongue-related information without directly inputting ultrasound tongue images. Specifically, we guide an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model, thus transferring tongue-related knowledge. To better model the alignment between the lip and tongue modalities, we further propose the introduction of a lip-tongue key-value memory network into the AV-SE model. This network enables the retrieval of tongue features based on readily available lip features, thereby assisting the subsequent speech enhancement task. Experimental results demonstrate that both methods significantly improve the quality and intelligibility of the enhanced speech compared to traditional lip-based AV-SE baselines. Moreover, both proposed methods exhibit strong generalization performance on unseen speakers and in the presence of unseen noises. Furthermore, phone error rate (PER) analysis of automatic speech recognition (ASR) reveals that while all phonemes benefit from introducing ultrasound tongue images, palatal and velar consonants benefit most.
    摘要 音视频 speech 增强 (AV-SE) 目标是提高受损的语音,同时使用附加的视觉信息,如舌头视频,并被证明高于听音只 speech 增强。这篇论文提议将超声舌头像 incorporated 到 lip-based AV-SE 系统以提高性能。为了解决在推理中获取超声舌头像的挑战,我们首先提议使用知识塑化在训练期间。特别是,我们引导一个 audio-lip speech 增强学生模型学习一个预训练的 audio-lip-tongue speech 增强老师模型,从而传递舌头相关的知识。为了更好地模型舌头和 lip 模态的匹配,我们进一步提议引入一个 lip-tongue 关键值记忆网络到 AV-SE 模型中。这个网络允许根据 readily available 的 lip 特征来检索舌头特征,以帮助后续的语音增强任务。实验结果表明,两种方法都能有效地提高增强后的语音质量和可读性,并且两种方法在不seen speaker 和不seen 噪音下具有强大的泛化性能。此外,基于自动语音识别 (ASR) 的 phone error rate (PER) 分析表明,将超声舌头像 incorporated 后,所有的音位都受益,但是 palatal 和 velar 元音最多受益。

Efficient Multi-Channel Speech Enhancement with Spherical Harmonics Injection for Directional Encoding

  • paper_url: http://arxiv.org/abs/2309.10832
  • repo_url: None
  • paper_authors: Jiahui Pan, Pengjie Shen, Hui Zhang, Xueliang Zhang
  • for: 提高多通道Speech干扰的效果,使用多个麦克风捕捉空间信息。
  • methods: 使用圆形傅立叶变换(SHT)矩阵作为助记模型输入,以便更好地利用空间分布。
  • results: 在TIMIT dataset上,模型对不同噪声和反射的情况下表现出色,超过了已有的标准。此外,该模型具有较少的计算量和参数数量。
    Abstract Multi-channel speech enhancement extracts speech using multiple microphones that capture spatial cues. Effectively utilizing directional information is key for multi-channel enhancement. Deep learning shows great potential on multi-channel speech enhancement and often takes short-time Fourier Transform (STFT) as inputs directly. To fully leverage the spatial information, we introduce a method using spherical harmonics transform (SHT) coefficients as auxiliary model inputs. These coefficients concisely represent spatial distributions. Specifically, our model has two encoders, one for the STFT and another for the SHT. By fusing both encoders in the decoder to estimate the enhanced STFT, we effectively incorporate spatial context. Evaluations on TIMIT under varying noise and reverberation show our model outperforms established benchmarks. Remarkably, this is achieved with fewer computations and parameters. By leveraging spherical harmonics to incorporate directional cues, our model efficiently improves the performance of the multi-channel speech enhancement.
    摘要 多通道语音提升使用多个麦克风捕捉空间信息,以提高语音提升的效果。深度学习在多通道语音提升中表现出了极大的潜力,通常直接使用短时傅立叙 transform(STFT)作为输入。为了充分利用空间信息,我们介绍了一种使用球面幂变换(SHT)系数作为辅助模型输入的方法。这些系数简洁地表示空间分布。具体来说,我们的模型有两个编码器,一个是 для STFT,另一个是 для SHT。在解码器中将两个编码器融合以估计提升后的 STFT,从而有效地 incorporate 空间上下文。在 TIMIT 上进行了不同噪音和频率反射的评估,我们的模型表现出了较好的性能,并且只需要更少的计算和参数。通过利用球面幂变换来包含方向信息,我们的模型高效地提高了多通道语音提升的性能。

Hierarchical Modeling of Spatial Cues via Spherical Harmonics for Multi-Channel Speech Enhancement

  • paper_url: http://arxiv.org/abs/2309.10393
  • repo_url: None
  • paper_authors: Jiahui Pan, Shulin He, Hui Zhang, Xueliang Zhang
  • for: 提高多渠道语音干扰除的性能,使用多渠道信号中的空间信息更好地提取目标语音。
  • methods: 提议使用球面傅立卷变换(SHT)来显式地模型空间信息,采用层次结构,先估计更高频率的空间模式,然后与更低频率的空间模式共同预测更细致的空间细节。
  • results: 在TIMIT数据集上,提议方法可以更好地回归目标空间模式,并且比基eline模型提高性能,使用更少的参数和计算。
    Abstract Multi-channel speech enhancement utilizes spatial information from multiple microphones to extract the target speech. However, most existing methods do not explicitly model spatial cues, instead relying on implicit learning from multi-channel spectra. To better leverage spatial information, we propose explicitly incorporating spatial modeling by applying spherical harmonic transforms (SHT) to the multi-channel input. In detail, a hierarchical framework is introduced whereby lower order harmonics capturing broader spatial patterns are estimated first, then combined with higher orders to recursively predict finer spatial details. Experiments on TIMIT demonstrate the proposed method can effectively recover target spatial patterns and achieve improved performance over baseline models, using fewer parameters and computations. Explicitly modeling spatial information hierarchically enables more effective multi-channel speech enhancement.
    摘要 多通道语音增强利用多个麦克风的空间信息提取目标语音。然而,大多数现有方法不直接模型空间指示,而是通过多通道 спектrum 的含义学习来隐式地利用空间信息。为更好地利用空间信息,我们提议直接将多通道输入应用到圆形傅里叶变换(SHT)中,以实现更好的多通道语音增强。在详细的实现中,我们引入一种层次结构,其中低顺位傅里叶capture更广泛的空间模式,然后与更高顺位傅里叶相结合,以递归地预测更细致的空间细节。在 TIMIT 上进行实验,我们发现提议的方法可以更好地回归目标空间模式,并在基准模型上达到更高的性能,使用更少的参数和计算。通过直接模型空间信息层次结构,我们可以更有效地进行多通道语音增强。

PDPCRN: Parallel Dual-Path CRN with Bi-directional Inter-Branch Interactions for Multi-Channel Speech Enhancement

  • paper_url: http://arxiv.org/abs/2309.10379
  • repo_url: None
  • paper_authors: Jiahui Pan, Shulin He, Tianci Wu, Hui Zhang, Xueliang Zhang
  • for: 提高多渠道语音干扰消除的精度
  • methods: 提出了并行双路卷积循环神经网络(PDPCRN)模型,包括两个关键创新:分立分支EXTRACT complementary特征,以及bi-directional模块实现交叉通信
  • results: 对TIMIT数据集进行实验 validate,PDPCRN模型不仅在PESQ和STOI指标中表现出色,而且还具有较少的计算负担和参数量。
    Abstract Multi-channel speech enhancement seeks to utilize spatial information to distinguish target speech from interfering signals. While deep learning approaches like the dual-path convolutional recurrent network (DPCRN) have made strides, challenges persist in effectively modeling inter-channel correlations and amalgamating multi-level information. In response, we introduce the Parallel Dual-Path Convolutional Recurrent Network (PDPCRN). This acoustic modeling architecture has two key innovations. First, a parallel design with separate branches extracts complementary features. Second, bi-directional modules enable cross-branch communication. Together, these facilitate diverse representation fusion and enhanced modeling. Experimental validation on TIMIT datasets underscores the prowess of PDPCRN. Notably, against baseline models like the standard DPCRN, PDPCRN not only outperforms in PESQ and STOI metrics but also boasts a leaner computational footprint with reduced parameters.
    摘要 多通道语音增强 seek to 利用空间信息来 distinguishing 目标语音与干扰信号。而深度学习方法如双路卷积回归网络(DPCRN)已经做出了 significiant progress, 但还存在效果模型交通信道和多级信息融合的挑战。为此,我们提出了并行双路卷积回归网络(PDPCRN)。这种语音模型建立有两个关键创新:首先,并行设计分配了 complementary 特征。其次,bi-directional模块允许交叉通信。这两个特征共同使得多元表示融合和模型提高。对于 TIMIT 数据集的实验验证,PDPCRN 表现出众,与基准模型如标准 DPCRN 不仅在 PESQ 和 STOI 指标上表现出优异,还具有更小的计算承载和减少参数量。

eess.AS - 2023-09-19

Exploring Speech Enhancement for Low-resource Speech Synthesis

  • paper_url: http://arxiv.org/abs/2309.10795
  • repo_url: None
  • paper_authors: Zhaoheng Ni, Sravya Popuri, Ning Dong, Kohei Saijo, Xiaohui Zhang, Gael Le Lan, Yangyang Shi, Vikas Chandra, Changhan Wang
  • for: 提高低资源语言 Text-to-Speech (TTS) 模型训练的高质量语音数据获取具有挑战性和成本高。
  • methods: 应用自动语音识别 (ASR) corpora 上的语音增强模型,以增强训练数据,并对低资源语言 TTS 系统进行训练。
  • results: 使用阿拉伯语 datasets 作为例子,我们显示了我们的管道比基eline方法在 ASR WER 指标上具有显著改进,并进行了实验分析语音增强和 TTS 性能之间的相关性。
    Abstract High-quality and intelligible speech is essential to text-to-speech (TTS) model training, however, obtaining high-quality data for low-resource languages is challenging and expensive. Applying speech enhancement on Automatic Speech Recognition (ASR) corpus mitigates the issue by augmenting the training data, while how the nonlinear speech distortion brought by speech enhancement models affects TTS training still needs to be investigated. In this paper, we train a TF-GridNet speech enhancement model and apply it to low-resource datasets that were collected for the ASR task, then train a discrete unit based TTS model on the enhanced speech. We use Arabic datasets as an example and show that the proposed pipeline significantly improves the low-resource TTS system compared with other baseline methods in terms of ASR WER metric. We also run empirical analysis on the correlation between speech enhancement and TTS performances.
    摘要 高质量和智能可理解的语音是文本到语音(TTS)模型训练的必要条件,但是获取低资源语言的高质量数据是具有挑战和成本的。将自动语音识别(ASR)集合中的非线性语音扭曲应用于TTS训练数据可以缓解这个问题,但是如何评估非线性语音扭曲对TTS训练的影响仍需要进一步调查。在本文中,我们训练了TF-GridNet语音增强模型,并将其应用于低资源 dataset,然后训练基于分立单元的 TTS 模型。我们使用阿拉伯语 dataset 作为例子,并证明了我们的管道可以对低资源 TTS 系统进行显著改进,比基eline方法在 ASR WER 指标上具有更高的性能。我们还运行了实验分析语音增强和 TTS 性能之间的相关性。

cs.CV - 2023-09-19

A Novel Deep Neural Network for Trajectory Prediction in Automated Vehicles Using Velocity Vector Field

  • paper_url: http://arxiv.org/abs/2309.10948
  • repo_url: https://github.com/Amir-Samadi/VVF-TP
  • paper_authors: MReza Alipour Sormoli, Amir Samadi, Sajjad Mozaffari, Konstantinos Koufos, Mehrdad Dianati, Roger Woodman
  • for: 预测其他道路用户的运动方向和速度,以帮助自动驾驶系统(ADS)进行安全和知悉的动作规划和决策。
  • methods: 融合了学习基于数据的方法和流体动力学发现的速度场(VVF),将其作为深度神经网的一个额外输入,以预测基于鸟瞰图片表示的动线。
  • results: 与现有方法进行比较,提出了一种新的动线预测技术,可以对5~50秒的预测时间 horizon和不同的观察窗口进行优化,并且降低了需要过去的观察历史以获得高精度的动线预测。
    Abstract Anticipating the motion of other road users is crucial for automated driving systems (ADS), as it enables safe and informed downstream decision-making and motion planning. Unfortunately, contemporary learning-based approaches for motion prediction exhibit significant performance degradation as the prediction horizon increases or the observation window decreases. This paper proposes a novel technique for trajectory prediction that combines a data-driven learning-based method with a velocity vector field (VVF) generated from a nature-inspired concept, i.e., fluid flow dynamics. In this work, the vector field is incorporated as an additional input to a convolutional-recurrent deep neural network to help predict the most likely future trajectories given a sequence of bird's eye view scene representations. The performance of the proposed model is compared with state-of-the-art methods on the HighD dataset demonstrating that the VVF inclusion improves the prediction accuracy for both short and long-term (5~sec) time horizons. It is also shown that the accuracy remains consistent with decreasing observation windows which alleviates the requirement of a long history of past observations for accurate trajectory prediction. Source codes are available at: https://github.com/Amir-Samadi/VVF-TP.
    摘要 预测道路用户的运动是自动驾驶系统(ADS)中的关键,它允许系统进行安全和有知情的下游决策和运动规划。然而,现代学习基于方法的运动预测表现会随预测时间 horizon 的增加或 observation window 的减少而显著下降。这篇文章提出了一种新的轨迹预测技术,它将数据驱动学习基于方法和流体流动动力学(VVF)相结合,以便基于 bird's eye view 场景表示序列 Predict the most likely future trajectories.在这种方法中,VVF 被包含为循环神经网络的一个额外输入,以帮助预测未来的轨迹。文章比较了提出的模型与现有方法在 HighD 数据集上的性能,并证明了 VVF 包含可以提高预测精度,并且预测精度随 observation window 的减少而保持一致。代码可以在 GitHub 上找到:https://github.com/Amir-Samadi/VVF-TP。

A Geometric Flow Approach for Segmentation of Images with Inhomongeneous Intensity and Missing Boundaries

  • paper_url: http://arxiv.org/abs/2309.10935
  • repo_url: None
  • paper_authors: Paramjyoti Mohapatra, Richard Lartey, Weihong Guo, Michael Judkovich, Xiaojuan Li
  • for: 这篇论文主要针对的是如何使用新的INTENSITY correction和自动化拓扑方法来进行Muscle segmentation,尤其是在MR图像中,图像具有强烈的灰度不均和缺失边界问题。
  • methods: 该方法使用了一种Geometric flow,该流体现了一个RKHS edge detector和一个geodesic distance penalty term,这些 penalty term来自 marker和anti-marker的集合。此外, paper还提出了一种新的bias field估计方法,即Prior Bias-Corrected Fuzzy C-means (PBCFCM),以帮助处理MR图像中的灰度不均问题。
  • results: 数字实验表明,提出的方法与比较方法相比,具有显著的改善,其 dice值为92.5%, 85.3%, 85.3% для quadriceps、hamstrings 和其他肌群,而其他方法至少下降10%。
    Abstract Image segmentation is a complex mathematical problem, especially for images that contain intensity inhomogeneity and tightly packed objects with missing boundaries in between. For instance, Magnetic Resonance (MR) muscle images often contain both of these issues, making muscle segmentation especially difficult. In this paper we propose a novel intensity correction and a semi-automatic active contour based segmentation approach. The approach uses a geometric flow that incorporates a reproducing kernel Hilbert space (RKHS) edge detector and a geodesic distance penalty term from a set of markers and anti-markers. We test the proposed scheme on MR muscle segmentation and compare with some state of the art methods. To help deal with the intensity inhomogeneity in this particular kind of image, a new approach to estimate the bias field using a fat fraction image, called Prior Bias-Corrected Fuzzy C-means (PBCFCM), is introduced. Numerical experiments show that the proposed scheme leads to significantly better results than compared ones. The average dice values of the proposed method are 92.5%, 85.3%, 85.3% for quadriceps, hamstrings and other muscle groups while other approaches are at least 10% worse.
    摘要 Image segmentation是一个复杂的数学问题,特别是当图像具有强度不均和紧邻的物体间缺失边界时。例如,核磁共振(MR)肌肉图像经常具有这两种问题,从而使肌肉分割特别困难。在这篇论文中,我们提出了一种新的强度修正和半自动的活动梁基分割方法。该方法使用了一种几何流,其包括一个 reproduce kernel Hilbert space(RKHS)边检测器和一个地odesic距离罚款项。我们对MR肌肉分割进行测试,并与一些现有方法进行比较。为了帮助处理MR肌肉图像中的强度不均,我们还提出了一种新的方法来估计偏置场,即 Prior Bias-Corrected Fuzzy C-means(PBCFCM)。数字实验表明,我们提出的方法与其他方法相比,得到了显著更好的结果。提出的方法的平均 dice 值为 92.5%、85.3% 和 85.3% 分别,而其他方法至少下降了10%。

Incremental Multimodal Surface Mapping via Self-Organizing Gaussian Mixture Models

  • paper_url: http://arxiv.org/abs/2309.10900
  • repo_url: https://github.com/hieu9955/ggggg
  • paper_authors: Kshitij Goel, Wennie Tabib
  • for: 这个论文旨在提出一种逐步多模态表面映射方法,用于高精度重建环境,同时压缩空间和强度点云数据。
  • methods: 该方法使用 Gaussian mixture models (GMMs) 表示环境,并提出了一种快速EXTRACT GMM submap的方法,以及一种判断点云中重复和无关数据的方法,从而提高计算速度。
  • results: 论文的实验结果显示,该方法可以提供高精度的地图,同时压缩空间和强度点云数据,并且与现有的地图方法相比,具有更好的质量和大小协调。
    Abstract This letter describes an incremental multimodal surface mapping methodology, which represents the environment as a continuous probabilistic model. This model enables high-resolution reconstruction while simultaneously compressing spatial and intensity point cloud data. The strategy employed in this work utilizes Gaussian mixture models (GMMs) to represent the environment. While prior GMM-based mapping works have developed methodologies to determine the number of mixture components using information-theoretic techniques, these approaches either operate on individual sensor observations, making them unsuitable for incremental mapping, or are not real-time viable, especially for applications where high-fidelity modeling is required. To bridge this gap, this letter introduces a spatial hash map for rapid GMM submap extraction combined with an approach to determine relevant and redundant data in a point cloud. These contributions increase computational speed by an order of magnitude compared to state-of-the-art incremental GMM-based mapping. In addition, the proposed approach yields a superior tradeoff in map accuracy and size when compared to state-of-the-art mapping methodologies (both GMM- and not GMM-based). Evaluations are conducted using both simulated and real-world data. The software is released open-source to benefit the robotics community.
    摘要 Translated into Simplified Chinese:这封信件描述了一种逐步多模态表面映射方法,该方法将环境表示为一个连续的概率模型。这个模型可以同时实现高分辨率重建和压缩空间和强度点云数据。这种策略使用 Gaussian mixture models (GMMs) 来表示环境。先前的 GMM 基于的映射工作已经开发了基于信息理论技术来确定混合组件的方法,但这些方法 Either operate on individual sensor observations, making them unsuitable for incremental mapping, or are not real-time viable, especially for applications where high-fidelity modeling is required. To address these limitations, this letter introduces a spatial hash map for rapid GMM submap extraction combined with an approach to determine relevant and redundant data in a point cloud. These contributions increase computational speed by an order of magnitude compared to state-of-the-art incremental GMM-based mapping. In addition, the proposed approach yields a superior tradeoff in map accuracy and size when compared to state-of-the-art mapping methodologies (both GMM- and not GMM-based). Evaluations are conducted using both simulated and real-world data, and the software is released open-source to benefit the robotics community.

PLVS: A SLAM System with Points, Lines, Volumetric Mapping, and 3D Incremental Segmentation

  • paper_url: http://arxiv.org/abs/2309.10896
  • repo_url: https://github.com/luigifreda/plvs
  • paper_authors: Luigi Freda
  • for: 这篇论文旨在提出一个实时系统,具有稀疏SLAM、Volume Mapping和3D无监督增量分割功能。
  • methods: 论文使用了 sparse points和Line segments为特征,通过基于锚帧的锚点扫描和Volume Mapping来生成3D环境重建。
  • results: 论文通过公共数据集的质量和量计算评估了PLVS框架的性能,并提供了一种incremental和几何基于的RGB-D摄像头分割方法。
    Abstract This document presents PLVS: a real-time system that leverages sparse SLAM, volumetric mapping, and 3D unsupervised incremental segmentation. PLVS stands for Points, Lines, Volumetric mapping, and Segmentation. It supports RGB-D and Stereo cameras, which may be optionally equipped with IMUs. The SLAM module is keyframe-based, and extracts and tracks sparse points and line segments as features. Volumetric mapping runs in parallel with respect to the SLAM front-end and generates a 3D reconstruction of the explored environment by fusing point clouds backprojected from keyframes. Different volumetric mapping methods are supported and integrated in PLVS. We use a novel reprojection error to bundle-adjust line segments. This error exploits available depth information to stabilize the position estimates of line segment endpoints. An incremental and geometric-based segmentation method is implemented and integrated for RGB-D cameras in the PLVS framework. We present qualitative and quantitative evaluations of the PLVS framework on some publicly available datasets. The appendix details the adopted stereo line triangulation method and provides a derivation of the Jacobians we used for line error terms. The software is available as open-source.
    摘要 Here is the translation in Simplified Chinese:这份文档介绍了PLVS:一个实时系统,它利用稀疏SLAM、三维映射和RGB-D和雷达相机的3D无监督增量分割。PLVS表示点、线、三维映射和分割。它支持RGB-D和雷达相机,这些相机可选择配备IMUs。SLAM模块是基干帧基的,它提取和跟踪稀疏点和线段为特征。三维映射在SLAM前端并行执行,通过将点云反 projekt 到关键帧来生成探索环境的3D重建。不同的三维映射方法被支持和集成在PLVS框架中。我们使用了一种新的 reprojection 错误来稳定线段endpoint的位置估计,这个错误利用可用的深度信息来稳定位置估计。PLVS框架还实现了RGB-D相机上的增量和几何基于的分割方法。我们对PLVS框架在一些公共可用的数据集上进行质量和量化评估。附录详细介绍了我们采用的雷达线段三角法和line error Jacobians的derivation。PLVS框架的软件可以在开源的形式下获取。

GelSight Svelte: A Human Finger-shaped Single-camera Tactile Robot Finger with Large Sensing Coverage and Proprioceptive Sensing

  • paper_url: http://arxiv.org/abs/2309.10885
  • repo_url: None
  • paper_authors: Jialiang Zhao, Edward H. Adelson
  • for: 这个论文旨在开发一种可以同时进行感觉和 proprioceptive 感知的、人套指尺寸、单摄像头感知系统(GelSight Svelte)。
  • methods: 该系统使用折射镜实现感知覆盖区域,并通过摄像头捕捉到 flexible 背部的变形来估计扭矩和旋转力。
  • results: 经过gel塑变形实验和对三种不同抓取方式的评估,该系统能够准确地估计扭矩和旋转力,并且可以在不同的抓取位置上进行多种任务。更多信息请参考我们的官方网站:https://gelsight-svelte.alanz.info。
    Abstract Camera-based tactile sensing is a low-cost, popular approach to obtain highly detailed contact geometry information. However, most existing camera-based tactile sensors are fingertip sensors, and longer fingers often require extraneous elements to obtain an extended sensing area similar to the full length of a human finger. Moreover, existing methods to estimate proprioceptive information such as total forces and torques applied on the finger from camera-based tactile sensors are not effective when the contact geometry is complex. We introduce GelSight Svelte, a curved, human finger-sized, single-camera tactile sensor that is capable of both tactile and proprioceptive sensing over a large area. GelSight Svelte uses curved mirrors to achieve the desired shape and sensing coverage. Proprioceptive information, such as the total bending and twisting torques applied on the finger, is reflected as deformations on the flexible backbone of GelSight Svelte, which are also captured by the camera. We train a convolutional neural network to estimate the bending and twisting torques from the captured images. We conduct gel deformation experiments at various locations of the finger to evaluate the tactile sensing capability and proprioceptive sensing accuracy. To demonstrate the capability and potential uses of GelSight Svelte, we conduct an object holding task with three different grasping modes that utilize different areas of the finger. More information is available on our website: https://gelsight-svelte.alanz.info
    摘要 Camera-based感觉检测是一种低成本、受欢迎的方法,以获取高级别的接触几何信息。然而,大多数现有的camera-based感觉传感器都是指尖传感器,长手指通常需要外加元素来获取类似于人类手指的扩展感知区域。此外,现有的方法来估算camera-based感觉传感器上的 proprioceptive信息(如手指上的总弯矩和扭矩)在复杂的接触几何下不准确。我们介绍了GelSight Svelte,一种呈杯形、人类手指大小的单摄像头感觉传感器,可以同时进行感觉和 proprioceptive感知。GelSight Svelte使用弯曲镜子实现感知覆盖区域。 proprioceptive信息(如手指上的总弯矩和扭矩)在GelSight Svelte的 flexible backbone上反射为凹形变化,并由摄像头捕捉。我们训练了一个卷积神经网络来估算弯矩和扭矩信息。我们通过在不同手指位置进行gel deformation实验来评估感觉检测能力和 proprioceptive感知精度。为了展示GelSight Svelte的能力和应用前景,我们完成了一个 объект托持任务,使用不同的抓取模式,利用不同的手指区域。更多信息请访问我们的网站:https://gelsight-svelte.alanz.info。

DeepliteRT: Computer Vision at the Edge

  • paper_url: http://arxiv.org/abs/2309.10878
  • repo_url: None
  • paper_authors: Saad Ashfaq, Alexander Hoffman, Saptarshi Mitra, Sudhakar Sah, MohammadHossein AskariHemmat, Ehsan Saboori
  • for: 本研究旨在提出一种高效的极低位量化神经网络模型,以便在计算机视觉应用中部署深度学习模型。
  • methods: 本研究使用了ARM目标平台上高优化的极低位量化卷积运算器,并实现了一个名为Deeplite Runtime(DeepliteRT)的端到端解决方案,用于编译、调参和极低位量化模型的运行。
  • results: 研究表明,使用DeepliteRT可以实现对分类和检测模型的高速化,相比于优化的32位浮点数、8位整数和2位基准,可以获得速度提升达2.20倍、2.33倍和2.17倍。
    Abstract The proliferation of edge devices has unlocked unprecedented opportunities for deep learning model deployment in computer vision applications. However, these complex models require considerable power, memory and compute resources that are typically not available on edge platforms. Ultra low-bit quantization presents an attractive solution to this problem by scaling down the model weights and activations from 32-bit to less than 8-bit. We implement highly optimized ultra low-bit convolution operators for ARM-based targets that outperform existing methods by up to 4.34x. Our operator is implemented within Deeplite Runtime (DeepliteRT), an end-to-end solution for the compilation, tuning, and inference of ultra low-bit models on ARM devices. Compiler passes in DeepliteRT automatically convert a fake-quantized model in full precision to a compact ultra low-bit representation, easing the process of quantized model deployment on commodity hardware. We analyze the performance of DeepliteRT on classification and detection models against optimized 32-bit floating-point, 8-bit integer, and 2-bit baselines, achieving significant speedups of up to 2.20x, 2.33x and 2.17x, respectively.
    摘要 “Edge设备的普及导致深度学习模型在计算机视觉应用中的部署得到了无前例的机会。然而,这些复杂的模型占用了大量的计算、存储和电源资源,通常不可能在边缘平台上提供。超低位数量化presented an attractive solution to this problem by reducing the model weights and activations from 32-bit to less than 8-bit. We implement highly optimized ultra low-bit convolution operators for ARM-based targets that outperform existing methods by up to 4.34x. Our operator is implemented within Deeplite Runtime (DeepliteRT), an end-to-end solution for the compilation, tuning, and inference of ultra low-bit models on ARM devices. Compiler passes in DeepliteRT automatically convert a fake-quantized model in full precision to a compact ultra low-bit representation, easing the process of quantized model deployment on commodity hardware. We analyze the performance of DeepliteRT on classification and detection models against optimized 32-bit floating-point, 8-bit integer, and 2-bit baselines, achieving significant speedups of up to 2.20x, 2.33x, and 2.17x, respectively.”Note: The translation is in Simplified Chinese, which is the standard Chinese writing system used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

On-device Real-time Custom Hand Gesture Recognition

  • paper_url: http://arxiv.org/abs/2309.10858
  • repo_url: None
  • paper_authors: Esha Uboweja, David Tian, Qifei Wang, Yi-Chun Kuo, Joe Zou, Lu Wang, George Sung, Matthias Grundmann
  • for: 可以快速创建和部署用户定制的手势识别系统,无需专业Machine Learning(ML)知识。
  • methods: 使用预训练单手嵌入模型,通过收集一小部分的图像来训练和部署自定义手势识别模型。
  • results: 可以快速完成手势识别系统的开发和部署,只需几分钟时间。
    Abstract Most existing hand gesture recognition (HGR) systems are limited to a predefined set of gestures. However, users and developers often want to recognize new, unseen gestures. This is challenging due to the vast diversity of all plausible hand shapes, e.g. it is impossible for developers to include all hand gestures in a predefined list. In this paper, we present a user-friendly framework that lets users easily customize and deploy their own gesture recognition pipeline. Our framework provides a pre-trained single-hand embedding model that can be fine-tuned for custom gesture recognition. Users can perform gestures in front of a webcam to collect a small amount of images per gesture. We also offer a low-code solution to train and deploy the custom gesture recognition model. This makes it easy for users with limited ML expertise to use our framework. We further provide a no-code web front-end for users without any ML expertise. This makes it even easier to build and test the end-to-end pipeline. The resulting custom HGR is then ready to be run on-device for real-time scenarios. This can be done by calling a simple function in our open-sourced model inference API, MediaPipe Tasks. This entire process only takes a few minutes.
    摘要 现有的手势识别(HGR)系统大多是基于固定的手势集。然而,用户和开发者通常希望可以识别新、未看到的手势。这是因为手势的多样性很大,例如开发者无法包含所有的手势在一个预定列表中。在这篇论文中,我们提供了一个用户友好的框架,允许用户轻松自定义和部署自己的手势识别管道。我们的框架提供了预训练的单手嵌入模型,可以根据用户自定义的手势进行微调。用户可以在前置摄像头上执行手势,并收集一小量的图像数据。我们还提供了一个低代码的解决方案,以便用户对自定义手势识别模型进行训练和部署。这使得用户对ML知识有限的用户可以轻松使用我们的框架。此外,我们还提供了一个无代码的web前端,以便用户没有任何ML知识的情况下也可以构建和测试完整的管道。结果的自定义HGR然后可以在实时场景下运行在设备上,只需要很短的时间。这可以通过调用我们开源的模型推理API,MediaPipe Tasks的简单函数来完成。

Assessing the capacity of a denoising diffusion probabilistic model to reproduce spatial context

  • paper_url: http://arxiv.org/abs/2309.10817
  • repo_url: None
  • paper_authors: Rucha Deshpande, Muzaffer Özbey, Hua Li, Mark A. Anastasio, Frank J. Brooks
  • for: 本研究的目的是探讨DDPMs在医疗影像领域中是否可靠地学习空间上下文信息。
  • methods: 本研究使用杂相上下文模型(SCMs)生成训练数据,并使用DDPMs生成图像 ensemble,以评估其在学习空间上下文信息方面的能力。
  • results: 研究发现,DDPMs可以准确地复制空间上下文信息,并且可以生成Contextually correct的图像,这些图像可以作为数据扩展任务中的参考数据。相比之下,GANs无法实现这种效果。
    Abstract Diffusion models have emerged as a popular family of deep generative models (DGMs). In the literature, it has been claimed that one class of diffusion models -- denoising diffusion probabilistic models (DDPMs) -- demonstrate superior image synthesis performance as compared to generative adversarial networks (GANs). To date, these claims have been evaluated using either ensemble-based methods designed for natural images, or conventional measures of image quality such as structural similarity. However, there remains an important need to understand the extent to which DDPMs can reliably learn medical imaging domain-relevant information, which is referred to as `spatial context' in this work. To address this, a systematic assessment of the ability of DDPMs to learn spatial context relevant to medical imaging applications is reported for the first time. A key aspect of the studies is the use of stochastic context models (SCMs) to produce training data. In this way, the ability of the DDPMs to reliably reproduce spatial context can be quantitatively assessed by use of post-hoc image analyses. Error-rates in DDPM-generated ensembles are reported, and compared to those corresponding to a modern GAN. The studies reveal new and important insights regarding the capacity of DDPMs to learn spatial context. Notably, the results demonstrate that DDPMs hold significant capacity for generating contextually correct images that are `interpolated' between training samples, which may benefit data-augmentation tasks in ways that GANs cannot.
    摘要 Diffusion models 已经成为深度生成模型(DGM)的流行家族。在文献中,有人声称一种类型的扩散模型——去噪扩散概率模型(DDPM)——在比较于生成敌对网络(GAN)的图像生成性能方面表现出色。到目前为止,这些声称都是通过ensemble-based方法,设计 для自然图像,或者传统的图像质量度量指标,如结构相似性,进行评估。然而,还有一项重要的需求,即了解DDPM在医学影像领域中 relevante的信息是否可靠地学习。为了解决这个问题,本文报告了DDPM在医学影像应用中学习空间上下文的系统性评估。在这些研究中,使用stoochastic context models(SCM)生成训练数据,以便通过后期图像分析来评估DDPM是否可靠地重现空间上下文。DDPM生成的集合的错误率被报告,并与现代GAN的错误率进行比较。研究发现了新的重要信息,即DDPM在学习空间上下文方面具有显著的能力。特别是,结果表明DDPM可以生成符合训练样本上下文的图像,并且可以在数据扩展任务中提供新的优势。

PanopticNeRF-360: Panoramic 3D-to-2D Label Transfer in Urban Scenes

  • paper_url: http://arxiv.org/abs/2309.10815
  • repo_url: https://github.com/fuxiao0719/panopticnerf
  • paper_authors: Xiao Fu, Shangzhan Zhang, Tianrun Chen, Yichong Lu, Xiaowei Zhou, Andreas Geiger, Yiyi Liao
  • for: 本研究旨在提高自驾车视觉系统的训练效果,以便提高自驾车的安全性和可靠性。
  • methods: 本文提出了一种新的方法,即combine粗糙3D标注与噪音2Dsemantic约束来生成高质量的360度图像和精确的各种标签。
  • results: 实验表明,本方法可以在KITTI-360 dataset上达到现有标注传递方法的状态对抗性,并且可以生成高分辨率、多视图和时空一致的图像、semantic和instance标签。
    Abstract Training perception systems for self-driving cars requires substantial annotations. However, manual labeling in 2D images is highly labor-intensive. While existing datasets provide rich annotations for pre-recorded sequences, they fall short in labeling rarely encountered viewpoints, potentially hampering the generalization ability for perception models. In this paper, we present PanopticNeRF-360, a novel approach that combines coarse 3D annotations with noisy 2D semantic cues to generate consistent panoptic labels and high-quality images from any viewpoint. Our key insight lies in exploiting the complementarity of 3D and 2D priors to mutually enhance geometry and semantics. Specifically, we propose to leverage noisy semantic and instance labels in both 3D and 2D spaces to guide geometry optimization. Simultaneously, the improved geometry assists in filtering noise present in the 3D and 2D annotations by merging them in 3D space via a learned semantic field. To further enhance appearance, we combine MLP and hash grids to yield hybrid scene features, striking a balance between high-frequency appearance and predominantly contiguous semantics. Our experiments demonstrate PanopticNeRF-360's state-of-the-art performance over existing label transfer methods on the challenging urban scenes of the KITTI-360 dataset. Moreover, PanopticNeRF-360 enables omnidirectional rendering of high-fidelity, multi-view and spatiotemporally consistent appearance, semantic and instance labels. We make our code and data available at https://github.com/fuxiao0719/PanopticNeRF
    摘要 培训自驾车视觉系统需要大量的标注。然而,手动标注在2D图像中是非常劳动密集的。现有的数据集提供了丰富的标注 для预先录制的序列,但它们缺乏标注不常见的视角,可能会妨碍视觉模型的泛化能力。在这篇论文中,我们提出了PanopticNeRF-360,一种新的方法,它将粗略的3D标注与噪声2D语义指标相结合,以生成一致的�anoptic标签和高质量的图像从任何视角。我们的关键发现在于利用3D和2D优先的补偿,以便互相增强几何和 semantics。具体来说,我们提议利用2D和3D空间中噪声的语义和实例标签来导航 geometry 优化。同时,改进的几何 помочь减少2D和3D标注中的噪声,通过学习的semantic场进行3D空间的合并。为了进一步提高外观,我们结合多层感知(MLP)和哈希网格,以生成混合的场景特征, strike a balance between high-frequency appearance and predominantly contiguous semantics。我们的实验表明PanopticNeRF-360在KITTI-360 dataset上的城市场景上表现出了状态机器人的性能,并且允许高精度、多视角和时空协调的出现、semantic和实例标签的渲染。我们在github上分享了我们的代码和数据,请参考https://github.com/fuxiao0719/PanopticNeRF。

PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance

  • paper_url: http://arxiv.org/abs/2309.10810
  • repo_url: https://github.com/pq-yang/pgdiff
  • paper_authors: Peiqing Yang, Shangchen Zhou, Qingyi Tao, Chen Change Loy
  • For: 本研究旨在把传统任务特有的训练方法替换为使用预训diffusion模型,以提高复原性能。* Methods: 我们提出了PGDiff方法,它通过引入偏向指导来增强复原性能。不同于先前的方法,我们的方法不需要具体地定义干扰过程,而是根据高质量图像的特性和颜色统计,设定导航。* Results: 实验结果显示,我们的方法不仅超越了现有的diffusion-prior-based方法,而且与任务特有的模型竞争。此外,我们的方法还可以扩展到复合任务,通过结合不同任务的导航。
    Abstract Exploiting pre-trained diffusion models for restoration has recently become a favored alternative to the traditional task-specific training approach. Previous works have achieved noteworthy success by limiting the solution space using explicit degradation models. However, these methods often fall short when faced with complex degradations as they generally cannot be precisely modeled. In this paper, we propose PGDiff by introducing partial guidance, a fresh perspective that is more adaptable to real-world degradations compared to existing works. Rather than specifically defining the degradation process, our approach models the desired properties, such as image structure and color statistics of high-quality images, and applies this guidance during the reverse diffusion process. These properties are readily available and make no assumptions about the degradation process. When combined with a diffusion prior, this partial guidance can deliver appealing results across a range of restoration tasks. Additionally, PGDiff can be extended to handle composite tasks by consolidating multiple high-quality image properties, achieved by integrating the guidance from respective tasks. Experimental results demonstrate that our method not only outperforms existing diffusion-prior-based approaches but also competes favorably with task-specific models.
    摘要 utilizes pre-trained diffusion models for restoration has recently become a popular alternative to the traditional task-specific training approach. Previous works have achieved notable success by limiting the solution space using explicit degradation models. However, these methods often fall short when faced with complex degradations as they generally cannot be precisely modeled. In this paper, we propose PGDiff by introducing partial guidance, a fresh perspective that is more adaptable to real-world degradations compared to existing works. Rather than specifically defining the degradation process, our approach models the desired properties, such as image structure and color statistics of high-quality images, and applies this guidance during the reverse diffusion process. These properties are readily available and make no assumptions about the degradation process. When combined with a diffusion prior, this partial guidance can deliver appealing results across a range of restoration tasks. Additionally, PGDiff can be extended to handle composite tasks by consolidating multiple high-quality image properties, achieved by integrating the guidance from respective tasks. Experimental results demonstrate that our method not only outperforms existing diffusion-prior-based approaches but also competes favorably with task-specific models.

Multi-Context Dual Hyper-Prior Neural Image Compression

  • paper_url: http://arxiv.org/abs/2309.10799
  • repo_url: None
  • paper_authors: Atefeh Khoshkhahtinat, Ali Zafari, Piyush M. Mehta, Mohammad Akyash, Hossein Kashiani, Nasser M. Nasrabadi
  • for: 这篇论文主要是为了提出一种基于Transformer的深度图像压缩神经网络,以提高图像压缩的环境依赖性和全局相互关系模型。
  • methods: 该论文提出了一种基于Transformer的非线性变换,以 efficiently capture both local and global information from the input image,并且引入了一种新的 entropy model,该模型包括了两种不同的 гиперприор来模型横轴和水平依赖关系。
  • results: 经过实验表明,该提议的框架在环境依赖性和全局相互关系上比前者方法更高,并且可以更好地压缩图像。
    Abstract Transform and entropy models are the two core components in deep image compression neural networks. Most existing learning-based image compression methods utilize convolutional-based transform, which lacks the ability to model long-range dependencies, primarily due to the limited receptive field of the convolution operation. To address this limitation, we propose a Transformer-based nonlinear transform. This transform has the remarkable ability to efficiently capture both local and global information from the input image, leading to a more decorrelated latent representation. In addition, we introduce a novel entropy model that incorporates two different hyperpriors to model cross-channel and spatial dependencies of the latent representation. To further improve the entropy model, we add a global context that leverages distant relationships to predict the current latent more accurately. This global context employs a causal attention mechanism to extract long-range information in a content-dependent manner. Our experiments show that our proposed framework performs better than the state-of-the-art methods in terms of rate-distortion performance.
    摘要 <> transform和Entropy模型是深度图像压缩神经网络的两个核心组件。大多数现有的学习基于图像压缩方法使用 convolutional-based transform,它因为卷积操作的局部场景限制无法准确地模型长距离依赖关系。为了解决这一限制,我们提议一种基于 transformer 的非线性变换。这种变换能够高效地捕捉输入图像的局部和全局信息,从而导致更高度相关的幂等表示。此外,我们介绍了一种新的 Entropy 模型,它通过两个不同的 гиперPRIOR 来模型跨通道和空间依赖关系。为了进一步改进 Entropy 模型,我们添加了一个全局上下文,通过利用远程关系来预测当前幂等的更加准确。这个全局上下文使用 causal attention 机制来提取跨通道信息,以具有内容依赖的方式。我们的实验表明,我们的提议的框架在率度-质量性能方面与当前最佳方法相比表现更好。Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

Multi-spectral Entropy Constrained Neural Compression of Solar Imagery

  • paper_url: http://arxiv.org/abs/2309.10791
  • repo_url: None
  • paper_authors: Ali Zafari, Atefeh Khoshkhahtinat, Piyush M. Mehta, Nasser M. Nasrabadi, Barbara J. Thompson, Michael S. F. Kirk, Daniel da Silva
  • for: 本研究旨在开发一种基于变换器的多spectral神经图像压缩器,以实现高效地捕捉多波段图像中的相互 redundancy。
  • methods: 我们提出了一种基于多头自注意 Mechanism的inter-window汇集token自注意机制,以解 liberate地ocalization的窗口自注意机制。此外,我们还使用随机偏移窗口注意机制,使transformer块对输入领域的翻译不敏感。
  • results: 我们的方法不仅超越了传统压缩算法,还能更好地decorrelates多波段图像,比单spectral压缩更好。
    Abstract Missions studying the dynamic behaviour of the Sun are defined to capture multi-spectral images of the sun and transmit them to the ground station in a daily basis. To make transmission efficient and feasible, image compression systems need to be exploited. Recently successful end-to-end optimized neural network-based image compression systems have shown great potential to be used in an ad-hoc manner. In this work we have proposed a transformer-based multi-spectral neural image compressor to efficiently capture redundancies both intra/inter-wavelength. To unleash the locality of window-based self attention mechanism, we propose an inter-window aggregated token multi head self attention. Additionally to make the neural compressor autoencoder shift invariant, a randomly shifted window attention mechanism is used which makes the transformer blocks insensitive to translations in their input domain. We demonstrate that the proposed approach not only outperforms the conventional compression algorithms but also it is able to better decorrelates images along the multiple wavelengths compared to single spectral compression.
    摘要 充当太阳动态行为研究任务的图像压缩系统需要在每天基础上 capture 多spectral 图像并将其传输到地面站。为了使压缩效率高并实现可行,图像压缩系统需要被利用。最近,一种基于神经网络的结构优化的图像压缩系统在随机的方式上表现出了极高的潜力。在这种工作中,我们提出了一种基于变换器的多spectral 神经图像压缩器,可以快速捕捉 intra/inter-探讨 的重复性。为了利用窗口基于的自注意机制的本地性,我们提出了一种多头自注意机制,并使用了随机推移窗口注意机制,使得转换块在输入领域中变得不敏感于平移。我们 demonstarte 了该方法不仅可以超越传统压缩算法,而且可以更好地对多普通频谱图像进行decorrelation。

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

  • paper_url: http://arxiv.org/abs/2309.10787
  • repo_url: https://github.com/roger-tseng/av-superb
  • paper_authors: Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee
  • for: 该研究旨在开发一种人类视觉系统,利用听视信息的相关性来实现。
  • methods: 该研究使用了AV-SUPERBBenchmark,一个涵盖5种音频视频任务的通用评估平台,以评估5种最近自监学模型的通用性。
  • results: 研究发现,当前的模型通常只能在限定的任务上进行有效的泛化,而且无法在所有任务上达到理想的性能。此外,研究还发现,通过中间任务练化和使用AudioSet作为中间任务可以提高表示的性能。
    Abstract Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations on 7 datasets covering 5 audio-visual tasks in speech and audio processing. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks, emphasizing the need for future study on improving universal model performance. In addition, we show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task. We release our benchmark with evaluation code and a model submission platform to encourage further research in audio-visual learning.
    摘要 音视频表示学习目标是使系统具有人类类似的感知,通过听音和视觉信息之间的相关性。然而,当前的模型通常只会处理有限的任务,学习得到的表示的通用能力是不清楚的。为此,我们提出了AV-SUPERB benchmark,它可以对7个数据集的5种音视频任务进行通用评估。我们评估了5种最新的自我超vised模型,发现其中没有一个能够通用到所有任务,这 highlights the need for future research on improving universal model performance。此外,我们还发现,通过中间任务精度调整和使用AudioSet进行听音事件分类可以提高表示的质量。我们发布了我们的benchmark,评估代码和模型提交平台,以便更多的研究人员继续探索音视频学习领域。

Context-Aware Neural Video Compression on Solar Dynamics Observatory

  • paper_url: http://arxiv.org/abs/2309.10784
  • repo_url: None
  • paper_authors: Atefeh Khoshkhahtinat, Ali Zafari, Piyush M. Mehta, Nasser M. Nasrabadi, Barbara J. Thompson, Michael S. F. Kirk, Daniel da Silva
  • for: This paper aims to improve the compression of solar images collected by NASA’s Solar Dynamics Observatory (SDO) mission.
  • methods: The paper proposes a novel neural Transformer-based video compression approach that leverages a Fused Local-aware Window (FLaWin) Transformer block to efficiently exploit temporal and spatial redundancies in the images, resulting in a high compression ratio.
  • results: The proposed approach outperforms conventional hand-engineered video codecs such as H.264 and H.265 in terms of rate-distortion trade-off, demonstrating the effectiveness of the FLaWin Transformer block in improving compression performance.
    Abstract NASA's Solar Dynamics Observatory (SDO) mission collects large data volumes of the Sun's daily activity. Data compression is crucial for space missions to reduce data storage and video bandwidth requirements by eliminating redundancies in the data. In this paper, we present a novel neural Transformer-based video compression approach specifically designed for the SDO images. Our primary objective is to efficiently exploit the temporal and spatial redundancies inherent in solar images to obtain a high compression ratio. Our proposed architecture benefits from a novel Transformer block called Fused Local-aware Window (FLaWin), which incorporates window-based self-attention modules and an efficient fused local-aware feed-forward (FLaFF) network. This architectural design allows us to simultaneously capture short-range and long-range information while facilitating the extraction of rich and diverse contextual representations. Moreover, this design choice results in reduced computational complexity. Experimental results demonstrate the significant contribution of the FLaWin Transformer block to the compression performance, outperforming conventional hand-engineered video codecs such as H.264 and H.265 in terms of rate-distortion trade-off.
    摘要

MAGIC-TBR: Multiview Attention Fusion for Transformer-based Bodily Behavior Recognition in Group Settings

  • paper_url: http://arxiv.org/abs/2309.10765
  • repo_url: https://github.com/surbhimadan92/magic-tbr
  • paper_authors: Surbhi Madan, Rishabh Jain, Gulshan Sharma, Ramanathan Subramanian, Abhinav Dhall
  • for: 本研究旨在提高人工智能系统的理解,通过自动分析社交语言行为。
  • methods: 本文提出了一种多视图注意力融合方法(MAGIC-TBR),利用视频和其相应的Discrete Cosine Transform幂等特征,通过变换器基本方法进行融合。
  • results: 实验结果表明,提出的特征融合方法有效地捕捉了BBSI数据集中的较细的行为特征,如招手、梳妆、摸拥等。
    Abstract Bodily behavioral language is an important social cue, and its automated analysis helps in enhancing the understanding of artificial intelligence systems. Furthermore, behavioral language cues are essential for active engagement in social agent-based user interactions. Despite the progress made in computer vision for tasks like head and body pose estimation, there is still a need to explore the detection of finer behaviors such as gesturing, grooming, or fumbling. This paper proposes a multiview attention fusion method named MAGIC-TBR that combines features extracted from videos and their corresponding Discrete Cosine Transform coefficients via a transformer-based approach. The experiments are conducted on the BBSI dataset and the results demonstrate the effectiveness of the proposed feature fusion with multiview attention. The code is available at: https://github.com/surbhimadan92/MAGIC-TBR
    摘要 文体行为语言是社交见识的重要依据,自动分析可以增强人工智能系统的理解。此外,行为语言cue也是社交代理人与用户之间活跃互动的重要因素。虽然计算机视觉技术在头和身体姿态估计等任务上已经取得了进展,但是还需要探索更细致的行为,如手势、梳妆或摸拌等。这篇论文提议一种名为MAGIC-TBR的多视图注意力融合方法,该方法通过 трансформа器方法结合视频和其相应的Discrete Cosine Transform幂值来拼接特征。实验在BBSI数据集上进行,结果表明提议的特征融合与多视图注意力具有效果。代码可以在以下链接获取:https://github.com/surbhimadan92/MAGIC-TBR。

Few-Shot Panoptic Segmentation With Foundation Models

  • paper_url: http://arxiv.org/abs/2309.10726
  • repo_url: https://github.com/robot-learning-freiburg/SPINO
  • paper_authors: Markus Käppeler, Kürsat Petek, Niclas Vödisch, Wolfram Burgard, Abhinav Valada
  • for: 提高�anoptic segmentation的可行性,使用非标注数据进行训练。
  • methods: 利用task-agnostic图像特征,即DINOv2底层和轻量级网络头,进行semantic segmentation和boundary estimation。
  • results: 使用仅10个标注图像,可以预测高质量pseudo-标签,并且与完全监督基eline相比,使用更少的标注数据(less than 0.3%),达到竞争力的结果。
    Abstract Current state-of-the-art methods for panoptic segmentation require an immense amount of annotated training data that is both arduous and expensive to obtain posing a significant challenge for their widespread adoption. Concurrently, recent breakthroughs in visual representation learning have sparked a paradigm shift leading to the advent of large foundation models that can be trained with completely unlabeled images. In this work, we propose to leverage such task-agnostic image features to enable few-shot panoptic segmentation by presenting Segmenting Panoptic Information with Nearly 0 labels (SPINO). In detail, our method combines a DINOv2 backbone with lightweight network heads for semantic segmentation and boundary estimation. We show that our approach, albeit being trained with only ten annotated images, predicts high-quality pseudo-labels that can be used with any existing panoptic segmentation method. Notably, we demonstrate that SPINO achieves competitive results compared to fully supervised baselines while using less than 0.3% of the ground truth labels, paving the way for learning complex visual recognition tasks leveraging foundation models. To illustrate its general applicability, we further deploy SPINO on real-world robotic vision systems for both outdoor and indoor environments. To foster future research, we make the code and trained models publicly available at http://spino.cs.uni-freiburg.de.
    摘要 当前最新的方法 для�annoptic segmentation需要庞大量的标注训练数据,这是费时和贵金的,这 pose significan t challenge для它们的普及。同时,最近的视觉表示学习技术的突破口导致了基础模型的出现,可以通过完全无标注图像进行训练。在这种情况下,我们提议利用这些任务无关的图像特征来实现几何�train panoptic segmentation,我们称之为Segmenting Panoptic Information with Nearly 0 labels(SPINO)。在详细的实现中,我们将DINOv2 脊梁结合轻量级网络头进行semantic segmentation和边界估计。我们表明,我们的方法可以通过只使用十个标注图像来生成高质量 Pseudo-labels,这些 Pseudo-labels 可以与任何现有的�annoptic segmentation方法结合使用。需要注意的是,我们的方法可以与完全supervised baseline 进行比较,并且使用 less than 0.3% 的标注数据,这种方法可以学习复杂的视觉认知任务。为了见证其通用性,我们进一步在实际的 роботи视系统中部署了SPINO。为了促进未来的研究,我们在http://spino.cs.uni-freiburg.de 上公开发布了代码和训练模型。

Reconstruct-and-Generate Diffusion Model for Detail-Preserving Image Denoising

  • paper_url: http://arxiv.org/abs/2309.10714
  • repo_url: None
  • paper_authors: Yujin Wang, Lingen Li, Tianfan Xue, Jinwei Gu
  • for: 减少图像噪声,提高图像质量
  • methods: 提pose了一种新的Reconstruct-and-Generate Diffusion Model(RnG),包括一个恢复性的噪声去除网络和一个扩散算法,以保持图像的视觉质量和准确性
  • results: 通过对多个synthetic和实际噪声 dataset进行了广泛的实验,证明了提案的方法的优越性。
    Abstract Image denoising is a fundamental and challenging task in the field of computer vision. Most supervised denoising methods learn to reconstruct clean images from noisy inputs, which have intrinsic spectral bias and tend to produce over-smoothed and blurry images. Recently, researchers have explored diffusion models to generate high-frequency details in image restoration tasks, but these models do not guarantee that the generated texture aligns with real images, leading to undesirable artifacts. To address the trade-off between visual appeal and fidelity of high-frequency details in denoising tasks, we propose a novel approach called the Reconstruct-and-Generate Diffusion Model (RnG). Our method leverages a reconstructive denoising network to recover the majority of the underlying clean signal, which serves as the initial estimation for subsequent steps to maintain fidelity. Additionally, it employs a diffusion algorithm to generate residual high-frequency details, thereby enhancing visual quality. We further introduce a two-stage training scheme to ensure effective collaboration between the reconstructive and generative modules of RnG. To reduce undesirable texture introduced by the diffusion model, we also propose an adaptive step controller that regulates the number of inverse steps applied by the diffusion model, allowing control over the level of high-frequency details added to each patch as well as saving the inference computational cost. Through our proposed RnG, we achieve a better balance between perception and distortion. We conducted extensive experiments on both synthetic and real denoising datasets, validating the superiority of the proposed approach.
    摘要 Image 减霉是计算机视觉领域中的基本和挑战性任务。大多数指导下的减霉方法学习从噪声输入中重建清晰图像,这些方法具有内在的频谱偏好,导致生成的图像过于平滑和模糊。在最近的研究中,研究人员使用扩散模型生成高频环境细节,但这些模型不保证生成的文本与实际图像相匹配,导致不жела的artefacts。为了解决减霉任务中的质量和准确性之间的权衡,我们提出了一种新的方法 called Reconstruct-and-Generate Diffusion Model (RnG)。我们的方法利用一种重建减霉网络来恢复大多数的下面清晰信号,该信号 serve as the initial estimation for subsequent steps to maintain fidelity。此外,它采用扩散算法生成剩余的高频环境细节,从而提高视觉质量。我们还提出了一种两阶段训练方案,以确保重建和生成模块之间的合作有效。为了避免扩散模型引入的不жела的文本,我们还提出了一种自适应步长控制器,可以控制每个 patch 中扩散模型应用的 inverse 步长数量,从而控制高频环境细节的添加水平以及计算误差的减少。通过我们的提出的 RnG,我们实现了减霉任务中更好的质量和准确性之间的权衡。我们在 synthetic 和实际减霉数据集上进行了广泛的实验,证明了我们的方法的超越性。

Interpret Vision Transformers as ConvNets with Dynamic Convolutions

  • paper_url: http://arxiv.org/abs/2309.10713
  • repo_url: None
  • paper_authors: Chong Zhou, Chen Change Loy, Bo Dai
  • for: 本研究旨在比较视觉变换器和卷积网络的超越性,并提出了一种将视觉变换器视为卷积网络的动态卷积的解释方法,从而在一个统一的框架下比较这两种 Architecture 的设计选择。
  • methods: 本研究使用了一种新的解释方法,即将视觉变换器视为卷积网络的动态卷积,从而将两种 Architecture 归纳到一个统一的框架下。研究者们还提出了两个具体的研究,一是关于视觉变换器中 softmax 的作用和其可以被取代的问题,二是基于depth-wise convolution的视觉变换器的设计。
  • results: 研究表明,通过将视觉变换器视为卷积网络的动态卷积,可以更好地理解和设计这两种 Architecture,并且可以提高网络的性能和速度。
    Abstract There has been a debate about the superiority between vision Transformers and ConvNets, serving as the backbone of computer vision models. Although they are usually considered as two completely different architectures, in this paper, we interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework and compare their design choices side by side. In addition, our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets and vice versa. We demonstrate such potential through two specific studies. First, we inspect the role of softmax in vision Transformers as the activation function and find it can be replaced by commonly used ConvNets modules, such as ReLU and Layer Normalization, which results in a faster convergence rate and better performance. Second, following the design of depth-wise convolution, we create a corresponding depth-wise vision Transformer that is more efficient with comparable performance. The potential of the proposed unified interpretation is not limited to the given examples and we hope it can inspire the community and give rise to more advanced network architectures.
    摘要 有一些论战关于 Computer Vision 模型的优劣,主要集中在 vision Transformers 和 ConvNets 之间。尽管它们通常被视为完全不同的架构,但在这篇论文中,我们将 vision Transformers 解释为 ConvNets 中的动态混合,这允许我们在一个统一的框架下描述现有的 Transformers 和动态 ConvNets,并且比较它们的设计决策。此外,我们的解释还可以帮助网络设计,因为研究人员现在可以从 ConvNets 的设计空间中考虑 vision Transformers,并且将之相对。我们透过以下两个具体的研究来证明这一点。首先,我们评估了 vision Transformers 中 softmax 作为活化函数的角色,发现它可以被替换为通用的 ConvNets 模组,如 ReLU 和层normalization,这会导致更快的整合速率和更好的性能。其次,我们根据深度对应 convolution 的设计,创建了相应的深度对应 vision Transformer,该模型更加高效,性能相当。我们希望这一提议可以鼓励社区,并导致更进一步的网络架构。

Latent Space Energy-based Model for Fine-grained Open Set Recognition

  • paper_url: http://arxiv.org/abs/2309.10711
  • repo_url: None
  • paper_authors: Wentao Bao, Qi Yu, Yu Kong
  • for: 这个研究旨在应对细部分组别的图像识别,排除未知的类别图像。
  • methods: 这个方法使用了能量基模型(EBM)来混合生成和识别任务,并在高维度空间进行点击估计,以提高细部分组别的图像识别。
  • results: 这个方法可以在细部分组别的图像识别中提高表达性、精细度和点击密度,并且可以利用最新的视觉 трансформа器来实现强大的视觉分类和生成。
    Abstract Fine-grained open-set recognition (FineOSR) aims to recognize images belonging to classes with subtle appearance differences while rejecting images of unknown classes. A recent trend in OSR shows the benefit of generative models to discriminative unknown detection. As a type of generative model, energy-based models (EBM) are the potential for hybrid modeling of generative and discriminative tasks. However, most existing EBMs suffer from density estimation in high-dimensional space, which is critical to recognizing images from fine-grained classes. In this paper, we explore the low-dimensional latent space with energy-based prior distribution for OSR in a fine-grained visual world. Specifically, based on the latent space EBM, we propose an attribute-aware information bottleneck (AIB), a residual attribute feature aggregation (RAFA) module, and an uncertainty-based virtual outlier synthesis (UVOS) module to improve the expressivity, granularity, and density of the samples in fine-grained classes, respectively. Our method is flexible to take advantage of recent vision transformers for powerful visual classification and generation. The method is validated on both fine-grained and general visual classification datasets while preserving the capability of generating photo-realistic fake images with high resolution.
    摘要 Translation notes:* Fine-grained open-set recognition (FineOSR) 精细开放类识别 (FineOSR)* aims to recognize images belonging to classes with subtle appearance differences 目标是识别具有微妙的外观差异的图像* while rejecting images of unknown classes 并拒绝未知类图像* A recent trend in OSR shows the benefit of generative models for unknown detection 最近的 OSR 趋势表明生成模型对未知检测具有优势* Energy-based models (EBMs) 能量基于模型 (EBMs)* are a type of generative model with potential for hybrid modeling of generative and discriminative tasks 是一种可以混合生成和抑制任务的生成模型* but most existing EBMs suffer from density estimation in high-dimensional space 但大多数现有的 EBM 在高维空间中进行密度估计存在问题* which is critical for recognizing images from fine-grained classes 这是识别精细类图像的关键问题* In this paper, we explore the low-dimensional latent space with energy-based prior distribution for OSR in a fine-grained visual world 在这篇论文中,我们在精细视觉世界中采用能量基于分布来探索低维 latent space,以实现 OSR* Specifically, based on the latent space EBM, we propose an attribute-aware information bottleneck (AIB) 具体来说,基于 latent space EBM,我们提出了 attribute-aware information bottleneck (AIB)* a residual attribute feature aggregation (RAFA) module 剩余特征特征聚合模块 (RAFA)* and an uncertainty-based virtual outlier synthesis (UVOS) module 基于不确定性的虚拟异常生成模块 (UVOS)* to improve the expressivity, granularity, and density of the samples in fine-grained classes 以提高精细类样本的表达能力、粒度和密度* Our method is flexible to take advantage of recent vision transformers for powerful visual classification and generation 我们的方法可以利用最近的视觉转换器来实现强大的视觉分类和生成* The method is validated on both fine-grained and general visual classification datasets while preserving the capability of generating photo-realistic fake images with high resolution 方法在精细和通用视觉分类数据集上进行验证,并保持生成高分辨率的图像的能力

ReShader: View-Dependent Highlights for Single Image View-Synthesis

  • paper_url: http://arxiv.org/abs/2309.10689
  • repo_url: https://github.com/avinashpaliwal/ReShader
  • paper_authors: Avinash Paliwal, Brandon Nguyen, Andrii Tsarov, Nima Khademi Kalantari
  • for: 提高单图像新视图synthesizer的可靠性和准确性
  • methods: 分解视图合成过程为两个独立任务:像素reshading和重定位
  • results: 生成具有真实运动高光的新视图图像,在多种真实场景上进行证明
    Abstract In recent years, novel view synthesis from a single image has seen significant progress thanks to the rapid advancements in 3D scene representation and image inpainting techniques. While the current approaches are able to synthesize geometrically consistent novel views, they often do not handle the view-dependent effects properly. Specifically, the highlights in their synthesized images usually appear to be glued to the surfaces, making the novel views unrealistic. To address this major problem, we make a key observation that the process of synthesizing novel views requires changing the shading of the pixels based on the novel camera, and moving them to appropriate locations. Therefore, we propose to split the view synthesis process into two independent tasks of pixel reshading and relocation. During the reshading process, we take the single image as the input and adjust its shading based on the novel camera. This reshaded image is then used as the input to an existing view synthesis method to relocate the pixels and produce the final novel view image. We propose to use a neural network to perform reshading and generate a large set of synthetic input-reshaded pairs to train our network. We demonstrate that our approach produces plausible novel view images with realistic moving highlights on a variety of real world scenes.
    摘要 近年来,从单一图像 synthesize 新视图的技术得到了重要进步,归功于三维场景表示和图像填充技术的快速进步。当前的方法可以生成具有正确的几何匹配的新视图,但它们通常不能正确处理视角依赖的效果。具体来说,它们生成的新视图中的高光通常会黏附到表面上,使得新视图变得不真实。为解决这一主要问题,我们作出了关键的观察:在synthesize 新视图过程中,需要根据新的摄像机改变像素的颜色和位置。因此,我们提议将视图synthesize 过程分解为两个独立任务:像素重新灰度和重新位置。在重新灰度过程中,我们使用单一图像作为输入,并根据新的摄像机进行颜色调整。这个重新灰度后的图像然后作为输入给现有的视图synthesize 方法,以生成最终的新视图图像。我们提议使用神经网络来进行重新灰度,并生成大量的人工生成的输入-重新灰度对来训练我们的网络。我们示例了我们的方法可以在多种真实场景上生成真实的高光移动的新视图图像。

CMRxRecon: An open cardiac MRI dataset for the competition of accelerated image reconstruction

  • paper_url: http://arxiv.org/abs/2309.10836
  • repo_url: None
  • paper_authors: Chengyan Wang, Jun Lyu, Shuo Wang, Chen Qin, Kunyuan Guo, Xinyu Zhang, Xiaotong Yu, Yan Li, Fanwen Wang, Jianhua Jin, Zhang Shi, Ziqiang Xu, Yapeng Tian, Sha Hua, Zhensen Chen, Meng Liu, Mengting Sun, Xutong Kuang, Kang Wang, Haoran Wang, Hao Li, Yinghua Chu, Guang Yang, Wenjia Bai, Xiahai Zhuang, He Wang, Jing Qin, Xiaobo Qu
  • for: This paper aims to facilitate the advancement of state-of-the-art cardiac magnetic resonance imaging (CMR) image reconstruction.
  • methods: The paper provides a large dataset of multi-contrast, multi-view, multi-slice, and multi-coil CMR imaging data from 300 subjects, which can be used to train and evaluate deep learning-based image reconstruction algorithms.
  • results: The dataset includes manual segmentations of the myocardium and chambers of all the subjects, and scripts of state-of-the-art reconstruction algorithms are provided as a point of reference. The dataset is freely accessible to the research community and can be accessed at https://www.synapse.org/#!Synapse:syn51471091/wiki/.
    Abstract Cardiac magnetic resonance imaging (CMR) has emerged as a valuable diagnostic tool for cardiac diseases. However, a limitation of CMR is its slow imaging speed, which causes patient discomfort and introduces artifacts in the images. There has been growing interest in deep learning-based CMR imaging algorithms that can reconstruct high-quality images from highly under-sampled k-space data. However, the development of deep learning methods requires large training datasets, which have not been publicly available for CMR. To address this gap, we released a dataset that includes multi-contrast, multi-view, multi-slice and multi-coil CMR imaging data from 300 subjects. Imaging studies include cardiac cine and mapping sequences. Manual segmentations of the myocardium and chambers of all the subjects are also provided within the dataset. Scripts of state-of-the-art reconstruction algorithms were also provided as a point of reference. Our aim is to facilitate the advancement of state-of-the-art CMR image reconstruction by introducing standardized evaluation criteria and making the dataset freely accessible to the research community. Researchers can access the dataset at https://www.synapse.org/#!Synapse:syn51471091/wiki/.
    摘要 卡ди亚克力磁共振成像(CMR)已成为心血管疾病诊断工具之一。然而,CMR的成像速度过慢,会让患者感到不适,并且会导致图像中的噪声和抖抖。随着深度学习技术的发展,深度学习基于CMR成像算法已经得到了广泛的关注。然而,深度学习的开发需要大量的训练数据集,而这些数据集在CMR领域并没有公开。为了解决这个问题,我们公开了一个包含多种对比、多视图、多层和多极的CMR成像数据集,其中包括300名参与者的数据。这些数据包括心血管磁共振和映射序列。参与者的心肺部分手动分割也包含在数据集中。我们提供了一些state-of-the-art reconstruction algorithm的脚本,作为参考。我们的目标是通过引入标准评估标准和公开数据集,推动CMR成像图像重建的技术发展。研究人员可以通过以下链接获取数据集:https://www.synapse.org/#!Synapse:syn51471091/wiki/。

Locally Stylized Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2309.10684
  • repo_url: None
  • paper_authors: Hong-Wing Pang, Binh-Son Hua, Sai-Kit Yeung
  • for: 将样式应用到3D场景中,特别是使用神经辐射场(NeRF)。
  • methods: 使用本地样式传递来实现样式化,使用HashGrid编码学习外观和geometry组件的嵌入,并通过优化外观分支来实现样式化。
  • results: 实现了可信度高的样式化结果,同时具有可控的个性化特性,通过修改和定制地域匹配来实现本地样式传递。
    Abstract In recent years, there has been increasing interest in applying stylization on 3D scenes from a reference style image, in particular onto neural radiance fields (NeRF). While performing stylization directly on NeRF guarantees appearance consistency over arbitrary novel views, it is a challenging problem to guide the transfer of patterns from the style image onto different parts of the NeRF scene. In this work, we propose a stylization framework for NeRF based on local style transfer. In particular, we use a hash-grid encoding to learn the embedding of the appearance and geometry components, and show that the mapping defined by the hash table allows us to control the stylization to a certain extent. Stylization is then achieved by optimizing the appearance branch while keeping the geometry branch fixed. To support local style transfer, we propose a new loss function that utilizes a segmentation network and bipartite matching to establish region correspondences between the style image and the content images obtained from volume rendering. Our experiments show that our method yields plausible stylization results with novel view synthesis while having flexible controllability via manipulating and customizing the region correspondences.
    摘要 近年来,人们对于将参照风格图像应用到3D场景上的涂鸦问题越来越感兴趣,尤其是在神经辐射场(NeRF)上。虽然直接在NeRF上进行涂鸦可以保证视野中任意新视图的外观一致,但是将模式图像中的模式转移到不同的NeRF场景部分是一个具有挑战性的问题。在这项工作中,我们提出了基于本地风格传递的NeRF涂鸦框架。具体来说,我们使用Hash网格编码学习外观和几何component的嵌入,并证明Hash表定义的映射允许一定程度的控制涂鸦。然后,我们通过优化外观支线来实现涂鸦,保持几何支线不变。为支持本地风格传递,我们提出了一种新的损失函数,利用分割网络和两个对偶匹配来建立风格图像和内容图像从volume渲染得到的区域对应关系。我们的实验显示,我们的方法可以实现有效的涂鸦结果,同时具有自由控制的便利性,通过操作和自定义区域对应关系来控制涂鸦效果。

Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping

  • paper_url: http://arxiv.org/abs/2309.10667
  • repo_url: None
  • paper_authors: Subash Khanal, Srikumar Sastry, Aayush Dhakal, Nathan Jacobs
  • for: 预测特定地理位置可能听到的各种声音
  • methods: 使用最新的状态艺术模型对地标Audio、文本描述和捕捉位置的拓展图像进行对比预训练,从而建立三个模式之间的共享嵌入空间,以便从文本或音频查询中构建任何地理区域的声cape图
  • results: 使用SoundingEarth数据集,我们的方法与现有SOTA之间有显著的改进,Image-to-Audio Recall@100从0.256提高到0.450。
    Abstract We focus on the task of soundscape mapping, which involves predicting the most probable sounds that could be perceived at a particular geographic location. We utilise recent state-of-the-art models to encode geotagged audio, a textual description of the audio, and an overhead image of its capture location using contrastive pre-training. The end result is a shared embedding space for the three modalities, which enables the construction of soundscape maps for any geographic region from textual or audio queries. Using the SoundingEarth dataset, we find that our approach significantly outperforms the existing SOTA, with an improvement of image-to-audio Recall@100 from 0.256 to 0.450. Our code is available at https://github.com/mvrl/geoclap.
    摘要 我团队的任务是声景地图,即预测特定地理位置可能被感受到的最有可能的声音。我们利用最新的状态艺术模型来编码地标注的音频、文本描述和捕捉位置的遮盲图像,并通过对比预训练来共享这三种模态的 embeddings 空间。这使得我们可以从文本或音频查询中构建任何地理区域的声景地图。使用 SoundingEarth 数据集,我们发现我们的方法与现有的 SOTA 有明显的改善,即图像到音频 Recall@100 从 0.256 提高到 0.450。我们的代码可以在 GitHub 上找到:https://github.com/mvrl/geoclap。

Analysing race and sex bias in brain age prediction

  • paper_url: http://arxiv.org/abs/2309.10835
  • repo_url: None
  • paper_authors: Carolina Piçarra, Ben Glocker
  • for: 这paper的目的是分析使用MRI进行脑年龄预测的模型是否受到人口结构的偏见。
  • methods: 这paper使用了ResNet-34模型,通过对不同人口结构下的子群进行分析,以及对特征进行检查,以确定模型是否受到偏见。
  • results: 研究发现,模型在不同人口结构下的预测性能存在 statistically significant differences,并且发现了7对12个对比的特征分布存在 statistically significant differences。这些结果表明脑年龄预测模型可能受到偏见。
    Abstract Brain age prediction from MRI has become a popular imaging biomarker associated with a wide range of neuropathologies. The datasets used for training, however, are often skewed and imbalanced regarding demographics, potentially making brain age prediction models susceptible to bias. We analyse the commonly used ResNet-34 model by conducting a comprehensive subgroup performance analysis and feature inspection. The model is trained on 1,215 T1-weighted MRI scans from Cam-CAN and IXI, and tested on UK Biobank (n=42,786), split into six racial and biological sex subgroups. With the objective of comparing the performance between subgroups, measured by the absolute prediction error, we use a Kruskal-Wallis test followed by two post-hoc Conover-Iman tests to inspect bias across race and biological sex. To examine biases in the generated features, we use PCA for dimensionality reduction and employ two-sample Kolmogorov-Smirnov tests to identify distribution shifts among subgroups. Our results reveal statistically significant differences in predictive performance between Black and White, Black and Asian, and male and female subjects. Seven out of twelve pairwise comparisons show statistically significant differences in the feature distributions. Our findings call for further analysis of brain age prediction models.
    摘要 Magnetic Resonance Imaging (MRI) 年龄预测已成为脑科疾病相关的快速成像标记,但是用于训练的数据集经常受到人口结构的偏袋和不均衡影响,可能使脑年龄预测模型受到偏见。我们使用常用的ResNet-34模型进行全面的子组表现分析和特征检查。该模型在Cam-CAN和IXI数据集上训练,并在UK Biobank数据集(n=42,786)上进行测试,并将数据集分为六个种族和生物性别子组。通过对各子组的绝对预测误差进行比较,我们使用克鲁斯卡尔-沃利斯测试后跟进行了两次康维-伊曼测试来检测种族和生物性别之间的偏见。为了检测特征生成的偏见,我们使用PCA进行维度减少,并使用两种样本的科洛摩戈罗夫-斯米纳夫测试来发现分布偏移。我们的结果显示,黑人和白人、黑人和亚裔人、男性和女性之间存在 statistically significant 的差异。在十二个对比中,有七个对比显示 statistically significant 的分布偏移。我们的发现呼吁进一步分析脑年龄预测模型。

Multi-Stain Self-Attention Graph Multiple Instance Learning Pipeline for Histopathology Whole Slide Images

  • paper_url: http://arxiv.org/abs/2309.10650
  • repo_url: https://github.com/amayags/mustang
  • paper_authors: Amaya Gallagher-Syed, Luca Rossi, Felice Rivellese, Costantino Pitzalis, Myles Lewis, Michael Barnes, Gregory Slabaugh
  • for: 本研究旨在 Addressing the challenges of weakly supervised computer vision tasks in gigapixel Whole Slide Images (WSIs) for patient diagnosis and stratification.
  • methods: 我们提出了一种基于自注意力的多例学习框架(MUSTANG),用于解决不具有批处理级别标注的多个WSIs的分类任务。我们使用了一种快速计算的k-Nearest Neighbour Graph,以实现快速的自注意力计算。
  • results: 我们的方法在实验中取得了当今最佳的F1分数/AUC分数0.89/0.92,超过了广泛使用的CLAM模型。我们的方法可以轻松地适应不同的临床数据集,只需要patient级别的标注,并可以接受不同的WSIs集合,Graph可以是不同的大小和结构。
    Abstract Whole Slide Images (WSIs) present a challenging computer vision task due to their gigapixel size and presence of numerous artefacts. Yet they are a valuable resource for patient diagnosis and stratification, often representing the gold standard for diagnostic tasks. Real-world clinical datasets tend to come as sets of heterogeneous WSIs with labels present at the patient-level, with poor to no annotations. Weakly supervised attention-based multiple instance learning approaches have been developed in recent years to address these challenges, but can fail to resolve both long and short-range dependencies. Here we propose an end-to-end multi-stain self-attention graph (MUSTANG) multiple instance learning pipeline, which is designed to solve a weakly-supervised gigapixel multi-image classification task, where the label is assigned at the patient-level, but no slide-level labels or region annotations are available. The pipeline uses a self-attention based approach by restricting the operations to a highly sparse k-Nearest Neighbour Graph of embedded WSI patches based on the Euclidean distance. We show this approach achieves a state-of-the-art F1-score/AUC of 0.89/0.92, outperforming the widely used CLAM model. Our approach is highly modular and can easily be modified to suit different clinical datasets, as it only requires a patient-level label without annotations and accepts WSI sets of different sizes, as the graphs can be of varying sizes and structures. The source code can be found at https://github.com/AmayaGS/MUSTANG.
    摘要 整幅扫描图像(WSIs)对计算机视觉 задача呈现挑战,因其 гига灵ixel大小和存在许多artefacts。然而,它们对患者诊断和分类非常有价值,通常被视为诊断任务的金标准。实际的临床数据集通常是一组不同类型的WSIs,每个扫描图像都有patient级别的标签,但是标签的质量很差,甚至没有任何注解。弱式监督的注意力基本学习方法在过去几年中开发出来,但是它们可能无法解决长距离和短距离依赖关系。我们提出了一个终端到终Multi-stain自动注意力图(MUSTANG)多个实例学习管道,用于解决弱式监督的гига灵ixel多图分类任务,任务标签是在患者级别上分配,但是没有扫描图像级别的标签或区域注解。管道使用基于Euclidean距离的自动注意力方法,限制操作于高度稀疏的k-最近邻 Graph Embedded WSI patches中。我们显示,这种方法可以达到状态机器人F1-score/AUC的0.89/0.92,超过了广泛使用的CLAM模型。我们的方法非常可 modify,可以轻松地适应不同的临床数据集,只需要patient级别的标签,不需要扫描图像级别的注解,并且可以接受不同大小和结构的WSIs集。源代码可以在https://github.com/AmayaGS/MUSTANG 中找到。

Cross-modal and Cross-domain Knowledge Transfer for Label-free 3D Segmentation

  • paper_url: http://arxiv.org/abs/2309.10649
  • repo_url: None
  • paper_authors: Jingyu Zhang, Huitong Yang, Daijie Wu, Xuesong Li, Xinge Zhu, Yuexin Ma
  • for: 提高3D点云Semantic Segmentation的性能,不需要大量的标注数据。
  • methods: 基于图像和点云之间的关系,设计有效的特征对齐策略,实现交叉模式和交叉领域的适应。
  • results: 无需3D标注数据,我们的方法在SemanticKITTI上达到了state-of-the-art表现,比对 existedUnsupervised和Weakly-supervised基eline的性能更高。
    Abstract Current state-of-the-art point cloud-based perception methods usually rely on large-scale labeled data, which requires expensive manual annotations. A natural option is to explore the unsupervised methodology for 3D perception tasks. However, such methods often face substantial performance-drop difficulties. Fortunately, we found that there exist amounts of image-based datasets and an alternative can be proposed, i.e., transferring the knowledge in the 2D images to 3D point clouds. Specifically, we propose a novel approach for the challenging cross-modal and cross-domain adaptation task by fully exploring the relationship between images and point clouds and designing effective feature alignment strategies. Without any 3D labels, our method achieves state-of-the-art performance for 3D point cloud semantic segmentation on SemanticKITTI by using the knowledge of KITTI360 and GTA5, compared to existing unsupervised and weakly-supervised baselines.
    摘要 当前最先进的点云基于识别方法通常依赖于大规模的标注数据,这需要昂贵的人工标注。一种自然的选择是探索无监督的方法学习。然而,这些方法往往面临重大性能下降的困难。幸运地,我们发现了大量的图像数据集和一种代替方案,即将图像知识传播到点云中。我们提出了一种困难的对Modal和对Domain的适应任务,通过全面探索图像和点云之间的关系和设计有效的特征对齐策略。无需任何3D标注,我们的方法在SemanticKITTI上实现了无监督性的3D点云 semantic segmentation的国际级表现,与现有的无监督和弱监督基准相比。

Self-Supervised Super-Resolution Approach for Isotropic Reconstruction of 3D Electron Microscopy Images from Anisotropic Acquisition

  • paper_url: http://arxiv.org/abs/2309.10646
  • repo_url: None
  • paper_authors: Mohammad Khateri, Morteza Ghahremani, Alejandra Sierra, Jussi Tohka
  • for: 用于重建三维电子顾 microscopy(3DEM)图像的均匀性。
  • methods: 使用深度学习(DL)自主超分解方法,利用U型架构和视图变换(ViT)块,学习多级图像依赖关系。
  • results: 成功重建均匀的3DEM图像从不均匀的获取图像中。
    Abstract Three-dimensional electron microscopy (3DEM) is an essential technique to investigate volumetric tissue ultra-structure. Due to technical limitations and high imaging costs, samples are often imaged anisotropically, where resolution in the axial direction ($z$) is lower than in the lateral directions $(x,y)$. This anisotropy 3DEM can hamper subsequent analysis and visualization tasks. To overcome this limitation, we propose a novel deep-learning (DL)-based self-supervised super-resolution approach that computationally reconstructs isotropic 3DEM from the anisotropic acquisition. The proposed DL-based framework is built upon the U-shape architecture incorporating vision-transformer (ViT) blocks, enabling high-capability learning of local and global multi-scale image dependencies. To train the tailored network, we employ a self-supervised approach. Specifically, we generate pairs of anisotropic and isotropic training datasets from the given anisotropic 3DEM data. By feeding the given anisotropic 3DEM dataset in the trained network through our proposed framework, the isotropic 3DEM is obtained. Importantly, this isotropic reconstruction approach relies solely on the given anisotropic 3DEM dataset and does not require pairs of co-registered anisotropic and isotropic 3DEM training datasets. To evaluate the effectiveness of the proposed method, we conducted experiments using three 3DEM datasets acquired from brain. The experimental results demonstrated that our proposed framework could successfully reconstruct isotropic 3DEM from the anisotropic acquisition.
    摘要 三维电子镜像技术(3DEM)是诊断组织结构的重要方法。由于技术限制和高成本镜像,样本通常在axial方向(z)中的分辨率较低,而在 lateral方向(x, y)中的分辨率高。这种不均匀镜像可能会降低后续分析和视觉任务的效果。为了解决这个问题,我们提出了一种基于深度学习(DL)的自动适应超分辨率方法。这种方法基于U形架构,并包括视觉转换(ViT)块,可以高效地学习本地和全局多尺度图像依赖关系。为了训练专门的网络,我们采用了自动适应的方法。具体来说,我们生成了具有不均匀和均匀特征的训练集。通过将给定的不均匀3DEM数据feed到我们的提posed框架中,可以获得均匀的3DEM。重要的是,这种均匀重建方法不需要对于不均匀和均匀3DEM训练集的对应的数据对。为了评估我们的方法的效果,我们使用了三个由脑组织的3DEM数据进行实验。实验结果表明,我们的提posed方法可以成功地从不均匀镜像中重建均匀的3DEM。

KFC: Kinship Verification with Fair Contrastive Loss and Multi-Task Learning

  • paper_url: http://arxiv.org/abs/2309.10641
  • repo_url: https://github.com/garynlfd/kfc
  • paper_authors: Jia Luo Peng, Keng Wei Chang, Shang-Hong Lai
  • For: The paper is written for the task of kinship verification in computer vision, with the goal of achieving better performance and mitigating racial bias.* Methods: The paper proposes a multi-task learning model structure with an attention module and a fairness-aware contrastive loss function with adversarial learning to enhance accuracy and reduce bias.* Results: The proposed method, called KFC, achieves state-of-the-art performance and mitigates racial bias, as demonstrated through extensive experimental evaluation.Here’s the information in Simplified Chinese text:
  • for: 本文是为计算机视觉中的身份验证任务所写的,目的是实现更好的性能和减少种族偏见。
  • methods: 本文提议一种多任务学习模型结构,具有注意力模块和公平意识的对比损失函数,以提高准确率和减少偏见。
  • results: 提议的方法(KFC)在标准差和准确率两个指标上具有优秀的性能,并成功减少种族偏见,经过了广泛的实验评估。
    Abstract Kinship verification is an emerging task in computer vision with multiple potential applications. However, there's no large enough kinship dataset to train a representative and robust model, which is a limitation for achieving better performance. Moreover, face verification is known to exhibit bias, which has not been dealt with by previous kinship verification works and sometimes even results in serious issues. So we first combine existing kinship datasets and label each identity with the correct race in order to take race information into consideration and provide a larger and complete dataset, called KinRace dataset. Secondly, we propose a multi-task learning model structure with attention module to enhance accuracy, which surpasses state-of-the-art performance. Lastly, our fairness-aware contrastive loss function with adversarial learning greatly mitigates racial bias. We introduce a debias term into traditional contrastive loss and implement gradient reverse in race classification task, which is an innovative idea to mix two fairness methods to alleviate bias. Exhaustive experimental evaluation demonstrates the effectiveness and superior performance of the proposed KFC in both standard deviation and accuracy at the same time.
    摘要 《家庭关系验证是计算机视觉领域的一个emerging任务,它具有多种应用前景。然而,当前没有一个大 enough的家庭关系数据集来训练代表性强的模型,这是实现更好的性能的限制。更重要的是,脸部验证已经展现出偏见,这些偏见没有被前一些家庭关系验证工作所处理,有时甚至会导致严重的问题。因此,我们首先将现有的家庭关系数据集合并标注每个人的正确的种族信息,以便考虑种族信息并提供更完整的数据集,我们称之为KinRace数据集。其次,我们提议一种多任务学习模型结构,并在其中添加了注意模块,以提高准确率。最后,我们提出了一种公平意识的抽象损失函数,并通过对涉及到种族的任务进行逆向梯度的实现,这是一种创新的方法来缓解偏见。我们的方法可以同时提高标准差和准确率。我们进行了广泛的实验评估,并证明了我们的提案的效iveness和超越性。》Note: Please note that the translation is in Simplified Chinese, which is used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Sparser Random Networks Exist: Enforcing Communication-Efficient Federated Learning via Regularization

  • paper_url: http://arxiv.org/abs/2309.10834
  • repo_url: None
  • paper_authors: Mohamad Mestoukirdi, Omid Esrafilian, David Gesbert, Qianrui Li, Nicolas Gresset
  • for: 本文提出了一种新的方法,用于提高随机联合学习中的通信效率,特别是在训练过参数化的随机网络时。
  • methods: 在本文中,我们使用了一个 binary 面板来优化,而不是直接优化模型的权重。这个面板可以描述一个可以通过少量参数来泛化的减小网络。与传统联合学习方法不同,我们在交换时使用了简单的 binary 数据,而不是浮点数据。这有效地减少了通信成本,最多只需要1比特每个参数。
  • results: 我们的实验表明,在使用了本文提出的方法后,可以获得 significan 的通信和存储开销减少,达到最多5 magnitudes,同时Validation 精度也可以保持在一定程度上。
    Abstract This work presents a new method for enhancing communication efficiency in stochastic Federated Learning that trains over-parameterized random networks. In this setting, a binary mask is optimized instead of the model weights, which are kept fixed. The mask characterizes a sparse sub-network that is able to generalize as good as a smaller target network. Importantly, sparse binary masks are exchanged rather than the floating point weights in traditional federated learning, reducing communication cost to at most 1 bit per parameter. We show that previous state of the art stochastic methods fail to find the sparse networks that can reduce the communication and storage overhead using consistent loss objectives. To address this, we propose adding a regularization term to local objectives that encourages sparser solutions by eliminating redundant features across sub-networks. Extensive experiments demonstrate significant improvements in communication and memory efficiency of up to five magnitudes compared to the literature, with minimal performance degradation in validation accuracy in some instances.
    摘要 Simplified Chinese:这个工作提出了一种新的方法,用于提高Stochastic Federated Learning中的通信效率,该方法在训练过 parametrization 的随机网络时使用。而不是优化模型的权重,这里优化一个二进制的maske,以Characterize一个可以与小型目标网络一样准确预测的稀疏子网络。这种方法可以将通信成本降至每个参数最多1比特,因为在传输的是稀疏二进制maske而不是浮点数据。previous state of the art stochastic方法无法找到可以减少通信和存储开销的稀疏网络,使用一致损失函数。为解决这个问题,我们提出了添加一个正则化项到本地目标函数中,以逼导更稀疏的解决方案,从而消除各个子网络之间的重复特征。广泛的实验表明,我们的方法可以在一些实例中实现到5 magnitudes的通信和存储效率提升,同时减少了验证精度下降。

Source-free Active Domain Adaptation for Diabetic Retinopathy Grading Based on Ultra-wide-field Fundus Image

  • paper_url: http://arxiv.org/abs/2309.10619
  • repo_url: None
  • paper_authors: Jinye Ran, Guanghua Zhang, Ximei Zhang, Juan Xie, Fan Xia, Hao Zhang
    for: 这篇研究目的是为了提高无标注宽场照片中的脓瘤评分性能。methods: 这篇研究使用了源自由活动领域适应(SFADA)技术,通过生成颜色照片中的脓瘤关系演化,选择一些有价值的宽场照片进行标注,并将模型适应到宽场照片上。results: 实验结果显示,我们的提案的SFADA可以达到现场临床实践中的最佳脓瘤评分性能,比基准值提高20.9%和二次均值权重卡佛洛値18.63%,分别达到85.36%和92.38%。
    Abstract Domain adaptation (DA) has been widely applied in the diabetic retinopathy (DR) grading of unannotated ultra-wide-field (UWF) fundus images, which can transfer annotated knowledge from labeled color fundus images. However, suffering from huge domain gaps and complex real-world scenarios, the DR grading performance of most mainstream DA is far from that of clinical diagnosis. To tackle this, we propose a novel source-free active domain adaptation (SFADA) in this paper. Specifically, we focus on DR grading problem itself and propose to generate features of color fundus images with continuously evolving relationships of DRs, actively select a few valuable UWF fundus images for labeling with local representation matching, and adapt model on UWF fundus images with DR lesion prototypes. Notably, the SFADA also takes data privacy and computational efficiency into consideration. Extensive experimental results demonstrate that our proposed SFADA achieves state-of-the-art DR grading performance, increasing accuracy by 20.9% and quadratic weighted kappa by 18.63% compared with baseline and reaching 85.36% and 92.38% respectively. These investigations show that the potential of our approach for real clinical practice is promising.
    摘要 域 adaptation (DA) 已经广泛应用于无注释超宽场视场照片(UWF)的糖尿病症诊断中,可以将标注过的知识传递到颜色视场照片中。然而,由于巨大的域漏洞和复杂的实际情况,大多数主流 DA 的诊断性能远远不如临床诊断。为了解决这个问题,我们在这篇论文中提出了一种新的源自由活动域 adaptation(SFADA)。我们专注于糖尿病诊断问题,并提出了生成颜色视场照片中的糖尿病关系的演化方法,并选择一些有价值的 UWF 照片进行本地匹配标注,并将模型适应到 UWF 照片中的糖尿病肉体抽象。需要注意的是,SFADA 还考虑了数据隐私和计算效率。我们的实验结果表明,我们的提议的 SFADA 可以达到状态机器的糖尿病诊断性能,相比基线提高了20.9%和quadratic weighted kappa 18.63%,分别达到85.36%和92.38%。这些调查表明了我们的方法在实际临床实践中的潜在潜力。

Intelligent Debris Mass Estimation Model for Autonomous Underwater Vehicle

  • paper_url: http://arxiv.org/abs/2309.10617
  • repo_url: None
  • paper_authors: Mohana Sri S, Swethaa S, Aouthithiye Barathwaj SR Y, Sai Ganesh CS
  • for: 这个论文的目的是提高自动下水车(AUV)在水下环境中导航和交互的能力,使用实例分割技术来分割图像中的对象。
  • methods: 这个论文使用的方法包括:YOLOV7在Roboflow中生成对象的 bounding box,将每个对象分割成不同的领域,并使用预处理技术来提高分割质量。
  • results: 这个论文的结果表明,使用实例分割技术可以准确地分割水下环境中的对象,并且可以提高AUV的导航和交互能力。
    Abstract Marine debris poses a significant threat to the survival of marine wildlife, often leading to entanglement and starvation, ultimately resulting in death. Therefore, removing debris from the ocean is crucial to restore the natural balance and allow marine life to thrive. Instance segmentation is an advanced form of object detection that identifies objects and precisely locates and separates them, making it an essential tool for autonomous underwater vehicles (AUVs) to navigate and interact with their underwater environment effectively. AUVs use image segmentation to analyze images captured by their cameras to navigate underwater environments. In this paper, we use instance segmentation to calculate the area of individual objects within an image, we use YOLOV7 in Roboflow to generate a set of bounding boxes for each object in the image with a class label and a confidence score for every detection. A segmentation mask is then created for each object by applying a binary mask to the object's bounding box. The masks are generated by applying a binary threshold to the output of a convolutional neural network trained to segment objects from the background. Finally, refining the segmentation mask for each object is done by applying post-processing techniques such as morphological operations and contour detection, to improve the accuracy and quality of the mask. The process of estimating the area of instance segmentation involves calculating the area of each segmented instance separately and then summing up the areas of all instances to obtain the total area. The calculation is carried out using standard formulas based on the shape of the object, such as rectangles and circles. In cases where the object is complex, the Monte Carlo method is used to estimate the area. This method provides a higher degree of accuracy than traditional methods, especially when using a large number of samples.
    摘要 海洋垃圾对海洋野生动物的存活造成了重要威胁,通常会导致拥挤和饥饿,最终导致死亡。因此,从海洋中除垃圾是保持自然平衡的关键,让海洋生物发展和繁殖。图像分割是高级形态检测的一种,可以准确地标识和分离对象,因此在自动下水潜车(AUV)在水下环境中Navigation和交互时非常重要。AUV使用图像分割来分析捕捉到的图像,以便在水下环境中 Navigation。在这篇论文中,我们使用图像分割来计算图像中每个对象的面积。我们使用YOLOV7在Roboflow中生成每个对象的 bounding box,并为每个检测得到一个分类标签和信任分数。然后,我们生成每个对象的分割面,通过应用一个二进制阈值来对对象的 bounding box 进行分割。这些面由一个基于对象分割的几何学模型训练而成。最后,我们对每个对象的分割面进行修正,使用Post处理技术,如形态运算和某些检测,以提高分割面的准确性和质量。图像分割面积的估计过程包括计算每个分割对象的面积,然后将所有对象的面积之和为总面积。计算使用标准的形态方程,根据对象的形状,如方形和圆形。在对象复杂时,我们使用蒙特卡洛方法来估计面积,这种方法可以提供更高的准确性,特别是使用大量样本。

NDDepth: Normal-Distance Assisted Monocular Depth Estimation

  • paper_url: http://arxiv.org/abs/2309.10592
  • repo_url: None
  • paper_authors: Shuwei Shao, Zhongcai Pei, Weihai Chen, Xingming Wu, Zhengguo Li
  • for: 这个论文主要针对的是单目深度估计问题,它的广泛应用在计算机视觉领域。
  • methods: 该论文提出了一种基于物理学(几何学)的深度学习框架,假设3D场景由分割面组成。论文引入了一个新的normal-distance头,该头输出每个位置的像素级表面法向量和平面到起点的距离,以便从 depth 的估计。此外,normal和距离被通过开发的平面感知约束进行正则化。论文还增加了一个附加的深度头,以提高提案的稳定性。
  • results: 对于NYU-Depth-v2、KITTI和SUN RGB-D数据集,提案的方法超过了之前的状态态的竞争对手。特别是,在KITTI的深度预测在线竞赛上,提案的方法在提交时 ranked 1st 中所有提交。
    Abstract Monocular depth estimation has drawn widespread attention from the vision community due to its broad applications. In this paper, we propose a novel physics (geometry)-driven deep learning framework for monocular depth estimation by assuming that 3D scenes are constituted by piece-wise planes. Particularly, we introduce a new normal-distance head that outputs pixel-level surface normal and plane-to-origin distance for deriving depth at each position. Meanwhile, the normal and distance are regularized by a developed plane-aware consistency constraint. We further integrate an additional depth head to improve the robustness of the proposed framework. To fully exploit the strengths of these two heads, we develop an effective contrastive iterative refinement module that refines depth in a complementary manner according to the depth uncertainty. Extensive experiments indicate that the proposed method exceeds previous state-of-the-art competitors on the NYU-Depth-v2, KITTI and SUN RGB-D datasets. Notably, it ranks 1st among all submissions on the KITTI depth prediction online benchmark at the submission time.
    摘要 单目深度估算在视觉社区中引起了广泛的关注,因为它具有广泛的应用。在这篇论文中,我们提出了一个新的物理(几何)驱动的深度学习框架,assuming that 3D scenes are constituted by piece-wise planes。我们引入了一个新的normal-distance head,这个head将在每个位置输出像素级表面法向和平面到起始距离,以 derivation depth。同时,normal和距离被调整了一个发展的平面应相适应约束。我们还整合了一个额外的深度head,以提高我们的提案的稳定性。为了充分利用这两个head的优点,我们开发了一个有效的补充循环调整模块,这个模块会根据depth的不确定性进行调整。实验结果显示,我们的提案超过了前一代的竞争者在 NYU-Depth-v2、KITTI 和 SUN RGB-D 数据集上。特别是,它在 KITTI 对应测试上的线上测试benchmark中排名第一。

Few-shot Object Detection in Remote Sensing: Lifting the Curse of Incompletely Annotated Novel Objects

  • paper_url: http://arxiv.org/abs/2309.10588
  • repo_url: None
  • paper_authors: Fahong Zhang, Yilei Shi, Zhitong Xiong, Xiao Xiang Zhu
  • for: 本研究的目的是提出一种基于自适应学习的几何卷积网络(ST-FSOD)方法,用于实现几何卷积网络在几何图像处理中的几何检测。
  • methods: 本研究使用了两个分支的Region Proposal Networks(RPN),其中一个分支用于提取基本对象的提案,另一个分支用于提取 noval 对象的提案。此外,本研究还使用了学生-教师机制,将高度自信的未标注目标作为pseudo标签,并将其包含在RPN和Region of Interest(RoI)头中。
  • results: 实验结果表明, compared with现有的state-of-the-art方法,本研究的ST-FSOD方法在各种几何检测设置下表现出了大幅提升。
    Abstract Object detection is an essential and fundamental task in computer vision and satellite image processing. Existing deep learning methods have achieved impressive performance thanks to the availability of large-scale annotated datasets. Yet, in real-world applications the availability of labels is limited. In this context, few-shot object detection (FSOD) has emerged as a promising direction, which aims at enabling the model to detect novel objects with only few of them annotated. However, many existing FSOD algorithms overlook a critical issue: when an input image contains multiple novel objects and only a subset of them are annotated, the unlabeled objects will be considered as background during training. This can cause confusions and severely impact the model's ability to recall novel objects. To address this issue, we propose a self-training-based FSOD (ST-FSOD) approach, which incorporates the self-training mechanism into the few-shot fine-tuning process. ST-FSOD aims to enable the discovery of novel objects that are not annotated, and take them into account during training. On the one hand, we devise a two-branch region proposal networks (RPN) to separate the proposal extraction of base and novel objects, On another hand, we incorporate the student-teacher mechanism into RPN and the region of interest (RoI) head to include those highly confident yet unlabeled targets as pseudo labels. Experimental results demonstrate that our proposed method outperforms the state-of-the-art in various FSOD settings by a large margin. The codes will be publicly available at https://github.com/zhu-xlab/ST-FSOD.
    摘要 Computer vision 和卫星图像处理中的对象检测是一项基础和重要任务。现有的深度学习方法在大规模标注数据的支持下已经达到了印象人的性能。然而,在实际应用中,标注数据的可用性受限。在这种情况下,几shot对象检测(FSOD)已经出现为一个有前途的方向,它目的是让模型能够检测未标注的对象,只需要几个标注。然而,许多现有的FSOD算法忽视了一个关键问题:当输入图像包含多个未标注的对象,并且只有一部分被标注,那么未标注的对象将被视为背景进行训练。这会导致混乱和严重影响模型的对 novel 对象的回忆。为解决这个问题,我们提出了一种基于自我训练的FSOD方法(ST-FSOD),它在几个shot fine-tuning过程中包含了自我训练机制。ST-FSOD的目标是允许模型发现未标注的对象,并将它们包含在训练中。在一个方面,我们设计了两个分支的区域提档网络(RPN),以分离基本对象和新对象的提档。在另一个方面,我们在 RPN 和区域关注头(RoI)中 integrate了学生-教师机制,以包含高度自信的未标注目标作为 Pseudo 标注。实验结果表明,我们的提出方法在各种 FSOD 设置下的性能均高于当前状态的极大margin。代码将在 上公开。

Adversarial Attacks Against Uncertainty Quantification

  • paper_url: http://arxiv.org/abs/2309.10586
  • repo_url: https://github.com/shymalagowri/Defense-against-Adversarial-Malware-using-RObust-Classifier-DAM-ROC
  • paper_authors: Emanuele Ledda, Daniele Angioni, Giorgio Piras, Giorgio Fumera, Battista Biggio, Fabio Roli
  • for: 本研究旨在攻击概率评估(Uncertainty Quantification,UQ)技术,以下用于让机器学习模型的输出不可靠。
  • methods: 我们设计了一个威胁模型,并提出了多种攻击策略,用于让UQ技术输出不准确的结果。
  • results: 我们的实验结果表明,我们的攻击策略可以更好地 manipulate UQ测量结果,比起induce misclassification。
    Abstract Machine-learning models can be fooled by adversarial examples, i.e., carefully-crafted input perturbations that force models to output wrong predictions. While uncertainty quantification has been recently proposed to detect adversarial inputs, under the assumption that such attacks exhibit a higher prediction uncertainty than pristine data, it has been shown that adaptive attacks specifically aimed at reducing also the uncertainty estimate can easily bypass this defense mechanism. In this work, we focus on a different adversarial scenario in which the attacker is still interested in manipulating the uncertainty estimate, but regardless of the correctness of the prediction; in particular, the goal is to undermine the use of machine-learning models when their outputs are consumed by a downstream module or by a human operator. Following such direction, we: \textit{(i)} design a threat model for attacks targeting uncertainty quantification; \textit{(ii)} devise different attack strategies on conceptually different UQ techniques spanning for both classification and semantic segmentation problems; \textit{(iii)} conduct a first complete and extensive analysis to compare the differences between some of the most employed UQ approaches under attack. Our extensive experimental analysis shows that our attacks are more effective in manipulating uncertainty quantification measures than attacks aimed to also induce misclassifications.
    摘要 We design a threat model for attacks targeting uncertainty quantification and devise different attack strategies on various UQ techniques for both classification and semantic segmentation problems. We conduct a comprehensive analysis to compare the differences between some of the most commonly used UQ approaches under attack. Our extensive experimental analysis shows that our attacks are more effective in manipulating uncertainty quantification measures than attacks that also aim to induce misclassifications.

Forgedit: Text Guided Image Editing via Learning and Forgetting

  • paper_url: http://arxiv.org/abs/2309.10556
  • repo_url: https://github.com/witcherofresearch/forgedit
  • paper_authors: Shiwen Zhang, Shuai Xiao, Weilin Huang
  • for: 这个论文的目的是提出一种新的文本引导图像编辑方法,以解决现有的图像编辑模型受过拟合和时间占用的问题。
  • methods: 该方法基于一种新的精度学习框架,通过视语言结合学习来重建输入图像,并使用向量减减和向量投影来找到适合的文本嵌入。此外,它还利用了Diffusion Models的一般性特性,并采用了忘记策略来解决恶性拟合问题。
  • results: 该方法在TEdBench数据集上达到了新的顶峰性状态,在CLIP分数和LPIPS分数两个指标上都超越了之前的SOTA方法Imagic with Imagen。
    Abstract Text guided image editing on real images given only the image and the target text prompt as inputs, is a very general and challenging problem, which requires the editing model to reason by itself which part of the image should be edited, to preserve the characteristics of original image, and also to perform complicated non-rigid editing. Previous fine-tuning based solutions are time-consuming and vulnerable to overfitting, limiting their editing capabilities. To tackle these issues, we design a novel text guided image editing method, Forgedit. First, we propose a novel fine-tuning framework which learns to reconstruct the given image in less than one minute by vision language joint learning. Then we introduce vector subtraction and vector projection to explore the proper text embedding for editing. We also find a general property of UNet structures in Diffusion Models and inspired by such a finding, we design forgetting strategies to diminish the fatal overfitting issues and significantly boost the editing abilities of Diffusion Models. Our method, Forgedit, implemented with Stable Diffusion, achieves new state-of-the-art results on the challenging text guided image editing benchmark TEdBench, surpassing the previous SOTA method Imagic with Imagen, in terms of both CLIP score and LPIPS score. Codes are available at https://github.com/witcherofresearch/Forgedit.
    摘要 Text 指导图像编辑问题是一个非常通用和挑战性的问题,需要编辑模型自行判断需要编辑哪些部分,以保持原图特征,同时也需要执行复杂的非RIGID 编辑。先前的精度训练基于解决方案具有较长的训练时间和过拟合问题,这限制了编辑能力。为解决这些问题,我们设计了一种新的文本指导图像编辑方法,即 Forgedit。首先,我们提出了一种新的精度训练框架,通过视语言结合学习来学习重建给定图像,并且在一分钟内完成。然后,我们引入向量减法和向量投影来探索适当的文本嵌入 для编辑。我们还发现了Diffusion Models中的一种普遍性,即UNet结构的特点,并由此得到了忘记策略,以解决致命的过拟合问题,并大幅提高Diffusion Models的编辑能力。我们的方法 Forgedit,通过稳定扩散,在TEdBench的文本指导图像编辑标准 benchmark 上实现了新的状态纪录,超过了之前的SOTA方法 Imagic with Imagen,以 Both CLIP 分数和 LPIPS 分数来衡量。代码可以在https://github.com/witcherofresearch/Forgedit 上找到。

An overview of some mathematical techniques and problems linking 3D vision to 3D printing

  • paper_url: http://arxiv.org/abs/2309.10549
  • repo_url: None
  • paper_authors: Emiliano Cristiani, Maurizio Falcone, Silvia Tozza
  • for: 本研究旨在探讨 Computer Vision 和 3D printing 之间的交互,尤其是Shape-from-Shading 问题的解决方法,以及基于非线性偏微分方程和优化的方法。
  • methods: 本文使用了一些非线性偏微分方程和优化技术来解决 Shape-from-Shading 问题,并考虑了将这些方法应用于 3D printing 过程中。
  • results: 本研究提出了一些实用的例子,以示出将图像转换为 final 3D 印刷的过程。
    Abstract Computer Vision and 3D printing have rapidly evolved in the last 10 years but interactions among them have been very limited so far, despite the fact that they share several mathematical techniques. We try to fill the gap presenting an overview of some techniques for Shape-from-Shading problems as well as for 3D printing with an emphasis on the approaches based on nonlinear partial differential equations and optimization. We also sketch possible couplings to complete the process of object manufacturing starting from one or more images of the object and ending with its final 3D print. We will give some practical examples of this procedure.
    摘要 计算机视觉和3D打印技术在过去10年内快速发展,但它们之间的交互非常有限,尽管它们共享一些数学方法。我们尝试填补这个空白,介绍一些Shape-from-Shading问题的技巧以及基于非线性偏微分方程和优化的3D打印技术。我们还简要介绍可能的交互,完成对象制造的整个过程,从一个或多个图像开始,结束于其最终3D打印。我们将给出一些实践示例。

Decoupling the Curve Modeling and Pavement Regression for Lane Detection

  • paper_url: http://arxiv.org/abs/2309.10533
  • repo_url: None
  • paper_authors: Wencheng Han, Jianbing Shen
  • for: 本研究的目的是提出一种新的车道检测方法,以解决现有方法中 curve-based lane representation 对不规则的车道线的处理不佳问题。
  • methods: 本研究使用了分解车道检测任务为两部分:曲线建模和地面高程回归。Specifically, we use a parameterized curve to represent lanes in the BEV space to reflect the original distribution of lanes, and regress the ground heights of key points separately from the curve modeling.
  • results: 我们在2D车道检测 benchmarks (TuSimple和CULane) 和最近提出的3D车道检测 datasets (ONCE-3Dlane和OpenLane) 上进行了实验,并显示出了显著的改进。
    Abstract The curve-based lane representation is a popular approach in many lane detection methods, as it allows for the representation of lanes as a whole object and maximizes the use of holistic information about the lanes. However, the curves produced by these methods may not fit well with irregular lines, which can lead to gaps in performance compared to indirect representations such as segmentation-based or point-based methods. We have observed that these lanes are not intended to be irregular, but they appear zigzagged in the perspective view due to being drawn on uneven pavement. In this paper, we propose a new approach to the lane detection task by decomposing it into two parts: curve modeling and ground height regression. Specifically, we use a parameterized curve to represent lanes in the BEV space to reflect the original distribution of lanes. For the second part, since ground heights are determined by natural factors such as road conditions and are less holistic, we regress the ground heights of key points separately from the curve modeling. Additionally, we have unified the 2D and 3D lane detection tasks by designing a new framework and a series of losses to guide the optimization of models with or without 3D lane labels. Our experiments on 2D lane detection benchmarks (TuSimple and CULane), as well as the recently proposed 3D lane detection datasets (ONCE-3Dlane and OpenLane), have shown significant improvements. We will make our well-documented source code publicly available.
    摘要 Lane 表示法是许多车道检测方法中的受欢迎方法,因为它允许车道被视为整个对象,并且最大化了车道的整体信息。然而,由这些方法生成的曲线可能不适应不规则的线条,这可能会导致性能上的差距与分 segmentation-based 或点 clouds 方法相比。我们发现这些车道并不是不规则的,但它们在 bird's eye view 视角下看起来是zigzag的,因为它们被Drawn 在不均匀的路面上。在这篇论文中,我们提出了一种新的车道检测任务的方法,即分解为曲线建模和地面高度回归。特别是,我们使用参数化的曲线来表示车道在 bird's eye view 空间中的原始分布。为第二部分,由于地面高度是由自然因素 such as 道路状况而决定的,我们分别从曲线建模中进行地面高度的回归。此外,我们将2D 和 3D 车道检测任务统一为一个框架和一系列损失函数,以引导模型的优化。我们的实验表明,在 TuSimple 和 CULane 2D 车道检测标准测试集上,以及最近提出的3D 车道检测数据集(ONCE-3Dlane 和 OpenLane)上,我们的方法具有显著的改进。我们将我们的详细的源代码公开发布。

Retinex-guided Channel-grouping based Patch Swap for Arbitrary Style Transfer

  • paper_url: http://arxiv.org/abs/2309.10528
  • repo_url: None
  • paper_authors: Chang Liu, Yi Niu, Mingming Ma, Fu Li, Guangming Shi
  • for: 提高patch-matching基于风格传输的质量和稳定性,以提供更加风格一致的文本URE。
  • methods: 根据Retinex理论和通道组合策略,对 conten image feature map 进行替换,并提供补充混合和多尺度生成策略来避免不希望的黑色区域和过度风格化问题。
  • results: 实验结果表明,提案方法可以比存在技术更好地提供风格一致的文本URE,同时保持内容准确性。
    Abstract The basic principle of the patch-matching based style transfer is to substitute the patches of the content image feature maps by the closest patches from the style image feature maps. Since the finite features harvested from one single aesthetic style image are inadequate to represent the rich textures of the content natural image, existing techniques treat the full-channel style feature patches as simple signal tensors and create new style feature patches via signal-level fusion, which ignore the implicit diversities existed in style features and thus fail for generating better stylised results. In this paper, we propose a Retinex theory guided, channel-grouping based patch swap technique to solve the above challenges. Channel-grouping strategy groups the style feature maps into surface and texture channels, which prevents the winner-takes-all problem. Retinex theory based decomposition controls a more stable channel code rate generation. In addition, we provide complementary fusion and multi-scale generation strategy to prevent unexpected black area and over-stylised results respectively. Experimental results demonstrate that the proposed method outperforms the existing techniques in providing more style-consistent textures while keeping the content fidelity.
    摘要 基本原则是将内容图像特征地图替换为风格图像特征地图中最近的 patches。由于一个单一风格图像的特征不足以表示自然图像的复杂 текстуры,现有技术通过信号级混合来创建新的风格特征 patches,而忽略了风格特征之间的隐式多样性,从而导致更好的风格化结果不可能获得。在这篇论文中,我们提出了基于 Retinex 理论的、通道分组 based patch swap 技术来解决上述挑战。通道分组策略将风格特征地图分为表面和Texture通道,以避免赢家占据全部问题。Retinex 理论基于的分解控制了更稳定的通道码率生成。此外,我们还提供了补做和多尺度生成策略,以避免不期望的黑色区域和过度风格化结果。实验结果表明,我们的方法在保持内容准确性的同时,可以提供更风格一致的 текстуры。

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2309.10527
  • repo_url: https://github.com/pjlab-adg/3dtrans
  • paper_authors: Xiangchao Yan, Runjian Chen, Bo Zhang, Jiakang Yuan, Xinyu Cai, Botian Shi, Wenqi Shao, Junchi Yan, Ping Luo, Yu Qiao
  • for: 本文为了提高3D LiDAR点云的感知任务,包括3D物体检测和LiDARSemantic分割,提出了一种可扩展的预训练方法。
  • methods: 本文提出了一种名为SPOT(可扩展预训练via占用预测)的方法,通过大规模预训练和不同下游数据集和任务的细致调整,以提高3D表示的学习效果。
  • results: 本文通过多个公共数据集和任务下的实验,证明了occupancy预测的潜在性,并且通过树枝抽样技术和类别准备策略来缓解不同LiDAR传感器和注释策略在不同数据集中的领域差异。此外,本文还观察到了扩展预训练的现象,即下游性能随预训练数据的增加而提高。
    Abstract Annotating 3D LiDAR point clouds for perception tasks including 3D object detection and LiDAR semantic segmentation is notoriously time-and-energy-consuming. To alleviate the burden from labeling, it is promising to perform large-scale pre-training and fine-tune the pre-trained backbone on different downstream datasets as well as tasks. In this paper, we propose SPOT, namely Scalable Pre-training via Occupancy prediction for learning Transferable 3D representations, and demonstrate its effectiveness on various public datasets with different downstream tasks under the label-efficiency setting. Our contributions are threefold: (1) Occupancy prediction is shown to be promising for learning general representations, which is demonstrated by extensive experiments on plenty of datasets and tasks. (2) SPOT uses beam re-sampling technique for point cloud augmentation and applies class-balancing strategies to overcome the domain gap brought by various LiDAR sensors and annotation strategies in different datasets. (3) Scalable pre-training is observed, that is, the downstream performance across all the experiments gets better with more pre-training data. We believe that our findings can facilitate understanding of LiDAR point clouds and pave the way for future exploration in LiDAR pre-training. Codes and models will be released.
    摘要 <> translate_language: zh-CN<> annotating 3D LiDAR point clouds for perception tasks, including 3D object detection and LiDAR semantic segmentation, is notoriously time-consuming and energy-intensive. To alleviate the burden of labeling, it is promising to perform large-scale pre-training and fine-tune the pre-trained backbone on different downstream datasets and tasks. In this paper, we propose SPOT, namely Scalable Pre-training via Occupancy prediction for learning Transferable 3D representations, and demonstrate its effectiveness on various public datasets with different downstream tasks under the label-efficiency setting. Our contributions are threefold:1. Occupancy prediction is shown to be promising for learning general representations, which is demonstrated by extensive experiments on numerous datasets and tasks.2. SPOT uses beam re-sampling technique for point cloud augmentation and applies class-balancing strategies to overcome the domain gap brought by various LiDAR sensors and annotation strategies in different datasets.3. Scalable pre-training is observed, that is, the downstream performance across all the experiments gets better with more pre-training data.We believe that our findings can facilitate understanding of LiDAR point clouds and pave the way for future exploration in LiDAR pre-training. Codes and models will be released.

Edge-aware Feature Aggregation Network for Polyp Segmentation

  • paper_url: http://arxiv.org/abs/2309.10523
  • repo_url: None
  • paper_authors: Tao Zhou, Yizhe Zhang, Geng Chen, Yi Zhou, Ye Wu, Deng-Ping Fan
  • for: 本研究旨在提高肠癌检测和预防的早期诊断方面,通过改进肠癌形态学����部分的分割精度。
  • methods: 本研究提出了一种Edge-aware Feature Aggregation Network(EFA-Net),包括Edge-aware Guidance Module(EGM)、Scale-aware Convolution Module(SCM)和Cross-level Fusion Module(CFM)等模块。
  • results: 对五种常用的肠癌检测数据集进行实验,EFA-Net表现出比前方法更高的普适性和效果。
    Abstract Precise polyp segmentation is vital for the early diagnosis and prevention of colorectal cancer (CRC) in clinical practice. However, due to scale variation and blurry polyp boundaries, it is still a challenging task to achieve satisfactory segmentation performance with different scales and shapes. In this study, we present a novel Edge-aware Feature Aggregation Network (EFA-Net) for polyp segmentation, which can fully make use of cross-level and multi-scale features to enhance the performance of polyp segmentation. Specifically, we first present an Edge-aware Guidance Module (EGM) to combine the low-level features with the high-level features to learn an edge-enhanced feature, which is incorporated into each decoder unit using a layer-by-layer strategy. Besides, a Scale-aware Convolution Module (SCM) is proposed to learn scale-aware features by using dilated convolutions with different ratios, in order to effectively deal with scale variation. Further, a Cross-level Fusion Module (CFM) is proposed to effectively integrate the cross-level features, which can exploit the local and global contextual information. Finally, the outputs of CFMs are adaptively weighted by using the learned edge-aware feature, which are then used to produce multiple side-out segmentation maps. Experimental results on five widely adopted colonoscopy datasets show that our EFA-Net outperforms state-of-the-art polyp segmentation methods in terms of generalization and effectiveness.
    摘要 精准的肠Rectal polyp分割对早期诊断和预防肠Rectal cancer (CRC)在临床实践中是非常重要的。然而,由于Scale variation和模糊的肠Rectal polyp边界,这仍然是一个挑战性的任务,以达到满意的分割性能。在本研究中,我们提出了一种新的Edge-aware Feature Aggregation Network (EFA-Net),用于肠Rectal polyp分割,可以充分利用不同级别和多种比例的特征来提高肠Rectal polyp分割性能。具体来说,我们首先提出了Edge-aware Guidance Module (EGM),用于将低级特征与高级特征结合,学习Edge-enhanced特征,并将其分割到每个解码器单元中。此外,我们还提出了Scale-aware Convolution Module (SCM),用于学习Scale-aware特征,通过不同的扩大系数来有效地处理Scale variation。此外,我们还提出了Cross-level Fusion Module (CFM),用于有效地整合不同级别的特征,可以利用本地和全局的Contextual information。最后,CFMs输出被使用learn的Edge-aware特征进行权重adaptive归一化,并用于生成多个Side-out分割地图。实验结果表明,我们的EFA-Net在五个常用的肠Rectoscopy数据集上比State-of-the-art的肠Rectal polyp分割方法有更好的普适性和效果。

Spatial-Assistant Encoder-Decoder Network for Real Time Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2309.10519
  • repo_url: https://github.com/cuzaoo/sanet-main
  • paper_authors: Yalun Wang, Shidong Chen, Huicong Bian, Weixiao Li, Qin Lu
  • for: 本文目的是提出一种基于encoder-decoder架构的实时semantic segmentation网络,以提高自动驾驶车辆的环境理解。
  • methods: 本文使用了encoder-decoder架构,并在encoder部分保留了中间部分的特征图,并使用了atrous convolution branches来实现同分辨率特征EXTRACTION。在decoder部分,我们提出了一种hybrid attention模块,SAD,来混合不同的分支。
  • results: 我们的SANet模型在实时CamVid和cityscape数据集上达到了竞争性的Result,包括78.4% mIOU at 65.1 FPS on Cityscape test dataset和78.8% mIOU at 147 FPS on CamVid test dataset。
    Abstract Semantic segmentation is an essential technology for self-driving cars to comprehend their surroundings. Currently, real-time semantic segmentation networks commonly employ either encoder-decoder architecture or two-pathway architecture. Generally speaking, encoder-decoder models tend to be quicker,whereas two-pathway models exhibit higher accuracy. To leverage both strengths, we present the Spatial-Assistant Encoder-Decoder Network (SANet) to fuse the two architectures. In the overall architecture, we uphold the encoder-decoder design while maintaining the feature maps in the middle section of the encoder and utilizing atrous convolution branches for same-resolution feature extraction. Toward the end of the encoder, we integrate the asymmetric pooling pyramid pooling module (APPPM) to optimize the semantic extraction of the feature maps. This module incorporates asymmetric pooling layers that extract features at multiple resolutions. In the decoder, we present a hybrid attention module, SAD, that integrates horizontal and vertical attention to facilitate the combination of various branches. To ascertain the effectiveness of our approach, our SANet model achieved competitive results on the real-time CamVid and cityscape datasets. By employing a single 2080Ti GPU, SANet achieved a 78.4 % mIOU at 65.1 FPS on the Cityscape test dataset and 78.8 % mIOU at 147 FPS on the CamVid test dataset. The training code and model for SANet are available at https://github.com/CuZaoo/SANet-main
    摘要 Semantic segmentation 是自驾车技术的重要组成部分,帮助自驾车理解它所处的环境。当前实时 semantic segmentation 网络通常采用 either encoder-decoder 架构或 two-pathway 架构。一般来说,encoder-decoder 模型比较快速,而 two-pathway 模型具有更高的准确率。为了利用这两者的优点,我们提出了 Spatial-Assistant Encoder-Decoder Network (SANet),将两种架构融合在一起。整体架构保持 encoder-decoder 设计,并在中间部分的 encoder 中维持特征图,使用 atrous convolution 分支来实现同分辨率特征提取。在 encoder 的末端,我们 integration asymmetric pooling pyramid pooling module (APPPM) 来优化特征提取。在 decoder 中,我们提出了 hybrid attention module (SAD),将 horizontal 和 vertical attention 集成到不同分支中,以便不同分支之间的组合。为了证明我们的方法的有效性,我们的 SANet 模型在 real-time CamVid 和 cityscape 数据集上达到了竞争性的结果。使用单个 2080Ti GPU,SANet 在 Cityscape 测试数据集上达到了 78.4 % mIOU 的值,并在 65.1 FPS 的速度下运行。训练代码和模型可以在 https://github.com/CuZaoo/SANet-main 上下载。

Unsupervised Landmark Discovery Using Consistency Guided Bottleneck

  • paper_url: http://arxiv.org/abs/2309.10518
  • repo_url: https://github.com/mamonaawan/cgb_uld
  • paper_authors: Mamona Awan, Muhammad Haris Khan, Sanoojan Baliah, Muhammad Ahmad Waseem, Salman Khan, Fahad Shahbaz Khan, Arif Mahmood
  • for: 本研究目标是不监督的物体标记发现问题。
  • methods: 我们引入了一个可靠性指标导向的瓶颈,该瓶颈利用标记兼容度来生成适应性的热图。我们还提出了一种在图像重建 pipeline 中使用 pseudo-ground truth 来获得假Supervision。
  • results: 我们在五种多样化的 dataset 上进行了评估,结果表明我们的方法与现有状态的方法相比,表现出色。我们的代码可以在 GitHub 上获取(https://github.com/MamonaAwan/CGB_ULD)。
    Abstract We study a challenging problem of unsupervised discovery of object landmarks. Many recent methods rely on bottlenecks to generate 2D Gaussian heatmaps however, these are limited in generating informed heatmaps while training, presumably due to the lack of effective structural cues. Also, it is assumed that all predicted landmarks are semantically relevant despite having no ground truth supervision. In the current work, we introduce a consistency-guided bottleneck in an image reconstruction-based pipeline that leverages landmark consistency, a measure of compatibility score with the pseudo-ground truth to generate adaptive heatmaps. We propose obtaining pseudo-supervision via forming landmark correspondence across images. The consistency then modulates the uncertainty of the discovered landmarks in the generation of adaptive heatmaps which rank consistent landmarks above their noisy counterparts, providing effective structural information for improved robustness. Evaluations on five diverse datasets including MAFL, AFLW, LS3D, Cats, and Shoes demonstrate excellent performance of the proposed approach compared to the existing state-of-the-art methods. Our code is publicly available at https://github.com/MamonaAwan/CGB_ULD.
    摘要 我们研究一个挑战性的无监督对象地标发现问题。许多最近的方法利用瓶颈来生成2D加aussian热图,但这些瓶颈受限于在训练时生成有知识的热图,可能因缺乏有效的结构准则而导致。此外,它假设所预测的地标都是semantically相关的,即使没有真实的地标指导。在当前的工作中,我们引入一个具有协调性的瓶颈,该瓶颈在图像重建基于管道中利用地标协调度来生成适应性的热图。我们提议通过图像之间的地标匹配来获取pseudo-超级视图。然后,协调度会修饰发现的地标的不确定性,使得有用的地标在热图生成中排名高于噪声Counterparts,提供有效的结构信息,从而提高了Robustness。我们对五种多样化的数据集,包括MAFL、AFLW、LS3D、猫和鞋子进行了评估,并证明了我们的方法与现有状态的方法相比表现出色。我们的代码公开可用于https://github.com/MamonaAwan/CGB_ULD。

Uncertainty Estimation in Instance Segmentation with Star-convex Shapes

  • paper_url: http://arxiv.org/abs/2309.10513
  • repo_url: None
  • paper_authors: Qasim M. K. Siddiqui, Sebastian Starke, Peter Steinbach
  • for: 这个研究旨在提高实例 segmentation 中模型的预测uncertainty 评估,以便更加准确地评估模型的可靠性和决策。
  • methods: 本研究使用 Monte-Carlo Dropout 和 Deep Ensemble 技术来计算实例的空间和分数 certeinty 分布,并 comparing 两种不同的聚类方法来评估它们的效果。
  • results: 研究发现,结合空间和分数 certeinty 分布可以获得更好的准确性评估结果,而使用 Deep Ensemble 技术并结合我们的新 radial clustering 方法则更加有效。
    Abstract Instance segmentation has witnessed promising advancements through deep neural network-based algorithms. However, these models often exhibit incorrect predictions with unwarranted confidence levels. Consequently, evaluating prediction uncertainty becomes critical for informed decision-making. Existing methods primarily focus on quantifying uncertainty in classification or regression tasks, lacking emphasis on instance segmentation. Our research addresses the challenge of estimating spatial certainty associated with the location of instances with star-convex shapes. Two distinct clustering approaches are evaluated which compute spatial and fractional certainty per instance employing samples by the Monte-Carlo Dropout or Deep Ensemble technique. Our study demonstrates that combining spatial and fractional certainty scores yields improved calibrated estimation over individual certainty scores. Notably, our experimental results show that the Deep Ensemble technique alongside our novel radial clustering approach proves to be an effective strategy. Our findings emphasize the significance of evaluating the calibration of estimated certainties for model reliability and decision-making.
    摘要

Single-Image based unsupervised joint segmentation and denoising

  • paper_url: http://arxiv.org/abs/2309.10511
  • repo_url: https://github.com/Nadja1611/Single-Image-based-unsupervised-joint-segmentation-and-denoising
  • paper_authors: Nadja Gruber, Johannes Schwab, Noémie Debroux, Nicolas Papadakis, Markus Haltmeier
  • for: 这个论文的目的是 joint segmentation and denoising of a single image.
  • methods: 该方法combines variational segmentation method with self-supervised, single-image based deep learning approach.
  • results: 该方法可以在高噪音和通用Texture图像中分割多个有意义的区域,并且在微scopic镜中可以对噪音图像进行joint segmentation and denoising. I hope this helps! Let me know if you have any other questions.
    Abstract In this work, we develop an unsupervised method for the joint segmentation and denoising of a single image. To this end, we combine the advantages of a variational segmentation method with the power of a self-supervised, single-image based deep learning approach. One major strength of our method lies in the fact, that in contrast to data-driven methods, where huge amounts of labeled samples are necessary, our model can segment an image into multiple meaningful regions without any training database. Further, we introduce a novel energy functional in which denoising and segmentation are coupled in a way that both tasks benefit from each other. The limitations of existing single-image based variational segmentation methods, which are not capable of dealing with high noise or generic texture, are tackled by this specific combination with self-supervised image denoising. We propose a unified optimisation strategy and show that, especially for very noisy images available in microscopy, our proposed joint approach outperforms its sequential counterpart as well as alternative methods focused purely on denoising or segmentation. Another comparison is conducted with a supervised deep learning approach designed for the same application, highlighting the good performance of our approach.
    摘要 在这项工作中,我们开发了一种无监督的方法,用于整合图像的分割和除噪。为此,我们结合了变形分割方法的优点和单个图像基于深度学习的自动学习方法。我们的方法的一个重要优点在于,它可以在没有培训数据库的情况下,将图像分割成多个有意义的区域。此外,我们引入了一个新的能量函数,其中除噪和分割两个任务之间存在互助关系,两个任务都可以受益于这种结合。现有的单个图像基于变形分割方法存在高噪音或通用的文字纹理等问题,我们通过与自动学习图像除噪方法的特定结合,解决这些局限性。我们提出了一种统一优化策略,并证明,特别是在微scopic频谱中的很噪音图像上,我们的提议的联合方法会超越其顺序对应方法以及专门为这种应用设计的其他方法。另外,我们还与一种监督式深度学习方法进行比较,显示了我们的方法的良好性能。

DCPT: Darkness Clue-Prompted Tracking in Nighttime UAVs

  • paper_url: http://arxiv.org/abs/2309.10491
  • repo_url: None
  • paper_authors: Jiawen Zhu, Huayi Tang, Zhi-Qi Cheng, Jun-Yan He, Bin Luo, Shihao Qiu, Shengming Li, Huchuan Lu
  • for: 提高夜间无人机跟踪性能
  • methods: 提出了一种新的 darkness clue-prompted tracking (DCPT) 架构,通过效率地学习生成黑暗提示来实现夜间无人机跟踪。
  • results: 对多个黑场景标准吊靡进行了广泛的实验,并达到了现状最佳性能。
    Abstract Existing nighttime unmanned aerial vehicle (UAV) trackers follow an "Enhance-then-Track" architecture - first using a light enhancer to brighten the nighttime video, then employing a daytime tracker to locate the object. This separate enhancement and tracking fails to build an end-to-end trainable vision system. To address this, we propose a novel architecture called Darkness Clue-Prompted Tracking (DCPT) that achieves robust UAV tracking at night by efficiently learning to generate darkness clue prompts. Without a separate enhancer, DCPT directly encodes anti-dark capabilities into prompts using a darkness clue prompter (DCP). Specifically, DCP iteratively learns emphasizing and undermining projections for darkness clues. It then injects these learned visual prompts into a daytime tracker with fixed parameters across transformer layers. Moreover, a gated feature aggregation mechanism enables adaptive fusion between prompts and between prompts and the base model. Extensive experiments show state-of-the-art performance for DCPT on multiple dark scenario benchmarks. The unified end-to-end learning of enhancement and tracking in DCPT enables a more trainable system. The darkness clue prompting efficiently injects anti-dark knowledge without extra modules. Code and models will be released.
    摘要 exististing nighttime unmanned aerial vehicle (UAV) trackers follow an "Enhance-then-Track" architecture - first using a light enhancer to brighten the nighttime video, then employing a daytime tracker to locate the object. This separate enhancement and tracking fails to build an end-to-end trainable vision system. To address this, we propose a novel architecture called Darkness Clue-Prompted Tracking (DCPT) that achieves robust UAV tracking at night by efficiently learning to generate darkness clue prompts. Without a separate enhancer, DCPT directly encodes anti-dark capabilities into prompts using a darkness clue prompter (DCP). Specifically, DCP iteratively learns emphasizing and undermining projections for darkness clues. It then injects these learned visual prompts into a daytime tracker with fixed parameters across transformer layers. Moreover, a gated feature aggregation mechanism enables adaptive fusion between prompts and between prompts and the base model. Extensive experiments show state-of-the-art performance for DCPT on multiple dark scenario benchmarks. The unified end-to-end learning of enhancement and tracking in DCPT enables a more trainable system. The darkness clue prompting efficiently injects anti-dark knowledge without extra modules. Code and models will be released.Translation notes:* "Enhance-then-Track" architecture is translated as "提高然后跟踪" architecture.* "Darkness clue prompts" is translated as "黑暗提示" or "黑暗Prompt".* "Darkness clue prompter" is translated as "黑暗提示生成器" or "黑暗Prompter".* "Gated feature aggregation" is translated as "离散特征聚合" or "离散Feature Fusion".* "Base model" is translated as "基础模型" or "基础模型。

RECALL+: Adversarial Web-based Replay for Continual Learning in Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2309.10479
  • repo_url: None
  • paper_authors: Chang Liu, Giulia Rizzoli, Francesco Barbato, Umberto Michieli, Yi Niu, Pietro Zanuttigh
  • for: 本研究的目的是解决 continual learning 中的 catastrophic forgetting 问题,通过不同的 regularization 策略来保持之前学习的知识。
  • methods: 本研究扩展了之前的方法(RECALL),通过在线数据库中检索过时的类例来避免忘记。本研究引入了两种新的方法:基于对抗学习和自适应阈值调整来选择来自网络数据的示例,以及一种改进的pseudo-labeling scheme来更准确地标注网络数据。
  • results: 实验结果显示,这种加强的方法在多个增量学习步骤中表现出色,特别是在多个增量学习步骤中表现出色,特别是在多个增量学习步骤中表现出色,特别是在多个增量学习步骤中表现出色。
    Abstract Catastrophic forgetting of previous knowledge is a critical issue in continual learning typically handled through various regularization strategies. However, existing methods struggle especially when several incremental steps are performed. In this paper, we extend our previous approach (RECALL) and tackle forgetting by exploiting unsupervised web-crawled data to retrieve examples of old classes from online databases. Differently from the original approach that did not perform any evaluation of the web data, here we introduce two novel approaches based on adversarial learning and adaptive thresholding to select from web data only samples strongly resembling the statistics of the no longer available training ones. Furthermore, we improved the pseudo-labeling scheme to achieve a more accurate labeling of web data that also consider classes being learned in the current step. Experimental results show that this enhanced approach achieves remarkable results, especially when multiple incremental learning steps are performed.
    摘要 continual learning中的重大问题之一是 previous knowledge的恶化,通常通过不同的正则化策略来解决。然而,现有方法在多个递增学习步骤时尤其不稳定。在这篇论文中,我们对我们之前的方法(RECALL)进行扩展,通过利用无监督的网络数据来恢复老的类。与原始方法不同的是,我们在这里引入了两种基于对抗学习和自适应阈值的新方法来从网络数据中选择只有强相似于过去训练数据的统计的样本。此外,我们改进了 Pseudo-labeling 方案,以更加准确地标注网络数据,并考虑当前步骤中学习的类。实验结果表明,这种加强的方法在多个递增学习步骤时显示出了很好的效果。

LineMarkNet: Line Landmark Detection for Valet Parking

  • paper_url: http://arxiv.org/abs/2309.10475
  • repo_url: None
  • paper_authors: Zizhang Wu, Fan Wang, Yuanzhu Gan, Tianhao Xu, Weiwei Sun, Rui Tang
  • for: This paper aims to solve the long-standing problem of accurate and efficient line landmark detection for valet parking in autonomous driving.
  • methods: The paper presents a deep line landmark detection system that utilizes a pre-calibrated homography to fuse context from four separate cameras into a unified bird-eye-view (BEV) space. The system employs a multi-task decoder to detect multiple line landmarks and incorporates a graph transformer to enhance the vision transformer with hierarchical level graph reasoning for semantic segmentation.
  • results: The paper achieves enhanced performance compared to several line detection methods and validates the multi-task network’s efficiency in real-time line landmark detection on the Qualcomm 820A platform while maintaining superior accuracy.
    Abstract We aim for accurate and efficient line landmark detection for valet parking, which is a long-standing yet unsolved problem in autonomous driving. To this end, we present a deep line landmark detection system where we carefully design the modules to be lightweight. Specifically, we first empirically design four general line landmarks including three physical lines and one novel mental line. The four line landmarks are effective for valet parking. We then develop a deep network (LineMarkNet) to detect line landmarks from surround-view cameras where we, via the pre-calibrated homography, fuse context from four separate cameras into the unified bird-eye-view (BEV) space, specifically we fuse the surroundview features and BEV features, then employ the multi-task decoder to detect multiple line landmarks where we apply the center-based strategy for object detection task, and design our graph transformer to enhance the vision transformer with hierarchical level graph reasoning for semantic segmentation task. At last, we further parameterize the detected line landmarks (e.g., intercept-slope form) whereby a novel filtering backend incorporates temporal and multi-view consistency to achieve smooth and stable detection. Moreover, we annotate a large-scale dataset to validate our method. Experimental results show that our framework achieves the enhanced performance compared with several line detection methods and validate the multi-task network's efficiency about the real-time line landmark detection on the Qualcomm 820A platform while meantime keeps superior accuracy, with our deep line landmark detection system.
    摘要 我们目标是实现高精度、高效的直线特征检测系统,以解决自动驾驶中长期存在但未解决的停车卫士问题。为此,我们提出了一种深度学习基于模块的直线特征检测系统,其中我们 méticulously 设计了四个通用的直线特征,包括三个物理直线和一个新的心理直线。这四个直线特征都是适用于停车卫士。然后,我们开发了一种深度网络(LineMarkNet),用于从卫星视图摄像头中检测直线特征。我们通过先天准备好的同步论homography将卫星视图中的特征与 bird-eye-view 空间(BEV)中的特征进行融合,然后使用多任务解码器来检测多个直线特征。我们在object detection任务中采用了中心基本策略,并设计了一种图变换器来增强视Transformer的层次Graph理解,以提高semantic segmentation任务的性能。最后,我们进一步参数化检测到的直线特征(例如, intercept-slope 形式),并使用一种新的筛选后端来实现temporal和多视图一致性,以获得平滑和稳定的检测。此外,我们还为我们的方法进行了大规模数据集注解。实验结果表明,我们的框架在多个直线检测方法的比较中具有更高的性能,并证明了我们的深度直线特征检测系统在Qualcomm 820A 平台上的实时检测性能。

Diffusion-based speech enhancement with a weighted generative-supervised learning loss

  • paper_url: http://arxiv.org/abs/2309.10457
  • repo_url: None
  • paper_authors: Jean-Eudes Ayilo, Mostafa Sadeghi, Romain Serizel
  • for: 喇叭声音提升(SE)中的扩散基本生成模型,提供一种代替传统的指导方法。
  • methods: 将干净说话训练样本转化为中心在噪声说话的高斯噪声,并逐次学习一个参数化的模型,以逆转这个过程,受到噪声说话的条件。
  • results: 我们提议的方法可以有效地解决不supervised方法中的不足,并且实验结果表明我们的方法有效。
    Abstract Diffusion-based generative models have recently gained attention in speech enhancement (SE), providing an alternative to conventional supervised methods. These models transform clean speech training samples into Gaussian noise centered at noisy speech, and subsequently learn a parameterized model to reverse this process, conditionally on noisy speech. Unlike supervised methods, generative-based SE approaches usually rely solely on an unsupervised loss, which may result in less efficient incorporation of conditioned noisy speech. To address this issue, we propose augmenting the original diffusion training objective with a mean squared error (MSE) loss, measuring the discrepancy between estimated enhanced speech and ground-truth clean speech at each reverse process iteration. Experimental results demonstrate the effectiveness of our proposed methodology.
    摘要 传播基本的生成模型在语音提高(SE)方面已经获得了关注,提供一个传统超级vised方法的替代方案。这些模型将清晰的语音训练样本转换为中心于噪音的 Gaussian 噪音,然后学习一个受控的模型,以逆读这个过程,仅在噪音下进行条件运算。不同于超级vised方法,生成基本的SE方法通常仅靠一个不监控的损失函数,这可能会导致更少的条件噪音语音的整合。为了解决这个问题,我们提议将原始传播训练目标加以mean squared error(MSE)损失函数,以量化在逆读过程中每次估计的提高语音与真正的清晰语音之间的差异。实验结果显示了我们提议的方法的有效性。

Unsupervised speech enhancement with diffusion-based generative models

  • paper_url: http://arxiv.org/abs/2309.10450
  • repo_url: https://github.com/sp-uhh/sgmse
  • paper_authors: Berné Nortier, Mostafa Sadeghi, Romain Serizel
  • for: 本研究旨在提出一种基于扩散模型的无监督音频增强方法,以解决现有方法在面对未见过的条件时的挑战。
  • methods: 我们使用了分布式扩散模型来学习干净语音的先验分布在短时域傅立叙特(STFT)域,然后通过将学习到的干净语音先验分布与陌生噪声模型结合,实现无监督的音频增强。
  • results: 我们的方法在比较之下比一种最新的变分自动编码器(VAE)基于无监督方法和一种现有的扩散基于监督方法的状态艺术方法更有优势,这些结果展示了我们的方法在无监督音频增强方面的潜在优势。
    Abstract Recently, conditional score-based diffusion models have gained significant attention in the field of supervised speech enhancement, yielding state-of-the-art performance. However, these methods may face challenges when generalising to unseen conditions. To address this issue, we introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models. Specifically, in a training phase, a clean speech prior distribution is learnt in the short-time Fourier transform (STFT) domain using score-based diffusion models, allowing it to unconditionally generate clean speech from Gaussian noise. Then, we develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference. The noise parameters are simultaneously learnt along with clean speech estimation through an iterative expectationmaximisation (EM) approach. To the best of our knowledge, this is the first work exploring diffusion-based generative models for unsupervised speech enhancement, demonstrating promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method. It thus opens a new direction for future research in unsupervised speech enhancement.
    摘要 Translated into Simplified Chinese:最近,基于条件的分布扩散模型在监督性音频增强领域受到了广泛关注,并实现了顶尖性能。然而,这些方法可能面临不可预期的条件下的挑战。为解决这个问题,我们介绍了一种alternative方法,利用扩散模型的生成力。在训练阶段,我们使用分数基于扩散模型在快时域 Fourier变换 (STFT) 中学习干净语音先验分布,以便不条件地生成干净语音。然后,我们开发了 posterior 采样方法,结合学习的干净语音先验分布和语音噪声模型,以进行语音增强。噪声参数同时与干净语音估计一起学习,通过交互式期望最大化 (EM) 方法。到目前为止,这是第一篇探讨扩散基于生成模型的无监督语音增强方法的研究论文,对于一个最近的variational autoencoder (VAE)-based无监督方法和一个状态艺术的扩散基于监督方法进行了比较,并达到了promising的结果。因此,这开启了未来无监督语音增强研究的新方向。

Posterior sampling algorithms for unsupervised speech enhancement with recurrent variational autoencoder

  • paper_url: http://arxiv.org/abs/2309.10439
  • repo_url: None
  • paper_authors: Mostafa Sadeghi, Romain Serizel
  • for: addresses the unsupervised speech enhancement problem based on recurrent variational autoencoder (RVAE)
  • methods: uses efficient sampling techniques based on Langevin dynamics and Metropolis-Hasting algorithms to circumvent the computational complexity of variational inference
  • results: significantly outperforms the variational expectation-maximization (VEM) method and shows robust generalization performance in mismatched test conditions
    Abstract In this paper, we address the unsupervised speech enhancement problem based on recurrent variational autoencoder (RVAE). This approach offers promising generalization performance over the supervised counterpart. Nevertheless, the involved iterative variational expectation-maximization (VEM) process at test time, which relies on a variational inference method, results in high computational complexity. To tackle this issue, we present efficient sampling techniques based on Langevin dynamics and Metropolis-Hasting algorithms, adapted to the EM-based speech enhancement with RVAE. By directly sampling from the intractable posterior distribution within the EM process, we circumvent the intricacies of variational inference. We conduct a series of experiments, comparing the proposed methods with VEM and a state-of-the-art supervised speech enhancement approach based on diffusion models. The results reveal that our sampling-based algorithms significantly outperform VEM, not only in terms of computational efficiency but also in overall performance. Furthermore, when compared to the supervised baseline, our methods showcase robust generalization performance in mismatched test conditions.
    摘要 在本文中,我们Addresses the unsupervised speech enhancement problem based on recurrent variational autoencoder (RVAE). This approach offers promising generalization performance over the supervised counterpart. Nevertheless, the involved iterative variational expectation-maximization (VEM) process at test time, which relies on a variational inference method, results in high computational complexity. To tackle this issue, we present efficient sampling techniques based on Langevin dynamics and Metropolis-Hasting algorithms, adapted to the EM-based speech enhancement with RVAE. By directly sampling from the intractable posterior distribution within the EM process, we circumvent the intricacies of variational inference. We conduct a series of experiments, comparing the proposed methods with VEM and a state-of-the-art supervised speech enhancement approach based on diffusion models. The results reveal that our sampling-based algorithms significantly outperform VEM, not only in terms of computational efficiency but also in overall performance. Furthermore, when compared to the supervised baseline, our methods showcase robust generalization performance in mismatched test conditions.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration

  • paper_url: http://arxiv.org/abs/2309.10438
  • repo_url: None
  • paper_authors: Lijiang Li, Huixia Li, Xiawu Zheng, Jie Wu, Xuefeng Xiao, Rui Wang, Min Zheng, Xin Pan, Fei Chao, Rongrong Ji
  • for: 提高 diffusion models 的生成速度,无需额外训练。
  • methods: 提出了一种统一搜索空间和演化算法,可以在不同的 diffusion models 中找到最佳的时间步骤和模型结构。
  • results: 通过使用只有几步(例如,4步),可以达到优秀的图像生成效果(例如, ImageNet 64 $\times$ 64 的 FID 分数为 17.86),比传统的 DDIM 更好。
    Abstract Diffusion models are emerging expressive generative models, in which a large number of time steps (inference steps) are required for a single image generation. To accelerate such tedious process, reducing steps uniformly is considered as an undisputed principle of diffusion models. We consider that such a uniform assumption is not the optimal solution in practice; i.e., we can find different optimal time steps for different models. Therefore, we propose to search the optimal time steps sequence and compressed model architecture in a unified framework to achieve effective image generation for diffusion models without any further training. Specifically, we first design a unified search space that consists of all possible time steps and various architectures. Then, a two stage evolutionary algorithm is introduced to find the optimal solution in the designed search space. To further accelerate the search process, we employ FID score between generated and real samples to estimate the performance of the sampled examples. As a result, the proposed method is (i).training-free, obtaining the optimal time steps and model architecture without any training process; (ii). orthogonal to most advanced diffusion samplers and can be integrated to gain better sample quality. (iii). generalized, where the searched time steps and architectures can be directly applied on different diffusion models with the same guidance scale. Experimental results show that our method achieves excellent performance by using only a few time steps, e.g. 17.86 FID score on ImageNet 64 $\times$ 64 with only four steps, compared to 138.66 with DDIM.
    摘要 Diffusion模型是一种表达力强的生成模型,需要许多时间步骤(推理步骤)来生成单个图像。为了加速这个繁琐的过程,减少步骤的假设是 diffusion模型的一个常见原则。然而,我们认为这种假设并不是最佳的解决方案,即可以在具体的模型中找到不同的优化时间步骤。因此,我们提议通过搜索优化时间步骤和压缩模型架构的联合框架来实现效果性的图像生成,不需要进行任何训练。我们首先设计了一个包含所有可能的时间步骤和各种架构的统一搜索空间。然后,我们引入了两Stage演化算法,以找到这个搜索空间中的优化解决方案。为了加速搜索过程,我们使用FID分数来评估生成的样本质量。因此,我们的方法具有以下优点:1. 无需训练,可以直接从搜索空间中找到最佳的时间步骤和模型架构。2. 与现有的 diffusion samplers ortogonal,可以和其他模型结合使用以提高样本质量。3. 通用,可以在不同的 diffusion models 中应用已经搜索的时间步骤和模型架构。实验结果表明,我们的方法可以通过使用只有几个时间步骤,如17.86 FID分数在 ImageNet 64 $\times$ 64 中只需要四个步骤,而DDIM需要138.66 FID分数。

Sample-adaptive Augmentation for Point Cloud Recognition Against Real-world Corruptions

  • paper_url: http://arxiv.org/abs/2309.10431
  • repo_url: https://github.com/roywangj/adaptpoint
  • paper_authors: Jie Wang, Lihe Ding, Tingfa Xu, Shaocong Dong, Xinli Xu, Long Bai, Jianan Li
  • for: 提高3D视觉中的稳定性和可靠性,应对各种潜在的损害和噪声。
  • methods: 提出了一种自适应增强框架,名为AdaptPoint,通过基于样本结构的适应变换来应对潜在的损害。同时,还引入了一种听众指导反馈机制,以便生成合适的样本难度水平。
  • results: 实验结果表明,我们的方法在多种损害 benchmarcks 上达到了状态的最佳结果,包括 ModelNet-C、我们的 ScanObjectNN-C 和 ShapeNet-C。
    Abstract Robust 3D perception under corruption has become an essential task for the realm of 3D vision. While current data augmentation techniques usually perform random transformations on all point cloud objects in an offline way and ignore the structure of the samples, resulting in over-or-under enhancement. In this work, we propose an alternative to make sample-adaptive transformations based on the structure of the sample to cope with potential corruption via an auto-augmentation framework, named as AdaptPoint. Specially, we leverage a imitator, consisting of a Deformation Controller and a Mask Controller, respectively in charge of predicting deformation parameters and producing a per-point mask, based on the intrinsic structural information of the input point cloud, and then conduct corruption simulations on top. Then a discriminator is utilized to prevent the generation of excessive corruption that deviates from the original data distribution. In addition, a perception-guidance feedback mechanism is incorporated to guide the generation of samples with appropriate difficulty level. Furthermore, to address the paucity of real-world corrupted point cloud, we also introduce a new dataset ScanObjectNN-C, that exhibits greater similarity to actual data in real-world environments, especially when contrasted with preceding CAD datasets. Experiments show that our method achieves state-of-the-art results on multiple corruption benchmarks, including ModelNet-C, our ScanObjectNN-C, and ShapeNet-C.
    摘要 Robust 3D感知下面受损Has become an essential task for the realm of 3D vision. While current data augmentation techniques usually perform random transformations on all point cloud objects in an offline way and ignore the structure of the samples, resulting in over-or-under enhancement. In this work, we propose an alternative to make sample-adaptive transformations based on the structure of the sample to cope with potential corruption via an auto-augmentation framework, named as AdaptPoint. Specially, we leverage a imitator, consisting of a Deformation Controller and a Mask Controller, respectively in charge of predicting deformation parameters and producing a per-point mask, based on the intrinsic structural information of the input point cloud, and then conduct corruption simulations on top. Then a discriminator is utilized to prevent the generation of excessive corruption that deviates from the original data distribution. In addition, a perception-guidance feedback mechanism is incorporated to guide the generation of samples with appropriate difficulty level. Furthermore, to address the paucity of real-world corrupted point cloud, we also introduce a new dataset ScanObjectNN-C, that exhibits greater similarity to actual data in real-world environments, especially when contrasted with preceding CAD datasets. Experiments show that our method achieves state-of-the-art results on multiple corruption benchmarks, including ModelNet-C, our ScanObjectNN-C, and ShapeNet-C.

Predicate Classification Using Optimal Transport Loss in Scene Graph Generation

  • paper_url: http://arxiv.org/abs/2309.10430
  • repo_url: None
  • paper_authors: Sorachi Kurita, Satoshi Oyama, Itsuki Noda
  • for: 提高Scene Graph生成(SGG)中的预测准确率,避免由数据集中关系标签的分布偏度导致的预测偏误。
  • methods: 使用最佳运输损失来比较两个概率分布的相似性,并在 predicate classification 中使用学习最佳运输损失来生成Scene Graph。
  • results: 相比 existed 方法,提出的方法在 mean Recall@50 和 100 上表现出色,并且提高了数据集中罕见的关系标签的回归率。
    Abstract In scene graph generation (SGG), learning with cross-entropy loss yields biased predictions owing to the severe imbalance in the distribution of the relationship labels in the dataset. Thus, this study proposes a method to generate scene graphs using optimal transport as a measure for comparing two probability distributions. We apply learning with the optimal transport loss, which reflects the similarity between the labels in terms of transportation cost, for predicate classification in SGG. In the proposed approach, the transportation cost of the optimal transport is defined using the similarity of words obtained from the pre-trained model. The experimental evaluation of the effectiveness demonstrates that the proposed method outperforms existing methods in terms of mean Recall@50 and 100. Furthermore, it improves the recall of the relationship labels scarcely available in the dataset.
    摘要 在场景图生成(SGG)中,使用十字积分损失会导致预测结果受到分布不均衡的影响,因此这个研究提出了一种使用优化交通为比较两个概率分布的方法来生成场景图。我们在该方法中使用优化交通损失,该损失反映了标签之间的交通成本的相似性,来进行 predicate 分类。在我们的方法中,交通成本的优化交通是基于预训练模型中的字符相似度来定义的。我们的实验评估结果表明,提议的方法在mean Recall@50和100上超过了现有方法,并且提高了数据集中罕见的关系标签的回快。Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you need the translation in Traditional Chinese, please let me know.

Exploring Different Levels of Supervision for Detecting and Localizing Solar Panels on Remote Sensing Imagery

  • paper_url: http://arxiv.org/abs/2309.10421
  • repo_url: None
  • paper_authors: Maarten Burger, Rob Wijnhoven, Shaodi You
  • for: 本研究探讨了远程感知图像中对象存在和定位问题,特点是太阳能板识别。研究不同级别的超级视图,包括全监督物体探测器、弱监督图像分类器和最小监督异常探测器。
  • methods: 研究采用了不同级别的超级视图模型,包括全监督物体探测器、弱监督图像分类器和最小监督异常探测器。
  • results: 研究结果显示,分类器在二分类存在检测中取得0.79的F1分数,而物体探测器则在精确定位方面取得0.72的成绩。异常探测器需要更多数据来实现可靠性。模型结果的融合可能会提高准确性。CAM对定位有一定影响,高级CAM、高级CAM++和HiResCAM在定位方面具有较好的成绩。另外,分类器在数据量更少时仍然保持了良好的Robustness。
    Abstract This study investigates object presence detection and localization in remote sensing imagery, focusing on solar panel recognition. We explore different levels of supervision, evaluating three models: a fully supervised object detector, a weakly supervised image classifier with CAM-based localization, and a minimally supervised anomaly detector. The classifier excels in binary presence detection (0.79 F1-score), while the object detector (0.72) offers precise localization. The anomaly detector requires more data for viable performance. Fusion of model results shows potential accuracy gains. CAM impacts localization modestly, with GradCAM, GradCAM++, and HiResCAM yielding superior results. Notably, the classifier remains robust with less data, in contrast to the object detector.
    摘要 本研究 investigate remote sensing imagery 中的对象存在和位置检测,强调太阳能板识别。我们考虑不同水平的超级visum,评估三种模型:完全supervised object detector、weakly supervised image classifier with CAM-based localization和minimally supervised anomaly detector。classifier在binary presence detection中表现出色(F1-score 0.79),而object detector(F1-score 0.72)具有精确的localization能力。anomaly detector需要更多数据来实现可靠的表现。模型结果的融合显示了可能的准确性提升。CAM对localization产生了一定的影响,高级CAM、GradCAM++和HiResCAM在localization中表现较好。各自注意的是,classifier在数据更少时仍然保持了 robust性,与object detector不同。

SideGAN: 3D-Aware Generative Model for Improved Side-View Image Synthesis

  • paper_url: http://arxiv.org/abs/2309.10388
  • repo_url: None
  • paper_authors: Kyungmin Jo, Wonjoon Jin, Jaegul Choo, Hyunjoon Lee, Sunghyun Cho
  • for: 该 paper 主要 targets 生成高质量的三维图像,尤其是在侧视角度下。
  • methods: 该 paper 提出了一种新的 GAN 训练方法,即 SideGAN,可以生成不同视角下的高质量图像。为了解决 pose 难以学习和 photo-realism 同时学习的问题,该方法将问题分解为两个更容易解决的子问题。
  • results: 该 paper 通过了广泛的验证,证明了 SideGAN 可以生成高质量的三维图像,不受 camera pose 的影响。
    Abstract While recent 3D-aware generative models have shown photo-realistic image synthesis with multi-view consistency, the synthesized image quality degrades depending on the camera pose (e.g., a face with a blurry and noisy boundary at a side viewpoint). Such degradation is mainly caused by the difficulty of learning both pose consistency and photo-realism simultaneously from a dataset with heavily imbalanced poses. In this paper, we propose SideGAN, a novel 3D GAN training method to generate photo-realistic images irrespective of the camera pose, especially for faces of side-view angles. To ease the challenging problem of learning photo-realistic and pose-consistent image synthesis, we split the problem into two subproblems, each of which can be solved more easily. Specifically, we formulate the problem as a combination of two simple discrimination problems, one of which learns to discriminate whether a synthesized image looks real or not, and the other learns to discriminate whether a synthesized image agrees with the camera pose. Based on this, we propose a dual-branched discriminator with two discrimination branches. We also propose a pose-matching loss to learn the pose consistency of 3D GANs. In addition, we present a pose sampling strategy to increase learning opportunities for steep angles in a pose-imbalanced dataset. With extensive validation, we demonstrate that our approach enables 3D GANs to generate high-quality geometries and photo-realistic images irrespective of the camera pose.
    摘要 Recent 3D-aware生成模型已经显示了多视图一致的真实图像生成,但生成图像质量随着摄像头姿态的变化而下降(例如,一个在侧视角度的人脸会有模糊和噪声的边沿)。这种下降主要是由于学习多视图一致和真实图像同时存在 dataset 中 pose 异常的问题。在这篇论文中,我们提出了 SideGAN,一种新的3D GAN 训练方法,可以不受摄像头姿态的限制生成真实图像。为了缓解学习多视图一致和真实图像生成的复杂问题,我们将问题分解为两个更容易解决的子问题。具体来说,我们将问题定义为两个简单的识别问题的组合:一个识别生成图像是否真实,另一个识别生成图像是否与摄像头姿态一致。基于这,我们提出了一个 dual-branched 识别器,以及一个 pose 匹配损失来学习3D GAN 的姿态一致。此外,我们还提出了一种 pose 采样策略,以增加在 pose 偏负重 dataset 中的学习机会。通过广泛验证,我们证明了我们的方法可以使3D GANs生成高质量的几何图像和真实图像,不受摄像头姿态的限制。

Pointing out Human Answer Mistakes in a Goal-Oriented Visual Dialogue

  • paper_url: http://arxiv.org/abs/2309.10375
  • repo_url: None
  • paper_authors: Ryosuke Oshima, Seitaro Shinagawa, Hideki Tsunashima, Qi Feng, Shigeo Morishima
  • for: 该论文旨在研究人工智能与人类交互中的有效沟通方法,以便解决复杂问题。
  • methods: 该论文使用视觉对话来助人类完成问题,并分析了人类答案错误的因素,以提高模型的准确率。
  • results: 经过实验,研究发现人类答案错误的因素包括问题类型和QA转数,这些因素对模型的准确率有重要影响。
    Abstract Effective communication between humans and intelligent agents has promising applications for solving complex problems. One such approach is visual dialogue, which leverages multimodal context to assist humans. However, real-world scenarios occasionally involve human mistakes, which can cause intelligent agents to fail. While most prior research assumes perfect answers from human interlocutors, we focus on a setting where the agent points out unintentional mistakes for the interlocutor to review, better reflecting real-world situations. In this paper, we show that human answer mistakes depend on question type and QA turn in the visual dialogue by analyzing a previously unused data collection of human mistakes. We demonstrate the effectiveness of those factors for the model's accuracy in a pointing-human-mistake task through experiments using a simple MLP model and a Visual Language Model.
    摘要 人机对话可以有效地解决复杂问题,其中一种方法是视觉对话,它利用多ModalContext来帮助人类。然而,实际情况中有时会出现人类的错误,这会导致智能代理人失败。而大多数先前的研究假设了人类回答是完美的,我们则关注实际情况中人类的意外错误,并对这些错误进行分析。在这篇论文中,我们发现人类回答错误的因素取决于问题类型和QA轮次在视觉对话中。我们通过使用简单的MLP模型和视觉语言模型进行实验,证明这些因素对模型准确性的影响。

GloPro: Globally-Consistent Uncertainty-Aware 3D Human Pose Estimation & Tracking in the Wild

  • paper_url: http://arxiv.org/abs/2309.10369
  • repo_url: None
  • paper_authors: Simon Schaefer, Dorian F. Henning, Stefan Leutenegger
  • for: 提高人机交互的安全性和效率,通过提供高精度的3D人体姿态估计。
  • methods: 利用视觉启示和学习的动作模型,效果地融合视觉启示,预测3D人体姿态的不确定性分布,包括形状、姿态和根姿态。
  • results: 与现有方法相比,在世界坐标系中的人体轨迹准确率得到了大幅提高(即使面临严重遮挡),能够提供一致的不确定性分布,并可以在实时下运行。
    Abstract An accurate and uncertainty-aware 3D human body pose estimation is key to enabling truly safe but efficient human-robot interactions. Current uncertainty-aware methods in 3D human pose estimation are limited to predicting the uncertainty of the body posture, while effectively neglecting the body shape and root pose. In this work, we present GloPro, which to the best of our knowledge the first framework to predict an uncertainty distribution of a 3D body mesh including its shape, pose, and root pose, by efficiently fusing visual clues with a learned motion model. We demonstrate that it vastly outperforms state-of-the-art methods in terms of human trajectory accuracy in a world coordinate system (even in the presence of severe occlusions), yields consistent uncertainty distributions, and can run in real-time.
    摘要 <> translate "An accurate and uncertainty-aware 3D human body pose estimation is key to enabling truly safe but efficient human-robot interactions. Current uncertainty-aware methods in 3D human pose estimation are limited to predicting the uncertainty of the body posture, while effectively neglecting the body shape and root pose. In this work, we present GloPro, which to the best of our knowledge the first framework to predict an uncertainty distribution of a 3D body mesh including its shape, pose, and root pose, by efficiently fusing visual clues with a learned motion model. We demonstrate that it vastly outperforms state-of-the-art methods in terms of human trajectory accuracy in a world coordinate system (even in the presence of severe occlusions), yields consistent uncertainty distributions, and can run in real-time." into Simplified Chinese.翻译文本为简化字符串:“一个准确且识别不确定性的3D人体姿态估计是人机交互的关键,目前的不确定性意识方法仅仅是对体姿的预测不确定性,而忽略了身体形状和根姿。在这项工作中,我们提出了 GloPro,这是我们知道的第一个把3D身体网格的不确定性分布,包括身体形状、姿态和根姿,通过有效地融合视觉线索和学习运动模型来预测。我们示示了它在世界坐标系中的人跟踪精度高于当前状态艺术方法,且可以在真实时间内运行,并且具有一致的不确定性分布。”

Improving CLIP Robustness with Knowledge Distillation and Self-Training

  • paper_url: http://arxiv.org/abs/2309.10361
  • repo_url: None
  • paper_authors: Clement Laroudie, Andrei Bursuc, Mai Lan Ha, Gianni Franchi
  • for: 本研究旨在评估CLIP模型在无监督学习上的 robustness,同时探索可以增强其 robustness 的策略。
  • methods: 我们提出了一种名为LP-CLIP的新方法,即在CLIP模型的编码结构上添加一个线性探测层,并通过使用CLIP生成的pseudo-标签和自我训练策略来训练这层。
  • results: 我们的LP-CLIP方法可以增强CLIP模型的Robustness,并在多个数据集上达到了SOTA的result。此外,我们的方法不需要标注数据,因此在实际应用中可以更加有效。
    Abstract This paper examines the robustness of a multi-modal computer vision model, CLIP (Contrastive Language-Image Pretraining), in the context of unsupervised learning. The main objective is twofold: first, to evaluate the robustness of CLIP, and second, to explore strategies for augmenting its robustness. To achieve this, we introduce a novel approach named LP-CLIP. This technique involves the distillation of CLIP features through the incorporation of a linear probing layer positioned atop its encoding structure. This newly added layer is trained utilizing pseudo-labels produced by CLIP, coupled with a self-training strategy. The LP-CLIP technique offers a promising approach to enhance the robustness of CLIP without the need for annotations. By leveraging a simple linear probing layer, we aim to improve the model's ability to withstand various uncertainties and challenges commonly encountered in real-world scenarios. Importantly, our approach does not rely on annotated data, which makes it particularly valuable in situations where labeled data might be scarce or costly to obtain. Our proposed approach increases the robustness of CLIP with SOTA results compared to supervised technique on various datasets.
    摘要

RoadFormer: Duplex Transformer for RGB-Normal Semantic Road Scene Parsing

  • paper_url: http://arxiv.org/abs/2309.10356
  • repo_url: None
  • paper_authors: Jiahang Li, Yikang Zhang, Peng Yun, Guangliang Zhou, Qijun Chen, Rui Fan
  • for: 本研究旨在提出一种基于Transformer的数据融合网络 RoadFormer,用于道路场景分解。
  • methods: RoadFormer 使用 duplex encoder 架构提取不同类型的特征,并使用异类特征融合块进行有效的特征融合和重新准确。
  • results: 对于 SYN-UDTIRI 数据集以及 KITTI 路、CityScapes 和 ORFD 三个公共数据集,RoadFormer 表现出优于所有现有的状态态之网络,特别是在 KITTI 路上排名第一。
    Abstract The recent advancements in deep convolutional neural networks have shown significant promise in the domain of road scene parsing. Nevertheless, the existing works focus primarily on freespace detection, with little attention given to hazardous road defects that could compromise both driving safety and comfort. In this paper, we introduce RoadFormer, a novel Transformer-based data-fusion network developed for road scene parsing. RoadFormer utilizes a duplex encoder architecture to extract heterogeneous features from both RGB images and surface normal information. The encoded features are subsequently fed into a novel heterogeneous feature synergy block for effective feature fusion and recalibration. The pixel decoder then learns multi-scale long-range dependencies from the fused and recalibrated heterogeneous features, which are subsequently processed by a Transformer decoder to produce the final semantic prediction. Additionally, we release SYN-UDTIRI, the first large-scale road scene parsing dataset that contains over 10,407 RGB images, dense depth images, and the corresponding pixel-level annotations for both freespace and road defects of different shapes and sizes. Extensive experimental evaluations conducted on our SYN-UDTIRI dataset, as well as on three public datasets, including KITTI road, CityScapes, and ORFD, demonstrate that RoadFormer outperforms all other state-of-the-art networks for road scene parsing. Specifically, RoadFormer ranks first on the KITTI road benchmark. Our source code, created dataset, and demo video are publicly available at mias.group/RoadFormer.
    摘要 Recent advancements in deep convolutional neural networks have shown great potential in the field of road scene parsing. However, existing works mostly focus on free space detection, with little attention paid to hazardous road defects that can compromise both driving safety and comfort. In this paper, we introduce RoadFormer, a novel Transformer-based data-fusion network for road scene parsing. RoadFormer uses a duplex encoder architecture to extract heterogeneous features from both RGB images and surface normal information. The encoded features are then fed into a novel heterogeneous feature synergy block for effective feature fusion and recalibration. The pixel decoder learns multi-scale long-range dependencies from the fused and recalibrated heterogeneous features, which are subsequently processed by a Transformer decoder to produce the final semantic prediction. Furthermore, we release SYN-UDTIRI, the first large-scale road scene parsing dataset that contains over 10,407 RGB images, dense depth images, and the corresponding pixel-level annotations for both freespace and road defects of different shapes and sizes. Extensive experimental evaluations conducted on our SYN-UDTIRI dataset, as well as on three public datasets, including KITTI road, CityScapes, and ORFD, show that RoadFormer outperforms all other state-of-the-art networks for road scene parsing. Specifically, RoadFormer ranks first on the KITTI road benchmark. Our source code, created dataset, and demo video are publicly available at mias.group/RoadFormer.

Language Guided Adversarial Purification

  • paper_url: http://arxiv.org/abs/2309.10348
  • repo_url: None
  • paper_authors: Himanshu Singh, A V Subramanyam
  • for: 防御 adversarial 攻击
  • methods: 使用生成模型进行敏感级别提升
  • results: 对强大 adversarial 攻击进行评估,表现出优秀的防御性能,不需要特殊的网络训练
    Abstract Adversarial purification using generative models demonstrates strong adversarial defense performance. These methods are classifier and attack-agnostic, making them versatile but often computationally intensive. Recent strides in diffusion and score networks have improved image generation and, by extension, adversarial purification. Another highly efficient class of adversarial defense methods known as adversarial training requires specific knowledge of attack vectors, forcing them to be trained extensively on adversarial examples. To overcome these limitations, we introduce a new framework, namely Language Guided Adversarial Purification (LGAP), utilizing pre-trained diffusion models and caption generators to defend against adversarial attacks. Given an input image, our method first generates a caption, which is then used to guide the adversarial purification process through a diffusion network. Our approach has been evaluated against strong adversarial attacks, proving its effectiveness in enhancing adversarial robustness. Our results indicate that LGAP outperforms most existing adversarial defense techniques without requiring specialized network training. This underscores the generalizability of models trained on large datasets, highlighting a promising direction for further research.
    摘要 <>输入文本为:Adversarial purification using generative models demonstrates strong adversarial defense performance. These methods are classifier and attack-agnostic, making them versatile but often computationally intensive. Recent strides in diffusion and score networks have improved image generation and, by extension, adversarial purification. Another highly efficient class of adversarial defense methods known as adversarial training requires specific knowledge of attack vectors, forcing them to be trained extensively on adversarial examples. To overcome these limitations, we introduce a new framework, namely Language Guided Adversarial Purification (LGAP), utilizing pre-trained diffusion models and caption generators to defend against adversarial attacks. Given an input image, our method first generates a caption, which is then used to guide the adversarial purification process through a diffusion network. Our approach has been evaluated against strong adversarial attacks, proving its effectiveness in enhancing adversarial robustness. Our results indicate that LGAP outperforms most existing adversarial defense techniques without requiring specialized network training. This underscores the generalizability of models trained on large datasets, highlighting a promising direction for further research.Translation into Simplified Chinese:<>使用生成模型进行对抗纯化可以达到强大的对抗防御性能。这些方法是无关于类别和攻击方式的,因此它们非常灵活,但经常需要大量计算资源。在扩散和分数网络方面,最近的进步有助于图像生成,并通过扩散网络来进行对抗纯化。另一种非常高效的对抗防御方法是对抗训练,但它需要特定的攻击方向的知识,因此需要对抗示例进行广泛训练。为了超越这些限制,我们介绍了一新的框架,即语言指导对抗纯化(LGAP),使用预训练的扩散模型和caption生成器来防御对抗攻击。给定一个输入图像,我们的方法首先生成一个caption,然后使用扩散网络来指导对抗纯化过程。我们的方法已经在强大的对抗攻击下进行评估,证明了它的效果性。我们的结果表明,LGAP在对抗防御方面超越了大多数现有的对抗防御技术,而不需要特定的网络训练。这反映了模型在大量数据上的普适性,标识了一个有前途的研究方向。

Anti-Aliased Neural Implicit Surfaces with Encoding Level of Detail

  • paper_url: http://arxiv.org/abs/2309.10336
  • repo_url: None
  • paper_authors: Yiyu Zhuang, Qi Zhang, Ying Feng, Hao Zhu, Yao Yao, Xiaoyu Li, Yan-Pei Cao, Ying Shan, Xun Cao
  • for: 高频几何细节恢复和抗折补新视图渲染
  • methods: 基于级别 detail(LoD)的 voxel 基本表示法, Multi-scale tri-plane Scene 表示法,可以捕捉 LoD 的 signed distance function(SDF)和空间辐射特征。
  • results: 比 state-of-the-art 方法更高效的surface 重建和 photorealistic 视图合成。
    Abstract We present LoD-NeuS, an efficient neural representation for high-frequency geometry detail recovery and anti-aliased novel view rendering. Drawing inspiration from voxel-based representations with the level of detail (LoD), we introduce a multi-scale tri-plane-based scene representation that is capable of capturing the LoD of the signed distance function (SDF) and the space radiance. Our representation aggregates space features from a multi-convolved featurization within a conical frustum along a ray and optimizes the LoD feature volume through differentiable rendering. Additionally, we propose an error-guided sampling strategy to guide the growth of the SDF during the optimization. Both qualitative and quantitative evaluations demonstrate that our method achieves superior surface reconstruction and photorealistic view synthesis compared to state-of-the-art approaches.
    摘要 我们介绍LoD-NeuS,一种高效的神经网络表示方法,用于高频 geometry 细节恢复和抗折衔新视图渲染。 draws inspiration from voxel-based representations with the level of detail (LoD), we introduce a multi-scale tri-plane-based scene representation that is capable of capturing the LoD of the signed distance function (SDF) and the space radiance. Our representation aggregates space features from a multi-convolved featurization within a conical frustum along a ray and optimizes the LoD feature volume through differentiable rendering. Additionally, we propose an error-guided sampling strategy to guide the growth of the SDF during the optimization. Both qualitative and quantitative evaluations demonstrate that our method achieves superior surface reconstruction and photorealistic view synthesis compared to state-of-the-art approaches.Here is the translation in Traditional Chinese:我们介绍LoD-NeuS,一种高效的神经网络表示方法,用于高频 geometry 细节恢复和抗折衔新视域渲染。 draws inspiration from voxel-based representations with the level of detail (LoD), we introduce a multi-scale tri-plane-based scene representation that is capable of capturing the LoD of the signed distance function (SDF) and the space radiance. Our representation aggregates space features from a multi-convolved featurization within a conical frustum along a ray and optimizes the LoD feature volume through differentiable rendering. Additionally, we propose an error-guided sampling strategy to guide the growth of the SDF during the optimization. Both qualitative and quantitative evaluations demonstrate that our method achieves superior surface reconstruction and photorealistic view synthesis compared to state-of-the-art approaches.

Multi-dimension Queried and Interacting Network for Stereo Image Deraining

  • paper_url: http://arxiv.org/abs/2309.10319
  • repo_url: https://github.com/chdwyb/mqinet
  • paper_authors: Yuanbo Wen, Tao Gao, Ziqi Li, Jing Zhang, Ting Chen
  • for: 这个论文的目的是实现高效地除去双投影像中的雨尘变形。
  • methods: 这个方法使用多维度查询和互动来构建双投影像的雨尘除去模型。具体来说,它使用了一个具有上下文感知的维度别查询块(CDQB),这个模块利用维度别查询来获取双投影像中的关键特征,并且使用全局上下文感知注意力(GCA)来捕捉双投影像中的重要特征,而不是捕捉无用或不相关的信息。此外,它还引入了一个间iewPhysics-aware注意力(IPA),这个注意力基于雨水影像的倒数物理模型,可以提取双投影像中的潜在雨尘特征,并且帮助降低雨尘相关的artefacts在早期学习阶段。最后,它组合了多维度互动来增强两个看到之间的特征互动。
  • results: 实验结果显示,该模型比EPRRNet和StereoIRR更高效,具体来说,它在PSNR方面比EPRRNet和StereoIRR提高了4.18 dB和0.45 dB。代码和模型可以在\url{https://github.com/chdwyb/MQINet}上获取。
    Abstract Eliminating the rain degradation in stereo images poses a formidable challenge, which necessitates the efficient exploitation of mutual information present between the dual views. To this end, we devise MQINet, which employs multi-dimension queries and interactions for stereo image deraining. More specifically, our approach incorporates a context-aware dimension-wise queried block (CDQB). This module leverages dimension-wise queries that are independent of the input features and employs global context-aware attention (GCA) to capture essential features while avoiding the entanglement of redundant or irrelevant information. Meanwhile, we introduce an intra-view physics-aware attention (IPA) based on the inverse physical model of rainy images. IPA extracts shallow features that are sensitive to the physics of rain degradation, facilitating the reduction of rain-related artifacts during the early learning period. Furthermore, we integrate a cross-view multi-dimension interacting attention mechanism (CMIA) to foster comprehensive feature interaction between the two views across multiple dimensions. Extensive experimental evaluations demonstrate the superiority of our model over EPRRNet and StereoIRR, achieving respective improvements of 4.18 dB and 0.45 dB in PSNR. Code and models are available at \url{https://github.com/chdwyb/MQINet}.
    摘要 避免雨损减降在双视图图像中存在强大挑战,需要高效利用双视图之间的相互信息。为此,我们设计了MQINet,它使用多维度查询和互动来进行双视图雨损减降。更具体地说,我们的方法包括一个Context-aware尺度 wise查询块(CDQB)。这个模块利用独立于输入特征的尺度 wise查询,并使用全局Context-aware注意力(GCA)来捕捉重要特征,同时避免杂乱或无关的信息。此外,我们还引入了内视图物理学感知(IPA),它基于雨水影响图像的反式物理模型,以提取雨水环境中深度特征,从而减少雨水相关的artefacts。此外,我们还将多视图多维度互动注意力机制(CMIA)与CDQB和IPA相结合,以促进多视图之间的全面特征互动。我们的模型在PSNR方面与EPRRNet和StereoIRR进行了广泛的实验评估,并实现了4.18 dB和0.45 dB的提升。代码和模型可以在 GitHub上找到:

Dive Deeper into Rectifying Homography for Stereo Camera Online Self-Calibration

  • paper_url: http://arxiv.org/abs/2309.10314
  • repo_url: None
  • paper_authors: Hongbo Zhao, Yikang Zhang, Qijun Chen, Rui Fan
  • for: 本文提出了一种基于投影 homography 的单机器推准算法,用于在仅提供一对图像时对 stero 摄像头进行在线自动化准备。
  • methods: 本文 introduce 了一种简单 yet effective 的全球最优极值外Parameters 估计方法,以及四种新的评价指标用于评估极值外Parameters 估计精度和可靠性。
  • results: 广泛的实验结果表明,提出的算法在室内和室外环境中的多种实验设置下具有更高的性能,相比基准算法。
    Abstract Accurate estimation of stereo camera extrinsic parameters is the key to guarantee the performance of stereo matching algorithms. In prior arts, the online self-calibration of stereo cameras has commonly been formulated as a specialized visual odometry problem, without taking into account the principles of stereo rectification. In this paper, we first delve deeply into the concept of rectifying homography, which serves as the cornerstone for the development of our novel stereo camera online self-calibration algorithm, for cases where only a single pair of images is available. Furthermore, we introduce a simple yet effective solution for global optimum extrinsic parameter estimation in the presence of stereo video sequences. Additionally, we emphasize the impracticality of using three Euler angles and three components in the translation vectors for performance quantification. Instead, we introduce four new evaluation metrics to quantify the robustness and accuracy of extrinsic parameter estimation, applicable to both single-pair and multi-pair cases. Extensive experiments conducted across indoor and outdoor environments using various experimental setups validate the effectiveness of our proposed algorithm. The comprehensive evaluation results demonstrate its superior performance in comparison to the baseline algorithm. Our source code, demo video, and supplement are publicly available at mias.group/StereoCalibrator.
    摘要 针对单一对像的情况,本文首先深入探讨了正交投影的概念,该概念是我们提出的新型双眼相机在线自动调整算法的基础。然后,我们提出了一种简单 yet effective的全球最优化外参参数估计解决方案,适用于双眼视频序列 caso。此外,我们强调了使用三个欧拉角和三个翻译向量来评估性能的不现实性,而 Instead,我们引入了四个新的评估指标来评估双眼相机外参参数估计的精度和稳定性,适用于单个对像和多对像情况。我们在室内和室外环境中进行了广泛的实验,并通过了extensive experiments demonstrate the effectiveness of our proposed algorithm, outperforming the baseline algorithm.我们的源代码、示例视频和补充材料都公开available at mias.group/StereoCalibrator。

Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning

  • paper_url: http://arxiv.org/abs/2309.10302
  • repo_url: None
  • paper_authors: Ximei Wang, Junwei Pan, Xingzhuo Guo, Dapeng Liu, Jie Jiang
  • for: 本研究的目的是提出一种简单、无参数的多Domain学习方法,以解决多个领域之间的数据偏袋和领域占据问题。
  • methods: 本研究使用的方法是一种三阶段的普通到特定的训练策略,首先在所有领域上进行温存训练,然后将每个领域分成多个头,并在固定后部上细化每个头。
  • results: 研究表明,使用该方法可以在各种数据集上达到惊人的性能,包括标准评价指标和卫星图像和推荐系统等应用领域。
    Abstract Multi-domain learning (MDL) aims to train a model with minimal average risk across multiple overlapping but non-identical domains. To tackle the challenges of dataset bias and domain domination, numerous MDL approaches have been proposed from the perspectives of seeking commonalities by aligning distributions to reduce domain gap or reserving differences by implementing domain-specific towers, gates, and even experts. MDL models are becoming more and more complex with sophisticated network architectures or loss functions, introducing extra parameters and enlarging computation costs. In this paper, we propose a frustratingly easy and hyperparameter-free multi-domain learning method named Decoupled Training(D-Train). D-Train is a tri-phase general-to-specific training strategy that first pre-trains on all domains to warm up a root model, then post-trains on each domain by splitting into multi heads, and finally fine-tunes the heads by fixing the backbone, enabling decouple training to achieve domain independence. Despite its extraordinary simplicity and efficiency, D-Train performs remarkably well in extensive evaluations of various datasets from standard benchmarks to applications of satellite imagery and recommender systems.
    摘要 In this paper, we propose a simple and hyperparameter-free multi-domain learning method called Decoupled Training (D-Train). D-Train is a three-phase training strategy that first pre-trains on all domains to warm up a root model, then post-trains on each domain by splitting into multiple heads, and finally fine-tunes the heads by fixing the backbone. This approach enables decoupled training to achieve domain independence.Despite its simplicity and efficiency, D-Train performs remarkably well in extensive evaluations of various datasets from standard benchmarks to applications of satellite imagery and recommender systems.

360$^\circ$ Reconstruction From a Single Image Using Space Carved Outpainting

  • paper_url: http://arxiv.org/abs/2309.10279
  • repo_url: None
  • paper_authors: Nuri Ryu, Minsu Gong, Geonung Kim, Joo-Haeng Lee, Sunghyun Cho
  • for: 创建一个全景360°视图的3D模型从单一图像
  • methods: combinatorial使用深度和 нормаль预测、空间划分、生成模型和神经ImplicitSurface重建方法
  • results: 提供了一个高度普适的方法,能够在不同的自然场景中进行高质量的3D重建,并且比同期作品提高了重建的精度和自然性
    Abstract We introduce POP3D, a novel framework that creates a full $360^\circ$-view 3D model from a single image. POP3D resolves two prominent issues that limit the single-view reconstruction. Firstly, POP3D offers substantial generalizability to arbitrary categories, a trait that previous methods struggle to achieve. Secondly, POP3D further improves reconstruction fidelity and naturalness, a crucial aspect that concurrent works fall short of. Our approach marries the strengths of four primary components: (1) a monocular depth and normal predictor that serves to predict crucial geometric cues, (2) a space carving method capable of demarcating the potentially unseen portions of the target object, (3) a generative model pre-trained on a large-scale image dataset that can complete unseen regions of the target, and (4) a neural implicit surface reconstruction method tailored in reconstructing objects using RGB images along with monocular geometric cues. The combination of these components enables POP3D to readily generalize across various in-the-wild images and generate state-of-the-art reconstructions, outperforming similar works by a significant margin. Project page: \url{http://cg.postech.ac.kr/research/POP3D}
    摘要 我们引入POP3D,一个新的框架,可以从单一的图像中生成全天球360°的3D模型。POP3D解决了单一构成的两个主要问题,即具有普遍性和高重建实实相似度的能力。POP3D的方法结合了四个主要 ком成分:(1)一个单眼深度和法向预测器,用于预测重要的几何启示;(2)一种能够划分目标物的空间剖面方法;(3)一个基于大规模图像集的生成模型,用于完成未见区域的目标物;以及(4)一个适应RGB图像和单眼几何启示的神经隐面表面重建方法。这些元件的结合使得POP3D可以轻松扩展至各种在野的图像,并生成国际首席的重建效果,超过相似的作品。项目页面:

RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery

  • paper_url: http://arxiv.org/abs/2309.10255
  • repo_url: None
  • paper_authors: Jiaxin Wei, Xibin Song, Weizhe Liu, Laurent Kneip, Hongdong Li, Pan Ji
  • For: 提出一种新的摄像头ipeline,用于解决RGB-D摄像头 pose estimation方法中的约束问题。* Methods: 利用预训练的单目估计器提取本地几何信息,并在这些信息上进行2D-3D匹配搜索。另外,还设计了一个独立的分支来直接回归对象的度量尺度基于类别级统计。* Results: 在 both synthetic and real datasets 上进行了广泛的实验,并证明了该方法在前一代RGB-based方法中的精度更高,特别是在旋转精度方面。
    Abstract While showing promising results, recent RGB-D camera-based category-level object pose estimation methods have restricted applications due to the heavy reliance on depth sensors. RGB-only methods provide an alternative to this problem yet suffer from inherent scale ambiguity stemming from monocular observations. In this paper, we propose a novel pipeline that decouples the 6D pose and size estimation to mitigate the influence of imperfect scales on rigid transformations. Specifically, we leverage a pre-trained monocular estimator to extract local geometric information, mainly facilitating the search for inlier 2D-3D correspondence. Meanwhile, a separate branch is designed to directly recover the metric scale of the object based on category-level statistics. Finally, we advocate using the RANSAC-P$n$P algorithm to robustly solve for 6D object pose. Extensive experiments have been conducted on both synthetic and real datasets, demonstrating the superior performance of our method over previous state-of-the-art RGB-based approaches, especially in terms of rotation accuracy.
    摘要 “当前RGB-D相机基于分类级别物体姿态估计方法显示了扎实的结果,但由于依赖深度传感器,因此受到了应用限制。RGB只方法可以解决这个问题,但它们受到了固有的比例抽象问题。在这篇论文中,我们提出了一个新的管道,即分离6D姿态和大小估计,以减少不准确的比例对矢量变换的影响。我们利用预训练的单目估计器提取本地几何信息,主要是为了搜索2D-3D匹配。同时,我们设计了一个分支来直接回归对象的度量尺度基于类别统计。最后,我们提议使用RANSAC-P$n$P算法稳定解决6D物体姿态。我们在实验中进行了Synthetic和实际数据集的广泛测试,并证明了我们的方法在前一个状态的RGB基于方法中的稍高精度。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

UPL-SFDA: Uncertainty-aware Pseudo Label Guided Source-Free Domain Adaptation for Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2309.10244
  • repo_url: https://github.com/hilab-git/upl-sfda
  • paper_authors: Jianghao Wu, Guotai Wang, Ran Gu, Tao Lu, Yinan Chen, Wentao Zhu, Tom Vercauteren, Sébastien Ourselin, Shaoting Zhang
  • for: 这篇研究的目的是为了提高深度学习基础的医疗影像分类模型在新目标领域中的测试影像处理,特别是在源领域数据不可用且目标领域数据没有标签的情况下进行调整。
  • methods: 本研究提出了一个新的不确定性感知驱动的源自由领域调整(SFDA)方法,包括目标领域增长(TDG)和两个前进对答案驱动(TFS)策略,以及一个平均预测值基于的熵最小化项。
  • results: 本研究在三个多地点心脏MRI分类任务、跨模型胚胎脑分类任务和3D胚胎组织分类任务上验证了UPL-SFDA方法,与基准方法相比,平均标签精度提高了5.54、5.01和6.89 percentage points。此外,UPL-SFDA方法也超过了一些现有的SFDA方法。
    Abstract Domain Adaptation (DA) is important for deep learning-based medical image segmentation models to deal with testing images from a new target domain. As the source-domain data are usually unavailable when a trained model is deployed at a new center, Source-Free Domain Adaptation (SFDA) is appealing for data and annotation-efficient adaptation to the target domain. However, existing SFDA methods have a limited performance due to lack of sufficient supervision with source-domain images unavailable and target-domain images unlabeled. We propose a novel Uncertainty-aware Pseudo Label guided (UPL) SFDA method for medical image segmentation. Specifically, we propose Target Domain Growing (TDG) to enhance the diversity of predictions in the target domain by duplicating the pre-trained model's prediction head multiple times with perturbations. The different predictions in these duplicated heads are used to obtain pseudo labels for unlabeled target-domain images and their uncertainty to identify reliable pseudo labels. We also propose a Twice Forward pass Supervision (TFS) strategy that uses reliable pseudo labels obtained in one forward pass to supervise predictions in the next forward pass. The adaptation is further regularized by a mean prediction-based entropy minimization term that encourages confident and consistent results in different prediction heads. UPL-SFDA was validated with a multi-site heart MRI segmentation dataset, a cross-modality fetal brain segmentation dataset, and a 3D fetal tissue segmentation dataset. It improved the average Dice by 5.54, 5.01 and 6.89 percentage points for the three tasks compared with the baseline, respectively, and outperformed several state-of-the-art SFDA methods.
    摘要 域 adaptation (DA) 是深度学习基于医疗影像分割模型的重要问题,以适应测试图像的新目标域。然而,现有的SFDA方法受到缺乏源域数据和目标域图像标注的限制,导致其性能有限。我们提出了一种新的不确定性感知导向的SFDA方法(UPL),用于医疗影像分割。具体来说,我们提出了目标域增长(TDG)技术,用于在目标域中提高预测的多样性。我们还提出了两次前进 passe supervision(TFS)策略,使用可靠的pseudo标签来监督预测。此外,我们还添加了一个平均预测值基于的Entropy最小化项,以鼓励confident和一致的结果。UPL-SFDA在多个心脏MRI分割 dataset、交叉模态胎生脑分割 dataset和3D胎生组织分割dataset上进行验证,提高了平均 dice 值5.54、5.01和6.89个百分点,分别比基eline进行了5.54、5.01和6.89个百分点的提升。此外,它还超过了一些state-of-the-art SFDA方法。

Transferable Adversarial Attack on Image Tampering Localization

  • paper_url: http://arxiv.org/abs/2309.10243
  • repo_url: None
  • paper_authors: Yuqi Wang, Gang Cao, Zijie Lou, Haochen Zhu
  • for: 评估现有数字图像修改地标算法的安全性在实际应用中。
  • methods: 提出了一种对修改地标算法的攻击方案,用于暴露其可靠性,并使其预测修改后的区域不准确。 Specifically, 使用优化和梯度来实现白/黑盒攻击。
  • results: 广泛评估表明,提posed攻击方案可以很准确地降低地标准正确率,同时保持修改后图像的视觉质量高。
    Abstract It is significant to evaluate the security of existing digital image tampering localization algorithms in real-world applications. In this paper, we propose an adversarial attack scheme to reveal the reliability of such tampering localizers, which would be fooled and fail to predict altered regions correctly. Specifically, the adversarial examples based on optimization and gradient are implemented for white/black-box attacks. Correspondingly, the adversarial example is optimized via reverse gradient propagation, and the perturbation is added adaptively in the direction of gradient rising. The black-box attack is achieved by relying on the transferability of such adversarial examples to different localizers. Extensive evaluations verify that the proposed attack sharply reduces the localization accuracy while preserving high visual quality of the attacked images.
    摘要 需要评估现有数字图像修改地点算法的安全性在实际应用中。在这篇论文中,我们提出了一种攻击方案,用于暴露修改地点检测器的可靠性,这些检测器会被欺骗并 incorrect地预测修改后的区域。 Specifically,我们基于优化和梯度的例外处理实现了白/黑盒攻击。各自相应地,例外处理通过逆梯度升降来优化,并在梯度升降的方向添加随机的杂音。黑盒攻击通过将这些例外处理转移到不同的地点检测器来实现。广泛的评估表明,我们提出的攻击会减少地点检测精度,同时保持修改后的图像的视觉质量高。

Learning Point-wise Abstaining Penalty for Point Cloud Anomaly Detection

  • paper_url: http://arxiv.org/abs/2309.10230
  • repo_url: https://github.com/daniellli/pad
  • paper_authors: Shaocong Xu, Pengfei Li, Xinyu Liu, Qianpu Sun, Yang Li, Shihui Guo, Zhen Wang, Bo Jiang, Rui Wang, Kehua Sheng, Bo Zhang, Hao Zhao
    for: 该论文旨在提高自动驾驶系统的LiDAR场景理解模块中的Out-Of-Distribution(OOD)点云识别能力。methods: 该方法基于选择性分类,即在标准关闭集分类setup中引入选择函数,以学习点云中的不同类别之间的差异。该方法还包括一个强大的合成管道,用于生成各种不同的外liers。results: 该方法在SemanticKITTI和nuScenes上达到了状态的最佳result,并且通过风险覆盖分析,揭示了不同方法的内在性质。代码和模型将公开可用。
    Abstract LiDAR-based semantic scene understanding is an important module in the modern autonomous driving perception stack. However, identifying Out-Of-Distribution (OOD) points in a LiDAR point cloud is challenging as point clouds lack semantically rich features when compared with RGB images. We revisit this problem from the perspective of selective classification, which introduces a selective function into the standard closed-set classification setup. Our solution is built upon the basic idea of abstaining from choosing any known categories but learns a point-wise abstaining penalty with a marginbased loss. Synthesizing outliers to approximate unlimited OOD samples is also critical to this idea, so we propose a strong synthesis pipeline that generates outliers originated from various factors: unrealistic object categories, sampling patterns and sizes. We demonstrate that learning different abstaining penalties, apart from point-wise penalty, for different types of (synthesized) outliers can further improve the performance. We benchmark our method on SemanticKITTI and nuScenes and achieve state-of-the-art results. Risk-coverage analysis further reveals intrinsic properties of different methods. Codes and models will be publicly available.
    摘要 利用LiDAR技术实现 semantic scene understanding 是现代自动驾驶感知栈中的重要模块。然而,在LiDAR点云中标识 Out-Of-Distribution (OOD) 点 cloud 是具有挑战性,因为点云在比RGB图像更缺乏semantic rich features。我们从选择性分类的角度重新评估了这个问题,并提出了一种基于选择函数的标准closed-set分类设置。我们的解决方案基于基本的想法,即在已知类别中选择点云,并学习一个点wise abstaining penalty with margin-based loss。synthesizing outliers to approximate unlimited OOD samples是关键,我们提议了一个强大的合成管道,该管道可以生成来自不同因素的outliers,包括不可能的物体类别、采样模式和大小。我们示示了不同类型的outliers的学习不同的abstaining penalties可以进一步提高性能。我们在SemanticKITTI和nuScenes上进行了比较,并达到了当前最佳的成绩。风险覆盖分析还揭示了不同方法的内在特性。代码和模型将公开 disponibles。

Learning Dynamic MRI Reconstruction with Convolutional Network Assisted Reconstruction Swin Transformer

  • paper_url: http://arxiv.org/abs/2309.10227
  • repo_url: None
  • paper_authors: Di Xu, Hengjie Liu, Dan Ruan, Ke Sheng
    for:这个论文主要是为了提高动态磁共振成像(DMRI)的诊断任务中的动态跟踪能力。methods:这篇论文使用了压缩学习(Compress sensing)和深度学习(Deep learning)技术来加速DMRI获取。具体来说,这篇论文提出了一种基于Transformer的新型推理架构,称为Reconstruction Swin Transformer(RST),用于4D MRI重建。results:实验结果表明,RST在cardiac 4D MR数据集上表现出色,与9倍加速的验证序列相比,RMSE值为0.0286±0.0199,1-SSIM值为0.0872±0.0783。这表明RST可以减少模型复杂度、GPU硬件需求和训练时间,同时保持重建质量。
    Abstract Dynamic magnetic resonance imaging (DMRI) is an effective imaging tool for diagnosis tasks that require motion tracking of a certain anatomy. To speed up DMRI acquisition, k-space measurements are commonly undersampled along spatial or spatial-temporal domains. The difficulty of recovering useful information increases with increasing undersampling ratios. Compress sensing was invented for this purpose and has become the most popular method until deep learning (DL) based DMRI reconstruction methods emerged in the past decade. Nevertheless, existing DL networks are still limited in long-range sequential dependency understanding and computational efficiency and are not fully automated. Considering the success of Transformers positional embedding and "swin window" self-attention mechanism in the vision community, especially natural video understanding, we hereby propose a novel architecture named Reconstruction Swin Transformer (RST) for 4D MRI. RST inherits the backbone design of the Video Swin Transformer with a novel reconstruction head introduced to restore pixel-wise intensity. A convolution network called SADXNet is used for rapid initialization of 2D MR frames before RST learning to effectively reduce the model complexity, GPU hardware demand, and training time. Experimental results in the cardiac 4D MR dataset further substantiate the superiority of RST, achieving the lowest RMSE of 0.0286 +/- 0.0199 and 1 - SSIM of 0.0872 +/- 0.0783 on 9 times accelerated validation sequences.
    摘要 dynamic magnetic resonance imaging (DMRI) 是一种有效的成像工具,用于诊断任务中需要跟踪特定的解剖结构运动。为了加速 DMRI 获取,通常将 k-空间测量下折水平或时间域。随着下折率的增加,获取有用信息的困难度也会逐渐增加。 compression sensing 是为此目的而创造的技术,并在过去的一代成为 DMRI 重建方法中最受欢迎的方法。然而,现有的深度学习(DL)基于 DMRI 重建方法仍有一定的局限性,主要是不具备长距离次序关系理解和计算效率问题,以及不具备完全自动化的特点。为了解决这些问题,我们在这里提出了一种新的架构,即 Reconstruction Swin Transformer (RST)。RST 继承了 Video Swin Transformer 的底层设计,并添加了一个新的重建头来恢复像素级强度。一个快速初始化 2D MR 帧的 convolution 网络 called SADXNet 也是用于减少模型复杂度、GPU 硬件需求和训练时间。实验结果表明,RST 在 Cardiac 4D MR 数据集上取得了最低 RMSE 值为 0.0286 ± 0.0199 和 1 - SSIM 值为 0.0872 ± 0.0783,在9倍加速验证序列上。

cs.AI - 2023-09-19

LMDX: Language Model-based Document Information Extraction and Localization

  • paper_url: http://arxiv.org/abs/2309.10952
  • repo_url: None
  • paper_authors: Vincent Perot, Kai Kang, Florian Luisier, Guolong Su, Xiaoyu Sun, Ramya Sree Boppana, Zilong Wang, Jiaqi Mu, Hao Zhang, Nan Hua
  • for: This paper is written for the task of document information extraction, specifically for semi-structured documents, and aims to improve the state-of-the-art in this area.
  • methods: The paper introduces a new methodology called Language Model-based Document Information Extraction and Localization (LMDX), which adapts arbitrary large language models (LLMs) for document information extraction. LMDX uses a grounding mechanism to ensure that the extracted information is accurate and not hallucinated.
  • results: The paper evaluates LMDX on two benchmark datasets (VRDU and CORD) and achieves a new state-of-the-art in document information extraction. The results show that LMDX can extract singular, repeated, and hierarchical entities with high accuracy, both with and without training data. Additionally, the paper demonstrates the efficiency of LMDX in creating high-quality parsers with minimal data.Here’s the simplified Chinese text for the three pieces of information:
  • for: 这篇论文是为了文档信息提取而写的,特别是对半结构化文档进行提取。
  • methods: 论文提出了一种新的方法——语言模型基于文档信息提取和地点确定(LMDX),可以将任意大语言模型(LLM)适应到文档信息提取中。LMDX使用了一种安全机制,确保提取的信息准确无误。
  • results: 论文在两个标准 benchmark datasets(VRDU和CORD)上进行了评估,实现了文档信息提取的新州OF-THE-ART。结果表明,LMDX可以准确提取唯一、重复和层次结构的实体,无需训练数据,并且可以创建高质量的解析器。
    Abstract Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art on many existing tasks and exhibiting emergent capabilities. However, LLMs have not yet been successfully applied on semi-structured document information extraction, which is at the core of many document processing workflows and consists of extracting key entities from a visually rich document (VRD) given a predefined target schema. The main obstacles to LLM adoption in that task have been the absence of layout encoding within LLMs, critical for a high quality extraction, and the lack of a grounding mechanism ensuring the answer is not hallucinated. In this paper, we introduce Language Model-based Document Information Extraction and Localization (LMDX), a methodology to adapt arbitrary LLMs for document information extraction. LMDX can do extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. In particular, we apply LMDX to the PaLM 2-S LLM and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers.
    摘要 Translation notes:* "Large Language Models" (LLM) 是指大型自然语言处理(NLP)模型,它们已经革命化了许多现有任务的状态。* "semi-structured document information extraction" 是指从视觉 ricH document (VRD) 中提取预定 schema 中的关键实体,这是许多文档处理工作流程的核心。* " Layout encoding" 是指在 LLM 中包含文档布局信息的机制,这是提取高质量实体的关键。* "grounding guarantees" 是指保证解决方案不会幻想的机制。* "localizing the entities" 是指将实体位于文档中的机制。* "high-quality, data-efficient parsers" 是指高质量,数据有效的解析器。

Benchmarks for Pirá 2.0, a Reading Comprehension Dataset about the Ocean, the Brazilian Coast, and Climate Change

  • paper_url: http://arxiv.org/abs/2309.10945
  • repo_url: None
  • paper_authors: Paulo Pirozelli, Marcos M. José, Igor Silveira, Flávio Nakasato, Sarajane M. Peres, Anarosa A. F. Brandão, Anna H. R. Costa, Fabio G. Cozman
  • for: 这篇论文的目的是为了测试机器学习模型在科学知识领域的能力。
  • methods: 这篇论文使用了 Pir'a 数据集,并定义了六个基准测试。
  • results: 这篇论文提供了多个参考值,用于测试机器学习模型在不同的问答任务上的能力。
    Abstract Pir\'a is a reading comprehension dataset focused on the ocean, the Brazilian coast, and climate change, built from a collection of scientific abstracts and reports on these topics. This dataset represents a versatile language resource, particularly useful for testing the ability of current machine learning models to acquire expert scientific knowledge. Despite its potential, a detailed set of baselines has not yet been developed for Pir\'a. By creating these baselines, researchers can more easily utilize Pir\'a as a resource for testing machine learning models across a wide range of question answering tasks. In this paper, we define six benchmarks over the Pir\'a dataset, covering closed generative question answering, machine reading comprehension, information retrieval, open question answering, answer triggering, and multiple choice question answering. As part of this effort, we have also produced a curated version of the original dataset, where we fixed a number of grammar issues, repetitions, and other shortcomings. Furthermore, the dataset has been extended in several new directions, so as to face the aforementioned benchmarks: translation of supporting texts from English into Portuguese, classification labels for answerability, automatic paraphrases of questions and answers, and multiple choice candidates. The results described in this paper provide several points of reference for researchers interested in exploring the challenges provided by the Pir\'a dataset.
    摘要 pia 是一个关注大洋、巴西海岸和气候变化的阅读理解数据集,基于科学报告和摘要这些主题。这个数据集代表了一个多样化的语言资源,特别有用于测试当前机器学习模型是否积累了专家科学知识。尽管其潜在,但还没有对 pia 的详细基准建立。通过创建这些基准,研究人员可以更加方便地使用 pia 作为测试机器学习模型的资源,涵盖广泛的问答任务。在这篇论文中,我们定义了六个 benchmark sobre el dataset de pia,包括关闭生成问答、机器阅读理解、信息检索、开放问答、答案触发和多选问答。为了实现这些目标,我们还制作了 pirá dataset 的修正版本,其中修复了一些语法错误、重复和其他缺陷。此外,该 dataset 还被扩展了多个新方向,以面对以上 benchmark:翻译支持文本从英语到葡萄牙语,答案可能性的分类标签,自动重叠问题和答案,以及多选问题的选项。本文中所描述的结果提供了许多参考值 для研究人员感兴趣于探索 pia 数据集的挑战。

End-to-End Speech Recognition Contextualization with Large Language Models

  • paper_url: http://arxiv.org/abs/2309.10917
  • repo_url: None
  • paper_authors: Egor Lakomkin, Chunyang Wu, Yassir Fathullah, Ozlem Kalinli, Michael L. Seltzer, Christian Fuegen
  • for: This paper aims to improve the performance of speech recognition models by incorporating large language models (LLMs) and contextual information.
  • methods: The authors propose a novel method that casts speech recognition as a mixed-modal language modeling task, using a pretrained LLM and providing audio features and optional text tokens for context. The system is trained in a decoder-only fashion, and the authors use adapters to add a small number of trainable parameters to unlock contextualized speech recognition capability.
  • results: The authors demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided, and a 7.5% WER improvement overall and 17% WER improvement on rare words compared to a baseline contextualized RNN-T system that was trained on a much larger dataset.
    Abstract In recent years, Large Language Models (LLMs) have garnered significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we introduce a novel method for contextualizing speech recognition models incorporating LLMs. Our approach casts speech recognition as a mixed-modal language modeling task based on a pretrained LLM. We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion. As a result, the system is implicitly incentivized to learn how to leverage unstructured contextual information during training. Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided. Moreover, we find that our method performs competitively and improve by 7.5% WER overall and 17% WER on rare words against a baseline contextualized RNN-T system that has been trained on more than twenty five times larger speech dataset. Overall, we demonstrate that by only adding a handful number of trainable parameters via adapters, we can unlock contextualized speech recognition capability for the pretrained LLM while keeping the same text-only input functionality.
    摘要 Here is the text in Simplified Chinese:Recently, Large Language Models (LLMs) have received significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we propose a novel method for contextualizing speech recognition models using LLMs. Our approach treats speech recognition as a mixed-modal language modeling task based on a pre-trained LLM, and we provide audio features and optional text tokens for context to train the system in a decoder-only fashion. This approach implicitly incentivizes the system to learn how to leverage unstructured contextual information during training. Our experimental results show a significant improvement in performance, with a 6% WER reduction when additional textual context is provided. Moreover, we find that our method performs competitively and improves by 7.5% WER overall and 17% WER on rare words against a baseline contextualized RNN-T system that was trained on a much larger speech dataset. Overall, we demonstrate that by adding only a small number of trainable parameters via adapters, we can unlock contextualized speech recognition capabilities for the pre-trained LLM while maintaining the same text-only input functionality.

Amplifying Pathological Detection in EEG Signaling Pathways through Cross-Dataset Transfer Learning

  • paper_url: http://arxiv.org/abs/2309.10910
  • repo_url: None
  • paper_authors: Mohammad-Javad Darvishi-Bayazi, Mohammad Sajjad Ghaemi, Timothee Lesort, Md Rifat Arefin, Jocelyn Faubert, Irina Rish
    for:This paper aims to explore the effectiveness of data and model scaling, as well as cross-dataset knowledge transfer, in the context of pathology diagnosis based on EEG signals.methods:The authors use a combination of data scaling, model scaling, and cross-dataset knowledge transfer to improve the performance of their target model on a low-regime dataset. They also employ a small and generic model (ShallowNet) and a larger model (TCN) to compare their performance.results:The authors observe varying performance improvements through data scaling, and identify the challenges of possible negative transfer and the significance of some key components to overcome distribution shifts and potential spurious correlations. They find that a small and generic model performs well on a single dataset, while a larger model performs better on transfer and learning from a larger and diverse dataset.
    Abstract Pathology diagnosis based on EEG signals and decoding brain activity holds immense importance in understanding neurological disorders. With the advancement of artificial intelligence methods and machine learning techniques, the potential for accurate data-driven diagnoses and effective treatments has grown significantly. However, applying machine learning algorithms to real-world datasets presents diverse challenges at multiple levels. The scarcity of labelled data, especially in low regime scenarios with limited availability of real patient cohorts due to high costs of recruitment, underscores the vital deployment of scaling and transfer learning techniques. In this study, we explore a real-world pathology classification task to highlight the effectiveness of data and model scaling and cross-dataset knowledge transfer. As such, we observe varying performance improvements through data scaling, indicating the need for careful evaluation and labelling. Additionally, we identify the challenges of possible negative transfer and emphasize the significance of some key components to overcome distribution shifts and potential spurious correlations and achieve positive transfer. We see improvement in the performance of the target model on the target (NMT) datasets by using the knowledge from the source dataset (TUAB) when a low amount of labelled data was available. Our findings indicate a small and generic model (e.g. ShallowNet) performs well on a single dataset, however, a larger model (e.g. TCN) performs better on transfer and learning from a larger and diverse dataset.
    摘要 依据EEG信号的 PATHOLOGY诊断利用人工智能方法和机器学习技术已经具有重要意义,用于理解神经系统疾病。随着人工智能方法和机器学习技术的发展,实时数据驱动诊断和有效治疗的潜力已经增长了。然而,在实际数据集上应用机器学习算法时,存在多种挑战,包括数据的罕见性和限制性。在这种情况下,扩大数据和数据转移学习成为了非常重要的。在这个研究中,我们使用一个真实世界的 PATHOLOGY 分类任务来探讨数据和模型的扩大和交叉数据集知识传递的效果。我们发现,通过数据扩大,模型的性能有所提高,需要仔细评估和标注。此外,我们还发现了可能的负向传递和分布shift的挑战,并提出了一些关键组件来超越这些挑战。我们发现,使用源数据集(TUAB)的知识可以在有限的标签数据情况下提高目标模型(NMT)的性能。我们的发现表明,一个小型和通用的模型(例如ShallowNet)在单个数据集上表现良好,而一个更大的模型(例如TCN)在转移学习和学习从更大和多样化的数据集上表现更好。

Multicopy Reinforcement Learning Agents

  • paper_url: http://arxiv.org/abs/2309.10908
  • repo_url: None
  • paper_authors: Alicia P. Wolfe, Oliver Diamond, Remi Feuerman, Magdalena Kisielinska, Brigitte Goeler-Slough, Victoria Manfredi
  • for: 这种研究旨在解决一种多智能问题,在其中一个智能创建多个同一个智能的复制品来完成单个智能任务更好或更高效。
  • methods: 我们提出了一种学习算法,该算法利用值函数的结构来高效地学习如何平衡多 копи地智能的优点和成本。
  • results: 我们的研究表明,在噪声环境中,使用多 копи地智能可以提高任务的完成率和效率。
    Abstract This paper examines a novel type of multi-agent problem, in which an agent makes multiple identical copies of itself in order to achieve a single agent task better or more efficiently. This strategy improves performance if the environment is noisy and the task is sometimes unachievable by a single agent copy. We propose a learning algorithm for this multicopy problem which takes advantage of the structure of the value function to efficiently learn how to balance the advantages and costs of adding additional copies.
    摘要 Here's the text in Simplified Chinese:这篇论文研究了一种新型多智能问题,在这种问题中,一个智能创建多个相同的 копи体来实现单个智能任务更好或更高效。这种策略在噪音环境下提高性能,因为单个 копи体可能无法完成任务。我们提议一种学习算法来解决这个多 копи问题,该算法利用价值函数的结构来高效地学习如何平衡增加 копи数的优点和成本。

Artificial Intelligence-Enabled Intelligent Assistant for Personalized and Adaptive Learning in Higher Education

  • paper_url: http://arxiv.org/abs/2309.10892
  • repo_url: None
  • paper_authors: Ramteja Sajja, Yusuf Sermet, Muhammed Cikmaz, David Cwiertny, Ibrahim Demir
  • for: 这篇论文旨在开发一种基于人工智能的智能助手(AIIA),用于个性化和适应性的大学学习。
  • methods: 该系统使用高级人工智能和自然语言处理技术,创造了一个互动性强、有趣的学习平台,以减轻学生的认知负担,提供易于获取信息、评估知识和个性化学习支持。
  • results: 研究发现,AIIA系统可以理解和回答学生问题,生成测验和卡片,并提供个性化学习路径,有望改善学生学习效果、参与度和满意度。
    Abstract This paper presents a novel framework, Artificial Intelligence-Enabled Intelligent Assistant (AIIA), for personalized and adaptive learning in higher education. The AIIA system leverages advanced AI and Natural Language Processing (NLP) techniques to create an interactive and engaging learning platform. This platform is engineered to reduce cognitive load on learners by providing easy access to information, facilitating knowledge assessment, and delivering personalized learning support tailored to individual needs and learning styles. The AIIA's capabilities include understanding and responding to student inquiries, generating quizzes and flashcards, and offering personalized learning pathways. The research findings have the potential to significantly impact the design, implementation, and evaluation of AI-enabled Virtual Teaching Assistants (VTAs) in higher education, informing the development of innovative educational tools that can enhance student learning outcomes, engagement, and satisfaction. The paper presents the methodology, system architecture, intelligent services, and integration with Learning Management Systems (LMSs) while discussing the challenges, limitations, and future directions for the development of AI-enabled intelligent assistants in education.
    摘要 The AIIA system has several capabilities, including:1. Understanding and responding to student inquiries2. Generating quizzes and flashcards3. Offering personalized learning pathwaysThe research findings have the potential to significantly impact the design, implementation, and evaluation of AI-enabled Virtual Teaching Assistants (VTAs) in higher education, and can inform the development of innovative educational tools that can enhance student learning outcomes, engagement, and satisfaction.The paper discusses the methodology, system architecture, intelligent services, and integration with Learning Management Systems (LMSs) while addressing the challenges, limitations, and future directions for the development of AI-enabled intelligent assistants in education.

Self-Augmentation Improves Zero-Shot Cross-Lingual Transfer

  • paper_url: http://arxiv.org/abs/2309.10891
  • repo_url: https://github.com/luka-group/SALT
  • paper_authors: Fei Wang, Kuan-Hao Huang, Kai-Wei Chang, Muhao Chen
  • for: 提高零shot跨语言传递性,不需要外部数据。
  • methods: 使用代码混合和嵌入混合自我增强,从多语言预训练语言模型中提取跨语言知识,提高下游任务的传递性。
  • results: 在XNLI和PAWS-X任务上,我们的方法能够提高零shot跨语言传递性,无需外部数据。
    Abstract Zero-shot cross-lingual transfer is a central task in multilingual NLP, allowing models trained in languages with more sufficient training resources to generalize to other low-resource languages. Earlier efforts on this task use parallel corpora, bilingual dictionaries, or other annotated alignment data to improve cross-lingual transferability, which are typically expensive to obtain. In this paper, we propose a simple yet effective method, SALT, to improve the zero-shot cross-lingual transfer of the multilingual pretrained language models without the help of such external data. By incorporating code-switching and embedding mixup with self-augmentation, SALT effectively distills cross-lingual knowledge from the multilingual PLM and enhances its transferability on downstream tasks. Experimental results on XNLI and PAWS-X show that our method is able to improve zero-shot cross-lingual transferability without external data. Our code is available at https://github.com/luka-group/SALT.
    摘要 zero-shot 跨语言传递是多语言NLP中的核心任务,允许基于更有 suficient 训练资源的语言模型在其他低资源语言上进行泛化。 Earlier 的尝试使用平行 corpora、双语词典或其他注解对应数据来提高跨语言传递性,这些数据通常是 expensive 的获得。 在这篇论文中,我们提出了一种简单又有效的方法,SALT,以提高多语言预训练语言模型的零shot 跨语言传递性。通过将 code-switching 和 embedding mixup 与自我束缚,SALT 有效地储存了多语言PLM 中的跨语言知识,并提高了其在下游任务的传递性。实验结果表明,我们的方法可以在 XNLI 和 PAWS-X 上提高零shot 跨语言传递性,无需外部数据。我们的代码可以在 https://github.com/luka-group/SALT 上获取。

Classifying Organizations for Food System Ontologies using Natural Language Processing

  • paper_url: http://arxiv.org/abs/2309.10880
  • repo_url: https://github.com/ICICLE-ai/Organization-Classification-for-Food-Systems
  • paper_authors: Tianyu Jiang, Sonia Vinogradova, Nathan Stringham, E. Louise Earl, Allan D. Hollander, Patrick R. Huber, Ellen Riloff, R. Sandra Schillo, Giorgio A. Ubbiali, Matthew Lange
  • for: 填充知识图和食品系统 Ontology 的信息
  • methods: 使用自然语言处理(NLP)方法自动分类实体
  • results: NLP 模型可以达到相对好的性能,并可以应用于许多其他分类问题
    Abstract Our research explores the use of natural language processing (NLP) methods to automatically classify entities for the purpose of knowledge graph population and integration with food system ontologies. We have created NLP models that can automatically classify organizations with respect to categories associated with environmental issues as well as Standard Industrial Classification (SIC) codes, which are used by the U.S. government to characterize business activities. As input, the NLP models are provided with text snippets retrieved by the Google search engine for each organization, which serves as a textual description of the organization that is used for learning. Our experimental results show that NLP models can achieve reasonably good performance for these two classification tasks, and they rely on a general framework that could be applied to many other classification problems as well. We believe that NLP models represent a promising approach for automatically harvesting information to populate knowledge graphs and aligning the information with existing ontologies through shared categories and concepts.
    摘要 我们的研究探讨了使用自然语言处理(NLP)方法自动分类实体,以填充知识图和食品系统 ontology 的目的。我们已经创建了一些 NLP 模型,可以自动将组织分类到环境问题相关的类别以及美国政府使用的标准工业分类(SIC)代码中。作为输入,NLP 模型被提供文本摘要,它们是由 Google 搜索引擎提取的每个组织的文本描述,用于学习。我们的实验结果表明,NLP 模型可以达到相对好的性能水平,并且可以应用于许多其他分类问题。我们认为,NLP 模型代表一种有前途的方法,用于自动收割信息,并将信息与现有 ontology 进行对应。

Believable Minecraft Settlements by Means of Decentralised Iterative Planning

  • paper_url: http://arxiv.org/abs/2309.10871
  • repo_url: None
  • paper_authors: Arthur van der Staaij, Jelmer Prins, Vincent L. Prins, Julian Poelsma, Thera Smit, Matthias Müller-Brockhausen, Mike Preuss
  • for: 这篇论文主要是为了解决 Procedural Content Generation (PCG) 领域中的寓真性和适应随机地形的城市生成问题。
  • methods: 该论文使用了分布式、迭代的规划过程,可以转移到类似的生成过程中生成“有机”的内容。
  • results: 该论文在 Generative Settlement Design in Minecraft (GDMC) 2022 比赛中获胜,表明了其在寓真性和适应随机地形的城市生成方面的可行性。
    Abstract Procedural city generation that focuses on believability and adaptability to random terrain is a difficult challenge in the field of Procedural Content Generation (PCG). Dozens of researchers compete for a realistic approach in challenges such as the Generative Settlement Design in Minecraft (GDMC), in which our method has won the 2022 competition. This was achieved through a decentralised, iterative planning process that is transferable to similar generation processes that aims to produce "organic" content procedurally.
    摘要 simultradchinese渐进城市生成,强调可信度和随机地形适应性,是PCG领域的挑战之一。多达数十名研究人员竞相奔走,寻求真实的方法,如 Minecraft 的生成殖民地设计挑战(GDMC),我们的方法在2022年赛事中获胜。这一成果基于分散式、迭代 плани策略,可以应用于类似的生成过程,以生成"有机"的内容。Note: "simplified Chinese" is a translation of the text into Chinese that uses simpler grammar and vocabulary, which is easier to understand for non-native speakers. However, the translation may not be as precise or nuanced as a translation into traditional Chinese.

Using AI Uncertainty Quantification to Improve Human Decision-Making

  • paper_url: http://arxiv.org/abs/2309.10852
  • repo_url: None
  • paper_authors: Laura R. Marusich, Jonathan Z. Bakdash, Yan Zhou, Murat Kantarcioglu
  • For: The paper aims to improve human decision-making by providing additional probabilistic information from AI uncertainty quantification (UQ).* Methods: The paper uses instance-based UQ for three real datasets, trains different AI models for classification, and creates confidence intervals for UQ using random samples. The UQ is calibrated using a strictly proper scoring rule.* Results: The paper finds that providing UQ information along with AI predictions significantly improves human decision-making beyond AI predictions alone, and that this benefit generalizes across different representations of UQ information.Here are the three key points in Simplified Chinese:* For: 这篇论文目的是提高人类决策过程中的AI预测信息。* Methods: 这篇论文使用实例基于的AI不确定量评估(UQ)来提高人类决策。它使用不同的AI模型进行分类,并使用随机样本生成周围的实例的信息来创建置信区间。这些置信区间被使用正确的评估函数来评估UQ的质量。* Results: 这篇论文发现,提供UQ信息和AI预测信息可以显著提高人类决策的准确性,并且这种效果适用于不同的UQ信息表示方式。
    Abstract AI Uncertainty Quantification (UQ) has the potential to improve human decision-making beyond AI predictions alone by providing additional useful probabilistic information to users. The majority of past research on AI and human decision-making has concentrated on model explainability and interpretability. We implemented instance-based UQ for three real datasets. To achieve this, we trained different AI models for classification for each dataset, and used random samples generated around the neighborhood of the given instance to create confidence intervals for UQ. The computed UQ was calibrated using a strictly proper scoring rule as a form of quality assurance for UQ. We then conducted two preregistered online behavioral experiments that compared objective human decision-making performance under different AI information conditions, including UQ. In Experiment 1, we compared decision-making for no AI (control), AI prediction alone, and AI prediction with a visualization of UQ. We found UQ significantly improved decision-making beyond the other two conditions. In Experiment 2, we focused on comparing different representations of UQ information: Point vs. distribution of uncertainty and visualization type (needle vs. dotplot). We did not find meaningful differences in decision-making performance among these different representations of UQ. Overall, our results indicate that human decision-making can be improved by providing UQ information along with AI predictions, and that this benefit generalizes across a variety of representations of UQ.
    摘要

SlimPajama-DC: Understanding Data Combinations for LLM Training

  • paper_url: http://arxiv.org/abs/2309.10818
  • repo_url: None
  • paper_authors: Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Joel Hestness, Natalia Vassilieva, Daria Soboleva, Eric Xing
  • for: 本研究使用SlimPajama dataset进行语言模型训练,旨在探索不同数据组合(如网络文本、Wikipedia、GitHub、书籍)对大语言模型训练的影响。
  • methods: 本研究使用SlimPajama dataset,并对其进行了全面和本地减重。然后,通过使用1.3B Cerebras-GPT模型和Alibi、SwiGLU进行训练,对不同的数据组合进行分析。
  • results: 研究发现,全球减重和本地减重对训练后的模型性能有着不同的影响。此外,研究还发现,在不同的数据组合中,提高数据多样性是关键的。最终,本研究的最佳配置比使用RedPajama dataset训练的1.3B模型表现出较好的性能。
    Abstract This paper aims to understand the impacts of various data combinations (e.g., web text, wikipedia, github, books) on the training of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T tokens RedPajama dataset contributed by Together. We've termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of high-quality/highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations of SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16$\times$ CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our models and the separate SlimPajama-DC datasets are available at: https://huggingface.co/MBZUAI-LLM and https://huggingface.co/datasets/cerebras/SlimPajama-627B.
    摘要

AI Foundation Models for Weather and Climate: Applications, Design, and Implementation

  • paper_url: http://arxiv.org/abs/2309.10808
  • repo_url: None
  • paper_authors: S. Karthik Mukkavilli, Daniel Salles Civitarese, Johannes Schmude, Johannes Jakubik, Anne Jones, Nam Nguyen, Christopher Phillips, Sujit Roy, Shraddha Singh, Campbell Watson, Raghu Ganti, Hendrik Hamann, Udaysankar Nair, Rahul Ramachandran, Kommy Weldemariam
    for:* 这篇论文旨在探讨使用机器学习和深度学习方法来理解大气中的混沌行为,以及如何使用这些方法进行天气预测。methods:* 这篇论文主要使用变换器、物理学习和图 neural network 等方法,以实现在相对狭小的时空尺度和特定任务上的状态顶峰性能。results:* 这篇论文认为,随着现在的生成人工智能技术的进步,现在已经可以构建一个通用的 Earth 系统模型,以及区域气象模型和中级气象模型。这些模型可以在多个领域特定下游任务上达到竞争力。
    Abstract Machine learning and deep learning methods have been widely explored in understanding the chaotic behavior of the atmosphere and furthering weather forecasting. There has been increasing interest from technology companies, government institutions, and meteorological agencies in building digital twins of the Earth. Recent approaches using transformers, physics-informed machine learning, and graph neural networks have demonstrated state-of-the-art performance on relatively narrow spatiotemporal scales and specific tasks. With the recent success of generative artificial intelligence (AI) using pre-trained transformers for language modeling and vision with prompt engineering and fine-tuning, we are now moving towards generalizable AI. In particular, we are witnessing the rise of AI foundation models that can perform competitively on multiple domain-specific downstream tasks. Despite this progress, we are still in the nascent stages of a generalizable AI model for global Earth system models, regional climate models, and mesoscale weather models. Here, we review current state-of-the-art AI approaches, primarily from transformer and operator learning literature in the context of meteorology. We provide our perspective on criteria for success towards a family of foundation models for nowcasting and forecasting weather and climate predictions. We also discuss how such models can perform competitively on downstream tasks such as downscaling (super-resolution), identifying conditions conducive to the occurrence of wildfires, and predicting consequential meteorological phenomena across various spatiotemporal scales such as hurricanes and atmospheric rivers. In particular, we examine current AI methodologies and contend they have matured enough to design and implement a weather foundation model.
    摘要 机器学习和深度学习方法已广泛应用于理解大气中的混沌行为以及进一步改进天气预测。现在,技术公司、政府机构和气象局都有增加兴趣于建立地球的数字孪生。最近的方法使用变换器、物理学 informed machine learning 和图 neural networks 已经在相对狭小的空间时间尺度和特定任务上达到了国际先进水平。尤其是在语言模型和视觉领域使用预训练变换器后 fine-tuning 的情况下,人工智能已经取得了很好的进步。我们现在正在向通用人工智能进化。特别是我们正在见证到通用人工智能模型可以在多个领域特定下渠道任务上表现竞争力。 DESPITE 这些进步,我们还处于全球 Earth system models、区域气候模型和 mesoscale 天气模型的通用人工智能模型的初始阶段。这里,我们评论当前领域的状态艺术方法,主要是基于 transformer 和操作学习文献中的 meteorology 方法。我们提供我们对成功 criterion 的看法,以及如何建立一个家族基础模型,以便在不同的下渠道任务上进行 nowcasting 和预测天气和气候预测。我们还讨论了如何使用这些模型在下采样、识别激发野火的条件以及预测不同的空间时间尺度的重要气象现象上表现竞争力。

Heuristic Search for Path Finding with Refuelling

  • paper_url: http://arxiv.org/abs/2309.10796
  • repo_url: None
  • paper_authors: Anushtup Nandy, Zhongqiang Ren, Sivakumar Rathinam, Howie Choset
  • for: 这篇论文考虑了路径找路径(PF)的一种扩展,即充油路径找路径(RF-PF)问题。与PF问题一样,RF-PF问题定义在一个图上,图 vertices是知道燃料价格的加油站,边成本取决于加油站之间的燃料消耗。RF-PF寻找最低成本路径从开始到目标Vertex的一个机器人,机器人具有有限燃料箱和有限数量的加油停留。
  • methods: 这篇论文提出了一种启发搜索算法called Refuel A* (RF-A*),该算法在图上逐步构建部分解决方案路径,并利用准则来精简状态的排除。
  • results: 测试在大城市地图上,RF-A* 比现有的状态艺术(一种 polynomial time 算法)快速得多于一个顺序幂,并且能够找到优化解决方案。
    Abstract This paper considers a generalization of the Path Finding (PF) with refueling constraints referred to as the Refuelling Path Finding (RF-PF) problem. Just like PF, the RF-PF problem is defined over a graph, where vertices are gas stations with known fuel prices, and edge costs depend on the gas consumption between the corresponding vertices. RF-PF seeks a minimum-cost path from the start to the goal vertex for a robot with a limited gas tank and a limited number of refuelling stops. While RF-PF is polynomial-time solvable, it remains a challenge to quickly compute an optimal solution in practice since the robot needs to simultaneously determine the path, where to make the stops, and the amount to refuel at each stop. This paper develops a heuristic search algorithm called Refuel A* (RF-A* ) that iteratively constructs partial solution paths from the start to the goal guided by a heuristic function while leveraging dominance rules for state pruning during planning. RF-A* is guaranteed to find an optimal solution and runs more than an order of magnitude faster than the existing state of the art (a polynomial time algorithm) when tested in large city maps with hundreds of gas stations.
    摘要 这个论文考虑了路径找路(PF)的一种扩展,即充油路径找路(RF-PF)问题。与PF类似,RF-PF问题在图上定义,图 vertices 是知道燃料价格的加油站,边的成本取决于两个顶点之间的燃料消耗。RF-PF寻找最低成本路径从开始顶点到目标顶点, robot 有有限燃料箱和有限数量的加油停。虽然 RF-PF 是 polynomial-time 可解决的,但在实践中很难快速计算优化的解决方案,因为机器人需要同时确定路径、停留处和充油量。这篇论文开发了一种启发搜索算法called Refuel A*(RF-A*),该算法在启发函数的指导下逐步构建从开始顶点到目标顶点的偏好解。RF-A* 保证找到优化解决方案,并在大型城市地图上进行了大量的测试,与现有的状态艺术(一个 polynomial time 算法)比较,运行速度高于一个数量级。

Guide Your Agent with Adaptive Multimodal Rewards

  • paper_url: http://arxiv.org/abs/2309.10790
  • repo_url: https://github.com/csmile-1006/arp
  • paper_authors: Changyeon Kim, Younggyo Seo, Hao Liu, Lisa Lee, Jinwoo Shin, Honglak Lee, Kimin Lee
  • for: 这篇论文的目的是提高强化学习agent在未经见过环境中的适应能力。
  • methods: 该方法使用了自然语言任务描述和预训练多Modal embedding来增强agent的总结能力。具体来说,它使用了CLIP预训练多Modal embedding来计算视觉观察和自然语言指令之间的相似性,并使用这个相似性作为奖励信号来训练返回conditioned政策。
  • results: 该方法可以有效地 mitigate目的泛化,并在面对未经见过的文本指令时实现superior的总结性能。此外,通过细化预训练多Modal encoder来提高奖励质量,进一步提高了性能。视频示例和源代码可以在项目网站(https://sites.google.com/view/2023arp)上找到。
    Abstract Developing an agent capable of adapting to unseen environments remains a difficult challenge in imitation learning. In this work, we present Adaptive Return-conditioned Policy (ARP), an efficient framework designed to enhance the agent's generalization ability using natural language task descriptions and pre-trained multimodal encoders. Our key idea is to calculate a similarity between visual observations and natural language instructions in the pre-trained multimodal embedding space (such as CLIP) and use it as a reward signal. We then train a return-conditioned policy using expert demonstrations labeled with multimodal rewards. Because the multimodal rewards provide adaptive signals at each timestep, our ARP effectively mitigates the goal misgeneralization. This results in superior generalization performances even when faced with unseen text instructions, compared to existing text-conditioned policies. To improve the quality of rewards, we also introduce a fine-tuning method for pre-trained multimodal encoders, further enhancing the performance. Video demonstrations and source code are available on the project website: https://sites.google.com/view/2023arp.
    摘要 开发一个能够适应未看过环境的智能代理仍然是一个困难的挑战。在这个工作中,我们提出了适应返回条件策略(ARP),这是一个高效的框架,用于提高智能代理的通用能力使用自然语言任务描述和预训练多模态编码器。我们的关键想法是在预训练多模态空间(如CLIP)中计算视觉观察和自然语言指令之间的相似性,并将其作为奖励信号使用。然后,我们使用专家示范标注为多模态奖励进行返回条件策略的训练,因此我们的ARP可以有效地消除目标泛化。这导致我们在面对未看过文本指令时的总体性能强于现有的文本条件策略。为了提高奖励质量,我们还引入了预训练多模态编码器的细化方法,进一步提高性能。视频示例和源代码可以在项目网站上找到:https://sites.google.com/view/2023arp。

Language as the Medium: Multimodal Video Classification through text only

  • paper_url: http://arxiv.org/abs/2309.10783
  • repo_url: None
  • paper_authors: Laura Hanu, Anita L. Verő, James Thewlis
  • for: 本研究旨在提出一种新的模型独立方法,可以帮助解释视频中的复杂Contextual关系。
  • methods: 该方法利用大型语言模型,如GPT-3.5或Llama2,来理解视频和声音modalities的文本描述,从BLIP-2、Whisper和ImageBind获取。无需进行额外的视频-文本模型或数据集调整,我们示出了现有的LLMs可以使用这些多modal的文本描述作为“看”或“听”的代理,进行零shot多modal视频分类。
  • results: 我们在UCf-101和Kinetics等知名动作认知 benchmark上进行了评估,并示出了这些 Context-rich描述可以在视频理解任务中得到成功应用。这种方法预示着一个promising的新研究方向,即多modal机器学习模型之间的互动,可以实现更全面的视频理解。
    Abstract Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos. Going beyond existing methods that emphasize simple activities or objects, we propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information. Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2, to reason about textual descriptions of the visual and aural modalities, obtained from BLIP-2, Whisper and ImageBind. Without needing additional finetuning of video-text models or datasets, we demonstrate that available LLMs have the ability to use these multimodal textual descriptions as proxies for ``sight'' or ``hearing'' and perform zero-shot multimodal classification of videos in-context. Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks. This method points towards a promising new research direction in multimodal classification, demonstrating how an interplay between textual, visual and auditory machine learning models can enable more holistic video understanding.
    摘要 尽管现有一新的多modal机器学习模型浪潮,现在的方法仍然无法理解视频中不同modalities之间的复杂关系。我们提出了一种新的模型无关方法,可以生成详细的文本描述,捕捉视频信息。我们的方法利用了大语言模型,如GPT-3.5或Llama2所学习的广泛知识,来理解文本描述的视觉和听觉modalities,从BLIP-2、Whisper和ImageBind获取。不需要额外的视频-文本模型或数据集进行训练,我们示示现有的LLM可以使用这些多modal文本描述作为“视”或“听”的代理,进行零例Multimodal分类视频 tasks。我们在UCf-101和Kinetics等流行动作识别benchmark上进行评估,显示这些具有上下文的描述可以在视频理解任务中使用。这种方法指向了一个新的研究方向,证明了多modal机器学习模型之间的互动可以实现更全面的视频理解。

FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection

  • paper_url: http://arxiv.org/abs/2309.10770
  • repo_url: None
  • paper_authors: Jamil Zaghir, Mina Bjelogrlic, Jean-Philippe Goldman, Soukaïna Aananou, Christophe Gaudet-Blavignac, Christian Lovis
    for: 这个研究文章是为了提供一种方法来生成通过语言跨度注解 projection 的翻译版本的注解集,以增加low-resource corpora中的 annotated datasets。methods: 这种方法基于BERT语言模型,使用语言不依赖的方法,可以快速增加low-resource corpora中的 annotated datasets,只需要使用已有的开源数据资源。results: 我们的crosslingual annotation projection方法的评估结果表明其高效和准确,可以生成高质量的注解集。作为实际应用,我们开发了法语注解资源(FRASIMED),这是一个包含2’051个 sintetic clinical cases的法语注解集,可以用于开发和改进法语自然语言处理(NLP)应用。
    Abstract Natural language processing (NLP) applications such as named entity recognition (NER) for low-resource corpora do not benefit from recent advances in the development of large language models (LLMs) where there is still a need for larger annotated datasets. This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection. Leveraging a language agnostic BERT-based approach, it is an efficient solution to increase low-resource corpora with few human efforts and by only using already available open data resources. Quantitative and qualitative evaluations are often lacking when it comes to evaluating the quality and effectiveness of semi-automatic data generation strategies. The evaluation of our crosslingual annotation projection approach showed both effectiveness and high accuracy in the resulting dataset. As a practical application of this methodology, we present the creation of French Annotated Resource with Semantic Information for Medical Entities Detection (FRASIMED), an annotated corpus comprising 2'051 synthetic clinical cases in French. The corpus is now available for researchers and practitioners to develop and refine French natural language processing (NLP) applications in the clinical field (https://zenodo.org/record/8355629), making it the largest open annotated corpus with linked medical concepts in French.
    摘要 自然语言处理(NLP)应用程序,如命名实体识别(NER) для低资源 Corpora 不会受到最近的大语言模型(LLM)的发展所带来的 beneficial effects。这篇研究文章介绍了一种方法ologies for generating translated versions of annotated datasets through crosslingual annotation projection。通过使用语言无关的 BERT 基于方法,可以efficiently 增加低资源 Corpora 以及少量人工劳动,只需使用已有的开源数据资源。量化和质量评估是评估自动数据生成策略的重要问题,但对于我们的 crosslingual annotation projection 方法,我们的评估结果表明其效果和准确性都很高。作为这种方法的实践应用,我们介绍了创建了 French Annotated Resource with Semantic Information for Medical Entities Detection(FRASIMED),这是一个包含 2'051 个 sintetic clinical cases 的法语 annotated corpus(https://zenodo.org/record/8355629)。这个 corpus 现在可以为研究人员和实践者提供,以开发和完善法语自然语言处理(NLP)应用程序在医疗领域。

A Blueprint for Precise and Fault-Tolerant Analog Neural Networks

  • paper_url: http://arxiv.org/abs/2309.10759
  • repo_url: None
  • paper_authors: Cansu Demirkiran, Lakshmi Nair, Darius Bunandar, Ajay Joshi
    for:This paper aims to improve the energy efficiency and scalability of deep neural network (DNN) acceleration using analog computing.methods:The paper proposes using the residue number system (RNS) to compose high-precision operations from multiple low-precision operations, eliminating information loss caused by limited precision data converters.results:The study achieves $99%$ of FP32 accuracy for state-of-the-art DNN inference using data converters with only $6$-bit precision, reducing energy consumption by several orders of magnitude while maintaining the same throughput and precision. The approach is also applied to DNN training, achieving accuracy comparable to FP32 precision using $7$-bit integer arithmetic. Additionally, the paper presents a fault-tolerant dataflow using redundant RNS error-correcting codes to protect computation against noise and errors in analog accelerators.
    Abstract Analog computing has reemerged as a promising avenue for accelerating deep neural networks (DNNs) due to its potential to overcome the energy efficiency and scalability challenges posed by traditional digital architectures. However, achieving high precision and DNN accuracy using such technologies is challenging, as high-precision data converters are costly and impractical. In this paper, we address this challenge by using the residue number system (RNS). RNS allows composing high-precision operations from multiple low-precision operations, thereby eliminating the information loss caused by the limited precision of the data converters. Our study demonstrates that analog accelerators utilizing the RNS-based approach can achieve ${\geq}99\%$ of FP32 accuracy for state-of-the-art DNN inference using data converters with only $6$-bit precision whereas a conventional analog core requires more than $8$-bit precision to achieve the same accuracy in the same DNNs. The reduced precision requirements imply that using RNS can reduce the energy consumption of analog accelerators by several orders of magnitude while maintaining the same throughput and precision. Our study extends this approach to DNN training, where we can efficiently train DNNs using $7$-bit integer arithmetic while achieving accuracy comparable to FP32 precision. Lastly, we present a fault-tolerant dataflow using redundant RNS error-correcting codes to protect the computation against noise and errors inherent within an analog accelerator.
    摘要 Traditional digital architectures have faced challenges in terms of energy efficiency and scalability, which has led to the resurgence of analog computing as a promising avenue for accelerating deep neural networks (DNNs). However, achieving high precision and DNN accuracy using analog technologies is challenging, as high-precision data converters are costly and impractical. In this paper, we address this challenge by using the residue number system (RNS). RNS allows for the composition of high-precision operations from multiple low-precision operations, thereby eliminating the information loss caused by the limited precision of the data converters. Our study shows that analog accelerators utilizing the RNS-based approach can achieve accuracy of at least 99% of FP32 for state-of-the-art DNN inference using data converters with only 6-bit precision, whereas a conventional analog core requires more than 8-bit precision to achieve the same accuracy in the same DNNs. This reduction in precision requirements implies that using RNS can reduce the energy consumption of analog accelerators by several orders of magnitude while maintaining the same throughput and precision. Our study also extends this approach to DNN training, where we can efficiently train DNNs using 7-bit integer arithmetic while achieving accuracy comparable to FP32 precision. Finally, we present a fault-tolerant dataflow using redundant RNS error-correcting codes to protect the computation against noise and errors inherent within an analog accelerator.

SHOWMe: Benchmarking Object-agnostic Hand-Object 3D Reconstruction

  • paper_url: http://arxiv.org/abs/2309.10748
  • repo_url: None
  • paper_authors: Anilkumar Swamy, Vincent Leroy, Philippe Weinzaepfel, Fabien Baradel, Salma Galaaoui, Romain Bregier, Matthieu Armando, Jean-Sebastien Franco, Gregory Rogez
  • for: 本研究旨在超越现有的手套物交互数据集,提供更多的真实物体变化和精度的3D手套物 reconstruction。
  • methods: 该研究使用了一种2 stage推理管道,首先使用固定的手套物系统进行准确的 registrations,然后使用多视图重建(MVR)算法进行3D重建。
  • results: 研究表明,使用SFM工具箱或手势估计器可以实现可靠的 объек-agnostic 3D手套物重建,但这些方法仍然敏感于初始相机pose估计,具有改进的重建空间。
    Abstract Recent hand-object interaction datasets show limited real object variability and rely on fitting the MANO parametric model to obtain groundtruth hand shapes. To go beyond these limitations and spur further research, we introduce the SHOWMe dataset which consists of 96 videos, annotated with real and detailed hand-object 3D textured meshes. Following recent work, we consider a rigid hand-object scenario, in which the pose of the hand with respect to the object remains constant during the whole video sequence. This assumption allows us to register sub-millimetre-precise groundtruth 3D scans to the image sequences in SHOWMe. Although simpler, this hypothesis makes sense in terms of applications where the required accuracy and level of detail is important eg., object hand-over in human-robot collaboration, object scanning, or manipulation and contact point analysis. Importantly, the rigidity of the hand-object systems allows to tackle video-based 3D reconstruction of unknown hand-held objects using a 2-stage pipeline consisting of a rigid registration step followed by a multi-view reconstruction (MVR) part. We carefully evaluate a set of non-trivial baselines for these two stages and show that it is possible to achieve promising object-agnostic 3D hand-object reconstructions employing an SfM toolbox or a hand pose estimator to recover the rigid transforms and off-the-shelf MVR algorithms. However, these methods remain sensitive to the initial camera pose estimates which might be imprecise due to lack of textures on the objects or heavy occlusions of the hands, leaving room for improvements in the reconstruction. Code and dataset are available at https://europe.naverlabs.com/research/showme
    摘要 近期手Object交互数据集显示有限的真实物体多样性,并且通过适应MANO参数模型来获取实际手势。为了突破这些限制并促进更多的研究,我们介绍了SHOWMe数据集,包括96个视频,每个视频都有细节Real和3D手Object纹理网格的注释。我们遵循最近的工作,假设手Object场景是固定的,即手指与对象之间的pose保持不变 durante全个视频序列。这种假设使我们能够将亮度毫米精度的地面扫描与图像序列进行注册。虽然更简单,但这种假设在应用场景中是有意义的,例如人 robot合作中的手Object交换、物体扫描、或手指与对象的接触点分析。重要的是,固定的手Object系统使得我们可以通过一个2 stage管道来解决视频基于3D重建未知手持 объек的问题。我们仔细评估了一些非常轻量级的基准,并证明可以使用SfM工具箱或手势估计器来恢复固定变换和Off-the-shelf MVR算法来实现可靠的物体agnostic 3D手Object重建。然而,这些方法仍然敏感于初始相机pose估计,可能因为对象上缺乏文本或手指重叠而导致估计不准确,留下改进重建的空间。代码和数据集可以在https://europe.naverlabs.com/research/showme上下载。

Evaluating large language models’ ability to understand metaphor and sarcasm using a screening test for Asperger syndrome

  • paper_url: http://arxiv.org/abs/2309.10744
  • repo_url: https://github.com/hiromu/llm-msst
  • paper_authors: Hiromu Yakura
  • for: 本研究旨在检验 latest large language models (LLMs) 是否能够理解人类含义渊博的通信方式,包括 метафора和讽刺。
  • methods: 本研究使用标准化测试来评估 LLMs 对 метафора和讽刺的理解能力。
  • results: 研究发现,随着模型参数的增加,LLMs 对 метафора的理解能力有所提高,但对讽刺的理解能力没有改善。这表明,为了让 LLMs 理解讽刺,需要采取不同的方法。
    Abstract Metaphors and sarcasm are precious fruits of our highly-evolved social communication skills. However, children with Asperger syndrome are known to have difficulties in comprehending sarcasm, even if they possess a certain level of verbal IQ sufficient for understanding metaphors. Given that, a screening test that scores the ability to understand metaphor and sarcasm has been used to differentiate Asperger syndrome from other symptoms exhibiting akin external behaviors (e.g., attention-deficit/hyperactivity disorder). This study uses the standardized test to examine the capability of recent large language models (LLMs) in understanding human nuanced communication. The results divulged that, whereas their ability to comprehend metaphors has been improved with the increase of the number of model parameters, the improvement in sarcasm understanding was not observed. This implies that an alternative approach is imperative to imbue LLMs with the capacity to grasp sarcasm, which has been associated with the amygdala, a pivotal cerebral region for emotional learning, in the case of humans.
    摘要 假设和讽刺是我们高度进化的社会通信技能的珍贵果子。然而,儿童患有阿斯伯格症状时常有困难理解假设和讽刺,即使他们具有足够的语言IQ来理解 мета喻。由此而来,一种用于分 differentiate 阿斯伯格症状和其他外表相似的症状(如注意力不足障碍)的测试方法是使用标准化测试来评估人类偏向通信中的细微表达能力。本研究使用这种标准化测试来检查最新的大语言模型(LLMs)在理解人类细微通信中的能力。结果发现,虽然模型参数的增加可以提高其理解假设的能力,但对讽刺的理解则没有改善。这表明,为了让 LLMs 擅长理解讽刺,需要采取不同的方法。这种方法与人类情感学习中的 Amygdala 相关,即脑部的一个重要区域。

MelodyGLM: Multi-task Pre-training for Symbolic Melody Generation

  • paper_url: http://arxiv.org/abs/2309.10738
  • repo_url: https://github.com/NEXTLab-ZJU/MelodyGLM
  • paper_authors: Xinda Wu, Zhijie Huang, Kejun Zhang, Jiaxing Yu, Xu Tan, Tieyao Zhang, Zihao Wang, Lingyun Sun
  • for: 这 paper 是为了提高 symbolic melody generation 的预训练方法,以便更好地捕捉多个尺度、多个维度的结构信息在音序中。
  • methods: 这 paper 使用了 multi-task pre-training 框架 MelodyGLM,并设计了 local blank infilling 和 global blank infilling 任务,以模型音序中的本地和全球结构。
  • results: 对于 melody continuation 和 melody inpainting 任务,MelodyGLM 表现出了明显的改善,特别是在 subjective 评价中,MelodyGLM 的平均提升为 0.82、0.87、0.78 和 0.94 个数据点,并且在 melody inpainting 任务上几乎与人工编写的 melody 相当。
    Abstract Pre-trained language models have achieved impressive results in various music understanding and generation tasks. However, existing pre-training methods for symbolic melody generation struggle to capture multi-scale, multi-dimensional structural information in note sequences, due to the domain knowledge discrepancy between text and music. Moreover, the lack of available large-scale symbolic melody datasets limits the pre-training improvement. In this paper, we propose MelodyGLM, a multi-task pre-training framework for generating melodies with long-term structure. We design the melodic n-gram and long span sampling strategies to create local and global blank infilling tasks for modeling the local and global structures in melodies. Specifically, we incorporate pitch n-grams, rhythm n-grams, and their combined n-grams into the melodic n-gram blank infilling tasks for modeling the multi-dimensional structures in melodies. To this end, we have constructed a large-scale symbolic melody dataset, MelodyNet, containing more than 0.4 million melody pieces. MelodyNet is utilized for large-scale pre-training and domain-specific n-gram lexicon construction. Both subjective and objective evaluations demonstrate that MelodyGLM surpasses the standard and previous pre-training methods. In particular, subjective evaluations show that, on the melody continuation task, MelodyGLM gains average improvements of 0.82, 0.87, 0.78, and 0.94 in consistency, rhythmicity, structure, and overall quality, respectively. Notably, MelodyGLM nearly matches the quality of human-composed melodies on the melody inpainting task.
    摘要 传统的预训练方法对象是文本和音乐之间的知识差异,使得现有的预训练方法很难捕捉多级多维结构信息在旋律中。此外,Symbolic melody的大规模数据集的可用性限制了预训练的改进。本文提出了MelodyGLM,一个多任务预训练框架,用于生成具有长期结构的旋律。我们设计了旋律n-gram和长span采样策略,以创建本地和全局的缺失填充任务,以模拟旋律中的本地和全局结构。特别是,我们将把抑音n-gram、节奏n-gram和其结合的n-gram添加到旋律n-gram缺失填充任务中,以模拟旋律中的多维结构。为此,我们建立了一个大规模的Symbolic melody数据集,MelodyNet,包含超过0.4万个旋律 Piece。MelodyNet被用于大规模预训练和域pecific n-gram词典构造。对比标准和先前的预训练方法,我们的MelodyGLM得分较高,特别是在旋律续写任务上,MelodyGLM的平均提升为0.82、0.87、0.78和0.94在一致性、节奏性、结构和总质量等方面。值得注意的是,MelodyGLM在旋律填充任务上几乎与人工制作的旋律相当。

Monte-Carlo tree search with uncertainty propagation via optimal transport

  • paper_url: http://arxiv.org/abs/2309.10737
  • repo_url: None
  • paper_authors: Tuan Dam, Pascal Stenger, Lukas Schneider, Joni Pajarinen, Carlo D’Eramo, Odalric-Ambrym Maillard
  • for: 这篇论文提出了一种新的备份策略,用于高度随机和部分可见的马尔可夫决策过程。
  • methods: 我们采用了一种概率方法,将值节点和行动值节点都模型为高斯分布。我们引入了一种新的备份操作符,通过计算行动值子节点的 Wasserstein 质量中心来传递估计的不确定性到根节点。
  • results: 我们提供了许多理论保证,证明我们的概率备份操作符在某些情况下具有极限吞吐量,并且在一些随机和部分可见环境中比较出色的表现。
    Abstract This paper introduces a novel backup strategy for Monte-Carlo Tree Search (MCTS) designed for highly stochastic and partially observable Markov decision processes. We adopt a probabilistic approach, modeling both value and action-value nodes as Gaussian distributions. We introduce a novel backup operator that computes value nodes as the Wasserstein barycenter of their action-value children nodes; thus, propagating the uncertainty of the estimate across the tree to the root node. We study our novel backup operator when using a novel combination of $L^1$-Wasserstein barycenter with $\alpha$-divergence, by drawing a notable connection to the generalized mean backup operator. We complement our probabilistic backup operator with two sampling strategies, based on optimistic selection and Thompson sampling, obtaining our Wasserstein MCTS algorithm. We provide theoretical guarantees of asymptotic convergence to the optimal policy, and an empirical evaluation on several stochastic and partially observable environments, where our approach outperforms well-known related baselines.
    摘要 The proposed backup operator is combined with two sampling strategies, based on optimistic selection and Thompson sampling, to obtain the Wasserstein MCTS algorithm. The authors provide theoretical guarantees of asymptotic convergence to the optimal policy and an empirical evaluation on several stochastic and partially observable environments, where the proposed approach outperforms well-known related baselines.Here is the translation of the text into Simplified Chinese:这篇论文介绍了一种新的 Monte Carlo Tree Search(MCTS)的备用策略,适用于高度随机的和部分可见的Markov决策过程。我们采用了一种 probabilistic 模型,将值和行动值节点都模型为 Gaussian 分布。我们提出了一种新的备用算法,计算值节点为行动值孩子节点的 Wasserstein 中心,从而在树中传递不确定性的估计。我们将该备用算法与两种抽样策略相结合,基于 optimistic selection 和 Thompson sampling,得到了 Wasserstein MCTS 算法。我们提供了对于优化策略的 asymptotic 收敛性的理论保证,并对多个随机和部分可见环境进行了 empirical 评估,其中我们的方法超过了一些相关的基准值。

PAMS: Platform for Artificial Market Simulations

  • paper_url: http://arxiv.org/abs/2309.10729
  • repo_url: https://github.com/masanorihirano/pams
  • paper_authors: Masanori Hirano, Ryosuke Takata, Kiyoshi Izumi
  • for: 这个论文提出了一个新的人工市场模拟平台,即 PAMS:Platform for Artificial Market Simulations。PAMS 是一个基于 Python 的模拟器,可以轻松地与深度学习结合,并且允许用户轻松地修改 simulation。
  • methods: 本论文使用的方法包括了深度学习,以 Predicting future prices。
  • results: 研究表明,PAMS 可以准确地预测未来的价格。
    Abstract This paper presents a new artificial market simulation platform, PAMS: Platform for Artificial Market Simulations. PAMS is developed as a Python-based simulator that is easily integrated with deep learning and enabling various simulation that requires easy users' modification. In this paper, we demonstrate PAMS effectiveness through a study using agents predicting future prices by deep learning.
    摘要 这篇论文介绍了一个新的人工市场模拟平台,即PAMS:基于Python的人工市场模拟平台。PAMS可以轻松地与深度学习结合,并且允许用户轻松地修改 simulate various simulation scenarios。在本文中,我们通过使用深度学习 agents 预测未来价格来证明 PAMS 的效果。Here's the translation in Traditional Chinese:这篇论文介绍了一个新的人工市场模拟平台,即PAMS:基于Python的人工市场模拟平台。PAMS可以轻松地与深度学习结合,并且允许用户轻松地修改 simulate various simulation scenarios。在本文中,我们通过使用深度学习 agents 预测未来价格来证明 PAMS 的效果。

Causality-Driven One-Shot Learning for Prostate Cancer Grading from MRI

  • paper_url: http://arxiv.org/abs/2309.10725
  • repo_url: None
  • paper_authors: Gianluca Carloni, Eva Pachetti, Sara Colantonio
  • for: 这个研究旨在提出一种自动分类医疗影像的方法,并将弱 causal 信号在影像中学习和应用。
  • methods: 我们的框架包括卷积神经网络和 causality-extractor 模组,这个模组可以将 causa-effect 关系 между特征图中的特征,帮助模型对于影像中的特征进行推断。
  • results: 我们通过一个 One-shot 学习 scheme 进行训练,包括 meta-training 和 meta-testing 任务,以评估我们的方法在低数据情况下的效果。我们对公开可用的 проstate MRI 影像集进行了二分和多分类实验,并进行了删除研究和 qualitative 评估,以验证提案的 causality-driven 模组的有效性。我们发现, causal 关系在特征之间扮演着关键角色,帮助模型更好地分类医疗影像。
    Abstract In this paper, we present a novel method to automatically classify medical images that learns and leverages weak causal signals in the image. Our framework consists of a convolutional neural network backbone and a causality-extractor module that extracts cause-effect relationships between feature maps that can inform the model on the appearance of a feature in one place of the image, given the presence of another feature within some other place of the image. To evaluate the effectiveness of our approach in low-data scenarios, we train our causality-driven architecture in a One-shot learning scheme, where we propose a new meta-learning procedure entailing meta-training and meta-testing tasks that are designed using related classes but at different levels of granularity. We conduct binary and multi-class classification experiments on a publicly available dataset of prostate MRI images. To validate the effectiveness of the proposed causality-driven module, we perform an ablation study and conduct qualitative assessments using class activation maps to highlight regions strongly influencing the network's decision-making process. Our findings show that causal relationships among features play a crucial role in enhancing the model's ability to discern relevant information and yielding more reliable and interpretable predictions. This would make it a promising approach for medical image classification tasks.
    摘要 在这篇论文中,我们提出了一种新的方法,用于自动分类医疗图像。我们的框架包括一个卷积神经网络背bone和一个 causality-extractor 模块,该模块EXTRACTS causal relationships between feature maps, 可以告诉模型在某个位置的图像中,某个特征的出现,受到另一个特征在另一个位置的影响。为了评估我们的方法在低数据 scenarios 中的效果,我们采用了一种 One-shot learning 方法,包括 meta-training 和 meta-testing 任务,这些任务是使用相关的类,但是在不同的粒度水平上进行设计。我们在公共可用的 проstate MRI 图像 dataset 上进行了二分和多分类分类实验。为了证明提案的 causality-driven 模块的有效性,我们进行了减少学习和qualitative assessment,使用类 activation maps 高亮模型决策过程中强烈影响的区域。我们的发现表明, causal relationships among features 在提高模型对 relevante information 的感知和取得更加可靠和可读的预测方面发挥了关键作用。这会使得这种方法在医疗图像分类任务中成为一种可靠的方法。

Sound Source Localization is All about Cross-Modal Alignment

  • paper_url: http://arxiv.org/abs/2309.10724
  • repo_url: None
  • paper_authors: Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung
  • for: 本研究旨在解决真正的声源定位问题,即人类可以通过视觉场景中的声音来确定声音的来源。
  • methods: 我们提出了一种涉及声音和视觉modalities的共同定位任务,以提高声音定位和视觉modalities之间的协调。
  • results: 我们的方法在声音定位和跨模态检索中表现出色,高于当前的状态艺术方法。这些结果表明,同时解决声音定位和跨模态协调任务是解决真正的声源定位问题的关键。
    Abstract Humans can easily perceive the direction of sound sources in a visual scene, termed sound source localization. Recent studies on learning-based sound source localization have mainly explored the problem from a localization perspective. However, prior arts and existing benchmarks do not account for a more important aspect of the problem, cross-modal semantic understanding, which is essential for genuine sound source localization. Cross-modal semantic understanding is important in understanding semantically mismatched audio-visual events, e.g., silent objects, or off-screen sounds. To account for this, we propose a cross-modal alignment task as a joint task with sound source localization to better learn the interaction between audio and visual modalities. Thereby, we achieve high localization performance with strong cross-modal semantic understanding. Our method outperforms the state-of-the-art approaches in both sound source localization and cross-modal retrieval. Our work suggests that jointly tackling both tasks is necessary to conquer genuine sound source localization.
    摘要 人类可以轻松地在视觉场景中识别声音源的方向,称为声音源localization。现在的学习基于的声音源localization研究主要从localization角度出发。然而,前一代和现有的标准没有考虑一个更重要的问题,即跨模态 semantics的理解,这是真正的声音源localization的关键。跨模态 semantics的理解能够有效地处理semantically mismatched audio-visual事件,如静物或屏外声音。为了考虑这一点,我们提议在声音源localization任务中添加跨模态对应 зада务,以更好地学习视觉modalities之间的交互。因此,我们实现了高地理位性性能和强的跨模态 semantics理解。我们的方法超越了当前状态的方法在声音源localization和跨模态retrieval两个领域。我们的工作表明,同时解决这两个任务是必要的,以解决真正的声音源localization。

LEA*: An A* Variant Algorithm with Improved Edge Efficiency for Robot Motion Planning

  • paper_url: http://arxiv.org/abs/2309.10722
  • repo_url: https://github.com/dongliangch/leastar
  • paper_authors: Dongliang Zheng, Panagiotis Tsiotras
  • for: 这个论文是为了提出一种新的图搜索算法,即懒边基于A*(LEA*),用于机器人运动规划。
  • methods: 该算法使用边队列和懒搜索的想法,与A*相似,具有优化的顶点效率和改进的边效率。它的实现几乎没有改变A*的基本结构,因此对前一些懒搜索算法的过渡带来了较小的负担。
  • results: 我们在2D规划问题和7度 freedom manipulator的规划中测试了LEA*和其它算法。我们对Random世界和不同的图大小进行了严格的比较,结果显示LEA*和它的彩色版本wLEA*在发现计划的速度方面与之前的算法相比较快。
    Abstract In this work, we introduce a new graph search algorithm, lazy edged based A* (LEA*), for robot motion planning. By using an edge queue and exploiting the idea of lazy search, LEA* is optimally vertex efficient similar to A*, and has improved edge efficiency compared to A*. LEA* is simple and easy to implement with minimum modification to A*, resulting in a very small overhead compared to previous lazy search algorithms. We also explore the effect of inflated heuristics, which results in the weighted LEA* (wLEA*). We show that the edge efficiency of wLEA* becomes close to LazySP and, thus is near-optimal. We test LEA* and wLEA* on 2D planning problems and planning of a 7-DOF manipulator. We perform a thorough comparison with previous algorithms by considering sparse, medium, and cluttered random worlds and small, medium, and large graph sizes. Our results show that LEA* and wLEA* are the fastest algorithms to find the plan compared to previous algorithms.
    摘要 在这个工作中,我们介绍了一种新的图搜索算法,懒散边基于A*(LEA*),用于机器人运动规划。通过使用边队列和懒散搜索的想法,LEA* 能够与A* 类似的顶点效率优化,并与A* 的边效率相比提高。LEA* 简单易于实现,对 previous lazy search 算法的修改 minimal,因此对于 previous lazy search 算法的 overhead 具有较小的影响。我们还探讨了膨胀式拓扑(weighted LEA*)的效果,并证明其边效率接近LazySP,因此是近似于优化的。我们在 2D 规划问题和7-DOF manipulator 的规划中测试了 LEA* 和 weighted LEA*。我们对 previous algorithms 进行了系统比较,包括 randomly generated sparse、medium 和填充的世界,以及 small、medium 和大的图像大小。我们的结果表明 LEA* 和 weighted LEA* 比 previous algorithms 更快地查找了计划。

Measurement Simplification in ρ-POMDP with Performance Guarantees

  • paper_url: http://arxiv.org/abs/2309.10701
  • repo_url: None
  • paper_authors: Tom Yotam, Vadim Indelman
  • for: 这篇论文主要目标是提出一种高效的决策方法,用于在不精确的信息下进行决策。
  • methods: 该论文使用分割观察空间的方法,以形成关于预期信息奖励的分析 bounds。这些 bounds 然后用于高效地规划,保证性能。
  • results: 该论文显示了这种方法的效果,包括在 Gaussian 信号下的性能提升,以及在实验中的速度增加。同时,它也与其他现有的方法进行比较,并证明其在活动 SLAM 场景中的优势。
    Abstract Decision making under uncertainty is at the heart of any autonomous system acting with imperfect information. The cost of solving the decision making problem is exponential in the action and observation spaces, thus rendering it unfeasible for many online systems. This paper introduces a novel approach to efficient decision-making, by partitioning the high-dimensional observation space. Using the partitioned observation space, we formulate analytical bounds on the expected information-theoretic reward, for general belief distributions. These bounds are then used to plan efficiently while keeping performance guarantees. We show that the bounds are adaptive, computationally efficient, and that they converge to the original solution. We extend the partitioning paradigm and present a hierarchy of partitioned spaces that allows greater efficiency in planning. We then propose a specific variant of these bounds for Gaussian beliefs and show a theoretical performance improvement of at least a factor of 4. Finally, we compare our novel method to other state of the art algorithms in active SLAM scenarios, in simulation and in real experiments. In both cases we show a significant speed-up in planning with performance guarantees.
    摘要 Simplified Chinese translation: autonomous system acting with imperfect information 的决策问题在不确定性下充满挑战。因为动作和观察空间的成本是加性的,因此许多在线系统无法解决这个问题。这篇论文提出了一种新的方法,通过分割高维观察空间来提高决策效率。使用分割后的观察空间,我们提出了一些关于预期信息奖励的分析 bound,对于总体信念分布来说。这些 bound 然后用于有效地规划,保证性能。我们证明这些 bound 是可变的、计算效率高,并且会 converge to 原始解。我们还扩展了分割思想,并提出了一个层次结构的分割空间,以提高规划的效率。最后,我们对 Gaussian 信念中的具体变体提出了一种改进,并证明其在至少增加了4倍的性能。最后,我们在模拟和实际实验中与其他当前标准算法进行比较,并在具有性能保证的情况下显示了明显的减速。

From “Let’s Google” to “Let’s ChatGPT”: Student and Instructor Perspectives on the influence of LLMs on Undergraduate Engineering Education

  • paper_url: http://arxiv.org/abs/2309.10694
  • repo_url: None
  • paper_authors: Ishika Joshi, Ritvik Budhiraja, Pranav Deepak Tanna, Lovenya Jain, Mihika Deshpande, Arjun Srivastava, Srinivas Rallapalli, Harshal D Akolekar, Jagat Sesh Challa, Dhruv Kumar
    for: This paper aims to explore the current usage patterns, perceived benefits, threats, and challenges of Large Language Models (LLMs) among students and instructors in undergraduate engineering universities in India.methods: The study uses surveys and interviews to gather data from 1306 students, 112 student interviews, and 27 instructor interviews.results: The study finds that LLMs are currently used primarily for answering questions and providing explanations, and that students and instructors perceive benefits such as improved understanding and efficiency, but also face challenges such as the need for critical thinking and the potential for misuse. The study offers recommendations for enhancing the adoption of LLMs in undergraduate engineering education and beyond.
    Abstract The rise in popularity of Large Language Models (LLMs) has prompted discussions in academic circles, with students exploring LLM-based tools for coursework inquiries and instructors exploring them for teaching and research. Even though a lot of work is underway to create LLM-based tools tailored for students and instructors, there is a lack of comprehensive user studies that capture the perspectives of students and instructors regarding LLMs. This paper addresses this gap by conducting surveys and interviews within undergraduate engineering universities in India. Using 1306 survey responses among students, 112 student interviews, and 27 instructor interviews around the academic usage of ChatGPT (a popular LLM), this paper offers insights into the current usage patterns, perceived benefits, threats, and challenges, as well as recommendations for enhancing the adoption of LLMs among students and instructors. These insights are further utilized to discuss the practical implications of LLMs in undergraduate engineering education and beyond.
    摘要 LLM(大型自然语言模型)的崛起,已经引发了学术界的讨论,学生们在作业问题上使用 LLM 的工具,教师则在教学和研究中使用 LLM。虽然有很多人在开发学生和教师专门的 LLM 工具,但是没有全面的用户研究,捕捉学生和教师对 LLM 的看法。这篇论文填补了这个空白,通过在印度的大学中进行调查和采访,收集了1306名学生的问卷回答、112名学生的面对面采访和27名教师的采访,对学生和教师在学术上使用 ChatGPT(一个流行的 LLM)的现有使用模式、感受到的利点、威胁和挑战,以及提高学生和教师对 LLM 的采用的建议。这些发现还可以用来讨论 LLMS 在bachelor 工程教育中的实际应用和未来发展。

MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

  • paper_url: http://arxiv.org/abs/2309.10691
  • repo_url: None
  • paper_authors: Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, Heng Ji
  • for: 评估大型自然语言模型(LLM)在复杂任务解决方面的多轮交互能力。
  • methods: 利用工具和自然语言反馈来评估 LLM 的多轮交互能力,并提供一套可重复性评估框架。
  • results: 研究发现, LLM 在多轮交互中受益于工具和自然语言反馈,表现提升(绝对值)1-8% 每次工具使用和2-17% 自然语言反馈。 单 turno 性能不一定对多轮交互性能有积极影响。 SIFT 和 RLHF 等方法在 LLM 中通常减退多轮交互能力。
    Abstract To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools. However, current evaluation paradigms often focus solely on benchmark performance with single-turn exchanges, neglecting the intricate interactions among the user, LLMs, and external tools, creating a discrepancy between benchmark evaluation and real-world use cases. We introduce MINT benchmark to evaluate LLMs' ability to solve tasks with multi-turn interactions by (1) using tools and (2) leveraging natural language feedback. To ensure reproducibility, we provide an evaluation framework where LLMs can access tools by executing Python code and receive natural language feedback from the user simulated with GPT-4. We repurpose a diverse set of established datasets and tasks focusing on reasoning, coding, and decision-making and carefully curate them into a compact subset of instances for efficient evaluation. Our analysis of 20 open- and closed-source LLMs offers intriguing findings. (1) LLMs generally benefit from tool interactions and language feedback, with performance gains (absolute, same below) of 1--8% per additional turn with tool use and 2--17% with natural language feedback. (2) Better single-turn performance does not guarantee better multi-turn performance. (3) Surprisingly, on LLMs we evaluated, we found supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities. We hope MINT can help measure progress and incentivize research in improving LLMs' capabilities in multi-turn interactions, especially for open-source communities where multi-turn human evaluation has been less accessible compared to commercial LLMs with a larger user base.
    摘要 LLMs 通常需要多次互动来解决复杂任务,但现有的评估方法通常只关注单次交互的性能,忽视用户、LLMs 和外部工具之间的复杂互动,从而导致评估和实际使用场景之间的差异。我们提出了 MINT 评估标准,用于评估 LLMs 在多次交互中解决任务的能力,包括使用工具和利用自然语言反馈。为确保可重复性,我们提供了一个评估框架,其中 LLMs 可以通过执行 Python 代码来访问工具,并从用户模拟器(使用 GPT-4)接收自然语言反馈。我们将一些已有的 dataset 和任务重新分配,并将其精炼成一个高效的评估集。我们对 20 个开源和关闭源 LLMs 进行分析,发现了一些有趣的发现:1. LLMs 通常受益于工具和自然语言反馈,其性能提升(绝对值)为 1-8% 每次工具使用和 2-17% 自然语言反馈。2. 更高的单次性能不一定意味着更高的多次性能。3. 对我们评估的 LLMs,我们发现了超级vised instruction-finetuning (SIFT) 和人类反馈学习 (RLHF) 通常会降低多次性能。我们希望 MINT 可以帮助测量进步,并鼓励研究人员在多次互动中提高 LLMs 的能力,特别是对于开源社区,其中多次人工评估的训练资源相对较少,相比于商业 LLMs 的更大用户基数。

Learning-Initialized Trajectory Planning in Unknown Environments

  • paper_url: http://arxiv.org/abs/2309.10683
  • repo_url: None
  • paper_authors: Yicheng Chen, Jinjie Li, Wenyuan Qin, Yongzhao Hua, Xiwang Dong, Qingdong Li
  • for: 提高自适应飞行器在未知环境中的准确规划,以便实现更高级别的自主飞行。
  • methods: 提出了学习初始化规划器(LIT-Planner),利用神经网络规划器提供初始值,并通过批量采样进行空间-时间优化,以捕捉多模态性。
  • results: 通过对真实世界和虚拟环境进行模拟和实验,证明LIT-Planner可以减少优化时间cost,并保持规划质量。
    Abstract Autonomous flight in unknown environments requires precise planning for both the spatial and temporal profiles of trajectories, which generally involves nonconvex optimization, leading to high time costs and susceptibility to local optima. To address these limitations, we introduce the Learning-Initialized Trajectory Planner (LIT-Planner), a novel approach that guides optimization using a Neural Network (NN) Planner to provide initial values. We first leverage the spatial-temporal optimization with batch sampling to generate training cases, aiming to capture multimodality in trajectories. Based on these data, the NN-Planner maps visual and inertial observations to trajectory parameters for handling unknown environments. The network outputs are then optimized to enhance both reliability and explainability, ensuring robust performance. Furthermore, we propose a framework that supports robust online replanning with tolerance to planning latency. Comprehensive simulations validate the LIT-Planner's time efficiency without compromising trajectory quality compared to optimization-based methods. Real-world experiments further demonstrate its practical suitability for autonomous drone navigation.
    摘要 自适应飞行在未知环境中需要精准规划空间和时间轨迹的profile,通常是非核心优化,导致高时间成本和易陷到地点优化。为解决这些限制,我们介绍了学习INITIALIZED Trajectory Planner(LIT-Planner),一种新的方法,该使用神经网络(NN)Planner提供初始值。我们首先利用空间-时间优化批处理生成训练例子,以捕捉多模态的轨迹。基于这些数据,NN-Planner将视觉和遥感观察映射到轨迹参数,以处理未知环境。网络输出被优化,以提高可靠性和可解释性,确保robust性。此外,我们提出了支持稳定在线重新规划的框架,抗性能规划延迟。完整的 simulations validate LIT-Planner的时间效率,而不会妥协轨迹质量与优化方法相比。实际世界实验进一步证明了它的实用性。

Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation

  • paper_url: http://arxiv.org/abs/2309.10677
  • repo_url: None
  • paper_authors: Yucheng Li
  • for: 本研究旨在提供一种不需要全量训练数据的污染分析方法,以便对现代语言模型进行可靠的评估。
  • methods: 本研究提出了一种基于沟通能力的污染分析方法,无需访问全量训练数据。
  • results: 研究发现,近期的基础模型在文本理解和概要写作benchmark上存在显著的记忆现象,而多选问题相对较少受污染。
    Abstract Data contamination in model evaluation is getting increasingly prevalent as the massive training corpora of large language models often unintentionally include benchmark samples. Therefore, contamination analysis has became an inevitable part of reliable model evaluation. However, existing method of contamination analysis requires the access of the entire training data which is often confidential for recent models. This prevent the community to rigorously audit these models and conduct accurate assessment of their capability. In this paper, we propose a novel method to quantify contamination without the access of the full training set, that measure the extent of contamination with perplexity. Our analysis provides evidence of significant memorisation of recent foundation models in popular reading comprehension, summarisation benchmarks, while multiple choice appears less contaminated.
    摘要 大量语言模型的训练集中的数据污染问题在不断增加,这是由于大型语言模型的训练集经常意外包含了标准样本。因此,污染分析已成为可靠模型评估的不可或缺的一部分。然而,现有的污染分析方法需要访问整个训练数据,这些数据通常是最新的模型中的商业秘密。这会阻碍社区对这些模型进行严格审核和准确评估其能力。在这篇论文中,我们提出了一种新的方法,可以无需访问整个训练集来衡量污染程度,这种方法基于混淆度来衡量污染程度。我们的分析表明,最近的基础模型在受欢迎的阅读理解和概要 Writing benchmarks 中存在较大的记忆现象,而多选题则相对较少污染。

Language Modeling Is Compression

  • paper_url: http://arxiv.org/abs/2309.10668
  • repo_url: https://github.com/facebookresearch/FBTT-Embedding
  • paper_authors: Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, Joel Veness
  • for: 这项研究的目的是用predictive模型来压缩数据,并评估大型自然语言模型的压缩能力。
  • methods: 该研究使用了大型自然语言模型,并使用了压缩视角来评估这些模型的Scaling laws、Tokenization和in-context learning能力。
  • results: 研究发现,大型自然语言模型不仅是强大的预测器,而且可以压缩图像和语音数据,比如ImageNet和LibriSpeech,以达到43.4%和16.4%的压缩率。此外,研究还表明,使用压缩视角可以使用任何压缩器(如gzip)建立conditional generative模型。
    Abstract It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.
    摘要

NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

  • paper_url: http://arxiv.org/abs/2309.10661
  • repo_url: https://github.com/indonlp/nusa-writes
  • paper_authors: Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel Dave, Sarah Oktavianti, Salsabil Maulana Akbar, Jhonson Lee, Nuur Shadieq, Tjeng Wawan Cenggoro, Hanung Wahyuning Linuwih, Bryan Wilie, Galih Pradipta Muridan, Genta Indra Winata, David Moeljadi, Alham Fikri Aji, Ayu Purwarianti, Pascale Fung
  • for: 这篇论文的目的是推广自然语言处理(NLP)技术的访问权,尤其是为underrepresented和EXTREMELY low-resource语言。
  • methods: 这篇论文使用了在线抓取和文档翻译来构建标注和无标注 corpora。然而,这些方法存在限制,包括lack of lexical diversity和local community的文化相关性。
  • results: 我们的实验结果表明,通过本地Native speakers写作 paragraphs来构建dataset可以提高lexical diversity和文化内容的质量。此外,我们还提供了\datasetname{} benchmark,包括12种underrepresented和EXTREMELY low-resource语言,这些语言在印度尼西亚被 millions of people speaking。我们的实验结果表明,现有的多语言大语言模型需要扩展到更多的underrepresented语言。我们在github上发布了nusa-writes dataset,可以在https://github.com/IndoNLP/nusa-writes 中下载。
    Abstract Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we present the \datasetname{} benchmark, encompassing 12 underrepresented and extremely low-resource languages spoken by millions of individuals in Indonesia. Our empirical experiment results using existing multilingual large language models conclude the need to extend these models to more underrepresented languages. We release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes.
    摘要 德米实现自然语言处理(NLP)技术的普及是非常重要,特别是 для那些受到歧视和资源匮乏的语言。过往的研究专注于透过网络采集和文档翻译来建立这些语言的标点和无标点数据库。although these methods have proven effective and cost-efficient,我们发现这些数据库中的缺失,包括语汇多样性和本地文化内涵。为了解决这个问题,我们进行了印尼地方语言的案例研究。我们比较了网络采集、人工翻译和本地母语者写作 paragraph的方法,以建立数据集。我们的发现是,由本地母语者写作 paragraph的数据集具有较高的语汇多样性和本地文化内涵。此外,我们还提供了 \datasetname{} 数据集,覆盖了印尼12种未代表和具有严重资源不足的语言,这些语言由 millions of individuals 使用。我们的实验结果显示,扩展现有的多语言大型语言模型到更多的未代表语言是必要的。我们在 GitHub 上发布了 NusaWrites 数据集,请参考 https://github.com/IndoNLP/nusa-writes。

CFGPT: Chinese Financial Assistant with Large Language Model

  • paper_url: http://arxiv.org/abs/2309.10654
  • repo_url: None
  • paper_authors: Jiangtong Li, Yuxuan Bian, Guoxuan Wang, Yang Lei, Dawei Cheng, Zhijun Ding, Changjun Jiang
  • For: The paper is written for presenting a Chinese Financial Generative Pre-trained Transformer framework (CFGPT) for natural language processing tasks in the financial domain.* Methods: The paper uses a dataset (CFData) for pre-training and supervised fine-tuning, a financial LLM (CFLLM) to manage financial texts, and a deployment framework (CFAPP) to navigate real-world financial applications. The CFLLM is trained on CFData in two stages, continued pre-training and supervised fine-tuning.* Results: The paper presents a tailored dataset (CFData) for financial natural language processing, a financial LLM (CFLLM) that can adeptly manage financial texts, and a deployment framework (CFAPP) with additional modules for multifaceted functionality in real-world applications.Here are the three information points in Simplified Chinese text:
  • for: 这篇论文是为了介绍一种基于Transformer框架的中文金融生成预训练模型(CFGPT),用于金融自然语言处理任务。
  • methods: 论文使用了一个名为CFData的数据集进行预训练和精度调整,还有一个专门为金融文本管理的金融LLM(CFLLM),以及一个用于实际应用的投放框架(CFAPP)。CFLLM通过两个阶段的预训练和精度调整来训练。
  • results: 论文提供了一个适用于金融自然语言处理的专门数据集(CFData),一个能够有效地处理金融文本的金融LLM(CFLLM),以及一个具有多方面功能的投放框架(CFAPP)。
    Abstract Large language models (LLMs) have demonstrated great potential in natural language processing tasks within the financial domain. In this work, we present a Chinese Financial Generative Pre-trained Transformer framework, named CFGPT, which includes a dataset~(CFData) for pre-training and supervised fine-tuning, a financial LLM~(CFLLM) to adeptly manage financial texts, and a deployment framework~(CFAPP) designed to navigate real-world financial applications. The CFData comprising both a pre-training dataset and a supervised fine-tuning dataset, where the pre-training dataset collates Chinese financial data and analytics, alongside a smaller subset of general-purpose text with 584M documents and 141B tokens in total, and the supervised fine-tuning dataset is tailored for six distinct financial tasks, embodying various facets of financial analysis and decision-making with 1.5M instruction pairs and 1.5B tokens in total. The CFLLM, which is based on InternLM-7B to balance the model capability and size, is trained on CFData in two stage, continued pre-training and supervised fine-tuning. The CFAPP is centered on large language models (LLMs) and augmented with additional modules to ensure multifaceted functionality in real-world application. Our codes are released at https://github.com/TongjiFinLab/CFGPT.
    摘要 大型自然语言处理模型(LLMs)在金融领域内表现出了很大的潜力。在这项工作中,我们介绍了一个名为CFGPT的中文金融生成预训练 transformer框架,其包括一个名为CFData的预训练和监督练习 dataset,一个适应金融文本的金融LLM,以及一个为实际金融应用而设计的CFAPP框架。CFData包含了中文金融数据和分析,以及一小部分通用文本的584M份文档和141B个字符。CFLLM基于InternLM-7B,通过两stage的预训练和监督练习来训练。CFAPP是基于LLMs的框架,并增加了其他模块,以确保在实际应用中的多方面功能。我们的代码在https://github.com/TongjiFinLab/CFGPT上发布。

Towards Energy-Aware Federated Traffic Prediction for Cellular Networks

  • paper_url: http://arxiv.org/abs/2309.10645
  • repo_url: https://github.com/vperifan/federated-time-series-forecasting
  • paper_authors: Vasileios Perifanis, Nikolaos Pavlidis, Selim F. Yilmaz, Francesc Wilhelmi, Elia Guerra, Marco Miozzo, Pavlos S. Efraimidis, Paolo Dini, Remous-Aris Koutsiamanis
  • for: 预测 fifth-generation 网络流量是一项重要的活动,以便优化网络,因为准确的预测是关键 для智能网络设计、资源分配和异常情况检测。
  • methods: 本文使用了 federated learning(FL)作为一种机器学习训练框架,以提高预测精度并避免数据中心化问题。
  • results: 研究发现,大型机器学习模型在联合学习场景下可以 marginally 提高性能,但具有显著的环境影响,导致它们在实际应用中不实际。
    Abstract Cellular traffic prediction is a crucial activity for optimizing networks in fifth-generation (5G) networks and beyond, as accurate forecasting is essential for intelligent network design, resource allocation and anomaly mitigation. Although machine learning (ML) is a promising approach to effectively predict network traffic, the centralization of massive data in a single data center raises issues regarding confidentiality, privacy and data transfer demands. To address these challenges, federated learning (FL) emerges as an appealing ML training framework which offers high accurate predictions through parallel distributed computations. However, the environmental impact of these methods is often overlooked, which calls into question their sustainability. In this paper, we address the trade-off between accuracy and energy consumption in FL by proposing a novel sustainability indicator that allows assessing the feasibility of ML models. Then, we comprehensively evaluate state-of-the-art deep learning (DL) architectures in a federated scenario using real-world measurements from base station (BS) sites in the area of Barcelona, Spain. Our findings indicate that larger ML models achieve marginally improved performance but have a significant environmental impact in terms of carbon footprint, which make them impractical for real-world applications.
    摘要 fifth-generation (5G) 网络中的 cellular traffic prediction 是一项非常重要的活动,因为准确预测是智能网络设计、资源分配和异常现象 mitigation 的关键。虽然机器学习 (ML) 是一种有前途的方法来有效预测网络流量,但是集中大量数据在单个数据中心存储的问题会导致隐私、安全性和数据传输带宽的问题。为解决这些挑战,联邦学习 (FL) 作为一种有appeal的 ML 训练框架,通过并行分布计算来提供高精度预测。然而,这些方法的环境影响 часто被忽略,这会让它们的可持续性成为问题。在这篇论文中,我们考虑了精度和能源消耗之间的负面交互,并提出了一个可用于评估 ML 模型可持续性的新指标。然后,我们对现有的深度学习 (DL) 架构在联邦enario中进行了广泛的评估,使用了实际测量从 Barcelona, Spain 的基站 (BS) 站点。我们发现,更大的 ML 模型可以marginally提高性能,但具有 significanth carbon footprint,这使得它们在实际应用中不可持续。

Geometric structure of Deep Learning networks and construction of global ${\mathcal L}^2$ minimizers

  • paper_url: http://arxiv.org/abs/2309.10639
  • repo_url: None
  • paper_authors: Thomas Chen, Patricia Muñoz Ewald
  • for: 这个论文的目的是对深度学习(Deep Learning)网络的结构做出几何解释,并使用$L$层抑制函数、${\mathcal L}^2$Schatten类(或希尔бер特- Schmidt)成本函数、输入和输出空间为${\mathbb R}^Q$($Q\geq1$)。
  • methods: 这篇论文使用了作者们之前对浅层神经网络的研究结果,构建了一个可导的家族解 minimizers,以实现深度学习网络的全局最小值。在这个设置下,隐藏层神经网络”照料’’(curate)训练输入数据,通过重层应用截断函数来最小化训练输入的噪声比例。
  • results: 论文显示,在$L\geq Q$的情况下,深度学习网络的全局最小值存在$2^Q-1$个不同的特点点。
    Abstract In this paper, we provide a geometric interpretation of the structure of Deep Learning (DL) networks, characterized by $L$ hidden layers, a ramp activation function, an ${\mathcal L}^2$ Schatten class (or Hilbert-Schmidt) cost function, and input and output spaces ${\mathbb R}^Q$ with equal dimension $Q\geq1$. The hidden layers are defined on spaces ${\mathbb R}^{Q}$, as well. We apply our recent results on shallow neural networks to construct an explicit family of minimizers for the global minimum of the cost function in the case $L\geq Q$, which we show to be degenerate. In the context presented here, the hidden layers of the DL network "curate" the training inputs by recursive application of a truncation map that minimizes the noise to signal ratio of the training inputs. Moreover, we determine a set of $2^Q-1$ distinct degenerate local minima of the cost function.
    摘要 在这篇论文中,我们提供了深度学习(DL)网络的几何解释,其特征为有$L$层感知层、斜坡活动函数、${\mathcal L}^2$ Schatten类(或希尔伯特-Ш密特)成本函数,以及输入和输出空间为${\mathbb R}^Q$,其维度为$Q\geq1$。感知层在${\mathbb R}^{Q}$上定义。我们利用我们之前对浅层神经网络的研究,构造了$L\geq Q$时的全局最小值的明确家族,并证明其为极值。在这种情况下,深度学习网络的隐藏层“照料”训练输入,通过重复应用一个减少映射来最小化训练输入的噪声与信号比率。此外,我们确定了$2^Q-1$个不同的极值本地最小值。

Exploring the Influence of Information Entropy Change in Learning Systems

  • paper_url: http://arxiv.org/abs/2309.10625
  • repo_url: None
  • paper_authors: Xiaowei Yu, Yao Xue, Lu Zhang, Li Wang, Tianming Liu, Dajiang Zhu
    for: This paper explores the influence of entropy change in deep learning systems by adding noise to the inputs/latent features, with applications in computer vision tasks.methods: The paper uses theoretical analysis and empirical experiments to demonstrate the enhancement gained from positive noise by reducing the task complexity defined by information entropy.results: The paper shows significant performance gains in large image datasets such as ImageNet by proactively injecting positive noise, achieving an unprecedented top 1 accuracy of over 95%.
    Abstract In this work, we explore the influence of entropy change in deep learning systems by adding noise to the inputs/latent features. The applications in this paper focus on deep learning tasks within computer vision, but the proposed theory can be further applied to other fields. Noise is conventionally viewed as a harmful perturbation in various deep learning architectures, such as convolutional neural networks (CNNs) and vision transformers (ViTs), as well as different learning tasks like image classification and transfer learning. However, this paper aims to rethink whether the conventional proposition always holds. We demonstrate that specific noise can boost the performance of various deep architectures under certain conditions. We theoretically prove the enhancement gained from positive noise by reducing the task complexity defined by information entropy and experimentally show the significant performance gain in large image datasets, such as the ImageNet. Herein, we use the information entropy to define the complexity of the task. We categorize the noise into two types, positive noise (PN) and harmful noise (HN), based on whether the noise can help reduce the complexity of the task. Extensive experiments of CNNs and ViTs have shown performance improvements by proactively injecting positive noise, where we achieved an unprecedented top 1 accuracy of over 95% on ImageNet. Both theoretical analysis and empirical evidence have confirmed that the presence of positive noise can benefit the learning process, while the traditionally perceived harmful noise indeed impairs deep learning models. The different roles of noise offer new explanations for deep models on specific tasks and provide a new paradigm for improving model performance. Moreover, it reminds us that we can influence the performance of learning systems via information entropy change.
    摘要 在这项研究中,我们探索深度学习系统中 entropy 变化的影响。我们在计算机视觉领域中应用了深度学习任务,但我们的理论可以应用到其他领域。在传统上,噪声被视为深度学习架构中的危险扰动,如 convolutional neural networks (CNNs) 和 vision transformers (ViTs),以及不同的学习任务,如图像分类和转移学习。然而,这篇论文想要重新思考这一观点是否总是正确的。我们展示了特定的噪声可以在某些条件下提高深度架构的性能。我们使用信息 entropy 来定义任务的复杂度,并分类噪声为正面噪声 (PN) 和有害噪声 (HN),根据噪声是否可以减少任务的复杂度。我们在大量图像数据集,如 ImageNet,进行了广泛的实验,并证明了在某些条件下,注意性噪声可以提高深度架构的性能。我们的理论分析和实验证据都表明,在某些任务上,正面噪声可以促进学习过程,而传统上认为的危险噪声实际上会降低深度学习模型的性能。这些不同的噪声角色为深度模型在特定任务上提供了新的解释,并提供了一个新的性能提升模式。此外,它提醒我们可以通过改变信息 entropy 来影响学习系统的性能。

Large language models can accurately predict searcher preferences

  • paper_url: http://arxiv.org/abs/2309.10621
  • repo_url: None
  • paper_authors: Paul Thomas, Seth Spielman, Nick Craswell, Bhaskar Mitra
  • for: 这个论文目的是提高搜索系统中 Labels 的质量,即用户是否认为搜索结果有用。
  • methods: 这个论文使用了大型自然语言模型(Lang Model)来生成 Labels,并通过对用户反馈进行训练来改进 Labels 的质量。
  • results: 论文表明,使用 Lang Model 可以生成高质量 Labels,并且比第三方标注者更准确,同时也比较cost-effective。此外,这些 Labels 还可以用于训练更好的排名算法。
    Abstract Relevance labels, which indicate whether a search result is valuable to a searcher, are key to evaluating and optimising search systems. The best way to capture the true preferences of users is to ask them for their careful feedback on which results would be useful, but this approach does not scale to produce a large number of labels. Getting relevance labels at scale is usually done with third-party labellers, who judge on behalf of the user, but there is a risk of low-quality data if the labeller doesn't understand user needs. To improve quality, one standard approach is to study real users through interviews, user studies and direct feedback, find areas where labels are systematically disagreeing with users, then educate labellers about user needs through judging guidelines, training and monitoring. This paper introduces an alternate approach for improving label quality. It takes careful feedback from real users, which by definition is the highest-quality first-party gold data that can be derived, and develops an large language model prompt that agrees with that data. We present ideas and observations from deploying language models for large-scale relevance labelling at Bing, and illustrate with data from TREC. We have found large language models can be effective, with accuracy as good as human labellers and similar capability to pick the hardest queries, best runs, and best groups. Systematic changes to the prompts make a difference in accuracy, but so too do simple paraphrases. To measure agreement with real searchers needs high-quality ``gold'' labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.
    摘要 搜寻结果的价值性标签(relevance labels)是评估和优化搜寻系统的关键因素。获取高质量的标签最好的方法是请求用户提供精确的反馈,但这种方法不能生产大量的标签。通过第三方审核员进行审核,但这可能会导致低质量的数据。将高质量的标签获取到大量的数据是一个问题。这篇文章介绍了一种新的方法来提高标签质量。它从真实的用户中获取了精确的反馈,并使用大型自然语言模型来开发问题提示,以确保它们与用户需求相符。我们在部署语言模型进行大规模的审核labeling时发现,大型语言模型可以有高精度和人工审核员相似的能力,并且能够处理最困难的查询、最佳路径和最佳分组。我们发现,对于标签的系统性改变可以提高精确性,但也注意到了简单的重写可以获得相似的效果。为了衡量模型与真实搜寻者需求的一致,我们需要高质量的“金”标签,但我们发现,这些标签可以让我们训练更好的排名器,并且这些标签的成本比第三方审核员来的便宜得多。

A Dynamic Linear Bias Incorporation Scheme for Nonnegative Latent Factor Analysis

  • paper_url: http://arxiv.org/abs/2309.10618
  • repo_url: None
  • paper_authors: Yurong Zhong, Zhe Xie, Weiling Li, Xin Luo
  • for: Handle high-dimensional and incomplete (HDI) data in big data-related applications, such as social network services systems, by learning HDI data representation.
  • methods: Propose a dynamic linear bias incorporation (DLBI) scheme to improve the scalability and representation ability of nonnegative latent factor analysis (NLFA) models for HDI data.
  • results: Obtain higher representation accuracy and competitive computational efficiency compared to state-of-the-art models on three HDI datasets from real applications.
    Abstract High-Dimensional and Incomplete (HDI) data is commonly encountered in big data-related applications like social network services systems, which are concerning the limited interactions among numerous nodes. Knowledge acquisition from HDI data is a vital issue in the domain of data science due to their embedded rich patterns like node behaviors, where the fundamental task is to perform HDI data representation learning. Nonnegative Latent Factor Analysis (NLFA) models have proven to possess the superiority to address this issue, where a linear bias incorporation (LBI) scheme is important in present the training overshooting and fluctuation, as well as preventing the model from premature convergence. However, existing LBI schemes are all statistic ones where the linear biases are fixed, which significantly restricts the scalability of the resultant NLFA model and results in loss of representation learning ability to HDI data. Motivated by the above discoveries, this paper innovatively presents the dynamic linear bias incorporation (DLBI) scheme. It firstly extends the linear bias vectors into matrices, and then builds a binary weight matrix to switch the active/inactive states of the linear biases. The weight matrix's each entry switches between the binary states dynamically corresponding to the linear bias value variation, thereby establishing the dynamic linear biases for an NLFA model. Empirical studies on three HDI datasets from real applications demonstrate that the proposed DLBI-based NLFA model obtains higher representation accuracy several than state-of-the-art models do, as well as highly-competitive computational efficiency.
    摘要 高维ensional和不完全(HDI)数据在大数据相关应用中常见,如社交媒体系统等,它们关注有限的节点间交互。科学数据获取从HDI数据是数据科学领域的重要问题,因为它们嵌入了诸如节点行为的复杂模式。非正式因子分析(NLFA)模型已经证明可以解决这个问题,其中线性偏好包含(LBI)策略可以避免模型快速 converges 和抖动。然而,现有的LBI策略都是静态的,这限制了NLFA模型的可扩展性和对HDI数据的表达能力。这篇论文驱动于以上发现,开创了动态线性偏好包含(DLBI)策略。它首先将线性偏好 vectors 扩展到矩阵,然后建立一个二进制权重矩阵,以switch动态线性偏好的活动/不活动状态。每个权重矩阵中的每个Entry 在线性偏好值变化时动态地 switching между二进制状态,从而实现了动态线性偏好。实验研究在三个HDI数据集上表明,提案的DLBI-based NLFA模型在表达精度方面比现有模型高得多,同时computational efficiency 也具有高度竞争力。

Decentralized Online Learning in Task Assignment Games for Mobile Crowdsensing

  • paper_url: http://arxiv.org/abs/2309.10594
  • repo_url: None
  • paper_authors: Bernd Simon, Andrea Ortiz, Walid Saad, Anja Klein
  • for: 这个研究是为了解决移动对感应系统 (MCS) 中的聚合数据收集问题。
  • methods: 这个研究使用了一种新的分布式方法,结合了对抗理论和在线学习,被称为碰撞避免多重枪 (CA-MAB-SFS)。这个方法模型了任务将分配问题为一个对抗游戏,考虑到 MCSP 和 MU 的个人目标,并让 MU 在线上学习其努力。
  • results: 这个研究的结果显示,CA-MAB-SFS 可以将 MCSP 和 MU 的满意度提高,并且降低均值任务完成时间,至少降低 16%。此外,CA-MAB-SFS 可以确保任务分配问题的稳定 regret 是一个线性下降函数,并且在线上学习过程中,MU 的学习速度得到了重要的改善。
    Abstract The problem of coordinated data collection is studied for a mobile crowdsensing (MCS) system. A mobile crowdsensing platform (MCSP) sequentially publishes sensing tasks to the available mobile units (MUs) that signal their willingness to participate in a task by sending sensing offers back to the MCSP. From the received offers, the MCSP decides the task assignment. A stable task assignment must address two challenges: the MCSP's and MUs' conflicting goals, and the uncertainty about the MUs' required efforts and preferences. To overcome these challenges a novel decentralized approach combining matching theory and online learning, called collision-avoidance multi-armed bandit with strategic free sensing (CA-MAB-SFS), is proposed. The task assignment problem is modeled as a matching game considering the MCSP's and MUs' individual goals while the MUs learn their efforts online. Our innovative "free-sensing" mechanism significantly improves the MU's learning process while reducing collisions during task allocation. The stable regret of CA-MAB-SFS, i.e., the loss of learning, is analytically shown to be bounded by a sublinear function, ensuring the convergence to a stable optimal solution. Simulation results show that CA-MAB-SFS increases the MUs' and the MCSP's satisfaction compared to state-of-the-art methods while reducing the average task completion time by at least 16%.
    摘要 <>将数据收集协调问题应用于移动农垦系统(MCSP)中。MCSP逐次发布感知任务到可用的移动单元(MU),并且MU通过发送感知申请回到MCSP。从接收的申请中,MCSP决定任务分配。稳定任务分配必须解决两个挑战:MCSP和MU的目标冲突,以及MU的努力和偏好的不确定性。为了解决这些挑战,我们提出了一种新的分布式方法, combining matching theory和在线学习,称为碰撞避免多重臂bandit with strategic free sensing(CA-MAB-SFS)。任务分配问题被模型为一个匹配游戏,考虑MCSP和MU的个人目标,而MU在线学习其努力。我们的创新的“免费感知”机制可以显著提高MU的学习过程,同时降低任务分配中的碰撞。CA-MAB-SFS的稳定征 regret,即学习损失,被分析显示为一个下线函数,确保 converge to a stable optimal solution。实验结果表明,CA-MAB-SFS在比 estado-of-the-art方法的情况下,使MU和MCSP满意度提高,而任务完成时间平均下降至少16%。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and widely used in other countries as well. If you need Traditional Chinese, please let me know.

PDRL: Multi-Agent based Reinforcement Learning for Predictive Monitoring

  • paper_url: http://arxiv.org/abs/2309.10576
  • repo_url: None
  • paper_authors: Thanveer Shaik, Xiaohui Tao, Lin Li, Haoran Xie, U R Acharya, Raj Gururajan, Xujuan Zhou
  • for: 这个研究旨在提出一个新的、通用的预测深度学习(PDRL)系统,用于时间序列预测环境中。
  • methods: 这个系统使用多个对应的深度问题网络(DQN)代理人,以监控预测的未来环境状态,并将学习的知识与最大化奖励相结合。
  • results: 在评估过程中,三个DRL代理人能够顺利学习相应的模式,并逐次获得奖励。该系统在时间序列预测中实现了状态预测的最佳性能。
    Abstract Reinforcement learning has been increasingly applied in monitoring applications because of its ability to learn from previous experiences and can make adaptive decisions. However, existing machine learning-based health monitoring applications are mostly supervised learning algorithms, trained on labels and they cannot make adaptive decisions in an uncertain complex environment. This study proposes a novel and generic system, predictive deep reinforcement learning (PDRL) with multiple RL agents in a time series forecasting environment. The proposed generic framework accommodates virtual Deep Q Network (DQN) agents to monitor predicted future states of a complex environment with a well-defined reward policy so that the agent learns existing knowledge while maximizing their rewards. In the evaluation process of the proposed framework, three DRL agents were deployed to monitor a subject's future heart rate, respiration, and temperature predicted using a BiLSTM model. With each iteration, the three agents were able to learn the associated patterns and their cumulative rewards gradually increased. It outperformed the baseline models for all three monitoring agents. The proposed PDRL framework is able to achieve state-of-the-art performance in the time series forecasting process. The proposed DRL agents and deep learning model in the PDRL framework are customized to implement the transfer learning in other forecasting applications like traffic and weather and monitor their states. The PDRL framework is able to learn the future states of the traffic and weather forecasting and the cumulative rewards are gradually increasing over each episode.
    摘要 强化学习在监测应用中得到了广泛应用,因为它可以从前一次的经验中学习并做出适应性的决策。然而,现有的机器学习基于的健康监测应用多为指导学习算法,它们不能在不确定的复杂环境中做出适应性的决策。本研究提出了一种新的和通用的系统——预测深度强化学习(PDRL),该系统通过多个RL代理在时间序列预测环境中使用多个DQN代理来监测预测的未来状况,以便代理学习现有的知识而寻求最大化奖励。在评估PDRL系统的过程中,三个DRL代理被部署到监测一个人的未来心率、呼吸和体温预测结果。在每次迭代中,三个代理能够学习相关的模式,其总奖励逐渐增长。与基线模型相比,PDRL系统在时间序列预测过程中实现了状态的极佳性能。PDRL系统可以在其他预测应用中,如交通和天气预测,实现传输学习,并监测其状态。在每个 episoden 中,PDRL系统能够学习未来的交通和天气预测结果,并逐渐增长其总奖励。

A multimodal deep learning architecture for smoking detection with a small data approach

  • paper_url: http://arxiv.org/abs/2309.10561
  • repo_url: None
  • paper_authors: Robert Lakatos, Peter Pollner, Andras Hajdu, Tamas Joo
  • for: 探索使用人工智能检测隐藏烟草广告的可能性,以提高媒体内容的不偏不倚和公平性。
  • methods: 提出一种基于深度学习、生成方法和人类干扰的整合文本和图像处理模型,可以在文本和图像格式下检测吸烟场景,即使有限的训练数据。
  • results: 模型可以达到74%的图像准确率和98%的文本准确率,并且可以 integrating human reinforcement 进行专家干扰。
    Abstract Introduction: Covert tobacco advertisements often raise regulatory measures. This paper presents that artificial intelligence, particularly deep learning, has great potential for detecting hidden advertising and allows unbiased, reproducible, and fair quantification of tobacco-related media content. Methods: We propose an integrated text and image processing model based on deep learning, generative methods, and human reinforcement, which can detect smoking cases in both textual and visual formats, even with little available training data. Results: Our model can achieve 74\% accuracy for images and 98\% for text. Furthermore, our system integrates the possibility of expert intervention in the form of human reinforcement. Conclusions: Using the pre-trained multimodal, image, and text processing models available through deep learning makes it possible to detect smoking in different media even with few training data.
    摘要 引言:借由覆住式烟草广告,常会引起 regulatory 措施。本文提出,人工智能,特别是深度学习,具有察觉隐藏广告的潜力,并可以实现不偏袋、可重复、公正地评估烟草相关媒体内容。方法:我们提议一种基于深度学习、生成方法和人类补做的集成文本和图像处理模型,可以在文本和图像格式中检测吸烟场景,即使培训数据 scarcity 。结果:我们的模型可以达到 74% 的准确率 для图像和 98% 的准确率 для文本。此外,我们的系统还可以 integrate the possibility of expert intervention in the form of human reinforcement。结论:通过深度学习提供的 pré-train 多模态、图像和文本处理模型,可以在媒体中检测吸烟,即使培训数据 scarce。

A Neighbourhood-Aware Differential Privacy Mechanism for Static Word Embeddings

  • paper_url: http://arxiv.org/abs/2309.10551
  • repo_url: None
  • paper_authors: Danushka Bollegala, Shuichi Otake, Tomoya Machide, Ken-ichi Kawarabayashi
  • for: 保护个人隐私(differential privacy)
  • methods: 使用邻域相关的差分隐私机制(Neighbourhood-Aware Differential Privacy,NADP),根据word embedding空间中 Word 的邻域构建图,并在不同邻域中应用不同水平的高斯噪声,以保证指定的隐私水平。
  • results: 在多个下游任务中,NADP 机制比 laplacian、gaussian 和 mahalanobis 等先前提出的隐私机制表现更好,同时保证更高的隐私水平。
    Abstract We propose a Neighbourhood-Aware Differential Privacy (NADP) mechanism considering the neighbourhood of a word in a pretrained static word embedding space to determine the minimal amount of noise required to guarantee a specified privacy level. We first construct a nearest neighbour graph over the words using their embeddings, and factorise it into a set of connected components (i.e. neighbourhoods). We then separately apply different levels of Gaussian noise to the words in each neighbourhood, determined by the set of words in that neighbourhood. Experiments show that our proposed NADP mechanism consistently outperforms multiple previously proposed DP mechanisms such as Laplacian, Gaussian, and Mahalanobis in multiple downstream tasks, while guaranteeing higher levels of privacy.
    摘要 我们提出了一种基于邻居的差分隐私(NADP)机制,利用预训练的静态单词嵌入空间中的邻居 relaciones,确定最小的噪声量以保证指定的隐私水平。我们首先将单词的嵌入图构建成 nearest neighbor 图,然后将其分解成一系列相互独立的噪声应用。实验表明,我们的提议的 NADP 机制在多个下游任务中 consistently 超越了多个先前提出的差分隐私机制,如 Laplacian、Gaussian 和 Mahalanobis,同时保证更高的隐私水平。

Towards Generative Modeling of Urban Flow through Knowledge-enhanced Denoising Diffusion

  • paper_url: http://arxiv.org/abs/2309.10547
  • repo_url: https://github.com/tsinghua-fib-lab/kstdiff-urban-flow-generation
  • paper_authors: Zhilun Zhou, Jingtao Ding, Yu Liu, Depeng Jin, Yong Li
  • for: 本研究旨在生成城市流动数据,尤其是在数据缺乏或新规划区域的情况下。
  • methods: 本研究使用了Diffusion Model和知识增强的spatio-temporal diffusion模型(KSTDiff)来生成城市流动数据。在KSTDiff模型中,我们首先构建了一个城市知识图(UKG),以模拟城市环境和区域之间的关系。然后,我们设计了一个学习式的流量估计器,以便准确地生成不同区域的流量。此外,我们还提出了一种知识增强的降噪网络,以捕捉城市流动的空间时间关系以及城市环境的影响。
  • results: 对四个实际数据集进行了广泛的实验,并证明了我们的模型在城市流动生成方面的优越性。此外,我们还进行了更深入的研究,证明了生成的城市流动数据的实用性和我们模型的长期流动预测和城市流动预测能力。
    Abstract Although generative AI has been successful in many areas, its ability to model geospatial data is still underexplored. Urban flow, a typical kind of geospatial data, is critical for a wide range of urban applications. Existing studies mostly focus on predictive modeling of urban flow that predicts the future flow based on historical flow data, which may be unavailable in data-sparse areas or newly planned regions. Some other studies aim to predict OD flow among regions but they fail to model dynamic changes of urban flow over time. In this work, we study a new problem of urban flow generation that generates dynamic urban flow for regions without historical flow data. To capture the effect of multiple factors on urban flow, such as region features and urban environment, we employ diffusion model to generate urban flow for regions under different conditions. We first construct an urban knowledge graph (UKG) to model the urban environment and relationships between regions, based on which we design a knowledge-enhanced spatio-temporal diffusion model (KSTDiff) to generate urban flow for each region. Specifically, to accurately generate urban flow for regions with different flow volumes, we design a novel diffusion process guided by a volume estimator, which is learnable and customized for each region. Moreover, we propose a knowledge-enhanced denoising network to capture the spatio-temporal dependencies of urban flow as well as the impact of urban environment in the denoising process. Extensive experiments on four real-world datasets validate the superiority of our model over state-of-the-art baselines in urban flow generation. Further in-depth studies demonstrate the utility of generated urban flow data and the ability of our model for long-term flow generation and urban flow prediction. Our code is released at: https://github.com/tsinghua-fib-lab/KSTDiff-Urban-flow-generation.
    摘要 although generative AI has been successful in many areas, its ability to model geospatial data is still underexplored. urban flow, a typical kind of geospatial data, is critical for a wide range of urban applications. existing studies mostly focus on predictive modeling of urban flow that predicts the future flow based on historical flow data, which may be unavailable in data-sparse areas or newly planned regions. some other studies aim to predict OD flow among regions but they fail to model dynamic changes of urban flow over time. in this work, we study a new problem of urban flow generation that generates dynamic urban flow for regions without historical flow data. to capture the effect of multiple factors on urban flow, such as region features and urban environment, we employ diffusion model to generate urban flow for regions under different conditions. we first construct an urban knowledge graph (UKG) to model the urban environment and relationships between regions, based on which we design a knowledge-enhanced spatio-temporal diffusion model (KSTDiff) to generate urban flow for each region. specifically, to accurately generate urban flow for regions with different flow volumes, we design a novel diffusion process guided by a volume estimator, which is learnable and customized for each region. moreover, we propose a knowledge-enhanced denoising network to capture the spatio-temporal dependencies of urban flow as well as the impact of urban environment in the denoising process. extensive experiments on four real-world datasets validate the superiority of our model over state-of-the-art baselines in urban flow generation. further in-depth studies demonstrate the utility of generated urban flow data and the ability of our model for long-term flow generation and urban flow prediction. our code is released at: https://github.com/tsinghua-fib-lab/KSTDiff-Urban-flow-generation.

Mean Absolute Directional Loss as a New Loss Function for Machine Learning Problems in Algorithmic Investment Strategies

  • paper_url: http://arxiv.org/abs/2309.10546
  • repo_url: None
  • paper_authors: Jakub Michańków, Paweł Sakowski, Robert Ślepaczuk
  • for: 这paper investigate了用于预测金融时间序列的机器学习模型优化中的恰当损失函数问题,以建立更好的算法投资策略(AIS)。
  • methods: authors propose了 Mean Absolute Directional Loss(MADL)函数,解决了 классиical forecast error functions中提取信息从预测中创建有效的 buy/sell signals问题。
  • results: authors based on two different asset classes(Bitcoin和Crude Oil)的数据,显示了新的损失函数可以更好地选择LSTM模型的超参数,并在它们的验证数据上获得更高的风险调整回报率。
    Abstract This paper investigates the issue of an adequate loss function in the optimization of machine learning models used in the forecasting of financial time series for the purpose of algorithmic investment strategies (AIS) construction. We propose the Mean Absolute Directional Loss (MADL) function, solving important problems of classical forecast error functions in extracting information from forecasts to create efficient buy/sell signals in algorithmic investment strategies. Finally, based on the data from two different asset classes (cryptocurrencies: Bitcoin and commodities: Crude Oil), we show that the new loss function enables us to select better hyperparameters for the LSTM model and obtain more efficient investment strategies, with regard to risk-adjusted return metrics on the out-of-sample data.
    摘要

Model Leeching: An Extraction Attack Targeting LLMs

  • paper_url: http://arxiv.org/abs/2309.10544
  • repo_url: None
  • paper_authors: Lewis Birch, William Hackett, Stefan Trawicki, Neeraj Suri, Peter Garraghan
  • for: 本研究旨在提取大语言模型(LLM)中的任务特有知识,并将其转换为具有减少参数的模型。
  • methods: 本研究使用的方法是Model Leeching,可以快速和效率地从目标LLM中提取任务特有的知识。
  • results: 研究表明,通过使用Model Leeching,可以从ChatGPT-3.5-Turbo中提取出73%的 preciseness(EM)相似性,以及SQuAD EM和F1分数分别为75%和87%,仅需API成本50美元。此外,研究还证明了对提取模型进行ML攻击的可行性,对ChatGPT-3.5-Turbo进行ML攻击时,通过 transferred adversarial attack 可以提高攻击成功率11%。
    Abstract Model Leeching is a novel extraction attack targeting Large Language Models (LLMs), capable of distilling task-specific knowledge from a target LLM into a reduced parameter model. We demonstrate the effectiveness of our attack by extracting task capability from ChatGPT-3.5-Turbo, achieving 73% Exact Match (EM) similarity, and SQuAD EM and F1 accuracy scores of 75% and 87%, respectively for only $50 in API cost. We further demonstrate the feasibility of adversarial attack transferability from an extracted model extracted via Model Leeching to perform ML attack staging against a target LLM, resulting in an 11% increase to attack success rate when applied to ChatGPT-3.5-Turbo.
    摘要 模型偷窥(Model Leeching)是一种新的抽取攻击,可以从目标大语言模型(LLMs)中提取任务特定的知识,并将其转换为具有减少参数的模型。我们通过对ChatGPT-3.5-Turbo进行抽取,实现了73%的精确匹配(EM)相似性,以及SQuAD EM和F1分数的75%和87%分别。这些成绩只需要50美元的API成本。我们还证明了攻击者可以通过我们提取的模型来对目标LLM进行ML攻击,从而提高了攻击成功率11%。

OpenMSD: Towards Multilingual Scientific Documents Similarity Measurement

  • paper_url: http://arxiv.org/abs/2309.10539
  • repo_url: https://github.com/google-research/google-research
  • paper_authors: Yang Gao, Ji Ma, Ivan Korotkov, Keith Hall, Dana Alon, Don Metzler
  • for: 本研究旨在开发和评估多语言科学文献相似度测量模型,以便帮助多语言研究人员更有效地找到和探索相关文献。
  • methods: 我们使用Open-access Multilingual Scientific Documents(OpenMSD)数据集,该数据集包含74M篇论文和778M个引用对,并采用科学专业语言模型的预训练和不同策略来Derive “相关” 文献对进行细化。
  • results: 我们的最佳模型在比较 STRONG 基线模型时显著超越,提高了7-16%的平均精度。
    Abstract We develop and evaluate multilingual scientific documents similarity measurement models in this work. Such models can be used to find related works in different languages, which can help multilingual researchers find and explore papers more efficiently. We propose the first multilingual scientific documents dataset, Open-access Multilingual Scientific Documents (OpenMSD), which has 74M papers in 103 languages and 778M citation pairs. With OpenMSD, we pretrain science-specialized language models, and explore different strategies to derive "related" paper pairs to fine-tune the models, including using a mixture of citation, co-citation, and bibliographic-coupling pairs. To further improve the models' performance for non-English papers, we explore the use of generative language models to enrich the non-English papers with English summaries. This allows us to leverage the models' English capabilities to create better representations for non-English papers. Our best model significantly outperforms strong baselines by 7-16% (in mean average precision).
    摘要 我们在这项工作中开发和评估多语言科学文献相似度评估模型。这些模型可以帮助多语言研究人员更 efficiently找到相关的文献,从而提高研究效率。我们提出了首个多语言科学文献数据集 Open-access Multilingual Scientific Documents (OpenMSD),该数据集包含7400万篇文献和778亿引用对,涵盖103种语言。使用OpenMSD数据集,我们预训练了专门为科学研究设计的语言模型,并研究了不同的策略来生成"相关"文献对以进一步训练模型,包括使用混合引用、共同引用和文献联系对。为了进一步提高非英文文献的表现,我们explore了使用生成语言模型来把非英文文献与英文摘要相联系。这allow我们利用模型的英文能力来创建更好的非英文文献表示。我们的最佳模型在比 STRONG baseline 高7-16%(在平均精度上)。

A Cognitively-Inspired Neural Architecture for Visual Abstract Reasoning Using Contrastive Perceptual and Conceptual Processing

  • paper_url: http://arxiv.org/abs/2309.10532
  • repo_url: None
  • paper_authors: Yuan Yang, Deepayan Sanyal, James Ainooson, Joel Michelson, Effat Farhana, Maithilee Kunda
  • for: 解决视觉抽象逻辑任务,即人类认知中的抽象逻辑过程。
  • methods: 提出了一种基于人类认知原理的新神经网络 architecture,即对比性感知-概念处理网络(CPCNet),它模型了人类视觉抽象逻辑的迭代、自我对比、学习过程。
  • results: 在使用matrix reasoning问题的 estilo de Raven’s Progressive Matrices智能测验 dataset中,CPCNet实现了所有之前发表的模型的高精度,同时使用最弱的推导假设。此外,文章还发现了原始 RAVEN 数据集中的一个substantial和前未注意的类别偏见,并提出了一个新的 RAVEN 变体 – AB-RAVEN,它更加具有抽象概念的均衡。
    Abstract We introduce a new neural architecture for solving visual abstract reasoning tasks inspired by human cognition, specifically by observations that human abstract reasoning often interleaves perceptual and conceptual processing as part of a flexible, iterative, and dynamic cognitive process. Inspired by this principle, our architecture models visual abstract reasoning as an iterative, self-contrasting learning process that pursues consistency between perceptual and conceptual processing of visual stimuli. We explain how this new Contrastive Perceptual-Conceptual Network (CPCNet) works using matrix reasoning problems in the style of the well-known Raven's Progressive Matrices intelligence test. Experiments on the machine learning dataset RAVEN show that CPCNet achieves higher accuracy than all previously published models while also using the weakest inductive bias. We also point out a substantial and previously unremarked class imbalance in the original RAVEN dataset, and we propose a new variant of RAVEN -- AB-RAVEN -- that is more balanced in terms of abstract concepts.
    摘要 我团队引入了一种新的神经网络模型,用于解决视觉抽象逻辑任务, draws inspiration from human cognition, specifically the observation that human abstract reasoning often interleaves perceptual and conceptual processing as part of a flexible, iterative, and dynamic cognitive process. 我们的模型将视觉抽象逻辑模型为一种迭代、自相对抗的学习过程,以实现视觉各种抽象处理和概念处理之间的一致性。我们使用矩阵理解问题,类似于著名的鸭子进步矩阵测验,解释我们的新型 Contrastive Perceptual-Conceptual Network (CPCNet) 如何工作。我们的实验表明,CPCNet 在 RAVEN 机器学习 dataset 上达到了所有前一代模型的高精度,同时使用最弱的推导假设。我们还指出了原始 RAVEN 数据集中的一个substantial和 previously unremarked class imbalance,并提出了一个新的 AB-RAVEN 数据集,它更加均衡了抽象概念。

Visible and NIR Image Fusion Algorithm Based on Information Complementarity

  • paper_url: http://arxiv.org/abs/2309.10522
  • repo_url: None
  • paper_authors: Zhuo Li, Bo Li
  • for: 本研究旨在利用可见和近红外(NIR)频率图像的融合,以优化图像质量。
  • methods: 本研究提出了一种基于物理信号水平的共辐合模型,包括两层权值引导滤波器和导引滤波器来获取文本和边缘层,以及使用延展DoG滤波器来生成初始可见-NIR共辐合权重图。
  • results: 实验结果表明,提出的算法可以良好地利用可见和NIR图像的谱属性和信息协同性,并避免颜色不自然的问题,与现有的状态艺术比较。
    Abstract Visible and near-infrared(NIR) band sensors provide images that capture complementary spectral radiations from a scene. And the fusion of the visible and NIR image aims at utilizing their spectrum properties to enhance image quality. However, currently visible and NIR fusion algorithms cannot well take advantage of spectrum properties, as well as lack information complementarity, which results in color distortion and artifacts. Therefore, this paper designs a complementary fusion model from the level of physical signals. First, in order to distinguish between noise and useful information, we use two layers of the weight-guided filter and guided filter to obtain texture and edge layers, respectively. Second, to generate the initial visible-NIR complementarity weight map, the difference maps of visible and NIR are filtered by the extend-DoG filter. After that, the significant region of NIR night-time compensation guides the initial complementarity weight map by the arctanI function. Finally, the fusion images can be generated by the complementarity weight maps of visible and NIR images, respectively. The experimental results demonstrate that the proposed algorithm can not only well take advantage of the spectrum properties and the information complementarity, but also avoid color unnatural while maintaining naturalness, which outperforms the state-of-the-art.
    摘要 可见和近红外(NIR)摄像机提供图像,捕捉场景的补充 спектраль辐射。并将可见和NIR图像融合,以利用其谱属性提高图像质量。然而,当前可见和NIR融合算法无法好地利用谱属性,同时缺乏信息协同,导致颜色扭曲和 Artefacts。因此,这篇论文提出了基于物理信号的补充模型。首先,使用两层权重导向滤波器和导引滤波器来分别获取Texture和Edge层。其次,使用扩展DoG滤波器来缩放可见和NIR差分图。接着,使用arctanI函数来引导初始可见-NIR补充权重图。最后,可以使用可见和NIR图像的补充权重图来生成融合图像。实验结果表明,提出的算法不仅可以好地利用谱属性和信息协同,同时也可以避免颜色不自然,而保持自然性,与当前最佳的方法相比。

Partially-Specified Causal Simulations

  • paper_url: http://arxiv.org/abs/2309.10514
  • repo_url: None
  • paper_authors: A. Zamanian, L. Mareis, N. Ahmidi
  • For: The paper is written to emphasize the importance of proper simulation design in causal inference research, and to introduce a new simulation framework called PARCS that addresses this issue.* Methods: The paper uses graphical causal models and a wide range of adjustable parameters to synthesize data, and allows users to identify and specify the subset of related parameters and randomize the remaining ones to generate a range of complying data-generating processes for their causal method.* Results: The paper reproduces and extends the simulation studies of two well-known causal discovery and missing data analysis papers, demonstrating the necessity of a proper simulation design and the benefits of using PARCS for simulation. The results show that PARCS can generate a more comprehensive and inclusive empirical investigation for causal claims.
    Abstract Simulation studies play a key role in the validation of causal inference methods. The simulation results are reliable only if the study is designed according to the promised operational conditions of the method-in-test. Still, many causal inference literature tend to design over-restricted or misspecified studies. In this paper, we elaborate on the problem of improper simulation design for causal methods and compile a list of desiderata for an effective simulation framework. We then introduce partially-randomized causal simulation (PARCS), a simulation framework that meets those desiderata. PARCS synthesizes data based on graphical causal models and a wide range of adjustable parameters. There is a legible mapping from usual causal assumptions to the parameters, thus, users can identify and specify the subset of related parameters and randomize the remaining ones to generate a range of complying data-generating processes for their causal method. The result is a more comprehensive and inclusive empirical investigation for causal claims. Using PARCS, we reproduce and extend the simulation studies of two well-known causal discovery and missing data analysis papers to emphasize the necessity of a proper simulation design. Our results show that those papers would have improved and extended the findings, had they used PARCS for simulation. The framework is implemented as a Python package, too. By discussing the comprehensiveness and transparency of PARCS, we encourage causal inference researchers to utilize it as a standard tool for future works.
    摘要 模拟研究在 causal inference 方法的验证中扮演着关键角色。模拟结果的可靠性取决于研究按照测试方法的操作条件进行设计。然而,许多 causal inference 文献中的模拟设计往往过于紧张或不准确。在这篇文章中,我们讨论了模拟设计不当的问题,并编辑了一份有效模拟框架的需求列表。然后,我们介绍了 partially-randomized causal simulation(PARCS)模拟框架,该框架基于图形 causal 模型和广泛的可调参数。在这个框架中,用户可以明确地将相关参数与 causal 假设之间的映射,并随机化剩下的参数来生成一个包含多种合法的数据生成过程。这使得用户可以对 causal laims 进行更广泛和包容的实验 исследование。使用 PARCS,我们重新生成和扩展了两篇已有的 causal discovery 和 missing data analysis 文献中的模拟研究,以强调模拟设计的重要性。我们的结果表明,如果使用 PARCS,这些文献中的结果将更加完整和多元。PARCS 已经实现为 Python 包。通过讨论 PARCS 的全面性和透明度,我们鼓励 causal inference 研究人员在未来的工作中使用这种标准工具。

A Configurable Library for Generating and Manipulating Maze Datasets

  • paper_url: http://arxiv.org/abs/2309.10498
  • repo_url: https://github.com/understanding-search/maze-dataset
  • paper_authors: Michael Igorevich Ivanitskiy, Rusheb Shah, Alex F. Spies, Tilman Räuker, Dan Valentine, Can Rager, Lucia Quirke, Chris Mathwin, Guillaume Corlouer, Cecilia Diniz Behn, Samy Wu Fung
  • for: investigate how machine learning models respond to distributional shifts using maze-solving tasks as a testbed
  • methods: present a comprehensive library for generating, processing, and visualizing maze-solving datasets with extensive control over generation algorithms and parameters
  • results: support for multiple output formats and tools for visualizing and converting between them, ensuring versatility and adaptability in research applications
    Abstract Understanding how machine learning models respond to distributional shifts is a key research challenge. Mazes serve as an excellent testbed due to varied generation algorithms offering a nuanced platform to simulate both subtle and pronounced distributional shifts. To enable systematic investigations of model behavior on out-of-distribution data, we present $\texttt{maze-dataset}$, a comprehensive library for generating, processing, and visualizing datasets consisting of maze-solving tasks. With this library, researchers can easily create datasets, having extensive control over the generation algorithm used, the parameters fed to the algorithm of choice, and the filters that generated mazes must satisfy. Furthermore, it supports multiple output formats, including rasterized and text-based, catering to convolutional neural networks and autoregressive transformer models. These formats, along with tools for visualizing and converting between them, ensure versatility and adaptability in research applications.
    摘要 理解机器学习模型对分布转移的响应是一项关键的研究挑战。迷宫 serves as an excellent testbed due to its varied generation algorithms, offering a nuanced platform to simulate both subtle and pronounced distributional shifts. To enable systematic investigations of model behavior on out-of-distribution data, we present $\texttt{maze-dataset}$, a comprehensive library for generating, processing, and visualizing datasets consisting of maze-solving tasks. With this library, researchers can easily create datasets, having extensive control over the generation algorithm used, the parameters fed to the algorithm of choice, and the filters that generated mazes must satisfy. Furthermore, it supports multiple output formats, including rasterized and text-based, catering to convolutional neural networks and autoregressive transformer models. These formats, along with tools for visualizing and converting between them, ensure versatility and adaptability in research applications.Here's the breakdown of the translation:* 理解机器学习模型 (Understanding machine learning models) becomes 理解机器学习模型 (Understanding machine learning models)* 对分布转移 (distributional shifts) becomes 对分布转移 (distributional shifts)* 迷宫 serves as an excellent testbed (mazes serve as an excellent testbed) becomes 迷宫 serves as an excellent testbed (mazes serve as an excellent testbed)* varied generation algorithms (varied generation algorithms) becomes 多种生成算法 (various generation algorithms)* nuanced platform (nuanced platform) becomes 细腻的平台 (subtle platform)* To enable systematic investigations (To enable systematic investigations) becomes 为实现系统的调查 (To enable systematic investigations)* we present $\texttt{maze-dataset}$ (we present $\texttt{maze-dataset}$) becomes 我们提供 $\texttt{maze-dataset}$ (we provide $\texttt{maze-dataset}$)* a comprehensive library (a comprehensive library) becomes 一个全面的库 (a comprehensive library)* consisting of maze-solving tasks (consisting of maze-solving tasks) becomes 包含迷宫解决任务 (consisting of maze-solving tasks)* With this library, researchers can easily create datasets (With this library, researchers can easily create datasets) becomes 通过这个库,研究人员可以轻松创建数据集 (With this library, researchers can easily create datasets)* having extensive control (having extensive control) becomes 具有广泛的控制 (having extensive control)* over the generation algorithm used (over the generation algorithm used) becomes 对生成算法使用的控制 (over the generation algorithm used)* the parameters fed to the algorithm of choice (the parameters fed to the algorithm of choice) becomes 选择的算法的参数 (the parameters fed to the algorithm of choice)* and the filters that generated mazes must satisfy (and the filters that generated mazes must satisfy) becomes 并且生成迷宫的筛选器必须满足 (and the filters that generated mazes must satisfy)* Furthermore, it supports multiple output formats (Furthermore, it supports multiple output formats) becomes 此外,它还支持多种输出格式 (Furthermore, it supports multiple output formats)* including rasterized and text-based (including rasterized and text-based) becomes 包括预览和文本基于的格式 (including rasterized and text-based)* catering to convolutional neural networks and autoregressive transformer models (catering to convolutional neural networks and autoregressive transformer models) becomes 适用于卷积神经网络和自适应转换器模型 (catering to convolutional neural networks and autoregressive transformer models)* These formats, along with tools for visualizing and converting between them (These formats, along with tools for visualizing and converting between them) becomes 这些格式、以及转换和可视化工具 (These formats, along with tools for visualizing and converting between them)* ensure versatility and adaptability in research applications (ensure versatility and adaptability in research applications) becomes 确保在研究应用中具有多样性和适应性 (ensure versatility and adaptability in research applications)

An Evaluation of GPT-4 on the ETHICS Dataset

  • paper_url: http://arxiv.org/abs/2309.10492
  • repo_url: None
  • paper_authors: Sergey Rodionov, Zarathustra Amadeus Goertzel, Ben Goertzel
  • for: 这个研究是为了评估GPT-4模型在ETHICS数据集上的表现。
  • methods: 这个研究使用了GPT-4模型来处理ETHICS数据集中的道德判断。
  • results: GPT-4的表现比前一代模型要好,表明AI工程学习并不是道德伦理的硬件问题。
    Abstract This report summarizes a short study of the performance of GPT-4 on the ETHICS dataset. The ETHICS dataset consists of five sub-datasets covering different fields of ethics: Justice, Deontology, Virtue Ethics, Utilitarianism, and Commonsense Ethics. The moral judgments were curated so as to have a high degree of agreement with the aim of representing shared human values rather than moral dilemmas. GPT-4's performance is much better than that of previous models and suggests that learning to work with common human values is not the hard problem for AI ethics.
    摘要 Here is the text in Simplified Chinese:这份报告总结了GPT-4在ETHICS数据集上的性能。ETHICS数据集包括五个子数据集,涵盖不同领域的伦理:正义、德ontology、美德伦理、功利主义和常识伦理。这些伦理判断被精心准备,以便反映人类共同价值,而不是道德困难。相比之前的模型,GPT-4的性能显著提高,表明AI伦理学中学习共同人类价值不是困难的问题。

Fully automated landmarking and facial segmentation on 3D photographs

  • paper_url: http://arxiv.org/abs/2309.10472
  • repo_url: https://github.com/rumc3dlab/3dlandmarkdetection
  • paper_authors: Bo Berends, Freek Bielevelt, Ruud Schreurs, Shankeeth Vinayahalingam, Thomas Maal, Guido de Jong
  • for: 这个研究的目的是发展和评估一个自动化的侧面测量方法,以提高侧面测量的精度和效率。
  • methods: 这个方法使用了两个DiffusionNet模型和其他的颜面分类算法,以及人工标注的10个标点。
  • results: 这个研究发现,这个自动化方法可以实现高精度和高一致性的侧面测量,并且可以减少人工标注的时间和误差。
    Abstract Three-dimensional facial stereophotogrammetry provides a detailed representation of craniofacial soft tissue without the use of ionizing radiation. While manual annotation of landmarks serves as the current gold standard for cephalometric analysis, it is a time-consuming process and is prone to human error. The aim in this study was to develop and evaluate an automated cephalometric annotation method using a deep learning-based approach. Ten landmarks were manually annotated on 2897 3D facial photographs by a single observer. The automated landmarking workflow involved two successive DiffusionNet models and additional algorithms for facial segmentation. The dataset was randomly divided into a training and test dataset. The training dataset was used to train the deep learning networks, whereas the test dataset was used to evaluate the performance of the automated workflow. The precision of the workflow was evaluated by calculating the Euclidean distances between the automated and manual landmarks and compared to the intra-observer and inter-observer variability of manual annotation and the semi-automated landmarking method. The workflow was successful in 98.6% of all test cases. The deep learning-based landmarking method achieved precise and consistent landmark annotation. The mean precision of 1.69 (+/-1.15) mm was comparable to the inter-observer variability (1.31 +/-0.91 mm) of manual annotation. The Euclidean distance between the automated and manual landmarks was within 2 mm in 69%. Automated landmark annotation on 3D photographs was achieved with the DiffusionNet-based approach. The proposed method allows quantitative analysis of large datasets and may be used in diagnosis, follow-up, and virtual surgical planning.
    摘要 三维面部塑型摄影技术可以提供细腻的脑颅面软组织图像,不需要使用辐射。现有的手动标注方法为头颈相机分析的现金标准,但是它是一项时间consuming的过程,容易出现人工错误。本研究的目标是开发和评估一种基于深度学习的自动标注方法。研究使用2897个3D面部照片,每个照片由一名观察者手动标注10个标记。自动标注工作流程包括两个DiffusionNet模型和其他的面部分 segmentation 算法。数据集随机分成训练和测试集。训练集用于训练深度学习网络,而测试集用于评估自动工作流程的性能。自动工作流程的精度由计算自动和手动标注之间的欧几丁度距离来评估。结果显示,自动工作流程成功的情况为98.6%。深度学习基本的标注方法实现了精确和一致的标记。手动标注和自动标注之间的平均差距为1.69(+/-1.15)毫米,与人工变化(1.31(+/-0.91)毫米)相比,表明自动标注的精度和一致性。自动和手动标注之间的欧几丁度距离在2毫米内的情况为69%。这种基于DiffusionNet的方法可以在3D照片上自动标注标记,并且允许大量数据的量化分析,可能用于诊断、跟踪和虚拟手术规划。

Exploring the Dark Side of AI: Advanced Phishing Attack Design and Deployment Using ChatGPT

  • paper_url: http://arxiv.org/abs/2309.10463
  • repo_url: None
  • paper_authors: Nils Begou, Jeremy Vinoy, Andrzej Duda, Maciej Korczynski
  • for: 这篇论文探讨了使用ChatGPT开发高级钓鱼攻击并大规模部署它们。
  • methods: 论文使用ChatGPT生成钓鱼攻击的以下部分:(1)复制目标网站,(2)抓取凭据,(3)干扰代码,(4)自动部署网站在托管提供商上,(5)注册钓鱼域名,(6)将网站与反向代理集成。
  • results: 初步评估自动生成的钓鱼套件显示它们具有快速生成和部署过程以及准确地模拟目标网站的页面。总之,这些发现表明了人工智能的进步,强调了钓鱼攻击的可能性和危险性,强调了人工智能系统中的增强防御措施的必要性。
    Abstract This paper explores the possibility of using ChatGPT to develop advanced phishing attacks and automate their large-scale deployment. We make ChatGPT generate the following parts of a phishing attack: i) cloning a targeted website, ii) integrating code for stealing credentials, iii) obfuscating code, iv) automating website deployment on a hosting provider, v) registering a phishing domain name, and vi) integrating the website with a reverse proxy. The initial assessment of the automatically generated phishing kits highlights their rapid generation and deployment process as well as the close resemblance of the resulting pages to the target website. More broadly, we demonstrate that recent advances in AI underscore the potential risks of its misuse in phishing attacks, which can lead to their increased prevalence and severity. This highlights the necessity for enhanced countermeasures within AI systems.
    摘要
  1. Cloning a targeted website2. Integrating code for stealing credentials3. Obfuscating code4. Automating website deployment on a hosting provider5. Registering a phishing domain name6. Integrating the website with a reverse proxyOur initial assessment of the automatically generated phishing kits shows that they can be rapidly generated and deployed, with the resulting pages closely resembling the target website. This demonstrates the potential risks of AI misuse in phishing attacks, which could lead to an increase in their prevalence and severity. This highlights the need for enhanced countermeasures within AI systems.

Human-AI Interactions and Societal Pitfalls

  • paper_url: http://arxiv.org/abs/2309.10448
  • repo_url: None
  • paper_authors: Francisco Castro, Jian Gao, Sébastien Martin
  • for: 本研究旨在研究在使用生成式人工智能(AI)时,用户可能会看到产效提升,但AI生成的内容可能不会完全符合他们的偏好。
  • methods: 本研究使用 bayesian 框架,让不同的用户选择向 AI 分享多少信息,面临一种输出准确性和通信成本之间的负担。
  • results: 我们发现,在个人层面,AI 训练时使用 AI 生成的内容可能导致输出变得更加一致,特别是在 AI 受训练时。此外,任何 AI 偏见都可能变成社会偏见。要解决这些问题,我们需要改善人机交互,以获得个性化的输出而不是牺牲产效。
    Abstract When working with generative artificial intelligence (AI), users may see productivity gains, but the AI-generated content may not match their preferences exactly. To study this effect, we introduce a Bayesian framework in which heterogeneous users choose how much information to share with the AI, facing a trade-off between output fidelity and communication cost. We show that the interplay between these individual-level decisions and AI training may lead to societal challenges. Outputs may become more homogenized, especially when the AI is trained on AI-generated content. And any AI bias may become societal bias. A solution to the homogenization and bias issues is to improve human-AI interactions, enabling personalized outputs without sacrificing productivity.
    摘要

Toward Unified Controllable Text Generation via Regular Expression Instruction

  • paper_url: http://arxiv.org/abs/2309.10447
  • repo_url: https://github.com/mrzhengxin/ctg-regex-instruction
  • paper_authors: Xin Zheng, Hongyu Lin, Xianpei Han, Le Sun
  • for: 本研究的目的是提出一种基于常见表达式的指令机制,以便快速适应不同的约束类型和组合。
  • methods: 我们的方法使用了指令机制,通过常见表达式来完全利用其优势,并支持所有流行的细化控制生成约束,包括lexical、position和length约束,以及其复杂组合。
  • results: 我们的实验结果表明,我们的简单方法可以 дости得高成功率和适应性,同时与其他约束组合进行比较,在自动指标中维持竞争力,并超越大多数之前的基eline。
    Abstract Controllable text generation is a fundamental aspect of natural language generation, with numerous methods proposed for different constraint types. However, these approaches often require significant architectural or decoding modifications, making them challenging to apply to additional constraints or resolve different constraint combinations. To address this, our paper introduces Regular Expression Instruction (REI), which utilizes an instruction-based mechanism to fully exploit regular expressions' advantages to uniformly model diverse constraints. Specifically, our REI supports all popular fine-grained controllable generation constraints, i.e., lexical, positional, and length, as well as their complex combinations, via regular expression-style instructions. Our method only requires fine-tuning on medium-scale language models or few-shot, in-context learning on large language models, and requires no further adjustment when applied to various constraint combinations. Experiments demonstrate that our straightforward approach yields high success rates and adaptability to various constraints while maintaining competitiveness in automatic metrics and outperforming most previous baselines.
    摘要 natural language generation中的可控制文本生成是一个基本问题,各种方法被提出来解决不同的约束类型。然而,这些方法经常需要大量的架构或解码修改,使其应用于其他约束或解决不同的约束组合变得困难。为解决这个问题,我们的论文引入了常见表达式指令(REI),该机制利用了常见表达式的优点,以通用的方式模型多样的约束。具体来说,我们的REI支持所有流行的细化可控制生成约束,包括lexical、位置和长度约束,以及它们的复杂组合,通过常见表达式样式的指令。我们的方法仅需要中型语言模型的微调或几极少的在线学习,并且无需进行进一步的调整,无论应用于不同的约束组合。实验表明,我们的简单方法可以实现高的成功率和适应性,同时保持自动指标的竞争力和大多数之前的基elines的性能。

Exploring Self-Reinforcement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models

  • paper_url: http://arxiv.org/abs/2309.10444
  • repo_url: https://github.com/strong-ai-lab/explanation-generation
  • paper_authors: Qiming Bao, Juho Leinonen, Alex Yuxuan Peng, Wanjun Zhong, Tim Pistotti, Alice Huang, Paul Denny, Michael Witbrock, Jiamou Liu
    for: 这个论文的目的是帮助学生生成高质量的学习资源,并使用自然语言处理技术来自动生成解释。methods: 这个论文提出了一个基于自适应大语言模型的框架,包括三个模块:生成学生对应的解释、评估这些解释的质量,并不断改进解释。results: 这个论文的实验结果表明,与其他大语言模型相比,GPT-4在生成解释时表现出更高的创造力,并且由人类专家评估时被评为最高。
    Abstract Learnersourcing involves students generating and sharing learning resources with their peers. When learnersourcing multiple-choice questions, creating explanations for the generated questions is a crucial step as it facilitates a deeper understanding of the related concepts. However, it is often difficult for students to craft effective explanations due to limited subject understanding and a tendency to merely restate the question stem, distractors, and correct answer. To help scaffold this task, in this work we propose a self-reinforcement large-language-model framework, with the goal of generating and evaluating explanations automatically. Comprising three modules, the framework generates student-aligned explanations, evaluates these explanations to ensure their quality and iteratively enhances the explanations. If an explanation's evaluation score falls below a defined threshold, the framework iteratively refines and reassesses the explanation. Importantly, our framework emulates the manner in which students compose explanations at the relevant grade level. For evaluation, we had a human subject-matter expert compare the explanations generated by students with the explanations created by the open-source large language model Vicuna-13B, a version of Vicuna-13B that had been fine-tuned using our method, and by GPT-4. We observed that, when compared to other large language models, GPT-4 exhibited a higher level of creativity in generating explanations. We also found that explanations generated by GPT-4 were ranked higher by the human expert than both those created by the other models and the original student-created explanations. Our findings represent a significant advancement in enriching the learnersourcing experience for students and enhancing the capabilities of large language models in educational applications.
    摘要 学生来源化包括学生生成和分享学习资源的过程。当学生来源化多选问题时,创造相关概念的解释是一项重要的步骤,因为它可以帮助学生更深入理解相关概念。然而,学生 oftentimes Difficulty crafting effective explanations due to limited subject matter understanding and a tendency to simply restate the question stem, distractors, and correct answer. To help address this challenge, we propose a self-reinforcement large language model framework in this work, with the goal of generating and evaluating explanations automatically. The framework consists of three modules: generating student-aligned explanations, evaluating these explanations to ensure their quality, and iteratively enhancing the explanations. If an explanation's evaluation score falls below a defined threshold, the framework iteratively refines and reassesses the explanation. Importantly, our framework emulates the manner in which students compose explanations at the relevant grade level. For evaluation, we had a human subject-matter expert compare the explanations generated by students with the explanations created by the open-source large language model Vicuna-13B, a version of Vicuna-13B that had been fine-tuned using our method, and by GPT-4. We observed that, when compared to other large language models, GPT-4 exhibited a higher level of creativity in generating explanations. We also found that explanations generated by GPT-4 were ranked higher by the human expert than both those created by the other models and the original student-created explanations. Our findings represent a significant advancement in enriching the learnersourcing experience for students and enhancing the capabilities of large language models in educational applications.

Rethinking Imitation-based Planner for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2309.10443
  • repo_url: https://github.com/jchengai/planTF
  • paper_authors: Jie Cheng, Yingbing Chen, Xiaodong Mei, Bowen Yang, Bo Li, Ming Liu
  • for: 这篇论文的目的是为了提供一个大规模的实际世界数据集和一个标准化的封闭比较 benchmark,以便对各种设计的效iveness进行公平的比较。
  • methods: 本文使用了两个基本 yet 未得到了充分研究的方面: Egoplan 中的关键特征和可以降低堆叠错误的有效数据扩展技术。
  • results: 我们的结果表明,一个well-designed的强制 imitation-based плаanner 可以与当前的状态 искусственный智能方法相比,在特长情况下表现出非常高的竞争力,并且在长尾情况下具有更好的泛化能力。
    Abstract In recent years, imitation-based driving planners have reported considerable success. However, due to the absence of a standardized benchmark, the effectiveness of various designs remains unclear. The newly released nuPlan addresses this issue by offering a large-scale real-world dataset and a standardized closed-loop benchmark for equitable comparisons. Utilizing this platform, we conduct a comprehensive study on two fundamental yet underexplored aspects of imitation-based planners: the essential features for ego planning and the effective data augmentation techniques to reduce compounding errors. Furthermore, we highlight an imitation gap that has been overlooked by current learning systems. Finally, integrating our findings, we propose a strong baseline model-PlanTF. Our results demonstrate that a well-designed, purely imitation-based planner can achieve highly competitive performance compared to state-of-the-art methods involving hand-crafted rules and exhibit superior generalization capabilities in long-tail cases. Our models and benchmarks are publicly available. Project website https://jchengai.github.io/planTF.
    摘要

Multi-Object Graph Affordance Network: Enabling Goal-Oriented Planning through Compound Object Affordances

  • paper_url: http://arxiv.org/abs/2309.10426
  • repo_url: None
  • paper_authors: Tuba Girgin, Emre Ugur
  • for: 研究复杂物体之间的契合关系,以便在机器学习中更好地训练机器人。
  • methods: 我们提出了多对象图像契合网络(MOGAN),用于模拟复杂物体之间的契合关系,并预测将新物体放置在现有复杂物体之上的效果。
  • results: 我们的系统能够正确地模拟复杂物体之间的契合关系,包括堆积球和杯子、杆和环等。我们在虚拟和真实环境中进行了测试,并与基线模型进行比较,以显示我们的系统的优势。
    Abstract Learning object affordances is an effective tool in the field of robot learning. While the data-driven models delve into the exploration of affordances of single or paired objects, there is a notable gap in the investigation of affordances of compound objects that are composed of an arbitrary number of objects with complex shapes. In this study, we propose Multi-Object Graph Affordance Network (MOGAN) that models compound object affordances and predicts the effect of placing new objects on top of the existing compound. Given different tasks, such as building towers of specific heights or properties, we used a search based planning to find the sequence of stack actions with the objects of suitable affordances. We showed that our system was able to correctly model the affordances of very complex compound objects that include stacked spheres and cups, poles, and rings that enclose the poles. We demonstrated the applicability of our system in both simulated and real-world environments, comparing our systems with a baseline model to highlight its advantages.
    摘要 学习对象可行性是机器学习领域的有效工具。而数据驱动模型探索单个或对应的对象可行性的探索,但是对于包含多个对象的复杂形状的复合物体可行性的探索却存在显著的缺口。在本研究中,我们提议了多对象图像可行性网络(MOGAN),该模型可以预测将新对象放置在现有复合物体上的效果。我们使用搜索基本计划来找到适合的堆作业序列,以实现不同任务,如建立特定高度或性能的塔楼。我们证明了我们的系统可以正确地模型复杂的复合物体,包括堆积球和杯子、柱子和环形结构,并在实验室和真实环境中进行了比较,与基eline模型进行了对比,以 highlight its advantages。

Functional requirements to mitigate the Risk of Harm to Patients from Artificial Intelligence in Healthcare

  • paper_url: http://arxiv.org/abs/2309.10424
  • repo_url: None
  • paper_authors: Juan M. García-Gómez, Vicent Blanes-Selva, José Carlos de Bartolomé Cenzano, Jaime Cebolla-Cornejo, Ascensión Doñate-Martínez
  • for: 本文旨在描述七种人工智能(AI)在医疗领域的风险,以及十四种技术解决方案来减少这些风险。
  • methods: 本文使用了七种AI风险的列表,并提出了十四种技术要求来降低这些风险。
  • results: 本文的结果表明,通过实施这些技术要求,可以减少AI在医疗领域的风险,并保证AI系统的不断良好运行,以便为患者提供有益的医疗服务。
    Abstract The Directorate General for Parliamentary Research Services of the European Parliament has prepared a report to the Members of the European Parliament where they enumerate seven main risks of Artificial Intelligence (AI) in medicine and healthcare: patient harm due to AI errors, misuse of medical AI tools, bias in AI and the perpetuation of existing inequities, lack of transparency, privacy and security issues, gaps in accountability, and obstacles in implementation. In this study, we propose fourteen functional requirements that AI systems may implement to reduce the risks associated with their medical purpose: AI passport, User management, Regulation check, Academic use only disclaimer, data quality assessment, Clinicians double check, Continuous performance evaluation, Audit trail, Continuous usability test, Review of retrospective/simulated cases, Bias check, eXplainable AI, Encryption and use of field-tested libraries, and Semantic interoperability. Our intention here is to provide specific high-level specifications of technical solutions to ensure continuous good performance and use of AI systems to benefit patients in compliance with the future EU regulatory framework.
    摘要 欧洲议会Directorate General for Parliamentary Research Services已经准备了一份关于人工智能(AI)在医疗领域的报告,并列出了七个主要的AI风险:对patient的伤害 due to AI错误,违规使用医疗AI工具,AI中存在偏见和现有不平等的持续传递,缺乏透明度、隐私和安全问题,责任缺口,以及实施困难。在这项研究中,我们提出了十四个功能需求,以减少AI系统的医疗用途中的风险:AI护照,用户管理,法规检查,仅学术用途说明,数据质量评估,临床医生双重检查,不断性能评估,审计记录,不断用户测试,审查退化/模拟案例,偏见检查,可解释AI,加密,并使用已经测试的库。我们的目的是提供特定的高级技术解决方案,以确保AI系统的持续良好表现,并且在欧盟未来的法规框架下使用AI系统为病人带来好处。

Learning from Teaching Assistants to Program with Subgoals: Exploring the Potential for AI Teaching Assistants

  • paper_url: http://arxiv.org/abs/2309.10419
  • repo_url: None
  • paper_authors: Changyoon Lee, Junho Myung, Jieun Han, Jiho Jin, Alice Oh
  • for: 本研究旨在探讨使用生成AI作为初级编程教育的教学助手,以评估学生与AI助手和人类助手之间的互动和感受。
  • methods: 我们采用 между组试验方法,尝试20名初级编程学习者在生成AI和人类助手的指导下解决编程任务。learners可以更快地解决任务,并得分相当。
  • results: 我们发现learners对AI助手的感受和人类助手的感受相似,即快速、全面和有用的回答,满意度也相似。此外,我们还提出了更好地设计和使用生成AI作为编程教育教学助手的指导原则。
    Abstract With recent advances in generative AI, conversational models like ChatGPT have become feasible candidates for TAs. We investigate the practicality of using generative AI as TAs in introductory programming education by examining novice learners' interaction with TAs in a subgoal learning environment. To compare the learners' interaction and perception of the AI and human TAs, we conducted a between-subject study with 20 novice programming learners. Learners solve programming tasks by producing subgoals and subsolutions with the guidance of a TA. Our study shows that learners can solve tasks faster with comparable scores with AI TAs. Learners' perception of the AI TA is on par with that of human TAs in terms of speed and comprehensiveness of the replies and helpfulness, difficulty, and satisfaction of the conversation. Finally, we suggest guidelines to better design and utilize generative AI as TAs in programming education from the result of our chat log analysis.
    摘要 Recent advances in 生成AI 使得对话模型如ChatGPT可能成为教学助手。我们 investigate 使用生成AI 作为初级编程教育中的教学助手,通过评估新手学者与 AI 和人类教学助手之间的互动来评估实用性。我们通过对 20 名初级编程学生进行比较研究,发现学生可以更快地解决编程任务,并且得分相似。学生对 AI 教学助手的评估与人类教学助手的评估相似,包括快速回答、全面性、 helpfulness、difficulty 和满意度。最后,我们提出了更好地设计和使用生成AI 作为编程教育教学助手的指南,基于我们的对话记录分析结果。

Unsupervised Learning via Network-Aware Embeddings

  • paper_url: http://arxiv.org/abs/2309.10408
  • repo_url: None
  • paper_authors: Anne Sophie Riis Damstrup, Sofie Tosti Madsen, Michele Coscia
  • for: 这篇论文的目的是解决不可靠的深度学习方法在面向网络数据的聚类任务中的缺陷。
  • methods: 这篇论文使用了一种新的网络意识 embedding 方法,通过对数字节点特征的一般化欧几里得距离进行估计,以便在聚类任务中考虑网络中节点之间的相互关系。
  • results: 试验结果表明,使用这种网络意识 embedding 方法可以提高聚类任务的效果,并且可以在各种领域(如市场营销、经济学和政治科学)提供实用的洞察。此外,这种方法可以扩展到大型网络中,并且可以在不同的数据集上重复得到类似的好效果。
    Abstract Data clustering, the task of grouping observations according to their similarity, is a key component of unsupervised learning -- with real world applications in diverse fields such as biology, medicine, and social science. Often in these fields the data comes with complex interdependencies between the dimensions of analysis, for instance the various characteristics and opinions people can have live on a complex social network. Current clustering methods are ill-suited to tackle this complexity: deep learning can approximate these dependencies, but not take their explicit map as the input of the analysis. In this paper, we aim at fixing this blind spot in the unsupervised learning literature. We can create network-aware embeddings by estimating the network distance between numeric node attributes via the generalized Euclidean distance. Differently from all methods in the literature that we know of, we do not cluster the nodes of the network, but rather its node attributes. In our experiments we show that having these network embeddings is always beneficial for the learning task; that our method scales to large networks; and that we can actually provide actionable insights in applications in a variety of fields such as marketing, economics, and political science. Our method is fully open source and data and code are available to reproduce all results in the paper.
    摘要 “数据聚合,将观察值按其相似性分组,是无监督学习中的关键组成部分,在生物、医学和社会科学等领域有广泛的应用。在这些领域中,数据往往具有复杂的相互关系,例如人们在社交网络上的多种特征和意见。现有的聚合方法无法处理这些复杂性,深度学习可以近似这些关系,但是无法直接使其作为分析的输入。在这篇论文中,我们想要解决这个潜在的盲点在无监督学习文献中。我们可以创建网络意识 embedding,通过一般化欧几何距离来估算网络中节点属性之间的距离。与现有文献中所有方法不同,我们不是将网络节点聚合,而是其节点属性。在我们的实验中,我们发现在应用于多个领域,如市场学、经济学和政治科学等,有助于提供实用的洞察。我们的方法是完全开源的,数据和代码都可以在论文中提供,以便重现所有结果。”

Exploiting Causality Signals in Medical Images: A Pilot Study with Empirical Results

  • paper_url: http://arxiv.org/abs/2309.10399
  • repo_url: None
  • paper_authors: Gianluca Carloni, Sara Colantonio
  • for: 这篇论文是用于自动分类医疗影像,以实现弱因果信号在场景中的模型,以描述一个特定区域的影像特征如何影响另一个区域的影像特征。
  • methods: 这篇论文使用了两个组件:一个卷积神经网络背bone和一个因果因素提取模组。这个模组计算出各个特征对象的权重,以强调每个特征对象的影像部分。可以根据两个外部信号来修改这个模组的功能,因此获得不同的方法variant。
  • results: 这篇论文在一个公开的 проstate MRI 影像数据集上进行了量值实验、质感评估和截除研究,结果显示了我们的方法可以提高分类性能和生成更加可靠的预测,尤其是在医疗影像中, precisione 和可靠性是诊断和治疗规划的重要因素。
    Abstract We present a new method for automatically classifying medical images that uses weak causal signals in the scene to model how the presence of a feature in one part of the image affects the appearance of another feature in a different part of the image. Our method consists of two components: a convolutional neural network backbone and a causality-factors extractor module. The latter computes weights for the feature maps to enhance each feature map according to its causal influence in the image's scene. We can modify the functioning of the causality module by using two external signals, thus obtaining different variants of our method. We evaluate our method on a public dataset of prostate MRI images for prostate cancer diagnosis, using quantitative experiments, qualitative assessment, and ablation studies. Our results show that our method improves classification performance and produces more robust predictions, focusing on relevant parts of the image. That is especially important in medical imaging, where accurate and reliable classifications are essential for effective diagnosis and treatment planning.
    摘要 我们提出了一种新的自动化医学图像分类方法,该方法利用图像场景中的弱 causal 信号来模型图像中不同部分之间的特征之间的相互影响。我们的方法包括两个组件:一个 convolutional neural network 背bone 和一个 causality-factors 提取模块。后者计算图像feature map 中的强度,以强调每个特征图像中的相互影响。我们可以通过使用两个外部信号来修改 causality 模块的功能,从而获得不同的方法变体。我们对公共的 проstate MRI 图像集进行评估,使用量化实验、质量评估和剪辑研究来评估我们的方法。我们的结果显示,我们的方法可以提高分类性能,生成更加稳定的预测结果,特别是在医学成像中,准确和可靠的分类是诊断和治疗规划的关键。

Adaptive questionnaires for facilitating patient data entry in clinical decision support systems: Methods and application to STOPP/START v2

  • paper_url: http://arxiv.org/abs/2309.10398
  • repo_url: None
  • paper_authors: Jean-Baptiste Lamy, Abdelmalek Mouazer, Karima Sedki, Sophie Dubois, Hector Falcoff
  • for: 该论文目的是提出一种简化患者数据录入的解决方案,以便临床医生更容易使用临床决策支持系统。
  • methods: 该论文使用了一种适应问卷,即在用户交互过程中动态显示或隐藏问题的问卷,以简化患者数据录入。同时,该论文还提出了一种将临床规则翻译成显示规则,以确定问卷中需要显示的项目的方法。
  • results: 该论文应用于一种决策支持系统,通过使用适应问卷,可以大大减少显示的临床条件数量,从原来的一半减少到一半。在临床医生focus group会议上,该适应问卷被评价为“很容易使用”。未来,该方法可能可以应用于其他指南,并适应于患者自身的数据录入。
    Abstract Clinical decision support systems are software tools that help clinicians to make medical decisions. However, their acceptance by clinicians is usually rather low. A known problem is that they often require clinicians to manually enter lots of patient data, which is long and tedious. Existing solutions, such as the automatic data extraction from electronic health record, are not fully satisfying, because of low data quality and availability. In practice, many systems still include long questionnaire for data entry. In this paper, we propose an original solution to simplify patient data entry, using an adaptive questionnaire, i.e. a questionnaire that evolves during user interaction, showing or hiding questions dynamically. Considering a rule-based decision support systems, we designed methods for translating the system's clinical rules into display rules that determine the items to show in the questionnaire, and methods for determining the optimal order of priority among the items in the questionnaire. We applied this approach to a decision support system implementing STOPP/START v2, a guideline for managing polypharmacy. We show that it permits reducing by about two thirds the number of clinical conditions displayed in the questionnaire. Presented to clinicians during focus group sessions, the adaptive questionnaire was found "pretty easy to use". In the future, this approach could be applied to other guidelines, and adapted for data entry by patients.
    摘要 临床决策支持系统是软件工具,帮助临床医生做出医疗决策。然而,它们的接受度通常很低。一个知道的问题是,它们经常需要临床医生手动输入大量患者数据,这是长时间的和繁琐的。现有的解决方案,如自动提取电子医疗记录中的数据,并不充分满足,因为数据质量和可用性不够。在实践中,许多系统仍然包含长问卷。在这篇论文中,我们提出了一种新的解决方案,以简化患者数据输入。我们使用了适应问卷,即在用户互动中动态显示或隐藏问题的问卷。考虑到规则驱动的决策支持系统,我们设计了将系统的临床规则翻译成显示规则,以确定问卷中显示的项目的顺序和优先级。我们应用了这种方法于一个管理多剂药物的决策支持系统,我们发现可以将问卷中显示的临床条件减少到大约两 third。在临床医生Focus组会议中展示了适应问卷,他们认为它很容易使用。未来,这种方法可能会应用于其他指南,并适应用于患者的数据输入。

Graph Contrastive Learning Meets Graph Meta Learning: A Unified Method for Few-shot Node Tasks

  • paper_url: http://arxiv.org/abs/2309.10376
  • repo_url: https://github.com/haoliu-cola/cola
  • paper_authors: Hao Liu, Jiarui Feng, Lecheng Kong, Dacheng Tao, Yixin Chen, Muhan Zhang
  • for: 本研究旨在提出一种新的几拟标签分类方法,以解决现有的几拟标签分类方法受到过拟合问题的限制。
  • methods: 本研究使用图生成学(Graph Neural Networks,GNNs)和强化学习(fine-tuning)两种方法,并结合了图对比学习(graph contrastive learning)。
  • results: 对于几拟标签分类任务,我们的方法COLA可以在少量数据情况下达到新的顶峰性能,而且可以减少过拟合风险。
    Abstract Graph Neural Networks (GNNs) have become popular in Graph Representation Learning (GRL). One fundamental application is few-shot node classification. Most existing methods follow the meta learning paradigm, showing the ability of fast generalization to few-shot tasks. However, recent works indicate that graph contrastive learning combined with fine-tuning can significantly outperform meta learning methods. Despite the empirical success, there is limited understanding of the reasons behind it. In our study, we first identify two crucial advantages of contrastive learning compared to meta learning, including (1) the comprehensive utilization of graph nodes and (2) the power of graph augmentations. To integrate the strength of both contrastive learning and meta learning on the few-shot node classification tasks, we introduce a new paradigm: Contrastive Few-Shot Node Classification (COLA). Specifically, COLA employs graph augmentations to identify semantically similar nodes, which enables the construction of meta-tasks without the need for label information. Therefore, COLA can utilize all nodes to construct meta-tasks, further reducing the risk of overfitting. Through extensive experiments, we validate the essentiality of each component in our design and demonstrate that COLA achieves new state-of-the-art on all tasks.
    摘要 граф neural networks (GNNs) 已成为graph representation learning (GRL) 中流行的方法之一。其中一个基本应用是几拟分类。大多数现有方法采用meta learning paradigm,表明它们在几拟任务上快速泛化的能力。然而,最近的研究表明,结合图像学习和精度调整可以明显超过meta learning方法。 DESPITE THE EMPERICAL SUCCESS, THERE IS LIMITED UNDERSTANDING OF THE REASONS BEHIND IT。在我们的研究中,我们首先确定了对比学习与meta learning之间的两大优势,包括(1)图像中节点的全面利用和(2)图像的扩展。为了结合对比学习和meta learning在几拟节点 classification中的优势,我们介绍了一种新的方法:对比几拟节点分类(COLA)。具体来说,COLA使用图像扩展来标识semantically similar的节点,从而无需标签信息可以构建meta-任务。因此,COLA可以完全利用所有节点来构建meta-任务,从而减少风险过拟合。通过广泛的实验,我们证明了我们的设计的每一个组件的重要性,并证明了COLA在所有任务上达到了新的状态机。

Generative AI vs. AGI: The Cognitive Strengths and Weaknesses of Modern LLMs

  • paper_url: http://arxiv.org/abs/2309.10371
  • repo_url: None
  • paper_authors: Ben Goertzel
  • For: The paper discusses the cognitive strengths and weaknesses of interactive large language models (LLMs) such as ChatGPT, GPT-4, Bard, and Llama, and how they differ from human cognitive systems.* Methods: The paper reviews the basic cognitive architectures of these LLMs and argues that incremental improvement is not a viable approach to achieving human-level artificial general intelligence (AGI).* Results: The paper suggests that while LLMs cannot form significant parts of human-level AGI architectures on their own, they can still provide valuable insights into human-level AGI and should be studied and experimented with. Additionally, the paper touches on social and ethical matters regarding LLMs, such as misinformation and economic upheavals, but argues that a different policy approach is needed compared to more credible approximations of human-level AGI.
    Abstract A moderately detailed consideration of interactive LLMs as cognitive systems is given, focusing on LLMs circa mid-2023 such as ChatGPT, GPT-4, Bard, Llama, etc.. Cognitive strengths of these systems are reviewed, and then careful attention is paid to the substantial differences between the sort of cognitive system these LLMs are, and the sort of cognitive systems human beings are. It is found that many of the practical weaknesses of these AI systems can be tied specifically to lacks in the basic cognitive architectures according to which these systems are built. It is argued that incremental improvement of such LLMs is not a viable approach to working toward human-level AGI, in practical terms given realizable amounts of compute resources. This does not imply there is nothing to learn about human-level AGI from studying and experimenting with LLMs, nor that LLMs cannot form significant parts of human-level AGI architectures that also incorporate other ideas. Social and ethical matters regarding LLMs are very briefly touched from this perspective, which implies that while care should be taken regarding misinformation and other issues, and economic upheavals will need their own social remedies based on their unpredictable course as with any powerfully impactful technology, overall the sort of policy needed as regards modern LLMs is quite different than would be the case if a more credible approximation to human-level AGI were at hand.
    摘要 一篇moderately detailed的文章对交互式LLM进行了认知系统的评估,主要focus on LLlMAmid-2023年,如ChatGPT、GPT-4、Bard、Llama等等。文章评估了这些系统的认知优势,然后仔细审视了这些AI系统与人类认知系统之间的重要差异。发现这些AI系统的实用弱点可以追溯到其基本认知架构的缺失。 argue that不可靠地提高这些LLMs不是实现人类水平AGI的可靠方法,具体来说,随着可计算资源的增加,这些LLMs的改进速度会变得慢。这并不意味着不能从研究和实验LLMs中学习人类水平AGI,也不意味着LLMs无法成为人类水平AGI架构的重要组成部分。文章 briefly touched social and ethical matters related to LLMs from this perspective, suggesting that while care should be taken to address misinformation and other issues, and economic upheavals will need their own social remedies based on their unpredictable course, the policy needed for modern LLMs is quite different from what would be the case if a more credible approximation to human-level AGI were at hand.

Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

  • paper_url: http://arxiv.org/abs/2309.10370
  • repo_url: None
  • paper_authors: Thomas Chen, Patricia Muñoz Ewald
  • for: 这 paper 描述了一种 shallow neural network 的结构,包括一层抑制函数、${\mathcal L}^2$ Schatten class 成本函数、输入空间为 ${\mathbb R}^M$,输出空间为 ${\mathbb R}^Q$ ($Q\leq M$),训练输入样本大小为 $N>QM$。
  • methods: 这 paper 使用了一种基于投影的approximate optimizer,并证明了cost function 的最小值下界为 $O(\delta_P)$,其中 $\delta_P$ 是训练输入信号噪声比。在特殊情况下 $M=Q$ 时,我们可以得到一个精确的本地最小值,其差异与上述下界为 $O(\delta_P^2)$。
  • results: 这 paper 证明了cost function 的最小值下界,并构造了一个可被训练的 neural network,该网络可以imetrize 输入空间中 $Q$-维子空间,它是由训练输入样本的平均值 $\overline{x_{0,j}$ ($j=1,\dots,Q$) 所决定的。
    Abstract In this paper, we provide a geometric interpretation of the structure of shallow neural networks characterized by one hidden layer, a ramp activation function, an ${\mathcal L}^2$ Schatten class (or Hilbert-Schmidt) cost function, input space ${\mathbb R}^M$, output space ${\mathbb R}^Q$ with $Q\leq M$, and training input sample size $N>QM$. We prove an upper bound on the minimum of the cost function of order $O(\delta_P$ where $\delta_P$ measures the signal to noise ratio of training inputs. We obtain an approximate optimizer using projections adapted to the averages $\overline{x_{0,j}$ of training input vectors belonging to the same output vector $y_j$, $j=1,\dots,Q$. In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function; the sharp value differs from the upper bound obtained for $Q\leq M$ by a relative error $O(\delta_P^2)$. The proof of the upper bound yields a constructively trained network; we show that it metrizes the $Q$-dimensional subspace in the input space ${\mathbb R}^M$ spanned by $\overline{x_{0,j}$, $j=1,\dots,Q$. We comment on the characterization of the global minimum of the cost function in the given context.
    摘要 在这篇论文中,我们提供了一种几何 interpreting the structure of shallow neural networks with one hidden layer, ramp activation function, $\mathcal{L}^2$ Schatten class (或希尔伯特- Schmidt) cost function, input space $\mathbb{R}^M$, output space $\mathbb{R}^Q$ with $Q\leq M$, 和训练输入样本大小 $N>QM$. 我们证明了cost function的最小值的上界为$\mathcal{O}(\delta_P)$,其中$\delta_P$是训练输入信号响应率。我们使用适应于$\overline{x_{0,j}$的投影来获得一个approximate optimizer。在特殊情况下,当$M=Q$时,我们确切地确定了一个精确的地方最小值,其差异与上界相对Error $O(\delta_P^2)$。证明上界带来一个可重构的网络,我们表明它在输入空间$\mathbb{R}^M$中метrize了一个$Q$-维子空间,该子空间是由$\overline{x_{0,j}$所确定的。我们评论了在给定的 context中global minimum的特征。

Toward efficient resource utilization at edge nodes in federated learning

  • paper_url: http://arxiv.org/abs/2309.10367
  • repo_url: None
  • paper_authors: Sadi Alawadi, Addi Ait-Mlouk, Salman Toor, Andreas Hellander
  • for: 本研究旨在 empirically 探讨在 Federated Learning 中 randomly 选择层进行模型训练的策略可以实现资源占用减少和全球模型收敛不受影响。
  • methods: 本研究使用了 Federated Learning 框架 FEDn,在不同的 dataset(CIFAR-10、CASA 和 IMDB)和任务(使用不同的深度学习模型架构)下进行了多个实验。
  • results: 结果显示,只训练部分模型层可以加速训练过程,有效地利用设备上的资源,并将数据传输量减少了约 75% 和 53%,无需妨碍全球模型准确性。
    Abstract Federated learning (FL) enables edge nodes to collaboratively contribute to constructing a global model without sharing their data. This is accomplished by devices computing local, private model updates that are then aggregated by a server. However, computational resource constraints and network communication can become a severe bottleneck for larger model sizes typical for deep learning applications. Edge nodes tend to have limited hardware resources (RAM, CPU), and the network bandwidth and reliability at the edge is a concern for scaling federated fleet applications. In this paper, we propose and evaluate a FL strategy inspired by transfer learning in order to reduce resource utilization on devices, as well as the load on the server and network in each global training round. For each local model update, we randomly select layers to train, freezing the remaining part of the model. In doing so, we can reduce both server load and communication costs per round by excluding all untrained layer weights from being transferred to the server. The goal of this study is to empirically explore the potential trade-off between resource utilization on devices and global model convergence under the proposed strategy. We implement the approach using the federated learning framework FEDn. A number of experiments were carried out over different datasets (CIFAR-10, CASA, and IMDB), performing different tasks using different deep-learning model architectures. Our results show that training the model partially can accelerate the training process, efficiently utilizes resources on-device, and reduce the data transmission by around 75% and 53% when we train 25%, and 50% of the model layers, respectively, without harming the resulting global model accuracy.
    摘要 联合学习(FL)allow edge nodes to collaboratively construct a global model without sharing their data. This is achieved by devices computing local, private model updates that are then aggregated by a server. However, computational resource constraints and network communication can become a severe bottleneck for larger model sizes typical for deep learning applications. Edge nodes tend to have limited hardware resources (RAM, CPU), and the network bandwidth and reliability at the edge is a concern for scaling federated fleet applications. In this paper, we propose and evaluate a FL strategy inspired by transfer learning to reduce resource utilization on devices and the load on the server and network in each global training round. For each local model update, we randomly select layers to train, freezing the remaining part of the model. By doing so, we can reduce both server load and communication costs per round by excluding all untrained layer weights from being transferred to the server. The goal of this study is to empirically explore the potential trade-off between resource utilization on devices and global model convergence under the proposed strategy. We implement the approach using the federated learning framework FEDn. A number of experiments were carried out over different datasets (CIFAR-10, CASA, and IMDB), performing different tasks using different deep-learning model architectures. Our results show that training the model partially can accelerate the training process, efficiently utilize resources on-device, and reduce the data transmission by around 75% and 53% when we train 25%, and 50% of the model layers, respectively, without harming the resulting global model accuracy.

OccluTrack: Rethinking Awareness of Occlusion for Enhancing Multiple Pedestrian Tracking

  • paper_url: http://arxiv.org/abs/2309.10360
  • repo_url: https://github.com/hieu9955/ggggg
  • paper_authors: Jianjun Gao, Yi Wang, Kim-Hui Yap, Kratika Garg, Boon Siew Han
  • for: 提高多人跟踪在 occlusion 场景下的精度和稳定性。
  • methods: 提出了一种适应 occlusion 的多人跟踪方法,包括异常动量抑制机制、pose-guided re-ID 模块和 occlusion-aware 关联方法。
  • results: 在 MOT-Challenge 数据集上进行了广泛的评估,并表明了我们的 OccluTrack 在多人跟踪和关联性能方面的改进。特别是,对 IDF1、IDSw、AssA 和 AssR 的改进表明了我们的 OccluTrack 在 occlusion 场景下的效果。
    Abstract Multiple pedestrian tracking faces the challenge of tracking pedestrians in the presence of occlusion. Existing methods suffer from inaccurate motion estimation, appearance feature extraction, and association due to occlusion, leading to inadequate Identification F1-Score (IDF1), excessive ID switches (IDSw), and insufficient association accuracy and recall (AssA and AssR). We found that the main reason is abnormal detections caused by partial occlusion. In this paper, we suggest that the key insight is explicit motion estimation, reliable appearance features, and fair association in occlusion scenes. Specifically, we propose an adaptive occlusion-aware multiple pedestrian tracker, OccluTrack. We first introduce an abnormal motion suppression mechanism into the Kalman Filter to adaptively detect and suppress outlier motions caused by partial occlusion. Second, we propose a pose-guided re-ID module to extract discriminative part features for partially occluded pedestrians. Last, we design a new occlusion-aware association method towards fair IoU and appearance embedding distance measurement for occluded pedestrians. Extensive evaluation results demonstrate that our OccluTrack outperforms state-of-the-art methods on MOT-Challenge datasets. Particularly, the improvements on IDF1, IDSw, AssA, and AssR demonstrate the effectiveness of our OccluTrack on tracking and association performance.
    摘要 多人行踪面临 occlusion 挑战,现有方法受到 occlusion 的影响,导致不准确的运动估计、外观特征提取和关联,从而导致 IDF1 分数不够高、ID Switches 过多、关联准确率和回归率不够高。我们发现主要原因是部分 occlusion 引起的异常检测。在这篇论文中,我们提出了关键思路,即显式运动估计、可靠的外观特征和公平的关联在 occlusion 场景下。 Specifically,我们提出了一种适应 occlusion 的多人行踪器,即 OccluTrack。我们首先在 Kalman 筛引入了异常运动抑制机制,以适应部分 occlusion 引起的异常检测。其次,我们提出了一种基于 pose 的 Re-ID 模块,以提取部分 occlusion 的特征。最后,我们设计了一种新的 occlusion-aware 关联方法,以实现公平的 IoU 和外观嵌入距离度量测量。我们对 MOT-Challenge 数据集进行了广泛的评估,结果显示,我们的 OccluTrack 超过了当前状态的方法。特别是,对 IDF1、ID Switches、AssA 和 AssR 的改进表明了我们的 OccluTrack 在跟踪和关联性能方面的效果。

Explaining Agent Behavior with Large Language Models

  • paper_url: http://arxiv.org/abs/2309.10346
  • repo_url: None
  • paper_authors: Xijia Zhang, Yue Guo, Simon Stepputtis, Katia Sycara, Joseph Campbell
  • for: 这种研究旨在提供一种能够让智能代理人(如机器人)对其决策的推理进行说明,以便与人类对手中的解释。
  • methods: 该方法基于对状态和行为的观察,不需要了解深度神经网络等模型的具体表示。通过学习紧凑表示,生成自然语言的解释,并且可以进行用户与大语言模型的互动。
  • results: 经过用户测试和实验,该方法能够生成与人类专家的解释相当有帮助的解释,同时允许用户进行如清晰化和反向问题的互动。
    Abstract Intelligent agents such as robots are increasingly deployed in real-world, safety-critical settings. It is vital that these agents are able to explain the reasoning behind their decisions to human counterparts, however, their behavior is often produced by uninterpretable models such as deep neural networks. We propose an approach to generate natural language explanations for an agent's behavior based only on observations of states and actions, agnostic to the underlying model representation. We show how a compact representation of the agent's behavior can be learned and used to produce plausible explanations with minimal hallucination while affording user interaction with a pre-trained large language model. Through user studies and empirical experiments, we show that our approach generates explanations as helpful as those generated by a human domain expert while enabling beneficial interactions such as clarification and counterfactual queries.
    摘要 智能代理人如机器人在实际世界中越来越多地被部署。这些代理人的决策需要给人类对手 explicable,但它们的行为通常是由不可解释的模型,如深度神经网络生成的。我们提出了一种方法,可以基于状态和行动的观察来生成代理人的决策的自然语言解释。我们表明了如何学习一种紧凑的代理人行为表示,并使用这种表示生成可靠的解释,同时允许用户与预训练的大型自然语言模型进行互动。通过用户研究和实验,我们表明了我们的方法可以生成与人类领域专家生成的解释相当有用,并允许有利的互动,如确认和对比查询。

FedWOA: A Federated Learning Model that uses the Whale Optimization Algorithm for Renewable Energy Prediction

  • paper_url: http://arxiv.org/abs/2309.10337
  • repo_url: None
  • paper_authors: Viorica Chifu, Tudor Cioara, Cristian Anitiei, Cristina Pop, Ionut Anghel
  • for: 这篇论文旨在解决机器学习模型中敏感个人信息的隐私问题,通过训练大规模数据集来提高能源预测的准确性。
  • methods: 该论文提出了一种基于联合学习的解决方案,使用鲸鱼优化算法将本地LSTM神经网络模型的参数重新权值融合成全局共享模型,并通过K-Means对不同数据的整合进行处理。
  • results: 论文的实验结果表明,FedWOA可以提高能源预测模型的准确性,比 FedAVG 提高25%的MSE和16%的MAE,同时显示了好的叠加和降低损失。
    Abstract Privacy is important when dealing with sensitive personal information in machine learning models, which require large data sets for training. In the energy field, access to household prosumer energy data is crucial for energy predictions to support energy grid management and large-scale adoption of renewables however citizens are often hesitant to grant access to cloud-based machine learning models. Federated learning has been proposed as a solution to privacy challenges however report issues in generating the global prediction model due to data heterogeneity, variations in generation patterns, and the high number of parameters leading to even lower prediction accuracy. This paper addresses these challenges by introducing FedWOA a novel federated learning model that employs the Whale Optimization Algorithm to aggregate global prediction models from the weights of local LTSM neural network models trained on prosumer energy data. The proposed solution identifies the optimal vector of weights in the search spaces of the local models to construct the global shared model and then is subsequently transmitted to the local nodes to improve the prediction quality at the prosumer site while for handling non-IID data K-Means was used for clustering prosumers with similar scale of energy data. The evaluation results on prosumers energy data have shown that FedWOA can effectively enhance the accuracy of energy prediction models accuracy by 25% for MSE and 16% for MAE compared to FedAVG while demonstrating good convergence and reduced loss.
    摘要 隐私是机器学习模型处理敏感个人信息的重要问题,这些模型需要大量数据进行训练。在能源领域,获取家庭生产者能源数据是重要的,以支持能源网络管理和大规模采用可再生能源,但是公民经常拒绝提供云端机器学习模型访问。联邦学习被提议为解决隐私挑战,但是报告表示因数据不均匀、生成模式变化和参数太多,导致预测精度更低。本文提出了一种名为FedWOA的联邦学习模型,该模型使用吴顿优化算法对全球预测模型的 weights 进行聚合,从而提高预测质量。在处理非Identical distributed(非ID)数据时,使用K-Means进行分 clustering prosumers的能源数据,以便构建全球共享模型。评估结果表明,FedWOA可以提高能源预测模型的准确率,比 FedAVG 提高25%的MSE和16%的MAE,同时示出良好的叠加和降低损失。

Learning based 2D Irregular Shape Packing

  • paper_url: http://arxiv.org/abs/2309.10329
  • repo_url: None
  • paper_authors: Zeshi Yang, Zherong Pan, Manyi Li, Kui Wu, Xifeng Gao
  • for: 用于实现纹理Atlas中3D模型的 памяти有效的外观渲染。
  • methods: 使用学习支持的2D不Regular shape填充方法,包括选择和分组缺陷UV patches,并使用共同优化提高填充率。
  • results: 与多种常用基elines比较,本方法在三个数据集上实现了更高的填充率,同时保持了竞争性的计算速度。
    Abstract 2D irregular shape packing is a necessary step to arrange UV patches of a 3D model within a texture atlas for memory-efficient appearance rendering in computer graphics. Being a joint, combinatorial decision-making problem involving all patch positions and orientations, this problem has well-known NP-hard complexity. Prior solutions either assume a heuristic packing order or modify the upstream mesh cut and UV mapping to simplify the problem, which either limits the packing ratio or incurs robustness or generality issues. Instead, we introduce a learning-assisted 2D irregular shape packing method that achieves a high packing quality with minimal requirements from the input. Our method iteratively selects and groups subsets of UV patches into near-rectangular super patches, essentially reducing the problem to bin-packing, based on which a joint optimization is employed to further improve the packing ratio. In order to efficiently deal with large problem instances with hundreds of patches, we train deep neural policies to predict nearly rectangular patch subsets and determine their relative poses, leading to linear time scaling with the number of patches. We demonstrate the effectiveness of our method on three datasets for UV packing, where our method achieves a higher packing ratio over several widely used baselines with competitive computational speed.
    摘要 二维不规则形填充是计算机图形中为三维模型的Texture Atlas中的UV贴图进行内存高效的显示的必要步骤。作为一个共同的、复杂决策问题,这个问题有well-known NP-hard复杂性。先前的解决方案 Either assume a heuristic packing order或修改上游缝隙和UV映射以简化问题,这些方法 Either limit the packing ratio or incur robustness or generality issues。相比之下,我们介绍了一种学习帮助的二维不规则形填充方法,可以 achieve high packing quality with minimal input requirements。我们的方法 iteratively selects and groups subsets of UV patches into near-rectangular super patches, essentially reducing the problem to bin-packing, based on which a joint optimization is employed to further improve the packing ratio。为了有效地处理大型问题集合,我们训练了深度神经策略来预测 nearly rectangular patch subsets和他们的相对位置,从而实现 linear time scaling with the number of patches。我们在三个UV填充数据集上展示了我们的方法的效果,其中我们的方法高于许多常用的基线方法,并且与 computation speed相当。

QASnowball: An Iterative Bootstrapping Framework for High-Quality Question-Answering Data Generation

  • paper_url: http://arxiv.org/abs/2309.10326
  • repo_url: None
  • paper_authors: Kunlun Zhu, Shihao Liang, Xu Han, Zhi Zheng, Guoyang Zeng, Zhiyuan Liu, Maosong Sun
  • for: 本研究的目的是提出一种基于循环增强的问答数据生成方法,以便为Question Answering(QA)模型提供更多和更高质量的数据来源。
  • methods: 该方法包括三个模块:一个答案抽取模块,用于从无结构文档中提取核心短语作为答案候选;一个问题生成模块,用于基于文档和答案候选生成问题;以及一个问答数据筛选模块,用于筛选高质量的问答数据。此外,该方法还可以通过重新种子来自我改进,以达到不断提高数据生成质量的目的。
  • results: 我们在高资源英文场景和中资源中文场景进行了实验,结果表明:(1)使用生成的数据来训练QA模型可以达到与使用直接监督数据相当的性能;(2)先使用生成的数据进行预训练,然后使用直接监督数据进行细化训练可以达到更好的性能。
    Abstract Recent years have witnessed the success of question answering (QA), especially its potential to be a foundation paradigm for tackling diverse NLP tasks. However, obtaining sufficient data to build an effective and stable QA system still remains an open problem. For this problem, we introduce an iterative bootstrapping framework for QA data augmentation (named QASnowball), which can iteratively generate large-scale high-quality QA data based on a seed set of supervised examples. Specifically, QASnowball consists of three modules, an answer extractor to extract core phrases in unlabeled documents as candidate answers, a question generator to generate questions based on documents and candidate answers, and a QA data filter to filter out high-quality QA data. Moreover, QASnowball can be self-enhanced by reseeding the seed set to fine-tune itself in different iterations, leading to continual improvements in the generation quality. We conduct experiments in the high-resource English scenario and the medium-resource Chinese scenario, and the experimental results show that the data generated by QASnowball can facilitate QA models: (1) training models on the generated data achieves comparable results to using supervised data, and (2) pre-training on the generated data and fine-tuning on supervised data can achieve better performance. Our code and generated data will be released to advance further work.
    摘要 Recent years have witnessed the success of question answering (QA), especially its potential to be a foundation paradigm for tackling diverse NLP tasks. However, obtaining sufficient data to build an effective and stable QA system still remains an open problem. To solve this problem, we propose an iterative bootstrapping framework for QA data augmentation (named QASnowball), which can iteratively generate large-scale high-quality QA data based on a seed set of supervised examples. Specifically, QASnowball consists of three modules: an answer extractor to extract core phrases in unlabeled documents as candidate answers, a question generator to generate questions based on documents and candidate answers, and a QA data filter to filter out high-quality QA data. Moreover, QASnowball can be self-enhanced by reseeding the seed set to fine-tune itself in different iterations, leading to continual improvements in the generation quality. We conduct experiments in the high-resource English scenario and the medium-resource Chinese scenario, and the experimental results show that the data generated by QASnowball can facilitate QA models: (1) training models on the generated data achieves comparable results to using supervised data, and (2) pre-training on the generated data and fine-tuning on supervised data can achieve better performance. Our code and generated data will be released to advance further work.Here is the word-for-word translation of the text into Simplified Chinese:近年来,问答(QA)的成功尤其是一种基础理念,可以应对多种自然语言处理(NLP)任务。然而,建立有效稳定的QA系统仍然是一个打开的问题。为解决这个问题,我们提出了一个迭代启动框架,名为QASnowball,可以在种子集的超级vised例子基础上生成大规模高质量的QA数据。具体来说,QASnowball包括三个模块:一个答案提取器,可以从无标记文档中提取核心短语作为候选答案;一个问题生成器,可以基于文档和候选答案来生成问题;以及一个QA数据筛选器,可以筛选出高质量的QA数据。此外,QASnowball可以通过不同迭代来自我进行改进,从而实现不断提高生成质量。我们在高资源英语场景和中资源中文场景进行了实验,实验结果表明,QASnowball生成的数据可以帮助QA模型:(1)使用生成数据训练模型可以达到相同的性能,和(2)先进行预训练并在超级vised数据上进行细化可以实现更好的性能。我们将代码和生成的数据发布,以便进一步的工作。

Metastatic Breast Cancer Prognostication Through Multimodal Integration of Dimensionality Reduction Algorithms and Classification Algorithms

  • paper_url: http://arxiv.org/abs/2309.10324
  • repo_url: None
  • paper_authors: Bliss Singhal, Fnu Pooja
  • for: 这个研究旨在利用机器学习方法检测肿瘤是否为癌变。
  • methods: 研究使用了两种预处理算法:原理Components分析和种群算法,以降低数据维度,然后使用了三种分类算法:逻辑回归、决策树分类器和k-最近邻分类器来检测肿瘤是否为癌变。
  • results: 研究发现,使用这些预处理和分类算法的ML管道可以达到71.14%的准确率,表明这些算法在检测肿瘤是否为癌变方面具有潜在的应用前景。
    Abstract Machine learning (ML) is a branch of Artificial Intelligence (AI) where computers analyze data and find patterns in the data. The study focuses on the detection of metastatic cancer using ML. Metastatic cancer is the point where the cancer has spread to other parts of the body and is the cause of approximately 90% of cancer related deaths. Normally, pathologists spend hours each day to manually classify whether tumors are benign or malignant. This tedious task contributes to mislabeling metastasis being over 60% of time and emphasizes the importance to be aware of human error, and other inefficiencies. ML is a good candidate to improve the correct identification of metastatic cancer saving thousands of lives and can also improve the speed and efficiency of the process thereby taking less resources and time. So far, deep learning methodology of AI has been used in the research to detect cancer. This study is a novel approach to determine the potential of using preprocessing algorithms combined with classification algorithms in detecting metastatic cancer. The study used two preprocessing algorithms: principal component analysis (PCA) and the genetic algorithm to reduce the dimensionality of the dataset, and then used three classification algorithms: logistic regression, decision tree classifier, and k-nearest neighbors to detect metastatic cancer in the pathology scans. The highest accuracy of 71.14% was produced by the ML pipeline comprising of PCA, the genetic algorithm, and the k-nearest neighbors algorithm, suggesting that preprocessing and classification algorithms have great potential for detecting metastatic cancer.
    摘要 机器学习(ML)是人工智能(AI)的一个分支,计算机通过分析数据找到数据中的模式。这项研究关注利用ML检测肿瘤是否为恶性肿瘤。肿瘤肿瘤是指肿瘤已经扩散到身体其他部分,accounts for approximately 90% of cancer-related deaths. 通常, PATHOLOGISTS spend hours each day manually classify tumors as benign or malignant, but this tedious task can lead to mislabeling of metastasis, which can be over 60% of the time. ML is a good candidate to improve the correct identification of metastatic cancer, which can save thousands of lives and improve the speed and efficiency of the process, reducing the need for resources and time.在这项研究中,我们使用了深度学习方法来检测肿瘤。这是一种新的approach,我们使用了两种预处理算法:主成分分析(PCA)和 генетиче算法来减少数据集的维度,然后使用三种分类算法:Logistic regression、决策树分类器和k-nearest neighbors来检测肿瘤在Pathology scans中。最高的准确率为71.14%,这表明预处理和分类算法在检测肿瘤中具有潜在的潜力。

Who to Trust, How and Why: Untangling AI Ethics Principles, Trustworthiness and Trust

  • paper_url: http://arxiv.org/abs/2309.10318
  • repo_url: None
  • paper_authors: Andreas Duenser, David M. Douglas
  • for: 本文提供了关于人们对AI的信任和AI可靠性的文献综述,并强调了更清晰地分 differentiating these two concepts,以及更多的实证证据来探讨人们信任行为的影响因素。
  • methods: 本文讨论了信任AI的方法,包括不仅依赖于系统本身,还包括信任开发者们。AI伦理原则,如解释性和透明度,经常被认为能够促进用户信任,但实际证据表明这些特性对用户所认为的系统可靠性的影响并不是够清晰。
  • results: 本文认为,AI系统应该被视为社会技术系统,开发者、用户和其他相关人员在设计、开发、部署和使用系统时的参与度是决定系统可靠性的关键因素。如果不认真地考虑这些细节,那么人们对AI和可靠AI的信任就可能变得混乱,变得任何有利AI系统都会被视为可靠。
    Abstract We present an overview of the literature on trust in AI and AI trustworthiness and argue for the need to distinguish these concepts more clearly and to gather more empirically evidence on what contributes to people s trusting behaviours. We discuss that trust in AI involves not only reliance on the system itself, but also trust in the developers of the AI system. AI ethics principles such as explainability and transparency are often assumed to promote user trust, but empirical evidence of how such features actually affect how users perceive the system s trustworthiness is not as abundance or not that clear. AI systems should be recognised as socio-technical systems, where the people involved in designing, developing, deploying, and using the system are as important as the system for determining whether it is trustworthy. Without recognising these nuances, trust in AI and trustworthy AI risk becoming nebulous terms for any desirable feature for AI systems.
    摘要 我们提供了关于人们对AI和AI可靠性的文献综述,并 argue了更清晰地分 differentiating these concepts,并更多地寻求实证证据以确定人们如何信任系统的行为。我们讨论了人们对AI系统的信任不仅取决于系统本身,而且还取决于开发者。AI伦理原则,如可读性和透明度,通常被认为能够促进用户信任,但实际证据表明这些特性对用户对系统可靠性的看法并不那么清晰。我们认为AI系统应被视为社会技术系统,其中设计、开发、部署和使用系统的人员是决定系统可靠性的重要因素。如果不认真地考虑这些细节,则“信任AI”和“可靠AI”这两个概念可能会变得混乱,成为任何愿望的AI系统特性。

Investigating the Catastrophic Forgetting in Multimodal Large Language Models

  • paper_url: http://arxiv.org/abs/2309.10313
  • repo_url: None
  • paper_authors: Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma
  • for: 这个论文旨在研究多modal大语言模型(MLLM)的开发,并评估它们是否具有相同的表现水平。
  • methods: 这篇论文使用了EMT方法(Evaluating MulTimodality)来评估多modal语言模型中的分割混乱现象。
  • results: 论文发现,大多数经过精制的MLLM都无法保持与视觉模型相同的表现水平,而且随着精制的进度,MLLM会开始幻化,导致表现下降。
    Abstract Following the success of GPT4, there has been a surge in interest in multimodal large language model (MLLM) research. This line of research focuses on developing general-purpose LLMs through fine-tuning pre-trained LLMs and vision models. However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still remains an inherent problem in multimodal LLMs (MLLM). In this paper, we introduce EMT: Evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs, by treating each MLLM as an image classifier. We first apply EMT to evaluate several open-source fine-tuned MLLMs and we discover that almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks. Moreover, we continue fine-tuning LLaVA, an MLLM and utilize EMT to assess performance throughout the fine-tuning. Interestingly, our results suggest that early-stage fine-tuning on an image dataset improves performance across other image datasets, by enhancing the alignment of text and visual features. However, as fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability, even when the image encoder remains frozen. Our results suggest that MLLMs have yet to demonstrate performance on par with their vision models on standard image classification tasks and the current MLLM fine-tuning procedure still has room for improvement.
    摘要 根据GPT4的成功,Multimodal大型语言模型(MLLM)的研究获得了更多的关注。这些研究旨在通过精心适应已经预训的语言模型和视觉模型来开发通用的MLLM。然而,在多modal LLM中,严重的忘记现象仍然是一个困扰,这意味着精心适应的模型无法保持与预训模型相同的性能水平。在这篇论文中,我们引入EMT:评估多modal性,用于评估MLLM中的忘记现象。我们首先将EMT应用于评估一些公开源的精心适应MLLM,我们发现大多数评估的MLLM都无法保持与摄像头模型在标准图像分类任务中的相同性能水平。此外,我们继续适应LLaVA,一个MLLM,并使用EMT评估其性能。我们发现,在早期的适应过程中,使用图像数据集进行适应可以提高图像和文本特征之间的对齐,但是,当精心适应进行时,MLLM开始伪造,导致严重的泛化能力损失,甚至当摄像头模型保持固定时。我们的结果表明,目前的MLLM尚未能达到和摄像头模型在标准图像分类任务中的性能水平,并且精心适应程序仍然需要改进。

Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition

  • paper_url: http://arxiv.org/abs/2309.10294
  • repo_url: None
  • paper_authors: Ziyang Ma, Wen Wu, Zhisheng Zheng, Yiwei Guo, Qian Chen, Shiliang Zhang, Xie Chen
  • For: This paper aims to improve speech emotion recognition (SER) using state-of-the-art speech pre-trained models, data2vec, text generation techniques, GPT-4, and speech synthesis techniques, Azure TTS.* Methods: The paper uses a combination of speech self-supervised pre-trained models, powerful large language models (LLMs), emotional text-to-speech (TTS) models, and data augmentation techniques to generate emotionally congruent text and speech.* Results: The paper demonstrates the effectiveness of their method through experiments and ablation studies on the IEMOCAP dataset, showing that their approach outperforms other data augmentation methods and other synthetic data.Here’s the simplified Chinese text:* 为:这篇论文目的是提高语音情感识别(SER),使用当前最佳的语音预训练模型(PTM)、数据2vec、文本生成技术、GPT-4和语音生成技术、Azure TTS。* 方法:该论文使用了语音自我超vised预训练模型、强大的大型语言模型(LLM)、情感文本生成模型和数据增强技术来生成情感相符的文本和语音。* 结果:论文通过对IEMOCAP数据集进行实验和剥离研究,证明了他们的方法的有效性,比其他数据增强方法和其他合成数据更高。
    Abstract In this paper, we explored how to boost speech emotion recognition (SER) with the state-of-the-art speech pre-trained model (PTM), data2vec, text generation technique, GPT-4, and speech synthesis technique, Azure TTS. First, we investigated the representation ability of different speech self-supervised pre-trained models, and we found that data2vec has a good representation ability on the SER task. Second, we employed a powerful large language model (LLM), GPT-4, and emotional text-to-speech (TTS) model, Azure TTS, to generate emotionally congruent text and speech. We carefully designed the text prompt and dataset construction, to obtain the synthetic emotional speech data with high quality. Third, we studied different ways of data augmentation to promote the SER task with synthetic speech, including random mixing, adversarial training, transfer learning, and curriculum learning. Experiments and ablation studies on the IEMOCAP dataset demonstrate the effectiveness of our method, compared with other data augmentation methods, and data augmentation with other synthetic data.
    摘要 在这篇论文中,我们探索了如何通过现代speech预训练模型(PTM)、数据2vec、文本生成技术(GPT-4)和speech生成技术(Azure TTS)来提高语音情感识别(SER)的性能。首先,我们研究了不同的speech自我超vised预训练模型的表示能力,并发现data2vec在SER任务上有良好的表示能力。其次,我们利用了一个强大的大语言模型(LLM)GPT-4和情感文本-to-speech(TTS)模型Azure TTS,生成情感相符的文本和speech。我们仔细设计了文本提问和数据构造,以获得高质量的人工情感语音数据。第三,我们研究了不同的数据增强方法,以提高SER任务的性能,包括随机混合、对抗训练、传输学习和课程学习。实验和缺陷分析在IEMOCAP数据集上表明了我们的方法的有效性,相比其他数据增强方法和数据增强。

QXAI: Explainable AI Framework for Quantitative Analysis in Patient Monitoring Systems

  • paper_url: http://arxiv.org/abs/2309.10293
  • repo_url: None
  • paper_authors: Thanveer Shaik, Xiaohui Tao, Haoran Xie, Lin Li, Juan D. Velasquez, Niall Higgins
  • for: 这个研究的目的是提出一种可解释的人工智能技术,用于远程监测病人的生命体指标和 физи活动。
  • methods: 这个研究使用了深度学习模型和注意力机制,以实现可解释的人工智能框架(QXAI)。
  • results: 研究使用了PPG-DaLiA数据集和MHEALTH数据集,实现了心率预测和物理活动分类任务的状态知识和地区解释。
    Abstract Artificial Intelligence techniques can be used to classify a patient's physical activities and predict vital signs for remote patient monitoring. Regression analysis based on non-linear models like deep learning models has limited explainability due to its black-box nature. This can require decision-makers to make blind leaps of faith based on non-linear model results, especially in healthcare applications. In non-invasive monitoring, patient data from tracking sensors and their predisposing clinical attributes act as input features for predicting future vital signs. Explaining the contributions of various features to the overall output of the monitoring application is critical for a clinician's decision-making. In this study, an Explainable AI for Quantitative analysis (QXAI) framework is proposed with post-hoc model explainability and intrinsic explainability for regression and classification tasks in a supervised learning approach. This was achieved by utilizing the Shapley values concept and incorporating attention mechanisms in deep learning models. We adopted the artificial neural networks (ANN) and attention-based Bidirectional LSTM (BiLSTM) models for the prediction of heart rate and classification of physical activities based on sensor data. The deep learning models achieved state-of-the-art results in both prediction and classification tasks. Global explanation and local explanation were conducted on input data to understand the feature contribution of various patient data. The proposed QXAI framework was evaluated using PPG-DaLiA data to predict heart rate and mobile health (MHEALTH) data to classify physical activities based on sensor data. Monte Carlo approximation was applied to the framework to overcome the time complexity and high computation power requirements required for Shapley value calculations.
    摘要 人工智能技术可以用来分类患者的物理活动和预测生命 Parameters 进行远程患者监测。基于非线性模型的回归分析,如深度学习模型,具有限制可读性的问题,因为它们的黑盒特性可能会导致决策者根据非线性模型的结果进行盲目的信任,� особенpecially in healthcare applications。在非侵入式监测中,患者数据来自跟踪传感器和其相关的临床特征,用作预测未来生命 Parameters 的输入特征。解释不同特征对总输出监测应用的贡献是重要的,以便医生做出决策。在这种研究中,一种可解释AI量化分析(QXAI)框架被提出,该框架包括后续模型解释和内在解释,用于回归和分类任务。这是通过使用Shapley值概念和 incorporating attention mechanisms in deep learning models来实现的。我们采用人工神经网络(ANN)和注意力基于BiLSTM(BiLSTM)模型来预测心率和分类物理活动基于传感器数据。这些深度学习模型在预测和分类任务中达到了状态艺术的结果。全局解释和本地解释在输入数据上进行了全面的解释,以便理解不同患者数据特征的贡献。提议的QXAI框架在使用PPG-DaLiA数据集预测心率和Mobile Health(MHEALTH)数据集分类物理活动基于传感器数据进行了评估。在计算能力和计算复杂性方面,我们使用Monte Carlo Approximation来缓解QXAI框架的时间复杂度和计算能力需求。

Koopman Invertible Autoencoder: Leveraging Forward and Backward Dynamics for Temporal Modeling

  • paper_url: http://arxiv.org/abs/2309.10291
  • repo_url: None
  • paper_authors: Kshitij Tayal, Arvind Renganathan, Rahul Ghosh, Xiaowei Jia, Vipin Kumar
    for: 这个研究旨在提高机器学习模型的准确长期预测能力,并且解决现有的时间模型(如回传神经网络)在训练数据中的限制,以及它们可能无法学习目标系统的下面特性。methods: 我们提出了一种基于科普曼操作理论的机器学习模型,叫做科普曼倒镜自动encoder(KIA),这个模型可以将系统的前向和反向动态模型在无限维度希尔伯特空间中实现,从而实现更高的预测精度。此外,我们的方法设计了倒镜性,使得这个模型在前向和反向操作中保持逆转性和一致性。results: 我们在摆钟和气候 dataset 上验证了我们的方法,结果显示,对于摆钟dataset,我们的方法可以提高长期预测能力,并且在噪音影响下保持稳定性,而且在气候dataset上,我们的方法也表现出了更好的预测能力。
    Abstract Accurate long-term predictions are the foundations for many machine learning applications and decision-making processes. However, building accurate long-term prediction models remains challenging due to the limitations of existing temporal models like recurrent neural networks (RNNs), as they capture only the statistical connections in the training data and may fail to learn the underlying dynamics of the target system. To tackle this challenge, we propose a novel machine learning model based on Koopman operator theory, which we call Koopman Invertible Autoencoders (KIA), that captures the inherent characteristic of the system by modeling both forward and backward dynamics in the infinite-dimensional Hilbert space. This enables us to efficiently learn low-dimensional representations, resulting in more accurate predictions of long-term system behavior. Moreover, our method's invertibility design guarantees reversibility and consistency in both forward and inverse operations. We illustrate the utility of KIA on pendulum and climate datasets, demonstrating 300% improvements in long-term prediction capability for pendulum while maintaining robustness against noise. Additionally, our method excels in long-term climate prediction, further validating our method's effectiveness.
    摘要 准确长期预测是机器学习应用和决策过程的基础。然而,建立准确长期预测模型仍然是一项挑战,因为现有的时间模型如回归神经网络(RNN)只 capture了训练数据中的统计连接,可能无法学习目标系统的下面动态。为解决这个挑战,我们提出了一种基于库曼 оператор理论的新的机器学习模型,我们称之为库曼归一Autoencoder(KIA)。KIA模型能够在无穷维度希尔бер特空间中模型系统的前向和反向动态,从而有效地学习低维度表示,并且可以提高长期系统行为预测的准确性。此外,我们的方法的归一设计 garanties reversibility和一致性在前向和逆向操作中。我们在拖钩和气候数据集上进行了实验,并证明了KIA在长期预测方面的约300%的提升,同时保持了对噪声的Robustness。此外,我们的方法在气候预测方面也具有优异的效果,进一步证明了我们的方法的有效性。

AstroPortal: An ontology repository concept for astronomy, astronautics and other space topics

  • paper_url: http://arxiv.org/abs/2309.10288
  • repo_url: https://github.com/rrovetto/astroportal
  • paper_authors: Robert J. Rovetto
  • for: 这篇论文是为了建立一个关于天文学、航天学和其他空间相关领域的 Ontology 仓库而写的。
  • methods: 论文使用了一种中心化的平台,允许用户搜索、评审和创建 Ontology для astro-相关话题。
  • results: 论文提出了一种新的概念,即建立一个专门的 Ontology 仓库,以减少研究时间,并提供一个易用的方式来研究和比较知识组织系统或semantic资源。
    Abstract This paper describes a repository for ontologies of astronomy, astronautics, and other space-related topics. It may be called AstroPortal (or SpacePortal), AstroHub (or SpaceHub), etc. The creation of this repository will be applicable to academic, research and other data-intensive sectors. It is relevant for space sciences (including astronomy), Earth science, and astronautics (spaceflight), among other data-intensive disciplines. The repository should provide a centralized platform to search, review and create ontologies for astro-related topics. It thereby can decrease research time, while also providing a user-friendly means to study and compare knowledge organization systems or semantic resources of the target domains. With no apparent repository available on the target domain, this paper also expresses a novel concept.
    摘要 这份论文描述了一个天文、航天和其他空间相关领域 ontology 存储库。它可以被称为 AstroPortal(或 SpacePortal)、AstroHub(或 SpaceHub)等。该存储库的创建将对学术、研究和数据密集领域进行应用。它 relevante 于天文学、地球科学和航天(空间飞行)等数据密集领域。该存储库应该提供一个中央化平台,用于搜索、评审和创建 astro-related ontoologies。因此,它可以降低研究时间,同时提供一个易于使用的方式来研究和比较知识组织系统或semantic 资源的target 领域。由于目标领域没有明显的存储库,这篇论文还描述了一个新的概念。

FRAMU: Attention-based Machine Unlearning using Federated Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2309.10283
  • repo_url: None
  • paper_authors: Thanveer Shaik, Xiaohui Tao, Lin Li, Haoran Xie, Taotao Cai, Xiaofeng Zhu, Qing Li
  • for: 这篇论文旨在解决数据隐私问题,提供一种基于联合学习和迪金生学习的机器学习卷积推理框架,以提高模型的准确性和计算效率。
  • methods: 该框架使用注意力机制、隐私保护技术和优化策略,可以处理不同数据源,包括单模态和多模态数据,并保持模型的准确性和隐私。
  • results: 在单模态和多模态数据集上进行了实验,发现FRAMUsignificantly outperformed基准模型,并且对模型的演进和优化进行了证明。
    Abstract Machine Unlearning is an emerging field that addresses data privacy issues by enabling the removal of private or irrelevant data from the Machine Learning process. Challenges related to privacy and model efficiency arise from the use of outdated, private, and irrelevant data. These issues compromise both the accuracy and the computational efficiency of models in both Machine Learning and Unlearning. To mitigate these challenges, we introduce a novel framework, Attention-based Machine Unlearning using Federated Reinforcement Learning (FRAMU). This framework incorporates adaptive learning mechanisms, privacy preservation techniques, and optimization strategies, making it a well-rounded solution for handling various data sources, either single-modality or multi-modality, while maintaining accuracy and privacy. FRAMU's strength lies in its adaptability to fluctuating data landscapes, its ability to unlearn outdated, private, or irrelevant data, and its support for continual model evolution without compromising privacy. Our experiments, conducted on both single-modality and multi-modality datasets, revealed that FRAMU significantly outperformed baseline models. Additional assessments of convergence behavior and optimization strategies further validate the framework's utility in federated learning applications. Overall, FRAMU advances Machine Unlearning by offering a robust, privacy-preserving solution that optimizes model performance while also addressing key challenges in dynamic data environments.
    摘要 机器无学是一个emerging field,旨在解决数据隐私问题,通过从机器学习过程中除去private或无关的数据。由于使用过时、private或无关的数据,会导致模型精度和计算效率受到挑战。为了解决这些问题,我们提出了一种新的框架:基于联邦反馈学习的注意力机器无学(FRAMU)。这个框架包括适应学习机制、隐私保护技术和优化策略,使其能够处理不同数据源,包括单模态和多模态数据,而不会影响模型的准确性和隐私。FRAMU的优势在于其适应到变化的数据景观、能够快速地忘记过时、private或无关的数据,以及支持不间断的模型演化而不损失隐私。我们在单模态和多模态数据集上进行了实验,发现FRAMU与基准模型相比有显著性能提升。进一步的评估对 convergence 行为和优化策略也证明了框架在联邦学习应用中的实用性。总之,FRAMU 提高了机器无学的可靠性和隐私保护能力,为动态数据环境中的机器学习应用提供了一个robust和隐私保护的解决方案。

Crowd-Aware Multi-Agent Pathfinding With Boosted Curriculum Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2309.10275
  • repo_url: None
  • paper_authors: Phu Pham, Aniket Bera
  • for: 解决多Agent路径规划(MAPF)在拥挤环境中的困难问题,旨在找到所有Agent在系统中的冲突自由路径。
  • methods: 我们提出了一种人群意识感知加强的分布式方法(CRAMP),通过强化课程学习引导的训练策略来解决这个问题。
  • results: 我们在模拟环境中测试了CRAMP,并证明了我们的方法在多种维度上超过了现有的分布式方法的性能。CRAMP提高了解决quality达58% measured in makespan和冲突数量,并提高了成功率达5%。
    Abstract Multi-Agent Path Finding (MAPF) in crowded environments presents a challenging problem in motion planning, aiming to find collision-free paths for all agents in the system. MAPF finds a wide range of applications in various domains, including aerial swarms, autonomous warehouse robotics, and self-driving vehicles. The current approaches for MAPF can be broadly categorized into two main categories: centralized and decentralized planning. Centralized planning suffers from the curse of dimensionality and thus does not scale well in large and complex environments. On the other hand, decentralized planning enables agents to engage in real-time path planning within a partially observable environment, demonstrating implicit coordination. However, they suffer from slow convergence and performance degradation in dense environments. In this paper, we introduce CRAMP, a crowd-aware decentralized approach to address this problem by leveraging reinforcement learning guided by a boosted curriculum-based training strategy. We test CRAMP on simulated environments and demonstrate that our method outperforms the state-of-the-art decentralized methods for MAPF on various metrics. CRAMP improves the solution quality up to 58% measured in makespan and collision count, and up to 5% in success rate in comparison to previous methods.
    摘要 多机器人规划(MAPF)在拥挤环境中存在一个复杂的运动规划问题,旨在找到所有机器人的碰撞自由路径。MAPF在各个领域中找到了广泛的应用,包括飞行群体、自主仓库机器人和自动驾驶车辆。当前的MAPF方法可以分为两个主要类别:中央化计划和分布式计划。中央化计划受到维度约束的困扰,因此在大型和复杂的环境中不 scalable。相反,分布式计划使得机器人可以在部分可见环境中实时进行路径规划,表现出隐式协调。然而,它们在拥挤环境中表现缓慢,性能下降。在这篇论文中,我们介绍了一种受欢迎的人群意识 Decentralized Approach(CRAMP),用于解决这个问题,通过利用强化学习指导的推广课程学习策略。我们在模拟环境中测试了CRAMP,并证明我们的方法在多个纪录中性能更好,相比前一代的分布式方法。CRAMP提高了解决方案质量,达到58%的做 span和碰撞计数,以及5%的成功率。

Using an Uncrewed Surface Vehicle to Create a Volumetric Model of Non-Navigable Rivers and Other Shallow Bodies of Water

  • paper_url: http://arxiv.org/abs/2309.10269
  • repo_url: None
  • paper_authors: Jayesh Tripathi, Robin Murphy
  • for: 这篇论文旨在提供一种实用的方法,用于使用无人marine surface vehicle(USV)收集和合并浅水体的浅层地图和数字表面地图,以生成一个综合的体积模型。
  • methods: 本论文使用了Poisson面重建算法来生成下游线上的稀疏声纳深度读数,并使用商业的Structure from Motion(SfM)包装来生成静止银行的稠密上层地图。
  • results: 该方法可以准确地生成浅水体的体积模型,并且可以填充感器覆盖缺陷,以增强emergency planners对洪水的预测和管理能力。
    Abstract Non-navigable rivers and retention ponds play important roles in buffering communities from flooding, yet emergency planners often have no data as to the volume of water that they can carry before flooding the surrounding. This paper describes a practical approach for using an uncrewed marine surface vehicle (USV) to collect and merge bathymetric maps with digital surface maps of the banks of shallow bodies of water into a unified volumetric model. The below-waterline mesh is developed by applying the Poisson surface reconstruction algorithm to the sparse sonar depth readings of the underwater surface. Dense above-waterline meshes of the banks are created using commercial structure from motion (SfM) packages. Merging is challenging for many reasons, the most significant is gaps in sensor coverage, i.e., the USV cannot collect sonar depth data or visually see sandy beaches leading to a bank thus the two meshes may not intersect. The approach is demonstrated on a Hydronalix EMILY USV with a Humminbird single beam echosounder and Teledyne FLIR camera at Lake ESTI at the Texas A&M Engineering Extension Service Disaster City complex.
    摘要 非航行性河流和储水池在抵御洪水方面发挥重要作用,但紧急计划者经常没有洪水量的数据,以便在洪水时进行应急准备。这篇论文描述了一种实用的方法,使用无人海面车 (USV) 收集和融合浸没深度图和数字地面图,形成一个统一的体积模型。在水下的网格是通过将波浪表面重建算法应用于 USV 的罕见声纳深度读数来构建的。陆地上的稠密网格是使用商业的结构从运动 (SfM) 包装来创建的。融合具有许多挑战,最主要的是感器覆盖缺陷,即 USV 不能收集声纳深度数据或视见砂滩,导致两个网格不相交。该方法在得克萨斯A&M工程扩展服务灾难城区使用一只Hydronalix EMILY USV、一个Humminbird单束声纳和Teledyne FLIR Camera进行了示范。

Correlation between morphological evolution of splashing drop and exerted impact force revealed by interpretation of explainable artificial intelligence

  • paper_url: http://arxiv.org/abs/2309.10266
  • repo_url: None
  • paper_authors: Jingzu Yee, Daichi Igarashi, Pradipto, Akinori Yamanaka, Yoshiyuki Tagawa
  • for: 这个研究探讨了撞击液体在固体表面上的某些特征与正常化的影响力之间的可能的相关性。
  • methods: 这个研究使用了一种新的特征提取方法和一种可解释的人工智能(XAI)视频分类器来分类撞击和非撞击液体。
  • results: 研究发现,XAI模型对撞击和非撞击液体的分类值具有不同的重要性,并且这些重要性随时间的演化而变化。具体来说,在撞击时间的各个点上,抽象出的撞击特征的贡献率与正常化影响力的贡献率之间存在紧密的相关性。
    Abstract This study reveals a possible correlation between splashing morphology and the normalized impact force exerted by an impacting drop on a solid surface. This finding is obtained from a newly proposed feature extraction method and a subsequent interpretation of the classification of splashing and non-splashing drops performed by an explainable artificial intelligence (XAI) video classifier. Notably, the values of the weight matrix elements of the XAI that correspond to the extracted features are found to change with the temporal evolution of the drop morphology. We compute the rate of change of the contributions of each frame with respect to the classification value of a video as an important index to quantify the contributions of the extracted splashing and non-splashing features at different impact times to the classification of the XAI model. Remarkably, the rate computed for the extracted splashing features is found to closely match the profile of the normalized impact force, where the splashing features are most pronounced immediately after the normalized impact force reaches its peak value. This study has provided an example that clarifies the relationship between the complex morphological evolution of a splashing drop and physical parameters by interpreting the classification of an XAI video classifier.
    摘要 (以下是简化中文版)这个研究发现可能存在液体撞击表面时的液体形态和normalized影响力之间的关系。这一发现来自于一种新提出的特征提取方法和随后的XAI视频分类器的解释。值得注意的是,XAI模型中的weight矩阵元素与提取特征之间的关系发生了时间的变化。我们计算了每帧的贡献的变化率对视频分类值的影响,以便量化不同的撞击时间对XAI模型的分类的贡献。吸引人的是,计算的液体撞击特征的变化率与正常化影响力的profile非常相似,特别是在正常化影响力达到最大值时,液体撞击特征的变化最为明显。这个研究提供了一个示例,从液体撞击的形态进行解释,并且解释了XAI模型的分类结果与物理参数之间的关系。

LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins

  • paper_url: http://arxiv.org/abs/2309.10254
  • repo_url: https://github.com/llm-platform-security/chatgpt-plugin-eval
  • paper_authors: Umar Iqbal, Tadayoshi Kohno, Franziska Roesner
  • for: 本研究旨在提供一个框架,以便LLM平台设计师分析和改善现有和未来插件集成的LLM平台的安全性、隐私性和安全性。
  • methods: 我们提出了一个攻击分类学,通过询问LLM平台的潜在攻击者如何利用他们的能力和责任来进攻LLM平台。在我们的回归过程中,我们将这个攻击分类学应用于OpenAI的插件生态系。
  • results: 我们发现了一些插件,它们实际地显示出了我们的攻击分类学中的一些问题类型。我们结论是,这些问题对现有和未来的LLM-基于计算平台的安全性、隐私性和安全性具有新的挑战。
    Abstract Large language model (LLM) platforms, such as ChatGPT, have recently begun offering a plugin ecosystem to interface with third-party services on the internet. While these plugins extend the capabilities of LLM platforms, they are developed by arbitrary third parties and thus cannot be implicitly trusted. Plugins also interface with LLM platforms and users using natural language, which can have imprecise interpretations. In this paper, we propose a framework that lays a foundation for LLM platform designers to analyze and improve the security, privacy, and safety of current and future plugin-integrated LLM platforms. Our framework is a formulation of an attack taxonomy that is developed by iteratively exploring how LLM platform stakeholders could leverage their capabilities and responsibilities to mount attacks against each other. As part of our iterative process, we apply our framework in the context of OpenAI's plugin ecosystem. We uncover plugins that concretely demonstrate the potential for the types of issues that we outline in our attack taxonomy. We conclude by discussing novel challenges and by providing recommendations to improve the security, privacy, and safety of present and future LLM-based computing platforms.
    摘要

GPTFUZZER : Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

  • paper_url: http://arxiv.org/abs/2309.10253
  • repo_url: https://github.com/sherdencooper/gptfuzz
  • paper_authors: Jiahao Yu, Xingwei Lin, Xinyu Xing
  • for: 这项研究的目的是提供一种自动生成黑盒攻击模板的攻击探索框架,以提高LLM的安全性。
  • methods: 这项研究使用了AFL fuzzing框架为基础,并提出了三个关键组成部分:种子选择策略、结构变换和判断模型。
  • results: 研究发现,使用\fuzzer攻击框架可以在不同的攻击enario下 consistently produce jailbreak templates with high success rate,即使从低质量的种子模板开始。
    Abstract Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial "jailbreak" attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce \fuzzer, a novel black-box jailbreak fuzzing framework inspired by AFL fuzzing framework. Instead of manual engineering, \fuzzer automates the generation of jailbreak templates for red-teaming LLMs. At its core, \fuzzer starts with human-written templates as seeds, then mutates them using mutate operators to produce new templates. We detail three key components of \fuzzer: a seed selection strategy for balancing efficiency and variability, metamorphic relations for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We tested \fuzzer on various commercial and open-source LLMs, such as ChatGPT, LLaMa-2, and Claude2, under diverse attack scenarios. Our results indicate that \fuzzer consistently produces jailbreak templates with a high success rate, even in settings where all human-crafted templates fail. Notably, even starting with suboptimal seed templates, \fuzzer maintains over 90\% attack success rate against ChatGPT and Llama-2 models. We believe \fuzzer will aid researchers and practitioners in assessing LLM robustness and will spur further research into LLM safety.
    摘要

On Explicit Curvature Regularization in Deep Generative Models

  • paper_url: http://arxiv.org/abs/2309.10237
  • repo_url: None
  • paper_authors: Yonghyeon Lee, Frank Chongwoo Park
  • for: 这个论文是为了提出一种基于曲率的深度生成模型学习方法的建议。
  • methods: 这个论文使用了具有征服积分的几何 curvature measure,并 derivated了一些高效的计算方法。
  • results: 对于含有噪声的运动捕捉数据, curvature-based 方法表现更高效,内在曲率度量slightly 更为有效。
    Abstract We propose a family of curvature-based regularization terms for deep generative model learning. Explicit coordinate-invariant formulas for both intrinsic and extrinsic curvature measures are derived for the case of arbitrary data manifolds embedded in higher-dimensional Euclidean space. Because computing the curvature is a highly computation-intensive process involving the evaluation of second-order derivatives, efficient formulas are derived for approximately evaluating intrinsic and extrinsic curvatures. Comparative studies are conducted that compare the relative efficacy of intrinsic versus extrinsic curvature-based regularization measures, as well as performance comparisons against existing autoencoder training methods. Experiments involving noisy motion capture data confirm that curvature-based methods outperform existing autoencoder regularization methods, with intrinsic curvature measures slightly more effective than extrinsic curvature measures.
    摘要 我们提出了一组基于曲率的调整项,用于深度生成模型的学习。我们 derive了对于任意数据构造的内在和外在曲率度量的明确构成,并且因为计算曲率是高度 computation-intensive 的过程,我们 derivated了高效的曲率度量评估方法。我们进行了比较研究,评估了内在曲率 versus 外在曲率基于的调整项的Relative efficacy,以及与现有 autoencoder 训练方法的比较。实验结果显示,曲率基于的方法在陌生动态捕捉数据上表现较好,内在曲率度量 slightly more effective than 外在曲率度量。

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles

  • paper_url: http://arxiv.org/abs/2309.10228
  • repo_url: None
  • paper_authors: Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Ziran Wang
  • for: This paper aims to enhance autonomous vehicles’ decision-making processes by integrating Large Language Models (LLMs) to provide personalized assistance, continuous learning, and transparent decision-making.
  • methods: The proposed framework leverages LLMs’ natural language capabilities and contextual understanding, specialized tools usage, synergizing reasoning, and acting with various modules on autonomous vehicles.
  • results: The proposed framework has the potential to revolutionize the way autonomous vehicles operate, offering personalized assistance, continuous learning, and transparent decision-making, ultimately contributing to safer and more efficient autonomous driving technologies.
    Abstract The future of autonomous vehicles lies in the convergence of human-centric design and advanced AI capabilities. Autonomous vehicles of the future will not only transport passengers but also interact and adapt to their desires, making the journey comfortable, efficient, and pleasant. In this paper, we present a novel framework that leverages Large Language Models (LLMs) to enhance autonomous vehicles' decision-making processes. By integrating LLMs' natural language capabilities and contextual understanding, specialized tools usage, synergizing reasoning, and acting with various modules on autonomous vehicles, this framework aims to seamlessly integrate the advanced language and reasoning capabilities of LLMs into autonomous vehicles. The proposed framework holds the potential to revolutionize the way autonomous vehicles operate, offering personalized assistance, continuous learning, and transparent decision-making, ultimately contributing to safer and more efficient autonomous driving technologies.
    摘要 自动驾驶未来在人类中心设计和高级人工智能技术的融合中实现。未来的自动驾驶车不仅会运送乘客,还会与乘客互动,适应其愿望,使旅行更舒适、更高效、更愉悦。在这篇论文中,我们提出了一种新的框架,通过将大型自然语言模型(LLM)的自然语言能力和上下文理解 integrate into autonomous vehicles的决策过程中。通过特殊工具的使用、同步理解、合并推理和行动等模块的结合,这个框架计划将 LLM 的高级语言和推理能力融合到自动驾驶车中。该框架的提议具有改变自动驾驶车的运行方式,提供个性化协助、不断学习和透明决策,从而为更安全和更高效的自动驾驶技术做出贡献。

Multi-level feature fusion network combining attention mechanisms for polyp segmentation

  • paper_url: http://arxiv.org/abs/2309.10219
  • repo_url: None
  • paper_authors: Junzhuo Liu, Qiaosong Chen, Ye Zhang, Zhixiang Wang, Deng Xin, Jin Wang
  • For: 这篇论文的目的是提出一种新的自动识别肿瘤技术,以提高医疗诊断的效率和准确性,并减少抑癌病的风险。* Methods: 本论文提出的新技术称为MLFF-Net,它利用多层次特征融合和注意机制来优化肿瘤分类。具体来说,MLFF-Net包括三个模组:多尺度注意模组(MAM)、高级特征增强模组(HFEM)和全球注意模组(GAM)。* Results: 在五个公共数据集上进行实验,MLFF-Net的提案方法不��ley且具有比现有技术更高的准确性和通用能力。
    Abstract Clinically, automated polyp segmentation techniques have the potential to significantly improve the efficiency and accuracy of medical diagnosis, thereby reducing the risk of colorectal cancer in patients. Unfortunately, existing methods suffer from two significant weaknesses that can impact the accuracy of segmentation. Firstly, features extracted by encoders are not adequately filtered and utilized. Secondly, semantic conflicts and information redundancy caused by feature fusion are not attended to. To overcome these limitations, we propose a novel approach for polyp segmentation, named MLFF-Net, which leverages multi-level feature fusion and attention mechanisms. Specifically, MLFF-Net comprises three modules: Multi-scale Attention Module (MAM), High-level Feature Enhancement Module (HFEM), and Global Attention Module (GAM). Among these, MAM is used to extract multi-scale information and polyp details from the shallow output of the encoder. In HFEM, the deep features of the encoders complement each other by aggregation. Meanwhile, the attention mechanism redistributes the weight of the aggregated features, weakening the conflicting redundant parts and highlighting the information useful to the task. GAM combines features from the encoder and decoder features, as well as computes global dependencies to prevent receptive field locality. Experimental results on five public datasets show that the proposed method not only can segment multiple types of polyps but also has advantages over current state-of-the-art methods in both accuracy and generalization ability.
    摘要 临床上,自动化肿体分割技术具有提高医学诊断效率和准确率的潜在优势,从而降低患者抗性肿瘤的风险。然而,现有方法受到两大缺陷,这两个缺陷可能会影响分割的准确性。首先,编码器提取的特征并不充分筛选和利用。其次,由特征融合引起的semantic conflict和信息重复不被注意。为了解决这些限制,我们提出了一种新的肿体分割方法,名为MLFF-Net,它利用多级特征融合和注意机制。具体来说,MLFF-Net包括三个模块:多级注意模块(MAM)、高级特征增强模块(HFEM)和全局注意模块(GAM)。其中,MAM用于从编码器的浅输出中提取多级信息和肿体细节。在HFEM中,编码器的深特征相互补充,并通过注意机制重新分配这些特征的权重,弱化冲突的重复部分,高亮任务所需的信息。GAM将编码器和解码器特征相结合,并计算全局依赖关系,以避免感知范围地局部性。我们在五个公共数据集上进行了实验,结果表明,提议的方法不仅可以分割多种肿体,而且在准确率和普适性能方面也有优势于当前state-of-the-art方法。

An Empirical Study of Attention Networks for Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2309.10217
  • repo_url: None
  • paper_authors: Hao Guo, Hongbiao Si, Guilin Jiang, Wei Zhang, Zhiyan Liu, Xuanyi Zhu, Xulong Zhang, Yang Liu
  • for: 本文主要研究 semantic segmentation 领域中的注意网络,探讨其计算复杂度和精度在不同类别上的表现,以及适用场景和建议。
  • methods: 本文使用了多种注意网络,包括 decoder 和 self-attention 网络,进行对比研究。
  • results: 研究发现,decoder 网络在某些场景下表现较好,而 self-attention 网络在其他场景下表现较好。此外,研究还发现了一些注意网络的缺点和未来发展方向。
    Abstract Semantic segmentation is a vital problem in computer vision. Recently, a common solution to semantic segmentation is the end-to-end convolution neural network, which is much more accurate than traditional methods.Recently, the decoders based on attention achieve state-of-the-art (SOTA) performance on various datasets. But these networks always are compared with the mIoU of previous SOTA networks to prove their superiority and ignore their characteristics without considering the computation complexity and precision in various categories, which is essential for engineering applications. Besides, the methods to analyze the FLOPs and memory are not consistent between different networks, which makes the comparison hard to be utilized. What's more, various methods utilize attention in semantic segmentation, but the conclusion of these methods is lacking. This paper first conducts experiments to analyze their computation complexity and compare their performance. Then it summarizes suitable scenes for these networks and concludes key points that should be concerned when constructing an attention network. Last it points out some future directions of the attention network.
    摘要 semantic segmentation 是计算机视觉中的一个关键问题。最近,一种常见的解决方案是将端到端 convolutional neural network(CNN)作为解决方案,这种方法比传统方法更为精准。 Recently, attention 基于的解码器在多个数据集上达到了状态的极点性能(SOTA)。但这些网络总是与之前的 SOTA 网络的 mIoU 进行比较,忽略它们的特点而不考虑不同类别的计算复杂度和精度,这是工程应用中必须考虑的。另外,不同网络之间的 FLOPs 和内存分析方法不一致,使得比较变得困难。此外,各种方法在 semantic segmentation 中使用 attention,但这些方法的结论缺乏。这篇论文首先进行了计算复杂度的分析和比较性能。然后总结了适合这些网络的场景,并指出了在建立注意力网络时需要关注的关键点。最后,它指出了未来注意力网络的发展方向。

Safe POMDP Online Planning via Shielding

  • paper_url: http://arxiv.org/abs/2309.10216
  • repo_url: None
  • paper_authors: Shili Sheng, David Parker, Lu Feng
  • for: 这个研究旨在实现安全性的延伸POMDP在线规划,以满足安全需求。
  • methods: 研究使用了防护盾来限制不安全的动作,并将其与POMCP算法结合以确保安全性。
  • results: 实验结果显示,提案的防护盾方法可以成功保证安全性,并且对大型POMDP进行在线规划并不会对 runtime 有显著影响。
    Abstract Partially observable Markov decision processes (POMDPs) have been widely used in many robotic applications for sequential decision-making under uncertainty. POMDP online planning algorithms such as Partially Observable Monte-Carlo Planning (POMCP) can solve very large POMDPs with the goal of maximizing the expected return. But the resulting policies cannot provide safety guarantees that are imperative for real-world safety-critical tasks (e.g., autonomous driving). In this work, we consider safety requirements represented as almost-sure reach-avoid specifications (i.e., the probability to reach a set of goal states is one and the probability to reach a set of unsafe states is zero). We compute shields that restrict unsafe actions violating almost-sure reach-avoid specifications. We then integrate these shields into the POMCP algorithm for safe POMDP online planning. We propose four distinct shielding methods, differing in how the shields are computed and integrated, including factored variants designed to improve scalability. Experimental results on a set of benchmark domains demonstrate that the proposed shielding methods successfully guarantee safety (unlike the baseline POMCP without shielding) on large POMDPs, with negligible impact on the runtime for online planning.
    摘要

cs.CL - 2023-09-19

MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods

  • paper_url: http://arxiv.org/abs/2309.10966
  • repo_url: None
  • paper_authors: Mara Finkelstein, Markus Freitag
  • for: 提高NLG任务中模型的质量和效率
  • methods: 使用MBRfinetuning和QEfinetuning方法,在训练时使用高质量的核心模型,在推断时使用高效的推断算法
  • results: 在Neural Machine Translation任务上,使用这些finetuning方法可以大幅提高模型的质量和效率,并且在使用外部LLM作为教师模型时,还能超越使用人工生成的参考数据
    Abstract Recent research in decoding methods for Natural Language Generation (NLG) tasks has shown that the traditional beam search and greedy decoding algorithms are not optimal, because model probabilities do not always align with human preferences. Stronger decoding methods, including Quality Estimation (QE) reranking and Minimum Bayes' Risk (MBR) decoding, have since been proposed to mitigate the model-perplexity-vs-quality mismatch. While these decoding methods achieve state-of-the-art performance, they are prohibitively expensive to compute. In this work, we propose MBR finetuning and QE finetuning which distill the quality gains from these decoding methods at training time, while using an efficient decoding algorithm at inference time. Using the canonical NLG task of Neural Machine Translation (NMT), we show that even with self-training, these finetuning methods significantly outperform the base model. Moreover, when using an external LLM as a teacher model, these finetuning methods outperform finetuning on human-generated references. These findings suggest new ways to leverage monolingual data to achieve improvements in model quality that are on par with, or even exceed, improvements from human-curated data, while maintaining maximum efficiency during decoding.
    摘要 Note:* "排序" (bǎo xiǎng) instead of "sorting"* "搜索" (sōu sòu) instead of "search"* "评估" (píng gòu) instead of "assessment"* "参考" (xiǎng gǎng) instead of "reference"* "模型" (módel) instead of "model"* "语言生成" (yǔ yán shēng chéng) instead of "Natural Language Generation"* "神经机器翻译" (shén qiān jī qì zhōng yì) instead of "Neural Machine Translation"

In-Context Learning for Text Classification with Many Labels

  • paper_url: http://arxiv.org/abs/2309.10954
  • repo_url: https://github.com/dia2018/What-is-the-Difference-Between-AI-and-Machine-Learning
  • paper_authors: Aristides Milios, Siva Reddy, Dzmitry Bahdanau
  • for: 本研究使用大型自然语言模型进行含义学习(ICL)任务,尤其是多个标签任务,因为限定的上下文窗口尺度使得不能填充足量的示例。
  • methods: 我们使用预训练的稠密检索模型,只给模型每个推理调用一个部分视图全标签空间。我们使用最新的开源LLMs(OPT、LLaMA)进行测试,并在几个常见的意图分类数据集上设置新的状态码表现。
  • results: 我们发现,随着不同的模型缩度和数量,更大的模型可以更好地利用更大的上下文长度进行ICL。我们运行了多个简化,分析了模型对:a) 输入中的相似示例和当前输入之间的相似度,b) 类名的Semantic内容,c) 示例和标签之间的正确匹配。我们发现,这三种因素在不同的领域中具有不同的重要性。
    Abstract In-context learning (ICL) using large language models for tasks with many labels is challenging due to the limited context window, which makes it difficult to fit a sufficient number of examples in the prompt. In this paper, we use a pre-trained dense retrieval model to bypass this limitation, giving the model only a partial view of the full label space for each inference call. Testing with recent open-source LLMs (OPT, LLaMA), we set new state of the art performance in few-shot settings for three common intent classification datasets, with no finetuning. We also surpass fine-tuned performance on fine-grained sentiment classification in certain cases. We analyze the performance across number of in-context examples and different model scales, showing that larger models are necessary to effectively and consistently make use of larger context lengths for ICL. By running several ablations, we analyze the model's use of: a) the similarity of the in-context examples to the current input, b) the semantic content of the class names, and c) the correct correspondence between examples and labels. We demonstrate that all three are needed to varying degrees depending on the domain, contrary to certain recent works.
    摘要 内容学习(ICL)使用大型语言模型进行多个标签任务是具有限制的上下文窗口,即对每次推寄都只能提供有限的示例。在这篇论文中,我们使用预训练的稠密检索模型,以快速地跳过这个限制,并只给模型提供部分视图全个标签空间。通过测试最新的开源LLMs(OPT、LLaMA),我们在少量示例设置下设置了新的国际标准性能,无需训练。我们还超过了训练后的性能在细化情感分类中,在某些情况下。我们分析了不同的示例数量和模型缩放大小对性能的影响,发现大型模型是需要充分利用更大的上下文长度进行ICL。通过多个缺省分析,我们分析了模型在不同的领域中使用:a)相关的示例和当前输入之间的相似性,b)类名中的semantic内容,以及c)正确地将示例和标签相匹配。我们发现,这三者在不同的领域中均需要不同的程度,与某些最近的工作不同。

A Family of Pretrained Transformer Language Models for Russian

  • paper_url: http://arxiv.org/abs/2309.10931
  • repo_url: None
  • paper_authors: Dmitry Zmitrovich, Alexander Abramov, Andrey Kalmykov, Maria Tikhonova, Ekaterina Taktasheva, Danil Astafurov, Mark Baushenko, Artem Snegirev, Tatiana Shavrina, Sergey Markov, Vladislav Mikhailov, Alena Fenogenova
  • for: 本研究旨在开发特性化的Transformer语言模型,用于俄语自然语言理解和生成。
  • methods: 本文使用encoder(ruBERT、ruRoBERTa、ruELECTRA)、decoder(ruGPT-3)和encoder-decoder(ruT5、FRED-T5)多种Transformer模型,并对其进行预训练和测试。
  • results: 研究人员通过对俄语自然语言理解和生成数据集和标准做测试,发现这些特性化Transformer模型具有良好的普适能力和生成能力。
    Abstract Nowadays, Transformer language models (LMs) represent a fundamental component of the NLP research methodologies and applications. However, the development of such models specifically for the Russian language has received little attention. This paper presents a collection of 13 Russian Transformer LMs based on the encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and encoder-decoder (ruT5, FRED-T5) models in multiple sizes. Access to these models is readily available via the HuggingFace platform. We provide a report of the model architecture design and pretraining, and the results of evaluating their generalization abilities on Russian natural language understanding and generation datasets and benchmarks. By pretraining and releasing these specialized Transformer LMs, we hope to broaden the scope of the NLP research directions and enable the development of industrial solutions for the Russian language.
    摘要 现在, transformer 语言模型(LMs)成为了自然语言处理(NLP)研究方法和应用的基本组成部分。然而,为俄语语言的特有Transformer LMs的开发受到了相对少的关注。本文介绍了13种俄语 transformer LMs,包括encoder(ruBERT、ruRoBERTa、ruELECTRA)、decoder(ruGPT-3)和encoder-decoder(ruT5、FRED-T5)模型,以及这些模型的训练和预训练方法。通过这些特化的Transformer LMs,我们希望拓宽NLP研究方向和提供俄语语言industrial解决方案。这些模型通过HuggingFace平台可以访问。我们还提供了模型体系设计和预训练方法的报告,以及在俄语自然语言理解和生成数据集和benchmark上模型的一般化能力的评价结果。

Specializing Small Language Models towards Complex Style Transfer via Latent Attribute Pre-Training

  • paper_url: http://arxiv.org/abs/2309.10929
  • repo_url: https://github.com/ruiqixu37/BTTS_ECAI2023
  • paper_authors: Ruiqi Xu, Yongfeng Huang, Xin Chen, Lin Zhang
  • for: 这项研究旨在介绍复杂文本风格传递任务,并基于两个广泛适用的场景构建了复杂文本数据集。
  • methods: 我们使用了小型模型( Less than T5-3B)和隐式风格预训练through contrastive learning来解决大型模型(LLM)的数据隐私问题、网络不稳定和高部署成本。
  • results: 我们的方法比现有方法更有效,可以在几个shot中完成文本风格传递任务,并且可以自动评估文本生成质量基于人工评估使用ChatGPT。
    Abstract In this work, we introduce the concept of complex text style transfer tasks, and constructed complex text datasets based on two widely applicable scenarios. Our dataset is the first large-scale data set of its kind, with 700 rephrased sentences and 1,000 sentences from the game Genshin Impact. While large language models (LLM) have shown promise in complex text style transfer, they have drawbacks such as data privacy concerns, network instability, and high deployment costs. To address these issues, we explore the effectiveness of small models (less than T5-3B) with implicit style pre-training through contrastive learning. We also propose a method for automated evaluation of text generation quality based on alignment with human evaluations using ChatGPT. Finally, we compare our approach with existing methods and show that our model achieves state-of-art performances of few-shot text style transfer models.
    摘要 在这项工作中,我们介绍了复杂文本风格传递任务的概念,并基于两个广泛适用的情况构建了复杂文本数据集。我们的数据集是首个类似类型的大规模数据集,包含700个重叠句子和1,000个《神韵碰》游戏中的句子。虽然大型语言模型(LLM)在复杂文本风格传递方面表现出了承诺,但它们具有数据隐私问题、网络不稳定和高部署成本的缺点。为了解决这些问题,我们研究了小型模型( Less than T5-3B)的隐式风格预训练through contrastive learning的效果。我们还提出了一种自动评估文本生成质量的方法,基于人类评估和ChatGPT的匹配。最后,我们与现有方法进行比较,并证明我们的模型在几架text风格传递模型中达到了状态之最。

Semi-Autoregressive Streaming ASR With Label Context

  • paper_url: http://arxiv.org/abs/2309.10926
  • repo_url: None
  • paper_authors: Siddhant Arora, George Saon, Shinji Watanabe, Brian Kingsbury
  • for: 这个论文是为了提高流式自动语音识别(ASR)模型的准确率和延迟时间而设计的。
  • methods: 这个论文使用了一种名为“半自动生成”的ASR模型,该模型将以前块中的标签作为额外Context,使用语言模型(LM)子网络来提高流式ASR的准确率。它还提出了一种新的滥货解码算法,可以在块边界附近减少插入和删除错误,而不是 significatively增加推理时间。
  • results: 实验表明,我们的方法可以与现有的流式非 autoregressive(NAR)模型相比,提高流式ASR的准确率。具体来说,在Tedlium2上,我们的方法提高了19%的相对准确率;在Librispeech-100的清洁/其他测试集上,提高了16%/8%的相对准确率;在Switchboard(SWB)/ Callhome(CH)测试集上,提高了19%/8%的相对准确率。此外,我们的方法可以更好地利用外部文本数据进行预训练LM子网络,进一步提高流式ASR的准确率。
    Abstract Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy. Since NAR automatic speech recognition (ASR) models must wait for the completion of the entire utterance before processing, some works explore streaming NAR models based on blockwise attention for low-latency applications. However, streaming NAR models significantly lag in accuracy compared to streaming AR and non-streaming NAR models. To address this, we propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context using a Language Model (LM) subnetwork. We also introduce a novel greedy decoding algorithm that addresses insertion and deletion errors near block boundaries while not significantly increasing the inference time. Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB) / Callhome(CH) test sets. It also reduced the accuracy gap with streaming AR and non-streaming NAR models while achieving 2.5x lower latency. We also demonstrate that our approach can effectively utilize external text data to pre-train the LM subnetwork to further improve streaming ASR accuracy.
    摘要 非自适应(NAR)模型在语音处理领域已经吸引了广泛的关注,因为这些模型在推理时间方面可以达到AR模型的多倍速度,而且也可以达到良好的识别精度。然而,NAR自动语音识别(ASR)模型必须等待整个句子的完成才能进行处理,因此一些研究者开发了基于块级注意力的流式NAR模型,以满足低延迟应用场景。然而,流式NAR模型与流式AR和非流式NAR模型的准确率存在明显的差距。为了解决这个问题,我们提出了一种流式"半自适应" ASR模型,该模型利用在前一个块中生成的标签作为额外Context使用语言模型(LM)子网络。我们还提出了一种新的恰好解oding算法,该算法可以在块边界附近快速地修复插入和删除错误,而不是增加推理时间。实验显示,我们的方法在Tedlium2、Librispeech-100清洁/其他测试集和Switchboard(SWB)/ Callhome(CH)测试集上比既有的流式NAR模型提高19%的相对性能,同时也降低了与流式AR和非流式NAR模型之间的准确率差距。此外,我们还证明了我们的方法可以有效地利用外部文本数据来预训练LM子网络,以进一步提高流式ASR准确率。

Semi-automatic staging area for high-quality structured data extraction from scientific literature

  • paper_url: http://arxiv.org/abs/2309.10923
  • repo_url: None
  • paper_authors: Luca Foppiano, Tomoya Mato, Kensei Terashima, Pedro Ortiz Suarez, Taku Tou, Chikako Sakai, Wei-Sheng Wang, Toshiyuki Amagasa, Yoshihiko Takano, Masashi Ishii
  • for: 提高 SuperCon 中新型超导体实验数据的更新效率,同时维持或提高数据质量。
  • methods: 使用自动和手动工作流程在抽取的数据库中实现半自动的stage区。使用异常检测自动过程预先审核收集的数据,并让用户通过专门设计的用户界面对原PDF文档进行数据验证。收集修复后的记录,并将其作为机器学习模型的训练数据使用。
  • results: 评估实验表明,我们的stage区可以显著提高审核质量。与传统手动方法(读取PDF文档并记录信息在Excel文档中)相比,使用界面提高精度和准确率分别提高6%和50%,平均提高40%的F1分数。
    Abstract In this study, we propose a staging area for ingesting new superconductors' experimental data in SuperCon that is machine-collected from scientific articles. Our objective is to enhance the efficiency of updating SuperCon while maintaining or enhancing the data quality. We present a semi-automatic staging area driven by a workflow combining automatic and manual processes on the extracted database. An anomaly detection automatic process aims to pre-screen the collected data. Users can then manually correct any errors through a user interface tailored to simplify the data verification on the original PDF documents. Additionally, when a record is corrected, its raw data is collected and utilised to improve machine learning models as training data. Evaluation experiments demonstrate that our staging area significantly improves curation quality. We compare the interface with the traditional manual approach of reading PDF documents and recording information in an Excel document. Using the interface boosts the precision and recall by 6% and 50%, respectively to an average increase of 40% in F1-score.
    摘要 在这项研究中,我们提出了一个用于新超导材料实验数据的投入区域,这个区域使用机器自动从科学文献中收集数据。我们的目标是提高超кон的更新效率,同时保持或提高数据质量。我们提出了一种半自动的投入区域,该区域采用工作流程结合自动和手动过程来处理提取的数据库。一个异常检测自动过程用于预先屏选收集的数据。用户可以通过一个专门设计的用户界面来手动修正任何错误。此外,当记录被修正时,其原始数据会被收集并用于改进机器学习模型的训练数据。我们的评估实验表明,我们的投入区域可以显著提高筛选质量。我们与传统的手动方法相比,使用界面可以提高准确率和敏感度分别提高6%和50%,平均提高40%的F1分数。

What Learned Representations and Influence Functions Can Tell Us About Adversarial Examples

  • paper_url: http://arxiv.org/abs/2309.10916
  • repo_url: https://github.com/sjabin/nnif
  • paper_authors: Shakila Mahjabin Tonni, Mark Dras
  • for: 这篇论文主要是为了研究在自然语言处理(NLP)中的对抗例(adversarial examples)的检测方法。
  • methods: 这篇论文使用了两种方法来检测对抗例:一种是基于最近邻居和影响函数,另一种是基于马ха拉欧斯距离。
  • results: 研究发现,使用基于最近邻居和影响函数的方法可以制定出state-of-the-art的检测器,而且这种方法还提供了对于NLP任务的对抗例subspace的新的理解和对比。
    Abstract Adversarial examples, deliberately crafted using small perturbations to fool deep neural networks, were first studied in image processing and more recently in NLP. While approaches to detecting adversarial examples in NLP have largely relied on search over input perturbations, image processing has seen a range of techniques that aim to characterise adversarial subspaces over the learned representations. In this paper, we adapt two such approaches to NLP, one based on nearest neighbors and influence functions and one on Mahalanobis distances. The former in particular produces a state-of-the-art detector when compared against several strong baselines; moreover, the novel use of influence functions provides insight into how the nature of adversarial example subspaces in NLP relate to those in image processing, and also how they differ depending on the kind of NLP task.
    摘要 adversarial examples, 通过小变化而被意外地骗响深度神经网络,在图像处理领域首先被研究,然后在自然语言处理(NLP)中被研究。在NLP中,检测 adversarial examples 的方法主要基于输入变换的搜索,而图像处理领域则有许多技术来描述恶作剂的表示空间。 在这篇论文中,我们采用了两种方法来检测 adversarial examples,一种是基于最近邻居和影响函数,另一种是基于 Mahalanobis 距离。前者在比较多个强基准下表现出状态顶峰的检测器,而且使用影响函数提供了关于恶作剂表示空间在 NLP 和图像处理之间的相似性,以及哪些因素使得恶作剂表示空间在不同的 NLP 任务中有所不同。

RedPenNet for Grammatical Error Correction: Outputs to Tokens, Attentions to Spans

  • paper_url: http://arxiv.org/abs/2309.10898
  • repo_url: None
  • paper_authors: Bohdan Didenko, Andrii Sameliuk
  • For: This paper is written for the UNLP 2023 workshop, specifically for the Shared Task in Grammatical Error Correction (GEC) for Ukrainian.* Methods: The paper uses a RedPenNet approach to address text editing tasks, which combines sequence-to-sequence and sequence tagging techniques.* Results: The paper achieves $F_{0.5}$ scores of 77.60 on the BEA-2019 (test) and 67.71 on the UAGEC+Fluency (test) benchmarks, which are considered state-of-the-art results.Here is the simplified Chinese text for the three key points:* For: 这篇论文是为UNLP 2023 工作坊写的,特意是为 grammatical error correction (GEC) 的 Ukrainian 语言共同任务。* Methods: 这篇论文使用 RedPenNet 方法来处理文本编辑任务,这种方法结合了 sequence-to-sequence 和 sequence tagging 技术。* Results: 这篇论文在 BEA-2019 测试集上 achieve $F_{0.5}$ 分数为 77.60,并在 UAGEC+Fluency 测试集上 achieve 67.71 分数,这些结果被视为当前领域的 state-of-the-art 成果。
    Abstract The text editing tasks, including sentence fusion, sentence splitting and rephrasing, text simplification, and Grammatical Error Correction (GEC), share a common trait of dealing with highly similar input and output sequences. This area of research lies at the intersection of two well-established fields: (i) fully autoregressive sequence-to-sequence approaches commonly used in tasks like Neural Machine Translation (NMT) and (ii) sequence tagging techniques commonly used to address tasks such as Part-of-speech tagging, Named-entity recognition (NER), and similar. In the pursuit of a balanced architecture, researchers have come up with numerous imaginative and unconventional solutions, which we're discussing in the Related Works section. Our approach to addressing text editing tasks is called RedPenNet and is aimed at reducing architectural and parametric redundancies presented in specific Sequence-To-Edits models, preserving their semi-autoregressive advantages. Our models achieve $F_{0.5}$ scores of 77.60 on the BEA-2019 (test), which can be considered as state-of-the-art the only exception for system combination and 67.71 on the UAGEC+Fluency (test) benchmarks. This research is being conducted in the context of the UNLP 2023 workshop, where it was presented as a paper as a paper for the Shared Task in Grammatical Error Correction (GEC) for Ukrainian. This study aims to apply the RedPenNet approach to address the GEC problem in the Ukrainian language.
    摘要 文本编辑任务,包括句子合并、句子分割、重写和语法错误修复(GEC),都与高度相似的输入和输出序列交互。这一领域的研究受到两个已有的领域的影响:一是完全自动化的序列到序列方法,通常用于语义翻译(NMT)类任务;二是序列标记技术,通常用于处理如部分词类标注(POS)、命名实体识别(NER)和类似任务。为了实现平衡的架构,研究人员提出了许多创新的解决方案,我们在相关工作部分进行讨论。我们的方法是called RedPenNet,旨在降低特定Sequence-To-Edits模型中的 arquitectónico和参数 redundancy,保留它们的半自动化优势。我们的模型在BEA-2019(测试)上获得了77.60的F0.5分,可以 considere为领先水平,只有系统组合为例外。此外,在UAGEC+ Fluency(测试)标准下,我们的模型获得了67.71的F0.5分。这项研究在UNLP 2023 会议上进行了展示,作为grammatical Error Correction(GEC) Shared Task for Ukrainian 的论文。本研究的目标是通过应用RedPenNet方法,解决 Ukrainian 语言中的GEC问题。

Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning

  • paper_url: http://arxiv.org/abs/2309.10814
  • repo_url: https://github.com/luohongyin/langcode
  • paper_authors: Tianhua Zhang, Jiaxin Ge, Hongyin Luo, Yung-Sung Chuang, Mingye Gao, Yuan Gong, Xixin Wu, Yoon Kim, Helen Meng, James Glass
  • for: 解决 math/symbolic reasoning、自然语言理解和指令执行任务
  • methods: 使用自然语言嵌入程序(NLEP)框架,让语言模型生成基于数据结构中的自然语言表示知识的函数
  • results: 可以超越强基线,在多种任务上提高性能,包括数学和符号逻辑、文本分类、问答和指令执行等任务,并且可以进行后续检查中间逻辑步骤的解释。
    Abstract How can we perform computations over natural language representations to solve tasks that require symbolic and numeric reasoning? We propose natural language embedded programs (NLEP) as a unifying framework for addressing math/symbolic reasoning, natural language understanding, and instruction following tasks. Our approach prompts a language model to generate full Python programs that define functions over data structures which contain natural language representations of structured knowledge. A Python interpreter then executes the generated code and prints the output. Despite using a task-general prompt, we find that this approach can improve upon strong baselines across a range of different tasks including math and symbolic reasoning, text classification, question answering, and instruction following. We further find the generated programs are often interpretable and enable post-hoc verification of the intermediate reasoning steps.
    摘要 如何通过对自然语言表示进行计算来解决需要符号逻辑和数值计算的任务?我们提出自然语言嵌入程序(NLEP)作为一个统一的框架,用于Addressing math/符号逻辑、自然语言理解和指令遵循任务。我们的方法请求一个语言模型生成全部Python程序,定义数据结构中含有自然语言表示的结构化知识上的函数。一个Python解释器然后执行生成的代码并输出结果。尽管使用了任务通用的提问,我们发现这种方法可以超越强基线 across a range of different tasks, including math and symbolic reasoning, text classification, question answering, and instruction following. 我们还发现生成的程序往往可读性好,允许后续验证中间的逻辑步骤。

Modeling interdisciplinary interactions among Physics, Mathematics & Computer Science

  • paper_url: http://arxiv.org/abs/2309.10811
  • repo_url: None
  • paper_authors: Rima Hazra, Mayank Singh, Pawan Goyal, Bibhas Adhikari, Animesh Mukherjee
  • for: 本研究旨在研究物理(PHY)、数学(MA)和计算机科学(CS)三个领域之间的引用流动,以及这三个领域之间的引用关系。
  • methods: 本研究使用了一个数据集,包含了这三个领域的 более than 1.2 million 篇论文,并使用了时间桶特征来量化这三个领域之间的引用互动。
  • results: 研究发现,在这三个领域之间的引用关系存在一些特定的模式,例如,物理领域常常引用数学领域的论文,而计算机科学领域则常常引用物理领域和数学领域的论文。此外,研究还提出了一些基于 relay-linking 框架的数学模型,用于解释这三个领域之间的引用动态。
    Abstract Interdisciplinarity has over the recent years have gained tremendous importance and has become one of the key ways of doing cutting edge research. In this paper we attempt to model the citation flow across three different fields -- Physics (PHY), Mathematics (MA) and Computer Science (CS). For instance, is there a specific pattern in which these fields cite one another? We carry out experiments on a dataset comprising more than 1.2 million articles taken from these three fields. We quantify the citation interactions among these three fields through temporal bucket signatures. We present numerical models based on variants of the recently proposed relay-linking framework to explain the citation dynamics across the three disciplines. These models make a modest attempt to unfold the underlying principles of how citation links could have been formed across the three fields over time.
    摘要 近年来,多学科研究(interdisciplinarity)在研究领域中得到了广泛的重视和发展,成为当今研究的一种重要方法。在这篇论文中,我们尝试通过模拟三个不同领域(物理(PHY)、数学(MA)和计算机科学(CS))之间的引用流动,以探索这三个领域之间的引用模式是否存在特定的征式。我们在一个包含超过120万篇论文的数据集上进行了实验,并通过时间桶签名来量化这三个领域之间的引用互动。我们基于近期提出的协助链框架(relay-linking framework)的变体来提出数学模型,用于解释这三个领域之间的引用动力学。这些模型尝试描述在不同时间点上如何形成这三个领域之间的引用链。

Semantic Text Compression for Classification

  • paper_url: http://arxiv.org/abs/2309.10809
  • repo_url: None
  • paper_authors: Emrecan Kutay, Aylin Yener
  • for: 这 paper 的目的是压缩文本中的含义,以便在分类等应用中提高效率。
  • methods: 这 paper 使用 semantic quantization 和压缩方法,使用 sentence embeddings 和semantic distortion metric来保持含义。
  • results: 这 paper 的结果表明,使用 semantic 方法可以大幅降低 bit 数,但准确性损失比基eline 小。此外, semantic clustering 可以进一步增加资源储存的可 reuse。这些方法在多种文本分类任务中表现出色。
    Abstract We study semantic compression for text where meanings contained in the text are conveyed to a source decoder, e.g., for classification. The main motivator to move to such an approach of recovering the meaning without requiring exact reconstruction is the potential resource savings, both in storage and in conveying the information to another node. Towards this end, we propose semantic quantization and compression approaches for text where we utilize sentence embeddings and the semantic distortion metric to preserve the meaning. Our results demonstrate that the proposed semantic approaches result in substantial (orders of magnitude) savings in the required number of bits for message representation at the expense of very modest accuracy loss compared to the semantic agnostic baseline. We compare the results of proposed approaches and observe that resource savings enabled by semantic quantization can be further amplified by semantic clustering. Importantly, we observe the generalizability of the proposed methodology which produces excellent results on many benchmark text classification datasets with a diverse array of contexts.
    摘要 我们研究语义压缩 для文本,其中文本中的意义传递到源解码器(例如分类)。主要的动机是避免精确重建的需求,以便实现资源储存和信息传输成本的减少。为此,我们提议语义压缩和压缩方法,使用句子嵌入和semantic distortion metric来保持语义。我们的结果表明,我们提议的语义方法可以实现重要的(许多个数级)储存成本减少,并且与语义agnostic基准相比,只有极少的准确性损失。我们比较了我们的方法与其他方法的结果,发现semantic quantization可以通过semantic clustering进一步增强资源储存的可抗性。重要的是,我们发现我们的方法在多个 benchmark文本分类 datasets上实现了优秀的结果,这些dataset中的上下文非常多样化。

Interactive Distillation of Large Single-Topic Corpora of Scientific Papers

  • paper_url: http://arxiv.org/abs/2309.10772
  • repo_url: None
  • paper_authors: Nicholas Solovyev, Ryan Barron, Manish Bhattarai, Maksim E. Eren, Kim O. Rasmussen, Boian S. Alexandrov
  • for: 这篇论文的目的是建立一个可扩展的、可靠的科学文献集合,并将其用于研究和教育。
  • methods: 这篇论文使用机器学习技术,将小量的“核心”文献集成为一个大量的科学文献集合。文献集合中的每篇文章都通过机器学习算法进行评估,以确定它们是否与“核心”文献集成关联。
  • results: 这篇论文透过机器学习技术建立了一个可扩展的、可靠的科学文献集合,并且可以透过人类专家的干预选择来确保集合中的文章是有关的。此外,这篇论文还使用了sub-topic模型ing(SeNMFk)来获得更多关于文章的信息。
    Abstract Highly specific datasets of scientific literature are important for both research and education. However, it is difficult to build such datasets at scale. A common approach is to build these datasets reductively by applying topic modeling on an established corpus and selecting specific topics. A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert (SME) handpicks documents. This method does not scale and is prone to error as the dataset grows. Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature. Given a small initial "core" corpus of papers, we build a citation network of documents. At each step of the citation network, we generate text embeddings and visualize the embeddings through dimensionality reduction. Papers are kept in the dataset if they are "similar" to the core or are otherwise pruned through human-in-the-loop selection. Additional insight into the papers is gained through sub-topic modeling using SeNMFk. We demonstrate our new tool for literature review by applying it to two different fields in machine learning.
    摘要 高度特定的数据集是科研和教育中非常重要的。然而,建立这些数据集是困难的。一种常见的方法是通过缩写分析已有的文库,选择特定的主题来建立数据集。另一种更加robust但是时间费时的方法是通过专家手动选择文献来建立数据集。这种方法不具扩展性和容易出错。在这里,我们介绍了一种新工具,基于机器学习,用于构建targeted的科研文献数据集。我们给出了一个小的初始核心文献库,然后建立了引用网络,并在每个引用网络中生成文本嵌入。我们使用dimensionality reduction来Visualize嵌入。如果文献与核心文献相似,或者通过人工征选来选择,则保留文献在数据集中。我们还使用SeNMFk进行子主题分析,从而获得更多的文献信息。我们通过在机器学习领域进行文献综述来证明我们的新工具。

OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch

  • paper_url: http://arxiv.org/abs/2309.10706
  • repo_url: https://github.com/opennlg/openba
  • paper_authors: Juntao Li, Zecheng Tang, Yuyang Ding, Pinzheng Wang, Pei Guo, Wangjie You, Dan Qiao, Wenliang Chen, Guohong Fu, Qiaoming Zhu, Guodong Zhou, Min Zhang
  • for: 本研究旨在开发一个基于中文的开源语言模型,以提高中文自然语言处理 task 的性能。
  • methods: 本文使用了一些有效和高效的技术,包括三个阶段的训练策略和数据处理等,以减少模型的大小。
  • results: 根据测试结果,我们的模型在 BELEBELE 测试套件、MMLU 测试套件和 C-Eval (hard) 测试套件上的性能比 LLaMA-70B 和 BLOOM-176B 更好,只需要使用 380B 个字。
    Abstract Large language models (LLMs) with billions of parameters have demonstrated outstanding performance on various natural language processing tasks. This report presents OpenBA, an open-sourced 15B bilingual asymmetric seq2seq model, to contribute an LLM variant to the Chinese-oriented open-source model community. We enhance OpenBA with effective and efficient techniques as well as adopt a three-stage training strategy to train the model from scratch. Our solution can also achieve very competitive performance with only 380B tokens, which is better than LLaMA-70B on the BELEBELE benchmark, BLOOM-176B on the MMLU benchmark, GLM-130B on the C-Eval (hard) benchmark. This report provides the main details to pre-train an analogous model, including pre-training data processing, Bilingual Flan data collection, the empirical observations that inspire our model architecture design, training objectives of different stages, and other enhancement techniques. We have refactored our code to follow the design principles of the Huggingface Transformers Library, making it more convenient for developers to use, and released checkpoints of different training stages at https://huggingface.co/openBA. More details of our project are available at https://github.com/OpenNLG/openBA.git.
    摘要 大型语言模型(LLM)拥有数十亿参数,在不同的自然语言处理任务上表现出色。本报告介绍OpenBA,一个开源的15B双语异称seq2seq模型,以贡献一种中文 Orientated 开源模型社区的变体。我们在OpenBA中应用有效和高效的技术,并采用三个阶段训练策略来从头开始训练模型。我们的解决方案可以在只有380B字符时达到非常竞争力的性能,比LLaMA-70B在BELEBELE标准上更好,比BLOOM-176B在MMLU标准上更好,比GLM-130B在C-Eval(困难)标准上更好。本报告提供了预训练模型的主要细节,包括预训练数据处理、双语Flann数据收集、预训练模型建立的设计原则、不同阶段的训练目标以及其他增强技术。我们已经对代码进行了 refactor,使其更加符合Huggingface Transformers Library的设计原则,并将不同训练阶段的检查点上传到https://huggingface.co/openBA。更多关于我们项目的详细信息可以通过https://github.com/OpenNLG/openBA.git获取。

Improving Medical Dialogue Generation with Abstract Meaning Representations

  • paper_url: http://arxiv.org/abs/2309.10608
  • repo_url: None
  • paper_authors: Bohao Yang, Chen Tang, Chenghua Lin
  • for: 这篇论文旨在探讨对距离医疗的对话生成,以实现医疗专业知识的传递。
  • methods: 本研究使用抽象意义表示(AMR)创建图形表示,以强调对话中的语言构成部分和医疗知识中的关系和实体。
  • results: 本研究的结果显示,使用AMR图形表示可以增强对话生成模型的理解能力,并且比对基eline模型表现出色。此外,本研究提供了实现这种方法的资源代码,以便未来的研究。
    Abstract Medical Dialogue Generation serves a critical role in telemedicine by facilitating the dissemination of medical expertise to patients. Existing studies focus on incorporating textual representations, which have limited their ability to represent the semantics of text, such as ignoring important medical entities. To enhance the model's understanding of the textual semantics and the medical knowledge including entities and relations, we introduce the use of Abstract Meaning Representations (AMR) to construct graphical representations that delineate the roles of language constituents and medical entities within the dialogues. In this paper, We propose a novel framework that models dialogues between patients and healthcare professionals using AMR graphs, where the neural networks incorporate textual and graphical knowledge with a dual attention mechanism. Experimental results show that our framework outperforms strong baseline models in medical dialogue generation, demonstrating the effectiveness of AMR graphs in enhancing the representations of medical knowledge and logical relationships. Furthermore, to support future research in this domain, we provide the corresponding source code at https://github.com/Bernard-Yang/MedDiaAMR.
    摘要 医疗对话生成扮演着重要的角色在 теле医疗中,以便传递医疗专业知识给患者。现有研究主要集中在文本表示方面,这限制了模型的能力来表示文本 semantics,例如忽略重要的医疗实体。为了增强模型对文本 semantics和医疗知识,包括实体和关系的理解,我们介绍了使用抽象意义表示(AMR)构建图形表示,以便分析语言结构和医疗实体在对话中的角色。在这篇论文中,我们提出了一种新的框架,该框架使用 AMR 图来模型患者和医疗专业人员之间的对话,并使用双重注意机制来结合文本和图形知识。实验结果表明,我们的框架在医疗对话生成中表现出优于强基线模型,这表明 AMR 图可以增强医疗知识和逻辑关系的表示。此外,为支持未来在这个领域的研究,我们在 GitHub 上提供了相关的源代码,请参考 https://github.com/Bernard-Yang/MedDiaAMR。

FRACAS: A FRench Annotated Corpus of Attribution relations in newS

  • paper_url: http://arxiv.org/abs/2309.10604
  • repo_url: None
  • paper_authors: Ange Richard, Laura Alonzo-Canul, François Portet
  • for: 本研究用于开发法语新闻文本中引用EXTRACTION和来源归属的手动注释 корпу。
  • methods: 本研究使用了 manually annotated corpus of 1676 newswire texts in French for quotation extraction and source attribution.
  • results: 研究得到了一个 manually annotated corpus,并且获得了对引用类型的Balance (直接、间接和混合),以及对注释者之间的互动对象的高度一致。
    Abstract Quotation extraction is a widely useful task both from a sociological and from a Natural Language Processing perspective. However, very little data is available to study this task in languages other than English. In this paper, we present a manually annotated corpus of 1676 newswire texts in French for quotation extraction and source attribution. We first describe the composition of our corpus and the choices that were made in selecting the data. We then detail the annotation guidelines and annotation process, as well as a few statistics about the final corpus and the obtained balance between quote types (direct, indirect and mixed, which are particularly challenging). We end by detailing our inter-annotator agreement between the 8 annotators who worked on manual labelling, which is substantially high for such a difficult linguistic phenomenon.
    摘要 zh-CN引用抽取是一项广泛有用的任务,不仅从社会学角度而言,也从自然语言处理角度来说。然而,有很少数据可以研究这项任务的其他语言 besides English。在这篇论文中,我们提供了1676篇新闻报道文本的手动注释集,用于引用抽取和来源归属。我们首先介绍了我们的 corpus 的组成和选择的数据。然后,我们详细介绍了注释指南和注释过程,以及最终集合的一些统计数据和引用类型的平衡。最后,我们详细介绍了8名注释员的间接协作情况,即这种语言现象的协作率很高。

Unsupervised Deep Cross-Language Entity Alignment

  • paper_url: http://arxiv.org/abs/2309.10598
  • repo_url: https://github.com/chuanyus/udcea
  • paper_authors: Chuanyu Jiang, Yiming Qian, Lijun Chen, Yang Gu, Xia Xie
  • for: 这篇论文旨在提出一种简单且新的无监督方法,用于跨语言实体对齐。
  • methods: 我们使用深度学习多语言encoder和机器翻译器来编码知识图文本,从而减少了标注数据的依赖性。我们的方法同时考虑了全局和局部对齐策略。
  • results: 我们的方法可以在DBP15K数据集上得到0.966、0.990和0.996的Hits@1率,在无监督和半监督类别中超过了现状的方法。与监督方法相比,我们的方法在 Ja-En和Fr-En对齐任务中表现出了2.6%和0.4%的提升,只是微弱地下降0.2%在Zh-En对齐任务中。
    Abstract Cross-lingual entity alignment is the task of finding the same semantic entities from different language knowledge graphs. In this paper, we propose a simple and novel unsupervised method for cross-language entity alignment. We utilize the deep learning multi-language encoder combined with a machine translator to encode knowledge graph text, which reduces the reliance on label data. Unlike traditional methods that only emphasize global or local alignment, our method simultaneously considers both alignment strategies. We first view the alignment task as a bipartite matching problem and then adopt the re-exchanging idea to accomplish alignment. Compared with the traditional bipartite matching algorithm that only gives one optimal solution, our algorithm generates ranked matching results which enabled many potentials downstream tasks. Additionally, our method can adapt two different types of optimization (minimal and maximal) in the bipartite matching process, which provides more flexibility. Our evaluation shows, we each scored 0.966, 0.990, and 0.996 Hits@1 rates on the DBP15K dataset in Chinese, Japanese, and French to English alignment tasks. We outperformed the state-of-the-art method in unsupervised and semi-supervised categories. Compared with the state-of-the-art supervised method, our method outperforms 2.6% and 0.4% in Ja-En and Fr-En alignment tasks while marginally lower by 0.2% in the Zh-En alignment task.
    摘要 crossed-lingual entity alignment是找到不同语言知识图中的同义semantic entity的任务。在这篇论文中,我们提出了一种简单且新的无监督方法 для crossed-lingual entity alignment。我们利用深度学习多语言编码器和机器翻译器来编码知识图文本,从而减少了依赖于标注数据。与传统方法只强调全局或本地对齐不同,我们的方法同时考虑了这两种对齐策略。我们首先将对齐任务视为一种双方匹配问题,然后采用了重新交换的想法来完成对齐。与传统的双方匹配算法只给出一个最优解不同,我们的算法生成了排名匹配结果,这些结果允许更多的下游任务。此外,我们的方法可以在双方匹配过程中采用不同的优化策略(最小和最大),这提供了更多的灵活性。我们的评估结果显示,我们在DBP15K dataset上分别得分0.966、0.990和0.996,在中文、日语和法语到英语对齐任务中。我们超过了当前状态的方法在无监督和半监督类别中。与当前状态的监督方法相比,我们的方法在Ja-En和Fr-En对齐任务中表现出优于2.6%和0.4%,而在Zh-En对齐任务中则只下降0.2%。

Multimodal Modeling For Spoken Language Identification

  • paper_url: http://arxiv.org/abs/2309.10567
  • repo_url: None
  • paper_authors: Shikhar Bharadwaj, Min Ma, Shikhar Vashishth, Ankur Bapna, Sriram Ganapathy, Vera Axelrod, Siddharth Dalmia, Wei Han, Yu Zhang, Daan van Esch, Sandy Ritchie, Partha Talukdar, Jason Riesa
  • for: 本研究旨在提高多媒体录音语言识别精度,通过利用不同类型的metadata来增强语言识别。
  • methods: 本研究提出了一种多modal Spoken Language Identification方法(MuSeLI),利用视频标题、描述和地理位置等metadata来提高语言识别精度。
  • results: 实验结果表明,Metadata可以提供显著帮助语言识别任务,并且对多媒体录音语言识别 Task取得了现场的状态。 Additionally, the ablation study shows that each modality contributes distinctively to language recognition.
    Abstract Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification. Our study reveals that metadata such as video title, description and geographic location provide substantial information to identify the spoken language of the multimedia recording. We conduct experiments using two diverse public datasets of YouTube videos, and obtain state-of-the-art results on the language identification task. We additionally conduct an ablation study that describes the distinct contribution of each modality for language recognition.
    摘要 通过多模态特征来实现语言认识,这是我们的研究方向。在视频数据中,我们可以使用视频标题、描述和地理位置等 metadata 来提高语言认识的准确性。我们在两个 YouTube 视频数据集上进行实验,并在语言认识任务上达到了状态之 искусственный前景。此外,我们还进行了减少研究,以解释每个模式在语言认识中的独特贡献。

NSOAMT – New Search Only Approach to Machine Translation

  • paper_url: http://arxiv.org/abs/2309.10526
  • repo_url: None
  • paper_authors: João Luís, Diogo Cardoso, José Marques, Luís Campos
  • for: 这个论文的目的是提出一种基于新搜索方法的机器翻译技术,以解决传统技术的慢速和不准确问题。
  • methods: 这个研究采用了一种新的索引技术,通过将具有相似语义的词语组合在一起,实现对原文语言记录和翻译语言之间的对应关系。
  • results: 这个研究发现,对于某些类型的文档,使用这种新索引技术可以提高翻译速度和准确性,并且可以开发出一种基于这种方法的翻译工具。
    Abstract Translation automation mechanisms and tools have been developed for several years to bring people who speak different languages together. A "new search only approach to machine translation" was adopted to tackle some of the slowness and inaccuracy of the other technologies. The idea is to develop a solution that, by indexing an incremental set of words that combine a certain semantic meaning, makes it possible to create a process of correspondence between their native language record and the language of translation. This research principle assumes that the vocabulary used in a given type of publication/document is relatively limited in terms of language style and word diversity, which enhances the greater effect of instantaneously and rigor in the translation process through the indexing process. A volume of electronic text documents where processed and loaded into a database, and analyzed and measured in order confirm the previous premise. Although the observed and projected metric values did not give encouraging results, it was possible to develop and make available a translation tool using this approach.
    摘要 机器翻译技术和工具在数年前就已经开发出来,以便让不同语言的人们共同交流。我们采用了一种“新搜索Onlyapproach to machine translation”来解决其他技术的慢速和不准确性。我们的想法是,通过索引增量词汇,使得将原始语言纪录与翻译语言之间创建对应关系。这个研究原则假设了公开发表/文档中的词汇数量相对较少,而且语言风格和词汇多样性受限,从而通过索引过程实现更快速、更加准确的翻译。我们对一volume of electronic文档进行处理和加载到数据库中,并对其进行分析和测量,以确认上述假设。虽然观察到的和预计的 метриック值并不给出激励的结果,但我们仍然可以开发出一种使用这种方法的翻译工具。

Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

  • paper_url: http://arxiv.org/abs/2309.10524
  • repo_url: None
  • paper_authors: Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi
  • for: 本研究旨在探讨一种使用大型自然语言模型(LLM)和端到端自动语音识别(ASR)的新的集成方法,以提高ASR性能。
  • methods: 我们使用一种名为Llama2的 instrucion-tuned LLM,并将其作为decoder的前端使用。我们还使用CTC和注意力架构,将ASR假设作为输入,并将其feed到LLM中,以便LLM可以根据指令进行文本生成。
  • results: 我们的实验结果和分析表明,这种集成方法可以提供出色的性能提升,并且我们的方法受益于LLM-based rescoring。
    Abstract We present a novel integration of an instruction-tuned large language model (LLM) and end-to-end automatic speech recognition (ASR). Modern LLMs can perform a wide range of linguistic tasks within zero-shot learning when provided with a precise instruction or a prompt to guide the text generation process towards the desired task. We explore using this zero-shot capability of LLMs to extract linguistic information that can contribute to improving ASR performance. Specifically, we direct an LLM to correct grammatical errors in an ASR hypothesis and harness the embedded linguistic knowledge to conduct end-to-end ASR. The proposed model is built on the hybrid connectionist temporal classification (CTC) and attention architecture, where an instruction-tuned LLM (i.e., Llama2) is employed as a front-end of the decoder. An ASR hypothesis, subject to correction, is obtained from the encoder via CTC decoding, which is then fed into the LLM along with an instruction. The decoder subsequently takes as input the LLM embeddings to perform sequence generation, incorporating acoustic information from the encoder output. Experimental results and analyses demonstrate that the proposed integration yields promising performance improvements, and our approach largely benefits from LLM-based rescoring.
    摘要 The proposed model is built on the hybrid connectionist temporal classification (CTC) and attention architecture, where an instruction-tuned LLM (i.e., Llama2) is employed as the front-end of the decoder. An ASR hypothesis, subject to correction, is obtained from the encoder via CTC decoding, which is then fed into the LLM along with an instruction. The decoder subsequently takes as input the LLM embeddings to perform sequence generation, incorporating acoustic information from the encoder output.Experimental results and analyses demonstrate that the proposed integration yields promising performance improvements, and our approach largely benefits from LLM-based rescoring.

Enhancing Open-Domain Table Question Answering via Syntax- and Structure-aware Dense Retrieval

  • paper_url: http://arxiv.org/abs/2309.10506
  • repo_url: https://github.com/nzjin/ODTQA
  • paper_authors: Nengzheng Jin, Dongfang Li, Junying Chen, Joanna Siebert, Qingcai Chen
  • for: answering open-domain table questions by retrieving and extracting information from a large collection of tables
  • methods: using syntax- and structure-aware retrieval method that provides syntactical representations for the question and uses structural header and value representations for the tables to avoid information loss
  • results: achieving state-of-the-art performance on the NQ-tables dataset and overwhelming strong baselines on a newly curated open-domain Text-to-SQL datasetHere’s the simplified Chinese text:
  • for: Answering open-domain 表格问题,通过大量表格的检索和提取信息。
  • methods: 使用 syntax- 和 structure-aware 检索方法,提供问题的 sintactical 表示,并使用表格的结构头和值表示来避免信息损失。
  • results: 在 NQ-tables 数据集上达到状态级表现,在新编辑的 open-domain Text-to-SQL 数据集上压倒强大的基eline。
    Abstract Open-domain table question answering aims to provide answers to a question by retrieving and extracting information from a large collection of tables. Existing studies of open-domain table QA either directly adopt text retrieval methods or consider the table structure only in the encoding layer for table retrieval, which may cause syntactical and structural information loss during table scoring. To address this issue, we propose a syntax- and structure-aware retrieval method for the open-domain table QA task. It provides syntactical representations for the question and uses the structural header and value representations for the tables to avoid the loss of fine-grained syntactical and structural information. Then, a syntactical-to-structural aggregator is used to obtain the matching score between the question and a candidate table by mimicking the human retrieval process. Experimental results show that our method achieves the state-of-the-art on the NQ-tables dataset and overwhelms strong baselines on a newly curated open-domain Text-to-SQL dataset.
    摘要 开放领域表格问答旨在提供问题的答案,通过检索和提取大量表格中的信息。现有研究的开放领域表格QA方法可能直接采用文本检索方法,或者只考虑表格结构在编码层面进行表格检索,这可能会导致问题的语法和结构信息丢失,从而影响表格的得分。为解决这个问题,我们提出一种语法和结构意识检索方法,用于开放领域表格QA任务。它提供了问题的语法表示,并使用表格的结构标头和值表示来避免语法和结构信息的丢失。然后,一个语法-结构汇总器用于获取问题和候选表格之间的匹配分数,通过模拟人类检索过程来实现。实验结果表明,我们的方法在NQ-tables数据集上实现了领先地位,并在一个新收录的开放领域文本-SQL数据集上压倒了强大的基线。

Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation

  • paper_url: http://arxiv.org/abs/2309.10456
  • repo_url: None
  • paper_authors: Luyao Cheng, Siqi Zheng, Qinglin Zhang, Hui Wang, Yafeng Chen, Qian Chen, Shiliang Zhang
  • for: 本研究旨在充分利用语言模型来提高基于集群的说话人识别系统的性能。
  • methods: 我们提出了一种新的方法,利用语言模型提取说话人相关的semantic信息,并将这些信息转化为对应的对比约束。
  • results: 我们在公共数据集上进行了广泛的实验,结果表明我们提出的方法在比对 voz-only 的说话人识别系统时表现出了一致性的superiority。
    Abstract Speaker diarization has gained considerable attention within speech processing research community. Mainstream speaker diarization rely primarily on speakers' voice characteristics extracted from acoustic signals and often overlook the potential of semantic information. Considering the fact that speech signals can efficiently convey the content of a speech, it is of our interest to fully exploit these semantic cues utilizing language models. In this work we propose a novel approach to effectively leverage semantic information in clustering-based speaker diarization systems. Firstly, we introduce spoken language understanding modules to extract speaker-related semantic information and utilize these information to construct pairwise constraints. Secondly, we present a novel framework to integrate these constraints into the speaker diarization pipeline, enhancing the performance of the entire system. Extensive experiments conducted on the public dataset demonstrate the consistent superiority of our proposed approach over acoustic-only speaker diarization systems.
    摘要 <>转换文本为简化中文。<>发言人识别在语音处理研究社区中受到了广泛的关注。主流的发言人识别方法主要基于发言人的声音特征,从语音信号中提取发言人的声音特征,然而它们经常忽略语音信号中的 semantics 信息。考虑到语音信号可以办好发言人的发言内容,我们想要充分利用这些 semantics 信息,以提高 clustering-based 发言人识别系统的性能。首先,我们引入了语言理解模块,以提取发言人相关的 semantics 信息,并使用这些信息构建对应的对比约束。其次,我们提出了一种新的框架,将这些约束集成到发言人识别管道中,从而提高整个系统的性能。在公共数据集上进行了广泛的实验,我们的提议方法在对比于声音Only 发言人识别系统时具有一致的超越性。

Reformulating Sequential Recommendation: Learning Dynamic User Interest with Content-enriched Language Modeling

  • paper_url: http://arxiv.org/abs/2309.10435
  • repo_url: None
  • paper_authors: Junzhe Jiang, Shang Qu, Mingyue Cheng, Qi Liu
  • for: 这个论文的目的是提出一种基于语言模型的新式分布式推荐方法,以提高推荐系统的个性化性和准确性。
  • methods: 该方法使用预训练的语言模型来捕捉用户的兴趣和需求,并通过语义理解来生成个性化的推荐结果。
  • results: 经过实验 validate,该方法可以在多个数据集上达到比较好的推荐效果,并且提供了有价值的推荐结果和推荐方法的指导意见。
    Abstract Recommender systems are essential for online applications, and sequential recommendation has enjoyed significant prevalence due to its expressive ability to capture dynamic user interests. However, previous sequential modeling methods still have limitations in capturing contextual information. The primary reason for this issue is that language models often lack an understanding of domain-specific knowledge and item-related textual content. To address this issue, we adopt a new sequential recommendation paradigm and propose LANCER, which leverages the semantic understanding capabilities of pre-trained language models to generate personalized recommendations. Our approach bridges the gap between language models and recommender systems, resulting in more human-like recommendations. We demonstrate the effectiveness of our approach through experiments on several benchmark datasets, showing promising results and providing valuable insights into the influence of our model on sequential recommendation tasks. Furthermore, our experimental codes are publicly available.
    摘要 <>将文本翻译成简化中文。<>在线应用程序中,推荐系统是非常重要的,而串行推荐占据了主导地位,因为它可以快速表达用户的兴趣。然而,以前的串行建模方法仍有限制,不能够 capture contextual information。主要的原因是语言模型通常缺乏域pecific知识和项目相关的文本内容的理解。为了解决这个问题,我们采用了一种新的串行推荐方式,并提出了LANCER,它利用预训练语言模型的semantic理解能力来生成个性化推荐。我们的方法 bridge了语言模型和推荐系统之间的 gap,从而生成更人类化的推荐。我们通过对多个 benchmark datasets进行实验,证明了我们的方法的有效性,并提供了有价值的推荐系统设计的意见。此外,我们的实验代码也公开可用。

Writer-Defined AI Personas for On-Demand Feedback Generation

  • paper_url: http://arxiv.org/abs/2309.10433
  • repo_url: None
  • paper_authors: Karim Benharrak, Tim Zindulka, Florian Lehmann, Hendrik Heuer, Daniel Buschek
  • for: 这篇论文是为了支持作者创作而设计的,帮助作者更好地理解和 Hook 到他们的target audience。
  • methods: 这篇论文使用了基于作者定义的 AI人物来生成即时反馈,以帮助作者更好地理解和 Hook 到他们的target audience。
  • results: 作者对这种概念表示欢迎,并在两个用户研究中使用了这种方法来获得不同的视角和反馈。 however, the feedback was often verbose and unspecific.
    Abstract Compelling writing is tailored to its audience. This is challenging, as writers may struggle to empathize with readers, get feedback in time, or gain access to the target group. We propose a concept that generates on-demand feedback, based on writer-defined AI personas of any target audience. We explore this concept with a prototype (using GPT-3.5) in two user studies (N=5 and N=11): Writers appreciated the concept and strategically used personas for getting different perspectives. The feedback was seen as helpful and inspired revisions of text and personas, although it was often verbose and unspecific. We discuss the impact of on-demand feedback, the limited representativity of contemporary AI systems, and further ideas for defining AI personas. This work contributes to the vision of supporting writers with AI by expanding the socio-technical perspective in AI tool design: To empower creators, we also need to keep in mind their relationship to an audience.
    摘要 优秀的写作是适应其读者群体的。这是一项挑战,因为作者可能难以理解读者,获得时间ous feedback,或者访问目标群体。我们提出了一个概念,即基于作者定义的人工智能人类,以获得即时反馈。我们在两项用户研究(N=5和N=11)中试用了这个概念:作者喜欢这个概念,并在获得不同角度的反馈时使用了人类。反馈被看作是有帮助的,并促使了文本和人类的修订,although it was often verbose and unspecific. We discuss the impact of on-demand feedback, the limited representativity of contemporary AI systems, and further ideas for defining AI personas. This work contributes to the vision of supporting writers with AI by expanding the socio-technical perspective in AI tool design: To empower creators, we also need to keep in mind their relationship to an audience.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide that version as well.

PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded Dialogue Systems

  • paper_url: http://arxiv.org/abs/2309.10413
  • repo_url: None
  • paper_authors: Bryan Wilie, Yan Xu, Willy Chung, Samuel Cahyawijaya, Holy Lovenia, Pascale Fung
  • for: 提高知识grunded对话系统的响应质量,使其更加信息归一化和有趣。
  • methods: 基于外部知识的对话生成系统,通过分析多种语言模型生成结果,发现存在多个生成结果,其中一些更加准确和有 relevance 性。
  • results: 提出了一种基于抑制策略的生成重新分配框架PICK,可以使模型生成更加准确和有 relevance 性的响应,不需要额外的标注数据或模型调整。经过自动和人工评估,PICK 能够提高系统的表现,并且在所有排序策略下保持稳定性。详细实现可以参考https://github.com/bryanwilie/pick。
    Abstract Grounding dialogue response generation on external knowledge is proposed to produce informative and engaging responses. However, current knowledge-grounded dialogue (KGD) systems often fail to align the generated responses with human-preferred qualities due to several issues like hallucination and the lack of coherence. Upon analyzing multiple language model generations, we observe the presence of alternative generated responses within a single decoding process. These alternative responses are more faithful and exhibit a comparable or higher level of relevance to prior conversational turns compared to the optimal responses prioritized by the decoding processes. To address these challenges and driven by these observations, we propose Polished \& Informed Candidate Scoring (PICK), a generation re-scoring framework that empowers models to generate faithful and relevant responses without requiring additional labeled data or model tuning. Through comprehensive automatic and human evaluations, we demonstrate the effectiveness of PICK in generating responses that are more faithful while keeping them relevant to the dialogue history. Furthermore, PICK consistently improves the system's performance with both oracle and retrieved knowledge in all decoding strategies. We provide the detailed implementation in https://github.com/bryanwilie/pick .
    摘要 提出了基于外部知识的对话回答生成系统,以生成有用和有趣的回答。然而,目前的知识基数对对话(KGD)系统经常无法落实生成的回答与人类喜好的质量不符,这可能是因为幻觉和对话异常的问题。我们分析了多种语言模型生成结果,发现生成过程中存在多个可行的回答,这些回答更 faithful 并且与之前的对话转折更高度相关。为了解决这些挑战,我们提出了精炼并 Informed Candidate Scoring(PICK)生成重新评分框架,使模型能够生成忠诚和相关的回答,不需要额外的标注数据或模型调整。通过自动和人工评估,我们证明了 PICK 的效果,可以生成更忠诚的回答,同时保持与对话历史相关。此外, PICK 在所有搜索策略下都能够一直提高系统的性能,并且与oracle和检索知识相结合。我们在 GitHub 上提供了详细的实现。

PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training

  • paper_url: http://arxiv.org/abs/2309.10400
  • repo_url: https://github.com/dwzhu-pku/pose
  • paper_authors: Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li
  • for: 用于提高大语言模型的适应性和扩展性
  • methods: 使用Positional Skip-wisE(PoSE)训练方法,通过在训练过程中随机填充各个 chunk 的位置指标来模拟长输入序列,从而适应不同的上下文窗口大小
  • results: 比较训练在全长输入上与 PoSE 训练在短 chunk 上,后者减少了内存和时间开销,而且性能减差不大。此外,PoSE 方法可以与 RoPE-based LLMs 和不同的位置插值策略兼容,并且可以无限扩展上下文窗口,具体取决于执行时间的内存使用情况。
    Abstract In this paper, we introduce Positional Skip-wisE (PoSE) training for efficient adaptation of large language models~(LLMs) to extremely long context windows. PoSE decouples train length from target context window size by simulating long inputs using a fixed context window with manipulated position indices during training. Concretely, we select several short chunks from a long input sequence, and introduce distinct skipping bias terms to modify the position indices of each chunk. These bias terms, along with the length of each chunk, are altered for each training example, allowing the model to adapt to all positions within the target context window without training on full length inputs. Experiments show that, compared with fine-tuning on the full length, PoSE greatly reduces memory and time overhead with minimal impact on performance. Leveraging this advantage, we have successfully extended the LLaMA model to 128k tokens. Furthermore, we empirically confirm that PoSE is compatible with all RoPE-based LLMs and various position interpolation strategies. Notably, by decoupling fine-tuning length from target context window, PoSE can theoretically extend the context window infinitely, constrained only by memory usage for inference. With ongoing advancements for efficient inference, we believe PoSE holds great promise for scaling the context window even further.
    摘要 在这篇论文中,我们介绍了Positional Skip-wisE(PoSE)训练方法,用于高效地适应大语言模型(LLM)到极长上下文窗口。PoSE将训练长度与目标上下文窗口大小分离开来,在训练过程中使用固定的上下文窗口和修改位标的扰动技术来模拟长输入。具体来说,我们从长输入序列中选择一些短块,并在每个块上引入不同的跳过偏好项来修改位标。这些偏好项, junto with each chunk's length, are altered for each training example, allowing the model to adapt to all positions within the target context window without training on full-length inputs.实验表明,相比于精细调整全长输入,PoSE可以减少内存和时间开销,而无需减少性能。通过这种优势,我们成功地扩展了LLaMA模型到128k个token。此外,我们还证实了PoSE与RoPE基于LLM的各种位置 interpolate策略兼容,并且可以 theoretically extend the context window infinitely,即使在推理时使用内存。随着高效推理技术的不断发展,我们相信PoSE具有极大的扩展前途。

Prompt, Condition, and Generate: Classification of Unsupported Claims with In-Context Learning

  • paper_url: http://arxiv.org/abs/2309.10359
  • repo_url: None
  • paper_authors: Peter Ebert Christensen, Srishti Yadav, Serge Belongie
  • for: 本研究旨在提供一种方法,可以从日常生活中遇到的不支持、不可靠的声明中提炼出一个可 COUNT的数据集,以便更好地理解和描述这些声明。
  • methods: 本研究使用了人工智能技术,特别是大语言模型(LLM),来自动生成声明,并通过Context Learning来帮助模型更好地理解和描述声明。
  • results: 研究发现,使用生成的声明可以提高 narative 分类模型的性能,并且可以使用一些训练示例来推断声明的Stand和方向。这种模型可以在应用中使用,例如 Fact-checking 等。
    Abstract Unsupported and unfalsifiable claims we encounter in our daily lives can influence our view of the world. Characterizing, summarizing, and -- more generally -- making sense of such claims, however, can be challenging. In this work, we focus on fine-grained debate topics and formulate a new task of distilling, from such claims, a countable set of narratives. We present a crowdsourced dataset of 12 controversial topics, comprising more than 120k arguments, claims, and comments from heterogeneous sources, each annotated with a narrative label. We further investigate how large language models (LLMs) can be used to synthesise claims using In-Context Learning. We find that generated claims with supported evidence can be used to improve the performance of narrative classification models and, additionally, that the same model can infer the stance and aspect using a few training examples. Such a model can be useful in applications which rely on narratives , e.g. fact-checking.
    摘要 日常生活中遇到的无法支持和无法证明的声明可能会影响我们对世界的看法。然而,描述、概括和更广泛地来说,对这些声明的理解可以是困难的。在这项工作中,我们关注细化的辩论话题,并提出了一项新的任务:从这些声明中提炼出一个可数的数量的故事。我们提供了一个来自多种源的12个热点话题的大量人工标注数据集,包括超过12万个Arguments、声明和评论,每个声明都被标注了故事标签。我们进一步研究了大型自然语言模型(LLM)如何在上下文学习中生成声明,并发现了以下两点:First,使用支持证明的生成声明可以提高故事分类模型的性能;Second,使用相同的模型可以在几个训练示例后推断姿态和方向。这种模型在基于故事的应用中可以是有用的,例如ifact-checking。

KoBigBird-large: Transformation of Transformer for Korean Language Understanding

  • paper_url: http://arxiv.org/abs/2309.10339
  • repo_url: None
  • paper_authors: Kisu Yang, Yoonna Jang, Taewoo Lee, Jinwoo Seong, Hyungjin Lee, Hwanseok Jang, Heuiseok Lim
  • for: 这个研究实现了一个大型的韩文BigBird模型,以获得韩文理解的State-of-the-art表现,并允许较长的序列处理。
  • methods: 我们仅将架构变更,并将统计编码扩展为我们提出的弹性统计编码表现(TAPER)。
  • results: 实验结果显示,KoBigBird-large在韩文语言理解benchmark上的总表现和文档分类和问答任务中的较长序列表现皆达到了State-of-the-art水平,并且在比较基eline模型时表现最佳。我们将这个模型公开发布。
    Abstract This work presents KoBigBird-large, a large size of Korean BigBird that achieves state-of-the-art performance and allows long sequence processing for Korean language understanding. Without further pretraining, we only transform the architecture and extend the positional encoding with our proposed Tapered Absolute Positional Encoding Representations (TAPER). In experiments, KoBigBird-large shows state-of-the-art overall performance on Korean language understanding benchmarks and the best performance on document classification and question answering tasks for longer sequences against the competitive baseline models. We publicly release our model here.
    摘要

Rigorously Assessing Natural Language Explanations of Neurons

  • paper_url: http://arxiv.org/abs/2309.10312
  • repo_url: https://github.com/hieu9955/ggggg
  • paper_authors: Jing Huang, Atticus Geiger, Karel D’Oosterlinck, Zhengxuan Wu, Christopher Potts
  • for: This paper aims to evaluate the faithfulness of natural language explanations of how large language models process and store information.
  • methods: The paper develops two modes of evaluation for natural language explanations that claim individual neurons represent a concept in a text input, including observational and intervention modes.
  • results: The paper shows that even the most confident explanations have high error rates and little to no causal efficacy, and critically assesses whether natural language is a good choice for explanations and whether neurons are the best level of analysis.
    Abstract Natural language is an appealing medium for explaining how large language models process and store information, but evaluating the faithfulness of such explanations is challenging. To help address this, we develop two modes of evaluation for natural language explanations that claim individual neurons represent a concept in a text input. In the observational mode, we evaluate claims that a neuron $a$ activates on all and only input strings that refer to a concept picked out by the proposed explanation $E$. In the intervention mode, we construe $E$ as a claim that the neuron $a$ is a causal mediator of the concept denoted by $E$. We apply our framework to the GPT-4-generated explanations of GPT-2 XL neurons of Bills et al. (2023) and show that even the most confident explanations have high error rates and little to no causal efficacy. We close the paper by critically assessing whether natural language is a good choice for explanations and whether neurons are the best level of analysis.
    摘要 自然语言是一种吸引人的媒介,用于解释大语言模型如何处理和存储信息,但评估这些解释的准确性具有挑战。为了解决这个问题,我们开发了两种评估模式 для自然语言解释,即 observational 模式和 intervención 模式。在 observational 模式下,我们评估laims that neuron $a$ 活动在所有和只有输入串 refer to a concept picked out by the proposed explanation $E$。在 intervención 模式下,我们理解 $E$ 为 neuron $a$ 是输入串中某种概念的 causal mediator。我们应用我们的框架到 Bills et al. (2023) 所生成的 GPT-2 XL neuron的解释中,并发现even the most confident explanations have high error rates and little to no causal efficacy。我们在 conclusion 中 kritically assessing whether natural language is a good choice for explanations and whether neurons are the best level of analysis。

Baichuan 2: Open Large-scale Language Models

  • paper_url: http://arxiv.org/abs/2309.10305
  • repo_url: https://github.com/baichuan-inc/baichuan2
  • paper_authors: Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai, Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, JunTao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xiaoxi Chen, Xin Men, Xin Yu, Xuehai Pan, Yanjun Shen, Yiding Wang, Yiyu Li, Youxin Jiang, Yuchen Gao, Yupeng Zhang, Zenan Zhou, Zhiying Wu
  • for: 这个技术报告是用于介绍一种大型多语言语言模型(Baichuan 2)的论文。
  • methods: 这个论文使用了从零开始训练的大规模多语言语言模型,共包括70亿和130亿参数。
  • results: 这个论文表明,Baichuan 2 可以与其他开源模型相比或超越它们在公共测试准则上,例如 MMLU、CMMLU、GSM8K 和 HumanEval。 此外,Baichuan 2 在医学和法律领域也表现出色。
    Abstract Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.
    摘要 大型语言模型(LLMs)已经展示出杰出的表现在多种自然语言任务上,仅基于几个自然语言指令,减少需要广泛的特征工程。然而,大多数最具备力 LLMs 是封闭式或仅能在英文语言上使用。在本技术报告中,我们发布了 Baichuan 2,一系列大规模多语言模型,包含 700亿和1300亿个参数,从零开始训练,使用 2.6 兆个字元。Baichuan 2 与其他开源模型相比,在公共测试 benchmark 上匹配或超越。此外,Baichuan 2 在医学和法律领域中表现出色。我们将发布所有预训练模型检查点,以便研究社区更好地理解 Baichuan 2 的训练过程。

Using fine-tuning and min lookahead beam search to improve Whisper

  • paper_url: http://arxiv.org/abs/2309.10299
  • repo_url: None
  • paper_authors: Andrea Do, Oscar Brown, Zhengjie Wang, Nikhil Mathew, Zixin Liu, Jawwad Ahmed, Cheng Yu
  • for: 提高low-resource语言中Whisper的表现
  • methods: fine-tune Whisper on additional data and propose an improved decoding algorithm
  • results: 在越南语言上, fine-tuning Whisper-Tiny with LoRA leads to an improvement of 38.49 in WER compared to zero-shot setting, and using Filter-Ends and Min Lookahead decoding algorithms reduces WER by 2.26 on average compared to standard beam search.I hope that helps! Let me know if you have any other questions.
    Abstract The performance of Whisper in low-resource languages is still far from perfect. In addition to a lack of training data on low-resource languages, we identify some limitations in the beam search algorithm used in Whisper. To address these issues, we fine-tune Whisper on additional data and propose an improved decoding algorithm. On the Vietnamese language, fine-tuning Whisper-Tiny with LoRA leads to an improvement of 38.49 in WER over the zero-shot Whisper-Tiny setting which is a further reduction of 1.45 compared to full-parameter fine-tuning. Additionally, by using Filter-Ends and Min Lookahead decoding algorithms, the WER reduces by 2.26 on average over a range of languages compared to standard beam search. These results generalise to larger Whisper model sizes. We also prove a theorem that Min Lookahead outperforms the standard beam search algorithm used in Whisper.
    摘要 文章提到的Whisper在低资源语言表现仍然远不完美。除了低资源语言的培训数据缺乏外,我们还发现了Whisper中的搜索算法有一些限制。为解决这些问题,我们对Whisper进行了进一步的微调和提议了一种改进的解码算法。在越南语言上,我们使用LoRA进行微调Whisper-Tiny,WER值下降38.49,比零基eline微调 setting下降1.45,而与全参数微调相比,这是进一步的下降。此外,使用Filter-Ends和Min Lookahead搜索算法,WER值平均下降2.26个语言中,相比标准搜索算法。这些结果普适到更大的Whisper模型大小。我们还证明了Min Lookahead比标准搜索算法更高效。

Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi

  • paper_url: http://arxiv.org/abs/2309.10272
  • repo_url: None
  • paper_authors: Md Nishat Raihan, Dhiman Goswami, Antara Mahmud
  • for: 这个论文主要针对的是 code-mixed NLP 挑战,即在文本中混合多种语言的问题。
  • methods: 这篇论文使用了 BERT 模型,并通过组合 synthetic 数据和实际数据进行预训练。
  • results: 论文提出了 Tri-Distil-BERT 和 Mixed-Distil-BERT 两种模型,并在多个 NLP 任务上进行了评估,与大型模型 like mBERT 和 XLM-R 进行了比较,得到了竞争力的表现。
    Abstract One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated success in tackling Code-Mixed NLP challenges. Again, in order to enhance their performance, Code-Mixed NLP models have depended on combining synthetic data with real-world data. It is crucial to understand how the BERT models' performance is impacted when they are pretrained using corresponding code-mixed languages. In this paper, we introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data. Both models are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R. Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding, contributing to advancements in the field.
    摘要 一种非常受欢迎的下游任务在自然语言处理领域是文本分类。由于文本受到混合的影响,文本分类任务变得更加困难。虽然模型在预训练时没有接触这种文本,但不同的BERT模型在处理混合代码的挑战中表现出色。为了提高 их表现,混合代码NLP模型通常是通过将 sintetic data 与实际数据相结合来进行优化。我们需要了解BERT模型在使用相应的混合代码语言进行预训练时的表现。在这篇论文中,我们介绍了Tri-Distil-BERT和Mixed-Distil-BERT两种模型。Tri-Distil-BERT 是多语言模型,在孟加拉语、英语和希н第语上进行预训练。Mixed-Distil-BERT 是在混合代码数据上细化的模型。两个模型在多个 NLP 任务中表现竞争性,与较大的 mBERT 和 XLM-R 模型相比。我们的两个阶段预训练方法为多语言和混合代码语言理解提供了高效的选择,对于领域的发展做出了贡献。

What is the Best Automated Metric for Text to Motion Generation?

  • paper_url: http://arxiv.org/abs/2309.10248
  • repo_url: None
  • paper_authors: Jordan Voas, Yili Wang, Qixing Huang, Raymond Mooney
  • for: 本研究旨在系统地研究skeleton-based人体动作生成 task 的评估指标,并提出一种基于多modal BERT 模型的新指标。
  • methods: 本研究使用了多种自动评估指标,并进行了人类评估来评估指标的含义。
  • results: 研究发现,现有的评估指标与人类评估之间存在很大的误差,而新提出的指标则与人类评估呈现很高的相似性。
    Abstract There is growing interest in generating skeleton-based human motions from natural language descriptions. While most efforts have focused on developing better neural architectures for this task, there has been no significant work on determining the proper evaluation metric. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. Since descriptions are compatible with many motions, determining the right metric is critical for evaluating and designing effective generative models. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better. Our findings indicate that none of the metrics currently used for this task show even a moderate correlation with human judgments on a sample level. However, for assessing average model performance, commonly used metrics such as R-Precision and less-used coordinate errors show strong correlations. Additionally, several recently developed metrics are not recommended due to their low correlation compared to alternatives. We also introduce a novel metric based on a multimodal BERT-like model, MoBERT, which offers strongly human-correlated sample-level evaluations while maintaining near-perfect model-level correlation. Our results demonstrate that this new metric exhibits extensive benefits over all current alternatives.
    摘要 “人工智能生成人体运动动作从自然语言描述中获得的兴趣在增长。大多数努力都在发展更好的神经网络模型来实现这项任务,但没有 significante 的工作关于确定合适的评价标准。人类评价是生成模型的终极准确度测试,自动化 metric 应该与人类评价呈相似性。由于描述可以与多种动作匹配,确定正确的 metric 是评价生成模型的关键。这篇论文系统地研究了哪些 metric 与人类评价最高相关,并提出了新的 metric ,它们与人类评价呈相似性。我们发现,目前用于这项任务的所有 metric 都没有even moderate 的相关性,但是用于评估平均模型性能的 R-Precision 和 less-used coordinate errors 显示了强相关性。此外,一些最近开发的 metric 不建议使用,因为它们与替代方案相比有较低的相关性。我们还介绍了一种基于多模态 BERT-like 模型的新 metric,MoBERT,它在样本级别上与人类评价呈相似性,同时保持了near-perfect 模型级别的相关性。我们的结果表明,这种新 metric 在所有当前选择中具有广泛的优势。”

PolicyGPT: Automated Analysis of Privacy Policies with Large Language Models

  • paper_url: http://arxiv.org/abs/2309.10238
  • repo_url: None
  • paper_authors: Chenhao Tang, Zhengliang Liu, Chong Ma, Zihao Wu, Yiwei Li, Wei Liu, Dajiang Zhu, Quanzheng Li, Xiang Li, Tianming Liu, Lei Fan
  • for: This paper aims to develop a text analysis framework for privacy policies, using Large Language Models (LLM) such as ChatGPT and GPT-4.
  • methods: The framework, called PolicyGPT, uses zero-shot learning to analyze privacy policies and categorize them into 10 different classes.
  • results: PolicyGPT achieved high accuracy rates on two datasets, with an accuracy rate of 97% on the first dataset and 87% on the second dataset, outperforming baseline models.Here’s the simplified Chinese text:
  • for: 这篇论文目标是开发一种基于大语言模型(LLM)的隐私政策文本分析框架,用于自动分类隐私政策。
  • methods: 该框架使用零shot学习方法,使用大语言模型(ChatGPT和GPT-4)对隐私政策进行分类,并分为10个类别。
  • results: PolicyGPT在两个数据集上表现出色,在第一个数据集上达到了97%的准确率,在第二个数据集上达到了87%的准确率,超过了基线机器学习和神经网络模型的性能。
    Abstract Privacy policies serve as the primary conduit through which online service providers inform users about their data collection and usage procedures. However, in a bid to be comprehensive and mitigate legal risks, these policy documents are often quite verbose. In practical use, users tend to click the Agree button directly rather than reading them carefully. This practice exposes users to risks of privacy leakage and legal issues. Recently, the advent of Large Language Models (LLM) such as ChatGPT and GPT-4 has opened new possibilities for text analysis, especially for lengthy documents like privacy policies. In this study, we investigate a privacy policy text analysis framework PolicyGPT based on the LLM. This framework was tested using two datasets. The first dataset comprises of privacy policies from 115 websites, which were meticulously annotated by legal experts, categorizing each segment into one of 10 classes. The second dataset consists of privacy policies from 304 popular mobile applications, with each sentence manually annotated and classified into one of another 10 categories. Under zero-shot learning conditions, PolicyGPT demonstrated robust performance. For the first dataset, it achieved an accuracy rate of 97%, while for the second dataset, it attained an 87% accuracy rate, surpassing that of the baseline machine learning and neural network models.
    摘要 Recently, the development of Large Language Models (LLM) such as ChatGPT and GPT-4 has opened up new possibilities for text analysis, particularly for lengthy documents like privacy policies. In this study, we investigate a privacy policy text analysis framework called PolicyGPT, which is based on an LLM.We tested PolicyGPT using two datasets. The first dataset consisted of privacy policies from 115 websites, which were carefully annotated by legal experts and categorized into one of 10 classes. The second dataset included privacy policies from 304 popular mobile applications, with each sentence manually annotated and classified into one of another 10 categories.Under zero-shot learning conditions, PolicyGPT demonstrated robust performance. For the first dataset, it achieved an accuracy rate of 97%, while for the second dataset, it attained an 87% accuracy rate, outperforming baseline machine learning and neural network models.

cs.LG - 2023-09-19

DPpack: An R Package for Differentially Private Statistical Analysis and Machine Learning

  • paper_url: http://arxiv.org/abs/2309.10965
  • repo_url: None
  • paper_authors: Spencer Giddens, Fang Liu
  • for: 这篇论文旨在提供一个开源的R包DPpack,用于保证个人隐私when分析数据。
  • methods: 本论文使用了三种流行的隐私保护机制:卷积、 Gaussian 和 exponential。此外,DPpack还提供了许多privacy-preserving描述统计函数,如mean、variance、covariance和quantiles,以及分布图和相关表。
  • results: DPpack提供了一个 user-friendly 的隐私保护版本的 logistic regression、SVM 和线性回归,以及隐私保护的模型参数调整。这些实现的隐私保护统计和机器学习技术,使得通常进行的统计分析中可以轻松地应用隐私保护原则。
    Abstract Differential privacy (DP) is the state-of-the-art framework for guaranteeing privacy for individuals when releasing aggregated statistics or building statistical/machine learning models from data. We develop the open-source R package DPpack that provides a large toolkit of differentially private analysis. The current version of DPpack implements three popular mechanisms for ensuring DP: Laplace, Gaussian, and exponential. Beyond that, DPpack provides a large toolkit of easily accessible privacy-preserving descriptive statistics functions. These include mean, variance, covariance, and quantiles, as well as histograms and contingency tables. Finally, DPpack provides user-friendly implementation of privacy-preserving versions of logistic regression, SVM, and linear regression, as well as differentially private hyperparameter tuning for each of these models. This extensive collection of implemented differentially private statistics and models permits hassle-free utilization of differential privacy principles in commonly performed statistical analysis. We plan to continue developing DPpack and make it more comprehensive by including more differentially private machine learning techniques, statistical modeling and inference in the future.
    摘要 differential privacy (DP) 是当今最先进的隐私保护框架,用于保护数据分析时个人隐私。我们开发了一个开源的 R 包 DPpack,该包提供了丰富的涉及扩展的隐私保护分析工具。现版本的 DPpack 实现了三种流行的隐私保护机制:拉пла斯、高斯和指数。此外,DPpack 还提供了许多易于访问的隐私保护描述统计函数,包括平均值、方差、covariance 和分位数,以及 histogram 和 conditional tables。最后,DPpack 提供了用户友好的实现隐私保护版本的 logistic regression、支持向量机器学习和线性回归,以及隐私保护参数优化 для每一种模型。这种广泛的实现的涉及隐私保护统计和机器学习技术,使得使用隐私保护原则在常见的统计分析中免受困惑。我们计划继续开发 DPpack,以便在未来包括更多的隐私保护机器学习技术、统计模型和推理。

Deep Reinforcement Learning for Infinite Horizon Mean Field Problems in Continuous Spaces

  • paper_url: http://arxiv.org/abs/2309.10953
  • repo_url: None
  • paper_authors: Andrea Angiuli, Jean-Pierre Fouque, Ruimeng Hu, Alan Raydan
  • for: The paper is written to develop and analyze a reinforcement learning algorithm for solving continuous-space mean field game and mean field control problems in a unified manner.
  • methods: The proposed algorithm uses the actor-critic paradigm with a parameterized score function to represent the mean field distribution, and updates the AC agent and the score function iteratively to converge to the MFG equilibrium or the MFC optimum. Langevin dynamics are used to obtain samples from the resulting distribution.
  • results: The performance of the algorithm is evaluated using linear-quadratic benchmarks in the asymptotic infinite horizon framework, and the results show that the algorithm is able to find the optimal solution to the mean field control problem.Here’s the same information in Simplified Chinese text:
  • for: 该文章是为了开发和分析一种基于actor-critic(AC)算法的含义场游戏和含义场控制问题的解决方案。
  • methods: 提议的算法使用AC模式与参数化得分函数来表示含义场分布,并在线上更新这个分布,使用朗格文动力来获取样本。AC代理和得分函数在迭代更新以达到MFG均衡或MFC优化。
  • results: 文章中的性能分析使用线性-квадратиче benchmark在极限无穷远 horizon框架下,结果显示该算法可以解决含义场控制问题的优化问题。
    Abstract We present the development and analysis of a reinforcement learning (RL) algorithm designed to solve continuous-space mean field game (MFG) and mean field control (MFC) problems in a unified manner. The proposed approach pairs the actor-critic (AC) paradigm with a representation of the mean field distribution via a parameterized score function, which can be efficiently updated in an online fashion, and uses Langevin dynamics to obtain samples from the resulting distribution. The AC agent and the score function are updated iteratively to converge, either to the MFG equilibrium or the MFC optimum for a given mean field problem, depending on the choice of learning rates. A straightforward modification of the algorithm allows us to solve mixed mean field control games (MFCGs). The performance of our algorithm is evaluated using linear-quadratic benchmarks in the asymptotic infinite horizon framework.
    摘要 我们介绍了一种基于 actor-critic(AC)方法的强化学习算法,用于解决连续空间的mean field game(MFG)和mean field control(MFC)问题。我们的方法通过使用参数化的分数函数来表示mean field分布,可以高效地在在线模式下更新,并使用朗格文动力学来获取该分布中的样本。AC Agent和分数函数在每次更新后会趋于MFG均衡或MFC优化点,具体取决于学习率。我们还提出了一种简单的修改,使得我们的算法可以解决混合mean field控制游戏(MFCG)。我们对linear-quadratic benchmark在无穷远距离框架中进行了性能评估。

Test-Time Training for Speech

  • paper_url: http://arxiv.org/abs/2309.10930
  • repo_url: https://github.com/Aastha2104/Parkinson-Disease-Prediction
  • paper_authors: Sri Harsha Dumpala, Chandramouli Sastry, Sageev Oore
  • for: 本研究探讨了应用测试时训练(Test-Time Training,TTT)解决speech应用中的分布转移问题。
  • methods: 我们引入了标准speech分类任务的测试集中的分布转移,并explored how TTT可以适应这种分布转移。我们的实验包括后声和自然语音的变化(如性别和年龄)引入分布转移。
  • results: 我们发现了一些关键挑战,包括优化器参数的敏感性(例如数量化步骤和选择参数的TTT)和可扩展性(每个示例都需要自己的参数)。我们提议使用BitFit,一种效率高的参数细化算法,并证明它在TTT中更稳定 than全模型参数细化。
    Abstract In this paper, we study the application of Test-Time Training (TTT) as a solution to handling distribution shifts in speech applications. In particular, we introduce distribution-shifts to the test datasets of standard speech-classification tasks -- for example, speaker-identification and emotion-detection -- and explore how Test-Time Training (TTT) can help adjust to the distribution-shift. In our experiments that include distribution shifts due to background noise and natural variations in speech such as gender and age, we identify some key-challenges with TTT including sensitivity to optimization hyperparameters (e.g., number of optimization steps and subset of parameters chosen for TTT) and scalability (e.g., as each example gets its own set of parameters, TTT is not scalable). Finally, we propose using BitFit -- a parameter-efficient fine-tuning algorithm proposed for text applications that only considers the bias parameters for fine-tuning -- as a solution to the aforementioned challenges and demonstrate that it is consistently more stable than fine-tuning all the parameters of the model.
    摘要 在这篇论文中,我们研究了在语音应用中使用测试时训练(TTT)解决发布分布变化的问题。我们特别是在标准语音分类任务的测试集中引入分布变化,例如 speaker-identification 和 emotion-detection,并探索了 TTT 如何适应这些分布变化。在我们的实验中,包括背景噪音和自然语音变化(如性别和年龄)的分布变化,我们发现了一些关键挑战,如优化器参数的敏感性(例如优化步数和选择的参数个数)以及可扩展性(例如每个示例都需要自己的参数)。最后,我们提议使用 BitFit,一种参数效率的精度调整算法,解决这些挑战,并证明它在文本应用中具有更高的稳定性。

Posterior Contraction Rates for Matérn Gaussian Processes on Riemannian Manifolds

  • paper_url: http://arxiv.org/abs/2309.10918
  • repo_url: None
  • paper_authors: Paul Rosa, Viacheslav Borovitskiy, Alexander Terenin, Judith Rousseau
  • for: 本研究旨在探讨 Whether intrinsic geometric Gaussian processes (GPs) can lead to better performance compared to embedding all relevant quantities into $\mathbb{R}^d$ and using the restriction of an ordinary Euclidean GP.
  • methods: 本研究使用了Optimal contraction rates for intrinsic Mat'ern GPs defined on compact Riemannian manifolds, as well as trace and extension theorems between manifold and ambient Sobolev spaces.
  • results: 研究发现,对于合适的细分参数,intrinsic GPs can achieve better performance than embedding all relevant quantities into $\mathbb{R}^d$ and using the restriction of an ordinary Euclidean GP. This result is demonstrated empirically on a number of examples.
    Abstract Gaussian processes are used in many machine learning applications that rely on uncertainty quantification. Recently, computational tools for working with these models in geometric settings, such as when inputs lie on a Riemannian manifold, have been developed. This raises the question: can these intrinsic models be shown theoretically to lead to better performance, compared to simply embedding all relevant quantities into $\mathbb{R}^d$ and using the restriction of an ordinary Euclidean Gaussian process? To study this, we prove optimal contraction rates for intrinsic Mat\'ern Gaussian processes defined on compact Riemannian manifolds. We also prove analogous rates for extrinsic processes using trace and extension theorems between manifold and ambient Sobolev spaces: somewhat surprisingly, the rates obtained turn out to coincide with those of the intrinsic processes, provided that their smoothness parameters are matched appropriately. We illustrate these rates empirically on a number of examples, which, mirroring prior work, show that intrinsic processes can achieve better performance in practice. Therefore, our work shows that finer-grained analyses are needed to distinguish between different levels of data-efficiency of geometric Gaussian processes, particularly in settings which involve small data set sizes and non-asymptotic behavior.
    摘要

  • paper_url: http://arxiv.org/abs/2309.10890
  • repo_url: None
  • paper_authors: Sofiane Azogagh, Zelma Aubin Birba, Sébastien Gambs, Marc-Olivier Killijian
  • For: 针对分布式图的隐私保护和敏感数据分享* Methods: 使用 cryptographic primitives 进行隐私保护,不需要披露个别党所拥有的图结构,却可以 computed 新的连接likelihood* Results: 能够实现高精度预测和防范图中毒攻击
    Abstract Graphs are a widely used data structure for collecting and analyzing relational data. However, when the graph structure is distributed across several parties, its analysis is particularly challenging. In particular, due to the sensitivity of the data each party might want to keep their partial knowledge of the graph private, while still willing to collaborate with the other parties for tasks of mutual benefit, such as data curation or the removal of poisoned data. To address this challenge, we propose Crypto'Graph, an efficient protocol for privacy-preserving link prediction on distributed graphs. More precisely, it allows parties partially sharing a graph with distributed links to infer the likelihood of formation of new links in the future. Through the use of cryptographic primitives, Crypto'Graph is able to compute the likelihood of these new links on the joint network without revealing the structure of the private individual graph of each party, even though they know the number of nodes they have, since they share the same graph but not the same links. Crypto'Graph improves on previous works by enabling the computation of a certain number of similarity metrics without any additional cost. The use of Crypto'Graph is illustrated for defense against graph poisoning attacks, in which it is possible to identify potential adversarial links without compromising the privacy of the graphs of individual parties. The effectiveness of Crypto'Graph in mitigating graph poisoning attacks and achieving high prediction accuracy on a graph neural network node classification task is demonstrated through extensive experimentation on a real-world dataset.
    摘要 GRAPHs 是一种广泛使用的数据结构,用于收集和分析关系数据。然而,当 GRAPH 结构分布在多个方面时,其分析变得特别困难。具体来说,由于每个方面可能想保持自己的部分 GRAPH 私有知识,而同时愿意与其他方面合作,例如数据整理或毒素数据的去除。为解决这个挑战,我们提议 Crypto'Graph,一种高效的隐私保护链接预测协议。更准确地说,它允许方面分享部分 GRAPH 的链接来预测未来新链接的可能性。通过使用 криптографических原则,Crypto'Graph 可以在共同网络上计算新链接的可能性,而不需要披露每个方面的私有 GRAPH 结构,即使它们知道它们拥有的节点数量。Crypto'Graph 在前一些工作的基础上进一步提高了可计算的相似指标数量,而不添加任何成本。使用 Crypto'Graph 可以防止 GRAPH 毒素攻击,并在一个实际 dataset 上进行了广泛的实验,证明了它的有效性。

Dynamical Tests of a Deep-Learning Weather Prediction Model

  • paper_url: http://arxiv.org/abs/2309.10867
  • repo_url: None
  • paper_authors: Gregory J. Hakim, Sanjit Masanam
  • for: 本研究用一种深度学习天气预测模型来测试physical laws的实用性。
  • methods: 研究者使用了一种名为Pangu-weather的深度学习模型,并对其进行了四种 класси型动力学实验。
  • results: 研究者发现,这种模型在不同的情况下都能够表现出真实的物理特性,包括热带高压系统、温带低压系统和极低压系统的形成等。
    Abstract Global deep-learning weather prediction models have recently been shown to produce forecasts that rival those from physics-based models run at operational centers. It is unclear whether these models have encoded atmospheric dynamics, or simply pattern matching that produces the smallest forecast error. Answering this question is crucial to establishing the utility of these models as tools for basic science. Here we subject one such model, Pangu-weather, to a set of four classical dynamical experiments that do not resemble the model training data. Localized perturbations to the model output and the initial conditions are added to steady time-averaged conditions, to assess the propagation speed and structural evolution of signals away from the local source. Perturbing the model physics by adding a steady tropical heat source results in a classical Matsuno--Gill response near the heating, and planetary waves that radiate into the extratropics. A localized disturbance on the winter-averaged North Pacific jet stream produces realistic extratropical cyclones and fronts, including the spontaneous emergence of polar lows. Perturbing the 500hPa height field alone yields adjustment from a state of rest to one of wind--pressure balance over ~6 hours. Localized subtropical low pressure systems produce Atlantic hurricanes, provided the initial amplitude exceeds about 5 hPa, and setting the initial humidity to zero eliminates hurricane development. We conclude that the model encodes realistic physics in all experiments, and suggest it can be used as a tool for rapidly testing ideas before using expensive physics-based models.
    摘要 全球深度学习天气预测模型最近已经能够生成与物理基础模型运行中心的预测相当的forecast。然而,这些模型是通过编码大气动力学或 simply模式匹配来生成预测错误最小化的。解决这个问题是确定这些模型是否有用作基础科学工具的关键。在这里,我们使用一个名为Pangu-weather的模型进行四种经典动力学实验,这些实验不同于模型训练数据。我们在稳定时间均值条件下添加了本地扰动和初始条件,以评估信号径向速度和结构的发展。在添加了热带热源后,模型物理学会出现经典的Matsuno--Gill响应,以及在极地射线方向射出的 планет� waves。在冬季平均北太平洋液压流上添加了本地扰动,可以生成真实的温带风暴和前线,包括自发性产生的极地低压系统。在500hPa高程场 alone 上进行调整,可以在 ~6小时内从一种平衡状态变换到一种风压平衡状态。当初始气压值大于5 hPa,并将初始湿度设置为0时,可以生成 Atlantics 风暴。我们 conclude that这些模型编码了实际物理学,并可以用作快速测试想法之前使用昂贵的物理基础模型。

$O(k)$-Equivariant Dimensionality Reduction on Stiefel Manifolds

  • paper_url: http://arxiv.org/abs/2309.10775
  • repo_url: https://github.com/harlinlee/psc
  • paper_authors: Andrew Lee, Harlin Lee, Jose A. Perea, Nikolas Schonsheck, Madeleine Weinstein
  • for: 本研究的目的是对高维度的几何空间资料进行降维,以提高分析和检测的效率。
  • methods: 本研究提出了一个名为几何主成分对称coordinate(PSC)的算法,它可以将资料从$V_k(\mathbb{R}^N)$降维到$V_k(\mathbb{R}^n)$,并且保持了$O(k)$-对称性。
  • results: 本研究通过多个实验表明,PSC算法可以对高维度资料进行有效的降维,并且可以提高分析和检测的效率。另外,研究也显示了PSC算法在不同的数据集中的表现。
    Abstract Many real-world datasets live on high-dimensional Stiefel and Grassmannian manifolds, $V_k(\mathbb{R}^N)$ and $Gr(k, \mathbb{R}^N)$ respectively, and benefit from projection onto lower-dimensional Stiefel (respectively, Grassmannian) manifolds. In this work, we propose an algorithm called Principal Stiefel Coordinates (PSC) to reduce data dimensionality from $ V_k(\mathbb{R}^N)$ to $V_k(\mathbb{R}^n)$ in an $O(k)$-equivariant manner ($k \leq n \ll N$). We begin by observing that each element $\alpha \in V_n(\mathbb{R}^N)$ defines an isometric embedding of $V_k(\mathbb{R}^n)$ into $V_k(\mathbb{R}^N)$. Next, we optimize for such an embedding map that minimizes data fit error by warm-starting with the output of principal component analysis (PCA) and applying gradient descent. Then, we define a continuous and $O(k)$-equivariant map $\pi_\alpha$ that acts as a ``closest point operator'' to project the data onto the image of $V_k(\mathbb{R}^n)$ in $V_k(\mathbb{R}^N)$ under the embedding determined by $\alpha$, while minimizing distortion. Because this dimensionality reduction is $O(k)$-equivariant, these results extend to Grassmannian manifolds as well. Lastly, we show that the PCA output globally minimizes projection error in a noiseless setting, but that our algorithm achieves a meaningfully different and improved outcome when the data does not lie exactly on the image of a linearly embedded lower-dimensional Stiefel manifold as above. Multiple numerical experiments using synthetic and real-world data are performed.
    摘要 许多实际数据集生活在高维度Stiefel和Grassmannian manifolds上,即$V_k(\mathbb{R}^N)$和$Gr(k, \mathbb{R}^N)$ respectively,并且受益于降维到lower-dimensional Stiefel(respectively, Grassmannian) manifolds上的投影。在这种工作中,我们提出了一种名为Principal Stiefel Coordinates(PSC)的算法,用于从$V_k(\mathbb{R}^N)$降维到$V_k(\mathbb{R}^n)$,并且在$O(k)$-equivariant manner下进行。我们开始 Observation that each element $\alpha \in V_n(\mathbb{R}^N)$ defines an isometric embedding of $V_k(\mathbb{R}^n)$ into $V_k(\mathbb{R}^N)$. Next, we optimize for such an embedding map that minimizes data fit error by warm-starting with the output of principal component analysis (PCA) and applying gradient descent. Then, we define a continuous and $O(k)$-equivariant map $\pi_\alpha$ that acts as a "closest point operator" to project the data onto the image of $V_k(\mathbb{R}^n)$ in $V_k(\mathbb{R}^N)$ under the embedding determined by $\alpha$, while minimizing distortion. Because this dimensionality reduction is $O(k)$-equivariant, these results extend to Grassmannian manifolds as well. Lastly, we show that the PCA output globally minimizes projection error in a noiseless setting, but that our algorithm achieves a meaningfully different and improved outcome when the data does not lie exactly on the image of a linearly embedded lower-dimensional Stiefel manifold as above. Multiple numerical experiments using synthetic and real-world data are performed.

Semi-supervised Domain Adaptation in Graph Transfer Learning

  • paper_url: http://arxiv.org/abs/2309.10773
  • repo_url: https://github.com/daiquanyu/AdaGCN_TKDE
  • paper_authors: Ziyue Qiao, Xiao Luo, Meng Xiao, Hao Dong, Yuanchun Zhou, Hui Xiong
  • for: 这个研究目的是为了实现不监控的类别转移学习在图形上,将标签集中的知识转移到无标签的目标图形上。
  • methods: 我们提出了一个方法named Semi-supervised Graph Domain Adaptation (SGDA),对抗域shift和标签稀缺的难题。我们在源图形中添加了 adaptive shift 参数,通过在反对抗模式下训练来对图形端点进行对齐,从而将源图形中训练的类别标签转移到目标图形上。此外,我们还提出了伪标签的方法,通过量测缺乏标签的范例中的后天影响,从而提高目标图形上的类别分类精度。
  • results: 我们在一些公开的数据集上进行了广泛的实验,证明了我们的提出的SGDA在不同的实验设定下的效果。
    Abstract As a specific case of graph transfer learning, unsupervised domain adaptation on graphs aims for knowledge transfer from label-rich source graphs to unlabeled target graphs. However, graphs with topology and attributes usually have considerable cross-domain disparity and there are numerous real-world scenarios where merely a subset of nodes are labeled in the source graph. This imposes critical challenges on graph transfer learning due to serious domain shifts and label scarcity. To address these challenges, we propose a method named Semi-supervised Graph Domain Adaptation (SGDA). To deal with the domain shift, we add adaptive shift parameters to each of the source nodes, which are trained in an adversarial manner to align the cross-domain distributions of node embedding, thus the node classifier trained on labeled source nodes can be transferred to the target nodes. Moreover, to address the label scarcity, we propose pseudo-labeling on unlabeled nodes, which improves classification on the target graph via measuring the posterior influence of nodes based on their relative position to the class centroids. Finally, extensive experiments on a range of publicly accessible datasets validate the effectiveness of our proposed SGDA in different experimental settings.
    摘要 为特例的图转移学习,无监督领域适应图的目标是将标签丰富的源图知识传播到无标注目标图。然而,图像结构和特征通常存在跨领域差异,世界上有许多实际情况下只有源图中的一个节点被标注。这些挑战使得图转移学习受到了严重的领域shift和标签稀缺的挑战。为解决这些挑战,我们提出了半监督图领域适应方法(SGDA)。为了处理领域shift,我们在源节点中添加了适应参数,这些参数在对抗式的训练中使得源节点的领域分布与目标节点的领域分布相似,从而使得源节点的类别推论器可以被传递到目标节点。此外,为了解决标签稀缺的问题,我们提出了假标注方法,通过测量每个节点的后向影响,根据节点的相对位置来提高目标图的分类精度。最后,我们在一系列公共访问的数据集上进行了广泛的实验,证明了我们提出的SGDA在不同的实际情况下的效果。

Improving Opioid Use Disorder Risk Modelling through Behavioral and Genetic Feature Integration

  • paper_url: http://arxiv.org/abs/2309.10837
  • repo_url: https://github.com/bayesomicslab/OUD-Risk-Prediction
  • paper_authors: Sybille Légitime, Kaustubh Prabhu, Devin McConnell, Bing Wang, Dipak K. Dey, Derek Aguiar
  • for: 预测抑菌药用症(OUD)的风险,以提高治疗方案、监测计划和干预策略的效果。
  • methods: combines 基因变异与行为特征(从GPS和Wi-Fi坐标中提取的空间时间特征)来评估OUD风险。发展了算法来(1)生成行为特征从实际分布中,(2)合成行业和基因样本,假设潜在的共同病因和相对风险。
  • results: 结果表明,结合行业和基因特征可以提高风险评估,并且行业特征对OUD风险的影响更大,尽管基因贡献也是显著的,特别是在线性模型中。但是需要考虑隐私、安全、偏见和普遍性问题,才能在临床试验中评估这种方法的可行性。
    Abstract Opioids are an effective analgesic for acute and chronic pain, but also carry a considerable risk of addiction leading to millions of opioid use disorder (OUD) cases and tens of thousands of premature deaths in the United States yearly. Estimating OUD risk prior to prescription could improve the efficacy of treatment regimens, monitoring programs, and intervention strategies, but risk estimation is typically based on self-reported data or questionnaires. We develop an experimental design and computational methods that combines genetic variants associated with OUD with behavioral features extracted from GPS and Wi-Fi spatiotemporal coordinates to assess OUD risk. Since both OUD mobility and genetic data do not exist for the same cohort, we develop algorithms to (1) generate mobility features from empirical distributions and (2) synthesize mobility and genetic samples assuming a level of comorbidity and relative risks. We show that integrating genetic and mobility modalities improves risk modelling using classification accuracy, area under the precision-recall and receiver operator characteristic curves, and $F_1$ score. Interpreting the fitted models suggests that mobility features have more influence on OUD risk, although the genetic contribution was significant, particularly in linear models. While there exists concerns with respect to privacy, security, bias, and generalizability that must be evaluated in clinical trials before being implemented in practice, our framework provides preliminary evidence that behavioral and genetic features may improve OUD risk estimation to assist with personalized clinical decision-making.
    摘要 吗 Opioids 是一种有效的疼痛镇静药物,但也会导致数百万例的 Opioid 使用障碍 (OUD) 和数千例的危险死亡在美国每年。 估计 OUD 风险前置于订scriptions可以提高治疗方案、监测计划和干预策略的效果,但风险估计通常基于自我报告数据或问卷。 我们开发了一种实验设计和计算方法,结合 associates with OUD 的遗传变异和从 GPS 和 Wi-Fi 空间时间坐标提取的行为特征来评估 OUD 风险。 因为 OUD mobilit和遗传数据不同层次, we develop algorithms to (1) generate mobilit features from empirical distributions and (2) synthesize mobilit and genetic samples assuming a level of comorbidity and relative risks. We show that integrating genetic and mobilit modalities improves risk modelling using classification accuracy, area under the precision-recall and receiver operator characteristic curves, and $F_1$ score. Interpreting the fitted models suggests that mobility features have more influence on OUD risk, although the genetic contribution was significant, particularly in linear models. While there exists concerns with respect to privacy, security, bias, and generalizability that must be evaluated in clinical trials before being implemented in practice, our framework provides preliminary evidence that behavioral and genetic features may improve OUD risk estimation to assist with personalized clinical decision-making.

Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

  • paper_url: http://arxiv.org/abs/2309.10740
  • repo_url: None
  • paper_authors: Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi
  • for: 实时文本转语音生成(TTA)方法的发展
  • methods: 使用修改了最近提出的一致混合激活架构,将TTA模型训练为仅需单一神经网络查询
  • results: 在AudioCaps dataset上进行了实验,发现这些模型可以保持 diffusion models 的高质量和多样性,并且降低了查询次数,每次查询可以减少到400次。
    Abstract Diffusion models power a vast majority of text-to-audio (TTA) generation methods. Unfortunately, these models suffer from slow inference speed due to iterative queries to the underlying denoising network, thus unsuitable for scenarios with inference time or computational constraints. This work modifies the recently proposed consistency distillation framework to train TTA models that require only a single neural network query. In addition to incorporating classifier-free guidance into the distillation process, we leverage the availability of generated audio during distillation training to fine-tune the consistency TTA model with novel loss functions in the audio space, such as the CLAP score. Our objective and subjective evaluation results on the AudioCaps dataset show that consistency models retain diffusion models' high generation quality and diversity while reducing the number of queries by a factor of 400.
    摘要 多种扩散模型通常用于文本到语音(TTA)生成方法中。然而,这些模型受到迭代查询到底层减噪网络的限制,导致推理速度慢,不适合具有推理时间或计算限制的场景。本工作改进了最近提出的一致性熔化框架,以训练不需要多个神经网络查询的 TTA 模型。此外,我们还在熔化训练过程中 incorporate 类ifier-free 指导,并利用生成的音频数据来细化熔化 TTA 模型,使用 novel 的损失函数,如 CLAP 分数。我们对 AudioCaps 数据集进行了对象和主观评估,结果表明,一致性模型可以保持扩散模型的高质量和多样性,同时减少查询数量,比例为 400 倍。

Mixture Weight Estimation and Model Prediction in Multi-source Multi-target Domain Adaptation

  • paper_url: http://arxiv.org/abs/2309.10736
  • repo_url: None
  • paper_authors: Yuyang Deng, Ilja Kuzborskij, Mehrdad Mahdavi
  • for: 本文主要针对多种不同来源数据的学习问题,目标是在新的目标分布上表现良好。
  • methods: 本文使用了一种新的混合学习方法,可以同时适应多个目标分布,并且可以在 computationally efficient 的方式下解决多个目标分布的采样问题。
  • results: 本文提出了一种新的混合学习方法,可以有效地解决多个目标分布的学习问题,并且可以在线上和离线上实现高效的学习。
    Abstract We consider the problem of learning a model from multiple heterogeneous sources with the goal of performing well on a new target distribution. The goal of learner is to mix these data sources in a target-distribution aware way and simultaneously minimize the empirical risk on the mixed source. The literature has made some tangible advancements in establishing theory of learning on mixture domain. However, there are still two unsolved problems. Firstly, how to estimate the optimal mixture of sources, given a target domain; Secondly, when there are numerous target domains, how to solve empirical risk minimization (ERM) for each target using possibly unique mixture of data sources in a computationally efficient manner. In this paper we address both problems efficiently and with guarantees. We cast the first problem, mixture weight estimation, as a convex-nonconcave compositional minimax problem, and propose an efficient stochastic algorithm with provable stationarity guarantees. Next, for the second problem, we identify that for certain regimes, solving ERM for each target domain individually can be avoided, and instead parameters for a target optimal model can be viewed as a non-linear function on a space of the mixture coefficients. Building upon this, we show that in the offline setting, a GD-trained overparameterized neural network can provably learn such function to predict the model of target domain instead of solving a designated ERM problem. Finally, we also consider an online setting and propose a label efficient online algorithm, which predicts parameters for new targets given an arbitrary sequence of mixing coefficients, while enjoying regret guarantees.
    摘要 我们考虑一个从多个不同来源学习模型的问题,目的是在新的目标分布上表现良好。学习者需要将这些数据源混合在目标分布意识的方式下,同时降低混合后的观察风险。文献已经做出了一些可观的进展,但还有两个未解决的问题。第一个问题是,给定目标分布,如何估计最佳混合比例?第二个问题是,当有多个目标分布时,如何使用可能不同的数据源混合来解决每个目标的Empirical Risk Minimization(ERM)问题,并且在计算效率上具有保证?在这篇论文中,我们efficiently和有保证地解决了这两个问题。我们将第一个问题,混合比例估计,转化为一个凸-非凸 Compositional Minimax问题,并提出了一种效果的杂化算法,具有可观的站点性保证。接下来,我们发现在某些情况下,可以避免解决每个目标分布的ERM问题,而是视 Parameters for a target optimal model as a non-linear function on a space of the mixture coefficients。在这个基础上,我们证明了在离线设置下,一个GD训练的过参数化神经网络可以可靠地学习这种函数,以预测目标分布下的模型。最后,我们还考虑了在线设置,并提出了一种标签效率的在线算法,可以预测新的目标参数,并且具有误差保证。

GPT4AIGChip: Towards Next-Generation AI Accelerator Design Automation via Large Language Models

  • paper_url: http://arxiv.org/abs/2309.10730
  • repo_url: None
  • paper_authors: Yonggan Fu, Yongan Zhang, Zhongzhi Yu, Sixu Li, Zhifan Ye, Chaojian Li, Cheng Wan, Yingyan Lin
  • for: 这个论文旨在提高人工智能加速器的设计效率和质量,使用大语言模型(LLMs)来自动化加速器设计。
  • methods: 本论文使用了LLMs的启发作用,开发了一个名为GPT4AIGChip的框架,以便通过人工语言指令来自动生成AI加速器设计。
  • results: 研究发现,LLMs可以帮助生成高质量的AI加速器设计,并且可以帮助不熟悉硬件领域的人员也能够设计高效的加速器。这是首次在LLMs中实现了自动化AI加速器设计的工作。
    Abstract The remarkable capabilities and intricate nature of Artificial Intelligence (AI) have dramatically escalated the imperative for specialized AI accelerators. Nonetheless, designing these accelerators for various AI workloads remains both labor- and time-intensive. While existing design exploration and automation tools can partially alleviate the need for extensive human involvement, they still demand substantial hardware expertise, posing a barrier to non-experts and stifling AI accelerator development. Motivated by the astonishing potential of large language models (LLMs) for generating high-quality content in response to human language instructions, we embark on this work to examine the possibility of harnessing LLMs to automate AI accelerator design. Through this endeavor, we develop GPT4AIGChip, a framework intended to democratize AI accelerator design by leveraging human natural languages instead of domain-specific languages. Specifically, we first perform an in-depth investigation into LLMs' limitations and capabilities for AI accelerator design, thus aiding our understanding of our current position and garnering insights into LLM-powered automated AI accelerator design. Furthermore, drawing inspiration from the above insights, we develop a framework called GPT4AIGChip, which features an automated demo-augmented prompt-generation pipeline utilizing in-context learning to guide LLMs towards creating high-quality AI accelerator design. To our knowledge, this work is the first to demonstrate an effective pipeline for LLM-powered automated AI accelerator generation. Accordingly, we anticipate that our insights and framework can serve as a catalyst for innovations in next-generation LLM-powered design automation tools.
    摘要 人工智能(AI)的出色能力和复杂性已经提高了特殊的AI加速器的需求。然而,为不同的AI任务设计这些加速器仍然是劳动和时间耗资的。虽然现有的设计探索和自动化工具可以部分减轻人类的参与,但它们仍需具备相当的硬件专业知识,从而成为非专家的障碍和AI加速器开发的束缚。为了解决这个问题,我们启动了这项工作,旨在利用大型自然语言模型(LLM)来自动化AI加速器设计。通过这项工作,我们开发了GPT4AIGChip框架,这是一个利用人类自然语言而不是域专门语言来自动化AI加速器设计的框架。 Specifically, we first perform an in-depth investigation into LLMs' limitations and capabilities for AI accelerator design, thus aiding our understanding of our current position and garnering insights into LLM-powered automated AI accelerator design. Furthermore, drawing inspiration from the above insights, we develop a framework called GPT4AIGChip, which features an automated demo-augmented prompt-generation pipeline utilizing in-context learning to guide LLMs towards creating high-quality AI accelerator design. To our knowledge, this work is the first to demonstrate an effective pipeline for LLM-powered automated AI accelerator generation. Accordingly, we anticipate that our insights and framework can serve as a catalyst for innovations in next-generation LLM-powered design automation tools.

On the different regimes of Stochastic Gradient Descent

  • paper_url: http://arxiv.org/abs/2309.10688
  • repo_url: https://github.com/AntonioScl/regimes_of_SGD
  • paper_authors: Antonio Sclocchi, Matthieu Wyart
  • For: + The paper is written to understand the dynamics of stochastic gradient descent (SGD) in training deep neural networks. + The authors aim to resolve the central challenges of understanding the cross-overs between SGD and gradient descent (GD) in large batches. + The paper focuses on a teacher-student perceptron classification model, and the results are expected to apply to deep networks.* Methods: + The authors use a phase diagram in the $B$-$\eta$ plane to separate three dynamical phases: noise-dominated SGD, large-first-step-dominated SGD, and GD. + The analysis reveals that the batch size $B^*$ separating regimes $\textit{(i)}$ and $\textit{(ii)}$ scales with the size $P$ of the training set, with an exponent that characterizes the hardness of the classification problem.* Results: + The authors obtain empirical results that support their key predictions and show the applicability of their analysis to deep networks. + The phase diagram provides a framework for understanding the different regimes of generalization error in SGD.
    Abstract Modern deep networks are trained with stochastic gradient descent (SGD) whose key parameters are the number of data considered at each step or batch size $B$, and the step size or learning rate $\eta$. For small $B$ and large $\eta$, SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the `temperature' $T\equiv \eta/B$. Yet this description is observed to break down for sufficiently large batches $B\geq B^*$, or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here we resolve these questions for a teacher-student perceptron classification model, and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the $B$-$\eta$ plane that separates three dynamical phases: $\textit{(i)}$ a noise-dominated SGD governed by temperature, $\textit{(ii)}$ a large-first-step-dominated SGD and $\textit{(iii)}$ GD. These different phases also corresponds to different regimes of generalization error. Remarkably, our analysis reveals that the batch size $B^*$ separating regimes $\textit{(i)}$ and $\textit{(ii)}$ scale with the size $P$ of the training set, with an exponent that characterizes the hardness of the classification problem.
    摘要 现代深度网络通常使用梯度下降(SGD)进行训练,SGD的关键参数包括每次考虑的数据量或批处理大小$B$,以及学习率$\eta$。对于小$B$和大$\eta$来说,SGD对参数进行杂音演化,其杂音强度由温度$T\equiv \eta/B$控制。然而,这个描述会在 sufficient large batches $B\geq B^*$ 或者简化为梯度下降(GD)当温度够小时失效。理解这些交叉点的位置是中心挑战。我们在教师-学生批处理分类模型上解决这些问题,并证明我们的预测仍然适用于深度网络。具体来说,我们获得了 $B$-$\eta$ 平面上的相对谱,这个谱分为三个动力学阶段: $\textit{(i)}$ 杂音控制的SGD, $\textit{(ii)}$ 大first-step控制的SGD,和 $\textit{(iii)}$ GD。这些不同的阶段也对应于不同的泛化误差阶段。很remarkably,我们的分析显示,训练集大小$P$的Batch大小$B^*$分界线 separates这些阶段,并且这个分界线的幂数与泛化误差阶段的困难程度相关。

Oracle Complexity Reduction for Model-free LQR: A Stochastic Variance-Reduced Policy Gradient Approach

  • paper_url: http://arxiv.org/abs/2309.10679
  • repo_url: https://github.com/jd-anderson/lqr_svrpg
  • paper_authors: Leonardo F. Toso, Han Wang, James Anderson
  • for: 解决模型free Linear Quadratic Regulator (LQR)问题中的$\epsilon$-近似解决方案。
  • methods: 使用Stochastic Variance-Reduced Policy Gradient (SVRPG)方法,结合一点和二点估计在 dual-loop variance-reduced算法中。
  • results: 只需要 $O\left(\log\left(1/\epsilon\right)^{\beta}\right)$ 二点成本信息,以达到 $\beta \in (0,1)$ 的 approximate optimal solution。
    Abstract We investigate the problem of learning an $\epsilon$-approximate solution for the discrete-time Linear Quadratic Regulator (LQR) problem via a Stochastic Variance-Reduced Policy Gradient (SVRPG) approach. Whilst policy gradient methods have proven to converge linearly to the optimal solution of the model-free LQR problem, the substantial requirement for two-point cost queries in gradient estimations may be intractable, particularly in applications where obtaining cost function evaluations at two distinct control input configurations is exceptionally costly. To this end, we propose an oracle-efficient approach. Our method combines both one-point and two-point estimations in a dual-loop variance-reduced algorithm. It achieves an approximate optimal solution with only $O\left(\log\left(1/\epsilon\right)^{\beta}\right)$ two-point cost information for $\beta \in (0,1)$.
    摘要 我们研究一个 $\epsilon$-近似解决方案 для碎时Linear Quadratic Regulator(LQR)问题,使用Stochastic Variance-Reduced Policy Gradient(SVRPG)方法。 policy gradient 方法已经证明可以线性传递到model-free LQR 问题的最佳解,但是需要两点成本询问的要求可能是不可行的,特别是在应用中具有高成本的cost function询问。为此,我们提出了一个 oracle-efficient 方法。我们的方法结合了一点和二点询问的对称降低算法,实现了 $\epsilon$-近似解的 aproximate 优化,只需要 $O\left(\log\left(1/\epsilon\right)^{\beta}\right)$ 两点成本信息,其中 $\beta \in (0,1)$。

Implementing a new fully stepwise decomposition-based sampling technique for the hybrid water level forecasting model in real-world application

  • paper_url: http://arxiv.org/abs/2309.10658
  • repo_url: None
  • paper_authors: Ziqian Zhang, Nana Bao, Xingting Yan, Aokai Zhu, Chenyang Li, Mingyu Liu
  • for: 这个研究是为了提高水位时间序列预测中的实际应用性,特别是使用分解方法来优化预测模型。
  • methods: 这个研究使用了新的全步阶分解基本(FSDB)抽样技术,与各种分解方法(如量子振荡分析(VMD)和单束谱分析(SSA))结合,实现了更好的预测性。
  • results: 根据这个研究,使用FSDB抽样技术和VMD分析方法,水位时间序列预测中的Nash-Sutcliffe效率(NSE)系数在三个站点中提高了6.4%、28.8%和7.0%,相比之下,使用现有最先进的抽样技术时的NSE系数提高幅度较低。同时,使用SSA分析方法时,NSE系数在三个站点中提高了3.2%、3.1%和1.1%。
    Abstract Various time variant non-stationary signals need to be pre-processed properly in hydrological time series forecasting in real world, for example, predictions of water level. Decomposition method is a good candidate and widely used in such a pre-processing problem. However, decomposition methods with an inappropriate sampling technique may introduce future data which is not available in practical applications, and result in incorrect decomposition-based forecasting models. In this work, a novel Fully Stepwise Decomposition-Based (FSDB) sampling technique is well designed for the decomposition-based forecasting model, strictly avoiding introducing future information. This sampling technique with decomposition methods, such as Variational Mode Decomposition (VMD) and Singular spectrum analysis (SSA), is applied to predict water level time series in three different stations of Guoyang and Chaohu basins in China. Results of VMD-based hybrid model using FSDB sampling technique show that Nash-Sutcliffe Efficiency (NSE) coefficient is increased by 6.4%, 28.8% and 7.0% in three stations respectively, compared with those obtained from the currently most advanced sampling technique. In the meantime, for series of SSA-based experiments, NSE is increased by 3.2%, 3.1% and 1.1% respectively. We conclude that the newly developed FSDB sampling technique can be used to enhance the performance of decomposition-based hybrid model in water level time series forecasting in real world.
    摘要 各种不同时间和不同固定信号需要进行正确的预处理,以便在实际应用中进行水位时间序列预测,如水位水平。分解方法是一个好的解决方案,广泛应用于这种预处理问题。然而,使用不当的抽样技术可能会引入未来数据,导致不正确的分解基于预测模型。在这种情况下,我们提出了一种新的幂等步骤分解基于抽样技术(FSDB), strict avoiding the introduction of future information.这种抽样技术与分解方法,如变分Mode分析(VMD)和特征峰分析(SSA),应用于预测水位时间序列的三个不同站点:国阳和朝湖水系。结果表明,使用VMD基于的混合模型使用FSDB抽样技术,相比最先进的抽样技术,Nash-Sutcliffe效率系数(NSE)的提高为6.4%、28.8%和7.0%分别在三个站点上。同时,对Series of SSA-based experiments,NSE的提高为3.2%、3.1%和1.1%分别。我们 conclude that the newly developed FSDB sampling technique can be used to enhance the performance of decomposition-based hybrid model in water level time series forecasting in real world.

Learning Adaptive Safety for Multi-Agent Systems

  • paper_url: http://arxiv.org/abs/2309.10657
  • repo_url: https://github.com/luigiberducci/learning_adaptive_safety
  • paper_authors: Luigi Berducci, Shuo Yang, Rahul Mangharam, Radu Grosu
  • for: 保障多体系统的安全性在具有限制信息的情况下是一项挑战,现有的方法往往假设其他代理人的行为,并且需要手动调整以保证安全、可行性和性能的平衡。本文探讨了基于控制障碍函数(CBF)的弹性安全学习方法。
  • methods: 本文提出了一种名为ASRL的可靠安全学习框架,可以自动优化策略和CBF系数,以提高安全性和长期性能。ASRL通过直接与其他代理人交互,学习与多种代理人行为相应,并保持成本违反在所需的限制下。
  • results: 本文在多机器人系统和竞争型多体系统中评估了ASRL,与学习基于和控制理论基于的方法进行比较。实验结果表明ASRL有效地增强了安全性和长期性能,并且可以适应各种非典型情况。代码和补充材料在线公开。
    Abstract Ensuring safety in dynamic multi-agent systems is challenging due to limited information about the other agents. Control Barrier Functions (CBFs) are showing promise for safety assurance but current methods make strong assumptions about other agents and often rely on manual tuning to balance safety, feasibility, and performance. In this work, we delve into the problem of adaptive safe learning for multi-agent systems with CBF. We show how emergent behavior can be profoundly influenced by the CBF configuration, highlighting the necessity for a responsive and dynamic approach to CBF design. We present ASRL, a novel adaptive safe RL framework, to fully automate the optimization of policy and CBF coefficients, to enhance safety and long-term performance through reinforcement learning. By directly interacting with the other agents, ASRL learns to cope with diverse agent behaviours and maintains the cost violations below a desired limit. We evaluate ASRL in a multi-robot system and a competitive multi-agent racing scenario, against learning-based and control-theoretic approaches. We empirically demonstrate the efficacy and flexibility of ASRL, and assess generalization and scalability to out-of-distribution scenarios. Code and supplementary material are public online.
    摘要 Ensuring safety in dynamic multi-agent systems is challenging due to limited information about other agents. Control Barrier Functions (CBFs) are showing promise for safety assurance but current methods make strong assumptions about other agents and often rely on manual tuning to balance safety, feasibility, and performance. In this work, we delve into the problem of adaptive safe learning for multi-agent systems with CBF. We show how emergent behavior can be profoundly influenced by the CBF configuration, highlighting the necessity for a responsive and dynamic approach to CBF design. We present ASRL, a novel adaptive safe RL framework, to fully automate the optimization of policy and CBF coefficients, to enhance safety and long-term performance through reinforcement learning. By directly interacting with the other agents, ASRL learns to cope with diverse agent behaviors and maintains the cost violations below a desired limit. We evaluate ASRL in a multi-robot system and a competitive multi-agent racing scenario, against learning-based and control-theoretic approaches. We empirically demonstrate the efficacy and flexibility of ASRL, and assess generalization and scalability to out-of-distribution scenarios. Code and supplementary material are public online.Here's the text in Traditional Chinese, if you prefer: Ensuring safety in dynamic multi-agent systems is challenging due to limited information about other agents. Control Barrier Functions (CBFs) are showing promise for safety assurance but current methods make strong assumptions about other agents and often rely on manual tuning to balance safety, feasibility, and performance. In this work, we delve into the problem of adaptive safe learning for multi-agent systems with CBF. We show how emergent behavior can be profoundly influenced by the CBF configuration, highlighting the necessity for a responsive and dynamic approach to CBF design. We present ASRL, a novel adaptive safe RL framework, to fully automate the optimization of policy and CBF coefficients, to enhance safety and long-term performance through reinforcement learning. By directly interacting with the other agents, ASRL learns to cope with diverse agent behaviors and maintains the cost violations below a desired limit. We evaluate ASRL in a multi-robot system and a competitive multi-agent racing scenario, against learning-based and control-theoretic approaches. We empirically demonstrate the efficacy and flexibility of ASRL, and assess generalization and scalability to out-of-distribution scenarios. Code and supplementary material are public online.

A spectrum of physics-informed Gaussian processes for regression in engineering

  • paper_url: http://arxiv.org/abs/2309.10656
  • repo_url: None
  • paper_authors: Elizabeth J Cross, Timothy J Rogers, Daniel J Pitchforth, Samuel J Gibson, Matthew R Jones
  • for: 提高使用有限数据进行预测模型的能力
  • methods: 结合机器学习技术和物理基础理解来提高预测模型的可靠性和可读性
  • results: 通过将物理基础理解与数据回归方法联系起来,可以大幅减少数据收集量,同时提高模型的解释性。
    Abstract Despite the growing availability of sensing and data in general, we remain unable to fully characterise many in-service engineering systems and structures from a purely data-driven approach. The vast data and resources available to capture human activity are unmatched in our engineered world, and, even in cases where data could be referred to as ``big,'' they will rarely hold information across operational windows or life spans. This paper pursues the combination of machine learning technology and physics-based reasoning to enhance our ability to make predictive models with limited data. By explicitly linking the physics-based view of stochastic processes with a data-based regression approach, a spectrum of possible Gaussian process models are introduced that enable the incorporation of different levels of expert knowledge of a system. Examples illustrate how these approaches can significantly reduce reliance on data collection whilst also increasing the interpretability of the model, another important consideration in this context.
    摘要

An Extendable Python Implementation of Robust Optimisation Monte Carlo

  • paper_url: http://arxiv.org/abs/2309.10612
  • repo_url: None
  • paper_authors: Vasilis Gkolemis, Michael Gutmann, Henri Pesonen
  • For: 这篇论文的目的是提出一个基于Monte Carlo的likelihood-free inference(LFI)方法,并实现其在Python的ELFI套件中。* Methods: 这篇论文使用了一种名为Robust Optimisation Monte Carlo(ROMC)的LFI方法,ROMC是一个新的、高度平行化的LFI框架,可以实现精确的 posterior 权重样本。* Results: 这篇论文的实现方法可以在两种方式下使用:一、科学家可以直接使用它作为一个出厂装置的LFI算法;二、研究者可以将ROMC分解为单独的部分,并让其在完全平行化的方式下运行,以便进一步的扩展和改进。
    Abstract Performing inference in statistical models with an intractable likelihood is challenging, therefore, most likelihood-free inference (LFI) methods encounter accuracy and efficiency limitations. In this paper, we present the implementation of the LFI method Robust Optimisation Monte Carlo (ROMC) in the Python package ELFI. ROMC is a novel and efficient (highly-parallelizable) LFI framework that provides accurate weighted samples from the posterior. Our implementation can be used in two ways. First, a scientist may use it as an out-of-the-box LFI algorithm; we provide an easy-to-use API harmonized with the principles of ELFI, enabling effortless comparisons with the rest of the methods included in the package. Additionally, we have carefully split ROMC into isolated components for supporting extensibility. A researcher may experiment with novel method(s) for solving part(s) of ROMC without reimplementing everything from scratch. In both scenarios, the ROMC parts can run in a fully-parallelized manner, exploiting all CPU cores. We also provide helpful functionalities for (i) inspecting the inference process and (ii) evaluating the obtained samples. Finally, we test the robustness of our implementation on some typical LFI examples.
    摘要 Performing inference in statistical models with an intractable likelihood is challenging, therefore, most likelihood-free inference (LFI) methods encounter accuracy and efficiency limitations. In this paper, we present the implementation of the LFI method Robust Optimisation Monte Carlo (ROMC) in the Python package ELFI. ROMC is a novel and efficient (highly-parallelizable) LFI framework that provides accurate weighted samples from the posterior. Our implementation can be used in two ways. First, a scientist may use it as an out-of-the-box LFI algorithm; we provide an easy-to-use API harmonized with the principles of ELFI, enabling effortless comparisons with the rest of the methods included in the package. Additionally, we have carefully split ROMC into isolated components for supporting extensibility. A researcher may experiment with novel method(s) for solving part(s) of ROMC without reimplementing everything from scratch. In both scenarios, the ROMC parts can run in a fully-parallelized manner, exploiting all CPU cores. We also provide helpful functionalities for (i) inspecting the inference process and (ii) evaluating the obtained samples. Finally, we test the robustness of our implementation on some typical LFI examples.Here's the translation in Traditional Chinese:行使 statistical models 中的 intractable likelihood 是具有挑战性的,因此大多数 likelihood-free inference (LFI) 方法受到精度和效率限制。在这篇 paper 中,我们提出了 ELFI 套件中的 Robust Optimisation Monte Carlo (ROMC) 方法的实现。ROMC 是一个新的、高度可行化 (高度并行化) LFI 框架,可以从 posterior 中获得正确的权重样本。我们的实现可以在 two 种方式使用:一、科学家可以将它作为 out-of-the-box LFI 算法使用;我们提供了与 ELFI 的原则相互适应的易用 API,使得与其他方法在套件中的比较变得容易。其次,我们将 ROMC 分解为可扩展的部分,让研究人员可以对部分 ROMC 进行 novel 的方法解决,而不需要从零开始重新实现。在这两种情况下,ROMC 的部分可以在完全并行化的方式下运行,扩展到所有 CPU 核心。我们还提供了帮助性的功能,包括 (i) 检查推断过程和 (ii) 评估取得的样本。最后,我们将实现的稳定性试验在一些常见的 LFI 例子上。

Asteroids co-orbital motion classification based on Machine Learning

  • paper_url: http://arxiv.org/abs/2309.10603
  • repo_url: None
  • paper_authors: Giulia Ciacci, Andrea Barucci, Sara Di Ruzza, Elisa Maria Alessi
  • for: 本研究探讨了使用机器学习分类asteroids在与一个 planet 的共轨运动中。
  • methods: 我们考虑了四种不同的运动方式,包括Tadpole、Horseshoe和Quasi-satellite,并构建了三个数据集(Real、Ideal和Perturbed)来训练和测试机器学习算法。
  • results: 我们的结果表明,机器学习算法能够正确地分类时间序列,并且性能非常高。
    Abstract In this work, we explore how to classify asteroids in co-orbital motion with a given planet using Machine Learning. We consider four different kinds of motion in mean motion resonance with the planet, nominally Tadpole, Horseshoe and Quasi-satellite, building 3 datasets defined as Real (taking the ephemerides of real asteroids from the JPL Horizons system), Ideal and Perturbed (both simulated, obtained by propagating initial conditions considering two different dynamical systems) for training and testing the Machine Learning algorithms in different conditions. The time series of the variable theta (angle related to the resonance) are studied with a data analysis pipeline defined ad hoc for the problem and composed by: data creation and annotation, time series features extraction thanks to the tsfresh package (potentially followed by selection and standardization) and the application of Machine Learning algorithms for Dimensionality Reduction and Classification. Such approach, based on features extracted from the time series, allows to work with a smaller number of data with respect to Deep Learning algorithms, also allowing to define a ranking of the importance of the features. Physical Interpretability of the features is another key point of this approach. In addition, we introduce the SHapley Additive exPlanations for Explainability technique. Different training and test sets are used, in order to understand the power and the limits of our approach. The results show how the algorithms are able to identify and classify correctly the time series, with a high degree of performance.
    摘要 在这项工作中,我们探索如何使用机器学习分类 asteroids 在与给定的 planet 的共轨运动中。我们考虑了四种不同的运动,包括 Tadpole、Horseshoe 和 Quasi-satellite,并构建了三个数据集:Real(使用 JPL Horizons 系统中的真实小行星轨道数据)、Ideal 和 Perturbed(两者都是通过初始条件的传播而生成的,以模拟不同的动力系统),用于训练和测试机器学习算法。我们使用自定义的数据分析管道来研究时间序列中的θ(与共轨运动相关的角度)。该管道包括:数据创建和注释、时间序列特征提取(使用 tsfresh 包)以及机器学习算法的维度减少和分类。这种方法,基于时间序列特征,允许我们使用较少的数据量与深度学习算法进行比较,同时允许我们定义特征的排名和物理可解性。此外,我们还引入 SHapley Additive exPlanations for Explainability 技术。为了了解我们的方法的力量和局限,我们使用了不同的训练和测试集。结果显示,我们的算法能够正确地识别和分类时间序列,性能非常高。

Motif-Centric Representation Learning for Symbolic Music

  • paper_url: http://arxiv.org/abs/2309.10597
  • repo_url: None
  • paper_authors: Yuxuan Wu, Roger B. Dannenberg, Gus Xia
  • for: 本研究旨在 computational modeling of music motifs, 以便自动音乐作曲和音乐信息检索。
  • methods: 使用 Siamese 网络架构和预训练和精度调整的管道,通过表示学习来学习隐藏的旋律关系和变化。 使用 VICReg 方法进行预训练,并使用对比学习进行精度调整。
  • results: 实验结果表明,这两种方法可以补充Each other,提高了 music 旋律检索任务中的区域下面积分值12.6%。最后,我们可见了获得的旋律表示,为了更直观地理解音乐作品的结构。根据我们所知,这是计算机模型音乐旋律的一个突破性的研究成果,laying the foundations for future applications of motifs in automatic music composition and music information retrieval。
    Abstract Music motif, as a conceptual building block of composition, is crucial for music structure analysis and automatic composition. While human listeners can identify motifs easily, existing computational models fall short in representing motifs and their developments. The reason is that the nature of motifs is implicit, and the diversity of motif variations extends beyond simple repetitions and modulations. In this study, we aim to learn the implicit relationship between motifs and their variations via representation learning, using the Siamese network architecture and a pretraining and fine-tuning pipeline. A regularization-based method, VICReg, is adopted for pretraining, while contrastive learning is used for fine-tuning. Experimental results on a retrieval-based task show that these two methods complement each other, yielding an improvement of 12.6% in the area under the precision-recall curve. Lastly, we visualize the acquired motif representations, offering an intuitive comprehension of the overall structure of a music piece. As far as we know, this work marks a noteworthy step forward in computational modeling of music motifs. We believe that this work lays the foundations for future applications of motifs in automatic music composition and music information retrieval.
    摘要 音乐主题,作为作曲的概念建筑块,对音乐结构分析和自动作曲具有重要意义。虽然人类听众可以轻松地识别主题,但现有计算机模型却无法准确表示主题和其变化。这是因为主题的本质是隐式的,主题变化的多样性超出了简单的重复和修改。在这项研究中,我们目的是通过学习来表示主题和其变化的关系,使用siamesenet Architecture和预训练和精度调整管道。采用VICReg方法进行预训练,而对比学习用于精度调整。实验结果表明,这两种方法相互补做,提高了 Retrieval-based任务的区间 beneath the precision-recall 曲线的面积12.6%。最后,我们可视化获得的主题表示,为音乐作品的结构提供了直观的理解。根据我们所知,这项工作是计算机模型音乐主题的开创性工作,我们认为这项工作将为自动音乐作曲和音乐信息检索的未来应用奠定基础。

Task Graph offloading via Deep Reinforcement Learning in Mobile Edge Computing

  • paper_url: http://arxiv.org/abs/2309.10569
  • repo_url: None
  • paper_authors: Jiagang Liu, Yun Mi, Xinyu Zhang
  • for: 本文主要研究适用于移动边缘计算(MEC)中任务图Offloading,以提高用户体验。
  • methods: 本文使用了Markov决策过程(MDP)模型来调度任务,并采用了深度学习算法(SATA-DRL)来学习任务调度策略。
  • results: 对比 existed 策略,SATA-DRL 能够更好地减少平均延迟和死线违反率。
    Abstract Various mobile applications that comprise dependent tasks are gaining widespread popularity and are increasingly complex. These applications often have low-latency requirements, resulting in a significant surge in demand for computing resources. With the emergence of mobile edge computing (MEC), it becomes the most significant issue to offload the application tasks onto small-scale devices deployed at the edge of the mobile network for obtaining a high-quality user experience. However, since the environment of MEC is dynamic, most existing works focusing on task graph offloading, which rely heavily on expert knowledge or accurate analytical models, fail to fully adapt to such environmental changes, resulting in the reduction of user experience. This paper investigates the task graph offloading in MEC, considering the time-varying computation capabilities of edge computing devices. To adapt to environmental changes, we model the task graph scheduling for computation offloading as a Markov Decision Process (MDP). Then, we design a deep reinforcement learning algorithm (SATA-DRL) to learn the task scheduling strategy from the interaction with the environment, to improve user experience. Extensive simulations validate that SATA-DRL is superior to existing strategies in terms of reducing average makespan and deadline violation.
    摘要 各种移动应用程序,它们包含依赖关系的任务,在广泛流行的情况下,逐渐变得越来越复杂。这些应用程序通常具有低延迟要求,从而导致计算资源的巨大需求增加。随着移动边缘计算(MEC)的出现,将应用任务卸载到靠近移动网络边缘的小规模设备上成为了最重要的问题,以实现高质量用户体验。然而,由于MEC环境的动态性,大多数现有的任务图卸载工作,它们依赖于专家知识或准确的分析模型,无法完全适应环境变化,从而导致用户体验下降。本文研究MEC中的任务图卸载,考虑到边缘计算设备的时间变化计算能力。为适应环境变化,我们将任务图 scheduling 模型为Markov决策过程(MDP)。然后,我们设计了深度强化学习算法(SATA-DRL),从环境互动中学习任务调度策略,以提高用户体验。广泛的 simulations validate 表明,SATA-DRL 在减少平均延迟和死线 violet 方面胜过现有策略。

  • paper_url: http://arxiv.org/abs/2309.10563
  • repo_url: None
  • paper_authors: Nishchal Prasad, Mohand Boughanem, Taoufik Dkaki
  • for: 法律判决预测和其解释受到长案文档超过万个字的问题困扰,特别是没有结构化注释的案文。我们定义这个问题为“缺乏注释法律文档”,并explore其缺乏结构信息和长文档的问题。
  • methods: 我们使用了一种深度学习基于分类框架MESc,将案文分解成多个部分,从自定义精心调整的大语言模型的最后四层中提取其嵌入,并使用无监督归一化来近似结构。然后,我们使用另一组变换器Encoder层来学习间隔表示。
  • results: 我们发现了大语言模型(GPT-Neo和GPT-J)在法律文档上的适应性和内部领域(法律)的转移学习能力。此外,我们还比较了MESc和这些模型的性能,以及将嵌入组合在最后几层中的影响。为 Hierarchical模型,我们也提出了一种解释抽取算法 named ORSE,即Occlusion sensitivity-based Relevant Sentence Extractor。
    Abstract Automatic legal judgment prediction and its explanation suffer from the problem of long case documents exceeding tens of thousands of words, in general, and having a non-uniform structure. Predicting judgments from such documents and extracting their explanation becomes a challenging task, more so on documents with no structural annotation. We define this problem as "scarce annotated legal documents" and explore their lack of structural information and their long lengths with a deep learning-based classification framework which we call MESc; "Multi-stage Encoder-based Supervised with-clustering"; for judgment prediction. Specifically, we divide a document into parts to extract their embeddings from the last four layers of a custom fine-tuned Large Language Model, and try to approximate their structure through unsupervised clustering. Which we use in another set of transformer encoder layers to learn the inter-chunk representations. We explore the adaptability of LLMs with multi-billion parameters (GPT-Neo, and GPT-J) to legal texts and their intra-domain(legal) transfer learning capacity. Alongside this, we compare their performance with MESc and the impact of combining embeddings from their last layers. For such hierarchical models, we also propose an explanation extraction algorithm named ORSE; Occlusion sensitivity-based Relevant Sentence Extractor;
    摘要 自动化法律判断预测和其解释受到长案文档超过万字的问题困扰,其中文档结构不均匀,预测判断和提取解释变得非常困难。我们称这个问题为“缺乏注释法律文档”,并explore其缺乏结构信息和长度的问题,以及无结构注释的文档。我们使用一种深度学习基于分类的框架,称之为MESc,以预测判断。具体来说,我们将文档分成多个部分,从最后四层的自定义微调Large Language Model中提取嵌入,并使用无结构注释的 clustering来approximate结构。然后,我们使用另一组 transformer encoder层来学习间键表示。此外,我们还 explore了LLMs的多亿 Parameter(GPT-Neo和GPT-J)在法律文本上的适应性和内部领域(legal)的传输学习能力。此外,我们还比较了MESc和这些模型的性能,以及将嵌入从最后几层 combine的影响。对于层次模型,我们还提出了一种解释抽取算法,称之为ORSE,即Occlusion sensitivity-based Relevant Sentence Extractor。

Hybrid State Space-based Learning for Sequential Data Prediction with Joint Optimization

  • paper_url: http://arxiv.org/abs/2309.10553
  • repo_url: None
  • paper_authors: Mustafa E. Aydın, Arda Fazla, Suleyman S. Kozat
  • for: 该 paper investigate 非线性预测/回归在在线环境中,并提出了一种混合模型,可以有效地解决传统非线性预测模型中的域pecificFeature工程问题,并实现了有效的非线性和线性组件混合。
  • methods: 该 paper 使用了回归结构来提取Raw序列序列中的特征,并使用了传统的线性时间序列模型来处理时间序列数据中的特有性,如季节性和趋势。而不同于现有的集成或混合模型,我们在一个单一的过程中,jointly optimize 一个加强的杜林 ней网络 (LSTM) для自动特征提取和一个 ARMA 家族时间序列模型 (SARIMAX) 来有效地处理时间序列数据。
  • results: 我们在 widely 公布的实验数据集上表现出了显著的改进,并在 Code 中开源了我们的代码,以便进一步研究和复现我们的结果。
    Abstract We investigate nonlinear prediction/regression in an online setting and introduce a hybrid model that effectively mitigates, via a joint mechanism through a state space formulation, the need for domain-specific feature engineering issues of conventional nonlinear prediction models and achieves an efficient mix of nonlinear and linear components. In particular, we use recursive structures to extract features from raw sequential sequences and a traditional linear time series model to deal with the intricacies of the sequential data, e.g., seasonality, trends. The state-of-the-art ensemble or hybrid models typically train the base models in a disjoint manner, which is not only time consuming but also sub-optimal due to the separation of modeling or independent training. In contrast, as the first time in the literature, we jointly optimize an enhanced recurrent neural network (LSTM) for automatic feature extraction from raw data and an ARMA-family time series model (SARIMAX) for effectively addressing peculiarities associated with time series data. We achieve this by introducing novel state space representations for the base models, which are then combined to provide a full state space representation of the hybrid or the ensemble. Hence, we are able to jointly optimize both models in a single pass via particle filtering, for which we also provide the update equations. The introduced architecture is generic so that one can use other recurrent architectures, e.g., GRUs, traditional time series-specific models, e.g., ETS or other optimization methods, e.g., EKF, UKF. Due to such novel combination and joint optimization, we demonstrate significant improvements in widely publicized real life competition datasets. We also openly share our code for further research and replicability of our results.
    摘要 我们调查线性预测/回归在线上环境中,并提出了一种混合模型,可以有效地减少传统非线性预测模型中的域专特性工程问题,并 achiev 了一个有效的非线性和线性元件混合。具体来说,我们使用回传结构来从原始的序列序列中提取特征,并使用传统的线性时间序列模型来处理时间序列数据中的特有性,例如季节性和趋势。现有的ensemble或混合模型通常会在分开的方式训练基本模型,这不仅是时间耗费大的,而且也是不佳的,因为它们的模型化或独立训练。相反地,我们是第一次在文献中使用一个强化的长期内部遮蔽网络(LSTM)来自动提取特征,并使用ARMA家族时间序列模型(SARIMAX)来有效地处理时间序列数据中的特有性。我们通过引入新的状态空间表示来融合这两个基本模型,然后将它们联合以提供一个完整的状态空间表示。因此,我们可以在单一通过粒子统计来协同优化这两个模型。我们的架构是通用的,可以使用其他回归架构,例如GRU,传统时间序列特定模型,例如ETS,或其他优化方法,例如EKF、UKF。因为我们的新的结构和协同优化,我们在广泛公开的真实生活竞赛数据中展示了明显的改善。我们还公开了我们的代码,以便进一步的研究和我们的结果的重现。

Love or Hate? Share or Split? Privacy-Preserving Training Using Split Learning and Homomorphic Encryption

  • paper_url: http://arxiv.org/abs/2309.10517
  • repo_url: https://github.com/khoaguin/hesplitnet
  • paper_authors: Tanveer Khan, Khoa Nguyen, Antonis Michalas, Alexandros Bakas
  • for: 这篇论文是针对分布式学习(Distributed Learning)中的隐私问题提出了一个新的解决方案。
  • methods: 本论文使用的方法是基于U型分布式学习(U-shaped Distributed Learning),并使用了几何加密(Homomorphic Encryption)来保护用户隐私。
  • results: 本论文的结果显示,使用HE数据进行U型分布式学习只会导致精度下降2.65%,并且保护了原始训练数据的隐私。
    Abstract Split learning (SL) is a new collaborative learning technique that allows participants, e.g. a client and a server, to train machine learning models without the client sharing raw data. In this setting, the client initially applies its part of the machine learning model on the raw data to generate activation maps and then sends them to the server to continue the training process. Previous works in the field demonstrated that reconstructing activation maps could result in privacy leakage of client data. In addition to that, existing mitigation techniques that overcome the privacy leakage of SL prove to be significantly worse in terms of accuracy. In this paper, we improve upon previous works by constructing a protocol based on U-shaped SL that can operate on homomorphically encrypted data. More precisely, in our approach, the client applies homomorphic encryption on the activation maps before sending them to the server, thus protecting user privacy. This is an important improvement that reduces privacy leakage in comparison to other SL-based works. Finally, our results show that, with the optimum set of parameters, training with HE data in the U-shaped SL setting only reduces accuracy by 2.65% compared to training on plaintext. In addition, raw training data privacy is preserved.
    摘要 分离学习(SL)是一种新的合作学习技术,允许参与者(例如客户和服务器)无需共享原始数据来训练机器学习模型。在这种设定下,客户首先应用其部分机器学习模型于原始数据,生成活动图并将其发送给服务器继续训练过程。先前的工作表明,重建活动图可能导致客户数据泄露。此外,现有的防范措施可以减少SL中客户数据隐私泄露的影响,但是这些措施在准确性方面表现不佳。在这篇论文中,我们超越先前的工作,基于U型SL构建了一种具有同质 encrypting 数据的协议。具体来说,在我们的方法中,客户将 homomorphic encryption 应用于活动图,以保护用户隐私。这是一项重要的改进,可以减少客户数据泄露的风险,相比其他SL基于的工作。最后,我们的结果表明,在U型SL设定下,使用HE数据进行训练,只减少准确性比例2.65%,与平文训练相比。此外,原始数据隐私得到保护。

Learning End-to-End Channel Coding with Diffusion Models

  • paper_url: http://arxiv.org/abs/2309.10505
  • repo_url: None
  • paper_authors: Muah Kim, Rick Fritschek, Rafael F. Schaefer
  • for: 这个论文的目的是提出一种基于扩散模型的频道编码器训练方法,以解决深度学习中的频道模型整合问题。
  • methods: 这个论文使用了扩散模型来近似频道分布,并提出了一种高效的训练算法。
  • results: 实验结果表明,扩散模型可以准确地学习频道分布,并在不同的频道模型下实现近乎最佳的端到端符号错误率。此外,扩散模型还具有较好的抗随机性和鲁棒性。
    Abstract The training of neural encoders via deep learning necessitates a differentiable channel model due to the backpropagation algorithm. This requirement can be sidestepped by approximating either the channel distribution or its gradient through pilot signals in real-world scenarios. The initial approach draws upon the latest advancements in image generation, utilizing generative adversarial networks (GANs) or their enhanced variants to generate channel distributions. In this paper, we address this channel approximation challenge with diffusion models, which have demonstrated high sample quality in image generation. We offer an end-to-end channel coding framework underpinned by diffusion models and propose an efficient training algorithm. Our simulations with various channel models establish that our diffusion models learn the channel distribution accurately, thereby achieving near-optimal end-to-end symbol error rates (SERs). We also note a significant advantage of diffusion models: A robust generalization capability in high signal-to-noise ratio regions, in contrast to GAN variants that suffer from error floor. Furthermore, we examine the trade-off between sample quality and sampling speed, when an accelerated sampling algorithm is deployed, and investigate the effect of the noise scheduling on this trade-off. With an apt choice of noise scheduling, sampling time can be significantly reduced with a minor increase in SER.
    摘要 neural encoder 的训练通过深度学习需要可导通道模型,这是因为权重传播算法的需求。这种要求可以通过在实际场景中使用启动信号来缓解。首先,我们使用最新的图像生成技术,如生成对抗网络(GANs)或其改进版本,来生成通道分布。在这篇论文中,我们使用扩散模型来解决通道 aproximation 挑战。我们提出了基于扩散模型的端到端通道编码框架,并提出了高效的训练算法。我们的 simulations 表明,我们的扩散模型可以准确地学习通道分布,并实现near-optimal的端到端符号错误率(SER)。此外,我们注意到扩散模型的一个重要优点:在高信号噪比区域中具有robust的总体化能力,与GAN变体不同,后者在错误 floor 方面受到影响。此外,我们还研究了样本质量和抽取速度之间的质量,以及降低SER的代价。通过适当的噪声调度,抽取时间可以被显著减少,但是SER的增加很小。

A comparative study of Grid and Natural sentences effects on Normal-to-Lombard conversion

  • paper_url: http://arxiv.org/abs/2309.10485
  • repo_url: None
  • paper_authors: Hongyang Chen, Yuhong Yang, Qingmu Liu, Baifeng Li, Weiping Tu, Song Lin
    for:This paper is written to investigate the effectiveness of Normal-to-Lombard models in improving natural speech intelligibility in real-world applications, using a parallel Lombard corpus (Lombard Chinese TIMIT, LCT) and a comparison with a standard grid sentence corpus (Enhanced MAndarin Lombard Grid, EMALG).methods:The paper uses a parallel Lombard corpus (LCT) and a comparison with a standard grid sentence corpus (EMALG) to evaluate the Lombard effect and Normal-to-Lombard conversion in natural and grid sentences. The authors also use a subjective intelligibility assessment across genders and Signal-to-Noise Ratios to evaluate the performance of a StarGAN model trained on EMALG.results:The paper finds that both natural and grid sentences exhibit similar changes in parameters as the noise level increases, but grid sentences show a greater increase in the alpha ratio. The StarGAN model trained on EMALG consistently outperforms the model trained on LCT in terms of improving intelligibility, which may be attributed to EMALG’s larger alpha ratio increase from normal to Lombard speech.
    Abstract Grid sentence is commonly used for studying the Lombard effect and Normal-to-Lombard conversion. However, it's unclear if Normal-to-Lombard models trained on grid sentences are sufficient for improving natural speech intelligibility in real-world applications. This paper presents the recording of a parallel Lombard corpus (called Lombard Chinese TIMIT, LCT) extracting natural sentences from Chinese TIMIT. Then We compare natural and grid sentences in terms of Lombard effect and Normal-to-Lombard conversion using LCT and Enhanced MAndarin Lombard Grid corpus (EMALG). Through a parametric analysis of the Lombard effect, We find that as the noise level increases, both natural sentences and grid sentences exhibit similar changes in parameters, but in terms of the increase of the alpha ratio, grid sentences show a greater increase. Following a subjective intelligibility assessment across genders and Signal-to-Noise Ratios, the StarGAN model trained on EMALG consistently outperforms the model trained on LCT in terms of improving intelligibility. This superior performance may be attributed to EMALG's larger alpha ratio increase from normal to Lombard speech.
    摘要 Grid sentence 通常用于研究洛伯特效应和正常到洛伯特转换。然而,不清楚的是,正常到洛伯特模型在格子句子上训练后是否能提高实际应用中的自然语音 inteligibility。这篇论文介绍了一个平行的洛伯特 corpus(名为洛伯特中文 TIMIT, LCT),从中文 TIMIT 中提取了自然句子。然后,我们比较了自然句子和格子句子在洛伯特效应和正常到洛伯特转换方面的不同,使用 LCT 和增强 Mandarin Lombard Grid corpora(EMALG)。通过对洛伯特效应的参数分析,我们发现,随着噪音水平的增加,自然句子和格子句子都会出现类似的参数变化,但是在 alpha 比率的增加方面,格子句子显示更大的增加。进一步,我们对 gender 和 Signal-to-Noise Ratio 不同的人进行主观的听力评估,发现 StarGAN 模型在 EMALG 上训练后在 inteligibility 方面一直表现出优于在 LCT 上训练的模型。这种更好的性能可能是因为 EMALG 中 alpha 比率在正常到洛伯特语音转换时的更大增加。

Ad-load Balancing via Off-policy Learning in a Content Marketplace

  • paper_url: http://arxiv.org/abs/2309.11518
  • repo_url: None
  • paper_authors: Hitesh Sagtani, Madan Jhawar, Rishabh Mehrotra, Olivier Jeunen
  • For: 优化在社交媒体平台上的在线广告系统中的广告负载均衡,以 maximize 用户满意度和广告收益,同时维护用户体验。* Methods: 利用偏函数学习和记录bandit反馈来评估和优化广告负载。* Results: 在大规模A/B测试中,通过使用偏函数学习和不偏估计器(如倒推propensity scoring和双重Robust),实现了对用户满意度指标和广告收益的同时优化。
    Abstract Ad-load balancing is a critical challenge in online advertising systems, particularly in the context of social media platforms, where the goal is to maximize user engagement and revenue while maintaining a satisfactory user experience. This requires the optimization of conflicting objectives, such as user satisfaction and ads revenue. Traditional approaches to ad-load balancing rely on static allocation policies, which fail to adapt to changing user preferences and contextual factors. In this paper, we present an approach that leverages off-policy learning and evaluation from logged bandit feedback. We start by presenting a motivating analysis of the ad-load balancing problem, highlighting the conflicting objectives between user satisfaction and ads revenue. We emphasize the nuances that arise due to user heterogeneity and the dependence on the user's position within a session. Based on this analysis, we define the problem as determining the optimal ad-load for a particular feed fetch. To tackle this problem, we propose an off-policy learning framework that leverages unbiased estimators such as Inverse Propensity Scoring (IPS) and Doubly Robust (DR) to learn and estimate the policy values using offline collected stochastic data. We present insights from online A/B experiments deployed at scale across over 80 million users generating over 200 million sessions, where we find statistically significant improvements in both user satisfaction metrics and ads revenue for the platform.
    摘要 <>使用简化中文。<>在在线广告系统中,负载均衡是一项挑战,尤其是在社交媒体平台上,目标是 maximize 用户满意度和广告收益,同时保持用户体验满意。这需要优化冲突的目标,如用户满意度和广告收益。传统的负载均衡方法采用静态分配策略,不能适应用户偏好和上下文因素的变化。在这篇论文中,我们提出一种方法,利用偏移策略和来自日志抽象反馈的评估。我们首先提出了负载均衡问题的动机分析,指出用户满意度和广告收益之间的冲突,并强调用户多样性和Session中的用户位置对问题的影响。基于这种分析,我们定义了负载均衡问题为特定的Feed fetch中的optimal ad-load。为解决这个问题,我们提出了一种偏移学习框架,利用不偏抽象器 such as Inverse Propensity Scoring (IPS)和Doubly Robust (DR)来学习和估计策略值使用在线采集的随机数据。我们提供在线 A/B 试验的实际经验,在8000万用户和2000万会话中发现了 statistically significant 的提升用户满意度指标和广告收益。

Coreset selection can accelerate quantum machine learning models with provable generalization

  • paper_url: http://arxiv.org/abs/2309.10441
  • repo_url: None
  • paper_authors: Yiming Huang, Huiyuan Wang, Yuxuan Du, Xiao Yuan
  • for: 提高量子机器学习模型的训练效率,使其在大规模数据集上达到相同的性能水平。
  • methods: 采用核心集选择法,从原始训练集中选择一个judicious subsets,以加速量子神经网络和量子核心的训练。
  • results: 通过系统的数字实验,显示了核心集选择法在多种量子机器学习任务中的潜在效果,包括 sintetic数据分类、量子相关性识别和量子编译。
    Abstract Quantum neural networks (QNNs) and quantum kernels stand as prominent figures in the realm of quantum machine learning, poised to leverage the nascent capabilities of near-term quantum computers to surmount classical machine learning challenges. Nonetheless, the training efficiency challenge poses a limitation on both QNNs and quantum kernels, curbing their efficacy when applied to extensive datasets. To confront this concern, we present a unified approach: coreset selection, aimed at expediting the training of QNNs and quantum kernels by distilling a judicious subset from the original training dataset. Furthermore, we analyze the generalization error bounds of QNNs and quantum kernels when trained on such coresets, unveiling the comparable performance with those training on the complete original dataset. Through systematic numerical simulations, we illuminate the potential of coreset selection in expediting tasks encompassing synthetic data classification, identification of quantum correlations, and quantum compiling. Our work offers a useful way to improve diverse quantum machine learning models with a theoretical guarantee while reducing the training cost.
    摘要 量子神经网络(QNN)和量子kernels作为量子机器学习领域的代表性 figma,潜在地利用近期量子计算机的潜在能力,以超越 классиical机器学习挑战。然而,训练效率问题成为了QNN和量子kernels的限制因素,在处理大规模数据时减少了它们的效果。为了解决这个问题,我们提出了一种统一方法:核心选择,旨在加速QNN和量子kernels的训练过程,通过筛选judicioussubset从原始训练集。此外,我们分析了QNN和量子kernels在 Such coresets上进行训练时的泛化误差 bound,发现它们与原始数据集上进行训练时的性能相似。通过系统的数字实验,我们描述了核心选择在加速Synthetic数据分类、量子相关性识别和量子编译等任务中的潜在优势。我们的工作提供了一种可靠的方法来改进多种量子机器学习模型,同时降低训练成本。

Graph Neural Networks for Dynamic Modeling of Roller Bearing

  • paper_url: http://arxiv.org/abs/2309.10418
  • repo_url: None
  • paper_authors: Vinay Sharma, Jens Ravesloot, Cees Taal, Olga Fink
  • for: 预测滚珠式机械的动态行为
  • methods: 使用图 neuronal networks (GNNs) 模型,利用图表示滚珠机械的组件之间的复杂关系和互动
  • results: 通过测试不同的滚珠机械配置,证明 GNN 模型在准确预测滚珠机械的动态行为方面具有良好的学习和泛化能力,表明其在实时监测旋转机械的健康状况中具有潜在的应用前景。
    Abstract In the presented work, we propose to apply the framework of graph neural networks (GNNs) to predict the dynamics of a rolling element bearing. This approach offers generalizability and interpretability, having the potential for scalable use in real-time operational digital twin systems for monitoring the health state of rotating machines. By representing the bearing's components as nodes in a graph, the GNN can effectively model the complex relationships and interactions among them. We utilize a dynamic spring-mass-damper model of a bearing to generate the training data for the GNN. In this model, discrete masses represent bearing components such as rolling elements, inner raceways, and outer raceways, while a Hertzian contact model is employed to calculate the forces between these components. We evaluate the learning and generalization capabilities of the proposed GNN framework by testing different bearing configurations that deviate from the training configurations. Through this approach, we demonstrate the effectiveness of the GNN-based method in accurately predicting the dynamics of rolling element bearings, highlighting its potential for real-time health monitoring of rotating machinery.
    摘要 在提出的工作中,我们提议使用图 neural network (GNN) 模型来预测滚动元件机械的动态行为。这种方法具有普适性和可解释性,有可能在实时操作中的数字双系统中进行扩展使用,以监测旋转机器的健康状态。通过在图中表示机械的组件为节点,GNN 可以有效地模型机械中复杂的关系和互动。我们利用一种动态春荷振荷模型来生成训练数据,其中离散质量表示机械中的滚动元件、内环和外环等组件,而哈特兹振荷模型则用于计算这些组件之间的力。我们通过测试不同的机械配置来评估 GNN 模型的学习和泛化能力。通过这种方法,我们证明了 GNN 模型在准确预测滚动元件机械的动态行为方面的效iveness,这种方法还有可能在实时监测旋转机器的健康状态。

A Variational Auto-Encoder Enabled Multi-Band Channel Prediction Scheme for Indoor Localization

  • paper_url: http://arxiv.org/abs/2309.12200
  • repo_url: None
  • paper_authors: Ruihao Yuan, Kaixuan Huang, Pan Yang, Shunqing Zhang
  • for: 提高室内地位标定精度,适用于虚拟/增强现实和智能家居等高技术应用。
  • methods: 使用频域预测channe state information (CSI)值,并将多频信息合并以提高室内地位标定精度。
  • results: 在COST 2100 simulate数据和实际时orthogonal frequency division multiplexing (OFDM) WiFi数据中测试了提议方案,并获得了更精度的室内地位标定结果。
    Abstract Indoor localization is getting increasing demands for various cutting-edged technologies, like Virtual/Augmented reality and smart home. Traditional model-based localization suffers from significant computational overhead, so fingerprint localization is getting increasing attention, which needs lower computation cost after the fingerprint database is built. However, the accuracy of indoor localization is limited by the complicated indoor environment which brings the multipath signal refraction. In this paper, we provided a scheme to improve the accuracy of indoor fingerprint localization from the frequency domain by predicting the channel state information (CSI) values from another transmitting channel and spliced the multi-band information together to get more precise localization results. We tested our proposed scheme on COST 2100 simulation data and real time orthogonal frequency division multiplexing (OFDM) WiFi data collected from an office scenario.
    摘要 室内定位技术正在不断受到不同的前沿技术的需求,如虚拟/增强现实和智能家居。传统的模型基地定位技术具有较高的计算开销,因此脸部特征定位技术在受到越来越多的关注,需要降低计算成本后,脸部特征定位技术可以在室内环境中提供更高的精度。然而,室内环境的复杂性使得多Path信号折射带来了局部定位精度的限制。本文提出了一种从频域提高室内脸部特征定位精度的方案,通过预测另一个传输通道的通道状态信息(CSI)值,并将多频信息组合在一起,以获得更加精确的定位结果。我们在COST 2100 simulatedata和实时的orthogonal frequency division multiplexing(OFDM)WiFi数据中测试了我们的提议方案,并取得了更好的定位精度。

Minimum width for universal approximation using ReLU networks on compact domain

  • paper_url: http://arxiv.org/abs/2309.10402
  • repo_url: None
  • paper_authors: Namjun Kim, Chanho Min, Sejun Park
  • for: 这个论文研究了宽度约束的网络的通用近似性Property,作为深度约束网络的对偶。
  • methods: 作者使用了几种方法来Characterize the minimum width $w_{\min}$ enabling the universal approximation property, but only a few of them found the exact values.
  • results: 作者证明了最小宽度为 $\max{d_x,d_y,2}$ 可以 universally approximate $L^p$ functions from $[0,1]^{d_x}$ to $\mathbb R^{d_y}$,且比知道的结果 $w_{\min}=\max{d_x+1,d_y}$ 更低。此外,作者还证明了一个下界于 $w_{\min}$ 的上限,即 $w_{\min}\ge d_y+1$ if $d_x<d_y\le2d_x$。
    Abstract The universal approximation property of width-bounded networks has been studied as a dual of the classical universal approximation theorem for depth-bounded ones. There were several attempts to characterize the minimum width $w_{\min}$ enabling the universal approximation property; however, only a few of them found the exact values. In this work, we show that the minimum width for the universal approximation of $L^p$ functions from $[0,1]^{d_x}$ to $\mathbb R^{d_y}$ is exactly $\max\{d_x,d_y,2\}$ if an activation function is ReLU-Like (e.g., ReLU, GELU, Softplus). Compared to the known result $w_{\min}=\max\{d_x+1,d_y\}$ when the domain is ${\mathbb R^{d_x}$, our result first shows that approximation on a compact domain requires smaller width than on ${\mathbb R^{d_x}$. We next prove a lower bound on $w_{\min}$ for uniform approximation using general activation functions including ReLU: $w_{\min}\ge d_y+1$ if $d_x
    摘要 全球近似性性的宽度约束网络的研究已被视为深度约束网络的 dual。虽有几种尝试Characterizing the minimum width $w_{\min}$ enabling the universal approximation property, but only a few found the exact values. In this work, we show that the minimum width for the universal approximation of $L^p$ functions from $[0,1]^{d_x}$ to $\mathbb R^{d_y}$ is exactly $\max\{d_x,d_y,2\}$ if the activation function is ReLU-Like (e.g., ReLU, GELU, Softplus). Compared to the known result $w_{\min}=\max\{d_x+1,d_y\}$ when the domain is $\mathbb R^{d_x}$, our result first shows that approximation on a compact domain requires smaller width than on $\mathbb R^{d_x}$. We next prove a lower bound on $w_{\min}$ for uniform approximation using general activation functions including ReLU: $w_{\min}\ge d_y+1$ if $d_x

Differentiable Quantum Architecture Search for Quantum Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2309.10392
  • repo_url: None
  • paper_authors: Yize Sun, Yunpu Ma, Volker Tresp
  • for: 该研究的目标是确定 whether DQAS 可以应用于量子深 Q-学习问题。
  • methods: 该研究使用了一种梯度基于的架构搜索框架 DQAS,并在两个不同的环境 - 护套和冰湖上进行了评估。
  • results: 实验结果显示 DQAS 可以自动和高效地设计量子电路,并且在训练过程中,自动创建的电路表现较为出色。
    Abstract Differentiable quantum architecture search (DQAS) is a gradient-based framework to design quantum circuits automatically in the NISQ era. It was motivated by such as low fidelity of quantum hardware, low flexibility of circuit architecture, high circuit design cost, barren plateau (BP) problem, and periodicity of weights. People used it to address error mitigation, unitary decomposition, and quantum approximation optimization problems based on fixed datasets. Quantum reinforcement learning (QRL) is a part of quantum machine learning and often has various data. QRL usually uses a manually designed circuit. However, the pre-defined circuit needs more flexibility for different tasks, and the circuit design based on various datasets could become intractable in the case of a large circuit. The problem of whether DQAS can be applied to quantum deep Q-learning with various datasets is still open. The main target of this work is to discover the capability of DQAS to solve quantum deep Q-learning problems. We apply a gradient-based framework DQAS on reinforcement learning tasks and evaluate it in two different environments - cart pole and frozen lake. It contains input- and output weights, progressive search, and other new features. The experiments conclude that DQAS can design quantum circuits automatically and efficiently. The evaluation results show significant outperformance compared to the manually designed circuit. Furthermore, the performance of the automatically created circuit depends on whether the super-circuit learned well during the training process. This work is the first to show that gradient-based quantum architecture search is applicable to QRL tasks.
    摘要 “差分可求量化架构搜寻(DQAS)是一个基于梯度的框架,用于自动设计量子Circuit在NISQ时代。它受到了低精度量子硬件、固定Circuit架构、高Circuit设计成本、梯度扁平(BP)问题以及periodicity of weights等因素的激励。人们使用它来解决错误补偿、单元分解和量子近似优化问题,基于固定数据集。量子回传学习(QRL)通常使用预先设计的Circuit,但这种预先设计的Circuit需要更多的灵活性以应对不同的任务,而且基于不同的数据集进行Circuit设计可能会成为实际问题。这篇研究的主要目标是探索DQAS能否应用于量子深度Q-学习问题。我们将DQAS应用到了循环学习任务上,并在滑车和冻湖两个不同环境中进行评估。实验结果显示DQAS可以自动设计量子Circuit,效率高。评估结果显示DQAS可以对量子深度Q-学习问题提供明显的超越性。此外,自动创建的Circuit表现取决于在训练过程中超Circuit是否从良好的学习。这是首次显示出gradient-based量子架构搜寻可以应用到QRL任务。”

Testable Likelihoods for Beyond-the-Standard Model Fits

  • paper_url: http://arxiv.org/abs/2309.10365
  • repo_url: None
  • paper_authors: Anja Beck, Méril Reboud, Danny van Dyk
  • for: 这 paper 的目的是研究精度前沿中BSM效应的可能性,需要准确地从低能量测量转移信息到高能BSM模型。
  • methods: 这 paper 使用normalizing flow来构建 likelihood 函数,以实现这种转移。 likelihood 函数可以生成更多的样本,并且允许一种“轻松”的准确性测试,表现为 $\chi^2$ 统计量。
  • results: 这 paper 研究了一种特定的normalizing flow,并应用它到一个多模态和非泊松的例子中,并评估了 likelihood 函数和测试统计量的准确性。
    Abstract Studying potential BSM effects at the precision frontier requires accurate transfer of information from low-energy measurements to high-energy BSM models. We propose to use normalising flows to construct likelihood functions that achieve this transfer. Likelihood functions constructed in this way provide the means to generate additional samples and admit a ``trivial'' goodness-of-fit test in form of a $\chi^2$ test statistic. Here, we study a particular form of normalising flow, apply it to a multi-modal and non-Gaussian example, and quantify the accuracy of the likelihood function and its test statistic.
    摘要 Translated into Simplified Chinese:研究BSM效应的精度前沿需要从低能量测量中准确地传递信息到高能BSM模型。我们提议使用 нормализа流来构建likelihood函数,以实现这种传递。通过这种方法,likelihood函数可以生成更多的样本,并且可以使用“轻松”的goodness-of-fit测试 Statistics。在这里,我们研究了一种特定的 нормализа流,将其应用于多模态和非泊松的示例,并评估likelihood函数和测试统计值的准确性。

Striking a Balance: An Optimal Mechanism Design for Heterogenous Differentially Private Data Acquisition for Logistic Regression

  • paper_url: http://arxiv.org/abs/2309.10340
  • repo_url: None
  • paper_authors: Ameya Anjarlekar, Rasoul Etesami, R. Srikant
  • for: 这篇论文是为了解决在隐私敏感的卖家数据上进行逻辑回归的问题。由于数据是隐私的,卖家需要通过支付来提供数据,因此目标是设计一种机制,可以平衡多个目标,包括测试损失、卖家隐私和支付。
  • methods: 该论文使用了游戏理论、统计学学习理论和隐私保护来解决这个问题。在解决方案中,我们使用了变量转换来将买家的目标函数转换为凸函数,从而使问题可以被 convexified。
  • results: 我们提供了一些各向异性结果,包括测试错误和支付量在数据量很大时的极限分布。此外,我们还应用了我们的想法到一个实际的医疗数据集中,以便展示我们的方法在实际应用中的效果。
    Abstract We investigate the problem of performing logistic regression on data collected from privacy-sensitive sellers. Since the data is private, sellers must be incentivized through payments to provide their data. Thus, the goal is to design a mechanism that optimizes a weighted combination of test loss, seller privacy, and payment, i.e., strikes a balance between multiple objectives of interest. We solve the problem by combining ideas from game theory, statistical learning theory, and differential privacy. The buyer's objective function can be highly non-convex. However, we show that, under certain conditions on the problem parameters, the problem can be convexified by using a change of variables. We also provide asymptotic results characterizing the buyer's test error and payments when the number of sellers becomes large. Finally, we demonstrate our ideas by applying them to a real healthcare data set.
    摘要 我们研究在收集来自隐私敏感卖家的数据时进行логистиック回归的问题。由于数据是私人的,卖家需要通过支付来激励提供数据。因此,我们的目标是设计一种机制,以优化一个权衡多个目标函数,即测试损失、卖家隐私和支付。我们利用游戏理论、统计学学习理论和幂等隐私来解决这个问题。买家的目标函数可能很不对称。然而,我们证明,在某些问题参数的条件下,问题可以通过变量变换 convex化。我们还提供了大量卖家测试错误和支付的极限结果,并将我们的想法应用到一个真实的医疗数据集中。

Computational Approaches for App-to-App Retrieval and Design Consistency Check

  • paper_url: http://arxiv.org/abs/2309.10328
  • repo_url: None
  • paper_authors: Seokhyeon Park, Wonjae Kim, Young-Ho Kim, Jinwook Seo
  • for: 这paper的目的是提出一种基于大规模网络图像训练的视觉模型,用于从零开始提取移动用户界面(UI)的含义表示,并用于设计决策过程中的 Computational design support tools。
  • methods: 这paper使用的方法包括使用大规模网络图像训练的视觉模型,以及使用数学基础的方法来实现应用程序之间的比较和设计一致性分析。
  • results: 实验结果显示,该方法不仅超越了先前的提取模型,还启用了多种新的应用程序。
    Abstract Extracting semantic representations from mobile user interfaces (UI) and using the representations for designers' decision-making processes have shown the potential to be effective computational design support tools. Current approaches rely on machine learning models trained on small-sized mobile UI datasets to extract semantic vectors and use screenshot-to-screenshot comparison to retrieve similar-looking UIs given query screenshots. However, the usability of these methods is limited because they are often not open-sourced and have complex training pipelines for practitioners to follow, and are unable to perform screenshot set-to-set (i.e., app-to-app) retrieval. To this end, we (1) employ visual models trained with large web-scale images and test whether they could extract a UI representation in a zero-shot way and outperform existing specialized models, and (2) use mathematically founded methods to enable app-to-app retrieval and design consistency analysis. Our experiments show that our methods not only improve upon previous retrieval models but also enable multiple new applications.
    摘要 <>TRANSLATE_TEXT现有的方法仅仅是使用小规模的手机用户界面(UI)数据集来训练机器学习模型,从而提取Semantic vectors和使用屏幕截图比较来检索类似的UI。然而,这些方法的可用性受限,因为它们通常不是开源的,具有复杂的训练管道,并且无法进行应用程序之间(i.e., app-to-app)检索。为了解决这些问题,我们采用以下两种方法:1. 使用大规模的网络图像训练视觉模型,以验证这些模型可以在零容量情况下提取UI表示,并且超越现有的专门模型。2. 使用数学基础的方法来启用应用程序之间的检索和设计一致分析。我们的实验结果表明,我们的方法不仅超越了现有的检索模型,还启用了多种新的应用。TRANSLATE_TEXTHere's the translation in Traditional Chinese as well:<>TRANSLATE_TEXT现有的方法只是使用小规模的手机用户界面(UI)数据集来训练机器学习模型,从而提取Semantic vectors和使用屏幕截图比较来检索类似的UI。然而,这些方法的可用性受限,因为它们通常不是开源的,具有复杂的训练管道,并且无法进行应用程序之间(i.e., app-to-app)检索。为了解决这些问题,我们采用以下两种方法:1. 使用大规模的网络图像训练视觉模型,以验证这些模型可以在零容量情况下提取UI表示,并且超越现有的专门模型。2. 使用数学基础的方法来启用应用程序之间的检索和设计一致分析。我们的实验结果显示,我们的方法不仅超越了现有的检索模型,也启用了多种新的应用。TRANSLATE_TEXT

TensorCodec: Compact Lossy Compression of Tensors without Strong Data Assumptions

  • paper_url: http://arxiv.org/abs/2309.10310
  • repo_url: https://github.com/kbrother/tensorcodec
  • paper_authors: Taehyung Kwon, Jihoon Ko, Jinhong Jung, Kijung Shin
  • for: 这个论文的目的是提出一种lossy压缩算法 для一般的tensor,以提高压缩率和准确性。
  • methods: 这个算法使用了三个关键想法:首先, integrate a recurrent neural network into Tensor-Train Decomposition,以增强其表达力和降低低级假设的限制。其次,折叠输入tensor到更高级tensor,以降低NTTD所需的空间。最后,重新排序输入tensor的模式索引,以便通过NTTD进行更好的预测。
  • results: 该算法可以达到以下三个目标:(a) concise:它可以提供7.38倍的更紧凑的压缩,与相同的重建错误相比;(b) accurate:给定相同的压缩大小预算,它可以提供3.33倍的更准确的重建,与相同的重建错误相比;(c) scalable:其实际压缩时间是线性增长,并且每个Entry的重建时间是对数增长。
    Abstract Many real-world datasets are represented as tensors, i.e., multi-dimensional arrays of numerical values. Storing them without compression often requires substantial space, which grows exponentially with the order. While many tensor compression algorithms are available, many of them rely on strong data assumptions regarding its order, sparsity, rank, and smoothness. In this work, we propose TENSORCODEC, a lossy compression algorithm for general tensors that do not necessarily adhere to strong input data assumptions. TENSORCODEC incorporates three key ideas. The first idea is Neural Tensor-Train Decomposition (NTTD) where we integrate a recurrent neural network into Tensor-Train Decomposition to enhance its expressive power and alleviate the limitations imposed by the low-rank assumption. Another idea is to fold the input tensor into a higher-order tensor to reduce the space required by NTTD. Finally, the mode indices of the input tensor are reordered to reveal patterns that can be exploited by NTTD for improved approximation. Our analysis and experiments on 8 real-world datasets demonstrate that TENSORCODEC is (a) Concise: it gives up to 7.38x more compact compression than the best competitor with similar reconstruction error, (b) Accurate: given the same budget for compressed size, it yields up to 3.33x more accurate reconstruction than the best competitor, (c) Scalable: its empirical compression time is linear in the number of tensor entries, and it reconstructs each entry in logarithmic time. Our code and datasets are available at https://github.com/kbrother/TensorCodec.
    摘要 许多实际数据集都是tensor的形式,即多维数值数组。不压缩存储这些tensor可能需要巨大的存储空间,其增长为 exponent。虽然有很多tensor压缩算法可用,但是许多它们假设输入tensor的级数、稀疏性、核心级和平滑性。在这种工作中,我们提出了TENSORCODEC,一种lossy压缩算法,用于通用的tensor,不 necesarily遵循强大的输入数据假设。TENSORCODEC包括三个关键想法:首先,我们将输入tensor integrate到一个循环神经网络中,以提高表达力并缓解低级假设的限制。其次,我们将输入tensor折叠成更高级的tensor,以降低存储空间的需求。最后,我们重新排序输入tensor的模式索引,以便NTTD可以更好地利用这些模式进行改进的approximation。我们的分析和实验表明,TENSORCODEC具有以下特点:(a) Concise:它可以提供与最佳竞争对手相同的压缩比,但是具有7.38倍的容器大小;(b) Accurate:给定相同的压缩容器大小,它可以提供3.33倍的更高精度重建结果;(c) Scalable:其实际压缩时间是线性增长的tensor入口数量,并且每个入口的重建时间是对数增长的。我们的代码和数据集可以在https://github.com/kbrother/TensorCodec中获得。

Prominent Roles of Conditionally Invariant Components in Domain Adaptation: Theory and Algorithms

  • paper_url: http://arxiv.org/abs/2309.10301
  • repo_url: https://github.com/hieu9955/ggggg
  • paper_authors: Keru Wu, Yuansi Chen, Wooseok Ha, Bin Yu
  • For: The paper focuses on the assumption of conditionally invariant components (CICs) in domain adaptation (DA) and explores their role in providing target risk guarantees.* Methods: The paper proposes a new algorithm called importance-weighted conditional invariant penalty (IW-CIP) based on CICs, which has target risk guarantees beyond simple settings such as covariate shift and label shift. Additionally, the paper shows that CICs help identify large discrepancies between source and target risks of other DA algorithms.* Results: The paper demonstrates the effectiveness of the proposed algorithm and theoretical findings via numerical experiments on synthetic data, MNIST, CelebA, and Camelyon17 datasets. Specifically, the paper shows that incorporating CICs into the domain invariant projection (DIP) algorithm can address its failure scenario caused by label-flipping features.
    Abstract Domain adaptation (DA) is a statistical learning problem that arises when the distribution of the source data used to train a model differs from that of the target data used to evaluate the model. While many DA algorithms have demonstrated considerable empirical success, blindly applying these algorithms can often lead to worse performance on new datasets. To address this, it is crucial to clarify the assumptions under which a DA algorithm has good target performance. In this work, we focus on the assumption of the presence of conditionally invariant components (CICs), which are relevant for prediction and remain conditionally invariant across the source and target data. We demonstrate that CICs, which can be estimated through conditional invariant penalty (CIP), play three prominent roles in providing target risk guarantees in DA. First, we propose a new algorithm based on CICs, importance-weighted conditional invariant penalty (IW-CIP), which has target risk guarantees beyond simple settings such as covariate shift and label shift. Second, we show that CICs help identify large discrepancies between source and target risks of other DA algorithms. Finally, we demonstrate that incorporating CICs into the domain invariant projection (DIP) algorithm can address its failure scenario caused by label-flipping features. We support our new algorithms and theoretical findings via numerical experiments on synthetic data, MNIST, CelebA, and Camelyon17 datasets.
    摘要 域适应(DA)是一个统计学习问题,其中源数据用于训练模型的分布与目标数据用于评估模型的分布不同。虽然许多DA算法已经显示了较好的实际成果,但是盲目地应用这些算法可能会导致新的数据集上的性能更差。为了解决这问题,需要清楚地明确DA算法在目标数据上的性能假设。在这种工作中,我们关注了conditionally invariant component(CIC)的存在,它们在预测中 relevante和目标数据中具有条件不变性。我们示出了CICs可以通过conditional invariant penalty(CIP)来估计,它们在域适应中提供了三个主要的角色:首先,我们提出了一种基于CICs的新算法,即importance-weighted conditional invariant penalty(IW-CIP),它在更复杂的设置中,如covariate shift和label shift,具有更多的目标风险保证。其次,我们表明CICs可以帮助标识源和目标数据之间的大量差异。最后,我们示出了在DIP算法中包含CICs可以解决因为label-flipping特征而导致的失败情况。我们通过synthetic数据、MNIST、CelebA和Camelyon17 dataset的numerical experiments支持我们的新算法和理论发现。

Learning Orbitally Stable Systems for Diagrammatically Teaching

  • paper_url: http://arxiv.org/abs/2309.10298
  • repo_url: None
  • paper_authors: Weiming Zhi, Kangni Liu, Tianyi Zhang, Matthew Johnson-Roberson
  • for: 教师机器人novel技能
  • methods: 使用用户提供的2D图像来形态 robot的运动
  • results: 可以 diagrammatically teach complex cyclic motion patterns with a high degree of accuracy.Here is the full translation of the abstract in Simplified Chinese:
  • for: 本文旨在教师机器人novel技能,使其可以适应复杂的循环运动模式。
  • methods: 本文使用用户提供的2D图像来形态 robot的运动,并通过应用diffomorphism来模拟 robot的运动。
  • results: 实验结果显示,可以使用diffomorphism来教师机器人完成复杂的循环运动,并且具有高度准确性。
    Abstract Diagrammatic Teaching is a paradigm for robots to acquire novel skills, whereby the user provides 2D sketches over images of the scene to shape the robot's motion. In this work, we tackle the problem of teaching a robot to approach a surface and then follow cyclic motion on it, where the cycle of the motion can be arbitrarily specified by a single user-provided sketch over an image from the robot's camera. Accordingly, we introduce the \emph{Stable Diffeomorphic Diagrammatic Teaching} (SDDT) framework. SDDT models the robot's motion as an \emph{Orbitally Asymptotically Stable} (O.A.S.) dynamical system that learns to follow the user-specified sketch. This is achieved by applying a \emph{diffeomorphism}, i.e. a differentiable and invertible function, to morph a known O.A.S. system. The parameterised diffeomorphism is then optimised with respect to the Hausdorff distance between the limit cycle of our modelled system and the sketch, to produce the desired robot motion. We provide theoretical insight into the behaviour of the optimised system and also empirically evaluate SDDT, both in simulation and on a quadruped with a mounted 6-DOF manipulator. Results show that we can diagrammatically teach complex cyclic motion patterns with a high degree of accuracy.
    摘要 图形教学是一种 робоット学习新技能的方法,其中用户提供的2D图形将影响机器人的运动。在这项工作中,我们解决了教育机器人接近场景中的表面并跟踪循环运动的问题,其中循环运动的周期可以通过用户提供的一个图形来定义。因此,我们介绍了稳定幂函数减杂教学框架(SDDT)。 SDDT将机器人的运动模型为一个稳定幂函数系统,该系统学习从用户提供的图形中学习循环运动。我们通过应用一个幂函数,即一个可导和反函数,将一个已知稳定幂函数系统变换为我们的模型系统。然后,我们对这个参数化的幂函数进行优化,使其与图形中的循环运动的 Hausdorff 距离最小化,以生成所需的机器人运动。我们提供了对优化后的系统行为的理论分析,以及在实验中对 SDDT 的评估,包括在模拟环境和一只四脚机器人上安装了六度 freedom 抓取机的实验。结果表明,我们可以使用图形来教学机器人复杂的循环运动模式,并且具有高度准确性。

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

  • paper_url: http://arxiv.org/abs/2309.10285
  • repo_url: https://github.com/alibabaresearch/flash-llm
  • paper_authors: Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song
  • for: 大型生成模型的快速部署和高效执行,特别是在高性能但高度限制的tensor核心硬件上。
  • methods: 提出了一种基于 Load-as-Sparse 和 Compute-as-Dense 方法的 Flash-LLM 框架,以优化 tensor 核心硬件上的大型生成模型执行。
  • results: Flash-LLM 在 SpMM 层次上至少比 state-of-the-art 库 Sputnik 和 SparTA 快速得多,在 OPT-30B/66B/175B 模型上达到了最高的 tokens per GPU-second 性能,与 DeepSpeed 和 FasterTransformer 相比,具有显著的性能提升和较低的执行成本。
    Abstract With the fast growth of parameter size, it becomes increasingly challenging to deploy large generative models as they typically require large GPU memory consumption and massive computation. Unstructured model pruning has been a common approach to reduce both GPU memory footprint and the overall computation while retaining good model accuracy. However, the existing solutions do not provide a highly-efficient support for handling unstructured sparsity on modern GPUs, especially on the highly-structured Tensor Core hardware. Therefore, we propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference with the sophisticated support of unstructured sparsity on high-performance but highly restrictive Tensor Cores. Based on our key observation that the main bottleneck of generative model inference is the several skinny matrix multiplications for which Tensor Cores would be significantly under-utilized due to low computational intensity, we propose a general Load-as-Sparse and Compute-as-Dense methodology for unstructured sparse matrix multiplication. The basic insight is to address the significant memory bandwidth bottleneck while tolerating redundant computations that are not critical for end-to-end performance on Tensor Cores. Based on this, we design an effective software framework for Tensor Core based unstructured SpMM, leveraging on-chip resources for efficient sparse data extraction and computation/memory-access overlapping. At SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9x and 1.5x, respectively. At end-to-end framework level on OPT-30B/66B/175B models, for tokens per GPU-second, Flash-LLM achieves up to 3.8x and 3.6x improvement over DeepSpeed and FasterTransformer, respectively, with significantly lower inference cost.
    摘要 随着参数大小的快速增长,大型生成模型的部署变得越来越困难,因为它们通常需要大量的GPU内存消耗和巨量计算。不结构化模型剔除已成为一种常见的方法来降低GPU内存占用和总计算量,同时保持良好的模型准确性。然而,现有的解决方案不提供高效支持对现代GPU上的不结构化稀疏性,特别是在高性能但高度结构化的Tensor Core硬件上。因此,我们提出Flash-LLM,用于实现低成本高效的大型生成模型推理,并且在高性能但高度结构化的Tensor Core硬件上提供了高度支持不结构化稀疏性。我们的关键观察是,生成模型推理的主要瓶颈在于一些瘦剑矩阵乘法,对于Tensor Core硬件来说,这些矩阵乘法的计算INTENSITY很低,导致GPU内存带宽瓶颈。基于这一点,我们提出一种普适的 Load-as-Sparse 和 Compute-as-Dense 方法,用于不结构化稀疏矩阵乘法。这种方法的基本思想是在约束可以忽略的计算上进行缓存和内存访问的重叠,以降低内存带宽瓶颈。基于这一方法,我们设计了一个高效的Tensor Core基于不结构化稀疏矩阵乘法的软件框架,利用GPU内存中的资源进行高效的稀疏数据提取和计算/内存访问重叠。在SpMM kernel层,Flash-LLM比Sputnik和SparTA两者均高效,平均提高2.9倍和1.5倍。在框架层,Flash-LLM在OPT-30B/66B/175B模型上,对于每个GPU每秒的字符数,与DeepSpeed和FasterTransformer相比,提高了3.8倍和3.6倍,同时具有显著更低的推理成本。

Crowdotic: A Privacy-Preserving Hospital Waiting Room Crowd Density Estimation with Non-speech Audio

  • paper_url: http://arxiv.org/abs/2309.10280
  • repo_url: None
  • paper_authors: Forsad Al Hossain, Tanjid Hasan Tonmoy, Andrew A. Lover, George A. Corey, Mohammad Arif Ul Alam, Tauhidur Rahman
  • For: 本研究旨在提出一种基于非语音audio的人群分析方法,以便在各种场景中提高智能建筑的运行和管理,同时坚持个人隐私要求。* Methods: 我们提出了一种基于转换器模型的非语音audio-based方法,并进行了大量的实验和比较分析,以证明非语音audio alone可以准确地进行人群分析。* Results: 我们的实验结果表明,非语音audio-based方法可以高度准确地估算人群占用率,并且超过了基于热像的模型和其他基线。此外,我们还进行了附加的差分隐私技术分析,以提供进一步的隐私保障。
    Abstract Privacy-preserving crowd density analysis finds application across a wide range of scenarios, substantially enhancing smart building operation and management while upholding privacy expectations in various spaces. We propose a non-speech audio-based approach for crowd analytics, leveraging a transformer-based model. Our results demonstrate that non-speech audio alone can be used to conduct such analysis with remarkable accuracy. To the best of our knowledge, this is the first time when non-speech audio signals are proposed for predicting occupancy. As far as we know, there has been no other similar approach of its kind prior to this. To accomplish this, we deployed our sensor-based platform in the waiting room of a large hospital with IRB approval over a period of several months to capture non-speech audio and thermal images for the training and evaluation of our models. The proposed non-speech-based approach outperformed the thermal camera-based model and all other baselines. In addition to demonstrating superior performance without utilizing speech audio, we conduct further analysis using differential privacy techniques to provide additional privacy guarantees. Overall, our work demonstrates the viability of employing non-speech audio data for accurate occupancy estimation, while also ensuring the exclusion of speech-related content and providing robust privacy protections through differential privacy guarantees.
    摘要 隐私保护的人群密度分析在各种场景中发挥着广泛的应用,大幅提高智能建筑的运行和管理,同时坚持隐私期望在不同的空间。我们提出了一种基于转换器的非语音音频方法,用于人群分析。我们的结果表明,非语音音频 alone 可以用于进行这种分析,并且具有惊人的准确性。我们知道,这是第一次使用非语音音频信号进行占用率预测。在这之前,没有任何相似的方法。为了完成这一目标,我们在一所大 Hospital 的等待室中部署了我们的传感器平台,并在一些月份内采集了非语音音频和热成像数据,用于训练和评估我们的模型。我们的非语音基本方法在占用率预测方面表现出色,并且超过了基于热成像模型和所有基eline的表现。此外,我们还进行了进一步的分析,使用差分隐私技术提供了额外的隐私保障。总的来说,我们的工作表明了非语音音频数据的可靠性,而不需要使用语音内容,同时提供了坚实的隐私保障。

Diffusion Methods for Generating Transition Paths

  • paper_url: http://arxiv.org/abs/2309.10276
  • repo_url: None
  • paper_authors: Luke Triplett, Jianfeng Lu
  • for: 本研究使用分子系统中的Score-based生成模型来模拟罕见的转移。
  • methods: 本文提出了两种新的路径生成方法:一种是基于链的方法,另一种是基于中点的方法。
  • results: 数值结果表明,这两种方法在数据充沛和数据缺乏两种情况下都能够效果地生成转移路径。
    Abstract In this work, we seek to simulate rare transitions between metastable states using score-based generative models. An efficient method for generating high-quality transition paths is valuable for the study of molecular systems since data is often difficult to obtain. We develop two novel methods for path generation in this paper: a chain-based approach and a midpoint-based approach. The first biases the original dynamics to facilitate transitions, while the second mirrors splitting techniques and breaks down the original transition into smaller transitions. Numerical results of generated transition paths for the M\"uller potential and for Alanine dipeptide demonstrate the effectiveness of these approaches in both the data-rich and data-scarce regimes.
    摘要 在这项工作中,我们寻求使用得分模型来模拟罕见的 между元态态转移。一种高效的方法 для生成高质量的转移路径对分子系统的研究非常有用,因为数据往往困难获取。我们在这篇论文中开发了两种新的方法来生成转移路径:一种链基的方法和一种中点基的方法。第一种偏导原动力学到促进转移,而第二种使用分割技术将原始转移分解成小转移。我们对穆勒潜能和阿拉伦二肽进行了数值研究,并证明了这些方法在数据丰富和数据缺乏两种情况下的有效性。

Multi-fidelity climate model parameterization for better generalization and extrapolation

  • paper_url: http://arxiv.org/abs/2309.10231
  • repo_url: None
  • paper_authors: Mohamed Aziz Bhouri, Liran Peng, Michael S. Pritchard, Pierre Gentine
  • for: 这个研究的目的是提出一种基于机器学习的全球气候模型或温室效应模型的参数化方法,以提高模型的准确性和可靠性。
  • methods: 这种方法使用了多种数据集,包括低精度的物理参数化数据和高精度的气候模拟数据,通过多质量混合来提高模型的准确性和可靠性。
  • results: 研究发现,使用多质量混合方法可以提供更准确的气候预测,而无需增加计算资源的增加。此外,这种方法还可以提供可靠的不确定性评估,并能够在多种enario下提供更加准确的预测结果。
    Abstract Machine-learning-based parameterizations (i.e. representation of sub-grid processes) of global climate models or turbulent simulations have recently been proposed as a powerful alternative to physical, but empirical, representations, offering a lower computational cost and higher accuracy. Yet, those approaches still suffer from a lack of generalization and extrapolation beyond the training data, which is however critical to projecting climate change or unobserved regimes of turbulence. Here we show that a multi-fidelity approach, which integrates datasets of different accuracy and abundance, can provide the best of both worlds: the capacity to extrapolate leveraging the physically-based parameterization and a higher accuracy using the machine-learning-based parameterizations. In an application to climate modeling, the multi-fidelity framework yields more accurate climate projections without requiring major increase in computational resources. Our multi-fidelity randomized prior networks (MF-RPNs) combine physical parameterization data as low-fidelity and storm-resolving historical run's data as high-fidelity. To extrapolate beyond the training data, the MF-RPNs are tested on high-fidelity warming scenarios, $+4K$, data. We show the MF-RPN's capacity to return much more skillful predictions compared to either low- or high-fidelity (historical data) simulations trained only on one regime while providing trustworthy uncertainty quantification across a wide range of scenarios. Our approach paves the way for the use of machine-learning based methods that can optimally leverage historical observations or high-fidelity simulations and extrapolate to unseen regimes such as climate change.
    摘要 globale 气候模型或液体动力学 simulations中的机器学习基 Parameters (i.e. 表示 sub-grid 过程的 representation) 被提议为一种有力的代替physical, but empirical, representations,提供更低的计算成本和更高的准确性。然而,这些方法仍然受到 extrapolation beyond the training data 的限制,这是 projecting 气候变化或未观察到的液体动力学 regime 的 kritical 因素。我们展示了一种多 fideliTY 方法,该方法 integrate 不同精度和充沛的数据集,可以提供best of both worlds:可以 extrapolate leveraging physically-based parameterization,同时使用 machine-learning-based parameterizations 提供更高的准确性。在气候模型中,我们的多 fideliTY 框架实现了更准确的气候预测,不需要大幅增加计算资源。我们的多 fideliTY randomized prior networks (MF-RPNs) 组合物理参数化数据作为low-fidelity,并使用风暴解决的历史数据作为高精度。为了 extrapolate beyond the training data,MF-RPNs 在高精度增温enario数据上进行测试。我们显示MF-RPNs 能够返回更有技巧的预测,比 Either low- 或 high-fidelity (历史数据) simulations 训练只在一个 режиме上,同时提供可靠的uncertainty quantification across a wide range of scenarios。我们的方法开创了使用机器学习基 methods 可以最佳地利用历史观察或高精度 simulations 并 extrapolate to unseen regimes such as climate change。

eess.IV - 2023-09-19

Multi-Spectral Reflection Matrix for Ultra-Fast 3D Label-Free Microscopy

  • paper_url: http://arxiv.org/abs/2309.10951
  • repo_url: None
  • paper_authors: Paul Balondrade, Victor Barolle, Nicolas Guigui, Emeric Auriant, Nathan Rougier, Claude Boccara, Mathias Fink, Alexandre Aubry
  • for: 实现深入、实时、量化的生物组织观察
  • methods: 多спектル矩阵方法
  • results: 实现0.1mm^3的场视野,290nm的分辨率,1Hz的帧率三维图像
    Abstract Label-free microscopy exploits light scattering to obtain a three-dimensional image of biological tissues. However, light propagation is affected by aberrations and multiple scattering, which drastically degrade the image quality and limit the penetration depth. Multi-conjugate adaptive optics and time-gated matrix approaches have been developed to compensate for aberrations but the associated frame rate is extremely limited for 3D imaging. Here we develop a multi-spectral matrix approach to solve these fundamental problems. Based on an interferometric measurement of a polychromatic reflection matrix, the focusing process can be optimized in post-processing at any voxel by addressing independently each frequency component of the wave-field. A proof-of-concept experiment demonstrates the three-dimensional image of an opaque human cornea over a 0.1 mm^3-field-of-view at a 290 nm-resolution and a 1 Hz-frame rate. This work paves the way towards a fully-digital microscope allowing real-time, in-vivo, quantitative and deep inspection of tissues.
    摘要 Label-free microscopy 利用光散射获取生物组织的三维图像。然而,光束传播受到偏振和多散射的影响,导致图像质量严重下降,限制了温度深度。多 conjugate adaptive optics 和时间锁定矩阵方法已经开发,但这些方法的相关帧率非常低,不适于3D图像。在这种情况下,我们开发了一种多 spectral matrix 方法。基于一种多色干涉测量,我们可以在后处理中独立地处理每个频率成分的波场,从而优化焦点处理。一个证明实验表明,我们可以在0.1 mm^3 的场视野内获得290 nm 的分辨率和1 Hz 的帧率。这种工作开创了一种完全数字的镜像机,允许实时、生物体内、量化和深入检查组织。

Multisource Holography

  • paper_url: http://arxiv.org/abs/2309.10816
  • repo_url: None
  • paper_authors: Grace Kuo, Florian Schiffers, Douglas Lanman, Oliver Cossairt, Nathan Matsuda
    for: Multisource holography is proposed as a novel architecture to suppress speckle in a single frame without sacrificing resolution.methods: The approach uses an array of sources, two spatial light modulators, and an algorithm to calculate multisource holograms.results: The proposed method can achieve up to a 10 dB increase in peak signal-to-noise ratio compared to an equivalent single source system, and is validated through a benchtop experimental prototype.
    Abstract Holographic displays promise several benefits including high quality 3D imagery, accurate accommodation cues, and compact form-factors. However, holography relies on coherent illumination which can create undesirable speckle noise in the final image. Although smooth phase holograms can be speckle-free, their non-uniform eyebox makes them impractical, and speckle mitigation with partially coherent sources also reduces resolution. Averaging sequential frames for speckle reduction requires high speed modulators and consumes temporal bandwidth that may be needed elsewhere in the system. In this work, we propose multisource holography, a novel architecture that uses an array of sources to suppress speckle in a single frame without sacrificing resolution. By using two spatial light modulators, arranged sequentially, each source in the array can be controlled almost independently to create a version of the target content with different speckle. Speckle is then suppressed when the contributions from the multiple sources are averaged at the image plane. We introduce an algorithm to calculate multisource holograms, analyze the design space, and demonstrate up to a 10 dB increase in peak signal-to-noise ratio compared to an equivalent single source system. Finally, we validate the concept with a benchtop experimental prototype by producing both 2D images and focal stacks with natural defocus cues.
    摘要 激光显示技术承诺了许多优点,包括高质量3D图像、准确的视力缩放和具有减小的形态因素。然而,激光学依赖于 coherent 照明,可能会在最终图像中产生不жела的斑点噪声。尽管平滑相位激光可以无斑点,但它们的非均匀观看窗口使其实际无法应用,而使用半共振源也会降低分辨率。均值多帧图像以提高噪声抑制效果需要高速调制器,这会占用时间频谱资源,这些资源可能需要在系统中用于其他目的。在这项工作中,我们提出了多源激光学技术,一种新的架构,使用一个数组源来抑制斑点。通过使用两个空间光模ulator,其中每个源在数组中可以被控制为创建不同的斑点版本,并且在图像平面上均值多个源的贡献可以抑制斑点。我们提出了一种算法来计算多源激光图,分析设计空间,并证明在相同的单个源系统中,我们可以获得最高达10 dB的峰值信号响应比。最后,我们验证了这一概念,使用了一个桌面实验prototype,并生成了2D图像和自然减ocus图像。

InSPECtor: an end-to-end design framework for compressive pixelated hyperspectral instruments

  • paper_url: http://arxiv.org/abs/2309.10833
  • repo_url: None
  • paper_authors: T. A. Stockmans, F. Snik, M. Esposito, C. van Dijk, C. U. Keller
  • for: 这个论文是为了设计一种高spectral仪器,它可以压缩数据,从而减少数据量和采集时间。
  • methods: 这个论文使用了TensorFlow算法,并利用自动微分来联合优化滤波器数组的布局和重建器。
  • results: 研究人员通过使用这种方法,可以减少数据量,采集时间和探测器空间,并且不会产生重要的信息损失。实际上,这种方法可以减少数据量的40倍,相比于传统的高spectral仪器。
    Abstract Classic designs of hyperspectral instrumentation densely sample the spatial and spectral information of the scene of interest. Data may be compressed after the acquisition. In this paper we introduce a framework for the design of an optimized, micro-patterned snapshot hyperspectral imager that acquires an optimized subset of the spatial and spectral information in the scene. The data is thereby compressed already at the sensor level, but can be restored to the full hyperspectral data cube by the jointly optimized reconstructor. This framework is implemented with TensorFlow and makes use of its automatic differentiation for the joint optimization of the layout of the micro-patterned filter array as well as the reconstructor. We explore the achievable compression ratio for different numbers of filter passbands, number of scanning frames, and filter layouts using data collected by the Hyperscout instrument. We show resulting instrument designs that take snapshot measurements without losing significant information while reducing the data volume, acquisition time, or detector space by a factor of 40 as compared to classic, dense sampling. The joint optimization of a compressive hyperspectral imager design and the accompanying reconstructor provides an avenue to substantially reduce the data volume from hyperspectral imagers.
    摘要 We evaluate the achievable compression ratio for various filter passband numbers, scanning frame numbers, and filter layouts using data from the Hyperscout instrument. Our results show that the proposed instrument designs can capture snapshot measurements without losing significant information, while reducing the data volume, acquisition time, and detector space by a factor of 40 compared to traditional, dense sampling. The joint optimization of the compressive hyperspectral imager design and the accompanying reconstructor provides a means to significantly reduce the data volume from hyperspectral imagers.

Minimum-length chain embedding for the phase unwrapping problem on D-Wave’s advantage architecture

  • paper_url: http://arxiv.org/abs/2309.10296
  • repo_url: None
  • paper_authors: Mohammad Kashfi Haghighi, Nikitas Dimopoulos
  • for: 解决 phase unwrapping 问题
  • methods: 使用 quantum annealing 和 Pegasus 图的嵌入
  • results: 提出一种新的嵌入算法,可以更好地解决 phase unwrapping 问题,并且可以应用于其他问题的嵌入
    Abstract With the current progress of quantum computing, quantum annealing is being introduced as a powerful method to solve hard computational problems. In this paper, we study the potential capability of quantum annealing in solving the phase unwrapping problem, an instance of hard computational problems. To solve the phase unwrapping problem using quantum annealing, we deploy the D-Wave Advantage machine which is currently the largest available quantum annealer. The structure of this machine, however, is not compatible with our problem graph structure. Consequently, the problem graph needs to be mapped onto the target (Pegasus) graph, and this embedding significantly affects the quality of the results. Based on our experiment and also D-Wave's reports, the lower chain lengths can result in a better performance of quantum annealing. In this paper, we propose a new embedding algorithm that has the lowest possible chain length for embedding the graph of the phase unwrapping problem onto the Pegasus graph. The obtained results using this embedding strongly outperform the results of Auto-embedding provided by D-Wave. Besides the phase unwrapping problem, this embedding can be used to embed any subset of our problem graph to the Pegasus graph.
    摘要 现在量子计算技术的进步,量子热处理已经被提出为解决复杂计算问题的强大方法。在这篇论文中,我们研究了量子热处理可以解决阶跃问题,这是复杂计算问题的一个实例。为解决阶跃问题使用量子热处理,我们使用D-Wave Advantage机器,该机器目前是最大的量子热处理器。然而,这台机器的结构与我们的问题图结构不兼容,因此需要将问题图映射到目标( Pegasus)图上,这种映射会对结果产生深见影响。根据我们的实验和D-Wave的报告,较短的链长可以使量子热处理表现更好。在这篇论文中,我们提出了一种新的映射算法,该算法可以将问题图映射到 Pegasus 图上,并且 obtenains 最低的链长。使用这种映射,我们在实验中获得了较好的结果,比D-Wave 提供的 Auto-embedding 更好。此外,这种映射可以将任何我们问题图的子集映射到 Pegasus 图上。

Disentangled Information Bottleneck guided Privacy-Protective JSCC for Image Transmission

  • paper_url: http://arxiv.org/abs/2309.10263
  • repo_url: None
  • paper_authors: Lunan Sun, Yang Yang, Mingzhe Chen, Caili Guo
  • For: 这个研究旨在保护私人资讯,同时确保传输过程中的通信效率。* Methods: 我们提出了一个混合资源和通道编码(JSCC)方法,并将其与私人资讯分离的方法(DIB)搭配,以实现高效且安全的传输。* Results: 我们的方法可以将私人资讯与公共资讯分离,并且可以实现高品质的传输。实验结果显示,我们的方法可以降低窃听者对私人资讯的准确率,并且可以降低传输时间。
    Abstract Joint source and channel coding (JSCC) has attracted increasing attention due to its robustness and high efficiency. However, JSCC is vulnerable to privacy leakage due to the high relevance between the source image and channel input. In this paper, we propose a disentangled information bottleneck guided privacy-protective JSCC (DIB-PPJSCC) for image transmission, which aims at protecting private information as well as achieving superior communication performance at the legitimate receiver. In particular, we propose a DIB objective to disentangle private and public information. The goal is to compress the private information in the public subcodewords, preserve the private information in the private subcodewords and improve the reconstruction quality simultaneously. In order to optimize JSCC neural networks using the DIB objective, we derive a differentiable estimation of the DIB objective based on the variational approximation and the density-ratio trick. Additionally, we design a password-based privacy-protective (PP) algorithm which can be jointly optimized with JSCC neural networks to encrypt the private subcodewords. Specifically, we employ a private information encryptor to encrypt the private subcodewords before transmission, and a corresponding decryptor to recover the private information at the legitimate receiver. A loss function for jointly training the encryptor, decryptor and JSCC decoder is derived based on the maximum entropy principle, which aims at maximizing the eavesdropping uncertainty as well as improving the reconstruction quality. Experimental results show that DIB-PPJSCC can reduce the eavesdropping accuracy on private information up to $15\%$ and reduce $10\%$ inference time compared to existing privacy-protective JSCC and traditional separate methods.
    摘要 joint source和通道编码(JSCC)已经吸引了越来越多的关注,因为它具有高效率和鲁棒性。然而,JSCC受到隐私泄露的威胁,因为源图像和通道输入之间存在高度相关性。在本文中,我们提出了一种基于信息瓶颈的隐私保护JSCC(DIB-PPJSCC),用于图像传输,以保护私人信息并实现合法接收器的超越性表现。具体来说,我们提出了一个DIB目标,用于分离私人信息和公共信息。我们的目标是压缩私人信息在公共子码字中,保持私人信息在私人子码字中,并同时提高重建质量。为了优化JSCC神经网络使用DIB目标,我们 deriv了一个可导的DIB目标基于变量 aproximation和density-ratio trick。此外,我们设计了一种基于密码的隐私保护算法(PP),可以与JSCC神经网络 jointly 优化,以加密私人子码字。具体来说,我们使用一个私人信息加密器加密私人子码字,并在合法接收器中使用相应的解密器恢复私人信息。我们 derive了基于最大 entropy 原理的损失函数,用于同时优化加密器、解密器和JSCC解码器的训练。实验结果表明,DIB-PPJSCC可以降低私人信息泄露率达15%,并提高重建质量10%,相比之下存在隐私保护JSCC和传统分离方法。

eess.SP - 2023-09-19

Deep Learning based Fast and Accurate Beamforming for Millimeter-Wave Systems

  • paper_url: http://arxiv.org/abs/2309.10904
  • repo_url: None
  • paper_authors: Tarun S Cousik, Vijay K Shah, Jeffrey H. Reed Harry X Tran, Rittwik Jana
  • for: 这个论文是为了提高mmWave设备的表现,特别是对于增加信号力和/或减少干扰水平。
  • methods: 这个论文使用了深度神经网络(DNN)框架,以实现快速和精准的照准方向。不同于传统的有限内存Look-Up表(LUT),BeamShaper使用训练好的NN模型来生成三角形矩阵的系数,并在实时运算中将其转换为任意方向的照准。
  • results: simulations 显示,BeamShaper 比 contemporary LUT 基本的解决方案在cosine-similarity 和中央角度上表现更好,并且在几乎相同的时间尺度上表现更好。此外,我们还显示了我们的 DNN 基本方法具有更好的对抗量化噪声的性能,这是因为量化噪声对于数字相位调节器而言是一个重要的问题。
    Abstract The widespread proliferation of mmW devices has led to a surge of interest in antenna arrays. This interest in arrays is due to their ability to steer beams in desired directions, for the purpose of increasing signal-power and/or decreasing interference levels. To enable beamforming, array coefficients are typically stored in look-up tables (LUTs) for subsequent referencing. While LUTs enable fast sweep times, their limited memory size restricts the number of beams the array can produce. Consequently, a receiver is likely to be offset from the main beam, thus decreasing received power, and resulting in sub-optimal performance. In this letter, we present BeamShaper, a deep neural network (DNN) framework, which enables fast and accurate beamsteering in any desirable 3-D direction. Unlike traditional finite-memory LUTs which support a fixed set of beams, BeamShaper utilizes a trained NN model to generate the array coefficients for arbitrary directions in \textit{real-time}. Our simulations show that BeamShaper outperforms contemporary LUT based solutions in terms of cosine-similarity and central angle in time scales that are slightly higher than LUT based solutions. Additionally, we show that our DNN based approach has the added advantage of being more resilient to the effects of quantization noise generated while using digital phase-shifters.
    摘要 广泛的 millimeter 设备的普及,导致了天线数组的兴趣增加。这种兴趣是因为天线数组可以将指向所需的方向中的能量强化,以提高信号强度和/或降低干扰水平。为实现射频,通常需要存储在Look-Up表(LUT)中的数组系数。尽管LUT 允许快速滚动,但它们的内存大小有限制,因此数组只能生成一定数量的射频。这意味着接收器可能会偏离主束,从而降低接收到的功率,并导致不佳的性能。在这封信中,我们介绍了BeamShaper,一个深度神经网络(DNN)框架,它允许在任意三维方向上快速和准确地实现射频。与传统的有限存储LUT不同,BeamShaper 使用训练的神经网络模型来生成数组系数,而不是固定的数组。我们的 simulations 表明,BeamShaper 在cosine-similarity和中心角时间尺度上都高于当前LUT基本解决方案。此外,我们还发现了我们的神经网络基本方法在使用数字阶梯器时产生的量化噪声的影响更加抗性。

Non-Orthogonal Time-Frequency Space Modulation

  • paper_url: http://arxiv.org/abs/2309.10889
  • repo_url: None
  • paper_authors: Mahdi Shamsi, Farokh Marvasti
  • for: 提出了一种时频空间变换(TFST)来 derivate 非正交基函数 для调制技术在延迟-多普勒平面上。
  • methods: 基于 TFST 的一家 Overloaded Delay-Doppler Modulation(ODDM)技术被提出,它提高了灵活性和效率,将调制信号表示为基函数信号的线性组合。
  • results: 对于提议的 ODDM 技术,一种非正交时frequency空间(NOTFS)数字调制被 derivation,并且在高负荷因子和白噪声频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率�
    Abstract This paper proposes a Time-Frequency Space Transformation (TFST) to derive non-orthogonal bases for modulation techniques over the delay-doppler plane. A family of Overloaded Delay-Doppler Modulation (ODDM) techniques is proposed based on the TFST, which enhances flexibility and efficiency by expressing modulated signals as a linear combination of basis signals. A Non-Orthogonal Time-Frequency Space (NOTFS) digital modulation is derived for the proposed ODDM techniques, and simulations show that they offer high-mobility communication systems with improved spectral efficiency and low latency, particularly in challenging scenarios such as high overloading factors and Additive White Gaussian Noise (AWGN) channels. A modified sphere decoding algorithm is also presented to efficiently decode the received signal. The proposed modulation and decoding techniques contribute to the advancement of non-orthogonal approaches in the next-generation of mobile communication systems, delivering superior spectral efficiency and low latency, and offering a promising solution towards the development of efficient high-mobility communication systems.
    摘要 Translation in Simplified Chinese:这篇论文提出了一种时Frequency空间转换(TFST),用于 derive 非对听基准 для模拟技术在延迟-Doppler平面上。基于 TFST,一家 Overloaded Delay-Doppler Modulation(ODDM)技术是提出的,这种技术可以提高 flexibility 和效率,将模拟信号表示为基准信号的线性组合。基于 NOTFS 数字模拟,一种非对听时Frequency空间(NOTFS)数字模拟是 derivation 的,并且在高负载因子和 Additive White Gaussian Noise(AWGN)频道下进行了丰富的 simulations,表明它们在高移动性系统中提供了更高的 spectral efficiency 和低延迟,特别是在高负载因子和 AWGN 频道下。此外,一种修改的 SPHERE 解码算法也是提出的,以高效地解码接收信号。提出的模拟和解码技术将在下一代移动通信系统中提供更高的 spectral efficiency 和低延迟,并且提供了高效的高移动性通信系统的开发方向。

  • paper_url: http://arxiv.org/abs/2309.10758
  • repo_url: None
  • paper_authors: Jiayu Mao, Aylin Yener
  • for: This paper focuses on over-the-air federated learning (OTA-FL) in a heterogeneous edge-intelligent network with non-i.i.d. user dataset distributions and physical layer impairments.
  • methods: The proposed cross-layer algorithm jointly optimizes RIS configuration, communication, and computation resources to enhance learning performance, with dynamic local update steps, RIS phase shifts, and transmission power control.
  • results: The proposed algorithm outperforms the existing unified approach under heterogeneous systems and imperfect channel state information (CSI) in numerical results.
    Abstract Over-the-air federated learning (OTA-FL) exploits the inherent superposition property of wireless channels to integrate the communication and model aggregation. Though a naturally promising framework for wireless federated learning, it requires care to mitigate physical layer impairments. In this work, we consider a heterogeneous edge-intelligent network with different edge device resources and non-i.i.d. user dataset distributions, under a general non-convex learning objective. We leverage the Reconfigurable Intelligent Surface (RIS) technology to augment OTA-FL system over simultaneous time varying uplink and downlink noisy communication channels under imperfect CSI scenario. We propose a cross-layer algorithm that jointly optimizes RIS configuration, communication and computation resources in this general realistic setting. Specifically, we design dynamic local update steps in conjunction with RIS phase shifts and transmission power to boost learning performance. We present a convergence analysis of the proposed algorithm, and show that it outperforms the existing unified approach under heterogeneous system and imperfect CSI in numerical results.
    摘要 “阶层联盟学习(OTA-FL)利用无线通信频道的自然层次性来整合通信和模型集合。尽管是一个具有掌上优势的架构,但它需要注意对物理层问题的处理。在这个工作中,我们考虑了一个多样化的智能边缘网络,其中不同的边缘设备具有不同的资源,并且有不同的用户数据分布。我们运用了智能反射表面(RIS)技术来增强OTA-FL系统,并在同时进行同频道上的上传和下传噪音通信频道下进行无瑕的通信。我们提出了一个跨层数据算法,将RIS配置、通信和计算资源进行统一优化。具体来说,我们设计了动态本地更新步骤,与RIS相位调整和传输功率进行协同运作,以提高学习性能。我们提供了一个对照分析,说明了我们的方法在不同的多样化系统和实际问题下的性能优势。”

ResEMGNet: A Lightweight Residual Deep Learning Architecture for Neuromuscular Disorder Detection from Raw EMG Signals

  • paper_url: http://arxiv.org/abs/2309.10756
  • repo_url: None
  • paper_authors: Minhajur Rahman, Md Toufiqur Rahman, Md Tanvir Raihan, Celia Shahnaz
  • for: 这项研究旨在使用深度学习技术对阿兰谱病和肌肉疾病进行检测。
  • methods: 该研究使用了卷积神经网络(CNNs)来直接从Raw EMG信号中检测阿兰谱病和肌肉疾病。不同于传统方法,ResEMGNet不需要手动提取特征,从而降低计算复杂性并提高实用性。
  • results: 该研究表明,ResEMGNet可以达到94.43%的全Subject-independent性能,在比较其他方法时表现出色。
    Abstract Amyotrophic Lateral Sclerosis (ALS) and Myopathy are debilitating neuromuscular disorders that demand accurate and efficient diagnostic approaches. In this study, we harness the power of deep learning techniques to detect ALS and Myopathy. Convolutional Neural Networks (CNNs) have emerged as powerful tools in this context. We present ResEMGNet, designed to identify ALS and Myopathy directly from raw electromyography (EMG) signals. Unlike traditional methods that require intricate handcrafted feature extraction, ResEMGNet takes raw EMG data as input, reducing computational complexity and enhancing practicality. Our approach was rigorously evaluated using various metrics in comparison to existing methods. ResEMGNet exhibited exceptional subject-independent performance, achieving an impressive overall three-class accuracy of 94.43\%.
    摘要 Amyotrophic Lateral Sclerosis (ALS) 和 Myopathy 是肌肉疾病,需要精准和高效的诊断方法。在这项研究中,我们利用深度学习技术来识别 ALS 和 Myopathy。卷积神经网络 (CNNs) 在这个上是非常有力的工具。我们介绍了 ResEMGNet,可以直接从 Raw 电romyography (EMG) 信号中识别 ALS 和 Myopathy。与传统方法不同,ResEMGNet 不需要手动提取特征,从而降低计算复杂性和提高实用性。我们的方法在不同的指标下进行了严格的评估,与现有方法进行了比较。ResEMGNet 在三类精度测试中取得了94.43%的总平均精度,表现出色。

BeamSec: A Practical mmWave Physical Layer Security Scheme Against Strong Adversaries

  • paper_url: http://arxiv.org/abs/2309.10632
  • repo_url: None
  • paper_authors: Afifa Ishtiaq, Arash Asadi, Ladan Khaloopour, Waqar Ahmed, Vahid Jamali, Matthias Hollick
  • for: 提高物理层安全性,防止窃听攻击
  • methods: 使用无知 adversary 位置/通道的方法,Robust against 协作窃听攻击,兼容标准
  • results: 与基eline schemes 比较,BSec 可以提高机密率 by 79.8%,防止单个和协作窃听攻击
    Abstract The high directionality of millimeter-wave (mmWave) communication systems has proven effective in reducing the attack surface against eavesdropping, thus improving the physical layer security. However, even with highly directional beams, the system is still exposed to eavesdropping against adversaries located within the main lobe. In this paper, we propose \acrshort{BSec}, a solution to protect the users even from adversaries located in the main lobe. The key feature of BeamSec are: (i) Operating without the knowledge of eavesdropper's location/channel; (ii) Robustness against colluding eavesdropping attack and (iii) Standard compatibility, which we prove using experiments via our IEEE 802.11ad/ay-compatible 60 GHz phased-array testbed. Methodologically, BeamSec first identifies uncorrelated and diverse beam-pairs between the transmitter and receiver by analyzing signal characteristics available through standard-compliant procedures. Next, it encodes the information jointly over all selected beam-pairs to minimize information leakage. We study two methods for allocating transmission time among different beams, namely uniform allocation (no knowledge of the wireless channel) and optimal allocation for maximization of the secrecy rate (with partial knowledge of the wireless channel). Our experiments show that \acrshort{BSec} outperforms the benchmark schemes against single and colluding eavesdroppers and enhances the secrecy rate by 79.8% over a random paths selection benchmark.
    摘要 高度的毫米波(mmWave)通信系统的指向性有效地减少了听到攻击的面积,从而提高物理层安全性。然而,即使使用非常指向的扩散,系统仍然面临着在主脉搜索区域内的听到攻击。在这篇论文中,我们提议了一种解决方案,称为BeamSec,以保护用户,即使在主脉搜索区域内。BeamSec的关键特点包括:* 无需知道听到者的位置和通道情况进行操作* 对协同听到攻击进行鲁棒性* 兼容标准,我们通过使用IEEE 802.11ad/ay兼容60GHz分布式测试床进行实验证明。方法上,BeamSec先通过分析标准可用的信号特征来确定不相关和多样的扩散对between the transmitter and receiver。接着,它将信息共同编码在所选择的所有扩散上以最小化信息泄露。我们研究了两种分配传输时间的方法,一种是均匀分配(不知道无线通道),另一种是最大化机密率的分配(具有部分无线通道知识)。我们的实验显示,BeamSec在单个和协同听到者面前的性能都高于参考方案,并提高机密率 by 79.8% compared to a random paths selection benchmark。

A Multi Constrained Transformer-BiLSTM Guided Network for Automated Sleep Stage Classification from Single-Channel EEG

  • paper_url: http://arxiv.org/abs/2309.10542
  • repo_url: None
  • paper_authors: Farhan Sadik, Md Tanvir Raihan, Rifat Bin Rashid, Minhjaur Rahman, Sabit Md Abdal, Shahed Ahmed, Talha Ibn Mahmud
  • for: automatic sleep scoring from single-channel EEG signals
  • methods: utilizes Convolutional Neural Network (CNN), transformer network, and Bidirectional Long Short Term Memory (BiLSTM)
  • results: outperforms different state-of-the-art techniques by a large margin in terms of accuracy, precision, and F1-score.Here’s the format you requested:
  • for: automatic sleep scoring from single-channel EEG signals
  • methods: DenseRTSleep-II 使用 CNN、transformer 网络和 BiLSTM
  • results: outperforms différents state-of-the-art techniques by a large margin in terms of accuracy, precision, and F1-score.
    Abstract Sleep stage classification from electroencephalogram (EEG) is significant for the rapid evaluation of sleeping patterns and quality. A novel deep learning architecture, ``DenseRTSleep-II'', is proposed for automatic sleep scoring from single-channel EEG signals. The architecture utilizes the advantages of Convolutional Neural Network (CNN), transformer network, and Bidirectional Long Short Term Memory (BiLSTM) for effective sleep scoring. Moreover, with the addition of a weighted multi-loss scheme, this model is trained more implicitly for vigorous decision-making tasks. Thus, the model generates the most efficient result in the SleepEDFx dataset and outperforms different state-of-the-art (IIT-Net, DeepSleepNet) techniques by a large margin in terms of accuracy, precision, and F1-score.
    摘要 休眠阶段分类从电enzephalogram(EEG)是重要的,可以快速评估休眠模式和质量。一种新的深度学习架构“DenseRTSleep-II”被提议用于自动休眠分类从单通道EEG信号。该架构利用了Convolutional Neural Network(CNN)、transformer网络和Bidirectional Long Short Term Memory(BiLSTM)等优点,以便更有效地进行休眠分类。此外,通过加入权重多失函数学习策略,这个模型在强制决策任务中更加准确地进行分类。因此,该模型在SleepEDFx数据集中 генетиче最高效果,并在与不同的现有技术(IIT-Net、DeepSleepNet)的比较中,在精度、准确率和F1-score等方面表现出了明显的优势。

EMG Signal Classification for Neuromuscular Disorders with Attention-Enhanced CNN

  • paper_url: http://arxiv.org/abs/2309.10483
  • repo_url: None
  • paper_authors: Md. Toufiqur Rahman, Minhajur Rahman, Celia Shahnaz
  • for: 这个研究旨在 Addressing the detection of Amyotrophic Lateral Sclerosis (ALS) and Myopathy, two debilitating neuromuscular disorders.
  • methods: 该方法开始于从 raw electromyography (EMG) 信号中提取有用的特征,利用 Log-spectrum 和 Delta Log spectrum,以捕捉信号的频谱特征和时间特征。然后,我们应用了深度学习模型 SpectroEMG-Net,结合 Convolutional Neural Networks (CNNs) 和 Attention,对三类进行分类。
  • results: 我们的方法在分类 Myopathy, Normal, 和 ALS 三类时表现出色,总准确率达到 92%。这个研究为 neuromuscular disorder 诊断带来了一个数据驱动、多类分类的方法,为早期检测提供了有价值的洞察。
    Abstract Amyotrophic Lateral Sclerosis (ALS) and Myopathy present considerable challenges in the realm of neuromuscular disorder diagnostics. In this study, we employ advanced deep-learning techniques to address the detection of ALS and Myopathy, two debilitating conditions. Our methodology begins with the extraction of informative features from raw electromyography (EMG) signals, leveraging the Log-spectrum, and Delta Log spectrum, which capture the frequency contents, and spectral and temporal characteristics of the signals. Subsequently, we applied a deep-learning model, SpectroEMG-Net, combined with Convolutional Neural Networks (CNNs) and Attention for the classification of three classes. The robustness of our approach is rigorously evaluated, demonstrating its remarkable performance in distinguishing among the classes: Myopathy, Normal, and ALS, with an outstanding overall accuracy of 92\%. This study marks a contribution to addressing the diagnostic challenges posed by neuromuscular disorders through a data-driven, multi-class classification approach, providing valuable insights into the potential for early and accurate detection.
    摘要 amyotrophic lateral sclerosis (ALS) 和 myopathy 在 neuromuscular disorder 诊断中存在很大的挑战。在这项研究中,我们使用高级深度学习技术来解决 ALS 和 myopathy 两种致命的疾病的检测。我们的方法开始于 raw electromyography (EMG) 信号中提取有用特征,利用 Log-spectrum 和 Delta Log spectrum,这两种特征捕捉信号的频谱特征和时间特征。然后,我们应用了深度学习模型 SpectroEMG-Net,结合 Convolutional Neural Networks (CNNs) 和 Attention,用于三类分类。我们的方法的稳定性得到了严格的评估,表明其在分类 Myopathy、Normal 和 ALS 中表现出色,总准确率达 92%。这项研究对 neuromuscular disorders 的诊断带来了一项数据驱动的多类分类方法,为早期检测提供了有价值的发现。

  • paper_url: http://arxiv.org/abs/2309.10460
  • repo_url: None
  • paper_authors: Daeun Kim, Jeonghun Park, Namyoon Lee
  • for: investigate the coverage performance of downlink satellite networks employing dynamic coordinated beamforming.
  • methods: modeling the spatial arrangement of satellites and users using Poisson point processes situated on concentric spheres, deriving analytical expressions for the coverage probability, and developing an approximation for the coverage probability.
  • results: dynamic coordinated beamforming significantly improves coverage compared to the absence of satellite coordination, and the optimal cluster size, which maximizes the ergodic spectral efficiency, increases with higher satellite density, provided that the number of antennas on the satellites is sufficiently large.
    Abstract In this paper, we investigate the coverage performance of downlink satellite networks employing dynamic coordinated beamforming. Our approach involves modeling the spatial arrangement of satellites and users using Poisson point processes situated on concentric spheres. We derive analytical expressions for the coverage probability, which take into account the in-cluster geometry of the coordinated satellite set. These expressions are formulated in terms of various parameters, including the number of antennas per satellite, satellite density, fading characteristics, and path-loss exponent. To offer a more intuitive understanding, we also develop an approximation for the coverage probability. Furthermore, by considering the distribution of normalized distances, we derive the spatially averaged coverage probability, thereby validating the advantages of coordinated beamforming from a spatial average perspective. Our primary finding is that dynamic coordinated beamforming significantly improves coverage compared to the absence of satellite coordination, in direct proportion to the number of antennas on each satellite. Moreover, we observe that the optimal cluster size, which maximizes the ergodic spectral efficiency, increases with higher satellite density, provided that the number of antennas on the satellites is sufficiently large. Our findings are corroborated by simulation results, confirming the accuracy of the derived expressions.
    摘要 在这篇论文中,我们研究了下降链接卫星网络的覆盖性能,使用动态协调扫描。我们的方法包括使用Poisson点过程模型卫星和用户的空间布局,并 derivate了覆盖概率的分析表达式。这些表达式考虑了协调卫星集的内部几何结构。为了更好地理解,我们还开发了覆盖概率的近似方法。此外,通过考虑 нормализа的距离分布,我们 derivate了平均覆盖概率,从而验证了协调扫描的优点。我们的主要发现是,动态协调扫描可以在不同卫星的覆盖性能方面提供显著改善,并且与卫星antenna的数量直接相关。此外,我们发现在卫星密度增加时,最佳团集大小,用于最大化随机 Spectral efficiency,随着卫星antenna的数量增加,而增加。我们的发现得到了仪表结果的验证,证明了我们 derive的表达式的准确性。

Enhancing Congestion Control to Improve User Experience in IoT Using LSTM Network

  • paper_url: http://arxiv.org/abs/2309.10347
  • repo_url: None
  • paper_authors: Atta Ur Rahman, Bibi Saqia, Wali Ullah Khan, Khaled Rabie, Mahmood Alam, Khairullah Khan
  • for: 本研究提出了一种基于长期快速响应Memory(LSTM)网络的新策略,用于改善压力控制。
  • methods: 本研究使用了LSTM网络,通过分析iot特有的网络流量模式、设备互动和压力发生情况,从iot环境中收集和训练LSTM网络架构。然后,使用LSTM模型预测技术来改善压力控制方法。
  • results: 本研究表明,通过使用LSTM网络预测技术,可以提高用户满意度和iot连接可靠性。通过测试和比较传统压力控制方法,评估了提议的策略的性能。
    Abstract This study suggests a new strategy for improving congestion control by deploying Long Short-Term Memory (LSTM) networks. LSTMs are recurrent neural networks (RNN), that excel at capturing temporal relationships and patterns in data. IoT-specific data such as network traffic patterns, device interactions, and congestion occurrences are gathered and analyzed. The gathered data is used to create and train an LSTM network architecture specific to the IoT environment. Then, the LSTM model's predictive skills are incorporated into the congestion control methods. This work intends to optimize congestion management methods using LSTM networks, which results in increased user satisfaction and dependable IoT connectivity. Utilizing metrics like throughput, latency, packet loss, and user satisfaction, the success of the suggested strategy is evaluated. Evaluation of performance includes rigorous testing and comparison to conventional congestion control methods.
    摘要

Time Stretch with Continuous-Wave Lasers for Practical Fast Realtime Measurements

  • paper_url: http://arxiv.org/abs/2309.10330
  • repo_url: None
  • paper_authors: Tingyi Zhou, Yuta Goto, Takeshi Makino, Callen MacPhee, Yiming Zhou, Asad M. Madni, Hideaki Furukawa, Naoya Wada, Bahram Jalali
  • for: 描述了一种新的连续波(CW)实现光子时间压缩方法,以便替代昂贵的超集普模式频率隔离mode-locked Laser源。
  • methods: 使用了电子依optic(EO)modulation来脉冲WDM CW Laser源,以实现时间压缩。
  • results: 通过 simulations和实验验证了该新方法的可行性,并描述了两种应用场景。
    Abstract Realtime high-throughput sensing and detection enables the capture of rare events within sub-picosecond time scale, which makes it possible for scientists to uncover the mystery of ultrafast physical processes. Photonic time stretch is one of the most successful approaches that utilize the ultra-wide bandwidth of mode-locked laser for detecting ultrafast signal. Though powerful, it relies on supercontinuum mode-locked laser source, which is expensive and difficult to integrate. This greatly limits the application of this technology. Here we propose a novel Continuous Wave (CW) implementation of the photonic time stretch. Instead of a supercontinuum mode-locked laser, a wavelength division multiplexed (WDM) CW laser, pulsed by electro-optic (EO) modulation, is adopted as the laser source. This opens up the possibility for low-cost integrated time stretch systems. This new approach is validated via both simulation and experiment. Two scenarios for potential application are also described.
    摘要 实时高通过率探测和检测可以在sub-picosecond时间尺度内捕捉罕见事件,使科学家可以揭示ultrafast物理过程的谜。光子时间延迟是这种方法中最成功的一种,它利用mode-locked激光器的ultra-wide频率带宽,以检测ultrafast信号。虽然强大,但它依赖于supercontinuum mode-locked激光源,这是昂贵的和困难集成的。这限制了这种技术的应用。我们提议了一种新的Continuous Wave(CW)实现方式, substitute supercontinuum mode-locked激光源,使用wavelength division multiplexed(WDM)CW激光器,由电子-光学(EO)模ulation启动。这开 up了low-cost集成时间延迟系统的可能性。这新的方法通过实验和 simulate validate。两种应用场景也被描述。

Delay-sensitive Task Offloading in Vehicular Fog Computing-Assisted Platoons

  • paper_url: http://arxiv.org/abs/2309.10234
  • repo_url: None
  • paper_authors: Qiong Wu, Siyuan Wang, Hongmei Ge, Pingyi Fan, Qiang Fan, Khaled B. Letaief
  • for: 这个研究的目的是为了提出一个基于SMDP的协调策略,以减少在VFC系统中的卸载延误。
  • methods: 本研究使用SMDP模型来描述VFC系统中的卸载问题,并提出一个基于最大长期收益函数的协调策略。
  • results: 研究结果显示,这个提出的协调策略可以实现VFC系统中的卸载延误最小化,并且与其他参考策略比较,有着更高的效率和可靠性。
    Abstract Vehicles in platoons need to process many tasks to support various real-time vehicular applications. When a task arrives at a vehicle, the vehicle may not process the task due to its limited computation resource. In this case, it usually requests to offload the task to other vehicles in the platoon for processing. However, when the computation resources of all the vehicles in the platoon are insufficient, the task cannot be processed in time through offloading to the other vehicles in the platoon. Vehicular fog computing (VFC)-assisted platoon can solve this problem through offloading the task to the VFC which is formed by the vehicles driving near the platoon. Offloading delay is an important performance metric, which is impacted by both the offloading strategy for deciding where the task is offloaded and the number of the allocated vehicles in VFC to process the task. Thus, it is critical to propose an offloading strategy to minimize the offloading delay. In the VFC-assisted platoon system, vehicles usually adopt the IEEE 802.11p distributed coordination function (DCF) mechanism while having various computation resources. Moreover, when vehicles arrive and depart the VFC randomly, their tasks also arrive at and depart the system randomly. In this paper, we propose a semi-Markov decision process (SMDP) based offloading strategy while considering these factors to obtain the maximal long-term reward reflecting the offloading delay. Our research provides a robust strategy for task offloading in VFC systems, its effectiveness is demonstrated through simulation experiments and comparison with benchmark strategies.
    摘要 车辆队伍中的车辆需要处理多个任务以支持不同的实时交通应用。当任务到达车辆时,车辆可能无法处理任务由于其有限的计算资源。在这种情况下,它通常会请求卸载任务到其他车辆队伍中的车辆进行处理。但当所有车辆队伍中的计算资源都不够时,任务无法在时间内进行处理 durch卸载到其他车辆队伍中的车辆。由于车辆队伍中的计算资源是有限的,因此需要提出一种卸载策略,以最小化卸载延迟。在VFC协助的车辆队伍系统中,车辆通常采用IEEE 802.11p分布协调功能(DCF)机制,而具有不同的计算资源。此外,当车辆随机到达和离开VFC时,其任务也随机到达和离开系统。在这篇论文中,我们提出了基于Markov决策过程(SMDP)的卸载策略,并考虑了这些因素,以获得最大化长期奖励,反映卸载延迟。我们的研究提供了VFC系统中任务卸载策略的稳定性和有效性,通过实验和比较基准策略进行证明。

A Generalized Approach for Recovering Time Encoded Signals with Finite Rate of Innovation

  • paper_url: http://arxiv.org/abs/2309.10223
  • repo_url: None
  • paper_authors: Dorian Florescu
  • for: 本研究考虑了一种 recuperate 一个约束过滤后的 Dirac 函数的问题,用于表示输入有限Rate of Innovation(FRI)信号。
  • methods: 本文引入了一种新的通用方法,可以 garantía Recovery FRI 信号从时间编码机(TEM)输出中。在理论前方,我们 significantly 扩展了可以保证恢复的筛选器的类型,并提供了一个依赖于筛选器的第一两个本地导数的条件,以确保完美的输入恢复。在实践前方,如果筛选器的数学函数未知,我们的方法可以绕过筛选器模型阶段,减少恢复过程的复杂性。
  • results: 我们通过数学实验 validate 了我们的方法,使用过过去文献中使用过的筛选器,以及不兼容的筛选器。此外,我们还 validate 了结果通过实验设备。
    Abstract In this paper, we consider the problem of recovering a sum of filtered Diracs, representing an input with finite rate of innovation (FRI), from its corresponding time encoding machine (TEM) measurements. So far, the recovery was guaranteed for cases where the filter is selected from a number of particular mathematical functions. Here, we introduce a new generalized method for recovering FRI signals from the TEM output. On the theoretical front, we significantly increase the class of filters for which reconstruction is guaranteed, and provide a condition for perfect input recovery depending on the first two local derivatives of the filter. We extend this result with reconstruction guarantees in the case of noise corrupted FRI signals. On the practical front, in cases where the filter has an unknown mathematical function, the proposed method streamlines the recovery process by bypassing the filter modelling stage. We validate the proposed method via numerical simulations with filters previously used in the literature, as well as filters that are not compatible with the existing results. Additionally, we validate the results using a TEM hardware implementation.
    摘要 在本文中,我们考虑了一个滤波后的Dirac恒等值的恢复问题,该问题的输入是有限速度变化(FRI)的。以前,恢复是对特定数学函数选择的筛选器进行保证的。在这里,我们提出了一种新的通用方法,可以将FRI信号从TEM输出中恢复。从理论角度来看,我们在筛选器的选择范围中提高了恢复是 garantido的类型,并提供了一个基于筛选器的第一两个本地导数的条件,以确定完美的输入恢复。此外,我们在噪声损害FRI信号时也提供了恢复保证。从实践角度来看,如果筛选器的数学函数未知,我们的方法可以缩短恢复过程,直接跳过筛选器模型阶段。我们通过使用以前在文献中使用过的筛选器、不兼容现有结果的筛选器进行数值实验 validate our method。此外,我们还使用TEM硬件实现来验证我们的结果。

cs.SD - 2023-09-18

Investigating End-to-End ASR Architectures for Long Form Audio Transcription

  • paper_url: http://arxiv.org/abs/2309.09950
  • repo_url: None
  • paper_authors: Nithin Rao Koluguri, Samuel Kriman, Georgy Zelenfroind, Somshubra Majumdar, Dima Rekesh, Vahid Noroozi, Jagadeesh Balam, Boris Ginsburg
  • for: This paper provides an overview and evaluation of end-to-end ASR models on long-form audios, with a focus on three categories of models based on their core architecture.
  • methods: The paper evaluates Word Error Rate, maximum audio length, and real-time factor for each model on several long audio benchmarks, including Earnings-21 and 22, CORAAL, and TED-LIUM3.
  • results: The model with self-attention and local attention has the best accuracy, and CTC-based models are more robust and efficient than RNNT on long-form audio.
    Abstract This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios. We study three categories of Automatic Speech Recognition(ASR) models based on their core architecture: (1) convolutional, (2) convolutional with squeeze-and-excitation and (3) convolutional models with attention. We selected one ASR model from each category and evaluated Word Error Rate, maximum audio length and real-time factor for each model on a variety of long audio benchmarks: Earnings-21 and 22, CORAAL, and TED-LIUM3. The model from the category of self-attention with local attention and global token has the best accuracy comparing to other architectures. We also compared models with CTC and RNNT decoders and showed that CTC-based models are more robust and efficient than RNNT on long form audio.
    摘要 本文提供了一个简报和ASR模型的评估,涵盖了长形音频的批处理。我们研究了三类自动语音识别(ASR)模型的核心结构:(1)卷积,(2)卷积加上压缩和刺激,以及(3)卷积模型加注意。我们从每个类别中选择了一个ASR模型,并对它们在多种长音频标准评估 datasets(Earnings-21和22、CORAAL、TED-LIUM3)上进行评估 Word Error Rate、最大音频长度和实时因素。我们发现,具有本地注意力和全局标识的模型在准确性方面表现最佳。此外,我们还比较了基于CTC和RNNT解码器的模型,并发现CTC-based模型在长形音频上更加稳定和高效。

Harmony and Duality: An introduction to Music Theory

  • paper_url: http://arxiv.org/abs/2309.10719
  • repo_url: None
  • paper_authors: Maksim Lipyanskiy
  • for: 本研究旨在提供一种基于组合理论的音乐理论基础,包括和声、排序和即兴等方面。
  • methods: 本研究使用了限制可能的Scale的方法,例如约束两个voice不能只有一个半音之间的规定,以及三个voice不能同时出现的约束。然后研究不包含这些约束的Scale,并证明这些Scale是可以用于音乐作曲的最大集。
  • results: 研究发现,对于这些简单的两个/三个voice约束,完整性是一种重要的特征,即Scale是最大的一个包含所有可能的音符的集。此外,对于这些约束,存在一种对应关系,可以将Scalesubject to two-voice constraint与Scalesubject to three-voice constraint进行对应。最后,通过组合这些约束,提供了一种分类推理的方法来分类和声。
    Abstract We develop aspects of music theory related to harmony, such as scales, chord formation and improvisation from a combinatorial perspective. The goal is to provide a foundation for this subject by deriving the basic structure from a few assumptions, rather than writing down long lists of chords/scales to memorize without an underlying principle. Our approach involves introducing constraints that limit the possible scales we can consider. For example, we may impose the constraint that two voices cannot be only a semitone apart as this is too dissonant. We can then study scales that do not contain notes that are a semitone apart. A more refined constraint avoids three voices colliding by studying scales that do not have three notes separated only by semitones. Additionally, we require that our scales are complete, which roughly means that they are the maximal sets of tones that satisfy these constraints. As it turns out, completeness as applied to these simple two/three voice constraints characterizes the types of scales that are commonly used in music composition. Surprisingly, there is a correspondence between scales subject to the two-voice constraint and those subject to the three-voice constraint. We formulate this correspondence as a duality statement that provides a way to understand scales subject to one type of constraint in terms of scales subject to the other. Finally, we combine these constraint ideas to provide a classification of chords.
    摘要 我们在音乐理论方面发展有关和声的方面,如约束、和声形成和演奏等。我们的目标是从一些假设出发,而不是直接记忆大量的和声和scale。我们的方法是引入约束,限制可考虑的约束。例如,我们可能假设两个声部不能夹紧只有一个半音之间,这太紧张。我们可以研究不含这种约束的约束。我们还需要保证我们的约束是完整的,这意味着它们是最大的满足这些约束的约束集。这些约束集是通常用于音乐创作中的约束。 surprisingly,存在一个对约束的对偶关系,它将约束一种类型的约束转换为另一种类型的约束。我们将这些约束融合,以提供和声的分类。

Frame-to-Utterance Convergence: A Spectra-Temporal Approach for Unified Spoofing Detection

  • paper_url: http://arxiv.org/abs/2309.09837
  • repo_url: None
  • paper_authors: Awais Khan, Khalid Mahmood Malik, Shah Nawaz
  • for: 防止语音伪造攻击,提高自动人脸识别系统的安全性
  • methods: 使用spectra-temporal fusionStrategy,包括frame-level和utterance-level偏差 coefficient、bi-LSTM网络和 auto-encoder,捕捉各种伪造类型的异常信号
  • results: 在多个数据集上(ASVspoof2019、ASVspoof2021、VSDC、partial spoofs和免疫深度伪造)进行了广泛的评估,并达到了多种声音应用场景中的高度可靠性和抗伪造性
    Abstract Voice spoofing attacks pose a significant threat to automated speaker verification systems. Existing anti-spoofing methods often simulate specific attack types, such as synthetic or replay attacks. However, in real-world scenarios, the countermeasures are unaware of the generation schema of the attack, necessitating a unified solution. Current unified solutions struggle to detect spoofing artifacts, especially with recent spoofing mechanisms. For instance, the spoofing algorithms inject spectral or temporal anomalies, which are challenging to identify. To this end, we present a spectra-temporal fusion leveraging frame-level and utterance-level coefficients. We introduce a novel local spectral deviation coefficient (SDC) for frame-level inconsistencies and employ a bi-LSTM-based network for sequential temporal coefficients (STC), which capture utterance-level artifacts. Our spectra-temporal fusion strategy combines these coefficients, and an auto-encoder generates spectra-temporal deviated coefficients (STDC) to enhance robustness. Our proposed approach addresses multiple spoofing categories, including synthetic, replay, and partial deepfake attacks. Extensive evaluation on diverse datasets (ASVspoof2019, ASVspoof2021, VSDC, partial spoofs, and in-the-wild deepfakes) demonstrated its robustness for a wide range of voice applications.
    摘要 声音骗陷poses a significant threat to automatic speaker verification systems. Existing anti-骗陷 methods often simulate specific attack types, such as synthetic or replay attacks. However, in real-world scenarios, the countermeasures are unaware of the generation schema of the attack, necessitating a unified solution. Current unified solutions struggle to detect spoofing artifacts, especially with recent spoofing mechanisms. For instance, the spoofing algorithms inject spectral or temporal anomalies, which are challenging to identify. To this end, we present a spectra-temporal fusion leveraging frame-level and utterance-level coefficients. We introduce a novel local spectral deviation coefficient (SDC) for frame-level inconsistencies and employ a bi-LSTM-based network for sequential temporal coefficients (STC), which capture utterance-level artifacts. Our spectra-temporal fusion strategy combines these coefficients, and an auto-encoder generates spectra-temporal deviated coefficients (STDC) to enhance robustness. Our proposed approach addresses multiple spoofing categories, including synthetic, replay, and partial deepfake attacks. Extensive evaluation on diverse datasets (ASVspoof2019, ASVspoof2021, VSDC, partial spoofs, and in-the-wild deepfakes) demonstrated its robustness for a wide range of voice applications.

Synth-AC: Enhancing Audio Captioning with Synthetic Supervision

  • paper_url: http://arxiv.org/abs/2309.09705
  • repo_url: https://github.com/littleflyingsheep/synthac
  • paper_authors: Feiyang Xiao, Qiaoxi Zhu, Jian Guan, Xubo Liu, Haohe Liu, Kejia Zhang, Wenwu Wang
  • for: 提高音频描述的方法(audio captioning)的发展,因为现有的数据有限和质量不高,导致方法的开发受限。
  • methods: 提出了SynthAC框架,利用现有的音频生成模型和文本 corpus来创建Synthetic text-audio pairs,从而提高文本-音频表示。具体来说,使用文本-音频生成模型(AudioLDM)来生成Synthetic audio signals with captions from an image captioning dataset。
  • results: 实验表明,SynthAC框架可以增强音频描述方法,通过学习Synthetic text-audio pairs中的关系,提高文本-音频表示的质量。此外,SynthAC可以轻松地适应不同的现状前方法,导致表现提高。
    Abstract Data-driven approaches hold promise for audio captioning. However, the development of audio captioning methods can be biased due to the limited availability and quality of text-audio data. This paper proposes a SynthAC framework, which leverages recent advances in audio generative models and commonly available text corpus to create synthetic text-audio pairs, thereby enhancing text-audio representation. Specifically, the text-to-audio generation model, i.e., AudioLDM, is used to generate synthetic audio signals with captions from an image captioning dataset. Our SynthAC expands the availability of well-annotated captions from the text-vision domain to audio captioning, thus enhancing text-audio representation by learning relations within synthetic text-audio pairs. Experiments demonstrate that our SynthAC framework can benefit audio captioning models by incorporating well-annotated text corpus from the text-vision domain, offering a promising solution to the challenge caused by data scarcity. Furthermore, SynthAC can be easily adapted to various state-of-the-art methods, leading to substantial performance improvements.
    摘要 <>对于音频描述来说,数据驱动的方法具有承诺。然而,音频描述方法的开发可能会受到文本-音频数据的有限性和质量的限制。这篇论文提议一个SynthAC框架,它利用最近的音频生成模型和常见的文本库来创建 sintetic文本-音频对,从而提高文本-音频表示。具体来说,文本-音频生成模型,即AudioLDM,用于生成 sintetic音频信号和caption从一个图像描述集。我们的SynthAC扩展了文本-视觉领域中已有较好的注释的caption到音频描述领域,从而提高文本-音频表示的学习关系。实验表明,我们的SynthAC框架可以通过将文本-视觉领域中已有较好的注释的caption引入到音频描述领域,提高音频描述模型的性能。此外,SynthAC可以轻松地适应不同的现状推荐方法,导致显著的性能提升。

Scaling the time and Fourier domains to align periodically and their convolution

  • paper_url: http://arxiv.org/abs/2309.09645
  • repo_url: https://github.com/flatmax/fxt
  • paper_authors: Matthew R. Flax, W. Harvey Holmes
  • for: 这篇论文是用于解释如何使用频率或时间扩展来对 periodic signal 进行快速匹配的。
  • methods: 该论文使用了频率或时间扩展的方法来对 periodic signal 进行快速匹配。
  • results: 该论文的结果表明,通过频率或时间扩展可以快速地对 periodic signal 进行匹配,并且可以用于开发新的算法,如抑音度估计算法。
    Abstract This note shows how to align a periodic signal with its the Fourier transform by means of frequency or time scaling. This may be useful in developing new algorithms, e.g. for pitch estimation. This note also convolves the signals and the frequency time convolution is denoted fxt.
    摘要 这份笔记介绍了如何将周期信号与其快 Fourier 变换进行对齐,通过频率或时间扩展。这可能有用于开发新的算法,例如抑音测量。此外,这份笔记还将信号和频率时间卷积,并将其称为fxt。

Refining DNN-based Mask Estimation using CGMM-based EM Algorithm for Multi-channel Noise Reduction

  • paper_url: http://arxiv.org/abs/2309.09630
  • repo_url: None
  • paper_authors: Julitta Bartolewska, Stanisław Kacprzak, Konrad Kowalczyk
  • for: 提高深度神经网络模型中的语音增强效果
  • methods: 提议一种多通道增强时间频谱屏障方法,包括迭代复杂� Gaussian Mixture Model(CGMM)基于算法,然后进行最佳空间滤波
  • results: 验证方法在三个最新的深度学习模型中,包括 DCUnet、DCCRN 和 FullSubNet,可以提高时间频谱屏障估计的准确性,并因此提高整体语音质量,测量方法为 PESQ 改进。改进是在所有三个 DNN 模型中具有一致性。
    Abstract In this paper, we present a method that allows to further improve speech enhancement obtained with recently introduced Deep Neural Network (DNN) models. We propose a multi-channel refinement method of time-frequency masks obtained with single-channel DNNs, which consists of an iterative Complex Gaussian Mixture Model (CGMM) based algorithm, followed by optimum spatial filtration. We validate our approach on time-frequency masks estimated with three recent deep learning models, namely DCUnet, DCCRN, and FullSubNet. We show that our method with the proposed mask refinement procedure allows to improve the accuracy of estimated masks, in terms of the Area Under the ROC Curve (AUC) measure, and as a consequence the overall speech quality of the enhanced speech signal, as measured by PESQ improvement, and that the improvement is consistent across all three DNN models.
    摘要 在这篇论文中,我们提出了一种方法,可以使用最近引入的深度神经网络(DNN)模型进一步提高抑制的 speech 质量。我们提议一种多通道刷新时域频谱屏障方法,包括一种迭代复杂 Gaussian Mixture Model(CGMM)基于算法,然后是最佳空间滤波。我们验证了我们的方法,使用三种最近的深度学习模型,即 DCUnet、DCCRN 和 FullSubNet。我们发现,我们的方法与提出的面积掩模预测方法可以提高抑制后 speech 质量的准确性, measured by AUC 度量,并且这种改进是所有三种 DNN 模型中的一致。

Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders

  • paper_url: http://arxiv.org/abs/2309.09627
  • repo_url: None
  • paper_authors: Lester Phillip Violeta, Wen-Chin Huang, Ding Ma, Ryuichi Yamamoto, Kazuhiro Kobayashi, Tomoki Toda
  • For: 提高电子喉咙 speech 知识权威性(intelligibility)* Methods: 使用Robust linguistic encoders和HuBERT输出特征,解决类型匹配问题和说话者匹配问题* Results: 比 conventional framework 提高16%的字符错误率和0.83的自然度分数
    Abstract We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conversion performance of this framework. To resolve this issue, we propose a linguistic encoder robust enough to project both EL and typical speech in the same latent space, while still being able to extract accurate linguistic information, creating a unified representation to reduce the speech type mismatch. Furthermore, we introduce HuBERT output features to the proposed framework for reducing the speaker mismatch, making it possible to effectively use a large-scale parallel dataset during pretraining. We show that compared to the conventional framework using mel-spectrogram input and output features, using the proposed framework enables the model to synthesize more intelligible and naturally sounding speech, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score.
    摘要 我们提出了一种新的框架,用于提高电子声门朗读 intelligibility。我们使用了Robust语言编码器,并在预训练和精度调整阶段使用这些编码器。然而,在大多数情况下,这些编码器可能会受到类型匹配问题的影响,导致转换性能下降。为解决这个问题,我们提议一种可以将EL和典型语音 проек到同一个准确空间中的语言编码器,同时仍能够提取正确的语言信息。此外,我们还引入了HuBERT输出特征,以降低说话者匹配问题,使得可以效果地使用大规模并行数据集进行预训练。我们的实验结果表明,相比传统框架使用MELspectrogram输入和输出特征,使用我们的框架可以让模型生成更加智能和自然的声音,Character error rate下降16%,Naturalness score提高0.83。

HumTrans: A Novel Open-Source Dataset for Humming Melody Transcription and Beyond

  • paper_url: http://arxiv.org/abs/2309.09623
  • repo_url: https://github.com/shansongliu/humtrans
  • paper_authors: Shansong Liu, Xu Li, Dian Li, Ying Shan
  • for: 这篇论文主要是为了描述一个名为 HumTrans 的嗓吟报表Dataset,该Dataset 可以用于嗓吟报表转译、以及下游任务如基于嗓吟报表的音乐生成等。
  • methods: 论文使用了10名大学生,他们都是音乐专业或擅长至少一种乐器,通过作者提供的网站录音界面,对每个段落进行了两次嗓吟录音,采样频率为44,100 Hz。
  • results: 该 Dataset 包含约56.22小时的嗓吟录音,是目前已知最大的嗓吟Dataset。论文将在 Hugging Face 上发布,并提供 GitHub 仓库,包含基线结果和评价代码。
    Abstract This paper introduces the HumTrans dataset, which is publicly available and primarily designed for humming melody transcription. The dataset can also serve as a foundation for downstream tasks such as humming melody based music generation. It consists of 500 musical compositions of different genres and languages, with each composition divided into multiple segments. In total, the dataset comprises 1000 music segments. To collect this humming dataset, we employed 10 college students, all of whom are either music majors or proficient in playing at least one musical instrument. Each of them hummed every segment twice using the web recording interface provided by our designed website. The humming recordings were sampled at a frequency of 44,100 Hz. During the humming session, the main interface provides a musical score for students to reference, with the melody audio playing simultaneously to aid in capturing both melody and rhythm. The dataset encompasses approximately 56.22 hours of audio, making it the largest known humming dataset to date. The dataset will be released on Hugging Face, and we will provide a GitHub repository containing baseline results and evaluation codes.
    摘要 Simplified Chinese translation:这篇论文介绍了一个名为HumTrans的 dataset,该 dataset 是公共可用的,主要用于唱响旋律识别。该 dataset 还可以用于下游任务,如基于唱响旋律的音乐生成。它包含了500首不同类型和语言的乐曲,每首乐曲被分成多个段落。总共,该 dataset 包含1000段乐曲。为了收集这个唱响 dataset,我们雇用了10名大学生,其中大多数是音乐专业或擅长至少一种乐器。每名学生在我们设计的网站上录制了每段乐曲两次。唱响录制的采样频率为44,100 Hz。在唱响会议中,主界面提供了一份乐谱,同时播放唱响音频,以帮助学生记录旋律和节奏。该 dataset 包含约56.22小时的音频,是目前已知最大的唱响 dataset。该 dataset 将在Hugging Face上发布,我们将在 GitHub 上提供基线结果和评估代码。

Spoofing attack augmentation: can differently-trained attack models improve generalisation?

  • paper_url: http://arxiv.org/abs/2309.09586
  • repo_url: None
  • paper_authors: Wanying Ge, Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Nicholas Evans
  • for: 本研究旨在探讨深度伪造探测器或伪讯干扰处理器(CM)在不可预测的伪讯攻击下的稳定性。
  • methods: 本研究使用了深度学习方法,并且采用了多种不同的攻击方法进行训练,以实现更通用的学习结果。
  • results: 本研究发现,使用不同的训练条件时,深度学习基本探测器的性能可能会有很大的差异。但是,使用图像注意力网络和自我超vised learning的CM模型则能够保持稳定性。此外,对于不同的攻击方法进行训练可能不够,还需要对于伪讯攻击进行增强。
    Abstract A reliable deepfake detector or spoofing countermeasure (CM) should be robust in the face of unpredictable spoofing attacks. To encourage the learning of more generaliseable artefacts, rather than those specific only to known attacks, CMs are usually exposed to a broad variety of different attacks during training. Even so, the performance of deep-learning-based CM solutions are known to vary, sometimes substantially, when they are retrained with different initialisations, hyper-parameters or training data partitions. We show in this paper that the potency of spoofing attacks, also deep-learning-based, can similarly vary according to training conditions, sometimes resulting in substantial degradations to detection performance. Nevertheless, while a RawNet2 CM model is vulnerable when only modest adjustments are made to the attack algorithm, those based upon graph attention networks and self-supervised learning are reassuringly robust. The focus upon training data generated with different attack algorithms might not be sufficient on its own to ensure generaliability; some form of spoofing attack augmentation at the algorithm level can be complementary.
    摘要 一个可靠的深刻模仿检测器或假造防范措施(CM)应该具备对不可预测的假造攻击的鲜活性。为了鼓励学习更通用的特征,CM通常在训练时被 expose 到多种不同的攻击。尽管如此,深度学习基本的CM解决方案的性能会异时变化,有时会导致重大的性能下降。我们在这篇论文中表明,假造攻击也可以因为训练条件而变化,导致检测性能下降。然而,基于 RawNet2 CM 模型的模型可以在只有轻微调整攻击算法时受到攻击。相比之下,基于图注意力网络和自然学习的模型具备了更高的鲜活性。训练数据生成的不同攻击算法可能不 enough alone ensure 通用性;在算法层次上进行假造攻击加强可以是补充。

Spiking-LEAF: A Learnable Auditory front-end for Spiking Neural Networks

  • paper_url: http://arxiv.org/abs/2309.09469
  • repo_url: None
  • paper_authors: Zeyang Song, Jibin Wu, Malu Zhang, Mike Zheng Shou, Haizhou Li
  • for: 这个研究是为了提高基于脳神经网络(SNN)的语音处理性能。
  • methods: 这篇论文使用了一个名为Spiking-LEAF的可学习听力前端,该前端结合了一个可学习的滤波器银行,并使用了一种名为IHC-LIF的内嵌丝网络模型,以更好地捕捉语音讯号的多尺度时间征性。
  • results: 在关键字搜寻和话者识别任务上,提案的Spiking-LEAF在类别精度、噪声耐性和编码效率方面都大于先前的SOTA脳神经网络听力前端和传统的实值静止特征。
    Abstract Brain-inspired spiking neural networks (SNNs) have demonstrated great potential for temporal signal processing. However, their performance in speech processing remains limited due to the lack of an effective auditory front-end. To address this limitation, we introduce Spiking-LEAF, a learnable auditory front-end meticulously designed for SNN-based speech processing. Spiking-LEAF combines a learnable filter bank with a novel two-compartment spiking neuron model called IHC-LIF. The IHC-LIF neurons draw inspiration from the structure of inner hair cells (IHC) and they leverage segregated dendritic and somatic compartments to effectively capture multi-scale temporal dynamics of speech signals. Additionally, the IHC-LIF neurons incorporate the lateral feedback mechanism along with spike regularization loss to enhance spike encoding efficiency. On keyword spotting and speaker identification tasks, the proposed Spiking-LEAF outperforms both SOTA spiking auditory front-ends and conventional real-valued acoustic features in terms of classification accuracy, noise robustness, and encoding efficiency.
    摘要 Brain-inspired spiking neural networks (SNNs) 处理时间信号的潜力很大,但是在声音处理方面表现有限因为缺乏有效的声音前端。为了解决这个限制,我们介绍Spiking-LEAF,一个learnable的声音前端,它精心设计用于基于SNN的声音处理。Spiking-LEAF结合了一个learnable滤波器和一个新的两个复体发射神经模型called IHC-LIF。IHC-LIF神经元受内毛槽细胞(IHC)的结构启发,并且利用分类的蕈葱和肉体部分来有效地捕捉声音信号的多尺度时间动态。此外,IHC-LIF神经元还包括 lateral feedback 机制和发射频率损失来增强发射码效率。在关键词搜寻和认知任务上,我们的提案的Spiking-LEAF比SOTA的脉搏式声音前端和传统的实值数字特征更高的分类精度、噪声Robustness和码编码效率。

Are Soft Prompts Good Zero-shot Learners for Speech Recognition?

  • paper_url: http://arxiv.org/abs/2309.09413
  • repo_url: None
  • paper_authors: Dianwen Ng, Chong Zhang, Ruixi Zhang, Yukun Ma, Fabian Ritter-Gutierrez, Trung Hieu Nguyen, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma
  • for: 这个论文的目的是解释软提示在自动语音识别(ASR) task 中的作用,以及如何使用软提示来提高 ASR 性能。
  • methods: 这篇论文使用了软提示来提高 ASR 性能,并通过分析软提示在不同的情况下的作用来深入理解软提示的作用。
  • results: 研究发现,软提示可以在不需要任何训练数据的情况下提高 ASR 性能,并且可以帮助模型更好地适应噪音环境。此外,研究还发现软提示可以分为两个角色:内容细化和噪音信息增强,这两个角色都有助于提高模型的稳定性和robustness。
    Abstract Large self-supervised pre-trained speech models require computationally expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple parameter-efficient alternative by utilizing minimal soft prompt guidance, enhancing portability while also maintaining competitive performance. However, not many people understand how and why this is so. In this study, we aim to deepen our understanding of this emerging method by investigating the role of soft prompts in automatic speech recognition (ASR). Our findings highlight their role as zero-shot learners in improving ASR performance but also make them vulnerable to malicious modifications. Soft prompts aid generalization but are not obligatory for inference. We also identify two primary roles of soft prompts: content refinement and noise information enhancement, which enhances robustness against background noise. Additionally, we propose an effective modification on noise prompts to show that they are capable of zero-shot learning on adapting to out-of-distribution noise environments.
    摘要 大型自我超级预训练语音模型需要计算昂贵的精细调整,以便在下游任务中提高性能。软提示调整提供了一种简单 Parameters efficient的替代方案,可以提高可移植性,同时保持竞争性。然而,不多少人理解这种emerging方法的工作原理。在这项研究中,我们想要深入了解这种方法,通过调查软提示在自动语音识别(ASR)中的角色。我们发现软提示能够作为零例学习者,提高ASR性能,但也使其易受到黑客修改的威胁。软提示能够帮助泛化,但并不是推理的必需。我们还发现软提示有两个主要角色:内容细化和噪声信息增强,这有助于对噪声背景进行鲁棒化。此外,我们还提出了一种有效的噪声提示修改方法,以示证明它们可以适应外部噪声环境进行零例学习。