results: 研究结果表明,提posed方法可以实现实时的分子组成检测,并且维持了传统优化方法的准确性。Abstract
Spectroscopy-based imaging modalities such as near-infrared spectroscopy (NIRS) and hyperspectral imaging (HSI) represent a promising alternative for low-cost, non-invasive, and fast monitoring of functional and structural properties of living tissue. Particularly, the possibility of extracting the molecular composition of the tissue from the optical spectra in real-time deems the spectroscopy techniques as unique diagnostic tools. However, due to the highly limited availability of paired optical and molecular profiling studies, building a mapping between a spectral signature and a corresponding set of molecular concentrations is still an unsolved problem. Moreover, there are no yet established methods to streamline inference of the biochemical composition from the optical spectrum for real-time applications such as surgical monitoring. In this paper, we develop a technique for fast inference of changes in the molecular composition of brain tissue. We base our method on the Beer-Lambert law to analytically connect the spectra with concentrations and use a deep-learning approach to significantly speed up the concentration inference compared to traditional optimization methods. We test our approach on real data obtained from the broadband NIRS study of piglets' brains. The results demonstrate that the proposed method enables real-time molecular composition inference while maintaining the accuracy of traditional optimization.
摘要
Spectroscopy-based imaging modalities such as near-infrared spectroscopy (NIRS) and hyperspectral imaging (HSI) represent a promising alternative for low-cost, non-invasive, and fast monitoring of functional and structural properties of living tissue. Particularly, the possibility of extracting the molecular composition of the tissue from the optical spectra in real-time deems the spectroscopy techniques as unique diagnostic tools. However, due to the highly limited availability of paired optical and molecular profiling studies, building a mapping between a spectral signature and a corresponding set of molecular concentrations is still an unsolved problem. Moreover, there are no yet established methods to streamline inference of the biochemical composition from the optical spectrum for real-time applications such as surgical monitoring. In this paper, we develop a technique for fast inference of changes in the molecular composition of brain tissue. We base our method on the Beer-Lambert law to analytically connect the spectra with concentrations and use a deep-learning approach to significantly speed up the concentration inference compared to traditional optimization methods. We test our approach on real data obtained from the broadband NIRS study of piglets' brains. The results demonstrate that the proposed method enables real-time molecular composition inference while maintaining the accuracy of traditional optimization.Here's the translation in Traditional Chinese:这些基于spectroscopy的内部成像技术,如近赤外谱спектроскопи (NIRS) 和高分谱成像 (HSI),具有低成本、非侵入性和快速监控生物组织功能和结构性的优点。特别是可以从光谱中提取生物组织的分子结构,使这些技术成为独特的诊断工具。然而,由于生物组织光谱对照的数据很少,因此建立光谱特征和相应的分子浓度之间的映射仍然是一个未解的问题。此外,还没有建立了将光谱特征转换为生物化学成分的方法,对于实时应用如手术监控来说,这是一个重要的障碍。在这篇论文中,我们开发了一种快速测量生物组织分子浓度变化的方法。我们基于Beer-Lambert法,使用深度学习方法来快速测量分子浓度,与传统优化方法相比,提高了速度和准确性。我们使用实验数据,从猪脑的宽带NIRS研究中获得的数据进行评估。结果显示,我们的方法可以在实时监控中提供生物化学成分的准确测量,而且与传统优化方法相比,速度更快。
X-ray dark-field via spectral propagation-based imaging
paper_authors: Jannis N. Ahlers, Konstantin M. Pavlov, Marcus J. Kitchen, Kaye S. Morgan
for: 这篇论文旨在描述一种新的黑场X射线成像技术,可以观察到不可分解的微结构。
methods: 这种技术使用了媒介频率场的模拟来恢复媒介频率场的相位信息。
results: 通过对双能X射线数据使用PBI暗场恢复算法,成功地获得了预先知道的黑场 спектраль依赖性。Abstract
Dark-field X-ray imaging is a novel modality which visualises scattering from unresolved microstructure. Current dark-field imaging techniques typically require precision optics in a stable environment. Propagation-based imaging (PBI) is an optics-free phase-contrast imaging technique that can be used to recover phase information by modelling the propagation of a diffracted wavefield. Based on the Fokker--Planck equation of X-ray imaging, we propose a dual-energy PBI approach to capture phase and dark-field effects. The equation is solved under conditions of a single-material sample with spatially slowly-varying dark-field signal, together with an a priori dark-field spectral dependence. We use single-grid dark-field imaging to fit a power law to the dark-field spectral dependence, and successfully apply the PBI dark-field retrieval algorithm to simulated and experimental dual-energy data.
摘要
黑场X射影像是一种新的显像技术,可以视化杂点不可分解的微结构。现有的黑场图像技术通常需要精炼的仪器,并在稳定的环境中进行。基于干涉图像(PBI)是一种无仪器phaseless相对幅影像技术,可以重建相对幅信息,通过模拟干涉波场的传播。基于X射影像的福克尔-普兰克方程,我们提出了双能量PBI方法,可以捕捉相对幅和黑场效应。方程在单材料样本中,空间上慢慢变化的黑场响应下解,并且使用单格黑场图像进行适应。我们成功应用PBI黑场恢复算法于实验和仿真的双能量数据。
DREAM-PCD: Deep Reconstruction and Enhancement of mmWave Radar Pointcloud
results: DREAM-PCD在重建质量和泛化性能方面胜过现有方法,并且具有优秀的实时性和可扩展性,适用于多种场景和参数。Abstract
Millimeter-wave (mmWave) radar pointcloud offers attractive potential for 3D sensing, thanks to its robustness in challenging conditions such as smoke and low illumination. However, existing methods failed to simultaneously address the three main challenges in mmWave radar pointcloud reconstruction: specular information lost, low angular resolution, and strong interference and noise. In this paper, we propose DREAM-PCD, a novel framework that combines signal processing and deep learning methods into three well-designed components to tackle all three challenges: Non-Coherent Accumulation for dense points, Synthetic Aperture Accumulation for improved angular resolution, and Real-Denoise Multiframe network for noise and interference removal. Moreover, the causal multiframe and "real-denoise" mechanisms in DREAM-PCD significantly enhance the generalization performance. We also introduce RadarEyes, the largest mmWave indoor dataset with over 1,000,000 frames, featuring a unique design incorporating two orthogonal single-chip radars, lidar, and camera, enriching dataset diversity and applications. Experimental results demonstrate that DREAM-PCD surpasses existing methods in reconstruction quality, and exhibits superior generalization and real-time capabilities, enabling high-quality real-time reconstruction of radar pointcloud under various parameters and scenarios. We believe that DREAM-PCD, along with the RadarEyes dataset, will significantly advance mmWave radar perception in future real-world applications.
摘要
Millimeter-wave (mmWave) 雷达点云提供了有前途的潜在应用,因为它在烟雾和低照明条件下具有强度。然而,现有的方法无法同时解决三大挑战在 mmWave 雷达点云重建中:损失的 Specular 信息、低角分辨率和强大的干扰和噪声。在这篇论文中,我们提出了 DREAM-PCD,一种新的框架,它将信号处理和深度学习方法结合在三个良好设计的组件中,以解决这三个挑战:非 coherent accumulation for dense points,synthetic aperture accumulation for improved angular resolution,和Real-Denoise Multiframe network for noise and interference removal。此外,DREAM-PCD 中的 causal multiframe 和 "real-denoise" 机制 significantly enhance the generalization performance。我们还介绍了 RadarEyes,最大 mmWave indoor dataset,包含超过 1,000,000 帧,其特殊设计包括两个垂直的单芯片雷达、激光和相机,这使得数据集的多样性和应用更加丰富。实验结果表明,DREAM-PCD 的重建质量胜过现有方法,并且具有优秀的普适和实时特性,可以在多种参数和场景下实现高质量的实时重建。我们认为,DREAM-PCD 和 RadarEyes 数据集将在未来实际应用中为 mmWave 雷达感知带来很大的进步。
results: 研究人员发现,传统的匹配滤波器不能生成最佳结果,会导致函数估计偏移。为了解决这个问题,他们提出了一种新的滤波器设计,并证明在最大时延 bound下,可以实现不偏向的函数估计。此外,他们还提出了一种nikonov regularization问题,可以根据函数估计偏移和噪声induced variance之间的质量衡量来生成优化的滤波器。当时延比发送波的长度较短时,该滤波器与匹配滤波器相比,在 fonction estimates中具有更好的性能。Abstract
Over-the-air computation (OAC) is a promising wireless communication method for aggregating data from many devices in dense wireless networks. The fundamental idea of OAC is to exploit signal superposition to compute functions of multiple simultaneously transmitted signals. However, the time- and phase-alignment of these superimposed signals have a significant effect on the quality of function computation. In this study, we analyze the OAC problem for a system with unknown random time delays and phase shifts. We show that the classical matched filter does not produce optimal results, and generates bias in the function estimates. To counteract this, we propose a new filter design and show that, under a bound on the maximum time delay, it is possible to achieve unbiased function computation. Additionally, we propose a Tikhonov regularization problem that produces an optimal filter given a tradeoff between the bias and noise-induced variance of the function estimates. When the time delays are long compared to the length of the transmitted pulses, our filter vastly outperforms the matched filter both in terms of bias and mean-squared error (MSE). For shorter time delays, our proposal yields similar MSE as the matched filter, while reducing the bias.
摘要
无线计算在空气中(OAC)是一种有前途的无线通信方法,用于从多个设备收集数据 dense wireless networks。 OAC 的基本思想是利用信号积加来计算多个同时发送的信号中的函数。然而,时间和频率对这些积加的信号的影响非常大。在本研究中,我们分析了 OAC 问题,包括未知随机时延和相位偏移。我们表明了经典匹配滤波器不会生成最佳结果,并产生了函数估计的偏移。为了解决这个问题,我们提议了一种新的滤波器设计,并证明在最大时延 bound 下,可以实现不偏函数估计。此外,我们提出了一个提高了函数估计准确性的nikonov regularization 问题。当时延较长,我们的滤波器在对比匹配滤波器时表现出了显著的优异,包括偏移和 Mean Squared Error (MSE)。当时延较短时,我们的提议与匹配滤波器相当,而且可以减少偏移。
Channel Estimation for Reconfigurable Intelligent Surface-Aided Multiuser Communication Systems Exploiting Statistical CSI of Correlated RIS-User Channels
results: 实验结果表明,提议的EP通道估算方案可以在减少试验束缚的情况下实现准确的通道估算。Abstract
Reconfigurable intelligent surface (RIS) is a promising candidate technology for the upcoming Sixth Generation (6G) communication system for its ability to manipulate the wireless communication environment by controlling the coefficients of reflection elements (REs). However, since the RIS usually consists of a large number of passive REs, the pilot overhead for channel estimation in the RIS-aided system is prohibitively high. In this paper, the channel estimation problem for a RIS-aided multi-user multiple-input-single-output (MISO) communication system with clustered users is investigated. First, to describe the correlated feature for RIS-user channels, a beam domain channel model is developed for RIS-user channels. Then, a pilot reuse strategy is put forward to reduce the pilot overhead and decompose the channel estimation problem into several subproblems. Finally, by leveraging the correlated nature of RIS-user channels, an eigenspace projection (EP) algorithm is proposed to solve each subproblem respectively. Simulation results show that the proposed EP channel estimation scheme can achieve accurate channel estimation with lower pilot overhead than existing schemes.
摘要
智能表面重配置 (RIS) 是 sixth generation (6G) 通信系统的一个优秀技术候选人,因为它可以通过控制反射元素 (RE) 的系数来 manipulate 无线通信环境。然而,由于 RIS 通常由大量的 passtive RE 组成,因此在 RIS 帮助系统中的频道估计问题中存在过高的频道过头。在本文中,我们对 RIS 帮助的多用户多输入单出口 (MISO) 通信系统中的用户频道估计问题进行了研究。首先,为描述 RIS-用户频道之间的相关特征,我们提出了一种 beam 频道模型来描述 RIS-用户频道。然后,我们提出了一种重复频道 schemes 来减少频道过头和将频道估计问题分解成多个子问题。最后,通过利用 RIS-用户频道之间的相关性,我们提出了一种 eigenvalue projection (EP) 算法来解决每个子问题。 simulation results 表明,我们的 EP 频道估计方案可以在较低的频道过头下实现准确的频道估计。
Bridging the complexity gap in Tbps-achieving THz-band baseband processing
paper_authors: Hadi Sarieddeen, Hakim Jemaa, Simon Tarboush, Christoph Studer, Mohamed-Slim Alouini, Tareq Y. Al-Naffouri
for: 本研究旨在提高电子和光子技术的进步,以实现teraHz频率上的高速数据传输。
methods: 该研究提议使用并行处理,特别是频率编码解码。
results: 研究表明,通过利用THz通道的结构化子空间,可以使用更短的代码字符串,并在所有基带处理块中进行并行化。这种方法可以提高数据传输速率至teraHz级别。Abstract
Recent advances in electronic and photonic technologies have allowed efficient signal generation and transmission at terahertz (THz) frequencies. However, as the gap in THz-operating devices narrows, the demand for terabit-per-second (Tbps)-achieving circuits is increasing. Translating the available hundreds of gigahertz (GHz) of bandwidth into a Tbps data rate requires processing thousands of information bits per clock cycle at state-of-the-art clock frequencies of digital baseband processing circuitry of a few GHz. This paper addresses these constraints and emphasizes the importance of parallelization in signal processing, particularly for channel code decoding. By leveraging structured sub-spaces of THz channels, we propose mapping bits to transmission resources using shorter code words, extending parallelizability across all baseband processing blocks. THz channels exhibit quasi-deterministic frequency, time, and space structures that enable efficient parallel bit mapping at the source and provide pseudo-soft bit reliability information for efficient detection and decoding at the receiver.
摘要
Recent advances in electronic and photonic technologies have allowed efficient signal generation and transmission at terahertz (THz) frequencies. However, as the gap in THz-operating devices narrows, the demand for terabit-per-second (Tbps)-achieving circuits is increasing. Translating the available hundreds of gigahertz (GHz) of bandwidth into a Tbps data rate requires processing thousands of information bits per clock cycle at state-of-the-art clock frequencies of digital baseband processing circuitry of a few GHz. This paper addresses these constraints and emphasizes the importance of parallelization in signal processing, particularly for channel code decoding. By leveraging structured sub-spaces of THz channels, we propose mapping bits to transmission resources using shorter code words, extending parallelizability across all baseband processing blocks. THz channels exhibit quasi-deterministic frequency, time, and space structures that enable efficient parallel bit mapping at the source and provide pseudo-soft bit reliability information for efficient detection and decoding at the receiver.Here's the translation in Traditional Chinese:随着电子和光子技术的进步,可以实现高频率的信号生成和传输,现在可以在teraHz(THz)频率上实现高速的资料传输。然而,随着THz设备的差距缩小,需要 Terabit/秒(Tbps)的实现普通的资料传输速率。将百兆Hz(GHz)的带宽转换为Tbps的数据传输速率需要在现有的几GHz的时钟频率上处理多达数万个信息位元每个时钟频率。本文探讨这些限制,并强调了信号处理中的并行化,特别是频道码解oding中的并行化。通过利用THz频道的结构子空间,我们提议将位元映射到传输资源上,使用短程码字,延伸并行性到所有的基带处理对象。THz频道具有 quasi-deterministic 的频率、时间和空间结构,使得源端可以高效地并行地将位元映射到传输资源,并提供pseudo-soft bit可靠性信息,以便高效地检测和解oding器。
Statistical CSI Based Beamforming for Reconfigurable Intelligent Surface Aided MISO Systems with Channel Correlation
methods: 本文使用了统计channel state information (S-CSI)来取代实时channel state information (I-CSI),并提出了用于合理的干扰器设计的离散估算法。
results: simulations results表明,我们提出的两种干扰器设计算法都能够提高网络吞吐量,并且2比特quantizer可以实现RIS相位偏移的实现。Abstract
Reconfigurable intelligent surface (RIS) is a promising candidate technology of the upcoming Sixth Generation (6G) communication system for its ability to provide unprecedented spectral and energy efficiency increment through passive beamforming. However, it is challenging to obtain instantaneous channel state information (I-CSI) for RIS, which obliges us to use statistical channel state information (S-CSI) to achieve passive beamforming. In this paper, RIS-aided multiple-input single-output (MISO) multi-user downlink communication system with correlated channels is investigated. Then, we formulate the problem of joint beamforming design at the AP and RIS to maximize the sum ergodic spectral efficiency (ESE) of all users to improve the network capacity. Since it is too hard to compute sum ESE, an ESE approximation is adopted to reformulate the problem into a more tractable form. Then, we present two joint beamforming algorithms, namely the singular value decomposition-gradient descent (SVD-GD) algorithm and the fractional programming-gradient descent (FP-GD) algorithm. Simulation results show the effectiveness of our proposed algorithms and validate that 2-bits quantizer is enough for RIS phase shifts implementation.
摘要
《嵌入式智能表面(RIS)在第六代(6G)通信系统中的潜在应用》,这篇论文探讨了RIS在6G通信系统中的可能性。RIS可以提供历史性和能量效率的增量,但是获取RIS实时通道状态信息(I-CSI)是困难的,因此需要使用统计通道状态信息(S-CSI)来实现抗反射。本文研究了RIS-助け的多输入单出多用户下链通信系统,并对相关的通道进行了 correlate。然后,我们形式ulated了在AP和RIS之间进行共同束 beamforming设计,以提高所有用户的均衡束pectral efficiency(ESE),从而提高网络容量。由于计算ESE的复杂度太高,我们采用了ESE的一种近似值来重新定义问题,使其变得更易解。然后,我们提出了两种共同束 beamforming算法,namely SVD-GD算法和FP-GD算法。实验结果表明了我们提出的算法的效果,并证明了2位quantizer可以实现RIS相位shift的实现。
Linear Progressive Coding for Semantic Communication using Deep Neural Networks
results: 实验结果表明,这种进化semantic编码方法可以在初始量化 measurements 中提供有用的semantic预览,并且在整体准确率和效率方面与非进化方法相当。Abstract
We propose a general method for semantic representation of images and other data using progressive coding. Semantic coding allows for specific pieces of information to be selectively encoded into a set of measurements that can be highly compressed compared to the size of the original raw data. We consider a hierarchical method of coding where a partial amount of semantic information is first encoded a into a coarse representation of the data, which is then refined by additional encodings that add additional semantic information. Such hierarchical coding is especially well-suited for semantic communication i.e. transferring semantic information over noisy channels. Our proposed method can be considered as a generalization of both progressive image compression and source coding for semantic communication. We present results from experiments on the MNIST and CIFAR-10 datasets that show that progressive semantic coding can provide timely previews of semantic information with a small number of initial measurements while achieving overall accuracy and efficiency comparable to non-progressive methods.
摘要
我们提出了一种通用的 semantic representation 方法,使用进程式编码来实现。这种 semantic coding 方法可以 selectively 编码特定的信息,将其转换为可高度压缩的 measurement 集。我们认为这种层次编码方法特别适合 semantic communication,例如在噪音频道上传输 semantic 信息。我们的提议方法可以视为进程式图像压缩和源编码的总和。我们在 MNIST 和 CIFAR-10 dataset 上进行了实验,结果显示,逐步 semantic coding 可以在少量初始测量后提供有效的时间预览,并且与非逐步方法相比,达到了相同的总体准确率和效率。
IEEE 802.11be Wi-Fi 7: Feature Summary and Performance Evaluation
results: 系统级别的 simulate结果表明,通过组合新技术,Wi-Fi 7可以实现30 Gbps的吞吐量和低于Wi-Fi 6的延迟。Abstract
While the pace of commercial scale application of Wi-Fi 6 accelerates, the IEEE 802.11 Working Group is about to complete the development of a new amendment standard IEEE 802.11be -- Extremely High Throughput (EHT), also known as Wi-Fi 7, which can be used to meet the demand for the throughput of 4K/8K videos up to tens of Gbps and low-latency video applications such as virtual reality (VR) and augmented reality (AR). Wi-Fi 7 not only scales Wi-Fi 6 with doubled bandwidth, but also supports real-time applications, which brings revolutionary changes to Wi-Fi. In this article, we start by introducing the main objectives and timeline of Wi-Fi 7 and then list the latest key techniques which promote the performance improvement of Wi-Fi 7. Finally, we validate the most critical objectives of Wi-Fi 7 -- the potential up to 30 Gbps throughput and lower latency. System-level simulation results suggest that by combining the new techniques, Wi-Fi 7 achieves 30 Gbps throughput and lower latency than Wi-Fi 6.
摘要
而 Wi-Fi 6 的商业大规模应用速度不断加快,IEEE 802.11 工作组即将完成一个新的修订标准 IEEE 802.11be -- Extremely High Throughput (EHT),也称为 Wi-Fi 7,可以用于满足 4K/8K 视频的吞吐量达到十兆比特每秒并且低延迟视频应用程序such as virtual reality (VR) 和 augmented reality (AR)。Wi-Fi 7 不仅扩展 Wi-Fi 6 的频谱带宽,还支持实时应用程序,这会对 Wi-Fi 进行革命性的变革。本文首先介绍 Wi-Fi 7 的主要目标和时间表,然后列出最新的关键技术,以提高 Wi-Fi 7 的性能。最后,我们验证 Wi-Fi 7 的核心目标 -- 可达 30 Gbps 的吞吐量和低于 Wi-Fi 6 的延迟。系级 simulation 结果表明,通过组合新技术,Wi-Fi 7 可以实现 30 Gbps 的吞吐量和低于 Wi-Fi 6 的延迟。
Quantum computer-enabled receivers for optical communication
results: 研究人员通过使用 optomechanical 转换和短深度的可变量化量子电路,实现了在低光强 régime下的高精度信息传输。此外,他们还使用模型来捕捉非理想的热雷和损失,以估计转换性能。最后,他们在IBM-Q设备上执行了训练后的可变量化电路,以示出这种方法可以在当今的量子计算硬件噪声下实现量子优势。Abstract
Optical communication is the standard for high-bandwidth information transfer in today's digital age. The increasing demand for bandwidth has led to the maturation of coherent transceivers that use phase- and amplitude-modulated optical signals to encode more bits of information per transmitted pulse. Such encoding schemes achieve higher information density, but also require more complicated receivers to discriminate the signaling states. In fact, achieving the ultimate limit of optical communication capacity, especially in the low light regime, requires coherent joint detection of multiple pulses. Despite their superiority, such joint detection receivers are not in widespread use because of the difficulty of constructing them in the optical domain. In this work we describe how optomechanical transduction of phase information from coherent optical pulses to superconducting qubit states followed by the execution of trained short-depth variational quantum circuits can perform joint detection of communication codewords with error probabilities that surpass all classical, individual pulse detection receivers. Importantly, we utilize a model of optomechanical transduction that captures non-idealities such as thermal noise and loss in order to understand the transduction performance necessary to achieve a quantum advantage with such a scheme. We also execute the trained variational circuits on an IBM-Q device with the modeled transduced states as input to demonstrate that a quantum advantage is possible even with current levels of quantum computing hardware noise.
摘要
光学通信是当今数字时代的标准高频带宽信息传输方式。随着带宽需求的增加,整形传输器已经成熟,它们使用相位和振幅模拟的光信号来编码更多的比特信息每个发射脉冲中。这种编码方案可以实现更高的信息密度,但也需要更复杂的接收机来分辨信号状态。实际上,在低光度 режиме下,实现光学通信的最终限制需要同步检测多个脉冲。尽管它们具有superiority,但是这些同步检测接收机尚未广泛使用,因为它们在光学频谱中实现的困难。在这项工作中,我们描述了如何将光学信号转换为超导素子状态,然后执行训练过的短深度变量量Quantum Circuits可以同时检测通信代码字符串,并且达到所有个别脉冲检测接收机的错误概率。我们还使用模型来捕捉非理想的热雷和损失,以理解转换性能所需的跨度。最后,我们在IBM-Q设备上执行训练过的变量量Circuits,并将模型转换后的状态作为输入,以示出可以在当今水平的量子计算机噪音下实现量子优势。
Differentiable Machine Learning-Based Modeling for Directly-Modulated Lasers
results: 研究发现,使用 convolutional attention transformer 的 surrogate model 在数字平衡设置中表现最佳,其 Root Mean Square Error 较低,并且训练/测试时间较短。Abstract
End-to-end learning has become a popular method for joint transmitter and receiver optimization in optical communication systems. Such approach may require a differentiable channel model, thus hindering the optimization of links based on directly modulated lasers (DMLs). This is due to the DML behavior in the large-signal regime, for which no analytical solution is available. In this paper, this problem is addressed by developing and comparing differentiable machine learning-based surrogate models. The models are quantitatively assessed in terms of root mean square error and training/testing time. Once the models are trained, the surrogates are then tested in a numerical equalization setup, resembling a practical end-to-end scenario. Based on the numerical investigation conducted, the convolutional attention transformer is shown to outperform the other models considered.
摘要
现代光通信系统中,endo-to-end学习已成为 JOINT TRANSMITTER和接收器优化的受欢迎方法。然而,这种方法可能需要一个可导的通道模型,从而限制基于直接调制拉зе(DML)的链路优化。这是因为DML在大信号 режиме下的行为,无法获得分析解。本文通过开发和比较 Machine Learning 基于 surrogate 模型来解决这个问题。这些模型在量化Error和训练/测试时间方面进行评估。经过训练后,这些模型在数字平衡设置中进行测试, simulate 一个实际的端到端场景。根据我们的数字调查,卷积注意力变换器表现得最好。
Energy-Saving Cell-Free Massive MIMO Precoders with a Per-AP Wideband Kronecker Channel Model
results: 数值仿真结果表明,使用这种方法可以在低负荷情况下保证所有天线的使用,同时减少电力消耗,最高可以达到9倍的减少。Abstract
We study cell-free massive multiple-input multiple-output precoders that minimize the power consumed by the power amplifiers subject to per-user per-subcarrier rate constraints. The power at each antenna is generally retrieved by solving a fixed-point equation that depends on the instantaneous channel coefficients. Using random matrix theory, we retrieve each antenna power as the solution to a fixed-point equation that depends only on the second-order statistics of the channel. Numerical simulations prove the accuracy of our asymptotic approximation and show how a subset of access points should be turned off to save power consumption, while all the antennas of the active access points are utilized with uniform power across them. This mechanism allows to save consumed power up to a factor of 9$\times$ in low-load scenarios.
摘要
我们研究无基站大量多输入多出力前缀器,以减少发动机增强器的功率消耗,同时保持每个用户每个子载波长的速率限制。每个天线的功率通常通过解决固定点方程来获取,该方程取决于当前频率响应的快速 statistcs。使用随机矩阵理论,我们可以通过解决固定点方程来获取每个天线的功率,该方程只取决于通道的二阶统计。数值仿真表明我们的极限 aproximation 的准确性,并显示在低负载场景下,可以关闭一些接入点,以达到消耗电力的减少,而活动接入点的所有天线都会充分利用。这种机制可以在低负载场景下减少消耗电力至多达9倍。
Design and Optimization of Residual Neural Network Accelerators for Low-Power FPGAs Using High-Level Synthesis
results: 这个研究在CIFAR-10 dataset上使用Xilinx FPGAs实现了ResNet8和ResNet20模型,比预设的Kria KV260板子上的实现更高速,并且保持了与预设的实现相似的精度。具体来说,ResNet20模型在Kria KV260板子上实现了2.88倍的速度,而且精度提高了0.5%,即91.3%;ResNet8模型的精度提高了2.8%,即88.7%。Abstract
Residual neural networks are widely used in computer vision tasks. They enable the construction of deeper and more accurate models by mitigating the vanishing gradient problem. Their main innovation is the residual block which allows the output of one layer to bypass one or more intermediate layers and be added to the output of a later layer. Their complex structure and the buffering required by the residual block make them difficult to implement on resource-constrained platforms. We present a novel design flow for implementing deep learning models for field programmable gate arrays optimized for ResNets, using a strategy to reduce their buffering overhead to obtain a resource-efficient implementation of the residual layer. Our high-level synthesis (HLS)-based flow encompasses a thorough set of design principles and optimization strategies, exploiting in novel ways standard techniques such as temporal reuse and loop merging to efficiently map ResNet models, and potentially other skip connection-based NN architectures, into FPGA. The models are quantized to 8-bit integers for both weights and activations, 16-bit for biases, and 32-bit for accumulations. The experimental results are obtained on the CIFAR-10 dataset using ResNet8 and ResNet20 implemented with Xilinx FPGAs using HLS on the Ultra96-V2 and Kria KV260 boards. Compared to the state-of-the-art on the Kria KV260 board, our ResNet20 implementation achieves 2.88X speedup with 0.5% higher accuracy of 91.3%, while ResNet8 accuracy improves by 2.8% to 88.7%. The throughputs of ResNet8 and ResNet20 are 12971 FPS and 3254 FPS on the Ultra96 board, and 30153 FPS and 7601 FPS on the Kria KV26, respectively. They Pareto-dominate state-of-the-art solutions concerning accuracy, throughput, and energy.
摘要
循环神经网络在计算机视觉任务中广泛应用。它们使得建立更深和更准确的模型变得可能,并mitigate the vanishing gradient problem。它们的主要创新是差值块,允许输出一层的输出通过一个或多个中间层直接加到后续层的输出。它们的复杂结构和差值块所需的缓冲 overhead 使得它们在有限资源平台上具有困难的实现。我们提出了一种新的设计流程,用于实现适应Field Programmable Gate Arrays(FPGA)的深度学习模型,使得差值块的缓冲 overhead 得到减少,从而实现资源高效的实现。我们的高级合成(HLS)基于的流程包括一系列设计原则和优化策略,利用了标准技术的 temporal reuse 和 loop merging,高效地将 ResNet 模型和其他 skip connection-based NN архитектуры映射到 FPGA。模型使用 8 位整数(weights和activations)、16 位整数(biases)和 32 位整数(accumulations)进行量化。实验结果基于 CIFAR-10 数据集,使用 Xilinx FPGAs 通过 HLS 在 Ultra96-V2 和 Kria KV260 板上实现 ResNet8 和 ResNet20。与state-of-the-art 在 Kria KV260 板上相比,我们的 ResNet20 实现 achieved 2.88X 速度增加,同时精度提高 0.5%,至 91.3%。 ResNet8 精度提高 2.8%,至 88.7%。差值块8和 ResNet20的 throughput 分别为 12971 FPS 和 3254 FPS 在 Ultra96 板上,并分别为 30153 FPS 和 7601 FPS 在 Kria KV26 板上。它们 pareto-dominate state-of-the-art 方案, regard to accuracy, throughput, and energy.
Approximate Message Passing with Rigorous Guarantees for Pooled Data and Quantitative Group Testing
paper_authors: Nelvin Tan, Jonathan Scarlett, Ramji Venkataramanan for: 这个论文的目的是用于预测pool中item的类别,并且提出了一种基于 Approximate Message Passing(AMP)算法的方法。methods: 这个论文使用了AMP算法来预测pool中item的类别,并且进行了一种准确的性能分析,包括静态和噪声场景。results: 研究发现,在静态场景下,AMP算法与之前由El Alaoui et al.提出的算法等价。此外,通过计算False Positive Rate和False Negative Rate的限制值,研究人员也得出了精确的性能保证。数据 simulations 表明,AMP算法在一些Quantitative Group Testing(QGT)场景下表现较好,但是在三个类别的pool中,Convex ProgrammingEstimator表现较好。Abstract
In the pooled data problem, the goal is to identify the categories associated with a large collection of items via a sequence of pooled tests. Each pooled test reveals the number of items of each category within the pool. We study an approximate message passing (AMP) algorithm for estimating the categories and rigorously characterize its performance, in both the noiseless and noisy settings. For the noiseless setting, we show that the AMP algorithm is equivalent to one recently proposed by El Alaoui et al. Our results provide a rigorous version of their performance guarantees, previously obtained via non-rigorous techniques. For the case of pooled data with two categories, known as quantitative group testing (QGT), we use the AMP guarantees to compute precise limiting values of the false positive rate and the false negative rate. Though the pooled data problem and QGT are both instances of estimation in a linear model, existing AMP theory cannot be directly applied since the design matrices are binary valued. The key technical ingredient in our result is a rigorous analysis of AMP for generalized linear models defined via generalized white noise design matrices. This result, established using a recent universality result of Wang et al., is of independent interest. Our theoretical results are validated by numerical simulations. For comparison, we propose estimators based on convex relaxation and iterative thresholding, without providing theoretical guarantees. Our simulations indicate that AMP outperforms the convex programming estimator for a range of QGT scenarios, but the convex program performs better for pooled data with three categories.
摘要
“在混合数据问题中,目标是通过一系列混合测试来确定Item的类别。每个混合测试会报告每个类别的 Item 数量。我们研究了一种近似消息传递(AMP)算法来估算类别,并且正式characterize其性能,包括噪音无效和噪音有效的情况。在噪音无效情况下,我们证明AMP算法与El Alaoui et al.提出的算法等价。我们的结果提供了对AMP算法性能的正式保证,之前由非正式技术来确定。在两类Quantitative group testing(QGT)中,我们使用AMP保证计算出精确的假阳性率和假阴性率的限制值。尽管混合数据问题和QGT都是线性模型中的估算问题,但现有的AMP理论不能直接应用,因为设计矩阵是二进制值的。我们的关键技术成果在于对AMP在通用线性模型中进行了正式分析,这一结果使用了Wang et al.最近的一个统计结果。这一结果不仅有助于解决我们的问题,还是独立有价值的。我们的理论结果通过数值仿真验证。而我们还提出了基于 convex relaxation 和迭代抑制的估算器,但没有提供理论保证。我们的仿真结果表明,在QGT场景中,AMP超过 convex programming 估算器的性能,但是在三类混合数据中,convex programming 估算器表现更好。”
Formation Wing-Beat Modulation (FWM): A Tool for Quantifying Bird Flocks Using Radar Micro-Doppler Signals
results: 实际观测到鸟类群体中的FWM信号,提供了量化鸟类数量和估计鸟类翼拍频率的工具,帮助进一步了雷达鸟类学和飞行生物学等领域的研究。Abstract
Radar echoes from bird flocks contain modulation signals, which we find are produced by the flapping gaits of birds in the flock, resulting in a group of spectral peaks with similar amplitudes spaced at a specific interval. We call this the formation wing-beat modulation (FWM) effect. FWM signals are micro-Doppler modulated by flapping wings and are related to the bird number, wing-beat frequency, and flight phasing strategy. Our X-band radar data show that FWM signals exist in radar signals of a seagull flock, providing tools for quantifying the bird number and estimating the mean wingbeat rate of birds. This new finding could aid in research on the quantification of bird migration numbers and estimation of bird flight behavior in radar ornithology and aero-ecology.
摘要
雷达射回信号中的鸟群射击信号包含形成翼振荡模ulation(FWM)效应,我们发现这些信号由鸟群中的鸟嘴冲击产生,导致一组spectral peak的峰值具有相同的幅度, spacing at a specific interval.我们称这为formation wing-beat modulation(FWM)效应。 FWM信号被微-Doppler模ulation和鸟数、翼振荡频率和飞行策略相关。我们的X射频雷达数据显示,FWM信号存在鸟群雷达信号中,提供了量化鸟数和估计鸟的平均翼振荡频率的工具。这一新发现可以帮助在雷达 ornithology和 aer-ecology 中研究鸟类迁徙数量和鸟类飞行行为的估计。
An Exploration of Optimal Parameters for Efficient Blind Source Separation of EEG Recordings Using AMICA
paper_authors: Gwenevere Frank, Seyed Yahya Shirazi, Jason Palmer, Gert Cauwenberghs, Scott Makeig, Arnaud Delorme
for: 这个论文主要是为了研究眼动电学(EEG)的独立 ком component分析(ICA)算法在EEG分解中的效果。
methods: 这个论文使用了多种ICA算法进行EEG分解,并对AMICA算法进行了比较。AMICA算法提供了许多参数,allowing for precise control of the decomposition。
results: 研究发现,在不同参数的设置下,AMICA算法的运行时间和分解质量可以通过对比两个纬度量 metrics:Pairwise Mutual Information (PMI)和Mutual Information Reduction (MIR)进行分析。此外,也提供了选择参数的初始值的建议。Abstract
EEG continues to find a multitude of uses in both neuroscience research and medical practice, and independent component analysis (ICA) continues to be an important tool for analyzing EEG. A multitude of ICA algorithms for EEG decomposition exist, and in the past, their relative effectiveness has been studied. AMICA is considered the benchmark against which to compare the performance of other ICA algorithms for EEG decomposition. AMICA exposes many parameters to the user to allow for precise control of the decomposition. However, several of the parameters currently tend to be set according to "rules of thumb" shared in the EEG community. Here, AMICA decompositions are run on data from a collection of subjects while varying certain key parameters. The running time and quality of decompositions are analyzed based on two metrics: Pairwise Mutual Information (PMI) and Mutual Information Reduction (MIR). Recommendations for selecting starting values for parameters are presented.
摘要
results: 研究人员通过实验证明,该模型可以准确地预测说话人和听众的头部方向,并且可以在不同的听众位置和环境中提供高精度的估算结果。Abstract
Estimation of a speaker's direction and head orientation with binaural recordings can be a critical piece of information in many real-world applications with emerging `earable' devices, including smart headphones and AR/VR headsets. However, it requires predicting the mutual head orientations of both the speaker and the listener, which is challenging in practice. This paper presents a system for jointly predicting speaker-listener head orientations by leveraging inherent human voice directivity and listener's head-related transfer function (HRTF) as perceived by the ear-mounted microphones on the listener. We propose a convolution neural network model that, given binaural speech recording, can predict the orientation of both speaker and listener with respect to the line joining the two. The system builds on the core observation that the recordings from the left and right ears are differentially affected by the voice directivity as well as the HRTF. We also incorporate the fact that voice is more directional at higher frequencies compared to lower frequencies.
摘要
<>输入文本转换为简化中文:<>使用扬声器录音的方式估算发言人的方向和头姿可以是许多实际应用中的关键信息,包括智能HEADSET和AR/VR头戴式设备。然而,这需要预测发言人和听众双方的相互头姿,这在实践中很困难。这篇论文提出了一种系统,通过利用人声直达性和听众耳部 Transfer Function (HRTF),以 ear-mounted microphones 上的听众所感受到的方式来联合预测发言人和听众的头姿。我们提议一种卷积神经网络模型, givens binaural speech recording,可以预测发言人和听众的方向相对于两点线。该系统基于核心observation ,左耳和右耳的录音被声 directivity 以及 HRTF 所不同地影响。我们还 incorporate 声音在高频段的方向性比低频段更强。
A multi-modal approach for identifying schizophrenia using cross-modal attention
results: 根据Weighted average F1 score,该多Modal系统在与前一个状态的多Modal系统进行比较时,提高了8.53%。Abstract
This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs to the audio and video modalities. Context-independent text embeddings extracted from transcriptions of speech were used as the input for the text modality. The multi-modal system is developed by fusing a segment-to-session-level classifier for video and audio modalities with a text model based on a Hierarchical Attention Network (HAN) with cross-modal attention. The proposed multi-modal system outperforms the previous state-of-the-art multi-modal system by 8.53% in the weighted average F1 score.
摘要
Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference
results: 实验结果显示,我们的方法可以在 LibriSpeech 集合上提高推理速度 12-13 倍,而且保持高度准确。Abstract
Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is primarily attributed to the incremental calculation of the decoder. This work proposes a partially AR framework, which employs segment-level vectorized beam search for improving the inference speed of an ASR model based on the hybrid connectionist temporal classification (CTC) attention-based architecture. It first generates an initial hypothesis using greedy CTC decoding, identifying low-confidence tokens based on their output probabilities. We then utilize the decoder to perform segment-level vectorized beam search on these tokens, re-predicting in parallel with minimal decoder calculations. Experimental results show that our method is 12 to 13 times faster in inference on the LibriSpeech corpus over AR decoding whilst preserving high accuracy.
摘要
注意型编码器-解码器模型具有自动推理(AR)解码功能,在自动语音识别(ASR)领域具有优秀表现。然而,它们经常受到慢速推理的困扰。这主要归结于逐个计算decoder的增量。本文提出了一种部分AR框架,使用段级 вектор化的搜索 beam search来提高ASR模型基于混合连接式时间分类(CTC)注意力基 architecture的推理速度。它首先使用批量CTC解码生成一个初始假设,并将低信度的token标识出来基于其输出概率。然后,我们使用decoder进行段级 вектор化的搜索,并在并行进行重新预测,只需要最少的decoder计算。实验结果显示,我们的方法在LibriSpeech corpus上的推理速度比AR解码快12-13倍,保持高精度。
Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification
paper_authors: Duc-Tuan Truong, Ruijie Tao, Jia Qi Yip, Kong Aik Lee, Eng Siong Chng
for: 提高自动人员识别性能
methods: 利用知识混合法强化教师网络和学生网络之间的一致性,并强调非目标说话人的分类概率
results: 在三种不同的学生模型架构上应用修改后的知识混合法,在VoxCeleb数据集上实现了13.67%的EER提高 compared to embedding-level和标准标签水平的知识混合法Abstract
Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automatic speaker verification. In this paper, we first demonstrate that leveraging a larger number of training non-target speakers improves the performance of automatic speaker verification models. Inspired by this finding about the importance of non-target speakers' knowledge, we modified the conventional label-level KD by disentangling and emphasizing the classification probabilities of non-target speakers during knowledge distillation. The proposed method is applied to three different student model architectures and achieves an average of 13.67% improvement in EER on the VoxCeleb dataset compared to embedding-level and conventional label-level KD methods.
摘要
知识缩减(KD)可以提高自动说话识别性能,确保大教师网络和轻量级学生网络在嵌入层或标签层之间具有一致性。然而,传统的标签级KD忽略了非目标说话者的知识,特别是他们的分类概率,这些知识对自动说话识别非常重要。在这篇论文中,我们首先证明了使用更多的训练非目标说话者可以提高自动说话识别模型的性能。 inspirited by这一发现,我们修改了传统的标签级KD,通过分解和强调非目标说话者的分类概率来进行知识缩减。我们对三种不同的学生模型架构进行应用,并在VoxCeleb数据集上实现了13.67%的EER提升,比 embedding-level和传统标签级KD方法更好。
Optimization Techniques for a Physical Model of Human Vocalisation
results: 比较了不同优化技术和声音表示方式的结果,发现 génétic 和 swarm 优化器在计算成本上较高,但可以更好地优化模型。specific combinations of optimizers and audio representations offer significantly different results。Abstract
We present a non-supervised approach to optimize and evaluate the synthesis of non-speech audio effects from a speech production model. We use the Pink Trombone synthesizer as a case study of a simplified production model of the vocal tract to target non-speech human audio signals --yawnings. We selected and optimized the control parameters of the synthesizer to minimize the difference between real and generated audio. We validated the most common optimization techniques reported in the literature and a specifically designed neural network. We evaluated several popular quality metrics as error functions. These include both objective quality metrics and subjective-equivalent metrics. We compared the results in terms of total error and computational demand. Results show that genetic and swarm optimizers outperform least squares algorithms at the cost of executing slower and that specific combinations of optimizers and audio representations offer significantly different results. The proposed methodology could be used in benchmarking other physical models and audio types.
摘要
我们提出了一种非监督式的方法来优化和评估非语音音效的生成模型。我们使用了淡红 trombone synthesizer 作为一个简化的 vocal tract 生产模型,targeting 非语音人类声音信号 -- yawnings。我们选择和优化控制参数,以减少实际和生成声音之间的差异。我们验证了文献中通常报道的优化技术和一种特制的神经网络。我们使用了多种广泛使用的质量指标,包括对象质量指标和主观相当的指标。我们比较了结果,包括总错误和计算负担。结果显示,遗传和群体优化器在计算 slower 的情况下,可以超过 least squares 算法;具体的组合优化器和声音表示可以得到明显不同的结果。我们的方法可以用于对其他物理模型和声音类型进行benchmarking。
Exploring RWKV for Memory Efficient and Low Latency Streaming ASR
methods: 这个论文使用了RWKV变体的线性注意力变换器,combines the superior performance of transformers和RNNs的推理效率,适用于流处理ASR场景, где时间和内存预算有限。
results: 实验表明,RWKV-Transducer和RWKV-Boundary-Aware-Transducer在不同的规模(100h-10000h)上具有和chunk conformer transducer相当或更高的准确率,同时具有较少的延迟和推理内存成本。Abstract
Recently, self-attention-based transformers and conformers have been introduced as alternatives to RNNs for ASR acoustic modeling. Nevertheless, the full-sequence attention mechanism is non-streamable and computationally expensive, thus requiring modifications, such as chunking and caching, for efficient streaming ASR. In this paper, we propose to apply RWKV, a variant of linear attention transformer, to streaming ASR. RWKV combines the superior performance of transformers and the inference efficiency of RNNs, which is well-suited for streaming ASR scenarios where the budget for latency and memory is restricted. Experiments on varying scales (100h - 10000h) demonstrate that RWKV-Transducer and RWKV-Boundary-Aware-Transducer achieve comparable to or even better accuracy compared with chunk conformer transducer, with minimal latency and inference memory cost.
摘要
近些时候,基于自注意力的transformer和conformer被提出为RNN的替代者 для语音识别器模型。然而,全序列注意机制是不可流动的并且计算成本高,因此需要修改,如分割和缓存,以实现高效的流动语音识别。在这篇论文中,我们提议使用RWKV,一种变体的线性注意力变换器,来应用流动语音识别。RWKV结合了transformer的性能和RNN的推理效率,适合流动语音识别场景,具有限制的时钟和内存成本。在不同规模(100小时-10000小时)的实验中,RWKV-Transducer和RWKV-Boundary-Aware-Transducer实现了与分割对应者扫描器相当或更高的准确率,同时具有最小的延迟和推理内存成本。
Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification
results: 无需重新训练 embedding extractor,Session信息可以得到有效的补偿Abstract
In the field of speaker verification, session or channel variability poses a significant challenge. While many contemporary methods aim to disentangle session information from speaker embeddings, we introduce a novel approach using an additional embedding to represent the session information. This is achieved by training an auxiliary network appended to the speaker embedding extractor which remains fixed in this training process. This results in two similarity scores: one for the speakers information and one for the session information. The latter score acts as a compensator for the former that might be skewed due to session variations. Our extensive experiments demonstrate that session information can be effectively compensated without retraining of the embedding extractor.
摘要
在说话识别领域,会话或渠道变化对 speaker 识别带来极大的挑战。虽然现代方法通常尝试将会话信息与说话人嵌入分离开来,但我们提出了一种新的方法,即通过附加一个额外的嵌入来表示会话信息。这种方法通过在 speaker 嵌入提取器的训练过程中附加一个辅助网络来实现。这将生成两个相似度分数:一个是说话人信息,另一个是会话信息。后者的分数将作为前者可能因会话变化而偏移的补偿。我们的广泛实验表明,不需要重新训练 embedding 提取器,就可以有效地补偿会话信息。
paper_authors: Yuchen Liu, Apu Kapadia, Donald Williamson
For: This paper aims to examine existing approaches for privacy-preserving and privacy-attacking strategies for audio and speech, and to provide a comprehensive analysis of their limitations.* Methods: The paper classifies attack and defense scenarios into several categories and provides detailed analysis of each approach, highlighting their contributions and limitations.* Results: The investigation reveals that voice-controlled devices based on neural networks are inherently susceptible to specific types of attacks, and more sophisticated approaches are required to comprehensively safeguard user privacy.Here is the same information in Simplified Chinese text:* For: 这篇论文目的是对现有的声音和语音隐私保护和隐私攻击策略进行分类和详细分析,以及对其限制的调查。* Methods: 论文将攻击和防御场景分类为多个类别,并对每种方法进行详细的分析,并且高亮它们的贡献和局限性。* Results: 调查发现,基于神经网络的声控设备容易受到特定类型的攻击,以及更加复杂的方法是必须保护用户隐私的。Abstract
In contemporary society, voice-controlled devices, such as smartphones and home assistants, have become pervasive due to their advanced capabilities and functionality. The always-on nature of their microphones offers users the convenience of readily accessing these devices. However, recent research and events have revealed that such voice-controlled devices are prone to various forms of malicious attacks, hence making it a growing concern for both users and researchers to safeguard against such attacks. Despite the numerous studies that have investigated adversarial attacks and privacy preservation for images, a conclusive study of this nature has not been conducted for the audio domain. Therefore, this paper aims to examine existing approaches for privacy-preserving and privacy-attacking strategies for audio and speech. To achieve this goal, we classify the attack and defense scenarios into several categories and provide detailed analysis of each approach. We also interpret the dissimilarities between the various approaches, highlight their contributions, and examine their limitations. Our investigation reveals that voice-controlled devices based on neural networks are inherently susceptible to specific types of attacks. Although it is possible to enhance the robustness of such models to certain forms of attack, more sophisticated approaches are required to comprehensively safeguard user privacy.
摘要
现代社会中,声控设备,如智能手机和智能助手,因其高级功能和可用性而广泛使用。这些设备的总是开机的麦克风使用者可以轻松地访问这些设备。然而,最近的研究和事件表明,这些声控设备受到多种恶意攻击的威胁,因此使用者和研究人员需要采取保护措施。 DESPITE numerous studies investigating adversarial attacks and privacy preservation for images, a conclusive study of this nature has not been conducted for the audio domain. Therefore, this paper aims to examine existing approaches for privacy-preserving and privacy-attacking strategies for audio and speech. To achieve this goal, we classify the attack and defense scenarios into several categories and provide detailed analysis of each approach. We also interpret the dissimilarities between the various approaches, highlight their contributions, and examine their limitations. Our investigation reveals that voice-controlled devices based on neural networks are inherently susceptible to specific types of attacks. Although it is possible to enhance the robustness of such models to certain forms of attack, more sophisticated approaches are required to comprehensively safeguard user privacy.
results: 实验表明,高分解能力的 HighDAN 在 C2Seg 数据集上表现出色,与现有的竞争对手相比,具有更高的分 segmentation 性能和泛化能力。Abstract
Artificial intelligence (AI) approaches nowadays have gained remarkable success in single-modality-dominated remote sensing (RS) applications, especially with an emphasis on individual urban environments (e.g., single cities or regions). Yet these AI models tend to meet the performance bottleneck in the case studies across cities or regions, due to the lack of diverse RS information and cutting-edge solutions with high generalization ability. To this end, we build a new set of multimodal remote sensing benchmark datasets (including hyperspectral, multispectral, SAR) for the study purpose of the cross-city semantic segmentation task (called C2Seg dataset), which consists of two cross-city scenes, i.e., Berlin-Augsburg (in Germany) and Beijing-Wuhan (in China). Beyond the single city, we propose a high-resolution domain adaptation network, HighDAN for short, to promote the AI model's generalization ability from the multi-city environments. HighDAN is capable of retaining the spatially topological structure of the studied urban scene well in a parallel high-to-low resolution fusion fashion but also closing the gap derived from enormous differences of RS image representations between different cities by means of adversarial learning. In addition, the Dice loss is considered in HighDAN to alleviate the class imbalance issue caused by factors across cities. Extensive experiments conducted on the C2Seg dataset show the superiority of our HighDAN in terms of segmentation performance and generalization ability, compared to state-of-the-art competitors. The C2Seg dataset and the semantic segmentation toolbox (involving the proposed HighDAN) will be available publicly at https://github.com/danfenghong.
摘要
现代人工智能(AI)方法在单一感知(RS)应用中已经取得了非常出色的成功,特别是在强调个别城市环境(例如单个城市或区域)。然而,这些AI模型在各个城市或区域的案例研究中会遇到性能瓶颈,因为缺乏多元RS信息和 cutting-edge 解决方案。为此,我们建立了一个新的多模态RS benchmark数据集(包括偏振、多спектраль、Synthetic Aperture Radar),称为C2Seg数据集,用于跨城市semantic segmentation任务。C2Seg数据集包括两个跨城市场景: Berlin-Augsburg(德国)和Beijing-Wuhan(中国)。我们提议一种高分辨率领域适应网络(HighDAN),用于提高AI模型在多城市环境中的泛化能力。HighDAN可以保持 изучение城市景象的空间 topological结构在高到低分辨率的并行混合方式中,同时通过对RS图像表示之间的巨大差异进行反对学习来填充这些差异。此外,我们还考虑了Dice损失函数,以解决由不同城市因素引起的分类不均匀问题。我们在C2Seg数据集上进行了广泛的实验,并证明了我们的HighDAN在segmentation性能和泛化能力方面与现有竞争者相比较出色。C2Seg数据集和semantic segmentation工具箱(包括我们提议的HighDAN)将在https://github.com/danfenghong公共地址上公开。
Conversion of single-energy computed tomography to parametric maps of dual-energy computed tomography using convolutional neural network
results: 我们的模型可以将 120 kVp SECT 图像转换为高质量的 VMI、EAN 和 RED 图像。转换后的 AD 为 9.02 Hounsfield Unit,RD 为 0.41% 相对于真实的 VMIs。转换后的 EAN 和 RED 的 AD 分别为 0.29 和 0.96,RD 分别为 1.99% 和 0.50%。Abstract
Objectives: We propose a deep learning (DL) multi-task learning framework using convolutional neural network (CNN) for a direct conversion of single-energy CT (SECT) to three different parametric maps of dual-energy CT (DECT): Virtual-monochromatic image (VMI), effective atomic number (EAN), and relative electron density (RED). Methods: We propose VMI-Net for conversion of SECT to 70, 120, and 200 keV VMIs. In addition, EAN-Net and RED-Net were also developed to convert SECT to EAN and RED. We trained and validated our model using 67 patients collected between 2019 and 2020. SECT images with 120 kVp acquired by the DECT (IQon spectral CT, Philips) were used as input, while the VMIs, EAN, and RED acquired by the same device were used as target. The performance of the DL framework was evaluated by absolute difference (AD) and relative difference (RD). Results: The VMI-Net converted 120 kVp SECT to the VMIs with AD of 9.02 Hounsfield Unit, and RD of 0.41% compared to the ground truth VMIs. The ADs of the converted EAN and RED were 0.29 and 0.96, respectively, while the RDs were 1.99% and 0.50% for the converted EAN and RED, respectively. Conclusions: SECT images were directly converted to the three parametric maps of DECT (i.e., VMIs, EAN, and RED). By using this model, one can generate the parametric information from SECT images without DECT device. Our model can help investigate the parametric information from SECT retrospectively. Advances in knowledge: Deep learning framework enables converting SECT to various high-quality parametric maps of DECT.
摘要
目标:我们提出了一种深度学习(DL)多任务学习框架,使用卷积神经网络(CNN)将单能CT(SECT)直接转换为三个不同的参数图像:虚拟单能图像(VMI)、有效原子数(EAN)和相对电子密度(RED)。方法:我们提出了VMI-Net来将SECT转换为70、120和200 keV的VMIs。此外,我们还开发了EAN-Net和RED-Net,用于将SECT转换为EAN和RED。我们使用2019-2020年采集的67例病人数据进行训练和验证。SECT图像使用120 kVp的DECT(IQon spectral CT, Philips)获取,而VMIs、EAN和RED则是使用同一设备获取的目标。我们评估了DL框架的性能,使用绝对差(AD)和相对差(RD)进行评估。结果:VMI-Net将120 kVp SECT转换为VMIs的AD为9.02 Hounsfield Unit,RD为0.41%,相比于参照值VMIs。转换后的EAN和RED的AD分别为0.29和0.96,RD分别为1.99%和0.50%。结论:我们提出了一种将SECT图像直接转换为DECT三个参数图像的DL框架。通过这种模型,可以从SECT图像中提取 Parametric 信息,无需使用DECT设备。我们的模型可以帮助回溯SECT图像中的 Parametric 信息。知识前进:DL框架可以将SECT图像转换为多种高质量的DECT参数图像。
M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding
results: 实验表明,M$^{3}$3D在ScanNet、NYUv2、UCF-101和OR-AR等数据集上都表现出优于现有的状态艺术方法,尤其是在ScanNet semantic segmentation任务上提高了+1.3% mIoU противMask3D。此外,我们还证明了我们的方法在数据缺乏情况下的数据效率superiority。Abstract
We present a new pre-training strategy called M$^{3}$3D ($\underline{M}$ulti-$\underline{M}$odal $\underline{M}$asked $\underline{3D}$) built based on Multi-modal masked autoencoders that can leverage 3D priors and learned cross-modal representations in RGB-D data. We integrate two major self-supervised learning frameworks; Masked Image Modeling (MIM) and contrastive learning; aiming to effectively embed masked 3D priors and modality complementary features to enhance the correspondence between modalities. In contrast to recent approaches which are either focusing on specific downstream tasks or require multi-view correspondence, we show that our pre-training strategy is ubiquitous, enabling improved representation learning that can transfer into improved performance on various downstream tasks such as video action recognition, video action detection, 2D semantic segmentation and depth estimation. Experiments show that M$^{3}$3D outperforms the existing state-of-the-art approaches on ScanNet, NYUv2, UCF-101 and OR-AR, particularly with an improvement of +1.3\% mIoU against Mask3D on ScanNet semantic segmentation. We further evaluate our method on low-data regime and demonstrate its superior data efficiency compared to current state-of-the-art approaches.
摘要
我们提出了一种新的预训练策略,称为M$^{3}$3D(多Modal多样性掩码3D),基于多Modal掩码自动encoder,可以利用3D先验和学习的过程来抽象RGB-D数据中的多Modal特征。我们将两种主要的自我超vised学习框架,Masked Image Modeling(MIM)和对比学习,结合在一起,以便具有更好的映射性和多Modal特征的嵌入。与现有的方法不同,我们的预训练策略不仅关注特定的下游任务,也不需要多视角对应。我们展示了M$^{3}$3D在不同的下游任务上的改进表现,包括视频动作识别、视频动作检测、2D semantics标注和深度估计。实验结果表明,M$^{3}$3D在ScanNet、NYUv2、UCF-101和OR-AR等数据集上具有较高的表现,特别是在ScanNet semantic segmentation任务上与Mask3D的+1.3\% mIoU提升。此外,我们还对M$^{3}$3D在低数据量情况下的数据效率进行了评估,并证明了它在当前状态的最佳方法中具有超过其他方法的优势。
SEPT: Towards Efficient Scene Representation Learning for Motion Prediction
results: 经过广泛的实验,SEPT 模型在 Argoverse 1 和 Argoverse 2 运动预测测试 bencmark 上达到了状态机器人性能,与前一代方法在所有主要指标上大幅超越。Abstract
Motion prediction is crucial for autonomous vehicles to operate safely in complex traffic environments. Extracting effective spatiotemporal relationships among traffic elements is key to accurate forecasting. Inspired by the successful practice of pretrained large language models, this paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful spatiotemporal understanding for complex traffic scenes. Specifically, our approach involves three masking-reconstruction modeling tasks on scene inputs including agents' trajectories and road network, pretraining the scene encoder to capture kinematics within trajectory, spatial structure of road network, and interactions among roads and agents. The pretrained encoder is then finetuned on the downstream forecasting task. Extensive experiments demonstrate that SEPT, without elaborate architectural design or manual feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming previous methods on all main metrics by a large margin.
摘要
<>translate_language: zh-CNMotion prediction is crucial for autonomous vehicles to operate safely in complex traffic environments. Extracting effective spatiotemporal relationships among traffic elements is key to accurate forecasting. Inspired by the successful practice of pretrained large language models, this paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful spatiotemporal understanding for complex traffic scenes. Specifically, our approach involves three masking-reconstruction modeling tasks on scene inputs including agents' trajectories and road network, pretraining the scene encoder to capture kinematics within trajectory, spatial structure of road network, and interactions among roads and agents. The pretrained encoder is then finetuned on the downstream forecasting task. Extensive experiments demonstrate that SEPT, without elaborate architectural design or manual feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming previous methods on all main metrics by a large margin.<>Here's the translation in Traditional Chinese as well:<>translate_language: zh-TWMotion prediction is crucial for autonomous vehicles to operate safely in complex traffic environments. Extracting effective spatiotemporal relationships among traffic elements is key to accurate forecasting. Inspired by the successful practice of pretrained large language models, this paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful spatiotemporal understanding for complex traffic scenes. Specifically, our approach involves three masking-reconstruction modeling tasks on scene inputs including agents' trajectories and road network, pretraining the scene encoder to capture kinematics within trajectory, spatial structure of road network, and interactions among roads and agents. The pretrained encoder is then finetuned on the downstream forecasting task. Extensive experiments demonstrate that SEPT, without elaborate architectural design or manual feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming previous methods on all main metrics by a large margin.<>
Boosting High Resolution Image Classification with Scaling-up Transformers
results: 这个方法在ICCV/CVPPA2023 Deep Nutrient Deficiency Challenge中获得第二名,并且在高分辨率图像分类任务中表现出色。Abstract
We present a holistic approach for high resolution image classification that won second place in the ICCV/CVPPA2023 Deep Nutrient Deficiency Challenge. The approach consists of a full pipeline of: 1) data distribution analysis to check potential domain shift, 2) backbone selection for a strong baseline model that scales up for high resolution input, 3) transfer learning that utilizes published pretrained models and continuous fine-tuning on small sub-datasets, 4) data augmentation for the diversity of training data and to prevent overfitting, 5) test-time augmentation to improve the prediction's robustness, and 6) "data soups" that conducts cross-fold model prediction average for smoothened final test results.
摘要
我们提出了一种整体方法 для高分辨率图像分类,在ICCV/CVPPA2023年度深度营养不足挑战赛中获得第二名。该方法包括以下整体管道:1. 数据分布分析,检查可能存在域shift问题。2. 选择一个强大基线模型,可扩展到高分辨率输入。3. 基于已发布的预训练模型进行传播学习,并进行不断微调小sub-dataset。4. 对训练数据进行数据扩展,以避免过拟合和提高预测的稳定性。5. 在测试时使用数据扩展,以提高预测的Robustness。6. "数据汤",对各个折衣进行平均预测,以平滑最终测试结果。
A Topological Machine Learning Pipeline for Classification
paper_authors: Francesco Conti, Davide Moroni, Maria Antonietta Pascali
for: 这个研究旨在发展一个可以将抗持续图 associate 到数据中的数位数据处理管道。
methods: 这个管道使用了一个格子搜索方法来 determin 最佳的表示方法和参数。
results: 这个研究获得了一个可以将抗持续图转换为适合机器学习数据的表示方法,并且可以评估这个管道的性能,并且与其他表示方法进行比较。Abstract
In this work, we develop a pipeline that associates Persistence Diagrams to digital data via the most appropriate filtration for the type of data considered. Using a grid search approach, this pipeline determines optimal representation methods and parameters. The development of such a topological pipeline for Machine Learning involves two crucial steps that strongly affect its performance: firstly, digital data must be represented as an algebraic object with a proper associated filtration in order to compute its topological summary, the Persistence Diagram. Secondly, the persistence diagram must be transformed with suitable representation methods in order to be introduced in a Machine Learning algorithm. We assess the performance of our pipeline, and in parallel, we compare the different representation methods on popular benchmark datasets. This work is a first step toward both an easy and ready-to-use pipeline for data classification using persistent homology and Machine Learning, and to understand the theoretical reasons why, given a dataset and a task to be performed, a pair (filtration, topological representation) is better than another.
摘要
在这项工作中,我们开发了一个管道,将持续征迹相关联到数字数据上,通过最佳筛选方法来确定最佳表示方法和参数。我们使用格子搜索方法来确定最佳表示方法和参数。这个管道的开发对机器学习领域具有重要意义,因为它可以帮助我们快速地将数据分类器与持续征迹相结合,从而提高数据分类的精度。在我们的管道中,我们首先将数字数据转换为一个 алгебраического对象,并将其关联到适当的筛选方法来计算其拓扑概要,即持续征迹。然后,我们将持续征迹转换为适当的表示方法,以便在机器学习算法中使用。我们评估管道的性能,并同时比较不同的表示方法在popular benchmark datasets上的性能。这项工作是机器学习领域的一个重要突破,因为它可以帮助我们快速地选择最佳的表示方法和参数,以便在数据分类 task 中实现更高的精度。此外,我们还可以通过对不同的表示方法进行比较,更好地理解在给定的数据集和任务中,哪种表示方法是更好的。
DECO: Dense Estimation of 3D Human-Scene Contact In The Wild
results: 我们对 DAMON 数据集进行了广泛的评估,以及 RICH 和 BEHAVE 数据集。我们的方法与现有的 SOTA 方法进行了比较,并显示了在多种Benchmark上的显著性能提升。此外,我们还证明了 DECO 在真实生活中的人类互动图像中能够具有良好的普适性。Abstract
Understanding how humans use physical contact to interact with the world is key to enabling human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. In contrast, we focus on inferring dense, 3D contact between the full body surface and objects in arbitrary images. To achieve this, we first collect DAMON, a new dataset containing dense vertex-level contact annotations paired with RGB images containing complex human-object and human-scene contact. Second, we train DECO, a novel 3D contact detector that uses both body-part-driven and scene-context-driven attention to estimate vertex-level contact on the SMPL body. DECO builds on the insight that human observers recognize contact by reasoning about the contacting body parts, their proximity to scene objects, and the surrounding scene context. We perform extensive evaluations of our detector on DAMON as well as on the RICH and BEHAVE datasets. We significantly outperform existing SOTA methods across all benchmarks. We also show qualitatively that DECO generalizes well to diverse and challenging real-world human interactions in natural images. The code, data, and models are available at https://deco.is.tue.mpg.de.
摘要
Simplified Chinese:理解人类如何通过物理接触与世界交互是人类中心智能的关键。现有方法 Either focus on 2D or 3D body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. In contrast, we focus on inferring dense, 3D contact between the full body surface and objects in arbitrary images. To achieve this, we first collect DAMON, a new dataset containing dense vertex-level contact annotations paired with RGB images containing complex human-object and human-scene contact. Second, we train DECO, a novel 3D contact detector that uses both body-part-driven and scene-context-driven attention to estimate vertex-level contact on the SMPL body. DECO builds on the insight that human observers recognize contact by reasoning about the contacting body parts, their proximity to scene objects, and the surrounding scene context. We perform extensive evaluations of our detector on DAMON as well as on the RICH and BEHAVE datasets. We significantly outperform existing SOTA methods across all benchmarks. We also show qualitatively that DECO generalizes well to diverse and challenging real-world human interactions in natural images. The code, data, and models are available at https://deco.is.tue.mpg.de.
results: 经验证明,使用ObVi-SLAM可以在不同的天气和照明条件下实现高精度的定位估计,并且能够在长时间内保持定位的一致性。Abstract
Robots responsible for tasks over long time scales must be able to localize consistently and scalably amid geometric, viewpoint, and appearance changes. Existing visual SLAM approaches rely on low-level feature descriptors that are not robust to such environmental changes and result in large map sizes that scale poorly over long-term deployments. In contrast, object detections are robust to environmental variations and lead to more compact representations, but most object-based SLAM systems target short-term indoor deployments with close objects. In this paper, we introduce ObVi-SLAM to overcome these challenges by leveraging the best of both approaches. ObVi-SLAM uses low-level visual features for high-quality short-term visual odometry; and to ensure global, long-term consistency, ObVi-SLAM builds an uncertainty-aware long-term map of persistent objects and updates it after every deployment. By evaluating ObVi-SLAM on data from 16 deployment sessions spanning different weather and lighting conditions, we empirically show that ObVi-SLAM generates accurate localization estimates consistent over long-time scales in spite of varying appearance conditions.
摘要
роботы, ответственные за задачи на долгосрочных временных масштабах, должны быть в состоянии локализоваться надежным образом и масштабируемо amid геометрических, точек зрения и изменений внешнего вида. существующие подходы к визуальному SLAM основаны на низкоуровневых описаниях особенностей, которые не устойчивы к изменениям окружающей среды и приводят к большим картам, которые масштабируются плохо на долгосрочных деплоиментах. в отличие от этого, детектирование объектов является robust to изменения окружающей среды и приводит к более компактным представлениям, но большинство систем SLAM на основе объектов ориентированы на короткие термины наружных деплоиментов с близкими объектами. в этой статье мы вводим ObVi-SLAM, чтобы преодолеть эти проблемы, используя лучшие свойства обоих подходов. ObVi-SLAM использует низкоуровневые визуальные особенности для высококачественной краткосрочной визуальной одометрии; и для обеспечения глобальной, долгосрочной一порядковости, ObVi-SLAM строит карту неопределенности-осознанного долгосрочного мапа persistent objects и обновляет ее после каждого деплоимента. путем empirical evaluation ObVi-SLAM на данных из 16 сеансов деплоимента, разных погодных и осветительных условий, мы показываем, что ObVi-SLAM генерирует точные оценки локализации, согласующиеся на долгосрочных временных масштабах, несмотря на изменяющиеся условия внешнего вида.
SLIQ: Quantum Image Similarity Networks on Noisy Quantum Computers
results: SLIQ可以在资源有限的情况下实现高效的无监督相似检测任务,并且比传统的量子算法有更好的性能。Abstract
Exploration into quantum machine learning has grown tremendously in recent years due to the ability of quantum computers to speed up classical programs. However, these efforts have yet to solve unsupervised similarity detection tasks due to the challenge of porting them to run on quantum computers. To overcome this challenge, we propose SLIQ, the first open-sourced work for resource-efficient quantum similarity detection networks, built with practical and effective quantum learning and variance-reducing algorithms.
摘要
对量子机器学习的探索在最近几年内有很大的发展,这主要是因为量子计算机可以快速化古典程式。然而,这些努力仍然无法解决无监督相似探测任务,因为将它们转换到量子计算机上是一个挑战。为了解决这个挑战,我们提出了SLIQ,首个开源的量子相似探测网络,使用了实用和有效的量子学习和减少方差算法。
APIS: A paired CT-MRI dataset for ischemic stroke segmentation challenge
results: 研究人员使用了专门的深度学习工具来解决 stroke 患者的脑损伤分 segmentation 问题。然而,结果表明,从 NCCT 序列中分 segmentation stroke 患者的脑损伤仍然是一个挑战。不withstanding, the annotated dataset remains accessible to the public upon registration, inviting the scientific community to deal with stroke characterization from NCCT but guided with paired DWI information.Abstract
Stroke is the second leading cause of mortality worldwide. Immediate attention and diagnosis play a crucial role regarding patient prognosis. The key to diagnosis consists in localizing and delineating brain lesions. Standard stroke examination protocols include the initial evaluation from a non-contrast CT scan to discriminate between hemorrhage and ischemia. However, non-contrast CTs may lack sensitivity in detecting subtle ischemic changes in the acute phase. As a result, complementary diffusion-weighted MRI studies are captured to provide valuable insights, allowing to recover and quantify stroke lesions. This work introduced APIS, the first paired public dataset with NCCT and ADC studies of acute ischemic stroke patients. APIS was presented as a challenge at the 20th IEEE International Symposium on Biomedical Imaging 2023, where researchers were invited to propose new computational strategies that leverage paired data and deal with lesion segmentation over CT sequences. Despite all the teams employing specialized deep learning tools, the results suggest that the ischemic stroke segmentation task from NCCT remains challenging. The annotated dataset remains accessible to the public upon registration, inviting the scientific community to deal with stroke characterization from NCCT but guided with paired DWI information.
摘要
世界上第二大死因之一是中风,即时注意和诊断对患者结果产生关键作用。诊断的关键在于确定脑损害的位置和范围。标准中风诊断协议包括非contrast CT扫描,以确定是否有出血或压缩。然而,非contrast CT可能无法察觉急性阶段的微scopic ischemic变化。因此,补充的扩散束环境MRI研究可以提供有价值的信息,帮助回复和量化中风损害。本文介绍了APIS,第一个对应公共数据集,包括NCCT和ADC研究数据。APIS在2023年IEEE国际生物医学图像学会上展示为挑战,邀请研究人员提出新的计算机科学策略,利用对应数据来解决中风损害部署。尽管所有团队使用特殊的深度学习工具,结果表明NCCT中风损害部署仍然是一个挑战。注释数据集仍然公开访问,邀请科学社区利用NCCT数据,但是受到对应DWI信息的指导。
CLRmatchNet: Enhancing Curved Lane Detection with Deep Matching Process
paper_authors: Sapir Kontente, Roy Orfaig, Ben-Zion Bobrovsky for: 提高排版检测精度,增加安全导航数据methods: 使用深度学习子模块网络(MatchNet)取代传统的标签分配过程,提高排版检测精度results: 在抛物线路段中显著提高排版检测精度,对所有背景进行改进,并维持或提高相对精度在其他部分。Abstract
Lane detection plays a crucial role in autonomous driving by providing vital data to ensure safe navigation. Modern algorithms rely on anchor-based detectors, which are then followed by a label assignment process to categorize training detections as positive or negative instances based on learned geometric attributes. The current methods, however, have limitations and might not be optimal since they rely on predefined classical cost functions that are based on a low-dimensional model. Our research introduces MatchNet, a deep learning sub-module-based approach aimed at enhancing the label assignment process. Integrated into a state-of-the-art lane detection network like the Cross Layer Refinement Network for Lane Detection (CLRNet), MatchNet replaces the conventional label assignment process with a sub-module network. This integration results in significant improvements in scenarios involving curved lanes, with remarkable improvement across all backbones of +2.8% for ResNet34, +2.3% for ResNet101, and +2.96% for DLA34. In addition, it maintains or even improves comparable results in other sections. Our method boosts the confidence level in lane detection, allowing an increase in the confidence threshold. The code will be available soon: https://github.com/sapirkontente/CLRmatchNet.git
摘要
Lane detection plays a crucial role in autonomous driving by providing vital data to ensure safe navigation. Modern algorithms rely on anchor-based detectors, which are then followed by a label assignment process to categorize training detections as positive or negative instances based on learned geometric attributes. The current methods, however, have limitations and might not be optimal since they rely on predefined classical cost functions that are based on a low-dimensional model. Our research introduces MatchNet, a deep learning sub-module-based approach aimed at enhancing the label assignment process. Integrated into a state-of-the-art lane detection network like the Cross Layer Refinement Network for Lane Detection (CLRNet), MatchNet replaces the conventional label assignment process with a sub-module network. This integration results in significant improvements in scenarios involving curved lanes, with remarkable improvement across all backbones of +2.8% for ResNet34, +2.3% for ResNet101, and +2.96% for DLA34. In addition, it maintains or even improves comparable results in other sections. Our method boosts the confidence level in lane detection, allowing an increase in the confidence threshold. The code will be available soon:
GasMono: Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes
paper_authors: Chaoqiang Zhao, Matteo Poggi, Fabio Tosi, Lei Zhou, Qiyu Sun, Yang Tang, Stefano Mattoccia
for: This paper aims to improve self-supervised monocular depth estimation in indoor scenes, addressing challenges such as large rotation and low texture.
methods: The proposed method uses multi-view geometry to obtain coarse camera poses, and refines them through rotation and translation/scale optimization. It also combines global reasoning with an overfitting-aware, iterative self-distillation mechanism to improve depth estimation.
results: The proposed method achieves state-of-the-art performance on four datasets (NYUv2, ScanNet, 7scenes, and KITTI), with outstanding generalization ability. Code and models are available at https://github.com/zxcqlf/GasMono.Here's the full text in Simplified Chinese:
results: 提议方法在四个数据集(NYUv2、ScanNet、7scenes、KITTI)上实现了新的州度-of-the-art性能,同时也有出色的泛化能力。代码和模型可以在https://github.com/zxcqlf/GasMono上下载。Abstract
This paper tackles the challenges of self-supervised monocular depth estimation in indoor scenes caused by large rotation between frames and low texture. We ease the learning process by obtaining coarse camera poses from monocular sequences through multi-view geometry to deal with the former. However, we found that limited by the scale ambiguity across different scenes in the training dataset, a na\"ive introduction of geometric coarse poses cannot play a positive role in performance improvement, which is counter-intuitive. To address this problem, we propose to refine those poses during training through rotation and translation/scale optimization. To soften the effect of the low texture, we combine the global reasoning of vision transformers with an overfitting-aware, iterative self-distillation mechanism, providing more accurate depth guidance coming from the network itself. Experiments on NYUv2, ScanNet, 7scenes, and KITTI datasets support the effectiveness of each component in our framework, which sets a new state-of-the-art for indoor self-supervised monocular depth estimation, as well as outstanding generalization ability. Code and models are available at https://github.com/zxcqlf/GasMono
摘要
这个论文解决了室内场景自指导的单视深度估计中的旋转和纹理问题。我们通过多视图几何获取估算摄像头位置,以处理前一些旋转。然而,我们发现由于不同场景集中的尺度抽象ambiguity,直接从场景集中获取的几何估算不能提高性能,这是预期的。为解决这个问题,我们提议在训练中对这些估算进行旋转和缩放优化。此外,为了减轻纹理的影响,我们将全球理解与迭代自适应自逼化机制相结合,提供更准确的深度指导。实验表明,我们的框架在NYUv2、ScanNet、7scenes和KITTI数据集上达到了室内自指导单视深度估计的新州chart,以及出色的泛化能力。代码和模型可以在https://github.com/zxcqlf/GasMono中下载。
paper_authors: Fengyu Yang, Jiacheng Zhang, Andrew Owens
for: 这篇论文旨在生成可能的图像从触感中。
methods: 该论文使用最近的扩散进步来创建图像从感触信号中的模型,并应用于多个视触合成任务。
results: 该模型在 manipulate 图像以符合触感信号的问题上表现出色,并且是首次成功地生成图像从触感信号中不需要其他场景信息。 additionally, the model is also used to solve two novel synthesis problems: generating images without the touch sensor or the hand holding it, and estimating an image’s shading from its reflectance and touch.Abstract
An emerging line of work has sought to generate plausible imagery from touch. Existing approaches, however, tackle only narrow aspects of the visuo-tactile synthesis problem, and lag significantly behind the quality of cross-modal synthesis methods in other domains. We draw on recent advances in latent diffusion to create a model for synthesizing images from tactile signals (and vice versa) and apply it to a number of visuo-tactile synthesis tasks. Using this model, we significantly outperform prior work on the tactile-driven stylization problem, i.e., manipulating an image to match a touch signal, and we are the first to successfully generate images from touch without additional sources of information about the scene. We also successfully use our model to address two novel synthesis problems: generating images that do not contain the touch sensor or the hand holding it, and estimating an image's shading from its reflectance and touch.
摘要
一种新兴的工作努力在触感中生成可信的图像。现有的方法只是处理触感领域的窄化方面,与其他领域的交叉模式同步生成方法相比,质量有很大差距。我们利用latest advances in latent diffusion创建一个图像从触感信号(以及vice versa)的模型,并应用于多个视触同步生成任务。使用这个模型,我们在触感驱动的风格化问题上表现出色,即通过触感信号修改图像,并且是首次无需其他场景信息就能从触感中生成图像。我们还成功地使用我们的模型解决了两个新的同步生成问题:生成不包含触感传感器或手持之物的图像,以及根据反射和触感来估算图像的阴影。
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
paper_authors: Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang
for: This paper proposes a vision-language large model called InternLM-XComposer, which enables advanced image-text comprehension and composition.
methods: The model uses interleaved text-image composition, comprehension with rich multilingual knowledge, and state-of-the-art performance.
results: The model consistently achieves state-of-the-art results across various mainstream benchmarks for vision-language foundational models.Here’s the text in Simplified Chinese:
results: 该模型在主流的视觉语言基础模型benchmark上一直保持了状态之Art的 Result。Abstract
We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition. The innovative nature of our model is highlighted by three appealing properties: 1) Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. Simply provide a title, and our system will generate the corresponding manuscript. It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates. 2) Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on extensive multi-modal multilingual concepts with carefully crafted strategies, resulting in a deep understanding of visual content. 3) State-of-the-art Performance: Our model consistently achieves state-of-the-art results across various mainstream benchmarks for vision-language foundational models, including MME Benchmark, MMBench, MMBench-CN, Seed-Bench, and CCBench (Chinese Cultural Benchmark). Collectively, InternLM-XComposer seamlessly blends advanced text-image comprehension and composition, revolutionizing vision-language interaction and offering new insights and opportunities. The InternLM-XComposer model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.
摘要
我们提出了InternLM-XComposer,一种视觉语言大型模型,具有先进的图文涉及和组合能力。我们的模型具有三个吸引人的特性:1. 交叉文本图像组合:InternLM-XComposer可以轻松地生成协调的和上下文相关的文章,并将图片与文本结合在一起,提供更加有趣和沉浸的阅读体验。只需提供一个标题,我们的系统将生成相应的报道。它可以智能地确定文本中需要图片的地方,并自动提供最合适的视觉候选者。2. 多语言多Modal概念的理解:图文理解受到了训练多Modal多语言概念的广泛和精心制定的策略的激进提高,从而实现了深刻的视觉内容理解。3. 状态之Art的表现:我们的模型在主流的视觉语言基础模型 benchmark 上 consistently achieve state-of-the-art 的Results,包括 MME Benchmark、MMBench、MMBench-CN、Seed-Bench 和 CCBench(中文文化benchmark)。总之,InternLM-XComposer 具有高级的文本图像组合和理解能力,对视觉语言交互带来革命性的变革,提供新的视角和机遇。InternLM-XComposer 模型系列,包含 7B 参数,公开 disponibles 于 GitHub 上:。
DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal Knowledge Distillation
results: 在多个表征模型上进行了广泛的测试,结果显示,我们的方法可以在nuScenes中实现state-of-the-art的性能。Abstract
3D perception based on the representations learned from multi-camera bird's-eye-view (BEV) is trending as cameras are cost-effective for mass production in autonomous driving industry. However, there exists a distinct performance gap between multi-camera BEV and LiDAR based 3D object detection. One key reason is that LiDAR captures accurate depth and other geometry measurements, while it is notoriously challenging to infer such 3D information from merely image input. In this work, we propose to boost the representation learning of a multi-camera BEV based student detector by training it to imitate the features of a well-trained LiDAR based teacher detector. We propose effective balancing strategy to enforce the student to focus on learning the crucial features from the teacher, and generalize knowledge transfer to multi-scale layers with temporal fusion. We conduct extensive evaluations on multiple representative models of multi-camera BEV. Experiments reveal that our approach renders significant improvement over the student models, leading to the state-of-the-art performance on the popular benchmark nuScenes.
摘要
三元射频(BEV)多摄像头基于的3D识别是现在自动驾驶行业中趋势,因为摄像头具有低成本的大规模生产。然而,现有的多摄像头BEV和激光激光(LiDAR)基于的3D对象检测存在显著性能差距。主要的原因在于LiDAR可以准确地测量深度和其他几何量,而从仅图像输入中抽取3D信息是极其困难的。在这种情况下,我们提议通过让学生检测器模型学习教师检测器的特征来增强多摄像头BEV的表征学习。我们提出了有效的均衡策略,使学生模型专注于学习教师模型的关键特征,并在多尺度层次进行时间融合。我们对多个代表性模型进行了广泛的实验,实验结果表明,我们的方法可以提供显著改善,使学生模型达到nuScenes上的 estado del arte性能。
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
results: 对LaVie进行了广泛的实验,证明其在量和质上都达到了领先水平。此外,LaVie还能够在不同的长视频生成和个性化视频合成应用中表现出优秀的 versatility。Abstract
This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.
摘要
Our key insights are:1. Incorporating simple temporal self-attentions and rotary positional encoding effectively captures the temporal correlations in video data.2. Joint image-video fine-tuning is crucial for producing high-quality and creative outcomes.To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Additionally, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.
Case Study: Ensemble Decision-Based Annotation of Unconstrained Real Estate Images
results: 这个研究获得了实际应用中对房产照片标注的重要关键特征和单一性,以及实现 praktical 实现的重要需求。I hope this helps! Let me know if you have any other questions.Abstract
We describe a proof-of-concept for annotating real estate images using simple iterative rule-based semi-supervised learning. In this study, we have gained important insights into the content characteristics and uniqueness of individual image classes as well as essential requirements for a practical implementation.
摘要
我们描述了一个 Proof of Concept,用于使用简单迭代规则基于 semi-supervised learning 来标注房产图片。在这项研究中,我们获得了图像类别特征和个体图像的重要理解,以及实际应用中必需的基本要求。Here's the breakdown of the translation:* 我们 (wǒmen) - we* 描述了 (pīnxiē le) - describe* 一个 (yī ge) - one* Proof of Concept (Proof of Concept)* 用于 (yòngyòu) - for* 使用 (shǐyòu) - using* 简单迭代 (jìndān diēyuè) - simple iterative* 规则基于 (guīyù jīdào) - rule-based* semi-supervised learning (semi-supervised learning)* 来 (lái) - to* 标注 (biāozhù) - annotate* 房产 (fángchǎng) - real estate* 图片 (túpǐn) - imagesNote that Simplified Chinese uses "图片" (túpǐn) instead of "图像" (túxiàng) to refer to images. Also, "简单迭代" (jìndān diēyuè) is a more common way to say "iterative" in Simplified Chinese.
Video-adverb retrieval with compositional adverb-action embeddings
results: 该方法在最新的五个视频-动词检索 benchmark 上达到了状态的前学者性能。此外,我们还提出了一些新的数据集分裂,用于测试视频-动词检索的通用性。Abstract
Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space. The compositional adverb-action text embedding is learned using a residual gating mechanism, along with a novel training objective consisting of triplet losses and a regression target. Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval. Furthermore, we introduce dataset splits to benchmark video-adverb retrieval for unseen adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet Adverbs datasets. Our proposed framework outperforms all prior works for the generalisation task of retrieving adverbs from videos for unseen adverb-action compositions. Code and dataset splits are available at https://hummelth.github.io/ReGaDa/.
摘要
Retrieving adverbs that describe an action in a video is a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embeddings in a joint embedding space. The compositional adverb-action text embedding is learned using a residual gating mechanism, along with a novel training objective consisting of triplet losses and a regression target. Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval. Furthermore, we introduce dataset splits to benchmark video-adverb retrieval for unseen adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet Adverbs datasets. Our proposed framework outperforms all prior works for the generalization task of retrieving adverbs from videos for unseen adverb-action compositions. Code and dataset splits are available at https://hummelth.github.io/ReGaDa/.Here's the translation in Traditional Chinese:取得运动动作中的动词形容词是精确的运动理解的重要步骤。我们提出了一个影像与动词形容词的对应框架,将影像嵌入与其对应的动词形容词文本嵌入在共同的嵌入空间中进行对应。我们使用了一个复零阀 mechanism来学习动词形容词文本的对应,同时使用了一个新的训练目标,包括 triplet loss 和一个回溯目标。我们的方法在五个最近的benchmark上取得了最佳性能。此外,我们还引入了一些新的数据分割,以便在未见过的动词形容词-动作compositions上进行测试。我们的提出的框架在这个测试任务中表现出色,并且超过了所有之前的工作。代码和数据分割可以在https://hummelth.github.io/ReGaDa/上取得。
results: 研究发现,大多数计算机视觉论文和专利都报道其技术可以提取人类数据,而且主要是提取人体和身体部位的数据。此外,研究发现了一些机构、国家和领域在计算机视觉研究中拥有较高的涉及度,这些机构、国家和领域的研究往往被用于跟踪专利。总的来说,计算机视觉研究在过去三十年内,对跟踪surveillance的应用增长了五倍以上,现已经用于超过11,000个跟踪专利。此外,研究还发现了大量的文献使用掩饰语言,以隐藏跟踪的程度。Abstract
A rapidly growing number of voices argue that AI research, and computer vision in particular, is powering mass surveillance. Yet the direct path from computer vision research to surveillance has remained obscured and difficult to assess. Here, we reveal the Surveillance AI pipeline by analyzing three decades of computer vision research papers and downstream patents, more than 40,000 documents. We find the large majority of annotated computer vision papers and patents self-report their technology enables extracting data about humans. Moreover, the majority of these technologies specifically enable extracting data about human bodies and body parts. We present both quantitative and rich qualitative analysis illuminating these practices of human data extraction. Studying the roots of this pipeline, we find that institutions that prolifically produce computer vision research, namely elite universities and "big tech" corporations, are subsequently cited in thousands of surveillance patents. Further, we find consistent evidence against the narrative that only these few rogue entities are contributing to surveillance. Rather, we expose the fieldwide norm that when an institution, nation, or subfield authors computer vision papers with downstream patents, the majority of these papers are used in surveillance patents. In total, we find the number of papers with downstream surveillance patents increased more than five-fold between the 1990s and the 2010s, with computer vision research now having been used in more than 11,000 surveillance patents. Finally, in addition to the high levels of surveillance we find documented in computer vision papers and patents, we unearth pervasive patterns of documents using language that obfuscates the extent of surveillance. Our analysis reveals the pipeline by which computer vision research has powered the ongoing expansion of surveillance.
摘要
快速增长的声音表示,人工智能研究,特别是计算机视觉,在跟踪surveillance中扮演着关键角色。然而,从计算机视觉研究直接到跟踪的路径一直保持混乱和难以评估。在这里,我们揭露了跟踪AI管道,通过分析过去三十年的计算机视觉研究文献和下游套件,涵盖了超过40,000份文档。我们发现大多数标注的计算机视觉文献和套件都自report其技术可以提取人类数据。此外,大多数这些技术都可以提取人体和身体部分数据。我们提供了丰富的质量分析,描绘这些实践。研究这些管道的源头,我们发现了著名大学和“big tech”公司在计算机视觉研究方面的强大势力,后来被引用在数千个跟踪套件中。此外,我们发现了证据,证明这些机构、国家或领域作出计算机视觉研究后,大多数文献都被用于跟踪套件。总的来说,我们发现在20世纪90年代和2010年代之间,计算机视觉研究在跟踪领域的数量增长了五倍以上,现已被用于超过11,000个跟踪套件。此外,我们发现了计算机视觉文献和套件中的高水平跟踪记录,以及文档使用掩饰跟踪的语言。我们的分析揭示了计算机视觉研究如何为跟踪提供了持续增长的能量。
RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation
results: 我们的模型在实验中与现有状态的 искусurz以一定的优势进行比较,并提供了一个新的synthetic dataset,以便进一步探索多模态感知的研究。Abstract
Recently, the RGB images and point clouds fusion methods have been proposed to jointly estimate 2D optical flow and 3D scene flow. However, as both conventional RGB cameras and LiDAR sensors adopt a frame-based data acquisition mechanism, their performance is limited by the fixed low sampling rates, especially in highly-dynamic scenes. By contrast, the event camera can asynchronously capture the intensity changes with a very high temporal resolution, providing complementary dynamic information of the observed scenes. In this paper, we incorporate RGB images, Point clouds and Events for joint optical flow and scene flow estimation with our proposed multi-stage multimodal fusion model, RPEFlow. First, we present an attention fusion module with a cross-attention mechanism to implicitly explore the internal cross-modal correlation for 2D and 3D branches, respectively. Second, we introduce a mutual information regularization term to explicitly model the complementary information of three modalities for effective multimodal feature learning. We also contribute a new synthetic dataset to advocate further research. Experiments on both synthetic and real datasets show that our model outperforms the existing state-of-the-art by a wide margin. Code and dataset is available at https://npucvr.github.io/RPEFlow.
摘要
最近,RGB图像和点云 fusión方法已经被提议用于同时估算2D光流和3D场景流。然而,由于传统的RGB相机和LiDAR感知器都采用帧基数据采集机制,因此其性能受到固定的低采样率的限制,特别是在高动态场景下。相反,事件摄像头可以不同步地捕捉强度变化,提供了高时间分辨率下的动态信息。在这篇论文中,我们将RGB图像、点云和事件 fusion在我们提出的多阶段多模式融合模型RPEFlow中。首先,我们提出了一种注意力融合模块,使用交叉注意力机制来隐式地探索2D和3D分支之间的内部交叉模式关系。其次,我们引入了互信息规则项来显式地模型三个感知器之间的补充信息,以便有效地学习多模式特征。我们还提供了一个新的 sintetic数据集,以促进进一步的研究。实验表明,我们的模型在实验室和真实数据集上都超过了现有状态的较好。代码和数据集可以在https://npucvr.github.io/RPEFlow上下载。
Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Visual Scene Understanding
paper_authors: Christina Kassab, Matias Mattamala, Lintong Zhang, Maurice Fallon
For: This paper introduces a real-time indoor Simultaneous Localization and Mapping (SLAM) system that can understand and interact with its surroundings.* Methods: The system uses Large Language Models (LLMs) to create a unified approach to scene understanding and place recognition, including visual-inertial odometry and Contrastive Language-Image Pretraining (CLIP) features.* Results: The system successfully categorizes rooms with varying layouts and dimensions and outperforms the state-of-the-art (SOTA) for place recognition and trajectory estimation tasks. Additionally, it demonstrates the potential for planning.Here is the same information in Simplified Chinese:* For: 这篇论文介绍了一个实时indoor Simultaneous Localization and Mapping(SLAM)系统,它可以理解和与周围环境交互。* Methods: 该系统使用大型自然语言模型(LLMs)创建一种统一的场景理解和地点识别方法,包括视觉感知和语言图像预训练(CLIP)特征。* Results: 该系统成功分类具有不同布局和尺寸的房间,并超越了当前最佳(SOTA)的地点识别和轨迹估计任务。它还 demonstrably potential for planning。Abstract
Versatile and adaptive semantic understanding would enable autonomous systems to comprehend and interact with their surroundings. Existing fixed-class models limit the adaptability of indoor mobile and assistive autonomous systems. In this work, we introduce LEXIS, a real-time indoor Simultaneous Localization and Mapping (SLAM) system that harnesses the open-vocabulary nature of Large Language Models (LLMs) to create a unified approach to scene understanding and place recognition. The approach first builds a topological SLAM graph of the environment (using visual-inertial odometry) and embeds Contrastive Language-Image Pretraining (CLIP) features in the graph nodes. We use this representation for flexible room classification and segmentation, serving as a basis for room-centric place recognition. This allows loop closure searches to be directed towards semantically relevant places. Our proposed system is evaluated using both public, simulated data and real-world data, covering office and home environments. It successfully categorizes rooms with varying layouts and dimensions and outperforms the state-of-the-art (SOTA). For place recognition and trajectory estimation tasks we achieve equivalent performance to the SOTA, all also utilizing the same pre-trained model. Lastly, we demonstrate the system's potential for planning.
摘要
自适应和多元的语义理解能使自动化系统更好地理解和与周围环境交互。现有的固定类型模型限制了室内移动和助手自动化系统的适应性。在这项工作中,我们介绍了LEXIS,一种实时室内同时地图和定位(SLAM)系统,利用大型语言模型(LLMs)的开放词汇性来创建一种综合的场景理解和地点识别方法。该方法首先建立了环境中的Topological SLAM图(使用视觉-陀螺偏移),并将语音-图像预训练(CLIP)特征embed在图节点中。我们使用这种表示来进行flexible房间分类和分割,作为基础 для房间中心的地点识别。这使得循环关闭搜索可以 dirigir towards semantically relevant places。我们提posed系统被评估使用公共、 simulate数据和实际数据,覆盖办公室和家庭环境。它成功地分类了不同的布局和维度的房间,并超过了当前最佳(SOTA)。 для地点识别和轨迹估计任务,我们 achieveequivalent performance to SOTA,alls utilizing the same pre-trained model。最后,我们示出了系统的规划潜力。
HPCR: Holistic Proxy-based Contrastive Replay for Online Continual Learning
paper_authors: Huiwei Lin, Shanshan Feng, Baoquan Zhang, Xutao Li, Yew-soon Ong, Yunming Ye for: This paper aims to address the catastrophic forgetting issue in online continual learning (OCL) by proposing a novel replay-based method called proxy-based contrastive replay (PCR) and a more advanced method named holistic proxy-based contrastive replay (HPCR).methods: The proposed methods use a contrastive-based loss with anchor-to-proxy pairs instead of anchor-to-sample pairs to alleviate the forgetting issue. The HPCR method consists of three components: a contrastive component, a temperature component, and a distillation component.results: The proposed methods are evaluated on four datasets and consistently demonstrate superior performance over various state-of-the-art methods.Abstract
Online continual learning (OCL) aims to continuously learn new data from a single pass over the online data stream. It generally suffers from the catastrophic forgetting issue. Existing replay-based methods effectively alleviate this issue by replaying part of old data in a proxy-based or contrastive-based replay manner. In this paper, we conduct a comprehensive analysis of these two replay manners and find they can be complementary. Inspired by this finding, we propose a novel replay-based method called proxy-based contrastive replay (PCR), which replaces anchor-to-sample pairs with anchor-to-proxy pairs in the contrastive-based loss to alleviate the phenomenon of forgetting. Based on PCR, we further develop a more advanced method named holistic proxy-based contrastive replay (HPCR), which consists of three components. The contrastive component conditionally incorporates anchor-to-sample pairs to PCR, learning more fine-grained semantic information with a large training batch. The second is a temperature component that decouples the temperature coefficient into two parts based on their impacts on the gradient and sets different values for them to learn more novel knowledge. The third is a distillation component that constrains the learning process to keep more historical knowledge. Experiments on four datasets consistently demonstrate the superiority of HPCR over various state-of-the-art methods.
摘要
在线持续学习(OCL)目标是在单次在线数据流中不断学习新数据。它通常受到忘却问题的困扰。现有的重播基本方法有效地解决这个问题,其中包括在代理基于的重播方式和对比基于的重播方式。在这篇论文中,我们进行了这两种重播方法的全面分析,发现它们可以补做。 inspirited by this finding,我们提出了一种新的重播基本方法,即代理基于对比重播(PCR),它将 anchor-to-sample 对替换为 anchor-to-proxy 对在对比基于的损失函数中,以解决忘却现象。基于 PCR,我们进一步开发了一种更高级的方法,即整体代理基于对比重播(HPCR),它包括三个 ком成分。第一是对比成分,它在大训练批处理中 conditionally 包含 anchor-to-sample 对,以学习更细致的semantic信息。第二是温度成分,它将温度系数分解为两个部分,根据它们对梯度的影响,并将其设置为不同的值,以学习更多的新知识。第三是馈送成分,它使得学习过程中保留更多的历史知识。实验结果表明,HPCR 在四个数据集上相比多种当前的方法表现出色。
Nuclear Morphometry using a Deep Learning-based Algorithm has Prognostic Relevance for Canine Cutaneous Mast Cell Tumors
paper_authors: Andreas Haghofer, Eda Parlak, Alexander Bartel, Taryn A. Donovan, Charles-Antoine Assenmacher, Pompei Bolfa, Michael J. Dark, Andrea Fuchs-Baumgartinger, Andrea Klang, Kathrin Jäger, Robert Klopfleisch, Sophie Merz, Barbara Richter, F. Yvonne Schulman, Jonathan Ganz, Josef Scharinger, Marc Aubreville, Stephan M. Winkler, Matti Kiupel, Christof A. Bertram
results: 研究发现,自动化形态学的报告值与人工测量和Pathologist估计的核大小有高度相关性,且其预后价值也较高。在ROC曲线中,自动化形态学的AUC值为0.943(95% CI:0.889-0.996),高于人工测量和mitotic count。此外,自动化形态学还可以提供较高的特异性和低的内涵相关性。Abstract
Variation in nuclear size and shape is an important criterion of malignancy for many tumor types; however, categorical estimates by pathologists have poor reproducibility. Measurements of nuclear characteristics (morphometry) can improve reproducibility, but manual methods are time consuming. In this study, we evaluated fully automated morphometry using a deep learning-based algorithm in 96 canine cutaneous mast cell tumors with information on patient survival. Algorithmic morphometry was compared with karyomegaly estimates by 11 pathologists, manual nuclear morphometry of 12 cells by 9 pathologists, and the mitotic count as a benchmark. The prognostic value of automated morphometry was high with an area under the ROC curve regarding the tumor-specific survival of 0.943 (95% CI: 0.889 - 0.996) for the standard deviation (SD) of nuclear area, which was higher than manual morphometry of all pathologists combined (0.868, 95% CI: 0.737 - 0.991) and the mitotic count (0.885, 95% CI: 0.765 - 1.00). At the proposed thresholds, the hazard ratio for algorithmic morphometry (SD of nuclear area $\geq 9.0 \mu m^2$) was 18.3 (95% CI: 5.0 - 67.1), for manual morphometry (SD of nuclear area $\geq 10.9 \mu m^2$) 9.0 (95% CI: 6.0 - 13.4), for karyomegaly estimates 7.6 (95% CI: 5.7 - 10.1), and for the mitotic count 30.5 (95% CI: 7.8 - 118.0). Inter-rater reproducibility for karyomegaly estimates was fair ($\kappa$ = 0.226) with highly variable sensitivity/specificity values for the individual pathologists. Reproducibility for manual morphometry (SD of nuclear area) was good (ICC = 0.654). This study supports the use of algorithmic morphometry as a prognostic test to overcome the limitations of estimates and manual measurements.
摘要
干细胞癌的核体大小和形状的变化是许多肿瘤类型的恶性性的重要依据,但是专业人员的 categorical 估计具有低的重建性。使用measurements of nuclear characteristics( morphometry)可以提高重建性,但是手动方法需要很多时间。在这个研究中,我们对96只犬尖锐皮癌细胞的自动化 morphometry 使用了深度学习基于的算法,并与11名医生的 karyomegaly 估计、12名医生手动的核体形态和 Mitotic count 作为参照进行比较。自动化 morphometry 的预测价值很高,ROC 曲线的面积为0.943(95% CI:0.889-0.996),而手动 morphometry 的所有医生合计的预测价值为0.868(95% CI:0.737-0.991), Mitotic count 的预测价值为0.885(95% CI:0.765-1.00)。在提出的阈值下,自动化 morphometry(核体面积SD≥9.0μm²)的 Hazard ratio 为18.3(95% CI:5.0-67.1),手动 morphometry(核体面积SD≥10.9μm²)的 Hazard ratio 为9.0(95% CI:6.0-13.4), karyomegaly 估计的 Hazard ratio 为7.6(95% CI:5.7-10.1), Mitotic count 的 Hazard ratio 为30.5(95% CI:7.8-118.0)。医生之间的 karyomegaly 估计的重建性只有 fair(κ=0.226),而手动 morphometry(核体面积SD)的重建性为good(ICC=0.654)。这种研究支持使用自动化 morphometry 作为诊断的方法,以超越估计和手动测量的限制。
IFT: Image Fusion Transformer for Ghost-free High Dynamic Range Imaging
results: 对多个标准 benchmark进行了广泛的实验,并达到了现有方法的状态空间性能。Abstract
Multi-frame high dynamic range (HDR) imaging aims to reconstruct ghost-free images with photo-realistic details from content-complementary but spatially misaligned low dynamic range (LDR) images. Existing HDR algorithms are prone to producing ghosting artifacts as their methods fail to capture long-range dependencies between LDR frames with large motion in dynamic scenes. To address this issue, we propose a novel image fusion transformer, referred to as IFT, which presents a fast global patch searching (FGPS) module followed by a self-cross fusion module (SCF) for ghost-free HDR imaging. The FGPS searches the patches from supporting frames that have the closest dependency to each patch of the reference frame for long-range dependency modeling, while the SCF conducts intra-frame and inter-frame feature fusion on the patches obtained by the FGPS with linear complexity to input resolution. By matching similar patches between frames, objects with large motion ranges in dynamic scenes can be aligned, which can effectively alleviate the generation of artifacts. In addition, the proposed FGPS and SCF can be integrated into various deep HDR methods as efficient plug-in modules. Extensive experiments on multiple benchmarks show that our method achieves state-of-the-art performance both quantitatively and qualitatively.
摘要
多帧高动态范围(HDR)捕捉目标是从内容相似仍然空间不一致的低动态范围(LDR)图像中重建无幻象 artifacts 图像,而现有的HDR算法容易产生幻象痕迹,因为它们的方法无法捕捉动态场景中大动量的长距离依赖关系。为解决这个问题,我们提议一种新的图像融合变换器,称为IFT,它包括一个快速全局小块搜索(FGPS)模块和一个自我交叉融合模块(SCF),用于无幻象HDR捕捉。FGPS模块在支持图像中搜索与参照图像中每个小块相似的小块,以模拟长距离依赖关系,而SCF模块在获得FGPS模块中的小块后,通过线性复杂度来进行内部Feature fusion和间接Feature fusion,使得对象在动态场景中的大动量可以得到准确的对齐,从而有效地避免生成artefacts。此外,我们的FGPS和SCF模块可以与多种深度HDR方法集成为高效插件模块。我们在多个标准底图上进行了广泛的实验,结果表明,我们的方法在量和质量两个方面均达到了领先的性能。
Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features
results: 这个研究比较 global feature 方法,在三个 dataset 上 achieved 15 mAP 点更高的准确率。它还证明了其可扩展性和可解释性。Abstract
The task of open-vocabulary object-centric image retrieval involves the retrieval of images containing a specified object of interest, delineated by an open-set text query. As working on large image datasets becomes standard, solving this task efficiently has gained significant practical importance. Applications include targeted performance analysis of retrieved images using ad-hoc queries and hard example mining during training. Recent advancements in contrastive-based open vocabulary systems have yielded remarkable breakthroughs, facilitating large-scale open vocabulary image retrieval. However, these approaches use a single global embedding per image, thereby constraining the system's ability to retrieve images containing relatively small object instances. Alternatively, incorporating local embeddings from detection pipelines faces scalability challenges, making it unsuitable for retrieval from large databases. In this work, we present a simple yet effective approach to object-centric open-vocabulary image retrieval. Our approach aggregates dense embeddings extracted from CLIP into a compact representation, essentially combining the scalability of image retrieval pipelines with the object identification capabilities of dense detection methods. We show the effectiveness of our scheme to the task by achieving significantly better results than global feature approaches on three datasets, increasing accuracy by up to 15 mAP points. We further integrate our scheme into a large scale retrieval framework and demonstrate our method's advantages in terms of scalability and interpretability.
摘要
开放词汇物体固定图像检索任务是将图像包含指定的目标物体,通过开放集文本查询来定义。随着图像数据集的大小增长,解决这个任务得到了实际上的重要性。应用包括基于自定义查询和训练中的困难例挖掘。近期,基于对比学习的开放词汇系统取得了很大的突破,使得大规模的开放词汇图像检索成为可能。然而,这些方法使用单个全局嵌入,因此无法检索包含相对较小的物体实例的图像。另一方面,从检测管道中提取本地嵌入也面临着扩展性挑战,使其不适用于大规模的图像检索。在这项工作中,我们提出了一种简单又有效的物体固定图像检索方法。我们的方法将CLIP中提取的密集嵌入聚合到一个紧凑的表示中,实际上将图像检索管道的扩展性与密集检测方法的物体识别能力相结合。我们证明了我们的方法在三个数据集上比全球特征方法提高了15个mAP点。此外,我们还将我们的方法integrirated into a large-scale retrieval framework,并证明了我们的方法在扩展性和可读性方面具有优势。
An Ensemble Model for Distorted Images in Real Scenarios
results: 我们的模型在CDCOCO测试集上表现出色,实现了高级Computer Vision任务中的影像捕捉和识别。我们的降噪检测模型可以对扭曲图像进行降噪和修复,使得模型在实际应用中有很多实际应用场景和环境。Abstract
Image acquisition conditions and environments can significantly affect high-level tasks in computer vision, and the performance of most computer vision algorithms will be limited when trained on distortion-free datasets. Even with updates in hardware such as sensors and deep learning methods, it will still not work in the face of variable conditions in real-world applications. In this paper, we apply the object detector YOLOv7 to detect distorted images from the dataset CDCOCO. Through carefully designed optimizations including data enhancement, detection box ensemble, denoiser ensemble, super-resolution models, and transfer learning, our model achieves excellent performance on the CDCOCO test set. Our denoising detection model can denoise and repair distorted images, making the model useful in a variety of real-world scenarios and environments.
摘要
computer vision 高级任务的图像获取条件和环境可能会有显著影响,而大多数计算机视觉算法在不受扭曲影响的数据集上训练时会表现有限。即使升级硬件如感知器和深度学习方法,在实际应用中仍然无法满足变化的条件。在这篇论文中,我们将对 CDCOCO 数据集中的扭曲图像进行探测,使用了优化的数据增强、检测盒集合、除噪集合、超分解模型和传输学习等技术,我们的模型在 CDCOCO 测试集上表现出色。我们的噪声检测模型可以将扭曲图像去噪和修复,使得模型在多种实际应用环境中变得有用。
IAIFNet: An Illumination-Aware Infrared and Visible Image Fusion Network
results: 比较五种现有方法,实验结果表明该方法可以提高低光照环境下图像融合的质量Abstract
Infrared and visible image fusion (IVIF) is used to generate fusion images with comprehensive features of both images, which is beneficial for downstream vision tasks. However, current methods rarely consider the illumination condition in low-light environments, and the targets in the fused images are often not prominent. To address the above issues, we propose an Illumination-Aware Infrared and Visible Image Fusion Network, named as IAIFNet. In our framework, an illumination enhancement network first estimates the incident illumination maps of input images. Afterwards, with the help of proposed adaptive differential fusion module (ADFM) and salient target aware module (STAM), an image fusion network effectively integrates the salient features of the illumination-enhanced infrared and visible images into a fusion image of high visual quality. Extensive experimental results verify that our method outperforms five state-of-the-art methods of fusing infrared and visible images.
摘要
射频和可见像融合(IVIF)是用来生成具有两个图像完整特征的融合图像,对下游视觉任务有利。然而,目前的方法几乎不考虑低光环境中的照明条件,目标在融合图像中也很难发掘。为了解决这些问题,我们提出了对照明意识的射频和可见像融合网络,名为IAIFNet。在我们的框架中,照明增强网first estimates input图像的进入照明地图。然后,透过我们提出的适应式差分融合模组(ADFM)和焦点意识模组(STAM),图像融合网络将融合两个照明增强的射频和可见像的焦点特征,生成高质量的融合图像。实验结果显示,我们的方法比五种现有的融合方法性能更高。
paper_authors: Rui Shao, Tianxing Wu, Ziwei Liu for: 这篇论文目的是为了探讨深伪攻击(DeepFake)的新型威胁,即多步骤脸部修改(Sequential DeepFake),并提出了一个新的研究问题——检测sequential DeepFake manipulation(Seq-DeepFake)。methods: 这篇论文使用了一个新的dataset——Seq-DeepFake dataset,并提出了一个专门针对Seq-DeepFake manipulation的问题,casting it as an image-to-sequence task。furthermore, the authors proposed a concise yet effective Seq-DeepFake Transformer (SeqFakeFormer) and a dedicated Seq-DeepFake Transformer with Image-Sequence Reasoning (SeqFakeFormer++) to better detect Seq-DeepFake manipulation.results: 根据实验结果,SeqFakeFormer和SeqFakeFormer++ Both show strong performance on the Seq-DeepFake dataset and the more challenging Sequential DeepFake dataset with perturbations (Seq-DeepFake-P),demonstrating their effectiveness in detecting Seq-DeepFake manipulation.Abstract
Since photorealistic faces can be readily generated by facial manipulation technologies nowadays, potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection methods are thus proposed. However, existing methods only focus on detecting one-step facial manipulation. As the emergence of easy-accessible facial editing applications, people can easily manipulate facial components using multi-step operations in a sequential manner. This new threat requires us to detect a sequence of facial manipulations, which is vital for both detecting deepfake media and recovering original faces afterwards. Motivated by this observation, we emphasize the need and propose a novel research problem called Detecting Sequential DeepFake Manipulation (Seq-DeepFake). Unlike the existing deepfake detection task only demanding a binary label prediction, detecting Seq-DeepFake manipulation requires correctly predicting a sequential vector of facial manipulation operations. To support a large-scale investigation, we construct the first Seq-DeepFake dataset, where face images are manipulated sequentially with corresponding annotations of sequential facial manipulation vectors. Based on this new dataset, we cast detecting Seq-DeepFake manipulation as a specific image-to-sequence task and propose a concise yet effective Seq-DeepFake Transformer (SeqFakeFormer). To better reflect real-world deepfake data distributions, we further apply various perturbations on the original Seq-DeepFake dataset and construct the more challenging Sequential DeepFake dataset with perturbations (Seq-DeepFake-P). To exploit deeper correlation between images and sequences when facing Seq-DeepFake-P, a dedicated Seq-DeepFake Transformer with Image-Sequence Reasoning (SeqFakeFormer++) is devised, which builds stronger correspondence between image-sequence pairs for more robust Seq-DeepFake detection.
摘要
因为现在可以轻松地通过图像修改技术生成真实的人脸,这已经引起了严重的风险。许多深伪检测方法已经被提出,但是这些方法只关注一步图像修改。然而,随着易 accessible 的人脸编辑应用程序的出现,人们可以通过多步操作Sequentially manipulate facial components。这种新的威胁需要我们检测一个序列的图像修改,这是检测深伪媒体以及恢复原始人脸的关键。我们根据这一观察,提出了一个新的研究问题:检测串行深伪修改(Seq-DeepFake)。与传统的深伪检测任务只需要对图像进行二分类预测而言,检测 Seq-DeepFake 修改需要正确地预测一个序列的facial manipulation vector。为支持大规模的调查,我们构建了第一个 Seq-DeepFake 数据集,其中每个人脸图像都被Sequentially manipulate,并且对应的涉及到Sequential facial manipulation vector的注释。基于这个新数据集,我们将检测 Seq-DeepFake 修改定义为图像到序列任务,并提出了一种简洁 yet effective 的 Seq-DeepFake Transformer(SeqFakeFormer)。为更好地反映实际深伪数据分布,我们还对原始 Seq-DeepFake 数据集进行了多种杂化,并构建了更加具有挑战性的 Sequential DeepFake 数据集(Seq-DeepFake-P)。面临 Seq-DeepFake-P 数据集时,我们提出了一种专门为image-sequence pairs建立更强的相关性的Seq-DeepFake Transformer with Image-Sequence Reasoning(SeqFakeFormer++),以更好地利用图像和序列之间的深刻相关性,从而提高 Seq-DeepFake 检测的Robustness。
MoCaE: Mixture of Calibrated Experts Significantly Improves Object Detection
results: 在COCO测试数据集上提高了对象检测的AP值$2.4$,在 LVIS数据集上提高了实例分割的AP值$2.3$,在DOTA数据集上提高了旋转对象检测的AP值$82.62$,全部实现了新的最佳状态(SOTA)Abstract
We propose an extremely simple and highly effective approach to faithfully combine different object detectors to obtain a Mixture of Experts (MoE) that has a superior accuracy to the individual experts in the mixture. We find that naively combining these experts in a similar way to the well-known Deep Ensembles (DEs), does not result in an effective MoE. We identify the incompatibility between the confidence score distribution of different detectors to be the primary reason for such failure cases. Therefore, to construct the MoE, our proposal is to first calibrate each individual detector against a target calibration function. Then, filter and refine all the predictions from different detectors in the mixture. We term this approach as MoCaE and demonstrate its effectiveness through extensive experiments on object detection, instance segmentation and rotated object detection tasks. Specifically, MoCaE improves (i) three strong object detectors on COCO test-dev by $2.4$ $\mathrm{AP}$ by reaching $59.0$ $\mathrm{AP}$; (ii) instance segmentation methods on the challenging long-tailed LVIS dataset by $2.3$ $\mathrm{AP}$; and (iii) all existing rotated object detectors by reaching $82.62$ $\mathrm{AP_{50}$ on DOTA dataset, establishing a new state-of-the-art (SOTA). Code will be made public.
摘要
我们提出一种非常简单且高效的方法,以混合不同的对象探测器来建立一个混合experts(MoE),以提高各个专家的精度。我们发现,将这些专家集成在类似于知名的深度ensemble(DE)中,并不能达到有效的混合。我们认为,不同探测器的信任分布不兼容是主要的原因。因此,为建立MoE,我们的建议是首先对每个个体探测器进行对目标准化函数的调整。然后,对不同探测器的所有预测进行筛选和精度调整。我们称这种方法为MoCaE,并通过了详细的实验,证明其效果。具体来说,MoCaE可以提高COCO测试开发集的三个强对象探测器的AP分数by 2.4,达到59.0AP;在挑战性的长尾LVIS数据集上提高实例分割方法的AP分数by 2.3;以及在DOTA数据集上所有旋转对象探测器的AP50分数by 82.62,创造了新的状态之状态(SOTA)。我们将代码公开。
A novel approach for holographic 3D content generation without depth map
paper_authors: Hakdong Kim, Minkyu Jee, Yurim Lee, Kyudam Choi, MinSung Yoon, Cheongwon Kim
for: 用于生成计算机生成镜像(CGH),使用FFT算法。
methods: 使用深度学习方法,只使用输入RGB图像来估计图像的深度图,然后生成CGH。
results: 对比与其他模型,提出的方法可以更加准确地生成镜像,只使用RGB色度数据。Abstract
In preparation for observing holographic 3D content, acquiring a set of RGB color and depth map images per scene is necessary to generate computer-generated holograms (CGHs) when using the fast Fourier transform (FFT) algorithm. However, in real-world situations, these paired formats of RGB color and depth map images are not always fully available. We propose a deep learning-based method to synthesize the volumetric digital holograms using only the given RGB image, so that we can overcome environments where RGB color and depth map images are partially provided. The proposed method uses only the input of RGB image to estimate its depth map and then generate its CGH sequentially. Through experiments, we demonstrate that the volumetric hologram generated through our proposed model is more accurate than that of competitive models, under the situation that only RGB color data can be provided.
摘要
要观看幻象三维内容,需要获取每个场景的RGB颜色和深度地图图像集。使用快速傅立勋变换算法时,这对计算机生成镜像(CGH)的生成是必要的。然而,在实际情况下,这对RGB颜色和深度地图图像的对应形式不总是可用。我们提出了一种基于深度学习的方法,可以使用只提供RGB图像来生成三维数字镜像,以便在RGB颜色和深度地图图像不完整提供的情况下进行覆盖。该方法只需要输入RGB图像,并且可以逐步生成其CGH。经过实验,我们证明了我们提出的模型可以在RGB颜色数据仅提供的情况下生成更加准确的幻象镜像,比与竞争模型更好。
GridFormer: Towards Accurate Table Structure Recognition via Grid Prediction
results: 在五个具有困难性的benchmark上实现了与其他方法相比竞争性的表现Abstract
All tables can be represented as grids. Based on this observation, we propose GridFormer, a novel approach for interpreting unconstrained table structures by predicting the vertex and edge of a grid. First, we propose a flexible table representation in the form of an MXN grid. In this representation, the vertexes and edges of the grid store the localization and adjacency information of the table. Then, we introduce a DETR-style table structure recognizer to efficiently predict this multi-objective information of the grid in a single shot. Specifically, given a set of learned row and column queries, the recognizer directly outputs the vertexes and edges information of the corresponding rows and columns. Extensive experiments on five challenging benchmarks which include wired, wireless, multi-merge-cell, oriented, and distorted tables demonstrate the competitive performance of our model over other methods.
摘要
所有表格都可以表示为格子。基于这一观察,我们提出了GridFormer,一种新的方法,用于解释不受限制的表格结构,通过预测格子的顶点和边。首先,我们提出了一种灵活的表格表示方式,即MXN格子。在这种表示方式中,格子的顶点和边存储了表格的本地化和相邻信息。然后,我们引入了DETR风格的表结构识别器,以高效地预测这些多对象信息。specifically,给定一组学习的行和列查询,识别器直接输出了相应的行和列的顶点和边信息。我们在五个具有挑战性的标准化表格 benchmark上进行了广泛的实验,并证明了我们的模型在其他方法之上具有竞争性。
Towards Real-World Test-Time Adaptation: Tri-Net Self-Training with Balanced Normalization
methods: 这个研究使用了globally class imbalanced testing set来补充现有的实际世界时间适应协议,并示出了现有方法在不同的测试设定下的缺陷。它们还提出了一个叫做平衡batchnorm层的新方法,可以在测试过程中适应而不偏向多数类别。此外,它们还参考了自适应(ST)的成功,并将其应用于时间适应。但是,ST独立使用可能会导致过度适应,因此它们提出了一个称为anchored loss的调整方法,以避免过度适应。
results: 这个研究建立了一个名为TRIBE的统一架构,其中包括平衡batchnorm层。TRIBE在四个实际世界时间适应设定中进行评估,并在多个评估协议下表现出了州际的最佳成绩。Abstract
Test-Time Adaptation aims to adapt source domain model to testing data at inference stage with success demonstrated in adapting to unseen corruptions. However, these attempts may fail under more challenging real-world scenarios. Existing works mainly consider real-world test-time adaptation under non-i.i.d. data stream and continual domain shift. In this work, we first complement the existing real-world TTA protocol with a globally class imbalanced testing set. We demonstrate that combining all settings together poses new challenges to existing methods. We argue the failure of state-of-the-art methods is first caused by indiscriminately adapting normalization layers to imbalanced testing data. To remedy this shortcoming, we propose a balanced batchnorm layer to swap out the regular batchnorm at inference stage. The new batchnorm layer is capable of adapting without biasing towards majority classes. We are further inspired by the success of self-training~(ST) in learning from unlabeled data and adapt ST for test-time adaptation. However, ST alone is prone to over adaption which is responsible for the poor performance under continual domain shift. Hence, we propose to improve self-training under continual domain shift by regularizing model updates with an anchored loss. The final TTA model, termed as TRIBE, is built upon a tri-net architecture with balanced batchnorm layers. We evaluate TRIBE on four datasets representing real-world TTA settings. TRIBE consistently achieves the state-of-the-art performance across multiple evaluation protocols. The code is available at \url{https://github.com/Gorilla-Lab-SCUT/TRIBE}.
摘要
Test-Time Adaptation的目标是在推理阶段将源频道模型适应测试数据,并在不同的挑战性实际场景中成功减少。然而,现有的工作主要关注于非非独立的数据流和持续频道转移。在这种工作中,我们首先补充了现有的实际世界TTA协议,并添加了全球级别的类别不均衡测试集。我们示示,将所有设置结合起来会带来新的挑战,并让现有方法失败。我们认为现有方法的失败的原因是不经过分类地应用normalization层到不均衡的测试数据。为了解决这个缺陷,我们提议使用平衡batchnorm层在推理阶段交换常规batchnorm层。新的batchnorm层可以在不偏向多个类别的情况下适应。此外,我们受到了自适应学习(ST)的成功,并将ST应用于测试阶段适应。然而,ST独立进行适应可能会导致过度适应,这会导致在持续频道转移中的 poor performance。因此,我们提议在持续频道转移中补充模型更新的 anchored loss。最终的TTA模型,称为TRIBE,基于了三元网络架构和平衡batchnorm层。我们在四个实际世界TTA设置上评估TRIBE,并在多种评价协议中获得了状态的艺术性表现。代码可以在 \url{https://github.com/Gorilla-Lab-SCUT/TRIBE} 中获取。
FEC: Three Finetuning-free Methods to Enhance Consistency for Real Image Editing
results: 方法可以减少计算机 память和计算量的使用,同时保持图像的纹理和特征,实现精准的图像编辑任务。Abstract
Text-conditional image editing is a very useful task that has recently emerged with immeasurable potential. Most current real image editing methods first need to complete the reconstruction of the image, and then editing is carried out by various methods based on the reconstruction. Most methods use DDIM Inversion for reconstruction, however, DDIM Inversion often fails to guarantee reconstruction performance, i.e., it fails to produce results that preserve the original image content. To address the problem of reconstruction failure, we propose FEC, which consists of three sampling methods, each designed for different editing types and settings. Our three methods of FEC achieve two important goals in image editing task: 1) ensuring successful reconstruction, i.e., sampling to get a generated result that preserves the texture and features of the original real image. 2) these sampling methods can be paired with many editing methods and greatly improve the performance of these editing methods to accomplish various editing tasks. In addition, none of our sampling methods require fine-tuning of the diffusion model or time-consuming training on large-scale datasets. Hence the cost of time as well as the use of computer memory and computation can be significantly reduced.
摘要
文本编辑是一项非常有前途的任务,它最近受到了无法估量的潜力。大多数当前实际图像编辑方法都需要先完成图像重建,然后使用不同的方法进行基于重建的编辑。大多数方法使用DDIM反向扩散来进行重建,但DDIM反向扩散经常无法保证重建性能,即无法生成保持原始图像内容的结果。为解决重建失败的问题,我们提出了FEC,它包括三种采样方法,每种采样方法适用于不同的编辑类型和设置。我们的三种采样方法可以实现两个重要的编辑任务目标:1)确保成功重建,即采样到生成结果,保持原始实际图像的纹理和特征。2)这些采样方法可以与多种编辑方法结合使用,大幅提高这些编辑方法的性能,完成多种编辑任务。此外,我们的采样方法不需要扩散模型的细调或大规模数据集的时间consuming Training,因此可以减少计算机 память和计算成本。
Addressing Data Misalignment in Image-LiDAR Fusion on Point Cloud Segmentation
results: 提出了一种解决方案,通过对nuScenes数据集和SOTA的融合模型2DPASS进行仔细分析,并提供可能的解决方案或改进方向。Abstract
With the advent of advanced multi-sensor fusion models, there has been a notable enhancement in the performance of perception tasks within in terms of autonomous driving. Despite these advancements, the challenges persist, particularly in the fusion of data from cameras and LiDAR sensors. A critial concern is the accurate alignment of data from these disparate sensors. Our observations indicate that the projected positions of LiDAR points often misalign on the corresponding image. Furthermore, fusion models appear to struggle in accurately segmenting these misaligned points. In this paper, we would like to address this problem carefully, with a specific focus on the nuScenes dataset and the SOTA of fusion models 2DPASS, and providing the possible solutions or potential improvements.
摘要
“随着进步的多感应用模型的出现,自动驾驶感知任务中的性能有了明显提升。然而,问题仍然存在,尤其是把激光感知和摄像头感知融合的问题。我们的观察显示,LiDAR点的投影位置常常与相应的图像不一致。此外,融合模型似乎对这些不一致的点进行正确分类具有困难。在本文中,我们将仔细处理这个问题,专注于nuscenes数据集和2DPASS的SOTA融合模型,并提供可能的解决方案或改进方向。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
Noise-Tolerant Unsupervised Adapter for Vision-Language Models
results: 实验结果显示,NtUA在多个通用的测试集上具有优秀的性能,并且可以在不需要目标标签的情况下进行类型辨识任务。Abstract
Recent advances in large-scale vision-language models have achieved very impressive performance in various zero-shot image classification tasks. While prior studies have demonstrated significant improvements by introducing few-shot labelled target samples, they still require labelling of target samples, which greatly degrades their scalability while handling various visual recognition tasks. We design NtUA, a Noise-tolerant Unsupervised Adapter that allows learning superior target models with few-shot unlabelled target samples. NtUA works as a key-value cache that formulates visual features and predicted pseudo-labels of the few-shot unlabelled target samples as key-value pairs. It consists of two complementary designs. The first is adaptive cache formation that combats pseudo-label noises by weighting the key-value pairs according to their prediction confidence. The second is pseudo-label rectification, which corrects both pair values (i.e., pseudo-labels) and cache weights by leveraging knowledge distillation from large-scale vision language models. Extensive experiments show that NtUA achieves superior performance consistently across multiple widely adopted benchmarks.
摘要
latest advances in large-scale vision-language models have achieved very impressive performance in various zero-shot image classification tasks. While prior studies have demonstrated significant improvements by introducing few-shot labelled target samples, they still require labelling of target samples, which greatly degrades their scalability while handling various visual recognition tasks. We design NtUA, a Noise-tolerant Unsupervised Adapter that allows learning superior target models with few-shot unlabelled target samples. NtUA works as a key-value cache that formulates visual features and predicted pseudo-labels of the few-shot unlabelled target samples as key-value pairs. It consists of two complementary designs. The first is adaptive cache formation that combats pseudo-label noises by weighting the key-value pairs according to their prediction confidence. The second is pseudo-label rectification, which corrects both pair values (i.e., pseudo-labels) and cache weights by leveraging knowledge distillation from large-scale vision language models. Extensive experiments show that NtUA achieves superior performance consistently across multiple widely adopted benchmarks.
PHRIT: Parametric Hand Representation with Implicit Template
paper_authors: Zhisheng Huang, Yujin Chen, Di Kang, Jinlu Zhang, Zhigang Tu
for: Parametric hand mesh modeling with an implicit template.
methods: 使用 signed distance fields (SDFs) with part-based shape priors, deforming the canonical template using a deformation field.
results: Realistic and immersive hand modeling with state-of-the-art performance, demonstrated through multiple downstream tasks such as skeleton-driven hand reconstruction, shapes from point clouds, and single-view 3D reconstruction.Abstract
We propose PHRIT, a novel approach for parametric hand mesh modeling with an implicit template that combines the advantages of both parametric meshes and implicit representations. Our method represents deformable hand shapes using signed distance fields (SDFs) with part-based shape priors, utilizing a deformation field to execute the deformation. The model offers efficient high-fidelity hand reconstruction by deforming the canonical template at infinite resolution. Additionally, it is fully differentiable and can be easily used in hand modeling since it can be driven by the skeleton and shape latent codes. We evaluate PHRIT on multiple downstream tasks, including skeleton-driven hand reconstruction, shapes from point clouds, and single-view 3D reconstruction, demonstrating that our approach achieves realistic and immersive hand modeling with state-of-the-art performance.
摘要
我们提出PHRIT,一种新的方法 Parametric Hand Mesh Modeling,结合 parametric meshes 和 implicit representations 的优点。我们的方法使用 signed distance fields (SDFs) 表示可变手型,通过部分基于 shape priors 的扭曲场来执行扭曲。该模型可以高效地执行高精度手形重建,通过扭曲权重的权重场来执行扭曲。此外,该模型是可导数的,可以由骨架和形状幂代码驱动。我们在多个下游任务中评估PHRIT,包括骨架驱动手形重建、点云形状重建和单视图三维重建, demonstarted that our approach achieves realistic and immersive hand modeling with state-of-the-art performance.Here's the translation in Traditional Chinese:我们提出PHRIT,一种新的方法 Parametric Hand Mesh Modeling,结合 parametric meshes 和 implicit representations 的优点。我们的方法使用 signed distance fields (SDFs) 表示可变手型,通过部分基于 shape priors 的扭曲场来执行扭曲。该模型可以高效地执行高精度手形重建,通过扭曲权重的权重场来执行扭曲。此外,该模型是可导数的,可以由骨架和形状幂代码驱动。我们在多个下游任务中评估PHRIT,包括骨架驱动手形重建、点云形状重建和单视野三维重建, demonstarted that our approach achieves realistic and immersive hand modeling with state-of-the-art performance.
Face Cartoonisation For Various Poses Using StyleGAN
for: 本文提出了一种新的方法,以保持原始Identität和支持多种姿势来实现面卡通化。 unlike previous methods, which relied on conditional-GANs, our approach leverages the expressive latent space of StyleGAN.
methods: 我们的方法通过引入一个捕捉图像中的姿势和Identität信息的编码器,并将其转换为 StyleGAN的表达空间中的嵌入。 we then pass this embedding through a pre-trained generator to obtain the desired cartoonised output.
results: 我们通过广泛的实验表明,我们的编码器可以使StyleGAN输出更好地保持Identität,并且可以在不同的姿势下进行 cartoonisation. our method stands out from other approaches based on StyleGAN by not requiring a dedicated and fine-tuned StyleGAN model.Abstract
This paper presents an innovative approach to achieve face cartoonisation while preserving the original identity and accommodating various poses. Unlike previous methods in this field that relied on conditional-GANs, which posed challenges related to dataset requirements and pose training, our approach leverages the expressive latent space of StyleGAN. We achieve this by introducing an encoder that captures both pose and identity information from images and generates a corresponding embedding within the StyleGAN latent space. By subsequently passing this embedding through a pre-trained generator, we obtain the desired cartoonised output. While many other approaches based on StyleGAN necessitate a dedicated and fine-tuned StyleGAN model, our method stands out by utilizing an already-trained StyleGAN designed to produce realistic facial images. We show by extensive experimentation how our encoder adapts the StyleGAN output to better preserve identity when the objective is cartoonisation.
摘要
Pre-training-free Image Manipulation Localization through Non-Mutually Exclusive Contrastive Learning
paper_authors: Jizhe Zhou, Xiaochen Ma, Xia Du, Ahmed Y. Alhammadi, Wentao Feng for:* The paper aims to address the data insufficiency problem in Deep Image Manipulation Localization (IML) models by proposing a Non-mutually exclusive Contrastive Learning (NCL) framework.methods:* The NCL framework uses a pivot structure with dual branches to constantly switch the role of contour patches between positives and negatives during training, and a pivot-consistent loss to avoid spatial corruption.results:* The proposed NCL framework achieves state-of-the-art performance on all five benchmarks without any pre-training, and is more robust on unseen real-life samples.Here is the simplified Chinese text for the three key points:for:* 论文目标是解决 Deep Image Manipulation Localization (IML) 模型所处的数据缺乏问题,提出 Non-mutually exclusive Contrastive Learning (NCL) 框架。methods:* NCL 框架使用一个 pivot 结构,其中 dual branches 在训练中不断交换抽象范围和实际范围的角色,并使用 pivot-consistent 损失函数来避免空间损害。results:* 提议的 NCL 框架在所有五个 benchmark 上达到了无预训练的状态之冠,并在未看过的实际样本上更加稳定。Abstract
Deep Image Manipulation Localization (IML) models suffer from training data insufficiency and thus heavily rely on pre-training. We argue that contrastive learning is more suitable to tackle the data insufficiency problem for IML. Crafting mutually exclusive positives and negatives is the prerequisite for contrastive learning. However, when adopting contrastive learning in IML, we encounter three categories of image patches: tampered, authentic, and contour patches. Tampered and authentic patches are naturally mutually exclusive, but contour patches containing both tampered and authentic pixels are non-mutually exclusive to them. Simply abnegating these contour patches results in a drastic performance loss since contour patches are decisive to the learning outcomes. Hence, we propose the Non-mutually exclusive Contrastive Learning (NCL) framework to rescue conventional contrastive learning from the above dilemma. In NCL, to cope with the non-mutually exclusivity, we first establish a pivot structure with dual branches to constantly switch the role of contour patches between positives and negatives while training. Then, we devise a pivot-consistent loss to avoid spatial corruption caused by the role-switching process. In this manner, NCL both inherits the self-supervised merits to address the data insufficiency and retains a high manipulation localization accuracy. Extensive experiments verify that our NCL achieves state-of-the-art performance on all five benchmarks without any pre-training and is more robust on unseen real-life samples. The code is available at: https://github.com/Knightzjz/NCL-IML.
摘要
深度图像修饰本地化(IML)模型受训练数据不足的影响,因此强调预训练。我们认为对IML而言,对比学习更适合解决数据不足问题。创造独特的负和正样本是对比学习的前提。然而,在IML中采用对比学习时,我们遇到了三类图像 patch:妥协、原始和边界 patch。妥协和原始 patch 是自然的独特的,但边界 patch 包含了妥协和原始像素,因此不是独特的。简单地抛弃这些边界 patch 会导致性能下降,因为边界 patch 对学习结果有决定性的作用。因此,我们提出了非独特对比学习(NCL)框架,以解决上述困境。在 NCL 中,我们首先建立了一个平衡结构,其中包含了两个分支,以在训练过程中不断地将边界 patch 的角色 switching。然后,我们设计了一个平衡结构相关的损失函数,以避免由角色 switching 过程引起的空间损害。通过这种方式,NCL 同时继承了自动学习的自我超vised 优点,并保留高效的修饰本地化精度。广泛的实验证明了我们的 NCL 在五个标准测试集上均达到了领先的性能水平,而且在未经预训练的情况下达到了最高的修饰本地化精度。代码可以在 GitHub 上获取:。
FDLS: A Deep Learning Approach to Production Quality, Controllable, and Retargetable Facial Performances
paper_authors: Wan-Duo Kurt Ma, Muhammad Ghifary, J. P. Lewis, Byungkuk Choi, Haekwang Eom
for: 这 paper 是为了解决电影特效中的人工智能表演问题,包括创建真实的人工智能人类和将演员表演 transferred 到人形角色。
methods: 该 paper 使用了 Facial Deep Learning Solver (FDLS) 解决方案,它采用了粗略到细节的人工智能策略,让解决过程中的表演可以被验证和修改。
results: FDLS 可以帮助制作人员在生产中实现高质量的动画表演,而不需要大量的人工干预。该系统已经在多部电影中使用,并且可以处理小于一天的日常变化。Abstract
Visual effects commonly requires both the creation of realistic synthetic humans as well as retargeting actors' performances to humanoid characters such as aliens and monsters. Achieving the expressive performances demanded in entertainment requires manipulating complex models with hundreds of parameters. Full creative control requires the freedom to make edits at any stage of the production, which prohibits the use of a fully automatic ``black box'' solution with uninterpretable parameters. On the other hand, producing realistic animation with these sophisticated models is difficult and laborious. This paper describes FDLS (Facial Deep Learning Solver), which is Weta Digital's solution to these challenges. FDLS adopts a coarse-to-fine and human-in-the-loop strategy, allowing a solved performance to be verified and edited at several stages in the solving process. To train FDLS, we first transform the raw motion-captured data into robust graph features. Secondly, based on the observation that the artists typically finalize the jaw pass animation before proceeding to finer detail, we solve for the jaw motion first and predict fine expressions with region-based networks conditioned on the jaw position. Finally, artists can optionally invoke a non-linear finetuning process on top of the FDLS solution to follow the motion-captured virtual markers as closely as possible. FDLS supports editing if needed to improve the results of the deep learning solution and it can handle small daily changes in the actor's face shape. FDLS permits reliable and production-quality performance solving with minimal training and little or no manual effort in many cases, while also allowing the solve to be guided and edited in unusual and difficult cases. The system has been under development for several years and has been used in major movies.
摘要
通常,特效需要创建真实的人工人类和将演员的表演转移到人oid角色,如外星人和怪物。为了实现娱乐中的表演效果,需要 manipulate复杂的模型,包括数百个参数。为了获得完全的创作控制,需要在生产过程中有自由地进行编辑,因此不能使用完全自动的“黑obox”解决方案。然而,使用这些复杂的模型生成真实的动画是困难和耗时的。本文描述了WDLS(Facial Deep Learning Solver),是威塔数字的解决方案。WDLS采用了粗略到细节的人类在循环策略,allowing solved performance可以在多个阶段的解决过程中进行验证和编辑。为了训练WDLS,我们首先将原始的动作捕获数据转化为Robust Graph特征。然后,根据艺术家们通常在完成短脊动画之前进行精细表情的修饰,我们解决了下嘴部动作,并使用区域网络根据嘴部位置预测细表情。最后,艺术家可以选择atively辅助非线性调整过程,以便在动作捕获虚拟标记上最接近可能。WDLS支持编辑,以便根据需要改进深度学习解决方案的结果,并可以处理小于日常变化的actor面部形态。WDLS允许可靠且生产质量高的表现解决方案,同时允许解决方案被指导和编辑在不常见和困难的情况下。这个系统已经在数年内开发,并在主要电影中使用。
Nearest Neighbor Guidance for Out-of-Distribution Detection
results: 经过广泛的实验测试,NNGuide 方法可以减少 OOD 样本的过度自信问题,同时保持类ifier-based 分数的细腻度,在 ImageNet OOD 检测测试准则下达到了状态的艺术指标 AUROC、FPR95 和 AUPR 的最佳 результа们。Abstract
Detecting out-of-distribution (OOD) samples are crucial for machine learning models deployed in open-world environments. Classifier-based scores are a standard approach for OOD detection due to their fine-grained detection capability. However, these scores often suffer from overconfidence issues, misclassifying OOD samples distant from the in-distribution region. To address this challenge, we propose a method called Nearest Neighbor Guidance (NNGuide) that guides the classifier-based score to respect the boundary geometry of the data manifold. NNGuide reduces the overconfidence of OOD samples while preserving the fine-grained capability of the classifier-based score. We conduct extensive experiments on ImageNet OOD detection benchmarks under diverse settings, including a scenario where the ID data undergoes natural distribution shift. Our results demonstrate that NNGuide provides a significant performance improvement on the base detection scores, achieving state-of-the-art results on both AUROC, FPR95, and AUPR metrics. The code is given at \url{https://github.com/roomo7time/nnguide}.
摘要
检测open-world环境中的非标准样本(out-of-distribution,OOD)是机器学习模型的关键问题。基于分类器的分数是标准的OOD检测方法,它具有细致的检测能力。然而,这些分数常常受到过自信问题的困扰,错误地分类OOD样本远离标准区域。为解决这个挑战,我们提出了一种方法called Nearest Neighbor Guidance(NNGuide),它使得分类器基于的分数尊重数据拟合的 geometrical 结构。NNGuide降低了OOD样本的过自信问题,同时保持了分类器的细致能力。我们在ImageNet OOD检测 benchmark中进行了广泛的实验,包括一个情况下ID数据经受自然分布变化。我们的结果表明,NNGuide可以带来显著的性能提升,在AUROC、FPR95和AUPR metrics中均达到了领先的结果。代码可以在 GitHub上找到:https://github.com/roomo7time/nnguide。
Locality-preserving Directions for Interpreting the Latent Space of Satellite Image GANs
methods: 这 paper 使用了保持本地性的方法,可以 decomposed 预训练 GANs 的权重空间,并回归可解释的方向,与高级 semantics 概念(如城市化、结构密度、植被存在)相关。
results: 作者比较了本地性方法和传统的 PCA 方法,发现本地性方法可以更好地 preserve 类信息,并且在卫星图像数据synthesis中表现出色。Abstract
We present a locality-aware method for interpreting the latent space of wavelet-based Generative Adversarial Networks (GANs), that can well capture the large spatial and spectral variability that is characteristic to satellite imagery. By focusing on preserving locality, the proposed method is able to decompose the weight-space of pre-trained GANs and recover interpretable directions that correspond to high-level semantic concepts (such as urbanization, structure density, flora presence) - that can subsequently be used for guided synthesis of satellite imagery. In contrast to typically used approaches that focus on capturing the variability of the weight-space in a reduced dimensionality space (i.e., based on Principal Component Analysis, PCA), we show that preserving locality leads to vectors with different angles, that are more robust to artifacts and can better preserve class information. Via a set of quantitative and qualitative examples, we further show that the proposed approach can outperform both baseline geometric augmentations, as well as global, PCA-based approaches for data synthesis in the context of data augmentation for satellite scene classification.
摘要
我们提出了一种地域意识的方法,用于解释基于wavelet的生成对抗网络(GANs)的latent空间,可以很好地捕捉卫星图像中的大空间和频谱多样性。通过强调地域性,我们的方法可以对预训练的GANs的权重空间进行分解,并回归可解释的方向,这些方向与高级 semantics(如城市化、结构密度、植被存在)相关,可以用于导引卫星图像的生成。与通常使用的方法相比,我们显示了保持地域性可以生成更加robust的方向,这些方向具有不同的角度,可以更好地抵御artifacts并保持类别信息。通过一系列量化和质量化的例子,我们进一步显示了我们的方法可以在卫星场景分类数据synthesis中超过基eline的几何增强和全局、基于PCA的方法。
ITEM3D: Illumination-Aware Directional Texture Editing for 3D Models
results: outperforms state-of-the-art methods, explicit control over lightingAbstract
Texture editing is a crucial task in 3D modeling that allows users to automatically manipulate the surface materials of 3D models. However, the inherent complexity of 3D models and the ambiguous text description lead to the challenge in this task. To address this challenge, we propose ITEM3D, an illumination-aware model for automatic 3D object editing according to the text prompts. Leveraging the diffusion models and the differentiable rendering, ITEM3D takes the rendered images as the bridge of text and 3D representation, and further optimizes the disentangled texture and environment map. Previous methods adopt the absolute editing direction namely score distillation sampling (SDS) as the optimization objective, which unfortunately results in the noisy appearance and text inconsistency. To solve the problem caused by the ambiguous text, we introduce a relative editing direction, an optimization objective defined by the noise difference between the source and target texts, to release the semantic ambiguity between the texts and images. Additionally, we gradually adjust the direction during optimization to further address the unexpected deviation in the texture domain. Qualitative and quantitative experiments show that our ITEM3D outperforms the state-of-the-art methods on various 3D objects. We also perform text-guided relighting to show explicit control over lighting.
摘要
《Texture Editing in 3D Modeling: A Challenge and a Proposed Solution》Texture editing是3D模型创建中的一项重要任务,它允许用户自动修改3D模型的表面材料。然而,3D模型的内在复杂性和文本描述的模糊性使得这项任务具有挑战。为解决这个挑战,我们提出了ITEM3D,一种基于扩散模型和可导渲染的自适应3D物体编辑方法。ITEM3D通过将渲染图像作为文本和3D表示之间的桥梁,并且进一步优化分离的 текстур和环境图像。前一些方法使用绝对编辑方向,即分配混合样本(SDS)的分数浓缩为优化目标,却导致图像的噪音和文本不一致。为解决文本的模糊性,我们引入了相对编辑方向,即源和目标文本之间的差异噪音作为优化目标,以释放文本和图像之间的 semantics 的模糊性。此外,我们在优化过程中逐步调整方向,以进一步解决图像领域中的意外偏差。我们对ITEM3D进行了质量和量化的实验,结果显示,ITEM3D在多种3D对象上超过了现有方法的性能。此外,我们还进行了文本指导的照明控制,以示Explicit Control over Lighting。
Cross-Dataset-Robust Method for Blind Real-World Image Quality Assessment
results: 实验结果表明,提出的方法在跨数据集测试中表现更好,甚至超越了直接在这些数据集上训练的一些状态对照方法,这证明了我们的方法的可靠性和泛化能力。Abstract
Although many effective models and real-world datasets have been presented for blind image quality assessment (BIQA), recent BIQA models usually tend to fit specific training set. Hence, it is still difficult to accurately and robustly measure the visual quality of an arbitrary real-world image. In this paper, a robust BIQA method, is designed based on three aspects, i.e., robust training strategy, large-scale real-world dataset, and powerful backbone. First, many individual models based on popular and state-of-the-art (SOTA) Swin-Transformer (SwinT) are trained on different real-world BIQA datasets respectively. Then, these biased SwinT-based models are jointly used to generate pseudo-labels, which adopts the probability of relative quality of two random images instead of fixed quality score. A large-scale real-world image dataset with 1,000,000 image pairs and pseudo-labels is then proposed for training the final cross-dataset-robust model. Experimental results on cross-dataset tests show that the performance of the proposed method is even better than some SOTA methods that are directly trained on these datasets, thus verifying the robustness and generalization of our method.
摘要
尽管现有许多有效的模型和实际数据集已经为盲图质量评估(BIQA)提供了许多有用的方法,但是现在的BIQA模型通常会适应特定的训练集。因此,仍然很难准确和可靠地测量真实世界图像的视觉质量。本文提出了一种robustBIQA方法,基于以下三个方面:一、robust训练策略;二、大规模的实际世界图像集;三、强大的后向。首先,多个基于流行和state-of-the-art(SOTA)Swin-Transformer(SwinT)的个体模型在不同的盲图BIQA数据集上进行了分别训练。然后,这些偏向的SwinT-based模型被用来生成 pseudo-labels,使用了两个随机图像之间的相对质量概率而不是固定的质量分数。然后,一个大规模的实际世界图像集,包含1000000个图像对和pseudo-labels,被用于训练最终的cross-dataset-robust模型。实验结果表明,提议的方法在cross-dataset测试中的性能甚至高于直接在这些数据集上训练的一些SOTA方法,从而证明了我们的方法的Robustness和普适性。
Unsupervised Reconstruction of 3D Human Pose Interactions From 2D Poses Alone
paper_authors: Peter Hardy, Hansung Kim for: 这种研究旨在解决多人场景下无监督2D-3D人姿估算方法中的投影歧义问题,通过预测人体眼镜中的高度角度来解决问题。methods: 该方法基于先前的工作,独立地提取每个人物的2D姿势,然后将其拼接在共享的3D坐标系中。然后,使用预测的高度角度来旋转和偏移每个人物的姿势,以便得到精确的3D重建。results: 在CHI3D dataset上进行测试,该方法能够实现精确的3D重建,并且介绍了三种新的量化指标来评估该方法的性能。这些指标为:1) 2D-3D姿势重建精度(2D-3D Pose Reconstruction Accuracy),2) 3D人姿精度(3D Human Pose Accuracy),3) 多人姿势重建精度(Multi-Person Pose Reconstruction Accuracy)。这些指标为未来研究的标准。Abstract
Current unsupervised 2D-3D human pose estimation (HPE) methods do not work in multi-person scenarios due to perspective ambiguity in monocular images. Therefore, we present one of the first studies investigating the feasibility of unsupervised multi-person 2D-3D HPE from just 2D poses alone, focusing on reconstructing human interactions. To address the issue of perspective ambiguity, we expand upon prior work by predicting the cameras' elevation angle relative to the subjects' pelvis. This allows us to rotate the predicted poses to be level with the ground plane, while obtaining an estimate for the vertical offset in 3D between individuals. Our method involves independently lifting each subject's 2D pose to 3D, before combining them in a shared 3D coordinate system. The poses are then rotated and offset by the predicted elevation angle before being scaled. This by itself enables us to retrieve an accurate 3D reconstruction of their poses. We present our results on the CHI3D dataset, introducing its use for unsupervised 2D-3D pose estimation with three new quantitative metrics, and establishing a benchmark for future research.
摘要
Current unsupervised 2D-3D人姿估算(HPE)方法在多人场景下不起作用,因为单目图像中的视角含义不明确。因此,我们发表了一项研究,探讨未经监督的多人2D-3DHPE从仅2D姿势中进行可行性研究,关注人类互动的重建。为解决视角不确定性问题,我们在先前的工作之上预测了相机的抬高角度 relative to the subjects' pelvis。这使得我们可以将预测的姿势旋转到地平面和水平面之间的垂直偏移来进行校准,并获得了每个人的3D姿势的估算。我们的方法是独立地提升每个人的2D姿势到3D,然后将它们在共享的3D坐标系中组合。然后,我们将 pose 旋转和偏移到预测的抬高角度,并将其缩放。这样就可以得到一个准确的3D重建。我们在 CHI3D 数据集上进行了实验,引入了三种新的量化指标,并建立了一个标准 для未来的研究。
Generalization of pixel-wise phase estimation by CNN and improvement of phase-unwrapping by MRF optimization for one-shot 3D scan
paper_authors: Hiroto Harada, Michihiro Mikamo, Ryo Furukawa, Ryushuke Sagawa, Hiroshi Kawasaki for: 提高一遍3D扫描的精度和稳定性,特别适用于医疗、工业等领域。methods: 提议使用像素间插值技术,通过U-Net预训练CG数据进行高效数据增强,并提出robust对匹配找索引算法基于Markov随机场优化。results: 实验结果表明,提议方法可以有效地提高一遍3D扫描的精度和稳定性,并能够处理具有强频率和文本特征的实际数据。Abstract
Active stereo technique using single pattern projection, a.k.a. one-shot 3D scan, have drawn a wide attention from industry, medical purposes, etc. One severe drawback of one-shot 3D scan is sparse reconstruction. In addition, since spatial pattern becomes complicated for the purpose of efficient embedding, it is easily affected by noise, which results in unstable decoding. To solve the problems, we propose a pixel-wise interpolation technique for one-shot scan, which is applicable to any types of static pattern if the pattern is regular and periodic. This is achieved by U-net which is pre-trained by CG with efficient data augmentation algorithm. In the paper, to further overcome the decoding instability, we propose a robust correspondence finding algorithm based on Markov random field (MRF) optimization. We also propose a shape refinement algorithm based on b-spline and Gaussian kernel interpolation using explicitly detected laser curves. Experiments are conducted to show the effectiveness of the proposed method using real data with strong noises and textures.
摘要
aktive stereo技术使用单一模式投影,即一枚3D扫描,在业界、医疗领域等方面引起了广泛的关注。一个一枚3D扫描的严重缺点是稀疏重建。此外,由于空间模式变得复杂,以便高效嵌入,因此易受到噪声的影响,导致解码不稳定。为解决这些问题,我们提议了一种像素级 interpolate技术,适用于任何类型的静止模式,只要模式是正则和周期的。这是通过CG预训练的U-网进行实现的。在论文中,为进一步减轻解码不稳定性,我们提议了基于Markov随机场(MRF)优化的稳定匹配算法。我们还提议了基于b-spline和Gaussian核函数 interpolate的形态纠正算法,使用显式探测的激光曲线。我们对实际数据进行了实验,以证明我们提议的方法的有效性。
Three-dimensional Tracking of a Large Number of High Dynamic Objects from Multiple Views using Current Statistical Model
results: simulations 实验和实际蜡烛实验表明,该方法可以提高跟踪完整性、续度和精度,相比常见的常速度基于 particule filter 方法。Abstract
Three-dimensional tracking of multiple objects from multiple views has a wide range of applications, especially in the study of bio-cluster behavior which requires precise trajectories of research objects. However, there are significant temporal-spatial association uncertainties when the objects are similar to each other, frequently maneuver, and cluster in large numbers. Aiming at such a multi-view multi-object 3D tracking scenario, a current statistical model based Kalman particle filter (CSKPF) method is proposed following the Bayesian tracking-while-reconstruction framework. The CSKPF algorithm predicts the objects' states and estimates the objects' state covariance by the current statistical model to importance particle sampling efficiency, and suppresses the measurement noise by the Kalman filter. The simulation experiments prove that the CSKPF method can improve the tracking integrity, continuity, and precision compared with the existing constant velocity based particle filter (CVPF) method. The real experiment on fruitfly clusters also confirms the effectiveness of the CSKPF method.
摘要
三维跟踪多个物体从多个视图有广泛的应用,特别是生物群集行为研究需要精确的轨迹数据。然而,在物体相似、频繁拐弯和大量聚集时,存在较大的时空协同不确定性。为解决这类多视图多物体三维跟踪问题,提出了一种现有统计模型基于Kalman滤波器(CSKPF)方法。CSKPF算法预测物体状态和估计物体状态协方差,通过当前统计模型提高实际效率,并通过Kalman滤波器减少测量噪声。实验表明,CSKPF方法可以提高跟踪完整性、续写性和精度,比CVPF方法更为有效。实际实验也证实了CSKPF方法的效果。
Discrepancy Matters: Learning from Inconsistent Decoder Features for Consistent Semi-supervised Medical Image Segmentation
results: 实验结果显示,LeFeD可以比前 eight 个SOTA方法在三个公开 dataset上进行排名,而且不需要添加额外的不确定性估计和强制性约束。此外,LeFeD也在医疗影像分类任务上设置了新的SOTA记录。Abstract
Semi-supervised learning (SSL) has been proven beneficial for mitigating the issue of limited labeled data especially on the task of volumetric medical image segmentation. Unlike previous SSL methods which focus on exploring highly confident pseudo-labels or developing consistency regularization schemes, our empirical findings suggest that inconsistent decoder features emerge naturally when two decoders strive to generate consistent predictions. Based on the observation, we first analyze the treasure of discrepancy in learning towards consistency, under both pseudo-labeling and consistency regularization settings, and subsequently propose a novel SSL method called LeFeD, which learns the feature-level discrepancy obtained from two decoders, by feeding the discrepancy as a feedback signal to the encoder. The core design of LeFeD is to enlarge the difference by training differentiated decoders, and then learn from the inconsistent information iteratively. We evaluate LeFeD against eight state-of-the-art (SOTA) methods on three public datasets. Experiments show LeFeD surpasses competitors without any bells and whistles such as uncertainty estimation and strong constraints, as well as setting a new state-of-the-art for semi-supervised medical image segmentation. Code is available at \textcolor{cyan}{https://github.com/maxwell0027/LeFeD}
摘要
半经过监督学习(SSL)已经证明对受限标注数据的问题具有有利的影响,特别是在医疗影像分割任务上。 unlike previous SSL methods,我们的实证发现,当两个解码器努力生成一致的预测时,自然而然地出现了不一致的解码器特征。基于这一观察,我们首先分析了在学习向一致的过程中,缺失的储量,以及在pseudo-标签和一致化正则化设置下的情况。然后,我们提出了一种新的SSL方法,called LeFeD,它通过在两个解码器之间学习层次不一致的特征,以Feedback信号来驱动编码器。LeFeD的核心设计是通过培养不同的解码器来扩大差异,然后在不一致信息上学习。我们对八种现有的SOTA方法进行比较,实验结果表明,LeFeD可以在三个公共数据集上超越竞争对手,而无需添加额外的征识度量和强制约束。代码可以在 \textcolor{cyan}{https://github.com/maxwell0027/LeFeD} 上获取。
A Comparative Study of Population-Graph Construction Methods and Graph Neural Networks for Brain Age Regression
results: 研究发现,使用 homophily 约束可以提高 GNN 的性能,而其他建立方法可能是更加稳定的。Abstract
The difference between the chronological and biological brain age of a subject can be an important biomarker for neurodegenerative diseases, thus brain age estimation can be crucial in clinical settings. One way to incorporate multimodal information into this estimation is through population graphs, which combine various types of imaging data and capture the associations among individuals within a population. In medical imaging, population graphs have demonstrated promising results, mostly for classification tasks. In most cases, the graph structure is pre-defined and remains static during training. However, extracting population graphs is a non-trivial task and can significantly impact the performance of Graph Neural Networks (GNNs), which are sensitive to the graph structure. In this work, we highlight the importance of a meaningful graph construction and experiment with different population-graph construction methods and their effect on GNN performance on brain age estimation. We use the homophily metric and graph visualizations to gain valuable quantitative and qualitative insights on the extracted graph structures. For the experimental evaluation, we leverage the UK Biobank dataset, which offers many imaging and non-imaging phenotypes. Our results indicate that architectures highly sensitive to the graph structure, such as Graph Convolutional Network (GCN) and Graph Attention Network (GAT), struggle with low homophily graphs, while other architectures, such as GraphSage and Chebyshev, are more robust across different homophily ratios. We conclude that static graph construction approaches are potentially insufficient for the task of brain age estimation and make recommendations for alternative research directions.
摘要
人类脑发育年龄与生物年龄之间的差异可能是脑性变化疾病的重要生物标记器,因此脑发育年龄估计可能是临床设置中非常重要。一种方法是通过人口图,融合不同类型的内部数据,以capture个体之间的相互关联。在医疗影像中,人口图已经显示出了非常有前途的结果,主要是用于分类任务。然而,提取人口图是一个非常困难的任务,可能会对Graph Neural Networks(GNNs)的性能产生很大的影响,GNNs是对图结构非常敏感。在这个研究中,我们强调了具有意义的图建构的重要性,并尝试了不同的人口图建构方法,以及它们对GNN性能的影响。我们使用了同化度量和图可视化来获得有用的量化和质量的问题。我们利用了UK Biobank数据集,这个数据集提供了许多影像和非影像特征。我们的结果显示,高敏感度的图结构,如几何网络(GCN)和几何注意力网络(GAT),对低同化度图进行训练时表现不佳,而其他架构,如几何泛化网络(GraphSage)和Chebychev网络,则在不同的同化度比率下表现更加稳定。我们结论,静态图建构方法可能不够具体,我们建议进行更多的研究,以找到更好的方法。
ENIGMA-51: Towards a Fine-Grained Understanding of Human-Object Interactions in Industrial Scenarios
paper_authors: Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto, Claudia Bonanno, Rosario Scavo, Antonino Furnari, Giovanni Maria Farinella
for: The paper is written for studying human-object interactions in industrial scenarios.
methods: The paper uses a new dataset called ENIGMA-51, which is densely annotated with labels to enable the systematic study of human-object interactions.
results: The baseline results show that the ENIGMA-51 dataset poses a challenging benchmark for studying human-object interactions in industrial scenarios.Here’s the text in Simplified Chinese:
for: 这篇论文是为了研究工业场景中人与物之间的互动而写的。
methods: 这篇论文使用了一个新的数据集 called ENIGMA-51,这个数据集是 densely annotated 的,具有许多标签,可以系统地研究人与物之间的互动。
results: 基本结果表明,ENIGMA-51 数据集在工业场景中的人与物互动研究提供了一个具有挑战性的标准。Abstract
ENIGMA-51 is a new egocentric dataset acquired in a real industrial domain by 19 subjects who followed instructions to complete the repair of electrical boards using industrial tools (e.g., electric screwdriver) and electronic instruments (e.g., oscilloscope). The 51 sequences are densely annotated with a rich set of labels that enable the systematic study of human-object interactions in the industrial domain. We provide benchmarks on four tasks related to human-object interactions: 1) untrimmed action detection, 2) egocentric human-object interaction detection, 3) short-term object interaction anticipation and 4) natural language understanding of intents and entities. Baseline results show that the ENIGMA-51 dataset poses a challenging benchmark to study human-object interactions in industrial scenarios. We publicly release the dataset at: https://iplab.dmi.unict.it/ENIGMA-51/.
摘要
ENIGMA-51 是一个新的自我中心数据集,在真实的工业领域中由 19 名参与者收集到,他们按照 instrucions 完成了电气板的维修使用工业工具(例如电动钻)和电子 instrumente(例如振荡器)。这 51 个序列都有密集的注解,包括一个丰富的标签集,允许系统性的人机交互研究。我们在四个关于人机交互任务上提供了标准测试:1)不加工作行为检测,2) egocentric 人机交互检测,3)短期对象交互预测和4)自然语言理解意图和实体。基准结果显示,ENIGMA-51 数据集对于研究工业场景中的人机交互具有挑战性。我们在以下链接公开发布了数据集:https://iplab.dmi.unict.it/ENIGMA-51/。
results: 该方法可以创造出真实的手别脉干图像和精确知道脉干 patrern,并且可以用于开发和评估手别脉干EXTRACTION和识别方法。此外,该方法还可以用于骗取手别脉干识别系统。Abstract
Finger vein pattern recognition is an emerging biometric with a good resistance to presentation attacks and low error rates. One problem is that it is hard to obtain ground truth finger vein patterns from live fingers. In this paper we propose an advanced method to create finger vein phantoms using 3D printing where we mimic the optical properties of the various tissues inside the fingers, like bone, veins and soft tissues using different printing materials and parameters. We demonstrate that we are able to create finger phantoms that result in realistic finger vein images and precisely known vein patterns. These phantoms can be used to develop and evaluate finger vein extraction and recognition methods. In addition, we show that the finger vein phantoms can be used to spoof a finger vein recognition system. This paper is based on the Master's thesis of Rasmus van der Grift.
摘要
finger vein pattern recognition 是一种emerging biometric,具有良好的抵御呈现攻击和低错误率。然而,一个问题是寻得真实的 finger vein pattern 的实验室数据很Difficult。在这篇论文中,我们提出了一种高级的方法,使用 3D printing 技术创建 finger vein phantom,模拟手指内部不同组织的光学性质,如骨、血管和软组织,使用不同的印刷材料和参数。我们示示了我们可以创建真实的手指形态和精确知道的血管图像。这些 phantom 可以用来开发和评估手指血管提取和识别方法。此外,我们还表明了 finger vein phantom 可以用来骗取手指血管识别系统。这篇论文基于 Rasmus van der Grift 的硕士论文。
3D Density-Gradient based Edge Detection on Neural Radiance Fields (NeRFs) for Geometric Reconstruction
methods: 使用density gradient的方法,具体是使用Sobel、Canny和Laplacian of Gaussian的3D边条探测器,从相邻的维度进行探测。
results: 能够实现高地形精度的3D几何重建,并且可以实现物体表面上的物体完整性。Canny探测器能够干涯缺口,并且实现uniform的点密度。Abstract
Generating geometric 3D reconstructions from Neural Radiance Fields (NeRFs) is of great interest. However, accurate and complete reconstructions based on the density values are challenging. The network output depends on input data, NeRF network configuration and hyperparameter. As a result, the direct usage of density values, e.g. via filtering with global density thresholds, usually requires empirical investigations. Under the assumption that the density increases from non-object to object area, the utilization of density gradients from relative values is evident. As the density represents a position-dependent parameter it can be handled anisotropically, therefore processing of the voxelized 3D density field is justified. In this regard, we address geometric 3D reconstructions based on density gradients, whereas the gradients result from 3D edge detection filters of the first and second derivatives, namely Sobel, Canny and Laplacian of Gaussian. The gradients rely on relative neighboring density values in all directions, thus are independent from absolute magnitudes. Consequently, gradient filters are able to extract edges along a wide density range, almost independent from assumptions and empirical investigations. Our approach demonstrates the capability to achieve geometric 3D reconstructions with high geometric accuracy on object surfaces and remarkable object completeness. Notably, Canny filter effectively eliminates gaps, delivers a uniform point density, and strikes a favorable balance between correctness and completeness across the scenes.
摘要
<>translate "Generating geometric 3D reconstructions from Neural Radiance Fields (NeRFs) is of great interest. However, accurate and complete reconstructions based on the density values are challenging. The network output depends on input data, NeRF network configuration and hyperparameter. As a result, the direct usage of density values, e.g. via filtering with global density thresholds, usually requires empirical investigations. Under the assumption that the density increases from non-object to object area, the utilization of density gradients from relative values is evident. As the density represents a position-dependent parameter it can be handled anisotropically, therefore processing of the voxelized 3D density field is justified. In this regard, we address geometric 3D reconstructions based on density gradients, whereas the gradients result from 3D edge detection filters of the first and second derivatives, namely Sobel, Canny and Laplacian of Gaussian. The gradients rely on relative neighboring density values in all directions, thus are independent from absolute magnitudes. Consequently, gradient filters are able to extract edges along a wide density range, almost independent from assumptions and empirical investigations. Our approach demonstrates the capability to achieve geometric 3D reconstructions with high geometric accuracy on object surfaces and remarkable object completeness. Notably, Canny filter effectively eliminates gaps, delivers a uniform point density, and strikes a favorable balance between correctness and completeness across the scenes."into Simplified Chinese.Here's the translation:<>生成基于神经辐射场(NeRF)的三维几何重建是非常有趣的。然而,基于密度值的准确和完整的重建很困难。神经网络输出取决于输入数据、NeRF网络配置和超参数。因此,直接使用密度值,例如通过全局密度阈值滤波,通常需要实际调查。在密度增加从非对象区域到对象区域的假设下,使用密度梯度的Relative值是明显的。由于密度是位置依赖的参数,因此可以在三维浓度场中进行处理。在这种情况下,我们关注基于密度梯度的三维几何重建,其中梯度来自三维边检测器的第一和第二导数,即索贝尔、坎尼和洛普拉几何。这些梯度依赖于所有方向的相邻密度值,因此不受绝对值的影响。因此,梯度滤波可以在广泛的密度范围内提取边缘,大致独立于假设和实际调查。我们的方法可以实现高度准确的三维几何重建,并且可以在对象表面上达到很高的物理精度和对象完整性。尤其是坎尼滤波可以有效地消除峰值,提供一致的点密度,并在场景中占据有利的平衡。
Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation
results: 在所有公共测试数据集上达到了最佳性能水平,并在实时推理速度下保持稳定性。Abstract
Unsupervised video object segmentation (VOS) is a task that aims to detect the most salient object in a video without external guidance about the object. To leverage the property that salient objects usually have distinctive movements compared to the background, recent methods collaboratively use motion cues extracted from optical flow maps with appearance cues extracted from RGB images. However, as optical flow maps are usually very relevant to segmentation masks, the network is easy to be learned overly dependent on the motion cues during network training. As a result, such two-stream approaches are vulnerable to confusing motion cues, making their prediction unstable. To relieve this issue, we design a novel motion-as-option network by treating motion cues as optional. During network training, RGB images are randomly provided to the motion encoder instead of optical flow maps, to implicitly reduce motion dependency of the network. As the learned motion encoder can deal with both RGB images and optical flow maps, two different predictions can be generated depending on which source information is used as motion input. In order to fully exploit this property, we also propose an adaptive output selection algorithm to adopt optimal prediction result at test time. Our proposed approach affords state-of-the-art performance on all public benchmark datasets, even maintaining real-time inference speed.
摘要
自助视频对象分割(VOS)是一项任务,旨在在视频中检测最引人注目的对象,不受外部指导。为了利用聚集对象的特征运动,现代方法通常结合运动图表中的运动规划和RGB图像中的外观规划。然而,由于运动图表通常很相关于分割掩蔽,因此网络在训练时容易被学习过度依赖于运动规划。这会导致这些两派方法在预测时容易被混淆运动规划,使其预测不稳定。为解决这问题,我们设计了一种新的运动作为选项网络,在网络训练时对RGB图像进行隐式减少运动依赖。由于学习的运动编码器可以处理RGB图像和运动图表两种不同的输入,因此在测试时可以生成两个不同的预测结果,具体取决于哪种源信息作为运动输入。为了充分利用这性能,我们还提出了一种适应输出选择算法,以选择最佳预测结果。我们的提出的方法可以在所有公共benchmark数据集上达到状态革命性的性能,同时保持实时推理速度。
Frugal Satellite Image Change Detection with Deep-Net Inversion
For: 这个论文的目的是提出一种基于活动学习的变化检测算法,用于检测卫星图像中的目标变化。* Methods: 该算法基于问答模型,通过询问用户(即oracle)关于图像中的变化相关性,更新深度神经网络(DNN)分类器。该算法还使用了一种新的对抗模型,以学习最有代表性、多样性和不确定性的虚拟 exemplars,从而提高了活动学习的效果。* Results: 实验表明,提出的深度网络反向推理方法在相关工作中表现出色,超过了相关的前工作。Abstract
Change detection in satellite imagery seeks to find occurrences of targeted changes in a given scene taken at different instants. This task has several applications ranging from land-cover mapping, to anthropogenic activity monitory as well as climate change and natural hazard damage assessment. However, change detection is highly challenging due to the acquisition conditions and also to the subjectivity of changes. In this paper, we devise a novel algorithm for change detection based on active learning. The proposed method is based on a question and answer model that probes an oracle (user) about the relevance of changes only on a small set of critical images (referred to as virtual exemplars), and according to oracle's responses updates deep neural network (DNN) classifiers. The main contribution resides in a novel adversarial model that allows learning the most representative, diverse and uncertain virtual exemplars (as inverted preimages of the trained DNNs) that challenge (the most) the trained DNNs, and this leads to a better re-estimate of these networks in the subsequent iterations of active learning. Experiments show the out-performance of our proposed deep-net inversion against the related work.
摘要
Change detection in satellite imagery aims to identify targeted changes in a scene captured at different times. This task has numerous applications, including land-cover mapping, monitoring anthropogenic activities, and assessing climate change and natural hazard damage. However, change detection is highly challenging due to acquisition conditions and the subjectivity of changes. In this paper, we propose a novel algorithm for change detection based on active learning. Our method uses a question-and-answer model to probe an oracle (user) about the relevance of changes on a small set of critical images (referred to as virtual exemplars), and updates deep neural network (DNN) classifiers according to the oracle's responses. The main contribution is a novel adversarial model that learns the most representative, diverse, and uncertain virtual exemplars (as inverted preimages of the trained DNNs) that challenge the trained DNNs, leading to a better re-estimate of these networks in subsequent active learning iterations. Experimental results show the outperformance of our proposed deep-net inversion compared to related work.
Multi-Label Feature Selection Using Adaptive and Transformed Relevance
results: 我们在12个benchmark上进行了实验,与10种现有的信息理论筛选方法进行比较。结果显示,我们的方法在6个评价指标中均显示出优异性,并且在特征和标签空间相对较大的benchmark中保持稳定性。代码可以在https://github.com/Sadegh28/ATR上获取。Abstract
Multi-label learning has emerged as a crucial paradigm in data analysis, addressing scenarios where instances are associated with multiple class labels simultaneously. With the growing prevalence of multi-label data across diverse applications, such as text and image classification, the significance of multi-label feature selection has become increasingly evident. This paper presents a novel information-theoretical filter-based multi-label feature selection, called ATR, with a new heuristic function. Incorporating a combinations of algorithm adaptation and problem transformation approaches, ATR ranks features considering individual labels as well as abstract label space discriminative powers. Our experimental studies encompass twelve benchmarks spanning various domains, demonstrating the superiority of our approach over ten state-of-the-art information-theoretical filter-based multi-label feature selection methods across six evaluation metrics. Furthermore, our experiments affirm the scalability of ATR for benchmarks characterized by extensive feature and label spaces. The codes are available at https://github.com/Sadegh28/ATR
摘要
多标签学习已成为数据分析中的重要方法,用于处理同时具有多个分类标签的实例。随着多标签数据在不同应用领域的普及,如文本和图像分类等,多标签特征选择的重要性得到了更加明显的证明。本文提出了一种基于信息理论滤波器的新型多标签特征选择方法,即ATR,它通过结合算法适应和问题转换技术来评估特征的个别标签和抽象标签空间的探索力。我们的实验包括12个标准 benchmark,覆盖多个领域,表明我们的方法在6个评价指标中超过了10种现有的信息理论滤波器基于多标签特征选择方法。此外,我们的实验还证明了ATR在特征和标签空间较大的 benchmark 上的扩展性。代码可以在https://github.com/Sadegh28/ATR 中获取。
InvKA: Gait Recognition via Invertible Koopman Autoencoder
results: 在多个数据集上实现了计算成本减少至1%,同时保持步态识别精度高达98%(非遮挡数据集)Abstract
Most current gait recognition methods suffer from poor interpretability and high computational cost. To improve interpretability, we investigate gait features in the embedding space based on Koopman operator theory. The transition matrix in this space captures complex kinematic features of gait cycles, namely the Koopman operator. The diagonal elements of the operator matrix can represent the overall motion trend, providing a physically meaningful descriptor. To reduce the computational cost of our algorithm, we use a reversible autoencoder to reduce the model size and eliminate convolutional layers to compress its depth, resulting in fewer floating-point operations. Experimental results on multiple datasets show that our method reduces computational cost to 1% compared to state-of-the-art methods while achieving competitive recognition accuracy 98% on non-occlusion datasets.
摘要
现有的步幅识别方法受到低解释性和高计算成本的影响。为改善解释性,我们在库曼操作理论基础上探索步幅特征在嵌入空间中。这个转换矩阵在这个空间中捕捉了复杂的步幅周期特征,即库曼操作。转换矩阵的主要元素可以表示步幅运动趋势,提供物理意义的描述。对于我们的算法,我们使用可逆 autoencoder 将模型大小缩小,并删除卷积层以压缩其深度,从而减少浮点运算次数。实验结果显示,我们的方法可以与现有的方法相比,在非遮掩 dataset 上 achieves 98% 的识别率,而且计算成本仅占现有方法的 1%。
Diffusion-based Holistic Texture Rectification and Synthesis
results: 实验结果表明,该框架在评价量和质量上都有显著的优势,并且通过了全面的拟合研究。Abstract
We present a novel framework for rectifying occlusions and distortions in degraded texture samples from natural images. Traditional texture synthesis approaches focus on generating textures from pristine samples, which necessitate meticulous preparation by humans and are often unattainable in most natural images. These challenges stem from the frequent occlusions and distortions of texture samples in natural images due to obstructions and variations in object surface geometry. To address these issues, we propose a framework that synthesizes holistic textures from degraded samples in natural images, extending the applicability of exemplar-based texture synthesis techniques. Our framework utilizes a conditional Latent Diffusion Model (LDM) with a novel occlusion-aware latent transformer. This latent transformer not only effectively encodes texture features from partially-observed samples necessary for the generation process of the LDM, but also explicitly captures long-range dependencies in samples with large occlusions. To train our model, we introduce a method for generating synthetic data by applying geometric transformations and free-form mask generation to clean textures. Experimental results demonstrate that our framework significantly outperforms existing methods both quantitatively and quantitatively. Furthermore, we conduct comprehensive ablation studies to validate the different components of our proposed framework. Results are corroborated by a perceptual user study which highlights the efficiency of our proposed approach.
摘要
我们提出了一种新的框架,用于纠正自然图像中的遮挡和扭曲。传统的文本生成方法强调从无损样本中生成文本,但这些样本往往需要人工准备,而且很难在自然图像中获得。这些问题源于自然图像中文本样本的频繁遮挡和变化,导致文本样本的缺失和扭曲。为解决这些问题,我们提出了一个框架,可以从自然图像中纠正的文本样本,扩展了示例基于文本生成技术的应用范围。我们的框架使用一种conditional Latent Diffusion Model(LDM),并使用一种新的遮挡意识的缓冲变换器。这个缓冲变换器不仅能够有效地编码自然图像中的文本特征,还能够显著地捕捉大遮挡样本中的长距离依赖关系。为训练我们的模型,我们引入了一种生成合成数据的方法,通过几何变换和自由形mask生成来生成干净的文本样本。实验结果表明,我们的框架在量化和质量上都有显著提高,并且进行了完整的ablation研究来验证不同的模型组件。结果得到了人类感知测试的证明,表明我们的提出的方法更加高效。
On quantifying and improving realism of images generated with diffusion
paper_authors: Yunzhuo Chen, Naveed Akhtar, Nur Al Hasan Haldar, Ajmal Mian for:* 这个论文主要是为了解决生成模型中的图像真实性评估问题。methods:* 该论文提出了一种基于统计方法的图像真实性评估指标,即图像真实性分数(IRS),并通过实验表明其可以有效地评估生成模型中的图像真实性。results:* 该论文的实验结果表明,图像真实性分数可以准确地分辨真实图像和假图像,并且可以用于评估生成模型的性能。此外,通过修改生成损失函数以采用图像真实性分数,可以提高生成模型中的图像质量。Abstract
Recent advances in diffusion models have led to a quantum leap in the quality of generative visual content. However, quantification of realism of the content is still challenging. Existing evaluation metrics, such as Inception Score and Fr\'echet inception distance, fall short on benchmarking diffusion models due to the versatility of the generated images. Moreover, they are not designed to quantify realism of an individual image. This restricts their application in forensic image analysis, which is becoming increasingly important in the emerging era of generative models. To address that, we first propose a metric, called Image Realism Score (IRS), computed from five statistical measures of a given image. This non-learning based metric not only efficiently quantifies realism of the generated images, it is readily usable as a measure to classify a given image as real or fake. We experimentally establish the model- and data-agnostic nature of the proposed IRS by successfully detecting fake images generated by Stable Diffusion Model (SDM), Dalle2, Midjourney and BigGAN. We further leverage this attribute of our metric to minimize an IRS-augmented generative loss of SDM, and demonstrate a convenient yet considerable quality improvement of the SDM-generated content with our modification. Our efforts have also led to Gen-100 dataset, which provides 1,000 samples for 100 classes generated by four high-quality models. We will release the dataset and code.
摘要
results: 通过Synthetic Noise Removal和Real-world Image Denoising dataset(SIDD和DND)的广泛实验,证明了方法的效果,PSNR和SSIM指标均有显著提高。此外,方法也具有良好的解释性。Abstract
Image denoising is a fundamental task in low-level computer vision. While recent deep learning-based image denoising methods have achieved impressive performance, they are black-box models and the underlying denoising principle remains unclear. In this paper, we propose a novel approach to image denoising that offers both clear denoising mechanism and good performance. We view noise as a type of image style and remove it by incorporating noise-free styles derived from clean images. To achieve this, we design novel losses and network modules to extract noisy styles from noisy images and noise-free styles from clean images. The noise-free style induces low-response activations for noise features and high-response activations for content features in the feature space. This leads to the separation of clean contents from noise, effectively denoising the image. Unlike disentanglement-based image editing tasks that edit semantic-level attributes using styles, our main contribution lies in editing pixel-level attributes through global noise-free styles. We conduct extensive experiments on synthetic noise removal and real-world image denoising datasets (SIDD and DND), demonstrating the effectiveness of our method in terms of both PSNR and SSIM metrics. Moreover, we experimentally validate that our method offers good interpretability.
摘要
图像噪声除除是计算机视觉领域的基础任务。Recent deep learning基于图像噪声除方法具有印象深刻的表现,但是它们是黑盒模型,即使用的原理不明确。在这篇论文中,我们提出了一种新的图像噪声除方法,具有明确的噪声除机制并且表现良好。我们视噪声为图像风格的一种,通过将噪声样本从噪声图像中提取出来,并将噪声样本与清晰图像中的风格相结合,从而将噪声分离出来。为此,我们设计了新的损失函数和网络模块,以提取噪声样本和清晰图像中的风格。噪声样本在特征空间中产生低响应活动,而内容特征产生高响应活动,这导致了噪声和内容的分离,从而有效地除噪图像。不同于基于分离Semantic-level特征的图像修改任务,我们的主要贡献在于通过全局噪声无响应风格来修改像素级别的特征。我们在合成噪声除和实际图像噪声除数据集(SIDD和DND)进行了广泛的实验,并证明了我们的方法在PSNR和SSIM指标上表现出色。此外,我们还进行了实验来证明我们的方法具有良好的可读性。
Advanced Volleyball Stats for All Levels: Automatic Setting Tactic Detection and Classification with a Single Camera
results: 超越现有方法的性能,在不同游戏情况下进行复杂游戏情况和不同摄像头角度处理Here’s the same information in Simplified Chinese:
for: 为高级战术分类提供单视图计算机视觉框架
methods: combining setting ball trajectory recognition with novel set trajectory classifier生成全面和高级统计数据
results: 超越现有方法的性能,在不同游戏情况下进行复杂游戏情况和不同摄像头角度处理Abstract
This paper presents PathFinder and PathFinderPlus, two novel end-to-end computer vision frameworks designed specifically for advanced setting strategy classification in volleyball matches from a single camera view. Our frameworks combine setting ball trajectory recognition with a novel set trajectory classifier to generate comprehensive and advanced statistical data. This approach offers a fresh perspective for in-game analysis and surpasses the current level of granularity in volleyball statistics. In comparison to existing methods used in our baseline PathFinder framework, our proposed ball trajectory detection methodology in PathFinderPlus exhibits superior performance for classifying setting tactics under various game conditions. This robustness is particularly advantageous in handling complex game situations and accommodating different camera angles. Additionally, our study introduces an innovative algorithm for automatic identification of the opposing team's right-side (opposite) hitter's current row (front or back) during gameplay, providing critical insights for tactical analysis. The successful demonstration of our single-camera system's feasibility and benefits makes high-level technical analysis accessible to volleyball enthusiasts of all skill levels and resource availability. Furthermore, the computational efficiency of our system allows for real-time deployment, enabling in-game strategy analysis and on-the-spot gameplan adjustments.
摘要
Text-image guided Diffusion Model for generating Deepfake celebrity interactions
paper_authors: Yunzhuo Chen, Nur Al Hasan Haldar, Naveed Akhtar, Ajmal Mian
for: The paper aims to explore the use of diffusion models for generating realistic and controllable Deepfake images, with a focus on creating forged content for celebrity interactions.
methods: The paper modifies a popular stable diffusion model to generate high-quality Deepfake images with text and image prompts, and adds the input anchor image’s latent at the beginning of inferencing to improve the generation of images with multiple persons. Additionally, the paper uses Dreambooth to enhance the realism of the fake images.
results: The paper demonstrates that the devised scheme can create fake visual content with alarming realism, such as images of meetings between powerful political figures, which could be used to spread rumors or misinformation.Abstract
Deepfake images are fast becoming a serious concern due to their realism. Diffusion models have recently demonstrated highly realistic visual content generation, which makes them an excellent potential tool for Deepfake generation. To curb their exploitation for Deepfakes, it is imperative to first explore the extent to which diffusion models can be used to generate realistic content that is controllable with convenient prompts. This paper devises and explores a novel method in that regard. Our technique alters the popular stable diffusion model to generate a controllable high-quality Deepfake image with text and image prompts. In addition, the original stable model lacks severely in generating quality images that contain multiple persons. The modified diffusion model is able to address this problem, it add input anchor image's latent at the beginning of inferencing rather than Gaussian random latent as input. Hence, we focus on generating forged content for celebrity interactions, which may be used to spread rumors. We also apply Dreambooth to enhance the realism of our fake images. Dreambooth trains the pairing of center words and specific features to produce more refined and personalized output images. Our results show that with the devised scheme, it is possible to create fake visual content with alarming realism, such that the content can serve as believable evidence of meetings between powerful political figures.
摘要
深圳图像是现在日益成为严重问题,因为它们的真实性。扩散模型在最近几年内表现出了高度真实的视觉内容生成能力,这使得它们成为深圳图像生成的极佳潜在工具。为了防止深圳图像的滥用,我们需要首先探索扩散模型可以生成高质量、控制性良好的深圳图像的可能性。这篇论文提出了一种新的方法。我们的技术改变了流行的稳定扩散模型,以生成高质量、控制性良好的深圳图像,使用文本和图像提示。此外,原始的稳定模型在生成高质量图像时缺乏严重的问题,我们修改了模型,将输入的anchor图像的秘密添加到推理过程的开始,而不是 Gaussian 随机秘密输入。因此,我们专注于生成假内容,如著名人之间的互动,这可能用于散布谣言。我们还使用 Dreambooth 来提高假图像的真实感。Dreambooth 训练了对中心词和特定特征的对应,以生成更加细致和个性化的输出图像。我们的结果显示,通过我们的方案,可以创建真实性惊人的假视觉内容,例如,著名人之间的互动,这些内容可能用作具有证据性的证据。
SSPFusion: A Semantic Structure-Preserving Approach for Infrared and Visible Image Fusion
results: 对三个标准数据集进行了实验,证明了 SSPFusion 方法可以生成高质量的融合图像,并且在对象检测和识别等计算机视觉任务中提高了性能。Abstract
Most existing learning-based infrared and visible image fusion (IVIF) methods exhibit massive redundant information in the fusion images, i.e., yielding edge-blurring effect or unrecognizable for object detectors. To alleviate these issues, we propose a semantic structure-preserving approach for IVIF, namely SSPFusion. At first, we design a Structural Feature Extractor (SFE) to extract the structural features of infrared and visible images. Then, we introduce a multi-scale Structure-Preserving Fusion (SPF) module to fuse the structural features of infrared and visible images, while maintaining the consistency of semantic structures between the fusion and source images. Owing to these two effective modules, our method is able to generate high-quality fusion images from pairs of infrared and visible images, which can boost the performance of downstream computer-vision tasks. Experimental results on three benchmarks demonstrate that our method outperforms eight state-of-the-art image fusion methods in terms of both qualitative and quantitative evaluations. The code for our method, along with additional comparison results, will be made available at: https://github.com/QiaoYang-CV/SSPFUSION.
摘要
大多数现有的学习基于的红外和可见图像 fusión(IVIF)方法会带来巨大的重复信息在合并图像中,例如导致边缘模糊效应或对物体检测器无法识别。为了解决这些问题,我们提出了一种semantic structure-preserving的方法,即SSPFusion。在这个方法中,我们首先设计了一个结构特征提取器(SFE),用于提取红外和可见图像的结构特征。然后,我们引入了一个多尺度结构保持合并(SPF)模块,用于将红外和可见图像的结构特征进行合并,同时保持了这些特征在合并图像和原始图像之间的一致性。由于这两个有效的模块,我们的方法能够生成高质量的合并图像,从红外和可见图像的对应组合中获得提高。实验结果表明,我们的方法在三个标准测试集上与八种现有的图像 fusión方法进行比较,在质量和量化评价方面都表现出优于其他方法。我们的代码、以及更多的比较结果,将在GitHub上公开:https://github.com/QiaoYang-CV/SSPFUSION。
ADU-Depth: Attention-based Distillation with Uncertainty Modeling for Depth Estimation
methods: 提出了一种名为ADU-Depth的知识填充框架,通过将Well-trained教师网络传递知识到单目学生网络,以提高单目深度估计精度。在训练阶段,应用了注意力适应特征填充和关注深度适应响应填充,以便在不同预测难度下进行有效的知识传递。同时,Explicitly model the uncertainty of depth estimation to guide distillation in both feature space and result space。
results: 在实际的深度估计 dataset KITTI 和 DrivingStereo 上进行了广泛的实验,并达到了在挑战性较高的 KITTI 在线Benchmark 上的第一名。Abstract
Monocular depth estimation is challenging due to its inherent ambiguity and ill-posed nature, yet it is quite important to many applications. While recent works achieve limited accuracy by designing increasingly complicated networks to extract features with limited spatial geometric cues from a single RGB image, we intend to introduce spatial cues by training a teacher network that leverages left-right image pairs as inputs and transferring the learned 3D geometry-aware knowledge to the monocular student network. Specifically, we present a novel knowledge distillation framework, named ADU-Depth, with the goal of leveraging the well-trained teacher network to guide the learning of the student network, thus boosting the precise depth estimation with the help of extra spatial scene information. To enable domain adaptation and ensure effective and smooth knowledge transfer from teacher to student, we apply both attention-adapted feature distillation and focal-depth-adapted response distillation in the training stage. In addition, we explicitly model the uncertainty of depth estimation to guide distillation in both feature space and result space to better produce 3D-aware knowledge from monocular observations and thus enhance the learning for hard-to-predict image regions. Our extensive experiments on the real depth estimation datasets KITTI and DrivingStereo demonstrate the effectiveness of the proposed method, which ranked 1st on the challenging KITTI online benchmark.
摘要
单眼深度估计是一个挑战性的任务,因为它具有自然的歧义和不确定性,但它对许多应用程序非常重要。Recent works 通过设计越来越复杂的网络来提取具有有限空间几何指标的单眼照片,实现有限的精度。我们则是通过将左右两个照片作为输入,训练一个教师网络,以便传递到单眼学生网络中的3D几何意识。我们称之为ADU-Depth的知识传授框架。我们希望通过将教师网络的学习知识转移到学生网络中,以提高单眼深度估计的精度。为了实现领域适应和确保专业转移,我们在训练阶段使用了注意力适应的特征传授和聚焦深度适应的回应传授。此外,我们Explicitly 模型深度估计的不确定性,以导引传授在特征空间和结果空间中。我们的实验结果显示,我们的提案可以在真实的深度估计数据集KITTI和DrivingStereo上得到高效的性能,并在挑战性的KITTI online排名中排名第一。
Volumetric Semantically Consistent 3D Panoptic Mapping
results: 在公共大规模数据集上达到了state-of-the-art精度水平,提高了许多常用的指标,并且指出了现有研究评价中的一个缺陷:使用真实轨迹而不是SLAM估计的轨迹作为输入,会导致评价结果与实际数据之间存在很大差距。Abstract
We introduce an online 2D-to-3D semantic instance mapping algorithm aimed at generating comprehensive, accurate, and efficient semantic 3D maps suitable for autonomous agents in unstructured environments. The proposed approach is based on a Voxel-TSDF representation used in recent algorithms. It introduces novel ways of integrating semantic prediction confidence during mapping, producing semantic and instance-consistent 3D regions. Further improvements are achieved by graph optimization-based semantic labeling and instance refinement. The proposed method achieves accuracy superior to the state of the art on public large-scale datasets, improving on a number of widely used metrics. We also highlight a downfall in the evaluation of recent studies: using the ground truth trajectory as input instead of a SLAM-estimated one substantially affects the accuracy, creating a large gap between the reported results and the actual performance on real-world data.
摘要
我们介绍一种在线2D-to-3D语义实例映射算法,旨在生成全面、准确和高效的语义3D地图,适用于无结构环境中的自主机器人。该方法基于最近的Voxel-TSDF表示方式,并提出了新的语义预测信任级别的集成方法,生成语义和实例一致的3D区域。此外,我们还提出了基于图优化的语义标签和实例细化方法,进一步提高方法的准确性。我们的方法在大规模公共数据集上实现了比STATE-OF-THE-ART更高的准确度,提高了一些广泛使用的指标。此外,我们还指出了评估最近研究的缺点:使用真实的轨迹作为输入而不是SLAM估计的轨迹,会导致评估结果与实际数据中的性能存在大差异。
Explaining Deep Face Algorithms through Visualization: A Survey
paper_authors: Thrupthi Ann John, Vineeth N Balasubramanian, C. V. Jawahar
for: 本研究旨在 bridging the gap between deep face models 和 human understanding, by conducting a meta-analysis of explainability algorithms in the face domain.
methods: 本研究使用了一系列的普适可视化算法,并对各种面部模型进行计算可视化。
results: 研究发现了面部网络结构和层次结构的细节,以及可视化算法的设计考虑事项。此外,通过用户研究,确定了实用的可视化算法,以便对 AI 专业人员提供可读性的可视化工具。Abstract
Although current deep models for face tasks surpass human performance on some benchmarks, we do not understand how they work. Thus, we cannot predict how it will react to novel inputs, resulting in catastrophic failures and unwanted biases in the algorithms. Explainable AI helps bridge the gap, but currently, there are very few visualization algorithms designed for faces. This work undertakes a first-of-its-kind meta-analysis of explainability algorithms in the face domain. We explore the nuances and caveats of adapting general-purpose visualization algorithms to the face domain, illustrated by computing visualizations on popular face models. We review existing face explainability works and reveal valuable insights into the structure and hierarchy of face networks. We also determine the design considerations for practical face visualizations accessible to AI practitioners by conducting a user study on the utility of various explainability algorithms.
摘要
Translated into Simplified Chinese:尽管当前的深度模型在面任务上已经超越了人类表现,但我们并不理解它们如何工作。这意味着我们无法预测它们如何对新输入react,导致 catastrophic failures 和不想要的偏见在算法中。可见 AI 可以帮助bridging the gap,但目前有很少的视觉化算法适用于面。这个工作是一次首次的 meta-analysis of explainability algorithms in the face domain。我们探索了适应面模型的通用视觉化算法的妙处和缺点,通过计算面模型上的视觉化来 Ilustrated。我们还回顾了现有的面可见工作,并发现了面网络的结构和层次结构的 valuable insights。最后,我们确定了实用的面视觉化设计考虑因素,通过对各种可见算法的用户研究来确定。
Bootstrap Diffusion Model Curve Estimation for High Resolution Low-Light Image Enhancement
For: 本研究目的是提出一种基于学习的弱光照图像提升方法,以解决现有方法的两大问题:高分辨率图像的计算成本高和同时增强和净化不够。* Methods: 本方法使用了拟合分布模型,通过学习分布参数的分布来代替传统的 нормаль光照图像本身。具体来说,我们采用了抛物线估计方法来处理高分辨率图像,其中抛物线参数由我们的拟合分布模型来估计。此外,我们还在每次抛物线调整中应用了净化模块,以净化每次调整后的提升结果。* Results: 我们在常用的 benchmark 数据集上进行了广泛的实验,并证明了 BDCE 在质量和量化上达到了领先水平。Abstract
Learning-based methods have attracted a lot of research attention and led to significant improvements in low-light image enhancement. However, most of them still suffer from two main problems: expensive computational cost in high resolution images and unsatisfactory performance in simultaneous enhancement and denoising. To address these problems, we propose BDCE, a bootstrap diffusion model that exploits the learning of the distribution of the curve parameters instead of the normal-light image itself. Specifically, we adopt the curve estimation method to handle the high-resolution images, where the curve parameters are estimated by our bootstrap diffusion model. In addition, a denoise module is applied in each iteration of curve adjustment to denoise the intermediate enhanced result of each iteration. We evaluate BDCE on commonly used benchmark datasets, and extensive experiments show that it achieves state-of-the-art qualitative and quantitative performance.
摘要
学习基于方法在低光照图像提升中吸引了很多研究人员的关注,并导致了显著的改善。然而,大多数方法仍然受到两个主要问题的困扰:高分辨率图像的计算成本高昂,同时同时提升和净化的性能不足。为解决这些问题,我们提议BDCE,一种使用分布的演化模型,利用学习分布参数的Curve estimation方法来处理高分辨率图像。另外,在每次曲线调整中,我们采用了一个净化模块,以净化每次调整后的图像结果。我们在常用的 referencia dataset上进行了广泛的实验,结果表明BDCE可以达到领先的质量和量化性能。
Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer
results: 对于两个广泛使用的PVS-HM和Xu-Gaze数据集,MFTR表现出优于状态艺术方法的平均预测精度和重叠率,同时具有竞争的计算效率。Abstract
Viewport prediction is a crucial aspect of tile-based 360 video streaming system. However, existing trajectory based methods lack of robustness, also oversimplify the process of information construction and fusion between different modality inputs, leading to the error accumulation problem. In this paper, we propose a tile classification based viewport prediction method with Multi-modal Fusion Transformer, namely MFTR. Specifically, MFTR utilizes transformer-based networks to extract the long-range dependencies within each modality, then mine intra- and inter-modality relations to capture the combined impact of user historical inputs and video contents on future viewport selection. In addition, MFTR categorizes future tiles into two categories: user interested or not, and selects future viewport as the region that contains most user interested tiles. Comparing with predicting head trajectories, choosing future viewport based on tile's binary classification results exhibits better robustness and interpretability. To evaluate our proposed MFTR, we conduct extensive experiments on two widely used PVS-HM and Xu-Gaze dataset. MFTR shows superior performance over state-of-the-art methods in terms of average prediction accuracy and overlap ratio, also presents competitive computation efficiency.
摘要
视窗预测是360度视频流式系统中的关键方面,但现有的轨迹基于方法缺乏Robustness,同时过分解信息构建和多Modal输入之间的融合过程,导致错误积累问题。在本文中,我们提出了基于多模态融合变换器的瓦片分类视窗预测方法,称之为MFTR。具体来说,MFTR利用变换器网络EXTRACT每个模式中的长距离依赖关系,然后挖掘INTRA-和INTER-模式关系,以捕捉用户历史输入和视频内容的共同影响 future viewport选择。此外,MFTR将未来瓦片分为两类:用户有趣或不,并选择未来视窗为包含最多用户有趣瓦片的区域。与predicting head trajectories不同,基于瓦片的二分类结果选择未来视窗显示更高的Robustness和可读性。为评估我们提出的MFTR,我们在两个广泛使用的PVS-HM和Xu-Gaze数据集上进行了广泛的实验。MFTR在状态前方法的平均预测精度和重叠率方面显示出superior性,同时具有竞争力的计算效率。
Structure Invariant Transformation for better Adversarial Transferability
for: This paper aims to improve the effectiveness of black-box adversarial attacks on deep neural networks (DNNs) by proposing a novel input transformation-based attack called Structure Invariant Attack (SIA).
methods: The SIA attack applies random image transformations to each image block to create a diverse set of images for gradient calculation, improving the transferability of the attack compared to existing methods.
results: The proposed SIA attack exhibits better transferability than existing state-of-the-art (SOTA) input transformation-based attacks on both CNN-based and transformer-based models, as demonstrated through extensive experiments on the ImageNet dataset.Abstract
Given the severe vulnerability of Deep Neural Networks (DNNs) against adversarial examples, there is an urgent need for an effective adversarial attack to identify the deficiencies of DNNs in security-sensitive applications. As one of the prevalent black-box adversarial attacks, the existing transfer-based attacks still cannot achieve comparable performance with the white-box attacks. Among these, input transformation based attacks have shown remarkable effectiveness in boosting transferability. In this work, we find that the existing input transformation based attacks transform the input image globally, resulting in limited diversity of the transformed images. We postulate that the more diverse transformed images result in better transferability. Thus, we investigate how to locally apply various transformations onto the input image to improve such diversity while preserving the structure of image. To this end, we propose a novel input transformation based attack, called Structure Invariant Attack (SIA), which applies a random image transformation onto each image block to craft a set of diverse images for gradient calculation. Extensive experiments on the standard ImageNet dataset demonstrate that SIA exhibits much better transferability than the existing SOTA input transformation based attacks on CNN-based and transformer-based models, showing its generality and superiority in boosting transferability. Code is available at https://github.com/xiaosen-wang/SIT.
摘要
由于深度神经网络(DNNs)对攻击性示例的漏洞性,有一项紧迫需要有效的攻击方法来识别DNNs在安全敏感应用中的缺陷。现有的黑盒攻击中的传输基于攻击仍然无法达到相对比白盒攻击的性能。其中,输入变换基于的攻击表现出了remarkable的效果,但是现有的输入变换基于的攻击仍然只能对输入图像进行全局性的变换,导致变换图像的多样性受限。我们认为,更多的多样性变换图像可以提高传输性。因此,我们开展了一种新的输入变换基于的攻击方法,即结构不变攻击(SIA),它在每个图像块上随机应用图像变换来生成一组多样的图像,以便在梯度计算中更好地保持图像的结构。我们对标准的ImageNet数据集进行了广泛的实验,结果显示,SIA比现有的SOTA输入变换基于的攻击更有效,可以在CNN-based和transformer-based模型上提高传输性,这表明了它的通用性和超越性。代码可以在https://github.com/xiaosen-wang/SIT上获取。
DriveSceneGen: Generating Diverse and Realistic Driving Scenarios from Scratch
paper_authors: Shuo Sun, Zekai Gu, Tianchen Sun, Jiawei Sun, Chengran Yuan, Yuhang Han, Dongen Li, Marcelo H. Ang Jr
for: This paper is written for the development and validation of autonomous driving systems, specifically to address the lack of diverse and realistic traffic scenarios in large quantities.
methods: The paper introduces DriveSceneGen, a data-driven driving scenario generation method that learns from real-world driving datasets and generates entire dynamic driving scenarios from scratch.
results: The experimental results on 5,000 generated scenarios show that DriveSceneGen can generate novel driving scenarios with high fidelity and diversity, and is able to generate scenarios that align with real-world data distributions.Here’s the same information in Simplified Chinese text:
results: 对5000个生成的场景进行了实验,结果显示,DriveSceneGen可以生成高准确性和多样性的驾驶场景,并且能够生成与现实世界数据分布相似的场景。Abstract
Realistic and diverse traffic scenarios in large quantities are crucial for the development and validation of autonomous driving systems. However, owing to numerous difficulties in the data collection process and the reliance on intensive annotations, real-world datasets lack sufficient quantity and diversity to support the increasing demand for data. This work introduces DriveSceneGen, a data-driven driving scenario generation method that learns from the real-world driving dataset and generates entire dynamic driving scenarios from scratch. DriveSceneGen is able to generate novel driving scenarios that align with real-world data distributions with high fidelity and diversity. Experimental results on 5k generated scenarios highlight the generation quality, diversity, and scalability compared to real-world datasets. To the best of our knowledge, DriveSceneGen is the first method that generates novel driving scenarios involving both static map elements and dynamic traffic participants from scratch.
摘要
现实生活中的交通情况具有广泛的多样性和复杂性,这些情况是自动驾驶系统的开发和验证中非常重要的。然而,由于数据收集过程中的多个问题和对于数据的依赖,现实世界的数据缺乏量和多样性,无法满足增长中的数据需求。本研究提出了 DriveSceneGen,一种基于数据的驾驶情况生成方法,从真实世界的驾驶数据中学习,并从零生成完整的动态驾驶情况。DriveSceneGen 能够生成高匹配度和多样性的驾驶情况,并且可以与实际世界数据的分布相互适应。实验结果显示,DriveSceneGen 能够生成5000个高品质和多样性的驾驶情况,并且可以与实际世界数据的分布相互适应。根据我们所知,DriveSceneGen 是首个从零生成包含静止地图元素和动态交通参与者的驾驶情况的方法。
DONNAv2 – Lightweight Neural Architecture Search for Vision tasks
results: 这个研究的结果显示,DONNAv2可以实现10倍的计算成本减少,并且可以在装置在Samsung Galaxy S10 mobile平台上进行硬件在轮试验。此外,DONNAv2还使用了封页知识传播范本来移除高测试成本的块。Abstract
With the growing demand for vision applications and deployment across edge devices, the development of hardware-friendly architectures that maintain performance during device deployment becomes crucial. Neural architecture search (NAS) techniques explore various approaches to discover efficient architectures for diverse learning tasks in a computationally efficient manner. In this paper, we present the next-generation neural architecture design for computationally efficient neural architecture distillation - DONNAv2 . Conventional NAS algorithms rely on a computationally extensive stage where an accuracy predictor is learned to estimate model performance within search space. This building of accuracy predictors helps them predict the performance of models that are not being finetuned. Here, we have developed an elegant approach to eliminate building the accuracy predictor and extend DONNA to a computationally efficient setting. The loss metric of individual blocks forming the network serves as the surrogate performance measure for the sampled models in the NAS search stage. To validate the performance of DONNAv2 we have performed extensive experiments involving a range of diverse vision tasks including classification, object detection, image denoising, super-resolution, and panoptic perception network (YOLOP). The hardware-in-the-loop experiments were carried out using the Samsung Galaxy S10 mobile platform. Notably, DONNAv2 reduces the computational cost of DONNA by 10x for the larger datasets. Furthermore, to improve the quality of NAS search space, DONNAv2 leverages a block knowledge distillation filter to remove blocks with high inference costs.
摘要
随着视觉应用的扩展和边缘设备的普及,发展具有高性能和可扩展性的硬件友好架构成为了不可或缺的。神经网络搜索(NAS)技术探索了许多方法,以便在计算效率高的情况下找到适合多种学习任务的高效架构。在这篇论文中,我们介绍了下一代神经网络设计方法,即DONNAv2,用于计算效率高的神经网络采样。传统的NAS算法需要一个计算昂贵的阶段,用于学习一个准确性预测器,以便在搜索空间中估计模型的性能。我们在这篇论文中提出了一种简洁的方法,即使用网络块的损失度作为采样模型的表现度量。为验证DONNAv2的性能,我们进行了对多种多样化视觉任务的广泛实验,包括分类、物体检测、图像噪声纠正、超分辨率和拼接性见网络(YOLOP)。实验使用了Samsung Galaxy S10移动 платформой。值得注意的是,DONNAv2将DONNA的计算成本减少了10倍,并且通过使用块知识继承筛选来移除高计算成本的块。
ZiCo-BC: A Bias Corrected Zero-Shot NAS for Vision Tasks
results: 该论文通过对多种视觉任务(图像分类、物体检测和 semantic segmentation)进行了广泛的实验,并证明了our approach可以在Samsung Galaxy S10设备上实现更高的准确率和更低的延迟时间。Abstract
Zero-Shot Neural Architecture Search (NAS) approaches propose novel training-free metrics called zero-shot proxies to substantially reduce the search time compared to the traditional training-based NAS. Despite the success on image classification, the effectiveness of zero-shot proxies is rarely evaluated on complex vision tasks such as semantic segmentation and object detection. Moreover, existing zero-shot proxies are shown to be biased towards certain model characteristics which restricts their broad applicability. In this paper, we empirically study the bias of state-of-the-art (SOTA) zero-shot proxy ZiCo across multiple vision tasks and observe that ZiCo is biased towards thinner and deeper networks, leading to sub-optimal architectures. To solve the problem, we propose a novel bias correction on ZiCo, called ZiCo-BC. Our extensive experiments across various vision tasks (image classification, object detection and semantic segmentation) show that our approach can successfully search for architectures with higher accuracy and significantly lower latency on Samsung Galaxy S10 devices.
摘要
<>Zero-Shot neural architecture search (NAS) approaches propose novel training-free metrics called zero-shot proxies to significantly reduce the search time compared to traditional training-based NAS. Despite the success on image classification, the effectiveness of zero-shot proxies is rarely evaluated on complex vision tasks such as semantic segmentation and object detection. Moreover, existing zero-shot proxies are shown to be biased towards certain model characteristics which restricts their broad applicability. In this paper, we empirically study the bias of state-of-the-art (SOTA) zero-shot proxy ZiCo across multiple vision tasks and observe that ZiCo is biased towards thinner and deeper networks, leading to sub-optimal architectures. To solve the problem, we propose a novel bias correction on ZiCo, called ZiCo-BC. Our extensive experiments across various vision tasks (image classification, object detection and semantic segmentation) show that our approach can successfully search for architectures with higher accuracy and significantly lower latency on Samsung Galaxy S10 devices.中文简体版: zero-shot neural architecture search (NAS) 方法提出了新的训练不需要的度量,以减少传统的训练基于 NAS 的搜索时间。 despite 图像分类 tasks 的成功,zero-shot proxies 在复杂的视觉任务,如 semantic segmentation 和 object detection 中的效果 rarely 评估。 In addition, 现有的 zero-shot proxies 具有一定的特征偏好,限制了它们的普遍性。 在这篇 paper 中,我们对 state-of-the-art (SOTA) zero-shot proxy ZiCo 的 bias 进行了 empirical study,并发现 ZiCo 偏好于较为细长和深的网络,导致非优化的 architecture。 To solve the problem, we propose a novel bias correction on ZiCo, called ZiCo-BC. Our extensive experiments across various vision tasks (image classification, object detection and semantic segmentation) show that our approach can successfully search for architectures with higher accuracy and significantly lower latency on Samsung Galaxy S10 devices.
Probabilistic 3D Multi-Object Cooperative Tracking for Autonomous Driving via Differentiable Multi-Sensor Kalman Filter
paper_authors: Hsu-kuang Chiu, Chien-Yi Wang, Min-Hung Chen, Stephen F. Smith
for: This paper aims to improve the reliability and accuracy of multi-object cooperative tracking in autonomous driving by leveraging vehicle-to-vehicle (V2V) communication and a differentiable multi-sensor Kalman Filter.
methods: The proposed method uses a differentiable multi-sensor Kalman Filter to estimate the measurement uncertainty of each detection from different connected autonomous vehicles (CAVs), which enables better utilization of the theoretical optimality property of Kalman Filter-based tracking algorithms.
results: The experimental results show that the proposed algorithm improves the tracking accuracy by 17% with only 0.037x communication costs compared to the state-of-the-art method in V2V4Real.Here’s the Chinese translation of the three points:
results: 实验结果显示,该提议的算法可以提高追踪精度 by 17%,并且只需0.037x的通信成本相比于现有方法在V2V4Real中。Abstract
Current state-of-the-art autonomous driving vehicles mainly rely on each individual sensor system to perform perception tasks. Such a framework's reliability could be limited by occlusion or sensor failure. To address this issue, more recent research proposes using vehicle-to-vehicle (V2V) communication to share perception information with others. However, most relevant works focus only on cooperative detection and leave cooperative tracking an underexplored research field. A few recent datasets, such as V2V4Real, provide 3D multi-object cooperative tracking benchmarks. However, their proposed methods mainly use cooperative detection results as input to a standard single-sensor Kalman Filter-based tracking algorithm. In their approach, the measurement uncertainty of different sensors from different connected autonomous vehicles (CAVs) may not be properly estimated to utilize the theoretical optimality property of Kalman Filter-based tracking algorithms. In this paper, we propose a novel 3D multi-object cooperative tracking algorithm for autonomous driving via a differentiable multi-sensor Kalman Filter. Our algorithm learns to estimate measurement uncertainty for each detection that can better utilize the theoretical property of Kalman Filter-based tracking methods. The experiment results show that our algorithm improves the tracking accuracy by 17% with only 0.037x communication costs compared with the state-of-the-art method in V2V4Real.
摘要
In this paper, we propose a novel 3D multi-object cooperative tracking algorithm for autonomous driving via a differentiable multi-sensor Kalman Filter. Our algorithm learns to estimate measurement uncertainty for each detection, which can better utilize the theoretical property of Kalman Filter-based tracking methods. The experiment results show that our algorithm improves the tracking accuracy by 17% with only 0.037x communication costs compared with the state-of-the-art method in V2V4Real.
Free Discontinuity Design: With an Application to the Economic Effects of Internet Shutdowns
results: 使用这种方法,作者发现印度的互联网停机导致经济活动减少了超过 50%,这大大超过了之前的估计,并为全球数字经济带来新的灯光。Abstract
Thresholds in treatment assignments can produce discontinuities in outcomes, revealing causal insights. In many contexts, like geographic settings, these thresholds are unknown and multivariate. We propose a non-parametric method to estimate the resulting discontinuities by segmenting the regression surface into smooth and discontinuous parts. This estimator uses a convex relaxation of the Mumford-Shah functional, for which we establish identification and convergence. Using our method, we estimate that an internet shutdown in India resulted in a reduction of economic activity by over 50%, greatly surpassing previous estimates and shedding new light on the true cost of such shutdowns for digital economies globally.
摘要
干预分配的阈值可能会产生输出的破碎,揭示了 causal 的增长。在许多情况下,如地理设置,这些阈值未知并且多变量。我们提议一种非 Parametric 方法来估计这些破碎的结果,segments 回归表面为平滑和破碎部分。这个估计器使用 Mumford-Shah 函数的凸relaxation,我们证明其标识和收敛。使用我们的方法,我们估计印度的互联网停机导致经济活动减少了超过 50%,大大超过了之前的估计,为全球数字经济带来新的见解。
paper_authors: Jiayi Liao, Xu Chen, Qiang Fu, Lun Du, Xiangnan He, Xiang Wang, Shi Han, Dongmei Zhang for: 这种研究旨在提高大规模模型在各个领域中表达抽象概念的能力,例如自然语言处理和计算机视觉。methods: 该研究基于三层艺术作品理论,将抽象概念 clarify为明确的意图,然后通过语言模型转换为具有相关含义的物理对象,最后使用概念依赖的形式从LM中提取形态模式集来集成信息。results: 该研究的评估结果表明,该框架可以帮助创建具有充分表达抽象概念的图像,并且与人类评估和新定义的概念分数指标表现良好。Abstract
Recent years have witnessed the substantial progress of large-scale models across various domains, such as natural language processing and computer vision, facilitating the expression of concrete concepts. Unlike concrete concepts that are usually directly associated with physical objects, expressing abstract concepts through natural language requires considerable effort, which results from their intricate semantics and connotations. An alternative approach is to leverage images to convey rich visual information as a supplement. Nevertheless, existing Text-to-Image (T2I) models are primarily trained on concrete physical objects and tend to fail to visualize abstract concepts. Inspired by the three-layer artwork theory that identifies critical factors, intent, object and form during artistic creation, we propose a framework of Text-to-Image generation for Abstract Concepts (TIAC). The abstract concept is clarified into a clear intent with a detailed definition to avoid ambiguity. LLMs then transform it into semantic-related physical objects, and the concept-dependent form is retrieved from an LLM-extracted form pattern set. Information from these three aspects will be integrated to generate prompts for T2I models via LLM. Evaluation results from human assessments and our newly designed metric concept score demonstrate the effectiveness of our framework in creating images that can sufficiently express abstract concepts.
摘要
recent years have witnessed significant progress in large-scale models across various domains, such as natural language processing and computer vision, which has facilitated the expression of concrete concepts. However, expressing abstract concepts through natural language is much more challenging due to their complex semantics and connotations. To address this issue, we can leverage images to convey rich visual information as a supplement. However, existing Text-to-Image (T2I) models are primarily trained on concrete physical objects and tend to fail to visualize abstract concepts.Inspired by the three-layer artwork theory that identifies critical factors, intent, object, and form during artistic creation, we propose a framework for Text-to-Image generation for Abstract Concepts (TIAC). The abstract concept is first clarified into a clear intent with a detailed definition to avoid ambiguity. Then, LLMs (Large Language Models) transform it into semantic-related physical objects, and the concept-dependent form is retrieved from an LLM-extracted form pattern set. Finally, information from these three aspects is integrated to generate prompts for T2I models via LLM.Our evaluation results from human assessments and our newly designed metric, concept score, demonstrate the effectiveness of our framework in creating images that can sufficiently express abstract concepts.
NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space
For: 提高单目3D semantic scene completion(SSC)的精度和效率,解决当前状态OF-THE-ART方法存在的特征抽象、姿态异常和计算不均衡等问题。* Methods: 提出了一种 Normalized Device Coordinates scene completion network(NDC-Scene),通过进行逐步修复depth维度的归一化设备坐标(NDC)空间,而不是直接将2D特征图卷积到世界空间中,以解决当前状态OF-THE-ART方法的特征抽象和姿态异常问题。同时,设计了一个适应深度的双 decode器,以同时进行2D和3D特征图的混合和提升。* Results: 经验证明,提出的方法在SemanticKITTI和NYUv2 dataset上比现状态OF-THE-ART方法表现更高,并且可以更好地处理各种复杂的场景和姿态。代码可以在https://github.com/Jiawei-Yao0812/NDCScene中下载。Abstract
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs. In this paper, we identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Computation Imbalance in the 3D convolution across different depth levels. To address these problems, we devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2D feature map to a Normalized Device Coordinates (NDC) space, rather than to the world space directly, through progressive restoration of the dimension of depth with deconvolution operations. Experiment results demonstrate that transferring the majority of computation from the target 3D space to the proposed normalized device coordinates space benefits monocular SSC tasks. Additionally, we design a Depth-Adaptive Dual Decoder to simultaneously upsample and fuse the 2D and 3D feature maps, further improving overall performance. Our extensive experiments confirm that the proposed method consistently outperforms state-of-the-art methods on both outdoor SemanticKITTI and indoor NYUv2 datasets. Our code are available at https://github.com/Jiawei-Yao0812/NDCScene.
摘要
《单目3D semantics场景完成(SSC)》在最近几年内引起了广泛的关注,因为它可以从单一图像中预测复杂的 semantics和geometry形状,无需3D输入。在这篇论文中,我们确定了当前状态艺术方法中的一些关键问题,包括 проекed 2D特征在射线空间中的特征歧义、3D卷积中的姿态歧义和3D卷积在不同深度水平上的计算不均衡。为解决这些问题,我们提出了一种新的Normalized Device Coordinates场景完成网络(NDC-Scene),通过逐渐恢复depth维度的卷积操作来直接将2D特征图射到Normalized Device Coordinates(NDC)空间中,而不是直接射到世界空间中。实验结果表明,将大多数计算从目标3D空间传播到我们提议的 Normalized Device Coordinates空间中得到了monocular SSC任务的改进。此外,我们还设计了一种适应深度的双解码器,以同时提高2D和3D特征图的 upsampling 和融合,进一步提高总性能。我们的广泛实验表明,我们的方法在SemanticKITTI和NYUv2 dataset上的表现都高于当前状态艺术方法。我们的代码可以在https://github.com/Jiawei-Yao0812/NDCScene中下载。
Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline
paper_authors: Xiao Wang, Shiao Wang, Chuanming Tang, Lin Zhu, Bo Jiang, Yonghong Tian, Jin Tang
for: 这 paper 的目的是提出一种基于 bio-inspired 事件摄像头的高速低延迟视觉跟踪方法,使用多模态/多视图信息进行知识传递,以实现使用事件信号进行测试的高性能低延迟跟踪。
methods: 该 paper 使用一个教师 Transformer 网络和一个学生 Transformer 网络,通过对 RGB 帧和事件流同时进行训练,实现高速低延迟的视觉跟踪。此外,该 paper 还提出了一种层次知识填充策略,包括对比相似性、特征表示和响应图填充知识,以引导学生网络的学习。
results: 该 paper 在两个低分辨率的 dataset(FE240hz、VisEvent、COESOT)上进行了广泛的实验,以及提出了一个大规模高分辨率的 dataset(EventVOT),包含 1141 个视频,覆盖了人行、车辆、无人机等多种类别。结果表明,该方法可以在高速低延迟的情况下实现高性能的视觉跟踪。Abstract
Tracking using bio-inspired event cameras has drawn more and more attention in recent years. Existing works either utilize aligned RGB and event data for accurate tracking or directly learn an event-based tracker. The first category needs more cost for inference and the second one may be easily influenced by noisy events or sparse spatial resolution. In this paper, we propose a novel hierarchical knowledge distillation framework that can fully utilize multi-modal / multi-view information during training to facilitate knowledge transfer, enabling us to achieve high-speed and low-latency visual tracking during testing by using only event signals. Specifically, a teacher Transformer-based multi-modal tracking framework is first trained by feeding the RGB frame and event stream simultaneously. Then, we design a new hierarchical knowledge distillation strategy which includes pairwise similarity, feature representation, and response maps-based knowledge distillation to guide the learning of the student Transformer network. Moreover, since existing event-based tracking datasets are all low-resolution ($346 \times 260$), we propose the first large-scale high-resolution ($1280 \times 720$) dataset named EventVOT. It contains 1141 videos and covers a wide range of categories such as pedestrians, vehicles, UAVs, ping pongs, etc. Extensive experiments on both low-resolution (FE240hz, VisEvent, COESOT), and our newly proposed high-resolution EventVOT dataset fully validated the effectiveness of our proposed method. The dataset, evaluation toolkit, and source code are available on \url{https://github.com/Event-AHU/EventVOT_Benchmark}
摘要
很多研究者在最近几年内对使用生物体感知的事件摄像机进行跟踪吸引了越来越多的注意力。现有的方法可以通过同时使用RGB图像和事件数据进行准确的跟踪,或者直接学习事件基于的跟踪器。前一种方法需要更多的计算成本,而后一种可能会受到噪音事件或者稀疏的空间分辨率的影响。在这篇论文中,我们提出了一种新的层次知识填充框架,可以在训练时将多模态/多视图信息完全利用,以便知识传递,从而在测试时通过使用事件信号来实现高速低延迟的视觉跟踪。具体来说,我们首先使用Transformer基本多模态跟踪框架进行训练,并在其基础之上设计了一种新的层次知识填充策略,包括对比相似性、特征表示和响应地图基于的知识填充。此外,由于现有的事件基本跟踪数据库都是低分辨率($346 \times 260),我们提出了首个大规模高分辨率($1280 \times 720)数据库,名为EventVOT。它包括1141个视频,覆盖了人行、车辆、无人机、乒乓球等多种类别。我们在低分辨率(FE240hz、VisEvent、COESOT)和我们新提出的高分辨率EventVOT数据库上进行了广泛的实验,并证明了我们提出的方法的有效性。数据库、评价工具箱和源代码可以在https://github.com/Event-AHU/EventVOT_Benchmark上获取。
Progressive Text-to-3D Generation for Automatic 3D Prototyping
results: 我们的实验表明,提出的方法在各种描述中具有优异表现,包括一些最为困难的描述,现有方法难以生成可行的形状。我们的方法能够成功地生成高质量的三维模型,并且能够减少用户需要提供的描述细节。Abstract
Text-to-3D generation is to craft a 3D object according to a natural language description. This can significantly reduce the workload for manually designing 3D models and provide a more natural way of interaction for users. However, this problem remains challenging in recovering the fine-grained details effectively and optimizing a large-size 3D output efficiently. Inspired by the success of progressive learning, we propose a Multi-Scale Triplane Network (MTN) and a new progressive learning strategy. As the name implies, the Multi-Scale Triplane Network consists of four triplanes transitioning from low to high resolution. The low-resolution triplane could serve as an initial shape for the high-resolution ones, easing the optimization difficulty. To further enable the fine-grained details, we also introduce the progressive learning strategy, which explicitly demands the network to shift its focus of attention from simple coarse-grained patterns to difficult fine-grained patterns. Our experiment verifies that the proposed method performs favorably against existing methods. For even the most challenging descriptions, where most existing methods struggle to produce a viable shape, our proposed method consistently delivers. We aspire for our work to pave the way for automatic 3D prototyping via natural language descriptions.
摘要
Applications of Sequential Learning for Medical Image Classification
results: 在第一个实验中,两种方法都成功地 дости了~95%的准确度门槛, although the short pre-training step enables sequential accuracy to plateau in fewer steps. 在第二个实验中,两种方法中的第二种方法(crosses the ~90% accuracy threshold much sooner)表现更好。在第三个实验中,这个方法在 Without pre-training 的情况下表现较差,但在具有预训练步骤的情况下,神经网络可以较快地超过 ~60% 的准确度门槛。Abstract
Purpose: The aim of this work is to develop a neural network training framework for continual training of small amounts of medical imaging data and create heuristics to assess training in the absence of a hold-out validation or test set. Materials and Methods: We formulated a retrospective sequential learning approach that would train and consistently update a model on mini-batches of medical images over time. We address problems that impede sequential learning such as overfitting, catastrophic forgetting, and concept drift through PyTorch convolutional neural networks (CNN) and publicly available Medical MNIST and NIH Chest X-Ray imaging datasets. We begin by comparing two methods for a sequentially trained CNN with and without base pre-training. We then transition to two methods of unique training and validation data recruitment to estimate full information extraction without overfitting. Lastly, we consider an example of real-life data that shows how our approach would see mainstream research implementation. Results: For the first experiment, both approaches successfully reach a ~95% accuracy threshold, although the short pre-training step enables sequential accuracy to plateau in fewer steps. The second experiment comparing two methods showed better performance with the second method which crosses the ~90% accuracy threshold much sooner. The final experiment showed a slight advantage with a pre-training step that allows the CNN to cross ~60% threshold much sooner than without pre-training. Conclusion: We have displayed sequential learning as a serviceable multi-classification technique statistically comparable to traditional CNNs that can acquire data in small increments feasible for clinically realistic scenarios.
摘要
目的:本工作的目的是开发一个神经网络训练框架,用于逐步训练小量医学成像数据,并创造一些启发法来评估在缺乏预约验证或测试集时的训练。材料和方法:我们提出了一种回顾性顺序学习方法,可以在医学成像数据中逐步训练和更新模型。我们使用PyTorch神经网络(CNN)和公开available的医学MNIST和NIH胸部X射影数据集来解决顺序学习中的过拟合、恶化学习和概念漂移问题。我们首先比较了两种方法:一种是在顺序学习过程中不进行基础预训练,另一种是在顺序学习过程中进行基础预训练。然后,我们转移到了两种唯一的训练和验证数据招募方法来估计无损信息抽取。最后,我们考虑了一个实际的数据示例,显示了我们的方法在现实生活中的应用。结果:在第一个实验中,两种方法都能达到大约95%的准确率阈值,但短期预训练步骤使Sequential learning的准确率快速落在阈值。在第二个实验中, Comparing two methods showed better performance with the second method, which crossed the ~90% accuracy threshold much sooner. Finally, the third experiment showed a slight advantage with a pre-training step that allows the CNN to cross the ~60% threshold much sooner than without pre-training.结论:我们已经Displayed sequential learning as a serviceable multi-classification technique that is statistically comparable to traditional CNNs, and can acquire data in small increments feasible for clinically realistic scenarios.
DifAttack: Query-Efficient Black-Box Attack via Disentangled Feature Space
paper_authors: Liu Jun, Zhou Jiantao, Zeng Jiandian, Jinyu Tian
for: This paper aims to propose an efficient score-based black-box adversarial attack method with a high Attack Success Rate (ASR) and good generalizability.
methods: The proposed method, called DifAttack, uses a disentangled feature space to differentiate an image’s latent feature into an adversarial feature and a visual feature. The method trains an autoencoder for the disentanglement using pairs of clean images and their Adversarial Examples (AEs) generated from available surrogate models via white-box attack methods.
results: The proposed method achieves significant improvements in ASR and query efficiency simultaneously, especially in the targeted attack and open-set scenarios. The code will be available at https://github.com/csjunjun/DifAttack.git soon.Here’s the same information in Simplified Chinese:
for: This paper aims to propose an efficient score-based black-box adversarial attack method with a high Attack Success Rate (ASR) and good generalizability.
methods: The proposed method, called DifAttack, uses a disentangled feature space to differentiate an image’s latent feature into an adversarial feature and a visual feature. The method trains an autoencoder for the disentanglement using pairs of clean images and their Adversarial Examples (AEs) generated from available surrogate models via white-box attack methods.
results: The proposed method achieves significant improvements in ASR and query efficiency simultaneously, especially in the targeted attack and open-set scenarios. The code will be available at https://github.com/csjunjun/DifAttack.git soon.Abstract
This work investigates efficient score-based black-box adversarial attacks with a high Attack Success Rate (ASR) and good generalizability. We design a novel attack method based on a Disentangled Feature space, called DifAttack, which differs significantly from the existing ones operating over the entire feature space. Specifically, DifAttack firstly disentangles an image's latent feature into an adversarial feature and a visual feature, where the former dominates the adversarial capability of an image, while the latter largely determines its visual appearance. We train an autoencoder for the disentanglement by using pairs of clean images and their Adversarial Examples (AEs) generated from available surrogate models via white-box attack methods. Eventually, DifAttack iteratively optimizes the adversarial feature according to the query feedback from the victim model until a successful AE is generated, while keeping the visual feature unaltered. In addition, due to the avoidance of using surrogate models' gradient information when optimizing AEs for black-box models, our proposed DifAttack inherently possesses better attack capability in the open-set scenario, where the training dataset of the victim model is unknown. Extensive experimental results demonstrate that our method achieves significant improvements in ASR and query efficiency simultaneously, especially in the targeted attack and open-set scenarios. The code will be available at https://github.com/csjunjun/DifAttack.git soon.
摘要
results: 本文实现了与XLS-R相当的性能(ML-SUPERB),仅使用了少于10%的训练数据,并且可以使用academic compute进行实现。此外,还提出了一种使用vanilla HuBERT基础模型,可以维持94%的XLS-R性能,仅使用3%的数据、4个GPU和有限的试验。Abstract
Multilingual self-supervised learning (SSL) has often lagged behind state-of-the-art (SOTA) methods due to the expenses and complexity required to handle many languages. This further harms the reproducibility of SSL, which is already limited to few research groups due to its resource usage. We show that more powerful techniques can actually lead to more efficient pre-training, opening SSL to more research groups. We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages. To build WavLabLM, we devise a novel multi-stage pre-training method, designed to address the language imbalance of multilingual data. WavLabLM achieves comparable performance to XLS-R on ML-SUPERB with less than 10% of the training data, making SSL realizable with academic compute. We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data, 4 GPUs, and limited trials. We open-source all code and models in ESPnet.
摘要
多语言自监学习(SSL)经常落后于当前最佳方法(SOTA),这主要是因为处理多种语言的成本和复杂性。这会增加SSL的复制性问题,现在只有一些研究小组可以进行SSL的研究,因为它的资源使用。我们表明更强大的技术可以实际导致更有效的预训练,从而使SSL更加开放。我们提出了WavLabLM,它将在40000小时数据上扩展WavLM的联合预测和干扰。为建立WavLabLM,我们开发了一种新的多stage预训练方法,用于解决多语言数据的语言偏好问题。WavLabLM在ML-SUPERB上与XLS-R的性能相似,但使用了 fewer than 10% of the training data,使SSL在学术计算机中变得可行。我们还证明了可以通过使用vanilla HuBERT Base模型,以维持94%的XLS-R性能,只需3%的数据,4个GPU和有限的尝试。我们将所有代码和模型开源在ESPnet上。
MAPTree: Beating “Optimal” Decision Trees with Bayesian Decision Trees
results: 在 16 个实际数据集上,MAPTree Either outperforms baselines 或者和比较好的性能,但是它的树会更小。在一个 sintetic 数据集上,MAPTree also demonstrates greater robustness to noise and better generalization than existing approaches。此外,MAPTree 还可以更快地找到最大 posteriori 树,并且可以提供一个 оптимальность 证明。Abstract
Decision trees remain one of the most popular machine learning models today, largely due to their out-of-the-box performance and interpretability. In this work, we present a Bayesian approach to decision tree induction via maximum a posteriori inference of a posterior distribution over trees. We first demonstrate a connection between maximum a posteriori inference of decision trees and AND/OR search. Using this connection, we propose an AND/OR search algorithm, dubbed MAPTree, which is able to recover the maximum a posteriori tree. Lastly, we demonstrate the empirical performance of the maximum a posteriori tree both on synthetic data and in real world settings. On 16 real world datasets, MAPTree either outperforms baselines or demonstrates comparable performance but with much smaller trees. On a synthetic dataset, MAPTree also demonstrates greater robustness to noise and better generalization than existing approaches. Finally, MAPTree recovers the maxiumum a posteriori tree faster than existing sampling approaches and, in contrast with those algorithms, is able to provide a certificate of optimality. The code for our experiments is available at https://github.com/ThrunGroup/maptree.
摘要
The Importance of Multimodal Emotion Conditioning and Affect Consistency for Embodied Conversational Agents
paper_authors: Che-Jui Chang, Samuel S. Sohn, Sen Zhang, Rajath Jayashankar, Muhammad Usman, Mubbasir Kapadia
for: 这个论文的目的是提高虚拟人工智能对人类的情感传递。
methods: 这个论文使用的方法是基于多Modal的行为生成框架,以实现情感的共同传递。
results: 研究发现,当多Modal的行为与主要情感保持一致时,人类对虚拟人工智能的情感传递得到了最高评分。同时,当某一Modal的行为与主要情感不一致时,人类对虚拟人工智能的情感传递明显减弱。Abstract
Previous studies regarding the perception of emotions for embodied virtual agents have shown the effectiveness of using virtual characters in conveying emotions through interactions with humans. However, creating an autonomous embodied conversational agent with expressive behaviors presents two major challenges. The first challenge is the difficulty of synthesizing the conversational behaviors for each modality that are as expressive as real human behaviors. The second challenge is that the affects are modeled independently, which makes it difficult to generate multimodal responses with consistent emotions across all modalities. In this work, we propose a conceptual framework, ACTOR (Affect-Consistent mulTimodal behaviOR generation), that aims to increase the perception of affects by generating multimodal behaviors conditioned on a consistent driving affect. We have conducted a user study with 199 participants to assess how the average person judges the affects perceived from multimodal behaviors that are consistent and inconsistent with respect to a driving affect. The result shows that among all model conditions, our affect-consistent framework receives the highest Likert scores for the perception of driving affects. Our statistical analysis suggests that making a modality affect-inconsistent significantly decreases the perception of driving affects. We also observe that multimodal behaviors conditioned on consistent affects are more expressive compared to behaviors with inconsistent affects. Therefore, we conclude that multimodal emotion conditioning and affect consistency are vital to enhancing the perception of affects for embodied conversational agents.
摘要
Synthesizing conversational behaviors for each modality that are as expressive as real human behaviors is difficult.2. Affects are modeled independently, making it hard to generate multimodal responses with consistent emotions across all modalities.To address these challenges, we propose the ACTOR (Affect-Consistent mulTimodal behaviOR generation) framework, which aims to increase the perception of affects by generating multimodal behaviors conditioned on a consistent driving affect. We conducted a user study with 199 participants to assess how the average person judges the affects perceived from multimodal behaviors that are consistent and inconsistent with respect to a driving affect. The results show that our affect-consistent framework received the highest Likert scores for the perception of driving affects. Additionally, we found that making a modality affect-inconsistent significantly decreases the perception of driving affects, and that multimodal behaviors conditioned on consistent affects are more expressive compared to behaviors with inconsistent affects.Therefore, we conclude that multimodal emotion conditioning and affect consistency are crucial to enhancing the perception of affects for embodied conversational agents.
Ruffle&Riley: Towards the Automated Induction of Conversational Tutoring Systems
results: 在一项在线用户研究中,与简单的问答聊天机器人和阅读活动相比,Ruffle&Riley Users表现出更高的理解和记忆,并认为提供的支持更有用和对话更 coherent。Abstract
Conversational tutoring systems (CTSs) offer learning experiences driven by natural language interaction. They are known to promote high levels of cognitive engagement and benefit learning outcomes, particularly in reasoning tasks. Nonetheless, the time and cost required to author CTS content is a major obstacle to widespread adoption. In this paper, we introduce a novel type of CTS that leverages the recent advances in large language models (LLMs) in two ways: First, the system induces a tutoring script automatically from a lesson text. Second, the system automates the script orchestration via two LLM-based agents (Ruffle&Riley) with the roles of a student and a professor in a learning-by-teaching format. The system allows a free-form conversation that follows the ITS-typical outer-/inner-loop structure. In an initial between-subject online user study (N = 100) comparing Ruffle&Riley to simpler QA chatbots and reading activity, we found no significant differences in post-test scores. Nonetheless, in the learning experience survey, Ruffle&Riley users expressed higher ratings of understanding and remembering and further perceived the offered support as more helpful and the conversation as coherent. Our study provides insights for a new generation of scalable CTS technologies.
摘要
对话教育系统(CTS)提供了由自然语言互动驱动的学习体验。它们知名于提高高度的认知投入和学习结果,特别是在推理任务中。然而,创建CTS内容所需的时间和成本是普遍采用的主要障碍。在这篇论文中,我们介绍了一种新型的CTS,它利用最近的大语言模型(LLM)的进步,自动从课程文本中推导导师课程。其次,这个系统通过两个LLM-基于的代理人(Ruffle&Riley),将学生和教授的角色分别扮演为学习-教学格式。系统允许自由的对话,并且遵循ITS-典型的外部/内部回路结构。在我们的初步在网上用户研究(N = 100)中,比较Ruffle&Riley与简单的问答聊天机器人和阅读活动,我们未发现任何显著的差异在 poste-test scores。然而,Ruffle&Riley 用户对系统的理解和记忆得分高于其他两种方法,并且觉得提供的支持更有帮助,以及对话更加流畅。我们的研究给出了一代新的可扩展的CTS技术的洞察。
STERLING: Self-Supervised Terrain Representation Learning from Unconstrained Robot Experience
results: 通过物理机器人实验,研究人员发现 STERLING 的特征在 preference-aligned visual navigation 任务上与完全超vised方法相当,而且与其他现有的方法相比,具有更好的对称性。此外,研究人员在一个3英里长的自然走道上完成了自主旅行,只需要两次人工干预,这表明 STERLING 在实际的 off-road 环境中具有较好的Robustness。Abstract
Terrain awareness, i.e., the ability to identify and distinguish different types of terrain, is a critical ability that robots must have to succeed at autonomous off-road navigation. Current approaches that provide robots with this awareness either rely on labeled data which is expensive to collect, engineered features and cost functions that may not generalize, or expert human demonstrations which may not be available. Towards endowing robots with terrain awareness without these limitations, we introduce Self-supervised TErrain Representation LearnING (STERLING), a novel approach for learning terrain representations that relies solely on easy-to-collect, unconstrained (e.g., non-expert), and unlabelled robot experience, with no additional constraints on data collection. STERLING employs a novel multi-modal self-supervision objective through non-contrastive representation learning to learn relevant terrain representations for terrain-aware navigation. Through physical robot experiments in off-road environments, we evaluate STERLING features on the task of preference-aligned visual navigation and find that STERLING features perform on par with fully supervised approaches and outperform other state-of-the-art methods with respect to preference alignment. Additionally, we perform a large-scale experiment of autonomously hiking a 3-mile long trail which STERLING completes successfully with only two manual interventions, demonstrating its robustness to real-world off-road conditions.
摘要
<>TERRAIN 认知,即机器人能够识别和区别不同类型的地形,是自主off-road导航中机器人必备的重要能力。现有的方法提供机器人 terrain 认知都有一些限制,包括需要收集和标注的数据成本高昂、可能不会总结的工程特征和成本函数,以及可能不可获得的专家人类示范。为了让机器人具备 terrain 认知无需这些限制,我们介绍了一种新的方法:Self-supervised TErrain Representation LearnING(STERLING)。STERLING 方法基于非对照 represencing 学习,通过不同模式的自我监督目标来学习地形表示。通过物理机器人在off-road环境中的实验,我们评估了 STERLING 的特征在视觉导航中的性能,发现 STERLING 特征与完全监督方法相当,并且在对齐性方面超过了现有的状态艺术方法。此外,我们进行了一项大规模的自主步行一个3英里长的徒步道,STERLING 成功完成了这项任务,只需两次人类干预,这表明它在实际的off-road条件下具有了可靠性。>>>
results: 证明了该方法可以超越已知最大Entropy技术,并在多种各种标准准备列表上实现了较好的性能。Abstract
The assumption that data are independent and identically distributed underpins all machine learning. When data are collected sequentially from agent experiences this assumption does not generally hold, as in reinforcement learning. Here, we derive a method that overcomes these limitations by exploiting the statistical mechanics of ergodic processes, which we term maximum diffusion reinforcement learning. By decorrelating agent experiences, our approach provably enables agents to learn continually in single-shot deployments regardless of how they are initialized. Moreover, we prove our approach generalizes well-known maximum entropy techniques, and show that it robustly exceeds state-of-the-art performance across popular benchmarks. Our results at the nexus of physics, learning, and control pave the way towards more transparent and reliable decision-making in reinforcement learning agents, such as locomoting robots and self-driving cars.
摘要
<>Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.
Out of Sight, Still in Mind: Reasoning and Planning about Unobserved Objects with Video Tracking Enabled Memory Models
results: 我们在广泛的模拟和实际实验中发现,我们的方法在不同数量的物体和干扰动作下表现出色,并且超越了隐式记忆基准。Abstract
Robots need to have a memory of previously observed, but currently occluded objects to work reliably in realistic environments. We investigate the problem of encoding object-oriented memory into a multi-object manipulation reasoning and planning framework. We propose DOOM and LOOM, which leverage transformer relational dynamics to encode the history of trajectories given partial-view point clouds and an object discovery and tracking engine. Our approaches can perform multiple challenging tasks including reasoning with occluded objects, novel objects appearance, and object reappearance. Throughout our extensive simulation and real-world experiments, we find that our approaches perform well in terms of different numbers of objects and different numbers of distractor actions. Furthermore, we show our approaches outperform an implicit memory baseline.
摘要
瑜珐机器人需要保持过去观察到但现在受阻物体的记忆,以在真实环境中正常工作。我们研究对象 ориентирован的记忆编码到多对象抓取和规划框架中。我们提出了DOOM和LOOM,它们利用变换器关系动力学来编码部分视图点云中的历史轨迹,并且具有发现和跟踪物体的引擎。我们的方法可以完成许多具有挑战性的任务,包括处理遮盲物体、新出现的物体和物体重新出现。在我们的广泛的模拟和实际实验中,我们发现我们的方法在不同数量的物体和不同数量的干扰动作下表现良好。此外,我们还证明我们的方法比基eline抑制器表现更好。
Efficient Low-rank Backpropagation for Vision Transformer Adaptation
results: 我们在不同的模型(ViT、混合 convolution-ViT模型)和多个数据集上进行了广泛的实验,证明了我们的LBP-WHT方法的效果。例如,当适应一个EfficientFormer-L1模型在CIFAR100上时,我们的LBP-WHT方法可以提高对比预测的准确率10.4%,同时需要9亿次FLOPs的计算量少。Abstract
The increasing scale of vision transformers (ViT) has made the efficient fine-tuning of these large models for specific needs a significant challenge in various applications. This issue originates from the computationally demanding matrix multiplications required during the backpropagation process through linear layers in ViT. In this paper, we tackle this problem by proposing a new Low-rank BackPropagation via Walsh-Hadamard Transformation (LBP-WHT) method. Intuitively, LBP-WHT projects the gradient into a low-rank space and carries out backpropagation. This approach substantially reduces the computation needed for adapting ViT, as matrix multiplication in the low-rank space is far less resource-intensive. We conduct extensive experiments with different models (ViT, hybrid convolution-ViT model) on multiple datasets to demonstrate the effectiveness of our method. For instance, when adapting an EfficientFormer-L1 model on CIFAR100, our LBP-WHT achieves 10.4% higher accuracy than the state-of-the-art baseline, while requiring 9 MFLOPs less computation. As the first work to accelerate ViT adaptation with low-rank backpropagation, our LBP-WHT method is complementary to many prior efforts and can be combined with them for better performance.
摘要
随着视transformer(ViT)的扩大规模,为特定需求进行高效调整这些大型模型已成为各种应用中的主要挑战。这个问题的起源在于ViT中linear层中的计算复杂度较高,即在backpropagation过程中的矩阵乘法。在这篇论文中,我们解决这个问题,提出了一种新的Low-rank BackPropagation via Walsh-Hadamard Transformation(LBP-WHT)方法。INTUITIVELY,LBP-WHT方法将梯度投影到低级空间中,并进行backpropagation。这种方法可以减少适应ViT的计算量,因为矩阵乘法在低级空间中是非常资源占用的。我们在不同的模型(ViT、混合卷积-ViT模型)和多个数据集上进行了广泛的实验,以证明我们的方法的有效性。例如,当适应EfficientFormer-L1模型在CIFAR100上的时候,我们的LBP-WHT方法可以与状态之前的基线方法相比,提高10.4%的准确率,同时需要9 MFLOPs fewer computation。作为第一个加速ViT适应的低级后向传播方法,我们的LBP-WHT方法是与许多先前的尝试相结合的,可以提高性能。
Memory-Efficient Continual Learning Object Segmentation for Long Video
results: 实验结果表明,提出的方法可以提高在线对象分割模型的性能,在长视频 dataset 上提高精度达到10%,同时在短视频 dataset 上保持相对稳定的性能。Abstract
Recent state-of-the-art semi-supervised Video Object Segmentation (VOS) methods have shown significant improvements in target object segmentation accuracy when information from preceding frames is used in undertaking segmentation on the current frame. In particular, such memory-based approaches can help a model to more effectively handle appearance changes (representation drift) or occlusions. Ideally, for maximum performance, online VOS methods would need all or most of the preceding frames (or their extracted information) to be stored in memory and be used for online learning in consecutive frames. Such a solution is not feasible for long videos, as the required memory size would grow without bound. On the other hand, these methods can fail when memory is limited and a target object experiences repeated representation drifts throughout a video. We propose two novel techniques to reduce the memory requirement of online VOS methods while improving modeling accuracy and generalization on long videos. Motivated by the success of continual learning techniques in preserving previously-learned knowledge, here we propose Gated-Regularizer Continual Learning (GRCL), which improves the performance of any online VOS subject to limited memory, and a Reconstruction-based Memory Selection Continual Learning (RMSCL) which empowers online VOS methods to efficiently benefit from stored information in memory. Experimental results show that the proposed methods improve the performance of online VOS models up to 10 %, and boosts their robustness on long-video datasets while maintaining comparable performance on short-video datasets DAVIS16 and DAVIS17.
摘要
现代 semi-supervised Video Object Segmentation(VOS)方法在使用前一帧信息进行当前帧分 segmentation 时已经显示了显著改善 target 对象分 segmentation 精度。尤其是这些记忆型方法可以帮助模型更好地处理出现变化(表达漂移)或遮挡。理想情况下,为了 дости到最高性能,在线 VOS 方法需要所有或大多数的前一帧(或其提取的信息)被存储在内存中,并在连续帧中进行在线学习。然而,这种解决方案并不可行于长视频,因为所需的内存大小会无限增长。另一方面,这些方法可能会失败当内存有限制,target 对象在视频中经历重复的表达漂移。我们提出了两种新的技术来降低在线 VOS 方法的内存需求,同时提高模型的准确性和泛化能力在长视频上。我们的方法包括:1. 闭包 regularizer continual learning(GRCL),可以提高在线 VOS 模型的性能,并且可以在有限内存下进行学习。2. 重建基于内存选择 continual learning(RMSCL),可以让在线 VOS 方法有效地利用存储在内存中的信息,以提高模型的性能和泛化能力。实验结果表明,我们的方法可以提高在线 VOS 模型的性能,并且可以增强其在长视频上的稳定性,而不会影响短视频上的性能。
STARC: A General Framework For Quantifying Differences Between Reward Functions
paper_authors: Joar Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, Alessandro Abate
For: The paper aims to provide a solution to the problem of deriving theoretical guarantees for reward learning algorithms, which is difficult due to the lack of good methods for quantifying the difference between reward functions.* Methods: The paper proposes a class of pseudometrics called STARC (STAndardised Reward Comparison) metrics, which can be used to quantify the difference between reward functions and provide both upper and lower bounds on worst-case regret.* Results: The paper shows that STARC metrics are tight and bilipschitz equivalent, and identifies issues with reward metrics proposed by earlier works. The paper also demonstrates the empirical efficacy of STARC metrics through practical evaluation.Abstract
In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use reward learning algorithms, which attempt to learn a reward function from data. However, the theoretical foundations of reward learning are not yet well-developed. In particular, it is typically not known when a given reward learning algorithm with high probability will learn a reward function that is safe to optimise. This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to predict in advance. One of the roadblocks to deriving better theoretical guarantees is the lack of good methods for quantifying the difference between reward functions. In this paper we provide a solution to this problem, in the form of a class of pseudometrics on the space of all reward functions that we call STARC (STAndardised Reward Comparison) metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret, which implies that our metrics are tight, and that any metric with the same properties must be bilipschitz equivalent to ours. Moreover, we also identify a number of issues with reward metrics proposed by earlier works. Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.
摘要
Translated into Simplified Chinese:为解决一个任务使用奖励学习,首先需要正式化该任务的目标为奖励函数。然而,在许多实际任务中,很难手动指定一个不奖励不良行为的奖励函数。因此,奖励学习算法在很多情况下变得越来越受欢迎。然而,奖励学习的理论基础还不够发展。特别是,通常不知道一个给定的奖励学习算法是否会在高probability上学习一个安全优化的奖励函数。这意味着奖励学习算法通常需要通过实际评估,这会很昂贵,并且其失败模式难以预测。一个障碍得到更好的理论保证的原因是量化奖励函数之间的差别的缺乏好的方法。在这篇论文中,我们提供一个解决方案,即 STARC(STAndardised Reward Comparison)度量。我们证明STARC度量induces both upper and lower bound on worst-case regret,这意味着我们的度量是紧张的,任何与我们度量相同的度量都必须是bilipSchitz相同的。此外,我们还标识了早期的奖励度量的一些问题。最后,我们employmSTARC度量进行实证评估,以证明其实践效果。STARC度量可以使得奖励学习算法的理论和实证分析变得更加容易和更加原则性。
results: 实验结果显示,VPA可以提高对异常输入的泛化能力 by 3.3%,超过了之前的测试方法。此外,VPA还可以提高损害鲁棒性 by 6.5%,并且可以提高领域转换性能 by 5.2%。Abstract
Textual prompt tuning has demonstrated significant performance improvements in adapting natural language processing models to a variety of downstream tasks by treating hand-engineered prompts as trainable parameters. Inspired by the success of textual prompting, several studies have investigated the efficacy of visual prompt tuning. In this work, we present Visual Prompt Adaptation (VPA), the first framework that generalizes visual prompting with test-time adaptation. VPA introduces a small number of learnable tokens, enabling fully test-time and storage-efficient adaptation without necessitating source-domain information. We examine our VPA design under diverse adaptation settings, encompassing single-image, batched-image, and pseudo-label adaptation. We evaluate VPA on multiple tasks, including out-of-distribution (OOD) generalization, corruption robustness, and domain adaptation. Experimental results reveal that VPA effectively enhances OOD generalization by 3.3% across various models, surpassing previous test-time approaches. Furthermore, we show that VPA improves corruption robustness by 6.5% compared to strong baselines. Finally, we demonstrate that VPA also boosts domain adaptation performance by relatively 5.2%. Our VPA also exhibits marked effectiveness in improving the robustness of zero-shot recognition for vision-language models.
摘要
文本提示调整已经显示出对多种下游任务的模型适应性提高了显著性。 draw inspiration from textual prompting success, several studies have explored visual prompting efficacy. In this work, we present Visual Prompt Adaptation (VPA), the first framework that generalizes visual prompting with test-time adaptation. VPA introduces a small number of learnable tokens, enabling fully test-time and storage-efficient adaptation without requiring source-domain information. We examine our VPA design under diverse adaptation settings, including single-image, batched-image, and pseudo-label adaptation. We evaluate VPA on multiple tasks, including out-of-distribution (OOD) generalization, corruption robustness, and domain adaptation. Experimental results show that VPA effectively enhances OOD generalization by 3.3% across various models, surpassing previous test-time approaches. Furthermore, we show that VPA improves corruption robustness by 6.5% compared to strong baselines. Finally, we demonstrate that VPA also boosts domain adaptation performance by relatively 5.2%. Our VPA also exhibits marked effectiveness in improving the robustness of zero-shot recognition for vision-language models.
SeMAnD: Self-Supervised Anomaly Detection in Multimodal Geospatial Datasets
paper_authors: Daria Reshetova, Swetava Ganguli, C. V. Krishnakumar Iyer, Vipul Pandey for:* 这个论文旨在检测多modal geospatial数据中的 геометрические异常(如道路、建筑、地形等)。methods:* 该论文提出了一种无监督异常检测技术,即 SeMAnD,使用 RandPolyAugment 数据增强策略和自我超vised 训练目标来学习多modal数据的表示,并具有异常检测能力。results:* 该论文的实验表明,SeMAnD 能够有效地检测实际世界中的异常,并与域外异常检测策略相比,提高了4.8-19.7%的异常分类 AUC。此外,模型性能随输入模式数量和数据增强策略的多样性和强度增长。Abstract
We propose a Self-supervised Anomaly Detection technique, called SeMAnD, to detect geometric anomalies in Multimodal geospatial datasets. Geospatial data comprises of acquired and derived heterogeneous data modalities that we transform to semantically meaningful, image-like tensors to address the challenges of representation, alignment, and fusion of multimodal data. SeMAnD is comprised of (i) a simple data augmentation strategy, called RandPolyAugment, capable of generating diverse augmentations of vector geometries, and (ii) a self-supervised training objective with three components that incentivize learning representations of multimodal data that are discriminative to local changes in one modality which are not corroborated by the other modalities. Detecting local defects is crucial for geospatial anomaly detection where even small anomalies (e.g., shifted, incorrectly connected, malformed, or missing polygonal vector geometries like roads, buildings, landcover, etc.) are detrimental to the experience and safety of users of geospatial applications like mapping, routing, search, and recommendation systems. Our empirical study on test sets of different types of real-world geometric geospatial anomalies across 3 diverse geographical regions demonstrates that SeMAnD is able to detect real-world defects and outperforms domain-agnostic anomaly detection strategies by 4.8-19.7% as measured using anomaly classification AUC. We also show that model performance increases (i) up to 20.4% as the number of input modalities increase and (ii) up to 22.9% as the diversity and strength of training data augmentations increase.
摘要
我们提出了一种自动异常检测技术,称为SeMAnD,用于检测多Modal geospatial数据中的几何异常。 geospatial数据包括获取和 derivated 多种数据类型,我们将其转换为semantically meaningful的 image-like 张量,以解决多Modal数据的表示、对接和融合问题。 SeMAnD 包括(i)一种简单的数据增强策略,称为 RandPolyAugment,可以生成多种几何异常的扩展,以及(ii)一种自动异常检测目标函数,包括三个组件,这些组件鼓励学习多Modal数据的表示,对一个模式中的局部变化不被其他模式支持。 检测局部异常是关键的,因为even small 异常(例如,偏移、错过连接、腐坏、缺失多边形vector geometry like roads, buildings, landcover, etc.)对 geospatial应用程序(如地图、路径规划、搜索、推荐系统)的用户体验和安全产生重要影响。我们的实验表明,SeMAnD 可以检测实际的异常并在预测异常分类 AUC 方面高于域无关异常检测策略 by 4.8-19.7%。我们还发现,模型性能随输入模式数量和数据增强策略的多样性和强度而增长,最高可以达到 20.4% 和 22.9%。
PlotMap: Automated Layout Design for Building Game Worlds
paper_authors: Yi Wang, Jieliang Luo, Adam Gaier, Evan Atherton, Hilmar Koch for:The paper aims to address the challenge of designing game maps that support a desired narrative, by introducing an extra layer of plot facility layout design that is independent of the underlying map generation method.methods:The paper proposes using Reinforcement Learning (RL) to automatically assign concrete locations on a game map to abstract locations mentioned in a given story (plot facilities), following spatial constraints derived from the story.results:The paper presents a system that considers input from multiple modalities, including map images, facility locations, and story constraints expressed in natural language, to train and evaluate RL models for plot facility layout design. The system is evaluated through a group of comprehensive experiments and ablation studies, providing insights for RL-based plot facility layout design.Abstract
World-building, the process of developing both the narrative and physical world of a game, plays a vital role in the game's experience. Critically acclaimed independent and AAA video games are praised for strong world building, with game maps that masterfully intertwine with and elevate the narrative, captivating players and leaving a lasting impression. However, designing game maps that support a desired narrative is challenging, as it requires satisfying complex constraints from various considerations. Most existing map generation methods focus on considerations about gameplay mechanics or map topography, while the need to support the story is typically neglected. As a result, extensive manual adjustment is still required to design a game world that facilitates particular stories. In this work, we approach this problem by introducing an extra layer of plot facility layout design that is independent of the underlying map generation method in a world-building pipeline. Concretely, we present a system that leverages Reinforcement Learning (RL) to automatically assign concrete locations on a game map to abstract locations mentioned in a given story (plot facilities), following spatial constraints derived from the story. A decision-making agent moves the plot facilities around, considering their relationship to the map and each other, to locations on the map that best satisfy the constraints of the story. Our system considers input from multiple modalities: map images as pixels, facility locations as real values, and story constraints expressed in natural language. We develop a method of generating datasets of facility layout tasks, create an RL environment to train and evaluate RL models, and further analyze the behaviors of the agents through a group of comprehensive experiments and ablation studies, aiming to provide insights for RL-based plot facility layout design.
摘要
世界建设,游戏的叙述和物理世界的开发过程,对游戏的体验非常重要。独立游戏和AAA游戏都得到了广泛的赞誉,即使游戏地图与叙述相互呼应, captivating 玩家并留下深刻的印象。然而,设计游戏地图以支持特定的叙述是具有复杂的约束的问题,而现有的地图生成方法通常会忽略这些约束。因此,手动调整仍然是必要的,以设计游戏世界,以便支持特定的叙述。在这种情况下,我们采用了一种独特的叙述设计方法,通过强化学习(RL)自动将叙述中提到的抽象位置(叙述设施)转换为游戏地图上的具体位置。一个决策者会将叙述设施移动到地图上,考虑它们之间的关系和地图上的约束,以满足叙述的约束。我们的系统将来自多种模式的输入进行处理:地图图像作为像素、设施位置作为真实值,以及叙述约束表示在自然语言中。我们开发了一种生成叙述设施任务的方法,创建了RL环境来训练和评估RL模型,并通过一系列完整的实验和剥离研究,以提供RL-基于叙述设施布局的深入理解。
ChatGPT & Mechanical Engineering: Examining performance on the FE Mechanical Engineering and Undergraduate Exams
For: This paper is written for exploring the capabilities of ChatGPT in the discipline of mechanical engineering, specifically in the classroom and professional settings.* Methods: The paper uses two ChatGPT models, one free and one paid subscription, to examine their performance on junior and senior level mechanical engineering exams and practice questions for the Fundamentals of Engineering Exam (FE) in Mechanical Engineering.* Results: The paid subscription model (GPT-4) greatly outperformed the free version (GPT-3.5), achieving 76% correct vs 51% correct, but the limitation of text only input on both models makes neither likely to pass the FE exam. The results confirm findings in the literature with regards to types of errors and pitfalls made by ChatGPT.Abstract
The launch of ChatGPT at the end of 2022 generated large interest into possible applications of artificial intelligence in STEM education and among STEM professions. As a result many questions surrounding the capabilities of generative AI tools inside and outside of the classroom have been raised and are starting to be explored. This study examines the capabilities of ChatGPT within the discipline of mechanical engineering. It aims to examine use cases and pitfalls of such a technology in the classroom and professional settings. ChatGPT was presented with a set of questions from junior and senior level mechanical engineering exams provided at a large private university, as well as a set of practice questions for the Fundamentals of Engineering Exam (FE) in Mechanical Engineering. The responses of two ChatGPT models, one free to use and one paid subscription, were analyzed. The paper found that the subscription model (GPT-4) greatly outperformed the free version (GPT-3.5), achieving 76% correct vs 51% correct, but the limitation of text only input on both models makes neither likely to pass the FE exam. The results confirm findings in the literature with regards to types of errors and pitfalls made by ChatGPT. It was found that due to its inconsistency and a tendency to confidently produce incorrect answers the tool is best suited for users with expert knowledge.
摘要
<> translate into Simplified Chinese于2022年底发布ChatGPT引发了大量关于人工智能在科学技术教育和相关领oprofessions的兴趣。因此,许多关于生成AI工具在课堂和职业场景中的能力和局限性的问题被提出并开始被探讨。本研究探讨了ChatGPT在机械工程领域的能力。它的目的是检查ChatGPT在课堂和职业场景中的应用案例和陷阱。为了完成这项研究,ChatGPT被给予了一组 junior和senior机械工程考试中的问题,以及一组机械工程基础知识考试(FE)的练习题。两个ChatGPT模型(一个免费版本GPT-3.5和一个付费版本GPT-4)的回答被分析。研究发现,付费版本GPT-4在正确率方面大幅过之GPT-3.5(76%对51%),但两个模型均限于文本输入,使得 neither是可以通过FE考试。研究结果证明了文献中关于ChatGPT的错误和陷阱的发现。它发现了ChatGPT的不一致和自信地生成错误的倾向,因此建议用户应具备专家知识使用该工具。本研究的结果表明,ChatGPT在机械工程领域的应用需要进一步的研究和开发,以便更好地利用其能力。此外,研究还发现了一些可能的应用场景,例如用于提供学生们学习资源、帮助教师们设计课程和评估学生们的知识水平等。
Learning Using Generated Privileged Information by Text-to-Image Diffusion Models
results: 在四个文本分类数据集上,LUGPI方法实现了明显的性能提升,示出其在文本分类中的潜力。Abstract
Learning Using Privileged Information is a particular type of knowledge distillation where the teacher model benefits from an additional data representation during training, called privileged information, improving the student model, which does not see the extra representation. However, privileged information is rarely available in practice. To this end, we propose a text classification framework that harnesses text-to-image diffusion models to generate artificial privileged information. The generated images and the original text samples are further used to train multimodal teacher models based on state-of-the-art transformer-based architectures. Finally, the knowledge from multimodal teachers is distilled into a text-based (unimodal) student. Hence, by employing a generative model to produce synthetic data as privileged information, we guide the training of the student model. Our framework, called Learning Using Generated Privileged Information (LUGPI), yields noticeable performance gains on four text classification data sets, demonstrating its potential in text classification without any additional cost during inference.
摘要
学习使用特权信息是一种特定的知识填充,其中教师模型在训练时得到额外数据表示,称为特权信息,以改进学生模型,该模型不能看到额外表示。然而,特权信息在实践中很少可用。为此,我们提议一种文本分类框架,利用文本到图像扩散模型生成人工特权信息。生成的图像和原始文本样本被用来训练基于state-of-the-art transformer-based architecture的多Modal教师模型。最后,多Modal教师模型中的知识被透传到文本基本(单Modal)学生模型中。因此,通过使用生成模型生成Synthetic数据作为特权信息,我们可以导引学生模型的训练。我们的框架,称为学习使用生成的特权信息(LUGPI),在四个文本分类数据集上显示了明显的性能提升,这表明它在文本分类中具有潜在的应用前提。
User Experience Design Professionals’ Perceptions of Generative Artificial Intelligence
results: 研究发现经验丰富的设计师对GenAI的辅助作用表示自信,但新手设计师可能会受到技能减退、工作替换和创造力疲劳的影响。Abstract
Among creative professionals, Generative Artificial Intelligence (GenAI) has sparked excitement over its capabilities and fear over unanticipated consequences. How does GenAI impact User Experience Design (UXD) practice, and are fears warranted? We interviewed 20 UX Designers, with diverse experience and across companies (startups to large enterprises). We probed them to characterize their practices, and sample their attitudes, concerns, and expectations. We found that experienced designers are confident in their originality, creativity, and empathic skills, and find GenAI's role as assistive. They emphasized the unique human factors of "enjoyment" and "agency", where humans remain the arbiters of "AI alignment". However, skill degradation, job replacement, and creativity exhaustion can adversely impact junior designers. We discuss implications for human-GenAI collaboration, specifically copyright and ownership, human creativity and agency, and AI literacy and access. Through the lens of responsible and participatory AI, we contribute a deeper understanding of GenAI fears and opportunities for UXD.
摘要
amongst creative professionals, Generative Artificial Intelligence (GenAI) has sparked excitement over its capabilities and fear over unanticipated consequences. How does GenAI impact User Experience Design (UXD) practice, and are fears warranted? We interviewed 20 UX Designers, with diverse experience and across companies (startups to large enterprises). We probed them to characterize their practices, and sample their attitudes, concerns, and expectations. We found that experienced designers are confident in their originality, creativity, and empathic skills, and find GenAI's role as assistive. They emphasized the unique human factors of "enjoyment" and "agency", where humans remain the arbiters of "AI alignment". However, skill degradation, job replacement, and creativity exhaustion can adversely impact junior designers. We discuss implications for human-GenAI collaboration, specifically copyright and ownership, human creativity and agency, and AI literacy and access. Through the lens of responsible and participatory AI, we contribute a deeper understanding of GenAI fears and opportunities for UXD.Here's the word-for-word translation in Simplified Chinese: amongst 创新专业人士, 生成人工智能(GenAI)已引起了能力和不确定后果的兴趣。GenAI如何影响用户体验设计(UXD)实践,而担忧是否合理?我们采访了20名UX设计师,他们在不同的公司(从创新公司到大型企业)中有各种经验。我们询问他们描述他们的做法,并抽样他们的态度、担忧和期望。我们发现经验丰富的设计师对他们的原创性、创造力和Empathy技能充满自信,并认为GenAI的角色是助手。他们强调了人类的特有因素,如“娱乐”和“权力”,人类仍然是AI的“调节者”。然而,技能衰退、工作替换和创造力疲劳可能会对初级设计师产生负面影响。我们讨论了人类-GenAI合作的影响,包括版权和所有权、人类创造力和权力,以及AI文化和访问权。通过负责和参与式AI的镜头,我们为UXD提供了更深刻的GenAI担忧和机遇。
Collaborative Watermarking for Adversarial Speech Synthesis
results: 研究表明,协作训练可以增加对噪音和时间压缩的Robustness,而且 listening 测试表明,协作训练对语音质量没有负面影响。Abstract
Advances in neural speech synthesis have brought us technology that is not only close to human naturalness, but is also capable of instant voice cloning with little data, and is highly accessible with pre-trained models available. Naturally, the potential flood of generated content raises the need for synthetic speech detection and watermarking. Recently, considerable research effort in synthetic speech detection has been related to the Automatic Speaker Verification and Spoofing Countermeasure Challenge (ASVspoof), which focuses on passive countermeasures. This paper takes a complementary view to generated speech detection: a synthesis system should make an active effort to watermark the generated speech in a way that aids detection by another machine, but remains transparent to a human listener. We propose a collaborative training scheme for synthetic speech watermarking and show that a HiFi-GAN neural vocoder collaborating with the ASVspoof 2021 baseline countermeasure models consistently improves detection performance over conventional classifier training. Furthermore, we demonstrate how collaborative training can be paired with augmentation strategies for added robustness against noise and time-stretching. Finally, listening tests demonstrate that collaborative training has little adverse effect on perceptual quality of vocoded speech.
摘要
(Simplified Chinese translation) neural speech synthesis技术的进步使得我们可以快速创建自然语音,并且可以使用预训练模型进行访问。然而,这些生成的内容的潮流使得人工语音检测和水印成为必要。在这些研究中,我们采用了一种与生成语音检测相关的方法:一个合成系统应该为另一个机器提供一种可以帮助检测的 watermark,但是对人类听众来说是透明的。我们提议一种合作训练方法,使得HiFi-GAN神经 vocoder与ASVspoof 2021基准防范模型之间进行了可 collaborative 的训练。我们表明,这种方法可以提高检测性能,并且可以结合增强策略来抵御噪音和时间延迟。最后,我们通过听力测试表明,这种合作训练对语音质量的影响是微乎其微的。
Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition
results: 这个论文的实验结果表明,使用LoRB架构可以在LibriSpeech和内部数据集上实现减少训练时间,减少5.4倍到3.6倍不等。Abstract
We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we present a method based on low-rank decomposition to train a rescoring BERT model and adapt it to new domains using only a fraction (0.08%) of the pretrained parameters. These inserted matrices are optimized through a discriminative training objective along with a correlation-based regularization loss. The proposed low-rank adaptation Rescore-BERT (LoRB) architecture is evaluated on LibriSpeech and internal datasets with decreased training times by factors between 5.4 and 3.6.
摘要
我们提出一种基于低级别适应(LoRA)的神经语言模型系统,用于语音识别输出重新分配。尽管先前训练的语言模型(LM)如BERT已经在第二次重新分配中显示出优秀表现,但是扩大预训练阶段和适应特定领域的预训练模型的计算成本限制了它们的实际使用。我们提出一种基于低级别分解的方法,通过只使用0.08%的预训练参数来训练重新分配BERT模型并适应新领域。这些插入矩阵通过一种推理目标和相关性基于的规则损失进行优化。我们提出的低级别适应重新分配BERT(LoRB)架构在LibriSpeech和内部数据集上进行评估,与训练时间相应减少了5.4到3.6倍。
results: 在多个数据集、领域和任务上评估了family of方法,其中conservative FB算法在总体来说达到了150%的vanilla FB性能。同时,保守的FB算法也超越了具有奖励标签的任务特定基线,即使没有访问奖励标签。Abstract
Zero-shot reinforcement learning (RL) promises to provide agents that can perform any task in an environment after an offline pre-training phase. Forward-backward (FB) representations represent remarkable progress towards this ideal, achieving 85% of the performance of task-specific agents in this setting. However, such performance is contingent on access to large and diverse datasets for pre-training, which cannot be expected for most real problems. Here, we explore how FB performance degrades when trained on small datasets that lack diversity, and mitigate it with conservatism, a well-established feature of performant offline RL algorithms. We evaluate our family of methods across various datasets, domains and tasks, reaching 150% of vanilla FB performance in aggregate. Somewhat surprisingly, conservative FB algorithms also outperform the task-specific baseline, despite lacking access to reward labels and being required to maintain policies for all tasks. Conservative FB algorithms perform no worse than FB on full datasets, and so present little downside over their predecessor. Our code is available open-source via https://enjeeneer.io/projects/conservative-world-models/.
摘要
zero-shot reinforcement learning(RL)承诺提供能够在环境中完成任何任务的代理人,具体来说是通过在线上进行准备 phase 来实现。forward-backward(FB)表示达了非常 significiant progress towards this ideal,达到了85%的任务特定代理人的性能。然而,这种性能受到了大量和多样化数据集的预training的限制,这些数据集在实际问题中不可能被期望。在这里,我们研究了FB表示性能如何随着小数据集的不同而下降,并采用保守性作为缓解方法,这是一种在offline RL算法中广泛存在的特征。我们在不同的数据集、领域和任务上评估了我们的家族方法,共 дости得150%的vanilla FB性能。尚未料算起来,保守FB算法还能超过任务特定基线,即使没有访问奖励标签和维护所有任务的策略。保守FB算法与FB在全数据集上表现相同,因此它们没有下降的缺点。我们的代码可以在https://enjeeneer.io/projects/conservative-world-models/上获取。
Revealing the Power of Spatial-Temporal Masked Autoencoders in Multivariate Time Series Forecasting
results: 实验结果显示,通过与 existed 的空间时间模型混合 STMAE,可以大幅提高 MTS 预测的能力。Abstract
Multivariate time series (MTS) forecasting involves predicting future time series data based on historical observations. Existing research primarily emphasizes the development of complex spatial-temporal models that capture spatial dependencies and temporal correlations among time series variables explicitly. However, recent advances have been impeded by challenges relating to data scarcity and model robustness. To address these issues, we propose Spatial-Temporal Masked Autoencoders (STMAE), an MTS forecasting framework that leverages masked autoencoders to enhance the performance of spatial-temporal baseline models. STMAE consists of two learning stages. In the pretraining stage, an encoder-decoder architecture is employed. The encoder processes the partially visible MTS data produced by a novel dual-masking strategy, including biased random walk-based spatial masking and patch-based temporal masking. Subsequently, the decoders aim to reconstruct the masked counterparts from both spatial and temporal perspectives. The pretraining stage establishes a challenging pretext task, compelling the encoder to learn robust spatial-temporal patterns. In the fine-tuning stage, the pretrained encoder is retained, and the original decoder from existing spatial-temporal models is appended for forecasting. Extensive experiments are conducted on multiple MTS benchmarks. The promising results demonstrate that integrating STMAE into various spatial-temporal models can largely enhance their MTS forecasting capability.
摘要
多变量时间序列(MTS)预测 involve forecasting future time series data based on historical observations. Existing research primarily focuses on developing complex spatial-temporal models that explicitly capture spatial dependencies and temporal correlations among time series variables. However, recent advances have been hindered by challenges related to data scarcity and model robustness. To address these issues, we propose Spatial-Temporal Masked Autoencoders (STMAE), an MTS forecasting framework that leverages masked autoencoders to enhance the performance of spatial-temporal baseline models.STMAE consists of two learning stages. In the pretraining stage, an encoder-decoder architecture is employed. The encoder processes the partially visible MTS data produced by a novel dual-masking strategy, including biased random walk-based spatial masking and patch-based temporal masking. Subsequently, the decoders aim to reconstruct the masked counterparts from both spatial and temporal perspectives. The pretraining stage establishes a challenging pretext task, compelling the encoder to learn robust spatial-temporal patterns. In the fine-tuning stage, the pretrained encoder is retained, and the original decoder from existing spatial-temporal models is appended for forecasting.Extensive experiments are conducted on multiple MTS benchmarks. The promising results demonstrate that integrating STMAE into various spatial-temporal models can significantly enhance their MTS forecasting capability.
3D Reconstruction with Generalizable Neural Fields using Scene Priors
paper_authors: Yang Fu, Shalini De Mello, Xueting Li, Amey Kulkarni, Jan Kautz, Xiaolong Wang, Sifei Liu
For: The paper is written for high-fidelity 3D scene reconstruction using neural fields, with a focus on scalability and flexibility.* Methods: The paper introduces training generalizable Neural Fields incorporating scene Priors (NFPs), which map single-view RGB-D images into signed distance and radiance values. The NFP network does not require a fusion module, allowing for faster adaptation to new scenes with fewer views.* Results: The paper demonstrates state-of-the-art (SOTA) scene reconstruction performance and efficiency, as well as support for single-image novel-view synthesis, which is underexplored in neural fields.Abstract
High-fidelity 3D scene reconstruction has been substantially advanced by recent progress in neural fields. However, most existing methods train a separate network from scratch for each individual scene. This is not scalable, inefficient, and unable to yield good results given limited views. While learning-based multi-view stereo methods alleviate this issue to some extent, their multi-view setting makes it less flexible to scale up and to broad applications. Instead, we introduce training generalizable Neural Fields incorporating scene Priors (NFPs). The NFP network maps any single-view RGB-D image into signed distance and radiance values. A complete scene can be reconstructed by merging individual frames in the volumetric space WITHOUT a fusion module, which provides better flexibility. The scene priors can be trained on large-scale datasets, allowing for fast adaptation to the reconstruction of a new scene with fewer views. NFP not only demonstrates SOTA scene reconstruction performance and efficiency, but it also supports single-image novel-view synthesis, which is underexplored in neural fields. More qualitative results are available at: https://oasisyang.github.io/neural-prior
摘要
高级准确3D场景重建得到了近期神经场的进步。然而,大多数现有方法都是从零开始训练单个场景的分开网络。这不可持续、不够高效,并且难以在有限视角下获得好结果。而学习基于多视图零点法则可以减轻这些问题的影响,但它们的多视图设置使其更难扩展和应用于广泛的场景。相反,我们引入了基于场景假设(NFP)的培训普通神经场。NFP网络将单个视角RGB-D图像映射到了签名距离和颜色值上。通过在Volume空间合并个体帧,可以无需拟合模块重建完整的场景。场景假设可以在大规模数据集上培训,以便快速适应重建新场景的更少视角。NFP不仅达到了最佳场景重建性能和效率,还支持单个图像新视角synthesis,这是神经场中尚未得到充分发挥的。更详细的结果可以在:https://oasisyang.github.io/neural-prior 查看。
Doduo: Learning Dense Visual Correspondence from Unsupervised Semantic-Aware Flow
results: 在使用实际视频数据进行训练后,这种方法能够准确地对应图像中每个像素的位置,并在不同场景下保持高精度。 代码和更多视觉化数据可以在 https://ut-austin-rpl.github.io/Doduo 上找到。Abstract
Dense visual correspondence plays a vital role in robotic perception. This work focuses on establishing the dense correspondence between a pair of images that captures dynamic scenes undergoing substantial transformations. We introduce Doduo to learn general dense visual correspondence from in-the-wild images and videos without ground truth supervision. Given a pair of images, it estimates the dense flow field encoding the displacement of each pixel in one image to its corresponding pixel in the other image. Doduo uses flow-based warping to acquire supervisory signals for the training. Incorporating semantic priors with self-supervised flow training, Doduo produces accurate dense correspondence robust to the dynamic changes of the scenes. Trained on an in-the-wild video dataset, Doduo illustrates superior performance on point-level correspondence estimation over existing self-supervised correspondence learning baselines. We also apply Doduo to articulation estimation and zero-shot goal-conditioned manipulation, underlining its practical applications in robotics. Code and additional visualizations are available at https://ut-austin-rpl.github.io/Doduo
摘要
紧密的视觉对应在机器人感知中发挥关键作用。这项工作专注于在两个图像中建立紧密的对应关系,以捕捉在进行重大变化的动态场景中。我们提出了Doduo,一种不需要基于真实数据的学习批处理的普适 dense visual correspondence 算法。给定两个图像,它估算每个像素在一个图像中的满意流场,并将其映射到另一个图像中的对应像素。Doduo 使用流程基于折射来获得超级visery 信号,用于训练。将semantic prior 与自我supervised flow 训练结合,Doduo 可以生成高度准确的紧密对应,抗性能够抵御场景的动态变化。在一个野外视频数据集上训练,Doduo 在点级对应估算方面表现出色,超过现有的自我supervised correspondence 学习基线。我们还应用Doduo 到人工智能和机器人的艺术骨骼估算和零基础目标conditined manipulation 中,展示了其实际应用的可行性。代码和补充的视觉化可以在https://ut-austin-rpl.github.io/Doduo 中找到。
Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models
results: 我们在11个 dataset中进行了大规模的探索,包括7B、13B和70B的Llama-2家族模型。我们提出了SAT Probe方法,可以预测约束满足和事实错误,并允许早期错误识别。这种方法和结论表明如何在LLMs中利用事实准确性的机制来提高可靠性。Abstract
We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text. We propose modeling factual queries as Constraint Satisfaction Problems and use this framework to investigate how the model interacts internally with factual constraints. Specifically, we discover a strong positive relation between the model's attention to constraint tokens and the factual accuracy of its responses. In our curated suite of 11 datasets with over 40,000 prompts, we study the task of predicting factual errors with the Llama-2 family across all scales (7B, 13B, 70B). We propose SAT Probe, a method probing self-attention patterns, that can predict constraint satisfaction and factual errors, and allows early error identification. The approach and findings demonstrate how using the mechanistic understanding of factuality in LLMs can enhance reliability.
摘要
我们研究 transformer 基于大语言模型(LLM)在生成错误文本时的内部行为。我们提议将 factual 查询作为约束满足问题来调查模型如何与约束进行交互。我们发现模型对约束符号的注意力强相关于它们的事实准确率。在我们精心制作的 11 个数据集中,包括超过 40,000 个提示,我们使用 Llama-2 家族在不同级别(7B、13B、70B)中预测错误。我们提出了 SAT Probe,一种探测自注意力模式的方法,可以预测约束满足和事实错误,并允许早期错误识别。这种方法和发现表明如何通过理解 LLM 中的事实性机制来提高可靠性。
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
paper_authors: Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal
For: The paper aims to explore the use of large language models (LLMs) for temporally consistent long video generation, and to develop a novel framework called VideoDirectorGPT that can leverage the knowledge of LLMs for video content planning and grounded video generation.* Methods: The proposed VideoDirectorGPT framework consists of a video planner LLM (GPT-4) and a video generator (Layout2Vid), which work together to generate multi-scene videos with visual consistency across scenes. The video planner generates a “video plan” that includes scene descriptions, entity layouts, and background information, and the video generator uses this plan to generate the video content.* Results: The proposed framework substantially improves layout and movement control in both single- and multi-scene video generation, and can generate multi-scene videos with visual consistency across scenes while achieving competitive performance with state-of-the-art (SOTA) methods in open-domain single-scene text-to-video generation. Additionally, the framework can dynamically control the strength of layout guidance and can generate videos with user-provided images.Abstract
Although recent text-to-video (T2V) generation methods have seen significant advancements, most of these works focus on producing short video clips of a single event with a single background (i.e., single-scene videos). Meanwhile, recent large language models (LLMs) have demonstrated their capability in generating layouts and programs to control downstream visual modules such as image generation models. This raises an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation? In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which involves generating the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities and backgrounds. Next, guided by this output from the video planner, our video generator, Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities/backgrounds across scenes, while only trained with image-level annotations. Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation. We also demonstrate that our framework can dynamically control the strength for layout guidance and can also generate videos with user-provided images. We hope our framework can inspire future work on better integrating the planning ability of LLMs into consistent long video generation.
摘要
Although recent text-to-video (T2V) generation methods have made significant progress, most of these works focus on producing short video clips with a single background (i.e., single-scene videos). However, recent large language models (LLMs) have demonstrated their ability to generate layouts and programs to control downstream visual modules such as image generation models. This raises an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation? In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which involves generating the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities and backgrounds. Next, guided by this output from the video planner, our video generator, Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities/backgrounds across scenes, while only trained with image-level annotations. Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation. We also demonstrate that our framework can dynamically control the strength for layout guidance and can also generate videos with user-provided images. We hope our framework can inspire future work on better integrating the planning ability of LLMs into consistent long video generation.
A Review on AI Algorithms for Energy Management in E-Mobility Services
paper_authors: Sen Yan, Maqsood Hussain Shah, Ji Li, Noel O’Connor, Mingming Liu
for: 本研究旨在探讨人工智能(AI)在电动交通系统(EMS)中的应用潜力,以Address various challenges related to efficient energy management, including range anxiety, charge rate optimization, and energy storage longevity.
methods: 本研究通过分析现有文献,探讨AI在EMS中的应用,并提出未来研究的有效方向。
results: 本研究的目标是提供EMS中AI应用的现状报告,并提出未来研究的有效方向,以为可持续和高效的电动交通系统提供贡献,并为交通领域带来更绿色和可持续的未来。Abstract
E-mobility, or electric mobility, has emerged as a pivotal solution to address pressing environmental and sustainability concerns in the transportation sector. The depletion of fossil fuels, escalating greenhouse gas emissions, and the imperative to combat climate change underscore the significance of transitioning to electric vehicles (EVs). This paper seeks to explore the potential of artificial intelligence (AI) in addressing various challenges related to effective energy management in e-mobility systems (EMS). These challenges encompass critical factors such as range anxiety, charge rate optimization, and the longevity of energy storage in EVs. By analyzing existing literature, we delve into the role that AI can play in tackling these challenges and enabling efficient energy management in EMS. Our objectives are twofold: to provide an overview of the current state-of-the-art in this research domain and propose effective avenues for future investigations. Through this analysis, we aim to contribute to the advancement of sustainable and efficient e-mobility solutions, shaping a greener and more sustainable future for transportation.
摘要
电动 mobilité (e-mobility) 已经出现为解决交通领域的环境和可持续性问题的重要解决方案。 fossil fuels 的枯竭、增加的气候变化排放和战 against 气候变化 都高亮了转换到电动汽车 (EV) 的必要性。 本文想要探讨人工智能 (AI) 在电动交通系统 (EMS) 中有效能源管理的挑战。这些挑战包括范围焦虑、加速率优化和电动汽车中能量存储的寿命。通过分析现有的文献,我们探讨了 AI 在这些挑战中的作用,并提出了有效的未来研究方向。我们的目标是为可持续可靠的电动交通解决方案做出贡献,创造一个更绿色、更可持续的交通未来。
When Prolog meets generative models: a new approach for managing knowledge and planning in robotic applications
results: 该系统通过一个实际应用示例,实现了自动化计划生成和执行,提高了机器人系统的效率和可靠性。I hope that helps! Let me know if you have any other questions.Abstract
In this paper, we propose a robot oriented knowledge management system based on the use of the Prolog language. Our framework hinges on a special organisation of knowledge base that enables: 1. its efficient population from natural language texts using semi-automated procedures based on Large Language Models, 2. the bumpless generation of temporal parallel plans for multi-robot systems through a sequence of transformations, 3. the automated translation of the plan into an executable formalism (the behaviour trees). The framework is supported by a set of open source tools and is shown on a realistic application.
摘要
在这篇论文中,我们提出了一种基于Prolog语言的机器人知识管理系统。我们的框架利用特殊的知识库组织方式,以实现:1. 自然语言文本自动或半自动填充知识库,使用大语言模型;2. 生成多机器人系统的时间平行计划,通过序列转换;3. 自动将计划转换为执行语言(行为树)。该框架得到了一组开源工具的支持,并在实际应用中展示了其可行性。
Class Incremental Learning via Likelihood Ratio Based Task Prediction
for: 这篇论文targets continual learning setting of class incremental learning (CIL), where tasks are learned sequentially and no task identifier is provided at test time.
results: 该论文表明,使用传统的OOD探测器进行任务标识预测是低效的,因为可以利用CIL中的额外信息(如回退数据和已学会任务)来设计更好和原理性的任务标识预测方法。该论文提出了TPLR(任务标识预测基于likelihood ratio)方法,该方法在CIL中表现出了明显的优异。Abstract
Class incremental learning (CIL) is a challenging setting of continual learning, which learns a series of tasks sequentially. Each task consists of a set of unique classes. The key feature of CIL is that no task identifier (or task-id) is provided at test time for each test sample. Predicting the task-id for each test sample is a challenging problem. An emerging theoretically justified and effective approach is to train a task-specific model for each task in a shared network for all tasks based on a task-incremental learning (TIL) method to deal with forgetting. The model for each task in this approach is an out-of-distribution (OOD) detector rather than a conventional classifier. The OOD detector can perform both within-task (in-distribution (IND)) class prediction and OOD detection. The OOD detection capability is the key for task-id prediction during inference for each test sample. However, this paper argues that using a traditional OOD detector for task-id prediction is sub-optimal because additional information (e.g., the replay data and the learned tasks) available in CIL can be exploited to design a better and principled method for task-id prediction. We call the new method TPLR (Task-id Prediction based on Likelihood Ratio}). TPLR markedly outperforms strong CIL baselines.
摘要
增量学习(CIL)是一种挑战性的持续学习设定,它通过一系列任务进行顺序学习。每个任务包含一组唯一的类。CIL的关键特征是在测试时没有提供任务标识符(或任务ID)。预测任务标识符 для每个测试样本是一个挑战性的问题。一种迅速成熔和有理据 justify的方法是在一个共享网络中基于任务增量学习(TIL)方法来处理忘记。在这种方法中,每个任务的模型是一个out-of-distribution(OOD)探测器,而不是一个传统的分类器。OOD探测器可以同时进行内任务(IN-distribution (IND))类预测和OOD检测。OOD检测能力是键 для任务标识符预测 durante la inferencia para cada muestra de prueba. Sin embargo, este artículo argumenta que utilizar un detector OOD tradicional para la predicción de la tarea es subóptima, ya que la información adicional (por ejemplo, los datos de replay y las tareas aprendidas) disponible en CIL puede ser explotada para diseñar un método más efectivo y principios para la predicción de la tarea. Llamamos al nuevo método TPLR (Predicción de Tarea basada en el Ratiode Likelihood). TPLR notablemente supera los baselines fuertes de CIL.
Combining Survival Analysis and Machine Learning for Mass Cancer Risk Prediction using EHR data
results: 对比基线方法,提出的方法在ROC AUC、F1和年龄基线方法上均显示出显著的优势(22.8% vs 15.1%、83.7% vs 84.9%、17.8% vs 21.4%)。在盲测随机回归测试中,提出的方法还能够正确地检测肿瘤病人(9 out of 100)。Abstract
Purely medical cancer screening methods are often costly, time-consuming, and weakly applicable on a large scale. Advanced Artificial Intelligence (AI) methods greatly help cancer detection but require specific or deep medical data. These aspects affect the mass implementation of cancer screening methods. For these reasons, it is a disruptive change for healthcare to apply AI methods for mass personalized assessment of the cancer risk among patients based on the existing Electronic Health Records (EHR) volume. This paper presents a novel method for mass cancer risk prediction using EHR data. Among other methods, our one stands out by the minimum data greedy policy, requiring only a history of medical service codes and diagnoses from EHR. We formulate the problem as a binary classification. This dataset contains 175 441 de-identified patients (2 861 diagnosed with cancer). As a baseline, we implement a solution based on a recurrent neural network (RNN). We propose a method that combines machine learning and survival analysis since these approaches are less computationally heavy, can be combined into an ensemble (the Survival Ensemble), and can be reproduced in most medical institutions. We test the Survival Ensemble in some studies. Firstly, we obtain a significant difference between values of the primary metric (Average Precision) with 22.8% (ROC AUC 83.7%, F1 17.8%) for the Survival Ensemble versus 15.1% (ROC AUC 84.9%, F1 21.4%) for the Baseline. Secondly, the performance of the Survival Ensemble is also confirmed during the ablation study. Thirdly, our method exceeds age baselines by a significant margin. Fourthly, in the blind retrospective out-of-time experiment, the proposed method is reliable in cancer patient detection (9 out of 100 selected). Such results exceed the estimates of medical screenings, e.g., the best Number Needed to Screen (9 out of 1000 screenings).
摘要
医疗保健领域的纯医学抵抗癌症检测方法通常是非常昂贵的、耗时的、并且适用范围不够广泛。高级人工智能(AI)方法可以帮助癌症检测,但它们需要特定或深入的医疗数据。这些因素对普遍实施癌症检测方法产生影响。为了缓解这些问题,我们提出了基于电子医疗记录(EHR)量的大规模个性化癌症风险评估的干预性变革。这篇论文提出了一种新的癌症风险预测方法,使用EHR数据。与其他方法不同的是,我们的方法只需要医疗服务代码和诊断记录,并将问题定义为二分类问题。我们的数据集包含175441名医疗记录(2861名患有癌症)。作为基线,我们实施了一种基于循环神经网络(RNN)的解决方案。我们提出了一种结合机器学习和生存分析的方法,因为这些方法较为轻量级,可以结合成ensemble(生存ensemble),并且可以在大多数医疗机构中实现。我们在一些研究中测试了生存ensemble。首先,我们发现Survival Ensemble的主要指标(均值精度)的值为22.8%(ROC AUC 83.7%, F1 17.8%),与基线相比,表示Survival Ensemble的性能有所提升。其次,我们在减少研究中证明了Survival Ensemble的性能。第三,我们的方法超过了年龄基线的 margin。最后,在盲测退化试验中,我们的方法可靠地检测癌症患者(9 out of 100)。这些结果超越了医疗检测的估计,例如最佳数量检测(9 out of 1000)。
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
paper_authors: Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner
for: The paper aims to develop a simple lie detector for large language models (LLMs) that does not require access to the LLM’s activations or ground-truth knowledge of the fact in question.
methods: The lie detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM’s yes/no answers into a logistic regression classifier.
results: The detector is highly accurate and generalizes well to different LLM architectures, fine-tuned lies, sycophantic lies, and real-life scenarios such as sales, indicating that LLMs have distinctive lie-related behavioral patterns that could enable general-purpose lie detection.Here’s the simplified Chinese version of the three key points:
results: 检测器具有高准确率和可扩展性,可以在不同的LLM架构、精心预期假话、卖场假话和实际生活场景中工作,表明LLM在假说方面具有一定的共同行为特征,可能实现普适的假说检测。Abstract
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
摘要
翻译结果:大型语言模型(LLM)可以“谎”,我们定义为输出 false 语句,即使其“知道”真实情况可见。LLM 可能“谎”,例如,当它被 instruced 输出谎言。我们开发了一种简单的谎言检测器,不需要访问 LLM 的激活(黑盒),也不需要对事实知识进行证明。检测器通过在可疑谎言后提出一组预先定义的无关 follow-up 问题,并将 LLM 的是/否答案传入 logistic regression 分类器。尽管其简单,但这种检测器具有高度准确和意外的通用性。当在单个设定下(提问 GPT-3.5 谎言关于事实 вопросы)进行训练后,检测器可以通过(1)其他 LLM 架构、(2) LLM 精通谎言、(3)偏袋谎言和(4)实际生活场景中的谎言来进行泛化。这些结果表明,LLM 在不同架构和上下文中具有一致的谎言相关行为模式,可能实现通用的谎言检测。
Don’t throw away your value model! Making PPO even better via Value-Guided Monte-Carlo Tree Search decoding
paper_authors: Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, Asli Celikyilmaz
for: 提高生成文本的可读性和吸引力
methods: 结合Monte-Carlo Tree Search(MCTS)和Proximal Policy Optimization(PPO)
results: 比标准实践提高生成文本的偏好性和可读性Abstract
Inference-time search algorithms such as Monte-Carlo Tree Search (MCTS) may seem unnecessary when generating natural language text based on state-of-the-art reinforcement learning such as Proximal Policy Optimization (PPO). In this paper, we demonstrate that it is possible to get extra mileage out of PPO by integrating MCTS on top. The key idea is not to throw out the value network, a byproduct of PPO training for evaluating partial output sequences, when decoding text out of the policy network. More concretely, we present a novel value-guided decoding algorithm called PPO-MCTS, which can integrate the value network from PPO to work closely with the policy network during inference-time generation. Compared to prior approaches based on MCTS for controlled text generation, the key strength of our approach is to reduce the fundamental mismatch of the scoring mechanisms of the partial outputs between training and test. Evaluation on four text generation tasks demonstrate that PPO-MCTS greatly improves the preferability of generated text compared to the standard practice of using only the PPO policy. Our results demonstrate the promise of search algorithms even on top of the aligned language models from PPO, and the under-explored benefit of the value network.
摘要
<>TRANSLATE_TEXT假设用 Monte-Carlo Tree Search(MCTS)进行推理时,可能会认为无需使用 reinforcement learning 的 state-of-the-art 方法,例如 Proximal Policy Optimization(PPO)。在这篇论文中,我们表明可以通过将 MCTS 与 PPO 集成起来,从而获得更多的优势。关键思想是不要抛弃 PPO 的值网络,即在 PPO 训练过程中生成的 partial output sequences 的评估结果,而是在推理时使用这些值网络来导引policy网络进行文本生成。我们提出了一种新的值导向的推理算法,称为 PPO-MCTS,它可以将 PPO 的值网络与 policy 网络在推理时进行紧密的合作。与之前基于 MCTS 的文本生成方法相比,我们的方法的关键优势在于减少了在训练和测试之间的基本匹配问题,从而提高生成的文本的偏好性。我们的实验结果表明,PPO-MCTS 可以在四个文本生成任务上大幅提高生成的文本的偏好性,比标准实践使用只有 PPO 政策更好。我们的结果表明,搜索算法可以在 PPO 的对齐语言模型上获得优势,而且值网络的可用性还未得到充分利用。
results: 本文提供了评估大语言模型对Alignment技术的多种benchmark和评估方法,并对这些方法的可靠性和安全性进行了分析。Abstract
Recent years have witnessed remarkable progress made in large language models (LLMs). Such advancements, while garnering significant attention, have concurrently elicited various concerns. The potential of these models is undeniably vast; however, they may yield texts that are imprecise, misleading, or even detrimental. Consequently, it becomes paramount to employ alignment techniques to ensure these models to exhibit behaviors consistent with human values. This survey endeavors to furnish an extensive exploration of alignment methodologies designed for LLMs, in conjunction with the extant capability research in this domain. Adopting the lens of AI alignment, we categorize the prevailing methods and emergent proposals for the alignment of LLMs into outer and inner alignment. We also probe into salient issues including the models' interpretability, and potential vulnerabilities to adversarial attacks. To assess LLM alignment, we present a wide variety of benchmarks and evaluation methodologies. After discussing the state of alignment research for LLMs, we finally cast a vision toward the future, contemplating the promising avenues of research that lie ahead. Our aspiration for this survey extends beyond merely spurring research interests in this realm. We also envision bridging the gap between the AI alignment research community and the researchers engrossed in the capability exploration of LLMs for both capable and safe LLMs.
摘要
近年来,大语言模型(LLM)的进步很快,吸引了广泛的注意。然而,这些进步同时也引发了各种担忧。 LLM 的潜在力量无疑,但它们可能生成不准确、误导或甚至有害的文本。因此,使得这些模型与人类价值观 align 成为 Paramount 问题。本文尝试为大语言模型alignment的方法和新提议进行全面的探索,同时涉及到现有的 capacitor 研究。采用 AI alignment 的视野,我们分类了现有的方法和新提议为外部Alignment和内部Alignment。我们还考虑了模型的可读性和对 adversarial attack 的抵御能力。为评估 LLM 的 alignmen,我们提供了多种benchmark和评价方法。在讨论大语言模型alignment的现状后,我们将对未来的研究方向进行探讨,探讨这个领域可能会出现的有前途的研究方向。我们的 aspiration 不仅是激发关于这个领域的研究兴趣,还是在 AI Alignment 研究社区和探索大语言模型capability的研究人员之间建立桥梁,以实现 capable 和安全的 LLM。
PINF: Continuous Normalizing Flows for Physics-Constrained Deep Learning
results: 可以效率地解决高维时间依赖和稳态 Фоккер-朋杰尔方程Abstract
The normalization constraint on probability density poses a significant challenge for solving the Fokker-Planck equation. Normalizing Flow, an invertible generative model leverages the change of variables formula to ensure probability density conservation and enable the learning of complex data distributions. In this paper, we introduce Physics-Informed Normalizing Flows (PINF), a novel extension of continuous normalizing flows, incorporating diffusion through the method of characteristics. Our method, which is mesh-free and causality-free, can efficiently solve high dimensional time-dependent and steady-state Fokker-Planck equations.
摘要
“常规化约束对概率密度进行解决是一个 significiant 挑战。正则化流,一种可逆生成模型,利用变量变换公式来保证概率密度的保守和复杂数据分布的学习。本文提出了物理学 informed 正则化流(PINF),一种新的连续正则化流扩展,通过方法Characteristics 来实现增材和 causality-free 的高维时间依赖和稳态方程解决。”Note: Simplified Chinese is also known as "简化字" or "简体字" in Chinese.
Unidirectional brain-computer interface: Artificial neural network encoding natural images to fMRI response in the visual cortex
paper_authors: Ruixing Liang, Xiangyu Zhang, Qiong Li, Lai Wei, Hexin Liu, Avisha Kumar, Kelley M. Kempski Leadingham, Joshua Punnoose, Leibny Paola Garcia, Amir Manbachi
for: 这 paper 的目的是为了利用人工智能来理解视觉过程,并且可以用于研究大脑的功能和结构。
methods: 这 paper 使用了人工神经网络模型,名为 VISION,来模拟大脑的视觉过程。该模型使用视觉和语义输入,可以预测大脑的功能磁共振成像(fMRI) scan 响应。
results: 这 paper 的结果表明,VISION 模型可以准确预测人类血液响应的 fMRI 磁共振成像,比现有技术的性能高出 45%。此外,这 paper 还探讨了训练的神经网络中的表征偏见,生成了可验证的实验假设,并提出了一个可解释的度量来关联这些假设与 cortical 功能。Abstract
While significant advancements in artificial intelligence (AI) have catalyzed progress across various domains, its full potential in understanding visual perception remains underexplored. We propose an artificial neural network dubbed VISION, an acronym for "Visual Interface System for Imaging Output of Neural activity," to mimic the human brain and show how it can foster neuroscientific inquiries. Using visual and contextual inputs, this multimodal model predicts the brain's functional magnetic resonance imaging (fMRI) scan response to natural images. VISION successfully predicts human hemodynamic responses as fMRI voxel values to visual inputs with an accuracy exceeding state-of-the-art performance by 45%. We further probe the trained networks to reveal representational biases in different visual areas, generate experimentally testable hypotheses, and formulate an interpretable metric to associate these hypotheses with cortical functions. With both a model and evaluation metric, the cost and time burdens associated with designing and implementing functional analysis on the visual cortex could be reduced. Our work suggests that the evolution of computational models may shed light on our fundamental understanding of the visual cortex and provide a viable approach toward reliable brain-machine interfaces.
摘要
尽管人工智能(AI)在不同领域得到了重大进步,但它在理解视觉理解仍然处于未利用状态。我们提议一种人工神经网络名为“视觉接口系统”(VISION),以模拟人类大脑并证明它可以推动神经科学研究。使用视觉和语言输入,这种多模态模型预测大脑的功能磁共振成像(fMRI)扫描响应自然图像。VISION成功预测人类血液动力学响应视觉输入,与现有技术的性能相比,提高了45%。我们进一步探究训练的网络,揭示视觉区域的表达偏好,生成可验证的假设,并构建可解释的度量,将这些假设相关于 cortical 功能。通过这种模型和评价度量,设计和实施功能分析可以减少成本和时间开销。我们的工作表明,计算机模型的演化可能为我们的基本理解提供灯光,并提供可靠的脑机器接口。
Automating question generation from educational text
paper_authors: Ayan Kumar Bhowmick, Ashish Jagmohan, Aditya Vempaty, Prasenjit Dey, Leigh Hall, Jeremy Hartman, Ravi Kokku, Hema Maheshwari for:本研究设计了一个自动生成问题工具,用于学校的学习和评估过程中的形ative和总结评估。methods:我们使用了最近的生成AI技术,设计了一个模块化框架,将transformer型语言模型用于自动生成多项选择问题(MCQ)。results:我们进行了广泛的量化和质感评估,展示了不同技术和模型之间的贸易。Abstract
The use of question-based activities (QBAs) is wide-spread in education, traditionally forming an integral part of the learning and assessment process. In this paper, we design and evaluate an automated question generation tool for formative and summative assessment in schools. We present an expert survey of one hundred and four teachers, demonstrating the need for automated generation of QBAs, as a tool that can significantly reduce the workload of teachers and facilitate personalized learning experiences. Leveraging the recent advancements in generative AI, we then present a modular framework employing transformer based language models for automatic generation of multiple-choice questions (MCQs) from textual content. The presented solution, with distinct modules for question generation, correct answer prediction, and distractor formulation, enables us to evaluate different language models and generation techniques. Finally, we perform an extensive quantitative and qualitative evaluation, demonstrating trade-offs in the use of different techniques and models.
摘要
使用问题基本活动(QBA)是教育中广泛的应用,传统上作为学习和评估过程的重要组成部分。在这篇论文中,我们设计并评估了一种自动生成问题工具,用于形ative和summative评估。我们发布了一百四名教师专家调查,表明自动生成QBA的需求,作为可以大幅减轻教师的工作负担,并且为个性化学习经验提供便利。利用最新的生成AI技术,我们然后提出了一种模块化框架,利用转换器基于语言模型生成多项选择问题(MCQ)。该解决方案具有问题生成、正确答案预测和幌Launchx的三个模块,允许我们评估不同的语言模型和生成技术。最后,我们进行了详细的量化和质量评估,描述了不同技术和模型的负担。
Measurement Models For Sailboats Price vs. Features And Regional Areas
results: 分析发现,单桅船通常比双桅船更便宜,并且一些参数,如长、宽、排水量和帆面积直接影响价格。另外,较低的吃水也与更高的列价有直接关系。研究还发现,不同国家的GDP没有直接影响帆船价格。使用50%交叉验证方法,我们的模型在测试组中具有一致的结果。本研究通过机器学习技术提供了更加精准的帆船价格预测,为潜在购买者提供了有用的指导。Abstract
In this study, we investigated the relationship between sailboat technical specifications and their prices, as well as regional pricing influences. Utilizing a dataset encompassing characteristics like length, beam, draft, displacement, sail area, and waterline, we applied multiple machine learning models to predict sailboat prices. The gradient descent model demonstrated superior performance, producing the lowest MSE and MAE. Our analysis revealed that monohulled boats are generally more affordable than catamarans, and that certain specifications such as length, beam, displacement, and sail area directly correlate with higher prices. Interestingly, lower draft was associated with higher listing prices. We also explored regional price determinants and found that the United States tops the list in average sailboat prices, followed by Europe, Hong Kong, and the Caribbean. Contrary to our initial hypothesis, a country's GDP showed no direct correlation with sailboat prices. Utilizing a 50% cross-validation method, our models yielded consistent results across test groups. Our research offers a machine learning-enhanced perspective on sailboat pricing, aiding prospective buyers in making informed decisions.
摘要
本研究 investigate sailboat技术参数和价格之间的关系,以及地域性的影响。使用一个包括特征如长、宽、吃水、排水量、 sail 面积和水线的数据集,我们应用多种机器学习模型来预测 sailboat 价格。梯度下降模型表现出色,生成最低的MSE和MAE。我们的分析发现,单桅船通常比多桅船便宜,并且一些特征,如长、宽、排水量和 sail 面积直接与更高的价格相关。另外,较低的吃水也与更高的列价有关。我们还探究了不同地区的价格决定因素,发现美国的平均 sailboat 价格最高,其次是欧洲、香港和加勒比海。与我们的初始假设不同,一个国家的GDP直接与 sailboat 价格无关。使用50%的交叉验证方法,我们的模型在测试组中提供了一致的结果。本研究提供了机器学习增强的 sailboat 价格Perspective,帮助潜在买家做出了 Informed 决定。
Investigating Deep Neural Network Architecture and Feature Extraction Designs for Sensor-based Human Activity Recognition
paper_authors: Danial Ahangarani, Mohammad Shirazi, Navid Ashraf
for: 本研究旨在 Investigate the performance of common deep learning and machine learning approaches, as well as different training mechanisms and feature representations, for human activity recognition.
results: 实验研究表明,deep learning方法可以在人类活动识别任务中表现出优于传统的信号处理和机器学习方法,而不同的特征表示方法和训练机制也对任务的性能有着不同的影响。Abstract
The extensive ubiquitous availability of sensors in smart devices and the Internet of Things (IoT) has opened up the possibilities for implementing sensor-based activity recognition. As opposed to traditional sensor time-series processing and hand-engineered feature extraction, in light of deep learning's proven effectiveness across various domains, numerous deep methods have been explored to tackle the challenges in activity recognition, outperforming the traditional signal processing and traditional machine learning approaches. In this work, by performing extensive experimental studies on two human activity recognition datasets, we investigate the performance of common deep learning and machine learning approaches as well as different training mechanisms (such as contrastive learning), and various feature representations extracted from the sensor time-series data and measure their effectiveness for the human activity recognition task.
摘要
“智能设备和互联网物联网(IoT)中的广泛 ubique 感知器的可用性,已经开启了基于感知器的活动识别的可能性。相比传统的感知器时间序列处理和手工设计特征提取,随着深度学习在不同领域的证明效果,许多深度方法在人类活动识别中被探索,超越传统的信号处理和机器学习方法。在这种工作中,我们通过对两个人活动识别数据集进行广泛的实验研究, investigate 不同的深度学习和机器学习方法,以及不同的训练机制(如对照学习)和不同的特征表示方法,并测试它们在人类活动识别任务中的效果。”Note: "ubique" is not a word in Simplified Chinese, so I translated it as "广泛" (practical) to convey the same meaning.
Improving Unsupervised Visual Program Inference with Code Rewriting Families
results: 使用 SIRI 和 rewrite family,对 2D 和 3D CSG shape programming languages 进行了改进,包括更好的重建和更快的收敛率,并且在测试时可以更好地提高 SIRI 预测结果的重建性能。Abstract
Programs offer compactness and structure that makes them an attractive representation for visual data. We explore how code rewriting can be used to improve systems for inferring programs from visual data. We first propose Sparse Intermittent Rewrite Injection (SIRI), a framework for unsupervised bootstrapped learning. SIRI sparsely applies code rewrite operations over a dataset of training programs, injecting the improved programs back into the training set. We design a family of rewriters for visual programming domains: parameter optimization, code pruning, and code grafting. For three shape programming languages in 2D and 3D, we show that using SIRI with our family of rewriters improves performance: better reconstructions and faster convergence rates, compared with bootstrapped learning methods that do not use rewriters or use them naively. Finally, we demonstrate that our family of rewriters can be effectively used at test time to improve the output of SIRI predictions. For 2D and 3D CSG, we outperform or match the reconstruction performance of recent domain-specific neural architectures, while producing more parsimonious programs that use significantly fewer primitives.
摘要
(Simplified Chinese)程序具有紧凑性和结构,使其成为视觉数据的有效表示。我们探索如何使用代码重写来改善从视觉数据中推理程序的系统。我们首先提出了零噪抽象 rewrite injection(SIRI)框架,该框架通过对训练程序集进行零噪抽象 rewrite 操作,将改进后的程序重新插入到训练集中。我们设计了一家函数 rewrite 的家族,用于视觉编程领域:参数优化、代码剪辑和代码rafting。对于2D和3D CSGshape编程语言,我们显示了使用 SIRI 和我们家族的 rewrite 可以提高性能:更好的重建和更快的收敛率,相比于不使用 rewrite 或使用它们的随机方法。最后,我们示出了我们家族的 rewrite 可以在测试时有效地提高 SIRI 预测的输出。对2D和3D CSG,我们的方法可以与最新的域特定神经网络架构相比,并且生成更简洁的程序,使用更少的基本元素。
Deep Generative Methods for Producing Forecast Trajectories in Power Systems
results: 我们在法国TSO RTE风力预测数据上进行了广泛的实验,并与特定的时间序列生成metric进行比较。结果表明,我们的深度学习模型可以更好地预测风力资源的变化,并且可以减少预测错误的概率。Abstract
With the expansion of renewables in the electricity mix, power grid variability will increase, hence a need to robustify the system to guarantee its security. Therefore, Transport System Operators (TSOs) must conduct analyses to simulate the future functioning of power systems. Then, these simulations are used as inputs in decision-making processes. In this context, we investigate using deep learning models to generate energy production and load forecast trajectories. To capture the spatiotemporal correlations in these multivariate time series, we adapt autoregressive networks and normalizing flows, demonstrating their effectiveness against the current copula-based statistical approach. We conduct extensive experiments on the French TSO RTE wind forecast data and compare the different models with \textit{ad hoc} evaluation metrics for time series generation.
摘要
随着可再生能源在电力混合体中的扩展,电力网络的变化程度将增加,因此需要强化电力系统以确保其安全性。因此,交通系统运营商(TSOs)必须进行分析来模拟未来电力系统的运行。然后,这些分析结果将用于决策过程中。在这个上下文中,我们调查使用深度学习模型生成能源生产和负荷预测曲线。为了捕捉多变量时间序列的空间时间相关性,我们适应 autoregressive 网络和 норmalizing 流,并证明其效果性比现有的 copula 统计方法更高。我们在法国TSO RTE 风力预测数据上进行了广泛的实验,并与不同模型进行比较,使用特制时间序列生成评价指标。
Recurrent Hypernetworks are Surprisingly Strong in Meta-RL
results: 研究发现,结合批量学习和一般化模型(如回归网络)可以实现强大的性能,但是使用超网络是关键来激活这种潜在的性能。 surprisingly,这些简单的基本方法实际上在所有评估方法中表现最优。Abstract
Deep reinforcement learning (RL) is notoriously impractical to deploy due to sample inefficiency. Meta-RL directly addresses this sample inefficiency by learning to perform few-shot learning when a distribution of related tasks is available for meta-training. While many specialized meta-RL methods have been proposed, recent work suggests that end-to-end learning in conjunction with an off-the-shelf sequential model, such as a recurrent network, is a surprisingly strong baseline. However, such claims have been controversial due to limited supporting evidence, particularly in the face of prior work establishing precisely the opposite. In this paper, we conduct an empirical investigation. While we likewise find that a recurrent network can achieve strong performance, we demonstrate that the use of hypernetworks is crucial to maximizing their potential. Surprisingly, when combined with hypernetworks, the recurrent baselines that are far simpler than existing specialized methods actually achieve the strongest performance of all methods evaluated.
摘要
Interactively Learning Social Media Representations Improves News Source Factuality Detection
results: 在实际世界事件上,我们的实验结果显示了对新闻来源的实际性检测表现的改善,甚至只需要几次人类互动即可。Abstract
The rise of social media has enabled the widespread propagation of fake news, text that is published with an intent to spread misinformation and sway beliefs. Rapidly detecting fake news, especially as new events arise, is important to prevent misinformation. While prior works have tackled this problem using supervised learning systems, automatedly modeling the complexities of the social media landscape that enables the spread of fake news is challenging. On the contrary, having humans fact check all news is not scalable. Thus, in this paper, we propose to approach this problem interactively, where humans can interact to help an automated system learn a better social media representation quality. On real world events, our experiments show performance improvements in detecting factuality of news sources, even after few human interactions.
摘要
“社交媒体的崛起导致假新闻的广泛传播,这是为了散播误信和影响人们的信念。快速检测假新闻,特别是在新事件发生时,是非常重要的,以预防误信。而以往的工作已经使用监督学习系统来解决这个问题,但模拟社交媒体的复杂景象,却是一个挑战。而且,让人类检查所有新闻也不是可扩展的。因此,在这篇论文中,我们提出了一个互动式的方法,让人类和机器系统共同帮助对社交媒体的表现质量进行学习。在实际的世界事件上,我们的实验结果显示,对新闻来源的实际性进行检查,甚至只需几次人类互动,就可以 obtain 性能提升。”
Contrastive Continual Multi-view Clustering with Filtered Structural Fusion
paper_authors: Xinhang Wan, Jiyuan Liu, Ao Li, Xinwang Liu, En Zhu
for: 该 paper 针对于实时数据 clustering 问题提出了一种新的方法,即 Contrastive Continual Multi-view Clustering with Filtered Structural Fusion (CCMVC-FSF),以解决现有方法在面临新视图时的灾难性忘记问题。
methods: 该 paper 使用了一种数据缓存机制,通过筛选结构信息来减少数据的干扰效应,并通过对比学习来生成一个robust的分区矩阵。此外,该 paper 还结合了 semi-supervised learning 和知识储存技术。
results: EXTENSIVE experiments 表明,该 paper 提出的方法可以减少灾难性忘记问题,并且在不同的实际场景中具有优秀的性能。Abstract
Multi-view clustering thrives in applications where views are collected in advance by extracting consistent and complementary information among views. However, it overlooks scenarios where data views are collected sequentially, i.e., real-time data. Due to privacy issues or memory burden, previous views are not available with time in these situations. Some methods are proposed to handle it but are trapped in a stability-plasticity dilemma. In specific, these methods undergo a catastrophic forgetting of prior knowledge when a new view is attained. Such a catastrophic forgetting problem (CFP) would cause the consistent and complementary information hard to get and affect the clustering performance. To tackle this, we propose a novel method termed Contrastive Continual Multi-view Clustering with Filtered Structural Fusion (CCMVC-FSF). Precisely, considering that data correlations play a vital role in clustering and prior knowledge ought to guide the clustering process of a new view, we develop a data buffer with fixed size to store filtered structural information and utilize it to guide the generation of a robust partition matrix via contrastive learning. Furthermore, we theoretically connect CCMVC-FSF with semi-supervised learning and knowledge distillation. Extensive experiments exhibit the excellence of the proposed method.
摘要
多视图聚类在数据视图预先采集了一致和补充的信息中得到最佳效果。然而,它忽略了实时数据的情况,即数据视图随时间的采集。由于隐私问题或内存压力等原因,先前的视图不可能在时间上提供。一些方法已经提出来解决这个问题,但它们受到稳定性和软化之间的负担。具体来说,这些方法在获得新视图时会导致严重的忘记先前知识的问题(CFP),从而使得一致和补充的信息困难以获得,并影响聚类性能。为解决这个问题,我们提出了一种新方法,即对比学习 filtered 结构融合(CCMVC-FSF)。具体来说,我们认为数据相关性在聚类过程中扮演着关键角色,因此我们开发了一个固定大小的数据缓存,用于存储 filtered 结构信息,并使用其引导生成一个强健的分区矩阵。此外,我们 theoretically 连接 CCMVC-FSF 与半导导学习和知识储存。广泛的实验表明我们提出的方法的优势。
Addressing preferred orientation in single-particle cryo-EM through AI-generated auxiliary particles
methods: 使用 Conditional deep generative model 生成辅助粒子,解决观察到的粒子方向估计中的内在偏好。
results: 在凝固粒子电子顺向分析中 Hemagglutinin 聚合体的near-atomic resolution结构重建,以及在不倾斜数据中使用 cryoPROS-MP 版本进行膜蛋白 NaX 的结构重建。Abstract
The single-particle cryo-EM field faces the persistent challenge of preferred orientation, lacking general computational solutions. We introduce cryoPROS, an AI-based approach designed to address the above issue. By generating the auxiliary particles with a conditional deep generative model, cryoPROS addresses the intrinsic bias in orientation estimation for the observed particles. We effectively employed cryoPROS in the cryo-EM single particle analysis of the hemagglutinin trimer, showing the ability to restore the near-atomic resolution structure on non-tilt data. Moreover, the enhanced version named cryoPROS-MP significantly improves the resolution of the membrane protein NaX using the no-tilted data that contains the effects of micelles. Compared to the classical approaches, cryoPROS does not need special experimental or image acquisition techniques, providing a purely computational yet effective solution for the preferred orientation problem. Finally, we conduct extensive experiments that establish the low risk of model bias and the high robustness of cryoPROS.
摘要
单粒子普遍困难:对于单粒子普遍困难,我们提出了一个基于人工智能的方法---cryoPROS。这个方法使用深度生成模型来生成辅助粒子,以解决实验资料中的自然偏见问题。我们在血液蛋白聚矩体中使用了cryoPROS,并取得了非tilt数据中的精确结构。此外,我们还开发了一个优化版本名为cryoPROS-MP,它在没有偏向数据中实现了蛋白质NaX的高分辨率结构。相比于传统方法,cryoPROS不需要特殊的实验或摄像频率技术,提供了一个纯 computationally 的解决方案。最后,我们进行了广泛的实验,证明了cryoPROS并不存在偏见问题,并且具有高价的稳定性。
Multi-Source Domain Adaptation for Object Detection with Prototype-based Mean-teacher
paper_authors: Atif Belal, Akhil Meethal, Francisco Perdigon Romero, Marco Pedersoli, Eric Granger for:This paper is written for adapting visual object detectors to operational target domains using unsupervised domain adaptation (UDA) methods, specifically for multi-source domain adaptation (MSDA) scenarios.methods:The proposed method, Prototype-based Mean-Teacher (PMT), uses class prototypes learned using a contrastive loss to preserve domain-specific information and align categories across domains.results:The proposed PMT method outperforms state-of-the-art MSDA methods on several challenging object detection datasets, demonstrating its effectiveness in adapting visual object detectors to operational target domains.Here’s the information in Simplified Chinese text:for:这篇论文是为了使用无监督领域适应(UDA)方法来适应视觉对象检测器到运维目标领域中,特别是在多源领域适应(MSDA)场景下所写的。methods:该提议的方法是使用类prototype来保持领域特定信息,这些prototype是通过对应性损失来学习的。results:该提议的PMT方法在一些复杂的对象检测数据集上比州先进的MSDA方法表现出色,证明了它在适应视觉对象检测器到运维目标领域中的效果。Abstract
Adapting visual object detectors to operational target domains is a challenging task, commonly achieved using unsupervised domain adaptation (UDA) methods. When the labeled dataset is coming from multiple source domains, treating them as separate domains and performing a multi-source domain adaptation (MSDA) improves the accuracy and robustness over mixing these source domains and performing a UDA, as observed by recent studies in MSDA. Existing MSDA methods learn domain invariant and domain-specific parameters (for each source domain) for the adaptation. However, unlike single-source UDA methods, learning domain-specific parameters makes them grow significantly proportional to the number of source domains used. This paper proposes a novel MSDA method called Prototype-based Mean-Teacher (PMT), which uses class prototypes instead of domain-specific subnets to preserve domain-specific information. These prototypes are learned using a contrastive loss, aligning the same categories across domains and separating different categories far apart. Because of the use of prototypes, the parameter size of our method does not increase significantly with the number of source domains, thus reducing memory issues and possible overfitting. Empirical studies show PMT outperforms state-of-the-art MSDA methods on several challenging object detection datasets.
摘要
通常通过不监督领域适应(UDA)方法来实现对操作目标领域的视觉对象检测器的适应。当来自多个源领域的标注数据集被视为独立的多个源领域时,使用多源领域适应(MSDA)方法可以提高准确性和稳定性,根据最近的研究表明。现有的 MSDA 方法learns领域不变和领域特定参数(对每个源领域) для适应。然而,与单个 UDA 方法不同,学习领域特定参数会使其增长得 proportional to the number of source domains used。这篇文章提出了一种新的 MSDA 方法,即 Prototype-based Mean-Teacher(PMT),它使用类prototype来保留领域特定信息。这些 prototypes 是通过对应类型的损失函数来学习的,以实现类型之间的对齐和不同类型之间的分离。由于使用 prototypes,我们的方法中的参数大小不会随着 source domains 的数量增加,从而避免内存问题和可能的过拟合。实验表明,PMT 在多个挑战性的对象检测数据集上表现出色,超越了当前state-of-the-art MSDA 方法。
A Democratic Platform for Engaging with Disabled Community in Generative AI Development
for: The paper aims to involve the disabled community in the design and development of generative AI systems to address bias and incorrectness in the outputs generated by these systems when used by the disabled community.
methods: The proposed platform calls for asynchronous and remote collaboration between disabled and non-disabled individuals from diverse backgrounds, using a democratic approach to decision-making.
results: The paper hopes to gain insight into the factors that contribute to bias in generative AI systems when used by the disabled community, and to identify the main algorithmic factors responsible for incorrect or irrelevant outputs.Abstract
Artificial Intelligence (AI) systems, especially generative AI technologies are becoming more relevant in our society. Tools like ChatGPT are being used by members of the disabled community e.g., Autistic people may use it to help compose emails. The growing impact and popularity of generative AI tools have prompted us to examine their relevance within the disabled community. The design and development phases often neglect this marginalized group, leading to inaccurate predictions and unfair discrimination directed towards them. This could result from bias in data sets, algorithms, and systems at various phases of creation and implementation. This workshop paper proposes a platform to involve the disabled community while building generative AI systems. With this platform, our aim is to gain insight into the factors that contribute to bias in the outputs generated by generative AI when used by the disabled community. Furthermore, we expect to comprehend which algorithmic factors are the main contributors to the output's incorrectness or irrelevancy. The proposed platform calls on both disabled and non-disabled people from various geographical and cultural backgrounds to collaborate asynchronously and remotely in a democratic approach to decision-making.
摘要
人工智能(AI)系统,尤其是生成AI技术在我们社会中变得越来越重要。如ChatGPT这种工具,被disabled社区的成员使用,例如自闭症人士可以使用其帮助compose电子邮件。随着生成AI工具的增长影响和流行度,我们被迫检查这些工具在disabled社区中的 relevance。然而,设计和开发阶段 часто忽视这个受欢迎的社群,导致错误预测和不公正对待。这可能由数据集、算法和系统在不同阶段的创建和实施中的偏见引起。本工作shop paper提出了一个平台,以便在生成AI系统的建设中包括disabled社区。通过这个平台,我们的目标是了解生成AI在disabled社区中输出的偏见的因素。此外,我们还希望了解算法因素是输出的错误或不相关的主要 contribuens。 proposed platform召集了不同地理和文化背景的 disable和非 disable人士共同参与协作,以征集 asynchronous和 remote的民主决策方式。
Label Deconvolution for Node Representation Learning on Large-scale Attributed Graphs against Learning Bias
methods: 本研究使用 pre-trained model 和 graph neural network (GNN) 结合,将 attribute 和 graph structure 同时编码。然而,由于joint training large-scale graphs 会导致性能下降,therefore, many methods propose to train NEs 和 GNNs 分开。然而,这会导致 feature convolution 在 GNNs 中被忽略,从而导致学习偏好。本研究提出了一种高效的标签减少技术,即 Label Deconvolution (LD),以解决这个问题。
results: 实验结果表明,LD 可以准确地预测 Open Graph Benchmark 数据集中的结果,并且与 state-of-the-art 方法相比,具有显著的性能优势。Abstract
Node representation learning on attributed graphs -- whose nodes are associated with rich attributes (e.g., texts and protein sequences) -- plays a crucial role in many important downstream tasks. To encode the attributes and graph structures simultaneously, recent studies integrate pre-trained models with graph neural networks (GNNs), where pre-trained models serve as node encoders (NEs) to encode the attributes. As jointly training large NEs and GNNs on large-scale graphs suffers from severe scalability issues, many methods propose to train NEs and GNNs separately. Consequently, they do not take feature convolutions in GNNs into consideration in the training phase of NEs, leading to a significant learning bias from that by the joint training. To address this challenge, we propose an efficient label regularization technique, namely Label Deconvolution (LD), to alleviate the learning bias by a novel and highly scalable approximation to the inverse mapping of GNNs. The inverse mapping leads to an objective function that is equivalent to that by the joint training, while it can effectively incorporate GNNs in the training phase of NEs against the learning bias. More importantly, we show that LD converges to the optimal objective function values by thejoint training under mild assumptions. Experiments demonstrate LD significantly outperforms state-of-the-art methods on Open Graph Benchmark datasets.
摘要
Node representation learning on attributed graphs---whose nodes are associated with rich attributes (e.g., texts and protein sequences)---plays a crucial role in many important downstream tasks. To encode the attributes and graph structures simultaneously, recent studies integrate pre-trained models with graph neural networks (GNNs), where pre-trained models serve as node encoders (NEs) to encode the attributes. As jointly training large NEs and GNNs on large-scale graphs suffers from severe scalability issues, many methods propose to train NEs and GNNs separately. Consequently, they do not take feature convolutions in GNNs into consideration in the training phase of NEs, leading to a significant learning bias from that by the joint training. To address this challenge, we propose an efficient label regularization technique, namely Label Deconvolution (LD), to alleviate the learning bias by a novel and highly scalable approximation to the inverse mapping of GNNs. The inverse mapping leads to an objective function that is equivalent to that by the joint training, while it can effectively incorporate GNNs in the training phase of NEs against the learning bias. More importantly, we show that LD converges to the optimal objective function values by the joint training under mild assumptions. Experiments demonstrate LD significantly outperforms state-of-the-art methods on Open Graph Benchmark datasets.
results: 研究发现,AI艺术领域的艺术家和创作者需要更多的信息和工具来理解AI的可持续性影响,并且需要一种可解释的可持续性模型来帮助他们更好地理解AI的可持续性影响。Abstract
AI is becoming increasingly popular in artistic practices, but the tools for informing practitioners about the environmental impact (and other sustainability implications) of AI are adapted for other contexts than creative practices -- making the tools and sustainability implications of AI not accessible for artists and creative practitioners. In this position paper, I describe two empirical studies that aim to develop environmental sustainability reflection systems for AI Arts, and discuss and introduce Explainable Sustainability in for AI Arts.
摘要
AI在艺术实践中日益受欢迎,但现有的环境影响(以及其他可持续发展因素)AI工具主要针对其他领域,因此艺术家和创作者无法访问这些工具和可持续发展因素。在本 Position paper 中,我描述了两项验证研究,旨在为 AI 艺术创造可持续发展反射系统,并讨论了Explainable Sustainability in AI Arts。Here's a breakdown of the translation:* AI在艺术实践中日益受欢迎 (AI is becoming increasingly popular in artistic practices)* 但现有的环境影响(以及其他可持续发展因素)AI工具主要针对其他领域 (but the tools for informing practitioners about the environmental impact and other sustainability implications of AI are mainly adapted for other contexts)* 因此艺术家和创作者无法访问这些工具和可持续发展因素 (therefore, artists and creative practitioners cannot access these tools and sustainability implications)* 在本 Position paper 中 (in this position paper)* 我描述了两项验证研究 (I describe two empirical studies)* 旨在为 AI 艺术创造可持续发展反射系统 (aiming to develop environmental sustainability reflection systems for AI Arts)* 并讨论了Explainable Sustainability in AI Arts (and discuss and introduce Explainable Sustainability in AI Arts)
Navigating Text-To-Image Customization:From LyCORIS Fine-Tuning to Model Evaluation
results: 本研究的结果显示,不同的调整方法对稳定扩散模型的表现有不同的影响,并且提供了一个系统性的评估框架,可以帮助研究人员更好地理解这些影响,并将其应用于实际应用中。Abstract
Text-to-image generative models have garnered immense attention for their ability to produce high-fidelity images from text prompts. Among these, Stable Diffusion distinguishes itself as a leading open-source model in this fast-growing field. However, the intricacies of fine-tuning these models pose multiple challenges from new methodology integration to systematic evaluation. Addressing these issues, this paper introduces LyCORIS (Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion) [https://github.com/KohakuBlueleaf/LyCORIS], an open-source library that offers a wide selection of fine-tuning methodologies for Stable Diffusion. Furthermore, we present a thorough framework for the systematic assessment of varied fine-tuning techniques. This framework employs a diverse suite of metrics and delves into multiple facets of fine-tuning, including hyperparameter adjustments and the evaluation with different prompt types across various concept categories. Through this comprehensive approach, our work provides essential insights into the nuanced effects of fine-tuning parameters, bridging the gap between state-of-the-art research and practical application.
摘要
Note: "Simplified Chinese" is a romanization of Chinese characters, which is used to represent the language in a simpler form, especially for non-native speakers. The text above is written in Simplified Chinese, and it may not be exactly the same as the traditional Chinese version.
Supersonic: Learning to Generate Source Code Optimizations in C/C++
paper_authors: Zimin Chen, Sen Fang, Martin Monperrus
for: 这个论文targets minor source code modifications for optimization.
methods: The paper presents a neural approach called Supersonic, which uses a seq2seq model to optimize C/C++ programs.
results: The experiments show that Supersonic outperforms OpenAI’s GPT-3.5-Turbo and GPT-4 on competitive programming tasks, while also minimizing the extent of the change with a model that is more than 600x smaller than GPT-3.5-Turbo and 3700x smaller than GPT-4.Abstract
Software optimization refines programs for resource efficiency while preserving functionality. Traditionally, it is a process done by developers and compilers. This paper introduces a third option, automated optimization at the source code level. We present Supersonic, a neural approach targeting minor source code modifications for optimization. Using a seq2seq model, Supersonic is trained on C/C++ program pairs ($x_{t}$, $x_{t+1}$), where $x_{t+1}$ is an optimized version of $x_{t}$, and outputs a diff. Supersonic's performance is benchmarked against OpenAI's GPT-3.5-Turbo and GPT-4 on competitive programming tasks. The experiments show that Supersonic not only outperforms both models on the code optimization task but also minimizes the extent of the change with a model more than 600x smaller than GPT-3.5-Turbo and 3700x smaller than GPT-4.
摘要
Revisiting Softmax Masking for Stability in Continual Learning
methods: 本文提出一种基于 masking softmax 函数的方法,以 preserve confidence distribution during continual learning。
results: Comparing with state-of-the-art methods, 本文的方法在 class-和 task-incremental learning benchmarks 中显示了更高的稳定性和足够的пластично性。特别是在使用 zero 或小 memory 时,本文的方法表现更好。Abstract
In continual learning, many classifiers use softmax function to learn confidence. However, numerous studies have pointed out its inability to accurately determine confidence distributions for outliers, often referred to as epistemic uncertainty. This inherent limitation also curtails the accurate decisions for selecting what to forget and keep in previously trained confidence distributions over continual learning process. To address the issue, we revisit the effects of masking softmax function. While this method is both simple and prevalent in literature, its implication for retaining confidence distribution during continual learning, also known as stability, has been under-investigated. In this paper, we revisit the impact of softmax masking, and introduce a methodology to utilize its confidence preservation effects. In class- and task-incremental learning benchmarks with and without memory replay, our approach significantly increases stability while maintaining sufficiently large plasticity. In the end, our methodology shows better overall performance than state-of-the-art methods, particularly in the use with zero or small memory. This lays a simple and effective foundation of strongly stable replay-based continual learning.
摘要
在连续学习中,许多分类器使用softmax函数来学习 confidence。然而,许多研究表明softmax函数无法准确地确定outsider的epistemic uncertainty。这种内置的限制也限制了精确地决定在前期训练 confidence distributions 中保留和忘记的决策。为解决这个问题,我们重新评估softmax函数的masking效果。虽然这种方法是简单而普遍存在在文献中,但它在连续学习过程中保持confidence分布的稳定性具有未得到足够的研究。在这篇论文中,我们重新评估softmax masking的影响,并介绍了一种使用它的confidence保存效果的方法。在不同的类和任务增量学习benchmark中,我们的方法显著提高了稳定性,同时保持了足够的пластичность。最后,我们的方法在与零或小的内存使用时表现更好,这建立了一个简单而有效的强有力的连续学习基础。
Evaluating Soccer Match Prediction Models: A Deep Learning Approach and Feature Optimization for Gradient-Boosted Trees
results: 根据验证集的结果表示,我们的模型在win/draw/loss预测中表现出了强大的稳定性,比前一次在2017年足球预测比赛中发表的模型更佳。Abstract
Machine learning models have become increasingly popular for predicting the results of soccer matches, however, the lack of publicly-available benchmark datasets has made model evaluation challenging. The 2023 Soccer Prediction Challenge required the prediction of match results first in terms of the exact goals scored by each team, and second, in terms of the probabilities for a win, draw, and loss. The original training set of matches and features, which was provided for the competition, was augmented with additional matches that were played between 4 April and 13 April 2023, representing the period after which the training set ended, but prior to the first matches that were to be predicted (upon which the performance was evaluated). A CatBoost model was employed using pi-ratings as the features, which were initially identified as the optimal choice for calculating the win/draw/loss probabilities. Notably, deep learning models have frequently been disregarded in this particular task. Therefore, in this study, we aimed to assess the performance of a deep learning model and determine the optimal feature set for a gradient-boosted tree model. The model was trained using the most recent five years of data, and three training and validation sets were used in a hyperparameter grid search. The results from the validation sets show that our model had strong performance and stability compared to previously published models from the 2017 Soccer Prediction Challenge for win/draw/loss prediction.
摘要
机器学习模型在足球赛事预测中变得越来越受欢迎,然而公共可用的标准数据集的缺乏使得模型评估变得困难。2023年足球预测挑战要求预测每个队伍所得的进球数量,以及每个队伍赢得、平局、负败的概率。提供的原始训练集和特征,在竞赛中提供,被补充了在4月4日至4月13日期间进行的其他比赛,表示训练集结束后的时间段,但是在预测的第一场比赛之前(在评估性能时使用)。使用Pi-评分来使用CatBoost模型,这些特征最初被认为是计算赢负平比数据的优选。值得注意的是,深度学习模型在这个特定任务中经常被忽视。因此,在这项研究中,我们想要评估深度学习模型的性能,并确定最佳特征集 для梯度树模型。模型使用过去五年的数据进行训练,并使用三个训练和验证集在搜索扫描中进行了超参数的搜索。验证集的结果表明,我们的模型在win/draw/loss预测方面具有强大的性能和稳定性,与2017年足球预测挑战中发表的模型相比。
Fine-tuning and aligning question answering models for complex information extraction tasks
paper_authors: Matthias Engelbach, Dennis Klau, Felix Scheerer, Jens Drawehn, Maximilien Kintz
For: This paper proposes an approach for improving the feature extraction of German business documents, such as insurance reports or medical leaflets, using extractive question answering (QA) models.* Methods: The paper uses and integrates existing German QA models, and fine-tunes them for tailored extraction tasks of complex linguistic features.* Results: The paper shows that fine-tuning the QA models boosts performance for these tasks, even with a small set of annotated data. Additionally, the paper discusses the relevance of scoring metrics for evaluating information extraction tasks and proposes a combined metric that mimics the assessment criteria from human experts.Here is the information in Simplified Chinese text:* For: 这篇论文提出了一种基于抽取问答模型的德国企业文档特征提取方法,以提高德国企业文档的特征提取精度。* Methods: 论文使用和集成了现有的德国问答模型,并对其进行了定制化的特征提取任务。* Results: 论文显示,通过定制化问答模型可以提高特征提取精度,即使只使用一小部分标注数据。此外,论文还讨论了特征提取任务的评价指标,并提出了一种组合指标,以模拟人类专家的评价标准。Abstract
The emergence of Large Language Models (LLMs) has boosted performance and possibilities in various NLP tasks. While the usage of generative AI models like ChatGPT opens up new opportunities for several business use cases, their current tendency to hallucinate fake content strongly limits their applicability to document analysis, such as information retrieval from documents. In contrast, extractive language models like question answering (QA) or passage retrieval models guarantee query results to be found within the boundaries of an according context document, which makes them candidates for more reliable information extraction in productive environments of companies. In this work we propose an approach that uses and integrates extractive QA models for improved feature extraction of German business documents such as insurance reports or medical leaflets into a document analysis solution. We further show that fine-tuning existing German QA models boosts performance for tailored extraction tasks of complex linguistic features like damage cause explanations or descriptions of medication appearance, even with using only a small set of annotated data. Finally, we discuss the relevance of scoring metrics for evaluating information extraction tasks and deduce a combined metric from Levenshtein distance, F1-Score, Exact Match and ROUGE-L to mimic the assessment criteria from human experts.
摘要
大型自然语言模型(LLM)的出现对各种自然语言处理任务的性能和可能性带来了提升。然而,使用生成AI模型如ChatGPT时,它们很容易生成假内容,这限制了它们在文档分析中的应用。相比之下,抽取语言模型如问答(QA)或段落检索模型可以保证查询结果在指定的文档范围内,这使得它们在公司生产环境中更适合用于可靠的信息抽取。在这项工作中,我们提议一种使用和结合抽取QA模型来改进德国商业文档的特征提取解决方案。我们还证明了使用现有的德语QA模型进行精细调整可以提高适应性特征提取任务的表现,即使只使用一小部分标注数据。最后,我们讨论了评价信息提取任务的分数指标,并 deduced一个combined metric,包括Levenshtein距离、F1分数、精确匹配和ROUGE-L,以模拟人类专家的评价标准。
Forgetting-aware Linear Bias for Attentive Knowledge Tracing
results: 该论文提出了一种简单 yet effective的解决方案,即忘记遗弃linear偏好(FoLiBi),用于反映学生的忘记行为。尽管简单,FoLiBi可以轻松地与现有的注意力基рованKT模型结合使用,并且在四个 benchmark数据集上实现了state-of-the-art KT模型的2.58%的AUC提升。Abstract
Knowledge Tracing (KT) aims to track proficiency based on a question-solving history, allowing us to offer a streamlined curriculum. Recent studies actively utilize attention-based mechanisms to capture the correlation between questions and combine it with the learner's characteristics for responses. However, our empirical study shows that existing attention-based KT models neglect the learner's forgetting behavior, especially as the interaction history becomes longer. This problem arises from the bias that overprioritizes the correlation of questions while inadvertently ignoring the impact of forgetting behavior. This paper proposes a simple-yet-effective solution, namely Forgetting-aware Linear Bias (FoLiBi), to reflect forgetting behavior as a linear bias. Despite its simplicity, FoLiBi is readily equipped with existing attentive KT models by effectively decomposing question correlations with forgetting behavior. FoLiBi plugged with several KT models yields a consistent improvement of up to 2.58% in AUC over state-of-the-art KT models on four benchmark datasets.
摘要
知识跟踪(KT)目标是根据问题解决历史评估学习者的掌握程度,以便提供更加流畅的课程设计。现今的研究者通常使用注意力机制来捕捉问题之间的相关性,并结合学习者的特点进行响应。然而,我们的实验研究发现,现有的注意力based KT 模型忽略学习者的忘记行为,特别是在互动历史越长时。这个问题来源于关注问题相关性的偏见,不经意识地忽略了忘记行为的影响。本文提出了一种简单 yet effective的解决方案,即忘记行为权重linear bias(FoLiBi),用于反映学习者的忘记行为。尽管简单,FoLiBi可以轻松地与现有的注意力based KT 模型结合使用,并且可以有效地 decomposition question相关性和忘记行为。在四个benchmark dataset上,FoLiBi与其他KT模型相结合得到了2.58%的AUC提升。
Semantic Map Learning of Traffic Light to Lane Assignment based on Motion Data
results: 研究人员通过实现和评估这种方法,并对可用的运动预测数据集进行转换,提供了一个公共的 API,以便研究人员开发和评估自己的方法。Abstract
Understanding which traffic light controls which lane is crucial to navigate intersections safely. Autonomous vehicles commonly rely on High Definition (HD) maps that contain information about the assignment of traffic lights to lanes. The manual provisioning of this information is tedious, expensive, and not scalable. To remedy these issues, our novel approach derives the assignments from traffic light states and the corresponding motion patterns of vehicle traffic. This works in an automated way and independently of the geometric arrangement. We show the effectiveness of basic statistical approaches for this task by implementing and evaluating a pattern-based contribution method. In addition, our novel rejection method includes accompanying safety considerations by leveraging statistical hypothesis testing. Finally, we propose a dataset transformation to re-purpose available motion prediction datasets for semantic map learning. Our publicly available API for the Lyft Level 5 dataset enables researchers to develop and evaluate their own approaches.
摘要
理解交通灯的控制哪一个车道是安全通行口的关键。自动驾驶车辆通常依赖高清晰度地图,这些地图包含交通灯分配给车道的信息。手动提供这些信息是费时、昂贵,并且不可扩展。为解决这些问题,我们提出了一种新的方法,即从交通灯状态和相应的车辆交通模式中提取分配信息。这种方法自动化了进行,不受地理布局的影响。我们还实现了一种基本的统计方法,以确定分配信息的有效性。此外,我们还提出了一种安全考虑的拒绝方法,通过利用统计假设测试来确保安全性。最后,我们建议将可用的动作预测数据集转换为semantic地图学习用的数据集,并提供了一个公共可用API,以便研究人员可以开发和评估自己的方法。
paper_authors: Francesco Immorlano, Veronika Eyring, Thomas le Monnier de Gouville, Gabriele Accarino, Donatello Elia, Giovanni Aloisio, Pierre Gentine
results: 研究结果显示,使用这种方法可以更 precisely project global surface temperature fields in the 21st century,并且可以提供更加准确的气候预测,以便于气候适应和气候控制。 Specifically, the study found that the 1.5°C threshold of the Paris Agreement will be crossed in 2031 (2028-2034) for SSP2-4.5, in 2029 (2027-2031) for SSP3-7.0, and in 2028 (2025-2031) for SSP5-8.5. Similarly, the 2°C threshold will be exceeded in 2051 (2045-2059), 2044 (2040-2047), and 2042 (2038-2047) respectively.Abstract
Accurate climate projections are required for climate adaptation and mitigation. Earth system model simulations, used to project climate change, inherently make approximations in their representation of small-scale physical processes, such as clouds, that are at the root of the uncertainties in global mean temperature's response to increased greenhouse gas concentrations. Several approaches have been developed to use historical observations to constrain future projections and reduce uncertainties in climate projections and climate feedbacks. Yet those methods cannot capture the non-linear complexity inherent in the climate system. Using a Transfer Learning approach, we show that Machine Learning, in particular Deep Neural Networks, can be used to optimally leverage and merge the knowledge gained from Earth system model simulations and historical observations to more accurately project global surface temperature fields in the 21st century. For the Shared Socioeconomic Pathways (SSPs) 2-4.5, 3-7.0 and 5-8.5, we refine regional estimates and the global projection of the average global temperature in 2081-2098 (with respect to the period 1850-1900) to 2.73{\deg}C (2.44-3.11{\deg}C), 3.92{\deg}C (3.5-4.47{\deg}C) and 4.53{\deg}C (3.69-5.5{\deg}C), respectively, compared to the unconstrained 2.7{\deg}C (1.65-3.8{\deg}C), 3.71{\deg}C (2.56-4.97{\deg}C) and 4.47{\deg}C (2.95-6.02{\deg}C). Our findings show that the 1.5{\deg}C threshold of the Paris' agreement will be crossed in 2031 (2028-2034) for SSP2-4.5, in 2029 (2027-2031) for SSP3-7.0 and in 2028 (2025-2031) for SSP5-8.5. Similarly, the 2{\deg}C threshold will be exceeded in 2051 (2045-2059), 2044 (2040-2047) and 2042 (2038-2047) respectively. Our new method provides more accurate climate projections urgently required for climate adaptation.
摘要
准确的气候预测是为气候适应和抑制需要的。地球系统模型的 simulate climate change 中含有一些简化了小规模物理过程,如云,这些过程的不确定性导致全球平均温度响应增加绿house gas concentration的不确定性。多种方法已经开发来使用历史观察来约束未来预测并减少气候预测和反馈的不确定性。然而,这些方法无法捕捉气候系统的非线性复杂性。我们使用传输学习方法,具体来说是深度神经网络,可以最优地利用和融合地球系统模型和历史观察所获得的知识,以更准确地预测21世纪初期的全球表面温度场。对于 Shared Socioeconomic Pathways (SSPs) 2-4.5、3-7.0和5-8.5,我们精细地估算了地域估计和全球预测的平均全球温度差异,分别为2.73°C(2.44-3.11°C)、3.92°C(3.5-4.47°C)和4.53°C(3.69-5.5°C),与未控制的2.7°C(1.65-3.8°C)、3.71°C(2.56-4.97°C)和4.47°C(2.95-6.02°C)相比。我们的发现表明,在2031年(2028-2034年)、2029年(2027-2031年)和2028年(2025-2031年),分别为SSP2-4.5、SSP3-7.0和SSP5-8.5的情况下, Париж协议中的1.5°C阈值将会超过。同时,2°C阈值将在2051年(2045-2059年)、2044年(2040-2047年)和2042年(2038-2047年)内相继超过。我们的新方法提供了更准确的气候预测,为气候适应提供了至关重要的信息。
Exploring Small Language Models with Prompt-Learning Paradigm for Efficient Domain-Specific Text Classification
results: 研究结果显示,在几个数据点下,使用提示学习方法可以实现T5-base模型的精确率高于75%,仅需要限制的标注数据。此外,研究还发现,这些提示可以在零数据点下实现模型的改进,并且可以实现模型之间的ensemble效果。Abstract
Domain-specific text classification faces the challenge of scarce labeled data due to the high cost of manual labeling. Prompt-learning, known for its efficiency in few-shot scenarios, is proposed as an alternative to traditional fine-tuning methods. And besides, although large language models (LLMs) have gained prominence, small language models (SLMs, with under 1B parameters) offer significant customizability, adaptability, and cost-effectiveness for domain-specific tasks, given industry constraints. In this study, we investigate the potential of SLMs combined with prompt-learning paradigm for domain-specific text classification, specifically within customer-agent interactions in retail. Our evaluations show that, in few-shot settings when prompt-based model fine-tuning is possible, T5-base, a typical SLM with 220M parameters, achieve approximately 75% accuracy with limited labeled data (up to 15% of full data), which shows great potentials of SLMs with prompt-learning. Based on this, We further validate the effectiveness of active few-shot sampling and the ensemble strategy in the prompt-learning pipeline that contribute to a remarkable performance gain. Besides, in zero-shot settings with a fixed model, we underscore a pivotal observation that, although the GPT-3.5-turbo equipped with around 154B parameters garners an accuracy of 55.16%, the power of well designed prompts becomes evident when the FLAN-T5-large, a model with a mere 0.5% of GPT-3.5-turbo's parameters, achieves an accuracy exceeding 31% with the optimized prompt, a leap from its sub-18% performance with an unoptimized one. Our findings underscore the promise of prompt-learning in classification tasks with SLMs, emphasizing the benefits of active few-shot sampling, and ensemble strategies in few-shot settings, and the importance of prompt engineering in zero-shot settings.
摘要
域域特定文本分类面临着匮乏标注数据的挑战,由于人工标注的高成本。提问学习,在少数shot场景中著名的效率,被提议为传统精度调整方法的替代。此外,虽然大语言模型(LLMs)在获得了前列,但小语言模型(SLMs,参数在1B以下)在域特定任务中提供了显著的可定制化、适应性和成本效益,遵循产业限制。本研究通过对SLMs与提问学习训练模型的组合来研究域特定文本分类的潜力。我们的评估表明,在少数shot设置下,当提问基本模型精度调整是可能的时候,T5-base,一个典型的SLM,可以在有限标注数据(占总数据的15%)下达到约75%的准确率。此外,我们还验证了在提问学习管道中的活动几个shot采样和 ensemble策略的效果,它们在提高性能方面发挥了重要作用。此外,在零shot设置下,我们注意到,虽然GPT-3.5-turbo搭载约154B参数,它在55.16%的准确率,但提问设计的优势在适用于FLAN-T5-large,一个具有0.5%的GPT-3.5-turbo参数,在优化提问下达到了31.10%的准确率,与未优化提问时的Sub-18%的性能有了大幅提升。我们的发现强调了提问学习在分类任务中的承诺,特别是在少数shot设置下的活动几个shot采样和ensemble策略的重要性,以及零shot设置下的提问工程学的重要性。
Boosting In-Context Learning with Factual Knowledge
results: 实验结果表明,KICT 相比强基eline,可以提高 auto-regressive LLMs 在多种文本分类和问答任务上的表现,提高了更多于 13% 和 7% 的准确率。Abstract
In-Context Learning (ICL) over Large language models (LLMs) aims at solving previously unseen tasks by conditioning on a few training examples, eliminating the need for parameter updates and achieving competitive performance. In this paper, we demonstrate that factual knowledge is imperative for the performance of ICL in three core facets, i.e., the inherent knowledge learned in LLMs, the factual knowledge derived from the selected in-context examples, and the knowledge biases in LLMs for output generation. To unleash the power of LLMs in few-shot learning scenarios, we introduce a novel Knowledgeable In-Context Tuning (KICT) framework to further improve the performance of ICL: 1) injecting factual knowledge to LLMs during continual self-supervised pre-training, 2) judiciously selecting the examples with high knowledge relevance, and 3) calibrating the prediction results based on prior knowledge. We evaluate the proposed approaches on auto-regressive LLMs (e.g., GPT-style models) over multiple text classification and question answering tasks. Experimental results demonstrate that KICT substantially outperforms strong baselines, and improves by more than 13% and 7% of accuracy on text classification and question answering tasks, respectively.
摘要
听说大语言模型(LLM)在受限语言学习(ICL)中表现突出,可以通过几个训练示例来解决未看过的任务,不需要参数更新,并且达到竞争性的性能。在这篇论文中,我们表明了知识的重要性,即LLM中的内在知识、选择的在上下文中的示例知识和LLM的输出生成知识偏好。为了解放LLM在几个步骤学习场景中的力量,我们提出了一种新的知识感知适应(KICT)框架:1)在不断的自我监督预训练中注入知识,2)精准地选择高知识相关的示例,3)根据先前知识进行预测结果的准确性补偿。我们对 auto-regressive LLM(如 GPT 型模型)进行了多种文本分类和问答任务的实验,结果表明,KICT substantially 超越了强基elines,并在文本分类和问答任务中提高了13% 和 7% 的准确率。
for: This paper aims to suggest a correct program with minimal repair edits for solving introductory programming problems.
methods: The authors use a pre-trained CodeT5 model and fine-tune it on code pairs of wrong and correct programs to suggest a correct program.
results: The fine-tuned CodeT5 achieves a pass@100 of 91.95% and an average edit distance of 6.84, indicating that at least one correct program can be suggested by generating 100 candidate programs.Here is the information in Simplified Chinese text:
results: 微调后的 CodeT5 实现了 pass@100 的 91.95% 和平均修改距离 6.84,表明可以通过生成 100 个候选程序来建议至少一个正确程序。Abstract
Programmers often struggle to identify and fix bugs in their programs. In recent years, many language models (LMs) have been proposed to fix erroneous programs and support error recovery. However, the LMs tend to generate solutions that differ from the original input programs. This leads to potential comprehension difficulties for users. In this paper, we propose an approach to suggest a correct program with minimal repair edits using CodeT5. We fine-tune a pre-trained CodeT5 on code pairs of wrong and correct programs and evaluate its performance with several baseline models. The experimental results show that the fine-tuned CodeT5 achieves a pass@100 of 91.95% and an average edit distance of the most similar correct program of 6.84, which indicates that at least one correct program can be suggested by generating 100 candidate programs. We demonstrate the effectiveness of LMs in suggesting program repair with minimal edits for solving introductory programming problems.
摘要
Age Minimization in Massive IoT via UAV Swarm: A Multi-agent Reinforcement Learning Approach
paper_authors: Eslam Eldeeb, Mohammad Shehab, Hirley Alves
for: 这篇论文的目的是要Addressing the high-dimensional problem of deploying a swarm of UAVs to collect fresh information from IoT devices, and minimizing the overall age of information in the IoT network.
methods: 这篇论文使用了多代理深度学习来解决高维度的问题,包括cooperative和partially cooperative multi-agent deep reinforcement learning approaches.
results: 研究结果显示,cooperative和partially cooperative multi-agent deep reinforcement learning approaches可以比中央化深度学习方法表现更好,尤其在大规模网络中。Abstract
In many massive IoT communication scenarios, the IoT devices require coverage from dynamic units that can move close to the IoT devices and reduce the uplink energy consumption. A robust solution is to deploy a large number of UAVs (UAV swarm) to provide coverage and a better line of sight (LoS) for the IoT network. However, the study of these massive IoT scenarios with a massive number of serving units leads to high dimensional problems with high complexity. In this paper, we apply multi-agent deep reinforcement learning to address the high-dimensional problem that results from deploying a swarm of UAVs to collect fresh information from IoT devices. The target is to minimize the overall age of information in the IoT network. The results reveal that both cooperative and partially cooperative multi-agent deep reinforcement learning approaches are able to outperform the high-complexity centralized deep reinforcement learning approach, which stands helpless in large-scale networks.
摘要
在许多大规模IoT通信场景中,IoT设备需要由动态单元提供覆盖,以降低上行能 consumption。一种可靠的解决方案是通过大量的无人机(UAV群)来提供覆盖和IoT网络的更好的直线视野。然而,这些大规模IoT场景中服务单元的研究会导致高维度问题,高复杂性。在这篇论文中,我们运用多代理深度学习来解决由UAV群提供的高维度问题,目标是最小化IoT网络中信息的总龄。结果显示,协作和半协作多代理深度学习方法在大规模网络中表现出色,能够超越中央化深度学习方法,后者在大规模网络中无法作用。
Ego-perspective enhanced fitness training experience of AR Try to Move game
results: 提供了一款AR Try to Move 游戏和一种可以快速和准确识别用户手势的 CNN 模型,帮助用户通过远程训练提高上肢部 muscle system 的效果和方便性Abstract
AR, a recent emerging technology, has been widely used in entertainment to provide users with immersive, interactive, and, sometimes, engaging experiences. The process of rehabilitation treatment and motor training process is often boring, and it is well known that users' exercise efficiency is often not as efficient as in a rehabilitation institution. Thus far, there is no effective upper limb sports rehabilitation training game based on the ego-perspective. Hence, with the objective of enhancing the enjoyment experience in rehabilitation and more effective remote rehabilitation training, this work aims to provide an AR Try to Move game and a convolutional neural network (CNN) for identifying and classifying user gestures from a self-collected AR multiple interactive gestures dataset. Utilizing an AR game scoring system, users are incentivized to enhance their upper limb muscle system through remote training with greater effectiveness and convenience.
摘要
新出现的技术AR在娱乐领域广泛应用,为用户提供 immerse、互动和有趣的经验。然而,rehabilitation treatment和motor training过程经常枯燥,用户的锻炼效率通常不如医疗机构的。目前没有有效的上肢体征复健康训练游戏,因此这项工作的目标是提供一款基于EGO视角的AR Try to Move游戏和一个 convolutional neural network(CNN),用于识别和分类用户自采集的AR多交互姿势数据集。通过使用AR游戏分数系统,用户有更多的动机来提高上肢体肌系,通过远程培训,提高效率和便捷性。
ANNCRIPS: Artificial Neural Networks for Cancer Research In Prediction & Survival
paper_authors: Amit Mathapati for: This research paper aims to develop and validate an intelligent mathematical model using Artificial Neural Networks (ANNs) to enhance the early detection of prostate cancer.methods: The model utilizes ANNs to analyze various clinical and laboratory data, such as PSA and DRE results, to improve the accuracy of prostate cancer detection and reduce false positives.results: The model demonstrates promising potential in reducing false positives and improving patient outcomes, with the potential to become a robust and marketable solution for prostate cancer detection in the future.Here’s the text in Simplified Chinese:for: 这个研究报告的目的是开发和验证一种基于人工神经网络(ANNs)的智能数学模型,以提高肾癌早期检测的准确率。methods: 该模型利用ANNs分析各种临床和实验室数据,如PSA和DRE结果,以提高肾癌检测的准确率并减少假阳性结果。results: 该模型在减少假阳性结果和提高病人结果的前提下显示了扎实的潜在价值,未来可能成为肾癌检测的可靠和市场化解决方案。Abstract
Prostate cancer is a prevalent malignancy among men aged 50 and older. Current diagnostic methods primarily rely on blood tests, PSA:Prostate-Specific Antigen levels, and Digital Rectal Examinations (DRE). However, these methods suffer from a significant rate of false positive results. This study focuses on the development and validation of an intelligent mathematical model utilizing Artificial Neural Networks (ANNs) to enhance the early detection of prostate cancer. The primary objective of this research paper is to present a novel mathematical model designed to aid in the early detection of prostate cancer, facilitating prompt intervention by healthcare professionals. The model's implementation demonstrates promising potential in reducing the incidence of false positives, thereby improving patient outcomes. Furthermore, we envision that, with further refinement, extensive testing, and validation, this model can evolve into a robust, marketable solution for prostate cancer detection. The long-term goal is to make this solution readily available for deployment in various screening centers, hospitals, and research institutions, ultimately contributing to more effective cancer screening and patient care.
摘要
probstate cancer 是男性年龄在50岁及以上的常见肿瘤之一。当前的诊断方法主要基于血液测试和PSA:肾脏特异抗体水平(Digital Rectal Examinations,DRE)。但这些方法具有较高的假阳性率。本研究旨在开发和验证一种智能的数学模型,使用人工神经网络(ANNs),以提高肾脏癌早期检测。本研究的主要目标是提供一种能够帮助检测肾脏癌的新型数学模型,以便医疗专业人员早期发现和治疗。该模型的实施表现了有前途的潜力,可以减少假阳性的发生,从而提高病人的 outcome。此外,我们期望通过进一步的优化、广泛的测试和验证,这种模型可以变得更加坚固,并最终变得可商业化,以便更好地检测和治疗肾脏癌。长期目标是将这种解决方案在各个检查中心、医院和研究机构中广泛应用,以便更有效地检测和治疗癌症,并 ultimately contribute to better patient outcomes.
Legal Question-Answering in the Indian Context: Efficacy, Challenges, and Potential of Modern AI Models
results: 研究发现现有的AILQA方法在理解自然语言提示并生成精确回答方面具有强大的能力。Abstract
Legal QA platforms bear the promise to metamorphose the manner in which legal experts engage with jurisprudential documents. In this exposition, we embark on a comparative exploration of contemporary AI frameworks, gauging their adeptness in catering to the unique demands of the Indian legal milieu, with a keen emphasis on Indian Legal Question Answering (AILQA). Our discourse zeroes in on an array of retrieval and QA mechanisms, positioning the OpenAI GPT model as a reference point. The findings underscore the proficiency of prevailing AILQA paradigms in decoding natural language prompts and churning out precise responses. The ambit of this study is tethered to the Indian criminal legal landscape, distinguished by its intricate nature and associated logistical constraints. To ensure a holistic evaluation, we juxtapose empirical metrics with insights garnered from seasoned legal practitioners, thereby painting a comprehensive picture of AI's potential and challenges within the realm of Indian legal QA.
摘要
法律QA平台承诺能够改变法律专家与法律文档之间的交互方式。在这篇论文中,我们进行了对当代AI框架的比较研究,以评估它们在印度法律环境中的适应度,尤其是在印度法律问答(AILQA)领域。我们的讨论涵盖了一系列的检索和问答机制,使用OpenAI GPT模型作为参考点。研究结果表明现有AILQA方法在处理自然语言提示和生成准确回答方面具有强大的能力。本研究的范围围绕着印度刑事法律景观,这个景观具有复杂的特点和相关的设备限制。为了进行全面的评估,我们对实际指标和从经验法律专业人员处获得的信息进行了结合,从而为AI在印度法律QA领域的潜力和挑战提供了全面的图景。
Effective Multi-Agent Deep Reinforcement Learning Control with Relative Entropy Regularization
results: MACDPP在多任务多代理人合作和竞争任务中,以及传统控制任务中,如OpenAI Benchmarks和机器臂运动控制等,表现出了明显的优势,包括学习能力和样本效率。Abstract
In this paper, a novel Multi-agent Reinforcement Learning (MARL) approach, Multi-Agent Continuous Dynamic Policy Gradient (MACDPP) was proposed to tackle the issues of limited capability and sample efficiency in various scenarios controlled by multiple agents. It alleviates the inconsistency of multiple agents' policy updates by introducing the relative entropy regularization to the Centralized Training with Decentralized Execution (CTDE) framework with the Actor-Critic (AC) structure. Evaluated by multi-agent cooperation and competition tasks and traditional control tasks including OpenAI benchmarks and robot arm manipulation, MACDPP demonstrates significant superiority in learning capability and sample efficiency compared with both related multi-agent and widely implemented signal-agent baselines and therefore expands the potential of MARL in effectively learning challenging control scenarios.
摘要
在这篇论文中,一种新的多智能体学习(MARL)方法,即多智能体连续动态政策差分(MACDPP),用于解决多智能体控制场景中的局限性和样本效率问题。它通过在中央训练与分布式执行(CTDE)框架中引入相对 entropy 约束,使多智能体的政策更新更加一致。在多智能体合作和竞争任务中,以及传统的控制任务,包括OpenAI benchmarks和机械臂 manipulate任务,MACDPP表现出了明显的学习能力和样本效率优势,与相关的多智能体和广泛实施的信号代理基elines相比。因此,MACDPP扩大了MARL在学习复杂控制场景中的潜在能力。
results: 本文发现,个人化大型语言模型可以在广泛的应用领域中提供高品质的结果,例如语言和视觉任务。此外,这些模型还能在实时的情况下执行,并且可以在个人电脑或移动设备上运行。Abstract
Inspired by Federated Learning, in this paper, we propose personal large models that are distilled from traditional large language models but more adaptive to local users' personal information such as education background and hobbies. We classify the large language models into three levels: the personal level, expert level and traditional level. The personal level models are adaptive to users' personal information. They encrypt the users' input and protect their privacy. The expert level models focus on merging specific knowledge such as finance, IT and art. The traditional models focus on the universal knowledge discovery and upgrading the expert models. In such classifications, the personal models directly interact with the user. For the whole system, the personal models have users' (encrypted) personal information. Moreover, such models must be small enough to be performed on personal computers or mobile devices. Finally, they also have to response in real-time for better user experience and produce high quality results. The proposed personal large models can be applied in a wide range of applications such as language and vision tasks.
摘要
受 Federated Learning 启发,在这篇论文中,我们提议个性化大型模型,从传统大型语言模型中提取出更适应本地用户个人信息,如教育背景和兴趣爱好。我们将大语言模型分为三级:个性化级、专家级和传统级。个性化级模型适应用户个人信息,对用户输入进行加密,保护用户隐私。专家级模型将特定知识,如金融、IT和艺术等融合。传统级模型则专注于普遍知识发现和提升专家模型。在这些分类中,个性化模型直接与用户进行交互,用户的(加密)个人信息被模型所拥有。此外,这些模型还需要具备在个人计算机或移动设备上进行执行,并在实时响应用户需求,生成高质量结果。我们提议的个性化大型模型可以应用于语言和视觉任务等广泛领域。
Optimizing delegation between human and AI collaborative agents
paper_authors: Andrew Fuchs, Andrea Passarella, Marco Conti
for: 本研究旨在帮助人类与自动化代理人组成混合团队,并准确地授权团队成员执行动作。
methods: 本研究使用观察团队表现来学习一个管理模型,不Restricting agents to matching dynamics。
results: 研究结果表明,我们的管理模型在不同 Representation of the environment下可以有效地做出授权决策,并且与替代方法相比,表现出色。Abstract
In the context of humans operating with artificial or autonomous agents in a hybrid team, it is essential to accurately identify when to authorize those team members to perform actions. Given past examples where humans and autonomous systems can either succeed or fail at tasks, we seek to train a delegating manager agent to make delegation decisions with respect to these potential performance deficiencies. Additionally, we cannot always expect the various agents to operate within the same underlying model of the environment. It is possible to encounter cases where the actions and transitions would vary between agents. Therefore, our framework provides a manager model which learns through observations of team performance without restricting agents to matching dynamics. Our results show our manager learns to perform delegation decisions with teams of agents operating under differing representations of the environment, significantly outperforming alternative methods to manage the team.
摘要
在人类与自动或自适应智能代理人合作的团队中,正确地授权团队成员执行操作是非常重要的。根据过去的例子,人类和自动系统在任务上可以成功或失败。我们希望通过训练一个委托管理者代理人来做委托决策,尊重团队成员的可能性。同时,我们不能一直预期不同代理人在同一个下式环境中操作。可能会出现情况,代理人的行动和转移会不同。因此,我们的框架提供一个管理模型,通过团队表现观察学习,不Restricting代理人遵循环境的同一个模型。我们的结果显示,我们的管理器在不同代理人操作下的环境表现下适应性明显高于其他管理方法。
From Asset Flow to Status, Action and Intention Discovery: Early Malice Detection in Cryptocurrency
For: 这个研究旨在开发一个早期侦测黑客活动的模型,以解决现有的黑客侦测模型具有深度学习无解释性、只能进行过去黑客活动类型的特定预测等问题。* Methods: 本研究使用了决策树基本特征选择和补充(DT-SC)来定义资产转账路径,然后使用了状态/行为提议模组(S/A-PM)和意图VAE模组来生成状态、行为、意图抽象和隐藏意图抽象。* Results: 实验结果显示,提出的算法在三个真实世界数据集上表现出色,较进行过去的方法更高的侦测速度和解释性。此外,适当的损失函数设计进一步增强预测速度和模型的解释性。Abstract
Cryptocurrency has been subject to illicit activities probably more often than traditional financial assets due to the pseudo-anonymous nature of its transacting entities. An ideal detection model is expected to achieve all three critical properties of (I) early detection, (II) good interpretability, and (III) versatility for various illicit activities. However, existing solutions cannot meet all these requirements, as most of them heavily rely on deep learning without interpretability and are only available for retrospective analysis of a specific illicit type. To tackle all these challenges, we propose Intention-Monitor for early malice detection in Bitcoin (BTC), where the on-chain record data for a certain address are much scarcer than other cryptocurrency platforms. We first define asset transfer paths with the Decision-Tree based feature Selection and Complement (DT-SC) to build different feature sets for different malice types. Then, the Status/Action Proposal Module (S/A-PM) and the Intention-VAE module generate the status, action, intent-snippet, and hidden intent-snippet embedding. With all these modules, our model is highly interpretable and can detect various illegal activities. Moreover, well-designed loss functions further enhance the prediction speed and model's interpretability. Extensive experiments on three real-world datasets demonstrate that our proposed algorithm outperforms the state-of-the-art methods. Furthermore, additional case studies justify our model can not only explain existing illicit patterns but can also find new suspicious characters.
摘要
криптовалюта часто becomes subject to illegal activities due to its pseudonymous nature, making it difficult to detect illicit activities. To address this challenge, we propose an ideal detection model that should have three critical properties: early detection, good interpretability, and versatility for various illegal activities. However, existing solutions are limited by their reliance on deep learning without interpretability and their inability to detect multiple types of illegal activities.To overcome these challenges, we propose Intention-Monitor, a model that uses decision trees to select features and complement the data, followed by a status/action proposal module and an intention-VAE module to generate embeddings for status, action, intent, and hidden intent. Our model is highly interpretable and can detect various illegal activities, and well-designed loss functions further enhance its prediction speed and interpretability.Extensive experiments on three real-world datasets show that our proposed algorithm outperforms state-of-the-art methods, and additional case studies demonstrate that our model can not only explain existing illicit patterns but can also find new suspicious characters.
Are Human-generated Demonstrations Necessary for In-context Learning?
results: 在四个数学逻辑、常识逻辑、多任务语言理解和代码生成测试 benchmarks 中,SEC 表现出色,并且不需要人工制作的示例,与使用手工制作的示例相比,效果相似。这表明,当今的大语言模型在许多任务上具备独立做出决策的能力,可以废除外部训练数据。Abstract
Despite the promising few-shot ability of large language models (LLMs), the standard paradigm of In-context Learning (ICL) suffers the disadvantages of susceptibility to selected demonstrations and the intricacy to generate these demonstrations. In this paper, we raise the fundamental question that whether human-generated demonstrations are necessary for ICL. To answer this question, we propose self-contemplation prompting strategy (SEC), a paradigm free from human-crafted demonstrations. The key point of SEC is that, instead of using hand-crafted examples as demonstrations in ICL, SEC asks LLMs to first create demonstrations on their own, based on which the final output is generated. SEC is a flexible framework and can be adapted to both the vanilla ICL and the chain-of-thought (CoT), but with greater ease: as the manual-generation process of both examples and rationale can be saved. Extensive experiments in arithmetic reasoning, commonsense reasoning, multi-task language understanding, and code generation benchmarks, show that SEC, which does not require hand-crafted demonstrations, significantly outperforms the zero-shot learning strategy, and achieves comparable results to ICL with hand-crafted demonstrations. This demonstrates that, for many tasks, contemporary LLMs possess a sufficient level of competence to exclusively depend on their own capacity for decision making, removing the need for external training data. Code is available at https://github.com/ruili33/SEC.
摘要
尽管大语言模型(LLM)具有批处少量能力,标准的尽Context学习(ICL) paradigm 受到选择性示范和生成示范的缺点。在这篇论文中,我们提出了基本问题:人类生成的示范是ICL必需的?为回答这个问题,我们提议了自我思考提示策略(SEC),一种不需要人类制定示范的 paradigm。SEC的关键点是,而不是使用手工制定的示范,LLMs可以通过自己生成示范来生成最终输出。SEC是一个灵活的框架,可以适应标准ICL和链条思考(CoT),但更容易:手动生成示范和理由的手动处理可以快速地保存。我们在数学逻辑、通用常识逻辑、多任务语言理解和代码生成benchmark中进行了广泛的实验,结果显示,不需要手动生成示范的SEC,在零shot学习策略下表现出色,与手动生成示范的ICL相当。这表明,许多任务上,当代LLMs具有独立决策的能力,可以完全依靠自己的能力,无需外部培训数据。代码可以在https://github.com/ruili33/SEC中找到。
XGV-BERT: Leveraging Contextualized Language Model and Graph Neural Network for Efficient Software Vulnerability Detection
results: 研究结果表明,XGV-BERT方法比现有的两种方法(VulDeePecker和SySeVR)更高的检测精度,特别是在VulDeePecker数据集上,XGV-BERT方法达到了97.5%的F1分数,而VulDeePecker方法只达到了78.3%的F1分数。在SySeVR数据集上,XGV-BERT方法也达到了95.5%的F1分数,超过了SySeVR方法的83.5%的F1分数。Abstract
With the advancement of deep learning (DL) in various fields, there are many attempts to reveal software vulnerabilities by data-driven approach. Nonetheless, such existing works lack the effective representation that can retain the non-sequential semantic characteristics and contextual relationship of source code attributes. Hence, in this work, we propose XGV-BERT, a framework that combines the pre-trained CodeBERT model and Graph Neural Network (GCN) to detect software vulnerabilities. By jointly training the CodeBERT and GCN modules within XGV-BERT, the proposed model leverages the advantages of large-scale pre-training, harnessing vast raw data, and transfer learning by learning representations for training data through graph convolution. The research results demonstrate that the XGV-BERT method significantly improves vulnerability detection accuracy compared to two existing methods such as VulDeePecker and SySeVR. For the VulDeePecker dataset, XGV-BERT achieves an impressive F1-score of 97.5%, significantly outperforming VulDeePecker, which achieved an F1-score of 78.3%. Again, with the SySeVR dataset, XGV-BERT achieves an F1-score of 95.5%, surpassing the results of SySeVR with an F1-score of 83.5%.
摘要
随着深度学习(DL)在不同领域的应用,有许多尝试通过数据驱动方法揭示软件漏洞。然而,现有的方法缺乏有效的表示方式,能够保留代码特征的非序列性和Contextual关系。因此,在这项工作中,我们提出了XGV-BERT框架,它将CodeBERT模型和图神经网络(GCN)结合以检测软件漏洞。通过在XGV-BERT中同时训练CodeBERT和GCN模块,我们的方法可以利用大规模预训练、继承大量原始数据和转移学习,以学习表示训练数据的图 convolution。研究结果表明,XGV-BERT方法在VulDeePecker和SySeVR dataset上显著提高了漏洞检测精度,比如VulDeePecker和SySeVR方法的F1分数分别为78.3%和83.5%,而XGV-BERT方法在VulDeePecker dataset上达到了97.5%的F1分数,在SySeVR dataset上达到了95.5%的F1分数。
Leveraging Herpangina Data to Enhance Hospital-level Prediction of Hand-Foot-and-Mouth Disease Admissions Using UPTST
results: 模型在医院级别的长臂和短臂预测精度方面具有显著优势,并且在探索性扩展实验中表现出了更广泛的应用前景。Abstract
Outbreaks of hand-foot-and-mouth disease(HFMD) have been associated with significant morbidity and, in severe cases, mortality. Accurate forecasting of daily admissions of pediatric HFMD patients is therefore crucial for aiding the hospital in preparing for potential outbreaks and mitigating nosocomial transmissions. To address this pressing need, we propose a novel transformer-based model with a U-net shape, utilizing the patching strategy and the joint prediction strategy that capitalizes on insights from herpangina, a disease closely correlated with HFMD. This model also integrates representation learning by introducing reconstruction loss as an auxiliary loss. The results show that our U-net Patching Time Series Transformer (UPTST) model outperforms existing approaches in both long- and short-arm prediction accuracy of HFMD at hospital-level. Furthermore, the exploratory extension experiments show that the model's capabilities extend beyond prediction of infectious disease, suggesting broader applicability in various domains.
摘要
OUTBREAKS OF HAND-FOOT-AND-MOUTH DISEASE (HFMD) HAVE BEEN ASSOCIATED WITH SIGNIFICANT MORBIDITY AND, IN SEVERE CASES, MORTALITY. ACCURATE FORECASTING OF DAILY ADMISSIONS OF PEDIATRIC HFMD PATIENTS IS THEREFORE CRUCIAL FOR AIDING THE HOSPITAL IN PREPARING FOR POTENTIAL OUTBREAKS AND MITIGATING NOSOCOMIAL TRANSMISSIONS. TO ADDRESS THIS PRESSING NEED, WE PROPOSE A NOVEL TRANSFORMER-BASED MODEL WITH A U-NET SHAPE, UTILIZING THE PATCHING STRATEGY AND THE JOINT PREDICTION STRATEGY THAT CAPITALIZES ON INSIGHTS FROM HERPANGINA, A DISEASE CLOSELY CORRELATED WITH HFMD. THIS MODEL ALSO INTEGRATES REPRESENTATION LEARNING BY INTRODUCING RECONSTRUCTION LOSS AS AN AUXILIARY LOSS. THE RESULTS SHOW THAT OUR U-NET PATCHING TIME SERIES TRANSFORMER (UPTST) MODEL OUTPERFORMS EXISTING APPROACHES IN BOTH LONG- AND SHORT-ARM PREDICTION ACCURACY OF HFMD AT HOSPITAL-LEVEL. FURTHERMORE, THE EXPLORATORY EXTENSION EXPERIMENTS SHOW THAT THE MODEL'S CAPABILITIES EXTEND BEYOND PREDICTION OF INFECTIOUS DISEASE, SUGGESTING BROADER APPLICABILITY IN VARIOUS DOMEAINS.
ALEX: Towards Effective Graph Transfer Learning with Noisy Labels
methods: 本研究提出了一种新的技术 Balance Alignment and Information-aware Examination (ALEX),包括对矩阵进行特征分解,通过图像对比学习提供稳定的节点表示,并通过估算假设分布来建立均衡的子图分布。
results: 对多个基准数据集进行了广泛的实验,证明了 ALEX 在不同设定下具有突出的优势。Abstract
Graph Neural Networks (GNNs) have garnered considerable interest due to their exceptional performance in a wide range of graph machine learning tasks. Nevertheless, the majority of GNN-based approaches have been examined using well-annotated benchmark datasets, leading to suboptimal performance in real-world graph learning scenarios. To bridge this gap, the present paper investigates the problem of graph transfer learning in the presence of label noise, which transfers knowledge from a noisy source graph to an unlabeled target graph. We introduce a novel technique termed Balance Alignment and Information-aware Examination (ALEX) to address this challenge. ALEX first employs singular value decomposition to generate different views with crucial structural semantics, which help provide robust node representations using graph contrastive learning. To mitigate both label shift and domain shift, we estimate a prior distribution to build subgraphs with balanced label distributions. Building on this foundation, an adversarial domain discriminator is incorporated for the implicit domain alignment of complex multi-modal distributions. Furthermore, we project node representations into a different space, optimizing the mutual information between the projected features and labels. Subsequently, the inconsistency of similarity structures is evaluated to identify noisy samples with potential overfitting. Comprehensive experiments on various benchmark datasets substantiate the outstanding superiority of the proposed ALEX in different settings.
摘要
GRAPH NEURAL NETWORKS (GNNs) have received extensive attention due to their remarkable performance in a wide range of graph machine learning tasks. However, most GNN-based approaches have been evaluated using well-annotated benchmark datasets, leading to suboptimal performance in real-world graph learning scenarios. To bridge this gap, the present paper investigates the problem of graph transfer learning in the presence of label noise, which transfers knowledge from a noisy source graph to an unlabeled target graph. We propose a novel technique termed Balance Alignment and Information-aware Examination (ALEX) to address this challenge.ALEX first employs singular value decomposition to generate different views with crucial structural semantics, which help provide robust node representations using graph contrastive learning. To mitigate both label shift and domain shift, we estimate a prior distribution to build subgraphs with balanced label distributions. Building on this foundation, an adversarial domain discriminator is incorporated for the implicit domain alignment of complex multi-modal distributions. Furthermore, we project node representations into a different space, optimizing the mutual information between the projected features and labels. Subsequently, the inconsistency of similarity structures is evaluated to identify noisy samples with potential overfitting.Comprehensive experiments on various benchmark datasets substantiate the outstanding superiority of the proposed ALEX in different settings.
Learning Emergent Behavior in Robot Swarms with NEAT
results: 我们在 Georgia Tech Miniature Autonomous Blimps (GT-MABs) 飞行 платформы上进行了 simulations,并在 Anki Vector 机器人上进行了测试。我们在不同的任务中评估了我们的算法,包括 Area Coverage 任务、Surround Target 任务和 Wall Climb 任务。我们比较了我们的算法生成的行为和 ‘设计政策’ 的行为,并发现我们的算法可以更好地实现愿望的群体行为。Abstract
When researching robot swarms, many studies observe complex group behavior emerging from the individual agents' simple local actions. However, the task of learning an individual policy to produce a desired emergent behavior remains a challenging and largely unsolved problem. We present a method of training distributed robotic swarm algorithms to produce emergent behavior. Inspired by the biological evolution of emergent behavior in animals, we use an evolutionary algorithm to train a 'population' of individual behaviors to approximate a desired group behavior. We perform experiments using simulations of the Georgia Tech Miniature Autonomous Blimps (GT-MABs) aerial robotics platforms conducted in the CoppeliaSim simulator. Additionally, we test on simulations of Anki Vector robots to display our algorithm's effectiveness on various modes of actuation. We evaluate our algorithm on various tasks where a somewhat complex group behavior is required for success. These tasks include an Area Coverage task, a Surround Target task, and a Wall Climb task. We compare behaviors evolved using our algorithm against 'designed policies', which we create in order to exhibit the emergent behaviors we desire.
摘要
We conduct experiments using simulations of the Georgia Tech Miniature Autonomous Blimps (GT-MABs) aerial robotics platforms in the CoppeliaSim simulator, and also test our algorithm on simulations of Anki Vector robots to demonstrate its effectiveness on various modes of actuation. We evaluate our algorithm on tasks where a somewhat complex group behavior is required for success, such as Area Coverage, Surround Target, and Wall Climb. We compare the behaviors evolved using our algorithm with 'designed policies' that we create to exhibit the desired emergent behaviors.
CoFiI2P: Coarse-to-Fine Correspondences for Image-to-Point Cloud Registration
results: 根据实验结果,CoFiI2P在KITTI数据集上取得了优秀的结果,相对偏动 Error 为1.14度,相对偏移 Error 为0.29米,与当前最佳方法相比提高了84%和89%。Abstract
Image-to-point cloud (I2P) registration is a fundamental task in the field of autonomous vehicles and transportation systems for cross-modality data fusion and localization. Existing I2P registration methods estimate correspondences at the point/pixel level, often overlooking global alignment. However, I2P matching can easily converge to a local optimum when performed without high-level guidance from global constraints. To address this issue, this paper introduces CoFiI2P, a novel I2P registration network that extracts correspondences in a coarse-to-fine manner to achieve the globally optimal solution. First, the image and point cloud data are processed through a Siamese encoder-decoder network for hierarchical feature extraction. Second, a coarse-to-fine matching module is designed to leverage these features and establish robust feature correspondences. Specifically, In the coarse matching phase, a novel I2P transformer module is employed to capture both homogeneous and heterogeneous global information from the image and point cloud data. This enables the estimation of coarse super-point/super-pixel matching pairs with discriminative descriptors. In the fine matching module, point/pixel pairs are established with the guidance of super-point/super-pixel correspondences. Finally, based on matching pairs, the transform matrix is estimated with the EPnP-RANSAC algorithm. Extensive experiments conducted on the KITTI dataset demonstrate that CoFiI2P achieves impressive results, with a relative rotation error (RRE) of 1.14 degrees and a relative translation error (RTE) of 0.29 meters. These results represent a significant improvement of 84\% in RRE and 89\% in RTE compared to the current state-of-the-art (SOTA) method. Qualitative results are available at https://youtu.be/ovbedasXuZE. The source code will be publicly released at https://github.com/kang-1-2-3/CoFiI2P.
摘要
Image-to-point cloud(I2P)匹配是汽车自主领域和交通系统中的基本任务,用于跨模态数据融合和地理位置。现有I2P匹配方法通常在点/像素级别进行匹配,经常忽略全局约束。然而,I2P匹配容易 converges to 局部最优解而不是全局最优解。为解决这个问题,这篇文章提出了CoFiI2P,一种新的I2P匹配网络,该网络可以在层次结构下提取特征,以实现全球最优解。首先,图像和点云数据经过一个SIAMESE编码器-解码器网络进行特征提取。其次,我们设计了一个coarse-to-fine匹配模块,以利用这些特征并建立稳定的特征对应。Specifically,在粗略匹配阶段,我们采用了一个新的I2P变换模块,以捕捉图像和点云数据中的同质和不同质全局信息。这使得可以估计粗略超点/超像素匹配对,并生成特征描述器。在细密匹配阶段,点/像素对被指导super-point/super-pixel对的建立。最后,基于匹配对,我们使用EPnP-RANSAC算法估计变换矩阵。广泛在KITTI数据集上进行了实验,CoFiI2P实现了出色的结果,其相对旋转误差(RRE)为1.14度,相对平移误差(RTE)为0.29米。这些结果与当前SOTA方法相比,提高了84%的RRE和89%的RTE。详细结果可以在https://youtu.be/ovbedasXuZE中查看。代码将在https://github.com/kang-1-2-3/CoFiI2P上公开发布。
Divide and Conquer in Video Anomaly Detection: A Comprehensive Review and New Approach
results: 根据本研究所获得的发现,提出了一种新的方法, integrate human skeletal frameworks with video data analysis techniques,可以在ShanghaiTech dataset上达到最高的性能,超过所有已有的先进方法。Abstract
Video anomaly detection is a complex task, and the principle of "divide and conquer" is often regarded as an effective approach to tackling intricate issues. It's noteworthy that recent methods in video anomaly detection have revealed the application of the divide and conquer philosophy (albeit with distinct perspectives from traditional usage), yielding impressive outcomes. This paper systematically reviews these literatures from six dimensions, aiming to enhance the use of the divide and conquer strategy in video anomaly detection. Furthermore, based on the insights gained from this review, a novel approach is presented, which integrates human skeletal frameworks with video data analysis techniques. This method achieves state-of-the-art performance on the ShanghaiTech dataset, surpassing all existing advanced methods.
摘要
视频异常检测是一项复杂的任务,“分而治之”的原则经常被视为解决复杂问题的有效方法。值得注意的是,近年来的视频异常检测方法中,已经应用了分而治之的哲学(尽管从传统使用角度有所不同),并且得到了出色的结果。本文系统地回顾这些文献从六个维度,旨在提高视频异常检测中使用分而治之策略的使用。此外,基于本文的检查,我们提出了一种新的方法,即将人体骨架与视频数据分析技术结合起来,该方法在上海理工大学数据集上达到了当前最佳性能,超过了所有先进方法。
Towards A Unified Utilitarian Ethics Framework for Healthcare Artificial Intelligence
paper_authors: Forhan Bin Emdad, Shuyuan Mary Ho, Benhur Ravuri, Shezin Hussain for: This study aims to identify the major ethical principles influencing the utility performance of AI in healthcare settings and to propose a new utilitarian ethics-based theoretical framework for designing ethical AI.methods: The study uses a thematic analysis of secondary survey data from 36 AI experts to identify the top ethical principles of AI design, and a meta-analysis to categorize the ethical issues in AI design.results: The study found that justice, privacy, bias, lack of regulations, risks, and interpretability are the most important ethical principles to consider for ethical AI in healthcare settings. The proposed theoretical framework is based on utilitarian ethics and aims to resolve the ethical issues identified by the meta-analysis and domain experts.Abstract
Artificial Intelligence (AI) aims to elevate healthcare to a pinnacle by aiding clinical decision support. Overcoming the challenges related to the design of ethical AI will enable clinicians, physicians, healthcare professionals, and other stakeholders to use and trust AI in healthcare settings. This study attempts to identify the major ethical principles influencing the utility performance of AI at different technological levels such as data access, algorithms, and systems through a thematic analysis. We observed that justice, privacy, bias, lack of regulations, risks, and interpretability are the most important principles to consider for ethical AI. This data-driven study has analyzed secondary survey data from the Pew Research Center (2020) of 36 AI experts to categorize the top ethical principles of AI design. To resolve the ethical issues identified by the meta-analysis and domain experts, we propose a new utilitarian ethics-based theoretical framework for designing ethical AI for the healthcare domain.
摘要
人工智能(AI)目标是提升医疗健康到最高点,通过支持临床决策。通过解决 relate to the design of ethical AI 的挑战,可以使临床医生、physicians、医疗专业人员和其他关注者可以使用和信任 AI 在医疗设置中。本研究尝试通过主题分析来识别不同技术水平(如数据访问、算法和系统)中 ethical principles 的影响。我们发现,正义、隐私、偏见、缺乏法规、风险和可解释性是 ethical AI 中最重要的原则。本数据驱动的研究分析了 Pew Research Center (2020)36名 AI 专家的次要调查数据,以分类 AI 设计中的最重要道德原则。为解决由 meta-analysis 和领域专家所提出的道德问题,我们提出了一种基于利用主义道德学的新理论框架,用于设计医疗领域的 ethical AI。
Unsupervised Graph Deep Learning Reveals Emergent Flood Risk Profile of Urban Areas
results: 使用美国多个都会统计区(MSA)的数据,该模型可以为每个都会区分类出六个不同的洪水风险水平,并且可以对每个水平中的区域进行特征分析,从而识别出每个都会区的三种极性类型。研究发现洪水风险在每个都会区内存在层次结构,核心城市占洪水风险的主要部分。Abstract
Urban flood risk emerges from complex and nonlinear interactions among multiple features related to flood hazard, flood exposure, and social and physical vulnerabilities, along with the complex spatial flood dependence relationships. Existing approaches for characterizing urban flood risk, however, are primarily based on flood plain maps, focusing on a limited number of features, primarily hazard and exposure features, without consideration of feature interactions or the dependence relationships among spatial areas. To address this gap, this study presents an integrated urban flood-risk rating model based on a novel unsupervised graph deep learning model (called FloodRisk-Net). FloodRisk-Net is capable of capturing spatial dependence among areas and complex and nonlinear interactions among flood hazards and urban features for specifying emergent flood risk. Using data from multiple metropolitan statistical areas (MSAs) in the United States, the model characterizes their flood risk into six distinct city-specific levels. The model is interpretable and enables feature analysis of areas within each flood-risk level, allowing for the identification of the three archetypes shaping the highest flood risk within each MSA. Flood risk is found to be spatially distributed in a hierarchical structure within each MSA, where the core city disproportionately bears the highest flood risk. Multiple cities are found to have high overall flood-risk levels and low spatial inequality, indicating limited options for balancing urban development and flood-risk reduction. Relevant flood-risk reduction strategies are discussed considering ways that the highest flood risk and uneven spatial distribution of flood risk are formed.
摘要
城市洪水风险由多个因素相互作用而生成,包括洪水威胁、洪水暴露、社会和物理敏感性等多个方面。然而,现有的城市洪水风险评估方法主要基于洪水平原地图,强调一些特定的特征,主要是洪水威胁和暴露特征,而不考虑特征之间的依赖关系或城市区域之间的复杂关系。为了解决这一漏洞,本研究提出了一种基于深度学习模型(称为洪水风险网络)的城市洪水风险评估模型。洪水风险网络能够捕捉城市区域之间的空间依赖关系以及洪水威胁和城市特征之间的复杂非线性互动。使用美国多个都会区统计数据,模型将城市洪水风险分为六个不同的城市特定水平。模型可解释性强,允许对每个城市区域进行特征分析,并将每个都会区内的三种架构类型划分为最高洪水风险。洪水风险发现在每个都会区中具有层次结构,核心城市占据最高洪水风险。许多城市具有高总洪水风险水平和低空间不平等,表明城市发展和洪水风险减少之间有限的选择空间。研究提出了适应城市发展和洪水风险减少的措施,并考虑了洪水风险最高和空间分布不均的形成原因。
Efficient Post-training Quantization with FP8 Formats
for: The paper is written to study the advantages of FP8 data formats for post-training quantization of deep learning models, and to develop a quantization workflow that generalizes across different network architectures.
methods: The paper examines three different FP8 representations (E5M2, E4M3, and E3M4) and compares their effects on model accuracy, and also uses Intel Neural Compressor for quantization.
results: The paper finds that FP8 formats outperform INT8 in multiple aspects, including workload coverage, model accuracy, and suitability for a broader range of operations. Additionally, the paper finds that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks.Here are the three points in Simplified Chinese text:
results: 论文发现FP8格式在多种方面比INT8高效,包括工作负载覆盖率(92.64% vs. 65.87%)、模型准确性和更广泛的操作范围。此外,论文还发现E4M3适合语言模型,而E3M4在计算机视觉任务上表现marginally更好。Abstract
Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor: https://github.com/intel/neural-compressor.
摘要
Joint Communication and Computation Framework for Goal-Oriented Semantic Communication with Distortion Rate Resilience
methods: 本研究使用 rate-distortion theory 分析语音和semantic压缩induced的扰动,以估计人工智能任务的实际性能。
results: 实验结果表明,提出的方法可以保持人工智能任务的准确性,同时遵循网络限制。这成为goal-oriented semantic communication领域的一个有价值贡献。此外,本研究也提出了数据驱动的方法在优化智能系统性能方面的积极作用。Abstract
Recent research efforts on semantic communication have mostly considered accuracy as a main problem for optimizing goal-oriented communication systems. However, these approaches introduce a paradox: the accuracy of artificial intelligence (AI) tasks should naturally emerge through training rather than being dictated by network constraints. Acknowledging this dilemma, this work introduces an innovative approach that leverages the rate-distortion theory to analyze distortions induced by communication and semantic compression, thereby analyzing the learning process. Specifically, we examine the distribution shift between the original data and the distorted data, thus assessing its impact on the AI model's performance. Founding upon this analysis, we can preemptively estimate the empirical accuracy of AI tasks, making the goal-oriented semantic communication problem feasible. To achieve this objective, we present the theoretical foundation of our approach, accompanied by simulations and experiments that demonstrate its effectiveness. The experimental results indicate that our proposed method enables accurate AI task performance while adhering to network constraints, establishing it as a valuable contribution to the field of signal processing. Furthermore, this work advances research in goal-oriented semantic communication and highlights the significance of data-driven approaches in optimizing the performance of intelligent systems.
摘要
Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix Factorization via Plastic Transformer
paper_authors: Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Sidney Fels, Jerry L. Prince, Georges El Fakhri, Jonghye Woo for:* 这个论文旨在研究语音生成的方法,具体来说是将weighting map翻译成对应的声音波形。methods:* 该论文使用了非负矩阵分解方法来估算函数单元的运动特征,并使用了深度学习框架来翻译weighting map到对应的声音波形。results:* 该论文的实验结果表明,使用该方法可以Synthesize speech audio waveforms from weighting maps,并且超过了传统的 convolution 和 transformer 模型。Abstract
The tongue's intricate 3D structure, comprising localized functional units, plays a crucial role in the production of speech. When measured using tagged MRI, these functional units exhibit cohesive displacements and derived quantities that facilitate the complex process of speech production. Non-negative matrix factorization-based approaches have been shown to estimate the functional units through motion features, yielding a set of building blocks and a corresponding weighting map. Investigating the link between weighting maps and speech acoustics can offer significant insights into the intricate process of speech production. To this end, in this work, we utilize two-dimensional spectrograms as a proxy representation, and develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms. Our proposed plastic light transformer (PLT) framework is based on directional product relative position bias and single-level spatial pyramid pooling, thus enabling flexible processing of weighting maps with variable size to fixed-size spectrograms, without input information loss or dimension expansion. Additionally, our PLT framework efficiently models the global correlation of wide matrix input. To improve the realism of our generated spectrograms with relatively limited training samples, we apply pair-wise utterance consistency with Maximum Mean Discrepancy constraint and adversarial training. Experimental results on a dataset of 29 subjects speaking two utterances demonstrated that our framework is able to synthesize speech audio waveforms from weighting maps, outperforming conventional convolution and transformer models.
摘要
tongue's intricate 3D structure, comprising localized functional units, plays a crucial role in the production of speech. When measured using tagged MRI, these functional units exhibit cohesive displacements and derived quantities that facilitate the complex process of speech production. Non-negative matrix factorization-based approaches have been shown to estimate the functional units through motion features, yielding a set of building blocks and a corresponding weighting map. Investigating the link between weighting maps and speech acoustics can offer significant insights into the intricate process of speech production. To this end, in this work, we utilize two-dimensional spectrograms as a proxy representation, and develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms. Our proposed plastic light transformer (PLT) framework is based on directional product relative position bias and single-level spatial pyramid pooling, thus enabling flexible processing of weighting maps with variable size to fixed-size spectrograms, without input information loss or dimension expansion. Additionally, our PLT framework efficiently models the global correlation of wide matrix input. To improve the realism of our generated spectrograms with relatively limited training samples, we apply pair-wise utterance consistency with Maximum Mean Discrepancy constraint and adversarial training. Experimental results on a dataset of 29 subjects speaking two utterances demonstrated that our framework is able to synthesize speech audio waveforms from weighting maps, outperforming conventional convolution and transformer models.Here's the text with some notes on the translation:1. "tongue's intricate 3D structure" is translated as "舌头的精确三维结构" (shí yì zhèng qié sān wéi jié gòng)2. "comprising localized functional units" is translated as "包括本地功能单位" (bā jīn běn dì gōng chéng dān yì)3. "plays a crucial role in the production of speech" is translated as "对话调制中发挥重要的作用" (duì huì tiáng zhèng zhōng yì de zuò yì)4. "Non-negative matrix factorization-based approaches" is translated as "基于非负矩阵分解的方法" (jī yú fēi shū jí zhèng fāng yì)5. "yielding a set of building blocks and a corresponding weighting map" is translated as "生成一个集合和对应的权重图" (shēng jìn yī gè jí hù yì de quán zhòng tú)6. "Investigating the link between weighting maps and speech acoustics" is translated as "研究权重图和话语音响之间的关联" (yán jí quán zhòng tú yǔ huì yǔ yīn jiān zhì)7. "two-dimensional spectrograms as a proxy representation" is translated as "作为代表的二维спектロграм" (zuò weǐ de dì yì xiàng yì zhèng)8. "and develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms" is translated as "并开发一套从权重图到对应的音频波形的深度学习框架" (qǐng dào yī yī jī zhèng xué xí guī fám)9. "Our proposed plastic light transformer (PLT) framework" is translated as "我们提出的塑料光Transformer(PLT)框架" (wǒ men tím chuī de zhāng liào guāng tīng zhèng jì)10. "efficiently models the global correlation of wide matrix input" is translated as "能够有效地模型宽度矩阵输入的全球相互关系" (néng kě yǐ jì dì módeli xiàng dào jí zhèng zhì yì)11. "To improve the realism of our generated spectrograms with relatively limited training samples" is translated as "以增强我们生成的спектロграм中的实际感受,使用有限的训练样本" (yǐ jìn cháng wèi de xiàng yì zhèng zhì yì, shǐ yòng yǒu xiàng de xiàng yì zhèng zhì)12. "we apply pair-wise utterance consistency with Maximum Mean Discrepancy constraint and adversarial training" is translated as "我们运用对应的说话遗传性和最大差异约束,以及反对攻击训练" (wǒ men yù yòng duì bìng de jiàn chuī zhèng zhì yì, yǐ jìn cháng wèi de zhèng zhì yì)Note that the translation is not word-for-word, but rather a more natural and idiomatic translation that captures the meaning and nuances of the original text.
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
methods: 本研究提出了一种新的损失函数called Continuously Weighted Contrastive Loss (CWCL),该损失函数使用连续的相似度测量来对一个模式的表示空间与另一个模式的表示空间进行对齐。
results: 对多种模型、数据集和模式进行实验,本研究发现CWCL可以在零shot传输中超过现有方法的性能,尤其是在图像分类和语音分类中。 Specifically, the models achieve 5-8% (absolute) improvement over previous state-of-the-art methods in 0-shot image classification and 20-30% (absolute) improvement in 0-shot speech-to-intent classification and keyword classification.Abstract
This paper considers contrastive training for cross-modal 0-shot transfer wherein a pre-trained model in one modality is used for representation learning in another domain using pairwise data. The learnt models in the latter domain can then be used for a diverse set of tasks in a zero-shot way, similar to ``Contrastive Language-Image Pre-training (CLIP)'' and ``Locked-image Tuning (LiT)'' that have recently gained considerable attention. Most existing works for cross-modal representation alignment (including CLIP and LiT) use the standard contrastive training objective, which employs sets of positive and negative examples to align similar and repel dissimilar training data samples. However, similarity amongst training examples has a more continuous nature, thus calling for a more `non-binary' treatment. To address this, we propose a novel loss function called Continuously Weighted Contrastive Loss (CWCL) that employs a continuous measure of similarity. With CWCL, we seek to align the embedding space of one modality with another. Owing to the continuous nature of similarity in the proposed loss function, these models outperform existing methods for 0-shot transfer across multiple models, datasets and modalities. Particularly, we consider the modality pairs of image-text and speech-text and our models achieve 5-8% (absolute) improvement over previous state-of-the-art methods in 0-shot image classification and 20-30% (absolute) improvement in 0-shot speech-to-intent classification and keyword classification.
摘要
results: 研究发现,使用公共可用的模型如XLSR-53可以达到比较高的识别精度,而自定义预训练模型也可以提高系统的性能。同时,该研究还 compare了不同的训练方法,包括supervised和Unsupervised方法,并使用wav2vec 2.0作为架构。Abstract
In today's interconnected globe, moving abroad is more and more prevalent, whether it's for employment, refugee resettlement, or other causes. Language difficulties between natives and immigrants present a common issue on a daily basis, especially in medical domain. This can make it difficult for patients and doctors to communicate during anamnesis or in the emergency room, which compromises patient care. The goal of the HYKIST Project is to develop a speech translation system to support patient-doctor communication with ASR and MT. ASR systems have recently displayed astounding performance on particular tasks for which enough quantities of training data are available, such as LibriSpeech. Building a good model is still difficult due to a variety of speaking styles, acoustic and recording settings, and a lack of in-domain training data. In this thesis, we describe our efforts to construct ASR systems for a conversational telephone speech recognition task in the medical domain for Vietnamese language to assist emergency room contact between doctors and patients across linguistic barriers. In order to enhance the system's performance, we investigate various training schedules and data combining strategies. We also examine how best to make use of the little data that is available. The use of publicly accessible models like XLSR-53 is compared to the use of customized pre-trained models, and both supervised and unsupervised approaches are utilized using wav2vec 2.0 as architecture.
摘要
今天的全球化社会中,越来越多的人选择移民 abroad,无论是为了工作、难民重新安置或其他原因。在医疗领域,语言障碍问题是每天都存在的问题,特别是在医生和患者之间的交流中。这会使患者和医生在医学询问或紧急室中的交流受到干扰,从而影响病人的护理。 Project HYKIST 的目标是开发一个语音翻译系统,以支持患者和医生之间的交流。 ASR 系统在特定任务上已经表现出了惊人的表现,如 LibriSpeech。但建立好模型仍然困难,因为有很多说话风格、音频和录音设置,以及缺乏相关领域的训练数据。在这个论文中,我们描述了我们在医疗领域的语音识别任务中使用 ASR 系统的努力。我们 investigate 了不同的训练计划和数据组合策略,以提高系统的表现。我们还研究了如何利用有限的数据来提高系统的性能。我们 compare 了使用公共可用模型如 XLSR-53 和自定义预训练模型,以及使用 supervised 和 unsupervised 方法,使用 wav2vec 2.0 架构。
RAGAS: Automated Evaluation of Retrieval Augmented Generation
results: 提出了一组无需人工标注的评估指标,可以评估不同维度的 RAGB 架构,包括 retrieve 模块是否能够准确地标识有关焦点文本段落,LLM 模块是否能够准确地利用这些段落,以及生成结果的质量。Abstract
We introduce RAGAs (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With RAGAs, we put forward a suite of metrics which can be used to evaluate these different dimensions \textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.
摘要
我们介绍了RAGAs(引用自由评估Retrieval Augmented Generation)框架,用于评估基于引用文本库的 Retrieval Augmented Generation(RAG)pipeline。RAG系统由一个检索和一个基于LLM的生成模块组成,通过将知识从参考文本库传递给LLM,使LLM能够作为自然语言层次,减少用户和文本库之间的风险假设。评估RAG体系却存在多个维度:检索系统能够寻找相关和焦点的文本段落,LLM能够充分利用这些文本段落,或生成的质量自身。我们提出了一组无需基于真实人类标注的指标,用于评估这些不同维度。我们认为这种框架可以在评估RAG体系中帮助降低评估周期的时间,特别是 LLM 的广泛采用。
STANCE-C3: Domain-adaptive Cross-target Stance Detection via Contrastive Learning and Counterfactual Generation
results: 经过实验表明,该模型在多个 dataset 上表现出了性能提升,并且在不同领域和目标话题上具有较高的泛化能力。Abstract
Stance detection is the process of inferring a person's position or standpoint on a specific issue to deduce prevailing perceptions toward topics of general or controversial interest, such as health policies during the COVID-19 pandemic. Existing models for stance detection are trained to perform well for a single domain (e.g., COVID-19) and a specific target topic (e.g., masking protocols), but are generally ineffectual in other domains or targets due to distributional shifts in the data. However, constructing high-performing, domain-specific stance detection models requires an extensive corpus of labeled data relevant to the targeted domain, yet such datasets are not readily available. This poses a challenge as the process of annotating data is costly and time-consuming. To address these challenges, we introduce a novel stance detection model coined domain-adaptive Cross-target STANCE detection via Contrastive learning and Counterfactual generation (STANCE-C3) that uses counterfactual data augmentation to enhance domain-adaptive training by enriching the target domain dataset during the training process and requiring significantly less information from the new domain. We also propose a modified self-supervised contrastive learning as a component of STANCE-C3 to prevent overfitting for the existing domain and target and enable cross-target stance detection. Through experiments on various datasets, we show that STANCE-C3 shows performance improvement over existing state-of-the-art methods.
摘要
<>translate_language: zh-CN<>Stance detection是推断人的立场或看法在特定问题上,以便推断人们对一些广泛或争议性的话题(如COVID-19大流行期间的健康政策)的看法。现有的姿态检测模型通常只能在单一领域(如COVID-19)和特定目标话题(如面具协议)上表现出色,但在其他领域或话题上通常无法达到相同的水平,这是因为数据的分布shift。然而,建立高性能的领域专门的姿态检测模型需要大量的相关领域数据,但这些数据并不易 disponibles。这种情况提出了一个挑战,因为标注数据的过程是贵重的和时间consuming。为解决这些挑战,我们介绍了一种新的姿态检测模型,名为域 adapted Cross-target STANCE detection via Contrastive learning and Counterfactual generation(STANCE-C3)。STANCE-C3使用了对立数据增强,以便在训练过程中增强目标领域数据,并且需要较少的新领域信息。我们还提出了一种修改后的自我超视的对比学习,以避免过拟合现有领域和目标,并启用跨目标姿态检测。通过对多个数据集进行实验,我们表明STANCE-C3表现出了与现有状态艺技的性能提升。
RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models
results: 实验结果表明,我们可以在零批训练情况下达到与GPT-3.5的列表重新排序效果相似,但效果略为落后于GPT-4。我们希望我们的工作可以为未来关于列表重新排序的研究提供基础。Abstract
Researchers have successfully applied large language models (LLMs) such as ChatGPT to reranking in an information retrieval context, but to date, such work has mostly been built on proprietary models hidden behind opaque API endpoints. This approach yields experimental results that are not reproducible and non-deterministic, threatening the veracity of outcomes that build on such shaky foundations. To address this significant shortcoming, we present RankVicuna, the first fully open-source LLM capable of performing high-quality listwise reranking in a zero-shot setting. Experimental results on the TREC 2019 and 2020 Deep Learning Tracks show that we can achieve effectiveness comparable to zero-shot reranking with GPT-3.5 with a much smaller 7B parameter model, although our effectiveness remains slightly behind reranking with GPT-4. We hope our work provides the foundation for future research on reranking with modern LLMs. All the code necessary to reproduce our results is available at https://github.com/castorini/rank_llm.
摘要
Question-Answering Approach to Evaluate Legal Summaries
results: GPT-4评价与人类评价之间的相关性可以用于评估摘要质量Abstract
Traditional evaluation metrics like ROUGE compare lexical overlap between the reference and generated summaries without taking argumentative structure into account, which is important for legal summaries. In this paper, we propose a novel legal summarization evaluation framework that utilizes GPT-4 to generate a set of question-answer pairs that cover main points and information in the reference summary. GPT-4 is then used to generate answers based on the generated summary for the questions from the reference summary. Finally, GPT-4 grades the answers from the reference summary and the generated summary. We examined the correlation between GPT-4 grading with human grading. The results suggest that this question-answering approach with GPT-4 can be a useful tool for gauging the quality of the summary.
摘要
传统的评估指标如ROUGE对 lexical overlap между参考和生成摘要没有考虑情节结构,这是法律摘要中重要的一点。在这篇论文中,我们提出了一种新的法律摘要评估框架,利用 GPT-4 生成一组对应于参考摘要中主要点和信息的问题集。然后,GPT-4 使用生成的摘要回答这些问题。最后,GPT-4 评分来自参考摘要和生成摘要的答案。我们对 GPT-4 评分与人工评分之间的相关性进行了检验。结果表明,这种问题回答方法与 GPT-4 可以作为评估摘要质量的有用工具。
Updated Corpora and Benchmarks for Long-Form Speech Recognition
results: 研究发现,AED模型更容易受到域名匹配问题的影响,而长形训练可以提高这些模型的Robustness。Abstract
The vast majority of ASR research uses corpora in which both the training and test data have been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio is not segmented, leading to a mismatch between inference-time conditions and models trained on segmented utterances. In this paper, we re-release three standard ASR corpora - TED-LIUM 3, Gigapeech, and VoxPopuli-en - with updated transcription and alignments to enable their use for long-form ASR research. We use these reconstituted corpora to study the train-test mismatch problem for transducers and attention-based encoder-decoders (AEDs), confirming that AEDs are more susceptible to this issue. Finally, we benchmark a simple long-form training for these models, showing its efficacy for model robustness under this domain shift.
摘要
大多数ASR研究使用已经 pré-分 segmented的数据集进行训练和测试。然而,在实际应用中,测试音频通常没有 segmented,导致模型训练用的 condition和测试 condition 之间存在匹配问题。在这篇论文中,我们重新发布了三个标准 ASR 数据集 - TED-LIUM 3、Gigapeech 和 VoxPopuli-en - 的更新的转录和对应,以便用于长形 ASR 研究。我们使用这些重新拟合的数据集来研究训练和测试之间的匹配问题,发现 AEDs 更容易受到这种问题的影响。最后,我们测试了一种简单的长形训练方法,并证明其在这种领域移植中的效果。
methods: 该研究使用了随机语言模型(De Giuli 2019),这是一种 ensemble of stochastic context-free grammars,用于量化人类和计算机语言的 syntax。
results: 研究表明,在考虑到显式对称破坏的情况下,模型的enario是Robust的。与人类语言数据中的 syntax 网络划分系数相比, Observation 与24岁的儿童 обычно经历的转变相当。Abstract
The Random Language Model (De Giuli 2019) is an ensemble of stochastic context-free grammars, quantifying the syntax of human and computer languages. The model suggests a simple picture of first language learning as a type of annealing in the vast space of potential languages. In its simplest formulation, it implies a single continuous transition to grammatical syntax, at which the symmetry among potential words and categories is spontaneously broken. Here this picture is scrutinized by considering its robustness against explicit symmetry breaking, an inevitable component of learning in the real world. It is shown that the scenario is robust to such symmetry breaking. Comparison with human data on the clustering coefficient of syntax networks suggests that the observed transition is equivalent to that normally experienced by children at age 24 months.
摘要
随机语言模型(De Giuli 2019)是一个集合的随机上下文自由格式语言,量化人类和计算机语言的语法。该模型提出了一个简单的语言学习图景,认为人类语言学习是一种热化在可能语言空间中的过程。在最简式表述中,它表明了一种单一的连续变换,在潜在词汇和分类之间各自破坏 симметрии。在这种情况下,我们考虑了对显式对称破坏的Robustness,这是学习世界中不可避免的一部分。结果表明,这种情况具有Robustness。与人类语言结构网络的凝集系数相比,显示出这种过渡与24个月大的儿童常见的过渡相等。
Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition
For: 提高自动语音识别(ASR)系统的训练效果,适用于大量的高质量对应数据。* Methods: 提出了 Omni-temporal Classification(OTC)训练标准,通过考虑标签不确定性,使模型能够有效地学习语音-文本对应。OTC基于不确定的Weighted Finite State Transducers(WFST)扩展了传统的 CTC 目标函数。* Results: 通过在 LibriSpeech 和 LibriVox 数据集上进行实验,表明使用 OTC 训练 ASR 模型,甚至在对应文本中含有70%错误的情况下,模型的性能不会下降。Abstract
Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. In this paper, we propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. This allows the model to effectively learn speech-text alignments while accommodating errors present in the training transcripts. OTC extends the conventional CTC objective for imperfect transcripts by leveraging weighted finite state transducers. Through experiments conducted on the LibriSpeech and LibriVox datasets, we demonstrate that training ASR models with OTC avoids performance degradation even with transcripts containing up to 70% errors, a scenario where CTC models fail completely. Our implementation is available at https://github.com/k2-fsa/icefall.
摘要
培训自动语音识别(ASR)系统需要大量的高质量对应数据。然而,人工标注员通常会进行“非文字”抄写,这可能导致模型训练不佳。在这篇论文中,我们提出了一种新的训练标准《全时分类》(OTC),该标准直接表达标注不确定性的影响。这使得模型能够有效地学习语音-文本对应,同时满足训练脚本中存在错误的情况。OTC基于权重finite state transducers扩展了传统的CTC目标,并通过实验表明,即使训练脚本中有70%的错误,ASR模型也能够保持高效性。我们的实现可以在https://github.com/k2-fsa/icefall中找到。
results: 对比其他竞争方法,提出的 Segmentation-Free 框架在质量-延迟Trade-off中具有更好的性能。Abstract
Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real-time. The traditional cascade approach, which combines an Automatic Speech Recognition (ASR) and an MT system, relies on an intermediate segmentation step which splits the transcription stream into sentence-like units. However, the incorporation of a hard segmentation constrains the MT system and is a source of errors. This paper proposes a Segmentation-Free framework that enables the model to translate an unsegmented source stream by delaying the segmentation decision until the translation has been generated. Extensive experiments show how the proposed Segmentation-Free framework has better quality-latency trade-off than competing approaches that use an independent segmentation model. Software, data and models will be released upon paper acceptance.
摘要
流动机器翻译(MT)是将输入文本流转换成实时翻译的任务。传统的堆叠方法,将自动语音识别(ASR)和MT系统结合在一起,需要一个中间分 segmentation 步骤,将转录流分成句子样式的单元。然而,在 incorporating 硬件分 segmentation 会限制MT系统的性能,并且是错误的来源。这篇论文提出了无需分 segmentation 的框架,允许模型在翻译过程中延迟分 segmentation 决策,直到翻译结果被生成。广泛的实验表明,提议的无需分 segmentation 框架在质量-延迟质量之间有更好的质量-延迟平衡,与独立的分 segmentation 模型相比。软件、数据和模型会在论文接受后释出。
BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning
results: 研究表明,通过将批处理大型预训练模型的参数冻结,并仅调参适应器方法中的参数,可以实现与完全调参模型的性能相似,同时减少了大量参数的数量。Abstract
This study aims to explore efficient tuning methods for the screenshot captioning task. Recently, image captioning has seen significant advancements, but research in captioning tasks for mobile screens remains relatively scarce. Current datasets and use cases describing user behaviors within product screenshots are notably limited. Consequently, we sought to fine-tune pre-existing models for the screenshot captioning task. However, fine-tuning large pre-trained models can be resource-intensive, requiring considerable time, computational power, and storage due to the vast number of parameters in image captioning models. To tackle this challenge, this study proposes a combination of adapter methods, which necessitates tuning only the additional modules on the model. These methods are originally designed for vision or language tasks, and our intention is to apply them to address similar challenges in screenshot captioning. By freezing the parameters of the image caption models and training only the weights associated with the methods, performance comparable to fine-tuning the entire model can be achieved, while significantly reducing the number of parameters. This study represents the first comprehensive investigation into the effectiveness of combining adapters within the context of the screenshot captioning task. Through our experiments and analyses, this study aims to provide valuable insights into the application of adapters in vision-language models and contribute to the development of efficient tuning techniques for the screenshot captioning task. Our study is available at https://github.com/RainYuGG/BLIP-Adapter
摘要
这个研究的目标是探索屏幕截图标题预测任务中有效的调参方法。在图像描述领域,近年来有了 significative 的进步,但是对手机屏幕上的用户行为描述 task 的研究尚处于相对缺乏的状态。当前的数据集和用户行为描述 case 都很有限,因此我们决定使用先前训练的模型进行调参。然而,对大型预训练模型的调参可能会占用大量的时间、计算资源和存储空间,这是因为图像描述模型中有很多参数。为了解决这个挑战,本研究提出了一种将适配器方法与图像描述模型结合使用的方法。这种方法原本是设计用于视觉或语言任务的,我们想用它们来解决屏幕截图标题预测任务中的类似挑战。通过冻结图像描述模型的参数,并仅对适配器方法进行训练,可以实现与完全调参模型的性能相似,同时减少了大量参数的数量。本研究是首次对适配器在屏幕截图标题预测任务中的应用进行全面的研究。通过我们的实验和分析,本研究旨在为应用适配器在视觉语言模型中的应用提供有价值的发现,并为屏幕截图标题预测任务中的效率调参技术做出贡献。研究的数据集和代码可以在 GitHub 上找到:https://github.com/RainYuGG/BLIP-Adapter
KERMIT: Knowledge Graph Completion of Enhanced Relation Modeling with Inverse Transformation
results: 通过这两种机制,我们观察到了知识图 completion 的显著改善,这些机制可以增强数据的 ricahness 和多样性,导致更准确的结果。Abstract
Knowledge graph completion is a task that revolves around filling in missing triples based on the information available in a knowledge graph. Among the current studies, text-based methods complete the task by utilizing textual descriptions of triples. However, this modeling approach may encounter limitations, particularly when the description fails to accurately and adequately express the intended meaning. To overcome these challenges, we propose the augmentation of data through two additional mechanisms. Firstly, we employ ChatGPT as an external knowledge base to generate coherent descriptions to bridge the semantic gap between the queries and answers. Secondly, we leverage inverse relations to create a symmetric graph, thereby creating extra labeling and providing supplementary information for link prediction. This approach offers additional insights into the relationships between entities. Through these efforts, we have observed significant improvements in knowledge graph completion, as these mechanisms enhance the richness and diversity of the available data, leading to more accurate results.
摘要
知识图完成任务是基于现有的知识图信息完善缺失的 triple。目前的研究主要采用文本方法来完成这项任务,但这种模型化方法可能会遇到限制,特别是当描述不准确、不完整时。为了解决这些挑战,我们提议在数据上进行两种附加机制。首先,我们使用 ChatGPT 作为外部知识库,生成具有协调性的描述, bridging 知识图中缺失的semantic gap。其次,我们利用反向关系,创建对称图,从而创建Extra labeling和提供补充信息 для链接预测。这种方法提供了更多的关系 между实体的意义,通过这些努力,我们观察到了知识图完成任务中显著的改善,这些机制增加了可用数据的丰富性和多样性,导致更加准确的结果。
ConPET: Continual Parameter-Efficient Tuning for Large Language Models
results: 实验结果显示,静态ConPET可以帮助多个former方法将可调参数的数量增加至3,000多倍,并在五个较小的benchmark上超过PET-只基eline以少于5分点。动态ConPET则在最大的dataset上获得了优化。codes和数据可以在https://github.com/Raincleared-Song/ConPET上获取。Abstract
Continual learning necessitates the continual adaptation of models to newly emerging tasks while minimizing the catastrophic forgetting of old ones. This is extremely challenging for large language models (LLMs) with vanilla full-parameter tuning due to high computation costs, memory consumption, and forgetting issue. Inspired by the success of parameter-efficient tuning (PET), we propose Continual Parameter-Efficient Tuning (ConPET), a generalizable paradigm for continual task adaptation of LLMs with task-number-independent training complexity. ConPET includes two versions with different application scenarios. First, Static ConPET can adapt former continual learning methods originally designed for relatively smaller models to LLMs through PET and a dynamic replay strategy, which largely reduces the tuning costs and alleviates the over-fitting and forgetting issue. Furthermore, to maintain scalability, Dynamic ConPET adopts separate PET modules for different tasks and a PET module selector for dynamic optimal selection. In our extensive experiments, the adaptation of Static ConPET helps multiple former methods reduce the scale of tunable parameters by over 3,000 times and surpass the PET-only baseline by at least 5 points on five smaller benchmarks, while Dynamic ConPET gains its advantage on the largest dataset. The codes and datasets are available at https://github.com/Raincleared-Song/ConPET.
摘要
<>将文本翻译成简化中文。<> kontinuel lerning需要 kontinuel adapting模型到新出现的任务,并最大限度减少老任务的忘记。这对大语言模型(LLM)来说是极其困难的,因为它们的计算成本高、内存占用大,以及忘记问题。 Drawing inspiration from the success of parameter-efficient tuning(PET), we propose Continual Parameter-Efficient Tuning(ConPET), a generalizable paradigm for continual task adaptation of LLMs with task-number-independent training complexity. ConPET includes two versions with different application scenarios. First, Static ConPET can adapt former continual learning methods originally designed for relatively smaller models to LLMs through PET and a dynamic replay strategy, which largely reduces the tuning costs and alleviates the over-fitting and forgetting issue. Furthermore, to maintain scalability, Dynamic ConPET adopts separate PET modules for different tasks and a PET module selector for dynamic optimal selection. In our extensive experiments, the adaptation of Static ConPET helps multiple former methods reduce the scale of tunable parameters by over 3,000 times and surpass the PET-only baseline by at least 5 points on five smaller benchmarks, while Dynamic ConPET gains its advantage on the largest dataset. The codes and datasets are available at https://github.com/Raincleared-Song/ConPET。
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
results: 这个论文的结果显示,使用QA-LoRA可以实现量化LLM的时间和内存使用率的减少,并且不会对精度造成损害。这个方法可以在不同的精度档案和下游应用中进行适用。Abstract
Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at https://github.com/yuhuixu1993/qa-lora.
摘要
最近几年内,大型语言模型(LLM)的快速发展已经引起了广泛的关注。尽管LLM具有许多语言理解任务的强大能力,但是计算负担很大,特别是在部署到边缘设备时。在这篇论文中,我们提出了一种量化意识扩展低级化算法(QA-LoRA)。我们的动机在于量化和适应的自由度不均衡,我们的解决方案是使用群组操作符,以增加量化的自由度,同时减少适应的自由度。QA-LoRA易于实现,只需几行代码即可,它使得原始LoRA具有两种能力:(i)在练习中,LLM的参数被量化(例如INT4),以降低时间和存储使用;(ii)在练习后,LLM和辅助参数自然地被 integrate到量化模型中,无损失准确性。我们在LLaMA和LLaMA2模型家族上应用QA-LoRA,并在不同的练习数据集和下游enario中验证其效果。代码将在https://github.com/yuhuixu1993/qa-lora中提供。
results: 经测试在UCF101 dataset上,本方法可以生成出promising的视频。Here’s a more detailed explanation of each point:1. For: The paper aims to propose a general and simple text-to-video model based on the Transformer architecture.2. Methods: The model uses the Transformer architecture to capture the temporal consistency between text and image sequences, and employs GPT2 as the language model.3. Results: The proposed method is tested on the UCF101 dataset and shows promising results in generating videos.Abstract
We present a general and simple text to video model based on Transformer. Since both text and video are sequential data, we encode both texts and images into the same hidden space, which are further fed into Transformer to capture the temporal consistency and then decoder to generate either text or images. Considering the image signal may become weak in the long sequence, we introduce the U-Net to reconstruct image from its noised version. Specifically, we increase the noise level to the original image in the long sequence, then use the $down$ module from U-Net to encode noised images, which are further input to transformer to predict next clear images. We also add a constraint to promote motion between any generated image pair in the video. We use GPT2 and test our approach on UCF101 dataset and show it can generate promising videos.
摘要
我们提出了一种通用、简单的文本到视频模型,基于Transformer。由于文本和视频都是序列数据,我们将文本和图像编码到同一个隐藏空间中,然后将其传递给Transformer来捕捉时间一致性。为了处理长序列中的图像信号弱化,我们引入了U-Net来重建图像。 Specifically,我们将原始图像的噪声水平提高,然后使用U-Net的$down$模块编码噪声图像,并将其输入到Transformer来预测下一帧清晰图像。此外,我们添加了一个约束来促进视频中任意生成图像对的运动。我们使用GPT2进行测试,并在UCF101 dataset上实现了可靠的视频生成。
results: experiments 表明,DeepROCK 可以有效地控制 FDR,并在 simulate 和实际数据上进行了广泛的验证。Abstract
The complexity of deep neural networks (DNNs) makes them powerful but also makes them challenging to interpret, hindering their applicability in error-intolerant domains. Existing methods attempt to reason about the internal mechanism of DNNs by identifying feature interactions that influence prediction outcomes. However, such methods typically lack a systematic strategy to prioritize interactions while controlling confidence levels, making them difficult to apply in practice for scientific discovery and hypothesis validation. In this paper, we introduce a method, called DeepROCK, to address this limitation by using knockoffs, which are dummy variables that are designed to mimic the dependence structure of a given set of features while being conditionally independent of the response. Together with a novel DNN architecture involving a pairwise-coupling layer, DeepROCK jointly controls the false discovery rate (FDR) and maximizes statistical power. In addition, we identify a challenge in correctly controlling FDR using off-the-shelf feature interaction importance measures. DeepROCK overcomes this challenge by proposing a calibration procedure applied to existing interaction importance measures to make the FDR under control at a target level. Finally, we validate the effectiveness of DeepROCK through extensive experiments on simulated and real datasets.
摘要
深度神经网络(DNN)的复杂性使其具有强大的计算能力,但同时也使其难以解释,这限制了其在容错领域的应用。现有的方法通过找出特征相互作用来推理DNN的内部机制,但这些方法通常缺乏系统的策略来优先级化交互,使其在实践中困难应用于科学发现和假设验证。在本文中,我们介绍了一种方法,称为DeepROCK,以解决这一限制。DeepROCK使用“假变量”(knockoffs),这些变量模拟给定特征集的依赖结构,同时保持conditionally independent于响应变量。与一种新的DNN架构相结合,DeepROCK同时控制了false discovery rate(FDR)和最大化统计能力。此外,我们发现了控制FDR使用现有特征相互作用重要性度量的挑战。DeepROCK解决了这个挑战,通过对现有的特征相互作用重要性度量进行滤波来使FDR进行控制。最后,我们通过对模拟和实际数据进行广泛的实验验证了DeepROCK的效果。
Telescope: An Automated Hybrid Forecasting Approach on a Level-Playing Field
paper_authors: André Bauer, Mark Leznik, Michael Stenger, Robert Leppich, Nikolas Herbst, Samuel Kounev, Ian Foster
for: 预测(forecasting)
methods: 使用机器学习方法自动从给定时间序列中提取有关信息,并将其分解成多个部分进行处理。
results: 比较其他最新方法的准确和可靠预测,而无需进行参数化或训练多个参数。I hope that helps! Let me know if you have any other questions.Abstract
In many areas of decision-making, forecasting is an essential pillar. Consequently, many different forecasting methods have been proposed. From our experience, recently presented forecasting methods are computationally intensive, poorly automated, tailored to a particular data set, or they lack a predictable time-to-result. To this end, we introduce Telescope, a novel machine learning-based forecasting approach that automatically retrieves relevant information from a given time series and splits it into parts, handling each of them separately. In contrast to deep learning methods, our approach doesn't require parameterization or the need to train and fit a multitude of parameters. It operates with just one time series and provides forecasts within seconds without any additional setup. Our experiments show that Telescope outperforms recent methods by providing accurate and reliable forecasts while making no assumptions about the analyzed time series.
摘要
在很多决策领域中,预测是一个重要的柱子。因此,有很多不同的预测方法被提出。从我们的经验来看,最近提出的预测方法都具有计算昂贵、自动化不够、特定数据集适应性或时间到结果难以预测等缺点。为了解决这些问题,我们介绍了天镜,一种新的机器学习基于的预测方法。与深度学习方法不同,我们的方法不需要参数化或训练多个参数。它只需要一个时间序列,并在秒钟内提供准确和可靠的预测,无需任何额外设置。我们的实验表明,天镜比最近的方法提供更准确和可靠的预测,而且不会对分析的时间序列做任何假设。
Beyond Log-Concavity: Theory and Algorithm for Sum-Log-Concave Optimization
results: 该文章通过应用该框架,引入了一种新的分类方法(检查ered regression),该方法可以在非线性分离问题中进行分类,并且可以通过使用任意数量的准则平面,创造一种棋盘状的决策区域。Abstract
This paper extends the classic theory of convex optimization to the minimization of functions that are equal to the negated logarithm of what we term as a sum-log-concave function, i.e., a sum of log-concave functions. In particular, we show that such functions are in general not convex but still satisfy generalized convexity inequalities. These inequalities unveil the key importance of a certain vector that we call the cross-gradient and that is, in general, distinct from the usual gradient. Thus, we propose the Cross Gradient Descent (XGD) algorithm moving in the opposite direction of the cross-gradient and derive a convergence analysis. As an application of our sum-log-concave framework, we introduce the so-called checkered regression method relying on a sum-log-concave function. This classifier extends (multiclass) logistic regression to non-linearly separable problems since it is capable of tessellating the feature space by using any given number of hyperplanes, creating a checkerboard-like pattern of decision regions.
摘要
Multiple Case Physics-Informed Neural Network for Biomedical Tube Flows
results: 可以在实时内获得未看到的geometry cases的结果,并且可以优化网络架构、管状特有的和正则化策略以提高性能Abstract
Fluid dynamics computations for tube-like geometries are important for biomedical evaluation of vascular and airway fluid dynamics. Physics-Informed Neural Networks (PINNs) have recently emerged as a good alternative to traditional computational fluid dynamics (CFD) methods. The vanilla PINN, however, requires much longer training time than the traditional CFD methods for each specific flow scenario and thus does not justify its mainstream use. Here, we explore the use of the multi-case PINN approach for calculating biomedical tube flows, where varied geometry cases are parameterized and pre-trained on the PINN, such that results for unseen geometries can be obtained in real time. Our objective is to identify network architecture, tube-specific, and regularization strategies that can optimize this, via experiments on a series of idealized 2D stenotic tube flows.
摘要
fluid 动力计算 для管状结构是生物医学评估血液和空气流动的重要方面。物理学 Informed Neural Networks (PINNs) 最近emerge 为传统计算流体力学 (CFD) 方法的好alternative。然而,vanilla PINN 需要每个特定流场训练时间更长,因此无法 justify 其主流使用。在这里,我们探讨使用多个案例 PINN 方法计算生物管流动,其中 varied geometry cases 被参数化并在 PINN 中预训练,以便在实时内获得未经见过的geometry结果。我们的目标是通过实验 serie 的idealized 2D 狭窄管流动来优化这种方法。
Scaling Representation Learning from Ubiquitous ECG with State-Space Models
For: The paper is written for enhancing human well-being through ubiquitous sensing from wearable devices in the wild, with a focus on electrocardiogram (ECG) signals.* Methods: The paper introduces a pre-trained state-space model for representation learning from ECG signals, which is trained in a self-supervised manner using a large dataset of 275,000 10s ECG recordings collected in the wild.* Results: The proposed model demonstrates competitive performance on a range of downstream tasks, including health monitoring, stress and affect estimation, and provides efficacy in low-resource regimes.Here’s the information in Simplified Chinese text format, as requested:
results: 提出的模型在多个下游任务上显示竞争性的表现,包括健康监测、压力和情绪估计等,并在资源有限的情况下显示有效性。Abstract
Ubiquitous sensing from wearable devices in the wild holds promise for enhancing human well-being, from diagnosing clinical conditions and measuring stress to building adaptive health promoting scaffolds. But the large volumes of data therein across heterogeneous contexts pose challenges for conventional supervised learning approaches. Representation Learning from biological signals is an emerging realm catalyzed by the recent advances in computational modeling and the abundance of publicly shared databases. The electrocardiogram (ECG) is the primary researched modality in this context, with applications in health monitoring, stress and affect estimation. Yet, most studies are limited by small-scale controlled data collection and over-parameterized architecture choices. We introduce \textbf{WildECG}, a pre-trained state-space model for representation learning from ECG signals. We train this model in a self-supervised manner with 275,000 10s ECG recordings collected in the wild and evaluate it on a range of downstream tasks. The proposed model is a robust backbone for ECG analysis, providing competitive performance on most of the tasks considered, while demonstrating efficacy in low-resource regimes. The code and pre-trained weights are shared publicly at https://github.com/klean2050/tiles_ecg_model.
摘要
通过 ubique 感知设备在野外的应用,可以提高人类的健康水平,从诊断临床病种和测量压力到构建适应性健康促进架构。但是,由于这些数据在不同的CONTEXT中存在差异,因此使用传统的指导学习方法会面临挑战。生物信号的 Representation Learning 是一个出现在的领域,它受到了计算机模型的最新进步和大量公共分享数据库的推动。电cardiogram (ECG) 是这个上下文中最常研究的Modalitas,它在健康监测、压力和情绪估计等方面有着应用。然而,大多数研究都受限于小规模的控制数据收集和过参数化的建筑选择。我们介绍了 \textbf{WildECG},一个预训练的状态空间模型,用于 Representation Learning 从 ECG 信号中获取特征。我们通过在野外收集了 275,000 个 10s ECG 记录的自助监测方式进行训练,并对其进行评估。该模型是一个强健的 ECG 分析的基础模型,在大多数任务上提供了竞争性的性能,同时在低资源 режи下也表现出了效果。代码和预训练 веса公共分享在 https://github.com/klean2050/tiles_ecg_model 上。
Composable Coresets for Determinant Maximization: Greedy is Almost Optimal
For: 本研究的目标是解决一个维度为 $d$ 的集合中选择 $k$ 个向量,以最大化向量的体积。* Methods: 本研究使用了 Determinantal point processes (DPP) 的 MAP-inference 任务,并在大量数据下进行研究。* Results: 我们提出了一种基于 Greedy 算法的可 compose 核心集合,可以在 $O(k)^{3k}$ 的准确因子下解决这个问题。此外,我们还证明了 Greedy 算法的本地优化性,可以在实际数据集上实现更高的准确性。Abstract
Given a set of $n$ vectors in $\mathbb{R}^d$, the goal of the \emph{determinant maximization} problem is to pick $k$ vectors with the maximum volume. Determinant maximization is the MAP-inference task for determinantal point processes (DPP) and has recently received considerable attention for modeling diversity. As most applications for the problem use large amounts of data, this problem has been studied in the relevant \textit{composable coreset} setting. In particular, [Indyk-Mahabadi-OveisGharan-Rezaei--SODA'20, ICML'19] showed that one can get composable coresets with optimal approximation factor of $\tilde O(k)^k$ for the problem, and that a local search algorithm achieves an almost optimal approximation guarantee of $O(k)^{2k}$. In this work, we show that the widely-used Greedy algorithm also provides composable coresets with an almost optimal approximation factor of $O(k)^{3k}$, which improves over the previously known guarantee of $C^{k^2}$, and supports the prior experimental results showing the practicality of the greedy algorithm as a coreset. Our main result follows by showing a local optimality property for Greedy: swapping a single point from the greedy solution with a vector that was not picked by the greedy algorithm can increase the volume by a factor of at most $(1+\sqrt{k})$. This is tight up to the additive constant $1$. Finally, our experiments show that the local optimality of the greedy algorithm is even lower than the theoretical bound on real data sets.
摘要
给定一组 $n$ 向量在 $\mathbb{R}^d$ 中,目标是选择 $k$ 个向量以最大化体积。这个问题被称为 determinant maximization 问题,是 determinantal point processes (DPP) 的MAP-推理任务,最近受到了各种应用的关注,以模型多样性。由于大多数应用都使用大量数据,因此这个问题在相关的可composable coreset 设定下进行研究。特别是,Indyk-Mahabadi-OveisGharan-Rezaei 在 SODA'20 和 ICML'19 上显示了可composable coreset 的优化因子为 $\tilde O(k)^k$,并且一个本地搜索算法可以达到 $O(k)^{2k}$ 的相对误差 guarantee。在这个工作中,我们表明了广泛使用的 Greedy 算法也可以提供可composable coreset 的近似因子 $O(k)^{3k}$,超过之前知道的 $C^{k^2}$ guarantee,并且支持先前的实验结果,证明 Greedy 算法在实际数据集上的实用性。我们的主要结论来自于 showing Greedy 算法的本地优化性ERT:将一个点从 Greedy 解决中拿换一个未被 Greedy 选择的向量可以提高体积的因子为最多 $(1+\sqrt{k})$,这是准确到 $1$ 的上限。最后,我们的实验结果表明 Greedy 算法的本地优化性实际比 теоретиче上的 bound 低。
A Physics Enhanced Residual Learning (PERL) Framework for Traffic State Prediction
methods: 本文使用了物理模型和剩余学习模型,并将其 integrate into a single model。该模型的预测结果为物理模型的结果加上一个预测的偏差。
results: 实验结果表明,PERL模型在小数据集上比物理模型、数据驱动模型和PINN模型更好地预测行驶轨迹。此外,PERL模型在训练过程中更快地 converges,需要 fewer training samples than data-driven model和PINN模型。Abstract
In vehicle trajectory prediction, physics models and data-driven models are two predominant methodologies. However, each approach presents its own set of challenges: physics models fall short in predictability, while data-driven models lack interpretability. Addressing these identified shortcomings, this paper proposes a novel framework, the Physics-Enhanced Residual Learning (PERL) model. PERL integrates the strengths of physics-based and data-driven methods for traffic state prediction. PERL contains a physics model and a residual learning model. Its prediction is the sum of the physics model result and a predicted residual as a correction to it. It preserves the interpretability inherent to physics-based models and has reduced data requirements compared to data-driven methods. Experiments were conducted using a real-world vehicle trajectory dataset. We proposed a PERL model, with the Intelligent Driver Model (IDM) as its physics car-following model and Long Short-Term Memory (LSTM) as its residual learning model. We compare this PERL model with the physics car-following model, data-driven model, and other physics-informed neural network (PINN) models. The result reveals that PERL achieves better prediction with a small dataset, compared to the physics model, data-driven model, and PINN model. Second, the PERL model showed faster convergence during training, offering comparable performance with fewer training samples than the data-driven model and PINN model. Sensitivity analysis also proves comparable performance of PERL using another residual learning model and a physics car-following model.
摘要
在车辆轨迹预测中,物理模型和数据驱动模型是两种主要的方法。然而,每个方法都有自己的缺点:物理模型对预测不够可靠,而数据驱动模型则缺乏解释性。为了解决这些缺点,这篇论文提出了一个新的框架,即物理增强遗传学习(PERL)模型。PERL结合了物理基础和数据驱动方法的优点,并提供了更好的车辆轨迹预测。PERL模型包括物理模型和遗传学习模型。其预测结果为物理模型的结果加上一个预测的差异。这样可以保留物理模型中的解释性,并且需要更少的数据。我们在实际的车辆轨迹数据集上进行了实验,比较了PERL模型与物理汽车追随模型、数据驱动模型和物理启发阶层神经网络(PINN)模型。结果显示,PERL模型在小数据集情况下表现更好,比物理模型、数据驱动模型和PINN模型更好。其次,PERL模型在训练过程中更快地趋向于平衡,需要更少的训练数据,与数据驱动模型和PINN模型相比。另外,对PERL模型使用不同的遗传学习模型和物理汽车追随模型进行了敏感性分析,结果显示PERL模型在不同的模型和物理模型下仍然具有良好的预测性。
Identifying factors associated with fast visual field progression in patients with ocular hypertension based on unsupervised machine learning
results: LCMM 模型发现了四个眼睛类型,每个类型的 MD 衰减趋势不同。794 个眼睛(25%)、1675 个眼睛(54%)、531 个眼睛(17%)和 133 个眼睛(4%)分别被分为 Improvers、Stables、Slow progressors 和 Fast progressors。这些类型的 MD 衰减平均值分别为 0.08、-0.06、-0.21 和 -0.45 dB/年。快速 VF 进程相关的因素包括基线年龄、内分泌压力 (IOP)、 Pattern standard deviation (PSD) 和 refractive error (RE),但是低于中央肋壁厚度 (CCT)。fast progression 与 calcium channel blockers、男性、心血管疾病历史、糖尿病历史、非洲裔美国人、心脏病历史、 migraine headaches 有关。Abstract
Purpose: To identify ocular hypertension (OHT) subtypes with different trends of visual field (VF) progression based on unsupervised machine learning and to discover factors associated with fast VF progression. Participants: A total of 3133 eyes of 1568 ocular hypertension treatment study (OHTS) participants with at least five follow-up VF tests were included in the study. Methods: We used a latent class mixed model (LCMM) to identify OHT subtypes using standard automated perimetry (SAP) mean deviation (MD) trajectories. We characterized the subtypes based on demographic, clinical, ocular, and VF factors at the baseline. We then identified factors driving fast VF progression using generalized estimating equation (GEE) and justified findings qualitatively and quantitatively. Results: The LCMM model discovered four clusters (subtypes) of eyes with different trajectories of MD worsening. The number of eyes in clusters were 794 (25%), 1675 (54%), 531 (17%) and 133 (4%). We labelled the clusters as Improvers, Stables, Slow progressors, and Fast progressors based on their mean of MD decline, which were 0.08, -0.06, -0.21, and -0.45 dB/year, respectively. Eyes with fast VF progression had higher baseline age, intraocular pressure (IOP), pattern standard deviation (PSD) and refractive error (RE), but lower central corneal thickness (CCT). Fast progression was associated with calcium channel blockers, being male, heart disease history, diabetes history, African American race, stroke history, and migraine headaches.
摘要
目的:通过无监督机器学习发现 ocular hypertension (OHT) 的不同趋势视场(VF)进程 под型,并找到加速 VF 进程的因素。参与者:全部有 3133 个眼和 1568 名 ocular hypertension treatment study(OHTS)参与者,每个参与者至少有五次视场测试。方法:我们使用潜在类混合模型(LCMM)来发现 OHT 的不同 под型,使用标准自动测测(SAP)的 Mean Deviation(MD)轨迹。我们根据基线测试时的 демографи、临床、眼科和视场因素进行分类。然后,我们使用通用估计方法(GEE)来确定加速 VF 进程的因素,并证明发现的结论。结果:LCMM 模型发现了四种眼睛的不同趋势 MD 下降。这些群体的眼睛数量分别为 794(25%)、1675(54%)、531(17%)和 133(4%)。我们将这些群体分别命名为 Improvers、Stables、Slow progressors 和 Fast progressors,根据每个群体的 MD 下降的平均值。加速 VF 进程的眼睛具有更高的基线年龄、血压(IOP)、模式标准差(PSD)和视力错觉(RE),但更低的中央肾脏厚度(CCT)。加速 VF 进程与 calcium channel blockers、男性、心血管疾病历史、糖尿病历史、非洲裔美国人种、心血管疾病历史、 migraine 和头痛历史有关。
Method and Validation for Optimal Lineup Creation for Daily Fantasy Football Using Machine Learning and Linear Programming
paper_authors: Joseph M. Mahoney, Tomasz B. Paniak
for: This paper aims to develop a method to forecast NFL player performance under uncertainty and determine an optimal lineup to maximize FPTS under a set salary limit.
methods: The paper uses a supervised learning neural network to project FPTS based on past player performance, and a mixed integer linear program to find the optimal lineup.
results: The optimal lineups outperformed randomly-created lineups on average, and fell in approximately the 31st percentile (median) compared to real-world lineups from users on DraftKings.Here’s the information in Simplified Chinese text:
methods: 这篇论文使用一种监督学习神经网络来预测FPTS,并使用杂合Integer Linear Programming来找到最优阵容。
results: 优化的阵容在 randomly-created 阵容的平均上赢得了比赛,并且与DraftKings上用户的真实阵容相比,通常在 Approximately 31st percentile (medians) 之间。Abstract
Daily fantasy sports (DFS) are weekly or daily online contests where real-game performances of individual players are converted to fantasy points (FPTS). Users select players for their lineup to maximize their FPTS within a set player salary cap. This paper focuses on (1) the development of a method to forecast NFL player performance under uncertainty and (2) determining an optimal lineup to maximize FPTS under a set salary limit. A supervised learning neural network was created and used to project FPTS based on past player performance (2018 NFL regular season for this work) prior to the upcoming week. These projected FPTS were used in a mixed integer linear program to find the optimal lineup. The performance of resultant lineups was compared to randomly-created lineups. On average, the optimal lineups outperformed the random lineups. The generated lineups were then compared to real-world lineups from users on DraftKings. The generated lineups generally fell in approximately the 31st percentile (median). The FPTS methods and predictions presented here can be further improved using this study as a baseline comparison.
摘要
每周或每天的在线日常体育竞技 (DFS) 是一种在线竞技平台, Users可以选择球员来增加他们的极限积分 (FPTS),而不超过球员薪资限额。这篇论文将 concentrate on (1) 预测 NFL 球员表现的方法的开发和 (2) 根据薪资限额最大化 FPTS 的优补策略。一种以过去 NFL 赛季 (2018 赛季) 的球员表现数据进行预测的超vised 学习神经网络被创建并使用,以预测下一周的 FPTS。这些预测的 FPTS 然后被用在杂integer linear program中找到最佳阵容。结果的阵容与Randomly创建的阵容进行比较,并发现了最佳阵容的性能较高。然后,这些阵容与 DraftKings 上用户实际创建的阵容进行比较,并发现了这些阵容在approximately 31% 的位置 (中位数)。这种 FPTS 预测和方法可以在这个基准比较中进行进一步改进。
V2X-Lead: LiDAR-based End-to-End Autonomous Driving with Vehicle-to-Everything Communication Integration
results: 实验结果表明,提议方法可以在杂化交通条件下 traverse不监控交叉路口时提高安全性和效率,并在不同的驾驶任务和场景中进行泛化。V2X通信的集成提供了让AV更好地感知周围环境的重要数据源,从而提高了驾驶行为的准确性和完整性。Abstract
This paper presents a LiDAR-based end-to-end autonomous driving method with Vehicle-to-Everything (V2X) communication integration, termed V2X-Lead, to address the challenges of navigating unregulated urban scenarios under mixed-autonomy traffic conditions. The proposed method aims to handle imperfect partial observations by fusing the onboard LiDAR sensor and V2X communication data. A model-free and off-policy deep reinforcement learning (DRL) algorithm is employed to train the driving agent, which incorporates a carefully designed reward function and multi-task learning technique to enhance generalization across diverse driving tasks and scenarios. Experimental results demonstrate the effectiveness of the proposed approach in improving safety and efficiency in the task of traversing unsignalized intersections in mixed-autonomy traffic, and its generalizability to previously unseen scenarios, such as roundabouts. The integration of V2X communication offers a significant data source for autonomous vehicles (AVs) to perceive their surroundings beyond onboard sensors, resulting in a more accurate and comprehensive perception of the driving environment and more safe and robust driving behavior.
摘要
Homotopy Relaxation Training Algorithms for Infinite-Width Two-Layer ReLU Neural Networks
methods: 提出了一种新的训练方法—Homotopy Relaxation Training Algorithm (HRTA),通过连接线性活动函数和ReLU活动函数的homotopy活动函数,以及对训练精度进行放松来加速训练过程。
results: 经过对NTK上的深度神经网络的深入分析,显示HRTA可以提高训练速度的 convergence rates,特别是在宽度更大的神经网络中。实验结果也验证了理论结论。这种提出的HRTA具有在其他活动函数和深度神经网络中的潜在应用前景。Abstract
In this paper, we present a novel training approach called the Homotopy Relaxation Training Algorithm (HRTA), aimed at accelerating the training process in contrast to traditional methods. Our algorithm incorporates two key mechanisms: one involves building a homotopy activation function that seamlessly connects the linear activation function with the ReLU activation function; the other technique entails relaxing the homotopy parameter to enhance the training refinement process. We have conducted an in-depth analysis of this novel method within the context of the neural tangent kernel (NTK), revealing significantly improved convergence rates. Our experimental results, especially when considering networks with larger widths, validate the theoretical conclusions. This proposed HRTA exhibits the potential for other activation functions and deep neural networks.
摘要
在这篇论文中,我们提出了一种新的训练方法,称为Homotopy Relaxation Training Algorithm(HRTA),目的是加速训练过程,而不是使用传统方法。我们的算法包含两个关键机制:一是建立一个连续函数 activation function,将线性 activation function 与 ReLU activation function 连续连接起来;另一个技术是通过放松 homotopy 参数来提高训练精度过程。我们在NTK 的背景下进行了深入的分析,发现该新方法可以提高训练速度。我们的实验结果,特别是考虑到大width 网络,证明了我们的理论结论。这种提出的 HRTA 具有潜在的应用前提,并不仅限于 activation functions 和深度神经网络。
Cross-Validation for Training and Testing Co-occurrence Network Inference Algorithms
results: 研究发现,提出的评估方法可以帮助选择最佳的网络推断算法和评估网络推断算法的质量。Abstract
Microorganisms are found in almost every environment, including the soil, water, air, and inside other organisms, like animals and plants. While some microorganisms cause diseases, most of them help in biological processes such as decomposition, fermentation and nutrient cycling. A lot of research has gone into studying microbial communities in various environments and how their interactions and relationships can provide insights into various diseases. Co-occurrence network inference algorithms help us understand the complex associations of micro-organisms, especially bacteria. Existing network inference algorithms employ techniques such as correlation, regularized linear regression, and conditional dependence, which have different hyper-parameters that determine the sparsity of the network. Previous methods for evaluating the quality of the inferred network include using external data, and network consistency across sub-samples, both which have several drawbacks that limit their applicability in real microbiome composition data sets. We propose a novel cross-validation method to evaluate co-occurrence network inference algorithms, and new methods for applying existing algorithms to predict on test data. Our empirical study shows that the proposed method is useful for hyper-parameter selection (training) and comparing the quality of the inferred networks between different algorithms (testing).
摘要
微生物可以在各种环境中找到,包括土壤、水、空气以及其他生物体内。虽然一些微生物会引起疾病,但大多数微生物帮助进行生物过程,如腐败、发酵和营养循环。研究微生物社区在不同环境中的相互作用和关系可以提供疾病研究的意义。存在的网络推理算法可以帮助我们理解微生物之间的复杂关系,特别是细菌。现有的网络推理算法使用技术如相关性、规则化线性回归和conditional dependence,它们的各种超参数会影响网络的稀畴程度。以前的评估推理网络质量的方法包括使用外部数据和网络在不同抽样下的一致性,但这些方法有一些缺点,限制它们在实际微生物组成数据集中的应用。我们提议一种新的验证方法来评估推理网络推理算法,以及新的方法来应用现有算法预测测试数据。我们的实验显示,我们的方法有用于权重选择(训练)和对不同算法预测测试数据的质量进行比较。
Auto-grading C programming assignments with CodeBERT and Random Forest Regressor
results: 测试结果表明,使用该方法可以准确地评分C编程作业, Root Mean Squared Error(RMSE)为1.89。 这些结果表明,使用深度学习自动评分编程作业的方法比使用统计方法更有效。Abstract
Grading coding assignments manually is challenging due to complexity and subjectivity. However, auto-grading with deep learning simplifies the task. It objectively assesses code quality, detects errors, and assigns marks accurately, reducing the burden on instructors while ensuring efficient and fair assessment. This study provides an analysis of auto-grading of the C programming assignments using machine learning and deep learning approaches like regression, convolutional neural networks (CNN) and long short-term memory (LSTM). Using a code-based transformer word embedding model called CodeBERT, the textual code inputs were transformed into vectors, and the vectors were then fed into several models. The testing findings demonstrated the efficacy of the suggested strategy with a root mean squared error (RMSE) of 1.89. The contrast between statistical methods and deep learning techniques is discussed in the study.
摘要
“手动评分程式作业是具有复杂性和主观性的,但使用深度学习可以简化这个任务。它 объектив地评估程式的质量,检测错误,并将分数划分给学生,从而减轻教师的负担,同时确保了公正和有效的评估。本研究通过机器学习和深度学习方法(如回归、单调数网络和长期缓存)进行自动评分C语言作业的分析。使用一个称为CodeBERT的程式码基于词汇嵌入模型,将文字式程式码转换为向量,然后将向量输入到不同模型中进行评分。测试结果显示了建议的策略的有效性,RMSE为1.89。研究中也讨论了统计方法和深度学习技术之间的比较。”Note: "Simplified Chinese" refers to the standardized form of Chinese used in mainland China and Singapore, which is different from "Traditional Chinese" used in Taiwan and other countries.
Balancing Computational Efficiency and Forecast Error in Machine Learning-based Time-Series Forecasting: Insights from Live Experiments on Meteorological Nowcasting
paper_authors: Elin Törnquist, Wagner Costa Santos, Timothy Pogue, Nicholas Wingle, Robert A. Caulk for:This paper aims to explore the relationship between computational cost and forecast error in machine learning-based time-series forecasting, using meteorological nowcasting as an example.methods:The paper employs various popular regression techniques, including XGBoost, FC-MLP, Transformer, and LSTM, for multi-horizon, short-term forecasting of temperature, wind speed, and cloud cover at multiple locations. The authors also propose two computational cost minimization methods: a novel auto-adaptive data reduction technique called Variance Horizon and a performance-based concept drift-detection mechanism.results:The results show that using the Variance Horizon technique can reduce computational usage by more than 50%, while increasing forecast error by up to 15%. Meanwhile, performance-based retraining can reduce computational usage by up to 90%, while improving forecast error by up to 10%. The combination of both techniques outperformed other model configurations by up to 99.7% when considering error normalized to computational usage.Abstract
Machine learning for time-series forecasting remains a key area of research. Despite successful application of many machine learning techniques, relating computational efficiency to forecast error remains an under-explored domain. This paper addresses this topic through a series of real-time experiments to quantify the relationship between computational cost and forecast error using meteorological nowcasting as an example use-case. We employ a variety of popular regression techniques (XGBoost, FC-MLP, Transformer, and LSTM) for multi-horizon, short-term forecasting of three variables (temperature, wind speed, and cloud cover) for multiple locations. During a 5-day live experiment, 4000 data sources were streamed for training and inferencing 144 models per hour. These models were parameterized to explore forecast error for two computational cost minimization methods: a novel auto-adaptive data reduction technique (Variance Horizon) and a performance-based concept drift-detection mechanism. Forecast error of all model variations were benchmarked in real-time against a state-of-the-art numerical weather prediction model. Performance was assessed using classical and novel evaluation metrics. Results indicate that using the Variance Horizon reduced computational usage by more than 50\%, while increasing between 0-15\% in error. Meanwhile, performance-based retraining reduced computational usage by up to 90\% while \emph{also} improving forecast error by up to 10\%. Finally, the combination of both the Variance Horizon and performance-based retraining outperformed other model configurations by up to 99.7\% when considering error normalized to computational usage.
摘要
“机器学习 для时间序列预测仍然是研究领域的关键领域。尽管许多机器学习技术已经得到成功应用,但将计算效率与预测误差之间的关系进行研究仍然是一个未探索的领域。这篇论文通过一系列实时实验来评估这个问题,使用天气预报为例子应用场景。我们使用了多种流行的回归技术(XGBoost、FC-MLP、Transformer和LSTM)进行多个地点的多时间档期预测温度、风速和云覆盖率。在5天的实验中,我们流动了4000个数据源,每小时训练和推断144个模型。这些模型被参数化以探索预测误差的两种计算成本减少方法:一种新的自适应数据减少技术(Variance Horizon)和一种基于性能的概念漂移检测机制。所有模型变化的预测误差都在实时比较之前的状态艺术天气预报模型。我们使用了传统和新的评价指标来评估性能。结果表明,使用Variance Horizon可以降低计算使用率超过50%,而预测误差也在0-15%之间增加。同时,基于性能的重新训练可以降低计算使用率达到90%,而同时也提高预测误差达到10%。最后,将Variance Horizon和基于性能的重新训练结合使用的模型在计算使用率normalized预测误差方面表现出色,高达99.7%。”
ICML 2023 Topological Deep Learning Challenge : Design and Results
results: 研究得到了28个合格的提交,并summarizes了挑战的主要发现。Abstract
This paper presents the computational challenge on topological deep learning that was hosted within the ICML 2023 Workshop on Topology and Geometry in Machine Learning. The competition asked participants to provide open-source implementations of topological neural networks from the literature by contributing to the python packages TopoNetX (data processing) and TopoModelX (deep learning). The challenge attracted twenty-eight qualifying submissions in its two-month duration. This paper describes the design of the challenge and summarizes its main findings.
摘要
Note:* topological neural networks 改为 topological deep learning* data processing 改为 数据处理* deep learning 改为 深度学习
Monitoring Machine Learning Models: Online Detection of Relevant Deviations
results: 实验结果表明,本方法可以更好地检测机器学习模型性能下降,比 benchmark 方法更有效。Abstract
Machine learning models are essential tools in various domains, but their performance can degrade over time due to changes in data distribution or other factors. On one hand, detecting and addressing such degradations is crucial for maintaining the models' reliability. On the other hand, given enough data, any arbitrary small change of quality can be detected. As interventions, such as model re-training or replacement, can be expensive, we argue that they should only be carried out when changes exceed a given threshold. We propose a sequential monitoring scheme to detect these relevant changes. The proposed method reduces unnecessary alerts and overcomes the multiple testing problem by accounting for temporal dependence of the measured model quality. Conditions for consistency and specified asymptotic levels are provided. Empirical validation using simulated and real data demonstrates the superiority of our approach in detecting relevant changes in model quality compared to benchmark methods. Our research contributes a practical solution for distinguishing between minor fluctuations and meaningful degradations in machine learning model performance, ensuring their reliability in dynamic environments.
摘要
SGD Finds then Tunes Features in Two-Layer Neural Networks with near-Optimal Sample Complexity: A Case Study in the XOR problem
results: 论文证明了,当数据来自$d$维布尔体系,使用$d$个polylog($d$)样本来训练二层神经网络,可以达到 populaion error $o(1)$。这是首次有人提供了$\tilde{O}(d)$的样本复杂度来有效地学习XOR函数在标准神经网络上。Abstract
In this work, we consider the optimization process of minibatch stochastic gradient descent (SGD) on a 2-layer neural network with data separated by a quadratic ground truth function. We prove that with data drawn from the $d$-dimensional Boolean hypercube labeled by the quadratic ``XOR'' function $y = -x_ix_j$, it is possible to train to a population error $o(1)$ with $d \:\text{polylog}(d)$ samples. Our result considers simultaneously training both layers of the two-layer-neural network with ReLU activations via standard minibatch SGD on the logistic loss. To our knowledge, this work is the first to give a sample complexity of $\tilde{O}(d)$ for efficiently learning the XOR function on isotropic data on a standard neural network with standard training. Our main technique is showing that the network evolves in two phases: a $\textit{signal-finding}$ phase where the network is small and many of the neurons evolve independently to find features, and a $\textit{signal-heavy}$ phase, where SGD maintains and balances the features. We leverage the simultaneous training of the layers to show that it is sufficient for only a small fraction of the neurons to learn features, since those neurons will be amplified by the simultaneous growth of their second layer weights.
摘要
在这个工作中,我们考虑了批处理式随机梯度下降(SGD)在二层神经网络上的优化过程。我们证明了,当数据来自 $d$ 维布尔多面体,并且标注为 quadratic “XOR” 函数 $y = -x_ix_j$,然后可以在 $d$ 个polylog(d)样本上培养到 population error $o(1)$。我们的结果同时考虑了两层神经网络中的两个层的标准批处理SGD的训练。根据我们所知,这是首次为效率地学习 XOR 函数在各向同性数据上的标准神经网络进行 $\tilde{O}(d)$ 样本的证明。我们的主要技巧是显示网络在两个阶段中进行发展:一个 $\textit{signal-finding}$ 阶段,在这个阶段,网络很小,许多神经元独立地找到特征;另一个 $\textit{signal-heavy}$ 阶段,SGD在维护和平衡特征。我们利用同时训练层的技巧,显示只需要一小部分神经元学习特征,因为这些神经元将在同时增长其第二层权重时被扩大。
Fixing the NTK: From Neural Network Linearizations to Exact Convex Programs
results: 研究发现,对于某些特定的掩码量,NTK不能在训练集上达到最佳性能,而MKL kernel则可以达到最佳性能。通过iterative reweighting,我们可以从NTK中获得最佳MKLkernel,并且提供了一些数值实验来证明我们的理论。Abstract
Recently, theoretical analyses of deep neural networks have broadly focused on two directions: 1) Providing insight into neural network training by SGD in the limit of infinite hidden-layer width and infinitesimally small learning rate (also known as gradient flow) via the Neural Tangent Kernel (NTK), and 2) Globally optimizing the regularized training objective via cone-constrained convex reformulations of ReLU networks. The latter research direction also yielded an alternative formulation of the ReLU network, called a gated ReLU network, that is globally optimizable via efficient unconstrained convex programs. In this work, we interpret the convex program for this gated ReLU network as a Multiple Kernel Learning (MKL) model with a weighted data masking feature map and establish a connection to the NTK. Specifically, we show that for a particular choice of mask weights that do not depend on the learning targets, this kernel is equivalent to the NTK of the gated ReLU network on the training data. A consequence of this lack of dependence on the targets is that the NTK cannot perform better than the optimal MKL kernel on the training set. By using iterative reweighting, we improve the weights induced by the NTK to obtain the optimal MKL kernel which is equivalent to the solution of the exact convex reformulation of the gated ReLU network. We also provide several numerical simulations corroborating our theory. Additionally, we provide an analysis of the prediction error of the resulting optimal kernel via consistency results for the group lasso.
摘要
近来,深度神经网络的理论分析主要集中在两个方向上:1)通过SGD训练在无穷层宽和学习率为零的极限下提供神经网络训练的洞察,使用神经 Tangent Kernel (NTK);2)globally optimizing the regularized training objective via cone-constrained convex reformulations of ReLU networks。后者的研究方向还产生了一种称为闭合ReLU网络的代表形式,可以通过高效的无约束凸 програм序global optimiztion。在这个工作中,我们将这个凸程序解释为一种多重kernel学习(MKL)模型,并与NTK建立连接。具体来说,我们证明在某些掩码权重不виси于学习目标时,这个kernel与NTK相同,这个kernel是gated ReLU网络的NTK在训练数据上。由于这些掩码权重不виси于学习目标,NTK无法在训练集上做出更好的性能。通过迭代重新权重,我们可以改进由NTK引入的权重,以获得最佳的MKL kernel,这个kernel与gated ReLU网络的凸形式的正确解相同。我们还提供了一些数值实验证明我们的理论。此外,我们还提供了预测误差的分析,使用群集lasso的一致性结果。
Automated Detection of Persistent Inflammatory Biomarkers in Post-COVID-19 Patients Using Machine Learning Techniques
paper_authors: Ghizal Fatima, Fadhil G. Al-Amran, Maitham G. Yousif for: 这份研究的目的是探索机器学习技术可以自动识别潜在的急性炎症生物 markers在 COVID-19 后期病人中,以提高医疗对病人的早期诊断和个性化治疗策略。methods: 这份研究使用了多种机器学习算法,包括逻辑回传、随机森林、支持向量机器和渐进提升,将资料进行了严格的数据预processing和特征选择,以便优化资料集供机器学习分析。results: 这份研究发现,使用机器学习技术可以实现高精度和确定性地自动识别 COVID-19 后期病人中的急性炎症生物 markers,并且这些模型可以作为医疗 provider 的有用工具,帮助早期诊断和个性化治疗策略,最终对 COVID-19 后期病人的康康和生活质量有所提高。Abstract
The COVID-19 pandemic has left a lasting impact on individuals, with many experiencing persistent symptoms, including inflammation, in the post-acute phase of the disease. Detecting and monitoring these inflammatory biomarkers is critical for timely intervention and improved patient outcomes. This study employs machine learning techniques to automate the identification of persistent inflammatory biomarkers in 290 post-COVID-19 patients, based on medical data collected from hospitals in Iraq. The data encompassed a wide array of clinical parameters, such as C-reactive protein and interleukin-6 levels, patient demographics, comorbidities, and treatment histories. Rigorous data preprocessing and feature selection processes were implemented to optimize the dataset for machine learning analysis. Various machine learning algorithms, including logistic regression, random forests, support vector machines, and gradient boosting, were deployed to construct predictive models. These models exhibited promising results, showcasing high accuracy and precision in the identification of patients with persistent inflammation. The findings of this study underscore the potential of machine learning in automating the detection of persistent inflammatory biomarkers in post-COVID-19 patients. These models can serve as valuable tools for healthcare providers, facilitating early diagnosis and personalized treatment strategies for individuals at risk of persistent inflammation, ultimately contributing to improved post-acute COVID-19 care and patient well-being. Keywords: COVID-19, post-COVID-19, inflammation, biomarkers, machine learning, early detection.
摘要
COVID-19 大流行对个人有持续影响,许多人在后途期患有持续的发炎症状。检测和监测这些发炎生物标志是关键,以确定患者的病情和提高患者的结果。这项研究使用机器学习技术自动识别290名患上 COVID-19 后期的患者中的持续发炎生物标志,基于伊拉克医院收集的医疗数据。数据包括丰富的临床参数,如 C-反抗蛋白和Interleukin-6 水平、患者人口、相关疾病和治疗历史。经过严格的数据预处理和特征选择过程,以便优化数据集 для机器学习分析。不同的机器学习算法,包括逻辑回归、Random Forest、支持向量机和梯度提升,被部署来构建预测模型。这些模型表现出色,展现了高精度和准确性在识别持续发炎患者方面。这些发现反映了机器学习在自动识别持续发炎生物标志方面的潜在潜力。这些模型可以作为医疗提供者的有价值工具,帮助早期诊断和个性化治疗策略,以提高后途期 COVID-19 患者的健康状况。关键词:COVID-19, 后途期 COVID-19, 发炎, 生物标志, 机器学习, 早期诊断.
Identifying Simulation Model Through Alternative Techniques for a Medical Device Assembly Process
results: 这些方法可以创建适应性强的模型,准确地表示快照过程,并且可以满足不同的enario。这些模型有助于进一步了解生产过程,帮助决策,特别当数据有限时。Abstract
This scientific paper explores two distinct approaches for identifying and approximating the simulation model, particularly in the context of the snap process crucial to medical device assembly. Simulation models play a pivotal role in providing engineers with insights into industrial processes, enabling experimentation and troubleshooting before physical assembly. However, their complexity often results in time-consuming computations. To mitigate this complexity, we present two distinct methods for identifying simulation models: one utilizing Spline functions and the other harnessing Machine Learning (ML) models. Our goal is to create adaptable models that accurately represent the snap process and can accommodate diverse scenarios. Such models hold promise for enhancing process understanding and aiding in decision-making, especially when data availability is limited.
摘要
这篇科学论文探讨了两种不同的方法来识别和估算模拟模型,尤其是在医疗器械组装过程中的快照过程中。模拟模型在工程师获得工业过程的洞察力方面发挥着关键作用,允许他们在实际组装之前进行实验和排查。然而,它们的复杂性常常导致计算时间很长。为了缓解这种复杂性,我们提出了使用spline函数和机器学习(ML)模型来识别模拟模型的两种方法。我们的目标是创造一些适应性强的模型,能够准确地表示快照过程并适应多种情况。这些模型在数据有限情况下能够提供进程理解和决策支持。
Single Biological Neurons as Temporally Precise Spatio-Temporal Pattern Recognizers
results: 研究表明,单个神经元的计算特征有较大的系统级别影响,并且可以用简单可靠的学习规则来模拟大脑中的非线性XOR操作。Abstract
This PhD thesis is focused on the central idea that single neurons in the brain should be regarded as temporally precise and highly complex spatio-temporal pattern recognizers. This is opposed to the prevalent view of biological neurons as simple and mainly spatial pattern recognizers by most neuroscientists today. In this thesis, I will attempt to demonstrate that this is an important distinction, predominantly because the above-mentioned computational properties of single neurons have far-reaching implications with respect to the various brain circuits that neurons compose, and on how information is encoded by neuronal activity in the brain. Namely, that these particular "low-level" details at the single neuron level have substantial system-wide ramifications. In the introduction we will highlight the main components that comprise a neural microcircuit that can perform useful computations and illustrate the inter-dependence of these components from a system perspective. In chapter 1 we discuss the great complexity of the spatio-temporal input-output relationship of cortical neurons that are the result of morphological structure and biophysical properties of the neuron. In chapter 2 we demonstrate that single neurons can generate temporally precise output patterns in response to specific spatio-temporal input patterns with a very simple biologically plausible learning rule. In chapter 3, we use the differentiable deep network analog of a realistic cortical neuron as a tool to approximate the gradient of the output of the neuron with respect to its input and use this capability in an attempt to teach the neuron to perform nonlinear XOR operation. In chapter 4 we expand chapter 3 to describe extension of our ideas to neuronal networks composed of many realistic biological spiking neurons that represent either small microcircuits or entire brain regions.
摘要
这个博士论文主要关注的中心思想是:单个神经元在脑中应该被视为高精度的空间-时间模式识别器,而不是现在大多数神经科学家视为简单的空间模式识别器。在这个论文中,我会尝试证明这是一个重要的分别,因为单个神经元的计算属性在脑内部的各种神经细胞圈和信息编码方面具有广泛的系统性影响。在引言中,我们将高亮显示神经元微circuit中的主要组件,并 illustrate它们之间的互dependent关系从系统角度。在第一章中,我们讨论了触发区域中神经元的复杂的空间-时间输入-输出关系,这些关系是由神经元的形态结构和生物物理特性所导致。在第二章中,我们展示了单个神经元可以在特定的空间-时间输入模式下生成高精度的输出模式,使用了一种简单的生物可能的学习规则。在第三章中,我们使用一个可微分的深度网络模型来估算神经元输出的梯度,并使用这种能力来教育神经元执行非线性XOR操作。在第四章中,我们扩展了上述想法,描述了使用多个真实生物快速射精神神经元组成的神经网络,代表了小微circuit或整个脑区域。
On Excess Risk Convergence Rates of Neural Network Classifiers
paper_authors: Hyunouk Ko, Namjoon Suh, Xiaoming Huo
for: This paper studies the performance of plug-in classifiers based on neural networks in a binary classification setting, with a focus on their excess risks.
methods: The paper uses a more general scenario that resembles actual practice, with the function class including the Barron functions as a proper subset, and the neural network classifier is constructed as the minimizer of a surrogate loss.
results: The paper obtains a dimension-free, uniform rate of convergence for the excess risk, and shows that the rate is minimax optimal up to a logarithmic factor. The paper also demonstrates the effect of the margin assumption in this regime.Abstract
The recent success of neural networks in pattern recognition and classification problems suggests that neural networks possess qualities distinct from other more classical classifiers such as SVMs or boosting classifiers. This paper studies the performance of plug-in classifiers based on neural networks in a binary classification setting as measured by their excess risks. Compared to the typical settings imposed in the literature, we consider a more general scenario that resembles actual practice in two respects: first, the function class to be approximated includes the Barron functions as a proper subset, and second, the neural network classifier constructed is the minimizer of a surrogate loss instead of the $0$-$1$ loss so that gradient descent-based numerical optimizations can be easily applied. While the class of functions we consider is quite large that optimal rates cannot be faster than $n^{-\frac{1}{3}$, it is a regime in which dimension-free rates are possible and approximation power of neural networks can be taken advantage of. In particular, we analyze the estimation and approximation properties of neural networks to obtain a dimension-free, uniform rate of convergence for the excess risk. Finally, we show that the rate obtained is in fact minimax optimal up to a logarithmic factor, and the minimax lower bound shows the effect of the margin assumption in this regime.
摘要
近期,神经网络在图像识别和分类问题中的成功表明了神经网络具有与传统分类器不同的特点。这篇论文研究基于神经网络的插入分类器在二分类 Setting 中的性能,并且使用过程损失来衡量其剩余风险。与文献中常见的设定相比,我们在两个方面进行了更加实际的设定:首先,函数类型包括Barron函数作为一个 Correct 子集,其次,使用surrogate损失函数来构建神经网络分类器,以便使用梯度下降的数值优化。尽管我们考虑的函数集是非常大,但是我们可以在这个 regime 中获得约度独立的速度,并且利用神经网络的近似能力。特别是,我们分析神经网络的估计和抽象性特性,以获得约度独立的异常速度。最后,我们证明了获得的速度实际上是最优的,并且显示了margin假设在这个regime中的效果。
Targeting Relative Risk Heterogeneity with Causal Forests
for: This paper focuses on the problem of treatment effect heterogeneity (TEH) in clinical trial analysis, and proposes a method for modifying causal forests to target relative risk using a novel node-splitting procedure based on generalized linear model (GLM) comparison.
methods: The proposed method uses a modified version of causal forests, which is a highly popular method for detecting TEH, but with a focus on relative risk instead of absolute risk. The method uses a novel node-splitting procedure based on GLM comparison to capture nuance in the relative risk.
results: The results of the paper show that the proposed relative risk causal forests method can capture otherwise unobserved sources of heterogeneity, as demonstrated on simulated and real-world data.Abstract
Treatment effect heterogeneity (TEH), or variability in treatment effect for different subgroups within a population, is of significant interest in clinical trial analysis. Causal forests (Wager and Athey, 2018) is a highly popular method for this problem, but like many other methods for detecting TEH, its criterion for separating subgroups focuses on differences in absolute risk. This can dilute statistical power by masking nuance in the relative risk, which is often a more appropriate quantity of clinical interest. In this work, we propose and implement a methodology for modifying causal forests to target relative risk using a novel node-splitting procedure based on generalized linear model (GLM) comparison. We present results on simulated and real-world data that suggest relative risk causal forests can capture otherwise unobserved sources of heterogeneity.
摘要
клиниче观察数据分析中,受试者群体内部效果差异(TEH)的研究具有重要意义。 causal forests(Wager和Athey,2018)是这种问题的高度流行方法,但与其他TEH检测方法一样,它的分组标准是基于绝对风险的差异。这可能会削弱统计能力,因为它会隐藏对积分风险的细节,这通常是临床兴趣的量。在这种工作中,我们提议和实现了一种修改 causal forests 以target相对风险的方法,使用一种基于泛化线性模型(GLM)的新的节点拆分方法。我们在模拟和实际数据上的结果表明,相对风险 causal forests 可以捕捉到其他不可见的差异源。
QUILT: Effective Multi-Class Classification on Quantum Computers Using an Ensemble of Diverse Quantum Classifiers
results: 根据论文的报告,使用 Quilt 框架进行多类别分类任务,可以在现有的五边系统上达到85%的准确率,使用 MNIST 数据集。Abstract
Quantum computers can theoretically have significant acceleration over classical computers; but, the near-future era of quantum computing is limited due to small number of qubits that are also error prone. Quilt is a framework for performing multi-class classification task designed to work effectively on current error-prone quantum computers. Quilt is evaluated with real quantum machines as well as with projected noise levels as quantum machines become more noise-free. Quilt demonstrates up to 85% multi-class classification accuracy with the MNIST dataset on a five-qubit system.
摘要
量子计算机在理论上可能具有显著的加速效果比经典计算机更快;但是,近期量子计算机的时代受到有限的量子比特数和错误率的限制。Quilt是一个用于实现多类分类任务的框架,针对当前的错误率量子计算机进行设计。Quilt在真正的量子机器上以及预计的噪声水平下进行评估。Quilt在MNIST数据集上的五个量子比特系统上达到85%多类分类精度。
Synthia’s Melody: A Benchmark Framework for Unsupervised Domain Adaptation in Audio
methods: 该研究使用了一种新的数据生成框架 called Synthia’s melody,可以生成具有用户指定的障碍结构的无数种4秒钢琴 melody。
results: 经测试表明,Synthia’s melody 可以提供一个不受 observation bias 的测试环境,用于评估音频深度学习模型对分布偏移的抵抗力。Abstract
Despite significant advancements in deep learning for vision and natural language, unsupervised domain adaptation in audio remains relatively unexplored. We, in part, attribute this to the lack of an appropriate benchmark dataset. To address this gap, we present Synthia's melody, a novel audio data generation framework capable of simulating an infinite variety of 4-second melodies with user-specified confounding structures characterised by musical keys, timbre, and loudness. Unlike existing datasets collected under observational settings, Synthia's melody is free of unobserved biases, ensuring the reproducibility and comparability of experiments. To showcase its utility, we generate two types of distribution shifts-domain shift and sample selection bias-and evaluate the performance of acoustic deep learning models under these shifts. Our evaluations reveal that Synthia's melody provides a robust testbed for examining the susceptibility of these models to varying levels of distribution shift.
摘要
尽管深度学习在视觉和自然语言领域取得了 significative 进步,但无监督频谱适应仍然相对未explored。我们认为这是因为缺乏适当的标准 benchmark 数据集。为了填补这个遗漏,我们提出了 Synthia 的旋律,一种新的音频数据生成框架,可以生成无数个 4 秒旋律,并且可以根据用户指定的隐藏结构(音频键、 timbre 和响度)进行定制。不同于现有的观察型数据集,Synthia 的旋律不受不观察到的偏见影响,因此可以保证实验的重复性和比较性。为了展示其 utility,我们生成了两种类型的分布转移-频谱转移和样本选择偏见-并评估了这些模型在这些转移下的性能。我们的评估结果表明,Synthia 的旋律提供了一个可靠的测试床 для检验这些模型对不同水平的分布转移的抗性。
Tempo Adaptation in Non-stationary Reinforcement Learning
results: 实验证明,$\texttt{ProST}$框架在多维非站点环境中实现了较高的在线返点,并且比既有方法更具有可行性。Abstract
We first raise and tackle a ``time synchronization'' issue between the agent and the environment in non-stationary reinforcement learning (RL), a crucial factor hindering its real-world applications. In reality, environmental changes occur over wall-clock time ($t$) rather than episode progress ($k$), where wall-clock time signifies the actual elapsed time within the fixed duration $t \in [0, T]$. In existing works, at episode $k$, the agent rolls a trajectory and trains a policy before transitioning to episode $k+1$. In the context of the time-desynchronized environment, however, the agent at time $t_{k}$ allocates $\Delta t$ for trajectory generation and training, subsequently moves to the next episode at $t_{k+1}=t_{k}+\Delta t$. Despite a fixed total number of episodes ($K$), the agent accumulates different trajectories influenced by the choice of interaction times ($t_1,t_2,...,t_K$), significantly impacting the suboptimality gap of the policy. We propose a Proactively Synchronizing Tempo ($\texttt{ProST}$) framework that computes a suboptimal sequence {$t_1,t_2,...,t_K$} (= { $t_{1:K}$}) by minimizing an upper bound on its performance measure, i.e., the dynamic regret. Our main contribution is that we show that a suboptimal {$t_{1:K}$} trades-off between the policy training time (agent tempo) and how fast the environment changes (environment tempo). Theoretically, this work develops a suboptimal {$t_{1:K}$} as a function of the degree of the environment's non-stationarity while also achieving a sublinear dynamic regret. Our experimental evaluation on various high-dimensional non-stationary environments shows that the $\texttt{ProST}$ framework achieves a higher online return at suboptimal {$t_{1:K}$} than the existing methods.
摘要
我们首先面临到非站点学习(RL)中的时间同步问题,这是实际应用中的一个关键因素。在实际情况下,环境变化发生在墙上时间($t$)而不是 episodenumber($k$),即墙上时间表示实际过去的时间在固定时间段[$0,T$]中。现有的方法在每个 episodenumber $k$ 中,Agent在 $k$ episodenumber 中生成 trajectory 并训练策略,然后在 $k+1$ episodenumber 中继续。但在时间不同步的环境中,Agent 在 $t_k$ allocate $\Delta t$ для trajectory 生成和策略训练,然后在 $t_{k+1} = t_k + \Delta t$ 中移动到下一个 episodenumber。尽管有固定的总集数 ($K$),但 Agent 在不同的交互时间 ($t_1, t_2, ..., t_K$) 中收集了不同的 trajectory,这会对策略的优化差值产生显著的影响。我们提出了一个名为 Proactively Synchronizing Tempo( $\texttt{ProST}$)框架,该框架计算一个不优的序列 {$t_1, t_2, ..., t_K$} (= { $t_{1:K}$}),以降低动态 regret的上限。我们的主要贡献在于我们显示了一个不优的 {$t_{1:K}$} 与策略训练时间(Agent 的拍弹)和环境变化速度(环境的拍弹)之间存在交易关系。理论上,这种工作在环境的非站点性程度的度量下计算出一个不优的 {$t_{1:K}$},并实现了对动态 regret的下降。我们在高维非站点环境中进行了多个实验,结果表明,$\texttt{ProST}$ 框架在不优的 {$t_{1:K}$} 下实现了更高的在线返回。
Statistical Analysis of Quantum State Learning Process in Quantum Neural Networks
methods: 这篇论文使用了无果定理来证明,当损失值低于一定阈值时,QNNs 中的搜索空间内的概率会下降 exponentially WITH qubit 的数量,而只有 polynomial 增长 WITH circuit 的深度。
results: 研究结果表明,QNNs 无法学习未知的量子状态,即使从高精度的初始状态开始,并且这种不可学习性与 circuit 的结构、初始化策略和 ansatz 无关。 数据 simulations validate 了我们的理论结果。这些结果对 QNNs 的可学习性和扩展性带来了限制,同时深入了量子神经网络中备忘的优先知识的作用。Abstract
Quantum neural networks (QNNs) have been a promising framework in pursuing near-term quantum advantage in various fields, where many applications can be viewed as learning a quantum state that encodes useful data. As a quantum analog of probability distribution learning, quantum state learning is theoretically and practically essential in quantum machine learning. In this paper, we develop a no-go theorem for learning an unknown quantum state with QNNs even starting from a high-fidelity initial state. We prove that when the loss value is lower than a critical threshold, the probability of avoiding local minima vanishes exponentially with the qubit count, while only grows polynomially with the circuit depth. The curvature of local minima is concentrated to the quantum Fisher information times a loss-dependent constant, which characterizes the sensibility of the output state with respect to parameters in QNNs. These results hold for any circuit structures, initialization strategies, and work for both fixed ansatzes and adaptive methods. Extensive numerical simulations are performed to validate our theoretical results. Our findings place generic limits on good initial guesses and adaptive methods for improving the learnability and scalability of QNNs, and deepen the understanding of prior information's role in QNNs.
摘要
Context-Aware Generative Models for Prediction of Aircraft Ground Tracks
results: 使用一周的英国上空航空Surveillance数据进行训练和测试,研究发现,使用bayesian neural network和Laplaceapproximation的模型可以生成最有可能性的轨迹,以便模拟航空交通的流动。Abstract
Trajectory prediction (TP) plays an important role in supporting the decision-making of Air Traffic Controllers (ATCOs). Traditional TP methods are deterministic and physics-based, with parameters that are calibrated using aircraft surveillance data harvested across the world. These models are, therefore, agnostic to the intentions of the pilots and ATCOs, which can have a significant effect on the observed trajectory, particularly in the lateral plane. This work proposes a generative method for lateral TP, using probabilistic machine learning to model the effect of the epistemic uncertainty arising from the unknown effect of pilot behaviour and ATCO intentions. The models are trained to be specific to a particular sector, allowing local procedures such as coordinated entry and exit points to be modelled. A dataset comprising a week's worth of aircraft surveillance data, passing through a busy sector of the United Kingdom's upper airspace, was used to train and test the models. Specifically, a piecewise linear model was used as a functional, low-dimensional representation of the ground tracks, with its control points determined by a generative model conditioned on partial context. It was found that, of the investigated models, a Bayesian Neural Network using the Laplace approximation was able to generate the most plausible trajectories in order to emulate the flow of traffic through the sector.
摘要
准确预测航空器轨迹(TP)在支持空交管理员(ATCO)决策中扮演着重要角色。传统的TP方法是决定性的,基于物理学术,参数通过全球采集的飞机抽象数据进行准确。这些模型因此具有不考虑飞行员和ATCO的意图的缺陷,特别在水平面上。这项工作提出了一种生成方法,使用概率机器学习来模拟飞机轨迹中不确定因素的影响,包括飞行员和ATCO的意图。模型通过特定到某个区域的方式进行训练,以便模型当地的过程,如协调入口和出口点。使用一周内英国Upper空间的繁忙区域的飞机抽象数据进行训练和测试。Specifically, a piecewise linear model was used as a functional, low-dimensional representation of the ground tracks, with its control points determined by a generative model conditioned on partial context. It was found that, of the investigated models, a Bayesian Neural Network using the Laplace approximation was able to generate the most plausible trajectories in order to emulate the flow of traffic through the sector.Note: Simplified Chinese is a romanization of Chinese that uses simpler characters and grammar to facilitate communication. It is not a formal standard of Chinese, but it is commonly used in informal writing and online communication.
Learning Generative Models for Climbing Aircraft from Radar Data
results: 这篇论文的结果显示,使用这个方法可以预测飞行器的到达时间比BADA模型更加精确,并且生成的轨迹对测试数据的实际情况有着更高的吻合度。Abstract
Accurate trajectory prediction (TP) for climbing aircraft is hampered by the presence of epistemic uncertainties concerning aircraft operation, which can lead to significant misspecification between predicted and observed trajectories. This paper proposes a generative model for climbing aircraft in which the standard Base of Aircraft Data (BADA) model is enriched by a functional correction to the thrust that is learned from data. The method offers three features: predictions of the arrival time with 66.3% less error when compared to BADA; generated trajectories that are realistic when compared to test data; and a means of computing confidence bounds for minimal computational cost.
摘要
减少飞机轨迹预测错误的精准方法(TP),受飞机运行不确定性的影响,导致预测和观测轨迹之间的差异较大。这篇论文提出了一种基于飞机数据(BADA)模型的生成模型,通过学习数据来修正驱进力。该方法具有以下三个特点:1)预测到达时间的错误率比BADA低于66.3%;2)生成的轨迹与测试数据实际上具有准确性;3)可以计算出轨迹 confidence bound,而且计算成本较低。
Parallel Multi-Objective Hyperparameter Optimization with Uniform Normalization and Bounded Objectives
results: 提高了多目标性能优化的效率,并且可以快速并平铺地运行多个任务。Abstract
Machine learning (ML) methods offer a wide range of configurable hyperparameters that have a significant influence on their performance. While accuracy is a commonly used performance objective, in many settings, it is not sufficient. Optimizing the ML models with respect to multiple objectives such as accuracy, confidence, fairness, calibration, privacy, latency, and memory consumption is becoming crucial. To that end, hyperparameter optimization, the approach to systematically optimize the hyperparameters, which is already challenging for a single objective, is even more challenging for multiple objectives. In addition, the differences in objective scales, the failures, and the presence of outlier values in objectives make the problem even harder. We propose a multi-objective Bayesian optimization (MoBO) algorithm that addresses these problems through uniform objective normalization and randomized weights in scalarization. We increase the efficiency of our approach by imposing constraints on the objective to avoid exploring unnecessary configurations (e.g., insufficient accuracy). Finally, we leverage an approach to parallelize the MoBO which results in a 5x speed-up when using 16x more workers.
摘要
Verifiable Learned Behaviors via Motion Primitive Composition: Applications to Scooping of Granular Media
results: 在simulation中进行探索任务和硬件上使用机器人挖掘固体媒体中的示范。Abstract
A robotic behavior model that can reliably generate behaviors from natural language inputs in real time would substantially expedite the adoption of industrial robots due to enhanced system flexibility. To facilitate these efforts, we construct a framework in which learned behaviors, created by a natural language abstractor, are verifiable by construction. Leveraging recent advancements in motion primitives and probabilistic verification, we construct a natural-language behavior abstractor that generates behaviors by synthesizing a directed graph over the provided motion primitives. If these component motion primitives are constructed according to the criteria we specify, the resulting behaviors are probabilistically verifiable. We demonstrate this verifiable behavior generation capacity in both simulation on an exploration task and on hardware with a robot scooping granular media.
摘要
一种可靠生成行为的机器人行为模型,可以在实时语言输入下生成行为,将加速工业机器人的采用,因为增加系统的灵活性。为了支持这些努力,我们构建了一个框架,在该框架中,通过自然语言抽象器学习的行为被可靠地验证。利用最新的动作基本 primitives和概率验证技术,我们构建了一个基于自然语言的行为抽象器,通过将提供的动作基本 primitives синтези为导向图来生成行为。如果这些组件动作基本 primitives按照我们的要求构建,则生成的行为是可靠地验证的。我们在实验中使用一个探索任务和硬件机器人夹取粒子物质进行了证明。
Credit Card Fraud Detection with Subspace Learning-based One-Class Classification
paper_authors: Zaffar Zaffar, Fahad Sohrab, Juho Kanniainen, Moncef Gabbouj
For: 这种论文旨在提出一种基于一类分类算法的自动信用卡fraud检测方法,以解决因commerce digitization而导致的信用卡fraud问题,以及随着fraud技术的不断发展,已有的检测方法的局限性。* Methods: 本文使用subspace learning-based One-Class Classification(OCC)算法,可以处理偏极分布的数据,同时具有预测未来fraud技术的能力。这种算法将数据描述于一个lower-dimensional的子空间中,从而提高了OCC的性能。* Results: 经过严格的实验和分析,本文证明了提出的方法可以有效地 mitigate credit card fraud detection中的curse of dimensionality和偏极分布问题,提高了自动检测的精度和效率。Abstract
In an increasingly digitalized commerce landscape, the proliferation of credit card fraud and the evolution of sophisticated fraudulent techniques have led to substantial financial losses. Automating credit card fraud detection is a viable way to accelerate detection, reducing response times and minimizing potential financial losses. However, addressing this challenge is complicated by the highly imbalanced nature of the datasets, where genuine transactions vastly outnumber fraudulent ones. Furthermore, the high number of dimensions within the feature set gives rise to the ``curse of dimensionality". In this paper, we investigate subspace learning-based approaches centered on One-Class Classification (OCC) algorithms, which excel in handling imbalanced data distributions and possess the capability to anticipate and counter the transactions carried out by yet-to-be-invented fraud techniques. The study highlights the potential of subspace learning-based OCC algorithms by investigating the limitations of current fraud detection strategies and the specific challenges of credit card fraud detection. These algorithms integrate subspace learning into the data description; hence, the models transform the data into a lower-dimensional subspace optimized for OCC. Through rigorous experimentation and analysis, the study validated that the proposed approach helps tackle the curse of dimensionality and the imbalanced nature of credit card data for automatic fraud detection to mitigate financial losses caused by fraudulent activities.
摘要
在数字化贸易景观中,信用卡诈骗的扩散和黑科技的不断演化,导致了严重的金融损失。自动化信用卡诈骗检测是一种可行的方法,可以加速检测,降低响应时间,最小化可能的金融损失。然而,解决这个挑战是因为数据集的高度偏好性和维度瓶颈的缘故复杂。在这篇论文中,我们调查了基于一个空间学习的一类分类算法,这些算法在处理偏好数据分布时表现出色,并具有预测和防范尚未发明的诈骗技术的能力。我们的研究探讨了现有的诈骗检测策略的局限性和信用卡诈骗检测的特定挑战。这些算法将数据描述中的subspace学习 integrate into the data description, so the models transform the data into a lower-dimensional subspace optimized for one-class classification.经过严格的实验和分析,我们的研究证明了我们提出的方法可以抗衡维度瓶颈和信用卡数据的偏好性,以便自动检测诈骗,从而减少由诈骗活动导致的金融损失。
Cluster Exploration using Informative Manifold Projections
paper_authors: Stavros Gerolymatos, Xenophon Evangelopoulos, Vladimir Gusev, John Y. Goulermas
for: 本研究旨在提出一种基于层次结构的维度减少方法,以便利用先前知识来探索高维数据的视觉结构。
methods: 本方法使用了一种线性组合的目标函数,包括对偏好信息的杜尔减去和含义层次分析。
results: 实验表明,本方法可以效果地揭示高维数据中的层次结构,并且可以根据不同的先前知识进行自动化的视觉探索。Abstract
Dimensionality reduction (DR) is one of the key tools for the visual exploration of high-dimensional data and uncovering its cluster structure in two- or three-dimensional spaces. The vast majority of DR methods in the literature do not take into account any prior knowledge a practitioner may have regarding the dataset under consideration. We propose a novel method to generate informative embeddings which not only factor out the structure associated with different kinds of prior knowledge but also aim to reveal any remaining underlying structure. To achieve this, we employ a linear combination of two objectives: firstly, contrastive PCA that discounts the structure associated with the prior information, and secondly, kurtosis projection pursuit which ensures meaningful data separation in the obtained embeddings. We formulate this task as a manifold optimization problem and validate it empirically across a variety of datasets considering three distinct types of prior knowledge. Lastly, we provide an automated framework to perform iterative visual exploration of high-dimensional data.
摘要
维度减少(DR)是数据可见化中一种关键工具,可以在两到三维空间中揭示高维数据的层次结构。大多数DR方法在文献中忽略了具体数据的先验知识。我们提出了一种新的方法,可以生成有用的嵌入,不仅抑制了不同类型的先验知识结构,还尝试揭示剩下的下面结构。我们使用了一种线性组合的两个目标:首先,对比PCA,抑制先验知识结构;其次,峰度投影约束,确保获得的嵌入是有意义的。我们将这个任务视为一个拟合优化问题,并在多个数据集上进行了验证。最后,我们提供了一个自动化的框架,可以对高维数据进行迭代可见化。
Investigation of factors regarding the effects of COVID-19 pandemic on college students’ depression by quantum annealer
results: 研究结果表明,QA算法在因素分析研究中具有与MLR模型广泛使用的相同能力。此外,QA算法的重要因素结果也被验证了。疫情相关因素(如社会系统信任度)和心理因素(如不确定情况下的决策)在后疫情条件下更加重要。我们认为,本研究将为研究类似主题的研究人员提供参考。Abstract
Diverse cases regarding the impact, with its related factors, of the COVID-19 pandemic on mental health have been reported in previous studies. College student groups have been frequently selected as the target population in previous studies because they are easily affected by pandemics. In this study, multivariable datasets were collected from 751 college students based on the complex relationships between various mental health factors. We utilized quantum annealing (QA)-based feature selection algorithms that were executed by commercial D-Wave quantum computers to determine the changes in the relative importance of the associated factors before and after the pandemic. Multivariable linear regression (MLR) and XGBoost models were also applied to validate the QA-based algorithms. Based on the experimental results, we confirm that QA-based algorithms have comparable capabilities in factor analysis research to the MLR models that have been widely used in previous studies. Furthermore, the performance of the QA-based algorithms was validated through the important factor results from the algorithms. Pandemic-related factors (e.g., confidence in the social system) and psychological factors (e.g., decision-making in uncertain situations) were more important in post-pandemic conditions. We believe that our study will serve as a reference for researchers studying similar topics.
摘要
Previous studies have reported diverse cases of the impact of the COVID-19 pandemic on mental health, with various related factors. College student groups have been frequently selected as the target population due to their vulnerability to pandemics. In this study, we collected multivariable datasets from 751 college students to examine the complex relationships between mental health factors using quantum annealing (QA)-based feature selection algorithms executed by commercial D-Wave quantum computers. We also applied multivariable linear regression (MLR) and XGBoost models for validation. Our results confirm that QA-based algorithms have comparable capabilities in factor analysis research to the widely used MLR models in previous studies. Additionally, the performance of the QA-based algorithms was validated through the important factor results from the algorithms. In post-pandemic conditions, pandemic-related factors (such as confidence in the social system) and psychological factors (such as decision-making in uncertain situations) were found to be more important. We believe that our study will serve as a reference for researchers studying similar topics.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.
Realtime Motion Generation with Active Perception Using Attention Mechanism for Cooking Robot
results: 经过训练和验证,机器人通过学习人类技能,可以成功地烹饪不同的鸡蛋。机器人可以根据鸡蛋的状态进行不同的搅拌和扭转动作,例如在开始热煮鸡蛋时,机器人会搅拌整个锅,然后随着鸡蛋的热度提高,机器人会改变搅拌方式和target area,例如进行扭转和分割动作,即使没有直接指定。Abstract
To support humans in their daily lives, robots are required to autonomously learn, adapt to objects and environments, and perform the appropriate actions. We tackled on the task of cooking scrambled eggs using real ingredients, in which the robot needs to perceive the states of the egg and adjust stirring movement in real time, while the egg is heated and the state changes continuously. In previous works, handling changing objects was found to be challenging because sensory information includes dynamical, both important or noisy information, and the modality which should be focused on changes every time, making it difficult to realize both perception and motion generation in real time. We propose a predictive recurrent neural network with an attention mechanism that can weigh the sensor input, distinguishing how important and reliable each modality is, that realize quick and efficient perception and motion generation. The model is trained with learning from the demonstration, and allows the robot to acquire human-like skills. We validated the proposed technique using the robot, Dry-AIREC, and with our learning model, it could perform cooking eggs with unknown ingredients. The robot could change the method of stirring and direction depending on the status of the egg, as in the beginning it stirs in the whole pot, then subsequently, after the egg started being heated, it starts flipping and splitting motion targeting specific areas, although we did not explicitly indicate them.
摘要
To address this challenge, we proposed a predictive recurrent neural network with an attention mechanism that can weigh the sensor input, distinguishing how important and reliable each modality is. This allows the robot to quickly and efficiently perceive and generate motion. The model is trained using learning from demonstration, allowing the robot to acquire human-like skills.We validated our proposed technique using the robot Dry-AIREC, and our learning model allowed the robot to cook eggs with unknown ingredients. The robot was able to change its method of stirring and direction depending on the status of the egg, starting with whole-pot stirring and then switching to flipping and splitting motions targeting specific areas as the egg heated up. This demonstrates the effectiveness of our proposed technique in enabling robots to adapt to changing objects and environments in real time.
results: 本论文透过应用OS-net于罗斯勒和斯洛特的系统中,发现了 périod doubling attractors 和 chaotic 行为的动力学。Abstract
We introduce OS-net (Orbitally Stable neural NETworks), a new family of neural network architectures specifically designed for periodic dynamical data. OS-net is a special case of Neural Ordinary Differential Equations (NODEs) and takes full advantage of the adjoint method based backpropagation method. Utilizing ODE theory, we derive conditions on the network weights to ensure stability of the resulting dynamics. We demonstrate the efficacy of our approach by applying OS-net to discover the dynamics underlying the R\"{o}ssler and Sprott's systems, two dynamical systems known for their period doubling attractors and chaotic behavior.
摘要
我们介绍OS-net(Orbitally Stable neural NETworks),一新的神经网络架构,特别针对周期动力系统的资料。OS-net是NODEs(神经普通微分方程)的特殊情况,利用微分方程理论,我们得出了网络 Parameters的稳定性条件。我们透过实践OS-net,发现了R\"{o}ssler和Sprott的两个动力系统中的时间倍增吸引器和混沌行为。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.
results: 这个研究获得了实验 validate theoretical results,并证明了 MarchOn 的传播速度是最佳的。Abstract
Stochastic optimization methods such as mirror descent have wide applications due to low computational cost. Those methods have been well studied under assumption of the independent and identical distribution, and usually achieve sublinear rate of convergence. However, this assumption may be too strong and unpractical in real application scenarios. Recent researches investigate stochastic gradient descent when instances are sampled from a Markov chain. Unfortunately, few results are known for stochastic mirror descent. In the paper, we propose a new version of stochastic mirror descent termed by MarchOn in the scenario of the federated learning. Given a distributed network, the model iteratively travels from a node to one of its neighbours randomly. Furthermore, we propose a new framework to analyze MarchOn, which yields best rates of convergence for convex, strongly convex, and non-convex loss. Finally, we conduct empirical studies to evaluate the convergence of MarchOn, and validate theoretical results.
摘要
Stochastic 优化方法,如镜像下降法,具有低计算成本,因此在各种应用场景中具有广泛的应用前景。然而,这些方法通常假设数据是独立和相同分布的,这可能是一个偏要假设。现在的研究则探讨在Markov链上采样实例时,镜像下降法的性能。尽管有些结果已经得到了关注,但是对于镜像下降法,还知之不够多。在本文中,我们提出了一种新的镜像下降法,称之为MarchOn,并在联合学习场景中应用。在分布网络上,模型会随机从一个节点跳转到其近邻节点。此外,我们还提出了一种新的分析框架,可以在 convex、强Converter、非凸损函数下达到最佳的速度。最后,我们进行了实验研究,证明了MarchOn的收敛性。
On the Computational Complexity and Formal Hierarchy of Second Order Recurrent Neural Networks
results: 作者们证明了二阶RNNs可以在受限制的精度和时间下recognize任何规则语言,并且在recognize regular grammars时表现更好于现代化RNNs和Gated Recurrent Units。他们还提供了一个Upper bound和稳定分析,证明二阶RNNs只需要一定的最大 neuron数来recognize任何规则语言。Abstract
Artificial neural networks (ANNs) with recurrence and self-attention have been shown to be Turing-complete (TC). However, existing work has shown that these ANNs require multiple turns or unbounded computation time, even with unbounded precision in weights, in order to recognize TC grammars. However, under constraints such as fixed or bounded precision neurons and time, ANNs without memory are shown to struggle to recognize even context-free languages. In this work, we extend the theoretical foundation for the $2^{nd}$-order recurrent network ($2^{nd}$ RNN) and prove there exists a class of a $2^{nd}$ RNN that is Turing-complete with bounded time. This model is capable of directly encoding a transition table into its recurrent weights, enabling bounded time computation and is interpretable by design. We also demonstrate that $2$nd order RNNs, without memory, under bounded weights and time constraints, outperform modern-day models such as vanilla RNNs and gated recurrent units in recognizing regular grammars. We provide an upper bound and a stability analysis on the maximum number of neurons required by $2$nd order RNNs to recognize any class of regular grammar. Extensive experiments on the Tomita grammars support our findings, demonstrating the importance of tensor connections in crafting computationally efficient RNNs. Finally, we show $2^{nd}$ order RNNs are also interpretable by extraction and can extract state machines with higher success rates as compared to first-order RNNs. Our results extend the theoretical foundations of RNNs and offer promising avenues for future explainable AI research.
摘要
人工神经网络(ANNs)带有回归和自注意力已被证明是图灵完备(TC)。然而,现有的工作表明,这些ANNs需要多个轮次或无限的计算时间,即使在无限精度的权重下,以recognize TC语法。然而,在受限于固定或受限的精度神经和时间的情况下,没有记忆的ANNs很难recognizeeven context-free语言。在这种工作中,我们扩展了第二阶段回归网络(2nd RNN)的理论基础,并证明存在一类的2nd RNN可以在受限时间内recognize TC语法。这种模型可以直接将转移表编码到其回归权重中,使得计算时间受限,并且可以通过设计来解释。我们还证明了没有记忆的2nd RNN,在固定权重和时间约束下,可以在较低的精度下recognize常见语法,并且在Tomita语法上进行了广泛的实验支持。最后,我们表明2nd RNN可以通过提取来解释,并且可以在比first-order RNN更高的成功率下提取状态机。我们的结果扩展了RNN的理论基础,并提供了未来可解释AI研究的有优点的方向。
FedCompass: Efficient Cross-Silo Federated Learning on Heterogeneous Client Devices using a Computing Power Aware Scheduler
results: 使用了多种异步的非同相数据分布式集合,研究发现FedCompass可以比其他半同步算法更快地训练到准确性,并且在不同客户端的计算能力不同情况下保持高效率。Abstract
Cross-silo federated learning offers a promising solution to collaboratively train robust and generalized AI models without compromising the privacy of local datasets, e.g., healthcare, financial, as well as scientific projects that lack a centralized data facility. Nonetheless, because of the disparity of computing resources among different clients (i.e., device heterogeneity), synchronous federated learning algorithms suffer from degraded efficiency when waiting for straggler clients. Similarly, asynchronous federated learning algorithms experience degradation in the convergence rate and final model accuracy on non-identically and independently distributed (non-IID) heterogeneous datasets due to stale local models and client drift. To address these limitations in cross-silo federated learning with heterogeneous clients and data, we propose FedCompass, an innovative semi-asynchronous federated learning algorithm with a computing power aware scheduler on the server side, which adaptively assigns varying amounts of training tasks to different clients using the knowledge of the computing power of individual clients. FedCompass ensures that multiple locally trained models from clients are received almost simultaneously as a group for aggregation, effectively reducing the staleness of local models. At the same time, the overall training process remains asynchronous, eliminating prolonged waiting periods from straggler clients. Using diverse non-IID heterogeneous distributed datasets, we demonstrate that FedCompass achieves faster convergence and higher accuracy than other asynchronous algorithms while remaining more efficient than synchronous algorithms when performing federated learning on heterogeneous clients.
摘要
Simplified Chinese translation:跨存储 silo 联合学习提供了一个优秀的解决方案,以协同训练 robust 和泛化 AI 模型,无需妥协本地数据隐私,例如医疗、金融、以及科学项目,这些项目缺乏中央数据设施。然而,由于客户端(设备)的资源差异(device heterogeneity),同步联合学习算法受到客户端延迟的影响,而异步联合学习算法则因为客户端模型偏移和数据不均匀(non-IID)而导致衰落。为了解决这些限制,我们提出了 FedCompass,一种具有服务器端计算能力感知调度器的半同步联合学习算法。FedCompass 能够适应客户端的计算能力,并在服务器端分配不同的训练任务,以避免客户端延迟。这种方法使得多个客户端上的本地模型被接收并聚合,从而减少本地模型的衰落。同时,整个训练过程保持异步,从而消除客户端延迟。使用多种非同Kind 的非同一样 distributed datasets,我们证明了 FedCompass 在异步联合学习中实现了更快的整合速度和更高的准确率,同时保持和同步联合学习更高的效率。
Transformer-based classification of user queries for medical consultancy with respect to expert specialization
results: 我们透过实验证明了我们的方法在不同医疗领域(如心脏科、神经科和皮肤科)中的高效性,F1 分数超过 92%。Abstract
The need for skilled medical support is growing in the era of digital healthcare. This research presents an innovative strategy, utilizing the RuBERT model, for categorizing user inquiries in the field of medical consultation with a focus on expert specialization. By harnessing the capabilities of transformers, we fine-tuned the pre-trained RuBERT model on a varied dataset, which facilitates precise correspondence between queries and particular medical specialisms. Using a comprehensive dataset, we have demonstrated our approach's superior performance with an F1-score of over 92%, calculated through both cross-validation and the traditional split of test and train datasets. Our approach has shown excellent generalization across medical domains such as cardiology, neurology and dermatology. This methodology provides practical benefits by directing users to appropriate specialists for prompt and targeted medical advice. It also enhances healthcare system efficiency, reduces practitioner burden, and improves patient care quality. In summary, our suggested strategy facilitates the attainment of specific medical knowledge, offering prompt and precise advice within the digital healthcare field.
摘要
在数字医疗时代,需求专业医疗支持的增长日益明显。本研究提出了一种创新的策略,利用 RuBERT 模型,对医疗咨询用户问题进行分类,以提高专业化水平。通过利用 transformers 的能力,我们对预训练 RuBERT 模型进行了精细调整,使得问题和医疗专业之间达到精准匹配。通过使用完整的数据集,我们已经证明了我们的方法的超过 92% 的 F1 分数,通过跨Validation和传统的测试集和训练集分割。我们的方法在医学领域such as cardiology、neurology和dermatology中展现出了优秀的泛化性。这种方法ология在数字医疗领域提供了实用的好处,可以引导用户到相应的专家获得timely和精准的医疗建议,从而提高医疗系统的效率、减轻医生的负担、提高患者的健康质量。总之,我们建议的策略可以帮助在数字医疗领域获得特定的医学知识,提供时效和精准的医疗建议。
Genetic InfoMax: Exploring Mutual Information Maximization in High-Dimensional Imaging Genetics Studies
paper_authors: Yaochen Xie, Ziqian Xie, Sheikh Muhammad Saiful Islam, Degui Zhi, Shuiwang Ji
for: This paper is written for the purpose of addressing the challenges of representation learning in genome-wide association studies (GWAS) for high-dimensional medical imaging data, specifically using mutual information (MI) to identify informative representations of the data.
methods: The paper introduces a trans-modal learning framework called Genetic InfoMax (GIM), which includes a regularized MI estimator and a novel genetics-informed transformer to address the specific challenges of GWAS.
results: The paper demonstrates the effectiveness of GIM and a significantly improved performance on GWAS, as evaluated on human brain 3D MRI data using standardized evaluation protocols.Abstract
Genome-wide association studies (GWAS) are used to identify relationships between genetic variations and specific traits. When applied to high-dimensional medical imaging data, a key step is to extract lower-dimensional, yet informative representations of the data as traits. Representation learning for imaging genetics is largely under-explored due to the unique challenges posed by GWAS in comparison to typical visual representation learning. In this study, we tackle this problem from the mutual information (MI) perspective by identifying key limitations of existing methods. We introduce a trans-modal learning framework Genetic InfoMax (GIM), including a regularized MI estimator and a novel genetics-informed transformer to address the specific challenges of GWAS. We evaluate GIM on human brain 3D MRI data and establish standardized evaluation protocols to compare it to existing approaches. Our results demonstrate the effectiveness of GIM and a significantly improved performance on GWAS.
摘要
Learning the Uncertainty Sets for Control Dynamics via Set Membership: A Non-Asymptotic Analysis
paper_authors: Yingying Li, Jing Yu, Lauren Conger, Adam Wierman
for: Linear dynamical systems under bounded, i.i.d. disturbances
methods: Set membership estimation, non-asymptotic bound on the diameter of the uncertainty sets
results: Robust adaptive model predictive control with performance approaching offline optimal model predictive controlAbstract
Set-membership estimation is commonly used in adaptive/learning-based control algorithms that require robustness over the model uncertainty sets, e.g., online robustly stabilizing control and robust adaptive model predictive control. Despite having broad applications, non-asymptotic estimation error bounds in the stochastic setting are limited. This paper provides such a non-asymptotic bound on the diameter of the uncertainty sets generated by set membership estimation on linear dynamical systems under bounded, i.i.d. disturbances. Further, this result is applied to robust adaptive model predictive control with uncertainty sets updated by set membership. We numerically demonstrate the performance of the robust adaptive controller, which rapidly approaches the performance of the offline optimal model predictive controller, in comparison with the control design based on least square estimation's confidence regions.
摘要
<>设 membership 估计是通用的控制算法中的一个重要组成部分,例如在线 robustly 稳定控制和robust 适应模型预测控制中。 despite 广泛应用,非偏 asymptotic 估计误差 bound 在sto 价设置下有限。这篇文章提供了这样一个 non-asymptotic bound 在线 linear 动力系统上的 uncertainty 集 generated by set membership 估计的 diameter。此外,这个结果被应用于 robust 适应模型预测控制中,where uncertainty sets 是通过 set membership 估计更新的。我们通过数值示例表明了这种 robust 适应控制器的性能,它快速地 approached 线上最优化的模型预测控制器的性能,与 based on least square estimation 的 confidence regions 的控制设计相比。Note: "set membership" in the text refers to the estimation of the uncertainty sets of the system's parameters, and "non-asymptotic" means that the bound is valid for all time and does not rely on any assumptions about the convergence of the estimation process.
Gray-box Adversarial Attack of Deep Reinforcement Learning-based Trading Agents
For: The paper is written to demonstrate the robustness of a Deep Reinforcement Learning (Deep RL) based trading agent against adversarial attacks.* Methods: The paper uses a “gray-box” approach for attacking the Deep RL-based trading agent, which involves trading in the same stock market with no extra access to the trading agent. The adversary agent uses a hybrid Deep Neural Network as its policy, consisting of Convolutional layers and fully-connected layers.* Results: The paper shows that the adversary policy proposed in the research is able to reduce the reward values by 214.17%, which results in reducing the potential profits of the baseline by 139.4%, ensemble method by 93.7%, and an automated trading software developed by the industrial partner by 85.5%, while consuming significantly less budget than the victims.Here are the three points in Simplified Chinese text:* 为:本研究用于证明深度强化学习(Deep RL)基于的交易代理程序对抗性的可行性。* 方法:本研究使用“灰色框架”的方法进行攻击深度强化学习基于的交易代理程序,即在同一股票市场中进行交易,无需额外访问交易代理程序。敌对代理程序使用了一个混合深度神经网络作为其政策,该政策包括卷积层和全连接层。* 结果:研究显示,敌对政策提出的本研究可以将奖励值降低214.17%,这导致基准值下降139.4%,协同方法下降93.7%,并且由industrial partner开发的自动交易软件下降85.5%,同时消耗了许多更少的预算。Abstract
In recent years, deep reinforcement learning (Deep RL) has been successfully implemented as a smart agent in many systems such as complex games, self-driving cars, and chat-bots. One of the interesting use cases of Deep RL is its application as an automated stock trading agent. In general, any automated trading agent is prone to manipulations by adversaries in the trading environment. Thus studying their robustness is vital for their success in practice. However, typical mechanism to study RL robustness, which is based on white-box gradient-based adversarial sample generation techniques (like FGSM), is obsolete for this use case, since the models are protected behind secure international exchange APIs, such as NASDAQ. In this research, we demonstrate that a "gray-box" approach for attacking a Deep RL-based trading agent is possible by trading in the same stock market, with no extra access to the trading agent. In our proposed approach, an adversary agent uses a hybrid Deep Neural Network as its policy consisting of Convolutional layers and fully-connected layers. On average, over three simulated trading market configurations, the adversary policy proposed in this research is able to reduce the reward values by 214.17%, which results in reducing the potential profits of the baseline by 139.4%, ensemble method by 93.7%, and an automated trading software developed by our industrial partner by 85.5%, while consuming significantly less budget than the victims (427.77%, 187.16%, and 66.97%, respectively).
摘要
在最近几年,深度强化学习(Deep RL)在复杂的游戏、自动驾驶车和chatbot等系统中被成功应用。其中一个有趣的应用场景是作为自动交易代理。然而,通常的机制来研究RL的可靠性是基于白盒子Gradient-based adversarial sample生成技术(like FGSM),这种方法在实际应用中是无效的,因为模型被保护在安全的国际交易API中,如NASDAQ。在这个研究中,我们展示了一种“灰色盒”的攻击方法,可以在同一个股票市场上进行交易,没有额外访问交易代理。我们的提议的敌方策略使用了混合深度神经网络,其中包括卷积层和全连接层。在三个 simulate 的股票市场配置中,我们的敌方策略可以将奖励值降低214.17%,这将导致基eline的可能收益降低139.4%,集成方法降低93.7%,以及由我们的伙伴公司开发的自动交易软件降低85.5%,而消耗的预算比 víctims(427.77%, 187.16%, 66.97%)还要少。
results: 该论文通过理论分析和实验 validate 了这种新的变量推断方法(RVRS)的效果,并证明了它在模型中存在某些特定的本地变量时表现特别好。Abstract
Traditional approaches to variational inference rely on parametric families of variational distributions, with the choice of family playing a critical role in determining the accuracy of the resulting posterior approximation. Simple mean-field families often lead to poor approximations, while rich families of distributions like normalizing flows can be difficult to optimize and usually do not incorporate the known structure of the target distribution due to their black-box nature. To expand the space of flexible variational families, we revisit Variational Rejection Sampling (VRS) [Grover et al., 2018], which combines a parametric proposal distribution with rejection sampling to define a rich non-parametric family of distributions that explicitly utilizes the known target distribution. By introducing a low-variance reparameterized gradient estimator for the parameters of the proposal distribution, we make VRS an attractive inference strategy for models with continuous latent variables. We argue theoretically and demonstrate empirically that the resulting method--Reparameterized Variational Rejection Sampling (RVRS)--offers an attractive trade-off between computational cost and inference fidelity. In experiments we show that our method performs well in practice and that it is well-suited for black-box inference, especially for models with local latent variables.
摘要
传统的变量推断方法通常基于参数化的变量分布家族,选择家族的选择对 posterior approximation 的准确性起到关键作用。简单的mean-field家族经常导致低精度的approximation,而Rich的分布家族如normalizing flows往往难以优化并不会利用target distribution的知识因为其黑盒性质。为扩展灵活的变量推断家族,我们回归到Variational Rejection Sampling(VRS)[Grover et al., 2018],它将 parametric proposal distribution 与拒绝抽样相结合,定义一种非 Parametric 的 rich family of distributions,并且直接利用target distribution的知识。通过引入低差variance reparameterized gradient estimator 的参数,我们使VRS成为latent variables 是 kontinuous的模型中的吸引力 inference 策略。我们 theoretically 和 empirically 论证,RVRS 提供了一个吸引人的trade-off between computational cost 和推断准确性。在实验中,我们发现我们的方法在实践中表现良好,特别是适用于black-box inference,尤其是local latent variables。
Neuro-Visualizer: An Auto-encoder-based Loss Landscape Visualization Method
For: This paper aims to provide a novel auto-encoder-based non-linear landscape visualization method for neural networks, called Neuro-Visualizer, to help researchers study the loss landscape of neural networks and their training process.* Methods: The proposed Neuro-Visualizer method uses an auto-encoder to learn a lower-dimensional representation of the loss landscape, and then visualizes the landscape using a 2D plot. The method is evaluated on a variety of problems in two applications of knowledge-guided machine learning (KGML).* Results: The results show that Neuro-Visualizer outperforms other linear and non-linear baselines and provides useful insights about the loss landscape of neural networks. The method is able to corroborate and sometimes challenge claims proposed by the machine learning community.Here’s the summary in Simplified Chinese:* 为: 这篇论文目标是提供一种基于自编码器的非线性损失地形可见化方法,名为Neuro-Visualizer,以帮助研究人员研究神经网络的损失地形和训练过程。* 方法: 提议的Neuro-Visualizer方法使用自编码器学习损失地形的下降维度表示,然后使用2D图表可见化地形。方法在两个知识导向机器学习(KGML)应用中进行了多种问题的实验评估。* 结果: 结果表明Neuro-Visualizer比其他线性和非线性基准方法表现更好,并为神经网络损失地形提供了有用的洞察。方法能够证实和机器学习社区提出的一些laims。所有实验代码和数据在https://anonymous.4open.science/r/NeuroVisualizer-FDD6上公开发布。Abstract
In recent years, there has been a growing interest in visualizing the loss landscape of neural networks. Linear landscape visualization methods, such as principal component analysis, have become widely used as they intuitively help researchers study neural networks and their training process. However, these linear methods suffer from limitations and drawbacks due to their lack of flexibility and low fidelity at representing the high dimensional landscape. In this paper, we present a novel auto-encoder-based non-linear landscape visualization method called Neuro-Visualizer that addresses these shortcoming and provides useful insights about neural network loss landscapes. To demonstrate its potential, we run experiments on a variety of problems in two separate applications of knowledge-guided machine learning (KGML). Our findings show that Neuro-Visualizer outperforms other linear and non-linear baselines and helps corroborate, and sometime challenge, claims proposed by machine learning community. All code and data used in the experiments of this paper are available at an anonymous link https://anonymous.4open.science/r/NeuroVisualizer-FDD6
摘要
近年来,有越来越多的研究者关注神经网络训练过程中的损失地图的可视化。使用原理Components分析等线性可视化方法已成为广泛使用的做法,因为它们可以直观地帮助研究者研究神经网络和它的训练过程。然而,这些线性方法受到一些限制和缺陷,因为它们无法准确表达高维度的地图。在这篇论文中,我们提出了一种基于自适应Encoder的非线性地图可视化方法,称之为Neuro-Visualizer。我们运行了多个问题在两个不同的知识导向机器学习(KGML)应用中,并证明Neuro-Visualizer可以超过其他线性和非线性基准,并提供有用的意见关于神经网络损失地图。所有实验代码和数据都可以在https://anonymous.4open.science/r/NeuroVisualizer-FDD6上获取。
Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control
results: 研究发现,返回地图具有意外的结构,存在简单的路径,可以通过 navigating 这些路径来改善政策的稳定性。该论文还提出了一种分布式视角来评估政策质量,并开发了一种分布式方法来找到这些路径。Abstract
Deep reinforcement learning agents for continuous control are known to exhibit significant instability in their performance over time. In this work, we provide a fresh perspective on these behaviors by studying the return landscape: the mapping between a policy and a return. We find that popular algorithms traverse noisy neighborhoods of this landscape, in which a single update to the policy parameters leads to a wide range of returns. By taking a distributional view of these returns, we map the landscape, characterizing failure-prone regions of policy space and revealing a hidden dimension of policy quality. We show that the landscape exhibits surprising structure by finding simple paths in parameter space which improve the stability of a policy. To conclude, we develop a distribution-aware procedure which finds such paths, navigating away from noisy neighborhoods in order to improve the robustness of a policy. Taken together, our results provide new insight into the optimization, evaluation, and design of agents.
摘要
深度强化学会控制器的表现会随时间而呈现显著的不稳定性。在这个工作中,我们提供了一种新的视角,研究返回地图:策略和返回之间的映射。我们发现受欢迎的算法在策略参数空间中穿梭着噪声的 neighborhood,一次更新策略参数可以导致返回值的各种各样变化。通过对这些返回值采取分布视角,我们映射了这个地图,描述了策略空间中失败的区域,并发现了一个隐藏的策略质量维度。我们发现返回地图具有意外的结构,找到了简单的参数空间路径,可以提高策略的稳定性。最后,我们开发了一种分布意识的过程,找到这些路径,在策略参数空间中缓解噪声,以提高策略的可靠性。总之,我们的结果为代理人的优化、评估和设计提供了新的视角。
paper_authors: Alessandro Fontanella, Wenwen Li, Grant Mair, Antreas Antoniou, Eleanor Platt, Chloe Martin, Paul Armitage, Emanuele Trucco, Joanna Wardlaw, Amos Storkey for: This paper aims to address the challenge of preparing clinical brain CT datasets for deep learning (DL) analysis.methods: The authors propose a complete semi-automatic pipeline to standardize the heterogeneous dataset, which includes handling image sets with different orientations, image types, and dimensions, and removing redundant background.results: The final pipeline was able to process 5,868/10,659 (45%) CT image datasets, with the majority of axial scans being accepted after adjustments such as image cropping, resizing, and scaling. However, 465 scans failed the registration process.Abstract
Despite the large amount of brain CT data generated in clinical practice, the availability of CT datasets for deep learning (DL) research is currently limited. Furthermore, the data can be insufficiently or improperly prepared for machine learning and thus lead to spurious and irreproducible analyses. This lack of access to comprehensive and diverse datasets poses a significant challenge for the development of DL algorithms. In this work, we propose a complete semi-automatic pipeline to address the challenges of preparing a clinical brain CT dataset for DL analysis and describe the process of standardising this heterogeneous dataset. Challenges include handling image sets with different orientations (axial, sagittal, coronal), different image types (to view soft tissues or bones) and dimensions, and removing redundant background. The final pipeline was able to process 5,868/10,659 (45%) CT image datasets. Reasons for rejection include non-axial data (n=1,920), bone reformats (n=687), separated skull base/vault images (n=1,226), and registration failures (n=465). Further format adjustments, including image cropping, resizing and scaling are also needed for DL processing. Of the axial scans that were not localisers, bone reformats or split brains, 5,868/6,333 (93%) were accepted, while the remaining 465 failed the registration process. Appropriate preparation of medical imaging datasets for DL is a costly and time-intensive process.
摘要
尽管在临床实践中生成了大量的脑CT数据,但现在对深度学习(DL)研究的CT数据仍然受到限制。此外,数据可能未经正确准备,导致机器学习分析出现假象和不可重复的问题。这种数据的限制对DL算法的发展带来了重大挑战。在这项工作中,我们提出了一个完整的半自动化管道,以解决在临床脑CT数据上进行DL分析前的挑战。我们描述了处理不同方向(AXIAL、SAGGITAL、CORONAL)、不同图像类型(观察软组织或骨骼)和维度等多种挑战。我们的最终管道可以处理5,868/10,659(45%)的CT图像集。拒绝原因包括非AXIAL数据(n=1,920)、骨 Reformats(n=687)、分割的颅骨基/顶层图像(n=1,226)以及注册失败(n=465)。进一步的格式调整,包括图像剪辑、缩放和缩放,还是需要DL处理。AXIAL扫描中未经本地化的、骨 Reformats或分割的脑,5,868/6,333(93%)被接受,剩下465个失败了注册过程。适当地准备医学成像数据 дляDL是一项成本高和时间投入巨大的过程。
Thalamic nuclei segmentation from T$_1$-weighted MRI: unifying and benchmarking state-of-the-art methods with young and old cohorts
for: 这个研究是为了比较不同State of the art的腋带神经分 segmentation方法的效果,以及这些方法在识别健康人群和阿尔ц海默病人群之间的分化能力。
methods: 这个研究使用了四种State of the art的腋带神经分 segmentation方法,包括FreeSurfer、HIPS-THOMAS、SCS-CNN和T1-THOMAS。这些方法都被应用在T1 MRI图像上,并被比较使用 overlap和异同度量来评估其精度。
results: 研究发现,HIPS-THOMAS方法能够最好地分解各个腋带神经元的尺度,并且能够最 accurately地识别健康人群和阿尔ц海默病人群之间的分化。此外,研究还发现,这些方法的识别健康人群和阿尔ц海默病人群的精度随着疾病的进程而变化。Abstract
The thalamus and its constituent nuclei are critical for a broad range of cognitive and sensorimotor processes, and implicated in many neurological and neurodegenerative conditions. However, the functional involvement and specificity of thalamic nuclei in human neuroimaging is underappreciated and not well studied due, in part, to technical challenges of accurately identifying and segmenting nuclei. This challenge is further exacerbated by a lack of common nomenclature for comparing segmentation methods. Here, we use data from healthy young (Human Connectome Project, 100 subjects) and older healthy adults, plus those with minor cognitive impairment and Alzheimer$'$s disease (Alzheimer$'$s Disease Neuroimaging Initiative, 540 subjects), to benchmark four state of the art thalamic segmentation methods for T1 MRI (FreeSurfer, HIPS-THOMAS, SCS-CNN, and T1-THOMAS) under a single segmentation framework. Segmentations were compared using overlap and dissimilarity metrics to the Morel stereotaxic atlas. We also quantified each method$'$s estimation of thalamic nuclear degeneration across Alzheimer$'$s disease progression, and how accurately early and late mild cognitive impairment, and Alzheimers disease could be distinguished from healthy controls. We show that HIPS-THOMAS produced the most effective segmentations of individual thalamic nuclei and was also most accurate in discriminating healthy controls from those with mild cognitive impairment and Alzheimer$'$s disease using individual nucleus volumes. This work is the first to systematically compare the efficacy of anatomical thalamic segmentation approaches under a unified nomenclature. We also provide recommendations of which segmentation method to use for studying the functional relevance of specific thalamic nuclei, based on their overlap and dissimilarity with the Morel atlas.
摘要
腔室和其组成部分 nuclei 对认知和感觉过程具有关键作用,并且与许多神经病理和神经退化病种相关。然而,人类腔室 nuclei 的功能参与度和特定性在人像成像中尚未得到充分认可和研究,这主要是因为识别和分割腔室 nuclei 技术上的挑战。此外,不同的识别方法之间没有共同的命名标准,进一步增加了研究难度。本文使用100名健康年轻人(人类连接计划)和older健康成人(540名),以及急性认知障碍和阿尔茨海默病(阿尔茨海默病成像计划)的Subjects,对4种state-of-the-art腔室分割方法(FreeSurfer、HIPS-THOMAS、SCS-CNN和T1-THOMAS)进行比较,并使用单一分割框架。分割结果与Moreel颈部 Atlases 进行比较,并计算每种方法对腔室 nuclear degeneration 的估计,以及在阿尔茨海默病进程中,健康控制组和急性认知障碍、阿尔茨海默病之间的区别如何准确。结果表明HIPS-THOMAS方法生成了最有效的各个腔室 nuclei 分割,并且在健康控制组和急性认知障碍、阿尔茨海默病之间的分割结果最为准确。本研究是首次系统地比较了不同的腔室分割方法,并提供了选择具体腔室 nuclei 研究功能相关性的建议,基于它们与Moreel Atlases 的重叠和不同程度。
Multiplex ultrasound imaging of perfluorocarbon nanodroplets enabled by decomposition of post-vaporization dynamics
paper_authors: Austin Van Namen, Sidhartha Jandhyala, Catalina-Paula Spatarelu, Kenneth M. Tichauer, Kimberley S. Samkoe, Geoffrey P. Luke
for: This paper aims to develop a new approach to multiplex ultrasound imaging using perfluorocarbon (PFC) nanodroplets as activatable contrast agents.
methods: The paper uses two populations of PFC nanodroplets with different core boiling points, and leverages their unique temporal responses to an acoustic trigger to differentiate their unique contributions to the overall ultrasound signal.
results: The paper demonstrates the potential of this approach for multiplex ultrasound imaging, showing that the relative concentrations of the two populations of PFC nanodroplets can be accurately measured in the same imaging volume within an average error of 1.1%.Abstract
Among the various molecular imaging modalities, ultrasound imaging benefits from its real-time, nonionizing, and cost-effective nature. Despite its benefits, there is a dearth of methods to visualize two or more populations of contrast agents simultaneously, a technique known as multiplex imaging. In this paper, we present a new approach to multiplex ultrasound imaging using perfluorocarbon (PFC) nanodroplets. The nanodroplets, which undergo a liquid-to-gas phase transition in response to an acoustic trigger, act as activatable contrast agents. By using two populations of PFC nanodroplets, each with a different core boiling point, their unique temporal responses to an acoustic trigger were leveraged to differentiate their unique contributions to the overall ultrasound signal. This work characterized the dynamic responses of two PFC nanodroplets with boiling points of 28 and 56 {\deg}C. These characteristic responses were then used to demonstrate that the relative concentrations of the two populations of PFC nanodroplets could be accurately measured in the same imaging volume within an average error of 1.1%. Overall, the findings indicate the potential of this approach for multiplex ultrasound imaging, allowing for the visualization of multiple molecular targets simultaneously.
摘要
在多种分子成像方法中,超声成像具有实时、非离子和cost-effective的特点,但 simultanously visualizing two or more populations of contrast agents still remains a challenge, a technique known as multiplex imaging. In this paper, we present a new approach to multiplex ultrasound imaging using perfluorocarbon (PFC) nanodroplets. The nanodroplets, which undergo a liquid-to-gas phase transition in response to an acoustic trigger, act as activatable contrast agents. By using two populations of PFC nanodroplets, each with a different core boiling point, their unique temporal responses to an acoustic trigger were leveraged to differentiate their unique contributions to the overall ultrasound signal. This work characterized the dynamic responses of two PFC nanodroplets with boiling points of 28 and 56 ℃. These characteristic responses were then used to demonstrate that the relative concentrations of the two populations of PFC nanodroplets could be accurately measured in the same imaging volume within an average error of 1.1%. Overall, the findings indicate the potential of this approach for multiplex ultrasound imaging, allowing for the visualization of multiple molecular targets simultaneously.
Depolarized Holography with Polarization-multiplexing Metasurface
paper_authors: Seung-Woo Nam, Youngjin Kim, Dongyeon Kim, Yoonchan Jeong
for: 提高投射幕技术的表现,超越物理限制
methods: 利用受体表面,充分利用偏振度的多样性,实现无关性的投射幕显示
results: 实验和 simulations 表明,通过偏振度多样性,可以减少雾点噪声,提高图像质量Abstract
The evolution of computer-generated holography (CGH) algorithms has prompted significant improvements in the performances of holographic displays. Nonetheless, they start to encounter a limited degree of freedom in CGH optimization and physical constraints stemming from the coherent nature of holograms. To surpass the physical limitations, we consider polarization as a new degree of freedom by utilizing a novel optical platform called metasurface. Polarization-multiplexing metasurfaces enable incoherent-like behavior in holographic displays due to the mutual incoherence of orthogonal polarization states. We leverage this unique characteristic of a metasurface by integrating it into a holographic display and exploiting polarization diversity to bring an additional degree of freedom for CGH algorithms. To minimize the speckle noise while maximizing the image quality, we devise a fully differentiable optimization pipeline by taking into account the metasurface proxy model, thereby jointly optimizing spatial light modulator phase patterns and geometric parameters of metasurface nanostructures. We evaluate the metasurface-enabled depolarized holography through simulations and experiments, demonstrating its ability to reduce speckle noise and enhance image quality.
摘要
计算机生成投射法(CGH)的进化已经提高了投射显示器的性能。然而,它们开始遇到物理限制,即投射的干扰性。为超越物理限制,我们利用一种新的自由度,即极化。我们使用一种新的光学平台,即元件表面(metasurface),以实现无关的行为。元件表面的多极化能力使得投射显示器的行为更加不干扰。我们利用这一特点,并通过在投射显示器中集成元件表面,以及利用极化多样性来带来一个额外的自由度,以便CGH算法进行优化。为最小化干扰噪和最大化图像质量,我们设计了一个完全可导优化管道,包括元件表面代理模型,以联合投射显示器的灵敏度模ulator相位模式和元件表面 nanostructure 的几何参数。我们通过实验和仿真来评估元件表面启用的投射极化投射,并证明其能够减少干扰噪并提高图像质量。
results: 这篇论文的算法可以高效地从非站点信号中提取时间 varying wave-shape 信息,并且在含高水平的噪音情况下表现更好,比较 existing wave-shape 估计算法和基于短时间傅立卷变数的检测方法。实际上,这篇论文还用于电普雷agraph 信号的静态� States 分析和静态� States 构成 waveform 的分析。Abstract
In this work, we propose a time-varying wave-shape extraction algorithm based on a modified version of the adaptive non-harmonic model for non-stationary signals. The model codifies the time-varying wave-shape information in the relative amplitude and phase of the harmonic components of the wave-shape. The algorithm was validated on both real and synthetic signals for the tasks of denoising, decomposition and adaptive segmentation. For the denoising task, both monocomponent and multicomponent synthetic signals were considered. In both cases, the proposed algorithm can accurately recover the time-varying wave-shape of non-stationary signals, even in the presence of high levels of noise, outperforming existing wave-shape estimation algorithms and denoising methods based on short-time Fourier transform thresholding. The denoising of an electroencephalograph signal was also performed, giving similar results. For decomposition, our proposal was able to recover the composing waveforms more accurately by considering the time variations from the harmonic amplitude functions when compared to existing methods. Finally, the algorithm was used for the adaptive segmentation of synthetic signals and an electrocardiograph of a patient undergoing ventricular fibrillation.
摘要
在这项工作中,我们提出了一种基于修改后的适应非固定模型的时变波形抽取算法,用于处理非站ARY信号。该模型在相对幅度和相位中codifies时变波形信息。我们验证了该算法在实际和 sintetic 信号上进行了denoising、decomposition和 adaptive segmentation 任务中的性能。对于denoising任务,我们考虑了单组件和多组件的synthetic信号。在两种情况下,我们的提案可以准确地回归非站ARY信号中的时变波形,即使在高噪声水平下,超过了基于短时域快推变换的杜邦滤波法和杜邦滤波法。此外,我们还应用了该算法于电enzephalograph信号的denoising任务,获得类似的结果。对于decomposition任务,我们的提案可以更加准确地回归组成波形,因为考虑了时变幅度函数中的时变信息。与现有方法相比,我们的方法可以更好地回归非站ARY信号的组成部分。最后,我们使用了该算法进行了adaptive segmentation Synthetic信号和一个患有心脏缺陷的病人的electrocardiograph信号。
Wave-shape Function Model Order Estimation by Trigonometric Regression
for: 这个论文旨在研究非固定幅度和相位的抽象 oscilating signal 的表示方法,并提出一种基于adaptive trigonometric regression的方法来估计波形函数(WSF)中的幂数。
methods: 该论文使用了适应 trigonometric regression 模型选择 criterion 来解决 estimating the number of harmonic components of WSF 问题,并将其应用到非站立波形信号中。
results: 实验结果表明,该方法可以有效地重construct non-stationary signals with non-sinusoidal oscillatory patterns,即使在噪声存在的情况下。 furthermore, the proposed method can denoise simulated pulse wave signals and take into account the interpatient waveform variability of ECG and respiratory signals.Abstract
The adaptive non-harmonic (ANH) model is a powerful tool to compactly represent oscillating signals with time-varying amplitude and phase, and non-sinusoidal oscillating morphology. Given good estimators of instantaneous amplitude and phase we can construct an adaptive model, where the morphology of the oscillation is described by the wave-shape function (WSF), a 2{\pi}-periodic more general periodic function. In this paper, we address the problem of estimating the number of harmonic components of the WSF, a problem that remains underresearched, by adapting trigonometric regression model selection criteria into this context. We study the application of these criteria, originally developed in the context of stationary signals, to the case of signals with time-varying amplitudes and phases. We then incorporate the order estimation to the ANH model reconstruction procedure and analyze its performance for noisy AM-FM signals. Experimental results on synthethic signals indicate that these criteria enable the adaptive estimation of the waveform of non-stationary signals with non-sinusoidal oscillatory patterns, even in the presence of considerable amount of noise. We also apply our reconstruction procedure to the task of denoising simulated pulse wave signals and determine that the proposed technique performs competitively to other denoising schemes. We conclude this work by showing that our adaptive order estimation algorithm takes into account the interpatient waveform variability of the electrocardiogram (ECG) and respiratory signals by analyzing recordings from the Fantasia Database.
摘要
“非传统的非伤害(ANH)模型是一个强大的工具,可以简洁地表示时间变化的振荡信号,包括时间变化的振幅和相位。我们可以透过适当的估计几何和相位,创建一个适应型模型,其中振荡模式由波形函数(WSF)描述,这是一个2π periodic的更一般的周期函数。在这篇文章中,我们研究了对WSF的数量估计问题,这个问题在这个 контексті仍未得到充分研究。我们运用了这些标准的 trigonometric regression 模型选择 criterion 到这个 контексті中,并研究了这些 criterion 的应用。我们然后将这些数量估计 integrate 到 ANH 模型重建程序中,并分析了它们在噪音 AM-FM 信号上的性能。实验结果表明,这些 criterion 可以在非站ARY信号上实现适应性的数量估计,即使在充斥噪音的情况下。我们还将我们的重建程序应用到实验数据中,并发现它们与其他数据去噪程序相比,表现相当竞争。最后,我们显示了我们的适应数量估计算法能够考虑各种胸部电压ogram 和呼吸信号之间的波形Variability。”
Eve Said Yes: AirBone Authentication for Head-Wearable Smart Voice Assistant
methods: 本研究使用了两个阶段的 AirBone 验证方法,首先确定 whether air and bone conduction utterances 是时域一致(TC),然后通过骨射频域的声音特征进行骨射频speaker recognition(BC-SR)。
results: 实验结果表明,提posed AirBone 验证方法具有可用性和安全性,可以轻松地通过商业市场上的头环设备进行实现,并且可以提供更高的安全水平,因为它可以抗御Current acoustic attacks 和高级 cross-domain attacks。Abstract
Recent advances in machine learning and natural language processing have fostered the enormous prosperity of smart voice assistants and their services, e.g., Alexa, Google Home, Siri, etc. However, voice spoofing attacks are deemed to be one of the major challenges of voice control security, and never stop evolving such as deep-learning-based voice conversion and speech synthesis techniques. To solve this problem outside the acoustic domain, we focus on head-wearable devices, such as earbuds and virtual reality (VR) headsets, which are feasible to continuously monitor the bone-conducted voice in the vibration domain. Specifically, we identify that air and bone conduction (AC/BC) from the same vocalization are coupled (or concurrent) and user-level unique, which makes them suitable behavior and biometric factors for multi-factor authentication (MFA). The legitimate user can defeat acoustic domain and even cross-domain spoofing samples with the proposed two-stage AirBone authentication. The first stage answers \textit{whether air and bone conduction utterances are time domain consistent (TC)} and the second stage runs \textit{bone conduction speaker recognition (BC-SR)}. The security level is hence increased for two reasons: (1) current acoustic attacks on smart voice assistants cannot affect bone conduction, which is in the vibration domain; (2) even for advanced cross-domain attacks, the unique bone conduction features can detect adversary's impersonation and machine-induced vibration. Finally, AirBone authentication has good usability (the same level as voice authentication) compared with traditional MFA and those specially designed to enhance smart voice security. Our experimental results show that the proposed AirBone authentication is usable and secure, and can be easily equipped by commercial off-the-shelf head wearables with good user experience.
摘要
Our approach is based on the observation that air and bone conduction (AC/BC) from the same vocalization are coupled and unique to each user, making them suitable for use as behavior and biometric factors in multi-factor authentication (MFA). In the first stage of our proposed method, we check whether the air and bone conduction utterances are time domain consistent (TC). If the utterances are consistent, we proceed to the second stage, which involves bone conduction speaker recognition (BC-SR).The use of AirBone authentication offers several advantages over traditional MFA methods. First, current acoustic attacks on smart voice assistants cannot affect bone conduction, which is in the vibration domain. Second, even for advanced cross-domain attacks, the unique bone conduction features can detect the adversary's impersonation and machine-induced vibration. Finally, AirBone authentication has good usability compared with traditional MFA and specialized methods designed to enhance smart voice security.Our experimental results show that the proposed AirBone authentication is both usable and secure, and can be easily equipped by commercial off-the-shelf head wearables with good user experience.
Application of reciprocity for facilitation of wave field visualization and defect detection
paper_authors: Bernd Köhler, Kanta Takahashi, Kazuyuki Nakahata
for: 研究了STRUCTURAL ком成分中的运动可视化 для缺陷检测
methods: 使用了hammer impacts at multiple points to excite elastic motions, and received by an accelerometer at a fixed point
results: 通过reciprocity in elastodynamics theorem, obtained the dynamic motion of the structural component for fixed-point excitation from measurements performed using multipoint excitations, and observed significant additional deformation at the wall thinning inserted as an artificial defect using maximum intensity projection method.Abstract
The motion visualization in a structural component was studied for defect detection. Elastic motions were excited by hammer impacts at multiple points and received by an accelerometer at a fixed point. Reciprocity in elastodynamics is only valid under certain conditions. Its validity under given experimental conditions was derived from the elastodynamic reciprocity theorem. Based on this, the dynamic motion of the structural component was obtained for fixed-point excitation from measurements performed using multipoint excitations. In the visualized eigenmodes, significant additional deformation was observed at the wall thinning inserted as an artificial defect. To prevent the dependence of defect detection on its position within the mode shape, another approach was proposed based on the extraction of guided wave modes immediately after impact excitation. It is shown that this maximum intensity projection method works well in detecting defects.
摘要
在结构组件中的运动可视化被研究用于缺陷检测。使用锤子影响的弹性运动被测量到固定点上的加速计上,并且根据刚Dynamic motion of the structural component was obtained from measurements performed using multipoint excitation. In the visualized eigenmodes, significant additional deformation was observed at the wall thinning inserted as an artificial defect. To prevent the dependence of defect detection on its position within the mode shape, another approach was proposed based on the extraction of guided wave modes immediately after impact excitation. It is shown that this maximum intensity projection method works well in detecting defects.Here's the translation in Traditional Chinese:在结构组件中的运动可见化被研究用于缺陷检测。使用锤子影响的弹性运动被量测到固定点上的加速计上,并且根据刚Dynamic motion of the structural component was obtained from measurements performed using multipoint excitation. In the visualized eigenmodes, significant additional deformation was observed at the wall thinning inserted as an artificial defect. To prevent the dependence of defect detection on its position within the mode shape, another approach was proposed based on the extraction of guided wave modes immediately after impact excitation. It is shown that this maximum intensity projection method works well in detecting defects.
Reliable Majority Vote Computation with Complementary Sequences for UAV Waypoint Flight Control
for: 本研究提出了一种不协调的Over-the-air computation(OAC)方案,用于可靠地计算多个参数的多数投票(MV)在滑动频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频Abstract
In this study, we propose a non-coherent over-the-air computation (OAC) scheme to calculate the majority vote (MV) reliably in fading channels. The proposed approach relies on modulating the amplitude of the elements of complementary sequences (CSs) based on the sign of the parameters to be aggregated. Since it does not use channel state information at the nodes, it is compatible with time-varying channels. To demonstrate the efficacy of our method, we employ it in a scenario where an unmanned aerial vehicle (UAV) is guided by distributed sensors, relying on the MV computed using our proposed scheme. We show that the proposed scheme reduces the computation error rate notably with a longer sequence length in fading channels while maintaining the peak-to-mean-envelope power ratio of the transmitted orthogonal frequency division multiplexing signals to be less than or equal to 3 dB.
摘要
在这项研究中,我们提出了一种非协调的天空计算(OAC)方案,以可靠地计算多数投票(MV)在淡化通道中。我们的方法基于对填充序列元素的振幅进行模拟,根据参数的符号来决定。由于不使用节点的通道状态信息,这种方法与时变通道相容。为证明我们的方法的有效性,我们在一个由分布式感知器引导的无人飞行器(UAV)上使用了我们的方法。我们显示,我们的方法可以在淡化通道中减少计算错误率,而且可以保持发射的多射频分多路复用信号的峰峰至平均功率比不超过3 dB。
AsQM: Audio streaming Quality Metric based on Network Impairments and User Preferences
results: 实验结果表明,用户对音乐内容的喜好对 QoE 具有重要作用,并且提出了一种基于用户喜好的 Audio streaming Quality Metric(AsQM)来衡量音乐流式服务质量。此外,实验还表明,在用户设备中实现 AsQM 对功耗、处理和内存占用产生了较低的影响。Abstract
There are many users of audio streaming services because of the proliferation of cloud-based audio streaming services for different content. The complex networks that support these services do not always guarantee an acceptable quality on the end-user side. In this paper, the impact of temporal interruptions on the reproduction of audio streaming and the users preference in relation to audio contents are studied. In order to determine the key parameters in the audio streaming service, subjective tests were conducted, and their results show that users Quality-of-Experience (QoE) is highly correlated with the following application parameters, the number of temporal interruptions or stalls, its frequency and length, and the temporal location in which they occur. However, most important, experimental results demonstrated that users preference for audio content plays an important role in users QoE. Thus, a Preference Factor (PF) function is defined and considered in the formulation of the proposed metric named Audio streaming Quality Metric (AsQM). Considering that multimedia service providers are based on web servers, a framework to obtain user information is proposed. Furthermore, results show that the AsQM implemented in the audio player of an end users device presents a low impact on energy, processing and memory consumption.
摘要
“现有许多音乐流媒体服务的用户,因为云端音乐流媒体服务的普及,导致不同内容的音乐流媒体服务。但是,这些服务支持的复杂网络不一定能提供用户端的可接受度。本文研究了音乐流媒体服务中的时间中断对于音乐重播的影响,以及用户对于音乐内容的偏好。通过调查,发现用户的品质体验(QoE)高度与以下应用程序参数相关:时间中断或停止的次数、时间长度和时间位置。但是,最重要的是,实验结果显示用户对音乐内容的偏好对于QoE有着重要的影响。因此,我们定义了偏好因子(PF)函数,并将其包含在提案的音乐流媒体质量指标(AsQM)中。考虑到多媒体服务提供商基于网页服务器,我们提出了一个框架来获取用户信息。结果显示,在用户设备上实现的AsQM具有低影响力、处理和内存占用。”
STAR-RIS Assisted Full-Duplex Communication Networks
results: 我们 derivatedclosed-form表达式来描述上下行通信的质量因子,并对bidirectional通信进行了扩展分析。此外,我们还提出了一个最大化 Erdos均速率的优化问题,该问题包括调整STAR-RIS元素的振荡和相位偏移,以及有效分配总传输功率。Abstract
Different from conventional reconfigurable intelligent surfaces (RIS), a recent innovation called simultaneous transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) has emerged, aimed at achieving complete 360-degree coverage in communication networks. Additionally, fullduplex (FD) technology is recognized as a potent approach for enhancing spectral efficiency by enabling simultaneous transmission and reception within the same time and frequency resources. In this study, we investigate the performance of a STAR-RIS-assisted FD communication system. The STAR-RIS is strategically placed at the cell-edge to facilitate communication for users located in this challenging region, while cell-center users can communicate directly with the FD base station (BS). We employ a non-orthogonal multiple access (NOMA) pairing scheme and account for system impairments, such as self-interference at the BS and imperfect successive interference cancellation (SIC). We derive closed-form expressions for the ergodic rates in both the up-link and down-link communications and extend our analysis to bidirectional communication between cell-center and cell-edge users. Furthermore, we formulate an optimization problem aimed at maximizing the ergodic sum-rate. This optimization involves adjusting the amplitudes and phase-shifts of the STAR-RIS elements and allocating total transmit power efficiently. To gain deeper insights into the achievable rates of STAR-RIS-aided FD systems, we explore the impact of various system parameters through numerical results.
摘要
不同于传统的可重新配置智能表面(RIS),最近的创新是同时传输和反射可重新配置智能表面(STAR-RIS),旨在实现全天猫360度的覆盖率在通信网络中。此外,全双工(FD)技术被认为是提高频率效率的强大方法,可以在同一时间和频率资源上同时进行传输和接收。在这项研究中,我们研究了STAR-RIS帮助FD通信系统的性能。STAR-RIS位于终端处,以便为位于这个困难区域的用户提供通信,而中心用户可以直接与FD基站(BS)进行通信。我们采用了不对称多接入(NOMA)对pairing schemes,并考虑了系统障碍物,如BS自身的自适应干扰和不完全的Successive Interference Cancellation(SIC)。我们 derivatedclosed-form表达式来描述在上行和下行通信中的质量因子,并将分析扩展到双向通信 между中心用户和边缘用户。此外,我们形ulated一个优化问题,旨在最大化服务器的质量因子。这个优化问题包括调整STAR-RIS元素的振荡和相位偏移,以及有效地分配总传输功率。通过numerical results,我们深入探讨STAR-RIS帮助FD系统实现的可能的速率。
Quadratic Detection in Noncoherent Massive SIMO Systems over Correlated Channels
results: 提出了一种基于最大可能性探测的非正负抽象极限的证明,并提供了一种基于统计知识的通信器抗干扰性能更好的设计框架。这种设计框架可以在中等和高SNR水平下提供更低的错误率。Abstract
With the goal of enabling ultrareliable and low-latency wireless communications for industrial internet of things (IIoT), this paper studies the use of energy-based modulations in noncoherent massive single input multiple output (SIMO) systems. We consider a one-shot communication over a channel with correlated Rayleigh fading and colored Gaussian noise. We first provide a theoretical analysis on the limitations of non-negative pulse-amplitude modulation (PAM) in systems of this kind, based on maximum likelihood detection. The existence of a fundamental error floor at high signal-to-noise ratio (SNR) regimes is proved for constellations with more than two energy levels, when no (statistical) channel state information is available at the transmitter. In the main body of the paper, we present a design framework for quadratic detectors that generalizes the widely-used energy detector, to better exploit the statistical knowledge of the channel. This allows us to design receivers optimized according to information-theoretic criteria that exhibit lower error rates at moderate and high SNR. We subsequently derive an analytic approximation for the error probability of a general class of quadratic detectors in the large array regime. Finally, we introduce an improved reception scheme based on the combination of quadratic detectors and assess its capabilities numerically.
摘要
<>传输goal是实现低延迟和无障碍无线通信,这篇论文研究了基于能量模式的非共振性大量单输入多输出(SIMO)系统。我们考虑了一次性通信在具有相关的徐杰尼谱折射和频率噪声的通道上。我们首先提供了非共振性PAM在这种系统中的限制分析,基于最大likelihood检测。在高信号噪声比例(SNR) regime中存在基本错误地板,当无 statistically channel state information available at the transmitter。在主要文章中,我们提供了一种泛化 quadratic detector的设计框架,以更好地利用通道的统计知识。这使得我们可以根据信息学定义的标准设计接收器,并且在中等和高SNR regime exhibit lower error rates。我们随后 derive了一个 Analytic approximation for the error probability of a general class of quadratic detectors in the large array regime。最后,我们介绍了一种基于quadratic detector和 assess its capabilities numerically。Note: Simplified Chinese is a romanization of Mandarin Chinese, which is the official language of China. The translation is written in the Simplified Chinese format, which is used in mainland China and Singapore.
Minimizing Energy Consumption for 5G NR Beam Management for RedCap Devices
results: 分析得到了可行范围,即RedCap设备的照射管理参数的Upper limit,我们可以根据这些指导方针进行设计。Abstract
In 5G New Radio (NR), beam management entails periodic and continuous transmission and reception of control signals in the form of synchronization signal blocks (SSBs), used to perform initial access and/or channel estimation. However, this procedure demands continuous energy consumption, which is particularly challenging to handle for low-cost, low-complexity, and battery-constrained devices, such as RedCap devices to support mid-market Internet of Things (IoT) use cases. In this context, this work aims at reducing the energy consumption during beam management for RedCap devices, while ensuring that the desired Quality of Service (QoS) requirements are met. To do so, we formalize an optimization problem in an Indoor Factory (InF) scenario to select the best beam management parameters, including the beam update periodicity and the beamwidth, to minimize energy consumption based on users' distribution and their speed. The analysis yields the regions of feasibility, i.e., the upper limit(s) on the beam management parameters for RedCap devices, that we use to provide design guidelines accordingly.
摘要
在5G新 Radio(NR)中,磁力管理包括 periodic和连续的控制信号传输和接收,用于初始访问和/或通道估计。但这些过程需要不断的能量消耗,尤其是 для低成本、低复杂度和电池受限的设备,如RedCap设备,以支持中高级Internet of Things(IoT)应用场景。在这个上下文中,本工作的目标是在磁力管理中降低RedCap设备的能量消耗,以确保达到所需的质量服务(QoS)要求。为此,我们将在室内工厂(InF)场景中形式化优化问题,选择最佳的磁力管理参数,包括磁力更新频率和磁力宽度,以最小化RedCap设备的能量消耗,基于用户的分布和速度。分析得到了可行范围,即RedCap设备的磁力管理参数的Upper limit,我们可以根据这些设计指南来提供相应的设计建议。
ML-based PBCH symbol detection and equalization for 5G Non-Terrestrial Networks
results: 结果显示了在控制的环境中ML的性能,以及其适应实际挑战的能力。这些结果 shed light on the potential benefits of applying ML in 5G-NTN, and provide a basis for further research in this area.Abstract
This paper delves into the application of Machine Learning (ML) techniques in the realm of 5G Non-Terrestrial Networks (5G-NTN), particularly focusing on symbol detection and equalization for the Physical Broadcast Channel (PBCH). As 5G-NTN gains prominence within the 3GPP ecosystem, ML offers significant potential to enhance wireless communication performance. To investigate these possibilities, we present ML-based models trained with both synthetic and real data from a real 5G over-the-satellite testbed. Our analysis includes examining the performance of these models under various Signal-to-Noise Ratio (SNR) scenarios and evaluating their effectiveness in symbol enhancement and channel equalization tasks. The results highlight the ML performance in controlled settings and their adaptability to real-world challenges, shedding light on the potential benefits of the application of ML in 5G-NTN.
摘要
Here's the translation in Simplified Chinese:这篇论文探讨了在5G无 terrestrial 网络 (5G-NTN) 中应用机器学习 (ML) 技术,尤其是对物理广播频道 (PBCH) 的符号检测和平衡。随着5G-NTN在3GPP生态系统中的崛起,ML有可能在无线通信性能方面提供显著的提升。为了探索这些可能性,我们提出了基于 ML 的模型,使用了真实数据和验证数据从一个真实的5G过球测试平台进行训练。我们的分析包括在不同的信号噪听比 (SNR) 场景下评估这些模型的性能,以及对符号增强和频道平衡任务的评估。结果表明 ML 在控制场景下的性能和其适应实际挑战的能力,这 shedding light on the potential benefits of applying ML in 5G-NTN.
Enhanced Channel Estimation in mm-Wave MIMO Systems Leveraging Integrated Communication and Sensing
methods: 该方法利用了 integrate sensing and communication paradigm,使估算信道所需的训练导航数量减少了4倍。
results: 实验表明,该方法可以减少4倍的训练导航数量,并且能够正确地修复感知和通信模式之间的匹配问题。Abstract
This paper tackles the challenge of wideband MIMO channel estimation within indoor millimeter-wave scenarios. Our proposed approach exploits the integrated sensing and communication paradigm, where sensing information aids in channel estimation. The key innovation consists of employing both spatial and temporal sensing modes to significantly reduce the number of required training pilots. Moreover, our algorithm addresses and corrects potential mismatches between sensing and communication modes, which can arise from differing sensing and communication propagation paths. Extensive simulations demonstrate that the proposed method requires 4x less pilots compared to the current state-of-the-art, marking a substantial advancement in channel estimation efficiency.
摘要
Here's the text in Simplified Chinese:这篇论文研究了indoor millimeter-wave场景中广带MIMO通道估算的挑战。我们提出的方法利用了整合感知和通信的思想,通过感知信息帮助通道估算。我们的算法利用了空间和时间感知模式,可以减少需要的训练导航器数量。此外,我们的算法还解决了感知和通信传播路径之间的差异,可以提高通道估算精度。广泛的 simulations 表明,我们的方法可以比现有技术减少4倍的导航器数量,标志着通道估算效率的显著提升。
Multi-static Parameter Estimation in the Near/Far Field Beam Space for Integrated Sensing and Communication Applications
results: 通过数字实验来评估提出的框架效果,并表明使用自定义的扁平增益编码字符串可以提高系统的通信性能。Abstract
This work proposes a maximum likelihood (ML)-based parameter estimation framework for a millimeter wave (mmWave) integrated sensing and communication (ISAC) system in a multi-static configuration using energy-efficient hybrid digital-analog arrays. Due to the typically large arrays deployed in the higher frequency bands to mitigate isotropic path loss, such arrays may operate in the near-field regime. The proposed parameter estimation in this work consists of a two-stage estimation process, where the first stage is based on far-field assumptions, and is used to obtain a first estimate of the target parameters. In cases where the target is determined to be in the near-field of the arrays, a second estimation based on near-field assumptions is carried out to obtain more accurate estimates. In particular, we select beamfocusing array weights designed to achieve a constant gain over an extended spatial region and re-estimate the target parameters at the receivers. We evaluate the effectiveness of the proposed framework in numerous scenarios through numerical simulations and demonstrate the impact of the custom-designed flat-gain beamfocusing codewords in increasing the communication performance of the system.
摘要
这个工作提出了基于最大可能性(ML)的参数估算框架,用于毫米波(mmWave)集成感知通信(ISAC)系统的多Static配置中。由于高频段的大型阵列在减轻各向异性视场损失,这些阵列可能会在近场区间运行。本文中的参数估算包括两个阶段的估算过程,其中第一阶段基于远场假设,用于获得初步目标参数估算。在目标确定在阵列近场区间时,进行第二阶段基于近场假设的估算,以获得更加准确的估算结果。特别是,我们选择了用于实现恒定增益的扩散焦点阵列重量,并在接收器上重新估算目标参数。我们通过数字 simulations 证明了该框架在各种场景中的效果,并示出了自定义的扁平增益扩散编码字符串在系统的通信性能中的影响。
Toward Energy Efficient Multiuser IRS-Assisted URLLC Systems: A Novel Rank Relaxation Method
results: 我们的结果表明,提案的解决方案比现有的参考解决方案更有效。Abstract
This paper proposes an energy efficient resource allocation design algorithm for an intelligent reflecting surface (IRS)-assisted downlink ultra-reliable low-latency communication (URLLC) network. This setup features a multi-antenna base station (BS) transmitting data traffic to a group of URLLC users with short packet lengths. We maximize the total network's energy efficiency (EE) through the optimization of active beamformers at the BS and passive beamformers (a.k.a. phase shifts) at the IRS. The main non-convex problem is divided into two sub-problems. An alternating optimization (AO) approach is then used to solve the problem. Through the use of the successive convex approximation (SCA) with a novel iterative rank relaxation method, we construct a concave-convex objective function for each sub-problem. The first sub-problem is a fractional program that is solved using the Dinkelbach method and a penalty-based approach. The second sub-problem is then solved based on semi-definite programming (SDP) and the penalty-based approach. The iterative solution gradually approaches the rank-one for both the active beamforming and unit modulus IRS phase-shift sub-problems. Our results demonstrate the efficacy of the proposed solution compared to existing benchmarks.
摘要
The main non-convex problem is divided into two sub-problems, and an alternating optimization (AO) approach is used to solve the problem. The first sub-problem is a fractional program that is solved using the Dinkelbach method and a penalty-based approach. The second sub-problem is then solved based on semi-definite programming (SDP) and the penalty-based approach. The iterative solution gradually approaches the rank-one for both the active beamforming and unit modulus IRS phase-shift sub-problems.The proposed solution is compared to existing benchmarks, and the results demonstrate the efficacy of the proposed solution. The key contribution of this paper is the development of an energy-efficient resource allocation design algorithm for IRS-assisted URLLC networks, which can significantly reduce the energy consumption of the network while maintaining the required reliability and latency.Here is the text in Simplified Chinese:这篇论文提出了一种智能反射Surface(IRS)助手的下行ultra-可靠低延迟通信(URLLC)网络的能源效率资源分配算法。在这种设置中,一个多antenna基站(BS)将数据流传输到一个URLLC用户群体,用户 packets的长度很短。我们通过对BS的活动扬声器和IRS的pasive扬声器(即相位Shift)进行优化,以最大化总网络的能源效率(EE)。主要非凸问题被分解成两个互相关联的优化问题。我们使用alternating optimization(AO)方法来解决问题。通过successive convex approximation(SCA)和一种新的迭代rank relaxation方法,我们构建了一个凹陷-凸函数对象函数。第一个优化问题是一个分数程序,通过Dinkelbach方法和一种罚金方法来解决。第二个优化问题是基于 semi-definite programming(SDP)和罚金方法来解决。迭代解决方案逐渐逼近rank-one для活动扬声器和IRS phase-shift优化问题。我们的结果表明,提出的解决方案比现有的标准做法更有效。这篇论文的关键贡献在于开发了一种智能反射Surface(IRS)助手的下行URLLC网络中的能源效率资源分配算法,可以减少网络的能 consumption,保持必要的可靠性和延迟。